Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Neural representation learning for robust and fair speaker recognition
(USC Thesis Other)
Neural representation learning for robust and fair speaker recognition
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Neural Representation Learning for Robust and Fair Speaker Recognition
by
Raghuveer Peri
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
August 2022
Copyright 2022 Raghuveer Peri
To my Mother and Father
For enabling me to pursue my dreams with their unwavering support and selfless sacrifices.
ii
Acknowledgements
From the beginning of my doctoral studies at USC to its culmination in this dissertation, I
had the privileged access to invaluable guidance and support from distinguished people.
I would like to begin by thanking my advisor Prof. Shrikanth Narayanan for having faith
in me, and for guiding me through research that has truly piqued and nurtured my scientific
curiosity. The rare combination of his big-picture vision and rigorous attention-to-detail has
been tremendously inspiring.
I would like to thank my committee members, Prof. Wael Abd-Almageed, Prof. Ram
Nevatia, Prof. Keith Jenkins, Prof. Antonio Ortega and Prof. Fei Sha for their insightful
feedback that helped refine this dissertation.
I would like to thank my extraordinary mentors at Amazon and Qualcomm. Shiva and
Srinivas have played an instrumental role in guiding the research during my internship at
Amazon. A special thanks to Asif and Laehoon for helping me navigate some of my first
research endeavors at Qualcomm, and for inspiring me to pursue further academic studies.
IwouldliketogivespecialacknowledgementtotheentireSAILfamilyforsupportingme
on both personal and professional fronts. I can not adequately underscore the importance
of the innumerable discussions, especially with Krishna, which played a key role in shaping
much of this dissertation.
Iwouldliketothankthedepartmentadvisors-Diane,Tanya, TracyandAndy, whohave
enabled smooth SAILing during my time at USC.
Theacknowledgementswouldbeincompletewithoutmentioningalltheamazingteachers
that have shaped me intellectually, and more importantly have inspired me to be a better
person.
Iconsidermyselfluckytohaveawesomefriendswhohaveextendedtheirsupportinevery
way possible, and for bearing with my not-so-infrequent tantrums.
Finally, I would like to offer special thanks to my uncles - Babu and Nanaji - who,
although no longer with us, continue to bestow their blessings on me.
iii
Table of Contents
Dedication ii
Acknowledgements iii
List of Tables viii
List of Figures xii
Abstract xvi
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Overview of Automatic Speaker Verification . . . . . . . . . . . . . . 5
1.1.2 Deep speaker embeddings . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.3 Within-speaker and Between-speaker variability . . . . . . . . . . . . 8
1.1.4 Reducing variability in speaker recognition . . . . . . . . . . . . . . . 9
1.2 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Prior Work 14
2.1 Quantifying information in speaker embeddings . . . . . . . . . . . . . . . . 14
2.2 Adversarial learning for disentanglement of speaker embeddings . . . . . . . 16
iv
2.3 Biases in speaker recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 Empirical analysis of information encoded in neural speaker representa-
tions 19
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.1 Neural speaker embeddings: x-vector . . . . . . . . . . . . . . . . . . 20
3.2.2 Factors of variability . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.1 x-vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.2 Dimensionality reduction . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.3 Classification model. . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4 Unsupervised adversarial disentanglement for nuisance-invariant speaker
recognition 37
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2 Background: Adversarial learning . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3.1 Unsupervised disentanglement . . . . . . . . . . . . . . . . . . . . . . 41
4.4 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.5 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.5.1 Quantifying information . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.5.2 Speaker Verification. . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
v
4.5.3 Clustering analysis of embeddings . . . . . . . . . . . . . . . . . . . . 50
4.5.4 Speaker diarization using oracle speech segment boundaries . . . . . . 52
4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5 Adversarialandmulti-tasktechniquesforbiasmitigationinspeakerrecog-
nition 55
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2 Background: Fairness in ASV . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2.1 Evaluating biases in ASV systems . . . . . . . . . . . . . . . . . . . . 62
5.2.2 Mitigating biases in ASV systems . . . . . . . . . . . . . . . . . . . . 64
5.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.3.1 Adversarial and multi-task extensions of UAI: UAI-AT and UAI-MTL 68
5.4 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.4.1 Utility: Equal error rate (EER) . . . . . . . . . . . . . . . . . . . . . 70
5.4.2 Fairness: Fairness discrepancy rate (FaDR) . . . . . . . . . . . . . . . 70
5.4.3 Fairness: Area under the FaDR-FAR curve (auFaDR-FAR) . . . . . . 73
5.5 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.5.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.5.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.6.1 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.6.2 Proposed methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.6.3 Ablation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.6.4 Evaluation setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.6.5 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.7 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.7.1 Fairness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.7.2 Fairness-Utility analysis . . . . . . . . . . . . . . . . . . . . . . . . . 87
vi
5.7.3 Biases in verification scores . . . . . . . . . . . . . . . . . . . . . . . 91
5.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6 Summary and Future Work 96
References 99
Appendices 116
A Supervisedadversarialdisentanglementforimprovingrobustnessofspeaker
recognition: Emotional speech 117
A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
A.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
A.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
A.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
A.3.2 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
A.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
A.4.1 Quantifying information . . . . . . . . . . . . . . . . . . . . . . . . . 121
A.4.2 Speaker verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
A.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
B Deeper dive into biases and bias mitigation 124
B.1 Effect of bias weight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
B.2 Direction of bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
B.3 Individual-level biases in speaker recognition . . . . . . . . . . . . . . . . . . 128
vii
List of Tables
3.1 Factors in speaker embeddings and corpora used to quantify them. C denotes
number of unique classes, Num. hidden denotes the number of hidden layers
in classification neural network. . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 % Accuracy (% F1 for Emotion) of classifying different factors using speaker
embeddings (x-vectors and their transformations). Robust speaker embed-
dingsareexpectedtoperformpoorlyforclassifyingnon-speakerrelatedtasks,
while achieving high accuracy in predicting speaker-related factors. . . . . . 32
4.1 Statistics of datasets used to train unsupervised disentanglement models and
evaluate robustness of learned speaker embeddings to within-speaker factors
of variability (utt-utterances, spk-speakers). . . . . . . . . . . . . . . . . . . 44
4.2 %Accuracy(%F1forEmotion)ofspeakerembeddingsinclassifyingdifferent
factors. Unsup. dis. denotes the proposed unsupervised disentanglement
embeddings. Robust speaker embeddings are expected to perform poorly for
classifying non-speaker related tasks, while capturing maximal information
pertaining to speaker-related factors. Values in bold denote best performance
among the different speaker embeddings for each factor. . . . . . . . . . . . . 47
viii
4.3 Speaker verification performance(% EER- Lower is better) in the presence of
different nuisance factor conditions (V19-eval). The speaker embedding e
1
from theproposedmethod (M1 lda-96) is compared against x-vector baseline.
Values in bold denote the best performance for each condition. . . . . . . . . 49
4.4 Normalized mutual information (%) between clusters of embeddings and true
cluster labels. k represents number of clusters (V19-dev). The speaker em-
bedding e
1
and nuisance embedding e
2
from the proposed method (M1) are
comparedagainstx-vectorbaseline. Speakerembeddingsareexpectedtocap-
ture speaker information, while nuisance embedding should capture speaker-
unrelated information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.5 Diarization performance with oracle speech segment boundaries and known
number of speakers (AMI dataset). The two baselines use x-vectors with k-
meansandPLDA+AHCbackendsrespectively,whiletheproposedembedding
uses k-means clustering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.1 List of metrics used in this chapter with a brief description and their purpose
(utility or fairness). The FaDR metric (values range from 0.0 to 1.0, higher
is better) evaluates the fairness of the ASV system at a particular operat-
ing threshold (characterized by demographic-agnostic FAR). Area under the
FaDR-FAR curve summarizes the fairness at the various operating points.
The error rates (ranging from 0.0 to 1.0, lower is better) are used to measure
utility. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
ix
5.2 Statistics of datasets used to train and evaluate speaker embedding models.
xvector-train-B: balanced with respect to gender in the number of speakers,
xvector-train-U: Not balanced. embed-train and embed-val (used to train
proposed models) have different utterances from the same set of speakers
to facilitate evaluating speaker classification performance during embedding
training. eval-dev and eval-test (used to evaluate ASV utility and fairness)
have speech utterances with no overlap between speakers. voxceleb-H is an
out-domain evaluation dataset. #spk.-number of unique speakers, #sam-
ples-number of speech utterances in training or number of verification pairs
in evaluation, F-Female, M-Male . . . . . . . . . . . . . . . . . . . . . . . . 75
5.3 Table showing active blocks (corresponding to Figure 5.3.1) used in different
embedding transformation techniques. All the techniques use an encoder to
reduce the dimensions of the input speaker embeddings and a predictor to
classify speakers. The first four rows denote ablation experiments, while the
last two correspond to the proposed techniques. . . . . . . . . . . . . . . . . 79
5.4 auFaDR-FAR (upper bound 900) capturing fairness (binary gender groups)
for5differentvaluesof ω, and%EERcapturingutilityoneval-test dataset.
Both the UAI-AT and UAI-MTL methods achieve similar auFaDR-FAR val-
ues, higher than the baseline x-vector-B for all values of ω, with significant
improvement for ω = 1.0,0.75,0.5. UAI-MTL improves fairness and retains
utility (similar %EER as x-vector-B), while UAI-AT achieves desired fairness
atthecostofreducedutility. ∗ denotessignificantimprovementoverx-vector-
B system (significance computed at level=0.01 using permutation test with
n = 10000 random permutations). Values in bold denote the highest fairness
for each different value of ω. . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
x
5.5 auFaDR-FAR capturing fairness (binary gender groups) for 5 different values
of ω, and %EER capturing utility on voxceleb-H dataset. The UAI-MTL
method achieves significantly higher auFaDR-FAR values than the baseline
x-vector-Bforallvaluesofω. UAI-ATandATmethodshavereducedfairness
compared to the baseline suggesting lack of generalizability of adversarial
method across datasets. The values in bold denote the highest fairness for
each different value of ω. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
A.1 Average % F1 of emotion recognition on IEMOCAP dataset from speaker
embeddings. Unsup. dis. denotes unsupervised disentanglement, while Sup.
dis. denotes supervised disentanglement of embeddings. Robust speaker em-
beddings are expected to perform poorly for classifying emotions. . . . . . . 122
A.2 Speaker verification performance (%EER) for the baseline x-vector and Sup.
dis. methodssplitbyemotions. Thelastrowshowsthenumberofverification
trials used in the analysis. The supervised disentanglement approach outper-
forms x-vectors in all emotion classes. The last columns shows the results for
all emotions combined. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
B.1 Classificationresultsonembed-valdatasetandverificationresultsoneval-dev
dataset for different bias weights ( δ in Equation 5.3.1). The majority class
random chance accuracy for bias labels in the embed-val data was 70% . . . 125
xi
List of Figures
1.1.1(Best viewed in color) Block diagram of a typical deep-learning based ASV
system. The inset figure shows a typical histogram of the similarity scores.
An overview of ASV is provided in Section 1.1.1. . . . . . . . . . . . . . . . . 5
1.1.2Factors contributing to variability in speaker recognition. Invariance of ASV
systems to within-speaker factors can be termed robustness, while invariance
to between-speaker factors can be termed fairness. . . . . . . . . . . . . . . . 8
3.3.1ExampleroomconfigurationinVOiCESdataset(Nandwanaetal.,2019). Pri-
mary speaker plays clean speech. Distractor represents noise source (playing
different noise types) and circles represent microphones at different locations.
This setup simulates real-life noisy scenarios that speech processing applica-
tions typically encounter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2.1Block diagram of a general adversarial training method for nuisance-invariant
speaker recognition. The discriminator is trained in an adversarial fashion
to learn nuisance-removed speaker embeddings. It requires speaker label in
addition to the specific nuisance label for which invariance is desired. . . . . 39
4.3.1Unsupervisedadversarialinvarianceappliedforspeakerrecognition. Thegoal
is to learn a split representation of speaker information in e
1
and nuisance
information in e
2
using adversarial training. Note that this does not require
labels of any specific nuisance factor to train. . . . . . . . . . . . . . . . . . . 41
xii
4.5.1ConfusionmatricesforemotionrecognitionusingspeakerembeddingsonIEMO-
CAP for different embedding transformation techniques. Robust speaker
embeddings are expected to perform poorly on emotion recognition task.
0:Anger, 1:Sadness, 2:Happiness 3:Neutral. Unsupervised disentanglement
provides the best disentanglement as shown by the poor emotion recognition
performance. Differences are particularly noticeable for the Happiness(2) class. 48
4.5.2DET curves(y-axis has the % false rejection rate and x-axis shows the % false
acceptance rate) of speaker verification task using different speaker embed-
dings with and without disentanglement in several (A) Noise conditions and
(B) Microphone placements. In almost all scenarios, the model trained using
the unsupervised disentanglement denoted by M2 lda-96 performs the best. 51
5.3.1(Best viewed in color) Block diagram of the method showing the predictor,
decoder and disentangler modules similar to Figure 4.3.1. The discriminator
module (yellow bounding box) is tasked with predicting the demographic fac-
tor(e.g.,gender)frome
1
,andcanbetrainedinanadversarialsetup(UAI-AT)
or in a multi-task (UAI-MTL) setup with the predictor. . . . . . . . . . . . . 68
5.7.1(Bestviewedincolor)Fairnessforbinarygendergroupsatdifferentoperating
pointscharacterizedbydemographic-agnosticFARupto10%,evaluatedusing
3 different values for the error discrepancy weight (Eq. 5.4.1), ω = 0.0,1.0
and 0.5. When evaluating fairness using discrepancy in FRR alone (ω =0.0),
there is not much difference between the different systems. When evaluating
fairnessusingdiscrepancyinFARalone(ω =1.0),baselinex-vector-Btrained
on balanced data performs better than x-vector-U. The proposed systems
(UAI-AT and UAI-MTL) outperform x-vector-B. When evaluating fairness
using weighted discrepancy in FAR and FRR with equal weights (ω = 0.5),
the proposed systems still show better performance than the baselines. . . . 85
xiii
5.7.2(Bestviewedincolor)Plotofdemographic-agnostic%FRRversusdemographic-
agnostic %FAR showing the utility of the systems. Curves closer to the origin
indicate better utility. Notice that the UAI-MTL system closely follows the
baseline x-vector-B system at a range of operating conditions, while UAI-AT
reduces utility shown by higher %FRR. . . . . . . . . . . . . . . . . . . . . . 88
5.7.3(Best viewed in color) Kernel density estimates of cosine similarity scores of
impostor pairs for the female and male demographic groups. Both the x-
vector baselines have the scores of the female population shifted compared to
the scores of the male population. UAI-AT and UAI-MTL techniques reduce
differences between the scores. Particularly, UAI-MTL produces scores with
barely noticeable difference between the genders shown by the %intersection
in the scores between genders.). . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.7.4(Best viewed in color) Kernel density estimates of cosine similarity scores
of genuine pairs for the female and male demographic groups. Both the x-
vectorbaselineshavethescoresofthefemaleandmalepopulationoverlapping
with each other, indicating minimal bias between genders. It is worth noting
that both the transformation techniques (UAI-AT and UAI-MTL) retain this
overlap as shown by the %intersection in the scores between genders. . . . . 93
6.0.1Intersectional aspects of the factors of variability in speaker recognition. Im-
portant to consider for holistic understanding of biases. . . . . . . . . . . . . 97
A.2.1Disentanglement of nuisance factors from speaker embeddings when nuisance
labels are available. Discriminator predicts nuisance labels, and is trained
adversarially with encoder. Learned speaker embeddings e
1
capture nuisance-
invariant speaker information. . . . . . . . . . . . . . . . . . . . . . . . . . . 119
xiv
A.4.1ConfusionmatricesforemotionrecognitionusingspeakerembeddingsonIEMO-
CAP for different embedding transformation techniques. Robust speaker
embeddings are expected to perform poorly on emotion recognition task.
0:Anger, 1:Sadness, 2:Happiness 3:Neutral. Unsupervised disentanglement
provides the best disentanglement as shown by the poor emotion recognition.
Differences are particularly noticeable for the Happiness(2) class. . . . . . . . 122
B.2.1(Best viewed in color) Demographic-specific FAR for the baseline and pro-
posed systems. Here, we look at the individual FARs of the female and male
populations. Compared to the x-vector systems, both the proposed methods
reduce the difference in the FARs between the female and male populations.
However, the UAI-AT method achieves this by reducing the EER for both
groups, while the UAI-MTL method achieves this by increasing the error rate
for the male population making it closer to the female population. . . . . . . 127
B.2.2(Best viewed in color) Demographic-specific FRR for the baseline and pro-
posed systems. Here we look at the individual FRRs of the female and male
populations. The baseline x-vector method already shows very little differ-
ences in the FRRs between the female and male populations. The proposed
techniques retain these small differences. . . . . . . . . . . . . . . . . . . . . 127
B.3.1Pairwise t-test between genuine and impostor scores of different speaker pairs
separatedbygender. 1representsstatisticallysignificantdifference( p<0.01),
0representsnotsignificantdifferences. Noticethepresenceofseveralspeakers
with significantly different scores, for both genuine and impostor verification
pairs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
xv
Abstract
Speechisaninformation-richsignalthatconveysawealthofinformationincludingaperson’s
age,gender,language,emotionalstateandenvironmentalsurroundings. Speakerrecognition
(SR), the task of identifying speakers based on their speech is an active area of research.
SR has found a wide range of applications in several everyday technologies such as smart
speakers, customer care centers etc. It is crucial that SR systems perform reliably in diverse
environments, while not having biases against any particular demographic group or individ-
ual. A technique to improve the robustness of SR systems against variability is to ensure
that the speaker representations used in these systems retain only information related to the
speaker’s identity. Specifically, speaker representations can be trained to contain minimal
information pertaining to factors unnecessary for the SR task such as background noise,
channel conditions and emotional state. In this thesis, we provide insights into the various
factors of information captured in current state-of-the-art speaker representations using ex-
tensive experiments. Guided by these findings, we propose adversarial learning techniques
tominimizenuisanceinformationinspeakerrepresentations, andempiricallyshowthatsuch
techniquesimprovetherobustnessofSRinchallengingconditions. Moreover,studiesofvari-
ability in the performance of contemporary SR systems with respect to demographic factors
are lagging compared to other speech applications such as speech recognition. Furthermore,
there exist only a handful of bias mitigation strategies developed for SR systems. Therefore,
we first present systematic evaluations of the biases present in SR systems with respect to
gender across a range of system operating points. We then discuss our proposed representa-
tion learning techniques to mitigate the biases. Finally, we show through quantitative and
qualitative evaluations that the proposed methods improve the fairness of SR systems over
competitive baselines.
xvi
Chapter 1
Introduction
Automaticspeakerrecognitionisthetaskofidentifyingaspeaker’sidentityfromtheirspeech.
This field of research has garnered immense attention over the past few decades due to its
wide range of applications including in voice biometrics (Markowitz, 2000), speaker diariza-
tion(task of identifying who spoke when in multi-party conversations) (Beigi, 2011), person-
alized voice assistants in smart homes (Shin & Jun, 2015), anti-spoofing (C. Zhang, yu, &
Hansen, 2017) etc. A major challenge in automatic speaker identification from speech is the
amount of variability that is inherently captured in the speech signal. Variability in speech
could arise from a multitude of factors (Hansen & Hasan, 2015). First, it can be introduced
atthesignalproductionstage,forexamplebasedontheemotionalstateoftheperson,orthe
context in which the speech was produced (formal interview, casual conversation etc.). Fur-
ther variability can be introduced at the signal acquisition stage, for example due to channel
conditions such as background noise, reverberation, and differences in microphone charac-
teristics. Furthermore, machine learning (ML) based speech applications can also be prone
tobiasesagainstcertaindemographicpopulation. Theseariseduetovariabilityfromfactors
inherent to the speaker’s identity such as gender, age, accent etc. Such biases can hinder
widespread adoption of speech technologies, and can lead to systematic exclusion of certain
population from the benefits of such technologies. It is therefore imperative that a speaker
1
recognition system is able to identify a person based on a short sample of speech irrespective
of the variability present in the signal, and should do so without being biased towards or
against any particular individual or group of individuals. This is particularly important for
human-centered speech applications which can have immediate societal impacts.
An important part of speaker recognition systems are low-dimensional representations of
speech that are representative of the speaker’s identity, called speaker embeddings. These
embeddings are utterance-level representations that capture the speaker’s voice characteris-
tics. In addition to being speaker discriminative, for robust speaker recognition systems, it
isimportantthatthespeakerembeddingsareinvarianttovariabilitypresentinspeech. Past
work in this domain has focused on factorizing the identity-related and identity-unrelated
information from speech (Kenny, Boulianne, Ouellet, & Dumouchel, 2007). The idea is to
force identity-related information to be captured by the speaker embeddings, while discard-
ing other factors that are not relevant to the speaker’s identity, such as channel conditions.
This has been the predominant approach for much of the past work. However, modern
deep learning approaches do not explicitly account for the factors of variability. Most deep
learningtechniquesrelyonartificiallyaugmentingdatasetswithfactorsofvariabilitysuchas
background noise and reverberation. However, a quantitative understanding of the amount
of information pertaining to identity-related and identity-unrelated factors retained in such
speaker embeddings is mostly lacking. Such analysis would aid in understanding the infor-
mationretainedinspeakerembeddingsthatcouldincorrectlybeusedbythesystemtomake
decisions. Therefore, a deeper understanding of what the speaker embeddings capture is one
step closer towards better explainability of speaker recognition systems.
Inrecenttimes,adversarialtechniqueshavebeenproposedtocapturespeakerembeddings
that are invariant to variability (Meng, Zhao, Li, & Gong, 2019; Tawara, Ogawa, Iwata,
Delcroix, & Ogawa, 2020). However, most existing methods based on adversarial learning
require the availability of labels of all the factors of variability during training, which is not
practical in many cases. For example, data augmentation and adversarial training can be
2
readilyperformedtoinducerobustnesstobackgroundacousticnoiseandchannelconditions,
however variability arising during the speech production stage such as emotional or health
state is harder to add artificially during training. In addition, issues of biases, which have
garnered much interest in the general ML community (Barocas, Hardt, & Narayanan, 2019),
have not been sufficiently addressed in the speaker recognition domain. In this thesis, we
tackle the above-mentioned shortcomings of current approaches and explore the following
aspects of speaker recognition systems:
1. Quantifying the extent to which state-of-the-art speaker embeddings retain identity-
related and identity-unrelated information.
2. Adversarial disentanglement techniques to improve robustness of speaker recognition
bylearningspeakerrepresentationsthatareinvarianttothefactorsofvariabilitywith-
out explicit labels of the factors.
3. Adversarial and multi-task learning methods to improve the fairness of speaker recog-
nition systems.
1.1 Background
Automatic speaker recognition primarily encompasses two specific tasks namely automatic
speaker verification (ASV) and speaker identification (SI). SI deals with a closed-set classifi-
cation task, where models are trained using utterances from a set of speakers, and the same
speakersarealsopresentattesttime. However, itisnotaveryrealisticscenariobecausethe
applications involving speaker recognition are often deployed in unseen test conditions, and
are expected to be effective in identifying unseen speakers. A more realistic and challenging
scenario is ASV, where the task is to identify whether two utterances belong to the same or
different speaker. More often than not, in ASV, the speakers that are present during testing
arenotusedtotrainthespeakerembeddingmodels. Intherestofthethesis, wewillpresent
3
results in both the closed-set SI task, and open-set ASV task, however, the primary focus
would be the ASV task. In particular, we use the closed-set SI task to quantify identity-
related information captured in the speaker embeddings, while majority of the performance
evaluations using the proposed speaker embeddings would be performed on the ASV task in
challenging acoustic conditions.
The goal of an ASV system is to automatically detect whether a given speech utterance
belongs to a claimant who is a previously enrolled speaker. ASV techniques in the past
represented speakers using statistical methods leveraging gaussian mixture models. Likeli-
hood ratio tests were then used to determine if a speech utterance belonged to an enrolled
speaker (Reynolds, 1995). More recently, embedding based methods have been developed
where the speech utterances are modeled using low-dimensional, speaker-discriminative rep-
resentations (Dehak, Kenny, Dehak, Dumouchel, & Ouellet, 2011; Snyder, Garcia-Romero,
Sell, Povey, & Khudanpur, 2018). In particular, embedding-based techniques that model a
speaker’s vocal characteristics using deep-learning methods have gained prominence (Sny-
der et al., 2018; Variani, Lei, McDermott, Moreno, & Gonzalez-Dominguez, 2014). Ro-
bust speaker recognition requires information pertaining to a person’s identity be captured
in fixed-dimensional speaker embeddings, while discarding identity-unrelated information.
These speaker embeddings should be discriminative of speaker’s identity, while being invari-
ant to variability that is not related to the speaker’s identity. Quantifying the information
retained in speaker embeddings serves two purposes:
1. It offers a means to directly compare the efficacy of techniques in extracting speaker
embeddings invariant to nuisance factors.
2. It provides insights into the various factors of information captured in speaker embed-
dingsthatcouldpotentiallybeusedbydownstreamtaskssuchasinspeakerverification
or diarization. Such analysis brings the speaker recognition systems one step closer to
being explainable.
4
Thresholding I’m Alex Alex’s enrolment
utterance Similarity
scoring Accept/Reject Unknown speaker Test
utterance ... ... Reject Accept Threshold Pre-trained speaker
embedding model Histogram of similarity scores of genuine and impostor verification pairs Figure 1.1.1: (Best viewed in color) Block diagram of a typical deep-learning based ASV
system. The inset figure shows a typical histogram of the similarity scores. An overview of
ASV is provided in Section 1.1.1.
In this section, we first provide a brief overview of a typical deep-learning based ASV
system. We then discuss some recent deep learning based speaker embeddings and then
mentionsomeofthefactorsofvariabilitythatcouldaffectASVperformance. Finally,wewill
discuss techniques proposed in literature to ‘disentangle’ the various factors of information
in speaker embeddings
1.1.1 Overview of Automatic Speaker Verification
As shown in Figure 1.1.1, speaker embeddings from the test utterance and a previously
collected enrolment utterance are obtained using pre-trained speaker embedding models.
The speaker embedding models are typically trained in a fully-supervised setting on large
amounts of speech with speaker identity labels using a deep neural network (Snyder et al.,
2018). The goal is to learn an embedding of the speech utterance that is discriminative of
the speaker. Much like the likelihood ratio tests, the similarity between the embeddings of
the enrolment and test utterances is compared to a threshold to verify the identity of the
speaker in the considered speech utterance.
5
Verification pairs consisting of speech from the same speaker in both the enrolment and
test utterances are called genuine pairs. When the speaker of the test utterance is differ-
ent from the enrolment utterance, the corresponding verification pairs are called impostor
pairs. The inset plot in Figure 1.1.1 shows the histogram of similarity scores of genuine
and impostor verification scores of a typical ASV system. Since the final output of an ASV
systemisabinaryaccept/rejectdecision,itserrorsmaybeclassifiedintotwocategories: false
accepts (FA) and false rejects (FR). FAs are instances where impostor pairs are incorrectly
accepted, while FRs are instances where genuine pairs are incorrectly rejected by the ASV
system. There exists a trade-off between the number of FAs and FRs of the system depend-
ing on the threshold on the similarity scores. Systems which use a larger threshold tend to
have fewer FAs and more FRs. This can be useful in high-security applications such as in
border security. Similarly, a smaller threshold can be used to reduce the number of FRs
in applications requiring greater user convenience. Thus, the chosen similarity threshold
determines the operating point of the overall ASV system, and it can be tuned to suit the
endapplication. Equalerrorrate(EER)isacommonlyusedmetrictoevaluateperformance
of ASV systems. It captures the performance at a single operating point, where the false
acceptance rate (FAR) is equal to the false rejection rate (FRR).
As noted in (Stoll, 2011) and (Si, Li, & Xu, 2021), the verification scores of pairs com-
prisingspeakersfromthefemaleandmaledemographicgroupscanbesignificantlydifferent.
Furthermore, as shown in Figure 1.1.1, the decision by the ASV system depends on the
similarity threshold of these verification scores. Therefore, any biases present in the verifi-
cation scores propagate to the final decisions of the ASV system. This can result in unfair
treatment of certain demographic groups, as we will see later in Chapter 5.
1.1.2 Deep speaker embeddings
Recently, speaker embeddings extracted from deep learning models using data artificially
augmented with various sources of variability have shown promising robustness. A popular
6
approach that has gained a lot of attention is x-vectors (Snyder et al., 2018). It consists of a
time delay neural network (TDNN) model that is trained to predict speaker identity labels
given mel frequency cepstral coefficients (MFCC) as input. The initial layers work as frame-
level feature extractors, while a statistics pooling layer aggregates frame-level information
across an utterance into a single fixed-dimensional speaker embedding. The x-vector model
is usually trained with datasets consisting of a large amount of data, typically around a mil-
lion utterances from thousands of speakers such as Voxceleb (Chung, Nagrani, & Zisserman,
2018). In addition to the available data, they are trained using artificial augmentations to
the dataset to introduce variability. Training data is augmented with additive noise and
reverberations (convolution with different room impulse responses) to make the speaker em-
beddings invariant to such factors that are not related to a person’s identity. This approach
has been found to be effective in making the system robust to noise and reverberation.
However, previous works (Raj, Snyder, Povey, & Khudanpur, 2019) have demonstrated that
even x-vectors trained with augmented data retain information pertaining to the sources of
variabilitynotrelatedtospeakeridentitysuchasnoisetype, lexicalcontentetc. Inaddition,
(Williams & King, 2019) have shown that x-vectors also retain information pertaining to
the emotional content in speech. However, the RedDots dataset (Lee et al., 2015) used for
the analysis in (Raj et al., 2019) was restrictive because of the use of crowd-sourced speech
recordings without control over the acoustic conditions during speech capture. Firstly, it is
not possible to investigate the amount of information present in speaker embeddings per-
taining to the recording devices such as the microphone type or the location relative to the
speaker. Secondly, the experiments on amount of information pertaining to the noise type
were conducted on artificially augmented data, and the findings might not generalize to
realistic scenarios. Therefore there is need for a more systematic analysis of the factors of
variability present in speaker embeddings.
7
Emotion Noise Linguistic
content Within-speaker Between-speaker Health Language Race Gender Age Figure 1.1.2: Factors contributing to variability in speaker recognition. Invariance of ASV
systems to within-speaker factors can be termed robustness, while invariance to between-
speaker factors can be termed fairness.
1.1.3 Within-speaker and Between-speaker variability
As discussed previously, speaker recognition performance can depend on several factors of
variability. As shown in Figure 1.1.2, these can be categorized into two: within-speaker and
between-speakerfactors. Within-speakerfactorsofvariabilityrefertothedifferentconditions
thatcanaffectthespeechqualityofapersonthatisinputtothespeakerverificationsystem.
Some of these factors such as the emotional or health state change due to the way speech
is produced by an individual, while others such as the acoustic conditions affect the speech
signal during acquisition. On the other hand, differences in performance to between-speaker
factors are related to biases present in the systems. As with many other ML technolo-
gies, speakerrecognitionsystemscanshowbiasesagainstcertaindemographicgroups(Fenu,
Lafhouli, & Marras, 2020). These different demographic attributes that contribute to the
variability can be grouped under the term between-speaker variability. Speaker recognition
systems whose performance is minimally impacted by the within-speaker factors of vari-
ability can be called ‘Robust’ systems. Speaker recognition systems whose performance is
independent of the between-speaker factors of variability can be called ‘Fair’ systems.
8
1.1.4 Reducing variability in speaker recognition
Adversarial disentanglement methods have become popular in the speaker recognition com-
munity to induce invariance to different sources of variability. In particular, such methods
have largely been explored to produce domain-invariant speaker embeddings, where the do-
main could refer to lexical content usage (Tawara et al., 2020), noise conditions (Zhou et al.,
2019)etc. Thegeneralideainsuchmethodsistolearnspeakerembeddingsusinganencoder
that is trained to retain speaker discriminative information, while being stripped of domain
information. This way, the identity-related factors of information are disentangled from the
identity-unrelated factors of information present in speaker embeddings. However, the focus
ofmostofthesetechniqueshasprimarilybeeninimprovingspeakerrecognitionperformance
in different domains, while not much attention has been given to quantifying the amount of
information captured in the speaker embeddings. In most cases, domain invariance is shown
through qualitative means such as t-sne plots.
In addition, the proposed adversarial learning methods typically require labels of the
factor to be disentangled during training, and can be used to make speaker embeddings
invariant to the specific factors. In scenarios where invariance is desired with respect to a
specific nuisance factor such as acoustic noise, the supervised adversarial disentanglement
methods can be effectively applied. However, knowledge of the all the various sources of
within-speaker variability is rarely available. Also, it might not be practical to artificially
augment training data using all sources of variability. For example, the effect of variability
induced by noise and reverberation on speaker embeddings has been studied extensively be-
cause it is fairly straightforward to artificially augment data with these factors of variability.
However,lombardeffect(Zollinger&Brumm,2011),thechangeinaspeaker’sspeakingchar-
acteristics in the presence of background noise, is a factor of variability that is challenging
to add artificially during training. The supervised adversarial disentanglement techniques
that require labels of nuisance factors are rendered ineffective in such scenarios. In order
to effectively disentangle the within-speaker factors of variability from the identity-related
9
factors, unsupervised methods are required that do not need labels of nuisance factors. Such
a technique called unsupervised adversarial invariance (UAI) has been recently proposed in
(Jaiswal, Wu, Abd-Almageed, & Natarajan, 2018a) for use in computer vision applications
such as face identification and digit classification tasks. The idea is to disentangle task-
specific information (for example identity-related information) from irrelevant information
(such as within-speaker, identity-unrelated information) through a task-specific prediction
task and task-agnostic reconstruction task. We adopt this technique and modify the imple-
mentation to fit to our task of extracting speaker embeddings invariant to within-speaker
variability.
Variability in speaker recognition owing to between-speaker factors has not been suffi-
cientlyaddressedinliterature. Explicitfairnessevaluationsaremostlylackinginthespeaker
recognition community. Further, the previously described unsupervised disentanglement
techniques may not be sufficient to reduce the between-speaker variability in ASV perfor-
mance, because this form of variability occurs due to the identity-related factors such as
gender, age etc. To handle this variability, additional supervision with respect to the biasing
factors may be necessary (Jaiswal, Wu, AbdAlmageed, & Natarajan, 2019). Also, adversar-
ial techniques can impact the overall ‘utility’ of the system, though they can mitigate the
between-speaker variability issue to some extent (Du, Yang, Zou, & Hu, 2020). Therefore,
extensive evaluations are necessary to fully understand the biases present in ASV systems,
and any proposed bias mitigation strategies need to consider the fairness in addition to the
utility of these systems.
1.2 Thesis Statement
Neural-based embedding transformation techniques can reduce variability in speaker
representations, improving robustness and reducing biases in speaker recognition sys-
tems.
10
1.3 Contributions
In light of the background provided previously in this section, we make the following contri-
butions in this thesis:
1. Factors of information in speaker embeddings: First, we performed a compre-
hensive quantitative evaluation of the various factors of information that are retained
inspeakerembeddings. Todothis,weresorttoextensiveclassificationexperimentsus-
ing a wide range of datasets. Each dataset was chosen to aid in understanding specific
factors of information. We show that popular deep-learning based speaker embeddings
retain substantial information about factors unrelated to the speaker’s identity. These
methods help inform techniques to remove such unnecessary information factors from
speaker embeddings.
2. Unsupervised adversarial techniques to mitigate within-speaker variability
in speaker recognition: We adapt a recently proposed adversarial disentanglement
technique to the task of speaker recognition. In particular, we developed a model
that takes state-of-the-art deep speaker embeddings (x-vectors) as input and produces
lower-dimensionalspeakerembeddingsthatarespeaker-discriminativewhilealsobeing
invariant to nuisance factors. We analyze the efficacy of the method by quantifying
the amount of information pertaining to various factors in the speaker embeddings.
In addition, we show the performance improvements obtained using these speaker
embeddings for the task of ASV on the challenging VOiCES dataset (Nandwana et
al., 2019) and speaker diarization on the AMI dataset (McCowan et al., 2005). In
particular, we show that the proposed speaker embeddings perform better than the
x-vectors, especially in challenging far-field conditions, while being competitive in less
challenging conditions.
3. Adversarialandmulti-tasktechniquestomitigatebetween-speakervariabil-
ityinspeakerrecognition: Wesystematicallyevaluatethebiasespresentinspeaker
11
recognition systems with respect to gender across a range of system operating points.
We also propose adversarial and multi-task learning techniques to improve the fairness
of these systems. We show through quantitative and qualitative evaluations that the
proposed methods improve the fairness of ASV systems over baseline methods trained
usingdatabalancingtechniques. Wealsopresenta fairness-utility trade-offanalysisto
jointly examine fairness and the overall system performance. We show that although
systems trained using adversarial techniques improve fairness, they are prone to re-
duced utility. On the other hand, multi-task methods can improve the fairness while
retaining the utility. These findings can inform the choice of bias mitigation strategies
in the field of speaker recognition.
1.4 Thesis Outline
Herewepresentanoutlineofourresearchwork, and providea glimpseintowhat is tofollow
in this thesis document
Chapter 2: In this chapter, we will present the highlights of some of the past research
that is relevant to our work. In particular, we will discus the previous efforts in quantifying
information embedded in speaker representations, and highlight some of the shortcomings
of such research. We will also provide brief details of some of the work done in the speaker
verification domain that uses adversarial techniques to disentangle identity-related factors
from nuisance factors in speaker embeddings. Finally, we will discuss some of the relevant
past work looking into biases in speaker recognition.
Chapter 3: In this chapter, we will go into detail about the factors of information that are
captured in speaker embeddings. First, we will discuss the various sources of variability that
areencounteredinspeech, andhowinformationpertainingtothesefactorscanbeembedded
in speaker representations. We will provide quantitative evidence of such identity-unrelated
information captured in x-vectors using several experiments on a wide range of datasets.
12
Chapter 4: Inthischapter, wewillfirstdescribetheunsuperviseddisentanglementmethod
we adopt to induce invariance to within-speaker in speaker embeddings. Then, we will
present the results of applying the technique in removing nuisance information pertaining
to a number of factors of variability. In particular, we will show the efficacy of this method
in removing nuisance information without explicit labels during training. In addition, we
show how the speaker embeddings extracted using this method can improve the robustness
of speaker verification task in challenging conditions.
Chapter5: Inthischapter,wewillfirstprovideadetaileddiscussionofthebiasespresentin
modern speaker verification systems. This deals with aspects of between-speaker variability.
Then, we will describe the adversarial and multi-task techniques we developed to mitigate
thesebiases. Wewillalsodiscussaspectsofutilityinadditiontothefairnessoftheproposed
methods.
Chapter6: Wewillprovideconcludingremarksbasedonourresearchfindings,andprovide
directions for future work.
13
Chapter 2
Prior Work
In this chapter, we will discuss several past works that have investigated the information
captured in speaker embeddings, and some of the adversarial training techniques proposed
inliteraturetoreducenuisanceinformationfromspeakerrepresentationswhilekeepingthem
speaker-discriminative. We will also briefly discuss some of the existing literature in improv-
ing fairness of speaker recognition.
2.1 Quantifying information in speaker embeddings
Studies of what is captured in speaker embeddings have been numerous, especially in the
past before the deep-learning era. Much of the focus has been on channel factors, the
factors which are captured in speaker embeddings during the signal capture stage. These
typically are a result of the diverse conditions in which recordings are made including the
background acoustic noise, reverberations, microphone non-linearities, compression artifacts
etc. Early speaker recognition systems (Kenny et al., 2007) attempted to decompose speech
representations into speaker, channel and residual factors. They were found to provide
improvementstospeakerverificationtasks. However, lateritwasfoundthatsuchtechniques
result in speaker factors leaking into channel factors as well. Therefore, later techniques
combinedthosefactorsintoatotalvariabilityfactor,representedbyi-vectors (Dehak,Kenny,
14
Dehak, Dumouchel, & Ouellet, 2010). However, a drawback was that the total variability
spacecapturedallfactorsofvariabilityincludingspeaker,channelandotherfactors,therefore
invalidatingtheinvarianceassumptionofspeakerembeddingstonuisancefactors. Additional
compensation techniques were thus proposed to make them robust to channel variability
(Castaldo, Colibro, Dalmasso, Laface, & Vair, 2007). However, the goal of most of these
techniques was to improve robustness of speaker embeddings as measure by %EER, and
explicit understanding of what these factors capture has received less interest.
Recently, there has been renewed interest in understanding speaker embeddings, espe-
cially neural representations for the information they capture. (S. Wang, Qian, & Yu, 2017)
have investigated the lexical, speaker and channel information encoded in i-vectors (Dehak
et al., 2010), s-vectors (Bhattacharya, Alam, Stafylakis, & Kenny, 2016) and d-vectors (Var-
iani et al., 2014) on the RSR2015 dataset (Larcher, Lee, Ma, & Li, 2014). On similar lines,
(Rajetal.,2019)investigatex-vectorsforseveralspeakerandlexicalfactorsontheRedDots
dataset (Lee et al., 2015). One drawback with these methods is the limitation of using a
single dataset is that it might not be possible to control for possible confounding factors
while analyzing a specific factor.
Inadditiontochannelfactors,pastworkhasshownthatspeakerembeddingsretaininfor-
mationaboutcontentfactorssuchaslexicalandemotionalcontent. Forexample, examining
the amount of phoneme-level information present in speaker embeddings, it was found that
better performance in phoneme classification tasks does not translate to improved speaker
recognition performance (Lozano-Diez et al., 2016). Furthermore, speaker embeddings that
perform well for speaker recognition tasks have been shown to capture segment-level charac-
teristics rather than low-level phoneme information (Shon, Tang, & Glass, 2018). (Williams
& King, 2019) examined x-vectors for the task of subject-independent emotion recognition.
Their experiments showed that x-vectors trained for speaker recognition task could predict
the emotion content in speech even without any additional supervision using emotion labels.
Several works in both the NLP domain for word embeddings and in the speaker recog-
15
nition for speaker embeddings have used classification tasks to quantify the information
encoded in the hidden representations. For example, (Adi, Kermany, Belinkov, Lavi, &
Goldberg, 2016) employ classification tasks to analyze sentence embeddings. Also, most of
the works mentioned previously use classification tasks to investigate information in speaker
embeddings. The premise is that if a DNN classifier is unable to classify a particular at-
tribute using the embeddings, then the embeddings do not contain information pertaining
to that factor (Adi et al., 2016).
2.2 Adversariallearningfordisentanglementofspeaker
embeddings
Domain adversarial learning has gained increasing attention over the past few years to learn
speaker-discriminative and domain-invariant speaker representations. (Zhou et al., 2019)
employed adversarial learning to extract speaker embeddings invariant to noise. Disen-
tanglement for domain-invariant speaker representations were explored in (Sang, Xia, &
Hansen, 2020). (Meng et al., 2019) proposed adversarial speaker verification technique to
learn condition-invariant speaker embeddings using categorical environment labels and con-
tinuous signal-to-noise ratio (SNR) labels. (Luu, Bell, & Renals, 2020a) proposed channel-
invariant speaker embeddings by training them to be invariant to the recording granularity,
and showed improved robustness in verification and diarization tasks.
In addition to channel and condition invariance, techniques to induce invariance to the
lexical content have been proposed in literature. Though, a few studies have shown that
lexical content is often useful for speaker verification tasks, (Tawara et al., 2020) argue that
such results are due to lack of evaluation on short utterances. They show that adversarial
learningtoextractphoneme-invariantspeakerembeddingshelpsimprovespeakerverification
performanceonshortutterances. Allofthesestudiesshowthatlearningspeakerembeddings
invariant to various nuisance factors helps improve robustness of speaker recognition. The
16
centralideainsuchtechniquesistouselabelsofnuisancefactorssuchasthechannelornoise
present in the signal to strip the speaker embeddings of that information. However, labelled
data might not always be available for several factors of variabilities. This necessitates
unsupervised adversarial training, that can learn speaker representations robust to channel
and other acoustic variability without knowledge of any particular nuisance factor. Such
workhasbeenexploredtosomeextentinthecomputervisiondomain(Jaiswaletal.,2018a),
but has been largely unexplored for speech. Recently, an unsupervised approach to induce
invariance for automatic speech recognition was introduced in (Hsu, Jaiswal, & Natarajan,
2019). However, the goal of that work was to remove speaker-specific information from the
speech representations, whereas our goal was the opposite to extract speaker-discriminative
information.
2.3 Biases in speaker recognition
Owing to the applications of biometrics in general and ASV in particular in immigration
and law enforcement, and other high-stakes systems, it is imperative that such technologies
be unbiased of demographic attributes. Study of algorithmic bias in biometrics has garnered
much interest in recent times (Ross et al., 2019). As pointed out in (Drozdowski, Rathgeb,
Dantcheva, Damer, & Busch, 2020), an overwhelming majority of bias detection and mit-
igation works in biometric systems have focused on face recognition (Beveridge, Givens,
Phillips, & Draper, 2009; Grother, Ngan, & Hanaoka, 2019; Klare, Burge, Klontz, Bruegge,
& Jain, 2012; Ryu, Adam, & Mitchell, 2017), with a few works on fingerprint matching
(Galbally, Haraksim, & Beslay, 2018; Preciozzi et al., 2020). On the other hand, fairness
in voice-based biometrics has remained a relatively under-explored field with only a handful
of works in that direction (Fenu, Marras, Medda, & Meloni, 2021; Fenu, Medda, Marras,
& Meloni, 2020; Toussaint & Ding, 2021). Differences in speaker verification scores due to
various factors, including individual differences, have been found in previous works (Stoll,
17
2011). One can imagine the potential harm caused by ASV technologies that are biased
against certain populations. For example, an ASV system that produces scores that are
dependent on the demographics can cause speakers from a particular population to be prone
to more false matches than others (Toussaint & Ding, 2021). This can lead to reduced trust
in these systems for people from that demographic population. Therefore, it is necessary to
understand and mitigate these biases from ASV systems, which forms a core focus in this
work. A detailed summary of prior work is discussed later in Section 5.2 in Chapter 5.
18
Chapter 3
Empirical analysis of information
encoded in neural speaker
representations
Inthiswork, wedemonstratethatspeakerembeddingsextractedfromadeeplearningmodel
capture information pertaining to several factors of variability. In particular, we design a
suiteofexperimentsonarangeofdatasetstocomprehensivelyanalyzethepopularx-vectors
extracted from a pre-trained model that was trained using artificially augmented data. Our
results demonstrate that data augmentation is not sufficient to strip the embeddings of
nuisance information.
3.1 Introduction
Speakerembeddingsarelow-dimensionalrepresentationsofspeechthatcapturespeakerchar-
acteristics. They have numerous applications in tasks involving automatic speaker recogni-
The work presented in this chapter was published in the following article: Peri, Raghuveer, et al. “An
EmpiricalAnalysisofInformationEncodedinDisentangledNeuralSpeakerRepresentations.”Proc. Odyssey
2020 The Speaker and Language Recognition Workshop. 2020.
19
tion,suchasspeakerdiarization(identifyingwhospokewhen)(Beigi,2011),voicebiometrics
(Markowitz, 2000), anti-spoofing (C. Zhang et al., 2017) and personalized services such as
in smart home devices (Shin & Jun, 2015). Real-world, text-independent speaker recogni-
tion requires that speaker embeddings capture speaker characteristics robust to all other
attributes unrelated to the speaker’s identity, such as the acoustic conditions, microphone
characteristics and (aspects of) lexical content in the speech signal.
Itisimportanttounderstandtheextenttowhichthevariousfactorsareentangledinthese
representationstocomparetheperformanceofdifferentspeakerembeddingsfordownstream
tasks. Such analyses also provide valuable directions to improve the task-specific robustness
of these representations to various confounding factors. In addition, we argue that a deeper
understanding of what the speaker embeddings capture can be considered as one step closer
towards explainable speaker recognition systems. Here, we present quantitative evidence of
information retained in the speaker embeddings with respect to various factors. We chose
classification tasks related to a particular factor of variability as a proxy to the amount of
information encoded in the speaker representations related to that factor as has been done
in several previous works (Adi et al., 2016; Shon et al., 2018; S. Wang et al., 2017; Williams
& King, 2019). We will first provide a background leading up to our analysis, such as the
factors of variability in speaker embeddings in Section 3.2, and then provide details of the
datasets we use for the analysis in Section 3.3. We will describe the method used to perform
the analysis in Section 3.4. Finally we present results in Section 3.5, and then provide
concluding remarks in Section 3.6.
3.2 Background
3.2.1 Neural speaker embeddings: x-vector
Even prior to the proliferation of deep learning techniques, the topic of extracting robust
speaker representations invariant to various factors of variability has been widely studied
20
in the literature (Kenny et al., 2007). These methods typically relied on statistical models
of frame-level features to extract utterance-level speaker-discriminative representations. In
addition to training them to be speaker discriminative, further channel compensation steps
were applied to minimize variability due to recording conditions (Castaldo et al., 2007;
Solomonoff, Campbell, & Quillen, 2007).
Recently, supervised speaker modeling techniques have been developed (Snyder et al.,
2018; Variani et al., 2014). These methods differ from the previous approaches in that they
do not try to explicitly separate the factors of variability. Instead, they attempt to make the
speaker embeddings robust to variability by using data augmentation strategies. One such
popular technique, called x-vectors, was proposed (Snyder et al., 2018). They are speaker
embeddings extracted from the bottleneck layer of a time-delay neural network (TDNN),
which was trained on a large corpus of augmented audio recordings to recognize speaker
identity. One pre-trained model
1
has been shown to be particularly effective in speaker
verification tasks. The model was trained on the voxceleb dataset to predict speakers.
Voxceleb dataset consists of in-the-wild recordings of celebrity interviews in various real-
worldconditions. Itconsistsofaudiorecordingscorrespondingto1.2M utterancesfrom7323
distinctspeakers. Sincethedatasetissourcedfromunconstrainedrecordingconditions,there
exists a huge amount of variability in terms of channel conditions and the content factors
such as lexical and emotion content (Albanie, Nagrani, Vedaldi, & Zisserman, 2018). These
audio recordings were further augmented by artificially adding background noise and music
at varying signal-to-noise ratio levels (Snyder et al., 2018). In order to simulate the effect of
reverberation, the audio signals were convolved with various room impulse responses from
https://www.openslr.org/. This process doubled the amount of training data to 2.4M
utterances keeping the number of speakers the same.
1
https://kaldi-asr.org/models/m7
21
3.2.2 Factors of variability
Ideally, speaker embeddings are expected to capture identity-related information only, while
being invariant to other factors of variability. But, in practice it is challenging to make the
speaker embeddings invariant to all the nuisance factors. Variability in speaker embeddings
can be caused due to a wide range of factors. Several works have studied the effect of
intrinsicfactorswhichareacquiredbythesignalduringthespeechproductionstage, suchas
the emotional state, lexical content etc (Bao, Xu, & Zheng, 2007; Kahn, Audibert, Rossato,
& Bonastre, 2010; Parthasarathy, Zhang, Hansen, & Busso, 2017). Other studies have
focusedonextrinsicfactorsthatareacquiredatthe signalrecordingstage(Nandwanaetal.,
2018). These different factors of variability are entangled to varying degrees in the speaker
representations.
Following (S. Wang et al., 2017), for ease of analysis, we categorize the factors into three
distinctcategories: channel factors whichareencodedinthespeechsignalduringtheprocess
of signal acquisition, such as background acoustic noise, microphone characteristics, room
response and acoustic scene, content factors which encode information about the spoken
content,suchaslexicalaspects,emotion,sentimentetc(Lietal.,2018),speakerfactors which
are inherent to the speaker, such as speaker identity, age, language and gender. Referring
to Figure 1.1.2, the channel and content factors fall under the category of within-speaker
variability, while speaker factors can be categorized as between-speaker variability.
Channelfactors: Thesefactorshavebeenextensivelystudiedinliteratureinthecontext
of speaker recognition. The goal of robust speaker recognition systems is to rid the speaker
embeddings of these factors, while retaining the speaker-related factors. As mentioned
in Chapter 2, early speaker recognition systems either attempted to decompose channel-
dependent factors from the speaker-dependent factors during speaker modeling (Kenny et
al., 2007), or introduced channel compensation steps in the total variability space (Castaldo
et al., 2007). X-vector based methods do not make this separation explicit and can contain
significant amount of channel-related information, as shown through channel classification
22
experiments (Raj et al., 2019). The information pertaining to channel factors can be evalu-
ated with regards to several factors of variability, such as microphone location, background
noise type etc. However, previous works have used artificially augmented data to do such
analysis. Furthermore, onlythenoisetypeisconsideredinsuchanalysis. However, thereare
othersourcesofvariabilitythatcanhaveaneffectontheperformanceofspeakerrecognition.
To facilitate a more detailed analysis, a dataset that has controlled recordings in all these
conditionsisneeded. TheVOiCESdataset(Richeyetal.,2018)providesexactlythat,which
will be described later in this chapter. Owing to the availability of these labels, we consider
three specific factors of variability:
1. Microphonetype: ThereweretwoparticularmicrophonetypesintheVOiCESdataset,
studio microphone and lavalier mic. The studio microphone is a standard high quality
one, whereas the lavalier microphone is a less expensive lapel microphone. We wanted
to analyze if speaker embeddings retain information about the type of microphone
that was used for recording. Ideally, the speaker embeddings should be invariant to
the quality of microphones used to record speech.
2. Microphone location: In the VOiCES dataset, recordings were made simultaneously
fromarangeofdifferentdistances from thespeech source. Speakerembeddings should
be invariant to how far the microphone is to the speech source, and we wanted to
investigate it this is truly the case. In particular, we chose all recording made using
the studio microphone at 7 different locations for the analysis.
3. Noise type: Acoustic noise is an important factor of variability that can affect the per-
formance of speaker recognition systems. Background noise is ubiquitous, and speaker
embeddings should ideally be invariant to the it. The recordings in the VOiCES
dataset were made under 4 conditions, one without any noise, and the others with
babble, music, and television noise played from loudspeakers called distractors.
Content factors: They encode information about the spoken content including lexical
23
content, emotional content, sentiment etc. These factors of variability occur as a result of
the difference in spoken content, and are entangled in speech during the speech production
phase. There could be several reasons for such variability, and typically such variability
arises due to the inherent state of the speaker, and are therefore sometimes termed ‘within-
speaker variability‘. Hansen and Hasan (Hansen & Hasan, 2015) provide a nice overview
of the different factors of variability, and how they can affect speaker recognition. Some of
the reasons could include the emotional state, conversational context (formal or informal),
lombard effect, situational task stress (Hansen, 1996) etc. These factors have been shown in
past works to be entangled with speaker-related factors in speaker representations (Shon et
al., 2018; Williams & King, 2019). We consider three specific factors as follows:
1. Emotional content: Emotional content is an inherent part of human speech (Mehra-
bian, 2008). The emotional state of a person has been shown to affect the physiology
of voice production (Scherer, 1986), and is therfore manifested in the produced speech.
However, it is important that such changes in the vocal characteristics do not affect
the speaker recognition systems. Therefore, speaker embeddings are expected to be
invarianttosuchvariability. Following(Williams&King,2019)weusetheIEMOCAP
dataset and perform our analysis using 4 emotion categories happiness, sadness, anger
and neutral.
2. Sentiment: We use the CMU-MOSEI dataset to analyze sentiment information in
speaker embeddings. Though closely related to emotion content, sentiment analysis
provides insights into how much information about the longer-term state of a person
(as captured in speech) is retained in speaker embeddings. Furthermore, performing
such analysis on a different dataset than one used for emotion content provides further
evidence of the amount of content information captured in speaker embeddings.
3. Lexical content: Lexical content is an inherent and important part of speech. How-
ever, most practical applications of speaker recognition technologies require that they
24
work well irrespective of what is being said. This scenario is called text-independent
speaker recognition, and is useful for widespread applicability of speaker recognition
technologies. However, since the lexical content forms a major part of the information
present in speech, completely stripping the speaker embeddings of such information is
extremely challenging. These factors have been shown in past works to be entangled
with speaker-related factors in speaker representations (Lozano-Diez et al., 2016). For
such analysis of lexical content, it is important that the dataset used contains multiple
speakers uttering the same sentences. The Reddots dataset provides this flexibility,
where the participants were asked to record a set of 10 sentences that were common
across all the participants.
Speaker factors: These factors are inherent to the speaker (cf., demographics), and
capture the identity of the speaker to varying degree. For example, while gender, language
and age are not sufficient to fully recognize a speaker’s identity from speech, speaker em-
beddings do capture these dimensions, and are sometimes useful in improving the speaker
recognition performance. Thus, examining how speaker embeddings perform for identifying
these individual factors is key to understanding what information is beneficial for speaker
recognition. In particular, it has been found that i-vectors and x-vectors can be successfully
used to classify gender (Raj et al., 2019; S. Wang et al., 2017). However, it is still unclear
whether speaker embeddings that are more discriminative of gender significantly improve
speaker recognition performance. A related, but seemingly orthogonal viewpoint is that
of fairness in speaker recognition. Differences in performance between demographic groups
has been studied extensively in the ASR community. Studies on ASR performance in low-
resource language settings has gained significant attention (Besacier, Barnard, Karpov, &
Schultz, 2014). Also, much focus has been given to ASR in the presence of accented speech
(Shi et al., 2021). extensive statistical analysis to understand how the performance of ASV
systems varies between individuals (Doddington, Liggett, Martin, Przybocki, & Reynolds,
1998). Individuals were grouped based on how easy or difficult it is for them to be imitated
25
by an impostor, thereby affecting ASV performance by contributing disproportionately to
errors. However, fairness analysis was not conducted with respect to demographics. Since
speaker embeddings are arguably the most important aspect of speaker recognition systems,
it is beneficial to understand to what extent demographic information is captured in the
speaker embeddings. For the gender analysis, we use the labelled data from the Reddots
dataset. Forthelanguageanalysis,weusetheMozillaCommonVoicedataset. Bothofthese
are described in detail in the next section.
26
Table3.1: Factorsinspeakerembeddingsandcorporausedtoquantifythem. Cdenotesnumberofuniqueclasses,Num. hidden
denotes the number of hidden layers in classification neural network.
Category Factors considered Study corpora Num. hidden
Mic. type (C=2)
VOiCES(Richey et al., 2018)
3
Channel Noise type (C=4) 3
Mic. location (C=7) 3
Emotion (C=4) IEMOCAP(Busso et al., 2008) 1
Content Sentiment (C=3) MOSEI(Bagher Zadeh, Liang, Poria, Cambria, & Morency, 2018) 1
Lexical (C=10) RedDots 3
Speaker id. (C=10) VOiCES 3
Speaker Gender (C=2) RedDots(Lee et al., 2015) 3
Language (C=4) Common Voice (Ardila et al., 2019) 3
27
Librispeech
recordings Noise Noise Distractor loudspeaker Primary loudspeaker Studio microphone Lapel microphone Fridge VOiCES data collection Noise 1 2 5 6 Figure 3.3.1: Example room configuration in VOiCES dataset(Nandwana et al., 2019). Pri-
mary speaker plays clean speech. Distractor represents noise source (playing different noise
types) and circles represent microphones at different locations. This setup simulates real-life
noisy scenarios that speech processing applications typically encounter.
3.3 Dataset
As mentioned in Section 3.2, we performed experiments on a number of publicly available
datasets. Each dataset was chosen to enable the exploration of individual factors, while
providing an analysis of how these factors are manifested in speaker representations on real-
world data. In Table 3.1, we summarize few of the exemplar factors belonging to the three
categories mentioned previously. We also list the corpora that we have considered in this
paper for analysis. These datasets are explained in detailed below:
VOiCES: Recordings collected from 4 different rooms with microphones placed at var-
ious fixed locations, while a loudspeaker played clean speech samples from the Librispeech
(Panayotov,Chen,Povey,&Khudanpur,2015)dataset. Alongwithspeech,noisewasplayed
back from loudspeakers present in the room, to simulate real-life recording conditions. Fig-
ure 3.3.1 shows one such room configuration and data collection setup where “Distractor”
represents noise source and the circles represent the available microphones. The yellow
28
(large) circles represent studio microphones, while the pink (small) circles represent lavalier
microphones. The loudspeaker and the distractor were placed with their cones facing the
centeroftheroom. WereferinterestedreaderstoSRI’spublishedworkandtheirwebsite
2
for
moredetails(Richeyetal.,2018). Thisdatasetwaschosenfortheavailabilityofannotations
regarding the acoustic conditions including noise types, microphone types and locations. In
particular, 3 different noise types (babble, music and television) were considered along with
the‘clean’condition. Forthemicrophonelocation, wedistinguishbetweenthe7differentlo-
cations of the studio microphone. We also use information about the 2 different microphone
types, studio and lavalier. We use this dataset to explore the amount of information related
to the microphone location, noise type and microphone type. We also use the VOiCES cor-
pus to evaluate speaker verification performance under various acoustic conditions, which
will be described in Section 4.5.2. We perform our evaluations on the phase-2 release of
the dataset to be consistent with the data used during the evaluation in VOiCES challenge
(Nandwana et al., 2019).
IEMOCAP: A multi-modal corpus consisting of video, audio, face motion, hand move-
ments and text transcriptions (Busso et al., 2008). These modalities were recorded from 10
actors during scripted and enacted conversations. Annotations are provided for discrete and
continuous emotion ratings. Following (Williams & King, 2019), we make use of a small
subset consisting of the audio portion of the data with a single emotion label per utterance,
leading to 1403 utterances corresponding to 4 discrete emotions, angry, sad, happy and
neutral. This dataset provides the capability to investigate the amount of emotional infor-
mation present in the speaker embeddings. Further, we use this dataset to perform speaker
identification experiments to explore the speaker information present in the embeddings.
MOSEI: An audio dataset of more than 23000 YouTube videos annotated for emotion
and sentiment (Bagher Zadeh et al., 2018). Audio from this corpus was used for the analysis
2
https://iqtlabs.github.io/voices/
29
of emotion and sentiment information captured in speaker embeddings. Since, the emotions
elicited in all the recordings were spontaneous, this database was useful to validate the
generalizabilityofouranalysistoreal-worldspontaneousdata. Weperformedpre-processing
of the raw labels provided in the corpus. Each audio recording in the dataset was annotated
forscorescorrespondingtosixemotionlabels. Wechosetheemotionlabelwiththemaximum
score for each audio recording. For sentiment analysis, we convert the labels provided in the
dataset into 3 distinct labels, positive, negative and neutral sentiment.
Common Voice: A publicly available corpus of crowdsourced audio recordings (Ardila
et al., 2019). It consists of transcribed text from multiple languages, from which we chose
a subset of 4 languages (English, German, Mandarin and Turkish) for ease of analysis. The
languages are chosen such that sufficient number of samples exist for each language. This
corpus is used for the language prediction task.
RedDots: Speechrecordingscollectedfromparticipantsusingtheirownaudiorecording
device, typically a mobile phone (Lee et al., 2015). The dataset comprises recordings of
participants reading out sentences, some of which were common across all the participants
while the others were unique to each participant. To ensure that experiments involving
lexical content factor were not confounded by variability due to speakers, we prune the data
and obtain a small subset of recordings consisting of only the 10 common sentences spoken
byallthespeakers. Thiscorpusisused for analysis of lexical content and also for thegender
prediction task.
30
3.4 Methods
3.4.1 x-vectors
We extract x-vectors using the publicly available pre-trained model
3
and consider them
as the baseline speaker embeddings for analysis. The x-vector model was trained on a
large corpus of audio recordings to predict speakers. These audio recordings were further
augmentedbyartificiallyaddingbackgroundnoiseandmusicatvaryingsignal-to-noiseratio
levels(Snyderetal.,2018). In order to simulate the effect of reverberation, the audio signals
were convolved with various room impulse responses. As discussed in Section 3.2, since
x-vectors were not trained with an explicit disentanglement stage, they retain information
pertaining to factors unrelated to speaker identity.
3.4.2 Dimensionality reduction
We also reduce the dimension of x-vectors (denoted xvector+PCA) using principal com-
ponent analysis (PCA), similar to (Williams & King, 2019). This was done to match the
dimension of x-vectors with that of the disentangled speaker embeddings (that will be ex-
plained in later chapters), to ensure fair comparison with respect to embedding dimension.
For dimensionality reduction, in addition to using PCA, we also employ another simple
technique, which we denote as non-linear dimensionality reduction (NLDR). In this
method, we use a subset of the un-augmented voxceleb dataset (that was used to train the
pre-trainedx-vectormodel)totrainasimple2-hiddenlayerneuralnetworkmodeltoclassify
speaker identity. We use the hidden layer activations of the 2
nd
layer as the dimensionality-
reduced speaker embeddings.
3
https://kaldi-asr.org/models/m7
31
Table 3.2: % Accuracy (% F1 for Emotion) of classifying different factors using speaker
embeddings(x-vectorsandtheirtransformations). Robustspeakerembeddingsareexpected
to perform poorly for classifying non-speaker related tasks, while achieving high accuracy in
predicting speaker-related factors.
Factors
Majority
baseline
x-vector
x-vector
+ PCA
x-vector
+ NLDR
Mic. type (C=2) 75.00 94.76 91.91 81.78
Channel↓ Mic. location (C=7) 16.67 92.01 86.16 68.94
Noise type (C=4) 25.00 94.90 92.20 77.89
Emotion (C=4) 87.49 91.70 91.08 86.99
Content↓ Sentiment (C=3) 41.53 54.35 51.76 49.00
Lexical (C=10) 10.27 98.00 97.10 72.91
Speaker Id. (C=10) 12.90 99.60 99.60 98.74
Speaker↑ Gender (C=2) 81.55 99.00 99.50 97.41
Language (C=4) 74.16 97.20 96.60 92.87
3.4.3 Classification model
As mentioned in Section 3.2, to measure the extent of entanglement of the various factors
in speaker embeddings, we designed classification experiments. In these experiments, the
speaker embeddings were used as input to predict each factor of variability. Such a method
of analysis assumes that if a factor is encoded in the speaker embeddings, a classifier can
be trained to predict the factor using the speaker representations as input. Further, the
classificationaccuracycanbeconsideredaproxyforhowwellthefactorhasbeenencodedin
these speaker embeddings (S. Wang et al., 2017). We employ neural network models (with
number of hidden layers denoted in Table 3.1) for the classification task of each factor, and
usenon-overlappingsplitsofthecorrespondingdatasetsfortraining, validation, andtesting.
32
3.5 Experiments and Results
3.5.1 Setup
As mentioned in Section 3.2, we analyze the extent of information present in the speaker
embeddings with respect to various factors. For each factor that was analyzed, we extracted
x-vectors using the pre-trained model mentioned in the previous section from the corre-
sponding dataset for that factor. We trained a classification model for each factor using a
simple feed-forward deep neural network. In general, we trained neural network models on
the ‘train’ split of each of the datasets, while monitoring the performance using the ‘valida-
tion’ split to avoid over-fitting. The final comparison of results was done on the ‘test’ split.
Informed by the training and validation loss, we used a different number of hidden layers for
the different tasks, as explained in Table 3.1. Each layer contained 256 hidden neurons and
hadReLUactivationfunction. Similarto(Williams&King,2019),weuseL2regularization,
Adam optimizer with learning rate lr = 0.0002, and an early-stopping criterion monitored
by the loss on validation set. The classifier models were trained to predict the factor with
the speaker embeddings as input.
For the IEMOCAP dataset, for fair comparison, we use the same train and test split as
in (Williams & King, 2019). For the RedDots dataset, in the lexical content analysis task,
for each speaker we randomly split 80% of the data into training, 10% into validation and
10% as test part. In the gender classification task, based on the previous split, we further
perform minority class over-sampling to balance the data for each gender during training.
Results summarizing the classification performance of the x-vectors with and without
PCA are reported in Table 3.2, where the reported % values were rounded to the nearest
decimal. For the emotion classification task, we report the %F1 score averaged over the 4
emotion classes, while for all other tasks, we report the % classification accuracy.
33
3.5.2 Results
Channel factors
We explore three different channel variability conditions: microphone type, microphone lo-
cation and noise type. We trained classifiers to predict the microphone location (possible 7
locations with studio microphones) at which the recording was collected given the speaker
embeddings extracted from a recording. We built classifiers to predict the noise type (clean,
babble, music, television) from the speaker embeddings. We also trained classification mod-
elstopredictthetypeofmicrophone(studioorlavalier)usedfortherecordingsfromspeaker
embeddings.
First, we observe that in all the cases, x-vectors perform better than the majority class
baseline(ifallsampleswereclassifiedasthemajorityclass). Whilex-vectorsareexpectedto
be invariant to the content and channel factors, we observe they retain substantial informa-
tionrelatedtothosefactorsaswell,asreflectedbythehighclassificationaccuracies( >90%).
This is despite the fact that the x-vector models were trained using unconstrained record-
ings obtained from different types of microphones, and the data was further augmented with
additive noise and reverberation. Unsupervised dimensionality reduction using PCA seems
to reduce the amount of information pertaining to factors other than speaker factors, albeit
minimally.
Content factors
Experimentswereperformedtoanalyzetheinformationrelatedto3differentcontentfactors,
lexicalcontent, emotionalcontentandsentimentusingthecorrespondingcorporamentioned
in Table 3.1.
First, we observe that x-vectors retain information pertaining to all three factors. In
particular, lexical information seems to be captured to a large extent in x-vectors. We also
observe that, unsupervised dimensionality reduction using PCA has the effect of reducing
34
the amount of information pertaining to factors other than speaker factors. However, the
reduction in accuracy is minimal. With the NLDR technique, though, we observe that
contentinformation(lexicalcontentinparticular)isgreatlyreduced. However,theaccuracies
are higher than the majority class baseline pointing to the need for more sophisticated
techniques to reduce the nuisance information from speaker embeddings.
Speaker factors
To analyze the speaker-related information present in the speaker representations, we per-
formed separate experiments to predict the speaker identity, speaker gender, and language.
We used the speaker labels in the VOiCES dataset from which 10 speakers were randomly
chosen for the speaker identification task. We trained classifiers to predict the speaker iden-
tity given the speaker embeddings. We also trained classifiers on the RedDots dataset to
make binary gender predictions using the speaker embeddings as input. For the language
prediction task, classifiers were trained using the Mozilla Common Voice dataset to classify
the speaker embeddings into one of 4 languages (English, German, Turkish, Mandarin).
First, we observe that in all the cases, x-vectors perform much better than the majority
class baseline. For the speaker factors, this is expected because the x-vectors were trained
to be speaker discriminative. Therefore, though the speaker embeddings were trained using
speaker classification objective only, they also retain gender (accuracy=99%) and language
(accuracy=97.2%) related information. In addition, PCA to reduce dimensionality does not
result in much loss of classification accuracy because the speaker factors are the modes of
maximum variance in x-vectors. NLDR technique results in a slight degradation in perfor-
mance in predicting gender and language compared to x-vectors. At first glance this may
seem to suggest that x-vectors retain fair amount of speaker information, and removing nui-
sanceinformationisunnecessary. However, weshouldnotethattheperformanceshownhere
is a relatively easy closed-set speaker classification task. As we will see later, the presence of
nuisance information in x-vectors leads to degraded performance in the challenging speaker
35
verification task.
3.6 Discussion
Wediscussedtheseveralfactorsofvariabilitythatcanbeencounteredinspeakerembeddings,
includingspeaker-relatedandspeaker-unrelatedfactors, withconcreteexamplesofeach. We
further showed through experimental evaluation on several datasets and a suite of factors
that x-vectors retain substantial information pertaining to these factors. This is despite
x-vector models being trained to be speaker discriminative and expected to be invariant to
nuisance factors. We also show that a simple supervised dimensionality reduction technique
can reduce the amount of speaker-unrelated factors in x-vectors. However, there is scope
for further reduction in nuisance information as evident by the difference from majority
class baseline. This points to the need for more sophisticated disentanglement techniques
we discuss in the next chapter.
36
Chapter 4
Unsupervised adversarial
disentanglement for
nuisance-invariant speaker recognition
In the previous chapter, we have seen how several factors of speech variability are encoded
in neural speaker embeddings to varying degrees. In this chapter, we will first describe how
unsupervised disentanglement techniques can be applied to remove nuisance information
fromspeakerembeddings. Theprimaryfocuswouldbeonlearningspeakerembeddingsthat
are discriminative with respect to speaker identity. We will provide details of our models
and training strategy. Finally, we will present results that show effective disentanglement of
emotion information from speaker embeddings with minimal loss of speaker information. In
addition, we show how such embeddings improve the robustness of speaker verification on
the challenging VOiCES dataset, and for speaker diarization on the AMI dataset.
The work presented in this chapter has been published at ICASSP 2020: Peri, Raghuveer, et al. “Ro-
bust speaker recognition using unsupervised adversarial invariance.” ICASSP 2020-2020 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020.
37
4.1 Introduction
We saw in the previous chapter that embeddings that are trained to capture speaker-specific
information using data augmentation techniques still retain information pertaining to a va-
riety of nuisance factors. In addition, though PCA retains the most informative features
which are related to speaker identity, they still capture nuisance information. We further
foundthatnon-lineardimensionalityreductionusinganeuralnetworkhelpsreducenuisance
information. However, we found that there is still some scope to improve.
Several works in the past have investigated ways to tackle this nuisance information
leakage into speaker embeddings using adversarial learning techniques. Though specific
implementationsvary,thegeneralideaistotrainaneuralnetworksuchthattheintermediate
representations are discriminative of the primary task at hand (speaker identification in our
case),whilebeingstrippedofspecificnuisancefactors. However,mostofthesetechniquesare
supervised with respect to the nuisance factors. In other words, invariance can be obtained
only against specific nuisance factors for which it is trained. However, such labelled data
might not be readily available in many real-world scenarios. This necessitates unsupervised
adversarial training, that can learn speaker representations robust to channel and other
acoustic variability without knowledge of any particular nuisance factor. Such work in the
speech domain has been largely unexplored. We intend to bridge this gap by exploring an
unsupervised technique to disentangle speaker factors from nuisance factors.
We will first provide a background on the general idea of adversarial learning in Section
4.2,andthenprovidedetailsofthemethodweemployinSection4.3. Wethenprovidedetails
of the datasets we use for training and evaluation in Section 4.4. Results are presented in
Section 4.5 comparing the proposed speaker embeddings in removing nuisance factors, while
improving robustness of speaker embeddings. Finally, we provide concluding remarks in
Section 4.6.
38
Encoder Speaker
predictor Discriminator x-vector or MFCC Speaker
label Nuisance
label Adversarial gradients Figure 4.2.1: Block diagram of a general adversarial training method for nuisance-invariant
speakerrecognition. Thediscriminatoristrainedinanadversarialfashiontolearnnuisance-
removed speaker embeddings. It requires speaker label in addition to the specific nuisance
label for which invariance is desired.
4.2 Background: Adversarial learning
Adversarial learning refers to the general class of training methodologies where two models
aretrained withcompeting objectives. These methods are closely related to domain adapta-
tion techniques, where the goal is to learn representations that are invariant to the domain
from which the input is taken (Ganin et al., 2016). Figure 4.2.1 shows a block diagram
of a general adversarial learning pipeline consisting of an encoder, predictor and nuisance
discriminator. In adversarial learning methods for nuisance-invariant speaker embeddings,
the encoder takes raw speech, speech representations or pre-trained speaker embeddings,
and is trained to produce representations that are speaker-discriminative using the classifier
to predict speaker labels. The discriminator is trained to predict nuisance labels from the
encoded representations. In addition, the encoder is simultaneously trained to maximize the
nuisancediscriminatorloss. Thisway,theencodedrepresentationscontainminimalinforma-
tion pertaining to the nuisance factors, while being speaker-discriminative. Such techniques
have been previously used to learn speaker representations that are noise-invariant (Meng
et al., 2019), channel-invariant (Zhou et al., 2019) etc. Techniques to learn encoder to
maximize nuisance-discriminator loss are direct maximization by alternatively training the
encoder, predictor and the discriminator. Other techniques such as applying a gradient
39
reversal layer (GRL) have also been employed (Q. Wang et al., 2018).
On successful training, the encoder is able to extract speaker-discriminative representa-
tions which are invariant to nuisance factors. During inference, predictor and discriminator
arediscarded,andtheencoderisusedtoproducespeakerembeddings. However,oneobvious
limitation of this approach is that nuisance factor labels are required during training. This
can be easily obtained using data augmentation strategy for some factors of variability. For
example,dataaugmentationhasbeensuccessfullyappliedtoartificiallyintroduceadditional
variability in acoustic conditions to train speaker embeddings. Typically, additive noise is
added to the speech signal, while also convolving the signal with room impulse response to
introduce reverberations. However, it can be challenging to introduce certain factors of vari-
abilitythatdonotfollowthestraightforwardadditiveorconvolutionmodelsaswithacoustic
channel conditions. For example, it has been found that the emotional state of a person af-
fects vocal characteristics which are reflected in speaker embeddings. However, artificially
augmenting data with different emotional states is extremely challenging. Even if such aug-
mentation is possible, it might not reflect the real-world conditions, and is usually a poor
approximationofrealemotionalspeech. Also, obtaining large-scale spontaneous speechcon-
taining labels of nuisance factors such as emotional states is difficult and often impractical.
This necessitates unsupervised adversarial training, that can learn speaker representations
robusttochannelandotheracousticvariabilitywithoutknowledgeofanyparticularnuisance
factor. Such work has been explored to some extent in the computer vision domain, but has
beenlargelyunexploredforspeech. Recently,anunsupervisedapproachtoinduceinvariance
for automatic speech recognition was introduced in (Hsu et al., 2019). However, the goal
of that work was to remove speaker-specific information from the speech representations,
whereas our goal was the opposite to extract speaker-discriminative information. Though
we discuss unsupervised adversarial disentanglement here, in Appendix A using emotion as
an exemplar nuisance factor we show how nuisance labels, when available, can be used to
induce invariance in speaker embeddings.
40
Randomized
perturbation ... Predictor Decoder Disentangler Encoder Disentangler Figure 4.3.1: Unsupervised adversarial invariance applied for speaker recognition. The goal
is to learn a split representation of speaker information in e
1
and nuisance information in
e
2
using adversarial training. Note that this does not require labels of any specific nuisance
factor to train.
4.3 Method
4.3.1 Unsupervised disentanglement
Inspired from previous work in the computer vision domain (Jaiswal et al., 2018a), we em-
ploy an unsupervised adversarial invariance technique to learn nuisance-invariant speaker
representations. The central idea behind this technique is to project the input speaker rep-
resentations into a split representation consisting of two embeddings, referred to as e
1
and
e
2
in Figure 4.3.1. Whilee
1
is trained with the objective of capturing speaker-specific infor-
mation, e
2
is trained to capture all other nuisance factors. This is achieved by training two
branches in an adversarial fashion.
L
prim
=αL
pred
(s,ˆ s)+βL
recon
(x,ˆ x) (4.3.1)
41
L
sec
=L
dis1
(e
2
,ˆ e
2
)+L
dis2
(e
1
,ˆ e
1
) (4.3.2)
min
Θ prim
max
Φ sec
L
prim
+γL
sec
,
whereΘ prim
=Θ e
∪Θ d
∪Θ p
,Φ sec
=Φ dis1
∪Φ dis2
(4.3.3)
The goal of one branch, called primary branch (consisting of the encoder, predictor and
decoder shown in green bounding boxes in Figure 4.3.1), is to predict speakers using e
1
as
input (using the predictor module) and reconstruct the x-vectors using e
2
and a randomly
perturbed version of e
1
as input (using the decoder module). The random perturbation
ensures that the network learns to treat e
1
as an unreliable source of information for the
reconstructiontask, henceforcinge
1
tonotcontaininformationaboutfactorsotherthanthe
speaker. The perturbation of e
1
is modelled as a dropout module that randomly removes
some dimensions from e
1
to create a noisy version. The primary branch produces the loss
term shown in Equation 4.3.1, where L
pred
is modelled as categorical cross-entropy loss to
predict speakers, and L
recon
is modelled as mean squared error (MSE) reconstruction loss of
the decoder. The terms Θ e
,Θ d
,Θ p
denote the network parameters of the encoder, decoder
and predictor respectively as shown in Figure 4.3.1. The speaker prediction task forcese
1
to
capture speaker-related information, while the reconstruction task ensures that e
2
captures
information related to all factors.
The other branch, called the secondary branch (consisting of the disentanglers shown in
red bounding boxes in Figure 4.3.1), is trained to minimize the mutual information between
e
1
ande
2
. Thisisachievedinthedisentangler moduleconsistingoftwonetworksthatpredict
e
1
from e
2
and vice-versa. The secondary branch produces the loss term given in Equation
4.3.2 which is the sum of the two disentangler losses, each of which is modelled as MSE loss.
The terms Φ dis1
,Φ dis2
denote the network parameters of the disentangler modules shown
in Figure 4.3.1. The UAI model is trained with a minimax objective shown in Equation
42
4.3.3 by alternating between the primary and secondary branch updates according to a fixed
schedule. The parameters α,β,γ control the contribution of the prediction, reconstruction
and the disentanglement loss terms respectively.
4.4 Dataset
In this work, we designed experiments to analyze the information pertaining to several
nuisance factors retained in speaker embeddings. We also evaluate robustness of speaker
verification performance to two main sources of variability that can occur in real-world au-
diorecordings(microphonedistanceandnoisetype). Here, wepresentdetailsofthepublicly
available datasets that we use for the experiments.
AMI: To evaluate the performance of the proposed embeddings on the speaker diariza-
tion task, we use a subset of the AMI meeting corpus (McCowan et al., 2005) that is fre-
quently used for evaluating diarization performance (Cyrta, Trzci´ nski, & Stokowiec, 2017;
Sun, Zhang, & Woodland, 2019). It consists of audio recordings from 26 meetings.
Vox: Our training data consists of a combination of the development and test splits of Vox-
Celeb2 (Chung et al., 2018) and the development split of VoxCeleb1 (Nagrani, Chung, &
Zisserman, 2017) datasets. This is consistent with the split that was used to train the pre-
trained x-vector model (mentioned in Section 3.2 in the previous chapter), but with no data
augmentation. It consists of speaker annotated in-the-wild recordings from celebrity speak-
ers. As such the dataset is sourced from unconstrained recording conditions. For brevity,
henceforth, we refer to this subset of the VoxCeleb dataset as Vox.
V19-evalandV19-dev: ThisreferstotheVOiCESchallengedatasetthatwasdescribedin
Section3.3inthepreviouschapter. Weusetwosubsetsofthisdatacorpus, thedevelopment
portion of the VOiCES challenge data (Nandwana et al., 2019) referred to as V19-dev and
the evaluation portion referred to as V19-eval. V19-dev is used for probing experiments as
discussed in Section 4.5.3, as it contains annotations for 200 speaker labels, 12 microphone
43
Table 4.1: Statistics of datasets used to train unsupervised disentanglement models and
evaluate robustness of learned speaker embeddings to within-speaker factors of variability
(utt-utterances, spk-speakers).
Name Purpose No.utt No.spk
Nuisance
annotations
available
AMI diarization 26
1
29 no
V19-eval verification 11,392 47 yes
V19-dev clustering 15,904 200 yes
Vox train 1.2M 7323 no
1
Refers to number of sessions
locationsand3noisetypes (babble, television, music). V19-evalistheevaluation portionof
the challenge data on which we perform our experiments to quantify information of factors
and for speaker verification experiments as described in Section 3.3.
Table 4.1 shows the statistics for the different datasets used in our work. We ensured
that the speakers contained in one dataset had no overlap with the speakers from any other
dataset.
4.5 Experiments and Results
We setup the following experiments to study the different aspects of our system:
1. Quantifying information with respect to nuisance factors (V19-eval dataset)
2. Robustness analysis of speaker verification (V19-eval dataset)
3. Unsupervised clustering (V19-dev dataset)
4. Speaker diarization with oracle speech segment boundaries (AMI dataset)
Baseline: We used x-vectors extracted from the pre-trained model
1
as baseline, to test if
1
https://kaldi-asr.org/models/m7
44
the proposed model is able to improve robustness of speaker embeddings by removing the
nuisance factors from x-vectors. In the results in Section 4.5.1, the baselines are denoted
by x-vector, x-vector+PCA and x-vector+NLDR as explained in the previous chapter. In
the results in Sections 4.5.2 and 4.5.3, we denote the baseline method by x-vector and the
method using the proposed embeddings by e
1
. In Section 4.5.4, the baselines are denoted
by Baseline 1 and Baseline 2, which are defined in the section.
4.5.1 Quantifying information
Setup
As discussed in Chapter 3, we analyze the extent of information present in the speaker
embeddings with respect to various factors. For this analysis, we used the proposed unsu-
pervised disentanglement model trained on data not augmented with any additional noise.
This model is denoted byM1. We also developed a model by training using data artificially
augmented with noise. This is denoted by M2, and is used in the speaker verification anal-
ysis discussed in Section 4.5.2. For each factor that was analyzed, we extracted x-vectors
using the pre-trained model mentioned above from the corresponding dataset for that fac-
tor. Wetrainedaclassificationmodelforeachfactorusingasimplefeed-forwarddeepneural
network, details of which are provided in Section 3.5 of Chapter 3.
Results summarizing the classification performance of the x-vectors with and without
PCA are reported in Table 4.2, where the reported % values were rounded to the nearest
decimal. For the emotion classification task, we report the %F1 score averaged over the 4
emotion classes, while for all other tasks, we report the % classification accuracy.
Results
Channel factors First, we observe that in all the cases, the proposed unsupervised dis-
entanglement technique retains the least amount of information. In particular, we notice
that compared to the x-vector+NLDR method, the proposed method can further remove
45
nuisance information, even without explicit nuisance labels during training.
ContentfactorsHereaswell,wefindthattheproposedunsuperviseddisentanglementtech-
nique retains reduced information compared with x-vectors. This shows that, even without
any additional supervision with respect to nuisance factors such as emotion labels, we can
reduce the information from speaker embeddings. In addition, compared to the NLDR tech-
nique, we observe that content information is reduced by using the proposed UAI technique
in the case of emotion and sentiment factors. Figure 4.5.1 shows the confusion matrix of
emotion recognition using the different embeddings. UAI provides the best disentanglement
as shown by the poor emotion recognition performance, showing that it can remove emo-
tion information substantially even without explicit nuisance labels during training. For the
lexical content, we observe that the NLDR technique is able to remove the most amount of
information.
Speaker factors For the speaker id task on the VOiCES dataset (Richey et al., 2018), we
observe that the embeddings obtained using the unsupervised disentanglement technique
perform almost on par with x-vectors, and close to 100% accuracy, suggesting that speaker
identityisretainedduringthedisentanglementprocess. Thisresultcomplementsthespeaker
verificationperformance, seeSection4.5.2. Wefurtherobserveasimilartrendforthegender
prediction task on the RedDots dataset (Lee et al., 2015), where all the speaker embeddings
achieve high accuracy (>95%). This is consistent with past work (Raj et al., 2019; S. Wang
et al., 2017). Similarly, even for the language identification task, the proposed embeddings
achieve performance close to the x-vectors, with very minimal loss of information.
4.5.2 Speaker Verification
Setup: Channel factors
WeevaluatethebaselineandtheproposedmethodsforspeakerverificationtaskontheV19-
eval dataset. Following standard practice (Snyder et al., 2018), we perform dimensionality
46
Table 4.2: % Accuracy (% F1 for Emotion) of speaker embeddings in classifying different
factors. Unsup. dis. denotes the proposed unsupervised disentanglement embeddings. Ro-
bust speaker embeddings are expected to perform poorly for classifying non-speaker related
tasks, while capturing maximal information pertaining to speaker-related factors. Values in
bold denote best performance among the different speaker embeddings for each factor.
Factors
Majority
baseline
x-vector
x-vector
+PCA
x-vector
+NLDR
Unsup.
Dis.
Mic. type (C=2) 75.00 94.76 91.91 81.78 80.44
Mic. location (C=7) 16.67 92.01 86.16 68.94 64.41
Noise type (C=4) 25.00 94.9 92.20 77.89 76.02
Emotion (C=4) 87.49 91.70 91.08 86.99 80.92
Sentiment (C=3) 41.53 54.35 51.76 50.53 47.25
Lexical (C=10) 10.27 98.13 97.41 71.47 78.96
Speaker id. (C=10) 12.90 99.60 99.60 98.74 98.59
Gender (C=2) 81.55 99.00 99.50 97.41 97.60
Language (C=4) 74.16 97.20 96.60 92.87 95.27
reduction using linear discriminant analysis (LDA) and score the verification trials using a
probabilisticlineardiscriminantanalysis(PLDA)backendforboththeproposedembeddings
and the baseline. The LDA and PLDA models were learnt on the training data for our
proposed system, while for the baseline system we used the pre-trained models. For the
embeddings extracted using our method, we use a dimension of 96 after LDA, while for
x-vectors we use 150 as the reduced dimension. Consistent with general practice (Snyder,
Garcia-Romero, Povey, & Khudanpur, 2017), equal error rate (EER) was used as the metric
for evaluation.
Following (Richey et al., 2018) and (Nandwana et al., 2018), we used knowledge of the
nuisancefactorsannotationsavailableinV19-evaldatasettostudythevariousfactorsaffect-
ing the performance of speaker verification. For these experiments we consider two distinct
nuisancefactors, noiseconditions: none, babbleandtelevisionandmicrophonelocation: far-
mic,near-micandobstructed-mic(microphonehiddenintheceiling). Wefurtherdistinguish
between the recordings collected at 2 different microphone locations (far-mic vs. near-mic)
while examining the performance in noisy conditions.
47
(a) x-vector (b) x-vector + PCA
(c) x-vector + NLDR (d) Unsup. Dis.
Figure 4.5.1: Confusion matrices for emotion recognition using speaker embeddings on
IEMOCAP for different embedding transformation techniques. Robust speaker embeddings
are expected to perform poorly on emotion recognition task. 0:Anger, 1:Sadness, 2:Happi-
ness 3:Neutral. Unsupervised disentanglement provides the best disentanglement as shown
by the poor emotion recognition performance. Differences are particularly noticeable for the
Happiness(2) class.
TheexperimentalsetupforthecontrolledconditionsisshowninFig3.3.1. Asmentioned
in Section 3.3, the circles represent microphones located at various distances from the main
loudspeaker. In all experiments, the enrolment utterances were collected from source data
used to playback from the loudspeaker, consistent with (Nandwana et al., 2019). We choose
adifferentsetoftestutterancesdependingontheexperimentbeingperformed. Forexample,
to evaluate performance in the noisy (near-mic) scenario, we use the utterances that were
48
Table 4.3: Speaker verification performance(% EER- Lower is better) in the presence of
differentnuisancefactorconditions(V19-eval). Thespeakerembedding e
1
fromtheproposed
method (M1 lda-96) is compared against x-vector baseline. Values in bold denote the best
performance for each condition.
x-vector e
1
noise
(near-mic)
none 3.34 3.99
babble 5.41 4.86
television 3.28 4.15
noise
(far-mic)
none 7.43 6.26
babble 21.93 19.79
television 10.80 9.05
mic
placement
near-mic 4.17 4.41
far-mic 14.97 12.79
obstructed-mic 6.34 5.67
overall 10.30 9.07
recorded from mics 1 and 2 as shown in Fig. 3.3.1 as the test utterances. Similarly for the
noisy (far-mic) scenario, test utterances are pooled from mics 5 and 6.
Results
As shown in Table 4.3, from the analysis on the effect of noise, although the baseline pro-
vides better performance in near-mic scenario for no noise and television noise conditions,
the verification performance using the proposed embedding (denoted by e
1
in Table 4.3)
provides improvement over the baseline when the test utterances were recorded at distant
microphones, for all the noise types. As previously observed in (Nandwana et al., 2018),
babble noise seems to be the most challenging of all the noise types in terms of verification
performance due to its speech-like characteristics. In this particularly harsh condition, the
proposed embedding outperforms the baseline in both the near-mic and far-mic scenarios.
Interestingly, our method shows the highest absolute improvement (∼ 2.2% in EER) in the
most challenging condition, i.e., far-mic recording in the presence of babble noise.
In experiments on the effect of microphone placement, the results show that our method
performs comparable to the baseline in near-mic scenario and outperforms the baseline in
49
the more challenging far-mic and obstructed-mic scenarios. The last row in Table 4.3 shows
the overall speaker verification performance using test utterances from all the microphones
under all noise conditions. In this experiment we see a relative 10% EER improvement by
our system over the baseline.
Inadditiontothe%EER,inFigure4.5.2wealsoshowthedetectionerrortrade-off(DET)
curvesthatgiveamoredetailedpictureofhowtheperformancechangesatdifferentoperating
points. DET plots show the FAR on the x-axis and FRR on the y-axis after standard
normal mapping. Each subplot compares four speaker embeddings, x-vector reduced to
150dimensionsusingLDA(xvec lda-150), M1(modeltrainedwithoutdataaugmentation)
reducedto96dimensions(M1lda-96),M2(modeltrainedwithdataaugmentation)reduced
to 96 dimensions (M2 lda-96) and M2 without LDA-based dimensionality reduction (M2
w/o lda).
The first row of subplots in Figure 4.5.2 shows the DET curves across different noise
conditions. Weobservethatinmajorityofthecasesdisentangledspeakerembeddingsobtain
lower error rates at most operating points. In particular, for the television and babble
noise conditions, which are considered challenging due to their speech-like characteristics
(Nandwana et al., 2018), we observed a respective absolute reduction of 2.2% and 1% in the
equal error rate (EER). Furthermore, we observe that for the distant microphone scenario,
speaker embeddings after disentanglement provide lower error rates at almost all operating
points, with a 2.6% reduction in EER.
4.5.3 Clustering analysis of embeddings
Setup
Inorder tofurtherprobe theinformation contained in the latent representations, we analyze
clustering performance of the embeddings. We expect e
1
to perform best when clustering
speakers and e
2
to cluster well with respect to the nuisance factors. We use normalized
mutual information (NMI) between the embeddings and the ground truth clusters as a
50
FNR (%) FPR (%) (A) Noise Condition Babble None Music Television (B) Microphone Placement Distant Close Obstructed Clear ~ 1% drop
in EER ~ 1% drop
in EER ~ 2 . 2 % drop
in EER ~ 2 . 6 % drop
in EER Figure 4.5.2: DET curves(y-axis has the % false rejection rate and x-axis shows the % false
acceptance rate) of speaker verification task using different speaker embeddings with and
without disentanglement in several (A) Noise conditions and (B) Microphone placements.
In almost all scenarios, the model trained using the unsupervised disentanglement denoted
by M2 lda-96 performs the best.
Table4.4: Normalizedmutualinformation(%)betweenclustersofembeddingsandtrueclus-
ter labels. k represents number of clusters (V19-dev). The speaker embedding e
1
and nui-
sanceembeddinge
2
fromtheproposedmethod(M1)arecomparedagainstx-vectorbaseline.
Speakerembeddingsareexpectedtocapturespeakerinformation, whilenuisanceembedding
should capture speaker-unrelated information.
e
1
e
2
x-vector
speaker (k =200) 92.20 65.10 87.90
noise (k =4) 0.10 0.70 0.10
mic placement (k =12) 0.10 2.00 1.00
proxy for the speaker/nuisance related information contained in each of the embeddings.
The ground truth clusters here are obtained from the annotations available in the V19-dev
dataset. Clustering is performed using k-means (mini batch k-means implementation in
(Pedregosa et al., 2011)) with the known number of clusters.
Results
Table 4.4 reports results comparing the performance of both our embeddings and the base-
line in clustering speakers and nuisance factors (noise type and microphone location). We
51
conducted permutation tests (Raschka, 2018) between the clustering results of the different
experiments
†
to test for statistical significance.
Clustering by speaker, we see that e
1
performs significantly better than x-vectors (ab-
solute 4.3% as shown in Table 4.4). This suggests that our method is able to extract more
speaker-discriminative information from x-vectors. Furthermore, as expected, e
2
showed
relatively poor performance in clustering speakers.
Clustering by nuisance factors, as expected, e
2
is the most predictive. Also, e
1
doesn’t
cluster well according to the nuisance factors. Consistent with our findings in Section 4.5.2,
x-vectors have a significantly higher NMI scores than e
1
when clustering by microphone
location (Row 3, Table 4.4). This suggests that the proposed embedding is able to capture
speaker information, invariant to microphone placement better than x-vectors. We found
significant differences in the reported NMI scores, suggesting that our method is able to
disentangle the 2 different streams of information, speaker-related and nuisance-related.
4.5.4 Speaker diarization using oracle speech segment boundaries
We further extend the analysis by examining the effectiveness of our proposed speaker em-
bedding in speaker diarization task (Garcia-Romero, Snyder, Sell, Povey, & McCree, 2017;
Sell et al., 2018). Since the goal of this work is to investigate the speaker-discriminative na-
ture of embeddings, we consider only speaker clustering in the diarization task and assume
prior knowledge of speaker homogeneous segments and the number of speakers, as was done
in past studies (Garcia-Romero et al., 2017; Sun et al., 2019). The proposed speaker diariza-
tion system (denoted by e
1
) is based on embeddings extracted from speaker-homogeneous
segments followed by k-means clustering as the backend. We compare our system with two
competitivebaselinesthatusex-vectorsfrompre-trainedmodelasinputfeatures. Onebase-
†
Reject null hypothesis that the results come from same distribution if p-value < α where α = 0.025 to
account for multiple comparison testing
52
Table 4.5: Diarization performance with oracle speech segment boundaries and known
number of speakers (AMI dataset). The two baselines use x-vectors with k-means and
PLDA+AHC backends respectively, while the proposed embedding uses k-means clustering.
System Baseline 1 Baseline 2 e
1
Avg. DER (%)↓ 11.91 11.51 7.28
line (denoted by Baseline 1) uses k-means clustering on the extracted x-vectors. The other
baseline (Baseline 2) uses PLDA scoring and agglomerative hierarchical clustering (AHC)
and has shown to be one of the best performing diarization systems using x-vectors (Garcia-
Romero et al., 2017).
Diarization error rate (DER) (Fiscus, Ajot, Michel, & Garofolo, 2006) averaged across
all sessions in the AMI dataset are shown in Table 4.5. We see that our proposed system
outperforms both Baseline 1 and Baseline 2 systems by a relative 38% and 36% in DER
respectively. This suggests that the proposed speaker embeddings contain more speaker
discriminativeinformationthanx-vectorembeddingsandhencearebettersuitedforspeaker
clustering across datasets. Further, a simple k-means clustering method on the proposed
method is able to outperform a more sophisticated method using PLDA and AHC.
4.6 Discussion
First, we showed through experimental results that unsupervised disentanglement can be
effectively used to reduce nuisance information from speaker embeddings. In particular, we
showed that, even without labelled factors of variability, the unsupervised disentanglement
technique can reduce information pertaining to factors such as noise type, microphone type,
microphone location etc. In addition, it also reduces both paralinguistic information such
as emotion and sentiment along with information about lexical content. We further demon-
stratedthroughextensivespeakerverificationexperimentsthatsuchspeakerembeddingsare
more robust to channel conditions, and outperform x-vectors in challenging conditions such
as in the presence of babble noise and in far-field recording conditions. Finally, we showed
53
the applicability of these speaker embeddings in improving the performance of speaker di-
arization on the popular AMI dataset.
So far, we have considered the performance of speaker recognition in the presence of sev-
eral factors of variability. However, all the factors considered were irrelevant to the speaker
recognitiontask,andremovingthemfromspeakerembeddingsissufficienttoimproverobust-
ness. Anotherimportantfactorofvariabilityisinherentspeakerfactorssuchasdemographic
attributes including age, gender, language etc. We want our speaker recognition system to
work well irrespective of ‘what’ a person says, ‘how’ they say it, and ‘where’ they say it. In
addition, we want it to perform well irrespective of ‘who’ says it, that is the demographic at-
tributesoftheperson. Thisbringsustothenexttopicofdiscussion, whichwillbepresented
in the next chapter.
54
Chapter 5
Adversarial and multi-task techniques
for bias mitigation in speaker
recognition
5.1 Introduction
Consider a home security system that authenticates the homeowner based on their voice -
what if it works reliably only for individuals from certain demographic groups? “What is
the practical applicability of such a system”?, “Is this system fair?”, “How do we identify
biases in this system?”, and “How might we mitigate these biases?”. Such questions are
being addressed at a rapid pace in technology domains such as computer vision and natural
language understanding that primarily rely on machine learning (ML) algorithms — leading
to the emergence of ML fairness as a field of study in its own right (Barocas et al., 2019). In
thecontextofspeechtechnologies,MLfairnessstudiesaremostlylimitedtoapplicationssuch
The work presented in this part was submitted for review to Elsevier Computer Speech and Language:
Peri, Raghuveer, Krishna Somandepalli, and Shrikanth Narayanan. “To train or not to train adversarially:
A study of bias mitigation strategies for speaker recognition.” arXiv preprint arXiv:2203.09122 (2022).
55
as speech-to-text conversion (Koenecke et al., 2020). Very few studies have considered ML
fairnessforspeakerrecognitionwhichisakeycomponentinapplicationssuchaspersonalized
speech technologies and voice-based biometric authentication.
Speakerrecognitionisthetaskofidentifyingapersonbasedontheirvoice. ASV,whichis
a specific application of speaker recognition, refers to the task of authenticating users based
on their voice characteristics. It has found widespread adoption in smart home appliances
(e.g., Alexa) (Lisa Eadicicco, n.d.), voice authentication in airports (Cornacchia, Papa, &
Sapio, 2020) and, as a biometric system in customer service centers and banks (Derek du
Preez, n.d.; James Griffiths, n.d.). With the proliferation of speech technologies in everyday
lives,biasespresentinthesesystemscanhavediresocialconsequences. Thereisanimminent
needtoidentify, understandandmitigatebiasesinthesesystems. Ourgoalinthisworkisto
systematically evaluate biases in speaker recognition, and study bias mitigation strategies.
Specifically, we examine whether adversarial or multi-task learning techniques help mitigate
these biases, and improve the fairness of speaker recognition systems. We present experi-
mental findings that demonstrate the conditions under which fairness can be improved, and
have made the related code and model information publicly available for the benefit of the
research community
∗
.
Generally, ASV applications have different expectations of performance depending on
the end use-case. For example, security applications typically impose strict restrictions
on the proportion of impostors
†
they erroneously admit. On the other hand, smart home
applicationsmayvalueuserconveniencei.e.,theyvalueseamlessverificationof genuine users
at the expense of tolerating more impostor cases. Thus, it is crucial to consider biases that
arise in different applications where ASV systems are deployed. While tremendous strides
have been made in evaluating and improving fairness in applied ML fields (Barocas et al.,
∗
Code and information about pre-trained models can be found in https://github.com/rperi/
trustworthy-asv-fairness
†
A person attempting to maliciously gain access to a biometric system claiming to be a different person
56
2019), techniques for bias mitigation in ASV systems are limited (Fenu, Medda, et al., 2020;
Shen et al., 2022). Notably, most current bias evaluation frameworks for ASV are restricted
to specific operating points of the systems (Fenu, Lafhouli, & Marras, 2020; Toussaint &
Ding, 2021), limiting their applicability to specific use-cases. In particular, differences in
the EER of an ASV system between different demographic population groups is commonly
used as a proxy for the biases present in that system (Fenu, Lafhouli, & Marras, 2020;
Shen et al., 2022). EER is the error of the system where the rate of accepting impostors
is equal to the rate of rejecting genuine users. It refers to a specific operating point of the
ASV system, and fairness evaluations using differences in EER may not generalize to other
use-cases. More general conclusions about the fairness of the systems (applicable to distinct
ASVuse-cases)requireathoroughevaluationofbiasesatseveralsystemoperatingpoints. In
addition, utility of ASV systems, which can be understood as the overall performance (not
considering any specific demographic group) is an important consideration in evaluating
the practical applicability of bias mitigation strategies. An ideal bias mitigation strategy is
expectedtoreducethedifferencesinperformancebetweenthedifferentdemographicgroups,
with minimal degradation of the utility of the system.
Bias mitigation strategies in ML systems often involve models trained using data balanc-
ing(Serna, Pe˜ na, Morales,&Fierrez,2021;Y.Zhang&Sang,2020;J.Zhao, Wang, Yatskar,
Ordonez, & Chang, 2018). In the context of fairness in ASV systems, data balancing meth-
ods were studied with respect to age and gender (Fenu, Medda, et al., 2020). However, it
is not evident if such techniques are the most suitable to induce fairness. For example, in
other ML fields such as computer vision, studies have shown that data balancing may not
be sufficient to mitigate biases (e.g, (T. Wang, Zhao, Yatskar, Chang, & Ordonez, 2019)).
Another class of techniques to improve fairness tackle biases in the modeling stage by incor-
porating fairness constraints during training (Zafar, Valera, Rogriguez, & Gummadi, 2017;
Zemel, Wu, Swersky, Pitassi,&Dwork,2013). Whendemographicinformation(e.g., gender,
language background, age etc.) is available, it can be used in an adversarial training (AT)
57
setup to learn speaker embeddings (compact speech representations that capture informa-
tion about speaker’s identity) that are fair with respect to the demographic attributes (Li,
Cui, Wu, Gu, & Harada, 2021). Adversarial methods for ASV systems typically train en-
coders to learn speaker-discriminative representations while stripping them of demographic
information (No´ e, Mohammadamini, Matrouf, Parcollet, & Bonastre, 2020). However, these
demographic factors are considered as components of a person’s identity (Hassan, Izquierdo,
& Piatrik, 2021), and can help improve the ASV system performance (Luu, Bell, & Renals,
2020b). For example, it is typically easier to reject an impostor verification claim, when the
impostor’s gender is different from that of the target speaker. Removing gender-related in-
formation from speaker embeddings using adversarial techniques can lead to degraded ASV
performance, reducing the utility of such systems. It may be beneficial to develop ASV sys-
temsthatperformequallywellforpeoplebelongingtodifferentdemographicgroups, despite
the speaker embeddings retaining information of the demographic attributes.
Forthispurpose,amulti-task learning(MTL)strategycanbeemployedtosimultaneously
predict factors related to speaker identity such as gender, age or language. Techniques
leveragingdemographicinformationhavebeenshowntoachieveimprovedASVperformance,
where models are trained to predict speaker labels along with age and nationality (Luu et
al., 2020b). In a recent work on fairness in ASV, (Shen et al., 2022) developed a system to
fuse scores from separately trained gender-specific models. However, this method does not
explicitlyinfusedemographicinformationintothespeakerembeddings,butmerelycombines
theseparategender-specificscores. Tothebestofourknowledge,thereexistsnocurrentwork
on multi-task training techniques that leverage demographic information to train speaker
embeddings with the goal of improving fairness.
In the previous chapter (Chapter 4), we showed that adapting unsupervised adversarial
invariance (UAI) (Jaiswal, Wu, Abd-Almageed, & Natarajan, 2018b) — an adversarial
method proposed for computer vision tasks — for speaker representation learning makes
the ASV system robust to adverse acoustic conditions. In the current chapter, we discuss
58
an extension of this framework by proposing adversarial training (UAI-AT) and multi-
task learning (UAI-MTL) for bias mitigation in ASV systems that use pre-trained neural
speaker embeddings. Our experiments not only evaluate the fairness of the proposed UAI-
AT and UAI-MTL methods but also compare them with data balancing, UAI, AT and MTL
methods as baselines. In addition to extensive fairness analyses, we jointly examine the
overall performance of the ASV system. This is referred to as fairness-utility trade-off.
Our specific contributions and findings are summarized below:
Summary of Contributions:
• WesystematicallyevaluatebiasespresentinASVsystemsatmultipleoperatingpoints.
Current work in this field is mostly focused on a single operating point, which can lead
to an incomplete evaluation of fairness. To this end, we adopt the fairness discrepancy
ratemetric(deFreitasPereira&Marcel,2021)tomeasurethefairnessofASVsystems
under different system operating conditions.
• We propose novel adversarial and multi-task training methods to improve the fairness
ofASVsystems. Adversarialandmulti-tasktechniquesforbiasmitigationinASVsys-
tems are limited or non-existent. We compare the proposed methods against baselines
that rely on data balancing, using quantitative and qualitative evaluations.
• In addition to fairness evaluations, we also consider the utility, which is the overall
systemperformance, usingstandard performancemetricssuchas EER.Jointconsider-
ations of fairness and utility can help inform the choice of bias mitigation techniques.
Summary of findings :
• WeshowthatthefairnessofbaselineASVsystems(trainedusingdatabalancing)with
respect to gender varies with the operating point of interest. Our experiments show
increased bias of the baseline methods as the system operation moves to regions with
fewer instances of incorrectly rejecting genuine users. We demonstrate that, compared
59
with the baseline systems, the fairness of the proposed adversarial and multi-task
methods have minimal dependence on the operating point.
• Wedemonstrateusingqualitativevisualizationsand quantitativemetricsthatthepro-
posed techniques are able to mitigate biases to a large extent compared to the baseline
systems based on data balancing. We further show that this observation holds true
across a range of different operating conditions.
• We observe that the adversarial technique improves fairness but suffers with reduced
utility. In contrast, the multi-task technique is able to improve fairness while retaining
the overall system utility. These findings can inform choosing appropriate bias mitiga-
tionstrategies,whilecarefullyconsideringthetargetuse-caseofthespeakerrecognition
systems.
The rest of the chapter is organized as follows. In section 5.2, we provide background
of existing work related to fairness in ASV. Section 5.3 details the methodology used to
induce fairness in ASV systems, followed by a description of the metrics we use to evaluate
the fairness of ASV systems in Section 5.4. We provide a brief description of the datasets
used to build and evaluate our models in Section 5.5. Section 5.6 outlines the baselines and
experiments designed to investigate the biases in the developed ASV systems (including ab-
lation studies). This is followed by the corresponding results and discussions in Section 5.7.
Finally, we provide concluding remarks for this chapter in Section 5.8, where we summarize
our findings.
A note on fairness terminology as it relates to this work
Extensive research has been done on fairness in diverse domains such as law, social science,
computer science, philosophy etc. There exist equally diverse definitions of fairness, each
given within the context of that particular domain (Mulligan, Kroll, Kohli, & Wong, 2019).
WiththerecentproliferationofMLalgorithmsinsocio-technicalsystems,fairnessinMLhas
60
garnered immense interest. The general idea of the absence of any prejudice or favoritism
toward an individual or a group based on their inherent or acquired characteristics has been
termed fairness in the ML literature (Mehrabi, Morstatter, Saxena, Lerman, & Galstyan,
2019), and the field of study dealing with these issues is referred to as Fair-ML (Barocas et
al., 2019). Most work evaluating and tackling fairness in ML systems can be broadly cate-
gorized into two: group fairness and individual fairness (Binns, 2020). Group fairness deals
with ensuring that the outcomes (correct predictions or errors) of a classifier are equally dis-
tributed between different demographic groups (Chouldechova, 2017). On the other hand,
individual fairness requires that two individuals who are similar to each other be treated
similarly (Dwork, Hardt, Pitassi, Reingold, & Zemel, 2012). In the context of this work, we
focus on group fairness, presenting preliminary investigations towards individual fairness in
Chapter B.
5.2 Background: Fairness in ASV
In this section, we present studies that examine fairness in ASV systems, vis-a-vis other ML
application domains. In particular, we discuss related-work of two key aspects relevant to
this chapter: (1) Evaluation of biases in ASV systems (2) Bias mitigation strategies, which
include adversarial and multi-task methodologies.
Issues of fairness and biases have been extensively studied in several domains involving
ML.Afewprominentexamplesincludefacialanalysis(Beveridgeetal.,2009;Buolamwini&
Gebru, 2018; Grother et al., 2019; A. Howard & Borenstein, 2018; J. J. Howard, Sirotin, &
Vemury, 2019; Klare et al., 2012; Robinson et al., 2020; Ryu et al., 2017), natural language
understanding (Bolukbasi, Chang, Zou, Saligrama, & Kalai, 2016; D´ ıaz, Johnson, Lazar,
Piper, & Gergle, 2018; Dixon, Li, Sorensen, Thain, & Vasserman, 2018; Park, Shin, & Fung,
2018; Sap, Card, Gabriel, Choi, & Smith, 2019), affect recognition (Gorrostieta, Lotfian,
Taylor, Brutti,&Kane,2019;Stoychev&Gunes,2022;Xu, White, Kalkan,&Gunes,2020),
61
criminaljustice(Chouldechova,2017;Green,2018;Mishler,Kennedy,&Chouldechova,2021;
Wadsworth, Vera, & Piech, 2018), and health care (Hamon et al., 2021; Suriyakumar, Pa-
pernot, Goldenberg, & Ghassemi, 2021).
In the context of speech technology research, the intersection of speech ML and fairness
ismostlylimitedtoautomaticspeechrecognition(ASR).Forexample,racialdisparitieswere
found in the performance of several state-of-the-art commercial ASR systems (Koenecke et
al., 2020). Biases with respect to gender, age etc., in a Dutch ASR system were analyzed
(Feng, Kudina, Halpern, & Scharenborg, 2021). Evaluations of ASR systems using criterion
commonly used in Fair-ML research have been explored extensively (Garnerin, Rossato, &
Besacier,2021;C.Liuetal.,2021;Z.Liu, Veliche,&Peng,2021;Sarı, Hasegawa-Johnson,&
Yoo, 2021). However, a systematic evaluation of fairness in ASV systems is scarce in current
literature.
A recent work explored racial and gender disparities in several speaker classification
systems (Chen, Li, Setlur, & Xu, 2022). However, their analysis was restricted to closed-
set classification task, which is different from the more challenging ASV setup we consider
in this work. The field of biometrics is perhaps the most relevant in the context of speaker
verificationfromtheperspectiveofanendapplication. Asnotedin(Drozdowskietal.,2020),
a majority of bias detection and mitigation works in biometrics focus on face recognition
(Beveridge et al., 2009; Grother et al., 2019; Klare et al., 2012; Ryu et al., 2017), and some
in fingerprint matching (Galbally et al., 2018; Preciozzi et al., 2020). Fairness in voice-based
biometrics remains to be an under-explored field with only a handful of works (Fenu et al.,
2021; Fenu, Medda, et al., 2020; Shen et al., 2022; Toussaint & Ding, 2021).
5.2.1 Evaluating biases in ASV systems
AsdiscussedinSection,1.1.1,ASVsystemscanbetailoredfordifferentapplicationsdepend-
ing on the threshold used to accept/reject verification pairs. Studies that evaluate biases in
ASV systems need to ensure that conclusions drawn from them are not limited to specific
62
applications. Evaluations of biases that only focus on specific operating points of an ASV
systemcanleadtoincompleteconclusionsabouttheirfairness. Somestudiesinthepasthave
documenteddifferencesintheperformanceofASVsystemsowingtodemographicattributes
such as gender and age (Stoll, 2011), where ASV systems that use statistical speaker models
were analyzed. However, the analysis was restricted to verification scores alone, while the
biases in final system decision were not considered. Furthermore, it is unclear if the findings
translatetocontemporaryspeaker-embeddingbasedmethods. Arecentworkshowedsignifi-
cantdifferencesinthespeakerverificationscoresbetweencertaindemographicgroups, butit
is not evident how such differences affect an ASV system in practice (Si et al., 2021). Owing
to the recent challenges organized by NIST (Sadjadi et al., 2019, 2017), there has been an
increased focus on improving ASV performance across languages (Lee et al., 2019; Torres-
Carrasquillo et al., 2017). Significant improvements were obtained on the evaluation data
provided in the challenge. Multi-lingual analyses were added in the latest VoxSRC challenge
(Brown, Huh, Chung, Nagrani, & Zisserman, 2022). However, performance evaluations in
these studies used only a specific operating point characterized by the EER.
In a more recent work, fairness of ASV systems with respect to age and gender as the
demographic factors has been explored (Fenu, Lafhouli, & Marras, 2020). However, the
evaluationoffairnesswasagainlimitedtodisparityinEERbetweenthedemographicgroups.
AnevaluationframeworkforprobingthefairnessofASVsystemshasbeenrecentlyproposed
(Toussaint&Ding,2021). Theirevaluationsfocusedontheminimumdetectioncostfunction,
which again considers one particular operating point characterized by a threshold on the
speaker verification scores. They provide visualizations of the verification scores and the
detection error trade-off (DET) curves, which offer a qualitative way of analyzing the biases
at several operating points. It might be of interest to quantify fairness at different operating
pointstosystematicallyunderstandhowthesystembehaves. Suchananalysiscanbecritical
to understanding the behavior of the ASV systems in different types of applications, for e.g.,
highsecurityapplicationsrequiringlowFAR(fewerincorrectlyacceptedimpostorpairs)and
63
high-convenience applications requiring low FRR (fewer incorrectly rejected genuine pairs).
Fairness evaluations were performed using several different definitions of fairness at mul-
tiple operating points with a focus on data balancing strategies to mitigate biases (Fenu et
al.,2021). Inthiswork,weadoptthefairnessdiscrepancyrate (FaDR)metricthathasbeen
proposedrecentlyinthebiometricsliterature(deFreitasPereira&Marcel,2021)toevaluate
fairness. FaDR computed at different system operating points can be used to systematically
analyze the fairness of the proposed methods. A detailed explanation of the FaDR metric
can be found in Section 5.4.
Fairness-Utility trade-off: Studies have shown that humans demonstrate a superior
speaker identification performance in their native language compared to non-native lan-
guages (Perrachione & Wong, 2007). Similarly, ASV systems can be biased to perform
better for certain demographic populations (Fenu, Lafhouli, & Marras, 2020). Although
reliance on demographic information can potentially improve ASV performance, it can lead
to unfair systems that discriminate against certain demographic groups. Thus, there is a
trade-off between fairness and utility. Such trade-offs have been studied extensively in gen-
eralFair-MLliterature(Calders, Kamiran, & Pechenizkiy, 2009; Du et al., 2020; Haas,2020;
H.Zhao&Gordon,2019). Forexample,ithasbeenshownthatadversarialtrainingmethods
to improve fairness reduce the utility of such systems (H. Zhao & Gordon, 2019). However,
empirical studies demonstrating such trade-offs between fairness and utility in ASV systems
are limited (Fenu et al., 2021). In this work, we study how the proposed techniques perform
in improving fairness, while also evaluating their utility using standard metrics.
5.2.2 Mitigating biases in ASV systems
Prior work has demonstrated differences in the performance of ASV systems across gender
groups (Reynolds, Quatieri, & Dunn, 2000). This led to the development of gender-specific
models that were used in combination with a gender classification module in ASV (Bimbot
64
etal.,2004;Kanervisto, Vestman, Sahidullah, Hautam¨ aki,&Kinnunen,2017). Thiswasthe
case even in the popular i-vector based ASV models (Dehak et al., 2011). However, such a
methodologyoftrainingseparatemodelsforeachdemographicgroupneedsthedemographic
grouplabel(eitherself-reportedbythespeakerorpredictedbyamodel)atthetimeofinfer-
ence. This information may not always be available, or possible to infer. In addition, such
methods can further perpetuate biases and undermine certain privacy criteria by requiring
thesystemstoinferdemographicattributes. Therefore, forpracticalpurposes, itisdesirable
to develop unified, demographic-agnostic ASV models. Most of the recent deep learning
based approaches train a unified model agnostic to the demographic groups, while trying to
ensure substantial representation from each group in the training data (Chung et al., 2018).
However, such systems can still be prone to issues of biases because they are not explicitly
trained to induce fairness.
Infair-MLliterature,algorithmstoimprovefairnessormitigatebiasesfallintooneofthe
three categories: pre-processing, post-processing and in-processing (Mehrabi et al., 2019). A
common pre-processing method todevelop fair models is by training them using data that is
balancedwithrespecttothevariouspotentialsourcesofbias(Y.Zhang&Sang,2020). This
approach has been explored in ASV systems, where data from individuals that is balanced
with respect to genders, languages, and ages is used to train models to improve fairness
(Fenu, Medda, et al., 2020). Post-processing techniques are used when only access to a
pre-trained model is available, and when it is impractical to train models using our own
data (Bellamy et al., 2018). However, such techniques are commonly employed in closed-set
classification tasks, and it is not straightforward to generalize them to a verification setup
like ASV. In-processing techniques involve explicitly inducing fairness into the model dur-
ing training by introducing fairness constraints (Berk et al., 2017). A common method is
adversarial techniques that use demographic information during training to learn de-biased
representations (B. H. Zhang, Lemoine, & Mitchell, 2018). When demographic labels are
available, they can also be used in a multi-task fashion. In such methods, the demographic
65
labels are used to reduce the performance disparity between groups (Xu et al., 2020).
Adversarial Training and Multi-task Learning: In the adversarial training method-
ology, labels are used to learn representations devoid of the demographic information. Since
the representations contain very little demographic information, the systems are likely to
use information discriminative of the primary prediction task, and rely less on demographic
attributes, making the systems fair with respect to those attributes (B. H. Zhang et al.,
2018). Adversarial training methods have shown promise in developing unbiased ML-based
classification systems (Edwards & Storkey, 2015; B. H. Zhang et al., 2018). Some studies
in face recognition (e.g., (Morales, Fierrez, Vera-Rodriguez, & Tolosana, 2020)) have shown
the efficacy of adversarial techniques to improve fairness. However, it is not evident if those
benefits translate to a verification setup (ASV systems in particular). Language invariant
speaker representations for ASV were developed using adversarial methods (Bhattacharya,
Alam, & Kenny, 2019). However, evaluations were limited to a single operating point char-
acterizedbyEER.Adversarialtrainingtechniqueswereemployedtodevelopspeakeridentity
representations that are devoid of certain sensitive attributes (No´ e et al., 2020). However,
their goal was privacy protection and not to improve fairness. To the best of our knowledge,
such techniques have not been used for developing fair ASV systems, and we explore that
direction in this work.
Demographiclabelscouldalsobeusedinamulti-taskmethodologywiththelabelpredic-
tionasasecondarytasktolearndemographic-aware representations. Thiscanbeparticularly
useful in tasks dealing with detecting identity such as face and speaker recognition, where
thedemographicattributesformpartoftheperson’sidentity. Insuchcases, insteadofstrip-
ping the representations of demographic factors, one can train models to ensure that the
performanceofthesystemsissimilaracrossdemographicgroups. Attributeinformationwas
provided to a facial expression recognition model in an attribute-aware fashion to improve
itsfairness(Xuetal.,2020). However, theseobservationsweremadeonaclassificationtask,
66
which is different from our verification setup. In biometric settings (which is closer to our
target ASV task), multi-task training methods can be used to add demographic information
to the general-purpose representations. It was shown that demographic attribute informa-
tion can be used in a multi-task setup to improve utility of ASV systems (Luu et al., 2020b).
However, fairness of such systems trained using the multi-task training setup is not studied.
In a more recent work, it was shown that gender-specific adaptation of encoders to extract
separate gender-specific representations can improve the fairness of ASV systems (Shen et
al., 2022). This can be treated as a demographic-aware method, and they show that their
method can improve the fairness while also improving the overall utility. However, fairness
evaluationswerelimitedtodifferencesinEERbetweenthegenders. Weintendtoinvestigate
if using a multi-task setup to train demographic-aware speaker representations can improve
the ASV utility in addition to reducing the differences in performance between the different
demographic groups.
5.3 Methods
We develop methods to transform existing speaker embeddings to another representation
space with the goal of minimizing biases with respect to demographic groups in ASV sys-
tems. This is achieved by training models using demographic labels in addition to the
speaker identity labels. We explore adversarial and multi-task techniques to train the em-
bedding models to improve the fairness of pre-trained speaker embeddings. We employ
the unsupervised adversarial invariance (UAI) framework, which was originally proposed in
(Jaiswal et al., 2018b). As discussed in Chapter 4, we adapted this approach to disentangle
speakerfactorsfrom nuisance factors unrelatedtothespeaker’sidentitypresentinx-vectors
(Peri, Li, Somandepalli, Jati, & Narayanan, 2020). A characteristic feature of this technique
isthatitdisentanglesthespeakeridentityfromnuisancefactors,whichareallthefactorsun-
related to the speaker’s identity (Peri, Li, et al., 2020), such as acoustic noise, reverberation
67
Randomized
perturbation ... Discriminator Predictor Decoder Disentangler Encoder Disentangler Figure 5.3.1: (Best viewed in color) Block diagram of the method showing the predictor,
decoder and disentangler modules similar to Figure 4.3.1. The discriminator module (yellow
bounding box) is tasked with predicting the demographic factor (e.g., gender) from e
1
, and
can be trained in an adversarial setup (UAI-AT) or in a multi-task (UAI-MTL) setup with
the predictor.
etc. However, as noted in (Jaiswal et al., 2019), the UAI technique cannot be directly used
to induce invariance to demographic factors in an unsupervised fashion, and demographic
labels are needed to induce invariance using adversarial techniques. Therefore, we propose
to use the adversarial extension of the UAI technique developed in (Jaiswal et al., 2019). In
addition, we also propose a novel multi-task extension to the UAI framework. Figure 5.3.1
shows the schematic diagram of the proposed method. The techniques including UAI and
its adversarial and multi-task extensions are explained in detail below.
5.3.1 Adversarial and multi-task extensions of UAI: UAI-AT and
UAI-MTL
UAI by itself cannot provide invariance to demographic factors. Therefore (Jaiswal et al.,
2019) extended the UAI framework to include demographic labels during training. In par-
ticular, they introduced a discriminator that is used to predict demographic labels. In the
formulation proposed in (Jaiswal et al., 2019), this discriminator (shown in yellow bounding
68
box in Figure 5.3.1) is trained in an adversarial fashion along with the disentangler of the
UAI.
min
Θ prim
max
Φ sec
L
prim
+γL
sec
+δL
bias
(b,
ˆ
b)
UAI-AT: Θ prim
=Θ e
∪Θ d
∪Θ p
, Φ sec
=Φ dis1
∪Φ dis2
∪Θ b
UAI-MTL: Θ prim
=Θ e
∪Θ d
∪Θ p
∪Θ b
, Φ sec
=Φ dis1
∪Φ dis2
(5.3.1)
Equation 5.3.1 shows the training objective that includes the discriminator loss denoted by
L
bias
, which is modeled as cross-entropy loss between the true and predicted demographic
labels denoted as b and
ˆ
b respectively. We denote the method of adversarially training the
discriminator along with UAI as UAI-AT throughout the rest of the chapter. The term
corresponding to UAI-AT in Equation 5.3.1 shows how the discriminator (with set of train-
able parameters denoted by Θ b
being part of the secondary branch) is trained adversarially
with the predictor. This ensures that the learned embeddings e
1
do not retain demographic
information, thereby achieving the desired invariance. The discriminator loss is modelled as
categorical cross-entropy loss between the true and predicted demographic labels.
On the other hand, it is not evident if adversarial training to induce invariance to de-
mographic factors is necessary to learn fair representations. Given the demographic labels,
they can be used to train the discriminator in a multi-task (as opposed to adversarial) fash-
ion. We call these demographic-aware speaker representations, and this method is denoted
as UAI-MTL in the rest of the chapter. The term corresponding to UAI-MTL in Equation
5.3.1showshowthediscriminatorparameters(beingpartoftheprimarybranch)aretrained
in a multi-task fashion with the predictor. The objective is to learn a representation that
capturesspeakeridentityinformationwhileretainingthedemographicattributeinformation.
In both the UAI-AT and UAI-MTL methods, the parameter δ controls the contribution of
the discriminator loss to the overall loss term.
69
5.4 Metrics
In this section, we provide details of the metrics that we use to evaluate the fairness and
utility of ASV systems. A brief description of each metric is also provided in Table 5.1 for
a quick reference.
5.4.1 Utility: Equal error rate (EER)
EER refers to a particular operating point of the system where the FAR equals FRR. This
metriciscommonlyusedtoevaluatetheutilityofASVsystems. LowervaluesofEERsignify
better system utility. We chose EER over the minimum detection cost function (minDCF),
which is another commonly used evaluation metric, as minDCF requires specifying parame-
terssuchastherelativecostsofthedetectionerrorsandthetargetspeakerpriorprobability,
which imply a particular application (Van Leeuwen & Br¨ ummer, 2007). We wanted to avoid
introducing additional variability arising due to the different parameters. Note that we only
use EER to measure utility and not to evaluate fairness.
5.4.2 Fairness: Fairness discrepancy rate (FaDR)
There exist several metrics to measure and evaluate fairness of ML systems, some of which
are more suited to a particular application than others. (Garg, Villasenor, & Foggo, 2020)
have discussed several commonly used metrics proposed in the fairness literature. Metrics
such as equal opportunity and equalized odds have been extensively studied. Verma and
Rubin (Verma & Rubin, 2018) discuss how some metrics can deem an algorithm fair which
theothermetricshavedeemedunfair. Therefore,itiscrucialtochooseametricthatsatisfies
the notion of fairness we aim to achieve. As discussed in Section 5.2, a reasonable goal of
fairness in ASV systems is to ensure that the performance differences between demographic
groups is small across a range of different operating points. Algorithms that are fair only at
certain operating points can result in a false sense of fairness, and can be detrimental when
70
used to design systems with real-world impact.
Astraight-forwardwaytoanalyzethefairnessofbiometricsystemsistousethedisparity
in EER between the demographic groups (termed as differential outcomes in (J. J. Howard
et al., 2019)) as an indication of the fairness. This method has been previously used in
evaluating fairness of ASV systems (Fenu, Medda, et al., 2020; Shen et al., 2022). However,
thisapproachassumesthateachdemographicgrouphasitsownthresholdontheverification
scores. This can lead to false notions of fairness, because in most real-world systems a single
threshold is used for verification irrespective of the demographic group (de Freitas Pereira &
Marcel,2021). Inordertoovercomethislimitation,PerieraandMercel(deFreitasPereira&
Marcel, 2021) propose a metric called fairness discrepancy rate (FaDR) to account for FARs
and FRRs in biometric systems. They propose to evaluate fairness at multiple thresholds
that can be chosen agnostic of the demographic groups. We employ this metric to evaluate
the fairness of our models.
FaDR(τ )=1− (ωA(τ )+(1− ω)B(τ ))
A(τ )=|FAR
g
1
(τ )− FAR
g
2
(τ )|, B(τ )=|FRR
g
1
(τ )− FRR
g
2
(τ )|
(5.4.1)
Intuitively, FaDR computes the weighted combination of absolute differences in FARs and
FRRs between demographic groups. The threshold τ is applied on demographic-agnostic
verification trials to compute the demographic-agnostic FAR (corresponding to the zero-
effort score distribution used by Periera and Mercel (de Freitas Pereira & Marcel, 2021)),
which characterizes a particular operating point of the system. The fairness of a system
can be measured at different values of the threshold τ corresponding to different operating
points. Assuming two demographic groups are of interest, at a given threshold τ , FaDR
‡
is defined in Equation 5.4.1 where FAR
g
1
(τ ) and FRR
g
1
(τ ) refer to the FAR and FRR,
‡
ThisdefinitionisaspecialcaseofFaDRwhenonlytwodemographicgroupsarepresent. Amoregeneral
definition can be found in (de Freitas Pereira & Marcel, 2021),
71
Table 5.1: List of metrics used in this chapter with a brief description and their purpose
(utility or fairness). The FaDR metric (values range from 0.0 to 1.0, higher is better)
evaluates the fairness of the ASV system at a particular operating threshold (characterized
by demographic-agnostic FAR). Area under the FaDR-FAR curve summarizes the fairness
at the various operating points. The error rates (ranging from 0.0 to 1.0, lower is better) are
used to measure utility.
Metric Abbreviation Brief description Purpose
False Acceptance Rate FAR Rate of accepting impostor verification trials Utility
False Rejection Rate FRR Rate of rejecting genuine verification trials Utility
Demographic-agnostic FAR -
FAR computed on demographic-agnostic
verification trials
Utility
Equal Error Rate EER
Error rate corresponding to threshold
where FAR equals FRR
Utility
Fairness Discrepancy Rate FaDR
Weighted absolute discrepancy in FAR and FRR
between demographic groups (Equation 5.4.1)
Fairness
Area under FaDR-FAR curve auFaDR-FAR
Area under the FaDR curve plotted
at several thresholds
Fairness
when the threshold is applied on the similarity scores of verification pairs consisting only
of speakers belonging to demographic group g
1
(similarly for demographic group g
2
). To
contextualize it with the terminology used in (Grother et al., 2019), this can be viewed as
a weighted combination of FA and FR differentials, with the error discrepancy weight given
by ω (0<=ω <=1).
Note on error discrepancy weight (ω)
FaDR can be computed by weighing the discrepancy between the demographic groups in 2
different types of errors, FAR and FRR. The error discrepancy weight, ω in Equation 5.4.1,
canbeusedtodeterminetheimportanceofthedifferenttypesoferrors. ω =1.0corresponds
to the case where the differences between the demographic groups are evaluated only using
their FARs. Similarly, ω = 0.0 corresponds to considering the differences only in the FRRs
between demographic groups. ω = 0.5 reflects the condition that discrepancy between the
demographic groups in FAs and FRs are equally important. Intuitively, it can be used to
weigh the relative importance of the discrepancy in FAR and FAR between the demographic
groups. For example, evaluating FaDR at high values of ω could be useful in applications
suchasinbordercontrolwhereFAsarecritical(deFreitasPereira&Marcel,2021). Alarger
72
emphasis can be given to reducing demographic disparity in accepting impostor verification
pairs. Similarly, smaller values of ω can be used to evaluate the fairness in applications such
as in smart speakers where considering FRRs that can degrade the user experience is more
important.
5.4.3 Fairness: Area under the FaDR-FAR curve (auFaDR-FAR)
FaDRcanbecomputedatvariousoperatingpointsofanASVsystembyvaryingthethresh-
old on verification similarity scores. These thresholds are applied on demographic-agnostic
verificationscorestocomputedemographic-agnosticFARs. Therefore,wecanobtainacurve
showingtheFaDRofthesystematvariousdemographic-agnosticFARvalues,andthiscurve
can be used to compare the fairness of different systems. Furthermore, Pereira and Mercel
(de Freitas Pereira & Marcel, 2021) propose the use of area under the FaDR-FAR (auFaDR-
FAR) curve as an objective summary of the fairness of a system for various conditions. We
use this as the primary metric for evaluation because it summarizes the fairness of systems
at the operating points of interest.
5.5 Dataset
Inthissection,weprovidedetailsofthedatasetsusedfortrainingandevaluatingourmodels.
We employed different subsets of the Mozilla Common Voice (MCV) dataset (Ardila et al.,
2020) in our experiments. In addition, we also used a subset of the Voxceleb1 dataset as an
out-of-domain evaluation set. The MCV corpus consists of crowd-sourced audio recordings
of read speech collected from around the world in multiple languages. It is an ongoing
project,whereanyuserwithaccesstoacomputerorsmartphoneandaninternetconnection
can upload speech samples for research purposes. Users are prompted to read sentences
appearing on their screen, and these recordings are validated by other users. We also used
the Voxceleb1 dataset (Nagrani et al., 2017) as an external corpus (different from the MCV
73
corpus) to evaluate the generalizability of the described methods on out-of-domain data. It
consists of in-the-wild recordings of celebrity interviews with speaker identity labels. Unlike
in the MCV corpus, the gender labels in Voxceleb1 were not self-reported but obtained from
Wikipedia. The subsets of these corpora we use in our experiments are described below, and
their statistics are provided in Table 5.2.
5.5.1 Training
We use the following datasets to train the speaker embedding transformation model using
the methods described in Section 5.3. These datasets consist of speech samples with speaker
identity labels and demographic labels such as gender and language.
• xvector-train-U (‘U’ stands for ‘Unbalanced’): A subset of MCV employed to train
the model (described in Section 5.6.1) that was used to extract the baseline speaker
embeddings. It corresponds to the data referred to as Train-2 condition in (Fenu,
Medda, et al., 2020). This subset consists of recordings in English and Spanish, which
are not balanced with respect to gender, as shown in Table 5.2.
• xvector-train-B (‘B’ stands for ‘Balanced’): Another subset of MCV employed to train
the x-vector model used to extract the baseline speaker embeddings. It corresponds
to data referred to as Train-1 condition in (Fenu, Medda, et al., 2020). This subset
consists of recordings which are balanced with respect to the number of speakers per
gender and age, as shown in Table 5.2. This is a subset of the xvector-train-U data.
• embed-train: Data used to train the proposed models to improve fairness using the
methods described in Section 5.3. Pre-trained speaker embeddings extracted on this
data were used to train our models. This is a subset of the xvector-train-U data.
• embed-val: This subset was created with the same speakers present in embed-train
dataset to tune the parameters of the models by evaluating speaker and demographic
74
Table 5.2: Statistics of datasets used to train and evaluate speaker embedding models.
xvector-train-B:balancedwithrespecttogenderinthenumberofspeakers, xvector-train-U:
Not balanced. embed-train and embed-val (used to train proposed models) have different
utterances from the same set of speakers to facilitate evaluating speaker classification per-
formance during embedding training. eval-dev and eval-test (used to evaluate ASV utility
and fairness) have speech utterances with no overlap between speakers. voxceleb-H is an
out-domain evaluation dataset. #spk.-number of unique speakers, #samples-number of
speech utterances in training or number of verification pairs in evaluation, F-Female, M-
Male
xvector-train-U xvector-train-B embed-train embed-val eval-dev eval-test voxceleb-H
#spk.
F 664 620 585 585 1194 529 527
M 1706 620 1692 1692 5231 2231 665
#samples
F 86,332 87,949 51,016 12,989 721,370 545,103 226,690
M 124,179 101,527 117,918 30,205 633,126 528,666 324,206
prediction on data unseen during training. The speaker and demographic prediction
accuracies on this subset can be treated as a proxy for the amount of information in
the intermediate speaker embeddings pertaining to speaker identity and demographic
factors respectively.
5.5.2 Evaluation
The following datasets are used to evaluate the transformed speaker embeddings for their
utility and fairness in ASV.
• eval-dev: We use this data to create development set verification pairs to fine-tune
hyperparametersofourmodels,suchasthebiasweightinEquation5.3.1. Thespeakers
in this subset are unseen during training (speakers not present in any of the subsets
describedinSection5.5.1). Tuninghyperparametersonthissubsetusingmetricsuseful
for verification allows us to build models that are better suited for the task of speaker
verification. Roughly 1.3M verification pairs were created from this data. Evaluations
wereperformedonseparatesubsetsofthepairscorrespondingtodifferentgenders. For
example, to evaluate verification performance on the female demographic group, pairs
were created using enrolment and test utterances only from speakers belonging to the
75
female demographic group.
• eval-test: Similartoeval-devdatadescribedabove,thiscontainsrecordingsfromspeak-
ers not present in any of the above datasets. Particularly, there is no speaker overlap
withtheeval-devdataset. Verificationpairsfromthisdataareusedtoevaluatemodels
intermsofbothfairnessandutility. Thisdatasetwasusedasheld-outdatatoevaluate
only the best models (after hyperparameter tuning). More than 1M verification pairs
were created from this data.
• voxceleb-H: Following (Toussaint & Ding, 2021), we performed evaluations on the
voxceleb-H split. It is a subset of Voxceleb1 containing 1190 speakers, and 500K veri-
fication trials consisting of same gender and same nationality pairs. Different from the
MCV corpus which is mostly read speech, the Voxceleb1 dataset consists of recordings
from celebrity interviews in an unconstrained setting. This dataset facilitates fairness
evaluations of ASV systems in more relaxed settings consisting of spontaneous speech.
5.6 Experiments
InSection5.3, wedescribedmethodstotransformpre-trainedspeakerembeddingstoinduce
fairness. In this section, we describe the experiments designed to evaluate the fairness and
utility of the proposed UAI-AT and UAI-MTL methods, by comparing them against suit-
able baselines. In addition, we describe the ablation studies we performed to investigate the
importance of the different modules used in our methods.
Setup: Our method consists of training an embedding transformation model using speaker
identityanddemographiclabelsinaclosed-setclassificationsetup. Forthiswork,werestrict
76
our analyses to gender
§
as the demographic attribute for which fairness is desired, but the
proposed methods can be extended to other demographic attributes (e.g., age) as well. The
encoder from the trained speaker representation model is used to extract embeddings, that
arethenevaluatedforfairnessandutilityinaspeakerverificationsetting. Belowwedescribe
the baselines along with the training setup of the proposed methods. We then discuss the
evaluation setup and implementation details.
5.6.1 Baselines
The pre-trained speaker embeddings used as input to our models were chosen from the
prior methods developed to improve fairness in ASV systems (Fenu, Medda, et al., 2020).
We compare our methods against ASV systems developed using these chosen off-the-shelf
embeddings as baselines, which allows us to investigate the effectiveness of our proposed
methods in improving the fairness of existing speaker embeddings.
• x-vector-U: As a weak baseline, we use the pre-trained models
¶
that were trained
using data not balanced with respect to gender. Specifically, the models were trained
using the xvector-train-U dataset described in Section 5.5. Evaluation on this baseline
provides an understanding of the biases present in speaker verification systems trained
using unbalanced data. This is particularly important because most existing speaker
verification systems rely on speaker embedding models trained on large amounts of
data, typically without specific attention to data balancing.
• x-vector-B: Data balancing is a common technique used to develop fair ML systems.
(Fenu, Medda, et al., 2020) employed this strategy to improve fairness of speaker
§
We use the term gender to refer to the self-reported gender in the datasets, except for Voxceleb, where
the labels were obtained from Wikipedia. We restrict our analysis to binary gender categories due to the
limitation imposed by the availability of labels in existing speech datasets (Garnerin et al., 2021), and hope
to overcome these limitations in the future.
¶
https://drive.google.com/drive/folders/1FW7FqkNuw2QqsaZ6PVF7EzLLg2ZjKbQ7
77
verification systems. This is a stronger baseline against which the proposed UAI-AT
and UAI-MTL methods are compared. We use pre-trained models
‖
that were trained
using the xvector-train-B dataset described in Section 5.5.
5.6.2 Proposed methods
We trained models with the following methods using gender labels along with the speaker
labels on the embed-train dataset described in Section 5.5.1. As mentioned before, the
embed-train dataset is a subset of the xvector-train-U dataset (Though, in theory we could
use the full xvector-train-U dataset, we were able to obtain only a subset due to missing
recordings). In contrast with the xvector-train-B dataset, the training data samples are not
balancedwithrespecttothegenderlabels. Theadvantageoftheproposedmethodsarethat
they can leverage all the available data without explicit data balancing.
We used the speaker embeddings referred to as x-vector-B in the previous section 5.6.1
as input to our models. The rationale behind using these embeddings was that these were
trainedusinganexistingdata-balancingtechniqueandhaveshowntoimprovefairness(Fenu,
Medda,etal.,2020). Thisallowedustoexploretheproposedtechniquesasameanstofurther
improve the fairness of existing ASV systems that are already trained to reduce biases.
• UAI-AT: As described in Section 5.3.1, the gender labels can be used in addition to
the UAI technique, similar to the technique proposed in (Jaiswal et al., 2019) to im-
prove fairness. As shown in Table 5.3, all modules including the discriminator from
Figure 5.3.1 were employed. The optimization was implemented as an alternating
mini-max game, where the predictor training forces the encoder to retain speaker in-
formation, while the discriminator training forces it to strip demographic information.
In the minimization step the encoder and the predictor from Figure 5.3.1 were up-
dated while keeping the secondary branch (discriminator and disentanglers) frozen for
‖
https://drive.google.com/drive/folders/1sGq0WO9pw7P6VQXy6ovm64kidu7ue5dE
78
Table 5.3: Table showing active blocks (corresponding to Figure 5.3.1) used in different
embedding transformation techniques. All the techniques use an encoder to reduce the di-
mensionsoftheinputspeakerembeddingsandapredictortoclassifyspeakers. Thefirstfour
rows denote ablation experiments, while the last two correspond to the proposed techniques.
Encoder Predictor Decoder Disentangler Discriminator
NLDR ✓ ✓
UAI ✓ ✓ ✓ ✓
MTL ✓ ✓ ✓
AT ✓ ✓ ✓
UAI-AT ✓ ✓ ✓ ✓ ✓
UAI-MTL ✓ ✓ ✓ ✓ ✓
a few iterations. In the maximization step, the encoder and the secondary branch
were updated keeping the primary branch frozen. This way, the encoder was trained
to retain speaker identity information while discarding information about the demo-
graphic attributes from the intermediate speaker representations. In practice, instead
of maximizing the discriminator loss, we minimized the loss between the predictions
and a random sampling of the gender labels from the empirical distribution obtained
from the training data, similar to the technique used in (Jaiswal et al., 2019), (Alvi,
Zisserman, & Nell˚ aker, 2018).
• UAI-MTL:Differentfromtheadversarialtrainingstrategy, herethegenderlabelswere
usedinamulti-taskfashion. SimilartotheUAI-ATtechnique,predictortrainingforces
the encoder to retain speaker information. However, in this case, the discriminator is
trained in a multi-task fashion using gender labels to explicitly force the encoder to
learn demographic information, producing demographic-aware speaker embeddings.
This is achieved by making the discriminator a part of the primary branch, with the
secondary branch consisting of only the disentanglers.
79
5.6.3 Ablation studies
As discussed in Section 5.3 and shown in Table 5.3, the proposed UAI-AT and UAI-MTL
techniques use all the modules including the encoder, predictor, decoder, disentanglers and
discriminators. We performed ablation studies to better understand the impact of each
module on the performance by selectively retaining certain modules. We also considered the
scenario where gender labels are not available. In such scenarios, we investigate if fairness
canbeimprovedeitherbyUAIorbysimpledimensionalityreductionusingneuralnetworks,
which we term non-linear dimensionality reduction (NLDR). This is in contrast to linear
dimensionality reduction approaches such as principal component analysis. The modules
corresponding to Figure 5.3.1 that are active in these experiments are shown in Table 5.3.
• Non-linear dimensionality reduction (NLDR): We investigate the effect of non-linear
transformation of speaker embeddings while retaining speaker identity information
without considering the demographic information. This is achieved by training a neu-
ralnetworktotransformpre-trainedspeakerembeddingsusingonlythespeakerlabels.
This helps understand if simple dimensionality reduction techniques can provide ben-
efits in terms of reducing the biases in the systems. We denote this experiment as
NLDR, since the model is trained using just the encoder and predictor modules with
non-linear activation functions.
• UAI:AsdescribedinChapter4,theUAItechniquewasusedtoimprovetherobustness
ofspeakerverificationsystemstonuisancefactorssuchasacousticnoise, reverberation
etc., that are not correlated with speaker’s identity (Peri, Pal, Jati, Somandepalli, &
Narayanan, 2020). However, since demographic attributes such as gender and age are
related to the speaker’s identity, we had observed that this method does not remove
these biases from the speaker embeddings (Peri, Li, et al., 2020). We evaluate if such
training can improve the fairness without the need for demographic label information.
As shown in Table 5.3, all modules except the discriminator from Figure 5.3.1 were
80
used in training.
• AT: We trained models using the encoder, predictor and discriminator in a standard
adversarial setting (without disentanglers). Similar to UAI-AT, the speaker classifica-
tiontaskofthepredictorforcestheencodertolearnspeaker-relatedinformation,while
the discriminator training forces the encoder to learn representations stripped of de-
mographic information. The adversarial loss was implemented by training the encoder
to minimize predictor loss while maximizing the discriminator loss using alternating
minimization and maximization steps. This experiment allowed us to investigate the
importance of the disentanglers in the training process.
• MTL: Similar to the AT setup described above, we used only the discriminator mod-
ule along with the encoder and predictor modules. In contrast to AT setup, here we
trained the discriminator in a multi-task setting with the predictor. This ensured that
theencoderretainedthespeakerinformation(duetothepredictor),andalsothedemo-
graphic information (due to the discriminator). The results of this experiment can be
compared with the UAI-MTL method to evaluate the importance of the disentanglers.
5.6.4 Evaluation setup
We used the encoder from the speaker representation models trained using the techniques
mentioned above as the embedding transformation module. Specifically, we transformed the
x-vector-B speaker embeddings (explained in Section 5.6.1) into a new set of speaker rep-
resentations using the trained encoders. These transformed speaker embeddings produced
from the verification evaluation dataset (Section 5.5.2) were evaluated using the standard
ASV setup described in Section 1.1.1. We used pre-determined enrollment and test pairs
generated from the evaluation data, and compute similarity scores using cosine similarity
(inner product of two unit-length vectors). We then applied a threshold on the similarity
score to produce an accept or reject decision for each verification trial, and the error rates
81
were computed by aggregating the decisions over all the pairs. Varying the threshold pro-
duces different error rates, and there exists an inherent trade-off between FAR and FRR.
To compute fairness metric (FaDR detailed in Section 5.4), the FARs and FRRs for each
demographic group were separately computed using verification trials belonging to that de-
mographic group. For example, to compute the FAR and FRR for the female population,
we aggregated the verification trials where both enrolment and test utterance belonging to
the female gender. Following Pereira and Mercel (de Freitas Pereira & Marcel, 2021), we
do not consider cross-gender trials (where enrolment and test utterances belong to different
genders), because they tend to produce substantially lower FARs than same-gender trials
(Hautam¨ aki et al., 2010). To compute the demographic-agnostic FAR values useful for eval-
uating the auFaDR-FAR metric described in Section 5.4, we pooled all the verification trials
agnostic to their demographic attributes.
Statistical testing for differences in performance
We used permutation tests to evaluate the statistical significance of our results. In partic-
ular, we used random permutations of the verification scores of the x-vector-B baseline and
the proposed methods (UAI-AT or UAI-MTL) to generate a distribution of auFaDR-FAR
values. The‘true’auFaDR-FPR(withoutpermuting)wascomparedagainstthisdistribution
of synthetically generated auFaDR-FAR values to compute the p-value. We used n = 10
4
permutations on randomly chosen 100,000 verification trials, with p < 0.01 to denote sig-
nificance. For testing the significance of the differences in %EER, we employed a similar
permutation test strategy, but instead used all the verification trials (2M) with n = 10
4
permutations.
5.6.5 Implementation details
The modules encoder, decoder, predictor and the disentanglers were modeled as multi-layer
perceptrons comprising 2 hidden layers each. The encoder and decoder had 512 units in
82
each layer, while the disentangler modules had 128 units in each layer. For the predictor
modules, 256 and 512 units were used in the first and second hidden layers, respectively.
The discriminator module comprised of a 1 hidden layer network with 64 hidden units. The
probability of dropout used in the random perturbation module was set to 0.75.
Eachmodelwastrainedusinganearlystoppingcriterionbasedonthespeakerprediction
accuracy on the embed-val dataset. In each epoch of the UAI-AT and UAI-MTL training,
optimization was performed with 10 updates of secondary branch for every 1 update of the
primarybranch. Aminibatchsizeof128wasused,andtheprimaryandsecondaryobjectives
were optimized using the Adam optimizer with 1e− 3 and 1e− 4 learning rates, respectively,
and a decay factor of 1e− 4 for both. The dimensions for the embeddings e
1
and e
2
were
chosen to be 128 and 32, respectively. We set the weights for the losses as α = 100, β = 5
andγ =100. Thearchitectureandlossweightparameterswerechosenbasedonourprevious
work using the UAI technique to improve robustness of speaker embeddings (Peri, Pal, et
al.,2020). ForthediscriminatormodulethatisusedintheproposedUAI-ATandUAI-MTL
methods,wetunedtheweightonthebiastermdenotedbyδ inEquation5.3.1,byevaluating
severalmodelswithdifferentweightvalues ontheeval-devdataset. TableB.1in SectionB.1
shows the fairness (auFaDR-FAR) and utility (%EER) of systems that were trained with
different bias weights on the eval-dev dataset. For each method, the model that gave the
best performance (in terms of the auFaDR-FAR on the eval-dev dataset) was used for final
evaluations reported in the next section on the held-out eval-test dataset.
5.7 Results and Discussion
In this section, we report results from the experiments described in Section 5.6, and discuss
our findings. First, we compare the fairness of the proposed systems against the baselines
at a range of system operating points in Section 5.7.1. We then discuss how these systems
compare in terms of their utility in Section 5.7.2. Finally, in Section 5.7.3, we delve into
83
biasespresentintheASVsystemsatthescorelevel(beforethethresholdingoperationshown
in Figure 1.1.1). This sheds light on the biases present in the verification similarity score
distribution of the existing ASV systems, and how the proposed techniques mitigate these
biases.
5.7.1 Fairness
Figure5.7.1showsFaDRplottedatvariousdemographic-agnosticFARvalues(upto10%)for
the proposed UAI-AT and UAI-MTL methods in comparison with the baseline x-vector sys-
tems,ontheeval-testdataset. Wefocusonoperatingpointsbelow10%FARbecausesystems
operating at FAR values beyond that may not be useful in practice
∗∗
. The demographic-
agnostic FAR values are obtained by applying different thresholds on all verification pairs
pooled irrespective of the demographic attribute of the utterances. FaDR is plotted for 3
values of the error discrepancy weight (ω in Eq. 5.4.1), denoting varying amount of contri-
bution from the differences between the genders in FRR and FAR. ω = 0.0 corresponds to
differences in FRR alone (Figure 5.7.1a), while ω = 1.0 corresponds to differences in FAR
alone (Figure 5.7.1b). ω =0.5 corresponds to equal contribution of differences in FARs and
FRRs (Figure 5.7.1c).
Discussion: From Figure 5.7.1a, we observe that the x-vector systems (red and orange
curves) score high on the fairness metric when ω = 0.0. This implies that FRR, which is
the rate of incorrectly rejecting genuine verification pairs, has minimal dependence on the
gender of the speaker. As we discuss later in Section 5.7.3, this can be explained from the
similarity scores of the x-vector speaker embeddings for the genuine pairs shown in Figure
5.7.4b, where we observe a substantial overlap in scores of the female and male populations.
Furthermore, we observe that the proposed ASV systems (UAI-AT and UAI-MTL) score
∗∗
We also performed experiments covering operating points upto 50% demographic-agnostic FAR, and
observed similar trends
84
Demographic-agnostic %FAR
% FaDR(ω=0.0): Higher is better fairness
88
90
92
94
96
98
100
1 2 3 4 5 6 7 8 9 10
xvector-U xvector-B UAI-AT UAI-MTL
(a) FaDR with ω =0.0. Equals 100 − (|%FRR
g
1
− %FRR
g
2
|)
Demographic-agnostic %FAR
% FaDR(ω=1.0): Higher is better fairness
88
90
92
94
96
98
100
1 2 3 4 5 6 7 8 9 10
xvector-U xvector-B UAI-AT UAI-MTL
(b) FaDR with ω =1.0. Equals 100 − (|%FAR
g
1
− %FAR
g
2
|)
Demographic-agnostic %FAR
% FaDR(ω=0.5): Higher is better fairness
88
90
92
94
96
98
100
1 2 3 4 5 6 7 8 9 10
xvector-U xvector-B UAI-AT UAI-MTL
(c) FaDR with ω =0.5. Equals 100− (0.5∗| %FRR
g
1
− %FRR
g
2
|+0.5∗| %FAR
g
1
− %FAR
g
2
|)
Figure 5.7.1: (Best viewed in color) Fairness for binary gender groups at different operating
points characterized by demographic-agnostic FAR upto 10%, evaluated using 3 different
values for the error discrepancy weight (Eq. 5.4.1), ω = 0.0,1.0 and 0.5. When evaluating
fairness using discrepancy in FRR alone (ω = 0.0), there is not much difference between
the different systems. When evaluating fairness using discrepancy in FAR alone ( ω = 1.0),
baselinex-vector-Btrainedonbalanceddataperformsbetterthanx-vector-U.Theproposed
systems (UAI-AT and UAI-MTL) outperform x-vector-B. When evaluating fairness using
weighted discrepancy in FAR and FRR with equal weights (ω =0.5), the proposed systems
still show better performance than the baselines.
similar to the baselines. It can be inferred that if we only care about the FRRs (i.e., how
many genuine verification pairs are rejected by the ASV system), then the x-vector systems
are already fair with respect to the gender attribute, and additional processing using the
proposed methods retains the existing fairness.
85
On the other hand, as shown in Figure 5.7.1b, the x-vector systems (red and orange
curves)arelessfairconsideringthecaseofω =1.0. Thisshowsthatforthebaselinesystems,
FAR, which is the rate of incorrectly accepting impostor verification pairs, depends on the
genderofthespeakers. Particularly,thex-vectorsystemtrainedwithimbalanceddatascores
lower on the fairness metric compared to that trained with balanced data. Furthermore, the
fairness of both the x-vector systems drops at higher values of demographic-agnostic FAR.
This suggests that data balancing by itself may not achieve the desired fairness at all oper-
ating regions of the ASV system considering the biases in FARs between genders. Previous
works in domains other than ASV made similar observations. For example, (T. Wang et al.,
2019)showedthatdatabalancingmaynotbesufficienttoaddressbiases, andtheyattribute
such behavior to bias amplification by models. Recently, in the field of ASR, (Garnerin et
al.,2021)showedthatwhentrainingwithbalanceddatasets, theactualspeakercomposition
in the training data plays a key role in the biases observed in the system outputs. We ob-
serve that using the proposed techniques to transform the x-vector speaker embeddings by
including demographic information during training (UAI-AT and UAI-MTL) improves the
fairness of systems considering the biases in FARs between the female and male population.
TheFaDRvalues(ω =1.0)oftheproposedmethods(greenandbluecurves)remaincloseto
100% at different values of the demographic-agnostic FAR. Therefore, in scenarios where we
care about how many impostor verification pairs are incorrectly accepted by the ASV sys-
tems,theproposedembeddingtransformationtechniquesarebeneficialinimprovingfairness
with respect to gender.
We evaluate the FaDR at ω = 0.5 (denoting equal contribution of FAR and FRR), as
shown in Figure 5.7.1c, to consider the scenario where the discrepancy between genders in
both the FAs and FRs of the system is of interest. First, compared to the x-vector system
trained on imbalanced data (orange curve), the system trained with data balanced with
respect to the genders (red curve) performs better in terms of fairness across all operat-
ing points. This confirms the observation made in (Fenu, Medda, et al., 2020), that data
86
balancing helps improve the fairness of speaker verification systems to some extent. The
proposedUAI-MTLandUAI-ATmethods(greenandbluecurves)consistentlyperformbet-
ter than the baselines in terms of fairness at all operating points (with the exception of
UAI-MTL at FAR=1%, where it is only slightly lower than the x-vector-B system). These
results suggest that both adversarial and multi-task learning of speaker embeddings using
gender labels can further improve the fairness of speaker representations compared to data
balancing techniques.
An additional observation from the plots in Figure 5.7.1 is that the benefits in terms
of fairness compared to the baselines are more prominent at higher FARs. This is evident
from the increasing difference between the FaDR values of the baseline x-vector-B and the
proposedsystemsasthedemographic-agnosticFARincreases. AswewillseelaterinSection
5.7.3,thisbehaviorcanbeexplainedbythedistributionoftheverificationscores. Also,FaDR
only captures the absolute discrepancy in the performance between genders, but does not
provide information about which particular demographic group is impacted. We discuss the
performance of the systems separately for each gender group in B.2 to understand how the
systems perform for each gender group separately.
5.7.2 Fairness-Utility analysis
Table 5.4 shows the area under the FaDR-FAR curve (auFaDR-FAR) along with the %EER
on the eval-test dataset. The auFaDR-FAR values are computed at 5 different values of the
error discrepancy weight, ω. The auFaDR-FAR metric provides a quantitative summary of
the fairness of the systems over several system operating points, with higher values denoting
better fairness. The %EER values are indicative of the speaker verification performance of
the systems, providing an understanding of their utility, with lower values denoting better
utility. In Figure 5.7.2, we show the FRR plotted against FAR using demographic-agnostic
verification pairs. This describes the overall utility of the system, computed without consid-
ering the demographic attributes of the speakers. Curves closer to the origin denote better
87
Table 5.4: auFaDR-FAR (upper bound 900) capturing fairness (binary gender groups) for
5 different values of ω, and %EER capturing utility on eval-test dataset. Both the UAI-
AT and UAI-MTL methods achieve similar auFaDR-FAR values, higher than the baseline
x-vector-B for all values of ω, with significant improvement for ω =1.0,0.75,0.5. UAI-MTL
improves fairness and retains utility (similar %EER as x-vector-B), while UAI-AT achieves
desired fairness at the cost of reduced utility. ∗ denotes significant improvement over x-
vector-Bsystem(significancecomputedatlevel=0.01usingpermutationtestwith n=10000
random permutations). Values in bold denote the highest fairness for each different value of
ω.
Metric ω
Baselines Proposed Ablations
x-vector-U x-vector-B UAI-AT UAI-MTL NLDR UAI AT MTL
auFaDR-FAR↑
1.00 842.3 864.9 892.4
∗ 892.5
∗ 840.2 863.9 895.5
∗ 853.6
0.75 854.6 872.1 893.3
∗ 892.9
∗ 853.9 871.7 894.9
∗ 864.6
0.50 866.8 879.2 894.3
∗ 893.2
∗ 867.6 879.5 894.4
∗ 875.6
0.25 878.9 886.3 895.2 893.6 881.2 887.3 893.9 886.5
0.00 891.2 893.5 896.1 893.9 894.9 895.2 893.3 897.5
%EER↓ - 3.8 2.5 3.9 2.7 2.5 2.4 2.8 2.7
EER=3.9% EER=2.7% EER=2.5% Figure 5.7.2: (Best viewed in color) Plot of demographic-agnostic %FRR versus
demographic-agnostic %FAR showing the utility of the systems. Curves closer to the ori-
gin indicate better utility. Notice that the UAI-MTL system closely follows the baseline
x-vector-B system at a range of operating conditions, while UAI-AT reduces utility shown
by higher %FRR.
utility. In Table 5.5, we report the results from the corresponding set of experiments on the
voxceleb-H dataset, that help understand the generalizability of the proposed methods to
more challenging in-the-wild recording conditions.
Discussion: The results in Table 5.4 show that the UAI-AT and UAI-MTL methods
88
Table 5.5: auFaDR-FAR capturing fairness (binary gender groups) for 5 different values of
ω, and %EER capturing utility on voxceleb-H dataset. The UAI-MTL method achieves
significantly higher auFaDR-FAR values than the baseline x-vector-B for all values of ω.
UAI-AT and AT methods have reduced fairness compared to the baseline suggesting lack of
generalizabilityofadversarialmethodacrossdatasets. Thevaluesinbolddenotethehighest
fairness for each different value of ω.
Metric ω
Baseline Proposed Select ablations
x-vector-B UAI-AT UAI-MTL AT MTL
auFaDR-FPR↑
1.00 863.9 848.4 872.2
∗ 859.9 820.6
0.75 863.7 845.5 877.8
∗ 859.8 809.2
0.50 865.0 843.4 883.9
∗ 859.6 797.9
0.25 866.3 841.3 890.0
∗ 859.5 786.6
0.00 870.7 841.8 898.4
∗ 859.3 775.3
%EER↓ - 30.1 30.4 29.1 30.9 30.6
performbetterthanthex-vectorbaselinesacrossallthevaluesof ω examined. Inparticular,
we found significant improvements using the proposed methods compared to the x-vector-B
baseline at ω = 1.00, 0.75, 0.50. This confirms the statistical significance of the findings re-
portedintheprevioussection. Additionally,weobservethattheUAI-MTLmethodprovides
markedlybetterutility(asshownbythelower%EER)thantheUAI-ATsystem. Thisisalso
evident from Figure 5.7.2, where we observe that the UAI-MTL method (green curve) per-
forms similar to the baseline x-vector-B speaker embeddings (red curve) in terms of speaker
verification performance. Though the UAI-AT method performs similar to the UAI-MTL
method in terms of fairness (auFaDR-FAR in Table 5.4), it comes at the cost of degraded
utilityrelativetothebaselinex-vector-Bspeakerembeddings(shownbytheshiftoftheblue
curve away from the origin in Figure 5.7.2). In summary, we find that the proposed multi-
task method of transforming speaker embeddings can improve fairness to supplement data
balancing techniques, while having minimal impact on utility (with statistically insignificant
increase from 2.47 to 2.70). In contrast, the adversarial training method UAI-AT improves
fairness at the cost of a significant increase in the %EER (from 2.47 to 3.86). This suggests
that multi-task learning using the UAI-MTL framework to transform speaker embeddings
provides greater benefits than adversarial methods considering both improvement in the
fairness of ASV systems and their impact on utility.
89
We observe from the ablation studies that the NLDR and UAI techniques to transform
speaker embeddings are not effective at improving fairness. This shows that merely using
speakerlabelswithoutthedemographicinformationcannotprovideimprovementsinfairness
overpre-trainedspeakerembeddings. ThisimpliesthatthediscriminatorinFigure5.3.1isan
indispensablemoduletomitigatebiasespresentinexistingspeakerrepresentations, asnoted
inpreviouswork(Jaiswaletal.,2019). WealsoobservethatMTL(withouttheUAIbranch)
is not effective in improving fairness. Even though AT shows some promise, we observe that
theutilitytakesahit(higher%EER).Furthermore,weshowinB.1thatadversariallytrained
methods(UAI-ATandAT)havegreatervariationin%EERwithrespecttothecontribution
of the bias term on the training loss (δ in Equation 5.3.1). This makes it challenging to tune
thebiasweight. Also,aswediscusslaterusingresultsonthevoxceleb-Hdataset,theATand
MTLtechniques(withoutthe UAIbranch) do not generalize well to out-of-domain datasets.
These experiments suggest that both the discriminator and the disentangler modules play
an important role in developing fair speaker representations using the proposed methods.
We also note the relationship between the benefits of the proposed systems and the error
discrepancy weight ω in Table 5.4. We observe that as ω becomes smaller, the differences
between the proposed techniques (UAI-AT,UAI-MTL) and the baseline x-vector-B system
becomessmaller. Thisshowsthat,inapplicationswithagreateremphasisonthediscrepancy
in the FAs of the system (higher values of ω), the proposed techniques can be beneficial in
improving the fairness of the baseline x-vector-B speaker embeddings. In applications where
the emphasis is on the discrepancy in the FRs of the system (smaller values of ω), the
baseline x-vector-B embeddings are already fair, and the proposed techniques merely retain
the fairness. We made a similar observation in Section 5.7.1 from Figure 5.7.1.
Results on the out-of-domain voxceleb-H test set are shown in Table 5.5. We observe
that the UAI-MTL technique to transform x-vector-B speaker embeddings attains the best
90
performance in terms of fairness (highest auFaDR-FAR), and utility (lowest %EER
††
). This
suggests generalizability of multi-task training using the UAI framework when evaluated on
a different dataset that is unseen during training. We also observe that, performance im-
provements over the baselines using UAI-AT and AT techniques are inconsistent comparing
with the results on the eval-test data shown in Table 5.4, while UAI-MTL shows conclusive
performance improvements even on out-of-domain data.
5.7.3 Biases in verification scores
Measures of fairness and utility are obtained after applying a threshold on the speaker
verification scores as described in Section 1.1.1. We have quantitatively observed that the
fairness of ASV systems can be improved through the UAI-AT and UAI-MTL techniques
compared to the baseline x-vector systems. However, the primary source for lack of fair
performance in ASV systems is the biases present in the speaker verification scores (Stoll,
2011). InordertounderstandthebiasespresentintheASVsystems,weperformaqualitative
analysis of the verification scores similar to (Toussaint & Ding, 2021). In particular, we plot
the kernel density estimate plots of the cosine similarity scores of the impostor verification
pairs for the female and male populations in Figure 5.7.3, and those of the genuine pairs in
Figure 5.7.4. The impostor verification scores determine the FARs, while the scores of the
genuine verification pairs determine the FRRs of the systems.
Discussion: First, we notice from Figure 5.7.3a that there exists a skew in the cosine
similarity scores between the female and male populations in the x-vector system trained on
data imbalanced with respect to the genders. This points to the presence of biases, likely
exacerbatedbythetrainingdataimbalance. Particularly, wenoticethattheimpostorscores
for the female demographic population are higher than the scores of the male population,
††
Theseeminglypoorperformanceofutilityofallthemethodsonthevoxceleb-Hdatasetcanbeattributed
to the utility of the x-vector speaker embeddings that we begin with. Here, our goal was to improve fairness
of speaker embeddings, while retaining their utility.
91
(a) x-vector-U (b) x-vector-B
(c) UAI-AT (d) UAI-MTL
Figure 5.7.3: (Best viewed in color) Kernel density estimates of cosine similarity scores of
impostor pairs for the female and male demographic groups. Both the x-vector baselines
have the scores of the female population shifted compared to the scores of the male popula-
tion. UAI-AT and UAI-MTL techniques reduce differences between the scores. Particularly,
UAI-MTL produces scores with barely noticeable difference between the genders shown by
the %intersection in the scores between genders.).
suggesting that at a given threshold, the proportion of FAs for the female population would
be higher than for the male population. Such differences between the female and male
impostor scores have been documented in prior literature (Marras, Korus, Memon, & Fenu,
2019). Further, from Figure 5.7.3b it can be observed that training using data balanced
with respect to the gender labels can mitigate the skew to some extent. Finally, we observe
from Figures 5.7.3c,5.7.3d that both the adversarial and multi-task learning techniques can
92
(a) x-vector-U (b) x-vector-B
(c) UAI-AT (d) UAI-MTL
Figure 5.7.4: (Best viewed in color) Kernel density estimates of cosine similarity scores of
genuine pairs for the female and male demographic groups. Both the x-vector baselines
have the scores of the female and male population overlapping with each other, indicating
minimal bias between genders. It is worth noting that both the transformation techniques
(UAI-AT and UAI-MTL) retain this overlap as shown by the %intersection in the scores
between genders.
further reduce the skew between the female and male verification scores. In particular, the
UAI-MTL method produces almost overlapping score distributions for the female and male
populations. This suggests that transformation of the speaker embeddings using gender
information in a multi-task fashion using the UAI-MTL framework can help mitigate the
biases in the impostor verification scores between genders. Subsequent application of a
threshold on these scores would therefore produce similar rates of FAs for the female and
93
male populations, as we have seen in Section 5.7.1.
From Figures 5.7.4a and 5.7.4b, we notice that the scores of the genuine pairs between
the female and male demographic groups are mostly overlapping. This implies that at any
given threshold, we would not observe much difference between the proportions of FRs of
the demographic groups. This is consistent with the quantitative analysis shown earlier in
Table 5.4, where we found high values of the auFaDR-FAR metric for the x-vector systems
computed at smaller values of ω (corresponding to greater emphasis on differences in FRR
between genders). We observe from the figures 5.7.4c and 5.7.4d that embedding transfor-
mation using the proposed methods retains the unbiased nature of the genuine verification
scores obtained from the pre-trained embeddings. In summary, we show that the proposed
methods improve or retain the fairness depending on the target use-case. In scenarios where
the rates of false accepts are an important consideration, the proposed UAI-AT and UAI-
MTL methods are able to reduce the biases present in existing speaker representations.
When the false rejects are more important, our methods preserve the fairness of existing
speaker representations.
5.8 Discussion
We presented adversarial and multi-task learning strategies to improve the fairness of ex-
tant speaker embeddings with respect to demographic attributes of the speakers. In the
adversarial setting, the demographic attribute labels were used to learn speaker embeddings
devoid of the demographic information. In the multi-task approach, the goal was to learn
demographic-aware speaker embeddings, where the demographic information is explicitly
infused into the embeddings. In particular, we adopted the unsupervised adversarial invari-
ance (UAI) framework (Jaiswal et al., 2019) to investigate whether adversarial or multi-task
training is better suited for reducing the biases with respect to binary gender groups in
speaker embeddings used in ASV systems. We used the recently proposed fairness discrep-
94
ancy rate metric (de Freitas Pereira & Marcel, 2021) to evaluate the fairness of the systems
at various operating points. We observed that data balancing, a commonly used strategy to
improve fairness, mitigates the biases to some extent. However, its fairness depends on the
operatingpointofinterest(whetheritisalowFARorlowFRRoperatingregion). Therefore
itisimportanttoconsiderthespecificapplication–andthecorrespondingdesiredoperating
region – of the ASV systems when evaluating fairness. For applications strictly focused on
the differences between genders in their FRRs, existing x-vector speaker embeddings (either
trained on balanced or imbalanced data) performed well by having very minimal biases, and
the speaker embeddings transformed using the proposed methods retained this desirable
property. However, as we move toward applications focused on the differences between the
genders in their FARs, the x-vector speaker embeddings showed biases between the genders.
In this scenario, the proposed adversarial and multi-task training strategies were able to
mitigate these biases by a significant margin. Furthermore, we showed qualitative evidence
that the proposed methods were able to effectively reduce the biases in the verification score
distributions between the female and male populations. In addition, we showed that it is
criticaltojointlyconsideraspectsofbothfairnessandutilityinselectingembeddingtransfor-
mation techniques. We found that the adversarial and multi-task training strategies showed
similarperformanceonfairnessmetrics. However, whilemulti-tasktrainingtotransformthe
x-vector speaker embeddings had very little impact on the utility, the adversarial training
strategy significantly degraded the utility.
95
Chapter 6
Summary and Future Work
The objective in this dissertation is to develop techniques to extract robust and fair speaker
representations for use in speaker recognition applications. This deals with mitigating vari-
ability introduced due to within-speaker and between-speaker factors. To this end, I first
discussed techniques to obtain a comprehensive understanding of the information encoded
in neural speaker representations through extensive experimental evaluation. Informed by
these findings, I adopted an unsupervised disentanglement technique to transform the popu-
lar x-vectors to lower dimensional, speaker-discriminative representations that are invariant
to nuisance factors. Through detailed experiments on a variety of datasets and tasks, I
showed that the proposed speaker embeddings are more robust than x-vectors to several
nuisance factors for speaker recognition tasks including speaker verification and diarization.
Later I explored several aspects of fairness and utility of ASV systems in this work.
In particular, I showed systematic evaluations of biases present in ASV systems at multiple
operatingpoints. Then,Ipresentedadversarialandmulti-tasklearningstrategiestoimprove
the fairness of extant speaker embeddings with respect to demographic attributes of the
speakers. Themajortakeawaywasthatbiasevaluationsneedtobedoneinthecontextofthe
operating region of the system. Such evaluations can then be used to inform the techniques
to improve fairness. I also showed qualitative evidence that the proposed methods were able
96
Emotion Noise Linguistic
content Within-speaker Between-speaker Race Language Age Gender Health Figure 6.0.1: Intersectional aspects of the factors of variability in speaker recognition. Im-
portant to consider for holistic understanding of biases.
to effectively reduce the biases in the verification score distributions between the female and
male populations.
ThedetailedanalysesandmethodsIdiscussedinthisdissertationdemonstratethatneu-
ral representation techniques can improve the robustness and fairness of speaker recognition
systems. However, I believe that there are still open questions that require further inves-
tigation. First, I limited the analyses to gender as the demographic factor of interest in
the current investigations. However, considering other demographic attributes (including
intersectional demographics) is important (Foulds, Islam, Keya, & Pan, 2020). For example,
systemsthatarenotbiasedwithrespecttothegenderalonecouldbebiasedwhenadifferent
demographic factor (e.g. age) is considered as an intersecting attribute. Also, I trained the
models using the MCV corpus, and analyzed the biases in these systems using the MCV
and Voxceleb corpora. However, such datasets could be prone to systemic censoring (Kallus
& Zhou, 2018). For example, the MCV corpus may not be sufficiently representative of
the different demographic groups and their intersectional attributes, because the data was
collectedonlyfromuserswithaccess to a microphone and internet connection. Similarlythe
Voxceleb corpus consists of speech samples only from celebrities. A more inclusive adoption
of such technologies requires careful consideration of these various aspects, which form an
interesting direction for future research.
The intersection of the aspects of fairness and robustness is another direction worth con-
97
sidering for future research. So far, I presented work done on these two aspects separately.
However, in the real-world, these factors of variability typically co-exist. As depicted in Fig-
ure 6.0.1, different demographic groups can have distinct within-speaker variability factors.
Forexample,certaindemographicpopulationmaybepronetomorenoise,whilecertainoth-
ers can have more diverse variability in the expression of emotional states that are manifest
in speech. Therefore, it is crucial to jointly consider these different factors of variability for
a holistic understanding of the practical applicability of speaker recognition systems.
98
References
Adi, Y., Kermany, E., Belinkov, Y., Lavi, O., & Goldberg, Y. (2016). Fine-grained analysis
of sentence embeddings using auxiliary prediction tasks.
Albanie,S.,Nagrani,A.,Vedaldi,A.,&Zisserman,A. (2018). Emotionrecognitioninspeech
using cross-modal transfer in the wild. In Proceedings of the 26th acm international
conference on multimedia (pp. 292–301).
Alvi, M., Zisserman, A., & Nell˚ aker, C. (2018). Turning a blind eye: Explicit removal
of biases and variation from deep neural network embeddings. In Proceedings of the
european conference on computer vision (eccv) workshops (pp. 0–0).
Ardila, R., Branson, M., Davis, K., Henretty, M., Kohler, M., Meyer, J., ... Weber, G.
(2019). Common voice: A massively-multilingual speech corpus. arXiv preprint
arXiv:1912.06670.
Ardila, R., Branson, M., Davis, K., Henretty, M., Kohler, M., Meyer, J., ... Weber, G.
(2020). Common voice: A massively-multilingual speech corpus. In Proceedings of the
12th conference on language resources and evaluation (lrec 2020) (pp. 4211–4215).
Bagher Zadeh, A., Liang, P. P., Poria, S., Cambria, E., & Morency, L.-P. (2018, July).
Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable
dynamic fusion graph. In Proceedings of the 56th annual meeting of the association
for computational linguistics (volume 1: Long papers) (pp. 2236–2246). Melbourne,
Australia: Association for Computational Linguistics. Retrieved from https://www
.aclweb.org/anthology/P18-1208 doi: 10.18653/v1/P18-1208
99
Bao, H., Xu, M.-X., & Zheng, T. F. (2007). Emotion attribute projection for speaker
recognition on emotional speech. In Eighth annual conference of the international
speech communication association.
Barocas, S., Hardt, M., & Narayanan, A. (2019). Fairness and machine learning. fairml-
book.org. (http://www.fairmlbook.org)
Beigi,H. (2011). Speakerrecognition. In Fundamentals of speaker recognition (pp.543–559).
Springer.
Bellamy,R.K.,Dey,K.,Hind,M.,Hoffman,S.C.,Houde,S.,Kannan,K.,... others (2018).
Ai fairness 360: An extensible toolkit for detecting, understanding, and mitigating
unwanted algorithmic bias. arXiv preprint arXiv:1810.01943.
Berk, R., Heidari, H., Jabbari, S., Joseph, M., Kearns, M., Morgenstern, J., ... Roth, A.
(2017). A convex framework for fair regression. arXiv preprint arXiv:1706.02409.
Besacier, L., Barnard, E., Karpov, A., & Schultz, T. (2014). Automatic speech recognition
for under-resourced languages: A survey. Speech communication, 56, 85–100.
Beveridge,J.R.,Givens,G.H.,Phillips,P.J.,&Draper,B.A. (2009). Factorsthatinfluence
algorithm performance in the face recognition grand challenge. Computer Vision and
Image Understanding, 113(6), 750–762.
Bhattacharya, G., Alam, J., & Kenny, P. (2019). Adapting end-to-end neural speaker
verification to new languages and recording conditions with adversarial training. In
Icassp 2019-2019 ieee international conference on acoustics, speech and signal process-
ing (icassp) (pp. 6041–6045).
Bhattacharya, G., Alam, J., Stafylakis, T., & Kenny, P. (2016). Deep neural network based
text-dependent speaker recognition: Preliminary results. In Proc. odyssey (pp. 2–15).
Bimbot, F., Bonastre, J.-F., Fredouille, C., Gravier, G., Magrin-Chagnolleau, I., Meignier,
S., ... Reynolds, D. A. (2004). A tutorial on text-independent speaker verification.
EURASIP Journal on Advances in Signal Processing, 2004(4), 1–22.
Binns, R. (2020). On the apparent conflict between individual and group fairness. In
100
Proceedings of the 2020 conference on fairness, accountability, and transparency (pp.
514–524).
Bolukbasi, T., Chang, K.-W., Zou, J. Y., Saligrama, V., & Kalai, A. T. (2016). Man is
to computer programmer as woman is to homemaker? debiasing word embeddings.
Advances in neural information processing systems, 29, 4349–4357.
Brown, A., Huh, J., Chung, J. S., Nagrani, A., & Zisserman, A. (2022). Voxsrc 2021: The
third voxceleb speaker recognition challenge.
Buolamwini, J., & Gebru, T. (2018). Gender shades: Intersectional accuracy disparities
in commercial gender classification. In Conference on fairness, accountability and
transparency (pp. 77–91).
Busso, C., Bulut, M., Lee, C.-C., Kazemzadeh, A., Mower, E., Kim, S., ... Narayanan, S. S.
(2008). Iemocap: Interactive emotional dyadic motion capture database. Language
resources and evaluation, 42(4), 335.
Busso, C., Parthasarathy, S., Burmania, A., AbdelWahab, M., Sadoughi, N., & Provost,
E. M. (2016). Msp-improv: An acted corpus of dyadic interactions to study emotion
perception. IEEE Transactions on Affective Computing , 8(1), 67–80.
Calders, T., Kamiran, F., & Pechenizkiy, M. (2009). Building classifiers with independency
constraints. In 2009 ieee international conference on data mining workshops (pp. 13–
18).
Castaldo, F., Colibro, D., Dalmasso, E., Laface, P., & Vair, C. (2007). Compensation of
nuisance factors for speaker and language recognition. IEEE Transactions on Audio,
Speech, and Language Processing, 15, 1969-1978.
Chen, X., Li, Z., Setlur, S., & Xu, W. (2022). Exploring racial and gender disparities in
voice biometrics. Scientific Reports , 12(1), 1–12.
Chouldechova,A.(2017).Fairpredictionwithdisparateimpact: Astudyofbiasinrecidivism
prediction instruments. Big data, 5(2), 153–163.
Chung, J. S., Nagrani, A., & Zisserman, A. (2018). Voxceleb2: Deep speaker recognition.
101
arXiv preprint arXiv:1806.05622.
Cornacchia, M., Papa, F., & Sapio, B. (2020). User acceptance of voice biometrics in
managing the physical access to a secure area of an international airport. Technology
Analysis & Strategic Management, 32(10), 1236–1250.
Cyrta, P., Trzci´ nski, T., & Stokowiec, W. (2017). Speaker diarization using deep recurrent
convolutional neural networks for speaker embeddings. In International conference on
information systems architecture and technology (pp. 107–117).
de Freitas Pereira, T., & Marcel, S. (2021). Fairness in biometrics: a figure of merit to
assess biometric verification systems. IEEE Transactions on Biometrics, Behavior,
and Identity Science, 4(1), 19–29.
Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., & Ouellet, P. (2010). Front-end factor
analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language
Processing, 19(4), 788–798.
Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., & Ouellet, P. (2011). Front-end factor
analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language
Processing, 19(4), 788-798. doi: 10.1109/TASL.2010.2064307
Derek du Preez. (n.d.). Barclays adopts voice biometrics for customer identifi-
cation. https://www.computerworld.com/article/3418280/barclays-adopts-vo
ice-biometrics-for-customer-identification.html (May 2013).
D´ ıaz, M., Johnson, I., Lazar, A., Piper, A. M., & Gergle, D. (2018). Addressing age-related
bias in sentiment analysis. In Proceedings of the 2018 chi conference on human factors
in computing systems (pp. 1–14).
Dixon,L.,Li,J.,Sorensen,J.,Thain,N.,&Vasserman,L. (2018). Measuringandmitigating
unintended bias in text classification. In Proceedings of the 2018 aaai/acm conference
on ai, ethics, and society (pp. 67–73).
Doddington, G., Liggett, W., Martin, A., Przybocki, M., & Reynolds, D. (1998). Sheep,
goats, lambs and wolves: A statistical analysis of speaker performance in the nist 1998
102
speakerrecognitionevaluation (Tech.Rep.). NationalInstofStandardsandTechnology
Gaithersburg Md.
Drozdowski, P., Rathgeb, C., Dantcheva, A., Damer, N., & Busch, C. (2020). Demographic
bias in biometrics: A survey on an emerging challenge. IEEE Transactions on Tech-
nology and Society, 1(2), 89–103.
Du, M., Yang, F., Zou, N., & Hu, X. (2020). Fairness in deep learning: A computational
perspective. IEEE Intelligent Systems.
Dwork,C.,Hardt,M.,Pitassi,T.,Reingold,O.,&Zemel,R. (2012). Fairnessthroughaware-
ness. In Proceedings of the 3rd innovations in theoretical computer science conference
(pp. 214–226).
Edwards, H., & Storkey, A. (2015). Censoring representations with an adversary. arXiv
preprint arXiv:1511.05897.
Feng, S., Kudina, O., Halpern, B. M., & Scharenborg, O. (2021). Quantifying bias in
automatic speech recognition.
Fenu, G., Lafhouli, H., & Marras, M. (2020). Exploring algorithmic fairness in deep speaker
verification. In International conference on computational science and its applications
(pp. 77–93).
Fenu, G., Marras, M., Medda, G., & Meloni, G. (2021). Fair voice biometrics: Impact
of demographic imbalance on group fairness in speaker recognition. Proc. Interspeech
2021, 1892–1896.
Fenu, G., Medda, G., Marras, M., & Meloni, G. (2020). Improving fairness in speaker
recognition. In Proceedings of the 2020 european symposium on software engineering
(pp. 129–136).
Fiscus, J. G., Ajot, J., Michel, M., & Garofolo, J. S. (2006). The Rich Transcription 2006
spring meeting recognition evaluation. In International workshop on machine learning
for multimodal interaction (pp. 309–322).
Foulds, J. R., Islam, R., Keya, K. N., & Pan, S. (2020). An intersectional definition of
103
fairness. In2020 ieee 36th international conference on data engineering (icde) (p.1918-
1921). doi: 10.1109/ICDE48307.2020.00203
Galbally, J., Haraksim, R., & Beslay, L. (2018). A study of age and ageing in fingerprint
biometrics. IEEE Transactions on Information Forensics and Security, 14(5), 1351–
1365.
Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., ... Lem-
pitsky, V. (2016). Domain-adversarial training of neural networks. The journal of
machine learning research, 17(1), 2096–2030.
Garcia-Romero,D.,Snyder,D.,Sell,G.,Povey,D.,&McCree,A.(2017).Speakerdiarization
using deep neural network embeddings. In Proc. icassp (pp. 4930–4934).
Garg, P., Villasenor, J., & Foggo, V. (2020). Fairness metrics: A comparative analysis.
Garnerin, M., Rossato, S., & Besacier, L. (2021). Investigating the impact of gender repre-
sentation in asr training data: a case study on librispeech. In 3rd workshop on gender
bias in natural language processing (pp. 86–92).
George, K. K., Kumar, C. S., Ramachandran, K., & Panda, A. (2015). Cosine distance
features for improved speaker verification. Electronics Letters, 51(12), 939–941.
Gorrostieta, C., Lotfian, R., Taylor, K., Brutti, R., & Kane, J. (2019). Gender de-biasing
in speech emotion recognition. In Interspeech (pp. 2823–2827).
Green, B. (2018). ‘fair’risk assessments: A precarious approach for criminal justice reform.
In5th workshop on fairness, accountability, and transparency in machine learning (pp.
1–5).
Grother, P., Ngan, M., & Hanaoka, K. (2019, 2019-12-19). Face recognition vendor test
part 3: Demographic effects. NIST Interagency/Internal Report (NISTIR), National
Institute of Standards and Technology, Gaithersburg, MD. doi: https://doi.org/10
.6028/NIST.IR.8280
Haas, C. (2020). The price of fairness-a framework to explore trade-offs in algorithmic
fairness. In 40th international conference on information systems, icis 2019.
104
Hamon, R., Junklewitz, H., Malgieri, G., Hert, P. D., Beslay, L., & Sanchez, I. (2021).
Impossible explanations? beyond explainable ai in the gdpr from a covid-19 use case
scenario. In(p.549–559). NewYork,NY,USA:AssociationforComputingMachinery.
Retrieved from https://doi.org/10.1145/3442188.3445917 doi: 10.1145/3442188
.3445917
Hansen, J. H. (1996). Analysis and compensation of speech under stress and noise for
environmental robustness in speech recognition. Speech communication, 20(1-2), 151–
173.
Hansen,J.H.,&Hasan,T. (2015). Speakerrecognitionbymachinesandhumans: Atutorial
review. IEEE Signal processing magazine, 32(6), 74–99.
Hassan, B., Izquierdo, E., & Piatrik, T. (2021). Soft biometrics: a survey. Multimedia Tools
and Applications, 1–44.
Hautam¨ aki, V., Kinnunen, T., Nosratighods, M., Lee, K.-A., Ma, B., & Li, H. (2010). Ap-
proachinghumanlisteneraccuracywithmodernspeakerverification. INTERSPEECH-
2010.
Howard,A.,&Borenstein,J. (2018). Theuglytruthaboutourselvesandourrobotcreations:
the problem of bias and social inequity. Science and engineering ethics, 24(5), 1521–
1536.
Howard, J. J., Sirotin, Y. B., & Vemury, A. R. (2019). The effect of broad and specific
demographic homogeneity on the imposter distributions and false match rates in face
recognition algorithm performance. In 2019 ieee 10th international conference on bio-
metrics theory, applications and systems (btas) (pp. 1–8).
Hsu, I.-H., Jaiswal, A.,&Natarajan, P. (2019). Niesr: Nuisanceinvariantend-to-endspeech
recognition. ArXiv, abs/1907.03233.
Jaiswal,A.,Wu,R.Y.,Abd-Almageed,W.,&Natarajan,P.(2018a).Unsupervisedadversar-
ial invariance. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi,
& R. Garnett (Eds.), Advances in neural information processing systems (Vol. 31).
105
Curran Associates, Inc. Retrieved from https://proceedings.neurips.cc/paper/
2018/file/03e7ef47cee6fa4ae7567394b99912b7-Paper.pdf
Jaiswal, A., Wu, R. Y., Abd-Almageed, W., & Natarajan, P. (2018b). Unsupervised adver-
sarialinvariance. Advances in Neural Information Processing Systems, 31,5092–5102.
Jaiswal, A., Wu, Y., AbdAlmageed, W., & Natarajan, P. (2019). Unified adversarial invari-
ance. arXiv preprint arXiv:1905.03629.
James Griffiths. (n.d.). Citi Tops 1 Million Mark for Voice Bio-
metrics Authentication for Asia Pacific Consumer Banking Clients.
https://www.citigroup.com/citi/news/2017/170321b.htm (March 2017).
Kahn, J., Audibert, N., Rossato, S., & Bonastre, J.-F. (2010, 06). Intra-speaker variability
effects of speaker verification performance. Odyssey-2010, the speaker and Language
recognition Workshop.
Kallus, N., & Zhou, A. (2018). Residual unfairness in fair machine learning from prejudiced
data.
Kanervisto,A.,Vestman,V.,Sahidullah,M.,Hautam¨ aki,V.,&Kinnunen,T. (2017). Effects
of gender information in text-independent and text-dependent speaker verification. In
2017 ieee international conference on acoustics, speech and signal processing (icassp)
(pp. 5360–5364).
Kenny, P., Boulianne, G., Ouellet, P., & Dumouchel, P. (2007). Joint factor analysis ver-
sus eigenchannels in speaker recognition. IEEE Transactions on Audio, Speech, and
Language Processing, 15(4), 1435–1447.
Klare, B. F., Burge, M. J., Klontz, J. C., Bruegge, R. W. V., & Jain, A. K. (2012). Face
recognition performance: Role of demographic information. IEEE Transactions on
Information Forensics and Security, 7(6), 1789–1801.
Koenecke, A., Nam, A., Lake, E., Nudell, J., Quartey, M., Mengesha, Z., ... Goel, S.
(2020). Racialdisparitiesinautomatedspeechrecognition. Proceedings of the National
Academy of Sciences, 117(14), 7684–7689.
106
Larcher, A., Lee, K. A., Ma, B., & Li, H. (2014). Text-dependent speaker verification:
Classifiers, databases and rsr2015. Speech Communication, 60, 56–77.
Lee, K. A., Hautamaki, V., Kinnunen, T., Yamamoto, H., Okabe, K., Vestman, V., ...
others (2019). I4u submission to nist sre 2018: Leveraging from a decade of shared
experiences. arXiv preprint arXiv:1904.07386.
Lee, K. A., Larcher, A., Wang, G., Kenny, P., Br¨ ummer, N., Leeuwen, D. v., ... others
(2015). The reddots data collection for speaker recognition. In Sixteenth annual con-
ference of the international speech communication association.
Li,L.,Wang,D.,Chen,Y.,Shi,Y.,Tang,Z.,&Zheng,T.F.(2018,April).Deepfactorization
for speech signal. In 2018 ieee international conference on acoustics, speech and signal
processing (icassp) (p. 5094-5098). doi: 10.1109/ICASSP.2018.8462169
Li, X., Cui, Z., Wu, Y., Gu, L., & Harada, T. (2021). Estimating and improving fairness
with adversarial learning. arXiv preprint arXiv:2103.04243.
LisaEadicicco. (n.d.). Exclusive: Amazon Developing Advanced Voice-Recognition for Alexa.
https://time.com/4683981/amazon-echo-voice-id-feature-2017/ (Feb. 2017).
Liu,C.,Picheny,M.,Sarı,L.,Chitkara,P.,Xiao,A.,Zhang,X.,... Saraf,Y.(2021). Towards
measuring fairness in speech recognition: Casual conversations dataset transcriptions.
Liu, Z., Veliche, I.-E., & Peng, F. (2021). Model-based approach for measuring the fairness
in asr.
Lozano-Diez, A., Silnova, A., Matejka, P., Glembek, O., Plchot, O., Pesan, J., ... Gonzalez-
Rodriguez, J. (2016). Analysis and optimization of bottleneck features for speaker
recognition. In Odyssey (pp. 352–357).
Luu, C., Bell, P., & Renals, S. (2020a). Channel adversarial training for speaker verification
and diarization. In Icassp 2020-2020 ieee international conference on acoustics, speech
and signal processing (icassp) (pp. 7094–7098).
Luu,C.,Bell,P.,&Renals,S. (2020b). Leveragingspeakerattributeinformationusingmulti
tasklearningforspeakerverificationanddiarization. arXivpreprintarXiv:2010.14269.
107
Markowitz, J. A. (2000, September). Voice biometrics. Commun. ACM, 43(9), 66–73.
Retrieved from https://doi.org/10.1145/348941.348995 doi: 10.1145/348941
.348995
Marras, M., Korus, P., Memon, N. D., & Fenu, G. (2019). Adversarial optimization for
dictionary attacks on speaker verification. In Interspeech (pp. 2913–2917).
McCowan, I., Carletta, J., Kraaij, W., Ashby, S., Bourban, S., Flynn, M., ... others (2005).
Theamimeetingcorpus. InProceedings of the 5th international conference on methods
and techniques in behavioral research (Vol. 88, p. 100).
Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., & Galstyan, A. (2019). A survey on
bias and fairness in machine learning. arXiv preprint arXiv:1908.09635.
Mehrabian, A. (2008). Communication without words. Communication theory, 6, 193–200.
Meng, Z., Zhao, Y., Li, J., & Gong, Y. (2019). Adversarial speaker verification. In
Icassp 2019-2019 ieee international conference on acoustics, speech and signal pro-
cessing (icassp) (pp. 6216–6220).
Mishler, A., Kennedy, E. H., & Chouldechova, A. (2021). Fairness in risk assessment
instruments: Post-processingtoachievecounterfactualequalizedodds. In(p.386–400).
NewYork,NY,USA:AssociationforComputingMachinery. Retrievedfrom https://
doi.org/10.1145/3442188.3445902 doi: 10.1145/3442188.3445902
Morales, A., Fierrez, J., Vera-Rodriguez, R., & Tolosana, R. (2020). Sensitivenets: Learning
agnostic representations with application to face images.
Mulligan, D. K., Kroll, J. A., Kohli, N., & Wong, R. Y. (2019). This thing called fairness:
disciplinary confusion realizing a value in technology. Proceedings of the ACM on
Human-Computer Interaction, 3(CSCW), 1–36.
Nagrani, A., Chung, J. S., Albanie, S., & Zisserman, A. (2020). Disentangled speech
embeddings using cross-modal self-supervision. In Icassp 2020-2020 ieee international
conference on acoustics, speech and signal processing (icassp) (pp. 6829–6833).
Nagrani, A., Chung, J. S., & Zisserman, A. (2017). Voxceleb: a large-scale speaker identifi-
108
cation dataset. arXiv preprint arXiv:1706.08612.
Nandwana, M. K., Van Hout, J., McLaren, M., Richey, C., Lawson, A., & Barrios, M. A.
(2019). The voices from a distance challenge 2019 evaluation plan. arXiv preprint
arXiv:1902.10828.
Nandwana, M. K., van Hout, J., McLaren, M., Stauffer, A. R., Richey, C., Lawson, A., &
Graciarena, M. (2018). Robust speaker recognition from distant speech under real
reverberant environments using speaker embeddings. In Interspeech (pp. 1106–1110).
No´ e, P.-G., Mohammadamini, M., Matrouf, D., Parcollet, T., & Bonastre, J.-F. (2020). Ad-
versarialdisentanglementofspeakerrepresentationforattribute-drivenprivacypreser-
vation. arXiv preprint arXiv:2012.04454.
Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015, April). Librispeech: An asr
corpus based on public domain audio books. In 2015 ieee international conference on
acoustics, speech and signal processing (icassp) (p. 5206-5210). doi: 10.1109/ICASSP
.2015.7178964
Park,J.H.,Shin,J.,&Fung,P. (2018). Reducinggenderbiasinabusivelanguagedetection.
arXiv preprint arXiv:1808.07231.
Parthasarathy, S., Zhang, C., Hansen, J. H. L., & Busso, C. (2017, March). A study of
speaker verification performance with expressive speech. In 2017 ieee international
conference on acoustics, speech and signal processing (icassp) (p. 5540-5544). doi:
10.1109/ICASSP.2017.7953216
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... Duch-
esnay, E. (2011). Scikit-learn: Machine Learning in Python . Journal of Machine
Learning Research, 12, 2825–2830.
Peri,R.,Li,H.,Somandepalli,K.,Jati,A.,&Narayanan,S. (2020). Anempiricalanalysisof
information encoded in disentangled neural speaker representations. In Proc. odyssey
2020 the speaker and language recognition workshop (pp. 194–201).
Peri, R., Pal, M., Jati, A., Somandepalli, K.,&Narayanan, S. (2020). Robustspeakerrecog-
109
nitionusingunsupervisedadversarialinvariance. InIcassp2020-2020ieeeinternational
conference on acoustics, speech and signal processing (icassp) (pp. 6614–6618).
Perrachione, T. K., & Wong, P. C. (2007). Learning to recognize speakers of a non-native
language: Implications for the functional organization of human auditory cortex. Neu-
ropsychologia, 45(8), 1899–1910.
Preciozzi, J., Garella, G., Camacho, V., Franzoni, F., Di Martino, L., Carbajal, G., &
Fernandez, A. (2020). Fingerprint biometrics from newborn to adult: A study from a
national identity database system. IEEE Transactions on Biometrics, Behavior, and
Identity Science, 2(1), 68–79.
Raj, D., Snyder, D., Povey, D., & Khudanpur, S. (2019). Probing the information encoded
in x-vectors. In 2019 ieee automatic speech recognition and understanding workshop
(asru) (pp. 726–733).
Raschka, S. (2018, April). Mlxtend: Providing machine learning and data science utilities
and extensions to python’s scientific computing stack. The Journal of Open Source
Software, 3(24). Retrieved from http://joss.theoj.org/papers/10.21105/joss
.00638 doi: 10.21105/joss.00638
Reynolds, D. A. (1995). Automatic speaker recognition using gaussian mixture speaker
models. In The lincoln laboratory journal.
Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted
gaussian mixture models. Digital signal processing, 10(1-3), 19–41.
Richey, C., Barrios, M. A., Armstrong, Z., Bartels, C., Franco, H., Graciarena, M., ... Ni,
K. (2018). Voices obscured in complex environmental settings (voices) corpus.
Robinson, J. P., Livitz, G., Henon, Y., Qin, C., Fu, Y., & Timoner, S. (2020). Face
recognition: Too bias, or not too bias?
Ross, A., Banerjee, S., Chen, C., Chowdhury, A., Mirjalili, V., Sharma, R., ... Yadav, S.
(2019). Some research problems in biometrics: The future beckons. In 2019 interna-
tional conference on biometrics (icb) (pp. 1–8).
110
Ryu, H. J., Adam, H., & Mitchell, M. (2017). Inclusivefacenet: Improving face attribute
detection with race and gender diversity. arXiv preprint arXiv:1712.00193.
Sadeghi, B., Wang, L., & Boddeti, V. N. (2021). Adversarial representation learning with
closed-form solvers. In Joint european conference on machine learning and knowledge
discovery in databases (pp. 731–748).
Sadjadi, S.O., Greenberg, C., Singer, E., Reynolds, D., Mason, L.,&Hernandez-Cordero, J.
(2019). The2018NISTSpeakerRecognitionEvaluation. InProc.interspeech2019 (pp.
1483–1487). Retrievedfrom http://dx.doi.org/10.21437/Interspeech.2019-1351
doi: 10.21437/Interspeech.2019-1351
Sadjadi, S. O., Kheyrkhah, T., Tong, A., Greenberg, C. S., Reynolds, D. A., Singer, E.,
... Hernandez-Cordero, J. (2017). The 2016 nist speaker recognition evaluation. In
Interspeech (pp. 1353–1357).
Sang, M., Xia, W., & Hansen, J. H. (2020). Deaan: Disentangled embedding and adver-
sarial adaptation network for robust speaker representation learning. arXiv preprint
arXiv:2012.06896.
Sap, M., Card, D., Gabriel, S., Choi, Y., & Smith, N. A. (2019). The risk of racial bias in
hate speech detection. In Proceedings of the 57th annual meeting of the association for
computational linguistics (pp. 1668–1678).
Sarı,L.,Hasegawa-Johnson,M.,&Yoo,C.D. (2021). Counterfactuallyfairautomaticspeech
recognition. IEEE/ACMTransactionsonAudio, Speech, andLanguageProcessing,29,
3515-3525. doi: 10.1109/TASLP.2021.3126949
Scherer, K. R. (1986). Vocal affect expression: A review and a model for future research.
Psychological bulletin, 99(2), 143.
Sell, G., Snyder, D., McCree, A., Garcia-Romero, D., Villalba, J., Maciejewski, M., ...
others (2018). Diarization is hard: Some experiences and lessons learned for the jhu
team in the inaugural dihard challenge. In Interspeech (pp. 2808–2812).
Serna,I.,Pe˜ na,A.,Morales,A.,&Fierrez,J. (2021). Insidebias: Measuringbiasindeepnet-
111
works and application to face gender biometrics. In 2020 25th international conference
on pattern recognition (icpr) (pp. 3720–3727).
Shen, H., Yang, Y., Sun, G., Langman, R., Han, E., Droppo, J., & Stolcke, A. (2022).
Improving fairness in speaker verification via group-adapted fusion network. arXiv
preprint arXiv:2202.11323.
Shi, X., Yu, F., Lu, Y., Liang, Y., Feng, Q., Wang, D., ... Xie, L. (2021). The accented
english speech recognition challenge 2020: open datasets, tracks, baselines, results and
methods. arXiv preprint arXiv:2102.10233.
Shin, D.-G., & Jun, M.-S. (2015, 07). Home iot device certification through speaker recog-
nition. In (p. 600-603). doi: 10.1109/ICACT.2015.7224867
Shon, S., Tang, H., & Glass, J. R. (2018). Frame-level speaker embeddings for text-
independent speaker recognition and analysis of end-to-end model. 2018 IEEE Spoken
Language Technology Workshop (SLT), 1007-1013.
Si, S., Li, Z., & Xu, W. (2021). Exploring demographic effects on speaker verification. In
2021 ieee conference on communications and network security (cns) (pp. 1–2).
Snyder, D., Garcia-Romero, D., Povey, D., & Khudanpur, S. (2017). Deep neural network
embeddings for text-independent speaker verification. In Interspeech (pp. 999–1003).
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., & Khudanpur, S. (2018). X-vectors:
Robust dnn embeddings for speaker recognition. In 2018 ieee international conference
on acoustics, speech and signal processing (icassp) (pp. 5329–5333).
Solomonoff, A., Campbell, W. M., & Quillen, C. (2007). Nuisance attribute projection.
Speech Communication, 1–73.
Stoll, L. L. (2011). Finding difficult speakers in automatic speaker recognition (Unpublished
doctoral dissertation). UC Berkeley.
Stoychev, S., & Gunes, H. (2022). The effect of model compression on fairness in facial
expression recognition. arXiv preprint arXiv:2201.01709.
Sun, G., Zhang, C., & Woodland, P. C. (2019). Speaker diarisation using 2d self-attentive
112
combination of embeddings. In Icassp 2019-2019 ieee international conference on
acoustics, speech and signal processing (icassp) (pp. 5801–5805).
Suriyakumar, V. M., Papernot, N., Goldenberg, A., & Ghassemi, M. (2021). Chasing your
long tails: Differentially private prediction in health care settings. In (p. 723–734).
NewYork,NY,USA:AssociationforComputingMachinery. Retrievedfrom https://
doi.org/10.1145/3442188.3445934 doi: 10.1145/3442188.3445934
Tawara, N., Ogawa, A., Iwata, T., Delcroix, M., & Ogawa, T. (2020). Frame-level phoneme-
invariant speaker embedding for text-independent speaker recognition on extremely
shortutterances. InIcassp 2020-2020 ieee international conference on acoustics, speech
and signal processing (icassp) (pp. 6799–6803).
Torres-Carrasquillo, P. A., Richardson, F., Nercessian, S., Sturim, D. E., Campbell, W. M.,
Gwon, Y., ... others (2017). The mit-ll, jhu and lrde nist 2016 speaker recognition
evaluation system. In Interspeech (pp. 1333–1337).
Toussaint, W., & Ding, A. Y. (2021). Sveva fair: A framework for evaluating fairness in
speaker verification. arXiv preprint arXiv:2107.12049.
¨ Ulgen,
˙
I. R., Erden, M., & Arslan, L. M. (2021). Predicting biometric error behaviour from
speakerembeddingsandafastscorenormalizationscheme. InInternational conference
on speech and computer (pp. 826–836).
Van Leeuwen, D. A., & Br¨ ummer, N. (2007). An introduction to application-independent
evaluation of speaker recognition systems. In Speaker classification i (pp. 330–353).
Springer.
Variani, E., Lei, X., McDermott, E., Moreno, I. L., & Gonzalez-Dominguez, J. (2014).
Deep neural networks for small footprint text-dependent speaker verification. In 2014
ieee international conference on acoustics, speech and signal processing (icassp) (pp.
4052–4056).
Verma,S.,&Rubin,J. (2018). Fairnessdefinitionsexplained. In 2018 ieee/acm international
workshop on software fairness (fairware) (pp. 1–7).
113
Wadsworth,C.,Vera,F.,&Piech,C. (2018). Achievingfairnessthroughadversariallearning:
an application to recidivism prediction. arXiv preprint arXiv:1807.00199.
Wang, Q., Rao, W., Sun, S., Xie, L., Chng, E. S., & Li, H. (2018). Unsupervised do-
main adaptation via domain adversarial training for speaker recognition. In 2018 ieee
international conference on acoustics, speech and signal processing (icassp) (pp. 4889–
4893).
Wang, S., Qian, Y., & Yu, K. (2017). What does the speaker embedding encode? In
Interspeech (pp. 1497–1501).
Wang, T., Zhao, J., Yatskar, M., Chang, K.-W., & Ordonez, V. (2019). Balanced datasets
are not enough: Estimating and mitigating gender bias in deep image representations.
In Proceedings of the ieee/cvf international conference on computer vision (pp. 5310–
5319).
Williams, J., & King, S. (2019). Disentangling style factors from speaker representations.
In Interspeech (pp. 3945–3949).
Xu, T., White, J., Kalkan, S., & Gunes, H. (2020). Investigating bias and fairness in facial
expression recognition. In European conference on computer vision (pp. 506–523).
Zafar, M. B., Valera, I., Rogriguez, M. G., & Gummadi, K. P. (2017). Fairness constraints:
Mechanismsforfairclassification.In Artificialintelligenceandstatistics (pp.962–970).
Zemel, R., Wu, Y., Swersky, K., Pitassi, T., & Dwork, C. (2013, 17–19 Jun). Learning
fair representations. In S. Dasgupta & D. McAllester (Eds.), Proceedings of the 30th
international conference on machine learning (Vol.28,pp.325–333). Atlanta,Georgia,
USA: PMLR. Retrieved from http://proceedings.mlr.press/v28/zemel13.html
Zhang, B. H., Lemoine, B., & Mitchell, M. (2018). Mitigating unwanted biases with ad-
versarial learning. In Proceedings of the 2018 aaai/acm conference on ai, ethics, and
society (pp. 335–340).
Zhang, C., yu, C., & Hansen, J. (2017, 01). An investigation of deep learning frameworks
for speaker verification anti-spoofing. IEEE Journal of Selected Topics in Signal Pro-
114
cessing, PP, 1-1. doi: 10.1109/JSTSP.2016.2647199
Zhang, Y., & Sang, J. (2020). Towards accuracy-fairness paradox: Adversarial example-
based data augmentation for visual debiasing. In Proceedings of the 28th acm interna-
tional conference on multimedia (pp. 4346–4354).
Zhao, H., & Gordon, G. J. (2019). Inherent tradeoffs in learning fair representations. arXiv
preprint arXiv:1906.08386.
Zhao, J., Wang, T., Yatskar, M., Ordonez, V., & Chang, K.-W. (2018). Gender
bias in coreference resolution: Evaluation and debiasing methods. arXiv preprint
arXiv:1804.06876.
Zhou, J., Jiang, T., Li, L., Hong, Q., Wang, Z., & Xia, B. (2019). Training multi-task
adversarial network for extracting noise-robust speaker embedding. In Icassp 2019-
2019 ieee international conference on acoustics, speech and signal processing (icassp)
(pp. 6196–6200).
Zollinger, S. A., & Brumm, H. (2011). The lombard effect. Current Biology, 21(16), R614–
R615.
115
Appendices
116
Appendix A
Supervised adversarial
disentanglement for improving
robustness of speaker recognition:
Emotional speech
In this chapter we describe additional experiments we performed to investigate the effective-
ness of supervised methods of inducing invariance. In particular, we will discuss emotion as
an exemplar content factor for analysis.
A.1 Introduction
We saw in Chapter 4 that unsupervised disentanglement approach can be used to remove
nuisance information from speaker embeddings without explicit supervision of the factors to
The work presented in this chapter was inspired from the work published in the following article: Peri,
Raghuveer, et al. “Disentanglement for audio-visual emotion recognition using multitask setup.” ICASSP
2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE,
2021.
117
be removed. We saw that such approach can effectively reduce information about channel
factors such as noise type, microphone type etc., in addition to content factors such as
emotion and sentiment. We had argued that in many cases data with labelled nuisance
factors may not be available, and thus the unsupervised techniques can provide a viable
alternative for disentanglement. However, in some cases, labels of the nuisance factors to be
removed are available. Therefore, it may be of interest to understand how much better we
can do in terms of disentanglement when labelled data is available.
We will use emotion as an exemplar content factor to investigate supervised adversarial
disentanglement. We will briefly describe the method in Section A.2. Then, we will discuss
the dataset we used in Section A.3.1. Results are shown in Section A.4, and final concluding
remarks are provided in Section A.5.
A.2 Method
As described in Section 4.2, adversarial training to induce invariance in speaker embeddings
has been explored to a large extent (Meng et al., 2019; Q. Wang et al., 2018; Zhou et al.,
2019). Here, the nuisance factors such as noise type or emotion are assumed to be known
during training. As shown in Figure A.2.1, these labels in addition to the speaker identity
labels are used to learn speaker representations that capture speaker-related information,
while being devoid of nuisance information.
L
pred
=CE(y,ˆ y) and L
dis
=CE(n,ˆ n) (A.2.1)
min
Θ e,Θ p
max
Φ d
L
pred
+γL
dis
(A.2.2)
During training, cross-entropy loss between speaker labels L
pred
, and the loss between nui-
sancelabelsL
dis
isemployed. InEquationA.2.1,y and ˆ y arethetrueandpredictedspeaker
118
... Predictor Discriminator Encoder Figure A.2.1: Disentanglement of nuisance factors from speaker embeddings when nuisance
labels are available. Discriminator predicts nuisance labels, and is trained adversarially with
encoder. Learned speaker embeddings e
1
capture nuisance-invariant speaker information.
labels respectively, n and ˆ n are the true and predicted nuisance factor labels respectively,
and CE denotes cross-entropy loss. The encoder is trained to minimize L
pred
and maximize
L
dis
, while the predictor is trained to minimize L
pred
, and the discriminator is trained to
minimize L
dis
. Clearly, there exists a minimax game between the encoder and discriminator
as shown in Equation A.2.2. The discriminator tries to pull nuisance information into the
encoded representations (while minimizing the discriminator cross-entropy loss), while the
encoder tries to resist it (by maximizing the discriminator cross-entropy loss). In practice,
instead of maximizing L
dis
, the encoder in trained to produce embeddings which minimize
the cross-entropy loss between nuisance labels and a uniform distribution (Nagrani, Chung,
Albanie, & Zisserman, 2020).
119
A.3 Experiments
A.3.1 Datasets
For the supervised disentanglement experiments, we use the EmoVox dataset (Albanie et
al., 2018). The EmoVox dataset comprises of emotional labels on the VoxCeleb-1 dataset
obtained by predictions using a strong teacher network over eight emotional states: neutral,
happiness, surprise, sadness, anger, disgust, fearandcontempt. Notethattheteachermodel
wastrainedonlyusingfacialfeatures(visualonly). Overall, thedatasetconsistsofinterview
videos from 1251 celebrities spanning a wide range of ages and nationalities. For each audio
utterance, we find the most dominant emotion based on the distribution and use that as
our ground-truth label similar to (Albanie et al., 2018). The label distribution is heavily
skewed towards a few emotion classes because emotions such as disgust, fear, contempt
and surprise are rarely exhibited in interviews. Following previous approaches that deal
with such imbalanced datasets (Busso et al., 2016), we combine these labels into a single
class ‘other’, resulting in 5 emotion classes. Further, we discard videos corresponding to
speakers belonging to the bottom 5 percentile w.r.t the number of segments to reduce the
imbalance in the number of speech segments per speaker. We create three splits from the
database: EmoVox-Train to train models, EmoVox-Validation for hyperparameter tuning,
EmoVox-Test to evaluate models for speaker verification task on held out speech segments
from speakers present in the train set. The subset EmoVox-Train and EmoVox-Validation
were created from the dev partition in VoxCeleb-1, whereas EmoVox-Test was created from
the test partition.
For the experiments on quantifying the emotion information, we use the IEMOCAP
dataset as described in Section 3.3, while the results for the speaker verification experiments
are shown on the EmoVox-Test dataset.
120
A.3.2 Setup
We trained models using the previously described adversarial disentanglement method. In
particular, we used the emotion labels available in the dataset to transform speaker em-
beddings into an embedding space stripped of emotion information. We tuned the loss
hyper-parameter, γ in Equation A.2.2 on the EmoVox-Validation dataset to minimize the
emotion classification performance, and evaluate the best models on the IEMOCAP dataset
for quantifying information and the held-out EmoVox-Test dataset for speaker verification
evaluation.
A.4 Results
A.4.1 Quantifying information
In this section, we show quantitatively show how much emotion information is contained
in the speaker embeddings similar to that shown in Section 4.5.1, but with the described
supervised disentanglement method. As seen in Table A.1, the supervised disentanglement
approach yields the smallest %F1 score in classifying emotions from speaker embeddings
on the IEMOCAP dataset, suggesting that it is able to reduce the emotion information
substantiallycomparedwiththeunsuperviseddisentanglementapproachdescribedinSection
4.3. This is expected behavior, since the additional information available by the model can
now be used to effectively remove nuisance factors. From the confusion matrices shown
in Figure A.4.1, it is clear that the supervised disentanglement particularly reduces the
performance on the ‘happiness’ class.
A.4.2 Speaker verification
So far, we have seen that the supervised adversarial disentanglement approach is able to
reducetheemotioninformationfromspeakerembeddings. Nowwewouldliketounderstand
121
Table A.1: Average % F1 of emotion recognition on IEMOCAP dataset from speaker em-
beddings. Unsup. dis. denotes unsupervised disentanglement, while Sup. dis. denotes
supervised disentanglement of embeddings. Robust speaker embeddings are expected to
perform poorly for classifying emotions.
Method
Majority
baseline
x-vector
Unsup.
Dis.
Sup.
Dis.
%F1 87.49 91.70 80.92 70.02
(a) x-vector (b) Unsup. Dis. (c) Sup. Dis.
Figure A.4.1: Confusion matrices for emotion recognition using speaker embeddings on
IEMOCAP for different embedding transformation techniques. Robust speaker embeddings
are expected to perform poorly on emotion recognition task. 0:Anger, 1:Sadness, 2:Happi-
ness 3:Neutral. Unsupervised disentanglement provides the best disentanglement as shown
bythepooremotionrecognition. DifferencesareparticularlynoticeablefortheHappiness(2)
class.
if this helps improve the robustness of speaker verification in the presence of variability
due to emotional speech. In Table A.2, we report speaker verification results on trials cre-
ated from speakers unseen during training for different emotion classes. We observe that
the embeddings obtained using the supervised disentanglement approach yields much bet-
ter performance compared to the baseline x-vectors. This can be seen in all the emotion
classes (though the number of trials in the case of the emotion ‘sadness’ may be too small to
draw conclusions from). This suggests that removing the emotion information from speaker
embeddings using the supervised disentanglement approach is beneficial in improving the
robustness of speaker verification in real-world emotional speech scenarios. From the last
column in Table A.2, we can see that supervised disentanglement improves the ASV perfor-
mance when considering the overall performance including all emotion classes.
122
Table A.2: Speaker verification performance (%EER) for the baseline x-vector and Sup. dis.
methods split by emotions. The last row shows the number of verification trials used in
theanalysis. Thesuperviseddisentanglementapproachoutperformsx-vectorsinallemotion
classes. The last columns shows the results for all emotions combined.
Method Neutral Other Anger Happiness Sadness Overall
x-vector 8.53 9.26 9.99 5.66 4.88 8.75
Sup. Dis. 5.02 6.12 7.49 3.91 0.17 5.12
Num. Trials 2,395,918 479,711 76,142 59,308 1,564 616,725
A.5 Conclusions
In this chapter, we considered the scenario where the labels of the nuisance factors to be
removed are available during training. We showed using emotion as an exemplar nuisance
factorthatmodelscanbeadversariallytrainedtoremovenuisanceinformationfromspeaker
embeddings. Wefoundthatsuchsuperviseddisentanglementapproachyieldsbetterremoval
of nuisance information than the unsupervised methods. We further saw that this leads to
better speaker verification performance in the presence of a wide range of emotions. There-
fore, when nuisance labels are available, supervised disentanglement methods can provide
improvedspeakerrecognitionrobustness. However,theunsupervisedmethodsarestilluseful
when explicit nuisance labels are not available.
123
Appendix B
Deep dive into biases and bias
mitigation
B.1 Effect of bias weight
Wetrainedseveralmodelsbyvaryingtheweightparameterδ inEquation5.3.1. Thisparam-
eter allowed us to control the influence of the discriminator loss on the overall optimization.
As described in Section 5.6, we fixed the values for the weights of the predictor, decoder
and disentangler modules based on preliminary experiments to α =100, β =5 and γ =100
respectively. Therefore, by varying δ we studied the isolated effect of the discriminator loss
on the training objective.
Discussion: The second pair of columns in Table B.1 shows the speaker classification
accuracy of the predictor and gender classification accuracy of the discriminator on the
embed-val dataset. Clearly, the UAI-AT method is able to reduce the gender classifica-
tion accuracy to close to majority class chance performance (70%). This shows that the
TheworkpresentedinthischapterwassubmittedforreviewtoElsevierComputerSpeechandLanguage:
Peri, Raghuveer, Krishna Somandepalli, and Shrikanth Narayanan. “To train or not to train adversarially:
A study of bias mitigation strategies for speaker recognition.” arXiv preprint arXiv:2203.09122 (2022).
124
Table B.1: Classification results on embed-val dataset and verification results on eval-dev
dataset for different bias weights ( δ in Equation 5.3.1). The majority class random chance
accuracy for bias labels in the embed-val data was 70%
bias weight
%acc.
(predictor)
%acc.
(bias)
%EER auFaDR
xvector-U - - - 2.66 871.09
xvector-B - - - 2.36 884.57
UAI-AT
10 96.99 78.24 2.81 893.40
30 96.91 70.55 3.42 892.17
50 96.58 75.92 4.95 893.03
70 96.76 80.35 3.12 896.38
100 96.6 82.03 4.11 890.21
150 96.1 77.70 4.58 888.12
200 95.32 72.80 10.26 893.69
UAI-MTL
10 97.04 97.03 2.45 896.72
30 96.99 97.98 2.66 885.45
50 96.96 98.52 2.70 886.52
70 97.01 98.73 3.06 858.99
100 96.96 98.86 2.66 848.88
150 96.94 99.02 2.98 851.31
200 96.91 99.03 3.19 852.62
AT
10 96.72 69.91 3.00 879.68
30 96.71 76.04 3.40 882.82
50 96.65 71.91 3.14 897.04
70 96.24 59.37 9.78 884.28
100 95.91 78.10 8.11 884.06
MTL
10 96.68 97.19 2.47 861.27
30 96.70 98.33 2.52 876.25
50 96.79 98.71 2.77 858.71
70 96.75 98.91 2.99 852.92
100 93.73 98.96 2.75 870.16
technique is able to successfully reduce the amount of gender information in the speaker em-
beddings. On the other hand, owing to its multi-task training setup, the UAI-MTL method
retains gender information in the speaker embeddings. This is evident from the high gender
classification accuracy of the discriminator ( >97%).
Verification results on the eval-dev dataset are shown in the third set of columns in
Table B.1. We notice that compared to the UAI-AT models, the UAI-MTL models provide
better verification performance as shown by the %EER in all settings. In addition, across
different training configurations (characterized by the bias weights), the UAI-MTL method
125
has a smaller variation in %EER (min:2.45, max:3.19) when compared with the UAI-AT
method (min:2.81, max:10.26). This provides further evidence of the negative impact on
the utility of adversarial training when compared with multi-task learning. It validates the
findings from prior research that have shown the instability of adversarial training (Sadeghi,
Wang, & Boddeti, 2021). We find similar trends in models trained without the UAI branch.
Specifically, we observe that the MTL methods have a smaller variation in %EER (min:2 .47,
max:2.99) when compared to the AT methods (min:3.00, max:9.78). Finally, for all the
methods, we choose the optimal bias weight δ based on the best auFaDR-FAR value (in
bold). This model was used for the evaluations on the eval-test dataset that were described
in Section 5.7.
B.2 Direction of bias
In Section 5.7, we reported the results using FaDR metric, which considers the absolute
difference between the FARs and FRRs of the female and male demographic groups. It does
not provide a sense of the direction of bias. Previous studies have shown that ASV systems
are prone to higher error rates for the female population than the male population (Fenu,
Lafhouli, & Marras, 2020; George, Kumar, Ramachandran, & Panda, 2015). In a similar
vein, we wanted to investigate if there is a systematic bias against a particular gender. In
particular,wewantedtocheckiftheASVsystemsconsistentlyunderperformforaparticular
demographic group when compared with a different demographic group. We report the
individual FARs (B.2.1) and FRRs(Figure B.2.2) of the female and male populations at
varying thresholds characterized by demographic-agnostic %FARs.
Discussion: From Figure B.2.1, we observe that the baseline x-vector system is highly
biased against the female demographic groups considering FARs. This is evident from the
gap between the curves for the male (solid blue) and female (dotted blue) populations.
Furthermore, the gap increases at higher values of demographic-agnostic %FAR. On the
126
Delta = 5.2% (a) Demographic-specific FAR :
xvector-B
Delta = 1.1% (b) Demographic-specific FAR :
UAI-AT
Delta = 0.9% (c) Demographic-specific FAR :
UAI-MTL
Figure B.2.1: (Best viewed in color) Demographic-specific FAR for the baseline and pro-
posed systems. Here, we look at the individual FARs of the female and male populations.
Compared to the x-vector systems, both the proposed methods reduce the difference in the
FARs between the female and male populations. However, the UAI-AT method achieves
this by reducing the EER for both groups, while the UAI-MTL method achieves this by
increasing the error rate for the male population making it closer to the female population.
(a) Demographic-specific FRR :
xvector-B
(b) Demographic-specific FRR :
UAI-AT
(c) Demographic-specific FRR :
UAI-MTL
FigureB.2.2: (Bestviewedincolor)Demographic-specificFRRforthebaselineandproposed
systems. Here we look at the individual FRRs of the female and male populations. The
baseline x-vector method already shows very little differences in the FRRs between the
female and male populations. The proposed techniques retain these small differences.
other hand, both the proposed UAI-AT and UAI-MTL methods reduce the gap in %FAR
betweenthefemaleandmalepopulations. However,theyshownoticeablydifferentbehavior.
The UAI-MTL reduces the %FAR of the female population (dotted red) compared to the x-
vectorbaseline,whilesimultaneouslyincreasingthe%FARofthemalepopulation(solidred),
bringing them closer to each other. On the other hand, the UAI-AT method substantially
reduces the %FAR on the female population (dotted green), while also reducing the %FAR
on the male population (solid green) by a small extent. At first glance, this seems to suggest
that UAI-AT is a better technique since it improves the performance of both demographic
127
groups with respect to %FAR. However, as we discussed in Section 5.7.2, considering the
%FRR of the systems, UAI-AT method degrades the performance, thereby affecting the
overall utility of the ASV system.
In Figure B.2.2, we report the %FRR for the female and male populations. Notice the
difference in the scale of y-axis compared to Figure B.2.1. Here, we observe that there is
not much difference between the %FRRs of the different demographic groups even with the
baseline x-vector system. Furthermore, we observe that the UAI-MTL method to transform
x-vectors does not have a substantial impact on the performance compared with x-vectors.
The UAI-AT technique of transforming x-vectors increases %FRR for both the female and
male populations to some extent.
B.3 Individual-level biases in speaker recognition
Sofar,weconsideredthebiasesinASVsystemsatthegroup-level. Particularly,weevaluated
biases using gender as the demographic attribute under consideration. We found that the
ASV systems are biased against the female population. However, there could be factors of
variability that can not be explained by the demographic attributes alone. For example,
certain individuals could be prone to errors purely based on the uniqueness of their voice
characteristics. Such differences in performance between individuals have been documented
in relatively old research papers. A prominent example is (Doddington et al., 1998), where
they showed the existence of different individuals who could be grouped based on their
proneness to different types of errors such as false accepts and false rejects. Such differences
were also demonstrated later by Stoll in her Ph.D. thesis (Stoll, 2011). However, those
analyses were performed on relatively older ASV systems that use gaussian mixture models.
More recently, (
¨ Ulgen, Erden, & Arslan, 2021) showed differences in the speaker verification
scores on modern speaker embeddings (x-vectors). However, their focus was on utterance-
level scores, not specific individuals.
128
We performed preliminary investigation along the lines of (Doddington et al., 1998) and
(Stoll, 2011) to test for statistical differences between the verification scores of different
speakers. For this, we resorted to the eval-test dataset of the MCV corpus discussed in
Chapter 5. We collected the genuine and impostor scores of each speaker and following
(Doddington et al., 1998), performed a Kruskal-Wallis rank test
∗
to check for differences in
the scores between speakers. Specifically, for the test using genuine scores, we used all the
genuineverificationpairsforeachspeaker. Fortestsusingimpostorverificationpairs, allthe
scores corresponding to a pair of a particular enrolment and test speaker were averaged. We
performed the analysis separately for the female and male populations for both the genuine
and impostor verification scores. The evaluation split we used consisted of 114 female and
167 male speakers. The null hypothesis was that the mean ranks of the scores for each
speaker are the same. We were able to reject the null hypothesis with extremely small p-
values (p << 0.01) for all scenarios considered. This indicates that in each of the female
and male groups, there exist atleast one speaker with significant differences in their scores,
and this holds true for both the genuine and impostor scores. This suggests that there exist
speakers who are more prone to errors, both false accepts and false rejects. Further post-hoc
analysis is required to determine the specific speakers who are affected disproportionately
by the errors. It is possible that there is a different demographic attribute that is at play
here. Determining the individuals who are at a disadvantage can help mitigate the biases
by training speaker-specific models (personalized models). As a preliminary analysis, we
performed pair-wise t-test between the scores of each speaker with appropriate Bonferroni
correction. Figure B.3.1 shows the statistical significance (with p < 0.01) for each pair of
speakers. 1 denotes that significant differences were found between the scores of the pair
of speakers, while 0 denotes no significant differences were found. From the top row in the
figure,itisclearthattherearemorespeakerpairswithsignificantdifferencesintheirgenuine
∗
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kruskal.html
129
(a) Genuine (Female) (b) Genuine (Male)
(c) Impostor (Female) (d) Impostor (Male)
Figure B.3.1: Pairwise t-test between genuine and impostor scores of different speaker pairs
separated by gender. 1 represents statistically significant difference ( p < 0.01), 0 represents
notsignificantdifferences. Noticethepresenceofseveralspeakerswithsignificantlydifferent
scores, for both genuine and impostor verification pairs.
scores for females than in males. Next, considering the impostor scores (bottom row), we
can observe that there are quite a few number of speaker pairs with significant differences
both in females and males. This clearly suggests that the differences in scores observed by
the Kruskal-Wallis test described previously are not just due to a few speakers, but there
are a large number of speakers whose scores differ significantly.
130
Abstract (if available)
Abstract
Speech is an information-rich signal that conveys a wealth of information including a person's age, gender, language, emotional state and environmental surroundings. Speaker recognition (SR), the task of identifying speakers based on their speech is an active area of research. SR has found a wide range of applications in several everyday technologies such as smart speakers, customer care centers etc. It is crucial that SR systems perform reliably in diverse environments, while not having biases against any particular demographic group or individual. A technique to improve the robustness of SR systems against variability is to ensure that the speaker representations used in these systems retain only information related to the speaker's identity. Specifically, speaker representations can be trained to contain minimal information pertaining to factors unnecessary for the SR task such as background noise, channel conditions and emotional state. In this thesis, we provide insights into the various factors of information captured in current state-of-the-art speaker representations using extensive experiments. Guided by these findings, we propose adversarial learning techniques to minimize nuisance information in speaker representations, and empirically show that such techniques improve the robustness of SR in challenging conditions. Moreover, studies of variability in the performance of contemporary SR systems with respect to demographic factors are lagging compared to other speech applications such as speech recognition. Furthermore, there exist only a handful of bias mitigation strategies developed for SR systems. Therefore, we first present systematic evaluations of the biases present in SR systems with respect to gender across a range of system operating points. We then discuss our proposed representation learning techniques to mitigate the biases. Finally, we show through quantitative and qualitative evaluations that the proposed methods improve the fairness of SR systems over competitive baselines.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Representation, classification and information fusion for robust and efficient multimodal human states recognition
PDF
Multimodal and self-guided clustering approaches toward context aware speaker diarization
PDF
Invariant representation learning for robust and fair predictions
PDF
Speech recognition error modeling for robust speech processing and natural language understanding applications
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
Matrix factorization for noise-robust representation of speech data
PDF
Toward robust affective learning from speech signals based on deep learning techniques
PDF
Multimodality, context and continuous dynamics for recognition and analysis of emotional states, and applications in healthcare
PDF
Understanding sources of variability in learning robust deep audio representations
PDF
Extracting and using speaker role information in speech processing applications
PDF
Active data acquisition for building language models for speech recognition
PDF
Robust speaker clustering under variation in data characteristics
PDF
Fair Machine Learning for Human Behavior Understanding
PDF
Scalable optimization for trustworthy AI: robust and fair machine learning
PDF
Efficient estimation and discriminative training for the Total Variability Model
PDF
Context-aware models for understanding and supporting spoken interactions with children
PDF
Towards generalizable expression and emotion recognition
PDF
Optimization strategies for robustness and fairness
PDF
Robust automatic speech recognition for children
PDF
Enriching spoken language processing: representation and modeling of suprasegmental events
Asset Metadata
Creator
Peri, Raghuveer (author)
Core Title
Neural representation learning for robust and fair speaker recognition
School
Andrew and Erna Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2022-08
Publication Date
05/25/2022
Defense Date
04/25/2022
Publisher
University of Southern California. Libraries
(digital)
Tag
adversarial training,disentanglement,Fairness,OAI-PMH Harvest,robustness,speaker recognition
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Narayanan, Shrikanth (
committee chair
), Abd-Almageed, Wael (
committee member
), Nevatia, Ram (
committee member
)
Creator Email
raghu.peri90@gmail.com,rperi@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC111336170
Unique identifier
UC111336170
Identifier
etd-PeriRaghuv-10728.pdf (filename)
Legacy Identifier
etd-PeriRaghuv-10728
Document Type
Dissertation
Rights
Peri, Raghuveer
Internet Media Type
application/pdf
Type
texts
Source
20220527-usctheses-batch-944
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
adversarial training
disentanglement
robustness
speaker recognition