Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Understanding sources of variability in learning robust deep audio representations
(USC Thesis Other)
Understanding sources of variability in learning robust deep audio representations
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
UNDERSTANDING SOURCES OF VARIABILITY IN LEARNING ROBUST DEEP AUDIO REPRESENTATIONS by Arindam Jati A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Electrical Engineering) May 2021 Copyright 2021 Arindam Jati Dedication To my parents. ii Acknowledgements I want to express my sincere gratitude to my advisor Prof. Shrikanth (Shri) Narayanan. His knowledge and vision have shaped my thesis work. He has always motivated me to conduct research work that has a signicant impact. Without his guidance, I could not have nished my Ph.D. journey. I want to thank my committee members: Prof. Keith Jenkins and Prof. Aiichiro Nakano. I believe that their feedback and comments will help me in my future endeavors as a researcher. I want to thank Prof. Panayiotis (Panos) Georgiou for his guidance during the initial years of my Ph.D. His meticulousness and in-depth knowledge have taught me to always strive for excellence in research. I would like to thank my mentors and colleagues in the USC SAIL Lab. Special thanks to Naveen Kumar and Md Nasir. I would also like to thank all my friends. Special thanks to my friend Pranoy Dutta, for always being there with me. Finally, I want to thank my parents Biva Jati and Rabindranath Jati, my anc ee Sinjona Pal, and my elder brother Anirban Jati for their endless support, love, and motivation. iii Table of Contents Dedication ii Acknowledgements iii List Of Tables ix List Of Figures xii Abstract xv Chapter 1: Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Background on representation learning . . . . . . . . . . . . . . . . . . . . 3 1.2.1 Charateristics of a \good" representation . . . . . . . . . . . . . . 4 1.2.2 Supervision in representation learning . . . . . . . . . . . . . . . . 6 1.3 Audio representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3.1 Audio event and acoustic scene representation . . . . . . . . . . . . 7 1.3.1.1 Prior work . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3.2 Speaker representation . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3.2.1 Prior work . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3.2.2 Types of speaker recognition frameworks . . . . . . . . . 11 1.4 Sources of variability in learning audio representations . . . . . . . . . . . 12 1.4.1 Signal-space variability . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.4.2 Semantic-space variability . . . . . . . . . . . . . . . . . . . . . . . 13 1.5 Overview of proposed robust audio representation learning methods . . . 14 1.5.1 Learning latent data similarity patterns . . . . . . . . . . . . . . . 14 1.5.2 Data augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.5.3 Adversarial training . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.5.4 Exploring context . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.6 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Chapter 2: Signal-space variability I: Environmental noise 17 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.1 Total Variability Model (TVM) . . . . . . . . . . . . . . . . . . . . 20 iv 2.2.2 TDNN and x-vectors . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.3 Discriminative DNN-TVM system . . . . . . . . . . . . . . . . . . 22 2.2.3.1 Distribution-free TVM formulation . . . . . . . . . . . . 22 2.2.3.2 Discriminative TVM training . . . . . . . . . . . . . . . . 23 2.2.3.3 Hybrid DNN-TVM model . . . . . . . . . . . . . . . . . . 23 2.2.4 Triplet loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2.5 Multi-task training . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3.1 Datasets and features . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3.2 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3.3 System parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.3.3.1 i-vector and x-vector baselines . . . . . . . . . . . . . . . 28 2.3.3.2 Hybrid DNN-TVM . . . . . . . . . . . . . . . . . . . . . 28 2.3.3.3 Triplet and multi-task . . . . . . . . . . . . . . . . . . . . 29 2.3.4 LDA and PLDA scoring . . . . . . . . . . . . . . . . . . . . . . . . 29 2.3.5 Fusion of multiple systems . . . . . . . . . . . . . . . . . . . . . . . 30 2.4 Results and Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.4.1 Performance of individual systems . . . . . . . . . . . . . . . . . . 30 2.4.2 Performance of fused systems . . . . . . . . . . . . . . . . . . . . . 32 2.5 Conclusion and Future Directions . . . . . . . . . . . . . . . . . . . . . . . 33 Chapter 3: Signal-space variability II: Adversarial perturbation 34 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2.1 Speaker recognition systems . . . . . . . . . . . . . . . . . . . . . . 36 3.2.2 Adversarial attack . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2.2.1 Attack space . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2.2.2 Threat model . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2.2.3 Transferability . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.4 Attack and Defense Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 40 3.4.1 Attack algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.4.2 Defense algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.4.3 Proposed defense algorithm . . . . . . . . . . . . . . . . . . . . . . 44 3.4.3.1 Feature scattering adversarial training . . . . . . . . . . . 44 3.4.3.2 Proposed methodology . . . . . . . . . . . . . . . . . . . 46 3.5 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.5.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.5.2 Model architectures . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.5.2.1 1D CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.5.2.2 TDNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.5.3 Training parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.5.4 Attack parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.5.5 Defense parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.6 Results: Exposition study . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 v 3.6.1 Attack strength vs. SNR . . . . . . . . . . . . . . . . . . . . . . . 50 3.6.2 Attack strength vs. perceptibility . . . . . . . . . . . . . . . . . . . 51 3.6.3 Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.6.4 Ablation study 1: Varying attack strength . . . . . . . . . . . . . . 54 3.6.5 Ablation study 2: Analyzing the best defense method . . . . . . . 56 3.6.6 Ablation study 3: Transferability analysis . . . . . . . . . . . . . . 57 3.6.7 Ablation study 4: Eect of noise augmentation . . . . . . . . . . . 58 3.6.8 Speaker Verication . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.7 Results: Comparison with proposed defense . . . . . . . . . . . . . . . . . 61 3.8 Conclusion and Future Directions . . . . . . . . . . . . . . . . . . . . . . . 61 3.9 Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.9.1 Visualizing spectrograms . . . . . . . . . . . . . . . . . . . . . . . 64 3.9.2 Similarity in misclassication for dierent attacks . . . . . . . . . . 65 3.9.3 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 66 Chapter 4: Signal-space variability III: Long-term variability 67 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.2.1 Acoustic features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.2.2 Acoustic scene tracking . . . . . . . . . . . . . . . . . . . . . . . . 74 4.2.3 Contextual and user demographic measures . . . . . . . . . . . . . 74 4.3 Egocentric Analysis of Acoustic Scene Dynamics . . . . . . . . . . . . . . 75 4.3.1 Acoustic scene dynamics . . . . . . . . . . . . . . . . . . . . . . . . 77 4.3.2 Relationship of acoustic scene dynamics with individual demo- graphic and behavioral constructs . . . . . . . . . . . . . . . . . . 80 4.3.2.1 Multiple comparisons . . . . . . . . . . . . . . . . . . . . 80 4.3.2.2 Relationship with individual demographics . . . . . . . . 81 4.3.2.3 Relationship with individual behavioral constructs . . . . 82 4.4 Automated Prediction of Temporally Varying Acoustic Scenes . . . . . . . 84 4.4.1 Processing of acoustic features . . . . . . . . . . . . . . . . . . . . 85 4.4.1.1 In-talk acoustic scene identication . . . . . . . . . . . . 85 4.4.1.2 Foreground activity detector . . . . . . . . . . . . . . . . 85 4.4.2 Modeling segment-level acoustic scene . . . . . . . . . . . . . . . . 86 4.4.3 Modeling temporal sequence of acoustic scenes . . . . . . . . . . . 87 4.5 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.5.1 Model parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.5.2 Data subsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.5.2.1 Segment-level experiment . . . . . . . . . . . . . . . . . . 90 4.5.2.2 Temporal modeling experiments . . . . . . . . . . . . . . 91 4.5.3 Data splits, cross validation, and model selection . . . . . . . . . . 92 4.6 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.6.1 Segment-level prediction . . . . . . . . . . . . . . . . . . . . . . . . 93 4.6.2 Temporal sequence prediction . . . . . . . . . . . . . . . . . . . . . 94 4.6.3 Sequence visualization . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.6.4 Egocentric analysis with predicted scene sequence . . . . . . . . . 98 vi 4.7 Conclusion and Future Directions . . . . . . . . . . . . . . . . . . . . . . . 99 4.8 Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.8.1 Demographics and daily routines . . . . . . . . . . . . . . . . . . . 102 4.8.2 Behavioral constructs . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.8.3 Detailed results: Egocentric analysis with the sequence of predicted scenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Chapter 5: Semantic-space variability I: Granularity of annotations 108 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.2 Hierarchical Audio Events . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.2.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.2.2 Motivation to use hierarchy-aware loss . . . . . . . . . . . . . . . . 111 5.3 Hierarchy-aware Loss on Tree Structured Label Space . . . . . . . . . . . 112 5.3.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.3.2 Baseline DNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.3.3 Hierarchy-aware Loss . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.3.3.1 Balanced triplet loss . . . . . . . . . . . . . . . . . . . . . 114 5.3.3.2 Quadruplet loss . . . . . . . . . . . . . . . . . . . . . . . 115 5.3.4 Multi-task learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.4 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.4.1 DNN architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.4.2 Data split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.4.3 Features and parameters . . . . . . . . . . . . . . . . . . . . . . . . 117 5.5 Results and Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 5.5.1 Classication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 5.5.2 Clustering audio embeddings . . . . . . . . . . . . . . . . . . . . . 119 5.5.3 Transfer learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.6 Conclusion and Future Directions . . . . . . . . . . . . . . . . . . . . . . . 122 Chapter 6: Semantic-space variability II: Absence of labels 123 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 6.2 Neural Predictive Coding (NPC) . . . . . . . . . . . . . . . . . . . . . . . 124 6.3 NPC for learning of latent similarity patterns . . . . . . . . . . . . . . . . 125 6.3.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 6.3.1.1 Short-term speaker stationarity hypothesis . . . . . . . . 126 6.3.1.2 Training framework . . . . . . . . . . . . . . . . . . . . . 126 6.3.1.3 Evaluation on Speaker segmentation . . . . . . . . . . . . 127 6.3.1.4 Transfer learning: Unsupervised domain adaptation . . . 128 6.3.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 6.3.2.1 Training datasets . . . . . . . . . . . . . . . . . . . . . . 129 6.3.2.2 DNN architectures . . . . . . . . . . . . . . . . . . . . . . 130 6.3.2.3 Experimental setting . . . . . . . . . . . . . . . . . . . . 130 6.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 6.3.3.1 Distinctive capability of the speaker embeddings over MFCC131 6.3.3.2 Comparison with other state-of-the-art works . . . . . . . 133 vii 6.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 6.4 NPC for learning latent similarity and dissimilarity patterns . . . . . . . . 135 6.4.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 6.4.1.1 Contrastive sample creation . . . . . . . . . . . . . . . . . 135 6.4.1.2 Siamese Convolutional layers . . . . . . . . . . . . . . . . 137 6.4.1.3 Comparing Siamese embeddings { Loss functions . . . . . 138 6.4.1.4 Extracting NPC embeddings for test audio . . . . . . . . 141 6.4.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 6.4.2.1 NPC Training Datasets . . . . . . . . . . . . . . . . . . . 142 6.4.2.2 Validation data . . . . . . . . . . . . . . . . . . . . . . . . 143 6.4.2.3 Data for speaker identication experiment . . . . . . . . 144 6.4.2.4 Data for speaker verication experiment . . . . . . . . . . 144 6.4.2.5 Feature and model parameters . . . . . . . . . . . . . . . 144 6.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 6.4.3.1 Convergence curves . . . . . . . . . . . . . . . . . . . . . 145 6.4.3.2 Validation of the short-term speaker stationarity hypothesis147 6.4.3.3 Speaker identication evaluation . . . . . . . . . . . . . . 147 6.4.3.4 Upper-bound comparison experiment . . . . . . . . . . . 154 6.4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 6.5 Conclusions and future directions . . . . . . . . . . . . . . . . . . . . . . . 159 Chapter 7: Summary and Future works 161 7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 7.2 Future works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 7.2.1 Audio event representation learning in arbitrary label-space hierarchy162 7.2.1.1 Proposed idea . . . . . . . . . . . . . . . . . . . . . . . . 162 7.2.2 Temporal dynamics of speaker embeddings in egocentric audio recordings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 Bibliography 164 viii List Of Tables 2.1 Performance of dierent systems on VOiCES development (upper sub- table) and evaluation (lower sub-table) sets. The check marks (X) indicate the systems that were submitted for ocial evaluation. The last two rows (of both the sub-tables) indicate systems after score fusion. . . . . . . . . 31 3.1 Dierent attacks on a speaker recognition system, and performance of dif- ferent defense methods. \Benign accuracy" denotes accuracy on clean sam- ples, and \Adversarial accuracy" denotes accuracy on adversarial samples generated with dierent attack algorithms. \AT" stands for Adversarial Training (the defense method described in Section 3.4.2). Accuracy is on a scale of [0; 1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.2 Transferability of adversarial samples between dierent models. The ad- versarial samples are generated with the \source" model, but seem to be eective against the \target" model as well. = 0:002 is employed for this experiment. Accuracy is on a scale of [0; 1]. . . . . . . . . . . . . . . . . . 57 3.3 Eect of training data augmentation with white Gaussian noise. = 0:002 is employed for this experiment. Accuracy is on a scale of [0; 1]. . . . . . 58 3.4 Recall rates of our model in speaker verication settings. For this exper- iment, we employ FGSM with 10 random initializations, which is found to be stronger than the vanilla FGSM employed in the experiments of Section 3.6.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.5 Accuracy comparison of the proposed defense with other baseline defenses under untargeted, white-box attacks with = 0:002. The columns under PGD and CW attacks denote number of attack iterations. . . . . . . . . . 61 3.6 The network architecture of the CNN model we used in this paper. It takes waveform as input and predicts class logits. In all of the CNNs, the number of padded points (two-sided) is k 2 . . . . . . . . . . . . . . . . . . 66 ix 4.1 Abbreviation and description of dierent measures of acoustic scene dy- namics incorporated in the egocentric analysis. All of the following mea- sures are computed on a sequence of acoustic scenes. . . . . . . . . . . . . 76 4.2 Details of the sequences mined from audio recordings. The sequence lengths are in terms of total number of constituent audio segments. . . . . 91 4.3 Performance of dierent models in segment-level acoustic scene classication. 93 4.4 Performance of dierent models in predicting temporal sequence of acoustic scenes for dierent context duration. \Segment-level" denotes the perfor- mance of the TDNN small model aggregated over all 5 second windows (it does not employ temporal information). \Temporal" denotes the perfor- mance of the two-stage model (TDNN embeddings + GRU) which learns and utilizes the temporal pattern. . . . . . . . . . . . . . . . . . . . . . . . 95 4.5 Mean Absolute Error (MAE) between the measures of temporal dynamics computed with true scene labels and predicted scene labels. . . . . . . . . 99 4.6 Total number of constructs with statistically signicant outcome in the egocentric analyses performed with the predicted scene labels. The number inside the parenthesis is the number of constructs that are also observed in the same statistical test with the true labels. The rows with True Labels are for comparison purpose. . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.7 Abbreviation, description, and data type of dierent demographic and daily routine information incorporated for egocentric analysis. . . . . . . . 103 4.8 Abbreviation, description, domain, and data type of dierent behavioral constructs incorporated for egocentric analysis. . . . . . . . . . . . . . . . 104 5.1 Audio events classication accuracies of dierent models at coarse and ne levels of the label hierarchy (CE=Cross Entropy) . . . . . . . . . . . . . . 118 5.2 Clustering evaluation of the test embeddings generated by dierent models at coarse and ne levels of the label hierarchy (see Section 5.5.2 for metric acronyms) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.3 Classication accuracies for AEC on Greatest Hits dataset at coarse and ne levels of the label hierarchy . . . . . . . . . . . . . . . . . . . . . . . . 122 6.1 Training datasets and corresponding DNN architectures employed. . . . . 129 x 6.2 (Row 1-6) Comparison of the proposed method with some baseline methods (which use MFCC) on dierent datasets. EER is reported in (MDR, FAR) format. The best two EERs are in boldface. (Row 7) The last row shows the absolute improvements in mean EER (mean(MDR, FAR)) over the best baseline method obtained by Speaker2Vec models averaged across all datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 6.3 Comparison of the proposed method with some state-of-the-art papers along with the characteristics (duration and actual number of change points) of the articial dialogs they created from TIMIT dataset for eval- uation. The best two results are in boldface. . . . . . . . . . . . . . . . . 133 6.4 NPC Training Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 6.5 Frame-level speaker classication accuracies of dierent features with kNN classier (k=1). All features below are trained on unlabeled data except i-vector which requires speaker-homogeneous les. . . . . . . . . . . . . . 151 6.6 Utterance-level speaker classication accuracies of dierent features with kNN classier (k=1). Red italics indicates the best performing single fea- ture classication result while bold text indicates the best overall perfor- mance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 6.7 Speaker verication on VoxCeleb v1 data. i-Vector and x-Vector use the full utterance in a supervised manner for evaluation while the proposed embed- ding operates at the 1 second window with a simple statistics (mean+std) over an utterance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 xi List Of Figures 1.1 Multiple sources of variability in an audio representation learning pipeline. 13 3.1 Mean SNR and PESQ score of the test adversarial samples generated by dierent l 1 attack algorithms at dierent strengths. . . . . . . . . . . . . 51 3.2 Ablation study 1: Varying the strength ( for three l 1 attacks, for the Carlinil 2 attack) in dierent attack algorithms, and performance of dier- ent defense methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.3 Performance of PGD-10 adversarial training against PGD attack at dier- ent strengths, and with dierent number of iterations. . . . . . . . . . . . 56 3.4 Mel-spectrograms of an original utterance and its perturbed versions under dierent l 1 attacks at varying strengths. . . . . . . . . . . . . . . . . . . . 64 3.5 Similarity (on a scale of [0,1]) between wrong predictions made by the model for dierent attacks. . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.1 An illustrative schematic of the hospital acoustic scenes: nursing station, patient room, lab, lounge, and medication room. Every scene has its own sources of sound events, and thus a unique acoustic characteristics. Note that all the acoustic scenes might have more than one instances (e.g., multiple patient rooms). A nurse (red) might experience several acoustic scenes in a certain work shift, while a lab technician (green) might expe- rience fewer acoustic scenes due to relatively lesser mobility. Owl is the Bluetooth hub recording the location context of the user, and they are installed in dierent places having dierent acoustic scenes. Jelly is the wearable device that captures audio features, and assists Owl in registering the location of the user. The sequence of acoustic scenes a user encounters is derived from the location information captured by multiple Owls. The gure is best viewed in color. . . . . . . . . . . . . . . . . . . . . . . . . . 72 xii 4.2 p-values obtained from the Kruskal{Wallis hypothesis tests between scene dynamics (horizontal axis), and individual daily routines and demo- graphics (vertical axis). The multiple comparison problem is corrected by the Benjamini{Hochberg procedure. All the indicated p-values are sta- tistically signicant, and for the cases when the null hypothesis is rejected. Cases with p < 0:001 are shown as 0 for clearer visualization. Empty cells denote observations that fail to signicantly reject the null hypothe- sis as determined by the Benjamini{Hochberg procedure. Please see Sec- tion 4.3.2.1 and 4.3.2.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.3 Spearman's correlation between scene dynamics (horizontal axis) and individual behavioral constructs (vertical axis). The multiple compar- ison problem is corrected by the Benjamini{Hochberg procedure. Empty cells denote zero correlations or statistically insignicant correlations. All indicated nonzero correlations are statistically signicant as determined by the Benjamini{Hochberg procedure. Please see Section 4.3.2.1 and 4.3.2.3. Best viewed in color. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.4 A two-stage modeling framework for identifying sequence of in-talk acous- tic scenes. Left: Acoustic feature stream is masked with the output of the foreground detection model. This keeps the portions of the stream when there is a possible foreground activity, which are then segmented in win- dows of xed length, T s . Middle: The segment-level TDNN model takes a segment of length T s , and learns to predict the corresponding acoustic scene. Right: A GRU model is then trained on top of the segment-level embeddings for learning the sequence of acoustic scenes. . . . . . . . . . . 84 4.5 Mining sequence of acoustic features and scene labels for temporal model- ing and evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.6 Visualization of 4 true (red circles, no line) and predicted (black dots and dotted lines) sequences. The accuracy value shown in each subplot indi- cates the prediction accuracy for that particular sequence. . . . . . . . . . 97 4.7 Word clouds showing outcomes of the egocentric analysis of daily routine and demographics performed with model predictions. Predictions of the segment-level model and the proposed two-stage temporal model are com- pared at dierent context duration,d s . Green: the construct is also present for true labels, Red: not present for true labels, Word size: proportional to the total number of measures of dynamics having signicant outcome. . 106 xiii 4.8 Word clouds showing outcomes of the egocentric analysis of behavioral constructs performed with model predictions. Predictions of the segment- level model and the proposed two-stage temporal model are compared at dierent context duration,d s . Green: the construct is also present for true labels, Red: not present for true labels, Word size: proportional to the total number of measures of dynamics having signicant outcome. . . . . 107 5.1 An example of the hierarchical audio events class structure in our dataset. 110 6.1 Training framework and DNN architecture. . . . . . . . . . . . . . . . . . 127 6.2 ROC curves obtained by dierent algorithms on TED-LIUM evaluation data.131 6.3 NPC training scheme utilizing short-term speaker stationarity hypothe- sis. Left: Contrastive sample creation from unlabeled dataset of audio streams. Genuine and impostor pairs are created from unlabeled dataset. Right: The siamese network training method. The genuine and impostors pairs are fed into it for binary classication. \FC" denotes Fully Connected hidden layer in the DNN. . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 6.4 The DNN architecture employed in each of the siamese twins. All the weights are shared between the twins. The kernel sizes are denoted under the red squares. 2 2 max-pooling is used as shown by yel- low squares. All the feature maps are denoted as: N@x y, where N = number of feature maps, xy = size of each feature map. Dimen- sion of the speaker embedding is 512. \FC" = Fully Connected layer. . . 138 6.5 Binary classication accuracies of classifying genuine or impostor pairs for NPC models trained on the Tedlium, Tedlium-Mix, and YoUSCTube datasets. Both training and validation accuracies are shown. The best validation accuracies for all the models are marked by big stars (*). . . . . 145 6.6 t-SNE visualizations of the frames of dierent features for the Tedlium test data containing 11 speakers (2 utterances per speaker). Dierent colors represent dierent speakers. . . . . . . . . . . . . . . . . . . . . . . . . . . 149 xiv Abstract The audio signal carries a multitude of latent information about phoneme and language, speaker identity, audio events and acoustic scene, noise, and other channel-specic char- acteristics. Learning a xed dimensional vector representation or embedding of the audio signal, that captures a subset of those information streams, can be useful in numerous applications. For example, an audio representation that captures information about the underlying acoustic scene or environment can help us build smart devices that provide context-aware and personalized user experiences. Audio embeddings carrying speaker- specic characteristics can be utilized in speaker identity-based authentication systems. A typical deep neural network-based audio representation learning system can en- counter a diverse set of variabilities that span across the input signal-space (such as environmental noise, long-term variability, and deliberate perturbation through adver- sarial attacks) and the output semantic-space (granularity of annotations, absence of annotations). In this thesis, we investigate multiple sources of variability in an audio representation learning system, and leverage that knowledge to learn robust deep audio representations by exploring context information and latent similarity patterns of the data. The proposed xv methods have been successfully applied in several cutting-edge applications such as au- dio event recognition, egocentric acoustic scene characterization, and speaker recognition in noisy conditions and under adversarial attacks. The extensive experimental results demonstrate the ecacy of learned embeddings, hence deem useful for audio representa- tion learning. xvi Chapter 1 Introduction 1.1 Motivation Audio signal can provide a multitude of latent information including phoneme or lexical content, speaker-specic characteristics, speaker's emotional and behavioral traits, and channel-specic characteristics often dened by active audio events and noise sources. Over the past few years, researchers have invented application-specic technologies that capture a subset of these information streams. Automatic speech recognition [160, 203], speaker recognition [27, 74], audio event detection and identication [188, 66, 78], ambi- ent acoustic scene classication [188, 130, 45], speaker's emotional and behavioral state characterization [142] are some of the most thriving research areas in the audio signal processing domain. Because of the inherently convoluted nature, it is becoming important to learn application-specic \representation" or \embedding" of the audio signal. Intuitively, a xed dimensional vector representation (more details in section 1.2) of the variable length audio signal tries to retain the specic stream of information that subsequently helps a 1 certain application-oriented task. For example, a speaker recognition system [27, 74] would try to extract and retain only the speaker-specic information, and presumably re- move channel characteristics. On the other hand, an audio event recognition system [78] would try to capture information related to the specic audio event(s) present in the signal. A major concern in learning a suitable application-specic audio representation lies in the amounts of dierent types of variability present in the entire \learning system" or \pipeline". A simple example of variability is the environmental noise that generally contaminates the sound signal during propagation. Understanding the source(s) of vari- ability can help us to devise modeling techniques that, in turn, can assist in learning robust audio representations. In this work, we investigate and categorize multiple sources of variability in an audio representation learning system, and leverage that knowledge to learn robust audio rep- resentations by exploring context information and latent similarity patterns of the data. The proposed methods have been applied successfully in several cutting-edge applications such as audio event recognition, egocentric acoustic scene characterization, and speaker recognition in noisy conditions and under adversarial attacks. The following sections in this chapter provide a background on representation learning, dierent types of audio representations of interest, sources of variability in an audio representation learning system, an overview of the proposed audio representation learning methods, and the outline of the thesis. 2 1.2 Background on representation learning The diculty of a machine learning algorithm heavily depends on how the data is pre- sented to it [71]. High dimensional real-life data tend to lie on a comparatively lower dimensional space or manifold [71, 18]. Representation learning deals with nding that lower dimensional space. Generally, a good representation makes the subsequent task of classication or regression easier [71, 18]. In this work, we focus on learning xed dimen- sional vector representation of a variable length audio signal, that is suitable for some specic application oriented task. We focus on representation learning by Deep Neural Networks (DNN) [71] ∗ . In all our experiments, we work on some low-level descriptive features of the audio signal, such as mel spectrogram, or MFCCs [52] (the type of feature might change de- pending on the application, and it will be mentioned in the corresponding chapter). If we denote a variable length audio recording (the low-level features, each a vector of length D) by X2 R DT , a representation learning algorithm learns a (non)linear mapping to project it to a K-dimensional representation or embedding e2R K , such that the latter helps a subsequent prediction task: e =f(X; ); (1.1) wheref(; ) is modeled by a neural network parameterized by . Generally, e is obtained from a hidden layer of the neural network after the training is completed, and the training objective(s) can vary depending on the availability of labels and the learning strategy. ∗ Often, the representations are termed as \deep" representations. 3 1.2.1 Charateristics of a \good" representation As introduced before, the goodness of a representation generally depends on the sub- sequent learning task. Here we list a few characteristics of a \good" representa- tion/embedding, which will be relevant to the tasks addressed in this work. A more detailed explanation can be found in [18]. • Expressiveness: The learned \deep" representations are generally considered as (or, required to be) distributed representations i.e., a representation vector with a reasonable size can accommodate an exponentially large number of input congu- rations. • Disentanglement: The underlying true distribution of high dimensional data is gen- erally composed of multiple explanatory factors. A useful representation, for a particular task, generally aims to disentangle the factors of variation. This charac- teristic is generally true for all the embeddings we will describe in this work. For example, in chapter 2, the speaker embedding will be trained to disentangle channel factors such as noise and reverberation. • Natural clustering: Probability mass tends to concentrate around low dimensional manifolds [18], and dierent semantic classes are typically related to dierent man- ifolds. A good representation generally learns \compact clusters", i.e., embeddings from the same semantic class should lie in close proximity, and embeddings from two dierent semantic classes should be far away from each other in terms of some distance metric (usually Euclidean) on the embedding space. This also necessitates the invariant nature of a representation to local variations inside a semantic class. 4 Chapter 5 will particularly perform a clustering evaluation of the proposed audio event embeddings, and show them to be more compact than the baselines. • Hierarchical organization: The semantic labels annotated by human annotators generally follow a hierarchical ontology. Data with more abstract labels tend to be at the top of the hierarchy. Representations that capture the hierarchical relation- ship are sometimes needed for ne-grained classication or retrieval. In chapter 5, we will demonstrate how exploring similarity of data samples can create audio rep- resentations that conform to the hierarchical label-space structure. • Cross-task transferability: Representations learned with one objective (task) some- times transfer to other related task. This cross-task transferability generally requires some underlying shared factors that are common to both the tasks. In chapter 3 and chapter 6, the cross-task transferability will be evident from our experiments on speaker identication and verication. • Simplicity: Often, in a good representation, the underlying factors are associated with each other through simple, linear relationships. This helps the subsequent task, since only a linear classier/predictor is needed for the downstream appli- cation. In chapter 6, we will show that a simple kNN classier will be able to achieve satisfactory performance in speaker identication, when working on top of the proposed speaker representations. • Temporal coherence: Dierent semantic information can change at dierent rates over a sequential input data. Good representations can be learned by concentrating 5 on a suitable temporal context of the input sequence during learning. This will be evident in chapter 4 and chapter 6. 1.2.2 Supervision in representation learning A representation learning algorithm might incorporate dierent levels of human super- vision during training. Here, supervision indicates human annotated labels of the audio signals, like the speaker identity or the audio event class of a particular audio recording. We address two levels of supervision in dierent applications: • Supervised: Here, we assume the availability of human annotated labels for all the training samples. A simple example is a DNN-based classication framework. The last layer of the DNN acts as a softmax regression classier, and the remaining earlier layers learns a representation that is suitable for the classication task [71]. Similar to several existing literature [184, 183, 175], the output of a hidden layer (most commonly, the penultimate layer) of the trained neural network is utilized as the embedding. • Self-supervised: Self-supervised learning [57, 58] is a type of unsupervised learning where the data is utilized to generate the supervision. Here, we do not incorporate any human annotated labels in the training, and the training is typically performed on an out-of-domain unlabeled dataset. The learned representation is subsequently tested on multiple downstream tasks in dierent evaluation datasets. 6 1.3 Audio representations We study two major audio representation learning tasks, and investigate multiple sources of variability that span across those tasks. 1.3.1 Audio event and acoustic scene representation An audio event (e.g., door closing sound) indicates a sound having unique auditory char- acteristic [66]. On the other hand, an acoustic scene (e.g., oce room) is generally composed of multiple constituent audio events, and it denotes a specic ambient en- vironment. Audio Event Classication (AEC) [66, 78, 93, 163, 44] and Acoustic Scenes Classication (ASC) [188, 130, 45] are becoming increasingly popular due to their numer- ous applications in accessibility [92, 4], surveillance [198, 9], audio content indexing and retrieval [207, 88], advanced gaming, and even health monitoring systems [75]. Detection and classication of audio events and/or acoustic scenes directly or indirectly necessitates the learning of semantically meaningful representations. 1.3.1.1 Prior work One classical approach [10] for AEC and ASC is GMM tting on the \bag of frames" modeling of the audio recording and applying KL divergence measure between dier- ent GMM models. MFCC features are popular for this kind of approaches. Chu et al. [44] proposed a matching pursuit technique for nding useful time-frequency features to complement MFCC features, and obtained improved performance. Later, time series models such as HMMs were employed to better use context over time [214]. A detailed survey of non-deep learning approaches can be found in [188]. Recently Deep Neural 7 Network (DNN) [71] has shown promise for AEC. A DNN based approach was shown to outperform GMM in [67] for classifying 61 audio classes. In [210], a Convolutional Neural Network (CNN) was shown to extract robust features for noisy AEC task. Some other works [205, 190] also showed potential of CNN in extracting robust features directly from the spectrogram for AEC. In 2017, Google released an AEC corpus, AudioSet [66] containing 1:8M 10s excerpts from YouTube videos, which is much larger than the previous datasets. Again, CNN based models like ResNet-50 [76] gave quite satisfactory performance (0.959 AUC) [78] for classifying 485 audio classes on AudioSet. Detecting the acoustic scene might be helpful to an audio event detection system as well, since, the former can provide prior information about which audio events might be present in the particular acoustic scene [14]. For example, a typical oce acoustic scene might consists of audio from human speech, keyboards, phones, air conditioner vent etc. 1.3.2 Speaker representation Learning speaker-specic characteristics is an important step in numerous applications like speaker segmentation and diarization [107, 6, 195], speaker recognition (verication and identication) [36, 74, 183, 27], and automatic speech recognition [160, 203]. Most of the above applications involve learning a robust speaker representation that preserves speaker-specic characteristic in the embedding, and removes channel characteristics and other nuisance factors as much as possible. 8 1.3.2.1 Prior work Speaker representation learning is a very challenging problem due to the highly complex information that the speech signal modulates, from lexical content to emotional and be- havioral attributes [83, 142] and multi-rate encoding of this information. A major step towards speaker modeling is to identify features that focus only on the speaker-specic characteristics of the speech signal. Most of the prior works use short-term acoustic features [74] like MFCC [52] or PLP [77] for signal parameterization. In spite of the eectiveness of the algorithms used for building speaker models [27] or clustering speech segments [6], sometimes these features fail to produce high between-speaker variability and low within-speaker variability [74]. This is because MFCCs contain a lot of supple- mentary information like phoneme characteristics, and they are frequently deployed in speech recognition [160]. Signicant research eort has gone into solving the above mentioned discrepancies of short-term features by incorporating long-term or prosodic features [177] into existing systems. These features can specically be used in speaker recognition or verication systems since they are calculated at utterance-level [74]. Another way to tackle the problem is to calculate mathematical functionals or transformations on top of MFCC features to expand the context and project them on a \speaker space" which is supposed to capture speaker-specic characteristics. One popular method [162] is to build a GMM- UBM [74] on training data and utilize MAP adapted GMM supervectors [162] as xed dimensional representations of variable length utterances. Along this line of research, there has been ample eort in exploring dierent factor analysis techniques on the high 9 dimensional supervectors to estimate contributions of dierent latent factors like speaker- and channel-dependent variabilities [100]. Eigenvoice and eigenchannel methods were proposed by Kenny et al. [101] to separately determine the contributions of speaker and channel variabilities respectively. In 2007, Joint Factor Analysis (JFA) [99] was proposed to model speaker variabilities and compensate for channel variabilities, and it outperformed the former technique in capturing speaker characteristics. Introduction of i-vectors: In 2011, Dehak et al. proposed i-vectors [55] for speaker verication. The i-vectors were inspired by JFA, but unlike JFA, the i-vector approach trains a unied model for speaker and channel variability. One inspiration for proposing the Total Variability Model (TVM) [55] was from the observation that the channel eects obtained by JFA also had speaker factors. The mathematical details of the TVM will be brie y described in Section 2.2.1. The i-vectors have been used by many researchers for numerous applications including speaker recognition [55, 53], diarization [176, 59] and speaker adaptation during speech recognition [174] due to their state-of-the-art perfor- mance. But, performance of i-vector systems tends to deteriorate as the utterance length decreases [98], especially when there is a mismatch between the lengths of training and test utterances. Also, the i-vector modeling, similar to most factor analysis methods, is constrained by the GMM assumption which might degrade the performance in some cases [74]. DNN-based methods in speaker representation learning: Recently, Deep Neu- ral Network(DNN)- [71] derived \speaker embeddings" [168] or bottleneck features [80] 10 have been found to be very powerful for capturing speaker characteristics. For exam- ple, in [206, 200] and [68], frame-level bottleneck features have been extracted using DNNs trained in a supervised fashion over a nite set of speakers; and some aggregation techniques like GMM-UBM [162] or i-vectors have been used on top of the frame-level features for utterance-level speaker verication. Chen et al. [38, 37] developed a deep neural architecture and trained it for frame-level speaker comparison task in a supervised way. They achieved promising results in speaker verication and segmentation tasks even when they evaluated their system on out-of-domain data [38]. In [185, 184], the authors proposed an end-to-end text-independent speaker verication method using DNN embed- dings. It uses the similar approach to generate the embeddings, but the utterance-level statistics are computed internally by a pooling layer in the DNN architecture. In more recent work [116], dierent combinations of Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) [71] have been exploited to nd speaker embeddings using the triplet loss function which minimizes intra-speaker distance and maximizes inter-speaker distance [116]. The model also incorporates a pooling and normalization layer to produce utterance-level speaker embeddings. 1.3.2.2 Types of speaker recognition frameworks Here we describe the two types of speaker recognition frameworks that will be employed in multiple experiments. Speaker recognition systems can be developed either for iden- tication or verication [74] of individuals from their speech. In a closed set speaker identication scenario [74, 90], we are provided with train and test utterances from a set of unique speakers. Given a test utterance, the task is to train a model that can classify 11 it to one of the training speakers. Speaker verication [184, 46], on the other hand, is an open set problem. The task is to verify whether a test utterance claiming a particu- lar speaker's identity is actually spoken by that speaker (whose enrollment utterance is available beforehand). The training data in the latter case, is generally utterances from a mutually exclusive set of speakers. Although speaker verication diers from speaker identication during the testing phase, most of the recent state-of-the-art speaker verication systems [184, 138, 46, 183] are trained with the objective of learning to classify the set of training speakers. In other words, these models are trained with a cross-entropy objective over the unique set of training speakers (i.e., similar to a speaker identication scenario). 1.4 Sources of variability in learning audio representations Figure 1.1 shows a generic pipeline for learning audio representations. We consider two (broad) types of variability that can occur in such a learning system. 1.4.1 Signal-space variability This encompasses variability that can occur in the audio signal itself. Three dierent types of signal-space variability will be discussed here: environmental noise and reverberation, deliberate perturbation of audio through adversarial attacks, and long-term variations in the audio representations. 12 1 Audio Processing, Machine Learning True labels Noise and Reverberation Long-term temporal variability Granularity of annotation “Alarm” vs. “Fire Alarm” Full / Partial absence of annotations Deliberate perturbation (Security) Signal space variability Semantic space variability Figure 1.1: Multiple sources of variability in an audio representation learning pipeline. 1.4.2 Semantic-space variability Semantic-space variability refers to the variations in the human annotated label-space. It encompasses the variability that arises due to the abstractness of annotations. This can aect the cluster compactness of the learned representations. For example, a repre- sentation learned for an audio event \alarm" might not provide sucient compactness for a test audio from the class \re alarm". Another form of semantic-space variation deals with partial/full unavailability of annotations. This necessitates the incorporation of dierent learning techniques, particularly self-supervision as described in Section 1.2.2. 13 1.5 Overview of proposed robust audio representation learning methods The proposed robust audio representation learning algorithms are based on the following four pillars. The detailed description of the algorithms will be provided in the following chapters. 1.5.1 Learning latent data similarity patterns As introduced earlier, nding the similarity relationships between dierent samples from the same semantic class might play a crucial role in representation learning. Here we focus on metric learning algorithms [42, 73, 175]. In general, distance metric learning involves learning the embedding, e (Equation 1.1) such that samples from the same semantic class come closer in the embedding space or manifold. Various metric learning algorithms have been proposed in the past like pairwise contrastive learning [42, 73] for signature verication, and triplet learning [175] on applications like face verication. Most of the existing metric learning techniques are generally supervised in nature, and thus, they require labels for training. We study the eect of metric learning in obtaining useful representations of audio. Furthermore, we extend metric learning to an self-supervised framework where we aim to learn speaker-specic representations by exploring contextual information present in the audio signal (chapter 6). 14 1.5.2 Data augmentation Data augmentation refers to articially perturbing (with a reasonable amount) the input data so that the deep neural network can be trained on more data. It is often helpful to build robust and more generalized machine learning models. This behavior will also be observed in our setting for training the speaker representation learning models. 1.5.3 Adversarial training Adversarial training can be intuitively thought as an \online" data augmentation tech- nique where the noise is optimally created on-the- y to increase the model's loss, and then, the model is trained again on the noisy samples to decrease the loss. The training continues as a min-max game. More details will be provided in subsequent chapters. 1.5.4 Exploring context As the audio signal is composed of multitude of information, these information streams change at dierent temporal rates. For example, the spoken phonemes in a speech signal can change very fast, and the speaker itself might change comparatively slowly. The background acoustic environment can change even slower. Exploring the audio signal with suitable context length can help capturing the required information stream(s). We will explore both supervised and self-supervised methods to explore context for learning distinctive audio representations. 15 1.6 Outline • Chapter 2: A robust supervised speaker embedding learning method by exploring data similarity/dissimilarity patterns is proposed here. This chapter deals with signal-space variabilities like noise and reverberation. • Chapter 3: This chapter will propose a robust speaker representation learning method that explores data similarity patterns and performs adversarial training to build speaker recognition systems resilient to strong adversarial attacks (a signal- space variability). • Chapter 4: This chapter will show that long-term variations in audio represen- tations can be modeled by exploring context and modeling via a combination of convolutional and recurrent neural networks. • Chapter 5: This chapter deals with a semantic-space variability: granularity of annotations. Here we propose hierarchy-aware representation learning of audio events that can provide varying cluster compactness for samples from dierent levels of semantic granularity or abstractness. • chapter 6: This chapter describes how we can learn speaker representation from completely unlabeled data (i.e., a semantic-space variability: absence of labels) by exploring contextual similarities. We evaluate the learned embeddings in speaker segmentation and recognition tasks. • chapter 7: This chapter summarizes the thesis, and describes some possible future directions in audio representation learning. 16 Chapter 2 Signal-space variability I: Environmental noise This chapter studies the robustness and generalizability of speaker representations when the audio signal is contaminated with environmental noise and reverberation. Gener- ally, state-of-the-art text-independent speaker representation learning methods [183, 184] utilize a training dataset (e.g., Voxceleb [139, 47]) having multiple audio recordings (or utterances) from a large set of speakers. The training employs a speaker classier (a DNN trained with cross entropy loss), and the speaker embedding is extracted from a hidden layer of the trained network. This approach does not directly work on the embedding space to reduce intra-speaker embedding distance, and increase inter-speaker embedding distance. This can result in a less compact speaker representation. To address that is- sue, we propose a novel method that incorporates similarity-driven metric learning [95] along with extensive data augmentation. The learned representations are evaluated in a Speaker Verication (SV) experiment (see Section 1.3.2.2). 17 2.1 Introduction The performance of a speaker verication system can deteriorate due to mismatch between training, enrollment and test environments [74, 141]. Data inconsistencies may arise due to channel conditions, noise, background speakers, and microphone placement (near- vs. far-eld speech) and associated reverberation [74, 96]. One way to tackle the problem is to obtain a speaker representation or embedding [128] that is robust and invariant to dierent channel and noise conditions. The present work focuses on that approach. \The VOiCES from a Distance Challenge 2019" [140] was organized to benchmark state-of-the-art technologies for speaker verication from single channel far-eld speech in noisy conditions. The evaluation data was curated from the VOiCES corpus [164], which includes speech data recorded under challenging acoustic environments. On the other hand the training data for the xed condition [140] consisted of three \in the wild" datasets, and was not guaranteed to be from similar acoustic conditions (more details in Section 2.3.1). This paved a way to develop robust speaker embedding systems and evaluate them on realistic far-eld speech with natural reverberation [140]. Some of the earlier works [62, 171] for single channel far-eld speaker verication fo- cused on designing robust features. Avila et al. [12] analyzed performance degradation of classical GMM-UBM [162] and i-vector (employs the Total Variability Model (TVM)) [55] systems in far-eld condition, and proposed multi-condition training with dierent rever- beration levels to address the problem (please see Section 1.3.2 for details about i-vectors). Similarly, a multi-condition approach was adopted in [64] for training a Gaussian PLDA model in i-vector space. Snyder et al. [184] introduced x-vectors which employed dierent 18 types of articial augmentation to train a robust speaker embedding using a Time Delay Neural Network (TDNN) based speaker classication model [183]. Nandwana et al. [141] analyzed the performance of the x-vector and i-vector systems for far-eld noisy SV task on SRI distant speech collect [141] and VOiCES [164] datasets. X-vectors were shown to have superior performance. In the present work, we try to address the noisy and far-eld SV task from two dierent angles: potentially nd a better model to transform a variable length utterance into a xed dimensional embedding, and employ a loss function that directly works on the embedding space to reduce intra-speaker distance and increase inter-speaker distance irrespective of channel conditions. The main contributions of this work are the following: 1. We employ a newly proposed [196] hybrid discriminative DNN-TVM system which leverages the strength of both systems. Specically, it exploits the strength of DNN to project the input feature into a distinctive sequence of vectors, and utilizes TVM to obtain a xed dimensional embedding. 2. We implement a multi-task training scheme with cross entropy and triplet [175] losses to circumvent the deciencies of training with only cross entropy loss as observed in computer vision domain [175]. To the best of our knowledge, exploration of this multi-task training has not been done in the past for speaker recognition. 2.2 Methodology Total Variability Modeling (TVM) and i-vectors have been the state-of-the-art in speaker recognition for a long time. They were originally proposed as unsupervised generative 19 latent variable model [55]. But, when a large amount of labeled data is available, gen- erative models can have inferior performance compared to discriminative models having very high degrees of freedom. The recent success of x-vectors [184] is an example of that. Moreover, the Gaussian Mixture Model (GMM) assumption on the features might put more constraints on TVM than the distribution-free nature of DNN models. The motivations in developing a discriminative multi-task hybrid DNN-TVM are: 1. It removes the GMM assumption on feature vectors as employed in conventional TVM, and provides a distribution-free formulation [197]. 2. TDNNs [155, 183, 184] are found to be good in exploiting temporal context, and training them is generally faster than training recurrent neural networks. We can utilize the strength of TDNN models and deploy them as a feature transformer. 3. The global statistics pooling of conventional x-vector model can be replaced by a TVM model, and the hybrid model can be trained end-to-end with any discrimina- tive objective function e.g., cross entropy loss. 4. Moreover, such systems can be trained on multiple discriminative tasks instead of only speaker classication, as long as the tasks are complementary to each other. 2.2.1 Total Variability Model (TVM) Let us denote an utterance, X of length T as X =fx 0 ; x 1 ;:::; x T1 g (2.1) 20 where, x t 2R D is a feature vector at timet. In TVM [55], X is represented by a speaker- and channel-dependent GMM mean supervector, M2 R CD . Here, C is the number of Gaussian components in the GMM. M can be expressed as M = m + Tw (2.2) where, m is the speaker- and channel-independent GMM mean supervector of the Uni- versal Background Model (UBM), T2R CDK is a low rank rectangular total variability matrix, and w2 R K is the i-vector for that utterance. w has standard normal prior distribution: wN (0; I) (2.3) Conventionally, TVM is set up as a Maximum Likelihood Estimation (MLE) problem and trained by Expectation Maximization (EM) algorithm. 2.2.2 TDNN and x-vectors The model for x-vectors [184] is comprised of TDNN at the lower layers followed global statistics pooling. The TDNN layers transform the input utterance X into a sequence of vectors: G = h(X) =fg 0 ; g 1 ;:::; g T 0 1 g (2.4) 21 Then the pooling layer computes statistics (mean and standard deviation) of G, and concatenates them to generate a xed dimensional vector. It is then projected on an embedding layer: w x-vector = A E T (G) j S T (G) T + b = f(X) (2.5) whereE andS denote sample mean and standard deviation respectively. f(:) is a trainable nonlinear function, and denotes the part of the DNN model from input to the embedding layer. The embedding, w x-vector is fed to a shallow classier network with softmax outputs. The full model is trained end-to-end with cross entropy loss. At test time, w x-vector is used for scoring. 2.2.3 Discriminative DNN-TVM system 2.2.3.1 Distribution-free TVM formulation Travadi et al. [197] showed that the GMM assumption on features can be removed from the TVM formulation if Baum-Welch statistics of the features are used as variables in the model likelihood function instead of the features themselves. The zeroth and rst order Baum-Welch statistics are given by: N c = T1 X t=0 tc and F c = 1 N c T1 X t=0 tc x t (2.6) Here, tc = p(cjx t ) is the posterior probability of the t th feature vector to belong to the c th Gaussian component in the GMM. It was shown in [197] that the statistics F c 22 asymptotically follow normal distribution irrespective of the feature distribution. More importantly, the posterior probability, tc is not limited to be from a GMM distribution, but can be any positive function, tc = c (x t ) s.t. c :R D 7! [0;1). 2.2.3.2 Discriminative TVM training The MLE of TVM assumes the data (features) were generated from a GMM distribution and tries to nd model parameters that can best explain the data. The distribution-free TVM formulation has no assumption on how the features were generated, and thus best viewed as a trainable function from the features to the i-vector. In this case, the TVM parameters can be trained using cross entropy loss with the help of speaker labels. The discriminative training of TVM was proposed in [69] where the authors adopted numerical optimization algorithm to train the parameters, and found superior performance than generative TVM. 2.2.3.3 Hybrid DNN-TVM model The hybrid architecture comprises of the initial TDNN layers as in x-vector system. After the frame-level processing is done by the TDNN layers, a trainable mapping, 0 c (:) is applied on the transformed sequence G (Equation (2.4)) to produce posterior probability for every g t : tc = 0 c (g t ) (2.7) 23 The overall transformation of a feature vector x t to the posterior is given by (refer to Section 2.2.2 and 2.2.3.1): tc = c (x t ) = 0 c (h(x t )) = 0 c (g t ) (2.8) Instead of computing the global statistics of G as done in x-vector, we compute the Baum- Welch statistics of G using Equation (2.6). Based on the foundation of Section 2.2.3.1, these statistics can be used in a TVM formulation. Intuitively, the posterior tc denotes the probability of a feature vector, x t to belong to a certain region in the feature space, and F c denotes the local mean of the features in that region. We concatenate multiple local meansfF c g C c=1 to create a local mean supervector. Global mean pooling (as done in x-vector) can be regarded as a special case of this formulation, where C = 1 and 0 c (:) = 1. Finally, we project the local mean supervector to an embedding layer, w hybrid through an ane transform. Adopting same convention as Equation (2.5), we can have: w hybrid = A F T 1 j F T 2 j:::j F T C T + b = f(X) (2.9) Similar to x-vector, w hybrid then goes to a shallow classier network. The whole network can be trained using cross entropy loss as done in [196]. 2.2.4 Triplet loss Theoretically, any discriminative loss function can be used to train the DNN-TVM model (and also for x-vector system as will be used for comparison in Section 2.4.1). [196] 24 employed standard cross entropy loss, L CE which tries to increase the softmax posterior probabilities for all samples. The drawback of cross entropy loss is that it does not explicitly focus on reducing intra-class variance. On the contrary, triplet loss [175] directly works on the embedding space and tries to bring samples from same class closer than samples from two dierent classes. Assume X a i and X p i are two utterances from the same speaker, and are denoted as anchor and positive utterances respectively. X n i is called a negative utterance, and it belongs to a dierent speaker than anchor and positive. Then the tuple (X a i ; X p i ; X n i )2T is denoted as a triplet, whereT is the set of all triplets. The DNN should ideally nd a mapping f(:) such that, jjf(X a i ) f(X p i )jj 2 2 +<jjf(X a i ) f(X n i )jj 2 2 8(X a i ; X p i ; X n i )2T (2.10) where, is positive margin parameter. This is achieved by minimizing the triplet loss given by: L Triplet = 1 jTj jTj X i=1 max jjf(X a i ) f(X p i )jj 2 2 + jjf(X a i ) f(X n i )jj 2 2 ; 0 (2.11) The mapping f(X) is generally the DNN that transforms X into an embedding. For x-vector and hybrid systems, f(X) is basically w x-vector and w hybrid respectively (equa- tions 2.5 and 2.9). 25 2.2.5 Multi-task training Triplet loss has been successfully applied for speaker verication in past works [208, 116]. Inspired from the success in computer vision domain [212], we incorporate triplet loss along with the cross entropy loss, and train the network in multi-task setting. Note that the triplet loss is directly computed on the embedding, while the cross entropy loss encounters a shallow classier network after the embedding layer. Our hypothesis is that the multi-task training would force the network to produce both correct classication and distinctive embeddings, and these two tasks would be complementary. More formally, the joint loss function is given by: L =L CE + (1)L Triplet (2.12) where, 0<< 1 helps weighing dierent losses. 2.3 Experimental Setting 2.3.1 Datasets and features The training datasets for the \xed condition" [140] in the VOiCES challenge are Voxceleb 1 & 2 [47], and Speakers In The Wild (SITW) [126]. Both the datasets have 16KHz single channel audio. First, we remove 60 overlapped (between Voxceleb and SITW) speakers from the SITW dataset before training. Then we remove speakers that have less than 10 utterances. This results in 7537 unique speakers and 1:3M utterances for training ∗ . ∗ We held out the test part of Voxceleb 1 as an internal clean validation set to monitor possible degradation of model on clean speech. 26 The ocial VOiCES development set has around 16K utterances of noisy and far-eld speech from 196 speakers, and 4M trials for scoring. The evaluation set has around 11K utterances and 3:6M trials from 100 speakers. We extract 20 dimensional MFCC features with 25ms window and 10ms shift using Kaldi toolkit [157]. Energy based VAD is applied. The features are mean normalized with a moving window of maximum 3s length. Delta and delta-delta features are concatenated with original MFCC to produce 60 dimensional features for training. 2.3.2 Data Augmentation First, every utterance in the training data is augmented with three dierent types of noise: • Television: To create a simulated environment similar to background television sounds, we rst extract the audio from four publicly available video datasets: AVA- ActiveSpeaker dataset [166], advertisement dataset [85], and two compilation videos for TV shows \Friends" and \How I met your mother" available in YouTube. For every training utterance, a random segment from one of these four datasets is picked and added with the original signal with 13-20dB SNR. • Babble: Similar to [184], three to seven speakers are randomly chosen from the above four datasets, summed and added to the original speech at 13-20dB SNR. • Music: A single music le is randomly sampled from the MUSAN music dataset [182] and added to each utterance as described in [184]. 27 Then we reverberate the clean and above three copies with a Room Impulse Re- sponse (RIR) randomly sampled from a pool of 60K RIRs [103]. The nal training data also keeps the original clean utterance. Due to the large size of the augmented dataset (5 times the original, so 6:5M utterances), and lack of time and resources, we could only train on a subset of the data. The subset is created by sampling a maximum of 300 utterances (after augmentation, so a maximum of 60 clean utterances) from each speaker, thus limiting to 2:1M utterances. 2.3.3 System parameters 2.3.3.1 i-vector and x-vector baselines A GMM with 2048 components and full covariance matrix is trained for the UBM. 400 dimensional i-vector extraction system is built on the longest 100K utterances following Kaldi's Voxceleb v1 recipe † . The x-vector model is as described in [184], but we utilize our augmented data (Sec- tion 2.3.2) instead of the default augmentation recipe described in [184]. 2.3.3.2 Hybrid DNN-TVM The hybrid DNN-TVM model is composed of: TDNN! Dense! Dense! Dense ! Softmax! TVM Layer! Classier. The TDNN is part of the x-vector model up to `frame5' [184]. The dense layers have 1024 hidden units. The softmax layer has 1000 units (analogous to a GMM with 1000 Gaussian components). Three dense layers and the subsequent softmax layer together implement 0 c (:) as described in Section 2.2.3.3. The † The submitted i-vector system is trained on clean data. Training on the augmented data was not nished before system submission deadline. 28 TVM layer computes the Baum-Welch statistics and the local mean supervector. The classier network is same as [184]. It consists of two 512 dimensional dense layers, and the nal softmax layer for classication. 2.3.3.3 Triplet and multi-task Mining good triplets is very crucial in triplet learning. Easy triplets might stagnate the training, while very hard triplets might make the training unstable and result in collapsed model [175]. Moreover, the total number of all triplets grows exponentially with the number of samples. We adopt online (in-batch) semi-hard triplet mining [175] because it provides much faster training than oine mining, and is found to address the training issues mentioned above. For the multi-task loss, we choose = 0:8 in Equation (2.12). As explained in [212], giving more weight to cross entropy loss is slightly benecial. For our experiment, we observed it helped faster convergence at the beginning of the training. 2.3.4 LDA and PLDA scoring For all systems (listed in Section 2.4.1), LDA has been applied to reduce the embedding dimension. The LDA dimension is tuned on the VOiCES development set. For i-vector and multi-task models, 200 dimensional LDA was found to be optimal, while for x-vector 150 dimensional LDA gave the best performance (similar to [184]). After LDA, the embeddings are length normalized. Finally, PLDA training and scoring are performed following the conventions in [184]. 29 2.3.5 Fusion of multiple systems Given the wide diversity between train and test scenarios, it is often not possible to come up with a single good system for speaker verication. Therefore, to maximize benet from the complementary merits of dierent SV systems, we employ a weighted-sum log- likelihood score-fusion strategy. System fusion works well if the fused subsystems are similar in nature, however not identical, and also have complementary characteristics [70]. In this work, we have fused four SV systems: i-vector, x-vector, and the multi-task version of x-vector and DNN-TVM models. Fusion weights and a bias term to perform linear score-fusion are determined with BOSARIS ‡ toolkit on development data. The toolkit uses a fast unconstrained convex optimization algorithm based on a quasi-Newton method to train the logistic regression fusion [118]. 2.4 Results and Discussions 2.4.1 Performance of individual systems Four metrics are reported here as shown in Table 2.1. The primary and secondary metrics are minDCF and CLLR respectively, as dened in [140]. For clearer visualization, we show two sub-tables for development and evaluation sets respectively. The rst six rows of Table 2.1 (for both the sub-tables) show the performances of the six individual systems on VOiCES development (upper sub-table) and evaluation (lower sub-table) sets. The i-vector system trained on the clean dataset is outperformed by all DNN-based systems. ‡ https://sites.google.com/site/bosaristoolkit/ 30 Table 2.1: Performance of dierent systems on VOiCES development (upper sub-table) and evaluation (lower sub-table) sets. The check marks (X) indicate the systems that were submitted for ocial evaluation. The last two rows (of both the sub-tables) indicate systems after score fusion. VOiCES development set System minDCF actDCF EER (%) CLLR 1. i-vector clean 0.64 8.94 7.02 0.79 2. x-vector kaldi (X) 0.39 4.85 3.42 0.43 3. x-vector native 0.59 7.08 5.23 0.62 4. DNN-TVM 0.62 10.58 5.95 0.87 5. x-vector multi-task 0.41 5.54 3.53 0.49 6. DNN-TVM multi-task 0.40 6.77 3.83 0.59 (2+5+6) (X) 0.36 0.36 3.18 0.13 (1+2+6) (X) 0.35 0.35 3.29 0.13 VOiCES evaluation set System minDCF actDCF EER (%) CLLR 1. i-vector clean 0.99 29.58 31.89 3.51 2. x-vector kaldi (X) 0.62 4.35 7.54 0.58 3. x-vector native 0.86 6.54 11.74 0.76 4. DNN-TVM 0.89 9.45 12.09 0.93 5. x-vector multi-task 0.68 5.41 8.05 0.65 6. DNN-TVM multi-task 0.64 5.37 8.05 0.64 (2+5+6) (X) 0.60 0.62 7.29 0.45 (1+2+6) (X) 0.67 0.67 8.78 0.53 Moreover, the performance of i-vectors degrade severely when we move from development to evaluation set. Two x-vector systems are developed as shown in rows 2 and 3 of Table 2.1 (for both the sub-tables). The \x-vector kaldi" is developed using kaldi's [157] x-vector training recipe, while \x-vector native" is our own implementation of x-vector architecture in Keras [41]. The initiative to re-implement the x-vector system arises from the requirements of easy extension and modication of the model using widely used deep learning tools like Keras, and a fair comparison with DNN-TVM model which is implemented using the same tool § . § We noticed a gap in performance between two x-vector implementations possibly because of various custom optimizations done in Kaldi. 31 Row 4 (for both the sub-tables) shows the performance of the hybrid DNN-TVM model. For this application, both implementations of x-vector perform better than the hybrid model in development as well as evaluation datasets. Rows 5 and 6 (for both the sub-tables) show performances of the multi-task models (implemented in Keras). In the evaluation set (lower sub-table), the x-vector multi- task training gets a relative improvement of 21% in minDCF from its cross entropy counterpart (\x-vector native"). Similarly, in the evaluation set (lower sub-table), the DNN-TVM multi-task is ahead of cross entropy based DNN-TVM by 28% in terms of minDCF. Interestingly, multi-task training of DNN-TVM model works better than the multi-task version of x-vector model. In the evaluation set (lower sub-table), DNN-TVM multi-task has a relative advantage of 6% than x-vector multi-task in terms of minDCF. 2.4.2 Performance of fused systems The last two rows of Table 2.1 (for both the sub-tables) report the two best performing fused systems. We can see that both the fused systems achieve improvements over the individual systems in the development set, but only system (2+5+6) does so for the evaluation set (lower sub-table). Inclusion of the clean i-vector model in system (1+2+6) deteriorates the performance in evaluation set although promising performance was observed in the development set. We believe this comes from poor generalization of the clean i-vector system as discussed in Section 2.4.1. In terms of minDCF, the best fused system, (2+5+6) is about 8% and 3% better than \x-vector kaldi" in development and evaluation sets respectively. 32 2.5 Conclusion and Future Directions The work focused on speaker verication with noisy and far-eld speech. We tried to address the problem by employing a recently proposed hybrid DNN-TVM model. More- over, a multi-task training scheme was proposed for both state-of-the-art x-vector system and the hybrid model. The multi-task approach jointly optimized cross entropy loss and triplet based similarity loss to achieve both good categorization and distinctive embed- dings. The results on VOiCES development and evaluation sets showed that the multi-task models (both x-vector and DNN-TVM) are better than our native implementations of cross entropy based x-vector and DNN-TVM models. Moreover, they provided compli- mentary information when combined together with the x-vector system, and thus ob- tained improved performance compared to individual systems. The multi-task training was found to work better on the DNN-TVM model than the x-vector model for this far-eld SV task. In the future, we plan to do an intensive analysis of the performance gap between ours and kaldi's x-vector implementations, because the gap might also create potential degradation in our DNN-TVM system and its multi-task version. We also plan to train the systems on the full 6:5M augmented utterances, which we could not do due to lack of resource and time. This might fulll the data hungry needs of DNN and potentially improve the performance. 33 Chapter 3 Signal-space variability II: Adversarial perturbation The previous chapter focused on a signal-space variability that comes from environmental noise and room reverberation. In this chapter, we will focus on a more convoluted signal- space variability that arise when a deep neural network model is under adversarial attack { a form of deliberate perturbation of the audio signal. We will show that adversarial attacks pose a threat to the security of a deep speaker recognition system (such as the one described in chapter 2), and more importantly, raise a question about the robustness of the representation learning system. We will present an expository study as done in [91], as well as, a new way to train more robust models by exploring similarity patterns of the data and performing adversarial training [149]. 3.1 Introduction Deep learning models have been recently found to be vulnerable to adversarial at- tacks [189, 21] where the attacker potentially discovers blind spots in the model, and crafts adversarial samples that are only slightly dierent from the original samples, rendering the trained model fail to correctly classify them or even to perform any other inference 34 task on them. Over the last few years, researchers have been devoted to devising novel adversarial attack algorithms [72, 122, 153, 30], proposing defensive countermeasures to gain robustness [72, 122], and demonstrating exploratory analyses [30, 8, 29]. Adversarial attack on speech processing systems: With the rapid increase in the incorporation of Deep Neural Networks (DNN) within speech processing applications like Automatic Speech Recognition (ASR) [32, 11], speaker recognition [74, 46, 184, 90], and speech emotion and human behavior studies [142, 82, 23], it is becoming essential to study the probable weaknesses of the employed models in the presence of adversarial attacks. In [31], the authors have shown that it is possible to achieve even 100% success rate in attacking deep ASR systems. In [159] the authors have successfully generated imperceptible (to humans) adversarial audio samples while retaining high attack success rate. These studies highlight the vulnerability of deep ASR models against adversarial attacks. Adversarial attack on speaker recognition systems: Speaker recognition models are being widely employed in several applications including smart speakers and personal digital assistants [201, 74], bio-metric systems [146], and forensics [16]. Therefore, having robust speaker recognition models that are not susceptible to adversarial perturbation is an important requirement. However, speaker recognition models have not been investi- gated extensively in the presence of adversarial attacks. Some initial work can be found in the literature (Section 3.3), but a detailed analysis of white box attacks (will be discussed in Section 3.2.2) with state-of-the art attack algorithms is dicult to nd. Moreover, to 35 the best of our knowledge, eective defensive countermeasures for those attacks have not been proposed. The present work aims to address these issues in particular. 3.2 Background 3.2.1 Speaker recognition systems We will employ a speaker identication system for most of the experiments. A dedi- cated section will be provided for speaker verication. Please see Section 1.3.2.2 for an introductory description about the relationship between the two tasks. Letx2R D denote a time domain audio sample with speaker label y, then learning a speaker identication model is generally done through Empirical Risk Minimization (ERM) [122]: argmin E (x;y)D [L (x;y;)] (3.1) where, L() is the cross-entropy objective, and denotes the set of trainable parameters of the DNN. An intermediate representation of the trained DNN model might be subsequently extracted as a speaker embedding [184] which is expected to carry speaker-specic infor- mation. The speaker embeddings are then utilized for verication purposes. 3.2.2 Adversarial attack Given an audio samplex2R D , an adversarial attack generates a perturbed signal given by e x =x + such that kk p < (3.2) 36 with the goal of forcing the classier to produce erroneous output for e x. In other words, if x has a true label y, then the attacker forces the classier to produce e y6= y for the perturbed sample e x. In this paper, we will focus on l 1 and l 2 norms which are most widely employed in literature. 3.2.2.1 Attack space Adversarial attack on speech can be performed on dierent signal spaces e.g., the time domain raw waveform, the extracted spectrogram, or any other feature spaces [34, 159] ∗ . Targeting the time domain speech signal can open up opportunities for \over-the-air" attacks where the perturbation is added to the speech signal even before it is received by the microphone of the speech processing system (speaker recognition system in our case). Our current work focuses on time domain attacks † . 3.2.2.2 Threat model We explore white-box [29] attacks in this study. This assumes that the attacker has complete knowledge of the model architecture, parameters, loss functions, and gradients. We adopt this stronger form of attack (compared to black-box attack [29]) because it does not assume that any part of the model can be kept hidden from the attacker, and it is the most frequently employed threat model in the adversarial attack literature [72, 122, 153, 30]. ∗ Throughout the chapter, by \feature space" we refer to (mel-)spectrogram features that are generally extracted for further processing in widely-employed speaker recognition systems [184, 183] † Note that we do not perform over-the-air attacks in this work. Over-the-air attacks can be designed by following the work of Qin et al. [159], and it is straightforward with time domain attacks as done in this work. 37 Adversarial attacks can be targeted or untargeted [29]. An untargeted attack only forces the model to make erroneous predictions, whereas a targeted attack aims at forcing the model to predict the class that the adversary desires. We perform untargeted attacks in this study, and leave the targeted attack for future study. 3.2.2.3 Transferability Although most of the experiments in this paper are performed for white box attacks, we study the transferability of adversarial samples in Section 3.6.6, which gives us a notion of performance during a black-box attack as well. The transferability test [29, 152] evaluates the vulnerability of a target model against the adversarial samples generated with a source model. The attacker has full knowledge about the source model, but no or limited knowledge about the target model (for example, knowledge about the fact that both source and target have convolutional layers). The goal of the attacker is to generate adversarial samples (with the source model) in such a way that they \transfer well" to the target model, i.e., those samples also make the target model vulnerable. 3.3 Related Work This section describes key previous work on adversarial attack and defense methods pro- posed for speaker recognition systems. • Li et al. [117] showed that an i-vector [54] based speaker verication system is susceptible to adversarial attacks, and the adversarial samples generated with the 38 i-vector system also transfer well to a DNN-based x-vector [184] system ‡ . The attack was performed on the feature space (and not directly on the time domain speech signal), and only the Fast Gradient Sign Method (FGSM) [72] (will be further discussed in Section 3.4) was investigated for that purpose. Moreover, no defense method was proposed. • Kreuk et al. [109] demonstrated the vulnerability of an end-to-end DNN-based speaker verication system to an FGSM attack. The attack was done on feature space, and the authors discovered cross-feature transferability of the adversarial samples. No defense method was proposed in the paper. • Chen et al. [34] proposed the Natural Evolution Strategy (NES) based adversarial sample generation procedure, and successfully attacked a GMM-UBM system § and i-vector based speaker recognition systems. They found an impressive attack success rate with their proposed method. However, the authors did not attack more recent DNN-based speaker recognition frameworks which are shown to have state-of-the- art performances. Moreover, the test set involved in their experiments only included 5 speakers (TABLE I of [34]), and thus, an extensive study with a much higher number of test speakers is still needed. • Wang et al. [202] proposed adversarial regularization based defense methods using FGSM and Local Distributional Smoothness (LDS) [132] techniques. The proposed method was shown to improve the performance of a speaker verication system, ‡ i-vectors have been the state-of-the-art in speaker verication for a decade until DNN-based x-vectors were shown to outperform them [183, 184]. § GMM-UBM stands for Gaussian Mixture Model-Universal Background Model, a classical model in speaker recognition [162]. 39 but only FGSM was employed as the attack algorithm, and similar to most of the above methods, the attack was performed on the feature space and not on the time domain audio. In summary, although these studies represent important initial eorts on adversarial attacks on speaker recognition systems, many technical questions still remain to be ad- dressed. Limitations include consideration of primarily feature space attacks [117, 109, 202] (and not time domain), limited number of attack algorithms [117, 109, 34, 202], limited number of speakers in the test set [34], and no or limited number of defense methods [117, 109, 34]. The present exposition study aims to address these limitations by reporting extensive experimental analysis, ablation studies, and by proposing and evaluating various defense methods. 3.4 Attack and Defense Algorithms 3.4.1 Attack algorithms A group of gradient-based attack algorithms tries to maximize the loss function by nding a suitable perturbation which lies inside the l p -ball around the samplex. Formally, max :kk p < L (x +;y;): (3.3) Here, the notations follow Equation (3.2), i.e., denotes the adversarial noise being added, y denotes the true label of samplex, and denotes the parameters of the DNN 40 model. The loss,L (x;y;) is generally the cross entropy loss for the speaker identication task as shown in Equation 3.1. A dierent group of algorithms aims at decreasing the posterior of the true output class, and increasing the posterior of the most confusing wrong class. Here we present the attack algorithms we employ in our study. Fast Gradient Sign Method (FGSM) Goodfellow et al. [72] proposed this compu- tationally ecient one-step l 1 attack to generate adversarial samples by using only the sign of the gradient function, and moving in the direction of gradient to increase the loss: e x =x + sign (r x L (x;y;)): (3.4) Projected Gradient Descent (PGD) Madry et al. [122] proposed a more generalized version with iterative gradient based l 1 attack: e x i+1 = x+S [e x i + sign (r x L (x;y;))]; (3.5) where, is the step size of the gradient descent update, x +S is the set of allowed perturbations i.e., the l 1 -ball around x, and x+S denotes the constrained projection operation in a standard PGD optimization algorithm. Throughout the text, we will denote PGD with a xed number of T iterations with by \PGD-T ". 41 Carlini and Wagner attack (Carlini l 2 and Carlini l 1 ) Carlini and Wagner [30] dened the general methodology of their attack by minimize kk p +cg(e x) such that e x2 [0; 1] D : (3.6) Here, g() denes the objective function given by g(e x) = Z(e x) t max j6=t (Z(e x) j ) + + (3.7) where, Z() is the output vector containing posterior probabilities for all the classes, t denotes the output node corresponding to the true classy, is the condence margin pa- rameter, and [] + denotes the max(; 0) function. Intuitively, the attack tries to maximize the posterior probability of a class that is not the true class of x, but has the highest posterior among all the wrong classes. The norm can be either l 2 or l 1 . For Carlini l 1 attack, the minimization ofkk 1 is not straightforward due to non-dierentiability, and an iterative procedure is employed in [30] ¶ . 3.4.2 Defense algorithms Adversarial training The intuition here is to train the model on adversarial samples generated by a certain adversarial attack. The adversarial samples are generated online ¶ We suggest the readers to refer to [30] for detailed information about the iterative workaround for l1 attack, and also for choosing the values for the weight parameter, c. 42 using the training data and the current model parameters. Madry et al. [122] introduced the generalized notion of adversarial training by a mini-max optimization given by: argmin E (x;y)D " max :kk p < L (x +;y;) # (3.8) The inner maximization task is addressed by the attack algorithm utilized during ad- versarial training, and the outer minimization is the standard ERM (Equation (3.1)) employed to train the model parameterized with . We separately apply both one-step FGSM (Equation (3.4)), and T -step PGD (Equation (3.5)) algorithms to solve the inner maximization problem. Throughout the remaining text, we refer to these as \FGSM adversarial training" and \PGD-T adversarial training" respectively. Notably, the overall training is done on clean as well as adversarial samples. The overall loss function is given by: L AT (x;e x;y;) = (1w AT )L(x;y;) +w AT L(e x;y;); (3.9) where w AT 2 [0; 1] is the weight of the adversarial training. Adversarial Lipschitz Regularization (ALR) This approach of gaining robustness is based on learning a function that is not much sensitive to a small change in the input. In other words, if we can learn a relatively smooth function, then the posterior distribution should not vary abruptly if the input perturbation is within the maximum allowed limit. We propose a training strategy equipped with the recently proposed adversarial Lipschitz regularization technique [191]. Similar to the regularization based on local distribution 43 smoothness in Virtual Adversarial Training (VAT) [132], ALR imposes a regularization term dened using Lipschitz smoothness: kfk L = sup x;e x2X;d X (x;e x)>0 d Y (f(x);f(e x)) d X (x;e x) ; (3.10) where f() the function of interest (implemented by the neural network) that maps the input metric space (X;d X ) to output metric space (Y;d Y ). In our case of speaker clas- sication, we chose f() as the nal log-posterior output of the network, i.e., f(x) = logp(yjx;), l 1 norm as d Y , and l 2 norm as d X . The adversarial perturbation = k in e x =x + is approximated by the power iterations: i+1 = r i d Y f(x);f(x + i ) r i d Y f(x);f(x + i ) 2 ; (3.11) where, 0 is randomly initialized, and is another hyperparameter (see Section 3.5.5). The regularization term added to training is L ALR = d Y (f(x);f(e x)) d X (x;e x) K + ; (3.12) where K is the desired Lipschitz constant we wish to impose. 3.4.3 Proposed defense algorithm 3.4.3.1 Feature scattering adversarial training The adversarial training (Equation 3.8) that employs cross entropy loss for generating the adversarial samples (inner maximization), suers from label leaking [112] which forces 44 the model to overt on the adversarial noise. Moreover, cross entropy loss disregards the original manifold structure of the data, and might unnaturally push the samples to- ward the decision boundary which can aect the classication performance. Zhang et al. [209] proposed Feature Scattering (FS) adversarial training to overcome those draw- backs. It employs an unsupervised optimal transport (OT) [51] distance between two discrete distributions ; to generate the adversarial perturbations. The distributions are: = P n i=1 i x i and = P n i=1 i x adv;i , where x is the Dirac delta function, and ; are n-dimensional simplex, and n is the size of the minibatch during training. The OT distance is the minimum cost of transporting to, and is dened as D(;) = min T2(;) hT; Ci (3.13) whereh;i denotes Frobenius dot product, T is the transport matrix and C is the cost matrix. The set (;) contains all possible joint probabilities with marginals (x), (x adv ), and (;) =fT2R nn + jT1 n =; T > 1 n =g [209]. The cost matrix C is computed based on cosine distance between original and perturbed samples in the feature space, which is dened as C ij = 1 f(x i ) > f(x adv;j ) kf(x i )k 2 kf(x adv;j )k 2 (3.14) The Feature matching distance in Equation 3.13, is employed in the inner maximization criteria in Equation 3.8. The adversarial samples for training are generated in an iterative 45 fashion as in Equation 3.8. The adversarial training or the outer minimization procedure (to update the network weights) of Equation 3.8 incorporates cross-entropy loss. 3.4.3.2 Proposed methodology Motivation: The success of adversarial training depends on the quality of the local maxima [87] found during the optimization of Equation 3.8. Furthermore, if we can generate a more diverse set of adversarial samples, the trained model can be made more robust. We aim to incorporate multiple cues in generating more diversied adversarial samples, that can in turn, help the model encounter possibly more blind spots. Hybrid adversary generation: In our recent work [149], we proposed hybrid adver- sarial training that employs cross-entropy (L CE ), feature scattering (L FS ) and margin losses (L M ) to generate diversied adversarial samples, as follows: L adv =L CE (x adv ;y) + L FS (x; x adv ) +L M (x adv ;y) (3.15) where , , are weight on the individual losses. All the three losses are calculated from the model output or logit space. The three losses focus on three dierent aspects to generate the perturbed samples. For example, the cross entropy loss moves the adver- sarial samples toward the decision boundary. The feature scattering loss,L FS =D(;) increases the OT distance between the original and adversarial samples without altering the data manifold structure. 46 Margin loss: The margin loss,L M is based on the Carlini-Wagner (CW) attack as described in Section 3.4.1. It attempts to generate stronger perturbations by minimizing the dierence between the posteriors of the true class of the adversarial sample and the most confusing wrong class. The margin controls the condence of the attack. Formally, L M (x adv ;y) = n X i=1 f(x adv;i ) t max j6=t (f(x adv;i ) j ) +M + (3.16) wheret is the output node corresponding to the true classy,M is the condence margin, and [] + represents max(; 0) function. Hybrid adversarial training (HAT): As shown above,L adv is maximized to gen- erate the adversarial samples, and the model is trained with (the outer minimization in Equation 3.8) cross entropy loss. 3.5 Experimental Setting We implement the core of most of our attack and defense algorithms (except ALR) through the Adversarial Robustness Toolbox [144]. For ALR, we follow the original implementation of [191]. The rest of the experimental details are described below. 3.5.1 Dataset We employ Librispeech [151] (the \train-clean-100" subset) dataset for all the experi- ments. It contains 100 hours of clean speech from 251 unique speakers (125 females). We utilize all the speakers for our experiment. For every speaker, we employ 90% of the 47 utterances for training the classier, and the remaining 10% utterances for testing. The train-test split is deterministic, and it is kept xed throughout all the experiments. 3.5.2 Model architectures We implemented our classier,f :X!Y, by combining a Convolutional Neural Network (CNN) with a digital signal processing (DSP) front-end. The DSP front-end is non- trainable but dierentiable, and it extracts log Mel-spectrogram which can be viewed as a temporal signal ofF channels, whereF is the number of Mel frequency bins. The back- end is either of the two DNN models described below. As both modules are dierentiable, the adversarial attack schemes introduced in Section 3.4 can be applied to create time- domain perturbation directly. 3.5.2.1 1D CNN Our classier comprises three components: a DSP front-end, a speaker embeddings ex- tractor, and a linear classier. The extractor consists of 8 stacks of convolutional layers and transforms the spectrogram into a single vector of speaker embedding v 2 R 32 . All of the CNN layers are coupled with a batch normalization and ReLU non-linearity. Max-pooling is employed 3 times to down-sample the feature map. The 32D speaker embedding is obtained by max pooling over time of the output of the nal CNN layer. The linear classier maps the embedding into the class logits. The model has around 219 thousand trainable parameters in total. The complete network architecture is shown in Table 3.6. We analyze this model throughout the paper. 48 3.5.2.2 TDNN The Time Delay Neural Network (TDNN) [183, 184] is one of the current state-of-the- art models for speaker recognition. Section 2.2.2 provided a brief introduction of the TDNN model. We adopt the model architecture proposed in [184] for the experiments related to transferability analysis (Section 3.6.6). The model consists of time-dilated convolutional layers along with a statistics pooling module, and it has 4:4 million trainable parameters, and hence, is much larger than the 1D CNN model. We use this model only in transfereability experiments where we create adversarial perturbation from TDNN to attack our model, and vice versa. 3.5.3 Training parameters We employ the Adam optimizer [102] with a learning rate of 0:001, 1 = 0:5, and 2 = 0:999. We use a minibatch size of 128. We train all the models until the training loss saturates. 3.5.4 Attack parameters Our main results (Section 3.6.3) are obtained from the experiment with attack strength = 0:002 for l 1 attacks, and condence margin = 0 for a Carlini l 2 attack. We empirically chose = 0:002 because it results in an average SNR of around 30 dB and 3:5 PESQ score for the FGSM/PGD adversarial samples. This will be explained in more detail in Section 3.6.1 and Section 3.6.2. Furthermore, we vary the strength of dierent attacks, and the results are shown in Section 3.6.4. The PGD attack is performed for 100 iterations with a step size ==5. 49 3.5.5 Defense parameters In the ALR method, we set the number of power iterations K = 1, and the hyperparam- eter = 10, as recommended in [191]. The FGSM- and PGD-based adversarial training algorithms are run with = 0:002. Hence, the main results (Section 3.6.3) employ the same value in both the attack and the adversarial training based defense. The ablation study in Section 3.6.4 is particularly designed to investigate the eect of using dierent values of during the attack. Specically, the study varies above and below the vicinity of = 0:002 (set during defense training), and analyzes the eectiveness of the defense method. The PGD adversarial training uses 10 iterations (i.e., PGD-10 as introduced in Section 3.4.2) ‖ , although we evaluated it against PGD attacks with higher number of iterations (Section 3.6.3 and 3.6.5). During adversarial training, we create minibatches containing equal number of clean and adversarial samples, i.e., in Equation (3.9), we set w AT = 0:5. 3.6 Results: Exposition study 3.6.1 Attack strength vs. SNR To have a substantial understanding about the strength of dierent attack algorithms, we computed the mean Signal-to-Noise Ratio (SNR) of all the test adversarial samples for ev- ery level of attack strength. Forl 1 attacks, varies betweenf0:0005; 0:002; 0:0035; 0:005g, and for Carlinil 2 attack the condence margin varies betweenf0; 0:001; 0:01; 0:1g. The curves for l 1 attacks are shown in Figure 3.1a. There are two important observations. ‖ PGD adversarial training is slow, and we could not aord to run it for more than 10 iterations. 50 0.001 0.002 0.003 0.004 0.005 20 30 40 50 60 70 Mean SNR (dB) Attack FGSM PGD-100 Carlini l (a) Attack strength vs. SNR 0.001 0.002 0.003 0.004 0.005 3.0 3.2 3.4 3.6 3.8 4.0 4.2 4.4 PESQ score Attack FGSM PGD-100 Carlini l (b) Attack strength vs. perceptibility Figure 3.1: Mean SNR and PESQ score of the test adversarial samples generated by dierent l 1 attack algorithms at dierent strengths. First, the average level of SNR is 30 dB higher for Carlini l 1 than FGSM and PGD. Second, the SNR level tends to decrease faster with increase in for PGD and FGSM as compared to Carlini l 1 . The reason might be attributed to the optimization algorithms that various attack methods use for generating the adversarial samples. For example, the Carlini method enforces a minimum perturbation required to change the output predic- tion, while PGD enforces perturbation projected inside thel 1 -ball of size aroundx. We have also computed mean SNR for Carlini l 2 attack for dierent values of the condence margin . The SNR level tends to stay around 75 dB, and does not vary much with increasing . A visualization of the adversarial spectrograms is shown in Appendix 3.9.1 for a more detailed analysis of the attack algorithms. 3.6.2 Attack strength vs. perceptibility We measure the perceptibility of the generated adversarial samples by employing Per- ceptual Evaluation of Speech Quality (PESQ) [161, 165]. While subjective measure with multiple human annotators can be more accurate, it is time-consuming and costly. The 51 objective PESQ measure has been the ITU-T standard for measuring telephonic trans- mission quality. It gives a mean opinion score by comparing the degraded speech signal with the original speech recording. The PESQ score is between0:5 to 4:5, and a higher value indicates better quality. Figure 3.1b depicts the average PESQ scores of all the test adversarial samples generated via dierent attack methods at various strengths. We can see the gradual degradation of audio quality with the increase of attack strength. Simi- lar to the ndings in the SNR analysis (Section 3.6.1), Carlini l 1 attack produces more perceptible adversarial audio samples than PGD-100 and FGSM. The degradation is also slower for Carlini attack. It is noteworthy that at = 0:0005 the attack algorithms are able to achieve high audio quality (PESQ score> 4:0), but force the classier to produce erroneous outputs (Section 3.6.4). We have also computed the average PESQ score for the Carlini l 2 attack. The PESQ score is 4:4, and does not vary much with change in the condence margin . 3.6.3 Main results Table 3.1 presents the test performance of standard training (no defense) and all the em- ployed defense methods for threel 1 attacks, and onel 2 attack. All the performances are averaged over 10 random runs. Please note that \benign accuracy" denotes the accuracy on the clean samples, and \adversarial accuracy" denotes the accuracy on the adversar- ial samples generated with a particular attack algorithm. We will use this terminology throughout the remainder of the text. The l 1 attacks are with = 0:002, and all the adversarial training methods are run with the same value. As we can see, the accuracy of the standard training method 52 Table 3.1: Dierent attacks on a speaker recognition system, and performance of dierent defense methods. \Benign accuracy" denotes accuracy on clean samples, and \Adversar- ial accuracy" denotes accuracy on adversarial samples generated with dierent attack algorithms. \AT" stands for Adversarial Training (the defense method described in Sec- tion 3.4.2). Accuracy is on a scale of [0; 1]. Defense method Benign accuracy Adversarial accuracy with dierent attacks FGSM Carlini l 1 PGD-100 Carlini l 2 No defense 0.94 0.25 0.02 0 0 FGSM AT 0.82 0.20 0.09 0 0 ALR 0.96 0.44 0.10 0 0 PGD-10 AT 0.92 0.73 0.58 0.43 0.09 drops from 94% by a signicant margin for all the attacks. This shows the vulnerability of the model, and further underscores the need for strong countermeasures. A compari- son between the three l 1 attacks shows the FGSM is the weakest one (25% adversarial accuracy for standard training), and PGD-100 (PGD with 100 iterations) is the strongest one (0% adversarial accuracy for standard training). Comparing dierent defense methods, we can see that FGSM-based adversarial train- ing is the weakest defense strategy. The ALR method is better than FGSM adversarial training, although it fails to defend against a PGD-100 attack. The PGD-10 adversarial training is found to perform the best in our experiments. It is interesting to see that PGD adversarial training with 10 iterations is able to defend against a PGD attack with 100 iterations. The proposed PGD-10 adversarial training gives absolute improvements of 48%; 56% and 43% over the undefended performance against FGSM, Carlini l 1 and PGD-100 attacks, respectively. As observed in the previous literature, the performance gain achieved by PGD-10 adversarial training against dierent adversarial attacks generally results in a drop in benign accuracy. Similarly, in our experiment, the accuracy on the clean test samples 53 drops for both FGSM- and PGD-based adversarial training methods, with the FGSM variant getting higher drop in performance. The ALR method, on the other hand, achieves a 2% absolute improvement in benign accuracy compared to the model with standard training, possibly because of lesser overtting due to the presence of the penalty term shown in Equation (3.12). The last column of Table 3.1 shows the performance of dierent defense methods against Carlini l 2 attack. Standard training, FGSM adversarial training, and ALR al- gorithms are unable to defend against this attack. Defense with PGD-10 adversarial training also performs poorly for this l 2 attack. The reason might be attributed to the adversarial training methodology which is based on l 1 perturbation, and thus, probably fails to defend against a strong l 2 attack. A related ablation study is provided in Appendix 3.9.2 which shows the similarity between the misclassied predictions made by the model under dierent attacks. This could possibly reveal some inherent similarities between dierent attacks. 3.6.4 Ablation study 1: Varying attack strength Figure 3.2 shows how the performances of dierent defense methods vary when we vary the strength of the adversarial attacks. Note that the adversarial training-based defense methods still employ the same = 0:002 during training, but of the attack algorithm varies. We can observe that the general trend of the curves is downward with the increase of the strength of any attack. The only exception is the unprotected model under FGSM attack. The performance surprisingly increases in the beginning and then saturates. 54 0.001 0.002 0.003 0.004 0.005 0.0 0.2 0.4 0.6 0.8 1.0 Adversarial accuracy Defense Standard training FGSM adv. training ALR PGD-10 adv. training (a) FGSM attack 0.001 0.002 0.003 0.004 0.005 0.0 0.2 0.4 0.6 0.8 1.0 Adversarial accuracy Defense Standard training FGSM adv. training ALR PGD-10 adv. training (b) Carlini l 1 attack 0.001 0.002 0.003 0.004 0.005 0.0 0.2 0.4 0.6 0.8 1.0 Adversarial accuracy Defense Standard training FGSM adv. training ALR PGD-10 adv. training (c) PGD-100 attack 0.00 0.02 0.04 0.06 0.08 0.10 0.0 0.2 0.4 0.6 0.8 1.0 Adversarial accuracy Defense Standard training FGSM adv. training ALR PGD-10 adv. training (d) Carlini l 2 attack Figure 3.2: Ablation study 1: Varying the strength ( for three l 1 attacks, for the Carlini l 2 attack) in dierent attack algorithms, and performance of dierent defense methods. 55 20 40 60 80 100 Maximum number of iterations in PGD attack 0.0 0.2 0.4 0.6 0.8 1.0 Adversarial accuracy Attack strength = 0.0005 = 0.002 = 0.0035 = 0.005 Figure 3.3: Performance of PGD-10 adversarial training against PGD attack at dierent strengths, and with dierent number of iterations. Comparing dierent defense methods, we can see the PGD-10 adversarial training continues to outperform all the other defense methods for all attack types, and for all strength levels. The proposed ALR training is found to be the next best defense technique. Another interesting observation is that the accuracy curves for both the Carlini meth- ods are more at in nature compared to FGSM and PGD attacks. The reason might be attributed to the relatively less drop in SNR values of the test adversarial samples gen- erated by Carlni method as the attack strength increases, as explained in Section 3.6.1. 3.6.5 Ablation study 2: Analyzing the best defense method Here we analyze the best defense method, i.e., PGD-10 adversarial training, in further detail. Specically, we investigate its behavior when we attack it with PGD attack with dierent number of iterations and at dierent strengths. Figure 3.3 shows the variation of the adversarial accuracy for PGD-10 defense. Each line denotes attack at a particular strength i.e., a particular value. The horizontal axis denotes the iteration number, varying in the rangef10; 20; 30; 50; 100g. A closer inspection reveals that after the rst drop in performance for PGD-10 attack to PGD-20 attack, the accuracy value tends to 56 Table 3.2: Transferability of adversarial samples between dierent models. The adver- sarial samples are generated with the \source" model, but seem to be eective against the \target" model as well. = 0:002 is employed for this experiment. Accuracy is on a scale of [0; 1]. Benign accuracy 1D CNN TDNN 0.94 0.95 Adversarial accuracy for FGSM attack Source Target 1D CNN TDNN 1D CNN 0.24 0.37 TDNN 0.14 0.03 Adversarial accuracy for PGD-100 attack Source Target 1D CNN TDNN 1D CNN 0 0.28 TDNN 0.06 0 decrease very slowly. For example, at = 0:002, adversarial accuracy against PGD-10 attack is 48%. Then a 3% absolute drop is observed when we perform a PGD-20 attack, and the adversarial accuracy becomes 45%. The accuracy tends to drop very slowly afterwards, and we see a 43% accuracy against a PGD-100 attack. We conjecture that 20 iterations are sucient for the adversary to nd the best perturbation within the l 1 -ball with radius around x. As a result, the eects of PGD with larger number of iterations are marginal. 3.6.6 Ablation study 3: Transferability analysis We perform the transferability analysis between the smaller 1D CNN model, and the larger TDNN model dened in Section 3.5.2. The goal of this experiment is to investigate whether the adversarial samples generated with a simple convolutional model (such as the 1D CNN model) is able to make a dierent model (the TDNN model) vulnerable. We 57 Table 3.3: Eect of training data augmentation with white Gaussian noise. = 0:002 is employed for this experiment. Accuracy is on a scale of [0; 1]. Adversarial accuracy with dierent attacks Training data augmentation Benign accuracy FGSM PGD-100 No 0.94 0.25 0 Yes 0.95 0.17 0 choose the o-the-shelf TDNN model for this purpose because of its widespread usage in speaker recognition experiments. Table 3.2 shows the performance of dierent models for benign and adversarial samples. We can see that adversarial samples crafted from the \source" model tend to be harmful to the \target" model as well, as evident from the signicant drop in the performances. Adversarial samples generated with the larger model (TDNN) tend to be more eective in attacking the smaller model. Further studies are needed to fully understand the observed pattern of transferability. 3.6.7 Ablation study 4: Eect of noise augmentation Noise augmentation is a standard technique employed during training a speaker recog- nition model [184, 95]. Here, we experiment with augmenting the dataset with white Gaussian noise (scaled with a factor equal to the used in the attack) during training the undefended model. The model is trained with both clean and noisy samples. The experimental observations are tabulated in Table 3.3. As expected, the benign accu- racy improves. However, we could not nd any improvement in the performance of the model for defending against FGSM and PGD-100 adversarial attacks. The reason might 58 Table 3.4: Recall rates of our model in speaker verication settings. For this experiment, we employ FGSM with 10 random initializations, which is found to be stronger than the vanilla FGSM employed in the experiments of Section 3.6.3. Recall Type Attack Benign = 0:0005 = 0:002 = 0:0035 = 0:005 without adversarial training positive FGSM (10 init) .8684 .5955 .4873 .4404 .4076 Carlini l 1 " .5846 .3906 .3736 .3644 PGD-10 " .3911 .0095 .0032 .0029 negative FGSM (10 init) .8530 .5298 .4518 .4310 .4162 Carlini l 1 " .5314 .3864 .3791 .3779 PGD-10 " .3817 .0335 .0210 .0157 with PGD-10 adversarial training positive FGSM (10 init) .8728 .8248 .7593 .7077 .6599 Carlini l 1 " .8231 .8066 .7890 .7689 PGD-10 " .8071 .7179 .6424 .6102 negative FGSM (10 init) .8495 .8188 .7702 .7274 .6870 Carlini l 1 " .8071 .7974 .7875 .7759 PGD-10 " .8133 .7305 .6571 .6236 be attributed to the ability of attack algorithms to generate more novel noise samples (compared to simple white Gaussian noise) that force the model posteriors to change. 3.6.8 Speaker Verication Finally, we experiment our speaker recognition model on an open-set speaker verica- tion (SV) setup. For this purpose, we compute the cosine similarity between the speaker embeddings of the enrollment and a query utterance, denoted by v e and v q , respec- tively. We determine the decision boundary b on the dev-clean set (40 speakers unseen during training) of Librispeech, and apply the boundary for speaker verication on the test-clean set (another 40 unseen speakers). To create adversarial perturbation, we 59 formulate verication as a binary classication problem where a positive case (Y = 1) de- notes the decision that query and enrollment are the same speaker, and negative (Y = 0) otherwise. The model prediction is given by P (Yjv q ) = 1 1 +e (db) ; where d = v > q v e kv q kkv e k : (3.17) For positive pairs, in which the speaker in the enrollment and the query is the same, we enumerate all possible combinations (resulting in a total of 91,961 pairs). For negative pairs, we randomly drew 100 queries per enrollment from utterances of other 39 speakers (resulting in a total of 262,000 pairs). Following [35], we report the positive and negative recall rates separately in Table 3.4. The random seed was kept xed, so the benign recall rate remained the same across positive (or negative) experiments. The goal of attacking a positive pair is to fool the system into believing that the query is not the enrolled speaker, and the goal of attacking a negative pair is to make it believe the imposter is the enrolled speaker. Our experiments show that our speaker recognition model with adversarial train- ing is more robust to adversarial attacks even in speaker verication settings compared to the unprotected baseline model. Moreover, even though the adversarial training includes only one type of attack (PGD-10 in our case), the robustness can be generalized to other types of attacks. In conclusion, the robustness induced by adversarial training can be transferred to a related task. 60 Table 3.5: Accuracy comparison of the proposed defense with other baseline defenses under untargeted, white-box attacks with = 0:002. The columns under PGD and CW attacks denote number of attack iterations. Defense Clean FGSM-attack PGD-attack CW-attack 10 20 40 10 20 40 Standard 99.55 6.03 0.00 0.00 0.00 0.00 0.00 0.00 FGSM-AT 97.64 57.04 14.86 11.05 9.59 17.01 14.13 12.37 PGD10-AT 97.30 88.78 78.22 76.62 75.55 77.42 76.00 75.34 FS10-AT 96.12 82.85 60.69 55.28 53.19 38.45 35.68 32.54 HAT10 (ours) 97.68 90.06 81.12 79.60 78.84 80.12 79.18 78.52 3.7 Results: Comparison with proposed defense Table 3.5 compares the performance of the \Standard" (undefended) model, popular adversarial training (AT) models, and the proposed hybrid adversarial training (HAT) model in the same experimental setting as described in Section 3.5 i.e., = 0:002 for training and testing. We can see that although feature scattering adversarial training (FS10-AT) provides high accuracy for clean and FGSM, the performance deteriorates under PGD and CW attacks. This might be because only FS-based adversarial samples might not be diverse enough to encompass adversaries created by PGD and CW attack. The proposed HAT obtains signicant performance boost over FS method after incorpo- rating multi-task objectives. Furthermore, our defense attains 3.29% and 3.18% absolute improvement over PGD10-AT under PGD40 and CW40 adversarial attacks. 3.8 Conclusion and Future Directions The chapter presented an extensive exploratory analysis of adversarial attacks on a closed set speaker recognition system, and proposed a new defense method. We reported results obtained from experiments with multiple state-of-the-art attack algorithms with varying 61 attack strengths. We investigated state-of-the-art defense methods, and adopted them for employing as countermeasures for the speaker recognition model. We performed several ablation studies to understand the SNR characteristics and perceptibility of the adver- sarial speech, analyze the transferability of the adversarial attacks, and the eectiveness of white noise augmentation during training. The main observations are the following: • Speaker recognition system such as the one employed in the current study is vulner- able to white box adversarial attacks. The performance of the undefended model dropped from 94% to 0% with the strongest attacks (PGD-100, Carlini l 2 ) in our experiment even at 40 dB SNR and PESQ score > 4. • Adversarial samples crafted with the Carlini and Wagner method are found to have the best perceptual quality in terms of the PESQ score. • The adversarial samples generated with a particular source model are found to transfer well to a dierent target model, and hence, are also harmful for the target model. This is particularly alarming because it can open up chances for black box attacks. • Augmenting training data with white Gaussian noise is not found to be eective. • Experimenting with several defense methods showed that PGD-based adversarial training is the best defense strategy in out setting. • Although PGD adversarial training is the best defense method, it is not found to be eective againstl 2 attack in our experiments, probably because of employingl 1 norm during training. 62 • Robustness induced by adversarially training a speaker recognition model translates to speaker verication. • The proposed hybrid adversarial training was shown to outperform state-of-the-art defense methods. It utilizes multiple objectives to generate more diverse adversarial samples, and thus, a more robust adversarial training. Several important future directions can be taken from here. • Metric learning such as triplet training [175] are shown to learn compact and robust embeddings against adversarial attacks for images [124, 213]. Metric learning is also found to be useful for learning robust speaker embeddings in [95]. A natural extension can be to verify the adversarial robustness of speaker embeddings learning via metric learning. • Adaptive attacks [194] are particularly designed to break any specic defense al- gorithm. The strategies introduced in [194] can be a starting point to perform model-specic adversarial attacks on existing defense methods proposed for speaker recognition systems. • Studying targeted attacks might be another good direction from here, especially, since this could be a potential threat for biometric systems that rely on speaker recognition modules. • Finally, further research can be done on crafting imperceptible (to human judgement or by retaining high PESQ score) adversarial audio samples with high attack success 63 rate such as in [159], and also formulating eective detection [186] and defense algorithms as countermeasures. 3.9 Appendices 0 512 1024 2048 4096 8192 Hz Original 0 512 1024 2048 4096 8192 Hz = 0.002, SNR + 30 dB, PESQ = 3.288 0 512 1024 2048 4096 8192 Hz = 0.02, SNR + 7 dB, PESQ = 2.152 0 1.5 3 4.5 6 7.5 9 10 12 14 Time 0 512 1024 2048 4096 8192 Hz = 0.2, SNR 13 dB, PESQ = 1.405 (a) FGSM 0 512 1024 2048 4096 8192 Hz Original 0 512 1024 2048 4096 8192 Hz = 0.002, SNR + 32 dB, PESQ = 3.320 0 512 1024 2048 4096 8192 Hz = 0.02, SNR + 9 dB, PESQ = 1.923 0 1.5 3 4.5 6 7.5 9 10 12 14 Time 0 512 1024 2048 4096 8192 Hz = 0.2, SNR 10 dB, PESQ = 0.896 (b) PGD-100 0 512 1024 2048 4096 8192 Hz Original 0 512 1024 2048 4096 8192 Hz = 0.002, SNR + 73 dB, PESQ = 4.429 0 512 1024 2048 4096 8192 Hz = 0.02, SNR + 49 dB, PESQ = 4.174 0 1.5 3 4.5 6 7.5 9 10 12 14 Time 0 512 1024 2048 4096 8192 Hz = 0.2, SNR + 10 dB, PESQ = 3.359 (c) Carlini l 1 Figure 3.4: Mel-spectrograms of an original utterance and its perturbed versions under dierent l 1 attacks at varying strengths. 3.9.1 Visualizing spectrograms Figure 3.4 shows the mel-spectrograms of a randomly chosen utterance for dierent at- tacks at varying values. Here, for exploratory analysis, we increase beyond the range 64 FGSM PGD-100 Carlini l Carlini l 2 FGSM PGD-100 Carlini l Carlini l 2 0.41 0.59 0.39 0.58 0.38 0.94 0.45 0.60 0.75 0.90 (a) = 0:0005 FGSM PGD-100 Carlini l Carlini l 2 FGSM PGD-100 Carlini l Carlini l 2 0.21 0.35 0.23 0.34 0.23 0.94 0.30 0.45 0.60 0.75 0.90 (b) = 0:002 FGSM PGD-100 Carlini l Carlini l 2 FGSM PGD-100 Carlini l Carlini l 2 0.13 0.19 0.12 0.19 0.12 0.9 0.15 0.30 0.45 0.60 0.75 0.90 (c) = 0:005 Figure 3.5: Similarity (on a scale of [0,1]) between wrong predictions made by the model for dierent attacks. specied in our experiments described in the main text. We can see that for both FGSM and PGD, the noise is visible in the mel-spectrogram for = 0:002. The signal becomes extremely noisy for = 0:2 (SNR drops below10 dB, and PESQ score < 1:5). On the other hand, for Carlini l 1 attack, the noise is almost invisible at = 0:002 and = 0:02, as also evident from the high SNR values and PESQ scores. The noise becomes somewhat visible at = 0:2 where the SNR drops to 10 dB, and the PESQ score becomes 3:4. 3.9.2 Similarity in misclassication for dierent attacks We investigate whether dierent attack algorithms force the model to misclassify a par- ticular input utterance as the same (wrong) speaker. This could possibly reveal similarity between dierent attack algorithms. Figure 3.5 shows the fraction of similarity (i.e., av- erage number of matches) between the wrong predictions made by the model for dierent attacks. As evident, wrong predictions for Carlini l 1 and Carlini l 2 attacks are very similar (> 90% similarity for all the values), possibly because the inherent strategy of the Carlini attack remains the same in the two variants. The similarity between FGSM and the two Carlini attacks is also noticeable. More interestingly, the similarity scores 65 Table 3.6: The network architecture of the CNN model we used in this paper. It takes waveform as input and predicts class logits. In all of the CNNs, the number of padded points (two-sided) is k 2 . audio waveformx2 [1; 1] T DSP: [1; 1] T !R 32T 0 Conv1D(32, 64, k=3) BatchNorm, ReLU; MaxPool1D(2) Conv1D(64, 128, k=3); BatchNorm, ReLU Conv1D(128, 128, k=3); BatchNorm, ReLU Conv1D(128, 128, k=3); BatchNorm, ReLU; MaxPool1D(2) Conv1D(128, 128, k=3); BatchNorm, ReLU Conv1D(128, 64, k=3); BatchNorm, ReLU, MaxPool1D(2) Conv1D(64, 32, k=3), BatchNorm, ReLU MaxPool1D over time (the output is treated as speaker embedding) FullyConnected(32, 251) tend to decrease when increases. We hypothesize that a low constrains the attack algorithm with a smaller space for perturbation, and hence, the model generally tends to wrongly predict the closest class (one that causes the most confusion). On the other hand, a high opens up a lot more allowed space for the perturbation, and hence, the similarity between the wrong predictions tends to decrease. 3.9.3 Network Architecture The employed 1D CNN model is shown in Table 3.6. 66 Chapter 4 Signal-space variability III: Long-term variability The last two chapters dealt with learning a robust audio representation under two dier- ent forms of signal-space variability. As introduced in Chapter 1, dierent information streams in the audio signal vary at dierent rates. In this chapter, we will show that a short-term audio representation might not be well enough to model and predict tem- porally varying background acoustic scenes (see Section 1.3.1 for an introduction to the denition of acoustic scene). We will also demonstrate a two-stage modeling framework that involves a segment-level model and a higher-level temporal model. The proposed method will be shown to capture long-term patterns of the temporally varying acoustic scenes [94]. 4.1 Introduction The human auditory system experiences a multitude of sounds, often dynamically chang- ing over space and time, from its ambient environment. These experiences are in uenced by the nature of an individual's daily routine, life style and, notably, occupation. For example, a highway maintenance worker might experience trac noise throughout the 67 day, while a nurse in a hospital might deal with mostly human speech and equipment noises. Dierent sounds tend to show diverse eects on human health. For example, nature sounds are found to be benecial for supporting recovery from a psychological stressor [5]. On the other hand, certain ambient noises tend to have detrimental eects on both physiological and psychological well-being; from immediate change in heart rate variability [108] to disturbed sleep patterns [136]. Researchers have also explored the con- nection between environmental sounds and elicitation of positive and negative emotional reactions [133]. Increased levels of anxiety and depression have been observed in people from diverse age groups due to annoyance caused by undesirable sounds [20]. Oce or workplace sounds and noises are found to cause annoyance [17, 49] and decreased concen- tration [13] depending on subjective noise sensitivity [17], which might eventually result in lower performance and productivity [180, 123, 143]. Technological advances in wearable devices [63, 115] with the capability of captur- ing multimodal [24] body signals oer a unique opportunity to study the relationship between the ambient acoustic environment and our everyday life and behavioral pat- terns. An egocentric analysis (that is centered on and evolves around an individual) is particularly interesting since it could illuminate auditory experiences directly from the perspective of the user wearing the mobile device. Automated characterization of the sounds and categorization of the dynamic acoustic scenes experienced by the individ- uals using the wearable devices is an essential step into studying the aforementioned relationship. Although recent progress in deep neural network (DNN)-based approaches facilitate accurate detection and classication of sound events [66, 78, 93, 163, 44] and 68 acoustic scenes [188, 130, 45], they do not address applications based on real-world ego- centric audio recordings collected through o-the-shelf wearable devices, especially in the scenario when environmental sounds are possibly overlaid with user's own speech. We term this as in-talk acoustic scene identication, since it tries to infer the background acoustic scene when the user's speech is captured by the worn mobile device. This can be useful in providing context-aware user notications and experiences, and facilitate environment-aware decision making. Furthermore, to the best of our knowledge, past research has mostly focused on categorizing the underlying acoustic scene given an audio recording, and little work has attempted to model the temporally evolving acoustic scenes that a particular individual experiences through the course of their daily life. In this chapter, we focus our study on in-talk acoustic environments experienced by employees in a workplace; specically of nurses and other clinical providers in a critical care hospital. Every employee in the workplace experiences a variety of, and potential patterns within, acoustic scenes during the work hours day to day. The acoustic scenes under investigation lie inside a larger set of a workplace (hospital) acoustics, and thus represent audio locales with unique acoustic characteristics. In contrast to the commonly used manual acoustic scene annotation schemes [66, 131], we deploy Bluetooth-based acoustic scene tracking devices [135]. We hypothesize the existence of possible temporal patterns in the sequence of acoustic scenes that an individual experiences in a workplace (in this case hospital), from amongst a nite number of acoustic scenes (will be discussed in Section 4.2). The temporal patterns could be associated with, or driven by, the daily routine, demographics, job-roles, and work habits of the person. Motivated by the psy- chology literature presented above, we also hypothesize that the pattern and duration of 69 exposure to a specic set of acoustic scenes might be correlated with certain behavioral states and traits of the employees. Finally, in the case of existence of temporal patterns, we investigate whether we can capture them by employing a machine learning model. To pursue the above hypotheses, we organize the work into two parts. • Egocentric analysis: We dene some scalar measures of the temporal dynamics given a sequence of acoustic scenes experienced by a certain employee. Then, we perform statistical analyses to verify whether the measures of dynamics are indeed related to some of their underlying factors like job-roles, daily routines, habits, and demographic information of the employee. Furthermore, we perform a corre- lation study to explore the relationship of the measures with factors such as job performance. The results of this egocentric analysis will reveal the presence of rich temporal patterns in the acoustic scenes experienced by an employee, and a strong, statistically signicant relationship between those patterns and the job-roles, habits, and job performance of the employee. • Prediction: We investigate whether machine learning models can capture the temporal dynamics of acoustic scenes, and thus, allow us to predict the sequence of acoustic scenes from the egocentric audio features collected through the wearable device. The egocentric analysis presented in this work provides an initial evidence of the presence of temporal patterns in the sequence of acoustic scenes, and thus, builds a foundation for developing machine learning models that can learn those temporal patterns for the prediction task. Moreover, in Section 4.6.4, we perform the same egocentric analysis with 70 the model's prediction (instead of the true scene labels), which underscores the benets of employing a machine learning model with the capability of learning the underlying temporal dynamics. For the modeling and prediction task, we propose a two-stage DNN-based framework. A segment-level model is trained to map a segment of low-level audio features to the cor- responding acoustic scene label. A recurrent neural network (RNN) subsequently utilizes the embeddings from a pre-trained segment-level model, and learns to map them to a se- quence of acoustic scene labels. This two-stage learning framework helps us to separately analyze the capability of the in-talk audio features to infer the background acoustic scenes, and existence of rich temporal patterns in the sequence of acoustic scenes experienced by a specic user. It is worth mentioning that if there is no specic temporal pattern in the sequence of acoustic scenes experienced by the employees, then the incorporation of the RNN model would not help improve the performance beyond the performance of the segment-level model. We will nd that in our setting the RNN model, working on top of the segment-level embeddings, helps learn the temporal patterns better than solely employing the segment-level model. 4.2 Dataset The data are from the TILES (Tracking IndividuaL performancE with Sensor) project, a part of the IARPA MOSAIC program ∗ , that aimed assessing the eect of workplace stressors on employees' aective traits, behaviors, and job performance and productivity. ∗ Multimodal Objective Sensing to Assess Individuals with Context (MOSAIC): https://www.iarpa.gov/index.php/research-programs/mosaic 71 Lounge Medication room Lab Patient room Nursing station Owl Jelly Nurse Lab technician Other people Figure 4.1: An illustrative schematic of the hospital acoustic scenes: nursing station, patient room, lab, lounge, and medication room. Every scene has its own sources of sound events, and thus a unique acoustic characteristics. Note that all the acoustic scenes might have more than one instances (e.g., multiple patient rooms). A nurse (red) might experience several acoustic scenes in a certain work shift, while a lab technician (green) might experience fewer acoustic scenes due to relatively lesser mobility. Owl is the Bluetooth hub recording the location context of the user, and they are installed in dierent places having dierent acoustic scenes. Jelly is the wearable device that captures audio features, and assists Owl in registering the location of the user. The sequence of acoustic scenes a user encounters is derived from the location information captured by multiple Owls. The gure is best viewed in color. As a part of the project, we deployed wearable sensors to capture multimodal (audio, physiological, location etc. ) [24] data from nurses and other clinical providers in a large critical care hospital † . The data collection from each clinical provider participant lasted over a duration of 10 weeks; each provider could be in one of multiple work shifts (e.g., day, night), and each shift spanned 8 to 12 hours. The current study focuses on audio and location data from a set of 170 participants (47 male and 123 female) ‡ . More details about the TILES dataset can be found in [134]. † USC Keck Hospital, Los Angeles, CA, USA. ‡ All the data were collected in accordance with USC's Health Sciences Campus Institutional Review Board (IRB) approval (study ID HS-17-00876). 72 Figure 4.1 depicts an illustrative schematic of the ve acoustically-relevant locales in the hospital environment considered in this study: nursing station, patient room, lab, lounge, and medication room. Multiple instances of these locales exist in the actual experimental setting distributed across the hospital. In a hypothetical situation, a nurse might experience more than one acoustic locale because of higher mobility in their job, and hence a richer set of acoustic scenes, while a lab technician might encounter fewer of them in a certain work shift due to the more static nature of the job. Moreover, the temporal pattern in which they encounter the acoustic scenes might vary from one job-type to another. The analyses of the acoustic scene characteristics with respect to job- type, daily routines, and individual behavioral patterns (Section 4.3) lay the foundation for subsequent automated prediction (Section 4.4) of the temporal sequence of acoustic scenes from audio features. 4.2.1 Acoustic features Acoustic features were collected through a wearable audio recorder (an audio badge called the \TILES Audio Recorder") programmed in-house using a Jelly § phone device as de- scribed in [63]. Audio recordings were triggered with a custom online energy-based Voice Activity Detector (VAD). The HIPAA regulations [7] and the sensitive scenario of the study dictated us from not storing the raw audio signals. Instead, the audio badge sam- pled the audio signal at 16 KHz, and then extracted (online) several low-level descriptive features from the audio using the OpenSMILE toolkit [61]. The online feature extraction was performed at 60 ms window length with 10 ms shift. The feature set [63] includes § Jelly phone device from Unihertz [134, 63]. 73 energy, prosodic features like pitch, vocal jitter and shimmer, and spectral features like MFCCs. We use the energy and 13 MFCC features along with their derivatives (delta and delta-delta) for the current study, thus creating 42 dimensional feature vectors. The raw audio signal is discarded after the online feature extraction. 4.2.2 Acoustic scene tracking In similar previous studies acoustic scenes are annotated by humans after the recording is done [188, 131]. The expensive and time consuming nature of the manual labeling process prohibited us from performing human annotations on our data, especially since our interest was on closely sampled scene labeling for tracking for the entire duration of a work shift. Toward this end, we had installed Bluetooth-based transceivers in all instances of every acoustic scene, the Owl-in-ones (Owl in short) sensor [135] (see Figure 4.1). The Jelly sends Bluetooth pings that are received by the Owls in terms of Received Signal Strength Indicator (RSSI) values. The maximum strength of the RSSI values is determined to register the location, and hence, the background acoustic scene of the participant at a certain time instant. 4.2.3 Contextual and user demographic measures As introduced in Section 4.1, we hypothesize that the temporal patterns of the sequence of acoustic scenes experienced by an employee might be associated with factors such as job-roles, habits, daily routines, and demographics of the individual. Therefore, at the beginning of the TILES study, we collected self-reported information from volunteer participants about their demographics and daily routines [134]. This included information 74 about work shift, hours of work, current position in the hospital, extra jobs held etc. A detailed list is presented in Appendix 4.8.1. We also hypothesized in Section 4.1 that the dynamics of exposure to various acoustic scenes may also be related to some of the behavioral traits and characteristics of the employee. As a part of the study, the participants also completed a comprehensive self- reported assessment (called here, Initial Ground Truth Battery (IGTB) assessment) using a variety of standard psychological instruments to measure multiple traits and behavioral aspects such as personality, organizational behavior, executive function, and smoking and drinking habits [115]. A complete list of the measured IGTB constructs are provided in Appendix 4.8.2. More detailed description can be found in [134]. 4.3 Egocentric Analysis of Acoustic Scene Dynamics As introduced in Section 4.1, we investigate the presence of temporal patterns in the acoustic scenes experienced by an employee in the workplace. In this section, we pro- pose various scalar measures that capture the temporal variability in the scene dynamics, and investigate how the underlying temporal patterns are associated with factors such as job-roles, daily routines, and demographics of the employees (as introduced in Sec- tion 4.2.3). Moreover, we explore whether some of those measures are related to employee job-performance and cognitive ability. We represent the temporally varying acoustic scenes experienced by a certain partic- ipant by an ordered sequence of enumerated scene labels as dened below. [lounge: 1, 75 Table 4.1: Abbreviation and description of dierent measures of acoustic scene dynamics incorporated in the egocentric analysis. All of the following measures are computed on a sequence of acoustic scenes. Abbreviation Description std Standard deviation range Range iqr Inter-quartile range change Normalized number of changes 1-gram-x 1-gram count for acoustic scene class `x' 2-gram-xy 2-gram count for acoustic scene class pair `x' and `y' entropy Shannon's entropy measure tdf x Term frequency{inverse document frequency for class `x' patient room: 2, nursing station: 3, medication room: 4, lab: 5]. For example, in a hy- pothetical situation, if a participant experiences [nursing station, nursing station, patient room, patient room, patient room, lab] in order, then we represent the temporally varying sequence of acoustic scenes by [3; 3; 2; 2; 2; 5]. The enumeration is xed throughout the experiment. Denition 4.3.1. A sequence of acoustic scenes experienced by the i th participant is given by Y i = h y i 0 ;y i 1 ;y i 2 ;:::;y i (T i 1) i = y i t T i 1 t=0 ; (4.1) whereT i is the length of sequence, andy i t 2f1; 2;:::;Cg forC = 5 acoustic scene classes. Note thatY i might not contain uniformly spaced acoustic scenes because of our in- talk analysis (in Section 4.4.1, we will discuss how in-talk audio segments are extracted), but they are temporally ordered. 76 4.3.1 Acoustic scene dynamics We propose a set of measures to capture the dynamics of a sequence of acoustic scenes, Y i . In general, each measure quanties a certain pattern of a sequence of acoustic scenes for a particular user in terms of a scalar score. Table 4.1 summarizes the abbreviations and descriptions of the measures we employ for the analysis. The abbreviations will be used in Figure 4.2 and Figure 4.3. Some of the measures are dened below in detail. To quantify the mobility of an employee between dierent acoustic scenes we look at the number of changes inY i . Denition 4.3.2. The Normalized number of changes is dened as the total number of changes in acoustic scenes normalized by the length: Y i = 1 T i 1 T i 1 X t=1 I ( i [t]6= 0); (4.2) where i [t] =Y i [t]Y i [t 1] is the 1 st order dierence sequence, andI() is an indicator function i.e., I(b) = 1 if b is True, otherwise I(b) = 0. Higher values of Y i indicate higher mobility of the participant between dierent acoustic scenes. The normalized number of changes gives an aggregated measure of variation in the sequence of acoustic scenes,Y i . A more ne-grained information about the amount of time spent in a particular acoustic scene, or frequency of movement between two dierent acoustic scenes might reveal important characteristics of Y i . These can be formally quantied by n-gram counts which are frequently employed for language modeling in the natural language processing domain [97]. The 1-gram count quanties the number occurrences of a particular acoustic scene class normalized by the sequence length. Higher 77 values of `1-gram-x' (as abbreviated in Table 4.1) indicate the user spends more time in acoustic scene `x'. The 2-gram count measures the frequency of scene changes from one class to another, normalized by sequence length. Higher values of `2-gram-xy' (as abbreviated in Table 4.1) indicate that the user frequently moves from scene `x' to scene `y'. Furthermore, we try to quantify the uncertainty in the perceived acoustic scenes present in Y i through the \entropy" measure. Intuitively, Y i for a participant who mostly stays in the same acoustic scene should have dierent characteristics than the Y i experienced by another participant who frequently moves between dierent scenes in the workplace. Denition 4.3.3. The entropy [50] is dened as the average amount of uncertainty present in the signal. Denoting Y as a random variable for the observed acoustic scene with possible outcomes y k 2f1; 2::::;Cg, each with probability P Y (y k ), Shannon's en- tropy is dened as ¶ : H (Y ) = X k P Y (y k ) log 2 P Y (y k ): (4.3) Finally, we borrow the \tf-idf" measure from information retrieval literature [181]. In the current context, intuitively it denotes how important a particular acoustic scene, c is to a certain sequence of acoustic scene,Y i in a collection of several sequences,S. Denition 4.3.4. The tf-idf for a particular acoustic class, c can be dened as: tf-idf c;Y i ;S = tf c;Y i idf (c;S): (4.4) ¶ Ideally this is valid for i.i.d. observations, which might not be always satised in our dataset. 78 Figure 4.2: p-values obtained from the Kruskal{Wallis hypothesis tests between scene dynamics (horizontal axis), and individual daily routines and demographics (ver- tical axis). The multiple comparison problem is corrected by the Benjamini{Hochberg procedure. All the indicated p-values are statistically signicant, and for the cases when the null hypothesis is rejected. Cases with p< 0:001 are shown as 0 for clearer visualiza- tion. Empty cells denote observations that fail to signicantly reject the null hypothesis as determined by the Benjamini{Hochberg procedure. Please see Section 4.3.2.1 and 4.3.2.2. Here,S = Y i N i=1 denotes the collection of all sequences of acoustic scenes in the dataset containingN participants. The term frequency tf c;Y i denotes the frequency of occur- rence of acoustic class c in the sequenceY i . The inverse document frequency idf (c;S) measures how much information the scenec provides, and penalizes if it occurs frequently in most of the sequences. idf (c;S) = 1 +N 1 + df (c) + 1; (4.5) where df (c) is the number of sequences in which the class c is present. The nal score is l 2 normalized. 79 Figure 4.3: Spearman's correlation between scene dynamics (horizontal axis) and in- dividual behavioral constructs (vertical axis). The multiple comparison problem is corrected by the Benjamini{Hochberg procedure. Empty cells denote zero correlations or statistically insignicant correlations. All indicated nonzero correlations are statisti- cally signicant as determined by the Benjamini{Hochberg procedure. Please see Sec- tion 4.3.2.1 and 4.3.2.3. Best viewed in color. 4.3.2 Relationship of acoustic scene dynamics with individual demographic and behavioral constructs In this part, we report correlation analysis and hypothesis tests to explore relationships between dierent measures of acoustic scene dynamics, and individual demographics and behavioral constructs. 4.3.2.1 Multiple comparisons All the tests involve multiple comparisons [125, 170] i.e., we have multiple variables (e.g., several demographic and daily routine variables as shown in the vertical axis of Figure 4.2), and hence, an outcome of a statistically signicant observation (in general lowp-value, i.e.,p< = 0:05) might be purely by chance. As a correction technique, we 80 incorporate the Benjamini{Hochberg procedure [19] to control the False Discovery Rate (FDR). The FDR is dened as the proportion of signicant results or \discoveries" that are actually false positives [125]. In brief, Benjamini{Hochberg procedure ranks all the test outcomes by sorting the p-values in increasing order. The maximum p-value which satises p < (i=m)Q is a signicant observation, and all the smaller p-values are also signicant. (i=m)Q is known as the critical value, where i is the assigned rank, and m is total number of tests, and Q is the chosen FDR. In the following two experiments, we apply Benjamini{Hochberg procedure with 10% FDR to choose the statistically signicant observations ‖ . For each dynamics measure (e.g., normalized number of changes or \change" column in Figure 4.2), the Benjamini- Hochberg procedure is performed for all demographic and daily routine variables (i.e., for all rows in Figure 4.2). 4.3.2.2 Relationship with individual demographics We attempt to nd the association between the proposed measures of scene dynamics and underlying factors such as job-roles, daily routines, habits, and other demograph- ics information. We hypothesize that dierent groups of employees (e.g., with dierent job-roles) might experience dierent patterns in their acoustic scene dynamics. Toward this end, we perform hypothesis test to reveal this relationship. Most of the constructs for demographics, job-roles, and daily routines introduced in Section 4.2.3 (full list in Appendix 4.8.1) are categorical in nature. Therefore, we perform Kruskal{Wallis hypoth- esis test [111] between these individual demographics and the measures of experienced ‖ Please note that, 10% FDR does not mean = 0:1. Signicantp-values are chosen based on ranking, and the Benjamini-Hochberg critical value, (i=m)Q [125]. 81 scene dynamics for all the participants. This test is a non-parametric version of one way ANOVA test. The null hypothesis assumes that data (here, a particular scene dynamics, e.g., normalized number of changes or \change") from dierent categories (here, dierent groups of a particular demographic, e.g., dierent job-roles orcurrentposition in the hospital) come from the same distribution. A resultant low p-value casts doubt on the validity of null hypothesis, and those observations are of particular interest since they denote all the data samples do not come from the same distribution. In other words, dif- ferent groups of participants, with respect to a certain demographic construct, experience dierent acoustic scene dynamics. Figure 4.2 shows the observations which reject the null hypothesis in Kruskal{Wallis test (observations withp< 0:001 are denoted as 0 for clearer visualization). Some notable observations are summarized below: • Current occupation (currentposition) in the hospital has low p-value for most of the measures of acoustic scene dynamics, which intuitively makes sense since the job-roles presumably determine the mobility patterns of the hospital employees. • Rejections of null hypothesis are also observed for work shift (shift), hours of work (hours), overtime, type of commute (commute type), extra jobs held (extrajobs), extra hours (extrahours), and student status (student). 4.3.2.3 Relationship with individual behavioral constructs We compute Spearman's rank correlation between the scene dynamics measures, and the individual behavioral constructs (Section 4.2.3, and Appendix 4.8.2) over all the 82 participants. Figure 4.3 shows the statistically signicant correlations between IGTB constructs and the measures of acoustic scene dynamics. Some notable ndings therein are: • Some job performance related constructs (In-Role Behavior: irb, and Organiza- tional Citizenship Behavior: ocb) are signicantly correlated with a number of the acoustic dynamics measures. • Cognitive ability measures (Shipley Abstraction: shipley abs and Shipley Vo- cabulary: shipley voc) tend to manifest a similar trend. • Physical activity related measures (ipaq) also show signicant correlations with a couple of acoustics dynamics measures. • Maximum absolute correlation of +0:33 is observed between ocb, and 1-gram- 2/tdf 2. • The only personality construct \aggreableness" (agr) shows signicant correlations with two dynamics measures. • Aect-related constructs (pos af andneg af) do not show signicant correlations with any of the scene dynamics measures. To summarize, we dened a set of scalar measures that capture specic temporal patterns in a dynamically varying sequence of acoustic scenes experienced by a certain employee. The rejection of the null hypothesis in the Kruskal{Wallis test in Section 4.3.2.2 shows that those measures are associated with some of the underlying factors like job-roles and daily routines. Moreover, the presence of signicant correlations in Section 4.3.2.3 83 ! " ! Embedding,# ! TDNN layers Acoustic feature segment, $ ! … Foreground masks Processing of acoustic features Segmentation Statistics pooling Dense layers $ ' ! $ " ! $ "#$ ! $ ()$ ! Segment-level model GRU (1-layer example) Segment-level embeddings … ! " " ! # " ! Temporal model ! " ()$ ! # ()$ ! … … … # "#$ ! ! " "#$ ! ! " ' ! # ' ! % ' ! % " ! % "#$ ! % ()$ ! Foreground Background … Figure 4.4: A two-stage modeling framework for identifying sequence of in-talk acous- tic scenes. Left: Acoustic feature stream is masked with the output of the foreground detection model. This keeps the portions of the stream when there is a possible fore- ground activity, which are then segmented in windows of xed length, T s . Middle: The segment-level TDNN model takes a segment of lengthT s , and learns to predict the corre- sponding acoustic scene. Right: A GRU model is then trained on top of the segment-level embeddings for learning the sequence of acoustic scenes. indicates the presence of an inherent relationship between the measures and certain be- havioral constructs of the employees, specically those related to job performance and cognitive ability. These observations conrm the hypotheses introduced in Section 4.1, and also motivate us to investigate further the potential of machine learning methods for modeling the temporal patterns, and inferring the acoustic scene classes directly from the audio features collected via the wearable device. 4.4 Automated Prediction of Temporally Varying Acoustic Scenes As explained in Section 4.1, in contrast to existing acoustic scene classication works, we deal with long egocentric in-talk audio recordings to predict the sequence of observed 84 acoustic scene classes. For this, we propose a two-stage modeling framework. A segment- level model rst processes the raw acoustic features. The intermediate representations (or embeddings) learned by the segment-level model are passed on to a recurrent model to learn the temporal dynamics of the sequence of acoustic scenes. Figure 4.4 shows an overview of the employed framework. 4.4.1 Processing of acoustic features 4.4.1.1 In-talk acoustic scene identication The acoustic features obtained from the TILES audio recorder correspond to any audio activity that happened near the participant during the work shift (due to the presence of VAD module, see Section 4.2.1). As introduced in Section 4.1, in this work we focus on the problem of in-talk acoustic scene identication, i.e., classication of the background acoustic scene while the user (wearing the audio recording device) is presumably talking. The main dierence with the traditional acoustic scene identication is that the in-talk acoustic signals originating from ambient sound sources are supposed to be overlaid with speech coming from the user wearing the microphone (in Section 4.1, we have discussed several possible applications). Another distinction comes from the egocentric collection of audio over a long duration, which opens up opportunities for encountering dynamically evolving acoustic scenes. 4.4.1.2 Foreground activity detector The collection of in-talk acoustic signal requires selecting the portions of the entire audio recording which correspond to possible speech activity by the participant wearing the 85 mobile device. We apply a foreground speaker (i.e., the person wearing the mobile device) detection model developed in [137] to extract the portions in the audio recordings when there is a possible speech activity from the participant. The foreground detection model is trained on a labeled dataset of meeting speech (see [137] for details), and the pre-trained model is used in the current work. The model provides smoothed binary masks that are employed to extract the foreground speech activity (see the left part of Figure 4.4). The audio features obtained after the masking contain user's own speech overlaid on background audio coming from dierent sound sources like machine beeps, door slamming, clock, telephone, and speech from other people. A certain combination of some of these sound events makes an acoustic scene to possess unique characteristic. The masked audio features are segmented in chunks of length T s , which are subsequently employed for segment-level modeling. 4.4.2 Modeling segment-level acoustic scene We represent a sequence of all temporally ordered segments (might not be uniformly spaced) for the i th participant by X i = X i 0 ; X i 1 ;:::; X i T i 1 = X i t T i 1 t=0 (4.6) where, X i t represents a segment of acoustic feature vectors of length T s . Note that the corresponding acoustic scene labels are given byY i , as dened in (4.1). The segment-level model ignores the time information, and treats all the segments inX i to be independent 86 and identically distributed (i.i.d.). Dropping the time index, the segment-level model takes X i at input and learns to predict the corresponding scene label y i . We employ a Time-Delay Neural Network (TDNN) as introduced in Chapter 2. The middle part of Figure 4.4 shows a schematic of the TDNN model. The TDNN learns a nonlinear mapping or embedding from the input features: e i = f X i . The embedding e i is projected to the nal output layer with C = 5 softmax units emitting posterior probabilities for every acoustic scene class, P ^ y i =cjX i . The model is trained with an acoustic scene classication objective, and the training is done with all segments from all the participants in the training set (see Section 4.5.3 for details). Generally, the embedding learned in this fashion captures the semantic class-level information (for example, in speaker verication the speaker embeddings carry speaker characteristics [184, 95], in our case they are the acoustic scene classes), and thus they could be subsequently used for temporal modeling. 4.4.3 Modeling temporal sequence of acoustic scenes The embeddings learned from the segment-level model help compress a chunk of au- dio features of length T s into a xed dimensional vector (typically 128 dimensional, see Section 4.5). But the segment-level model does not exploit the temporal dependencies available in the data. We hypothesize that the way a particular participant encounters acoustic scenes during his/her work shift might possess a certain temporal pattern. There- fore, a recurrent modeling framework might be more eective in predicting the sequence of acoustic scenes as experienced by that participant. 87 We adopt Gated Recurrent Units (GRU) [48] neural network to map the sequence of segment-level embeddings e i 0 ; e i 1 ;:::; e i L1 into the sequence of acoustic scene labels y i 0 ;y i 1 ;:::;y i L1 (see the right part of Figure 4.4). Dropping participant's indexi for sim- plicity, and considering e t to be the input att th time step, the recurrent transformations for a single layer GRU and the output transformation in this work can be summarized as (see [48] for details): Reset: r t =(W er e t + b er + W hr h (t1) + b hr ) Update: z t =(W ez e t + b ez + W hz h (t1) + b hz ) Candidate: ~ h t = tanh(W e ~ h e t + b e ~ h + r t (W h ~ h h (t1) + b h ~ h )) Hidden: h t = (1 z t ) ~ h t + z t h (t1) Output: o t = W o relu(h t ) + b o Posterior: ^ y t = softmax(o t ) where, () is the sigmoid function, and is the Hadamard product operation. For a multi-layer GRU (not shown in Figure 4.4 for clarity, but employed in our experiment), the hidden state h t of a layer becomes the input for the next layer. 88 4.5 Experimental Setting 4.5.1 Model parameters Segment-level models: For the segment-level modeling we compare the performance of TDNN model with two other model architectures: 1. Multi-layer perceptron (MLP): It consists of three layers with hidden dimensions [1024! 1024! 512]. The embedding dimension is 512. The model has a total of 1:6 M trainable parameters. This model is fed with a concatenated temporal mean and standard deviation of the T s seconds segments. The remaining models are provided with the temporal features. 2. Resnet-18: To investigate the potential of 2D time-frequency convolutions, we ex- periment with a Resnet-18 model [76]. Two modications are made: use of 16 2 average pooling to comply with the 42 dimensional features, and having 5 output units for 5 acoustic scene classes. The embedding dimension is 512. The model has 11:1 M parameters. 3. TDNN small: It follows the TDNN architecture of [184], except the use of fewer CNN lters and lower statistics dimension. We use 128 lters at every CNN layer, and set the statistics and embedding dimensions to 256 and 128, respectively. The model has 280 k parameters. 4. TDNN big: It has 256 lters at every CNN layer, and the statistics and embedding dimensions are 512 and 256, respectively. The model has 954 k parameters. 89 Temporal models: We experiment with GRUs of dierent size and depth. The grid search for model selection is performed over hidden dimensions of [64; 128; 256], and number of hidden layers of [1; 2; 3]. The best model is selected from the validation set performance (see Section 4.5.3). The following parameters are common for all the above models (both segment-level and temporal): • Relu activation between any two hidden layers. • Softmax activations in the output layer. • 30% dropout (for GRU, only applicable if it is a multi-layer GRU). • Cross entropy loss as the minimization objective. For the temporal model, this is done over all time steps of the sequence. • Adam optimizer with learning rate 0:001, 1 = 0:9, and 2 = 0:999. • Mini-batch size of 64. 4.5.2 Data subsets 4.5.2.1 Segment-level experiment For segment-level training and testing, we mine T s = 5 seconds continuous segments from the foreground masked acoustic features (see Section 4.4.1). Segments shorter than 5 seconds are ignored, and segments longer than 5 seconds are chunked with no overlap. This creates a total of 269; 170 samples. The acoustic scene class labels are not uniformly 90 5 Acoustic scene labels Acoustic features Foreground mask Δ 5 3 3 3 4 4 2 1 1 … Foreground Background Figure 4.5: Mining sequence of acoustic features and scene labels for temporal modeling and evaluation. Table 4.2: Details of the sequences mined from audio recordings. The sequence lengths are in terms of total number of constituent audio segments. Context duration, d s Number of sequences mined Minimum sequence length Mean sequence length Maximum sequence length 15 minutes 15 k 12 28:00 424 30 minutes 19 k 12 29:79 678 1 hour 21 k 12 32:97 1209 2 hours 21 k 12 39:54 1839 distributed: 43% samples coming from patient rooms, 37% from nursing station, 11% from lounge, 5% from lab, and 4% from medication rooms. 4.5.2.2 Temporal modeling experiments For temporal modeling, we mine sequences of audio segments from the foreground masked recordings (Section 4.4.1) along with the corresponding acoustic scene labels. Figure 4.5 shows a schematic for easier interpretation. At a certain time point, we look behind d s 91 time units (called the context duration), and accumulate all the segments (each of length T s , similar to Section 4.5.2.1) which fall under any possible foreground speech activity region. This serves the purpose of obtaining temporal sequence of acoustic features and scene labels required for the temporal modeling (Section 4.4.3). For the next sequence, we move forward in time by s time units. This moving window approach with nonzero s helps skipping repetitive sequences. In this work, we set s to be equivalent to 4 audio segments i.e., s 4 5 = 20 seconds. An ablation study was performed with dierent values of context duration, d s : 15 minutes, 30 minutes, 1 hour, and 2 hours. Table 4.2 provides dierent statistics of the data subsets we mine from the audio recordings for the temporal model training and evaluation. For training and testing, we incorporate sequences that have at least 12 time steps (i.e., total duration of all the segments in the sequence is 5 12 = 60 seconds) ∗∗ in a given context duration. In other words, this removes all sequences having less than 12 foreground masked segments. For example, if a 15 minutes time window has only two 5 seconds segments with foreground speaker activity, it is ignored in our current analysis. 4.5.3 Data splits, cross validation, and model selection For all the experiments (both segment-level and temporal), we perform 5-fold cross val- idation ensuring no participant overlap between any two folds. This helps mitigating modeling bias that could arise from speaker-related characteristics. We randomly create a validation split from the training participants each time we train. The validation and ∗∗ Note that the time steps might not be equally spaced, as can be seen in Figure 4.5 92 Table 4.3: Performance of dierent models in segment-level acoustic scene classication. Model Accuracy (%) F1 score McNemar's test MLP 54.99 0.53 + Resnet-18 60.14 0.57 + TDNN small 63.49 0.62 + TDNN big 63.62 0.62 train splits also have no overlap of participants. We run the training for 50 epochs (be- cause the training loss seems to saturate around that point), and choose the model with the best validation performance. We repeat the overall experiment 5 times, and report the mean accuracy for both segment-level and temporal modeling. 4.6 Results and Discussion First, we present the result for segment-level prediction (modeling is in Section 4.4.2). Intuitively, this is equivalent to existing approaches that infer an acoustic scene label given an audio recording (in this case a segment of length T s = 5 seconds). Next, we present the results for the proposed two-stage temporal model (modeling is in Section 4.4.3) for a much longer context duration, d s (discussed in Section 4.5.2.2). The best performing segment-level model becomes a baseline for the proposed two-stage temporal framework. 4.6.1 Segment-level prediction Table 4.3 shows the unweighted classication accuracy scores and weighted F1 scores in the 5-fold cross validation experiments for all the models. We report the mean scores over 5 random repetitions as explained in Section 4.5.3. We perform McNemar's test [129] to verify statistical signicance of the results. It is a non-parametric paired hypothesis test 93 to check whether the dierence between the error rates of two classiers is statistically signicant. In Table 4.3, a positive (+) outcome indicates that the model outperforms the previous model (1-level higher row in the table), and the dierence between their mis-classication rates is statistically signicant. The model in the rst row has been compared with the chance accuracy which is 43%. We can see the basic MLP model is ahead of chance by 12% in classication accuracy. The Resnet-18 model signicantly outperforms the MLP by an absolute 5:15% in accuracy, and 0:04 in F1 score. Both the TDNN models signicantly outperform the Resnet-18 model by 3:4% in accuracy, and 0:05 in F1 score. We hypothesize that the relatively lower performance of Resnet-18 might be because of the large number of trainable parameters compared to the TDNN models, and possible overtting issues. Both the TDNN models perform similarly as veried by the negative () outcome in McNemar's test when we move from TDNN small to TDNN big model. Therefore, we use segment-level embeddings extracted from the TDNN small model for the subsequent temporal analysis due to their lower embedding dimension (helps to train RNN faster). 4.6.2 Temporal sequence prediction The sequence models are trained with the embeddings extracted from the segment-level TDNN small model (embedding dimension is 128). Table 4.4 shows the unweighted ac- curacy scores and weighted F1 scores for the best performing temporal model and the segment-level model for dierent values of context duration. The segment-level model denotes the already trained TDNN small model (Section 4.6.1). The performance of the segment-level model is equivalent to splitting the entire audio recording into multiple 94 Table 4.4: Performance of dierent models in predicting temporal sequence of acoustic scenes for dierent context duration. \Segment-level" denotes the performance of the TDNN small model aggregated over all 5 second windows (it does not employ tempo- ral information). \Temporal" denotes the performance of the two-stage model (TDNN embeddings + GRU) which learns and utilizes the temporal pattern. Context duration, d s Model Accuracy (%) F1 score McNemar's test 15 min Segment-level 80.24 0.79 + Temporal 83.52 0.83 + 30 min Segment-level 78.06 0.77 + Temporal 81.24 0.81 + 1 hour Segment-level 76.26 0.75 + Temporal 79.33 0.79 + 2 hour Segment-level 74.97 0.74 + Temporal 77.72 0.77 + chunks and inferring the acoustic scene independently for every segment. Therefore, any performance gain achieved by employing a temporal RNN model would highlight the ex- istence of an underlying temporal pattern in the sequence of acoustic scenes experienced by the employees. We report mean scores over 5 random repetitions of the experiment. The chance accuracy is 45% for all values of context duration. As explained in Sec- tion 4.5.2.2, for the experiment with sequential acoustic scene labels, we discarded the short sequences. Interestingly, we see an increase in the performance of the segment-level model from the performance reported in Section 4.6.1. This indicates the inability of the segment-level model in learning the acoustic scenes for the short isolated segments. This might be happening because of errors coming from the foreground detection module (see Section 4.4.1) †† . †† But, this untraceable because of the absence of human annotated labels, and raw audio signal. 95 In Table 4.4, a positive (+) outcome of the McNemar's test indicates that the corre- sponding model signicantly outperforms the previous model (1-level higher row in the table). The rst row (segment-level model) is compared with chance accuracy. It is evident that the GRU-based temporal models signicantly outperform the best segment- level model for all values of context duration in terms of both accuracy and F1 score. We hypothesize that the performance gain arises from the presence of temporal dependencies between the acoustic scenes observed by a certain participant. For example, within a given context duration, a nurse might experience the acoustic scenes in a specic pattern e.g., (s)he might mostly move between nursing station and patient room. Table 4.4 shows us another interesting (somewhat intuitive though) characteristic. The performance of the temporal model decreases as the context duration increases. This might be happening because it is harder to nd patterns in the data for longer sequences. The decrease in the performance of segment-level model might also be a factor, since the segment-level embeddings are utilized as features in temporal modeling. An end-to-end temporal modeling might be more helpful in this situation, and this will be discussed in Section 4.7. The results of both segment-level and temporal experiments show that the DNN models are able to classify the acoustic scenes with reasonable accuracy from in-talk acoustic features, representing potentially a mixture of ambient sounds and the user's speech. This shows the feasibility of addressing the task of in-talk ambient acoustic scene classication, as well as the ability of the DNN models to learn the mapping from the in-talk acoustic cues. 96 0 500 1000 1500 2000 2500 3000 1 2 Acoustic scene Accuracy = 88.89 % 0 500 1000 1500 2000 2500 3000 3500 1 2 3 Acoustic scene Accuracy = 89.42 % 0 500 1000 1500 2000 2500 3000 0 2 4 Acoustic scene Accuracy = 74.24 % 0 500 1000 1500 2000 2500 3000 3500 Relative time (seconds) 1 2 3 Acoustic scene Accuracy = 97.54 % True Predicted Figure 4.6: Visualization of 4 true (red circles, no line) and predicted (black dots and dotted lines) sequences. The accuracy value shown in each subplot indicates the prediction accuracy for that particular sequence. 4.6.3 Sequence visualization Figure 4.6 visualizes 4 dierent sequences of acoustic scenes (for 1 hour context duration) along with their predicted versions at dierent accuracy levels. A closer inspection reveals that there are dierent types of errors made by the temporal model, including failures at sudden change in the acoustic scenes (e.g., the second subplot from the top), and failures at isolated segments (e.g., third subplot from the top). The errors could arise from either the segment-level embeddings that are utilized to train the temporal model, or the inability of the GRU models to learn the temporal dependencies in those specic situations. 97 4.6.4 Egocentric analysis with predicted scene sequence The egocentric analysis presented in Section 4.3 was performed with the true sequence of acoustic scenes captured by the Bluetooth trackers. In Section 4.6.2, we compared performances of both the segment-level and the proposed two-stage temporal model in predicting the sequence of acoustic scene labels from audio features. Here, we analyze how the prediction error aects the egocentric analysis if we perform the analysis with a model's predictions instead of the true scene labels. Any prediction error will have a direct eect on the measures of acoustic scene dynamics that we compute from a sequence of acoustic scenes (in Section 4.3.1). Table 4.5 shows the Mean Absolute Error (MAE) between the true measures of scene dynamics, and the predicted ones. The MAE is averaged over all the participants. Thus, a lower MAE indicates better performance by the prediction model since the predicted measures of dynamics are closer to the true measures. We compare performances of the segment-level model and the proposed two- stage temporal framework. It is evident that MAE is smaller for the temporal model than the segment-level model, which is expected since the former one has achieved better prediction performance (Table 4.4). The higher value of MAE for longer context duration might possibly be because of higher error in prediction (as can be seen in Table 4.4). To perform the egocentric analysis of Section 4.3 with the predicted acoustic scenes, we do the same statistical tests (to recap, Kruskal-Wallis for daily routines and demo- graphics, and Spearman's correlation for behavioral constructs), but this time with a model's predictions. Subsequently, we compare the outcomes of the statistical tests per- formed with the true and the predicted scene labels. Table 4.6 shows the total number 98 Table 4.5: Mean Absolute Error (MAE) between the measures of temporal dynamics computed with true scene labels and predicted scene labels. MAE at dierent context duration d s Model 15 mins 30 mins 1 hour 2 hours Segment-level 0.062 0.063 0.066 0.068 Temporal 0.048 0.050 0.054 0.055 Relative improvement 22.21% 20.57% 17.68% 18.90% of constructs (either demographic or behavioral) that are found to have a statistically signicant relationship with at least one measure of scene dynamics. The number inside the parenthesis indicates how many of those constructs also have signicant outcome in the same statistical test performed with the true labels. We present those values for the segment-level and the proposed two-stage temporal model. We can see that, in most of the cases, the predictions from the proposed temporal model nd more constructs (com- pared to the segment-level model) that are also observed in the experiment performed with the true labels. This highlights the ecacy of the proposed two-stage framework in better predicting the temporal sequence of acoustic scenes. A comparison between the two types of construct (demographics and behavior) shows that the predicted labels are able to get more similar outcome as the true labels for the demographic variables. More details can be found in Appendix 4.8.3. 4.7 Conclusion and Future Directions We characterized the temporal dynamics of the acoustic scenes observed in a workplace. Specically, we studied the temporally evolving acoustic scenes experienced by nurses 99 Table 4.6: Total number of constructs with statistically signicant outcome in the ego- centric analyses performed with the predicted scene labels. The number inside the paren- thesis is the number of constructs that are also observed in the same statistical test with the true labels. The rows with True Labels are for comparison purpose. Construct Model #Constructs at dierent context duration d s 15 mins 30 mins 1 hour 2 hours Demo- graphics True labels 13 (13) Segment-level 5 (5) 6 (6) 7 (7) 6 (6) Temporal 8 (7) 9 (8) 12 (10) 10 (9) Behavior True labels 7(7) Segment-level 3 (1) 3 (1) 4 (2) 7 (4) Temporal 5 (3) 7 (4) 13 (5) 8 (2) and other clinical providers in a large hospital environment from egocentric audio record- ings collected with wearable microphones. The acoustic scene labels are obtained via Bluetooth hubs installed in the hospital. In the rst part of our study, we investigated the presence of underlying temporal patterns in the sequence of acoustic scenes experienced by a certain individual. To this end, we characterized the temporal dynamics of the acoustic scenes by proposing a set of measures that try to capture the variability in the acoustic scenes experienced by an individual. We showed that some of those measures are strongly associated with a number of driving factors including those related to job type, work hours, and extra jobs. Furthermore, we found the patterns of exposure to a set of acoustic scenes are correlated with variables like job performance and cognitive ability. The second part of the study focused on modeling the temporal dynamics. We pro- posed a two-stage deep learning framework to predict the sequence of acoustic scenes from egocentric audio features. A TDNN-based segment-level model was trained to learn the acoustic scenes from short segments of audio features. The acoustic scene embeddings 100 extracted from the trained segment-level model were utilized in the next stage of learning i.e., the GRU-based temporal model to directly predict the sequence of acoustic scenes. The extensive experiments and results showed the presence of rich temporal patterns in acoustic scenes encountered by the participants. Specically, the proposed two-stage temporal model was found to achieve superior performance to the baseline segment-level model in predicting the sequence of acoustic scenes. In summary, we provided a comprehensive study of dynamically evolving background acoustic scenes from the egocentric perspective of an employee in a workplace. The egocentric analysis revealed rich temporal patterns in the perceived acoustic scenes, which were also found to be strongly associated with a number of underlying job-related factors. This built a foundation for developing machine learning models that can learn those temporal patterns in order to predict the sequence of acoustic scenes directly from audio features. The improvements obtained by employing a temporal model over the segment- level model, in turn, highlighted the existence of rich temporal patterns in the egocentric sequence of acoustic scenes. There are several future research directions. • In the online acoustic feature acquisition part, more distinctive features like log mel energies might be considered in the future, since they were found to have superior performance in a variety of sound event detection tasks [26]. • Future research can be performed on disentangling the environmental sounds from user's speech to have better prediction accuracy. Several novel unsupervised disen- tanglement methods can be found in recent literature [156]. 101 • An extension of the proposed two-stage training framework would be to perform end-to-end training of the temporal model directly from the raw acoustic features. An inspection can be done on seamless acoustic scene detection (no foreground activity detection). The amount of training data to process would be however a challenge for that approach, and thus incorporation of ecient data sub-sampling techniques might be benecial. • Real-time implementation of the proposed models, and analyzing the feasible speed of prediction might be another research problem. • The ecacy of the temporal prediction model, and the association between scene dynamics and job characteristics can provide useful insights in devising novel ap- plications such as frameworks that can automatically generate movement statistics of the employees between dierent workplace acoustic scenes. Moreover, the nd- ings about the correlation between the scene dynamics and behavioral states of the employees can inspire further work on building behavioral models [142] that can predict the behavioral states and traits directly from acoustic data. 4.8 Appendices 4.8.1 Demographics and daily routines Table 4.7 lists the information about demographics and daily routines that are used for the analysis. The abbreviations are utilized in Figure 4.2. 102 Table 4.7: Abbreviation, description, and data type of dierent demographic and daily routine information incorporated for egocentric analysis. Abbreviation Description Data type race Race Categorical (1 7) ethnic Ethnicity Categorical (1 2) relationship Relationship status Categorical (1 4) pregnant Pregnancy status Categorical (1 2) children Number of children below 18 years of age Integer (0 15) housing Housing status Categorical (1 4) currentposition Current position in the hospital Categorical (1 8) certications Certications regarding occupation Categorical (1 7) nurseyears Years in the current profession Integer (1 80) shift Work shift Categorical (1 2) hours Hours of work per week Integer (1 100) overtime Overtime hours per month Integer (0 200) commute type Means of communicating to the workplace Categorical (1 6) commute time Quantized time for communicating to the work- place Categorical (1 6) extrajob Having at least one extra job Categorical (1 2) extrahours Extra hours spent at extra job(s) per week Integer (0 100) student Whether enrolled in a certain student program Categorical (1 9) 4.8.2 Behavioral constructs Table 4.8 tabulates the MOSAIC Initial Ground Truth Battery (IGTB) constructs along with their description, domain, and data type. The abbreviations are utilized in Fig- ure 4.3. 4.8.3 Detailed results: Egocentric analysis with the sequence of predicted scenes In Section 4.6.4 we compared the outcomes of the statistical tests performed on true and predicted scene labels by considering the presence of a particular construct if at least one measure of dynamics achieved statistically signicant observation. Here, we consider all 103 Table 4.8: Abbreviation, description, domain, and data type of dierent behavioral con- structs incorporated for egocentric analysis. Abbreviation Description Domain Data type itp Individual Task Prociency Job Performance Likert scale (1 5) irb In-Role Behavior Job Performance Likert scale (1 7) iod id Interpersonal and Organizational De- viance Scale / Interpersonal Deviance Job Performance Frequency scale (1 7) iod od Interpersonal and Organizational De- viance Scale / Organizational Deviance Job Performance Frequency scale (1 7) ocb Organizational Citizenship Behavior Job Performance Integer (0 8) shipley abs Shipley Abstraction Cognitive ability Integer (0 25) shipley voc Shipley Vocabulary Cognitive ability Integer (0 40) neu Neuroticism Personality Likert scale (1 5) con Conscientiousness Personality Likert scale (1 5) ext Extraversion Personality Likert scale (1 5) agr Agreeableness Personality Likert scale (1 5) ope Open-Mindedness Personality Likert scale (1 5) pos af Positive Aect Aect Likert scale (1 5) neg af Negative Aect Aect Likert scale (1 5) stai State-Trait Anxiety Inventory Anxiety Likert scale (1 5) audit Alcohol Use Disorders Identication Test Health Alcohol use Integer 0 40 gats status Global Adult Tobacco Survey status Health Tobacco use Categorical (current, past, or never) gats quantity Global Adult Tobacco Survey quantity Health Tobacco use Integer, tobacco units in past week ipaq International Physical Activity Question- naire Health Physical activity Integer, minutes in the past week psqi Pittsburgh Sleep Quality Index Health Sleep Float (0 21) the measures that show statistically signicant outcome and re ect that count in the form of word clouds. Figure 4.7 shows the word clouds for demographics and daily routines. A green word denotes that the particular construct also has a signicant outcome in the test performed with true labels, whether a red word indicates that it does not have a signicant outcome in the test with true labels. A word with larger size denotes that relatively higher number of measures of scene dynamics give statistically signicant outcome (but it might be green or red as described above). From Figure 4.7, we can see that the proposed two-stage temporal model is able to produce higher number of outcomes that are similar to the test with true labels, although it generates some other outcomes as well. Figure 4.8 104 shows similar word clouds for the egocentric analysis with the behavioral constructs. In most of the cases, the temporal model still tends to produce more observations that are similar to those of the true labels (except for d s = 2 hours which is also evident from the counts presented in Table 4.6). A comparison between Figure 4.7 and Figure 4.8 shows that more similar (with the true test outcomes) observations are found with demographics than behavioral variables. 105 (a) d s = 15 mins, Segment-level (b) d s = 15 mins, Temporal (c) d s = 30 mins, Segment-level (d) d s = 30 mins, Temporal (e) d s = 1 hour, Segment-level (f) d s = 1 hour, Temporal (g) d s = 2 hours, Segment-level (h) d s = 2 hours, Temporal Figure 4.7: Word clouds showing outcomes of the egocentric analysis of daily routine and demographics performed with model predictions. Predictions of the segment-level model and the proposed two-stage temporal model are compared at dierent context duration, d s . Green: the construct is also present for true labels, Red: not present for true labels, Word size: proportional to the total number of measures of dynamics having signicant outcome. 106 (a) d s = 15 mins, Segment-level (b) d s = 15 mins, Temporal (c) d s = 30 mins, Segment-level (d) d s = 30 mins, Temporal (e) d s = 1 hour, Segment-level (f) d s = 1 hour, Temporal (g) d s = 2 hours, Segment-level (h) d s = 2 hours, Temporal Figure 4.8: Word clouds showing outcomes of the egocentric analysis of behavioral con- structs performed with model predictions. Predictions of the segment-level model and the proposed two-stage temporal model are compared at dierent context duration, d s . Green: the construct is also present for true labels, Red: not present for true labels, Word size: proportional to the total number of measures of dynamics having signicant outcome. 107 Chapter 5 Semantic-space variability I: Granularity of annotations The last three chapters concentrated on signal-space variabilities. In this chapter, we focus on a semantic-space variability that arises due to the granularity of annotations. The granularity of annotations denotes the abstractness of the semantic labels. For example, the sound of a re alarm can be annotated as a \re alarm" or only \alarm". The embedding of the \alarm" class should ideally be more generic and cover \re alarm" and all other types of alarm such as \siren". In other words, the embeddings of the ne- grained classes should be more compact than the embeddings of the parent coarse class. This indicates a hierarchical tree structured semantic-space, and in the case of audio events, dierent levels of granularity come from that label-space ontology of the audio event dataset (see Chapter 1 Section 1.3.1 for an introduction to audio events). We will propose hierarchy-aware similarity learning methodologies [93] to learn compact and distinctive audio event representations that try to follow the hierarchical relationship or the granularity levels of the semantic labels. 108 5.1 Introduction Audio Event Classication (AEC) was introduced in Section 1.3.1. Generally, an audio events ontology is hierarchical in nature [66, 145, 172] because inherently it is easier for humans to identify (and hence annotate) rst the coarse class of the audio (e.g., vehicle), and then the ne class (e.g., bus). This work focuses on classifying audio events that have a hierarchical relationship in their human annotated label space. We have two complementary objectives: 1. Train a model that can identify the audio events satisfactorily at all levels in the label hierarchy, possibly by exploiting the hierarchical label taxonomy. 2. The model should be able to produce a distinctive audio embedding [175] or repre- sentation [18] that tries to follow the label hierarchy in a lower dimensional manifold with respect to some distance measure. Surprisingly, little work can be found in the eld of AEC that directly address the problem of hierarchical classication in audio event taxonomy [172, 145]. Xu et al. [204] proposed a DNN based multi-task learning method to solve this problem for a dataset having 3 coarse classes and 15 ne classes. Pre-training the DNNs [204] separately for coarse and ne classes was found to be helpful before applying the weighted multi-task (coarse- and ne-level classications) cross entropy objective function. But, the number of audio classes were very limited in this work. In this work, our application requires dealing with an unprecedented number ( 5K) of specic AEC classes dened on a hierarchical ontology similar to AudioSet. The goal of learning a distinctive audio manifold guides us to employ a loss function that can 109 Audio events Vehicle Bus Car ... Birds chirping Swan Duck ... Alarms Buzzers Electronic Beeps ... ... Figure 5.1: An example of the hierarchical audio events class structure in our dataset. harness the hierarchy information of the label space, and force the embeddings to follow that through some imposed similarity constraints during DNN training. The employed hierarchy-aware loss functions are found to be complementary with standard cross entropy loss, and together, in a multi-task learning setting, they improve the AEC performance at both higher and lower levels in the taxonomy. 5.2 Hierarchical Audio Events 5.2.1 Dataset Our manually labeled hierarchical Audio Event (AE) dataset has 183 AE classes, and each class has one or more AE subclass(es). An example of the class hierarchy is shown in Fig. 5.1. We will represent the dataset labels in two levels: \coarse" and \ne". So, there are total 183 coarse labels (e.g., vehicle, birds chirping etc. in Figure 5.1). A ne class label carries information about both AE class (e.g., vehicle) and subclass (e.g., bus). For example, ne label, c kl denotes k th class and l th subclass of that class. There are total 4721 unique ne labels. The dataset has around 230K audio samples with variable 110 durations. Each audio stream has only one audio event. The dataset was manually recorded and annotated by professional sound engineers. 5.2.2 Motivation to use hierarchy-aware loss The complementary objectives, as introduced in Section 5.1, direct us to learn ne-grained feature representation [212] in the data, that can help comparing the audio events at dierent levels in their taxonomy. A common approach is to learn distance metrics in the embedding space by imposing similarity constraints during training. Pairwise contrastive loss [42] and triplet loss [175] are some of the popular metrics in the eld of face recognition. But, these loss functions do not consider hierarchical label structures as in the case of our current AEC problem. To solve this issue, we propose a variant of the standard triplet learning by deploying a knowledge driven probabilistic triplet mining that utilizes the hierarchy information while sampling the triplets. In [212], the authors introduced a generalized triplet learning, which has the inherent ability to work on label tree structures. We also employ this method in our AEC task. We build a multi-task learning [169] framework where a multi-objective loss function, combining the hierarchy- aware loss and cross entropy loss, has been employed to train the DNN, inspired by some recent successes in the computer vision eld [212, 154]. 111 5.3 Hierarchy-aware Loss on Tree Structured Label Space 5.3.1 Problem formulation Let,D =fx 1 ; x 2 ;:::; x N g be a dataset ofN variable length audio samples having follow- ing labels in a bi-level hierarchy. Coarse:L C =fy 1 ;y 2 ;:::;y N g;y i 2f1; 2;:::;Cg (5.1) Fine:L F =fz 1 ;z 2 ;:::;z N g;z i 2f1; 2;:::;Fg (5.2) Here, C and F are number of coarse and ne classes respectively (F >> C). Please note that, as mentioned in Section 5.2.1, z i = c kl carries information for both AE class and subclass. Our complementary objectives (as introduced in Section 5.1) can now be dened: 1. Train a non-linear mappingM such that it maximizes the classication accuracy at both label spacesL C andL F . 2. ModelM should also provide an intermediate mapping, f(x)2 R d ;8x2D and jjf(x)jj 2 2 = 1, that tries to project audio on a manifold such that their mutual Euclidean distances obey the order as found in the label space hierarchy. To explain, let x 1 ; x 2 2 S 1 G 1 , x 3 2 S 2 G 1 , x 4 2 S 3 G 2 . Here, G i and S i denote class and subclass respectively. Then, ideally in the embedding space,jjf(x 1 )f(x 2 )jj 2 2 < jjf(x 1 )f(x 3 )jj 2 2 <jjf(x 1 )f(x 4 )jj 2 2 . 112 Note that the l 2 normalization of the embeddings is required to constrain them to lie on thed-dimensional hypersphere [175]. Here,M is the DNN with softmax() outputs. The embedding layer output, f(x) is connected to multiple softmax() outputs (with C or F output units depending on coarse- or ne-level training) through a single linear layer. 5.3.2 Baseline DNN We want to analyze the eect of introducing the hierarchy-aware loss functions on stan- dard cross entropy learning for AEC task. We separately train the baseline DNN (Sec- tion 5.4.1 describes the architecture) with two Cross Entropy (CE) loss functions. First (will be called \CE coarse"), we train it with coarse labels and the following cross entropy objective function: argmin g L CE coarse (g(x);y) = argmin g 1 N N X i=1 log exp (g (x i ;y i )) P C j=1 exp (g (x i ;j)) (5.3) Here, g(x i ;y i ) = g i (f(x i )), denotes the nal output of the DNN for the i th class before applying softmax(). So, this model is unable to give ne representations, but it will help comparing the benets of hierarchy-aware loss at coarse level. The second model (\CE ne") is trained with the following ne grained cross entropy loss: argmin g L CE ne (g(x);z) = argmin g 1 N N X i=1 log exp (g (x i ;z i )) P F j=1 exp (g (x i ;j)) (5.4) This model will be able to predict at both levels. 113 5.3.3 Hierarchy-aware Loss 5.3.3.1 Balanced triplet loss The triplet learning was introduced in Section 2.2.4. We rewrite the loss with the current notations: L Tri (f(x a i );f(x p i );f(x n i );) = max(jjf(x a i )f(x p i )jj 2 2 +jjf(x a i )f(x n i )jj 2 2 ; 0) (5.5) where, x a i , x p i , and x n i denote the anchor, positive, and negative samples respectively. The basic triplet learning does not support hierarchical label structure. While sampling a triplet for training the DNN for hierarchical AEC task, one has the option to pick the negative sample either from the same coarse class as the anchor but a dierent subclass, or from a completely dierent coarse class from the anchor. Mining triplets according to the uniform distribution does not consider the hierarchical label structure, and most of the negative samples are mined (with a probability of C1 C >> 1 C ) from a coarse class that is dierent from the anchor's coarse class. So, the model mostly learns the coarse representation, and not the ne one. To alleviate the problem, we perform triplet negative mining in a probabilistic way so that the model encounters around 50% triplets where the negative sample comes from the same class but dierent subclass. If x a i 2 S i G j , then we choose x p i s.t. x p i 2 S i . Here, S i ;8i = 1;:::;F is the ne set containing x a i , and G j ;8j = 1;:::;C is the coarse set that contains x a i . The negative exemplar, x n i is randomly sampled conditioned on a Bernoulli distribution as follows: x n i = I(r = 0)fx n i : x n i 2G j and x n i = 2S i g + I(r = 1)fx n i : x n i = 2G j g (5.6) 114 Here, rBer(0:5), and I(:) is the indicator function. 5.3.3.2 Quadruplet loss The basic idea here is to generalize the triplet learning across multiple levels in the hierarchy [212]. For a bi-level tree like ours, a quadruplet, (x a i ; x p+ i ; x p i ; x n i ) 2 Q is sampled fromD, s.t., if x a i 2 S i G j , then x p+ i 2 S i and x p i = 2 S i but x p i 2 G j . On the other hand, the negative is chosen s.t. x n i = 2 G j . The ultimate goal is to satisfy the following constraint in the embedding space: jjf(x a i )f(x p+ i )jj 2 2 +<jjf(x a i )f(x p i )jj 2 2 +; <jjf(x a i )f(x n i )jj 2 2 ; 8(x a i ; x p+ i ; x p i ; x n i )2Q (5.7) which is achieved by minimizing the objective function: argmin f 1 N Q N Q X i=1 L Quad f(x a i );f(x p+ i );f(x p i );f(x n i );; , argmin f 1 N Q N Q X i=1 L Tri f(x a i );f(x p+ i );f(x p i ); +L Tri f(x a i );f(x p i );f(x n i ); (5.8) Here, L Tri (:) is as dened in Equation (5.5), N Q =jQj, and and are two margin parameters satisfying > > 0. 115 5.3.4 Multi-task learning As mentioned in Section 5.2.2, we train the DNN in a multi-task learning environment (inspired by [212]) by imposing a joint objective function to minimize: L Multi =L HAL + (1)L CE ne (5.9) where,2 [0; 1] and,L HAL is the hierarchy-aware loss function and it can be either L Tri or L Quad . 5.4 Experimental Setting 5.4.1 DNN architecture Inspired from the performance of deep CNN models in at AEC tasks [78], we employ a slightly modied version of recently proposed ResNet-18 model [76]. We change the input layer of the basic ResNet-18 model to confront to the single channel spectrogram inputs, and the output layer depending on number of audio classes. We replace the nal average pooling layer by a fully connected layer that goes to a 512 dimensional embedding layer. We perform l 2 normalization of the embeddings during training and evaluation. The baseline model is trained end-to-end with cross entropy losses as described in Section 5.3.2. The multi-task model is trained with joint objectives as shown in Equation (5.9). 116 5.4.2 Data split A dataset splitting has been performed to produce disjoint (in terms of audio les) train ( 185K), validation ( 22:5K) and test ( 22:5K) sets. Early stopping [71] has been used based on the validation set performance for model selection. We should mention that the chance accuracies of a majority guess classier for coarse and ne classication tasks are 4.65% and 1.1% respectively. 5.4.3 Features and parameters 64 dimensional mel spectrogram features have been extracted from single channel audio streams having sampling rate of 48KHz using moving window of 42:67ms (2048 samples) length and 10:67ms (512 samples) shift. Online batch mining is employed during training for fetching triplets or quadruplets. Random windows of 100 feature frames have been generated, and 100 64 dimensional samples are fed into the input CNN layer of the model for training. We do not implement hard negative sampling [175] for triplet or quadruplet mining to increase the training speed. Instead we use a large batch size of 1024 samples to increase the probability of nding some hard negative exemplars during random sampling. We use 8 GPUs for training with data parallelism. We employ Adam optimizer with a learning rate of 10 3 and l 2 regularization penalty of 10 6 . For triplet loss, we have chosen = 0:1. For quadruplet loss, we have picked = 0:2 and = 0:1. The weighting factor, in Equation (5.9), is chosen to be 0.5 based on the validation set performance. 117 Table 5.1: Audio events classication accuracies of dierent models at coarse and ne levels of the label hierarchy (CE=Cross Entropy) Coarse-level Model Objective function (Section 5.3) Top 1 Top 3 Top 10 CE Coarse Eq. (5.3) 71.46% 85.82% 94.30% CE Fine Eq. (5.4) 75.58% 85.79% 92.35% Multi-task (CE+Triplet) Eq. (5.5),(5.9) 78.72% 87.42% 93.39% Multi-task (CE+Quadruplet) Eq. (5.8),(5.9) 78.53% 87.44% 93.32% Fine-level Model Objective function (Section 5.3) Top 1 Top 3 Top 10 CE Coarse Eq. (5.3) N/A N/A N/A CE Fine Eq. (5.4) 65.82% 77.30% 84.44% Multi-task (CE+Triplet) Eq. (5.5),(5.9) 69.57% 79.76% 86.23% Multi-task (CE+Quadruplet) Eq. (5.8),(5.9) 69.70% 79.92% 86.42% 5.5 Results and Discussions 5.5.1 Classication Table 5.1 shows the performance of dierent methods for AEC in terms of classication accuracies. The results on coarse- and ne-levels are shown as two sub-tables. The predictions on a test audio stream are generated by taking mean over all the posterior probabilities (non-overlapping sliding windows of 100 frames). `Top k' accuracy calculates the classication accuracy by observing whether the true class is in the top k predictions. From the rst two rows of Table 5.1 (both sub-tables), we can see that the Cross Entropy (CE) loss on ne labels gives a better Top 1 accuracy (even at the coarse level) than a CE loss on the coarse labels only. This indicates that the model has the ability to learn the ne labels and the knowledge of ne labels results in a better representation for classication (even at the coarse level). 118 The training using the multi-task loss function (Equation (5.9)) provides much better results at both levels than only CE supervision. We hypothesize that the hierarchy- aware loss helps to learn a better embedding space by reducing intra-cluster distance and increasing inter-cluster distance , and in turn helps the CE loss. It can also be thought as a regularization term along with standard CE. We achieve 3.14% and 3.88% absolute improvements for Top 1 coarse (183-way), and ne (4721-way) classications respectively. Comparison between balanced triplet and quadruplet learning shows almost similar performance with quadruplet leading by a small margin in case of classication with ne labels. This might be happening because of the bi-level constraints (Equation (5.7)) im- posed on the quadruplet loss (Equation (5.8)). The rest of the experiments are performed with the quadruplet loss model. 5.5.2 Clustering audio embeddings To evaluate the eect of introducing hierarchy-aware loss on learning better manifold and in turn producing more compact audio embeddings, we cluster the embeddings of the test audio using K-means clustering (K=183 for coarse-, K=4721 for ne-level clustering). Table 5.2 shows dierent metrics evaluating the clustering performed on the embeddings generated by baseline DNN with ne level cross entropy and the multi-task algorithm with quadruplet loss. The same parameter settings are used for K-means for both the al- gorithms. The metrics are explained below (all of them implemented in scikit-learn [192]): • Homogeneity (H), Completeness (C) and V-measure (V) [81]: If the true classes are known, then perfect homogeneity (1.0) occurs when each cluster contains samples 119 from a single class. Perfect completeness ensures all members of a single class to stay inside a single cluster. V-measure is the harmonic mean of these two metrics. • Adjusted Random Index (ARI): Given true labels and predicted cluster labels, ARI estimates a similarity between them ignoring possible permutations [84]. It varies from 0 to 1, and higher value is better. • Adjusted Mutual Information (AMI): Mutual information between true and pre- dicted cluster labels with a normalization to account for chance. A higher value is preferable. • Intra Cluster Distance (Intra CD): This does not require a clustering algorithm, but only needs true class (or, cluster) labels. It simply calculates the average distance between all points inside a cluster, and takes average among all clusters. Note that the embeddings arel 2 normalized before computing this (and Inter CCD) metric(s). A lower value is better. • Inter Cluster Centroids Distance (Inter CCD): Average distance between all cluster centroids. It is also independent of the clustering algorithm. A higher value is preferable. We can see from Table 5.2 that the multi-task model outperforms the baseline DNN in all metrics at coarse level (upper sub-table), and all except one metric (inter CCD) at ne label (lower sub-table). The lower value of inter CCD at ne level might be coming from the quadruplet constraint as mentioned in Equation (5.7). The model learns to separate the ne level clusters, but not too much because it also tries to keep them under the parent coarse level cluster. 120 Table 5.2: Clustering evaluation of the test embeddings generated by dierent models at coarse and ne levels of the label hierarchy (see Section 5.5.2 for metric acronyms) Coarse-level Model H C V ARI AMI Intra CD Inter CCD CE ne 0.471 0.435 0.452 0.106 0.366 7.137 4.957 Multi-task (CE+Quadruplet) 0.517 0.479 0.497 0.131 0.417 6.942 5.023 Fine-level Model H C V ARI AMI Intra CD Inter CCD CE ne 0.791 0.765 0.777 0.078 0.242 0.318 1.190 Multi-task (CE+Quadruplet) 0.792 0.771 0.781 0.088 0.265 0.308 1.153 5.5.3 Transfer learning To measure the transfer ability [150] of the model, we evaluate its performance on the Greatest Hits dataset [148]. It contains audio visual data of dierent actions (hit, scratch and other) performed on dierent objects (e.g., dirt, glass, leaf etc. ), and also their reactions (e.g., deform, scatter etc. ). We utilize the audio part of the dataset, and all 17 objects and 2 actions (hit and scratch) serve as class and subclass respectively. We do not use reaction because a bi-level hierarchy is more well aligned with the scope of this paper. Following the notations introduced in Section 5.2.1, an object creates a coarse class, and object and action together create a ne class. So, we generate 17 coarse and 34 ne classes. A random (80%,10%,10%) data split has been performed to produce disjoint train, validation, and test sets. Table 5.3 compares the test performances of a ResNet-18 trained from random initialization (referred as Rand-init in the table), and from our quadruplet based pre-trained multi-task model. For Top 1, we get around 10% absolute improvement at both coarse and ne levels. 121 Table 5.3: Classication accuracies for AEC on Greatest Hits dataset at coarse and ne levels of the label hierarchy Coarse-level Fine-level Model Top 1 Top 3 Top 10 Top 1 Top 3 Top 10 Rand-init 64.89% 89.10% 98.64% 59.47% 85.19% 97.45% Pre-trained 74.86% 92.93% 99.20% 69.04% 90.29% 98.49% 5.6 Conclusion and Future Directions The work dealt with a semantic-space variability{granularity of annotations coming from the hierarchical (bi-level) label-space structure of an audio event classication task. We introduced hierarchy-aware loss functions that learn from the tree structured label on- tology for achieving better classication performance at all levels and to produce more distinctive audio embeddings. A multi-task learning framework was built with cross en- tropy loss and the hierarchy-aware loss . Two dierent hierarchy-aware loss functions were employed. First, a modied triplet loss with a probabilistic multi-level batch balanc- ing strategy. Second, quadruplet learning suitable for labels having bi-level tree structure. The classication and clustering experiments showed the ecacy of the employed method. The evaluation on the Greatest Hits dataset showed the model's ability to transfer to a dierent domain. An obvious extension of the work would be to apply the employed methods in AEC tasks having deeper label structures. Unsupervised learning of hierarchical audio events and their mixtures might also be an interesting problem to attack in the future due to the availability of large amounts unlabeled audio events data. 122 Chapter 6 Semantic-space variability II: Absence of labels In the previous chapter, we saw how the granularity of annotations can introduce vari- ability in the learned representations. In this chapter, we will focus on another form of semantic-space variability that arise due to partial/full unavailability of labels. ∗ The absence of labels necessitates modication of the similarity learning methodologies we have discussed so far. We will present a self-supervised training framework, termed as Neural Predictive Coding (NPC) [89, 90], which learns the latent similarity patterns of the data by suitably exploring temporal context. We will learn speaker representations with downstream applications in speaker segmentation, identication, and verication. 6.1 Introduction We looked at several DNN-based supervised modeling techniques in Section 1.3.2. All of those methods need one or more annotated dataset(s) for the supervised training. This limits the learning power of the methods, especially given the data-hungry needs of advanced neural network-based models. Supervised training can also limit robustness ∗ Partial unavailability will not be discussed here, and the readers are requested to have a look at [88]. 123 due to over-tuning to the specic training environment. This can cause degradation in performance if the testing condition is very dierent from that of the training. Moreover, transfer learning [150] of the supervised models to a new domain also needs labeled data. This points to a desire and opportunity in employing unlabeled data and self-supervised methods for learning speaker embeddings. There have been a few eorts [173, 211, 113] in the past to employ neural networks for acoustic space division, but these works focused on speaker clustering and they did not employ self-supervision to learn distinctive speaker representations. In [114], an unsu- pervised training scheme using convolutional deep belief networks has been proposed for audio feature learning. The authors applied those features for phoneme, speaker, gender and music classication tasks. Although, the training employed there was unsupervised, the proposed system for speaker classication was trained on TIMIT dataset [65] where every utterance is guaranteed to come from a single speaker, and PCA whitening was applied on the spectrogram per utterance basis. Moreover, performance of the system on out-of-domain data was not evaluated. 6.2 Neural Predictive Coding (NPC) In this work, we propose a self-supervised method for learning features having speaker- specic characteristics from unlabeled audio streams that might contain many non-speech events (music, noise, and anything else available on YouTube). We term the general learn- ing of signal characteristics via the short-term stationarity assumption Neural Predictive Coding (NPC) since it was inspired by the idea of predicting present value of signal from 124 a past context as done in Linear Predictive Coding (LPC) [147]. The short-term station- arity hypothesis will be described in Section 6.3. LPC predicts future values from past data via a linear lter described by its coecients. NPC can predict future values from past data via neural network (potentially non-linear transformation). The embedding inside the NPC neural network can serve as a feature. Moreover, while predicting future values from past, the NPC model can incorporate knowledge learned from big, unlabeled datasets. Below are the major aims of the proposed method towards establishing a robust speaker embedding: 1. Training should require no labels of any kind (no speaker id labels, or speaker homogeneous utterances for training); 2. System should be highly scalable relying on plentiful availability of unlabeled data; 3. Embedding should represent short-term characteristics and be suitable as an alter- native to MFCCs in an aggregation system like [206] or [68]; and, 4. The training scheme should be readily applicable for unsupervised transfer learning. 6.3 NPC for learning of latent similarity patterns We employ an auto-encoder-like model [71], termed here as Speaker2Vec [89], to learn the speaker-characteristics manifold, but instead of reconstructing the input, we try to reconstruct a small window of speech from a temporally nearby window. Our hypothesis is that given the two windows are from the same speaker, the auto-encoder should be able to lter out all unnecessary features and capture the most common information between 125 the two windows i.e. the speaker-characteristics and encode them in its embedding layer. This task remains simple for single-speaker audio streams, but seems to be dicult for unlabeled multi-speaker audio. The short-term speaker stationarity hypothesis becomes useful here. 6.3.1 Method 6.3.1.1 Short-term speaker stationarity hypothesis It is based on the simple notion that given an audio stream containing multiple speakers, it is very unlikely that the speaker turns will occur very frequently (for example, every 1 second). Hence, if we consider pairs of two adjacent windows of short duration, in most pairs, both windows will belong to a single speaker. There will be some pairs having windows from two dierent speakers (those actually contain the speaker change points), but the number of such pairs will probably be small compared to the total number of all pairs of adjacent windows. As we consider more and more data from various domains, the probability of getting more and more single-speaker window pairs increases. Now, we can take all the pairs to train an auto-encoder [71], [80] so that it learns the speaker manifold as described below. 6.3.1.2 Training framework We use an auto-encoder having (2K + 1) hidden layers, where the (K + 1) th layer is the embedding layer. Let's assume that we have extracted frames of MFCC feature vectors from an audio stream. We consider all pairs of adjacent windows w 1 and w 2 , each of lengthd frames. Two consecutive pairs are separated by frames as shown in Figure 6.1. 126 Window w1: Frame n to n+d-1 K hidden layers Embedding Window w2: Frame n+d to n+2d-1 Encoder Decoder K hidden layers Audio input w2 w1 Δ MFCC (d frames) MFCC (d frames) Figure 6.1: Training framework and DNN architecture. Each pair becomes a training sample (input and output) for our auto-encoder. For every pair, w 1 goes to the input layer of the auto-encoder, and w 2 goes to the output. The auto-encoder tries to reconstruct w 2 from w 1 by minimizing a loss function L(g(f(w 1 )); w 2 ) (6.1) wheref(:) is an encoder function which produces the embedding layer h =f(w 1 ),g(:) is a decoder function which produces the output o =g(h),L(:) is a scalar loss function such as mean squared error. This framework enables us to exploit longer context to capture speaker characteristics, and compress them into a lower dimensional vector representation. Dierent values of d and employed in our experiments are reported in Section 6.3.2.3. 6.3.1.3 Evaluation on Speaker segmentation We use speaker segmentation as the task for evaluating Speaker2Vec. We just need the encoder part of the trained model for it. A window of d frames is traversed along the 127 test audio stream with a shift of 1 frame. Every window is applied to the DNN model to generate an embedding h for that particular window. Now, these embeddings are used for segmentation instead of the original MFCC features. The segmentation algorithm (similar to the one used in [38]) can be summarized as follows. (1) Obtain a divergence curve by measuring the KL divergence between two adjacent windows and sliding them over the embeddings. (2) Normalize the divergence curve and apply a low pass lter to smooth it. (3) Detect the peaks of the divergence curve by using threshold T KL (varies between 0 and 1). We keep the segmentation algorithm very simple (single pass, no renement, and asymmetric KL divergence [178]) to verify the distinctive ability of the speaker embeddings. 6.3.1.4 Transfer learning: Unsupervised domain adaptation Here, we provide a simple usage of a trained Speaker2Vec model for unsupervised transfer learning [150]. We use the following two-pass algorithm for adapting a trained DNN model to the test data in a completely unsupervised way. (1) Find the speaker change points by the trained DNN model as described above. (2) Get all possible speaker homogeneous regions. (3) Retrain the same DNN again on these homogeneous segments of speech. There might be some errors made by the model in step (1), but we only care about speaker homogeneous regions. So, we over-segment the test audio in the rst step (by setting a particular value of T KL as will be discussed below) so that we detect as many speaker change points as we can. We don't use segments smaller than 2d frames for adaptation. 128 Table 6.1: Training datasets and corresponding DNN architectures employed. Dataset Size (hours) #Hidden layers (K+1) #Parameters in DNN model TED-LIUM 118 3 24M YouTube 911 3 24M YouTubeLarge 1953 4 52M 6.3.2 Experiments 6.3.2.1 Training datasets The rst two columns of Table 6.1 describe the training datasets. We will refer to the trained Speaker2Vec models by the names of their training datasets. The TED-LIUM data is expected to t well for training since each audio le contains speech mainly from a single speaker. To validate our hypothesis of short-term speaker stationarity more strongly, we have used a comparatively larger amount of various types of audio from YouTube. The reason for choosing YouTube is two-fold. First, we can get a virtually unlimited supply of data. Second, we can have speech samples from diverse conditions. For example, our YouTube datasets have varying acoustic environments including single- speaker monologues, multi-speaker discussions, movies, speech with background music, both in-studio and out-of-studio conversations etc. It also has audio from dierent lan- guages including English, Spanish, Hindi and Telugu. We have used TED-LIUM devel- opment dataset [167] to stop the training before the model overts to the training data (for all training sets). 129 6.3.2.2 DNN architectures We have used two auto-encoder architectures by using K = 2 and 3 (Figure 6.1). The smaller network has been separately trained on TED-LIUM and YouTube datasets, while the bigger one has been trained on YouTubeLarge dataset. The number of neuron units in dierent layers from input to output are [4000! 2000! 40! 2000! 4000] and [4000! 6000! 2000! 40! 2000! 6000! 4000] for the smaller and bigger networks respectively. The length of the embedding layer has been set to 40, which makes it equivalent to the adopted MFCC dimension (discussed below). We have used ReLU activations for the hidden layers, and linear activations for the output layer. Logarithmic mean squared error has been used as error metricL(:). The last two columns of Table 6.1 show number of hidden layers and total number of parameters for the encoder parts of the DNN models. 6.3.2.3 Experimental setting We have adopted 40 dimensional high denition MFCC features extracted from 40 mel- spaced lters over a 25ms hamming window with a shift of 10ms using Kaldi toolkit [157]. We have usedd = 100 frames (1s) for all training scenarios. This makes the size of input and output layers of the DNN models to be 4000. We have chosen = 50 frames (0.5s) for TED-LIUM, and = 200 frames (2s) for YouTube datasets (we could get enough training samples even with this increased for YouTube datasets, and this completely removed overlap of samples). For segmentation, we have used two adjacent windows, each of size 1s, and slid them with a shift of 1 frame to achieve maximum resolution. For the rst step of domain adaptation, T KL has been experimentally set to 0.5. This threshold 130 0 10 20 30 40 50 60 70 80 90 100 Mis-detection rate (%) 0 10 20 30 40 50 60 70 80 90 100 False alarm rate (%) BIC GD KL2 MFCC40_KL Speaker2Vec Ted Speaker2Vec YouTube Speaker2Vec YouTubeLarge Speaker2Vec Ted Adapted Speaker2Vec YouTube Adapted Speaker2Vec YouTubeLarge Adapted Figure 6.2: ROC curves obtained by dierent algorithms on TED-LIUM evaluation data. over-segments the test audio so that we get as many pure homogeneous segments as we can. 6.3.3 Results 6.3.3.1 Distinctive capability of the speaker embeddings over MFCC Three datasets from very dierent acoustic conditions have been chosen to evaluate the performance of the proposed method. The rst one has been articially generated by taking 200 random utterances (extracted using the timestamps provided in the transcrip- tions) from TED-LIUM evaluation data. The second one is the NIST RT-06 conference meetings data (we have used the hsum audio les) [1]. The third set has been built by randomly picking 12 sessions (2.07 hours of audio) from the Couples Therapy Corpus [43], which has spontaneous conversations between husband and wife. The low SNR, poor 131 Table 6.2: (Row 1-6) Comparison of the proposed method with some baseline methods (which use MFCC) on dierent datasets. EER is reported in (MDR, FAR) format. The best two EERs are in boldface. (Row 7) The last row shows the absolute improvements in mean EER (mean(MDR, FAR)) over the best baseline method obtained by Speaker2Vec models averaged across all datasets. Evaluation dataset Dierent distance metrics with MFCC features Speaker2Vec models Adapted Speaker2Vec mod- els BIC with MFCC13 GD with MFCC13 KL2 with MFCC13 KL with MFCC40 Tedlium YouTube YouTube large Tedlium YouTube YouTube large TEDLIUM test (47.50, 65.35) (60.00, 61.54) (62.50, 63.94) (78.50, 76.05) (32.00, 32.32) (44.00, 46.38) (44.50, 43.58) (26.00, 28.26) (30.00, 33.87) (40.00, 31.19) Couples Therapy (49.11, 44.00) (50.79, 44.09) (51.92, 44.24) (57.83, 57.95) (42.39, 42.83) (42.95, 42.36) (43.96, 44.88) (42.45, 42.09) (42.28, 42.36) (43.68, 44.00) RT-06 (tol. 0:25s) (75.35, 69.61) (76.51, 70.71) (76.51, 69.61) (75.32, 75.38) (64.74, 64.41) (65.33, 63.54) (64.08, 65.21) (63.00, 63.31) (62.43, 61.94) (63.95, 62.99) RT-06 (tol. 0:5s) (51.71, 50.84) (52.65, 48.57) (53.47, 48.47) (53.47, 48.47) (46.31, 47.81) (47.59, 47.05) (51.72, 48.34) (47.43, 45.80) (44.69, 45.42) (44.92, 45.87) Mean improvement w.r.t. best baseline 9.92 6.62 5.73 11.72 11.14 9.44 performance of forced alignment systems [22], and our own requirements have inspired us to use this for evaluation. Table 6.2 shows the results in terms of equal error rates (EER) as dened in [56]. Each tuple denotes mis-detection rate (MDR) and false alarm rate (FAR). Four baseline methods have been chosen. The rst three use 13 dimensional MFCC features and apply BIC [39], Gaussian divergence (GD) [15] and KL2 [178] metrics respectively. The fourth algorithm uses the 40 dimensional MFCC features we use in our method, and directly applies the same KL divergence we have adopted. We have used forgiveness tolerance [40] of both0:5s and0:25s for RT-06 data to see the performance of our model as it decreases. For all other datasets,0:5s is used similar to other works [40], [106]. We can see from Table 6.2 that the embeddings consistently perform much better than MFCC. The last row shows the absolute improvements in mean EER over the best baseline methods obtained by dierent Speaker2Vec models averaged across all test 132 Table 6.3: Comparison of the proposed method with some state-of-the-art papers along with the characteristics (duration and actual number of change points) of the articial dialogs they created from TIMIT dataset for evaluation. The best two results are in boldface. Method Delacourt et al. [56] Kotti et al. [105] Kotti et al. [106] Chen et al. [38] Ma et al. [120] Speaker2Vec models Adapted Speaker2Vec models Tedlium YouTube YouTube large Tedlium YouTube YouTube large Duration (mins) { 6.42 60 16.67 160 45.22 Number of change points 60 136 935 250 3200 1010 Unsupervised? Yes Yes Yes No No Yes F1 score { 0.73 0.78 0.74 0.79 0.82 0.82 0.83 0.86 0.85 0.85 EER (15.6, 28.2) (30.5, 21.8) (5.1, 28.9) (19, 25) (16.9, 18.8) (18.51, 18.19) (17.62, 17.33) (17.23, 17.77) (14.75, 14.99) (14.95, 15.03) (15.74, 15.54) datasets. The TED-LIUM model, the best among unadapted models, achieves a 9.92% absolute improvement. The unsupervised adaptation on test data gives some boost in the performance. We obtain 11.72% and 11.14% absolute improvements in mean EER for the adapted TED-LIUM and YouTube models respectively. Figure 6.2 shows the ROC curves [158] obtained by dierent algorithms on TED-LIUM test dataset. One important point is that even though the TED-LIUM model performs best for the TED-LIUM test data (which is expected), the YouTube models give competitive results, and they perform much better than the baselines. 6.3.3.2 Comparison with other state-of-the-art works To compare the performance of the proposed method with more sophisticated speaker segmentation algorithms, we have adopted TIMIT dataset [2]. We have created ten articial dialogs by randomly choosing speakers from the dataset. We have removed the inter-speaker silence regions similar to [106] to make the problem more challenging. Table 6.3 tabulates the characteristics of the articial data created from TIMIT corpus 133 for evaluation in dierent papers, along with their learning strategies, F1 scores [106], and EERs. We achieve 6% and 7% absolute increments in F1 score over the best published work [120] with adapted YouTube and TED-LIUM models respectively. 6.3.4 Discussion The proposed self-supervised neural predictive coding algorithm was able to explore tem- poral context of the audio signal, and learn distinctive speaker representations. The NPC embeddings performed satisfactorily in the speaker segmentation task when evaluated on dierent datasets from varying acoustic environments, even with a very simple distance metric. One major benet is that these embeddings can be employed in typical speaker segmentation or diarization systems, e.g., by replacing MFCC features with the NPC embeddings. One could note that the proposed NPC framework only explored latent data similarity patterns based on the short-term speaker stationarity hypothesis. How can we extend the algorithm to also learn dissimilarity patterns from the data? This will be presented next. 134 6.4 NPC for learning latent similarity and dissimilarity patterns 6.4.1 Method We extend the NPC method to learn similarity as well as dissimilarity patterns from the data. This necessitates the mining of contrastive samples (same vs. dierent pairs) from the unlabeled data with the help of self-supervision. 6.4.1.1 Contrastive sample creation The contrastive samples are generated by distinguishing between dierent speakers. Dur- ing training phase, the NPC algorithm possesses no information about the actual speaker identities, but only learns whether two input audio chunks are generated from the same speaker or not. We provide the NPC model two kinds of samples [42]. The rst kind con- sists of pairs of speech segments that come from the same speaker, called genuine pairs. The second type consists of speech segments from two dierent speakers, called impostor pairs. This approach has been used in the past for numerous applications [73, 42, 104], but all of them needed labeled datasets. The challenge is how we can create such samples if we do not have labeled acoustic data. We again hypothesize short-term speaker stationarity [89] as described in Section 6.3.1. The genuine pairs are mined from consecutive segments of speech. To nd the impostor pairs, we choose two random segments from two dierent audio streams in our unlabeled dataset. Intuitively, the probability of nding the same speaker in an impostor pair is relatively much lower than the probability of getting two dierent speakers in it, provided 135 Network 1 Network 2 Embedding 1 Embedding 2 L1 distance vector W w 1 w 2 or w' 2 FC Binary classification: 0 = Genuine Pair 1 = Imposter Pair Audio 1 Audio i … … Genuine Pair (w 1 , w 2 ) Impostor Pair (w 1 , w' 2 ) w 1 w 2 w' 2 w 1 w 2 ∆ Unlabeled dataset of audio streams Shared weights Figure 6.3: NPC training scheme utilizing short-term speaker stationarity hypothesis. Left: Contrastive sample creation from unlabeled dataset of audio streams. Genuine and impostor pairs are created from unlabeled dataset. Right: The siamese network training method. The genuine and impostors pairs are fed into it for binary classication. \FC" denotes Fully Connected hidden layer in the DNN. a suciently large unsupervised dataset. For example, sampling two random YouTube videos, the likelihood of getting the same speaker in both is very low. The left part of Fig. 6.3 shows this contrastive sample creation process. Audio stream 1 and audio streami (for anyi between 2 toN, whereN is the number of audio streams in the dataset) are shown here. Assume the vertical red lines denote (unknown) speaker change points. (w 1 ; w 2 ) is a window pair where each of the two windows has d feature frames. This window pair is moved over the audio streams with a shift of to get the genuine pairs. For every w 1 , we randomly pick a window w 0 2 of same length from a dierent audio stream to get an impostor pair. All these samples are then fed into the siamese DNN network for binary classication of whether an input pair is genuine or impostor. A siamese neural network (please see right part of Fig. 6.3), rst introduced for signa- ture verication [25], consists of two identical twin networks with shared weights that are 136 joined at the top by some energy function [73, 42, 104]. Generally, the siamese networks are provided with two inputs and trained by minimizing the energy function which is a predened metric between the highest level feature transformations of both the inputs. The weight sharing ensures similar inputs are transformed into embeddings in close prox- imity with each other. The inherent structure of a siamese network enables us to learn similar or dissimilar input pairs with discriminative energy functions [42, 73]. Similar to [104], we use L 1 distance loss between the highest level outputs of the siamese twin networks for the two inputs. 6.4.1.2 Siamese Convolutional layers The amazing eectiveness of CNNs have been well established in computer vision eld [110, 179]. Recently, speech scientists are also applying CNNs for dierent chal- lenging tasks like speech recognition [3, 79], speaker recognition [116, 127, 119], large scale audio classication [78] etc. The general benets of using CNNs can be found in [71] and in the above papers. In our work, the inspiration to use CNNs comes from the need of exploring spectral and temporal contexts together through 2D convolution over the mel-spectrogram features. The benets of such a 2D convolution have also been shown with more traditional signal processing feature sets such as Gabor features [33]. Our siamese network (one of the identical twins), built using multiple CNN layers and one dense layer at the highest level, is shown in Fig. 6.4. We gradually reduce the kernel size from 77 to 55, 44, and 33. We have used 22 max-pooling layers after every 137 64@94×34 64@45×15 64@90×30 32@42×12 32@40×10 32@20×5 5× maxpool 2× 4× 3× 2× maxpool 3200 512 10 20 30 40 MFCC features 20 40 60 80 100 Time (frames) 7× Figure 6.4: The DNN architecture employed in each of the siamese twins. All the weights are shared between the twins. The kernel sizes are denoted under the red squares. 2 2 max-pooling is used as shown by yellow squares. All the feature maps are denoted as: N@xy, where N = number of feature maps, xy = size of each feature map. Dimension of the speaker embedding is 512. \FC" = Fully Connected layer. two convolutional layers. The size of stride for all convolution and max-pooling opera- tions has been chosen to be 1. We have used Leaky ReLU nonlinearity [121] after every convolutional or fully connected layer (omitted from Fig. 6.4 for clearer visualization). We have applied batch normalization [86] after every layer to reduce the \internal co- variate shift" [86] of the network. It helped the network to avoid overtting and converge faster without the need of using dropout layers [187]. After the last convolutional layer, we get 32 feature maps, each of size 20 5. We atten these maps to get a 3200 dimen- sional vector which is connected to the nal 512 dimensional NPC embedding through a fully connected layer. The embeddings are obtained before applying the Leaky ReLU non-linearity † . 6.4.1.3 Comparing Siamese embeddings { Loss functions Let f(x 1 ) and f(x 2 ) be the highest level outputs of the siamese twin networks for inputs x 1 and x 2 (in other words, (x 1 ; x 2 ) is one contrastive sample obtained from the window † Following standard convention for extracting embedding from DNNs, such as in [183]. 138 pair (w 1 ; w 2 ) or (w 1 ; w 0 2 )). We will use this transformation f(x) as our \embedding" for any input x (please see right part of Fig. 6.3). Here x is a matrix of size dm, and it denotes d frames of m dimensional MFCC feature vectors in window w. Similarly, x i denotes the feature frames in window w i for i = 1; 2. We have deployed two dierent types of loss functions for training the NPC model. They are described below. 1) Cross entropy loss: Inspired from [104], the loss function is designed in a way such that it decreases the weightedL 1 distance between the embeddings f(x 1 ) and f(x 2 ) if x 1 and x 2 are from a genuine pair, and increases the same if they are from an impostor pair. The \L 1 distance vector" (Fig. 6.3, right) is obtained by calculating element-wise absolute dierence between the two embedding vectors f(x 1 ) and f(x 2 ) and is given by: L(x 1 ; x 2 ) =jf(x 1 ) f(x 2 )j: (6.2) We connect L(x 1 ; x 2 ) to two outputs g i (x 1 ; x 2 ) using a fully connected layer: g i (x 1 ; x 2 ) = D X k=1 w i;k jf(x 1 ) k f(x 2 ) k j +b i (6.3) for i = 1; 2. Here, f(x 1 ) k and f(x 2 ) k are the k th elements of f(x 1 ) and f(x 2 ) vectors respectively, and D is the length of those vectors (so, D is the embedding dimension). w i;k 's and b i 's are the weights and bias for the i th output. Note that these weights and biases are aecting only the binary classier, and they are not part of the siamese network. 139 A softmax layer produces the nal probabilities: p i (x 1 ; x 2 ) =s(g i ((x 1 ; x 2 ))) for i = 1; 2: (6.4) Here s(:) is the softmax function given by s(g i (x 1 ; x 2 )) = e g i (x 1 ;x 2 ) e g 1 (x 1 ;x 2 ) +e g 2 (x 1 ;x 2 ) for i = 1; 2: (6.5) The network in Fig. 6.3 is provided with the genuine and impostor pairs. We use cross entropy loss here. It is given by e(x 1 ; x 2 ) =I(y(x 1 ; x 2 ) = 0) log(p 1 (x 1 ; x 2 ))I(y(x 1 ; x 2 ) = 1) log(p 2 (x 1 ; x 2 )) (6.6) whereI(:) is the indicator function dened as:I(t) = 1 if t is True, otherwiseI(t) = 0. and, y(x 1 ; x 2 ) is the true label for the sample (x 1 ; x 2 ), dened as: y(x 1 ; x 2 ) = 8 > > > > < > > > > : 0; if (x 1 , x 2 ) is a genuine pair. 1; if (x 1 , x 2 ) is an impostor pair. (6.7) Using Equation (6.7), we can write the error as e(x 1 ; x 2 ) =(1y(x 1 ; x 2 )) log(p 1 (x 1 ; x 2 ))y(x 1 ; x 2 ) log(p 2 (x 1 ; x 2 )) (6.8) 2) Cosine embeddings loss: We also analyze the performance of the network when we directly minimize a contrastive loss function between the embeddings. So, there is no 140 Table 6.4: NPC Training Datasets Name of the dataset Size (hours) Number of samples Tedlium 100 358K Tedlium-Mix 110 395K YoUSCTube 584 2.1M need to add an extra fully connected layer at the end. The employed cosine embedding loss is dened below. L cos (x 1 ; x 2 ) = 8 > > > > < > > > > : 1C(f(x 1 ); f(x 2 )); if y(x 1 ; x 2 ) = 0 C(f(x 1 ); f(x 2 )); if y(x 1 ; x 2 ) = 1 Here C(f(x 1 ); f(x 2 )) is the cosine similarity between f(x 1 ) and f(x 2 ) dened as C(u; v) = u v jjujj 2 jjvjj 2 : Herejjjj 2 denotes the L 2 norm. Performances of the two types of loss functions will be analyzed through experimental evidence. 6.4.1.4 Extracting NPC embeddings for test audio Once the DNN model is trained, we use it for extracting speaker embeddings from any test audio stream. As discussed in Section 6.4.1.3, the transformation achieved by the siamese network on an input segment x of lengthd frames is given by f(x). We use only this siamese part of the network to transform a sequence of MFCC frames of any speech segment into NPC embeddings by using a sliding window w of d frames and shifting it by 1 frame along the entire sequence. 141 6.4.2 Experiments 6.4.2.1 NPC Training Datasets Table 6.4 shows the training datasets along with their approximate total durations and number of contrastive samples created from each dataset. We train three dierent models individually on these datasets, and we call every trained model by the name of the dataset used for training along with the NPC prex (for example, the NPC model trained on YoUSCTube data will be called as NPC YoUSCTube). 1) Tedlium dataset: The Tedlium dataset is built from the Tedlium training cor- pora [167]. It originally contained 666 unique speakers, but we have removed the 19 speakers which are also present in the Tedlium development and test sets (since the Tedlium dataset was originally developed for speech recognition purposes, it has speaker overlap between train and dev/test sets). The contrastive samples created from the Tedlium dataset are less noisy (compared to the case for YoUSCTube data as will be discussed next), because most of the audio streams in the Tedlium data are from a sin- gle speaker talking in the same environment for long (although there is some noise, for example, speech of the audience, clapping etc.). The reason for employing this dataset is two-fold: First, the model trained on the Tedlium data will provide a comparison with the models trained on the Tedlium-Mix and YoUSCTube datasets for a validation of the short-term speaker stationarity hypothesis. Second, since the test set of the speaker iden- tication experiment will be from the Tedlium test data, this will help demonstrate the dierence in performance for in-domain and out-of-domain evaluation. 142 2) Tedlium-Mix dataset: The Tedlium-Mix dataset is created mainly to validate the short-term speaker stationarity hypothesis. We create the Tedlium-Mix dataset by creat- ing articial dialogs through randomly concatenating utterances. Tedlium is annotated, so we know the utterance boundaries. We thus simulate a dialog that has a random speaker every other utterance of the main speaker. For every audio stream, we reject half of the total utterances, and between every two utterances we concatenate a randomly chosen utterance from a randomly chosen speaker (i.e. S,R 1 ,S,R 2 ,S,R 3 ;::: whereS's are the utterances of the main speaker and R i (for i = 1; 2; 3;::: ) is a random utterance from a randomly chosen speaker i.e. a random utterance from another Ted recording). In this way we create the Tedlium-Mix dataset having a speaker change after every utterance for every audio stream. It also has almost the same size as the Tedlium dataset. 3) YoUSCTube dataset: A large amount of various types of audio data has been collected from YouTube to create the YoUSCTube dataset. We have chosen YouTube for this purpose because of virtually unlimited supply of data from diverse environments. The dataset has multilingual data including English, Spanish, Hindi and Telugu from heterogeneous acoustic conditions like monologues, multi-speaker conversations, movie clips with background music and noise, outdoor discussions etc. 6.4.2.2 Validation data The Tedlium development set (8 unique speakers) has been used as validation data for all training cases. We used the utterance start and end time stamps and the speaker IDs provided in the transcripts of the Tedlium dataset to create the validation set so that it does not have any noisy labels. 143 6.4.2.3 Data for speaker identication experiment The Tedlium test set (11 unique speakers from 11 dierent Ted recordings) has been employed for the speaker identication experiment. Similar to the development dataset, it has start and end time of every utterance for every speaker as well as the speaker IDs. We have extracted the utterances from every speaker, and all utterances of a particular speaker have been assigned the corresponding speaker ID. Those have been used for creating the experimental scenarios for speaker classication. Similar to the validation set, the labels of this dataset are very clean since they are created using the human-labeled speaker IDs. 6.4.2.4 Data for speaker verication experiment A large speaker verication corpus, VoxCeleb (version 1) [139] is employed for the speaker verication experiment. It has a total of 1251 unique speakers with 154K utterances at 16KHz sample rate. The average number of sessions per speaker is 18. We use the default development and test split provided with the dataset and mentioned in [139]. 6.4.2.5 Feature and model parameters We employ 40 dimensional MFCC features computed from 16KHz audio with 25ms win- dow and 10ms shift using the Kaldi toolkit [157]. We choose d = 100 frames (1s), and = 200 frames (2s). Therefore, each window is a 100 40 matrix, and we feed this to the rst CNN layer of our network (Fig. 6.4). The employed model has a total 1.8M parameters and it has been trained using RMSProp optimizer [193] with a learning rate 144 0 5 10 15 20 25 30 35 40 Number of epochs 75 80 85 90 95 100 Binary classification accuracy (%) 90.19 90.48 92.16 Train accuracy for NPC Tedlium Validation accuracy for NPC Tedlium Train accuracy for NPC Tedlium-Mix Validation accuracy for NPC Tedlium-Mix Train accuracy for NPC YoUSCTube Validation accuracy for NPC YoUSCTube Figure 6.5: Binary classication accuracies of classifying genuine or impostor pairs for NPC models trained on the Tedlium, Tedlium-Mix, and YoUSCTube datasets. Both training and validation accuracies are shown. The best validation accuracies for all the models are marked by big stars (*). of 10 4 and a weight decay of 10 6 . The held out validation set (as described above) has been used for model selection. 6.4.3 Results 6.4.3.1 Convergence curves Fig. 6.5 shows the convergence curves in terms of binary classication accuracies of clas- sifying genuine or impostor pairs for training the DNN model separately in dierent datasets along with the corresponding validation accuracies. The development set for calculating the validation accuracy is same for all the training sets and it doesn't contain any noisy samples. In contrast, our training set is noisy since it's unlabeled and based on the short-term stationarity in assigning same/dierent class speaker pairs. 145 We can see from Fig. 6.5 that NPC Tedlium reaches almost 100% training accuracy, but NPC Tedlium-Mix converges at a lower training accuracy as expected. This is due to the larger portion of noisy samples present in the Tedlium-Mix dataset that arise from the articially introduced fast speaker changes and the simultaneous hypothesis of short-term speaker stationarity ‡ . However this doesn't hurt the validation accuracy on the development set, which is both distinct from training set and correctly labeled: we obtain 90:19% and 90:48% for NPC Tedlium and NPC Tedlium-Mix trained-models respectively. We believe this is because the model is correctly learning to not label some of the assumed same-speaker pairs as same-speaker when there is a speaker change that we introduced via our mixing, due to the large amounts of correct data that compensate for the smaller-amount of mislabeled pairs. The NPC YoUSCTube model reaches much better training accuracy than the NPC Tedlium-Mix model even with fewer epochs. This points to both increased robustness due to the increased data variability and also that speaker-changes in real dialogs are not as fast as we simulated in the Tedlium-Mix dataset. It is interesting to see that the NPC YoUSCTube model achieves a little better validation accuracy (92:16%) than the other two models even when the training dataset had no explicit domain overlap with the validation data. We think, this is because of the huge size (approximately 6 times larger in size than the Tedlium dataset) and widely varying types of acoustic environments of the YoUSCTube dataset. ‡ The corpus is created by mixing turns. This means that there are 54; 778 speaker change points in the 115 hours of audio. However in this case we assumed that there are no speaker changes in consecutive frames. If the change points were uniformly distributed then that would result in an upper-bound of 87%. 146 6.4.3.2 Validation of the short-term speaker stationarity hypothesis Here we analyze the validation accuracies obtained by the NPC models trained separately on the Tedlium and Tedlium-Mix datasets. From Fig. 6.5 it is quite clear that both models could achieve similar validation accuracies, although the Tedlium-Mix dataset has audio streams containing speaker changes at every utterance and the Tedlium dataset contains mostly single-speaker audio streams. The reason is that even though there are frequent speaker turns in the Tedlium-Mix dataset, the short length of context (d = 100 frames = 1s) chosen to learn the speaker characteristics ensures that the total number of correct same-speaker pairs dominates the falsely-labeled same-speaker pairs. Therefore the sudden speaker changes are of little impact and do not deteriorate the performance of neural network on the development set. This result validates the short-term speaker stationarity hypothesis. 6.4.3.3 Speaker identication evaluation 1) Frame-level Embedding visualization: Visualization of high dimensional data is vital in many scenarios because it can reveal the inherent structure of the data points. For speaker characteristics learning, visualizing the employed features can manifest the clus- ters formed around dierent speakers and thus demonstrate the ecacy of the features. We use t-SNE visualization [199] for this purpose. We compare the following features (the terms in boldface show the names we will use to call the features). (a) MFCC: Raw MFCC features. 147 (b) MFCC stats: This is generated by moving a sliding window of 1s along the raw MFCC features with a shift of 1 frame (10ms) and taking the statistics (mean and standard deviation in the window) to produce a new feature stream. This is done for a fair comparison of MFCC and the embeddings (since the embeddings are generated using 1s context). (c) NPC YoUSCTube Cross Entropy: Embeddings extracted with NPC YoUSCTube model using cross entropy loss. (d) NPC Tedlium Cross Entropy: Embeddings extracted with NPC Tedlium model using cross entropy loss. (e) NPC YoUSCTube Cosine: Embeddings extracted with NPC YoUSCTube model using cosine embedding loss. (f) i-vector: 400 dimensional i-vectors extracted independently every 1s using a sliding window with 10ms shift. The i-vector system (Kaldi VoxCeleb v1 recipe) was trained on the VoxCeleb dataset [139] (16 KHz audio). It is not possible to train an i-vector system on YoUSCTube since it contains no labels on speaker-homogeneous regions. Fig. 6.6 shows the 2 dimensional t-SNE visualizations of the frames (at 10ms resolu- tion) of the above features extracted from the Tedlium test dataset containing 11 unique speakers. For better visualization of the data, we chose only 2 utterances from every speaker, and the feature frames from a total of 22 utterances become our input dataset for the t-SNE algorithm. From Fig. 6.6 we can see that the raw MFCC features are 148 (a) MFCC (b) MFCC stats (c) i-vector (d) NPC YoUSCTube Cosine (e) NPC YoUSCTube Cross Entropy (f) NPC Tedlium Cross Entropy Figure 6.6: t-SNE visualizations of the frames of dierent features for the Tedlium test data containing 11 speakers (2 utterances per speaker). Dierent colors represent dierent speakers. 149 very noisy, but the inherent smoothing applied to compute MFCC stats features help the features of the same speaker to come closer. However we notice that although some same-speaker features cluster in lines, these lines are far apart in the space, which denotes that the MFCC features capture additional information. For example we see that the speaker denoted with Green occupies both the very left and very right parts of the t-SNE space. The i-vector plot looks similar to the MFCC stats and does not cluster the speakers very well. This is consistent with existing literature [98] that showed that i-vectors do not perform well for short utterances especially when the training utterances are comparatively longer. Below, we will see that the utterance-level i-vectors perform much better for speaker classication. The NPC YoUSCTube Cosine embeddings underperform the cross entropy-based methods possibly because of poorer convergence as we observed during training. They are also noisier than MFCC stats and i-vectors, indicating that even a little change in the input (just 10ms of extra audio) perturbs the embedding space, which might not be desirable. The NPC YoUSCTube Cross Entropy and NPC Tedlium Cross Entropy embeddings provide much better distinction between dierent speaker clusters. Moreover, they also provide much better cluster compactness compared to the MFCC and i-vector features. Among the NPC embeddings, NPC YoUSCTube Cross Entropy features provide pos- sibly the best tSNE visualization. They even produce better clusters than NPC Tedlium Cross Entropy, although the latter one is trained on in-domain data. The larger size of YoUSCTube dataset might be the reason behind this observation. 150 Table 6.5: Frame-level speaker classication accuracies of dierent features with kNN classier (k=1). All features below are trained on unlabeled data except i-vector which requires speaker-homogeneous les. MFCC MFCC stats NPC YoUSCTube Cross Entropy NPC Tedlium Cross Entropy NPC YoUSCTube Cosine i-vector VoxCeleb 1 48.75 72.70 79.05 80.25 62.97 70.26 2 54.12 81.33 87.26 88.32 70.04 79.07 3 57.05 84.11 89.56 89.62 73.77 82.58 5 61.36 88.85 92.34 92.00 78.59 87.80 8 63.38 89.73 91.62 91.33 79.07 88.91 10 64.13 90.17 92.42 91.88 80.84 89.12 MFCC MFCC stats NPC YoUSCTube Cross Entropy NPC Tedlium Cross Entropy NPC YoUSCTube Cosine i-vector VoxCeleb 1 38.02 70.45 75.62 76.40 56.30 64.02 2 44.08 79.43 83.75 83.21 58.18 74.50 3 46.39 81.98 85.06 84.79 59.05 76.76 5 50.24 86.20 89.12 88.65 62.18 81.65 8 51.56 87.70 89.66 89.07 64.33 84.21 10 52.65 88.46 90.34 89.94 65.79 88.13 # of Enrollment Utterances Tedlium test set Tedlium development set # of Enrollment Utterances 2) Frame-level speaker identication: We perform frame-level speaker identication experiments on the Tedlium development set (8 speakers) and the Tedlium test set (11 speakers). By frame-level classication we mean that every frame in the utterance is independently classied as to its speaker ID. The reason for evaluating with frame-level speaker classication is that better frame- level performance conveys the inherent strength of the system to derive short-term fea- tures that carry speaker-specic characteristics. It also shows the possibility to replace MFCCs with the proposed embeddings by incorporating in systems such as [206] and [68]. Table 6.5 shows a detailed comparison between the 6 dierent features described above in terms of frame-wise speaker classication accuracies. We have tabulated the accuracies 151 for dierent number of enrollment utterances (in other words training utterances for the speaker ID classier) per speaker. We have used kNN classier (with k=1) for speaker classication. The reason for using such a naive classier is to reveal true potential of the features, and not to harness strength of the classier. We have repeated each experiment 5 times and the average accuracies have been reported here. Each time we have held out 5 random utterances from each speaker for testing. The same seen (enrollment) and test utterances have been used for all types of features and in all cases the test and enrollment sets are distinct. From Table 6.5 we can see that MFCC stats perform much better than raw MFCC features. We think the reason is that the raw features are much noisier than the MFCC stats features because of the implicit smoothing performed during the statistics compu- tation. The NPC YoUSCTube and NPC Tedlium models with cross entropy loss perform pretty similarly (for test data, NPC YoUSCTube even performs better) even though the former one is trained on out-of-domain data. This highlights the benets and possi- bilities of employing out-of-domain unsupervised learning using publicly available data. NPC YoUSCTube Cosine doesn't perform well compared to other NPC embeddings. The i-vectors perform worse than NPC embeddings and MFCC stats for frame-level classi- cation due to the reasons discussed above and as reported by [98]. 3) Utterance-level speaker identication: Here we are interested in utterance-level speaker identication task. We compare NPC YoUSCTube Cross Entropy (out-of-domain (OOD) YouTube), MFCC, and i-vector (OOD VoxCeleb) methods. For MFCC and NPC embeddings, we calculate the mean and standard deviation vectors over all frames in a 152 Table 6.6: Utterance-level speaker classication accuracies of dierent features with kNN classier (k=1). Red italics indicates the best performing single feature classication result while bold text indicates the best overall performance. MFCC stats NPC YoUSCTube stats i-vector VoxCeleb NPC YoUSCTube + i-vector MFCC stats NPC YoUSCTube stats i-vector VoxCeleb NPC YoUSCTube + i-vector 1 75.12 82.12 86.38 85.62 80.00 83.27 85.00 86.82 2 83.00 87.88 91.75 92.12 87.64 92.36 92.82 95.73 3 84.88 89.88 93.12 93.12 92.18 95.45 94.82 97.09 5 91.25 94.50 95.25 95.88 92.09 95.36 93.91 95.45 8 92.12 95.00 96.62 97.25 95.27 97.36 95.82 97.36 10 92.50 95.25 95.12 96.50 96.54 98.00 96.73 98.09 # of Enrollment Utterances Tedlium test set Tedlium development set particular utterance, and concatenate them to produce a single vector for every utterance. For i-vector, we calculate one i-vector for the whole utterance. We applied LDA (trained on development part of VoxCeleb) to project the 400 dimen- sional i-vectors to a 200 dimensional space. This gave better performance for i-vectors (also observed in literature [183]) and let us compare unsupervised NPC embeddings with the best possible i-vector conguration. We again classify using k-NN classier with k = 1, as explained above to focus on the feature performance and not on the next-layer of trained classiers. Table 6.6 shows the classication accuracies for dierent features with increasing num- ber of enrollment utterances. In each enrollment scenario, 5 randomly held-out utterances from each of the 11 speakers have been used for testing, and the process has been re- peated 20 times to report the average accuracies. Both i-vectors and NPC YoUSCTube embeddings perform similarly. It is interesting to note the complementary behavior of the concatenated i-vector-embedding feature. From Table 6.6 we can see that the NPC YoUSCTube Cross Entropy + i-vector perform the best for almost all the cases. 153 Table 6.7: Speaker verication on VoxCeleb v1 data. i-Vector and x-Vector use the full utterance in a supervised manner for evaluation while the proposed embedding operates at the 1 second window with a simple statistics (mean+std) over an utterance. Method Training domain Feature Context Speaker labels Speaker homogeneity minDCF EER(%) i-vector [59] ID Full No Yes 0.73 8.80 x-vector ID Full Full Yes 0.61 7.21 NPC stats OOD 1sec No No 0.87 15.54 An additional important point is that the classier used is the simple 1-Nearest Neigh- bor classier. So, we believe that the highly non-Gaussian nature of the embeddings (as can be observed from Fig. 6.6) might not be captured well by the 1-NN since it is based on Euclidean distance which will under perform in complex manifolds as we observe with NPC embeddings. This motivates future work in higher-layer, utterance-based, neural network-derived features that build on top of these embeddings. 6.4.3.4 Upper-bound comparison experiment Table 6.7 compares performance of i-vector, x-vector [184], and the proposed NPC em- beddings for the speaker verication task on Voxceleb v1 data using the default Dev and Test splits [139] distributed with the dataset. We want to highlight that since the assumption for our system is that we have abso- lutely no labels during DNN training (in fact our YouTube downloaded data are not even guaranteed to be speech!), the comparison with x-vector or i-vector is highly asymmetric. To simplify this explanation: • Our proposed method uses \some random audio": completely self-supervised and challenging data. 154 • i-vector uses \speech" with labels on \speaker homogeneous regions": unsupervised with a supervised step on clean data. • x-vector uses \speech" with labels of \id of speaker": completely supervised on clean data. Moreover, i-vector and x-vector here are trained on in-domain (ID) Dev part of the Vox- Celeb dataset. On the other hand, the NPC model is trained out-of-domain (OOD) on unlabeled YouTube data. Please note that here \out-of-domain" refers to the generic characteristics of the YoUSCTube dataset compared to the Voxceleb dataset. For ex- ample, the Voxceleb dataset was mined using the keyword \interview" [139] along with the speaker name, and the active speaker detection [139] ensured active presence of that speaker in the video. On the other hand, the YoUSCTube dataset is mined without any constraints thus generalizing more to realistic acoustic conditions. Moreover, hav- ing only celebrities [139] in the Voxceleb dataset helped it to nd multiple sessions of the same speaker, which subsequently helped the supervised DNN models to be more channel-invariant. However, such freedom is not available in the YoUSCTube dataset, thus paving a way to build unsupervised models that can be trained or adapted in such conditions. Finally, the features employed by i-vector and x-vector employ the whole utterance of average length 8.2s (min=4s, max=145s) [139] while the NPC model is only produc- ing 1 second estimates. While we do intend to incorporate more contextual learning for longer sequences, in this work we are focusing on the low-level feature and hence employ 155 statistics (mean and std) of the embeddings. This is suboptimal and creates an unin- formed information bottleneck, however it is a necessary and easy way to establish an utterance-based feature, thus enabling comparison with the existing methods. For all the above reasons we expect that any evaluation with i-vector and x-vector can only be seen as a very upper-bound and we don't expect to beat either of these two in performance. The i-vector performance is as reported in [139]. No data augmentation is performed for x-vector for a fair comparison. To maintain standard scoring mechanisms we employed LDA to project the embed- dings on a lower dimensional space and, then PLDA scoring as in [184, 139]. The same VoxCeleb Dev data is utilized to train LDA and PLDA models for all methods for a fair comparison. The LDA dimension is 200 for x-vector and i-vector [139] systems, and 100 for NPC system. We report the minimum normalized detection cost function (minDCF) for P target = 0:01 and Equal Error-Rate (EER). We can see that the best in-domain supervised method is 30% better than unsupervised NPC in terms of minDCF. 6.4.4 Discussion Based on the experiments stated above we have established that the resulting embedding is capturing signicant information about the speakers' identity. The feature has shown to be quite better than using knowledge driven features such as MFCCs or statistics of MFCCs and even more robust than supervised features such as i-vector operating on 1 second windows. Importantly the proposed embedding showed extreme portability by operating better on the Tedlium dataset when trained on larger amounts of random 156 audio from the collected YoUSCTube corpus than when trained in-domain on the Tedlium dataset itself. Also importantly we have shown that if we on purpose create a fast changing dialog by mixing the Tedlium utterances, the short-term stationarity hypotheses still holds. This encourages the use of unlabeled data. Evaluating this embedding however is challenging as its use is not obvious until it is used for a full blown speaker identication framework. This requires several more stages of development that we will discuss further in this work, along with discussing the shortcomings of this embedding. However, we can, and we are, providing some early evidence that the embedding does indeed capture signicant information about the speaker. In Section 6.4.3.3 we present results that compare an utterance-based classication system on the Tedlium data. We are comparing the i-vector system optimized for utterance-level classication, and which employs supervised data, with a very simple statistic (mean and std) of our proposed unsupervised embedding. We show that our embedding provides very robust results that are comparative to the i-vector system. The shortcoming of this comparison, is that the utterances are drawn from the Tedlium dataset, and they are likely also incorporating channel information. We provide some suggestions in overcoming this shortcoming further in this section. We proceeded, in Section 6.4.3.4, to present results that compare an utterance-based verication system on the VoxCeleb v1 test. Here we wanted to provide an upper-bound comparison. We evaluated i-vector and supervised x-vector methods trained on VoxCeleb data. These methods are able to employ the full utterance as a single observation, while 157 the proposed embedding only operates on a < 1 second resolution, hence we again ag- gregate via an uninformed information bottleneck (mean and std). We see that despite the information bottleneck and complete unsupervised and out of domain nature of the experiment our proposed system still achieves an acceptable performance with a 30% worse minDCF than x-vector. The above observations and analysis provide many directions for future work. Given that all our same-speaker examples come from the same channel, we believe that the proposed embedding captures both channel and speaker characteristics. This provides an opportunity for data augmentation, and hence reduction of the channel in uence. In future work we intend to augment the near-by frames such that contextual pairs are coming from a range of dierent channels through augmentation. This also provides another opportunity for joint channel and speaker learning. Through the above augmentation we can jointly learn same vs dierent speakers and same vs dierent channels, thus providing disentanglement and more robust speaker rep- resentations. Further, triplet learning [116], especially with hard triplet mining, has been shown to provide improved performance and we intend to use such an architecture in future work to directly optimize intra- and inter-class distances in the manifold. One additional opportunity for improvement is to employ a larger neural network. We employed a CNN with only 1.8M parameters for our training (as an initial try to check the validity of the proposed method). But, recent CNN-based speaker verication systems employ much deeper networks (e.g., VoxCeleb's baseline CNN comprises of 67M 158 parameters). We think utilizing recent state-of-the-art deep architectures will improve performance of the proposed technique for large scale speaker verication experiments. Finally, and more applicable to the speaker ID task, we need embeddings that can capture information from longer sequences. As we see in Section 6.4.3.4 the supervised speaker identication methods are able to exploit longer term context while the proposed embedding is only able to serve as a short-term feature. This requires either supervised methods, towards higher level information integration, or more in alignment with our interests of better unsupervised context exploitation. For example we can employ a better aggregation mechanism via unsupervised diarization using this embedding to identify speaker-homogeneous regions and then employ Recurrent Neural Networks (RNN) [71] towards longitudinal information integration. 6.5 Conclusions and future directions In this chapter, we proposed a self-supervised technique to learn speaker-specic charac- teristics from unlabeled data that contain any kind of audio, including speech, environ- mental sounds, and multiple overlapping speakers of many languages. We proposed two variants of the NPC training framework. In the rst variant, we mined same-speaker samples based on the short-term speaker stationarity hypothesis, and trained an encoder-decoder model to predict the future from past context. In the second framework, we mined both same- and dierent-spaker samples, and trained a siamese network with contrastive loss. 159 We showed that it is possible learn speaker reprsentations from a completely un- labeled dataset by employing the NPC self-supervision technique. The analysis of the proposed out-of-domain self-supervised method with the in-domain supervised methods helps identify challenges and raises a range of opportunities for future work, including in longitudinal information integration and in introducing robustness to channel character- istics. 160 Chapter 7 Summary and Future works This chapter summarizes the main contributions of the thesis, and provides brief overview of future research directions. 7.1 Summary In this thesis, we have investigated multiple sources of variability in an audio representa- tion learning system. Specically, we have studied signal-space and semantic-space vari- abilities. For signal-space variabilities, we looked at noise and reverberation, long-term temporal variability, and deliberate perturbation of audio through adversarial attacks. For semantic-space variability, we studied the granularity of annotations and the absence of annotations. By leveraging the understanding of multiple sources of variability in a deep neural network-based audio representation learning system, we have proposed new robust audio representation learning methods by exploring the temporal context and latent similarity patterns of the data. We have applied the proposed methods in multiple real-life appli- cations such as speaker recognition, speaker segmentation, audio event classication, and 161 acoustic scene identication from egocentric audio recordings. The extensive analysis and experimental results demonstrate the ecacy of the proposed methods for robust audio representation learning. 7.2 Future works 7.2.1 Audio event representation learning in arbitrary label-space hierarchy In chapter 5 we proposed metric learning methods that helped learning representations of audio events that follow a bi-level hierarchy. We aim to further extend this method to work for any arbitrary hierarchy with possibly much deeper and unbalanced structure. This can facilitate audio event detection and identication in a much larger dataset like the AudioSet [66]. Moreover, learning the hierarchical relationship in audio representations will also eectuate the possibility of retrieving unseen audio classes, e.g., retrieving the parent audio class when the classier fails to recognize a novel audio class. 7.2.1.1 Proposed idea The idea is to learn two separate embedding spaces: one for the audio events, and the other for their semantic text labels. Then, nd a suitable mapping to connect the two embedding spaces in a way that subsequently helps in identication and retrieval of the events. Some related literature have been proposed in the past [60, 28] which tend to learn both data and label emebdding spaces for cross-modal retrieval, but limited work could 162 be found which address this problem for a hierarchical label space having an arbitrary tree-structure. 7.2.2 Temporal dynamics of speaker embeddings in egocentric audio recordings In Chapter 2 and Chapter 6 we proposed supervised and self-supervised methods to learn speaker embeddings by exploiting latent similarity relationships present in the data or label space. Here, we plan to investigate the temporal dynamics of a speaker embedding, specically, how it varies in dierent societal situations and activities. We hypothesize that the speaking style and characteristic of a certain speaker might vary depending on their activity (like eating, playing etc. ) and societal interactions (like talking to someone over phone, socializing with a group, in a job-talk etc. ). Moreover, from the egocentric perspective of an user, these variations can happen throughout the course of their daily life. Finding the variations in the speaker representation can help us learning the strength or weakness of dierent speaker representation learning algorithms, and guide us in developing more robust embeddings. It can also have potential applications in building speaker recognition and tracking systems which can track the voice segments of a particular speaker in a multi-speaker conversation. 163 Bibliography [1] Spring 2006 (rt-06s) rich transcription meeting recognition evaluation plan. http://www.itl.nist.gov/iad/mig/tests/rt/2006-spring/docs/ rt06s-meeting-eval-plan-V2.pdf. [2] Timit acoustic-phonetic continuous speech corpus. https://catalog.ldc.upenn. edu/LDC93S1. LDC Catalog No. LDC93S1. [3] Ossama Abdel-Hamid, Abdel Rahman Mohamed, Hui Jiang, Li Deng, Gerald Penn, and Dong Yu. Convolutional neural networks for speech recognition. IEEE/ACM Transactions on audio, speech, and language processing, 22(10):1533{1545, 2014. [4] Justice Adams, Arindam Jati, Sudha Krishnamurthy, Masanori Omote, Jian Zheng, Naveen Kumar, Min-Heng Chen, and Ashish Singh. Action description for on- demand accessibility, April 30 2020. US Patent App. 16/177,197. [5] Jesper J Alvarsson, Stefan Wiens, and Mats E Nilsson. Stress recovery during exposure to nature sound and environmental noise. International journal of envi- ronmental research and public health, 7(3):1036{1046, 2010. [6] Xavier Anguera, Simon Bozonnet, Nicholas Evans, Corinne Fredouille, Gerald Friedland, and Oriol Vinyals. Speaker diarization: A review of recent research. IEEE Transactions on Audio, Speech, and Language Processing, 20(2):356{370, 2012. [7] Denise L Anthony, Ajit Appari, and M Eric Johnson. Institutionalizing HIPAA compliance: organizations and competing logics in us health care. Journal of health and social behavior, 55(1):108{124, 2014. [8] Anish Athalye, Nicholas Carlini, and David A. Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In ICML, pages 274{283, 2018. [9] Pradeep K Atrey, Namunu C Maddage, and Mohan S Kankanhalli. Audio based event detection for multimedia surveillance. In ICASSP, volume 5, pages V{V. IEEE, 2006. [10] Jean-Julien Aucouturier, Boris Defreville, and Francois Pachet. The bag-of-frames approach to audio pattern recognition: A sucient model for urban soundscapes but not for polyphonic music. The Journal of the Acoustical Society of America, 122(2):881{891, 2007. 164 [11] Kartik Audhkhasi, Bhuvana Ramabhadran, George Saon, Michael Picheny, and David Nahamoo. Direct acoustics-to-word models for english conversational speech recognition. In Proc. Interspeech 2017, pages 959{963, 2017. [12] Anderson R Avila, Milton Sarria-Paja, Francisco J Fraga, Douglas O'Shaughnessy, and Tiago H Falk. Improving the performance of far-eld speaker verication using multi-condition training: The case of gmm-ubm and i-vector systems. In Fifteenth Annual Conference of the International Speech Communication Association, 2014. [13] Simon P Banbury and Dianne C Berry. Oce noise and employee concentration: Identifying causes of disruption and potential improvements. Ergonomics, 48(1):25{ 37, 2005. [14] Daniele Barchiesi, Dimitrios Giannoulis, Dan Stowell, and Mark D Plumbley. Acoustic scene classication: Classifying environments from the sounds they pro- duce. IEEE Signal Processing Magazine, 32(3):16{34, 2015. [15] Claude Barras, Xuan Zhu, Sylvain Meignier, and J-L Gauvain. Multistage speaker diarization of broadcast news. IEEE Transactions on Audio, Speech, and Language Processing, 14(5):1505{1512, 2006. [16] Timo Becker, Michael Jessen, and Catalin Grigoras. Forensic speaker verication using formant features and gaussian mixture models. In Ninth Annual Conference of the International Speech Communication Association, 2008. [17] Goran Belojevi c, Evy Ohrstr om, and Ragnar Rylander. Eects of noise on mental performance with regard to subjective noise sensitivity. International archives of occupational and environmental health, 64(4):293{301, 1992. [18] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798{1828, 2013. [19] Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a prac- tical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological), 57(1):289{300, 1995. [20] Manfred E Beutel, Claus J unger, Eva M Klein, Philipp Wild, Karl Lackner, Maria Blettner, Harald Binder, Matthias Michal, J org Wiltink, Elmar Br ahler, et al. Noise annoyance is associated with depression and anxiety in the general population-the contribution of aircraft noise. Plos one, 11(5), 2016. [21] Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Srndi c, Pavel Laskov, Giorgio Giacinto, and Fabio Roli. Evasion attacks against machine learning at test time. In Joint European conference on machine learning and knowledge discovery in databases, pages 387{402. Springer, 2013. [22] Matthew Black, Athanasios Katsamanis, Chi-Chun Lee, Adam C Lammert, Brian R Baucom, Andrew Christensen, Panayiotis G Georgiou, and Shrikanth S Narayanan. 165 Automatic classication of married couples' behavior using audio features. In IN- TERSPEECH, pages 2030{2033, 2010. [23] Daniel Bone, Chi-Chun Lee, Theodora Chaspari, James Gibson, and Shrikanth Narayanan. Signal processing and machine learning for mental health research and clinical applications. IEEE Signal Processing Magazine, 34(5):189{196, September 2017. [24] Brandon M Booth, Karel Mundnich, Tiantian Feng, Amrutha Nadarajan, Tiago H Falk, Jennifer L Villatte, Emilio Ferrara, and Shrikanth Narayanan. Multimodal human and environmental sensing for longitudinal behavioral studies in naturalistic settings: Framework for sensor selection, deployment, and management. Journal of medical Internet research, 21(8):e12832, 2019. [25] Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard S ackinger, and Roopak Shah. Signature verication using a" siamese" time delay neural network. In Advances in Neural Information Processing Systems, pages 737{744, 1994. [26] Emre Cakir, Toni Heittola, Heikki Huttunen, and Tuomas Virtanen. Polyphonic sound event detection using multi label deep neural networks. In 2015 international joint conference on neural networks (IJCNN), pages 1{7. IEEE, 2015. [27] Joseph P Campbell. Speaker recognition: A tutorial. Proceedings of the IEEE, 85(9):1437{1462, 1997. [28] Yue Cao, Mingsheng Long, Jianmin Wang, and Shichen Liu. Deep visual-semantic quantization for ecient image retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1328{1337, 2017. [29] Nicholas Carlini, Anish Athalye, Nicolas Papernot, Wieland Brendel, Jonas Rauber, Dimitris Tsipras, Ian Goodfellow, Aleksander Madry, and Alexey Kurakin. On evaluating adversarial robustness. arXiv preprint arXiv:1902.06705, 2019. [30] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In 2017 IEEE symposium on security and privacy (sp), pages 39{57. IEEE, 2017. [31] Nicholas Carlini and David Wagner. Audio adversarial examples: Targeted attacks on speech-to-text. In 2018 IEEE Security and Privacy Workshops (SPW), pages 1{7. IEEE, 2018. [32] William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4960{4964. IEEE, 2016. [33] Shuo-Yiin Chang and Nelson Morgan. Robust cnn-based speech recognition with gabor lter kernels. In Fifteenth Annual Conference of the International Speech Communication Association, 2014. 166 [34] Guangke Chen, Sen Chen, Lingling Fan, Xiaoning Du, Zhe Zhao, Fu Song, and Yang Liu. Who is real bob? adversarial attacks on speaker recognition systems. arXiv preprint arXiv:1911.01840, 2019. [35] Guangke Chen, Sen Chen, Lingling Fan, Xiaoning Du, Zhe Zhao, Fu Song, and Yang Liu. Who is real bob? adversarial attacks on speaker recognition systems. CoRR, abs/1911.01840, 2019. [36] Ke Chen. Towards better making a decision in speaker verication. Pattern Recog- nition, 36(2):329{346, 2003. [37] Ke Chen and Ahmad Salman. Extracting speaker-specic information with a reg- ularized siamese deep network. In Advances in Neural Information Processing Sys- tems, pages 298{306, 2011. [38] Ke Chen and Ahmad Salman. Learning speaker-specic characteristics with a deep neural architecture. IEEE Transactions on Neural Networks, 22(11):1744{1756, 2011. [39] Scott Chen and Ponani Gopalakrishnan. Speaker, environment and channel change detection and clustering via the bayesian information criterion. In Proc. DARPA Broadcast News Transcription and Understanding Workshop, volume 8, pages 127{ 132. Virginia, USA, 1998. [40] Shih-Sian Cheng, Hsin-Min Wang, and Hsin-Chia Fu. Bic-based speaker segmen- tation using divide-and-conquer strategies with application to speaker diarization. IEEE transactions on audio, speech, and language processing, 18(1):141{157, 2010. [41] Fran cois Chollet et al. Keras. https://keras.io, 2015. [42] Sumit Chopra, Raia Hadsell, and Yann LeCun. Learning a similarity metric dis- criminatively, with application to face verication. In CVPR, volume 1, pages 539{546. IEEE, 2005. [43] Andrew Christensen, David C Atkins, Sara Berns, Jennifer Wheeler, Donald H Baucom, and Lorelei E Simpson. Traditional versus integrative behavioral couple therapy for signicantly and chronically distressed married couples. Journal of consulting and clinical psychology, 72(2):176, 2004. [44] Selina Chu, Shrikanth Narayanan, and C-C Jay Kuo. Environmental sound recog- nition with time{frequency audio features. IEEE Transactions on Audio, Speech, and Language Processing, 17(6):1142{1158, 2009. [45] Selina Chu, Shrikanth Narayanan, C-C Jay Kuo, and Maja J Mataric. Where am I? scene recognition for mobile robots using audio features. In Int. conference on multimedia and expo, pages 885{888. IEEE, 2006. [46] Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. Voxceleb2: Deep speaker recognition. In Proc. Interspeech 2018, pages 1086{1090, 2018. 167 [47] Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. Voxceleb2: Deep speaker recognition. arXiv preprint arXiv:1806.05622, 2018. [48] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Em- pirical evaluation of gated recurrent neural networks on sequence modeling. In NeurIPS Workshop on Deep Learning, December 2014. [49] Charlotte Clark and Stephen A Stansfeld. The eect of transportation noise on health and cognitive development: A review of recent evidence. International Jour- nal of Comparative Psychology, 20(2), 2007. [50] Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons, 2012. [51] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in neural information processing systems, pages 2292{2300, 2013. [52] Steven Davis and Paul Mermelstein. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE transactions on acoustics, speech, and signal processing, 28(4):357{366, 1980. [53] Najim Dehak, Reda Dehak, Patrick Kenny, Niko Br ummer, Pierre Ouellet, and Pierre Dumouchel. Support vector machines versus fast scoring in the low- dimensional total variability space for speaker verication. In Tenth Annual con- ference of the international speech communication association, 2009. [54] Najim Dehak, Patrick J Kenny, R eda Dehak, Pierre Dumouchel, and Pierre Ouellet. Front-end factor analysis for speaker verication. IEEE Transactions on Audio, Speech, and Language Processing, 19(4):788{798, 2010. [55] Najim Dehak, Patrick J Kenny, R eda Dehak, Pierre Dumouchel, and Pierre Ouellet. Front-end factor analysis for speaker verication. IEEE Transactions on Audio, Speech, and Language Processing, 19(4):788{798, 2011. [56] Perrine Delacourt and Christian J Wellekens. Distbic: A speaker-based segmenta- tion for audio data indexing. Speech communication, 32(1):111{126, 2000. [57] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual represen- tation learning by context prediction. In Proceedings of the IEEE international conference on computer vision, pages 1422{1430, 2015. [58] Carl Doersch and Andrew Zisserman. Multi-task self-supervised visual learning. In Proceedings of the IEEE International Conference on Computer Vision, pages 2051{2060, 2017. [59] Gr egor Dupuy, Mickael Rouvier, Sylvain Meignier, and Yannick Esteve. I-vectors and ILP clustering adapted to cross-show speaker diarization. In Thirteenth Annual Conference of the International Speech Communication Association, 2012. 168 [60] Benjamin Elizalde, Shuayb Zarar, and Bhiksha Raj. Cross modal audio search and retrieval with joint embeddings based on text and audio. In ICASSP 2019- 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4095{4099. IEEE, 2019. [61] Florian Eyben, Martin W ollmer, and Bj orn Schuller. Opensmile: the munich ver- satile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia, pages 1459{1462. ACM, 2010. [62] Tiago H Falk and Wai-Yip Chan. Modulation spectral features for robust far- eld speaker identication. IEEE Transactions on Audio, Speech, and Language Processing, 18(1):90{100, 2010. [63] Tiantian Feng, Amrutha Nadarajan, Colin Vaz, Brandon Booth, and Shrikanth Narayanan. TILES audio recorder: An unobtrusive wearable solution to track audio activity. In 4th ACM Workshop on Wearable Sys. and Apps., pages 33{38. ACM, 2018. [64] Daniel Garcia-Romero, Xinhui Zhou, and Carol Y Espy-Wilson. Multicondition training of gaussian plda models in i-vector space for noise and reverberation robust speaker recognition. In 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 4257{4260. IEEE, 2012. [65] John S. Garofolo, Lori F. Lamel, William M. Fisher, Jonathan G. Fiscus, David S. Pallett, Nancy L. Dahlgren, and Victor Zue. Timit acoustic-phonetic continuous speech corpus ldc93s1, 1993. [66] Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 776{780. IEEE, 2017. [67] Oguzhan Gencoglu, Tuomas Virtanen, and Heikki Huttunen. Recognition Of Acoustic Events Using Deep Neural Networks. Proc. 22nd EUSIPCO, pages 506{ 510, 2014. [68] Sina Hamidi Ghalehjegh and Richard C Rose. Deep bottleneck features for i-vector based text-independent speaker verication. In Automatic Speech Recognition and Understanding (ASRU), 2015 IEEE Workshop on, pages 555{560. IEEE, 2015. [69] Ond rej Glembek, Luk a s Burget, Niko Br ummer, Old rich Plchot, and Pavel Mat ejka. Discriminatively trained i-vector extractor for speaker verication. In Twelfth An- nual Conference of the International Speech Communication Association, 2011. [70] Ondrej Glembek, Franti sek Gr ezl, Martin Kara at, David A van Leeuwen, Pavel Matejka, Petr Schwarz, and Albert Strasheim. Fusion of heterogeneous speaker recognition systems in the stbu submission for the nist speaker recognition evalua- tion 2006. IEEE Transactions on Audio, Speech, and Language Processing, 15(7), 2007. 169 [71] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learn- ing, volume 1. MIT press Cambridge, 2016. [72] Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harness- ing adversarial examples. In International Conference on Learning Representations, 2015. [73] Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learn- ing an invariant mapping. In Computer vision and pattern recognition, 2006 IEEE computer society conference on, volume 2, pages 1735{1742. IEEE, 2006. [74] John HL Hansen and Tauq Hasan. Speaker recognition by machines and humans: A tutorial review. IEEE Signal processing magazine, 32(6):74{99, 2015. [75] Tian Hao, Guoliang Xing, and Gang Zhou. isleep: unobtrusive sleep quality moni- toring using smartphones. In Proceedings of the 11th ACM Conference on Embedded Networked Sensor Systems, page 4. ACM, 2013. [76] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770{778, 2016. [77] Hynek Hermansky. Perceptual linear predictive (plp) analysis of speech. the Jour- nal of the Acoustical Society of America, 87(4):1738{1752, 1990. [78] Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Sey- bold, et al. CNN architectures for large-scale audio classication. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 131{135. IEEE, 2017. [79] Georey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks for acoustic modeling in speech recogni- tion: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82{97, 2012. [80] Georey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. science, 313(5786):504{507, 2006. [81] Julia Bell Hirschberg and Andrew Rosenberg. V-measure: a conditional entropy- based external cluster evaluation. Proceedings of EMNLP, 2007. [82] Che-Wei Huang and Shrikanth Narayanan. Deep convolutional recurrent neural network with attention mechanism for robust speech emotion recognition. In 2017 IEEE International Conference on Multimedia and Expo (ICME), pages 583{588. IEEE, 2017. 170 [83] Xuedong Huang, Alex Acero, and Hsiao-Wuen Hon. Spoken Language Processing: A Guide to Theory, Algorithm, and System Development. Prentice Hall PTR, Upper Saddle River, NJ, USA, 1st edition, 2001. [84] Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of classica- tion, 2(1):193{218, 1985. [85] Zaeem Hussain, Mingda Zhang, Xiaozhong Zhang, Keren Ye, Christopher Thomas, Zuha Agha, Nathan Ong, and Adriana Kovashka. Automatic understanding of im- age and video advertisements. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1705{1715, 2017. [86] Sergey Ioe and Christian Szegedy. Batch normalization: Accelerating deep net- work training by reducing internal covariate shift. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, vol- ume 37 of Proceedings of Machine Learning Research, pages 448{456, Lille, France, 07{09 Jul 2015. PMLR. [87] Yunseok Jang, Tianchen Zhao, Seunghoon Hong, and Honglak Lee. Adversarial defense via learning to generate diverse attacks. In Proceedings of the IEEE Inter- national Conference on Computer Vision, pages 2740{2749, 2019. [88] Arindam Jati and Dimitra Emmanouilidou. Supervised deep hashing for ecient audio event retrieval. In Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Process.(ICASSP). IEEE, 2020. [89] Arindam Jati and Panayiotis Georgiou. Speaker2vec: Unsupervised learning and adaptation of a speaker manifold using deep neural networks with an evaluation on speaker segmentation. In Proceedings of Interspeech, August 2017. [90] Arindam Jati and Panayiotis Georgiou. Neural predictive coding using convo- lutional neural networks toward unsupervised learning of speaker characteristics. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(10):1577{ 1589, 2019. [91] Arindam Jati, Chin-Cheng Hsu, Monisankha Pal, Raghuveer Peri, Wael AbdAl- mageed, and Shrikanth Narayanan. Adversarial attack and defense strategies for deep speaker recognition systems. Computer Speech & Language, page 101199, 2021. [92] Arindam Jati, Naveen Kumar, and Ruxin Chen. Sound categorization system, April 2 2020. US Patent App. 16/147,331. [93] Arindam Jati, Naveen Kumar, Ruxin Chen, and Panayiotis Georgiou. Hierarchy- aware loss function on a tree structured label space for audio event detection. In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6{10. IEEE, 2019. 171 [94] Arindam Jati, Amrutha Nadarajan, Raghuveer Peri, Karel Mundnich, Tiantian Feng, Benjamin Girault, and Shrikanth Narayanan. Temporal dynamics of work- place acoustic scenes: Egocentric analysis and prediction. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:756{769, 2021. [95] Arindam Jati, Raghuveer Peri, Monisankha Pal, Tae Jin Park, Naveen Kumar, Ruchir Travadi, Panayiotis Georgiou, and Shrikanth Narayanan. Multi-task dis- criminative training of hybrid dnn-tvm model for speaker verication with noisy and far-eld speech. In proceedings of Proceedings of Interspeech, 2019. [96] Qin Jin, Tanja Schultz, and Alex Waibel. Far-eld speaker recognition. IEEE Transactions on Audio, Speech, and Language Processing, 15(7):2023{2032, 2007. [97] Daniel Jurafsky and James H Martin. Speech and Language Processing: An intro- duction to speech recognition, computational linguistics and natural language pro- cessing. Upper Saddle River, NJ: Prentice Hall, 2008. [98] Ahilan Kanagasundaram, Robbie Vogt, David B Dean, Sridha Sridharan, and Michael W Mason. I-vector based speaker recognition on short utterances. In Pro- ceedings of the 12th Annual Conference of the International Speech Communication Association, pages 2341{2344. International Speech Communication Association (ISCA), 2011. [99] Patrick Kenny, Gilles Boulianne, Pierre Ouellet, and Pierre Dumouchel. Joint factor analysis versus eigenchannels in speaker recognition. IEEE Transactions on Audio, Speech, and Language Processing, 15(4):1435{1447, 2007. [100] Patrick Kenny and Pierre Dumouchel. Disentangling speaker and channel eects in speaker verication. In Acoustics, Speech, and Signal Processing, 2004. Proceed- ings.(ICASSP'04). IEEE International Conference on, volume 1, pages I{37. IEEE, 2004. [101] Patrick Kenny, Mohamed Mihoubi, and Pierre Dumouchel. New map estimators for speaker recognition. In INTERSPEECH, 2003. [102] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. preprint arXiv:1412.6980, 2014. [103] Tom Ko, Vijayaditya Peddinti, Daniel Povey, Michael L Seltzer, and Sanjeev Khu- danpur. A study on data augmentation of reverberant speech for robust speech recognition. In 2017 IEEE International Conference on Acoustics, Speech and Sig- nal Processing (ICASSP), pages 5220{5224. IEEE, 2017. [104] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. Siamese neural networks for one-shot image recognition. In ICML Deep Learning Workshop, volume 2, 2015. [105] Margarita Kotti, Emmanouil Benetos, and Constantine Kotropoulos. Automatic speaker change detection with the bayesian information criterion using mpeg-7 fea- tures and a fusion scheme. In Circuits and Systems, 2006. ISCAS 2006. Proceedings. 2006 IEEE International Symposium on, pages 4{pp. IEEE, 2006. 172 [106] Margarita Kotti, Emmanouil Benetos, and Constantine Kotropoulos. Computa- tionally ecient and robust bic-based speaker segmentation. IEEE transactions on audio, speech, and language processing, 16(5):920{933, 2008. [107] Margarita Kotti, Vassiliki Moschou, and Constantine Kotropoulos. Speaker seg- mentation and clustering. Signal processing, 88(5):1091{1124, 2008. [108] Ute Kraus, Alexandra Schneider, Susanne Breitner, Regina Hampel, Regina R uck- erl, Mike Pitz, Uta Geruschkat, Petra Belcredi, Katja Radon, and Annette Peters. Individual daytime noise exposure during routine activities and heart rate vari- ability in adults: a repeated measures study. Environmental health perspectives, 121(5):607{612, 2013. [109] Felix Kreuk, Yossi Adi, Moustapha Cisse, and Joseph Keshet. Fooling end-to- end speaker verication with adversarial examples. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1962{ 1966. IEEE, 2018. [110] Alex Krizhevsky, Ilya Sutskever, and Georey E Hinton. Imagenet classication with deep convolutional neural networks. In Advances in neural information pro- cessing systems, pages 1097{1105, 2012. [111] William H. Kruskal and W. Allen Wallis. Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association, 47(260):583{621, 1952. [112] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial machine learning at scale. arXiv preprint arXiv:1611.01236, 2016. [113] Itshak Lapidot, Hugo Guterman, and Arnon Cohen. Unsupervised speaker recog- nition based on competition between self-organizing maps. IEEE Transactions on Neural Networks, 13(4):877{887, 2002. [114] Honglak Lee, Peter Pham, Yan Largman, and Andrew Y Ng. Unsupervised fea- ture learning for audio classication using convolutional deep belief networks. In Advances in neural information processing systems, pages 1096{1104, 2009. [115] Michelle L'Hommedieu, Justin L'Hommedieu, Cynthia Begay, Alison Schenone, Lida Dimitropoulou, Gayla Margolin, Tiago Falk, Emilio Ferrara, Kristina Lerman, and Shrikanth Narayanan. Lessons learned: Recommendations for implementing a longitudinal study using wearable and environmental sensors in a health care organization. JMIR mHealth and uHealth, 7(12):e13305, 2019. [116] Chao Li, Xiaokong Ma, Bing Jiang, Xiangang Li, Xuewei Zhang, Xiao Liu, Ying Cao, Ajay Kannan, and Zhenyao Zhu. Deep speaker: an end-to-end neural speaker embedding system. arXiv preprint arXiv:1705.02304, 2017. [117] Xu Li, Jinghua Zhong, Xixin Wu, Jianwei Yu, Xunying Liu, and Helen Meng. Ad- versarial attacks on gmm i-vector based speaker verication systems. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Process- ing (ICASSP), pages 6579{6583. IEEE, 2020. 173 [118] Chih-Jen Lin, Ruby C Weng, and S Sathiya Keerthi. Trust region newton methods for large-scale logistic regression. In Proceedings of the 24th international conference on Machine learning, pages 561{568. ACM, 2007. [119] Yanick Lukic, Carlo Vogt, Oliver D urr, and Thilo Stadelmann. Speaker identica- tion and clustering using convolutional neural networks. In Machine Learning for Signal Processing (MLSP), 2016 IEEE 26th International Workshop on, pages 1{6. IEEE, 2016. [120] Yong Ma and Chang-Chun Bao. Sparse dnn-based speaker segmentation using side information. Electronics Letters, 51(8):651{653, 2015. [121] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectier nonlinearities improve neural network acoustic models. In Proc. ICML, volume 30, 2013. [122] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018. [123] Cheuk Ming Mak and YP Lui. The eect of sound on oce productivity. Building Services Engineering Research and Technology, 33(3):339{345, 2012. [124] Chengzhi Mao, Ziyuan Zhong, Junfeng Yang, Carl Vondrick, and Baishakhi Ray. Metric learning for adversarial robustness. In Advances in Neural Information Pro- cessing Systems, pages 480{491, 2019. [125] John H McDonald. Handbook of biological statistics, volume 2. Sparky house pub- lishing Baltimore, MD, 2009. [126] Mitchell McLaren, Luciana Ferrer, Diego Castan, and Aaron Lawson. The speakers in the wild (sitw) speaker recognition database. In Interspeech, pages 818{822, 2016. [127] Mitchell McLaren, Yun Lei, Nicolas Scheer, and Luciana Ferrer. Application of convolutional neural networks to speaker recognition in noisy conditions. In Fif- teenth Annual Conference of the International Speech Communication Association, 2014. [128] ML McLaren, Diego Castan, Mahesh Kumar Nandwana, Luciana Ferrer, and Emre Yilmaz. How to train your speaker embeddings extractor. 2018. [129] Quinn McNemar. Note on the sampling error of the dierence between correlated proportions or percentages. Psychometrika, 12(2):153{157, 1947. [130] Annamaria Mesaros, Toni Heittola, Emmanouil Benetos, Peter Foster, Mathieu Lagrange, Tuomas Virtanen, and Mark D Plumbley. Detection and classication of acoustic scenes and events: Outcome of the dcase 2016 challenge. IEEE/ACM Tran. on Audio, Speech and Language Processing, 26(2):379{393, 2018. [131] Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. A multi-device dataset for urban acoustic scene classication. In Proceedings of the Det. and Classif. of Acoustic Scenes and Events 2018 Workshop, pages 9{13, November 2018. 174 [132] Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, Ken Nakae, and Shin Ishii. Distributional smoothing with virtual adversarial training. arXiv preprint arXiv:1507.00677, 2015. [133] Paola Moscoso, Alice Eldridge, and Mika Peck. Emotional associations with sound- scape re ect human-environment relationships. Journal of Ecoacoustics, 1, 2018. [134] Karel Mundnich, Brandon M. Booth, Michelle L'Hommedieu, Tiantian Feng, Ben- jamin Girault, Justin L'Hommedieu, Mackenzie Wildman, Sophia Skaaden, Am- rutha Nadarajan, Jennifer L. Villatte, Tiago H. Falk, Kristina Lerman, Emilio Ferrara, and Shrikanth Narayanan. TILES-2018, a longitudinal physiologic and behavioral data set of hospital workers. Sci Data, 7(354), 2020. [135] Karel Mundnich, Benjamin Girault, and Shrikanth Narayanan. Bluetooth based indoor localization using triplet embeddings. In 2019 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP), pages 7570{7574. IEEE, 2019. [136] Alain Muzet. Environmental noise, sleep and health. Sleep medicine reviews, 11(2):135{142, 2007. [137] Amrutha Nadarajan, Krishna Somandepalli, and Shrikanth Narayanan. Speaker agnostic foreground speech detection from audio recordings in workplace settings from wearable recorders. In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6765{6769. IEEE, 2019. [138] Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. Voxceleb: A large-scale speaker identication dataset. In Proc. Interspeech 2017, pages 2616{2620, 2017. [139] Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. VoxCeleb: a large-scale speaker identication dataset. arXiv preprint arXiv:1706.08612, 2017. [140] Mahesh Kumar Nandwana, Julien Van Hout, Mitchell McLaren, Colleen Richey, Aaron Lawson, and Maria Alejandra Barrios. The voices from a distance challenge 2019 evaluation plan. arXiv preprint arXiv:1902.10828, 2019. [141] Mahesh Kumar Nandwana, Julien van Hout, Mitchell McLaren, Allen Stauer, Colleen Richey, Aaron Lawson, and Martin Graciarena. Robust speaker recognition from distant speech under real reverberant environments using speaker embeddings. In Proc. Interspeech, 2018. [142] Shrikanth Narayanan and Panayiotis G Georgiou. Behavioral signal processing: Deriving human behavioral informatics from speech and language. Proceedings of the IEEE, 101(5):1203{1233, 2013. [143] P Nassiri, M Monazam, B Fouladi Dehaghi, L Ibrahimi Ghavam Abadi, SA Zake- rian, and K Azam. The eect of noise on human performance: A clinical trial. Int J Occup Environ Med (The IJOEM), 4(2 April):212{87, 2013. 175 [144] Maria-Irina Nicolae, Mathieu Sinn, Minh Ngoc Tran, Beat Buesser, Ambrish Rawat, Martin Wistuba, Valentina Zantedeschi, Nathalie Baracaldo, Bryant Chen, Heiko Ludwig, Ian Molloy, and Ben Edwards. Adversarial robustness toolbox v1.2.0. CoRR, 1807.01069, 2018. [145] Maria Niessen, Caroline Cance, and Dani ele Dubois. Categories for soundscape: toward a hybrid classication. In Inter-Noise and Noise-Con Congress and Con- ference Proceedings, volume 2010, pages 5816{5829. Institute of Noise Control En- gineering, 2010. [146] Antonio Nucci and Ram Keralapura. Hierarchical real-time speaker recognition for biometric voip verication and targeting, April 17 2012. US Patent 8,160,877. [147] Douglas O'Shaughnessy. Linear predictive coding. IEEE potentials, 7(1):29{32, 1988. [148] Andrew Owens, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H Adel- son, and William T Freeman. Visually indicated sounds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2405{2413, 2016. [149] Monisankha Pal, Arindam Jati, Raghuveer Peri, Chin-Cheng Hsu, Wael AbdAl- mageed, and Shrikanth Narayanan. Adversarial defense for deep speaker recogni- tion using hybrid adversarial training. arXiv preprint arXiv:2010.16038, 2020. [150] Sinno Jialin Pan, Qiang Yang, et al. A survey on transfer learning. IEEE Trans- actions on knowledge and data engineering, 22(10):1345{1359, 2010. [151] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Lib- rispeech: an asr corpus based on public domain audio books. In 2015 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206{5210. IEEE, 2015. [152] Nicolas Papernot, Patrick McDaniel, and Ian Goodfellow. Transferability in ma- chine learning: from phenomena to black-box attacks using adversarial samples. arXiv preprint arXiv:1605.07277, 2016. [153] Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z Berkay Celik, and Ananthram Swami. The limitations of deep learning in adversarial settings. In 2016 IEEE European symposium on security and privacy (EuroS&P), pages 372{ 387. IEEE, 2016. [154] Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, et al. Deep face recognition. In BMVC, volume 1, page 6, 2015. [155] Vijayaditya Peddinti, Daniel Povey, and Sanjeev Khudanpur. A time delay neural network architecture for ecient modeling of long temporal contexts. In Sixteenth Annual Conference of the International Speech Communication Association, 2015. 176 [156] Raghuveer Peri, Monisankha Pal, Arindam Jati, Krishna Somandepalli, and Shrikanth Narayanan. Robust speaker recognition using unsupervised adversar- ial invariance. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6614{6618. IEEE, 2020. [157] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et al. The kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society, 2011. [158] David Martin Powers. Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation. Journal of Machine Learning Technolo- gies, 2(1):37{63, 2011. [159] Yao Qin, Nicholas Carlini, Garrison Cottrell, Ian Goodfellow, and Colin Rael. Im- perceptible, robust, and targeted adversarial examples for automatic speech recog- nition. In International Conference on Machine Learning, pages 5231{5240, 2019. [160] Lawrence R Rabiner and Biing-Hwang Juang. Fundamentals of speech recognition. 1993. [161] ITU-T Recommendation. Perceptual evaluation of speech quality (pesq): An ob- jective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs. Rec. ITU-T P. 862, 2001. [162] Douglas A Reynolds, Thomas F Quatieri, and Robert B Dunn. Speaker verication using adapted gaussian mixture models. Digital signal processing, 10(1-3):19{41, 2000. [163] Ga el Richard, Shiva Sundaram, and Shrikanth Narayanan. An overview on per- ceptually motivated audio indexing and classication. Proceedings of the IEEE, 101(9):1939{1954, 2013. [164] C. Richey, M. Barrios, Z. Armstrong, C. Bartels, H. Franco, M. Graciarena, A. Law- son, M. K. Nandwana, A. Stauer, J. van Hout, P. Gamble, J. Hetherly, C. Stephen- son, and K. Ni. Voices obscured in complex environmental settings (voices) corpus. pages 1566{1570, 2018. [165] Antony W Rix, John G Beerends, Michael P Hollier, and Andries P Hekstra. Per- ceptual evaluation of speech quality (pesq)-a new method for speech quality assess- ment of telephone networks and codecs. In 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221), volume 2, pages 749{752. IEEE, 2001. [166] Joseph Roth, Sourish Chaudhuri, Ondrej Klejch, Radhika Marvin, Andrew Gal- lagher, Liat Kaver, Sharadh Ramaswamy, Arkadiusz Stopczynski, Cordelia Schmid, Zhonghua Xi, et al. Ava-activespeaker: An audio-visual dataset for active speaker detection. arXiv preprint arXiv:1901.01342, 2019. 177 [167] Anthony Rousseau, Paul Del eglise, and Yannick Esteve. Ted-lium: an automatic speech recognition dedicated corpus. In LREC, pages 125{129, 2012. [168] Mickael Rouvier, Pierre-Michel Bousquet, and Benoit Favre. Speaker diarization through speaker embeddings. In Signal Processing Conference (EUSIPCO), 2015 23rd European, pages 2082{2086. IEEE, 2015. [169] Sebastian Ruder. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098, 2017. [170] G Rupert Jr. Simultaneous statistical inference. Springer Science & Business Me- dia, 2012. [171] Seyed Omid Sadjadi, Tauq Hasan, and John HL Hansen. Mean hilbert envelope coecients (mhec) for robust speaker recognition. In Thirteenth Annual Conference of the International Speech Communication Association, 2012. [172] Justin Salamon, Christopher Jacoby, and Juan Pablo Bello. A dataset and tax- onomy for urban sound research. In Proceedings of the 22nd ACM international conference on Multimedia, pages 1041{1044. ACM, 2014. [173] Muhammad Muneeb Saleem and John HL Hansen. A discriminative unsupervised method for speaker recognition using deep learning. In Machine Learning for Signal Processing (MLSP), 2016 IEEE 26th International Workshop on, pages 1{5. IEEE, 2016. [174] George Saon, Hagen Soltau, David Nahamoo, and Michael Picheny. Speaker adap- tation of neural network acoustic models using i-vectors. In ASRU, pages 55{59, 2013. [175] Florian Schro, Dmitry Kalenichenko, and James Philbin. Facenet: A unied em- bedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815{823, 2015. [176] Gregory Sell and Daniel Garcia-Romero. Speaker diarization with plda i-vector scoring and unsupervised calibration. In Spoken Language Technology Workshop (SLT), 2014 IEEE, pages 413{417. IEEE, 2014. [177] Elizabeth Shriberg, Luciana Ferrer, Sachin Kajarekar, Anand Venkataraman, and Andreas Stolcke. Modeling prosodic feature sequences for speaker recognition. Speech Communication, 46(3):455{472, 2005. [178] Matthew A Siegler, Uday Jain, Bhiksha Raj, and Richard M Stern. Automatic seg- mentation, classication and clustering of broadcast news audio. In Proc. DARPA speech recognition workshop, volume 1997, 1997. [179] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. 178 [180] V Siskova and M Juricka. The eect of sound on job performance. In 2013 IEEE International Conference on Industrial Engineering and Engineering Management, pages 1679{1683. IEEE, 2013. [181] Sivic and Zisserman. Video Google: a text retrieval approach to object matching in videos. In Proceedings Ninth IEEE International Conference on Computer Vision, pages 1470{1477 vol.2, Oct 2003. [182] David Snyder, Guoguo Chen, and Daniel Povey. Musan: A music, speech, and noise corpus. arXiv preprint arXiv:1510.08484, 2015. [183] David Snyder, Daniel Garcia-Romero, Daniel Povey, and Sanjeev Khudanpur. Deep neural network embeddings for text-independent speaker verication. In Inter- speech, pages 999{1003, 2017. [184] David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khu- danpur. X-vectors: Robust dnn embeddings for speaker recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5329{5333. IEEE, 2018. [185] David Snyder, Pegah Ghahremani, Daniel Povey, Daniel Garcia-Romero, Yishay Carmiel, and Sanjeev Khudanpur. Deep neural network-based speaker embed- dings for end-to-end speaker verication. In Spoken Language Technology Workshop (SLT), 2016 IEEE, pages 165{170. IEEE, 2016. [186] Skyler Speakman, Srihari Sridharan, Sekou Remy, Komminist Weldemariam, and Edward McFowland. Subset scanning over neural network activations. arXiv preprint arXiv:1810.08676, 2018. [187] Nitish Srivastava, Georey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overtting. Journal of Machine Learning Research, 15(1):1929{1958, 2014. [188] Dan Stowell, Dimitrios Giannoulis, Emmanouil Benetos, Mathieu Lagrange, and Mark D Plumbley. Detection and classication of acoustic scenes and events. IEEE Tran. on Multimedia, 17(10):1733{1746, 2015. [189] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013. [190] Naoya Takahashi, Michael Gygli, Beat Pster, and Luc Van Gool. Deep Convolu- tional Neural Networks and Data Augmentation for Acoustic Event Recognition. Interspeech 2016. [191] D avid Terj ek. Adversarial lipschitz regularization. In 8th International Confer- ence on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. 179 [192] Pedregosa et. al. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825{2830, 2011. [193] Tijmen Tieleman and Georey Hinton. Lecture 6.5-RMSProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26{31, 2012. [194] Florian Tramer, Nicholas Carlini, Wieland Brendel, and Aleksander Madry. On adaptive attacks to adversarial example defenses. arXiv preprint arXiv:2002.08347, 2020. [195] Sue E Tranter and Douglas A Reynolds. An overview of automatic speaker di- arization systems. IEEE Transactions on audio, speech, and language processing, 14(5):1557{1565, 2006. [196] Ruchir Travadi and Shrikanth Narayanan. Total variability layer in deep neu- ral network embeddings for speaker verication. IEEE Signal Processing Letters, 26(6):893{897, 2019. [197] Ruchir Travadi and Shrikanth S Narayanan. A distribution free formulation of the total variability model. In INTERSPEECH, pages 1576{1580, 2017. [198] Giuseppe Valenzise, Luigi Gerosa, Marco Tagliasacchi, Fabio Antonacci, and Au- gusto Sarti. Scream and gunshot detection and localization for audio-surveillance systems. In IEEE Conference on Advanced Video and Signal Based Surveillance (AVSS), pages 21{26, 2007. [199] Laurens Van Der Maaten and Georey Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 9(Nov):2579{2605, 2008. [200] Ehsan Variani, Xin Lei, Erik McDermott, Ignacio Lopez Moreno, and Javier Gonzalez-Dominguez. Deep neural networks for small footprint text-dependent speaker verication. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages 4052{4056. IEEE, 2014. [201] Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno. Generalized end- to-end loss for speaker verication. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4879{4883. IEEE, 2018. [202] Qing Wang, Pengcheng Guo, Sining Sun, Lei Xie, and John HL Hansen. Adversar- ial regularization for end-to-end robust speaker verication. In Interspeech, pages 4010{4014, 2019. [203] PC Woodland, MJF Gales, D Pye, and SJ Young. The development of the 1996 htk broadcast news transcription system. In DARPA speech recognition workshop, pages 73{78. Morgan Kaufmann Pub, 1997. [204] Yong Xu, Qiang Huang, Wenwu Wang, and Mark D. Plumbley. Hierarchical learn- ing for DNN-based acoustic scene classication. Detection and Classication of Acoustic Scenes and Events, September 2016. 180 [205] Yong Xu, Qiuqiang Kong, Wenwu Wang, and Mark D Plumbley. Large-scale weakly supervised audio classication using gated convolutional neural network. In ICASSP, pages 121{125, 2018. [206] Takanori Yamada, Longbiao Wang, and Atsuhiko Kai. Improvement of distant- talking speaker identication using bottleneck features of dnn. In Interspeech, pages 3661{3664, 2013. [207] Yao Wang, Zhu Liu, and Jin-Cheng Huang. Multimedia content analysis-using both audio and visual clues. IEEE Signal Process. Mag., 17(6):12{36, 2000. [208] Chunlei Zhang and Kazuhito Koishida. End-to-end text-independent speaker ver- ication with triplet loss on short utterances. In Interspeech, pages 1487{1491, 2017. [209] Haichao Zhang and Jianyu Wang. Defense against adversarial attacks using feature scattering-based adversarial training. In Advances in Neural Information Processing Systems, pages 1831{1841, 2019. [210] Haomin Zhang, Ian McLoughlin, and Yan Song. Robust sound event recognition using convolutional neural networks. In ICASSP, pages 559{563. IEEE, 2015. [211] Xiao-Lei Zhang. Multilayer bootstrap network for unsupervised speaker recogni- tion. arXiv preprint arXiv:1509.06095, 2015. [212] Xiaofan Zhang, Feng Zhou, Yuanqing Lin, and Shaoting Zhang. Embedding la- bel structures for ne-grained feature representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1114{1123, 2016. [213] Yaoyao Zhong and Weihong Deng. Adversarial learning with margin-based triplet embedding regularization. In Proceedings of the IEEE International Conference on Computer Vision, pages 6549{6558, 2019. [214] Xiaodan Zhuang, Xi Zhou, Mark A Hasegawa-Johnson, and Thomas S Huang. Real-world acoustic event detection. Pattern Recognition Letters, 31(12):1543{1551, 2010. 181
Abstract (if available)
Abstract
The audio signal carries a multitude of latent information about phoneme and language, speaker identity, audio events and acoustic scene, noise, and other channel-specific characteristics. Learning a fixed dimensional vector representation or embedding of the audio signal, that captures a subset of those information streams, can be useful in numerous applications. For example, an audio representation that captures information about the underlying acoustic scene or environment can help us build smart devices that provide context-aware and personalized user experiences. Audio embeddings carrying speaker-specific characteristics can be utilized in speaker identity-based authentication systems. ❧ A typical deep neural network-based audio representation learning system can encounter a diverse set of variabilities that span across the input signal-space (such as environmental noise, long-term variability, and deliberate perturbation through adversarial attacks) and the output semantic-space (granularity of annotations, absence of annotations). ❧ In this thesis, we investigate multiple sources of variability in an audio representation learning system, and leverage that knowledge to learn robust deep audio representations by exploring context information and latent similarity patterns of the data. The proposed methods have been successfully applied in several cutting-edge applications such as audio event recognition, egocentric acoustic scene characterization, and speaker recognition in noisy conditions and under adversarial attacks. The extensive experimental results demonstrate the efficacy of learned embeddings, hence deem useful for audio representation learning.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
Human behavior understanding from language through unsupervised modeling
PDF
Speech recognition error modeling for robust speech processing and natural language understanding applications
PDF
Neural representation learning for robust and fair speaker recognition
PDF
Invariant representation learning for robust and fair predictions
PDF
Multimodal and self-guided clustering approaches toward context aware speaker diarization
PDF
Computational modeling of behavioral attributes in conversational dyadic interactions
PDF
Toward robust affective learning from speech signals based on deep learning techniques
PDF
Theory and applications of adversarial and structured knowledge learning
PDF
Learning distributed representations of cells in tables
PDF
Representation, classification and information fusion for robust and efficient multimodal human states recognition
PDF
Enriching spoken language processing: representation and modeling of suprasegmental events
PDF
Establishing cross-modal correspondences for media understanding.
PDF
Noise aware methods for robust speech processing applications
PDF
Word, sentence and knowledge graph embedding techniques: theory and performance evaluation
PDF
Efficient estimation and discriminative training for the Total Variability Model
PDF
Improving language understanding and summarization by leveraging auxiliary information through self-supervised or unsupervised learning
PDF
Semantically-grounded audio representation learning
PDF
Context-aware models for understanding and supporting spoken interactions with children
PDF
Extracting and using speaker role information in speech processing applications
Asset Metadata
Creator
Jati, Arindam
(author)
Core Title
Understanding sources of variability in learning robust deep audio representations
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
03/30/2021
Defense Date
03/11/2021
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
audio,neural networks,OAI-PMH Harvest,representation learning,speech
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Narayanan, Shrikanth (
committee chair
), Jenkins, Keith (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
arindamjati@gmail.com,jati@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-434648
Unique identifier
UC11667592
Identifier
etd-JatiArinda-9371.pdf (filename),usctheses-c89-434648 (legacy record id)
Legacy Identifier
etd-JatiArinda-9371.pdf
Dmrecord
434648
Document Type
Dissertation
Rights
Jati, Arindam
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
neural networks
representation learning