Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Context-aware models for understanding and supporting spoken interactions with children
(USC Thesis Other)
Context-aware models for understanding and supporting spoken interactions with children
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Context-Aware Models for Understanding and Supporting Spoken Interactions with Children by Manoj Kumar A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) December 2020 Copyright 2020 Manoj Kumar Dedicated to my parents Prabakaran Srinivasan and Abitha Perumalsamy ii Table of Contents Dedication ii List Of Tables vii List Of Figures x Abstract xiv I Introduction 1 Chapter 1: Introduction 2 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Child-Adult Dyadic Conversations . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3.1 Child Speech Processing by Computers . . . . . . . . . . . . . . . 8 1.3.2 Automatic Speech Recognition . . . . . . . . . . . . . . . . . . . . 8 1.3.3 Child-Adult Classication from Speech . . . . . . . . . . . . . . . . 9 1.3.4 Exploiting Context in Human Computer Interactions . . . . . . . . 10 1.4 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 II Background Context for Child-Adult Classication 14 Chapter 2: Meta-Learning for Variability 15 2.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 Baseline Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.1 X-vectors as Speaker Representations . . . . . . . . . . . . . . . . 19 2.2.2 Siamese Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3 Prototypical Networks for Few-Shot Learning . . . . . . . . . . . . . . . . 20 2.3.1 Batch training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.2 Extension to multiple sessions . . . . . . . . . . . . . . . . . . . . . 23 2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4.1 Weakly Supervised Classication . . . . . . . . . . . . . . . . . . . 24 2.4.2 Speaker Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.4.3 Qualitative Analysis using TSNE . . . . . . . . . . . . . . . . . . . 28 2.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 iii Chapter 3: Designing Neural Speaker Embeddings with Meta-Learning 30 3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2.1 Prototypical Networks . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2.2 Relation Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2.3 Use in Speaker Applications . . . . . . . . . . . . . . . . . . . . . . 38 3.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.4 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.4.1 Baseline Speaker Embeddings . . . . . . . . . . . . . . . . . . . . . 44 3.4.2 Meta-learned embeddings . . . . . . . . . . . . . . . . . . . . . . . 46 3.4.3 Speaker Diarization Results . . . . . . . . . . . . . . . . . . . . . . 49 3.4.3.1 Eect of classes within a task . . . . . . . . . . . . . . . . 52 3.4.3.2 Performance across dierent domains in DIHARD . . . . 53 3.4.3.3 Performance across dierent child age groups . . . . . . . 54 3.4.4 Speaker Verication Results . . . . . . . . . . . . . . . . . . . . . . 55 3.4.4.1 Robust Speaker Verication . . . . . . . . . . . . . . . . . 57 3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Chapter 4: Entropy Filtering for Self Learning 61 4.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.1.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.2 Entropy Filtering for Selecting Robust Enrolment Data . . . . . . . . . . 67 4.3 Iterative Bootstrapping for Self Learning . . . . . . . . . . . . . . . . . . . 69 4.3.1 Classication of uncertain segments . . . . . . . . . . . . . . . . . 70 4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.4.1 System selection using baseline performance . . . . . . . . . . . . . 70 4.4.2 Session-level adaptation strategies . . . . . . . . . . . . . . . . . . 71 4.4.3 Eect of Uncertainty Threshold . . . . . . . . . . . . . . . . . . . . 73 4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 III Interlocutor Context for ASR and Speaker Diarization 76 Chapter 5: LM Adaptations for Child ASR 77 5.0.1 Spontaneous Child Speech . . . . . . . . . . . . . . . . . . . . . . . 78 5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.1.1 Lexical Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.1.2 Semantic Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.1.3 Use of Context for ASR Adaptation . . . . . . . . . . . . . . . . . 82 5.2 Evaluation Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.2.1 Forensic Interviews . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.2.2 Play-based, naturalistic sessions for children with ASD with an adult social partner . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.3.1 Lexical Context: Token Matching . . . . . . . . . . . . . . . . . . . 84 5.3.2 Semantic Context . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 iv 5.3.3 Linear Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.4.1 Baseline child ASR . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.4.2 Baseline adult ASR . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.4.3 Domain Adaptations . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.4.4 Context Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.5 Results and Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.5.1 Domain Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.5.2 Context Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.5.3 Eect of context type: . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.5.4 Eect of context size . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.5.5 Eect of external factors . . . . . . . . . . . . . . . . . . . . . . . . 102 5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Chapter 6: Error Analysis for Improving Speaker Diarization 106 6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.1.1 Analyzing Diarization Errors . . . . . . . . . . . . . . . . . . . . . 107 6.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.3 Contextual Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 6.4.1 Baseline system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 6.4.2 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.4.3 Learning to Improve Diarization Errors . . . . . . . . . . . . . . . 112 6.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 IV Applications of Context-aware Descriptors 118 Chapter 7: Robust Behavioral Descriptors using End-to-End Pipeline 119 7.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 7.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 7.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 7.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 Chapter 8: Tracking Treatment Outcomes among Children with Autism 124 8.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 8.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 8.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 8.3.1 Prosodic Descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . 127 8.3.2 Turn Conversational Descriptors . . . . . . . . . . . . . . . . . . . 128 8.3.3 Language Descriptors . . . . . . . . . . . . . . . . . . . . . . . . . 129 8.3.4 Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 129 8.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 8.4.1 Descriptor Variation across Child Age . . . . . . . . . . . . . . . . 130 8.4.2 Identifying Group Dierence using Pipeline Descriptors . . . . . . 131 8.4.3 Eect of treatment on speech descriptors . . . . . . . . . . . . . . 134 v 8.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 Chapter 9: Detecting Truthful Language from Child Speech 136 9.1 Narrative Truth Induction (NTI) Corpus . . . . . . . . . . . . . . . . . . . 138 9.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 9.2.1 Speech to Text Alignment . . . . . . . . . . . . . . . . . . . . . . . 139 9.2.2 Turn Conversational Descriptors . . . . . . . . . . . . . . . . . . . 142 9.2.3 Classication Models . . . . . . . . . . . . . . . . . . . . . . . . . . 143 9.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 9.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 V Conclusion 147 Chapter 10: Concluding Remarks 148 Chapter 11: Future Directions 153 References 156 vi List Of Tables 2.1 Statistics of child-adult corpora used in this work. . . . . . . . . . . . . . 18 2.2 Child-adult classication results using macro-F1 (%) . . . . . . . . . . . . 26 2.3 Speaker clustering results using purity (%) . . . . . . . . . . . . . . . . . . 28 3.1 Overview of training and evaluation corpora . . . . . . . . . . . . . . . . . 41 3.2 Statistics of corpora used for speaker verication, including trial subsets created for analysis purposes . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.3 Statistics of corpora used for speaker diarization . . . . . . . . . . . . . . 43 3.4 Selecting a baseline system for speaker diarization. For each embedding and clustering method (AHC-f: AHC with xed threshold, AHC-p: AHC with optimized threshold, bSC: binarized spectral clustering with normal- ized maximum eigengap), diarization error rate (DER %) is provided for two settings: using oracle speaker count (Oracle) and estimated count (Est). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.5 Speaker diarization results comparing meta-learning models with x-vectors. x-vector+retrain represents mean DER computed with 3 trials . . . . . . 48 3.6 Analysis of child-adult diarization performance on the ADOS-Mod3 cor- pus. For each age group, mean DER (%) of sessions in each group are presented along with relative improvement in parenthesis. . . . . . . . . . 55 3.7 Selecting a baseline system for speaker verication. Results are presented as equal error rate (EER %) . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.8 Speaker verication results comparing meta-learning models with x-vectors. Results presented using EER and minDCF computed at P target = 0:01 . 56 vii 3.9 Analysis of speaker verication based on microphone location (Near: Near- eld, Far: Far-eld, Obs: Fully obscured) in VOiCES corpus and level of degradation artefacts in SITW corpus . . . . . . . . . . . . . . . . . . . . 58 4.1 Data statistics showing number of labeled and unlabeled samples . . . . . 66 4.2 Mean UAR for speaker clustering on ADOS and LID on CALLFRIEND. Results are reported separately for each number of languages (2-5) on CALLFRIEND. (E: Entropy-based ltering, B: Bootstrapping, D: Cosine- Distance based assignment for uncertain segments) . . . . . . . . . . . . . 72 5.1 Statistics ( ) for child speech in Forensic Interview (FI) sessions and Autism Spectrum Disorder (ASD) sessions. *Session duration for ASD is averaged across the ADOS (n=1, usually lasting 40-60 minutes) and BOSCC sessions (n=21, usually lasting 12 minutes) . . . . . . . . . . . . . 84 5.2 Details of child speech corpora used in training and evaluation of baseline ASR model. *CSLU statistics are computed after speech-to-text align- ment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.3 Word error rates (%) for baseline models applied on CIDMIC and child speech portion from Forensic Interviews (FI) and Autism diagnosis session (ASD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.4 Improvements to word-error rate and perplexity scores for domain adap- tation. *The AM-adapted model was discarded and baseline model used instead for all further experiments . . . . . . . . . . . . . . . . . . . . . . 98 5.5 Global context adaptation using ASR hypotheses (Session LM) and ground truth transcripts (Session LM - Oracle) . . . . . . . . . . . . . . . . . . . 99 5.6 Utterance-level adaptation results. For each method and corpus, results are reported for forward (F), backward (B) and bidirectional (Bi) directions of context adaptations. GT-Oracle represents adaptation using ground truth transcripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.7 Eect of utterance length (U), child age (A) and adult WER (W) on the adaptation performance measured using WER and perplexity. Each en- try presents the statistically signicant factors (p < 0:1, *p < 0:01) as determined by ANOVA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.1 Statistics of child-adult interactions. clinic: Clinical interactions adminis- tered by a psychologist;BOSCC-high andBOSCC-low represent parent- child interactions for children with high and low language-level respectively 108 viii 6.2 DER results from error correction network with system VAD and oracle VAD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 8.1 Details of BOSCC sessions used in this Chapter. Locations: New York Uni- versity (NYU), Icahn School of Medicine at Mount Sinai (MSSM), Center for Autism and the Developing Brain (CADB) at Weill Cornell Medicine and Albert Einstein College of Medicine (EIN) . . . . . . . . . . . . . . . 127 8.2 Signicant correlations between pipeline descriptors from each participant and biological age of child (p <0.05*, p <0.01**, n.s - not signicant) . . 130 8.3 Signicant correlations between pipeline descriptors and symptom severity 132 8.4 List of descriptors which exhibit dierent signicant changes over time for treatment and control groups . . . . . . . . . . . . . . . . . . . . . . . . . 135 9.1 Outcome counts in the NTI Corpus. The table lists the session count after removing a few due to speech-to-text alignment issues . . . . . . . . . . . 139 9.2 Classication results reported as unweighted F1-scores for truth-telling and disclosure tasks on the NTI corpus . . . . . . . . . . . . . . . . . . . . . . 144 9.3 Top important descriptors for each type, based on classication perfor- mance. For each feature, group trends are represented using*: higher value for positive class; and+: lower value for positive class . . . . . . . . 146 ix List Of Figures 1.1 Interplay between observed signals and latent mental states. In this dis- sertation, I develop methods for incorporating contextual information for developing speech processing applications, and validate using respective task performance and statistical models (dashed arrows) . . . . . . . . . . 5 2.1 Illustration of meta-learning. Typically, the target task contains classes unseen in any of the training tasks. . . . . . . . . . . . . . . . . . . . . . . 16 2.2 Training in protonets (left) vs siamese networks (right) in the embedding space. Colored backgrounds represent class decision regions. Distances from the query sample (non-lled) to prototypes from each class (lled with black) are used to estimate loss training loss using Equations (2.2) and (2.3). siamese networks are trained to maximize similarity between same-speaker pairs (dashed line) and minimize similarity between dierent- speaker pairs (solid line). Illustration adopted from [191, 214]. . . . . . . . 21 2.3 Illustrating the classication and clustering methods used in this chapter . 25 2.4 TSNE visualizations for protonet embeddings (left) and x-vectors (right) for 3 test sessions in ASD corpora . . . . . . . . . . . . . . . . . . . . . . 29 3.1 Overview of baseline and meta-learning architectures. (a) A time-delay layerF(N;D;K) which forms the basic component across models. At each time-step, activations from the previous layer are computed using a context width of K and a dilation of D. N represents the output embed- ding dimension. (b) Baseline x-vector model. Kaldi speaker embeddings are extracted at fc1 layer. I nd that fc2 and fc1 embeddings perform better for speaker diarization and speaker verication respectively. (c) Prototypical network architecture. Layers marked with a dashed bound- ary are initialized with pre-trained x-vector models, while layers with a solid boundary are randomly initialized. The nal layer output is referred to as protonet embeddings. (d) Relation encoder architecture. The nal layer output is referred to as relation network embeddings. Relation scores are computed used these embeddings as illustrated in Fig. 3.2b) . . . . . 40 x 3.2 (a) Illustrating the training step in prototypical networks. Decision re- gions are indicated using background colors. For each class, prototypes are estimated as the centroid of supports (lled shapes). Given the query (unlled shape), negative distances to each prototype are treated as logits. Adopted from [102]. (b) Comparison module in relation networks. The sum of support embeddings from classc (v c ) is concatenated with a query embedding (f (x j )) and input to the comparison network. r c;j is known as the relation score for query x j with respect to class c and treated as the logit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.3 Speaker diarization performance (% DER) across dierent corpora for dif- ferent combinations of supports examples and training classes within an episode. Number of queries per class is always 1 in all experiments. . . . . 48 3.4 Diarization performance across domains in DIHARD. For each domain, the mean DER across sessions is provided for baseline (x-vectors), protonets and relation nets. The relative change in DER (%) with respect to the baseline is given next to the bar (positive: DER reduction) . . . . . . . . 50 3.5 An example room conguration from the VOiCES corpus . . . . . . . . . 59 4.1 Illustrating conventional setup for supervised learning vs alternative setup. The latter is explored in this chapter . . . . . . . . . . . . . . . . . . . . . 62 4.2 Overall methodology for session adaptation . . . . . . . . . . . . . . . . . 64 4.3 Relation between posterior probability and class-specic classication per- formance for two dierent language combinations from CALLFRIEND cor- pus (left) Farsi vs German: R 2 = 0.53, (right) French vs Hindi: R 2 = 0.49 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.4 Illustrating entropy-ltering: the relationship between p C i and exp (e C i ), the fraction of enrolment samples . . . . . . . . . . . . . . . . . . . . . . . 69 4.5 Eect of threshold for determining uncertain segments on Mean UAR (in %). Presented in combination with entropy-ltering and bootstrapping for each corpora. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.6 Classication performance against ne tuned uncertainty threshold for ADOS and CALLFRIEND: 3 languages. The abscissa has been scaled non-linearly for better visualization . . . . . . . . . . . . . . . . . . . . . . 74 xi 5.1 Transcript excerpts (top: Forensic Interviews; bottom: ASD Session) illus- trating information ow between the speech from adult and child. Child phrases similar to contextual adult phrases are indicated in blue. Direc- tional ows from adult-to-child and child-to-adult are indicated using green and red respectively . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.2 (a) Training a seq2seq model with a single interlocutor utterance at the encoder and target child utterance at the decoder. At each decoder unit, words from child utterance are used to compute cross-entropy loss (b) Inference using a trained seq2seq model. At each decoder unit, the top hypotheses for next word are sampled and fed into successive decoder unit until the <END> token is encountered. . . . . . . . . . . . . . . . . . . . 88 5.3 Proposed context adaptation framework . . . . . . . . . . . . . . . . . . . 91 5.4 Eect of number of context utterances on the perplexity and WER. For each case, upto 10 context utterances are used. Results are provided using both oracle transcripts (Blue, continuous) and ASR hypotheses (Red, dashed)105 6.1 Illustrating the conversational descriptors: utterance length and speaker change proximity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 6.2 Eect of contextual factors on adult speech (top) and child speech (bot- tom) during the baseline diarization system. For each speaker and factor range, all possible outcomes are normalized to sum to 1 so as to display the error distributions uniformly across the context ranges. Within each bar, the outcomes (from top to bottom) follow: correctly classied frames, missed frames and misclassied frames . . . . . . . . . . . . . . . . . . . . 116 6.3 The attention network used during error correction. The speaker labels from baseline diarization system and factors are independently attended to in time using feed-forward attention. Context vectors from each source are merged and passed to a fully connected network to predict output label 117 6.4 Speaker errors for home BOSCC sessions before and after DNN error cor- rection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 7.1 Illustrating extraction of pipeline descriptors and oracle descriptors. In the former (top row), a speech/language processing pipeline consisting of multiple concatenated components is used to estimate who spoke when (speaker labels; task: speaker diarization) and what was spoken (tran- scripts; task: automatic speech recognition)). Alternatively, trained anno- tators (bottom row) mark speaking times and words by listening to the audio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 xii 7.2 Most robust descriptors (NMSE) from each modality due to errors arising from speaker diarization (top), and both speech detection and speaker diarization (bottom) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 7.3 Predicted calibrated severity scores using oracle descriptors and pipeline descriptors for two error conditions: (1) (Top) Errors from speech activ- ity detection and speaker diarization modules, (2) (Bottom) Errors from speaker diarization module . . . . . . . . . . . . . . . . . . . . . . . . . . 123 9.1 An overview of the toy-play and interviewing segments from a narrative truth induction (NTI) session. There are four possible outcomes of the ses- sion depending on whether the toy breaks, and whether the child answers the interviewer's questions truthfully. . . . . . . . . . . . . . . . . . . . . . 140 9.2 Overview of the two-pass alignment pipeline used in NTI corpus . . . . . 141 xiii Abstract Speech produced by children is inherently dierent from adult speech and automatic child speech understanding is considered a relatively harder task when compared to adult speech. While decades of speech research has resulted in signicant technical innovations, researchers have often met with limited success when developing technologies for child speech comprehension. A number of challenges in child speech processing have been put forward in recent times, from a speech production perspective and from a machine learning perspective. In recent times, however, there is a growing interest in understanding child speech from a commercial viewpoint (digital assistants, interactive entertainment services) and developmental health and other societal applications. In the former, typically the child interacts with a computer agent and the conversations are often conned to a constrained topic set. In this thesis, I focus on the latter which involve naturalistic interactions such as with an adult professional and are inclined to be goal oriented (e.g., during clinical diagnosis) but spanning multiple topics and activities. I develop methods for automated child speech understanding which can be broadly categorized into multiple subtasks: (child) speech detection, speaker classication (e.g., child versus adult), child speech-to- text and inferring high-level behavioral constructs from acoustic-language features that are relevant to the interaction domain. I address these tasks by developing context-driven xiv approaches, where context often manifests as circumstances that shape the participants' immediate response as well as propagate through the course of interaction, often shaping the outcome. I hypothesize that contextual information from multiple sources and time scales can enhance or substitute for information that may be severely degraded or not available in speech produced by the child, in order to support overall understanding and modeling. In case of child-adult interactions, these contextual sources can be broadly categorized based on their origin: the environment, the adult interlocutor and the child itself. I then illustrate applications of the proposed context-aware techniques to child- adult interactions in the autism research and forensic interview (FI) domains. In the rst part, I address background context which can be dened as sources of variability from the environment that are potentially unique to an interaction, but can be considered constant/stationary across the interaction session. While background context can be straightforward to identify (age and language level of child, symptom severity), its eect on participants' speech and language is often complex. I present two methods for incorporating background context for the task of child-adult classication from speech: entropy-ltering for identifying robust enrolment samples and prototypical networks for training session-invariant discriminative space using the meta-learning paradigm. In the second part, I use interlocutor context for improving child speech understanding for two tasks. In the rst task, I exploit the language of adult interlocutor as a time- varying context source for improving child automatic speech recognition (ASR) which is typically considered a more challenging problem than adult ASR. I use lexical and semantic information from the language in the vicinity of child speech utterance to adapt xv language models for ASR. Further, I study the role of directionality and length of con- text in this use-case scenario. In the second task, I address speaker diarization using participants' contextual features. I rst study the relation between diarization system performance and context features, and propose a method to improve the errors by incor- porating context features as a post-processing method. In the nal part of the dissertation, I apply the context-aware approaches on two im- portant clinical and mental health domains involving child speech understanding: Autism Spectrum Disorder (ASD) and Child Forensic Interviews (CFI). As a rst step before ap- plication, I validate the descriptors extracted using a speech and language pipeline. I compare the above automatically extracted descriptors with those created using manual annotations for speaker labels and transcripts. Following, I study the descriptors on two tasks in the ASD domain: identifying group dierences in the speech of children higher and lower in the spectrum, and analyzing the eect of parent-administered intervention treatment. Finally, I apply similar descriptors to detect truthfulness in child speech dur- ing a testimony following a minor transgression task. The results from the above studies help validate the automatically extracted descriptors for large-sample studies involving child speech. The proposed context-aware approaches alongwith applications help illustrate their applicability in a real world scenario. While previous approaches have focused on a single module such as ASR, this dissertation makes the case an holistic analysis of an integrated speech and language pipeline. The descriptor studies expose potential trade-os between individual component performance and overall descriptor robustness, especially when the source signal is often degraded or contains incomplete information. Finally, results from xvi the large-sample studies demonstrate the validity of fully automated descriptors in critical domains. Replacing manual annotations oers potential reductions in administration time and cost while improving focus on more informative descriptors which capture interaction dynamics. xvii Part I Introduction 1 Chapter 1 Introduction 1.1 Motivation Interpersonal interactions are a complex phenomenon that develop with information ex- change between the participants. The exchange is often multi-modal and requires the abil- ity to acquire and interpret behavioral cues from other participant(s) as well as following up with an appropriate response (or lack thereof), including shared aect. Humans are particularly receptive of such cues at multiple time-scales, and often shape their responses based on these cues as part of a socially meaningful interaction. This phenomenon, known as reciprocal determinism [8, 9] states that \an individual's behavior in uences and is in- uenced by both the social world and personal characteristics." Hence, each participant's behaviors is dependent on persistent feedback mechanisms, whether they occur between participants or originate in response to environmental factors. Recently, there has been a growing interest towards automatic, large-scale under- standing of spontaneous human interactions using computers. Such approaches have been utilized at multiple levels: data acquisition (minimize distraction to child during 2 autism diagnosis interviews [14], designing a non-obtrusive yet privacy-preserving audio activity detector [49]), signal enhancement (ltering out MRI noise from speech [209]), target participant identication (speaker diarization from audio [5], audio-visual speaker localization [27, 170], and mapping signals to latent behavior [20, 204]. However, a ma- jority of these approaches have rarely accounted for the above interpersonal feedback systems during algorithm design. In this thesis, I develop technological approaches to- wards understanding spontaneous human interactions that incorporate information from participant/environment feedback (referred to as "context"). Since these context mech- anisms in uence participant behavior and interview progress, I expect benets in task performance over techniques that do not incorporate context. I extend the scope for context-driven feedback mechanisms beyond the interlocutors to account for environ- mental/background factors, such as acoustic channel conditions. I focus our methods on a specic domain: dyadic (two-participant) interactions where one of the participants is a child. 1.2 Child-Adult Dyadic Conversations Computational analysis of spoken interactions with children have been relatively few when compared to conversations between adults. Previous works on analyzing spontaneous child speech have been limited to conversational agents such as animated characters [12, 143, 80], in the role of instructor [154, 92, 182], data collected from search engines [116] or in computer-based reading assessment/training applications [13, 139, 38]. In the above applications, the agent's behavior (voice commands) is often restricted to a discrete set of modes specic to the application goal. The constrained nature of child 3 speech in such applications (single/few words, restricted vocabulary) meant that previous approaches for child speech understanding focused only on developmental aspects of child speech to improve task performance. On the other hand, spontaneous interactions with an adult have only recently been explored. Unlike with a computer agent, these interactions are often goal-oriented, and the adult interlocutor (a trained professional) continuously modies their behavior to align with, or control the direction of conversation. In this thesis, I develop context-aware signal processing methodologies at each stage of child speech understanding during such interactions. Specically, I consider child-adult interactions from two domains: autism spectrum disorder (ASD) and child forensic interviews (CFI), computational analyses of both domains have seen growing interest in recent times ([21, 104, 105]). In both cases, the child's behavior, specically speech and language are in uenced by two broad factors: the underlying mental condition/incident trauma and the interaction context (Figure 1.1). Previous works on analyzing these interactions can be broadly categorized into two groups (Figure 1.1, top-left and top-right). The rst group develops core speech and language processing algorithms to map raw signals (audio) to lexical and speaker rubrics such as transcripts and speaker labels ([103, 221, 53]). The latter are in turn used to compute behavioral descriptors for each participant. The term "Behavioral descriptor" here refers to any information that is extracted from audio using speech transcripts and speaker labels, and is associated with either participant or with the interaction itself. This is not to be confused with "features" which refers to frame-level transformations 4 Figure 1.1: Interplay between observed signals and latent mental states. In this disser- tation, I develop methods for incorporating contextual information for developing speech processing applications, and validate using respective task performance and statistical models (dashed arrows) 5 of audio that assist machine learning models, for instance Mel Frequency Cepstral Coef- cients (MFCCs). Behavioral descriptors from audio can be divided into lexical (e.g., aective normatives, part-of-speech tags), prosodic (e.g., pitch, jitter, shimmer) and turn-taking/conversational (e.g., utterance length, latency, speaking fraction). Often, descriptor extraction involves multiple speech processing modules that are connected in a sequential manner, for instance voice activity detection, speaker diarization, speaker role recognition and automatic speech recognition [221]. The second group of works have analyzed the relationship between these behavioral descriptors and the underlying latent state of the child using statistical models, primarily correlational and causality studies [20, 17, 63], synchrony and coordination studies [22, 91, 78] and machine learning for prediction/regression of latent states [20]. While both groups of works have provided useful insights, they have often been carried out independent of each other, i.e. behav- ioral descriptors have been assumed available for statistical models, while the robustness of behavioral descriptors for statistical models have not been studied. A handful of approaches in recent times have studied the association between contex- tual factors and the latent state of the child. ([105, 21, 79]). For instance, during autism diagnosis sessions, [105] quantied aective dimensions of the language using psycho- linguistic norms [125]. Multiple dimensions: aect, valence and gender-ladenness in the clinician's language were found to be signicantly correlated with the child's symptom severity. Further, [21] found that the clinician's speaking time increased, variance in vol- ume level and pitch dynamics increased when interacting with children with higher autism severity. Similar descriptors were found to predict the severity with a better accuracy than the child's descriptors. These studies suggest that the interlocutor modies their 6 behavior subconsciously (since they are not aware of severity scale during the session) to adjust for atypicalities in the child's speech and language, thus suggesting the in uence that the latent state has on the interlocutor. Building on the results from above studies alongwith the knowledge that the child's speech is aected by the latent state, I propose to incorporate contextual factors while developing speech and language processing methods for better child spoken language un- derstanding. Context-driven methods are particularly relevant in these domains, since the latent state of the child results in additional complexity for automatic child speech understanding over existing developmental factors such as short vocal tract [140], mis- pronunciations [156], etc. From a computational analysis point-of-view, the signal source from the child if often severely degraded in quality, or simply does not contain the tar- get information, cases where understanding the eect of contextual information is often useful. I categorize context into (1) background context and (2) interlocutor context. The former includes the environment which can be assumed common to both participants and remains constant throughout an analysis session. While background context may not necessarily in uence child speech production, it manifests into channel conditions which aect the acquired speech signal. Interlocutor context on the other hand is a time- varying behavior signal from the adult that in uences the child response and/or in uenced by the underlying latent state of the child [105, 79]. I validate the proposed methods using two methods. First, I show improvements using respective task performance such as word error rate (WER) for speech recognition, F1-score for child-adult classication from speech, etc. Next, I extract behavioral descriptors using these methods to study 7 robustness to errors in each speech processing task. I analyze robustness with respect to descriptors extracted using oracle speaker labels and transcripts, as well as latent state inference derived using statistical models. 1.3 Background 1.3.1 Child Speech Processing by Computers Child speech is inherently dierent from adult speech and more complex from the view- point of computational processing. A number of child speech properties have been stud- ied, namely high acoustic variability within and across age-groups due to a growing vocal tract in children [60, 213, 140], an under-developed vocabulary leading to pronunciation and grammatical errors [82, 156] and overall high temporal and spectral speech variability [112] across children of dierent age groups and gender. Child speech and language pro- cessing approaches in the last decade have developed various techniques to address specic idiosyncrasies of child speech. I review the literature on two important tasks addressed in this thesis: child automatic speech recognition (ASR) and child-adult classication from speech, which can be viewed as a variant of speaker diarization task [5]. 1.3.2 Automatic Speech Recognition The relative diculty in obtaining large annotated child speech corpora (more than a few tens of hours) when compared to adult speech corpora led to early studies focusing on adapting acoustic models trained on adult speech to variabilities associated with child speech [26, 35, 42, 143]. Later, [188, 187, 46] trained acoustic models using only child 8 speech corpora or augmented with adult speech. They observed that mixed data aug- mentation provided better performance than using only either children or adult speech. In order to capture pronunciation dierences between child and adult speech, and vari- abilities within child speech, [187] employed phone confusion matrices to generate pro- nunciation alternatives during the decoding stage. Since formant frequency locations shift across the child's linguistic development, vocal tract length normalization (VTLN) [197, 183, 184, 46] was used as a speaker normalization technique. VTLN compensates for dierent vocal tract lengths of speakers by warping the frequency axis using a speaker- specic normalization factor. Previous works also explored acoustic model adaptations such as maximum likelihood linear regression (MLLR) [73, 155] and language model adap- tations using linear interpolation [73, 104] resulting in modest improvement in word error rates (WER). In a recent approach, [188] explored adaptation in the feature space using a hybrid DNN-HMM acoustic model. Specic layers of a DNN trained on adult speech were re-trained to account for idiosyncrasies of child speech - acoustic variability (layers close to input) and mispronunciations (layers close to output). In practice, a combination of the above methods can be employed to achieve the best performance. 1.3.3 Child-Adult Classication from Speech Interest in child-adult classication from speech during spontaneous conversations has increased recently. The task can be viewed as a variant of speaker diarization combined with role-recognition on the diarized outputs. Diarization solutions for child speech (both child-directed and adult-directed) initially looked at traditional feature representations 9 (MFCCs, PLPs) [141] and speaker segmentation/clustering methods (generalized likeli- hood ratio, Bayesian information criterion) [233, 199]. In [233], the authors introduced several methods for working with audio collected from children with autism using a wear- able device. More recently, approaches based on xed-dimensional embeddings such as ivectors [40] and DNN speaker embeddings such as x-vectors [223] were explored. The authors observed that splitting the adult speech into gender-specic portions while train- ing the PLDA returned improvements in diarization performance. A majority of these works have looked at day-long recordings collected from a wearable microphone while some of them deal with dyadic conversations collected in the clinical setting [199]. In this dissertation, I focus on the latter domains. 1.3.4 Exploiting Context in Human Computer Interactions Utilizing context from one interlocutor to study the other's behavioral state has been explored in the past, especially with paralinguistic analyses. [217] and [216] showed that the presence of a specic prosodic descriptor (low pitch value late in utterance) was an important predictor for verbal backchannels (um, oh, etc) in two separate corpora of telephone conversations involving English and Japanese speakers. Building upon [217], [137] performed head gesture recognition (e.g., head nods) using latent dynamic con- ditional random elds trained on multimodal descriptors from the interlocutor, namely timing (pause information, utterance length), eye gaze, prosodic (pitch slope, continuing intonation, rapid energy changes in speech) and lexical information (unigrams). Later, [111] used a dynamic Bayesian network to model emotional states (both categorical and continuous) using acoustic-prosodic descriptors from current and past utterances of the 10 interlocutor. Follow-up works on the IEMOCAP corpus [133, 134] proposed a more generic framework for modeling emotion dynamics using motion-capture (MOCAP) and prosodic descriptors. The authors study the evolution of dynamics both within a turn (current utterances of two speakers) and across turns (current and past utterances of both speakers). In case of the latter, they utilize recurrent neural network based architectures to model an arbitrarily long context length. Multi-modal cues from the interlocutor were also shown to predict the participants' body language [227], with the strength of coor- dination depending on the nature of interaction (friendly or con icting). While previous methods have studied the relation between complexity and participants' behavior from an analysis point-of-view, I incorporate them for improving spoken language technologies 1.4 Thesis Overview This thesis is organized into three parts. In part one, I explore methods to incorpo- rate background context for improving child-adult classication from speech. I use two approaches: meta-learning for reducing within-speaker variability and entropy-ltering for identifying robust enrolment samples. Both methods adopt fundamentally dierent approaches to solve the task: while meta-learning learns channel-invariant representa- tions, entropy-ltering focuses on adapting the model within each channel/session. Meta- learning and entropy-ltering are described in chapter s 2 and 4 respectively. In chapter 3 I propose an extension of the algorithms developed from chapter 2. I use meta-learning to recongure the classication objective for speaker embedding training into an ensemble of smaller, but related training tasks. 11 In part two, I focus on participants' context, which is time-varying within each ses- sion, unlike background context. I work upon two dierent tasks in this part: child automatic speech recognition and child-adult classication using a speaker diarization framework. In the rst task (Chapter 5), I develop methods for lexical and semantic information transfer from the interlocutor's language for adapting language models for child ASR. In the second task (Chapter 6), I rst analyze the associations between er- rors produced by a state-of-the-art diarization system and the participants' contextual descriptors. Next, I train a neural network that learns the above associations towards improving the diarization performance. In part three, I validate context-aware approaches developed in parts one and two using the end-application, i.e. behavioral state inference. As a rst step towards an end-to-end speech and language pipeline (audio! latent state), I use a pilot corpus to analyze the robustness of descriptors produced by such a pipeline. I estimate robustness of pipeline descriptors with respect to oracle descriptors as well as latent state inference. I identify subsets of descriptors that can drive future large-scale analyses. In the second task, I study the relation between pipeline descriptors with children's demographic and clinical severity conditions on a large-sample study from the Autism Spectrum Disorder (ASD) domain. I establish the validity of pipeline descriptors by studying their associa- tion with the child's biological age. Following, I explore group dierences using pipeline descriptors between children higher and lower on the autism spectrum. Finally, I use pipeline descriptors to understand the eect of intervention treatment. In the third task, I explore the conditions under which children disclose incident details following a minor 12 transgression, specically toy break. These curated sessions are an eort towards under- standing incident disclosure and minimizing recall trauma among children during forensic interviews. 13 Part II Background Context for Child-Adult Classication 14 Chapter 2 Meta-Learning for Variability Child-adult interactions have been used in the ASD domain primarily for diagnosis (Autism Diagnosis and Intervention Schedule: ADOS, [119]) and measuring interven- tion response (Brief Observation of Social Communication Change: BOSCC, [77]). Au- tomated computational processing of the participants' audio [17] and language streams [105] has provided objective descriptions that characterize the session progress and help understand association with symptom severity. However, behavioral descriptor extrac- tion in above studies has necessitated manual annotation for speaker labels, which can be expensive and time-consuming to obtain especially for large corpora. Automatic speaker label extraction involves a combination of speech activity detection (speech/non-speech classication) and speaker classication (categorization of speech regions into child and adult). In this work, I assume that oracle speech activity detection is available and focus on building a robust child-adult classication model. Training a conventional child-adult classier from speech has at least two major issues in addition to background/channel variability: Large within-class variability especially for child from age, gender, autism severity 15 Lack of sucient amounts of balanced training data I propose to address the above issues using meta-learning, also known as learning-to- learn [51]. Meta-learning consists of two optimizations: the conventional learner which learns within a task; and a meta-learner which learns across tasks. This is in contrast to conventional supervised learning, which operates within a single task for training and testing, and learns across samples. Meta-learning is inspired by the human learning process for rapid generalization to new tasks, for instance children who have never seen a new animal before can learn to identify them using only a handful of images. As a consequence, meta-learning has demonstrated success in low-resource applications [191, 51] in computer vision in recent years. Training Task 1 Target Task Training Task 2 Figure 2.1: Illustration of meta-learning. Typically, the target task contains classes unseen in any of the training tasks. In this chapter, I consider a corpus of multiple child-adult sessions. Within the train- ing set, I model each session as a separate task. Hence, each task consists of two classes: child and adult from the particular session. During training, classes are not shared across 16 tasks, i.e., child in one session is a separate class from child in another session. By opti- mizing the network to discriminate between child-adult speaker pairs across all training tasks (sessions), I mitigate the in uence of within-class variabilities. Further, the need for large amounts of training data is removed by randomly sampling training and testing subsets (referred to as supports and queries respectively in meta-learning [191]) within each batch. I evaluate my proposed method under two settings: 1) Weak supervision: a handful of labeled examples are available from the test session, and 2) No supervision: clustering. The latter is similar to conventional speaker clustering in diarization systems. I show that the learnt representations outperform baselines in both settings. The rest of the chapter is organized as follows: training and evaluation corpora are described in Section 2.1. Section 2.2 describes segment-level feature representations (x- vectors), baseline methods for classication and clustering tasks; and Section 2.3 outlines the episodic training process for prototypical networks. Experimental results and discus- sions including qualitative analysis are provided in Section 2.4, followed by conclusions in Section 2.5. 2.1 Datasets I select two types of child-adult interactions from the ASD domain: the gold-standard ADOS ([119]) which is used for diagnostic purposes and a recently proposed treatment outcome measure, BOSCC ([77]) for verbal children who uently used complex sentences. Module 3 of the ADOS is designed for verbally uent children, and typically lasts between 45 and 60 minutes and includes over 10 semi-structured tasks. The ADOS produces a 17 diagnostic algorithm score which can be used to classify children between ASD vs. non- ASD groups. On the other hand, BOSCC is a treatment outcome measure used to track changes in social-communication skills over the course of treatment in individuals with ASD, and is applicable in dierent collection settings (clinics, homes, research labs). A BOSCC session lasts typically for 12 minutes and consists of 4 segments (two 4-minute- play segments with toys and two 2-minute-conversation segments). I use a combination of ADOS (n=3) and BOSCC (n=24) sessions which were administered by clinicians and manually labeled by trained annotators for speaking times and transcripts. This corpus is referred to as ASD. The sessions in ASD cover sources of variability in child age, collection centers (4) and amount of available speech per child (Table 2.1). Table 2.1: Statistics of child-adult corpora used in this work. Corpus Duration(min) Child Age(yrs) # Utts (mean std.) (mean std.) Child Adult ASD 17.76 11.99 9.02 3.10 11045 20313 ASD-Infants 10.35 0.51 1.87 0.78 1371 4120 To check generalization performance, I train the models on ASD and evaluate on a dierent child-adult corpus within the autism diagnosis and intervention domain. The ASD-Infants corpus (Table 2.1) consists of BOSCC (n=12) sessions with minimally ver- bal toddlers and preschoolers with limited language (nonverbal, single words or phrase speech). As opposed to ASD, these sessions are administered by a caregiver, and rep- resent a more naturalistic data collection setup aimed at early behavioral assessments with a familiar adult. The age dierences between children in both corpora provides a signicant domain mismatch. 18 2.2 Baseline Methods 2.2.1 X-vectors as Speaker Representations Until recently, research involving speaker representations for speaker recognition, speaker verication and speaker diarization applications used total variability modeling (TVM) approaches to extract xed-length representations (namely, speaker embeddings) from frame-level features such as MFCCs. TVM [45] is a generative framework that learns a transformation from high-dimensional statistics of audio feature representations into a low dimensional space (i-vectors; [43]) in an unsupervised manner (i.e, no speaker labels). I-vectors are expected to capture the dominant variability modes in the data. After extraction, i-vectors are typically trained for a supervised, task-specic objective [110, 58, 174]. However, advances in deep neural networks have resulted in speaker embeddings extracted from an intermediate layer of a network typically trained for speaker ID loss ([207, 178, 225, 166]). Among them, x-vectors have emerged as the state-of-art in most speaker modeling approaches. I use x-vectors as a baseline representation as well as input features to the prototypical network. I use a pre-trained x-vector model 1 which is trained on the extended Switch- board 2 and CALLHOME corpora from NIST SRE 2000 dataset 3 . The extractor consists of time-delay layers on frame-level MFCC features followed by a statistics pooling layer from which 128-dimensional x-vectors are extracted. Complete details of the architecture are provided in [178]. 1 https://kaldi-asr.org/models/m6 2 https://catalog.ldc.upenn.edu/LDC2004S07 3 https://catalog.ldc.upenn.edu/LDC2001S97 19 2.2.2 Siamese Networks While x-vectors have demonstrate state-of-the-art performance, they are extracted using a cross-entropy loss function which unlike meta-learning is not a metric learning training objective. To enable a fair comparison with proposed embeddings which are optimized on top of x-vectors using a metric learning objective, I implement Siamese networks [100] with x-vector inputs to extract speaker embeddings. Siamese networks learn a metric space to maximize pairwise similarity between same-class pairs and minimize similarity between dierent-class pairs. Specically, I implement the variant used in speaker di- arization [59], where the training label for each input pair represents the probability of belonging to the same speaker. The network jointly learns both the embedding space and distance metric for computing similarity. In my work, I randomly select same-speaker (child-child, adult-adult) and dierent speaker (child-adult) x-vector pairs to provide in- put to the model. Fig. 2.2 illustrates the dierences between Siamese networks and protonets during training. 2.3 Prototypical Networks for Few-Shot Learning Meta-learning methods were introduced to address the problem of few-shot learning [51], where only a handful of labeled examples are available in new tasks not seen by the trained model. Deep metric-learning methods were developed within the meta-learning paradigm to specically address generalization in the few-shot scenario. I choose proto- typical networks (protonets) [191] which presume a simple learning bias when compared to other metric-learning methods, and have demonstrated state-of-the-art performance in 20 Figure 2.2: Training in protonets (left) vs siamese networks (right) in the embedding space. Colored backgrounds represent class decision regions. Distances from the query sample (non-lled) to prototypes from each class (lled with black) are used to estimate loss training loss using Equations (2.2) and (2.3). siamese networks are trained to maxi- mize similarity between same-speaker pairs (dashed line) and minimize similarity between dierent-speaker pairs (solid line). Illustration adopted from [191, 214]. image classication [51] and natural language processing tasks such as sequence classi- cation [229]. Protonets learn a non-linear transformation into an embedding space where each class is represented using a single sample, specically the centroid of examples from that class. During inference, a test sample is assigned to the nearest centroid. Our application of protonets for speaker classication is motivated by the fact that participants in a test session represent unseen classes, i.e., speakers in an audio record- ing to be diarized are typically assumed unknown. However, the target roles namely child and adult are shared across train and test sessions. Hence, by treating child-adult speaker classication in each train session as an independent task, I hypothesize that pro- tonets learn the common discriminating characteristics between child and adult classes irrespective of local variabilities which might in uence the task performance. 21 As a metric-learning method, protonets share similarities with triplet networks [23] and siamese networks [32] for learning speaker embeddings. Other than a recently pro- posed work which used protonet loss function for speaker identication and verication [214], to the best of my knowledge this work is one of the early applications of protonets for speaker clustering. Following, I illustrate the protonet training process using a single batch, then extend it to multiple training sessions. 2.3.1 Batch training Consider a set of labeled training examples fromC classes (X tr ;Y tr ) =f(x 1 ;y 1 );:::(x N ;y N )g where each sample x i is a vector in D-dimensional space and y i 2f1; 2;::;Cg. Protonets learn a non-linear mappingf :R D !R M where the prototype of each class is computed as follows: p c = 1 jS c j X (x i ;y i )Sc f (x i ) (2.1) S c represents the set of train samples belonging to class c. For every test sample (x;y), the posterior probability given class c is as follows: p (y =cjx) = exp (d ' (f (x); p c )) P c 0 2C exp (d ' (f (x); p c 0)) (2.2) d ' denotes distance metric. While the choice of d ' can be arbitrary, it was shown in [191] that using squared Euclidean distance is equivalent to modeling the supports using 22 Algorithm 1 Single batch of protonet training U(S,N) denotes uniform sampling of N elements from S without replacement Input: D = S C c=1 D c , where D c =f(x i ;y i ) ;y i =cg Output: L (Batch training loss) 1: for c inf1;:::;Cg do 2: S c =U(D c ;k) . Supports 3: Q c =U(D c nS c ;Bk) . Queries 4: p c = 1 k P (x i ;y i )2Sc f (x i ) . Prototypes 5: L 0 6: for c inf1;:::;Cg do 7: for (x;y) in Q c do 8: L L + 1 C(Bk) [ log(p (y =cjx)] Gaussian mixture density functions, and empirically performed better than other func- tions. Thus, Euclidean distance is used in this work. Learning proceeds by minimizing the negative log probability for the true class using gradient descent. L(y; x) = C X c=1 y c log(p (y =cj x)) (2.3) Pseudo-code for training a batch is provided in Algorithm 1. 2.3.2 Extension to multiple sessions Consider S sessions in the training corpus, with N c;s number of samples belonging to class c in session s. I iterate through each session s, and randomly sample k examples each from child and adult without replacement. These samples (supports) are used to construct the prototypes using Equation (2.1). From the remaining N c;s k samples, Bk samples are chosen without replacement from each class, where B denotes the training batch size. These samples (queries) are used to update the weights in a single back-propagation step according to Equation (2.3). Although a signicant fraction of 23 samples are not seen during a single epoch (1 epoch S batches), random sampling of supports and queries over multiple epochs improve the generalizability of protonets. 2.4 Experiments I use 128-dimensional x-vectors as pre-trained audio embeddings for all experiments in this chapter. X-vectors are input to a feed-forward neural network with 3 hidden layers (128, 64 and 32 units per layer). Embeddings from the third hidden layer (32-dimensional) are treated as speaker representations. Rectied linear unit (ReLU) non-linearity is used in between the layers. Batch-normalization and dropout (p = 0.2) are used for regular- ization. Adam optimizer (lr = 3e4, 1 = 0.9, 2 = 0.999) is used for weight updates. A batch size of 128 samples is employed. Since ASD corpus contains only 27 sessions, I use nine-fold cross validation to estimate test performance. At each fold, 18 sessions are used for model training. The best model is chosen using validation loss computed with 6 sessions. The remaining 3 sessions are treated as evaluation data. No two folds share the data from same speaker. A summary of the experiments is presented in Figure 2.3. 2.4.1 Weakly Supervised Classication I evaluate my models in a few-shot setting similar to the original formulation of protonets [191] which is equivalent to sparsely labeled segments from the test session. In practice, such labels can be made available from the session through random selection or active learning [185]. I train a baseline model using the architecture described in the previous paragraph and a softmax layer to minimize cross-entropy loss between child and adult classes. This model is directly used to estimate class posteriors on the testing data. I 24 refer to this model as Base. I use a second baseline where the labeled samples from test sessions in each fold are made available during the training process, i.e., updating protonet weights using back-propagation (Base-backprop). For protonets, I train two variants: P20 and P30 with 20 and 30 supports per class during training. A larger number of supports translates into more samples for reliable prototype computation, however it results in fewer queries for back-propagation. During evaluation, 5 samples from each class in the test session are randomly chosen as training data. These samples are used to compute prototypes for child and adult followed by minimum-distance based assignment for the remaining samples in that session. In order to estimate a robust performance measure for Base-backprop, P20 and P30, I repeat each evaluation 200 times by selecting a dierent set of 5 samples and compute the mean macro (unweighted) F1-score over the corpus. Figure 2.3: Illustrating the classication and clustering methods used in this chapter 25 Results Weakly-supervised classication results are presented in Table 2.2. In general, both vari- ants of protonet outperform the baselines signicantly in their respective corpora (ASD: p <0.05, ASD-Infants: p<0.01). However, all models degrade in performance on the ASD-Infants corpus as compared to ASD. As mentioned before, the data from younger children presents a large domain mismatch between training and evaluation data and I suspect this as the primary reason for lower performance. Surprisingly in ASD, updating network weights using samples from test session (Base-backprop) reduces classication performance. I suspect that the network overts on the labeled samples. However in the case of ASD-Infants, the labeled samples from the test session provide useful infor- mation about the speakers resulting in modest improvement over a weaker Base. While protonets provide the best F1-scores in both corpora, the performance in ASD-Infants leaves room for improvement. I do not observe any signicant dierence between P20 and P30, suggesting that the performance is robust to the number of supports and queries during training. Table 2.2: Child-adult classication results using macro-F1 (%) Method ASD ASD-Infants Base 82.67 53.67 Base-backprop 78.64 56.29 P20 86.66 61.30 P30 86.10 61.47 2.4.2 Speaker Clustering Clustering PLDA scores (trained with supervision) using AHC is an integral part of recent diarization systems [179, 180]. This method forms the rst baseline. I note that 26 the training data for PLDA transformation represents signicant domain mismatch with the corpora. I use k-means and spectral clustering (using cosine-distance based anity matrix) as unsupervised clustering methods for comparing x-vectors, Siamese embeddings and protonet embeddings. In the Siamese network, the distance measure between a segment pair is learnt between outputs from the third hidden layer (32-dimensional). For protonets, I use the models trained for weak supervision and extract embeddings at the prototype space (32-dimensional) for clustering. I use purity as the clustering metric, which describes to what extent samples from a cluster belong to the same speaker. Results Clustering PLDA scores using AHC results in a purity of 63.45% in ASD, which is signi- cantly lower than both K-means and Spectral Clustering (SC) for all the models in Table 2.3. This suggests that the supervised PLDA models may be susceptible to unknown speaker types. Unsupervised PLDA adaptation using x-vectors' mean and variance from ASD marginally improves the performance to 64.32%, hence I do not include this method in the rest of the comparisons. As opposed to classication, clustering performance does not degrade in ASD-Infants, suggesting that discriminative information between child and adult speakers within a session is preserved in all the embeddings compared in Table 2.3. Siamese networks present a modest improvement over x-vectors, upto 5.26% relative improvement for spectral clustering in ASD. However, protonets provide the best perfor- mance in both the corpora. In particular, P20 results in slightly higher purity scores than P30 across clustering methods and corpora. Hence, a larger number of queries within a batch appears benecial for speaker clustering in this work. I also note that the best 27 clustering performance (P20) is better in the out-of-domain corpus. I believe that the younger ages of children in ASD-Infants over ASD might benet the clustering process. Table 2.3: Speaker clustering results using purity (%) Method ASD ASD-Infants K-Means SC K-Means SC x-vectors 77.05 75.22 77.98 75.97 siamese 78.22 79.18 78.30 76.86 P20 81.39 80.70 85.51 85.55 P30 79.80 80.24 83.57 83.26 2.4.3 Qualitative Analysis using TSNE I provide a qualitative analysis using TSNE in Figure 2.4. I collect embeddings from both child and adult from a single-fold (3 sessions) in ASD and provide the TSNE visu- alizations for protonet embeddings and x-vectors. Embeddings from child and adult class are represented using 3 shades of red and blue respectively, one shade for each session. Although x-vectors cluster compactly within each speaker in a session, embeddings across sessions from the same class are spread apart. Protonets are able to cluster within classes compactly, while preserving the discriminative information between classes. In particular, embeddings belonging to child (which are expected to cover more sources of variability) are as compact as embeddings from adult. This suggests that protonets are able to learn across within-class variabilities for child-adult classication from speech. 2.5 Conclusions In this chapter, I modeled child-adult interactions from multiple sessions as dierent, related tasks in order to perform child-adult classication on an unseen test session. Protonets were shown to learn a session-invariant representation for the speakers while 28 Figure 2.4: TSNE visualizations for protonet embeddings (left) and x-vectors (right) for 3 test sessions in ASD corpora preserving the between-class discriminability and reducing within-class variability. In the future work, I are interested in combining meta-learning with domain-adversarial tech- niques using deep neural networks [56] since the cross-corpora classication performance on ASD-Infants provides room for improvement. In the next chapter, I develop generic speaker representations using meta-learning and evaluate using speaker diarization and speaker verication applications. 29 Chapter 3 Designing Neural Speaker Embeddings with Meta-Learning In the previous chapter, I proposed the use of prototypical networks for addressing intra- class speaker variations among child class during child/adult classication. Both system performance and qualitative visualizations illustrate the applicability of meta-learning as a useful paradigm for learning robust speaker embeddings. However, reducing eects of nuisance factors (i.e., not related to speaker characteristics) in speaker embeddings is a common research problem for other domains during speaker diarization. Additionally, approaches in other applications such as speaker verication have also sought to train embeddings robust to linguistic content and background speakers. Finally, the work in chapter 2 focused on learning embeddings using pretrained x-vectors as model input. Hence, any learnable information is constrained by information already encoded in x- vectors. In this chapter, I address the above limitations and explore an extension to meta- learning for child/adult speaker classication. I employ meta-learning at a more funda- mental level of learning speaker embeddings, namely, using frame-level features as input representations. Specically, the contributions of this chapter are as follows: 30 I develop new speaker embeddings using meta-learning that are not restricted to an application. Within each application, I demonstrate improvements using multiple corpora obtained under controlled as well as naturalistic speech interaction settings. I identify conditions where meta-learning demonstrates benets over conventional cross-entropy paradigm. I analyze diarization performance across dierent domains in the DIHARD corpora. Further, I consider the special case of impact of child age groups using internal child-adult interaction corpora from the Autism domain. I study the eect of data collection setups (near-eld, far-eld and obstructed mi- crophones) and the level of degradation artifacts on the speaker verication perfor- mance. While I present results using two variants of meta-learning: prototypical networks and relation networks, the proposed framework is independent of the specic metric- learning approach and hence oers scope for incorporating non-classication objec- tives such as clustering. Such objectives can be used in conjunction with or in-place of the metric-learning objectives used in this work. I present an open source implementation 1 of my work, including x-vectors baselines, based on a generic machine learning toolkit (PyTorch). 3.1 Background Audio speaker embeddings refer to xed-dimensional vector representations extracted from variable duration audio utterances and assumed to contain information relevant to 1 https://github.com/manojpamk/pytorch xvectors 31 speaker characteristics. In recent times, speaker embeddings have emerged as the most common representations used for speaker-identity relevant tasks such as speaker diariza- tion (speaker segmentation followed by clustering: who spoke when?) [4] and speaker verication (does an utterance pair belong to same speaker?) [28]. Such applications are relevant across a variety of domains such as voice bio-metrics [161, 175], automated meeting analysis [6, 205], and clinical interaction analysis [146, 220]. Recent technology evaluation challenges [172, 167, 81, 130] have drawn attention to these domains by incor- porating natural and simulated in-the-wild speech corpora exemplifying the many diverse technical facets that need to be addressed. While initial eorts toward training speaker embeddings had focused on generative modeling [165, 29] and factor analysis [44], deep neural network (DNN) representations extracted at bottleneck layers have become the standard choice in recent works. The most widely used representations are trained using a classication loss (d-vectors [208], x-vectors [193, 194]), while other training objectives such as triplet loss [24, 231] and contrastive loss [34] have also been explored. More recently, end-to-end training strategies [54, 87, 55] have been proposed for speaker diarization to address the mismatch between training objective (classication) and test setup (clustering, speaker selection, etc). A common factor in the classication formulation is that all the speakers from train- ing corpora are used throughout the training process for the purpose of loss computation and minimization. Typically, categorical cross-entropy is used as the loss function. While the number of speakers (classes) can often be large in practice (O(10 3 )), the classica- tion objective represents a single task, i.e., the same speaker set is used to minimize cross-entropy at every training minibatch. This entails limited task diversity during 32 the training process and oers scope for training better speaker-discriminative embed- dings by introducing more tasks. I note that a few approaches do exist which introduce multiple objectives for embedding training, such as metric-learning with cross entropy [224, 164] and speaker classication with domain adversarial learning [232, 215]. While these approaches demonstrate improvements over a single training objective, the speaker set is often common across objectives (except in domain adversarial training where target speaker labels are assumed unavailable). In this work I use the classication framework while training neural speaker embed- dings, however I decompose the original classication task into multiple tasks wherein each training step optimizes on a new task. A common encoder is learnt over this en- semble of tasks and used for extracting speaker embeddings during inference. At each step of speaker embedding training, I construct a new task by sampling speakers from the training corpus. For a large training speaker set available in typical training corpora, generating speaker subsets results in a large number of tasks. This provides a natural regularization to prevent task over-tting. Our approach is inspired by the meta-learning [176] paradigm, also known as learning to learn. Meta-learning optimizes at two-levels: within each task and across a distribution of tasks [163]. This is in contrast to conven- tional supervised learning which optimizes a single task over a distribution of samples. In addition to benets from increased task variability meta-learning has demonstrated success in unseen classes [163, 52, 3]. This forms a natural t for applications such as speaker diarization and speaker verication which often evaluate on speakers unseen during embedding training. 33 3.2 Methods In this section, I introduce the meta-learning setup for neural embedding training followed by description of two metric-learning approaches adopted in this work: prototypical net- works and relation networks. Following which, I outline their use in my tasks: speaker diarization and speaker verication, including a description of the choice of clustering algorithm. Consider a training corpus where C denotes the set of unique speakers, and where each speaker has multiple utterances available. Typically,jCj is a large integer (O(10 3 )). Here, an utterance might be in the form of raw waveform or frame-level features such as MFCCs or Mel spectrogram. Under the meta-learning setup, each episode (a training step; equivalent to a minibatch) consists of two stages of sampling: classes and utterances conditioned on classes. First, a subset of classesL (speakers) is sampled fromC within an episode, with the number of speakers per episodejLj typically held constant during the training process. Next, for each speaker in L, two disjoint subsets are sampled without replacement from the set of all utterances belonging to that speaker: supports S and queries Q. Within an episode, supports and queries are used for model training and loss computation, respectively, similar to train and test sets in supervised training. This process continues across a large number of episodes with speakers and utterances sampled as explained above. An episode is equivalent to a task, wherein the model learns to classify 34 speakers from that task. Hence, meta-learning optimizes across tasks, treating each task as a training example. The optimization process is given as: = arg max E L [ E S;Q [ E (x;y)2Q [logp (yjx;S)]]] (3.1) Here, denotes trainable parameters of the neural network, (x;y) represents an utterance and its corresponding speaker label. In contrast to conventional supervised learning: = arg max E B [ E (x;y)2B [logp (yjx)]] (3.2) whereB denotes a minibatch. Meta-learning approaches are broadly categorized based on the characterization of p (yjx): model-based [173], metric-based [210] and optimization- based meta-learning [52]. Of interest in this work are metric-based approaches where p (yjx) is a potentially learnable kernel function between utterances from S and Q. The reasoning is as follows: speaker embeddings trained for classication are bottleneck repre- sentations, and the latter is directly optimized using task performance in metric-learning approaches. I now describe the two metric-learning approaches used in this work: proto- typical networks and relation networks. 3.2.1 Prototypical Networks Protonets learn a non-linear transformation where each class is represented by a single point in the embedding space, namely the centroid (prototype) of training utterances from that class. During inference a test sample is assigned to the class of nearest centroid, similar to the nearest class mean method [132]. 35 At training time, consider an episode t, the support set (S t ) and the query set (Q t ) sampled as explained above. Supports are used for prototype computation while queries are used for estimating class posteriors and loss value. The prototype (v c ) for each class is computed as follows: v c = 1 jS t;c j X (x i ;y i )2St;c f (x i ) (3.3) f :R M !R P represents the parameters of the protonet. x i represents anM-dimensional utterance-level representation extracted using a DNN. S t;c is the set of all utterances in S t belonging to class c. For every test utterance x j 2 Q t , the posterior probability is computed by applying softmax activation over the negative distances with prototypes: p (y j =cjx j ;S t ) = exp (d (f (x j ); v c )) P c 0 2L exp (d (f (x j ); v c 0)) (3.4) d represents the distance function. Squared Euclidean distance was proposed in the original formulation [191] due to its interpretability as a Bregman divergence [10] as well as supporting empirical results. For the above reasons, I adopt squared Euclidean as a metric in this work. The negative log-posterior is treated as the episodic loss function and minimized using gradient descent: Loss = 1 jQ t j X (x j ;y j )Qt log(p (y j j x j ;S t )) (3.5) 3.2.2 Relation Networks Relation networks compare supports and queries by learning the kernel function simul- taneously with the embedding space[201]. In contrast with protonets which use squared 36 Euclidean distance, relation networks learn a more complex inductive bias by parameter- izing the comparison metric using a neural network. Hence, relation networks attempt to jointly learn the embedding and metric over an ensemble of tasks that are generalized to an unseen task. Specically, there exist two modules: an encoder network that maps utterances into xed-dimensional embeddings and a comparison network that computes a scalar relation given pairs of embeddings. Given supports S t , the class representation is taken as the sum of all support embeddings: v c = X (x i ;y i )St;c f (x i ) (3.6) f represents the encoder network. For each query embedding belonging to a class j, its relation score r c;j with training class c is computed using the comparison network g as follows: r c;j =g ([v c ;f (x j )]) (3.7) Here [:;:] represents concatenation operation. The original formulation of relation net- works [201] treated the relation score as a similarity measure, hence r c;j is dened as: r c;j = 8 > > > > < > > > > : 1; if y j =c 0; otherwise (3.8) The networks f and g were jointly optimized using mean squared error (MSE) objective since the predicted relation network was treated similar to a linear regression 37 model output. In this work, I replace MSE with the conventional cross-entropy objective based on empirical results. Hence the posterior probability is computed as: p (y j jx j ;S t ) = exp (r c;j ) P c 0 2L exp r c 0 ;j (3.9) and the loss function is computed using Equation (3.5). 3.2.3 Use in Speaker Applications Speaker Diarization Typically, there exist four steps in a speaker diarization system: speech activity detection, speaker segmentation, embedding extraction and speaker clustering (exceptions include recently proposed end-to-end approaches [87, 55]). In this work, I adopt the uniform segmentation strategy similar to [180, 57] wherein the session is segmented into equal duration segments with overlap. Meta-learned embeddings are extracted from these seg- ments followed by clustering. I use a recently proposed variant of spectral clustering [149] which uses a binarized version of anity matrix between speaker embeddings. The binarization is expressed using a parameter (p) which represents the fraction of non-zero values at every row in the anity matrix. The clustering algorithm attempts a trade-o between pruning excessive connections in the anity matrix (minimizing p) while in- creasing the normalized maximum eigengap (NME;g p ) where the latter is expressed as a function ofp (Eq. (10) in [149]). The ratio ( p gp ) is then minimized to estimate the number of resulting clusters (i.e., speakers) in a session. This process is referred to as binarized spectral clustering with normalized maximum eigengap (NME-SC). 38 Our choice of NME-SC in this work is motivated by two reasons: (1) I do not require a separate development set to estimate a threshold parameter used in the more com- mon agglomerative hierarchical clustering (AHC) method with average linking applied on distances estimated using probabilistic linear discriminant analysis (PLDA) [180]. I choose the binarization parameter (p) for each session by optimizing for ( p gp ) over a pre- determined range for p. (2) Empirical results which demonstrate similar performance between AHC tuned on a development set and NME-SC reported in [149] and in this work. Speaker Verication I use the standard protocol for speaker verication wherein a speaker embedding is ex- tracted from the entire utterance. Subsequently, the embeddings are reduced in dimen- sion using LDA and trial pairs are scored using a PLDA model trained on the same data used to train embeddings. Following this, target/imposter pairs are determined using a threshold on the PLDA scores. 3.3 Datasets Since I evaluate meta-learned embeddings on two applications: speaker diarization and speaker verication, I use dierent corpora commonly used in evaluating these respective applications. I choose corpora obtained from both controlled and naturalistic settings, with the former generally assumed relatively free from noise, reverberation and babble. I further choose additional corpora to assist with application-specic analysis of perfor- mance, such as the eect of domains and speaker characteristics (age) on diarization error 39 (512,1,5) (512,2,5) (512,3,7) (512,1,1) (1500,1,1) Utterance (MFCC Features) Pooling Layer Frame-level representations Utterance-level representations Speaker Labels 512 512 3000 7323 (512,1,5) (512,2,5) (512,3,7) (512,1,1) (1500,1,1) Pooling Layer 512 512 3000 512 512 (Prototypical Loss) (512,1,5) (512,2,5) (512,3,7) (512,1,1) (1500,1,1) Pooling Layer 512 512 3000 512 Utterance (MFCC Features) Utterance (MFCC Features) X-vector Model (Baseline) Prototypical Network Relation Network (Encoder) (Comparison Network) N t t-D t+D t K Time Delay Layer: (N,D,K) θ ( x) θ ( x) fc1 fc2 (a) (b) (c) (d) Figure 3.1: Overview of baseline and meta-learning architectures. (a) A time-delay layerF(N;D;K) which forms the basic component across models. At each time-step, activations from the previous layer are computed using a context width of K and a dilation of D. N represents the output embedding dimension. (b) Baseline x-vector model. Kaldi speaker embeddings are extracted at fc1 layer. I nd that fc2 and fc1 embeddings perform better for speaker diarization and speaker verication respectively. (c) Prototypical network architecture. Layers marked with a dashed boundary are initialized with pre-trained x-vector models, while layers with a solid boundary are randomly initialized. The nal layer output is referred to as protonet embeddings. (d) Relation encoder architecture. The nal layer output is referred to as relation network embeddings. Relation scores are computed used these embeddings as illustrated in Fig. 3.2b) 40 rate (DER) and channel conditions on equal error rate (EER). A summary of the corpora used in this work is presented in Table 3.1. Below, I provide details for each corpora. Table 3.1: Overview of training and evaluation corpora Training Evaluation Speaker Diarization Speaker Verication Vox2 AMI Vox1 test Vox1 dev DIHARD II dev VOiCES ADOS-Mod3 SITW Voxceleb The Voxceleb corpus [34] consists of YouTube videos and audio of speech from celebrities with a balanced gender distribution. Over a million utterances from7300 speakers are annotated with speaker labels. The utterances are collected from varied background con- ditions to simulate an in-the-wild collection. The Voxceleb corpus is further subdivided into Vox1 and Vox2 datasets. Following the baseline Kaldi recipe 2 , I use the dev and test splits from Vox2 and the dev split from Vox1 for embedding training. The test split from Vox1 is reserved for speaker verication. There exists no speaker overlap between the train set and Vox1-test set. VOICES The VOiCES corpora [167] was released as part of the VOiCES from a distance chal- lenge 3 . It consists of clean audio (Librispeech corpus[148]) played inside multiple room congurations and recorded with microphones of dierent types and placed at dierent locations in the room. In addition, various distractor noise signals were played along with 2 https://github.com/kaldi-asr/kaldi/tree/master/egs/voxceleb/v2 3 https://voices18.github.io/ 41 the source audio to simulate acoustically challenging conditions for speaker and speech recognition. Furthermore, the audio source was rotated in its position to simulate a real person. I use the evaluation portion of the corpus which is expected to contain more challenging room congurations [142] than the development portion. SITW The speakers-in-the-wild corpus [131] was released as part of the SITW speaker recogni- tion challenge. It consists of in-the-wild audio collected from a diverse range of recording and background conditions. In addition to speaker identities, the utterances are manually annotated for gender, extent of degradation, microphone type and other noise conditions in order to aid analysis. A subset of the utterances also include multiple speakers, with timing information available for the speaker with longest duration. Table 3.2: Statistics of corpora used for speaker verication, including trial subsets created for analysis purposes Corpus #Spkrs #Utterances #Trails (#target) Vox1 test 40 4715 38K (19K) VOiCES 100 11392 3.6M (36K) close mic 98 1076 0.84M (8.5K) far mic 96 1006 0.78M (7.9K) obs mic 96 1006 0.77M (7.9K) SITW 151 1006 0.50M (3K) low deg 150 998 0.16M (735) high deg 151 1003 0.20M (1.2K) 42 AMI The AMI Meeting corpus 4 consists of over 100 hours of oce meetings recorded in four dierent locations. The meetings are recorded using both close-talk and far-eld micro- phones, I use the former for diarization purpose. Since each speaker has their individual channels, The audio is beamformed into a single channel. I follow [198, 147] for splitting the sessions into the dev and eval partitions, ensuring that no speakers overlap between them. For the purpose of this chapter, the AMI sessions represent audio collected in noise-free recording conditions. DIHARD The DIHARD speaker diarization challenges [171] were introduced in order to focus on hard diarization tasks, i.e., in-the-wild data collected with naturalistic background con- ditions. In this work, I use the development set from second DIHARD challenge. This corpus consists of data from multiple domains such as clinical interviews, audiobooks, broadcast news, etc. I make use of the 192 sessions in the single-channel task in this work. It is worth noting that a handful of sessions in this corpus contain only a single speaker. Table 3.3: Statistics of corpora used for speaker diarization Corpus #Sessions #Spkrs/Session Session Duration (min: ()) DIHARD 192 3.48 7.44 3.00 AMI (dev+eval) 26 3.96 31.54 9.06 ADOS-Mod3 173 2 3.23 1.50 4 http://groups.inf.ed.ac.uk/ami/corpus/ 43 ADOS-Mod3 One of the most challenging domains from the DIHARD evaluations included speech collected from children. Speaker diarization for these interactions involve additional complexities due to two reasons: (1) An intrinsic variability in child speech owing to developmental factors [112, 113], and (2) Speech abnormalities due to underlying neuro- developmental disorder such as autism. To this end, I use 173 child-adult interactions consisting of excerpts from the administration of module 3 of the ADOS (Autism Di- agnosis Observation Module) [120]. These interactions involve children with suciently developed linguistic skills, i.e., ability to form complete sentences. All the children in this study had a diagnosis of autism spectrum disorder (ASD) or attention decit hyper- activity disorder (ADHD). The sessions were collected from two dierent locations and manually annotated using the SALT transcription guidelines 5 . Details of corpora used for speaker diarization is provided in Table 3.3. 3.4 Experiments and Results 3.4.1 Baseline Speaker Embeddings In order to select a competitive and fair baseline to meta-learned embeddings, I rst developed an implementation of x-vectors. Our model is similar to the Kaldi Voxceleb recipe 6 with respect to training corpora and network architecture. I compare the reported performance of Kaldi embeddings with my implementation and select the best performing model as the baseline system. 5 https://www.saltsoftware.com/media/wysiwyg/tranaids/TranConvSummary.pdf 6 https://kaldi-asr.org/models/m7 44 512 1024 1 Comparison Network Prototypical Loss θ ( x j ) v c r c,j (a) (b) Figure 3.2: (a) Illustrating the training step in prototypical networks. Decision regions are indicated using background colors. For each class, prototypes are estimated as the centroid of supports (lled shapes). Given the query (unlled shape), negative distances to each prototype are treated as logits. Adopted from [102]. (b) Comparison module in relation networks. The sum of support embeddings from classc (v c ) is concatenated with a query embedding (f (x j )) and input to the comparison network. r c;j is known as the relation score for query x j with respect to class c and treated as the logit. As mentioned in Section 3.3, I use the Vox2 and Vox1-dev corpora for embedding training. Similar to the Kaldi recipe, I extract 30-dimensional MFCC features using a frame width of 25ms and overlap of 15ms. I augment the training data with noise, music and babble speech using the MUSAN corpus[192], and reverberation using the RIR NOISES 7 corpus. The augmented data consists of 7323 speakers and 2.2M utter- ances. Following which, all utterances shorter than 4 seconds in duration and all speakers with fewer than 8 utterances each are removed to assist the training process. Cepstral mean normalization using a sliding window of 3 seconds was performed to remove any channel eects. The model architecture consists of 5 time-delay layers which model temporal context information, followed by a statistical pooling layer to map into a utterance-level vector. 7 http://www.openslr.org/28 45 This is followed by two feed-forward bottleneck layers with 512 units in each layer and the nal layer which outputs speaker posterior probabilities. In contrast with the Kaldi im- plementation, I use Adam optimizer ( 1 =0.9, 2 =0.99) to train the model, with an initial learning rate of 1e-3. The learning rate is increased to 2e-3 and progressively reduced to 1e-6. Dropout and batch normalization are used at all layers for regularization purpose. A minibatch of 32 samples is used at each iteration, while ensuring that utterances in each minibatch are of xed duration to improve the training process. I accumulated gradients for every 4 minibatches before back propagation, which was observed to improve model convergence. Table 3.4: Selecting a baseline system for speaker diarization. For each embedding and clustering method (AHC-f: AHC with xed threshold, AHC-p: AHC with optimized threshold, bSC: binarized spectral clustering with normalized maximum eigengap), di- arization error rate (DER %) is provided for two settings: using oracle speaker count (Oracle) and estimated count (Est). Tool Method DIHARD AMI ADOSMod3 Oracle Est Oracle Est Oracle Est Kaldi AHC-f 15.94 24.67 13.96 12.64 19.53 31.05 AHC-o - 18.35 - 14.28 - 18.17 bSC 18.81 15.26 8.57 9.50 14.77 19.57 Ours fc1 AHC-f 17.09 24.47 15.40 14.49 18.82 33.14 AHC-o - 18.74 - 14.55 - 20.18 bSC 18.81 14.62 7.95 14.51 15.85 21.37 Ours fc2 AHC-f 22.17 24.77 18.03 16.25 18.89 30.37 AHC-o - 19.61 - 16.23 - 20.03 bSC 17.62 13.93 6.94 8.47 13.94 17.16 3.4.2 Meta-learned embeddings I select DNN architectures for the meta-learning models similar to the baseline model in order to enable a fair comparison. I use the same network as x-vectors except for the nal layer, i.e., I retain the time-delay layers, the stats pooling layer, and two fully 46 connected layers with 512 units in each layer. The protonet model uses an additional two fully connected layers with 512 units in each layer. Embeddings extracted at the nal layer are used for prototype computation and loss estimation. The relation network uses one additional fully connected layer (512 units) for the encoder network. The comparison network consists of three fully connected layers with 1024 units at the input, 512 units in the hidden layer and 1 unit at the output. For both networks, I use batch normalization which was observed to improve convergence. I do not use dropout in the meta-learned models following their respective original implementations [191, 201]. The number of trainable parameters for the baseline x-vector model, protonet and relation net (encoder + comparison) are 9.8M, 6.6M and 7.1M, respectively. I trained both protonets and relation nets using the Adam optimizer ( 1 =0.9, 2 =0.99). The initial learning rate was set to 1e-4 and exponentially decreased ( = 0.9) every 500 episodes, where an episode corresponds to a single back-propagation step. The models were trained for 100K episodes with the stopping point determined based on convergence of smoothed loss function. The architecture and initialization strategies for all models are presented in Figure 3.1, while the meta-learning losses are illustrated in Figure 3.2. Model Initialization: I use a part of the pre-trained x-vector model as an initial- ization for the meta-learning model. Specically, I initialize the time-delay layers using the pre-trained weights from the corresponding layers from the x-vector model. The fully connected layers are initialized uniformly at random between [ 1 p N ; 1 p N ] where N is the number of parameters in the layer. Empirically, I observed that the above initialization scheme provided a signicant performance improvement in my experiments. 47 25 50 100 200 400 Classes 2 5 10 20 50 Supports DIHARD 17.78 18.23 18.14 16.9 16.16 16.18 17.03 15.99 16.51 15.78 15.81 15.71 15.44 19.19 19.74 25 50 100 200 400 Classes 2 5 10 20 50 AMI-eval 7.53 6.42 6.97 6.72 7.13 6.2 5.68 6.43 6.73 5.6 6.67 11.03 10.84 11.71 11.87 25 50 100 200 400 Classes 2 5 10 20 50 ADOSMod3 16.69 22.5 16.38 13.92 17.2 13.02 14.82 12.69 13.21 12.91 13.62 12.15 12.81 25.26 28.18 25 50 100 200 400 2 5 10 20 50 Supports 18.24 17.27 17.4 17.63 17.07 16.89 17.44 16.44 16.54 16.17 18.64 18.71 18.56 19.03 18.63 25 50 100 200 400 2 5 10 20 50 8.55 9.31 9.66 7.52 8.61 8.37 9.93 7.21 7.19 8.31 6.97 7.1 6.38 12.13 11.34 25 50 100 200 400 2 5 10 20 50 14.77 14.4 15.27 14.06 14.15 14.22 14.6 16.05 13.87 14.34 13.51 13.66 12.34 27.58 29.96 Relation Net Protonet Figure 3.3: Speaker diarization performance (% DER) across dierent corpora for dier- ent combinations of supports examples and training classes within an episode. Number of queries per class is always 1 in all experiments. Table 3.5: Speaker diarization results comparing meta-learning models with x-vectors. x-vector+retrain represents mean DER computed with 3 trials Method DIHARD AMI ADOSMod3 Oracle Est Oracle Est Oracle Est x-vectors 17.62 13.93 6.94 8.47 13.94 17.16 x-vector+retrain 17.39 13.26 7.49 8.52 16.74 16.89 Protonet 15.44 12.96 6.67 7.31 12.81 17.22 Relation Net 16.17 12.65 6.38 8.94 12.34 16.19 Since I borrow a part of the pre-trained x-vector model in my meta-learning models during initialization, I verify that any gains in performance obtained with meta-learning models do not arise from overtraining the x-vector model. I conduct a sanity check experiment wherein I retrain the x-vector model similar to the meta-learning models. Specically, I use the baseline model from Section 3.4.1 and retrain it using pre-trained weights for time-delay layers and random initialization for the fully-connected layers. 48 The model was trained for 100K minibatches, which corresponds to the same number of episodes used for training meta-learning models. 3.4.3 Speaker Diarization Results I use the oracle speech activity detection for speaker diarization in order to study ex- clusively the speaker errors. I segment the session to be diarized into uniform segments 1.5 seconds long in duration and with an overlap of 0.75 seconds. Embedding clustering is performed using the NME-SC method as described in Section 3.2.3. During scoring, I do not use a collar similar to DIHARD evaluations. However, I discard speaker over- lap regions since neither x-vectors nor meta-learned embeddings are trained to handle overlapping speech. Table 3.4 presents speaker diarization results for various baseline embeddings. I com- pare between pre-trained Kaldi embeddings, and both feed-forward bottleneck layers in my implementation. In addition to NME-SC for speaker clustering, I use AHC on PLDA scores using two methods for estimating number of speakers: (1) A xed threshold param- eter of 0, (2) Tuned threshold parameter using a development set. I tuned the parameter using two-fold cross validation for DIHARD and ADOS-Mod3, and the AMI-dev set for the AMI corpus. First, I notice that AHC is quite sensitive to the threshold parameter when estimating the number of speakers across all corpora and clustering methods. DER reduction using a ne-tuned threshold is particularly signicant for the ADOS-Mod3 corpus with nearly 13% absolute improvement for fc1, and 10% for fc2 embeddings extracted using my net- work. In some cases on the DIHARD and AMI corpora the DER obtained by ne-tuning 49 12.30 16.18 7.67 -15.48 4.76 38.31 6.35 12.87 3.14 -1.53 5.97 8.80 -7.92 -23.96 5.49 32.84 4.32 19.54 7.13 -7.00 Broadcast Interview Child Clinical Court Maptask Meeting Restaurant Socio-Field Socio-Lab Webvideo 0 5 10 15 20 25 30 35 40 45 DER (%) Baseline Protonets Relation Nets Figure 3.4: Diarization performance across domains in DIHARD. For each domain, the mean DER across sessions is provided for baseline (x-vectors), protonets and relation nets. The relative change in DER (%) with respect to the baseline is given next to the bar (positive: DER reduction) 50 the threshold is lower than when oracle number of speakers is used, similar to observations in [146]. Next, fc1 embeddings outperform fc2 embeddings when clustering using AHC and PLDA scores, consistent with ndings from [193]. However, when cosine anities are used with NME-SC I notice that the layer closer to the cross-entropy objective (fc2) results in a lower DER. This is the case both when oracle number of speakers are used as well as when they are estimated using the maximum eigengap value. The combination of fc2 embeddings with NME-SC method returns the lowest DERs for most conditions. Further, NME-SC removes the need for a separate development set for estimating the threshold parameter. Hence, I adopt this as the diarization baseline method in all my experiments. In Table 3.5, I compare the baseline with the meta-learning models. x-vector+retrain represents mean results from 3 trials of the sanity check experiment described in the Section 3.4.2. Both meta-learning models were trained for 100K episodes. Within each episode, 400 classes were randomly chosen without replacement from the training cor- pus. Following which, 3 samples were chosen without replacement from each class. Two samples were treated as supports, while the third sample was treated as query. From the results, I note that retraining the x-vector model provides minor DER improvement on the DIHARD corpus while performance worsens on the AMI corpus. The meta-learning models outperform the baselines in most cases, although improvements depend on the corpus and setting. On the DIHARD corpus consisting of challenging domains, protonets result in 12.37% relative improvement given oracle number of speakers and 6.96 % im- provement when the number of speakers are estimated. Relation networks show a slight degradation when compared to protonets. This dierence is more on a relatively clean 51 corpus such as AMI while estimating number of speakers. In the following experiments, I analyze which setups contribute to improvements in performance over x-vectors. 3.4.3.1 Eect of classes within a task While training meta-learning models, previous works [191, 201, 210] often carefully control the number of classes (way) within an episode and the number of supports per class (shot) so as to match the evaluation scenario. Drawing analogies with speaker diarization, a typical session consists ofO(1) speakers (way), withO(10) utterances per speaker (shot). In this experiment I vary these hyper-parameters for both protonets and relation nets in order to study the eect on DER. I vary the way and shot between 25 to 400, and 2 to 50, respectively, and train a new meta-learning model for each conguration. Results are presented in Fig 3.3. A common eect across dierent corpora and models is that the number of speakers (classes; way) is an important parameter for diarization performance. Increasing the number of speakers in an episode favours DER. This is similar to previous ndings in few-shot image recognition [191], where during training, a higher way than expected during testing was found to provide the best results. However, the eect of number supports per class (shot) on DER is not straightforward. When a large number of classes is used, increasing supports provides little to no improvements in both protonets and relation nets. Upon reducing the number of classes, the performance degrades with more supports across most models. This suggests a possibility of over-tting due to large number of supports even though the conguration closely resembles a test session. It is more benecial to increase the number of classes within an episode during training. 52 3.4.3.2 Performance across dierent domains in DIHARD It is often useful to understand the eect of conversation type, including speaker count, spontaneous speech and recording setups on the diarization performance. I study this using the domain labels [172] available for the DIHARD corpus. For each domain, I compute the mean DER across sessions using the baseline model as well as the meta- learning models. Oracle speaker count is used during clustering in order to exclusively study the eect of domain factors. I do not include the Audiobooks domain in this experiment since all the models return the same performance on account of sessions consisting of only one speaker. I present the results in Table 3.4. I note that there exists considerable variation between domains in terms of the DER improvement between x-vectors and meta-learning models. Broadcast news, child, map- task, meeting and socio-eld domains show signicant gains due to meta-learning models. Specically, meeting and child domains benet upto 38.31 % and 16.18 % relative DER improvement from protonets. Diarization in the court domain degrades in performance consistently between protonets and relation nets, with up to 20.05 % relative degradation for relation networks. Upon a closer look at the court and meeting domains to understand this dierence, I note that both domains contain similar number of speakers per session (Court: 7, Meeting: 5.3). However, the domains dier in the data collection setup: court sessions are collected by averaging audio streams from individual table-mounted microphones, while meeting sessions are collected using a single table microphone distant from all the participants [172]. Among the socio-linguistic interview domains, interviews recorded in the eld under diverse locations and subject age groups (socio-eld) result in 53 a larger DER improvement over those collected under quiet conditions (socio-lab). Socio- lab contains recording from both close-talking and distant microphones, hence it is not immediately clear whether microphone placement alone is a factor in DER improvement. Child and restaurant domains show variation in DER reduction although they perform similar with the baseline models, suggesting that background noise types aect benets from meta-learning. Overall, most domains that include in-the-wild data collection show improvements with meta-learning. 3.4.3.3 Performance across dierent child age groups As mentioned in Section 3.3, automatic child speech processing has been considered a hard problem when compared to processing adult speech. More recently, the child domain returned one of the highest DERs during the DIHARD evaluations [222], illustrating the challenges of working with child speech for diarization. Considering meta-learning models return signicant improvement over x-vectors for child domain, I attempt to understand gains in DER by controlling for the age of the child. Children develop linguistic skills as they grow up, hence child age is a reasonable proxy for their linguistic development. I select sessions from the ADOS-Mod3 corpus where I have access to the child age metadata. I compute the DER for each child using the respective baseline and meta-learned models described in Section 3.4.2. For children where two sessions are available, I compute the mean DER per child. I study the eect of child age on DER by grouping child age into 3 groups with approximately equal number of children in each set. Children below 7.5 years of age are collected in the Low age group, children between 7.5 years and 9.5 years 54 of age are collected in the Mid age group, and children above 9.5 years of age are collected in the High age group. Table 3.6: Analysis of child-adult diarization performance on the ADOS-Mod3 corpus. For each age group, mean DER (%) of sessions in each group are presented along with relative improvement in parenthesis. Model Low Mid High Baseline 17.36 13.42 13.77 Protonet 15.77 (9.16) 12.39 (7.68) 12.33 (10.46) Relation Net 15.69 (9.62) 12.82 (4.47) 11.37 (17.43) From the results in Table 3.6, I notice that the Low age group returns the highest DER, while Mid and High age groups return similar performance across models. Given that children in the Low age group are more likely to exhibit speech abnormalities, this result illustrates the relative diculty in automatic speech processing under such conditions. Improvements in DER from meta-learning models are distributed across all age groups. A consistent improvement of 10% relative DER among the Low age group is particularly encouraging given the challenging nature of such sessions. The high age group exhibits similar improvements in DER, with the relation networks providing upto 17.43 % relative gains. 3.4.4 Speaker Verication Results I use speaker verication as another application task to illustrate the generalized speaker information captured by meta-learned embeddings. Similar to speaker diarization, I rst evaluate my implementation of the baseline with the pre-trained Kaldi embeddings. I use the test partition of Voxceleb corpus, the eval set in VOiCES corpus and the eval set in SITW corpus in my experiments. I use the core-core condition in the SITW corpus 55 where a single speaker is present in both utterances during a trial. For all models, I score trials using PLDA after performing dimension reduction to 200 using LDA and length- normalization. The PLDA model is trained using the same data for embedding training, i.e., Vox2 corpus and the dev set of Vox1 corpus. Speakers in the SITW corpus which overlap with the Voxceleb corpus were removed from the trials before evaluation. I use equal error rate (EER) as the metric to select the best performing baseline system. Since cosine scoring returned signicantly high EERs relative to PLDA, I did not investigate it further. Results are provided in Table 3.7. Table 3.7: Selecting a baseline system for speaker verication. Results are presented as equal error rate (EER %) Embedding Vox1-test VOiCES SITW Kaldi 3.128 10.300 4.054 Ours:fc1 2.815 8.591 3.856 Ours:fc2 3.006 9.854 4.087 I notice that embeddings from both layers in my implementation outperform or closely match the Kaldi implementation. Similar to observations from Section 3.4.1 and [193] fc1 embeddings fare better than fc2 embeddings when scored with PLDA. I select fc1 embeddings as the baseline speaker verication method. Table 3.8: Speaker verication results comparing meta-learning models with x-vectors. Results presented using EER and minDCF computed at P target = 0:01 Model Vox1-test VOiCES SITW EER DCF EER DCF EER DCF Baseline 2.815 0.311 8.591 0.696 3.856 0.359 Protonets 2.831 0.299 7.837 0.646 3.560 0.347 Relation Net 2.884 0.313 8.238 0.690 3.725 0.370 56 When comparing meta-learning models, I use the same models developed in Sec- tion 3.4.3. In addition to EER, I present results using the minimum detection cost func- tion (minDCF) computed at P target = 0:01. From Table 3.5, I note that meta-learning models outperform x-vectors in most settings except in the case of Voxceleb corpus when EER is used. Both protonets and relation nets return similar EER and minDCF for the Voxceleb corpus. Interestingly, I achieve notable improvements on the relatively more challenging corpora. Protonets provide up to 8.78% and 7.68% EER improvements in the VOiCES and SITW corpora, respectively, with similar improvements in minDCF. While relation nets provide better performance than x-vectors in the above corpora, they do not outperform protonets in any setting. This suggests that using a predened dis- tance function (namely squared Euclidean in protonets) might be benecial overall when compared to learning a distance metric using relation networks for speaker verication application. 3.4.4.1 Robust Speaker Verication Since VOiCES and SITW corpora return the most improvement for speaker verication, I take a closer look at which factors benet meta-learning. For each corpus, I make use of annotations for the microphone location and channel degradation to create new trials for speaker verication. In the VOiCES corpus, I collect playback recordings from rooms 3 and 4 present in the eval subset. Within these recordings, I distinguish between the utterances based on the microphone placement with respect to the loudspeaker (audio source). Specically, I create three categories: (1) utterances collected using mic1 and mic18 are treated as 57 Table 3.9: Analysis of speaker verication based on microphone location (Near: Near- eld, Far: Far-eld, Obs: Fully obscured) in VOiCES corpus and level of degradation artefacts in SITW corpus VOiCES (mic location) Model Near Far Obs EER DCF EER DCF EER DCF Baseline 3.907 0.3407 7.311 0.5797 22.65 0.9375 Protonets 3.801 0.376 7.132 0.6337 20.58 0.9366 Relation Net 3.872 0.3521 7.618 0.6282 21.24 0.9527 SITW (degradation level) Model Low High EER DCF EER DCF Baseline 3.401 0.3463 4.815 0.445 Protonets 3.537 0.3281 4.414 0.4268 Relation Net 3.81 0.3467 4.414 0.4525 near-eld, being closest to the source, (2) utterances collected from mic19 are treated as far-eld, and (3) utterances collected from mic12 are treated as obscured, since they are fully obscured by the wall. While creating the trials for each category, I ensure that the ratio of target to nontarget pairs remain approximately equal to the overall eval set trial. An example room conguration is presented in Figure 3.5. From the SITW corpus, I use the metadata annotations for level of degradation. The corpus includes multiple degradation artifacts: reverberation, noise, compression, etc, among others. The level of degradation for the most prominent artefact was annotated manually on a scale of 0 (least) to 4 (maximum). I use the trials available as part of the eval set which are annotated with the degradation level. I group the trials into two levels: low (deg0 and deg1) and high (deg3 and deg4). Note that the utterances contain multiple types of degradation in each level. Details of target and imposter pairs for SITW corpus 58 (degradation level) and VOiCES corpus (microphone placement) are present in Table 3.2. Speaker verication results using EER and minDCF are presented in Table 3.9. Figure 3.5: An example room conguration from the VOiCES corpus 8 . Microphones are represented using circles. I notice that no single model performs the best across multiple conditions. When controlled for microphone placement in VOiCES, protonets return the best EER at all locations. The margin of improvement remains approximately the same when only the distance from source is considered: 2.71% for near-eld and 2.45% for far-eld. The margin improves to 9.14% when the microphone is fully obscured by a wall and placed close to distractor noises. Interestingly, these improvements are not re ected in the minDCF scores in the absence of noise, where x-vectors outperform both meta-learning models. I believe that improvements in EER and minDCF in VOiCES corpus primarily 8 Figure adapted from https://voices18.github.io/rooms/ 59 arise from utterances collected in obstructed locations and in close vicinity of distractor noises. The experiments in SITW corpus focus on the strength of such noise conditions. Under low degradation levels, I see that x-vectors return the least EER, although their performance is not consistent with minDCF. Meta-learning models continue to work better in higher degradation levels, providing 8.3% reduction in 4.1% reduction in EER and minDCF, respectively. 3.5 Conclusions I proposed neural speaker embeddings trained with the meta-learning paradigm, and evaluated on corpora representing dierent tasks and settings. In contrast to conventional speaker embedding training which optimizes on a single classication task, I simulate multiple tasks by sampling speakers during the training process. Meta-learning optimizes on a new task at every training iteration, thus improving generalizability to an unseen task. I evaluate two variants of meta-learning, namely prototypical networks and relation networks on speaker diarization and speaker verication. I analyze the performance of meta-learned speaker embeddings in challenging settings such as far-eld recordings, child speech, fully obstructed microphone collection and in the presence of high noise degradation levels. The results indicate the potential of meta-learning as a framework for training multi-purpose speaker embeddings. 60 Chapter 4 Entropy Filtering for Self Learning In this chapter, I continue to focus on improving child-adult classication from speech in the ASD domain. However, I adopt a dierent problem setting in terms of labeled sample distribution across multiple sessions. Specically, I assume a single labeled sample (referred to as enrolment data) per class (child/adult) from all sessions as opposed to the conventional train-test split across sessions. The motivation for the same is two-fold: First, the assumption that all samples are labeled in training sessions is often not feasible, since annotations are often time-consuming, expensive to obtain and even unreliable. On the other hand, typical child-adult diagnostic tools often begin with a preamble [77] which provides a straightforward way to obtain a handful of labeled enrolment data for both child and adult. Secondly, from the standpoint of variability modeling, enrolment data are more likely to cover all acoustic and background variability conditions from audio especially for the test sessions. The proposed problem setting (Figure 4.1) is closely aligned with semi-supervised learning [236] for the low-resource scenario. In this setting, semi-supervised algorithms 61 Figure 4.1: Illustrating conventional setup for supervised learning vs alternative setup. The latter is explored in this chapter utilize a small amount of labeled data in combination with a (usually) larger set of unla- beled data to improve learning. Various strategies for semi-supervised methods have been developed, including the use of generative models [235, 96], self-training [234] and graph based [31] methods. Each variant of semi-supervised learning places dierent assumptions on the distribution of unlabeled data. In this chapter, I implement semi-supervised learning as a two-step process. First, a `global' model is built using per-speaker enrollment data from all available sessions. Next, classication is performed within a session by adapting the global model to the session's data. First I introduce an entropy-based ltering step to identify sessions with `similar' labeled segments, which are then used to train a session specic classier. Next, I classify the unlabeled segments using self-training in an iterative manner by classifying 62 those with high condence scores and retraining my classier. This method (bootstrap- ping) where the classier trains on its own predictions has been used in natural language processing and computer vision [169, 123]. Finally, I use a distance-based assignment for the remaining segments deemed uncertain by self-training. This way, I adapt the `global' classier to each session's variability during the classication process. The rest of the chapter is organized as follows: Section 4.1 describes the problem setting and motivates the use of a global classier. Two speech tasks are described which are addressed using this method, demonstrating the generalized nature of the approach. Section 4.2 presents the entropy-based ltering technique which is used to select sessions with representative enrolment data. These data are used to adapt the global model under the self-learning paradigm which I refer to as iterative bootstrapping (Section 4.3). Section 4.4 describes the experiments including classication methods, results and analysis. Conclusions are presented in Section 4.5. 4.1 Problem Formulation In this section, I describe the problem formulation common for both speech tasks ad- dressed in this chapter. Consider N sessions in the corpus with up to K classes; i.e., a session i will contain k i classes such that k i 2 [2;K]. Letfx ij g j2 J i denote the set of all segments within session i, i.e, the cardinality of J i is the number of segments in session i. Also letfx ij 0g j 0 2 J 0 i represent the set of labeled segments in session i (one labeled segment for each class). The objective now becomes to classify the unlabeled 63 datafx ij g8j62 J 0 i . In all experiments, I rst build a `global' supervised classier (re- ferred to asS 0 ) using only the labeled data across all sessions, and perform unsupervised adaptation to create another classier (S i ) specic to session i (Figure 4.2) Figure 4.2: Overall methodology for session adaptation 4.1.1 Datasets I applied my method on two dierent tasks: speaker clustering on the ADOS dataset and LID{like task on CALLFRIEND. In the latter task, I demonstrate the ability of my methods to generalize to multiple classes and replicate my ndings using a publicly available corpora 1 . Following, I describe the datasets and experimental setup. 1 https://catalog.ldc.upenn.edu/ 64 Speaker Classication on ADOS I use 269 sessions from Module 3 of the ADOS which is a spoken diagnostic interaction between a clinician and a child. The sessions include the Emotions subtask, where the child is asked to identify the causes and eects of various emotions in them; and the Social Diculties & Annoyance subtask, where various social problems at home and school are discussed. The sessions cover children from ages 4 to 13 years (Duration: = 219s, = 89s). I dene speaker homogeneous segments within an utterance boundary, which are obtained using the ground truth speech transcripts. For both the child and adult, I choose the longest utterance (Duration: = 7.01s, = 1.98s) from each session as the labeled data so as to ensure each speaker is represented suciently. In total, I have 538 labeled segments and 10,284 unlabeled segments. Language ID on CALLFRIEND For the next task, I use 13 languages from the CALLFRIEND corpus, which consists of unscripted telephone conversations between native speakers of the particular language. I use Japanese, Korean, Mandarin (Mainland & Taiwan dialects), Spanish (Carribean & Non-Carribean), Tamil and Vietnamese to train the GMM-UBM and i-vector extractor, and Arabic, Farsi, French, German and Hindi for evaluation purposes. For each language, I pool data from the train, dev and eval subsets resulting on an average of 118 speakers per language. I use an energy based voiced activity detector from Kaldi [157] to remove silence regions since the conversations were recorded with low levels of background noise. A segment is dened as a contiguous speech utterance of 2 seconds in duration. I formulate the task similar to spoken language identication by creating synthetic sessions which 65 includes speech segments from speakers of dierent languages, while ensuring that each language within a session is represented by only one speaker. For example, a session can include segments from speaker 1 of French, speaker 10 of German and speaker 100 of Farsi. The number of languages per session is varied from 2 to 5, and 200 unique sessions are created for each of them. The segment to be labeled from each language is randomly chosen from within that conversation. Data statistics for both corpora are presented in Table 4.1. Table 4.1: Data statistics showing number of labeled and unlabeled samples Corpus ADOS CF-2 CF-3 CF-4 CF-5 #Labeled 538 400 600 800 1000 #Unlabeled 10,284 20,654 31,844 42,575 52,078 Motivation for Entropy Filtering In this section, I explain the motivation behind the choice of entropy ltering as a method for selecting relevant labeled data per session. Given a test session, I dene relevant data as the subset of enrolment data (including from other sessions in the corpora) that closely re ects the background conditions in the test session. I start with the enrolment data (one sample per class) from the same session and augment them with the closest samples from other sessions using cosine distance. However, it is often not straightforward to determine the number of top closest samples to select per class. Choosing too few samples may result in insucient data while too many samples can potentially add noisy information from other channel conditions. I introduce an asymmetric entropy measure to choose the number of samples every session (entropy ltering). 66 I compute the entropy measure from the posterior probabilities for enrolment samples using a global model estimated using enrolment data from all sessions. I hypothesize that the posterior probability (and the entropy measure) re ects the extent to which the background information from that session has been captured by the global model: a high probability score indicates reasonably high performance and vice versa. Hence, I retain a relatively large amount of samples from other sessions for a session with high probability scores for its enrolment samples. Before I present the mathematical formulation for entropy ltering in Section 4.2, I validate the hypothesis using binary LID tasks. I consider two pairs of languages: Farsi vs German and French vs Hindi. For each language pair, I simulate conversations between speakers of each language similar to Section 4.1.1. I randomly select one sample from each language per conversation and train a global model. For each enrolment sample, I plot the relationship between the training posterior probability computed using the global model and the performance for that class in a session in Figure 4.3. In both cases, I see that there exists a strong relation between the two, suggesting that a higher posterior probability is related to increased classication performance. 4.2 Entropy Filtering for Selecting Robust Enrolment Data The global classier S 0 is trained using the labeled utterances from all sessions. In order to avoid over-tting, I do not aim for perfect classication accuracy on the training set. Instead, if the labeled segments of a session are incorrectly classied by S 0 , I consider this as an indicator that the classication performance of S 0 on unlabeled segments of that session is also likely to be poor. 67 Figure 4.3: Relation between posterior probability and class-specic classication perfor- mance for two dierent language combinations from CALLFRIEND corpus (left) Farsi vs German: R 2 = 0.53, (right) French vs Hindi: R 2 = 0.49 In other words, condence of classication for the labeled examplefx C i g;C2 [1;K i ] from session i can indicate the suitability of S 0 for class C within the i th session. I control the number of labeled examples to retain from each class while classifying within session i. Using the class posteriors for labeled examples obtained using S 0 , I dene an entropy-inspired score (after Shannon's entropy [186]) for each labeled example as follows: e C i = 8 > > > > > > > > < > > > > > > > > : X x2(p C i ;1p C i ) x log(x) if p C i >= 0:5 2log(0:5) X x2(p C i ;1p C i ) x log(x) if p C i < 0:5 (1) where p C i is the posterior probability from S 0 for x C i . The number of labeled segments belonging to class C from other sessions to retain while classifying session i is: N C i =Ne e C i 68 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 p i C 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 exp(e i C ) Figure 4.4: Illustrating entropy-ltering: the relationship between p C i and exp (e C i ), the fraction of enrolment samples The closest N C i segments are selected using a distance measure. Note that the entropy score in (1) is asymmetric about p C i = 0.5, unlike regular entropy. This ensures that e C i is monotonically increasing with p C i (Figure 4.4). This behavior is preferable since a condent and correct classication of x C i should favor retaining most of the data used in S 0 , and vice-versa. Further, the formulation in (1) ensures that the minimum fraction of labeled data that will be retained is 0.25. This natural regularization guarantees that I do not remove all the data from any class before classication. 4.3 Iterative Bootstrapping for Self Learning SinceS 0 may not necessarily capture the session characteristics from all sessions, I classify in an iterative manner while adapting to the session's data. At every iteration, I augment the set of labeled examples by selecting the most condent unlabeled segment (using the posterior probability) along with the label predicted using the current model. The model is then re-trained at the end of each iteration. The algorithm is presented in Algorithm 2. 69 Algorithm 2 Iterative bootstrapping 1: Input S 0 : Global classier X train : Labeled features, Y train : Labels fx ij g;j62J 0 i : Unlabeled segments from session i T : Uncertainty threshold 2: while All segments not classied & max(p C j ) > T do 3: Obtain posterior probabilities p C j using S 0 for unlabeled data 4: [j 00 ;C 00 ] = arg max j;C (p C j ) 5: X train :append(x ij 00) 6: Y train :append(C 00 ) 4.3.1 Classication of uncertain segments Self-training is known to suer in the later iterations, as condence of the most certain unlabeled segment keeps decreasing. To avoid this, I use the posterior probabilities from S i to help identify uncertain segments and fall back to a simpler classication method for such examples. While it is straightforward to use the posterior probability as an uncertainty metric for two classes, I use the entropy of the posterior in the case of multiple classes. Cosine-distance has been used for score computation with i-vectors for many applications [43, 190, 115]. Hence, I classify the uncertain segments by assigning them to the nearest labeled segment based on cosine similarity. 4.4 Experimental Results 4.4.1 System selection using baseline performance I dene a baseline classier that does not perform session-level adaptation, I use the global classierS 0 to classify all the sessions in the corpus. I use support vector machines for bothS 0 and session specic classiers in this work, since they are a popular choice for 70 supervised classication algorithms. I use the baseline performance to optimize the num- ber of Gaussian mixtures while estimating the GMM-UBM, and the i-vector dimension. In the case of LID, I also decide the front-end feature representation between MFCC and SDC (Shifted Delta Cepstra) since the latter has been used recently due to its ability to capture temporal information [101]. While I use the ve evaluation languages for param- eter optimization in CALLFRIEND, in ADOS I resort to 20-fold cross validation since a leave-one-session out would be computationally expensive. I also experiment with smaller number of GMM mixtures and i-vector dimensions in addition to commonly used values in the case of ADOS considering the size of the corpus. I use the unweighted average recall (UAR) as my performance metric for each session, which takes into account class imbalances [177]. I report the results as UAR averaged across sessions. The optimal combinations of i-vector dimension and number of UBM mixtures were found to be (400 and 2048) and (20 and 256) for the case of CALLFRIEND and ADOS corpora respectively. These parameter combinations are used in the rest of this work. 4.4.2 Session-level adaptation strategies I measure the eect of each of the adaptation strategies on classication performance. I also present all possible combinations of the strategies in Table 4.2, and study their contributions. A consistent increase is observed in mean UAR across the corpora and across dierent number of languages in the case of LID. Furthermore, the baseline performance decreases as the number of classes grows. 71 Table 4.2: Mean UAR for speaker clustering on ADOS and LID on CALLFRIEND. Results are reported separately for each number of languages (2-5) on CALLFRIEND. (E: Entropy-based ltering, B: Bootstrapping, D: Cosine-Distance based assignment for uncertain segments) Method ADOS CALLFRIEND 2 3 4 5 Baseline 87.62 68.68 55.35 47.07 41.41 E 89.36 73.13 60.58 52.56 47.94 B 88.23 69.83 56.18 47.64 42.38 D 92.25 71.23 55.11 44.99 38.66 E+D 92.90 74.51 61.68 53.09 48.62 E+B 89.87 75.28 60.57 52.30 46.64 B+D 92.25 70.58 56.34 47.51 42.06 E+D+B 92.91 76.81 61.32 52.69 46.98 0.5 0.6 0.7 0.8 0.9 Threshold 86 87 88 89 90 91 92 93 94 Mean UAR ADOS 0.5 0.6 0.7 0.8 0.9 Threshold 62 64 66 68 70 72 74 76 78 CF: 2 lang 0.5 0.6 0.7 0.8 0.9 Threshold 45 50 55 60 65 CF: 3 lang 0.5 0.6 0.7 0.8 0.9 Threshold 35 40 45 50 55 CF: 4 lang 0.5 0.6 0.7 0.8 0.9 Threshold 30 35 40 45 50 CF: 5 lang D B+D E+D E+B +D Figure 4.5: Eect of threshold for determining uncertain segments on Mean UAR (in %). Presented in combination with entropy-ltering and bootstrapping for each corpora. Classifying uncertain segments using cosine distance (D) enhances classication accu- racy when the number of classes in small, e.g. ADOS and up to 3 languages in LID, but the performance drops otherwise. This happens since labeled examples from the same class have dierent representations across dierent sessions. The entropy-based lter- ing of labeled data (E) gives the largest gains in performance out of all the strategies I employed. In contrast to distance-based classication (D), performance gains are pro- portional to the number of classes, in the 2 language scenario I observe a performance boost of 6:48% while I have a 15:77% increase on the 5 language case. Entropy-ltering reduces the amount of labeled segments that are used within a session, which suggests 72 that simply adding more data does not necessarily imply better classiers. It is more benecial to retain labeled data from other sessions that are similar (in terms of cosine distance) to the labeled data from the current session. Bootstrapping by itself provides only moderate gains in performance. The perfor- mance drops marginally for larger number of classes when bootstrapping is combined with distance-based classication, which could be explained in the same way as distance- based classication (D) alone. However, combining bootstrapping with entropy-ltering decreases performance for larger number of classes when compared to entropy-ltering alone. I expect that the baseline performance (which serves as the initial step) in uences the bootstrap strategy, since the latter is of iterative nature. Overall, the best clustering performance is obtained using a combination of all adaptation methods (E+B+D) for lower number of classes, while entropy-ltering followed by distance-based classication (E+D) is the best conguration in the case of large number of classes. 4.4.3 Eect of Uncertainty Threshold In the last set of the experiments, I look at how the parameter for uncertainty threshold in uences the overall performance of my scheme. The threshold determines the amount of data that will be classied using the distance measure, as well as the amount of labeled data used while adapting S 0 . I experimented with values between 0.5 (all examples are deemed certain by the classier) and 0.9 (most examples are considered uncertain) with a step of 0.1, and present the results in Figure 4.5. I observe that in most cases the performance increases as the threshold value is increased towards 0.9. This is expected, since I classify only a small, 73 but condent subset of the data with the supervised classier and hence minimize the errors from uncertain examples. However, the dependence is not uniform across dierent combinations involving entropy-ltering and bootstrapping. Combining cosine-distance based classication with bootstrapping (B+D) makes the system highly dependent on the threshold, suggesting that bootstrapping accumulates errors at each iteration as I continue classifying segments with lower condence scores. This eect is somewhat ameliorated with entropy ltering (E+B+D, E+D) especially as the number of languages increases. Since the performance is monotonically increasing with the threshold in most cases, I further ne tuned the threshold between 0.9 and 1.0 (all examples considered uncertain) and present the results for two cases - ADOS and CALLFRIEND with 3 languages. From Figure 4.6, I observe that there exists a clear optimal threshold for ADOS while the performance saturates for most cases in the case of CALLFRIEND. Hence, while a large threshold favors better performance in general, further inferences might be corpus- specic. 0.9 0.99 0.999 1 Threshold 86 87 88 89 90 91 92 93 94 Mean UAR ADOS 0.9 0.99 0.999 1 Threshold 45 50 55 60 65 CF: 3 lang D B+D E+D E+B +D Figure 4.6: Classication performance against ne tuned uncertainty threshold for ADOS and CALLFRIEND: 3 languages. The abscissa has been scaled non-linearly for better visualization 74 4.5 Conclusions In this chapter, I propose entropy-ltering as an adaptation strategy to select reliable enrolment samples during child-adult classication. Entropy-ltering selects a small but condent subset of labeled data using the posterior probability, which favors task perfor- mance over using a large number of uncertain examples. I nd that gains in performance returned by entropy-ltering increases with more classes, suggesting a promising approach for tasks with large number of classes. In the next step, I would like to automatically se- lect labeled segments within an active learning setup, in order to make this methodology fully unsupervised. I would also like to investigate dierent non-linear functional forms in place of the entropy for selecting the labeled examples in the global model, and analyze the robustness of adaptation strategies, especially entropy-ltering to noise conditions. 75 Part III Interlocutor Context for ASR and Speaker Diarization 76 Chapter 5 LM Adaptations for Child ASR In chapters 2-4, I proposed child/adult classication systems that exploited background context, namely acoustic background conditions. I developed channel-invariant (meta- learning) and channel-specic (entropy-ltering) methods to improve classication and clustering performance. In this chapter and the next, I develop child speech processing techniques that include interlocutor context as a parameter. I dene interlocutor context as all expressive verbal and non-verbal behavioral cues of the adult participating in the in- teraction. An important distinction between background context and interlocutor context is that the former remains constant throughout a session while the latter is time-varying and in uences & in uenced by the child behavior. In this chapter, I use the language information of the interacting adult to improve child automatic speech recognition (ASR). Motivation Recent advances in deep learning have contributed to signicant improvements in speech recognition accuracy in the last decade [85, 71, 72]. However, improving ASR performance for child speech continues to be more challenging problem than for adult speech due to the 77 inherent variability and heterogeneity in speech signal [156, 112, 113]. Most approaches developed in the last decade [26, 155, 187, 73] to tackle these issues have been conned to read or prompted speech, with spontaneous speech receiving little attention from researchers. For a detailed review of child ASR approaches, please refer to Section 1.3. 5.0.1 Spontaneous Child Speech Spontaneous speech, especially in an interaction setting, tends to include a more diverse vocabulary and increased variability in speaking style and background conditions than prompted speech. Previous works on analyzing spontaneous child speech have been lim- ited to conversational agents as animated characters [12, 143, 80], in the role of instructor [154, 92, 182], data collected from search engines [116] or in computer-based reading assessment/training applications [13, 139, 38]. Analytic dierences were observed in the speech signal collected in spontaneous versus prompted manner. Specically, chil- dren were found to exhibit increased dis uencies, signicantly decreased vowel duration, shorter utterance durations and higher speaking rates which can possibly be attributed due to higher cognitive load [61]. Designing child ASR systems in such cases has tradi- tionally focused on task-specic pronunciation and language models to tackle the unique vocabulary, often making use of transcripts generated along with the corpus. In this chapter, I improve child ASR in two real-world application domains: forensic interviews and play-based, interactive sessions of children with ASD with an adult social partner. In forensic interviews [109, 83], a trained interviewer questions a child about sus- pected criminal victimization, usually child sexual or physical abuse. The interviews are 78 held outside a courtroom setting, and are aimed at reducing potential trauma while max- imizing incident recall as well as reducing suggestive and leading question types. Obser- vations of autism symptoms in clinical settings often involve play-based, semi-structured contexts created by an adult social partner (e.g., clinician) such as the Autism Diagnostic Observation Schedule (ADOS; [118]) and the Brief Observation of Social Communication Change (BOSCC; [77]). These sessions are conducted in an interactive manner and the child's responses are evaluated based on multiple socio-communicative categories, which are further combined to quantify the level of autism symptom severity. Both domains contain spontaneous child speech which are typically characterized by short utterance durations and produced either under signicant cognitive load (forensic interviews) or social demand (ADOS and BOSCC sessions, which probe for social communication and interaction characteristics). I introduce a modeling framework for exploiting context from the spoken language of the interlocutor during such sessions. I explore incorporating contextual information from the semantics using a novel methodology. Specically, I train a neural network based model directly on the word representations. I employ a sequence-to-sequence model [202] (seq2seq) which is trained using the adult's speech as encoder inputs and child's response as decoder outputs. At test time, contextual utterances are fed into the network to produce hypotheses for the current utterance, which are incorporated into a dynamic language model (In the rest of this work, an utterance is dened operationally as a speaker turn). I also make use of context using the lexical information, and study the combination of both systems. I investigate the eect of strength and direction of context for both lexical and semantic cases. Finally, I study the eect of external factors (age, 79 utterance length and WER of adult hypotheses) on the improvement of my adaptation methods. The rest of the chapter is organized as follows: I review previous approaches for lexical and semantic adaptation, and use of context for ASR adaptation in Section 5.1. I present the two domains of evaluation: autism diagnosis sessions and forensic interviews in Section 5.2. In Section 5.3, I describe the two modes of context transfer: token matching for lexical transfer and seqeunce-to-sequence models for semantic transfer. I propose a way to incorporate long-term context from the interlocutor utterances into seq2seq model. In Section 5.4, I create a compeitive baseline child ASR from scratch. I also describe the experimental setup for domain adaptation, global context adaptation and local context adaptation. Results and discussions for each of these experiments are presented in Section 5.5. Conclusions and directions for future work follow in Section 5.6. 5.1 Background 5.1.1 Lexical Adaptation Language model adaptations for ASR have been studied previously in the context of human-machine interactions, especially those involving conversational agents [143]. A majority of studies consider topics/classes - either a broad categorization of word units, or task-specic groupings (e.g., music, travel, etc). A large, out-of-domain class-independent language model is rst trained and (typically linearly) interpolated with smaller class specic models. The interpolation weights can be set using hyper-parameter search, es- timated using a development set [195] or predicted using a deep neural network trained 80 to minimize perplexity on a development set [162]. To mitigate the availability of class- labeled transcriptions, [212] included unlabeled data for training class-specic models through iterative re-estimation. However, training multiple disjoint models often leads to data fragmentation. To tackle this, [76] trained context-based dynamic classes for language model adaptation by borrowing the dialog state information from the conver- sation agent. An initial class-based language model is still required, where semantically similar units (e.g., names of dierent airlines) share n-gram statistics. In the context of this work, the diculties associated with training such a model are two fold: 1. Labeled, spontaneous child speech corpora are scarce, let alone with class-labeled transcripts, and 2. Concept of classes can be very dierent for child speech than adult speech (for instance, class distributions and labels might dier) and has not been explored yet. 5.1.2 Semantic Adaptation Semantics refers to the meaning or concepts conveyed in language, rather than surface words. Semantic information in an utterance can be represented using parsers, such as a hierarchical grouping constructed using semantic tags from the word level to the sentence level (full parser) [47]. While decoding a test utterance, either a full parser or shallow parser (single level of hierarchy - one semantic tag per word) is rst used to obtain an LM score. This is then combined with LM score computed using statistical word-level (n- gram) language model using maximum entropy modeling. Combining information from semantic and lexical sources has been shown to achieve WER improvements for a spoken dialog system in three dierent domains. Semantic information adaptation can also be achieved using topic models [86] wherein a training data set is used to estimate a latent 81 set of topics. Topic-dependent language models can then be adapted from a base model using interpolation. Alternatively, semantic relations between words can be represented using shared occurrence/proximity in a latent space (Latent Semantic Analysis; LSA [66]). However, both topic-models and LSA require substantial in-domain training corpus to reliably estimate topics. 5.1.3 Use of Context for ASR Adaptation A majority of ASR have treated `context' as external sources beyond the conversation itself, such as time of the day 1 [135, 219, 150]. This has been possible since context sources are reliable and fairly easy to obtain. For instance, obtaining n-best ASR hypotheses can be reliable for adult speech especially in controlled channel conditions, while state information for agent dialog (during human-agent conversations) are already available within an application. In contrast, child-adult interactions are often collected in diverse environments and estimating topics/classes from ASR hypotheses for child speech can prove unreliable. Moreover the child speech may be produced under communication diculty or impairment. For the above reasons, I restricted the context source to only the adult speech. 5.2 Evaluation Corpora 5.2.1 Forensic Interviews I looked at forensic interviews conducted with 30 children (each child had one FI con- ducted, i.e one session). Time-stamps at every two minutes were marked in the transcripts 1 Note that `context' here denotes information external to the decoded utterance, and hence does not include spliced frames 82 and used to split the session into smaller segments. Speech-to-text alignment is performed within each segment and this information is used to segment the utterances at turn level. All turns were manually checked for errors before being used in the experiments of this work. 5.2.2 Play-based, naturalistic sessions for children with ASD with an adult social partner I consider a dataset of 1 ADOS (Autism Diagnosis Observation Schedule [119]) and 21 BOSCC (Brief Observation of Social Communication Change [77]) sessions obtained from 4 dierent clinical sites. The ADOS is a semi-structured interaction designed to exam- ine social communication and repetitive and restricted behaviors for a diagnosis of ASD. The ADOS takes about 40-60 minutes to complete. The BOSCC is a treatment outcome measure designed to capture subtle changes in social communication in children with ASD over the course of intervention. The BOSCC takes about 12 minutes to complete. Both ADOS and BOSCC create a naturalistic setting to examine social communication for the children with ASD while they interact with an adult social partner. The sessions were manually annotated for both speaking times and transcripts. Similar to FI, the transcripts were broken down into turns and manually checked for errors. Further details are presented in Table 5.1. I note that in both domains, ground truth from speech activity detection and speaker diarization are assumed available. This is not the case with most recordings in natural 83 real-world settings; however in this paper I have chosen this setup in order to exclusively analyze ASR errors. Table 5.1: Statistics ( ) for child speech in Forensic Interview (FI) sessions and Autism Spectrum Disorder (ASD) sessions. *Session duration for ASD is averaged across the ADOS (n=1, usually lasting 40-60 minutes) and BOSCC sessions (n=21, usually lasting 12 minutes) ASD FI Session Duration (min) 16.50 6.52* 45.00 17.71 Age (yrs) 9.28 3.12 8.56 3.14 Number of utterances 29.45 16.55 37.3 16.28 Turns/session 40 22.45 63.25 24.25 5.3 Methods In this section, I describe the methods used to incorporate context from the interlocutor's speech. I borrow the contextual information from two sources - lexical and semantic. This information is then used to adapt a base language model to create context-adapted models. Mathematically, consider a child-adult conversation fP 1 ;C 1 ;P 2 ;C 2 ;::P T ;C T g with T utterance turns, P t and C t representing the interlocutor and child utterance at turn t respectively. I dene a turn as a speaker homogeneous segment consisting of a complete query (interlocutor) or response (child). The goal is to improve the ASR performance for C t using the information from P t ;P t1 ;::: and/or P t+1 ;P t+2 ;:::. 5.3.1 Lexical Context: Token Matching In this method, I look for exact repetitions of context words in the child speech. Rep- etitions form an important component in re ective listening or active listening, so as to conrm and/or clarify what was spoken. Conversational excerpts showing repetitions are presented in Figure 5.1. Note that repetitions occur two-ways; for example, the child 84 conrms what was asked and the interlocutor conrms what was said, hence there are benets to searching for context in both forward and backward directions. From the selected context window and direction, I extract n-grams (up to trigrams) and modify the base language model as explained in 5.3.3. Figure 5.1: Transcript excerpts (top: Forensic Interviews; bottom: ASD Session) illus- trating information ow between the speech from adult and child. Child phrases similar to contextual adult phrases are indicated in blue. Directional ows from adult-to-child and child-to-adult are indicated using green and red respectively 5.3.2 Semantic Context Clearly, not all context can be obtained using repetitions alone. Semantics (referring to concept/meaning of a sentence) oers an important broader context and for a coherent conversation, represents a large portion of the inter-dependence between utterances. As mentioned earlier, semantic information transfer can be implemented using parsers, topic- models or LSA. However, it is often non-trivial to dene a discrete set of concepts even within a limited time window during an open conversation. Further, topic-models and 85 LSA ignore the order of words within a document (in this case, utterance) which can prove important. In this work, I use a sequence-to-sequence (seq2seq) network to model the semantic information transfer. A seq2seq network [202] maps from one variable-length input sequence to another variable-length output sequence through a xed dimensional embedding vector (also known as thought vector/state vector), as opposed to supervised classication which uses xed-size output labels. Neural networks (often recurrent) are used to map the input sequence to embedding (encoder) and embedding to output sequence (decoder). Seq2seq networks take advantage of the learning power of neural networks with minimal assumptions on the data structure as demonstrated in multiple classication tasks like speech recognition [159], neural machine translation [202] and conversational agents [211], my approach being inspired by the latter. Training the seq2seq model I train the seq2seq network with the adult utterance P t as input and the child utterance C t as output, i.e., the network is trained to predict the child's responses from the context. I rst convert the words into a one-hot representation using a vocabulary built with the most frequent words, i.e for a vocabulary containing V words, each word is converted into a V dimensional binary vector ~ w containing zeros in all dimensions except the one uniquely identifying the word. Thus each utterance is represented by a matrix of dimen- sionVN, whereN is the number of words in that turn. Both child and adult utterances are padded with zero vectors upto a xed length. During training, the adult speech ut- terance is input to the encoder one word at a time. Each encoder unit takes as input 86 the current word, and cell state from previous unit. The cell state from nal encoder unit (an embedding of xed dimension) encompasses the semantic information from the adult speech utterance. This embedding is input to the rst decoder unit alongwith a special start-token (see Figure 5.2). Successive decoder units take the words from child utterance as the input while the cell state is propagated throughout the utterance in the decoder. At the output of the n th decoder unit, cross-entropy loss is computed between the softmax activations and one-hot representation of words spoken by the child. The overall loss for the turn is: loss = N C X n=1 CE(~ w Ct;n ;softmax(h n )) (5.1) where h n represents the output from the n th decoder unit and w Ct;n represents the n th word from C t . During inference the adult utterance is input to the encoder to obtain the embedding. Similar to training, the start-token is fed to the rst decoder unit along- with the embedding vector. However, I randomly sample the top words from the output distribution to feed into the next decoder unit. This is continued until the end-token is encountered or the maximum sequence length is reached. Unlike greedy decoding (using only the most likely word at every unit) sampling enables us to obtain multiple hypotheses for the child speech. Incorporating context in seq2seq model It is often meaningful to incorporate the history of conversation while predicting the current utterance. Previous approaches to incorporate long-term context include con- catenating multiple utterances in the encoder [226] and using a two-step (hierarchical) 87 E 1 E 2 E Np E 3 w Pt,2 w Pt,1 w Pt,3 w Pt,Np D 1 D 2 D Nc+1 D 3 w Ct,1 w Ct,2 w Ct,Nc <START> <END> Interlocutor utterance (encoder) Child utterance (decoder) w Ct,1 w Ct,2 w Ct,3 (a) E 1 E 2 E Np E 3 w Pt,2 w Pt,1 w Pt,3 w Pt,Np D 1 D 2 D N D 3 <START> <END> Interlocutor utterance (encoder) ȟ 1 Ct, 1 ȟ 1 Ct, 1 ȟ 3 Ct, 2 ȟ 2 Ct, 1 ȟ 3 Ct, 1 ȟ 1 Ct, 2 ȟ 2 Ct, 2 ȟ 3 Ct, 2 ȟ 1 Ct, 3 ȟ 2 Ct, 3 ȟ 3 Ct, 3 Child utterance (decoder) Sampling from most likely words (b) Figure 5.2: (a) Training a seq2seq model with a single interlocutor utterance at the encoder and target child utterance at the decoder. At each decoder unit, words from child utterance are used to compute cross-entropy loss (b) Inference using a trained seq2seq model. At each decoder unit, the top hypotheses for next word are sampled and fed into successive decoder unit until the <END> token is encountered. model [181] where the rst network encodes individual utterances, and a second network is trained to model previous utterances to the decoder output. In this work, I use a variant of the recently proposed context-aware training [33] where the decoder states are passed through a conversation. Specically, I use the same architecture in Figure 5.2, but feed the nal decoder state of current turn to the rst encoder of next turn. I use the following representation of a training step: (P t ;x)! (C t ;y) (5.2) 88 where P t represents an encoder input sequence, x is cell state to rst encoder unit, C t is decoder output sequence and y is the output from the last decoder unit. I illustrate below the proposed iterative context-aware training for a conversation dened earlier as fP 1 ;C 1 ;P 2 ;C 2 ;::P T ;C T g. Context-Independent Training: During the rst step of training,x is an empty vector representing no information from previous turns. (P t ; )! (C t ;e t;0 ) 8t2f1; 2;::;Tg wheree t;0 denotes the decoder output for turnt and no previous turns input through the encoder Single-Context Training: At the next step, decoder outputs from context-independent training are fed as input state to the rst encoder unit while training successive ut- terances: (P t ;e t1;0 )! (C t ;e t;1 ) 8t2f2;::;Tg Note that e t;1 contains information from turns t and t 1 through P t and e t1;0 respectively By feeding in e t;1 as input to the rst encoder unit and so on, the above steps can be continued by increasing the context available. I note that although each input-output pair (P t ;C t ) is seen multiple times by the network, dierent amounts of context are made available through the rst encoder unit input. In this work, I train the seq2seq model with upto 3 context utterances (e t;2 ). I note that longer context windows can possibly 89 be used, however the propagated context information might diminish across utterances. At the time of testing, I do not have access to neighboring child utterances. Hence, I resort to context-independent testing. I generate multiple hypotheses for the current child speech utterance C t using sampling. Similar to lexical context, I control for the length and direction of context and extract n-grams from these hypotheses to adapt a base language model. 5.3.3 Linear Interpolation Consider a fairly-large n-gram based language model L base trained using out-of-domain text and a small context modelL context estimated using adult context utterances or child utterance hypotheses from the seq2seq network. Both L base and L context are assumed to share a common pronunciation. The goal is to estimate an adapted model that includes context information while at the same time does not overt. I create an adapted model L adapt as follows: P (wjL adapt ) = 8 > > > > > > > > > < > > > > > > > > > : P (wjL base ) + (1)P (wjL context ) w2L base \L context P (wjL base ) w62L context (1)P (wjL context ) w62L base where w represents an n-gram. 5.4 Experiments I present my experimental framework in three dierent steps: 90 Baseline Adult ASR Model Baseline Child ASR Model Domain Data LM Adaptation AM + LM Adaptation Adult ASR Hypotheses Context- Filter Seq2seq Model LM Adaptation Child ASR Hypotheses Domain- Adapted Models Domain- Adapted Models Session- Adapted Models Global Context Lexical Context Semantic Context Context Adaptation Framework Figure 5.3: Proposed context adaptation framework Baseline Child ASR: In Section 5.4.1, I describe the steps involved in building a robust baseline ASR for child speech, including corpus selection, noise and reverb augmentation and training a hybrid DNN-HMM acoustic model. I validate my model using an out-of-domain child speech corpus, namely CIDMIC ([112]). Domain Adaptation: I build separate models for both FI and ASD in Section 5.4.3 by ne-tuning the baseline models to domain specic acoustic conditions and vocabulary. All context adaptation results are presented against domain-adapted models. Context Adaptation: I analyze adaptation to the interlocutor's utterances at global (using all utterances from session) and local (using only neighbouring utter- ances) in Section 5.4.4. In the case of latter, I experiment with two dierent types 91 Table 5.2: Details of child speech corpora used in training and evaluation of baseline ASR model. *CSLU statistics are computed after speech-to-text alignment. Corpus # Kids Age Range (yrs) Speaking Style # Utts Size (hrs) CHIMP [143] 130 4-11 Spontaneous 4457 3.2 CMU Kids [48] 73 6-11 Prompted 2120 3.1 CSLU* [189] 1093 5-15 Read, Prompted 14238 14,8 CU Kids [37] 1084 6-11 Read 32776 45.9 TIDIGITS [114] 101 6-15 Prompted 3371 2.8 CIDMIC [112](eval) 419 7-17 Prompted 4800 3.27 of context: at the lexical and semantic level (Section 5.5.3). I further analyze the amount of context by controlling for the number of utterances (Section 5.5.4). Fi- nally I use statistical analysis to examine the eect of external factors on adaptation results in Section 5.5.5. An overview of the entire system is presented in Figure 5.3. 5.4.1 Baseline child ASR As outlined in Section 5.1, child ASR models perform better when trained using children speech. I trained a baseline model based on deep neural networks that is further adapted separately at domain and session level. I use multiple child speech corpora spanning a wide range of age groups, background conditions, speaking styles (read, prompted, spontaneous), topics (video games, educational, generic) and utterance durations (isolated digits to stories). Except TIDIGITS, all the corpora used here were specically designed for developing child ASR systems. In the case of CSLU, utterances were of relatively long duration (=99.8s,=34.4s) and hence speech-text alignment was performed to segment the audio into smaller utterances. Only successfully aligned words were used for training purposes. 92 I resample audio from all corpora to 16kHz sampling frequency. All utterances shorter than 2 seconds or longer than 20 seconds were removed to aid training. Utterances that in- cluded noise or silence labels in the transcripts were also removed. Further, corpus-specic dis uencies and paralinguistic labels were mapped onto a common out-of-vocabulary (OOV) unit. Details of each corpus after pre-processing steps are presented in Table 5.2. Since the combined size of training corpora is only 70 hours, I experimented with two augmentation techniques in this work. First, I perturb the fundamental frequency (F0) of the audio without modifying the phonetic component. This is motivated by the fact that children have smaller vocal folds than adults which is manifested by higher F0. This dierence is minimized above the age of 11-12 years [112]. Hence, introducing modest pitch variations can potentially mimic a larger sample of children. For every utterance in the training set, I randomly choose a factor between [0.9,1.1] to scale the F0 for the entire utterance. I repeat this process 9 times to generate 700 hours of clean speech. Next, I randomly introduce noise and reverberation conditions following [99]. A variety of real and simulated room impulse responses (RIRs) are added to the audio to simulate various background conditions, a complete description of this augmentation method can be found in [99]. I generate approximately 2000 hours of reverberation augmented data this way. Since pre-DNN training steps typically saturate in performance with the size of training data, I include the noise and reverb-augmented data only during DNN training. Pre-DNN training: I extract 13-dimensional Mel Frequency Cepstral Coecients (frame width 25ms, frame shift 10ms) as front end representations. Using the Kaldi [157] speech recognition toolkit, I train the acoustic model beginning with monophone 93 GMM system through hybrid DNN-HMM in an iterative manner. At each step, align- ments obtained using the previous model are used to initialize the current model. I used adaptation techniques such as linear discriminant analysis (LDA) and speaker adaptive training (SAT) using feature-space maximum likelihood linear regression (fMLLR) trans- forms. I use the SAT system to obtain alignments for the entire data for use in DNN training. DNN Training: I train a hybrid HMM-DNN system where a time-delay neural network (TDNN) is used to estimate posterior probabilities for context-dependent HMM states. TDNNs have been shown to model long-term temporal dependencies as eectively as recurrent architectures with signicantly less training times [152]. I use 6 hidden layers in the network with 512 units at each layer. At the input to the network, bidirectional context of 2 frames is appended, followed by splicing at osetsf0,0g,f-1,1g,f-2,2g,f- 3,3g,f-3,3g,f-6,0g (refer to Figure 1 in [152]) at each successive layer. The network is trained for 10 epochs using stochastic gradient descent updates. In addition, I train another network where the pre-nal TDNN layer is replaced with self-attention [158]. Language Modeling: I estimate a tri-gram language model using the CMU dic- tionary as pronunciation lexicon and Whitten-Bell smoothing for unseen units. Since transcripts from entire collection of child speech corpora consisted of only 800K to- kens, I experimented with additional transcripts from adult speech corpora (Librispeech, TEDLIUM, ICSI, Fisher and WSJ) containing 35M words. I attempted interpolating the two models, but found that child speech transcripts negatively impacted performance. 94 I evaluate the TDNN and Self-Attention systems on the CIDMIC corpus and provide the results in Table 5.3. I note that the WER is comparable to other recent systems evaluated on a subset of CIDMIC in previous studies [187, 188]. Table 5.3: Word error rates (%) for baseline models applied on CIDMIC and child speech portion from Forensic Interviews (FI) and Autism diagnosis session (ASD) Model TDNN Self-Attention CIDMIC 29.50 29.76 FI 63.33 62.53 ASD 76.69 76.23 5.4.2 Baseline adult ASR I use the ASPIRE model 2 as an o-the-shelf ASR system for adult speech. The model is trained on an augmented version of the Fisher English corpus, using an augmentation method similar to Section 5.4.1. The model uses a time-delay network with intermediate BLSTM layers as part of its acoustic model. I observed a WER of 26.18% (Forensic Inter- views) and 33.15% (Autism Sessions) using this model, and use the obtained hypotheses for adaptation purpose. 5.4.3 Domain Adaptations For the rst adaptation step, I extend both the adult and child baseline ASR models to their respective domains. Acoustic adaptation adapts baseline models to domain-specic channel conditions while language model adaptation assigns higher weights to domain- specic vocabulary. Similar to previous work on domain adaptation [104], I perform LM adaptation for the adult speech model and AM+LM adaptation for the child speech 2 http://kaldi-asr.org/models/m1 95 model. During LM adaptation, I estimate a new language model by interpolating the base model with an in-domain model following Section 5.3.3 and choosing so as to minimize WER. N-gram weights from the interpolated model are used to rescore lattices generated during baseline decoding. This amounts to re-ranking lattice paths generated using the baseline model against computing a new set of paths from the audio and decoding graph. I choose this method since constructing the decoding graph can be a time-intensive process especially during multiple folds of validation. For the case of AM adaptation I train the baseline hybrid DNN-HMM model for a single epoch using the pre-trained network as an initialization point. Training is restricted to a single epoch to avoid over-tting. Although alternative stopping criteria can be explored, I did not experiment with them since this was not the focus of the current study. I implement domain adaptation using a leave- one-session-out manner, wherein a particular session is used for evaluation and all other sessions are treated as adaptation data. For the purpose of this work, domain adapted models represent a ne-tuned comparison against which I present my context adaptation improvements. 5.4.4 Context Adaptation I use the interlocutor language within every session as the context source. I categorize context adaptation as global: use of all utterances in the session, and local: use of only utterances neighboring the current child utterance to be decoded. Global context helps understand the in uence (if any) of commonly occurring phrases/concepts throughout the session, while local context makes a trade-o between the amount and quality of context. I group local context adaptation by the type: lexical and semantic (see Sections 5.3.1 and 96 5.3.2). For each type, I investigate the direction (forward/backward) and length of context (number of utterances). Context direction helps understand the relative importance between child repetitions (forward) and clarications by the interlocutor (backward). For all of my experiments, I perform the adaptation using the interpolation method outlined in Section 5.3.3 and set to 0.5. I present results using word-error-rate (WER) and language perplexity. 5.5 Results and Discussions 5.5.1 Domain Adaptation From Table 5.4, I note an overall improvement in word error rate and perplexity across the two domains considered. However, there exist a few dierences between the domains. First, child speech from ADOS and BOSCC sessions has a signicantly higher baseline WER and perplexity. The dierence can be partly explained by the fact that idiosyncratic speech is a well-known observation in speech of children with ASD, for example neologism (creating new words) and atypical prosody [74]. Hence, spoken language abnormalities are an additional complexity over segmental acoustic variabilities associated with children's speech. Next, utterances from FI are signicantly longer than ASD (p < 0.001) and with signicantly higher signal-to-noise ratio (in dB, p < 0.001). The above reasons also explain why acoustic adaptation using audio from other sessions does not provide any improvement in WER for ASR. However, there exists a clear improvement in the case of FI which are relatively free of language abnormalities. LM adaptation provides a clear WER improvement, with the absolute increase more for FI. This is also re ected 97 in considerable perplexity reduction. The performance dierence between baseline and LM-adapted models are also re ective of adult speech corpora being used to estimate language models in the former. Table 5.4: Improvements to word-error rate and perplexity scores for domain adaptation. *The AM-adapted model was discarded and baseline model used instead for all further experiments Model Forensic ASD WER (%) PPL WER (%) PPL Baseline 62.53 217.09 76.23 335.08 Domain AM 59.10 - 77.52* - Domain AM+LM 56.10 156.16 73.94 234.06 5.5.2 Context Adaptation For both FI and ASD, although global context oers signicant WER improvements over the baseline (Table 5.5), that is not the case with respect to domain-adapted models (Table 5.4). Hence, using all interlocutor utterances in a session during adaptation does not provide any signicant gains over domain knowledge. However, signicant gains are observed in perplexity values for both FI and ASD domains. This suggests that while contextual language from the entire session oers an informative channel, the acoustic model might be unable to account for degraded audio conditions and hence not result in signicant WER improvements. Alternatively, the trained speech acoustic model might be insucient to capture variations in speech of children with pathological conditions. To get an estimate of upper bound to improvement from global context, I replace the adult ASR hypotheses used during adaptation with the ground truth transcripts (Session LM - Oracle). I do not observe signicant WER improvement for this case either, suggesting that the ASR hypotheses are robust enough for the global adaptation. 98 Table 5.5: Global context adaptation using ASR hypotheses (Session LM) and ground truth transcripts (Session LM - Oracle) Model Forensic ASD WER (%) PPL WER (%) PPL Baseline 62.53 217.09 76.23 335.08 Session LM 55.79 141.18 73.79 204.86 Session LM - Oracle 55.51 131.11 73.42 182.31 5.5.3 Eect of context type: I recall that lexical context adaptation uses n-grams directly from ASR hypotheses of adult context utterances while in semantic context adaptation, the context utterances are passed through a seq2seq model and n-grams from decoder hypotheses are used for adaptation. While analyzing the type of context adaptation in Table 5.6, I x the number of context utterances in each direction at 3. Table 5.6: Utterance-level adaptation results. For each method and corpus, results are reported for forward (F), backward (B) and bidirectional (Bi) directions of context adap- tations. GT-Oracle represents adaptation using ground truth transcripts Method Forensic ASD WER (%) PPL WER (%) PPL F B Bi F B Bi F B Bi F B Bi Baseline 62.53 217.09 76.23 335.08 AM+LM Domain Adapted 56.10 156.16 73.94 234.06 GT-Oracle 52.13 52.28 51.60 129.49 140.8 125.70 70.31 70.34 69.86 190.33 199.28 176.32 Lexical 52.32 52.38 51.82 133.43 142.63 131.45 70.54 70.47 70.32 202.66 202.26 187.89 Semantic 52.13 52.26 - 148.87 148.59 - 70.29 70.75 - 217.35 214.89 - Combined 52.05 52.02 - 134.12 143.49 - 70.31 70.45 - 195.47 206.29 - Lexical Context Adaptation: I observe that local context adaptation consistently outperforms the out-of-domain baseline as well as ne-tuned domain adapted model. Both perplexity and WER show signicant improvement for all directions in both FI and ASD. However, similar to domain adaptations, the magnitude of improvement is higher in 99 FI, which suggests that context adaptation is still dependent on baseline performance and data complexity, i.e., a better performing baseline results in larger improvements from adaptation. Bidirectional adaptation (where adult ASR hypotheses from both directions are used for LM adaptation) outperforms individual directions in terms of both perplex- ity and WER. However, I note that this improvement is not statistically signicant over any individual direction. Moreover the amount of context available is twice than either forward/backward, hence a straightforward comparison may not be appropriate. This suggests that both repetitions by the child (response) and interlocutor (clarications, follow-up, etc) are equally important when it comes to eect on adaptation. Semantic Context Adaptation: Using decoder outputs from the seq2seq model shows improvement similar to, but slightly less than the lexical adaptation for both perplexity and WER, and both FI and ASD. I note two important dierences in the way multiple context utterances are handled by the lexical and semantic adaptations. First, lexical adaptation uses the entirety of context utterances and in an order-agnostic manner (since only n-gram counts are borrowed). The seq2seq model is restricted by the number of encoder units when multiple utterances are concatenated. For instance, during forward context adaptation of length 3 for child utterance C t , the rst word from utterance P t2 is fed at the rst encoder unit. Hence, the amount of context actually seen by the seq2seq model may be smaller than the context length. Second, the current seq2seq architecture cannot incorporate both directions of con- text in the encoder in an unbiased manner, hence rendering bidirectional adaptation not 100 possible. Nevertheless, semantic adaptation provides signicant improvement over the domain-adapted models. To investigate any complementary information from the lexical and semantic models, I combine them at the hypothesis level. I augment the adult ASR hypotheses (lexical) with the outputs from seq2seq model (semantic) and present results for the combined method in Table 5.6. I observe marginal improvement from combined adaptation in majority of cases, while even exceeding the oracle inputs with respect to WER for FI. Hence, while combined adaptation may not always be optimal, preliminary ndings hold promise. 5.5.4 Eect of context size I study the eect of amount of context used for adaptation, measured using the number of context utterances. A larger context window however can contain both useful and non- useful information, and hence may not necessarily be useful for adaptation. In Figure 5.4, I provide results for absolute perplexity and WER for lexical adaptation. As mentioned above, increasing context during semantic adaptation has a high possibility of truncating utterances closer to the target, hence I do not present results for the semantic or combined adaptations. In all cases, I vary the number of context utterances from 1 to 10, again noting that the bidirectional case receives twice the amount of context compared to forward/backward directions. For comparison, I repeat the experiments using oracle transcripts. I observe that the perplexity for FI shows a slight degradation with increase in context, suggesting that the noisy information dominate during adaptation. This is however not the case with backward adaptation, where there is no clear increase/decrease. Both oracle 101 transcripts and ASR hypotheses result in the same eects, although the former results in reduced perplexity in most cases. A small context window (2 or smaller) seems to be optimal for both measures - perplexity and WER. In case of ASD, I notice that a window length of 2 provides the best perplexity values, either saturating or worsening with larger context windows. 5.5.5 Eect of external factors In this part, I perform a statistical analysis on the adaptation results obtained from Sec- tion 5.4.4. I analyze the eect of external factors such as duration of child utterance, age of child and WER of corresponding adult context utterances on the adaptation re- sults. For each factor, I compare the change in child ASR performance (perplexity/WER) against the baseline using one-way analysis-of-variance (ANOVA) with the null hypothe- sis representing no change in performance distribution. I encode the dependent variable into three levels, representing increase, no-change and decrease respectively. I categorize the age into two levels - young (<8yrs) and old (>8yrs). Results are presented in the form of signicant factors according to p-values for each adaptation method in Table 5.7. Across dierent adaptation methods and domains, the duration of child utterance is a signicant factor in perplexity improvement and strongly signicant in WER. Overall, shorter utterances resulted in degradation of performance when compared to longer utter- ances. This raises a signicant issue, since speech turns by the child are typically shorter and consisting of fewer words than their adult interlocutors. Both age and adult WER do not play a dominant role for utterance adaptation, appearing only in selected cases 102 under semantic or combined adaptation. A notable exception is during domain adap- tation. WER of adult transcripts plays a signicant role for improving perplexity and not WER. In this case, source factors (interlocutor: adult WER) dominate for perplexity improvement while target factors (child: utterance length, child age) dominate during WER improvement. This observation follows from the fact that target factors directly in uence acoustic variability, which is measured through WER and not perplexity. Table 5.7: Eect of utterance length (U), child age (A) and adult WER (W) on the adaptation performance measured using WER and perplexity. Each entry presents the statistically signicant factors (p< 0:1, *p< 0:01) as determined by ANOVA. Direction Type WER Perplexity FI ASD FI ASD - Domain U*,A U*,A W* - Session-Global U*,A U*,W - U,W Forward Lexical U*,A U* U U* Semantic U*,W U* U U*,A Combined U* U* U U*,A Backkward Lexical U* U* U U* Semantic U*,A U* U U* Combined U*,A U* U U* 5.6 Conclusions In this work, I show that the adult interlocutor's spoken language is useful in improving child speech recognition accuracy in child-adult dyadic interaction setting. I make use of two semi-structured but spontaneous child speech application domains to motivate and evaluate the proposed context modeling - forensic interviews and play-based, interactive sessions for children with ASD. Traditionally considered a challenging problem, I describe the development of a robust child ASR system built on top of state-of-the art models de- signed for adult speech. I demonstrate methods to extract lexical and semantic contextual 103 information from the adult speech hypotheses extracted using a ASR system. I show that even few utterances from the immediate vicinity of the target utterance provide signi- cant gains in performance as compared to session-level context. I further investigate the eect of direction and number of context utterances, noting that the seq2seq model is limited by the number of encoder units. Combining semantic and lexical context types results in highest performance for majority of test conditions. I do not observe signicant dierence between adaptations using oracle transcripts and ASR hypotheses, emphasizing the robustness of my models to transcription errors from the adult ASR system. I also consider the eect of source-based factors (originating from interlocutor - adult WER) and target-based factors (originating from the child - utterance length, chronological age) separately on the performance improvement using statistical analysis. I nd that while utterance length is a dominant factor during majority of adaptation conditions, improve- ments in WER and perplexity during domain adaptation are found to be in uenced by target-based factors and source-based factors, respectively. In the future, I are interested in automatically learning adaptation weights (possibly unique to each n-gram) to minimize WER without using a held-out set, considering limited availability of in-domain data. In the case of semantic adaptation, it would be useful to learn to select context adult utterances relevant for each target child utterance. This would require both incorporating long-term context (using longer encoder lengths or hierarchical networks) and including an attention mechanism in the seq2seq network. Considering the high baseline WER, further work will also continue to focus on developing a robust generic child ASR. 104 0 5 10 # Utterances 125 130 135 140 145 150 Perplexity Forward 0 5 10 # Utterances 125 130 135 140 145 150 Backward 0 5 10 # Utterances 125 130 135 140 145 150 Bidirectional 0 5 10 # Utterances 51 51.5 52 52.5 53 WER (%) 0 5 10 # Utterances 51 51.5 52 52.5 53 0 5 10 # Utterances 51 51.5 52 52.5 53 FI 0 5 10 # Utterances 180 190 200 210 Perplexity Forward 0 5 10 # Utterances 180 190 200 210 Backward 0 5 10 # Utterances 180 190 200 210 Bidirectional 0 5 10 # Utterances 69 69.5 70 70.5 71 WER (%) 0 5 10 # Utterances 69 69.5 70 70.5 71 0 5 10 # Utterances 69 69.5 70 70.5 71 ASD Figure 5.4: Eect of number of context utterances on the perplexity and WER. For each case, upto 10 context utterances are used. Results are provided using both oracle transcripts (Blue, continuous) and ASR hypotheses (Red, dashed) 105 Chapter 6 Error Analysis for Improving Speaker Diarization 6.1 Background Speaker diarization is the process of identifying speaker identities and speaking times in audio recordings i.e., who spoke when? A typical audio diarization system consists of multiple components: speech activity detection (identifying and removing non-speech regions), speaker segmentation (identifying speaker change points) and speaker clustering (assigning identities to speaker homogeneous segments). In the context of spontaneous child-adult interactions, speaker diarization becomes a challenging task especially when these sessions are extended to non-clinical settings such as homes [93], and contain signif- icant fraction of short individual utterance lengths and small within-session child speech fraction. Application of state-of-the-art diarization systems trained on adult speech cor- pora can leave room for improvement when applied to such parent-child interactions . Further, straightforward adaptation of such systems with annotated corpora might not be benecial due to often small sizes of in-domain data. 106 In this chapter, I extend the intuition from chapter 5 and hypothesize that contextual information from the participants and the environment are associated with the diarization system performance. In the rst part of this study, I design an error analysis experiment where I study the relation between dierent types of diarization errors and the contextual descriptors. Next, I exploit this relation to automatically identify errors in a speaker diarization system and improve overall task performance. 6.1.1 Analyzing Diarization Errors Compared to the number of works that improve upon diarization systems, relatively few studies have systematically analyzed the dierent types of errors from such systems. In [136], the authors studied the relation between diarization system performance and various session-level descriptors such as speaker count and rate of conversational turns. Alternatively, local descriptors such as utterance duration and distance to closest speaker change point were chosen to study missed speech and speaker errors in [98]. In related works [89, 88], the authors studied the eect of dierent components of the diarization system on the overall DER. At each stage the original component is replaced with an oracle, enabling an individual analysis of each component. The authors note that errors from speech activity detection and overlapped speech regions contribute to the majority of diarization errors. While studies have illustrated the challenges associated with speaker diarization dur- ing child-adult interactions [93, 200], a comprehensive analysis of errors is yet to be explored in this domain. As a rst step in this direction, I analyze the eect of acoustic- prosodic and conversational factors on diarization errors separately for the child and 107 Table 6.1: Statistics of child-adult interactions. clinic: Clinical interactions administered by a psychologist; BOSCC-high and BOSCC-low represent parent-child interactions for children with high and low language-level respectively Corpus # Sess Duration Speech Fraction(%) (min) Child Psych clinic 27 17.76 11.99 27.5 7.8 40.8 7.1 BOSCC-high 20 11.1 3.9 13.9 6.1 39.6 5.8 BOSCC-low 18 10.1 0.3 7.8 5.6 37.0 10.1 adult. This analysis is similar to [98] since I use local factors rather than global (session- level) functionals. However, I analyze the eect of these factors on each error type rather than the overall DER. 6.2 Dataset I select child-adult interactions related to a recently proposed treatment outcome mea- sure BOSCC (Brief Observation of Social Communication Change) [77], similar to the sessions used in chapter 2. I make use of BOSCC sessions collected both at clinics and homes. The clinical sessions were collected across four dierent sites and administered by trained psychologists. The home sessions were administered by parents and coded by psychologists. The home sessions represent a complete \in-the-wild" data collection setup, aimed at assisting periodic self-collection at the child's natural environment. I divide the home BOSCC sessions according to the child's language level. Further details are reported in Table 6.1. 6.3 Contextual Factors As part of a preliminary study, I select two acoustic-prosodic factors for the error anal- ysis: signal-to-noise ratio (SNR) and speech intensity; and two conversational factors: 108 utterance length and speaker change proximity, with all factors being estimated at the frame level. Utterance length: This is dened as the length of utterance (or) speaking turn in which the current frame is present. Utterance length is zero for all non-speech frames. Short utterances typically do not contain enough speaker information for embedding extraction and are more prone to errors. [106, 98]. Speaker change proximity: This is the absolute distance to the nearest speaker change point. I dene a speaker change point as the time instance at which speakers switch, or begin speaking, or end speaking. The minimum proximity considered in this work is 0.25 seconds, which is the standard no-score collar as dened by NIST evaluations. An illustration of the conversational factors is provided in Figure 6.1 Signal-to-Nosie Ration (SNR): Given a mixed audio signal, SNR is the relative strength between the noise-only and speech-only components. I estimate SNR using the NIST-STNR tool [95] and scale the values (in dB) to zero-mean and unit variance within each session to assist training. Speech intensity: I use the Praat toolkit [15] to estimate speech intensity as the smoothed version of the signal energy. Since absolute intensity values are not informative, I normalize intensity to zero mean and unit variance within each session. 109 Utterance Duration Speaker Change Proximity Speaker Labels time S1 S2 Speech Figure 6.1: Illustrating the conversational descriptors: utterance length and speaker change proximity 6.4 Experiments 6.4.1 Baseline system I use a two-hidden layer feed-forward DNN developed as part of the DARPA RATS project 1 to classify speech from non-speech. The input consists of spliced ( 15 frames) 13-dimensional MFCCs to provide context, while the output nodes are binary labels. The network is trained to minimize cross-entropy loss. During testing, the class posteriors are smoothed and passed through a threshold to obtain speech segments. Speaker embeddings (x-vectors) are extracted at uniform intervals from the speech segments, followed by AHC to obtain the diarization labels. I use the pre-trained x-vectors provided with the CALLHOME recipe in Kaldi, similar to the system in [178]. The training data consisted of a collection of NIST SRE04,05,06,08 and Switchboard corpora, augmented with noise, music and reverberation, a subset of which was used to train the PLDA scoring metric 1 https://www.darpa.mil/program/robust-automatic-transcription-of-speech 110 for AHC. Further details are available in [178]. In this work, the PLDA transforms are trained using the oracle speaker labels from clinic, which is found to further improve the baseline performance. 6.4.2 Error Analysis In the rst experiment, I study the eect of each of contextual factors in Section 6.3 on dierent types of diarization errors produced by the baseline system. For each session in BOSCC-high and BOSCC-low, I compute the factors (utterance length and speaker change proximity using oracle speaker labels, SNR and intensity using raw audio) and decisions (correct, missed speech and speaker error) at frame-level. Note that false alarms are not accounted for in this analysis since I are interested in the eect of factors on child/adult speech segments (not silence regions). While plotting the results in Figure 6.2, the maximum ranges for utterance duration and speaker change proximity are chosen so as to ensure sucient number of frames for analysis. From Figure 6.2, signicant dierences are observed between how contextual factors aect diarization errors on child and adult speech. The fraction of correctly classied frames marginally increases for both speakers with longer utterances and farther from speaker change points, similar to what was observed in [98]. However, the improvement is re ected by lower fractions of speaker errors for child speech and lower fractions of missed speech for adult speech. Further, short child utterances are more likely to get diarized as adult speech than any other outcome, and fraction of missed child speech is consistent across dierent utterance lengths and proximities to speaker change points. Adult speech is seen to perform better as SNR increases: reducing both missed speech 111 and speaker errors. However, child speech seems to exhibit an optimal SNR with respect to correctly classied frames. Specically, the fraction of child speech diarized as adult speech increases with SNR, suggesting that the baseline diarization system is likely to cluster clean speech segments from the child into adult. Frames with low speech intensity from both speakers get missed by the VAD, while child speech is likely to get diarized as adult speech as the intensity increases. While the above analysis provides an insight into how errors from the diarization system manifest dierently on child and adult speech, joint eects of multiple contextual factors on each error type as well as temporal eects cannot be analyzed in this manner. 6.4.3 Learning to Improve Diarization Errors Given that contextual factors in uence speaker diarization errors, I hypothesize that they can be exploited to correct the same. I feed the outputs from the baseline diarization system (speaker labels along with silence) and the time-aligned contextual factors to a deep neural network. At output, the network is provided with a single label representing the speaker. Hence, I pose the neural network training as a sequence classication problem where the output belongs to one of three classes: child, adult or silence. The input is spliced with context frames to exploit temporal information, which cannot be captured during the error analysis in Section 6.4.2. I dene an input sample as a contiguous block of frames spanning a duration of 2 seconds. At each instant, the diarization output is converted into a 3-dimensional one- hot encoding and appended with 4 contextual factors. The SNR and intensity values are obtained directly from audio; the conversational factors viz. utterance length and speaker 112 change proximity are obtained using the diarization outputs since the oracle speaker labels will not be available during testing. I hypothesize that any possible errors introduced due to the latter will be compensated for with enough training data. I experiment with three types of network designs in this work - n, blstm and atten. In n, the input is attened across the time axis before passed through 3 dense layers with 256 units each. blstm employs LSTM (Long Short Term Memory) layers to capture forward and backward temporal information. The hidden state is fed to 3 dense layers with 256 units each. Atten makes use of feed-forward attention [160] to selectively attend to frames relevant to the output. The context vector (c) in this method is a weighted sum of hidden state activations (h t ), where the weights ( t ) themselves are obtained using a learnable function (A(h t )). c = T X t=1 t h t ; t =softmax(e t ); e t =A(h t ) (6.1) I use a BLSTM to produceh t and a fully-connected DNN with 2 hidden layers (8 units in each layer) to learn A. The context vectors from all sources (speaker labels and factors) are appended and passed through 2 dense layers before the output. All networks used in this work are optimized using rmsprop to minimize the cross-entropy loss between output labels and logits. Batch normalization and random dropout (rate = 0.2) are used in the dense layers for regularization. I obtain the results using cross-validation, where the home BOSCC sessions are divided across dierent folds and the network is retrained from scratch after every fold. The folds are repeated until all sessions are treated as test 113 data. Within train data in each fold, I further set apart 4 sessions as validation data to determine the optimum training epoch using validation loss. Additionally, I check the network's capability to exclusively correct speaker errors by using oracle VAD labels at input and output. Hence, the DNN's output layer consists of two nodes: child/adult, and predictions are made only at voiced frames. The input speaker labels may still consist of silence due to spliced temporal context. In both experi- ments (oracle VAD and system VAD), the training data is augmented with clinic sessions at all folds. I provide DERs on the BOSCC home corpora with both oracle VAD and system VAD in Table 6.2. All networks which use system VAD provide gains in DER, with the attention network resulting in 8.2% and 15.8% relative improvement in DER. The results between dierent network designs (n, blstm, atten) underscore the importance of exploiting temporal information for diarization. With oracle VAD supplied, improvement in speaker error is marginal and conned only to the BOSCC-high corpus. Upon closer inspection, I observe that there exist signicant variations in speaker error change across sessions, i.e sessions away fromy =x in Figure 6.4 exhibit considerable change in speaker error. Hence, there possibly exist other factors that in uence the speaker errors which are not accounted for in the current experiments. Table 6.2: DER results from error correction network with system VAD and oracle VAD System VAD Oracle VAD Model BOSCC -high BOSCC -low BOSCC -high BOSCC -low Baseline 55.08 65.27 36.50 30.77 + n 53.86 61.51 34.29 33.03 + blstm 54.58 60.06 34.25 33.17 + atten 50.56 54.94 35.81 34.43 114 6.5 Conclusions In this chapter, I showed that a state-of-the-art diarization system trained on general large-scale corpora does not necessarily perform well for naturalistic child-speech inter- actions. I investigated the role of context in improving upon the results of this system. First, I examined the eect of various contextual factors on missed speech and speaker errors, analyzing the dierences between eects on child speech and adult speech. Next, I trained a DNN to correct systematic errors in diarization as well as exclusively speaker errors. The results suggest the benet of local context in improving child-adult speaker diarization. An important drawback of the current setup is the lack of reliable role- recognition module which is required to assign child/adult labels to diarization outputs. In the future I would like to handle this by training with role-agnostic inputs, similar to permutation invariant training [228] in speaker separation. Additional factors would also be explored especially from the visual modality, since audio-video diarization has been shown to perform better than audio-only diarization whenever the video is available. Further, the DNN training can be posed as sequence-to-sequence learning task where the output label will be substituted with label sequence. 115 0.25 0.75 0.75 1.25 1.25 1.75 1.75 2.25 0 0.2 0.4 0.6 0.8 1 0.25 0.75 0.75 1.25 1.25 1.75 1.75 2.25 Utterance Duration (sec) 0 0.2 0.4 0.6 0.8 1 0.25 0.75 0.75 1.25 1.25 1.75 1.75 2.25 0 0.2 0.4 0.6 0.8 1 0.25 0.75 0.75 1.25 1.25 1.75 1.75 2,25 Speaker Change Proximity (sec) 0 0.2 0.4 0.6 0.8 1 -1.0 -0.5 -0.5 0.0 0.0 0.5 0.5 1.0 0 0.2 0.4 0.6 0.8 1 -1.0 -0.5 -0.5 0.0 0.0 0.5 0.5 1.0 SNR (Normalized) 0 0.2 0.4 0.6 0.8 1 -1.0 -0.5 -0.5 0.0 0.0 0.5 0.5 1.0 0 0.2 0.4 0.6 0.8 1 -1.0 -0.5 -0.5 0.0 0.0 0.5 0.5 1.0 Speech Intensity (Normalized) 0 0.2 0.4 0.6 0.8 1 Figure 6.2: Eect of contextual factors on adult speech (top) and child speech (bottom) during the baseline diarization system. For each speaker and factor range, all possible outcomes are normalized to sum to 1 so as to display the error distributions uniformly across the context ranges. Within each bar, the outcomes (from top to bottom) follow: correctly classied frames, missed frames and misclassied frames 116 Speaker Label Utterance Length Prox. Spkr. Change SNR Intensity BLSTM BLSTM BLSTM BLSTM BLSTM FFA FFA FFA FFA FFA time Input Output Figure 6.3: The attention network used during error correction. The speaker labels from baseline diarization system and factors are independently attended to in time using feed- forward attention. Context vectors from each source are merged and passed to a fully connected network to predict output label 10 20 30 40 50 Speaker Error (Pre-DNN) 10 20 30 40 50 Speaker Error (Post-DNN) FFN BLSTM ATTEN Figure 6.4: Speaker errors for home BOSCC sessions before and after DNN error correc- tion. 117 Part IV Applications of Context-aware Descriptors 118 Chapter 7 Robust Behavioral Descriptors using End-to-End Pipeline In Chapters 2, 4, 5, 6 I developed various context-aware speech and language processing approaches towards child spoken understanding. A unifying question concerning these ap- proaches is how to determine if the task performance is adequate. Previous chapters have demonstrated improvements in respective performance metrics as opposed to non context- aware approaches. In this chapter, I answer the question using the end-application in mind, i.e., behavioral descriptor extraction and latent state inference. Specically, I ap- ply the context-aware approaches developed in previous chapters as part of an end-to-end pipeline and evaluate the extracted descriptors through multiple methods. Automatically extracted descriptors are of interest in clinical and mental health applications[144, 16]. Specically, they assist towards decreasing administration times and costs, and increas- ing subject numbers by removing the need for often time-consuming manual annotation steps. The contributions of this chapter are as follows: (1) I evaluate pipeline descriptors by comparing them with oracle descriptors (extracted using manual annotations speaker labels and transcripts). (2) I train statistical models (regression for autism symptom 119 severity) using pipeline descriptors and compare the inference produced using similar models with oracle descriptors. The results from this chapter serve as a validation step for use of pipeline descriptors in large-sample studies as described in Chapters 8 and 9. 7.1 Dataset I select a pilot corpus of 27 child-examiner interactions from the ASD domain, similar to the sessions used in chapter 2. Unlike chapter 6, I restrict to sessions collected in the clinic due to relatively better signal quality (higher SNR). Specically, I use 24 BOSCC (Brief Observation of Social Communication Change [77]) sessions and 3 gold-standard diagnostic tool ADOS (Autism Diagnostic Observation Schedule [118])) sessions (child age: =9.3 years, =3.4; verbal IQ: =98.6, =24.3). 7.2 Experiments For each session in the corpus, trained annotators labeled speaker boundaries and tran- scripts which were used to extract session-level descriptors (oracle descriptors). Next, the speech pipeline was employed to extract an identical set of descriptors (pipeline de- scriptors) wherein at each module the oracle labels were replaced with previous module outputs. An illustration of oracle descriptors and pipeline descriptors is provided in Figure 7.1 I study descriptor robustness studied using two methods: (1) Value-level comparison between pipeline descriptors and oracle descriptors. For each type of module error, I used normalized mean squared error (NMSE) to identify a subset of robust descriptors. (2) Relation to symptom severity: I examine the applicability of pipeline descriptors 120 Speaker Labels Transcripts Speech/Language Pipeline Pipeline Descriptors Oracle Descriptors Annotation Figure 7.1: Illustrating extraction of pipeline descriptors and oracle descriptors. In the former (top row), a speech/language processing pipeline consisting of multiple concate- nated components is used to estimate who spoke when (speaker labels; task: speaker diarization) and what was spoken (transcripts; task: automatic speech recognition)). Al- ternatively, trained annotators (bottom row) mark speaking times and words by listening to the audio towards predicting autism symptom severity. I train a linear regression model using the pipeline descriptors as inputs to predict the calibrated severity scores (CSS) evaluated using a certied expert. Next, I train a similar regression model by replacing the pipeline descriptors with oracle descriptors. The parameters of regression model are kept constant between both types of descriptors to ensure a fair comparison. 7.3 Results I obtained perfect role assignment for all sessions, word error rates of 45.61% (clini- cian) and 75.63% (child) from the ASR system, speaker error of 9.42% from the speaker diarization system and an f-score of 0.90 from the speech detection system. Similar de- scriptor subsets were found to be robust under both conditions: (lexical) rst-person pronoun use by clinician, (turn-taking) clinician speaking fraction and utterance lengths 121 Figure 7.2: Most robust descriptors (NMSE) from each modality due to errors arising from speaker diarization (top), and both speech detection and speaker diarization (bottom) and (prosodic) intonation slope and intercept from both speakers 7.2, suggesting that these descriptors can be particularly useful in large-sample studies. Regression models trained for symptom severity prediction returned similar t measures for oracle descrip- tors (adjustedR 2 = 0.23) and pipeline descriptors (adjusted R 2 = 0.26), suggesting that the predictive power in the latter are robust to any errors in pipeline. A qualitative comparison of predicted severity is presented in Figure 7.3. 7.4 Conclusion In this chapter, I applied a fully automated speech and language pipeline to extract behav- ioral descriptors on a pilot set of semi-structured, naturalistic examiner-child interactions. I identied descriptors that are robust to dierent module errors from the pipeline, and hence prove useful for large-sample automated studies. I further demonstrated similar 122 0 5 10 15 20 25 30 Session ID 0 2 4 6 8 10 CSS CSS Prediction using Features with Speaker Errors 0 5 10 15 20 25 30 0 2 4 6 8 10 CSS CSS Prediction using Features with SAD Errors and Speaker Errors Ground-truth Oracle Pipeline Figure 7.3: Predicted calibrated severity scores using oracle descriptors and pipeline descriptors for two error conditions: (1) (Top) Errors from speech activity detection and speaker diarization modules, (2) (Bottom) Errors from speaker diarization module predictive power between robust pipeline descriptors and oracle descriptors. In the follow- ing chapters, I use a subset of the above descriptors to answer domain-specic questions towards latent state modeling in children. 123 Chapter 8 Tracking Treatment Outcomes among Children with Autism In previous chapters 2, 3, 4, and 5, I developed context-aware audio and language process- ing methodologies towards improving child speech understanding in dyadic conversations. The above approaches are typically integrated in modules such as speaker diarization, role-assignment and ASR to develop an end-to-end pipeline. Such a pipeline outputs speaker labels and transcripts, which are used to extract behavioral descriptors. In chap- ter 7, I used a pilot set of child-adult interactions from the ASD domain to study the applicability of such descriptors using their association with symptom severity. In this chapter, I extend the study of behavioral descriptors further within the ASD Domain. Using a large-sample study of child-clinician interactions collected across varied demographics and locations, I use pipeline descriptors for answering the following ques- tions: (1) How does child speech and language during an interaction with the clinician vary with their biological age? (2) What dierences (if any) exist between children on the higher and lower end of the autism spectrum, as characterized using pipeline descriptors? (3) Do noticeable dierences exist in the child speech/language behavior before and after 124 treatment intervention? While there exist similar approaches (see Section 8.1) towards answering these questions, the current study is one of the rst to extend fully automatic pipeline descriptors to a large number of participants. Manual annotation for speaker labels and transcripts is intractable for descriptor extraction, hence pipeline descriptors become necessary for any computational analysis. 8.1 Background Autism Spectrum Disorder (ASD) refers to a heterogeneous group of complex neuro- developmental disorders characterized by social-communicative decits along with re- stricted, repetitive behaviors. Prevalence of ASD among children in the US has been rapidly increasing, from about 1 in 150 children in 2002 to 1 in 54 children in 2016 [7, 124]. The predominant symptoms of ASD manifest as diculties in language and non-verbal comprehension and expression, and anomalies in expressive vocal prosody patterns[94]. The symptoms become apparent in early years of an individual, hence early diagnosis in children is considered an important step towards eective treatment and intervention. One of the most common diagnosis tools consists of clinically-administered interactions between the child and a trained clinician [119, 77]. These dyadic interactions often consist of multiple activities which are designed to observe various socio-communicative behaviors [1, 128]. Computational methodologies, especially including objective speech and language analyses of such interactions combined with machine learning [67, 20, 105] have helped characterize the symptoms associated with ASD, and provided insights into the diagnosis process. For instance, [19] associated subjective perception of awkward prosody with 125 prosodic descriptors extracted from the child's speech, and showed that the descriptors were signicant in classifying between ASD subjects and typically developing controls. [105] extended this to language descriptor subsets available with the Linguistic Inquiry and Word Count toolkit (LIWC, [203]). Specic phrases were found to be prominent between typically developing children and children on the spectrum. Furthermore, stud- ies [151, 153] have illustrated signicant correlations between the interlocutor's prosody, language use, and discourse linguistic descriptors and symptom severity. However, time and monetary cost of manual annotation continues to in uence sam- ple size or methodologies in such studies. Specically, the number of subjects is usually limited: N = 29 [22], N = 52 [75], N = 37 [230], N = 39 [78]. Alternatively, the seg- ments of analysis are limited to certain questions/topics of interest - in [105, 108], only the Emotions and Social Diculties and Annoyance tasks from ADOS Module 3 were annotated for speaker labels and transcripts. In this work, I attempt to mitigate some of the above limitations by using pipeline descriptors which were used in the descriptor robustness study from chapter 7. 8.2 Dataset I use child-clinician interactions based on the Brief Observation of Social Communication Change (BOSCC [77]) tool, typically used with children verbally uent to produce com- plex sentences. As described in chapter 2, BOSCC is a treatment outcome measure used to record changes in the child behavior (specically, language and communication skills) over the course of treatment. Further, BOSCC is relatively shorter in time to administer and code when compared to the more popular Autism Diagnosis Observation Schedule 126 (ADOS [118]) tool. For a detailed description of a BOSCC session, refer Section 2.1). Contrary to previous works, I use BOSCC session where I do not have access to oracle speaker labels and transcripts. As a result, I select 210 session for analyses throughout the rest of the Chapter. Demographic and diagnostic information for the sessions are provided in Table 8.1. Table 8.1: Details of BOSCC sessions used in this Chapter. Locations: New York Uni- versity (NYU), Icahn School of Medicine at Mount Sinai (MSSM), Center for Autism and the Developing Brain (CADB) at Weill Cornell Medicine and Albert Einstein College of Medicine (EIN) Attribute Value Locations NYU: 63; MSSM: 37; EIN: 48; CADB: 62 Age (yrs) 9.35 3.15 CSS 7.52 1.83 IQ 116.93 118.470 VCAE 71.08 22.07 Duration (min) 14.05 10.41 8.3 Methods Following extraction of speaker labels and transcripts, I compute a number of vocal communication and language descriptors relevant to the interaction. 8.3.1 Prosodic Descriptors I extracted intonation (log-pitch) and volume (intensity) contours, which can be used to operationalize perceptions of \monotone" speech or variable volume level [21, 18]. These descriptors were extracted using the Praat [15] toolkit. Following, I subtract the mean value for log-pitch and intensity per speaker and per session, to remove speaker-specic and channel-specic osets, respectively. Second-order polynomial parameterization for 127 each of these contours are then extracted, followed by computing the functionals (mean, standard deviation) of the polynomial coecients. 8.3.2 Turn Conversational Descriptors I compute measures (descriptors) of turn-taking that describe the interaction style. For each participant, I compute a global descriptor (speaking fraction) and functionals of a local descriptor (turn duration in seconds). As an extension to conversational descrip- tors, I analyze them at a larger semantically-coherent scale. Specically, I segment the BOSCC session into topics, where a topic is dened as a contiguous subset (in time) of the interaction wherein the interlocutors discuss about a single entity of interest. In the context of BOSCC sessions, I found that the most common entities of interest across dierent child age and collection centers include school, a family member and toys. I believe that the nature of a BOSCC session (alternating talk and play segments) give rise to the above ndings. Topic segmentation is performed by optimizing for lexical cohesion measure [2] within the segmented topics. The transcript of entire BOSCC session is treated as a single document, followed by estimating a segmentation rule that results in maximum lexical similarity within segments. I explore both greedy and optimal split strategies, I nd that the former is computationally inexpensive and returns similar similarity score. Similar to turn-level conversational descriptors, I extract both global descriptors: total number of topics, total number and fraction of child/clinician-initiated topics as well as functionals of local descriptors: child/clinician-initiated topic durations (time), child/clinician-initiated topic duration (utterance counts) and turn latency within topic. 128 8.3.3 Language Descriptors I quantify language use primarily through the Linguistic Inquiry and Word Count (LIWC) toolbox [203]. LIWC has previously been used to study language in ASD [25, 105]. and other domains such as motivational interviewing [64] and medical student writings [117]. For each participant and descriptor, I use LIWC to extract the overall count, unique count (as applicable), and functionals of turn-level counts. Lexical concepts of interest include words, nouns, verbs, pronoun (rst & second person), positive & negative emotion, assent (OK), dis uencies (\hmm", \uhm") and llers (\I mean", \You know"). 8.3.4 Correlation Analysis Following extraction of pipeline descriptors as described above, I perform Pearson corre- lation of each descriptor with various demographic and clinical severity codes. I select the following factors for correlation with the descriptors: biological age of child (years); calibrated severity score (CSS) [69] which is a measure of Autism severity independent of age and language development; and Vineland adaptive communication score [30] which captures both receptive communication (Follows instruction with two related actions e.g., `Get your coat and put it on') and expressive communication (e.g., correct use of prepo- sitions, uses plural nouns). Vineland scores are independent of biological age, i.e a 3-year old child may be scored the same as 1-year old child as long as they possess the same language development. 129 8.4 Results 8.4.1 Descriptor Variation across Child Age In the rst set of experiments, I analyze how the pipeline descriptors vary across the child age. A summary of signicant correlations is provided in Table 8.2. Table 8.2: Signicant correlations between pipeline descriptors from each participant and biological age of child (p <0.05*, p <0.01**, n.s - not signicant) Category Descriptor Child (r) Clinician (r) Prosodic 2 nd pitch coe 0.176* -0.447** Language 1 st person plural \We", \us" n.s -0.184* Assent (, ) -0.164*, -0.291* n.s, n.s Nouns (Total, Uniq) 0.342**, 0.204* 0.208**, 0.236** Verbs (Total, Uniq) n.s, 0.184* n.s, 0.211* Words-per-turn 0.333** 0.211* Conversational Turn duration (, ) n.s, n.s 0.152*, 0.172* Topic Count 0.328** Topic Duration -0.323** Multiple descriptors exhibit signicant associations with the child age. Correlations from conversational descriptors suggest increased language exchange between the child and clinician. Specically, interactions involving an older kid cover more topics within a session and spend lesser time on each topic, with both associations being signicant (p < 0:01). Further, the clinician's speaking turn is longer in duration although there exists increased variability in duration. Interestingly, the child's turn duration does not signicantly change with age, hence older kids are less likely to speak longer than younger kids. However, the words-per-turn descriptor for both participants is strongly positively associated (p < 0:01) with child age. Given that the turn duration does not increase with age, this suggests that older kids increase their speaking rate when interacting 130 with the clinician. Other language descriptors point to an overall increase in language diversity. The number of unique nouns and verbs increase with age, validating an increase in vocabulary with age. Similar associations are seen in the clinician's language as well. These results are likely related to the increase in topics, since diverse semantic entities are likely to involve increased nouns and verbs. An increase in topics is also possibly related to reduced references to the conversation itself in the language of clinician, as seen from the negative correlation with rst person plurals. While interacting with older kids, the pitch contour of clinician exhibits a falling pat- tern. While assuming the role of interviewer, i.e asking questions, a falling pitch contour is known to be associated with open ended questions, as opposed to closed ended ques- tions such as yes/no type. This suggests that clinicians are able to engage in more open ended conversations with older kids, a nding that is also possibly seen from the lesser use of assent (\OK", \alright") by the kids. Given that a number of correlations observed in this experiment share similarities with common knowledge of child development, the results serve as another method to validate the use of pipeline descriptors. 8.4.2 Identifying Group Dierence using Pipeline Descriptors In the second set of experiments, I compute the correlations between pipeline descriptors and two manually coded variables: the calibrated severity score on a scale of 1 (least) to 10 (most) to measure symptom severity, and Vineland adaptive communication score to measure adaptive communication skills. From Table 8.3, I notice that language descriptors return the highest number of signif- icant correlations, similar to the rst experiment. While interacting with children higher 131 Table 8.3: Signicant correlations between pipeline descriptors and symptom severity Calibrated Severity Score Vineland Category Descriptor Correlation Descriptor Correlation Prosodic Clinician: 2nd intensity coe 0.238** Clinician: 2nd pitch coe -0.383** Language Child: Positive emotion 0.167* Child: Words per turn (, ) 0.585**, 0.520** Child: Words; uniq -0.164* Clinician: Words per turn (, ) 0.412**, 0.413** Child: Verbs; uniq -0.174* Child: Verbs (total, uniq) 0.561**, 0.463** Child: Nouns (total, uniq) -0.141*, -0.151* Child: Nouns (total, uniq) 0.532**, 0.481** Child: Non uencies (, ) 0.232**, 0.228** Child: Assent (, ) -0.496**, 0.564** Conversation Number of topics -0.159* Number of topics 0.482** Clinician: Turn duration (, ) -0.150*, -0.236* Clinician: Turn duration (, ) 0.486**, 0.592** Child: Intra-topic latency 0.152* Child: Intra-topic latency -0.354* 132 in the spectrum, clinicians speak louder towards end of their turn. The reason for the same is not immediately clear from this analysis, although it is known that children with autism respond appropriately to a loud tone of voice [90]. Further, clinicians engage in lesser number of topics with the children. While there exist multiple possible causes of reduced topic coverage, the most likely manifestation is increased intra-topic child la- tency. In other words, a child higher on the spectrum pauses longer before responding to the clinican's turn, validating a well-known observation among children with autism [121, 97, 122]. Given that latency was not a signicant descriptor before performing topic segmentation, the nding outlines the relevance of semantic segmentation before analyzing latency in a multi-topic conversation. Many language descriptors show negative correlations with severity. The total counts of words, verbs, nouns and unique counts of nouns are decreased for children higher in the spectrum. While these descriptors were found to be positively correlated with age, the ndings especially assume signicance since child age is not correlated with CSS (r = 0:098). Further, the clinician's language does not exhibit any associations with CSS. During the interaction, the clinician is blind to the child's autism severity but not the biological age, hence suggesting that interlocutor language is aected by the child's overt demographic factors when compared to latent behavioral traits. A perhaps surprising result is the positive correlation for positive language content in the child speech, which contradicts previous ndings [105]. However, the ndings ought to be cautiously interpreted given the dierence in toolkit (LIWC vs psycho-linguistic norms [125]) and demographic composition. 133 8.4.3 Eect of treatment on speech descriptors In the third experiment, I address the following question: Do noticeable dierences exist in the child speech/language behavior before and after treatment intervention? I sample sessions from 107 children who underwent a parent-delivered intervention [168, 127, 41] over the course of few weeks. I also select a control group (69 children) who did not undergo any intervention. For each child in both groups, exactly one session was selected after the treatment (or) as a second data collection. The constraint of two sessions from each child reduced the overall sample size from 210 children to 176 children. For each group and time-point (before/after), I aggregate statistics for each descriptor. To test signicant change in descriptor value post treatment, I use an independent sample t-test to check for dierence in means. I use the Welch's variant [218] since the sample sizes for each group is dierent and variances cannot be assumed equal between the groups. However, not all descriptors exhibiting signicant change are useful for studying treatment eect. To remove the eects of developmental changes, I check for any corresponding changes in the control group. Only descriptors which exhibit a change in the treatment group and do not change in the control group, and vice versa are selected. The nal list of signicant features are presented in Table 8.4. It is interesting to note that all features which exhibit signicant group dierences belong to the interviewer. Specically, the interviewer takes a longer time to respond while talking with children during the second collection. There exists no change in response duration (latency) while talking with children who underwent intervention. Similarly, the interviewer tends to remain longer on-topic while talking with controls. In other words, the topic duration increases when initiated by the interviewer. This suggests that 134 the children follow the interviewer's cue for topics longer as they grow older. However, if they had underwent intervention, the topics terminate faster after treatment. While these results are of preliminary nature, they provide the potential to study the eect of treatment on automatically derived descriptors. Table 8.4: List of descriptors which exhibit dierent signicant changes over time for treatment and control groups Descriptor Treatment Controls Clinician speech latency (intra-topic) { * Clinician-initiated topic duration + * Clinician: Assent { * 8.5 Conclusion In this Chapter, I applied fully automated descriptors on a large sample study from the ASD domain for addressing diagnosis and treatment studies. Given that there is no human-in-the-loop to validate the accuracy of descriptors, I validate them by study- ing their association with biological age. I corroborate signicant associations between descriptors and age are in agreement with knowledge of child speech and language de- velopment. Following, I use the descriptors to study group dierences between high and low severity on the scale of calibrated severity scores (CSS). Further, I study the eect of treatment on the speech and language behavior of children using descriptors. I identify descriptors that exhibit dierent trends of change between the treatment and control groups. I believe that above experiments strengthen the cause for using automated de- scriptors for clinical applications involving child speech understanding. 135 Chapter 9 Detecting Truthful Language from Child Speech In this chapter, I extend the application of speech and language processing pipeline to another domain involving spontaneous child-adult conversations. I look at interactions from the legal interviewing sector, specically child forensic interviews (CFI). A CFI is a semi-structured conversation between a trained [109] professional and a child who is a suspected victim of/witness to abuse. Unlike interactions from the autism domain where the primary goal is diagnosis and intervention eect study, a CFI is designed to maximize incident recall while minimizing trauma to the child. Child maltreatment has been long considered one of the primary obstructions to emo- tional, physical and overall well-being of children [129, 50, 196, 65]. Eects of child abuse on victims have been observed during transition to adulthood as well as integration with society [11, 145, 206]. Often, children are the sole witness to incidents involving them. Further, children are susceptible to/intimidated by legal proceedings in a courtroom sce- nario and in the presence of alleged perpetrators [126, 62], necessitating a separate in- terviewing strategy from that of adults. The NICHD (National Institute of Child Health 136 and Human Development) protocol [109, 84, 107] was developed to address the above dif- culties, by focusing on open-ended questions which prompt recall from memory rather than close-ended prompts which may be of suggestive or leading nature. An unique aspect of CFIs is the presence of a rapport-building segment before the incident recall segment. During rapport building, the interviewer discusses day-to-day topics pertinent to the child's life such as school, friends, festivals, etc. Rapport building in CFI has been shown to result in longer engagement times of the child with the interviewer, and assist the child with memory recall following the rapport building segment [39]. Automated conversation analyses during CFI can benet in two ways: (1) By pro- viding objective descriptors of the child speaking style, including localized and overall statistics of language use and prosodic patterns. These descriptors will be used to under- stand the relation with interview outcome in terms of both the legal proceeding and child mental state. (2) Provide feedback to the interviewer: How does the interview language aect incident recall? Specically, at what points in time does the interview protocol yield an accurate recall from the child? Identifying truthful language in the child speech during CFI is especially important given the high stakes of CFI. A wrong conviction can lead to broken family ties while a wrong acquittal means that the perpetrator returns back to the society and the child victim's household. In this work, I focus on two questions related to the task of identifying truthful language from the child speech. Truth-TellingTask: I examine whether the child is truthful during the incident re- call phase, i.e., does the testimony accurately re ect what transpired. The outcome 137 of this task is binary: a positive outcome includes true non-disclosures and true dis- closures; and a negative outcome includes false non-disclosure and false disclosures (See Figure 9.1 for a description of the outcomes). Disclosure Task: Given the knowledge that the transgression transpired, I ex- amine whether the children disclose it. The outcome of this task is binary: a positive outcome indicates true disclosure while a negative outcome indicates false non-disclosure. This task is particularly relevant for understanding situations where children do not disclose abuse, possibly due to shame, trauma or memory impair- ment. 9.1 Narrative Truth Induction (NTI) Corpus A signicant drawback of directly using CFIs for studying truthful child language is that the outcome of court proceedings are at best inappropriate for use as ground truth and impossible to verify. Hence, I choose a controlled transgression study with children where a toy-break is simulated for a random subset of the children. Within each session in the study, adult professionals who take part in the toy-break segment (the confederate) and the interview segment (the interviewer) are not the same. Further, the interviewer is blind to the outcome of the transgression (toy-break). This setting enables recreating the circumstances leading to, and during a CFI albeit with a minor transgression. If a transgression happens (i.e., toy breaks), the confederate asks the child not to mention about the same to anyone else, including the interviewer Further, the interviewer navigates through a series of rapport-building topics before moving to the recall questions. A 138 random subset of children are provided the putative confession [36] i.e., the interviewer mentions that the confederate informed them of events during toy-play. Following the interview the child is debriefed on the purpose of the study and claried that no one is responsible for the toy-break. An overview of the dierent trajectories during an NTI session is provided in Figure 9.1 including the four possible outcomes of the session. In this chapter, I use 202 NTI sessions conducted with school-going children. The sessions were collected using a mobile interviewing unit by the USC Child Interviewing Lab 1 . Time boundaries for rapport-building and free recall segments were marked by annotators as part of previous studies on the corpus. On an average, the rapport building segment lasts for 5 minutes 9 seconds while the free recall segment lasts for 3 minutes 23 seconds. Table 9.1 provides counts of sessions for each outcome. Table 9.1: Outcome counts in the NTI Corpus. The table lists the session count after removing a few due to speech-to-text alignment issues Truthful Deceit Toy-Break 44 121 No Toy-Break 50 0 9.2 Methods 9.2.1 Speech to Text Alignment Similar to chapter 8, I develop a speech pipeline to obtain information about what was spoken and who spoken when before extracting the descriptors. As part of the NTI corpus curation, I have transcripts alongwith speaker labels already available. Hence, I employ a slightly dierent pipeline from previous works since speech-to-text module (or, 1 https://uscchildinterviewinglab.com/ 139 Confederate Interviewer Playing with toys [ Toy breaks ] [ Toy does not break ] Don’t tell the interviewer Rapport building Incident Recall Tell me about your last birthday [ Confederate leaves ] [ Interviewer enters ] Did anything happen while playing? Toy broke Toy didn’t break Toy broke Toy didn’t break Time True Disclosure False Non-Disclosure False Disclosure True Non-Disclosure Outcome Figure 9.1: An overview of the toy-play and interviewing segments from a narrative truth induction (NTI) session. There are four possible outcomes of the session depending on whether the toy breaks, and whether the child answers the interviewer's questions truthfully. 140 Transcripts ASR XXYXYX YYX ABBABA BBABB YYX ABBA BA Aligned Words Unaligned Words Forced- Alignment Word Alignment [ First Pass ] [ Second Pass ] String Matching Figure 9.2: Overview of the two-pass alignment pipeline used in NTI corpus automatic speech recognition task) is not needed anymore. Instead, I employ a two-pass speech-to-text alignment pipeline illustrated in Figure 9.2. I make use of the time boundaries of rapport building and free recall segments to select the audio for respective segments. Within each segment, I employ the alignment pipeline. During the rst pass, I perform an incomplete but relatively accurate alignment using a combination of ASR and text matching. An o-the-shelf ASR model is used to decode the audio into word hypotheses. The decoded word string is matched with the transcript, wherein the speaker labels are removed from the latter. This step ensures that only words which are recognized by an ASR are matched with the transcript, thereby ensuring the aligned time information of every word is fairly accurate. During the second- pass, all intermediate phrases from the transcript are aligned to the corresponding audio portions using a forced-alignment system. A forced alignment assigns every word to a timestamp in the audio, irrespective of condence of alignment (hence, forced). Thus the second pass can be considered as a complete but relatively inaccurate alignment. The 141 above two-pass alignment is inspired by [138] which employed a similar recursive system for long audio recordings. 9.2.2 Turn Conversational Descriptors Following the application of a two-pass speech alignment pipeline (Section 9.2.1), I use the obtained word alignments alongwith the speaker information to extract turn conver- sational descriptors. Similar to chapter 8 I extract descriptors that capture information relevant a speaker turn, specically latency, turn duration and speaking rate (words/sec). Each descriptor is extracted for both the child and interviewer. Following, I compute global functionals of these descriptors which serve as session-level feature. Additionally, I extract behavioral synchrony information from the dynamics of con- versational descriptors. Specically, I compute the strength of causal relationship (in the Granger [70] sense) from the interviewier's descriptors to the child's corresponding descriptors. Given two time-series X and Y , X is said to Granger-cause Y if it can be demonstrated through statistical tests (usually F-test) that lagged values of X pro- vide signicantly more information about Y when compared to using lagged values of Y alone. Prediction information for Y using lagged values ofY is quantied using an auto- regression model, to which lagged values of X are augmented to t a regression model. Model orders are estimated using information theory criteria, usually Akaike (AIC) or Bayesian (BIC). In this work, I treat the interviewer's turn descriptors as X and the child's turn descriptors asY . Hence, I check for the causal relationship where the interviewer's speech descriptors aect the child's. I use the F-statistic for comparing the residual error terms 142 from the regression models. The F-statistic serves as a session-level feature. From each participant, I use the time-series for latency and speaking rate. I extract four features per session: two within-descriptor features (Interviewer speaking rate ->Child speaking rate and Interviewer latency ->Child latency) and two across-descriptor features (Interviewer speaking rate ->Child latency and Interviewer latency ->Child speaking rate). 9.2.3 Classication Models I train binary classication models for prediction separately for the truth-telling and dis- closure task. Further, I control for the interlocutor (child/interviewer) for the descriptor functionals and within-feature/across-feature for the Granger causality features. In all experiments, I report the unweighted average of F1 scores, in order to account for the class imbalance. For both tasks, I use a bootstrapped version of random predictions as a baseline. I predict random labels for each session, and repeat this for 1000 trials. Among these, I select the trial corresponding to the 50 th percentile of F1 score. For the truth-telling task, I also simulate a human baseline based on a recent study [68] which looked at how accurately do adults predict deception from children's behavior during NTI sessions. Based on [68], I simulate a classier which correctly predicts truthfulness 54% of the time, and compute the mean F1 score over 1000 trials. For the proposed sys- tems, I evaluate two ensemble classiers (Random forests and AdaBoost) and a binary support vector classier. The motivation between controlling for the feature sources in dierent experiments is to distinguish the relative importance of the same towards group dierences rather than to train a perfect classier. 143 Table 9.2: Classication results reported as unweighted F1-scores for truth-telling and disclosure tasks on the NTI corpus Model Task Truth-Telling Disclosure Baseline Human 0.452 NA Random-Bootstrap 0.479 0.445 Feature-Functionals Child 0.488 0.473 Interviewer 0.519 0.481 Granger Causality Within-Feature 0.540 0.533 Across-Feature 0.538 0.562 9.3 Results Classication results are provided in Table 9.2. I note that while all the proposed mod- els out-perform the random baseline and human performance, the improvements are marginal. Between feature functionals trained with child descriptors and interviewer descriptors, there exists a minor improvement when using the latter. This trend mirrors results from the autism domain (Chapter 8), with the possibility that interviewer speech is better aligned with text when compared to child speech. Models trained on the causal- ity features return the best performance, implying that feature dynamics provide more discriminative information over feature functionals. The best performing models return 12.73 % and 26.29 % relative improvement over the random baseline for truth telling and disclosure tasks, and 19.45% relative improvement over human performance for the truth telling task. Overall, AdaBoost and RandomForest were the best performing models for truth-telling and disclosure tasks respectively. As a follow up to the classication models, I analyze the contribution of dierent descriptors towards the performance. I employ a recursive feature elimination method 2 2 Implementation: https://scikit-learn.org/stable/ 144 to rank the descriptors. All descriptors are used in the initial model. Following, a weight is assigned to each descriptor and the one with least weight is recursively removed followed by retraining with the remaining descriptors. This process continues until the most informative descriptor remains. I present the results in Table 9.3. Unlike the classication models, I include both participants features while training with functionals and both within-feature and across-feature causality scores in order to compare between the descriptors. The results corresponding to speaking rate descriptor suggest that children speak faster when they speak the truth. This nding supports the fact, generally a lower cog- nition load is incurred when speaking the truth as opposed to lying. However, children speak slower while (truthfully) disclosing a toy-break. This is partly due to the fact that while they are still truthful during disclosure, children are forced to ignore the confeder- ate's instructions. Silence fraction measures the time within a turn when the participant does not speak, i.e., pauses. Children pause lesser during recall when they are truth- ful, another indicator of less cognitive load. The interviewer on the other hand, pauses longer during rapport building. Further analysis is required to explore whether increases in pause duration is a result of back-channels which did not elicit a response from the child. The causal descriptors provide a relationship between the lead-lag relationship and the truthfulness of child recall. Specically, the child's latency follows the interviewer's speaking rate to a lower extent when being truthful, however the speaking rate itself closely follows that of the interviewer. Hence, while there exists an eort to match the interviewer's speaking style, preliminary results appear contradictory. When the child 145 does disclose a toy-break, the most signicant nding suggests that interviewer's latency in uences the child's speaking rate to a lower extent. Table 9.3: Top important descriptors for each type, based on classication performance. For each feature, group trends are represented using*: higher value for positive class; and+: lower value for positive class Truth-Telling Disclosure Feature- Functionals Recall: Interviewer Speaking Rate+ Child Speaking Rate* Child Silence Fraction+ Rapport: Interviewer Latency+ Interviewer Silence Fraction* Recall: Interviewer Speaking Rate+ Child Speaking Rate+ Interviewer Silence Fraction+ Rapport: Child Speaking Rate+ Interviewer Latency+ Granger Causality Recall: Speaking Rate ->Latency+ Rapport: Speaking Rate ->Speaking Rate* Recall: Latency ->Speaking Rate+ Rapport: Latency ->Speaking Rate+ 9.4 Conclusions In this Chapter, I examined two tasks related to predicting truthfulness from a child's speech and language. Motivated by the high-stakes application of child forensic inter- views, I simulate a minor transgression study with controlled outcomes. I apply a two- step speech alignment pipeline to extract automatic descriptors which are then used to train binary classiers for predicting truthfulness and disclosure. I nd that causal de- scriptors provide better classication performance over feature functionals. Further, a recursive feature elimination technique is used to study the relative importance of de- scriptors. Overall, the minor but consistent improvements in classication performance by automatic descriptors illustrates the applicability in this critical domain. 146 Part V Conclusion 147 Chapter 10 Concluding Remarks In this dissertation, I develop and implement approaches for understanding child speech during spontaneous interactions with an adult. The need for robust and automated speech processing is ubiquitous through the prevalence of voice-controlled assistants. Child speech processing poses unique acoustic and linguistic challenges due to a num- ber of developmental factors. As an alternative to conventional techniques which focus on a single source of information for improving child speech understanding, I highlight the eect of the interlocutor's behavior and environment conditions on the child's speech. Hence, including information from the context is a natural solution for training better modules at each step of any speech and language pipeline. The developed methods are validated using two methods: individual component performance and application of ex- tracted descriptors for the end task. In chapter s 2 and 4, I use the background conditions (specically, acoustic) as envi- ronment context for improving child/adult speech classication. Two dierent methods are developed to exploit environment context: entropy-ltering for identifying robust 148 enrolment samples and prototypical networks for training session-invariant speaker em- beddings using prototypical networks. While both methods optimize on the same task, entropy-ltering attempts to quickly adapt a classier to a new session. The extent of adaptation is controlled using the classier's initial performance as a proxy: a better initial classier implies more condence and cautious adaptation to the session, and vice versa. Using an enrolment setup, I showed that entropy-ltering provided signicant improvements over a conventional classication model. Further, entropy ltering can be combined with other semi-supervised methods such as bootstrapping to further improve system performance. Meta-learning with prototypical networks on the other hand, learns a high-dimensional space invariant to background conditions. An ensemble of tasks is used to train a shared embedding, which is shown both qualitatively and quantitatively to possess increased discriminative information over a state-of-the-art speaker embed- ding. Further, I explore an extension of meta-learning (Chapter 3 to train a generic speaker embedding agnostic to speaker ID applications, namely speaker verication and speaker diarization. The proposed embeddings outperform conventional models for both child-adult conversations and adult-adult conversations, and are particularly robust to challenging noise and reverberant conditions. A drawback of using environment context for child speech understanding is it's static nature through the interaction. Context as characterized using model parameters do not vary with the child's speech, thus posing a signicant limitation on the amount of useful context. I address this by adding interlocutor context as an additional source of information. Further, the relationship between interlocutor context and child (speech) 149 behavior is two-way. There exists a direct feedback mechanism from the interlocutor that the child often subconsciously adapts to during the interaction. I illustrate application of interlocutor context using two tasks: child automatic speech recognition (ASR; chapter 5) and child-adult speaker diarization (Chapter 6). In the former, the interlocutor language is used as an input source while scoring hypotheses output from a speech recognition lattice. The approach is inspired by the mechanism of locally shared semantics, which are common in naturalistic human conversations and child-adult conversations in particular. We borrow all phrases from the interlocutor's speech and re-weigh them during lattice decoding. Alternatively, a sequence-to-sequence model is trained to predict the child speech using the interlocutor's speech as input. Random sampling is used to simulate multiple child speech hypotheses, which are then re-weighed during lattice decoding. In the latter, the contribution of interlocutor factors are used to improve predictions from a speaker diarization model. Contrary to child ASR, no distinction is made between the interloctur's and child's conversational context factors. Multiple factors are directly appended to the diarization system outputs and provided as input to an error-improving DNN model. Ground truth speaker labels are provided as outputs. Hence, the system learns to correct for errors in diarization using information from the contextual factors. Using child-adult interactions as well as toddler- adult interactions, we illustrate the improvements when using both oracle speech activity outputs as well as system outputs. In the third part of dissertation, I use the context-aware methods developed in the rst two parts towards behavioral applications. Each methods contributes towards a module in an end-to-end speech and language processing pipeline, which is used to obtain speech 150 transcripts and speaker labels at each time instant. An end-to-end pipeline is meant to replace manual annotation for the above information, a process which is time consuming, expensive and often impossible to be completely accurate. Further, manual annotation served as a primary limiting factor in previous studies which resorted to small sample sizes. The outputs from the pipeline are used to extract descriptors, which provide infor- mation about spoken behavior of each participant. While a pipeline enables automated descriptor extraction, it is imperative to validate the outputs since ground truths are not available. Towards this, we propose two solutions: (1) Comparing pipeline descriptors to oracle descriptors (Chapter 7), where the latter is extracted using manual annotations. A pilot set of child-adult interactions from the Autism Spectrum Disorder (ASD) do- main are used for this task. Pipeline descriptors are found to be similar to their oracle counterparts when trained to predict the symptom severity. (2) Comparing associations between descriptors and demographic factors (Chapter 8) with previous knowledge based on child development literature. I used a large-sample study and compute the correlations between the descriptors and the child's biological age to establish their validity. Following their validation, pipeline descriptors are extended within the ASD domain to identify group dierences between children higher and lower in the severity spectrum, and study eect of treatment intervention. The results mostly obtained corroborate previous ndings on group dierences, however they open the possibility of a wider range of descriptors. The treatment intervention study returned a limited set of descriptors which exhibit signicant dierences between the groups. The ecacy of intervention treatment, especially in terms of automated tracking of behavior change is an important question to address in the future works. Finally, pipeline descriptors are used to identify 151 truthfulness from child speech following a minor transgression task (toy-break). Using a simulated study with controlled outcomes, a classication model trained using the descriptors is shown to outperform both random guess and human ability for deception detection. Of importance are descriptors which capture a measure of synchrony in the interaction (specically, Granger causality). These descriptors are observed to outperform static functionals over the session, illustrating the usefulness of capturing dynamic in the interaction towards predicting truthfulness. 152 Chapter 11 Future Directions The methods developed in the rst parts of this dissertation focus on including context information as an additional input source in child speech understanding models. Context is appended at the feature level (Chapter 6), used for hypotheses re-scoring (Chapter 5) or used as proxy for assessing adaptation performance (Chapter 4). An alternate method to incorporate context is to add an objective related to the end application, for instance mean squared error for descriptors of interest. Multi-objective training is particularly suited for deep neural networks via an additional branch at the outputs. Further, techniques such as adversarial training [56] enable controlling the contribution of dierent objectives during the training process. However, one still needs to investigate the relevance of dierent objective functions to context factor types. Further, lack of supervised child speech data continues the necessity of low-resource methodologies in multi-task training. A potential outcome of replacing/augmenting conventional metric for dierent pipeline modules (such as word error rate for ASR, diarization error rate for speaker diarization) with descriptor robustness is a downgrade in module performance. It is often upto the user to tolerate the drop in performance. For instance, does an ASR system that results 153 in higher-than-normal WER but still enables robust language descriptors an acceptable system? There exist trade-os at each module and descriptor that ought to be studied before application in behavioral domains. Sometimes, it might be possible to predict descriptor robustness without multiple objectives. The descriptor robustness study in chapter 7 applied an end-to-end pipeline to extract descriptors. While this setup closely resembles a real-world application, it is possible to simulate errors in pipeline modules directly using output perturbations. The perturbations should be carefully controlled for the expected amount of error from each pipeline module. A signicant advantage of random perturbations is the possibility to simulate a large number of pipelines on the same set of sessions. Further, the relation- ship between error distributions and descriptor robustness can be analyzed to estimate expected operating points for the pipeline as well as identify descriptors which are ro- bust over a wide range of errors within the operating point. The number of trials and maximum admissible error are two parameters which can be controlled depending on the application. Finally, an important nding from chapter 9 is that causality-related descriptors pro- vide better classication performance than conventional session functionals. Standard moments of descriptors have been long used as inputs in a classication model due to their relative ease of computation and interpretation especially in clinical and mental health applications. The ndings from this chapter, especially the cross-feature causal- ity descriptors can help understand the eect of dierent modalities on one another. For instance, does the tone of interlocutor's voice aect the child's facial expressions. Such relations hold signicance since they quantify cross-modal eects which are often 154 generally acknowledged by anecdotal observations but otherwise hard to detect using computational methods. 155 References [1] Natacha Akshoomo, Christina Corsello, and Heather Schmidt. The role of the autism diagnostic observation schedule in the assessment of autism spectrum disor- ders in school and community settings. The California School Psychologist, 11(1):7{ 19, 2006. [2] Alexander A Alemi and Paul Ginsparg. Text segmentation based on semantic word embeddings, 2015. [3] Marcin Andrychowicz, Misha Denil, Sergio G omez Colmenarejo, Matthew W. Ho- man, David Pfau, Tom Schaul, Brendan Shillingford, and Nando de Freitas. Learn- ing to learn by gradient descent by gradient descent. In Proceedings of the 30th In- ternational Conference on Neural Information Processing Systems, page 3988{3996, 2016. [4] X. Anguera, S. Bozonnet, N. Evans, C. Fredouille, G. Friedland, and O. Vinyals. Speaker diarization: A review of recent research. IEEE Transactions on Audio, Speech, and Language Processing, 20(2):356{370, 2012. [5] Xavier Anguera, Simon Bozonnet, Nicholas Evans, Corinne Fredouille, Gerald Friedland, and Oriol Vinyals. Speaker diarization: A review of recent research. IEEE Transactions on Audio, Speech, and Language Processing, 20(2):356{370, 2012. [6] Xavier Anguera, Chuck Wooters, and Javier Hernando. Acoustic beamforming for speaker diarization of meetings. IEEE Transactions on Audio, Speech, and Language Processing, 15(7):2011{2022, 2007. [7] J. Baio. Prevalence of autism spectrum disorder among children aged 8 years|autism and developmental disabilities monitoring network, 11 sites, united states, 2010. Morbidity and Mortality Weekly Report: Surveillance Summaries, 63(2):1{21, 2014. [8] Albert Bandura. The self system in reciprocal determinism. American psychologist, 33(4):344, 1978. [9] Albert Bandura and Richard H Walters. Social learning theory, volume 1. Prentice- hall Englewood Clis, NJ, 1977. 156 [10] Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, and Joydeep Ghosh. Clus- tering with bregman divergences. J. Mach. Learn. Res., 6:1705{1749, December 2005. [11] DOUGLAS BARNETT. The impact of subtype, frequency, chronicity, and severity of child maltreatment on social competence and behavior problems. Development and psychopathology, 6:121{143, 1994. [12] Linda Bell, Johan Boye, Joakim Gustafson, Mattias Heldner, Anders Lindstr om, and Mats Wir en. The Swedish NICE corpus{spoken dialogues between children and embodied characters in a computer game scenario. In Proceedings of the 9th European Conference on Speech Communication and Technology, Lisbon, Portugal, pages 2765{2768, September 2005. [13] M. P. Black, J. Tepperman, and S. S. Narayanan. Automatic prediction of children's reading ability for high-level literacy assessment. IEEE Transactions on Audio, Speech, and Language Processing, 19(4):1015{1028, May 2011. [14] Matthew P Black, Daniel Bone, Marian E Williams, Phillip Gorrindo, Pat Levitt, and Shrikanth Narayanan. The usc care corpus: Child-psychologist interactions of children with autism spectrum disorders. In Twelfth Annual Conference of the International Speech Communication Association, 2011. [15] Paul Boersma et al. Praat, a system for doing phonetics by computer. Glot inter- national, 5, 2002. [16] D. Bone, C. Lee, T. Chaspari, J. Gibson, and S. Narayanan. Signal processing and machine learning for mental health research and clinical applications [perspectives]. IEEE Signal Processing Magazine, 34(5):196{195, 2017. [17] Daniel Bone, Somer L Bishop, Matthew P Black, Matthew S Goodwin, Catherine Lord, and Shrikanth S Narayanan. Use of machine learning to improve autism screening and diagnostic instruments: eectiveness, eciency, and multi-instrument fusion. Journal of Child Psychology and Psychiatry, 57(8):927{937, 2016. [18] Daniel Bone, Matthew P Black, Chi-Chun Lee, Marian E Williams, Pat Levitt, Sungbok Lee, and Shrikanth Narayanan. Spontaneous-speech acoustic-prosodic fea- tures of children with autism and the interacting psychologist. In Proceedings of the Interspeech, 13th Annual Conference of the International Speech Communica- tion Association, pages 1043{1046, 2012. [19] Daniel Bone, Matthew P. Black, Anil Ramakrishna, Ruth B. Grossman, and Shrikanth S. Narayanan. Acoustic-prosodic correlates of `awkward' prosody in story retellings from adolescents with autism. In INTERSPEECH, pages 1616{ 1620. ISCA, 2015. [20] Daniel Bone, Matthew S Goodwin, Matthew P Black, Chi-Chun Lee, Kartik Au- dhkhasi, and Shrikanth Narayanan. Applying machine learning to facilitate autism diagnostics: pitfalls and promises. Journal of autism and developmental disorders, 45(5):1121{1136, 2015. 157 [21] Daniel Bone, Chi-Chun Lee, Matthew P Black, Marian E Williams, Sungbok Lee, Pat Levitt, and Shrikanth Narayanan. The psychologist as an interlocutor in autism spectrum disorder assessment: Insights from a study of spontaneous prosody. Jour- nal of Speech, Language, and Hearing Research, 57(4):1162{1177, 2014. [22] Daniel Bone, Chi-Chun Lee, Alexandros Potamianos, and Shrikanth S. Narayanan. An investigation of vocal arousal dynamics in child-psychologist interactions using synchrony measures and a conversation-based model. In INTERSPEECH 2014, 15th Annual Conference of the International Speech Communication Association, Singapore, September 14-18, 2014, pages 218{222, 2014. [23] H. Bredin. Tristounet: Triplet loss for speaker turn embedding. In ICASSP, pages 5430{5434, March 2017. [24] H. Bredin. Tristounet: Triplet loss for speaker turn embedding. In IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5430{5434, 2017. [25] Benjamin T Brown, Gwynn Morris, Robert E Nida, and Lynne Baker-Ward. Brief report: making experience personal: internal states language in the memory nar- ratives of children with and without asperger's disorder. Journal of autism and developmental disorders, 42(3):441{446, 2012. [26] D. C. Burnett and M. Fanty. Rapid unsupervised adaptation to children's speech on a connected-digit task. In Proceedings of the Fourth International Conference on Spoken Language Processing. ICSLP '96, volume 2, pages 1145{1148, Oct 1996. [27] Carlos Busso, Panayiotis G Georgiou, and Shrikanth S Narayanan. Real-time mon- itoring of participants' interaction in a meeting using audio-visual sensors. In 2007 IEEE International Conference on Acoustics, Speech and Signal Processing- ICASSP'07, volume 2, pages II{685. IEEE, 2007. [28] J. P. Campbell. Speaker recognition: a tutorial. Proceedings of the IEEE, 85(9):1437{1462, 1997. [29] W. M. Campbell, D. E. Sturim, D. A. Reynolds, and A. Solomono. Svm based speaker verication using a gmm supervector kernel and nap variability compensa- tion. In IEEE International Conference on Acoustics Speech and Signal Processing Proceedings (ICASSP), volume 1, pages I{I, 2006. [30] Alice S Carter, Fred R Volkmar, Sara S Sparrow, Jing-Jen Wang, Catherine Lord, Geraldine Dawson, Eric Fombonne, Katherine Loveland, Gary Mesibov, and Eric Schopler. The vineland adaptive behavior scales: supplementary norms for individ- uals with autism. Journal of autism and developmental disorders, 28(4):287{302, 1998. [31] O. Chapelle and A. Zien. Semi-supervised classication by low density separation. In Proceedings of the Tenth International Workshop on Articial Intelligence and Statistics, pages 57{64, 2005. 158 [32] Ke Chen and Ahmad Salman. Learning speaker-specic characteristics with a deep neural architecture. IEEE Transactions on Neural Networks, 22(11):1744{1756, 2011. [33] Silje Christensen, Simen Johnsrud, Massimiliano Ruocco, and Heri Ramampiaro. Context-aware sequence-to-sequence models for conversational systems. 2018. [34] Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. Voxceleb2: Deep speaker recognition, 2018. [35] T. Claes, I. Dologlou, L. ten Bosch, and D. van Compernolle. A novel feature transformation for vocal tract length normalization in automatic speech recognition. IEEE Transactions on Speech and Audio Processing, 6:549{557, Nov 1998. [36] Kyndra C Cleveland, Jodi A Quas, and Thomas D Lyon. The eects of implicit encouragement and the putative confession on children's memory reports. Child abuse & neglect, 80:113{122, 2018. [37] R Cole, P Hosom, and B Pellom. University of colorado prompted and read chil- dren's speech corpus. Technical report, Technical Report TR-CSLR-2006-02, Uni- versity of Colorado, 2006. [38] Ron Cole, Dominic W Massaro, Jacques de Villiers, Brian Rundle, Khaldoun Shobaki, Johan Wouters, Michael Cohen, Jonas Baskow, Patrick Stone, Pamela Connors, et al. New tools for interactive speech and language training: us- ing animated conversational agents in the classroom of profoundly deaf children. In MATISSE-ESCA/SOCRATES Workshop on Method and Tool Innovations for Speech Science Education, 1999. [39] Roger Collins, Robyn Lincoln, and Mark G Frank. The eect of rapport in forensic interviewing. Psychiatry, psychology and law, 9(1):69{78, 2002. [40] Alejandrina Cristia, Shobhana Ganesh, Marisa Casillas, and Sriram Ganapathy. Talker diarization in the wild: The case of child-centered daylong audio-recordings. In Interspeech 2018, pages 2583{2587, 2018. [41] Allison Cunningham. Measuring change in social interaction skills of young children with autism. Journal of autism and developmental disorders, 42(4):593{605, 2012. [42] S. Das, D. Nix, and M. Picheny. Improvements in children's speech recognition performance. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume 1, pages 433{436, May 1998. [43] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet. Front-end fac- tor analysis for speaker verication. IEEE Transactions on Audio, Speech, and Language Processing, 19(4):788{798, May 2011. [44] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet. Front-end fac- tor analysis for speaker verication. IEEE Transactions on Audio, Speech, and Language Processing, 19(4):788{798, 2011. 159 [45] Najim Dehak, Reda Dehak, Patrick Kenny, Niko Br ummer, Pierre Ouellet, and Pierre Dumouchel. Support vector machines versus fast scoring in the low- dimensional total variability space for speaker verication. In Tenth Annual con- ference of the international speech communication association, 2009. [46] Daniel Elenius and Mats Blomberg. Adaptation and normalization experiments in speech recognition for 4 to 8 year old children. In Proceedings of the Eurospeech, 9th European Conference on Speech Communication and Technology, Lisbon, Portugal, September 4-8, pages 2749{2752, 2005. [47] Hakan Erdogan, Ruhi Sarikaya, Stanley F Chen, Yuqing Gao, and Michael Picheny. Using semantic analysis to improve speech recognition performance. Computer Speech & Language, 19(3):321{343, 2005. [48] Maxine Eskenazi, Jack Mostow, and David Gra. The CMU kids corpus. Linguistic Data Consortium, 1997. [49] Tiantian Feng, Amrutha Nadarajan, Colin Vaz, Brandon Booth, and Shrikanth Narayanan. Tiles audio recorder: An unobtrusive wearable solution to track au- dio activity. In Proceedings of the 4th ACM Workshop on Wearable Systems and Applications, pages 33{38. ACM, 2018. [50] David Finkelhor, Heather A Turner, Anne Shattuck, and Sherry L Hamby. Vio- lence, crime, and abuse exposure in a national sample of children and youth: An update. JAMA pediatrics, 167(7):614{621, 2013. [51] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML-Volume 70, pages 1126{1135. JMLR, 2017. [52] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Con- ference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 1126{1135. PMLR, 2017. [53] Nikolaos Flemotomos, Pavlos Papadopoulos, James Gibson, and Shrikanth Narayanan. Combined speaker clustering and role recognition in conversational speech. In Interspeech, pages 1378{1382, 2018. [54] Yusuke Fujita, Naoyuki Kanda, Shota Horiguchi, Kenji Nagamatsu, and Shinji Watanabe. End-to-End Neural Speaker Diarization with Permutation-Free Objec- tives. In INTERSPEECH, pages 4300{4304, 2019. [55] Yusuke Fujita, Shinji Watanabe, Shota Horiguchi, Yawen Xue, and Kenji Naga- matsu. End-to-end neural diarization: Reformulating speaker diarization as simple multi-label classication, 2020. 160 [56] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Fran cois Laviolette, Mario Marchand, and Victor Lempitsky. Domain- adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096{2030, 2016. [57] D. Garcia-Romero, D. Snyder, G. Sell, D. Povey, and A. McCree. Speaker diariza- tion using deep neural network embeddings. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4930{4934, 2017. [58] Daniel Garcia-Romero and Carol Y Espy-Wilson. Analysis of i-vector length nor- malization in speaker recognition systems. In Twelfth Annual Conference of the International Speech Communication Association, 2011. [59] Daniel Garcia-Romero, David Snyder, Gregory Sell, Daniel Povey, and Alan Mc- Cree. Speaker diarization using deep neural network embeddings. In ICASSP, pages 4930{4934. IEEE, 2017. [60] Matteo Gerosa, Diego Giuliani, and Shrikanth Narayanan. Acoustic analysis and automatic recognition of spontaneous children's speech. In Proceedings of the Ninth International Conference on Spoken Language Processing, Pittsburgh, PA, USA, September 17-21, 2006. [61] Matteo Gerosa, Diego Giuliani, Shrikanth Narayanan, and Alexandros Potamianos. A review of ASR technologies for children's speech. In Proceedings of the 2nd Workshop on Child, Computer and Interaction, 2009. [62] Simona Ghetti, Kristen Weede Alexander, and Gail S Goodman. Legal involvement in child sexual abuse cases: Consequences and interventions. International Journal of Law and Psychiatry, 25(3):235{251, 2002. [63] James Gibson, Dogan Can, Panayiotis G. Georgiou, David C. Atkins, and Shrikanth S. Narayanan. Attention networks for modeling behaviors in addiction counseling. In INTERSPEECH 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, August 20-24, 2017, pages 3251{3255, 2017. [64] James Gibson, Nikolaos Malandrakis, Francisco Romero, David C Atkins, and Shrikanth S Narayanan. Predicting therapist empathy in motivational interviews using language features inspired by psycholinguistic norms. In Sixteenth annual conference of the international speech communication association, 2015. [65] Ruth Gilbert, Cathy Spatz Widom, Kevin Browne, David Fergusson, Elspeth Webb, and Staan Janson. Burden and consequences of child maltreatment in high-income countries. The Lancet, 373(9657):68 { 81, 2009. [66] Daniel Gildea and Thomas Hofmann. Topic-based language models using EM. In Proceedings of the Sixth European Conference on Speech Communication and Technology, 1999. 161 [67] Tze Jui Goh, Joachim Diederich, Insu Song, and Min Sung. Using diagnostic information to develop a machine learning application for the eective screening of autism spectrum disorders. In Mental health informatics, pages 229{245. Springer, 2014. [68] Jennifer Gongola, Nicholas Scurich, and Jodi A Quas. Detecting deception in chil- dren: A meta-analysis. Law and human behavior, 41(1):44, 2017. [69] Katherine Gotham, Andrew Pickles, and Catherine Lord. Standardizing ados scores for a measure of severity in autism spectrum disorders. Journal of autism and developmental disorders, 39(5):693{705, 2009. [70] Clive WJ Granger. Investigating causal relations by econometric models and cross- spectral methods. Econometrica: journal of the Econometric Society, pages 424{ 438, 1969. [71] A. Graves, A. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6645{6649, May 2013. [72] Alex Graves and Navdeep Jaitly. Towards end-to-end speech recognition with re- current neural networks. In Proceedings of the 31st International Conference on Machine Learning - Volume 32, pages II{1764{II{1772, 2014. [73] Sharmistha S. Gray, Daniel Willett, Jianhua Lu, Joel Pinto, Paul Maergner, and Nathan Bodenstab. Child automatic speech recognition for US English: child inter- action with living-room-electronic-devices. In Proceedings of the 4th Workshop on Child, Computer and Interaction, WOCCI, Singapore, September 19, pages 21{26, 2014. [74] Ruth B Grossman, Rhyannon H Bemis, Daniela Plesa Skwerer, and Helen Tager- Flusberg. Lexical and aective prosody in children with high-functioning autism. Journal of Speech, Language, and Hearing Research, 53(3):778{793, 2010. [75] Ruth B Grossman, Julia Mertens, and Emily Zane. Perceptions of self and other: Social judgments and gaze patterns to videos of adolescents with and without autism spectrum disorder. Autism, 23(4):846{857, 2019. [76] Alexander Gruenstein, Chao Wang, and Stephanie Sene. Context-sensitive sta- tistical language modeling. In Proceedings of the Ninth European Conference on Speech Communication and Technology, 2005. [77] Rebecca Grzadzinski et al. Measuring changes in social communication behaviors: preliminary development of the brief observation of social communication change (BOSCC). Journal of autism and developmental disorders, 46(7):2464{2479, 2016. [78] Tanaya Guha, Zhaojun Yang, Ruth B Grossman, and Shrikanth S Narayanan. A computational study of expressive facial dynamics in children with autism. IEEE transactions on aective computing, 9(1):14{20, 2016. 162 [79] Rahul Gupta, Daniel Bone, Sungbok Lee, and Shrikanth Narayanan. Analysis of engagement behavior in children during dyadic interactions using prosodic cues. Computer speech & language, 37:47{66, 2016. [80] Andreas Hagen, Bryan Pellom, and Ronald Cole. Children's speech recognition with application to interactive books and tutors. In Proceedings of the 2003 IEEE Workshop on Automatic Speech Recognition and Understanding, pages 186{191, Nov 2003. [81] John H.L. Hansen, Abhijeet Sangwan, Aditya Joglekar, Ahmet E. Bulut, Lakshmish Kaushik, and Chengzhu Yu. Fearless steps: Apollo-11 corpus advancements for speech technologies from earth to the moon. In INTERSPEECH, pages 2758{2762, 2018. [82] Valerie Hazan and Sarah Barrett. The development of phonemic categorization in children aged 6{12. Journal of Phonetics, 28(4):377 { 396, 2000. [83] Irit Hershkowitz, Sara Fisher, Michael E. Lamb, and Dvora Horowitz. Improving credibility assessment in child sexual abuse allegations: The role of the NICHD investigative interview protocol. Child Abuse & Neglect, 31(2):99 { 110, 2007. [84] Irit Hershkowitz, Michael E Lamb, and Carmit Katz. Allegation rates in forensic child abuse investigations: Comparing the revised and standard nichd protocols. Psychology, Public Policy, and Law, 20(3):336, 2014. [85] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Van- houcke, P. Nguyen, T. N. Sainath, and B. Kingsbury. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82{97, Nov 2012. [86] Thomas Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42(1):177{196, Jan 2001. [87] Shota Horiguchi, Yusuke Fujita, Shinji Watanabe, Yawen Xue, and Kenji Naga- matsu. End-to-end speaker diarization for an unknown number of speakers with encoder-decoder based attractors, 2020. [88] Marijn Huijbregts, David A van Leeuwen, and Chuck Wooters. Speaker diarization error analysis using oracle components. IEEE Transactions on Audio, Speech, and Language Processing, 20(2):393{403, 2012. [89] Marijn Huijbregts and Chuck Wooters. The blame game: performance analysis of speaker diarization system components. In INTERSPEECH 2007, 8th Annual Con- ference of the International Speech Communication Association, Antwerp, Belgium, August 27-31, 2007, pages 1857{1860, 2007. [90] Brian Isaacson. Helping children with autism learn: A guide to treatment ap- proaches for parents and professionals. Psychiatric Services, 55(11):1328{1328, 2004. 163 [91] Anil Jakkam and Carlos Busso. A multimodal analysis of synchrony during dyadic interaction using a metric based on sequential pattern mining. In 2016 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6085{6089. IEEE, 2016. [92] W. Lewis Johnson, Je W. Rickel, and James C. Lester. Animated pedagogical agents: Face-to-face interaction in interactive learning environments. International Journal of Artical Intelligence in Education, 11:47{78, 2000. [93] Rebecca M. Jones, Daniela Plesa Skwerer, Rahul Pawar, Amarelle Hamo, Caroline Carberry, Eliana L. Ajodan, Desmond Caulley, Melanie R. Silverman, Shannon McAdoo, Steven Meyer, Anne Yoder, Mark Clements, Catherine Lord, and He- len Tager-Flusberg. How eective is LENA in detecting speech vocalizations and language produced by children and adolescents with ASD in dierent contexts? Autism Research, January 2019. [94] Leo Kanner et al. Autistic disturbances of aective contact. Nervous child, 2(3):217{250, 1943. [95] Chanwoo Kim and Richard M Stern. Robust signal-to-noise ratio estimation based on waveform amplitude distribution analysis. In Ninth Annual Conference of the International Speech Communication Association, 2008. [96] D.P. Kingma, S. Mohamed, D.J. Rezende, and M. Welling. Semi-supervised learn- ing with deep generative models. In Advances in Neural Information Processing Systems, pages 3581{3589, 2014. [97] Jessica Klusek, Gary E Martin, and Molly Losh. A comparison of pragmatic lan- guage in boys with autism and fragile x syndrome. Journal of Speech, Language, and Hearing Research, 57(5):1692{1707, 2014. [98] Mary Tai Knox, Nikki Mirghafori, and Gerald Friedland. Where did i go wrong?: Identifying troublesome segments for speaker diarization systems. In Thirteenth Annual Conference of the International Speech Communication Association, 2012. [99] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur. A study on data augmentation of reverberant speech for robust speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5220{5224, March 2017. [100] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. Siamese neural networks for one-shot image recognition. In ICML deep learning workshop, volume 2, 2015. [101] M. A. Kohler and M. Kennedy. Language identication using shifted delta cepstra. In The 45th Midwest Symposium on Circuits and Systems, volume 3, pages 69{72, Aug 2002. [102] N. R. Koluguri, M. Kumar, S. H. Kim, C. Lord, and S. Narayanan. Meta-learning for robust child-adult classication from speech. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8094{8098, 2020. 164 [103] M. Kumar, P. Papadopoulos, R. Travadi, D. Bone, and S. Narayanan. Improving semi-supervised classication for low-resource speech interaction applications. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5149{5153, April 2018. [104] Manoj Kumar, Daniel Bone, Kelly McWilliams, Shanna Williams, Thomas D. Lyon, and Shrikanth S. Narayanan. Multi-scale context adaptation for improving child automatic speech recognition in child-adult spoken interactions. In Proceedings of the Interspeech, 18th Annual Conference of the International Speech Communica- tion Association, pages 2730{2734, 2017. [105] Manoj Kumar, Rahul Gupta, Daniel Bone, Nikolaos Malandrakis, Somer Bishop, and Shrikanth S Narayanan. Objective language feature analysis in children with neurodevelopmental disorders during autism assessment. In Interspeech, pages 2721{2725, 2016. [106] Soonil Kwon and Shrikanth Narayanan. Robust speaker identication based on selective use of feature vectors. Pattern Recognition Letters, 28(1):85 { 89, 2007. [107] David La Rooy, Sonja P Brubacher, Anu Arom aki-Stratos, Mireille Cyr, Irit Her- shkowitz, Julia Korkman, Trond Myklebust, Makiko Naka, Carlos E Peixoto, Kim P Roberts, et al. The nichd protocol: A review of an internationally-used evidence- based tool for training child forensic interviewers. Journal of Criminological Re- search, Policy and Practice, 2015. [108] Rimita Lahiri, Manoj Kumar, Somer Bishop, and Shrikanth Narayanan. Learn- ing domain invariant representations for child-adult classication from speech. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6749{6753. IEEE, 2020. [109] Michael E Lamb, Yael Orbach, Irit Hershkowitz, Phillip W Esplin, and Dvora Horowitz. A structured forensic interview protocol improves the quality and infor- mativeness of investigative interviews with children: A review of research using the NICHD investigative interview protocol. Child Abuse & Neglect, 31(11-12):1201{ 1231, 2007. [110] Anthony Larcher, Pierre-Michel Bousquet, Kong Aik Lee, Driss Matrouf, Haizhou Li, and Jean-Francois Bonastre. I-vectors in the context of phonetically-constrained short utterances for speaker verication. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4773{4776. IEEE, 2012. [111] Chi-Chun Lee, Emily Mower, Carlos Busso, Sungbok Lee, and Shrikanth Narayanan. Emotion recognition using a hierarchical binary decision tree approach. Speech Communication, 53(9):1162 { 1171, 2011. Sensing Emotion and Aect - Fac- ing Realism in Speech Processing. [112] Sungbok Lee, Alexandros Potamianos, and Shrikanth Narayanan. Acoustics of children's speech: Developmental changes of temporal and spectral parameters. The Journal of the Acoustical Society of America, 105(3):1455{1468, 1999. 165 [113] Sungbok Lee, Alexandros Potamianos, and Shrikanth Narayanan. Developmen- tal acoustic study of american english diphthongs. The Journal of the Acoustical Society of America, 136(4):1880{1894, 2014. [114] R. Leonard. A database for speaker-independent digit recognition. In Proceedings of the ICASSP '84. IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 9, pages 328{331, March 1984. [115] M. Li, C. Lu, A. Wang, and S. Narayanan. Speaker verication using lasso based sparse total variability supervector with PLDA modeling. In Asia-Pacic Signal and Information Processing Association Annual Summit and Conference, pages 1{4, 2012. [116] Hank Liao, Golan Pundak, Olivier Siohan, Melissa Carroll, Noah Coccaro, Qi-Ming Jiang, Tara N. Sainath, Andrew Senior, Fran coise Beaufays, and Michiel Bacchiani. Large vocabulary automatic speech recognition for children. In Proceedings of the Interspeech, 2015. [117] Chi-Wei Lin, Meei-Ju Lin, Chin-Chen Wen, and Shao-Yin Chu. A word-count approach to analyze linguistic patterns in the re ective writings of medical students. Medical education online, 21(1):29522, 2016. [118] C. Lord, M. Rutter, P.C. DiLavore, S. Risi, K. Gotham, and S. Bishop. Autism diagnostic observation schedule: Ados-2. 2012. [119] Catherine Lord et al. The autism diagnostic observation schedule|Generic: A standard measure of social and communication decits associated with the spectrum of autism. Journal of autism and developmental disorders, 30(3):205{223, 2000. [120] Catherine Lord, Susan Risi, Linda Lambrecht, Edwin H. Cook, Bennett L. Leven- thal, Pamela C. DiLavore, Andrew Pickles, and Michael Rutter. The Autism Di- agnostic Observation Schedule|Generic: A Standard Measure of Social and Com- munication Decits Associated with the Spectrum of Autism. Journal of Autism and Developmental Disorders, 30(3):205{223, Jun 2000. [121] O Ivar Lovaas, Alan Litrownik, and Ronald Mann. Response latencies to audi- tory stimuli in autistic children engaged in self-stimulatory behavior. Behaviour Research and Therapy, 9(1):39{49, 1971. [122] Katherine A Loveland, Susan H Landry, Sheryl O Hughes, Sharon K Hall, and Robin E McEvoy. Speech acts and the pragmatic decits of autism. Journal of Speech, Language, and Hearing Research, 31(4):593{604, 1988. [123] B. Maeireizo, D. Litman, and R. Hwa. Co-training for predicting emotions with spoken dialogue data. In Proceedings of the ACL 2004 on Interactive Poster and Demonstration Sessions, 2004. 166 [124] Matthew J Maenner, Kelly A Shaw, Jon Baio, et al. Prevalence of autism spec- trum disorder among children aged 8 years|autism and developmental disabilities monitoring network, 11 sites, united states, 2016. MMWR Surveillance Summaries, 69(4):1, 2020. [125] Nikolaos Malandrakis and Shrikanth S Narayanan. Therapy language analysis using automatically generated psycholinguistic norms. In Sixteenth Annual Conference of the International Speech Communication Association, 2015. [126] Judy Martin, Jessie Anderson, Sarah Romans, Paul Mullen, and Martine O'Shea. Asking about child sexual abuse: methodological implications of a two stage survey. Child abuse & neglect, 17(3):383{392, 1993. [127] Johnny L Matson. Determining treatment outcome in early intervention programs for autism spectrum disorders: A critical analysis of measurement issues in learning based interventions. Research in developmental disabilities, 28(2):207{218, 2007. [128] Carla A Mazefsky and Donald P Oswald. The discriminative ability and diagnostic utility of the ADOS-G, ADI-R, and GARS for children in a clinical setting. Autism, 10(6):533{549, 2006. [129] KAREN McCURDY and DEBORAH DARO. Child maltreatment: A national survey of reports and fatalities. Journal of Interpersonal Violence, 9(1):75{94, 1994. [130] Mitchell McLaren, Luciana Ferrer, Diego Castan, and Aaron Lawson. The 2016 speakers in the wild speaker recognition evaluation. In INTERSPEECH, pages 823{827, 2016. [131] Mitchell McLaren, Luciana Ferrer, Diego Castan, and Aaron Lawson. The speakers in the wild (sitw) speaker recognition database. In INTERSPEECH, pages 818{822, 2016. [132] T. Mensink, J. Verbeek, F. Perronnin, and G. Csurka. Distance-based image clas- sication: Generalizing to new classes at near-zero cost. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(11):2624{2637, 2013. [133] Angeliki Metallinou, Athanasios Katsamanis, and Shrikanth Narayanan. A hierar- chical framework for modeling multimodality and emotional evolution in aective dialogs. In Proceedings of the Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on, pages 2401{2404. IEEE, 2012. [134] Angeliki Metallinou, Martin Wollmer, Athanasios Katsamanis, Florian Eyben, Bjorn Schuller, and Shrikanth Narayanan. Context-sensitive learning for enhanced audiovisual emotion classication. IEEE Transactions on Aective Computing, 3(2):184{198, 2012. 167 [135] A. H. Michaely, X. Zhang, G. Simko, C. Parada, and P. Aleksic. Keyword spotting for google assistant using contextual speech recognition. In Proceedings of the 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 272{278, Dec 2017. [136] Nikki Mirghafori and Chuck Wooters. Nuts and akes: A study of data character- istics in speaker diarization. In 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, volume 1, pages I{I. IEEE, 2006. [137] Louis-Philippe Morency, Iwan de Kok, and Jonathan Gratch. Context-based recog- nition during human interactions: Automatic feature selection and encoding dictio- nary. In Proceedings of the 10th international conference on Multimodal interfaces, pages 181{188. ACM, 2008. [138] Pedro J Moreno, Chris Joerg, Jean-Manuel Van Thong, and Oren Glickman. A recursive algorithm for the forced alignment of very long audio segments. In Fifth International Conference on Spoken Language Processing, 1998. [139] Jack Mostow, Steven F. Roth, Alexander G. Hauptmann, and Matthew Kane. A prototype reading coach that listens. In Proceedings of the 12th National Conference on Articial Intelligence, Seattle, WA, USA, July 31 - August 4, 1994, Volume 1., pages 785{792, 1994. [140] Ryoko Mugitani and Sadao Hiroya. Development of vocal tract and acoustic fea- tures in children. Acoustical Science and Technology, 33(4):215{220, 2012. [141] Maryam Najaan and John HL Hansen. Speaker independent diarization for child language environment analysis using deep neural networks. In 2016 IEEE Spoken Language Technology Workshop (SLT), pages 114{120. IEEE, 2016. [142] Mahesh Kumar Nandwana, Julien van Hout, Mitchell McLaren, Colleen Richey, Aaron Lawson, and Maria Alejandra Barrios. The voices from a distance challenge 2019 evaluation plan, 2019. [143] S. Narayanan and A. Potamianos. Creating conversational interfaces for children. IEEE Transactions on Speech and Audio Processing, 10(2):65{78, Feb 2002. [144] Shrikanth Narayanan and Panayiotis G Georgiou. Behavioral signal processing: Deriving human behavioral informatics from speech and language. Proceedings of the IEEE, 101(5):1203{1233, 2013. [145] Assaf Oshri, Erinn B Duprey, Steven M Kogan, Matthew W Carlson, and Sihong Liu. Growth patterns of future orientation among maltreated youth: A prospective examination of the emergence of resilience. Developmental psychology, 54(8):1456, 2018. [146] M. Pal, M. Kumar, R. Peri, T. J. Park, S. Hyun Kim, C. Lord, S. Bishop, and S. Narayanan. Speaker diarization using latent space clustering in generative ad- versarial network. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6504{6508, 2020. 168 [147] M. Pal, M. Kumar, R. Peri, T. J. Park, S. Hyun Kim, C. Lord, S. Bishop, and S. Narayanan. Speaker diarization using latent space clustering in generative ad- versarial network. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6504{6508, 2020. [148] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur. Librispeech: An asr cor- pus based on public domain audio books. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206{5210, 2015. [149] T. J. Park, K. J. Han, M. Kumar, and S. Narayanan. Auto-tuning spectral clus- tering for speaker diarization using normalized maximum eigengap. IEEE Signal Processing Letters, 27:381{385, 2020. [150] A. Patel, D. Li, E. Cho, and P. Aleksic. Cross-lingual phoneme mapping for lan- guage robust contextual speech recognition. In Proceedings of the 2018 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5924{5928, April 2018. [151] Rhea Paul, Lawrence D Shriberg, Jane McSweeny, Domenic Cicchetti, Ami Klin, and Fred Volkmar. Brief report: Relations between prosodic performance and communication and socialization ratings in high functioning speakers with autism spectrum disorders. Journal of Autism and Developmental Disorders, 35(6):861, 2005. [152] Vijayaditya Peddinti, Daniel Povey, and Sanjeev Khudanpur. A time delay neural network architecture for ecient modeling of long temporal contexts. In Proceed- ings of the INTERSPEECH 2015, 16th Annual Conference of the International Speech Communication Association, Dresden, Germany, September 6-10, 2015, pages 3214{3218, 2015. [153] Susan Pepp e, Joanne McCann, Fiona Gibbon, Anne O'Hare, and Marion Ruther- ford. Receptive and expressive prosodic ability in children with high-functioning autism. Journal of Speech, Language, and Hearing Research, 50(4):1015{1028, 2007. [154] Diana Perez-Mar n and Ismael Pascual-Nieto. An exploratory study on how chil- dren interact with pedagogic conversational agents. Behaviour & Information Tech- nology, 32(9):955{964, 2013. [155] A. Potamianos and S. Narayanan. Robust recognition of children's speech. IEEE Transactions on Speech and Audio Processing, 11(6):603{616, Nov 2003. [156] Alexandros Potamianos and Shrikanth Narayanan. Spoken dialog systems for chil- dren. In Proceedings of the Acoustics, Speech and Signal Processing, 1998. Pro- ceedings of the 1998 IEEE International Conference on, volume 1, pages 197{200. IEEE, 1998. [157] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hanne- mann, P. Motlicek, Y. Qian, and P. Schwarz. The kaldi speech recognition toolkit. In IEEE Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, dec 2011. 169 [158] Daniel Povey, Hossein Hadian, Pegah Ghahremani, Ke Li, and Sanjeev Khudanpur. A time-restricted self-attention layer for ASR. In Proceedings of the 2018 IEEE In- ternational Conference on Acoustics, Speech and Signal Processing, ICASSP 2018, Calgary, AB, Canada, April 15-20, 2018, pages 5874{5878, 2018. [159] Rohit Prabhavalkar, Kanishka Rao, Tara N. Sainath, Bo Li, Leif Johnson, and Naveep Jaitly. A comparison of sequence-to-sequence models for speech recognition. In Proceedings of Interspeech 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, August 20-24, pages 939{ 943, 2017. [160] Colin Rael and Daniel P. W. Ellis. Feed-forward networks with attention can solve some long-term memory problems, 2015. [161] Y. Rahulamathavan, K. R. Sutharsini, I. G. Ray, R. Lu, and M. Rajarajan. Privacy- preserving ivector-based speaker verication. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(3):496{506, 2019. [162] Anirudh Raju, Behnam Hedayatnia, Linda Liu, Ankur Gandhe, Chandra Khatri, Angeliki Metallinou, Anu Venkatesh, and Ariya Rastrow. Contextual ASR adap- tation for conversational agents, 2018. [163] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017. [164] Z. Ren, Z. Chen, and S. Xu. Triplet based embedding distance and similarity learning for text-independent speaker verication. In Asia-Pacic Signal and In- formation Processing Association Annual Summit and Conference (APSIPA ASC), pages 558{562, 2019. [165] Douglas A Reynolds, Thomas F Quatieri, and Robert B Dunn. Speaker verication using adapted gaussian mixture models. Digital signal processing, 10(1-3):19{41, 2000. [166] Fred Richardson, Douglas Reynolds, and Najim Dehak. Deep neural network approaches to speaker and language recognition. IEEE signal processing letters, 22(10):1671{1675, 2015. [167] Colleen Richey, Maria A. Barrios, Zeb Armstrong, Chris Bartels, Horacio Franco, Martin Graciarena, Aaron Lawson, Mahesh Kumar Nandwana, Allen Stauer, Julien van Hout, Paul Gamble, Je Hetherly, Cory Stephenson, and Karl Ni. Voices obscured in complex environmental settings (voices) corpus, 2018. [168] Sally J. Rogers, Annette Estes, Catherine Lord, Laurie Vismara, Jamie Winter, An- nette Fitzpatrick, Mengye Guo, and Geraldine Dawson. Eects of a brief early start denver model (esdm){based parent intervention on toddlers at risk for autism spec- trum disorders: A randomized controlled trial. Journal of the American Academy of Child & Adolescent Psychiatry, 51(10):1052 { 1065, 2012. 170 [169] C. Rosenberg, M. Hebert, and H. Schneiderman. Semi-supervised self-training of object detection models. In Application of Computer Vision. Seventh IEEE Work- shops on, volume 1, pages 29{36, Jan 2005. [170] Viktor Rozgic, Carlos Busso, Panayotis G Georgiou, and Shrikanth Narayanan. Multimodal meeting monitoring: Improvements on speaker tracking and segmen- tation through a modied mixture particle lter. In 2007 IEEE 9th Workshop on Multimedia Signal Processing, pages 60{65. IEEE, 2007. [171] N Ryant, K Church, C Cieri, A Cristia, J Du, S Ganapathy, and M Liberman. First dihard challenge evaluation plan, 2018. [172] Neville Ryant, Kenneth Church, Christopher Cieri, Alejandrina Cristia, Jun Du, Sriram Ganapathy, and Mark Liberman. The second dihard diarization challenge: Dataset, task, and baselines, 2019. [173] Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Meta-learning with memory-augmented neural networks. In International conference on machine learning, pages 1842{1850, 2016. [174] George Saon, Hagen Soltau, David Nahamoo, and Michael Picheny. Speaker adap- tation of neural network acoustic models using i-vectors. In 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, pages 55{59. IEEE, 2013. [175] N. Scheer, L. Ferrer, A. Lawson, Y. Lei, and M. McLaren. Recent developments in voice biometrics: Robustness and high accuracy. In 2013 IEEE International Conference on Technologies for Homeland Security (HST), pages 447{452, 2013. [176] J urgen Schmidhub er. Evolutionary principles in self-referential learning. PhD the- sis, Technische Universitat M unchen, 1987. [177] B.W. Schuller, S. Steidl, and A. Batliner. The INTERSPEECH 2009 emotion challenge. In 10th Annual Conference of the International Speech Communication Association, pages 312{315, Sep 2009. [178] Gregory Sell, David Snyder, Alan McCree, Daniel Garcia-Romero, Jes us Villalba, Matthew Maciejewski, Vimal Manohar, Najim Dehak, Daniel Povey, Shinji Watan- abe, and Sanjeev Khudanpur. Diarization is hard: Some experiences and lessons learned for the JHU team in the inaugural DIHARD challenge. In Interspeech, pages 2808{2812. ISCA, 2018. [179] Gregory Sell, David Snyder, Alan McCree, Daniel Garcia-Romero, Jes us Villalba, Matthew Maciejewski, Vimal Manohar, Najim Dehak, Daniel Povey, Shinji Watan- abe, and Sanjeev Khudanpur. Diarization is hard: Some experiences and lessons learned for the JHU team in the inaugural DIHARD challenge. In Interspeech, pages 2808{2812, September 2018. 171 [180] Gregory Sell, David Snyder, Alan McCree, Daniel Garcia-Romero, Jes us Villalba, Matthew Maciejewski, Vimal Manohar, Najim Dehak, Daniel Povey, Shinji Watan- abe, and Sanjeev Khudanpur. Diarization is hard: Some experiences and lessons learned for the jhu team in the inaugural dihard challenge. In INTERSPEECH, pages 2808{2812, 2018. [181] Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron C. Courville, and Yoshua Bengio. A hierarchical latent variable encoder-decoder model for generating dialogues. In Proceedings of the Thirty-First AAAI Conference on Articial Intelligence, February 4-9, 2017, San Francisco, California, USA., pages 3295{3301, 2017. [182] Soa Serholt and Wolmet Barendregt. Robots tutoring children: Longitudinal evaluation of social engagement in child-robot interaction. In Proceedings of the 9th Nordic Conference on Human-Computer Interaction, pages 1{10, 2016. [183] R. Serizel and D. Giuliani. Vocal tract length normalisation approaches to dnn- based children's and adults' speech recognition. In Proceedings of the IEEE Spoken Language Technology Workshop (SLT), pages 135{140, Dec 2014. [184] Romain Serizel and Diego Giuliani. Deep-neural network approaches for speech recognition with heterogeneous groups of speakers including children. Natural Lan- guage Engineering, 23(3):325{350, 2017. [185] Burr Settles. Active learning literature survey. Technical report, University of Wisconsin-Madison Department of Computer Sciences, 2009. [186] C. E. Shannon. A mathematical theory of communication. SIGMOBILE Mob. Comput. Commun. Rev., 5(1):3{55, January 2001. [187] Prashanth Shivakumar, Alexandros Potamianos, Sungbok Lee, and Shrikanth Narayanan. Improving speech recognition for children using acoustic adaptation and pronunciation modeling. In Proceedings of the Workshop on Child, Computer and Interaction (WOCCI 2014), 2014. [188] Prashanth Gurunath Shivakumar and Panayiotis Georgiou. Transfer learning from adult to children for speech recognition: Evaluation, analysis and recommendations, 2018. [189] Khaldoun Shobaki, John-Paul Hosom, and Ronald A Cole. The ogi kids' speech corpus and recognizers. In Proceedings of the Sixth International Conference on Spoken Language Processing, 2000. [190] J. Silovsky and J. Prazak. Speaker diarization of broadcast streams using two-stage clustering based on i-vectors and cosine distance scoring. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4193{4196, March 2012. 172 [191] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pages 4077{4087, 2017. [192] David Snyder, Guoguo Chen, and Daniel Povey. Musan: A music, speech, and noise corpus, 2015. [193] David Snyder, Daniel Garcia-Romero, Daniel Povey, and Sanjeev Khudanpur. Deep neural network embeddings for text-independent speaker verication. In INTER- SPEECH, pages 999{1003, 2017. [194] David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. X-vectors: Robust dnn embeddings for speaker recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5329{5333. IEEE, 2018. [195] R. A. Solsona, E. Fosler-Lussier, H. J. Kuo, A. Potamianos, and I. Zitouni. Adaptive language models for spoken dialogue systems. In Proceedings of the 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages I{37{I{40, May 2002. [196] Kristen W. Springer, Jennifer Sheridan, Daphne Kuo, and Molly Carnes. Long- term physical and mental health consequences of childhood physical abuse: Results from a large population-based sample of men and women. Child Abuse & Neglect, 31(5):517 { 530, 2007. [197] Georg Stemmer, Christian Hacker, Stefan Steidl, and Elmar N oth. Acoustic nor- malization of children's speech. In Proceedings of the 8th European Conference on Speech Communication and Technology (EUROSPEECH), Geneva, Switzerland, September 1-4, 2003. [198] G. Sun, C. Zhang, and P. C. Woodland. Speaker diarisation using 2d self-attentive combination of embeddings. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5801{5805, 2019. [199] L. Sun, J. Du, T. Gao, Y. Lu, Y. Tsao, C. Lee, and N. Ryant. A novel lstm-based speech preprocessor for speaker diarization in realistic mismatch conditions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5234{5238, April 2018. [200] L. Sun, J. Du, T. Gao, Y. Lu, Y. Tsao, C. Lee, and N. Ryant. A novel lstm-based speech preprocessor for speaker diarization in realistic mismatch conditions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5234{5238, April 2018. [201] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1199{1208, 2018. 173 [202] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104{ 3112, 2014. [203] Yla R. Tausczik and James W. Pennebaker. The psychological meaning of words: Liwc and computerized text analysis methods. Journal of Language and Social Psychology, 29(1):24{54, 2010. [204] Fadi Thabtah. Machine learning in autistic spectrum disorder behavioral research: A review and ways forward. Informatics for Health and Social Care, 44(3):278{297, 2019. [205] David A. van Leeuwen and Marijn Huijbregts. The ami speaker diarization system for nist rt06s meeting data. In Machine Learning for Multimodal Interaction, pages 371{384, 2006. [206] Faith VanMeter, Elizabeth D Handley, and Dante Cicchetti. The role of coping strategies in the pathway between child maltreatment and internalizing and exter- nalizing behaviors. Child Abuse & Neglect, 101:104323, 2020. [207] Ehsan Variani, Xin Lei, Erik McDermott, Ignacio Lopez Moreno, and Javier Gonzalez-Dominguez. Deep neural networks for small footprint text-dependent speaker verication. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4052{4056. IEEE, 2014. [208] Ehsan Variani, Xin Lei, Erik McDermott, Ignacio Lopez Moreno, and Javier Gonzalez-Dominguez. Deep neural networks for small footprint text-dependent speaker verication. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4052{4056. IEEE, 2014. [209] Colin Vaz, Vikram Ramanarayanan, and Shrikanth S. Narayanan. A two-step tech- nique for MRI audio enhancement using dictionary learning and wavelet packet analysis. In INTERSPEECH 2013, 14th Annual Conference of the International Speech Communication Association, Lyon, France, August 25-29, 2013, pages 1312{ 1315, 2013. [210] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Match- ing networks for one shot learning. In Advances in neural information processing systems, pages 3630{3638, 2016. [211] Oriol Vinyals and Quoc Le. A neural conversational model. 2015. [212] Karthik Visweswariah and Harry Printz. Language models conditioned on dialog state. In Proceedings of the Seventh European Conference on Speech Communica- tion and Technology, 2001. [213] Houri K. Vorperian, Ray D. Kent, Mary J. Lindstrom, Cli M. Kalina, Lindell R. Gentry, and Brian S. Yandell. Development of vocal tract length during early childhood: A magnetic resonance imaging study. The Journal of the Acoustical Society of America, 117(1):338{350, 2005. 174 [214] Jixuan Wang, Kuan-Chieh Wang, Marc T Law, Frank Rudzicz, and Michael Brudno. Centroid-based deep metric learning for speaker recognition. In ICASSP, pages 3652{3656. IEEE, 2019. [215] Q. Wang, W. Rao, S. Sun, L. Xie, E. S. Chng, and H. Li. Unsupervised domain adaptation via domain adversarial training for speaker recognition. In IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4889{4893, 2018. [216] Nigel Ward. Using prosodic clues to decide when to produce back-channel ut- terances. In Proceedings of the Spoken Language, 1996. ICSLP 96. Proceedings., Fourth International Conference on, volume 3, pages 1728{1731. IEEE, 1996. [217] Nigel Ward and Wataru Tsukahara. Prosodic features which cue back-channel responses in english and japanese. Journal of pragmatics, 32(8):1177{1207, 2000. [218] Bernard L Welch. The generalization ofstudent's' problem when several dierent population variances are involved. Biometrika, 34(1/2):28{35, 1947. [219] Ian Williams, Anjuli Kannan, Petar S. Aleksic, David Rybach, and Tara N. Sainath. Contextual speech recognition in end-to-end neural network systems using beam search. In Proceedings of the Interspeech, pages 2227{2231. ISCA, 2018. [220] Bo Xiao, Chewei Huang, Zac E Imel, David C Atkins, Panayiotis Georgiou, and Shrikanth S Narayanan. A technology prototype system for rating therapist empa- thy from audio recordings in addiction counseling. PeerJ Computer Science, 2:e59, 2016. [221] Bo Xiao, Zac E Imel, Panayiotis G Georgiou, David C Atkins, and Shrikanth S Narayanan. " rate my therapist": Automated detection of empathy in drug and alcohol counseling via speech and language processing. PloS one, 10(12):e0143055, 2015. [222] Jiamin Xie, Leibny Paola Garc a-Perera, Daniel Povey, and Sanjeev Khudanpur. Multi-PLDA Diarization on Children's Speech. In INTERSPEECH, pages 376{380, 2019. [223] Jiamin Xie1, Leibny Paola Garcia-Perera, Daniel Povey Povey, and Sanjeev Khu- danpur. Multi-plda diarization on children's speech. In Interspeech 2019, pages 376{380, 2019. [224] Jiwei Xu, Xinggang Wang, Bin Feng, and Wenyu Liu. Deep multi-metric learning for text-independent speaker verication. Neurocomputing, 410:394 { 400, 2020. [225] Takanori Yamada, Longbiao Wang, and Atsuhiko Kai. Improvement of distant- talking speaker identication using bottleneck features of DNN. In INTER- SPEECH, 14th Annual Conference of the International Speech Communication As- sociation, pages 3661{3664, August 2013. 175 [226] Rui Yan, Yiping Song, and Hua Wu. Learning to respond with deep neural networks for retrieval-based human-computer conversation system. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 55{64. ACM, 2016. [227] Zhaojun Yang, Angeliki Metallinou, and Shrikanth Narayanan. Analysis and pre- dictive modeling of body language behavior in dyadic interactions from multimodal interlocutor cues. IEEE Transactions on Multimedia, 16(6):1766{1778, 2014. [228] D. Yu, M. Kolbk, Z. Tan, and J. Jensen. Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In 2017 IEEE In- ternational Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 241{245, March 2017. [229] Mo Yu, Xiaoxiao Guo, Jinfeng Yi, Shiyu Chang, Saloni Potdar, Yu Cheng, Gerald Tesauro, Haoyu Wang, and Bowen Zhou. Diverse few-shot text classication with multiple metrics. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics, pages 1206{1215, 2018. [230] Emily Zane, Zhaojun Yang, Lucia Pozzan, Tanaya Guha, Shrikanth Narayanan, and Ruth Bergida Grossman. Motion-capture patterns of voluntarily mimicked dynamic facial expressions in children and adolescents with and without asd. Journal of autism and developmental disorders, 49(3):1062{1079, 2019. [231] C. Zhang, K. Koishida, and J. H. L. Hansen. Text-independent speaker verication based on triplet convolutional neural network embeddings. IEEE/ACM Transac- tions on Audio, Speech, and Language Processing, 26(9):1633{1644, 2018. [232] J. Zhou, T. Jiang, L. Li, Q. Hong, Z. Wang, and B. Xia. Training multi-task adversarial network for extracting noise-robust speaker embedding. In IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6196{6200, 2019. [233] Tianyan Zhou, Weicheng Cai, Xiaoyan Chen, Xiaobing Zou, Shilei Zhang, and Ming Li. Speaker diarization system for autism children's real-life audio data. In 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP), pages 1{5. IEEE, 2016. [234] X. Zhu. Semi-supervised learning literature survey. Computer Science, University of Wisconsin-Madison, 2(3):4, 2006. [235] X. Zhu, Z. Ghahramani, and J.D. Laerty. Semi-supervised learning using gaussian elds and harmonic functions. In Proceedings of the 20th International conference on Machine learning, pages 912{919, 2003. [236] Xiaojin Jerry Zhu. Semi-supervised learning literature survey. Technical report, University of Wisconsin-Madison Department of Computer Sciences, 2005. 176
Abstract (if available)
Abstract
The need for robust and automated speech processing is ubiquitous. Of particular interest for automated analysis are interpersonal interactions, especially between a child and adult in clinical and mental health applications. Child speech understanding in such applications poses unique acoustic and linguistic challenges due to a number of developmental factors. As an alternative to conventional techniques which focus on a single source of information during the learning process, we highlight the effect of the interlocutor’s behavior and environment conditions on the child’s speech. We claim that including information from the context is a natural solution for training better modules at each stage of any speech and language pipeline. Our proposed methods demonstrate improvements in child/adult speaker diarization and child speech recognition. The proposed methods are validated using two methods: individual component performance and application of extracted descriptors for the end task, i.e., child latent state understanding.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
Exploiting latent reliability information for classification tasks
PDF
Matrix factorization for noise-robust representation of speech data
PDF
Noise aware methods for robust speech processing applications
PDF
Machine learning paradigms for behavioral coding
PDF
Multimodal and self-guided clustering approaches toward context aware speaker diarization
PDF
Computational modeling of human interaction behavior towards clinical translation in autism spectrum disorder
PDF
Modeling expert assessment of empathy through multimodal signal cues
PDF
Human behavior understanding from language through unsupervised modeling
PDF
A computational framework for exploring the role of speech production in speech processing from a communication system perspective
PDF
Novel variations of sparse representation techniques with applications
PDF
Computational modeling of behavioral attributes in conversational dyadic interactions
PDF
Interaction dynamics and coordination for behavioral analysis in dyadic conversations
PDF
Automatic quantification and prediction of human subjective judgments in behavioral signal processing
PDF
Toward understanding speech planning by observing its execution—representations, modeling and analysis
PDF
Understanding sources of variability in learning robust deep audio representations
PDF
Efficient graph processing with graph semantics aware intelligent storage
PDF
Behavioral signal processing: computational approaches for modeling and quantifying interaction dynamics in dyadic human interactions
PDF
A computational framework for diversity in ensembles of humans and machine systems
PDF
Effective graph representation and vertex classification with machine learning techniques
Asset Metadata
Creator
Kumar, Manoj
(author)
Core Title
Context-aware models for understanding and supporting spoken interactions with children
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
11/01/2020
Defense Date
10/09/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
artificial intelligence,child speech,deep learning,OAI-PMH Harvest,speech processing
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Narayanan, Shrikanth (
committee chair
), Lyon, Thomas (
committee member
), Ortega, Antonio (
committee member
)
Creator Email
manojpamk.usc.grad@gmail.com,manojpamk@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-388309
Unique identifier
UC11666262
Identifier
etd-KumarManoj-9082.pdf (filename),usctheses-c89-388309 (legacy record id)
Legacy Identifier
etd-KumarManoj-9082.pdf
Dmrecord
388309
Document Type
Dissertation
Rights
Kumar, Manoj
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
artificial intelligence
child speech
deep learning
speech processing