Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Extracting and using speaker role information in speech processing applications
(USC Thesis Other)
Extracting and using speaker role information in speech processing applications
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Extracting and Using Speaker Role Information in Speech Processing Applications by Nikolaos Flemotomos A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) August 2022 Copyright 2022 Nikolaos Flemotomos To my parents, So and Antonis. ii Acknowledgements This dissertation marks the end of a journey; a journey that I could denitely not have done alone. First and foremost, I would like to thank my advisor, Professor Shrikanth Narayanan, for giving me the opportunity to join SAIL and embark on this journey. Also, I need to extend a big thank you to all my professors at USC who, through their classes, provided me the foundational tools needed for my research, as well as to all the sta, and especially Tanya Avecedo-Lam and Diane Demetras, who supported me and made sure the trip was as smooth as possible. During this PhD journey, I had the chance to work with some great collaborators. A special thank you to Professors Panayiotis Georgiou, David Atkins, Zac Imel, and Torrey Creed who had a huge impact on my research agenda and on the way I learned to approach real-world problems and research questions. I also need to express my gratitude to all the professors who served as members in my qualication and/or dissertation committees, providing valuable advice and guidance: Keith Jenkins, Maja Matari c, C.-C. Jay Kuo, and Antonio Ortega. To all my labmates and colleagues I met along the way, many of whom are now my friends. From study groups to music nights, and from writing papers to having fun at international conferences around the world, the SAIL family was always there, to make the journey both more productive and more enjoyable. But also to all the friends outside the lab, both the ones I made in this part of the world and the ones back in Greece. A special thank you to my good friends and roommates who made the journey so much more fun, especially during the weird times of the pandemic, and a big thank you to my girlfriend for her support, patience, and understanding. Last, and denitely not least, a huge thank you to my family, who supported this journey in every possible way long before it even started. Mom, dad, Evelina, this was only possible because of you. iii Contents Dedication ii Acknowledgements iii List of Tables viii List of Figures x Abbreviations xii Abstract xv Introduction 1 Roles and Human Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Computational Analysis of Speaker Roles . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 I Extracting Speaker Roles 7 Chapter 1 Chapter 1 Combined Speaker Clustering and Role Recognition in Conversa- tional Speech 8 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.2 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.2.1 General framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.2.2 Speaker clustering module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.2.3 Role recognition module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 iv 1.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.4 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Chapter 2 Chapter 2 Role Specic Lattice Rescoring for Speaker Role Recognition from Speech Recognition Outputs 21 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2.1 Weighted Finite State Transducers . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2.2 WFST framework for speech recognition . . . . . . . . . . . . . . . . . . . . . 25 2.2.3 Speech lattices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.2.4 Lattice rescoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.5 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.5.1 Turn-level SRR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.5.2 Speaker-level SRR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.5.3 Eect on speech recognition accuracy . . . . . . . . . . . . . . . . . . . . . . 34 2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 II Using Speaker Roles 36 Chapter 3 Chapter 3 Linguistically Aided Speaker Diarization Using Speaker Role Infor- mation 37 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2 Background: Audio-Only Speaker Diarization . . . . . . . . . . . . . . . . . . . . . . 40 3.3 Proposed Method: Linguistically-Aided Speaker Diarization . . . . . . . . . . . . . . 42 3.3.1 Text-based segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.3.2 Role recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.3.3 Prole estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.3.4 Audio segmentation and classication . . . . . . . . . . . . . . . . . . . . . . 45 v 3.4 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.4.1 Evaluation data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.4.2 Segmenter and role LM training data . . . . . . . . . . . . . . . . . . . . . . 45 3.5 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.5.1 Baseline systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.5.2 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.5.3 Results with reference transcripts . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.5.4 Results with ASR transcripts . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Chapter 4 Chapter 4 Multimodal Speaker Clustering with Role Induced Constraints 53 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.2 Background and Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.2.1 Spectral clustering for speaker diarization . . . . . . . . . . . . . . . . . . . . 55 4.2.2 Constrained clustering for speaker diarization . . . . . . . . . . . . . . . . . . 57 4.2.3 Constrained spectral clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.4 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.4.1 Psychotherapy sessions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.4.2 Podcast episodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.5 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.5.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.5.2 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 III Real World Impact 68 Chapter 5 Chapter 5 Why Do We Need Roles? Automated Psychotherapy Evaluation as an Example Downstream Application 69 5.1 Need for Psychotherapy Quality Assessment Tools . . . . . . . . . . . . . . . . . . . 70 5.2 Behavioral Coding for Motivational Interviewing . . . . . . . . . . . . . . . . . . . . 71 vi 5.3 Psychotherapy Evaluation in the Digital Era . . . . . . . . . . . . . . . . . . . . . . 72 5.4 Current Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.4.1 System overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.4.2 Deployment: data collection and pre-processing . . . . . . . . . . . . . . . . . 76 5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.5.1 System with clustering-based diarization . . . . . . . . . . . . . . . . . . . . . 78 5.5.2 System with classication-based diarization . . . . . . . . . . . . . . . . . . . 80 5.6 Analysis and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.6.1 Speaker diarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.6.2 Psychotherapy evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Conclusions and Future Directions 87 Summary and Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Directions for Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 References 90 Appendices 111 Appendix A Appendix A UCC dataset: Inter-Rater Reliability 112 Appendix B Appendix B Psychotherapy Transcription and Coding Pipeline 114 B.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 B.2 System Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 vii List of Tables 1.1 Descriptive analysis of the corpora used for SRR. . . . . . . . . . . . . . . . . . . . . 15 1.2 Misclassication rates of the SC algorithm, the language-based recognizer, and the audio-based recognizer, when used independently or in a piped or combined archi- tecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1 Size of the PSYCH dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.2 Size of the AMI dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.3 Size of the vocabulary and total number of tokens in the corpora used for LM training. 32 2.4 Misclassication rates for turn-level SRR. . . . . . . . . . . . . . . . . . . . . . . . . 34 2.5 Misclassication rates for speaker-level SRR and for speaker clustering. . . . . . . . 34 2.6 WER using the best path of a generic lattice or role-specic rescored lattices. . . . . 35 3.1 DER following our linguistically-aided approach and the two baselines. . . . . . . . . 49 4.1 Size of the UCC dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.2 Size of the TAL dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.3 Classication accuracy of the BERT-based model and a majority-class baseline. . . . 65 4.4 DER using unconstrained audio-only clustering, constrained clustering with role- induced constraints, and language-only role-based classication. . . . . . . . . . . . . 66 5.1 Therapist-related utterance-level codes, as dened by MISC 2.5. . . . . . . . . . . . 73 5.2 Mapping between MISC-dened behavior codes and grouped target labels, together with the occurrences of each group in the evaluation UCC sets. . . . . . . . . . . . . 78 5.3 Diarization results for the UCC data. . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 viii 5.4 Correlations between the manually-derived codes and the machine-generated ones for the per-session counts of the utterance-level MISC labels. . . . . . . . . . . . . . 84 5.5 Correlations between the manually-derived codes and the machine-generated ones for the session-level MISC aggragate metrics. . . . . . . . . . . . . . . . . . . . . . . 85 A.1 Krippendor's alpha for the MISC-dened codes in the UCC data. . . . . . . . . . . 112 A.2 Krippendor's alpha for the grouped target labels in the UCC data. . . . . . . . . . 113 B.1 ASR results for the UCC data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 B.2 F 1 scores for the predicted utterance-level codes using the manually transcribed UCC data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 ix List of Figures 1.1 Two approaches for SRR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.2 Proposed approach for SRR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.3 Misclassication rate when using only the AM-based decision as a function of the number of Gaussians in the GMM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.4 Distribution of the scores which are the output of the SC, the LM-based recognizer, and the AM-based recognizer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.1 Example of speech recognition lattice. . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.2 Turn-level SRR by role-specic lattice rescoring. . . . . . . . . . . . . . . . . . . . . 29 2.3 Turn-level SRR by evaluating the text with role-specic LMs. . . . . . . . . . . . . . 30 3.1 Finding \who spoke when" in a speech signal. . . . . . . . . . . . . . . . . . . . . . . 38 3.2 Linguistically-aided speaker diarization using role information. . . . . . . . . . . . . 42 3.3 Neural network for sentence-level text segmentation. . . . . . . . . . . . . . . . . . . 43 3.4 Baseline audio-based speaker diarization. . . . . . . . . . . . . . . . . . . . . . . . . 46 3.5 Baseline language-based speaker diarization. . . . . . . . . . . . . . . . . . . . . . . . 47 3.6 DER as a function of the number of text segments we take into account per session for the prole estimation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.1 Two-step speaker clustering for role-playing interactions. . . . . . . . . . . . . . . . . 59 4.2 DER for the UCC dataset as a function of the number of constraints. . . . . . . . . 64 4.3 Classication accuracy and support for the BERT-based classiers when only seg- ments with associated softmax value above some threshold are considered. . . . . . . 65 x 5.1 Baseline transcription and coding pipeline developed to assess the quality of a psy- chotherapy session. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.2 Transcription and coding pipeline employing linguistically-aided, role-based speaker diarization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.3 Speaker error rate (SER) per UCC session for the dierent system designs. . . . . . 83 xi Abbreviations ADOS Autism Diagnostic Observation Schedule. AM Acoustic Model. ASR Automatic Speech Recognition. BERT Bidirectional Encoder Representations from Tranformers. BIC Bayesian Information Criterion. CL Cannot-Link. CMLLR Constrained Maximum Likelihood Linear Regression. CNN Convolutional Neural Network. CPTS Counseling and Psychotherapy Transcripts Series. CRF Conditional Random Field. DER Diarization Error Rate. DNN Deep Neural Network. E 2 CP Exhaustive and Ecient Constraint Propagation. FA Facilitate. GI Giving Information. xii GMM Gaussian Mixture Model. HAC Hierarchical Agglomerative Clustering. HMM Hidden Markov Model. IRR Inter-Rater Reliability. LDA Linear Discriminant Analysis. LDC Linguistic Data Consortium. LM Language Model. LSTM Long-Short Term Memory. MFCC Mel Frequency Ceptral Coecient. MI Motivational Interviewing. MIA Motivational Interviewing - Adherent. MISC Motivational Interviewing Skill Code. ML Must-Link. MR Misclassication Rate. NER Named Entity Recognition. NLP Natural Language Processing. PDD Pervasive Developmental Disorder. PLDA Probabilistic Linear Discriminant Analysis. QUC Closed Question. QUO Open Question. xiii RATS Robust Automatic Transcription of Speech. REC Complex Re ection. RES Simple Re ection. SC Speaker Clustering. SER Speaker Error Rate. SRR Speaker Role Recognition. ST Structure. SU Subword Unit. SVM Support Vector Machine. TAL This American Life. TDNN Time-Delay Neural Network. UCC University Counseling Center. VAD Voice Activity Detection. WER Word Error Rate. WFSA Weighted Finite State Acceptor. WFST Weighted Finite State Transducer. xiv Abstract Individuals assume distinct roles in dierent situations throughout their lives and people who consistently adopt particular roles develop specic commonalities in behavior. As a result, roles can be dened in terms of observable tendencies and behavioral patterns that can be manifest through a wide range of modalities during a conversational interaction. For instance, an interviewer is expected to use more interrogative words than the interviewee and a teacher is likely to speak in a more didactic style than the student. Speaker role recognition is the task of assigning a role label in a speech segment where a single speaker is active, through computational models that capture such behavioral characteristics. The approaches that tackle this problem depend on successful pre-processing steps applied on the recorded conversation, such as speaker segmentation and clustering or automatic speech recognition, something that inevitably leads to error propagation. At the same time, accurate role information can provide valuable cues for the aforementioned speech processing tasks. In this dissertation I propose techniques that combine role recognition with other speech pro- cessing modules to alleviate the problem of error propagation. Additionally, focusing on the task of speaker diarization (that answers the question who spoke when), I demonstrate that role-aware sys- tems can achieve improved performance when compared to traditional, state-of-the-art approaches. Finally, I showcase how some of the proposed techniques can be applied in a real-world system, by presenting and analyzing an automated tool for psychotherapy quality assessment, where robust diarization and role identication (i.e., therapist vs. patient) are of critical importance. xv Introduction Roles and Human Interactions Roles are one of the most important concepts in understanding and modeling human behavior. According to social psychology, individuals assume distinct roles in dierent situations through- out their lives that both guide their own behavioral patterns and create expectations about the behaviors of other people they interact with (Biddle, 1986). Systems of human interaction can be viewed as \microscopic social systems" (Bales, 1950) and roles can be dened as stated functions \associated with a position in a group (a status) with rights and duties toward one or more other group members" (Hare, 1994). The underlying social structure, the context, and the end goal of an interaction both enable and constrain the participants' actions and behaviors (Gleave, Welser, Lento, & Smith, 2009), and this is re ected on the participants' roles. For instance, the role of the parent is associated with protecting and caring for ospring, the role of the chief executive ocer is linked to taking managerial decisions that will lead to the economic growth of a company, and the role of the lecturer is related to conveying a clear message to their audience. Those roles can be adopted by the same person during dierent interactions and can occasionally collide and con ict with each other. However, clear role expectations can assist towards better task distribution within a group, promote individual responsibility and accountability, improve group cohesion, and eventually lead to more eective task performance (Mudrack & Farrell, 1995). Roles can be distinguished into two broad categories. Formal roles (e.g., interviewer vs. in- terviewee) are typically associated with pre-dened objectives of an agent within a group, while informal roles (e.g., protagonist vs. supporter in a group discussion) can develop naturally as a 1 result of interpersonal interactions and social dynamics and are sometimes referred to as emergent roles (Hare, 1994). Additionally, roles can be assigned to people either implicitly, because of the organizational or social structure of the environment where the interaction takes place, or explicitly, by requiring participants to perform specic tasks and providing detailed guidelines. Such scripted roles are of particular interest in collaborative learning scenarios, where role playing can foster engagement, discussion, and knowledge sharing (Strijbos & De Laat, 2010), leading to improved learning outcomes when compared to unstructured interactions (Weinberger, Stegmann, & Fischer, 2010). They are also a key aspect in group psychotherapy where patients learn to adhere to specic social and moral values through psychodrama (Kipper, 1992). In any case, people who consistently adopt particular roles develop specic commonalities in behavior (Gleave et al., 2009). As a result, roles can be dened in terms of observable tendencies and behavioral regularities that can be manifest through a wide range of modalities during a conversational interaction. Thus, dierent roles may be associated with distinguishable patterns observed in acoustic, prosodic, linguistic, and structural characteristics (Bales, 1950; Knapp, Hall, & Horgan, 2013; Sacks, Scheglo, & Jeerson, 1978). For instance, a teacher is likely to speak in a more didactic style while a student be more inquisitive, an interviewer is expected to use more interrogative words than the interviewee, a doctor is likely to inquire on symptoms and prescribe while a patient describe their symptoms, and so on. All those patterns can be viewed as the structural signatures of the various roles and can be studied through statistical analysis and appropriate computational modeling. Computational Analysis of Speaker Roles The phenomenal growth of multimedia data, including audio recordings, during the last few years, has been connected to heavy demands for ecient data manipulation applications. Speaker role information can be used to facilitate such applications, including audio indexing (Bigot, Ferran e, Pinquier, & Andr e-Obrecht, 2010), topic-based segmentation (Vinciarelli & Favre, 2007), infor- mation retrieval (Barzilay, Collins, Hirschberg, & Whittaker, 2000), media browser enhancement (Ordelman, De Jong, & Larson, 2009), and multimedia summarization (Vinciarelli, 2006). At the same time, speaker roles oer valuable cues when studying various aspects of human communica- 2 tion such as entrainment and dominance (Be nu s et al., 2014; Danescu-Niculescu-Mizil, Lee, Pang, & Kleinberg, 2012). They are, additionally, of critical importance in computer-supported collab- orative learning (Strijbos & De Laat, 2010), as well as in social computing and robotics (Be nu s, 2014). Roles can also be of great value for the development of specialized dialogue models. For instance, in the psychotherapy domain, chatbots can play both the role of a therapist to provide mental health care services (Inkster, Sarda, & Subramanian, 2018), and the role of a patient to assist in training new counselors (Demasi, Li, & Yu, 2020; Tanana, Soma, Srikumar, Atkins, & Imel, 2019). A closely related notion to roles is that of personae. Personae, also known as character archetypes, are classes of characters grouped by similar behavioral traits (Jung, 2014), which means they are aected by the roles they potentially assume. Being able to adopt consistent personae is an essential element of engaging, naturalistic interactions (Roller et al., 2021) and, thus, persona modeling has been a key area of research in developing articial conversational agents (Demasi et al., 2020; Song, Zhang, Cui, Wang, & Liu, 2019). Given the importance of speaker roles in multimedia analysis, it is not surprising that there has been an increasing interest in applying computational methods to automatically recognize roles in speech documents. Formal speaker role recognition has been explored in a variety of domains, such as broadcast news programs (Bigot, Fredouille, & Charlet, 2013; Salamin & Vinciarelli, 2012), call centers (Garnier-Rizet et al., 2008), business meetings (Favre, Dielmann, & Vinciarelli, 2009; Sapru & Valente, 2012), psychotherapy sessions (Xiao, Huang, et al., 2016), press conferences (Li et al., 2017), interviews (Rasipuram & Jayagopi, 2018), and medical discussions (Luz, 2009). Other studies have investigated recognition of informal, emergent roles in multi-party interactions occurring in meetings (Sapru & Bourlard, 2015; Zancanaro, Lepri, & Pianesi, 2006) or in computer- assisted learning platforms (Dowell, Nixon, & Graesser, 2019). Usually, role recognition assigns a label from a pre-dened, nite set to a speech segment and is, therefore, viewed as a supervised classication task. However, unsupervised approaches that exploit the structure of the interaction and discover roles through clustering have been also proposed (Dowell et al., 2019; Li et al., 2017). In order to address the problem of role recognition, appropriate features that capture the dis- tinguishable patterns between dierent roles have to be extracted. Those features need to exploit characteristics that may be shared between dierent individuals, since the same role can be played 3 by various speakers. Role-specic regularities can be found in the acoustic (Bigot, Ferran e, et al., 2010), lexical (Garg, Favre, Salamin, Hakkani T ur, & Vinciarelli, 2008), prosodic (Sapru & Valente, 2012), or structural (Salamin & Vinciarelli, 2012) characteristics of the speech signal, with the im- portance of each modality being task-specic. The extracted features are coupled with machine learning algorithms towards the nal task of role classication or clustering. Early works in the eld used boosting algorithms, maximum entropy classiers, and support vector machines (Barzilay et al., 2000; Y. Liu, 2006; Zancanaro et al., 2006). More recently, social network analysis (Garg et al., 2008; Marcos-Garc a, Mart nez-Mon es, & Dimitriadis, 2015), conditional random elds (Salamin & Vinciarelli, 2012), and deep learning approaches (Li et al., 2017) have also been explored. Automated role recognition methods that rely on the computational analysis of recorded speech signals typically depend on successful pre-processing steps applied on the conversation, such as speaker diarization (answering the question who spoke when) or automatic speech recognition (ASR). At the same time, accurate role information can improve the performance of the aforemen- tioned speech processing tasks (Sapru, Yella, & Bourlard, 2014; Valente, Vijayasenan, & Motlicek, 2011). This interplay between core speech processing and speaker roles is the focus of the current dissertation. Research Directions In this dissertation I build and apply computational models to i) recognize speaker roles using speech and language processing techniques, and ii) use speaker role information to facilitate speech applications. In more detail, I study formal roles within both dyadic and multi-party recorded conversational interactions (e.g., therapist during a psychotherapy session, host during a podcast, project manager during a business meeting) and: 1. I propose a framework for speaker role recognition that alleviates error propagation from pre-processing steps, and 2. I leverage speaker role information to improve the performance of core modules in a speech processing pipeline, with a focus on speaker diarization. 4 My work can be summarized in the following research statement: The behavioral patterns found within conversational interactions can help us study speaker roles towards improved performance in speech processing tasks. Outline The current dissertation is structured as follows: Introduction denes roles within the context of human interactions and reviews the computa- tional methods that have been proposed in the literature for recognition and analysis of speaker roles, as well as applications for which speaker role information is a useful or even essential sub-task. In Part I the focus is on how we can eectively use specic speech processing techniques in order to robustly infer speaker roles. To that end, Chapter 1 introduces a framework for the task of speaker role recognition that combines speaker-specic and role-specic information within a conversation from both the acoustic and linguistic modalities. The linguistic information here is acquired through manually-derived transcripts. Chapter 2 describes an eective way to infer speaker roles from transcribed audio data in real-world situations where transcriptions are obtained by an automatic speech recognition system. In Part II we switch the focus on how speaker role information can help improve the performance of core speech analysis tasks within certain domains. The main area of interest is speaker diarization, the problem of answering the quesion \who spoke when" within a conversation. Even though this is typically addressed as an audio-only clustering-based problem, herein I explore ways to provide supplemental information in the form of linguistically extracted speaker roles. Chapter 3 presents a way to reduce the clustering diarization problem into a classication one, answering the question \which role spoke when". A limitation of the proposed approach is that it assumes a one-to- one correspondence between speakers and roles, i.e., each speaker needs to be associated with a unique role within the conversation (e.g., single interviewer vs. single interviewee). To address this limitation, Chapter 4 introduces an alternative, two-step framework where the language-based roles are only used to impose constraints on the subsequent audio-based clustering step. Part III presents how some of the role-based computational techniques proposed in this work can be successfully applied in a real-world application. To that end, in Chapter 5 a fully automated 5 psychotherapy quality assessment tool, deployed in clinical settings, is described and analyzed. We see why speaker role recognition is an essential element of the system and how the techniques introduced in Chapter 3 can be used to improve the overall performance, with respect to the downstream task of therapy evaluation. The last chapter, named Conclusions and Future Directions, presents an overview of the dis- sertation and gives potential directions for future work. While my research has focused on formal roles, computational analysis of informal, emergent speaker roles and their usage within speech processing is an exciting area for future research. The relationship of speaker roles, either formal or informal, with other aspects of a person's identity and with social phenomena is another interesting, and quite unexplored from a computational perspective, research area. 6 Part I Extracting Speaker Roles 7 Chapter 1 Combined Speaker Clustering and Role Recognition in Conversational Speech Speaker role recognition (SRR) is usually addressed either as an independent classication task, or as a subsequent step after a speaker clustering module. However, the rst approach does not take speaker-specic variabilities into account, while the second one results in error propagation. In this chapter we propose the integration of an audio-based speaker clustering algorithm with a language-aided role recognizer into a meta-classier which takes both modalities into account. That way, we can treat separately any speaker-specic and role-specic characteristics before combining the relevant information together. The method is evaluated on two corpora of dierent conditions with interactions between a clinician and a patient and it is shown that it yields superior results for the SRR task. The work presented in this chapter has been published in (Flemotomos, Papadopoulos, Gibson, & Narayanan, 2018). 8 1.1 Introduction Speaker role recognition (SRR) is the task of assigning a specic role to each speaker turn (speaker- homogeneous segment) in a speech signal. This task plays a signicant role in numerous areas, such as information retrieval (Barzilay et al., 2000), audio indexing (Bigot, Ferran e, et al., 2010), or social interaction analysis (Biddle, 1986). Most of the research eorts have been focused on identifying roles in broadcast news programs or talk shows (Bazillon, Maza, Rouvier, Bechet, & Nasr, 2011; Damnati & Charlet, 2011a; Laurent, Camelin, & Raymond, 2014; Salamin & Vinciarelli, 2012), while there have been also works dealing with meeting scenarios (Sapru & Valente, 2012), confer- ences (Li et al., 2017), medical discussions between domain experts (Luz, 2009), and psychotherapy sessions (Xiao, Huang, et al., 2016). There have been presented both supervised (Barzilay et al., 2000; Bigot et al., 2013; Laurent et al., 2014; Rouvier, Delecraz, Favre, Bendris, & Bechet, 2015), and unsupervised (Hutchinson, Zhang, & Ostendorf, 2010; Li et al., 2017) methods. The approaches towards dealing with the problem of SRR can be distinguished on the basis of whether the nal decision is made at the turn level or the speaker level. In the former case (Figure 1.1a), a classier is built where the input space is the space of speaker turns with no speaker information available. In a real-world application, those turns are obtained through a speaker change detection algorithm. The rst works in the eld use boosting algorithms (Barzilay et al., 2000) and statistical methods (Barzilay et al., 2000; Y. Liu, 2006) towards this classication task. Sapru and Valente (2012) combine lexical, prosodic, structural, and dialog act information also through boosting algorithms. Damnati and Charlet (2011a) combine audio-based and language- based classiers with early or late fusion through a logistic regression model. Finally, Rouvier et al. (2015) have more recently applied deep learning techniques to learn turn-level role embeddings. In the case of speaker-level SRR (Figure 1.1b), the classier is built in two steps, the rst being a speaker clustering (SC) algorithm, or a diarization system in the more general case 1 , where turns are grouped into same-speaker clusters in an unsupervised way and then each cluster is assigned a specic role. In this line of work, Vinciarelli (2007) uses a social network analysis approach taking into consideration relational data across dierent speakers, while Bigot, Ferran e, et al. (2010) and 1 More details on speaker clustering and speaker diarization are provided in Chapters 3 and 4. 9 Speaker-homogeneous segments Role Recognition [turn level] (a) Turn-level SRR. Speaker-homogeneous segments Speaker Clustering Role Recognition [speaker level] (b) Speaker-level SRR. Figure 1.1: Two approaches for speaker role recognition. Bigot et al. (2013) propose a hierarchical classication system. W. Wang, Yaman, Precoda, and Richey (2011) investigate the eect of various modalities on the nal performance of SRR when using boosting algorithms. Dufour, Esteve, and Del eglise (2011) study the relationship between speech spontaneity levels and speaker roles, using a classier based on boosting methods with decision stumps, which are replaced by small decision trees by Laurent et al. (2014). Bazillon et al. (2011) use question types as features, with results reported both at the speaker and the turn level. In contrast to tasks such as speaker identication, the features to be extracted for SRR have to exploit characteristics that may be shared between dierent individuals, since the same role can be shared between various speakers. However, knowledge of speaker-specic information can lead to better classication results (e.g., Bazillon et al., 2011), which is the reason why many SRR- related works operate at the speaker level, employing a SC step. A major drawback of this piped approach, presented in Figure 1.1b, is that no matter how good the subsequent classier is, any potential error in the SC algorithm is propagated and the overall performance is upper-bounded by the performance of the SC module. Thus, it is desirable to eectively combine speaker-specic and role-specic information without such problems. To that end, Salamin and Vinciarelli (2012) propose an appproach where the nal role recogni- tion decision is taken at the turn level, but speaker information, available after a diarization step, is taken into account during feature extraction. However, that information is only used for the ex- traction of structural features (such as the average time between two turns of the current speaker). 10 Those are combined with turn-level prosodic features and the nal classication is made using conditional random elds (CRFs). It is reported that, when using oracle speaker segmentation, this combination does not lead to improved results over the independent usage of the two dierent feature sets. Damnati and Charlet (2011b) present a hybrid hierarchical approach, where the SC output is used to distinguish at the speaker level a specic role from all the others, which are then classied at the turn level. However, this approach has been proposed specically for application in broadcast news shows, taking into consideration dierent variabilities between the anchors and the reporters on the one hand and between the reporters and others on the other. In this chapter, we present an alternative generic framework to combine a SC algorithm with a turn-level supervised role classier, in such a way that both speaker-specic and role-specic information is taken into account for the nal decision. We evaluate our method on the binary problem of patient-clinician interactions using manually extracted speaker turns. However, the framework presented is generalizable to an arbitrary number of speakers, under the assumption of one-to-one correspondence between speakers and roles in a single speech document, in the sense that each speaker is uniquely linked to a single role within the conversation 2 . 1.2 Proposed Method 1.2.1 General framework We propose the combined architecture presented in Figure 1.2, where the SC and role recognition modules work in parallel and their output is fed as input to a meta-classier. We assume that we know a priori the number of speakers in the speech document, say N, and that there is a one-to-one correspondence between the set of speakersfS i g N i=1 and the set of rolesfR i g N i=1 . We treat the outputs of the two modules as continuous-valued scores assigned to each speaker/role label. Thus, the output of the SC algorithm is the sequence of tuples (p 1i ) N i=1 ; (p 2i ) N i=1 ; ; (p Ti ) N i=1 , such that the k-th turn would be assigned the speaker label S m if and only if p km = max i p ki . Similarly, the output of the role recognition module is the sequence 2 This is an assumption we will follow thoughout most of the dissertation. In Chapter 4 we will extensively discuss how such an assumption can be limiting in some domains and we will see an application-specic approach that can be used even in cases where such an assumption does not hold. 11 Speaker-homogeneous segments meta-classifier Speaker Clustering Role Recognition [turn level] Figure 1.2: Proposed approach for speaker role recognition. of tuples (q 1i ) N i=1 ; (q 2i ) N i=1 ; ; (q Ti ) N i=1 , such that the k-th turn would be assigned the role label R m if and only if q km = max i q ki . In that way, for each turn we have 2N scores corresponding to the N speakers/roles. Those are treated as input features for the classier of the last step of the architecture. Since there is not a natural correspondence between the two systems' outputs, it is necessary to nd the optimal matching between the two sets of labelsfS i g N i=1 andfR i g N i=1 . This is a standard step taking place in the more general case of diarization systems output combination (Bozonnet et al., 2010; Tranter, 2005) or for the evaluation of speaker clustering performance (D. Liu & Kubala, 2004). For a small N (which is a realistic assumption for conversational settings), it is easy to nd this matching in an exhaustive way. Formally, if we denote such a matching as the mapping M :fS i g N i=1 !fR i g N i=1 , the optimal matching is dened as ^ M = argmin M T X k=1 I(M(S 0 k )6=R 0 k )d k (1.1) whereS 0 k 2fS i g N i=1 andR 0 k 2fR i g N i=1 are the labels assigned by the two modules to the k-th turn, I() is the indicator function, d k is the duration of the turn, and T is the total number of turns in the speech document. 12 1.2.2 Speaker clustering module For the speaker clustering module we use a simple bayesian information criterion (BIC) based hierarchical agglomerative clustering (HAC) algorithm (S. Chen & Gopalakrishnan, 1998; Cheng & Wang, 2003). At each step of the HAC procedure we use one Gaussian to model each cluster, so that the distance metric, known as BIC, between two clusters x and y, with n x , n y members (frames) and with covariance matrices x , y , respectively, is BIC(x;y) = 1 2 (n logjjn x logj x jn y logj y j) d(d + 3) 4 logn (1.2) wheren =n x +n y , is the covariance matrix if we merge the clustersx andy,d is the dimensionality of the feature vector representing each frame, and is a penalty factor ( = 1 for our experiments). At each step, the pair of clusters with the minimum BIC is merged. Speaker clustering in this work is purely based on the acoustic information and as features we use the 13 rst MFCCs for each frame. At the last step, we have one Gaussian modeling each of the N speakers and the required scores for the turn are the per-frame log-likelihoods with respect to each Gaussian averaged over the voiced frames of the turn. The voiced frames are identied with a voice activity detection (VAD) algorithm, which is also applied at the initial step of the HAC procedure, so that the constructed Gaussians model only the voiced information for each speaker. 1.2.3 Role recognition module We explore two dierent approaches for the role recognition module; one language-based and one audio-based. In order to build a language-based role recognizer to exploit the linguistic patterns that are potentially shared between speakers with the same roles, we use similar ideas as in the role matching module presented by Xiao, Huang, et al. (2016). Since we treat role recognition as a supervised classication task, we need a role-labeled training set of speaker turns. On that set we train N n-gram language models (LMs), one for each role. During the test phase, we evaluate the perplexity of the turn to be classied with respect to all the constructed LMs. The required scores to be used as input to the meta-classier are the N negative log-perplexities. Even though we use the acoustic information in the SC module, we are interested in exploring 13 the hypothesis that the exact same information has a predictive power over roles, apart from speakers. Following a similar idea as in (Damnati & Charlet, 2011a), we build an acoustic model (AM) for each one of the N roles. The AM for a role is a gaussian mixture model (GMM) t on the voiced frames of all the turns available in the training set which are labeled with that role. The scores for the turn to be used during the test phase are again, as in the case of the SC algorithm, the N per-frame log-likelihoods with respect to each GMM averaged over the voiced frames of the turn. 1.3 Datasets For this work, we evaluate our proposed method on two dierent corpora from the psychology domain, featuring interactions between a clinician and a patient. The rst corpus is composed of motivational interviewing (MI)|a specic type of psychotherapy|sessions between a thera- pist (T) and a client (Cl) collected from six independent clinical trials (ARC, ESPSB, ESB21, CTT, iCHAMP, HMCBI; Atkins, Steyvers, Imel, & Smyth, 2014; Baer et al., 2009) 3 . We col- lectively refer to those sessions as the MI corpus. In this study, we use 343 manually transcribed sessions. The second corpus comprises autism diagnostic observation schedule (ADOS) assessments between a psychologist (P) and a child (Ch) being evaluated for a pervasive developmental disorder (PDD) (Lord et al., 2000). In this study, we use 273 manually transcribed sessions, with a minimum duration of 2 min. There is a limited number of sessions where there are more than two speakers involved. In such cases, we do not take into account any turns not belonging to the clinician/patient for our analysis. Additionally, there is a limited number of non-pure speaker turns, in the sense that the manually annotated boundaries are not optimal and occasionally overlap. We chose to include such turns in the analysis without any preprocessing, since in a real-world setting (i.e., after automatic segmentation) such problems are impossible to completely avoid. Some descriptive analysis for the two datasets is presented in Table 1.1. Unfortunately, the exact total number of dierent clients is not available for the MI dataset. However, under the 3 Motivational interviewing is studied extensively in Chapter 5. 14 assumption that it is highly improbable for the same client to visit dierent therapists in the same study, and having partial information available about the client identities, we made the train/test split in a way that we are highly condent there is no overlap between speakers. Similarly, the exact total number of psychologists is unknown for the ADOS corpus, but the data are collected from two dierent clinics (in dierent cities) and we assume that the same clinician does not work for both. So, the data from one clinic is used for training and from the other for testing. Table 1.1: Descriptive analysis of the corpora used. MI-train MI-test ADOS-train ADOS-test #sessions 242 101 141 132 duration (mean) 27:24 min 33:14 min 3:67 min 3:67 min duration (std) 14:40 min 17:42 min 1:34 min 1:65 min duration-T/P 47:30 h 26:35 h 2:63 h 2:52 h duration-Cl/Ch 52:96 h 25:87 h 2:97 h 2:98 h #T/P 123 53 { { #Cl/Ch { { 89 81 By duration-T/P and duration-Cl/Ch we denote the total duration of all the speaker turns labeled as therapist/psychologist and client/child, respectively. By #T/P and #Cl/Ch we denote the total number of dierent thera- pists/psychologists and clients/children. 1.4 Experiments and Results The two available datasets are split into train and test sets, as explained in Section 1.3, in a way that, with high condence, there are not overlapping speakers between the sets, in order to ensure that the trained models indeed capture role-specic and not speaker-specic information. The train set is only used to build the LMs and AMs described in Section 1.2.3 corresponding to the dierent roles. The LMs are 3-gram models trained (and later evaluated) using the SRILM toolkit (Stolcke, 2002) with manually derived transcriptions of the recordings. In order to ensure a large enough vo- cabulary that minimizes the unseen words during the test phase, we interpolate those models with a large background model|namely with the pruned version of the 3-gram model of cantab-TEDLIUM (Williams, Prasad, Mrva, Ash, & Robinson, 2015)|giving a weight of 0.9 to the domain-specic LM and 0.1 to the background one. 15 The AMs are diagonal GMMs, modeling the frames of turns assigned to each role, where frames are represented by 13-dimensional MFCCs. During training, we take into consideration only the voiced frames, by applying to the initial speaker turns a simple, energy-based VAD algorithm, as implemented in the Kaldi speech recognition toolkit (Povey et al., 2011). The same VAD algorithm is applied during evaluation, as well as during the SC step, as explained in Section 1.2.2. As a meta-classier we are use a binary linear support vector machine (SVM), since we evaluate on binary problems. All the results are based on a 5-fold cross-validation scheme on the data allocated for testing in each dataset, where, as is the case for the initial train/test split, we use all the available meta-data information to minimize any possible overlapping of speakers between dierent folds. The reason we are adopting this approach and do not use the training part of the datasets is that we do not want to pipe data already seen by the AMs and/or LMs to the SVM training. As the evaluation metric of SRR we use the misclassication rate (MR), dened as (D. Liu & Kubala, 2004) MR = #misclassied frames total #frames = P k I(R k 6= ^ R k )d k P k d k (1.3) where the summation is over all the speaker turns, R k is the role assigned by the algorithm, ^ R k is the reference role, d k is the duration of the k-th turn, andI() is the indicator function. In Figure 1.3 we can see how MR is aected by the number of Gaussians in the GMM-based AM, when only the audio-based role recognizer is used. Based on that, we use 512 Gaussians for the subsequent experiments both for the MI and the ADOS datasets. In this work we do not report results for the piped architecture presented in Figure 1.1b using an actual classication algorithm as the second step of the pipeline. Instead, in Table 1.2 we give the best possible result with this architecture when using the SC algorithm that we have described. Using a perfect classication algorithm for the SRR task at the speaker level, which we denote as R y , the overall error of the system is always lower-bounded by the error of the SC algorithm itself. So, the results reported in the SC+R y -piped column of the Table are in fact the MRs of the SC algorithm. The language-based and audio-based recognizers are evaluated when used independently (LM- only and AM-only) and when used in the combined architecture presented in Figure 1.2 (SC+LM- 16 4 16 64 256 1024 4096 number of Gaussians 10 15 20 25 30 35 40 45 Misclassification Rate (%) MI ADOS Figure 1.3: SRR misclassication rate when using only the AM-based decision, as a function of the number of Gaussians in the GMM. Table 1.2: Misclassication rates (%) of the SC algorithm, the language-based recognizer (LM), and the audio-based recognizer (AM), when used independently (only) or in a piped (piped) or combined (comb) architecture for the task of SRR. SC+R y piped LM only SC+LM comb AM only SC+AM comb AM+LM comb SC+AM+LM comb MI 3.59 9.49 2.76 35.45 3.66 9.17 2.71 ADOS 12.67 12.37 7.70 14.03 10.58 8.02 5.98 ByR y an optimal, 0-error classication algorithm is denoted. comb and SC+AM-comb). The results are reported in Table 1.2. As we can see, the LM-based approach has a strong predictive power for both datasets, revealing dierences in the linguistic patterns between a clinical provider and a client or a child evaluated for PDD. When this is combined with the SC algorithm which captures the speaker-specic dierences within a single session, the results are considerably better, compared not only to the independent classiers, but also to the piped architecture. On the other hand, the AM approach does not behave in the same manner for the two datasets. As expected, the acoustic characteristics of the children as a whole are dierent than those of the adult clinicians. This is re ected in the AM-only results for the ADOS data, even though they are still worse than the LM-only ones. This age distinction between the two dierent groups of speakers does not exist in the MI dataset. So, although it seems from the results that there is some non-negligible acoustic variability between the clinicians and the clients, the performance gap 17 between the LM-only and the AM-only approaches is much larger for those data. When combined with the SC algorithm the results are substantially better, because the meta-classier is aected by the more separated scores which are the output of the SC module. This notion of \separability" is visually depicted in Figure 1.4 where we show how the outputs of the SC, LM, and AM modules are distributed on the plane. It is of high interest that in the case of the ADOS dataset, because of its very special nature, the exact same information (at the feature level) can be used to capture both role-specic and speaker-specic variabilities in a way that if the two modules are combined by our proposed architecture (SC+AM-comb), they can improve the overall performance as if they carried complementary information. As a nal experiment, we combine the outputs of the LM- and the AM-based recognizers, again using the linear SVM as the meta-classier (AM+LM-comb) and we also combine all the three constructed modules in an extended combined architecture (SC+AM+LM-comb). In this latter case the meta-classier gets 3 2 (in the general case 3N) inputs for each turn to be classied. We note that the result of the optimal matching between SC and LM was the same as in between SC and AM, so we did not encounter any con ict. When compared to the LM-only and the SC+LM- comb results, the addition of the acoustic-based recognizer in the architecture does not lead to any substantial improvements, as expected, for the MI data, but does improve the performance of the system for the case of the ADOS sessions. Overall, the relative error improvement with our nal system which follows the combined architecture is 24.5% for the MI data and 52.8% for the ADOS data, when compared to the piped architecture with an optimal recognizer. 1.5 Conclusion In this chapter we proposed a framework to incorporate speaker-specic and role-specic infor- mation for the SRR task, by independently implementing an unsupervised SC algorithm and a supervised turn-level role classier, the output scores of which are fed to a meta-classier which gives a turn-level nal decision. By evaluating our method on dyadic interactions we showed that it yields superior results, compared both to the independent use of turn-level classiers which do not take speaker-specic variabilities into account, and to systems that use speaker-specic information by applying SC as a rst step and predicting the output at the speaker level. 18 -60 -55 -50 -45 -60 -55 -50 -45 Cl T (a) -70 -60 -50 -40 -52 -50 -48 -46 -44 Ch P (b) -6 -4 -2 -6 -4 -2 Cl T (c) -8 -6 -4 -2 -8 -6 -4 -2 Ch P (d) -55 -50 -45 -55 -50 -45 Cl T (e) -52 -50 -48 -46 -44 -42 -52 -50 -48 -46 -44 -42 Ch P (f) Figure 1.4: Distribution of the scores which are the output of the SC ((a),(b)), the LM-based recognizer ((c),(d)), and the AM-based recognizer ((e),(f)) for the MI ((a),(c),(e)) and the ADOS ((b),(d),(f)) datasets. Each data point is a speaker turn with size proportional to the turn length. 300 turns of the test set are randomly shown for each dataset. x a and x t are the acoustic and textual representation of a turn x. LM R and AM R are the LM and AM corresponding to the role R. G R is the Gaussian corresponding to the role R at the end of the SC and after an optimal matching between speakers and roles. 19 One drawback of our methodology is that it requires additional data for the training of the meta-classier. Moreover, in a real-world scenario, the speaker boundaries, as well as the language- based features, would be extracted, at least at the evaluation phase, from diarization and automatic speech recognition (ASR) outputs, which can lead to error propagation. In the following chapter, we will explore a technique to mitigate such potential error propagation due to ASR. 20 Chapter 2 Role Specic Lattice Rescoring for Speaker Role Recognition from Speech Recognition Outputs As shown in the previous chapter, the language patterns followed by dierent speakers who play specic roles in conversational interactions provide valuable cues for the task of speaker role recognition (SRR). Given the speech signal, existing algorithms typically try to nd such patterns either in manually derived transcripts or in the best path of an automatic speech recognition (ASR) system. In this chapter we propose an alternative way of revealing role-specic linguistic characteristics, by making use of role-specic ASR outputs, which are built by suitably rescoring the lattice produced after a rst pass of ASR decoding. That way, we avoid pruning the lattice too early, eliminating the potential risk of information loss. The work presented in this chapter has been published in (Flemotomos, Georgiou, & Narayanan, 2019). 21 2.1 Introduction In Chapter 1 we introduced the problem of SRR, dened as the classication task of mapping a speaker-homogeneous segment (speaker turn) to an element of a predened set of roles, where a role is characterized by the task a speaker performs and the objectives related to it. Typical examples of conversational interactions between individuals with specic roles are business meetings (Sapru & Valente, 2012), broadcast news programs (Bigot et al., 2013; Damnati & Charlet, 2011a), psychotherapy sessions (Xiao, Huang, et al., 2016), or press conferences (Li et al., 2017). In order to address the problem of SRR, appropriate features which capture distinguishable patterns between the dierent roles have to be extracted. Such patterns can be found in the acoustic (Bigot, Pinquier, Ferran e, & Andr e-Obrecht, 2010), lexical (Garg et al., 2008), prosodic (Sapru & Valente, 2012), or structural (Li et al., 2017; Salamin & Vinciarelli, 2012) characteristics of the speech signal, with the importance of each modality being task-specic. For instance, it is desired that a psychotherapist speaks less than the client, an interviewer is expected to use more interrogative words than the interviewee, etc. However, as validated by our experiments in Chapter 1 where we explored linguistic and acoustic characteristics, it seems that language often carries the most important information for the problem in hand (Damnati & Charlet, 2011a; Sapru & Valente, 2012; W. Wang et al., 2011) and is more robust to unseen conditions (e.g., dierent speakers) (Rouvier et al., 2015), which is the reason why a great portion of the research eorts has been focused on studying and exploiting the lexical variability between the speaker roles. The rst eorts in the eld extract bags of n-grams to represent the lexical information and use them as input features to boosting algorithms or maximum entropy classiers (Barzilay et al., 2000; Y. Liu, 2006). Boosting approaches have been also followed by W. Wang et al. (2011) and Sapru and Valente (2012) to combine n-gram features with other modalities, with the nal classication decision taken either at the speaker (W. Wang et al., 2011) or at the turn level (Sapru & Valente, 2012). Bazillon et al. (2011) rst classify the types of questions posed by the dierent speakers and use that information for the role assignment. Rouvier et al. (2015) explore deep learning approaches by using word embeddings as inputs to convolutional neural networks. Xiao, Huang, et al. (2016) build role-specic n-gram LMs and reduce SRR to the problem of nding the LM which minimizes the perplexity of a speaker turn (or of all the turns assigned to a specic speaker after a speaker 22 clustering step) 1 . Although a bulk of the aforementioned studies use manually transcribed speech data to perform SRR, in a real-world application the lexical information would become available after an ASR step (Rouvier et al., 2015; Xiao, Huang, et al., 2016). Moreover, Damnati and Charlet (2011b) suggest that the quality of ASR transcripts can be used to extract additional features carrying complementary information in specic scenarios. In any case, the ASR output is considered to be the best path of a system that uses generic acoustic and language models. In this chapter, we propose using role-specic ASR systems, each one of which gives a potentially dierent output together with a corresponding cost. Then, after passing any given turn through all the systems, we can assign to that turn the role which corresponds to the system producing the minimum cost. In particular, for this study, we create the role-specic systems by rescoring the lattices generated by a generic ASR with role-specic LMs, as explained in Section 2.3. That way, we can exploit any information carried by the decoding lattice before pruning it to nd the best path. Based on similar intuitions, Georgiou, Black, Lammert, Baucom, and Narayanan (2011a) and Xiao, Huang, et al. (2016) have previously explored lattice rescoring techniques for binary classication problems in the eld of behavioral code prediction. Our method is evaluated on dyadic interactions from the clinical domain, as well as on multi-participant business meeting scenarios, yielding improved results for the task of SRR. 2.2 Background In this section we give an overview of speech lattices and lattice rescoring. In order to better understand lattices, we rst explain what a decoding graph is, within the framework of weighted nite state transducers (WFSTs). 1 This is the baseline SRR approach we followed for our experimentation in the previous chapter, as detailed in Section 1.2.3 23 2.2.1 Weighted Finite State Transducers Job of a WFST is to transform (or transduce) an input sequence into another output sequence, where the input and output sets of labels (alphabets) may dier 2 (Hori & Nakamura, 2013; Mohri, Pereira, & Riley, 2002). Every WFST is associated with some semiring, that enables us to perform various algebraic operations on it. The formal denition of a semiring is given below (Kuich & Salomaa, 1986): Denition 1. A semiring, denoted as , consists of a set A with two binary opera- tions and and two constants 0 and 1, such that the following axioms are satised: (i) a 0 = 0a =a8a2A, (ii) a 1 = 1 a =a8a2A, (iii) commutativity for: ab =ba8a2A;8b2A, (iv) distributivity: a (bc) = (a b) (a c) and (ab) c = (a c) (b c)8a2A;8b2A;8c2A, (v) 0 a =a 0 = 08a2A WFSTs can be viewed as directed graphs where each edge represents a transition between two states. During a transition t, an input symbol (or the empty symbol) is converted to an output symbol (ot the empty symbol) with some weight w(t)2 W, where W is the set of a semiring. A valid path is a sequence of nite successive transitions t 1 ;t 2 ; ;t n from an initial state 3 to a nal state f, associated with some weight w(f). The total cost of the path is w() =w(t 1 ) w(t 2 ) w(t n ) w(f) (2.1) For the task of speech recognition, weights are typically negative log probabilities and the tropical semiring <R + [f1g; min; +;1; 0> is the most widely used one. 2 If there is no new output (or input and output are always the same), we have a weighted nite state acceptor (WFSA). 3 We can safely assume that every WFST has a single initial state. 24 One of the most common operations dened on WFSTs is the binary operation of composition. Given two WFSTs T 1 and T 2 , their composition T is a WFST that transforms an input sequence x into an output sequence y according to the formula T (x;y) = (T 1 T 2 )(x;y), M z T 1 (x;z) T 2 (z;y) (2.2) 2.2.2 WFST framework for speech recognition Given a sequence of acoustic featuresO, the job of an ASR system from a traditional point of view 4 is to nd, out of the setW of possible word sequences, the most probable sequence ^ W = argmax W2W P (WjO) = argmax W2W P (OjW )P (W ) (2.3) whereP (OjW ) is the acoustic likelihood of O forW , estimated through the acoustic model (AM), andP (W ) is the prior probability ofW , estimated through a language model (LM). If the pronun- ciation lexicon mapping words to subword units (SUs) contains the additional information of how probable an SU sequence V is, given the word W , then we get ^ W = argmax W2W X V2K(W) P (OjV;W )P (VjW )P (W ) argmax W2W X V2K(W) P (OjV )P (VjW )P (W ) (2.4) where K(W ) is the set of the possible SU-level representations of W . Since decoding is based on the Viterbi algorithm, the summation is replaced by a max function and nally we get in the log domain ^ W argmax W2W max V2K(W) flogP (OjV ) + logP (VjW ) + logP (W )g (2.5) In the WFST framework, we have the transducer ~ H that transforms a sequence of acoustic featuresO into an SU sequenceV with a weight logP (OjV ), the WFSTL that transforms an SU sequence V into a word sequence W with a weight logP (VjW ), and the WFSA G that accepts a word sequence W with a weight logP (W ) (Hori & Nakamura, 2013; Mohri et al., 2002). ~ H is actually split into a WFST H that transforms a sequence of hidden markov model (HMM) states 4 as opposed to the end-to-end neural approaches 25 into an SU sequence and a model S that maps the acoustic observations to HMM states and is trained following either the GMM or the DNN paradigm. Since typically the elementary SUs in ASR are triphones and the pronunciation lexicons give the phoneme-level representation of each word, it is necessary to have one more WFSTC that transforms a triphone sequence into a phoneme sequence, where each phoneme is context-independent and is identical to the central phoneme of the corresponding triphone. Those automata are composed into a nal WFST 5 N =HCLG (2.6) Given any speech utterancex oft frames, we construct the WFSAT x that representsx (witht + 1 nodes and one arc between consecutive nodes for each HMM state) and ASR is now a shortest path problem on T x N, called the decoding search graph for the specic utterance. 2.2.3 Speech lattices Conventional ASR systems try to nd the shortest path on the decoding graph, such that a sequence of HMM states is transduced to a word sequence with the minimum possible cost. In many cases, however, it is desirable to keep multiple suciently probable transcription hypotheses, and not only the best one. This can be done either by keeping a list ofn-best sequences, or by generating a speech recognition lattice. A lattice is a weighted directed acyclic graph (and thus, can be represented as a WFSA) with word labels (Ljolje, Pereira, & Riley, 1999). Each valid path represents an alternative word sequence, weighted by its recognition cost, and an exponential number of such word sequences can be encoded with respect to the number of nodes in the lattice. An example of a word lattice is given in Figure 2.1. Time and alignment information is also usually included in the lattice. According to Povey et al. (2012), given some cost tolerance , any lattice should satisfy the following conditions: (i) There should be one path for every word sequence within from the one with the minimum total cost, 5 In practice, optimization operations need to be applied before the nal composition (Mohri et al., 2002). 26 sil recognize wreck speech using common calm sense sil a nice beach using incense sil Figure 2.1: Example of speech recognition lattice encoding four alternative transcription hypothe- ses. For simplicity, no scores or alignments are shown. (ii) there should only be one path for any distinct word sequence (no duplicate paths allowed), (iii) the scores and alignments in the lattice should be accurate. 2.2.4 Lattice rescoring One of the main reasons it is desirable to generate lattices during decoding is so that we can later process them and rescore them with more complex, or domain-specic, models. For example, a lattice can be rescored to infuse knowledge-based information (Siniscalchi, Li, & Lee, 2006). Another common scenario is when a relatively simple language model is used during rst-pass decoding due to lower computational complexity and the generated lattice is later rescored with a better language model to improve accuracy (Sak, Sara clar, & G ung or, 2010; Xu et al., 2018). One of the simplest ways to rescore a lattice is through composition (Povey et al., 2012). Essentially, for LM-rescoring, which is the focus of this study, we want to subtract the old LM cost and add the new LM cost to the weighted automaton representing the lattice. According to the analysis in Section 2.2.2, the lattice should include two scores; the graph cost corresponding to the weight of the WFST N (which, based on equation (2.6), incorporates the LM cost from G, the pronunciation cost from L, and the HMM-transitions-related cost from H) and the acoustic cost corresponding to the model S. Storing each weight on the lattice as a pair of graph and acoustic weights (w gr ;w ac ), the latticeL Gnew (x) for an utterance x, after rescoring with an LM G new , can be expressed as L Gnew (x) = L y G old (x)G old y G new (2.7) whereL G old (x) is the lattice generated using the old LM G old andy represents the operation of scaling both the graph and acoustic lattice costs multiplying by1. ForG old andG new the weights 27 are of the form (w; 0). A semiring similar to the tropical semiring on w gr +w ac can be used for lattices, but keeping track of the graph and acoustic weights separately. More precisely, the semiring used is equipped with the following operations: (w gr 1 ;w ac 1 ) (w gr 2 ;w ac 2 ) = (w gr 1 +w gr 2 ;w ac 1 +w ac 2 ) (w gr 1 ;w ac 1 ) (w gr 2 ;w ac 2 ) = 8 > < > : (w gr 1 ;w ac 1 ); if w gr 1 +w ac 1 <w gr 2 +w ac 2 (w gr 2 ;w ac 2 ); if w gr 1 +w ac 1 >w gr 2 +w ac 2 Ties in the latter case are broken comparing w gr 1 w ac 1 vs. w gr 2 w ac 2 6 . 2.3 Proposed Method Given a generic ASR system, the goal is to convert the generated decoding lattice for an input turn to multiple, role-specic versions, in such a way that there is one version that re ects the speaker role corresponding to the particular turn. We do this by rescoring the lattice N times, where N is the number of roles, with role-specic LMs. Let's assume we have a background, out-of-domain n-gram LMG andN role-specic LMsR 1 ;R 2 ; ;R N corresponding to the rolesR 1 ;R 2 ; ;R N , which are trained using in-domain data. First, we ensure that all the models which are going to be used recognize the same vocabulary. We can eciently do so by interpolating the individual LMs to get the mixed modelsG + ;R + 1 ;R + 2 ; ;R + N . To obtain an interpolated model, we assign to each n-gram the weighted average of the probabilities from the input models, and we then we re-normalize the produced model (Stolcke, 2002). Using the symbol to denote LM interpolation, the nal models are expressed as G + =w g G (1w g ) ~ R (2.8) R + i =w g i Gw r i R i (1w g i w r i ) ~ R i (2.9) 6 For more details, please refer to (Povey et al., 2012) and to the Kaldi documentation at https://kaldi-asr.org/ doc/lattices.html. 28 where ~ R = 1 N N K i=1 R i ; ~ R i = 1 N 1 N K j=1 j6=i R j and all the weights w g ;w g i ;w r i are chosen to minimize the perplexity of appropriate role-specic development corpora. Given an input turn x, we rst pass it through an ASR system, trained with the LMG + , pro- ducing a decoding latticeL G +(x). The lattice is then rescored with all the LMsR + j ,j = 1; 2; ;N to produce the latticesL R + j (x). Denoting as c j (x) the LM cost of the best path inL R + j (x), the role assigned to x is R m where m = argmin j c j (x). The process is visually depicted in Figure 2.2. The dierence between this approach and the language-based approach followed in Chapter 1 is that in the second case the evaluation with respect to a role-specic LM would be done using the nal output of the ASR, as presented in Figure 2.3. That way, the latticeL G +(x) is pruned using a generic LM, which can potentially lead to loss of valuable information for the task of SRR. This is exactly the problem our approach tries to avoid. ASR withG + L G +(x) rescore withR + 1 x ... L R + 1 (x) best path c 1 (x) argmin j rescore withR + N L R + N (x) best path c N (x) rescore withR + j L R + j (x) best path ... c j (x) m R m ... ... Figure 2.2: Turn-level SRR by role-specic lattice rescoring. If the extra information of the speaker who uttered the turn is available, after a speaker clus- tering step, then the role assignment can be done more robustly at the speaker level instead of the 29 ASR withG + L G +(x) evaluate withR + 1 x ... c 0 1 (x) argmin j evaluate withR + N c 0 N (x) evaluate withR + j best path ... c 0 j (x) m R m Figure 2.3: Turn-level SRR by evaluating the text with role-specic LMs. turn level, as we already saw in Chapter 1. If we denote by T i the set of turns corresponding to speaker S i , we can dene the cost of the speaker-role pair (S i ;R j ) as c(S i jR j ), X x2T i c j (x) (2.10) Ideally, we would again like to assign to any speaker S i the role R m such that the cost c(S i jR m ) is the minimum among all c(S i jR j ); j = 1; 2; ;N. However, assuming that there is one-to- one correspondence between speakers and roles in a speech document, which is the case for many practical applications, this criterion would fail, since there is no guarantee that for n6=m we have argmin j c(S n jR j )6= argmin j c(S m jR j ). Thus, in order to take such a constraint into account, we use Algorithm 1, which is a general- ization of the role matching criterion we proposed in (Flemotomos, Martinez, et al., 2018) for the 2-speaker scenario, where the costs were perplexities. The algorithm begins with the entire sets ~ S and ~ R of the available speakers and roles and at every iteration it chooses the speaker S k such that a condence metric C k is the maximum among all C i ;i = 1; 2; ;j ~ Sj. Then, it assigns to S k the role R l k that minimizes the cost c(S k jR j );j = 1; 2; ;j ~ Rj and removes S k and R l k from the available speakers and roles. The condence metric C i is designed in such a way that the larger the dierence between the minimum cost and the rest of the costs for S i is, the more condent we are about the role assignment of the particular speaker. 30 Algorithm 1 Speaker-level SRR given costs for each (speaker,role) pair. Inputs: speakers S 1 ;S 2 ; ;S N roles R 1 ;R 2 ; ;R N costs c(S i jR j )8i;j ~ S fS i g N i=1 ; ~ R fR i g N i=1 while ~ S6= do for S i 2 ~ S do l i argmin m c(S i jR m ); R m 2 ~ R C i min n jc(S i jR l i )c(S i jR n )j; R n 2 ~ RnfR l i g end for k argmax i C i assign R l k to S k ~ S ~ SnfS k g; ~ R ~ RnfR l k g end while 2.4 Datasets We evaluate our method on two datasets featuring interactions between individuals under dier- ent conditions. The rst dataset, to which we will refer as the PSYCH corpus, is composed of motivational interviewing sessions between a therapist (T) and a client (C) and is collected from ve independent clinical trials (ARC, ESPSB, ESP21, iCHAMP, HMCBI; Atkins et al., 2014) 7 . The second one is the AMI meeting corpus (Carletta et al., 2005) from which we use the inde- pendent headset microphone (IHM) setup of the scenario-only part. This is composed of meetings where each participant plays the role of an employee in a company; the project manager (PM), the marketing expert (ME), the user interface designer (UI), and the industrial designer (ID). The two datasets are split into training, development and test sets in such a way that there is no speaker overlap between them. For the AMI corpus we follow the scenario-only partition which is ocially recommended 8 . For the PSYCH corpus, since the client identities are not available for the HMCBI sessions, the partitioning is done under the assumption that it is highly improbable for the same client to visit dierent therapists in the same study, as explained in Chapter 1. In both cases, we use the manually derived segmentation. The datasets are presented in Tables 2.1 and 2.2. 7 Note that this is a subset of the MI dataset used in Chapter 1, described in Table 1.1. In particular here we do not use the 200 CTT sessions (Baer et al., 2009). Those feature scripted interactions between actors playing the roles of therapist vs. patient; here we only consider real-world clinical interactions. 8 https://groups.inf.ed.ac.uk/ami/corpus/datasets.shtml 31 Table 2.1: Size of the PSYCH dataset. PSYCH-train PSYCH-dev PSYCH-test #sessions 74 44 25 duration-T 26:43 h 15:23 h 7:34 h duration-C 23:29 h 12:17 h 7:54 h Durations are calculated based on manual turn boundaries. Table 2.2: Size of the AMI dataset. AMI-train AMI-dev AMI-test #meetings 98 20 20 duration-PM 16:00 h 2:95 h 3:93 h duration-ME 10:22 h 2:61 h 2:51 h duration-UI 9:71 h 2:26 h 1:79 h duration-ID 11:03 h 2:02 h 2:15 h Durations are calculated based on manual turn boundaries. In order to train the required LMs we use the training parts of the PSYCH and AMI corpora, as well as the Fisher English corpus (Cieri, Miller, & Walker, 2004) and the transcribed therapy sessions provided by the counseling and psychotherapy transcripts series 9 (CPTS), as described in Section 2.5. The size of the corresponding vocabularies and the total number of tokens are given in Table 2.3. Table 2.3: Size of the vocabulary and total number of tokens in the corpora used for LM training. PSYCH-train AMI-train Fisher CPTS vocabulary size 8.17K 8.54K 58.6K 35.6K #tokens 530K 479K 21.0M 6.52M 2.5 Experiments and Results First, we train all the necessary LMs, which are 3-gram models with Kneser-Ney smoothing. The generic LMG is trained using the Fisher English corpus. For the AMI corpus, the 4 role-specic LMs 9 https://alexanderstreet.com/products/counseling-and-psychotherapy-transcripts-series 32 R PM ;R ME ;R UI ;R ID are trained using only the turns belonging to the corresponding roles in the training set. For the PSYCH corpus, we additionally use the CPTS sessions and get the role-specic LMsR T =w o T R T;CPTS (1w o T )R T;PSYCH and similarly forR C . The mixing weightsw o T and w o C are optimized so that the perplexity of the turns of the corresponding roles in the development set is minimized. Once we have those LMs, we create the mixed versions according to equations (2.8) and (2.9), where all the appearing mixing weights are again optimized to minimize the perplexity of the development corpora. For the optimization of w g , the corresponding development corpus is the union of all the role-specic development corpora for the dataset we work with. The LM training and weight optimization is done with the SRILM toolkit (Stolcke, 2002). The size of the nal mixed vocabulary is 69.5K for the experiments with the PSYCH corpus and 59.6K for the experiments with the AMI corpus, while the phonetic representation of those words is given by the CMU dictionary 10 . The ASR decoding is done with the Kaldi speech recognition toolkit (Povey et al., 2011) using Kaldi's pre-trained ASpIRE acoustic model 11 . The word insertion penalty and the LM weighting factor used during decoding are chosen to minimize the word error rate (WER) on the development set. The evaluation metric used for the nal role assignment is the misclassication rate (MR), as dened in equation (1.3). 2.5.1 Turn-level SRR In Table 2.4 we present the results using our method (lm-resc) for turn-level (tl) SRR, as shown in Figure 2.2, as well as using the approach shown in Figure 2.3 (lm-asr) where the cost c 0 j (x) is the log-likelihood of the turn x given the LMR + j . As we can see, both lm-resc-tl and lm-asr-tl fail to beat the baseline classier which always chooses the majority class (from the training set) for the case of AMI corpus. For the 2-role problem in PSYCH corpus this is not the case, but still lm-asr-tl outperforms lm-resc-tl. This is because the corpora feature conversational interactions and thus, prior to speaker clustering, utterances are broken into very short speech segments. Each individual segment contains insucient observations 10 https://github.com/cmusphinx/cmudict 11 https://kaldi-asr.org/models/m1 33 Table 2.4: MR (%) for turn-level SRR. lm-resc-tl lm-asr-tl maj. class PSYCH 23.58 10.75 50.67 AMI 64.70 63.40 62.22 lm-resc-tl refers to the system of Figure 2.2. lm-asr-tl refers to the system of Figure 2.3. to infer speaker role, and since all decisions are independent, that increases error. Such inaccuracies cancel out when we exploit the aggregate score for all the turns of a speaker as we will see in the following section. 2.5.2 Speaker-level SRR Here, the nal decision of the role assignment is taken at the speaker level, according to Algorithm 1, which means that a speaker clustering step is required. To that end, a BIC-based HAC is employed on top of an energy-based voice activity detector at the frame level, like in Chapter 1. In order for the clustering to make sense in the case of the AMI corpus, we downmix the 4 headset microphones into one audiole per meeting. As observed in Table 2.5, our method (lm-resc-sl) yields improved results, outperforming both lm-asr-sl and the turn-level approaches (Table 2.4). Of course, the nal performance depends on the performance of the clustering algorithm used. Table 2.5: MR (%) for speaker-level (sl) SRR and for speaker clustering (BIC-HAC). lm-resc-sl lm-asr-sl BIC-HAC PSYCH y 0.00 7.46 { PSYCH 4.41 5.83 4.08 AMI y 29.46 55.52 { AMI 46.16 60.94 15.63 y denotes the use of ground truth speaker clustering in- formation. 2.5.3 Eect on speech recognition accuracy Finally, we want to explore whether the role-specic lattice rescoring can lead to improved results for the task of ASR apart from SRR. To that end, for every turn we assume that the lexical information is given by the best path of the rescored lattice corresponding to the role that was 34 assigned by our algorithm to that turn. The results in Table 2.6 show that this approach, following our per-speaker role assignment, can indeed slightly improve the ASR performance. The slight dierence between the WER of the generic ASR model and the combination of the rescored ones, together with the substantial improvements in SRR performance (Table 2.5) suggest that even small role-specic improvements in the text produced by the ASR can be of high value for a reliable role identication. Table 2.6: WER (%) using the best path of a generic lattice or role-specic rescored lattices. lm-resc-tl lm-resc-sl generic PSYCH 37.84 37.54 37.99 AMI 29.35 29.27 29.29 2.6 Conclusion Here we presented an algorithm that rescores the lattices produced by an ASR system with role- specic LMs in order to exploit the linguistic information in a more robust way for the task of SRR. We experimented with approaches taking the nal decision both at the turn and at the speaker level and we identied that the second case leads to more reliable results. This chapter concludes our analysis on how to robustly extract speaker roles from the speech signal. In Chapters 3 and 4 we will focus on how to use the role information in order to improve the performance of a fundamental speech processing task, that of speaker diarization. There, we are going to use weaker approaches to extract speaker roles (since we will do so at the turn level for only a few turns) and for that reason we will employ specic condence criteria. 35 Part II Using Speaker Roles 36 Chapter 3 Linguistically Aided Speaker Diarization Using Speaker Role Information In the previous chapter we demonstrated how to infer speaker roles from speech recognition outputs and additionally showed that speaker role information can improve the performance of an ASR system. In this chapter, we utilize speaker roles to facilitate another speech processing task, namely speaker diarization. This task relies on the assumption that speech segments corresponding to a particular speaker are concentrated in a specic region of the speaker space; a region which represents that speaker's identity. These identities are not known a priori, so a clustering algorithm is typically employed, traditionally based solely on audio. Under noisy conditions, however, such an approach poses the risk of generating unreliable speaker clusters. Here we aim to utilize linguistic information as a supplemental modality to identify the various speakers in a more robust way. In particular, we show that the dierent linguistic patterns that speakers are expected to follow in role-based conversational scenarios can help us construct the speaker identities. That way, we are able to boost diarization performance by converting the clustering task to a classication one. The work presented in this chapter has been published in (Flemotomos, Georgiou, & Narayanan, 2020). 37 3.1 Introduction Given a speech signal with multiple speakers, diarization answers the question \who spoke when" (Anguera et al., 2012). To address the problem, the main underlying idea is that speech segments corresponding to some speaker share common characteristics which are ideally unique to the par- ticular person. So, the problem is usually reduced to nding a suitable representation of the signal and a reliable distance metric. Under this viewpoint, when the distance between two speech seg- ments is beyond a certain threshold, they are considered to belong to dierent speakers. The job of a speaker diarization system is visually depicted in Figure 3.1. (a) Raw signal. (b) Diarization output. Figure 3.1: Finding \who spoke when" in a speech signal. In (b), the white regions indicate silence or noise. The 5 detected (colored) speech regions are further segmented into 7 speaker-homogeneous segments which are clustered into 3 same-speaker groups. In the conventional diarization approach, the input signal is rst segmented either uniformly (e.g., Sell et al., 2018) or according to a speaker change detection algorithm (e.g., Zajc, Kune sov a, Zelinka, & Hr uz, 2018). In either case, it is assumed that a single speaker is present in each one of the resulting segments. Since diarization is typically viewed as an unsupervised task, it heavily depends on the successful application of a clustering algorithm in order to group same-speaker 38 segments together. Such a method, however, poses the risk of creating noisy, non-representative speaker clusters. In particular, if the speakers to be clustered reside closely in the speaker space, some speakers may be merged. Additionally, if there is enough noise and/or silence within a recording (possibly not suciently captured by a voice activity detection algorithm), it may be the case that one of the constructed clusters only contains the non-speech or distorted-speech segments. This behavior can lead to poor performance even if the exact number of speakers is known in advance. Even though speaker diarization has traditionally been an audio-only task which relies on the acoustic variability between dierent speakers, the linguistic content captured in the speech signal can oer valuable supplementary cues. Apart from practical observations such as the fact that it is highly improbable for a speaker change point to be located within a word (Dimitriadis & Fousek, 2017; Silovsky, Zdansky, Nouza, Cerva, & Prazak, 2012), it is widely accepted that each individual has their very own way of using language (Johnstone, 1996). Thus, language patterns followed by individual speakers have been explored in the literature for the tasks of speaker segmentation and clustering, both when used unimodally (Meng, Mou, & Jin, 2017), and in combination with the speech audio (India Massana, Rodr guez Fonollosa, & Hernando Peric as, 2017; Park & Georgiou, 2018; Park, Han, Huang, et al., 2019; Zaj c, Soutner, Hr uz, M uller, & Radov a, 2018). Despite the benecial eects of using language as an additional stream of information, there is an important practical consideration: how to get access to the transcripts. In a real-world scenario, a high-performing ASR system needs to be applied before any textual data is available. However, speaker diarization is widely viewed as a pre-processing step of multi-talker ASR systems and is often a module that precedes ASR in conversational speech processing pipelines (Huang, Marcheret, Visweswariah, Libal, & Potamianos, 2007; Xiao, Huang, et al., 2016). This is because single-speaker speech segments allow for speaker normalization techniques, including speaker adaptive training through constrained maximum likelihood linear regression (CMLLR; Gales, 1998) and i-vector based neural network adaptation (Saon, Soltau, Nahamoo, & Picheny, 2013). Nevertheless, taking into consideration the error propagating from a non-ideal diarization system to the ASR output, it is nowadays questionable whether diarization can in practice improve recognition accuracy, which is why several modern pipelines start by applying ASR rst, achieving state-of-the-art results (Park, Han, Huang, et al., 2019; Yoshioka et al., 2019). In any case, if there are not major computational 39 and/or time constraints, running a second pass of ASR after diarization could be a reasonable approach 1 . Following the aforementioned line of work, we propose an alternative way of using the linguistic information for the task of speaker diarization in recordings where participants play specic roles which are known in advance. In particular, we process the text stream independently in order to segment it in speaker-homogeneous chunks (where only one speaker is active), each one of which can be assigned to one of the available speaker roles. Aggregating this information for all the segments, and aligning text with audio, we can construct the acoustic identities of the speakers found in the recording. That way, each audio segment can be assigned to a speaker through a simple classier, overcoming the potential risks of clustering. We apply this approach in psychotherapy recordings featuring dyadic interactions between two speakers with well-dened roles; namely those of a therapist and a patient. 3.2 Background: Audio-Only Speaker Diarization Speaker diarization is the process of partitioning a speech signal into speaker-homogeneous segments and then grouping same-speaker segments together, without having prior information about the speaker identities. Therefore, research eort has been focused on nding i) a representation that can capture speaker-specic characteristics, and ii) a suitable distance metric that can separate dierent speakers based on those characteristics. The traditional approach has been to model speech segments under some probability distribution (e.g., GMMs), and measure the distance between them using a metric such as the one based on the bayesian information criterion (BIC) (S. Chen & Gopalakrishnan, 1998). Speaker modeling by GMMs was later replaced by i-vectors (Shum, Dehak, Chuangsuwanich, Reynolds, & Glass, 2011), xed-dimensional embeddings inspired by the total variability model. In this framework, the cosine distance metric was initially proposed as the divergence criterion to be used, but probabilistic linear discriminant analysis (PLDA) based scoring (Ioe, 2006; Prince & Elder, 2007) was proved to yield improved results (Sell & Garcia-Romero, 2014). Given two 1 In Chapter 5 we will see how diarization and ASR can be connected within a larger speech processing pipeline. 40 embeddingsv,r, PLDA provides a framework to estimate their similaritys(v;r) as the log-likelihood ratio s(v;r) = log p(v;rjsame speaker) p(vjdif. speakers)p(rjdif. speakers) (3.1) In recent years, with the advent of deep neural networks (DNNs), the embeddings used are usu- ally bottleneck features extracted from neural architectures. Such architectures are trained under the objective of speaker classication, employing a cross-entropy loss function (Snyder, Garcia- Romero, Sell, Povey, & Khudanpur, 2018), or under the objective of speaker discrimination em- ploying contrastive (Garcia-Romero, Snyder, Sell, Povey, & McCree, 2017) and triplet (Bredin, 2017b) loss functions. Typical examples of embeddings that have shown state-of-the-art perfor- mance for speaker diarization are the long-short term memory (LSTM) based d-vectors (Q. Wang, Downey, Wan, Manseld, & Moreno, 2018) and the time-delay neural net (TDNN) based x-vectors (Sell et al., 2018), which are also the embeddings used for the work presented here. The rst layers of the architecture used to extract x-vectors operate at the frame level, with deeper layers seeing longer temporal contexts. Then, a statistics pooling layer is used to collect the outputs of the last layer of the TDNN and compute the mean and standard deviation vectors. The next few dense lay- ers operate at the segment level before a softmax inference layer maps segments to speaker labels. The activations of the rst dense layer are selected as speaker embeddings. Speaker diarization usually comprises two steps: rst, the speech signal is segmented into single- speaker chunks, and second, the resulting segments are clustered into same-speaker groups (Anguera et al., 2012). Even though speaker change detection is by itself an active research eld (Hr uz & Zaj c, 2017; Jati & Georgiou, 2017), it has been shown that it doesn't necessarily lead to improved results within the framework of diarization when compared to a uniform, sliding-window based segmentation (Zaj c, Kune sov a, & Radov a, 2016; Zajc et al., 2018), so the latter method is widely used. As far as the clustering is concerned, common approaches include hierarchical agglomerative clustering (HAC; Sell et al., 2018) and spectral clustering (Park, Kumar, et al., 2019; Q. Wang et al., 2018), while methods based on anity propagation (Yin, Bredin, & Barras, 2018) and generative adversarial networks (Pal et al., 2020) have also been proposed. In order to overcome some of the problems connected with clustering, supervised systems that directly output a sequence of speaker labels have been recently introduced (Fujita et al., 2019; Zhang, Wang, Zhu, Paisley, & Wang, 41 2019). 3.3 Proposed Method: Linguistically-Aided Speaker Diarization Our proposed approach for speaker diarization in conversational interactions where speakers as- sume specic roles is illustrated in Figure 3.2. We describe the various modules in detail in Sec- tions 3.3.1{3.3.4. segmentation segment-level role recognition text role LMs profile estimation audio uniform segmentation classification Figure 3.2: Linguistically-aided speaker diarization using role information. 3.3.1 Text-based segmentation Given the textual information of the conversation, our goal is to obtain speaker-homogeneous text segments; that is segments where all the words have been uttered by a single speaker. Those will later help us construct the desired acoustic speaker identities. Even though text-based speaker change detectors have been proposed (Meng et al., 2017), for our nal goal we can safely over seg- ment the available document, provided this leads to a smaller number of segments containing more than one speakers (Zaj c et al., 2018). So, we assume that each sentence is with high probability speaker-homogeneous and we instead segment at the sentence level. To that end, the problem can be viewed as a sequence labeling one, where each word is tagged as either being at the beginning of a sentence, or anywhere else. In particular, we address the problem building a Bidirectional LSTM (BiLSTM) network with a conditional random eld (CRF) inference layer (Ma & Hovy, 2016), as shown in Figure 3.3. The input to the recurrent layers is a sequence of words. Each word is given as a concatenation of a character-level representation predicted by a CNN and a word embedding. For our experiments, we initialize the word embeddings with the 42 extended dependency skip-gram embeddings (Komninos & Manandhar, 2016), pre-trained on 2B words of Wikipedia. Those extend the semantic vector space representation of the word2vec model (Mikolov, Sutskever, Chen, Corrado, & Dean, 2013) considering not only spacial co-occurrences of words within text, but also co-occurrences in a dependency parse graph. That way, they can capture both functional and topic-related semantic properties of words. CNN CNN CNN h i M a r h e y y Mary hey hi BiLSTM CRF B B M Figure 3.3: Neural network for sentence-level text segmentation: A character representation is constructed for each word through a CNN and is concatenated with a word embedding (here shown in grey). This is the input to a BiLSTM-CRF architecture which predicts a sequence of labels. Here B denotes a word at the beginning of a sentence and M in the middle (any word which is not the rst one of a sentence). 3.3.2 Role recognition The next step in our system is the application of a text-based role recognition module. In more detail, assuming we have N speakers in the session (N = 2 for our experiments) and there is one- to-one correspondence between speakers and roles (e.g., there is one therapist and one patient), we want to assign one of the role labelsfR i g N i=1 to each segment. To do so, we buildN LMsfR + i g N i=1 , one for each role, and we estimate the perplexity of a segment given the LMR + i , fori = 1; 2; ;N. The role assigned to the segment is the one yielding the minimum perplexity like in Chapters 1 and 2. We note that in our experiments all the perplexities are normalized for segment length. The required role-specic LMs are n-gram models built as described in Chapter 2, Section 2.3. For this process, we assume that in-domain text data is available for training. We rst construct a background, out-of-domain LMG and N role-specic LMsfR i g N i=1 . G is used to ensure a large 43 enough vocabulary that minimizes the unseen words during the test phase. Those individual LMs are interpolated to get the mixed modelsfR + i g N i=1 2 . 3.3.3 Prole estimation After applying the text-based segmenter and role recognizer, we have several text segments corre- sponding to each role R i . If we have the alignment information at the word level 3 , we can directly get the time-boundaries of those segments. We extract one embedding (x-vector) for each and we estimate a role identity r i as the mean of all the embeddings corresponding to the specic role. Under the assumption of one-to-one correspondence between speakers and roles that we already introduced in Section 3.3.2, those role identities are at the same time the acoustic identities (also known as proles) of the speakers appearing in the initial recording. We note that role recognition at the segment level does not always provide robust results as explained in the previous chapters, something which could lead to unreliable generated proles. However, we expect that there will be a fraction of the segments for the results of which we are condent enough and we can take only those into consideration for the nal averaging. The proxy used as our condence for the segment-level role assignment is the dierence between the best and the second best perplexity of a segment given the various LMs, similarly to the condence introduced in Algorithm 1 (Chapter 2). Formally, if segment x is assigned the role R i , and if pp(xjR + i ) is the perplexity of x given the LMR + i , then the condence metric used for this assignment is c x = min j6=i jpp(xjR + j )pp(xjR + i )j (3.2) Then, the corresponding prole is r i = P x2R i ~ c x u x P x2R i ~ c x , P x2R i Ifc x >gu x P x2R i Ifc x >g (3.3) where u x is the x-vector for segment x,Ifg is the indicator function, is a tunable parameter. 2 For more details, please refer to equations (2.8){(2.9). 3 If we have access to the transcripts and the audio, we can force-align. If we generate the text through ASR, we get the desired alignments from the decoding lattices. 44 3.3.4 Audio segmentation and classication After having computed all the needed prolesfr i g N i=1 , in order to perform speaker diarization, we rst segment the audio stream of the speech signal uniformly with a short sliding window, a typical approach in audio-only diarization systems. In other words, the language information is used by our framework only to construct the speaker proles, with the nal diarization result relying on audio-based segmentation, as illustrated in Figure 3.2. For each one of the resulting segments an x-vector is extracted. However, instead of clustering the x-vectors, we now classify them within the correct speaker/role. In order to have a fair comparison between common diarization baselines and our proposed system, our classier is based on PLDA, but we note that any other classier could be employed instead. In this framework, a segment x with embedding u x is assigned the label ^ R x = argmax 1iN fs(u x ;r i )g (3.4) where s(;) is the PLDA similarity score estimated in equation (3.1). 3.4 Datasets 3.4.1 Evaluation data We evaluate our proposed method on datasets from the clinical psychology domain. In particular, we apply the system to the motivational interviewing sessions introduced in the previous chapters, and specically the PSYCH corpus described in Table 2.1. As explained there, the train/dev/eval split has been done in such a way that there is no speaker overlap between the subsets. All the results reported are on PSYCH-test. 3.4.2 Segmenter and role LM training data The segmenter presented in Section 3.3.1 is trained on a subset of the Fisher English corpus (Cieri et al., 2004) comprising a total of 10,195 telephone conversations for which the original transcriptions (including punctuation symbols which are essential for the training of our network) are available. This set is enhanced by 1,199 in-domain therapy sessions provided by the counseling and psy- chotherapy transcripts series (CPTS). The combined dataset is randomly split (80-20 split at the 45 session level) into training and validation sets. Here, we use the same role-specic LMs as the ones trained and used in Chapter 2, employing CPTS and the entire Fisher English corpus for training. Please refer to Table 2.3 for details on the size of the corresponding vocabularies. 3.5 Experiments and Results 3.5.1 Baseline systems Audio-based diarization with speaker clustering As an audio-only baseline, we use a diarization system following the widely applied x-vector/PLDA paradigm (Sell et al., 2018). As shown in Figure 3.4, the speech signal is rst segmented uniformly and an x-vector is extracted for each segment. The pairwise similarities s(;) between all those embeddings are then calculated based on PLDA scoring (equation (3.1)). uniform segmentation audio clustering Figure 3.4: Baseline audio-based speaker diarization. The segments are clustered into same-speaker groups following a HAC approach with average linking. Since our experiments are conducted on dyadic interactions, we force the HAC algorithm to run until two clusters are constructed. Language-based diarization As a language-only baseline, we use the system of Figure 3.5, which essentially consists of the rst steps of the framework in Figure 3.2. After estimating the segment-level role labels as described in Sections 3.3.1 and 3.3.2, we can simply use those as our diarization output labels to evaluate the performance of a system that only depends on linguistic information. In that case, we only utilize audio to get the timestamps of the text segments. If an ASR system is used, this information is already available through the decoding lattices. 46 segmentation segment-level role recognition text role LMs alignment audio Figure 3.5: Baseline language-based speaker diarization. 3.5.2 Experimental setup As a pre-processing step, the text which is available from the manual transcriptions of the PSYCH corpus is normalized to remove punctuation symbols and capital letters, and force-aligned with the corresponding audio sessions. Based on the word alignments, we segment the audio according to whether there is a silence gap between two words larger than a threshold equal to 1 sec. We should highlight that this initial segmentation is applied before running either one of the baseline systems or our proposed architecture. Thus, the initial segments to be diarized are always the same and those are also the segments that we pass to the ASR system. The diarization ground truth is also constructed through the word alignments, by allowing a maximum of 0:2 sec-long in-turn silence. The resulting text segments are further subsegmented at the sentence level based on the output of the tagger in Figure 3.3. During training we dene as \sentence" any text segment between two punctuation symbols denoting pause, apart from commas. We exclude commas rst because they normally do not indicate speaker change points but also because they are too frequent in our training set and they would lead to very short segments, not containing sucient information for the task of role recognition. The tagger is built using the NCRF++ toolkit (Yang & Zhang, 2018). Following the general recommendations by Reimers and Gurevych (2017) and after our own hyperparameter tuning, the network comprises 4 CNN layers and 2 stacked BiLSTM layers with dropout (p = 0:5) and l 2 regularization ( = 10 8 ). The length of each word representation is 330 (character embedding dimension = 30, word embedding dimension = 300). The network is trained using the Adam optimizer with a xed learning rate equal to 10 3 and a batch size equal to 256 word sequences. The tagger achieves an F 1 score of 0:805 on the validation set after 14 training epochs. All the LMs required for role recognition are 3-gram models with Kneser-Ney smoothing built with the SRILM toolkit, as described in Section 2.5. The audio-based diarization framework is built 47 using the Kaldi toolkit (Povey et al., 2011). We use the VoxCeleb pre-trained x-vector extractor 4 and the PLDA model which comes with it, after we adapt it on the development set of the PSYCH corpus, both for the audio-only baseline and for our linguistically-aided system. The x-vectors are extracted after uniformly segmenting the audio into 1:5 sec-long windows with a window shift equal to 0:25 sec. Those are normalized and decorrelated through a linear discriminant analysis (LDA) projection and dimensionality reduction (nal embedding length = 200), mean, and length normalization. The evaluation is always based on the diarization error rate (DER), as estimated by the NIST md-eval.pl tool, with a 0:25 sec-long collar, ignoring overlapping speech. DER incorporates three sources of error: false alarms (speech in the output but not in the ground truth), missed speech (speech in the ground truth but not in the output), and speaker confusion (speech assigned to the wrong speaker cluster). To get the ASR outputs, we use Kaldi's pre-trained ASpIRE acoustic model 5 , coupled with the 3-gram LM given in equation (2.8). This ASR system gives a word error rate (WER) equal to 38:02% for the PSYCH-dev and 39:78% for the PSYCH-test set. It is noted that WERs in this range are typical in spontaneous medical conversations (Kodish-Wachs, Agassi, Kenny III, & Overhage, 2018). 3.5.3 Results with reference transcripts Before applying ASR, we employ our system using the manually derived transcripts. That way, we can inspect the usability and eectiveness of our approach, eliminating potential propagation errors because of ASR. Table 3.1 gives the results of our linguistically-aided diarization system in comparison with the audio-only and language-only baseline approaches. First, we notice that, between the two baselines, the one using the acoustic modality yields better results. This came at no surprise since we expected that audio carries the most important speaker-specic characteristics. Hence, we propose using language only as a supplementary stream of information. When we apply our linguistically-aided system using our sequence tagger to segment at the sentence level (still using all the segments, without applying any condence criterion) we get a 4 https://kaldi-asr.org/models/m7 5 https://kaldi-asr.org/models/m1 48 Table 3.1: DER (%) following our linguistically-aided approach and the two baselines. transcript source text segmentation audio only language only linguistically aided linguistically aided y reference oracle 11:05 12:99 7:28 6:99 tagger 20:09 7:71 7:30 ASR tagger 11:05 27:07 8:37 7:84 The text segmentation (when needed) is either performed by our sequence tagger or based on the manually annotated speaker changes (oracle). y denotes results when only a% of the segments we are most condent about are taken into account in each session for the prole estimation, where a is a parameter optimized on the development set. 30:23% DER relative improvement compared to the audio-only approach. In the rst row of Table 3.1 we additionally report results when using the oracle speaker segmentation provided by the manual annotations instead of applying the sequence tagger. That way, we can eliminate any negative eects caused by a suboptimal speaker change detector. As expected, the results are indeed better, but it is worth noting the dierence in the performance gap between the language-only and the linguistically-aided approaches when we compare the oracle vs. the tagger-based segmentation. Since the sequence tagger operates at the sentence level, its output is over-segmented with respect to speaker changes. As a result, utterances are broken into very short segments, with several segments containing insucient information to infer speaker role in a robust way. However, when we aggregate all those speaker turns to only estimate an average speaker prole, such inaccuracies cancel out. Further improvements are observed if for prole estimation we only keep the segments we are most condent about, applying the condence metric introduced in Section 3.3.3. Instead of directly optimizing for the parameter appearing in equation (3.3), we nd the parametera that minimizes the overall DER on the development set when only the a% segments we are most condent about are taken into consideration per session. The results on the test set are reported in the last column of Table 3.1 (a = 70 for the tagger segmentation and a = 55 for the oracle segmentation, after optimizing on the development set). An additional 5:32% relative error reduction is achieved when our tagger is used and similar improvements are noticed in the case of oracle text segmentation. In Figure 3.6 we plot DER as a function of the percentage of the segments we use to estimate the speaker proles within a session. Even though the oracle text segmentation consistently yields 49 marginally better results, it seems that if we carefully choose which segments to use to get an estimate of the speakers' identities, our tagger-based segmentation approaches the oracle perfor- mance. In fact, the best result we got on the test set (optimizing for a on the same set) using our segmenter was 7:13% DER, while the corresponding number using the oracle segmentation was 6:99%. We should highlight here that the analysis presented in this work is based on using a% of the segments within a session, after choosing some a which remains constant across sessions. It is probable that this is a session-specic parameter which ideally should be chosen based on an alternative, session-level strategy. 0 10 20 30 40 50 60 70 80 90 100 most confident text segments (%) 6 8 10 12 14 16 DER (%) audio-only tagger (reference) oracle (reference) tagger (ASR) Figure 3.6: DER (%) as a function of the number of text segments we take into account per session for the prole estimation, based on our condence metric. Text segmentation is either performed by our sequence tagger or based on the manually annotated speaker changes (oracle). Results are presented both with reference and with ASR transcripts. 3.5.4 Results with ASR transcripts For the experiments in this Section we apply the same pre-processing steps, but we replace the reference transcripts with the textual outputs of the ASR system and the corresponding time alignments. The results are given in the last row of Table 3.1. Here, we report results only when using the sequence tagger (and not with oracle segmentation), simply because we now assume we have no access to the reference transcripts, so we cannot know the oracle speaker change points. As we can see, the diarization performance is substantially improved compared to the audio-only 50 system (relative DER reduction equal to 24:25%) even if the WER of the ASR module is relatively high, as reported in Section 3.5.2. It seems that when using the transcripts only for the task of prole estimation, the overall performance is not severely degraded by a somehow inaccurate ASR system. This is not the case for our language-only baseline. Since in that case the nal output only depends on linguistic information, the performance gap between using manual and ASR-derived transcripts (language-only column in Table 3.1) is large. We should note that this performance gap is not only due to higher speaker confusion in the case of ASR transcripts, but also because of increased missed speech. In particular, the missed speech when using ASR is 2:7% because of word deletions (as opposed to 0:6% when the reference transcrpits and the tagger are used). As was the case with the experiments in Section 3.5.3, further improvements are observed when only using a subset of the total number of segments per session to estimate the speaker proles. In particular, if a = 45% of the segments for which we are most condent about (after optimizing for a on the development set) are used, DER is reduced to 7:84%. The benecial eects of using our condence metric to estimate a speaker representation only by a subset of their assigned speech segments is also demonstrated in Figure 3.6. 3.6 Conclusion We proposed a system for speaker diarization suitable to use in conversations where participants assume specic roles, associated with distinct linguistic patterns. While this task typically relies on clustering methods which can lead to noisy speaker partitions, we demonstrated how we can exploit the lexical information captured within the speech signal in order to estimate the speaker proles and follow a classication approach instead. A text-based speaker change detector is an essential component of our system. For this subtask, assuming each sentence is speaker-homogeneous, we proposed using a sequence tagger which segments at the sentence level, by detecting the beginning of a new sentence and we showed that this segmentation strategy approaches the oracle performance. The resulting segments are assigned a speaker role label which is later used to construct the desired speaker identities and we introduced a condence metric to be associated with this assignment. Our results showed that such a metric can be used in order to take into consideration only the segments we are most condent about, leading to further performance improvements. When applied to 51 dyadic interactions between a therapist and a patient, our proposed method achieved an overall relative DER reduction equal to 29:05%, compared to the baseline audio-only approach with speaker clustering. When reference transcripts were used instead of ASR outputs, the corresponding overall reduction was equal to 33:94%. Since role recognition is a supervised task, one drawback of our system when compared to traditional diarization approaches is that it requires in-domain text data in order to build the role-specic LMs. It should be additionally highlighted that the diarization results can be fur- ther improved if, for example, a re-segmentation module is employed as a nal step, or a more precise audio segmentation strategy is followed instead of relying on uniform segmentation. For instance, an audio-based speaker change detector could be applied both for the audio-only baseline and the linguistically-aided system and in the latter case this could be used in combination with the language-based segmenter. However, our goal in this chapter was mainly to demonstrate the eectiveness of constructing the speaker proles within a session to be diarized in order to con- vert the clustering task into a classication one and not to achieve the best possible diarization performance. Additionally, since the initial segmentation was the same both for our system and our audio-only baseline, we expect that any improvements with respect to that part (i.e. more so- phisticated segmentation and/or application of re-segmentation techniques) would lead to similar relative improvements to both systems. Here we essentially modelled each speaker by a single embedding, since for the nal prole esti- mation we averaged over all the speech segments assigned to the corresponding speaker. A potential extension of the current work would be an exploration of alternative speaker identity construction strategies, e.g., representing a speaker by a distribution of embeddings. This is particularly promis- ing in scenarios where recordings are long enough so that they may incorporate various acoustic conditions or dierent speaking styles corresponding to the same speaker. In any case, to construct the speaker proles based on roles, we had to assume there is one-to-one correspondence between speakers and roles within a conversation (for our experiments, one speaker assuming the role of patient and one speaker assuming the role of provider). However, there are domains where such an assumption does not hold. In the following chapter, we are going to study in depth such scenarios and we will provide an alternative role-based approach towards more robust speaker diarization. 52 Chapter 4 Multimodal Speaker Clustering with Role Induced Constraints In the previous chapter, we introduced a methodology that utilizes speaker roles to reduce diarization from a clustering problem to a classication one, following a multimodal approach where both audio and text were taken into consideration. As we saw, the language used by the participants in a conversation carries information that can supplement the audio modality. However, we assumed that each speaker is linked to a unique speaker role, an assumption that we also followed in Chapters 1 and 2. In this chapter we propose an alternative approach where we employ a supervised text-based model to extract speaker roles and then use this information to guide an audio-based spectral clustering step by imposing must-link and cannot-link constraints between segments. The proposed method, which does not need the aforementioned assumption, is applied on two dierent domains, namely on medical interactions and on podcast episodes, and is shown to yield improved results when compared to the audio-only approach. The work presented in this chapter has been submitted for publication. The pre-print is available (Flemotomos & Narayanan, 2022). 53 4.1 Introduction Speaker diarization, as explained in Sections 3.1 and 3.2, is the task of segmenting a multi-party speech signal into speaker-homogeneous regions (and tagging them with speaker-specic labels) and is a critical component of several applications, including speaker-attributed speech recognition, audio indexing, and speaker tracking (Anguera et al., 2012; Park et al., 2022). Even though recently introduced end-to-end neural diarization oers simplicity and achieves remarkable results in some scenarios (Fujita et al., 2019; Horiguchi, Fujita, Watanabe, Xue, & Nagamatsu, 2020), modular, clustering-based diarization is still widely used and has been an indispensable part of award-winning systems in recent challenges (Medennikov et al., 2020; Y. Wang et al., 2021). In the conventional diarization approach, the speech signal is rst segmented into chunks which are assumed to be speaker-homogeneous, in the sense that a single speaker is active therein. Speaker representations, typically bottleneck feature vectors obtained from a speaker classication neural network (Dawalatabad et al., 2021; Koluguri, Park, & Ginsburg, 2022; Snyder et al., 2018), are then estimated for all the segments and their pairwise similarities are computed. A clustering algorithm that gives the desired labeled speech segments is nally employed. Even though it is generally assumed that no information is known a priori about the speakers, in practice we often need to deploy diarization systems in specic applications, and domain-dependent processing can be used to further improve the nal performance. To that end, both the acoustic (e.g., Y. Wang et al., 2021) and the linguistic (e.g., Chapter 3) streams of information can be exploited to either adapt the models or modify the diarization pipeline. The language-based approach, where the transcripts of a recording are taken into consideration during diarization, is especially promising for interactions where speakers play dissimilar roles. It should be noted that several role-playing conversations, such as interviews, clinical interactions, and court hearings, have been included in the evaluation data of recent diarization challenges (Ryant et al., 2021). In Chapter 3 we used language to identify the roles associated with dierent speech segments, estimate the acoustic proles of the participants in the conversation, and eventually reduce the clustering problem into a classication one. Along similar lines, in Chapter 1 we ran an audio-based speaker clustering and a language-based role recognition module in parallel and then combined their outputs through a meta-classier. However, those systems assume a one-to-one correspondence 54 between speakers and roles, i.e., every speaker is linked to a unique role during a conversation. Even though this is a reasonable assumption in multiple domains (e.g., medical domain with dialogues between a clinician and a patient), the systems cannot be easily generalized when a single speaker assumes multiple roles or when multiple speakers play the same role (e.g., trials with a single judge, a group of co-defendants, and multiple prosecution witnesses). To overcome this limitation, here we propose to exploit the linguistically extracted role informa- tion only to impose constraints during audio-based clustering. Depending on the domain, we can impose must-link and/or cannot-link constraints, without the need for one-to-one correspondence between speakers and roles. In particular, we use a BERT-based classier to extract speaker role information from text and we then impose a list of pairwise constraints between segments linked to the same roles or dierent ones. Using manually-derived speaker-homogeneous segments with oracle transcriptions, we evaluate the eectiveness of the approach on the clustering performance by running experiments on two dierent domains: i) dyadic clinical interactions, where the roles of interest are the ones of the therapist and the patient, and ii) multi-party interactions from a weekly radio show with only partial role information available, where the role of interest is the host. 4.2 Background and Prior Work 4.2.1 Spectral clustering for speaker diarization Clustering is one of the main components in modular speaker diarization. During that step, speech segments are grouped into same-speaker classes, usually following either a HAC (Sell et al., 2018), or a spectral clustering (Q. Wang et al., 2018) approach. This grouping is based on the pair- wise similarities between the N segments to be clustered, which are stored in an anity matrix ^ W2R NN . In Chapter 3 we used PLDA to estimate the elements of the anity matrix. Another common choice for estimating the anities uses the cosine distance: given two speaker embeddings v i , v j , we have ^ W ij = 1 2 1 + v i v j jjv i jjjjv j jj (4.1) which ensures that the anities are in the range [0; 1]. Having constructed the rened anity matrix W (where renements are explained later), spec- 55 tral clustering is a technique that exploits the eigen-decomposition of W to project theN elements onto a suitable lower-dimensional space (Ng, Jordan, & Weiss, 2001). To do so, we dene the degrees d i , P j W ij and we construct the normalized Laplacian matrix L = I D 1=2 WD 1=2 (4.2) where D = diagfd 1 ;d 2 ; ;d N g. Assuming we know the number of speakers k, we nd the k eigenvectors x 1 ; x 2 ; ; x k corresponding to the k smallest eigenvalues of L and form the matrix X = [x 1 jx 2 jjx k ]. After normalizing the rows of X to unit norm, so that ~ X ij = X ij = q P j X 2 ij , we cluster the N rows of ~ X through a k-means algorithm and assign the original l-th segment to speaker s if and only if the l-th row of ~ X is assigned to speaker s. In order to eectively use spectral clustering in diarization settings, several renement oper- ations have been proposed to be applied on the original anity matrix (Park, Han, Kumar, & Narayanan, 2019; Q. Wang et al., 2018), the most notable being p-thresholding. Given the original anity matrix ^ W, the (100p)% largest values in each row are set to 1 and the rest are either bi- narized to 0 or multiplied by a small constant (soft thresholding), giving the modied matrix ^ W p . Since this operation may break the symmetry property of the anity matrix, we re-symmetrize it to get W = 1 2 ^ W p + ^ W T p (4.3) Instead of xing a specic valuep, an auto-tuning approach which uses the maximum eigengap of the Laplacian matrix can be followed (Park, Han, Kumar, & Narayanan, 2019). The eigengap criterion has its roots in graph theory and is also used to estimate the number of clusters (speakers) ^ k, when this is not known a priori. L is a positive semi-denite matrix with N non-negative real eigenvalues 0 = 1 2 N . If W is viewed as an adjacency matrix of a graph with ^ k perfectly connected components, then ^ k equals the multiplicity of the eigenvalue 1 = 0. In practical applications, where we do not expect perfect components, ^ k is estimated by the maximum eigengap: ^ k = argmax k k+1 k (4.4) 56 4.2.2 Constrained clustering for speaker diarization Constrained clustering extends the traditional unsupervised learning paradigm of clustering by inte- grating supplemental information in the form of constraints (Gan carski, Dao, Cr emilleux, Forestier, & Lampert, 2020). Even though several types of constraints have been explored, the most common ones are the instance-level relations, and in particular the must-link (ML) and cannot-link (CL) constraints. Under that viewpoint, if an ML (CL) constraint is imposed between two segments, then those segments must (must not) be in the same cluster. In speaker diarization, constrained clustering has been applied with constraints imposed either by human input, or by acquired knowledge within a particular framework. C. Yu and Hansen (2017) propose a system where a sucient number of segments corresponding to all the speakers are rst identied by a human expert and the rest of the segments are clustered in a constrained fashion. Bost, Xavier and Linares, Georges (2014) apply a two-step clustering for audio-based speaker di- arization in videos, where speakers are rst clustered locally in scenes detected to contain dialogues, before a global clustering with CL constraints between segments locally assigned to dierent clus- ters. Similarly, in an eort to integrate end-to-end and clustering-based diarization, Kinoshita, Delcroix, and Tawara (2021a, 2021b) rst estimate distinct local neural speaker embeddings from short speech chunks, which they then CL-constrain in the subsequent global speaker clustering step. Finally, Tripathi et al. (2022) employ a speaker change detector and impose CL constraints between consecutive segments separated by a speaker change and ML constraints between segments where a speaker change was not detected. To the best of our knowledge, constraints grounded on language, that can provide crucial information in role-based conversational settings, have not been explored. 4.2.3 Constrained spectral clustering Constraints can be combined with several clustering algorithms, such as k-means (Kinoshita et al., 2021b) or HAC (Prokopalo, Shamsi, Barrault, Meignier, & Larcher, 2021). In this work we use a constrained spectral clustering approach, where constraints are integrated via the exhaustive and ecient constraint propagation (E 2 CP) algorithm (Lu & Peng, 2013), which was recently applied in diarization settings (Tripathi et al., 2022). Applying E 2 CP, we can propagate an initial set 57 of pairwise constraints to the entire session. In order to do so, we dene a constraint matrix Z2R NN , such that Z ij = 8 > > > > < > > > > : +1; if9 ML constraint between i and j 1; if9 CL constraint between i and j 0; if@ any constraint between i and j (4.5) Soft constrains can also be applied within this framework by settingjZ ij j< 1, withjZ ij j denoting the condence score that a constraint should be imposed between the i-th and j-th segments. The elements of the anity matrix ^ W are then updated as ^ W ij 8 > < > : 1 (1 F ij )(1 ^ W ij ); if F ij 0 (1 + F ij ) ^ W ij ; if F ij < 0 (4.6) where F contains the constraints propagated to the entire session based on the initial set of constraints and is estimated as F = (1) 2 (I L) 1 Z(I L) 1 (4.7) L equals D 1=2 ^ W D 1=2 , where D is a diagonal matrix dened like D in Section 4.2.1, but using the degrees of ^ W. The constant 2 [0; 1] is a tunable hyperparameter: a small value penalizes large changes between the initial pairwise constraints in Z and the new constraints created during propagation, while a large value penalizes large changes between the neighboring segments in the graph described by ^ W. Note that for = 0 we get F = Z which means we only rely on the initial constraints, and for = 1 we get F = 0, which means we completely ignore any constraint information. The constraint propagation and integration described here takes place before the renement and spectral operations described in Section 4.2.1. 4.3 Proposed Method We propose to use a two-step clustering for conversations where speakers assume distinct roles, as depicted in the example of Figure 4.1. 58 … hello Chris hello doctor I had a terrible headache yesterday audio-based constrained clustering … hello Chris hello doctor I had a terrible headache yesterday ML … hello Chris hello doctor I had a terrible headache yesterday Patient Patient text-based role recognition Figure 4.1: Two-step speaker clustering for role-playing interactions. Here, an ML constraint is imposed for two segments both associated with the role patient. Those segments have to be in the same cluster after the clustering step. First, speaker roles are identied from text for each speech segment. To that end, we employ a BERT-based classier (Devlin, Chang, Lee, & Toutanova, 2019), where we add dropout and a softmax inference layer on top of a pre-trained BERT model and we ne-tune it for the task with in- domain data. If, after classication, we have complete role information available (i.e., each segment is associated with a distinct speaker role), we can directly get a purely text-based diarization result (see also Chapter 3). However, there are multiple scenarios where only partial role information is available (e.g., we have sucient data to only train a binary classier to identify news anchor vs. guest in a broadcast news program with multiple potential guests within a show). Additionally, we expect that there will be several segments where the linguistic content is not sucient to robustly infer the associated speaker role. So, we only use role information to impose suitable constraints for the following step of audio- based clustering and we take into account only segments where roles are identied with sucient condence. Even though it is well known that neural classiers tend to be over-condent about their decisions and that softmax values are usually not a robust proxy of condence scores, in practice we saw that we can use a softmax threshold as a threshold of condence, as discussed 59 in Section 4.5. For those segments where the condence of their associated role is beyond some specied threshold, we impose ML and CL constraints, according to the domain we are working on. For instance, we can distinguish between the following general scenarios: 1. dierent roles are always played by dierent speakers within a session (e.g., teacher vs. stu- dents during a lecture): apply a CL constraint between any segments associated with dierent speaker roles, 2. dierent speakers always play dierent roles within a session (e.g., anchor vs. interviewer vs. guest during a broadcast news program, where anchor and interviewer might be the same person): apply an ML constraint between any segments associated with the same speaker role, 3. one-to-one correspondence between speakers and roles within a session (e.g., doctor vs. patient during a doctor's visit): apply both CL and ML constraints as in cases (1) and (2). Dierent types of domain-specic strategies can also be followed. The constraints are then integrated within a spectral clustering algorithm, and we proceed as described in Sections 4.2.1 and 4.2.3. 4.4 Datasets We evaluate the proposed speaker clustering approach on two dierent domains with role-playing interactions. As detailed below, we use a medical dataset drawn from the psychotherapy eld and another dataset from the entertainment industry with podcast episodes. 4.4.1 Psychotherapy sessions We use a collection of psychotherapy sessions recorded at a university counseling center (UCC) 1 , and specically the sessions in the sets denoted as UCC train , UCC dev and UCC test 1 in (Flemotomos et al., 2021) 2 . All the recordings have been normalized to 16 kHz sampling rate, 16 bit precision, and 1 Note that this is a dierent dataset than the MI sessions used for the experiments in the previous chapters. 2 See also Chapter 5, Section 5.4.2. 60 the two recording microphones suspended from the ceiling of the clinic oces have been combined through acoustic beamforming. Each session is a dyadic conversation between a therapist and a patient, thus falls under case (3) according to the categorization given in Section 4.3. The dataset comprises 97 participants (23 therapists and 74 patients), with no speaker overlap between the train/dev/eval sets. The sessions have been professionally transcribed, the transcribed segments have been forced-aligned with the beamformed audio, and any utterances consisting of only non- speech vocal sounds (e.g., laughs) have been discarded. More details on the dataset are provided in Table 4.1. Table 4.1: Size of the UCC dataset. train dev eval #sessions 50 26 20 #segments - therapist 8,766 3,959 4,146 #segments - patient 9,052 4,246 4,245 segment duration (mean) 7:8 sec 8:7 sec 6:4 sec #words per segment (mean) 21.4 22.3 18.8 4.4.2 Podcast episodes This American Life 3 (TAL) is a weekly podcast and public radio show where each episode revolves around a specic theme and is structured as a story-telling act with multiple characters. Mao, Li, McAuley, and Cottrell (2020) have curated a dataset of 663 TAL episodes aired between 1995 and 2020. We use the clean, audio-aligned utterances provided, with the recommended train/dev/eval split, and with the archived audio standardized to 16 kHz, 16 bit precision, mono-channel, wav format (as described by Mao et al., 2020). In each episode there are on average 17.7 speakers (std=8.7) with variable speaking times, while the existing background music poses extra challenges for robust clustering and diarization. The dataset, described in Table 4.2, has been annotated with speaker identities and with three speaker roles, those of host, interviewer, and subject. However, the provided role information was not helpful for our purposes, since, according to the annotations, multiple speakers may play the same role within an episode and, at the same time, a single speaker 3 https://www.thisamericanlife.org/ 61 may play multiple roles (with some episodes having the same speaker occasionally playing all 3 roles). So, we instead chose to annotate as host utterances only the ones spoken by Ira Glass and assign all the other utterances to a non-host speaker role. Ira Glass is the host and executive producer of the show and speaks for 18:6% of the time during the entire dataset 4 . Since in this case dierent roles always denote dierent speakers (but the inverse does not hold since there are multiple non-host speakers), this dataset falls under case (1) according to the categorization given in Section 4.3. Of course, we should note that since this annotation strategy is speaker-dependent, the role recognition algorithm applied is also expected to capture speaker-specic, and not purely role-specic, information. Table 4.2: Size of the TAL dataset. train dev eval #episodes 593 34 36 #segments - host 26,523 1,765 1,317 #segments - non-host 119,295 6,869 8,039 segment duration (mean) 14:1 sec 13:7 sec 13:4 sec #words per segment (mean) 37.7 36.6 36.6 4.5 Experiments and Results 4.5.1 Experimental setup For both datasets we run experiments using the manually derived speaker segments and the cor- responding transcriptions, in order to evaluate the eectiveness of the proposed method without propagating potential errors from automated segmentation and speech recognition modules. We standardize the text by stripping punctuation, removing non-verbal vocalizations and con- verting all letters to lower case. We build the binary role classiers (therapist vs. patient and host vs. non-host) using TensorFlow (Abadi et al., 2016) with the pre-trained uncased English BERT-base model provided in TensorFlow model garden (H. Yu et al., 2020), adding a dropout 4 For reference, the second single most-talking speaker of the dataset is Nancy Updike, speaking for 1:6% of the time. 62 layer with dropout ratio equal to 0.2. Since very short segments are not expected to have sucient role-related information, during ne-tuning we only take into account segments containing at least 5 words (65:58% of the available training segments for UCC and 88:65% of the available training segments for TAL). We ne-tune the models for 2 epochs on the training subsets of the datasets, using the development subsets for validation. We use the Adam optimizer with decoupled weight decay (Loshchilov & Hutter, 2019) with initial learning rate equal to 2 10 5 and with a warm-up stage lasting for the rst 10% of the training time. The mini-batch size is set to 16 segments and the maximum allowed segment length is set to 128 tokens 5 , which means that 2:91% of the initial training UCC segments and 2:06% of the initial training TAL segments are cropped. The speaker representation of the segments is based on the widely used x-vectors (Snyder et al., 2018) and, to that end, the pre-trained VoxCeleb x-vector extractor from Kaldi (Povey et al., 2011) is used 6 . A single x-vector is extracted per segment, taking into consideration only the voiced frames, as identied by an energy-based voice activity detector. X-vectors are projected through linear discriminant analysis (LDA) on a 200-dimensional space and are further mean- and length-normalized. The segments are then clustered following the described constrained spectral clustering approach with ML and/or CL constraints imposed according to the predicted associated roles 7 . For the UCC dataset, which features dyadic interactions, we group all the segments into two clusters, while for TAL we estimate the number of speakers using the eigengap criterion described in Section 4.2.1, searching in the range 2{50. The value of p for the p-thresholding step is found through auto-tuning (Park, Han, Kumar, & Narayanan, 2019), searching in the range 40{95, and we use soft thresholding with = 0:01 (Q. Wang et al., 2018). All the results are reported on the eval subsets of the data. Diarization is evaluated with respect to the diarization error rate (DER), estimated with the pyannote.metrics library (Bredin, 2017a) without allowing any tolerance collar around segment boundaries. As explained in Chapter 3 (Section 3.5.2), DER incorporates three sources of error; false alarms, missed speech, and speaker confusion. However, segmentation is always the oracle one provided by human annotators and, 5 according to the default WordPiece-based BERT tokenizer 6 https://kaldi-asr.org/models/m7 7 https://github.com/wq2012/SpectralCluster 63 since there is almost no speaker overlap in our datasets, by DER we essentially estimate speaker confusion (false alarm is always 0 and missed speech is 0:02% for UCC and 0:13% for TAL). 4.5.2 Results and discussion If we have perfect role information for all the segments and if there is one-to-one correspondence between roles and speakers (e.g., one therapist vs. one patient in UCC), we can get a perfect diarization result in terms of speaker confusion. In the framework of constrained spectral clustering, this can be done by lling Z in equation (4.5) with all the corresponding constrains and setting = 0 in equation (4.7) so that F = Z. This is re ected in Figure 4.2 where we see how DER changes as we provide more oracle constraints to the algorithm. This is similar to the expected behavior of the algorithm when constraints are added in the form of human supervision. 0 0.2 0.4 0.6 0.8 1 #(constrained segments) / #segments 0 0.5 1 1.5 DER (%) Figure 4.2: DER for the UCC dataset as a function of the (normalized) number of constraints, always providing oracle role information to build the constraints, for dierent values of in equa- tion (4.7). Without having access to the oracle role information, we have to rely on a segment-level role classier. The classication accuracy of our BERT-based classiers after ne-tuning is given in Table 4.3 and is compared to a naive majority-class baseline. Even though the classiers provide reasonable results, we need to ensure that constraints are imposed only on segments which are condently linked to some role. For this work, apart from only using segments longer than a specied duration (here, containing at least 5 words) to ensure some minimal linguistic content, we use the softmax values associated with the predicted roles as a proxy of the condence level. As shown in Figure 4.3, the softmax value can indeed act as a reasonable proxy of condence for 64 Table 4.3: Classication accuracy (%) of the BERT-based model and a majority-class baseline. all segments segments w. 5 words maj. class BERT maj. class BERT UCC 50.59 73.63 53.50 83.22 TAL 85.92 90.87 85.41 90.92 for UCC data: binary problem of identifying therapist vs. patient for TAL data: binary problem of identifying host vs. non-host our purposes. In particular, if we only consider segments where the corresponding softmax value is above some threshold, accuracy increases monotonically as a function of the threshold. However, the choice of the threshold value is a trade-o decision between accuracy and adequate support so that we have a sucient number of constraints. With that in mind, we choose a threshold equal to 0.980 for the UCC data (accuracy = 94.66%, support = 3,222) and equal to 0.995 for the TAL data (accuracy = 98.15%, support = 3,674), which leads to imposing constraints on around 40% of the segments in both cases. 0.5 0.6 0.7 0.8 0.9 1 softmax threshold 0.8 0.85 0.9 0.95 1 accuracy 0 2000 4000 6000 support accuracy support (a) UCC 0.5 0.6 0.7 0.8 0.9 1 softmax threshold 0.9 0.92 0.94 0.96 0.98 1 accuracy 0 2000 4000 6000 8000 support accuracy support (b) TAL Figure 4.3: Classication accuracy and support for the BERT-based classiers when only segments with associated softmax value above some threshold are considered. After constructing the constraint matrix based on the described role classication only for the segments with sucient role classication condence, we perform the constrained spectral clustering algorithm. Our experiments fall under case (3) for the UCC data and under case (1) for the TAL 65 data, according to the categorization given in Section 4.3. The results 8 are reported in the second column of Table 4.4. For comparison, we also provide results for the two extreme cases: i) following a conventional, unconstrained spectral clustering, which ignores any language-based information and ii) following a language-only classication using the results of the BERT-based classier for all the segments, without setting any softmax threshold, which ignores any audio-based information. Table 4.4: DER (%) using unconstrained audio-only clustering, constrained clustering with role- induced constraints, and language-only role-based classication. unconstrained clustering (audio-only) constrained clustering (multimodal) role-based classication (language-only) UCC 1.38 1.31 10.34 TAL 42.22 23.86 63.01 results contain only 2 speakers, since we rely on binary classication In the case of the UCC data, our approach yields a small improvement (5.1% relative) compared to the unconstrained baseline. We additionally found that adding more constraints (selecting a smaller softmax threshold as our condence criterion) leads to worse performance. Comparing this nding to the results displayed in Figure 4.2 with oracle constraints, where error approaches 0 given a large number of constraints, we realize that our method is sensitive to the performance of the role classier. This is because any classication errors can be easily propagated to the clustering step (Figure 4.1). This error propagation becomes, as expected, more evident in the case we constrain all the segments, relying only on the linguistic stream of information (last column of Table 4.4). Looking at the results with the TAL data, we observe a substantial improvement when going from unconstrained to constrained clustering. We can see that in scenarios with a large number of speakers, even partial role-based information (like the host vs. non-host classication here) can provide useful cues that robustly guide the subsequent clustering. In more detail, we observed that the imposed constraints changed the nal Laplacian matrix in a way that the eigengap criterion led to the detection of more clusters (speakers) per episode. The severe performance degradation with the language-only approach is expected, since the results in that case only contain two speakers (since we only have two role classes), even though each TAL episode features multiple participants. 8 Those results are for = 0:75 for UCC and = 0:50 for TAL. 66 4.6 Conclusion In this chapter we proposed to integrate text-based constraints within audio-based clustering to improve the performance of speaker diarization in conversational interactions where speakers assume specic roles. We implemented a BERT-based role classier solely relying on text data and used its output to construct a constraint matrix for use within constrained spectral clustering. Experimental results in two dierent domains showed that, after applying a softmax-based condence criterion, performance can be improved both in cases of one-to-one correspondence between speakers and roles and in cases with only partial available role information, thus overcoming limitations of assumptions we needed to follow for the approaches proposed in the previous chapters. We performed all our experiments using oracle textual information and oracle speaker segmen- tation. We should note that, in a real-world scenario, errors would be introduced and potentially propagated to the clustering step both because of a speech recognizer and because of non-ideal seg- mentation. Speaker segmentation could be included as a separate pre-processing module (e.g., like in Chapter 3), or incorporated with the role recognizer in a named entity recognition (NER)-like approach (e.g., Zuluaga-Gomez et al., 2021). Future work can also investigate a combination of hard and soft constraints for the task, as well as dierent types of role-induced constraints. Even though here we focused on linguistic characteristics, role-specic behaviors can also be manifest through acoustic, structural, or visual cues, all of which can be potentially used within the framework of role-dependent constrained speaker clustering. With this chapter, we close our discussion on how linguistically-extracted speaker role infor- mation can be used to facilitate the task of speaker diarization. Here we studied how to use this information to impose constraints during audio-based clustering. In Chapter 3 we proposed a technique, suitable in scenarios with one-to-one correspondence between speakers and roles (e.g., pa- tient-doctor interactions), to construct the acoustic speaker identities and reduce diarization to a classication problem. In the following chapter we are going to see how the latter technique can be incorporated within a larger speech and language processing pipeline deployed in clinical settings to solve a real-world problem; the one of psychotherapy quality assessment. 67 Part III Real World Impact 68 Chapter 5 Why Do We Need Roles? Automated Psychotherapy Evaluation as an Example Downstream Application With the growing prevalence of psychological interventions, it is vital to have measures that rate the eectiveness of psychological care to assist in training, supervision, and quality assurance of services. Traditionally, quality assessment is addressed by human raters who evaluate recorded sessions along specic dimensions, often codied through constructs relevant to the approach and domain. This is, however, a cost-prohibitive and time-consuming method that leads to limited use in real-world settings. To facilitate this process, we have developed an automated competency rating tool able to process the raw recorded audio of a session, analyzing who spoke when, what they said, and how the health professional used language to provide therapy. Since the system focuses on therapist-attributed language, it is essential to robustly dierentiate between utterances spoken by the therapist vs. the patient. We present and analyze our platform using a dataset drawn from its deployment in a real-world clinical setting and we show how applying the techniques we introduced in Chapter 3 can have a substantial benecial eect to the overall performance. The work presented in this chapter is based on work that has been published in (Flemotomos et al., 2021). 69 5.1 Need for Psychotherapy Quality Assessment Tools Recent epidemiological research suggests that developing a mental disorder is the norm, rather than the exception, estimating that the lifetime prevalence of diagnosable mental disorders (i.e., the proportion of the population that, at some point in their life, have experienced or will experience a mental disorder) is around 50% (Kessler et al., 2005) or even more (Schaefer et al., 2017). According to data from 2018, an estimated 47.6 million adults in the United States had some mental illness, and 1 in 7 adults received professional mental health services (Substance Abuse and Mental Health Services Administration, 2019). Psychotherapy is a commonly used process in which mental health disorders are treated through communication between an individual and a trained mental health professional. Even though its positive eects have been well documented (Lambert & Bergin, 2002; Perry, Banon, & Ianni, 1999; Weisz, Weiss, Han, Granger, & Morton, 1995), there is room for improvement in terms of the quality of services provided. In particular, a substantial number of patients report negative outcomes, with signs of mental health deterioration after the end of therapy (Curran et al., 2019; Klatte, Strauss, Fl uckiger, & Rosendahl, 2018). Apart from patient characteristics (Lambert & Bergin, 2002), therapist factors play a signicant and clinically important role in contributing to negative outcomes (Saxon, Barkham, Foster, & Parry, 2017). This has direct implications for more rigorous training and supervision (Lambert & Ogles, 1997), quality improvement, and skill development. A critical factor that can lead to increased performance and thus ensure high quality of services is the provision of accurate feedback to the practitioner (Hattie & Timperley, 2007). This can take various forms; both client progress monitoring (Lambert, Whipple, & Kleinst auber, 2018) and performance-based feedback (Schwalbe, Oh, & Zweben, 2014) have been reported to reduce therapeutic skill erosion and to contribute to improved clinical outcomes. The timing of the feedback is of utmost importance as well, since it has been shown that immediate feedback is more eective than delayed (Kulik & Kulik, 1988). In psychotherapy practice, however, providing regular and immediate performance evaluation is almost impossible. Behavioral coding|the process of listening to audio recordings and/or reading session transcripts in order to observe therapists' behaviors and skills (Bakeman & Quera, 2012)| is both time-consuming and cost-prohibitive when applied in real-world settings. It has been 70 reported (Moyers, Martin, Manuel, Hendrickson, & Miller, 2005) that, after intensive training and supervision that lasts on average 3 months, a procient coder would need up to two hours to code just a 20 min-long session of motivational interviewing (MI), a specic type of psychotherapy which is the focus of the current chapter. The labor-intensive nature of coding means that the vast majority of psychotherapy sessions are not evaluated. As a result, many providers get inadequate feedback on their therapy skills after their initial training (Miller, Sorensen, Selzer, & Brigham, 2006) and behavioral coding is mainly applied for research purposes with limited outreach to community settings (Proctor et al., 2011). At the same time, the barriers imposed by manual coding usually lead to research studies with relatively small sample sizes (Magill et al., 2014), limiting progress in the eld. It is, thus, made apparent that being able to evaluate a therapy session and provide feedback to the practitioner at a low cost and in a timely manner would both boost psychotherapy research and scale up quality assessment to real-world use. In this chapter we investigate whether it is feasible to analyze a therapy session recording in a fully automatic way and provide feedback to the therapist within short time. The focus is on the importance of speaker role modeling within the overall computational approach and on how some of the techniques presented earlier (especially in Chapter 3) can improve the nal performance of automated behavioral coding. 5.2 Behavioral Coding for Motivational Interviewing Motivational interviewing (MI; Miller & Rollnick, 2012), often used for treating addiction and other conditions, is a client-centered intervention that aims to help clients make behavioral changes through resolution of ambivalence. It is a psychotherapy treatment with evidence supporting that specic skills are correlated with the clinical outcome (Gaume, Gmel, Faouzi, & Daeppen, 2009; Magill et al., 2014) and also that those skills cannot be maintained without ongoing feedback (Schwalbe et al., 2014). Thus, great eort from MI researchers has been devoted to developing instruments to evaluate delity to MI techniques. The gold standard for monitoring clinician delity to treatment is behavioral observation and coding (Bakeman & Quera, 2012). During that process, trained coders assign specic labels or numeric values to the psychotherapy session, which are expected to provide important therapy- 71 related details (e.g., \how many open questions were posed by the therapist?" or \did the counselor accept and respect the client's ideas?") and essentially re ect particular therapeutic skills. While there is a variety of coding schemes (Madson & Campbell, 2006), in this study we focus on a widely used research tool, the motivational interviewing skill code (MISC 2.5; Houck, Moyers, Miller, Glynn, & Hallgren, 2010), which was specically developed for use with recorded MI sessions (Madson & Campbell, 2006). MISC denes behavior codes both for the counselor and the patient, but for the automated system reported here we focus on counselor behaviors. The MISC manual (Houck et al., 2010) denes both session-level and utterance-level codes. Session-level codes characterize the entire interaction and are scored on a 5-point Likert scale. When coding at the utterance-level, instead of assigning numerical values, the coder decides in which behavior category each utterance belongs. An utterance is a \thought unit" (Houck et al., 2010), which means that multiple consecutive phrases might be parsed into a single utterance and, likewise, multiple utterances might compose a single sentence or talk turn. After the session is parsed into utterances, each one is assigned one of the codes summarized in Table 5.1 (or gets the label NC if it can not be coded). For the work presented in this chapter we focus on utterance- level codes, but we also use session-level summary indicators. In particular, we estimate i) the ratio of re ections (simple and complex) to questions (open and closed), ii) the percentage of open questions (over the total number of questions), iii) the percentage of complex re ections (over the total number of re ections), and iv) MI adherence, dened as the percentage of utterances coded with any code other than advice (with or without permission), raise concern (with or without permission), confront, direct, warn. 5.3 Psychotherapy Evaluation in the Digital Era Psychotherapy sessions are interventions primarily based on spoken language, which means that the information capturing the session quality is encoded in the speech signal and the language patterns of the interaction. Thus, with the rapid technological advancements in the elds of speech and natural language processing (NLP) over the last few years (e.g., Devlin et al., 2019; Xiong et al., 2017), and despite many open challenges specic to the healthcare domain (Quiroz et al., 2019), it is not surprising to see trends in applying computational techniques to automatically analyze and 72 Table 5.1: Therapist-related utterance-level codes, as dened by MISC 2.5. abbreviation name example ADP Advise with Permission Would it be all right if I suggested something? ADW Advise w/o Permission I recommend that you attend 90 meetings in 90 days. AF Arm Thank you for coming today. CO Confront (C: I don't feel like I can do this.) Sure you can. DI Direct Get out there and nd a job. EC Emphasize Control It is totally up to you whether you quit or cut down. FA Facilitate Uh huh. (keep-going acknowledgment) FI Filler Nice weather today! GI Giving Information Your blood pressure was elevated [...] this morning. QUO Open Question Tell me about your family. QUC Closed Question How often did you go to that bar? RCP Raise Concern with Permission Could I tell you what concerns me about your plan? RCW Raise Concern w/o Permission That doesn't seem like the safest plan. RES Simple Re ection (C: The court sent me here.) That's why you're here. REC Complex Re ection (C: The court sent me here.) This wasn't your choice to be here. RF Reframe (C: [...] something else comes up [...]) You have clear priorities. SU Support I'm sorry you feel this way. ST Structure Now I'd like to switch gears and talk about exercise. WA Warn Not showing up for court will send you back to jail. NC No Code You know, I. . . (meaning is not clear) Most of the examples are drawn from the MISC manual (Houck et al., 2010). Many of the code assignments depend on the client's previous utterance (C). 73 evaluate psychotherapy sessions. Such eorts span a wide range of psychotherapeutic approaches including couples therapy (Black et al., 2013), MI (Xiao, Can, et al., 2016) and cognitive behavioral therapy (Flemotomos, Martinez, et al., 2018), used to treat a variety of conditions such as addiction (Xiao, Can, et al., 2016) and post-traumatic stress disorder (Shiner et al., 2012). Both text-based (Imel, Steyvers, & Atkins, 2015; Xiao, Can, Georgiou, Atkins, & Narayanan, 2012) and audio-based (Black et al., 2013; Xiao et al., 2014) behavioral descriptors have been explored in the literature and have been used either unimodally or in combination with each other (Singla et al., 2018). In this study we focus on behavior code prediction from textual data. Most research studies focused on text-based behavioral coding have relied on written text excerpts (Barahona et al., 2018) or used manually-derived transcriptions of the therapy session (Can, Atkins, & Narayanan, 2015; Gibson et al., 2022; Lee, Hull, Levine, Ray, & McKeown, 2019). However, a fully automated evaluation system for deployment in real-world settings requires a speech processing pipeline that can analyze the audio recording and provide a reliable speaker-segmented transcript of what was spoken by whom. This is a necessary condition before such an approach is introduced into clinical settings since, otherwise, it may eliminate the burden of manual behavioral coding, but it introduces the burden of manual transcription. An end-to-end system is presented by Xiao, Imel, Georgiou, Atkins, and Narayanan (2015) and Xiao, Huang, et al. (2016), where the authors report a case study of automatically predicting the empathy expressed by the provider. A similar platform, focused on couples therapy, is presented by Georgiou, Black, Lammert, Baucom, and Narayanan (2011b). Even employing an ASR module with relatively high error rate, those systems were reported to provide competitive prediction performance. The scope of the particular studies, though, was limited only to session-level codes, while the evaluation sessions were selected from the two extremes of the coding scale. Thus, for each code the problem was formulated as a binary classication task trying to identify therapy sessions where a particular code (or its absence) is represented more prominently (e.g., identify `low' vs. `high' empathy). 74 5.4 Current Study 5.4.1 System overview We analyze a platform able to process the raw recording of a psychotherapy session and provide, within short time, performance-based feedback according to therapeutic skills and behaviors. We focus on dyadic psychotherapy interactions (i.e., one therapist and one client) and the quality assessment is based on the counselor-related codes of the MISC protocol (Houck et al., 2010). The behavioral codes are predicted by NLP algorithms that analyze the linguistic information captured in the automatically derived transcriptions. The behavioral analysis of the counselor is summarized into a comprehensive feedback report that can be used directly by the provider as a self-assessment method or by a supervisor as a supportive tool that helps them deliver more eective and engaging training. After both parties have formally consented, the therapist begins recording the session. The digital recording is sent to the processing pipeline and appropriate acoustic features are extracted from the raw speech signal. The baseline system explored here (Figure 5.1) consists of six main steps: (a) voice activity detection (VAD), where speech segments are detected over silence or back- ground noise, (b) speaker diarization, where the speech segments are clustered into same-speaker groups (e.g., speaker A, speaker B of a dyad), (c) automatic speech recognition (ASR), where the audio speech signal of each speaker-homogeneous segment is transcribed to words, (d) speaker role recognition (SRR), where each speaker group is assigned their role (i.e., therapist vs. client), (e) ut- terance segmentation, where the speaker turns are parsed into utterances which are the basic units of behavioral coding, and (f) automated behavioral coding where a MISC-based code is assigned to each therapist-attributed utterance. Speaker role recognition is essential in this system, since the goal is to robustly identify and then automatically code the therapist utterances. The architecture design described can inevitably lead to error propagation. Here, we study how errors due to diarization can aect the overall performance of the downstream task of psychotherapy quality assessment and how an alternative framework can help alleviate such error propagation problems. In particular, we compare the system of Figure 5.1 with the architecture of Figure 5.2, which is based on the linguistically aided diarization approach introduced in Chapter 3. Using a collection of real-world psychotherapy recordings acquired after the deployment of our system in 75 Voice Activity Detection Is someone speaking? speech no speech Speaker Diarization (clustering) Who is speaking? speaker A speaker B Automatic Speech Recognition What are they saying? I don’t like my job I ’ll quit It sounds like… Speaker Role Recognition (speaker-level) Is it the therapist or the client? I don’t like my job I ’ll quit It sounds like… C T Utterance Segmentation How is the text parsed into utterances? I don’t like my job It sounds like… C T I ’ll quit C Behavioral Coding It sounds like… T simple reflection What code is associated to the utterance? Figure 5.1: Baseline transcription and coding pipeline developed to assess the quality of a psy- chotherapy session. The focus of the particular study is on the eect of the speaker diarization and role recognition modules on the overall performance. clinical settings, we show that traditional clustering-based diarization can fail for certain sessions, leading to inaccurate behavior coding results. Employing simple quality and condence thresholds based on the expected speaking times of the two interlocutors, we can instead use the linguistically- aided approach for those sessions and get signicant performance gains. Voice Activity Detection Speaker Diarization (classification) Automatic Speech Recognition (second pass) Segmentation & Speaker Role Recognition (segment-level) Utterance Segmentation Behavioral Coding Automatic Speech Recognition (first pass) linguistically-aided diarization Figure 5.2: Transcription and coding pipeline employing linguistically-aided, role-based speaker diarization. 5.4.2 Deployment: data collection and pre-processing Through a collaboration with the counseling center of a large US-based university, we gathered a corpus of real-world psychotherapy sessions to evaluate the system. Therapy treatment was provided by a combination of licensed sta as well as trainees pursuing clinical degrees. Topics discussed span a wide range of concerns common among students, including depression, anxiety, substance use, and relationship concerns. All the participants (both patients and therapists) had formally consented to their sessions being recorded. Study procedures were approved by the insti- tutional review board of the University of Utah. Each session was recorded by two microphones suspended from the ceiling of the clinic oces, one omni-directional and one directed to where the therapist generally sits. 76 Data were collected between September, 2017 and March, 2020, for a total of 5,097 recordings. Out of those, 188 sessions were selected to be manually transcribed and coded. Coding took place in two independent trials (one in mid 2018 and one in late 2019), with some dierences in the procedure between the two. For the rst coding trial (96 sessions), the transcriptions were stripped of punctuation and coders were asked to parse the session into utterances. During the second trial (92 sessions), the human transcriber was asked to insert punctuation, which was used to assist parsing. Additionally, for the second batch of transcriptions, stacked behavioral codes (more than one code per utterance) were allowed in case one of the codes is open or closed question (QUO or QUC). We have split the rst trial into train (UCC train ; 50 sessions), development (UCC dev ; 26 sessions), and test (UCC test 1 ; 20 sessions) sets, while we refer to the second trial as the UCC test 2 set. The split for the rst trial was done in a way so that there is no speaker overlap between the dierent sets. For this chapter, we report results on the 112 sessions of the combined UCC test 1 and UCC test 2 sets. The manually transcribed UCC sessions do not contain any timing information, which means that we needed to align the provided audio with text. That way, we were able to get estimates of the \ground truth" information required for evaluation. We did so by using the Gentle forced aligner 1 , an open-source, Kaldi-based (Povey et al., 2011) tool, in order to align at the word level. However, we should note that this inevitably introduces some error to the evaluation process, since 9:4% of the words per session on average (std=3:4%) remain unaligned. Another pre-processing step we needed to take in order to have a meaningful evaluation of the system on the UCC data is related to the behavioral labels assigned by the humans and by the platform. In particular, some of the utterance-level MISC codes are assigned very few times within a session by the human raters and the corresponding inter-rater reliability (IRR) is very low (Table A.1); additionally, there are pairs or groups of codes with very close semantic interpretation as re ected by the examples in Table 5.1 (e.g., complex re ections (REC) and reframes (RF)). Thus, we clustered the codes into composite groups resulting in 9 target labels. The mapping between the codes dened in the MISC manual and the target labels, as well as the occurrences of 1 https://github.com/lowerquality/gentle 77 those labels in the UCC data, is given in Table 5.2. The facilitate code (FA) seems to dominate the data, because most of the verbal llers (e.g., uh-huh, mm-hmm, etc.)|which are very frequent constructs in conversational speech|and single-word utterances (e.g., yeah, right, etc.) are labeled as FA. Table 5.2: Mapping between MISC-dened behavior codes and grouped target labels, together with the occurrences of each group in the evaluation UCC sets. group MISC codes count FA FA 13,618 GI GI, FI 7,661 QUC QUC 4,387 QUO QUO 2,658 REC REC, RF 6,342 RES RES 829 MIN ADP, ADW, CO, DI, RCW, RCP, WA 987 MIA AF, EC, SU 1,839 ST ST 2,081 MISC abbreviations are dened in Table 5.1. MIA stands for MI-Adherent codes. MIN stands for MI-NonAdherent codes. 5.5 Experiments We apply and compare the two systems introduced in Section 5.4.1 and we focus on the performance of the speaker diarization and role recognition modules with respect to the end task of automated behavioral coding. In both cases, all the other modules (VAD, ASR, utterance segmentation, MISC labeling) remain xed|details on those modules are provided in Appendix B. 5.5.1 System with clustering-based diarization Following a traditional speaker diarization approach, the speech signal is rst partitioned into segments where a single speaker is present and then, those speaker-homogeneous segments are clus- tered into same-speaker groups. In the baseline pipeline of Figure 5.1 we follow the x-vector/PLDA paradigm (Sell et al., 2018), the same baseline audio-only diarization approach we followed in Chap- ter 3. Each voiced segment, as predicted by VAD, is partitioned uniformly into subsegments of 78 length equal to 1:5 sec with a shift of 0:25 sec. For each subsegment an x-vector (Snyder et al., 2018) is extracted using the pre-trained CallHome x-vector extractor 2 provided by Kaldi (Povey et al., 2011). The subsegments are nally clustered according to hierarchical agglomerative clustering (HAC) with average linking, using probabilistic linear discriminant analysis (PLDA) as the simi- larity metric. Since each session is expected to have exactly two speakers, we continue the HAC procedure until two clusters are constructed. As a post-processing step, adjacent speech segments assigned to the same speaker are concatenated into a single speaker turn, allowing a maximum of 1 sec in-turn silence. After diarization, we have the entire set of utterances clustered into two groups; however, there is not a natural correspondence between the cluster labels and the actual speaker roles (i.e., therapist and client). For our purposes, speaker role recognition (SRR) is exactly the task of nding the mapping between the two. We employ the speaker-level SRR approach used as baseline in Chapter 2, with the text provided by the ASR subsystem (without lattice rescoring), since early experiments showed that this approach gives perfect recognition results for all the sessions where diarization accurately distinguishes the two interlocutors. Let's denote the two clusters identied by diarization asS 1 andS 2 , each one containing the utterances assigned to the two dierent speakers. We know a priori that one of those speakers is the therapist (T) and one is the client (C). In order to do the role matching, two trained LMs, one for the therapist (LM T ) and one for the client (LM C ), are used. We then estimate the perplexities of S 1 and S 2 with respect to the two LMs and assign to S i the role that yields the minimum perplexity. In case one role minimizes the perplexity for both speakers, we rst assign the speaker for whom we are most condent. The condence metric is based on the absolute distance between the two estimated perplexities 3 . The required LMs are 3-gram models trained with the SRILM toolkit (Stolcke, 2002), using the MI-train and CPTS 4 corpora with mixing parameters 0.8 and 0.2, respectively. 2 https://kaldi-asr.org/models/m6 3 For more details, please also refer to Algorithm 1 (Chapter 2). 4 Those datasets have been introduced in Chapter 1 and are also described in Appendix B. 79 5.5.2 System with classication-based diarization As an alternative, we explore the system of Figure 5.2, where the clustering-based diarization is replaced by a classication-based one. In order to do so, we follow the approach developed in Chapter 3 (Figure 3.2). The voiced segments derived from the VAD module are transcribed with a rst pass of ASR and are then sub-segmented based on the textual information. For text segmentation we use the DeepSegment tool 5 , which uses a BiLSTM-CRF architecture, similar to the one we built in Chapter 3. SRR is now performed at the turn level using the same LMs|LM T and LM C |as in the baseline system of Section 5.5.1. In order to estimate the acoustic proles (see Section 3.3) we use the 50% of the role-annotated segments per session about which we are most condent according to the perplexity-based criterion of equation (3.2). Those acoustic proles are then used during a PLDA-based classication: like in the baseline system, each voiced segment, as predicted by VAD, is partitioned uniformly into subsegments of length equal to 1:5 sec with a shift of 0:25 sec, and each subsegment is labeled as belonging to the interlocutor who maximized the PLDA similarity. We use the same speaker representation as in the baseline system (employing the CallHome x-vector extractor) for both the prole estimation and the sub-segmentation step. As shown in Figure 5.2, after the linguistically-aided diarization (for which ASR outputs are required), we have a second pass of ASR. The reason is that diarization denes dierent speech seg- ments than the VAD-based ones used during the rst pass and we wanted to have a fair comparison with the baseline system, keeping all the other modules (apart from diarization/role recognition) xed. However, we should note that, as explained in Appendix B, the second ASR pass does not yield improved recognition results (with respect to the estimated word error rates) compared to the rst pass. 5 https://github.com/notAI-tech/deepsegment This is the same tool used for the utterance segmentation module of both systems (Figures 5.1 and 5.2) before behavioral coding|see also Appendix B. 80 5.6 Analysis and Results 5.6.1 Speaker diarization We rst evaluate the two dierent diarization systems described in Section 5.5. The standard evaluation metric, that we have also used in Chapters 3 and 4, is the diarization error rate (DER; Anguera et al., 2012) and it incorporates three sources of error: false alarms (the percentage of speech in the output but not in the ground truth), missed speech (the percentage of speech in the ground truth but not in the output), and speaker error (the percentage of speech assigned to the wrong speaker cluster after an optimal mapping between speaker clusters and true speaker labels). However, false alarm here is not representative of the algorithms' performance because of the specic implementation followed. In particular, we chose to concatenate adjacent speech segments assigned to the same speaker, if there is not a silence gap between them greater that 1 sec. This step increases DER, since it labels short non-voiced segments as belonging to some speaker, thus introducing false alarms. However, it creates longer speaker-homogeneous segments, which is benecial to ASR, and, hence, to the overall system. What is important for the downstream task is to identify the therapist speech, and for that reason, we want to minimize missed speech and speaker error. Results with respect to those metrics are reported in Table 5.3. Those are estimated using the NIST md-eval.pl tool, with a forgiveness collar of 0:25 sec around each speaker boundary. Table 5.3: Diarization results (%) for the UCC data. diarization method missed speech speaker error clustering-based 0.5 7.6 classication-based 0.5 4.9 clustering-based refers to the system of Figure 5.1 and classication-based refers to the system of Figure 5.2. We can see that the overall diarization performance is improved by the classication-based system, thus validating the results of Chapter 3. However, a per-session analysis revealed that most of this performance gap is due to a handful of sessions for which the traditional, clustering- based diarization essentially failed, with a reported speaker error rate as high as 50%. At the same time, the clustering-based system occasionally performs even better than the classication-based one for sessions under clean acoustic conditions and featuring speakers with very dissimilar acoustic 81 characteristics (e.g., male vs. female). In order to get the best of both words, and to avoid the increased computational complexity of the classication-based system whenever this is not needed 6 , we propose to start with the clustering-based system for all the sessions and apply a simple proxy of diarization performance. If, according to this proxy, the clustering-based diarization fails, we halt processing and re-run diarization, using the classication-based, linguistically-aided system this time. According to our proxy, the percentage of speech assigned to each one of the two speakers should be at leastm% of the total speaking time, withm = 10 for the results reported here 7 . Since we deal with dyadic conversational scenarios, it is expected that each of the two speakers talks for a substantial amount of time. Even though therapy is not a normal dialogue and the provider often plays more the role of the listener (Hill, 2009), if one of the two interlocutors seems to not be participating in the conversation, then we are highly condent there is some problem. This may be an issue associated either with the audio quality, or with high speaker error introduced by the diarization module because the two speakers have similar acoustic characteristics. Per-session results in terms of speaker error rate (SER) when using either the clustering-based or the classication-based system for all the sessions, or a combination of those based on the described threshold, are given in Figure 5.3. As we can see (Figure 5.3a), our quality safeguard is a reasonable proxy of diarization performance: most of the sessions with high estimated SER (more than 15%) are sessions where the speaking time of one of the interlocutors is very low (less than 10%), suggesting that the two speaker clusters were collapsed into one. When we choose to continue processing those sessions using the classication-based system (Figure 5.3c), the problem is alleviated. 6 Note that the specic implementation of the classication-based system requires applying ASR and extracting x-vectors twice. 7 This was one of the quality safeguards incorporated within the original system, presented in (Flemotomos et al., 2021). Since this system was designed with real-world deployment in mind, it was important to incorporate specic quality safeguards that help us both identify potential computational errors, including ones due to diarization, and determine whether the input was an actual therapy session or not (e.g., whether the therapist pushed the recording button by mistake). Based on those safeguards, if certain quality thresholds were not met, then the nal report was not generated and feedback was not provided for the specic session. Instead, an error message was displayed to the counselor. 82 0.0 0.2 0.4 min speaker time / total speaking time 0 10 20 30 40 50 SER (%) (a) clustering-based 0.2 0.3 0.4 0.5 min speaker time / total speaking time 0 10 20 30 40 50 (b) classication-based 0.1 0.2 0.3 0.4 0.5 min speaker time / total speaking time 0 10 20 30 40 50 (c) combination Figure 5.3: Speaker error rate (SER) per UCC session for the dierent system designs illustrated in Figures 5.1 and 5.2. In (c), we use the classication-based system for the sessions where the speaking time of each speaker is not at least 10% of the overall speaking time according to the clustering-based diarization output. 5.6.2 Psychotherapy evaluation When diarization fails, the error is propagated throughout the entire pipeline and the system cannot accurately code the therapist utterances. In fact, for seven of the sessions where the clustering- based diarization algorithm failed to suciently distinguish between the two speakers, the subse- quent speaker-level SRR module (Figure 5.1) failed to nd the right mapping between roles and speakers. This is not the case when for the \problematic" sessions we use the linguistically-aided, classication-based diarization where role assignment is done at the turn level (Figure 5.2). When we compare the total number of utterances per session that have been assigned to the therapist by the human annotators and by the automated systems, the Spearman correlation is increased from 0.478 (p< 10 7 ) in the clustering-based system to 0.561 (p< 10 9 ) in the system that uses either the clustering or the classication-based diarization according to the minimum speaking time criterion 8 . This behavior is re ected in the nal evaluation of the overall system performance, as well. Eval- uation with respect to utterance-level behavioral coding is not straightforward, since the utterances 8 Those numbers correspond to the utterances predicted by the system after the utterance segmentation module (Figures 5.1 and 5.2). 83 given to the MISC predictor after automatic transcription are not the same as the ones dened by human transcribers. In that case, we use as a simple evaluation metric the correlation between the tallies, i.e., the counts of each MISC label in the manual coding trial and in the automatically generated report. The results are given in Table 5.4. Table 5.4: Spearman correlation coecients for the per-session counts of the utterance-level MISC labels between the manually-derived codes and the machine-generated ones for dierent diarization approaches. MISC clustering-based classication-based combination FA 0.194 0.309 y 0.305 GI 0.639 y 0.507 y 0.627 y RES 0.303 0.187 0.235 REC 0.388 y 0.502 y 0.447 y QUC 0.634 y 0.475 y 0.639 y QUO 0.524 y 0.741 y 0.753 y MIA 0.451 y 0.576 y 0.596 y MIA 0.455 y 0.390 y 0.474 y ST 0.428 y 0.549 y 0.581 y mean 0.446 0.471 0.517 clustering-based refers to the system of Figure 5.1 and classication-based refers to the system of Figure 5.2. combination: use the classication-based system for the sessions where the speaking time of each speaker is not at least 10% of the overall speaking time according to the clustering-based system. y p< 0:001, p< 0:01, p< 0:05 We additionally report results with respect to session-level functionals commonly used in MI research. Those are the the ratio of re ections to questions (Re2Qu), the percentage of open questions out of all the questions (QUO2Qu), the percentage of complex re ections out of all the re ections (REC2Re), the MI adherence (MI-Adh), dened as the percentage of utterances not assigned the MIN code, as well as the ratio of therapist-attributed over client-attributed speaking time (Ther2Cl). As shown in Table 5.5, Re2Qu and QUO2Qu are re ected more accurately|on average, taking all sessions into account|with the combination approach. With respect to MI- Adh and Ther2Cl, results are better when only the classication-based diarization is used for all the sessions. The only metric for which the clustering-based approach yields the best results is REC2Re; however, Spearman correlations are not statistically signicant for the particular metric. The reason behind the low overall performance with respect to the REC2Re metric is that our text- 84 based MISC prediction algorithm had a high confusion rate between complex and simple re ections 9 (see also Appendix B). Table 5.5: Spearman correlation coecients for the session-level MISC aggregate metrics between the manually-derived codes and the machine- generated ones for dierent diarization approaches. metric clustering-based classication-based combination Re2Qu 0.324 0.428 0.452 QUO2Qu 0.527 0.575 0.698 REC2Re 0.172 0.087 0.154 MI-Adh 0.354 0.509 0.418 Ther2Cl 0.720 0.823 0.815 clustering-based refers to the system of Figure 5.1 and classication-based refers to the system of Figure 5.2. combination: use the classication-based system for the sessions where the speak- ing time of each speaker is not at least 10% of the overall speaking time according to the clustering-based system. All correlations are signicant (p< 0:001), apart from REC2Re (p> 0:05). 5.7 Conclusion In this chapter we explored speaker role information and the impact it has on a real-world appli- cation. In particular, we presented a processing pipeline used to automatically evaluate recorded dyadic psychotherapy sessions, where accurate estimation of when a therapist talks is critical, since therapist-attributed speech needs to be coded at the utterance level. We applied the linguistically- aided, role-based diarization approach that we presented in Chapter 3 and we compared it with a traditional clustering-based diarization algorithm to study how diarization output can aect the downstream task of psychotherapy quality assessment. Experimental results showed that the linguistically-aided method can signicantly outperform the baseline, especially for sessions where the latter fails to identify the two dierent speakers within the session, and thus the subsequent behavioral coding algorithm is not provided with accurate input. Here we proposed rst trying to process all the sessions with the simpler, and computationally less expensive, clustering-based system and only employing the linguistically-aided diarization for 9 In the original system deployed in clinical settings, we have grouped complex and simple re ections into a single composite \re ections" label when generating the feedback report. 85 sessions where the rst system failed. Since we cannot directly evaluate the diarization performance on unseen sessions, we used a quality proxy based on the minimum expected speaking time of the two interlocutors. More sophisticated combination methods (e.g., Stolcke & Yoshioka, 2019) and/or condence metrics (e.g., Vaquero, Ortega, Miguel, & Lleida, 2013) for diarization systems can potentially further improve the nal results. The application of a competency rating tool, like the one we presented, in clinical settings could guarantee the provision of fast and low-cost feedback. Performance-based feedback is an essential aspect both for training new therapists and for maintaining acquired skills, and can eventually lead to improved quality of services and more positive clinical outcomes. Additionally, being able to accurately record, transcribe, and code interventions at large scale opens up ample opportunities for psychotherapy research studies with increased statistical power. 86 Conclusions and Future Directions Summary and Main Contributions In the previous chapters I proposed various methods to recognize speaker roles and use the inferred information to facilitate speech processing tasks. A main motivation behind the research conducted has been the reduction of error propagation in pipelined architectures for speech-based applications. In Chapter 1 I showed that combining audio-based speaker clustering with language-based role recognition at the turn level can lead to substantial performance gains for the task of speaker role recognition (SRR). The linguistic information used for this work, however, was extracted from manual transcriptions. In Chapter 2 I extended the ideas behind language-based SRR for a more realistic scenario where the textual information is drawn from automatically derived transcripts. I did so by producing role-specic ASR outputs, suitably rescoring the decoding lattices produced by a generic ASR system. The proposed approach also led to slight improvements in ASR performance. Moving to a dierent speech processing task, in Chapters 3 and 4 I utilized role-related informa- tion to improve the performance of speaker diarization, when applied in conversational interactions where speakers assume dissimilar roles. In particular, in Chapter 3 speaker roles were used to construct the acoustic proles of the interlocutors, thus enabling us to convert speaker diariza- tion from a clustering problem to a classication one. This method, however, assumed that every speaker in the conversation is mapped to a single role and vice-versa. In Chapter 4, I presented a more generic framework, where linguistic, role-based information is used to impose segment-wise constraints during the subsequent step of audio-based clustering. Finally, in Chapter 5 I presented an end-to-end speech and language processing pipeline de- veloped to transcribe and evaluate psychotherapy sessions to provide performance-based feedback 87 to therapists. Speaker diarization and role recognition are crucial components within this compe- tency rating tool, and I showed how employing a role-aided diarization approach can reduce error propagation and lead to improved overall results. Directions for Future Work This dissertation has focused on the computational analysis of formal speaker roles within conversa- tional interactions and on ways that role information can be used to facilitate core speech processing tasks, such as speaker diarization. With the evolution and success of end-to-end neural architec- tures, an exciting area of future research is towards unied frameworks where role recognition and other speech processing modules, such as speech recognition and speaker diarization, are combined together. Early works towards that direction have shown promising results, but leave ample room for improvements and further research (El Shafey, Soltau, & Shafran, 2019; Flemotomos, Chen, Atkins, & Narayanan, 2018). An assumption made throughout this work is that the role concepts we study remain static during a single interaction. Even though this is in general true for formal roles (e.g., patient vs. doctor), informal roles emerge as a result of interpersonal dynamics and can change over the course of a conversation (Dowell et al., 2019). An interesting direction for future work would be an extension of the tools presented here to the analysis of informal, emergent roles, incorporating this additional element of temporal variability. Both formal and informal roles can be manifest through specic behavioral patterns; this has been the main overarching idea behind the various models proposed in the previous chapters. However, how an individual behaves within a specic group and under specic circumstances is a function of various aspects, including personality traits and other dimensions of identity. A role that an individual assumes can be viewed as just one such dimension (Hare, 1994). Future research eorts could focus on the analysis and modeling of the relationship between speaker roles and identity characteristics, such as gender, age, and personality. Finally, an exciting prospect would be the incorporation of role-specic information in voice assistants. While smart conversational agents are becoming part of our everyday lives, generic re- sponses and lack of emotional intelligence remain a shortfall of dialogue generation models, posing 88 obstacles to carrying long and naturalistic conversations (Roller et al., 2021). Allowing intelli- gent agents to assume specic roles and adopt role-aware behaviors would potentially give them more human-like conversational characteristics and would make them more adaptive to dierent environments. 89 References Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., . . . Zheng, X. (2016). TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (pp. 265{283). Anguera, X., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., & Vinyals, O. (2012). Speaker diarization: A review of recent research. IEEE Transactions on Audio, Speech, and Language Processing, 20(2), 356-370. doi: 10.1109/TASL.2011.2125954 Anguera, X., Wooters, C., & Hernando, J. (2007). Acoustic beamforming for speaker diarization of meetings. IEEE Transactions on Audio, Speech, and Language Processing, 15(7), 2011{2022. doi: 10.1109/TASL.2007.902460 Atkins, D. C., Steyvers, M., Imel, Z. E., & Smyth, P. (2014). Scaling up the evaluation of psychotherapy: evaluating motivational interviewing delity via statistical text classication. Implementation Science, 9(1), 49. doi: 10.1186/1748-5908-9-49 Baer, J. S., Wells, E. A., Rosengren, D. B., Hartzler, B., Beadnell, B., & Dunn, C. (2009). Agency context and tailored training in technology transfer: A pilot evaluation of motivational inter- viewing training for community counselors. Journal of Substance Abuse Treatment, 37(2), 191{202. doi: 10.1016/j.jsat.2009.01.003 Bakeman, R., & Quera, V. (2012). Behavioral observation. In H. Cooper, P. M. Camic, D. L. Long, A. T. Panter, D. Rindskopf, & K. J. Sher (Eds.), APA handbook of research methods in psy- chology, Vol. 1. Foundations, planning, measures, and psychometrics (pp. 207{225). American Psychological Association. doi: 10.1037/13619-013 Bales, R. F. (1950). A set of categories for the analysis of small group interaction. American Sociological Review, 15(2), 257{263. doi: 10.2307/2086790 90 Barahona, L. M. R., Tseng, B.-H., Dai, Y., Manseld, C., Ramadan, O., Ultes, S., . . . Gasic, M. (2018). Deep learning for language understanding of mental health concepts derived from cognitive behavioural therapy. In Proceedings of the 9th International Workshop on Health Text Mining and Information Analysis (pp. 44{54). doi: 10.18653/v1/W18-5606 Barzilay, R., Collins, M., Hirschberg, J., & Whittaker, S. (2000). The rules behind roles: Identifying speaker role in radio broadcasts. In Proceedings of the 7th National Conference on Articial Intelligence and 12th Conference on Innovative Applications of Articial Intelligence (pp. 679{684). doi: 10.7916/D8PC39Q9 Bazillon, T., Maza, B., Rouvier, M., Bechet, F., & Nasr, A. (2011). Speaker role recognition using question detection and characterization. In Proceedings of Interspeech 2011 (pp. 917{920). doi: 10.21437/Interspeech.2011-442 Be nu s, S. (2014). Social aspects of entrainment in spoken interaction. Cognitive Computation, 6(4), 802{813. doi: 10.1007/s12559-014-9261-4 Be nu s, S., Gravano, A., Levitan, R., Levitan, S. I., Willson, L., & Hirschberg, J. (2014). Entrain- ment, dominance and alliance in supreme court hearings. Knowledge-Based Systems, 71(1), 3{14. Biddle, B. J. (1986). Recent developments in role theory. Annual Review of Sociology, 12(1), 67{92. doi: 10.1146/annurev.so.12.080186.000435 Bigot, B., Ferran e, I., Pinquier, J., & Andr e-Obrecht, R. (2010). Speaker role recognition to help spontaneous conversational speech detection. In Proceedings of the 2010 International Workshop on Searching Spontaneous Conversational Speech (pp. 5{10). doi: 10.1145/1878101 .1878104 Bigot, B., Fredouille, C., & Charlet, D. (2013). Speaker role recognition on TV broadcast doc- uments. In Proceedings of the 1st Workshop on Speech, Language and Audio in Multimedia (pp. 66{71). Bigot, B., Pinquier, J., Ferran e, I., & Andr e-Obrecht, R. (2010). Looking for relevant features for speaker role recognition. In Proceedings of Interspeech 2010 (pp. 1057{1060). doi: 10.21437/ Interspeech.2010-137 Black, M. P., Katsamanis, A., Baucom, B. R., Lee, C.-C., Lammert, A. C., Christensen, A., . . . Narayanan, S. S. (2013). Toward automating a human behavioral coding system for married 91 couples' interactions using speech acoustic features. Speech Communication, 55(1), 1{21. doi: 10.1016/j.specom.2011.12.003 Bost, Xavier and Linares, Georges. (2014). Constrained speaker diarization of TV series based on visual patterns. In Proceedings of the 2014 IEEE Spoken Language Technology Workshop (pp. 390{395). doi: 10.1109/SLT.2014.7078606 Bozonnet, S., Evans, N., Anguera, X., Vinyals, O., Friedland, G., & Fredouille, C. (2010). System output combination for improved speaker diarization. In Proceedings of Interspeech 2010 (pp. 2642{2645). doi: 10.21437/Interspeech.2010-701 Bredin, H. (2017a). pyannote.metrics: a toolkit for reproducible evaluation, diagnostic, and error analysis of speaker diarization systems. In Proceedings of Interspeech 2017 (pp. 3587{3591). doi: 10.21437/Interspeech.2017-411 Bredin, H. (2017b). Tristounet: Triplet loss for speaker turn embedding. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 5430{5434). doi: 10.1109/ICASSP.2017.7953194 Can, D., Atkins, D. C., & Narayanan, S. S. (2015). A dialog act tagging approach to behavioral coding: A case study of addiction counseling conversations. In Proceedings of Interspeech 2015 (pp. 339{343). doi: 10.21437/Interspeech.2015-151 Carletta, J., Ashby, S., Bourban, S., Flynn, M., Guillemot, M., Hain, T., . . . Wellner, P. (2005). The AMI meeting corpus: A pre-announcement. In Proceedings of the 2nd International Conference on Machine Learning for Multimodal Interaction (pp. 28{39). doi: 10.1007/ 11677482 3 Chen, S., & Gopalakrishnan, P. (1998). Speaker, environment and channel change detection and clustering via the bayesian information criterion. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop (pp. 127{132). Chen, Z., Flemotomos, N., Ardulov, V., Creed, T. A., Imel, Z. E., Atkins, D. C., & Narayanan, S. (2021). Feature fusion strategies for end-to-end evaluation of cognitive behavior therapy ses- sions. In Proceedings of the 2021 43rd Annual International Conference of the IEEE Engineer- ing in Medicine and Biology Society (p. 1836-1839). doi: 10.1109/EMBC46164.2021.9629694 Cheng, S.-S., & Wang, H.-M. (2003). A sequential metric-based audio segmentation method via the bayesian information criterion. In Proceedings of the 8th European Conference on Speech 92 Communication and Technology (pp. 945{948). Cieri, C., Miller, D., & Walker, K. (2004). The Fisher corpus: a resource for the next generations of speech-to-text. In Proceedings of the 4th International Conference on Language Resources and Evaluation (pp. 69{71). Curran, J., Parry, G. D., Hardy, G. E., Darling, J., Mason, A.-M., & Chambers, E. (2019). How does therapy harm? A model of adverse process using task analysis in the meta-synthesis of service users' experience. Frontiers in Psychology, 10, 347. doi: 10.3389/fpsyg.2019.00347 Damnati, G., & Charlet, D. (2011a). Multi-view approach for speaker turn role labeling in TV broadcast news shows. In Proceedings of Interspeech 2011 (pp. 1285{1288). doi: 10.21437/ Interspeech.2011-430 Damnati, G., & Charlet, D. (2011b). Robust speaker turn role labeling of tv broadcast news shows. In Proceedings of the 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 5684{5687). doi: 10.1109/ICASSP.2011.5947650 Danescu-Niculescu-Mizil, C., Lee, L., Pang, B., & Kleinberg, J. (2012). Echoes of power: Language eects and power dierences in social interaction. In Proceedings of the 21st International Conference on World Wide Web (pp. 699{708). doi: 10.1145/2187836.2187931 Dawalatabad, N., Ravanelli, M., Grondin, F., Thienpondt, J., Desplanques, B., & Na, H. (2021). ECAPA-TDNN embeddings for speaker diarization. In Proceedings of interspeech 2021 (pp. 3560{3564). doi: 10.21437/Interspeech.2021-941 Demasi, O., Li, Y., & Yu, Z. (2020). A multi-persona chatbot for hotline counselor training. In Findings of the Association for Computational Linguistics: EMNLP 2020 (pp. 3623{3636). doi: 10.18653/v1/2020.ndings-emnlp.324 Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Con- ference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 4171{4186). doi: 10.18653/v1/N19-1423 Dimitriadis, D., & Fousek, P. (2017). Developing on-line speaker diarization system. In Proceedings of Interspeech 2017 (pp. 2739{2743). doi: 10.21437/Interspeech.2017-166 Dowell, N. M., Nixon, T. M., & Graesser, A. C. (2019). Group communication analysis: A com- 93 putational linguistics approach for detecting sociocognitive roles in multiparty interactions. Behavior Research Methods, 51(3), 1007{1041. doi: 10.3758/s13428-018-1102-z Dufour, R., Esteve, Y., & Del eglise, P. (2011). Investigation of spontaneous speech characterization applied to speaker role recognition. In Proceedings of Interspeech 2011 (pp. 917{920). doi: 10.21437/Interspeech.2011-370 El Shafey, L., Soltau, H., & Shafran, I. (2019). Joint speech recognition and speaker diarization via sequence transduction. Proceedings of Interspeech 2019 , 396{400. doi: 10.21437/Interspeech .2019-1943 Favre, S., Dielmann, A., & Vinciarelli, A. (2009). Automatic role recognition in multiparty record- ings using social networks and probabilistic sequential models. In Proceedings of the 17th ACM International Conference on Multimedia (pp. 585{588). doi: 10.1145/1631272.1631362 Flemotomos, N., Chen, Z., Atkins, D. C., & Narayanan, S. (2018). Role annotated speech recog- nition for conversational interactions. In Proceedings of the 2018 IEEE Spoken Language Technology Workshop (pp. 1036{1043). doi: 10.1109/SLT.2018.8639611 Flemotomos, N., Georgiou, P., & Narayanan, S. (2019). Role specic lattice rescoring for speaker role recognition from speech recognition outputs. In Proceedings of the 2019 IEEE In- ternational Conference on Acoustics, Speech and Signal Processing (pp. 7330{7334). doi: 10.1109/ICASSP.2019.8683900 Flemotomos, N., Georgiou, P., & Narayanan, S. (2020). Linguistically aided speaker diarization using speaker role information. In Proceedings of The Speaker and Language Recognition Workshop (Odyssey 2020) (pp. 117{124). doi: 10.21437/Odyssey.2020-17 Flemotomos, N., Martinez, V. R., Chen, Z., Singla, K., Ardulov, V., Peri, R., . . . Narayanan, S. (2021). Automated evaluation of psychotherapy skills using speech and language technologies. Behavior Research Methods. doi: 10.3758/s13428-021-01623-4 Flemotomos, N., Martinez, V. R., Gibson, J., Atkins, D., Creed, T., & Narayanan, S. (2018). Language features for automated evaluation of cognitive behavior psychotherapy sessions. Proceedings of Interspeech 2018 , 1908{1912. doi: 10.21437/Interspeech.2018-1518 Flemotomos, N., & Narayanan, S. (2022). Multimodal clustering with role induced constraints for speaker diarization. arXiv preprint arXiv:2204.00657 . Flemotomos, N., Papadopoulos, P., Gibson, J., & Narayanan, S. (2018). Combined speaker clus- 94 tering and role recognition in conversational speech. In Proceedings of Interspeech 2018 (pp. 1378{1382). doi: 10.21437/Interspeech.2018-1654 Fujita, Y., Kanda, N., Horiguchi, S., Xue, Y., Nagamatsu, K., & Watanabe, S. (2019). End-to-end neural speaker diarization with self-attention. In Proceedings of the 2020 IEEE Automatic Speech Recognition and Understanding Workshop (pp. 296{303). doi: 10.1109/ASRU46091 .2019.9003959 Gales, M. J. (1998). Maximum likelihood linear transformations for HMM-based speech recognition. Computer Speech & Language, 12(2), 75{98. doi: 10.1006/csla.1998.0043 Gan carski, P., Dao, T.-B.-H., Cr emilleux, B., Forestier, G., & Lampert, T. (2020). Constrained clustering: Current and new trends. In P. Marquis, O. Papini, & H. Prade (Eds.), A Guided Tour of Articial Intelligence Research: Volume II: AI Algorithms (pp. 447{484). Springer. doi: 10.1007/978-3-030-06167-8 14 Garcia-Romero, D., Snyder, D., Sell, G., Povey, D., & McCree, A. (2017). Speaker diarization using deep neural network embeddings. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 4930{4934). doi: 10.1109/ICASSP.2017 .7953094 Garg, N. P., Favre, S., Salamin, H., Hakkani T ur, D., & Vinciarelli, A. (2008). Role recognition for meeting participants: An approach based on lexical information and social network analysis. In Proceedings of the 16th ACM International Conference on Multimedia (pp. 693{696). doi: 10.1145/1459359.1459462 Garnier-Rizet, M., Adda, G., Cailliau, F., Guillemin-Lanne, S., Waast-Richard, C., Lamel, L., . . . Waast-Richard, C. (2008). CallSurf: Automatic transcription, indexing and structura- tion of call center conversational speech for knowledge extraction and query by content. In Proceedings of the 6th International Conference on Language Resources and Evaluation (pp. 2623{2628). Gaume, J., Gmel, G., Faouzi, M., & Daeppen, J.-B. (2009). Counselor skill in uences outcomes of brief motivational interventions. Journal of Substance Abuse Treatment, 37(2), 151{159. doi: 10.1016/j.jsat.2008.12.001 Georgiou, P. G., Black, M. P., Lammert, A., Baucom, B., & Narayanan, S. S. (2011a). \That's aggravating, very aggravating": Is it possible to classify behaviors in couple interactions 95 using automatically derived lexical features? In Proceedings of Aective Computing and Intelligent Interaction (ACII 2011), Lecture Notes in Computer Science (Vol. 6974). doi: 10.1007/978-3-642-24600-5 12 Georgiou, P. G., Black, M. P., Lammert, A. C., Baucom, B. R., & Narayanan, S. S. (2011b). \That's aggravating, very aggravating": Is it possible to classify behaviors in couple interactions using automatically derived lexical features? In Proceedings of the 2011 International Conference on Aective Computing and Intelligent Interaction (pp. 87{96). doi: 10.1007/978-3-642-24600 -5 12 Gibson, J., Atkins, D., Creed, T., Imel, Z., Georgiou, P., & Narayanan, S. (2022). Multi-label multi-task deep learning for behavioral coding. IEEE Transactions on Aective Computing, 13(1), 508{518. doi: 10.1109/TAFFC.2019.2952113 Gleave, E., Welser, H. T., Lento, T. M., & Smith, M. A. (2009). A conceptual and operational denition of social role in online community. In Proceedings of the 42nd Hawaii International Conference on System Sciences (pp. 1{11). Gra, D., Wu, Z., MacIntyre, R., & Liberman, M. (1997). The 1996 broadcast news speech and language-model corpus. In Proceedings of the 1997 DARPA Speech Recognition Workshop (pp. 11{14). Hallgren, K. A. (2012). Computing inter-rater reliability for observational data: an overview and tutorial. Tutorials in Quantitative Methods for Psychology, 8(1), 23{34. doi: 10.20982/ tqmp.08.1.p023 Hare, A. P. (1994). Types of roles in small groups: A bit of history and a current perspective. Small Group Research, 25(3), 433{448. doi: 10.1177/1046496494253005 Hattie, J., & Timperley, H. (2007). The power of feedback. Review of Educational Research, 77(1), 81{112. doi: 10.3102/003465430298487 Hill, C. E. (2009). Helping skills: Facilitating, exploration, insight, and action. American Psycho- logical Association. Hori, T., & Nakamura, A. (2013). Speech recognition algorithms using weighted nite-state trans- ducers. Morgan & Claypool. Horiguchi, S., Fujita, Y., Watanabe, S., Xue, Y., & Nagamatsu, K. (2020). End-to-end speaker diarization for an unknown number of speakers with encoder-decoder based attractors. In 96 Proceedings of Interspeech 2020 (pp. 269{273). doi: 10.21437/Interspeech.2020-1022 Houck, J. M., Moyers, T. B., Miller, W. R., Glynn, L. H., & Hallgren, K. A. (2010). Motiva- tional interviewing skill code (MISC) version 2.5. (Available from http://casaa.unm.edu/ download/misc25.pdf) Hr uz, M., & Zaj c, Z. (2017). Convolutional neural network for speaker change detection in telephone speaker diarization system. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 4945{4949). doi: 10.1109/ ICASSP.2017.7953097 Huang, J., Marcheret, E., Visweswariah, K., Libal, V., & Potamianos, G. (2007). The IBM Rich Transcription 2007 speech-to-text systems for lecture meetings. In R. Stiefelhagen, R. Bowers, & J. Fiscus (Eds.), Multimodal Technologies for Perception of Humans: Inter- national Evaluation Workshops CLEAR 2007 and RT 2007 (pp. 429{441). Springer. doi: 10.1007/978-3-540-68585-2 40 Hutchinson, B., Zhang, B., & Ostendorf, M. (2010). Unsupervised broadcast conversation speaker role labeling. In Proceedings of the 2010 IEEE International Conference on Acoustics Speech and Signal Processing (pp. 5322{5325). doi: 10.1109/ICASSP.2010.5494958 Imel, Z. E., Steyvers, M., & Atkins, D. C. (2015). Computational psychotherapy research: Scaling up the evaluation of patient{provider interactions. Psychotherapy, 52(1), 19{30. doi: 10.1037/ a0036841 India Massana, M. A., Rodr guez Fonollosa, J. A., & Hernando Peric as, F. J. (2017). LSTM neural network-based speaker segmentation using acoustic and language modelling. In Proceedings of Interspeech 2017 (pp. 2834{2838). doi: 10.21437/Interspeech.2017-407 Inkster, B., Sarda, S., & Subramanian, V. (2018). An empathy-driven, conversational articial intelligence agent (Wysa) for digital mental well-being: real-world data evaluation mixed- methods study. JMIR mHealth and uHealth, 6(11), e12106. doi: 10.2196/12106 Ioe, S. (2006). Probabilistic linear discriminant analysis. In Proceedings of the 9th European Conference in Computer Vision, Part IV (pp. 531{542). doi: 10.1007/11744085 41 Janin, A., Baron, D., Edwards, J., Ellis, D., Gelbart, D., Morgan, N., . . . Wooters, C. (2003). The ICSI meeting corpus. In Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing (pp. I/364{I/367). doi: 10.1109/ICASSP.2003.1198793 97 Jati, A., & Georgiou, P. G. (2017). Speaker2Vec: Unsupervised learning and adaptation of a speaker manifold using deep neural networks with an evaluation on speaker segmentation. In Proceedings of Interspeech 2017 (pp. 3567{3571). doi: 10.21437/Interspeech.2017-1650 Johnstone, B. (1996). The linguistic individual: Self-expression in language and linguistics. Oxford University Press. Jung, C. G. (2014). Two essays on analytical psychology. Routledge. Kessler, R. C., Berglund, P., Demler, O., Jin, R., Merikangas, K. R., & Walters, E. E. (2005). Lifetime prevalence and age-of-onset distributions of DSM-IV disorders in the national comorbidity survey replication. Archives of General Psychiatry, 62(6), 593{602. doi: 10.1001/archpsyc.62.6.593 Kinoshita, K., Delcroix, M., & Tawara, N. (2021a). Advances in integration of end-to-end neural and clustering-based diarization for real conversational speech. In Proceedings of Interspeech 2021 (pp. 3565{3569). doi: 10.21437/Interspeech.2021-1004 Kinoshita, K., Delcroix, M., & Tawara, N. (2021b). Integrating end-to-end neural and clustering- based diarization: Getting the best of both worlds. In Proceedings of the 2021 IEEE In- ternational Conference on Acoustics, Speech and Signal Processing (pp. 7198{7202). doi: 10.1109/ICASSP39728.2021.9414333 Kipper, D. A. (1992). Psychodrama: Group psychotherapy through role playing. International Journal of Group Psychotherapy, 42(4), 495{521. doi: 10.1080/00207284.1992.11490720 Klatte, R., Strauss, B., Fl uckiger, C., & Rosendahl, J. (2018). Adverse eects of psychotherapy: protocol for a systematic review and meta-analysis. Systematic Reviews, 7, 135. doi: 10.1186/ s13643-018-0802-x Knapp, M. L., Hall, J. A., & Horgan, T. G. (2013). Nonverbal communication in human interaction. Cengage Learning. Ko, T., Peddinti, V., Povey, D., & Khudanpur, S. (2015). Audio augmentation for speech recog- nition. In Proceedings of Interspeech 2015 (pp. 3586{3589). doi: 10.21437/Interspeech.2015 -711 Kodish-Wachs, J., Agassi, E., Kenny III, P., & Overhage, J. M. (2018). A systematic comparison of contemporary automatic speech recognition engines for conversational clinical speech. In AMIA Annual Symposium Proceedings (pp. 683{689). 98 Koluguri, N. R., Park, T., & Ginsburg, B. (2022). TitaNet: Neural model for speaker representation with 1D depth-wise separable convolutions and global context. In Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Komninos, A., & Manandhar, S. (2016). Dependency based embeddings for sentence classication tasks. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 1490{1500). doi: 10 .18653/v1/N16-1175 Krippendor, K. (2018). Content analysis: An introduction to its methodology. Sage publications. Kuich, W., & Salomaa, A. (1986). Semirings, automata, languages. Springer Verlag. Kulik, J. A., & Kulik, C.-L. C. (1988). Timing of feedback and verbal learning. Review of educational research, 58(1), 79{97. doi: 10.3102/00346543058001079 Lambert, M. J., & Bergin, A. E. (2002). The eectiveness of psychotherapy. In M. Hersen & W. Sledge (Eds.), Encyclopedia of Psychotherapy (Vol. 1, pp. 709{714). USA: Elsevier Science. doi: 10.1016/B0-12-343010-0/00084-2 Lambert, M. J., & Ogles, B. M. (1997). The eectiveness of psychotherapy supervision. In C. E. Watkins (Ed.), Handbook of Psychotherapy Supervision (pp. 421{446). John Wiley & Sons, Inc. Lambert, M. J., Whipple, J. L., & Kleinst auber, M. (2018). Collecting and delivering progress feedback: A meta-analysis of routine outcome monitoring. Psychotherapy, 55(4), 520{537. doi: 10.1037/pst0000167 Laurent, A., Camelin, N., & Raymond, C. (2014). Boosting bonsai trees for ecient features combination: application to speaker role identication. In Proceedings of Interspeech 2014 (pp. 76{80). Lee, F.-T., Hull, D., Levine, J., Ray, B., & McKeown, K. (2019). Identifying therapist conversational actions across diverse psychotherapeutic approaches. In Proceedings of the 6th Workshop on Computational Linguistics and Clinical Psychology (pp. 12{23). doi: 10.18653/v1/W19-3002 Li, Y., Wang, Q., Zhang, X., Li, W., Li, X., Yang, J., . . . He, Q. (2017). Unsupervised classication of speaker roles in multi-participant conversational speech. Computer Speech & Language, 42, 81{99. doi: 10.1016/j.csl.2016.09.002 Liu, D., & Kubala, F. (2004). Online speaker clustering. In Proceedings of the 2004 IEEE In- 99 ternational Conference on Acoustics, Speech, and Signal Processing (Vol. 1, pp. I{333). doi: 10.1109/ICASSP.2004.1325990 Liu, Y. (2006). Initial study on automatic identication of speaker role in broadcast news speech. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers (pp. 81{84). Ljolje, A., Pereira, F., & Riley, M. (1999). Ecient general lattice generation and rescoring. In Proceedings of the 6th European Conference on Speech Communication and Technology (Eurospeech 1999) (pp. 1251{1254). Lord, C., Risi, S., Lambrecht, L., Cook, E. H., Leventhal, B. L., DiLavore, P. C., . . . Rutter, M. (2000). The autism diagnostic observation schedule-generic: A standard measure of social and communication decits associated with the spectrum of autism. Journal of Autism and Developmental Disorders, 30(3), 205{223. doi: 10.1023/A:1005592401947 Loshchilov, I., & Hutter, F. (2019). Decoupled weight decay regularization. In Proceedings of the 7th international conference on learning representations. Lu, Z., & Peng, Y. (2013). Exhaustive and ecient constraint propagation: A graph-based learning approach and its applications. International Journal of Computer Vision, 103(3), 306{325. doi: 10.1007/s11263-012-0602-z Luz, S. (2009). Locating case discussion segments in recorded medical team meetings. In Proceedings of the 3rd Workshop on Searching Spontaneous Conversational Speech (pp. 21{30). doi: 10.1145/1631127.1631131 Ma, X., & Hovy, E. (2016). End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1064{1074). doi: 10.18653/v1/P16-1101 Madson, M. B., & Campbell, T. C. (2006). Measures of delity in motivational enhancement: a systematic review. Journal of Substance Abuse Treatment, 31(1), 67{73. doi: 10.1016/ j.jsat.2006.03.010 Magill, M., Gaume, J., Apodaca, T. R., Walthers, J., Mastroleo, N. R., Borsari, B., & Longabaugh, R. (2014). The technical hypothesis of motivational interviewing: A meta-analysis of MI's key causal model. Journal of Consulting and Clinical Psychology, 82(6), 973{983. doi: 10.1037/a0036833 100 Mao, H. H., Li, S., McAuley, J., & Cottrell, G. W. (2020). Speech recognition and multi-speaker diarization of long conversations. In Proceedings of Interspeech 2020 (pp. 691{695). doi: 10.21437/Interspeech.2020-3039 Marcos-Garc a, J.-A., Mart nez-Mon es, A., & Dimitriadis, Y. (2015). DESPRO: A method based on roles to provide collaboration analysis support adapted to the participants in CSCL situations. Computers & Education, 82, 335{353. doi: 10.1016/j.compedu.2014.10.027 Medennikov, I., Korenevsky, M., Prisyach, T., Khokhlov, Y., Korenevskaya, M., Sorokin, I., . . . others (2020). The STC system for the CHiME-6 challenge. In Proceedings of the 6th International Workshop on Speech Processing in Everyday Environments (CHiME 2020) (pp. 36{41). doi: 10.21437/CHiME.2020-9 Meng, Z., Mou, L., & Jin, Z. (2017). Hierarchical RNN with static sentence-level attention for text-based speaker change detection. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (pp. 2203{2206). doi: 10.1145/3132847.3133110 Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2 (pp. 3111{3119). Miller, W. R., & Rollnick, S. (2012). Motivational interviewing: Helping people change. Guilford press. Miller, W. R., Sorensen, J. L., Selzer, J. A., & Brigham, G. S. (2006). Disseminating evidence-based practices in substance abuse treatment: A review with suggestions. Journal of Substance Abuse Treatment, 31(1), 25{39. doi: 10.1016/j.jsat.2006.03.005 Mohri, M., Pereira, F., & Riley, M. (2002). Weighted nite-state transducers in speech recognition. Computer Speech & Language, 16(1), 69{88. doi: 10.1006/csla.2001.0184 Moyers, T. B., Martin, T., Manuel, J. K., Hendrickson, S. M., & Miller, W. R. (2005). Assessing competence in the use of motivational interviewing. Journal of Substance Abuse Treatment, 28(1), 19{26. doi: 10.1016/j.jsat.2004.11.001 Mudrack, P. E., & Farrell, G. M. (1995). An examination of functional role behavior and its consequences for individuals in group settings. Small Group Research, 26(4), 542{571. doi: 10.1177/1046496495264005 Ng, A. Y., Jordan, M. I., & Weiss, Y. (2001). On spectral clustering: Analysis and an algorithm. In 101 Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic (pp. 849{856). Ordelman, R., De Jong, F., & Larson, M. (2009). Enhanced multimedia content access and exploitation using semantic speech retrieval. In Proceedings of the 2009 IEEE International Conference on Semantic Computing (pp. 521{528). doi: 10.1109/ICSC.2009.80 Pal, M., Kumar, M., Peri, R., Park, T. J., Kim, S. H., Lord, C., . . . Narayanan, S. (2020). Speaker diarization using latent space clustering in generative adversarial network. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 6504{6508). doi: 10.1109/ICASSP40776.2020.9053952 Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015). Librispeech: an ASR corpus based on public domain audio books. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 5206{5210). doi: 10.1109/ICASSP.2015 .7178964 Park, T. J., & Georgiou, P. (2018). Multimodal speaker segmentation and diarization using lexical and acoustic cues via sequence to sequence neural networks. In (pp. 1373{1377). doi: 10.21437/Interspeech.2018-1364 Park, T. J., Han, K. J., Huang, J., He, X., Zhou, B., Georgiou, P., & Narayanan, S. (2019). Speaker diarization with lexical information. In Proceedings of Interspeech 2019 (pp. 391{395). doi: 10.21437/Interspeech.2019-1947 Park, T. J., Han, K. J., Kumar, M., & Narayanan, S. (2019). Auto-tuning spectral clustering for speaker diarization using normalized maximum eigengap. IEEE Signal Processing Letters, 27, 381{385. doi: 10.1109/LSP.2019.2961071 Park, T. J., Kanda, N., Dimitriadis, D., Han, K. J., Watanabe, S., & Narayanan, S. (2022). A review of speaker diarization: Recent advances with deep learning. Computer Speech & Language, 72, 101317. doi: 10.1016/j.csl.2021.101317 Park, T. J., Kumar, M., Flemotomos, N., Pal, M., Peri, R., Lahiri, R., . . . Narayanan, S. (2019). The second DIHARD challenge: System description for USC-SAIL team. In Proceedings of Interspeech 2019 (pp. 998{1002). doi: 10.21437/Interspeech.2019-1903 Paul, D. B., & Baker, J. M. (1992). The design for the Wall Street Journal-based CSR corpus. In Proceedings of the Workshop on Speech and Natural Language (pp. 357{362). doi: 10.3115/ 102 1075527.1075614 Peddinti, V., Povey, D., & Khudanpur, S. (2015). A time delay neural network architecture for ecient modeling of long temporal contexts. In Proceedings of Interspeech 2015 (pp. 3214{ 3218). doi: 10.21437/Interspeech.2015-647 Perry, J. C., Banon, E., & Ianni, F. (1999). Eectiveness of psychotherapy for personality disorders. American Journal of Psychiatry, 156(9), 1312{1321. doi: 10.1176/ajp.156.9.1312 Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., . . . Vesely, K. (2011). The Kaldi speech recognition toolkit. In Proceedings of the 2011 IEEE Workshop on Automatic Speech Recognition and Understanding. (IEEE Catalog No.: CFP11SRW-USB) Povey, D., Hannemann, M., Boulianne, G., Burget, L., Ghoshal, A., Janda, M., . . . Vu, N. T. (2012). Generating exact lattices in the WFST framework. In Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (p. 4213-4216). doi: 10.1109/ICASSP.2012.6288848 Prince, S. J., & Elder, J. H. (2007). Probabilistic linear discriminant analysis for inferences about identity. In Proceedings of the 2007 IEEE 11th International Conference on Computer Vision (pp. 1{8). doi: 10.1109/ICCV.2007.4409052 Proctor, E., Silmere, H., Raghavan, R., Hovmand, P., Aarons, G., Bunger, A., . . . Hensley, M. (2011). Outcomes for implementation research: conceptual distinctions, measurement chal- lenges, and research agenda. Administration and Policy in Mental Health and Mental Health Services Research, 38(2), 65{76. doi: 10.1007/s10488-010-0319-7 Prokopalo, Y., Shamsi, M., Barrault, L., Meignier, S., & Larcher, A. (2021). Active correction for speaker diarization with human in the loop. In Proceedings of IberSPEECH (pp. 260{264). doi: 10.21437/IberSPEECH.2021-55 Quiroz, J. C., Laranjo, L., Kocaballi, A. B., Berkovsky, S., Rezazadegan, D., & Coiera, E. (2019). Challenges of developing a digital scribe to reduce clinical documentation burden. npj Digital Medicine, 2, 114. doi: 10.1038/s41746-019-0190-1 Rasipuram, S., & Jayagopi, D. B. (2018). Automatic assessment of communication skill in interview- based interactions. Multimedia Tools and Applications, 77, 18709{18739. doi: 10.1007/ s11042-018-5654-9 Reimers, N., & Gurevych, I. (2017). Reporting score distributions makes a dierence: Performance 103 study of LSTM-networks for sequence tagging. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (pp. 338{348). doi: 10.18653/v1/D17 -1035 Roller, S., Dinan, E., Goyal, N., Ju, D., Williamson, M., Liu, Y., . . . Weston, J. (2021). Recipes for building an open-domain chatbot. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume (pp. 300{325). doi: 10.18653/v1/2021.eacl-main.24 Rousseau, A., Del eglise, P., & Esteve, Y. (2014). Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks. In Proceedings of the 8th International Conference on Language Resources and Evaluation (pp. 3935{3939). Rouvier, M., Delecraz, S., Favre, B., Bendris, M., & Bechet, F. (2015). Multimodal embedding fusion for robust speaker role recognition in video broadcast. In Proceedings of the 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (pp. 383{389). doi: 10.1109/ASRU.2015.7404820 Ryant, N., Singh, P., Krishnamohan, V., Varma, R., Church, K., Cieri, C., . . . Liberman, M. (2021). The third DIHARD diarization challenge. In Proceedings of Interspeech 2021 (pp. 3570{3574). doi: 10.21437/Interspeech.2021-1208 Sacks, H., Scheglo, E. A., & Jeerson, G. (1978). A simplest systematics for the organization of turn taking for conversation. In J. Schenkein (Ed.), Studies in the Organization of Conver- sational Interaction (pp. 7{55). Elsevier. doi: 10.1016/B978-0-12-623550-0.50008-2 Sak, H., Sara clar, M., & G ung or, T. (2010). On-the- y lattice rescoring for real-time automatic speech recognition. In Proceedings of Interspeech 2010 (pp. 2450{2453). doi: 10.21437/ Interspeech.2010-532 Salamin, H., & Vinciarelli, A. (2012). Automatic role recognition in multiparty conversations: An approach based on turn organization, prosody, and conditional random elds. IEEE Transactions on Multimedia, 14(2), 338{345. doi: 10.1109/TMM.2011.2173927 Saon, G., Soltau, H., Nahamoo, D., & Picheny, M. (2013). Speaker adaptation of neural network acoustic models using i-vectors. In Proceedings of the 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (pp. 55{59). doi: 10.1109/ASRU.2013.6707705 Sapru, A., & Bourlard, H. (2015). Automatic recognition of emergent social roles in small group 104 interactions. IEEE Transactions on Multimedia, 17(5), 746{760. doi: 10.1109/TMM.2015 .2408437 Sapru, A., & Valente, F. (2012). Automatic speaker role labeling in AMI meetings: recognition of formal and social roles. In Proceedings of the 2012 IEEE International Conference on Acous- tics, Speech and Signal Processing (pp. 5057{5060). doi: 10.1109/ICASSP.2012.6289057 Sapru, A., Yella, S. H., & Bourlard, H. (2014). Improving speaker diarization using social role information. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 101{105). doi: 10.1109/ICASSP.2014.6853566 Saxon, D., Barkham, M., Foster, A., & Parry, G. (2017). The contribution of therapist eects to patient dropout and deterioration in the psychological therapies. Clinical Psychology & Psychotherapy, 24(3), 575{588. doi: 10.1002/cpp.2028 Schaefer, J. D., Caspi, A., Belsky, D. W., Harrington, H., Houts, R., Horwood, L. J., . . . Mott, T. E. (2017). Enduring mental health: prevalence and prediction. Journal of Abnormal Psychology, 126(2), 212{224. doi: 10.1037/abn0000232 Schwalbe, C. S., Oh, H. Y., & Zweben, A. (2014). Sustaining motivational interviewing: A meta- analysis of training studies. Addiction, 109(8), 1287{1294. doi: 10.1111/add.12558 Sell, G., & Garcia-Romero, D. (2014). Speaker diarization with PLDA i-vector scoring and unsu- pervised calibration. In Proceedings of the 2014 IEEE Spoken Language Technology Workshop (pp. 413{417). doi: 10.1109/SLT.2014.7078610 Sell, G., Snyder, D., McCree, A., Garcia-Romero, D., Villalba, J., Maciejewski, M., . . . Khudanpur, S. (2018). Diarization is hard: Some experiences and lessons learned for the JHU team in the inaugural DIHARD challenge. In Proceedings of Interspeech 2018 (pp. 2808{2812). doi: 10.21437/Interspeech.2018-1893 Shiner, B., D'Avolio, L. W., Nguyen, T. M., Zayed, M. H., Watts, B. V., & Fiore, L. (2012). Automated classication of psychotherapy note text: implications for quality assessment in PTSD care. Journal of Evaluation in Clinical Practice, 18(3), 698{701. doi: 10.1111/ j.1365-2753.2011.01634.x Shum, S., Dehak, N., Chuangsuwanich, E., Reynolds, D., & Glass, J. (2011). Exploiting intra- conversation variability for speaker diarization. In Proceedings of interspeech 2011 (pp. 945{ 948). doi: 10.21437/Interspeech.2011-383 105 Silovsky, J., Zdansky, J., Nouza, J., Cerva, P., & Prazak, J. (2012). Incorporation of the ASR output in speaker segmentation and clustering within the task of speaker diarization of broadcast streams. In Proceedings of the 2012 IEEE 14th International Workshop on Multimedia Signal Processing (pp. 118{123). doi: 10.1109/MMSP.2012.6343426 Singla, K., Chen, Z., Flemotomos, N., Gibson, J., Can, D., Atkins, D. C., & Narayanan, S. (2018). Using prosodic and lexical information for learning utterance-level behaviors in psychotherapy. In Proceedings of Interspeech 2018 (pp. 3413{3417). doi: 10.21437/Interspeech.2018-2551 Siniscalchi, S. M., Li, J., & Lee, C.-H. (2006). A study on lattice rescoring with knowledge scores for automatic speech recognition. In Proceedings of Interspeech 2006 (p. paper 1319-Mon3A2O.1). doi: 10.21437/Interspeech.2006-198 Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., & Khudanpur, S. (2018). X-vectors: Robust DNN embeddings for speaker recognition. In Proceedings of the 2018 IEEE In- ternational Conference on Acoustics, Speech and Signal Processing (pp. 5329{5333). doi: 10.1109/ICASSP.2018.8461375 Song, H., Zhang, W.-N., Cui, Y., Wang, D., & Liu, T. (2019). Exploiting persona information for diverse generation of conversational responses. In Proceedings of the 28th International Joint Conference on Articial Intelligence (pp. 5190{5196). doi: 10.24963/ijcai.2019/721 Stolcke, A. (2002). SRILM{an extensible language modeling toolkit. In Proceedings of the 7th International Conference on Spoken Language Processing (pp. 901{904). Stolcke, A., & Yoshioka, T. (2019). DOVER: A method for combining diarization outputs. In Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (p. 757-763). doi: 10.1109/ASRU46091.2019.9004031 Strijbos, J.-W., & De Laat, M. F. (2010). Developing the role concept for computer-supported collaborative learning: An explorative synthesis. Computers in Human Behavior, 26(4), 495{505. doi: 10.1016/j.chb.2009.08.014 Substance Abuse and Mental Health Services Administration. (2019). Key substance use and mental health indicators in the United States: Results from the 2018 national survey on drug use and health. Center for Behavioral Health Statistics and Quality. Tanana, M. J., Soma, C. S., Srikumar, V., Atkins, D. C., & Imel, Z. E. (2019). Development and evaluation of ClientBot: Patient-like conversational agent to train basic counseling skills. 106 Journal of Medical Internet Research, 21(7), e12529. doi: 10.2196/12529 Thomas, S., Saon, G., Van Segbroeck, M., & Narayanan, S. S. (2015). Improvements to the IBM speech activity detection system for the DARPA RATS program. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 4500{4504). doi: 1109/ICASSP.2015.7178822 Tranter, S. (2005). Two-way cluster voting to improve speaker diarisation performance. In Proceed- ings of the 2005 IEEE International Conference on Acoustics, Speech, and Signal Processing (Vol. 1, pp. I/753{I/756). doi: 10.1109/ICASSP.2005.1415223 Tripathi, A., Lu, H., Sak, H., Moreno, I. L., Wang, Q., & Xia, W. (2022). Turn-to-diarize: Online speaker diarization constrained by transformer transducer speaker turn detection. In Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Valente, F., Vijayasenan, D., & Motlicek, P. (2011). Speaker diarization of meetings based on speaker role n-gram models. In Proceedings of the 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 4416{4419). doi: 10.1109/ICASSP.2011 .5947333 Vaquero, C., Ortega, A., Miguel, A., & Lleida, E. (2013). Quality assessment for speaker diarization and its application in speaker characterization. Transactions on Audio, Speech, and Language Processing, 21(4), 816{827. doi: 10.1109/TASL.2012.2236317 Vinciarelli, A. (2006). Sociometry based multiparty audio recordings summarization. In Proceedings of the 18th International Conference on Pattern Recognition (pp. 1154{1157). doi: 10.1109/ ICPR.2006.1063 Vinciarelli, A. (2007). Speakers role recognition in multiparty audio recordings using social network analysis and duration distribution modeling. IEEE Transactions on Multimedia, 9(6), 1215{ 1226. doi: 10.1109/TMM.2007.902882 Vinciarelli, A., & Favre, S. (2007). Broadcast news story segmentation using social network analysis and hidden markov models. In Proceedings of the 15th ACM International Conference on Multimedia (pp. 261{264). doi: 10.1145/1291233.1291287 Wang, Q., Downey, C., Wan, L., Manseld, P. A., & Moreno, I. L. (2018). Speaker diarization with LSTM. In Proceedings of the 2018 ieee international conference on acoustics, speech and 107 signal processing (pp. 5239{5243). doi: 10.1109/ICASSP.2018.8462628 Wang, W., Yaman, S., Precoda, K., & Richey, C. (2011). Automatic identication of speaker role and agreement/disagreement in broadcast conversation. In Proceedings of the 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 5556{5559). doi: 10.1109/ICASSP.2011.5947618 Wang, Y., He, M., Niu, S., Sun, L., Gao, T., Fang, X., . . . Lee, C.-H. (2021). USTC-NELSLIP system description for DIHARD-III challenge. arXiv preprint arXiv:2103.10661 . Weinberger, A., Stegmann, K., & Fischer, F. (2010). Learning to argue online: Scripted groups surpass individuals (unscripted groups do not). Computers in Human behavior, 26(4), 506{ 515. doi: 10.1016/j.chb.2009.08.007 Weisz, J. R., Weiss, B., Han, S. S., Granger, D. A., & Morton, T. (1995). Eects of psychother- apy with children and adolescents revisited: a meta-analysis of treatment outcome studies. Psychological Bulletin, 117(3), 450{468. doi: 10.1037/0033-2909.117.3.450 Williams, W., Prasad, N., Mrva, D., Ash, T., & Robinson, T. (2015). Scaling recurrent neural net- work language models. In Proceedings of the 2015 IEEE International Conference on Acous- tics, Speech, and Signal Processing (pp. 5391{5395). doi: 10.1109/ICASSP.2015.7179001 Xiao, B., Bone, D., Segbroeck, M. V., Imel, Z. E., Atkins, D. C., Georgiou, P. G., & Narayanan, S. S. (2014). Modeling therapist empathy through prosody in drug addiction counseling. In Proceedings of Interspeech 2014 (pp. 213{217). doi: 10.21437/Interspeech.2014-55 Xiao, B., Can, D., Georgiou, P. G., Atkins, D., & Narayanan, S. S. (2012). Analyzing the language of therapist empathy in motivational interview based psychotherapy. In Proceedings of the 2012 Asia Pacic Signal and Information Processing Association Annual Summit and Conference. Xiao, B., Can, D., Gibson, J., Imel, Z. E., Atkins, D. C., Georgiou, P. G., & Narayanan, S. S. (2016). Behavioral coding of therapist language in addiction counseling using recurrent neural networks. In Proceedings of Interspeech 2016 (pp. 908{912). doi: 10.21437/Interspeech.2016 -1560 Xiao, B., Huang, C., Imel, Z. E., Atkins, D. C., Georgiou, P., & Narayanan, S. S. (2016). A technology prototype system for rating therapist empathy from audio recordings in addiction counseling. PeerJ Computer Science, 2, e59. doi: 10.7717/peerj-cs.59 108 Xiao, B., Imel, Z. E., Georgiou, P. G., Atkins, D. C., & Narayanan, S. S. (2015). \Rate my therapist": Automated detection of empathy in drug and alcohol counseling via speech and language processing. PLOS ONE, 10(12), e0143055. doi: 10.1371/journal.pone.0143055 Xiong, W., Droppo, J., Huang, X., Seide, F., Seltzer, M. L., Stolcke, A., . . . Zweig, G. (2017). To- ward human parity in conversational speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(12), 2410{2423. doi: 10.1109/TASLP.2017.2756440 Xu, H., Chen, T., Gao, D., Wang, Y., Li, K., Goel, N., . . . Khudanpur, S. (2018). A pruned RNNLM lattice-rescoring algorithm for automatic speech recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processin (p. 5929- 5933). doi: 10.1109/ICASSP.2018.8461974 Yang, J., & Zhang, Y. (2018). NCRF++: An open-source neural sequence labeling toolkit. In Proceedings of ACL 2018, System Demonstrations (pp. 74{79). doi: 10.18653/v1/P18-4013 Yin, R., Bredin, H., & Barras, C. (2018). Neural speech turn segmentation and anity propagation for speaker diarization. Proceedings of Interspeech 2018 , 1393{1397. doi: 10.21437/Interspeech.2018-1750 Yoshioka, T., Dimitriadis, D., Stolcke, A., Hinthorn, W., Chen, Z., Zeng, M., & Huang, X. (2019). Meeting transcription using asynchronous distant microphones. In Proceedings of Interspeech 2019 (pp. 2968{2972). doi: 10.21437/Interspeech.2019-3088 Yu, C., & Hansen, J. H. (2017). Active learning based constrained clustering for speaker diarization. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(11), 2188{2198. doi: 10.1109/TASLP.2017.2747097 Yu, H., Chen, C., Du, X., Li, Y., Rashwan, A., Hou, L., . . . Li, J. (2020). TensorFlow Model Garden. https://github.com/tensor ow/models. Zaj c, Z., Kune sov a, M., & Radov a, V. (2016). Investigation of segmentation in i-vector based speaker diarization of telephone speech. In Proceedings of the 18th International Conference on Speech and Computer (pp. 411{418). doi: 10.1007/978-3-319-43958-7 49 Zajc, Z., Kune sov a, M., Zelinka, J., & Hr uz, M. (2018). ZCU-NTIS speaker diarization system for the DIHARD 2018 challenge. In Proceedings of Interspeech 2018 (pp. 2788{2792). doi: 10.21437/Interspeech.2018-1252 Zaj c, Z., Soutner, D., Hr uz, M., M uller, L., & Radov a, V. (2018). Recurrent neural network based 109 speaker change detection from text transcription applied in telephone speaker diarization system. In Proceedings of the 21st International Conference on Text, Speech and Dialogue (pp. 342{350). doi: 10.1007/978-3-030-00794-2 37 Zancanaro, M., Lepri, B., & Pianesi, F. (2006). Automatic detection of group functional roles in face to face interactions. In Proceedings of the 8th International Conference on Multimodal Interfaces (pp. 28{34). doi: 10.1145/1180995.1181003 Zhang, A., Wang, Q., Zhu, Z., Paisley, J., & Wang, C. (2019). Fully supervised speaker diarization. In Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 6301{6305). doi: 10.1109/ICASSP.2019.8683892 Zuluaga-Gomez, J., Sarfjoo, S. S., Prasad, A., Nigmatulina, I., Motlicek, P., Ohneiser, O., & Helmke, H. (2021). BERTrac: A robust BERT-based approach for speaker change detection and role identication of air-trac communications. arXiv preprint arXiv:2110.05781 . 110 Appendices 111 Appendix A UCC dataset: Inter-Rater Reliability In Chapter 5 we presented and used the MISC-annotated UCC data (which we also used in Chap- ter 4 but without the MISC labels). Here, we present an inter-rater reliability (IRR) analysis of the utterance-level codes assigned by human raters based on a small subset of the available sessions. Each of the 188 sessions that were selected for professional transcription and coding was coded by at least one of three trained raters. Among those, 14 sessions (from the rst trial described in Section 5.4.2) were coded by two or three coders. We estimated Krippendor's alpha (Krippendor, 2018) for each code, a statistic which is generalizable to dierent types of variables and exible with missing observations (Hallgren, 2012). Since sessions were parsed into utterances from human raters, the unit of coding is not xed, so we estimated Krippendor's alpha at the session level by using the per-session occurrences (treated as ratio variables) of each label. The results for all the codes are given in Table A.1. Table A.1: Krippendor's alpha () to estimate inter-rater reliability for the utterance-level codes in the UCC data. code IRR () code IRR () code IRR () code IRR () ADP 0.542 EC 0.558 QUC 0.897 RF 0.093 ADW 0.422 FA 0.868 RCP { SU 0.345 AF 0.123 FI 0.784 RCW 0.000 ST 0.434 CO 0.497 GI 0.861 RES 0.268 WA -0.054 DI 0.590 QUO 0.945 REC 0.478 MISC abbreviations are dened in Table 5.1. the particular code was not used (count=0) by at least 2 coders for at least half of the analyzed sessions. RCP was never used by any coder. 112 As described in Section 5.4.2, those MISC labels are grouped into 9 target classes (Table 5.2). Table A.2 gives the results of the IRR analysis for this labeling scheme. Table A.2: Krippendor's alpha () to estimate inter-rater reliability for the utterance-level target labels in the UCC data. group IRR () group IRR () group IRR () FA 0.868 QUO 0.946 MIN 0.606 GI 0.898 REC 0.479 MIA 0.363 QUC 0.897 RES 0.268 ST 0.434 The mapping bettween MISC-dened behavior codes and grouped target labels is given in Table 5.2. 113 Appendix B Psychotherapy Transcription and Coding Pipeline The following sections provide details related to the several modules of the transcription and coding pipeline introduced in Chapter 5, including training data, hyperparameter values, and evaluation results. B.1 Datasets The design of the system is based on datasets drawn from a variety of sources. We have combined large speech and language corpora both from the psychotherapy domain and from other elds (meetings, telephone conversations, etc.). That way, we wanted to ensure high in-domain accuracy when analyzing psychotherapy data, but also robustness across various recording conditions. Out-of-domain corpora The acoustic modeling was mainly based on a large collection of speech corpora, widely used by the research community for a variety of speech processing tasks. Specically, we used the Fisher English (Cieri et al., 2004), ICSI Meeting Speech (Janin et al., 2003), WSJ (Paul & Baker, 1992), and 1997 HUB4 (Gra, Wu, MacIntyre, & Liberman, 1997) corpora, available through the linguistic data consortium (LDC), as well as Librispeech (Panayotov, Chen, Povey, & Khudanpur, 2015), TED-LIUM (Rousseau, Del eglise, & Esteve, 2014), and AMI (Carletta et al., 2005). This 114 combined speech dataset consists of more than 2,000 hours of audio and contains recordings from a variety of scenarios, including business meetings, broadcast news, telephone conversations, and audiobooks/articles. The aforementioned datasets are accompanied by manually-derived transcriptions which can be used for language modeling tasks. In our case, since we need to capture linguistic patterns specic to the psychotherapy domain, the main reason we need some out-of-domain text corpus is to build a background model that guarantees a large enough vocabulary and minimizes the unseen words during evaluation. To that end, we use the transcriptions of the Fisher English corpus, featuring a vocabulary of 58.6K words and totaling more than 21M tokens. Psychotherapy-related corpora In order to train and adapt our machine learning models on in-domain data, in addition to the UCC data collection described in Section 5.4.2, we also used available psychotherapy-focused corpora. In particular, we used a collection of MI sessions (for which audio, transcription and manual coding information were available) from six independent clinical trials (ARC, ESPSB, ESP21, iCHAMP, HMCBI, CTT; Atkins et al., 2014; Baer et al., 2009), as introduced in Chapter 1 (with the MI-train subset dened in Table 1.1). The transcripts of those MI sessions were enhanced by data provided by the counseling and psychotherapy transcripts series 1 (CPTS). This included transcripts from a variety of therapy interventions totaling about 300K utterances and 6.5M words. For this corpus, no audio or behavioral coding are available, and the data were hence used only for language-based modeling tasks. B.2 System Details Audio feature extraction For all the modules of the speech pipeline (VAD, diarization, ASR), the acoustic representation is based on the widely used mel-frequency cepstrum coecients (MFCCs), extracted every 10 msec 1 https://alexanderstreet.com/products/counseling-and-psychotherapy-transcripts-series 115 using 25 msec-long windows with the Kaldi toolkit 2 . For the UCC data, the channels from the two recording microphones are combined through acoustic beamforming (Anguera, Wooters, & Hernando, 2007), using the open-source BeamformIt tool 3 . Voice activity detection The rst step of the transcription pipeline is to extract the voiced segments of the input audio session. The rest of the session is considered to be silence, music, background noise, etc., and is not taken into account for the subsequent steps. To that end, we use a feed-forward neural network with two layers of 512 neurons each and sigmoid activation functions, before a nal inference layer giving a frame-level probability. The input is a 13-dimensional MFCC vector characterizing a frame, spliced with a context of 30 neighboring frames (15+15). This is a pre-trained model, initially developed as part of the robust automatic transcription of speech (RATS) program (Thomas, Saon, Van Segbroeck, & Narayanan, 2015). The model was trained to reliably detect speech activity in highly noisy acoustic scenarios, with most of the noise types included during training being military noises like machine gun, helicopter, etc. Hence, in order to make the model better suited to our task, the original model was adapted using the UCC dev data. Optimization of the various parameters was done with respect to the unweighted average recall (UAR). The frame-level outputs are smoothed via a median lter of 31 taps and converted to longer speech segments which are passed to the diarization sub-system. During this process, if silence between any two contiguous voiced segments is less than 0:5 sec, the corresponding segments are merged together. Automatic speech recognition The linguistic content captured within speech segments is the information supplied to the subse- quent text-based algorithms used for speaker role recognition, lignuistically-aided diarization, and behavioral coding. Automatic speech recognition (ASR) depends on two components; the acoustic model (AM), which calculates the likelihood of acoustic observations given a sequence of words, 2 https://github.com/kaldi-asr/kaldi 3 https://github.com/xanguera/BeamformIt 116 and the language model (LM), which calculates the likelihood of a word sequence by describing the distribution of typical language usage. We note that, for the system depicted in Figure 5.2, the same ASR module is used for both the rst and the second passes. In order to train the AM, we build a time-delay neural network (TDNN) with subsampling (Ped- dinti, Povey, & Khudanpur, 2015). First, word alignments are derived based on the GMM/HMM paradigm. The input feature vectors to the TDNN architecture are 40-dimensional MFCCs which are augmented by 100-dimensional i-vectors, extracted online through a sliding window. The net- work is trained on a large combined speech dataset composed of the Fisher English, ICSI Meeting Speech, WSJ, 1997 HUB4, Librispeech, TED-LIUM, AMI, and MI corpora. We use the ocially recommended training subsets for Librispeech and TED-LIUM and the recommended training and development sets for AMI. We randomly choose 95% of the available Fisher utterances and 80% of the available ICSI, WSJ, and HUB4 utterances. We also use the 242 MI-train sessions (Table 1.1). We have kept the rest of the combined dataset for internal validation and evaluation of the ASR system. Among the aforementioned corpora, TED-LIUM and the clean portion of Librispeech are augmented with speed perturbation, noise, and reverberation (Ko, Peddinti, Povey, & Khudanpur, 2015). The nal combined, augmented corpus contains more than 4,000 hours of phonetically rich speech data, recorded under dierent conditions and re ecting a variety of acoustic environments. The ASR AM is built and trained using the Kaldi speech recognition toolkit (Povey et al., 2011). In order to build the LM, we independently train two 3-gram models using the SRILM toolkit (Stolcke, 2002). One is trained with in-domain psychotherapy data from the CPTS transcribed sessions. This is interpolated with a large background model, in order to minimize the unseen words during inference. The background LM is trained with the Fisher English corpus, which features conversational telephone data. The two 3-gram LMs are interpolated with mixing weights equal to 0.8 for the in-domain model and 0.2 for the background model. The evaluation of an ASR system is usually performed through the word error rate (WER) metric which is the normalized Levenshtein distance between the ASR output and the manually- derived transcript and includes errors because of word substitutions, word deletions, and word insertions. Those errors are typically estimated for each utterance given to the ASR module and then summed up for all the evaluation data, in order to get an overall WER. However, when we analyze an entire therapy session which has been processed by the VAD and diarization sub- 117 systems, the \utterances" are dierent than the ones identied by the human transcriber. In that case, we evaluate at the session level, concatenating all the session utterances. The results are reported in Table B.1 using either the oracle segmentation (from the manual transcriptions) or the one generated by the automated systems. For the latter case, we explore VAD-only segmentation (after which we run the rst pass of ASR needed for the linguistically-aided diarization as shown in Figure 5.2), as well as the two diarization-based segmentation approaches we explored in Chapter 5: the audio-only, clustering-based one and the linguistically-aided, classication-based one. Table B.1: ASR results (%) for the UCC data. diarization method substitutions deletions insertions WER oracle 15.1 14.1 2.6 31.7 VAD 16.4 12.5 3.2 32.0 clustering-based 16.8 12.9 3.2 32.9 classication-based 17.2 12.8 3.4 33.4 WER is estimated as the sum of the substitution, insertion, and deletion rates. Results are reported when using either the segments derived by the manual tran- scriptions (oracle) or the machine-generated ones, based on only VAD, or based on the two dierent diarization methods we have explored (Section 5.5). As we can see, ASR performance is not severely degraded by error propagation due to the pre- processing steps of VAD/diarization (up to about 5% relative WER increase). However, we do note that the degradation observed between the VAD-based and the diarization-based segmentations suggests that ASR can completely precede diarization and an alternative overall architecture than the ones presented in Chapter 5 might provide improved overall performance. This is a direction we did not explore within this study. Interestingly, comparing the oracle and the machine-generated segmentations, we can see that even though insertion rate is increased, deletion rate is decreased when machine-generated segments are provided. This is explained by the long segments constructed after concatenating consecutive segments given by the VAD and diarization algorithms. On the one hand, labeling silence or noise as \speech" associated with some speaker occasionally leads ASR to predict words where in reality there is no speech activity|thus increasing insertion rate. On the other hand, this minimizes the probability of missing some words because of missed speech. Such deleted words may occur when providing the oracle segments because of inaccuracies during the construction of the \ground truth" through forced alignment. 118 We note that, even though the estimated error is high, WERs in the range reported and even higher are typical in spontaneous medical conversations (Kodish-Wachs et al., 2018). Error analysis revealed that those numbers are in ated because of llers (e.g. uh-huh, hmm) and other idiosyn- crasies of conversational speech. It should be additionally highlighted that WER is a generic metric that gives equal importance to all the words, while for our end goal of behavior coding there are specic linguistic constructs which potentially carry more valuable information than others. Utterance segmentation The ASR output is at the segment level, with segments dened by the VAD and diarization algo- rithms. However, silence and speaker changes are not always the right cues to help us distinguish between utterances, which are the basic units of behavioral coding. The presence of multiple utter- ances per speaker turn is a challenge we often face when dealing with conversational interactions. Especially in the psychotherapy domain, it has been shown that utterance-level segmentation can signicantly improve the performance of automatic behavior code prediction (Z. Chen et al., 2021). Thus, we have included an utterance segmentation module at the end of the automatic tran- scription, before employing the subsequent NLP algorithms. In particular, we merge together all the adjacent segments belonging to the same speaker in order to form speaker-homogeneous talk-turns, and we then segment each turn using the DeepSegment tool 4 . DeepSegment has been designed to perform text-based sentence boundary detection having specically ASR outputs in mind, where punctuation is not readily available. In this framework, sentence segmentation is viewed as a se- quence labeling problem, where each word is tagged as being either at the beginning of a sentence (utterance), or anywhere else. DeepSegment addresses the problem employing a bidirectional long- short term memory (BiLSTM) network with a conditional random eld (CRF) inference layer (Ma & Hovy, 2016), similarly to the tagger architecture we used in Chapter 3 (Figure 3.3). 4 https://github.com/notAI-tech/deepsegment 119 Utterance-level code prediction Once the entire session is transcribed at the utterance level, we employ text-based algorithms for the task of behavior code prediction. We focus on counselor behaviors, so we only take into account the utterances assigned to the therapist according to the speaker role recognition module. Each one of those needs to be assigned a single code from the 9 target labels summarized in Table 5.2. This is achieved through a BiLSTM network with attention mechanism (Singla et al., 2018) which only processes textual features. The input to the system is a sequence of word-level embeddings for each utterance. The recurrent layer exploits the sequential nature of language and produces hidden vectors which take into account the entire context of each word within the utterance. The attention layer can then learn to focus on salient words carrying valuable information for the task of code prediction, thus enhancing robustness and interpretability. The network is rst trained on the MI data using the Adam optimizer with learning rate equal to 0.001 and exponential decay equal to 0.9. The batch size is set equal to 256 utterances and we use class weights inversely proportional to the class frequencies. The system is trained on that dataset for 30 epochs with an early stopping strategy, keeping the model with the lowest validation loss. The system is further ne-tuned to the University Counseling Center conditions by continuing training on the UCC train data. When we use the manually transcribed data to perform utterance-level MISC code prediction, the overall averagedF 1 score is 0:517 for the UCC evaluation sets. TheF 1 scores for each individual code are reported in Table B.2. As expected, the results are better for the highly frequent codes (Table 5.2), such as the one expressing facilitation (FA), since the machine learning models have more training examples to learn from. On the other hand, the models do not perform as well for less frequent codes, such as MI-NonAdherent behaviors (MIN) and simple re ections (RES). Comparing Table B.2 and Table A.1, we can also see that for several of the codes that our system performs relatively poorly (e.g., simple re ections (RES), MI-Adherent (MIA), structure (ST)), the inter-annotator agreement is also considerably low. A notable example which does not follow this pattern is the non-adherent behavior (MIN) where the performance of our system is relatively poor (F 1 = 0:261), while there is a substantial inter-annotator agreement ( = 0:606). This is partly because of the underrepresentation of the particular code (or cluster of codes) in the training and 120 development sets. It may be also the case that pure linguistic information found in textual patterns may not be enough for the operationalization of the particular code. This example suggests that a hybrid approach where machine learning methods are combined with knowledge-based rules from the coding manuals may be an interesting direction for future research. Finally, by examining the confusion matrices (not reported here), we realized that the system often gets confused between the codes representing questions (QUC vs. QUO) and re ections (RES vs. REC), since those pairs of codes get usually assigned to utterances with several structural and semantic similarities. Table B.2: F 1 scores for the predicted utterance-level codes using the manually transcribed UCC data. FA GI QUC QUO REC RES MIN MIA ST 0.951 0.473 0.604 0.792 0.476 0.198 0.261 0.423 0.472 It is interesting to compare the remarkably good performance of the system with respect to FA with the relatively low correlation reported in Table 5.4, where the MISC predictor is given the automatically generated utterances. The reason behind this is that FA is assigned to a lot of one- word utterances and talk turns. Our speech pipeline, however, often fails to capture turns of such short duration (or concatenates them with neighboring utterances to construct longer segments) which results in a smaller than expected frequency for the specic code. 121
Abstract (if available)
Abstract
Individuals assume distinct roles in different situations throughout their lives and people who consistently adopt particular roles develop specific commonalities in behavior. As a result, roles can be defined in terms of observable tendencies and behavioral patterns that can be manifest through a wide range of modalities during a conversational interaction. For instance, an interviewer is expected to use more interrogative words than the interviewee and a teacher is likely to speak in a more didactic style than the student.
Speaker role recognition is the task of assigning a role label in a speech segment where a single speaker is active, through computational models that capture such behavioral characteristics. The approaches that tackle this problem depend on successful pre-processing steps applied on the recorded conversation, such as speaker segmentation and clustering or automatic speech recognition, something that inevitably leads to error propagation. At the same time, accurate role information can provide valuable cues for the aforementioned speech processing tasks.
In this dissertation I propose techniques that combine role recognition with other speech processing modules to alleviate the problem of error propagation. Additionally, focusing on the task of speaker diarization (that answers the question who spoke when), I demonstrate that role-aware systems can achieve improved performance when compared to traditional, state-of-the-art approaches. Finally, I showcase how some of the proposed techniques can be applied in a real-world system, by presenting and analyzing an automated tool for psychotherapy quality assessment, where robust diarization and role identification (i.e., therapist vs. patient) are of critical importance.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Multimodal and self-guided clustering approaches toward context aware speaker diarization
PDF
Speech recognition error modeling for robust speech processing and natural language understanding applications
PDF
Robust speaker clustering under variation in data characteristics
PDF
Human behavior understanding from language through unsupervised modeling
PDF
Spoken language processing in low resource scenarios with applications in automated behavioral coding
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
Computational modeling of behavioral attributes in conversational dyadic interactions
PDF
Noise aware methods for robust speech processing applications
PDF
Generating psycholinguistic norms and applications
PDF
Context-aware models for understanding and supporting spoken interactions with children
PDF
Matrix factorization for noise-robust representation of speech data
PDF
Learning shared subspaces across multiple views and modalities
PDF
Neural representation learning for robust and fair speaker recognition
PDF
Representation, classification and information fusion for robust and efficient multimodal human states recognition
PDF
Automatic quantification and prediction of human subjective judgments in behavioral signal processing
PDF
A computational framework for exploring the role of speech production in speech processing from a communication system perspective
PDF
Understanding sources of variability in learning robust deep audio representations
PDF
Machine learning paradigms for behavioral coding
PDF
Multimodality, context and continuous dynamics for recognition and analysis of emotional states, and applications in healthcare
PDF
Enriching spoken language processing: representation and modeling of suprasegmental events
Asset Metadata
Creator
Flemotomos, Nikolaos
(author)
Core Title
Extracting and using speaker role information in speech processing applications
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2022-08
Publication Date
07/15/2022
Defense Date
05/09/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
conversational analysis,diarization,machine learning,OAI-PMH Harvest,speaker roles,speech processing
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Narayanan, Shrikanth (
committee chair
), Jenkins, Keith (
committee member
), Mataric, Maja (
committee member
)
Creator Email
flemotom@usc.edu,nick.flemotomos@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC111371347
Unique identifier
UC111371347
Legacy Identifier
etd-Flemotomos-10835
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Flemotomos, Nikolaos
Type
texts
Source
20220715-usctheses-batch-953
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
conversational analysis
diarization
machine learning
speaker roles
speech processing