Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Robust speaker clustering under variation in data characteristics
(USC Thesis Other)
Robust speaker clustering under variation in data characteristics
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ROBUST SPEAKER CLUSTERING UNDER VARIATION IN DATA CHARACTERISTICS by Kyu Jeong Han A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) December 2009 Copyright 2009 Kyu Jeong Han Dedication This dissertation is dedicated with loving appreciation to my family for their endless love and support. Without them this work would never have been achieved. ii Acknowledgements FirstofallIwouldliketothankmyadvisor,Dr. ShrikanthS.Narayanan,forhisinspiring and encouraging way to guide me to a deeper understanding of research work, and his invaluablecommentsduringmygraduateyears. Hetaughtmehowtoformulateproblems and solve them with reasonable thinking, and how to express ideas and results efficiently. On top of all these, I really appreciate his patience and belief in me throughout my entire years as his students in Signal Analysis and Interpretation Laboratory (SAIL). I also would like to thank the rest of my thesis committee: Dr. C.-C. Jay Kuo, Dr. Hong-Goo Kang, and Dr. Cyrus Shahabi, who reviewed my work and gave insightful comments. My special thanks go to Dr. Kang, who gave us a favor of accepting our invitation to this committee while he was working for Broadcom in Irvine, CA for his sabbatical year. To my family, I appreciate and love you with all my heart. Without your love and unwavering belief in me it would have been impossible for me to complete Ph.D. work at USC. Specially I would like to thank my wife and daughter. I could not have overcome even a bit of huddle without them in the entire life in the United Sates. Seo Jung, my lovely wife offered by gracious God, you are the only one that deserves to share all my achievements during my USC years and even for the rest of my life. I LOVE YOU. My preciousdaughter, Jimin, Iwantyoutoknowthatyouhavemotivatedmeaswellasyour mom to keep hard working on every side of our life since you were given to us. Please you never forget you are more than God’s gift to us. Finally, I would like to thank you, my Lord, for your guidance of my life and family through perseverance to know and enjoy you in the name of Jesus Christ. iii Table of Contents Dedication ii Acknowledgements iii List of Tables vii List of Figures x List of Algorithms xiv Abstract xv 1. Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Pattern Classification and Clustering . . . . . . . . . . . . . . . . . 2 1.1.2 Speaker Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1.3 Focus Identification . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Previous Work on Speaker Clustering . . . . . . . . . . . . . . . . . . . . 5 1.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 Proposed Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4.1 Perspective 1: Stopping Point Estimation . . . . . . . . . . . . . . 9 1.4.2 Perspective 2: Inter-Cluster Distance Measurement . . . . . . . . . 10 1.4.2.1 Earlier Recursion Steps of AHSC . . . . . . . . . . . . . . 11 1.4.2.2 Later Recursion Steps of AHSC . . . . . . . . . . . . . . 11 1.4.2.3 Cluster Modeling. . . . . . . . . . . . . . . . . . . . . . . 12 1.4.3 Application: Speaker Diarization . . . . . . . . . . . . . . . . . . . 12 1.5 Contribution Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.6 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2. Robust Stopping Point Estimation for AHSC 16 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2 Data and Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3 BIC-based Stopping Point Estimation for AHSC . . . . . . . . . . . . . . 18 2.3.1 Generalized Likelihood Ratio (GLR) . . . . . . . . . . . . . . . . . 19 2.3.2 Bayesian Information Criterion (BIC) . . . . . . . . . . . . . . . . 24 2.3.3 BIC-based Stopping Point Estimation Method for AHSC . . . . . 25 iv 2.3.4 Tuning Parameter . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.3.5 Stopping Criterion under the Variation of Input Speech Data . . . 29 2.4 ICR-based Stopping Point Estimation for AHSC . . . . . . . . . . . . . . 31 2.4.1 Information Change Rate (ICR) . . . . . . . . . . . . . . . . . . . 31 2.4.2 Comparison of ICR with ICR-like Measures . . . . . . . . . . . . . 33 2.4.3 ICR as a Homogeneity Decision Measure for Clusters. . . . . . . . 33 2.4.4 ICR-based Stopping Point Estimation Method for AHSC . . . . . 34 2.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3. Robust Inter-Cluster Distance Measurement for AHSC 40 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.2 GLR at Early AHSC Recursion Steps . . . . . . . . . . . . . . . . . . . . 41 3.3 Modification of AHSC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.3.1 Constrained Cluster Selection for Merging . . . . . . . . . . . . . . 45 3.3.2 Pre-Classification of Short Speech Segments . . . . . . . . . . . . . 47 3.3.3 Sequential Clustering prior to AHSC . . . . . . . . . . . . . . . . . 48 3.4 Combination of GLR and ICR . . . . . . . . . . . . . . . . . . . . . . . . 49 3.4.1 (GLR+ICR)-based Inter-Cluster Distance Measurement . . . . . . 51 3.4.2 Proposed Measure in Modified AHSC Approaches . . . . . . . . . 54 3.5 Selective AHSC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.5.1 Modified AHSCs with Stopping Point Estimation . . . . . . . . . . 56 3.5.2 Selective AHSC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4. Robust Cluster Modeling for Inter-Cluster Distance Measurement in AHSC 64 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.2 Inter-Cluster Distance Measurement for AHSC . . . . . . . . . . . . . . . 66 4.2.1 GLR-based statistical inter-cluster distance measurement . . . . . 66 4.2.2 Conventional cluster modeling approaches . . . . . . . . . . . . . . 67 4.2.2.1 Single Gaussian cluster modeling . . . . . . . . . . . . . . 67 4.2.2.2 GMM cluster modeling . . . . . . . . . . . . . . . . . . . 69 4.2.2.3 Experimental comparison . . . . . . . . . . . . . . . . . . 72 4.3 Incremental Gaussian Mixture Cluster Modeling . . . . . . . . . . . . . . 77 4.3.1 Proposed cluster modeling approach . . . . . . . . . . . . . . . . . 77 4.3.2 Comparison and analysis . . . . . . . . . . . . . . . . . . . . . . . 79 4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5. Reliable Speaker Diarization based on Robust AHSC 84 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.2 Speaker Diarization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.3 SAIL Speaker Diarization System . . . . . . . . . . . . . . . . . . . . . . . 87 5.3.1 Data Description and Experimental Setup . . . . . . . . . . . . . . 87 5.3.2 Speech/Non-Speech Detection . . . . . . . . . . . . . . . . . . . . . 88 5.3.3 Speaker Change Detection . . . . . . . . . . . . . . . . . . . . . . . 91 5.3.4 Speaker Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.3.4.1 IGMM Cluster Modeling . . . . . . . . . . . . . . . . . . 92 v 5.3.4.2 ICR-based Stopping Point Estimation . . . . . . . . . . . 94 5.3.4.3 Comparison. . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.3.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.4 Refined Speaker Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.4.1 Selection of Representative Speech Segments . . . . . . . . . . . . 98 5.4.2 Participant Interaction Pattern Modeling . . . . . . . . . . . . . . 101 5.4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6. Conclusions 106 6.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.2 Possible Future Research Topics . . . . . . . . . . . . . . . . . . . . . . . 107 6.3 Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Bibliography 110 vi List of Tables 2.1 Developmentsetofdatasources. N s : #ofspeakeridentities(male:female) in each data source, T s : total utterance time (sec.), N t : # of speech seg- ments, and T a : average segment length (sec.). C, N, and I: data sources chosen from ICSI, NIST, and ISL meeting speech corpora respectively. . . 17 2.2 Evaluation set of data sources. The notation is same as that in Table 2.1. 18 2.3 Comparison of ICR with other measures utilizing the idea of normalizing GLR. C x and C y : two clusters consisting of M and N feature vectors respectively, : parameter empirically determined, and n: dimension of feature vectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.4 ICR-basedstoppingpointestimationmethodvs. BIC-basedstoppingpoint estimation method. c = 1 2 { n+ 1 2 n(n+1) } , where n is the dimension of feature vectors. n=12, =0:18603, and =12:0. . . . . . . . . . . . . . 36 2.5 Global comparison (averaged speaker error time rate for the evaluation data set) of AHSC with the BIC-based stopping point estimation method and AHSC with the ICR-based stopping point estimation method. . . . . 39 3.1 Distribution of three different merging types (M ss , M sl , and M ll ) at the first quarter of the entire merging recursions during AHSC for every data source in the development data set in Section 2.2. M ss : merging between the speech segments shorter than 3 seconds, M sl : merging between one speech segment shorter than 3 seconds and the other longer than or equal to 3 seconds, and M ll : merging between the speech segments longer than or equal to 3seconds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2 Accuracy of the GLR-based inter-cluster distance measure for AHSC de- pending on the merging types defined in Table 3.1. These accuracies were obtained only based on the first quarter of the entire merging recursions during AHSC for every data source in the development data set in Section 2.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 vii 3.3 ComparisonofbasicAHSCanditsfirstmodifiedversionintermsofaverage speaker error time rate for the development and evaluation data set in Section 2.2. Both of the clustering strategies use the GLR-based inter- cluster distance measure to select clusters for merging at every recursion stepofAHSC,andperfectstoppingpointestimationisassumed. (Foreach result in the table, the corresponding standard deviation is presented as well.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.4 Comparison of basic AHSC and its second modified version in terms of average speaker error time rate for the development and evaluation data setinSection2.2. Thesamedistancemeasureandassumptionforstopping point estimation in AHSC as ones in Table 3.3 are applied. . . . . . . . . 47 3.5 Comparison of basic AHSC and its third modified version in terms of av- erage speaker error time rate for the development and evaluation data set in Section 2.2. The same distance measure and assumption for stopping point estimation in AHSC as ones in Table 3.3 are applied. . . . . . . . . 50 3.6 Comparison of AHSC with the GLR-based inter-cluster distance measure andthatwithourproposedmethod,intermsofaveragespeakererrortime rate for the development and evaluation data set in Section 2.2. Perfect stopping point estimation for AHSC is assumed. . . . . . . . . . . . . . . 54 3.7 Average speaker error time rate for the evaluation data set in Section 2.2. ThistablecomparesAHSCanditsthreemodifiedversionswithbothGLR- based and (GLR+ICR)-based inter-cluster distance measurement. Perfect estimation of the optimal stopping point for AHSC is assumed. . . . . . . 56 3.8 Average speaker error time rate for the evaluation data set in Section 2.2 when the ICR-based stopping point estimation method is applied. This table compares AHSC and its three modified versions with GLR-based and (GLR+ICR)-based inter-cluster distance measurement, as Table 3.7 does. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.9 Average speaker error time rate for the evaluation data set in Section 2.2. This table compares selective AHSC with all the counterparts that have been dealt with in this dissertation. . . . . . . . . . . . . . . . . . . . . . . 63 4.1 Datasource. N s : numberofspeakersources(male:female),T s : totalutter- ance time (sec.), N t : number of speech segments, and T a : average segment length (sec.). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.2 Performancecomparisonofthetwoconventionalclustermodelingapproaches in terms of speaker error time rate (%).N: single Gaussian cluster mod- eling and x : GMM cluster modeling with x mixture components. . . . . 74 5.1 Training data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.2 Testing data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 viii 5.3 Performancecomparisonoftheproposedspeech/non-speechdetectionpro- cess with and without updating the silence cluster, in terms of the two detection error rates for the training data set. . . . . . . . . . . . . . . . . 90 5.4 Comparison of 1) IGMM cluster modeling + ICR-based stopping point estimation, and2) singleGaussianclustermodeling+BIC-basedstopping point estimation, in terms of speaker-error-time rate for the testing data set. = 25:0 (for BIC-based stopping point estimation) and = 0:225 (for ICR-based stopping point estimation), which are tuned based on the training data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.5 Improved speaker diarization performance with the two approaches pro- posed in this section, i.e., representative speech segment selection and par- ticipant interaction pattern modeling. For the former approach we empir- ically set N = 32. Performance comparison is given in terms of average DER (%) across 10 data sources in the testing data set in Section 5.3.1. . 103 ix List of Figures 1.1 Categorization of automatic pattern classification systems. . . . . . . . . . 3 1.2 Application domains of speaker clustering. . . . . . . . . . . . . . . . . . . 5 1.3 Unreliable speaker clustering performance by AHSC across various input speech data. The entire data are 10 sets of segmented meeting conversa- tions speech and each of them contains a number of speech segments in an utterance or sentence level. In each speech segment speaker-specific characteristics are homogeneous. . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Overview illustration of dissertation contributions to robust AHSC and speaker diarization in terms of the variation of input speech data. . . . . . 9 2.1 GLR for two clusters C 1 and C 2 along with the number of feature vectors in each cluster. The second order statistics of the corresponding cluster models are fixed at 1 =0, 2 =1, and Σ 1 =Σ 2 =1. . . . . . . . . . . . . 21 2.2 Comparison of speaker clustering performance (for the evaluation data set described in Section 2.2) with and without accurate stopping point esti- mation. For the BIC-based stopping point estimation method, we tuned to be 12.0. Average speaker error time rate degradation by incorrect esti- mation of the optimal stopping point is about 9.65% (absolute) per data source. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.3 lnGLR and ln(M +N) (=ln(N 1 +N 2 ) in this case) for the same clusters considered in Figure 2.1 along with the number of feature vectors in each cluster, with the fixed second order statistics of 1 = 0, 2 = 1, and Σ 1 =Σ 2 =1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.4 DistributionsforcorrectandincorrectmergingintermsofICR.Thethresh- old is set so as to minimize classification error between the two distri- butions. The distributions were obtained based on our development data set, and feature vectors in every cluster considered corresponded to more than 30 seconds in amount of time. . . . . . . . . . . . . . . . . . . . . . . 35 x 2.5 lnGLR, Th BIC = ·c·ln(M +N), and Th ICR = ·(M +N) for C-6, where = 12:0 and = 0:18603. The stopping point estimated by the ICR-based stopping point estimation method is identical to the optimal one in this case. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.6 Comparison of speaker clustering performance for the evaluation data set with accurate stopping point estimation and with the ICR-based stopping point estimation method, for which = 0:18603. Average speaker error timeratedegradationbyincorrectestimationoftheoptimalstoppingpoint is less than 1% (absolute) per data source. . . . . . . . . . . . . . . . . . . 38 3.1 Figure 2.1 revisited. This figure displays GLR for two clusters C 1 and C 2 along with the number of feature vectorsin each cluster. The second order statistics of the corresponding cluster models are fixed at 1 = 0, 2 = 1, and Σ 1 =Σ 2 =1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2 Segment length distributions for the development data set in Section 2.2. 44 3.3 Speaker error time rate by AHSC with perfect detection of the optimal stopping point for the development data set in Section 2.2. This figure compares performance for the entire speech segments given as an input to AHSC with that for the corresponding subset containing the segments longer than or equal to 3 seconds only. . . . . . . . . . . . . . . . . . . . . 45 3.4 Comparison of basic AHSC and its three modified versions proposed in terms of average speaker error time rate. . . . . . . . . . . . . . . . . . . . 51 3.5 Softrankingusedintheproposedinter-clusterdistancemeasurementmethod. If a certain pair of clusters have the normalized distance of 0.5, their soft ranking becomes 0.69 (grey line) in this system. . . . . . . . . . . . . . . . 53 3.6 Extra performance improvement achieved if the proposed (GLR+ICR)- basedinter-clusterdistancemeasurewereappliedtothelaterecursionsteps ofthethreemodifiedversionsofAHSCintroducedinSection3.3. Thedata sources used in this experiment are the development and evaluation data set in Section 2.2. Perfect estimation of the optimal stopping point for AHSC is assumed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.7 Figure 3.3 revisited, showing speaker error time rate by AHSC with per- fect detection of the optimal stopping point for the development data set in Section 2.2. This figure compares performance for the entire speech seg- ments given as an input to AHSC with that for the corresponding subset containing the segments longer than or equal to 3 seconds only. . . . . . . 58 3.8 Comparison of basic AHSC with the assumption of perfect estimation of the optimal stopping point and selective AHSC (including the BIC-based stopping point estimation method), in terms of speaker error time rate on the evaluation data set in Section 2.2. . . . . . . . . . . . . . . . . . . . . 60 xi 3.9 Comparison of basic AHSC with the assumption of perfect estimation of the optimal stopping point and selective AHSC (including the ICR-based stopping point estimation method), in terms of speaker error time rate on the evaluation data set in Section 2.2. . . . . . . . . . . . . . . . . . . . . 61 4.1 EffectivenessoftheEMproceduresintheGMMclustermodelingapproach for GLR-based inter-cluster distance measurement. Each subfigure com- pares distances between two pairs of clusters along with the number of iterations in the EM procedures for GMMs with 16 mixture components. Onepaircomesfromthesamespeakersource(blackcurve)whiletheother is from different sources (grey curve). . . . . . . . . . . . . . . . . . . . . . 71 4.2 Processing time comparison of the two conventional cluster modeling ap- proaches. (For the GMM approach, four different mixture numbers are compared, i.e., 4, 8, 16, and 32.) (a) Full-shot version. (b) Zoomed-in version. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.3 Clustering performance variation for Data 13 in the GMM cluster model- ingapproach. Thecirclesdenotespeakererrortimeratesfortherespective 10 sessions of the GMM approach with four different numbers of mixture components (i.e., 4, 8, 16, and 32), and the bold crosses are the corre- sponding mean values. The horizontal line presents the speaker error time rate obtained from the single Gaussian cluster modeling approach, which is 23.4%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.4 Performance comparison of the proposed and two conventional cluster modeling approaches in terms of speaker error time rate (%). For this comparison, the best performance of the GMM approach for each data source was chosen among the 4 candidates (4, 8, 16, and 32 mixture com- ponents). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.5 Processing time comparison of the proposed and two conventional cluster modeling approaches. (For the GMM approach, four different mixture numbers are compared, i.e., 4, 8, 16, and 32.) (a) Full-shot version. (b) Zoomed-in version. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.1 Speaker diarization: (a) Block diagram of a speaker diarization system. (b) Step-by-step graphical interpretation of how a given audio source is transcribed (in terms of “who spoke when”) by speaker diarization. . . . . 85 5.2 Performance of the proposed SAIL speaker diarization system on non- overlapped speech in the testing data set, in terms of DER. . . . . . . . . 97 5.3 IGMM cluster modeling. {C i } 5 i=1 are initial clusters for AHSC, and a and b (a+b = 1) are weights for the respective constituent GMMs. The weights are determined by the cardinalities of{C 1 ;C 2 ;C 3 } and{C 4 ;C 5 }, respectively. This figure illustrates how IGMMs grow through merging during AHSC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 xii 5.4 Selection of representative speech segments for improved IGMM cluster modeling. Inthiscase,C 2 ;C 4 ,andC 5 areselectedasrepresentativespeech segments to model{C i } 5 i=1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.5 1 st -orderMarkovchainmodelforparticipantinteractionpatternswhenthe estimated number of speakers is 4, where p ij is the transition probability from the speaker S i to the speaker S j for 1≤i;j≤m (m=4 in this case). 101 5.6 Performance of the modified SAIL speaker diarization system on non- overlapped speech in the testing data set, in terms of DER. . . . . . . . . 104 xiii List of Algorithms 1 Agglomerative Hierarchical Speaker Clustering (AHSC) . . . . . . . . . . 6 2 Modified Version 1 of AHSC . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3 Modified Version 2 of AHSC . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4 Modified Version 3 of AHSC . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5 AHSCwithcombinationofGLRandICRasaninter-clusterdistancemeasure 52 6 Selective AHSC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 7 Leader-Follower Clustering (LFC). . . . . . . . . . . . . . . . . . . . . . . 89 8 Agglomerative Hierarchical Speaker Clustering (AHSC) revisited . . . . . 92 xiv Abstract Speaker clustering refers to a process of classifying a set of input speech data (or speech segments) by a speaker identity in an unsupervised way, based on the similarity of speaker-specific characteristics between the data. The process identifies the speech segments of the same speaker source without any prior speaker-specific information of thegiveninputdata. Thisspeaker-perspective,unsupervisedclassificationofspeechdata can be applied as a pre-processing step to speech/speaker recognition or multimedia data segmentation/classification in various ways. Thus, speaker clustering has been recently attracting much attention in the research area of speech recognition and multimedia data processing. One big, yet unsolved, issue in the research field of speaker clustering is unreliable clustering performance under the variation of input speech data. In this disser- tation, we deal with this problem in the framework of agglomerative hierarchical speaker clustering (AHSC) in two perspectives: stopping point estimation and inter-cluster dis- tance measurement. In order to improve the robustness of stopping point estimation for AHSC under the variation of input speech data, we propose a new statistical measure called information change rate (ICR), which can improve estimation of the optimal stopping point. The ICR-based stopping point estimation method is not only empirically but also theoretically verified to be more robust to the variation of input speech data than the conventional BIC-based method. In order to improve the robustness of inter- clusterdistancemeasurementforAHSCunderthevariationofinputspeechdata, wealso propose selective AHSC and incremental Gaussian mixture cluster modeling. xv These two approaches are proven to provide much more reliability for speaker clustering performance under the variation of input speech data. Basedontheseresultsonrobustspeakerclusteringunderthevariationofinputspeech data,weextendourinteresttoimplementingamorerobustspeaker diarization system to the variation of input audio data. (Speaker diarization refers to an automated process that can annotate a given audio source in terms of “who spoke when”.) Focusing on speaker diarization of meeting conversations speech, we propose two refinement schemes to further improve the reliability of speaker clustering performance in the framework of speaker diarization under the variation of input audio data. One is selection of representative speech segments and the other is interaction pattern modeling between meeting participants,andbothofthemareexperimentallyverifiedtoenhance thereliabilityofspeakerclusteringperformanceandhenceimprovetheoveralldiarization accuracy under the variation of input audio data. xvi Chapter 1 Introduction 1.1 Motivation Pattern classication refers to a process, by not only human beings but also ma- chines, that categorizes information into pre-defined classes or classifies similar informa- tion among what is given to handle without any prior knowledge. This process is very common and we can count a number of examples inside/around us. A typical example can be found from the learning/cognitive systems of a human brain. With the help of various built-in pattern classification systems functioning in one’s brain, he/she identifies whoisthepersontowakehim/herupeverymorning,verifiesifthecarbeingtowedacross the street is his/hers or not, and recognizes that the music coming from a radio station is one of his/her favorites. We can also distinguish a mother and her son correctly out of unknown people based on their physical and behavioral resemblance detected by our brain systems, although we have never seen and known about them. For other instances, we can bring many state-of-the-art pattern recognition machines mimicking human brain functions, which are currently in a wide service in our daily life as a variety of forms, such as security-purposed biometrics, speech recognition solutions, and data mining ap- plications. From these machine recognition systems, we obtain huge benefit in terms of both convenience and efficiency. 1 From an engineering point of view, pattern classification in general means a process by machines based on understanding of human pattern classification 1 . In this regard, we, pattern classification engineers, have been trying to further expand the territory of relevant application domains beyond the currently deployed service areas including what has been mentioned above, e.g., biometrics, and there is still a significant amount of pattern classification research work actively going on around the world. Such a vast research effort can more enrich our future life as it has given us a lot of benefit thus far. 1.1.1 Pattern Classication and Clustering Automatic pattern classification systems can be categorized into being supervised or unsupervised [18], [75]. In supervised classification, there are the respective class labels available for a part of the entire data set given to be classified, so we can utilize such a portion of labeled data to train a classifier. Then we can identify which class the rest of the data belong to, respectively, based on the trained classifier. On the other hand, unsupervised classification, also called clustering, is used when such a high-level information as class labels is not available for the given data set. In this case, the only way to classify the data is to measure their proximity, e.g., similarity or dissimilarity, and is to decide which data points or clusters belong to the same class based on the measured proximity. Since there is no prior class-dependent information available for the given data set, clustering is generally considered as a more challenging task than supervised classification. Clustering systems can be further categorized into being partitional and hierarchical. The former is used when it is known how many classes there exist in the givendataset, whilethelatterisconsideredforthecasethatthereisnosuchinformation at hand. The categorization of pattern classification systems is depicted in Figure 1.1. Compared to supervised classification, clustering has attracted relatively more atten- tion in recent years mainly due to information overload [46], [7]. As an enormous amount ofnewinformationarepouredeverymomentoutofvariousmassmediaandmostofthem 1 This is a research topic usually conducted by cognitive science rather than by engineering. 2 Figure 1.1: Categorization of automatic pattern classification systems. are accessible because of the continuing growth of the Internet as well as the World Wide Web, there emerges a need for not only cost-effective but also time-efficient technologies that can handle the whole available information properly [41,49,56]. With most of the available information being stored as electric forms of data nowadays, data classification is necessary as the very first step for a proper information handling, and clustering offers such a required functionality. According to [18], there are a number of advantages in clustering in the era of information overload, some of which is listed as follows: • Itisalmostimpossibleinpracticetoalwaysguaranteeanenoughamountoflabeled data for classification, because labeling a large amount of data might cost too much time/effort and be prohibitive in some applications. • Inthecasethatthereexistedthetemporaldynamicsofdatapatternsandweneeded to consider them over time for a classifier, clustering could be applied to tracking such changes and possibly improve the overall classification performance based on it, which is hardly available in supervised classification. • Clustering can provide a form of data-dependent “smart pre-processing” such as smartfeatureextraction. Forthispurpose,unsuperviseddataanalysisbyclustering could be utilized to gain some insight into the nature or structure of the given data set. 3 1.1.2 Speaker Clustering As multimedia data, e.g, short environmental audio clips or a complex mixture of au- ral and visual sources such as movies and TV broadcast news, exponentially increase in number these days, how to properly classify and process a considerable amount of audio recordings,especiallyspeechportions,becomesacriticaltopic. Thisbroughtdatacluster- ing concept into the research field of speech signal processing. There are various criteria that can be considered in terms of clustering speech data, such as gender, emotion, topic or genre, and so on. Such major criteria for speech data clustering also include speaker identity, based on which we can classify speech data by speaker-specific characteristics. This classification process is called speaker clustering. Specifically, speaker clustering refers to the process of classifying a set of input speech data(orspeechsegments)byspeakeridentityinanunsupervisedway,basedonmeasuring the similarity of speaker-specific characteristics between the data. The process identifies which speech segments belong to the same speaker source without any prior speaker- specific information of the given input data. This speaker-perspective, unsupervised classification of speech data can be applied as a pre-processing step to speech/speaker recognitionormultimediadatasegmentation/classificationinvariousways. Forinstance, speakerclusteringcanprovidespeechrecognitionofaspontaneousconversationrecording withunsupervisedspeakeradaptationcapability, combinedwithspeaker-specificsegmen- tationoftherecording. (ItspossibleapplicationdomainsarefurthershowninFigure1.2.) For this reason speaker clustering has been recently attracting much attention in the re- search area of speech recognition and multimedia data processing. 1.1.3 Focus Identication Based on the aforementioned general benefit from pattern classification research and currentimportanceofspeakerclusteringinspeechsignalandmultimediadataprocessing, in this dissertation, we focus our research effort on speaker clustering and its relevant issues. 4 Figure 1.2: Application domains of speaker clustering. 1.2 Previous Work on Speaker Clustering Since a simple speaker clustering framework based on a hierarchical approach 2 was intro- ducedinearly1990sbyGish, et al.[26], therehavebeenalotofresearchworkonspeaker clustering thus far. Most of the work were initially relevant to broadcast news transcrip- tion systems [6,9,10,12,15,19–25,27,28,34–39,43,44,47,48,57,59,61,67–71,74,76,77]. Speaker clustering was utilized and had been developed in those systems to improve the accuracy of speech recognition for transcription of broadcast news audio data, en- abling to adapt original phoneme models (for speech recognition) based on unsupervised speaker-specific data classification results. As general interest in the research field of speech transcription and understanding moves from scripted/read speech data toward more challenging data domains like spontaneous meeting conversations, speaker cluster- ing now gets more important. In addition, a number of current and potential systems for multimedia content analysis and data indexing requires speaker-specific information of multimedia data as a necessary prior knowledge to semantically understand the whole 2 For your reminder, hierarchical clustering is utilized when there is no prior information of data at all including the number of data classes, as mentioned in Section 1.1.1. Since speaker clustering is mainly applied to applications where there is no available speaker-specific information of given speech data such as the number of speaker sources, hierarchical approaches are more natural to be considered in speaker clustering research than partitional ones. 5 Algorithm 1 Agglomerative Hierarchical Speaker Clustering (AHSC) Require: {x i };i=1;:::;ˆ n: speech segments ˆ C i ;i=1;:::;ˆ n: initial clusters Ensure: C i ;i=1;:::;n: finally remaining clusters 1: ˆ C i ←{x i }, i=1;:::;ˆ n 2: do 3: i;j←argmind( ˆ C k ; ˆ C l );k;l =1;:::;ˆ n;k̸=l 4: merge ˆ C i and ˆ C j 5: ˆ n← ˆ n−1 6: until speaker clustering performance is estimated to have reached the lowest level 7: return C i ;i=1;:::;n content of the data, so speaker clustering (being combined with speaker-specific segmen- tation) becomes into the spotlight more than ever. The Rich Transcription (RT) event that has been annually offered as one of mainstream benchmark tests since 2002 by the National Institute of Standards and Technology (NIST), therefore, includes speaker di- arization 3 system evaluation, considering it to be one of its main evaluation categories. A typical strategy for speaker clustering is an agglomerative hierarchical approach [6,9,10,12,13,15,18,19,21–23,25–27,35,43,59,61,69,74,76,77], which we usually call agglomerative hierarchical speaker clustering (AHSC). This strategy is considered as the best one for speaker clustering tasks, due to its simple processing structure and acceptable level of performance (although it is sub-optimal). Algorithm 1 shows how it works. It considers input speech data (or segments) as individual initial clusters and, at every recursion step, merges the closest pair of clusters in terms of speaker-specific characteristics among the entire candidate pairs. Such recursions continue until a certain stopping point where it is decided that an additional merging would not improve speaker clustering performance any further. AHSC, since its prototypical introduction in [26], has evolved in terms of two main perspectives, as follows: 3 Speakerdiarizationisanextendedversionofspeakerclustering,referringtoaprocessthatdividesand classifies speech data by speaker-specific characteristics and, as a result, can annotate the data in terms of “who spoke when” [63]. This process includes speaker-specific segmentation before speaker clustering, but the latter plays a much more critical role in the entire process than the former. 6 1. How to estimate when speaker clustering performance reaches the lowest level? 2. How to select the most homogeneous clusters (in terms of speaker-specific charac- teristics) for merging at every recursion step so as to achieve the minimum possible level of speaker clustering performance overall? Toward addressing the first question, a stopping point estimation method based on Bayesian information criterion (BIC) [58] is now widely used as a standardized approach. It was introduced in 1998 by Chen and Gopalakrishnan [13], and since then most of speaker clustering applications have utilized it to estimate the optimal recur- sion stopping point in AHSC. In order to tackle the second question in the state of the art, on the other hand, generalized likelihood ratio (GLR) [26] has been popularly utilized. This statistical distance measure between speech data was empirically verified and is thus considered to be the best solution in terms of properly selecting the clos- est pair of clusters for merging during AHSC [63], among candidate measures including single/complete/averagelinkage,Euclidean/MahalanobisdistanceorKullback-Leiblerdi- vergence. AbasicAHSCframeworkwiththesetwoschemesforstoppingpointestimation andinter-data(orinter-cluster)distancemeasurementisbroadlyadoptedbymanystate- of-the-art speaker diarization systems [1,3–5,33,42,50,54,55,65,72,73,78]. 1.3 Problem Statement Despite its broad adoption in current state-of-the-art speaker clustering applications, AHSC has a big, yet unsolved, issue in its performance in terms of robustness to the variation of input speech data [63], [30]. We can clearly see the huge, negative effect of this issue on AHSC performance in Figure 1.3, which shows baseline experimental results 4 forAHSCwithBIC-basedstoppingpointestimationandGLR-basedinter-cluster 4 AHSC performance was measured by speaker error time rate, which is one of official performance measures for speaker diarization, particularly for speaker clustering, in the NIST RT evaluation. The measurement tool is freely available at http://www.nist.gov/speech/tests/rt/2006-spring. We will discuss more in detail about this measure in subsequent chapters. 7 Set 1 Set 2 Set 3 Set 4 Set 5 Set 6 Set 7 Set 8 Set 9 Set 10 0 5 10 15 20 25 30 35 40 45 50 Input Speech Data Clustering Error Rate (%) Figure 1.3: Unreliable speaker clustering performance by AHSC across various input speech data. The entire data are 10 sets of segmented meeting conversations speech and each of them contains a number of speech segments in an utterance or sentence level. In each speech segment speaker-specific characteristics are homogeneous. distance measurement. While the performance for Set 10 is less than 5%, which is quite good,theperformancesforSets2,3,and9areallmorethan35%. Theabsolutedifference is roughly over 30%, which is undesirable. This unreliability problem in AHSC performance is caused because both of the BIC- based stopping point estimation method and the GLR-based inter-cluster distance mea- sure are much influenced by the variation of input speech data. In this dissertation, we address this problem not only by analyzing its causes but also by proposing various al- gorithmic solutions to them. The next section is a brief list of our proposed approaches, which will be more explained later throughout the dissertation, respectively. 1.4 Proposed Approaches An overview of our approaches to tackle the aforementioned unreliability problem in AHSCperformanceisshowninFigure1.4. Theycanbecategorizedintotwoperspectives: stopping point estimation (the leftmost column of the upper figure) and inter-cluster 8 Figure 1.4: Overview illustration of dissertation contributions to robust AHSC and speaker diarization in terms of the variation of input speech data. distance measurement (the rest of the upper figure). The proposed approaches are later applied to one of promising applications in the research field of speaker clustering, i.e., speaker diarization. In the framework of speaker diarization, we further propose a few refinement schemes for speaker clustering performance (the lower figure). 1.4.1 Perspective 1: Stopping Point Estimation The conventional BIC-based stopping point estimation method for AHSC is not robust to the variation of input speech data, i.e., does not provide every input data set with reliable estimation of the optimal stopping point where clustering performance would not be improved any further with extra merging. A main reason for this robustness problem 9 in the method is that the stopping criterion used in the method is too sensitive to the variability of the following characteristics across input speech data: • Total amount of time for the entire speech utterances in a given input data set • Utterance time distribution over speaker sources, etc. InordertoimprovetherobustnessofstoppingpointestimationforAHSC,wepropose a new statistical measure, calledinformation change rate (ICR), that can help better and more robustly estimating the optimal stopping point. The ICR-based stopping point estimationmethodisnotonlyempiricallybutalsotheoreticallyverifiedtobemorerobust to the variation of input speech data than the conventional BIC-based one. We will take care of this subject in more detail in Chapter 2. 1.4.2 Perspective 2: Inter-Cluster Distance Measurement As the BIC-based stopping point estimation method, the conventional GLR-based inter- cluster distance measure for AHSC does not provide reliable performance for every input speech data set. Specifically speaking, the GLR-based measure incorrectly selects a pair ofheterogeneousclusters(intermsofspeaker-specificcharacteristics)formergingatsome recursionstepsofAHSC,whichcausestheoverallAHSCperformancetodegradeseverely. AmainreasonforthisproblemisthatthereliabilityoftheGLR-basedmeasureisaffected by the variability of the following characteristics across input speech data: • Utterance time distribution over speaker sources in a given input data set • Total number of speech segments • Average/individual time length per segment, etc. In order to address this problem, we have three different viewpoints on it. The next three sub-sections show each viewpoint in the order. 10 1.4.2.1 Earlier Recursion Steps of AHSC According to [29–32,40,66], a more specified reason for this unreliability problem in the GLR-based inter-cluster distance measurement is that GLR tends to get larger in proportion to the size of a cluster pair under consideration. As a result, the GLR-based measure has the following undesirable patterns: • Apairofhomogeneousclusters(intermsofspeaker-specificcharacteristics)ofsmall size might have a smaller GLR value and be regarded as mutually closer than those of large size. • A pair of heterogeneous clusters of small size might have a smaller GLR value and be regarded as mutually closer than a pair of homogeneous clusters of large size. These patterns cause merging between small size clusters to occur mostly in the early recursionstepsofAHSC,forwhichitishighlylikelythatincorrectmergingoftenhappens inthosestepsduetoinsufficientdatarepresentingspeaker-specificcharacteristicsinsmall size clusters. Such an incorrect merging, especially in the early recursion steps of AHSC, could affect the subsequent merging process negatively because of the recursive structure of AHSC, and thus needs to be prevented as much as possible. We propose several algorithmic approaches, e.g., selective AHSC in Figure 1.4, to figure out these undesirable patterns of the GLR-based inter-cluster distance measure by forcing merging between small size clusters to be kept from occurring in the early recursion steps of AHSC, which will be discussed more in Chapter 3. 1.4.2.2 Later Recursion Steps of AHSC The aforementioned patterns of the GLR-based inter-cluster distance measure could also cause incorrect merging between heterogeneous clusters in the late recursion steps of AHSC, which might have much bigger impact on the overall clustering performance than incorrect merging in the earlier recursion steps of AHSC. 11 In order to prevent such an incorrect merging in the later recursion steps of AHSC, we propose an alternative distance measurement method to combine GLR and ICR for better resolution in selecting the closest pair of clusters. We will see more details later in Chapter 3 as well. 1.4.2.3 Cluster Modeling Sinceinter-clusterdistanceisstatisticallymeasuredinAHSC,selectingproperprobability distributionfunctions(PDFsorpdfs)isrequiredforindividualclustersinordertoobtain accurate distance between clusters. Ideal cluster modeling for cluster distance measure- mentwithintheframeworkofAHSCshouldaccountforvariableclustersize,whichgrows when clusters are merged, and be dynamic enough to represent the statistical changes of data in clusters throughout the entire AHSC procedures. Since such changes in clusters during AHSC largely depend upon a number of input data characteristics, cluster mod- eling without dynamic representation capability would be much affected by the variation of input speech data, which is undesirable for reliable AHSC performance. Conventional cluster modeling approaches using either single Gaussian distributions or Gaussian mix- ture models (GMMs) are not ideal in this regard. We introduce a novel cluster modeling approach with dynamic representation capa- bility, called incremental Gaussian mixture cluster modeling. This new approach notonlycanbetterrepresentthestatisticalchangesofdatainclustersthroughoutAHSC than single Gaussian cluster modeling, but also provides slightly better clustering perfor- mance and has much lower computational complexity compared to GMM-based cluster modeling. We will handle this topic further in Chapter 4. 1.4.3 Application: Speaker Diarization BasedonourproposedapproachesforrobustAHSCtothevariationofinputspeechdata, we try to make speaker diarization (which is one of main speaker clustering applications and AHSC plays a critical role in it) further reliable across various input speech data 12 sets. For this purpose, we first implement our own speaker diarization system for analy- sis of spontaneous meeting conversations, called SAIL 5 speaker diarization system, equipped with the ICR-based stopping point estimation method and the incremental Gaussian mixture cluster modeling strategy for GLR-based inter-cluster distance mea- surement. Then, we propose two schemes for clustering performance refinement in the framework of speaker diarization: representative cluster model selection and in- teraction pattern modeling. In Chapter 5, all of these will be taken care of in more detail. 1.5 Contribution Summary This dissertation, as it has been mentioned thus far, handles how to make speaker clus- tering, particularly AHSC, more robust to the variation of input speech data in two main perspectives, i.e., stopping point estimation and inter-cluster distance measurement, and how to extend research results from work on robust speaker clustering toward an appli- cation domain. A summary for the contributions of the dissertation is as follows: • Stopping Point Estimation Perspective { Proposes ICR, a new statistical distance measure between clusters, in order to avoid the negative effect of the variability of input data characteristics on stopping point estimation for AHSC. { IntroducesastoppingpointestimationmethodbasedonICRforAHSC,which ismorerobusttothevariationofinputspeechdatathantheconventionalBIC- based one. • Inter-Cluster Distance Measurement Perspective 5 SAIL stands for Signal Interpretation and Analysis Laboratory, in which I have been a member since 2004 under Prof. Shri Narayanan, my advisor and committee chair for this dissertation. 13 { Proposes several modified versions of AHSC approaches so as to enhance the reliability of the GLR-based inter-cluster distance measure at the early recur- sion steps of AHSC. { Proposes a method to combine GLR and ICR so as to improve the reliability of the GLR-based inter-cluster distance measure at the late recursion steps of AHSC. { Proposes a dynamic cluster modeling method so as to account for variable cluster size throughout the entire AHSC procedures. • Application { Proposes SAIL speaker diarization that can utilize our promising results from research work on robust speaker clustering. { Proposes two refinement schemes for clustering performance in the framework of speaker diarization. 1.6 Dissertation Outline Thisdissertationisorganizedasfollows. InChapter2,weaddresstherobustnessproblem oftheBIC-basedstoppingpointestimationmethodforAHSCunderthevariationofinput speech data. For this, we first take a short review of GLR and BIC, and then investigate a main reason for the problem considered. This investigation leads to understanding why a new statistical distance measure between clusters is needed for more robust stopping point estimation in AHSC under the variation of input speech data, which results in our proposal of ICR. In addition, we introduce a stopping point estimation method for AHSC based on ICR in this chapter. This stopping point estimation method is verified through experimental results to be more robust to the variation of input speech data than the conventional BIC-based one. In Chapter 3, we tackle the robustness problem of the GLR-based inter-cluster distance measure from both viewpoints of early and late 14 AHSC recursion steps. For this, we first examine why the reliability of the GLR-based inter-cluster distance measure severely varies across input data sources. Based on this examination, we propose several modified versions of AHSC approaches to improve the accuracy of the GLR-based inter-cluster distance measure, particularly at the early re- cursion steps of AHSC. Then we propose a supplement inter-cluster distance measure to utilize the advantages of GLR and ICR in order to tackle the robustness problem of the GLR-based inter-cluster distance measure at the late recursion steps of AHSC. All the methods proposed in this chapter are compared with original AHSC in terms of averaged performance across data sources, and are proven to provide benefit to the reliability of theGLR-basedinter-clusterdistancemeasureandthustheoverallspeakerclusteringper- formance. In Chapter 4, we introduce incremental Gaussian mixture cluster modeling for inter-cluster distance measurement in AHSC. This dynamic cluster modeling approach not only provides AHSC with as comparable clustering performance as the conventional GMM-based one does, but also has a lot more feasibility in computational complexity. In Chapter 5, we apply our research results to speaker diarization. For this, we implement ourownspeakerdiarizationsystemandfurthermodifyitwithtwoclusteringperformance refinement schemes. This dissertation is concluded in Chapter 6 with the final remarks on the work that has been dealt with thus far. We also mention our research’s potential application domains other than speaker diarization in this final chapter. 15 Chapter 2 Robust Stopping Point Estimation for AHSC 2.1 Introduction This chapter handles the robustness problem of the conventional BIC-based stopping point estimation method for AHSC under the variation of input speech data. This prob- lem has a huge impact on speaker clustering performance because it results in incorrect estimation of the optimal stopping point for AHSC on some data sources 1 , which might cause speaker clustering performance to be extremely worse than what it could be with exact estimation of the optimal stopping point during AHSC. In order to address the problem, we first propose information change rate (ICR), and then apply it to stop- ping point estimation for AHSC. Thechapterisorganizedasfollows. InSection2.2,weintroducethedatasourcesused for experiments in this chapter including analysis and comparison. Experimental setup and relevant assumptions are also described here. In Section 2.3, the BIC-based stopping point estimation method is investigated. This section provides analysis on the cause of the sensitivity of the BIC-based stopping point estimation method to the variation of input speech data. In Section 2.4, based on the analysis in Section 2.3, we tackle the robustness problem of the BIC-based stopping point estimation method by proposing a novelalternativebasedonICR.Throughexperimentsonourevaluationdatasources, the 1 This means that stopping point estimation in AHSC is perfectly done for some data sources while it is not for some others, which is why we call this issue a robustness problem to the variation of input speech data. 16 Table 2.1: Development set of data sources. N s : # of speaker identities (male:female) in each data source, T s : total utterance time (sec.), N t : # of speech segments, and T a : average segment length (sec.). C, N, and I: data sources chosen from ICSI, NIST, and ISL meeting speech corpora respectively. Development Set C-1 C-2 C-3 N-1 I-1 N s 7 (5:2) 7 (5:2) 6 (4:2) 4 (3:1) 4 (2:2) T s 1064.9 931.3 1148.5 835.7 477.7 N t 417 278 243 178 118 T a 2.5 3.3 4.7 4.7 4.0 proposedICR-basedstoppingpointestimationmethodisdemonstratedtobemorerobust to the variation of input speech data than the BIC-based one. We conclude this chapter inSection2.5withcommentsonfutureworkwithregardtotheICR-basedstoppingpoint estimation method for AHSC. 2.2 Data and Experimental Setup Tables 2.1 and 2.2 present the development and evaluation data sets used for the experi- ments reported in this chapter, obtained from 15 different meeting conversation excerpts (with the total length of approximately 3 hours and 45 minutes). The data sources are chosen from ICSI, NIST, and ISL meeting speech corpora 2 . They are distinct from one another in terms of the number of speaker sources (N s ), gender distribution over speaker sources, total utterance time (T s ), number of speech segments (N t ), and average segment length (T a ). The development set will be used for tuning the parameters of the stopping point estimation methods (i.e., BIC- and ICR-based methods) that will be mentioned in this chapter, while the evaluation set will be used for performance calculation. For the experiments presented in this chapter, we assume that there is no individual speechsegmenthavingmorethantwospeakersourcesorincludingoverlappedutterances, in order to avoid any potential confusion in performance analysis. To enable this, we 2 LDC2004S02, LDC2004S09, and LDC2004S05, respectively. 17 Table 2.2: Evaluation set of data sources. The notation is same as that in Table 2.1. Evaluation Set C-4 C-5 C-6 C-7 C-8 C-9 N-2 N-3 I-2 I-3 N s 5 (3:2) 9 (7:2) 7 (6:1) 6 (5:1) 4 (4:0) 9 (7:2) 4 (3:1) 6 (4:2) 8 (4:4) 3 (2:1) T s 674.5 423.2 2336.3 1664.9 1475.9 659.7 443.4 624.1 272.4 365.3 N t 175 129 610 531 477 158 74 143 92 72 T a 3.8 3.3 3.8 3.1 3.1 4.1 5.9 4.3 2.9 5.0 manually segmented each data source according to the reference transcription officially provided by the Linguistic Data Consortium (LDC) prior to the experiments. Mel-frequency cepstral coefficients (MFCCs) are used as acoustic features. Through 23 mel-scaled filter banks, a 12-dimensional MFCC vector is generated for every 20ms- long frame of speech. Every frame is shifted with the fixed rate of 10ms so that there can be an overlap between two adjacent frames. In order to measure speaker clustering performance, the official scoring tool, i.e., md-eval-v21.pl 3 , distributed by NIST is used. This tool provides clustering performance as speaker error time rate. 2.3 BIC-based Stopping Point Estimation for AHSC We begin this section by providing relevant background details on GLR and BIC. The former is, as mentioned in Section 1.2, a widely-used inter-cluster distance measure for selecting merging clusters at every recursion step of AHSC, and the latter is a well- known model selection criterion and is utilized for the stopping point estimation method considered in this section. 3 This tool can be downloaded from http://www.nist.gov/speech/tests/rt/2006-spring, as mentioned in Section 1.3. 18 2.3.1 Generalized Likelihood Ratio (GLR) Suppose that a pair of clusters C x and C y are given and they consist of n-dimensional acoustic feature vectors x = {x 1 ;x 2 ;··· ;x M } and y = {y 1 ;y 2 ;··· ;y N }, respectively. Then, GLR for the given pair is computed as follows: GLR(C x ;C y ) = P (x∪y|H 1 ) P (x∪y|H 2 ) ; (2.1) where • H 1 (Unmerging Hypothesis): C x and C y are hypothesized to be left unmerged. • H 2 (Merging Hypothesis): C x and C y are hypothesized to be merged so as to be a new cluster C z , where z =x∪y. In order to mathematically calculate the two likelihoods in the right side of Eq. (2.1), the two hypotheses need to be modeled by probability mass or distribution functions (PMFs or PDFs) respectively. In this regard, single Gaussian modeling for each cluster considered (C x and C y for H 1 , and C z for H 2 ) has been popularly utilized since [26]. In this chapter, we follow this approach as well because single Gaussian cluster modeling is much easier to be analyzed theoretically than other cluster modeling approaches such as one based on Gaussian mixture models (GMMs) 4 . Based on [26], C x , C y , and C z are modeled by (multivariate) single Gaussian distributions f X , f Y , and f Z with full covariance matrices respectively. Assuming that the PDFs represent random variables X, Y, and Z respectively, x, y, and z can be regarded (in the modeling framework of [26]) as the sequences of independently and identically distributed (i.i.d.) random variables drawn according to the PDFs f X , f Y , and f Z of random variables X, Y, and Z respectively. The mean vectors and the covariance matrices of f X , f Y , and f Z are 4 Wewilldiscussmoreindetailaboutclustermodelingforinter-clusterdistancemeasurementinAHSC in Chapter 5. 19 determined by way of maximizing the likelihoods of x, y, and z for f X , f Y , and f Z respectively. In order words, ˜ x =( x ;Σ x )=( f X ;Σ f X )= f X ; (2.2) ˜ y =( y ;Σ y )=( f Y ;Σ f Y )= f Y ; (2.3) and ˜ z =( z ;Σ z )=( f Z ;Σ f Z )= f Z ; (2.4) where x , y , and z are the sample mean vectors, and Σ x , Σ y , and Σ z are the sample covariance matrices obtained from x, y, and z respectively. f X , f Y , and f Z are the mean vectors, and Σ f X , Σ f Y , and Σ f Z are the covariance matrices of f X , f Y , and f Z respectively. Under this framework, Eq. (2.1) can be re-written as GLR(C x ;C y ) = p(x|f X ; f X )·p(y|f Y ; f Y ) p(z|f Z ; f Z ) (2.5) = p ( x|f X ; ˜ x ) p ( x|f Z ; ˜ z )· p ( y|f Y ; ˜ y ) p ( y|f Z ; ˜ z ): (2.6) Eq. (2.6) tells that GLR is always greater than or equal to 1 because both of the numerators in the equation are maximal out of the likelihoods of x and y respectively. In other words, p(x|f X ; ˜ x )≥ p(x|f Z ; ˜ z ) and p(y|f Y ; ˜ y )≥p(y|f Z ; ˜ z ), where the equalities hold only if C x = C y or x = y. This means that H 1 is always more likely than H 2 , and thus GLR is not adequate to indicate that one hypothesis is more likely than the other. Instead, GLR tells how much more likely H 1 is than H 2 . Therefore, the more likely H 1 is for a pair of clusters, the more distant the clusters are regarded in GLR-based inter-cluster distance measurement. The drawback of GLR as an inter-cluster distance measure is, as mentioned in [29– 32,40,66], that GLR tends to get larger as the total number of feature vectors within a pair of clusters under consideration increases. This can be clearly illustrated in Figure 20 0 20 40 60 80 100 0 20 40 60 80 100 0 1 2 3 4 5 x 10 9 N 2 N 1 GLR Figure 2.1: GLR for two clusters C 1 and C 2 along with the number of feature vectors in each cluster. The second order statistics of the corresponding cluster models are fixed at 1 =0, 2 =1, and Σ 1 =Σ 2 =1. 2.1, which shows GLRs between two clusters C 1 and C 2 along with the corresponding numbers of feature vectors N 1 and N 2 . In order to observe the effect of the numbers of featurevectorsintheclusters, wefixedthesecondorderstatisticsof ˜ 1 and ˜ 2 arbitrarily. (Inthiscase, 1 =0, 2 =1,andΣ 1 =Σ 2 =1.) Fromthisfigurewecanexplicitlyseethe exponential rising-up of GLR as the numbers of feature vectors in the clusters increase. Consequently, in GLR-based inter-cluster distance measurement, a pair of homogeneous clusters (in terms of speaker-specific characteristics) of small size are likely to have a smaller GLR value and be regarded as mutually closer than those of large size. Besides, a pair of heterogeneous clusters of small size might have a smaller GLR value and be regarded as mutually closer than a pair of homogeneous clusters of large size, which is undesirable. This undesirable tendency of GLR can be confirmed by analyzing GLR computation with a few basic concepts in the field of information theory. Let us begin this analysis 21 with Eq. (2.5). We can re-write the equation as below without loss of generality by applying logarithm to both sides: lnGLR(C x ;C y ) = ln p(x|f X ; f X )·p(y|f Y ; f Y ) p(z|f Z ; f Z ) = lnf X (x 1 ;x 2 ;··· ;x M )+lnf Y (y 1 ;y 2 ;··· ;y N )−lnf Z (x 1 ;··· ;x M ;y 1 ;··· ;y N ): (2.7) Considering that GLR computation intrinsically assumes the weak law of large numbers 5 to be satisfied during its procedure, we can apply the asymptotic equipartition property 6 (AEP) widely-known as the consequence of the weak law of large number2 in the field of information theory to the right side term of Eq. (2.7). Then, the equation can be simplified to lnGLR(C x ;C y ) = −M·h(X)−N·h(Y)+(M +N)·h(Z): (2.9) 5 The weak law of large numbers states that a sample mean and a sample variance converge in prob- ability towards the expected value and the second central moment of a corresponding random variable respectively. In GLR computation, this law is inherent to Eqs. (2.2)-(2.4). 6 This property can be explained as follows: Let x1;x2; ;xM be the sequence of i.i.d. random variables drawn according to the PDF fX of a random variable X. Then, according to [16], AEP states that 1 M lnfX (x1;x2; ;xM) =h(X)in probability; (2.8) where h is entropy. 22 Since entropy for an n-dimensional multivariate normal distributionN(;Σ) can be ob- tained (according to [16]) as a closed form of 1 2 ln(2e) n |Σ| where|·| is determinant, we can further simplify Eq. (2.9) to lnGLR(C x ;C y ) = −M· 1 2 ln(2e) n |Σ x |−N· 1 2 ln(2e) n |Σ y |+(M +N)· 1 2 ln(2e) n |Σ z | = M +N 2 ln|Σ z |− M 2 ln|Σ x |− N 2 ln|Σ y |; (2.10) where Σ z has the following relation with Σ x and Σ y : Σ z = M·Σ x +N·Σ y M +N + M· x T x +N· y T y M +N − M· x +N· y M +N · ( M· x +N· y M +N ) T (2.11) because z =x∪y. Based on this, let us think of a simple instance. Suppose that we need to compute GLRbetweentwoclustersC x ′ andC y ′, wherex ′ andy ′ arethesequencesofi.i.d. random variables drawn according to the PDFs f X and f Y , and their cardinalities are 2M and 2N respectively. In other words, x ′ has the same second order statistics with x’s but twice the number of feature vectors within x. y ′ has such relation to y as well. Then, Σ z ′ =Σ z , and hence lnGLR ( C x ′;C y ′ ) = (M +N)ln|Σ z ′|−M·ln|Σ f X |−N·ln|Σ f Y | = (M +N)ln|Σ z |−M·ln|Σ x |−N·ln|Σ y |=2·lnGLR(C x ;C y ): The above example indicates that lnGLR linearly increases (or GLR exponentially in- creases) with the fixed second order statistics as the numbers of feature vectors within 23 a pair of clusters under consideration get larger, which is consistent with what has been shown in Figure 2.1. 2.3.2 Bayesian Information Criterion (BIC) BIC[58]wasprimarilyintendedformodel(orPDF)selection,specificallyfortheproblem of how to select the best model for given observations from candidate models. A basic model selection strategy based on BIC is as follows: 1. Compute BIC scores for all candidate models. BIC(f) = lnp(x|f; f )−P f = lnp(x|f; f )− 1 2 #( f )lnM; (2.12) where x = {x 1 ;x 2 ;···;x M } represents given M observations, f is a model (or PDF), f is a set of model parameters for f, and #( f ) is the total number of model parameters for f. 2. Select the model whose BIC score is the highest as the best one to represent the observations. The core of BIC is that the log-likelihood of given observations for a model is penalized by P f , which is determined by the total number of model parameters and the logarithm of the cardinality of the observations. This prevents the model having the most number of parameters from being chosen all the time as the best one, which is a well-known issue in model selection based on maximum likelihood without penalization. 24 2.3.3 BIC-based Stopping Point Estimation Method for AHSC Keeping both GLR and BIC in mind, let us now investigate the BIC-based stopping point estimation method for AHSC. This conventional method to search for the optimal stopping point for AHSC (where speaker clustering performance would not be improved any further with extra merging) was originally introduced in [13] by Chen and Gopalakr- ishnan. It basically stops AHSC at the point where the closest pair among all pairs of remaining clusters are decided to be not homogeneous in terms of speaker-specific char- acteristicsforthefirsttimeoftheentireAHSCprocedures, basedonthereasoningthatif theclosestpairofclusterswereheterogeneousthensowouldbeanyotherpairofclusters, and thus there would be no more need for merging in AHSC. Decision of homogeneity for the closest pair of clusters at every recursion step of AHSC is done by comparing the BIC scores of the clusters for two hypotheses of ‘Unmerging’ and ‘Merging’. These two hypotheses are the same as those (H 1 and H 2 ) used in GLR computation in Section 2.3.1, and in this case H 2 supports homogeneity while H 1 supports heterogeneity. As in GLR computation, the two clusters considered are modeled by (multivariate) single Gaussian distributions with maximum likelihood parameter estimation. The details of how the BIC-based stopping point estimation method works for AHSC are as follows 7 : 7 We used the same notation in Section 2.3.1 for single Gaussian modeling for clusters. 25 1. FortheclosestpairofclustersC x andC y consistingoffeaturevectorsx={x 1 ;x 2 ;···;x M } and y ={y 1 ;y 2 ;···;y N } respectively, compute the BIC scores of x∪y for H 1 and H 2 . BIC(H 1 ) = lnP (x∪y|H 1 )−·P H 1 = lnP (x∪y|H 1 )−· 1 2 #(H 1 )lnN total = ln{p(x|f X ; f X )·p(y|f Y ; f Y )}−· 1 2 {#( f X )+#( f Y )}lnN total = ln { p ( x|f X ; ˜ x ) ·p ( y|f Y ; ˜ y )} −· 1 2 [ 2 { n+ 1 2 n(n+1) }] lnN total : (2.13) BIC(H 2 ) = lnP (x∪y|H 2 )−·P H 2 = lnP (x∪y|H 2 )−· 1 2 #(H 2 )lnN total = ln{p(x|f Z ; f Z )·p(y|f Z ; f Z )}−· 1 2 #( f Z )lnN total = ln { p ( x|f Z ; ˜ z ) ·p ( y|f Z ; ˜ z )} −· 1 2 { n+ 1 2 n(n+1) } lnN total : (2.14) In Eqs. (2.13) and (2.14), is the parameter that should be tuned a priori for minimizing averaged speaker clustering performance (i.e., speaker error time rate) with a development set of data sources (which will be explained more in detail later), N total is the total number of feature vectors for the entire clusters given as an input to AHSC, and n is the dimension of feature vectors. 26 2. Compute ∆BIC(C x ;C y )=BIC(H 1 )−BIC(H 2 ). ∆BIC(C x ;C y ) = ln { p ( x|f X ; ˜ x ) ·p ( y|f Y ; ˜ y )} −· 1 2 [ 2 { n+ 1 2 n(n+1) }] lnN total − ln { p ( x|f Z ; ˜ z ) ·p ( y|f Z ; ˜ z )} +· 1 2 { n+ 1 2 n(n+1) } lnN total = ln p ( x|f X ; ˜ x ) ·p ( y|f Y ; ˜ y ) p ( x|f Z ; ˜ z ) ·p ( y|f Z ; ˜ z )−· 1 2 { n+ 1 2 n(n+1) } lnN total = lnGLR(C x ;C y )−· 1 2 { n+ 1 2 n(n+1) } lnN total (2.15) H 1 ≷ H 2 0: 3. If ∆BIC(C x ;C y )<0 or BIC(H 1 )<BIC(H 2 ), decide that C x and C y are homoge- neous and merge them. Otherwise, do not merge them and stop AHSC. The stopping criterion mentioned above can be re-written as lnGLR(C x ;C y ) H 1 ≷ H 2 ·c·lnN total ; (2.16) where c= 1 2 { n+ 1 2 n(n+1) } is a constant. This criterion could be replaced by lnGLR(C x ;C y ) H 1 ≷ H 2 ·c·ln(M +N): (2.17) This modified criterion was introduced in [8] based on its better performance for esti- mating the optimal stopping point for AHSC than Eq. (2.16). In this chapter, we will consider Eq. (2.17) as a baseline stopping criterion for the BIC-based stopping point estimation method for this reason. From this point on, the stopping criterion that we are mentioning throughout the chapter thus points out Eq. (2.17), not Eq. (2.16). 27 C−4 C−5 C−6 C−7 C−8 C−9 N−2 N−3 I−2 I−3 0 5 10 15 20 25 30 35 40 45 50 Data Source Speaker Error Time Rate (%) Accurate stopping point estimation BIC−based method Figure 2.2: Comparison of speaker clustering performance (for the evaluation data set described in Section 2.2) with and without accurate stopping point estimation. For the BIC-based stopping point estimation method, we tuned to be 12.0. Average speaker errortimeratedegradationbyincorrectestimationoftheoptimalstoppingpointisabout 9.65% (absolute) per data source. 2.3.4 Tuning Parameter An important aspect to note for this BIC-based stopping point estimation method is the use of the tuning parameter in Eqs. (2.13) and (2.14). This parameter is not included in the original BIC score computation as shown in Eq. (2.12), which means that the parameter was intentionally introduced when applying BIC to devise a stopping point estimation method for AHSC. Unfortunately, there is no explicit explanation in [13] of why is necessary and how it can be optimally chosen. In the research field of speaker clustering, however, the parameter is widely considered as a weighting factor to lift up the level of the whole right side term of Eq. (2.17), and is generally tuned so as for the stopping criterion to provide the minimum averaged speaker error time rate for a development data set. (In this chapter, we set to be 12.0 because = 12:0 minimized averaged speaker error time rate for our development data set.) 28 A problem is that does not work globally because it is tuned only based on a development data set. Such a tuned parameter cannot guarantee the stopping criterion to correctly estimate the optimal stopping points for data sources in a different data domain, due to its dependency upon the data set used for tuning. This problem is clearly confirmed in Figure 2.2 8 . We can see from this figure that with = 12:0 the BIC-based stopping point estimation method does not reliably estimate the optimal stopping point for the evaluation data set. In our experiments, the impact of incorrect estimation of the optimal stopping point is detrimental specifically for C-5, C-6, N-2, and I-2 while it is not the case for C-4, C-8, and I-3. Average speaker error time rate degradation due to such incorrect estimation is about 9.65% (absolute) per data source. In order to handle this problem, one interesting approach was proposed in [3] based on the idea of [2], which is to automatically erase by equalizing #(H 1 ) to #(H 2 ) in the computation of BIC scores for H 1 and H 2 . For this, a Gaussian mixture model (GMM) with m model parameters for each cluster considered (C x and C y ) in H 1 and another GMM with 2m model parameters for a hypothetically merged cluster (C z ) in H 2 were utilized respectively. By doing so, this approach can avoid parameter tuning. However, it has some side effects such as increased computing time for training GMMs at every recursion step of AHSC. Moreover, the approach does not directly take care of a fundamentalcausefortherobustnessproblemoftheBIC-basedstoppingpointestimation method, which is the stopping criterion itself being not robust to the variation of input speech data. 2.3.5 Stopping Criterion under the Variation of Input Speech Data The stopping criterion of the BIC-based stopping point estimation method, Eq. (2.17), has an intrinsic flaw in terms of robustness to the variation of input speech data because it utilizes GLR. As aforementioned in Section 2.3.1, GLR is sensitive to the numbers of featurevectorswithintheclustersconsidered. Asaresult,theleftsidetermofEq. (2.17), 8 In this experiment, GLR was used as an inter-cluster distance measure for AHSC to select the closest pair of clusters at every recursion step. 29 0 20 40 60 80 100 0 20 40 60 80 100 0 5 10 15 20 25 N 2 N 1 lnGLR ln(N 1 +N 2 ) Figure 2.3: lnGLR and ln(M +N) (= ln(N 1 +N 2 ) in this case) for the same clusters considered in Figure 2.1 along with the number of feature vectors in each cluster, with the fixed second order statistics of 1 =0, 2 =1, and Σ 1 =Σ 2 =1. lnGLR, is affected by several aspects in the entire speech segments given as an input to AHSC beyond statistical difference between the clusters considered. This is because the size of the clusters considered by the BIC-based stopping point estimation method at a certain recursion step of AHSC is determined jointly by the total amount of time for the entire speech utterances (or speech segments) in the given input data, the distributions of the speech segments in length and speaker identity, and merging procedures at the previous recursion steps of AHSC. One might claim that the right side term of Eq. (2.17) is also decided by the numbers of feature vectors within the clusters considered due to ln(M +N), so the stopping criterion looks robust to the variation of input speech data. However, lnGLR grows in a linear fashion 9 in proportion to M and N while ln(M +N) increases in a logarithmic fashion, which is well shown in Figure 2.3. lnGLR is fast increasing along with M and N, but ln(M +N) looks relatively flat in the figure. This indicates that the right side term of Eq. (2.17) cannot compensate the data dependency 9 We confirmed in Section 2.3.1 that GLR exponentially increased in proportion to the numbers of feature vectors within the clusters considered. 30 of the left side term fully enough, and the stopping criterion is thus highly likely to vary across input speech data sources. For this reason, it is too difficult to set a global . 2.4 ICR-based Stopping Point Estimation for AHSC In the previous section, we investigated the BIC-based stopping point estimation method for AHSC and underscored that a fundamental reason for the robustness problem of the method is the stopping criterion being not robust to the variation of input speech data. In this section, based on the analysis in Section 2.3, we propose a new stopping point estimation method for AHSC that is more robust to the variation of input speech data than the BIC-based one. 2.4.1 Information Change Rate (ICR) First, we propose a new statistical distance measure between clusters, information change rate (ICR),whichisdefinedasfollowsforapairofclustersC x andC y consisting of feature vectors x={x 1 ;x 2 ;···;x M } and y ={y 1 ;y 2 ;···;y N }, respectively: ICR(C x ;C y ) , 1 M +N lnGLR(C x ;C y ): (2.18) Asshownabove,ICRisthenormalizedversionoflnGLR. Thissimpleideaofnormalizing lnGLR with the total number of feature vectors within a pair of clusters under consider- ation was inspired by analyzing GLR with an information-theoretic perspective. Let us consider Eq. (2.9) in Section 2.3.1 again. Considering that entropy can be regarded as average description length for a random sample from a given PDF, we can separate the right side term of the equation into the following two parts: lnGLR(C x ;C y ) = (M +N)·h(Z) | {z } Total description length for z=x∪y under H 2 − {M·h(X)+N·h(Y)} | {z } Total description length for z under H 1 :(2.19) 31 This means that lnGLR equals difference between the total description lengths for the whole feature vectors considered under the two hypotheses H 1 (Unmerging) and H 2 (Merging). Thatis,lnGLRrepresentshowmuchamountofinformationwouldbetotally changed by merging the clusters considered. This is why GLR is sensitive to cluster size. Thus, it is natural to expect that a certain distance measure, if it represents how much amount of information would be changed on average over feature vectors by merging theclustersconsidered,couldavoidbeingaffectedbythesizeoftheclusters. ICRsatisfies such an expectation. From Eqs. (2.18) and (2.19), we can obtain a different version of ICR: ICR(C x ;C y ) = h(Z)− M·h(X)+N·h(Y) M +N : (2.20) In this form, ICR can show inter-cluster relation as follows, for two extreme examples: • Ex 1: C x =C y or x=y. ICR(C x ;C y ) = ICR(C x ;C y ) = h(X)− M·h(X)+M·h(X) M +M = h(X)−h(X) = 0 • Ex 2: C x and C y are mutually independent. ICR(C x ;C y ) = h(X)+h(Y)− M·h(X)+N·h(Y) M +N = (M +N)·h(X)+(M +N)·h(Y) M +N − M·h(X)+N·h(Y) M +N = N·h(X)+M·h(Y) M +N 32 Table2.3: ComparisonofICRwithothermeasuresutilizingtheideaofnormalizingGLR. C x andC y : twoclustersconsistingofM andN featurevectorsrespectively,: parameter empirically determined, and n: dimension of feature vectors. ICR(C x ;C y ) PLR in [40] NLLR in [66] 1 M+N lnGLR(C x ;C y ) 1 (M+N) GLR(C x ;C y ) 1 (M+N)·n lnGLR(C x ;C y ) 2.4.2 Comparison of ICR with ICR-like Measures Infact,therehavebeenseveralICR-likeinter-clusterdistancemeasurestonormalizeGLR in the research field of speaker clustering. Table 2.3 compares two of such measures, i.e. penalized likelihood ratio (PLR) [40] and normalized log-likelihood ratio (NLLR) [66], with ICR. PLR normalizes GLR with the -th power of the total number of feature vectors within the clusters considered. However, it does not appear promising in terms of mitigating the effect of cluster size on distance measurement, because lnPLR(C x ;C y ) = lnGLR(C x ;C y )−·ln(M +N): (2.21) As shown in Section 2.3.5, ln(M +N) cannot compensate the dependency of lnGLR on cluster size entirely. Thus, it is difficult to set a global . On the other hand, NLLR is very similar to ICR and its relation to ICR is shown as follows: NLLR(C x ;C y ) = 1 n ICR(C x ;C y ): (2.22) But it has a different physical meaning from that of ICR because it further normalizes lnGLR with the dimension of feature vectors, n. 2.4.3 ICR as a Homogeneity Decision Measure for Clusters Since ICR represents how much amount of information would be changed on average over feature vectors by merging the clusters considered, it is natural to expect ICR to be very small when the clusters considered are homogeneous in terms of speaker-specific 33 characteristicsandeachclusterislargeenoughtofullycovertheintra-speakervarianceof corresponding speaker identity. In other words, ICR will be small when the clusters con- sidered have the same speaker identity source and do not need additional information for representing full speaker-specific characteristics. On the contrary, ICR will be relatively large when the clusters considered are heterogeneous, or when they are homogeneous but containsmallfeaturevectorstocoveronlyapartofspeaker-specificcharacteristics. Thus, ICR could properly work as a measure to decide homogeneity for clusters if every cluster considered were large enough to fully represent the characteristics of the corresponding speaker identity. We assume that a cluster containing feature vectors which correspond to more than 30 seconds in amount of time is such a large enough cluster. This assumption is based on the fact that it requires long speech utterances (at least longer than 20 seconds) to derive reliable speaker characteristics [51–53]. Figure 2.4 displays distributions for ICR between homogeneousclustersandforICRbetweenheterogeneousclusters. Thedistributionswere assumed to be Gaussian, and their sample means and sample variances were respectively obtained based on our development data set. The number of feature vectors in all the clusters considered here corresponded to more than 30 seconds in amount of time. Using the distributions in the figure, we set a threshold = Th ICR to be 0.18603, with which classification error between the two distributions can be minimized. We can thus regard a pair of clusters having ICR less than = 0.18603 as homogeneous in terms of speaker- specific characteristics. 2.4.4 ICR-based Stopping Point Estimation Method for AHSC Based on ICR and its applicability to inter-cluster homogeneity decision in terms of speaker-specific characteristics, we now introduce an ICR-based stopping point estima- tion method for AHSC. This method is distinct from the BIC-based one in terms of 1) stopping criterion and 2) the order of the clusters considered. Its details are as follows: 34 −0.1 0 0.1 0.2 0.3 0.4 0.5 ICR Dist. for homogeneous clusters Dist. for heterogeneous clusters Th ICR (=0.18603) Figure 2.4: Distributions for correct and incorrect merging in terms of ICR. The thresh- old is set so as to minimize classification error between the two distributions. The distributions were obtained based on our development data set, and feature vectors in every cluster considered corresponded to more than 30 seconds in amount of time. 1. Wait until AHSC reaches the end of its merging procedures, i.e., wait until all the clusters given as an input to AHSC are merged to one big cluster. 2. For the pair of clusters merged at the last recursion step of AHSC, C x and C y , con- sistingof featurevectorsx={x 1 ;x 2 ;···;x M } andy ={y 1 ;y 2 ;···;y N }respectively, compute ICR. 3. Compare ICR with : ICR(C x ;C y ) H 1 ≷ H 2 : (2.23) If ICR(C x ;C y )>, decide that C x and C y are heterogeneous in terms of speaker- specific characteristics and move on to consider the pair of clusters merged at the next latest recursion step of AHSC. Otherwise, stop considering more merging re- cursionsandselecttherecursionsteppreviouslyconsideredastheestimatedoptimal stopping point. 35 Table 2.4: ICR-based stopping point estimation method vs. BIC-based stopping point estimation method. c= 1 2 { n+ 1 2 n(n+1) } , where n is the dimension of feature vectors. n=12, =0:18603, and =12:0. ICR-based method BIC-based method Criterion ICR(C x ;C y ) H1 ≷ H2 lnGLR(C x ;C y ) H1 ≷ H2 ·c·ln(M +N) Right side term Fixed during AHSC Floating along with M and N in criterion during AHSC Computing Complexity for lnGLR(C x ;C y ) Complexity for lnGLR(C x ;C y ) complexity and ·(M +N) and ·c·ln(M +N) Order of clusters From the pair of clusters From the pair of clusters considered merged at the last recursion step merged at the 1 st recursion step The ICR-based stopping point estimation method depends upon the reasoning 10 that all the merging occurring after the optimal stopping point would occur between het- erogeneous clusters. The reason why this stopping point estimation method starts its consideration from the pair of clusters merged at the last recursion step of AHSC is that such a strategy can make the stopping criterion, Eq. (2.23), consider large clusters only. As mentioned earlier, ICR can properly work as a homogeneity decision measure only for large enough clusters to represent full speaker-specific characteristics respectively. Eq. (2.23) can be re-written as follows: lnGLR(C x ;C y ) H 1 ≷ H 2 ·(M +N): (2.24) Comparing this criterion with Eq. (2.17) for the BIC-based stopping point estimation method, we can see that the difference of computational complexity between the two 10 The BIC-based stopping point estimation method also relies on this same reasoning. 36 11 10 9 8 7 6 5 4 3 2 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 x 10 4 Number of remaining clusters lnGLR Th BIC Th ICR = Optimal stopping point Estimated stopping point by the ICR−based method (where ICR > Th ICR for the last time) Estimated stopping point by the BIC−based method (where lnGLR < Th BIC for the first time) Figure 2.5: lnGLR, Th BIC =·c·ln(M +N), and Th ICR =·(M +N) for C-6, where = 12:0 and = 0:18603. The stopping point estimated by the ICR-based stopping point estimation method is identical to the optimal one in this case. stopping point estimation methods is negligible. For easier understanding of the ICR- based stopping point estimation method for AHSC, Table 2.4 is presented. Figure 2.5 shows lnGLR, Th BIC =·c·ln(M +N), and Th ICR =·(M +N) for the data source C-6 in our evaluation data set, where =12:0 and =0:18603. This figure focuses on the variations of the three terms at the final 10 merging recursions during AHSC for C-6. From the figure, we can see that Th ICR varies along with lnGLR while Th BIC does not. The observation that Th BIC looks almost flat compared to lnGLR is consistentwithwhatwasshowninFigure2.3inSection2.3.5,andverifiesthatEq. (2.17) is not robust to the variation of input speech data. In contrast, the robustness of the criterioninEq. (2.23)orEq. (2.24)tothevariationofinputspeechdataisdemonstrated through the figure above. Figure 2.6 11 presents AHSC performance using the ICR-based stopping point esti- mation method ( = 0:18603) for the evaluation data set. In the figure, we can observe that the proposed stopping point estimation method exactly detected the optimal stop- ping points for all the data sources except C-4, C-8, and C-9. Even for the three data 11 GLR was used as an inter-cluster distance measure for AHSC in this experiment. 37 C−4 C−5 C−6 C−7 C−8 C−9 N−2 N−3 I−2 I−3 0 5 10 15 20 25 30 35 40 45 50 Data Source Speaker Error Time Rate (%) Accurate stopping point estimation ICR−based method Figure2.6: Comparisonofspeakerclusteringperformancefortheevaluationdatasetwith accurate stopping point estimation and with the ICR-based stopping point estimation method, forwhich =0:18603. Averagespeakererrortimeratedegradationbyincorrect estimation of the optimal stopping point is less than 1% (absolute) per data source. sources,gapsbetweenspeakererrortimeratesattheestimatedstoppingpointsandthose at the optimal ones are shown to be insignificant. Compared to the results (in Figure 2.2) obtained using AHSC with the BIC-based stopping point estimation method for the same data set, the results in this figure are much improved overall, and indicate that the ICR-based method is superior to the BIC-based one in terms of robustness to the variation of input speech data. Consequently, the ICR-based method for AHSC led to average performance improvement by 8.76% (absolute) and 35.77% (relative). 2.5 Conclusions In this chapter, weaddressed the robustness problem of the BIC-based stopping pointes- timationmethodforAHSCtothevariationofinputspeechdata. Forthis, weproposeda novelICR-basedalternative. Throughexperimentalresultsontheexcerptsobtainedfrom meeting speech corpora, AHSC with the ICR-based stopping point estimation method was shown to outperform and be more robust to the variation of input speech data than 38 Table 2.5: Global comparison (averaged speaker error time rate for the evaluation data set) of AHSC with the BIC-based stopping point estimation method and AHSC with the ICR-based stopping point estimation method AHSC (BIC) AHSC (ICR) 24.49% 15.73% basic AHSC with the BIC-based stopping point estimation method. Table 2.5 presents performancecomparisonforAHSCwiththeBIC-basedstoppingpointestimationmethod and AHSC with the ICR-based stopping point estimation method. A reason for the im- provements achieved by our proposed method in terms of averaged speaker error time rate across the data sources in the evaluation data set is that the undesirable tendency of GLR, i.e., GLR tends to get larger as the total number of feature vectors within a pair of clusters under consideration increases, was removed. One potential future direction is to identify the lower bound for cluster size that guarantees ICR to be reliable as a statistical distance measure, more specifically as a ho- mogeneity decision measure, between the clusters considered. In this chapter, we avoided the possibility that ICR would not work properly, by checking ICR-based inter-cluster homogeneity starting from the pair of clusters merged at the last recursion step of AHSC under the assumption that clusters at the late recursion steps of AHSC would be large enough for reliable ICR. This assumption worked for the meeting conversation excerpts usedfortheexperimentspresentedinthe chapterbecausemostof thespeakersourcesin- volved in the conversations generated enough speech utterances of which the total length in time was longer than at least 30 seconds, respectively. Thus, at the late recursion steps of AHSC where the ICR-based stopping point estimation method was usually ap- plied, ICR could be reliable as an inter-cluster homogeneity measure as expected. The assumption could be however broken for other data sources which have a preponderance of short speech segments that are inadequate to reveal the corresponding speaker-specific characteristics completely. 39 Chapter 3 Robust Inter-Cluster Distance Measurement for AHSC 3.1 Introduction InthischapterwehandletherobustnessproblemoftheGLR-basedinter-clusterdistance measure for AHSC under the variation of input speech data. Like the robustness prob- lem of the BIC-based stopping point estimation method, this problem is caused mainly by the undesirable tendency of GLR, which has been described in Chapter 2 (Section 2.3.1), contributing to incorrect merging between heterogeneous clusters (in terms of speaker-specific characteristics) throughout the entire merging recursions in AHSC. In this chapter, we particularly focus on and tackle the negative effect of this problem on both early and late recursion steps of AHSC. This chapter is organized as follows. In Section 3.2, we investigate the reason why the reliability of the GLR-based inter-cluster distance measure for AHSC severely varies acrossdatasources, fromaviewpointofearlyAHSCrecursionsteps. Basedonthisinves- tigation,inSection3.3,weproposemodifiedversionsofAHSC,whichareverifiedthrough experiments to enhance the reliability of the GLR-based inter-cluster distance measure at the early recursion steps of AHSC. As a result, all the modified AHSCs proposed in thissectionobtainbetterperformancethanbasicAHSCintermsofinter-clusterdistance measurementandthusspeakererrortimerate(assumingperfectestimationoftheoptimal stopping point for AHSC). In Section 3.4, based on the investigation done in Section 3.2, 40 we also propose a new method to measure distance between clusters at the late recursion steps of AHSC, which is to combine the advantages of GLR and ICR (proposed in Chap- ter2). Thisnovelmethodisdemonstratedthroughexperimentalresultstobebetterthan the conventional GLR-based inter-cluster distance measure at later merging recursions in AHSC (assuming perfect estimation of the optimal stopping point for AHSC). One issue that needs to be addressed in those modified speaker clustering strategies in Sections 3.3 and 3.4 is that they are beneficial only when the optimal stopping point is accurately detected, which is not the case all the time in real situations. In this regard, in Section 3.5, we propose another modified version of AHSC, called selective AHSC, which offers better speaker clustering performance than the strategies dealt in Sections 3.3 and 3.4 underICR-basedstoppingpointestimation. InSection3.6,weconcludethischapterwith comments on future research work with regard to those handled in the entire chapter. 3.2 GLR at Early AHSC Recursion Steps As examined in Section 2.3.1, GLR tends to get larger as the total number of feature vectorswithinapairofclustersunderconsiderationincreases. Figure3.1explicitlyshows thistendency. DuringAHSC,thetendencyofGLRcausesapairofhomogeneousclusters (in terms of speaker-specific characteristics) of small size to have a smaller GLR value and be regarded as mutually closer than those of large size. ThistendencyofGLRleadsAHSCwiththeGLR-basedinter-clusterdistancemeasure to preferentially select short speech segments (among the entire speech segments given as an input to AHSC) as the closest for merging at the early recursion steps of AHSC. This can be well noticed in Table 3.1. From the fourth row (‘sub-total’) of the table, we can observe that the speech segments shorter than 3 seconds are involved in more than a half of the first quarter of the entire merging recursions during AHSC for all the data sources in the development data set presented in Section 2.2. This trend is particularly distinct for C-1, C-2, and I-1 (92.38%, 88.57%, and 90.00% respectively), which seems 41 0 20 40 60 80 100 0 20 40 60 80 100 0 1 2 3 4 5 x 10 9 N 2 N 1 GLR Figure 3.1: Figure 2.1 revisited. This figure displays GLR for two clusters C 1 and C 2 along with the number of feature vectors in each cluster. The second order statistics of the corresponding cluster models are fixed at 1 =0, 2 =1, and Σ 1 =Σ 2 =1. reasonable because these data sources contain a large number of short speech segments as shown in Figure 3.2. The interesting point that we observed from the first quarter of the entire merging recursions during AHSC for every data source in the development data set is that the accuracy of the GLR-based inter-cluster distance measure for AHSC stayedperfectformergingbetweenthespeechsegmentslongerthanorequalto3seconds (M ll ), which is shown in Table 3.2. In contrast, the accuracy became lower for the other merging types (M ss and M sl ). In other words, only short speech segments 1 were involved in all the incorrect merging. In this context, we can say that incorrect merging at the earlyrecursionstepsofAHSC aremorelikelyto occurfor thedatasourceshavingalarge number of short speech segments. ConsideringthatAHSChasarecursivestructureandthusanyincorrectmergingdur- ing AHSC becomes a potentialseed for other incorrect merging recursions, suchincorrect merging at the early recursion steps of AHSC due to the aforementioned tendency of 1 From this point on, let us call the speech segments shorter than 3 seconds short speech segments. Accordingly, let us call the speech segments longer than or equal to 3 seconds long speech segments. 42 Table 3.1: Distribution of three different merging types (M ss , M sl , and M ll ) at the first quarter of the entire merging recursions during AHSC for every data source in the development data set in Section 2.2. M ss : merging between the speech segments shorter than3seconds,M sl : mergingbetweenonespeechsegmentshorterthan3secondsandthe other longer than or equal to 3 seconds, and M ll : merging between the speech segments longer than or equal to 3seconds. C-1 C-2 C-3 N-1 I-1 M ss 60.95% 52.86% 21.32% 13.33% 50.00% M sl 31.43% 35.71% 39.34% 42.22% 40.00% sub-total 92.38% 88.57% 60.66% 55.55% 90.00% M ll 7.62% 11.43% 39.34% 44.45% 10.00% Table3.2: AccuracyoftheGLR-basedinter-clusterdistancemeasureforAHSCdepending on the merging types defined in Table 3.1. These accuracies were obtained only based on the first quarter of the entire merging recursions during AHSC for every data source in the development data set in Section 2.2. M ss M sl M ll Accuracy 88.89% 93.81% 100.00% GLR can be regarded as one of direct causes for high speaker error time rate. This is confirmedinanindirectwayinFigure3.3,whichcomparesspeakererrortimerate(under the assumption of perfect estimation of the optimal stopping point estimation) for each datasourceinthedevelopmentdatasetwiththatforthecorrespondingsubsetcontaining long speech segments only. From this figure, we can observe that performance improve- ment would be achieved for most of the data sources without short speech segments. The improvement is considerable for C-1, C-2 and I-1, where short speech segments have a relatively large portion compared to the other data sources (C-3 and N-1). Based on all of these, we can conclude that the portion of short speech segments in the entire speech segments given as an input to AHSC can affect speaker error time 43 C−1 C−2 C−3 N−1 I−1 0 10 20 30 40 50 60 70 80 90 100 Data Source Proportion (%) Segments < 3 seconds Segments >= 3 seconds Figure 3.2: Segment length distributions for the development data set in Section 2.2. rate, because of the undesirable tendency of GLR that it tends to get larger as the total number of feature vectors within a pair of clusters under consideration increases. For better AHSC performance, we thus need to mitigate this negative effect of short speech segments on the GLR-based inter-cluster distance measure, which will be handled in the next section. 3.3 Modication of AHSC In this section, in order to address the problem (mentioned in the previous section) of incorrect merging between short speech segments at the early recursion steps of AHSC, we propose three modified versions of AHSC to constrain merging between short speech segments especially at the early recursion steps of AHSC so as to minimize its effect on speaker error time rate under GLR-based inter-cluster distance measurement. The first two modified clustering strategies try to avoid merging between short speech segments (M ss ) because the accuracy of the GLR-based inter-cluster distance measure for M ss is relatively worse than that for the other merging types (M sl and M ll ). The third modified 44 C−1 C−2 C−3 N−1 I−1 0 5 10 15 20 25 30 Data Source Speaker Error Time Rate (%) Entire segments given Subset containing long segments Figure 3.3: Speaker error time rate by AHSC with perfect detection of the optimal stop- pingpointforthedevelopmentdatasetinSection2.2. Thisfigurecomparesperformance fortheentirespeechsegmentsgivenasaninputtoAHSCwiththatforthecorresponding subset containing the segments longer than or equal to 3 seconds only. AHSC tries to preferentially consider M ll other than M ss or M sl to utilize the high accuracy of the GLR-based inter-cluster distance measure for M ll at the early recursion steps of AHSC. In the next three sub-sections, we will explain those strategies more in detail, respectively. 3.3.1 Constrained Cluster Selection for Merging The first modified version of AHSC is to prevent M ss by allowing only M sl or M ll duringtheentireAHSCprocedures. Ifthepairofclustersselectedformergingatacertain recursion step of AHSC are both short speech segments, the next closest pair of clusters at the recursion step are considered for merging until the pair of clusters considered are not both short speech segments. (See Algorithm 2.) This idea is based on the results in Table 3.2, showingthat the accuracy of the GLR-based inter-clusterdistance measure for M ss is worse than that for M sl or M ll . 45 Algorithm 2 Modified Version 1 of AHSC Require: {x i };i=1;:::;ˆ n: speech segments ˆ C i ;i=1;:::;ˆ n: initial clusters Ensure: C i ;i=1;:::;n: finally remaining clusters 1: ˆ C i ←{x i }, i=1;:::;ˆ n 2: do 3: i;j←argminGLR( ˆ C k ; ˆ C l ) such that either{x k } or{x l } is a long speech segment, k;l =1;:::;ˆ n;k̸=l 4: merge ˆ C i and ˆ C j 5: ˆ n← ˆ n−1 6: until optimal stopping point 7: return C i ;i=1;:::;n Table 3.3: Comparison of basic AHSC and its first modified version in terms of average speaker error time rate for the development and evaluation data set in Section 2.2. Both of the clustering strategies use the GLR-based inter-cluster distance measure to select clusters for merging at every recursion step of AHSC, and perfect stopping point estima- tion is assumed. (For each result in the table, the corresponding standard deviation is presented as well.) Basic AHSC Modified Version 1 Dev. 13.02% (± 9.92) 10.90% (± 6.80) Eval. 14.84% (± 9.00) 13.60% (± 8.71) This modified version of AHSC, as shown in Table 3.3, provides better clustering performance (in terms of averaged speaker error time rate over data sources) than basic AHSC for both of the development and evaluation data set in Section 2.2 by 2.12% and 1.24% (absolute) respectively. This overall improvement in speaker clustering perfor- mance is achieved by the enhancement of the reliability of the GLR-based inter-cluster distance measure in the modified AHSC, which is supported by the reduced standard deviation of the performance results by the modified AHSC shown in the right column of the table. We can confirm from the results in this table that preventing merging between shortspeechsegmentsfromoccurringattheearlyrecursionstepsofAHSCwouldimprove the reliability of AHSC performance as a consequence, as expected. 46 Algorithm 3 Modified Version 2 of AHSC Require: {x i };i=1;:::;ˆ n: speech segments ˆ C i ;i=1;:::;ˆ n ′ ;ˆ n ′ ≤ ˆ n: initial clusters Ensure: C i ;i=1;:::;n: finally remaining clusters 1: sort{x i } in the descending order of length 2: ˆ C j ←{x i } such that{x i } is a long speech segment, i=1;:::;ˆ n and j =1;:::;ˆ n ′ 3: m= ˆ n ′ +1 4: do 5: ˆ C←{x m } 6: i←argminGLR( ˆ C; ˆ C k );k =1;:::;ˆ n ′ 7: merge ˆ C to ˆ C i 8: m←m+1 9: until m> ˆ n 10: do 11: i;j←argminGLR( ˆ C k ; ˆ C l );k;l =1;:::;ˆ n ′ ;k̸=l 12: merge ˆ C i and ˆ C j 13: ˆ n ′ ← ˆ n ′ −1 14: until optimal stopping point 15: return C i ;i=1;:::;n Table3.4: ComparisonofbasicAHSCanditssecondmodifiedversionintermsofaverage speaker error time rate for the development and evaluation data set in Section 2.2. The same distance measure and assumption for stopping point estimation in AHSC as ones in Table 3.3 are applied. Basic AHSC Modified Version 2 Dev. 13.02% (± 9.92) 11.67% (± 9.72) Eval. 14.84% (± 9.00) 14.82% (± 9.87) 3.3.2 Pre-Classication of Short Speech Segments The second modified version is to merge every short speech segment to a long speech segment prior to AHSC. It has the same basic idea as the first modified AHSC does in the sense of preventing M ss during the entire AHSC procedures, but is a different approach to implementing the idea. This modified version of AHSC first has each of short speech segments merged to the closest long speech segment in terms of GLR, and then runs AHSC on the remaining set of speech segments containing long ones only. (See Algorithm 3.) 47 Algorithm 4 Modified Version 3 of AHSC Require: {x i };i=1;:::;ˆ n: speech segments, : threshold ˆ C i ;i=0;:::;ˆ n ′ ;ˆ n ′ ≤ ˆ n: intermediate clusters Ensure: C i ;i=1;:::;n: finally remaining clusters 1: sort{x i } in the descending order of length 2: ˆ C 1 ←{x 1 };ˆ n ′ =1;m=2 3: do 4: ˆ C←{x m } 5: i←argminGLR( ˆ C; ˆ C k );k =1;:::;ˆ n ′ 6: if minGLR( ˆ C; ˆ C i )> 7: ˆ n ′ = ˆ n ′ +1 8: ˆ C ^ n ′ = ˆ C 9: else 10: merge ˆ C to ˆ C i 11: m←m+1 12: until m> ˆ n 13: do 14: i;j←argminGLR( ˆ C k ; ˆ C l );k;l =1;:::;ˆ n ′ ;k̸=l 15: merge ˆ C i and ˆ C j 16: ˆ n ′ ← ˆ n ′ −1 17: until optimal stopping point 18: return C i ;i=1;:::;n This second modified version of AHSC, as shown in Table 3.4, also provides better clustering performance (in terms of averaged speaker error time rate over data sources) than basic AHSC for both of the development and evaluation data set in Section 2.2 by 1.24% and 0.02% (absolute) respectively, which is however a bit worse than the perfor- mance improvement of the first modified version of AHSC in the previous sub-section. One interesting point is that the standard deviation of the modified AHSC performance for the evaluation data set is higher than that of the basic AHSC performance. This suggests that this version of AHSC cannot provide better reliability in terms of AHSC performancealthoughitcanofferbetteroverallspeakertimeerrorratethanbasicAHSC. 3.3.3 Sequential Clustering prior to AHSC The third modified version is a bit different from the other two versions previously proposed. Instead of pre-screening M ss , this modified version of AHSC just reduces the proportion of M ss (and M sl as well) at the early recursion steps of AHSC by letting long speech segments be preferentially considered for merging through sequential clustering 48 prior to AHSC. Specifically, it first sorts the entire speech segments (given as an input to AHSC) in the descending order of length, runs leader-follower clustering 2 (LFC) [18] on the sorted segment set, and performs AHSC on the clusters provided by LFC. (See Algorithm 4.) The threshold used in LFC was empirically set to be 250.0. This third modified version of AHSC, as shown in Table 3.5, also provides better clustering performance (in terms of averaged speaker error time rate over data sources) than basic AHSC for both of the development and evaluation data set in Section 2.2 by 3.41% and 1.04% (absolute) respectively, which is the best overall performance improve- ment among those proposed thus far as shown in Figure 3.4. Comparing the standard deviations of the results in the table, we can see that this modified AHSC provides more reliabilityforclusteringperformanceacrossdatasourceslikethefirstmodifiedAHSCdid in Section 3.3.1. 3.4 Combination of GLR and ICR In the previous two sections, we have focused on the early recursion steps of AHSC regarding GLR-based inter-cluster distance measurement. In this section, we move our attention to the GLR-based inter-cluster distance measure at the late recursion steps of AHSC. Let us start the section from bringing back the undesirable tendency of GLR mentioned in Section 3.2, i.e., • A pair of homogeneous clusters of small size are likely to have a smaller GLR value and be regarded as mutually closer than those of large size. This tendency leads to the following: • A pair of heterogeneous clusters of small size might have a smaller GLR value and be regarded as mutually closer than a pair of homogeneous clusters of large size. 2 In this sequential clustering strategy, input data are classified in the order of incoming without any pre-trained class model. Thus, the first incoming datum automatically becomes the first class and every datum thereafter either is merged to one of existing class(es) or becomes another new class. 49 Table 3.5: Comparison of basic AHSC and its third modified version in terms of average speaker error time rate for the development and evaluation data set in Section 2.2. The same distance measure and assumption for stopping point estimation in AHSC as ones in Table 3.3 are applied. Basic AHSC Modified Version 3 Dev. 13.02% (± 9.92) 9.61% (± 8.41) Eval. 14.84% (± 9.00) 13.80% (± 8.24) In other words, the tendency could cause incorrect merging during AHSC. Incorrect mergingismoredetrimentaltospeakererrortimeratewhenitoccursatthelaterecursion steps of AHSC than elsewhere. This is because average cluster size increases as merging recursions continue during AHSC, and thus any incorrect merging at the late recursion steps of AHSC is likely to occur between large size clusters. Such an incorrect merging at thelaterecursionstepsofAHSCwouldgenerallyraisespeakererrortimeratemuchmore thanthatbetweensmallsizeclustersatanyotherrecursionsteps. Therefore,inter-cluster distance measurement needs to be more accurate at the late recursion steps of AHSC, for which in this section we propose a novel alternative to the GLR-based inter-cluster distance measure that we can apply to the late recursion steps of AHSC. This alternative distance measurement method is to consider both GLR and ICR (proposed in Section 2.4.1) in selection of clusters for merging at the late recursion steps of AHSC, and is motivated by the idea that ICR could be utilized as a complement inter-cluster distance measure to GLR in the sense that it could possibly compensate for the aforementioned undesirable tendency of GLR if we are able to manipulate it to handle large clusters only. As mentioned in Section 2.4.3, ICR would properly work as a sort of distance measure between clusters only if it handled large clusters. 50 Development Evaluation Overall 0 5 10 15 20 25 Data Set Average Speaker Error Time Rate (%) Basic AHSC Modified Version 1 of AHSC Modified Version 2 of AHSC Modified Version 3 of AHSC Figure 3.4: Comparison of basic AHSC and its three modified versions proposed in terms of average speaker error time rate. 3.4.1 (GLR+ICR)-based Inter-Cluster Distance Measurement The new method to measure distance between clusters that we propose here basically de- pendsuponGLRateveryrecursionstepofAHSC,butstartstoadditionallyconsiderICR from a certain recursion step of AHSC where all remaining clusters contain data samples of more than 30 seconds in amount of time. (See Algorithm 5.) Since co-consideration of ICR begins only at such a recursion step, this method is naturally applicable to the late recursion steps of AHSC. The reason why 30 seconds is specifically chosen here is, as aforementioned above, that ICR could properly work as an inter-cluster distance mea- sure if every cluster considered were large enough to fully represent the corresponding speaker-specific characteristics. In this section, we conservatively assume that a cluster containing feature vectors which correspond to more than 30 seconds in amount of time is a large enough cluster to represent speaker-specific characteristics completely, as we did in Section 2.4.3. 51 Algorithm 5 AHSC with combination of GLR and ICR as an inter-cluster distance measure Require: {x i };i=1;:::;ˆ n: speech segments ˆ C i ;i=1;:::;ˆ n: initial clusters Ensure: C i ;i=1;:::;n: finally remaining clusters 1: ˆ C i ←{x i }, i=1;:::;ˆ n 2: do 3: if all{ ˆ C i } ^ n i=1 contain data of more than 30 seconds 4: i;j←argmin [ SR GLR ( ˆ C k ; ˆ C l ) +SR ICR ( ˆ C k ; ˆ C l )] ; SR GLR (or SR ICR ) : soft ranking of cluster pairs in terms of GLR (or ICR), k =1;:::;ˆ n, and l =k+1;:::;ˆ n 5: else 6: i;j←argminGLR( ˆ C k ; ˆ C l ); k =1;:::;ˆ n, and l =k+1;:::;ˆ n 7: merge ˆ C i and ˆ C j 8: ˆ n← ˆ n−1 9: until optimal stopping point 10: return C i ;i=1;:::;n InordertoconsiderbothGLRandICRwhenselectingclustersformergingatthelate recursion steps of AHSC where the aforementioned condition is satisfied, the proposed method utilizes the sum of rankings (in terms of GLR and ICR) for the entire pairs of clusters at the recursion step considered, as a means of information fusion. Specifically, each pair of clusters at the recursion step of AHSC considered is ranked in two ways, one of which is in terms of GLR and the other is in terms of ICR. The smaller GLR (or ICR) value a certain pair of clusters have, the higher they are ranked in terms of GLR (or ICR). The proposed method selects the pair of clusters having the smallest summed rankingformerging. Weusesuchahighlevelfusionstrategytoexploit‘ranking’because GLR is empirically shown to have much wider variance than ICR for any given cluster pair, and thus low level fusion strategies like score normalization could cause GLR to be extremely dominant over ICR in selection of clusters for merging in our case. 52 −5 −4 −3 −2 −1 0 1 2 3 4 5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Normalized Distance (GLR or ICR) Soft Ranking CDF for N(0,1) PDF for N(0,1) Figure3.5: Softrankingusedintheproposedinter-clusterdistancemeasurementmethod. Ifacertainpairofclustershavethenormalizeddistanceof0.5,theirsoftrankingbecomes 0.69 (grey line) in this system. As for information fusion in the proposed method, we use ‘soft ranking’ when ranking clusters in terms of GLR and ICR. Each soft ranking is defined as follows: SR GLR ( ˆ C k ; ˆ C l ) , F N GLR ( ˆ C k ; ˆ C l ) − GLR GLR (3.1) SR ICR ( ˆ C k ; ˆ C l ) , F N ICR ( ˆ C k ; ˆ C l ) − ICR ICR ; (3.2) where GLR and GLR (or ICR and ICR ) are mean and standard deviation for the GLR (or ICR) values of the entire cluster pairs at the recursion step of AHSC considered, and F N (·) is a normal cumulative density function with zero mean and unit variance. This soft ranking approach normalizes inter-cluster distances, assuming that they are normally distributed, and transforms them through a monotonic increasing function, so as to provide a sort of relative information between clusters. (See Figure 3.5.) AHSC with our proposed method to measure distance between clusters, as shown in Table3.6,providesbetterclusteringperformance(intermsofaveragedspeakererrortime 53 Table 3.6: Comparison of AHSC with the GLR-based inter-cluster distance measure and that with our proposed method, in terms of average speaker error time rate for the development and evaluation data set in Section 2.2. Perfect stopping point estimation for AHSC is assumed. GLR GLR+ICR Dev. 13.02% 10.20% Eval. 14.84% 14.64% rate over data sources) than that with the conventional GLR-based inter-cluster distance measure for both of the development and evaluation data set in Section 2.2 by 2.82% and 0.20% (absolute) respectively. This improvement comes from, as expected, the reduced number of incorrect merging occurrences at the late recursion steps of AHSC. Based on these results, we can expect that applying this method to the late recursion steps of the modified AHSC approaches proposed in Section 3.3 would result in extra performance improvement. 3.4.2 Proposed Measure in Modied AHSC Approaches Figure 3.6 explicitly displays the extra performance improvement that would be achieved iftheproposed(GLR+ICR)-basedinter-clusterdistancemeasurewereappliedtothelate recursion steps of the three modified versions of AHSC introduced in Section 3.3. The overall results in this figure indicate that the proposed, supplement inter-cluster distance measure does not degenerate the merits of the modified AHSC approaches at the early recursion steps, retaining its merit at the late recursion steps. The most outstanding im- provementis2.94%(absolute)forbothofthemodifiedversions1and2ontheevaluation data set, while performance improvement for the modified version 3 is not significant. 54 Mod 1 (D) Mod 1 (E) Mod 2 (D) Mod 2 (E) Mod 3 (D) Mod 3 (E) 0 2 4 6 8 10 12 14 16 18 20 Data Source (D: development set and E: evaluation set) Average Speaker Error Time Rate (%) Modified AHSCs in Section 3.3 Combined with a (GLR+ICR)−based measure Figure 3.6: Extra performance improvement achieved if the proposed (GLR+ICR)-based inter-clusterdistancemeasurewereappliedtothelaterecursionstepsofthethreemodified versions of AHSC introduced in Section 3.3. The data sources used in this experiment are the development and evaluation data set in Section 2.2. Perfect estimation of the optimal stopping point for AHSC is assumed. 3.5 Selective AHSC In the previous sections, we have proposed three modified versions of AHSC to tackle incorrect merging between heterogenous clusters (in terms of speaker-specific character- istics) at the early recursion steps of AHSC. Furthermore, we have introduced a new supplement inter-cluster distance measure to handle incorrect merging at the late recur- sion steps of AHSC. From those approaches, we could obtain performance improvement for the evaluation data set (in Section 2.2) in terms of average speaker error time rate by up to 4.18% (absolute) and 28.17% (relative), which is re-organized in Table 3.7. How- ever, they work only under the assumption of perfect estimation of the optimal stopping point in AHSC. In this section, we test how badly the clustering strategies proposed in Sections 3.3 and3.4workwiththeICR-basedstoppingpointestimationmethod(proposedinChapter 55 Table 3.7: Average speaker error time rate for the evaluation data set in Section 2.2. This table compares AHSC and its three modified versions with both GLR-based and (GLR+ICR)-basedinter-clusterdistancemeasurement. Perfectestimationoftheoptimal stopping point for AHSC is assumed. AHC Mod 1 Mod 2 Mod 3 GLR 14.84% 13.60% 14.82% 13.80% GLR+ICR 14.64% 10.66% 11.88% 13.21% Table3.8: AveragespeakererrortimeratefortheevaluationdatasetinSection2.2when the ICR-based stopping point estimation method is applied. This table compares AHSC and its three modified versions with GLR-based and (GLR+ICR)-based inter-cluster distance measurement, as Table 3.7 does. AHC Mod 1 Mod 2 Mod 3 GLR 15.73% 15.65% 18.48% 16.51% GLR+ICR 22.42% 16.00% 14.18% 19.18% 2). In this regard, we propose a new clustering strategy, i.e., selective AHSC, utilizing (relatively) accurate stopping point estimation by the ICR-based stopping point estima- tionmethodandhighreliabilitybytheGLR-basedinter-clusterdistancemeasureforlong speech segments. This clustering approach is empirically verified to be one of possible combinations that can well coordinate our results in Chapters 2 and 3. 3.5.1 Modied AHSCs with Stopping Point Estimation The clustering performance that would be achieved from the three modified versions of AHSC (proposed in Section 3.3) if the ICR-based stopping point estimation method were applied instead of the assumption of perfect estimation of the optimal stopping point is shown in Table 3.8. Considering Table 3.7 together, we can see that incorrect stopping point estimation by the ICR-based method mostly erases the advantages of the 56 modified versions of AHSC that were obtained under the assumption of perfect stopping point estimation. A noticeable thing in Table 3.8 is that the (GLR+ICR)-based inter- cluster distance measure and the ICR-based stopping point estimation method do not work well together, which can be easily shown when we compare the results in the second row and their counterparts in the third row. Except for the second modified version of AHSC, clustering performance is observed to be degraded in every case when the two approaches are applied together. This is caused because the ICR-based stopping point estimationmethodstartsitsestimationprocessfromthepairofclustersmergedatthelast recursion step of AHSC by comparing the ICR value of the pair with a pre-set threshold ( in Chapter 2). However, the (GLR+ICR)-based inter-cluster distance measure is likely to select the clusters having a small ICR (and GLR) value for merging, especially at the late recursion steps of AHSC. Thus, the ICR-based stopping point estimation method is more likely to confuse itself in estimating the optimal stopping point in AHSC (and its modified versions) under (GLR+ICR)-based inter-cluster distance measurement. Therefore,inpracticalapplications,themodifiedversionsofAHSCwiththe(GLR+ICR)- based inter-cluster distance measure introduced in the previous section need a better stopping estimation method using a different (or independent) measure from ICR. We do not further take care of this issue in this dissertation, remaining it as a future research topic. 3.5.2 Selective AHSC In order to better coordinate our results in Chapters 2 and 3, we propose selective AHSC in this section. This proposed method is motivated by the same reason for the other modified versions of AHSC introduced in Section 3.3, which is well shown in Figure 3.7. WecanseefromthefigurethattheaccuracyoftheGLR-basedinter-clusterdistance measure for AHSC would mostly rise up and thus result in better clustering performance when an input to AHSC contains only long speech segments. Selective AHSC can utilize this advantage that could be obtained when dealing with long speech segments only, by 57 C−1 C−2 C−3 N−1 I−1 0 5 10 15 20 25 30 Data Source Speaker Error Time Rate (%) Entire segments given Subset containing long segments Figure 3.7: Figure 3.3 revisited, showing speaker error time rate by AHSC with perfect detection of the optimal stopping point for the development data set in Section 2.2. This figure compares performance for the entire speech segments given as an input to AHSC with that for the corresponding subset containing the segments longer than or equal to 3 seconds only. firstrunningbasicAHSCwiththeGLR-basedinter-clusterdistancemeasureandtheICR- based stopping point estimation method only on long speech segments among the entire given speech segments, and then classifying the rest (i.e., short speech segments) into one of the clusters provided by the initial AHSC step. (See Algorithm 6.) By selective classification of speech segments in terms of length, selective AHSC can mitigate the negativeeffectofshortspeechsegmentsonGLR-basedinter-clusterdistancemeasurement during AHSC, especially at the early recursion steps. Note that robust stopping point estimation to the variation of input speech data, like by the ICR-based stopping point estimation method, is critical to selective AHSC due to selective consideration of speech segments at the initial AHSC step. Such selective consideration causes the variability of an input to AHSC, so selective AHSC would not work properly if it were with any other stopping point estimation method not robust to the variation of data sources. How badly the BIC-based stopping point estimation method, which has been verified in Chapter 2 to be not robust to the variation of input 58 Algorithm 6 Selective AHSC Require: {x i };i=1;:::;ˆ n: speech segments ˆ C i ;i=1;:::;ˆ n ′ ;ˆ n ′ ≤ ˆ n: initial clusters Ensure: C i ;i=1;:::;n: finally remaining clusters 1: permutate{x i } in the descending order of length 2: ˆ C j ←{x i } such that{x i } is a long speech segment, i=1;:::;ˆ n and j =1;:::;ˆ n ′ 3: m= ˆ n ′ 4: do 5: i;j←argminGLR( ˆ C k ; ˆ C l );k;l =1;:::;m;k̸=l 6: merge ˆ C i to ˆ C j 7: m←m−1 8: until optimal stopping point (detected by ICR-based stopping point estimation) 9: return C i ;i=1;:::;n 10: m= ˆ n ′ +1 11: do 12: ˆ C←{x m } 13: i←argminP( ˆ C| ˆ C k );k =1;:::;n 14: merge ˆ C to ˆ C i 15: m←m+1 16: until m> ˆ n 17: return C i ;i=1;:::;n speech data, would break selective AHSC performance is given in Figure 3.8. The results in this figure suggest that without robust stopping point estimation we cannot get any benefit from this novel approach to speaker clustering. Figure 3.9 shows the performance of selective AHSC with robust stopping point es- timation for the evaluation data set in Section 2.2. From this figure, we can see that selective AHSC is a reasonable strategy to coordinate our results in Chapters 3 and 4. The performance of selective AHSC with ICR-based stopping point estimation is shown to be better than that of basic AHSC with perfect stopping point estimation for every data source except C-4, C-5, and I-3. Even for the three data sources, performance gap is negligible. The result that average speaker error time rate by selective AHSC for the evaluation data set (in Section 2.2) is even better than that by AHSC with perfect stop- ping point estimation can be regarded as promising. Table 3.9 explicitly indicates the superiority of selective AHSC over all the counterparts that have been dealt with in this dissertation. 59 C−4 C−5 C−6 C−7 C−8 C−9 N−2 N−3 I−2 I−3 0 10 20 30 40 50 60 70 Data Source Speaker Error Time Rate (%) AHSC with perfect stopping point estimation Selective AHSC with BIC−based stopping point estimation Figure 3.8: Comparison of basic AHSC with the assumption of perfect estimation of the optimal stopping point and selective AHSC (including the BIC-based stopping point estimation method), in terms of speaker error time rate on the evaluation data set in Section 2.2. 3.6 Conclusions In this chapter, we addressed the robustness problem of the GLR-based inter-cluster distance measure for AHSC to the variation of input speech data. For this, we proposed 1) three modified versions of AHSC so as to enhance the reliability (or accuracy) of the GLR-based inter-cluster distance measure at the early recursion steps AHSC and 2) a (GLR+ICR)-basedinter-clusterdistancemeasuresoastoimprovereliabilityinmeasuring inter-cluster distance at the late recursion steps of AHSC. Through experimental results on the excerpts obtained from meeting corpora, all the proposed ones are demonstrated to provide better clustering performance than basic AHSC with the GLR-based inter- cluster distance measure in terms of averaged speaker error time rate over data sources. Furthermore, weproposedselectiveAHSCtobettercoordinatethemeritsofourresearch resultsinthischapterwithICR-basedstoppingpointestimation(proposedinChapter2) 60 C−4 C−5 C−6 C−7 C−8 C−9 N−2 N−3 I−2 I−3 0 5 10 15 20 25 30 35 40 45 50 Data Source Speaker Error Time Rate (%) AHSC with perfect stopping point estimation Selective AHSC with ICR−based stopping point estimation Figure 3.9: Comparison of basic AHSC with the assumption of perfect estimation of the optimal stopping point and selective AHSC (including the ICR-based stopping point estimation method), in terms of speaker error time rate on the evaluation data set in Section 2.2. in more realistic situations forspeakerclustering. This novelclustering strategy provided AHSC performance with even better reliability over input speech data. There are several directions for future work including further refinements to the pro- posed solutions. For instance, in the third modified version of AHSC, the threshold parameter determines the number of intermediate clusters, which is directly linked to the final speaker error time rate. It was chosen empirically in this chapter, but finding ways for optimally setting would be beneficial in further enhancing clustering per- formance. As another example, we might have to consider how to optimally fuse two different statistical information on the same object for (GLR+ICR)-based inter-cluster distance measurement at the late recursion steps of AHSC. In this chapter, we used soft rankings in terms of GLR and ICR for that purpose, but it is not theoretically proven to be optimal to the task considered. Establishing more systematic frameworks for selection of information fusion methods could be one of valuable future research directions. In addition, as mentioned earlier in the middle part of this chapter, it would be a good research topic to find out a stopping point estimation method for AHSC with the 61 (GLR+ICR)-based inter-cluster distance measure, other than the ICR-based one. A new stopping point estimation method should be comparable to our proposed ICR-based one in terms of estimation accuracy, but needs to use an inter-cluster homogeneity decision measure independent of ICR. Then it could keep the advantages of the modified versions ofAHSCandthe(GLR+ICR)-basedinter-clusterdistancemeasurevalideveninpractical applications. 62 Table 3.9: Average speaker error time rate for the evaluation data set in Section 2.2. This table compares selective AHSC with all the counterparts that have been dealt with in this dissertation. Key Components Performance AHSC Distance Measure: GLR 24.49% Stopping Method: BIC AHSC Distance Measure: GLR 15.73% Stopping Method: ICR Modified Version 1 Distance Measure: GLR 15.65% of AHSC Stopping Method: ICR Modified Version 1 Distance Measure: (GLR+ICR) 16.00% of AHSC Stopping Method: ICR Modified Version 2 Distance Measure: GLR 18.48% of AHSC Stopping Method: ICR Modified Version 2 Distance Measure: (GLR+ICR) 14.18% of AHSC Stopping Method: ICR Modified Version 3 Distance Measure: GLR 16.51% of AHSC Stopping Method: ICR Modified Version 3 Distance Measure: (GLR+ICR) 19.18% of AHSC Stopping Method: ICR Selective AHSC Distance Measure: GLR 12.28% Stopping Method: ICR 63 Chapter 4 Robust Cluster Modeling for Inter-Cluster Distance Measurement in AHSC 4.1 Introduction Thus far we have tried to address the robustness problem of AHSC performance under the variation of input speech data in this dissertation. To overcome the problem in the perspective stopping point estimation, we proposed a more reliable way to determine the optimal (recursion) stopping point for AHSC across a variety of input speech data than the conventional method [13] utilizing Bayesian information criterion (BIC) [58] (Chapter 2). Specifically we defined a new statistical distance measure between clusters, i.e.,informationchangerate(ICR),andappliedittostoppingpointestimationforAHSC based on its superiority to BIC in terms of robustness to input data variation. To tackle therobustnessproblemofAHSCperformancefromtheviewpointofinter-clusterdistance measurement, on the other hand, we claimed and verified in Chapter 3 that short speech segments (< 3s, in general) among input speech data degraded the accuracy of picking up the closest cluster pair especially at the early recursion steps of AHSC, and proposed a variety of schemes to prevent short input speech segments from negatively affecting distance measurement of clusters. All of the proposed schemes were empirically verified to offer clustering performance improvement particularly for the input data suffering from the negative effect of short speech segments, meaning that they can enhance the 64 reliability of AHSC performance against short input speech segments. In addition, we introducedanewinter-clusterdistancemeasurebycombininggeneralizedlikelihoodratio (GLR)[26]andICR.ThismetricmitigatedtheaccuracydegradationofGLR-basedinter- cluster distance measurement at the late recursion steps of AHSC, caused by unbalanced speaking time distribution over speaker sources in input speech data. In this chapter we tackle the robustness problem of AHSC performance across input speech data, in terms of statistical cluster modeling for inter-cluster distance measure- ment. This work was motivated by the reasoning that ideal cluster modeling for inter- cluster distance measurement within the framework of AHSC should account for variable cluster size, which grows when clusters are merged, and be dynamic enough to represent the statistical changes of data in clusters throughout the entire AHSC procedures. Since such changes in clusters during AHSC largely depend upon a number of input data char- acteristics, cluster modeling without dynamic representation capability would be affected by input data variation, which is undesirable for reliable AHSC performance under the variation of input speech data. Conventional cluster modeling approaches using either single Gaussian distributions or Gaussian mixture models (GMMs) are not ideal in this regard. We introduce a novel cluster modeling approach with dynamic representation capability in this chapter. In this regard, the chapter is organized as follows. In Section 4.2, we re-investigate GLR-based inter-cluster distance measurement, which is a general method to statistically choose the closest cluster pair at every recursion step of AHSC. This investigation leads us to why reliable statistical cluster modeling is important for inter-cluster distance measurement in AHSC. Then we examine the aforementioned con- ventional cluster modeling approaches for this GLR-based distance measurement frame- work. Through this we show the merits and demerits of the conventional approaches in terms of cluster representation capability and computational complexity. In Section 4.3, we propose a new cluster modeling approach using incremental Gaussian mixture models (IGMMs) and compare it with the conventional approaches. The comparison ver- ifies that the proposed method not only provides improved clustering performance but 65 also has moderate computational cost so that it is feasible in practice. In Section 4.4, we provide concluding remarks and future research directions regarding speaker-specific data modeling. 4.2 Inter-Cluster Distance Measurement for AHSC Inter-cluster distance measurement is a critical part in AHSC of selecting the closest pair of clusters (in terms of speaker-specific characteristics) for merging at every recursion step. Because of the recursiveness of AHSC, erroneous selection of merging clusters at any given recursion step would affect subsequent recursion steps and might result in the significant degradation of overall clustering performance in the end. Therefore, precise selection of merging clusters is desirable at every recursion step of AHSC. Ingeneral,clusterdistanceisstatisticallymeasuredwithintheframeworkofAHSC.A typical method [26], i.e., GLR described in Section 2.3.1, calculates inter-cluster distance by comparing likelihoods for two hypotheses on the clusters considered. The details of this method are re-presented in the next subsection, for better understanding of the rest of this chapter. 4.2.1 GLR-based statistical inter-cluster distance measurement Let us consider a certain recursion step during AHSC. Suppose that a pair of clusters x = {x 1 ;x 2 ;··· ;x M } and y = {y 1 ;y 2 ;··· ;y N } are given for distance measurement. Then, GLR for the given pair is computed as follows: GLR(x;y) = P (x;y|H 1 ) P (x;y|H 2 ) ; (4.1) where • H 1 (unmerging hypothesis): x and y are hypothesized to be left unmerged, 66 • H 2 (merging hypothesis): x and y are hypothesized to be merged so as to become a new, larger cluster z, where z=x∪y. By this way distance for every pair of clusters at the recursion step considered can be measured in terms of GLR, and the cluster pair having the smallest GLR value is chosen to be merged. 4.2.2 Conventional cluster modeling approaches Note that each hypothesis in this method of measuring distance between clusters is mod- eled by some probabilistic distribution. This is called hypothesis or cluster modeling, by which all the clusters considered for distance measurement (x, y, and z) are represented byPDFs,respectively. Thus,properdistributionselectionforclustermodelingisveryim- portant for precise distance measurement of clusters. In this subsection, we examine two conventional distributions for cluster modeling in the research field of speaker clustering. 4.2.2.1 Single Gaussian cluster modeling One of conventional selection of PDFs for cluster modeling is to use a single Gaussian distribution. In this approach, all the three clusters aforementioned are modeled by multivariate normal PDFs, N x =N(m x ;Σ x ), N y =N(m y ;Σ y ), and N z =N(m z ;Σ z ). The sample mean vectors (m x , m y , and m z ) and (full) covariance matrices (Σ x , Σ y , and Σ z ) are determined by way of maximizing the likelihoods of x, y, and z for N x , N y , and N z , respectively. As a result, Eq. (4.1) can be rewritten as follows: GLR(x;y) = ln p(x;y|H 1 ) p(x;y|H 2 ) = ln p(x| N x )·p(y| N y ) p(z| N z ) = ln p(x;m x ;Σ x )·p(y;m y ;Σ y ) p(z;m z ;Σ z ) : (4.2) 67 This single Gaussian cluster modeling approach has been popular mainly due to its small computational cost. For example, the right-hand side in Eq. (4.2) can be simplified by the following relation [13]: ln p(x;m x ;Σ x )·p(y;m y ;Σ y ) p(z;m z ;Σ z ) = M +N 2 lndet(Σ z )− M 2 lndet(Σ x )− N 2 lndet(Σ y ); (4.3) wheredet(·)standsforamatrixdeterminantoperator. Hencewecanavoiddirectcompu- tation of the likelihoods, i.e., p(x;m x ;Σ x ), p(y;m y ;Σ y ), and p(z;m z ;Σ z ), which would require more processing time as the cardinalities ofx, y, andz increase. Another advan- tageous factor of this cluster modeling approach in terms of computational complexity is relatively simple parameter estimation for Gaussian PDFs. Specifically, m z and Σ z can be simply calculated from m x , m y , Σ x and Σ y using the following closed-form relations, instead of direct maximum likelihood estimation from z: m z = M·m x +N·m y M +N (4.4) and Σ z = M·Σ x +N·Σ y M +N + M·m x m T x +N·m y m T y M +N −m z m T z : (4.5) Therefore, in this cluster modeling approach, there is no need to estimate model pa- rameters for merging-hypothesized clusters at distance measurement during AHSC. This reduces a lot of computational cost over the entire clustering procedures, especially as cluster size increases. However,thisapproachhasacriticalissueintermsofrepresentationcapability. Single Gaussian distributions are known to have limited capability in representing the statis- tical characteristics of large speech data in terms of speaker-specific properties [51–53]. 68 Considering that the average size of the clusters handled by AHSC increases as merg- ing recursions continue, one-mode PDFs like normal PDFs could degenerate inter-cluster discernibility in terms of speaker-specific characteristics especially at the late recursion steps of AHSC, and hence cause overall clustering performance to degrade severely. 4.2.2.2 GMM cluster modeling The other conventional approach for cluster modeling is to utilize GMMs as cluster models. In this approach, all the aforementioned three clusters (x, y, and z) con- sidered at distance measurement are modeled by GMMs, x = ({m i x ;Σ i x ;w i x } i=1 ), y = ({m i y ;Σ i y ;w i y } i=1 ), and z = ({m i z ;Σ i z ;w i z } i=1 ). The mean vectors (m i x , m i y , and m i z ), (diagonal) covariance matrices (Σ i x , Σ i y , and Σ i z ), and weights (w i x , w i y , and w i z ) for Gaussian mixture components are estimated by the expectation-maximization (EM) procedures [18]. The number of component mixtures in GMMs, , is empirically fixed at 8, 16 or 32 1 in general. As a consequence, Eq. (4.1) can be rewritten as follows: GLR(x;y) = ln p(x;y|H 1 ) p(x;y|H 2 ) = ln p(x| x )·p(y| y ) p(z| z ) = ln ∑ i=1 w i x ·p(x;m i x ;Σ i x )· ∑ i=1 w i y ·p(y;m i y ;Σ i y ) ∑ i=1 w i z ·p(z;m i z ;Σ i z ) : (4.6) This GMM cluster modeling approach has better representation capability in terms of speaker-specific characteristics because of multiple modes (or component mixtures) and the respective weights for them, compared to the previous single Gaussian approach. Thus, this approach can provide better clustering performance overall. However, there exist some problems as well with using GMMs as cluster models. 1 These values come from Reynolds’ work [51–53] saying that GMMs with those numbers of mixture components well represent speaker-specific characteristics for speaker identification tasks. 69 First, GMMs with a fixed number 2 of mixture components cannot consider variations in cluster size throughout the entire AHSC procedures. For example, GMMs with a lot of mixture Gaussians could be overfitted for small clusters at the early recursion steps of AHSC because most initial clusters handled by AHSC usually do not contain sufficient data to train multi-component GMMs properly. On the other hand, GMMs with a small number of mixture components might not be able to fully represent the speaker-specific characteristics of large clusters at the late recursion steps as in the single Gaussian case. Totacklethisproblem,therehasbeensomeresearch[14,45,60,62]toadjustthenumberof mixture components during AHSC in proportion to cluster size based on certain criteria, but they cost a lot of computational burden for mixture number selection for each GMM in addition to the EM procedures that have already high computational complexity. In this regard, it is necessary to dynamically represent clusters during AHSC with low (or at least moderate) computational complexity. The second issue in the GMM cluster modeling approach is that, although they re- quire a significant amount of processing time in proportion to cluster size and mixture number, the EM procedures might degrade discernibility between clusters in terms of speaker-specific characteristics. Let us consider some examples shown in Figure 4.1, which presents simple test results about the effectiveness of the EM procedures in the GMM cluster modeling approach with 16 mixture components in each GMM. In this figure, each subfigure compares GLR distances between two pairs of clusters along with the number of iterations in the EM procedures. One pair comes from the same speaker source (black curve), meaning that cluster distance should be relatively close in terms of speaker-specific characteristics, while the other is from different sources (grey curve). In- terestingly,wecanobservefromtheleftmostsubfigurethatdistancefortheheterogeneous cluster pair, presented by the grey curve, gets smaller than that for the homogeneous one as iterations continue, which is undesirable because distance between homogeneous clus- ters (in terms of speaker-specific characteristics) should be always less than that between 2 It is common that the number of mixture components is universally set throughout the whole AHSC procedures as an empirically reasonable value like 8, 16 or 32. 70 1 20 40 60 80 100 −300 −200 −100 0 100 200 300 400 Inter−Cluster Distance 1 20 40 60 80 100 −300 −200 −100 0 100 200 300 400 Number of Iterations in EM 1 20 40 60 80 100 −300 −200 −100 0 100 200 300 400 Figure 4.1: Effectiveness of the EM procedures in the GMM cluster modeling approach for GLR-based inter-cluster distance measurement. Each subfigure compares distances between two pairs of clusters along with the number of iterations in the EM procedures for GMMs with 16 mixture components. One pair comes from the same speaker source (black curve) while the other is from different sources (grey curve). heterogeneous ones in the ideal case. Even from the other subfigures, the EM procedures do not significantly widen distance between the two pairs of clusters compared, indicat- ing that the EM procedures do not much improve inter-cluster discernibility in terms of speaker-specific characteristics. These observations can be explained as follows: the EM proceduresiterativelyadapttheparametersofanyGMMtowardoptimizationintermsof maximum likelihood and thus increase p(z| z ) in Eq. (4.6) regardless of speaker-specific homogeneity between the clusters considered. However, this does not help increase inter- cluster discernibility and may even make it worse, as shown in the leftmost subfigure in Figure 4.1. Another issue in this cluster modeling approach is that the EM procedures are af- fected by random initialization in the beginning and result in different estimation of model parameters for GMMs every session, which might cause the variation of clustering performance for the same input speech data. 71 4.2.2.3 Experimental comparison We have thus far examined the two conventional cluster modeling approaches within the framework of GLR-based inter-cluster distance measurement for AHSC, and listed the merits and demerits of each approach. In this subsection, we verify our examination by comparing the two approaches empirically. 1) Experimental setup Before we start, we need to take a look at speech data and setup for the entire experiments in this chapter. Table 4.1 presents input speech data for AHSC. These data sources are 15 sets of speech segments with approximately 4hr-long total duration, and were randomly chosen from the ICSI, NIST, and ISL meeting corpora. They are distinct from one another in terms of the number of speaker sources (N s ), gender distribution over speaker sources, total utterance time (T s ), number of speech segments (N t ), and average segment length (T a ). For preparing each set of speech segments, we manually segmented each audio clip at every point of speaking turn changes according to the given reference transcription beforehand. In order to avoid any potential confusion in performance analysis that might result from overlaps between segments, we excluded all the segments involved in any overlap during data preparation. Inallexperimentsinthischapter,weassumethatstoppingpointestimationforAHSC is optimal, i.e, the optimal stopping point where extra merging in AHSC would not im- prove speaker error time rate any further can be exactly estimated for every data source. Forthisassumptiontogetrealized,wemanuallystoppedAHSCwherethelowestspeaker error time rate would be achieved, because we only focus on inter-cluster distance mea- surement in this chapter, not stopping point estimation. 2) Comparison Table 4.2 shows performance comparison of the two conventional cluster modeling approachesintermsofspeakererrortimerate. FortheGMMclustermodelingapproach, 72 Table4.1: Datasource. N s : numberofspeakersources(male:female),T s : totalutterance time (sec.), N t : number of speech segments, and T a : average segment length (sec.). Data Source 1 2 3 4 5 6 7 8 N s 7 (5:2) 7 (5:2) 7 (6:1) 6 (4:2) 5 (1:4) 6 (5:1) 5 (5:0) 4 (4:0) T s 1064.9 931.3 2336.3 1148.5 805.1 1664.9 1609.1 1475.9 N t 418 279 611 244 228 532 591 478 T a 2.5 3.3 3.8 4.7 3.5 3.1 2.7 3.1 Data Source 9 10 11 12 13 14 15 N s 9 (7:2) 4 (3:1) 4 (3:1) 6 (4:2) 8 (4:4) 4 (2:2) 4 (0:4) T s 659.7 443.4 835.7 624.1 272.4 477.7 429.1 N t 159 75 179 144 93 119 95 T a 4.1 5.9 4.7 4.3 2.9 4.0 4.5 we chose 4 different values, i.e., 4, 8, 16, and 32, as the number of mixture components in GMMs. The lowest error rate in each column (or each data source) is bold-faced. From this table we can first observe that the GMM approach is better than the single Gaussian approach in terms of overall performance, except for the case 3 of 4 mixture components in the GMM approach. Other than a few data sources (Data 1, 4, 5, and 11) GMMs provide better clustering accuracy and, even for Data 1, 4, 5, and 11, difference between the clustering error rates of AHSC by the two cluster modeling approaches is not that significant, which verifies our previous statement that the GMM approach has better representation capability for modeling clusters and thus provides better clustering performance overall. However, the results in this table also show the difficulty to set the proper number of mixture Gaussians in the GMM cluster modeling approach. 8 mixture Gaussians fit to Data 7, 9, 10, 13, and 14 while 16 mixtures are better for Data 3, 6, and 8. For 3 In this case, we can see that only 4 mixture Gaussians (with diagonal covariance matrices) in GMMs are not enough to represent speaker-specific characteristics in cluster data, which can be suppoted by the previous research in [53]. 73 Table 4.2: Performance comparison of the two conventional cluster modeling approaches in terms of speaker error time rate (%). N: single Gaussian cluster modeling and x : GMM cluster modeling with x mixture components. Data Source 1 2 3 4 5 6 7 8 N 7.0 19.3 10.6 2.7 15.1 7.5 13.2 25.6 4 30.9 20.8 25.8 9.3 25.2 21.5 14.0 20.4 8 13.7 13.8 8.1 8.6 29.5 7.4 7.0 21.5 16 12.9 16.9 6.2 4.1 20.2 7.0 10.9 16.3 32 10.3 10.0 18.4 4.7 15.9 9.5 9.1 26.4 Data Source 9 10 11 12 13 14 15 Avg. N 23.6 7.6 9.1 9.7 23.4 27.0 29.2 15.4 4 20.6 13.1 27.8 12.4 29.4 27.6 38.7 22.5 8 10.9 2.5 9.9 13.4 10.5 22.8 30.2 14.0 16 12.3 7.8 10.3 9.1 33.3 27.1 29.2 14.9 32 12.0 5.2 10.0 8.7 28.7 30.7 28.7 15.2 Data 2, 12, and 15, 32 mixtures are relatively superior to 8 or 16. This suggests that in order to obtain the best clustering performance in the GMM cluster modeling approach we need to optimize the number of mixture components for every data source. This is impractical because there is no theoretical way yet to find out the optimal number of mixture Gaussians in GMMs depending upon input data sources. Thus, there is still a need for a more adaptive and reliable way of modeling clusters during AHSC. Another demerit of the GMM cluster modeling approach is well depicted in Figure 4.2, which shows comparison of the processing time 4 of AHSC depending upon cluster modeling approaches. From this figure it is so easy to confirm that the GMM approach requires by far more time than the single Gaussian approach. For example, the process- ing time for Data 3 by the GMM approach with 32 mixture components is more than 4 For this experiment, an MS-Windows machine with the Intel Pentium-4 3.2GHz CPU was used. The number of iterations in the EM procedures for each GMM parameter estimation was fixed at 15. After 15 iterations we can empirically assume that there is no significant change in likelihoods for GMMs. 74 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 20000 40000 60000 80000 Data Source (a) Processing Time (s) GMM (32) GMM (16) GMM (8) GMM (4) Gaussian 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 200 400 600 Data Source (b) Processing Time (s) Figure 4.2: Processing time comparison of the two conventional cluster modeling ap- proaches. (For the GMM approach, four different mixture numbers are compared, i.e., 4, 8, 16, and 32.) (a) Full-shot version. (b) Zoomed-in version. 22hrs and is approximately 1500 times that by the single Gaussian approach, which is prohibitive in practice. This high computational cost is mainly due to the EM proce- dures for GMM parameter estimation and is unavoidable in the GMM cluster modeling approach, becauseintheEMprocedurestherearenoclosedformrelationslikeEqs. (4.4) and (4.5) and thus the parameters of the merging-hypothesized clusters used for GLR- based inter-cluster distance measurement should be newly estimated for every possible clusterpair. Therefore,theprocessingtimeoftheGMMapproachexponentiallyincreases in proportion to the total amount of input speech data. Figure 4.3 presents another minor issue in the GMM cluster modeling approach, i.e. performancevariationduetoinitialrandomnessintheEMproceduresforGMMs. Inthis figure clustering performance for Data 13 is shown, and we can clearly see the session- to-session variation of speaker error time rate in every number of mixture Gaussians considered. The variations of the error rates in all the GMM approaches are so large that we cannot claim that the GMM approach is in general better than the single Gaussian 75 GMM (4) GMM (8) GMM (16) GMM (32) 10 15 20 25 30 35 40 GMM−based Cluster Modeling Approach Speaker Error Time Rate (%) Figure 4.3: Clustering performance variation for Data 13 in the GMM cluster modeling approach. The circles denote speaker error time rates for the respective 10 sessions of the GMM approach with four different numbers of mixture components (i.e., 4, 8, 16, and 32), andtheboldcrossesarethecorrespondingmeanvalues. Thehorizontallinepresents the speaker error time rate obtained from the single Gaussian cluster modeling approach, which is 23.4%. approach in terms of performance. Only the GMM approach with 8 mixture components shows lower error rates than the single Gaussian approach despite such performance vari- ation 5 . However, asaforementioned, thevalueof8mixturecomponentsisnotuniversally optimal across data sources. Fromalltheseexaminationsandcomparisonsofthetwoconventionalclustermodeling approaches for GLR-based inter-cluster distance measurement within the framework of AHSC in this section, we are able to confirm a need for a new cluster modeling approach, whichnotonlyrequiresmoderatecomputationalcostbutalsohasdynamicrepresentation capability. In the next section, we propose such an alternative method to overcome the disadvantages of the conventional cluster modeling approaches. 5 Nevertheless we can still see one outlier worse than the speaker error time rate by the single Gaussian approach in Figure 4.3. 76 4.3 Incremental Gaussian Mixture Cluster Modeling For GLR-based inter-cluster distance measurement within the framework of AHSC, ideal cluster modeling should: • Keep clusters well represented in terms of speaker-specific characteristics through- outthewholeAHSCproceduresasclustersizescontinuetoincreaseduetomerging recursions. • Be reliable in terms of performance across data sources or sessions. • Have moderate computational complexity so that it is feasible in practice. To achieve all these, we introduce a novel cluster modeling approach in this section, named incremental Gaussian mixture cluster modeling. 4.3.1 Proposed cluster modeling approach For this new cluster modeling approach, we devise a simple but dynamic distribution for AHSC, called an incremental Gaussian mixture model (IGMM), which increments mixture components from one Gaussian to multiple Gaussians by summing the PDFs of therespectivedistributionsformergingclusterstorepresentanewlymergedcluster. The details of the incremental Gaussian mixture or IGMM cluster modeling approach are as follows: • Every initial cluster is modeled by a multivariate normal distribution. • Any newly merged or merging-hypothesized cluster is modeled by a distribution whose PDF is determined by the weighted sum of the PDFs of the respective dis- tributions for the two clusters involved in (potential) merging. The weights for the two PDFs are the normalized cardinalities of the clusters considered, respectively. In this approach, all the three clusters (i.e., two clusters under consideration: x and y, and a merging-hypothesized cluster: z) considered within the framework of GLR- based inter-cluster distance measurement are thus represented by IGMMs, IGMM x = 77 IGMM({m i x ;Σ i x ;w i x } x i=1 ), IGMM y =IGMM({m i y ;Σ i y ;w i y } y i=1 ),and IGMM z =IGMM({m i z ; Σ i z ;w i z } z i=1 ). The IGMM parameters are given as below: • ThenumbersofmixtureGaussiansin IGMM x and IGMM y , x and y ,areequaltothe numbersoftheinitialclustersthathavebeenmergedtomakexandy,respectively. In general, x ̸= y . Since we can express x and y as x ={x 1 ;x 2 ;···;x x } and y={y 1 ;y 2 ;···;y y }, respectively, where{x i } x i=1 and{y i } y i=1 areallinitialclusters, and z = x∪y ={x 1 ;x 2 ;···;x x ;y 1 ;y 2 ;···;y y }, then the number of mixture components in IGMM z is z = x + y . • The weights w i x and w i y are |x i | ∑ x j=1 |x j | and |y i | ∑ y j=1 |y j | , respectively, where|·| de- notes cardinality. Thus, {w i z } x i=1 = { |x i | ∑ x j=1 |x j |+ ∑ y j=1 |y j | } x i=1 and{w i z } z i=x+1 = { |y i | ∑ x j=1 |x j |+ ∑ y j=1 |y j | } y i=1 . • The mean vectors and (full) covariance matrices in IGMM x and IGMM y are those of model distributions for the constituent initial clusters of x and y, respectively, i.e., IGMM({m i x ;Σ i x } x i=1 )={N(m x i;Σ x i)} x i=1 andIGMM({m i y ;Σ i y } y i=1 )={N(m y i;Σ y i) } y i=1 . Thus,IGMM({m i z ;Σ i z } x i=1 )={N(m x i;Σ x i)} x i=1 andIGMM({m i z ;Σ i z } z i=x+1 )= {N(m y i;Σ y i)} y i=1 . Asaconsequence,Eq. (4.1)canberewritteninthisclustermodelingapproachasfollows: GLR(x;y) = ln p(x;y|H 1 ) p(x;y|H 2 ) = ln p(x| IGMM x )·p(y| IGMM y ) p(z| IGMM z ) = ln ∑ x i=1 w i x ·p(x;m i x ;Σ i x )· ∑ y i=1 w i y ·p(y;m i y ;Σ i y ) ∑ z i=1 w i z ·p(z;m i z ;Σ i z ) = ln ∑ x i=1 w i x ·p(x;m x i;Σ x i)· ∑ y i=1 w i y ·p(y;m y i;Σ y i) ∑ x i=1 w i z ·p(z;m x i;Σ x i)+ ∑ y i=1 w x+i z ·p(z;m y i;Σ y i) : (4.7) 78 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 5 10 15 20 25 30 Data Source Speaker Error Time Rate (%) Gaussian GMM IGMM Figure 4.4: Performance comparison of the proposed and two conventional cluster mod- eling approaches in terms of speaker error time rate (%). For this comparison, the best performance of the GMM approach for each data source was chosen among the 4 candi- dates (4, 8, 16, and 32 mixture components). 4.3.2 Comparison and analysis The proposed IGMM cluster modeling approach has several merits compared to the two conventional approaches. The first advantage is that the numbers of mixture compo- nentsinIGMMskeepincreasingduringAHSC.Thisisbecausethenumberofcomponent mixtures in the IGMM representing any newly merged cluster is determined by the sum of the numbers of mixture Gaussians in the IGMMs representing the clusters involved in merging. In other words, smooth transition from a single Gaussian distribution for modelingeveryinitialclustertomultipleGaussianmixturesforlargerclustersgenerating from merging recursions occurs during AHSC. For this reason, the IGMM cluster model- ingapproachcanprovidedynamicclusterrepresentationcapabilitythroughoutthewhole AHSCprocedures. Consideringthatbothoftheconventionalclustermodelingapproaches have limitation in this regard, i.e., limited representation capability for large clusters in the single Gaussian approach and overfitting for small clusters in the GMM approach, we 79 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 20000 40000 60000 80000 Data Source (a) Processing Time (s) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 200 400 600 Data Source (b) Processing Time (s) GMM (32) GMM (16) GMM (8) GMM (4) Gaussian IGMM Figure 4.5: Processing time comparison of the proposed and two conventional cluster modeling approaches. (For the GMM approach, four different mixture numbers are com- pared, i.e., 4, 8, 16, and 32.) (a) Full-shot version. (b) Zoomed-in version. can say that this new cluster modeling approach compromises the two conventional ap- proaches efficiently. As a consequence, it can provide better clustering performance than the two conventional approaches. This claim is verified by experimental results in Figure 4.4, which compares the proposed and two conventional cluster modeling approaches in terms of speaker error time rate. From this figure, we can easily observe that the pro- posed approach provides much better clustering performance than the single Gaussian approach by around 30% (relative) on average over the entire 15 data sources while it gives as comparable error rate as the GMM approach. Note that, in this comparison, we chose the best performance among the 4 candidates (4, 8, 16, and 32 mixture compo- nents) for each data source in the case of the GMM approach. Considering this, we can insist that the proposed approach has even better performance than the GMM approach. According to our comparison test (not shown here), average speaker error time rate by theIGMMclustermodelingapproachislowerthanthatbythehighest-performingGMM approach with 8 mixture components by approximately 20% (relative). 80 The second merit of the proposed cluster modeling approach is that despite many mixture Gaussians in IGMMs the approach requires only moderate computational com- plexity, which makes it much more feasible in practice than the GMM approach. Figure 4.5 shows this advantage of the IGMM approach by comparing processing time. It is clear from this figure that the computational cost of the IGMM approach is a lot less than that of the GMM approach, and as comparably small as that of the single Gaussian approach. For instance, the processing time of the IGMM cluster modeling approach for Data 3 is impressively about 3.5mins while that of the GMM approach with 32 mixture components is more than 22hrs (also mentioned in Section 4.2.2.3). The interesting fact is that according to our observation there are 5 clusters finally remaining at the optimal stopping point for AHSC on Data 3, and the numbers of mixture Gaussians in the cor- responding IGMMs for the clusters are 343, 91, 78, 68, and 31, respectively. The main reason why the IGMM approach has relatively low computational complexity although therearemuchmoremixturecomponentsinvolvedthantheGMMapproachisthatthere is no complex process for parameter estimation like the EM procedures. Instead, like the single Gaussian approach, the right-hand side in Eq. (4.7) can be simplified into a closed form based on cross-likelihood 6 between initial clusters. To verify this, let us go back to Eq. (4.7). From the definition of cross-likelihood, the first term of the numerator in Eq. (4.7) can be rewritten as x ∑ i=1 w i x ·p(x;m x i;Σ x i) = x ∑ i=1 w i x ·p(x 1 ;x 2 ;···;x x ;m x i;Σ x i)= x ∑ i=1 w i x x ∏ j=1 p(x j ;m x i;Σ x i)= x ∑ i=1 w i x x ∏ j=1 L x j |x i: (4.8) 6 Wedefinecross-likelihoodbetweeninitialclustersasfollows. Supposethatwehavetwoinitialclusters x i and x j , and the respective single Gaussian models N x i = N(m x i;Σ x i) and N x j = N(m x j;Σ x j). The cross-likelihoods of the two clusters,L x i |x j andL x j |x i, are defined asp(x i ;m x j;Σ x j) andp(x j ;m x i;Σ x i), respectively. 81 Similarly, the second term of the numerator can be simplified into y ∑ i=1 w i y ·p(y;m y i;Σ y i) = y ∑ i=1 w i y y ∏ j=1 L y j |y i: (4.9) The denominator can be also rewritten in the similar way as x ∑ i=1 w i z ·p(z;m x i;Σ x i)+ y ∑ i=1 w x+i z ·p(z;m y i;Σ y i) = x ∑ i=1 w i z ·p(x 1 ;···;x x ;y 1 ;···;y y ;m x i;Σ x i)+ y ∑ i=1 w x+i z ·p(x 1 ;···;x x ;y 1 ;···;y y ;m y i;Σ y i) = x ∑ i=1 w i z x ∏ j=1 p(x j ;m x i;Σ x i) y ∏ j=1 p(y j ;m x i;Σ x i)+ y ∑ i=1 w x+i z x ∏ j=1 p(x j ;m y i;Σ y i) y ∏ j=1 p(y j ;m y i;Σ y i) = x ∑ i=1 w i z x ∏ j=1 L x j |x i y ∏ j=1 L y j |x i + y ∑ i=1 w x+i z x ∏ j=1 L x j |y i y ∏ j=1 L y j |y i: (4.10) Eq. (4.7) thus can be rewritten as GLR(x;y) = ln ∑ x i=1 w i x ∏ x j=1 L x j |x i· ∑ y i=1 w i y ∏ y j=1 L y j |y i ∑ x i=1 w i z ∏ x j=1 L x j |x i ∏ y j=1 L y j |x i + ∑ y i=1 w x+i z ∏ x j=1 L x j |y i ∏ y j=1 L y j |y i : (4.11) If we calculated the cross-likelihoods of every pair of initial clusters beforehand, there would be thus no additional computational cost for direct parameter estimation and likelihood calculation at every inter-cluster distance measurement in the IGMM cluster modeling approach. The cross-likelihood computation does not take relatively long as empirically verified by Figure 4.5 although its complexity increases in proportion to the number of initial clusters. 82 Another merit of the proposed approach is that, because of no randomization, per- formance variation would not occur from session to session in contrast to the GMM approach. This advantage can boost the relibility of AHSC performance. 4.4 Conclusions InthischapterweproposedtheincrementalGaussianmixtureclustermodelingapproach for GLR-based inter-cluster distance measurement within the framework of AHSC. The proposedapproachaddressedthelimitationsofthetwoconventionalclustermodelingap- proaches, i.e., single Gaussian and GMM cluster modeling, by smoothly updating cluster models from normal distributions to GMMs with multiple mixture components during AHSC. Apart from this, the IGMM approach requires only moderate computational cost comparedtotheGMMapproach. Bythislowcomplexanddynamicclustermodelingap- proach, we obtained clustering performance improvement in terms of speaker error time rate by approximately 30% and 20% (relative) against the single Gaussian and GMM ap- proaches, respectively. Theseperformanceimprovementsobtainedfromourworksuggest that the proposed cluster modeling approach enhanced the reliability of AHSC perfor- mance across input speech data. The results of this work could be extended to speaker modeling in the research field of speaker recognition. Currently, speaker modeling is performed based on GMMs with a fixed number of mixture components like 16 or 32, but we still do not know how many mixture Gaussians would be necessary for the optimal modeling of speaker-specific characteristics. Based on our intuition it should be speaker-dependent, but there is no canonical method yet to derive the proper number of mixture components in GMMs for speaker-specific representation of data. The cluster modeling approach proposed in this chapter does not require any fixed number for mixture Gaussians beforehand, so it might be able to be a good alternative to the conventional GMM-based speaker modeling. 83 Chapter 5 Reliable Speaker Diarization based on Robust AHSC 5.1 Introduction In this chapter we extend our research results on robust speaker clustering under the variation of input speech data toward one application domain, by applying them to one of main speaker clustering applications, i.e., speaker diarization. In the research field of speaker diarization, the robustness problem of diarization performance has been a big issue as well and is definitely caused by speaker clustering, which plays a decisive role in current state-of-the-art speaker diarization systems. This chapter is organized as fol- lows. In Section 5.2, we first take a general, but closer look at speaker diarization and its various uses. In Section 5.3, based on the previous research results in Chapters 2, 3, and 4, we propose our own speaker diarization system equipped with sequential cluster- ing for speaker change detection, and ICR-based stopping point estimation and IGMM cluster modeling for speaker clustering. In Section 5.4, we also propose clustering perfor- mance refinement schemes in the framework of speaker diarization, which can enhance the reliability of diarization performance across data sources. We conclude this chapter in Section 5.5 with the final remarks on robust speaker clustering in a speaker diarization perspective. 84 Transcription of Who Spoke When (b) Speaker Change Detection & Speaker Clustering Feature Extraction Speech/Non-speech Detection Audio Source Audio Source Speech 1 Speech 2 Speech 3 Speech 4 Speech 5 Speech by Speaker 1 Speech by Speaker 2 Speech by Speaker 3 Speech by Speaker 2 Speech by Speaker 1 Speech by Speaker 3 Speech by Speaker 2 0 T 0 T 0 T (a) Non-speech Regions Figure 5.1: Speaker diarization: (a) Block diagram of a speaker diarization system. (b) Step-by-stepgraphicalinterpretationofhowa givenaudiosource istranscribed(in terms of “who spoke when”) by speaker diarization. 5.2 Speaker Diarization Speaker diarization refers to the automatic process of dividing a given audio source, predominantly using speech, into speaker-specific segments by transcribing it in terms of “who spoke when” [63]. Such speaker-specific segmentation done by speaker diarization can be beneficial and have many applicable areas, such as for automatic speech recog- nition. For instance, speaker diarization enables selecting speaker-specific data that can be utilized for unsupervised speaker adaptation. It also can help provide statistics that rely on speaker-specific information, such as frequency of speaking turn change, average speaking time per turn, number of speakers, speaking time distribution over speakers, and so on. These statistics are useful for multimedia content analysis. Because of its broad significance, speaker diarization is currently regarded as one of the main categories evaluated in the Rich Transcription (RT) Evaluation led by NIST. Many state-of-the-art speaker diarization systems have a basic structure in common as shown in Figure 5.1, consisting of three main steps following audio feature extraction. One is speech/non-speech detection, which separates target speech regions from a given audio source. The others are speaker change detection and speaker clustering. Speaker 85 change detection identifies potential speaker changing points in each speech region, and further divides the speech region into smaller speaker-specific segments. Speaker cluster- ing classifies the resultant segments by speaker identity to append a unique label to the segments belonging to the same speaker class. These two steps are in general performed in the order mentioned, i.e., speaker change detection followed by speaker clustering. The overall performance of speaker diarization systems is evaluated by diarization error rate (DER). This performance indicator for speaker diarization is defined by NIST as the sum of three constituent error rates: false alarm speaker time rate, missed speaker time rate, and speaker error time rate. The first twos jointly indicate how precisely speech/non-speech detection is performed, while the last one solely tells how well speaker changedetectionandspeakerclusteringcoordinate. Recentresearchpapersdemonstrated the dominance of speaker error time rate over the other error rates in deciding DER, and besides its severe variability across data sources [63], [30]. Among those two steps in speaker diarization, speaker clustering is more critical than speaker change detection in terms of impact on DER. Furthermore, serial concatenation of speaker change detection and speaker clustering in typical speaker diarization systems could require speaker clustering to be more precise in terms of performance. Speaker change detection is typically tuned not to miss speaker changing points in given speech regions at the cost of false alarms because, if actual speaker changing points were missed duringspeakerchangedetection,therewouldbenochanceforthemtoberecovered(orre- detected) by speaker clustering. The segments unnecessarily divided due to false alarms during speaker change detection, on the other hand, could be possibly merged through speaker clustering. As a consequence, such a tuning pattern for speaker change detection in typical speaker diarization systems results in not so many detection errors, but cannot help burdening speaker clustering with a large number of short speech segments. (It is generally more difficult to classify short speech segments by speaker-specific character- istics, as we have seen in Chapter 3, than to classify long ones because speaker-specific identification requires long speech utterances (at least longer than 3 seconds) [51–53].) 86 Thus, it is very important to make speaker clustering work properly for reliable speaker diarization. In this regard, in the next section, we implement a more reliable speaker diarization systembasedonrobustAHSCapproaches,whichhavebeendealtwithinthisdissertation thus far. Specifically, we utilize ICR-based stopping point estimation and IGMM cluster modeling for this purpose. In addition, we also exploit the merit of the third modified version of AHSC in Section 3.3.3 for speech/non-speech detection and speaker change detection in this system. We call this proposed system the SAIL speaker diarization system. 5.3 SAIL Speaker Diarization System In this section we propose a novel speaker diarization system based on robust speaker clustering. The proposed SAIL speaker diarization system has the same structure as other state-of-the-art speaker diarization systems have: speech/non-speech detection, speaker change detection, and speaker clustering. It first applies a sequential clustering concept to segmentation of a given audio data source, and then performs AHSC for speaker-specific classification (or speaker clustering) of speech segments. The speaker clusteringalgorithmutilizesanIGMMclustermodelingstrategyforinter-clusterdistance measurement, and ICR-based stopping point estimation to properly stop the recursion of AHSC. Before explaining the details of each step in the system, let us describe the data sets and experimental setup used in this chapter. 5.3.1 Data Description and Experimental Setup Tables5.1and5.2presentthetwodatasets(trainingandtesting)usedfortheexperiments reported in this chapter. The training data set is used for tuning the whole speaker diarization system, while the testing data set is used for performance evaluation. All the data sources in the data sets were chosen from the ICSI, NIST, and USC meeting 87 Table 5.1: Training data set. Source Name Length (min:sec) No. of Speakers 1 ICSI Bmr018 20:01 7 2 ICSI Bro003 20:00 7 3 ICSI Bsr001 20:00 8 4 NIST 20020214 19:59 6 5 NIST 20030925 19:59 4 Table 5.2: Testing data set. Source Name Length (min:sec) No. of Speakers 1 ICSI Bdb001 19:57 5 2 ICSI Bed015 20:00 6 3 ICSI Bmr013 20:01 7 4 ICSI Bed002 20:01 6 5 NIST 20011115 17:52 4 6 NIST 20030702 20:00 4 7 NIST 20031215 19:57 5 8 USC 200804011207 17:23 5 9 USC 200804011259 6:28 4 10 USC 200804011325 19.41 4 speechcorpora, andaredistinctfromoneanotherintermsofthenumberofspeakersand meeting topics (not given in the tables). In order to measure DER, we use a scoring tool distributed by NIST, i.e., md-eval- v21.pl 1 . This tool calculates DER as the sum of missed speaker time rate, false alarm speaker time rate, and speaker error time rate. 5.3.2 Speech/Non-Speech Detection Theproposedspeech/non-speechdetectionstepintheSAILspeakerdiarizationsystemis basedonleader-followerclustering(LFC)[18],whichisawell-knownsequentialclustering 1 Available at http://www.nist.gov/speech/tests/rt/2006-spring. 88 Algorithm 7 Leader-Follower Clustering (LFC) Require: {x i };i=1;:::;ˆ n: data sequentially incoming : threshold Ensure: C i ;i=1;:::;n: clusters finally remaining 1: C 1 ←{x 1 };n←1;m←1 2: do m←m+1 3: ˆ C←{x m } 4: i←argmind(C j ; ˆ C);j =1;:::;n 5: if d(C i ; ˆ C)> 6: n←n+1 7: C n ← ˆ C 8: else 9: merge ˆ C into C i 10: until m= ˆ n 11: return C i ;i=1;:::;n strategy. As shown in Algorithm 7, LFC sequentially classifies incoming data, either by having them merged to existing clusters or by generating new clusters for them. Decision is made by comparing the minimum distance between each of incoming data and the existing clusters with a pre-set threshold, and continues until there are no more data available. The speech/non-speech detection step utilizes this sequential process of LFC, as follows: 1. Wedividethedatasourcegivenforspeakerdiarizationinto2s-longframes 2 without overlap, and perform LFC on all the frames. 2. LFC decides which cluster every incoming frame is the closest to, choosing from 1) the silence cluster, 2) the universal background cluster, and 3) one of the existing speaker clusters. • If 1) is selected, the frame considered is labeled as silence. • If 2) is chosen, a new speaker cluster for the frame is generated. (The frame is newly labeled as well.) 2 The reason that we select 2s as a frame length is that 2s is widely known to be the minimum window length for reasonable segmentation results [13], [64], [17]. 89 Table 5.3: Performance comparison of the proposed speech/non-speech detection process with and without updating the silence cluster, in terms of the two detection error rates for the training data set. Without Update With Update False-Alarm Rate 2.80% 2.56% Missed-Detection Rate 3.28% 4.46% Total Detection Error Rate 6.08% 7.02% • Incase3), theframeismergedtothecorrespondingspeakercluster. (Itcomes to have the same label as the other frames in the cluster.) 3. The previous step is repeated until there remain no more incoming frames. Forthisprocess, thesilenceandtheuniversalbackgroundclustershouldbegenerated prior to LFC. (For reference, there is no speaker cluster initially other than these two clusters. Speaker clusters are generated during LFC.) For the silence cluster, we gather a total of 15s of 25ms-long audio chunks with the lowest energy from the entire data source given for speaker diarization, assuming that silence spreads over the given data source with various lengths at least longer than 25ms, and that the total length of such silence chunks in the data source is at least longer than 15s overall. Empirically, 15s is consideredasenoughamounttofullyrepresentthespectralcharacteristicsofsilence. For the universal background cluster, we use the given data source entirely. This huge cluster works as if it is a source-dependent threshold for LFC, and thus we do not need to tune such a certain threshold value prior to the process as shown (as ) in Algorithm 7 in the previous page. Notethatthesilenceclusterisnotupdatedduringtheproposedsequentialspeech/non- speechdetectionprocess, whilethespeakerclusterskeepbeingupdatedthroughmerging. This is to preserve the initial purity of the silence cluster, which might be damaged by incorrectlymergingitwithspeechframes. Suchcontaminationinthesilenceclustercould be propagated over the entire process and thus result in a lower rate of speech detection. As shown in Table 5.3, the proposed speech/non-speech detection process with updating 90 the silence cluster would reduce the false-alarm rate at the relatively high cost of the missed-detection rate. As a result, the sum of the two error rates would increase overall in this case. In the proposed speech/non-speech detection process, distance between the frame considered and all the clusters is measured by GLR [26]. For a frame F and one of the clusters C, GLR for the two objects is given as GLR(F;C) = p(F|Θ F )·p(C|Θ C ) p(F∪C|Θ F∪C ) : (5.1) Each object and the union of the objects are modeled by single Gaussian distributions with full covariance matrices to compute the likelihoods in the equation above, and Θ is a set of parameters in each normal distribution and is estimated toward maximizing the likelihoods of data in F, C, and F∪C for the respective model distributions. 5.3.3 Speaker Change Detection For speaker change detection, we use the result of the previous process for speech/non- speech detection. As shown in the previous subsection, every 2s-long incoming frame to LFC is labeled as silence or one of the speaker tags assigned to the speaker clusters, respectively. In other words, all the frames except silence frames have the respective speaker tags, which means that we already have the boundary information of potential speaker changing points in the given data source. Therefore, using this information, we can further divide the data source into speaker-specific segments, each of which is surrounded by two consecutive boundaries. Every resultant segment becomes an initial cluster for AHSC in the next step. 5.3.4 Speaker Clustering In this subsection, we apply our work in Chapters 2 and 4 to the framework of SAIL speaker diarization. Let us start this section by briefly investigating how AHSC works 91 Algorithm 8 Agglomerative Hierarchical Speaker Clustering (AHSC) revisited Require: {x i };i=1;:::;ˆ n: speaker-specific segments ˆ C i ;i=1;:::;ˆ n: initial clusters Ensure: C i ;i=1;:::;n: clusters finally remaining 1: ˆ C i ←{x i }, i=1;:::;ˆ n 2: do 3: i;j←argmind( ˆ C k ; ˆ C l );k;l =1;:::;ˆ n;k̸=l 4: merge ˆ C i and ˆ C j 5: ˆ n← ˆ n−1 6: until DER is estimated to reach the lowest level 7: return C i ;i=1;:::;n in the SAIL speaker diarization system. As shown in Algorithm 8, AHSC considers the speaker-specific segments given from speaker change detection as individual initial clus- ters, and recursively merges the closest pair of clusters in terms of speaker-specific char- acteristics. Its recursive process continues until it is decided that extra cluster merging would not improve speaker clustering performance any further, i.e. until DER is esti- mated to reach its lowest level. All the segments in each of the clusters finally remaining are identically labeled, and every cluster label is unique. In order for AHSC to achieve reliable performance, two critical questions need to be answeredproperly: 1)howtoselecttheclosestpairofclustersformergingateveryrecur- sion step and 2) how to decide the optimal (recursion) stopping point where the lowest DER would be achieved. In this context, our proposed speaker clustering method utilizes two novel approaches to address the questions, respectively: IGMM cluster modeling and ICR-based stopping point estimation. 5.3.4.1 IGMM Cluster Modeling The inter-cluster distance measurement to select the closest pair of clusters at every recursion step of AHSC is done by comparing GLR values for all possible cluster pairs. 92 (Once such comparison is done, the cluster pair having the smallest GLR value is picked for merging.) For two clusters C x and C y , GLR is presented as follows: GLR(C x ;C y ) = p(C x |Θ Cx )·p ( C y |Θ Cy ) p ( C x ∪C y |Θ Cx∪Cy ) ; (5.2) where Θ is a set of parameters in each cluster model distribution. Unlike speech/non- speech detection in Section 5.3.2, single Gaussian cluster modeling is not appropriate for inter-cluster distance measurement in AHSC as discussed in Chapter 4. Therefore, we utilize the IGMM cluster modeling approach in Chapter 4, which works as follows: • Eachinitialclusterismodeledbyanormaldistributionwithafullcovariancematrix, • For GLR computation in Eq. (5.2), the union of the clusters considered is modeled bythedistribution 3 whosePDFistheweightedsumofthePDFsofthedistributions representing the clusters, respectively, and • AnynewlymergedclusterismodeledbythedistributionwhosePDFistheweighted sum of the PDFs of the respective distributions representing merging-involved clus- ters, for GLR computation with other clusters at the subsequent recursion steps of AHSC. ThisapproachduringAHSCenablesnotonlythesmoothtransitionofclustermodelsfrom single Gaussian distributions to GMMs, but also a gradual increase in the complexity of GMMs (or the number of mixture components in GMMs). In this cluster modeling method, Eq. (5.2) is thus written as below: GLR(C x ;C y ) = p(C x |Λ Cx )·p ( C y |Λ Cy ) p ( C x ∪C y |Λ Cx∪Cy ) ; (5.3) 3 As a consequence, this distribution has a mixed form of weighted normal distributions, which is a GMM. 93 where Λ Cx , Λ Cy , and Λ Cx∪Cy are sets of parameters in the IGMMs representing the clusters considered, and the PDF of the distribution representing C x ∪ C y is simply determined as follows: f Cx∪Cy = N Cx N Cx +N Cy f Cx + N Cy N Cx +N Cy f Cy : (5.4) In the above equation, N is the cardinality of the clusters, and f is the PDF of a model distribution with Λ. 5.3.4.2 ICR-based Stopping Point Estimation A conventional stopping point estimation method, which is based on BIC, checks if GLR for the closest pair of clusters is greater than 0 using Eq. (5.3) at every recursion step of AHSC [13]. However, as discussed in Chapter 2, this method is known to be unreli- able (across data sources) in terms of estimation accuracy. In order to overcome such unreliability, we utilize ICR-based stopping point estimation (proposed in Chapter 2) here. ICR for two clusters C x and C y is defined as ICR(C x ;C y ) , 1 N Cx +N Cy lnGLR(C x ;C y ): (5.5) From a viewpoint of information theory, this statistical measure between clusters repre- sents how much entropy would be increased by merging the clusters considered. Thus, it is natural to expect ICR to be small when the clusters considered are homogeneous in terms of speaker-specific characteristics and each cluster is large enough to fully cover the intra-speaker variance of the corresponding speaker identity. In other words, ICR would be small when the clusters considered have the same speaker source and do not need additional information in representing full speaker-specific characteristics. On the contrary, ICR would be relatively large when the clusters considered are heterogeneous, or when they are homogeneous but contain small size data to cover only a part of the 94 Table 5.4: Comparison of 1) IGMM cluster modeling + ICR-based stopping point esti- mation, and 2) single Gaussian cluster modeling + BIC-based stopping point estimation, intermsofspeaker-error-timerateforthetestingdataset. =25:0(forBIC-basedstop- ping point estimation) and = 0:225 (for ICR-based stopping point estimation), which are tuned based on the training data set. 1) 2) Speaker-Error-Time Rate 17.79% 22.75% whole speaker-specific characteristics. As a consequence, ICR could properly work as a measure to decide homogeneity for clusters only if every cluster considered were large enough to fully represent the characteristics of the corresponding speaker identity. Based on this, the ICR-based stopping point estimation method for AHSC in the SAIL speaker diarization system 1. Waits until AHSC reaches the end of its process, i.e., until all the initial clusters are merged to one big cluster. 2. For the pair of clusters merged at the last recursion step of AHSC, C x and C y , computes ICR. 3. Compares ICR with a pre-set threshold . If ICR(C x ;C y ) > , decides that C x and C y are heterogeneous in terms of speaker-specific characteristics and considers the pair of clusters merged at the next latest recursion step. Otherwise, stops consideringthemergedclustersandselectstherecursionsteppreviouslyconsidered as the final stopping point. Like the conventional BIC-based one, this stopping point estimation method depends upon the reasoning that every merging after the optimal stopping point would occur only between heterogeneous clusters. The reason why its consideration of the merged clusters starts from the pair of clusters merged at the last recursion step of AHSC (i.e., the opposite direction to the one used in the BIC-based method) is that such a strategy can make ICR properly work as a homogeneity measure by handling large clusters only. 95 5.3.4.3 Comparison Table 5.4 shows comparison of our proposed approaches versus the conventional ones to cluster modeling and stopping point estimation for AHSC. The proposed techniques resulted in improvement of 4.96% (absolute) and 21.80% (relative) in terms of speaker clusteringperformance(i.e.,speakererrortimerate)intheend-to-endspeakerdiarization system. This improvement is directly connected to the enhancement of the proposed speaker clustering strategies in terms of performance reliability. 5.3.5 Experimental Results Figure5.2presentstheoverallperformanceoftheproposedSAILspeakerdiarizationsys- tem on non-overlapped speech in the testing data set, in terms of DER. The lowest DER (6.77%) was achieved for Data 9 while the highest one (40.32%) was obtained for Data 10. Average DER is 21.90%. These results are quite comparable with those in the recent RTevaluations. (However,faircomparisonwithotherstate-of-the-artspeakerdiarization systems is practically impossible in this dissertation because system performance varies across data sources and our training/testing data sets are different from those used for the RT evaluations, and the best way to do such a fair comparison would be to join in any RT evaluation and compete with the other systems.) Oneinterestingobservationisthat,despiteourproposedapproachestorobustspeaker clustering, speaker error time rate for some data sources such as Data 4, 6, and 10 still show a huge difference from that for the others, which means that there exists a room for further development in the reliability of AHSC performance. A main reason for such relatively bad results at Data 4, 6 and 10 was a lot of wrong merging between hetero- geneous clusters (in terms of speaker-specific characteristics) during AHSC. This also caused mismatch between the optimal and the estimated stopping point, which led to severe DER degradation overall compared to the DERs for the other test data sources. The biggest contributor to this phenomena in speaker clustering in the framework of 96 1 2 3 4 5 6 7 8 9 10 Total 0 5 10 15 20 25 30 35 40 45 50 Testing Data Set DER (%) Speaker Error Time Rate Missed Speaker Time Rate False Alarm Speaker Time Rate Figure 5.2: Performance of the proposed SAIL speaker diarization system on non- overlapped speech in the testing data set, in terms of DER. speaker diarization is incorrect speaker change detection, which causes many speech seg- ments (i.e., individual initial clusters in AHSC) to have more than one speaker sources in them. Due to their mixed statistical characteristics, those segments can confuse inter- cluster distance measurementand result in a series of incorrect merging during AHSC. In addition, considering that we handle spontaneous meeting conversations speech as data sources for speaker diarization, some segments cannot help containing overlapped speech parts, which naturally happen in real-life conversations. These kinds of ‘impure’ speech segmentsalso can cause confusion in inter-clusterdistance measurement. In the next sec- tion, we introduce a method to overcome this problem in the framework of SAIL speaker diarization for more reliable speaker diarization performance, as well as a high-level dia- loguepatternmodelingapproachforbetterAHSCperformanceunderspeakerdiarization of meeting conversations speech. 5.4 Rened Speaker Clustering In this section, we propose two approaches to making AHSC better and more refined in terms of DER in the framework of SAIL speaker diarization. The first approach selects 97 Figure 5.3: IGMM cluster modeling.{C i } 5 i=1 are initial clusters for AHSC, and a and b (a+b=1)areweightsfortherespectiveconstituentGMMs. Theweightsaredetermined by the cardinalities of{C 1 ;C 2 ;C 3 } and{C 4 ;C 5 }, respectively. This figure illustrates how IGMMs grow through merging during AHSC. representative speech segments when modeling clusters with IGMMs instead of using all the segments available. This can avoid the negative effect of the aforementioned ’impure’ speech segments naturally generated throughout speaker change detection onto cluster modeling and thus clustering/diarization performance. The second approach utilizes interaction patterns between speakers in a given audio source for speaker diarization. By modeling such a high-level dialogue pattern, it can provide more robust diarization performance under the variation of input audio data. Let us start this section from the first approach, by re-considering IGMM cluster modeling for inter-cluster distance measurement in AHSC. 5.4.1 Selection of Representative Speech Segments In IGMM cluster modeling, clusters are modeled as follows: • Every (initial) cluster in the beginning of AHSC is represented by a normal PDF with a sample mean vector and (full) covariance matrix. • After merging during AHSC, anewly merged cluster is representedbytheweighted sum of the PDFs for the clusters being merged. • The weights are determined by the normalized cardinalities of the merged clusters. 98 In this way, the PDFs of cluster models not only have smooth transitions from normal PDFstothePDFsofGMMsbutalsoobtainagradualincreaseinthenumberofGaussian mixtures in the PDFs of GMMs. Computational complexity for this cluster modeling approach is quite low because there are no training sessions in IGMMs like the (EM) procedures used for conventional GMM training. Figure 5.3 presents how the PDFs of IGMMs grow through merging in AHSC. In this figure, GMM 1 and GMM 2 represent two clusters{C 1 ;C 2 ;C 3 } and{C 4 ;C 5 }, respectively. Each C i is an initial cluster (i.e., individual input speech segment to AHSC). In the top row of the figure, the two clusters that have gone through merging between the initial clusters twiceand once, respectively, are illustrated. Nowsuppose that these twoclusters are merged and a newly merged cluster{C 1 ;C 2 ;C 3 ;C 4 ;C 5 } is represented by GMM 0 , depicted in the bottom part of the figure. The PDF of GMM 0 is formed by the weighted sum of the PDFs of GMM 1 and GMM 2 . In this cluster modeling framework, therefore, every initial cluster is modeled by the PDF of a single Gaussian distribution, and once any initial cluster is merged into a larger cluster during AHSC then the PDF of its cluster model contributes to the respective IGMM by providing an individual Gaussian component. A problem is that some initial clusters might contain data from more than one speaker source due to imprecise speaker change detection or inherently overlapped speech in a given audio data source. The Gaussian mixtures generated based on those ‘impure’ initial clusters degrade the capa- bility of IGMMs that represents the statistical characteristics of the corresponding data clusters. In this subsection, we propose a novel idea to address this problem in IGMM cluster modeling: representative speech segment selection. The basic idea is that, when modeling a certain, large cluster during AHSC, selecting representative initial sub- clusters from the cluster would help because they can represent the cluster statistically better. Our way of choosing representative speech segments from a cluster is as follows. Let us consider a clusterC. Suppose that the cluster has gone through merging and contains 99 n initial clusters, i.e., C ={C 1 ;C 2 ;···;C n }, where{C i } n i=1 are initial clusters. Then, IGMM{C}=IGMM{C 1 ;C 2 ;···;C n }=(m i ;Σ i ;w i ) n i=1 ,where(·)isaGMM,m i andΣ i are the sample mean vector and (full) covariance matrix estimated from C i , respectively, and w i is a weight for the Gaussian component representing C i in this GMM. 1. Compute the likelihood of the entire data in the cluster C for the PDF of every single Gaussian component, i.e., { p ( C;m i ;Σ i )} n i=1 : Note that we exclude weights{w i } n i=1 in likelihood computation. Otherwise, Gaus- sian components with large weights in iGMMs would tend to have high likelihood values, which is not desirable for a fair comparison in the next step. 2. SelectN-bestcomponentsintermsoflikelihood,whereN islessthanthetotalnum- ber of Gaussian mixtures in the respective IGMM. The initial clusters (or speech segments) corresponding to the chosen N Gaussian components are considered rep- resentative. The N components form a new GMM (with N mixtures), which we call a refined IGMM for the cluster C. 3. During AHSC, repeat 1) and 2) for every newly merged cluster whose IGMM has the number of Gaussian components greater than N. This can keep updating representative speech segments for clusters throughout AHSC. This is simply illustrated in Figure 5.4 where we reconsider{C 1 ;C 2 ;C 3 ;C 4 ;C 5 } and its IGMM 0 in Figure 5.3. Assuming that N =3,{C 2 ;C 4 ;C 5 } are selected as representative speech segments in this case and form a new, refined GMM with 3 Gaussian mixtures. Note that our interest in this method is to see how universally individual Gaussian components in the IGMM considered represent the entire cluster data. This is because it is reasonable to regard speech segments that correspond to the Gaussian components selectedintermsofsuchuniversalityasrepresentative. Thisselectiveapproachforcluster 100 Figure 5.4: Selection of representative speech segments for improved IGMM cluster mod- eling. Inthiscase,C 2 ;C 4 ,andC 5 areselectedasrepresentativespeechsegmentstomodel {C i } 5 i=1 . S1 S2 S4 S3 p12 p21 p32 p23 p22 p11 p14 p41 p33 p44 p34 p31 p43 p24 p13 p42 Figure 5.5: 1 st -order Markov chain model for participant interaction patterns when the estimatednumberofspeakersis4,wherep ij isthetransitionprobabilityfromthespeaker S i to the speaker S j for 1≤i;j≤m (m=4 in this case). modeling using a portion of the entire cluster data can refine representation capability in cluster models in terms of not only keeping statistically representative speech segments but also excluding potentially unnecessary or even degenerate speech segments. 5.4.2 Participant Interaction Pattern Modeling Inthissection,weproposeanotherideatodrawimprovementinSAILspeakerdiarization of meeting conversations speech by refining speaker clustering performance regarding interaction patterns between meeting participants. This idea was motivated by the expectation that temporal dynamics between participants in meeting conversations are informative from a diarization perspective [11]. Modeling such dynamics would help in understanding the whole meeting speech and would reduce DER as a consequence. 101 We estimate participant interaction patterns, which are meeting-dependent, based on diarization results. For this purpose, we use an m-state 1 st -order Markov chain model, illustrated in Figure 5.5 as an example when the number of states is 4. The number of states in this interaction pattern model is set to the number of clusters that remain after AHSC. This number means the estimated number of speakers in the given meeting speech. Each transition probability is decided as follows: 1. “Who spoke when” resulting from speaker diarization is used to count the number of speaking turn transitions (N ij ) from the speaker S i to the speaker S j , where 1≤ i;j≤ m. Every 2s-long segment, which is the smallest unit handled in our speaker diarization system, is considered for transition number counting. 2. Average N ij with N i , where N i = ∑ m j=1 N ij . Thus, each transition probability p ij (1≤i;j≤m) is determined by p ij = N ij N i = N ij ∑ m j=1 N ij : The estimated transition probabilities in this model are used as a priori information for refinement of diarization results. The refinement step performs a simple speaker identification task with considering m GMMs 4 as pre-trained speaker models. Specifically, it refines diarization results by classifying every 2s-long segment into one of the clusters that remain after AHSC based on maximum a posteriori. Suppose that GMMs for the clusters that remain after AHSC are i (1≤ i≤ m) and the entire input meeting speech x can be split into L 2s-long segments, i.e., x ={x 1 ;x 2 ;·;·;·;x L }. The refinement step computes the likelihood of x l (1≤ l≤ L) for each i and assigns the argument i providing the highest a posteriori to x l as a speaker label, i.e., argmax i p(x l | i )p ji ; 4 These GMMs are trained by the EM procedures over representative speech segments in the respective clusters. The number of Gaussian mixtures is empirically set to 32. 102 Table 5.5: Improved speaker diarization performance with the two approaches proposed in this section, i.e., representative speech segment selection and participant interaction pattern modeling. For the former approach we empirically set N = 32. Performance comparison is given in terms of average DER (%) across 10 data sources in the testing data set in Section 5.3.1. DER Baseline Performance 21.90 + Representative Speech Segment Selection 16.76 + Participant Interaction Pattern Modeling 14.49 where p ji is the transition probability from the speaker S j to the speaker S i in the estimatedinteractionpatternmodelanditisassumedthatthespeakerlabelj isassigned to x l−1 . 5.4.3 Experimental Results Table 5.5 presents speaker diarization performance by our original SAIL speaker diariza- tion system in Section 5.3 and the modified system with the two approaches proposed in this section, in terms of average DER across data sources in the testing data set given in Section 5.3.1. The main reason for the diarization performance increase in the modi- fied system with selection of representative speech segments for IGMM cluster modeling (21.90%→ 16.76%, 23.47% relative improvement) is that the proposed approach helped not only in choosing the closest pair of clusters at every recursion step of AHSC properly but also in estimating the optimal stopping point for AHSC accurately. This indicates thatselectingspeechsegmentswithrepresentativenessisbetterforIGMMclustermodel- ing than using the entire data in clusters. This claim is reasonable because clusters could contain unnecessary or defective data from a cluster representation perspective due to in- correct speaker change detection or wrong merging during AHSC and there is, therefore, a significant need to keep purifying such clusters throughout AHSC for better cluster- ing performance. From the table, we can also see that the second approach contributed to DER reduction as well (13.54% relative improvement), as expected. It is especially 103 1 2 3 4 5 6 7 8 9 10 Total 0 5 10 15 20 25 30 35 40 45 50 Testing Data Set DER (%) Interaction Pattern Modeling Representative Speech Segment Selection Baseline Figure 5.6: Performance of the modified SAIL speaker diarization system on non- overlapped speech in the testing data set, in terms of DER. meaningful in this high-level modeling approach that interaction patterns between par- ticipants, which are hard to be universally modeled due to their data-dependency, can be mathematically represented in an unsupervised fashion based on diarization results. NotethataveryaccuratestoppingpointestimationforAHSCisrequiredintheproposed approach because the number of states, m, in the 1 st -order Markov chain model for in- teraction patterns is determined by the number of the clusters that remain after AHSC. This is already bolstered in the modified speaker diarization system by the first approach proposed in this section as well as the ICR-based stopping point estimation method. Figure5.6showshowmuchtheproposedapproachesimprovethereliabilityofAHSC/diarization performance across data sources in the testing data set more explicitly. Being compared with baseline performance (which was also shown in Figure 5.2 in Section 5.3.5), the improved performance by the modified speaker diarization system, particularly for Data 4, 6, and 10, indicates that the SAIL speaker diarization system with the two refinement methods for robust speaker clustering performance can further enhance the reliability of diarization performance under the variation of input data sources. 104 5.5 Conclusions In this chapter, we implemented the SAIL speaker diarization system not only with our research results from work on robust speaker clustering in the previous chapters, but also with refinement schemes to further improve speaker clustering performance, based on 1) representative speech segment selection for IGMM cluster modeling and 2) interaction pattern modeling. The proposed speaker diarization system showed much improvement in terms of performance reliability under the variation of input data sources. Important future work includes finding a way of robustly dealing with overlapped speech in this framework of speaker diarization. Although the first approach proposed in Section 5.4 provided some prototypical ideas in this regard, i.e., selective clustering of data can maintain or even boost representativeness in cluster models, there must be a long way to a state-of-the-art level of handling overlapped speech from a speaker clustering/diarization viewpoint. We keep working on this topic. 105 Chapter 6 Conclusions 6.1 Contributions In this dissertation, we dealt with one big, yet unsolved, issue in the research field of speaker clustering: unreliable clustering performance under the variation of in- put speech data. For this, we focused on two main perspectives in the framework of agglomerative hierarchical speaker clustering (AHSC): stopping point estimation and inter-cluster distance measurement. In Chapter 2, we addressed the robustness problem oftheBIC-basedstoppingpointestimationmethodforAHSCunderthevariationofinput speechdata. Forthis,wefirsttookashortreviewofGLRandBIC,andtheninvestigated a main reason for the problem considered. This investigation led to understanding why a new statistical distance measure between clusters is needed for more robust stopping point estimation in AHSC under the variation of input speech data, which resulted in our proposal of ICR. In addition, we introduced a stopping point estimation method for AHSC based on ICR in this chapter. This stopping point estimation method was verified through experimental results to be more robust to the variation of input speech data than the conventional BIC-based one. In Chapter 3, we tackled the robustness problem of the GLR-based inter-cluster distance measure from both viewpoints of early and late AHSC recursion steps. For this, we first examined why the reliability of the GLR-based inter-cluster distance measure severely varies across input data sources. Based on this 106 examination, we proposed several modified versions of AHSC approaches to improve the accuracy of the GLR-based inter-cluster distance measure, particularly at the early re- cursionstepsofAHSC.Thenweproposedasupplementinter-clusterdistancemeasureto utilize the advantages of GLR and ICR in order to tackle the robustness problem of the GLR-based inter-cluster distance measure at the late recursion steps of AHSC. All the methods proposed in this chapter were compared with original AHSC in terms of aver- agedperformanceacrossdatasources,andwereproventoprovidebenefittothereliability of the GLR-based inter-cluster distance measure and thus the overall speaker clustering performance. In Chapter 4, we introduced incremental Gaussian mixture cluster mod- eling for inter-cluster distance measurement in AHSC. This dynamic cluster modeling approach not only provided AHSC with as comparable clustering performance as the conventional GMM-based one does, but also had a lot more feasibility in computational complexity. In Chapter 5, we applied our research results to speaker diarization. For this, we implemented our own speaker diarization system and further modified it with two clustering performance refinement schemes. 6.2 Possible Future Research Topics One potential future research direction is to identify the lower bound for cluster size that guarantees ICR to be reliable as a statistical distance measure, more specifically as a homogeneity decision measure, between the clusters considered. In Chapter 2, we avoided the possibility that ICR would not work properly, by checking ICR-based inter- cluster homogeneity starting from the pair of clusters merged at the last recursion step of AHSC under the assumption that clusters at the late recursion steps of AHSC would be large enough for reliable ICR. This assumption worked for the meeting conversation excerpts used for the experiments presented in the chapter because most of the speaker sources involved in the conversations generated enough speech utterances of which the total length in time was longer than at least 30 seconds, respectively. Thus, at the 107 late recursion steps of AHSC where the ICR-based stopping point estimation method was usually applied, ICR could be reliable as an inter-cluster homogeneity measure as expected. The assumption could be however broken for other data sources which have a preponderance of short speech segments that are inadequate to reveal the corresponding speaker-specific characteristics completely. TherearealsoseveraldirectionsforfutureworkregardingwhatwashandledinChap- ter 3, including further refinements to the proposed, modified AHSC approaches. For instance, in the third modified version of AHSC in the chapter, the threshold parameter determines the number of intermediate clusters, which is directly linked to the final speaker error time rate. It was chosen empirically in this chapter, but finding ways for optimally setting would be beneficial in further enhancing clustering performance. As anotherexample,wemighthavetoconsiderhowtooptimallyfusetwodifferentstatistical informationonthesameobjectfor(GLR+ICR)-basedinter-clusterdistancemeasurement at the late recursion steps of AHSC. In this chapter, we used soft rankings in terms of GLRandICRforthatpurpose,butitisnottheoreticallyproventobeoptimaltothetask considered. Establishing more systematic frameworks for selection of information fusion methods could be one of valuable future research directions. In addition, as mentioned in the middle part of this chapter, it would be a good research topic to find out a stopping point estimation method for AHSC with the (GLR+ICR)-based inter-cluster distance measure, otherthantheICR-basedone. Anewstoppingpointestimationmethodshould becomparabletoourproposedICR-basedoneintermsofestimationaccuracy, butneeds to use an inter-cluster homogeneity decision measure independent of ICR. Then it could keep the advantages of the modified versions of AHSC and the (GLR+ICR)-based inter- cluster distance measure valid even in practical applications. The research results in Chapter 4 could be extended to speaker modeling in the research field of speaker recognition. Currently, speaker modeling is performed based on GMMswithafixednumberofmixturecomponentslike16or32, butwestilldonotknow 108 how many mixture Gaussians would be necessary for the optimal modeling of speaker- specific characteristics. Based on our intuition it should be speaker-dependent, but there isnocanonicalmethodyettoderivethepropernumberofmixturecomponentsinGMMs for speaker-specific representation of data. The cluster modeling approach proposed in this chapter does not require any fixed number for mixture Gaussians beforehand, so it mightbeabletobeagoodalternativetotheconventionalGMM-basedspeakermodeling. 6.3 Final Remarks My Ph.D. research work on this topic is a tiny part of vast research effort now being conductedwithintheresearchfieldofpatternclassification,butIhopeandbelievethatit isameaningfulcontributiontothisfieldbecausethereliabilityissueofspeakerclustering performance across data sources has not been significantly tackled thus far although there has been much recognition on the seriousness of this issue. The entire results in thisdissertationcouldbeutilizedforotherdatadomainsbeyondspeechdata,particularly where there exists similar data-dependency in clustering performance. 109 Bibliography [1] J. Ajmera, C. Wooters, B. Peskin, and C. Oei. Speaker segmentation and clustering. In NIST RT-03S Workshop, Boston, MA, USA, May 2003. [2] Jitendra Ajmera, Iain McCowan, and Herve Bourlard. Robust speaker change de- tection. IEEE Signal Processing Letter, 11(8):649–651, 2004. [3] Jitendra Ajmera and Chuck Wooters. A robust speaker clustering algorithm. In IEEE Automatic Speech Recognition and Understanding Workshop, pages 411–416, St. Thomas, VI, USA, November 2003. [4] X. Anguera, C. Wooters, and J. Pardo. Robust speaker diarization for meetings: ICSI RT06s evaluation system. In International Conference on Spoken Language Processing, pages 1674–1677, Pittsburgh, PA, USA, September 2006. [5] X.Anguera,C.Wooters,B.Peskin,andM.Aguilo. Robustspeakersegmentationfor meetings: theICSI-SRI spring 2005 diarization system. In Multimodal Interaction and Related Machine Learning Algorithms, pages 402–414, Edinburgh, UK, July 2005. [6] Raimo Bakis, Scott Chen, Ponani Gopalakrishnan, Ramesh Gopinath, Stephane Maes, Lazaros Polymenakos, and Martin Franz. Transcription of broadcast news shows with theIBM large vocabulary speech recognition system. In DARPA Speech Recognition Workshop, Chantilly, VA, USA, February 1997. [7] P. Balaram. Information overload. Current Science, 78(5):533–534, 2000. [8] Claude Barras, Xuan Zhu, Sylvain Meignier, and J. Gauvain. Multistage speaker diarization of broadcast news. IEEE Transactions on Audio, Speech, and Language Processing, 14(5):1505–1512, 2006. [9] P. Beyerlein, X. Aubert, R. Haeb-Umbach, M. Harris, Dietrich Klakow, A. Wende- muth, Sirko Molau, Michael Pitz, and A. Sixtus. The Philips/RWTH system for transcription of broadcast news. In DARPA Broadcast News Workshop, Herndon, VA, USA, February 1999. [10] Peter Beyerlein, Xavier Aubert, Reinhold Haeb-Umbach, Dietrich Klakow, Mein- hard Ullrich, Andreas Wendemuth, and Patricia Wilcox. Automatic transcription of English broadcast news. In DARPA Broadcast News Transcription and Understand- ing Workshop, Lansdowne, VA, USA, February 1998. 110 [11] Carlos Busso, Panayiotis G. Georgiou, and Shrikanth S. Narayanan. Real-time mon- itoring of participants interaction in a meeting using audio-visual sensors. In IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 685– 688, Honolulu, HI, USA, April 2007. [12] Scott S. Chen, Ellen Eide, M. J. F. Gales, Ramesh A. Gopinath, D. Kanevsky, and P. Olsen. Recent improvements to IBM’s speech recognition system for automatic transcription of broadcast news. In DARPA Broadcast News Workshop, Herndon, VA, USA, February 1999. [13] Scott S. Chen and Ponani S. Gopalakrishnan. Speaker, environment and channel change detection and clustering via theBayesian information criterion. In DARPA Broadcast News Transcription and Understanding Workshop, Lansdowne, VA, USA, February 1998. [14] Scott S. Chen and Ramesh A. Gopinath. Model selection in acoustic modeling. In European Conference on Speech Communication and Technology, pages 1087–1090, Budapest, Hungary, September 1999. [15] Scott Shaobing Chen, M. J. F. Gales, P. S. Gopalakrishnan, R. A. Gopinath, H. Printz, D. Kanevsky, P. Olsen, and L. Polymenakos. IBM’s LVCSR system for tanscription of broadcast news used in the 1997 Hub-4 English evaluation. In DARPA Broadcast News Transcription and Understanding Workshop, Lansdowne, VA, USA, February 1998. [16] Thomas M. Cover and Joy A. Thomas. Elements of information theory. John Wiley & Sons, 1991. [17] P. Delacourt and C. J. Wellekens. DISTBIC: a speaker-based segmentation for audio data indexing. Speech Communication, 32(1-2):111–126, 2000. [18] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern classication. John Wiley & Sons, 2001. [19] D. Eichmann, Miguel Ruiz, Padmini Srinivasan, Nick Street, Chris Culy, and Fil- ippo Menczer. A cluster-based approach to tracking, detection, and segmentation of broadcast news. In DARPA Broadcast News Workshop, Herndon, VA, USA, Febru- ary 1999. [20] J. Gauvain, G. Adda, L. Lamel, and M. Adda-Decker. Transcribing broadcast news: theLIMSINov96Hub4system.InDARPASpeechRecognitionWorkshop,Chantilly, VA, USA, February 1997. [21] J. Gauvain, L. Lamel, G. Adda, L. Chen, and H. Schwenk. TheLIMSI RT03 BN systems. In NIST RT-03S Workshop, Boston, MA, USA, May 2003. [22] J. Gauvain, Lori Lamel, and G. Adda. The LIMSI 1997 Hub-4E transcription system. In DARPA Broadcast News Transcription and Understanding Workshop, Lansdowne, VA, USA, February 1998. 111 [23] J.Gauvain,LoriLamel,andGillesAdda. Partitioningandtranscriptionofbroadcast newsdata. InInternational Conference on Spoken Language Processing,pages1335– 1338, Sydney, Australia, November 1998. [24] J.Gauvain,LoriLamel,GillesAdda,andMicheleJardino. TheLIMSI1998Hub-4E transcription system. In DARPA Broadcast News Workshop, Herndon, VA, USA, February 1999. [25] Jean Gauvain, Lori Lamel, and Gilles Adda. Transcribing broadcast news for audio and video indexing. Communications of the ACM, 43(2):64–70, 2000. [26] Herbert Gish, M. Siu, and Robin Rohlicek. Segregation of speakers for speech recog- nition and speaker identification. In IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 873–876, Toronto, Ontario, Canada, May 1991. [27] XueFeng Guo, WeiBin Zhu, Qin Shi, Scott S. Chen, and Ramesh A. Gopinath. The IBM LVCSR system used for 1998 Mandarin broadcast news transcription evaluation. In DARPA Broadcast News Workshop, Herndon, VA, USA, February 1999. [28] T. Hain, S. E. Johnson, A. Tuerk, P. C. Woodland, and S.J. Young. Segment generation and clustering in the HTK broadcast news transcription system. In DARPA Broadcast News Transcription and Understanding Workshop, Lansdowne, VA, USA, February 1998. [29] Kyu J. Han, Samuel Kim, and Shrikanth S. Narayanan. Robust speaker cluster- ing strategies to data source variation for improved speaker diarization. In IEEE Automatic Speech Recognition and Understanding Workshop, pages 262–267, Kyoto, Japan, December 2007. [30] Kyu J. Han, Samuel Kim, and Shrikanth S. Narayanan. Strategies to improve the robustness of agglomerative hierarchical clustering under data source variation for speakerdiarization. IEEE Transactions on Audio, Speech, and Language Processing, 16(8):1590–1601, 2008. [31] Kyu J. Han and Shrikanth S. Narayanan. A robust stopping criterion for agglomera- tive hierarchical clustering in a speaker diarization system. In European Conference on Speech Communication and Technology, pages 1853–1856, Antwerp, Belgium, August 2007. [32] Kyu J. Han and Shrikanth S. Narayanan. A novel inter-cluster distance measure combiningGLRandICRforimprovedagglomerativehierarchicalspeakerclustering. InIEEEInternationalConferenceonAcoustics, Speech, andSignalProcessing,pages 4373–4376, Las Vegas, NV, USA, March 2008. [33] Jing Huang, Etienne Marcheret, Karthik Visweswariah, and Gerasimos Potamianos. TheIBM RT07 evaluation systems for speaker diarization on lecture meetings. In International Evaluation Workshops CLEAR 2007 and RT 2007, pages 497–508, Baltimore, MD, USA, May 2007. 112 [34] Juan M. Huerta, Stanley Chen, and Richard M. Stern. The 1998Carnegie Mellon University Sphinx-3 Spanish broadcast news transcription system. In DARPA Broadcast News Workshop, Herndon, VA, USA, February 1999. [35] Hubert Jin, Francis Kubala, and Rich Schwartz. Automatic speaker clustering. In DARPA Speech Recognition Workshop, Chantilly, VA, USA, February 1997. [36] F. Kubala, T. Anastasakos, H. Jin, L. Nguyen, and R. Schwartz. Transcribing radio news. In International Conference on Spoken Language Processing, pages 598–601, Philadelphia, PA, USA, October 1996. [37] Francis Kubala, Jason Davenport, Hubert Jin, Daben Liu, Tim Leek, Spyros Mat- soukas, David Miller, Long Nguyen, Fred Richardson, Richard Schwartz, and John Makhoul. The1997BBNByblossystemappliedtobroadcastnewstranscription. In DARPA Broadcast News Transcription and Understanding Workshop, Lansdowne, VA, USA, February 1998. [38] Francis Kubala, Hubert Jin, Spyros Matsoukas, Long Nguyen, Rich Schwartz, and John Makhoul. The 1996 BBN Byblos Hub-4 transcription system. In DARPA Speech Recognition Workshop, Chantilly, VA, USA, February 1997. [39] D. Liu, A. Srivastava, F. Kubala, D. Kiecza, A. Ann, J. Maguire, R. Schwartz, M. Snover, and B. Dorr. BBN+UMD Rich Transcription system for broadcast news. In NIST RT-03F Workshop, Washington, DC, USA, October 2003. [40] Daben Liu and Francis Kubala. Fast speaker change detection for broadcast news transcription and indexing. In European Conference on Speech Communication and Technology, pages 1031–1034, Budapest, Hungary, September 1999. [41] P. Maes. Agents that reduce work and information overload. Communications of the ACM, 37(7):30–40, 1994. [42] D.Mararu,L.Besacier,S.Meignier,C.Fredouille,andJ.Bonastre.ELISA,CLIPS andLIANIST 2003 segmentation. In NIST RT-03S Workshop, Boston, MA, USA, May 2003. [43] SpyrosMatsoukas,LongNguyen,JasonDavenport,JayBilla,FredRichardson,Man- hung Siu, Daben Liu, and Richard Schwartz. The 1998 BBN BYBLOS primary system applied to English and Spanish broadcast news transcription. In DARPA Broadcast News Workshop, Herndon, VA, USA, February 1999. [44] L. Nguyen, N. Duta, J. Makhoul, S. Matsoukas, R. Schwartz, B. Xing, and D. Xu. The BBN RT03 BN English system. In NIST RT-03S Workshop, Boston, MA, USA, May 2003. [45] Masafumi Nishida and Tatsuya Kawahara. Speaker model selection based on the Bayseian information criterion applied to unsupervised speaker indexing. IEEE Transactions on Speech and Audio Processing, 13(4):583–592, 2005. 113 [46] J. M. Noyes and P. J. Thomas. Information overload: an overview. IEE Colloquium on Information Overload, pages 0–6, 1995. [47] Katsutoshi Ohtsuki, Sadaoki Furui, Naoyuki Sakurai, Atushi Iwasaki, and Z. Zhang. ImprovementsinJapanesebroadcastnewstranscription. InDARPA Broadcast News Workshop, Herndon, VA, USA, February 1999. [48] M. Padmanabhan, L. R. Bahl, D. Nahamoo, and M. A. Picheny. Speaker clustering and transformation for speaker adaptation in large-vocabulary speech recognition systems. In IEEE International Conference on Acoustics, Speech, and Signal Pro- cessing, pages 701–704, Atlanta, GA, USA, May 1996. [49] N.Reeves, S.Mills, andJ.Noyes. Informationretrievalfromauserperspective. IEE Colloquium on Information Overload, pages 3/1–3/7, 1995. [50] D. Reynolds, P. Torres, and R. Roy. EARS RT03S diarization. In NIST RT-03S Workshop, Boston, MA, USA, May 2003. [51] Douglas A. Reynolds. Speaker identification and verification using gaussian mixture speaker models. Speech Communication, 17(1-2):91–108, 1995. [52] DouglasA.Reynolds,ThomasF.Quatieri,andRobertB.Dunn. Speakerverification using adapted gaussian mixture models. Digital Signal Processing, 10(1-3):19–41, 2000. [53] Douglas A. Reynolds and Richard C. Rose. Robust text-independent speaker iden- tification using gaussian mixture models. IEEE Transactions on Speech and Audio Processing, 3(1):72–83, 1995. [54] Douglas A. Reynolds and Pedro A. Torres-Carrasquillo. The MIT Lincoln labo- ratory RT-04F diarization systems: applications to broadcast news and telephone conversations. In NIST RT-04F Workshop, Palisades, NY, USA, November 2004. [55] DouglasA.ReynoldsandPedroA.Torres-Carrasquillo. Approachesandapplications of audio diarization. In IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 953–956, Philadelphia, PA, USA, March 2005. [56] R. H. Rockland. Reducing the information overload: a method on helping students research engineering topics using the Internet. IEEE Transactions on Education, 43(4):420–425, 2000. [57] AnanthSankar, RamanaRaoGadde, andFuliangWeng.SRI’s1998broadcastnews system - toward faster, better, smaller speech recognition. In DARPA Broadcast News Workshop, Herndon, VA, USA, February 1999. [58] G.Schwarz. Estimatingthedimensionofamodel. Annals of Statistics,6(2):461–464, 1978. 114 [59] K. Seymore, Stanley Chen, S. Doh, M. Eskenazi, E. Gouvea, B. Raj, Mosur Rav- ishankar, Ronald Rosenfeld, M. A. Siegler, Richard M. Stern, and Eric Thayer. The 1997CMUSphinx-3English broadcast news transcription system. In DARPA Broadcast News Transcription and Understanding Workshop, Lansdowne, VA, USA, February 1998. [60] Koichi Shinoda and Takao Watanabe. Acoustic modeling based on theMDL prin- ciple for speech recognition. In European Conference on Speech Communication and Technology, pages 99–102, Berlin, Germany, September 1997. [61] Matthew A. Siegler, Uday Jain, Bhiksha Raj, and Richard M. Stern. Automatic segmentation, classification, and clustering of broadcast news audio. In DARPA Speech Recognition Workshop, Chantilly, VA, USA, February 1997. [62] Hiroshi Tenmoto, Mineichi Kudo, and Masaru Shimbo. MDL-based selection of the number of components in mixture models for pattern classification. In Joint IAPR International Workshop on Structural and Syntactic Pattern Recognition, and Statistical Pattern Recognition, pages 831–836, Sydney, NSW, Australia, August 1998. [63] S.E.TranterandD.A.Reynolds. Anoverviewofautomaticspeakerdiarizationsys- tems. IEEE Transactions on Audio, Speech, and Language Processing, 43(5):1557– 1565, 2006. [64] A. Tritschler and R. Gopinath. Improved speaker segmentation and segments clus- tering using theBayesian information criterion. In European Conference on Speech Communication and Technology, pages 679–682, Budapest, Hungary, September 1999. [65] David A. van Leeuwen. The TNO speaker diarization system for NIST RT05s meeting data. In Multimodal Interaction and Related Machine Learning Algorithms, pages 440–449, Edinburgh, UK, July 2005. [66] AnVandecatseyeandJ.Martens. Afast,accurateandstream-basedspeakersegmen- tation and clustering algorithm. In European Conference on Speech Communication and Technology, pages 941–944, Geneva, Switzerland, September 2003. [67] Steven Wegmann, Francesco Scattone, Ira Carp, Larry Gillick, Robert Roth, and Jonathan P. Yamron. Dragon systems’ 1997 broadcast news transcription system. InDARPABroadcastNewsTranscriptionandUnderstandingWorkshop,Lansdowne, VA, USA, February 1998. [68] P. Woodland, G. Evermann, M. Gales, T. Hain, R. Chan, B. Jia, D. Y. Kim, A. Liu, D. Mrva, D. Povey, K. C. Sim, M. Tomalin, S. Tranter, L. Wang, and K. Yu. CU- HTKSTT systems forRT03. In NIST RT-03S Workshop, Boston, MA, USA, May 2003. [69] P. C. Woodland, M. J. F. Gales, D. Pye, and S. J. Young. The development of the 1996HTK broadcast news transcription system. In DARPA Speech Recognition Workshop, Chantilly, VA, USA, February 1997. 115 [70] P.C.Woodland,T.Hain,S.E.Johnson,T.R.Niesler,A.Tuerk,E.W.D.Whittaker, and S. J. Young. The 1997HTK broadcast news transcription system. In DARPA Broadcast News Transcription and Understanding Workshop, Lansdowne, VA, USA, February 1998. [71] P. C. Woodland, T. Hain, G. L. Moore, T. R. Niesler, D. Povey, Andreas Tuerk, and E. W. D. Whittaker. The 1998 HTK broadcast news transcription system: developmentandresults. InDARPA Broadcast News Workshop,Herndon,VA,USA, February 1999. [72] C. Wooters, J. Fung, B. Peskin, and X. Anguera. Towards robust speaker segmen- tation: the ICSI-SRI fall 2004 diarization system. In NIST RT-04F Workshop, Palisades, NY, USA, November 2004. [73] ChuckWootersandMarijnHuijbregts. TheICSIRT07sspeakerdiarizationsystem. In International Evaluation Workshops CLEAR 2007 and RT 2007, pages 509–519, Baltimore, MD, USA, May 2007. [74] XintianWu,ChaojunLiu,YonghongYan,DoughwaKim,SethCameron,andRandy Parr. The1998OGI-Fonixbroadcastnewstranscriptionsystem. InDARPA Broad- cast News Workshop, Herndon, VA, USA, February 1999. [75] Rui Xu and Donald C. Wunsch. Clustering. IEEE Press, 2009. [76] Yonghong Yan, Xintian Wu, Johan Schalkwyk, and Ron Cole. Development of the CSLULVCSR: the 1997DARPAHub-4 evaluation system. In DARPA Broadcast News Transcription and Understanding Workshop, Lansdowne, VA, USA, February 1998. [77] PumingZhan,StevenWegmann,andLarryGillick. DragonSystems’1998broadcast news transcription system for Mandarin. In DARPA Broadcast News Workshop, Herndon, VA, USA, February 1999. [78] Xuan Zhu, Claude Barras, Lori Lamel, and J. Gauvain. Speaker diarization: from broadcast news to lectures. In Machine Learning for Multimodal Interaction, pages 396–406, Bethesda, MD, USA, May 2006. 116
Abstract (if available)
Abstract
Speaker clustering refers to a process of classifying a set of input speech data (or speech segments) by a speaker identity in an unsupervised way, based on the similarity of speaker-specific characteristics between the data. The process identifies the speech segments of the same speaker source without any prior speaker-specific information of the given input data. This speaker-perspective, unsupervised classification of speech data can be applied as a pre-processing step to speech/speaker recognition or multimedia data segmentation/classification in various ways. Thus, speaker clustering has been recently attracting much attention in the research area of speech recognition and multimedia data processing.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Extracting and using speaker role information in speech processing applications
PDF
Multimodal and self-guided clustering approaches toward context aware speaker diarization
PDF
Neural representation learning for robust and fair speaker recognition
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
Modeling expert assessment of empathy through multimodal signal cues
PDF
Understanding sources of variability in learning robust deep audio representations
PDF
Establishing cross-modal correspondences for media understanding.
PDF
Representation, classification and information fusion for robust and efficient multimodal human states recognition
PDF
Emotional speech production: from data to computational models and applications
PDF
Robust real-time algorithms for processing data from oil and gas facilities
PDF
Robust and proactive error detection and correction in tables
PDF
Human behavior understanding from language through unsupervised modeling
PDF
Improving language understanding and summarization by leveraging auxiliary information through self-supervised or unsupervised learning
PDF
Automatic quantification and prediction of human subjective judgments in behavioral signal processing
PDF
Classification and retrieval of environmental sounds
PDF
Speech recognition error modeling for robust speech processing and natural language understanding applications
PDF
Event-centric reasoning with neuro-symbolic networks and knowledge incorporation
PDF
Generating gestures from speech for virtual humans using machine learning approaches
PDF
Modeling dyadic synchrony with heterogeneous data: validation in infant-mother and infant-robot interactions
PDF
Advanced features and feature selection methods for vibration and audio signal classification
Asset Metadata
Creator
Han, Kyu Jeong
(author)
Core Title
Robust speaker clustering under variation in data characteristics
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
11/21/2009
Defense Date
07/20/2009
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
incremental gaussian mixtures,information change rate,OAI-PMH Harvest,speaker clustering speaker Diarization,speaker modeling
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Narayanan, Shrikanth S. (
committee chair
), Kang, Hong-Goo (
committee member
), Kuo, C.-C. Jay (
committee member
), Shahabi, Cyrus (
committee member
)
Creator Email
kyuhan@usc.edu,kyujeong.han@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m2753
Unique identifier
UC1496912
Identifier
etd-Han-3367 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-279173 (legacy record id),usctheses-m2753 (legacy record id)
Legacy Identifier
etd-Han-3367.pdf
Dmrecord
279173
Document Type
Dissertation
Rights
Han, Kyu Jeong
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
incremental gaussian mixtures
information change rate
speaker clustering speaker Diarization
speaker modeling