Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
A study of unsupervised speaker indexing
(USC Thesis Other)
A study of unsupervised speaker indexing
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
NOTE TO USERS This reproduction is the best copy available. ® UMI R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A STUDY OF UNSUPERVISED SPEAKER INDEXING by Soon-11 Kwon A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Electrical Engineering) May 2005 Copyright 2005 Soon-11 Kwon R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . UMI Number: 3180472 Copyright 2005 by Kwon, Soon-ll All rights reserved. INFORMATION TO USERS The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleed-through, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. ® UMI UMI Microform 3180472 Copyright 2005 by ProQuest Information and Learning Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. ProQuest Information and Learning Company 300 North Zeeb Road P.O. Box 1346 Ann Arbor, Ml 48106-1346 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm issio n . D edication To my families M in -K y u n g L ee and O livia Y o o -J u n g K w on for their love, encouragement, and support. And to my parents H y u k -S u n g K w o n and Y o u n g -S o o k H a n for their dedication and hard work. R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . A cknow ledgem ents This thesis reflects upon the research area in which I had an opportunity to be involved in during my graduate studies at USC. It contains the results of detecting speaker changes in a multi-speaker audio stream and clustering each detected speaker-specific segments in an unsupervised manner. In addition, generic models are used along with speaker quantization to improve the overall performance of unsupervised speaker indexing. I would like to use this opportunity to express my special thanks to Professor Narayanan, my adviser, who made this thesis possible. I thank him for his great un derstanding, patience, help and support to me. Also I thank Professor Sung-Bok Lee, Professor Panayiotis G. Georgiou, and Dr. Naveen Srinivasamurthy. S.K. Los Angeles, California May 2005. R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . C ontents D e d ica tio n ii A c k n o w led g em en ts iii L ist O f T ables vi L ist O f F ig u res vii A b stra ct x 1 In tro d u ctio n 1 1.1 Motivation ............................................................................................................ 1 1.2 Human Speech Production and Perception S y s te m .................................... 2 1.3 Speaker R e c o g n itio n ............................................................................................ 3 1.4 Feature E x tra c tio n ............................................................................................... 5 1.5 Feature Similarity M e a s u r e .............................................................................. 8 1.6 Entropy and Kullback-Leibler D istance........................................................... 10 1.7 Hypothesis T e s tin g ............................................................................................... 12 1.8 Unsupervised Speaker In d e x in g ........................................................................ 13 1.9 Literature Review ............................................................................................... 16 1.10 Contribution of this th e s is .................................................................................. 17 1.11 Outline .................................................................................................................. 18 2 S p eak er C h an ge D e te c tio n 22 2.1 Introduction............................................................................................................ 22 2.2 Silence Detection Based Search A lg o rith m ..................................................... 26 2.3 Localized Search Algorithm ( L S A ) .................................................................. 28 2.4 Some Further T h o u g h ts ..................................................................................... 31 3 G en eric M o d els for B o o tstr a p p in g 33 3.1 Introduction............................................................................................................ 33 3.2 Sample Speaker Models (SSM) ........................................................................ 35 3.3 Clustering and Model A d ap tatio n ..................................................................... 38 3.4 Experim ents............................................................................................................ 42 3.5 R esults...................................................................................................................... 45 iv R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 3.6 Some Further T h o u g h ts ..................................................................................... 52 4 S p eak er Q u a n tiza tio n 54 4.1 Introduction........................................................................................................... 54 4.2 Basics of Vector Q u an tizatio n ............................................................................ 56 4.3 Speaker Quantization ......................................................................................... 59 4.4 Analysis of Errors in Unsupervised Speaker In d e x in g ................................. 63 4.5 E xperim ents........................................................................................................... 65 4.6 R esults..................................................................................................................... 66 4.7 Some Further T h o u g h ts ..................................................................................... 71 5 A n A n a ly tica l A p p roach T ow ard U n d ersta n d in g T h e C a p a city o f U n su p erv ised S peaker In d ex in g 73 5.1 Introduction........................................................................................................... 73 5.2 Problem: Unsupervised Speaker Indexing Using Sample Speaker M o d e ls ..................................................................................... 74 5.3 Relevant Information-Theoretic B ackground.................................................. 76 5.4 Capacity in Unsupervised Speaker In d e x in g .................................................. 80 5.4.1 Information-Theoretic A p p ro a c h ......................................................... 80 5.4.2 Statistical A p p ro a c h ................................................................................ 82 5.5 Special C a se ........................................................................................................... 85 5.5.1 One-Dimensional Gaussian Speaker M o d e ls ...................................... 85 5.5.2 Simulation and R e s u lt............................................................................. 89 5.6 Some Further T h o u g h ts ..................................................................................... 92 6 C o n clu sio n an d F u tu re W ork 94 6.1 Conclusion ........................................................................................................... 94 6.2 Future Work ......................................................................................................... 97 6.2.1 Unsupervised Speaker Indexing with a large p o p u la tio n .............. 97 6.2.2 M ulti-Channel Unsupervised Speaker I n d e x in g ............................... 98 6.2.3 Control over Overlapped Speech in Unsupervised Speaker In d ex in g ..................................................................................... 99 R eferen ce L ist 101 v R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . List O f Tables 3.1 Indexing Error Rates as a function of Average Length of Segment per Speakers, Number of Speakers in the Test Sequence, and Generic Model ty p e s ......................................................................................................................... 46 3.2 Number of Speakers in Clips vs. Accuracy for the Broadcast News M aterial 52 4.1 Error Rates of Unsupervised Speaker Indexing: Universal Background Model and Sample Speaker Models with MCMC and Speaker Quanti zation (SQ) respectively. Note that the figures in parentheses show the relative improvement from the baseline............................................................. 70 5.1 Average of Clustering Criterion (Jc) in case that two speakers to be indexed are males, females, or male and female: The last column (Total) is the average of three cases................................................................................. 92 vi R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . List O f Figures 1.1 Simplified Diagram of Speaker I n d e x in g ........................................................ 2 1.2 Diagram of Speaker Verification......................................................................... 4 1.3 Diagram of Speaker Identification..................................................................... 4 1.4 An Example of Filter-Bank: a triangle bandpass frequency response with bandwidth and spacing determined by a mel scale........................................ 7 1.5 An Example of Hypothesis T e s tin g .................................................................. 12 1.6 Block diagram of the unsupervised speaker indexing process with generic m o d els...................................................................................................................... 14 1.7 An Example of Speaker Indexing (X-Axis: Time (msec), Y-Axis: En ergy): (a) Original Speech data sequence extracted from HUB-4 Broad cast News (1999), (b) Speech data sequence indexed as Speaker 1, (c) Speech data sequence indexed as Speaker 2, (d) Speech data sequence indexed as Speaker 3.............................................................................................. 15 2.1 A block diagram of speaker change detection using the Silence Detection Based A lg o rith m .................................................................................................. 27 2.2 An example of speaker change detection using the Silence Detection Based A lg o rith m .................................................................................................. 28 2.3 A block diagram of speaker change detection using the Localized Search Algorithm (L S A ).................................................................................................. 29 vii R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 2.4 An example of speaker change detection using the Localized Search Algo rithm (LSA): An analysis window (4 seconds) looks for the exact speaker changing point near the potential boundary comparing with two analysis segments (2 seconds) of the analysis window. The first analysis segment is the reference segment and the segment of the second analysis segment is compared with the reference segment using the GLR test. W hen the analysis window detects a latent speaker change point, it shifts by 0.2 sec onds to enable a finer search yielding a total of 10 ratios from GLR tests. The minimum is chosen, implying the highest probability of a speaker change. The boundary between two analysis segments is recognized as a true speaker changing point................................................................................. 31 3.1 Generic Models: (a) Universal Background Model (UBM): The entire speaker data in the pool is used to create a single model, (b) Universal Gender Models (UGM): The data is used to create 2 gender models, (c) Sample Speaker Models (SSM): Speaker models are selected from the generic speaker data pool by the proposed sampling m ethod..................... 36 3.2 Comparison of KL distances between various types of generic models and target speaker models: 1. UBM, 2. UGM, and 3. SSM. On average, SSM is closer to the target than the other models................................................... 37 3.3 Model Adaptation: from generic speaker models into speaker specific m o d els...................................................................................................................... 40 3.4 Clustering with Sample Speaker Models (SSM)and A d a p ta tio n .............. 40 3.5 Assigning segments to clusters: only clusters assigned to segments survive. 41 3.6 Indexing Accuracy for various Types of Generic Models (SSM, UBM, UGM): (a) 2 Speaker Conversations, (b) 4 Speaker Conversations, and (c) Broadcast News. Note that ’samples’ here refer to those drawn from the generic model pool by MCMC. The results for SSM were consistently better than for the UBM and UGM cases........................................................ 49 4.1 A block diagram of Generalized Lloyd A lg o rith m ........................................ 58 4.2 Vector Quantization vs. Speaker Quantization: (a) Vector Quantization, (b) Speaker Quantization...................................................................................... 60 4.3 Example of a tree for Tree Structured Vector Quantization (TSVQ). . . 61 4.4 Type-1 Error in Speaker Indexing: Each segment of speech sequence from a single speaker is indexed to a sample............................................................. 63 viii R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 4.5 Type-2 Error in Speaker Indexing: The segments of speech sequence from different speakers respectively are indexed to samples................................. 64 4.6 Relation between the number of pooled speakers and the number of quan tized speakers........................................................................................................... 67 4.7 Speaker indexing error rates vs. number of quantized speakers........ 68 5.1 An illustration of 16 Sample Speaker Model in a 2-dimensional space. . 75 5.2 An illustration of 64 Sample Speaker Model and 4 targets to be indexed in a 2-dimensional space........................................................................................ 76 5.3 Relative entropy as a function of N, the number of Sample Speaker Models (solid line: D(P\ || Pi), dashed line: D(P\ || P 2))................................ 81 5.4 An example of 1-D Gaussian cases: Pseudo Targets (1-D Gaussian Models with solid lines), Two Sources (1-D Gaussian Models with dash-dot line for Source 1 and dotted line for Source 2). Note that the range between a and b is the dominant segment for the pseudo target assigned to Source 1. 86 5.5 (a) Clustering Criterion vs. Number of Sample Speaker Models result ing in minimum Total Error Probability, (b) Number of Sample Speaker Models vs. Total Error Probability with varying Clustering Criterion. . 90 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . A bstract Speaker indexing sequentially detects points where speaker identity changes in a multi- speaker audio stream, and classifies each detected segment according to the speaker’s identity. This thesis addresses three challenges: The first relates to efficient sequential speaker change detection. The second relates to the fact th at the num ber/identity of the speakers is unknown. The third relates to building speaker models with only small amounts of training data. To address the first issue, a localized search algorithm is pro posed which aims to provide speaker change detection with minimal amounts of data for speech analysis. To address the issue of speaker modeling under unsupervised data con ditions, a novel predetermined generic speaker-independent model set, called the Sample Speaker Models (SSM), is proposed. This set can be useful for more accurate speaker modeling and clustering without requiring training models on target speaker data. Once a speaker-independent model is selected from SSM, it is progressively adapted into a speaker-dependent model. Experiments were performed with data from the Speaker Recognition Benchmark NIST Speech (1999) and the HUB-4 Broadcast News Evalu ation English Test M aterial (1999). Results showed th at our new technique, sampled using the Markov Chain Monte Carlo Method, gave 92.5% indexing accuracy on 2 speaker telephone conversations, 89.6% on 4 speaker conversations with the telephone x R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm ission . speech quality, and 87.2% on broadcast news. SSM outperformed the Universal Back ground Model by up to 29.4% absolute and the Universal Gender Models by up to 22.5% absolute in indexing accuracy in the experiments. While SSM is useful in unsupervised speaker indexing, an optimal sampling method is still required. To solve this problem, the Speaker Quantization method, motivated by Tree Structured Vector Quantization, is proposed and experimentally compared with the MCMC approach. Experimental results showed that the new sampling approach outperformed the random selection by 22.7% relative in error rate on telephone conversations, 19.8% relative on broadcast news even though it is not optimal. We also analytically studied the capacity of unsupervised speaker indexing. From the analytic approach, we found th at the similarity between speakers to be indexed plays an im portant role in determining the most appropriate number of sample speaker models. R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . C hapter 1 Introduction 1.1 M otivation Speaker recognition is the process of automatically determining the personal identity by analysis of spoken phrases. Many applications have been considered for speaker recognition: secure access controlled by voice, customizing services by voice, surveil lance, forensic investigations, banking transaction over a telephone line or other types of network, voice mail, and so on [5] [28] [23] [35], Speaker recognition technology is expected to make our daily lives more convenient. One of the speaker recognition applications is speaker indexing, the process of de termining when and who is talking [Fig. 1.1]. It is an integral element of speech data monitoring and content-based data mining applications. Consider, for example, ap plications such as meeting/teleconference monitoring, archiving and browsing. A key motivation arises from the fact that it is impossible or tedious to attend all relevant meetings face to face. Multimedia meeting or teleconference monitors and browsers can be useful for getting meeting information, such as who is saying what and when, 1 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . Speech Signal of ____ Microphone ____ ^ Speaker Indexing Multiple Speakers (Who is talking now?) Figure 1.1: Simplified Diagram of Speaker Indexing remotely through on-line or off-line systems [29] [78]. Specially, these applications com monly include a speaker indexing process that is segmentation and classification of data with respect to speakers [66], 1.2 H um an Speech P roduction and P erception System Speech is a sound containing the meaning for human communication. In biophysics, the sound is produced by waves of air pressure emanating from the human vocal mechanism. As air is expelled from the lungs, the vocal cord is vibrated by the air flow. The air flow is changed into quasi-periodic pulses. Then these pulses are m odulated in frequency flowing through the throat, mouth, and nasal cavities respectively. Depending on the movements and the positions of jaw, tongue, lips, and mouth, different sounds are produced. However, the same movements and positions of articulators might produce different sounds if the shape, length, and size of vocal tract are different [60] [59]. It is one of the main reasons that everybody has different vocal characteristics. 2 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . In signal processing, the speech sound is periodic waves consisting of a number of harmonics such as fundamental frequency and several formants. These combinations of frequencies change depending on the movements and positions of articulators. Hence, every vowels has different distribution of energy in the frequency domain. In addition, everyone has a different tone of vowel sounds due to the different shape and movement of articulators [59] [10]. In psychoacoustics, the concept of timbre is defined as the attribute of auditory sensation. The timbre of a sound depends on many physical variables including spectral and temporal variations. It is the reason that human can differentiate two sounds from the different musical instruments with the same octave and duration. In the same manner, human can discriminate the voices from different people [23]. 1.3 Speaker R ecognition While speech recognition is to catch what a person is saying, speaker recognition is to identify or authenticate a person who is talking. Speaker recognition is essentially a kind of (voice) pattern recognition problem. The speech signal is obtained, and then compared with the previously prepared reference [61]. Speaker recognition encompasses speaker verification and speaker identification [64]. Speaker verification is to verify a person’s claimed identity from his (or her) voice en rolled prior in the system. When an identity is claimed, the binary decision, acceptance 3 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . Speech Feature Similarity Signal Extraction Check Decision Accept Claimed Speaker Model Figure 1.2: Diagram of Speaker Verification Speech Feature Similarity Maximum Speaker Signal Extraction Check Selection ID I Speaker Models Figure 1.3: Diagram of Speaker Identification or rejection, is required. In order to make the decision, a set of feature vectors are ob tained from the utterances of the speaker, and compared with a set of stored reference vectors. Thus only a single comparison is executed to make the final decision. Speaker identification problem is different from speaker verification. In speaker identification, without a prior identity claim, the system decides who the person is among a group of people. Instead of a single comparison, multiple comparisons are executed. Firstly, the speech data of people in a group are collected as a training step. The second step of speaker identification is the task of comparing an unidentified R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . utterance with the training data and making the identification. The decision rule is to choose the speaker with the minimum probability of error. There are two categories: text-dependent and text-independent speaker recognition [64]. In text-dependent recognition, the spoken phrase is known to the system whereas in the text-independent case, the spoken phrase is unknown. Speaker recognition can also be subdivided into two categories: closed-set and open-set. The closed-set speaker recognition is to identify a speaker only from a group of N known speakers. The open- set problem is to identify the speaker who may or may not belong to a group of N known speakers. Speaker indexing in general is a text-independent, open-set speaker identification problem [5] [45]. 1.4 Feature E xtraction Speech information is usually analyzed by the short-time spectrum. Short-time spectral analysis differentiates not only speech sounds but also speakers. In company with the speech analysis, cepstral processing is useful to extract features from the speech signal. The cepstrum is computed as follows: c[n] = DFT-\ln\DFT(x[n})\). (1.1) There are two principal methods for parameterizing short-term spectrum: linear predictive coding (LPC) analysis and filter bank analysis [59] [71]. LPC-based spectral R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . analysis is widely used for speech recognition. The LPC model of speech signal can be represented as a linear sum of previous samples and an excitation term. For example, H w = H = ib ( °> where the inverse filter, A(z) = 1 - ^ 2 akZ~k> (1-3) k= 1 V x[n] — Z _1[X(2:)] = ^ 2 akx [ n ~ k] + e[n], (1-4) fc=l p x[n] = ^ 2 akx [ n ~ k\, (1-5) k=l where x is the current sample predicted from a linear combination of its past p samples. e[n] = x[n} — x[n\. (1-6) The LPC coefficients are computed by solving a set of linear equations obtained from the minimization of the mean-squared error between the signal and the linearly estimated signal [64] [60], Given the LPC coefficients, cepstrum, which is called LPC-Cepstrum, can be obtained. This method is computationally efficient, but less popular in speaker recognition. The filter bank method analyzes the speech signal through a bank of bandpass filters covering the available range of frequencies [60] [53], The Mel Frequency Cepstral 6 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . FREQUENCY (Hz) Figure 1.4: An Example of Filter-Bank: a triangle bandpass frequency response with bandwidth and spacing determined by a mel scale. Coefficient (MFCC) can be obtained by inserting the intermediate step of transforming as follows: 1. take the Fast Fourier Transform (FFT). 2. take the magnitude. 3. take the log: result is real valued and symmetric. 4. warp the frequencies according to the mel scale: the higher frequency band com pressed since it is regarded as less im portant than the lower frequency band. 5. take the inverse Fast Fourier Transform. The mel scale is based on the non-linear human perception of the frequency of sounds. R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 1.5 Feature Sim ilarity M easure To recognize speakers, the extracted features are compared. There are mainly two ap proaches: metric-based and model-based. The metric-based m ethod measures the simi larity between two vector set, Xi and X 3, by calculating the distance such as Euclidean distance, D d=Y.(x - x i)2’ (L7) i=1 where Xi and X } i,j=T,2,...,D. There are a few other methods for similarity measures. 1. Minkowski: The model-based method uses a probabilistic formulation of feature space to mea sure the similarity between two vector sets. Speech signal for a particular speaker can D (1.8) where q = l (City-Block distance) or q=2 (Euclidean distance). 2. Mahalanobis: d? — (x — yu)4E ^{x — fl) (1.9) where ji is a mean vector and E is a covariance vector. 3. Tanimoto: the ratio of the number of shared attributes to the number possessed by x or y, ( 1.10) 8 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . be specified by some probability distribution. There are nonparametric and parametric probability models. Nearest Neighbor and Vector Quantization modeling are the exam ples of the nonparametric method. The nearest neighbor m ethod estimates the density by measuring the distances between neighboring vectors in th at the smaller nearest- neighbor distance means the higher density [64] [21] [36]. Vector Quantization (VQ) modeling construct a set of representatives from the training vectors. The feature space is quantized by mapping each vector to one of the representatives. The Gaussian model is a basic parametric model. The multivariate Gaussian model is simply, but effectively, parameterized by its mean, /x, and covariance matrix, S. The classification of the testing vector sets is based on the log likelihoods: £(X; /q E) = log L(X; /x, E) = — log |2ttS| - ^ n(Xi - / i / E " ^ - /x), (1.11) i=l where n is the dimension of vectors. In speaker recognition, the Gaussian mixture model, a weighted sum of some Gaussian distributions, has been found to be better in reflecting speaker information. Model training is accomplished by Estim ate Maximize (EM) algorithm. Suppose that GMM with N mixtures is used as a speaker model. All frames are initially divided into N clusters. An initial model is obtained by parameter estimation; means and covariances are estimated from the vectors in each cluster. The prior weights of GMM can be set simply by the proportion of the number of feature vec tors in each cluster. Then the feature vectors are clustered by the Maximum Likelihood 9 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . (ML) m ethod using the previously estimated model. These processes are iteratively executed until the model parameters converge [12] [50] [67] [22] [46]. 1.6 Entropy and Kullback-Leibler D istance The entropy is a measure of the randomness or uncertainty of a sequence drawn from the discrete distribution when there is a discrete set of symbols with probabilities, Pi, i=l,2,...,m. m H = - Y /Pdog2Pi, ( 1.12) i=1 where entropy is measured in bits. It is the average uncertainty for the symbol. Entropy is nonnegative and can be zero only if the probability function is a deterministic one, Pi-1, i=l,2,...,m. In a continuous distribution case, the entropy is / OO p{x) lnp(x)dx. (1-13) -OO For continuous cases, we usually use natural logarithm instead of the logarithm base 2. Another property of entropy is, m H = E {- log2 Pi) < E [- log2 Qi\ = - ^ Pi log2 Qi, (1-14) 7=1 where the equality occurs if and only if Pi=Qi, i=l,2,...,m. 10 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . To classify estimated models, we need to measure the distance between two statistical models. Kullback-Leibler distance (relative entropy) is an appropriate measure of the distance between distributions: 1. The discrete case, DKL{p{x),q{x)) = ^ p { x ) In (1.15) x li \ J 2. The continuous case, DKL(p{x),q{x)) = - ( p (x)ln ^}-d x . (1.16) J — O O Q\x ) This measure is not practically easy to calculate as the statistical models are com plex. Monte Carlo method is commonly used to numerically approximate the KL dis tance. However, this approximation may vary in different computations due to the randomness of the Monte Carlo approximation. Hence, we calculate the upper-bound of KL distances between two d-dimensional Gaussians as follows [11] [73] [62]: Z?(A%,£)||A/XS,£)) = ^ [log - d + t r ( Z - 1!]) + (/i - £ ) rS - 1( A * - £ )]• (1.17) However, it is not the true distance since it is not symmetric and does not satisfy the triangle inequality [74] [11]. To get a symmetric KL distance, we often use the following measurement th at is the average of two asymmetric distances: D k M i U ( * ) ) - D n ( i , ( l |- i ( ,) ) + ( 1 . 18) 11 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . Figure 1.5: An Example of Hypothesis Testing 1.7 H ypothesis Testing Hypothesis testing is to decide the validity of a hypothesis. It is not to determine whether a hypothesis is true, but to establish whether the evidence supports the rejection of a hypothesis [47] [56] [20], Suppose that there are two hypothesis. Let Ho be the hypothesis that the user is an imposter and let H\ be the hypothesis that the user is the claimed speaker. The scores of the observations form probability density functions according to whether the user is the claimed speaker or an imposter [68]. Let p(z\Ho) be the conditional density function of observation score, z, generated by speakers other than the claimed speaker, imposters, and let p(z\Hj) be for the claimed speaker. Then the likelihood ratio, ^ ^ _ p { z \ H 0) M 1n, x{z) = M f i T y (U 9 ) If A(z) > T, choose Ho- Otherwise choose Hj. Threshold, T, is set for a minimum error performance. 12 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . The problem of speaker identification differs significantly from the above hypothesis testing. In this case, the system need to choose one of N speakers in the population. Hence the decision rule is that speaker i is chosen such that p^(z) > Pj(z),j — 1, 2,..., N,j ^ i. (1-20) The speaker with the minimum probability of error is chosen [61]. 1.8 U nsupervised Speaker Indexing Speaker indexing can be divided into two categories based on processing requirements: on-line and off-line. Both on-line and off-line indexing can be executed only sequentially, a characteristic of speaker indexing. Off-Line speaker indexing can be used for record keeping, but it is not appropriate for real-time meeting or teleconferencing systems that demand on-line processing. One of the main technical differences between off-line and on-line speaker indexing is the feasibility of multi-pass over the same data. Hence, in off-line indexing, it is possible to use various speaker indexing algorithms at each iteration. However, in on-line indexing, only one strategy can be used through the whole sequential process. Recently we proposed an on-line m ethod th at picks out the speech segments from an audio stream and classifies them by speakers [30]. A block diagram of the speaker indexing process is shown in Fig. 1.6. The first step is front-end analysis where the incoming audio samples are classified into foreground 13 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . Multi-Speaker Audio Stream No Speaker Change Detection Yes Generic Model Clustering Front-End Analysis Model Adaptation Figure 1.6: Block diagram of the unsupervised speaker indexing process with generic models speech and other background audio (noise) types. Generally audio data can be cat egorized into four broad classes: speech, music, environmental noise, and silence. In speaker indexing, we only need speech/non-speech discrimination. W hen there is back ground noise or music, it is likely to be overlapped with speech. Corrupted speech is not easily discriminated from noise. Since it is critical th at we should not lose any speech data, the focus of the classification is to minimize false rejection, perhaps, even at the cost of false acceptance. Usually, for speech/non-speech discrimination, a zero-crossing rate and short-time energy are used [41]. Speech signal usually has a higher silence 14 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . (a) (b) 6000 4000 2000 0 I 6000 4000 2000 0 (c) (d) 6000 4000 2000 0 2000 2000 4000 4000 blkii..Il.nlLl.Jljlii,.M.Lil L.I.J bkJlL L . 6000 8000 10000 12000 14000 2000 16000 1 i i i i i L , l i LiljUliiL Il J i 1 0 2000 4000 6000 8000 10000 12000 ‘ I ' " " " 1 T" 14000 .... 16000 . _ 2000 4000 6000 8000 10000 12000 14000 16000 1 ----------------------- 1 -----------------------1 ----------------------- 1 ----------------------- 1 ----------------------- 1 ----------------------- i ----------------------- 1 -------- _ i_ 6000 8000 10000 12000 14000 16000 Figure 1.7: An Example of Speaker Indexing (X-Axis: Time (msec), Y-Axis: Energy): (a) Original Speech data sequence extracted from HUB-4 Broadcast News (1999), (b) Speech data sequence indexed as Speaker 1, (c) Speech data sequence indexed as Speaker 2, (d) Speech data sequence indexed as Speaker 3. ratio and level of variation in zero-crossing rates than music. From the result of audio segmentation, we collect all segment classified as speech: clean or corrupted by music or other background noise [42], Only the speech data are used for the next step, speaker change detection. In this step, the system sequentially detects whether a speaker change occurs in the middle of a speech analysis frame, without assuming any specific knowledge about the speakers. Some of the challenges faced by this detection problem are addressed in chapter 3. Once the speaker change detection determines the boundary, all the data between the 15 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . speaker change points are used for speaker clustering. In the clustering step, we use speaker models from a predetermined generic model set. After clustering, the speaker independent generic model is adapted into an appropriate speaker dependent model during the indexing process. The adapted model is replaced with the original model before adaptation or inserted back in the generic model set. W hen new audio samples after the boundary of the current speaker comes into the system, the previous steps are repeated until all data are exhausted [33], 1.9 Literature R eview Several efforts have been reported on speaker indexing. Methods based on speaker veri fication using speaker subspace for speaker indexing were proposed by Nishida and Ariki [51]. In this paper, a speaker model was initialized and then the next speech segment was verified if it was from the same speaker as the first one. They used only 1 second segments, and these might be too short to build an initial speaker model. In addition, some segments including the speech of more than 2 speakers could not be correctly clustered without the speaker change detection. Rosenberg et al. used the General ized Likelihood Ratio (GLR) Test for initial segmentation of speaker indexing. After initial segmentation, speaker models were constructed and then repeatedly segmented. Their process focused on the iterative segmentation and clustering th at was only for the off-line speaker indexing systems [66], Liu used the Hybrid speaker clustering method, which utilized both the dispersion and GLR threshold [38]. 16 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . In many previous scenarios of speaker indexing or clustering, there is no prior knowl edge about the identity or the number of speakers involved. The speaker indexing and model construction can be performed sequentially without storing all the testing data in advance. However, the problem is that sequentially constructed models may not repre sent speakers well due to model initialization problems. In addition, when the training of speakers is not supervised, this problem also potentially leads to continual error propagation. W ithout good initial models for speaker indexing, we cannot effectively build/update speaker models sequentially and incrementally. Recall that sequentially constructed models in the unsupervised indexing scenario cannot represent speakers well due to the small initial amount of data. We try to solve this critical drawback by employing an alternative m ethod of using the notion of generic speaker models [33]. 1.10 C ontribution of this thesis To enable speaker indexing, ideally we need information about speakers such as the number of speakers and the appropriate speaker models. However, in some scenarios, it is not easy to obtain a priori information about the target speakers in the data, including the number of speakers, in advance. Consider for example speaker indexing applied to broadcast news (interviews). It may not be easy to obtain information about the reporters and interviewees in advance. Hence, unsupervised speaker indexing may be required. Assuming one is using streaming audio, we are limited to making any indexing decision with only current and previously seen speech data from the session. Furthermore, since the models of speakers are not available a priori for indexing, we 17 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . need to create and update them on the fly. This leads to a number of challenges. In general, under these circumstances of sequential learning, data are not sufficient to build a speaker model initially. Although a model can be roughly built, it is apt to cause decision errors due to potential uncertainty in the unsupervised learning. To address the problem, we need some method to enable effective model bootstrapping [31]. This proposal aims to contribute to 1. effective speaker change detection: Localized Search Algorithm (LSA). 2. model bootstrapping for unsupervised scenarios: Sample Speaker Models (SSM). 3. proper sampling method for a generic model initialization: Speaker Quantization (SQ). 4. analytical investigation into determining capacity of unsupervised speaker index ing. 1.11 O utline The efficient speaker change detection method is proposed in Chapter 2. Before clus tering, detection of speaker changing points is required. This step sequentially binds data segments according to speakers that helps to improve the performance of indexing. There are two issues: the size of analysis segments and the specific analysis approach. The size of analysis segments is usually fixed. A large data analysis segment size is useful towards an improved correct indexing decision, as it includes more information about the speakers for indexing. However, it is apt to miss speaker changes th at may occur 18 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . within an analysis segment. To solve this problem, a smaller size of data-analysis seg ment can be used, but it requires a robust speaker change detection process to improve the precision [30]. To detect the changing points robustly, sufficient overlapping across the analysis frames is required, resulting in higher computational complexity. W ith out sufficient overlapping, changing points are easily missed. The Localized Search Algorithm (LSA) is adopted as a compromise between these conflicting requirements. Chapter 2 will provide details about this algorithm. We use the Generalized Likelihood Ratio (GLR) Test for speaker change detection. Though the GLR Test can be unstable for small amounts of analysis data, clustering can help compensate for this instability. The Sample Speaker Models (SSM) approach is proposed in Chapter 3. There are two kinds of generic models that have been proposed previously for this purpose: Univer sal Background Models (UBM) and Universal Gender Models (UGM). Here we propose a new m ethod for creating and evaluating generic models, referred to as the Sample Speaker Models (SSM). This is built on the hypothesis that a speech data corpus, inde pendent from the target data, can help initialize a model set for unsupervised speaker indexing. The sample model set is predetermined at training, and samples can be ran domly picked from a pool of generic speaker models using the Markov Chain Monte Carlo (MCMC) method. The generic model set can be used for initializing/bootstrapping any speaker indexing process, and can be referred to during speaker clustering with the target test data. After clustering, a selected model can be continually adapted with the test data th at are used for clustering [Fig. 1.6]. The model adaptation step uses the Maximum a Posteriori (MAP) scheme. The experiments were conducted on data from 19 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . the “Speaker Recognition Benchmark NIST Speech Corpus” (1999) and the “HUB-4 Broadcast News Evaluation English Test M aterial” (1999). We performed three main experiments with these data to see the behaviors of model adaptation/convergence and the speaker indexing performance using the three types of generic models. The speaker indexing evaluation, that explores the performance of the generic models, included three tests with different speech materials: 2 speaker conversations, 4 speaker conversations, and broadcast news. The experimental results showed that our on-line (sequential, causal) unsupervised speaker indexing can achieve a recognition rate comparable with a state of the art off-line system [ 6 6 ]. It also showed higher accuracy compared to other generic models such as the Universal Background Model (UBM) and Universal Gender Models (UGM) under the various experimental conditions: 2 speaker conversations, 4 speaker conversations, and broadcast news. For more principled selection of samples to reduce error rates, we also propose a novel m ethod called Speaker Quantization (SQ) in Chapter 4. This m ethod is useful for sampling speaker models in that the feature space is properly quantized. SQ originates from Vector Quantization while it quantizes speaker models instead of vectors. The experiments were conducted on the same data used in Chapter 3. The efficiency of a novel sampling m ethod (Speaker Quantization) was compared with a random selection (Markov Chain Monte Carlo) in regard to the error rate of speaker indexing. The result showed that our Speaker Quantization (SQ) method outperformed the random selection. In Chapter 5, we explore analytical capacity of unsupervised speaker indexing: the relation between the number of speaker samples and error rates with respect to the 20 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . number of target speakers. In addition, we simulate a simplified case: two speaker indexing with one-dimensional Gaussian models. From this simulation, we got useful insights th at the similarity between two speakers to be indexed is very im portant to determine the number of samples. In Chapter 6 , we discuss and conclude our study on the Speaker Change Detection method, Sample Speaker Models, Speaker Quantization, and analytical understanding the capacity of Unsupervised Speaker Indexing. Future work is also described in the end of Chapter 6 . 21 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . C hapter 2 Speaker C hange D etection 2.1 Introduction The goal of unsupervised speaker change detection is to detect the points where the speaker changes in the middle of a speech sequence without any knowledge about the identity and number of speakers. The robust speaker change detection is a critical prerequisite for speaker clustering. If we falsely detect a speaker changing point, we may compensate for the error through the speaker clustering step. However, if we totally skip the real changing point, the clustering step cannot recover it. In the speaker change detection step, the system sequentially detects whether a speaker changes in the middle of speech analysis frame assuming no knowledge about the identity and number of speakers. There are two kinds of segmentation: fixed and variable length segmentation. In fixed length segmentation, audio data is chopped into segments, the length of which is predetermined. Fixed length segmentation is simple, but relatively long segments are likely to include speaker changing points, while short segments do not have enough 22 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . speaker information. In variable length segmentation, the audio data is divided into variable length of segments depending on some factors. We can also categorize speaker change detection methods with respect to the measure of similarity between two neighboring segments: metric-based and model-based. The metric-based method employs the maximum point of an appropriately defined “metric” between neighboring segments for signaling detection. The model-based m ethod on the other hand relies on models for speakers, background noise, speech and music built in advance. The incoming audio stream is then classified, for example, by a maximum like lihood selection over a sliding analysis window [7], Model-based detection requires both training data and some information about the test data such as the number of speak ers. In contrast to the model-based methods, metric-based methods can be executed without such data requirements. However, it offers potential for progressive adaptation and regression against variability provided data is available. Unsupervised speaker indexing systems usually use the following methods: Euclid ean distance, Mahalanobis distance, the Bayesian Information Criterion (BIC), and the Generalized Likelihood Ratio (GLR) tests. Euclidean distance and Mahalanobis dis tance measure the dissimilarity between two neighboring analysis segments. We can also define the corresponding weighted version [29]. The usual weight is the variance of feature within a cluster. It gives smaller weights to some feature vectors with large variances and larger weights to some feature vectors with small variances. However, it reflects only intraclass information of feature vectors. Interclass information also plays an im portant role in clustering. For that reason, we usually use the variance of two 23 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . groups of vectors in two neighboring segments as weights. Even though this weighted Euclidean distance measure is better than the others, it is not so robust to environment noise. The Generalized Likelihood Ratio (GLR) test and the Bayesian Information Cri terion (BIC) m ethod are based on the comparison of two statistical models from two adjacent segments. To detect speaker changes, we use an analysis window which consists of two segments. The two segments within the window are compared using the GLR test [ 6 6 ] [3] [16] [72] . Suppose there are two feature vector sets, X j and Xg, coming from each segment, respectively. Hypothesis, Ho, is that the speakers in two segments are the same, while hypothesis, Hi, is that the speakers are different. Let L(Xi; Ai) and L(Xg; Ag) be the likelihood of Xi and Xg where Ai and Xg represent model parameters that maximize each likelihood. Similarly let X be the union of X i and Xg. L(X; Xi+g) is the maximum likelihood estimate for X. Gaussian models are used here, and A in cludes the mean and variance of the Gaussian model which are obtained from the data of each segment. Then (~ < T D IJ{.X, Xl+2) /n p L(X r;A ,)L(X g;A g)' ( ' } W hen two segments correspond to the same speaker, GLR value goes up to 1, and, otherwise, it falls to zero. We apply a preset threshold, which is empirically obtained, on GLR to determine the latent changing point [33], 24 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . BIC is a likelihood criterion penalized by the model complexity: the number of model parameters [7] [49] [70]: BIC(H) = logL(X\H) - ^v(H)logN , (2.2) where X is a vector set from a segment, H is a hypothesis, v(M) is the complexity of a model representing the hypothesis (H), and A is a tunable paremeter. The BIC procedure is to choose the model with which the BIC criterion is maximized. The BIC difference of two competing models can be seen as an approximation to the logarithm of the Bayes factor [79], For example, there are two feature vector sets, X j and Xs, coming from each segment, respectively. Hypothesis, Ho, is th at two segments are from the same speakers, while hypothesis, Hi, is th at the speakers are different. Let L {X f,\i) and L(Xg; Ag) be the likelihood of X i and X 2 where Xi and A 2 represent parameters from two different models. Additionally let X be the union of X i and X 2 with likelihood for X ,L(X ; A1+2)- The difference of two BIC values is: A B , C= + <23> If A B IC is positive, it means that the two neighboring segments are relative to the same speaker. BIC has some advantages: robustness and threshold-free. However, its computation is costly. The Generalized Likelihood Ratio (GLR) test is similar to the Bayesian Information Criterion in that it compares two competing models, but it is simpler and 25 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . less complex to compute. We have adopted the Generalized Likelihood Ratio (GLR) test in our unsupervised speaker indexing algorithm [33] [25]. 2.2 Silence D etection B ased Search A lgorithm Silence can be supposed to be one of the latent points of speaker changes. In the middle of speaking, people usually breathe. Speaker changes are not likely to occur there between breathing points. In the unsupervised variable size segmentation, audio data is segmented by silence points in the middle of the sequence of speech. The silence point is defined as a certain period within which the energy of a signal stays below the threshold. To avoid potential problems of false alarm, some restrictions are placed on the length of segments [Fig. 2.1]. For example, we can define a silence segment as a period that is longer than 100 msec in length and lower than -40dB in energy. If the length of a segment is shorter than a threshold, it is merged with neighboring segments depending on the length of adjacent silence segments [29] [42] [19] [44]. For example, if front-end silence of a segment is longer than rear-end silence, then the segment is merged with the next following segment. The period and threshold can be determined depending on the characteristics of the data sources [Fig. 2.2]. This segmentation technique might lower the number of segments processed without losing the merit of a short fixed segment method. However, it still has problems in noisy environments. If there is some global noise or local noise near silence points, it is not easy to detect silences using a preset threshold. In addition to robustness, it is difficult to define the length of silence. Suppose that it usually takes longer than 100 msec for a 26 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . Multi-Speaker Audio Stream No Length of a Segment > Threshold Yes Silence Detection Similarity Measure of Two Neighboring Segments Figure 2.1: A block diagram of speaker change detection using the Silence Detection Based Algorithm human breath. If a set of test data is not noisy, then we can easily detect the speaker changing points. However, if the second person immediately start speaking without any silence period after the end of the first speaker, then we may miss the changing point. To further improve the performance of the unsupervised speaker change detection, we need a more robust segmentation algorithm followed by an adaptive distance measure technique. 27 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . Audio Stream Speaker 2 Speaker 1 Speaker 1 Speaker 2 Silence Silence Silence Similarity Measure Detected Changing Point Figure 2.2: An example of speaker change detection using the Silence Detection Based Algorithm 2.3 Localized Search A lgorithm (LSA) The length of the analysis segment used for the speaker change detection can be either variable or static. In variable length segmentation, the speech stream is divided into different lengths depending on several factors such as pauses and background changes. However, static segmentation assumes a fixed analysis segment. A static segmentation is attractive since it is computationally simple but care has to be taken while choosing the analysis segment length. Too short an analysis segment may not provide adequate data for analysis while a longer analysis segment may likely miss a speaker change point. It has been found in previous experiments that 2 seconds is the proper length of analysis segments with respect to accuracy and robustness [29]. The other point to be thought of is the shift size of an analysis window. While it is sure that small shift gives less errors, the computational complexity should be considered. Suppose that the length of speech sequence is L and the number of speaker changes is C. If the size of shift is I, then the number of iterative computation is L/l. In 28 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . Multi-Speaker Audio Stream No GLR < Threshold Yes Analysis Window Smaller Shift Analysis Window Shift Detection of a speaker changing point Figure 2.3: A block diagram of speaker change detection using the Localized Search Algorithm (LSA) this case, the error margin will be 1/2. However, if the size of shift is l/f, (/ > 1), the number of iteration would be / times more and the error margin would be l / f times less. If shift is usually I and it gets smaller, l/f, for a fine search th at runs only near latent speaker changing points, the total number of computation, N, is simply N = ^ + (Cxf), (2.4) where C is properly smaller than L/l . It means th at the LSA can reduce the number of iterations with a smaller error margin. 29 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . Fig. 2.3 shows the process of our speaker change detection. Each analysis segment we considered was 2 seconds long, and the total analysis window, which includes two analysis segments, considered at any point is 4 seconds long. This implies th at we cannot detect a speaker change occurring within an analysis segment shorter than 2 seconds. Smaller analysis window shifts (e.g., 0.2 sec) could lead to finer resolution [ 6 6 ], But computational complexity severely increases with the number of analysis windows. For example, if the analysis window shifts by 0.2 second, we need 10 times more Generalized Likelihood Ratio calculations than in the 2 second shift case. To solve this problem, we propose a Localized Search Algorithm (LSA). Our algorithm seeks a compromise between accuracy and efficiency. Fig. 2.4 shows an illustration of how this algorithm works. We assume th at the boundary between the two analysis segments is the speaker changing point. The analysis window shifts by 1 second. The analysis window consists of two analysis segments, the first of which is the reference for speaker change detection. The speech data in the second analysis segment are compared with the reference to detect whether they contain speech data from the same speaker or not. The analysis window, which shifts overlapped by 3 seconds, is not appropriate to detect the exact speaker changing positions. In other words, we cannot detect a changing point within a 1 second duration with this amount of shift. For that reason, firstly, the analysis window shifts by 1 second. When GLR falls below the threshold, the data in the second analysis segment of the current analysis window may include a latent speaker changing point. Then, the Localized Search Algorithm starts running through the 0.2 second 30 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . Min Speaker 1 Speaker 2 Audio Strea 1 I 1 1 1 I I 1 ! ^ | I I ► 1 I I I I 1 I I I I 1 | I I I I | ■ I I I " " I 1 | I I I I 1 I I I i 1 1 I I I > 1 I ’ Speaker Changing Point r Detected Changing Point Figure 2.4: An example of speaker change detection using the Localized Search Algo rithm (LSA): An analysis window (4 seconds) looks for the exact speaker changing point near the potential boundary comparing with two analysis segments ( 2 seconds) of the analysis window. The first analysis segment is the reference segment and the segment of the second analysis segment is compared with the reference segment using the GLR test. W hen the analysis window detects a latent speaker change point, it shifts by 0.2 seconds to enable a finer search yielding a total of 10 ratios from GLR tests. The min imum is chosen, implying the highest probability of a speaker change. The boundary between two analysis segments is recognized as a true speaker changing point. analysis window shift for enabling a finer search [ 6 6 ]. There are 10 candidates, one of which indicates a true speaker changing point [33], 2.4 Som e Further Thoughts Speaker change detection is a very im portant step in unsupervised speaker indexing. If we can detect accurately, speaker clustering is easier and more accurate. However, it 31 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . is not easy to detect speaker changing point due to the lack of data, the variability of speech signal, and the environmental noise. More sophisticated algorithm may overcome these difficulties. However, such a good m ethod is not available so far. To improve the speaker change detection algorithm, we can incorporate other features representing the speed and habit of speaking that have not been deeply explored yet. Multi-modal features, such as expression, emotion, gaze, and gesture, also can be useful to improve the performance of speaker change detection. 32 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . C hapter 3 G eneric M odels for B ootstrapping 3.1 Introduction To build effective speaker models, sufficient training data are required. In the unsu pervised scenario, there is no prior knowledge about the speakers. W hen the speaker indexing process starts, only the data seen thus far can be used for modeling due to the sequential nature of the indexing process. Such models that are constructed roughly on-the-fly can cause severe clustering errors. The key issue here is finding a m ethod for alleviating the model initialization problem: it is difficult to build speaker models for speaker indexing without adequate prior knowledge/data about target speakers. The idea of generic models offers a promising alternative. We can create generic models of speakers that are independent of the test set speakers with the hypothesis that some speakers of the reference set are acoustically close to the target (test) speaker and can be adapted to be closer with new data [76]. Although we do not know the exact number of target speakers, we assume th at the number is finite. W ith this assumption, the initial generic models are built through training with data not directly related to the 33 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . test condition. This can make it possible for the speaker indexing system to operate without training of true (target) speaker models [33]. There are at least two possibilities that one can consider for creating generic models: Universal Background Model (UBM), Universal Gender Models (UGM). For example, suppose there are M male speakers and N female speakers in the generic speaker data pool. The UBM is built pooling the entire data of (M + N ) speakers. UGM includes two models: One for male speakers (trained with data from M male speakers), and the other for female speakers (trained with N female speakers) [33], Solomonoff introduced the metric based on purity and completeness of clusters for speaker clustering. W ith this method, even if it is not necessary to train speaker models, it is not so robust to environmental noises [72], There are other efforts th at have been reported on on-line speaker segmentation and clustering without prior knowledge of speakers and speaker models. The Universal Background Model (UBM) was used to classify feature vectors by Wu [77], Even though UBM is one of generic models, it can not reflect information of a single speaker well since it is built with data from a lot of different speakers. Ajmera and Woosters tried to use another m ethod for speaker clustering without any prior knowledge of the identities or the number of speakers. More than the number of target speakers in conversations are built and distributed in the feature space in advance, and then the target speakers are initially indexed. After initial segmentation, some of the indexed models are merged using BIC [2], However, the merged speaker model might not represent a single speaker well with false clustering and merging. They also need to find out the optimal number of initial models and 34 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . efficient m ethod of cluster merging. Here we propose a new generic model set, Sample Speaker Models (SSM), to improve the unsupervised speaker indexing system without trained models [33], 3.2 Sam ple Speaker M odels (SSM ) Suppose there are (M + N) speakers in the pool. At first, we pick S speakers, the number of which is smaller than the total number of speakers in the data pool and bigger than the total number of speakers (targets) in each conversation, and then construct S speaker models. While UBM and UGM involve “averaging” across a number of speakers, SSM does not [Fig. 3.1]. To investigate our assumption, we calculated KL distances between initial generic models and target speaker models using Cross Validation. A telephone speech data set (500 speaker data) was used for building generic models. Fig. 3.2 shows an experimental example of KL distances between generic models and target speaker models. In the SSM case, it is sure that the similarity between sample models and targets depends on the number of samples. Referring to Fig. 3.2, the KL distance of SSM is much shorter than the others since we use 400 samples. However, the number of samples is usually more than the number of gender models. Experimentally, if the number of samples is more than 2, the KL distance of SSM is shorter than the others. In conclusion, as initial speaker models, SSM is more similar to target speaker models than UBM and UGM with respect to a relative entropy (KL distance). 35 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . Speaker Data Pool Speaker 1 ) ( Speaker 2 Speaker (M+N) (a) ( b ) (c) Figure 3.1: Generic Models: (a) Universal Background Model (UBM): The entire speaker data in the pool is used to create a single model, (b) Universal Gender Mod els (UGM): The data is used to create 2 gender models, (c) Sample Speaker Models (SSM): Speaker models are selected from the generic speaker data pool by the proposed sampling method. W hen we use SSM as a generic model, we have to address two problems. The first concerns the number of sample models needed, and the other concerns the sampling method required for constructing the SSM set. At present we lack an analytical way for seeking an optimal choice for these parameters. We evaluate these empirically in each case: for example for the size of SSM, 16 is optimal for 2 people conversations, or 32 is optimal for 4 people conversations. We assume that this optimal number varies by the number of speakers and the type of speech data including how the features are defined [33]. The Markov Chain Monte Carlo (MCMC) approach offers a promising means for model sampling. Monte Carlo methods are computational techniques to make use of 36 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . — I Type of Generic Model Figure 3.2: Comparison of KL distances between various types of generic models and target speaker models: 1. UBM, 2. UGM, and 3. SSM. On average, SSM is closer to the target than the other models. random numbers. One of the uses of Monte Carlo method is to generate samples from a given probability distribution. We used the Metropolis algorithm, which is an instan tiation of a Markov Chain Monte Carlo (MCMC) method. The Metropolis m ethod is widely used for high-dimensional problems [43] [9] [39] [14]. It generates samples by running an ergodic Markov Chain which converges to a target distribution function /. For an arbitrary starting value x^°\ a chain ( X ^ ) is generated using a transition kernel with the stationary distribution /, which ensures the convergence of (X^) to a random variable from /. Thus, for a “large enough” To, X^T°^ can be considered as distributed under /. The number of samples, n, can be predetermined, and the samples (X^T°\ X^To+1\...,X(To+Tn' )) are generated from / according to the criterion [65], 37 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . In our speaker indexing system, we tried to apply the Markov Chain Monte Carlo (MCMC) m ethod to choose sample speaker models [33]. It can be briefly summed up as follows: 1. The target distribution of the sampling space is estimated from the Universal Background Model (UBM) th at represents all the speakers in the pool. 2. The predetermined number of sample vectors are chosen by Metropolis criterion from this initial distribution (a normal distribution is assumed). The mean of this distribution is obtained from the centroid of the Universal Background Model (UBM). 3. W ith every sample vector and every speaker model, the likelihood is calculated. 4. A speaker model that provides the maximum likelihood of a sample vector is chosen. 3.3 C lustering and M odel A daptation The segments obtained from the speaker change detection step are indexed into models in terms of speakers, and then the corresponding models are adapted with the newly indexed data. There can be several methods in general: agglomeration, stopping criteria, distance measures, and set distances [48], However, in our unusual clustering condition, our method is different from general solutions. We experimented with speaker models from the predetermined generic model sets: UBM, UGM, and SSM. 38 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . In UBM and UGM cases, the generic pooled model was adapted to create the speaker specific models. Since speaker indexing is a sequential process, the first speaker model is always created from UBM (or UGM) using the first speaker segment. W ith the next speaker segment, the model just constructed is assumed as a speaker model. How ever, if the likelihood of the second speaker segment is lower than the threshold, then new speaker model is created by bootstrapping from UBM (or UGM). The subsequent speaker segments are sequentially clustered in a similar manner: Each time the new data segment is evaluated against all available speaker models. Whenever new speakers are detected, the number of models hence increases in the UBM and UGM case. W ith SSM, the generic speaker models are adapted into speaker specific models [Fig. 3.3]. The likelihood of every speaker segment is calculated with the sample speaker models, and the model with maximum likelihood is selected and adapted sequentially. The number of models is kept the same for the entire data [Fig. 3.4], For example, there are S single speaker models (m*, where i=l,...S) that represent S clusters. When a segment comes into the clustering unit, Maximum (Log) Likelihood (/) is calculated with each sample speaker models (mi), i = arg max li(segment\mi), (3-1) After ML calculation, the segment is assigned to a sample speaker model (mj) repre senting cluster i. The remaining clusters without assigned segments are deleted. After all, the number of clusters will be equal to the number of speakers [Fig. 3.5], 39 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . Model 1 Model 2 Input Model 3 Model 1 Model 2 Model 3 Figure 3.3: Model Adaptation: from generic speaker models into speaker specific models Sample Speaker Model Set Model Adaptation Speaker Segmentation Model Selection by Maxium Likelihood Figure 3.4: Clustering with Sample Speaker Models (SSM)and Adaptation Model adaptation is executed by Maximum a Posteriori (MAP) scheme [20], As the amount of data increases towards infinity, the MAP estimate converges to the ML estimate [75]. The MAP adaptation on a Gaussian Mixture Model (GMM) is straight forward [40], Given the adaptation vectors X = {xj, xg, ..., x t}, we compute the prob ability, Pr(i\xt): m . - r i : ( 'T 'j. i (3.2) 40 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . Audio Stream Cluster 1 Cluster 3 Cluster 4 segment 1 segment 2 segment 3 segment 4 segment 5 Figure 3.5: Assigning segments to clusters: only clusters assigned to segments survive. where Wi is the weight of each mixture in the GMM, and pi is the probability of input, xt, in each mixture. M is the number of mixtures. In this system, means, fi, and weights, w, of GMM are updated as follows: Hi = a™ Ei(x) + (1 - a ? ) m u>i - [c^rii/T + (1 - a P )w i\j, (3.3) (3.4) where 7 is a scale factor, a™ and a? are data-dependent adaptation coefficients which are defined as: a. = ag m 1 rii + rp ’ (3.5) 41 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . where rp is the fixed relevance factor, and the sufficient statistics of mixtures, n l, and the re-estimation of mixtures, Ei(x), are defined as: We assume that speaker models in the reference set are independent of the test speech data. A desirable property of the generic models hence is to ensure rapid adaptation to the true speaker models. Furthermore, the acoustic environment of generic models and test speech stream might be different. If the difference is large, we need to compensate additionally for such effects. To address this problem, for example, we may use the first speaker segment to adjust for the channel difference [33]. 3.4 E xperim ents We used two audio data sources: the 1999 Speaker Recognition Benchmark Corpus from NIST (1999) and the HUB-4 Broadcast News Evaluation English Test Material (1999). For the generic model, we used 100 speakers (50 male speakers and 50 female speakers) who were randomly selected from the training data in the NIST Speaker Recognition Benchmark Corpus. For training each speaker model, about one minute of speech data were used. We tested our speaker indexing system also using independent T (3.6) t= 1 (3.7) 42 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm ission . portions of this data corpora. The primary experiment focused on unsupervised on line speaker indexing. To investigate the convergence performance under multi-pass (“off-line”) conditions, we repeated the indexing over several iterations. We performed two main experiments with a variety of speech materials: the first test focused on model adaptation and convergence behavior with various lengths of analysis windows and generic model types while the second test investigated the overall performance with various generic models. Specifically, in the first set of experiments, we evaluated speaker identification error rates with various lengths of analysis windows (i.e., 1 , 2, 4, 8 seconds) and with the three types of generic models (UBM, UGM, and SSM) to find out the most proper conditions for speaker change detection and model adaptation steps. We randomly picked some speakers from the 100 speaker pool for SSM using the Markov Chain Monte Carlo (MCMC) method. Although the speaker pool consisted of 50 males and 50 females, sampled speakers were not necessarily evenly distributed in gender. The Universal Background Model (UBM) and Universal Gender Models (UGM) were also built using data from 100 speakers in the pool. In SSM, we assumed th at “16” was an experimentally proper number of speaker models as a reference set to index speakers with the telephone conversation and Broadcast news materials [30] [33]. The second experiment concerned exploring the performance of the generic models. The speaker indexing test included three tests with different speech data sets: 2 speaker conversations, 4 speaker conversations, and broadcast news. The first test material was executed with about 24 minute audio data from the Speaker Recognition Benchmark 43 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . NIST Speech (1999). The length of each speech audio sequence was about one minute. Each sequence included two speaker telephone conversations. One third of the sequences included mixed gender (one male and one female) conversations. The other sequences include two males or two female speakers. As for the second test material, about 24 minute audio data from the Speaker Recognition Benchmark NIST Speech (1999) were used. Since we needed exactly four speaker conversations, we created them with 2 speaker conversations (“artificial sequences”) of the Speaker Recognition Benchmark NIST Speech (1999). No speaker used for building the generic model participated in the test conversations. One third of the sequences included mixed gender (two males and two females) conversations. The other sequences included four males or four female speakers [33], The third test material constituted about 45 minute audio data from the HUB-4 Broadcast News Evaluation English Test M aterial (1999). Broadcast news data included various categories of audio data and environmental conditions. Speaker types in this data included anchors, guests, interviewers, and interviewees. For our experiment, we considered 2 0 news clips representing different topics, and the number of speakers ranged from 2 to 6 speakers. We tested which of the three generic models (UBM, UGM, SSM) showed the best speaker indexing performance on these multiple speaker conversations. In the Sample Speaker Models (SSM) case, we considered several sample set sizes (i.e., 8 , 16, 32, 64 and 100 sample models) [33]. Since long silences have an adverse effect on speaker recognition, we eliminated data segments which were longer than 100 msec and lower than -40 dB as silence. 44 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . Experimental data were sampled at 8000Hz. As feature vectors, we used 26 channel, 24 dimensional Mel-Frequency Cepstral Coefficients (MFCC). We also used a 30 msec Hamming window that was shifted by 10 msec. The models of UBM, UGM, and SSM were Gaussian Mixture Models (GMM) with 16 mixtures [33], 3.5 R esults Only the speech portion extracted from the input audio stream was sequentially chopped into segments categorized in terms of speakers using the generic models. We calculate speaker indexing error rates as follows: Length of missmatched speaker index in time Indexing error rate — --------- — ---- — ------ ;---- : ---- : -------------- (3-8) I otal length of sequence m time The first experiment was relevant to the convergence of model adaptation, and the results are shown in Table 3.1. In this experiment, we wished to determine what length of speaker segment is proper to index speakers under various conditions such as the number of speakers in the speech sequence and the type of generic models used [33]. W hen the length of the segment increased (e.g., to 8 seconds) under the same model conditions, speaker indexing error rates decreased to almost 0 % in most cases, although, in some conditions, the error rates slightly increased (the case with 4 speakers and UGM). It implies that more speech data in a segment include more discriminatory in formation to represent a specific speaker. For example, when the number of speakers was 1, in the SSM case, the error rates decreased from 2.2% to 0% as the length of segment got to 8 seconds. From our experiments, we found that a segment of 8 seconds provided 45 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . Table 3.1: Indexing Error Rates as a function of Average Length of Segment per Speak ers, Number of Speakers in the Test Sequence, and Generic Model types Average Length of Segment per Speakers(sec) Generic Model Type Number of Speakers 1 2 4 1 SSM 2 .2 % 0 % 2 .8 % UGM 3.9% 4.3% 6 .6 % UBM 3.0% 3.0% 5.3% 2 SSM 1 .6 % 0 % 0 % UGM 1.9% 0 % 0.9% UBM 2 % 0 % 0.9% 4 SSM 0.5% 0 % 0 % UGM 0.5% 0 % 1 .8 % UBM 0.4% 0 % 0 % 8 SSM 0 % 0 % 0 % UGM 1 .0 % 0 % 3.6% UBM 0 % 0 % 0 % the best choice (Table 3.1). No significant improvements were found for segments longer than 8 seconds. However, we may need a shorter data segment to index shorter speaker segments. For example, to consider a 20 second two speaker conversation, we used 8 second long analysis segments. Suppose that one of the speaker’s speech lies between the 5 and 7 second mark in the sequence (measured from the beginning). Although the first 8 -second analysis segment included two speakers, it would be recognized as one speaker. To detect shorter speech episodes, we should use as short an analysis segment as we can. However, shorter (e.g., 1 second) segments could not capture a speaker’s information adequately in our experiment: Error rates significantly increased. We de termined empirically from this first experiment, that a 2 second analysis segment was a good compromise [33]. 46 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm ission . As the number of speaker candidates (participants) in a conversation increased, the error rates are expected to rise. For example, suppose that we had 4 speakers in a test sequence. Each speaker had about 1 minute of speech data, hence about a total of 4 minutes of data were used for speaker indexing. The point is that the indexing process was executed only sequentially without using any prior target speaker models. While the first minute segments passed through the indexing system, some speaker changing points might be falsely detected. The first adapted speaker model and the generic model(s) were compared with the group of segments that were hypothesized as speech of a certain speaker. After one speaker model was adapted from the generic model, the two models were now compared. Whenever a speaker change occurred, the system looked for the next speaker model. As the number of newly detected speakers increased, the number of models we had to compare increased which in turn might affect the overall error rate. Note that since this corpus consists of fairly clean land-line telephone conversations, there was no significant background noise to adversely affect the recognition. However, in some conditions, the number of speakers (participants) did not critically affect the error rate of speaker indexing. This result may imply th at each speaker spoke for about one minute at a time without any speaker change which might give enough speaker information to adapt speaker models, and help to discriminate speakers well [33]. Accompanying with the number of speakers, the similarity of speakers may be a im portant factor affecting the error rates as well. Table 3.1 shows results with the three types of generic models: Sample Speaker Models (SSM), Universal Gender Models (UGM), and Universal Background Model 47 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . (UBM). SSM provided the most stable performance across all experimental conditions, both in terms of the number of speaker candidates and the length of the analysis seg ment. Lower error rates also implied that the concomitant model adaptation was better. Recall that whenever a segment was assigned to a speaker, the corresponding speaker model was updated [33]. The second experiment focused on speaker indexing on the three different test m ate rials with the 3 types of generic models. Based on the results of the first experiment, we adopted a 2 second analysis segment length. Figure 3.6 shows the unsupervised speaker indexing performance of the generic models for the telephone conversations, 4 person conversations, and broadcast news clips. In Fig. 3.6, the initial Universal Background Model (UBM) was a unitary Gaussian Mixture Model that was trained with data from all the 100 speakers in the pool. The Universal Gender Models (UGM) set consisted of two models: male and female. The Sample Speaker Models (SSM) set had variable number of models. In our experiments, we used 8 , 16, 32, 64 and 100 model sets to find empirically the optimal number of samples for unsupervised speaker indexing under various test conditions [33]. In the 2 speaker case, when the number of generic model speakers was smaller than 16, the indexing accuracy was below 90%. As the number of model samples became larger, the accuracy peaked at about 92.5%, before slowly degrading, as the number of samples increased further. The reason might be th at 8 models were not adequate to recognize two speakers, as they could not have adequate discriminatory power in our feature space. While the 32 model case (90.1%) performed better than the 8 model case 48 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . (a) 100 S S M O U B M A U G M 95 85 75 70 55 20 30 60 7 Number of Sample Speaker Models 100 (b) 1 0 0 S S M O U B M A U G M 95 85 fb 80 75 70 65 55 0 10 20 30 40 50 60 70 80 90 100 Number of Sample Speaker Models (c) 100 S S M O U B M A U G M 95 85 80 75 70 65 60 55 0 10 20 30 40 50 70 60 80 90 100 Number of Sample Speaker Models Figure 3.6: Indexing Accuracy for various Types of Generic Models (SSM, UBM, UGM): (a) 2 Speaker Conversations, (b) 4 Speaker Conversations, and (c) Broadcast News. Note that ’samples’ here refer to those drawn from the generic model pool by MCMC. The results for SSM were consistently better than for the UBM and UGM cases. 49 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . (87.4%), they were still worse than for the 16 model set. However, as the number of models increased further, too many similar models occupied the feature space. In this situation, one (test) speaker could be recognized as two or more (model) speakers. Prom this experiment, 16 was found to be the best number of sample speaker models. The results of Universal Background Model (UBM) (82.8%) and Universal Gender Models (UGM) (86.1%) cases were worse than that of the 16-SSM case. Recall th at the UBM was built with 100 speaker data, and UGM consisted of a male model and a female model th at were built with 50 male and 50 female speaker data, respectively. Each model of SSM, however, was a specific generic speaker model. For th at reason, UBM and UGM had larger variances, and each speaker model was adapted from those initial models. Although the model variance could be adapted, it is difficult to represent a speaker well with small amounts of data. In sum, UGM is better than UBM because gender models might have relatively smaller variances [33], The results were similar for the 4 speaker case; the best performance (89.6% accu racy) was obtained with the 16 sample models. W hen the number of samples was 8 , the accuracy was about 69.5%. While for the 32 model case it was 84%. Again, the results of Universal Background Model (UBM) (60.2%) and Universal Gender Models (UGM) (67.1%) cases were much worse than for the 16-SSM case. These experiments showed that the automatic selection of initial models in a way most similar to the final target models leads to better and faster model adaptation and convergence [33]. Based on the results of the 2 speaker case and the 4 speaker case, the accuracy of 2 speaker indexing was higher than that of 4 speaker indexing with all generic models 50 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . except 100-SSM. The reason might be that the possibility of false indexing increased as the number of speakers to index increases. However, it is interesting th at the difference of accuracy decreased as the number of sample models in SSM case increased. And, with 100 sample models, the 4 speaker case got slightly higher accuracy than the 2 speaker case [33], The broadcast news data posed significantly more challenges mainly due to the diversity in the audio accompanying the speech therein. Our test set consisted of 20 audio clips, segmented manually based on topics, from a 45 minute broadcast. There were 2 to 6 speakers in these clips; the distribution of genders was uneven. There were more males than females. Fig.3.6 (c) shows the speaker indexing result. Even though the best performance was obtained with the sample model set of size 64 and the accuracy was 87.2%, the difference in accuracies with the others (16, 32, 64) was small. There might be several reasons. The different audio data conditions and the variety in the number of speakers (2 to 6 ) could affect the results. Based on our analysis, we might need more sample models, but it is not directly proportional to the number of target (true) speakers. The main reason th at 64-SSM was best is due to the stability of performance. Even though more sample models make more errors, the variance of error rates in each test was smaller. W hen the number of speakers is over 6 (i.e., 10, 16, 32), we might need more sample models, at least equal to the number of target speakers [33]. Based on this result, we compared the accuracy of 64-SSM with those of UGM and UBM by the number of speakers in the clips (the Table 3.2). W ith any number of 51 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . Table 3.2: Number of Speakers in Clips vs. Accuracy for the Broadcast News Material Number of Speakers Length of clips (sec) Accuracy 64-SSM UGM UBM 2 507.83 93.9% 70.1% 6 8 .8 % 3 824.32 89.8% 67.0% 72.0% 4 773.92 80.8% 60.8% 6 6 .2 % 5 402.89 89.2% 72.5% 63.6% 6 168.95 80.0% 70.2% 6 8 .8 % speaker models, SSM with 64 sample speaker models was the best among the three generic models. Even in the 6 speaker clips, the error rate was 20% which implied the stability of SSM. The performance could however be adversely affected from the environment and individual speakers. That may explain why the accuracy of 4 speaker clips were much worse than that of 2 or 5 speaker clips in not only the 64-SSM case but also the UGM and UBM cases. In this worse condition, 64-SSM also showed a below 20% error rate for the 4 speaker clips [33], In the next chapter, we explore the effect of speaker clustering on speaker indexing leveraging recent related work that considers the notion of optimal quantization of the speaker model space. 3.6 Som e Further T houghts In this chapter, we adopted the Markov Chain Monte Carlo (MCMC) m ethod to pick the samples from the pool. This method somewhat attained a measure of success to get some proper positions of speaker models in the feature space primarily due to the fact 52 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . that more samples were picked from the space where more speaker models in the pool were concentrated in the feature space. There are a couple of challenges th at need further investigation in this context. One critical issue with this SSM approach relates to finding the optimal number of sample models and positions in the feature space to use (optimal sampling). For a given feature space, some of the models can be severely overlapped, and some are farther apart, even if this formation can be thought to be inherently natural. A more principled approach, with supporting experiments, is required in organizing the space spanned by the (generic) speakers for SSM, such as feature space or speaker quantization for optimal speaker (model) sampling. 53 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . C hapter 4 Speaker Q uantization 4.1 Introduction The Sample Speaker Models (SSM) method is very useful in the unsupervised speaker indexing system that operates without any prior knowledge of speakers. However, one critical issue with this SSM approach relates to finding the optimal number of sample models and positions in the feature space to use. A more principled approach, with supporting experiments, is required in organizing the space spanned by the (generic) speakers for SSM, such as feature space or speaker model quantization for optimal speaker (model) sampling. Vector Quantization can be adopted to get sample speaker models with more proper number and positions in the feature space for unsupervised speaker indexing. Through the Vector Quantization process, speaker models in a pool can be quantized (categorized) to obtain some proper number of representatives. Our experiments showed that the SSM with SQ m ethod outperformed both SSM in con junction with random selection and the baseline Universal Background Model (UBM) [32]. 54 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . Several efforts have been reported on speaker indexing and the use of Vector Quan tization (VQ) ideas for speaker recognition. Kinnunen et al. simply made use of the vector quantization method to speaker identification [26]. They generated the VQ code book in advance and identified a target speaker by measuring the Euclidean distance between target speaker data and the codebook. Recently, Nishida and Kawahara pro posed a novel method for model selection for speaker indexing [52]. While Gaussian Mixture Model (GMM) is usually a better m ethod than VQ in terms of speaker recog nition performance, VQ was found to outperform GMM when only small amounts of data are available for training. For that reason, they explored a flexible framework to select an optimal speaker model (GMM or VQ) based on Bayesian Information Crite rion (BIC) [52]. J. Pelecanos proposed a m ethod of combining Vector Quantization with multi-dimensional Gaussians to generate a fast and robust approximation of Gaussian Mixture Models. In this method, Vector Quantization was used only for computational efficiency in the GMM construction [57], Kolano and Regel-Brietzmann combined VQ and GMM based on the fact that both methods represent the distribution of data vectors in the feature space. This m ethod also used VQ to train GMM without EM algorithm. However, the results showed much better recognition rates than simply using VQ, but lower accuracy compared with th at of the GMM m ethod [27]. Recall that sequentially constructed models in unsupervised indexing scenarios can not represent speakers well due to the lack of data available initially. We tried to solve this critical drawback by employing an alternative method of using the notion of generic 55 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . speaker models. In addition, we utilized VQ for categorizing speakers in the given fea ture space based on their acoustical characteristics while, in all these previous efforts, VQ was used to quantize feature space or select models [32]. We performed three main experiments to investigate the following: the relation between the size of population (the number of pooled speakers) and the number of quantized speakers, the analysis of error types, and the unsupervised speaker indexing evaluation on telephone conversations and broadcast news. The experimental results showed th at our SSM with SQ method achieved a higher recognition rate when compared with others such as the Universal Background Model (UBM) and SSM with random model selection. 4.2 Basics of V ector Q uantization Vector quantization is a generalization of scalar quantization. While scalar quanti zation is usually used for analog-to-distal conversion, vector quantization is used for sophisticated digital signal processing. Vector quantization plays an im portant role in data compression, such as image coding and speech coding, and have been explored extensively [54] [17] [1], Vector quantization is a sequential process to encode data into vectors. Each source vectors are mapped into the closest matching vector from the codebook by the vector quantization encoder. The decoder also has the codebook identical to the encoder. The goal of vector quantizer design is to produce the “best” possible reproduction vectors for a given condition. The key of a good vector quantizer is to develop a good codebook. 56 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . One method used for developing a codebook is the Linde-Buzo-Gray (LBG) algorithm, also referred as the generalized Lloyd algorithm. It has been exclusively used to build a codebook [13]. The LBG algorithm iteratively derives a codebook based training vectors through the following procedure shown in Fig. 4.1. The initialization step involves choosing the initial codebook. This might be a codebook used previously, or arbitrary one evenly spaced points in the vector space using a splitting technique. The iteration starts by assigning each training vectors to its codeword th at is chosen by some distortion measure. Then, given the set of training vectors assigned to a particular code, the code is changed to minimize its distortion. These two steps are repeated until the relative distortion decrease falls below some small threshold [15]. If this algorithm converges to a codebook in the sense that further iterations do not produce any changes, the resulting codebook must be optimal. However, such a iterative improvement algorithm need not yield truly optimum, but locally optimum quantizers. We can measure the performance of a vector quantizer by an average distortion. A good vector quantizer yields a small average distortion. For example, the n dimensional source inputs, x — (xi,...,x n) £ X n are quantized to a vector, y = (yi,—,yn) G Y n- The distortion between input and output vectors is 1 n d&V) = (4-1) 3 =1 where the average distortion is E[d(x,y)\. 57 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm ission . Changes by a small enough amount? No Yes Stop Training set Initialize a codebook Compute average distortion Figure 4.1: A block diagram of Generalized Lloyd Algorithm There are several distortion measures. The most widely used measure is the squared Euclidean distance, n d{x,y) = ^ \ xj - y j \ 2. (4.2) 3 = 1 Another distortion measure is the weighted squared error, d{x,y) = ( x - y Y W (x - y ) , (4.3) where W is a symmetric and positive definite weighting matrix. Note that this measure can be the squared Euclidean distance in case th at W is the identity matrix. W hen W 58 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm ission . is the inverse of the covariance m atrix of the input vector, x, this measure is called the Mahalanobis distortion measure. There is no better method to quantize vectors than VQ with an optimal codebook. However, unconstrained VQ is severely limited to the vector dimensions and codebook sizes for practical problems. Hence, several techniques, such as Tree-Structured VQ and classified VQ, have been developed to apply various constraints to the structure of the VQ codebook. These methods generally provide some trade-offs between performance and complexity. 4.3 Speaker Q uantization In SSM, samples can be randomly picked from a pool of generic speaker models th at are Gaussian Mixture Models. However, the random selection m ethod does not give optimal samples in the feature space. While the Sample Speaker Models (SSM) m ethod has been found useful in unsupervised speaker indexing, the critical issue th at remains is finding the optimal number of sample models, and their positions in the feature space. A more principled approach is required in initializing the space spanned by the (generic) speakers for the SSM method. To obtain sample speaker models with the optimal number and positions in the feature space, we propose a novel m ethod called “Speaker Quantization (SQ)” . The basic concept of SQ originates from Tree Structured Vector Quantization (TSVQ) [18], In this method, we do not quantize feature vectors but speaker models in the given feature space [Fig. 4.2], In other words, the Speaker Quantization method 59 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . Vector Quantization Speaker Quantization QO ► ► o : Speaker Model O : Vector X : Representative of each t ~ \ : Representative of each node after VQ node after SQ Figure 4.2: Vector Quantization vs. Speaker Quantization: (a) Vector Quantization, (b) Speaker Quantization. is used to select speaker models that can represent some acoustically similar speakers well for unsupervised speaker indexing [32]. The Kullback-Leibler (KL) distance is used as the distortion measure. To approx imate the KL distance between two Gaussian Mixture Models, we need to assign each Gaussian of one GMM to each of the other GMM. However, as the number of mixtures increases, it is harder to get optimal one-to-one mappings between two groups of Gaus- sians. So far, we do not have any optimal algorithm to solve this mapping problem in case of the large number of mixtures: There are 16 mixtures in our case. As an 60 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . n l n 21 n 22 n 31 n 32 Figure 4.3: Example of a tree for Tree Structured Vector Quantization (TSVQ). alternative method, random mappings could be used even though it makes the upper- bound loose. We calculate the upper-bound of KL distances between two d-dimensional Gaussian Mixture Models as follows [11] [73] [62]: ( M M y J w iih fu\i y / i= 1 3=1 ) M < E ^ l i ^ 2 j D (Wuijl’h; CU)\\N2j(f^2ji G2j ) ) i,3=1 M , < y log i , j = 1 d ^ C y ~ d + t r (C2jCn) + ( V U ~ - p^)], (4.4) where the number of mixture is M. To get a symmetric KL distance, we use the following measure that is the average of two asymmetric distances: D (e,"i « m V « I I E" 1 "«V») + D (e " , I E"i ".A.) D k l% = —i (4.5) 61 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm ission . In our SQ, a binary tree structure is used [Fig. 4.3]; each node splits into two new nodes in each level. Each node splits under the local condition th at the rate of averaged distortion is greater than e that is locally calculated: e = A _ A ^ (46) where D r is the mean of KL distance within a root node and Di is the average of the mean KL distances of two leaf nodes. The threshold, e, is empirically preset. To implement TSVQ for SQ, a codebook can be built as follows: • Step 1: Partition the root node (level 1) of the tree into 2 new leaf nodes using the Lloyd algorithm. • Step 2: Check the variation of the averaged distortion, R. • Step 3: If R is larger than e, the partitioning is accepted. • Step 4: Repeat node splitting and distortion checking (from Step 1 to Step 3) until no nodes are left to split. We proposed SSM to prevent the defect of averaged speaker models. Hence, it is notable th at one of speaker models giving the minimum distortion at each final node is chosen as a quantized speaker model representing the other models (acoustically similar speaker models) at the node [32]. In other words, one of the members, not a composite one, in a node represents the others. 62 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . Sam ple Figure 4.4: Type-1 Error in Speaker Indexing: Each segment of speech sequence from a single speaker is indexed to a sample. 4.4 A nalysis o f Errors in U nsupervised Speaker Indexing From Speaker Quantization, the number of samples would be determined. However, it could be changed depending on the size of the speaker pool and the threshold of quantization, e. We have room for controlling the number of samples by analyzing errors of speaker indexing resulting in error reduction. The quantized speaker models are expected to occupy the feature space with less overlap than in the case before quantization. There are two types of errors occurring in unsupervised speaker indexing using SSM due to the difference between the number of target speakers and initially quantized speakers. Type-1 error occurs when a single 63 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . Target Sample Figure 4.5: Type-2 Error in Speaker Indexing: The segments of speech sequence from different speakers respectively are indexed to samples. target speaker is recognized as multiple speakers [Fig. 4.4] while type-2 is when multiple target speakers are indexed to a single speaker [Fig. 4.5]. Our previous experiments showed that type- 1 errors are dominant with a larger number of sample speaker models. On the other hand, with a smaller number of sample speaker models, type-2 errors appear dominant. The unsolved problem is how to choose the optimal number of speaker models while minimizing both these errors [32]. After the SQ process, only one representative speaker model at each final node exists in the place of potentially several acoustically similar models, which helps reduce type-1 errors. The reduced number of models in SQ may contribute to the induction of type-2 errors. By increasing the number of initial speaker models, however, type-2 errors may be reduced. W ith the given conditions of quantization, the size of population in a pool affect the number of quantized samples as well. For these reasons, we need to consider 64 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . the size of population and the quantization parameters to reduce both type-1 and type-2 errors. 4.5 E xperim ents We performed experiments on a 400 speaker data set (240 females and 160 males) in our speaker pool obtained from the Speaker Recognition Benchmark NIST Speech (1999) corpus. We performed three experiments: first, we investigated the relation between the size of population (in a speaker pool) and the number of quantized speakers. Initially, we had three sets with various population sizes: 100, 200, and 400 speakers respectively. To get averaged results, we quantized each set three times setting e to zero. The second experiment focused on the analysis of type-1 and type-2 errors. After the first experiment, we obtained three sets of quantized speakers as generic model sets with respect to the size of the population in a speaker pool. To analyze type-1 errors, we had 40-second speech data from each of the 100 speakers in the test set. The first selected speaker was supposed to be a true target speaker, and every 5-second data segment was sequentially indexed to a single speaker. For type-2 errors, 10-second speech-data segment from each of 100 test speakers was used. We had 5 sub-experiments with 5 different participants in a conversation (2, 4, 6 , 8 , and 10 speakers). Each sub experiment was executed three times to get averaged results. The last experiment investigated the performance of unsupervised speaker indexing with the quantized generic model set. For generic models, we used the best quantized speaker set th at yielded the smallest error in the previous experiment. We compared 65 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . three methods: Sample Speaker Models (SSM) with the Speaker Quantization (SQ) method, Sample Speaker Models (SSM) with the random selection using the Markov Chain Monte Carlo (MCMC) method, and the Universal Background Model (UBM). A Universal Background Model (UBM) was built with the 400-speaker data in the pool. It was challenging to get data from a large number (on the order of hundreds) of speakers from the broadcast news audio corpus for training. Hence, with generic models trained in for telephone speech, we indexed both telephone conversations and broadcast news. We used about 70 minutes of data for testing: 24 minute two-speaker telephone conversations from the Speaker Recognition Benchmark NIST Speech (1999), and about a 45 minute audio data clip from the HUB-4 Broadcast News Evaluation English Test M aterial (1999). Experimental data were sampled at 8000Hz. As feature vectors, we used 26 channel, 24 dimensional Mel Cepstrum coefficients. We also used a 30 msec Hamming window that was shifted by 10 msec. Speaker models for UBM, UGM, and SSM were standard Gaussian Mixture Models (GMM) with 16 mixtures. 4.6 R esults The first experiment was relevant to exploring the relation between the number of speakers in a pool and the number of speakers resulting from speaker quantization. Fig. 4.6 shows th at the number of quantized speakers is directly proportional to the initial number of speakers in the pool. Based on this result, we can empirically choose the number of pooled speakers to get the desired number of quantized speakers with controlling the threshold of quantization. One key point to note however is that this 66 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 100 150 200 250 300 350 400 Number of speakers in a pool Figure 4.6: Relation between the number of pooled speakers and the number of quan tized speakers. experiment is limited in the sense th at it does not provide any insights to whether or not there exists a converging bound in the number of quantized speakers with increasing speaker pool size. Further experiments with a far larger speaker pool are needed to address this question. The next experiment focused on speaker indexing errors. Two types of errors were considered as mentioned in Section 4.4. In this experiment, we regarded it as type-1 errors th at at least one of segments from a single speaker is indexed to a sample speaker model that is not supposed to be indexed. Hence, we calculate type-1 errors as follows: Number o f segments indexed to false targets . . Type — 1 error — -------------- —— ------- ---------------- (4.7) 1 otal number of segments 67 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 0.5 0.48 C D 0.46 o 0.44 C D 7 0.42 C D CL f 4 0.4 0.38 20 30 40 50 60 Number of quantized speakers 70 (a) Type-1 error 0.8 C D -f— < C O 0 0.6 C D CM 1 0 4 C D Q . I - 0.2 □ 2 speakers - E - 4 speakers 6 speakers 8 speakers 10 speakers t3~. O. '[] -<) 10 20 30 40 50 60 70 Number of quantized speakers (b) Type-2 error Figure 4.7: Speaker indexing error rates vs. number of quantized speakers. 68 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . On the contrary, we regarded it as type-2 errors th at at least two segments from two different speakers are indexed to one sample speaker model. Hence, each trial to differ entiate speakers made a binary decision, correct or not. It is calculated as follows: Number o f trials failing in perfect differentiations Type - 2 error = ----------------------Total number o f trials---------------------- (4'8) Fig. 4.7 shows th at the type-1 error rate is directly proportional to the number of quantized speakers, while the type- 2 error rate is inversely proportional to the number of quantized speakers. Fig. 4.7(b) also shows the different slopes according to the number of speakers participating in a conversation. In Fig. 4.7(b), the error rate, 1, means that every trial failed in a perfect differentiation. For example, referring to Fig. 4.7(b), the type- 2 error rate of 8 participant indexing was 1 when the number of quantized samples was 13. It means there is no success in differentiating 8 speakers with 13 samples. From the result of this experiment, we observed th at the behavior of these two types of errors are opposite to one another. Hence, to reduce the total error rate, we need to find the optimal number of quantized speakers (or the number of pooled speakers) while considering the number of participants. In the last experiment, we indexed speakers using telephone conversations and broad cast news data. Prior to speaker indexing, we empirically chose the number of quantized speakers based on the analysis of the previous results. In the telephone conversation (2 speakers) case, as the number of quantized speakers increased from 30 to 70, type-1 errors continuously increased [Fig. 4.7(a)] while the type-2 errors remained unchanged [Fig. 4.7(b)], Hence, the set of 30 quantized speakers was deemed to give the best 69 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . Table 4.1: Error Rates of Unsupervised Speaker Indexing: Universal Background Model and Sample Speaker Models with MCMC and Speaker Quantization (SQ) respectively. Note that the figures in parentheses show the relative improvement from the baseline. Test Set Telephone Broadcast News UBM (baseline) 14.6% (-) 25.2% (-) SSM-MCMC 11.0% (24.7%) 18.1% (28.2%) SSM-SQ 8.5% (41.8%) 14.5% (42.5%) result. However, since broadcast news clips includ various numbers (from 2 to 6 ) of speakers, we could expect here that 70 quantized speakers would give the best result. We calculate speaker indexing error rates in the same manner as used in the previous chapter: Length o f missmatched speaker index in time Indexing error rate — --------- — ---- — ------ - ---------------------------------------- (4-9) 1 otal length of sequence m time Table 4.1 summarizes the speaker indexing results. The result of this experiment shows th at the case of using Sample Speaker Models (SSM) in conjunction with the Speaker Quantization (SQ) method yielded the best performance. SSM-SQ method outperformed SSM-MCMC by 2.5% absolute (22.7% relative) in the two speaker con versation case. It also shows a lower error rate than UBM by 6.1% absolute (41.8% relative) in this telephone conversation case. W ith broadcast news, SSM-SQ gave lower error rates than SSM-MCMC by 3.6% absolute (19.9% relative), and also much lower than UBM by 10.7% absolute (42.5% relative). In sum, the generic model approach is useful for unsupervised speaker indexing. However, approaches based on UBM are not adequate since they do not reflect individual 70 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . speaker-specific information well, having been constructed with data pooled from a large number of speakers. The experiment showed th at the Sample Speaker Models (SSM) approach improved the unsupervised speaker indexing performance over UBM. For SSM, we initially adopted the Markov Chain Monte Carlo (MCMC) m ethod to pick the samples from the pool. However, this method could not assure optimal positions of speaker models in the feature space. For improved speaker model sampling, Speaker Quantization (SQ) was proposed, and experimental results showed higher accuracy than SSM sampled by MCMC. While SQ cannot ensure the optimal number of samples, it picks more properly spaced sample models than random selection at least for speaker indexing. 4.7 Som e Further Thoughts We investigated the effect of using clustered speaker models for initialization th at uses averaged statistics from a pool of like speakers much similar to universal/gender back ground models [ 6 ]. The main difference from the latter is th at the averaged speaker information used for modeling comes from a set speakers deemed to be cohorts under a specific similarity criterion such as the KL distance. However, our hypothesis is that averaging information across speakers would reduce the model discrimination power and that their adaptation into target speaker models in general would be slower. Hence, if the size of the clusters is small, both methods may have similar performance. In the experiments with clustered speaker models, we used the “clusters” generated in the 71 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm ission . quantization process to create a clustered speaker model corresponding to each quan tized region. We used a portion of the NIST 1999 corpus with 100 speakers. From this experiment, we found that there were 1 2 clusters and the number of members in each cluster varied from 2 to 15. The indexing results showed th at on average SSM outper formed the clustered version about 4% absolute in unsupervised speaker indexing on 2 speaker telephone conversations [33]. We need to do further investigations with our Sample speaker Models (SSM) and Speaker Quantization (SQ) methods. From four hundred speakers in a pool, some speakers were chosen as samples using the Speaker Quantization method. And then multiple (from 2 to 6 in our experiments) speakers were indexed with those samples. However, we do not know what is the maximum number of speakers participating in a conversation with the given number of sample speaker models, and what is the optimal number of samples with the given number of participants in a conversation. We need a theoretical analysis on the relation between the number of participants and the number of sample speaker models. To calculate the distance between Gaussian Mixture Models, the KL distance was used. It is not easy to measure the Gaussian models consisting of a number of mixtures. We need to think about the way of optimal computation to obtain the exact KL distances among m ixture models that will let us obtain more organized samples from speaker quantization. 72 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . C hapter 5 A n A nalytical Approach Toward U nderstanding The C apacity o f U nsupervised Speaker Indexing 5.1 Introduction A desirable property for the Sample Speaker Models is for them to occupy the feature space with minimal overlap. However, there exists unavoidable overlap in the models resulting in errors. We can categorize errors into two types: Type-1 errors occur when a single target speaker is recognized as multiple speakers while type- 2 errors occur when multiple target speakers are indexed to a single speaker. We can empirically expect that too many speaker model samples relative to targets increase type-1 errors. On the other hand, a smaller number of samples makes more type-2 errors. It is difficult to choose the optimal number of sample speaker models while minimizing both of these errors. To understand this point, let us consider the following. We note th at the error types of speaker indexing is similar to those of statistical hypothesis (Ho and H j) tests. There are also type I and type II errors in hypothesis tests as follows: • Type I error: Rejecting Ho and accepting H i when H q is true. 73 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . • Type II error: Failing to reject Ho when H i is true. We can control errors so that satisfactory values of error probabilities can be obtained. However, two types of error cannot be reduced at the same time due to the characteristics of errors: there is a trade-off [ 2 2 ]. Two types of error in the above cases, speaker indexing and statistical hypothesis tests, look similar while the concept of errors is a little different: We can handle and find the decision bound th at minimized the total error probability using the information- theoretic bounds such as Bhattacharyya bound and Chernoff bound in hypothesis test ing. However, in our speaker indexing case, the decision bound can be handled by only the factors (number and position) of sample speaker models, which makes it more difficult to find optimal decision bound and number of sample speaker models. In this chapter, we utilize the same approach to deal with error probabilities that is used in statistical hypothesis tests. 5.2 Problem : U nsupervised Speaker Indexing U sing Sam ple Speaker M odels Suppose that there are 16 sample speaker models that are equally spaced Gaussian models with identical variances in a two dimensional finite space [Fig. 5.1]. The space might be divided into 16 disjointed subspaces with respect to sample speaker models. Each subspace can be defined as the space where a sample speaker model is more likely to be indexed to a target than the other samples. Hence each sample speaker 74 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . Figure 5.1: An illustration of 16 Sample Speaker Model in a 2-dimensional space. model represents each subspace. Our goal is to differentiate target speakers using the subspaces. Another illustration in Fig. 5.2 shows 64 Sample Speaker Model and 4 targets to be indexed in a 2-dimensional space. There are 64 disjointed subspaces with an equal size. Using the subspaces, 4 speaker models are to be differentiated. However, if the number of subspace is too small resulting too large size of each subspace, we cannot differentiate adjacent targets. On the contrary, a single speaker is apt to be indexed to multiple subspaces that is relatively too small in size resulting from too many samples. 75 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . Sjpeaker- 4 Figure 5.2: An illustration of 64 Sample Speaker Model and 4 targets to be indexed in a 2 -dimensional space. In the following sections, we will mention th at what number of samples (subspaces) is required to index targets with lower total error probability. 5.3 R elevant Inform ation-T heoretic Background Hypothesis testing is one of the standard problem in statistics. As a simple case, we consider two hypotheses where X; ( ) is i.i.d. with P: • Hf. P=P!. • H & - P=Ps- 76 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . From these two hypotheses, we can define two error probabilities with set A over which Hi is accepted: • Pei =P(Hgis accepted)#/ is tru e )= P ni(A c). • Pe2 = P (# /is accepted)H2 is tru e )= P n2 (A). We want to minimize both error probabilities, but it is a trade-off. One of the possible ways is to minimize one of the error probabilities under a con straint on the other error probability [ 8 ]. Stein’s lemma shows the best achievable error exponent in error probabilities for this hypothesis testing problem. S te in ’s L e m m a : Let Xi (i=l,...,n) is i.i.d. ~ P. Consider the hypothesis test between P = P i and P = P 2 , where relative entropy between Pi and P 2 , D (Pi\\P 2)< 0 0 . Let n be an acceptable region for Hi. Let the probabilities of error be Pei = P ?(A c n), Pe2 = P%(An), (5.1) and for 0 < e < ^, define PI2 = min Pe2, where Pei < e. (5.2) A n CSjn Then lim lim —InP5» = —D (Pi I I P2). (5.3) e — > 0 n— >00 n Since we set P ei<c and P eg=e~nD, Stein’s lemma lacks symmetry [ 8 ]. Using a Bayesian approach, we can minimize the total error probability given by the weighted sum of the individual error probabilities. Suppose th at Xi (i=l,...,n ) is i.i.d. with P: • Hf. P = P i with prior probability ttj . • H2: P=Pjg with prior probability ir2- 77 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm ission . The total error probability is Pe — ft lPel + 7 T & P e2 - (^-4) By Sanov’s theorem, individual error probabilities are approximated as follows: Pel = P ?(A C ) = e~nD(Px*l|Pj) (5.5) and Pe2 = P%(A) = e~nDiPx* H P2), (5.6) where P \ is the distribution in A that is closest to P 2, and also in A c th at is closest to P j. Therefore, total error probability is p e ± ^ ^ - n D i P x W P A + 7 T2e -n D ( P x l i f t ) ^ e - n m m {D (Px \\PA, D(PX\\P2)} _ ^ ^ We can obtain the upper bound of total error probability when the maximum value of the minimum of {P (P a|| P i ), D (P \\\P 2)} is attained by selecting A so that D (PX || P t ) = D {P\ || P2) 4 C{Ph P2). (5.8) Thus the upper bound of total error probability is determined by the highest achievable exponent, C (P i, P 2), called Chernoff information [ 8 ] [24]. 78 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . C h e rn o ff: The best achievable exponent in the Bayesian probability of error is D *, where D* = lim min - - InPe = D(P\-. || Pi) = D (PX . || P2) (5.9) n — *tx> A n CSSn 1 % with = P t i ' f p u {SM) and A * the value of A such that D(P\* || Pt) = D{PX * || Pt ). (5.11) We can derive the upper bound of total error probability from the above information. Suppose th at decision region, A, for H i for the maximum a posteriori (MAP) decision rule is A = < 512> and total error probability is P e = TTiPei + n 2Pe2 7T 2P 2 A ° A = ] P m in {ttiP i, 1tsP2} < Y .pipl~x- (5-13) Since it is true for any A, we can obtain the Chernoff bound by taking the minimum over 0 < A < 1 [ 8 ], 79 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 5.4 C apacity in U nsupervised Speaker Indexing 5 .4 .1 I n fo r m a tio n -T h e o r e tic A p p r o a c h In the last section (5.2), information-theoretic background on Hypothesis Testing was explored. Since Unsupervised Speaker Indexing is also a sort of Hypothesis Testing problem, we adopt the same approach to study the capacity in Unsupervised Speaker Indexing. Recall that the upper bound of total error probability is obtained by selecting A th at maximizes the value of the minimum of {D (P \ || P i),D (P \ || Pa)} where P \ is the distribution in a region,A, that is closest to Pa, and also in A c th at is closest to P j. In our case, the relation between the number of Sample Speaker Models and total error probability, we need to note the decision region, A. It is strongly related with the sample speaker models: the number and positions of samples determine decision regions for Hypothesis Testing. In addition, P \ is controlled by the acceptance region of P i, A, the size of which depends on the number of samples. Therefore, error probability depends on what samples are used for indexing. Assuming the properly positioned samples obtained by Speaker Quantization, the number of samples is a very important factor to minimize the total error probability. D (P \ \ \ Pi) and D (P \ || Pa) are controlled by A that depends on the acceptance region, A: As the number of samples increases, D (P \ || Pi) decreases while D (P \ || Pa) increases. Hence total error probability, Pe = 7 TiPe! + 1 T 2P e2 = ir1e - nD(- p^ Pl' > + irse - nD^ p^ (5.14) 80 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm ission . Relative Entropy 2.51 ---------------1- 2 1.5 1 0.5 0 20 N Number of Sample Speaker Models (N) Figure 5.3: Relative entropy as a function of N, the number of Sample Speaker Models (solid line: D (P \ || Pi), dashed line: D (P \ || Ps))- where A is a functions of N, the number of Sample Speaker Models: A oc N . Fig. 5.3 shows an illustration of the relation between the number of Sample Speaker Models and Relative entropy, D (P \ || Pi) and D (P \ || P2). As mentioned before, we can select A attaining the maximum value of the minimum of {D (P \ || P i),D (P \ || P2)}: It is attained when they are equal. N can be determined by the selection of A . However, this information-theoretic relation between A and N has not been deeply explored yet. 81 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 5 .4 .2 S t a t is t ic a l A p p r o a c h We cannot directly apply information-theoretic analysis that we mentioned in the pre vious section due to the different characteristics of errors. Indexing errors are strongly related with the sample speaker models. The number and positions of samples deter mine acceptance regions for Hypothesis Testing. In addition, we need to note that decision bounds can be handled by only the number of sample speaker models. In this section, we consider an Sample Speaker Models (SSM) based unsupervised speaker indexing problem proposed in this thesis. The problem can be described as follows: There axe M speakers involved in a conversation. To index speakers, N (> M ) generic sample speaker models are previously trained. W hen the number of speakers, M, is fixed, what is the best number to minimize two types of errors? This question also implies what the best number of speakers is given the fixed number of sample speaker models. We simplify the problem to two speaker indexing (M = 2) so th at we can more easily analyze error probabilities and compare them with simple hypothesis tests. To make the problem comprehensible, we describe it as follows: Consider two speech signal sources with statistical models, S j and Sg. Transm itter sends a packet of speech data (p n) to the receiver at a time. A packet includes speech data, a sequence of n-dimensional vectors (vn), from a single source (S j or Sg). W hen every packet arrives at the receiver unit, one of the pseudo targets ( sample speaker models) with statistical models ( Ti) is assigned to represent the source. The pseudo target models firstly assigned to sources are considered as true target models. In other words, after a pseudo target model (T*) 82 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . is assigned to a source (Sj), it is considered as error to assign the packet of data from the source (Sj) to a different pseudo target (T k,k / i). The other type of error occurs when (Ti) is assigned to the other source (Si, l^ j) . We choose each pseudo target by calculating relative entropies (Kullbeck-Leibler distances) between source models and pseudo target models as follows: i — arg m in D(Ti || Sj), where i = and j — 1 or 2. (5.15) In practice, D(-) is difficult to calculate exactly. Hence, instead of doing that, we calculate: n i = arg m in | (Ti(pk) — Sj(pk) |, where i — and j = 1 or 2. (5.16) fc=l However, in our case, we do not have source models. Thus we use Maximum Likelihood instead of the KL distance: n i = arg m ax E log Ti(pk), where i — (5-17) k=l While every packet of data is assigned to a pseudo target, some errors may occur. As we mentioned before, there are two types of error. Suppose th at there are N pseudo target speaker models in n-dimensional vector space, Cn, th at is bounded. The space can be divided into N subsets, cn» (i—l,...,N ), which are disjointed: ci fl c i = ^ ’ w h e r e * > • ? = i ^ j , (5.18) 83 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . N \J c ? = Cn. (5.19) i — 1 The subset, cnj, is the space where T t is the most likely to be assigned to the source, and it can be defined as follows: V«" € c", T i(vn) > T j(v n), where j = 1, j ^ i. (5.20) As we mentioned before, after i is set to i, every packet of data should be assigned to T i. Otherwise it is a type-1 error, • Continuous case: Pel = l - f Sj {vn)dvn. (5.21) J vn^ci y • Discrete case: Pel = 1 - £ S j ( v n). (5.22) In two speech source case, when one pseudo target is assigned to a source (S i), the other source (Sg) may still be unassigned. If S's is assigned to the same pseudo target as S i, a type-2 error occurs. Suppose that a packet is sent from Sk and assigned to T; that was already assigned to S j (j ^ k). It is a type-2 error, • Continuous case: Pe2= [ Sk (vn)dvn. (5.23) J vnec^ 84 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm ission . • Discrete case: P e 2 = y , s ^ v n y (5.24) v n £c!? Total error probability is the weighted sum of two types of error: Pe = 7 it Pel + ir2P e2- (5.25) where ttj and 7rg are prior probabilities in which S j and S 2 transm it a packet of data. As the number of pseudo target models increases, the size of every disjointed sub space would be smaller. These smaller subspaces induce more type-1 errors and less type-2 errors: This is a trade-off. However, the ratio of change in a type-1 case might be different from th at in a type- 2 case with the change of environment: subspaces and speaker models to be indexed. We will investigate this assumption in a more specific condition. 5.5 Special Case 5.5.1 One-Dimensional Gaussian Speaker Models Speaker models are usually built with high-dimensional Gaussian mixtures. However, it is very difficult to analyze errors with n-dimensional Gaussian M ixture Models. For more comprehensive analysis with lower computational complexity, we use one-dimensional single Gaussian models with identical variances. This special case can be described as follows [Fig. 5.4]: 85 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . x 1(T4 18 16 14 12 10 8 6 4 2 200 400 600 800 1000 (a) 4 Targets x 10“4 18 16 14 12 10 8 6 4 2 200 400 600 800 1000 (b) 8 Targets Figure 5.4: An example of 1-D Gaussian cases: Pseudo Targets (1-D Gaussian Models with solid lines), Two Sources (1-D Gaussian Models with dash-dot line for Source 1 and dotted line for Source 2 ). Note that the range between a and b is the dominant segment for the pseudo target assigned to Source 1. 86 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . • The one-dimensional space is bounded: Length is L. • Source and pseudo target speaker models have identical variances, < x 2. • Packets of data are transm itted from two sources: S i ^ s i , 0'2) and S 2 (lJ-sa,(J‘ 2)- • The receiver includes N pseudo target speaker models: T^/iT^cr2), i= l,...,N . Suppose th at Tk is assigned to a source, Si, and the dominant range for Tk is [a,b\. The type-1, type-2, and total error probabilities are as follows: • Type-1 Error Probability: P e l = 1 - f S!{l)dl J a rb 1 „ ,\2 1 Type-2 Error Probability: PeZ = f S s(l)dl J a C b 1 2 < t2 dl. (5.27) Total Error Probability: Pe = KlPel + T^sPeZ fb 1 -Q~PSl)2, Jt 1 - / 7ri , ----- ^ exp|----------------j dl ± . — it — + / 7T2 L * ‘7 5 ^ L * V ih dl' (5 - 2 8 > 87 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . where 7 T / and 7rg are prior probabilities of and S% . There are some notable points to examine the relation between error probabilities and the number of pseudo targets: the relative positions of sources and subsets, the length of each subset, and variances of speaker models (a 2). We consider th at Gaussian models are used for building speaker models: sources and pseudo targets. Since the distances of sources from pseudo targets are different, the changing rate of type-1 er rors may be different from that of type-2 errors: the changing rate of probability in a Gaussian probability density function depends on the distance from the mean of the probability density function [Fig. 5.4], The different changing rates after all affect total error probability. The relation between the number of target models and error probabilities depends on a 2 as well. While two sources can be differentiated well in case of a relatively small a2 to the distances between means of source models, a relatively large a2 make sources overlapped severely. When we consider that speaker models usually have large variances resulting in considerable overlap, we can say th at a2 is relatively large to the distances between means of speaker models. It also implies th at the intra-variance of each speaker model is relatively larger than the inter-variance of speaker models. Hence we introduce the following Clustering Criterion [12]: J c = 1 ^ I ', (5.29) ast + < 4 This criterion is the ratio of between-source variance and within-source variance that also make an effect on the capacity of our unsupervised speaker indexing. The lower Jc R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm ission . indicates the source speaker models to be indexed are highly overlapped; the higher Jc implies that sources are more separable. The length of a subset is quite related with the number of targets. As the number of gaussian models increases, the length of [a,b ] decreases: probability rises and type-2 falls. 5.5.2 Simulation and Result Based on statistics and information theory mentioned in the previous sections, we simu lated the relation between error probabilities and the number of sample speaker models (pseudo target models), N. As the simplified case, we used one-dimensional Gaussians with identical variances for all of speaker models (sources and pseudo targets). Let the feature range bounded and equally split into N (N —2,4,6,8,...,64) sub-ranges. We investigated the total error probability varying Jc and N. Prior probabilities, tt/ and 7 rg, were assumed as Thus total error probability was (5.30) According to the increment of the number of pseudo target models, N , type-1 error , _ Pel + Pe2 e — (5.31) 89 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . Number of Sample Speaker Models 35 30 25 20 15 10 5 0 0 1 2 3 4 5 6 7 Clustering Criterion (Jc) (a) Total Error Probability (Pe) 0.6 0.55 0.5 0.45 0.4 0.35 0.3 5 10 15 20 25 30 35 40 Number of Sample Speaker Models (b) Figure 5.5: (a) Clustering Criterion vs. Number of Sample Speaker Models resulting in minimum Total Error Probability, (b) Number of Sample Speaker Models vs. Total Error Probability with varying Clustering Criterion. 90 ■ Jc = 0.08 ■ Jc = 0.72 ■ Jc = 2.00 ■ Jc = 7.03 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm ission . Fig. 5.5(a) shows that the number of sample speaker model, th at lies between 2 and 64, accompanied by a minimum total error probability is inversely proportional to Clustering Criterion: N m in (P e) = (5.32) J C where a is a positive number that can be empirically determined. In the plot, Fig. 5.5(a), we can see th at the number of y axis falls to 2 ,that is the smallest number of samples to distinguish two sources, when Jc rises to about 7.03. However, as Jc gets smaller, larger number of samples are required to distinguish the two sources. Fig. 5.5(a) shows th at the relatively large number of sample models can make lower errors to index highly overlapped sources. We took a look at the relation between the number of sample models and total error probability with varying Jc between 0.08 and 7.03 [Fig. 5.5(b)], W ith smaller Jc, larger number of sample models are required for yielding the minimum of total error probability as we mentioned before; however, the minimum value of total error probability increases. In addition, as the number of samples increases, total error probability converges to 0.5 regardless of Jc. From simulations, we note th at Jc is an im portant element to decide the number of sample speaker models. To investigate the usual value of Jc in the real world, we conducted an experiment with 500 speaker data set (300 females and 200 males) obtained from the Speaker Recognition Benchmark NIST Speech (1999) corpus. We assumed that speakers could be represented with Gaussian models, and calculated Jc in three cases: male & male, female & female, and male & female. Our experimental results (Table 91 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . Table 5.1: Average of Clustering Criterion (Jc) in case that two speakers to be indexed are males, females, or male and female: The last column (Total) is the average of three cases. Male & Male Female & Female Male & Female Total Jc 0.1174 0.1158 0.1162 0.1165 5.1) show th at the average of Jc is 0.1165. Referring to Fig. 5.5(a), about 19 samples are required for two speaker indexing. From the simulation and investigation, we found that the relation between error probabilities and the number of sample speaker models varies with the similarity of target speaker models to be indexed (Clustering Criterion: Jc). In conclusion, the proper number of sample speaker models to index two speakers can be determined by Jc. In practice, more complicated statistical models (Gaussian M ixture Models) are used for speaker recognition, which makes it more difficult to analyze the relation between the number of sample models and total error probability. However, we can approximate the proper number of samples by using this method. 5.6 Som e Further Thoughts We explored the capacity of unsupervised speaker indexing through an analytical study and a simulation. From the simulation, we are informed about the importance of the similarity (or separability) between two speakers to be indexed with respect to the relation between the number of sample models and total error probability. However, we still need to find out a m ethod to calculate the similarity of sources in any condition: 92 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . Prior information is not enough for similarity measures. We may adaptively control the number of samples by monitoring the target speakers to be indexed and receiving feedback on the fly. We also do not have any knowledge of the relations analyzed for multiple (>2) speaker indexing case. In multiple speaker indexing, some of the models may be severely overlapped and others are separable. In this case, it is not easy to calculate Jc and deter mine the best number of samples. We may use the average or minimum of Jc of multiple speakers for indexing. Every point mentioned above will be far more complicated to analyze and simulate. We can only assume that the situation is similar to two speaker indexing. 93 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . C hapter 6 C onclusion and Future Work 6.1 C onclusion We presented a novel method for enabling unsupervised speaker indexing. For unsu pervised sequential processing, without any prior knowledge about the speakers to be indexed, a generic model set was incorporated into the general speaker indexing frame work. This generic model was shown to help the unsupervised speaker indexing system to overcome some of the difficulties arising due to the lack of data for building true target speaker models since it does not need any information on the type of speakers training data of target speakers in the initial for indexing. This m ethod implies that we do not have to retrain speaker models whenever we test with different speakers. Based on this merit, we proposed the Sample Speaker Models (SSM) approach that showed better and more stable performance than the other generic model methods such as Universal Background Model (UBM) and Universal Gender Models (UGM). We used telephone conversation data and broadcast news to evaluate the perfor mance of our algorithm with generic speaker models built with the pool of 100 speaker 94 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . data. The condition that yielded the best performance in our experiments was using 2 second analysis segments in conjunction with 16 sample speaker models for 2 speaker conversations. The total error rate was lower compared with the UBM case. In the case of 4 speaker conversations, the 16 sample speaker model set was deemed to be the best. As the number of speakers involved in conversations increases, the error rate of UBM increased at a much higher rate than th at of 16-SSM model set. In the experi ment with the broadcast news, the actual number of speakers was unknown. The news clips considered had between two and six speakers. More samples were required than those in the 4 speaker telephone conversations to cover the wide range in the number of speakers. The result showed that 64 was the best number of speakers for broadcast news clips that we experimented with. From our experiments, the Sample Speaker Models (SSM) approach was found to be more robust and stable to variations in the number of speakers and data types than previously proposed generic model approaches such as the Universal Background Model (UBM) and the Universal Gender Models (UGM). However, there was a key issue worth considering to further improve the overall performance of unsupervised speaker indexing: strategies for effectively sampling the SSM set. For a given feature space, some of the models can be severely overlapped, and some are farther apart, even if this may be thought to be inherently natural. Firstly, we adopted the Markov Chain Monte Carlo (MCMC) method to pick the samples from the pool. This m ethod attained a measure of success to get some positions of speaker models in the feature space primarily due to 95 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . the fact th at more samples were picked from the space where more speaker models in the pool were concentrated in the feature space. To get more properly distributed speaker models, we proposed a novel sampling method (Speaker Quantization) for attaining some measure of success. Some of our results were related to seeking the optimal size of the sampled speaker model set where the goal was to determine what number of sampled models is optimal for a given speaker indexing scenario. This relates to the question of the “capacity” of speaker indexing, i.e. model set size versus indexing performance trade offs. Prom the experiment with generic speaker models built with the pool of 400 speaker data, we could attain more properly positioned samples to get the best result. Speaker Quantization also gave better results of unsupervised speaker indexing than any other generic model considered thus far, including SSM with MCMC. We, however, understood the need for further investigation in regard to optimal number of samples. Toward this goal, we made use of an analytical method to explore the question of capacity. Through simulation of a simplified case, we recognized that the best number of sample speaker models to index speakers strongly depends on the Clustering Criterion (Jc), the similarity of input speakers (targets). Using (Jc), we may estimate the proper number of samples before indexing. Further investigations are needed for understanding the role of different Clustering Criteria and performance levels. In addition, experiments with a speaker data pool comprising more than 400 speakers are necessary to complete the analysis and generalization of the trends observed in this 96 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . study. Such efforts are critical, especially for a very large population of unsupervised speaker identification. 6.2 Future Work 6.2.1 Unsupervised Speaker Indexing with a large population We have used only 400 speaker data for speaker models training. We acknowledge that it is too small number of samples when we consider just the population of United States (over 200 million). Everyone has acoustically different kinds of voice and speaking styles. However, some of them may be very similar and some are quite different. Hence, we can categorize them with respect to the acoustical characteristics. Here we meet difficulties followed by some questions: • How large number of categories do we need to identify a certain number of speakers with proper indexing accuracy? • Even though we know the answer of the previous question, does th at number guarantee the best, irrelevant of the environment or domain of speech data? If we can categorize speakers generally or at least domain-dependently, our unsupervised speaker indexing method would be found more useful in many applications that cannot provide any prior information of speakers. 97 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 6.2.2 Multi-Channel Unsupervised Speaker Indexing Meetings are one of fundamental human activities. Face-to-face meetings usually have several modalities including speech, gesture, handwriting, and person identification. It is very im portant to recognize and integrate each of these modalities for an accurate record of a meeting. However, each of these modalities has its own set of difficulties in implementing automatic recognition. As one of various modalities, speech, plays an im portant role in meeting browser. In order to produce the record automatically, we have to solve the assignment problem (who is talking what), which involves speaker identification and speech recognition [4] [58] [37]. Speaker identification has difficulties with segmentation in a crowded room. Speech signal can be captured hands-free using a microphone array. The microphone array is useful for discriminating between multiple competing speakers based on their location. While we only use a single channel speech data so far, we can also use multi-channel speech data to identify speakers in an unsupervised manner. The multi-channel information can be useful to improve the overall performance of the unsupervised speaker indexing and the speakers (or objects) tracking [55]. Jin and Schultz have tried to use multiple channels for speaker segmentation and clustering in meetings [25], Before speaker segmentation and clustering, they selected the best channel to unify the boundaries across multiple channels. This is one of the methods that control multiple channels to improve speaker indexing systems. Before attem pting multi-channel speaker indexing, we need to investigate some basic problems. First, there is the channel selection problem. Since participants usually sit 98 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . around a table in meetings, the distances among them are so short. Hence it is com mon that each microphone gets a significant level of mixed speech signal from multiple speakers near it. We may assume that speech signal from the nearest speaker to each microphone have higher energy than the others, but it is not always true. Second, there is audio classification problem. Meeting participants may make some noises (coughing, clearing throat, murmuring, and so on) while someone is talking. Some headset micro phone may be sensitive to those kinds of noises. Third, microphone array is usually installed on the table, not headsets. If the speaker moves to another location or shake its head during speaking, it gets more difficult to index due to the speech level variation. 6.2.3 Control over Overlapped Speech in Unsupervised Speaker Indexing The last challenging problem is how to resolve overlapped speech. It gains considerable attention since it affects the performance of automatic speech and speaker recognition. Such overlap has seriously consequences for speaker indexing. Overlapped speech has been noted as a characteristic of natural conversation. So far, researchers have handled only a single channel. However, it is very difficult to resolve overlapped speech only with single channel information [69] [63], Lathoud et al. introduced an approach that processed location based features from a microphone array within a GMM/HMM frame work to produce a global segmentation of speaker changes. This approach is accurate, but each possible combination of active speaker has to be modeled [34]. 99 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . We have studied on the task of automatically segmenting speakers on the individual channels. However, we also need information from the other channels to resolve over lapped speakers. Each channel source is processed with each speaker indexing engine, and speaker indexing engines communicate for the robust speaker indexing and the res olution of overlapped speakers. The challenge of this strategy is how to handle those multiple sources sequentially without unacceptable delay. 100 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . R eference List [1] H. Abut, R. M. Gray, and G. Rebolledo. Vector quantization of speech and speech like waveforms. IEEE Transactions on Acoustics, Speech, and Signal Processing, 30(3):423— 435, June 1982. [2] J. Ajmera and C. Woosters. A robust speaker clustering algorithm. In Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop 2003, pages 411-416, 2003. [3] R. Bakis, S. Chen, P. Gopalakrishnan, R. Gopinath, S. Maes, L. Polymenakos, and M. Franz. Transcription of broadcast news shows with the ibm large vocabulary speech recognition system. In Proceedings of Speech Recognition Workshop, pages 67-72, 1997. [4] J. F. Bonastre, C. Delacourt, T. Fredouille, T. Merlin, and C. Wellekens. A speaker tracking system based on speaker turn detection for nist evaluation. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 1177-1180, 2000. [5] J. P. Campbell. Speaker recognition: A tutorial. In Proceedings of IEEE, volume 85 of 9, pages 1436-1462, 1997. [6] U. V. Chaudhari, J. Navratil, G. N. Ramaswamy, and S. H. Maes. Very large popu lation text-independent speaker identification using transform ation enhanced multi grained models. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages 7-11, May 2001. [7] S. Chen and P. Gopalakrishnan. Speaker, environment, and channel change de tection and clustering via the bayesian information criterion. In Proceedings of DARPA Speech Recognition Workshop, pages 127-132, 1998. [8] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley Series in Telecommunications. Wiley Interscience, 1991. [9] M. Davy, C. Doncarli, and J. Tourneret. Supervised classification using mcmc methods. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pages 33-36, 2000. [10] J. R. Deller, J. H. L. Hansen, and J. G. Proakis. Discrete-Time Processing of Speech Signals. Ieee Press Classic Reissue. Macmillan, New Jersey, 1993. 101 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . [11] M. Do. Fast approximation of kullback-leibler distance for dependence trees and hidden markov models. IEEE Signal Processing Letters, 9:115-118, 2003. [12] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. Wiley- Interscience, 2001. [13] W. H. Equitz. A new vector quantization clustering algorithm. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(10):1568-1575, October 1989. [14] J. E. Gentle. Statistics and Computing, pages 165-209. Springer, 2003. [15] A. Gersho and R. M. Gray. Vector Quantization and Signal Compression, pages 307-486. Kluwer Academic Publishers, 1999. [16] H. Gish, M. Siu, and R. Rohlicek. Segregation of speakers for speech recognition and speaker identification. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 2, pages 873-876, 1991. [17] R. M. Gray. Vector quantization. IEEE Magazine on Acoustics, Speech, and Signal Processing, pages 4-29, April 1984. [18] R. M. Gray and D. L. Neuhoff. Quantization. IEEE Transactions on Information Theory, 44(6):2325-2383, 1998. [19] T. Hain, S. E. Johnson, A. Tuerk, P. C. Woodland, and S. J. Young. Segment generation and clustering in the htk broadcast news transcription system. In Pro ceedings of DARPA Broadcast News Transcription and Understanding Workshop, pages 133-137, 1998. [20] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning, pages 411-426. Springer, 2001. [21] A. Higgins, L. Bahler, and J. Porter. Voice identification using nearest neighbor distance measure. In IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 2, pages 375-378, 1993. [22] R. V. Hogg and E. A. Tanis. Probability and Statistical Inference. Prentice Hall, New Jersey, sixth edition edition, 2001. [23] X. Huang, A. Acero, and H. W. Hon. Spoken Language Processing: A Guide to Theory, Algorithm, and System Development, pages 930-931. Prentice Hall PTR, 2001 . [24] A. Jain, P. Moulin, M. I. Miller, and K. Ramchandran. Information-theoretic bounds on target recognition performance based on degraded image data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(9):1153— 1166, 2002. [25] Q. Jin and T. Schultz. Speaker segmentation and clustering in meetings. In Proceedings of International Conference Spoken Language Processing 2004, Page TuC2105o.2, October 2004. 102 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . [26] T. Kinnunen, T. T. Kilpelainen, and P. Franti. Comparison of clustering algorithms in speaker identification. In Proceedings of IASTED International Conference of Signal Processing and Communications (SPC 2000), pages 222-227, 2000. [27] G. Kolano and P. Regel-Brietzmann. Combination of vector quantization and gaussian mixture models for speaker verification with sparse training data. In Proceedings of Euro speech 1999, pages 1203-1206, 1999. [28] H. J. Kunzel. Current approaches to forensic speaker recognition. In Proceedings of ESCA Workshop on Automatic Speaker Recognition, Identification and Verifi cation, pages 135-141, 1994. [29] S. Kwon and S. Narayanan. Speaker change detection using a new weighted dis tance measure. In Proceedings of International Conference on Spoken Language Processing, volume 4, pages 2537-2540, 2002. [30] S. Kwon and S. Narayanan. A method for on-line speaker indexing using generic reference models. In Proceedings of Eurospeech 2003, pages 2653-2656, 2003. [31] S. Kwon and S. Narayanan. A study of generic models for unsupervised on-line speaker indexing. In Proceedings of IEEE Automatic Speech Recognition and Un derstanding Workshop 2003, pages 423-428, 2003. [32] S. Kwon and S. Narayanan. Speaker model quantization for unsupervised speaker indexing. In Proceedings of International Conference Spoken Language Processing 2004, page WeC2102p.l8, October 2004. [33] S. Kwon and S. Narayanan. Unsupervised speaker indexing using generic models. IEEE Transactions on Speech and Audio Processing, In press. [34] G. Lathoud, I. A. McCowan, and D. C. Moore. Segmenting multiple concurrent speakers using microphone arrays. In Proceedings of Eurospeech 2003, pages 2889- 2892, 2003. [35] C. H. Lee, F. K. Soong, and K. K. Paliwal. Automatic Speech and Speaker Recog nition: Advanced Topics. Kluwer Academic Publishers, 1996. [36] H. Lin and A. N. Venetsanopoulos. A weighted minimum distance classifier for pat tern recognition. In Proceedings of Canadian Conference on Electrical and Com puter Engineering, volume 2, pages 904-907, 1993. [37] Q. Lin, E. E. Jan, and J. Flanagan. Microphone arrays and speaker identification. IEEE Transactions on Speech and Audio Processing, 2:622-629, 1994. [38] D. Liu and F. Kubala. Online speaker clustering. In Proceedings of IEEE Inter national Conference on Acoustics, Speech, and Signal Processing, volume 1, pages 572-575, 2003. [39] J. S. Liu. Monte Carlo Strategies in Scientific Computing, pages 105-114. Springer, 2001 . 103 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm ission . [40] M. Liu, E. Chang, and B. Q. Dai. Hierarchical gaussian mixture model for speaker verification. In Proceedings of International Conference on Spoken Lan guage Processing, volume 2, pages 1353-1356, 2002. [41] G. Lu and T. Hankinson. An investigation of automatic audio classification and segmentation. In Proceedings of International Conference on Spoken Language Processing, pages 776-781, 2000. [42] L. Lu, H. J. Zhang, and H. Jiang. Content analysis for audio classification and segmemtation. IEEE Transactions on Speech and Audio Processing, 10(7):504- 516, 2002. [43] D.J.C. MacKay. Introduction to Monte Carlo methods, pages 175-204. MIT Press, 1999. [44] I. Magrin-Chagnolleau and F. Bimbot. Indexing telephone conversations by speak ers using time-frequency principal component analysis. In Proceedings of IEEE Internatinal Conference on Multimedia and Expo 2000, volume 2, pages 881-894, 2000 . [45] R. J. Mammone, X. Zhang, and R. Ramachandran. Robust speaker recognition: A feature-based approach. IEEE Signal Processing Magazine, 13(5):58-71, Sept 1996. [46] W. Mendenhall and R. Sincich. Statistics for Engineering and the Sciences. Pren tice Hall, fourth edition, 1995. [47] T. M. Mitchell. Machine Learning, pages 144-145. McGraw-Hill, 1997. [48] Y. Moh, P. Nguyen, and J.-C. Junqua. Towards domain independent speaker clustering. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 2, pages 85-88, 2003. [49] K. Mori and S. Nakagawa. Speaker change detection and speaker clustering using vq distortion for broadcast news speech recognition. In Proceedings of IEEE Inter national Conference on Acoustics, Speech, and Signal Processing, pages 413-416, 2001 . [50] M. Nadler and E. P. Smith. Pattern Recognition Engineering. Wiley Interscience, New York, 1992. [51] M. Nishida and Y. Ariki. Speaker indexing for news articles, debates, and drama in broadcasted tv programs. In Proceedings of IEEE International Conference on Multimedia Computing and Systems, volume 2, pages 466-471, 1999. [52] M. Nishida and T. Kawahara. Unsupervised speaker indexing using speaker model selection based on bayesian information criterion. In Proceedings of IEEE Inter national Conference on Acoustics, Speech and Signal Processing, volume 1, pages 172-175, 2003. 104 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm ission . [53] A. V. Oppenheim and R. W. Schafer. Discrete-Time Signal Processing, pages 629- 651. Prentice Hall, 1999. [54] A. Orlitsky. Scalar versus vector quantization: Worst case analysis. IEEE Trans actions on Information Theory, 48(6): 1393-1409, June 2002. [55] J. Ortega-Garcia and J. Gonzalez-Rodriguez. Providing single and multi-channel acoustical robustness to speaker identification systems. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, volume 2, pages 1107-1110, 1997. [56] A. Papoulis. Probability, Random Variables, and Stochastic Process, pages 265- 279. McGraw-Hill, 1991. [57] J. Pelecanos, S. Myers, S. Sridhaxan, and V. Chandran. Vector quantization based gaussian modeling for speaker vproceedings oferification. In Proceedings of Inter national Conference on Pattern Recognition, volume 1, pages 294-297, 2000. [58] T. Pfau, D. P. W. Ellis, and A. Stolcke. Multispeaker speech activity detection for the icsi meeting recorder. In Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop 2001, pages 107-110, 2001. [59] T. F. Quatieri. Discrete-time speech signal processing : principles and practice. Prentice-Hall signal processing series. Prentice Hall, New Jersey, 2002. [60] L. Rabiner and B. H. Juang. Fundamentals of Speech Recognition, pages 122-132. Prentice Hall PTR , 1993. [61] L. R. Rabiner and R. W. Schafer. Digital Processing of Speech Signals, pages 476-489. Prentice Hall, 1978. [62] J. Ramirez, J. C. Segura, C. Benitez, A. D. L. Torre, and A. J. Rubio. A new kullback-leibler vad for speech recognition in noise. IEEE Signal Processing Letters, ll(2):266-269, 2004. [63] A. M. Reedy and B. Raj. A minimum mean squared error estimation for single channel speaker separation. In Proceedings of International Conference Spoken Language Processing 2004, page ThC 1202p.ll, October 2004. [64] D. A. Reynolds. An overview of automatic speaker recognition technology. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 4, pages 4072-4075, 2002. [65] C. P. Robert and G. Casella. Monte Carlo Statistical Methods, pages 71-192. Springer, 1999. [66] A. Rosenberg, A. Gorin, and S. Parthasarathy. Unsupervised speaker segmentation of telephone conversations. In Proceedings of International Conference on Spoken Language Processing 2002, volume 1, pages 565-568, 2002. 105 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . [67] A. E. Rosenberg, I. Magrin-Chagnolleau, S. Parthasarathy, and Q. Huang. Speaker detection in broadcast speech databases. In Proceedings of International Confer ence on Spoken Language Processing, pages 202-205, 1998. [68] A. E. Rosenberg, O. Siohan, and S. Parathasarathy. Speaker verification using min imum verification error training. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages 105-108, 1998. [69] E. Shriberg, A. Stolcke, and D. Baron. Observations on overlap: Findings and implications for automatic multi-party conversation. In Proceedings of Eurospeech 2001, volume 2, pages 1359-1362, 2001. [70] P. Sivakumaran, A. M. Ariyaeeinia, and J. Fortuna. An effective unsupervised scheme for multiple-speaker-change detection. In Proceedings of International Con ference on Spoken Language Processing, pages 569-572, 2002. [71] S. S. Soliman and M. D. Srinath. Continuous and Discrete Signals and Systems. Prentice Hall Information and System Science Series. Prentice Hall, New Jersey, 1990. [72] A. Solomonoff, A. Mielke, M. Schmidt, and H. Gish. Clustering speakers by their voices. In Proceedings of International Conference on Acoustics, Speech, and Signal Processing, volume 2, pages 12-15, 1998. [73] N. Srinivasamurthy and S. Narayanan. Language-adaptive persian speech recogni tion. In Proceedings of Eurospeech 2003, pages 3137-3140, 2003. [74] R. Veldhuis. The centroid of the symmetrical kullback-leibler distance. IEEE Signal Processing Letters, 9:96-99, 2002. [75] P. C. Woodland. Speaker adaptation: Techniques and challenges. In Proceedings of IEEE Workshop Automatic Speech Recognition and Understanding, pages 85-90, 1999. [76] J. Wu and E. Chang. Cohorts based custom models for rapid speaker and dialect adaptation. In Proceedings of Eurospeech 2001, pages 1261-1264, 2001. [77] T. Wu, L. Lu, K. Chen, and H. Zhang. Ubm-based real-time speaker segmenta tion for broadcasting news. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 2, pages 193-196, 2003. [78] J. Yang, X. Zhu, R. Gross, J. Kominek, Y. Pan, and A. Waibel. Multimodal people id for a multimedia meeting browser. In Proceedings of AC M International Conference on Multimedia, volume 1, pages 159-168, 1999. [79] B. Zhou and J. H. L. Hansen. Unsupervised audio stream segmentation and clus tering via the bayesian information criterion. In Proceedings of International Con ference on Spoken Language Processing, pages 714-717, 2000. 106 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm ission .
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Design and analysis of server scheduling for video -on -demand systems
PDF
Content -based video analysis, indexing and representation using multimodal information
PDF
Efficient acoustic noise suppression for audio signals
PDF
Compression algorithms for distributed classification with applications to distributed speech recognition
PDF
Energy efficient hardware-software co-synthesis using reconfigurable hardware
PDF
Color processing and rate control for storage and transmission of digital image and video
PDF
Robust signal processing techniques for source localization and multisource spatial sound rendering for immersive environments
PDF
A CMOS frequency channelized receiver for serial-links
PDF
Dynamic radio resource management for 2G and 3G wireless systems
PDF
Admission control and burst scheduling in multiservice cellular CDMA systems
PDF
Concept classification with application to speech to speech translation
PDF
Error-tolerance in digital speech recording systems
PDF
Irreducible polynomials which divide trinomials over GF(2)
PDF
G -folds: An appearance-based model of facial gestures for performance driven facial animation
PDF
Design and analysis of ultra-wide bandwidth impulse radio receiver
PDF
Data -driven facial animation synthesis by learning from facial motion capture data
PDF
Facial animation by expression cloning
PDF
High -speed networks with self -similar traffic
PDF
Acoustic microfluidic PZT transducers and temperature -compensated film bulk acoustic resonators
PDF
In-situ DNA synthesis on chip with nozzleless MEMS ejector
Asset Metadata
Creator
Kwon, Soon-Il (author)
Core Title
A study of unsupervised speaker indexing
School
Graduate School
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
engineering, electronics and electrical,OAI-PMH Harvest
Language
English
Contributor
Digitized by ProQuest
(provenance)
Advisor
Narayanan, Shrikanth (
committee chair
), Georgiou, Panayiotis G. (
committee member
), Lee, Sung-Bok (
committee member
), Srinivasamurthy, Naveen (
committee member
)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c16-385542
Unique identifier
UC11340176
Identifier
3180472.pdf (filename),usctheses-c16-385542 (legacy record id)
Legacy Identifier
3180472.pdf
Dmrecord
385542
Document Type
Dissertation
Rights
Kwon, Soon-Il
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Tags
engineering, electronics and electrical