Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Representation, classification and information fusion for robust and efficient multimodal human states recognition
(USC Thesis Other)
Representation, classification and information fusion for robust and efficient multimodal human states recognition
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
representation, classification and information fusion for robust and efficient multimodal human states recognition by Ming Li A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) August 2013 Copyright 2013 Ming Li Table of Contents List of Figures v List of Tables ix Abstract xii Chapter 1: Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Simplified supervised i-vector modeling for representation in LID and SRE tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3 Sparse representation for classification and representation . . . . . . . . 12 1.3.1 Sparse representation for classification in talking face video veri- fication, SRE, and age/gender recognition tasks . . . . . . . . . . 12 1.3.2 Sparse representation for representation in LID and SRE tasks . 16 1.4 Latent factor analysis based Eigenchannel factor vector modeling in af- fective states recognition tasks . . . . . . . . . . . . . . . . . . . . . . . 18 1.5 A generalized optimization framework . . . . . . . . . . . . . . . . . . . 20 1.6 Speakerverificationbasedonfusionofacousticandarticulatoryinformation 20 1.7 Multimodal physical activity recognition . . . . . . . . . . . . . . . . . . 23 1.8 The role of the proposed methods in each individual application . . . . 28 Chapter 2: Simplified supervised i-vector for representation 32 2.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.1.1 The i-vector baseline . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.1.2 Label regularized supervised i-vector . . . . . . . . . . . . . . . . 35 2.1.3 Simplified i-vector . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.1.4 GFCC features and Gabor features for robust LID . . . . . . . . 43 2.1.5 Score level fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.2 Corpus, classification task and feature extraction . . . . . . . . . . . . . 44 2.2.1 RATS database dev2 task . . . . . . . . . . . . . . . . . . . . . . 44 2.2.2 NIST SRE 2010 database female condition 5 task . . . . . . . . . 46 2.3 Experimental Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 ii 2.3.1 LID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.3.2 SRE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Chapter 3: Sparse representation for both representation and classification 53 3.1 Sparse representation for classification . . . . . . . . . . . . . . . . . . . 53 3.1.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.1.2 Corpora, tasks and feature extraction . . . . . . . . . . . . . . . 57 3.1.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 65 3.2 Sparse representation for representation . . . . . . . . . . . . . . . . . . 70 3.2.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.2.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 73 Chapter 4: Eigenchannel for speaker states representation 81 4.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.1.1 GMM baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.1.2 Eigenchannel matrix estimation . . . . . . . . . . . . . . . . . . . 82 4.1.3 Eigenchannel factor extraction . . . . . . . . . . . . . . . . . . . 83 4.2 Corporal, tasks and feature extraction . . . . . . . . . . . . . . . . . . . 84 4.2.1 Intoxicated speech detection . . . . . . . . . . . . . . . . . . . . . 84 4.2.2 Speaker emotion classification . . . . . . . . . . . . . . . . . . . . 85 4.3 Experimental Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Chapter 5: A general optimization framework for representation 90 5.1 A general optimization framework . . . . . . . . . . . . . . . . . . . . . 90 Chapter 6: Acoustic and articulatory information fusion for speaker verification 92 6.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.2.1 Subject-independent inversion . . . . . . . . . . . . . . . . . . . . 94 6.2.2 Articulatory features . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.2.3 Front end processing . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.2.4 GMM baseline modeling . . . . . . . . . . . . . . . . . . . . . . . 99 6.3 Experimental Results and Discussions . . . . . . . . . . . . . . . . . . . 100 6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Chapter 7: Multimodal physical activity recognition 103 7.1 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 7.1.1 Temporal feature extraction . . . . . . . . . . . . . . . . . . . . . 104 7.1.2 Cepstral feature extraction . . . . . . . . . . . . . . . . . . . . . 111 7.2 Activity Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 7.2.1 SVM Classification for temporal features. . . . . . . . . . . . . . 114 7.2.2 GMM modeling for cepstral features . . . . . . . . . . . . . . . . 116 iii 7.3 System Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 7.3.1 Feature level fusion . . . . . . . . . . . . . . . . . . . . . . . . . . 117 7.3.2 Score level fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 7.4 Experimental setup and results . . . . . . . . . . . . . . . . . . . . . . . 119 7.4.1 Data acquisition and evaluation . . . . . . . . . . . . . . . . . . . 119 7.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 7.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 7.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 Chapter 8: Conclusions 130 Bibliography 132 Appendix 146 .1 Journal papers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 .2 Conference papers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 iv List of Figures 1.1 5fundamentalquestionsfrommultimodaldataforhumanstatesrecognition 2 1.2 Typical examples of human states recognition tasks in this work . . . . 3 1.3 Applications across numerous domains . . . . . . . . . . . . . . . . . . . 3 1.4 The framework with representation, classification and information fusion 4 1.5 Performing classification in the generative models’ parameter space . . . 4 1.6 The same signals or features could be adopt for different tasks . . . . . 5 1.7 InformationvsvariabilityfromthesamesetofMFCCfeaturesforcontext independent speaker recognition . . . . . . . . . . . . . . . . . . . . . . 6 1.8 InformationvsvariabilityfromthesamesetofMFCCfeaturesforspeaker independent emotion recognition . . . . . . . . . . . . . . . . . . . . . . 6 1.9 Multimodalsignalscouldalsobejointlyusedtorecognizeasinglehuman state . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.10 Fusing diverse subsystems . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.11 The baseline systems and the proposed methods . . . . . . . . . . . . . 9 1.12 Face detection, eyes detection and face normalization . . . . . . . . . . . 12 1.13 Block wise DCTmod2xy features [128] . . . . . . . . . . . . . . . . . . . 13 1.14 sparse coding vs factor analysis . . . . . . . . . . . . . . . . . . . . . . . 17 1.15 KSVD based dictionary learning (split-merge strategy) . . . . . . . . . . 18 1.16 The system overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.17 KNOWME wearable body area network system . . . . . . . . . . . . . . 24 v 1.18 The proposed physical activity recognition system overview . . . . . . . 26 1.19 Language identification . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 1.20 Speaker verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 1.21 Speaker verification using articulation and acoustics information fusion . 30 1.22 Face video verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 1.23 Intoxication and emotion recognition . . . . . . . . . . . . . . . . . . . . 30 1.24 Age and gender recognition . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.1 I-vector modeling and its objective function . . . . . . . . . . . . . . . . 33 2.2 Supervised i-vector modeling and its objective function . . . . . . . . . . 35 2.3 Simplified Supervised i-vector modeling and its objective function . . . . 39 2.4 The n j quantization curve in the log domain, 300 indexes. . . . . . . . . 41 2.5 Waveform(top),spectrum(middle),andcochleagram(bottom)ofonespeech segment in RATS database. . . . . . . . . . . . . . . . . . . . . . . . . 42 2.6 2D-Gabor filters arranged by spectral and temporal modulation frequen- cies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.7 WaveformandspectrumoftwospeechsegmentsinRATSdatabase(dv2 0011 B and dv2 0439 F). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.8 EER and minDCF values for LID performance in Table 2.5 . . . . . . . 50 2.9 EER performance on NIST 2010 SRE female condition 5 task . . . . . . 52 3.1 The sparse solution of a true trial with problem B (3.7) . . . . . . . . . 56 3.2 The sparse solution of a false trial with problem B (3.7) . . . . . . . . . 56 3.3 MAP adaptation and mean shifted GMM model . . . . . . . . . . . . . 59 3.4 The correlation matrix of the over-complete dictionary A 2 (3.7), N 2 = 4601. Thecoherencevaluesmax j6=k hA 2j ,A 2k i)are0.996(left)and0.963 (right), respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.5 (a) Σ 1 and Σ ubm for the 1st Gaussian component (b) the distribution of all the elements for (Σ 1 ) −1 Σ ubm . . . . . . . . . . . . . . . . . . . . 70 vi 3.6 (a) in the NIST SRE 2010 database K = 3000, τ = 6 (b) in the LID RATS database, K = 600,τ = 150 . . . . . . . . . . . . . . . . . . . . . 72 3.7 One example of the original and the reconstruction of mean supervector V ˇ F j for type I s-vector. Blue color is the original, red color is the re- construction. (left)V ˇ F j (right) top 800 dimension ofV ˇ F j . . . . . . . . 73 3.8 EERandminDCFvaluesforLIDperformanceinTable3.12. (left)Table 5 ID 11, (right) Table 5 ID 6. . . . . . . . . . . . . . . . . . . . . . . . . 75 3.9 EER and minDCF values for type II s-vector for SRE female 05.nve- nve.phn-phn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.10 EER and minDCF values for type I s-vector for SRE female 05.nve- nve.phn-phn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.11 DET curves of the proposed systems in Table 3.15 . . . . . . . . . . . . 80 4.1 Accuracy against Eigenchannel matrix rank . . . . . . . . . . . . . . . . 87 4.2 First 2 dimension of Eigenchannel factor vector of fold 0 IEMOCAP training data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.1 A general optimization framework . . . . . . . . . . . . . . . . . . . . . 91 6.1 Errorbar of pair-wise correlation coefficients between session one esti- matedarticulatorysignals(afterDTW)fromall46speakers(allspeakthe same word sequence, totally 1081 pair-wise DTW and correlation). The ninedimensionsoftheestimatedarticulationareLA,PRO,JAW OPEN, TTCD, TBCD, TDCD, TTCL, TBCL, and TDCL, respectively. . . . . 96 6.2 Estimated articulatory signals (after DTW) of lip aperture and tongue body constriction location of session one from two-speaker pairs. The top plots are for the pair of speaker JW 48 and 33, and the bottom plots are for the pair of speaker JW 48 and 59. The first speaker pair shows high correlation, while the other pair shows low correlation. (a) LA of file TP001 2 from JW48 and 33; (b) TBCL of file TP001 2 from JW48 and 33; (c) LA of file TP001 from JW48 and 59; (d) TBCL of file TP001 from JW48 and 59. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.3 DET curves of speech only (ID1) and fusion (ID3) results. . . . . . . . 102 7.1 The proposed physical activity recognition system overview . . . . . . . 104 vii 7.2 The mean and standard deviation of normalized ECG signals. . . . . . . 106 7.3 Hermite basis functions with δ = 10,D = 201: (a) l=1 (b) l=2 (c) l=3 (d) l=4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 7.4 The original and the reconstructed ECG heartbeat from HPE. . . . . . 110 7.5 Placementofelectrodes(Blackfilledcircles)andaccelerometer(Redopen triangle) and data collection environment. . . . . . . . . . . . . . . . . . 120 viii List of Tables 2.1 Target and nontarget languages in RATS database . . . . . . . . . . . . 44 2.2 Corpora used to estimate the UBM, total variability matrix, JFA factor loading matrix, WCCN, LDA, PLDA and the normalization data for NIST 2010 task condition 5. . . . . . . . . . . . . . . . . . . . . . . . . . 47 2.3 Complexity of the proposed methods for a single utterance (GMM size C = 2048, feature dimension D = 56,T matrix rank K = 600, type I s- vectorτ 1 = 200,PCAprojectionmatrixV sizeR×CD = 2500×2048×56, target class number M = 6, time was measured on a Intel I7 CPU with 12GB memory) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.4 PerformanceoftheproposedmethodswithSVMmodelingforLIDRATS 120 seconds task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.5 Performance of score level fusion with systems based on multiple features 49 2.6 Performance of the proposed methods for the 2010 NIST SRE task female part condition 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 2.7 Performance of the proposed systems in fusion . . . . . . . . . . . . . . 52 3.1 The two configurations (S1 and S2) and the corresponding complexity of sparse representation with Tnorm. . . . . . . . . . . . . . . . . . . . . . 57 3.2 Corpora used to estimate the UBM, total variability matrix (T), WCCN, LDA ,SVM imposter and the Tnorm data. . . . . . . . . . . . . . . . . . 62 3.3 EER (%) performance of the proposed systems . . . . . . . . . . . . . . 65 3.4 EER (%) of the sparse representation system (P protocol) . . . . . . . . 65 3.5 Performance of the sparse representation system on the Male part of the NIST 08 test with configuration S1. . . . . . . . . . . . . . . . . . . . . 66 ix 3.6 Performance of the sparse representation system on the Female part of the NIST 08 test with configuration S1. . . . . . . . . . . . . . . . . . . 67 3.7 Performance on the Male part of the NIST 08 test . . . . . . . . . . . . 67 3.8 Performance on the Female part of the NIST 08 test . . . . . . . . . . . 68 3.9 Performance of GMM UWPP supervector modeling evaluated on the development set with 512 GMM. . . . . . . . . . . . . . . . . . . . . . . 69 3.10 PerformanceofthefusionsystembetweensparserepresentationandSVM on the square root UWPP supervectors . . . . . . . . . . . . . . . . . . 69 3.11 PerformanceoftheproposedmethodswithSVMmodelingforLIDRATS 120 seconds task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.12 Performance of score level fusion with systems based on multiple features 74 3.13 Performance of the type I s-vector system using PLDA modeling (SRE female 05.nve- nve.phn-phn) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.14 PerformanceofthetypeIIs-vectorsystemusingPLDAmodeling(SREfemale05.nve- nve.phn-phn) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.15 Performance of the proposed systems when fusing with the JFA and i- vector baseline systems for SRE task . . . . . . . . . . . . . . . . . . . . 79 4.1 Unweighted accuracy (UA) and weighted accuracy (WA) [140] on the development set of ALC database in the 2011 speaker state challenge. . 86 4.2 Unweighted accuracy (UA) and weighted accuracy (WA) per utterance for 10-fold leave-one speaker out cross validations on the IEMOCAP database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.1 Parameter setting for generalized optimization problem . . . . . . . . . 91 6.1 Data set partition for SRE experiments. Other L5S sessions of JW41-63 denotes all the longer than 5 seconds sessions (exclude 11) of speaker JW41-JW63. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.2 The performance of 26 speaker classes (closed set) identification systems based on different utterance-level features derived from estimated artic- ulatory data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 x 6.3 Performance of MFCC-real-articulation system with “ALL-small” protocol 99 6.4 Performance of MFCC-estimated-articulation system with “ALL” protocol101 6.5 Performance of MFCC-estimated-articulation system with “L5S” protocol101 7.1 conventional temporal accelerometer features . . . . . . . . . . . . . . . 110 7.2 Performance(%correct)ofSVMsystembasedontemporalECGfeatures (HR:Heart Rate, NM:Noise Measurement) . . . . . . . . . . . . . . . . . 123 7.3 Evaluation of GMM systems based on different configurations of cepstral feature extraction. (ACC=accelerometer) . . . . . . . . . . . . . . . . . 124 7.4 Score level fusion: the mean ± standard deviation of accuracies P c (%) for different subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 7.5 The configuration and performance of openset tasks. . . . . . . . . . . . 127 xi Abstract The goal of this work is to enhance the robustness and efficiency of the multimodal hu- manstatesrecognitiontask. Humanstatesrecognitioncanbeconsideredasajointterm foridentifying/verifingvariouskindsofhumanrelatedstates, suchasbiometricidentity, language spoken, age, gender, emotion, intoxication level, physical activity, vocal tract patterns, ECG QT intervals and so on. I performed research on the aforementioned states recognition problems and my focus is to increase the performance while reduce the computational cost. I start by extending the well known total variability i-vector modeling (a factor analysis on the concatenated GMM mean supervectors) to the simplified supervised i-vector modeling to enhance the robustness and efficiency. First, by concatenating the label vector and the linear classifier matrix at the end of the mean supervector and the i-vector factor loading matrix, respectively, the traditional i-vectors are extended to the label regularized supervised i-vectors. This supervised i-vectors are optimized to not only reconstruct the mean supervectors well but also minimize the mean square error between the original and the reconstructed label vectors, thus can make the supervised i-vectors more discriminative in terms of the label information regularized. Second, I perform the factor analysis (FA) on the pre-normalized GMM first order statistics supervector to ensure each gaussian components statistics sub-vector is treated equally in the FA which reduce the computational cost by a factor of 25. Since there is only xii one global total frame number in the equation, I make a global table of the resulted matrixes against its log value. By checking with the table, the computational cost of eachutterance’si-vectorextractionisfurtherreducedby4timeswithsmallquantization error. I demonstrate the utility of the simplified supervised i-vector representation on both the language identification (LID) and speaker verification (SRE) tasks, achieved comparable or better performance with significant computational cost reduction. Inspiredbytherecentsuccessofsparserepresentationonfacerecognition,Iexplored the possibility to adopt sparse representation for both representation and classification inthismultimodalhumansatesrecognitionproblem. Forclassificationpurpose,asparse representation computed by l1-minimization (to approximate the l0 minimization) with quadratic constraints was proposed to replace the SVM on the GMM mean supervec- tors and by fusing the sparse representation based classification (SRC) method with SVM, the overall system performance was improved. Second, by adding a redundant identity matrix at the end of the original over-complete dictionary, the sparse represen- tation is made more robust to variability and noise. Third, both the l1 norm ratio and the background normalized (BNorm) l2 residual ratio are used and shown to outper- form the conventional l2 residual ratio in the verification task. I showed the usage of SRC on GMM mean supervectors, total variability i-vectors, and UBM weight posterior probability (UWPP) supervectors for face video verification, speaker verification and age/gender identification tasks, respectively. For the representation purpose, rather than projecting the GMM mean supervector on the low rank factor loading matrix, I project the mean supervector on a large rank dictionary to generate sparse coeffi- cient vectors (s-vectors). I show that KSVD algorithm can be adopted here to learn the dictionary. I fuse the s-vector systems with other methods to improve the overall performance in LID and SRE tasks. xiii I also present an automatic speaker affective state recognition approach which mod- els the factor vectors in the latent factor analysis framework improving upon the Gaus- sian Mixture Model (GMM) baseline performance. I consider the affective speech signal as the original normal average speech signal being corrupted by the affective channel effects. Rather than reducing the channel variability to enhance the robustness as in the speaker verification task, I directly model the speaker state on the channel fac- tors under the factor analysis framework. Experimental results show that the proposed speakerstatefactorvectormodelingsystemachievedunweightedandweightedaccuracy improvement over the GMM baseline on the intoxicated speech detection task and the emotion recognition task, respectively. To summarize the methods for representation, I propose a general optimization framework. The aforementioned methods, such as traditional factor analysis, i-vector, supervised i-vector, simplified i-vector and s-vectors, are all special cases of this gen- eral optimization problem. In the future, I plan to investigate other kinds of distance measures, cost functions and constraints in this unified general optimization problem. I use two examples to demonstrate my work in the areas of domain specific novel features and multimodal information fusion for the human states recognition task. The first application is speaker verification based on the fusion of acoustic and articula- tory information. We propose a practical, feature-level fusion approach for combining acousticandarticulatoryinformationinspeakerverificationtask. Wefindthatconcate- nating articulation features obtained from the measured speech production data with conventional Mel-frequency cepstral coefficients (MFCCs) improves the overall speaker verification performance. However, since access to the measured articulatory data is impracticalforrealworldspeakerverificationapplications,wealsoexperimentwithesti- mated articulatory features obtained using acoustic-to-articulatory inversion technique. Specifically, we show that augmenting MFCCs with articulatory features obtained from xiv subject-independent acoustic-to-articulatory inversion technique also significantly en- hances the speaker verification performance. This performance boost could be due to the information about inter-speaker variation present in the estimated articulatory features, especially at the mean and variance level. The second example is multimodal physical activity recognition. A physical activity (PA) recognition algorithm for a wearable wireless sensor network using both ambu- latory electrocardiogram (ECG) and accelerometer signals is proposed. First, in the time domain, the cardiac activity mean and the motion artifact noise of the ECG sig- nal are modeled by a Hermite polynomial expansion and principal component analysis, respectively. A set of time domain accelerometer features is also extracted. A support vectormachine(SVM)isemployedforsupervisedclassificationusingthesetimedomain features. Second, motivatedbytheirpotentialforhandlingconvolutionalnoise, cepstral features extracted from ECG and accelerometer signals based on a frame level analysis are modeled using Gaussian mixture models (GMM). Third, to reduce the dimension of thetri-axialaccelerometercepstralfeatureswhichareconcatenatedandfusedatthefea- ture level, heteroscedastic linear discriminant analysis is performed. Finally, to improve theoverallrecognitionperformance,fusionofthemulti-modal(ECGandaccelerometer) and multi-domain (time domain SVM and cepstral domain GMM) subsystems at the score level is performed. xv Chapter 1: Introduction 1.1 Background The goal of this work is to enhance the robustness and efficiency of the multimodal hu- manstatesrecognitiontask. Nowadays,wecancapturehugeamountofmultimodalmul- timedia data (speech, image, video, accelerometer, ECG, etc) of human users through computers or mobile devices. Given those multimodal signals, we want to understand thehumanuser’ssituationmorethoroughly,thereforeprovidebetterservicestotheuser in security, health, monitoring, user assistance, science and many other applications. From Fig. 1.1 and Fig. 1.2, we can see that there are 5 fundamental questions (where, when, who, what, how) to describe this human states recognition problem. Where and whenquestionsarealreadyaddressedbyGPSnavigationanddevicetime. Toanswerthe “who”question,wewanttoprovidetheidentitystatesofthehumanuserformultimodal biometric applications. It covers speaker verification (SRE) [96, 105, 106, 108], talking face video recognition [100], ECG biometrics [99], age/gender identification [95, 96], biometrics in speech production system [106] and many other applications. For the “what” question, there are spoken language identification (LID) [102, 104], multimodal 1 WH O ? H O W ? W H A T ? W H E R E ? W H E N ? Figure1.1: 5fundamentalquestionsfrommultimodaldataforhumanstatesrecognition physical activity recognition [103], high-level descriptions of real-life physical activities [76], and other speech recognition related approaches. For the “how” question, it pro- vides description of the health monitoring states. e.g. ECG QT interval detection and mental affective states, such as emotion recognition [98], intoxication detection [15, 98], personality recognition [8], pathology recognition [74], and many other behavior signal processing applications [14, 87]. Inthiswork, humanstatesrecognitioncanbeconsideredasajointtermforidentify- ing/verifing various kinds of human related states, such as biometric identity, language spoken, age, gender, emotion, intoxication level, personality, pathology, physical activ- ity,vocaltractpatterns,ECGQTintervalsandsoon. Itcouldleadtomanyapplications in different domains as demonstrated in Fig. 1.3. I performed research on these afore- mentioned states recognition problems and my focus is to increase the performance while reduce the computational cost. 2 Figure 1.2: Typical examples of human states recognition tasks in this work Figure 1.3: Applications across numerous domains Since the aforementioned human states are relatively stable without rapid and dy- namic changes within one speech utterance segment interval or couple of seconds and their start-of-the-art recognition methods share similar modeling approaches, it con- verges to a global supervised modeling framework that includes representation, classifi- cationandinformationfusionasshowninFig. 1.4. Thisrepresentationisamoregeneral term which includes statistical processing and modeling for the subsequent supervised 3 Figure 1.4: The framework with representation, classification and information fusion Figure 1.5: Performing classification in the generative models’ parameter space classification step. The reason to adopt information fusion module at the decision score level is to fusing results from different classifiers, representations and features to im- prove the overall performance. Due to the time varying property of speech, video, ECG and accelerometer signals, short time frame level features are the common choice. In contract to those recording sample level global features [44, 57] which directly feed to the classifier, short time frame level features are modeled to fit a generative model for each recording sample and then the feature vectors in the generative model space or distance measures between those generative models are employed for classification. As shown in Fig. 1.5, we usually perform classification in the generative models’ parameter space and the features in this parameter space are usually denoted as supervectors. 4 Figure 1.6: The same signals or features could be adopt for different tasks In this work, Gaussian Mixture Model (GMM) is adopted as the generative model while various supervectors [27, 38, 95, 144, 159] have been proposed to capture the information in the GMM model parameters’ space as features for classification. One major challenge for the human states recognition is to compensate or reduce the vari- abilities. On one hand, some variabilities come from the nature of the features. For example, the same set of Mel-frequency cepstral coefficients (MFCCs) from speech can be used for speech recognition, speaker verification, language identification, age/gender identificationandemotionrecognitionatthesametimeasshowninFig. 1.6. Theuseful information for one task may become variability or noise for other tasks. For example, in the context independent speaker recognition task (Fig. 1.7), the content information, accent from bilingual speakers and affective emotions are variabilities while content, language and speaker information can also become noises in the speaker independent 5 Figure 1.7: Information vs variability from the same set of MFCC features for context independent speaker recognition Figure 1.8: Information vs variability from the same set of MFCC features for speaker independent emotion recognition emotion recognition task (Fig. 1.8). On the other hand, there are also variabilities from many kinds of mismatch or noise in the data collection, such as the background noises in speech, lightening changes in video, motion artifacts in ECG signals, etc. What supervector in the model space fits better for this task? How to make if more discrim- inative? Will new raw features or representation supervectors provide complimentary information to the baseline and contribute in the fusion? These are still open questions. Therefore, our focus is to extract the related useful information for each particular task and to compensate other variabilities to make it robust. Moreover, in order to reduce the computational costs in both training and testing for industry applications, efficiency is also very important. Furthermore, we can also use multimodal signals to 6 Figure 1.9: Multimodal signals could also be jointly used to recognize a single human state jointly recognize a single human state as demonstrated in Fig. 1.9. Chapter 7 and 8 use two application examples to show the multimodal information fusion in different levels. From Fig. 1.10, we can see that fusion could be performed at different levels as long as those different subsystems are diverse. Mythesisisonthetopicofrobustandefficientmultimodalhumanstatesrecognition. The connection to the general machine learning problems is that we perform context independent verification or identification on a set of short time frame level features with variability modeling. The human states recognition is a very active and competitive research area. NIST organized Language Recognition Evaluation (LRE) and Speaker Recognition Evaluation (SRE) from 1996 to now while the recent 5 years Interspeech challenges cover emotion, paralinguistic, intoxication, personality and pathology these human states topics. There are also well defined standard databases for evaluation and 7 Figure 1.10: Fusing diverse subsystems fair comparison. Therefore, in this work, we work on top of the start of the art systems to future enhance the robustness and efficiency. We demonstrate the baseline systems andtheproposedmethods inFig. 1.11. The followingsectionsinthischapterintroduce the background, literature review and motivation of the proposed methods in this work. 1.2 Simplified supervised i-vector modeling for represen- tation in LID and SRE tasks The goal for language identification (LID) is to determine the language spoken in a given segment of speech. In real world security or military intelligence applications, the speech signals could come from extremely noisy and distorted communication channels, such as short wave AM broadcasting. Thus robust LID on real degraded data becomes a challenging task. Approaches using phototactic information, namely PRLM (phoneme recognizer fol- lowed by language models) and PPRLM (parallel PRLM), have been shown to be quite successfulinthepastdecade[162]. InPPRLM,asetoftokenizersareusedtotranscribe the input speech into phoneme strings or lattices which are later scored by n-gram lan- guage models [51] or mapped into a bag of trigrams feature vector for support vector machine (SVM) modeling [93]. Lately, due to the introduction of shifted-delta-cepstral 8 Figure 1.11: The baseline systems and the proposed methods (SDC) acoustic features [150], promising results using Gaussian Mixture Model (GMM) framework with latent factor analysis [29] and supervector modeling [26, 104] have been reported. We focus on acoustic level systems in this work. Meanwhile, for the acoustic level speaker verification, the use of joint factor analysis (JFA) [70, 71, 72] has contributed to state of the art performance in text independent speaker verification and hence is being widely used. It is a powerful technique for compensating the variability caused by different channels and sessions. Recently, total variability i-vector modeling has gained significant attention in both language identification and speaker verification domains due to its excellent perfor- mance, low complexity and small model size [38, 40]. In this modeling, first, a single factoranalysisisusedasafrontendtogeneratealowdimensionaltotalvariabilityspace which models language, speaker and channel variabilities all together [40]. Then, within this i-vector space, variability compensation methods, such as Within-Class Covariance 9 Normalization (WCCN) [58], Linear Discriminative analysis (LDA) and Nuisance At- tributeProjection(NAP)[26],areperformedtoreducethevariabilityforthesubsequent modeling (SVM for LID and probabilistic linear discriminant analysis (PLDA) for SRE, respectively). However, the i-vector training and extraction algorithms are computa- tionally very expansive, especially for big size GMM model and large training data set [7, 56]. Both [56] and [7] used pre-calculated UBM weighting vector to approximate each utterance’s 0 order GMM statistics vector to avoid the computationally expensive GMMcomponentwisematrixoperationsinonlytheSREtasks. Thisapproximationre- sulted in 10−25 times speed up in the expense of a significant performance degradation (about 17% EER) [7]. By enforcing this approximation in both training and extraction stages, the performance degradation can be reduced to minimum [56] condition to there is no or very little mismatch between train/test data and UBM data. Therefore, we investigated an alternative robust and efficient solution for both LID and SRE tasks in this work. We performed the factor analysis (FA) on the pre-normalized GMM first order statistics supervector to ensure each gaussian component’s statistics sub-vector is treated equally in the FA which reduce the computational cost by a factor of 25. In this way, each utterance is represented as one single pre-normalized supervector as the feature vector plus one total frame number to control its importance against the prior. Each component’s statistics sub-vector is normalized by its own occupancy prob- ability square root, thus it avoids the mismatch between global pre-calculated average weighting vector ([56] adopted the UBM weights) and each utterance’s own occupancy probability distribution vector. Furthermore, since there is only a global total frame number inside the matrix inversion, we made a global table of the resulted matrixes against its log value. The reason to choose log domain is that the smaller the total frame number, the more important it is against the prior. By looking at the table, each 10 utterance’s i-vector extraction is further sped up by another 4 times with small table index quantization error. The larger the table, the smaller this quantization error. Moreover, as one unsupervised method, i-vectors cover language, speaker and chan- nelvariabilitiesalltogetherwhichnecessarilyrequirethevariabilitycompensationmeth- ods (both LDA and WCCN are linear) as the back end. This motivate us to investigate the joint optimization to minimize the weighted summation of both the re-construction errorandthelinearclassificationerrorsimultaneously. Comparedtotheaforementioned sequential optimization for traditional i-vectors, this joint optimization can select the top Eigen directions only related to the given labels which can reduce the non-relevant informationinthei-vectorspace,suchasnoise,variabilities. Inthiswork,thetraditional i-vectors are extended to the label regularized supervised i-vectors by concatenating the label vector and the linear classifier matrix at the end of the mean supervector and the i-vector factor loading matrix, respectively. The contribution weight of each feature dimension and each target class to the objective function is automatically calculated by the iterative EM training. The traditional i-vector SVM/PLDA system serves as our baseline and our motivation is to propose alternative robust, effective and efficient rep- resentation methods to improve the performance while reduce the computational cost at the same time. For robust LID task on the noisy data, we also applied the Gammatone frequency cepstral coefficients (GFCC)[142] and Gabor filter bank features [79, 160] as a kind of auditory feature and spectral-temporal feature, respectively. When combining them with the traditional MFCC based sub-systems, the overall system performance was enhanced. 11 Figure 1.12: Face detection, eyes detection and face normalization 1.3 Sparse representation for classification and represen- tation Inspired by the recent success of sparse representation on face recognition, I explored the possibility to adopt sparse representation for both representation and classification in this multimodal human sates recognition task. 1.3.1 Sparse representation for classification in talking face video ver- ification, SRE, and age/gender recognition tasks Face recognition using video sequence has recently gained significant attention [161]. With built-in cameras and microphones becoming a standard feature on most personal computing and mobile devices, audio-visual biometrics has become a natural way for user verification and personal secure access. Specifically, face verification based on a videosequenceofatalkingfaceratherthanjustoneorafewstillimagesofferspossibility for increased robustness. It has been previously demonstrated that systems based on 12 Figure 1.13: Block wise DCTmod2xy features [128] block wise local features and Gaussian mixture models (GMM) are suitable for video based talking face verification as they offer the best trade-off in terms of complexity, robustness and performance [28, 127]. In this work, we follow this framework and focus on further enhancing the robustness and performance of the GMM modeling, notably by exploring sparse representations of the talking face. The verification task based on talking face images acquired in an uncontrolled envi- ronment, such as with a mobile device, is very challenging. A large variability in facial appearanceofthesamesubjectiscausedbyvariationsofrecordingdevices,illumination, and facial expression. These variations are further increased by errors in face localiza- tion, alignment and normalization. While facial dynamic information (continuity in head/camera movement, facial expression or photometric continuity) has been studied for robust face video recognition [88], algorithms based on the similarity of unordered imagesetshavealsobeenproposed[17]. Furthermore, intheGMMframeworkbasedon blockwiselocalfeatures,selectionofthegoodimageframesusingqualitymeasurements [127] and score normalization (ZT-norm) [129] have been proposed to compensate for the session variability. However, the GMM modeling is still based on the UBM training 13 and MAP adaptation framework. Recently, joint factor analysis (JFA) [71] has been successfully used in the speaker verification task in which session variability caused by different channels influences the system performance dramatically [71]. In this work, givendatafrommultiplesessions,wedivideeachfacevideosequenceintoseveralcontin- uousshortsegmentsandadopttheJFAapproachtoreducetheintra-personalvariations within all these segments. A key concept in the JFA approach is to use a GMM supervector consisting of the stacked means of all the mixture components [26, 71]. Support vector machine (SVM) basedonthisGMMmeansupervectorsformetheGMM-SVMsupervectorsystemwhich hasbeensuccessfullyappliedinthespeakerverificationtask[26]. Morerecently,asparse representationscomputedbyl 1 -minimizationwithequalityconstraintswereproposedto replace the SVM in the GMM mean supervector modeling and has been demonstrated to be effective in the closed set speaker identification task on the clean TIMIT database [117]. However, the sparse representation of GMM mean supervectors has not been explored or exploited in detail to handle the robust face video verification task against large session variabilities. In this work, we exploit the discriminative nature of sparse representations to per- form face verification based on GMM supervectors. Given a verification trial with the test supervector and the target identity, we first construct an over-complete dictionary using all the target supervector samples and non-target background supervector sam- ples, then calculate the sparsest linear representation via l 1 norm minimization. The membership of the sparse representation in the over-complete dictionary itself captures thediscriminativeinformationgivensufficienttrainingsamples[157]. Ifthetrialistrue, the test sample should have a sparse representation whose nonzero entries concentrate mostly on the target samples whereas the test sample from a false trial should have sparse coefficients spread widely among multiple subjects [157]. For most verification 14 tasks, the number of non-target background subjects/samples are naturally way larger than the number of target subjects/samples, thus the chance nonzero entries on the target training samples for a test sample from a false trial should be arbitrarily small and close to zero. Therefore, for the calculated sparse representation, thel 1 norm ratio betweenthetargetsamplesandallthesamplesintheover-completedictionarybecomes the verification decision criterion. Based on the overwhelming unbalanced non-target negative training samples and the very limited target positive training samples, in con- trast to the SVM system which requires to tune the SVM cost values each time, the proposed framework utilizes the highly unbalanced nature of the training samples to form a sparse representation problem. Furthermore,weproposedthreemethodstoenhancetherobustnessandperformance against variabilities. First, the sparse representation is computed by l 1 -minimization with quadratic constraints rather than equality constraints. Second, by adding a re- ductant identity matrix at the end of the original over-complete dictionary, the sparse representation is more robust to the variability and noise [157]. Third, the difference between the UBM and the MAP adapted model is mapped into the GMM mean shifted supervector which not only preserves the distance of the associated GMM but also makes the supervector sparse. Compared to the conventional mean supervector, the correlation of the constructed over-complete dictionary becomes smaller and therefore helps to achieve robust sparse representation. The aforementioned SRC approach can also be applied on speaker verification task and age/gender recognition task. However, sparse representation on large dimension supervectors not only requires a large training data set (large GMM size, the number of samples must be greater than the supervector dimension [157]) but also consumes a large amount of memory space due to the over-complete dictionary which can limit the training sample numbers and slow down the recognition process. Thus, in this work, we 15 adoptSRConthelowdimensionali-vectors[105]andUBMweightposteriorprobability (UWPP) supervectors [95]for SRE and age/gender recognition tasks, respectively. In addition, we propose three methods to further enhance the robustness and the performance of our SRE task. First, the background normalized (BNorm) l 2 residual is proposed as a score measuring criterion. Second, by directly using the Tnorm i- vectors as the non-target background samples in the over-complete dictionary, Tnorm scorenormalizationisefficientlyachievedbyonlyonesparserepresentationcomputation with BNorm l 2 residual as the scoring measure. Finally, the results of these i-vector modeling systems are fused to further improve the overall verification performance. In the age and gender recognition task, we also adopted the sparse representation framework as a kind of classification method on the GMM UWPP supervectors [95]. Since the age/gender recognition is just a 7 classes closed set recognition task, we used the same SRC set up as in the face recognition task and then fuse with SVM at the score level to further improve the overall performance. 1.3.2 Sparse representation for representation in LID and SRE tasks In the aforementioned approaches [95, 101, 105], the sparse representation framework was used just as a kind of classification approach on various GMM supervectors. Since sparse representation solution needs to be calculated for every trial for verification pur- pose, it is computationally expensive for high dimensional supervectors and sometimes intractable to perform score normalization (ZT-norm). Therefore it is more efficient to utilize SRC to model the low dimensional supervectors, such as i-vectors and JFA speaker factors, rather than the mean supervectors [95, 105]. However, factor analysis based Eigenvoice modeling and sparse representation are generally similar in terms of projecting the supervector into a dictionary. The dictionary of Eigenvoice modeling is a low dimensional subspace which makes the factor vector low dimensional while 16 Figure 1.14: sparse coding vs factor analysis the dictionary of SRC is large rank which results in a sparse coefficient vector. Thus, this analogy motivated us to explore the sparse representation as a kind of front end representation framework which is similar to the factor analysis based Eigenvoice mod- eling in the i-vector modeling approach [97]. In this case, the benefits are as follows. First,computingthesparserepresentationsolutionisrequiredonlyonceforeachtesting utterance which makes the score normalization efficient. Second, there is no need to use over-complete dictionary since it is adopted as a front end representation frame- work rather than the classification approach. Therefore, it can be performed on the high dimensional GMM mean supervectors. In [97], we proposed the Lasso based l 1 norm regularized weighted least square estimation to map the centered GMM 1st order statistics vector into a sparse factor vector which is denoted as sparse total variability supervector (s-vector). The dictionary adopted in [97] was the first 3000 PCA eigen- vectors, it was extended later by [11, 12] to use K-SVD (Single value decomposition) dictionary learning scheme and l 0 sparsity constraint. Itisinthiscontextthatwefurtherinvestigatedtheusageofsparserepresentationas a kind of sparse coding representation method for LID task. We first train a small rank dictionary for each target language and then merge them into a joint dictionary, the representation of each mean supervector on this dictionary is naturally sparse which 17 Figure 1.15: KSVD based dictionary learning (split-merge strategy) makes the corresponding sparse representation valid and discriminative. The KSVD algorithm was employed for dictionary learning. Similar algorithms were also reported effectiveandcompetitiveinfacerecognitionfieldrecently[67, 158]. Giventhesimilarity oftworecognitiontasks, thesuccessinfacerecognitionfieldmotivatedustoexplorethe sparse representation in the LID task and fuse with other systems to further improve the performance. We denote the sparse representation coefficient vector as the s-vector in the following sections. 1.4 Latent factor analysis based Eigenchannel factor vec- tor modeling in affective states recognition tasks Automatic recognition of paralinguistic information (e.g., gender, age, emotional state), can guide human computer interaction systems to automatically adapt to different user needs. Identifying speaker state given a short speech utterance is a challenging task and has gained significant attention recently in the speaker emotion challenge[137], paralinguistic challenge[139], and speaker states challenge[140]. It has been shown in [94, 96, 115] that speaker state information can be modeled at various levels, such as phonetic, acoustic, and prosodic. Due to the different aspects of 18 UBM Training Training Features UBM MAP Adaptation Training Mean Supervectors LFA Eigenchannel Matrix Estimation Eigenchannel Matrix LFA Eigenchannel Factor Extraction Testing Features Training Factors Testing Factors SVM Modeling Results Figure 1.16: The system overview modeling, combining different classification methods can significantly improve the over- all performance [96, 115]. Acoustic level modeling approaches, such as Hidden Markov Models (HMM) or Gaussian Mixture Models (GMM) operating on the mel-frequency cepstral coefficient (MFCC) features, play the most fundamental and important role among those various subsystems [80, 96] given short utterances. In this paper, we fol- low the GMM-MFCC framework and focus on further enhancing the performance of the intoxicated and affective speaker state recognition tasks. Recently, latent factor analysis (LFA) [23, 29] has been successfully and widely used forthespeakerverificationtask, inwhichsessionvariabilitycausedbydifferentchannels influencesthesystemperformancedramatically. However,systemsthatdirectlyuseLFA to remove the speaker variability factors in the speaker state recognition task have been shown to perform worse than a GMM baseline [80, 94]. This might indicate that the speaker variability is larger than the speaker state variability. To address this issue, we treat a paralinguistic speech signal as the normal average speech signal being corrupted 19 by channel effects and consider the speaker states as the “channels”. Thus we employ the Eigenchannel factors as a new kind of speaker state supervector and adopt SVM to model these factor vectors for the discriminative speaker state classification task. The GMM LFA approach can be considered as a type of feature extraction frontend which summarizes the affective speaker states information into the low dimensional Eigenchannelfactorvectors. Comparedtothecommonlyusedlargedimensionalfeature vectors computed using statistical functionals [139], the proposed factor vector reduces the feature dimensionality dramatically; therefore, it is more efficient for the adaptive or online model training applications. 1.5 A generalized optimization framework Meanwhile,inspiredby[49],weextendtheEMtrainingofthesupervisedi-vectormodel- ing to a more generalized convex optimization problem with optional sparse constrains. The aforementioned several modeling schemes (factor analysis, i-vector, simplified i- vector, supervised i-vector, s-vector, etc) are just some special cases of this general optimization problem. Therefore, other kinds of distance measures, cost functions, con- straints as well as optimization approaches can be applied which is our focus for the further work. 1.6 Speaker verification based on fusion of acoustic and articulatory information The goal in a speaker verification (SRE) task is to determine whether a given segment of speech is spoken by the claimed target speaker. At the acoustics level, joint factor analysis (JFA) [70, 71, 72] has contributed to the state-of-the-art performance in the text independent SRE. It is a powerful and 20 widely used technique for compensating the acoustic variabilities caused by different channels and sessions. Recently, total variability i-vector modeling has gained signifi- cantattentioninSREduetoitsexcellentperformance, lowcomplexityandsmallmodel size [38]. In this approach, a single factor analysis is used as a front-end to generate a low dimensional total variability space (i.e. the i-vector space) which jointly mod- els speaker and channel variabilities [38]. The factor analysis can also be extended to a simplified supervised version to enhance the performance and reduce the computa- tional cost [108]. Within the i-vector space, variability compensation methods, such as Within-ClassCovarianceNormalization(WCCN)[58]andLinearDiscriminativeAnaly- sis (LDA), are performed to reduce the variability for the subsequent probabilistic LDA (PLDA) modeling [111, 130]. Sparse representation could also be applied in the SRE task [97, 101, 105]. In addition to the aforementioned state-of-the-art modeling methods, various kinds offeatureshavealsobeenproposedfortextindependentspeakerrecognition(e.g. short- term spectral features, voice source features, spectral-temporal features, prosodic fea- tures and high-level features) [78]. Both feature level and score level fusion based on multiple features have been shown to enhance the overall SRE system performance [78]. In this work, our goal is to examine the utility of speech production-oriented features for the SRE task. Several studies have shown that an important source of inter-speaker variability in speech acoustics lies in the variability in the vocal tract morphology across various speakers; morphological variability could result from the differences in the vocal tract length[46,91,125,143],themorphologyofthehardpalateandtheposteriorpharyngeal wall [84, 85, 86]). Since vocal tract length is closely related to the formant frequency [46, 143], change in vocal tract length scales the frequency of the speech spectra for voiced sounds. This has been extensively used for vocal tract length normalization 21 (VTLN)[42, 89]inautomaticspeechrecognition(ASR).Unlikenormalization, wefocus on exploiting morphological variations as the cue for speaker characteristics in SRE applicationsinthispaper. Speakerswithflatpalatesexhibitlessarticulatoryvariability during vowel production than speakers with highly domed palates [20, 21, 116, 124]. Articulation of coronal fricatives is also influenced by palate shape, including apical vs. laminal articulation of sibilants [37], as well as jaw height and the positioning of the tongue body [60, 149]. Therefore due to the different vocal tract morphology characteristics, different speakers articulate even the same words differently. We also showin[107]thatthefusionofspeechandarticulationfeaturesenhancesdiscrimination between different morphological structures. This motivates us to examine the utility of morphological characteristics in the SRE task by using articulatory features in addition to acoustics. We find that concatenating articulation features obtained from the measured speech production data with conventional Mel-frequency cepstral coefficients (MFCCs) im- proves the overall SRE performance. However, since measuring articulatory move- ment during speech production is impractical for real world SRE applications, SRE experiment is also performed where the measured articulatory features are replaced with estimated articulatory features obtained using acoustic-to-articulatory inversion. Specifically, we show that augmenting MFCCs with features obtained from subject- independent acoustic-to-articulatory inversion techniques significantly enhances the SRE performance. In other words, we show that the estimated articulation obtained by articulatory inversion carry useful information about inter-speaker variation, espe- cially at the mean and variance level, which leads to better performance. To our best knowledge, there has not been a study on this topic. Although the estimated articulatory features are also generated from speech signals, wecanshowthataddingthisnewinformation(articulation-acousticsmapping)ontopof 22 MFCCs can still enhance the performance. Theoretical supports from machine learning fields are provided in [123, 154] which show that the recognition of target label can be improved with additional knowledge of related labels. Practically, this concatenation based speech-articulatory feature level fusion has been reported to increase the ASR performance [77, 151] significantly. 1.7 Multimodal physical activity recognition Automatic recognition of physical activity (PA) with wearable sensors can provide feed- back about an individual’s lifestyle and mobility patterns. Such information can form the basis for new types of health assessment, rehabilitation, and intervention tools to help people maintain their energy balance and stay physically fit and healthy. Recently, promising results from wearable body accelerometers in single or multiple locations for detecting PA have been presented [10, 57, 59, 62, 63, 65, 81, 112, 132]. Both [57] and [62] offer comprehensive summaries of existing accelerometer-based ap- proaches. It has been shown in [10] that a system with five accelerometers improved the average accuracy of PA recognition by 35% compared to a system with a single ac- celerometer. However, placing wearable sensors in multiple body locations can be quite cumbersome when the user has to collect data on a daily basis or for longer periods of continuous monitoring. Thus, many approaches based on multiple integrated sensor modalities have been proposed, since it is much more comfortable for the user to wear a single device. Moreover, incorporating multimodal information can yield additional physiological and environmental cues, such as heart rate, light, skin resistance, temper- ature, audio, global positioning system (GPS) location, etc [43, 92, 110, 120]. It is in this context that we examined the validity and feasibility of using multimodal wearable sensors in a laboratory setting within the KNOWME network to discriminate between various categories of PAs. 23 The KNOWME network [5, 90, 147, 148] is developed to target technology-centric applications in health care such as pediatric obesity. The KNOWME network utilizes heterogeneous sensors simultaneously, which send their measurements to a Nokia N95 cell phone via Bluetooth, as shown in Fig. 1.17. Flexible sensor measurement choices can include ECG signals, accelerometer signals, heart rate, and blood oxygen levels as well as other vital signs. Furthermore, external sensor data are combined with data from the mobile phone’s built-in sensors (GPS and accelerometer signal). Thus, the mobile phone can display and transmit the combined health record to a back-end server (e.g. Google Health Server [2]) in real time. In this study, we use ECG and accelerometer signals in the KNOWME network to detectPAcategories. Thissensorchoiceiscommonandfrequentlyusedinmanystudies for multimodal PA recognition [33, 43, 120, 121]. ECG is a physiological signal which accompanies physical measurements and therefore has great potential to increase the accuracy of PA recognition. There already exist several commercial ECG monitors with built-in accelerometers [1]; thus, users only need to wear one single multimodal sensor of this type and can feel more comfortable while carrying out their daily lives. Finally, the ECG is a very important diagnostic tool and is widely used in a great majority of mobile health systems. A study of the relationship between PAs and the ECG signal can be useful in health monitoring applications. Nokia N95 Cell Phone Fusion Center ECG sensor Server Oximeter sensor ... Figure 1.17: KNOWME wearable body area network system 24 The ECG sensor measures the change in electrical potential over time. A single nor- mal cycle of the ECG represents the successive atrial depolarization/repolarization and ventricular depolarization/repolarization. The advantage of the wearable ECG devices is that they can be used both in a hospital setting and under free living conditions. The practical challenge is that the ECG signal is often contaminated by noise and artifacts within the frequency band of interest, which can manifest with similar morphologies as the ECG itself [33]. Instant heart rate extracted from the ECG signal has been studied in distinguishing PAs in conjunction with accelerometer data [5, 43, 120, 146], and results showed that only modest gains were achieved [146]. Recently, it has been shown in [121, 122] that the motion artifacts in a single-lead wearable ECG signal in- duced by body movement of an ambulatory patient can be detected and reduced by a principal component analysis (PCA) based classification approach. Thus, in addition to heart rate details, ECG signals contain additional discriminative information about PA. In the proposed work, we extend the development in [122] by using Hermite poly- nomial expansion (HPE) and PCA to describe the cardiac activity mean (CAM) and motion artifact noise (MAN), respectively. Furthermore, instant heart rate variability (mean/variance) and heartbeat shape variability (noise measure within a window) are combined with HPE and PCA coefficients to generate a set of ECG temporal features and used for PA classification. In contrast to the ECG signal, the accelerometer signal has been studied extensively for PA recognition. There exists a wide range of features and algorithms for supervised classification of PAs with accelerometer derived features. Commonly used methods in the context of activity recognition include Naive Bayes classifiers [10, 65, 112, 132, 146], C4.5 decision trees [10, 43, 65, 112, 120, 132, 146], nearest neighbor methods [10, 65, 112], boosting [92, 132], support vector machines (SVMs) [63, 132], and Hidden markov models (HMM)[59, 81]. A comparison of these methods is reported in [57, 62, 65, 132]. 25 Figure 1.18: The proposed physical activity recognition system overview Moreover, a variety of features in both time and frequency domains have been adopted [10, 57, 62, 65, 112]. In general, the SVM classifier based on temporal feature statistics was found to be one of the best performing systems [63, 132]. In this work, a set of conventional temporal features is extracted from accelerometer signals and used for PA classification. The temporal features from ECG and accelerometers are modeled using a support vector machine (SVM). The generalized linear discriminative sequence (GLDS) kernel [26] was employed due to its good classification performance and low computational complexity. The GLDS kernel uses a discriminative classification metric that is simply an inner product between the averaged feature vector andmodel vectorand thus is very computationally efficient with small model size, making it attractive for mobile device implementations. More recently, promising results in biometrics [126] have shown that cepstral fea- tures of stethoscope-collected heart sound signals can be used to identify different per- sons. This inspired us to explore the potential of cepstral domain ECG features for PA detection. Compared to time-domain fiducial points or the PCA approach, cepstral feature calculation uses short fixed length processing windows and thus does not need thepre-processingstepsofheartbeatsegmentationandnormalization. Furthermore, for accelerometer signals, the evaluation in [62] shows that Fast Fourier Transform (FFT) 26 features always rank among the features with the highest precision, but the FFT coef- ficients that attain the highest precision are different for each activity type. Therefore, combining different FFT coefficients within filter bands might provide a good compro- mise versus using individual spectral coefficients. Thus, in the proposed work, linear filter bank based cepstral features extracted from both accelerometer and ECG signals are used to measure the cepstral characteristics of different PAs. The cepstral fea- tures corresponding to different PA types are modeled using Gaussian Mixture Models (GMMs). We combine both temporal and cepstral information at the score level to im- prove the system performance. We hypothesize that cepstral features can capture the spectral envelope variations in both ECG and accelerometer signals and thus can com- plement conventional time domain features. Also, as described in Section 7.1, cepstral features provide a natural way for handling convolutional noise inherent in the sensor measurements. Moreover, fusing system outputs from multiple modalities at the score level can also improve performance [134]. ECG and accelerometer cepstral features are not concatenated and fused at the feature level due to compatibility issues arising from differenttimeshiftandwindowlengthconfigurationsanddifferentsamplingfrequencies. However, the cepstral features from each axis of the accelerometer are concatenated to construct a long cepstral feature vector in each frame. Heteroscedastic linear discrimi- nant analysis (HLDA) [83] is used to perform feature dimension reduction. As a special form of (single state) HMM, a GMM model is developed for each activity by using a sequence of feature vectors, rather than individual instances, with a view toward better capturing the temporal dynamics. As shown in Fig. 7.1, after the classification scores of both the temporal feature based SVM systems and the cepstral feature based GMM systems are available, the four individual system outcomes are fused at the score level to generate the final recognized activity. 27 Just as individual variability can have significant impact on the interpretation of both the accelerometer and ECG data [48, 64], session variability is another important issue in PA recognition. In real life applications, many other factors can influence or even modify the desired sensor signals, such as sensor placement location, user emotion, fitness, etc. Even within the same activity, an individual can perform various styles of PA, which might not appear in the training set, and thus decrease the system perfor- mance. In this study, the session variability of the ECG and accelerometer signals is studied under subject dependent modeling framework. In summary, we address the PA recognition problem with multimodal wearable sensors (ECG and accelerometer) in this work. The contributions are as follows: (1) The cardiac activity mean (CAM) component of the ECG signal is described by Her- mite polynomial expansion (HPE) in the temporal feature extraction. (2) In the SVM framework for both ECG and accelerometer temporal features, the GLDS kernel makes the classification computationally efficient with a small model size. (3) A GMM system based on cepstral features is proposed to capture the frequency domain information in a robust fashion against convolutional effects, and HLDA is used to reduce the feature dimension of tri-axial accelerometer based measurements. (4) Score level fusion of the multi-modal and multi-domain subsystems is performed to improve the overall perfor- mance (5) The effects of session variability of ECG and accelerometer measurements on PA recognition are studied. 1.8 The role of the proposed methods in each individual application In the previous sections, I have introduced several major contributions in this work. However, since different applications have different baselines and characterizations. 28 Results Speech MFCC SDC GFCC SVM I-vector Score level fusion Sparse representation for representation Simplified supervised I-vector Figure 1.19: Language identification Result Speech MFCC LDA WCCN PLDA Snorm I-vector Score level fusion Sparse representation for representation Simplified supervised I-vector Sparse representation for classification Joint factor analysis LDA WCCN SVM Figure 1.20: Speaker verification Therefore, I demonstrate the usage of these methods on top of the baseline systems in Fig. 1.19,Fig. 1.20,Fig. 1.22,Fig. 1.23,Fig. 1.24. 29 Figure 1.21: Speaker verification using articulation and acoustics information fusion Result Face Video Face detection Eye detection Geometric normalization Histogram normalization DCTmod2xy GMM LFA Mean supervector Mean shifted supervector SVM Sparse representation for classification Figure 1.22: Face video verification Result Speech MFCC GMM Eigenchannel factor vector modeling Figure 1.23: Intoxication and emotion recognition 30 Result Speech MFCC Score level fusion GMM prosodic features MLLR supervector UWPP supervector OpenSmile global feature Sparse representation for classification mean supervector SVM Figure 1.24: Age and gender recognition 31 Chapter 2: Simplified supervised i-vector for representation 2.1 Methods 2.1.1 The i-vector baseline Inthetotalvariabilityspace,thereisnodistinctionbetweenthelanguageeffects,speaker effects and the channel effects. Rather than using the Eigenvoice matrix V and the eigenchannel matrixU [71], the total variability space contains the speaker and channel variabilities simultaneously [38]. Given a C component GMM UBM model λ with λ c = {p c ,μ c ,Σ c },c = 1,··· ,C and an utterance with a L frame feature sequence {y 1 ,y 2 ,··· ,y L }, the 0 th and centered 1 th order Baum-Welch statistics on the UBM are calculated as follows: N c = L X t=1 P(c|y t ,λ) (2.1) F c = L X t=1 P(c|y t ,λ)(y t −μ c ) (2.2) 32 Figure 2.1: I-vector modeling and its objective function where c = 1,··· ,C is the GMM component index and P(c|y t ,λ) is the occupancy probability fory t on λ c . The corresponding centered mean supervector ˜ F is generated by concatenating all the ˜ F c together: ˜ F c = P L t=1 P(c|y t ,λ)(y t −μ c ) P L t=1 P(c|y t ,λ) . (2.3) The centered GMM mean supervector ˜ F can be written as follows: ˜ F =Tx, (2.4) where T is a rectangular total variability matrix of low rank and x is the so-called i-vector [38]. Considering a C-component GMM and D dimensional acoustic features, the total variability matrix T is a CD×K matrix which can be estimated the same way as learning the Eigenvoice matrixV in [69] except that here we consider that every utterance is produced by a new speaker or in a new language [38]. Given the centered mean supervector ˜ F and total variability matrixT, the i-vector is computed as follows [38]: x = (I +T t Σ −1 NT) −1 T t Σ −1 N ˜ F (2.5) 33 whereN isadiagonalmatrixofdimensionCD×CDwhosediagonalblocksareN c I,c = 1,··· ,C and Σ is a diagonal covariance matrix of dimension CD×CD estimated in the factor analysis training step. It models the residual variability not captured by the total variability matrixT [38]. In this total variability space, two channel compensation methods, namely Linear Discriminant Analysis (LDA) and Within Class Covariance Normalization (WCCN) [58], are applied to reduce the variabilities. LDA attempts to transform the axes to minimizetheintra-classvarianceduetothevariabilityeffectsandmaximizethevariance betweenclasseswhileWCCNusestheinverseofthewithin-classcovariancetonormalize the cosine kernel for minimizing an upper bound of the binary SVM classification error [58]. After LDA and WCCN steps, cosine distance is employed for i-vector modeling. The cosine kernel between two i-vectorsx 1 andx 2 is defined as follows: k(x 1 ,x 2 ) = <x 1 ,x 2 > kx 1 k 2 kx 2 k 2 (2.6) Finally, cosine kernel based SVM and PLDA modeling are adopted as the classifier for LID and SRE tasks, respectively. For SRE, we assume that the training data consists of J utterances from I speakers and denote the j th i-vector of the i th speaker by x ij . We assume that the data are generated in the following way [130]: x ij =μ+Uh i +Gw ij + ij (2.7) wherethespeakertermμ+Uh i isonlydependentonthespeakerindexandthevariabil- ity termGw ij + ij is different for every i-vector and used to model the within-speaker variances. The model parameters are estimated by employing Expectation Maximiza- tion(EM)algorithmsonthetrainingdata. Givenapairofi-vectorsS(x i ,x j )fortesting, 34 Figure 2.2: Supervised i-vector modeling and its objective function the log likelihood ratio is computed based on a hypothesis testing P(S|H 1 )/P(S|H 0 ) where H 1 means it is a true trial and H 0 denotes a false trial [130]. Since the scoring is symmetric for the target and test i-vectors, symmetric normalization (Snorm) [19] is performed as the score normalization approach. The PLDA implementation is based on the University College of London (UCL) toolkit [130]. 2.1.2 Label regularized supervised i-vector The i-vector training and extraction can be re-interpreted as a classic factor analysis basedgenerativemodelingproblem. Forthej th utterance, thepriorandtheconditional distribution is defined as following multivariate Gaussian distributions: P(x j ) =N(0,I), P( ˜ F j |x j ) =N(Tx j ,N −1 j Σ) (2.8) therefore, the posterior distribution of i-vectorx given the observed ˜ F is: P(x j | ˜ F j ) =N((I +T t Σ −1 N j T) −1 T t Σ −1 N j ˜ F j ,(I +T t Σ −1 N j T) −1 ). (2.9) The mean of the posterior distribution (point estimate) is adopted as the i-vector. The traditional i-vectors are extended to the label regularized supervised i-vectors by concatenating the label vector and the linear classifier matrix at the end of the mean supervector and the i-vector factor loading matrix, respectively. This supervised 35 i-vectors are optimized to not only reconstruct the mean supervectors well but also minimizethemeansquareerrorbetweentheoriginalandthereconstructedlabelvectors, thuscanmakethesupervisedi-vectorsbecomemorediscriminativeintermsofthelabel information regularized. P(x j ) =N(0,I), P([ ˜ F j L j ]|x j ) =N([ Tx j Wx j ]|x j ),[ N −1 j Σ 1 n −1 j Σ 2 ]) (2.10) In (2.8,2.9,2.10),x j ,N j , ˜ F j andL j denote thej th utterance’s i-vector,N vector, mean supervector and label vector, respectively. Σ 1 and Σ 2 denote the variance for CD dimensional mean supervector and M dimensional label vector, respectively. n j = P C c=1 N cj where N cj denotes the N c for the j th utterance. The reason for using a global scalar n j is that each target class is treated equally in terms of frame length importance, the variance Σ 2 is adopted to capture the variance and accuracy for each particular class. We define two types of label vectors as follows: Supervised type 1: L ij = 1 if utterance j is from class i 0 otherwise (2.11) For type 1 label vectors, we want the regression matrixW to correctly classify the class labels. Suppose there are M speaker classes, L j is a M dimensional binary vector with only one non-zero element with the value of 1 and W is a M ×K linear classification matrix. Supervised type 2: L j =¯ x s j ,W =I. (2.12) Type 2 label vectors are the sample mean vector of all the supervised i-vectors from the same speaker index in the last iteration ¯ x s j (similar to the one in WCCN). The reason 36 is to reduce the within class covariance and help all the supervised i-vectors to move towards their class sample mean. Therefore, M =K in this case. The log likelihood of the total ˆ N utterances is: ˆ N X j=1 ln(P( ˜ F j ,L j ,x j )) = ˆ N X j=1 {ln(P([ ˜ F j L j ]|x j ))+ln(P(x j ))} (2.13) Combining (2.10) and (2.13) together and remove non-relevant items, we can get the objective function J for the Maximum Likelihood ML EM training: J = ˆ N X j=1 ( 1 2 x t j x j + 1 2 ( ˜ F j −Tx j ) t Σ1 −1 N j ( ˜ F j −Tx j )+ 1 2 (L j −Wx j ) t Σ2 −1 nj(L j −Wx j ) − 1 2 ln(|Σ1 −1 |)− 1 2 ln(|Σ2 −1 |)) (2.14) For the E-step, we estimated E(x j ) and E(x j x j t ): E(x j )=(I +T t Σ 1 −1 N j T +W t Σ 2 −1 njW) −1 (T t Σ 1 −1 N j ˜ F j +W t Σ 2 −1 njL j ), (2.15) E(x j x j t ) =E(x j )E(x j ) t +(I +T t Σ 1 −1 N j T +W t Σ 2 −1 n j W) −1 . (2.16) Then, for the M-step, we need to minimize the following expected objective function: E(J)= ˆ N X j=1 ( 1 2 Tr[E(x j x j t )]+ 1 2 ˜ F j Σ 1 −1 N j ˜ F j + 1 2 Tr[TE(x j x j t )T t Σ 1 −1 N j ]− ˜ F j t Σ 1 −1 N j TE(x j ) + 1 2 L j Σ 2 −1 n j L j + 1 2 Tr[WE(x j x j t )W t Σ 2 −1 n j ]−L j t Σ 2 −1 n j WE(x j )− 1 2 ln(|Σ 1 −1 |)− 1 2 ln(|Σ 2 −1 |)) (2.17) In deriving(2.17), we used Tr[Tx j x j t T t Σ 1 −1 N j ] = Tr[x j t T t Σ 1 −1 N j Tx j ]. By set- ting the derivatives of E(J) towards toT andW to be 0, we can get: ˆ N X j=1 N j TE(x j x j t ) = ˆ N X j=1 N j ˜ F j E(x j t ) (2.18) 37 ˆ N X j=1 n j WE(x j x j t ) = ˆ N X j=1 n j L j E(x j t ) (2.19) Since n j is a scalar, the new W matrix is updated as: Type1 :W new = [ ˆ N X j=1 n j L j E(x j t )][ ˆ N X j=1 n j E(x j x j t )] −1 (2.20) For theT matrix, we employed the strategy in [38] to update component by component since N cj is also a scalar. T cnew = [ ˆ N X j=1 N cj ˜ F cj E(x j t )][ ˆ N X j=1 N cj E(x j x j t )] −1 (2.21) In (2.21), T c denotes the [(c− 1)D + 1 : cD] rows sub-matrix of T and ˜ F cj is the [(c−1)D+1:cD] elements sub-vector of ˜ F j . Similarly, by setting ∂Ej ∂(Σ 1 −1 ) and ∂Ej ∂(Σ 2 −1 ) to be 0, we can get: Σ 1 = diag{ P ˆ N j=1 (N j ( ˜ F j −T new E(x j )) ˜ F j t )} ˆ N (2.22) Type 1: Σ2 = diag{ P Γ j=1 (nj(L j −WnewE(x j ))L j t )} Γ (2.23) Type 2: Σ2 = diag{ P Γ j=1 (nj(L j −E(x j )) t (L j −E(x j )))} Γ (2.24) These 2 variance vectors describe the energy that can not by represented by the factor analysis and control the importance in the joint optimization objective function (2.14). Afterseveraliterations’EMtraining,theparametersarelearned. Theninthesupervised i-vectorextraction, lettheΣ 2 tobeinfinitysincewedonotknowthegroundtruthlabel information. This will make equation(2.15) converges back to equation(2.5). After the supervisedi-vectorextraction. theclassificationmethodsarethesameasthetraditional i-vector modeling. 38 Figure 2.3: Simplified Supervised i-vector modeling and its objective function There are some naive extensions of this supervised i-vector frameworks. We can make L as the parameter vector that we want to perform regression, this can make the proposed framework suitable for regression problems. Moreover, if the classification or regression relation is not linear, we can use non-linear mapping as a preprocessing before generatingL. 2.1.3 Simplified i-vector I-vector training and extraction is computational expensive. Consider the GMM size, feature dimension, factor loading matrix size to be C, D, and K, respectively. The complexity for generating a single i-vector is O(K 3 +K 2 C +KCD) [56]. In this work, we make 2 approximations to reduce the complexity. The K 3 term comes from the matrix inversion while the K 2 C term is from T t Σ −1 N j T in equation(2.5). When C is large, this K 2 C term’s computational cost is huge. The fundamental reason is that each Gaussian component λ c has different N c for each utterancej which means some sub-vectors ˜ F cj has less variance than others in ˜ F j and needs intra mean supervector re-weighting in the objective function. We first decompose the N j vector into N j = n j m j where n j = P C c=1 N cj , m cj = N cj /n j and P C c=1 m cj = 1. m j is the re-weighting vector andn j (total frame number) controls the confidence at the global level. Our motivation is to re-weight each utterance’s mean 39 supervector with its own (m j ) 1/2 before the factor analysis step which makes each di- mension of the new supervector ˆ F j be treated equally in the approximated modeling (2.27). ˆ F c = P L t=1 P(c|y t ,λ)(y t −μ c ) P L t=1 P(c|y t ,λ) [ N cj n j ] 1 2 , ˆ F j =m 1/2 j ˜ F j (2.25) Sotheintrasupervectorunbalanceiscompensatedbythispre-weighting,eachutterance is represented by ˆ F j as the general feature vector andn j as the confidence value for the subsequent machine learning algorithms. We perform factor analysis in the following way by linearly project this new normalized supervector ˆ F on a dictionary ˆ T: ˆ F = ˆ Tx,P( ˆ x j ) =N(0,I) (2.26) P( ˆ F j | ˆ x j )=N( ˆ T ˆ x j ,m j N j −1 Σ)=N( ˆ T ˆ x j ,m j (njm j ) −1 Σ)=N( ˆ T ˆ x j ,n −1 j Σ) (2.27) therefore, the posterior distribution of i-vector ˆ x given the observed ˆ F is: P( ˆ x j | ˆ F j ) =N((I + ˆ T t Σ −1 n j ˆ T) −1 ˆ T t Σ −1 n j ˆ F j ),(I + ˆ T t Σ −1 n j ˆ T) −1 ). (2.28) From the above equation, we can find that the complexity is reduced toO(K 3 +KCD) sincen j isnotdependentonanyGMMcomponent. Byreplacingthe1stGMMstatistics supervector ˜ F j with the pre-normalized supervector ˆ F j and setN j to a scalar n j , the i-vector training equations become the simplified i-vector solution. Moreover, since the entire term (I + ˆ T t Σ −1 n j ˆ T) −1 ˆ T t Σ −1 in equation (2.28) only depends on the scalar total frame number n j , we can create a global table of this item againstthelogvalueofn j . Thereasontochooselogdomainitthatthesmallerthetotal framenumber,themoreimportantitisagainsttheprior. Ifn j isverylargecomparedto the prior, then the twon j get canceled. By looking at the table, the complexity of each utterance’s i-vector extraction is further reduced to O(KCD) with small table index 40 0 50 100 150 200 250 300 0 1 2 3 4 x 10 4 Table Index Corresponding n j value Figure 2.4: The n j quantization curve in the log domain, 300 indexes. quantization error. The larger the table, the smaller this quantization error. Figure 2.4 shows the quantization distance curve. We can see that the quantization error is relatively smaller when n j is small. In this work, we also derived the type 1 simplified supervised i-vector modeling’s solution as follows: E( ˆ x j ) =(I+ ˆ T t Σ 1 −1 n j ˆ T+W t Σ 2 −1 n j W) −1 ( ˆ T t Σ 1 −1 n j ˆ F j +W t Σ 2 −1 n j L j ), (2.29) E( ˆ x j ˆ x j t ) =E( ˆ x j )E( ˆ x j ) t +(I + ˆ T t Σ 1 −1 n j ˆ T +W t Σ 2 −1 n j W) −1 . (2.30) W new = [ ˆ N X j=1 n j L j E( ˆ x j t )][ ˆ N X j=1 n j E( ˆ x j ˆ x j t )] −1 (2.31) 41 0 0.5 1 1.5 2 2.5 3 -0.5 0 0.5 Time (s) Time (s) Frequency (Hz) 0 0.5 1 1.5 2 2.5 3 0 1000 2000 3000 4000 Time (s) Frequency (Hz) 1 2 3 4000 2184 1209 591 260 50 Figure 2.5: Waveform(top), spectrum(middle), and cochleagram(bottom) of one speech segment in RATS database. ˆ T new = [ ˆ N X j=1 n j ˆ F cj E( ˆ x j t )][ ˆ N X j=1 n j E( ˆ x j ˆ x j t )] −1 (2.32) Σ 1 = diag{ P ˆ N j=1 (n j ( ˆ F j − ˆ T new E( ˆ x j )) ˆ F j t )} ˆ N (2.33) Σ 2 = diag{ P ˆ N j=1 (n j (L j −W new E(x j ))L j t )} ˆ N (2.34) Since the label vector dimensionality M << CD, the complexity is almost the same as previous O(CDK). It is worth noting that for best accuracy, we only perform approximationusingtheglobaltableforthetrainingpurpose. Whenintesting,equation (2.29) is sill employed. All the experimental results based on simplified i-vector or simplified supervised i-vector in Section 2.3 are generated in this way. 42 Figure2.6: 2D-Gaborfiltersarrangedbyspectralandtemporalmodulationfrequencies. 2.1.4 GFCC features and Gabor features for robust LID ForrobustLIDtaskonthenoisydata,wealsoappliedtheauditoryinspiredGammatone frequency cepstral coefficients (GFCC) [142] and spectral temporal Gabor features. It is shown in Fig. 2.5 that GFCC features have more detailed resolution on the low dimensional part of the frequency response and performed better than MFCC feature for robust SID task on a variety of low SNR conditions [142]. While Gabor features can capture spectral-temporal information in different scales as shown in Fig. 2.6. In this work, when fuse with traditional MFCC feature based subsystems, the overall system performance was enhanced. 43 Table 2.1: Target and nontarget languages in RATS database Target languages Arabic Farsi Pashto Dari Urdu Nontarget languages English Mandarin Spanish Italian Thai Vietnamese Russian Japanese Bengali Korean 2.1.5 Score level fusion Due to the limited amount of training data, we simply employed the weighted summa- tion fusion approach with parameters tuned by cross validation. Let there be G input subsystems where the i th subsystem outputs its own posterior probability vector l i (y) for every trial. Then the fused score vector ´ l(y) is given by: ´ l(y) = G X i=1 η i l i (y) (2.35) The weight, η k , can be tuned by validation data. For the SVM system in the LID task, log-likelihood normalization was adopted to map the log likelihood scores into the posterior probabilities [95]. It is worth noting that other advanced score fusion approaches, like the logistic regression method in the popular FoCal toolkit [18], can also be adopted here to increase the performance which is a topic for our future work. 2.2 Corpus, classification task and feature extraction 2.2.1 RATS database dev2 task We first performed experiments on the Robust Automatic Transcription of Speech (RATS) LID corpus [36, 131] 1 . Each speech recording was collected through various degraded, weak and/or noisy communication channels (A-H) with low SNR. Audio file format is 16-Khz 16-bit PCM MS WAV/RIFF with lossless FLAC compression. In this 1 http://www.darpa.mil/Our Work/I2O/Programs/Robust Automatic Transcription of Speech (RATS).aspx (please copy the full url including those underscores) 44 Figure 2.7: Waveform and spectrum of two speech segments in RATS database (dv2 0011 B and dv2 0439 F). LID task, each testing sample of speech needs to be scored on every target language hypothesis (a target language of interest to be detected). In other words, the task is to decide whether that target language was in fact spoken in the given sample (yes or no) based on an automated analysis of the speech signal. So we used equal error rate (EER), miss rate with 10% false alarm 2 and Detection Error Trade-off (DET) curve as the metrics for evaluation. It can be observed from Fig 2.7 that the speech signals can be highly noisy and degraded with significant variability. As a preprocessing step, we performed Speech Activity Detection (SAD) on the raw signal to obtain the speech boundaries. The data set used to train the SAD system is the corpus provided by LDC for the RATS project [36, 131]. We used the Multi-Resolution Long-term spectral variability (MR- LTSV)SADsystemdescribedin[152]whichusesvariousresolutionsoftheLTSVfeature 2 this point in the DET curve is required in the task 45 proposedin[52]. Theclassifierweusedtoclassifyaframeasspeech/non-speechistheK- NNclassifier[34]. Theparametersofthe MR-LTSVfeatures andthenumberofoptimal neighbors of K-NN have been optimized on five randomly picked files for training and five randomly picked files for testing for each channel A-H, given in the RATS corpus. Approximately, the parameters have been optimized on 1 hour of training and 1 hour of testing data for each channel. In addition, we have used 100ms frame shift and smoothed the per frame decision using a median filter as described in [152]. We extracted 3 different features on 8K sampled valid speech data which have been reported to be useful in the noisy data LID task. For MFCC-SDC feature extraction, a 25ms Hamming window with 10ms shifts was adopted. Each utterance was converted into a sequence of 56-dimensional MFCC SDC feature vectors [150], each consisting of a 49 SDC (7-1-3-7) features and 7 MFCC coefficients including C0. We also extracted the 36dimensionalMFCCfeaturesconsistingof18MFCCcoefficientsandtheirfirstderiva- tives. Finally, GFCC features with 64 filter banks were generated. Feature warping was applied on all 3 features to mitigate variabilities. LIBLINEAR [45] was employed for the SVM modeling. The target and nontarget languages are shown in Table 2.1. The training data for target and nontarget languages are from ldc2011e111 and ldc2012e03, respectively. The dev2 data set from ldc2012e06 is employed as the evaluation data set. Our focus is the 120 seconds condition task with 1914 testing segments. There is a verification trial on every target language for each testing segment which results in 9570 trials in total. 2.2.2 NIST SRE 2010 database female condition 5 task WealsoperformedexperimentsontheNIST2010speakerrecognitionevaluation(SRE) corpus[119]. Ourfocusisthefemalepartofthecommoncondition5(asubsetoftel-tel) in the core task. We used equal error rate (EER), the normalized old minimum decision 46 Table 2.2: Corpora used to estimate the UBM, total variability matrix, JFA factor loading matrix, WCCN, LDA, PLDA and the normalization data for NIST 2010 task condition 5. Switchboard NIST04 NIST05 NIST06 NIST08 UBM √ √ T √ √ √ √ √ JFA V √ JFA U √ √ √ √ JFA D √ WCCN √ √ √ √ LDA √ √ √ √ PLDA √ √ √ √ Znorm √ √ Snorm √ Tnorm √ cost value (norm old minDCF) and norm new minDCF as the metrics for evaluation [119]. For cepstral feature extraction, a 25ms Hamming window with 10ms shifts was adopted. Each utterance was converted into a sequence of 36-dimensional feature vec- tors, each consisting of 18 MFCC coefficients and their first derivatives. We employed a Czech phoneme recognizer [141] to perform the voice activity detection (VAD) by sim- ply dropping all frames that are decoded as silence or speaker noises. Feature warping is applied to mitigate channel effects. The training data for NIST 2010 task included Switchboard II part1 to part3, NIST SRE 2004, 2005, 2006 and 2008 corpora on the telephone channel. The description of the data sets used in each step is provided in Table 2.2. The gender-dependent GMM UBMs consist of 1024 mixture components, which were trained using EM with the data from NIST SRE 04 and 05 corpus. We used all of the training data for estimating the total variability space. The NIST SRE 2004, 2005, 2006 and 2008 data sets were used for training WCCN, LDA and PLDA matrix, and a data set chosen from SRE 2006 47 Table 2.3: Complexity of the proposed methods for a single utterance (GMM size C = 2048, feature dimension D = 56, T matrix rank K = 600, type I s-vector τ 1 = 200, PCA projection matrixV sizeR×CD = 2500×2048×56, target class numberM = 6, time was measured on a Intel I7 CPU with 12GB memory) Methods Approximated complexity Time (s) I-vector O(K 3 +K 2 C +KCD) 8 Simplified I-vector without table O(K 3 +KCD) 0.22 Simplified I-vector with table O(KCD) 0.06 Supervised Simplified I-vector without table O(K 3 +K(CD+M)) 0.22 Supervised Simplified I-vector with table O(K(CD+M)) 0.06 type I s-vector (OMP toolkit [135]) O(τ 3 1 +τ 2 1 K +3Kτ 1 +2RK) 0.02 corpus was used for Tnorm score normalization, including 1325 female utterances. 256 female utterances from NIST 2008 were adopted as Snorm data. The JFA baseline system is trained using the BUT toolkit [22] and linear common channel point estimate scoring [55] is adopted. The speaker factor size and channel factor size is 300 and 100, respectively. ZTnorm was applied on JFA subsystem while Snorm was employed in i-vector subsystem. 2.3 Experimental Results 2.3.1 LID First, the complexity of the proposed methods for a single utterance is shown in Table 2.3. We can see that the proposed simplified and supervised i-vector systems achieves significant complexity cost reduction which has potentially large impact on mobile de- vices. It is worth noting that the batch OMP solution was employed here for type I s-vector method. If we use standard OMP optimization, the complexity will be larger. Furthermore, the performance of the proposed methods on all three feature set is shown in Table 2.4. First, we can observe that WCCN worked well to compensate the variability for all the systems. This makes sense due to there are 8 highly degraded 48 Table 2.4: Performance of the proposed methods with SVM modeling for LID RATS 120 seconds task ID Method WCCN EER (%) / Miss rate (%) at 10% false alarm MFCC-SDC 56dim MFCC 36dim GFCC 44dim 1 I-vector × 6.6/4.5 5.7/4.2 6.1/4.3 2 I-vector √ 6.1/4.4 5.7/3.9 5.8/3.9 3 Simplified I-vector × 7.6/6.7 7.0/5.3 7.3/6.1 4 Simplified I-vector √ 6.2/4.4 5.7/4.2 6.3/4.4 5 Simplified Supervised I-vector × 6.6/4.9 5.1/3.5 5.6/3.8 6 Simplified Supervised I-vector √ 6.3/4.5 4.8/3.0 5.4/3.7 Table 2.5: Performance of score level fusion with systems based on multiple features Table 5 Task Method EER (%) / Miss rate (%) at 10% false alarm ID duration MFCC-SDC MFCC GFCC Fusion 3 features 6 120s Simplified Supervised I-vector 6.3/4.5 4.8/3.0 5.4/3.7 3.7/1.9 communication channels for each language. Second, the simplified i-vector achieved comparable results to the i-vector baseline with small degradation. Third, the simpli- fiedsupervisedi-vectoroutperformedthesimplifiedi-vectordramatically. Especiallyfor MFCC 36 dimensional features and GFCC 44 dimensional features, the simplified su- pervised i-vector achieved relatively 10%-20% error reduction compared to the i-vector baseline. The reason might be because those 2 feature sets are not commonly adopted on clean data set for LID task (MFCC-SDC is the traditional feature), the supervised i-vector can better selectively extract the top Eigen directions only associated with the language labels. Finally, we also observed dramatic error reduction by fusing 3 feature set based systems together as our final system in Table 2.5. Fig. 2.8 demonstrates the DET curves for each feature set as well as the 3 feature set fusion. The fused system achieved 3.7% EER and 1.9% miss rate at 10% false alarm. 2.3.2 SRE The results of the i-vector baseline and the proposed supervised, simplified as well as the simplified supervised i-vector methods are shown in Table 2.6. We can observe that 49 Figure 2.8: EER and minDCF values for LID performance in Table 2.5 LDA, PLDA, and Snorm contributed to increase the performance for all the systems. WCCN reduced the EER by more than 40% for all systems except the type 2 simplified supervised i-vector (type 2 SIM-SUP-IV). For type 2 SIM-SUP-IV, WCCN is not that importantsincethelabelregularizedjointoptimizationalreadyincludesthewithinclass covariance in the objective function. This was reflected by a 30% EER reduction (I- vector9.02%,type2SIM-SUP-IV6.45%)inthecosinedistancerawscoringwithoutany backendprocessing. Furthermore, type1supervisedi-vector(type1SUP-IV)andtype 1 simplified supervised I-vector (type 1 SIM-SUP-IV) outperformed IV and SIM-IV by 5%-10% relatively for all the modeling configurations (3.37% and 3.45% EER vs 2.95 and3.13%EER).Alsoas showninTable 2.7(ID6vs5), afterfusingwithJFAbaseline, SUP-IV still outperformed IV baseline by 9% relative EER reduction. Therefore, by 50 Table 2.6: Performance of the proposed methods for the 2010 NIST SRE task female part condition 5 Method LDA WCCN PLDA S EER% norm minDCF norm new old IV × × × × 9.02 0.724 0.409 IV 250 × × √ 7.87 0.668 0.307 IV 250 √ × √ 3.91 0.454 0.190 IV 250 √ √ √ 3.37 0.415 0.165 type 1 SUP-IV 250 × × √ 7.64 0.640 0.278 type 1 SUP-IV 250 √ × √ 4.01 0.425 0.170 type 1 SUP-IV 250 √ √ √ 2.95 0.420 0.154 SIM-IV × × × × 8.94 0.758 0.374 SIM-IV 250 × × √ 7.96 0.696 0.311 SIM-IV 250 √ × √ 4.79 0.527 0.213 SIM-IV 250 √ √ √ 3.45 0.545 0.192 type 1 SIM-SUP-IV × × × × 8.65 0.746 0.341 type 1 SIM-SUP-IV 250 × × √ 7.06 0.654 0.289 type 1 SIM-SUP-IV 250 √ × √ 3.95 0.518 0.197 type 1 SIM-SUP-IV 250 √ √ √ 3.13 0.541 0.176 type 2 SIM-SUP-IV × × × × 6.45 0.645 0.285 type 2 SIM-SUP-IV 250 × × √ 5.35 0.575 0.228 type 2 SIM-SUP-IV 250 √ × √ 4.51 0.549 0.195 type 2 SIM-SUP-IV 250 √ √ √ 3.06 0.569 0.179 type 2 SIM-SUP-IV 250 × √ √ 3.08 0.581 0.189 IV: i-vector, SIM-IV: simplified i-vector, SUP-IV: supervised i-vector, SIM-SUP-IV: simplified supervised i-vector adding label information in the i-vector training indeed improves the performance. The less improvement of type 2 SIM-SUP-IV compared to type 1 SIM-SUP-IV might be due to the diagonal version of Σ 2 against the triangular WCCN matrix. Moreover, simplified supervised i-vector systems (type 1 SIM-SUP-IV and type 2 SIM-SUP-IV) achieved better EER but worse norm cost compared to the i-vector base- line. However, the computationally cost is reduced by around 120 times. And after fusing with JFA system (Table 2.7 ID 7 vs 5), this gap is reduced to only 3% to 6% relatively. Therefore, simplified supervised i-vector has the potential to replace the computational expensive i-vector baseline when fusing with JFA system. It is worth noting that the supervised version of all the systems only performed better on EER and norm old minDCF values. How to further reduce the norm new 51 Table 2.7: Performance of the proposed systems in fusion ID Systems EER% norm minDCF new old 1 JFA linear scoring ZTnorm 3.62 0.414 0.193 2 IV LDA WCCN PLDA Snorm 3.37 0.415 0.165 3 type 1 SUP-IV LDA WCCN PLDA Snorm 2.95 0.420 0.154 4 type 1 SIM-SUP-IV LDA WCCN PLDA Snorm 3.13 0.541 0.176 5 Fusion ID 1 + ID 2 2.77 0.372 0.152 6 Fusion ID 1 + ID 3 2.53 0.370 0.146 7 Fusion ID 1 + ID 4 2.82 0.377 0.162 Figure 2.9: EER performance on NIST 2010 SRE female condition 5 task minDCF is our current focus. Future work also includes applying the non-simplified type 2 supervised i-vector as well as evaluating different label vector designs. 52 Chapter 3: Sparse representation for both representation and classification 3.1 Sparse representation for classification 3.1.1 Methods 3.1.1.1 Sparse representation for modeling Given N 1 (N 1 = 1 in our case because only one recording for each target speaker and one target speaker per trial) target training samplesA 1 andN 2 non-target background training samplesA 2 , we construct the over-complete dictionaryA: A = [A 1 A 2 ] = [s 11 ,s 12 ,··· ,s 1N 1 ,s 21 ,s 22 ,··· ,s 2N 2 ]. (3.1) Each samples ij is an L dimensional i-vector and is normalized to unit l 2 norm. This matches the length normalization in the SVM cosine kernel. Throughout the entire testing progress, the background samplesA 2 are fixed; and only the target samplesA 1 53 are replaced according to the claimed target identity in the test trial. Let us denote N =N 1 +N 2 ,thenN 1 N 2 andL<N needtobesatisfiedforsparserepresentation. In our case, the dimensionality L of the i-vectors is significantly smaller than the number of training samples N. For any test sample y ∈ R L with unit l 2 norm, we want to use the over-complete dictionary A to linearly represent y in a sparse way. If y is from the target, then y will approximately lie in the linear span of training samples in A 1 [157]. Since the equality constraint Ax =y is not robust against large session variabilities [157], we constrain the Euclidian distance between the test sample and the linearcombinationof training samples to be smaller than which resulted in a standard convex optimization problem (l 1 -minimization with quadratic constraints): Problem A: minkxk 1 subject tokAx−yk 2 ≤ (3.2) Since N 1 = 1 in our case, for each sample in the over-complete dictionary i, (i = 1,··· ,N), let δ i :R N →R N be the characteristic function which selects the coefficient only associated with the i th sample. For x ∈R N , δ 1 (x) ∈R N is a new vector whose nonzero entries are the only entries in the first element ofx. Now based on the sparse representation x, in addition to the l 1 norm ratio and l 2 residual ratio introduced in [101], we propose the new Background Normalized (BNorm) l 2 residual criterion for verification purposes. It uses the scores from the background data to perform a kind of Tnorm on the target score. Given a solved sparse representation, we can also consider every background sample as the target sample and calculate its minus l 2 residual as a similarity score. Without any additional sparse representation computation, just 54 by rotating the role of each sample in this over-complete dictionary, we can instantly generate the similarity measure scores (φ) for all the samples. l 1 norm ratio = kδ 1 (x)k 1 /kxk 1 (3.3) l 2 residual ratio = ky−A(Σ N i=2 δ i (x))k 2 ky−Aδ 1 (x)k 2 (3.4) Bnorm l 2 residual = −ky−Aδ 1 (x)k 2 −mean(φ) std(φ) φ j,j=2:N = −ky−Aδ j (x)k 2 (3.5) Alargerscorerepresentsahigherlikelihoodforthetestingsamplebeingfromthetarget subject. Due to large session variabilities, the test sampley can be partially corrupted. Thus an error vectore is introduced to explain the variability [157]: y =y 0 +e =Ax 0 +e (3.6) So the original optimization problem takes the following form: Problem B : minkzk 1 subject tokBz−yk 2 ≤ (3.7) B = [A I]∈R L×(N+L) ,z = [x t e t ] t ∈R (N+L) (3.8) If the error vector e is sparse and has no more than (L+N 1 )/2 nonzero entries, the new sparse solutionz is the true generator according to (3.6) [157]. Finally, we redefine the three decision criteria based on the new sparse solution ˆ z = [ˆ x t ˆ e t ] t . 55 0 500 1000 1500 2000 0 0.2 0.4 0.6 0.8 Target: timvdA, Test: fumuyB z dictionary index Figure 3.1: The sparse solution of a true trial with problem B (3.7) 0 500 1000 1500 2000 -0.1 0 0.1 0.2 Target: tkgadA, Test: fmndgA dictionary index z Figure 3.2: The sparse solution of a false trial with problem B (3.7) l 1 norm ratio = kδ 1 (ˆ x)k 1 /kˆ xk 1 (3.9) l 2 residual ratio = ky−ˆ e−A(Σ N i=2 δ i (ˆ x))k 2 ky−ˆ e−Aδ 1 (ˆ x)k 2 (3.10) Bnorml 2 residual = −ky−ˆ e−Aδ 1 (ˆ x)k 2 −mean(φ) std(φ) φ j,j=2:N = −ky−ˆ e−Aδ j (ˆ x)k 2 (3.11) Fig.3.1 and 3.2 demonstrate the sparse solutions of two trials in the evaluation using problem B (3.7) before Tnorm. 56 Table 3.1: The two configurations (S1 and S2) and the corresponding complexity of sparse representation with Tnorm. Trail Score Tnorm Score (N 3 samples) SR A 1 A 2 A 1 A 2 Times S1 Target Background i th Tnorm Background 1+N 3 S2 Target Tnorm None None 1 3.1.1.2 Sparse representation with Tnorm Test normalization (Tnorm) is an important technique for normalizing the variance of the testing score based on a set of cohort models and hence is widely adopted in veri- fication tasks. Calculating the similarity scores between the testing sample and all the cohort models can be computationally expensive for the sparse representation system. Thus, as shown in Table 3.1, we propose a new setup for the sparse representation sys- temtoefficientlyperformTnormscorenormalization. Comparedtothestraightforward configuration S1, the new setting S2 only requires a single sparse representation calcu- lation which reduces the computational complexity significantly. In the configuration S2, we directly employ the Tnorm data as the non-target background samples in the over-complete dictionary and use the score distribution of the Tnorm data to normalize thetargetsample’sscoreusingeq(3.11). NotethatinproblemBsetting, thenumberof samples in the over-complete dictionary (L+1+N 3 ) is always bigger than the i-vector dimensionality L. Therefore, the condition of sparse representation is still satisfied. 3.1.2 Corpora, tasks and feature extraction 3.1.2.1 GMM mean shifted supervector and BANCA database The supervectorss andy in the face video verification task are the GMM mean shifted supervectors. 57 For each segment of video sequence, a GMM was adapted from the UBM by MAP adaptation; the GMMs were modeled with diagonal covariance matrices and only the means of the GMMs were adapted [28, 133]. The KL divergence D(λ, ˆ λ) between two GMM models (λ, ˆ λ) is approximated by the upper bound distanced(λ, ˆ λ) which satisfy the Mercer condition [26]: 0≤D(λ, ˆ λ)≤d(λ, ˆ λ) = 1 2 N X i=1 p i (μ i −ˆ μ i ) t Σ −1 i (μ i −ˆ μ i ) (3.12) The linear kernel is defined as the corresponding inner product of the GMM mean supervectors m which is a concatenation of the weighted GMM mean vectors m = [m t 1 ,··· ,m t i ,··· ,m t N ] t [26]: K(λ, ˆ λ) = N X i=1 m t i ˆ m i = N X i=1 ( √ p i Σ − 1 2 i μ i ) t ( √ p i Σ − 1 2 i ˆ μ i ) (3.13) where p i and Σ i are the i th UBM mixture weight and diagonal covariance matrix, μ i corresponds to the mean of the i th Gaussian component in this GMM. Let the UBM model be e λ = {p j ,e μ j ,Σ j }, then we can construct mean shifted GMM model λ ? by subtracting the UBM mean vector from the MAP adapted model: λ ? ={p i ,μ ? i ,Σ i } = {p i ,μ i − e μ i ,Σ i },i = 1,··· ,N. It is clear that the distance between the mean shifted modelsisexactlythesameasthedistancebetweentheoriginalmodels. Thus,theGMM mean shifted supervectors is generated from the mean shifted GMM using (3.13). In MAP adaptation [133], the mean vector is updated as follows: μ i =α i E i (x)+(1−α i )e μ i , α i = P T t=1 Pr(i|x t ) γ + P T t=1 Pr(i|x t ) (3.14) wherePr(i|x t )denotestheoccupancyprobabilityoffeatureframetbelongingtothei th gaussiancomponentandγ istheconstantrelevancefactor. Therefore,μ ? i =α i (E i (x)− 58 UBM MAP adapted UBM MAP adapted (a) UBM and MAP adaptation (b) Mean shifted GMM after MAP Figure 3.3: MAP adaptation and mean shifted GMM model e μ i ). Given a segment of feature vectors and a large sized UBM, α i can be arbitrarily small on certain gaussian components due to the small occupancy probability and lack of enough data to update [133]. Thus the entries of the corresponding dimensions on mean shifted supervector are close to zero. It is shown in Fig.3.4 that the over- complete dictionary constructed using mean shifted supervectors significantly reduced thecorrelationbetweenatoms. Sinceanincoherentover-completedictionarycanprovide better performance in l 1 -minimization sparse representation [41, 157], the proposed meanshiftedsupervectorismoresuitableinthisframeworkcomparedtothetraditional mean supervector. From Fig.3.3, we can see that the GMM mean shifted supervector models the distance between the MAP adapted model and the UBM model rather than the mean of the MAP adapted model itself. By taking out the common and dominant component(UBM)fromeachadaptedmodel, theover-completedictionarycomposedof all the training supervectors becomes more incoherent while the discriminative distance measure is still preserved. The reason that the coherence was not reduced significantly in Fig.3.4 might due to some close to duplicate samples from the same subject. Thus other dictionary design approaches [4], such as duplicate samples removal, can also be adopted on the mean shifted supervectors to further enhance the robustness and performance of sparse representations. 59 0 0.2 0.4 0.6 0.8 1 (a) using mean supervectors 0 0.2 0.4 0.6 0.8 1 (b) using mean shifted supervectors Figure3.4: Thecorrelationmatrixoftheover-completedictionaryA 2 (3.7),N 2 = 4601. Thecoherencevaluesmax j6=k hA 2j ,A 2k i)are0.996(left)and0.963(right),respectively. Given a sequence of face images, face detection was performed for each image frame by using the Viola-Jones face detector in the opencv library, and then the Biosecure talkingfacereferencesystem[17]withtheMPTlibrary[47]wasusedtofindthelocation of two eyes. Given the detected face image and eye location, a geometric normalization tool [13] was applied to crop the detected face image (200×240) into a normalized face image (51×55). The histogram was globally equalized for each cropped and normalized image. Furthermore, DCTmod2xy [28] feature vectors were extracted for each block of every normalized face image. The 20 dimensional DCTmod2xy feature is the standard DCTmod2 feature [136] plus (x,y) coordinates of each block. Due to the included space domain information, it performs better than the DCTmod2 feature [28, 127]. Thus given the block size of 8×8 pixels, each normalized face generated 11×12 =132 frames of DCTmod2xy feature vectors which were assumed to be independent and modeled by GMM. Compared to the holistic feature based systems, this block wise local feature based system was reported to be robust against face localization and normalization errors [28, 127]. Finally, the feature vectors were normalized to mean zero and unit variance on a per-video basis. 60 The Banca talking face video database [9] was used for evaluation. The Banca En- glishdatabasehas2groups. Eachgrouphas26targetsubjects(13female, 13male)and for each subject, 12 sessions were recorded in 3 different scenarios. The P protocol [9] was employed for evaluation. There are another 60 world model video recordings from different subjects (not in the 52 subjects set) for background UBM training. The eval- uation method is the Equal Error Rate(EER). Since data set g1 and g2 are separated, g2 is the development set for g1, and vise versa. For the GMM modeling, the mixture number was 128 (tuned by system 2) while Z-norm, T-norm, JFA and non-target back- ground data sets were all from the development data set. A relevance factor of 12 was used for the MAP adaptation. N 2 and mean value of N 1 are 4601 and 8.62 for g1 and 3843 and 11.34 for g2. The P protocol has 544 trials (232 true and 312 false) and 545 trials(234trueand311false)forg1andg2, respectively. Thesparserepresentationwas achieved by the SPGL tool[153]. Rather than using the provided pre-selected 5 images set to perform testing, we used all the images in the testing video sequence without selection by face quality measurements which will be considered in the further work. 3.1.2.2 NIST SRE 2008 database The supervectorss andy in the SRE task are the low dimensional i-vectors 2.5. We performed experiments on the NIST 2008 speaker recognition evaluation (SRE) corpus [118]. Our focus is the single-side 1 conversation train, single-side 1 conversation test, and the multi-language handheld telephone task, which is one part of the core test condition. This setup resulted in 3832 true trials and 33218 false trials. We used equal error rate (EER) and the minimum decision cost value (minDCF) as the metrics for evaluation [118]. 61 Table 3.2: Corpora used to estimate the UBM, total variability matrix (T), WCCN, LDA ,SVM imposter and the Tnorm data. Switchboard NIST04 NIST05 NIST06 UBM √ T √ √ √ √ WCCN √ √ √ LDA √ √ √ SVM-Imposter √ √ Tnorm √ For cepstral feature extraction, a 25ms Hamming window with 10ms shifts was adopted. Each utterance was converted into a sequence of 36-dimensional feature vec- tors,eachconsistingof18MFCCcoefficientsandtheirfirstderivatives. Anenergy-based speech detector was applied to discard low-energy frames. Feature warping is applied to mitigate channel effects. The training data included Switchboard II part2 and part3, Switchboard Cellular, NIST SRE 2004, 2005, and 2006 corpora. The description of the dataset used in each step is provided in Table 3.2. The gender-dependent GMM UBMs consist of 1024 mixture components, which were trained using EM with the data from NIST SRE 04 corpus. The background data was the same as UBM. We used all of the training data for estimating the total variability space. The NIST SRE 2004, 2005 and 2006 datasets were used for training WCCN and the the LDA matrix, and a data set chosen from NIST SRE 2006 corpus was used for Tnorm score normalization, including 367 male utterances and 340 female utterances. The SVMLight toolkit [68] was used for SVM modeling. 3.1.2.3 UBM weight posterior probability supervector and Agender database Thesupervectorssandy intheageandgenderrecognitiontaskarethelowdimensional square root UWPP supervectors 3.20. 62 For each utterance in the training and evaluation sets, UWPP feature extraction is performed using the UBM. Given a frame-based MFCC featurex t and the GMM-UBM λ with C Gaussian components (each component is defined as λ i ), λ i ={w i ,μ i ,Σ i },i = 1,··· ,C, (3.15) the occupancy posterior probability is calculated as follows: P(λ i |x t ) = p i (x t |μ i ,Σ i ) Σ C j=1 p j (x t |μ j ,Σ j ) . (3.16) Thisposteriorprobabilitycanalsobeconsideredasthefractionofthisfeaturex t coming from the i th Gaussian component which is also denoted as partial counts. The larger the posterior probability, the better this Gaussian component can be used to represent this feature vector. The UWPP supervector is defined as follows: UWPP supervector =b = [b 1 ,b 2 ,··· ,b M ] (3.17) b i = y i T = 1 T Σ T t=1 P(λ i |x t ). (3.18) Equation (3.18) is exactly the same as the weight updating equation in the expectation-maximization (EM) algorithm in GMM training. The mixing coefficient b i is equal to the fraction of data points assigned to the corresponding i th GMM com- ponent. Considering the GMM as a generative model which generates all the T inde- pendent frames of feature vectors, y i = Σ T t=1 P(λ i |x t ) is the total frames being drawn from the corresponding i th GMM component. InordertomodeltheUWPPsupervectorusingSVM,weextendthekernelfromthe traditional linear kernel [159] to the Bhattacharyya probability product (BPP) kernel [66]. 63 k linear (P,P 0 ) = (b) t b 0 (3.19) k BPP (P,P 0 ) = ( √ b) t √ b 0 (3.20) Let the square root of UWPP supervector θ = [b 1/2 1 ,b 1/2 2 ,··· ,b 1/2 M ] and θ 0 = [b 0 1 1/2 ,b 0 2 1/2 ,··· ,b 0 M 1/2 ] be two input feature vectors of SVM, then the BPP kernel is just the standard linear kernel on the square root UWPP supervectors: k BPP (P,P 0 ) =(θ) t θ 0 . (3.21) A multi-class SVM classifier based on the BPP kernel was employed for UWPP super- vector modeling using all the training set data. LIBSVM [31] with probabilistic output training was adopted. Another database to evaluate the proposed approach is the aGender database [24, 138]. The task is to classify a speaker’s age and gender class which is defined as follows: children < 13 years (C), young people 14−19 years (YF/YM), adults 20−54 years (AF/AM), and seniors > 55 years (SF/SM), where F and M indicate female and male, respectively. We employed a Czech phoneme recognizer [141] to perform the voice activitydetection(VAD)bysimplydropallframesthataredecodedassilenceorspeaker noises. ThemeanandstandarddeviationofspeechdurationperdatasampleafterVAD inthetraininganddevelopmentdatasetsoftheaGenderdatabaseare1.13±0.86seconds and1.14±0.87seconds,respectively. Thusitisashortlengthspeechutterancedatabase. The training data set of the aGender database (472 speakers, 32527 utterances) was used for model training while the development data set from the aGender database (300 speakers, 20549 utterances) was used as the evaluation set of each subsystem as well as the fusion system in this paper. Finally, the testing data set from the aGender 64 Table 3.3: EER (%) performance of the proposed systems Methods / System ID 1 2 3 4 5 GMM baseline √ √ √ √ √ ZT norm √ √ JFA √ √ √ GMM-SVM √ GMM-Sparse √ P protocol G1 27.9 14.2 12.8 10.2 8.4 P protocol G2 29.6 13.3 12.0 12.3 10.5 Table 3.4: EER (%) of the sparse representation system (P protocol) criterion / problem settings A eq(3.2) B eq(3.7) B eq(3.7) supervectors mean mean mean shifted g1: l 2 residual ratio 15.5 11.5 9.9 g2: l 2 residual ratio 15.0 13.3 11.8 g1: l 1 norm ratio 12.5 10.9 8.4 g2: l 1 norm ratio 13.6 12.4 10.5 database (17332 utterances) was evaluated. The details about the aGender database and the evaluation methods are provided in [138]. 3.1.3 Experimental Results 3.1.3.1 Face video verification As shown in Table 3.3, ZT-norm dramatically improved the results which matches the conclusion of [129]. Furthermore, the use of JFA reduced the EER by 1.3% absolutely which demonstrates the good performance of the JFA method in terms of variability compensation. Comparing the results of system 4 with 5, we can observe that sparse representation based on mean shifted supervectors consistently outperformed the SVM meansupervectorsystemby1.8%absoluteEERreductioninbothgroups. Byonlyusing the proposed JFA and sparse representation approaches, system 5 achieved the best performance which significantly improved the performance of GMM ZT-norm baseline. 65 Table 3.5: Performance of the sparse representation system on the Male part of the NIST 08 test with configuration S1. System Without Tnorm With Tnorm EER minDCF EER minDCF l 1 norm ratio 5.68% 0.0243 5.13% 0.0235 l 2 residual ratio 6.25% 0.0247 5.13% 0.0240 Bnorm -(l 2 residual) 5.35% 0.0236 4.94% 0.0226 Moreover, compared to the results in [28, 127, 129], the proposed method also achieved highly competitive results. In Table 3.4, the performance of different sparse representation problem settings and decision criteria are shown. First, the sparse solution computed by the problem B achieved better results compared to the problem A which validates the assumption that adding an error vector can enhance the robustness against variabilities. Second, the systems using mean shifted supervectors performed consistently better than the ones employed the traditional mean supervectors. This can be attributed to the more incoherent over-complete dictionary constructed by mean shifted supervectors. Third, the l 1 norm ratio criterion outperformed the l 2 residual ratio criterion which matches the results of sparsity concentration index (SCI) criterion in the open set rejection task [157]. Finally, adopting the sparse representation method in [117] (15.5% EER) did not improve the results compared to system 4. However, in the proposed framework, our system 5 achieved significantly improvement against system 4. 3.1.3.2 SRE The performance of sparse representation using S1 configuration and problem B setting with different score measuring criteria is shown in Table 3.5 and Table 3.6. It is demon- strated in [101] that l 1 norm ratio is better than l 2 residual ratio for verification tasks. This matches with our experimental results here. Furthermore, the proposed BNorm 66 Table 3.6: Performance of the sparse representation system on the Female part of the NIST 08 test with configuration S1. System Without Tnorm With Tnorm EER minDCF EER minDCF l 1 norm ratio 7.21% 0.0327 6.95% 0.0317 l 2 residual ratio 7.26% 0.0325 6.86% 0.0307 Bnorm -(l 2 residual) 6.76% 0.0310 6.19% 0.0293 Table 3.7: Performance on the Male part of the NIST 08 test ID System Without Tnorm With Tnorm EER minDCF EER minDCF 1 SVM-base 4.75% 0.0231 4.36% 0.0216 2 CDS-base 4.76% 0.0256 4.43% 0.0221 3 SR S1 5.35% 0.0236 4.94% 0.0226 4 SR S2 4.82% 0.0235 6 Fusion1+3 4.18% 0.0202 7 Fusion2+3 4.30% 0.0205 8 Fusion1+4 4.05% 0.0204 9 Fusion2+4 4.22% 0.0204 l 2 residual criterion achieved the best performance among all three score measurement with 0.0226 and 0.0293 minDCF value after Tnorm for the male and female tasks, re- spectively. It is shown in both Table 3.7 and Table 3.8 that S1 configuration based sparse representation system performed better than the S2 configuration in terms of minDCF value. This might be because in S2 setting, each Tnorm target sample was not scored on test sample independently and the Tnorm set is smaller than the background data set. However, the S2 setting is significantly more efficient. Moreover, by fusing the sparse representation systems with the SVM and CDS baselines, the S2 based system demonstrates superior performance over S1 setting in terms of EER for both male and female tasks. In Table 3.7, we see a significant improvement was achieved by fusing the proposed S2 sparse representation system with either the SVM baseline or the CDS baseline in 67 Table 3.8: Performance on the Female part of the NIST 08 test ID System Without Tnorm With Tnorm EER minDCF EER minDCF 1 SVM-base 5.86% 0.0278 5.52% 0.0268 2 CDS-base 6.87% 0.0326 5.93% 0.0302 3 SR S1 6.76% 0.0310 6.19% 0.0293 4 SR S2 6.40% 0.0314 6 Fusion1+3 5.40% 0.0263 7 Fusion2+3 5.55% 0.0272 8 Fusion1+4 5.25% 0.0262 9 Fusion2+4 5.53% 0.0272 terms of minDCF value. The fusion system (ID 8) achieved the best result on male task with 4.05% EER and 0.0204 minDCF value. Similar results are demonstrated in Table 3.8 for female data task. The minDCF value of the cosine distance baseline system was improved from 0.0302 to 0.0272 by the fusion with the sparse representation system (fusion system ID 9). The overall system performance was improved to 5.25% EER and 0.0262 minDCF value by the fusion system ID 8. It is shown in Table 3.7 and 3.8 that, after T-norm, sparse representation did not achieve superior performance compared to SVM or CDS baseline in terms of single system performance in this task. This might be because only one enrollment (positive) sampleintheover-completedictionary. Furthermore, wecanseethattheimprovements ofTnormscorenormalizationonsparserepresentingsystemsarelesssignificantthanthe SVM and CDS baselines. It might be due to the fact that score distribution being not gaussian (majority of the l 1 norm ratio scores concentrate on 0 value), suggesting that we need to investigate other distribution based score normalization. Future work also includes investigating the usage of sparse representation on the language identification task and the potential way to represent the speaker/language/channel information in the sparse manner. 68 Table 3.9: Performance of GMM UWPP supervector modeling evaluated on the devel- opment set with 512 GMM. Methods Kernel/Ratio Supervector age & gender age gender UA WA UA WA UA WA SVM linear UWPP 41.5 42.2 45.0 44.9 74.5 82.1 SVM linear Square root UWPP 42.4 42.9 45.8 45.7 74.6 82.2 Sparse Representation L1 UWPP 36.9 38.2 41.7 42.3 72.1 79.8 Sparse Representation L2 UWPP 38.5 38.7 43.1 42.2 73.7 80.4 Sparse Representation L1 Square root UWPP 38.3 39.7 42.8 43.5 72.8 80.9 Sparse Representation L2 Square root UWPP 41.4 41.4 45.3 44.4 75.3 82.4 Table 3.10: Performance of the fusion system between sparse representation and SVM on the square root UWPP supervectors System age & gender age gender UA WA UA WA UA WA GMM-UWPP-SVM 42.4 42.9 45.8 45.7 74.6 82.2 GMM-UWPP-Sparse Representation 41.4 41.4 45.3 44.4 75.3 82.4 Weighted sum fuse 43.8 44.2 47.2 46.9 76.2 83.5 3.1.3.3 Age and gender recognition In Table 3.9, the results of different setups for the SVM and sparse representation sub- systemsontheGMMUWPPsupervectorsareshown. First,wecanseethatBPPkernel outperforms linear kernel in the SVM modeling on the UWPP supervectors. Further- more, L2 residual ratio based criteria has better performance compared to the L1 norm ratio for the sparse representation modeling. And Hellinger’s distance based quadratic constraint yields significant boost in the sparse representation subsystem compared to the baseline Euclidean distance. This might be because the square root UWPP super- vectors automatically satisfy the unit l 2 norm and the Hellinger’s distance is closely related to the Bhattacharyya’s affinity between distributions [66]. Finally, we can ob- serve from Table 3.9 that the sparse representation modeling with L2 ratio on square root UWPP supervectors yields competitive results compared to the SVM subsystems. Consideringthereisnomulti-classtrainingworkrequiredforsparserepresentation, this approach can serve as a kind of online adaptive learning method. Finally, fusing SVM 69 0 5 10 15 20 25 30 35 40 0 0.5 1 1.5 Index Variance UBM init af ter training (a) 0 5 1 0 1 5 2 0 2 5 0 2 0 0 0 4 0 0 0 6 0 0 0 8 0 0 0 1 0 0 0 0 Q (b) Figure 3.5: (a) Σ 1 and Σ ubm for the 1st Gaussian component (b) the distribution of all the elements for (Σ 1 ) −1 Σ ubm and sparse representation on the square root UWPP supervectors can further improve the performance as shown in Table 3.10. 3.2 Sparse representation for representation 3.2.1 Methods Rather than adopting sparse representation as a classification approach [82, 101, 105, 117] as discussed in the Introduction section, in this work, we propose to use sparse representation with KSVD dictionary learning as a alternative sparse coding based representation solution against the i-vector modeling in the LID and SRE tasks. First, let us consider the objective function of the simplified i-vector: J = ˆ N X j=1 ( 1 2 ˆ x j t ˆ x j + 1 2 ( ˆ F j − ˆ T ˆ x j ) t Σ 1 −1 n j ( ˆ F j − ˆ T ˆ x j )− 1 2 ln(|Σ 1 −1 |)) (3.22) SincethesparserepresentationistooptimizexandT tominimizetheobjectivefunction J (Σ 1 is fixed), this term 1 2 ln(|Σ 1 −1 |) can be neglected. Furthermore, we assume that 70 (Σ 1 ) −1 Σ ubm ≈ QI where Q is a scalar. The reason is that each dimension plays the same role in equation(3.22) (n j is a scalar), the inter-dimension variance difference mainly comes from their UBM variance initialization. Figure 3.5(a) shows an example of (Σ 1 ) and (Σ ubm ) at the first gaussian component after simplified i-vector training. Thedistributionofalltheelementsof(Σ 1 ) −1 (Σ ubm )isshowninFigure3.5(b). Wecan observe that most values in Figure 3.5(b) are close to a constantQ which supports our approximation. LetusdenoteΣ −1/2 ubm ˆ F j = ˇ F j andΣ −1/2 ubm ˆ T = ˇ T,respectively. Therefore, the objective function becomes: J = ˆ N X j=1 ( 1 2 ˇ x j t ˇ x j + n j Q 2 k ˇ F j − ˇ T ˇ x j k 2 2 ) (3.23) SinceQ is a constant andk ˇ x j k 2 is already constrained in the sparse representation, we can finally arrive at the minimization problem (denoted as sparse I): Sparse type I: min ˇ T,ˇ x J = ˆ N X j=1 (njk ˇ F j − ˇ T ˇ x j k 2 2 ) subject to k ˇ x j k0 < τ1. (3.24) In this case, the constraint k ˇ x j k 0 <τ influenced the lower bound of the ˇ F j recon- struction error, which is similar to limit theT matrix rank K. KSVD toolkit [135] was adopted for this optimization. In LID task, after training a small rank dictionary for each target language and merging them into a big dictionary, therepresentationofeachmeansupervectoronthisdictionaryisnaturallysparsewhich makes the corresponding sparse representation valid and discriminative. This resulted big dictionary servers as the initial matrix for KSVD dictionary learning on the full training data set. We still inherit from [97] to call this sparse vector ˇ x j s-vector. In ordertoreducethecomputationalcost,wemultipliedtheRrankPCAprojectionmatrix V with the CD dimensional supervector ˇ F j to result in a R dimensional vector. Sparse type I: min ˇ T,ˇ x J = ˆ N X j=1 (njkV ˇ F j − ˇ T ˇ x j k 2 2 ) subject to k ˇ x j k0 < τ1. (3.25) 71 0 500 1000 1500 2000 2500 3000 -0.3 -0.2 -0.1 0 0.1 0.2 index w (a)One type II s-vector example 0 1 00 200 3 0 0 40 0 5 00 6 0 0 - 6 0 - 40 - 20 0 20 40 6 0 8 0 I n d e x S - v e c t o r (b)One type I s-vector example Figure 3.6: (a) in the NIST SRE 2010 databaseK = 3000,τ = 6 (b) in the LID RATS database, K = 600,τ = 150 Ourearliers-vectorworkin[97]canbesummarizedasaLassobasedl 1 normregularized weighted least square estimation for each utterance without dictionary learning: Sparse type II: min ˇ x j Jj =kΣ −1/2 N 1/2 j ( ˜ F j −T ˇ x j )k 2 2 subject to k ˇ x j k1 < τ2. (3.26) The reason for preventing performing dictionary learning was that each normalized mean supervector Σ −1/2 N 1/2 j ˜ F j needs to be projected on a normalized dictionary Σ −1/2 N 1/2 j T and vector N 1/2 j is different for each utterance. It is the simplified i- vector (2.25,2.27,2.28) proposed in this work that cut the link between each utterance’s own re-weighting vector and the global dictionary which makes it possible for global KSVDdictionarylearning. In[97],weemployedthefulllengthmeansupervector ˜ F j and top 3000 PCA eigenvectors as T which made the s-vector extraction computationally expensive. In this work, the complexity of our type I s-vector’s database training and single signal extraction is approximately O( ˆ N(τ 3 1 +τ 2 1 K +7Kτ 1 +2RK +4Rτ 1 )) and O(τ 3 1 +τ 2 1 K + 3Kτ 1 + 2RK), respectively [135]. Since R << CD, K << CD and τ 1 <K, the cost is reduced dramatically. Fig. 3.6 demonstrates both the type I and II s-vector examples while Fig. 3.7 demonstrates one example of the original and the reconstruction of mean supervector V ˇ F j for type I s-vector. The reason that V ˇ F j has larger values in small indexes 72 0 500 1000 1500 2000 2500 -30 -20 -10 0 10 20 30 Index Type I s-vector 0 200 400 600 800 -30 -20 -10 0 10 20 30 Index Type I s-vector Figure 3.7: One example of the original and the reconstruction of mean supervector V ˇ F j for type I s-vector. Blue color is the original, red color is the re-construction. (left)V ˇ F j (right) top 800 dimension ofV ˇ F j . is because the top rows of V have larger eigenvalues. As the same as the works in [145, 157], each row ofV has unitl 2 norm which makes the sparse representation valid. From Fig. 3.7 (right), we can observe that the larger the eigenvalue (the smaller the index), the smaller the error of reconstruction and the more important in the dictionary learning. 3.2.2 Experimental Results The databases for evaluation is RATS and NIST SRE 2010, respectively. The detail descriptions of these two databases and evaluation protocols are presented in Section 2.2. 73 Table 3.11: Performance of the proposed methods with SVM modeling for LID RATS 120 seconds task ID Method WCCN EER (%) / Miss rate (%) at 10% false alarm MFCC-SDC 56dim MFCC 36dim GFCC 44dim 1 I-vector × 6.6/4.5 5.7/4.2 6.1/4.3 2 I-vector √ 6.1/4.4 5.7/3.9 5.8/3.9 3 Simplified I-vector × 7.6/6.7 7.0/5.3 7.3/6.1 4 Simplified I-vector √ 6.2/4.4 5.7/4.2 6.3/4.4 5 Simplified Supervised I-vector × 6.6/4.9 5.1/3.5 5.6/3.8 6 Simplified Supervised I-vector √ 6.3/4.5 4.8/3.0 5.4/3.7 7 type I s-vector × 7.2/5.6 6.4/4.4 6.7/5.2 8 type I s-vector √ 6.3/4.8 6.0/4.1 6.6/4.9 9 type II s-vector × 8.5/7.4 10 type II s-vector √ 8.1/7.1 11 Fusion ID 6+8 √ 5.5/4.0 4.5/2.6 5.2/3.2 Table 3.12: Performance of score level fusion with systems based on multiple features Task Method EER (%) / Miss rate (%) at 10% false alarm ID duration MFCC-SDC MFCC GFCC Fusion 3 features 6 120s Simplified supervised i-vector 6.3/4.5 4.8/3.0 5.4/3.7 3.7/1.9 11 120s Simplified supervised i-vector 5.5/4.0 4.5/2.6 5.2/3.2 3.2/2.0 + type I s-vector 3.2.2.1 LID From Table 3.11, we can see that type I s-vector performed significantly better than the type II s-vector which might be due to the dictionary learning and dictionary ini- tialization setup. By training a small dictionary separately for each target language and then merging them together into a joint dictionary as the KSVD initialization, the projection on the joint dictionary becomes naturally sparse which makes the sparse representationvalidandevenmorediscriminative. Furthermore, sincethesimplifiedsu- pervised i-vector and type I s-vector are different, complimentary and complexity wise efficient (Table 2.3), we fuse these 2 systems together (ID 11) to further improve the overall system performance. We can see that this fusion resulted in around 10%,20% and 10% relative EER reduction for MFCC-SDC, MFCC and GFCC features, respec- tively. Finally, we also observed dramatic error reduction by fusing 3 feature set based systems together as our final system in Table 3.12. Fig. 3.8 demonstrates the DET 74 1 2 5 10 20 1 2 5 10 20 False Alarm probability (in %) Miss probability (in %) MFCC SDC 56dim MFCC 36dim GFCC 44dim Fusion 3 f eatures 1 2 5 10 20 1 2 5 10 20 False Alarm probability (in %) Miss probability (in %) MFCC SDC 56dim MFCC 36dim GFCC 44dim Fusion 3 f eatures Figure 3.8: EER and minDCF values for LID performance in Table 3.12. (left) Table 5 ID 11, (right) Table 5 ID 6. curves for each feature set as well as the 3 feature set fusion. The final system achieved 3.2% EER and 2.0% miss rate at 10% false alarm. 3.2.2.2 SRE Performance of the s-vector systems using LDA with cosine distance raw scoring are shown in Fig.3.9 and Fig.3.10, respectively. Compared to the raw scoring (EER 15.23% in Table 3.14 ID 1 and EER 12.48% in Table 3.14 ID 2), applying LDA on top of the s- vectors significantly improved the performance (more than 40% relatively) which might be because that majority of s-vector coefficients are zero. Furthermore, both EER and normoldminDCFcostcontinuedtoreducebydecreasingtheLDAdimensionalitywhile the norm new minDCF cost achieved the best result at 500-600. This suggests that for optimizing the new minDCF values, we may need to choose larger LDA dimension. Moreover, for the type I s-vector, all the results (EER, old or new norm minDCF 75 100 200 300 400 500 600 700 800 900 1000 7.5 8 8.5 9 9.5 10 10.5 11 LDA dimension EER% & norm minDCF (10 × new, 20 × old) EER% 10 × norm new minDCF 20 × norm old minDCF Figure3.9: EERandminDCFvaluesfortypeIIs-vectorforSREfemale05.nve-nve.phn- phn 100 150 200 250 300 350 400 450 500 550 600 5.5 6 6.5 7 7.5 8 8.5 LDA Dimension EER (%) & NormMinDCF (10 New, 20 Old) EER % 20 new minDCF 10 new minDCF Figure3.10: EERandminDCFvaluesfortypeIs-vectorforSREfemale05.nve-nve.phn- phn values) performed significant better (more than 15% relatively) than the type II s- vector’s like the LID task. This might be due to the KSVD dictionary learning and l 0 norm constraint. 76 Table 3.13: Performance of the type I s-vector system using PLDA modeling (SRE female 05.nve- nve.phn-phn) ID τ 1 K M LDA WCCN PLDA Snorm EER% norm minDCF U G #EM new old 1 800 2575 2500 × × × × × × 11.87 0.93 0.55 2 200 600 2500 × × × × × × 12.48 0.91 0.54 3 400 2575 2500 × × × × × × 13.21 0.87 0.57 4 200 600 2500 200 × × × × × 6.75 0.77 0.32 6 200 600 2500 200 × × × × √ 5.56 0.63 0.28 7 200 600 2500 250 √ × × × √ 5.95 0.65 0.30 8 200 600 2500 200 √ × × × √ 5.66 0.65 0.30 9 200 600 2500 200 √ 200 80 20 √ 5.02 0.61 0.26 10 200 600 2500 200 √ 200 80 50 √ 4.50 0.60 0.26 11 200 600 2500 200 √ 200 65 50 √ 4.18 0.61 0.26 Table 3.14: Performance of the type II s-vector system using PLDA modeling (SRE female 05.nve- nve.phn-phn) ID τ2 LDA WCCN PLDA Snorm EER% norm minDCF U G #EM new old 1 6 × × × × × × 15.23 0.96 0.65 2 6 600 × × × × × 10.16 0.72 0.42 3 6 600 √ × × × × 11.79 0.71 0.44 4 6 600 × 100 100 10 × 11.22 0.97 0.55 5 6 100 × 100 100 10 × 8.97 0.85 0.43 6 6 × × 100 100 10 × 11.02 0.90 0.52 7 6 × × 100 100 20 × 8.44 0.83 0.42 8 6 × × 100 100 20 √ 4.80 0.62 0.27 9 8 × × 100 100 20 √ 4.86 0.65 0.26 10 6 × × 150 50 20 × 8.73 0.75 0.41 11 6 150 × 150 50 20 √ 7.61 0.85 0.34 12 6 × × 150 50 20 √ 4.83 0.55 0.28 The results for type I and type II s-vector systems with WCCN, PLDA and Snorm are shown in Table 3.13 and 3.14. In Table 3.13, we can see that the performance were similar for different dictionary size and l 0 constraint parameters (system ID 1-3). 400− 800 non-zero elements in ID 1 and 3 match the rank of the i-vector baseline system while K = 600 also makes sense since the valid and useful eigen directions are around 400−600. The sparse constraints of all these three configurations are around 1/3ofthedictionarysizewhichsupportsourassumptionthatbothi-vectorands-vector systemsemployedthedictionarysizeandconstraintparameterstopartiallyre-construct the mean supervector. Similarly, the proposed type I s-vector system is also not very 77 sensitive to the constraint τ 2 values in equation (3.26). Larger τ can loose the norm constraint which results in a more accurate least square solution. However, a large norm constraint also slows the Lasso or OMP computation and may violate the sparse assumption of s-vector. With a large τ, the proposed solution becomes the standard generalized least squares estimator. On the other hand, setting τ to be very small increasestheresiduebetween ˜ F andTxwhichmayleadtononaccurates-vectors. Thus, a balanced τ is preferred. The Snorm score normalization achieved big improvements on both the EER and minDCF cost values for type II s-vector system while WCCN contributed the most for those simplified and supervised simplified i-vector systems. This might be because in the type II s-vector solution (3.26), projection is performed on different dictionary (Σ −1/2 N 1/2 j T) each time (N j ) therefore depends more on score normalization. Also adding WCCN on top of LDA (200 for type I Table 3.13 ID 7 and 600 for type II Table 3.14 ID 3) did not help which may suggest nonlinear session variabilities. Therefore, PLDA was applied for modeling. PLDA modeling improved the system performance with small rank sub-matrixes (U and G). This matches the result in [111] that 90 rank U and G achieved the best performance. The best result was observed by using 200 eigen-voices, 65 eigen-channels for type I s-vector and 150 eigen-voices, 50 eigen-channels for type II s-vector which matches the parameter setting ratio (300 eigen-voices and 100 eigen-channels) in the JFA framework. Finally, the type I s-vector Table 3.13 ID 11 servers the s-vector system in the score level fusion. The performance of each single system as well as the score level fusion is demon- strated in Table 3.15. Since JFA and i-vector baselines are different and complimentary, fusing these two systems together can significantly improve the performance which has already been widely adopted in SRE task. In this work, we have already demonstrated that simplified supervised i-vector achieved the same performance as the traditional 78 Table3.15: PerformanceoftheproposedsystemswhenfusingwiththeJFAandi-vector baseline systems for SRE task ID Systems EER(%) norm minDCF new old 1 JFA linear scoring ZTnorm 3.62 0.41 0.193 2 I-vector LDA WCCN PLDA Snorm baseline 3.08 0.52 0.19 3 Simplified Supervised I-vector LDA WCCN PLDA Snorm 3.12 0.52 0.18 4 Type I s-vector LDA PLDA Snorm 4.18 0.61 0.26 5 Fusion ID 1 + ID 2 2.75 0.39 0.16 6 Fusion ID 1 + ID 3 2.75 0.37 0.15 7 Fusion ID 3 + ID 4 3.05 0.52 0.18 i-vector. Now we show that after fusing with JFA baseline, the fusion system also pre- serves the improvement. Considering the low complexity cost, the simplified supervised i-vector might be adopted to replace the traditional i-vector in the SRE task on very large database. From Fig. 3.11, we can see that the proposed fusion system Table 3.15 ID 6 performed better on the very low false alarm area which may fit the new DCF values better. Nevertheless, compared to the large fusion benefit in LID task, fusing type I s-vector and simplified supervised i-vector together in SRE task did not achieve improvement. This might be because PLDA serves as the modeling method for both systems and the difference is not big enough. The improvement mentioned in [97] might be because in that work the i-vector baseline did not use PLDA modeling while the type II s-vector used. How to effectively combine the s-vector with simplified supervised i-vector for speaker verification is our future work. 79 0.01 0.02 0.05 0.1 0.2 0.5 1 2 5 10 1 2 5 10 20 40 False Alarm probability (in %) Miss probability (in %) 1:JFA 2:I-vector 3:Simplif ied Supervised I-vector 4:Type I S-vector 1+2 1+3 Figure 3.11: DET curves of the proposed systems in Table 3.15 80 Chapter 4: Eigenchannel for speaker states representation 4.1 Methods 4.1.1 GMM baseline The Gaussian Mixture Model (GMM) is adopted to model the MFCC features. In the proposed work, a universal background model (UBM) in conjunction with a maximum a posteriori (MAP) model adaptation approach [133] is used to model different speaker states in a supervised manner. Let the UBM be a N-components GMM model e λ: e λ = {p i ,e μ i ,Σ i },i = 1,··· ,N, where p i and Σ i are the i th UBM mixture weight and diagonal covariance matrix, e μ i corresponds to the mean of thei th Gaussian component of the UBM. For each target speech segment, a GMM was adapted from the UBM by 81 the MAP adaptation [133]. As shown in (4.1), the GMMs were modeled with diagonal covariance matrices and only the means were adapted. μ i =α i E i (x)+(1−α i )e μ i ,α i = P T t=1 Pr(i|x t ) β+ P T t=1 Pr(i|x t ) (4.1) where Pr(i|x t ) denotes the occupancy probability of feature frame x t (t = 1,··· ,T) belongs to thei th gaussian component andβ is the constant relevance factor. Here the GMM mean supervector M is defined as a concatenation of the GMM mean vectors M = [μ t 1 ,··· ,μ t i ,··· ,μ t N ] t [26, 29]. Assume the feature vectors are D-dimensional, the GMM mean vectorM is a ND dimensional vector. 4.1.2 Eigenchannel matrix estimation In the GMM LFA framework for speaker verification [29], we can consider the paralin- guistic speech as normal average speech being corrupted by affective channel variability. Let us denote M k,c as the speaker and affective channel dependent mean supervec- tor. ThenM k,c can be decomposed into speaker dependent mean supervector plus the channel variabilityUy, whereU is the low rank Eigenchannel matrix learned from the Principle Component Analysis (PCA) on the pooled within speaker covariance matrix. M k,c =M k +Uy (4.2) U is a factor loading matrix and the components of y are the speaker state channel factors [29]. In order to train the Eigenchannel matrix, we need to use data from mul- tiple speakers. Furthermore, for each speaker, there should be speech utterances from multiple speaker states. First, for each speaker k,k = 1,··· ,K and all his utterances 82 j = 1,··· ,J k , UBM is adapted to obtain a supervectorM kj . Second, the correspond- ing speaker true supervector is estimated by averaging all the supervectors from this speaker: f M k = J k X j=1 M kj J k , ´ M kj =M kj − f M k . (4.3) Then we concatenate all the speakers’ state variability supervectors ´ M kj into a vari- ability supervector matrixS with ND rows and J columns (J = P K k=1 J k ): S = [ ´ M 11 ,··· , ´ M 1J 1 ,··· , ´ M K1 ,··· , ´ M KJ K ] (4.4) Finally, the Eigenchannel matrix U are given by R PCA eigenvectors of the within speaker covariance matrix (1/J)SS t which corresponds to the R largest eigenvalues [29]. 4.1.3 Eigenchannel factor extraction Based on the latent factor analysis framework, the speaker states factor vector y is estimated as follows [23, 29]: y = (A+E −1 ) −1 N X i=1 U 0 i T X t=1 γ i (t) x t −μ i Σ i (4.5) A = N X i=1 U 0 i U i Σ i T X t=1 γ i (t) (4.6) U i ,γ i (t)andx t denotethesubmatrixofU correspondingtothei th gaussiancomponent (D×R), occupancy probability of the t th feature on the i th gaussian component and the t th feature vector, respectively. U 0 i is the transpose of matrix U i . The diagonal covariance matrixE includes the R leading eigenvalues of the Eigenchannel matrixU. The details about the Eigenchannel factor extraction are provided in [23, 29]. 83 For each utterance, the acoustic MFCC features are mapped into the Eigenchannel factor vector y. Then, a back end SVM classifier was trained using LIBSVM [30] to model the multi-class speaker states categories. 4.2 Corporal, tasks and feature extraction TheproposedEigenchannelfactorvectormodelingapproachisevaluatedontwospeaker state recognition tasks, namely, intoxicated speech detection, in Section 4.2.1 and speaker emotion classification, in Section 4.2.2. 4.2.1 Intoxicated speech detection The Alcohol Language Corpus (ALC) database [140] comprising 154 German speakers released in the 2011 speaker state challenge [140] was used to study the intoxicated speaker state recognition task. The two speaker states of interest are intoxicated (in- dicated by a blood-alcohol content above 0.5mg/L) and sober. First, in the GMM baseline system, a 512 component GMM was trained for each state on the 39 dimen- sional MFCC features in the training dataset. Second, in the LFA framework, MAP adaptation from UBM model was performed for every utterance in both training and development dataset. The GMM mean supervectors were generated by concatenating the mean vectors of all components from the adapted GMM. Then, the Eigenchannel matrix was trained using the mean supervectors from the train set, and Eigenchannel factor vector extraction was performed to map each mean supervector into the factor vector. Speaker normalization [16] was adopted on top of these factor vectors. The GMM size N and Eigenchannel matrix rank R are 256 and 4, respectively. Finally, LIBSVM toolkit [30] was adopted to perform this binary classification task on the 4- dimensional Eigenchannel factor vectors,y. 84 4.2.2 Speaker emotion classification In the speaker emotion study, we use the Interactive Emotional Dyadic Motion Capture (IEMOCAP) database [25]. This database contains approximately 12 hours of audio- visual data from five mixed-gender pairs of actors[25]. IEMOCAP contains detailed face and head information obtained from motion capture as well as video, audio and transcripts of each session. Two act types were used; scripted and improvisation of hypothetical scenarios. The goal was to elicit emotional displays that resemble natural emotional expression. Dyadic sessions of approximately 5 minute length were recorded and were later manually segmented into utterances. Each utterance was annotated by at least 3 annotators into categorical labels (anger, happiness, neutrality, etc). We examine all 10 available speakers and use only speech modality signals. We examine classesofanger, happiness, excitation, neutrality, andsadnesswheretherewasmajority consensus across the three annotators. We have merged the classes of happiness and excitation into a single class which we will refer to as happiness. We organize our emotion recognition experiments using 10-fold leave-one-speaker- out cross validation. The mean and standard deviation of the number of test utterances across the folds is: 62±28, angry; 87±26, happy; 58±23, neutral; and 65±25, sad. The 30 dimensional feature set is composed of 13 MFCC features, energy, pitch and their first order derivatives. The GMM size is 512 and the Eigenchannel matrix rank is 26. 4.3 Experimental Results The classification accuracy on the development set are shown in Table 4.1. We can see that the proposed LFA Eigenchannel factor modeling approach outperformed the 85 Table 4.1: Unweighted accuracy (UA) and weighted accuracy (WA) [140] on the devel- opment set of ALC database in the 2011 speaker state challenge. WA UA GMM baseline 65.33% 65.05% LFA Eigenchannel Factor Modeling 69.27% 70.39% Table 4.2: Unweighted accuracy (UA) and weighted accuracy (WA) per utterance for 10-fold leave-one speaker out cross validations on the IEMOCAP database. WA UA GMM baseline 54.11% 54.35% LFA Eigenchannel Factor Modeling 55.88% 55.84% GMMbaselineandachieved3.94%and5.34%improvementforweightedandunweighted accuracy, respectively. InTable4.2,thespeaker-independentemotionclassificationresultsaveragedoverthe 10 folds are presented. We can observe that only moderate improvements are achieved (1.77% WA and 1.49% UA). This might be because the emotional states are not stable and vary dynamically both within and across utterances. It is shown in [73] that the factor analysis based speaker factor vectors can be used for the speaker change point detection task and achieved promising results. Therefore, ourfurtherworkwillfocusonanalyzingandtrackingtheproposedspeakerstatesfactor vectors along the entire speech conversation or dialog using the sliding window frame- work which may have great potential in the speaker states change point detection and tracking tasks. TheSVMclassificationresultsagainsttheEigenchannelmatrixrankRareshownfor both tasks in Fig.4.1. We can see that the results are not sensitive to the Eigenchannel matrix rank and that even small rank (R < 5) can achieve competitive results. This property validates that the proposed Eigenchannel factor vector is highly informative and effective in terms of the speaker state feature. Furthermore, the matrix rank in 86 1 1.5 2 2.5 3 3.5 4 0.55 0.6 0.65 0.7 0.75 Eigenchannel matrix rank Accuracy Intoxicated speech detection task UA WA 0 5 10 15 20 25 30 0.46 0.48 0.5 0.52 0.54 0.56 0.58 Eigenchannel matrix rank Accuracy Speaker emotion recognition task UA WA Figure 4.1: Accuracy against Eigenchannel matrix rank the intoxicated speaker state recognition task is smaller than in the speaker emotion classification task which might be due to smaller speaker state categories. In figure 4.2 we plot the first two dimensions of the eigenchannel factor vector of the trainingdatainstancesfromthefirstfold(theplotsaresimilaracrossfolds). Weobserve that different emotions tend to occupy different, although overlapping, positions of the two-dimensional space, suggesting the discriminative ability of these two dimensions. Moreover the first dimension seems to carry activation-related information. Activation isanemotionalattributedescribinghowactivevspassiveisanemotionalstate. Typical 87 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 1st dimension of Eigenchannel factor vector 2nd dimension of Eigenchannel factor vector Angry Happy Neutral Sad Figure4.2: First2dimensionofEigenchannelfactorvectoroffold0IEMOCAPtraining data examples of highly activated emotions include anger, happiness and excitement, while neutrality and sadness are usually described by medium and low activation respectively [35]. Similarstructurecanbeobservedacrossthefirstdimensionwhereangryandhappy utterances tend to have high values while neutral and sad utterances have medium and low values respectively. This suggests that some of the computed Eigenchannel factors may carry some interpretable (here emotion-related) information. Future work includes analyzing and tracking the proposed speaker state factor vec- tors along the entire speech conversation or dialog using a sliding window framework. 88 Moreover, factor analysis based i-vectors [39] and lasso based s-vectors [98] in combina- tion with various variability compensating methods, such as Within Class Covariance Normalization (WCCN), Probabilistic Linear Discriminant Analysis (PLDA), may also be employed to model the speaker states. 89 Chapter 5: A general optimization framework for representation 5.1 A general optimization framework Fromtheobjectivefunctionofthesupervisedi-vector, simplifiedi-vectorandthesparse coding based s-vector with fixed variance, the algorithms aforementioned can be sum- marized as a general optimization problem: min T,W.x P ˆ N j=1 ( α 2 x t j x j | {z } Prior + 1 2 ( ˜ F j −Tx j ) t Σ 1 −1 N j ( ˜ F j −Tx j ) | {z } mean supervector re-construction error + 1 2 (L j −Wx j ) t Σ 2 −1 n j (L j −Wx j )) | {z } label regression error subject to kx j k 1 <τ 1 | {z } constrants . (5.1) 90 Figure 5.1: A general optimization framework Table 5.1: Parameter setting for generalized optimization problem α τ 1 Σ 1 Σ 2 N j Supervector Solution 1 ∞ ˜ F j Supervised I-vector 1 ∞ ∞ ˜ F j I-vector 1 ∞ ∞ n j ˆ F j =m 1/2 j ˜ F j Simplified I-vector 1 ∞ n j ˆ F j =m 1/2 j ˜ F j Simplified Supervised I-vector 0 1/Q ∞ n j ˇ F j =Σ −1/2 ubm m 1/2 j ˜ F j type I S-vector As shown in Table5.1, the traditional i-vector and the proposed simplified or super- vised i-vector as well as the sparse coding based s-vector methods are special cases of this generalized optimization problem. This opens a door for using general optimization toolkits for solving (5.1) and try some new distance measures and penalty cost functions, such as KullbackCLeibler di- vergence of GMM models and within class covariance matrix. 91 Chapter 6: Acoustic and articulatory information fusion for speaker verification 6.1 Data We used the Wisconsin X-Ray Microbeam data (XRMB) [155] for our analysis and experiments. A key feature of this database is that both articulatory measurement and simultaneously recorded speech signal are available from multiple speakers. We selected read speech data (citation words, sentences and paragraphs) from sessions 1 to 101 for each speaker from JW11 to JW63 which resulting in a total of 4034 utterances from 46 speakers with an average duration of 5.72 seconds per utterance. Note that we excluded speech sessions involving different speaking styles (such as fast or slow speech, emphasized speech, or stimuli that involved diadokinesis). We also omitted speaker 92 sessions where a speaker had to repeat an utterance, as well as those which were found to contain severe pellet tracking errors, as detailed in the XRMB Manual [155]. Table 6.1 shows the two protocols that we adopted in the evaluation, namely “ALL” and “L5S”. We used all sessions from speaker JW11 to JW40 (26 speakers with 2295 utterances) as the background data and select the session 11 (a paragraph session) of eachspeakerfromJW41toJW63(20speakers)asthetargetregistrationutterance. For the testing, protocol “ALL” selects all the sessions (excluding session 11 in the target set) from speaker JW41 to JW63 (a total of 20 speakers and 1719 utterances) while protocol “L5S” only adopts those sessions that are more than 5 seconds long (a total of 20 speakers and 840 utterances). The reason to create a separate “L5S” protocol is that some testing utterances in the “ALL” protocol are too short. In order to perform Test Segment Score Normalization (T-norm), we selected all other paragraph sentences (session 12,79,80,81, totally 95 utterances) in the background set as the T-norm set. To evaluate the performance of the acoustic-only baseline as well as the acoustic- estimated-articulatory system, we followed the protocol (as shown in Table 6.1) exactly. For the speech-real-articulation system, a subset of data were removed from the train, targetandtestingsetsduetomissingdatainsomearticulatorychannels[155]. Wename this modified “ALL” protocol as “ALL-small” protocol. In “ALL-small” protocol, each Table 6.1: Data set partition for SRE experiments. Other L5S sessions of JW41-63 denotes all the longer than 5 seconds sessions (exclude 11) of speaker JW41-JW63. Data sets & Protocol ALL L5S Background: all sessions of JW11-40 √ √ Target: session 11 of JW41-63 √ √ Test: other sessions of JW41-63 √ Test: other L5S sessions of JW41-63 √ Tnorm: sessions 11,12,79,80,81 of JW11-40 √ √ 93 utterance is shorter and there are 1849, 18, and 1389 utterances in train, target and testing sets, respectively. 6.2 Experimental Setup 6.2.1 Subject-independent inversion We used the generalized smoothness criterion (GSC) for acoustic-to-articulatory inver- sion [54] under a subject-indepedent inversion setting [53]. The GSC estimates articu- latory parameters given acoustic features so that the estimated parameters are optimal solution which satisfies two conditions: (1) the estimated trajectories are smooth and slowlyvaryingand(2)thedifferencebetweentheestimatedandoriginalarticulatorypa- rameters weighted by similarity metric of corresponding acoustic features is minimum. The subject-independent inversion setting uses a probability feature vector (PFV) for acoustic features. PFV is a normalized likelihood score of the conventional acoustic fea- turevector, i.e. MFCCs, tothe40clustersofageneralacousticmodel(GAM)[53]. The general acoustic model represents the variabilities in acoustic space, which was created with TIMIT data [50]. In subject-independent acoustic-to-articulatory inversion, the acoustic of an arbi- trary test subject is converted to a PFV which is then used to find the closest PFV from the chosen exemplar whose articulatory data is used for training the inversion mapping. It is expected that the PFV reflects the acoustic sound produced by the test subject irrespective of the speaker, i.e., the PFVs corresponding to a sound recorded from different speakers including the exemplar should be similar to each other so that the speaker variability is eliminated in the inversion. The quality of this speaker vari- ability elimination solely depends on the generalizations of the GAM used to compute the probability feature vector. Note that the GAM used in this work is built using 94 the TIMIT training corpus whereas the articulatory inversion is done on XRMB corpus whichmayhaveadifferentacousticsthanthatofTIMITresultinginapoorelimination of speaker variability during inversion. This in turn gets reflected in the estimated ar- ticulatory features which when used for SRE task provides inter-speaker discrimination in addition to MFCC. Thus the SRE performance improvement (Sec6.3) in this work may results from the non-linear mapping between acoustic and articulatory spaces and the residual speaker specific information present in the probability features computed during the subject-independent acoustic-to-articulatory inversion. 6.2.2 Articulatory features We used tract variables for articulatory parameters as in previous study [53]. The tract variables are estimated from an Electromagnetic Articulography database [75], which includes speech audio spoken by a native female speaker of American English and its parallel articulatory data. The speaker was asked to read 460 English sentences (approximately69minutes)identicaltothesentencesofMOCHATIMITdatabase[156]. The tract variables include nine articulatory parameters, such as lip aperture (LA), lip protrusion (PRO), jaw opening (JAW OPEN), the constriction degree (CD) and constriction location (CL) of tongue tip (TT), tongue blade (TB), and tongue dorsum (TD). The constriction location parameter for each tongue sensor is the distance from a fixed point on the palatal line, which is manually chosen by visual inspection, to the projected point of each sample to the palatal line. We followed the definitions in the previous study [53] for the other parameters, such as LA, PRO, JAW OPEN, CDs. Fig. 6.1 shows the error bar plot of pair-wise correlation coefficients between the estimatedarticulatorysignalsofthesessionone(thedataof46speakers)aftertemporal alignmentontheutterancepairs. Allspokethesamewordsequenceinthiscase,allowing us to compare the inter-speaker variations by this method. Dynamic time warping 95 Figure 6.1: Errorbar of pair-wise correlation coefficients between session one estimated articulatorysignals(afterDTW)fromall46speakers(allspeakthesamewordsequence, totally 1081 pair-wise DTW and correlation). The nine dimensions of the estimated articulation are LA, PRO, JAW OPEN, TTCD, TBCD, TDCD, TTCL, TBCL, and TDCL, respectively. (DTW) (applied on the estimated articulation) was used to remove possible speaking- rateconfoundsforthiscorrelationstudy. Fig.6.1showsthattongueconstrictionlocation features (dim 7,8,9) have more inter-speaker variations than tongue constriction degree features(dim4,5,6). Lipapertureandtonguebodyconstrictionlocation(dim1,8)show relatively less correlation, implying that their inter-speaker variations are larger than the other tract variables. Fig. 6.2 shows the estimated articulatory signals after DTW on LA and TBCL of session one from the two-speaker pairs. Within all the speaker pairs, the pair of speaker JW48 and 33 has the highest correlation, while the pair of speaker JW 48 and JW 96 Figure6.2: Estimatedarticulatorysignals(afterDTW)oflipapertureandtonguebody constriction location of session one from two-speaker pairs. The top plots are for the pair of speaker JW 48 and 33, and the bottom plots are for the pair of speaker JW 48 and 59. The first speaker pair shows high correlation, while the other pair shows low correlation. (a) LA of file TP001 2 from JW48 and 33; (b) TBCL of file TP001 2 from JW48 and 33; (c) LA of file TP001 from JW48 and 59; (d) TBCL of file TP001 from JW48 and 59. 59 has the lowest one (JW48, 33 are female, JW59 is male). We can see from Figs. 6.2 (a)-(b) that even with the highest correlation, their estimated articulatory signals are not exactly the same. We investigated this further and found that the mean and variance values of these signals could differentiate between different speakers to a large extent, as shown in Figs. 6.2 (c)-(d). In order to test our assumption that the mean and variance might carry the infor- mation of inter-speaker variability, we performed a simple multi-class SVM experiment. 97 Table 6.2: The performance of 26 speaker classes (closed set) identification systems based on different utterance-level features derived from estimated articulatory data. Features & Systems 1 2 3 mean √ √ √ variance √ √ mean crossing rate √ Accuracy 32% 48% 52% Table 6.2 shows the performance of speaker classification with different utterance-level featuresderivedfromestimatedarticulatorytrajectories. Thenumberofspeakerclasses is 26. Sessions 12, 79, 80 and 81 of all 26 speakers in the background data set (this is actually the T-norm set) was used for train set, and session 11 were used for test data. Table6.2showstheperformanceof3systemsbasedondifferentutterance-levelfeatures. By using only mean and variance, system 2 achieve around 50% accuracy, indicating that they do carry valuable information regarding inter-speaker variations. This result may also suggest to normalize mean and variance of estimated articulatory parameters for minimizing speaker-dependent information. 6.2.3 Front end processing Wiener filtering [3] was adopted for X-Ray Microbeam data to reduce stationary arti- fact noises. After voice activity detection (VAD), non-speech frames were eliminated and cepstral features were extracted. Real and estimated articulatory signals were also truncated based on the VAD results and then re-sampled at 100 hz. A 25ms Hamming window with 10ms shifts was adopted for MFCC extraction. Each utterance was con- verted into a sequence of 36-dimensional feature vectors, each consisting of 18 MFCC coefficients and their first derivatives. Cepstral mean subtraction and variance (MVN) normalization were performed to normalize the MFCC and real articulatory features to 98 Table 6.3: Performance of MFCC-real-articulation system with “ALL-small” protocol ID Systems “ALL-small” (%) OptDCF EER 1 MFCC-only 11.04 11.95 2 MFCC-real-articulation 9.98 10.15 3 Score level fusion 1+2 6.42 6.77 zero mean and unit variance on a per utterance basis. The reason to perform MVN on the real articulatory signals is that the baseline values of these sensors have already encoded the vocal tract shape information of the speakers’ [155]. For the estimated ar- ticulatorysignals, wedonotperformMVNsincetheyarecalculatedfromspeechsignals and mean variance are useful for speaker recognition. After MVN, MFCCs are concate- natedwithrealorestimatedarticulatorysignalstogeneratetheMFCC-real-articulation and MFCC-estimated-articulation these two enhanced feature sets. 6.2.4 GMM baseline modeling A UBM in conjunction with a MAP model adaptation approach [14] was used to model different speakers in a supervised manner. All the data in the background set was adopted to train a 256-component UBM, and MAP adaptation was performed using the training set data for each speaker. A relevance factor of 16 was used for the MAP adaptation. We performed AT-norm to calibrate the scores. Every testing utterance is scored on every target sample to generate the trials. The reason to use the GMM baseline here rather than the state-of-the-art I-vector PLDA method is that the data set is too small to train a large scale factor analysis model. 99 6.3 Experimental Results and Discussions We evaluate the system performance in terms of identification weighted accuracy, ver- ification EER and old OptDCF cost value [119]. Table6.3 shows the performance on the MFCC only as well as the MFCC-real-articulation features systems with the “ALL- small”protocol. Wecanseethatbyaugmentingwiththemeasuredarticulatoryfeatures (although mean and variance normalized), the enhanced feature set reduced the EER from11.95%to10.15%. ScorelevelfusionofthesetwosystemsfurtherreducedtheEER to 6.77%, a 40% relative EER reduction compared to the MFCC only system. Thus it is clear that, adding real articulation information enhances the SRE performance. Table 4 and 5 show the SRE performance when estimated articulatory features are usedinaugmentingMFCCsin”ALL”and”L5S”protocolsrespectively. Hereweobserve similarpatternsasinTable6.3. TheMFCC-estimated-articulationfeaturesalsoachieved 4% and 8% relative EER reduction for “ALL” and “L5S” protocols, respectively. Score level fusion further increased this relative reduction to 9% and 14%. However, it should benotedthattheimprovementisnotasbigastherealarticulationcaseinTable6.3. The SRE performance improvement using MFCC-estimated-articulation, though moderate, suggests the potential benefit that estimated articulatory features may provide in SRE task. This is particularly important because in a real world SRE application, we have only access to the speech signal. In such scenarios it is only articulatory inversion that can provide information about speaker’s morphological characteristics in terms of the estimated articulatory features. And experimental results in this work show that estimated articulatory features indeed provide production oriented information (complementary to the MFCCs) to discriminate different speakers. This is also shown by the Detection Error Trade-off (DET) curves in Fig.6.3 , which clearly demonstrates that adding estimated articulatory features improves the SRE performance. 100 Table 6.4: Performance of MFCC-estimated-articulation system with “ALL” protocol ID Systems “ALL” (%) OptDCF EER Accuracy 1 MFCC-only 8.68 8.73 89.65 2 MFCC-estimated-articulation 8.40 8.44 90.92 3 Score level fusion 1+2 7.83 7.91 91.74 Table 6.5: Performance of MFCC-estimated-articulation system with “L5S” protocol ID Systems “L5S” (%) OptDCF EER Accuracy 1 MFCC-only 4.84 4.88 95.95 2 MFCC-estimated-articulation 4.34 4.52 97.14 3 Score level fusion 1+2 4.05 4.17 97.02 6.4 Conclusion We propose a practical feature-level fusion approach for speaker verification using infor- mation from both acoustic and articulatory representations. We find that the speaker verificationperformanceimprovesbyconcatenatingarticulationfeaturesfrommeasured articulatory movement data during speech production with conventional MFCCs. How- ever, since access to the measured articulatory movement is impractical for real world speaker verification applications, we also experiment with estimated articulatory fea- tures obtained through acoustic-to-articulatory inversion. Specifically, we show that augmenting MFCCs with the estimated articulatory features also significantly enhances the speaker verification performance. Our future works cover investigating better in- version methods that can maximize the inter-speaker articulatory variations as well as applyingtheproposedMFCC-estimated-articulationfeaturetotheNISTSREdatasets with the state-of-the-art methods. 101 Figure 6.3: DET curves of speech only (ID1) and fusion (ID3) results. 102 Chapter 7: Multimodal physical activity recognition 7.1 Feature extraction A feature is a characteristic measurement, transform, or structural mapping extracted from the input data to represent important patterns of desired phenomena (PA in our case) with reduced dimension. For example, the standard deviation of an accelerometer reading and the mean of the instantaneous heart rate via the ECG are good candidates as PA cues or features. Furthermore, utilizing the complementary characteristics of different types of features can offer substantial improvement over single type features in therecognitionaccuracydependingupontheinformationbeingcombinedandthefusion methodology adopted [134]. In this section, we describe the proposed time domain and cepstral feature extraction process in detail. 103 Figure 7.1: The proposed physical activity recognition system overview 7.1.1 Temporal feature extraction We consider four types of temporal features. Features in the first set, which we denote as“conventional”,wereselectedbasedontheirefficacyasdemonstratedintheliterature regarding wireless body area sensor networks; for the accelerometer, the conventional features are shown in Table 7.1, and for the ECG sensor, the mean and variance of the instantaneous heart-rate constitute the conventional features. The other three features sets are comprised of features that describe the discriminative activity information for theECGsignals. ThesefeaturesresultfrommorecomplexprocessingoftheECGsignal: (i) the principal component analysis (PCA) error vector, which has been previously studied in [122] for body movement activity recognition, (ii) the Hermite polynomial expansion (HPE) coefficients, and (iii) the standard deviation of multiple normalized beats which are novel to our work. These techniques model the underlying signals, and the resultant model parameters are the features. First, we describe the required pre- processing of the collected biometric ECG signal, then we describe the ECG temporal feature extraction, and finally we outline the temporal accelerometer features. 104 7.1.1.1 Pre-processing of the ECG Signal Each type of body movement induces a particular type of motion artifact in the ECG signal. If there are M hypothesized activities, for the j th heartbeat observation under thei th activity,thecontinuous-timerecordedECGsignal,r ij (t),ismodeledas[121,122] r ij (t) =θ i (t)+χ ij (t)+η ij (t), (7.1) whereθ i (t) is the cardiac activity mean (CAM) which is the normal heart signal,χ ij (t) is an additive motion artifact noise (MAN) due toi th class of activities, andη ij (t) is the sensor noise present in the ECG signal. Since the length of each heartbeat is different due to inherent heart rate variability, the first step of pre-processing normalizes each heartbeat waveform to the same time duration (in the phase domain) and amplitude range [33, 122]. Due to the low signal to noise ratio (SNR) of the ECG signal in high intensity PA, fake peak elimination and valid beat selection [33] are performed to enhance the robustness and reduce the peak detection error. TheD-dimensional vector representation ofr ij (t) over one heartbeat is denotedr ij and the D-dimensional vector representations of the corresponding CAM, MAN, and sensor noise components areθ i ,χ ij , andη ij , respectively. Fig. 7.2 shows the mean and standard deviation of the normalized ECG signal for different activities. One of our innovations over [122] is the recognition that both CAM and MAN carry discriminative information between different PAs. 105 Figure 7.2: The mean and standard deviation of normalized ECG signals. 7.1.1.2 Principal Component Analysis Principal component analysis (PCA) is used for feature extraction from the MAN com- ponentχ ij . For thei th activity class, we useν i heartbeats to estimate the CAMθ i (as in [122]), ˜ θ i = 1 ν i ν i X j=1 r ij . (7.2) We note that the number of heartbeats available for training,ν i , is different for each of the activities. Subtracting the CAM from the signalr ij yields residual activity vectors ´ r ij : ´ r ij =r ij − ˜ θ i =χ ij + ´ η ij , (7.3) where ´ η ij includes both the sensor noise and the CAM estimation noise induced by the session variability. As noted in [122], although the signal component due to MAN has smaller amplitude than CAM, it has much greater amplitude than the sensor noise, i.e., |η| |χ i | < |θ i |, ∀i (where |·| is the 2-norm). Thus, the MAN has a dominant influenceontheshapeoftheresidualactivityvector ´ r i . Foreachactivityclassi,wenow 106 compute eigenvectors and eigenvalues using the eigen-decomposition of the covariance matrixΣ i of ´ r ij . LetE i = [e i0 ,e i1 ,··· ,e iκ i ]beasetofeigenvectorscorrespondingtothe κ i <D largest eigenvalues, and letp uj be a vector representation of thej th normalized observation ECG heartbeat after pre-processing. We subtract the class mean ˜ θ i , see (7.2), fromp uj to yield ˜ p ij . Thus, a measure of the reconstruction error ini th activity’s residual vector eigenspace, for the j th ECG heartbeat observation, is defined as: RE PCA j (i) = ˜ p ij −(E i E T i )˜ p ij 2 , (7.4) which is summed overν F heartbeat observations. In the PCA approach studied in [121] and[122], thedecisionisassignedtotheactivityclasslabelfromi = 1,··· ,M forwhich the reconstruction error RE PCA (i) is the minimum. However, the activity class mean ˜ θ i is pre-trained and fixed in all the testing situations. This can induce session-to-session variability issues. Differences in sensor electrode placements and user emotion states can cause fluctuations of the mean vector between the training and testing data which affect the computation of the residual activity vector. Furthermore, this PCA method does not use the heart rate or other intra-beat statistical information, and focuses only on the normalized heartbeat modeling. In this work, we address this issue by adopting the PCA error vector RE PCA = [RE PCA (1) RE PCA (2) ... RE PCA (M)] as one of the temporal ECG features used for PA recognition. 7.1.1.3 Hermite polynomial expansion A Hermite polynomial expansion (HPE) is used to model the CAM component θ i of the sampled ECG signal, and the resulting coefficients serve as another feature set for classification. Hermite polynomials are classical orthogonal polynomial sequence repre- sentations [6] and have been successfully used to describe ECG signals for arrhythmia 107 detection [109] but do not appear to have been previously used for PA detection. In Fig. 7.2, the shape of the CAM component for each activity is different and thus these signals can be used to distinguish between different PA states. Rather than subtracting the activity mean to model the motion artifact noise, we average the normalized ECG signal to estimate the cardiac activity mean. Let ν F denote the fixed number of nor- malized heartbeats in each running window; the CAM component of theκ th window is estimated by ´ θ iκ = 1 ν F (κ+1)ν F X j=(κ)ν F +1 r ij . (7.5) Denote each estimated D-dimensional (D is an odd number) CAM component vector and polynomial order by ´ θ[n] andL, respectively. The HPE of ´ θ[n] can be expressed as [109] ´ θ[n] = L−1 X l=0 c l ψ l (n,δ), n∈ − (D−1) 2 , (D−1) 2 , (7.6) where {c l }, l = 0,1,··· ,L−1 are the HPE coefficients, and ψ l (n,δ) are the Hermite basis functions defined as: ψ l (n,δ) = 1 p δ2 l l! √ π e −n 2 /2δ 2 H l (n/δ). (7.7) The functions H l (n/δ) are the Hermite polynomials [6]: H 0 (t) =1, H 1 (t) =2t, (7.8) H l (t) =2tH l−1 (t)−2(l−1)H l−2 (t). (7.9) It had previously been shown in [109] that, for Hermite basis functions with different orders, the higher the order the higher is its frequency of changes within the time domain and thus resulting in a better capability for capturing morphological details 108 Figure 7.3: Hermite basis functions with δ = 10,D = 201: (a) l=1 (b) l=2 (c) l=3 (d) l=4. of ECG signals. The HPE basis functions can be denoted by a D×L matrix B = [ψ 0 ψ 1 ··· ψ L−1 ]; the expansion coefficients c = [c 0 c 1 ··· c L−1 ] T are obtained by minimizing the sum squared error E: E = ´ θ[n]− L−1 X l=0 c l ψ l (n,δ) 2 2 = ´ θ−Bc 2 2 → c = (B T B) −1 B T ´ θ. (7.10) Asshownin[109],HPEbasedreconstructionisnearlyidenticaltotheoriginalwaveform for ECG signals. The HPE coefficients c are also employed as ECG temporal features. 109 Figure 7.4: The original and the reconstructed ECG heartbeat from HPE. Table 7.1: conventional temporal accelerometer features mean absolute deviation zero crossing rate energy (20,40,60,80) th percentile spectral entropy kurtosis cross correlation mean crossing rate median mean of maxima mean of minima mean standard deviation root mean square skewness 7.1.1.4 Standard deviation of multiple normalized beats Wehadpreviouslyobservedthatthevarianceoftheaccelerometermeasurementsoffered discrimination capability [5]; this feature for the ECG signal also has utility. From Fig. 7.2, we see that higher intensity states (walking) have a larger standard deviation thanlowerintensityones(lying). Iftheuserislyingdownorsitting,thenthenormalized heartbeatshapesaremoreconsistentorsimilarwithinthewholeprocessingwindow,but if the user is walking or running, then the normalized ECG shape can vary dramatically 110 and become noisy. Thus the sum of standard deviations for all the normalized bins (D bins) in the window is also employed as a feature. To our knowledge, this feature has not been previously used for PA classification. Thus, for the temporal ECG features, not only are the PCA error vector and HPE coefficients included but also are the conventional mean and variance of instant heart rate and standard deviation of multiple normalized beats (noise measure). By using multiple measurements, this temporal ECG feature vector covers both conventionally used heart rate information and novel morphological shape information. 7.1.1.5 Temporal Accelerometer Features For the tri-axial accelerometer, a set of conventional temporal features (in Table 7.1) is extracted from the signals of each axis in every processing window. These features have been previously studied in [10, 57, 62, 65, 112], employing various subsets of the features listed. BothECGandaccelerometertemporalfeaturevectorsaredenotedasy andmodeled by a support vector machine as explained in Section 7.2.1. 7.1.2 Cepstral feature extraction Inthiswork,itisassumedthatbothECGandaccelerometersignalshavequasi-periodic characteristics resulting from the convolution between an excitation (heart rate or mov- ingpace)andacorrespondingsystemresponse(ECGwaveformshapes[33]oraccelerom- eter moving patterns). Furthermore, in the acquisition of both ECG and accelerometer signals, there are many other “channel” artifacts, such as skin muscle activity, mental states variability, electrodes displacements, and so on. Cepstral analysis [32, 61, 133] is a homomorphic signal transform technique that transforms a convolution into an additive relationship which makes it especially conducive for mitigating convolutional 111 effects. It has been successfully and widely used with processing many real life signals, such as speech and seismic signals [32, 61, 133]. Thus, in order to filter out the effects of the different paths from the source signals to the sensors, using cepstral features to model the frequency information of the native signal allows us to separate inherent convolutive effects by simple linear filtering. In the following, we explain the usage of a real cepstrum and describe the proposed linear frequency band based cepstral features in detail. The sensor signal has some frequencies at which motion artifacts or sensor noise dominate. For example, the ECG baseline wanders and high frequency noises can result indrasticframe-to-framephasechanges. Furthermore,thepropertiesofthe“excitation” source of the sensor signal (e.g. ECG heart rate and accelerometer speed) also vary from frame to frame, which makes the phase not very meaningful. Because of this, the complex cepstrum is rarely adopted for real life signals such as speech [61]. Thus, in this activity recognition application, we use only the real cepstrum which is based on spectral magnitude information from the sensor signals. The real cepstrum of a signal x[n] with spectral magnitude |X(e jw )| is defined as [61]: C[n] = 1 2π Z π −π ln|X(e jw )|e jwn dw. (7.11) In many applications, instead of operating directly on the signal spectrum, filter banks are employed to emphasize different portions of the spectrum separately. For example, inspeechandaudioprocessing, melfrequency cepstralcoefficients are popular [133] and are derived based on nonlinear filter bank processing of the spectral energies to approximate the frequency analysis in the human ear. 112 Given the FFT of the input signal x[n] X[k] = N−1 X n=0 x[n]e −j2πnk/N ,0≤k<N, (7.12) whereN is the size of FFT, a filter bank withM filters (m = 1,2,··· ,M) is adopted to map the powers of the spectrum obtained above into the mel scale using triangular overlapping windows H m [k] [61]. Thus, the log-energy at the output of each filter is computed as: S[m] =ln N−1 X k=0 |X[k]| 2 H m [k] ,0≤m<M. (7.13) Finally, discrete cosine transform (DCT) of theM filter log-energy outputs is calculated to generate the cepstral features: C[n] = M−1 X m=0 S[m]cos(πn(m+1/2)/M),0≤n<M. (7.14) The filter energies are more robust to noise and spectral estimation errors and thus have been extensively used as the golden feature set for speech and music recognition applications [61]. The perceptually motivated logarithmic mel-scale filter bands are de- signed for the human auditory system, which might not match the ECG and accelerom- eter signals. For this reason and for simplicity, in this work, we use linear frequency bandsratherthanthemel-scalefrequencybands. Cepstralmeansubtraction(CMS)and cepstral variance normalization (CVN) are adopted to mitigate convolutional filtering effects for ensuring robustness. Specifically, due to potential inter-session variability, such as a change in electrode position or a variation in a user’s emotion state, there is always a fluctuation on the “relative transfer function” as characterized by the transformation of the ground truth 113 measurements of the PAs to the sensors’ signals. Therefore, CMS is performed to mit- igate this effect. The multiplication of the signal’s spectrum, X[n,k], and the relative transfer function’s spectrum,H[k], in the frequency domain is equivalent to a superpo- sition in the cepstral domain: C y [n,k] =C x [n,k]+C H [k]. (7.15) And the second component C H [k] can be removed by applying long term averaging for each dimension k: C y [n,k]−hC y [n,k]i avg =C x [n,k]−hC x [n,k]i avg . (7.16) Thus, cepstral features with CMS and CVN normalization are more robust against the session variability. 7.2 Activity Modeling As shown in Fig. 7.1, the features in both temporal and cepstral domains are modeled using the SVM and GMM classifiers, respectively. The multimodal and multi-domain subsystems are fused together at the score level to improve the overall PA recognition performance. 7.2.1 SVM Classification for temporal features An SVM is a binary classifier constructed from sums of a kernel functionK(·,·) overN support vectors, wherey i denotes the i th support vector and t i is the ideal output: f(y) = N X i=1 α i t i K(y,y i )+d. (7.17) 114 The ideal outputs are either 1 or −1, depending upon whether the corresponding support vector belongs to class 1 or−1. By using kernel functions, an SVM can be gen- eralized to non-linear classifiers by mapping the input features into a high dimensional feature space. The original form of the generalized linear discriminative sequence (GLDS) kernel [26] involves a polynomial expansion,b(y), with monomials (between each combination of vector components) up to a given degreep. The GLDS kernel between two sequences of vectorsY 1 ={y 1 t } t=1···N 1 andY 2 ={y 2 t } t=1···N 2 is denoted as a rescaled dot product between average expansions: K(Y 1 ,Y 2 ) = 1 N 1 N 1 X i=1 b(y 1 i ) t ·R (−1) · 1 N 2 N 2 X j=1 b(y 2 j ) = (b y 1 t R (−1/2) )·(R (−1/2) b y 2) (7.18) where R is the second moment matrix of the polynomial expansions and its diagonal approximation is usually used for efficiency. In this work, only the first order of b(y) is used for simplicity: b(y) =y. In addition, if we arbitrarily add one dummy dimension with value 1 at the head of each feature vector ´ b(y) = [1 b(y)], R becomes ´ R and the scoring function of the GLDS kernel can be simplified by the following compact technique [26]: f({Y}) =( N X i=1 α i t i ´ R −1 ´ b y i +d) t · ´ b y =W t · ´ b y , (7.19) where ( ´ b y i) t are the support vectors and d is defined as [d 0···0] t . Therefore, the scoring function of a target model on a sequence of observations can be calculated using the averaged observation. Furthermore, by collapsing all the support vectors down into a single model vectorW, each target score can be calculated by a simple inner product which makes this framework computationally efficient. In this study, the LIBSVM tool 115 [31] and 1vsRest [26] strategy were used for the SVM model training. For each activity, a binary SVM classifier was trained against the rest M −1 activities using the GLDS kernel in (7.18). Moreover, for each binary SVM model, all the support vectors were collapsed into a single vectorW by (7.19) to make the scoring function computationally efficient. 7.2.2 GMM modeling for cepstral features AGaussianMixturemodel(GMM)isusedtomodelthecepstralfeaturesoftheECGand accelerometer signals. A Gaussian mixture density is a weighted sum of N component densities and is given by p(C|λ) = N X j=1 ω j p j (C) (7.20) where C is a D-dimensional random vector, p j (C),j = 1,··· ,N are the component densities and ω j ,j = 1,··· ,N are the mixture weights. Each component density is a D-variate Gaussian function of the following form: p j (C) = 1 (2π) D/2 |Σ j | 1/2 exp − 1 2 (C−μ j ) T Σ −1 j (C−μ j ) (7.21) with mean vectorμ j and covariance matrix Σ j . The mixture weights satisfy the con- straint that P N j=1 ω j = 1. The complete Gaussian mixture density is parameterized by the mean vectors, covariance matrices, and mixture weights from all component densities. These parameters are collectively represented by the notation λ i for activity i,i = 1,··· ,M, and are explicitly written as λ i ={p j ,~ μ j ,Σ j }, j = 1,··· ,N. (7.22) 116 For subject-dependent PA identification using the cepstral features of sensor signals, each activity performed by every subject is represented by a GMM and is referred to by its model λ i . In the proposed work, since the training data for each activity of each subject is too limited to train a good GMM, a Universal Background Model (UBM) in conjunction with a Maximum A Posteriori (MAP) model adaptation approach [133] is used to model different PAs in a supervised manner. The UBM model is trained using all the training data including all the activities and all the subjects; then the subject- dependent activity model is derived using MAP adaptation from the UBM model with subject specific activity training data. The expectation maximization (EM) algorithm is adopted for the UBM training. Under the framework of GMM, during testing, each signal segment withT frames is scored on all the activities’ models from the same subject. By using logarithms and the independence between observations, the GMM system outputs the recognized activity by maximizing log likelihood criterion: ˆ S = arg max 1≤i≤M T X t=1 log{p(C t |λ i )}. (7.23) 7.3 System Fusion Inamultimodalactivityrecognitionsystem, fusioncanbeaccomplishedbyutilizingthe complementary information available in each of the modalities. In the proposed work, both feature level fusion and score level fusion are studied. 7.3.1 Feature level fusion Feature level fusion requires the feature sets of multiple modalities to be compatible [134]. Letq ={q 1 ,q 2 ,··· ,q m } ands ={s 1 ,s 2 ,··· ,s n } denote two feature vectors (q∈ R m ands∈R n ) representing the information extracted from two different modalities. 117 The goal of the feature level fusion is to fuse these two feature sets in order to yield a new feature vector z with better capability to represent the PA. The l-dimensional vectorz, l ≤ (m+n), can be generated by first augmenting vectorsq ands and then performing feature selection or feature transformation on the resultant feature vector in order to reduce the feature dimensionality. In the proposed work, we only studied the feature level fusion with different axis features of accelerometer in the cepstral domain. It is because the cepstral feature may not be compatible with temporal features and the window length for the temporal feature calculation is significantly larger than for the cepstral features. Furthermore, ECG and accelerometer cepstral features are not concatenated and fused at the feature level due to the compatibility issues arising from different time shift and window length configurations and different sampling frequencies. However, the cepstral features from each axis of the accelerometer are concatenated to construct a long cepstral feature vector in each frame. Heteroscedastic linear discriminant analysis (HLDA) [83] is used to perform feature dimension reduction. 7.3.2 Score level fusion Multimodal information can also be fused at the score level rather than the feature level. The match score is a measure of similarity between the input sensor signals and the hypothesized activity. When these match scores generated by subsystems based on different modalities are consolidated in order to generate a final recognition decision, fusion is done at the score level. Since some multimodal feature sets are not compatible and it is relatively easy to access and combine the scores generated by different sub- systems, information fusion at the score level is the most commonly used approach in multimodal recognition systems [134]. 118 Let there be K input PA recognition subsystems (as shown in Fig. 7.1, K = 4 in this work), each acting on a specific sensor modality and feature set, where the k th subsystem outputs its own normalized log-likelihood vectorl k (x t ) for every trial. Then the fused log-likelihood vector is given by: ´ l(x t ) = K X k=1 β k l k (x t ) (7.24) The weight, β k , is determined by logistic regression based on the training data [134]. 7.4 Experimental setup and results 7.4.1 Data acquisition and evaluation Data collection was conducted using an ALIVE heart rate monitor [1] and a Nokia N95 cell phone. The single lead ECG signal is collected by the heart rate monitor with electrodes on the chest, and at the same time the heart rate monitor with built in ac- celerometer is placed on the left hip to record the accelerometer signal. The placement of electrodes and accelerometer is shown in Fig. 7.5. Both signals are synchronized and packaged together to transmit to the cell phone through a Bluetooth wireless connec- tion [1, 5, 90]. The sampling frequencies of the ECG and the accelerometer are 300 Hz and 75 Hz, respectively. In this work, only one tri-axial (heart rate monitor built-in) accelerometer signal and one single lead ECG signal are used for analysis. For each session, the subject was required to wear the sensors and perform 9 categories of PA following a predetermined protocol [113, 148] of lying, sitting, sitting fidgeting, stand- ing, standing fidgeting, playing Nintendo Wii tennis, slow walking, brisk walking, and running. The last 3 activities were performed on a treadmill with subjects’ own choices of speed (around 1.5 mph for slow walking and around 3 mph for brisk walking). The 119 Figure 7.5: Placement of electrodes (Black filled circles) and accelerometer (Red open triangle) and data collection environment. activities selected here are based on a version of the System for Observing Fitness In- struction Time (SOFIT), considered a gold standard for physical activity measurement [113]. These basic activities are believed to make up or represent a majority of real life physical activities. Furthermore, since measurements are based on a laboratory proto- col,themodelingandrecognitionofthesecategoriescanbeconsideredasafoundational baseline. Subjectsworethesensorsfor7minutesineachofthe9PAswithinter-activity rest as needed. Data from 5 subjects (2 male, 3 female, ages ranging from 13 to 30) who participated in the experiment are reported in this paper. Each subject performed 4 sessions on different days and at different times. Thus the data reflect variability of electrodes positions and a variety of environmental and physiological factors. In the fol- lowing, the proposed approach is evaluated inbothclosed set and open set classification tasks. First, the proposed PA recognition is formulated as a subject-dependent closed set activity identification problem, so the performance is measured by classification accu- racy. For each subject, there are data from 4 sessions. Thus we established 3 different settingstoevaluateourmethods: Setting1: Foreachsubjectandsession, trainingwas based on data from the first half and testing from the second half. Setting 2: For each subject, training was on one session’s data and testing was on another session. Setting 120 3: For each subject, training was on 3 sessions’ data and testing was on the remaining session. In the following, evaluations of our feature extraction and supervised modeling as shown in Table 7.2,7.3, and 7.5 are performed by using setting 3 in which training and testing data are from different days/times and training/testing data are rotated 4 times (for cross validation). The performance reported is basedon the average of all the subjects and all the rotation tests. In addition, score level fusion and session variability regarding all 3 settings are studied and demonstrated in Table 7.4. Second, in real life free living conditions, there might be situations that do not quite fit in our 9-category PA protocol. Thus, 3 different open set task experiments were conducted to evaluate the generalizability of the results to everyday, ambulatory monitoring by testing the ability to correctly reject activities that do not fall within the set categories. All 3 open set tasks are based on subject dependent modeling of the previously described Setting 3. First, task 1 is formulated as an activity verification task (e.g. walking or not) by testing each in-set hypothesis activity’s likelihood against a global threshold. Task 1: For each time, 8 activities are considered as in-set target activities while the remaining one activity is assigned as out-of-set activity for rejection purpose. This out of set activity is excluded from any training process. The setup was rotated 9 times to calculate the average performance. Equal Error Rate (EER) is used to evaluate the performance. Second, rather than identifying/rejecting activities based on thresholds, Tasks 2 and 3 employed closed set classification with the usage of an “others” activity model to classify all other activities that do not belong in the desired closedset. AsshowninTable7.5,task2isfocusedondistinguishingsedentaryactivities while rejecting unknown vigorous activities by using the “others” model trained using data from the “standing fidgeting” activity, and vice versa for Task 3. 121 The testing duration of all the evaluation experiments is fixed at 20 seconds. HPE order, PCA eigenvector dimension, and the normalized heartbeat sample lengthD were empirically chosen to be 60, 40 and 201, respectively. 7.4.2 Results 7.4.2.1 SVM system based on temporal features Table 7.2 shows the results of the ECG temporal feature based SVM system. Com- pared to the conventional PCA method [122], the proposed HPE coefficients together with heart rate (HR) and noise measurement (NM) features achieved nearly 10% im- provement in accuracy. Furthermore, fusing PCA, HPE, HR, and NM features together achieves an additional 4% improvement. 7.4.2.2 GMM system based on cepstral features In Table 7.3, the results of the GMM system based on different configurations of cep- stral features are shown. Before feature extraction, the DC baseline is removed by a high pass filter. ECG IDs (1,2,8) show that smaller shifts and window sizes have bet- ter performances while ECG IDs (2,3,4) show that the number of cepstral coefficients used for recognition does not have to be the number of spectral bands because DCT calculation in cepstral feature extraction can be seen as a hidden dimension reduction method. Moreover,ECGIDs(3,5,6)demonstratethat50%overlapandfirstorderdelta in cepstral extraction is necessary. Finally, ECG IDs (7,8,9) illustrate the performance against different numbers of Gaussian components. In this case, GMM with 64 com- ponents together with a 120 milliseconds window, 24 cepstral coefficients, 48 frequency bands, 50% overlap, and first order delta derivatives give us the best performance of 63.45%. 122 Table 7.2: Performance (% correct) of SVM system based on temporal ECG features (HR:Heart Rate, NM:Noise Measurement) ECG 1 PCA 2 HPE 3 HR+NM 2+3 1+2+3 10 beats 46.9 51.7 43.3 56.7 60.8 20 seconds 49.4 54.4 44.0 60.0 64.2 Evaluation of the accelerometer (ACC) cepstral features in Table 7.3 yields similar results: smaller window sizes yield higher accuracy. Since the sampling frequency of the accelerometer is only 75 Hz, we set the minimum window length to be 480 milliseconds which is exactly 1/4 th of the ECG feature window size. However, in ACC IDs (1,6), the bestsetupforthenumberofcepstralcoefficientsis20ratherthan7. Sothefinalfeature dimensionis120becauseoftheadditionofafirstorderdeltaandtri-axialfeaturevector combination. ACC IDs (1,7,8,9) show the results of the HLDA dimension reduction method in the accelerometer cepstral domain. Results show that the system is not sensitive to the final reduced dimension, and the accuracy is improved from 74.76% to 77.56% when the dimension is reduced to 72. 7.4.2.3 Score level fusion Performance of the score level fusion at different settings is shown in Table 7.4. In setting 3, firstly, fusion of ECG temporal and cepstral systems improves the accuracy from 64.17% to 68.49% while fusion of accelerometer temporal and cepstral systems achieves accuracy improvement from 84.85% to 90.00%. Secondly, using the same kind of features, fusing both ECG and accelerometer information together can also improve the results. We can see that, in the temporal domain, fusion of the ECG SVM system and the accelerometer SVM system increases the accuracy only by 1% while, in the cep- stral domain, fusion of both modalities improves the accuracy from 77.56% to 82.30%. 123 Table 7.3: Evaluation of GMM systems based on different configurations of cepstral feature extraction. (ACC=accelerometer) ECG cepstral spectral window window delta cepstral feat GMM accuracy ID number bands length(s) shift(s) HLDA dimension size Pc(%) 1 36 64 0.5 0.25 yes no 64 32 55.49 2 36 64 0.25 0.125 yes no 64 32 61.83 3 24 64 0.25 0.125 yes no 48 32 61.45 4 64 64 0.25 0.125 yes no 128 32 57.64 5 24 64 0.25 0.25 yes no 48 32 59.23 6 24 64 0.25 0.125 no no 24 32 59.17 7 24 48 0.12 0.06 yes no 48 16 61.26 8 24 48 0.12 0.06 yes no 48 32 63.20 9 24 48 0.12 0.06 yes no 48 64 63.45 ACC cepstral spectral window window delta cepstral feat GMM accuracy ID number bands length(s) shift(s) HLDA dimension size Pc(%) 1 20 20 0.48 0.24 yes no 120 32 74.76 2 20 20 0.96 0.48 yes no 120 32 67.48 3 20 20 0.24 0.24 yes no 120 32 70.73 4 20 20 0.48 0.24 yes no 120 16 75.01 5 20 20 0.48 0.24 yes no 120 64 72.45 6 7 20 0.48 0.24 yes no 42 32 72.00 7 20 20 0.48 0.24 yes yes 96 32 77.52 8 20 20 0.48 0.24 yes yes 84 32 77.34 9 20 20 0.48 0.24 yes yes 72 32 77.56 Finally, we fuse all 4 individual systems together to further improve the PA recognition performance which results in 91.40% accuracy for setting 3. It is shown that our fusion method has 6.55% absolute improvement (from 84.85% to 91.40% ) compared to the conventional accelerometer temporal-features based SVM system. Similar results are also shown in settings 1 and 2. 7.4.2.4 Session variability study InTable7.4,theperformancesinsetting2arenoticeablylowerthaninsetting1because of the mismatch between training and testing data due to the session variability. The ECGsystemscandroptheirperformancebyupto30%whiletheaccelerometersystems are relatively more robust with only a 15% decrease. This might be because the ECG signal varies due to a range of factors, such as electrode placement, mental stress, 124 emotion, and so on, while the accelerometer only measures the physical movement and thus only varies by different movement types or patterns. However, by adding more training data from different sessions, this variability can be mitigated and the system canbemademorerobust. Thisisdemonstratedbyobservingthe10%-21%improvement from setting 3 to setting 2. The accuracy standard deviations of different subjects are also shown in Table 7.4. The individual standard deviation is also improved along with the average accuracy in score level fusion. Furthermore, in terms of accuracy for fusion system (ID 9), the p-values [114] of null hypothesis that setting 1 is equal to setting 2 and setting 3 is equal to setting 2 are 0.00003 and 0.0009, respectively. Thus, with the influence of individual variability, session variability is verified with 0.0009 significance level. 7.4.2.5 Open set tasks study Table 7.5 clearly shows that, in the open set tasks, score level fusion of the multi-modal and multi-domain subsystems significantly improves performance. Based on the similar accuracy results between closed set classification and open set tasks 2 and 3, it can be observed that with the usage of the “others” activity model, the proposed approach can effectively identify the activities of interest as well as reject out of set activities. 7.5 Discussion This work addresses the PA recognition problem with multimodal wearable sensors (ECG and accelerometer). The contributions are as follows: (1) The cardiac activity mean (CAM) component of the ECG signal is described by HPE in the temporal feature extraction. It can be observed in Table 7.2 that HPE fea- tures perform better than conventional PCA features and adding PCA, HPE, HR, and NM features together achieves significant improvement. This is because the pre-trained 125 activity mean in the PCA approach might be different from the testing condition due to session variability which can decrease system performance. Moreover, PCA and HPE model the MAN and CAM part of the normalized ECG waveform, respectively, while HR and NM measure the heart rate and inter-beats noise level, and this information is complementary. (2) In the SVM framework for both ECG and accelerometer temporal features, the GLDS kernel makes the classification computationally efficient with a small model size. WecanseethatthesingleleadECGsignalhasmoreactivitydiscriminationinformation than provided by just the heart rate, but as shown in Table 7.4, the performance is still relatively low compared with accelerometer based methods. Therefore, fusing the information from both modalities is necessary. (3) A GMM system based on cepstral features is proposed to capture the frequency domain information, and HLDA is used to reduce the feature dimension of tri-axial ac- celerometer based measurements. In Table 7.4, compared to the ECG temporal feature basedSVMsystem,theGMMapproachwithECGcepstralfeaturesachievedalmostthe same performance in setting 3 and, in fact, 10% better in setting 2 because cepstral fea- tures together with CMS are more robust to session variability. Furthermore, because there is no need for pre-processing steps, such as peak detection and segmentation which are inherently noisy and computationally expansive, cepstral feature calculation is faster and more efficient than temporal feature extraction. Compared to the result of the accelerometer temporal feature based SVM system (84.85%), this GMM-cepstral approach achieved a lower performance (77.56%). This is due to the characteristics of the cepstral feature and CMS normalization, in which the mean of the accelerometer signal is removed. The mean of the tri-axial accelerometer signal corresponds to the gravity along different directions; thus different static positions of activity might have different mean values because of the sensor rotation. By analysis of this mean value, 126 Table 7.4: Score level fusion: the mean ± standard deviation of accuracies P c (%) for different subjects System ID and name Setting 1 Setting 2 Setting 3 1 ECG-Temporal-SVM 88.05±5.0 43.39±8.9 64.17±6.3 2 ACC-Temporal-SVM 95.13±4.1 72.76±7.9 84.85±7.8 3 ECG-Cepstral-GMM 85.43±9.1 53.81±15.1 63.45±9.8 4 ACC-Cepstral-GMM 78.93±7.4 63.69±5.7 77.56±5.4 5 Fusion (1+3) 92.52±3.0 54.05±13.9 68.49±7.4 6 Fusion (2+4) 96.17±3.9 79.04±5.4 90.00±4.1 7 Fusion (1+2) 97.02±2.8 71.81±7.6 85.49±5.8 8 Fusion (3+4) 90.78±8.8 66.12±5.8 82.30±6.2 9 Fusion (1+2+3+4) 97.29±2.4 79.30±4.8 91.40±3.4 Table 7.5: The configuration and performance of openset tasks. Experiment Setup Performance EER(%) Accuracy(%) Activities T2 T3 System ID T1 T2 T3 Lying N 1: ECG-Tem 14.7 66.8 68.8 Sitting N 2: ACC-Tem 6.6 84.6 80.6 Sit Fidgeting N 3: ECG-Cep 23.2 70.0 49.1 Standing N 4: ACC-Cep 12.7 64.3 76.4 Stand Fidgeting M 5: Fuse1,3 14.5 72.5 68.8 Playing Wii M 6: Fuse2,4 5.3 88.2 83.4 Slow Walking N 7: Fuse1,2 6.5 87.0 83.5 Brisk Walking N 8: Fuse3,4 9.8 75.8 82.3 Running N 9: Fuse1,2,3,4 5.0 91.4 86.5 N is in set target activity,M is “others” model activity, is out of set activity. T1,T2 and T3 denote Task 1,2 and 3, respectively. the performance of the accelerometer temporal SVM system is enhanced. However, comparing the results from both setting 1 and setting 2 in Table 7.4, it is clear that the cepstral features based system is less sensitive to session variability than the temporal features based system. (4) Score level fusion of the multi-modal and multi-domain subsystems is performed to improve the overall recognition performance. We demonstrated in Section 7.4.2.3 that fusing both temporal and cepstral information in each single modality can improve the overall system performance. This result substantiates our assumption that tempo- ral information and cepstral information are complementary. Additionally, fusing both 127 ECG and accelerometer information together can also increase the accuracy. Therefore, fusing both modalities is also useful. Compared to the conventional accelerometer tem- poral feature based approach (System ID 2), the proposed multimodal temporal and cepstral information fusion method (System ID 9) achieved 44%, 24%, and 43% relative error reduction for setting 1,2, and 3, respectively. (5) The effects of session variability of ECG and accelerometer measurements on PA recognition were studied. Session variability compensation in the PA recognition application might become an important and challenging research question where many algorithms need to be designed and applied to increase the system robustness. For example, the nuisance attribute projection (NAP) [27] method in the SVM modeling has already been successfully and widely used in speaker recognition to reduce the influence of different channels. In this study with hypotheses testing, we just showed that results in setting 1 (within session recognition) can not reflect the performance in real PA recognition applications such as in the across session condition of setting 2. But adding more training data from multiple sessions can mitigate this variability and improve the real system performance. This also underscores the need for dynamic adaptation to changing data conditions. 7.6 Conclusion In this work, a multimodal physical activity recognition system was developed by fusing both ECG and accelerometer information together. Each modality is modeled in both temporal and cepstral domains. The main novelty is that by fusing both modalities to- gether,andfusingbothtemporalandcepstraldomaininformationwithineachmodality, the overall system performance is shown to improve significantly in both accuracy and robustness. We also show that the ECG signals are more sensitive to session-to-session variability than the accelerometer signals, and by adding more multi-session training 128 data, thesessionvariabilitycanbemitigatedandthesystemcanbecomemorerobustin reallifeusageconditions. Futureworkincludesvalidatingtheresultswithdatacollected under free living conditions. 129 Chapter 8: Conclusions Inmythesis,Iproposedtheconceptofhumanstatesrecognitionasataskencompassing term for understanding, modeling and recognizing various of human centered informa- tion. My technical contributions lie in the modules of representation, classification and information fusion in the human states recognition pipeline for enhancing the perfor- mance accuracy and processing efficiency. For the representation, I proposed a unified general optimization framework, all the representation methods studied in my thesis, such as factor analysis, traditional i-vector, simplified i-vector, supervised i-vector and sparse coding based s-vector, are just special cases of this general optimization frame- work. The proposed simplified supervised i-vector modeling not only achieves better performance but also reduce the computational cost by a factor of 100. For the clas- sification, I employed the sparse representation as an alternative classification method for both identification and verification purposes which generates complementary infor- mation with other classification methods. By fusing multiple diverse subsystems at different levels (feature level, representation level and score level), the overall system performance is enhanced. I showed two applications, namely speaker verification using 130 articulatory and acoustic information fusion and multimodal physical activity recog- nition as the examples for domain specific novel feature extraction and multimodal information fusion. 131 Bibliography [1] Alive Heart Monitor. http://www.alivetec.com/products.htm. [2] Google Health Service. http://www.google.com/health/. [3] Andre Adami, Lukas Burget, Stephane Dupont, Hari Garudadri, Frantisek Grezl, Hynek Hermansky, Pratibha Jain, Sachin Kajarekar, Nelson Morgan, and Sunil Sivadas. Qualcomm-icsi-ogi features for asr. In Proceedings of ICSLP, volume 1, pages 4–7, 2002. [4] M. Aharon, M. Elad, and A. Bruckstein. K-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation. IEEE Trans. Signal Pro- cessing, 54(11):4311–4322, 2006. [5] M. Annavaram, N. Medvidovic, U. Mitra, S. Narayanan, G. Sukhatme, Z. Meng, S. Qiu, R. Kumar, G. Thatte, and D. Spruijt-Metz. Multimodal sensing for pediatric obesity applications. International Workshop on Urban, Community, and Social Applications of Networked Sensing Systems, UrbanSense, 2008. [6] G.B. Arfken, H.J. Weber, and H.J. Weber. Mathematical methods for physicists. Academic press New York, 1985. [7] H. Aronowitz and O. Barkan. Efficient approximated i-vector extraction. In Proc. ICASSP, pages 4789–4792, 2012. [8] K. Audhkhasi, A. Metallinou, M. Li, and S.S. Narayanan. Speaker personality classification using systems based on acoustic-lexical cues and an optimal tree- structured bayesian network. 2012. [9] E. Bailly-Bailliere, S. Bengio, F. Bimbot, M. Hamouz, J. Kittler, J. Mariethoz, J. Matas, K. Messer, V. Popovici, F. Poree, et al. The BANCA Database and Evaluation Protocol. Lecture Notes in Computer Science, 2688:625–638, 2009. [10] L. Bao and S.S. Intille. Activity recognition from user-annotated acceleration data. Lecture Notes in Computer Science, 3001:1–17, 2004. 132 [11] Haris BC and R. Sinha. On exploring the similarity and fusion of i-vector and sparse representation based speaker verification systems. In Proc. ODESSEY, 2012. [12] Haris BC and R. Sinha. Sparse representation over learned and discriminatively learned dictionaries for speaker verification. In Proc. ICASSP, 2012. [13] J.R. Beveridge, D. Bolme, B.A. Draper, and M. Teixeira. The CSU face identifi- cation evaluation system. Machine vision and applications, 16(2):128–138, 2005. [14] M.Black,A.Katsamanis,C.C.Lee,A.C.Lammert,B.R.Baucom,A.Christensen, P.G.Georgiou, andS.S.Narayanan. AutomaticClassificationofMarriedCouples’ BehaviorUsingAudioFeatures. InProc.INTERSPEECH,pages2030–2033,2010. [15] D. Bone, M.P. Black, M. Li, A. Metallinou, S. Lee, and S. Narayanan. Intoxicated speech detection by fusion of speaker normalized hierarchical features and gmm supervectors. In Proc. INTERSPEECH, 2011. [16] Daniel Bone, Matthew P. Black, Ming Li, Angeliki Metallinou, Sungbok Lee, and Shrikanth Narayanan. Intoxicated speech detection by fusion of speaker normal- izedhierarchicalfeaturesandgmmsupervectors. InProc. of the Interspeech,pages 3217–3220, August 2011. [17] H. Bredin, G. Aversano, C. Mokbel, and G. Chollet. The biosecure talking-face reference system. In 2nd Workshop on Multimodal User Authentication, 2006. [18] N. Br¨ ummer. Focal multi-class: Toolkit for evaluation, fusion and calibration of multi-class recognition scorestutorial and user manual, 2007. Software available at http://sites.google.com/site/nikobrummer/focalmulticlass. [19] N. Br¨ ummer and A. Strasheim. Agnitios speaker recognition system for evalita 2009. 2009. [20] J. Brunner, S. Fuchs, and P. Perrier. The influence of the palate shape on articu- latory token-to-token variability. ZAS Papers in Linguistics, 42:43–67, 2005. [21] Jana Brunner, Susanne Fuchs, Pascal Perrier, et al. On the relationship between palateshapeandarticulatorybehavior. Journal of the Acoustical Society of Amer- ica, 125(6):3936–3949, 2009. [22] L. Burget, M. Fapˇ so, and V. Hubeika. But system description: Nist sre 2008. In Proc. NIST Speaker Recognition Evaluation Workshop, pages 1–4, 2008. Soft- wareavailableathttp://speech.fit.vutbr.cz/software/joint-factor-analysis-matlab- demo. 133 [23] L. Burget, P. Matejka, P. Schwarz, O. Glembek, and J. Cernocky. Analysis of fea- ture extraction and channel compensation in a gmm speaker recognition system. IEEE Transactions on Audio, Speech, and Language Processing, 15(7):1979–1986, 2007. [24] F. Burkhardt, M. Eckert, W. Johannsen, and J. Stegmann. A Database of Age and Gender Annotated Telephone Speech. In Proc. 7th International Conference on Language Resources and Evaluation (LREC), pages 1562–1565, 2010. [25] C. Busso, M. Bulut, C.C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J.N. Chang, S. Lee, and S.S. Narayanan. Iemocap: Interactive emotional dyadic motion cap- ture database. Language resources and evaluation, 42:335–359, 2008. [26] W.M. Campbell, DE Sturim, and DA Reynolds. Support vector machines us- ing gmm supervectors for speaker verification. IEEE Signal Processing Letters, 13(5):308–311, 2006. [27] WMCampbell,DESturim,DAReynolds,andA.Solomonoff. SVMbasedspeaker verification using a GMM supervector kernel and NAP variability compensation. In Proc. ICASSP, volume 1, pages 97–100, 2006. [28] F. Cardinaux, C. Sanderson, and S. Bengio. User authentication via adapted statistical models of face images. IEEE Trans. Signal Processing, 54(1):361–373, 2005. [29] F. Castaldo, D. Colibro, E. Dalmasso, P. Laface, and C. Vair. Compensation of nuisance factors for speaker and language recognition. IEEE Transactions on Audio, Speech, and Language Processing, 15(7):1969–1978, 2007. [30] C.C. Chang and C.J. Lin. Libsvm: a library for support vector machines. 2001. [31] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm. [32] DGChilders,DPSkinner,andRCKemerait. Thecepstrum: aguidetoprocessing. Proceedings of the IEEE, 65:1428–1443, 1977. [33] G.D. Clifford, F. Azuaje, and P. McSharry. Advanced methods and tools for ECG data analysis. Artech House, 2006. [34] T. CoverandP. Hart. Nearest neighborpatternclassification. IEEE Transactions on Information Theory, 13:21–27, January 1967. [35] R. Cowie, E. Douglas-Cowie, B. Apolloni, J. Taylor, A. Romano, and W. Fellenz. Whataneuralnetneedstoknowaboutemotionwords.Computationalintelligence and applications, pages 109–114, 1999. 134 [36] DAPPA. RATS project. http://www.darpa.mil/. [37] S. Dart. Articulatory and acoustic properties of apical and laminal articulations. In I. Maddieson, editor, UCLA Working Papers in Phonetics, number 79. 1991. [38] N. Dehak, P.J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet. Front-end factor analysis for speaker verification. Audio, Speech, and Language Processing, IEEE Transactions on, 19(4):788–798, 2011. [39] N. Dehak, P.J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet. Front-end fac- tor analysis for speaker verification. IEEE Trans Audio, Speech, Lang. Process., 19(4):788–798, 2011. [40] N. Dehak, P.A. Torres-Carrasquillo, D. Reynolds, and R. Dehak. Language recog- nitionviai-vectorsanddimensionalityreduction. InProc. INTERSPEECH,2011. [41] D.L. Donoho. For most large underdetermined systems of linear equations the minimal l1-norm solution is also the sparsest solution. Communications on pure and applied mathematics, 59(6):797–829, 2006. [42] Ellen Eide and Herbert Gish. A parametric approach to vocal tract length nor- malization. In Proceedings of ICASSP, volume 1, pages 346–348, 1996. [43] M.Ermes,J.Parkka, J.Mantyjarvi,andI.Korhonen. Detectionofdailyactivities and sports with wearable sensors in controlled and uncontrolled conditions. IEEE Transactions on Information Technology in Biomedicine, 12(1):20–26, 2008. [44] F. Eyben, M. Wollmer, and B. Schuller. OpenEARintroducing the Munich open- source emotion and affect recognition toolkit. In Affective Computing and Intel- ligent Interaction and Workshops, ACII, pages 1–6, 2009. [45] R.E. Fan, K.W. Chang, C.J. Hsieh, X.R. Wang, and C.J. Lin. Liblinear: A library for large linear classification. The Journal of Machine Learning Research, 9:1871–1874, 2008. [46] G. Fant. Acoustic Theory of Speech Production. Mouton & Co., The Hague, 1960. [47] I.Fasel,R.Dahl,J.Hershey,andB.Fortenberry.TheMachinePerceptionToolbox. http://sourceforge.net/projects/mptbox/. [48] D. Gafurov, K. Helkala, and T. Søndrol. Biometric gait authentication using accelerometer sensor. Journal of computers, 1, 2006. [49] D. Garcia-Romero and C. Espy-Wilson. Joint factor analysis for speaker recog- nition reinterpreted as signal coding using overcomplete dictionaries. In Proc. ODYSSEY, 2010. 135 [50] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren. DARPA TIMIT acoustic phonetic continuous speech corpus CDROM, 1993. [51] J.L. Gauvain, A. Messaoudi, and H. Schwenk. Language recognition using phone latices. In Proc. ICSLP, 2004. [52] P. Ghosh, A. Tsiartas, P. G. Georgiou, and S. Narayanan. Robust Voice Activity DetectionUsingLong-TermSignalVariability. IEEE Transactions Audio, Speech, and Language Processing, 19:600–613, March 2011. [53] Prasanta Ghosh and Shrikanth Narayanan. A subject-independent acoustic-to- articulatory inversion. In Proceedings of ICASSP, 2011. [54] Prasanta Kumar Ghosh and Shrikanth S. Narayanan. A generalized smoothness criterion for acoustic-to-articulatory inversion. Journal of the Acoustical Society of America, 128(4):2162–2172, 2010. [55] O. Glembek, L. Burget, N. Dehak, N. Brummer, and P. Kenny. Comparison of scoring methods used in speaker recognition with joint factor analysis. In Proc. ICASSP, pages 4057–4060, 2009. [56] O. Glembek, L. Burget, P. Matejka, M. Karafi´ at, and P. Kenny. Simplification and optimization of i-vector extraction. In Proc. ICASSP, pages 4516–4519, 2011. [57] A.Godfrey,R.Conway,D.Meagher,andG. ´ OLaighin. Directmeasurementofhu- man movement by accelerometry. Medical Engineering and Physics, 30(10):1364– 1386, 2008. [58] A.O. Hatch, S. Kajarekar, and A. Stolcke. Within-class covariance normalization for SVM-based speaker recognition. In Proc. Interspeech, volume 4, pages 1471– 1474, 2006. [59] J.He,H.Li,andJ.Tan. Real-timedailyactivityclassificationwithwirelesssensor networksusingHiddenMarkovModel. In29thAnnualInternationalConferenceof the IEEE Engineering in Medicine and Biology Society, EMBS, pages 3192–3195, 2007. [60] M. Honda, A. Fujino, and T. Kaburagi. Compensatory responses of articulators tounexpectedperturbationofthepalateshape. Journal of Phonetics,30:281–302, 2002. [61] X. Huang, A. Acero, and H.W. Hon. Spoken language processing: A guide to theory, algorithm, and system development. Prentice Hall PTR Upper Saddle River, NJ, USA, 2001. 136 [62] D.T.G. Huynh. Human Activity Recognition with Wearable Sensors. Ph.D. The- sis, 2008. [63] T. Huynh, U. Blanke, and B. Schiele. Scalable recognition of daily activities with wearable sensors. Lecture Notes in Computer Science, 4718:50, 2007. [64] S.A. Israel, J.M. Irvine, A. Cheng, M.D. Wiederhold, and B.K. Wiederhold. ECG to identify individuals. Pattern Recognition, 38:133–142, 2005. [65] L.C. Jatoba, U. Grossmann, C. Kunze, J. Ottenbacher, and W. Stork. Context- awaremobilehealthmonitoring: Evaluationofdifferentpatternrecognitionmeth- ods forclassificationofphysical activity. In30th Annual International Conference of the IEEEE engineering in Medicine and Biology Society, EMBS., pages 5250– 5253, 2008. [66] T. Jebara, R. Kondor, and A. Howard. Probability product kernels. The Journal of Machine Learning Research, 5:819–844, 2004. [67] Z. Jiang, Z. Lin, and L.S. Davis. Learning a discriminative dictionary for sparse coding via label consistent k-svd. In Proc. CVPR, pages 1697–1704, 2011. [68] T. Joachims. SVMLight: Support Vector Machine. SVM-Light Support Vector Machine http://svmlight. joachims. org/, University of Dortmund, 1999. [69] P. Kenny, G. Boulianne, and P. Dumouchel. Eigenvoice modeling with sparse training data. IEEE Transactions on Speech and Audio Processing, 13(3):345– 354, 2005. [70] P. Kenny, G. Boulianne, P. Dumouchel, and P. Ouellet. Speaker and Session Variability in GMM-Based Speaker Verification. IEEE Transactions on Audio, Speech and Language Processing, 15(4):1448–1460, 2007. [71] P.Kenny,G.Boulianne,P.Ouellet,andP.Dumouchel. Jointfactoranalysisversus eigenchannels in speaker recognition. IEEE Transactions on Audio, Speech, and Language Processing, 15(4):1435–1447, 2007. [72] P. Kenny, P. Ouellet, N. Dehak, V. Gupta, and P. Dumouchel. A study of inter- speaker variability in speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 16(5):980–988, 2008. [73] P. Kenny, D. Reynolds, and F. Castaldo. Diarization of telephone conversa- tions using factor analysis. IEEE Journal of Selected Topics in Signal Processing, 4(6):1059–1070, 2010. [74] J. Kim, N. Kumar, A. Tsiartas, M. Li, and S.S. Narayanan. Intelligibility classi- fication of pathological speech using fusion of multiple subsystems. 2012. 137 [75] Jangwon Kim, Adam Lammert, Prasanta Kumar Ghosh, and Shrikanth S. Narayanan. Spatial and temporal alignment of multimodal human speech pro- duction data: realtime imaging, flesh point tracking and audio. In Proceedings of ICASSP, 2013. [76] S. Kim, M. Li, S. Lee, U. Mitra, A. Emken, D. Spruijt-Metz, M. Annavaram, and S.Narayanan. Modelinghigh-leveldescriptionsofreal-lifephysicalactivitiesusing latent topic modeling of multimodal sensor signals. In Engineering in Medicine and Biology Society, EMBC, 2011 Annual International Conference of the IEEE, pages 6033–6036. IEEE, 2011. [77] Simon King, Joe Frankel, Karen Livescu, Erik McDermott, Korin Richmond, and Mirjam Wester. Speech production knowledge in automatic speech recognition. Journal of the Acoustical Society of America, 121:723, 2007. [78] Tomi Kinnunen and Haizhou Li. An overview of text-independent speaker recog- nition: From features to supervectors. Speech Communication, 52(1):12–40, 2010. [79] Michael Kleinschmidt. Robust speech recognition based on spectro-temporal pro- cessing. Bibliotheks-und Informationssystem der Univ., 2003. [80] M. Kockmann, L. Burget, and J. ˇ Cernock` y. Brno university of technology system for interspeech 2010 paralinguistic challenge. In Proc. of the Interspeech, pages 2822–2825, 2010. [81] A.Krause,DPSiewiorek,A.Smailagic,andJ.Farringdon.Unsupervised,dynamic identificationofphysiologicalandactivitycontextinwearablecomputing.InIEEE International Symposium on Wearable Computers, pages 88–97, 2005. [82] J.M.K.Kua, E.Ambikairajah, J.Epps, andR.Togneri. Speakerverificationusing sparse representation classification. In Proc. ICASSP, pages 4548–4551, 2011. [83] N. Kumar and A.G. Andreou. Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition. Speech communication, 26(4):283– 298, 1998. [84] A. Lammert, M. Proctor, A. Katsamanis, and S. Narayanan. Morphological vari- ation in the adult vocal tract: A modeling study of its potential acoustic impact. In Proceedings of INTERSPEECH, 2011. [85] A. Lammert, M. Proctor, and S. Narayanan. Morphological variation in the adult hard palate and posterior pharyngeal wall. Journal of Speech, Language and Hearing Research, in press. 138 [86] Adam Lammert, Mike Procto, and Shri Narayanan. Interspeaker variability in hard palate morphology and vowel production. Journal of Speech, Language and Hearing Research. in revision. [87] C.C.Lee,M.Black,A.Katsamanis,A.C.Lammert,B.R.Baucom,A.Christensen, P.G. Georgiou, and S.S. Narayanan. Quantification of Prosodic Entrainment in Affective Spontaneous Spoken Interactions of Married Couples. In Proc. INTER- SPEECH, pages 793–796, 2010. [88] K.C.Lee,J.Ho,M.H.Yang,andD.Kriegman. Video-basedfacerecognitionusing probabilistic appearance manifolds. In CVPR, volume 1, page 313, 2003. [89] LiLeeandRichardCRose. Speakernormalizationusingefficientfrequencywarp- ing procedures. In Proceedings of ICASSP, volume 1, pages 353–356, 1996. [90] S. Lee, M. Annavaram, G. Thatte, Rozgic V., M. Li, U. Mitra, S. Narayanan, and D.Spruijt-Metz. SensingforObesity: KNOWMEImplementationandLessonsfor anArchitect. InWorkshop on Biomedicine in Computing: Systems, Architectures, and Circuits, 2009. [91] S. Lee, A. Potamianos, and S. Narayanan. Acoustics of children’s speech: Devel- opmental changes of temporal and spectral parameters. Journal of the Acoustical Society of America, 105(3):1455–1468, 1999. [92] J. Lester, T. Choudhury, and G. Borriello. A practical approach to recognizing physical activities. Lecture Notes in Computer Science, 3968:1–16, 2006. [93] H.Li,B.Ma,andC.H.Lee. Avectorspacemodelingapproachtospokenlanguage identification. IEEE Transactions on Audio, Speech, and Language Processing, 15(1):271–284, 2007. [94] M. Li, K. Han, and S. Narayanan. Automatic speaker age and gender recognition usingacousticandprosodiclevelinformationfusion. submittedtoComputerspeech and language. [95] M. Li, K.J. Han, and S. Narayanan. Automatic speaker age and gender recog- nition using acoustic and prosodic level information fusion. Computer Speech & Language, 2012. [96] M. Li, C.S. Jung, and K.J. Han. Combining five acoustic level modeling methods forautomaticspeakerageandgenderrecognition. InEleventh Annual Conference of the International Speech Communication Association, 2010. [97] M. Li, C. Lu, A. Wang, and S. Narayanan. Speaker verification using lasso based sparse total variability supervector and probabilistic linear discriminant analysis. In Proc. of the NIST speaker verification workshop, 2011. 139 [98] M. Li, A. Metallinou, D. Bone, and S. Narayanan. Speaker states recognition using latent factor analysis based eigenchannel factor vector modeling. 2012. [99] M. Li and S. Narayanan. Robust ecg biometrics by fusing temporal and cep- stral information. In Proceedings of the 20th International Conference on Pattern Recognition (ICPR10), pages 1326–1329, 2010. [100] M. Li and S. Narayanan. Robust talking face video verification using joint factor analysis and sparse representation on gmm mean shifted supervectors. In Acous- tics, Speech and Signal Processing (ICASSP), 2011 IEEE International Confer- ence on, pages 1481–1484. IEEE, 2011. [101] M. Li and S. Narayanan. Robust talking face video verification using joint factor analysis and sparse representation on GMM mean shifted supervectors. In Proc. ICASSP, pages 1481–1484, 2011. [102] M. Li and S. Narayanan. Simplified supervised i-vector modeling and sparse representation with application to robust language recognition. Computer Speech & Language, Oct 2012. submitted. [103] M. Li, V. Rozgic, G. Thatte, S. Lee, A. Emken, M. Annavaram, U. Mitra, D. Spruijt-Metz, and S. Narayanan. Multimodal physical activity recognition by fusing temporal and cepstral information. Neural Systems and Rehabilitation Engineering, IEEE Transactions on, 18(4):369–380, 2010. [104] M. Li, H. Suo, X. Wu, P. Lu, and Y. Yan. Spoken language identification using score vector modeling and support vector machine. In Proc. INTERSPEECH, pages 350–353, 2007. [105] M. Li, X. Zhang, Y. Yan, and S. Narayanan. Speaker verification using sparse representations on total variability i-vectors. In Proc. INTERSPEECH, pages 4548–4551, 2011. [106] MingLi, JangwonKim, PrasantaGhosh, VikramRamanarayanan, andShrikanth Narayanan. Speakerverificationbasedonspeechandinvertedarticulatorysignals. In INTERSPEECH, 2013. [107] Ming Li, Jangwon Kim, Prasanta Kumar Ghosh, Vikram Ramanarayanan, and Shrikanth Narayanan. Automatic classification of palatal and pharyngeal wall shape categories from speech acoustics and inverted articulatory signals. In IN- TERSPEECH, 2013. [108] Ming Li, Andreas Tsiartas, Maarten Van Segbroeck, and Shrikanth Narayanan. Speaker verification using simplified and supervised i-vector modeling. In Pro- ceedings of ICASSP, 2013. 140 [109] TH Linh, S. Osowski, and M. Stodolski. On-line heart beat recognition using Hermite polynomials and neuro-fuzzy network. IEEE Transactions on Instru- mentation and Measurement, 52(4):1224–1231, 2003. [110] P. Lukowicz, H. Junker, M. Stager, T. Von Buren, and G. Troster. WearNET: A distributed multi-sensor system for context aware wearables. Lecture notes in computer science, pages 361–370, 2002. [111] P. Matejka, O. Glembek, F. Castaldo, MJ Alam, O. Plchot, P. Kenny, L. Burget, and J. Cernocky. Full-covariance ubm and heavy-tailed plda in i-vector speaker verification. In Proc. ICASSP, pages 4828–4831, 2011. [112] U. Maurer, A. Rowe, A. Smailagic, and D. Siewiorek. Location and Activity Recognition Using eWatch: A Wearable Sensor Platform. Lecture Notes in Com- puter Science, 3864:86, 2006. [113] T.L. McKenzie, J.F. Sallis, and P.R. Nader. SOFIT: System for observing fitness instruction time. J Teach Phys Educ, 11:195–205, 1991. [114] W. Mendenhall and T. Sincich. Statistics for Engineering and the Sciences (5th Edition). Prentice Hall, 2006. [115] F. Metze, J. Ajmera, R. Englert, U. Bub, F. Burkhardt, J. Stegmann, C. Muller, R.Huber,B.Andrassy,JGBauer,andB.Littel. Comparisonoffourapproachesto ageandgenderrecognitionfortelephoneapplications. InProc.ICASSP,volume4, pages 1089–1092, 2007. [116] C. Mooshammer, P. Perrier, C. Geng, and D. Pape. An EMMA and EPG study on token-to-token variability. AIPUK, 36:47–63, 2004. [117] I. Naseem, R. Togneri, and M. Bennamoun. Sparse Representation for Speaker Identification. In Proc. ICPR, page 4460, 2010. [118] NIST. The NIST Year 2008 Speaker Recognition Evaluation Plan. http://www.nist.gov/speech/tests/spk/2008/index.html. [119] NIST. The NIST Year 2010 Speaker Recognition Evaluation Plan. http://www.itl.nist.gov/iad/mig/tests/spk/2010/index.html. [120] J.Parkka,M.Ermes,P.Korpipaa,J.Mantyjarvi,J.Peltola,I.Korhonen,V.T.T.I. Technol, and F. Tampere. Activity classification using realistic data from wear- able sensors. IEEE Transactions on Information Technology in Biomedicine, 10(1):119–128, 2006. [121] T. Pawar, NS Anantakrishnan, S. Chaudhuri, and SP Duttagupta. Impact of ambulation in wearable-ECG. Annals of Biomedical Engineering, 36(9):1547– 1557, 2008. 141 [122] T.Pawar,S.Chaudhuri,andSPDuttagupta. Bodymovementactivityrecognition for ambulatory cardiac monitoring. IEEE Transactions on Biomedical Engineer- ing, 54(5):874–882, 2007. [123] Dmitry Pechyony and Vladimir Vapnik. On the theory of learning with privileged information. Advances in Neural Information Processing Systems, 23, 2010. [124] J.S. Perkell. Articulatory processes. In W.J. Hardcastle and J. Laver, editors, The Handbook of Phonetic Sciences, pages 333–370. Blackwell, Oxford, 1997. [125] G. E. Peterson and H. L. Barney. Control methods used in a study of vowels. Journal of the Acoustical Society of America, 24:175–184, 1952. [126] K. Phua, J. Chen, T.H. Dat, and L. Shue. Heart sound as a biometric. Pattern Recognition, 41(3):906–919, 2008. [127] N. Poh, C. Chan, J. Kittler, S. Marcel, C. Cool, E. R´ ua, J. Castro, M. Villegas, R. Paredes, V. ˇ Struc, et al. Face video competition. Lecture Notes in Computer Science, 5558:715–724, 2009. [128] N. Poh, C.H. Chan, J. Kittler, S. Marcel, C. McCool, E.A. R´ ua, J.L.A. Castro, M. Villegas, R. Paredes, V. Struc, et al. An evaluation of video-to-video face verification. InformationForensicsandSecurity, IEEETransactionson,5(4):781– 801, 2010. [129] N. Poh, J. Kittler, S. Marcel, D. Matrouf, and J.F. Bonastre. Model and Score Adaptation for Biometric Systems: Coping With Device Interoperability and Changing Acquisition Conditions. In ICPR, pages 1229–1232, 2010. [130] S.J.D. Prince and J.H. Elder. Probabilistic linear discriminant analysis for infer- ences about identity. In Proc. ICCV, pages 1–8, 2007. [131] DAPPARATS.Ratscorpus.https://secure.ldc.upenn.edu/intranet/dataMatrixGenerate.jsp. [132] N. Ravi, N. Dandekar, P. Mysore, and M.L. Littman. Activity recognition from accelerometer data. In Proceedings of the National Conference on Artificial Intel- ligence, volume 20, page 1541, 2005. [133] D.A.Reynolds, T.F. Quatieri, andR.B.Dunn. Speaker verification using adapted Gaussian mixture models. Digital signal processing, 10(1-3):19–41, 2000. [134] A.A. Ross, K. Nandakumar, and A.K. Jain. Handbook of multibiometrics. Springer, 2006. [135] R. Rubinstein, M. Zibulevsky, and M. Elad. Efficient implementation of the k- svd algorithm using batch orthogonal matching pursuit. Technical Report-CS Technion, 2008. 142 [136] C. Sanderson and KK Paliwal. Fast feature extraction method for robust face verification. Electronics Letters, 38(25):1648–1650, 2002. [137] B. Schuller, S. Steidl, and A. Batliner. The interspeech 2009 emotion challenge. In Proc. of the Interspeech, pages 312–315, 2009. [138] B. Schuller, S. Steidl, A. Batliner, F. Burkhardt, L. Devillers, C. Mueller, and S. Narayanan. The INTERSPEECH 2010 paralinguistic challenge. In Proc. IN- TERSPEECH, pages 2794–2797, 2010. [139] B. Schuller, S. Steidl, A. Batliner, F. Burkhardt, L. Devillers, C. M¨ uller, and S.S. Narayanan. The interspeech 2010 paralinguistic challenge. In Proc. of the Interspeech, pages 2794–2797, 2010. [140] B. Schuller, S. Steidl, A. Batliner, F. Schiel, and J. Krajewski. The interspeech 2011 speaker state challenge. In Proc. of the Interspeech, pages 3201–3204. [141] P. Schwarz, P. Matejka, and J. Cernocky. Hierarchical structures of neural networks for phoneme. In Proc. ICASSP, pages 325–328, 2006. Software available at http://speech.fit.vutbr.cz/software/phoneme-recognizer-based-long- temporal-context. [142] Y. Shao and D.L. Wang. Robust speaker identification using auditory features and computational auditory scene analysis. In Proc. ICASSP, pages 1589–1592, 2008. [143] K. Stevens. Acoustic Phonetics. MIT Press, Cambridge, MA., 1998. [144] A. Stolcke, L. Ferrer, S. Kajarekar, E. Shriberg, and A. Venkataraman. MLLR transforms as features in speaker recognition. In Proc. INTERSPEECH, pages 2425–2428, 2005. [145] T. Takiguchi, M. Yoshii, Y. Ariki, and J. Bilmes. Acoustic model transformations based on random projections. In Proc. ICASSP, 2012. [146] E.M. Tapia, S.S. Intille, W. Haskell, K. Larson, J. Wright, A. King, and R. Fried- man. Real-time recognition of physical activities and their intensities using wire- less accelerometers and a heart monitor. In IEEE International Symposium on Wearable Computers, ISWC, 2007. [147] G. Thatte, M. Li, A. Emken, U. Mitra, S. Narayanan, M. Annavaram, and D. Spruijt-Metz. Energy-Efficient Multihypothesis Activity-Detection for Health- Monitoring Applications. In International Conference of the IEEE engineering in Medicine and Biology Society, EMBS, 2009. 143 [148] G. Thatte, V. Rozgic, M. Li, S. Ghosh, U. Mitra, S. Narayanan, M. Annavaram, and D. Spruijt-Metz. Optimal Allocation of Time-Resources for Multihypothe- sis Activity-Level Detection. In IEEE International Conference on Distributed Computing in Sensor Systems, DCOSS, 2009. [149] M. Thibeault, L. M´ enard, S.R. Baum, G. Richard, and D.H. McFarland. Articu- latory and acoustic adaptation to palatal perturbation. Journal of the Acoustical Society of America, 129(4):2112–2120, 2011. [150] P.A. Torres-Carrasquillo, E. Singer, M.A. Kohler, R.J. Greene, D.A. Reynolds, and JR Deller Jr. Approaches to language identification using gaussian mixture models and shifted delta cepstral features. In Proc. ICSLP, pages 89–92, 2002. [151] Asterios Toutios and Konstantinos Margaritis. A rough guide to the acoustic-to- articulatoryinversionofspeech. In6thHellenicEuropeanConferenceofComputer Mathematics and its Applications, HERCMA-2003, 2003. [152] A. Tsiartas, T. Chaspari, N. Katsamanis, and S. Narayanan. Multi-resolution long-term signal variability for voice activity detection. submitted to INTER- SPEECH, 2012. [153] E. Van Den Berg and M.P. Friedlander. Probing the Pareto frontier for basis pursuit solutions. SIAM Journal on Scientific Computing, 31(2):890–912, 2008. [154] Oriol Vinyals, Yangqing Jia, Li Deng, and Trevor Darrell. Learning with recur- sive perceptual representations. In Advances in Neural Information Processing Systems, pages 2834–2842, 2012. [155] John Westbury, Paul Milenkovic, Gary Weismer, and Raymond Kent. X-ray mi- crobeam speech production database. Journal of the Acoustical Society of Amer- ica, 88(S1):S56–S56, 1990. [156] A. Wrench. MOCHA-TIMIT. speech database, Department of Speech and Lan- guage Sciences, Queen Margaret University College, Edinburgh, 1999. [157] J.Wright, A.Y.Yang, A.Ganesh, S.S.Sastry, andY.Ma. Robustfacerecognition via sparse representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(2):210–227, 2008. [158] M. Yang, L. Zhang, J. Yang, and D. Zhang. Robust sparse coding for face recog- nition. In Proc. CVPR, pages 625–632, 2011. [159] X. Zhang, H. Suo, Q. Zhao, and Y. Yan. Using a kind of novel phonotactic information for SVM based speaker recognition. IEICE TRANSACTIONS on Information and Systems, 92(4):746–749, 2009. 144 [160] Sherry Y Zhao and Nelson Morgan. Multi-stream spectro-temporal features for robust speech recognition. In Proc. Interspeech, pages 898–901, 2008. [161] S. Zhou, V. Krueger, and R. Chellappa. Probabilistic recognition of human faces from video. Computer Vision and Image Understanding, 91(1-2):214–245, 2003. [162] M.A. Zissman. Language identification using phoneme recognition and phonotac- tic language modeling. In Proc. ICASSP, volume 5, pages 3503–3506, 1995. 145 Appendix .1 Journal papers Ming Li, Kyu J. Hanb and Shrikanth Narayanan, “Automatic Speaker Age and Gender Recognition usingacousticandprosodiclevelinformationfusion”, Computerspeechandlanguage, 27.1(2013): 151- 167.2013. Ming Li, Viktor Rozgic, Gautam Thatte, Sangwon Lee, Adar Emken, Murali Annavaram, Urbashi Mitra, Donna Spruijt-Metz and Shrikanth Narayanan, “Multimodal Physical Activity Recognition by Fusing Temporal and Cepstral Information”, IEEE Transactions on Neural Systems & Rehabilitation Engineering, vol 18, issue 4, August, 2010. Ming Li, Shrikanth Narayanan, “Simplified Supervised I-vector Modeling and Sparse Representation with application to Robust Language Recognition”, submitted to Computer speech and language (in revision). U. Mitra, A. Emken, S. Lee, M. Li, V. Rozgic, G. Thatte, H. Vathsangam, D. Zois, M. Annavaram, S. Narayanan, D. Spruijt-Metz, and G. Sukhatme, “KNOWME: a Case Study in Wireless Body Area Sensor Network Design”, IEEE Communications Magazine 2012 50:5(116-125). 146 Gautam Thatte, Ming Li, Sangwon Lee, Adar Emken, Murali Annavaram, Shri Narayanan, Donna Spruijt-Metz, Urbashi Mitra, “Optimal Time-Resource Allocation for Energy Efficient Physical Activ- ity Detection”, IEEE Transaction on Signal Processing, vol 59, issue 4, April, 2011. Gautam Thatte, Ming Li, Sangwon Lee, Adar Emken, Shri Narayanan, Urbashi Mitra, Donna Spruijt- Metz and Murali Annavaram, “KNOWME: An Energy-Efficient and Multimodal Body Area Sensing System for Physical Activity Monitoring”, ACM Transactions in Embedded Computing Systems, vol. 11, no. S2, Aug 2012. Adar Emken, Ming Li, Gautam Thatte, Sangwon Lee, Murali Annavaram, Urbashi Mitra, Shrikanth Narayanan, Donna Spruijt-Metz, “Recognition of Physical Activities in Overweight Hispanic Youth us- ing KNOWME Networks”, Journal of Physical Activity and Health, 9:3(432–441), 2012. Daniel Bone, Ming Li, Matthew P Black, Shrikanth S Narayanan, ”A Fusion Framework with Speaker- Normalized Hierarchical Functionals and GMM Supervectors”, Computer speech and language, 2012. .2 Conference papers Ming Li, Andreas Tsiartas, Maarten Van Segbroeck and Shrikanth S. Narayanan, ”SPEAKER VERI- FICATION USING SIMPLIFIED AND SUPERVISED I-VECTOR MODELING”, ICASSP 2013. Ming Li, Jangwon Kim, Prasanta Kumar Ghosh, Vikram Ramanarayanan and Shrikanth Narayanan, “Speaker verification based on fusion of acoustic and articulatory information”, Interspeech 2013. Ming Li, Adam Lammert, Jangwon Kim, Prasanta Ghosh and Shrikanth Narayanan, “Automatic Clas- sification of Palatal and PharyngealWall Morphology Patterns from Speech Acoustics and Inverted 147 Articulatory Signals”, submitted to the workshop of Speech Production in Automatic Speech Recogni- tion 2013. Ming Li, Charley Lu, Anne Wang, Shrikanth Narayanan, ”Speaker Verification using Lasso based Sparse Total Variability Supervector and Probabilistic Linear Discriminant Analysis”, presented at NIST Speaker Recognition Workshop, Atlanta, 2011. published in in: Proceedings of APSIPA Annual Summit and Conference, Hollywood, CA, 2012. Ming Li, Angeliki Metallinou, Daniel Bone, Shrikanth Narayanan, “Speaker states recognition using latent factor analysis based Eigenchannel factor vector modeling”, ICASSP 2012. MingLi,CharleyLu,AnneWang,ShrikanthNarayanan,“SpeakerVerificationusingLassobasedSparse Total Variability Supervector and Probabilistic Linear Discriminant Analysis”, NIST Speaker Recogni- tion Workshop, Atlanta, 2011. Ming Li, Xiang Zhang, Yonghong Yan and Shrikanth Narayanan, “Speaker Verification using Sparse Representations on Total Variability I-Vectors”, INTERSPEECH, 2011. Ming Li, Shrikanth Narayanan, “Robust talking face video verification using joint factor analysis and sparse representation on GMM mean shifted supervectors”, ICASSP, 2011. MingLi,ShrikanthNarayanan,“ECGBiometricsbyFusingTemporalandCepstralInformation”,ICPR 2010. Ming Li, Chi-Sang Jung and Kyu Jeong Han, “Combining Five Acoustic Level methods for Auto- matic Speaker Age and Gender Recognition”, INTERSPEECH, 2010. 148 Ming Li, Adar Emken, Shri Narayanan, Gautam Thatte, Sangwon Lee, Harshvardhan Vathsangam, Gaurav Sukhatme, Urbashi Mitra, Murali Annavaram and Donna Spruijt-Metz, “Using the KNOWME Networks Mobile Biomonitoring System to Characterize Physical Activity in Overweight Hispanic Youth”, ACSM Health and Fitness Summit, Austin, TX (April 2010). MingLi,CharleyLu,AnneWang,ShrikanthNarayanan,“SpeakerVerificationusingLassobasedSparse Total Variability Supervector with PLDA modeling”, APSIPA, 2012. Andreas Tsiartas, Theodora Chaspari, Nassos Katsamanis, Prasanta Ghosh, Ming Li, Maarten Van Segbroeck, Alexandros Potamianos, Shrikanth Narayanan, “Multi-resolution long-term signal variabil- ity features for robust voice activity detection”, Interspeech 2013. Jangwon Kim, Naveen Kumar, Andreas Tsiartas, Ming Li and Shrikanth S. Narayanan, ”Intelligibil- ityclassificationofpathologicalspeechusingfusionofmultiplehighleveldescriptors”,Interspeech,2012. Kartik Audhkhasi, Angeliki Metallinou, Ming Li and Shrikanth S. Narayanan, ”Speaker Personality ClassificationUsingSystemsBasedonAcoustic-LexicalCuesandanOptimalTree-StructuredBayesian Network”, Interspeech, 2012. DanielBone,MatthewP.Black,MingLi,AngelikiMetallinou,SungbokLeeandShrikanthS.Narayanan, ”Intoxicated Speech Detection by Fusion of Speaker Normalized Hierarchical Features and GMM Su- pervectors”, Interspeech, 2011. Samuel Kim, Ming Li, Sangwon Lee, Urbashi Mitra, Adar Emken, Donna Spruijt-Metz, Murali An- navaram, Shrikanth Narayanan, “Modeling high-level descriptions of real-life physical activities using latent topic modeling of multimodal sensor signals”,EMBC, 2011. 149 Gautam Thatte, Viktor Rozgic, Ming Li, Sabyasachi Ghosh, Urbashi Mitra, Shri Narayanan, Mu- rali Annavaram, Donna Spruijt-Metz, “Optimal Time-Resource Allocation for Activity-Detection via Multimodal Sensing”, BodyNets, Los Angeles, CA (April 2009). Gautam Thatte, Viktor Rozgic, Ming Li, Sabyasachi Ghosh, Urbashi Mitra, Shri Narayanan, Mu- rali Annavaram and Donna Spruijt-Metz, “Optimal Allocation of Time-Resources for Multihypothesis Activity-Level Detection”, DCOSS, Marina Del Rey, CA (June 2009). SangwonLee,MuraliAnnavaram,GautamThatte,VikorRozgic,MingLi,UrbashiMitra,ShriNarayanan and Donna Spruijt-Metz, “Sensing for Obesity: KNOWME Implementation and Lessons for an Archi- tect”, BiC, Austin, TX (June 2009). Gautam Thatte, Ming Li, Adar Emken, Urbashi Mitra, Shri Narayanan, Murali Annavaram and Donna Spruijt-Metz, “Energy-Efficient Multihypothesis Activity-Detection for Health-Monitoring Ap- plications”, EMBC (September 2009). Donna Spruijt-Metz, Ming Li, Gautam Thatte, Gaurav Sakhatme, Murali Annavaram, Sabyasachi Ghosh, Viktor Rozgic, Urbashi Mitra, Nenad Medvidovic, Britni Belcher and Shrikanth Narayanan, “Differentiating physical activity modalities in youth using heartbeat waveform shape and differences between adjacent waveforms”, 7th International Conference on Diet and Activity Methods (ICDAM 7), Washington DC (June 2009). GautamThatte, MingLi, AdarEmken, UrbashiMitra, ShriNarayanan, MuraliAnnavaramandDonna Spruijt-Metz, “Energy-Efficient Activity-Detection via Multihypothesis Testing for Pediatric Obesity”, the 7th Annual CENS Research Review, Los Angeles, CA (October 2009). D. Spruijt-Metz, S. Narayanan, U. Mitra, G. Sukhatme, M. Li, G. Thatte, A. Emken, S. Lee, H. 150 Vathsangam and M. Annavaram, “KNOWME Networks: Mobile Device Biomonitoring to Prevent and Treat Obesity in Underserved Minority Youth”, mHealth Summit, Washington, DC (October 2009). Donna Spruijt-Metz, et.al, “Decreasing Sedentary Behavior in Overweight Youth Using a Real-Time Mobile Intervention”, Wireless Health 2012. 151
Abstract (if available)
Abstract
The goal of this work is to enhance the robustness and efficiency of the multimodal human states recognition task. Human states recognition can be considered as a joint term for identifying/verifing various kinds of human related states, such as biometric identity, language spoken, age, gender, emotion, intoxication level, physical activity, vocal tract patterns, ECG QT intervals and so on. I performed research on the aforementioned states recognition problems and my focus is to increase the performance while reduce the computational cost. ❧ I start by extending the well known total variability i-vector modeling (a factor analysis on the concatenated GMM mean supervectors) to the simplified supervised i-vector modeling to enhance the robustness and efficiency. First, by concatenating the label vector and the linear classifier matrix at the end of the mean supervector and the i-vector factor loading matrix, respectively, the traditional i-vectors are extended to the label regularized supervised i-vectors. This supervised i-vectors are optimized to not only reconstruct the mean supervectors well but also minimize the mean square error between the original and the reconstructed label vectors, thus can make the supervised i-vectors more discriminative in terms of the label information regularized. Second, I perform the factor analysis (FA) on the pre-normalized GMM first order statistics supervector to ensure each Gaussian component's statistics sub-vector is treated equally in the FA which reduce the computational cost by a factor of 25. Since there is only one global total frame number in the equation, I make a global table of the resulted matrices against its log value. By checking with the table, the computational cost of each utterance's i-vector extraction is further reduced by 4 times with small quantization error. I demonstrate the utility of the simplified supervised i-vector representation on both the language identification (LID) and speaker verification (SRE) tasks, achieved comparable or better performance with significant computational cost reduction. ❧ Inspired by the recent success of sparse representation on face recognition, I explored the possibility to adopt sparse representation for both representation and classification in this multimodal human sates recognition problem. For classification purpose, a sparse representation computed by l1-minimization (to approximate the l0 minimization) with quadratic constraints was proposed to replace the SVM on the GMM mean supervectors and by fusing the sparse representation based classification (SRC) method with SVM, the overall system performance was improved. Second, by adding a redundant identity matrix at the end of the original over-complete dictionary, the sparse representation is made more robust to variability and noise. Third, both the l1 norm ratio and the background normalized (BNorm) l2 residual ratio are used and shown to outperform the conventional l2 residual ratio in the verification task. I showed the usage of SRC on GMM mean supervectors, total variability i-vectors, and UBM weight posterior probability (UWPP) supervectors for face video verification, speaker verification and age/gender identification tasks, respectively. For the representation purpose, rather than projecting the GMM mean supervector on the low rank factor loading matrix, I project the mean supervector on a large rank dictionary to generate sparse coefficient vectors (s-vectors). I show that KSVD algorithm can be adopted here to learn the dictionary. I fuse the s-vector systems with other methods to improve the overall performance in LID and SRE tasks. ❧ I also present an automatic speaker affective state recognition approach which models the factor vectors in the latent factor analysis framework improving upon the Gaussian Mixture Model (GMM) baseline performance. I consider the affective speech signal as the original normal average speech signal being corrupted by the affective channel effects. Rather than reducing the channel variability to enhance the robustness as in the speaker verification task, I directly model the speaker state on the channel factors under the factor analysis framework. Experimental results show that the proposed speaker state factor vector modeling system achieved unweighted and weighted accuracy improvement over the GMM baseline on the intoxicated speech detection task and the emotion recognition task, respectively. ❧ To summarize the methods for representation, I propose a general optimization framework. The aforementioned methods, such as traditional factor analysis, i-vector, supervised i-vector, simplified i-vector and s-vectors, are all special cases of this general optimization problem. In the future, I plan to investigate other kinds of distance measures, cost functions and constraints in this unified general optimization problem. ❧ I use two examples to demonstrate my work in the areas of domain specific novel features and multimodal information fusion for the human states recognition task. The first application is speaker verification based on the fusion of acoustic and articulatory information. We propose a practical, feature-level fusion approach for combining acoustic and articulatory information in speaker verification task. We find that concatenating articulation features obtained from the measured speech production data with conventional Mel-frequency cepstral coefficients (MFCCs) improves the overall speaker verification performance. However, since access to the measured articulatory data is impractical for real world speaker verification applications, we also experiment with estimated articulatory features obtained using acoustic-to-articulatory inversion technique. Specifically, we show that augmenting MFCCs with articulatory features obtained from subject-independent acoustic-to-articulatory inversion technique also significantly enhances the speaker verification performance. This performance boost could be due to the information about inter-speaker variation present in the estimated articulatory features, especially at the mean and variance level. ❧ The second example is multimodal physical activity recognition. A physical activity (PA) recognition algorithm for a wearable wireless sensor network using both ambulatory electrocardiogram (ECG) and accelerometer signals is proposed. First, in the time domain, the cardiac activity mean and the motion artifact noise of the ECG signal are modeled by a Hermite polynomial expansion and principal component analysis, respectively. A set of time domain accelerometer features is also extracted. A support vector machine (SVM) is employed for supervised classification using these time domain features. Second, motivated by their potential for handling convolutional noise, cepstral features extracted from ECG and accelerometer signals based on a frame level analysis are modeled using Gaussian mixture models (GMM). Third, to reduce the dimension of the tri-axial accelerometer cepstral features which are concatenated and fused at the feature level, heteroscedastic linear discriminant analysis is performed. Finally, to improve the overall recognition performance, fusion of the multi-modal (ECG and accelerometer) and multi-domain (time domain SVM and cepstral domain GMM) subsystems at the score level is performed.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Efficient estimation and discriminative training for the Total Variability Model
PDF
Neural representation learning for robust and fair speaker recognition
PDF
Multimodality, context and continuous dynamics for recognition and analysis of emotional states, and applications in healthcare
PDF
Matrix factorization for noise-robust representation of speech data
PDF
Establishing cross-modal correspondences for media understanding.
PDF
A computational framework for diversity in ensembles of humans and machine systems
PDF
Leveraging training information for efficient and robust deep learning
PDF
Speech recognition error modeling for robust speech processing and natural language understanding applications
PDF
Towards generalizable expression and emotion recognition
PDF
Efficient template representation for face recognition: image sampling from face collections
PDF
Classification and retrieval of environmental sounds
Asset Metadata
Creator
Li, Ming
(author)
Core Title
Representation, classification and information fusion for robust and efficient multimodal human states recognition
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
08/06/2013
Defense Date
05/14/2013
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
articulation,ECG processing,emotion recognition,human state characterization,language identification,multimodal biometrics,OAI-PMH Harvest,physical activity recognition,simplified supervised i-vector,sparse representation,speaker verification,speech production,vocal tract morphology
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Narayanan, Shrikanth S. (
committee chair
), Kuo, C.-C. Jay (
committee member
), Ortega, Antonio K. (
committee member
), Sha, Fei (
committee member
)
Creator Email
ming.li.ioa@gmail.com,mingli@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-317515
Unique identifier
UC11293853
Identifier
etd-LiMing-1970.pdf (filename),usctheses-c3-317515 (legacy record id)
Legacy Identifier
etd-LiMing-1970.pdf
Dmrecord
317515
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Li, Ming
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
articulation
ECG processing
emotion recognition
human state characterization
language identification
multimodal biometrics
physical activity recognition
simplified supervised i-vector
sparse representation
speaker verification
speech production
vocal tract morphology