Page 1 |
Save page Remove page | Previous | 1 of 166 | Next |
|
small (250x250 max)
medium (500x500 max)
Large (1000x1000 max)
Extra Large
large ( > 500x500)
Full Resolution
All (PDF)
|
This page
All
|
representation, classification and information fusion for robust and efficient multimodal human states recognition by Ming Li A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) August 2013 Copyright 2013 Ming Li
Object Description
Title | Representation, classification and information fusion for robust and efficient multimodal human states recognition |
Author | Li, Ming |
Author email | mingli@usc.edu;ming.li.ioa@gmail.com |
Degree | Doctor of Philosophy |
Document type | Dissertation |
Degree program | Electrical Engineering |
School | Viterbi School of Engineering |
Date defended/completed | 2013-05-14 |
Date submitted | 2013-08-06 |
Date approved | 2013-08-06 |
Restricted until | 2013-08-06 |
Date published | 2013-08-06 |
Advisor (committee chair) | Narayanan, Shrikanth S. |
Advisor (committee member) |
Kuo, C.-C. Jay Ortega, Antonio K. Sha, Fei |
Abstract | The goal of this work is to enhance the robustness and efficiency of the multimodal human states recognition task. Human states recognition can be considered as a joint term for identifying/verifing various kinds of human related states, such as biometric identity, language spoken, age, gender, emotion, intoxication level, physical activity, vocal tract patterns, ECG QT intervals and so on. I performed research on the aforementioned states recognition problems and my focus is to increase the performance while reduce the computational cost. ❧ I start by extending the well known total variability i-vector modeling (a factor analysis on the concatenated GMM mean supervectors) to the simplified supervised i-vector modeling to enhance the robustness and efficiency. First, by concatenating the label vector and the linear classifier matrix at the end of the mean supervector and the i-vector factor loading matrix, respectively, the traditional i-vectors are extended to the label regularized supervised i-vectors. This supervised i-vectors are optimized to not only reconstruct the mean supervectors well but also minimize the mean square error between the original and the reconstructed label vectors, thus can make the supervised i-vectors more discriminative in terms of the label information regularized. Second, I perform the factor analysis (FA) on the pre-normalized GMM first order statistics supervector to ensure each Gaussian component's statistics sub-vector is treated equally in the FA which reduce the computational cost by a factor of 25. Since there is only one global total frame number in the equation, I make a global table of the resulted matrices against its log value. By checking with the table, the computational cost of each utterance's i-vector extraction is further reduced by 4 times with small quantization error. I demonstrate the utility of the simplified supervised i-vector representation on both the language identification (LID) and speaker verification (SRE) tasks, achieved comparable or better performance with significant computational cost reduction. ❧ Inspired by the recent success of sparse representation on face recognition, I explored the possibility to adopt sparse representation for both representation and classification in this multimodal human sates recognition problem. For classification purpose, a sparse representation computed by l1-minimization (to approximate the l0 minimization) with quadratic constraints was proposed to replace the SVM on the GMM mean supervectors and by fusing the sparse representation based classification (SRC) method with SVM, the overall system performance was improved. Second, by adding a redundant identity matrix at the end of the original over-complete dictionary, the sparse representation is made more robust to variability and noise. Third, both the l1 norm ratio and the background normalized (BNorm) l2 residual ratio are used and shown to outperform the conventional l2 residual ratio in the verification task. I showed the usage of SRC on GMM mean supervectors, total variability i-vectors, and UBM weight posterior probability (UWPP) supervectors for face video verification, speaker verification and age/gender identification tasks, respectively. For the representation purpose, rather than projecting the GMM mean supervector on the low rank factor loading matrix, I project the mean supervector on a large rank dictionary to generate sparse coefficient vectors (s-vectors). I show that KSVD algorithm can be adopted here to learn the dictionary. I fuse the s-vector systems with other methods to improve the overall performance in LID and SRE tasks. ❧ I also present an automatic speaker affective state recognition approach which models the factor vectors in the latent factor analysis framework improving upon the Gaussian Mixture Model (GMM) baseline performance. I consider the affective speech signal as the original normal average speech signal being corrupted by the affective channel effects. Rather than reducing the channel variability to enhance the robustness as in the speaker verification task, I directly model the speaker state on the channel factors under the factor analysis framework. Experimental results show that the proposed speaker state factor vector modeling system achieved unweighted and weighted accuracy improvement over the GMM baseline on the intoxicated speech detection task and the emotion recognition task, respectively. ❧ To summarize the methods for representation, I propose a general optimization framework. The aforementioned methods, such as traditional factor analysis, i-vector, supervised i-vector, simplified i-vector and s-vectors, are all special cases of this general optimization problem. In the future, I plan to investigate other kinds of distance measures, cost functions and constraints in this unified general optimization problem. ❧ I use two examples to demonstrate my work in the areas of domain specific novel features and multimodal information fusion for the human states recognition task. The first application is speaker verification based on the fusion of acoustic and articulatory information. We propose a practical, feature-level fusion approach for combining acoustic and articulatory information in speaker verification task. We find that concatenating articulation features obtained from the measured speech production data with conventional Mel-frequency cepstral coefficients (MFCCs) improves the overall speaker verification performance. However, since access to the measured articulatory data is impractical for real world speaker verification applications, we also experiment with estimated articulatory features obtained using acoustic-to-articulatory inversion technique. Specifically, we show that augmenting MFCCs with articulatory features obtained from subject-independent acoustic-to-articulatory inversion technique also significantly enhances the speaker verification performance. This performance boost could be due to the information about inter-speaker variation present in the estimated articulatory features, especially at the mean and variance level. ❧ The second example is multimodal physical activity recognition. A physical activity (PA) recognition algorithm for a wearable wireless sensor network using both ambulatory electrocardiogram (ECG) and accelerometer signals is proposed. First, in the time domain, the cardiac activity mean and the motion artifact noise of the ECG signal are modeled by a Hermite polynomial expansion and principal component analysis, respectively. A set of time domain accelerometer features is also extracted. A support vector machine (SVM) is employed for supervised classification using these time domain features. Second, motivated by their potential for handling convolutional noise, cepstral features extracted from ECG and accelerometer signals based on a frame level analysis are modeled using Gaussian mixture models (GMM). Third, to reduce the dimension of the tri-axial accelerometer cepstral features which are concatenated and fused at the feature level, heteroscedastic linear discriminant analysis is performed. Finally, to improve the overall recognition performance, fusion of the multi-modal (ECG and accelerometer) and multi-domain (time domain SVM and cepstral domain GMM) subsystems at the score level is performed. |
Keyword | human state characterization; speaker verification; language identification; multimodal biometrics; emotion recognition; simplified supervised i-vector; sparse representation; physical activity recognition; ECG processing; speech production; articulation; vocal tract morphology |
Language | English |
Format (imt) | application/pdf |
Part of collection | University of Southern California dissertations and theses |
Publisher (of the original version) | University of Southern California |
Place of publication (of the original version) | Los Angeles, California |
Publisher (of the digital version) | University of Southern California. Libraries |
Provenance | Electronically uploaded by the author |
Type | texts |
Legacy record ID | usctheses-m |
Contributing entity | University of Southern California |
Rights | Li, Ming |
Physical access | The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given. |
Repository name | University of Southern California Digital Library |
Repository address | USC Digital Library, University of Southern California, University Park Campus MC 7002, 106 University Village, Los Angeles, California 90089-7002, USA |
Repository email | cisadmin@lib.usc.edu |
Filename | etd-LiMing-1970.pdf |
Archival file | uscthesesreloadpub_Volume7/etd-LiMing-1970.pdf |
Description
Title | Page 1 |
Contributing entity | University of Southern California |
Repository email | cisadmin@lib.usc.edu |
Full text | representation, classification and information fusion for robust and efficient multimodal human states recognition by Ming Li A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) August 2013 Copyright 2013 Ming Li |