Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Advanced features and feature selection methods for vibration and audio signal classification
(USC Thesis Other)
Advanced features and feature selection methods for vibration and audio signal classification
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ADVANCED FEATURES AND FEATURE SELECTION METHODS FOR VIBRATION AND AUDIO SIGNAL CLASSIFICATION by Enshuo Tsau A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) August 2012 Copyright 2012 Enshuo Tsau Table of Contents List of Tables v List of Figures vi Abstract viii Chapter 1: Introduction 1 1.1 Significance of the Research . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Review of Previous Work . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.1 Engine Fault Detection . . . . . . . . . . . . . . . . . . . . . 5 1.2.2 Environmental Sound Reconition . . . . . . . . . . . . . . . 8 1.2.3 Fundamental Frequency Estimation for Music Signals . . . . 10 1.3 Contributions of this Research . . . . . . . . . . . . . . . . . . . . . 11 1.4 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . 13 Chapter 2: Research Background 14 2.1 Matching Pursuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 Hilbert-Huang Transform . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3 Code Excited Linear Prediction (CELP) . . . . . . . . . . . . . . . 23 Chapter3: EnvironmentalSoundRecognitionwithCELP-basedFea- tures 28 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2 Extraction of CELP-Based Features . . . . . . . . . . . . . . . . . . 30 3.2.1 CELP Codec . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2.2 CELP-based Features . . . . . . . . . . . . . . . . . . . . . . 32 3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 35 3.3.2 Results and Discussion . . . . . . . . . . . . . . . . . . . . . 37 3.4 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . 38 ii Chapter 4: Robust Jet Engine Fault Detection and Diagnosis Using CELP and MFCC Features 42 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.2 Review of Previous Work . . . . . . . . . . . . . . . . . . . . . . . . 44 4.2.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.2.1.1 The SR-30 data . . . . . . . . . . . . . . . . . . . . 46 4.2.1.2 The PW4000 data . . . . . . . . . . . . . . . . . . 47 4.3 Description of Proposed System . . . . . . . . . . . . . . . . . . . . 48 4.4 Extraction of CELP-Based Features . . . . . . . . . . . . . . . . . . 50 4.4.1 CELP Codec . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.4.2 CELP-based Features . . . . . . . . . . . . . . . . . . . . . . 53 4.4.3 Pre-Processing Operations . . . . . . . . . . . . . . . . . . . 55 4.4.4 Feature Selection and Extraction . . . . . . . . . . . . . . . 56 4.4.5 Feature Space Reduction . . . . . . . . . . . . . . . . . . . . 58 4.4.6 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.4.6.1 Classifier Selection and Design . . . . . . . . . . . 59 4.4.6.2 Classifier Training . . . . . . . . . . . . . . . . . . 60 4.4.7 Decision Synthesis . . . . . . . . . . . . . . . . . . . . . . . 61 4.4.7.1 Necessity of Synthesis . . . . . . . . . . . . . . . . 61 4.4.7.2 Time-Based Synthesis . . . . . . . . . . . . . . . . 62 4.4.7.3 Stage-Based Synthesis . . . . . . . . . . . . . . . . 63 4.4.7.4 Sensor-Based Synthesis . . . . . . . . . . . . . . . . 64 4.4.7.5 Decision Synthesis Conclusion . . . . . . . . . . . . 64 4.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.5.1 Comparison of MFCC/CELP Features . . . . . . . . . . . . 65 4.5.2 VIBRATION SENSOR ANALYSIS . . . . . . . . . . . . . . 66 4.5.2.1 Test Parameters . . . . . . . . . . . . . . . . . . . 66 4.5.2.2 Detailed Results . . . . . . . . . . . . . . . . . . . 67 4.5.2.3 VIBRATION SENSOR PERFORMANCE ANAL- YSIS with PCA REDUCTION . . . . . . . . . . . 68 4.5.3 Algorithmic Complexity . . . . . . . . . . . . . . . . . . . . 70 4.5.3.1 Classifier Training . . . . . . . . . . . . . . . . . . 72 4.5.3.2 Classifier Testing . . . . . . . . . . . . . . . . . . . 74 4.6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . 75 4.6.1 Training Set Selection . . . . . . . . . . . . . . . . . . . . . 75 Chapter5: FundamentalFrequencyEstimationforMusicSignalswith Modified Hilbert-Huang Transform 80 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.2 Music Pitch Analysis with Modified HHT . . . . . . . . . . . . . . . 82 5.2.1 Weakness of HHT . . . . . . . . . . . . . . . . . . . . . . . . 82 5.2.2 Modified HHT for Fundamental Frequency Estimation . . . 85 iii 5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.4 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . 90 Chapter 6: Content/Context-Adaptive Feature Selection for Envi- ronmental Sound Recognition 92 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.2 Review of Previous Work . . . . . . . . . . . . . . . . . . . . . . . . 94 6.3 Content/context-Adaptive Feature Selection . . . . . . . . . . . . . 98 6.3.1 Context-Adaptive Method . . . . . . . . . . . . . . . . . . . 99 6.3.2 Content-Adaptive Method . . . . . . . . . . . . . . . . . . . 103 6.4 Experimental Result . . . . . . . . . . . . . . . . . . . . . . . . . . 107 6.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 107 6.4.2 Results and Discussion . . . . . . . . . . . . . . . . . . . . . 109 6.4.3 Length of samples . . . . . . . . . . . . . . . . . . . . . . . . 111 6.4.4 Confusion matrix discussion . . . . . . . . . . . . . . . . . . 112 6.5 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . 113 Chapter 7: Conclusion and Future Work 114 7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 References 119 iv List of Tables 3.1 CorrectclassificationrateswithdifferentfeaturesetsusingtheBayesian network as the classifier. . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2 The confusion matrix obtained with the CELP features only and the Bayesian network classifier. . . . . . . . . . . . . . . . . . . . . . . . 40 3.3 Correct classification rates with different feature sets using the sup- port vector machine as the classifier. . . . . . . . . . . . . . . . . . 41 4.1 List of Flight Stages and Associated Engine Speeds . . . . . . . . . 47 4.2 System Performance with MFCC Features (7.97% Error Rate) . . . 65 4.3 System Performance with CELP Features (6.13% Error Rate) . . . 66 4.4 System Performance with MFCC + CELP Features (6.06% Error Rate) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.5 Run time of each vibration sensors . . . . . . . . . . . . . . . . . . 67 4.6 Detection error of each vibration sensors . . . . . . . . . . . . . . . 68 4.7 System Performance and Computational Complexity for Different Down-Sampling Ratios . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.8 training set and complexity . . . . . . . . . . . . . . . . . . . . . . 73 6.1 The confusion matrix obtained with the CELP features only and the Bayesian network classifier. . . . . . . . . . . . . . . . . . . . . . . . 113 v List of Figures 2.1 Thetime-frequencyrepresentationofachirpsignalwithω(t)= 300× 2πt obtained by (a) HHT and (b) Fourier transform. . . . . . . . . 21 2.2 The block-diagram of a CELP codec. . . . . . . . . . . . . . . . . . 24 2.3 The bit stream of a CELP codec. . . . . . . . . . . . . . . . . . . . 25 3.1 The block-diagram of a CELP codec. . . . . . . . . . . . . . . . . . 31 3.2 Computation of the CELP-based features, which consist of the 10 dimensional LPC features and the one dimensional pitch lag feature. 34 3.3 System Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.1 PW4000 data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2 The proposed system . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.3 The proposed system . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.4 The block-diagram of a CELP codec. . . . . . . . . . . . . . . . . . 52 4.5 Computation of the CELP-based features, which consist of the 10 dimensional LPC features and the one dimensional pitch lag feature. 55 4.6 Sensor Readings Pre-Synthesis . . . . . . . . . . . . . . . . . . . . . 62 4.7 Sensor Readings Post-Synthesis . . . . . . . . . . . . . . . . . . . . 63 vi 4.8 Performance of (a) Vibration Sensor 1 (Sensor 12/15) (b) Vibration Sensor 2 (Sensor 13/15) (c) Vibration Sensor 3 (Sensor 14/15) (d) Vibration Sensor 4 (Sensor 15/15) . . . . . . . . . . . . . . . . . . 77 4.9 PCA of (a) Vibration Sensor 1 (Sensor 12/15) (b) Vibration Sensor 2(Sensor13/15)(c)VibrationSensor3(Sensor14/15)(d)Vibration Sensor 4 (Sensor 15/15) . . . . . . . . . . . . . . . . . . . . . . . . 78 4.10 Classification Error for PCA Reductions on Sensor 7 Data . . . . . 79 5.1 (a)Anexampleofmodemixingand(b)Asignalwithanaddedbubble. 83 5.2 The modified HHT process for fundamental frequency estimation. . 86 5.3 IMFsofC4(261Hz)bysifting(a)withoutand(b)withthefilterbank pre-processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.4 Performance comparison ofhit ratesoftheYINmethod, the original and the modified HHT method. . . . . . . . . . . . . . . . . . . . . 89 5.5 Performance comparison of the modified HHT method for notes C2, C3, C4 and C5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 6.1 The conceptual diagram of content/context-adaptive feature selec- tion methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.2 Illustration of the context-based classifer design. . . . . . . . . . . . 103 6.3 Classification Rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 6.4 Classification Rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.5 Classification Rate with different length of samples. . . . . . . . . . 112 6.6 Confusion Matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 vii Abstract An adequate feature set plays a key role in many signal classification and recogni- tion applications. This is a challenging problem due to the nonlinearity and non- stationary characteristics of real world signals, such as engine acoustic/vibration data, environmental sounds, speech signals and music instrument sounds. Some of traditional features such as the Mel Frequency Cepstral Coefficients (MFCC) may not offer good performance. Other features such as those based on the Matching Pursuit (MP) decomposition may performbetter, yet their complexity is very high. In this research, we consider a new feature set that can be easily generated in the model-based signal compression process, known as the Code Excited Linear Pre- diction (CELP) features. The CELP-based coding algorithm and its variants have been widely used to encode speech and low-bit-rate audio signals. In this research, we examine two applications based on CELP-based features. First, we present a new approach to engine fault detection and diagnosis based on acoustic and vibration sensor data with MFCC and CELP features. Through properalgorithmicadaptationtothe specifics ofthe dataset, the faultconditions of a damaged blade and a bearing failure can, with high probability, be autonomously viii discovered and identified. The conducted experiments will show that CELP fea- tures, althoughgenerallyused inspeech applications, areparticularlywell suited to this problem, in terms of both compactness and detection specificity. Furthermore, the issue of automatic fault detection with different levels of decision resolution is addressed. The low prediction error coupled with ease of hardware implementation makes this proposed method an attractive alternative to manual maintenance. Next, we propose the use of CELP-based features to enhance the performance of the environmental sound recognition (ESR) problem. Traditionally, MFCC fea- tures have been used for the recognition of structured data like speech and music. However, their performance for the ESR problem is limited. An audio signal can be well preserved by its highly compressed CELP bit streams, which motivates us to study the CELP-based features for the audio scene recognition problem. We present a way to extract a set of features from the CELP bit streams and compare theperformanceofESRusingdifferent featuresets withtheBayesian network clas- sifier. ItisshownbyexperimentalresultsthattheCELP-basedfeaturesoutperform the MFCC features in the ESR problem by a significant margin and the integrated MFCC and CELP-based feature set can even reach a correct classification rate of 95.2% using the Bayesian network classifier. CELP-based features may not be suitable for wideband audio signals such as music signals. To address this problem, we would like to add other new features. Oneideaistoperformreal-timefundamentalfrequencyestimationusingamodified ix Hilbert-Huangtransform(HHT),asstudiedinthelastpartofthisproposal. HHTis a non-linear transform which is suitable for the analysis of non-stationary AM/FM like data. However, the application of HHT directly to music signals encounters several problems. In this research, we modify HHT so that it can be tailored to the short-window pitch analysis. It is shown by experimental results that the proposed HHT method performs significantly better than several benchmark schemes. Finally, for the ESR application with large number of classes, more features are needed in order to maintain the classification performance. On the other hand, more data are desired to avoid the over-fit problem. These two become contra- dicting requirements. We propose two methods to resolve the contradiction. They are the content-adaptive feature selection method and the context-based feature selection method. The content-adaptive feature selection method selects different featuresusedfordifferenttestingsamplesaccordingtotheirstatistics. Thecontext- basedfeatureselection methodeliminates theloadingoftheclassifier by addingthe context stage as preprocessing layer. As a result, we can dramatically decrease the number of features used and adaptively select a good subset of features. x Chapter 1 Introduction 1.1 Significance of the Research Suitable features are critical to the performance of many classification and recog- nition systems in signal processing applications such as engine fault diagnosis, en- vironmental sound recognition, speaker/speech recognition and music information retrieval. In pertinent literature, several features have been used to characterize such signals. For instance, some commonly-used features for audio signals include the band energy ratio, frequency roll-off, spectral bandwidth, spectral asymmetry, spectral flatness, zero-crossing, energy [ZK01], and set of Mel Frequency Cepstral Coefficients (MFCC). However, none of these features solely provide efficient per- formance in every application. As a result, a combination of some of these features is usually used foraudio analysis. Forgoodperformance, it is crucial toselect right set offeatures depending on the application. Forexample, when the number of fea- tures used increases, there may exist redundant and/or irrelevant features, which 1 can potentially have a negative impact on the classification performance. One way to explain this phenomenon is that, as the feature dimension becomes larger, data points become sparser and, as a result, the clustering performance can be affected. It is therefore an important problem to find a suitable set of features. The feature selection problem is much more important if the number of classes and number of features used increase dramatically. However, to find a good feature set and an ef- ficient way of feature selection is challenging since it should not only provide better performance but also have lower complexity, easier implementation and ability to adapt to different applications. In this research, we focus on suitable feature sets for audio applications, i.e., engine fault diagnosis in Chapter 4, environmental sound recognition (ESR) in Chapter 3, music information retrievel (MIR) in Chapter 5 and large number of classes in ESR in Chapter 6. The significance of these problems is detailed below. • In the engine fault diagnosis system, any tiny defect may cause malfunction and, even worse, result in a catastrophe. Thus, it is desirable to develop a robust diagnostic and high precision inspection mechanism that can discover abnormality and identify fault types at the very early stage. It could prevent unnecessary damagestothemachine, reduce thecostofrepairorreplacement substantially and improve the flight security. • Ascomparedtovisualdata,audiodataisreadilyavailableeveninchallenging conditions such as lack of light and/or with visual obstruction. Besides, as 2 compared with video, audio is relatively easy to store and process. Hence, audio scene analysis is easier as compared to video scene analysis. The ESR technique can also be used to enhance the performance of speaker identifica- tion and language recognition with environmental sounds in the background. • FortheMIRprobelm,agoodfundamentalfrequencyestimatornotonlyhelps resolving the polyphonic problem but also provides a good feature in general audio applications. Several challenging problems along this line are described below. Estimation of fundamental frequency (also denoted by F0) plays an important role in music information retrieval (MIR), mechanical fault diagonsis and speech recognition. A number of fundamental frequency estimation methods have been proposed for different applications in last two decades. Motivated by real-time ap- plications, it is desirable to conduct pitch analysis within a short-analysis window in time. In the context of MIR, the fundamental frequency estimation problem is particularly challenging due to the rich variety of musical instruments and ex- pressions (e.g. notes of different durations and polyphonic mixing). Some of the challenges encountered in processing of these signals have been listed below: • Multiple sources and nonlinearity of signals Engine faults arise from a variety of sources, such as high rotor dynamic forces,bearingfailures,andstructuralfatiguefaults. Alloftheseproblemsare knowntoproducespecificvariationsinthesoundandvibrationpatternsofthe 3 engines since they are not operating within normal boundaries. Furthermore, these variations are mostly non-linear and non-stationary. Due to the non- stationary nature of engine signals, traditional analytic tools have limitations when discovering useful features. Besides, the environmental sounds are also unstructuralandgeneratedfrommultiplesources. Thosesignalsarenoteasily analyzed by the traditional signal processing tools. • Polyphonic problem for Hilbert-Huang Transform (HHT) HHT is a nonlinear analysis tool, which is suitable for non-stationary AM- FM data such as quasi-periodic music signals. Although HHT can be used to identify low-pitch signals within a short temporal window. However, to apply HHT directly to music signals encounters a major challenge known as thepolyphonicproblem, wheredifferentfrequencymodesco-existatthesame time location. • Complexity and implementation issues Some features are suitable for audio applications, yet they demand high com- putationalcomplexity. Forexample, the computation oftheMaching Pursuit (MP) features highly depends on the number of adopted atoms M since its computational complexity is O(MN 2 logN). 4 1.2 Review of Previous Work 1.2.1 Engine Fault Detection Engine faults arise from a variety of sources, such as high rotor dynamic forces, bearing failures, and structural fatigue faults. All of these problems are known to produce specific variations in the sound and vibration patterns of the engines, because they are not operating within normal boundaries. Previous work in this field has leveraged these facts, and sought to find ways in which information from these signals could be used to infer the state of engine components. In practice, these signals are used not only for the investigation of signal characteristics under variousoperationalscenarios,butalsofortheextractionofusefulfeaturescombined with data mining techniques in order to detect whether fault situations occur. The signal analysis and feature extraction approach has proven to be efficient in a wide array of practical applications. The performance of an engine fault detection system is quite dependent on the selection of efficient and robust features which can capture the signatures of problematic condition in certain circumstances and the design of a classifier to correctlyidentifyengineconditions. Thetimedomainapproachwasfirstconsidered and focused mainly on the ratio of the peak value to the root-mean-square (RMS), the probability density method, andthe kurtosis method [DS77]. Later, band-pass filtering and the shock pulse methods were introduced to exploit the resonance 5 frequency concept. The time-domain tools described above were not adaquate for the complexities of engine vibration and acoustics signals. A frequency domain approach was developed more recently, and it is widely used nowadays. The intuition behind this approach is that high and low frequency components represent the resonance frequency of the engine and the characteristic defect frequency, respectively. The latter is especially of interest, because it pro- vides additional information about the nature of the defect. In practice, however, it is difficult to find the significant peaks corresponding to these these frequencies withinthespectrum ofmeasured enginesignals becauseofnoiseandvibrationfrom other sources. Furthermore, the Fourier Transform has a resolution limitation in the low frequency region. Some mathematical tools, such as decomposing signals into periodic components and adaptive noise canceling (ANC) [CT82] have been proposed to overcome these difficulties but improvement is limited. Another ap- proach, known as the High Frequency Resonance Technique (HFRT) [?], performs band-pass filtering on the high frequencies, i.e. the resonance frequency range, and demodulates the resulting signal into the characteristic frequency. However, some artifacts still remain in the processed data. For example, defect frequencies may be submerged in the rising background level of the spectrum, which make their detection with this method very challenging. Various kinds of techniques have also been proposed to improve the fidelity of features using time-frequency domain signal analysis techniques, particularly based 6 on the Wavelet Transform [ZYL + 09, LQ00, TYT04]. Compared to the Fourier Transform, the Wavelet Transform can provide amore flexible time-frequency reso- lution and multi-scale analysis of signals. However, the Continuous Wavelet Trans- form requires prohibitively large computational resources, since it needs a lot of data at each scale. In [PTC05], the Hilbert-Huang Transform (HHT) is used to dealwithnonlinearandnon-stationaryenginesignalcharacteristicswithlowercom- plexity. Based on the principle of empirical mode decomposition (EMD) and the Hilbert Transform, signals can be decomposed into several intrinsic mode functions (IMF) which are efficient for representing the instantaneous frequency components ofnon-stationarysignals. However, theIMFsmaynotprovideaccuratesignalinter- pretation at low frequencies, and thus the overall system performance can degrade. Having obtained useful features, robust classifiers were developed to effectively take advantage of the extracted signal information. Principal Component Analysis (PCA) is frequently used as a first step to reveal the internal structure of the features in a way which best explains the variance in the data. Therefore, it can be used to reduce the dimension of the feature space after data collection. W. Sun et al. proposed a PCA-based fault diagnosis system based on decision trees [SCL07]. In a later work, multiple classifiers were used in a decision fusion system which integratedmultipledatasourcesfromdifferenttypesofsensors [NHYT07]. Various otherpatternanalysisandmachinelearningtechniques havebeenproposedasgood classifiers, such asK-NearestNeighbors(KNN),GaussianMixtureModels (GMM), 7 Support Vector Machines (SVM), Independent Component Analysis (ICA), and Artificial Neural Networks (ANN). Despite proposed systems with combinations of these techniques, robust engine fault detection and diagnosis based on sound and/or vibration signal analysis is still an open problem due to poor frequency resolution in the low-frequency region, noiseinterference, ineffectivefeatures, highcomputationalcomplexity, andthenon- stationary nature of underlying signals. Some novel feature extraction and signal analysis techniques are needed. 1.2.2 Environmental Sound Reconition Suitable features are critical to the performance of ESR systems. Several features have been used to characterize audio signals. They are briefly reviewed below. One commonly used feature is the set of Mel Frequency Cepstral Coefficients (MFCC). The MFCC features, obtained by taking the cosine transform of a log power spectrum onthe nonlinear Mel-scale offrequency, offer the short-term power spectrumofanaudiosignal. Thefilter-banksusedinextractingtheMFCCfeatures are based on the human auditory system. The MFCC features have been shown to work particularly well for structured sounds, like speech and music. However, environmental sounds contain a large variety of sound sources that can be charac- terizedbynarrowspectralpeaks, suchaschirpingsofinsects, whichMFCCsarenot able to capture well. Other commonly-used features for audio signals include band 8 energy ratio, frequency roll-off, spectral bandwidth, spectral asymmetry, spectral flatness, zero-crossing, and energy [ZK01]. In some of previous work, a combination of some of these features has been used for audio analysis. When the number of features used increase, there may exist redundant and/or irrelevant features, which can potentially have a negative impact on the classification performance. One way to explain this phenomenon is that, as the feature dimension becomes larger, data points become sparser and, as a result, the clustering performance can be affected. Thus, it is an important problem to find a suitable set of features. Research on unstructured audio recognition, such as environmental sounds, has received less attention than that for structured audio such as speech or music. Only a few studies have been reported, and most of them were conducted with raw environment audio. To give a couple of examples, sound-based scene analysis was investigated in [CSP98, PTK + 02, EPT + 06]. Because of randomness, high variance and other difficulties associated with environmental sounds, their recognition rates arepoorerthanthoseforstructuredaudio. Thisisespecially truewhenthenumber of sound classes increases. To overcome the insufficiency of MFCCs and other commonly-usedfeatures,Chuetal. [CNK09]proposedasetoffeaturesbasedonthe Matching Pursuit (MP) technique. Although the MP-based features provide good performance, their computational complexity is too high for real-time applications. 9 1.2.3 FundamentalFrequencyEstimationforMusicSignals A number of fundamental frequency estimation methods have been proposed for different applications in last two decades. Driven by real-time applications, it is desirable to conduct pitch analysis within a short analysis window in time. In the contextofMIR,thefundamental frequency estimationproblemisparticularlychal- lenging due to the rich variety of musical instruments and expressions (e.g. notes of different durations and polyphonic mixing). The autocorrelation-based method [dCK02] cannot be easily extended to multiple F0s and high pitch identification. Besides, it needs asufficient longanalysis frame, whose size is at least twice as long as the fundamental period. The Fourier transform based method performs poorly in estimating low-pitch notes since it is not effective in handling low-frequency mu- sical signals with logarithmic-scale notes [Kla08]. The method of auditory scene analysis (ASA) imitates the human hearing mechanism to separate mixing sounds [WBE06]. An unsupervised learning method known as the nonnegative matrix fac- torization(NMF)was proposed in[Con06]tosolve this problem recently. However, very few of them can detect a wide pitch range with a short temporal analysis win- dow successfully. Especially, the performance deteriorates significantly in the low frequency region. Besdies, the conventional window length of 93ms is not suitable for all types of music, specially musical compositions with very rhythm. All of the above observations motivate this research. 10 1.3 Contributions of this Research Contributions of this research are discussed below. 1. For engine fault diagnosis, we propose a novel hybrid feature set, composed of Mel Frequency Cepstrum Coefficients (MFCC) and Code Excited Linear Prediction (CELP) features, supplemented by a hierarchical SVM classifier after PCA decomposition in Chapter 4. The conducted experiments show that CELP features, although generally used in speech applications, are par- ticularly well suited to this problem in terms of both compactness and de- tection specificity. Furthermore, the issue of automatic fault detection with different levels of decision resolution is addressed. The low prediction error coupled with ease of hardware implementation makes this proposed method an attractive alternative to manual maintenance. 2. WeproposetheuseofCELP-basedfeaturestoenhancetheperformanceofthe environmental sound recognition (ESR) problem in Chapter 3. Traditionally, MFCC features have been used for the recognition of structured data like speechandmusic. However,theirperformancefortheESRproblemislimited. We present a way to extract a set of features from the CELP bit streams and compare the performance of ESR using different feature sets with the Bayesian network classifier. We show that the most commonly used features do not work well for environmental sounds. In contrast, the CELP-based 11 features provide excellent performance using the Bayesian network classifier, where the averaged correct classification rate reaches 91.2%. Moreover, we also show that , the integrated MFCC and CELP-based features can provide an averaged correct classification rate of 95.2%. 3. HHT [HSL + 98] has proved to be useful in certain data analysis applications arising from mechanical and aerospace engineering disciplines. It is a nonlin- ear analysis tool, which is suitable for non-stationary AM-FM data such as quasi-periodicmusicsignals. InChapter5,wewillshowthatHHTcanbeused to identify low-pitch signals within a short temporal window. However, ap- plication of HHT directly to music signals is challenging due to several issues, for e.g. mixing frequency modes at the same time location. Most HHT-based signal analysis methods proposed before, e.g. [HWLD05], are not suitable in this application context. Our main contribution here is the proposal of a modified HHT tailored to the short-window pitch analysis application. 4. For the ESR application with large number of classes, more features are needed in order to maintain the classification performance. On the other hand, more data are desired to avoid the over-fit problem. These two be- come contradicting requirements. We propose two methods to resolve the contradiction. They are the content-adaptive feature selection method and the context-based feature selection method. The content-adaptive feature selection method selects different features used for different testing samples 12 accordingtotheirstatistics. Thecontext-basedfeatureselectionmethodelim- inatestheloadingoftheclassifierbyaddingthecontextstageaspreprocessing layer. As a result, we can dramatically decrease the number of features used and adaptively choose a good subset of features. 1.4 Organization of the Dissertation The rest of this disserttion is organized as follows. The background of this re- search, including the Mataching Pursuit, Hilbert-Huang Transform (HHT) and Code-Excitation Linear Prediction (CELP) is reviewed in Chapter 2. Then, the proposedenginefaultdiagnosissystemispresentedinChapter4. Theenviromental sound recognition by using CELP features is described in Chapter 3. The funda- mental frequency estimation problem for music dignals with the modified HHT is examined in Chapter 5. The content-adaptive and context-based feature selection methods for the ESR problem are discussed in 6. Finally, concluding remarks and future research work are presented in Chapter 7. 13 Chapter 2 Research Background 2.1 Matching Pursuit Ourgoalistoobtaintheminimumnumberofbasistorepresentasignal,resultingin asparserepresentation. ThisisanNP-completeproblem. Variousadaptiveapprox- imation techniques have been proposed in literature, such as the method of frames, basispursuit, matchingpursuit, andorthogonalmatchingpursuits. Allthese meth- ods utilize the notion of a dictionary that allows the decomposition of a signal by selecting basis from a given dictionary to find the best basis set. Among these, MP is a more efficient, but greedy, approach. By using a dictionary that consists of a widevarietyofbasis, MPprovidesanefficient wayofselecting asmallbasissetthat would produce meaningful features as well as a flexible representation. MP is sub- optimal in the sense that it may not achieve the sparsest solution depending on the given dictionary. However, as long as the dictionary is complete, the expansion is 14 guaranteed to converge to asolution where the residual signal has zero energy. Ele- ments in the dictionary are selected based on maximizing the energy removed from the residual signal at each step. Even with a few steps, the algorithm is capable of yielding reasonable approximation using only a few atoms. For further details, we refer to [7]. The MP result relies on the choice of the dictionary. A dictionary is a setofbasis (orsimply parameterized waveforms) forobtainingalinearcombination to produce an approximated representation of the signal. Several dictionaries have been proposed for MP, including frequency dictionaries (e.g., Fourier), time-scale dictionaries (e.g., Haar), time-frequency dictionaries (e.g., Gabor). Most dictionar- ies are complete or overcomplete. It is important for atoms in the dictionary to be discriminative among themselves; otherwise, similar atoms will compete with each other inthe MP process, resulting inlow weight value distributed amongtheir coefficients. We will go over the Gabor function in more detail, as it will become morerelevant tothe details ofourfeatureextraction method. 2.2. Time-Frequency Dictionaries A combination of both time and frequency functions can be demon- strated in the Gabor dictionary. Gabor functions are sine-modulated Gaussian functions that have been scaled and translated, providing joint time-frequency lo- calization. s the size of the atom for which the dictionary is constructed. Scale s, which corresponds to atoms width in time, is derived from the dyadic sequence s = 2p , pproximation (reconstruction) using the first ten coefficients from MP with 15 dictionaries of Gabor(top), Haar (middle) and Fourier (bottom). Fig. 2: Decom- positionofasignal itemfrom6different classes aslisted, where the top-mostsignal is the original, followed by the first five bases. tionaries are given in Fig. 1(a). Because of the nature of sine and cosine functions, it makes the Fourier dictionary more suitable for high frequency type of data, while the Haar wavelet dictionary is better for more stable, lower frequency-type of signals. The Gabor representation has advantages of these two dictionaries, characterizing the signal in both the time and frequency domain, permitting for a more general representation. Fig. 1(b) demonstrates the effectiveness of reconstructing a signal using only a small subset of coefficients. Gabor atoms result in the lowest reconstruction error, as compared with the Haar or Fourier transforms using the same number of coefficients. Due to the non-homogeneous nature of environmental sounds, using features with these Gabor properties we hypothesize would benefit a classification system. It would provide the ability to be flexible and to capture the time and frequency localization ofunstructuredsounds, yieldingamoregeneralrepresentation. Inthefollowing,we will focus on using the Gabor function. We chose N = 256 , m = 8 , other words, a dictionary of 1120 Gabor atoms of length 256 were generated by using atom scales of powers of two from 2 to 256 and translation of a quarter of the atom. We use a logarithmic frequency scale, where the frequency was chosen to be a parabolic function that distributes frequencies in a manner to allow for a higher resolution of histogram bins in lower frequencies and lower resolution in higher frequencies. 16 The reason for a more subtle granularity in the lower frequencies is because more object types occur in these ranges, and we wish to capture the finer differences between them. Since we use discrete atoms, the choice of indices resolution will affect the discriminative power of atoms. The phase was kept constant. We try to keep the dictionary size small, since a large dictionary demands higher complexity. Fig. 2 demonstrates a decomposition of a signal using Gabor atoms. We observe differences in the bases between different types of environmental sounds. 2.2 Hilbert-Huang Transform HHT contains two mainsteps: 1)empirical modedecomposition (EMD)and2)the Hilbert transform. They are detailed below. EMD, or called sifting, serves as a pre-processing step to the Hilbert transform. It adopts a nonlinear procedure to decompose an input signal into several intrinsic mode functions (IMFs). An IMF is a function that satisfies the following two conditions: 1. The number of extrema and the number of zero crossings should equal or differ at most by one (i.e., the local maxima has to be positive and the local minima must be negative). 17 2. The mean function (the mean of the envelope defined by local maxima and minima) should beclose tozero (i.e., an IMF symmetrically oscillates around zero). These IMFs form a multi-scale representation of the original signal in the time domain. For an input signal, x(t), we first find its local maxima and construct an upper envelopex U (t)bycubicsplinefitting. Similarly, wecanconstructitslowerenvelope x L (t) with its local minima. The EMD process is to find the mean function, m(t), of x L (t) and x U (t). An IMF candidate, denoted by h(t), can be acquired via x(t)−m(t). If h(t) meets the two IMF criteria, we set the first IMF c 1 (t) = h(t). Otherwise, we use the residual r(t)=x(t)−h(t) as the new input signal and apply the same process until the first IMF is found. Based on the above description, we can state the EMD process formally below. Let x(t) =r 0 (t) be the initial residual. 1. For the ith IMF, we set h i,0 (t)=r i−1 (t). Initially, i = 1. 2. We perform the following iteration, h i,k (t) =h i,k−1 (t)−m i,k−1 (t), k = 1,2,··· , 18 where m i,k−1 (t) is the mean of the local maxima and the local minima en- velopesofh i,k−1 (t). Theiterationcontinuesuntilh i,k (t)satisfiesthetwoIMFs requirements. 3. Set h i,k (t) to the ith IMF c i (t). Compute the residual r i (t) = h i,0 (t)−c i (t). If the energy of r i (t) is sufficiently small or r i (t) becomes a monotonic func- tion (or it does not have enough local extrema), the iteration terminates. Otherwise, we go to Step 1 to determine the (i+1)th IMF. The stopping criterion in Step 2 used by Huang [HSL + 98] is to check whether the normalized standard deviation of two consecutive siftings, defined by SD = T X t=0 [ |h 1,k−1 (t)−h 1,k (t)| 2 h 2 1,k−1 (t) ], is less than a threhold value (which is often set to a number between 0.2 and 0.3). If this is true, the iteration in Step 2 stops. Thus, we can decompose the input signal into the following sum: x(t) = I X i=0 c i (t)+r I (t), Toconclude,IMFsandthelastresidualformanadaptivedata-drivenrepresentation set for the original signal. 19 By applying the Hilbert transform to each IMF c i (t), we can obtain the corre- sponding Hilbert spectrum H i (w,t). Mathematically, it can be written as x(t) =Re[a(t)e j R ω(t)dt ], where w(t) and a(t) are the instantaneous frequency and amplitude, respectively. Unlike theFouriertransform, which uses afixed frequency globally (i.e.,ω(t)=ω), HHT allows the frequency to be a time-varying function for better local analysis. Since w(t) is a function of time, we can estimate the local frequency at any time stamp without the time-frequency resolution constraint. It is well known that HHT is a powerful analytical tool for non-stationary AM- FM like signals. The spirit of EMD is to decompose an input signal into the sum of simpleAM-FMnarrowbandsignalssothatthelocalestimationofanAM-FMsignal by the Hilbert transform would not interfere with other AM-FM signals [HSL + 98]. Forinstance,Fig. 2.1comparesthattheHilbertspectrumandtheFouriertransform of a chirp signal whose frequency component varies with time. It is observed that HHT demonstrates greater resolution than the Fourier transform, where the white color represents higher energy. Furthermore, the local analysis capability of HHT allows us to analyze time-varying signals in a shorter time window. As a result, HHT offers an excellent analytical tool for music note estimation. The spirit of EMD is to decompose an input signal into the sum of simple AM-FM narrowband signals so that the local estimation of an AM-FM signal by 20 Frequency Time 0 0.2 0.4 0.6 0.8 0 100 200 300 400 500 (a) Time Frequency 0 0.2 0.4 0.6 0.8 1 0 100 200 300 400 500 (b) Figure 2.1: The time-frequency representation of a chirp signal with ω(t) = 300× 2πt obtained by (a) HHT and (b) Fourier transform. the Hilbert transform would not interfere with other AM-FM signals. Fig. II- 4 compares that the Hilbert spectrum and the Fourier transform of a chirp signal whosefrequencycomponentvarieswithtime. ItisobservedthatHHTdemonstrates greater resolution than the Fourier transform, where the white color represents higher energy. Furthermore, the local analysis capability of HHT allows us to analyze time-varying signals in a shorter time window. As a result, HHT offers an excellent analytical tool. Knowing the well-behaved Hilbert transforms of the IMF components is only the starting point. Unfortunately, most of the data are not IMFs. At any given time, the data may involve more than one oscillatory mode; that is why the simple Hilbert transform cannot provide the full description of the frequency content for the general data as reported by Long et al. (1995). We have to decompose the data into IMF components. Here, we will introduce a new 21 method to deal with both non-stationary and nonlinear data by decomposing the signal rst, and discuss the physical meaning of this decomposition later. Contrary to almost all the previous methods, this new method is intuitive, direct, a posteriori and adaptive, with the basis of the decomposition based on, and derived from, the data. The decomposition is based on the assumptions: (1) the signal has at least two extrema—one maximum and one minimum; (2) the characteristic time scale is de ned by the time lapse between the extrema; and (3) if the data were totally de- void of extrema but contained only inflection points, then it can be drentiated once ormoretimestorevealtheextrema. Finalresultscanbeobtainedbyintegration(s) of the components. The essence of the method is to identify the intrinsic oscillatory modes by their characteristic time scales in the data empirically, and then decompose the data accordingly. AccordingtoDrazin(1992),thefirststepofdataanalysisistoexamine the data by eye. From this examination, one can immediately identify the scales directly in two ways: by the time lapse between the successive alternations of local maxima and minima; and by the time lapse between the successive zero crossings. The interlaced local extrema and zero crossings give us the complicated data: one undulation is riding on top of another, and they, in turn, are riding on still other undulations, and so on. Each of these undulations de 22 nes a characteristic scale of the data; it is intrinsic to the process. We have decided to adopt the time lapse between successive extrema as the de nition of the time scale for the intrinsic oscillatory mode, because it not only gives a much ner resolution of the oscillatory modes, but also can be applied to data with non-zero mean, either all positive or all negative values, without zero crossings. A systematic way to extract them, designated as the sifting process, is described as follows. 2.3 Code Excited Linear Prediction (CELP) The code excited linear prediction (CELP) technique is a mature and widely- adopted speech coding algorithm, which was first proposed by Schroeder and Atal [SA85]. It outperforms several other vocoders such as the linear prediction coder (LPC) and the residual-excited linear predictive (RELP) for its better quality at low-bit rates. It is adopted by ITU-T G.723.1 with two coding rates (namely, 5.3 kbps and 6.3kbps). After the introduction of the CELP codec, several variants (e.g., ACELP, RCELP, LD-CELP and VSELP) have been developed. CELP and its variants offer an effective low-bit-rate speech coding tool, which serve as the core of all modern speech codecs. The CELP codec encodes speech or audio signals based on linear predictive analysis-by-synthesis coding with frame-level processing. Each frame consists of 23 Figure 2.2: The block-diagram of a CELP codec. 240 samples, which are further decomposed to four 60-sample subframes. One set of CELP-based feature is obtained from each frame. For a sampling rate at 8 kHz, eachframehasadurationof30ms. Theblock-diagramofaCELPencoderisshown in Fig. 2.2. It consists of two input signals, obtained from an adaptive and a fixed codebook, respectively, and their sum serves as the excitation to a sythesis fiter whose filter coefficients are updated dynamically. The excitation from the adaptive codebook is used to synthesize the main signal while the excitation from the fixed codebook is used to account for the residual signal. The fixed codebook contains 5 to 6 fixed-position pulses as an enhancement to the excitation. The CELP codec extracts four types of information from each frame and opti- mizes the decoded audio signal in a closed loop perceptually. 24 Figure 2.3: The bit stream of a CELP codec. • Linear Prediction Coefficients (LPC) Each frame is first filtered to remove the DC component and decomposed into four sub-frames, each of which has 60 samples. For every sub-frame, we mask it by a hamming window, compute the 10-th order LPC with the Levinson-Durbin recursion, and use LPC to construct the synthesis and for- mant weighting filter in each subframe. Note that the 10 dimension LPC is sufficient to represent the regularity of audio signals in a short frame. In the implementation of a CELP system, only LPC parameters of the last subframe is quantized and transmitted since the LPC parameters of all other 25 three subframes can be interpolated under the linearity assumption of adja- cent subframes. The LPC coefficients are employed to construct the short- term perceptual weighting filter, which is used to filter the entire frame and to obtain the perceptually weighted speech signal. TheLPCisdifferentialquantized inthelinespectralfrequency (LSF)domain using a Predictive Split Vector Quantizer (PSVQ) and encoded according to the built-in adaptive codebook. Since the reference code uses an approxima- tion method to acquire LSF, the average LP parameters of 4 subframes are chosen as one of CELP features, denoted by LPC in Table 3.1. • Pitch Lag (PITCH) Foreverytwosubframes(120samples),anopenlooppitchperiodiscomputed basedontheweightedaudiosignalbythecorrelationmethod. Theopenpitch lag, L, takes on 128 values using 7 bits. The values range from 18 to 145, which correspond to the frequency range from 55Hz to 444Hz, respectively, under the 8KHz sampling rate. This is enough for most speech and audio applications since the fixed codebook can compensate prediction residuals. Since the pitch lags in adjacent subframes are close to each other. We choose the average of the open pitch lag L in one frame as the desired feature, which is denoted by PITCH in Table 3.1. • Gain of Pitch Filter (GAIN) Theclose-looppitchlagiscomputed bytheprocedure ofoptimizingthe5-tap 26 pitch filter gain,b, within the codebookofthe system by searching fromL−1 to L+2. The filter gain parameter, b is denoted by GAIN in Table 3.1. • Pulse Position in Fixed Codebook (POS) The pulse position information is acquired by minimizing the residual error in a nested loop for each subframe. This piece of information occupies the largest portion of the bit stream. It is denoted by POS in Table 3.1. 27 Chapter 3 Environmental Sound Recognition with CELP-based Features 3.1 Introduction The environmental sound recognition (ESR) problem arises in many interesting applications such asaudioscene analysis, navigation, assisting robotics, andmobile device-based services. By audio scene analysis, we refer to the classification of a location (such as a restaurant, a playground or a rural area) based on its different acoustic characteristics. Audio data are available in challenging conditions such as lack of light and/or with visual obstruction. Besides, as compared with video, audio is relatively easy to store and process. The ESR technique can also be used toenhance the performance ofspeaker identification and languagerecognition with environmental sounds in the background. 28 MelFrequencyCepstralCoefficients(MFCC)havebeenusedfortherecognition of structured audio such as speech and music for a long while. However, their performance for unstructured audio such as the environmental sounds is limited. In this work, we study a set of new features based on CELP (Code Excited Linear Prediction) to enhance the performance of the ESR problem. Although CELP was initially developed for speech coding, it actually offers an excellent low-bit-rate codecforanyaudiosignal,includingmusicandenvironmentalsounds. Ourresearch is motivated by the observation that an audio signal can be well preserved by its highly compressed CELP bit streams. It is therefore intuitive to extract features from the CELP bit streams for the audio recognition purpose. Thereareseveral advantageswiththeCELP-based features. First,theyprovide a concise yet accurate representative of underlying audio signals. Second, their computational complexity is low, which allows real-time feature extraction. Third, CELP-based features are more robust with respect to background noise. Fourth, all modern telecommunication systems adopt the CELP technique or its variations as the speech codec due toits high compression ratio, excellent coded audio quality and low complexity. All mobile telecommunication devices are equipped with the CELP-based codec. As a result, recognition based on CELP features is desirable since the additional effort required by feature extraction is almost negligible. In this work, we compare avariety ofaudio features, and provide aperformance evaluation on the classification of ten environmental sounds using different feature 29 sets. We find that the most commonly used features do not work well for environ- mentalsounds. Incontrast,theCELP-basedfeaturesprovideexcellentperformance using the Bayesian network classifier, where the averaged correct classification rate reaches 91.2%. Besides, the integrated MFCC and CELP-based features can pro- vide an averaged correct classification rate of 95.2%. To the best of our knowledge, this is the first paper that examines the use of CELP-based features in audio clas- sication. Since the preliminary study as reported in this paper is promising, it will be worthwhile to extend our study to a large class of audio signals, including both speech and music in the future. The rest of this paper is organized as follows. Comparison of various audio features is made in Sec. 1.1. The proposed CELP-based features are discussed in Sec. 3.2. Experimental results are shown in Sec. 3.3 to demonstrate the superior performance of the CELP-based features in ESR. Finally, concluding remarks and future research directions are given in Sec. 3.4. 3.2 Extraction of CELP-Based Features 3.2.1 CELP Codec The code excited linear prediction (CELP) technique is a mature and widely- adopted speech coding algorithm, which was first proposed by Schroeder and Atal [SA85]. It outperforms several other vocoders such as the linear prediction coder 30 (LPC) and the residual-excited linear predictive (RELP) for its better quality at low-bit rates. It is adopted by ITU-T G.723.1 with two coding rates (namely, 5.3 kbps and 6.3kbps). After the introduction of the CELP codec, several variants (e.g., ACELP, RCELP, LD-CELP and VSELP) have been developed. CELP and its variants offer an effective low-bit-rate speech coding tool, which serve as the core of all modern speech codecs. Figure 3.1: The block-diagram of a CELP codec. The CELP codec encodes speech or audio signals based on linear predictive analysis-by-synthesis coding with frame-level processing. Each frame consists of 240 samples, which are further decomposed to four 60-sample subframes. One set of CELP-based feature is obtained from each frame. For a sampling rate at 8 kHz, eachframehasadurationof30ms. Theblock-diagramofaCELPencoderisshown in Fig. 3.1. It consists of two input signals, obtained from an adaptive and a fixed 31 codebook, respectively, and their sum serves as the excitation to a sythesis fiter whose filter coefficients are updated dynamically. The excitation from the adaptive codebook is used to synthesize the main signal while the excitation from the fixed codebook is used to account for the residual signal. The fixed codebook contains 5 to 6 fixed-position pulses as an enhancement to the excitation. 3.2.2 CELP-based Features The CELP codec extracts four types of information from each frame and optimizes the decoded audio signal in a closed loop perceptually. • Linear Prediction Coefficients (LPC) Each frame is first filtered to remove the DC component and decomposed into four sub-frames, each of which has 60 samples. For every sub-frame, we mask it by a hamming window, compute the 10-th order LPC with the Levinson-Durbin recursion, and use LPC to construct the synthesis and for- mant weighting filter in each subframe. Note that the 10 dimension LPC is sufficient to represent the regularity of audio signals in a short frame. In the implementation of a CELP system, only LPC parameters of the last subframe is quantized and transmitted since the LPC parameters of all other 32 three subframes can be interpolated under the linearity assumption of adja- cent subframes. The LPC coefficients are employed to construct the short- term perceptual weighting filter, which is used to filter the entire frame and to obtain the perceptually weighted speech signal. TheLPCisdifferentialquantized inthelinespectralfrequency (LSF)domain using a Predictive Split Vector Quantizer (PSVQ) and encoded according to the built-in adaptive codebook. Since the reference code uses an approxima- tion method to acquire LSF, the average LP parameters of 4 subframes are chosen as one of CELP features, denoted by LPC in Table 3.1. • Pitch Lag (PITCH) Foreverytwosubframes(120samples),anopenlooppitchperiodiscomputed basedontheweightedaudiosignalbythecorrelationmethod. Theopenpitch lag, L, takes on 128 values using 7 bits. The values range from 18 to 145, which correspond to the frequency range from 55Hz to 444Hz, respectively, under the 8KHz sampling rate. This is enough for most speech and audio applications since the fixed codebook can compensate prediction residuals. Since the pitch lags in adjacent subframes are close to each other. We choose the average of the open pitch lag L in one frame as the desired feature, which is denoted by PITCH in Table 3.1. • Gain of Pitch Filter (GAIN) 33 Theclose-looppitchlagiscomputed bytheprocedure ofoptimizingthe5-tap pitch filter gain,b, within the codebookofthe system by searching fromL−1 to L+2. The filter gain parameter, b is denoted by GAIN in Table 3.1. • Pulse Position in Fixed Codebook (POS) The pulse position information is acquired by minimizing the residual error in a nested loop for each subframe. This piece of information occupies the largest portion of the bit stream. It is denoted by POS in Table 3.1. Figure 3.2: Computation of the CELP-based features, which consist of the 10 dimensional LPC features and the one dimensional pitch lag feature. 34 As shown in the experimental results in Sec. 3.3, we observe that the set of features thatgive thehighest discriminant power isthe unionofthe10-dimensional LPC features and the one-dimensional pitch lag (or simply pitch) feature. Thus, we choose them as the desired CELP features, which result in a 11 dimensional feature vector per frame and denoted by CELP in Table 3.1. The computation of the 11 dimensional CELP feature vector is illustrated in Fig. 3.2 3.3 Experimental Results 3.3.1 Experimental Setup In the experiments, we collected ten classes of environmental sounds from the BBC audio data. They were sounds associated with the following: • Transportation (3): airplane, motorcycle and train. • Weather (4): rain, thunder, wind and stream. • Rural Areas (2): bird, insect. • Indoor (1): restaurant. All data were resampled at 8KHz and normalized to a one-minute-long audio clip from a mono channel. Some pre-processing and filtering operations were used to filter out silence as well as irrelevant or noise parts of enviromental sounds. 35 The feature extraction process is illustrated in Fig. 3.3. The CELP features were extracted by modifying the standard code of ITU-T G.723.1. Figure 3.3: System Diagram In order to compare the CELP features with MFCC, we adopted two classifiers, i.e., support vector machine (SVM) and naive Bayesian network. The machine learning toolbox, weka [HFH + 09], provides a convenient interface to analyze the result of the Libsvm [CL11] and the Bayesian network. Ten fold cross-validation was adopted for each analysis. 36 3.3.2 Results and Discussion First, we show the correct classification rates in Table 3.1 for enviromental sounds with different features using the Bayesian network classifier. We have the follow- ing observations. Either the PITCH or the Gain feature alone do not provides good results. The LPC features work pretty well, which offer an averaged correct classfication rate of 88.5%. On one hand, by adding the PITCH feature to the LPC feature, the resultant 11-dimensional CELP feature vector can reach an aver- aged correct classification rate of 91.5%. On the other hand, the addition of other additional features such as GAIN and POS results in a negative impact. The CELP features outperform the MFCC features in most categories and the difference in the averaged classification rate is around 9%, which is very significant. This is especially obvious in the cases of rain, stream and thunder sounds. MFCC features offer the short-time energy in different frequency bands. For these three sounds, their short-time energy distribution varies in different bands at different time instances. To classify these three sounds accurately, the temporal correlation of sounds captured by the LPC features is more useful although they are acquired in the same frame. The combined CELP+MFCC features achieve the best result, where the aver- aged correct classification rate is 95.2%. With the combined features, the perfor- mance of restaurant, thunder and train sounds gets improved dramatically. 37 Next, we show the confusion matrix using the CELP features only and the Bayesian network classifier in Table 3.2, where the (i,j) entry in the table denotes the rate of the i-th class data being classified to the j-th class. We see clearly that restaurant,thunderandtrainsoundsmaybemisclassified toeachothermoreeasily. Finally,weshowthethecorrectclassificationratesforenviromentalsoundswith threefeaturesetsusingtheSVMclassifierinTable3.1. Inthiscase,theperformance difference between the MFCC and the CELP features is smaller. The integrated MFCC/CELPfeaturesetstillprovidesthebestperformancewhoseaveragedcorrect classification rate is 93.3%. 3.4 Conclusion and Future Work A novel set of CELP-based features are proposed in this work, which offers an excellent performance for the ESR problem. This new feature set outperforms the MFCC feature set by a significant margin using the Bayesian network classifier. The integration of CELP/MFCC features offers the best classification result with a correct classification rate of 95.2% using the Bayesian network classifier for 10 environmental sounds. Since the preliminary study as reported in this paper is promising, itwillbeinterestingtoextend ourstudytoalargeclassofaudiosignals, including both speech and music in the future. 38 Classification rate(%) Airplane Bird Insect Motor Rain Rest. Stream Thunder Train Wind Overall PITCH 77.8 28.8 1.1 27.1 1.2 62.6 10.5 – 29.1 21.2 26.8 GAIN 66.3 8.5 44.0 18.5 32.0 8.3 8.3 2.4 15.9 11.5 22.2 LPC 85.4 96.3 99.6 89.8 99.1 63.7 98.0 77.0 74.1 98.5 88.5 CELP+GAIN 88.7 96.8 99.6 90.4 99.0 77.8 97.6 79.5 81.6 98.7 91.0 CELP+GAIN+POS 92.6 99.5 98.7 73.7 96.3 55.9 96.0 30.0 61.0 93.0 81.3 MFCC 87.8 90.0 95.8 86.2 76.8 69.4 77.0 43.2 86.9 100 82.5 CELP 88.4 96.8 99.6 90.4 99.0 77.9 97.7 78.8 81.3 98.7 91.2 CELP+MFCC 92.3 97.7 99.5 95.5 99.0 87.5 98.7 85.4 93.4 99.9 95.2 Table 3.1: Correct classification rates with different feature sets using the Bayesian network as the classifier. 39 % Airplane Bird Insect Motor Rain Rest. Stream Thunder Train Wind Airplane 88.4 – – – – 1.9 – 0.2 5.1 4.4 Bird – 96.8 – 0.1 – 1.6 0.3 0.2 1.1 – Insect – – 99.6 – – 0.4 – – – – Motor 0.1 – – 90.4 – 5.7 – 0.3 3.5 – Rain – – – – 99.0 0.3 0.4 0.1 – – Rest. 1.0 2.2 – 8.1 0.1 77.9 1.4 2.6 6.8 0.1 Stream – 0.2 – – 0.3 1.0 97.7 0.2 0.5 – Thunder 1.9 0.6 0.1 3.0 0.3 7.5 3.8 78.8 3.4 0.7 Train 5.1 0.7 – 5.0 0.1 7.1 0.1 0.7 81.3 – Wind – – – – – – – 1.3 – 98.7 Table 3.2: The confusion matrix obtained with the CELP features only and the Bayesian network classifier. 40 Classification rate(%) Airplane Bird Insect Motor Rain Rest. Stream Thunder Train Wind Overall MFCC 91.1 91.9 99.2 90.2 84.0 67.6 81.8 54.3 91.1 100 86.1 CELP 85.9 93.4 99.5 80.4 98.7 69.4 95.4 69.8 73.2 99.3 87.0 CELP+MFCC 92.5 96.8 99.5 92.9 98.2 81.4 95.2 80.2 92.2 100 93.3 Table 3.3: Correct classification rates with different feature sets using the support vector machine as the classifier. 41 Chapter 4 Robust Jet Engine Fault Detection and Diagnosis Using CELP and MFCC Features 4.1 Introduction Engineconditionmonitoringandfaultdiagnosishavebeenahotresearchtopicover the last two decades, which draws a lot of attention from researchers in academia and industry. Any tiny defect may cause malfunction and, even worse, result in a catastrophe. Thus, it is desirable to develop a robust diagnostic and quality inspection mechanism that can discover abnormality and identify fault types at the very early stage. It could prevent unnecessary damages to the machine, reduce the cost of repair or replacement substantially and improve the flight security. Engine faults may arise from a variety of sources such as high rotor dynamic forces, bearing failures and structural fatigue faults. All of them produce specific variations in sound and vibration data. Consequently, sound and vibration data 42 analysis in either the time or the frequency domain plays an important role in the fault diagnosis system. However, due to the non-stationary characteristics of vibration signals and the limitation of traditional analytical tools, engine fault diagnosis is not a trivial problem. In this proposed research, we intend to use several new mathematical tools, e.g., the Code Excited Linear Prediction(CELP) and the matching pursuit (MP) decomposition, to extract features from sound and vibration data and then apply some classification techniques to identify engine conditions. Our ultimate research goal is to develop a novel engine monitoring and fault diagnosis system. Since engine faults usually arise from a variety ofsources whose statistical char- acteristics are mostly non-linear and non-stationary, such as high rotor dynamic forces or bearing interactions, data analysis in the time and frequency domains plays an important role in the fault diagnosis system. Due to the non-stationary of engine signals, traditional analytic tools have limitations when discovering useful features. Inthispaper,ajointfeatureextractiontechniquebasedonMel-Frequency Cepstral Coefficients (MFCC) and Code Excited Linear Prediction (CELP) is used toovercomethisproblem. MFCCanalysisisoneofthefeaturerepresentations used successfully in auditory recognition and is amenable to compensation for convolu- tional channel distortion. CELP is a state-of-the-art coding technique for speech signals that provides linear prediction coefficient (LPC) and pitch information for an underlying audio signal. Experimental results demonstrate that the use of these 43 two approaches compensates for the time-varying characteristics of the source sig- nals to some degree, although acceleration and deceleration scenarios still provide a difficult environment for fault detection and diagnosis. 4.2 Review of Previous Work Several fault diagnosis methods have been studied in the last two decades, includ- ing the monitoringofrolling bearingfrom vibration datameasurements in the time and the frequency domain, sounds measurements and acoustic emission. Vibra- tion data analysis is widely used for its efficiency. Bearing acts as a source of vibration for varying compliance or defects. There are two categories of defects, ”distributed” defects and ”localized” defects[TA99]. Distributed defects come from surface roughness, waviness, misaligned races and off-size rolling elements that re- sult inanincreased vibrationlevel. Localizeddefects arecaused by cracks, pits and spalls on the rolling surface that tend to yield periodic short-duration pulses. The time domain approach was first considered, including the ratio of the peak value to the root-mean-square (RMS), the probability density method and the kurtosismethod[DR78]. Later,theband-passfilteringandtheshockpulsemethods wereintroducedtoexploittheresonancefrequencyconcept. Thetime-domaintools described above are however not accurate enough. The frequency domain approach was developed more recently, and it is widely used nowadays. The high and low frequency components represent the resonance 44 frequency and the characteristic defect frequency, respectively. The latter is espe- cially of our interest. Generally speaking, it is difficult to obtain significant peaks at these frequencies with the spectrum of measured engine data due to noise and vibration from other sources. Furthermore, the Fourier transform has its resolution limitation in the low frequency region. Some mathematical tools such as decom- posingsignalsintoperiodiccomponentsandadaptivenoisecanceling(ANC)[CT81] have been proposed to overcome the difficulty but the improvement is limited. The high frequency resonance technique (HFRT) [MS84] performs band-pass filtering on the high frequency (i.e., the resonance frequency) and demodulates the result- ing signal to the characteristic frequency. However, some artifacts still remain in the processed data. For example, defect frequencies may submerge in the rising background level of the spectrum, which make their detection very challenging. In recent years, the wavelet transform (WT), the neural network and the adap- tive ”order” tracking [BHHS05] techniques have been proposed by researchers for enhanced signal analysis. They are briefly reviewed below. As compared with FFT, WT can provide a more flexible time-frequency resolution [LQ00]. The neu- ral network classifier was presented in [LCTH00]. However, its performance highly depends on the features used in the classification. Another approach is to analyze the order of the vibration signal, which is the signal amplitude in a narrow fre- quency band centered on a harmonic of the rotation frequency of a shaft. Since the rotationalspeed varieswithtime, classical analyticaltools(e.g., harmonicanalysis) 45 cannotbedirectlyappliedtoorderanalysis. OrdertrackingbyKalmanfilteringand the recursive least-squares (RLS) algorithm were proposed to tackle the problem of multiple vibration sources. Despite several techniques proposed before, robust engine fault detection based on sound and vibration signal analysis is still an open problem due to the poor frequency resolution in the low-frequency region, noise interference, multiple vi- bration sources, poor features used and the non-stationary nature of underlying signals. Some novel feature extraction and signal analysis techniques are needed. This will be the focus of our current proposal. ————————————————– 4.2.1 Data Collection 4.2.1.1 The SR-30 data The SR-30 data set used in this paper was acquired by Turbine Technology, and it includes a collection of 15 sensors. 11 sensors were used to capture acoustic engine signals (1 at the edge of the compressor blade, 9 around the engine, and 1 at the edgeoftheexhaust). Also, 4vibrationsensorsweremountedat90degreesintervals aroundthe around the outside ofthe engine casing. The sampling rate is102.4KHz Three separate tests were performed to acquire different data sets: 1. Nominal represents the engine state in a standard configuration and balance 46 2. Blade Damaged represents the engine state when there is a notch cut from a blade of the compressor 3. Bearing Failure represents the engine state when one of the bearings is dam- aged so that it causes a vibration to occur throughout the engine Each test consists of a sequence of sub-tests according to the typical stages of flight, which are summarized below for this particular engine. Note that speeds are given in kRPM (thousands of revolutions per minute). Flight Stage Engine Setting Idle Engine run at idle speed, 43kRPM Slow Acceleration Increasing engine speed from 43 to 70 kRPM for 15-20s Cruise Engine run at cruise speed, 70kRPM Slow Deceleration Decreasing engine speed from 43 to 70 kRPM for 15-20s Fast Acceleration Increasing engine speed from 43 to 70 kRPM in 3s Fast Deceleration Decreasing engine speed from 43 to 70 kRPM in 3s Table 4.1: List of Flight Stages and Associated Engine Speeds 4.2.1.2 The PW4000 data The PW4000 data is a real jet engine data which is collected by vibration sensors. They are operated at a lower sampling rate at 25KHz comparing to SR-30 data. Besides, there are 55 vibration sensors or ”channel” collecting approximately 5 minutesdata. However, fortheperformanceevaluation,onlychannelswithnoerror 47 Figure 4.1: PW4000 data set were used: channel 72 - 80 and channel 99 - 113 (total 24 channels). Meanwhile, therearenoseparate”stages”asinSR-30data. Similarly, threeseparatetestswere performed to acquire different data sets as shown in Fig. 4.1: 1. ADR61: Normal condition engine 2. ADR77: The fan is imblanced 3. ADR119: Low presure turbine imbalance 4.3 Description of Proposed System The solution to this challenging problem highly depends on mathematical tools used. Toextractfeaturesfrommeasuredvibrationdata,toolssuchastheshort-time Fourier transform (STFT), the wavelet transform (WT), the filter-bank decompo- sition and the ”order” tracking method as reviewed in Section II are not perfect since some vibration signals cannot be well analyzed by them due to nonlinear 48 and non-stationary properties. We propose two new analytical tools for sound and vibration signal analysis in this project. The firstoneistousethe matchingpursuit (MP)decomposition [MZ93]toana- lyze sounds for effective time-domain signature extraction. The MP-based method utilizes a dictionary of atoms for feature selection, resulting in a flexible, intuitive and physically interpretable set of features. In our previous work, we have shown that MP-based time-domain features can be used to supplement frequency-domain features to yield higher recognition accuracy of environmental sounds. MP-based features have been proved to be effective in classifying sounds where frequency- domain features fail. Furthermore, it could be advantageous to combine both time and frequency domain features to improve the overall performance. For seven en- vironment types including the street, elevator, caf?, hallway, lobby, and so on, we have achieved a classification rate higher than 90 % [CNK09]. The second tool is the Hilbert-Huang Transform (HHT) [HSL + 98], which was introduced by Huang in 1998. He developed an empirical mode decomposition (EMD)scheme andused itas thepre-processing step forthe Hilberttransform. In- stead of retrieving the information directly fromsignals, the EMD process adopts a nonlinearapproachtodecomposeagivensignalintoseveralintrinsicmodefunctions (IMFs), which give a multi-scale representation of the signal in the time domain. The AM-FM like property of IMFs not only makes the spectrum better visualized after the Hilbert Transform but also provides insightful analysis to the harmonic 49 structure of engine vibration signals. This tool has received a lot of attention in the last decade for nonlinear and nonstationary signal analysis. The above two mathematical tools have been proven successful in certain areas but not yet in the vibration data analysis for engine fault diagnosis and classifi- cation. For non-stationary vibration data, HHT provides a different scale in the time domain that can offer better frequency resolution in the low frequency region. When HHT is fused with the time-domain MP features, we will be able to get better features for engine condition analysis, fault diagnosis, etc. Figure 4.2: The proposed system 4.4 Extraction of CELP-Based Features 4.4.1 CELP Codec The code excited linear prediction (CELP) technique is a mature and widely- adopted speech coding algorithm, which was first proposed by Schroeder and Atal 50 Figure 4.3: The proposed system [SA85]. It outperforms several other vocoders such as the linear prediction coder (LPC) and the residual-excited linear predictive (RELP) for its better quality at low-bit rates. It is adopted by ITU-T G.723.1 with two coding rates (namely, 5.3 kbps and 6.3kbps). After the introduction of the CELP codec, several variants (e.g., ACELP, RCELP, LD-CELP and VSELP) have been developed. CELP and its variants offer an effective low-bit-rate speech coding tool, which serve as the core of all modern speech codecs. The CELP codec encodes speech or audio signals based on linear predictive analysis-by-synthesis coding with frame-level processing. Each frame consists of 51 Figure 4.4: The block-diagram of a CELP codec. 240 samples, which are further decomposed to four 60-sample subframes. One set of CELP-based feature is obtained from each frame. For a sampling rate at 8 kHz, eachframehasadurationof30ms. Theblock-diagramofaCELPencoderisshown in Fig. 4.4. It consists of two input signals, obtained from an adaptive and a fixed codebook, respectively, and their sum serves as the excitation to a sythesis fiter whose filter coefficients are updated dynamically. The excitation from the adaptive codebook is used to synthesize the main signal while the excitation from the fixed codebook is used to account for the residual signal. The fixed codebook contains 5 to 6 fixed-position pulses as an enhancement to the excitation. 52 4.4.2 CELP-based Features The CELP codec extracts four types of information from each frame and optimizes the decoded audio signal in a closed loop perceptually. • Linear Prediction Coefficients (LPC) Each frame is first filtered to remove the DC component and decomposed into four sub-frames, each of which has 60 samples. For every sub-frame, we mask it by a hamming window, compute the 10-th order LPC with the Levinson-Durbin recursion, and use LPC to construct the synthesis and for- mant weighting filter in each subframe. Note that the 10 dimension LPC is sufficient to represent the regularity of audio signals in a short frame. In the implementation of a CELP system, only LPC parameters of the last subframe is quantized and transmitted since the LPC parameters of all other three subframes can be interpolated under the linearity assumption of adja- cent subframes. The LPC coefficients are employed to construct the short- term perceptual weighting filter, which is used to filter the entire frame and to obtain the perceptually weighted speech signal. TheLPCisdifferentialquantized inthelinespectralfrequency (LSF)domain using a Predictive Split Vector Quantizer (PSVQ) and encoded according to 53 the built-in adaptive codebook. Since the reference code uses an approxima- tion method to acquire LSF, the average LP parameters of 4 subframes are chosen as one of CELP features, denoted by LPC in Table 3.1. • Pitch Lag (PITCH) Foreverytwosubframes(120samples),anopenlooppitchperiodiscomputed basedontheweightedaudiosignalbythecorrelationmethod. Theopenpitch lag, L, takes on 128 values using 7 bits. The values range from 18 to 145, which correspond to the frequency range from 55Hz to 444Hz, respectively, under the 8KHz sampling rate. This is enough for most speech and audio applications since the fixed codebook can compensate prediction residuals. Since the pitch lags in adjacent subframes are close to each other. We choose the average of the open pitch lag L in one frame as the desired feature, which is denoted by PITCH in Table 3.1. • Gain of Pitch Filter (GAIN) Theclose-looppitchlagiscomputed bytheprocedure ofoptimizingthe5-tap pitch filter gain,b, within the codebookofthe system by searching fromL−1 to L+2. The filter gain parameter, b is denoted by GAIN in Table 3.1. • Pulse Position in Fixed Codebook (POS) 54 The pulse position information is acquired by minimizing the residual error in a nested loop for each subframe. This piece of information occupies the largest portion of the bit stream. It is denoted by POS in Table 3.1. Figure 4.5: Computation of the CELP-based features, which consist of the 10 dimensional LPC features and the one dimensional pitch lag feature. 4.4.3 Pre-Processing Operations The huge amount of available engine data obstructs a real-time implementation of the system, even if the classifier training can be done offline. Because there are 11 different sensors and the time-resolution of the sensors is relatively high, the amount of information must be reduced or aggregated if it is to be processed by 55 on-board systems in a timely manner. An efficient data reduction method, Multi- ResolutionAnalysis(MRA),canbeadoptedtoeffectthis. AnMRAmodelincludes multiple sub-sequences having several resolutions different from the original signal and allows for independent treatment of each level of a sub-sequence. A critical down-sampling ratio, which keeps fair classification performance but maximally reduces the number of necessary samples, was also investigated. From extensive experiments on the sample data, there appear to be a good trade-offs be- tweentheerrorrateandcomputationalcomplexityifthesamplesaredecimatedbya ratioof32. [NEEDSMOREINFOONTHIS,ALONGWITHTABLES/GRAPHS] Inaddition,alow-passfiltershouldbeusedasananti-aliasingfilterbeforedown- sampling, since there are frequency components with considerable energy outside of the Nyquist frequency which might give a rise to an aliasing problem. In this system, an 11th order Chebyshev IIR low pass filter is used to perfectly reject the harmonics in the stop band. 4.4.4 Feature Selection and Extraction Mel-Frequency Cepstral Coefficients are a well-known signal processing tool, which has had a huge success in the speech community, in applications like speaker recog- nition and identification. The design of MFCC takes the physical construction of human ear mechanics and acoustics into account. The audible frequency range is divided into several mel-scale filter banks, which correspond roughly to what can 56 be perceived by the human ear. The MFCC Transform maps the original signal frequencies into the cepstral domain and calculates the energy within each filter bank. In this paper, the original MFCC design is adopted to engine acoustics, with an increase in the number and density of analysis filters in the filter bank. This modificationisnecessary because thesampling rateofthe sensors inengineismuch higher than that used in traditional speech applications. CodeExcitedLinearPredictionisawell-knowntechniquewidelyusedintelecom- munications and is the most successful technique among those used in speech cod- ing, primarily because it is simple, fast, and effective. Due to the fact that it is model based, it has a much higher compression rate than any other speech cod- ing technique, leading to an efficient signal representation. The breakthrough that increases performance in the detection of engine faults is the adoption of CELP model parameters as features to classifier. In this paper, the LPC coefficients and pitch information is incorporated into the feature set, using the original settings of the CELP coder. The original CELP standard adopts 10 Linear Predictive Coding (LPC) coeffi- cients to model the data at the subframe level. In each subframe, the optimal LPC coefficients are found by efficiently solving the Durbin-Levinson recursion. Since the models assume data to be largely linear, the mean of LPC coefficients from each of the 4 subframes is taken as the final representative set of coefficients for each frame. 57 TheoriginalCELPstandardestimatesthepitchevery2subframeswithanopen- loop and a closed-loop estimation method. The mean of two closed-loop estimates is taken as the final pitch information for one frame in this implementation. The pitch information and LPC coefficients contain the most important parts of the original signal. 4.4.5 Feature Space Reduction Feature reduction is necessary because although the feature extraction process ex- tracts the essential information from the underlying data, the result is often an over-determined system. Feature reduction will decrease performance slightly, by eliminating the number of the dimensions of the data, which have the smallest variances, but yields a large boost in the speed of classification. PrincipalComponents Analysis (PCA)isawidelyused technique inthepattern recognitionsociety, andworksbytransformingtheoriginaldatadomaintoanother, much more concise, domain in order to save computation/space while preserving the essential decision information of the data. The resulting domain is a linear combination of the original feature set, but ordered by the components with the highest variance, indicating a dimension, which represents large variability in data and therefore makes samples from different classes easier to classify. The intuition of using statistical block-based features (multiple frames) instead of frame-based features is that frame-based features contain information that is 58 overly localized. It does not make sense to identify the engine status based on the classification of one frame. Aggregating statistical results over multiple frames for classification will provide more sound performance while slightly increase compu- tational costs and the necessary memory requirements. 4.4.6 Classification 4.4.6.1 Classifier Selection and Design The selection of a pertinent classification algorithm is an important step in any recognition problem, and determines a lot of what kinds of design decisions will need to be made in the subsequent tweaking of the algorithm. Because of the nature of the MFCC and CELP data, the classifier chosen for this project was the Support Vector Machine (SVM), which provides a way to deal with a large set of data separated into relatively few classes. SVM is a technique that maps sample pointsintoahigherdimensionsothatitmaycreatealineardecisionboundarywith a hyperplane, rather than constructing complicated decision rules in the original dimension that tend to either severely under-represent the complexity of the data or over fit the classifier to the specifics of the training samples. The other modification to the classification process that was made during the course of development was to include a two-phase classifier: one for detection of enginefaultsandonefortheiridentification. Bothstagesofclassification useSVMs 59 and the same training/testing process, but it was discovered that breaking this process down into two steps achieved several important things: 1. Improved overall classification performance 2. Improved fault detection 3. Improved training runtime 4.4.6.2 Classifier Training Training the classifier is the most complex step in the classification phase of a pattern recognition problem. Assuming that the data has been dimensionally and semantically decomposed to its essential components, a number of factors can still influence the success orfailure of the algorithm, some ofwhich are discussed below. The first consideration that was undertaken for robust classifier training was the size of the training set. The two main constraints on the size are the time it takes for the classifier to be trained and the amount of data that can be stored in the memory of the computer doing this calculation. Although more training data theoretically means better performance, the experience of the pattern recognition community recognizes that an overwhelming amount of information will actually overfit the classifier, making it unreasonably sensitive to minute changes in the input. 60 Duringthecourseofourtesting,itwasfoundthattheSVMcodeinthePRTools package for MATLAB would not run out of memory if less than around 600 sam- ples were used for testing. In effect, this means there were 1800 data points with 11-32 dimensions (600 each from the three engine operation cases). We tested the performance of the classifier by selecting 100-600 consecutive samples from a par- ticular record of the data and testing the remaining samples based on the classifier developed with this trained data. Classification performance improves with larger sample sizes, as expected, but the time it takes to perform each set of training/testing roughly doubles with each increase of 100 for the training set. Thus, a sample size of 200 or 400 should be a pretty sufficient tradeoff between the runtime and performance. Further im- provements can be achieved by two means - simply taking the lowest points (best performing training data) as the general training sets for the entire dataset, or finding specific ”good” samples within the dataset that are representative of each class. 4.4.7 Decision Synthesis 4.4.7.1 Necessity of Synthesis Depending on the memory considerations of an on-board system, it may be neces- sary to combine some of the classification results throughout the operation of the engine during the flight. Although a mechanic or maintenance crew would most 61 likely prefer to see all of the information from each sensor throughout the duration of the flight, it is conceivable that a concise summary of the engine’s performance might be required, without the needs of performing complicated data analysis. 4.4.7.2 Time-Based Synthesis The first method of aggregating the data would be to simply average the decisions for each sensor on the basis of a window, taking the most frequently occurring decision within that window as a representative decision for that window. This would significantly reduce the amount of data that needs to be stored, but it might alsoartificiallyremovebehaviorthatmightleadtosuspicionofanimmanentfailure. An example of what this time-based synthesis might look like is shown below: Figure 4.6: Sensor Readings Pre-Synthesis As seen in the above figures, the decisions for each window are determined based on the majority of the sensor’s individual decisions within the confines of the window. Although the threshold of this majority-rule decision could be lowered 62 Figure 4.7: Sensor Readings Post-Synthesis to permit a more sensitive analysis of the underlying signal, there will still be a possibility of erasing important information that might give maintenance crews an early warning with respect to some impending problem. Because windowing is involved in this kind of signal synthesis, an appropriate investigation of proper techniques would also need to be conducted, taking into account especially the effects of window overlap and window shapes/weights. 4.4.7.3 Stage-Based Synthesis The information generated by the detection/diagnosis system can also be collected onastagebasis;differentclassificationtechniqueswouldbenecessarytoanalyzethe engine’s acoustics during different stages of flight, so this data would have different duration and even different importance. During the acceleration and deceleration stages of flight, the engine is put under much more stress, which might reveal structural integrity problems detectable with acoustic sensors. Therefore, although 63 itwould be possible toconsider all flight-time dataas equal andaverage everything together, it is important to keep in mind that different engine states mean different stresses put on the entire plane, which might give contradictory output signals in this system. A higher resolution output (lack of stage-based synthesis) would then be useful in analyzing such discrepancies. 4.4.7.4 Sensor-Based Synthesis The data gathered from each of the 11 sensors can also be aggregated, although this is probably inadvisable simply from the point of view of sensor failure. If one of the sensors experienced a failure, it might continue to provide faulty readings into the entire system if sensor-based decision integration were implemented. 4.4.7.5 Decision Synthesis Conclusion Although the details have not been thoroughly investigated, it appears that syn- thesis ofinformation onstrictly atime scale would be recommended in this system, if memory restrictions required it. Any other synthesis might introduce much more uncertainty and problems than it would alleviate, and even time-based synthesis wouldneedacarefulexaminationinordertoretaintheessentialpartsofthesystem output. 64 4.5 Experimental Results In this section we present a detailed analysis of the experiments we conducted as well as their interpretation with regards to constructing a successful fault detection and diagnosis system. Many variations of experimental setups were tested, and the general trends are presented here, but we have opted to restrict parameters in such a way that they meet the design constraints of our particular application area. In brief, we are looking for a system which: 1. May have arbitrarily long setup time 2. Must process data in real-time 3. Must produce decisions that take up little memory To this end, we tried to make design decisions which sped up the testing time of ouralgorithmandsimplifiedintermediatestepsasmuchaspossible,whileassuming that an extended period of time may be spent doing pre- and post-processing of the data. 4.5.1 Comparison of MFCC/CELP Features Nominal Blade Damaged Bearing Failure Nominal 10689 1857 124 Blade Damaged 778 11890 2 Bearing Failure 238 29 12403 Table 4.2: System Performance with MFCC Features (7.97% Error Rate) 65 Nominal Blade Damaged Bearing Failure Nominal 10913 1744 13 Blade Damaged 547 12121 2 Bearing Failure 19 4 12647 Table 4.3: System Performance with CELP Features (6.13% Error Rate) Nominal Blade Damaged Bearing Failure Nominal 11222 1435 13 Blade Damaged 648 12022 0 Bearing Failure 169 39 12462 Table 4.4: System Performance with MFCC + CELP Features (6.06% Error Rate) 4.5.2 VIBRATION SENSOR ANALYSIS The vibration sensors were not used in the original classification system, but were added in order to examine their efficiency at detecting various kinds of engine problems. 4.5.2.1 Test Parameters 1. Hardware Specifics: (a) Engine: SR30 (Turbine Technologies) (b) Number of Sensors: 4 (0, 90, 180, 270 degrees) 2. Software Specifics: (a) Downsampling Rate: None (b) Features: MFCC (21) + CELP (11) (c) PCA Order: 32 (Full Features) 66 (d) Segment Sizes: 100, 300, 500 The algorithm used for classification was the 2-stage detection and diagnosis implementation, which means there is slightly less information about which fault classes (bearing failure and blade damage) are misclassified in the first detection stage, but this is not as big of an issue because of the extremely high general performance of the system. 4.5.2.2 Detailed Results Runtime: Algorithm runtime varied depending on the size of the training segment, but was relatively constant given a segment size. The table below shows the time it took for training a particular instance of the classifier given a set of training data. Segment Size (samples per class) Runtime (seconds) 100 1.2 - 1.7 300 43 - 52 500 199 - 250 Table 4.5: Run time of each vibration sensors Theamountoftimetotestthedatawassimilartotheacousticsensors, orabout 0.1 ms per sample, indicating that the testing process can still be implemented in real time. Detection Accuracy: 67 The followingshows asummary ofthe detection error(for500samples perclass of training set) for each of the sensors for each stage: Sensor 1 Sensor 2 Sensor 3 Sensor 4 Idle 0.045 0.002 0.000 0.000 Acceleration 0.457 0.169 0.255 0.209 Cruise 0.005 0.002 0.003 0.003 Deceleration 0.305 0.217 0.174 0.169 Fast Acceleration 0.362 0.181 0.196 0.180 Fast Deceleration 0.273 0.204 0.139 0.183 Table 4.6: Detection error of each vibration sensors From the above information, we can conclude that, as predicted, the idle and cruise stages of engine operation have the best performance, with detection error in the 1-5% range (detection accuracy of 95-99%). Other stages have a mixture of success, generally ranging in the 20-30% range in terms of detection error (70-80% detection accuracy), with the exception of Sensor 1, which shows a particularly high error for each of the more difficult stages. 4.5.2.3 VIBRATIONSENSORPERFORMANCEANALYSISwithPCA REDUCTION As seen from Figure 1, Sensor 1 still shows some inconsistencies with the other sensors, in virtually all aspects of the sensor’s performance. The only mode which seems to function as expected is the Cruise mode - all other performance results are outside of the expected bounds of detection accuracy when compared to the other three sensors. 68 Ingeneral,theCruiseandIdlestagesperformbest,withnearly99%performance regardless of sensor, as seen in Figures 2-4. There are some variations as to which of the other stages is best/worst, but in general they all perform with an error rate ofaround20%which isarguablybetterthanthe66%errorrateofrandomguessing between 3 possible classification results, but still relatively poor. As seen in Figures 5-7, the performance of the Cruise and Idle stages for the threesensorswithbestresultsaresomewhatvaried. Itseemsthattheyallgenerally follow the expected trend of an increase of performance with a higher number of PCA-reduced features, but here Sensor 2 seems to be acting a little strangely, due to the fact that the Idle stage has worse performance than the Cruise stage and the accuracy gets worse with more features. This might also be an artifact in the data, or simply strange results within the margin of error, since the deviations are on the scale of 10-3. Lastly, an interesting observation is the fact that from Figures 2-4 we can con- clude that regardless of the PCA order, the performance of the ”hard” stages is generally the same. This most likely means that we are not capturing the behav- ior of these modes of operation accurately, since there is so much variation in the results. Fig. 4.10 shows a comparison of the performance for several reduced sets of features for different feature configurations. As seen from the figure, reduction of 69 MFCC features from the basic set of 21 down to at least 10 preserves the per- formance of the full feature set. For CELP, the possible reduction is much more dramatic, since just 3 of the 11 original features are necessary to preserve the per- formance of the system. Down Sampling Ratio 1 8 16 32 64 Computational Complexity (s) 85.5 9.3 8.3 8.15 7.9 Error Rate 0.15 0.19 0.21 0.2 0.4 Table4.7: SystemPerformanceandComputationalComplexityforDifferentDown- Sampling Ratios 4.5.3 Algorithmic Complexity Analyzing the complexity (whether space or time) for this implementation of a classifier is difficult, primarily because of two reasons: 1. MATLAB Environment Using MATLAB code speeds up development of an algorithm, but serves to hide some of the costs incurred during runtime. Overall, MATLAB requires a good deal of memory to deal with its own runtime environment, GUI, and internal functions. Generally, this means that code in MATLAB will run slowerandlessefficientlythancodethatisdevelopedforaspecificplatformin C/C++. Althoughitispossibletodeterminetheruntimeofcode(seebelow), it is difficult to judge how much memory and system resources are being consumedtoprocessthenecessaryinformation. Assuch,onlyroughestimates 70 based on operating system monitoring software can be given, although these will be over-estimates of the actual necessary resources. 2. PRTools Classification Package Because an open-source package is being used to perform the actual SVM classifier training and testing, there are inherent inefficiencies introduced by the generality of the base code. The PRTools functions do a lot of things internally to calculate statistics and additional meta-data that are proba- bly not necessary in a commercial implementation. There are also a few ”workarounds” that have been implemented in our code that could be better implemented in dedicated code to speed up the process. Again, this implies that the resource requirements will be an over-estimate of the true system load. We now describe some initial performance results for classifier training and testing, respectively. The two need to be separated because the classifier training is extremely memory intensive (using hundreds of samples of data at once to build a statistical model) while the testing is relatively simple (evaluating the features of a particular sample based on the classifier). 71 4.5.3.1 Classifier Training This portion is the most data and processor intensive of the entire classification process, but needs only be performed once and could theoretically be complete offline, rather than in-flight. 1. Code Segment Size The code we have forthe classification is acouple hundred lines long, and the inclusion ofthe PRTools code would most likely leave the entirety of the clas- sifier in under 1000 lines. While running the code, Windows Task Manager showed an increase in memory usage between 20 and 40 MB. As mentioned before, this is MATLAB code, so its length and memory requirements could vary depending on the platform. In particular, since the MATLAB environ- ment includes a lot of built in functionality it is difficult to truly estimate what the resource requirements of the code itself are. 2. Data Segment Size A typical training sample set is between 300 and 1200 samples (100 to 400 samples per class), since additional samples do not seem to improve perfor- manceconsiderably. Eachsamplecurrentlyconsists of21MFCCfeaturesand 11 CELP features, each of which are stored as 8-Byte doubles. This results in a training data requirement of 75kB to 300kB for the data itself. The clas- sifier is a set of mapping coefficients for each of the feature and classes into a decision space. The class provided by PRTools for this purpose currently 72 requires 75kB and 120kB. Thus, the total required space in memory for the main data structures is between 150kB and 420kB. 3. Instruction Throughput Requirement MATLAB makes it difficult to calculate this, but an estimate can be given based on the physical runtime on our testing machine. The computer we are using has the following parameters: • Operating System: 64-bit Windows 7 • Installed RAM: 6 GB • Processor: Intel Core i7 @ 2.67 GHz Thetablebelowshowsthetypicalruntimesfortheclassifiertrainingfordifferent training set sizes: Training Set Runtime (seconds) 300 2.7 600 20.3 900 72.1 1200 187.8 Table 4.8: training set and complexity Experiments have shown that the runtime is roughly quadratic with respect to the size of the training set, although this depends on the actual data, since the SVM classifier’s stopping condition is based on an iterative optimality algorithm. 73 4.5.3.2 Classifier Testing This portion of the program is relatively processor and memory un-intensive, be- cause only one sample is being processed at a time and all other parameters have already been calculated. 1. Code Segment Size The code for this portion of the algorithm is relatively short, since the clas- sifier is only being applied to samples of test data. The bulk of this work is being done by PRTools, but the classification is a very simple process - the discriminant functions (one per class) are applied to the test sample and map it into the decision space. 2. Data Segment Size A single testing sample will again consist of 21 MFCC features and 11 CELP features, each of which are stored as 8-Byte doubles. This means a memory requirement of 256 Bytes for each data sample, although that will be lower after the dimensions of the feature space are reduced. It is important to note that the classifier itself must be kept in memory during this part of the algorithm, so the 70kB to 420kB will also be necessary. 3. Instruction Throughput Requirement Again, we estimate the runtime for processing a single sample by referencing our machine’s specifications in the previous section and noting additionally 74 that we currently process all the testing samples in a batch process. This probably improves the runtime slightly (since less time is used on additional function calls and intermediate steps), but rewriting this to work on a per- sample basis will not increase the processing needs significantly. Regardless of how many samples the classifier was trained from, the algorithm required around 3 seconds to process 35,000 samples, for a runtime of around 11.6 ms per sample. 4.6 Conclusion and Future Work 4.6.1 Training Set Selection One of the potential areas where substantial improvements could be made, espe- cially in light of our design restrictions, is the selection of the best training set for a given choice of training data. As is usually the case, there is an overwhelming amount ofsample dataand no clear choice ofwhich pieces touse, but there is some intuition for this set of data which may prove to be useful. First, itis clearthatcontiguous portionsofraw datashould beused fortraining sets. Usingselectedsamplesfromanydatasetthatisessentiallyatime-serieswould be meaningless. The second assumption we can make is that there is some portio of the training datawhichismorerepresentativeoftheenginestatethanotherparts. Thissuggests 75 using a form of analysis that looks at optimal stationarity of the data within a selected window, but it is still unclear which method is best suited for this. Tests with simple statistical approaches tended to yield slightly worse results than the best random training set. Lastly, training sets should be chosen from areas that are outside of the tem- poral boundaries of the entire training set. We found that the performance of training samples around the start and end of the training data set tended to per- form much worse than samples from within the middle portions, although we have not quantified a temporal area of effect for this phenomenon. [INSERT PICTURE OF THIS] 76 (a) (b) (c) (d) Figure 4.8: Performance of (a) Vibration Sensor 1 (Sensor 12/15) (b) Vibration Sensor 2(Sensor 13/15)(c)VibrationSensor 3(Sensor 14/15)(d)VibrationSensor 4 (Sensor 15/15) 77 (a) (b) (c) (d) Figure 4.9: PCA of (a) Vibration Sensor 1 (Sensor 12/15) (b) Vibration Sensor 2 (Sensor13/15)(c)VibrationSensor3(Sensor14/15)(d)VibrationSensor4(Sensor 15/15) 78 Figure 4.10: Classification Error for PCA Reductions on Sensor 7 Data 79 Chapter 5 Fundamental Frequency Estimation for Music Signals with Modified Hilbert-Huang Transform 5.1 Introduction The technique of fundamental frequency (also denoted by F0) estimation plays an important role in music information retrieval (MIR), mechanical fault diagonsis and speech recognition. A number of fundamental frequency estimation methods have been proposed for different applications in last two decades. Driven by real- time applications, it is desirable to conduct pitch analysis within a short analysis window in time. In the context of MIR, the fundamental frequency estimation problem is particularly challenging due to the rich variety of musical instruments and expressions (e.g. notes of different durations and polyphonic mixing). The autocorrelation-based method [dCK02] cannot be easily extended to mul- tiple F0s and high pitch identification. Besides, it needs a sufficient long analysis 80 frame, whose size is at least twice as long as the fundamental period. The Fourier transform based method performs poorly in estimating low-pitch notes since it is not effective in handling low-frequency musical signals with logarithmic-scale notes [Kla08]. The method of auditory scene analysis (ASA) imitates the human hear- ing mechanism to separate mixing sounds [WBE06]. An unsupervised learning method known as the nonnegative matrix factorization (NMF) was proposed in [Con06] to solve this problem recently. However, very few of them can detect a wide pitch range with a short temporal analysis window successfully. Especially, the performance deteriorates significantly in the low frequency region. Besdies, the conventional window length of 93ms cannot satisfy all types of music, especially, at a very rapid rhythm. All of the above observations motivate this research. The Hilbert-Huang Transform (HHT), introduced by Huang [HSL + 98], has proved to be useful in certain data analysis applications arising from mechani- cal and aerospace engineering disciplines. HHT is a nonlinear analysis tool, which is suitable for non-stationary AM-FM data such as quasi-periodic music signals. In this work, we will show that HHT can be used to identify low-pitch signals within a short temporal window. However, to apply HHT directly to music signals en- counters a lot of issues e.g. mixing frequency modes at the same time location. Most HHT-based signal analysis methods proposed before, e.g. [HWLD05], arenot suitable in this application context. Our main contribution here is the proposal of a modified HHT tailored to the short-window pitch analysis application. 81 The rest of this paper is organized as follows. A brief review of HHT is given in Sec. 2.2. The modified HHT method is presented in Sec. 5.2. Experimental results areshowninSec. 5.3todemonstratethesuperiorperformanceofthemodifiedHHT methodinlowpitchanalysis. Finally,concludingremarksandfutureresearchtopics are given in Sec. 5.4. 5.2 Music Pitch Analysis with Modified HHT 5.2.1 Weakness of HHT There aremany challenges in themusic pitch analysis. Emiya [EDB07] pointed out that the performance of most pitch estimation methods decreases if the analysis window becomes shorter or the target F0 range is over a wide range. Klapuri [Kla08]observedthatYIN,whichemploysareliableautocorrelationfunction(AFC) method, is not robust when F0 is greater than 600Hz. In addition, methods based on the Fourier transform inevitably encounter a problem at low pitch notes due to time-frequency localization. In this work, we show that these problems can be well handled by a modified version of HHT. There are several reasons that the original HHT does not result in good perfor- mance for music pitch analysis, which are described in more detail below. • Boundary effect 82 For extrema that are close but outside of the analysis window, we need their valuestoprovidethecubicsplinefittingintheboundaryregionofthewindow. Theaccuracyoftheoutsideextremawillaffecttheshapeoftheenvelopecurve and, thus, the final IMF. The error is the largest around the boundary and its effect also propagates to the interior region of a window. signal imf1 imf2 imf3 imf4 imf4 (a) 1100 1150 1200 1250 1300 1350 −0.06 −0.04 −0.02 0 0.02 0.04 0.06 Signal Upper Envolope Lower Envolope (b) Figure5.1: (a)Anexample ofmodemixing and(b)Asignalwithanadded bubble. • Intermittence and mode mixing By intermittence, we mean that the frequency of a signal tracked by a par- ticular IMF jumps to other IMFs. This situation occurs when a singal has multiple primary frequency components called mode mixing [DK05]. Some frequency components can be easily sifted out while others cannot. As a result, the same frequency component may not reside in the same IMF. An example is given in Fig. 5.1 (a), where mode mixing is observed between 83 IMF2 and IMF3. The mode mixing problem is a major obstacle to the F0 es- timationinmusic signalprocessing. Furthermore, thereisaproblemwith the combination of pure tones as observed in [DK05]. To illustrate this, consider the sum of two sine waves sin(f 1 t)+sin(f 2 t) = 2sin(0.5(f 1 +f 2 )t)cos(0.5(f 1 +f 2 )t), which results in another amplitude modulated sine wave. Without proper sifting of pre-processing separating, HHT may lead to undesired IMFs. • Stability of EMD with respect to signal perturbation Some perturbation in the neighborhood of an extremum may change its po- sition and, thus, change the upper/lower envelope curves and the mean func- tion. To address this issue, more iterations are needed in this area to sift out the IMF with a suitable scale. However, the sifting may not be needed in other smooth areas. This phenomenon complicates the stopping criterion and demands more computational complexity. In Fig. 5.1 (b), we show the upper/lower envelope curves and the mean function of note C4 (261Hz). One ”bubble” in the right region of Fig. 5.1 (b) affects the shape of the lower envelope significantly. • Existence of subharmonics and partials 84 The music signal usually contains subharmonic components (frequency lower than F0) and partials due to nonperfect periodicity. HHT is very sensitive to non-stationary components. The existence of these components complicates the task of fundamental frequency estimation. 5.2.2 ModifiedHHTforFundamentalFrequencyEstimation To address the four problems stated in Sec. 5.2.1, we propose a modified HHT to extract F0 information. The block diagram is shown in Fig. 5.2. After the signal is segmented with asuitable window size, we apply a filterbank todecompose the sig- nal into several narrowband music signals. We discard weak bands using an energy threshold (which is set to 10% in our experiments). EMD is then adopted to each individual band to acquire IMFs. Again, we discard IMFs which are outside the passband. Through this modified EMD process, we do collect an IMF set that is different from that obtained by original HHT. Then, we select the IMF containing the fundamental frequency (denoted by IMF-F0), which has the maximum corre- lation with the original signal. For the last step, the Hilbert transform is applied to IMF-F0 and only central 50% effective window is evaluated. The median of the instantaneous frequency inside the effective window is set to the fundamental frequency. The proposed modified HHT algorithm has the following three unique features. 85 Figure 5.2: The modified HHT process for fundamental frequency estimation. First, to address the bounday effect problem, we adopt a mirror approach to estimate the outside extrema in the EMD process. Furthermore, we define an effective window around the center. Only values within the effective window are evaluated. Theeffective windowsizeisabout50-75%ofthatoftheoriginalanalysis window. Second, to handle the mode mixing problem, we use Rilling’s stopping criteria [RFG03] in the EMD process and add a filterbank pre-processing stage. We use the mode amplitude, a(t) = (x U (t)−x L (t))/2, and the evaluation function σ(t) = |m(t)/a(t)|. Then, the stopping creteria can be written as • σ(t)<θ 1 for some prescribed fraction (1−α) of the total duration. • σ(t)<θ 2 for the remaining region. In above, θ 1 , θ 1 and α are three pre-selected parameters and usually set to 0.05, 0.5 and 0.05, respectively. The above two conditions are much stricter than the original one because it provides local restriction on the IMF while Huang’s criteria are based on the global standard deviation difference. 86 Although Rilling’s stopping criteria provide a better sifting result, some separa- tion is still required for non-stationary music data. In the pre-processing stage, we apply the filterbank to the input signal to decompose it into several narrow-band signals before HHT. Each narrow band music signal contains only one harmonic (or subharmonic) component, which is constant locally in the window. Besides, it dramatically decreases the possibility of ambiguity between tones and the AM-FM signals. To sum up, the pre-processing step helps decrease the interference caused by mode mixing. In the experimental section, we use 10 2nd-order butterworth filters for filterbank implementation. The flat passband gain of the butterworth filterbank allows us to select IMF-F0 in the time domain robustly and without attenuation. Third, to address the problem of subharmonics and partials, we consider the following post-processing technique. We discard some undesired frequency bands whose energy is less than 10 % of the original signal. Then, we focus on IMFs that contain the frequency bands of interest, which gives us a smaller pool to select the IMF containing the fundamental frequency. Unlike other multi-scale techniques, EMD generates IMFs from high resolution to low resolution scales. As a result, the energy of IMF-F0 may not be high. Thus, we use the correlation as a metric to measure IMF-F0 since IMF-F0 should be most correlated to the input signal. Finally, we estimateF0by findingthemedianoftheinstantaneous frequency inside the effective window of IMF-F0. 87 5.3 Experimental Results In the experiments, we chose the widely-used music database, the Musical Instru- ment Samples from the University of Iowa , which are stereoly recorded with a sampling rate of 44.1KHz. We collected 3-sec-long data for each of the 88 notes that range from A0 (27.5Hz) through C8 (4186Hz). For each frame-size, we had 100 realizations from a different portion of the data. signal imf1 imf2 imf3 imf4 imf4 (a) signal imf1 imf2 imf3 imf4 imf4 (b) Figure 5.3: IMFs of C4 (261Hz) by sifting (a) without and (b) with the filterbank pre-processing. We used one realization from the C4 (261Hz) data to illustrate the benifit of the filterbank pre-processing in the sifting process. As shown in Fig. 5.3 (a), both IMF3 and IMF4 contain the F0 information in different time intervals due to the mode mixing problem. However, by applying the modified HHT to the same data, we obtain results as shown in Fig. 5.3 (b), where the mixing mode phenomenon is avoided. We can overcome the problem of mixing mode as described in Sec. 5.2 88 using the filterbank to generate narrowband signals. By using the filterbank and EMD, we can obtain the F0 information in one IMF with a higher probability. 20 40 60 80 100 0 20 40 60 80 100 Frame(ms) Hit Rate(%) Modified HHT HHT YIN Figure 5.4: Performance comparison of hit rates of the YIN method, the original and the modified HHT method. Fig. 5.4comparestheperformanceofaverageF0hitrateoftheYIN,traditional HHT and modified HHT algorithms. 100 realizations of C4 piano data were used. The hit rate is defined as the percentages of detected F0s within 6% of the true F0 of given notes. The YIN method assumes the knowledge of the minimum F0 and, consequently, the minimum frame size is also known (which is roughly twice of (1/minimum F0) [dCK02]). The original minimum F0 was set to 30Hz in the experiment. As shown in Fig. 5.4, the YIN method requires a mimimum of 65ms frame-length to analyze F0. It fails to work properly if its frame size is smaller 89 than the desired one. In contrast, HHT-based methods do not require any prior knowledge of the minimum frame size. They are more robust and suitable for short frame analysis. The above characteristics allow HHT-based algorithms to identify F0s under some challenging situation, for example, fast varying frequency componentsinmusic. WealsoseefromFig. 5.4thatthemodifiedHHThasabetter hit rate than the YIN method and the original HHT. Note that the original HHT is suitable for non-stationary data, yet music data is not entirely non-stationary. we can impose some constraints (such as pre- and post-processing steps) to get the pitch information more accurately. Besides, the original HHT tends to identify subharmonics (i.e., frequencies less than F0) as F0 when the frame size is larger. Finally, we show the hit rate of the modified HHT method for notes C2, C3, C4 and C5 as a function of the frame size. The performance is better as the frame size becomes large and/or F0 becomes higher. The hit rate is equal to 100% for note C5. 5.4 Conclusion and Future Work In this work, we proposed a HHT-based F0 estimation method that provides a better short frame analysis than traditional methods. Besides, we presented a modified HHT algorithm that can give better results when it is applied to quasi- periodic music signals. It was shown by experimental results the modified HHT method outperforms the YIN method and the original HHT method. There are 90 20 40 60 80 100 20 40 60 80 100 Frame(ms) Hit Rate(%) C5 C4 C3 C2 Figure 5.5: Performance comparison of the modified HHT method for notes C2, C3, C4 and C5. some extensions to be done in the near future. For example, we would like to generalize our work to the polyphonic case. Besides, it is worthwhile to examine the performance of the proposed algorithm in a real music song setting. 91 Chapter 6 Content/Context-Adaptive Feature Selection for Environmental Sound Recognition 6.1 Introduction The environmental sound recognition (ESR) problem arises in many interesting applications such asaudioscene analysis, navigation, assisting robotics, andmobile device-based services. By audio scene analysis, we refer to the classification of a location (such as a restaurant, a playground or a rural area) based on its different acoustic characteristics. Audio data are available in challenging conditions such as lack of light and/or with visual obstruction. Besides, as compared with video, audio is relatively easy to store and process. The ESR technique can also be used toenhance the performance ofspeaker identification and languagerecognition with environmental sounds in the background. 92 Research on unstructured audio recognition, such as environmental sounds, has received less attention than that for structured audio such as speech or music. Only a few studies have been reported, and most of them were conducted with raw environment audio. To give a couple of examples, sound-based scene analysis was investigated in [CSP98, PTK + 02, EPT + 06]. Because of randomness, high variance and other difficulties associated with environmental sounds, their recognition rates arepoorerthanthoseforstructuredaudio. Thisisespecially truewhenthenumber of sound classes increases. To overcome the insufficiency of MFCCs and other commonly-usedfeatures,Chuetal. [CNK09]proposedasetoffeaturesbasedonthe Matching Pursuit (MP) technique. Although the MP-based features provide good performance, their computational complexity is too high in real-time applications. The low-complexity CELP-based features were studied in earlier chapters of this thesis. Itis well known thattheperformance ofESR algorithmsdramatically decreases when the number of sound classes increases (even with good features). It will need more good features for performance improvement. On the other hand, a larger number of features not only results in higher complexity but also demands more samples while there is no guarantee on performance improvement. That is because some features may help classify some classes but confuse the others. Besies, it is not easy to train a classifier for better discriminant power in a higher dimension feature space. As a result, feature selection and reduction is an important task. In 93 this chapter, we propose a novel content/context-based feature selection method to achieve this goal. The rest of this paper is organized as follows. Comparison of various audio features is made in Sec. 6.2. The proposed content/context-based feature selection methods are discussed in Sec. 6.3. Experimental results are shown in Sec. 6.4 to demonstratethesuperiorperformanceoftheproposedmethods. Finally,concluding remarks and future research directions are given in Sec. 6.5. 6.2 Review of Previous Work Traditional feature selection attempts to find a feature subset that maximizes the utility function or a certain pre-defined performance metric. It assumes that the metric is agoodapproximation tothe classification rate. Then, theproblem can be re-formulated as the search of a feature subset to maximize a pre-defined metric. However, the featureset reduction problemis examined withallavailablesounds in the database, and the reduced feature subset is independent of a particular query sound. Feature selection algorithms typically fall into the following two categories. • Feature Ranking It ranks the features by a metric, and eliminates all features that do not achieve an adequate score. 94 • Subset Selection Itsearchesthesetofpossiblefeaturesfortheoptimalsubset. Itcanbefurther classified into 3 types. – Wrappers Wrappers use a search algorithm to search through the space of possible features and evaluate each subset by running a model on the subset. Wrappers can be computationally expensive and have a risk of over fit- ting to the model. – Filters Filters are similar to Wrappers in the search approach. However, in- stead of being evaluated against a model, a simpler filter is used in the evaluation. Two popular filter metrics for classification are correlation and mutual information, although they are not true distance measures in themathematical sense, since they failtoobey thetriangleinequality. They should rather be regarded as scores. These scores are computed between a candidate feature (or a set of candidate features) and the desired output category. – Embedded Techniques Embeddedtechniquesareembeddedinandspecifictoamodel,forexam- ple, the decision tree. There is a large number of data analysis software available for feature selection in the public domain. 95 There are two main feature selection tasks: 1) selecting the evaluation criterion and 2) selecting the search algorithm. For the evaluation criterion selection, we have the following several choices. 1. Distance-based measure Distance measures are also known as separability, divergence or discrimina- tion. One can maximize the inter-class distance using linear or non-linear metrics such as the Minkowski, the Euclidean or the Chebychev distance. Many probabilistic distances (e.g., Mahalanobis, Bhattacharyya, Divergence, Patrick-Fischer) can be simplified in the two-class case when the distribution of each class has a parametric functional form. 2. Margin-based measure One can maximize the margin of a hyper-plance that separates two classes. 3. Information-based measure It determines the information gain from a feature or a mutual information between the feature and a class label or entropy. The maximal-relevance- minimal-redundancy (mRMR)criterionisshowntobeequivalent tothemax- imal statistical dependency of the target class on the data distribution, but it is more efficient. 4. Dependence-based measure The dependence-based measure is also called the correlation measure or the 96 similarity measure. The correlation coefficient can be used to find the corre- lation between a feature and a class. 5. Consistency-based measure The consistency-based measure attempts to minimize the number of features that separate classes inconsistently, where inconstancy is defined as two in- stances have the same value but different class labels. The relation between these five measures and the classification error probabilities in terms of the performance bound has been widely studied. However, it is still an open question. Given a criterion, the next question is how to find the optimal set. Exhaustive searchisnotpractical. AlloptimalmethodsarebasedonBranchandBound. How- ever, they can only apply to problems of lower dimensionality with a monotonic criterion (e.g. distance measure). Most often, people use sub-optimal solutions of polynomial complexity for problems of higher dimensionality or with a non- monotoniccriterion. Examples ofsub-optimalsearch includes: sequential selection, floating search, oscillating search, dynamic oscillating search, random space, evolu- tionaryalgorithms,memeticalgorithms,reliefalgorithm,simulatedannealing,tabu search,randomizedoscillatingsearch,etc. Otherfeatureselectionproblemsinclude: the determination of the feature size, the feature acquisition cost, over-fitting and instability. 97 At the end of this section, we would like to review the Fisher Discriminant Analysis (FDA), since it will be used in our work. The Fisher linear discriminant provides an efficient tool for dimension reduction in statistical pattern recognition. Its main idea can be briefly stated as follows. Suppose that there are two kinds of sample points in a d-dimension data space. We desire to find a line in a feature space such that projective points on the line of sample points can be separated as much as possible. To achieve this goal, one can define the Fisher discriminant ratio as: J(w)= (˜ m 1 − ˜ m 2 ) 2 /( ˜ S 2 1 + ˜ S 2 2 ), where w is the direction vector of the separating line, ˜ m i and ˜ S i are the mean and the within-class scatter of the i-th class, respectively, and the tilde denotes the result after projection, for example, ˜ m = w T m, and where w is the linear discriminant function. The Fisher Discriminant Analysis (FDA) is to find out a linear projection w that maximizes the Fisher ratio, J(w). 6.3 Content/context-Adaptive Feature Selection Toaddresstheproblemofalargenumberofclasses, weconsideranewmethodology in this work. That is, we first study the content and the context of each audio sound, and identify the most useful content/context adaptive feature set. Then, thereducedfeaturesethasahigherdiscriminantpowerforaparticularquerysound 98 withrespecttoothersoundsinthedatabase. Specifically, weproposetwomethods; namely, the context-based method and the content-based method, to achieve this goal. 6.3.1 Context-Adaptive Method The context-adaptive method divides the classification task into two stages: 1) context identification and 2) target classification as shown in Fig. 6.1 [RTC11]. The main spirit is to divide the whole classes into several contexts, where each context is a group of classes which share similar features or belong to the similar category. Then, we can use different context adaptive features to further classify the target within a context. Figure 6.1: The conceptual diagram of content/context-adaptive feature selection methods. The context identification serves as a preprocessing unit before classification. We have the following two contexts. • Norminal Contexts are a group of classes that belong to a similar category with a phys- ical meaning. For example, some similar sounds or similar situations in the 99 environmental sound such as rain and water. Sometimes, it may not neces- sary to be group of classes, it can be some other taxonomy or other impact which may affect the features, such as the weather condition in [RTC11]. No matter what kinds of context are used, the additional information about the data is necessary. • Artificial For some specific data, there is no clue or prior knowledge about the data and its features. As a result, the artificial taxomony is adopted, i.e., clus- tering. We cluster the data into one context with similar or nearby features. Intuitively speaking, those samples or classes should be under the same situ- ation or under the same context although we have no idea what the context’s physical meaning could be. In this situation, we artificially cluster all input sounds into a pre-defined number of contexts Wetakecontextidentificationasanextrastage,inwhichwedecreasetheloading of the target classifier in complexity and performance. It will result in fewer classes for the target classifier after context identification. Generally speaking, we will demand fewer features since the number of sound classes within a context becomes smaller. It can also avoid ambiguous features which are only good for some classes whilebeingnotoriousforotherclasses. Wesummarizethecontext-adaptivemethod below. 100 Context-Based Method: Nominal Context Training Phase 1. Group the original classes into different context by nominal categories. 2. Pick updominantfeaturessetF1byFisherratioforcontext identification stage. 3. For each context, extract dominant features set F2 by fisher ratio. 4. Use the sequential forward search to decide F1 and F2 until the fisher ratio can not increase by adding features. Testing Phase 1. Context Identification: Using feature set F1 to identify the context. 2. Depending on the result of context identification, one may use feature set F2 to identify the target. Context-Based Method: Artifical Context The algorithm is basically the same. The only difference is in the context part, which is given below. 1. Normalize the features. To have the clustering more meaningful, we have to normalize features. We do not want to have clustering for different scales. For example, the LPC value is always less than 1 while the pitch is in the range of hundreds. 2. Use PCA to reduce the feature to three linear combined features. The reason that we use PCA toreduce the dimension is because that high di- mension clustering may notprecise. Besides, ifwe need touse the probablity- density-based classifier to estimate the density, a high dimension will be very 101 challeging. Furthermore, the PCA projects to the principal component re- gardless classes. Here, we are concerned with clustering rather than classer so that it is fine to use the PCA. 3. Use K-means algorithm and the reduced features to cluster the context. Forthereasonthatclusteringinahighdimensionalspaceisnotpractical. We use PCA to project the features into 3 dimensional space. K-mean algorithm is then adopted to cluster the context. PCA is the easiest way to reduced the dimension here beacause the label is notimportant here. What we care is the main pricipal of the features. The best number of cluster is decided until the fisher ratio can not be increased by more clusters. For multiple classes, we need to identify within context classes. Here we need to user-defined parameter to filter out non-relevant classes in context. If the class has less than c% in one context, we ignore it as not in the context. We can set up a optimal cluster number by the following method: 1. We start with 2 clusters and get the fisher ratio R 2. We further try more clusters and calculate R. 3. We stop at K clusters until R is not improving. The whole system diagram is shown in Fig. 6.2. As we can see, this multi-stage for multi-class system can dramatically speed up the classification and reduce the 102 number of features. By using different feature set for different context, we can also improve the classification rate. Forthe classifier, it is also adaptive to any classifier nowadays. Figure 6.2: Illustration of the context-based classifer design. 6.3.2 Content-Adaptive Method The context-Based method is already giving us satisfactory result. However, this method still has some drawback. First, the result still depends on higher context identification rate, i.e., if the context is misclassified, there is no way to correct. Second, thenumberofcontextandthenumberofwithin-contextclassisatrade-off. We still want less within-context class to make the within-context features valuable toeachclass. However, lesswithin-contextclassesmeanslargernumberofcontexts, which makes the problem become challenging at context identification stage. That is why we proposed the second algorithm-Content-based method or sample-based 103 method. The sample-based method assumes that we have enough information or statis- ticsintestingsampletodecide itsowndominantfeatures. Thebasicideaisthatwe collect the statistics and information from testing sample and decide the dominant feature still by fisher ratio. Then we compare with the training database. We dont necessary compare to all the classes because we can eliminate those classes with diverse dominant features. By eliminating some of the classes, we first reduce the complexity and second we can improve the performance by taking less classes into account. Besides, we dont really need that many features. Afterthat,wecanusethesimilaritymeasure tocomparethetestingsampleand training database. Here we will use density measurement such as KL divergence for the reason that representative point distance measurement is not precise such as mean distance. However, using all or more features to perform the density similarity measure- mentisnotpractical. Weshouldalwayscomparethoseimportantfeaturesonly. By usingfisherratio,wecangetasetofdominantfeatureforeachsample. Thelenghth of collecting testing samples is also a main issue of concern in this algorithm. In our experiments, we use 3 seconds data as one sample point. 104 Besides, it is not practical to use all the features so that we use the Fisher ratio to find the most important and discriminant features to measure the similarity. To sum up, we do the following: 1. Use Fisher ratio to rank important features for each class 2. Use multiple testing samples as one frame to calculate the statistics to collect important feature information 3. Compare the testing sample statistics to training data using the KL divergence Content-Adaptive Method Training Phase: 1. For all the classes, extract all the features 2. Use fisher ratio to extract the dominant features set F1 i for each class i as a compare database. Testing phase 1. Extract the features for 3 seconds. 2. Calcualte the fisher ratio by comparing to the database. 3. Compare with the trainingdatabase, eliminate undesired classes and features by the dominant features of testing sample are not on the list of compare database. 4. Use the KL divergence to calculate the distribuion distance to the remaining classes. 105 5. Classify the testing sample to the most similar class. We have two options. First, we calculate the test sample by the Fisher ratio based onthewhole database. Second, wecompute theclass-based Fisher ratio. For eachfeature, wecalculatethefisherratiooftestingsampletoindividual classinthe database. We pick up those features who can differentiate more with some specific class first instead of average differentiation. Once it is picked up, we choose from other features. In short, instead of finding the maximum in a vector, we search in one matrix. If global statistics (either statistics come from the class itself or statistics to all other classes data) does not help much, we may go deeper to analyze class-to-class statistics. For example, if the ranking is not fixed at testing, then we may go class- to-class (Matrix has more information than vectors). Some classes data should be very dramatically distinguish to some specific class if statistics to all other classes could be blurred by influence from other classes. Classify by directly matching the content feature set picked in the training database and testing sample is not suitable in the environmental sound recogini- tion problem. The enviromental sound is too complicated that can provide the exact dominant features in the database. However, if those dominant features we picked up are not on the list of those of testing samples, It is garanteed that they 106 are quite different, especially for the first few dominant features. The selecting/statistics processing help to decrease the loading of classifier. That means that we may choose simple complexity classifier instead of fancy clas- sifier. This is side benefit of our approach. 6.4 Experimental Result 6.4.1 Experimental Setup In the experiments, we collected 30 classes of environmental sounds from the BBC audio data. They were sounds associated with the following: • Transportation(7): airplane,car,motorcycle,train,helicopter, ship,elevator. • Weather (3): rain, thunder, wind. • Sports(3): table tennis, tennis, basketball. • Rural Areas (3): bird, insect, and stream. • Animal(5): dog, chicken, sheep, horse, pig. • Indoor(3): telephone, bell, clock. • Human(3): crowd chatting, crowd applause, baby crying. 107 • Special(3): machine gun,tank, vacuum cleaner. All data were resampled at 8KHz and normalized to a one-minute-long audio clip from a mono channel. Some pre-processing and filtering operations were used tofilteroutsilence aswell asirrelevant ornoise partsofenvironmental sounds. The CELP features were extracted by modifying the standard code of ITU-T G.723.1 There were total 19403 instances in the feature space with a roughly equal number in each class. To compare the CELP features with the MFCC features, we adopted the naive Bayesian network with ten fold cross-validation. The machine learning toolbox, weka [HFH + 09], provides aconvenient interface tothem. Duetothe space limit, only the result of the Bayesian network is provided. Since we collect the statics every 3 seconds, in the classifier, we also take the average of the 3 seconds feature as one sample point. The adopted feature set includes the following: • CELP • MFCC • AmplitudeModulation • AutoCorrelation • Energy 108 • Envelope • EnvelopeShapeStatistics • Zerocrossing rate • PerceptualSharpness • SpectralFlatness • SpectralShapeStatistics 6.4.2 Results and Discussion Context-Based Method In Fig. 6.3, X axis is the number of features adopted in the within-context classification, i.e., the number of dominant features. Y axis is the correct classification rate. We compare the performance of context- based method and context-independent features, i.e., directly use all the features and its PCA result regardless of context. We can clearly see that the performance of context-based method is better than using features and its PCA for all classes. For the context part, the artificial context-based method outperform the nomi- nal. Itisbecausethatintheenvironmental sound, even thesoundbelongtosimilar category, they may still act different. 109 Figure 6.3: Classification Rate. Content-Based Method In Fig. 6.4, X axis is the number of features adopted in the within-context classification, i.e., the number of dominant features. Yaxisisthecorrectclassificationrate. WecomparetheperformanceofMPwhichis proposed by Selina Chu with our content-dependent features. Selina’s MP features are 3 dimensions so the performance measurement is only up to 3 dimensions. We canclearly seethatourcontent-based featureselection outperformtheMPfeatures and PCA. it is because that the benifit we mentioned in the previous section. MP features is quite time-comsuming and when dealing with large database, it may not enough to distinguish some specific features. Here we provide an adaptive way to select the important features only in the classification, which makes the algorithm faster and easier. Comparing the Content-based method and context-Based method, we find that the content-based feature selection is better. It is because that the features are 110 adaptive foreach testing sample. We dont have toworry aboutcontext feature and target feature trade-off. We can see that with 3 to 6 features help, it should be enough to classify the sample. With too many features, it may not necessary help since we use similarity measurement. That is the reason the performance drop with larger number of features. Figure 6.4: Classification Rate. 6.4.3 Length of samples Itisaninteresting problemthathowlongshould beenoughtocollect thestatistics. In Fig. 6.5. We canclearly see thatthe best performance is over 3secs. The reason that with larger length, the performance drops because the insufficient data may 111 cause the over-fitting problem. Figure 6.5: Classification Rate with different length of samples. 6.4.4 Confusion matrix discussion Due to the space we only list the result of one context. The result shows pretty good. Figure 6.6: Confusion Matrix. 112 % Airplane Car Motor Train Hellicoptor Ship Elevator Airplane 95.1 – 1.1 – – 0.7 – Car – 85.3 4.9 0.1 – – – Motor – – 80.4 – – 3.5 Train 0.1 – – 90.4 – 1.1 – Hellicoptor – – – – 99 – – Ship 0.5 – – 1.2 0.1 91.2 – Elevator – 0.1 0.1 – 0.3 0.3 98.8 Table 6.1: The confusion matrix obtained with the CELP features only and the Bayesian network classifier. 6.5 Conclusion and Future Work A novel context-based method and content-based method were proposed to solve theESRprobleminthiswork. ThisnewmethodoutperformstheMPfeaturesetby a significant margin. The content-based method offers the best classification result with a correct classification rate of 95.2% using the Bayesian network classifier for 30 environmental sounds. Since the preliminary study as reported in this work is promising, it will be interesting to extend our study to more larger class of audio signals. 113 Chapter 7 Conclusion and Future Work 7.1 Conclusion WeproposedandappliedanovelCodeExcitationLinearPrediction(CELP)feature set for the engine fault diagnosis systems in Chapter 4 and environmental sound recognition in Chapter 3. Furthermore, we proposed a modified HHT to better address fundamental estimation of music in Chapter 5. This can be a dramatical speed up the traditional MP process in frequency estimation. By using HHT based MP features, we not only have another nice audio feature set but also preserve the ability of HHT dealing with nonlinear and nonstationary data. Finally, for the ESR application with large number of classes, we proposed the use of the content- adaptive feature selection method and the context-based feature selection method to decrease the number of features used and adaptively select a good subset of features. 114 Although CELP was initially developed for speech coding, it actually offers an excellentlow-bit-ratecodecforanyaudiosignal,includingmusicandenvironmental sounds. Our research is motivated by the observation that an audio signal can be well preserved by its highly compressed CELP bit streams. It is therefore intuitive to extract features from the CELP bit streams for the audio recognition purpose. There are several advantages with the CELP-based features. First, they provide a concise yet accurate representative of underlying audio signals. Second, their computational complexity is low, which allows real-time feature extraction. Third, CELP-based features are more robust with respect to background noise. Fourth, all modern telecommunication systems adopt the CELP technique or its variations as the speech codec due toits high compression ratio, excellent coded audio quality and low complexity. All mobile telecommunication devices are equipped with the CELP-based codec. As a result, recognition based on CELP features is desirable since the additional effort required by feature extraction is almost negligible. 7.2 Future Work Although we obtained promising results in our preliminary study, we would like to extend our current work to make the treatment more complete in several areas as described below. 1. Robust engine fault detection and diagnosis 115 • Sequential decision making and processing The amount of the jet engine data is huge, and the computational com- plexity of a straightforward processing is high. It is also not pratical to make decision after all measured data are collected. It is desirable to adoptasequentialdecisionmakingapproachwhichmakesdecisionbased on received data and offers probability estimates of different modes as a function of time. • Advanced machine learning techniques We would like to adopt more advanced meachine learning techniques that can make robust decision based on a smaller sampled data set. Furthermore, the system can learn using unlabeled data. As a result, the system can make decision based on fewer labeled samples. • Decision fusion We will consider the decision fusion from multiple sensors and across different stages. Although the use of the most reliable sensors/stages is easy, we may throw away some useful and relevant data. We may get more accurate result by data filtering and decision fusion. • Data pre-processing via down-sampling To reduce the computational complexity, we may perform data pre- processing via down-sampling. The detection accuracy may decreases 116 with the down-sampling rate. That is, the detection rate may dete- riorate if the down-sampling rate is high. We will test this idea and analyze the relationship between the down-sampling factor and the de- tection performance. 2. Environmental sound recognition • Search of more features The CELP-based features including both the LPC and the pitch infor- mation. Therearestillsomeotherdataand/orintermediateinformation in the bit stream, which can provide good features. We will continue to search simple yet effective features to enhance the performance of the environment sound recogntion problem. • Speech and speaker recognition We expect that the CELP-based features will work well for the speech and the speaker recognition problems as well. We will conduct extensive experiments and compare the performance of CELP-based features and the traditional MFCC features. 3. Fundamental frequency estimation for music signals 117 • Robust and fast matching pursuit AlthoughtheMPdecompostionprovidesanicewaytoanalyzenonlinear and nonstationary data, it computational complexity is extremely high. We expect HHT to provide a fast and robust frequency estimation for this type of data. This will be investigated in our future research. Be- sides, itmayimprovethefrequencyandscaleestimationintheMatching Puisuit decomposition as an application. If the complexity of the MP decomposition can be dramatically decreased, the MP features will be very useful in many signal analysis problems. • Polyphonic extention HHThasagoodcapabilitytosiftthesignaltoseveralIMFs,whichcould be used toanalyze the chord. We would like toexam thesifting result of the polyphonic case and extend our current work to polyphonic signals such as real-world music signals. 118 References [BHHS05] Mingsian Bai, Jiamin Huang, Minghong Hong, and Fucheng Su. Fault diagnosis of rotating machinery using an intelligent order tracking sys- tem. Journal of Sound and Vibration, 280(3-5):699 – 718, 2005. [CL11] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines, May 2011. [CNK09] Selina Chu, Shrikanth Narayanan, and C.-C. Jay Kuo. Environmental sound recognition with time-frequency audio features. Trans. Audio, Speech and Lang. Proc., 17(6):1142–1158, August 2009. [Con06] Arsha Cont. Realtime multiple pitch observation using sparse non- negative constraints. In ISMIR, pages 206–211, 2006. [CSP98] B. Clarkson, N. Sawhney, and A. Pentlan. Auditory context awareness viawearablecomputing. InProc.WorkshopPerceptualUserInterfaces, 1998. [CT81] GK Chaturvedi and DW Thomas. Bearing fault detection using adap- tive noise canceling. ASME Paper, 1981. [CT82] G. K. Chaturvedi and D. W. Thomas. Bearing fault detection using adaptivenoisecancelling. ASME Transactions and Journal of Mechan- ical Design, 104(1):280–289, apr 1982. [dCK02] Alain de Cheveign´ e and Hideki Kawahara. YIN, a fundamental fre- quency estimator for speech and music. The Journal of the Acoustical Society of America, 111(4):1917–1930, 2002. [DK05] R. Deering and J.F. Kaiser. The use of a masking signal to improve empirical mode decomposition. IEEE ICASSP,4:iv/485–iv/488Vol.4, March 2005. [DR78] D Dyer and Stewart RM. Detection of rolling element bearing damage by statistical vibration analysis. Trans ASME, 1978. 119 [DS77] D. Dyer and R. M. Stewart. Detection of rolling element bearing dam- agebystatisticalvibrationanalysis. InAmericanSociety of Mechanical Engineers, Design Engineering Technical Conference, volume 3, pages 26–30, sept 1977. [EDB07] V. Emiya, B. David, and R. Badeau. A parametric method for pitch estimation of piano tones. IEEE ICASSP, 1:I–249–I–252, April 2007. [EPT + 06] A.J. Eronen, V.T. Peltonen, J.T. Tuomi, A.P. Klapuri, S. Fagerlund, T. Sorsa, G. Lorho, and J. Huopaniemi. Audio-based context recogni- tion. Audio, Speech, and Language Processing, IEEE Transactions on, 14(1):321 – 329, jan. 2006. [HFH + 09] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. The weka data mining software: an update. SIGKDD Explor. Newsl., 11(1):10–18, November 2009. [HSL + 98] N. E. Huang, Z. Shen, S. R. Long, M. C. Wu, H. H. Shih, Q. Zheng, N. C. Yen, C. C. Tung, and H. H. Liu. The empirical mode decompo- sition and the Hilbert spectrum for nonlinear and non-stationary time series analysis. Proceedings of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences, 454(1971):903–995, March 1998. [HWLD05] Weiping Hu, Xiuxin Wang, Yaling Liang, and Minghui Du. A novel pitch period detection algorithm bases on hht with application to nor- mal and pathological voice. jan. 2005. [Kla08] A. Klapuri. Multipitch analysis ofpolyphonic music and speech signals using an auditory model. IEEE Trans. Speech and Audio Processing, 16(2):255–266, Feb. 2008. [LCTH00] Bo Li, Mo-Yuen Chow, Yodyium Tipsuwan, and James C Hung. Neural-network-based motor rolling bearing fault diagnosis. IEEE Transactions on Industrial Electronics, 47(5):1060 – 1069, 2000. [LQ00] Jing Lin and Liangsheng QU. Feature extraction based on morlet wavelet and its application for mechanical fault diagnosis. Journal of Sound and Vibration, 234(1):135–148, june 2000. [MS84] P.D. McFadden and J.D. Smith. Vibration monitoring of rolling ele- ment bearings by the high-frequency resonance technique – a review. Tribology International, 17(1):3 – 10, 1984. [MZ93] S. Mallat and Z. Zhang. Matching pursuit with time-frequency dictio- naries. IEEE Trans. Signal Process., 41, 1993. 120 [NHYT07] G. Niua, T. Hana, B.-S. Yang, and A.C.C. Tan. Multi-agent decision fusion for motor fault diagnosis. Mechanical Systems and Signal Pro- cessing, 21(3):1285–1299, apr 2007. [PTC05] Z. K. Peng, P. W. Tse, and F. L. Chu. Improved hilbert-huang trans- form and its application in vibration signal analysis. Journal of Sound and Vibration, 286(1):187–205, aug 2005. [PTK + 02] VesaPeltonen,JuhaTuomi,AnssiKlapuri,JyriHuopaniemi,andTimo Sorsa. Computationalauditoryscenerecognition. InAcoustics, Speech, and Signal Processing (ICASSP), 2002 IEEE International Conference on, volume 2, pages II–1941 –II–1944, may 2002. [RFG03] G. Rilling, P. Flandrin, and P. Gon¸ calv` es. On empirical mode decom- position and its algorithms. 2003. [RTC11] C.R. Ratto, P.A. Torrione, and L.M. Collins. Exploiting ground- penetrating radar phenomenology in a context-dependent framework for landmine detection and discrimination. Geoscience and Remote Sensing, IEEE Transactions on, 49(5):1689 –1700, may 2011. [SA85] M. Schroeder and B. Atal. Code-excited linear prediction(celp): High- quality speech at very low bit rates. In Acoustics, Speech, and Sig- nal Processing, IEEE International Conference on ICASSP ’85., vol- ume 10, pages 937 – 940, apr 1985. [SCL07] Weixiang Sun, Jin Chen, and Jiaqing Li. Decision tree and pca-based fault diagnosis of rotating machinery. Mechanical Systems and Signal Processing, 21(3):1300–1317, apr 2007. [TA99] N. Tandon and Choudhury A. A review of vibration and acoustic mea- surement methods for the detection of defects in rolling element bear- ings. Tribology International, 32(8):469 – 480, 1999. [TYT04] P.W. Tse, W.X.Yang, andH.Y.Tam. Machine faultdiagnosisthrough an effective exact wavelet analysis. Journal of Sound and Vibration, 277(4):1005–1024, nov 2004. [WBE06] D. Wang, G. J. Brown, and Eds. Computational auditory scene anal- ysis: Principles, algorithms and applications. New York:Wiley-IEEE Press, 2006. [ZK01] T.ZhangandC.-C.Jay Kuo. Audiocontent analysis foronline audiovi- sualdatasegmentationandclassification. SpeechandAudioProcessing, IEEE Transactions on, 9(4):441 –457, may 2001. 121 [ZYL + 09] Z.K. Zhu, Ruqiang Yan, Liheng Luo, Z.H. Feng, and F.R. Kong. De- tection of signal transients based on wavelet and statistics for machine faultdiagnosis. Mechanical Systems and Signal Processing,23(4):1076– 1097, may 2009. 122
Abstract (if available)
Abstract
An adequate feature set plays a key role in many signal classification and recognition applications. This is a challenging problem due to the nonlinearity and nonstationary characteristics of real world signals, such as engine acoustic/vibration data, environmental sounds, speech signals and music instrument sounds. Some of traditional features such as the Mel Frequency Cepstral Coefficients (MFCC) may not offer good performance. Other features such as those based on the Matching Pursuit (MP) decomposition may perform better, yet their complexity is very high. In this research, we consider a new feature set that can be easily generated in the model-based signal compression process, known as the Code Excited Linear Prediction (CELP) features. The CELP-based coding algorithm and its variants have been widely used to encode speech and low-bit-rate audio signals. In this research, we examine two applications based on CELP-based features. ❧ First, we present a new approach to engine fault detection and diagnosis based on acoustic and vibration sensor data with MFCC and CELP features. Through proper algorithmic adaptation to the specifics of the dataset, the fault conditions of a damaged blade and a bearing failure can, with high probability, be autonomously discovered and identified. The conducted experiments will show that CELP features, although generally used in speech applications, are particularly well suited to this problem, in terms of both compactness and detection specificity. Furthermore, the issue of automatic fault detection with different levels of decision resolution is addressed. The low prediction error coupled with ease of hardware implementation makes this proposed method an attractive alternative to manual maintenance. ❧ Next, we propose the use of CELP-based features to enhance the performance of the environmental sound recognition (ESR) problem. Traditionally, MFCC features have been used for the recognition of structured data like speech and music. However, their performance for the ESR problem is limited. An audio signal can be well preserved by its highly compressed CELP bit streams, which motivates us to study the CELP-based features for the audio scene recognition problem. We present a way to extract a set of features from the CELP bit streams and compare the performance of ESR using different feature sets with the Bayesian network classifier. It is shown by experimental results that the CELP-based features outperform the MFCC features in the ESR problem by a significant margin and the integrated MFCC and CELP-based feature set can even reach a correct classification rate of 95.2% using the Bayesian network classifier. ❧ CELP-based features may not be suitable for wideband audio signals such as music signals. To address this problem, we would like to add other new features. One idea is to perform real-time fundamental frequency estimation using a modified Hilbert-Huang transform (HHT), as studied in the last part of this proposal. HHT is a non-linear transform which is suitable for the analysis of non-stationary AM/FM like data. However, the application of HHT directly to music signals encounters several problems. In this research, we modify HHT so that it can be tailored to the short-window pitch analysis. It is shown by experimental results that the proposed HHT method performs significantly better than several benchmark schemes. ❧ Finally, for the ESR application with large number of classes, more features are needed in order to maintain the classification performance. On the other hand, more data are desired to avoid the over-fit problem. These two become contradicting requirements. We propose two methods to resolve the contradiction. They are the content-adaptive feature selection method and the context-based feature selection method. The content-adaptive feature selection method selects different features used for different testing samples according to their statistics. The context-based feature selection method eliminates the loading of the classifier by adding the context stage as preprocessing layer. As a result, we can dramatically decrease the number of features used and adaptively select a good subset of features.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Advanced techniques for object classification: methodologies and performance evaluation
PDF
Block-based image steganalysis: algorithm and performance evaluation
PDF
A signal processing approach to robust jet engine fault detection and diagnosis
PDF
Data-efficient image and vision-and-language synthesis and classification
PDF
Advanced techniques for human action classification and text localization
PDF
Biologically inspired overcomplete representation, feature extraction and object classification
PDF
Classification and retrieval of environmental sounds
PDF
Object classification based on neural-network-inspired image transforms
PDF
Facial age grouping and estimation via ensemble learning
PDF
Machine learning methods for 2D/3D shape retrieval and classification
PDF
Depth inference and visual saliency detection from 2D images
PDF
Explainable and green solutions to point cloud classification and segmentation
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
Human activity analysis with graph signal processing techniques
PDF
Multimodal image retrieval and object classification using deep learning features
PDF
Data-driven image analysis, modeling, synthesis and anomaly localization techniques
PDF
Novel algorithms for large scale supervised and one class learning
PDF
Efficient graph learning: theory and performance evaluation
PDF
Active data acquisition for building language models for speech recognition
PDF
Noise aware methods for robust speech processing applications
Asset Metadata
Creator
Tsau, Enshuo
(author)
Core Title
Advanced features and feature selection methods for vibration and audio signal classification
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
08/09/2012
Defense Date
08/07/2012
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
CELP,fault diagnosis,feature selection,HHT,OAI-PMH Harvest
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kuo, C.-C. Jay (
committee chair
), Chang, Tu-nan (
committee member
), Jenkins, Brian Keith (
committee member
)
Creator Email
sphinx0204@gmail.com,tsau@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-89597
Unique identifier
UC11290266
Identifier
usctheses-c3-89597 (legacy record id)
Legacy Identifier
etd-TsauEnshuo-1149.pdf
Dmrecord
89597
Document Type
Dissertation
Rights
Tsau, Enshuo
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
CELP
fault diagnosis
feature selection
HHT