Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Recognition and characterization of unstructured environmental sounds
(USC Thesis Other)
Recognition and characterization of unstructured environmental sounds
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
RECOGNITION AND CHARACTERIZATION OF UNSTRUCTURED ENVIRONMENTAL SOUNDS by Selina Chu A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) May 2011 Copyright 2011 Selina Chu Table of Contents List of Tables v List of Figures vi Abstract ix Chapter 1: Introduction 1 1.1 Signicance of the Research . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Review of Previous Work . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Contributions of the Research . . . . . . . . . . . . . . . . . . . . . 10 1.4 Organization of the Proposal . . . . . . . . . . . . . . . . . . . . . . 12 Chapter 2: Review of Research Background 14 2.1 Audio Features for Recognition and Classication . . . . . . . . . . 14 Chapter 3: A Study on Features and Classication Methods for En- vironmental Audio 17 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2 Common Audio Features . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3 Environmental Data Acquisition . . . . . . . . . . . . . . . . . . . . 20 3.4 Classication Methods . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.5 Experimental Results and Discussion . . . . . . . . . . . . . . . . . 23 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Chapter 4: MP-Features: Feature Extraction with Matching Pursuit 28 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.2 Signal Representation with Matching Pursuit (MP) . . . . . . . . . 30 4.3 Feature Extraction with Matching Pursuit (MP) . . . . . . . . . . . 33 4.3.1 Extracting MP Features . . . . . . . . . . . . . . . . . . . . 33 4.4 MP Dictionary Selection . . . . . . . . . . . . . . . . . . . . . . . . 36 4.5 Computational Cost of MP Features . . . . . . . . . . . . . . . . . 39 4.6 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 39 4.6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 39 ii 4.6.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . 42 4.7 Confusion Matrix and Pairwise Classication . . . . . . . . . . . . . 45 4.8 Comparison of Time-Domain Features . . . . . . . . . . . . . . . . 48 4.9 Listening Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.9.1 Test Setup and Procedure . . . . . . . . . . . . . . . . . . . 52 4.9.2 Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Chapter 5: A Semi-Supervised Learning Approach to Online Audio Background Detection 58 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.2 Semi-Supervised Learning with Audio . . . . . . . . . . . . . . . . . 59 5.3 Online Adaptive Background Detection . . . . . . . . . . . . . . . . 61 5.4 Data and Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.5 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Chapter 6: Environmental Sound Recognition with Composite Deep Belief Network 70 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 6.2 Deep Belief Networks (DBNs) . . . . . . . . . . . . . . . . . . . . . 75 6.2.1 Restricted Boltzmann Machines . . . . . . . . . . . . . . . . 75 6.2.2 Deep Network Training . . . . . . . . . . . . . . . . . . . . . 77 6.3 DBN for Environmental Sound . . . . . . . . . . . . . . . . . . . . 77 6.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 78 6.3.2 Classication . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.3.3 Unlabeled data for pre-training . . . . . . . . . . . . . . . . 83 6.4 Composite DBNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.4.1 Initial Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 86 6.4.2 Towards Automatic Composite-DBN . . . . . . . . . . . . . 89 6.4.2.1 Stability of High-Level Groupings . . . . . . . . . . 92 6.5 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . 97 6.6 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Chapter 7: Conclusion and Future Work 100 7.1 Summary of the Research . . . . . . . . . . . . . . . . . . . . . . . 100 7.2 Future Research Direction . . . . . . . . . . . . . . . . . . . . . . . 102 Chapter 8: Related Publications 105 8.1 Book Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 8.2 Journals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 8.3 Conferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 8.4 Workshops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 iii Bibliography 108 iv List of Tables 3.1 List of features used in classication . . . . . . . . . . . . . . . . . . 19 3.2 Summary of classication accuracy . . . . . . . . . . . . . . . . . . 24 3.3 Confusion matrix of KNN classication using forward feature selec- tion with 16 features, in percentage . . . . . . . . . . . . . . . . . . 25 4.1 recognition accuracy using GMM with a varying number of mixtures, using MFCC and MP features. . . . . . . . . . . . . . . . . . . . . . 43 4.2 Confusion matrix for 14-class classication using MP features and MFCC with GMM . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.3 Recognition accuracy for pairwise classication using GMM. . . . . 46 4.4 Comparison of recognition accuracy between MFCC, MP, and MFCC+MP features for pairwise classication of ve-class examples. For each pair of classes, the three recognition accuracy values correspond to: (left) MFCC, (middle) MP, (right) MFCC+MP features. All values are in percentages. . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.5 Comparison of recognition accuracy between MFCC and MFCC with (concatenating) individual MP features for pairwise and overall clas- sication of the ve-class examples using GMM, in percentage. . . . 48 4.6 Recognition performance from the listening test. . . . . . . . . . . . 54 5.1 Classication results using self-training. (in %) . . . . . . . . . . . . 61 5.2 Background detection accuracy (in %) . . . . . . . . . . . . . . . . 68 6.1 Classication accuracies comparing DBN and GMM (in %) . . . . . 82 v List of Figures 3.1 Classication with forward feature selection using KNN, GMM, and SVM respectively . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.1 Illustration of the decomposition of signals from 6 dierent classes as listed, where the top-most signal is the original, followed by the rst ve basis vectors. . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.2 Examples of reconstruction using MP with the Gabor dicionary by varying the number of atoms (basis vectors). . . . . . . . . . . . . . 35 4.3 Comparison of classication rates (with the GMM classier) using the rst n atoms, where n = 1; ; 10, as features while the MFCC features are kept the same. . . . . . . . . . . . . . . . . . . . . . . . 36 4.4 (a) Decomposition of signals using MP (the rst ve basis vectors) with dictionaries of Fourier (left), Haar (middle), and Gabor (right), and (b) approximation (reconstruction) using the rst ten coecients from MP with dictionaries of Gabor(top), Haar (middle) and Fourier (bottom). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.5 Overall recognition rate (GMM) comparing 14 classes using MFCC only, MP only, and MP+MFCC as features. (0% recognition for four classes using MFCC only: Casino, Nature-nighttime, Train passing, and Street with ambulance.) . . . . . . . . . . . . . . . . . . . . . . 42 4.6 Overall recognition accuracy using kNN with varying number of K . 43 4.7 Overall recognition accuracy comparing MP, MFCC, and other commonly- used features for 14 classes of sounds using kNN and GMM as classers. 44 4.8 Sample of the short-time energy function from each of the example ve classes: a) Nature-nighttime, b)Nature-daytime, c) Playground, d) Raining, e) River / stream. . . . . . . . . . . . . . . . . . . . . . 49 4.9 Temporal features: a) the energy range and b) the zero-crossing rate. (Figures a and b share the same legend.) . . . . . . . . . . . . . . . 50 vi 4.10 MP features (i.e., the mean value of the corresponding paramters) in feature space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.11 Individual MP feature descriptor values: mean-F (top left), std-F (top right), mean-S (bottom left), std-S (bottom right). . . . . . . . 52 4.12 Recognition accuracy of 14 classes from the listening test. . . . . . . 54 4.13 User condence in the listening test. . . . . . . . . . . . . . . . . . 57 5.1 Decomposition of signals from 6 environments, where the top-most signal is the original, followed by the rst ve basis vectors, demon- strating dierent underlying structures for various environments, where an MP-based algorithm picks up these top basis vectors and repre- sents them uniquely. . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.2 Comparison of classication results obtained by online background modeling with combination models (CM) and persistency only models (PSM). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 6.1 2-dimensional plot of the GMMs for three classes (a) using restricted / less noisy training data (left), and (b) using training data from (a) and unrestricted / noisy data (right) . . . . . . . . . . . . . . . . . 73 6.2 Confusion matrix for GMM using Sets A and B (in percentage of misclassication) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.3 Confusion matrix for classication with GMM using (a) Set A (left), and (b) Sets A and B data (right), in percentage (less than 5% is ignored) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.4 Confusion matrix for classication with DBN using (a) Set A (left), and (b) Sets A and B data (right), in percentage( less than 5% is ignored) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.5 Eect on the amount unlabeled data un pre-training for classication 84 6.6 Eect on the amount of pre-training using unlabeled data for classi- cation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.7 Average misclassication errors between the dierent environments 86 6.8 Manually constructed hierarchical conguration for dierent envi- ronments based on Fig. 6.7 . . . . . . . . . . . . . . . . . . . . . . . 88 6.9 Dendrogram using hierarchical clustering with cosine similarity mea- sure, based on activations of a trained 12-class DBN (distance along the y-axis depicts their measure of similarity) . . . . . . . . . . . . 89 vii 6.10 Automatic construction of the hierarchical conguration for dierent environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 6.11 Dendrogram using hierarchical clustering with cosine similarity mea- sure using MP-features and MFCC directly (without using DBN) . 92 6.12 Evolution of dendrogram using hierarchical clustering with cosine similarity measure, by increasing the number of target classes (Dot- ted box depicts the new target class being added). . . . . . . . . . 93 6.13 Evolution of dendrogram using hierarchical clustering with cosine similarity measure, by increasing the number of target classes (Dot- ted box depicts the new target class being added). Target classes are added in the reverse order from Fig. 6.12 . . . . . . . . . . . . . . . 94 6.14 Evolution of dendrogram using hierarchical clustering with cosine similarity measure, by increasing the number of target classes (Dot- ted box depicts the new target class being added). Weights between the third hidden layer and the output layer was not saved; similar ordering as Fig. 6.12 . . . . . . . . . . . . . . . . . . . . . . . . . . 95 viii Abstract Environmental sounds are what we hear everyday, or more generally sounds that surround us ambient or background audio. Human utilize both vision and hearing to respond to their surroundings, a capability still quite limited in machine pro- cessing. The rst step toward achieving multi-modality is the ability to process unstructured audio and recognize audio scenes (or environments). The goal of my thesis is on the characterization of unstructured environmental sounds for under- standing and predicting the context surrounding of an agent or device, investigating on the development of appropriate feature extraction algorithm and learning tech- niques for modeling the variations of the environment. Such ability would have applications in content analysis and mining of multimedia data or improving ro- bustness in context aware applications through multi-modality, such as in assistive robotics, surveillance, or mobile device-based services. The goal of this thesis is on the characterization of unstructured environmen- tal sounds for understanding and predicting the context surrounding of an agent or device. Most research on audio recognition has focused primarily on speech and music. Less attention has been paid to the challenges and opportunities for using audio to characterize unstructured audio. My research focuses on investigat- ing challenging issues in characterizing unstructured environmental audio and to develop novel algorithms for modeling the variations of the environment. The rst step in building a recognition system for unstructured auditory envi- ronment was to investigate on techniques and audio features for working with such audio data. We begin by performing a study that explore suitable features and the ix feasibility of designing an automatic environment recognition system using audio information. In this initial investigation, I have found that traditional recognition and feature extraction for audio were not suitable for environmental sound, as they lack any type of structures, unlike those of speech and music which contain forman- tic and harmonic structures, thus dispelling the notion that traditional speech and music recognition techniques can simply be used for realistic environmental sound. Natural unstructured environment sounds contain a large variety of sounds, which are in fact noise-like and thus are not eectively modeled by Mel-frequency cepstral coecients (MFCCs) or other commonly-used audio features, e.g. energy, zero-crossing, etc. To achieve a more eective representation, I proposed a special- ized feature extraction method for environmental sounds that utilizes the matching pursuit (MP) algorithm to learn the inherent structure of each type of sounds, which we called MP-features. MP-features have shown to classify sounds where the frequency domain features (e.g., MFCCs) fail and can be advantageous when combining with MFCCs to improve the overall performance. The third component leads to our investigation on modeling and detecting the background audio. One of the goals of this research is to characterize an environ- ment. Since many events would blend into the background, I wanted to look for a way to achieve a general model for any particular environment. Once we have an idea of the background, it will enable us to identify foreground events even if we havent seen these events before. Therefore, the next section proposes a framework for robust audio background modeling, which includes prediction, data knowledge and persistent characteristics of the environment. This approach has the ability to model the background and detect foreground events as well as the ability to verify whether the predicted background is indeed the background or a foreground event that protracts for a longer period of time. I also investigated the use of a semi-supervised learning technique to exploit unlabeled audio data. x The nal components of my thesis will involve investigating on the use of deep learning as a way to obtain a generative model-based method for classication and to learn features within each type of sounds in an unsupervised manner. The inher- ent nature of environmental sound is noisy and contains relatively large amounts of overlapping events between dierent environments. Environmental sounds con- tain large variances even within a single environment type, and frequently, there are no divisible or clear boundaries between some types. Traditional methods of classication are generally not robust enough to handle classes with overlaps. This audio, hence, requires representation by complex models. Using deep learning ar- chitecture provides a way to obtain a generative model-based method for classi- cation. Specically, I considered the use of Deep Belief Networks (DBNs) to model environmental audio and investigate its applicability with noisy data to improve robustness and generalization. A framework was proposed using composite-DBNs to discover high-level representations by unsupervised learning of features charac- terizing the dierent acoustic environments and providing a hierarchical structure of sound types in a data-driven fashion. Experimental results on real data sets demonstrate its eectiveness over traditional methods with over 90% accuracy on recognition for a high number of environmental sound types. xi Chapter 1 Introduction 1.1 Signicance of the Research Unstructured audio is an important aspect in building systems that are capable of understanding their surrounding environment through the use of both audio and other modalities of information, i.e. visual, sonar, global positioning, etc. Consider, for example, applications in robotic navigation, assistive robotics, and other mobile device-based services, where context aware processing is often desired. Human beings utilize both vision and hearing to navigate and respond to their surroundings, a capability still quite limited in machine processing. The rst step toward achieving recognition of multi-modality is the ability to process unstructured audio and recognize audio scenes (or environments). By audio scenes, we refer to a location with dierent acoustic characteristics such as a coee shop, park, or quiet hallway. Dierences in acoustic characteristics could be caused by the physical environment or activities of humans and nature. To enhance a systems context awareness, we need to incorporate and adequately utilize such audio information. A stream of audio data contains a signicant wealth of information, enabling the system to capture a semantically richer environment. Moreover, to capture a more complete description of a scene, the fusion of audio 1 and other sensory information can be advantageous, say, for disambiguation of en- vironment and object types. To use any of these capabilities, we have to determine the current ambient context rst. Thus, the determination of the ambient context using audio is a main goal of this research. Most research in environmental sounds has centered mostly on recognition of specic events or sounds. To date, only a few systems have been proposed to model raw environment audio without pre-extracting specic events or sounds. In this work, our focus is not in the analysis and recognition of discrete sound events, but rather on characterizing the general unstructured acoustic environment as a whole. In general, most research on audio recognition has focused primarily on speech and music. Less attention has been paid to the challenges and opportunities for us- ing audio to characterize unstructured audio. In my initial investigation to explore the feasibility of designing an automatic environment recognition system using au- dio information, I have found that traditional recognition and feature extraction for audio were not suitable for environmental sound, as they lack any type of structures, unlike those of speech and music which contain formantic and harmonic structures, thus dispelling the notion that traditional speech and music recognition techniques can simply be used for realistic environmental sound. Unstructured environment characterization is still in its infancy. Current al- gorithms still have diculty in handling such situations, and a number of issues and challenges remain. We brie y describe some of the issues that we think make learning in unstructured audio particularly challenging: One of the main issues arises from the lack of proper audio features for en- vironmental sounds. Audio signals have been traditionally characterized by Mel-frequency cepstral coecients (MFCCs) or some other time-frequency representations such as the short-time Fourier transform and the wavelet transform, etc. We found from our study that traditional features do not 2 perform well with environmental sounds. MFCCs have been shown to work relatively well for structured sounds such as speech and music sounds, but their performance degrades in the presence of noise. Environmental sounds, for example, contain a large variety of sounds, which may include compo- nents with strong temporal domain signatures, such as chirpings of insects and sounds of rain. These sounds are in fact noise-like with a broad spectrum and are not eectively modeled by MFCCs. Modeling the background audio of complex environments is a challenging problem as the audio, in most cases, are constantly changing. Therefore one of the main question is what is considered the background and how do we model it. We can dene the background in an ambient auditory scene as something recurring, and noise-like, which is made up of various sound sources, but changing over time, i.e., trac and passers-by on a street. In contrast, the foreground can be viewed as something unanticipated or as a deviation from the background model, i.e., passing ambulance with siren. The problem arises when identifying foreground existence in the presence of background noise, given the background also changes with a varying rate, depending on dierent environments. If we create xed models with too much prior knowledge, these models could be too specic and might not do well with new sounds. There are many varieties of events occurring in a specic environment, de- pending on a number of diverse factors, i.e. location, time and recording source, etc. For example a restaurant environment in dierent locations would be slightly dierent in content. However, it is relatively easy for humans to identify something as a restaurant environment even when presented with audio recordings of dierent restaurants. We wish be able to nd the com- monality between dierent scenarios of the same type of environment and be able to recognize them automatically. If we can nd a systematic way to 3 break down environmental sounds, it would increase the eciency of identi- fying them. Coming up with a taxonomy of sound structures from learning a hierarchy of sound types might improve and clarify problems caused by the confusion of an acoustic environment with similar characteristics. The use of suitable hierarchies also allow us to assign confusing samples to a more general class. Using a structured classication technique (e.g. hierarchical sound structures), it would alleviate much of the confusion when trying to recognize large varieties of environmental sounds. 1.2 Review of Previous Work Unstructured Environmental Audio Research in general audio environment recognition has received some interest in the last few years (Ellis, 1996; Eronen et al., 2006; Malkin & Waibel, 2005; Peltonen, 2001; Aucouturier et al., 2007), but the activity is considerably less compared to that for speech or music. Some areas of non-speech sound recognition that have been studied to various degrees are those pertaining to recognition of specic events using audio from carefully produced movies or television tracks (Cai et al., 2006; Radhakrishnan et al., 2005). Others include the discrimination between musical instruments (Cano et al., 2004; Herrera et al., 2002), musical genres (Tzanetakis & Cook, 2002), and between variations of speech, nonspeech and music (Carey et al., 1999; El-Maleh et al., 2000; Zhang & Kuo, 2001). As compared to other areas in audio such as speech or music, research on gen- eral unstructured audio-based scene recognition has received little attention. To the best of our knowledge, only a few systems (and frameworks) have been proposed to investigate environmental classication with raw audio. Sound-based situation analysis has been studied in (Eronen et al., 2006; Peltonen, 2001) and in (Clark- son et al., 1998; Ellis & Lee, 2004), for wearables and context-aware applications 4 . Because of randomness, high variance, and other diculties in working with en- vironmental sounds, the recognition rates fall rapidly with increasing number of classes; representative results show recognition accuracy limited to around 92% for 5 classes (Chu et al., 2006), 77% for 11 classes (Malkin & Waibel, 2005) and approximately 60% for 13 or more classes (Eronen et al., 2006; Peltonen, 2001). The analysis of sound environments in Peltonen's thesis (Peltonen, 2001), which is closest to our work, presented two classication schemes. The rst scheme was based on averaging the band-energy ratio as features and classifying them using a K-nearest neighborhood (kNN) classier. The second uses MFCCs as features and a Gaussian Mixture Model (GMM) classier. Peltonen noticed the shortcomings of MFCCs for environmental sounds and proposed using the band-energy ratio as a way to represent sounds occurring in dierent frequency ranges. Both of these experiments involved classifying 13 dierent contexts or classes. The classiers and types of features compared were similar to our experiments, but the actual type of classes were dierent. Similar to their work, we also compared a variety of dierent class types. In a subsequent paper by Eronen et al. (Eronen et al., 2006), they extended the investigation to audio-based context recognition by proposing a system that classies 24 individual contexts. They subdivided 24 contexts into 6 higher-level categories, with each category consisting of 4-6 contexts. Peltonen et al. also performed a listening test and reported the ndings in (Peltonen, 2001). Subjects were presented with 34 samples, each one minute in duration, for the rst experiment and 20 samples, of three minutes each, in the second experiment. The tests were mostly conducted in a specialized listening room. Their listening experiment setup is dierent than the one presented in this work, most notably in how the data were presented to the subjects. The samples used in our study are the same 4 second segments as used in our automatic classication system. The work by Aucouturier et al. (Aucouturier et al., 2007) also investigated on environmental type of sounds. Their focus is mainly to study the dierences 5 between urban environments, or as the authors refer to as urban soundscapes, and polyphonic music. In their system, they propose to model the distribution of MFCCs using 50-component GMMs and to use Monte Carlo approximation of the Kullback-Leibler distance to determine the similarities between urban and musical sounds. They studied the temporal and statistical homogeneity of each of these classes and demonstrated dierences in the temporal and statistical structure for soundscapes and polyphonic music signals. However, instead of dening 4 general classes of urban sounds, (viz., avenue, calm neighborhood, street markets, and parks.), they consider each location as a single class. For example, a specic street (or location) would be considered a class of its own. In contrast, our approach is to consider dierent streets (or dierent locations of similar environment) to be of the same class and propose features that furthers generalization. Matching Pursuit The MP algorithm has been used in a variety of applications, such as video cod- ing (Ne & Zakhor, 1997) and music note detection (Gribonval & Bacry, 2003). MP has also been used in music genre classication (Umapathy et al., 2005) and clas- sication of acoustic emissions from a monitoring system (Ebenezer et al., 2004). In our proposed technique, MP is used for feature extraction in the context of en- vironmental sound (Chu et al., 2008). We investigate a variety of audio features and provide an empirical evaluation on fourteen dierent environment types. It is shown that the most commonly-used features do not always work well with environ- mental sounds while the MP-based features can be used to supplement traditional frequency domain features (MFCC) to yield higher automatic recognition accuracy for environmental sounds. There has also been some prior work on using matching pursuit for analyzing audio for classication but quite limited. The proposed approach by Ebenezer et al. (Ebenezer et al., 2004) demonstrated the use of MP for signal classication. Their framework classied acoustic emissions using a modied MP algorithm in an 6 actual acoustic monitoring system. The classier was based on a modied version of the MP decomposition algorithm. For each class, appropriate learning signals were selected, the time- and frequency-shifting of these signals forms their dictionary. After the MP algorithm terminates, the net contribution of correlation coecients from each class is used as the decision statistic, where the one that produces the largest value is the chosen class. They demonstrated an overall classication rate of 83.0% for the 12-class classication case. However, the type and nature of the sound classes are unclear since the test data were proprietary (the classes were only identied by their numbers in the report). Another system using MP was presented by Umapathy et al. (Umapathy et al., 2005). In this work, they proposed a technique that uses an adaptive time-frequency transform algorithm, which is based on MP with Gaussian functions. Their work is most similar to our proposed technique in utilizng the parameters of their signal decomposition to obtain features for classication. However, their parameters to the decompositions were conducted with octave scaling and was used to generate a set of 42 features over three frequency bands. These features were then analyzed for the classication of 6-class music genres using the linear discriminant analysis (LDA) and were able to achieve an overall correct classication rate of 97.6%. Background modeling Various approaches have been proposed to model the background. One such example are models that do not consider prior knowledge, such as unsupervised techniques, typically use simple methods of thresholding (H arm a et al., 2005). However, this would led to the problem of threshold determination. This is im- practical in unstructured environments since there are no clear boundaries between dierent types of sounds. Other systems employ learning techniques, such as in (Ellis, 2001), build models explicitly for specic audio events, making it in exible to new events. 7 The state-of-the-art approaches in background modeling (Moncrie et al., 2007) do not make any assumptions about the prior knowledge of the location and operate with ephemeral memory of the data. Their proposed method models the persistency of features by dening the background as changing slowly over time and assuming foreground events to be short and terse, e.g., breaking glass. The problem arises when the foreground is also gradual and longer lasting, e.g., plane passing overhead. In this case, it would adapt the foreground sound as background, since there is no knowledge of the background or foreground. It would be dicult to verify whether the background model is indeed correct or models some persistent foreground sound as well. Classication for Unstructured Audio Traditional methods, like Gaussian mixture models (GMM) (Bishop, 2003) and Hidden Markov Models (HMM) (Ra- biner & Juang, 1993) are commonly used classiers for audio classication. HMMs have been extensively used in speech. It also works well with sounds that change in duration because the durational change can occur at a single state or across many states. Since environmental sounds or general ambient sounds lack such temporal structure or phonetic structure that speech, there is no set alphabet that allows for slices of non-speech sound to be divided into, making HMM-based methods dicult to implement. For explicit sound events, such as gunshot and alarms, this technique could work well because each sound events contain similar temporal structure and each state can be use to model the dierent stationary parts of the signal. Even if we could build models for the many dierent possible events that might occur in an environment, this requires the knowledge of specic events for each environment and being able to handle the problem when many dierent events are occurring simultaneously. This would be challenging for specic sound events model to recognize. Therefore, GMMs are typically used for modeling the entire scene or environment without requiring the knowledge of specic sound events that might exist. They are frequently used to train the models of the characterized 8 clusters (e.g., speech (Hinton, 1995), music (Tzanetakis & Cook, 2002), and audio background (Chu et al., 2009b)). Overall, GMM generalizes better but loses the temporal relationship that one nds with HMMs. A way to ameliorate this prob- lem is by incorporating the temporal structure into the features as in (Chu et al., 2009a). Even with specialized features, such as those described in (Chu et al., 2009a), the performance of traditional audio classiers, like GMM, still deteriorate dramatically when using realistic environmental sound that might be noisy data or have overlapping classes. In recent years, there has been a large interest in deep learning and using neural network with recent introduction of a fast greedy layer-wise unsupervised learning algorithm by Hinton et al (Hinton et al., 2006). Approaches have also been proposed to learn simple features in the lower levels and more complex features in the higher levels (Hinton et al., 2006; Ranzato et al., 2006; Bengio et al., 2007). The idea is to learn some abstract representations of the input data in a layer-wise fashion using unsupervised learning, which then can be used as input for supervised learning in tasks such as classication and regression. Recently deep belief networks (DBNs) have been applied to music audio (Hamel et al., 2009) and to learn features for speech recognition (Lee et al., 2009). Ballan et al. (Ballan et al., 2009) applied DBN for audio concept recognition and compares with SVM classiers and shows the results for DBN to be comparable, but have not been tested with environmental sound. The Deep Belief Network (DBN) is a neural network constructed from many layers of Restricted Boltzmann Machines (RBMs) (Hinton et al., 2006). Previously, traditional neural network were trained using gradient descent (Bengio et al., 2007). Using gradient descent, however, makes neural networks dicult or impossible to train. Hinton proposes a greedy layer-wise unsupervised pre-training phase, which in (Bengio et al., 2007; Hinton et al., 2006) shows that this unsupervised pre- training builds a representation from which made it possible to perform supervised 9 learning by ne-tuning the resulting weights using gradient descent learning (tra- ditional neural network learning) . In other words, the unsupervised stage sets the weights of the network to be closer to a good solution than random initial- ization, thus avoiding local minima that made occur when using supervised gradi- ent descent. This training strategy has been subsequently analyzed by Bengio et al.(Bengio et al., 2007) who concluded that it is an important ingredient in eective optimization and training of deep networks. 1.3 Contributions of the Research Several contributions are made in this research. They are summarized below. Study on Classifying Environmental Sounds We investigated on techniques for developing a scene classication system using audio features. We performed the study by rst collecting real world audio with a robot and then building a classier to discriminate dierent en- vironments, which allows us to explore and investigate on suitable features and the feasibility of designing an automatic environment recognition system using audio information. The classication system was successful in classify- ing the 5 classes of environment using real data obtained from a tele-operated mobile robot. We also found that using high number of features is not al- ways benecial to classication. In using forward feature selection, a form of greedy search, only nine of the thirty-four features were required to achieve a high recognition rate. We have also identied features that can discriminate between these 5 types of environment. With success in using audio to dis- criminate between dierent unstructured environments, we have shown that it is feasible to build such a system. Novel Feature Extraction Method 10 We propose a novel feature extraction method that utilizes matching pursuit (MP) to select a small set of time-frequency features, which is exible, intu- itive and physically interpretable. MP features can classify sounds where the pure frequency-domain features fail and can be advantageous combining with them to improve the overall performance. This MP-based feature is adopted to supplement the MFCC features to yield higher recognition accuracy for en- vironmental sounds. Extensive experiments were conducted to demonstrate the advantages of MP features as well as joint MFCC and MP features in environmental sound classication. The experimental results show promising performance in classifying fourteen dierent audio environments, and shows comparable performance to human classication results on a similar task. Our work provides competitive performance for multi-audio category environment recognition using a comprehensive feature processing approach. Background Modeling We show that we can utilize both labeled and unlabeled data to improve audio classication performance. By incorporating prediction models in the determination process, we can improve the background detection performance even further. Experimental results on real data sets demonstrate the eec- tiveness of our proposed method. In this work, we proposed a framework for audio background modeling, which includes prediction, data knowledge and persistent characteristics of the environment, leading to a more robust audio background detection algorithm. This framework has the ability to model the background and detect foreground events as well as the ability to verify whether the predicted background is indeed the background or a foreground event that protracts for a longer period of time. Experimental re- sults demonstrated promising performance in improving the state-of-the-art in background modeling of audio environments. We also investigated the use of a semi-supervised learning technique to exploit unlabeled audio data. It is 11 encouraging that we could utilize more unlabeled data to improve generaliza- tion as they are usually cheap to acquire but expensive to label. Composite Deep Belief Network We investigate the use of a richer generative model based method for classi- cation and attempt to discover high-level representations for dierent acoustic environments in a data-driven fashion. Specically , we consider Deep Belief Network (DBN) to model environmental audio and investigate its applica- bility with noisy auditory data for robustness and generalization. We also propose a framework for composite DBNs as a way to represent various lev- els of representations. This provides a way to learn features within each type of sound in an unsupervised manner. Experimental results demonstrate promising performance in improving the state of art recognition for audio environments. It is encouraging that we could utilize more unrestrictive data to improve generalization. To the best of our knowledge, our work has shown to provide a signicant improvement on classication over previous proposed methods discriminating this many dierent types of general environmental sounds. 1.4 Organization of the Proposal The rest of this proposal is organized as follows. Chapter 2 provides some back- ground for audio research, including a review of dierent common audio features. Then, a study on features and classication methods for environmental sounds is given in Chapter 3. In Chapter 4, we propose a novel feature extraction method using matching pursuit. Then a framework on audio background modeling using semi-supervised learning is introduced in Chapter 5. Chapter 6 our research on using deep learning on acoustic environments and proposes a composite deep belief network for unsupervised feature learning and constructing a hierarchical sound 12 structure model. Finally concluding remarks and future work items are given in Chapter 7. 13 Chapter 2 Review of Research Background Several major feature extraction techniques for audio signal processing are reviewed here. 2.1 Audio Features for Recognition and Classication One major issue in building an automatic audio recognition system is the choice of proper signal features that are likely to result in eective discrimination between dierent auditory environments. Environmental sounds in general are unstruc- tured data comprising of contributions from a variety of sources, and unlike music or speech, no assumptions can be made about predictable repetitions nor harmonic structure in the signal. Because of the nature of unstructured data, it is dicult to form a generalization to quantify them. Due to the inherent diverse nature, there are many features that can be used, or are needed, to describe audio signals. The appropriate choice of these features is crucial in building a robust recognition sys- tem. Here, we examine some of the commonly used audio signal features. Broadly, acoustic features can be grouped into two categories: time-domain (or temporal features) and frequency-domain (or spectral features). A number of those have been proposed in the literature. Two widely used time-domain measures are given below (Zhang & Kuo, 2001). 14 Short-time energy: E n = 1 N X m [x(m)w(nm)] 2 ; wherex(m) is the discrete time audio signal,n is the time index of the short- time energy, andw(m) is the window of lengthN. Short-time energy provides a convenient representation of the amplitude variation over time. Short-time average zero-crossing rate (ZCR): Z n = 1 2 X m jsgn[x(m)]sgn[x(m 1)]jw(nm); where sgn[x(n)] = 8 > < > : 1; x(n) 0; 1; x(n)< 0: Zero-crossings occur when successive samples have dierent signs, and the ZCR rate is the average number of times the signal changes its sign within the short- time window. We calculate both energy and ZCR values using a window of 256 samples with a 50% overlap, at an input sampling rate of 22050Hz. Similarly, a variety of spectral features have been proposed. These features are typically obtained by rst applying a Fourier transform (implemented as a fast Fourier transform, FFT) to short-time window segments of audio signals followed by further processing to derive the features of interest. Some commonly-used ones include the following. Mel-frequency cepstral coecients (MFCC) (Rabiner & Juang, 1993). After taking the FFT of each short-time window, the rst step in MFCC cal- culation is to obtain the mel-lter bank outputs by mapping the powers of the spectrum onto the mel scale, using 23 triangular mel-lterbanks, and trans- formed into a logarithmic scale, which emphasizes the low varying frequency 15 characteristics of the signal. Typically 13 mel frequency cepstral coecients are then obtained by taking the discrete cosine transform (DCT). Band energy ratio (Eronen et al., 2006): It is the ratio of the energy in a specic frequency-band to the total energy. Eight logarithmic sub-bands are used in our experiments. Spectral ux (Tzanetakis & Cook, 2002): It is used to measure a spectral amplitude dierence between two successive frames. Statistical moments (Tzanetakis & Cook, 2002; Agostini et al., 2003): The commonly-used statistical moments include the following. { Spectral centroid measures the brightness of a sound. The higher the centroid, the brighter the sound. { Signal bandwidth measures the width of the range of signal's frequencies. { Spectral atness quanties the tonal quality; namely, how much tone-like the sound is as opposed to being noise-like. { Spectral roll-o quanties the frequency value at which the accumulative value of the frequency response magnitude reaches a certain percentage of the total magnitude. A commonly used threshold is 95%. Another commonly-used feature is linear prediction cepstral coecients (LPCC) (Markel & Jr., 1976). The basic idea behind linear prediction is that the current sample can be predicted, or approximated, as a linear combination of the previous sam- ples, which would provide a more robust feature against sudden changes. LPCC is calculated using the autocorrelation method in this work (Rabiner & Juang, 1993). 16 Chapter 3 A Study on Features and Classication Methods for Environmental Audio 3.1 Introduction In this chapter, we consider the task of recognizing environmental sounds for the un- derstanding of a scene or context surrounding an audio sensor. A variety of features have been proposed for audio recognition, including the popular Mel-frequency cep- stral coecients (MFCCs) which describe the audio spectral shape. Unlike speech and music, which have formantic structures and harmonic structures, environmental sounds are considered unstructured since they are variably composed from dier- ent sound sources. We consider the task of recognizing environment sounds for the understanding of a scene (or context) surrounding an audio sensor. By auditory scenes, we refer to a location with dierent acoustic characteristics such as a coee shop, park or quiet hallway. Our rst attempt to understand environmental audio is by performing an initial study by rst collecting real world audio with a robot and then building a classier to discriminate dierent environments, which allows us to explore and investigate on suitable features and the feasibility of designing an automatic environment recognition system using audio information. We begin our investigation of recognizing dierent auditory environments with the audio information. In this section, we utilize low-level audio features from a 17 mobile robot and investigate using high-level features based on spectral analysis for scene characterization, and a recognition system was built to discriminate between dierent environments based on these audio features found. Many robotic applications are being utilized for navigation in unstructured en- vironments (Pineau et al., 2003; Thrun et al., 1999). There are other tasks that require knowing the environment. For example, Yanco (Yanco, 1998) introduced a robotic wheelchair system that switches automatically between control modes for indoor and outdoor environments. Also, laser range-nder can track people in an outdoor environment In order to use any of these capabilities, we rst have to de- termine the current context, e.g., the location type (outdoor environment or inside an oce or hallway, etc). Environments are dynamic, and the setting might change even in the same area. With the loss of certain landmarks, a vision-based robot might not be able to recover from its displacement because it is unable to determine the environment that it is in. Knowing the scene provides a coarse and ecient way to prune out irrelevant scenarios. Even with a GPS system and a well-dened map, without clear images, it is dicult to discern dierent characteristics of the environment. It is relatively easy for most people to make sense of what they hear or to discriminate where they are located in the environment on the basis of sound alone. However, this is typically not the case with an autonomous agent or mobile device. Surprisingly little research has been done on audio scene analysis in these areas. With increasing number of systems built for service and social settings, it is ever more important for them to not only to identify locations, but to comprehend and characterize their auditory features. 18 Table 3.1: List of features used in classication Feature No. Types of Features 1-12 1 st -12 th MFCCs 13-24 Standard Deviation of 1 st -12 th MFCCs 25 Spectral Centroid S c 26 Spectral Bandwidth, S w 27 Spectral Asymmetry, S a 28 Spectral Flatness, S f 29 Zero-Crossing 30 Standard Deviation of Zero-Crossing 31 Energy Range, E r 32 Standard Deviation of Energy Range 33 Frequency Roll-o 34 Standard Deviation of Roll-o 3.2 Common Audio Features One of the major issues in building a recognition system for multimedia data is the choice of proper signal processing features that are likely to result in eective discrimination between dierent auditory environments. Sounds from a general ambient environment are considered neither speech nor music, but a combination of some specic audio signals that are similar to noise. While much work has concentrated on speech and music, little research has been done on actual analysis of features for classication of environmental sounds. One of the major goals of this work is to study the eect of various features on the eciency of an auditory environmental recognition system. There are many features that can be used to describe audio signals. We examined the following features in our experiments: Mel- frequency cepstrum coecient analysis (MFCC), statistical moments from the audio signals spectrum (i.e. spectral centroid, spectral bandwidth, spectral asymmetry, and spectral atness), zero-crossing rate, energy range, and frequency roll-o. The use of the term frequency roll-o in this paper refers to the rate at which the accumulative magnitude of the frequency response is equal to that of 95% of the total magnitude. Since the energy level varies depending on the location of the 19 source of the sound, we do not use the mean of the energy. Instead, we only use the range and standard deviation. The feature vector contained a total of 34 features, summarized in Table 3.1. 3.3 Environmental Data Acquisition We would like to capture actual scenarios of situations where a robot might nd itself, including any environmental sounds, along with additional noise generated by the robot. To simplify the problem, we restricted the number of scenes we examined and enforced each type of environmental sound not to overlap each other. The locations we considered are recorded within and around a multipurpose engineering building on the USC campus. The diverse locations that were focused include: 1) a caf e area, 2) hallways where research labs are housed, 3) around and inside elevator areas, 4) lobby area, and 5) along the street on the south side of the building. We used a Pioneer DX mobile robot from ActivMedia, running Playerjoy and Playerv (Pla, ). The robot was manually controlled using a laptop computer. To train and test our algorithm, we collected about 3 hours of audio recordings of the ve aforementioned types of environmental locations. We used an Edirol USB audio interface, along with a Sennheiser microphone mounted to the chassis of the robot. Several recordings were taken at each location, each about 10-15 minutes long, taken on multiple days and at various times. This was done to introduce a variety of sounds and to prevent biases in the recordings. The robot was deliberately driven around with its sonar sensors turned on (and sometimes o) to resemble a more realistic situation and to include noises obtained from the onboard motors and sonar. We did not use the laser and camera because they produce little, if any, notice- able sound. Recordings were manually labeled and assigned to one of the ve classes 20 listed previously to aid the experiments described below. Our general observations about the sounds encountered at the dierent locations are: Hallway: mostly quiet, with occasional doors opening/closing, distant sound from the elevators, and individuals quietly talking, some footsteps. Cafe: many people talking, ringing of the cash registers, moving of chairs. Lobby: footsteps with echos (dierent from hallways due to the type of oor- ing), people talking, sounds of rolling dollies from deliveries being made. Elevators: bells and alerts from the elevator, footsteps, rolling of dollies on the steel frame of elevator entrance. Outside: footsteps on concrete, trac from buses and cars, bicycles, and occasional planes and helicopters. We chose for this study to focus on a few simple, yet robust features, which can be extracted in a straightforward manner. Features that require many thresholds were avoided. The audio data samples collected were mono-channel, 16 bits per sample with a sampling rate of 44 kHz and of varying lengths. The input signal was down- sampled to a 22050 Hz sampling rate. Each clip was further divided into 4-second segments. Features were calculated from a 20 msec rectangular window with 10 msec overlap. Each 4 sec segment makes up an instance for training/testing. All spectra were computed with a 512-point FFT. All data were normalized to zero mean and unit variance. 3.4 Classication Methods To evaluate the performance of our recognition system, we examined the following three classication methods: K-Nearest Neighbor (KNN) (Mitchell, 1997), Gaus- sian Mixture Models (GMM) [9], and Support Vector Machine (SVM) (Sch olkopf, 21 2002). For KNN, we used the Euclidean distance as the distance measure and the 1-nearest neighbor queries to obtain the results. As for GMM, we set the number of mixtures for both training and testing to 5. For the SVM classiers, we used a 2 degree polynomial as its kernel with C = 10 and " = 1e 7 , where C is the regular- ization parameter and " controls the width of the "-insensitive zone, which is used to t the training data, aecting the number of support vectors used. Since SVM is a two-class classier, we use the one-against-the-rest algorithm (Burges, 1998) for our multi-class classication in all of the experiments. We performed leave-one-out cross-validation on the data. Although this method is computationally expensive, it has been shown to produce almost unbiased results. The recognition accuracy using leave-one-out cross-validation was found from calculating: Figure 3.1: Classication with forward feature selection using KNN, GMM, and SVM respectively 22 More than half of the data collected contained sonar and motor sounds emitted by the robot. Motor noises were found to be less noticeable than those emitted by the sonars. To determine how the sonar sounds aect the classication, we manually separated the data into two sets: A) containing sonar and B) without any sonar sounds. Classications were performed on three sets of data: set A only, set B only, and sets A and B together. The use of set-B only data would be unrealistic for mobile robots. The accuracy using all 34 features for KNN was 90.8%, 91.2%, and 89.5% for set A, B, and A&B respectively. For the rest of the paper, all experiments are performed using set A&B. 3.5 Experimental Results and Discussion One of the problems in using a large number of features is that there are many potentially irrelevant features that could negatively impact the quality of classi- cation. In using feature selection techniques, we can choose a smaller feature set to reduce the computational cost and running time, as well as achieve an acceptable, if not higher, recognition rate. Adding more features is not always helpful; as the feature dimension increases, data points become more sparse and some features are essentially noise. This leads to the issue of selecting an optimal subset of features from a larger set of possible features that will yield the most eective subset. The optimal solution is using an exhaustive search of all the features. This requires 2 34 1, or roughly 10 10 combinations. Instead of performing 10 10 computations, we use a greedy search for selecting the features. There are various ways of performing feature selection, such as forward feature selection, backward selection, branch and bound, and stochastic search, each with its advantages and disadvantages. We used forward feature selection for our experiments since it is simple and straightforward. The algorithm is given as: Initialize selected set S = empty set 23 Initialize unselected set F =f1; ;Mg Repeat: Evaluate performance with S[f i for each f i 2F S :=S[f m and F :=Fnf m , where f m gives maximum improvement in performance Stop when no signicant improvement in classication or features Using this feature selection algorithm and evaluating by picking the feature fm that yields the maximum accuracy, we found that using 16 features enabled us to achieve a recognition accuracy of 96.6% for SVM and 94.3% for KNN. It took 25 features to achieve an accuracy of 93.4% for GMM. The results and features selected are summarized in Table 3.2. Figure 3.1 below shows a plot of various recognition accuracies with the dierent number of features. Using only 6 features (91.1% for KNN), we were able to surpass the accuracy of using all 34 features (89.5% for KNN). The confusion matrix in 3.3 shows the misclassied classes for the KNN classier using 16 features. It can be seen that the worst performance was from the Elevator class and had most misclassication from Hallway. One reason for this comes from the fact that the area where the robot was driven for the Elevator class was actually part of the Hallway as well, so there was less separation between the two areas. However, Hallway gave the best performance due to its distinct characteristic of being relatively quiet most of the time. We can also observe from the same table Table 3.2: Summary of classication accuracy Classifers Features Used Recognition Accuracy KNN All 34 Features 89.5% Forward FS 1-3, 5-10, 12, 13, 16, 17, 28, 31, 33 94.3% Backward FS 1-3, 7-9, 28, 31, 33 94.2% GMM All 34 Features 89.5% Forward FS 1-10, 12-16, 20-22, 25, 26, 28, 31-34 93.4% SVM All 34 Features 95.1% Forward FS 1-3, 5-10, 13, 15, 18, 28, 31-33 96.6% 24 that Lobby and Street were confused, as both contained many sharp footstep noises, but on dierent types of ooring. The Lobby has granite tiling, while the Street is concrete. There are footstep noises in the Hallway class as well, but the ooring for the hallways is plastic so. the footsteps were less prominent than those from Lobby or Street and created less confusion. Footsteps in Caf e were drowned out by other noises, such as crowds of people talking and shuing of furnitures. Figure 3.1 shows GMM to produce the worst result when compared to KNN. One possible reason is because we rened the parameters for the case of using all features and did not re-optimize parameters when performing the forward-search feature selection experiments. A nal note on the various learning algorithms studied here: unlike KNN, both GMM and SVM require a careful choice in choosing the correct parameters. There are at least two degrees of freedom for GMM and four for SVM. For GMM, the numbers of mixtures for training and another testing must be picked a priori. For SVM, one needs to decide on C,", kernel type and its bias, as explained in Section 4. In other words, even minor changes, such as number of training samples, requires ne turning of parameters. In addition, both GMM and SVM are much higher in computational complexity. Despite the higher accuracy rate of SVM methods over KNN and GMM, SVM are very expensive to train even with ve classes. The running times required for training and classication with a full feature set and no feature selection for KNN, GMM, and SVM are given as 1.1, 148.9, and 1681.8 sec, respectively. The KNN classier works well overall, Table 3.3: Confusion matrix of KNN classication using forward feature selection with 16 features, in percentage Street Elevator Caf e Hallway Lobby Street 92.2 0 0 0 5.6 Elevator 0 90.0 1.1 7.8 1.1 Caf e 0 0 95.6 0 4.4 Hallway 0 0 0 100 0 Lobby 2.2 0 3.3 0 94.4 25 outperforming GMM and is roughly 1000 times faster than SVM. To check for overtting and to conrm the validity of the selected features, we performed a sensitivity analysis with respect to the forward feature selection algorithm. Since GMM and SVM require tuning of many parameters and are more complex, we restricted this experiment to just the KNN algorithm. The experiment was as follows: Repeat for 100 times - Randomly pick half of the dataset - Repeat the forward feature selection algorithm on the subset - Record the features selected We tallied the selected features used in each trial and picked the features that were used more than half of the time, which resulted in 11 features. With these 11 features, we performed a backward feature selection search. Similar to the idea of forward search, backward search works by using all 11 features and begins by taking out one feature at a time. Instead of picking the feature that yields the maximum recognition accuracy in the forward search, we selected the features that provided the minimum accuracy rate. The results returned 9 features, which were in turn fed back into the 1-NN classier. We nally achieved 94.2% recognition accuracy on the entire dataset with these 9 features. As listed in Table 3, the 9 features include MFCC1-4 and 9-10, zero-crossing, std dev of zero-crossing, std dev of roll-o frequency. 3.6 Conclusion Most previous eorts utilize a combination of some, or even all, of the aforemen- tioned features, to characterize audio signals. However, adding more features is not always helpful. As the feature dimension increases, data points become sparser and 26 there are potentially irrelevant features that could negatively impact the classica- tion result. We showed in this section that the use of all features for classication does not always produce good performance for the audio classication problems of our interest. This in turn leads to the issue of selecting an optimal subset of features from a larger set of possible features to yield the most eective subset. We utilized a simple feature selection algorithm to obtain a smaller feature set to reduce the computational cost and running time and achieve an acceptable, if not higher, classication rate. Although the results showed improvements, the features found after the feature selection process were found to be specic to each classier and environment type. A similar phenomenon was observed in (Peltonen, 2001), where dierent feature subsets were tried to increase the performance for each con- text type. It was with these ndings that motivated us to look for a more eective and principled approach for determining an appropriate representation for environ- mental sound classication. Toward this goal, we investigated in ways of extracting features and introduce a novel feature extraction algorithm using matching pursuit in Chapter 4 to capture the underlying structures of audio data. 27 Chapter 4 MP-Features: Feature Extraction with Matching Pursuit 4.1 Introduction Desirable types of features should be robust, stable and straightforward, with the representation being sparse and physically interpretable. Environmental sounds, such as chirpings of insects and sounds of rain which are typically noiselike with a broad at spectrum, may include strong temporal domain signatures. However, only few temporal-domain features have been developed to characterize such diverse audio signals previously. As with most pattern recognition systems, selecting proper features is key to eective system performance. Audio signals have been traditionally characterized by Mel-frequency cepstral coecients (MFCCs) or some other time-frequency rep- resentations such as the short-time Fourier transform and the wavelet transform. The lterbanks used for MFCC computation approximates some important prop- erties of the human auditory system. MFCCs have been shown to work well for structured sounds such as speech and music, but their performance degrades in the presence of noise. MFCCs are also not eective in analyzing noise-like signals that have a at spectrum. Environmental audio contain a large and diverse variety of sounds, including those with strong temporal domain signatures, such as chirpings 28 of insects and sounds of rain that are typically noise-like with a broad at spectrum that may not be eectively modeled by MFCCs. In this work, we propose to use the matching pursuit (MP) algorithm to analyze environmental sounds. MP provides a way to extract time-frequency domain features that can classify sounds where using frequency-domain only features (e.g., MFCCs) fail. The process includes nding the decomposition of a signal from a dictionary of atoms, which would yield the best set of functions to form an approximate representation. The MP algorithm has been used in a variety of applications, such as video cod- ing (Ne & Zakhor, 1997) and music note detection (Gribonval & Bacry, 2003). MP has also been used in music genre classication (Umapathy et al., 2005) and clas- sication of acoustic emissions from a monitoring system (Ebenezer et al., 2004). In our proposed technique, MP is used for feature extraction in the context of en- vironmental sound (Chu et al., 2008). We investigate a variety of audio features and provide an empirical evaluation on fourteen dierent environment types. It is shown that the most commonly-used features do not always work well with environ- mental sounds while the MP-based features can be used to supplement traditional frequency domain features (MFCC) to yield higher automatic recognition accuracy for environmental sounds. Our goal in this chapter is to study the dierent unstructured environmental sounds in a more general sense and to use MP to learn the inherent structures of each type of sounds as a way to discriminate the various sound classes. In this work, we performed an empirical feature analysis for audio environment characteri- zation and proposed to use the matching pursuit (MP) algorithm to obtain eective time-frequency features. We will show that using MP will make this representa- tion possible. The advantages of this representation are the ability to capture the inherent structure within each type of signal and to map from a large, complex signal onto a small, simple feature space. More importantly, it is conceivably more 29 invariant to background noise and could capture characteristics in the signal where MFCCs tend to fail. 4.2 Signal Representation with Matching Pursuit (MP) The intuition behind our strategy is that there are underlying structures that lie within signals of each type of environment, and we could use MP to discover them. Dierent types of environmental sounds have their own unique characteristics, mak- ing the decomposition into sets of basis vectors to be noticeably dierent from one another. By using a dictionary that consists of a wide variety of functions, MP provides an ecient way of selecting a small set of basis vectors that produces meaningful features as well as exible representation for characterizing an audio environment. To achieve an ecient representation, we would like to obtain the minimum number of basis vectors to represent a signal, resulting in a sparse approximation. However, this is an NP-complete problem. Various adaptive approximation tech- niques to obtain such a signal representation in an ecient manner have been pro- posed in the literature, including basis pursuit (BP) (Chen et al., 1998), matching pursuit (MP) (Mallat & Zhang, 1993), and orthogonal matching pursuits (OMP) (Pati et al., 1993). All of these methods utilize the notion of a dictionary that capacitates the decomposition of a signal by selecting basis vectors from a given dictionary to nd the best subset. BP provides a framework that minimizes the L1-norm of coecients occurring in the representation, but at a cost in linear programming. Although it provides good representations, BP is computationally intensive. By using a dictionary that consists of a wide variety of elementary waveforms, MP aims at nding sparse decompositions of a signal eciently in a greedy manner. MP is sub-optimal in 30 the sense that it may not achieve the sparsest solution. Usually, elements in a given dictionary are selected by maximizing the energy removed from the residual signal at each step. Even in just a few steps, the algorithm can yield a reasonable approximation with a few atoms, and the decomposition will provide us with an interpretation of the signal structure. We adopt the classic MP approach to generate audio features in our study. The MP algorithm was originally introduced by Mallat and Zhang (Mallat & Zhang, 1993) for decomposing signals in an overcomplete dictionary of functions, providing a sparse linear expansion of waveforms. As long as the dictionary is overcomplete, the expansion is guaranteed to converge to a solution where the residual signal has zero energy. The following description of the MP algorithm is based on the descriptions from (Chen et al., 1998). Let dictionary D be a collection of parameterized waveforms given by D =f : 2 g; where is the parameter set and is called an atom. The approximate decom- position of a signal, s, can be written as s = m X i=1 i i +R (m) ; (4.1) where R (m) is the residual. Given s, m, and D, our goal is to nd indices i and compute i , where i = 1; 2; ;m, while minimizing R (m) . Starting from initial approximation s (0) = 0 and residual R (0) =s, the MP algorithm builds up a sequence of sparse approximation stepwise. 31 Initially, the MP algorithm computes all inner products of signal s with atoms in dictionaryD. The atom with the largest magnitude inner product 0 is selected as the rst element. Thus, the atom selection criteria can be given as jhs; 0 ijjhs; ij8 2 : After the rst step, atom 0 is subtracted froms to yield residualR (0) . Generally, at stage k = 1; 2; , the MP algorithm identies the atom that best correlates with the residual and then adds the scalar multiple of that atom to the current approximation, s (k) =s (k1) + k k ; (4.2) where k =hR (k1) ; k i and R (k) =ss (k) : After m steps, one has a representation of the approximate decomposition with residual R =R (m) as shown in Eq. (4.1). Various dictionaries have been proposed to be used with MP, including wavelets (Vera-Candeas et al., 2004), wavelet packets (Yang et al., 2007), cosine packets (Sugden & Canagarajah, 2004), Gabor dictionaries (Mallat & Zhang, 1993), multi- scale Gabor dictionaries (Sugden & Canagarajah, 2004; Gribonval, 2001), Chirplets (Ghofrani et al., 2003) and others. Most dictionaries are complete or overcomplete, and the approximation techniques, such as MP, allow for the combination of dif- ferent dictionaries. Examples of some basic dictionaries are: 1) frequency (i.e., Fourier functions), 2) time-scaled (i.e. Haar wavelets), and 3) time-frequency, (i.e. Gabor functions). To encapsulate the nonstationary characteristics of audio signals, we use a dictionary of Gabor atoms to oer a more discriminant time-frequency representation. In Sec. 4.4, we will discuss this in further detail. 32 Algorithm matching pursuit Input: signal s, dictionary D Return: List of coecients for ( k , k ) Initialize: s (0) s Repeat Find k with maximum inner producths (k) ; k i k hs (k) ; k i s (k+1) =s (k) k k k k + 1 until eitherjjs (k) jj<threshold or certain k is reached 4.3 Feature Extraction with Matching Pursuit (MP) Desirable types of features should be robust, stable and straightforward, with the representation being sparse and physically interpretable. We will show that using MP will make this representation possible. The advantages of this representation are the ability to capture the inherent structure within each type of signal and to map from a large, complex signal onto a small, simple feature space. More importantly, it is conceivably more invariant to background noise and could capture characteristics in the signal where MFCCs tend to fail. In this section, we will describe how MP features are obtained. 4.3.1 Extracting MP Features Our goal is to use MP as a tool for feature extraction for classication, and not necessarily to recover or approximate the original signal for compression. Never- theless, MP provides an excellent way to accomplish either of these tasks. MP is a desirable method to provide a coarse representation and to reduce the residual energy with as few atoms as possible. The decomposition from MP also furnishes us with an interpretation of the signal structures. The strategy for feature extraction is based on the assumption that the most important information of a signal lies in leading synthesizing atoms with the highest energy, yielding a simple representa- tion of the underlying structure. Since MP selects atoms in order by eliminating 33 Figure 4.1: Illustration of the decomposition of signals from 6 dierent classes as listed, where the top-most signal is the original, followed by the rst ve basis vectors. the largest residual energy, it lends itself in providing the most useful atoms, even just after a few iterations. We illustrate the eectiveness of using MP with Gabor functions in Fig. 4.2. As shown in Fig. 4.2(a), the biggest drop in the residual error happens in the rst few terms. It is also observed from Fig. 4.2(b) that using only 10 atoms will provide a reasonable signal; while using the rst 50 atoms produces an approximation very similar to the original one. The MP algorithm selects atoms in a stepwise manner among the set of wave- forms in the dictionary that best correlate the signal structures. The iteration can be stopped when the coecient associated with the atom selection falls below a threshold or when a certain number of atoms selected overall has been reached. Another common stopping criterion is to use the signal to residual energy ratio. In this work, we chosen atoms as the stopping criterion for the iteration. MP features are selected by the following process. Based on our experimental setup, explained in Sec. 4.6.1, we use a rectangular window of 256 points with a 50% overlap. This corresponds to the window size used for all feature extraction. We decompose each 256-point segment using MP 34 (a) (b) Figure 4.2: Examples of reconstruction using MP with the Gabor dicionary by varying the number of atoms (basis vectors). with a dictionary of Gabor atoms that are also 256 points in length. We stop the MP process after obtainingn atoms. Afterwards, we record the frequency and scale parameters for each of thesen atoms and nd the mean and the standard deviation corresponding to each parameter separately, resulting in 4 feature values. To select parameter n in the stopping criterion, we plot the classication per- formance as a function of n in Fig. 4.3. It shows a rise with an increasing number of features due to the increased discriminatory power with the performance lev- eling o around 4 or 5 atoms. Thus, we chose n = 5 atoms in our experiments and use the same process to extract features for both training and test data. The decomposition of dierent signals from the same environmental class might not be composed of exactly the same atoms or order. However, since we are taking the average of their parameters as features, the sequencing order of atoms is neglected and the robustness of these features is enhanced by averaging. Using these atom parameters as features abstracts away ner details and forces the concentration on the most pronounced characteristics. The above truncation process is similar to that of non-injective mapping. When mapping a large problem space into the feature space, only a few signicant features 35 Figure 4.3: Comparison of classication rates (with the GMM classier) using the rst n atoms, where n = 1; ; 10, as features while the MFCC features are kept the same. are considered, enabling us to disregard the rest. The most important information in describing a signal could be found in a few basis vectors with the highest ener- gies, and the process in which MP selects these vectors are exactly in the order of eliminating the largest residual energy. This means that even the rst few atoms found by MP will naturally contain the most information, making them to be more signicant features. This also allows us to map each signal from a larger problem space into a point in a smaller feature space. Any data items are similar as long as their representation in the feature space are similar or close in proximity. 4.4 MP Dictionary Selection Examples of the MP decomposition using dierent dictionaries are compared in Fig. 4.4. The rst ve atoms obtained from the MP decomposition with Fourier, Haar and Gabor dictionaries are shown in Fig. 4.4 (a). Since the Fourier representa- tion is formed by the superposition of non-local signals, it demands a large number of atoms for cancellation to result in a local waveform. In contrast, the Gabor representation is formed by a band-limited signal of nite duration, thus making it more suitable for time-frequency localized signals. The Gabor representation was 36 (a) (b) Figure 4.4: (a) Decomposition of signals using MP (the rst ve basis vectors) with dictionaries of Fourier (left), Haar (middle), and Gabor (right), and (b) approxi- mation (reconstruction) using the rst ten coecients from MP with dictionaries of Gabor(top), Haar (middle) and Fourier (bottom). shown in (Vetterli & Kovacevic, 1995) to be optimal in the sense of minimizing the joint two-dimensional uncertainty in the combined spatial-frequency space. The eectiveness of reconstructing a signal using only a small number of atoms is com- pared in Fig. 4.4(b), where ten atoms are used. Gabor atoms result in the lowest reconstruction error, as compared with the Haar or the Fourier transforms using the same number of coecients. Due to the non-homogeneous nature of environmental sounds, using features with these Gabor properties would benet a classication system. Based on the above observation, we choose to use the Gabor function in this work. Gabor functions are sine-modulated Gaussian functions that are scaled and translated, providing joint time-frequency localization. Mathematically, the dis- crete Gabor time-frequency atom is written as g s;u;!; (n) = K s;u;!; p s e (nu) 2 =s 2 cos[2!(nu) +]; where s2 R + ;u;!2 R;2 [0; 2]. K s;u;!; is a normalization factor such that jjg s;u;!; jj 2 = 1. We use = (s;u;!;) to denote parameters of the Gabor function, 37 wheres,u,! and correspond to an atom's position in scale, time, frequency and phase, respectively. The Gabor dictionary in (Mallat & Zhang, 1993) was imple- mented with atom parameters chosen from dyadic sequences of integers. The scale s, which corresponds to the atom width in time, is derived from dyadic sequence s = 2 p , 1pm, and the atom size is equal to N = 2 m . We chose the Gabor function with the following parameters in this work: s = 2 p (1p 8), u =f0; 64; 128; 192g, ! =Ki 2:6 (with 1i 35, K = 0:5 35 2:6 so that the range of ! is normalized between 0 and 0.5), = 0 and the atom length is truncated toN = 256. Thus, the dictionary consists of 1120 = 8 35 4 Gabor atoms that were generated using scales of 2 p and translation by quarters of atom length N. We attempt to keep the dictionary size small since a large dictionary demands higher complexity. For example, we choose a xed phase term since its variation does not help much. By shifting the phase, i.e. =f0; 4 ; 2 ;:::g, each basis vector only varies slightly. Since we are using the top few atoms for creating the MP-features, it was found not necessary to incorporate the phase-shifted basis vectors. A logarithmic frequency scale is used to permit a higher resolution in the lower frequency region and a lower resolution in the higher frequency region. We found the exponent 2:6 in ! experimentally given the parameter setting of the frequency interval. We wanted to have a ner granularity below 1000Hz as well as enough descriptive power in the higher frequency. The reason for ner granularity in lower frequencies is because more audio object types occur in this range, and we want to capture ner dierences between them. The decomposition of signals from six sound classes using Gabor atoms is shown in Fig. 4.1, where the top ve atoms are shown. We can observe dierences in synthesizing atoms for dierent environments, which demonstrates that dierent environments exhibit dierent characteristics, 38 and each set of decompositions encapsulates the inherent structures within each type of signal. For example, because the two classes, On boat and Harbor, contain ocean sounds, the decompositions are very similar to each other. Another example is between Nature-daytime and Near highway. Both were recorded outdoors; therefore there are some similarities in the subset of their decomposition but because the Near highway class has the presence of trac noise, this has led to distinctively dierent atoms with higher frequency components, compared to Nature-daytime. When we compared them with diering classes, e.g., Nature-nighttime and Near highway, the decompositions are noticeably dierent from one another. Therefore, we utilize these set of atoms as a simple representation to these structures. 4.5 Computational Cost of MP Features For each input audio signal, we divide into k overlapping windows of length N, and MP is performed on each of these k windows. At each iteration, the MP algorithm computes the inner product of the window of signals (or residuals) with all D atoms in the dictionary. The cost of computing all inner products would be O(kND). During this process, we need to record the highest correlation value and the corresponding atom. We terminate the MP algorithm aftern iterations, yielding a total cost of O(nkND). By keeping the dictionary size small with constant iteration number,n, and window size,N, the computational cost is a linear function of the total length of the signal. Thus, it can be done in real time. 4.6 Experimental Evaluation 4.6.1 Experimental Setup We investigated the performance of a variety of audio features and provide an em- pirical evaluation on fourteen dierent types of environmental sounds commonly 39 encountered. We used recordings of natural (unsynthesized) sound clips obtained from (BBC, ; Fre, ). We used recordings that are available in WAV formats to avoid introducing artifacts in our data (e.g., from the MP3 format). Our audi- tory environment types were chosen so that they are made up of non-speech and non-music sounds. It was essentially background noise of a particular environment, composed of many sound events. We do not consider each constituent sound event individually, but as many properties of each environment. Naturally, there could be innitely many possible combinations. To simplify the problem, we restricted the number of environment types examined and enforced each type of sound to be dis- tinctively dierent from one another, which minimized overlaps as much as possible. The fourteen environment types considered were: Inside restaurants, Playground, Street with trac and pedestrians, Train passing, Inside moving vehicles, Inside casinos, Street with police car siren, Street with ambulance siren, Nature-daytime, Nature-nighttime, Ocean waves, Running water/stream/river, Raining/shower, and Thundering. We examined the performance of the MP features, extracted as described in Section 4.3, a concatenation of the MP-features and MFCCs to form a longer feature vector, MP+MFCC (16), and a variety of commonly-used features, which includes MFCC (12), MFCC (12), LPC (12), LPC (12), LPCC(12), the band energy ratio, frequency roll-o set at 95%, spectral centroid, spectral bandwidth, spectral asymmetry, spectral atness, zero-crossing, and energy. We adopted the Gaussian Mixture Model (GMM) classication method in the feature space for our work. With GMMs, each data class was modeled as a mixture of several Gaussian clusters. Each mixture component is a Gaussian represented by the mean and the covariance matrix of the data. Once the model was generated, conditional probabilities were computed using p(xjX k ) = m k X j=1 p(xjj)P (j); 40 where X k is the datapoints for each class, m k is the number of components, P (j) is the prior probability that datum x was generated by component j, andp(xjj) is the mixture component density. The EM algorithm (Bishop, 2003) was then used to nd the maximum likelihood parameters of each class. We also investigated the K-nearest neighbor (kNN) classication method. kNN is a simple supervised learning algorithm where a new query is classied based on the majority class of its k nearest neighbors. A commonly used distance measure is the Euclidean distance, d(x;y) = v u u t n X i=1 (x i y i ) 2 : In our experiments, we utilized separate source les for training and test sets. We kept the 4-sec segments that were originated from the same source le separate from one another. Each source le for each environment was obtained at dierent locations. For instance, the Street with trac class contains four source les which were labeled as taken from various cities. We required that each environment contained at least four separate source recordings, and segments from the same source le were considered a set. We used three sets for training and one set for testing. Finally, we performed a 4-fold cross validation for the MP features and all commonly-used features individually for performance comparison. In this setup, none of the training and test items originated from the same source. Since the recordings were taken from a wide variety of locations, the ambient sound might have a very high variance. Results were averaged over 100 trials. These sound clips were of varying lengths (1-3 minutes long), and were later processed by dividing up into 4-second segments and downsampled to 22050 Hz sampling rate, mono- channel and 16 bits per sample. Each 4-sec segment makes up an instance for training/testing. Features were calculated from a rectangular window of 256 points (11.6 msec) with 50% overlap. 41 Figure 4.5: Overall recognition rate (GMM) comparing 14 classes using MFCC only, MP only, and MP+MFCC as features. (0% recognition for four classes using MFCC only: Casino, Nature-nighttime, Train passing, and Street with ambulance.) 4.6.2 Experimental Results We compare the overall recognition accuracy using MP, MFCC and their combina- tion for 14 classes of sounds in Fig. 4.5. As shown in this gure, MFCC features tend to operate on the extremes. They perform better than MP features in six of the examined classes while producing extremely poor results in the case of ve other classes; namely, a recognition rate of 0% for four classes, Casino, Nature-nighttime, Train passing, and Street with ambulance and less than 10% for Thundering. MP features perform better overall, with the exception of two classes (Restaurant and Thundering) having the lowest recognition rate at 35%. One illustrative example is the Nature-nighttime class, which contains many insect sounds of higher frequen- cies. Unlike MFCCs that recognized 0% of this category, MP features were able to yield a correct recognition rate of 100%. Some of these sounds are best character- ized by narrow spectral peaks, like chirps of insects. MFCC is unable to encode such narrow-band structure, but MP features are eective in doing so. By combin- ing MP and MFCC features, we were able to achieve an averaged accuracy rate of 42 Figure 4.6: Overall recognition accuracy using kNN with varying number of K Table 4.1: recognition accuracy using GMM with a varying number of mixtures, using MFCC and MP features. No. of mixtures 1 2 3 4 5 6 7 8 9 10 15 20 mixed Accuracy (%) 68.1 66.8 74.6 76.0 83.9 80.4 77.5 73.9 73.2 73.7 70.8 69.4 83.4 83.9% in discriminating fourteen classes. There are seven classes that have a clas- sication rate higher than 90%. We see that MFCC and MP features complement each other to give the best overall performance. For completeness, we compared the results from the two dierent classiers, namely GMM and kNN. We examine the results from varying the number of neigh- bors K and using the same K for each environment type. The overall recognition rate by varying K are given in Fig. 4.6. The highest recognition rate was obtained using K=32, with an accuracy of 77.3% We could observe the performance slowly attens out and further degrades as we increase the number of neighbors. By in- creasing K, we are in fact expanding the radius of its neighbors. Extending this space makes it more likely the classes would overlap. In general, the results from GMM outperforms those from using kNN. Therefore, we will concentrate on GMM for the rest of our experiments. Using GMM allows for better generalization. kNN would perform well if the data samples are very similar to each other. However, since we are using dierent sources for testing and training, they might be similar in their overall structure but not ner details. 43 Figure 4.7: Overall recognition accuracy comparing MP, MFCC, and other commonly-used features for 14 classes of sounds using kNN and GMM as class- ers. To determine the model order of GMM, we examine the results by varying the number of mixtures. Using the same settings as the rest of the experiments, we examined mixtures of 1-10, 15, and 20 and used the same number of mixtures for each environment type. The overall recognition rates are given in Table 4.1. We see that the classication performance peaks around 5 mixtures and the performance slowly degrades as the number of mixtures increases. The highest recognition rate for each class across the number of mixtures was obtained with 4-6 mixtures. They were equal to 4, 5, 5, 5, 5, 5, 6, 5, 4, 4, 5, 6, 5, 5 for the corresponding classes: Nature-daytime, Inside moving vehicles, Inside restaurants, Inside casinos, Nature- nighttime, Street with police car siren, Playground, Street with trac, Thundering, Train passing, Raining/shower, Running water/stream, Ocean waves, and Street with ambulance. We also experimented with this combination of mixtures numbers, and the results is given as mixed in Table 4.1. Since the latter requires tailoring to each class, we decided to just use 5 mixtures throughout all of our experiments to avoid making the classier too specialized to the data. 44 We performed an analysis of variance (ANOVA) on the classication results. Specically, we used the t-test, which is a special case of ANOVA for comparing two groups. The t-test was run on each of the 14 classes individually. The t-tests showed that the result of the two systems was signicant with p<0.001 for all 14 classes. An interesting benchmark is shown in Fig. 4.7, where we ran the same experi- ments using all features, including MP, MFCC, and other commonly-used features as stated in Sec. 4.6.1. The average recognition accuracy is approximately 55.2%, which is much worse than using combined MFCC and MP features. This conrms our discussion in Sec. 3.6; namely, adding more features may not be always helpful. 4.7 Confusion Matrix and Pairwise Classication Results presented in Sec. 4.6.2 are averaged values from all trials together. To further understand the classication performance, we show results in the form of a confusion matrix, which allows us to observe the degree of confusion among dierent classes. The confusion matrix given in Table 4.2 is built from a single arbitrary trial, constructed by applying the classier to the test set and displaying the number of correctly/incorrectly classied items. The rows of the matrix denote the environment classes we attempt to classify, and the columns depict classied Table 4.2: Confusion matrix for 14-class classication using MP features and MFCC with GMM 14 classes: 1) Nature-daytime, 2) Inside vehicle, 3) Restaurant, 4) Casino, 5) Nature-nighttime, 6) Street - police, 7) Playground, 8) Street - trac, 9) Thundering, 10), Train passing, 11) Rain / shower, 12) River / stream, 13) Waves, 14) Trac - ambulance. All values are in percentages. Blank cells equates to less than 1%. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 92.2 2 100 3 66.8 2.5 12.8 8.6 4 23.0 62.2 5 100 6 1.8 33.7 97.5 4.4 7 2.9 94.6 13.5 8 1.2 74.8 5.9 3.5 9 7.6 1.3 12.1 91.1 11.2 7.1 10 60.7 11 46.5 12 3.8 3.5 1.2 53.3 78.3 37.1 13 22.0 100.0 14 2.2 54.6 45 Table 4.3: Recognition accuracy for pairwise classication using GMM. All values are in percentages. Blank cells equates to greater than 90%. results. We see from Table 4.2 that Restaurant, Casino, Train, Rain and Street ambulance were more often misclassied than the rest. We could further point out that the misclassication overlaps between pairs, such as those of Inside restaurant and Inside casino and of Rain and Steam (Running River). Interestingly, there exists a one-sided confusion between Train and Waves, where samples of Train were misclassied as Waves, but not vice versa. Generating a confusion matrix provides a convenient way to understand the per- formance of features and classiers. However, since it is obtained from all classes, it is dicult to observe more subtle details. In many instances, we are interested in determining where misclassication actually occurs; namely, whether it is origi- nating from the classier or the ambiguity of extracted features. To address this, we use a pairwise classication method to observe the interaction between all pos- sible pairs of classes. Pairwise classication is a series of two-class problems in a one-against-one manner, instead of the one-against-all method used to construct the confusion matrix. By examining all exhaustive pairs of classes and nding the most dicult ones, we show the pairwise classication results in Table 4.3. For 46 Table 4.4: Comparison of recognition accuracy between MFCC, MP, and MFCC+MP features for pairwise classication of ve-class examples. For each pair of classes, the three recognition accuracy values correspond to: (left) MFCC, (middle) MP, (right) MFCC+MP features. All values are in percentages. most pairs of classes, we obtained a correct classication rate higher than 90%. Only cases with correct classication rates less than 90% are listed in Table 4.3. A simple 2-class classication result is around 58% in dierentiating classes between inside restaurant or casino, which is not much better than random guessing. We investigate more closely the eectiveness of MP features by presenting the pairwise classication results for ve classes of environmental sounds, with 20 data samples each. By examining a smaller problem, we could observe the subtle details of their classication performance. The ve classes are Playground, Nature-daytime, Nature-nighttime, Stream/river, and Raining. Table 4.4 shows ten pairwise classi- cation results between ve classes. For each pair of classes, recognition rates are given in three boxes. They correspond to the use of dierent features for classi- cation: MFCC features only (left), MP features only (middle) and joint MFCC and MP features (right). The use of joint MFCC and MP features tends to result in a higher accuracy rate. One impressive example is observed in discriminating Rain/shower and Nature-daytime, the use of MFCC and MP-features alone results in only an accuracy rate of 50%. However, the use of two types of features jointly leads to an accurate classication rate of 98.4%. 47 Table 4.5: Comparison of recognition accuracy between MFCC and MFCC with (concatenating) individual MP features for pairwise and overall classication of the ve-class examples using GMM, in percentage. MFCC MP-features mean-F std-F mean-, std-F mean-S std-S mean-, std-S 5-class 65.3 86.7 80 78.7 82.7 73.3 69.3 78.7 (Overall) Nature-daytime - 47.6 89.7 75.3 75.3 61.2 85.3 66.5 66.5 Nature-nighttime Nature-daytime - 50 99.5 66.5 75.3 75.3 50 89.7 66.5 Playground Nature-daytime - 50 98.4 61.2 77.3 61.2 75.3 75.3 85.3 Rain Nature-daytime - 50 100 100 100 100 85.3 85.3 85.3 Stream Nature-nighttime - 66.5 99.3 100 75.3 100 75.3 85.3 75.3 Playground Nature-nighttime - 50 100 100 85.3 100 89.7 85.3 85.3 Rain Nature-nighttime - 100 98.9 100 98.9 100 100 98.9 98.9 Stream Playground - 50 61.2 75.3 66.5 75.3 47.6 50 50 Rain Playground - 46.2 85.3 66.5 61.2 61.2 61.2 50 66.5 Stream Rain - 58.1 75.3 75.3 75.3 75.3 66.5 61.2 61.2 Stream Nature-nighttime - 0 100 100 0 100 0 0 0 Thundering 4.8 Comparison of Time-Domain Features Some environment sounds may include strong temporal domain signatures such as those from chirpings of insects and raining, which are noise-like with a broad at spectrum. These characteristics might be better captured with temporal type features. When compared with spectral features, there are fewer temporal-domain features used to characterize audio signals. Two commonly used temporal features are the short-time energy and the zero-crossing rate (Zhang & Kuo, 2001). In this work, we present new temporal features based on MP. In this subsection, we would like to compare these three features. Fig. 4.8 provides an example of the short-time energy function of signals from ve dierent classes. However, it may not provide an eective discriminant feature as illustrated in Fig. 4.9(a), where we show the energy range of twenty data samples for ve sound classes. We see from Fig. 4.9(a) that the energy range of Nature- nighttime resembles a at line. This is due to the high frequency in the chirping of insects, making it similar to a constant sound. The large variation within each type of sounds also makes it dicult to determine the eectiveness of each feature for 48 Figure 4.8: Sample of the short-time energy function from each of the example ve classes: a) Nature-nighttime, b)Nature-daytime, c) Playground, d) Raining, e) River / stream. each sound type. The zero-crossing rate can be useful to separating some classes such as Nature-nighttime and Raining from the rest of the classes as shown in Fig. 4.9(b). However, the other three types have very similar properties and, thus, they are more dicult to distinguish. MP features provide a more exible and eective way to extract temporal fea- tures of environmental sounds using time- and frequency-localized representation. For illustration, the mean distribution of three types of MP parameters are shown in Fig. 4.10. We see that these MP features form clearly separable clusters among themselves. For example, the Nature-nighttime class makes a cluster in the higher frequency and smaller scales due to the fact that insects have high-pitched repeat- ing chirps. In contrast, running streams of water produce a lower frequency sound, and they are mapped to the lower frequency and higher scale region in the gure. Using similar experiment settings as in Sec. 4.6.1, we perform classication on these ve classes using GMM. We obtained results of 75.3%, 84.0%, and 89.7% for MFCC only, MP-features only, and the combined MFCC+MP-features, respec- tively. Similar to previous ndings, including MP-features with MFCCs in the feature vector increases classication performance than using MFCCs alone. To 49 (a) (b) Figure 4.9: Temporal features: a) the energy range and b) the zero-crossing rate. (Figures a and b share the same legend.) achieve a better understanding of how combining MFCCs with individual MP- feature descriptors helps with classication, we can observe the results in rst row of the Table 4.5, where we perform the classication using the input feature vector as a combination, or more specically concatenation, of MFCCs with one (or two) of the descriptors at a time. We use mean-F and std-F to denote the mean and standard deviation for the frequency indices and likewise, mean-S and std-S for the scale indices. Table 4.5 shows how the descriptors contributes to the overall classi- cation. We further observe how each descriptor aects certain classes by repeating the experiment with pairwise classications as listed in Table 4.5. We see that the eect of each descriptor is dierent for each pair of environments. To further examine the eects, we plotted the values to each of the descriptor in Fig. 4.11. The mean-S can be viewed as an indication of the overall amplitude of the signal. It depends on the loudness of the signal or how far away the microphone is from the sound source. The std-S descriptor provides us with a way to disclose the variability of the energy in the time-frequency plane. The values for static type of noises, such as those of constant raining, are higher than diverse noises. Another interesting observation is that out of the four descriptors, std-S was the 50 Figure 4.10: MP features (i.e., the mean value of the corresponding paramters) in feature space. only one that separates out much of the Nature-daytime class from the others, which was was the most dicult to do with the other descriptors. The mean-F might be similar to that of the centroid as it represents where the energy on the frequency axis. Although, the mean-F only describes the frequency, but it still proved to be useful when combined with MFCC. One of the reason is that MFCCs model the human auditory system and do poorly when modeling non-speech type noise. Mean-F furnishes us with a description of the basic frequency without being modeled based on any auditory system. Std-F expresses the frequency range. If the sound frequency is narrow, std-F is low, i.e. running stream. An interesting example is for the class, between Nature-Nighttime and Thundering, where using MFCCs alone yields 0%. However, we can see in Table 4.5 that adding the mean-F to the feature vector helps signicantly. In this case, mean-S was less important in discriminating between Nature-Nighttime and Thundering, which also indicates that it is not relying on the amplitude of the signal. We can see that although dierent descriptors might be better for certain pair of classes, it would be dicult, and too specic, to selectively choose them. But from Table 4.5, we can conclude 51 Figure 4.11: Individual MP feature descriptor values: mean-F (top left), std-F (top right), mean-S (bottom left), std-S (bottom right). that using all the frequency and scale descriptors provides us with extra information for discriminating between dicult classes. 4.9 Listening Tests 4.9.1 Test Setup and Procedure A listening test was conducted to study human recognition capability of these envi- ronmental sounds. Our motivation was to nd another human-centric performance benchmark for our automatic recognition system. Our test consisted of 140 au- dio clips from 14 categories, with 10 clips from each of the classes described in Sec. 4.6.1. Audio clips were randomly picked from the test and training sets, and the duration varied between 2, 4, and 6 seconds. A total of 18 subjects participated in the test. They were volunteers and had no prior experience in analyzing envi- ronmental sounds. Participants consisted of both male and female subjects with 52 their ages between 24-40. About half of the subjects were from academia while the rest were from non-related elds. Four of the subjects were involved in speech and audio research. Each subject was asked to complete 140 classication tasks (the number of audio clips) in the course of this experiment. In each task, subjects were asked to evaluate the sound clip presented to them by assigning a label of one of 15 choices, which includes the 14 possible scenes and the others category. In addition to class labeling, we also obtained the condence level for each of the tasks. The condence levels were between 1 and 5, with 5 being the most condent. The order in which sound clips were presented was randomized to minimize any bias. The test was set up so that the rst 14 clips were samples of each of the classes and was not included in calculating the nal results. They were used to introduce subjects to the variety of sounds to be examined and to accustom them to dierent categories. The user interface was a webpage accessible via a browser with internet connec- tion. Users were asked to use headphones so as to reduce the amount of possible background noise. The test was performed without any time limit, and users were able to break and return at any time. For each task, the results are expressed as an entry consisting of 4 data items: 1) the original environment type, 2) the audio clip duration, 3) user labeled environment type, and 4) user condence level. 4.9.2 Test Results The results from the listening test are shown in Fig. 4.12. The overall recogni- tion rate was 82.3%, and the recognition accuracy for each individual environment ranged from 50% to 100%. The three best recognized scenes were Nature-daytime (98%), Playground (95%), and Thundering (95%). On the other hand, the four most dicult scenes were Ocean waves (65%), Inside Casino (70%), Inside mov- ing vehicles (73%), and Street with trac (74%). The listening test showed that humans are able to recognize everyday auditory scenes in 82% of the cases. The 53 Figure 4.12: Recognition accuracy of 14 classes from the listening test. Table 4.6: Recognition performance from the listening test. 14 classes: 1) Nature-daytime, 2) Inside vehicle, 3) Restaurant, 4) Casino, 5) Nature-nighttime, 6) Street - police, 7) Playground, 8) Street - trac, 9) Thundering, 10), Train passing, 11) Rain / shower, 12) River / stream, 13) Waves, 14) Trac - ambulance. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Others 1 98.5 1.5 2 73.1 10.0 1.5 3.1 12.3 3 86.9 3.0 4.8 5.4 4 14.8 70.0 3.1 11.5 5 6.2 88.5 1.5 4.0 6 6.2 76.2 17.9 7 95.0 3.1 2.0 8 15.6 73.8 5.6 3.8 9 94.6 2.3 3.1 10 8.5 80.8 3.1 4.8 11 82.3 15.6 2.3 12 16.9 83.9 13 13.1 2.3 4.6 2.3 65.4 2.3 14 16.4 83.1 All values are in percentages. Blank cells equates to less than 1%. confusions were mainly between scenes that had similar types of prominent sound events. We can also examine the performance of each sound class as an eect of the duration in Fig. 4.12. The overall average recognition rates were 77%, 82%, and 85% for an audio clip duration of 2, 4 and 6 seconds, respectively. There is a larger dierence in the rates between 2 and 4 seconds, but less between 4 and 6 seconds. A longer duration permits the listener more opportunities to pick up prominent sounds within each clip. However, the duration eect becomes less important as it passes a certain threshold. 54 One of the main reasons for misclassication was due to misleading sound events. For example, the scene Street with trac was recorded with dierent types of traf- c, which was frequently recognized as Inside moving vehicles, and vice versa. The recordings from Inside moving vehicles consist of dierent vehicles passing, which included a variety of vehicles like passenger sedans, trucks, and buses. Another reason for misclassication arises from the similarity between two dierent sounds and the inability of human ears to separate them. For example Ocean waves ac- tually sounds very similar to that of Train passing. Another problem comes from subjects' unfamiliarity of a particular scene. For example, some users reported that they have never set foot inside a casino. Thus, the sound event Inside casino was mislabeled by them as Inside restaurant due to the crowd type of the ambient sound. The confusion matrix for the test is given in Table 4.6. The rows of the matrix are the presented environmental scenes while the columns describe the subject responses. All values are given in percentages. Confusion between scenes was most noticeably high between Street with police car and Streets with ambulance, between Raining and Running water, and between Street with trac and Inside moving vehicles. The highest o-diagonal value occurs when Streets with police car is recognized as Street with ambulance. Confusion between sirens from police cars and ambulance was not due to the actual discrimination between the two sound classes but rather some people were semantically confused between the two sirens. In other words, the discrimination between the two classes requires background knowledge of subjects. Many users reported afterwards that they were second guessing the type of emer- gency vehicles that sirens were originating from. Confusion also occured between scenes that are lled with crowded people, such as Inside restaurant and Inside casino. 55 Besides recognition accuracy, we are also interested in the relationship between the user condence level and the audio clip duration. The results are shown in Fig. 4.13. If we compare Fig. 4.12 and Fig. 4.13, a lower condence translates to a lower recognition rate, and vice versa. The condence of listeners increases as we extend from 2 to 4 seconds, but there is only a slight increase from 4 to 6 seconds. The average condence for each class, out of a possible 5, is 3.7, 4.2, and 4.4 for 2 seconds, 4 seconds, and 6 seconds, respectively. The lowest scores with the largest discrepancy between 2 and 4 seconds comes from the pair of Waves and Street with trac. In general, a higher condence is displayed with audio clips that are longer than 2 seconds. The listening test shows that human listeners were able to correctly recognize 82% of ambient environment sounds for a duration of 4 seconds. Under the con- dition of 4 second clips, our automatic recognition system achieved a rate of 83%, which demonstrates that our recognition system has comparable performance to that of human listeners. The results of our listening test and those in (Eronen et al., 2006) are dissimilar. As indicated in the studies in (Eronen et al., 2006), their results were higher for humans than that obtained from the computer system. Whereas in our case, the results were fairly similar between human and computer recognition. One possible reason for the dierences is that their experimental setup was dierent than the one presented here, most notably in the length of the data presented to the subjects. The data presented to the users in our setup are the same segments as used in our automatic classication system, which was 4 seconds long, while the samples in Eronen's experiments were 30 seconds to 1 minute long. Given that humans may have prior knowledge to dierent situations that can be advantageously used in classication, allowing them a much longer time to listen to the audio sample increases the likelihood that they would nd some audio cue within each segment as to the environmental context in question. 56 Figure 4.13: User condence in the listening test. 4.10 Conclusion This chapter reports a novel feature extraction method that utilizes matching pur- suit (MP) to select a small set of time-frequency features, which is exible, intu- itive and physically interpretable. MP features can classify sounds where the pure frequency-domain features fail and can be advantageous combining with them to improve the overall performance. Extensive experiments were conducted to demon- strate the advantages of MP features as well as joint MFCC and MP features in environmental sound classication. The experimental results show promising per- formance in classifying fourteen dierent audio environments, and shows compara- ble performance to human classication results on a similar task. Our work provides competitive performance for multi-audio category environment recognition using a comprehensive feature processing approach. 57 Chapter 5 A Semi-Supervised Learning Approach to Online Audio Background Detection 5.1 Introduction The background in an ambient auditory scene can be considered as something recurring, and noise-like, which is made up of various sound sources, but changing over time, i.e., trac and passers-by on a street. In contrast, the foreground can be viewed as something unanticipated or as a deviation from the background model, i.e., passing ambulance with siren. The problem arises when identifying foreground existence in the presence of background noise, given the background also changes with a varying rate, depending on dierent environments. If we create xed models with too much prior knowledge, these models could be too specic and might not do well with new sounds. On the other hand, models that do not consider prior knowledge, such as unsupervised techniques, typically use simple methods of thresholding (H arm a et al., 2005). Then, we would be led to the problem of threshold determination. This would be impractical in unstructured environments since there are no clear boundaries between dierent types of sounds. Systems employing learning techniques, such as in (Ellis, 2001), build models explicitly for specic audio events, making it in exible to new events. The state-of- the-art approaches in background modeling (Moncrie et al., 2007) do not make any 58 assumptions about the prior knowledge of the location and operate with ephemeral memory of the data. The method proposed in (Moncrie et al., 2007) models the persistency of features by dening the background as changing slowly over time and assuming foreground events to be short and terse, e.g., breaking glass. The problem arises when the foreground is also gradual and longer lasting, e.g., plane passing overhead. In this case, it would adapt the foreground sound as background, since there is no knowledge of the background or foreground. It would be dicult to verify whether the background model is indeed correct or models some persistent foreground sound as well. In this work, we consider modeling and detecting background and foreground sounds by incorporating explicit knowledge of data into the process. We propose to include audio prediction models as a procedure to learn the background and foreground sounds. Our framework is comprised of two modules, each addressing a separate issue. First, we use a semi-supervised method to train classiers to learn models for the foreground and background of an environment. Then, we use the learned models as a way to bootstrap the overall system. A separate model is constructed to detect the changes in the background. It is then integrated together with audio prediction models to decide on the nal background/foreground (BG/FG) determination. 5.2 Semi-Supervised Learning with Audio We begin our study by building prediction models to classify the enivironment into foreground and background. To obtain classiers with high generalization ability, a large amount of training samples are typically required. However, labeled samples are fairly expensive to obtain while unlabeled natural recordings consisting of envi- ronmental sounds are easy to come by. Thus, we investigate ways to automatically label them using self-training (Nigam et al., 2000) to increase the training example 59 size. In self-training, a classier for each class is trained on a labeled data set. Then, it labels these unlabeled examples automatically. The newly labeled data are added to the original labeled training set, and the classier is rened with the augmented data set. After the above steps, we train a multivariate probability density model to rec- ognize the background and a separate one for the foreground, where the expectation maximization (EM) approach is used to estimate the parameters of the model. We use the augmented EM approach (EM-) in (Nigam et al., 2000), where parame- ters are estimated using both labeled and unlabeled samples. The contribution of unlabeled samples are weighted by a factor 0 1, which provides a way to reduce the in uence of the use of a large amount of unlabeled data (as compared to the labeled data). It also makes the algorithm less sensitive to the newly labeled data. With the standard EM, we maximize the M-step using: l c (jX) =log(P ()) + X x i X log K X i=1 j P (X i j j ); (5.1) where there are N training data, X, with class labels y i K. When unlabeled data x i X u are incorporated into the labeled data x i X l , the new training set becomes X =X l [X u . Then, to maximize the M-step in EM-, we have l c (jX) =log(P ()) + X x i X l log K X i=1 j P (X i j j ) + X x i X u log K X i=1 j P (X i j j ) ! ; (5.2) which results in the following parameter estimation: P (Xjk j ;) = 1 + P jXj i=1 P (y i =k j jx i ) P jXj i=1 (i)P (y i =k j jd i ) (5.3) with the prior as 60 Table 5.1: Classication results using self-training. (in %) Data set EM EM- Coee room 88.7 92.5 Courtyard 76.1 81.0 Subway platform 73.7 86.5 P (k j j) = 1 + P jXj i=1 (i)P (y i =k j jx i ) jKj +jX u j +jX u j ; (5.4) and a weighting factor (i), dened as (i) = 8 > < > : ; if x i D u ; 1; if x i D l : (5.5) If none of the classiers found the unlabeled sample to be probable (e.g., probabil- ities are low, say, less than 15%), we assign the unlabeled data to the foreground classier since it is more likely that the unseen data sample is part of the foreground model. To demonstrate the eectiveness of this semi-supervised training approach for our dataset, we compare results between the usual EM approach and the EM- approach. The experimental setup is described in Sec. 5.4. After using the self- training method to label a large amount of unlabeled audio data, we re-train models P fg and P bg for the background determination process. Experimental results are summarized in Table 5.1. 5.3 Online Adaptive Background Detection Once prediction models, P fg and P bg , are learned for foreground (FG) and back- ground (BG) classication, we utilize these models in the online adaptive back- ground detection process. The initial background modeling work was done for video (Stauer & Grimson, 1999), which uses the mixture of Gaussians for each pixel. Instead of modeling the pixel process, Moncrie et al. (Moncrie et al., 61 2007) propose to model the audio feature vector as Gaussians mixture densities. Our adaptation is based on the latter. The resultant algorithm is summarized below. The history of feature vector x t can be viewed asfx 1 ;x 2 ;:::;x t g, each x t is modeled by a mixture of K Gaussian distributions. The probability of observing current x t is given by P online (x t ) = K X i=1 i;t P (x t j i;t ): That is,x t is represented by the components of the mixture model. Since x t varies over time, the mixture models have to be re-trained at every time t to maximize the likelihood of X. Instead, we use an online K-means approximation algorithm. Everyx t is checked against the existingK Gaussian distributions to determine if a match is found. TheK th component is viewed as a match ifx t is within 2.5 standard deviations from the mean of a distribution, as done in (Moncrie et al., 2007; ?). If none of the distributions qualify, the least probable distribution is replaced by the current observation x t as the mean value with an initial high variance and a low prior. The parameters are adjusted with the prior weights of each component as k;t = (1 ! M k;t )! k;t1 + ! (M k;t ); (5.6) whereM k;t is 1 for matched models and 0 for mismatched ones and ! is the learning rate, which determines the rate of adaptation of the background model. After the approximation, priors are re-normalized (summing to 1) and used to decrease the weight of the models that are not matched. The parameters for unmatched models remain the same, while matched models update their parameters with new observation x t as k;t = (1) t1 +X t i;j t = (1) i;j t1 +(X i t X j t ); where 62 = g e 1 2d (Xt t1 ) T P 1 t1 (Xt t1 ) is the second learning rate that is dependent on the data and g determines the update rate of model parameters. Using this method to update does not destroy the existing models with new incoming data, but remains in the overall system (while having their prior weights decrease). The model with the lowest becomes the the least probable model, which will then be replaced by the new observation. From here on, we deviate from that of (Moncrie et al., 2007) by including both P fg and P bg into the system. For updating P fg and P bg , we continue to use Eqs. (5.2)-(5.4), but in a sliding windows approach. As the system runs online, if we permit an increasing amount of unlabeled data to be included in the model, it would allow errors from the self-training process to propagate. To regulate possible concept drifts (as trends and patterns tend to change over time), we use a simple sliding window method, where only unlabeled data within the window of size m is utilized in the self-training process. Any unlabeled data outside the window is not used. Furthermore, to avoid re-training models at every second, we perform the self-training process at every m 4 interval. This means that we remove the oldest m 4 of the data from the window to include newly arrived unlabeled data. For example if m = 120 samples, then we perform self-training at every 30 samples interval. In addition, we attempt to maintainP fg to re ect the current concept by placing more emphasis on the current data sample and using a new fg forP fg dened as fg (i) = 8 > < > : mzm m ; if x i D u ; ; if x i D l ; (5.7) where z i =fz 1 ;z 2 ;:::;z m g, and where z 1 and z m are the beginning and the end of window m, respectively. For BG/FG classication, we rank their distributions by prior weights online and bg , fg (from P bg , P fg , respectively, depending on P (kjx t ;)). Since we use separate models for the foreground and background, we normalize their priors to 1. 63 For classication, we ordered the Gaussians by their values of online + s , where s = bg if P bg ( t ) P fg ( t ) and 0 otherwise. The distributions chosen as the foreground model are obtained via FG = 2 4 k highest X k=1 online k + s k 3 5 T; where k highest is the highest rank model and k = 1 the lowest ranked. T is the tolerance threshold for background classication. A lower T will result in more distributions being classied as background. Models not chosen as foreground are consider background models. We use a heuristic to perform the nal classication. We utilize a queue to keep track of results at each timet from either a background or foreground classication of P bgjfg (x t ;x t1 ;:::). If there is no change between classications of previous dis- tribution P t1 online and the current one, P t online , we append the result P bgjfg (x t ) into the queue and make the classication at t by taking a majority vote of the result queue at t;t 1;:::;tq, where q is the queue size. Using a queue to remember the results allows for some misclassication in the prediction process. When there is change betweenP t1 online andP t online , we examineP bgjfg (x t ) andP bgjfg (x t1 ). If they are consistent, we also take a majority vote. Otherwise, we take on the classica- tionP bgjfg (x t ) att. Whenever there is a change in classication, we clear the queue of results from t 1;:::;tq. 5.4 Data and Experiments To demonstrate the eectiveness of the proposed online background modeling al- gorithm, the following three environment sounds are used in the test: 64 1. Coee room: Background includes footsteps, shuing of things, people com- ing in and out. Foreground includes coee grinding and brewing, printing sound (since a printer is located in the coee room), etc.; 2. Courtyard: Background includes water fountain, distant talking from passers- by and trac from nearby streets. Foreground includes plane passing over- head, cellphone ringing, loud talking, footsteps, etc.; 3. Subway station platform: Background includes passers-by noise and talk- ing, trains in the distant. Foreground includes train arrival/departure, trains breaking, announcements, etc. They are made up of ambient noise of a particular environment, composed of many sound events. We do not consider each constituent sound event individually, but as many properties of each environment. We use continuous, unedited audio streams as the training and testing data. The rst two data sets were collected and recorded in mono-channel, 16 bits per sample with a sampling rate of 44.1 KHz and of vary- ing lengths in the Electrical Engineering building of the University of Southern California. They were taken at various times over a period of two weeks, with a duration averaging around 15 minutes each. The subway station set is also record- ings of natural (unsynthesized) sound clips, which were obtained from (Fre, ) and down-sampled to a 22.050 KHz sampling rate. The incoming audio signal was segmented into xed duration 1-second clips. Every second of the data sets was manually labeled. Features were computed for every 1-second window by averaging those from a 30-msec rectangular sampling window with 5 msec overlap. They were calculated for each clip and combined to form those of current clip x t . We use two types of features: 1) Mel-frequency cep- strum coecient analysis (MFCC), widely popular in speech and audio processing, and 2) MP-features (Chapter 4). MP-features utilize the matching pursuit (MP) algorithm and a dictionary of Gabor atoms to learn the inherent structure of each 65 Figure 5.1: Decomposition of signals from 6 environments, where the top-most signal is the original, followed by the rst ve basis vectors, demonstrating dierent underlying structures for various environments, where an MP-based algorithm picks up these top basis vectors and represents them uniquely. type of sounds and select a small set of joint time frequency features. This method has shown to be robust for classifying sounds where the pure frequency-domain features fail and can be advantageous in combining with MFCC to improve the overall performance. Examples of MP-features are given in Fig. 5.1. The feature vector used in this work contains a combination of MFCC and MP-features. For details of feature extraction and extensive experiments, we refer to Section 4.2. For evaluation, the BG/FG classication was compared with labeled testing sets. We had 40 minutes of data for Coee room and Courtyard. We were only able to obtain 15 minutes for the Subway station. The data were divided into 4 sets for each class. We used 2 sets as unlabeled data and 1 set as labeled in the self- training process. The last subset was used for testing in the nal online BG/FG determination. Results were taken from the average of six trials (from dierent permutations of the 4 sets). The data were segmented into 1-second segments, but analyzed in sequence, where each segment was considered a sample. The accuracy of the detection is calculated by 66 BG accuracy = N y=bg N total N fg ; where N y=bg is the number of samples classied as BG, N fg is the number of FG samples that are correctly classied, and N is the total number of samples. We calibrated the parameter values for each dataset to produce better overall results. The weighting factor, = 0:5, was set to reduce the sensitivity to unlabeled data. The threshold, T = 0:5, was the tolerance for determining distributions that were considered as BG. g and ! were set to 0.01 in the experiments. The sliding window sizem was set to 120 samples. Based on the observation from (Chu et al., 2009a), the setting for the MP-features was chosen for the Gabor function with the following parameters in this work: s = 2 p (1 p 8), u =f0; 64; 128; 192g, ! =Ki 2:6 (with 1i 35, K = 0:5 35 2:6 so that the range of ! is normalized between 0 and 0.5), = 0 and the atom length is truncated toN = 256. Thus, the dictionary consists of 1120 = 8 35 4 Gabor atoms that were generated using scales of 2 p and translated by quarters of atom length N. 5.5 Results and Discussion We compared the performance accuracy between the proposed method and the one from (Moncrie et al., 2007) as a baseline. We refer to our approach as combination models (CM) and (Moncrie et al., 2007) as persistency only models (PSM). The ex- perimental results are summarized in Table 5.2. We see that both methods produce better accuracy for the background than foreground since the background is more constant than the foreground, and therefore is easier to learn. The PSM method performs poorly on Coee room data since it cannot classify the long-persistent sound of a printer as the foreground. At one time, the printing sound continuously ran for 49 seconds and these segments were misclassied as background. We examine a small segment of data (as shown in Fig. 5.2) in more detail. In this example, the delay from PSM was about 7 seconds, while CM results in a 2-3 67 Table 5.2: Background detection accuracy (in %) CM PSM Data FG BG FG BG Coee room 75.9 82.5 27.4 56.8 Courtyard 63.5 92.1 36.7 89.9 Subway platform 74.8 79.2 46.5 58.9 second delay. We also note that, after about 10 seconds of considering the current sound clip as foreground, the foreground distributions were soon considered to part of the background process. With a quick change in the BG/FG events, PSM takes about 10-15 seconds to stabilize depending on the update rates. Figure 5.2: Comparison of classication results obtained by online background modeling with combination models (CM) and persistency only models (PSM). We observe that it is more dicult to detect the foreground segments in the Courtyard class. When a plane passed over for 16 seconds, PSM only detected 4 seconds of it while CM detected about 10 seconds. The Subway set provides an example comprised of many short events. There were very few moments when there is a constant background. In this case, we observe that it was dicult for both systems to achieve high performances. However, CM still outperforms PSM. For the CM method, class determination is based on the combined eort of both online models and prediction models, making it less sensitive to changes in parameters. The PSM method is more sensitive to parameter changes since its classication only depends on one model. 68 5.6 Conclusion In this work, we proposed a framework for audio background modeling, which in- cludes prediction, data knowledge and persistent characteristics of the environment, leading to a more robust audio background detection algorithm. This framework has the ability to model the background and detect foreground events as well as the ability to verify whether the predicted background is indeed the background or a foreground event that protracts for a longer period of time. Experimental results demonstrated promising performance in improving the state-of-the-art in background modeling of audio environments. We also investigated the use of a semi-supervised learning technique to exploit unlabeled audio data. It is encour- aging that we could utilize more unlabeled data to improve generalization as they are usually cheap to acquire but expensive to label. And more than often, we are forced to work with relatively small amount of labeled data due to this limitation. 69 Chapter 6 Environmental Sound Recognition with Composite Deep Belief Network 6.1 Introduction In this chapter, we investigate the use of a richer generative model based method for classication and attempt to discover high-level representations for dierent acoustic environments in a data-driven fashion. Specically in this work, we con- sider Deep Belief Network (DBN) to model environmental audio and investigate its applicability with noisy auditory data for robustness and generalization. We also propose a framework for composite DBNs as a way to represent various levels of representations. Experimental results on real data sets demonstrate its eectiveness over traditional methods. Environmental sound contains large variances even within a single environment. It is constantly changing, but these changes and events are inconsistent. Despite these dierences, humans can distinguish and contextualize them. For example, humans could easily dierentiate between sounds originating from outside on the streets, inside a restaurant, coee shop, or train station, etc. We could mostly agree that it is relatively easy for humans to identify something as a restaurant environ- ment even when presented with audio recordings of varying restaurant locations, settings or events happening. This implies that there are commonalities between 70 dierent locations of the same situation or scene. The goal of this work is to uncover these commonalities despite the noise and variance that comes with environmental sounds so that we are able to characterize and represent specic environments in a tangible manner. Traditional methods, like Gaussian mixture models (GMM) (Bishop, 2003) and Hidden Markov Models (HMM) (Rabiner & Juang, 1993) are commonly used clas- siers for audio classication. HMMs have been extensively used in speech. It also works well with sounds that change in duration because the durational change can occur at a single state or across many states. Since environmental sounds or general ambient sounds lack such temporal structure or phonetic structure that speech, there is no set alphabet that allows for slices of non-speech sound to be di- vided into, making HMM-based methods dicult to implement. For explicit sound events, such as gunshot and alarms, this technique could work well because each sound events contain similar temporal structure and each state can be use to model the dierent stationary parts of the signal. Even if we could build models for the many dierent possible events that might occur in an environment, this requires the knowledge of specic events for each environment and being able to handle the problem when many dierent events are occurring simultaneously. This would be challenging for specic sound events model to recognize. Therefore, GMMs are typically used for modeling the entire scene or environment without requiring the knowledge of specic sound events that might exist. They are frequently used to train the models of the characterized clusters (e.g., speech (Hinton, 1995), music (Tzanetakis & Cook, 2002), and audio background (Chu et al., 2009b)). Overall, GMM generalizes better but loses the temporal relationship that one nds with HMMs. A way to ameliorate this problem is by incorporating the temporal struc- ture into the features as in (Chu et al., 2009a). 71 Even with specialized features, such as those described in (Chu et al., 2009a), the performance of traditional audio classiers, like GMM, still deteriorate dra- matically when using realistic environmental sound that might be noisy data or have overlapping classes. To illustrate this point, we demonstrate a trivial exam- ple of using GMM classier on three types of environmental sound, Playground, Nature-daytime, and Street with trac. We used two sets of data: A) types of the sound were enforced to be distinctively dierent from one another, minimizing overlaps and B) a combination of the data used in set A along with unrestricted data from the same environment types. All three of these classes associated with being outdoors, with Playground and Nature-daytime occurring during the daytime and Street with trac, which could be any time of the day. The experimental setup is described in Sec. 6.3.1. We obtained a classication rate of 99.2% on Set A and 74.2% on Set B. To illustrate, in Fig. 6.1 we plotted a 2-dimensional projection of the representation of the components (means and covariance matrices) of the three set of GMMs. The projection was obtained by using PCA. GMM works well for distinctively separated and homogeneous data. However, as we included additional less restricted data, the GMMs for these classes became more overlapped and . Using GMM classiers poses a big problem if we were to use this in a realistic situation. To be useful in realistic situations, the acoustic environment classication must be robust enough to allow for changes and noise in the environments and to the in- troduction of new phenomena within that environment. Theoretically, it is possible to train a classier in a supervised manner with as much data as possible, covering all the variation of an acoustic environment. It requires an enormous amount of training data to encompass the variability of the characteristics for all possible en- vironments. As we observe above, unless we only include restricted data, more data will not be helpful. This is especially impractical if we want to use in real-world context-aware applications. In a realistic environment, there will always be unseen 72 (a) (b) Figure 6.1: 2-dimensional plot of the GMMs for three classes (a) using restricted / less noisy training data (left), and (b) using training data from (a) and unrestricted / noisy data (right) data and unforeseen changes. A more robust system is necessary to allow the model to adapt to random changes. The nature of environmental sound is noisy and contain relatively large amount of overlaps between each environment classes. There is no divisible or clear structure between these environment types. This audio requires representation by complex models, but typically involves many environments. Traditional method is not ro- bust enough to handle classes with overlaps. Non-linear classiers like SVM and traditional neural network have all been used for audio classication. SVM has shown to demonstrate better discrimination than other classiers in studies since they typically perform well on non-linear separable classes (Chu et al., 2006). How- ever, they do not scale well to long feature vectors as input and larger number of classes (e.g. over 10 classes). Since they are not as ecient as GMMs or HMMs, these non-linear classiers have been utilized to a lesser extent . To obtain classiers with high generalization ability, a large amount of training samples are typically required that contains large variation. However, the limited representational capacity of GMM prevents us from modeling data that originate from many sources or have overlapping boundaries. There has been little research in trying to build generative models and algorithm for this purpose. Therefore, 73 this leads us to investigate on ways to build a more robust recognition system using deep architectures with multiple layers. In recent years, there has been a large interest in deep learning and using neural network with recent introduction of a fast greedy layer-wise unsupervised learning algorithm by Hinton et al (Hinton et al., 2006). Approaches have also been proposed to learn simple features in the lower levels and more complex features in the higher levels (Hinton et al., 2006; Ranzato et al., 2006; Bengio et al., 2007). The idea is to learn some abstract representations of the input data in a layer-wise fashion using unsupervised learning, which then can be used as input for supervised learning in tasks such as classication and regression. Recently deep belief networks (DBNs) have been applied to music audio (Hamel et al., 2009) and to learn features for speech recognition (Lee et al., 2009). Ballan et al. (Ballan et al., 2009) applied DBN for audio concept recognition and compares with SVM classiers and shows the results for DBN to be comparable, but have not been tested with environmental sound. In this work, we investigate in using a more generative model method and to make use of noisy data to improve generalization. The goal is learn a more complex, higher-level representation of the environment types and make use of unlabeled or semi-related data to improve generalization in the context of DBN, as a way to learn the We want to determine if DBN is suitable for environmental sounds and improve classication over traditional audio classiers. In this paper, we apply DBN to environmental audio and empirically evaluate them on classication. We will use DBN for unsupervised learning of features, as a way to discover the commonality for each type of sounds. We also investigate on using a composite of DBNs as a way to represent various levels of representations. 74 6.2 Deep Belief Networks (DBNs) The Deep Belief Network (DBN) is a neural network constructed from many layers of Restricted Boltzmann Machines (RBMs) (Hinton et al., 2006). Previously, tra- ditional neural network were trained using gradient descent (Bengio et al., 2007). Using gradient descent, however, makes neural networks dicult or impossible to train. Hinton proposes a greedy layer-wise unsupervised pre-training phase, which in (Bengio et al., 2007; Hinton et al., 2006) shows that this unsupervised pre- training builds a representation from which made it possible to perform supervised learning by ne-tuning the resulting weights using gradient descent learning (tra- ditional neural network learning) . In other words, the unsupervised stage sets the weights of the network to be closer to a good solution than random initial- ization, thus avoiding local minima that made occur when using supervised gradi- ent descent. This training strategy has been subsequently analyzed by Bengio et al.(Bengio et al., 2007) who concluded that it is an important ingredient in eective optimization and training of deep networks. A schematic representation is shown in Fig. 6.2. 6.2.1 Restricted Boltzmann Machines A RBM is a bipartite graph composed of a layer of stochastic visible units v and a layer of stochastic hidden unit h. Many RBMs can be stacked on top of each other by linking the hidden layer of one RBM to the visible layer of the next RBM, forming a multilayer neural network. Details are given in (Hinton et al., 2006). We will brie y provide a review and formulation here. The role of an individual RBM is to model the distribution of its input. The visible units of one layer are connected to all the hidden units of the other layer, but 75 there are no connections between units of the same layer. The joint conguration (v; h) of visible and hidden units has an energy given by: E v;h = X i;j v i h j w ij X i2visible b i v i X j2hidden a j h j where v i and h j are the binary states of visible and hidden units i and j, w ij are the weights, b i are hidden unit biases and a j are visible unit bias. From the energy function, we can see that the hidden units are conditionally independent of one another given the visible layer. Using this energy function, the probability that the model assigns to a visible vector v is: p(v) = P h e E(v;h) P u;h e E(u;h) Since there are no within layer connections, the conditional distributions p(vjh) and p(hjv) are factorial and are given as: p(h j = 1jv) =(b j + X i w ij v i ); for the binary states h of the hidden units, where(x) = (1 +e x ) 1 is the logistic function. Once binary states of the hidden units have been chosen, a reconstruction is produced by setting each v i to 1 by following the conditional distribution: p(v i = 1jh) =(a i + X j w ij h j ) The states of the hidden units are then updated once more, so that they represent features of the reconstruction. The change in the weight between visible uniti and hidden unit j is: w ij =(<v i h j > data <v i h j > reconstruction ); 76 where is the learning rate. < v i h j > data is the expectation that the visible unit i and the hidden units j are on together in the training set. < v i h j > reconstruction represent the expectation after at least one iteration of Gibbs sampling. Similar learning rule is applied to update biasesa i andb j . In theory, these parameters can be optimized by performing stochastic gradient ascent on the log-likelihood of the training data. However, computing the exact gradient of the log-likelihood is in- tractable. Therefore the parameters can be approximated eciently by contrastive divergence using Gibbs sampling, which has shown to work well in practice(Hinton, 2002). 6.2.2 Deep Network Training The DBN is trained in two phases. The pre-training phase considers each layer (an RBM) separately and trains layers closest to the input layer rst. It takes the output of the rst layer and uses it as input to the next layer, and so forth. It uses the greedy layer-wise Contrastive Divergence (CD) pre-training for initializing weights. The overall pre-training process is repeated several times, layer by layer, obtaining a hierarchical model in which each layer captures strong high-order correlations between its input units. This phase allows the DBN to make use of unlabeled data in an unsupervised manner. The second phase of training is a supervised, global ne-tuning phase that is similar to training traditional neural network training. Gradient descent is used to obtain a ne-tuning of the parameters for optimal reconstruction of the input data. 6.3 DBN for Environmental Sound We begin our investigation by determining the appropriateness of DBN as classi- cation method for environmental sounds over traditionally used classiers. We will provide an empirical evaluation on twelve dierent types of environmental sounds 77 commonly encountered. For our rst experiment, we investigated on the perfor- mance of DBN and GMM for audio classication. In the second task, we explore the use of noisy and unlabeled data to improve generalization. 6.3.1 Experimental Setup Environment types were chosen so that they are made up of ambient sounds of a particular environment, composed of many sound events. We do not consider each constituent sound event individually, but as many properties of each environment. We used recordings of natural (unsynthesized) sound clips obtained from (Sou, 1992). Our auditory environment types were chosen so that they are made up mostly of non-speech and non-music sounds. Depending on the environment, it might contain muddled conversations within a big crowd or faint music. It was essentially background noise of a particular environment, composed of many sound events. The twelve environment types considered were: 1) Inside casino, 2) Play- ground, 3) Nature-nighttime, 4) Nature-daytime, 5) Inside restaurants, 6) Next to rivers/streams, 7) Train passing, 8) Inside vehicles, 9) Raining, 10) Street with trac, 11) Ocean waves, and 12) Thundering. We use two types of features: Mel-frequency cepstrum coecient analysis (MFCC) and Matching Pursuit (MP)-features (Chu et al., 2009a). MFCC is the most com- mon feature representation for audio and has shown to work relatively well for speech and music, but their performance degrades in the presence of noise. They have been demonstrated to be ineective in analyzing noise-like signals that have a at spectrum, such as chirpings of insects and sounds of rain. Previous research on audio features have shown that using specialized features for environmental sounds proved to be successful in aiding with classication of unstructured environmental sound (Chu et al., 2009a). MP-features, a feature extraction method specically proposed for environmental sounds, attempt to characterize each audio clip as being composed of only a few atoms. This high-level representation can then be applied 78 to classication tasks. We would like to use both type of features for our work: MP-features and MFCCs. Because MP-features are discrete values and MFCCs are continuous, we rst need to discretize MFCC, which provides a simple way to work with both types of features. We need to discretize values of MFCCs by cre- ating the breakpoints, or intervals, based on the distribution of the data. Dierent set of breakpoints should be learned for each variable or coecient. We use the equal frequency discretization method in this work (Dougherty et al., 1995). The discretization method is based on the PDF estimates of the variables. it works by dividing the range intob bins of equal frequency. This method is less susceptible to outliers, and the intervals would be closer to each other in regions where there are more elements and farther apart in sparsely-populated regions, which represents the distribution of each variable better than the equal-width method. The user in this case must specify the number of bins b. The sound clips used are of varying lengths (1-3 minutes long) and are later processed by dividing them up into 3-second segments and downsampled to 22050 Hz sampling rate, mono-channel and 16 bits per sample. Each 3-sec segment makes up an instance for training/testing. The audio was analyzed using a rectangular window of 512 points (23.3 msec) with 50% overlap. We represented the audio using 12 MFCCs and MP-features. We discretize the features using equal frequency discretization method from (Dougherty et al., 1995), which resulted with inputs of dimension 174. We kept the features of each segment that were originated from the same source separate from one another. In this setup, none of the training and test items originated from the same source. Since the recordings were taken from a wide variety of locations, the ambient sound might have a very high variance. The only preprocessing we performed on the data was verifying that they were not saturated and to remove the silent parts from beginning and end of the les. The data used consisted of three dierent sets: 79 Set A: Samples were selected so that they are more homogeneous within each type of environment. The samples are also enforced so that each type of sound tends to be distinctively sounding dierent from one another, which minimized overlaps as much as possible. Each environment contained at least four separate source recordings, and segments from the same source le were considered a batch. We use three batches for training and one batch for testing, which leads us to perform a 4-fold cross validation for the features. Set B: There are around 10-15 les for each environment. They are les that are related to the same environments as in Set A. They are more diverse sounds of the same class, which makes them farther away in terms of simi- larity. There are fewer restrictions on the data, like homogeneity within each class or minimal overlap between types. Set C : Consisted of about 2 hours of background sounds of dierent environ- ments. These are unlabeled and are mainly used in pre-training phase for some of the experiments. In this setup, none of the training and test items originated from the same source. Since the recordings were taken from a wide variety of locations, the ambient sound might have a very high variance. The only preprocessing we performed on the data was making sure that they were not saturated and to remove the silent parts from beginning and end. We use a DBN with three hidden layers with 100 hidden units for the rst and second layers and 450 hidden units for the third layer. The input layer consists of 174 units, which correspond to the dimension of the features, and twelve output units for the number of target classes. We use a learning rate of 0.1. A schematic representation is given in Fig. 6.2. 80 Figure 6.2: Confusion matrix for GMM using Sets A and B (in percentage of misclassication) 6.3.2 Classication To analyze the performance of DBN for environmental sounds, we experimented with various settings. We compared the results to those obtained by using GMM. For GMM classiers, we use 5 mixtures throughout all of our experiments. Note that it is possible to tailor the number of mixtures to each class of data. The improvement has shown to be negligible but will cause the classier to be specialized to the training data. Therefore, we decided to just use a set number of mixtures throughout all of our experiments. In the rst experiment, the DBN is rst pre-trained with the training set in an unsupervised manner. We use the same training set in the next stage where supervised ne-tuning of the weights is performed. Since there is no pre-training option for GMM, this method will furnish us with a fairer comparison between the two methods. We compared results on classication two tasks: 1) using only Set A and 2) using both Set A and B The results are shown in Table 6.1. As expected, the more 81 Table 6.1: Classication accuracies comparing DBN and GMM (in %) Data set used DBN GMM A 67.9 64.8 A + B 79.6 41.7 complex DBN model perform slightly better than GMM. We can see that when we included data from Set B, it increased the classication accuracy of the DBN, but the opposite occurred in the case with using GMM. It has diculties handling the extra data. To understand this occurrence, let us observe the classications in more detail. Sample confusion matrices of the classications are given in Fig. 6.3 and 6.4. We could observe that when adding the extra training data Set B to the DBN method, it improved the performance for eight classes, but reduced for three classes, ranging from 5-23%. The improvement was most signicant for Near river from 10% to 95% and for restaurant from 20% to 87.5%. For GMM, the performances of casino, nature nighttime and Ocean waves were reduced to 0%. The misclassi- cations seem to gravitate toward classes Near river, On vehicle, and Raining. The bias might come the fact that these three classes are somewhat more constant and homogeneous sounding, particularly for Near river and Raining. GMM are biased toward catagorizing the test data into them, as compared to models that are more diverse. Clearly using A and B for training provide better performance overall for using DBN. By introducing variability into the data, the performance for certain classes might degrade a bit, but is limited. The increase in the recognition ability for other classes outweighs the amount that has decreased. Even when include additional information that might be noisy is introduced, DBNs performance does not suer as much and less dramatic than GMM. 82 (a) (b) Figure 6.3: Confusion matrix for classication with GMM using (a) Set A (left), and (b) Sets A and B data (right), in percentage (less than 5% is ignored) 6.3.3 Unlabeled data for pre-training In the next experiment, we wish to investigate on the use of unlabeled data for generalization. In this setup, the DBN is rst pre-trained with unlabeled data from Set C and then we use Set A and B in the supervised training stage. We compared the amount of unlabeled data and the amount of pre-training, in terms of iterations, necessary to improve classication. For the rst task, we kept the number of epochs constant at 20 and we varied the amount of unlabeled data used. The results are given in Fig. 6.5. The best performance was produced when the DBN is pre-trained with about 1000 3-sec samples. It produced a classication rate of 80.4%. As 6.5 shows, the amount of unlabeled is almost negligible on the classication accuracy of the overall system. In the next task, we used 1000 samples throughout the experiment and modied the number of epochs. The results are given in Fig. 6.6. The best performance was produced when the DBN is pre-trained with the unlabeled data for about 20 epochs. The classication accuracy was 81.1%, Both produced a slight improvement when compared to 79.6% from before. However, we also observed that if we increase either 83 (a) (b) Figure 6.4: Confusion matrix for classication with DBN using (a) Set A (left), and (b) Sets A and B data (right), in percentage( less than 5% is ignored) Figure 6.5: Eect on the amount unlabeled data un pre-training for classication the number of epochs used or the amount of unlabeled data used, the performance will also degrades. This signies a trade o between the amount of training and its performance. The interesting nding here is that by not using the actual training data in the pre-training, but just Set C, it outperforms those of using the same data for the pre-training stage and the supervised training stage. Therefore, we can see that it is possible to use unlabeled data to improve the generalization and speeds up the length it takes for the network to converges, but requires some ne-tuning. 84 Figure 6.6: Eect on the amount of pre-training using unlabeled data for classi- cation 6.4 Composite DBNs Learning a hierarchy of sound types might improve and clarify problems caused by the confusion of an acoustic environment with similar characteristics. For example, a restaurant and a shopping mall. Both share the characteristics of being indoors and in a crowd with people talking, but the restaurant contains clanking of utensils, whereas in a mall there might be footsteps of shuing of feet. The use of suitable hierarchies would also allow us to assign confusing samples to a more general class. For example, Near river and Raining into uidic class, and grouping Restaurant and Casino into a Indoor crowded place class. It allows environments to be grouped using the heuristic that a good grouping of two classes should maximize the num- ber of misclassied Therefore, we wish to investigate a method for learning audio structural models for the general environments utilizing a hierarchy of sound types. To some extent, there has been some work done on coming up with such an hierarchy. In (Sundaram & Narayanan, 2007), the authors partition the data into signal representations according to their perceptual audio attributes, for example grouping together machine generated (such as vehicle noise, engine noise, printers, fax and telex machines etc.), versus other natural noise sources (rainfall, waves on a seashore, blowing wind etc.). 85 Figure 6.7: Average misclassication errors between the dierent environments 6.4.1 Initial Evaluation In order to determine whether a hierarchy of DBNs would work, the rst step is to manually build the overall system. We need to evaluate the performance of DBN with higher-level combined classes. In addition, by building one manually, it will provide us with a guideline as to how one might looked like if this was done automatically. We begin by identifying and mapping out the misclassications. Fig. 6.7 illus- trates the confusions between classes, giving us a direction to categorize certain groups together. Each circle signies a class and the arrows signify the direction in which the misclassication occurred. For example, class A! class B shows that the actual is class A, but was misclassied into class B. The values next to the arrows are the percentage of misclassications. Only errors resulting in above 15% is listed. 86 This furnishes us with some insights into where the diculties might lie within these twelve environments. For example, we could see that there is a high confusion (or in other words, similarities) between sets of : 1) Casino and Restaurant, 2) playground and restaurant, 3) on vehicle and Street with trac and, 4) Playground and Near river, and 5) Casino and Train passing. The others are more one-way confusions/misclassication, which makes it a bit easier to separate. The higher level groupings were created based on the higher percentages of errors from Fig. 6.7 as criterion. We examine many dierent sets of combination to determine how each combination would perform. After evaluation, we came up with the following possible conguration: At the rst level, we can subdivide it into 5 classes, which consisted of a) Casino, Playground and Restaurant, b) Nature- nighttime and Nature-daytime, c) Near river and Raining, d)Train passing, Ocean waves and Thundering, and e) On vehicle and Street with trac. By combining classes, like Casino, Playground and Restaurant, we are combining all the classes that are characterized by a crowded ambiance with human voices. Combining Near river and Raining gives us all water running type of sounds and placing Nature- nighttime and Nature-daytime together will provide us with a set that contains a constant higher frequency and pitch since they are both located in a forest, composed of mainly insects and birds. Although Train passing, Ocean waves and Thundering seems unlikely to be placed together, but both Thundering and Ocean waves have a slight slow crashing type of sound. The train passing sound we used was mostly freight trains, therefore, it also give a loud rumbling type of sound. Fig. 6.8 illustrates a possible hierarchical conguration of the overall system. To create this composite of DBNs, we trained a dierent DBN for each of the high-level classes. We can consider, in Fig. 6.8, each box with more than one target class as a separate DBN. At the very top level, we have a 5-class DBN that will classify the input data into the ve high-level classes. For example, Casino, Playground 87 Figure 6.8: Manually constructed hierarchical conguration for dierent environ- ments based on Fig. 6.7 and Restaurant together will be regarded as one class at that level and Nature- nighttime and Nature-daytime together will be considered as another class. Let us further observe the example for the left-most grouping of Casino, Playground and Restaurant. We trained a 2-class DBN to discriminate between: 1) Casino and 2) Playground and Restaurant. Another 2-class DBN is also trained to separate the output from the previous level DBN into Playground or Restaurant. In this experiment, each of the DBNs is rst pre-trained with 1000 samples (one hour of audio data) randomly selected from Set C for 20 epochs, then we use Set A and B in the supervised training stage. The error for this conguration is 87.6%, compared to 79.6% originally from Sec. 6.3.2. This shows that this hierarchical structure seems to lead to a much better performance than using a single DBN alone. Despite the higher complexity, this would work well with noisy data and overlapping environment. This demonstrate that being able to combine related environment type together at the higher level and only separate them at the lower detailed levels has its advantage in classifying environments that have many overlapping classes. 88 Figure 6.9: Dendrogram using hierarchical clustering with cosine similarity mea- sure, based on activations of a trained 12-class DBN (distance along the y-axis depicts their measure of similarity) 6.4.2 Towards Automatic Composite-DBN Although we demonstrated promising capability of a hierarchical structure, but is currently manually constructed. The next step is to automatically build this hierarchy of DBNs in some optimal way. The key is to discover this the commonality between dierent scenes and locations of dierent (but similar) type of environment e.g., uidic (running water taps, ocean waves on the shore, and owing streams ) or crowds (in restaurants, shopping malls), etc and to learn a low-level type-specic structural model for each type, e.g. restaurants, school playground, street with trac utilizing this method. Similarly at the next level, we need to discover the 89 commonality between similar types of environment. To discover the features to represent the commonality among each type of sound, we will use the DBN method to learn these features in an unsupervised manner. Once the structural models are learnt, it is then possible to build a hierarchy of sounds for each environment. The ability to automatically separate the environment types into subclasses at top level correctly is crucial. DBN actually supplies us with a simple method to accomplish this. We can view the activation or weights that are learnt as features. The idea is to utilize the combination of activations between the last hidden layer and each of the target output units of a trained DBN. Let us consider a similar experiment as setup in Sec. 6.3.1 with twelve output target classes (or environment types). After the DBN is trained, each target unit would be equipped with a set of activations. Using these as features, we calculate the pairwise distance between pairs of activation set for each environment type using cosine similarity. Then, we create a hierarchical clustering using averaged linkage to generate the framework for the hierarchy of environment types. A dendrogram from the result is depicted in Fig. 6.9. Although the dendrogram generated is not identical to the manually constructed version, but we could observe that most of the groupings do indeed correspond to those in Fig. 6.8, (e.g. Near river and Raining, Nature-nighttime and Nature-daytime). The main disparity would be Ocean waves, which is placed with Street with trac. More improvement is needed to produce this structure, but the results are encouraging as this leads a way to creating a composite DBN automatically. A composite DBN framework was built based on the results provided by the hierarchical clustering of the activations found from the 12-class classication in Sec. 6.3.2. Utilizing similar pre-training and training methods as in Sec. 6.4.1, this composite DBN yielded an average of 91.9% classication accuracy for twelve dif- ferent classes of environmental sounds, as compared to 81.1% from Sec. 6.3.2 using 90 Figure 6.10: Automatic construction of the hierarchical conguration for dierent environments the straightforward (non-composite) DBN method. This was also a slight improve- ment over the manually analyzed and constructed composite DBN from Sec. 6.4.1, which produced an accuracy of 87.6%. The breakdown of the classication accu- racy for each of the DBN is illustrated in Fig. 6.10. To the best of our knowledge, this is a signicant improvement on classication over previous proposed methods discriminating this many dierent types of general environmental sounds. As a baseline for comparison, we eliminated the DBN step and obtain a hierar- chical clustering founded from using only the original audio features (MP-features and MFCCs) that was used for training the DBNs. For the distance measure, we also used cosine similarity and averaged linkage to generate the clusters. Using the average of each cluster, we obtain a dendrogram to illustrate the distances between each class, as depicted in Fig. 6.11. We can observe that the dendrogram created is unevenly branched, meaning that the clusters are considered to be very similar to each other, thus making it dicult to separate and permit the branches to be dis- tributed more widely. The high level grouping of the classes are grouped together more acoustically sounding. For example ocean might be closer to thunder due to 91 Figure 6.11: Dendrogram using hierarchical clustering with cosine similarity mea- sure using MP-features and MFCC directly (without using DBN) the crashing sound and both Train and Nature-nighttime have high frequency peri- odic sounds. For comparison, a composite DBN was also created based on Fig. 6.11, which yields an average classication accuracy of 51.76%. 6.4.2.1 Stability of High-Level Groupings In this section, we investigate on the stability of high-level clusters produced by the hierarchical structure from Sec. 6.4.2 and the eects of training orders. In this set of experiments, the set up is similar to that in Sec. 6.3.1, we pre-trained the three RBMs in an unsupervised manner with Set C. Then, we utilize the data from Sets A and B on the selected classes for training and ne-tuning the DBN. 92 Figure 6.12: Evolution of dendrogram using hierarchical clustering with cosine similarity measure, by increasing the number of target classes (Dotted box depicts the new target class being added). In the rst experiment, we begin by using two target classes and gradually grow the dendrogram by adding one target class at a time and analyzing how the result- ing hierarchy. The selection order of classes was based on classes that have related pairs that are under the same high level grouping from Fig. 6.10, (e.g. Casino and Restaurant in a crowded room settings or Raining outside, Next to river being uidic types) and then follow by the rest of the target classes. We started with Casino and Restaurant as the rst step, Then at each of the following steps, we add each of the target classes. The order of target classes being added: 1)Casino, 2)Restaurant, 3)On vehicle, 4)Street with trac, 5)Raining outside, 6)Next to river, 93 Figure 6.13: Evolution of dendrogram using hierarchical clustering with cosine similarity measure, by increasing the number of target classes (Dotted box depicts the new target class being added). Target classes are added in the reverse order from Fig. 6.12 7)Nature-nighttime, 8)Nature-daytime, 9)Train passing, 10)Near ocean, 11)Play- ground, 12)Thundering. At each step, we save the weights after obtaining the resulting DBN from train- ing. We change the structure of the DBN by adding a unit to the output layer, which represents an additional target class. Weights are then incorporated between all the nodes in the third hidden layer and the new target class, initialized with random weight values in a specic range. From Fig. 6.12, we could observe a bias against the initial two selected classes. Since we started with Casino and Restaurant, the dendrogram at each step tries to keep these two classes apart on separate branches. The other related classes after 94 Figure 6.14: Evolution of dendrogram using hierarchical clustering with cosine similarity measure, by increasing the number of target classes (Dotted box depicts the new target class being added). Weights between the third hidden layer and the output layer was not saved; similar ordering as Fig. 6.12 the initial two classes, e.g. On vehicle and Street with trac; Raining outside and Next to river, were able to be clustered together. Since we are building upon the weights of the DBN found from the previous step, the bias of the initial network is being propagated through out the entire process. This means that the most important step is to enforce the rst two classes to be most dissimilar. To verify the bias found above, in this second experiment, we reversed the order in the selection of the target classes to add. We begin with 1)Thundering and 2)Playground, follow by 3)Near ocean, 4)Train passing, 5)Nature-daytime, 6)Nature- nighttime, 7)Next to river, 8)Raining outside, 9)Street with trac, 10)On vehicle, 95 11)Restaurant, 12)Casino. Fig. 6.13 illustrates the various dendrograms created at each step. Since we used two dissimilar classes for the initialization, the resulting cluster- ings does not break apart more compelling high level groupings, e.g. Crowded room (Restaurant and Casino), uidic types (Next to river and Raining outside), Vehicle related (On vehicle and Street with trac). In the last experiment, we use a similar selection order as in Fig. 6.12 for the rst experiment. However, in this case, we are not saving the weights between the third hidden layer and the output layer. This is analogous to saving the weights found from pre-training the three RBMs. For each additional target class, weights are added between the new target class and units in the third hidden layer. In addition, the weights between the rest of the target classes and the third hidden layer are also reinitialized to random weight values. Then we proceed with ne- tuning the DBN on the training data as before. If we compare Fig. 6.9 and Fig. 6.14, we could observe similar target classes being clustered together. The exception was for Thundering to be clustered with Train passing, instead of grouping together Thundering and Street with trac in Fig. 6.9. However, having both, Thundering and Train passing, together makes more sense as two produce somewhat similar reverberating sounds. The experiments in this section demonstrate that the high-level grouping found in Fig. 6.10 was mostly stable. The only aberration comes from the bias created from selecting two similar and related classes to initialize the expansion of the hierarchical clustering. This indicates that we could utilize the activations from the original DBN as a way to automatically create a hierarchy structure for a composite-DBN. 96 6.5 Conclusions and Future Work This section proposes a framework for generative modeling of environmental sound using DBN in a hierarchal structure. Our framework demonstrates the ability to model dierent environmental sound types despite overlapping and noisy data. Experimental results demonstrate promising performance in improving the state of art recognition for audio environments. It is encouraging that we could utilize more unrestrictive data to improve generalization. We also proposed a method for automating the creation of the hierarchy structure for the composite-DBN. 6.6 Applications The potential use of extracting context information from the environment can be built on a wide range of applications, from robotics to characterizing user generated multimedia content on the internet. The most apparent applications are that of visual-based systems, such as for service-type robots and automatic surveillance. There are many opportunities where the fusion of audio and visual (or other sensory information) can be ad- vantageous, say for disambiguation of uncertainties, environment, or object types. With many robotic applications being utilized for navigation in unstructured envi- ronments (Pineau et al., 2003; Thrun et al., 1999) , there have been recent interests in nding ways to provide hearing for mobile robots (Chu et al., 2006; Huang, 2002) so as to enhance their perception. Typically, robots are employed in a well-dened and/or highly constrained environment. Vision-based systems typically require much world knowledge and are susceptible to problems of lighting (or lack of) and the angle of the camera. It is possible to mitigate the systems dependency on using vision alone by incorporating audio information into the recognition process. In some, the manner of how robotic systems navigate depends on their environment, 97 which switches automatically between dierent control modes for various environ- ments (Yanco, 1998; DeSouza & Kak, 2002). Audio can provide an approximation of the current activities surrounding the agent. Using audio enables the system to capture a semantically richer environment on top of what the visual information can provide. Audio data can be obtained more easily, and are computationally cheaper to process than visual data. Robots or virtual systems deployed in crowded areas would also benet much from being able to understand and adapt to their environment, e.g. museum guides, should be able to adjust their hearing and their presentation manner based on their surroundings. If the agent perceives itself as being in a crowd, yet no one is conversing with it, the guide might want to call out to attract people's attention or that it might want to adjust the way he speaks to make himself heard in a crowd or softer when in a quiet room. For any interactive systems, a major issue would be the performance of speech recognition system performing in a noisy environment. A way to improve the robust speech recognition is to integrate environmental noise models into the recognition system. However, instead of just selecting from a pool of acoustic models (Thatphithakkul et al., 2006), it should learn the background of the environment that the agent is in and then adapt that to the changing environment. Search and rescue and automatic surveillance type of system can use audio to ameliorate recognition accuracy. A system can monitor with both audio and video for anomalies or unusual events, e.g. identifying suspicious noise (Ntalampiras et al., 2009) or monitor a crowded market place for unusual happenings (screaming or locate a person crying for help), where it would be dicult with just visuals. Home monitoring system, which uses both audio and video for unexpected accidents, was found to be robust despite many various common everyday events occurring (Zhuang et al., 2009). By just knowing the scene provides a coarse and ecient way to prune out irrelevant scenarios. 98 Other applications include those in the domain of wearables and context-aware applications (Waibel et al., 2004; Ellis & Lee, 2004; Clarkson et al., 1998), e.g., in the design of a mobile device such as a cellphone that can automatically change the notication mode based on the knowledge of users surroundings, like switching to the silent mode in a theater or classroom (Waibel et al., 2004) or even provide information customized to users location (Mantyjarvi et al., 2002). In recent years, there has been an explosion of multimedia resources from media- sharing sites, i.e. YouTube, exhibiting scenes of various aspects of quotidian life in a wide range of domains. Being able to recognize environmental audio can also contribute towards analyzing and mining of unstructured multimedia data through the use of both audio and visual information, e.g. everyday surrounding environ- ment. Currently to make sense of these contents, we rely mainly on associated tags provided by users. Using audio can enable users to retrieve video clip from among massive heterogeneous visual data in a more semantically meaningful manner. 99 Chapter 7 Conclusion and Future Work 7.1 Summary of the Research Unstructured environmental audio is extensively studied in this research. We inves- tigated in terms of classication, feature extraction, modeling, and characterization in Chapters 3-6, respectively. In Chapter 3, we investigate techniques for developing a scene classication sys- tem using audio features. The classication system was successful in classifying the ve classes of environment using real data obtained from a tele-operated mobile robot. We also found that using high number of features is not always benecial to classication. In using forward feature selection, a form of greedy search, only nine of the thirty-four features were required to achieve a high recognition rate. We have also identied features that can discriminate between these 5 types of environ- ment. With success in using audio to discriminate between dierent unstructured environments, we have shown that it is feasible to build such a system. Although the results exhibited improvements, the features found after the feature selection process were more specic to each classier and environment type. It is with these ndings that motivated us to look for a more eective approach for representing environmental sounds. 100 Toward this goal, we proposed a novel feature extraction method in Chapter 4 that utilizes the matching pursuit (MP) algorithm to learn the inherent structure of each type of sounds and select a small set of time-domain features, which we called MP-features. MP-features have shown to classify sounds where the frequency do- main features (e.g., MFCCs) fail and can be advantageous when combining with MFCCs to improve the overall performance. Extensive experiments were conducted to demonstrate the advantages of MP-features as well as joint MFCC and MP- features in environmental sound classication. To the best of our knowledge, we were the rst to propose using MP for feature extraction for environmental sounds. This method has shown to perform well in classifying fourteen dierent audio en- vironments, achieving 83% classication accuracy. This result is very promising, considering that, due to the high variance and other diculties in working with environmental sounds, recognition rates have thus far been limited as the number of targeted classes increases, i.e. approximately 60% for 13 or more classes (Eronen et al., 2006). In Chapter 5, we proposed a framework for audio background modeling, which includes prediction, data knowledge and persistent characteristics of the environ- ment, leading to a more robust audio background detection algorithm. This frame- work has the ability to model the background and detect foreground events as well as the ability to verify whether the predicted background is indeed the background or a foreground event that protracts for a longer period of time. Experimental results demonstrated promising performance in improving the state-of-the-art in background modeling of audio environments. We also investigated the use of a semi-supervised learning technique to exploit unlabeled audio data. It is encour- aging that we could utilize more unlabeled data to improve generalization as they are usually cheap to acquire but expensive to label. And more than often, we are forced to work with relatively small amount of labeled data due to this limitation. 101 In Chapter 6, we investigated on the use of a richer generative model based method for classication and attempt to discover high-level representations for dif- ferent acoustic environments in a data-driven fashion. Specically in this work, we considered DBNs for modeling environmental audio and investigated its applica- bility with noisy auditory data for robustness and generalization. We proposed a framework for creating a framework for a composite DBN as a way to represent var- ious levels of representations. Experimental results on real data sets demonstrate its eectiveness over traditional methods. The result was promising as it provided us with over 90% accuracy on recognition for a high number of environmental sound types. Overall, the main contribution of this thesis is to dispel the notion that tra- ditional speech and music recognition techniques can simply be used for environ- mental sound. We provided a comparison and analysis of traditional methods for acoustic environments and present novel recognition techniques specically for en- vironmental sounds to improve recognition. 7.2 Future Research Direction There are many interesting directions that could be taken from this work. Below are some directions for research stemming from this PhD work: Taxonomy of Sound Structures If we can nd a systematic way to break down environmental sounds, it would increase the eciency of identifying them. Coming up with a taxonomy of sound structures from learning a hierarchy of sound types might improve and clarify problems caused by the confusion of an acoustic environment with similar characteristics. The use of suitable hierarchies also allow us to assign confusing samples to a more general class. Using a structured classication technique (e.g. hierarchical sound structures), it would alleviate much of the 102 confusion when trying to recognize large varieties of environmental sounds. Our framework in Chapter 6 demonstrates the ability to model dierent envi- ronmental sound types despite overlapping and noisy data and the ability to form a hierarchy of sound types. Experimental results demonstrate promising performance in improving the state of art recognition for audio environments. It is encouraging that we could utilize more unrestrictive data to improve gen- eralization. Future work includes continuing to investigate on a framework to fully automate the creation of the hierarchy structure for composite-DBN. Results from this work infer some information to allow for a greater under- standing of the nature of environmental sound. By obtaining insight into how we could come up with a taxonomy of sound structures, it would pro- vide some structure to what is currently considered unstructured and attempt to come up with some form of environmental sound alphabet. This means that inferences from this could be used to further rene environmental sound recognition techniques. Adaptiveness To be useful in realistic situations, the acoustic environment classication must be adaptable to changes in the environments and to the introduction of new phenomena within that environment. Over time, the characteristics of a certain location will change and new events may be encountered. In some instances, the environment changes gradually and incrementally, such as new sounds in a restaurant or on a street. The environment might also change sporadically, like trains arriving and departing in a subway station. Therefore, adapting to changes of environment, learning the rate of change, and continuously learning a new environments are important requirements for acoustic environment classiers. Theoretically, it is possible to train a classier in a supervised manner with as much data as possible, covering all the variation of an acoustic environment. It requires an enormous amount 103 of training data to encompass the variability of the characteristics for all possible environments. This is especially impractical if we want to use this on context-aware applications. In a realistic environment, there will always be unseen data and unforeseen changes. An adaptive learning strategy allows the model to adapt to random changes, new environments and to deal with concept drift. Fusion of Audio and other Modalities in Robotic or Monitoring System Areas such as robotics, hearing aid technology, smartphones, home monitor- ing and security system would benet from research on identifying non-speech sounds. Much of these work are already implemented based on visual recogni- tion. The next step is to combine the proposed ideas to the design of museum guide robots, mobile devices like smartphones, and surveillance systems. For example, we can equip a robot, which typically contains a vison-based sensory system, or a smartphone, which already have GPS, with audio listening capa- bility. By being able to incorporate dierent modalities and to include envi- ronment knowledge, it will provide us with capability to disambiguate object types and to localize, track and identify robustly for context-aware personal- ization. The rst step is to develop a framework for information fusion from the dierent modalities to enhance the characterization of the environment or scene context. A major issue in this area is the synchronization or coupling of events from dierent data streams. An initial approach would be to build a system where the decisions for determining certain scenes or environment are made in each individual information stream, based on low-level information processing. Then from further processing, such as correlation analysis on the training data, we can extract linkage information between features of the dif- ferent data streams to combine the two decisions for making predictions for decision making and planning. 104 Chapter 8 Related Publications 8.1 Book Chapter Chu, S., Narayanan, S., and Kuo, C.-C. J. "Unstructured Environmental Audio: Representation, Classication and Modeling," In an edited volume, Machine Au- dition: Principles, Algorithms and Systems edited by Wenwu Wang. Published by IGI Global, 2011. ISBN13: 9781615209194. 8.2 Journals Chu, S., Narayanan, S., and Kuo, C.-C. J. "Environmental Sound Recognition with Time-Frequency Audio Features," IEEE Transactions on Speech, Audio, and Language Processing, vol. 17 no. 6, pg. 1142-1158, 2009. 8.3 Conferences Chu, S., Narayanan, S., and Kuo, C.-C. J. "Environmental Sound Recognition with Composite Deep Belief Network ," Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), submitted. 105 Swartout, W., Traum, D., Artstein, R., Noren, D., Debevec, P., Bronnenkant, K., Williams, J., Leuski, A., Narayanan, S., Piepol, D., Lane, C., Morie, J., Aggarwal, P., Liewer, M., Chiang, J.-Y., Gerten, J., Chu, S., and White, K. "Ada and Grace: Toward Realistic and Engaging Virtual Museum Guides," Proceedngs of the 10th International Conference on Intelligent Virtual Agents (IVA), 2010. Chu, S., Narayanan, S., and Kuo, C.-C. J. "A Semi-Supervised Learning Approach to Online Audio Background Detection," Proceedings of IEEE International Con- ference on Acoustics, Speech, and Signal Processing (ICASSP), 2009. Chu, S., Narayanan, S., and Kuo, C.-C. J. "Environmental Sound Recognition using MP-based Features," Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2008. Chu, S., Narayanan, S., Kuo, C.-C. J., and Matari c, M. J. "Where Am I? Scene Recognition for Mobile Robots using Audio Features," Proceedings of IEEE Inter- national Conference on Multimedia & Expo (ICME), 2006. Chu, S., Narayanan, S., and Kuo, C.-C. J. "Towards Parameter-Free Classication of Sound Eects in Movies," Proceedings of SPIE Optics & Photonics, Conference on Applications of Digital Image Processing XXVIII, v.5909, 2005. 8.4 Workshops Swartout, W., Traum, D., Artstein, R., Noren, D., Debevec, P., Bronnenkant, K., Williams, J., Leuski, A., Narayanan, S., Piepol, D., Lane, C., Morie, J., Aggarwal, P., Liewer, M., Chiang, J.-Y., Gerten, J., Chu, S., and White, K. "Virtual Mu- seum Guides," Proceedings of the IEEE Workshop on Spoken Language Technology 106 (SLT), Berkeley, CA, 2010. Chu, S. "Unstructured Audio Classication for Environment Recognition," Pro- ceedings of AAAI 23rd National Conference on Articial Intelligence (Doctoral Consortium), 2008. Chu, S., Narayanan, S., and Kuo, C.-C. J. "Content Analysis for Acoustic En- vironment Classication in Mobile Robots," Proceedings of AAAI Fall Symposium, Aurally Informed Performance: Integrating Machine Listening and Auditory Pre- sentation in Robotic Systems, 2006. 107 Bibliography The BBC sound eects library - original series. http://www.sound- ideas.com/bbc.html. The freesound project. http://freesound.iua.upf.edu/index.php. Player project. (1992). The series 6000, the general sound eects library. Sound-Ideas. Agostini, Giulio, Longari, Maurizio, & Pollastri, Emanuele (2003). Musical instru- ment timbres classication with spectral features. EURASIP J. on Applied Signal Processing. Aucouturier, Jean-Julien, Defreville, Boris, & Pachet, Fran cois (2007). The bag- of-frames approach to audio pattern recognition: A sucient model for urban soundscapes but not for polyphonic music. J. of the Acoustical Society of America, 122. Ballan, Lamberto, Bazzica, Alessio, Bertini, Marco, Bimbo, Alberto Del, & Serra, Giuseppe (2009). Deep networks for audio event classication in soccer videos. Proceedings of the International Conference on Multimedia and Expo (ICME). Bengio, Yoshua, Lamblin, Pascal, Popovici, Dan, & Larochelle, Hugo (2007). Greedy layer-wise training of deep networks. Proceedings of Advances in Neu- ral Information Processing Systems (NIPS). Bishop, Christopher M. (2003). Neural networks for pattern recognition. Oxford University Press. Burges, Christopher J. C (1998). A tutorial on support vector machines for pattern. recognition. Data Mining and Knowledge Discovery, 2, 121{167. Cai, Rui, Lu, Lie, Hanjalic, Alan, Zhang, HongJiang, & Cai, Lian-Hong (2006). A exible framework for key audio eects detection and auditory context inference. IEEE Trans on Audio, Speech and Lang Processing, 14. 108 Cano, Pedro, Koppenberger, Markus, Groux, Sylvain Le, Ricard, Julien, Wack, Nicolas, & Herrera, Perfecto (2004). Nearest-neighbor generic sound classication with a wordnet-based taxonomy. Proc of AES. Carey, Michael J., Parris, Eluned S., & Lloyd-Thomas, Harvey (1999). A compari- son of features for speech, music discrimination. Proc of ICASSP. Chen, Scott Shaobing, Donoho, David L., & Saunders, Michael A. (1998). Atomic decomposition by basis pursuit. SIAM Journal on Scientic Computing, 20, 33{ 61. Chu, Selina, Narayanan, Shrikanth, & Kuo, C.-C. Jay (2008). Environmental sound recognition using mp-based features. Proc of ICASSP. Chu, Selina, Narayanan, Shrikanth, & Kuo, C.-C. Jay (2009a). Environmental sound recognition with time-frequency audio features. IEEE Trans on Audio, Speech and Lang Processing, 17. Chu, Selina, Narayanan, Shrikanth, & Kuo, C.-C. Jay (2009b). A semi-supervised learning approach to online audio background detection. Proc of ICASSP. Chu, Selina, Narayanan, Shrikanth, Kuo, C.-C. Jay, & Matari c, Maja J. (2006). Where am I? scene recognition for mobile robots using audio features. Proc of ICME. Clarkson, Brian, Sawhney, Nitin, & Pentland, Alex (1998). Auditory context aware- ness via wearable computing. Workshop on Perceptual User Interfaces. DeSouza, Guilherme N., & Kak, Avinash C. (2002). Vision for mobile robot nav- igation: a survey. IEEE Trans on Pattern Analysis and Machine Intelligence, 24. Dougherty, James, Kohavi, Ron, & Sahami, Mehran (1995). Supervised and unsu- pervised discretization of continuous features. Proc of ICML. Ebenezer, Samuel P., Papandreou-Suppappola, Antonia, & Suppappola, Seth B. (2004). Classication of acoustic emissions using modied matching pursuit. EURASIP J. on Applied Signal Processing. El-Maleh, Khaled, Klein, Mark, Petrucci, Grace, & Kabal, Peter (2000). Speech/music discrimination for multimedia applications. Proc of ICASSP. Ellis, Daniel P. W. (1996). Prediction-driven computational auditory scene analysis. Doctoral dissertation, MIT Department of Electrical Engineering and Computer Science. 109 Ellis, Daniel P. W. (2001). Detecting alarm sounds. Proceedings of the Workshop Consistent and Reliable Acoustic Cues (CRAC). Ellis, Daniel P. W., & Lee, Keansub (2004). Minimal-impact audio-based personal archives. Proc of the Workshop on CARPE. Eronen, Antti J., Peltonen, Vesa T., Tuomi, Juha T., Klapuri, Anssi P., Fagerlund, Seppo, Sorsa, Timo, Lorho, Ga etan, & Huopaniemi, Jyri (2006). Audio-based context recognition. IEEE Trans on Audio, Speech and Lang Processing, 14. Ghofrani, Sedigheh, McLernon, Des, & Ayatollahi, Ahmad (2003). Comparing gaussian and chirplet dictionaries for time-frequency analysis using matching pur- suit decomposition. Proc of ISSPIT. Gribonval, R emi (2001). Fast matching pursuit with a multiscale dictionary of gaussian chirps. IEEE Trans on Signal Processing, 49. Gribonval, R emi, & Bacry, Emmanuel (2003). Harmonic decomposition of audio signals with matching pursuit. IEEE Transactions on Signal Processing, 51, 101{ 111. Hamel, Philippe, Wood, S., & Eck, D. (2009). Automatic identication of instru- ment classes in polyphonic and polyinstrument audio. Proceedings of Interna- tional Conference on Music Information Retrieval (ISMIR). H arm a, Aki, McKinney, Martin F., & Skowronek, Janto (2005). Automatic surveil- lance of the acoustic activity in our living environment. Proc of ICME. Herrera, Perfecto, Yeterian, Alexandre, & Gouyon, Fabien (2002). Automatic clas- sication of drum sounds: a comparison of feature selection methods and classi- cation techniques. Proc of ICMAI. Hinton, Georey E. (1995). Robust text-independent speaker identication us- ing gaussian mixture speaker models. IEEE Transactions on Audio, Speech and Language Processing, Accepted, 3. Hinton, Georey E. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14, 1711{1800. Hinton, Georey E., Osindero, Simon, & Teh, Yee-Whye (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18, 1527{1554. Huang, Jie (2002). Spatial auditory processing for a hearing robot. Proc of ICME. Lee, Honglak, Largman, Yan, Pham, Peter, & Ng, Andrew Y. (2009). Unsupervised feature learning for audio classication using convolutional deep belief networks. Proceedings of Advances in Neural Information Processing Systems (NIPS). 110 Malkin, Robert G., & Waibel, Alex (2005). Classifying user environment for mobile applications using linear autoencoding of ambient audio. Proc of ICASSP. Mallat, Stephane, & Zhang, Zhifeng (1993). Matching pursuits with time-frequency dictionaries. IEEE Trans on Signal Processing, 41. Mantyjarvi, Jani, Huuskonen, Pertti, & Himberg, Johan (2002). Collaborative context determination to support mobile terminal applications. IEEE Trans on Wireless Communications, 9. Markel, John D., & Jr., Augustine H. Gray (1976). Linear prediction of speech. Springer-Verlag. Mitchell, Tom M. (1997). Machine learing. McGraw Hill. Moncrie, Simon, Venkatesh, Svetha, & West, Geo (2007). On-line audio back- ground determination on-line audio background determination for complex audio environments. ACM Trans on Multimedia Computing, Comm, and App, 3. Ne, Ralph, & Zakhor, Avideh (1997). Very low bit rate video coding based on matching pursuits. IEEE Transactions on Circuits and Systems for Video Tech- nology, 7, 158{171. Nigam, Kamal, McCallum, Andrew, Thrun, Sebastian, & Mitchell, Tom (2000). Text classication from labeled and unlabeled documents using em. Machine Learning, 39. Ntalampiras, Stavros, Potamitis, Ilyas, & Fakotakis, Nikos (2009). An adaptive framework for acoustic monitoring of potential hazards. EURASIP J. on Audio, Speech, and Music Processing. Pati, Yagyensh Chandra, Rezaiifar, Ramin, & Krishnaprasad, P. S. (1993). Or- thogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. Proceedings of the 27th Annual Asilomar Conference on Signals, Systems and Computers. Peltonen, Vesa (2001). Computational auditory scene recognition. Master's thesis, Tampere University of Technology, Finland. Pineau, Joelle, Montemerlo, Michael, Pollack, Martha, Roy, Nicholas, & Thruna, Sebastian (2003). Towards robotic assistants in nursing homes. Special issue on Socially Interactive Robots, Robotics and Autonomous Systems, 42. Rabiner, Lawrence, & Juang, Biing-Hwang (1993). Fundamentals of speech recog- nition. Prentice-Hall. 111 Radhakrishnan, Regunathan, Divakaran, Ajay, & Smaragdis, Paris (2005). Audio analysis for surveillance applications. Proc of IEEE Workshop on Appl of Signal Processing to Audio and Acoustics. Ranzato, M., Poultney, C., Chopra, S., & LeCun, Y. (2006). Ecient learning of sparse representations with an energy-based model. Proceedings of Advances in Neural Information Processing Systems (NIPS). Sch olkopf, Bernhard (2002). Learning with kernels: Support vector machines, reg- ularization, optimization, and beyond. MIT Press. Stauer, Chris, & Grimson, W.E.L (1999). Adaptive background mixture models for real-time tracking. Proc of CVPR. Sugden, Paul, & Canagarajah, Nishan (2004). Underdetermined noisy blind sepa- ration using dual matching pursuits. Proc of ICASSP. Sundaram, Shiva, & Narayanan, Shrikanth (2007). Discriminating two types of noise sources using cortical representation and dimension reduction technique. Proc of ICASSP. Thatphithakkul, N., Kruatrachue, B., Wutiwiwatchai, C., Marukatat, S., & Boon- piam, V. (2006). Robust speech recognition using pca-based noise classication. ECTI Trans on Comp and Info Tech, 2. Thrun, Sebatian, Bennewitz, Maren, Burgard, Wolfram, Cremers, Armin B., Del- laert, Frank, Fox, Dieter, Haehnel, Dirk, Rosenberg, Charles, Roy, Nicholas, Schulte, Jamieson, & Schulz, Dirk (1999). Minerva: A second generation mu- seum tour-guide robot. Proc of ICRA. Tzanetakis, George, & Cook, Perry (2002). Musical genre classication of audio signals. IEEE Trans on Speech and Audio Processing, 10. Umapathy, Karthikeyan, Krishnan, Sridhar, & Jimaa, Shihab (2005). Multigroup classication of audio signals using time-frequency parameters. IEEE Trans on Multimedia, 7. Vera-Candeas, Pedro, Ruiz-Reyes, Nicol as, Rosa-Zurera, Manuel, Martinez-Mu~ noz, Damian, & L opez-Ferreras, Francisco (2004). Transient modeling by matching pursuits with a wavelet dictionary for parametric audio coding. IEEE Signal Processing Letters, 11, 349{352. Vetterli, Martin, & Kovacevic, Jelena (1995). Wavelets and subband coding. Prentice-Hall. 112 Waibel, Alex, Steuslo, Hartwig, Stiefelhagen, Rainer, & the CHIL Project Con- sortium (2004). Chil - computers in the human interaction loop. Proc of the Intl Workshop on Image Analysis for Multimedia Interactive Services. Yanco, Holly A. (1998). A robotic wheelchair system: Indoor navigation and user interface. In Lecture notes in articial intelligence: Assistive technology and articial intelligence, 256{268. Springer-Verlag. Yang, Guang, Zhang, Qi, & Que, Pei-Wen (2007). Matching-pursuit-based adaptive wavelet-packet atomic decomposition applied in ultrasonic inspection. Russian J of Nondestructive Testing, 43. Zhang, Tong, & Kuo, C.-C. Jay (2001). Audio content analysis for online audio- visual data segmentation and classication. IEEE Trans on Speech and Audio Processing, 9. Zhuang, Xiaodan, Huang, Jing, Potamianos, Gerasimos, & Hasegawa-Johnson, Mark (2009). Acoustic fall detection using gaussian mixture models and gmm supervectors. Proc of ICASSP. 113
Abstract (if available)
Abstract
Environmental sounds are what we hear everyday, or more generally sounds that surround us – ambient or background audio. Human utilize both vision and hearing to respond to their surroundings, a capability still quite limited in machine processing. The first step toward achieving multi-modality is the ability to process unstructured audio and recognize audio scenes (or environments). The goal of my thesis is on the characterization of unstructured environmental sounds for understanding and predicting the context surrounding of an agent or device, investigating on the development of appropriate feature extraction algorithm and learning techniques for modeling the variations of the environment. Such ability would have applications in content analysis and mining of multimedia data or improving robustness in context aware applications through multi-modality, such as in assistive robotics, surveillance, or mobile device-based services.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Classification and retrieval of environmental sounds
PDF
Modeling and recognition of events from temporal sensor data for energy applications
Asset Metadata
Creator
Chu, Selina
(author)
Core Title
Recognition and characterization of unstructured environmental sounds
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
05/11/2011
Defense Date
12/01/2010
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
auditory scene recognition,background modeling,data representation,deep belief networks,environmental sounds,feature extraction,feature selection,generalization,matching pursuit,MFCC,OAI-PMH Harvest,semi-supervised learning,unstructured audio classification,unsupervised feature learning
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kuo, C.-C. Jay (
committee chair
), Narayanan, Shrikanth S. (
committee chair
), Jenkins, B. Keith (
committee member
), Shahabi, Cyrus (
committee member
)
Creator Email
selina.chu@gmail.com,selinach@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m3939
Unique identifier
UC1449131
Identifier
etd-Chu-4483 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-477025 (legacy record id),usctheses-m3939 (legacy record id)
Legacy Identifier
etd-Chu-4483.pdf
Dmrecord
477025
Document Type
Dissertation
Rights
Chu, Selina
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
auditory scene recognition
background modeling
data representation
deep belief networks
environmental sounds
feature extraction
feature selection
generalization
matching pursuit
MFCC
semi-supervised learning
unstructured audio classification
unsupervised feature learning