Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Data-driven methods in description-based approaches to audio information processing
(USC Thesis Other)
Data-driven methods in description-based approaches to audio information processing
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
DATA-DRIVEN METHODSIN DESCRIPTION-BASEDAPPROACHES TO AUDIO INFORMATION PROCESSING by Shiva Sundaram A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) December 2008 Copyright 2008 Shiva Sundaram Dedication This work is dedicated to my parents. ii Acknowledgements The work presented in this dissertation is a cumulative result of my effort over the last fifteen years. These years signify my transition from a curious student, to an engineer, to a researcher and a scientist. I would like to take up this opportunity to express my deepest thanks to those who have supported me to this point today. I have learnt a lot from them and they deserve more than just a few lines of gratitude. It is important to emphasize the fact that the order in which they appear is approximately chronological and not in the order of importance in any manner unless explicitly mentioned. For all future reference and readers, this aknowledgment was written on September 6 th 2008, on a sunny Saturday in Los Angeles. First and foremost, this dissertation is dedicated to my parents and my brother: my mother Rama, my father Pranatharthihran and my older brother Mukund. My mother is a school teacher with great patience and infinite love. My father is an exceptional engineer himself with immeasurable experience. He is extremely particular and orga- nized about his responsibilities that includedmy upbringing. My brotheris asuccessful manager with unparalleled acuity who has no reservations about sharing his different yet revered point of view. It is difficult for me to see myself standing at this point without their continuous support, encouragement and direction. All throughout my iii life, I cannot think of a single example of discouragement or doubt about my abilities from them. Although my mother knew little about engineering, she clearly identified my passion and always encouraged me in all the silly things I dabbled at in the name of curiosity. She has also bared with all the things I have broken around our home. I am sure she only remembers the small number of things I have actually fixed. In addition to this, she has taken great effort in building my creative thinking, which is a large part of my research contributions. Everything today started in my childhood with the doodles and sketches that my father made on random pieces of paper. These sketches weremadetoanswermyquestions abouthowthingsaroundusworked. For no apparent reason, I clearly remember the diagrams of a transformer and a cathode-ray tube that he made more than fifteen years ago. Even today he talks about the world and the technologies men have created with child-like excitement. With this thesis in audio information processing, I hope I am completing some aspects of his dream. In my family, my brother is a pioneer, experiencing and succeeding in life beforeI did, and when asked, has always been ready to share it with me. Right from my childhood, I can remember the long discussions or arguments that I have had with my brother on various aspects of life. Although we did not agree with each other in many things he has always been and still is someone whom I turn to in times of grave doubt. To be fearless, is something I am still learning from him. Looking back I can clearly see how my family laid the foundation on which I stand today. I would like to acknowledge Prof. Ramakrishna Ramaswamy (Ram) who has been my mentor since my teenage years. He was the first one to expose me to the exciting world of scientific research work. He showed me science and what people do to realize it. I iv believe realization of science is the only way to improve human life, which is a funda- mental objective of any technology. At that time, electronic circuit building was my hobby, and I was building circuits while working for him. I owe it to him as he made sure that I understand what I do rather than just doing it. As compared to his own students, I have actually spent little time with him, but I certainly treasure all the things I have implicitly learnt from him. I would particularly like to thank Kapil and Manu (or Manu and Kapil, depending on the reader’s point of view) for inducing me into all that lead to being mentored by Prof. Ramaswamy. Manu and Kapil (or Kapil and Manu) are top scientists in their own fields and have always been my role models. They are also close friends of mine, who are always around in some way or the other. My PhD adviser, Prof. Shrikanth S. Narayanan (Shri) is one of the most important people who have shaped the final stages of my education. He is an exceptional teacher, scientist, and an engineer and he is fundamental to all the research work and ideas presented in this dissertation. This dissertation has materialized as a result of his avid mentoring during seven years of my graduate school. While the words and ideas pre- sented are my own, he has guided me throughout every page and every aspect of it. I was introduced to him first in the speech processing course that he offered in the Fall of 2001. Although I was new to the field and went to him with a blank slate, he was quick to identify my potential and took me under his aegis. Any idea starts off as being rough and unpolished. After a lot of effort and with proper guidance, the idea becomes a contribution to the field. Shri helped me to convert my ideas in to the thesis presented here. His office door is always open (literally!) and he always welcomed a discussion. As his student, he let me nurture my own ideas, and knew exactly when to v step-in and offers the most guidance. It lead to a perfect balance between independent thought and the right amount of guidance. I shall always admire his immense energy and motivation to work and I am most grateful to him for completing this dissertation work. I would also like to acknowledge Prof. Chris Kyriakakis, who was my adviser for the early years of my graduate school. He taught me how to take an idea, develop it andrunwithit until its realization. Hedefinedforme, thetruemeaningof anengineer: one who can solve any problem through objective analysis and experimentation. Prof. Kyriakakis laid the final brick that built my confidence to pursue, without doubt, my way of creative thinking. At this point, I would like to thank a very special personin my life: Ms. Ozlem Kalinli. She is a friend, and a colleague and has unconditionally supported me right from the start. This dissertation work is the end result of the successes achieved only after wad- ing through all the failures. Ozlem has been with me at every step. I am grateful to have been able to share all my successes with her, and gain direction and support at moments of failure. I have always enjoyed long technical discussions with her and getting her point of view. It has always been wonderful to share ideas with her as it helped me put things in perspective. Her meticulous nature has helped me to sharpen the final presentations of the research work. In short, I would like to emphasize that she has made the journey a lot easier. Iwouldalsoliketoexpressmythankstoothersignificant peoplewhohave helpedmein many ways: Murtaza Bulut, Demetrios Cantzos, Abhinav Sethy, Dhananjay Raghavan, Maurice Boueri andPankaj Mishra. Finally, Iwould like tosay that thislist will always remain incomplete and I am limited to the words when expressing gratitude. vi Table of Contents Dedication ii Acknowledgements iii List of Tables x List of Figures xii Abstract xv Chapter 1: Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Extracting Information from Audio Signals . . . . . . . . . . . . 4 1.1.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.2.1 Three Fundamental Procedures in Audio Processing . . . . . . . 11 1.2.2 Representation. Choice of Features . . . . . . . . . . . . . . . . . 12 1.2.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.2.4 Performance Assessment . . . . . . . . . . . . . . . . . . . . . . . 17 1.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.4 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . . 20 Chapter 2: Background 21 2.1 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2 The First Step: Short-time Analysis of Audio Signal . . . . . . . . . . . 22 2.2.1 Temporal Analysis Features . . . . . . . . . . . . . . . . . . . . . 23 2.2.2 Spectral Analysis Features: . . . . . . . . . . . . . . . . . . . . . 24 2.2.3 Perceptual Features: . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.2.4 Bio-Inspired Cortical Representation (CR). . . . . . . . . . . . . 30 2.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 vii Chapter 3: Literature Review 34 3.1 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2 Speech versus Non-Speech Discrimination . . . . . . . . . . . . . . . . . 35 3.3 Music Content Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.4 Classification of general sound sources . . . . . . . . . . . . . . . . . . . 39 3.4.1 Audio-Based Context Recognition: . . . . . . . . . . . . . . . . . 40 3.4.2 Audio based methods in Video Segmentation . . . . . . . . . . . 42 3.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Chapter 4: Audio representation using words 48 4.1 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.3 Motivation: Describing Sounds with Words . . . . . . . . . . . . . . . . 53 4.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.4.1 Distance Metric in Lexical Meaning Space . . . . . . . . . . . . . 55 4.4.2 Tagging the Audio Clips with onomatopoeia words . . . . . . . . 57 4.5 Vector Representation of Audio Clips in Meaning Space . . . . . . . . . 61 4.5.1 Unsupervised Clustering of Audio clips using Vector Representa- tion in Meaning Space . . . . . . . . . . . . . . . . . . . . . . . . 63 4.5.2 Clustering Results . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.6 Clustering Onomatopoeia Words in Meaning Space . . . . . . . . . . . . 66 4.6.1 Clustering the Feature Vectors . . . . . . . . . . . . . . . . . . . 69 4.6.2 Classification Experiments . . . . . . . . . . . . . . . . . . . . . . 70 4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Chapter 5: Attribute Based approach to processing 75 5.1 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.3 System Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.5.1 Segmenting Vocal Sections: . . . . . . . . . . . . . . . . . . . . . 89 5.6 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.6.1 Genre Classification Experiments . . . . . . . . . . . . . . . . . . 91 5.6.2 Classification Results and Discussion . . . . . . . . . . . . . . . . 94 5.7 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . 97 Chapter 6: Noise Classification for Attribute based Approach to Audio Processing100 6.1 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.3 Auditory Cortical Representation (CR) . . . . . . . . . . . . . . . . . . 103 6.4 Dimension reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 viii 6.4.1 DATER Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 107 6.4.2 Partitioning data . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.5 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 6.6 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Chapter 7: Latent Perceptual Indexing 114 7.1 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 7.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 7.2.1 Problem definition and motivation . . . . . . . . . . . . . . . . . 115 7.2.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 7.2.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 7.2.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 7.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 7.3.1 Singular Value Decomposition (SVD) and Representation . . . . 125 7.3.2 Similarity Measure . . . . . . . . . . . . . . . . . . . . . . . . . . 127 7.3.3 Text-based indexing using Latent Semantic Indexing . . . . . . . 127 7.3.4 Text Query Representation . . . . . . . . . . . . . . . . . . . . . 129 7.3.5 Audio representation in Latent Perceptual Space . . . . . . . . . 131 7.3.6 Query Representation . . . . . . . . . . . . . . . . . . . . . . . . 134 7.3.7 Relationship with the LSI framework . . . . . . . . . . . . . . . . 135 7.4 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 7.4.1 Assigning onomatopoeia labels to audio clips . . . . . . . . . . . 137 7.4.2 Consolidation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 7.4.3 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 7.5 Experiments and Results. . . . . . . . . . . . . . . . . . . . . . . . . . . 146 7.5.1 Text-based query and retrieval using Latent Semantic Indexing (LSI). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 7.5.2 Example-based query and retrieval using Latent Perceptual In- dexing (LPI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 7.6 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 158 Chapter 8: Conclusion 163 Bibliography 172 ix List of Tables 4.1 Complete list of Onomatopoeia Words used in this work. . . . . . . . . . 54 4.2 Resultsofunsupervisedclusteringofaudioclipsusingtheproposedvector representation method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.3 Examples of automatically derived word clusters in lexical onomatopoeic ‘meaning space’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.4 Examplesofautomaticallyderivedclustersofaudioclipsbasedonfeature vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.5 4 best and worst precision (%P) and recall (%R) rates for classification. Clusters are formed by using word-level grouping information . . . . . . 71 4.6 The two most confusing clusters for each of the 4 clusters that have the best precision and recall rates. . . . . . . . . . . . . . . . . . . . . . . . 72 5.1 k-NN class. accuracy (%) speech-like(S-l), harmonic(H), noise-like(N-l) . 88 5.2 false alarm & error rates of popular music segmentation . . . . . . . . . 91 5.3 Distribution of number of songs in each genre . . . . . . . . . . . . . . . 91 5.4 Expt. 1 Confusion Matrix: Using activity rate signals and DTW for similarity measure(using 9-Nearest Neighbour Rule). . . . . . . . . . . . 96 5.5 Expt. 2 Confusion Matrix: Using activity rate signals in an HMM-based classifier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.6 Expt. 3 Confusion Matrix: Using MFCCs as features in an HMM-based classifier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 7.1 A list of nine audio clips and its high-level description from the database. 128 x 7.2 Distribution of clips under each semantic category. . . . . . . . . . . . . 137 7.3 The starting list of Onomatopoeia labels used in this work. . . . . . . . 138 7.4 Distribution of clips in each onomatopoeia category. . . . . . . . . . . . 140 7.5 Examples of automatically derived word clusters. . . . . . . . . . . . . . 144 7.6 List of combined onomatopoeia categories. . . . . . . . . . . . . . . . . . 144 7.7 Query examples and corresponding 5 best matching retrieved clips. . . . 148 xi List of Figures 1.1 Varying levels of possible descriptions of audio used in this work. Using this description scheme a full-duplex audio retrieval system can be im- plemented. Such a retrieval system will be able to retrieve instances of audio based on both signal level measures (scalable) and language-level descriptions (scalable). . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2 Overviewofvariouselementsinthedescriptionbasedapproachpresented in this thesis work. A collection of general audio clips is categorized by their perceived acoustic properties. The system is trained to recognize this in a test audio clip, creating a meta-level representation scheme, which can be further processed to create higher-level ontologies. . . . . 9 1.3 The basic blocks to build an audio processing system. . . . . . . . . . . 14 2.1 The Spectrogram: Conversion of time-indexed frames to a time-indexed time-frequency representation . . . . . . . . . . . . . . . . . . . . . . . . 33 2.2 TheprocessingblockstoobtainMel-frequencyCepstralCoefficients(MFCCs) for an acoustic signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.1 Typical hierarchical methods used in content-based methods for audio classification of general sources. Top from [23] and bottom from [73] . . 41 4.1 Arrangement of some onomatopoeia words in 2 dimensional ‘meaning space’. Note that words such as clang and clank are closer to each other, but they are farther away from words such as fizz and sizzle . . . . . . . 58 4.2 Tagging and clustering the audio clips with onomatopoeia words. . . . . 60 4.3 VectorrepresentationoftheaudioclipHORSE VARIOUS SURFACES BB with tags{clatter, pitpat} . . . . . . . . . . . . . . . . . . . . . . . . . . 62 xii 4.4 BIC as a function of number of clusters k in model M k . The maximum value is obtained for k =112. . . . . . . . . . . . . . . . . . . . . . . . . 64 4.5 Variation of d BIC(M k ) as a function of k for Q = 500 for clustering onomatopoeia words. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.1 Illustration of the proposed system for audio segmentation. . . . . . . . 83 5.2 Thetime-varyingr m vectorforaT =45sec. songclip. Forthisparticular illustration, T s =20 msec and M r =100×T s . . . . . . . . . . . . . . . 85 6.1 Summaryofprocessingfrominputsoundtothemulti-dimensional tensor representation of cortical processing [45, 66, 72]. . . . . . . . . . . . . . 104 6.2 Illustration of data partitioning. Data points belong to 2 classes (hollow and solid). The whole data is partitioned into 3 subsets (circles, squares and triangles). The whole data mapped onto each projection (resulting from discriminant analysis of each subset)are augmented column-wiseto form the resulting data matrix Y . . . . . . . . . . . . . . . . . . . . . 108 6.3 5 nearest neighbour (left), AdaBoost (right) classifier results (average accuracy and true positives rate) of the cortical representation (CR) ver- sus MFCCs as a function of number of projections included. (90/10% train/test split,1 -machine generated, -1 -other noise) . . . . . . . . . . . 113 7.1 Overview of a description-based audio retrieval system. . . . . . . . . . 119 7.2 An example representations of seven clips in the Latent Semantic Space. 130 7.3 Representation in Latent Perceptual Space. . . . . . . . . . . . . . . . . 132 7.4 Variation of d BIC(M k ) as a function of k, the number of word clusters . 143 7.5 Subjective Evaluation: Probability of retrieving≥C relevant clips. Dot- ted line represents the probability of retrieving 0 relevant clips. . . . . . 149 7.6 Precision-Recall rates fortext-based retrieval. Performanceusingseman- tic and onomatopoeic labels. . . . . . . . . . . . . . . . . . . . . . . . . 152 7.7 Nearest Neighbor confusion matrix for semantic categories for retrieval by latent semantic indexing. . . . . . . . . . . . . . . . . . . . . . . . . . 153 7.8 Nearest Neighbor confusion matrix for onomatopoeia categories for re- trieval by latent semantic indexing. . . . . . . . . . . . . . . . . . . . . . 154 xiii 7.9 Averagerelative percentageimprovement asafunctionofvocabularysize N for example-based retrieval. . . . . . . . . . . . . . . . . . . . . . . . 155 7.10 Precision-Recall rates for example-based retrieval. Performance using semantic and onomatopoeic labels. . . . . . . . . . . . . . . . . . . . . . 156 7.11 NearestNeighborclassificationconfusionmatrixbyLPIandonomatopoeia categories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 7.12 Nearest Neighbor classification confusion matrix by LPI and semantic categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 xiv Abstract Hearing is a part of everyday human experience. Starting with the sound of our alarm clock in the morning there are innumerable sounds that are familiar to us and more! Thisfamiliarityandknowledgeaboutsoundsislearned overourlifetime. Itisourinnate ability to try to quantify (or consciously ignore) and interpret every sound we hear. In spite of the tremendous varieties of sounds, we can understand each and every one of them. Even if we hear a sound for the first time, we are able to come up with specific descriptions about it. The descriptions can be about the source or the properties of its sound. This is the listening process that is continuously taking place in our auditory mechanism. It is based on context, placement and timing of the source. The descrip- tions are not necessarily in terms of words in a language, it may be some meta residual understanding of the sound that immediately allows us to draw a mental picture and subsequently recognize it. All computer-based processing systems help human users to augment their audio-visual processing mechanism. The objective of this work is to try to capture a part of, or at least mimic aspects of this listening and interpretation process and implement it in a computing machine. This would help one to browse vast amounts of audio and locate parts of interest quickly and automatically. Other contemporary systems that attempt the same problem exist. Although these methods are highly accurate, primarily because they solve specific problems that are xv well constrained, they lack sophisticated information extraction and representation of audio beyond the realm of a simple labeling scheme and its classification. Additionally, theypresentthedrawbackthat theycannothandlelargenumberofclasses orcategories of audio, as they inherently rely on a naive implementation of pattern classification al- gorithms. To this end, flexibility and scalability are important traits of an automatic listening machine. One of the primary contribution of this work is developing a new, scalable frame- work for processing audio through its higher level descriptions (compared to signal level measures) of acoustic properties instead of just an object labeling and classifi- cation scheme. The research efforts are geared toward developing representations and analysis techniques that are scalable in terms of time and description level. It con- siders both perception-based description (using onomatopoeia) and high-level semantic descriptions. These methods can be universally applied to the domain of unstructured audiothatcovers allformsofcontent wherethetypeandthenumberofacousticsources andtheirdurationarehighlyvariable. Theultimategoaloftheworkpresentedhereisto develop a full-duplex audio information processing system where audio is categorized, segmented, and clustered using both signal-level measures and higher-level language- based descriptions. For this, new organization and inference techniques are developed that are implemented as learning machines using existing pattern classifiers. First, an approach of describing audio properties with words is introduced. The results indicate thatonomatopoeicdescriptionscanbeusedaspropermeta-level representationofaudio properties and this representation scheme is different from the properties of audio cap- turedbyexistingsignallevel measures. Anothertechniquetoprocessingbydescriptions is using audio attributes. Here, instead of the conventional approach of directly identi- fying segments of interest, a mid-level representation through generic audio attributes (such as noise-like, speech or harmonic) using an activity rate measure is first created. xvi Using this representation, a system that segments vocal sections and identifying the genre of a popular music piece is presented. While the performance is comparable to other contemporarymethods, theideaspresented herearealsoscalable andcan beused forprocessingmorecomplex audioscenes (with large numberof sources). Todoso, it is necessarytoincreasethenumberofattributesthatarebeingtracked. Anideaforexten- sionistofurtherresolvethenoise-likeattributeintomachine-generateandnaturalnoise is discussed. Using a bio-inspired cortical representation, performance of two pattern classification systems in discriminating between the two noise types is presented. To handle large amounts of large dimensional data, a new dimension reduction technique based on partitioning the data into smaller subsets is implemented. Along the same lines, another framework for description-based audio retrieval using unit-document co- occurrence measure is presented. In this case the retrieval is performed by explicitly discovering discrete units in audio clips and then formulating the unit-document co- occurrencemeasure. Thisis usedto index any audioclip in acontinuous representation space, and therefore, perform retrieval. The approach that is adopted in this dissertation work presents an alternative method for audio processing that moves away from direct identification of acoustic sources andits correspondinglabels. Instead, theframework presentsideas to represent and process general, unstructured audio without explicitly identifying distinct acoustic source-events. xvii Chapter 1: Introduction 1.1 Motivation Hearing is one of the five human senses. It is used to sense the immediate environment by capturing sound. Sound is fleeting in nature and it can only travel limited distances before it becomes inaudible. Therefore, in earlier times, it was never seen as a feasible medium to store information, or as a means for long-distance interaction. However, since the invention of the microphone, phonograph and the telephone, there has always been interest in audio as a vehicle to capture rich aural information for anywhere on-demand access and delivery over long-distances. It may be for entertainment, news, or even directions to find a restaurant in an unfamiliar part of town. Today, hundreds of resources in the Internet offer this in varieties such as streaming radio/video, pod- casting, multimedia blogs, and download services for music and even full-length movies at the user’s finger tips. Cheaper media storage and hardware has led to penetration of this multimedia content delivery in every facet of everyday life. Affordable devices 1 such as the TiVo, or the more recent SlingBox, that can store, organize and re-transmit television programming (multimedia) at the user’s convenience are available. Storing, streaming and capturing multimedia content in handheld devices such as cell phones and digital cameras is commonplace. New compression algorithms (for both audio and video) along with cheaper storage devices have made it possible to capture, store and share any audio or visual aspect of life. Documenting news, events or observations is not just limited to language and text. It is rich with pictures of people, their voices, colors, expressions and sounds. Present technology allows us to capture all of these aspects in the digital domain instantaneously. Necessity is the mother of invention: This continuously increasing growth of multimedia data at an explosive rate has led to an increased need for automatic media content processing techniques for storage-retrieval, indexing, and classification. Multimedia content includes a variety of modalities in which audio is a key ingredient. The focus of the present work will be audio content processing: grouping, identifying and appropriately marking audio segments. There is no singular, ideal approach to solve the problems entailed in the engineering aspects of audio processing. Different applications require different ideas and methodology. In this work, any device that is capable of generating sound is referred to as an acoustic source. An audio clip is a recording of one or more acoustic sources and an audio/acoustic event is the segment of the clip when sources are generating sound. An audio scene is an environment where multiple acoustic sources are interacting. Finally, the term audio class is the label that an acoustic source is given to uniquely identify it and other similar audio sources in the same class. 2 Audio is a sequential sound stream that can be used to transmit precise aural in- formation to the listener through the auditory senses. Conversely, The auditory senses capture precise aural information from the immediate surrounding by measuring a sequential sound stream known as audio. The main problem that has been tackled here is to establish a meta under- standing of the audio signal that has been captured (heard). This is useful to develop a more complex relationship of comparisons such as similarity (or lack of), structure and content. This understanding is completely implemented in a computing machine using signal processing techniques for analysis, representation and classification. Examples of these implementations are as follows. In chapter 4 representation of audio using onomatopoeic descriptions is presented. Subsequent experiments indicate that this scheme differs from the properties captured using common acoustical features. In chapter 5, an audio processing method using a mid-level representation is presented. This mid-level scheme is created by classifying short-frames of the acoustic signal into categories such as speech-like, harmonic or noise-like. As an example, it is used to segment vocal sections in popular music. Finally, in chapter 6 to take this idea further splitting the noise-like attribute into machine-generated and natural noise categories is explored. Using more attributes can essentially be useful in extending this approach to process more complex auditory scenes. Next a note on extracting information from audio signals is presented, followed by the main contributions in this thesis. 3 1.1.1 Extracting Information from Audio Signals Information extraction in audio involves identifying distinct, familiar entities from the captured sound signal. The entities/objects can be any information about the source that is generating the sound.The audio medium is inherently rich and it conveys vary- ing levels of information depending on the time-scale and the description, developing an indexing framework using language is challenging. In general, it is not possible to identify these individual objects explicitly in an audio signal because sound generated by the sources is embedded as continuous signal variations in time. Defining individual “units”isdifficult. Even ifaworkingdefinitionofaunitiscreated (say anaudioevent), they do not have any distinct boundaries (overlapping in time), the duration of each event is variable, and multiple events can occur in the same time segment. Although audio processing systems analyzes an acoustic signal as short-time frames, the signal itself is not a series of discrete elements that are perceived to be continuous, it is al- ready continuous, overlapping and all the information about all the sources is encoded as time variations that cause discernible spectral variations in a one dimensional signal. Usually, these time and spectral variations caused by the sources allows us to extract enough information to discern individual elements. Processing audio involves trans- forming this time continuous signal into discrete elements or entities according to some familiar knowledge. The familiar knowledge is about a source and its sound generating mechanism, which allows us to reliably distinguish it amongst all the other details cap- tured in the sound signal. The work presented here deals with extracting information based on the knowledge of sound sources in terms of quantitative descriptions of their 4 perceived properties, and the relationship between these perceived properties and the acoustic signal measures. 1.1.2 Contribution Conventional content-based audio processing methods try to identify individual objects inaclip. Whilethisworksforalimitedsetofsources,itignorestheinherentrelationship between acoustic sources that arises due to the acoustic properties they share. The relationship is in form of many-to-one mapping between the assigned class labels (the directnameofsources)anditsacousticproperties. Agivensource,say,“Nail Hammered on Bench”, (which is the given direct name) shares similar acoustic properties with “Knocking on Door” which also shares its properties with “Footsteps on Tarmac”. In such a scenario, an audio retrieval system will be required to retrieve the other two clips when queried with one of them. An intuitive scheme like this is going to be difficult with the content-based methods since it creates individual class labels for each of these sources. This approach is also not easily scalable because as more number of identifiable sources are involved, the complexity of the system would also increase by many folds. It will also require reorganization of the data and retraining of the audio processing system for every new application domain. The research work discussed in thisdissertation providesaframeworktomoveaway fromthecontent-based methodsto description based methods to utilize the relationship that arises from the many-to-one mapping between the acoustic sources and the acoustic properties they share. The methods discussed here presents an alternative method to the conventional techniquethat clearly definestheaudioclasses anddevicemethodsto distinguishthem. 5 As mentioned above, while content-based methods are a good enough start for limited applications, it lacks provisions to include mid-level information such as descriptions of the audio properties, qualification (description of source type) and quantification which can better utilize the shared acoustic properties amongst sources. A source “Nail Hammered on Bench” can have mid-level information that it is impulsive, repetitive (succinctlydescribedbythewordstap-tap). Giventhatitispossibletodecryptthiskind of information from the audio signal, it will be possible to describe an unknown audio clip (say “Knocking on Door”) with this intermediate information. For a whole gamut of such descriptions it will also be possible to accommodate recognition of new audio sourceswithoutreorganizing andre-trainingthecompleteaudioprocessingsystem fora new class. The final decision about the source can be drawn based on this intermediate information. Thecoredifferencebetweenthisapproachandthecontent-basedtechnique is that here, the audio processing system is trained to identify descriptive acoustic properties of sources irrespective of the direct class names. The hypothesis is that any audio scene with multiple acoustic sources can be familiarized through one or more of these distinguishable universal acoustic properties. Once they are identified in any arbitrary unknown audio clip, a meta-representation of the given signal can be formed. Then, a second level of flexible post processing can be used to either actually identify the direct class names or create other ontological representations. The basic blocks of a description based approach is illustrated in figure 1.2 . Insomeways, thisapproachissimilartotheways humansprocessauralinformation and recognize sounds. This is substantiated by the work presented in [34], where it is suggested that the time-frequency responses of the auditory filters depends on the 6 statistical nature of the acoustic source signal. It is shown that the bandwidth-centre frequencybehaviouroftheauditoryfiltersvaryaccordingtosourcetypes. Althoughthe illustration is for categories such as speech, animal sounds and environmental sounds, it alludes to the fact that the human auditory system responds to a particular source type irrespective the actual direct name. Thus, each sound we hear immediately gets qualified and a higher level of inference eventually identifies the sound source. The responseto a particular source type is a form of meta-level understandingthat is aimed for here. These aspects can be mimicked as pattern classifiers that discern various properties of sounds. Themain challenge lies in discovering these descriptive properties that one needs to look for in audio. In this work, two approaches are presented: (1) by an ontological understanding of how language level word descriptions relate to acoustic signal properties and (2) by building a mid-level,coarser representation scheme that quantitatively measures a given audio clip in terms of three primitive attributes and finally relating these levels of description to the overall semantic description of audio clips. In each case, the automatic processing is performed without explicitly classifying or identifying individual acoustic sources. These varying levels of descriptions and representations are illustrated in figure 1.1. The signal processing aspects in this work involves implementing systems that can decode descriptive information from the acoustic signal. Since common signal features thatareusedinaudioclassificationtaskscanalreadydiscernawidevarietyofsources,it isassumedthatthesefeaturescanalsomeasuretheintermediatedescriptiveinformation that is relevant here. Using pre-labelled data, pattern classifiers are used to discern and identify the acoustic properties. The data-driven learning, however, shall be the 7 Figure 1.1: Varying levels of possible descriptions of audio used in this work. Using this description scheme a full-duplex audio retrieval system can be implemented. Such a retrieval system will be able to retrieve instances of audio based on both signal level measures (scalable) and language-level descriptions (scalable). 8 Figure 1.2: Overview of various elements in the description based approach presented in this thesis work. A collection of general audio clips is categorized by their perceived acoustic properties. Thesystem is trained to recognize this ina test audioclip, creating ameta-levelrepresentationscheme,whichcanbefurtherprocessedtocreatehigher-level ontologies. 9 only common aspect with the other contemporary methods. Experiments in chapter 5 show that it is indeed possible to implement a description based system using existing signal features and pattern classifiers. Of course, modeling signal measures, choice of pattern classifiers etc. are application dependent. In all the cases discussed, existing signal processing algorithms for representation, discrimination and classification are used on new application either directly or by modifying them. For instance, in chapter 6 discriminant analysis by partitioning the data is implemented for large dimensional data to make it tractable. Since the methods are data-driven, organizing audio data into categories according to their perceived properties is a major challenge. Almost all the work presented here uses a database of general audio sources (satisfying both variety of content and amount). Also, in certain cases, new signal representations are investigated for better performance of the implemented systems. In summary, the main contribution of the research presented in this thesis, adopts a description based approach where sound sources are described by their properties and they are grouped, identified and segmented accordingly. This scheme is irrespective of their direct audio class names. It also presents an approach to tie language level descriptions to audio classification problems using available signal processing methods. Forinstance, soundgeneratedbyavacuumcleanerandthesoundofwavesonaseashore aredescribedtobenoisy. Intheconventionalapproachboththesesoundswouldbegiven different class labels. Such a framework is useful in creating a mid-level representation scheme that can be easily generalized to different domains of interest. As a part of the introduction, a general framework to audio processing is presented next. 10 1.2 Approach 1.2.1 Three Fundamental Procedures in Audio Processing Although an acoustic event, such as a clap or a knock can be thought as an audio unit, no explicit definition of acoustic events exists. This is because these events in audio can take place at any time-scale depending on the acoustic sources: from a few milliseconds to tens of seconds. The term audio is very generic and covers a wide vari- ety of sound sources. Processing audio, essentially involves detecting the time instant and the type of change in audio. The change type can be according to any a priori knowledge framework. Audio processing always involves one or more of the following related procedures: segmentation, clustering, and classification. Segmentation involves marking homogeneous segments of audio. Each segment usually belongs to only one of many known type of audio. For example, marking vocal sections in popular music songs[61]. Clusteringinvolves groupinglargenumberofsamplesofaudiointoahandful of classes. The clips within a group share some similarity. Classification is to identify an unknown source or event in an audio signal. Each has its own problem definitions dependingon how these sources are interpreted for a particular application. Sometimes one procedure can be used to implement a different one. For example, in a two class problem (such as speech/non-speech classification) segmentation can beused to classify individual sections of a clip into speech or non-speech classes. Also, these procedures can be employed in a wide spectrum of application such as, for example, in a front end of a low bit rate speech codec for voice activity detection, to a back-end of a large database system: searching through terabytes of audio to retrieve relevant “thumbnail” 11 clips for a user query. Any audio processing system that uses one of the three proce- dures requires a working definition of an audio event and change in the signal. The focus here is to develop techniques that can automatically identify, understand, detect, and interpret events or sound sources for audio indexing applications using these three techniques. It is easy to see that these three fundamental techniques can be combined in any order to develop a usable audio information processing systems. For example, in chapter 5 a segmentation scheme is implemented by classifying individual frames of audio into previously determined attributes. These techniques are usually implemented as a learning machine that can discriminate or find similarity amongst the patterns in the representation scheme of various sounds. 1.2.2 Representation. Choice of Features As one attempts to design and/or create such a machine, it is informative to also understandthepropertiesofvarioussoundsources. Thishelpstocreatearepresentation thatwouldeffectively identifyordiscernthesources. Asananalogousexampleinvision, colour is an effective feature to distinguish apples from oranges. However, brightness will not be as effective as colour. An apple and an orange can be equally bright. This also applies to audio. Appropriate choice of features is important for good system performance. For audio processing applications a large set of features are available. Most features are either based on temporal analysis or spectral analysis. Temporal featuressuchasshorttermenergy(STE)areusedinstatisticalmodellingtechniquesand they have limited scope. On the other hand, spectral features can be either statistical measures or perceptual. Some examples of statistical spectral measures are the spectral 12 centroid, spectral flux, linear prediction cepstral coefficients etc. Perceptual features are those that model the time-frequency processing in the human auditory system. Thesefeatures are based on power or energy measures in frequency bins that are evenly placed in a logarithmic frequency scale. Although extensive research on the auditory processing of the human ear has been conducted, scientists and engineers have not yet established an exact mathematical model. The main reason behind this is the fact that the transformations in the auditory processing, like all things in nature, is non-linear. Thismakesittodifficulttoreverseengineertheprocessinginthehumanear(the“black box” approach). Nevertheless, for engineering, it is not completely in vain to try and develop such representations. Mel-frequency cepstral coefficients (MFCCs) is one such example[18]. Ithasbeenusedinmanyaudioclassificationsystemswithexcellentresults and it is the feature set of choice for state-of-the-art speech recognition systems. In most audio processing systems, both statistical and perceptual feature sets are usedtogether. Theresultingrepresentation is in alarge dimensional feature space. The important issuethat needstobeaddressedinchoosingalargedimensionrepresentation is whether the features of choice measure the audio properties of interest. For instance, pitch is important to measure melody in music. However, additionally measuring the properties such as STE or other spectral information using a large dimension feature set can be an unnecessary overhead, ineffective or worse, affect system performance. A feature can be relevant and effective in a system only if it can appropriately measure the audio properties of interest. 13 Figure 1.3: The basic blocks to build an audio processing system. 1.2.3 Implementation The methods discussed here follow a data-driven approach where a pattern classifica- tionalgorithmistrainedtodiscernpatternsbasedonlargeamountsoflabeleddatabase. The samples of audio (in its feature space representation) are labeled with its member- ship class. This falls under the class of supervised techniques because the labeling of this data is done under human supervision. The steps to building an audio processing system is roughly illustrated in figure 1.3. Before implementing a data-driven system, the initial legwork involves acquiring digitized waveforms of audio signals from real sources. In most cases, this has already been done professionally (such as a part of a sound effects library used for movie pro- duction). The audio clips can be collected from a variety of resources depending on the application. For example, for creating music information retrieval systems (MIR), whereidentification of thegenre is theobjective [36], audioclips of variety of genres are collected. Asanotherexample, formoregeneralizedaudioclassificationtasks, wherethe sound source of interest could be anything from door knocking to engine noise, audio clips of general everyday sources is collected. Relevance to the application is important 14 because the machine that is being trained on this data is usually optimized to discern between the various groups or classes within the domain of interest. In general, the more generalized the classes are, the harder it is to collect audio clips for the data. The next steps relate to building a classification system. The first step is the pre-processing stage. This can vary from simple down sampling and normalization to elaborate realizations that exclude segments of audio by classifi- cation. In this stage, a continuous audio stream is fragmented into smaller, overlapping frames of twenty to few hundred milliseconds. This is known as frame based analysis (details in chapter 2) and it is borrowed from speech processing. The short frames are usually multiplied by a window of the same length which is tapering at the edges and has a maximum value at the middle. Multiplying it with a window minimizes the effects of abrupt sequence truncation in the frequency domain. The idea behind frame basedanalysis is that thenon-stationary audiosignal (processes whosestatistics change in time) is assumed to be stationary for that period of analysis. Following this is the feature extraction stage that essentially extracts information and represents the input streaminacondensedformaspointsinafeaturespace. Again,dependingontheoverall system, it can be a simple STE measure to large dimension perceptual measures such as MFCCs. The feature extraction block is based on the idea of observing a random process. It is implicitly assumed that an underlying random process is generating an output and the extracted features are the observations. Since each acoustic source is assumed to be a process, the observations are assumed to have a source-dependent sta- tistical structure to it. The pattern learning/classification stage is the core of the audio processing system which essentially learns the structure of these observations. Here, a 15 pattern classification algorithm is trained to discriminate one class of audio from an- other in the feature space. The underlying assumption being points mapped from the same class of audio will be in close proximity (forming a pattern). The classification al- gorithm initially undergoes an offlineprocedurewhereit learns to discriminatebetween audioclasses ofinterest. Thisisdoneby initially providingtheoverall systemwithdata similar to the test case. The classifier stage can be decisions based on statistical model or algorithms for discriminant analysis. The final stage is the post processing stage. It combinestherawoutputsfromtheclassifiertodrawinferenceaboutasegmentofaudio. In most cases it is a set of hardwired heuristic rules or based on empirical observations of data (a form of training). In other cases it can be more sophisticated such as using Neural Networks. On close examination of the blocks, it can be seen that aspects such as extracting perceptual features, building a classifier, training it, and combining the raw outputs of the classifier for a final decision, roughly model the way humans process audio. In fact, thisapproachalsomodelsthehumanauditoryprocessingatdifferedlevels. Forinstance, the feature vectors can be designed to respond to one type of acoustic source. This by itself can be seen as a higher-level process of detection. As another example, post pro- cessing the classifier outputs can by itself beanother meta level of representation which can be further processed to a final decision. In [61] the author combines individual classifier outputs to create an activity rate measure of a given audio and subsequently locates vocal sections in popular music songs. The auditory processing as a whole also does similar things. An audio track recorded in a street corner, would contain many vehicles passing by, people talking, footsteps dogs barking etc. Each individual sound 16 sourcerecognized can beseenasanoutputof aclassifier, andthesubsequentconclusion that it is a street scene, is a higher level inference based on the individual classifier. It is generally assumed that due to the choice of classes, the results of the classifiers can be interpreted as having semantic information. For example, key audio effects in [52] are used to provide cues to higher level understandingof the underlyingauditory scene. The main lesson to be learned from trying to model the human auditory processing is that it is an accepted Utopian system. Modeling aspects of human auditory processing is a well informed way to develop a framework for automatic processing technique. 1.2.4 Performance Assessment Measuring performance depends on the critical nature of an application. There are two main performance issues: the actual identification accuracy and temporal accuracy. In most cases, both aspects of performance can be directly measured, however in certain cases, it is better measured with the performance of the complementary system. For instance, for a front end of an automatic speech recognizer (ASR) it is very important that non-speech sounds are correctly identified and accurately segmented. In such a case, the performance of a speech/non-speech discriminator can be indirectly measured by measuring the improvement in the word error rate. Note that due to the nature of application, accuracy at the frame level of few tens of milliseconds may be critical to the performance. However for other applications such as audio thumb-nailing or highlightgenerationperformancecanbedirectlymeasured. Hereidentificationaccuracy is morecritical than temporal accuracy, butit is difficult toascertain good performance based on only one aspect. If an audio thumb-nailing system result is off by a few 17 milliseconds, it is not necessarily bad performance. What if the same result is off by a few seconds? How many hundreds of milliseconds constitute bad performance? It is difficulttoestablish suchgrounds. Thisisbecause, themeasurementof performancefor audio processing system also relates to the definition of an event and change. Systems such as thumbnail generation are looking for a segment of speech event, melody, chorus event. They are also looking for whole scenes such as war or dialogue scenes in a movie. Usually these definitions also match with the human definitions of event and change. This can also contain smaller audio events and change points. Such accepted humandefinitionsofeventandchangeare,however, inherentlyfraughtwithincomplete, nonspecific, ephemeral formulations. It is highly dependent on context that is not considered in the performance assessment of such system. As an example, an audio based video segmentation scheme may classify a scene with dialogues as a dialogue scene, but the dialogues may be in the context of a war scene where the characters are speaking to each other. Examples like these may result in degradation in measured system performance. Therefore, it is often important to review the exact variables involved. Inshort,performancemeasuresarelimitedtoapplications, anditvariesfromsystem to system. The research ideas presented in this dissertation are sometimes compared with other contemporary systems as they sometimes lie within the same application domain. In other cases, new metrics to appropriately assess system performance are suggested. In general, new metrics can be developed by a formal understanding of the assumptions and the purpose served by the audio processing system. 18 1.3 Conclusion As it will be clearer in the next chapters, each and every system discussed has an implicit definition of an event and change. Theproblem of audio processingis to detect changesatdifferenttime-scales anddifferentdomains: fromafewmillisecondstotensof seconds, from signal level events such as silence or otherwise, to semantic events such as sound source recognition, tracking and placement. All the systems fundamentally differ from each other in this aspect. The implementation of an individual system is highly dependent on the definition of the above entities and how efficiently they are detected. The posed problem lies in their definitions too. The devised schemes to handle these problems at multiple scales are based on common observations, content and statistical measures. In this thesis work, the audio processing problem is tackled based on conjectures of the human auditory processing. As mentioned earlier, this would entail building a system that would model the auditory processing at different levels, depending on the application. The experiments for proposed systems presented in this research not only try to model the signal processing aspects of the auditory system but also the plausible meta representation that can be used to draw further inference about the captured audio. This will help in creating more intuitive systems that are along the lines ofhumandefinitionsandperceptionofaudio. Someaspects of theexperiments are based on approximate definitions of acoustic properties. This meta understanding are in terms of properties of audio types, and language level descriptions. As such systems are implemented, it will be seen that the same three fundamental audio processing procedures are revisited using classical signal processing techniques. 19 1.4 Organization of the Dissertation The next chapter begins by laying the groundwork. It describes the various features commonlyusedinaudioprocessingindetail. Thenadetailedliteraturereviewdiscusses the state-of-the-art in audio processing techniques. The review is organized in terms of audio content. Later, they are also discussed in terms of implementation schemes. In many cases, a given technique is compared with others in the given domain of ap- plication. It will be apparent that in almost all the cases the basic implementation discussedin section 1.2 is followed. In thechapters following the review, thedescription based methods proposed here are discussed. As mentioned previously, all the methods are implemented by a data-driven method using conventional pattern classification al- gorithms. For each application appropriate performance assessment is also presented. Each chapter beginswithabriefsummaryandat theendit alsohasrelevant conclusion and proposed future work. This dissertation concludes with a brief reiteration of the work presented and possi- ble future directions of research. Finally an approximate time-line is presented for the completion of the thesis work. 20 Chapter 2: Background What we observe is not nature itself, but nature exposed to our method of questioning.- Werner Heisenberg 2.1 Chapter Summary In this chapter, various feature sets that are used for analysis of audio signals is pre- sented. Both statistical and perceptual features are covered and they are grouped by the domain of the measure: temporal, spectral and perceptual. It covers the sets that are widely and successfully used in the research community. These features are popular in the techniques presented in chapter 3. First, the concept of short-time analysis of audio signals is discussed. Thisstep is incorporated in thepreprocessingand/or feature extraction stage of the audio processing system. 21 2.2 The First Step: Short-time Analysis of Audio Signal In subsection 1.2.3, it was mentioned that to extract features, a short-time sequence of an audio stream is windowed before the feature extraction stage. This helps to convert a process which is a function of time into a set indexed by time. This can be mathematically represented as: x(n)→x k ∀ k·L<n<(k+1)·L (2.1) A more practical view of this (that can be implemented) is to extract a time windowed segment of the audio stream. [50] x i = m=∞ X m=−∞ [x(m)·w(m−i)] where (2.2) w(m) > 0 0≤ m ≤ N−1 = 0 otherwise Any audio signal, continuously varies in amplitude, frequency content, phase and thesourcepropertiesgeneratingthesound. Atemporalrepresentationofasoundsource consists oftime-synchronized amplitudevariation. Thisx i is essentially asetof samples of the input audio stream where the time index of the samples are maintained. This is based on thepremisethat theaudiosignals change relatively slow. i.e., withinthis time window the audio signal is relatively constant in terms of the above mentioned factors. This is a valid premise if the time reference of analysis is relatively fast. Time window of 10 milliseconds to 1 second is typical. Analysis is performed by extracting successive 22 time windows, or even the windows overlap in time. The short-time analysis helps in the feature extraction stage because, (1) enough samples are usually present in each window for a reliable frequency estimate (for many spectra-based features) (2) it also helps the implementation of mapping this time- indexed windowed audio signal from the time domain to a (usually) higher dimensional feature domain. 2.2.1 Temporal Analysis Features 2.2.1.1 Short-term Average Energy (E): . It is the sum of squared samples within a short time window. It is a simple way to measure amplitude or loudness of a given segment of audio. It was initially used in speech processing to detect voiced sections in speech signal. Also, by appropriate thresholds, silence or very low loudness level sections of a clip can be detected. It is commonly assumed that the sections of silence or low level segments in a clip are part of the “background” and they are not relevant to the actual “foreground” of the audio clip and hence the corresponding frames are discarded. Mathematically, the short-term energy of the i th frame is given by the equation: E i = m=∞ X m=−∞ [x(m)·w(i−m)] 2 (2.3) From the definition in equation 2.2, since w(i) is a finite length sequence, E i is limited in observation to its length. 23 2.2.1.2 Short-term Average Zero-Crossing Rate (ZCR): ZCR is another important temporal features that was again developed for voiced/unvoiced detection in speech processing [50]. It is commonly used in audio processing for speech/music discrimination [53],[35],[47]. The expression for the ZCR for the i th audio frame is given by : Z i = m=∞ X m=−∞ ksgn[x(m)]−sgn[x(m−1)]kw(i−m) (2.4) Again due to equation 2.2, Z n is a short time measure. Here, sgn[·] is the signum function, and w(i) = 1 2N ∀ 0 ≤ i ≤ N −1. The ZCR is closely related to the instantaneous frequency of the signal and it is also highly correlated with the spectral centroid. 2.2.2 Spectral Analysis Features: 2.2.2.1 The spectrogram The spectrogram is a time-frequency representation of a large audio signal. It is the natural nextstep fromanalysis of short-termtemporal propertiestoshort-time spectral properties. It is based on the concepts of short-term Fourier transform, and it is an important first step for measuring spectral properties of audio signals. An indexed discrete-time Fourier transform (DTFT) is given as [50]: F i e j·ω = m=∞ X m=−∞ x(m)·w(i−m)·e j·ω (2.5) 24 Here w(n) is as defined in equation 2.2, i is the time index andω is the frequency index of the Fourier transform. Since each window typically lasts for 20-100 milliseconds, this equation essentially is asequenceof Fourier transformsofa longaudiosignal every win- dowduration. Atimeindexedfrequencyrepresentation(time-frequency representation) is more about the spectral content in each frequency. Thereforeit is common to use the magnitude of the complex frequency coefficients obtained in the transform 2.5. A more practical time-frequency representation using the discrete Fourier transform (DFT) is given by the equation : F(i,k)=|F i (k)| = j=N−1 X j=0 y(j)·e 2π k N j (2.6) where,y i (j) is a N length sequence of the i th windowed signal y i (j) = ( m=∞ X m=−∞ x(m)·w(m−i) ) and, 0 ≤ j ≤ N−1 (2.7) This F(i,k) is visualized as an image where F(i,k)↔ I(x,y) where the colour of each pixel is scaled according to the magnitude at the (i,k) th point. This is illustrated in figure 2.1. All spectral features are based on the spectrogram of the signal. They are all a statistical measure of the time-frequency content of the signal. Common measures used for audio processing are discussed next. 2.2.2.2 Spectral Centroid (SC): It isthe weighted mean frequencyof agiven frameof audio. Theweights arethemagni- tude of the corresponding frequency points of a short-time Discrete Fourier Transform. 25 For our work we use a 8192 point Fast Fourier Transform (FFT) to calculate the spec- tral centroid. This feature is strongly correlated to the ZCR [47] and it is useful to distinguishsources with moreenergy skewed tothehigher frequencies. Mathematically, the SC of the i th frame is represented as: SC i = P k= N 2 k=0 k·|F(k)| 2 P k= N 2 k=0 |F(k)| 2 (2.8) here |F(k)| is the magnitude of the k th point of an N−point discrete point Fourier transform(DFT).Indirectly,SCisameasureofbrightness oftheunderlyingaudiosignal. Acoustic sources with content in the higher frequencies are perceived to be brighter (a perception term used to describing properties similar to turning down the bass knob and turning up the treble setting of a radio). 2.2.2.3 Bandwidth (BW): This is a weighted mean of the squared deviation from the SC of a given audio frame. The weights here are again the magnitude of the corresponding frequency bins of a short-time DFT. For the i th frame it is given by : BW i = P k= N 2 k=0 (k−SC i ) 2 ·|F(k)| 2 P k= N 2 k=0 |F(k)| 2 (2.9) where N,F(k),SC i ,k are as defined previously. 26 2.2.2.4 Linear Prediction Spectrum Slope (LPSS): The slope of the line passing through the maximum point and the minimum point in the magnitude of a low order linear prediction coefficient (LPC) spectrum is the Linear Prediction Spectrum Slope (LPSS). The LPSS of i th frame of audio is given by the equation LPSS(i) = |θ max |−|θ min | F θmax −F θ min (2.10) Here|θ max |isthemaximumofthe4 th orderLPCspectrum. and|θ min | istheminimum. F θmax and F θmax are the frequency bins at which the maximum and minimum occur respectively. The LPSS is simply the ‘slope’ of the magnitude spectrum of a given frame of audio. It is scaled appropriately to match with the order of values of (SC). 2.2.2.5 Spectral roll-off frequency (SRF): SRF is defined as : SRF i =k,such that j=k X j=0 |F(j)| <P · j= N 2 X j=0 |F(j)| (2.11) Itsusefulnesshasbeenestablishedinspeech/musicdiscrimination[53]andgeneralaudio classification task of speech versus non-speech classification[35]. P is the threshold in percentage. Its value is usually just under 1.0. 27 2.2.2.6 Band energy ratio (BER): This can be defined in many ways, here a generic definition is established. A ‘band’ in thespectrumcan berepresentedbythecouple(f 1 ,f 2 ) wheretheedgefrequencies ofthe bandaref 1 ≤f 2 . TheBER isthereforethetotal energycontained withinthefrequency band to the total energy of the audio frame in all the frequencies. It is written as : BER i = P k=K 2 k=K 1 |F(k)| 2 P k= N 2 −1 k=0 |F(k)| 2 (2.12) Here, K 1 and K 2 are the frequency points correspondingto f 1 andf 2 in the DFT of the given audio frame. 2.2.2.7 Spectral flux changes (SF): Spectral flux is defined to be the change in spectra from one frame to the next. Math- ematically it is defined as : SF i = X ∀k |F i+1 (k)−F i (k)| (2.13) Changes in the spectrum are brought out with the SF measure. It is commonly used in speech/music discrimination, where, music, due to its variability, shows more changes in SF than say, speech. 28 2.2.3 Perceptual Features: 2.2.3.1 Pitch Pitch is an important perceptual measure of the fundamental frequency (F0). In most cases, it is defined as the perceived fundamental frequency F0. While the differences between pitch and the fundamental frequencies are well documented [25], for all prac- tical purposes, the speech and audio processing systems rely on measuring the F0 for a measure of pitch. Common techniques for estimating the pitch are based on auto cor- relation, Linear Prediction Coefficients (LPC), the cepstrum [49, 50], wavelet transform [32] etc. Pitchinformationiscommonlyusedforapplicationssuchasautomaticspeakeriden- tification, musictranscription/content analysis, querybyhummingetc. Inmoregeneral cases, such as audio classification of general sources, pitch information is useful because most sources are either harmonic (musical instruments such as piano, flute, voiced por- tions of humanspeech, birdcalls etc.) or non-harmonicin nature(noisy sources such as soundof a vehicle, rain etc.). Even within the harmoniccategory, some sources are per- ceived to have higher pitch levels (such as musical instruments) as compared to others (such as speech). 2.2.3.2 Mel-frequency Cepstral Coefficients (MFCC): These coefficients are the result of a cepstral analysis of the magnitude square of the spectrumintheMel-frequencyscale. Itwasoriginally proposedbytheauthorin[18]for speech recognition. Theperceptual aspect of this feature lies in theMel-frequency scale (a logarithmic axis) that approximately models the filter-bank analysis in the human 29 ear. the steps for calculating MFCCs for a given frame of audio is illustrated in figure 2.2. Therearetwoadditional measures, popularinspeechandaudioprocessingsystems that capture the time varying properties of the perceptual spectral features. They are the MFCC-delta (Δ) and the MFCC-(Δ−Δ) measures. These are simply the first and second derivative of the MFCCs obtained from the MFCCs 2.2.4 Bio-Inspired Cortical Representation (CR) Recently, features based on modelling the processing at the auditory cortex (AC) for audio classification have been developed. Figure 6.1 illustrates the processing stages which finally result in a tensor representation. The output of the early auditory system is a time-frequency representation of the input signal. Here the input sound signal is filtered by the basilar membrane at different centre frequencies along the tonotopic frequency axis, followed by a differentiator stage, a non-linearity, low-pass filtering and finally, a lateral inhibitory network (LIN)[72]. The differentiator, non-linearity and the low-pass filtering is known as the Transduction Stage. It essentially models the transformation of the output of the basilar membrane by the hair-cell stages. The LIN is assumed to be a neural network model in the cochlear nucleus. The LIN stage is modelled in 3 steps: a derivative with respect to the tonotopic axis, a half-wave rectification operation and finally a leaky temporal integration. This is the input to the central auditory system, which is analogous to the early auditory system, except all the transformations are along the tonotopic frequency axis. The processing is modelled as a double affine wavelet transform of the frequency axis at different scales and phase. The mother function wavelet is a negative second derivative of the normal Gaussian 30 function. The result of this analysis is a 3-mode tensor A(f : f c ,φ,λ) ∈ R D 1 ×D 2 ×D 3 . Here, f is the tonotopic frequency at different centre frequencies f c , φ is the symmetry (or phase) and λ is the scale factor or the dilation factor of the wavelet function. The main differences between this representation and the MFCCs are (1) MFCCs model the rough processing in the peripheral auditory system, however the CRs model the processing in both the early and the central auditory system with more accuracy, (2) theCRsalsoresultinahigherresolutionmulti-scaleanalysisthatcapturethelocaltime- frequency properties of the signal in a better way as compared to MFCCs (including theΔ and Δ−Δmeasures). An example of this is shown in chapter 6. Sinceit involves high-dimensional data, data reduction techniques such as principal component analysis (PCA) (used in [45]) and discriminant analysis (in chapter 6) are employed. 2.3 Concluding Remarks It can be seen that almost all the features extracted are designed and based on short- time frame analysis of an input audio signal. This results in a trade-off between time and frequency resolution. In many cases, the choice of the window length for the short- time analysis is critical. The system performance is dependent on it, as it affects the measure of change in an audio signal. The reason for using multiple features is quite simple. Each feature vector extracts information that is characteristic of that audio source. Each source is assumed to be acoustically unique because of its unique sound production mechanism. From a perspective of capturing audio, no information about how the sound is generated is 31 available. All the measurements and inferences (about the source and the production mechanism) aredrawnfromthecapturedpressuredifferencesofthesound. Information about the position of the source, the acoustics of the environment or setting, the type of source, the semantic information contained in the source (say, the words in speech), identification, individual traits, and the relation with other sources, essentially all the acoustic information if contained in the one-dimensional captured sound signal. By developing and using feature measures (such as presented in this chapter) allows us to indirectly measure different aspects of the way the sound was generated, resulting in unique qualification by the given feature vector. As it will be seen in chapter 3 and the research work presented here (in chapters 4 the information extracted from the audio signal is a combination of more than one feature set described here. There is no hard and fast rule for selecting an appropriate featureforthegiven problem, mostsystems useacombination ofthemeasures(directly or modified higher-order statistics) presented here. 32 Figure 2.1: The Spectrogram: Conversion of time-indexed frames to a time-indexed time-frequency representation Figure 2.2: The processing blocks to obtain Mel-frequency Cepstral Coefficients (MFCCs) for an acoustic signal 33 Chapter 3: Literature Review 3.1 Chapter Summary In this chapter the state-of-the-art in audio processing is summarized. The review starts with the most commonly tackled problem in audio classification problems: Speech vs Non-speech classification. This is a fundamental first step in most elaborate classification systems because it reduces the computational complexity and in most cases it allows for a systematic hierarchical classification scheme. For instance, once a segment of audio is determined to be speech, further classification procedures are unnecessary. However, if it is non-speech, the system can proceed with further classification such as music or noise or any other acoustic source. This is dealt with in the later subsection of classification of general audio sources. Finally, techniques in multimodal analysis of multimedia data (audio + video) for basic classification and categorization is presented. In these techniques audio is seen as an integral part of an multimedia-rich content (such as video presentations, movie scenes etc.). 34 Since the audio is synchronized with video, video scene classification is tackled in amoresimplifiedmannerbyclassifyingaudioanddrawingcontextinferenceswithvideo. 3.2 Speech versus Non-Speech Discrimination This is a classical problem for audio processing techniques. Speech and music are distinct audio classes common in commercial media (for example, radio shows, televi- sion programmes etc.). The signals are assumed to contain audio of speech or musical instruments. In most cases speech and music signals are usually assumed to be non- overlapping intime. Duetoeasy andlarge availability of standardizedtrainingandtest material, this classic problem has been solved in a variety of ways and it also serves as a benchmark for assessing the performance of new feature sets and techniques. It is of importance because it serves as a front end to many audio processing systems. As mentioned above, this essentially reduces the computational load and implementa- tion complexity of the core pattern classification block of an audio processing system. It can be used as an additional information source for efficiency gain. For example, in [62] the authors have proposed to use a speech/music discriminator as a front end to their Query by Humming (QBH) system where they distinguish between sung and instrumental queries. It can be used to generate thumbnails and highlights for long complex audio segments with numerous sources. Other, more complex audio classifica- tion systems that also cover speech and music can use it. Therefore, depending on the particular application, this problem can be solved in a variety of ways using simple to 35 complex features and from real-time online systems to offline methods. First, discrim- ination systems based on signal properties are discussed, then other systems that use higher-level abstraction for classification is presented. In [53], the author proposes a set of perceptual acoustic features and statistical sig- nal features for discrimination of speech and music. They measure the effectiveness of thefeatureset (mainly the4Hzenergy modulation andthespectral flux)usingdifferent classifier approach. Namely, GMM based MAP, k-NN and k-d trees. In the presented cases they achieved a classification accuracy of morethan 90%. While it may seem that the authors have unnecessarily used many features for a reasonably separable two class problem, it can easily beincorporated into alarger audio classification system that uses some or all of the features used in their work. The perfrormance was assesed by using individual known segments of speech or music and examining the classification results. In [47] the authors assume additional system constraints such as its ability to function inarealtimeenvironmentandlowlatency. Theyapproachtheproblemusingstatistical signal dissimilarity between speech and music by measuring two time-domain features : root means-squared (RMS) energy and zero-crossings. The RMS energy feature is used to segment the signal (change-point detection) and the additional zero-crossings rate is used for classification of the signal as speech or music. The accuracy of their system was 95% with segmentation accuracy of 97% with atmost 20 milliseconds tem- poral error. The performace was measured in terms of segmentation and identification accuracy. In [9] the authors has compared a statiscal and neural networks approach to speech/music dscrimination. For the statictcal appprach they implement a Bayesian classifier using statistics mesured from the zero-crossings rate (ZCR). For the case of 36 the neural networks approach a multilayer perceptron (MLP) using 8 feature sets. The raw classifier outputs are further processesed using a regularization procedurebased on empirical observation. The performance of the system was measured by calculating the fraction of duration of correctly classified segments. Their system has an error rate of about10%fortheMLPimplementationandabout20%fortheBayesian classifierbased on ZCR. In [68] the authors design a new set of four features (mean per frame entropy, av- erage probability, Background-label energy ratio,and phone distribution match) from the posterior probabilites of an acoustic classifier. These features represent a form of post processing and a higher abstraction of raw classifier outputs. The classifiers were trained neural-net classifiers that output the posterior probability of belonging in the speech/musicclass. Thefinalclassifier decision was basedonalikelihood ratio test. Es- sentially this is a form of higher level processingto draw inference about the underlying acoustic signal. The performance was evaluated on hand-labelled segments of speech and music, which an effective accuracy of over 90%. They attempt to extend the same posterior probability features to detecting singing voice in popular music. However, in thiscase theresultswerenot asgoodas usingsimplecepstral features. Thisisreflective of the fact that post processing raw classifier inputs, in some cases, results in highly streamlined results that is not easily generalizable. In [31] the authors implement a speech musicdiscriminatorbased onthesameapporach as in[68]. Herealso thefeature set was based on post processing of the classifier output. Specifically, they design their classifier to detect phonemes that are higher level speech units, beyond the signal level features. Their measure is bsed on duration of consonants and vowels detected within a 37 window, consonant/vowel rates, changes in consonant-vowels etc. They evaluate their systemsonabroadcastnewsdatabasewithanaggregateframelevelaccuracyof92-95%. 3.3 Music Content Analysis Music Content Analysis mainly involves summarizing or highlighting through musical structure analysis, grouping similar musical segments and problems such as genre iden- tification. Structure analysis is performed by correlation that essentially measures the similarities between a given section of a music piece with other sections. A good il- lustration of this method of analysis is presented in [16]. The similarity in a clip is presented as a matrix where the (i,j) th entry is the the similarity between the i th and j th segments of the clip. Thesimilarity is based on a measureof ‘closeness’ between the featurevectors extracted from theindividualsegments. Summarization is performedby extracting self-similar, repeated sections based on the structure matrix. This approach was initially followed by the authors in [6] [13] for thumbnailing. In [6] the authors uses a chroma instead of just a dot product similarity. Music similarity analysis is another problem that requires to find similarities between different musical clips that does not necessarily involve analysis of repeating structures in a given piece. In [38] the authors presentsuchaproblemofplaylist generation. Theproblemistocreateaplaylistoflarge number of songs, where the songs that are similar to each other appear together as a contiguous block in the playlist. They measure similarity by extracting a spectral sig- nature for each song, and comparing them. The comparison is made using a histogram comparison method, where the effort taken to change one spectral signature (extracted MFCC vectors represented as a histogram) to another is taken as the distance between 38 them. For a given query song, the system returns N best similar songs out of a large pool. Automatic genreclassification isaproblemdealt withinmusicinformation retrieval (MIR), particularly in the domain of popular music songs. The problem is to deter- mine the type or category of a given audio clip of music. The categories of music are based on the rhythmic structure, musical instruments, and content. It is harder than directaudioclassificationproblemsbecausemusicalgenresarehighlyoverlappingindef- initions.Music genre classification system has been implemented using timbral content features(suchasMFCCs, SC,SRF,ZCRetc.), rhythmiccontent features(beats, tempo [29, 42]) and pitch. These features are commonly used in speech/non-speech discrim- ination (see previous section for literature). In [64] the authors achieve 61% accuracy for ten genres (classical, country, disco, hip-hop, jazz, rock, blues, reggae, pop, metal). This indicates that the said features, while they maybe excellent in speech/non-speech discrimination,do not necessarily work for other representations of audio class (in this case, music genre). In [36] the authors improve the classification results to upto 80% by using the histograms of the Daubechies Wavelet Coefficients (DWHCs) instead of the featuresusedby[64]. In[65]theauthorshaveusedasimilarapproachusinghistorgrams of time-frequency transformations to classifiy music genres. They achieve an accuracy of over 91%. 3.4 Classification of general sound sources This category can contain any acoustic source. Acoustic signals from musical instru- ments, engine noise, dog barking, machine gun fire, bird tweets etc. are a handful of 39 examples. Italso covers speechandmusicsignals. Atypical exampleof acontent-based method is [23, 69]. Here the authors evaluate their system on a database of animals, bells, crowds, female, laughter, machines, male voices, percussion instruments, tele- phone, water sounds etc. However, since the problem of general sound sources contains innumerable audio classes, processing techniques are made implementable by grouping the sources into categories. Categorizing sounds into groups such as animals, environ- mentalnoisemusicetc. makesclassificationfeasible. Forexample(asillustratedinfigure 3.1) in [73] the authors present a hierarchical, generalized extension to the speech/non- speech discrimination problem. They group the sources into categories such as silence, with music components and without music components. They also present results on grouping as pure speech, pure music, a song, speech + music background, sound effects +musicbackground, harmonicandnon-harmonicsoundeffects. In[37]alsoahierarchi- cal scheme by grouping sources into speech, music or environmental sounds. They first determined if a given segment is speech or non speech, further speaker change point detection was performed in case of speech segments and for the case of non-speech segments, the samples are classified as music, environment and silence. Next, other techniques that solve other objectives that essentially deal with classification of general audio sources are discussed: 3.4.1 Audio-Based Context Recognition: Context recognition primarily deals with recognition and classification of general audio sources. In this newly recognized application the aim is to classify a given segment (usually of duration 1 second or more) into one of previously known auditory scenes. 40 Figure 3.1: Typical hierarchical methods used in content-based methods for audio classification of general sources. Top from [23] and bottom from [73] 41 It is based on the principle that certain acoustic sources are specific to certain scenes and settings. For instance, ameeting room scene would predominantly contain audio of multiple speakers. Some examples of other scenes dealt with here are: outdoors: street corners, market place, construction site etc., Indoors: meeting room, office, library, home living room, kitchen etc., public places: restaurants, cafeteria, shopping malls etc.. In[21]theysetuptheproblemintwolevels: theactual low-level context (examples shown previously) and a higher level category such as outdoors, vehicles, public/social places, offices, home environment, etc. They compare the performance of their system with the recognition ability of actual human listeners. The results indicated that the human listeners were superior by more than 10% compared to the implemented system ( average accuracy of 58%) for the low-level context. However, when the higher level context was considered, the performance of the system and the human listeners were comparable (about 85%). 3.4.2 Audio based methods in Video Segmentation Inmultimediacontent, audiocategories aredefinedbytheaccommodatingvideoscenes. For example, cheer scenes, fighting/violence scenes, quiet scenes. Scene change detec- tion, broadcast news, topic summarization etc. Sometimes, information about video scene changes is indirectly detected by tracking changes in audio. In [11] the authors develop a two level system applicable for video summarization by detecting key audio effects. This is based on the assumption that such key effects provide unique cues to the underlying video scene. Using three audio effects (laughter, applause and cheer) the highlights are extracted based on a second level sound effect attention model that 42 is used to model the salience of an effect. They have extended this work to more effects such as explosion, gun-shots, vehicle sounds and sirens [52]. Here the authors attempt to bridge signal level features and audio context (a semantic level description) by fol- lowing a cue based approach. Their premise being certain audio effects are important indicators in given scene. In the proposed system, they device a flexible framework to deduce the audio scene by detecting special audio effects using Hidden Markov Models (HMMs) The authors extract higher-level context information indirectly by detecting key effects such as applause, car racing, car crash, cheer, explosion, gun-shots, heli- copter sound, laughter, airplane sound and sirens. At ahigher-level inferthecontext by a set of heuristicrules, for example, cheer andapplauserepresent excitement, gun-shots and explosions means a war scene. Their performance accuracy is about 80%. In [70], the authors implement a system based on the same ideas discussed previously. They attempt to extract salient highlights insportvideos bytracking reliableaudiocues such as applause, speech with music, cheering and music. Their premise is that these audio events precede important parts of the sportingevent (e.g a soccer goal or a homerunin baseball). These problems are also related to data fusion ( [28] and references therein). In [28] the authors perform shot segmentation of video clips using both information from audio segemtation (into speech, music and environmental sounds) and video seg- mentation (using colour correlation information). They achieved an accuracy of over 90%. In [44] The authors classify video genre using information extracted from audio. They basically rely on a content-based method similar to the approach followed in context 43 recognition. Specifically, they classify video segments into sports, cartoons, news, com- mercial and music scenes. The actual classification is done on reasonably long audio segments extracted from each of the video genres. They also achieve best performance for 25 seconds long segments of about 80%. 3.5 Concluding Remarks What has been seen so far is that audio processing as a whole involves interpretation and identification at a higher level. It is not a simple case of identifying sources with distinct characteristics. Audio processing problems inherently deal with abstract defi- nitions/representations which are not well defined and are highly overlapping. It is a hard and interesting problem to systematically measure these representations so that the underlying acoustic process can be fully discovered and understood. The methods and framework discussed here can be organized into statistical tech- niques, heuristic rule based methods, and methods based on empirical observations. Some of them use more than one of the above approaches. Yet another way of looking at these techniques can be in terms of the assumptions about solving a given prob- lem. Inthis, content based audioclassification methods assumethat given audiocan be completely analyzed by identifying the individual sources. Hierarchical methods try to cover all the acoustic sources by organizing them in a hierarchy (an example is [73]). In the work presented in this dissertation proposal, another approach to audio processing based on perceptual descriptions and interpretations of audio properties is presented. Whendealingwithsimplerproblemssuchasspeechandmusicdiscrimination, exist- ing signal level measures and features work well. However when the problem is shifted 44 to higher level definitions (such as, say, music genre identification) existing features fall short of measuring the audio signals along the lines of descriptions of perceptions. The broad look at the state-of-the-art in audio processing techniques can be taken as evi- dence to the initial statements made in this proposal: there is no single solution to the various problems tackled in audio processing. Both existing and future techniques can be implemented by modelling the auditory processing by humans. How much of it is modelled and at what level depends on the problem presented, and complexity of the proposed system. An obvious and widely used approach to characterizing a complex audio scene is by recognizing each of the possible constituent classes through deterministic or statistical methods. This would especially work well for closed-set problems where the number of identifiable classes are limited and a given audio event is always known to belong to one of the previously known classes. In more open problems, where the number of acoustic events and possible audio scenes are large, and perhaps unseen, it would be- come tremendously complex to implement such a scheme. Consider the classes speech and non-speech audio. This definition is very generic since any and all audio data can be considered to be speech or non-speech. While it is useful to reduce the complexity of other classification systems, it is of little use by itself. To incorporate both generic and more number of classes, more heuristic rules, such as the one implemented in [37] for classification would be required. Related work to this problem of content analysis of audio is presented in [23, 73]. Usually in these implementations, each acoustic source is labeled to be a unique audio class and the core challenge in the problem is to identify these sources. Also in methods such as in [47, 53, 73] the discrimination system and 45 the method of analysis are based on the underlying assumption that the two classes are non-overlapping in time. Essentially these create distinct classes that are sepa- rable using the features, and the problem is limited to discriminating from one class to another. Related work in description based approaches to audio typically involve application-tailored choice of descriptions trained on corresponding data. For example in [10], the author uses probabilistic descriptions specific to music. The approach pro- posed in [55] involves directly tying semantic level descriptions to signal level Gaussian Mixture Models built on extracted feature vectors. These approaches, are relevant to the present work, butrely on statistical similarity amongst acoustic sources and are not easily generalizable to large classes of acoustic sources. The wide variety of methods covered in this chapter highlights the fact that every audioinformationprocessingsystemrequiresaworkingdefinitionof anaudioevent and change. In the case of speech/non-speech discrimination, it is speech source and other sources, for music content analysis, it is segments of self-similar sections of a song, in the case of context recognition, it is whole (long duration) audio scenes, and in the case of video scenes, it is salient audio events such as applause or cheering. The definitions are made based on thespecific application presented. Each system also solves the prob- lem in unique ways: based on signal statistics, heuristics rules, empirical observations derived from the human auditory system. The description based approach proposed in this dissertation also attempts to solve these common problems in audio information processing usingca combination of statis- tical measures,heuristicrulesand/orempiricalobservations. However, thefundamental difference lies in the definition of an event and change. As illustrated previously, the 46 methods proposed here do not rely on a specific, distinct definition of audio event, but on human interpreted descriptions of acoustic sources and its perceived properties. De- tails of the implementation of the proposed methods, and specific differences with the contemporary systems listed here and discussed in the next chapters. 47 Chapter 4: Audio representation using words 4.1 Chapter Summary In this chapter, results on organization of audio data based on their descriptions using onomatopoeia words is presented. Onomatopoeia words are imitative of sounds that directly describe and represent different types of sound sources through their perceived properties. Forinstance,thewordpop aptlydescribesthesoundofopeningachampagne bottle. First this type of audio-to-word relationship is established by manually tagging a variety of audio clips from a sound effects library with onomatopoeia words. Using principalcomponentanalysis(PCA)andanewlyproposeddistancemetricforword-level clustering, theaudiodata representingtheclips is clustered. Dueto thedistancemetric and the audio-to-word relationship, the resulting clusters of clips have similar acoustic properties. It was found that as language level units, the onomatopoeic descriptions are able to represent perceived properties of audio signals. This form of description can be useful in relating higher-level descriptions of events in a scene by providing an 48 intermediate perceptual understanding of the acoustic event. Also, by discriminant analysis of the clusters at the feature level, the separability of the clusters are analyzed. In terms of feature level representation, clusters formed by words such as buzz, fizz etc. are better represented by signal features than percussive sounds such as clang, clank, tap. 4.2 Introduction Automatic techniques are required to interpret and manage the ever-increasing multi- media data that is acquired, stored and delivered in a wide variety of forms. In inter- active environments, involving humans and/or robots, data is available in the form of video/images, audioandavariety ofsensorsdependingonthenatureoftheapplication. Each of these represent different forms of communication and a variety of expressions. To utilize and manage them effectively, it is desirable to organize, index and label these forms according to their content. Language (rather textual) description or annotation is a concise representation of an event that is useful in this respect. It makes the audio and video data more presentable and accessible for reasoning, and or search/retrieval. In these cases, it is desirable to compute both using words (for user queries) and signal level measures for (automatic retrieval). This also aids in developing machine listening systems that can use aural information for decision making tasks. The work presented heremainlydealswithontological representation andcharacterization ofdifferent audio events. While the recorded data is stored in signal feature space (such as in terms of frequency components or energy etc.) for automatic processing, text annotation repre- sents the audio clip in the semantic space. The underlying representations of an audio 49 clip in the signal feature space and in semantic space are different. This is because thefeaturevectors representsignal level properties(frequencycomponents, energy etc.) while in the semantic space the definition is based on human perception and context information. This semantic definition is often represented using natural language in textual form, since words directly represent ‘meaning’. Therefore natural language rep- resentation of audio properties and events are important for semantic understanding of audio, and it is the focus of this present paper. In content-based processing, natural language representations are typically estab- lished by a naive labeling scheme where the audio data is mapped onto a set of pre- specified classes. The resulting mapped clusters are used to train a pattern classifier and eventually used to identify the correct class for a given test data. Examples of such systems are in [23, 37, 73]. While such an approach yields high classification accuracy, they have limited scope in characterizing generic audioscenes, save for situations where theexpected audioclasses areknownpreviously. For instance, considerthephraseNail Hammeredas a label (an example from the BBC sound effects Library [1]). It represents that the audio clip is the sound of the nail being hammered, but does not describe the acoustic propertiesof theevent. However, the underlyingautomatic processingis based onsimilaritiesintheacousticproperties. Thisinherentambiguityneedstobereconciled by the automatic audio classification system. Other techniques for retrieval that better exploit semantic relations in language is implemented in [12]. Here the authors have used WordNet [43] to generate word tags for a given audio clip using acoustic feature similarities, and also retrieve clips that are similar to the initial tags. While such semantic relations in language are important in 50 buildingaudioontologies, they arestill sufficiently insulatedfromsignal level properties that directly affect the perception of sources. In this work, however, an approach to use linguistic descriptions that are closer to signal level properties is presented. This is implemented using onomatopoeia words present in the English language. These are words that are imitative of sounds (as de- fined by the Oxford English Dictionary). The rationale being these descriptions would provide a more intuitive (based on perception) but less ambiguous lexical descriptions to aid automatic classification. For example, the audio clip of Nail hammered can be better describedby tap-tapwhichprovides moredirect information about theacoustic properties of the event. Experiments presented in this chapter have three objectives: 1. To develop a distance metric to analyze the relationship amongst onomatopoeia words and thus cluster them. The ability to cluster these words in a quantitative spacemakes themusefulasameta-level representation forcomputingusingwords for audio retrieval. 2. Tocreate avector representation of audiousingits wordbaseddescriptions(using theproposeddistancemetricmentioned above). Thishelpsincomputinginterms for words for user query based retrieval systems. 3. To measure the effectiveness of common acoustic signal features to represent the resulting clusters. For this, the features extracted from a selection of audio clips are clustered by two methods: (a) using information from onomatopoeia word clustersasmentionedabove(b)byunsupervisedclusteringonthewholecollection of extracted feature vectors. 51 The presentation of the idea is as follows. First the onomatopoeia words represented as vectors in a ‘meaning space’. This is implemented using the proposed inter-word distance metric. Then various clips of acoustic sources from a general sound effects library are tagged (offline) with appropriate onomatopoeia words. These words are the descriptionsoftheacousticpropertiesofthecorrespondingaudioclip. Usingthetags of eachclip,andthevectorrepresentationofeachword,theaudioclipsarerepresentedand clusteredinthemeaningspace. Usinganunsupervisedclusteringalgorithmandamodel fit measure in this vector based representation, the clips are then clustered according to their representation in this space. The resulting clusters are both semantically relevant andsharesimilar perceived acoustic properties. Someexamples of theresultingclusters are also presented. For3(a)above,theaudioclipsareclustereddifferentlyusingavotingscheme. Inthis case, the voting schemes clusters the audio clips according to inter-word relations and notaccordingtotherepresentationoftheclips. Theseword-basedclustersarecompared (inthefeaturespace)toclustersformedbyunsupervisedclusteringofthefeaturevectors extracted from theclips (referred to as “raw grouping”). Thetwo methodsof clustering are compared using a Gaussian maximum a posteriori (GMAP) classifier after multiple discriminant analysis (MDA) [20]. The clustering in 3(a) is according to a meta-level understanding of the onomatopoeic words. These words, subsequently, are descriptive of the underlyingacoustic properties of the clips. Next, the motivation for this research is discussed. 52 4.3 Motivation: Describing Sounds with Words Humansareabletoexpressandconvey awidevariety ofacousticevents usinglanguage. Thisisachievedbyusingwordsthatexpressthepropertiesofaparticularacousticevent. For example, if one attempts to describe the event “knocking on the door” , the words “tap-tap-tap” describe the acoustic properties well. Communicating acoustic events in such a manner is possible because of a two way mapping between the acoustic space and language or semantic space. Existence of such a mapping is a result of common understanding of familiar acoustic events. The person communicating the acoustic aspect of the event “knocking on the door” may use the words “tap” to describe it. That individualisaware of aprovision inlanguage (theonomatopoeia word “tap”) that would best describe it to another. The person who hears the word is also familiar with acoustic properties associated with the word “tap”. Here, it is important to point out the following issues: (1) There is a difference in the language descriptions “knocking on the door” and “tap-tap”. The former is an original lexical description of the event and thelater is closer to thedescription of theacoustic propertiesoftheknockingevent. (2) Since the words such as “tap” describe the acoustic properties, they can also represent multiple events (for example, knocking a door, horse hooves on tarmac etc.). Other relevant examples of such descriptions usingonomatopoeia words of familiar sounds are as follows: • In case of sounds of birds: A hen clucks, a sparrow tweets, a crow or raven caws, and an owl hoots. 53 bang bark bash beep biff blah blare blat bleep blip boo boom bump burr buzz caw chink chuck clang clank clap clatter click cluck coo crackle crash creak cuckoo ding dong fizz flump gabble gurgle hiss honk hoot huff hum hush meow moo murmur pitapat plunk pluck pop purr ring rip roar rustle screech scrunch sizzle splash splat squeak tap-tap thud thump thwack tick ting toot twang tweet whack wham wheeze whiff whip whir whiz whomp whoop whoosh wow yak yawp yip yowl zap zing zip zoom Table 4.1: Complete list of Onomatopoeia Words used in this work. • Example of sounds from everyday life: A door close is described as a thud and/or thump. A door can creak or squeak while opening or closing. A clock ticks. A doorbell is described with the words ding and/or dong or even toot. In general, onomatopoeic description of such sounds is not restricted to single word expressions. One usually uses multiple words to paint an appropriate acoustic picture. Theaboveexamplesalsoprovidetherationaleforusingonomatopoeicdescriptions. For example, by their onomatopoeic descriptions, the sound of door bell is closer to an owl hootingwhereas their lexical descriptions(that semantically represents theevents using the sound sources “door bell” and “owl”) are entirely different. It is also possible to draw a higher level of inference from the onomatopoeic description of an audio event. Given the scene of a thicket or a barn, the acoustic features of the sampleclip with hoot as its description is likely to be an owl than a door bell. However, given the scene of a living room, the same acoustic features are more likely to represent a door bell. Based on such ideas, it can be seen that descriptions with onomatopoeia words automatically provide a flexible framework for recognition or classification of general auditory scenes. 54 In the next sections, the implementation of the analysis in this work is discussed. First a quantitative relationship between the words and their representation is presented. 4.4 Implementation 4.4.1 Distance Metric in Lexical Meaning Space The onomatopoeia words are represented as vectors using a semantic word based sim- ilarity/distance metric and Principal Component Analysis (PCA). The details of this method follows: A set {L i } consisting of l i words is generated by a Thesaurus for each word O i in the list of onomatopoeia words. Then the similarity between the j th and k th word can be defined to be: s(j,k) = c j,k l d j,k , (4.1) resulting in a distance measure : d(j,k) = 1−s(j,k) (4.2) 55 Here c j,k is the number of common words in the set{L j } and{L k } and l d j,k is the total number of words in the union of{L j } and{L k }. By this definition it can be seen that 0≤d(j,k)≤1 (4.3) d(j,k) =d(k,j) (4.4) d(k,k) =0 (4.5) Except for the triangular inequality, it is a valid distance metric. It is also semanti- cally relevant because the words in the set{L j } and{L k } generated by the Thesaurus havesomemeaning associatedwiththewordsO j andO k inthelanguage. Thesimilarity between two words depends on the number of common words (a measure of sameness in meaning). Therefore for a set of W words, using this distance metric, we get a sym- metric W ×W distance matrix where the (j,k) th element is the distance between the j th and k th word. Note that the j th row of the matrix is a vector representation of the j th word in terms of other words present in the set. We perform principal com- ponent analysis (PCA) [20] on this set of feature vectors, and represent each word as a point in a smaller dimensional space O d with d < W. In our implementation the squared sum of the first eight ordered eigenvalues covered more than 95% of the total squared sum of all the eigenvalues. Therefore d=8 was selected for reduced dimension representation and W = 83. Thus these points (or vectors) are representation of the onomatopoeic words in meaning space. A similar distance based approach for semantic clustering of words with documents is the Latent Semantic Analysis (LSA)(introduced in [19]). Here a keyword and its occurence in a set of documents is considered [30]. The 56 two-mode representation is factorized and a framework for word-word comparison and word-documentcomparision isdeveloped. Whileit issimilar tothewordrepresentation method presented here, it is not directly useful since related documents to calculate word occurence frequencies is not available. Also, LSA dependson interpretation of the words in terms of a given set of documents, while the presented metric is more absolute as it represents word-word meaning directly. Table 1 lists all the onomatopoeia words used in this work. By studying the words it can beseen that many have overlapping meanings (eg. clang and clank), some words are ‘closer’ in meaning to each other with respect to other words (eg. fizz is close to sizzle, bark is close to roar, but (fizz/sizzle) and (bark/roar) are far from each other). These observations can also be made from Figure 4.1 that illustrates the arrangement of the words in a d = 2 dimensional space. Observe that the words growl and twang are close to each other. This is mainly because the words are represented in a low dimensional space (d =2) in the figure. Once we have the tags for each audio clip, the clips can also be represented as vectors in the meaning space. Next, we discuss the tagging procedure that results in onomatopoeic descriptions of each audio clip. Later, vector representation based on these tags is discussed. 4.4.2 Tagging the Audio Clips with onomatopoeia words A set of 236 audio clips were selected from the BBC sound effects Library [1]. The clips were chosen to represent a wide variety of recordings belonging to categories such as: animals, birds, footsteps, transportation, construction work, fireworks etc. Four 57 0 0 bang bark beep blah blare blat bleep blip boo boom bump burr buzz caw chink chuck clang clank clap clatter click cluck coo crackle crash creak crunch cuckoo ding dong fizz flump gabble growl grunt gurgle hiss honk hoot huff hum murmur pitapat pluck purr ring rip roar rustle screech scrunch sizzle splash squeak tap thud thump thwack tick ting toot twang tweet wham wheeze whiff whip whir whiz whomp whoop whoosh wow yak yawp yowl zap zing zip zoom dimension 1 dimension 2 Figure4.1: Arrangementofsomeonomatopoeiawordsin2dimensional‘meaningspace’. Note that words such as clang and clank are closer to each other, but they are farther away from words such as fizz and sizzle 58 subjects (with English as their first language) volunteered to tag this initial set of clips with onomatopoeia words. A Graphical User Interface (GUI) based software tool was designed to play each clip in stereo over a pair of headphones. All the clips were edited to be about 10-14 seconds in duration. The GUI also had the complete list of the words. The volunteers were asked to choose words that best described the audio by clicking on them. The clips were randomly divided into 4 sets, so that the volunteers spent only 20-25 minutes at a time tagging the clips in each set. The chosen words were recorded as the onomatopoeia tags for the corresponding clip. The tags of all the volunteers recorded for each clip were counted. The tags that had a count of two or more were retained and the rest discarded. This results in tags that are common to all the responses of the volunteers. This tagging method is illustrated in Figure 4.2. Note that the resulting tags are basically onomatopoeic descriptions that best represent the perceived audio signal. The tags for this initial set of words were then transposed to otherclipswithsimilaroriginallexicaldescription. Forexample,theclipwiththelexical name BRITISH SAANEN GOAT 1 BB received the tags{blah, blat, boo, yip, yowl} and this same set of words were used to tag the file BRITISH SAANEN GOAT 2 BB. Similarly, the audio clip BIG BEN 10TH STRIKE 12 BB received the tags {clang, ding, dong}. These tags were also used for the file BIG BEN 2ND STRIKE 12 BB. After transposing the tags, a total of 1014 clips was available. Next, we represent each tagged audio clip in the meaning space. 59 Figure 4.2: Tagging and clustering the audio clips with onomatopoeia words. 60 4.5 Vector Representation of Audio Clips in Meaning Space The vector representation of the tagged audio clips in two dimensions is illustrated in Figure 4.3. The vectors for each audio clip is simply the sum of the vectors that correspond to the onomatopoeic tags. Let the clip HORSE VARIOUS SURFACES BB have the onomatopoeic description tags {pitpat, clatter}. Now the tags pitpat and clatter are already represented as vectors in meaning space. Performing a vector sum of the vectors that correspond to these tags (pitpat, clatter), i.e, the sum of the vectors of the points 1 and 2 shown in Figure 4.3. This results in the point 3. Therefore, the vector of point 3 is taken to be the vector of the clip HORSE VARIOUS SURFACES BB. The implicit properties of this representation technique is as follows: • If two or more audio clips have the same tags then the resulting vectors of the clips would be the same. • Iftwoclipshavesimilarmeaningtags(notthesametags)thentheresultingpoints of the vectors of the clips would bein close proximity to each other. For example, let clips A and B have tags {sizzle, whiz} and {fizz, whoosh} respectively. Since these tags are already close to each other in the meaning space (refer to Figure 4.1), because of the vector sum, the resulting points of the vectors of clips A and B would also be in close proximity to each other. In contrast, if the clips have tags that are entirely different from each other, then the vector sum would result in points that are relatively far from each other. Subsequently, using clustering 61 Figure 4.3: Vector representation of the audio clip HORSE VARIOUS SURFACES BB with tags{clatter, pitpat} algorithms in this space, audio clips that have similar acoustic and/or semantic properties can be grouped together. Thusthe audioclips can berepresented as vectors in theproposedmeaning space. This allows us to use conventional pattern recognition algorithms. In this work, we group clips with similar onomatopoeic descriptions (and hence similar acoustic properties) using the unsupervised k-means clustering algorithm. The complete summary of the tagging and clustering the clips is illustrated in Figure 4.2. The Clustering procedure is discussed in the next section. 62 4.5.1 Unsupervised Clustering of Audio clips using Vector Represen- tation in Meaning Space The Bayesian Information Criterion (BIC) [54] has been used for model selection in unsupervisedlearning. Itis widely usedforchoosing theappropriatenumberof clusters in unsupervised clustering [14, 74]. It works by penalizing a selected model in terms of the complexity of the model fit to the observed data. For a model fit M for an observation setX, it is defined as [54, 74]: BIC(M)=log(P(X|M))− 1 2 ·r M ·log(R X ), (4.6) where R X is the number of observations in the set X and r M is the number of inde- pendent parameters in the model M. For a set of competing models{M 1 ,M 2 ,...,M i } we choose the model that maximizes the BIC. For the case where each cluster in M k (withk clusters)ismodelledasamultivariateGaussiandistributionwegetthefollowing expression for the BIC: BIC(M k ) = j=k X j=1 − 1 2 ·n j ·log(|Σ j |) − 1 2 ·r M ·log(R X ) (4.7) Here, Σ j is the sample covariance matrix for the j th cluster, k is the number of clusters in the model and n j is the number of samples in each cluster. We use this criterion to choose k for the k-means algorithm for clustering the audio clips in the meaning space. Figure 4.4 is a plot of the BIC as a function of number of clusters k estimated using equation (7). It can be seen that the maximum value is obtained for k =112. 63 0 50 100 150 −1.7 −1.6 −1.5 −1.4 −1.3 −1.2 −1.1 −1 −0.9 −0.8 −0.7 x 10 5 BIC value → number of clusters → Figure 4.4: BIC as a function of number of clusters k in model M k . The maximum value is obtained for k =112. 4.5.2 Clustering Results Some of the resulting clusters using the presented method are shown in Table 4.2. The table lists some of the significant audio clips in each of the clusters. Only five out of k = 112 clusters are shown for illustration. As mentioned previously, audio clips with similar onomatopoeic descriptions are clustered together. As a result, the clips in the clusters share similar perceived acoustic properties. For example, the clips SML NAILS DROP ON BENCH B2.wav and DOORBELL DING DING DONG MULTI BB.wav in clus- ter 5 listed in the table. From their respective onomatopoeic descriptions and an un- derstanding of the properties of the sound generated by a doorbell and a nail dropping 64 Cluster # Clip Name & Onomatopoeic Descriptions Cluster 1 CAR FERRY ENGINE ROOM BB{buzz,fizz,hiss} WASHING MACHINE DRAIN BB{buzz,hiss,whoosh} PROP AIRLINER LAND TAXI BB{buzz,hiss,whir } Cluster 2 GOLF CHIP SHOT 01 BB.wav {thump, thwack} 81MM MED MORTAR FIRING 5 BB.wav {bang, thud, thump} THUNDERFLASH BANG BB.wav {bang, thud, wham} TRAIN ELEC DOOR SLAM 01 B2.wav { thud, thump, whomp } Cluster 3 PARTICLE BEAM DEVICE 01 BB.wav {buzz, hum} BUILDING SITE AERATOR.wav {burr, hum, murmur, whir} PULSATING HARMONIC BASS BB.wav {burr, hum, murmur} ... Cluster 4 HUNT KENNELS FEED BB.wav {bark, blat, yip, yowl} PIGS FARROWING PENS 1 BB.wav {blare, boo, screech, squeak, yip} SMALL DOG THREATENING BB.wav {bark, blare} ... Cluster 5 DOORBELL DING DING DONG MULTI BB.wav {ding, dong, ring} SIGNAL EQUIPMENT WARN B2.wav {ding, ring, ting} SML NAILS DROP ON BENCH B2.wav {chink, clank} ... Table 4.2: Results of unsupervised clustering of audio clips using the proposed vector representation method. 65 on a bench, a relationship can be made between them. The relationship is established by the vector representation of the audio clips in meaning space according to their onomatopoeic descriptions. 4.6 Clustering Onomatopoeia Words in Meaning Space Using the BIC definition in subsection 4.5.1, it is also possible to directly cluster the words in the meaning space (instead of clustering the vectors of the clips). For this, equation 4.5.1 is again used, only the observations are the vector representation of the words not the audio clips. Again, We use this criterion to choose k for the k-means algorithm for clustering the words. Since the number of points in the onomatopoeic meaning space is small (number of words is 87), we follow a bootstrapping approach to estimate the BIC for each k . This method is as follows: 1. FOR k =1,2,...,N, DO (a) FOR q =1,2,...,Q, DO • Initialize the cluster means randomly let this initialization set beI k,q . • Execute the k-means algorithm until convergence to obtain the sam- ple mean ˆ μ j and sample covariance variance ˆ Σ j for each cluster j = 1,2,...,k. Let these resulting cluster means be stored in setK k,q • Calculate BIC(k,q) =BIC(M k ) using equation 4.5.1. (b) END 2. Calculate [ BIC(M k )= 1 Q X ∀q BIC(k,q) 66 5 10 15 20 25 30 35 40 −1.7 −1.6 −1.5 −1.4 −1.3 −1.2 −1.1 −1 −0.9 −0.8 −0.7 x 10 4 BIC Figure 4.5: Variation of [ BIC(M k ) as a function of k for Q = 500 for clustering ono- matopoeia words. 3. END 4. Choose : M k =argmax k {BIC(M k )} The resulting variation of the BIC as a function of k for Q = 500 is shown in Figure 4.5. The maximum value of [ BIC was obtained for k = 19. Again, because of small numberof points in the onomatopoeic space, for eachq thek-means results in a slightly different cluster mean ˆ μ j , j = 1,2,...,k. The following procedure is implemented for an appropriate choice of the cluster means. 1. AFTER : M k =argmax k {BIC(M k )}, k =19 is known. 67 2. FOR q =1,2,...Q, DO • Calculate thehistogramH q,j =n j wheren j isthenumberof samplesin each cluster j =1,2,...,k for the k-means resultK k,q (here k is known from the BIC maximization). • Calculate the sample variance of the histogram σH q = 1 k−1 k X j=1 (H q,j −muH q ) 2 . Where μH q is the mean number of samples in each cluster j, i.e, μH q = 1 k P k j=1 (H q,j ). 3. END 4. Choose q,K k,q =argmin q {σH q } Table 4.3 illustrates a few of the resulting clusters. Since word clusters can be formed it can be inferred that: 1. Words within clusters have overlapping meaning, 2. Words in different clusters are sufficiently distinct. 3. The proposed metric sufficiently discerns the words by their meaning. Both gen- eral and specific perceived audio properties can be described using onomatopoeia words. Next, the clustering of the extracted feature vectors using word-level clustering infor- mation is discussed. 68 cluster 1(C 1 ): Clang,Clank,Ding,Dong,Ting cluster 2 (C 2 ) Beep,Bleep,Toot cluster 3(C 3 ): Creak,Squeak, Screech,Yawp,Yowl cluster 4 (C 4 ): Cluck,Cuckoo, Hoot,Tweet cluster 5(C 5 ): Buzz,Fizz,Sizzle,Hiss Whiz,Wheeze,Whoosh,Zip cluster 6(C 6 ): Thump,Thwack,Wham cluster 7(C 7 ): Burr,Crunch,Scrunch cluster 8(C 8 ): Rip,Zing,Zoom cluster 9(C 9 ): Clatter,Blah,Gabble,Yak cluster 10(C 10 ): Meow,Moo,Yip Table 4.3: Examples of automatically derived word clusters in lexical onomatopoeic ‘meaning space’. 4.6.1 Clustering the Feature Vectors As mentioned in point 2(a) of section 4.2, the feature vectors are first clustered using theinformationfromclustersofonomatopoeiawordsoftheirclips. Thefollowingvoting procedure was used: 1. For each clip, the number of onomatopoeia tags common with the words in each clusterC j ,j∈{1,2,...,k} was counted. 2. ThefeaturesextractedfromtheaudioclipareassignedtotheclusterC j withmost number of common words. Collision (where the tags may have the same number of common words with more than one word cluster) was randomly resolved. With this procedure, the resulting clusters of audio in the feature space are similar in terms of their onomatopoeic descriptions. In step 2, it is also possible to assign the acoustic features from a clip to more than one cluster C j by having at least one word common with the clusters. But this would result in a more complex grouping of the features. Some of the clusters as a result of this procedureare listed in table 4.4. As an example, note that AMSTERDAM TRAM 3 BB ,TOOLBOX CLOSED B2 are clustered together. It can be interpreted that the properties of the sound generated by the tram and the 69 Clusters: PERSIAN CATS EAT PURR, GARGLE BB, SLURP BB TRACTORS WORK IN YARD BB TEAPOT BEING FILLED BB MANY HORSES TROTTING BB, LRG TACK NAIL HAMMERED B2, CLIZA OUTDOOR MARKET B2, BRUSH PAINTING B2, GAS BLOWLAMP LIT FLAME B2 AMSTERDAM TRAM 3 BB, LONDON SUBWAY ARRIVES 01 B2, BUILDING SITE HAMMERING B2,CAN OPENER BB, TOOLBOX CLOSED B2 Table 4.4: Examples of automatically derived clusters of audio clips based on feature vectors box can both be described with the words {clang, clank}. Thus the clusters resulting from this procedure have similarity in terms of their onomatopoeic descriptions. Using word-level clustering information, features extracted from the clips were clus- tered into k =19 clusters. Since the onomatopoeia words describe the acoustic proper- ties,theunderlyingacousticdatacanalsobeexpectedtohave19clusters. Asmentioned in point 2(b) of section 4.2 a “raw grouping” is done by clustering all the extracted fea- tures using k-means algorithm in the acoustic feature space. This was done without using information from the word-level clusters. For this, the algorithm was also initial- ized to have 19 clusters. 4.6.2 Classification Experiments First, by MDA the dimensionality of the problem was reduced to (k−1). Then, the data was split into train and test sets (90% and 10% respectively). Parameters of the GMAP classifier were determined using the train set. This was done for both methods of clustering. The final result for each clustering method is presented. 70 % P % R Cluster words B 64.8 81.1 {buzz,fizz,hiss,sizzle,wheeze,whiz,whoosh,zip} E 72.5 73.4 {huff,hum,whiff,wow} S 47.8 68.3 { click,chink,tick} T 65.6 54.8 {creak,squeak,screech,yawp,yowl}. W 25.6 23.0 {cluck,cuckoo,honk,hoot,tweet} O 28.1 21.2 {meow,moo,whoop,yip} R 39.1 18.1 { crackle,pluck,splash,tap} ST 24.1 10.7 {clang,clank,ding,dong,ting} Table 4.5: 4 best and worst precision (%P) and recall (%R) rates for classification. Clusters are formed by using word-level grouping information Classification accuracy using word-level clustering : The classification re- sults we obtained were better for some clusters and worse for others with an overall accuracy of 54.44%. The recall and precision for those clusters are listed in Table 4.5. The resulting 2 nearest clusters (in terms of most confusing clusters for a given cluster) is given in Table 4.6. Accuracy of “raw grouping”: We obtained a classification accuracy of 85.28% for frame level tests for the raw clustering without using information from the word clus- ters. Indicating 19 distinct clusters indeed exist amongst the feature vectors extracted from the audio data in the acoustic space. That is, higher classification accuracy ⇒ resulting clusters are well separated⇒ acoustic features are sufficiently discriminatory. 4.7 Conclusion In this paper we represent descriptions of audio clips with onomatopoeia words and cluster them according to their vector representation in the linguistic (lexical) meaning 71 {blare,blat,grunt,murmur} & ,{burr,crunch,,caw,scrunch} FOR {buzz,fizz,hiss,sizzle,wheeze,whiz,whoosh,zip} {buzz,fizz,hiss,sizzle,wheeze,whiz,whoosh,zip} &{beep,bleep,toot} FOR {huff,hum,whiff,wow} {buzz,fizz,hiss,sizzle,wheeze,whiz,whoosh,zip} &{creak,squeak,screech,yawp,yowl} FOR {click,chink,tick} {burr,crunch,caw,scrunch} &{blah,clatter,gabble,yak} FOR {creak,squeak,screech,yawp,yowl} Table 4.6: Thetwo most confusingclusters for each of the 4 clusters that have thebest precision and recall rates. space. Onomatopoeia words are imitative of sounds and provide a means to represent perceived audio characteristics with language level units. This form of representation essentially bridges the gap between signal level acoustic properties and higher-level au- dio class labels. First, using the proposed distance/similarity metric we establish a vector represen- tation of the words in a ‘meaning space’. We then provide onomatopoeic descriptions (onomatopoeiawordsthatbestdescribethesoundinanaudioclip)bymanuallytagging them with relevant words. Then, the audio clips are represented in the meaning space as the sumof thevectors of its correspondingonomatopoeia words. Using unsupervised k-means clustering algorithm, and the Bayesian Information Criterion (BIC), we clus- ter the clips into meaningful groups. The clustering results presented in this work indicate that the clips within each cluster are well represented by their onomatopoeic descriptions. These descriptions effectively capture the relationship between the audio clips based on their acoustic properties. 72 Then clustering of acoustic feature vectors with and without using word-level in- formation was analyzed. Word-level clustering was done using onomatopoeia words as means to represent perceived audio signal characteristics. These words can be used as a meta-level representation between acoustic features and language-level descriptions of audio. This clustering was compared with “raw grouping”: clustering feature vectors without using information from word-level grouping. The comparison was performed in terms of classification accuracy of a GMAP classifier after MDA. The results in this work indicate that certain word clusters are more separable than the others. It can be, in part, due to the fact that the acoustic features used in this work are only able to representcertain onomatopoeicdescriptions. Itcan alsobebecauseofinconsistencies in theunderstandingand usageof onomatopoeia wordsas well. Also, some words(such as crackle) represent long-term temporal properties that arenot well represented by frame level analysis. Anotherinterpretationoftheresultsisthat,therawclusteringresultsinpartitioning of the feature vectors into regions of contiguous volume of space. However, clustering usingonomatopoeia groupinginformation may result in fragmented partitioning, where the feature vectors of a cluster may be present in different regions of the feature space. This essentially brings out the differences in signal level measures and linguistic level descriptions. Thisalsocalls forsignalmeasuresthatarerepresentative oflinguisticlevel descriptions. Onomatopoeiawordscanbeusedtoannotateamediumthatcannotrepresentaudio (e.g. text). Given better signal measures, this representation can be useful for comput- ing with both in terms of language level units and signal level measures. The ability 73 to cluster words in a quantitative meaning space implies words within the clusters have overlappingmeaningandwordsindifferent clustersaresufficientlydistinct. Thismakes them useful as they can express and represent both specific and general audio charac- teristics. This is a desirable trait as a meta-level representation making them suitable for automatic annotation and processing of audio. Onomatopoeia words are useful in representing signal properties of acoustic events. Theyareausefulprovisioninlanguagetodescribeandconveyacousticevents. Theyare especially useful to convey the underlying audio in media that cannot represent audio. For example, comic books frequently use words such as bang to represent the acoustic properties of an explosion in the illustrations. As mentioned previously, this is a re- sult of common understandingof the words that convey specific audio properties of the acoustic events. This is a desirable trait in language level units making them suitable for automatic annotation and processing of audio. This form of representation is useful indevelopingmachinelisteningsystemsthat canexploit bothsemanticinformation and similarities in acoustic properties for aural detection and decision making tasks. As a part of our future work, we wish to explore the clustering and vector representation of audio clips directly based on their lexical labels and then relate it to the underly- ing properties of the acoustic sources using onomatopoeic descriptions and signal level features. For this, we would like to develop techniques based on pattern recognition algorithms that can automatically identify acoustic properties and build relationships amongst various audio events. 74 Chapter 5: Attribute Based approach to processing 5.1 Chapter Summary This chapter presents a descriptive approach for analyzing audio scenes that can com- prise a mixture of audio sources. This method is applied to segment popular music songs into vocal and non-vocal sections. Unlike existing methods that directly rely on within-class feature similarities of acoustic sources, the proposed data-driven system is based on a training set where the acoustic sources are grouped by their perceptual or semantic attributes. The audio analysis approach is based on a quantitative time- varying metric to measure the interaction between acoustic sources present in a scene developedusingpatternrecognitionmethods. Usingtheproposedsystemthatistrained on a general sound effects library, one can achieve less than ten percent vocal-section 75 segmentation error and less than five percent false alarm rates when evaluated on a database of popular music recordings that spans four different genres (rock, hip-hop, pop, and easy listening). 5.2 Introduction The increased focus on automatic audio processing techniques for retrieval, indexing, and classification is a result of vastly improved digital media storage, delivery of con- tent over the network, and the availability of cheaper and efficient computing. Audio processing for the above tasks involves one or more of the following related procedures: segmentation, clustering, and classification. Segmentation involves markingsimilar, ho- mogeneous sections of audio. The work presented here segments popular music songs intovocalandnon-vocalsections. Thisisprimarilyusefulforaudioinformationretrieval applications for annotation, browsing, summarization and creating audio thumbnails. The implementation is based on developing a method for descriptive characterization of audio content through a data-driven approach. Typically, a data-driven approach to discerning the type (classification) and the temporal extent of an audio event (segmen- tation) is based on some model-free or model-based approach, in either case involving clusteringandlearningthecharacteristics ofthedesiredaudioclasses. Thepresentwork differs from other approaches by grouping them based on perceived description and not simply on their signal level similarities. An obvious and widely used approach to characterizing a complex audio scene is by recognizing each of the possible constituent classes through deterministic or statistical methods. This would especially work well for closed-set problems where the number 76 of identifiable classes are limited and a given audio event is always known to belong to one of the previously known classes. In more open problems, where the number of acoustic events and possible audio scenes are large, and perhaps unseen, it would be- come tremendously complex to implement such a scheme. For example more heuristic rules, such as the one implemented in [37] for classification would be required. Related work to this problem of content analysis of audio is presented in [23, 73]. Usually in these implementations, each acoustic source is labeled to be a unique audio class and the core challenge in the problem is to identify these sources. Also in methods such as in [47, 53, 73] the discrimination system and the method of analysis are based on the underlying assumption that the two classes are non-overlapping in time. Related work in description based approaches to audio typically involve application-tailored choice of descriptions trained on corresponding data. For example in [10], the author uses probabilistic descriptions specific to music. The approach proposed in [55] involves di- rectly tying semantic level descriptions to signal level Gaussian Mixture Models built on extracted feature vectors. These approaches, are relevant to the present work, but rely on statistical similarity amongst acoustic sources and are not easily generalizable to large classes of acoustic sources. Audio categorization implicitly or explicitly involves a change-point detection (seg- mentation) scheme. Thisisdonebyeitherrecognition, classification ormeasuringsignal statistics localized in time. In [37, 73] the authors have implemented a set of heuristic rules to classify audio into speech, music, silence and environmental sound and thus segment them. In [63] the authors have used a peak-picking scheme on the derivative of adistancesignal. In[47] aclassification schemeby astatistical measureofzero-crossing 77 rate (ZCR) and root-mean-squared (RMS) of signal energy is usedfor segmentation. In [74] the authors have used aT 2 -statistic along with the Bayesian Information Criterion (BIC) to the related problem of speaker turn detection. All these techniques use only information at the signal level to perform the segmentation task. No higher level un- derstanding of the audio is utilized. Our main motivation is based on the change point detection scheme of the human auditory system. It can perform audio segmentation with ease and reasonable accuracy [63]. This can be attributed to the fact that the auditory system as a whole depends not only on signal level characteristics but also on the semantic understanding, rele- vance and temporal/spatial placement of acoustic events. The segmentation task is easy because, in some way, the auditory system is able to quantitatively measure the interaction of the events presented to it in a scene. Thus when the interaction changes, a change-point in this quantitative measure can be detected. This idea proposes to use perceived audio descriptors, or attributes, to measure and process the interaction of acoustic sources quantitatively and implement a change-point detection scheme. First a general audio data-set of sound effects is grouped into high level attributes based on how humans interpret and describe audio. Then a metric to quantitatively characterize a given audio clip in terms of these attributes is proposed. Finally a change point detection rule is derived based on this quantitative metric and evaluated on segmenting vocal sections in popular music. The next two sections describe the overall approach to the segmentation problem 78 followed by its implementation. Experiments performed to test the accuracy are de- scribed in section 5.5, with the results in section 5.6. Finally, the conclusion, additional discussion and ideas about our future work are presented in Section 5.7. 5.3 System Description Although an audio scene may consist numerous acoustic sources, each identified by a unique linguistic name, many of them share similar perceptual qualities and thus they can be grouped under one category. For the purposes of this work, the framework focuses on three such high level attributes, namely, speech-like, harmonic and noise- like. For the training data, the relevant audio data available in the BBC sound effects library [1] are manually categorized based on these three attributes. The clips were grouped according to the way they are interpreted after listening to them.These at- tributes are referred to as the perceived audio descriptors. For example, the sound of a vacuumcleaner andthesoundofacar’senginearebothconsiderednoise-like. Similarly many other acoustic sources (e.g.,: waves in a seashore, machine-shop tools, heavy rain, breathing sounds), have such noise-like characteristics. Thus, each of these sources can begroupedbased ontheseperceived attributes regardless of thelinguistic label/class or just based on their signal feature similarity/dissimilarity. Along the same lines, a wide variety of acoustic sources such as door bells, string musical instruments such as violin, guitar, (excluding percussion instruments), telephone ringtones sirens, pure tones etc. can be categorized under the group harmonic i.e. sources that are harmonically rich. Speech-like mainly covers individual speech, conversations in a crowd, laughter, and human vocalizations. Further narrow, additional descriptions are also possible. These 79 three high level attributes were chosen because they are sufficiently distinct human in- terpretations of events in an auditory scene and they are sufficiently ”separable” using perceptual signal-level features for the automatic classifiers. This grouping, based on human characterization of the signal, leads to a mapping in a relatively lower dimensional representation space. Such a description offers scala- bility, and can be also effectively used as an intermediate step for conventional audio processing methods. In practice, one can assume that the time-varying description of any audio scene typically contains acoustic events that can be considered semantically grouped under this broad description scheme. This identification is constructed using a bank of classi- fiers (as detectors), where each one focuses on a specific attribute such as harmonicity. Then,thetimeseriesoftheseclassifieroutputsareassimilated toprovideafinalcatego- rization of the audio (illustrated in Figure 1 and explained further in Section 5.4). The descriptive labels are automatically assigned to each frame of audio by using standard pattern classifiers that are trained off-line. The technique is based on broad definitions of various audio attributes, and the training data is also chosen accordingly. For the task of tracking these attributes, a k-Nearest Neighbour (k-NN) Classifier [20] is imple- mented. Discrimination based on k-NN rulefor classification task has beeninvestigated previously, and found useful, in speech/music classification tasks [53]. An additional reason for choosing it here was because it belongs to a class of lazy learning algorithms, and suits our approach which makes no assumption regarding signal feature similarities of the target attributes in the feature space. To measure the degree of interaction between the audio attributes (descriptors) at 80 any given time, the quantity activity rate (AR) is used. It is defined as the number of each event (e.g., noise-like/speech-like/harmonic) detected per unit time of analy- sis. For example, for sections of audio with just music, the harmonic activity rate is expected to be high whereas for sections with singing/dialogues in a scene, the speech activity rate is expected to be high. The complexity of audio scenes can be described in terms of the different acoustic sources or events present in it. Thus in effect, the activity rate provides an aggregate quantitative measureof interaction of the individual events. Note that in classical speech/music discriminators, it is usually assumed that segments of speech and music are non-overlapping in time and the problem is to statis- tically measure the signal properties appropriately to discern one from the other (such as through a maximum likelihood scheme). In the present work, no such assumptions of temporal mutual exclusivity are made and the derived time-varying representation is used to Music Information Retrieval (MIR) problems. Namely, segmentation of vocal section in popular music songs and overall genre classification. To evaluate the pro- posedscheme, it is appliedto theproblem of segmenting popularmusicsongs intovocal and non-vocal sections, and also classify the genre of the songs. Note that such audio is characterized by vocals of one or more main singers, and possibly other background singers, along with polyphonic instrumentals. It should be pointed out that while the results indicate satisfactory segmentation of vocal sections, the framework provides a generalized approach to analyze the underlying acoustic structure in a given audio clip [40]. The chosen evaluation domain reflects such application where obtaining sections with the vocals in a song and classifying the genre of the music is the goal. The next section provides details about the implementation of the proposed system. 81 5.4 Implementation TheproposedsystemforsegmentationisdepictedinFigure1. Theoverallsystemissplit intotwoblocks: Level 1 andLevel 2. Thefirststageofthesystemisafeatureextraction stage. It maps a given windowed audioframex m ofT s duration (usually from 20 to 100 msec.) to a point X m in a D dimensional feature space Ω (m being the time-index). The extracted feature is a popular, perceptually-motivated 37 dimensional vector com- prising of 13 Mel Frequency Cepstral Coefficients (MFCC), its delta-MFCC(DMFCC) anddelta-delta (DDMFCC). Thefeaturesarerelevant herebecausetheymodeltheper- ception of the human auditory system and they have also been successfully applied in recognition of general audio ([35] and references therein). This is used both during the training and testing. For an audio scene of T duration, this stage generates a sequence ofX m ,m∈{1,...,M}(discrete-time) featurevectors. Thesilencesegmentsaretreated separately using the root mean-squared energy (RMS) of the signal. Level 1 comprises N classifiers trained in a one-against all scheme. The output of the k th classifier C k is w k , an element of the N dimensional vector L m and w k ∈ {0,1}, ∀ k ∈ {1,...,N}. C k is trained to identify (the classification process) the k th label in a frame. w k = 1 indicates that the classifier has classified the given frame with the k th label. For example, suppose the k th label is noise-like. Then the classifier C i is trained on all the data that contains audio clips of acoustic sources that are noisy (eg: engine noise, vacuum cleaner, hair dryer, waves on a seashore etc.) and w k =1 means that the given frame is noise-like. In our implementation of the proposed system N =3 and the labels used are w 1 ≡ Speech-like, w 2 ≡ harmonic and w 3 ≡ noise-like. Thetrainingdataforthethreegroupstotalled 7.28 hours(about2.42 hoursforeach 82 Figure 5.1: Illustration of the proposed system for audio segmentation. 83 group). The clips were available as 44.1 kHz, 16-bit 2 channel uncompressedaudio. For the feature extraction stage, the clips were converted to mono without changing the sampling rate. As mentioned previously, the classifiers C 1 ,C 2 ,C 3 were implemented using the non-parametric k- nearest neighbor classifier (k = 5) scheme [20]. Note that the realization of the N = 3 classifiers can be combined into one block, but they have been shown separately for the sake of clarity. At this point, for a sequence of audio frames for an audio scene of T duration, a sequence of vectors L m m∈{1,2,...,M} This is the input to the Activity Rate mod- ule. If a sequence of L m vectors is time-aligned (similar to a spectrogram where the magnitude of a sequence of Fourier coefficients are time aligned), then the number of detected events in each dimension of the L m vector in the time-aligned representation for a given time period is a measure of the event activity rate (AR). Mathematically it can be written as: r k m = 1 M r j=m+ Mr 2 X j=m− Mr 2 I{w k,j =1} , r m =(r 1 m ,r 2 m ,...,r N m ) where I{·} is an indicator function and w k,j is the k th dimension of L m . M r is the duration for measuring the AR and typically T s <M r <<T and M r ≈50 to 100 times T s . The output of the AR device is again a N = 3 dimensional signal r m that takes continuous values between 0.0 and1.0. A valueclose to 0.0indicates noactivity and1.0 indicates high activity in terms of the corresponding descriptor. As an example, Figure 2showsther m vector fora45secondclipofthesong Don’t Know Why byNorahJones. 84 0 5 10 15 20 25 30 35 40 45 0 0.5 1 r 1 Activity Rate 0 5 10 15 20 25 30 35 40 45 0 0.5 1 r 2 0 5 10 15 20 25 30 35 40 45 0 0.5 1 time (seconds) r 3 Figure 5.2: The time-varying r m vector for a T = 45 sec. song clip. For this particular illustration, T s =20 msec and M r =100×T s In this clip, the song starts with soft music (comprising a piano, a stringed instrument, andsnaredrums)andthesingerstarts singingat the10 th second. Studyingtheactivity rate plot, it can be seen that the speech-like activity rate (r 1 ) increases around the 10 th second and the harmonic activity rate (r 2 ) is high initially and lower when the singer startstosing. Asthesingersingseachverseofthesong,therate(r 1 )alternatesbetween highandlowvalues. Noteadifferenttrendintheplotbetweenthe26 th and36 th second, ascomparedtothesegment betweenthe15 th and25 th second. Thisisobservedbecause the singer actually sings the second part of the first verse differently from the first part. Similarlyinther 2 plot,thepianonotesbrieflycometotheforegroundintheclipbetween 85 the 36 th and 38 th second indicating fluctuations in the detected harmonic activity rate. Thetrends in the noise-like activity rate (r 3 ) also brings out such variations as the song proceeds. If a windowed audio segment is determined to be silence (by appropriate threshold of the RMS energy signal) then the frame’s corresponding L m vector is not used to calculate the activity rate. Thus, it can be seen that the interaction between the acoustic sources in an audio scene can be quantitatively measured using the activity rate. This is the descriptive approach presented in this paper. The time variations in the activity rate (AR) signals quantitatively describe the audio scene locally in time. The description is generalized in the sense that the N categories cover the type of acoustic source irrespective of the actual signalstatistics. Whiledirectcorrelation betweenthetrendsintheActivity Rate tochangesintheaudioclipcanbedrawnbyaudio-visualinspection,onesuchsecondary analysis is provided for segmenting popular music songs. Level 2: This level focuses on using the proposed attribute-based audio descriptors for specific categorical classification. Specifically, the application of segmenting audio into vocal andnon-vocal sections is considered(generically, referredfromhereonas ‘speech- like’and‘non-speechlike’,althoughitincludesbothsungandspokenforms). Inaddition to this, the task of identifying the genre of the given music piece is also considered. In frame based analysis, we observe an audio segment W of T w duration as a sequence of independent frames of short duration. Let this set of observed frames be given by the set {X 1 ,X 2 ,...,X R }. In practice each observation X i ,i ∈ {1,...,R} is a point in the feature space and represents about 20 to 50 msec of windowed audio samples. 86 A typical probabilistic detection scheme involves estimating P(W=Speech-like) and P(W=non-Speech like) which gives the rule, if, P(W=Speech-like) > P(W=non-Speech like) then W is a vocal section Let I{X i = speech-like} represent classification of a frame X i as speech-like. In an observationofonlyRvectors,P(W=Speech)canbeestimatedbytheclassicaldefinition [48], P(W=Speech like)= P ∀i (I{X i =speech-like}) R ,and P(W=non-Speech like)= P ∀i (I{X i = harmonic-like}) R + P ∀i (I{X i = noise-like}) R Thus from the inequality 5.1 we get, for a given R for a segment W we conclude that it is a vocal segment if X ∀i I{X i =speech-like}> P ∀i I{X i =harmonic-like} + P ∀i I{X i =noise-like} Therefore we obtain a time-localized voting scheme for detection of a vocalized segment in a test clip. By the definition of the event activity rate (AR) in equation 5.1 87 Table 5.1: k-NN class. accuracy (%) speech-like(S-l), harmonic(H), noise-like(N-l) %split Train/Test Size = 80/20 %split Train/Test Size = 90/10 classified as→ S-l H N-l S-l 94.44 2.96 2.60 H 1.08 97.30 1.62 N-l 0.42 0.67 98.91 S-l H N-l 94.72 2.66 2.62 0.94 97.56 1.50 0.47 0.55 98.98 we get the rule for change point detection : if s > 0 then segment is a vocal section, where, s = r 1 −r 2 −r 3 , i.e, s = (speech-like AR)−(harmonic AR)−(noise-like AR) In the implementation, R =100 is used (from the value set forM r ). Note that this combination highlights the segments that contain the vocals of the song. It basically draws inference based on a set of individual classification results. The classification accuracy of the k-NN classifier for two set of instances is listed in Table I. We also include a “fill-in” procedure where two segments that are not more than 1 sec. apart are combined [35]. Other combination schemes are also possible depending on the application. Forexample, forlargerN, ther m canbedirectlyusedasalow-dimensional feature vector for classifying auditory scenes. This aspect is used for identifying the genre of a given music clip. The next section discusses the performance in each of these music processing tasks. 88 5.5 Experiments 5.5.1 Segmenting Vocal Sections: A collection of 67 full-length assorted songs were used to asses the segmentation perfor- mance of the proposed system. The tracks belonged to 4 genres: Rock, Pop, Hip-Hop and Easy Listening, covering a variety of artists like Red Hot Chili Peppers, Cake, Bangles, Chris Isaak, Wyclef Jean, Mark Knopfler, Tom Waits, Enya etc. The un- compressed audio tracks were directly obtained as mono from theoriginal commercially available CDs at 44.1 kHz sampling rate. For each track, a binary signal of the same duration was obtained at the output. Thebinarysignal has avalueof 1for sections of audiodeterminedto have vocals, and 0 otherwise. Thisoutputwasthenconvertedintotimemarkersthatmarkthecorrespond- ing sections on a waveform. A Graphical User Interface (GUI) was used to display the waveform and the markers aligned in time. The error and false alarm percentage were calculated manually by listening to the songs and checking the actual vocal sections of the songs against sections marked by the system. The error and false alarm rate were calculated by using the following formulae: %Error= no. of vocal sections in a song not marked by the system total no. of segments in the song % False Alarm = no. of non-vocal sections marked by the system total no. of segments in the song The tolerance for segmentation was 1 sec. i.e., if the markers were off by±1 sec., then it was not considered as an error. This tolerance value can be determined from the size ofM r tocalculate theactivity rate. Intheproposedimplementation T s =20 msec. and 89 M r = 100×T s with 50% overlap between frames for estimating r k m . It is of the same order of acceptance measure suggested by the authors in [63]. 5.6 Results and Discussion The error and false alarm rates for segmentation of the songs using the proposed sys- tem, grouped by genre, is tabulated in Table 5.2. It can be seen that the estimates are consistent through the various genres. The overall false alarm and error rate was found to be about 4.0% and 7.0% respectively. Although direct comparision with other speech/music discriminators is not possible due to differences in training data, imple- mentation and domain of application our results are comparable to other systems such as [47, 53]. While the results show low error values, some specific problems arose in certain cases causing relatively high error. They are discussed below: Observed sources of error: In some Rock songs, there were sections with high inten- sity music (lead guitars+drums) along with screaming voices that were not segmented correctly. This is because relatively high values were obtained for all the three Activity Rate signals r 1 ,r 2 ,r 3 and these resulted in low (approximately 0) values of s (refer to eqn. 1). Also, certain instances of extreme pitch levels of voices were not correctly marked. Similar problems with segmentation arose in songs of other genres that had extended durationof singingnotes, accompanied by amusical instrumentsuch astrum- pet or violin. In certain cases, due to fade in and fade out of the singer’s voices the exact time instant of the segmentation boundaries was off by a few seconds (termed as border effect [35]). 90 Table 5.2: false alarm & error rates of popular music segmentation Genre No. of Songs % False Alarm % Error Rock 14 4.28 5.97 Pop 20 5.09 6.63 Hip-Hop 13 3.39 8.45 Easy Listening 20 3.37 6.66 Table 5.3: Distribution of number of songs in each genre Genre No. of songs Classical 318 Ambient 199 Electronica 206 Metal 201 Rock-Pop 278 Total 1202 5.6.1 Genre Classification Experiments For these experiments we use 1202 full-length MPEG1-Layer 3 (mp3) tracks encoded at 128 kbps collected from the Magnatune database available online (http://www.magnatune.com). The distribution of number of songs in each class is shown in Table 5.3. The mp3 codec is transparent at this bit-rate and all the songs were first converted into 44.1kHz one-channel audio before the feature extraction stage. We focus on classification into 5 genres: Classical, Ambient, Rock-Pop, Metal, and Elec- tronica. Thegenre information for each track was obtained from the ID3 tag embedded in the mp3 format and information obtained from the website. We test the perfor- mance of the activity rate measure for the genre classification task of full-length tracks. Three experiments are designed for this purpose. The first two experiments deal with the performance of using only the activity rate measure as “features” for the classifica- tion task. In the third experiment, we also present the performance of an HMM-based 91 classifier built using the MFCCs extracted from the full-length tracks. The details are given below: 5.6.1.1 Expt. 1: Using Activity Rate and DTW The activity rate measure is a time-series of the interaction of the various attributes. Also, because there are inter-song artistic variations and duration differences we obtain a distance measure between the activity rate time-series of two songs using Dynamic Time Warping (DTW). DTW essentially aligns two time-series of unequal length by defining a warping function to map the samples of one time series onto the other under a set of constraints. The Euclidean distance between the time-aligned series (in this case, the activity rate measures) is calculated to be the measure of similarity between them. Since the activity rate is a time-series of the underlying structure of the song in terms of three attributes, this gives us a measure of similarity between the songs. Thus using a nearest-neighbour approach, by comparing a given test track with a previously organized training data, the genre of the given track can be identified. The activity rate signals are slow-varying signals. Also, in this case, the overall temporalenvelopeofthesignal ismoreimportantthantheminorvariations. Therefore, for this experiment, we perform a 4−level asymmetric dyadic wavelet decomposition of the activity rate signals by decomposing only the low-frequency sub-band output from each level. The output after the 4−levels of decomposition is used for computing the DTW based similarity measure described above. 92 5.6.1.2 Expt. 2: Using Activity Rate in an HMM-based classifier Using the activity rate signals as features, it is possible to train a classifier to recognize the genre of a given test music track. In this experiment we train an HMM-based classifier using the Baum-Welch algorithm similar to the procedures in a continuous speech recognition task. By this we attempt to capture the genre-specific temporal statistics of the activity rate signals. For training, each track of the training set is split into ten segments with same genre labels. Again, similar to continuous speech recognition, using the trained HMMs, the final identification of a given test track was performedusingViterbi decoding. Thedecodingprocedureresultsinmultiplesegments that are labeled with the genre with the maximum likelihood score. Then by summing the output likelihood scores of the individual segments for each genre, the label with maximum score is chosen to bethe genre label for the wholesong. Mathematically, this is expressed as: L=argmax i j=R i X j=1 S i j . In this equation S i j is the output likelihood score of the j th segment that has been marked with i th genre label. R i is the number of segments that has been marked with the i th genre label in a given track, and L is the overall label given to the track. Here, the number of states of the HMMs and the number of Gaussian mixtures for each state was determined experimentally, and the best performing states-mixtures combination was chosen. 93 5.6.1.3 Expt. 3: Using MFCCs in an HMM-based classifier Here again, the training and test procedures are same as Experiment 2 described previ- ously. Only, instead of using the activity rate measure, the extracted MFCCs are used directly as features. MFCC features representthetimbral qualities in music[46]. Along with the 13 MFCCs we also include the MFCC-D and MFCC-DD coefficients that also represent the inter-frame spectral changes. The features for these experiments were extracted every 50 milliseconds with a window size of 100 milliseconds. These values were experimentally determined to give the best classification performance. The three experiments attempt the genre classification task at two different levels. Experiment 3 uses detailed signal level measures that represent spectral properties and inter-frame variations. Experiments 1 and 2 use higher-level information and represent a given music in terms of an aggregate measure in terms of the attributes. Note that for each song, the MFCC feature set is extracted only once. They are directly used for classification in experiment 3. They are further processed into activity rate measure as shown in Fig. 5.4 and used for classification in experiments 1 and 2. It is important to point out that the activity rate measure derived in this work is based on a generic soundeffects database [1] that is different from the database used for training the genre classifiers. The results of these experiments are discussed next. 5.6.2 Classification Results and Discussion The classification performance for each experiment was estimated using the average of 10-fold cross-validation. For each cross-validation, 60% of the songs were randomly 94 selected for the training set and remaining 40% was used for testing. The testing was done with approximately equal priors. For experiments 1 and 2 the MFCC features wereextracted every 25milliseconds (50 milliseconds featurewindow size) andM r =50 for estimating the activity rate. For experiment 2, a 10-state 16 Gaussian mixture left- to-right HMM topology was determined to have the best performance. For experiment 3, a 32 state 16 mixture Gaussian mixture left-to-right HMM topology was determined to have the best performance. The results of experiments 1 and 2 that use the activity signals are listed in Table 5.4, 5.5. It can be seen that the classification performance is well above chance except the Electronica case. The two most confusing classes for the anomalous result for the Electronica case in experiment 1 are Classical and Rock-Pop. This could partly be due to the type of tracks present in the data whose acoustic structure are similar to either Classical songs or Rock-Pop songs. However, since this does not show up in the results of experiment 2 (using the same activity rate measure and a different classification approach), it could be attributed to certain issues with the similarity measure using the DTW algorithm (such as cases of non-zero mean in the signals [33]). Another observation is that thetracks that belongto Metal aremostly misclassified as Rock-Pop and vice versa. This can be attributed to the stylistic similarity between Metal and Rock-Pop tracks. This trend can be seen in the results of all the experiments shown in Table 5.4-5.6. Another trend is the significantly better performance for the Classical and Metal classes. This essentially indicates that the content of the two genres and their structure are distinct compare to the other genre classes. 95 Table5.4: Expt. 1ConfusionMatrix: UsingactivityratesignalsandDTWforsimilarity measure(using 9-Nearest Neighbour Rule). class. as→ Classical Ambient Electronica Metal Rock-Pop Classical 97.03 00.00 00.15 01.63 01.19 Ambient 29.00 57.05 00.94 02.98 10.03 Electronica 34.32 00.45 07.13 17.38 40.71 Metal 03.64 00.00 00.00 74.49 21.87 Rock-Pop 24.18 00.00 01.28 34.42 40.11 Chance 20.00 18.96 19.79 20.36 20.89 Table 5.5: Expt. 2 Confusion Matrix: Using activity rate signals in an HMM-based classifier. class. as→ Classical Ambient Electronica Metal Rock-Pop Classical 71.07 15.87 02.93 05.47 04.67 Ambient 24.15 55.65 04.94 10.17 05.08 Electronica 06.62 03.71 40.00 34.44 15.23 Metal 00.13 01.55 00.52 91.33 06.47 Rock-Pop 01.94 05.30 13.31 45.22 34.24 Chance 20.00 18.96 19.79 20.36 20.89 Table 5.6: Expt. 3 Confusion Matrix: Using MFCCs as features in an HMM-based classifier. class. as→ Classical Ambient Electronica Metal Rock-Pop Classical 99.33 00.26 00.13 00.26 00.00 Ambient 09.89 77.97 04.10 02.68 05.37 Electronica 01.05 04.90 82.11 03.17 08.71 Metal 00.00 00.13 01.03 90.03 08.79 Rock-Pop 00.51 03.74 05.03 34.75 55.94 Chance 20.00 18.96 19.79 20.36 20.89 96 The MFCC results shown in 5.6 are better than by just using activity rate signals. This can partly be because the activity rate experiments presented here are based on only three attributes. This can also be due to the fact that the timbral information represented by the MFCC features provide a more detailed information that is useful for genre classification task. However, a re-examination of the results by combining the classes Classical with Ambient and Metal with Rock-Pop (since they are mis-classified as the other) in experiments 1 and 3 indicate comparable performance with an average accuracy of 88.51% and 94.34% respectively. This good performance for this combined case as opposed to considering all the five genre classes can be a result of using only three attributes for the activity rate measure. As a part of our future work discussed in the next section, we would like to explore this aspect by including more attributes in the analysis. While there are differences in data sets and the genre classes under examination, theresults shown for theactivity ratesignals are similar to thetrends observed in other systemssuchas[46,57,64]. Theseresultsofexperiments1and2arealsocompetingwith other systems that use both music-specific information [46] [41] and a purely statistical approach [17]. Thus they can appropriately capture information that is relevant to genre classification. 5.7 Conclusions and Future Work In this work, an attribute-based approach to quantitatively measure the changes in an audio scene is presented. It is applied it to segment popular music tracks into sections with and without vocals. The idea is motivated by the fact that the human 97 auditory system can instantly identify changes in the scene by tracking changes in the interaction of the different acoustic sources. The work in this paper presents activity rate (AR) as a metric to quantitatively measure the interaction of different sources. This measure does not identify individualsources, but measures the activity of different semantically-related attributes of sources over time. These are based on how the sources are perceived, and not necessarily on the similarities in signal properties. We then use these attributes for categorical classification; specifically we consider segmenting music into vocal and non-vocal sections. The clips are segmented without assuming that the vocal and non-vocal sections of audio are non-overlapping in time. As mentioned previously the framework presented here is not limited to building a binary speech/music discriminator type system. It provides a way to analyze the underlying acoustic structure in a given audio clip and also a way to annotate and highlight relevant sections. This is very useful for audio summarization and thumbnail- ing applications. The application of this system is not limited to segmentation, and as shown, it has also been applied for music genre classification task. The main conclusion from our genre classification results is that it is indeed possible to recognize genre using the acoustic structure information represented by the event activity rate measure. This is feasible due to the distinct content (in terms of musical instruments and vocals), and the inter-genre styles which can be captured without actually identifying the individual sources. It is sufficient to just have an aggregate measure of the various attributes. The interesting aspect of the experiments is that the activity rate measure derived in this work is based on a generic sound 98 effectsdatabasethatisdifferentfromthedatabaseusedfortrainingthegenreclassifiers. As another example, the activity rate (AR) signals can be used to organize a large audiodatabase. Ausercanmakeaquerytosuchadatabaseusingthesmall-dimensional AR signals and the relevant audio clips can be returned to the user. Of course, one would require a larger dimension (N > 3) description for this application and the categories need to be appropriately chosen. Splitting the large category noise-like into machine-noise (e.g: engines noise, vacuum cleaner, hair dryer etc.) and non-machine- noise (e.g: seashore, breathing sound, heavy rain, clothes rustling etc.), or having additional categories such as impulsive sounds (e.g: explosions, gunfire, knocking, clock ticks etc.) are some ways to increase the dimensionality of the descriptions. Our future work would involve further investigating this categorization of acoustic sources through perceptual/language descriptions similar to the ideas presented here. The main principle which sets the stage for this work is that humans can describe auditory scenes using language which is a representation of the semantic information captured from an audio clip. Analysis of more complex and rich scenes with large number of acoustic sources can be potentially implemented by increasing the number of audio descriptors and seeking quantitative measures such as the activity rate to adequately characterize them. This is a topic of our on going work. 99 Chapter 6: Noise Classification for Attribute based Approach to Audio Processing 6.1 Chapter Summary Content-based audio classification techniques have focused on classifying events that are both semantically and perceptually distinct (such as speech, music, environmental sounds etc.). However, it is both useful and challenging to develop systems that can alsodiscernsourcesthataresemantically andperceptuallyclose. Inthischapter, results of experiments on discriminating two types of noise sources is presented. Particularly, we focus on machine-generated versus natural noise sources. A bio-inspired tensor representationofaudiothatmodelstheprocessingattheprimaryauditorycortexisused 100 for feature extraction. To handle large tensor feature sets, a generalized discriminant analysismethodisusedtoreducethedimension. Also, anovel techniqueofpartitioning dataintosmallersubsetsandcombiningtheresultsofindividualanalysisbeforetraining pattern classifiers is presented. The results of the classification experiments indicate that cortical representation performs 25% better than the common perceptual feature set used in audio classification systems (MFCCs). 6.2 Introduction Content-based audio systems rely on clustering, segmentation and classification of dis- tinct acoustic source types through their within-class signal similarities. These systems rely on direct mapping between signal level feature vectors and their classes to achieve the end result. They group the vast possibilities of general audio classes into a handful of application specific classes such as speech, environmental sounds, music etc [73]. To generalize, it will be necessary to explicitly increase the number of recognizable groups resulting in increased complexity of the system (for example, more heuristic rules in [37] would be required). In [61] the authors developed a generalizable, mid-level rep- resentation scheme, where each instance (of frame based analysis) of an audio signal is classified into perceptual speech-like, harmonic and noise-like categories. This rep- resentation was successfully used to segment vocal sections in popular songs using a maximum a posteriori scheme. To tackle more complex scenes, futher categorization of sounds would be necessary. In this paper, we builduponideas of attribute based audiorepresentation presented in [61]. Particularly, we aim at connecting signal representations to audio attributes, 101 andfocusonfurtheranalyzingthenoise-like class. Wepresentaclassification systemto discern between two noise sources: machine generated and other natural noise sources. Machine-generated noise are audio from sources such as computer printers, telex ma- chines, vehicle engines, air plane propellers etc. Examples of other noises are sounds of wind, waves on a seashore, rainfall, leaves rustling. etc. The discrimination of the two noise categories is challenging because they are both semantically and acoustically similar and they are usually categorized without any distinction as non-speech or envi- ronmental sounds in systems such as [37, 73]. Other noise classification systems follow thecontent-basedapproachoftryingtoexplicitlyclassifyindividualnoiseclassessuchas car,plane, trainetc., usingelaborateHiddenMarkovModels([39]listsacomprehensive list of such systems). As mentioned earlier, for a generalizable mid-level representation, it is desirable to classify noises into categories based on signal attributes (such as sug- gested here) rather than classes based on canonical names. Classifying noise categories has applications in context recognition [39], scene change detection and indexing [11], context-aware listening for robots [15] and also in background/foreground audio track- ing [51]. The contributions of this work are as follows. First, as features for the noise clas- sification task, we use a bio-inspired approach involving a model of processing at the primaryauditorycortex. Thishasbeenappliedtothespeechnon-speechdiscrimination (SNS)problemsuccessfully[45]. Asindicatedbyourexperimental resultsfornoiseclas- sification, the cortical representation (CR) exceeds the performance of the commonly used Mel-frequency cepstral co-efficients (MFCCs). Since CR is a multi-dimensional 102 tensor, training pattern classifiers using this data becomes prohibitive due to large vec- tor dimensions (∼10 3 ). For dimension reduction, we use a generalization of the Fisher discriminant analysis (FDA). However, this also is not feasible on large training data for pattern classifiers. This issue is exacerbated by the large dimension of the data. To address this, we propose a new technique of data partitioning, and combining the re- sultsofanalysisontheindividualsubsets. Weshowthatdiscriminantanalysisofalarge data set can be made practical by breaking the problem into smaller sets and using the information from each of this subset. In the next section, details of this representation followed by the proposed dimension reduction method is presented. Finally, results of pattern classification algorithms are presented. We also compare performance of the CR with the performance of the MFCCs. 6.3 Auditory Cortical Representation (CR) WhilefeaturessuchasthepopularMFCCsarebasedontheprocessingattheearlyaudi- tory system, theCRis basedon processingof soundat thecentral auditory system[66]. Thisis modelled as are-analysis of theinput spectrafromtheearly auditory processing stage along the logarithmic frequency axis, a scale axis (local frequency bandwidth) and a phase axis which is a measure of the local symmetry of the spectrum. Since each time-frequency slice of an input audio signal is measured with respect to these three axes, the analysis results in a tensor (n-mode) representation. Figure 6.1 illustrates the processing stages which finally result in a tensor represen- tation. The output of the early auditory system is a time-frequency representation of the input signal. Here the input sound signal is filtered by the basillar membrane at 103 Figure6.1: Summaryofprocessingfrominputsoundtothemulti-dimensionaltensorrepresentationofcorticalprocessing [45, 66, 72]. 104 different centre frequencies along the tonotopic frequency axis, followed by an differen- tiator stage, a non-linearity , low-pass filtering and finally, a lateral inhibitory network [72]. This is the input to the central auditory system, which is analogous to the early auditory system, except all the transformations are along the tonotopic frequency axis. Theprocessingis modelled as a doubleaffine wavelet transform of the frequency axis at different scales and phase. The mother function wavelet is a negative second derivative of the normal Gaussian function. The result of this analysis is a 3-mode tensor A(f : f c ,φ,λ) ∈ R D 1 ×D 2 ×D 3 . Here, f is the tonotopic frequency at different centre frequencies f c , φ is the symmetry (or phase) and λ is the scale factor or the dilation factor of the wavelet function. In our experiments, theanalysis was performedusing64 bankfilters (D 1 =64) at 12 phase(or symmetry) values (D 2 =12), 5 scale values (D 3 =5). ThereforeA(·,·,·) is a 64×12×5 tensor. 6.4 Dimension reduction The classification experiments presented in this paper follow a data-driven approach. A total of 1.38 hours of data (from 217 clips) was collected from the BBC sound ef- fects library (http://www.soundideas.com). A preprocessing stage first converted all the 2-channel 44.1 kHz uncompressed audio files into 1-channel 16kHz channels. Then, using an audio editor (http://audacity.sourceforge.net/) each clip was manually segmented to remove silence sections, extraneous impulsive sounds and other non-noise segments. In the process, to facilitate data analysis, the data was also grouped into machine-noise and other-noise. For analysis, audio frames of 40 millisecond duration, 105 were extracted every 10 milliseconds after it was multiplied with a Hamming window. After the processing stages in the early and the central auditory system, a tensor of dimension 64×12× 5 is obtained for each frame. Basically, as a vectorized 1-mode representation, a vector of length 64× 12× 5 = 3840 was extracted for each frame (effectively 5×10 5 vectors). For comparison of performance, MFCCs (39 dimensions: 13 order +Δ+ΔΔ ) were also extracted. Certainly, due to the multi-scale analysis of the spectral profile, the extracted data contain large amounts of redundancy. Also, it is not feasible to train pattern classifiers on this raw, large dimensional data set. To make it practical, the dimension of the data needs to be reduced before training the classifier. In [45] the authors reduce the dimension using a generalization of the principal component analysis (PCA) for this 3-mode tensor, and successfully apply it to robust speech/non-speech discrimination. Although PCA is a powerful dimension reduction technique, it focuses on finding the best representation of the data with fewer principal components. For discriminatory pattern classification, however, discriminant analysis is more appropriate. For the work presented here, a generalization of the matrix discriminant analysis for the tensor case was implemented. This was originally proposed in [71] and suc- cessfully used for face recognition tasks. Although the discriminant analysis of tensor representation (DATER) algorithm is a suboptimal solution and works iteratively and by unfolding the tensor in each dimension, it is still not directly practical for the data set extracted here. This is made feasible here by introducing a further sub-optimality by partitioning the data into smaller sets and performing localized discriminant analy- sis. Since, the partitioning is not restricted to a given discriminant analysis method, we 106 also apply it to perform FDA. Our experimental results indicate that splitting the data does not deteriorate performance, and as shown for the MFCC case, it improves the classification result. Next the DATER algorithm is discussed, followed by the proposed data partitioning modification. 6.4.1 DATER Algorithm Like FDA, the DATER algorithm seeks to find matrices U ={U 1 ∈R D 1 ×m 1 ,U 2 ∈ R D 2 ×m 2 ,...,U N ∈R D N ×m N } where, m k < D k ∀ k such that, (for data set tensor X∈R D 1 ×D 2 ...×D N ×N S , N S is the total number of sample tensors), (U k | k=N k=1 )=argmax U k | k=N k=1 P c n c k ¯ X c × 1 U 1 ...× N U N − ¯ X× 1 U 1 ...× N U N k P i kX i × 1 U 1 ...× N U N − ¯ X c i × 1 U 1 ...× N U N k Here ¯ X c is the mean of tensors belonging to class c . ¯ X is the overall mean. n c is the number of samples in its respective class. X i is the i th sample tensor of class c i . × k U k represents the unfolding of a tensor along the k th dimension (into a matrix) and multiplication with U k [4]. This is equivalent to maximizing the between class scatter and minimizing the within class scatter in FDA. The algorithm findsU iteratively by re-projecting the tensorX along U k ∀ k at the endofeachiteration. SimilartoFDA,ageneralizedeigenvector problemissolved, tode- termineamappingthatmaximizestheinter-classscatterandminimizesthewithin-class scatter (using the unfolded tensor). When the stopping criterion is met, the algorithm outputs the matrices U={U 1 ∈R D 1 ×m 1 ,U 2 ∈R D 2 ×m 2 ,...,U N ∈R D N ×m N } . After projecting X along the matrices U, we obtain a smaller dimension tensor X 0 ∈R m 1 ×m 2 ... ×m N ×N S . 107 Figure 6.2: Illustration of data partitioning. Data points belong to 2 classes (hollow and solid). The whole data is partitioned into 3 subsets (circles, squares and triangles). The whole data mapped onto each projection (resulting from discriminant analysis of each subset) are augmented column-wise to form the resulting data matrix Y 6.4.2 Partitioning data As it will become clearer later, since the DATER algorithm involves unfolding of the tensors along each dimension, it is impractical to useit for large trainingsets. However, by partitioning the data into smaller sets, it is possible to use the algorithm for each subset. This is the partitioning modification proposed in this work. This method is explained below: 108 1. Randomly partition the whole data X into P sets L j such that X = {L 1 |L 2 |...|L P }. i.e, L j ∈R D 1 ×D 2 ...×D N ×N j and P j=P j=1 N j =N S (total number of sample tensors) 2. FOR j =1,2,...,P • Execute DATER algorithm on L j and obtain U j . • Project tensor samples X along U j and obtain X j 0 ∈R m j 1 ×m j 2 ... ×m j N ×N S . • let Y j = matricize(X j 0 ). i.e, By vectorizing each tensor, obtain Y j ∈ R N S ×(m j 1 ·m j 2 ...m j N ) 3. END 4. Let Y ={Y 1 |Y 2 |···|Y P }∈R N S × P j=P j=1 (m j 1 ·m j 2 ...m j N ) (column-wise augmentation) This partitioning technique, illustrated in 2 dimensions is shown in figure 6.2. As a more concrete example, the data used in the current work can be considered. From the 1.38 hours of audio data, N S ≈ 5× 10 5 ( each R D 1 =64×D 2 =12 ×D 3 =5 ) tensors, belonging to c = 2 classes are extracted. For the DATER algorithm, unfolding along the first dimension would result in a 64×3·10 7 matrix, a size that is prohibitive from a computational standpoint. This is also a problem when the tensor data set is unfolded along the other dimensions. But, if the data is partitioned into P = 50 smaller sets, (N j = 10 4 ∀ j), unfolding would result in ( 50 times) smaller matrices. This would also result in P =50 U j projection matrix sets. Since it is a 2 class problem, m j k =1∀ j,k. Therefore each U j 0 results in a mapping from 64×12×5 space to a 1-dimensional line. By column-wise augmentation, this results in P j=P j=1 (m j 1 ·m j 2 ...m j N ) = 50, and Y ∈ 109 R N S =(5·10 5 )×50 which is the reduced dimension data set available for training (instead of the initial 5·10 5 ×3840 set). Each column of the resulting matrix Y is the projection of all the data points on projections obtained from individual partitions. While figure 6.2 is shown in 2 dimensions, the data points are actually in a very high dimensional space. It however, also shows that this partitioning procedure is not specific to tensor representation. As illustrated, it can also be used for FDA of 2-mode data. This partitioning is also used for FDA to compare CR with MFCCs. Next, the results of classification experiments are presented. 6.5 Results The performance of a 5 nearest neighbour (5NN) classifier and decision stump classifier with AdaBoost, as a function of the number of projections (or columns) used in the matrix Y is shown in figure 6.3. Using only 1 or 2 projections, (Y ∈R N S ×2 ) the aver- age accuracy (and the true positives rate) of the classifiers using the CR is about 95%. This is about 18-25% better than the performance of MFCCs for the same number of projections. For MFCCs, as the number of projections (columns of Y) increases, the classifier accuracy increases. Thistrendcan beobservedinbothclassifiers. However, fortheCR, there is no significant increase with more projections (F-measure=98.4% for 5NN and 94.1% for AdaBoost for 10 projections). Even with 10 projections, the performance of MFCC features (F-measure=93.5% for 5NN and 71.2% for AdaBoost with 10 projec- tions) is less than theperformanceof usingjust1 or 2projections of theCRs. However, for MFCCs, by partitioning the data and using multiple projections, this performance 110 is better than theaverage baseline accuracy that is obtained by FDA on thewhole data set (also shown in figure 6.3). 6.6 Discussion and Conclusion In this work, discrimination of two-types of noise sources: machine generated (such as vehicle noise, engine noise, printers, fax and telex machines etc.), versus other natural noise sources (rainfall, waves on a seashore, blowing wind etc.) was presented. Per- formance of the pattern classifiers was better using the cortical representation (CR) as opposed to using MFCCs. To reduce the dimension of the extracted data, a gener- alized version of the discriminant analysis for multi-dimensional tensor representation was used. Discriminant analysis of this large data set was made tractable by a new data partitioning technique. Intuitively, discriminant analysis on a subset of the whole data gives rise to a projection that is optimal for each smaller subset. This gives rise to many projections for the whole data set. When the information obtained from multiple projections is used for classification, it results in higher classification accuracy (as op- posed to single projection obtained by discriminant analysis on the whole data). This partitioning technique makes training classifiers of large dimensional data sets feasible. It can also be extended to real-time processing problems. From the high correct classification results, it can be concluded that the two noise types in question are effectively discernible using the cortical representation. The per- formance differential with respect to MFCC features is significant. This better per- formance can be attributed to the high resolution multi-scale spectral analysis of the cortical processing, which in effect, also naturally captures temporal properties of audio 111 (due to time-frequency duality). Whereas, with MFCCs this had to be approximated with delta (Δ) and delta-delta (Δ−Δ) features. As suggested in [61], as a part of future work, we would like to use this type of bio- inspired feature based discrimination ability to robustly represent various audio scenes inanattributemid-level representation (speech-like, harmonic, machine-noise like, etc.) and use higher-level decision to classify a given audio scene. 112 1 2 3 4 5 6 7 8 9 10 40 50 60 70 80 90 100 No. of projections → Percentage of correctly classified instances → Performance using k=5 Nearest Neighbour Classifier no significant change ⇒ MFCC Baseline accuracy. FDA without partitioning F−meas=93.5\% F−meas=98.4% cortical, class −1 cortical, class 1 MFCC, class −1 MFCC, class 1 Cortical avg. accuracy MFCC avg. accuracy 1 2 3 4 5 6 7 8 9 10 40 50 60 70 80 90 100 No. of projections → Percentage of correctly classified instances → Performance using AdaBoost with decision stump Classifier no significant change ⇒ MFCC Baseline accuracy. FDA without partitioning F−meas=71.2% F−meas=94.1% cortical, class −1 cortical, class 1 MFCC, class −1 MFCC, class 1 Cortical avg. accuracy MFCC avg. accuracy Figure 6.3: 5 nearest neighbour (left), AdaBoost (right) classifier results (average ac- curacy and true positives rate) of the cortical representation (CR) versus MFCCs as a function of number of projections included. (90/10% train/test split,1 -machine gener- ated, -1 -other noise) 113 Chapter 7: Latent Perceptual Indexing 7.1 Chapter Summary We present a framework for description-based audio retrieval using unit-document co- occurrence measure. Audiodescriptions can be text-based such as a caption (for query- by-text) or example-based such as an audio clip sample (for query-by-example). The query-by-text retrieval is performed by latent semantic indexing where words are units and the captions of the audio clips are the documents. Example based retrieval is performed by explicitly discovering discrete units in audio clips and then formulating the unit-document co-occurrence measure. For this, feature-vectors extracted from the clipsinthedatabasearegroupedintoreference clusters usinganunsupervisedclustering technique. Usingtheseasunits,anaudioclip-to-clustermatrixisconstructedbykeeping countofthenumberoffeaturesthatarequantizedintoeachofthereferenceclusters. By singular-valuedecompositionofthismatrix,eachaudioclipofthedatabaseismappedto apointinalatentperceptualspace. Thisisusedforindexingtheaudioretrievalsystem. 114 Since each of the initial reference clusters represents a specific perceptual quality in a perceptual space defined by the signal features (similar to words that represent specific concepts in the semantic space), querying-by-example results in clips that have similar perceptual qualities. The performance of the indexing methods are evaluated based on two methods of categorization: using semantic labels and perceptually motivated onomatopoeic labels. Our results indicate that both forms of indexing perform well for the two categorization methods. It supports the fact that text-based descriptions, the semantic and perceptual categories of acoustic sources are inter-related. Using the flexible framework presented here, it is possible to retrieve both semantically and perceptually similar audio clips to queries that can be high-level text descriptions or a clip of an audio signal itself. The method does not make domain specific assumptions for indexing, therefore it can be extended to different audio classification or retrieval problems with little or no modification. 7.2 Introduction 7.2.1 Problem definition and motivation The Web 2.0 platform and the proliferation of consumer devices has lead to online availability oflargeamountofmultimediacontent ontheWorldWideWeb. Multimedia content can include text (such as blogs, captions, personal notes etc.), images (personal or publicly available), audio (podcasts, lectures, music, audio books, news clips etc) and video (also personal or publicly available archives, news items, sporting events, movies, documentaries etc.) or any combination of these media. In these platforms, content can be updated, edited and created easily and frequently. To provide efficient 115 access to both the user and the back-end retrieval system the content needs to be organized and indexed. Typical examples of commercial Web 2.0 retrieval applications forpictures/images andvideoareFlickr TM andYouTube TM . Inthesesystems, userscan upload an image or a video clip and include a short description of the content. It is also possible to tag or categorize the content and the application also has the provision for other users to provide comments, tag or rate the content. In this, a user can also search for content using keywords or text descriptions. However, as far as we know, it does not allow for searching by example. Therefore the indexing and retrieval method is primarily based on keywords or text descriptions. Theworkpresentedinthispaperisrelatedtotheproblemoforganizingandindexing audiofor retrieval. A fully duplexretrieval system framework is developed wherea user presentsaquery (adescriptionofwhattheuserneeds)tothesystemwithapre-indexed library of audio clips. The system retrieves a list of similar audio clips for the user. The full duplex property stems from three aspects: • First, the query can be one of or a combination of the following forms of de- scription: (1) a high-level semantic description such as “airplane take-off”, (2) A description of the acoustical properties such as “buzzing sounds”, or (3) an ex- ample of the audio clip itself. Corresponding to the query forms, the system can retrieve a list of meaningful high-level text descriptions, or a semantic label cate- gorizing the query or a perceptually meaningful label that describes the acoustic properties or a list of similar sounding audio clips. 116 • Second, these varying levels of descriptions are inter-related because semantic similarities implicitly results in perceptual similarities and vice versa. This is em- piricallytrueforbothhigh-level textdescriptionsandexample-baseddescriptions. Forinstance,“airplane take-off”and“carenginestart”aretwodescriptionofclips that belong to the semantic category of “transportation sounds”. Acoustical simi- larities between airplane engine and car engine sound can be easily understood as they both have a buzzing or humming quality. Conversely, “airplane engine” and “car engine” have perceptual similarities, they also belong to the same semantic category of “transportation sounds”. • Finally, in contrast to the previous point, audio clips belonging to the same se- mantic category can belong to different perceptual categories and vice versa. For instance, “women walk on tarmac” and “small cocktail party” belong to the same semantic category of “human sounds” but it can be seen that they perceptually sound different. Also, “women walk on tarmac” and “horse walks on tarmac” sound similar, but they belong to different semantic category. To handle retrieval at these different levels, and include semantically and percep- tually relevant similarities, the system needs to be able to appropriately compute both in terms of text descriptions and signal-level features. In this work, retrieval for text- based query is implemented using latent semantic indexing (LSI) [7, 19], where using unit-document or word-document co-occurrence text documents are mapped to a high dimensional semantic space. Retrieval for example-based query is through signal-level features whose response to audio is similar to the front-end of the human auditory system. Indexing for example-based retrieval is also implemented using unit-document 117 co-occurrencemeasure. Inthiscasehowever, theunits areexplicitly discoveredinsignal features using unsupervised clustering methods and the document is the whole set of featuresextractedfromagivenaudioclip[60]. Here,thefullduplexnatureoftheframe- work is brought out by using onomatopoeia labels for acoustic properties and semantic labels for high-level description of audio and performing retrieval using text captions and example audio clips. The main motivation for incorporating text descriptions of acoustic properties in retrieval is the fact that humans express a wide variety of acoustic properties using language in addition to the semantic description of the acoustic event. Many acoustic qualifications of events are intuitively expressed using onomatopoeia words [8]. As an example, to describe the event “knocking on the door”, the words “tap-tap-tap” de- scribe the acoustic properties well. Communicating acoustic events in such a manner is possible because of a two-way mapping between the acoustic space and language or semantic space. Existence of such a mapping is language dependent and a result of common understanding of familiar acoustic properties [8, 59]. Here, it is important to point out the following: there is a difference in the language descriptions “knocking on the door” and “tap-tap”. The former is a high-level description of a specific semantic event and the later is closer to the description of the acoustic properties of it. This form of onomatopoeic description of acoustical properties is particularly useful in the context of audio retrieval because the human auditory system predominantly relies on perception [34]. It allows for development and evaluation of a retrieval system that is indexed in a perceptual and semantically meaningful way. Therefore, in this work, perceptually meaningful onomatopoeia words as category labels (such as buzz, dong, 118 Figure 7.1: Overview of a description-based audio retrieval system. whoosh, bang) are used in addition to semantic category labels (such as human sounds, transportation, animals etc.). Potential applications of the approach presented here include spoken document re- trieval, query-basedaudioretrieval, audiobrowsing, audiofingerprinting,context recog- nition,auditoryscenerecognition, audiobasedaudio-visualsegmentation, andproblems inmusicinformationretrieval. Next,abriefoverviewoftheretrievalsystemispresented followed by a summary of the main contributions of this work in the context of other contemporary systems. 7.2.2 Overview A typical description-based audioinformation retrieval system isshowninfigure7.1. In this the back-end database or library consists of audio clips annotated with high-level text captions/descriptions, their semantic category label and their onomatopoeic cate- gory label. This library is assumed to be indexed offline using the framework presented in this paper. The indexing step involves mapping the audio clip and the high-level text description to a vector in a high dimensional latent perceptual and semantic space, 119 respectively. The corresponding vectors in these spaces are denoted as ˜ f i and ˜ b i in the figure. A client or an user presents a query to the system, the system retrieves a ranked list of items from the library based on a similarity measure. The equivalent representa- tion of the presented query (either in perceptual or semantic space) is first derived and the ranked list of clips is retrieved based on the degree of similarity between the query clip representation and the indexed library clips. The similarity metric used differs based on the type of query presented to the system. It can either be signal-level simi- larity between vectors in the latent perceptual space for the case of query-by-example or semantic similarity, for text-based queries, or both. Similarly, the retrieved results can be either a set of descriptive tags: such as high-level category labels, onomatopoeic labels or short phrases describing the clip, or a set of ranked audio clips itself. As men- tioned earlier, this is the full duplex nature of the retrieval system implemented here. Next, the main contribution of this work is highlighted followed by a discussion of other work related to the problem of audio information retrieval. 7.2.3 Contribution In this paper, an audio retrieval system that addresses both perceptual and semantic descriptionsis presented. Theframeworkprovidesamethodtohandlebothretrieval by audio example and retrieval by text description. The main contributions of this work areasfollows. First,incontrasttomethodsdiscussedinthenextsection,[23,26,37,69], the work presented here does not deal with training models specifically for explicit class definitions (such as music, stationary noise, animal, laughter, speech etc.). Instead, the perceptual structure of an audio clip as a single vector representation is obtained 120 from its signal-level measures usingmethods that extract latent information usingunit- document co-occurrence measure. Additionally, for higher-level text-based indexing, a text vector is also sought from its text caption by LSI.Thesetwo vector representations for each audio clip allows for retrieval by audio example and by text description. The framework is implemented and evaluated using a generic sound effects audio database [1] that covers a wide variety of semantic and perceptual characteristics. Since an en- tire audio clip is represented as a single vector, it makes the computationally intensive signal-based similarity measure manageable. This feature is ideal for very large audio database that is typical in Web 2.0 applications. This representation method brings out an underlying latent perceptual structure of audio clips. The performance of the acoustic representation and text representation is analyzed by two categorization meth- ods: usinghigh-level semanticcategories andalsothemid-level perceptuallymeaningful onomatopoeia labels. For semantic evaluation the available audio database is split into twenty one mutually exclusive high-level categories (such as airplane, crowd, construc- tion, industry etc.). The category labels are obtained directly from a commercially available sound effects database [1]. The onomatopoeia based categorization was per- formedbylabelingtheclipsinthedatabasewithperceptuallydescriptiveonomatopoeia words. In section (7.4.1) a semi-automatic method is developed to label the clips in the database so that both subjective interpretation and word-to-word relations are incor- porated. The performance using single vector representation of its text caption (for retrieval by text description) obtained in latent semantic space is also analyzed and presented. 121 7.2.4 Related Work Contemporary work in audio information retrieval mainly involves semantic categoriza- tion of audio. Typical examples of such content-based methods through direct labeling for a generic audio database is [23, 69]. Here the system is evaluated on a database of animals, bells, crowds, female, laughter, machines, male voices, percussion instruments, telephone, water sounds etc. Their work is important in this sense, that it establishes thediscriminativepowerofthesignallevel featuresandthemachinelearningalgorithm. The underlying assumption in these methods is that clips belonging to the same cat- egory are perceptually similar, and their database is accordingly designed. Albeit its correctness, this assumption has its limitations as it does not allow for the possibility that in general, acoustic sources belonging to the same category can be perceptually dissimilar, and vice versa. This approach is limited to specific predefined classes and it becomes harder to scale it to other applications. Analternativescalableschemethatavoidsdirectlabelingofaudioclipsistoorganize clips in a hierarchy. It is a more flexible extension to the speech/non-speech approach to audio processing. For example in [73] the authors present a system of hierarchical categories suchassilence,withmusiccomponentsandwithoutmusiccomponents. They also present results on grouping as pure speech, pure music, speech plus music back- ground,soundeffectsplusmusicbackground,harmonicandnon-harmonicsoundeffects. Work in [37] also uses a similar hierarchical scheme by grouping sources into speech, music or environmental sounds. They first determined if a given segment is speech or non-speech, followed by further speaker change point detection was performed in case of speech segments and for the case of non-speech segments, classification of data into 122 music, environment and silence. While these generalized categories are applicable to various domains, they lack the specificdescriptiveinformationofthedirectlabelingschemeofagivenaudioclip. There- fore, it is desirable to move away from methods that implement limited signal-based la- beling/modeling and similarity measure (implemented in classical content-based audio analysis methods [23, 26, 37]) to more descriptive methods. Examples of methods that deal with semantic descriptions of audio retrieval are [5, 12, 55]. In [55] the author improves on the direct labeling scheme by creating a mappingfrom each nodeof a hier- archical model in the abstract semantic space to the acoustic feature space. The nodes inthehierarchical model (represented probabilistically as words) aremappedonto their corresponding acoustic models. In [5], the authors have a similar approach of modeling features with text labels in the captions. Other techniques for retrieval using semantic relations in language include [12]. Here WordNet [43] is used to generate words for a given audio clip using acoustic feature similarities, and then retrieve clips that are similar to the tags. The work discussed so far is based on high-level labels and the associated text de- scriptions. An example of systems that deal with onomatopoeic descriptions is in music information retrieval. In [24] the query to the retrieval system is spoken form of ono- matopoeia for percussive sounds. Theauthors focus on eight basic drumcategories and their corresponding onomatopoeia labels (such as bom, ta, ti, do). Spoken queries were in the form of short sequence of onomatopoeia words. In [56] a form of unit-document co-occurrence for semantic information extraction 123 in audio is used implicitly. While it was used for deriving semantic concepts for audio- visual segmentation, their information extraction procedure for audio is relevant here. In this, the audio features are directly used at different scale levels to represent the overall timbral qualities of the audio track. This is combined with the video and text features, and the time series of these features after dimensionality reduction is used to segment media clips. Most of the work discussed here addresses measurement of acoustic sources accord- ing to their semantic content. They primarily deal with forming semantic categories of acoustic sources, or extracting information that is semantically meaningful. In con- trast, the work presented in this paper addresses the perceptual aspects of audio and its description in addition to its semantic description and categorization. As mentioned earlier, it also explores the implicit relationship between semantic categories of audio sources and their perceptual similarities. The paper is organized as follows. In section (7.4), the database, the labelling scheme, and the signal features extracted for the im- plementation are discussed. In section (7.5) the experiments performed and the results obtained is presented. Finally the conclusion and discussion of the results is presented. First, text-based and example based indexing are presented next. 7.3 Proposed Method In latent information analysis of documents, reduced rank approximation of the unit- documentco-occurrencemeasureis usedtoobtainrepresentations intheanalysis space. Since the co-occurrence measure is generally a rectangular matrix, the rank reduction is achieved by singular value decomposition (SVD). In the following section, a brief 124 discussion of SVD is presented, followed by a description of the representation of a document and the similarity measure in the high-dimensional space to which the doc- uments are mapped. Based on unit-document co-occurrence ideas discussed for LSI, latent perceptual indexing (LPI) of an audio clip example is also presented. 7.3.1 Singular Value Decomposition (SVD) and Representation Singular value decomposition [58] is a generic approach to factoring a M×N matrix in terms of its eigenvalues or singular values and eigenvectors. It states that a matrix G can be decomposed as: G M×N =U M×M ·S M×N ·(V N×N ) T (7.1) whereU isamatrixofcolumnorthonormalvectors, thelefteigenvectors, S isadiagonal matrix of singular values and V is a matrix of row orthonormal vectors. In general, in manyapplications,Gisasparsesamplematrix,whereeachrowcorrespondstoasample. Two important aspects of the sample matrix and its factorization are discussed next. Then, similarity between two vectors in the representation space is presented. 7.3.1.1 Rank R approximation One of the important properties of this factorization is that we can obtain the rank R approximation of matrix G. This is obtained by retaining only the significant singular values of G and its corresponding columns of U and rows of V. In this work, the 125 significant singular values are chosen according to the percentage of total power in the top R singular values. For example, to retain 90% of the total power, min R P k=R k=0 (s k ) 2 P k=M k=0 (s k ) 2 ≥ 0.9 (7.2) where, s 1 < s 2 < s 3 < ... < s M are the diagonal elements of S. The main reason for obtaining the rank R approximation is to reduce the noisiness in data while still maintainingthecorrelationbetweenthesamples[22]. Theresultingapproximatesample matrix is given by: ˜ G M×N = ˜ U M×R · ˜ S R×R · ˜ V N×R T (7.3) Here, ˜ U and ˜ V are matrices obtained from retaining the columns corresponding to the R significant eigenvalues and ˜ S is a diagonal matrix formed by the R significant eigenvalues. 7.3.1.2 Representation For the sample matrix G, the row vectors are the individual samples. In the present case,theyaretheunit-documentco-occurrencemeasure,i.e,thefrequencyofoccurrence of each unit (column) in a given document (row). In the rank R approximation, each row is projected on to the basis formed by the column matrix of V [7]. Thus the vector characterizing the i th sample (or row of G) is represented by the i th row of ˜ U· ˜ S. 126 7.3.2 Similarity Measure Since the rows of ˜ U· ˜ S correspond to the samples or documents, using a cosine metric, the similarity between k and i sample in the representation space can be expressed as the angle between the vectors: Similarity(k,i) =cos −1 (˜ u k × ˜ S)·(˜ u i × ˜ S) k ˜ u k × ˜ Sk·k ˜ u i × ˜ Sk ! (7.4) Here, × is the vector-matrix product, (·) is the dot product between two vectors and k k is the vector length. In the next section, the data representation and mapping method based on SVD is discussed for representing both text-based queries and audio example-based queries. 7.3.3 Text-based indexing using Latent Semantic Indexing Latent semanticindexingisamethodtorepresentandindextext documentsintermsof the words present in them. It is based on the assumption that the underlying semantic structure in a document can be derived by analyzing individual word occurrences in it. Itwasoriginallyproposedin[19]andlaterextensionsofitsapplicationstoproblemssuch as email filtering, document classification, unit selection for text-to-speech synthesis is presented in [7]. This method to map documents to high-dimensional space is briefly presented here again as it sets the stage for the example-based retrieval using latent perceptual indexing. In this method, indexing a set of text documents begins by viewing each document 127 No. Filename : Descriptions 1 11 MO OLD BOY BATHING : 11 month old baby bath with water and bob toy 2 CAR ALARM : Car alarm 3 DESCENDING SIREN 01 : Alarm and siren descend 4 POLICE CAR LEAVES SIREN : Emergency police car depart with wail siren 5 BEAGLE DOG DRINK WATER: Animal beagle dog drink water interior 6 CAR WASH FROM INSIDE : Water car wash from inside car Berkshires 7 LARGE WATER SPLASH : Water splash large Table 7.1: A list of nine audio clips and its high-level description from the database. asabagofwords/terms. Thedocumentsandwordsarediscreteentitiesthatcombineto form a concept. They are discrete because they represent specific, well-defined volumes in a semantic space. For example words such as “door” or “road” represent specific semantic entities and associated properties. LSI maps discrete entities to a continuous space that spans the semantic information present in them. To create a representation of a document in the semantic space, a term-document frequency count is calculated for each term in the bag of words with each document. Adopting the notation from [7], for a set of M documents and a list of N words or units in a vocabulary W, it results in a M×N matrix. For a simplified illustration, consider a list of M =7 audio clip descriptions shown in table 7.1. For LSI, the description or caption of each clip is consideredtobeadocument. Toproceedalist ofwordsisselected fromthedocuments. For this illustration N =8 words: baby, alarm, animal, siren, water, car, dog, large are chosen. The term document frequency is obtained by: B N×M = c ij T i (7.5) 128 Here c ij is the number of times the j th word from the vocabulary W occurs in the i th document, and T i is the total number of words in the i th caption or document. In general, M and N are very large ∼ 10 3 to 10 4 . This results in a large, sparse matrix B. This matrix is also intrinsically noisy because the words used in the documents can map to a variety of meanings or many meanings can be represented by the same worddependingoncontext. Toreducethisnoisinessandderivetheunderlyingsemantic structure, a reduced rank approximation of B is obtained by SVD as explained earlier in the section 7.3.1. 7.3.4 Text Query Representation To represent a query in semantic space (not part of the initial collection), the co- occurrence of the vocabulary W in a query text is first estimated. This results in a N dimensional vector b similar to a row of B. This can also be taken as an additional row of B, and assuming that S and V remain the same, we can express: b=u b ×S·V T Here u b is the additional row in U corresponding to b. For the similarity measure, we need to estimate u b ·S. From the above equation we get the representation of the query audio clip as: ˜ b = ˜ u b × ˜ S =b× ˜ V 129 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 −0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 Eleven month old baby boy bath time with water and bob toy Car alarm Alarm and siren descend Emergency police car depart with wail siren Domestic animal beagle dog drinking water interior Water car wash from inside berkshire Water splash large 1 st SVD Dimension 2 nd SVD Dimension Figure 7.2: An example representations of seven clips in the Latent Semantic Space. By using the similarity measure in section (7.3.2), it is possible to retrieve the set of {R b } that is close to the query b. For illustration, representation of the audio descriptions in table 7.1 is shown in figure 7.2. From the figure it can be seen that the resulting representation is semanti- cally meaningful. For example, in the two dimensional space shown, the captions with the words alarm or siren are close to each other, and also the captions with the word water. Thiscloseness canalsobeobservedincaptionswiththewordcar. Aninteresting aspect is the caption “Water car wash from inside car Berkshires”. It contains both the words car and water and it is approximately equidistant from the other groups. The remainder of the words such as boy, dog, bath, large are also significant in this 130 illustration becausealthough theyoccur onlyonceintheir correspondingcaptions, they affect the placement of the vectors that consists the other words such as car, water or alarm. Such properties that also include sophisticated applications can be found in [7]. Thismethodof latent analysis andindexingbymeasuringtheunit-document occur- rence frequency is also adopted for indexing the audio clips using clusters of its feature vectors in feature space as “units” and the audio clip as the document. Additionally, the similarity measure between documents (or the clips) is also presented. Details of this and its relationship with LSI are presented next. 7.3.5 Audio representation in Latent Perceptual Space Audio clips are represented as vectors in the latent perceptual space similar to the way text documentscanbeindexedinthelatent semanticspace. Here, anaudioclipismod- eled as a sequence of non-overlapping time-ordered acoustic units. Each of these units belong to a unique region in the feature space, the same way words represent distinct meaning in the semantic space. Since the feature space is derived using perceptually meaningful feature measures of audio signal, the resulting indexing is according to the perceptual content. This is analogous to sequence of words that make up sentences in a textdocumentandtherepresentationofdocumentsusingword-documentco-occurrence measure. Formally, an audio clip A i , in a dataset of M clips i ∈ {1,2,...,M} can be expressed as, A i = u i t 1 ,u i t 2 , u i t 3 ,...,u i t K i (7.6) 131 Figure 7.3: Representation in Latent Perceptual Space. here t 1 < t 2 < t 3 < ... < t K i are time indices, and u i t k is the t th k acoustic unit of the i th clip. In frame-based analysis, each of these units may be made up of any number of frames of audio. The next step is to identify each of these individual units in terms of a vocabulary, and subsequently derive unit-document measure. Let V be a vocabulary of N identifiable units and for a given dataset, V is obtained by clustering the dataset (in an unsupervised manner) into N distinct clusters in the feature space. Since the clusters are identified by their centers, the resultingN centers{C 1 ,C 2 ,C 3 ,...,C N } are denoted as the known units of the vocabulary, i.e, V ={C 1 ,C 2 ,C 3 ,...,C N }. Next, by a Euclidean distance measure, each feature vector extracted from the clip is quantized into one of N vocabulary units. After identifying all the frames extracted in the audio, this step results in a time-ordered sequence of units as described in equation (7.6). Here the sequence u t 1 ,u t 2 , u t 3 ,...,u t K becomes the sequence v t 1 ,v t 2 , v t 3 ,...,v t K where V t k ∈ {V} ∀ t k . By this an audio clip can be represented as a sequence of units. The method employed here is highly data dependent, since V is obtained from a fixed dataset, and the representation of an audio clip as units is also dependent on the choice of N. Note that since A i is now represented as a sequence of units inV, by measuring 132 the number of times each of {C 1 ,C 2 ,C 3 ,...,C N } occur in A i ∀ i, a unit-document co-occurrence matrix can be created for the whole collection of M clips. Next the implementation of this scheme is presented. 7.3.5.1 Algorithm Let’s assume that a collection of M audio clips is available in a database as the training data. Let the i th clip have T i feature-vectors. Then, the procedure involved in obtaining a representation in the latent perceptual space is given below: STEP 1. The collection of all the feature-vectors obtained from all the clips in the trainingdatabaseisclusteredusingthek-means clusteringalgorithm. Thisresults inN reference clusters. Thecenters of thereferenceclusters aretheN vocabulary units ofV. STEP 2. Let the i th audio clip have a total of T i frames. FOR audio clip A i where, i∈{1,...,M}, DO: i. Calculate : f i,j = P t=T i t=1 I(lab(t)=j) T i .∀j ∈ 1,...,N. Here I(·) ∈ {0,1} is an indicator function. I(lab(t) =j) =1if thet th frameis labeled to bein thej th cluster, otherwise I(·) =0. ii. Assign F(i,j) =f i,j the (i,j) th element of the sparse matrix F M×N . STEP 2 derives the unit-document frequency. First, feature-vectors of each frame ofagivenclipisquantizedintooneofN referenceclusters. Thisisanidentification 133 stepthatdetermineswhichoftheN unitsispresentinagivensegment oftheclip. Then the number of feature-vectors quantized into each of the N is determined. This gives the unit-document frequency. STEP 3. END FOR loop; STEP 4. Obtain F M×N =U M×M ·S M×N ·(V N×N ) T by SVD. STEP 5. Obtain the approximation of F as ˜ F M×N = ˜ U M×R · ˜ S R×R · ( ˜ V N×R ) T by retaining the R largest singular values. The steps involved in example-based indexing and retrieval are illustrated in figure 7.3. The approximation ˜ F is obtained by the span of basis vectors that have significant singular values. By retaining only the significant singular values, the randomness in quantization is eliminated. Since the initial matrix representation F was obtained from clusters of signal feature-vectors, the columns of ˜ U and ˜ V essentially span the latent perceptual space. Therefore, the given set of audio clips are indexed in the latent perceptual space. This is analogous to text document representation by LSI with term- document frequency. Therefore, ideas of similarity measurement and representation of a query can be re-applied here. Since similarity measure has already been discussed in section (7.3.2), only query representation will be discussed next. 7.3.6 Query Representation To represent a query audio clip in latent perceptual space (not part of the initial col- lection), the number of feature-vectors of the query in each of the N reference clusters 134 is first estimated. This results in a N dimensional vector x similar to a row of F. This can be seen as an additional row of F, and assuming S and V remain the same, we can express: x =u x ×S·V T Here u x is the additional row in U corresponding to x. For similarity measurement we need to estimate u x ·S. From the above equation we get the representation of the query audio clip as: ˜ x= ˜ u x × ˜ S =x× ˜ V By using the similarity measure in section (7.3.2), it is possible to retrieve the set of {R x } that are close to the query x. Since the similarity is not calculated directly in the feature space, it makes compar- ison of two audio clips significantly more manageable. After the initial clustering and SVD (can be performed offline), since N <T i ∀i∈{1,...,M}, retrieving clips based on this similarity measure is tractable. 7.3.7 Relationship with the LSI framework While the method presented here is similar to the LSI framework, there are some dif- ferences which are discussed here for clarity. As stated in [7], LSI tries to uncover the underlying semantic structure in data by eliminating the randomness that arises due to variations in expressing the same concept with different choice of words. It maps discrete objects such as words and documents onto a continuous space. The words and documents occupy specific volumes in the semantic space as concepts, which is used in 135 measuring “closeness” between documents. The present work, attempts to derive the underlying perceptual structure notwithstanding the randomness caused by temporal variations. These variations arise due to differences in the environment in which the acoustic source is present. Based on the features extracted from a given database, the method presented here seeks distinct acoustic clusters in the perceptual space. The significance of these acoustic clusters in the perceptual space are analogous to concepts in the semantic space. Therefore, the resulting similarity is a measure of closeness in the perceptual structure between two clips. One fundamental difference between text andaudiois that discrete unitsanddocuments comprised of discreteunits already exist in text. These discrete units combine to express a concept. However, in audio, discrete unitsandthedocumentsare discovered usingunsupervisedclusteringandquantization. The results obtained in our experiments are sensitive to the choice of number of clusters and the span of the domain of the dataset. In the next section, the details of the dataset used in this work is presented. It also includes the feature set used for deriving the vector representation for example-based retrieval. 7.4 Database For the presented application, 2,491 whole audio clips from the BBC Sound Effects Library [1] were used. Each clip in the library is labeled with a semantically high- level category and a perceptually descriptive onomatopoeia tag that best describes the acoustic properties of the source. The database is available pre-organized according to high level-categories and their corresponding subcategories. A short caption that appropriately describes the event in the audio clip is also available. This organization 136 Category No. of files Category No. of files IMPACT 16 NATURE 85 OPEN 8 SPORTS 151 TRANSPORTATION 295 HUMAN 357 AMBIENCES 311 EXPLOSIONS 18 MILITARY 102 MACHINERY 117 ANIMALS 359 SCI-FI 121 OFFICE 144 POLICE 96 HORROR 98 PUBLIC 44 AUTOMOBILES 53 DOORS 4 MUSIC 25 HOUSEHOLD 38 ELECTRONICS 49 Table 7.2: Distribution of clips under each semantic category. is primarily tailored for sound engineers to mix actual acoustic events during sound production. The database is originally designed to be accessible through text-based searches or by manually browsing the catalog. For the experiments presented in the next section, the semantic category labels providedinthelibraryweredirectlyused. Theselectionof2,491clipsusedherebelonged to twenty one high-level semantic categories. The distribution of number of audio clips under each category is shown in table 7.2. Since the mid-level onomatopoeic tag was not available in the database, the clips were manually tagged with onomatopoeic labels by manually listening to the clips. This is described next. 7.4.1 Assigning onomatopoeia labels to audio clips Choosinganonomatopoeiclabelthatbestdescribestheacousticpropertiesofthesource in an audio clip is completely based on subjective perception, and this can only be achieved by manually listening to each clip. Such subjective categorization of clips for a reasonably large database is tedious and prone to inconsistencies. In this work, to 137 Bang Bark Bash Beep Biff Blah Blare Blat Bleep Blip Boo Boom Bump Burr Buzz Caw Chink Chuck Clang Clank Clap Clatter Click Cluck Coo Crackle Crash Creak Cuckoo Ding Dong Fizz Flump Gabble Gurgle Hiss Honk Hoot Huff Hum Hush Meow Moo Murmur Pit-pat Plunk Pluck Pop Purr Ring Rip Roar Rustle Screech Scrunch Sizzle Splash Splat Squeak Tap-tap Thud Thump Thwack Tick Ting Toot Twang Tweet Whack Wham Wheeze Whiff Whip Whir Whiz Whomp Whoop Whoosh Wow Yak Yawp Yip Yowl Zap Zing Zip Zoom Table 7.3: The starting list of Onomatopoeia labels used in this work. keep errors to a minimum, the assignment of onomatopoeic labels was done in three passes and a final consolidation step. First a small set of clips were manually tagged by subjects. Then,other clips havingthesamefilenamewereassignedthesametags as the corresponding ones in the initial set. Finally by manual comparison by the author with the sets obtained from the first two passes, the remaining files in thedatabase were also appropriately labeled. As a final post labeling step using automatic clustering methods andword-based similarity measure, files withsimilar onomatopoeic labels weregrouped together. This resulted in forming twenty two onomatopoeic categories. At the end of this process appropriate onomatopoeia label still could not be found for 351 out of the 2,491 clips 1 Therefore for the experiments with onomatopoeia based categorization presented later, only 2,140 clips were used. The distribution of the clips under each onomatopoeia label is shown in table 7.4. 1 The main reason for this is that the initial list of onomatopoeia words does not appropriately cover theinnumerableacoustic propertiesoftheclips. Also, manyaudioclipsareverycomplicated tocorrectly apply an onomatopoeic description to them. This aspect leads to unlabeled files in the final labelling step. 138 7.4.1.1 First Pass Asetof236audioclips(belongingtocategories suchas: animals, birds, footsteps, trans- portation, construction work, fireworks etc.) were selected from the sound library. Four subjects, with English as their first language, tagged this initial set of clips with ono- matopoeia words. A Graphical User Interface (GUI) based software tool was designed to play each clip over a pair of headphones and have the subject click on the relevant onomatopoeia words that best described the clip. The starting list of onomatopoeia labels used in the GUI is shown in table 7.3. All the clips were edited to beabout 10-14 seconds in duration. Theclips were randomly dividedinto 4 sets, so that the volunteers spent only 20-25 minutes at a time, per set. Only the tags that were common to two or more volunteers were retained. In general, at least one or two tags assigned by volun- teers were in agreement for a clip. Note that the resulting labels are assumed to be the onomatopoeic labels that best represent the perceived acoustic properties. 7.4.1.2 Second Pass The tags from this set were then copied to other clips with similar names. For ex- ample, the clip with the lexical name BRITISH SAANEN GOAT 1 BB received the tags {blah, blat, boo, yip, yowl} and this same set of words were used to tag the file BRITISH SAANEN GOAT 2 BB. Similarly, the audio clip BIG BEN 10TH STRIKE 12 BB re- ceived the tags {clang, ding, dong}. These tags were also used for the file BIG BEN 2ND STRIKE 12 BB After transposing the tags, a total of 1,014 clips was avail- able. 139 Category No. of files Category No. of files GROWL 73 MEOW 67 CRACKLE 34 BLEAT 17 CLATTER 157 GABBLE 128 DONG 108 BANG 69 HONK 64 BEEP 111 WHOOSH 86 BURR 143 HUM 240 BUZZ 151 THUD 77 SPLASH 66 TICK 17 CRUNCH 48 TAP 240 SQUEAK 124 TWEET 108 CROW 12 Table 7.4: Distribution of clips in each onomatopoeia category. 7.4.1.3 Third Pass Usingthelabeled1014clipsasareference, theremainingclipswerelabeledbymanually listening to each file. To keep the labels consistent this was done only by the author using the GUI interface. 7.4.2 Consolidation As mentioned in [59] in the first pass eighty seven onomatopoeic labels were available. However, only forty two labels were actually used in all. This is due to the fact that a given set of onomatopoeia does not necessarily span all the acoustic properties in a given library and vice versa. Since many words have overlapping meaning (such as hiss and buzz) the set of forty two words was again reduced to the twenty two based on word-word similarity and unsupervised clustering technique explained next. Unsupervised clustering of similar onomatopoeia words: Aset{L i }consistingofl i words 140 is generated by a thesaurus [67] for each word O i in the list of onomatopoeia words. Then the similarity between the j th and k th word can be defined to be: s(j,k) = c j,k l d j,k , (7.7) resulting in a distance measure : d(j,k) = 1−s(j,k) (7.8) Here c j,k is the number of common words in the set{L j } and{L k } and l d j,k is the total number of words in the union of{L j } and{L k }. By this definition it can be seen that 0≤d(j,k)≤1 (7.9) d(j,k) =d(k,j) (7.10) d(k,k) =0 (7.11) Except for the triangular inequality, it is a valid distance metric. It is also based on word-word relations because the words in the set {L j } and {L k } generated by the thesaurus have some meaning associated with the words O j and O k in the language. Thesimilarity between two words dependson thenumberof common words (a measure of sameness or overlap in meaning). Therefore for a set ofW words, usingthis distance metric, we get a symmetric W ×W distance matrix where the (j,k) th element is the distance between the j th and k th word. Note that the j th row of the matrix is a vector representation of the j th word in terms of other words present in the set. We perform 141 principal component analysis (PCA) [20] on this set of feature vectors, and represent each word as a point in a smaller dimensional space O d with d < W. In our imple- mentation the squared sum of the first eight ordered eigenvalues covered more than 95% of the total squared sum of all the eigenvalues. Therefore d = 8 was selected for reduced dimension representation. From table 7.3 it can be seen that many have overlapping meanings (e.g. clang and clank), some words are closer in meaning to each other with respect to other words (e.g. fizz is close to sizzle, bark is close to roar, but (fizz/sizzle) and (bark/roar) are far from each other). By modeling the onomatopoeia word representation in this space as observations of a multivariate Gaussian process, the words can be clustered. This results in the consolidation of onomatopoeia words that have overlapping sense under on category. Here, the number of clusters for un- supervised clustering was decided using the Bayesian information criterion (BIC). The BIC [54] is widely used for choosing the appropriate number of clusters in unsupervised clustering [14, 74]. If each cluster in a model M k (with k clusters) is assumed to follow a multivariate Gaussian distribution we get a closed-form expression for the BIC as given in [14]. For a set of competing models {M 1 ,M 2 ,...,M i } we choose the model that maximizes theBIC.Weusethiscriterion tochoosek forthek-means algorithm for clustering the words. Since the number of points in the onomatopoeic meaning space is small (W=87), a bootstrapping approach is used to estimate the BIC measure. The resulting variation of the BIC as a function of k is shown in Figure 7.4. The maximum valueof [ BIC was obtained fork =19. Table7.5 illustrates afewof theseautomatically derived clusters. Since word clusters can be formed it can be inferred that: (1) words within clusters 142 5 10 15 20 25 30 35 40 −1.7 −1.6 −1.5 −1.4 −1.3 −1.2 −1.1 −1 −0.9 −0.8 −0.7 x 10 4 BIC k→ Figure 7.4: Variation of [ BIC(M k ) as a function of k, the number of word clusters haveoverlappingmeaning,(2)wordsindifferentclustersaresufficientlydistinct,and(3) the proposed metric sufficiently discerns the words by their meaning. In essence, both general and specific perceived audio properties can be described using onomatopoeia words. For this work, however, the clusters are used to combine words with overlap- ping meanings to have an appropriate number of samples in each category. The labels that were combined is shown in table 7.6. This results in a manageable number of onomatopoeia categories with enough number of audio clips in each to have reliable retrieval rate estimate. Next, the signal features extracted for example based indexing are presented. 143 Example Words in cluster 1 Clang,Clank,Ding,Dong,Ting 2 Beep,Bleep,Toot 3 Creak,Squeak, Screech,Yawp,Yowl 4 Cluck,Cuckoo, b Hoot,Tweet 5 Buzz,Fizz,Sizzle,Hiss Whiz,Wheeze,Whoosh,Zip 6 Thump,Thwack,Wham 7 Burr,Crunch,Scrunch 8 Rip,Zing,Zoom 9 Clatter,Blah,Gabble,Yak 10 Meow,Moo,Yip Table 7.5: Examples of automatically derived word clusters. Combined under: Initially labeled with: BUZZ Hum, Burr, Fizz, Hiss, Boo. CLATTER Rattle CROW Cuckoo CRUNCH Crash DONG Twang, Ring, Clang. GABBLE Coo, GROWL Roar, Purr, MEOW Moo SQUEAK Creak TAP Clap Table 7.6: List of combined onomatopoeia categories. 144 7.4.3 Features For retrieval-by-example, an audio clip is represented in the latent perceptual space usingthe signal-level features. In this work, a fourteen dimensional feature-vector set is extracted. Twelve dimensions were comprised of perceptually motivated Mel-frequency cepstral coefficients (MFCCs), and the remaining two were the spectral centroid (SC) and spectral-roll-off frequency (SRF) measures. The SC and SRF measure were nor- malized tohalfthesamplingfrequencyoftheaudioclips. Thesefeatures wereextracted by windowing the audio signal with a twenty millisecond Hamming window every ten milliseconds. The details of these measures are as follows: 7.4.3.1 Mel-frequency Cepstral Coefficients (MFCC) These coefficients are the result of a cepstral analysis of the magnitude square of the spectrumin theMel-frequency scale. It was originally proposedin [18]for speechrecog- nition, but it also widely used for audio classification tasks [35]. The perceptual aspect of this feature lies in the Mel-frequency scale (a logarithmic axis) that approximately models the filter-bank analysis in the human ear. The steps for calculating MFCCs for a given frame of audio is illustrated in [35]. 7.4.3.2 Spectral Centroid (SC) It isthe weighted mean frequencyof agiven frameof audio. Theweights arethemagni- tude of the corresponding frequency points of a short-time Discrete Fourier Transform. Here a 8192-point Fast Fourier Transform (FFT) was used to calculate the spectral centroid. Thisfeature is strongly correlated with thezero-crossings rate (ZCR) [47] and 145 it is useful to distinguish sources with more energy skewed to the higher frequencies. Mathematically, the SC of the i th frame is represented as: SC i = P k= N 2 k=0 k·|F(k)| 2 P k= N 2 k=0 |F(k)| 2 (7.12) here |F(k)| is the magnitude of the k th point of an N−point discrete point Fourier transform(DFT).Indirectly,SCisameasureofbrightness oftheunderlyingaudiosignal. Acoustic sources with content in the higher frequencies are perceived to be brighter (a perception term used for describing properties similar to turning down the bass knob and turning up the treble setting of a radio). 7.4.3.3 Spectral roll-off frequency (SRF) It is defined as: SRF i =k,such that j=k X j=0 |F(j)| <P · j= N 2 X j=0 |F(j)| (7.13) Itsusefulnesshasbeenestablishedinspeech/musicdiscrimination[53]andgeneralaudio classification task of speech versus non-speech classification[35]. P is the threshold in percentage. Its value is usually just under 1.0. In the current work, a value of 0.9 was chosen. 7.5 Experiments and Results As mentioned previously 2,491 clips were selected from the BBC SoundEffects Library. For the experiments, the clips were down-sampled to 16.0 kHz and converted to one 146 channel uncompressed format (from 44.1 kHz stereo clips). The fourteen dimensional feature extracted from the clips was used for example-based query after converting it to a vector representation in the latent perceptual space. Bothquery-by-exampleandquery-by-textprocedureswereevaluatedforthissystem. While category-wise retrieval performance was evaluated for both retrieval procedures, an additional subjective evaluation was performed for the query-by-example. This is described first. 7.5.0.4 Subjective Evaluation: By the query representation procedure, a selection of 100 clips is first represented in the LPS. Then using the similarity measure, for each query clip a list of 5 closest matching clips is obtained. Examples of query clips (the text caption only) and the 5 best matching retrieved clips are presented in table 7.7. Seven subjects evaluated the retrieval system. The 100 query clips and the ordered list of top 5 best matching clips was presented to them one by one using a web-page interface. For each query they were instructed to select the retrieved clips that sounded similar to it. Alternatively, they could also choose a “None of them similar” option if they determined that none of the retrieved clips sounded similar to the query. The results are shown in figure 7.5 The framework presented here was evaluated by ten-fold cross-validation. The whole data set was randomly split into ten sets. One of the ten sets was randomly chosen as thetestset,andtheremainingninesets(withoutreplacement)constitutedthetrainset. This was repeated ten times for both text-based and example-based retrieval and their 147 Query clip 1: AUTO RACE PIT AMBIENCE BEFORE RACE 1. SUBWAY EXTERIOR PULL INTO STATION STOP EXIT STATION TRAIN 2. ORCHESTRA WARMING UP IN CONCERT HALL 3. AUTO RACE INDY TIER AMBIENCE OVERALL PERSPECTIVE 4. BAR PUB SMALL CROWD AMBIENCE 5. RESTAURANT LARGE CROWD ··· Query clip 2: CHAIR WOOD SIT DOWN IN WOODEN CHAIR 1. FOOTSTEPS METAL MALE LEATHER SOLE WALK 2. CHAIR LAWN SIT DOWN IN LAWN CHAIR 3. BRIEFCASE UNLOCK CLASP OFFICE 4. CHAIR KITCHEN SIT DOWN IN KITCHEN CHAIR 5. BOTTLE SOFT DRINK REMOVE SCREW LID SODA ··· Query clip 3: SIREN SIREN POLICE AMBULANCE FIRE TRUCK 1. SIREN CONSTANT YELPING FOR EMERGENCY VEHICLE 2. SIREN FIVE SIRENS SIMULTANEOUSLY WAILS AND YELPS 3. SIREN WAIL SIREN POLICE AMBULANCE FIRE TRUCK 4. SIREN CONSTANT WAILS AND YELPS 5. FOREST VENEZUELA VENEZUELA DAYTIME AMBIENCE BIRDS ··· Query clip 4: TWO SWORDS CLANKING TOGETHER SINGLE HIT 1. TWO SWORDS CLANKING TOGETHER SINGLE HIT 2. SWORD SLIDE INTO SHEATH 3. SWORD REMOVE FROM SHEATH 4. CHAIN DROP SMALL CHAIN DROP TO WOOD SURFACE 5. SWORD TWO SWORDS SCRAPING ··· Table 7.7: Query examples and corresponding 5 best matching retrieved clips. 148 1 2 3 4 5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 C or more clips Probability P(C) Figure 7.5: Subjective Evaluation: Probability of retrieving≥C relevant clips. Dotted line represents the probability of retrieving 0 relevant clips. average is taken as the final result. In each case, the train set constitutes the library or the indexed back-end of the retrieval system, and the test set, the queries presented. The average precision and recall rates were calculated by using the formulae: Precision = R correct R retrieved × 100 (7.14) Recall = R correct R category × 100 (7.15) Here R retrieved is the number of clips retrieved for a query, R correct is the number of correctly retrieved clips for the query, and R category is the number of clips in the 149 database that belong to the same category as the query. Recall percentage is varied from 0.0 to 100.0 by considering more and more clips from the ordered list of retrieved clips. The average precision percentage is calculated by averaging the precision values at every instance of correctly retrieved clip in the list. The average precision values were again averaged over each cross-validation. It is important to note that although the framework allows for retrieving text de- scriptions and audio clips for an unknown audio clip or text query, it is difficult to evaluate the performance based on it. This difficulty stems from the lack of a valid definition of “correctly” retrieved clip. To limit the scope of this work, and allow for comparison with other systems, we evaluate the performance of the system using only the semantic and onomatopoeic labels given to it. Four experiments are performed in all: 1. Text-based retrieval using LSI and semantic categories. 2. Text-based retrieval using LSI and onomatopoeic categories. 3. Example-based retrieval using LPI and semantic categories. 4. Example-based retrieval using LPI and onomatopoeic categories. For the semantic and onomatopoeic categorization the precision and recall rates for the baseline are calculated by randomly retrieving (without replacement) clips from the database for a given query. It represents the baseline chance level performance that would be obtained by randomly selecting the retrieved clips, instead of using the retrieval methodpresentedhere. Theperformanceofthesystemisbetter, ifitsretrieval 150 ratesarebetterthanbaseline. Intheexample-basedretrievalcasetheimprovementover the baseline is also indicated. 7.5.1 Text-based query and retrieval using Latent Semantic Indexing (LSI) For text-based retrieval, a vocabulary W of M = 2477 words were obtained from the wholecollection ofcaptionsavailableinthedatabase. Thelistwasobtainedbyretaining only significant words (such as nouns, verbs, adjectives etc.). Words such as articles, numbers,etc. werenotconsidered. Theaudiocaptionsoftheclipsfromthetestsetwere used as the query. The representation of the query in the semantic space is the same as it was described in section (7.3.6), and to retrieve the clips the similarity measure in section (7.3.2) was used, except in this case the matrices from section (7.3.3) were used. Toidentifythecorrectlyretrievedclip, thecorrespondingcategory labelwasused. The average precision for varying recall percentages is shown in figure 7.6. Note that in this case, the size ofW and the choice of words will affect the retrieval performance. But in this case since the captions (as individual text documents) are only a few words (≈ 3−10) long, W is fixed. This eliminates the possibility of all zero rows in matrix B in section (7.3.5). From the precision and recall rates, it can be seen that for this case retrieval performance by semantic category is superior to the performance using onomatopoeia categories. This aspect is also illustrated in the nearest neighbor (NN) classification confusion matrix presented in figures 7.7 and 7.8. While the classification performance is poorer in the case of onomatopoeia categories, it still empirically follows perceptual 151 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 Recall percentage (%) → Average precision percentage (%) → Baseline chance−level Onomatopoeia Labels Semantic Labels Figure 7.6: Precision-Recall rates for text-based retrieval. Performance using semantic and onomatopoeic labels. similarities. Examples of this are: the category Beep is most confused with Hum and Dong, Burr is most confused with Hum, Buzz, Squeak is most confused with Tweet, Bang with Clatter and Thud with Tap. These observations in the results support the fact that the high-level text descriptions of audio clips predict semantic categories well. Retrieval performance by onomatopoeia category is lower than that of the semantic categorization. This is expected as the text descriptionsdonotexplicitly expresstheperceptualcontent. However, theperformance is well above the baseline chance, indicating that the semantic categories of the events 152 AMBIENCES ANIMALS AUTOMOBILES DOORS ELECTRONICS EXPLOSIONS HORROR HOUSEHOLD HUMAN IMPACT MACHINERY MILITARY MUSIC NATURE OFFICE OPEN POLICE PUBLIC SCI−FI SPORTS TRANSPORTATION AMBIENCES ANIMALS AUTOMOBILES DOORS ELECTRONICS EXPLOSIONS HORROR HOUSEHOLD HUMAN IMPACT MACHINERY MILITARY MUSIC NATURE OFFICE OPEN POLICE PUBLIC SCI−FI SPORTS TRANSPORTATION 0 10 20 30 40 50 60 70 80 90 Figure 7.7: Nearest Neighbor confusion matrix for semantic categories for retrieval by latent semantic indexing. alsorelatetoperceptualsimilarities, andthereforestrengtheningtheinitialpremisethat they are indeed related. 7.5.2 Example-based query and retrieval using Latent Perceptual In- dexing (LPI) For example-based retrieval, the vocabulary V of N centers is derived only from the train set, and using the steps discussed in section (7.3.5) the example audio clip query fromthetestsetisindexedinthelatentperceptualspace. Again,byusingthesimilarity measure, an ordered list of clips is retrieved from the database (the train set). Since 153 BANG BEEP BLEAT BURR BUZZ CLATTER CRACKLE CROW CRUNCH DONG GABBLE GROWL HONK HUM MEOW SPLASH SQUEAK TAP THUD TICK TWEET WHOOSH BANG BEEP BLEAT BURR BUZZ CLATTER CRACKLE CROW CRUNCH DONG GABBLE GROWL HONK HUM MEOW SPLASH SQUEAK TAP THUD TICK TWEET WHOOSH 0 10 20 30 40 50 60 70 80 Figure7.8: Nearest Neighborconfusionmatrix foronomatopoeiacategories forretrieval by latent semantic indexing. V is obtained by unsupervised clustering of the dataset into N reference clusters, it is possible to perform retrieval for different values of N. By increasing N, the dataset is resolved intomorenumberofclusterunits. Therefore, itispossibletoperformexample- based retrieval from a coarse scale (when N is small) to finer scale (large N). In this case, N is varied from 100 to 4000 (N ∈{100, 200, 500, 1000, 2000, 4000, 8000}). N is dependent on the dataset and the domains that it covers. In the present case, since the dataset is a large, generic sound effects database, a reasonably large N is expected to perform best. The effect of varying N on retrieval performance is calculated as the 154 0 1000 2000 3000 4000 5000 6000 7000 8000 0 20 40 60 80 100 120 140 160 180 200 Vocabulary size N → Percentage Relative Improvement (%) → Onomatopoeia Labels Semantic Labels Figure 7.9: Average relative percentage improvement as a function of vocabulary size N for example-based retrieval. average relative improvement inprecisionascalculated byequation(7.14) over all recall values. Mathematically, the average improvement is given by: Avg. (%) Improvement = Precision(r)−Baseline(r) Baseline(r) ×100 ∀ recall values r 155 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 Recall percentage (%) → Average precision percentage (%) → Baseline chance−level Onomatopoeia Labels Semantic Labels Figure 7.10: Precision-Recall rates for example-based retrieval. Performance using se- mantic and onomatopoeic labels. Here,Precision(r)andBaseline(r)arethecalculated precisionvalueofthepresent method and of the baseline chance level at recall value r. The improvement percentage is averaged over all recall values. The performance improvement as a function of N for the semantic and onomatopoeic category labels is shown in figure 7.9. From the figure it can be seen that N =4000 gives the best retrieval result and it begins to degrade for higher values. The precision-recall curve for this is shown in figure 7.10. It can be seen that LPI performs equally well for semantic and onomatopoeia categorization. This result is again due to the fact that semantic categories in the data set are also formed duetotheperceptualsimilarities theyshare. Inmanycases, clipsbelongingtothesame 156 onomatopoeia category also have similar to same text captions. Also, many audio clips are complex auditory scenes, with multiple sources, and this results in a query being spread in the acoustic space. This causes the query (and the indexed clips) to be less precise and thus affects the performance of the system. Confusion matrix of NN classification for onomatopoeia categories and semantic categories is shown in figures 7.11 and 7.12 respectively. For the onomatopoeia case, the confusion in identification of a query clip is perceptually meaningful. For example, Thud is most confused with Tap, Crackle is most confused with Burr, Beep is most confused with Dong and Tweet. Again, in the case of semantic categories, although it performsreasonablywellinidentifyingthecorrectcategoryforagivenquery,incontrast to the onomatopoeia categories, it is difficult to deduce similarities between the most confused categories. In summary, from the precision-recall rates, the following can be observed: 1. Retrieval by semantic categories with LSI performs best. 2. Retrieval by onomatopoeic categories with LSI performs is relatively less. 3. Retrieval by semantic categories and onomatopoeic categories with LPI is compa- rable, which is again comparable with LPI and onomatopoeia categories. These results support the initial statements in section (7.2.1) that motivated the devel- opment of a fully duplex retrieval system. The final conclusion and detailed discussion of the results are presented next. 157 BANG BEEP BLEAT BURR BUZZ CLATTER CRACKLE CROW CRUNCH DONG GABBLE GROWL HONK HUM MEOW SPLASH SQUEAK TAP THUD TICK TWEET WHOOSH BANG BEEP BLEAT BURR BUZZ CLATTER CRACKLE CROW CRUNCH DONG GABBLE GROWL HONK HUM MEOW SPLASH SQUEAK TAP THUD TICK TWEET WHOOSH 0 10 20 30 40 50 60 70 80 Figure7.11: Nearest Neighborclassification confusionmatrixbyLPIandonomatopoeia categories. 7.6 Conclusion and Discussion In this work, we present a data-driven framework to perform audio retrieval by text-based and example-based queries. The approach is based on unit-document co- occurrence measure that extracts latent information in a given set of audio clips. This methodisappliedtoindexboththetextcaptionsandthesignalfeaturesextractedfrom each audio clip. The text captions describing the acoustic event are used to represent audio clips in a continuous space using latent semantic analysis: a method that has been extensively used in text document indexing and classification. First a unit-document co-occurrence 158 AMBIENCES ANIMALS AUTOMOBILES DOORS ELECTRONICS EXPLOSIONS HORROR HOUSEHOLD HUMAN IMPACT MACHINERY MILITARY MUSIC NATURE OFFICE OPEN POLICE PUBLIC SCI−FI SPORTS TRANSPORTATION AMBIENCES ANIMALS AUTOMOBILES DOORS ELECTRONICS EXPLOSIONS HORROR HOUSEHOLD HUMAN IMPACT MACHINERY MILITARY MUSIC NATURE OFFICE OPEN POLICE PUBLIC SCI−FI SPORTS TRANSPORTATION 0 10 20 30 40 50 60 70 Figure 7.12: Nearest Neighbor classification confusion matrix by LPI and semantic categories ismeasuredasthenumberoftimes each wordinavocabulary (theunits)occursineach text caption (the documents). By singular value decomposition (SVD) of this matrix, a representation of documents in a latent space is derived from its singular vectors. By using a vector similarity measure, a list of semantically similar clips for a given text query is retrieved. One of the main contributions of this work is to apply this unit-document analysis approach directly to the audio clips using the signal features extracted from the clip. Although an audiosignal is continuous, discrete unitsarediscovered by quantizing each frame or a set of frames to the nearest feature vector in a vocabulary. The vocabulary 159 is derived from the centers of the clusters formed after unsupervised clustering of the whole data set. Since these centers are formed as distinct units in the perceptual space, the audio clip can be modeled as a sequence of these discrete units. The fundamental difference with text indexing is that here these units are discovered in an unsupervised manner for a given data set. By calculating the number of feature-vectors extracted froman audioclip that arequantized into theseunits, aunit-document co-occurrence is established. By singular value decomposition a reduced representation is obtained and by vector similarity measure, perceptually similar audio clips to the example query clip are retrieved. This is the latent perceptual indexing approach presented here. The other contribution of this work is evaluating the performance of the system in a perceptually meaningful way using onomatopoeia words as category labels in addi- tion to semantic categories of audio. Onomatopoeia words describe acoustic properties of sound sources. The motivation for using them is to also allow for indexing audio clips in a perceptually meaningful way. This makes the retrieval system fully duplex where a user can query at varying levels of description. In this work a partially super- visedapproachwaspresentedtoappropriatelylabeleach audioclip withanappropriate onomatopoeia word that best represents their perceptual properties. Both text-based LSI and signal-based LPI are evaluated using the two categorization methods. It is important to note that there are many ways to evaluate the system, and some call for subjective evaluation. Subjective evaluation can be highly variable. Consequently, it inherently lacks precise definition of “correctly retrieved” clips. Therefore, to limit the scope of the performance evaluation, only the high-level semantic tags and the percep- tual onomatopoeia labels of the retrieved clips are considered. If the category of the 160 retrieved clip matches the category of the query clip, then it is assumed to be correct. Theobtainedresults indicatethat LPIperformsequally well forbothsemantic andper- ceptualcategories. Theirperformanceisalsocomparabletotext-basedLSIperformance with perceptual categories. LSI performs best for semantic categories. These results allude to the statements made initially that text descriptions, semantic and perceptual categories are highly inter-connected. There are two important points pertaining to the result that LPI works equally well forsemanticcategories andonomatopoeiacategories. First,theaudioclipsusedinthese experiments are not “pure”. Many clips were recorded outside the studio environment where there is little control of interfering sources present in them. For example, clips that were recorded in a restaurant environment have the clattering of dishes and also includes background conversation, or distant vehicle sounds. This results in queries be- gin spread in the perceptual space, and thus making them less precise or specific. This is supported by the observation that a few categories in the confusion matrix shown in figure 7.11 are most confused with Clatter. Clatter has been used for a variety of real sounds such as train in an underground station, to movement of cattle on wooden planks, to sound of a noisy bicycle. Second, except for some clips, the semantic cate- goriescreatedhereareclosely relatedtotheirperceptualsimilarity. Insomecases, there are multiple instance of similar acoustic sources that are also under the same semantic categories. This reason also applies to the result that text-based retrieval performance using LSI for onomatopoeia categories is comparable to the performance of example- based retrieval using LPI. Many audio clips that are under one onomatopoeic category also share similar text-captions. The slightly better performance of text-based retrieval 161 using LSI relates to the precision of the query. As discussed earlier, unlike example- based queries, text queries are highly specific. The results show that the framework presented can handle a variety of queries. The work and the experiments performed do not assume any domain specific properties of audio, and therefore it can be applied to a variety of audio and music information pro- cessing problems that involve segmentation, browsing, clustering, and classification. In the proposed framework, SVD is the most computationally expensive step. How- ever, as shownhere, it needs to bedoneonly once offline. Therefore, theframework can be used for small datasets and it can be easily scaled to very large online multimedia databases. The main limitation of is that it does not consider temporal information for indexing. The time information can be brought about by splitting a clip into smaller segments, or through feature stacking. Along with this, other signal representation techniques and features can also be explored. Another avenue that can be explored is the use of approximate nearest neighbor algorithm [2, 3] and a probabilistic approach [27] for latent perceptual indexing of audio clips. These remain part of our ongoing and future work. 162 Chapter 8: Conclusion The work presented in this thesis explores audio information retrieval problems using varying levels of signal and language level descriptions. In each case the contributions with respect to existing work, possible limitations and extensions have been discussed. There are two main ideas that have been explored in this dissertation proposal: organizing audio based on its word level descriptions, and using perception based audio representation for audio. The audio medium is inherently rich, and conveys varying levels of information depending on the time-scale and the description, and by developing indexing framework to handle this leads us to the ultimate goal of developing a full-duplex indexing system. In chapter 4, a systematic approach to categorizing audio by its onomatopoeic description is presented.In this approach, we represent descriptions of audio clips with onomatopoeia words and cluster them according to their vector representation in the linguistic (lexical) meaning space. Onomatopoeia words are imitative of sounds and 163 provide a means to represent perceived audio characteristics with language level units. This form of representation essentially bridges the gap between signal level acoustic properties and higher-level audio class labels. First, using the proposed distance/similarity metric we establish a vector represen- tation of the words in a ‘meaning space’. We then provide onomatopoeic descriptions (onomatopoeia words that best describe the sound in an audio clip) by manually tagging them withrelevant words. Then, theaudioclips arerepresented inthemeaning space as the sum of the vectors of its corresponding onomatopoeia words. Using unsupervised k-means clustering algorithm, and the Bayesian Information Criterion (BIC), we cluster the clips into meaningful groups. The clustering results presented in this work indicate that the clips within each cluster are well represented by their onomatopoeic descriptions. These descriptions effectively capture the relationship between the audio clips based on their acoustic properties. Then clustering of acoustic feature vectors with and without using word-level information was analyzed. Word-level clustering was done using onomatopoeia words as means to represent perceived audio signal characteristics. These words can be used as ameta-level representation between acoustic features andlanguage-level descriptions of audio. This clustering was compared with “raw grouping”: clustering feature vectors without using information from word-level grouping. The comparison was performed in terms of classification accuracy of a GMAP classifier after MDA. The results in this work indicate that certain word clusters are more separable than the others. It can be, in part, due to the fact that the acoustic features used in this work are only able to represent certain onomatopoeic descriptions. It can also be because of inconsistencies 164 in the understandingand usage of onomatopoeia words as well. Also, some words (such as crackle) represent long-term temporal properties that are not well represented by frame level analysis. Anotherinterpretationoftheresultsisthat,therawclusteringresultsinpartitioning of the feature vectors into regions of contiguous volume of space. However, clustering usingonomatopoeia groupinginformation may result in fragmented partitioning, where the feature vectors of a cluster may be present in different regions of the feature space. This essentially brings out the differences in signal level measures and linguistic level descriptions. This also calls for signal measures that are representative of linguistic level descriptions. In Chapter 5, ideas from onomatopoeia-based descriptions of audio are adopted to develop a time-varying quantitative representation of audio. This is mainly based on how onomatopoeic words cluster based on a word-based similarity measure. Serendipitously, an implementation of this idea leads to an alternative approach to acoustic structure analysis that is coarser than the signal level measure derived by acoustic features. In this work, an attribute-based approach to quantitatively measure the changes in an audio scene is presented. It is applied it to segment popular music tracks into sections with and without vocals. The idea is motivated by the fact that the human auditory system can instantly identify changes in the scene by tracking changes in the interaction of the different acoustic sources. The work in this paper presents activity rate (AR) as a metric to quantitatively measure the interaction of different sources. This measure does not identify individual sources, but measures the activity of different semantically-related attributes of sources over time. These are 165 based on how the sources are perceived, and not necessarily on the similarities in signal properties. We then use these attributes for categorical classification; specifically we consider segmenting music into vocal and non-vocal sections. The clips are segmented without assuming that the vocal and non-vocal sections of audio are non-overlapping in time. As mentioned previously the framework presented here is not limited to building a binary speech/music discriminator type system. It provides a way to analyze the underlying acoustic structure in a given audio clip and also a way to annotate and highlight relevant sections. This is very useful for audio summarization and thumbnail- ing applications. The application of this system is not limited to segmentation, and as shown, it has also been applied for music genre classification task. The main conclusion from our genre classification results is that it is indeed possible to recognize genre using the acoustic structure information represented by the event activity rate measure. This is feasible due to the distinct content (in terms of musical instruments and vocals), and the inter-genre styles which can be captured without actually identifying the individual sources. It is sufficient to just have an aggregate measure of the various attributes. The interesting aspect of the experiments is that the activity rate measure derived in this work is based on a generic sound effectsdatabasethatisdifferentfromthedatabaseusedfortrainingthegenreclassifiers. As another example, the activity rate (AR) signals can be used to organize a large audio database. A user can make a query to such a database using the 166 small-dimensional AR signals and the relevant audio clips can be returned to the user. Of course, one would require a larger dimension (N > 3) description for this application and the categories need to be appropriately chosen. Splitting the large category noise-like into machine-noise (e.g: engines noise, vacuum cleaner, hair dryer etc.) and non-machine-noise (e.g: seashore, breathing sound, heavy rain, clothes rustling etc.), or having additional categories such as impulsive sounds (e.g: explosions, gunfire, knocking, clock ticks etc.) are some ways to increase the dimensionality of the descriptions. Our future work would involve further investigating this categorization of acoustic sources through perceptual/language descriptions similar to the ideas presented here. The main principle which sets the stage for this work is that humans can describe auditory scenes using language which is a representation of the semantic information captured from an audio clip. Analysis of more complex and rich scenes with large number of acoustic sources can be potentially implemented by increasing the number of audio descriptors and seeking quantitative measures such as the activity rate to adequately characterize them. A further generalization of the above attributes, is to increase the number of attributes. In chapter 6, further resolving the noise-like categories into machine- generated and naturally occuring noise has been explored. To achieve this goal, a new cortical representation and dimension reduction technique has been explored. Results indicated that while cortical representation the audio signal suffers from high-dimensionality, its performanceis superiorfor theclassification task Inchapter 7, a furthergeneralization of theattribute-based approach andtheonomatopoeia-based 167 descriptionisimplemented.Inthiswork, wepresentadata-drivenframeworktoperform audio retrieval by text-based and example-based queries. The approach is based on unit-document co-occurrence measure that extracts latent information in a given set of audio clips. This method is applied to index both the text captions and the signal features extracted from each audio clip. The text captions describing the acoustic event are used to represent audio clips in a continuous space using latent semantic analysis: a method that has been extensively used in text document indexing and classification. First a unit-document co-occurrence is measured as the number of times each word in a vocabulary (the units) occurs in each text caption (the documents). By singular value decomposition (SVD) of this matrix, a representation of documents in a latent space is derived from its singular vectors. By using a vector similarity measure, a list of semantically similar clips for a given text query is retrieved. One of the main contributions of this work is to apply this unit-document analysis approach directly to the audio clips using the signal features extracted from the clip. Although an audiosignal is continuous, discrete unitsarediscovered by quantizing each frame or a set of frames to the nearest feature vector in a vocabulary. The vocabulary is derived from the centers of the clusters formed after unsupervised clustering of the whole data set. Since these centers are formed as distinct units in the perceptual space, the audio clip can be modeled as a sequence of these discrete units. The fundamental difference with text indexing is that here these units are discovered in an unsupervised manner for a given data set. By calculating the number of feature-vectors extracted from an audio clip that are quantized into these units, a unit-document co-occurrence 168 is established. By singular value decomposition a reduced representation is obtained and by vector similarity measure, perceptually similar audio clips to the example query clip are retrieved. This is the latent perceptual indexing approach presented here. Theother contribution of this work is evaluating the performanceof the system in a perceptually meaningful way using onomatopoeia words as category labels in addition to semantic categories of audio. Onomatopoeia words describe acoustic properties of sound sources. The motivation for using them is to also allow for indexing audio clips in a perceptually meaningful way. This makes the retrieval system fully duplex where a user can query at varying levels of description. In this work a partially supervised approach was presented to appropriately label each audio clip with an appropriate onomatopoeia word that best represents their perceptual properties. Both text-based LSI and signal-based LPI are evaluated using the two categorization methods. It is important to note that there are many ways to evaluate the system, and some call for subjective evaluation. Subjective evaluation can be highly variable. Consequently, it inherently lacks precise definition of “correctly retrieved” clips. Therefore, to limit the scope of the performance evaluation, only the high-level semantic tags and the perceptual onomatopoeia labels of the retrieved clips are considered. If the category of the retrieved clip matches the category of the query clip, then it is assumed to be correct. Theobtained results indicate that LPI performs equally well for both semantic and perceptual categories. Their performance is also comparable to text-based LSI performance with perceptual categories. LSI performs best for semantic categories. These results allude to the statements made initially that text descriptions, semantic and perceptual categories are highly inter-connected. 169 There are two important points pertaining to the result that LPI works equally well for semantic categories and onomatopoeia categories. First, the audio clips used in these experiments are not “pure”. Many clips were recorded outside the studio environment where there is little control of interfering sources present in them. For example, clips that were recorded in a restaurant environment have the clattering of dishes and also includes background conversation, or distant vehicle sounds. This results in queries begin spread in the perceptual space, and thus making them less precise or specific. This is supported by the observation that a few categories in the confusion matrix shown in figure 7.11 are most confused with Clatter. Clatter has been used for a variety of real sounds such as train in an underground station, to movement of cattle on wooden planks, to sound of a noisy bicycle. Second, except for some clips, the semantic categories created here are closely related to their perceptual similarity. In some cases, there are multiple instance of similar acoustic sources that are also under the same semantic categories. This reason also applies to the result that text-based retrieval performance using LSI for onomatopoeia categories is comparable to the performance of example-based retrieval using LPI. Many audio clips that are under one onomatopoeic category also share similar text-captions. The slightly better performance of text-based retrieval using LSI relates to the precision of the query. As discussed earlier, unlike example-based queries, text queries are highly specific. The results show that the framework presented can handle a variety of queries. The work and the experiments performed do not assume any domain specific properties of audio, and therefore it can be applied to a variety of audio and music information processing problems that involve segmentation, browsing, clustering, and classification. 170 In the proposed framework, SVD is the most computationally expensive step. However, asshownhere, it needstobedoneonlyonceoffline. Therefore, theframework canbeusedforsmalldatasetsanditcanbeeasilyscaledtoverylargeonlinemultimedia databases. The main limitation of is that it does not consider temporal information for indexing. The time information can be brought about by splitting a clip into smaller segments, or through feature stacking. Along with this, other signal representation techniques and features can also be explored. Another avenue that can be explored is the use of approximate nearest neighbor algorithm [2, 3] and a probabilistic approach [27] for latent perceptual indexing of audio clips. These are however, extensions of the basic framework presented here. Asmentionedearlier, theapproachthatisadoptedinthisdissertationworkpresents an alternative method for audio processing that moves away from direct identification of acoustic sources and its corresponding labels. Instead, the framework presents ideas to represent and process general, unstructured audio without explicitly identifying distinct acoustic source-events. 171 Bibliography [1] The BBC Sound Effects Library - Original Series. http://www.sound-ideas.com, May 2006. [2] A.AndoniandP.Indyk. Near-OptimalHashingAlgorithmsforApproximateNear- estNeighborinHighDimensions. Proceedings of the 47 th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06)-Volume 00, pages 459–468, 2006. [3] A. Arya, D. Mount, N.S Netanyahu, R. Silverman, and A. Wu. An optimal algo- rithm for approximate nearest neighbor searching. SODA ’94: Proceedings of the 5 th annual ACM-SIAM symposium on Discrete algorithms, pages 573–582, 1994. [4] B.W. Bader and T.G. Kolda. Algorithm 8xx: MATLAB Tensor Classes for Fast Algorithm Prototyping. ACM Transactions on Mathematical Software, 32(4), De- cember 2006. [5] L. Barrington, A. Chan, and D. Turnbul land G. Lanckriet. Audio Information Retrieval using Semantic Similarity. IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), Honolulu, Hawaii, USA.,2:II–725–II–728, 2007. [6] M. A. Bartsch and G. H. Wakefield. To catch a chorus: Using Chroma-based Rep- resentations for Audio Thumbnailing. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (2001).(WASPAA ’01), pages 25–18, October 2001. [7] J.R.Bellagarda. LatentSemanticMapping: ADatadrivenFrameworkforModeling Global Relationships Implicit in Large Volumes of Data. IEEE Signal Processing Magazine., 22:70–80, September 2005. [8] Hugh. Bredin. Onomatopoeia as a Figure and a Linguistic Principle. New Literary History, 27(3):555–569, 1996. [9] A. Bugatti, A. Flammini, and P. Migliorati. Audio Classification in Speech and Music: A Comparison between a Statistical and a Neural Approach. EURASIP Journal on Applied Signal Processing, 2002:372–378, January 2002. 172 [10] B.Whitman. Semantic Rank Reduction of Music Audio. In Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics WASPAA-03, Mohonk New York, USA., October 2003. [11] R. Cai, L. Lu, H. J. Zhang, and L. H. Cai. Highlight Sound Effects Detection in Audio Stream. In Proceedings of the International Conference on Multimedia and Expo (ICME) 2003., 3:37–40, July 2003. [12] P. Cano, M. Koppenberger, S. LeGroux, J. Ricard, P. Herrera, and N. Wack. Nearest-Neighbor Generic Sound Classification with a WordNet-based Taxonomy. In Proceedings 116 th Audio Engineering Society (AES) Convention, Berlin, Ger- many., 2004. [13] W. Chai and B. Vercoe. Structural analysis of musical signals for indexing and thumbnailing. Joint Conference on Digital Libraries 2003., pages27–34, May 2003. [14] S .S. Chen and P. S. Gopalakrishnan. Clustering via the Bayesian Information Criterion with Applications in Speech Recognition. In Proceedings of the Inter- national Conference on Acoustic Speech and Signal Processing (ICASSP), Seattle, USA., Vol.2:12–15, May 1998. [15] S. Chu, S. Narayanan, C. C. Kuo, and M. J. Mataric. Where am I? Scene Recog- nition for Mobile Robots usingAudioFeatures. In Proceedings of the International Conference on Multimedia and Expo (ICME) 2006., July 2006. [16] M. Cooper and J. Foote. Summarizing Popular Music via Structural Similarity Analysis. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics 2003 (WASPAA’03), October 2003. [17] C.Tzagkarakis, A.Mouchtaris, and P.Tsakalides. Musical Genre Classification VIA Generalized Gaussian and Alpha-Stable Modeling. In proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Toulouse, France, 5, 2006. [18] S.Davis andP.Mermelstein. ComparisonofParametricRepresentationsforMono- syllabic Word Recognition in Continuously Spoken Sentences . IEEE Transactions on Acoustic Speech and Signal Processing, 28(4):357–366, 1980. [19] S. Deerwester, S. T. Dumais, G. W. Furnas, T.K. Landaurer, and R. Harshman. Indexing by Latent Semantic Analysis. Journal of the American Society for Infor- mation Science, 6(41):391–407, 1990. [20] R.O.Duda,P.E.Hart,andD.G.Stork.PatternClassification. Wiley-Interscience, 2nd edition, October 2000. 173 [21] A. J. Eronen, V. T. Peltonen, J. T. Tuomi, A. P. Klapuri, S. Fagerlund, T. Sorsa, G. Lorho, and J. Huopaniemi. Audio-Based Context Recognition. IEEE Transac- tions on Audio, Speech and Language Processing, 14(1):321–329, January 2006. [22] K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, 1990. [23] G.Guo and S. Z. Li. Content-Based Audio Classification and Retrieval by Support Vector Machines. IEEE Transactions on Neural Networks, 14(1):209–215, January 2003. [24] O. Gillet and G. Richard. Drum Loops Retrieval from Spoken Queries. Journal of Intelligent Information Systems, 24(2):159–177, 2005. [25] B. Gold and N. Morgan. Speech and Audio Signal Processing: Processing and Perception of Speech and Music. Wiley, 0-471-35154-7:560, August 1999. [26] M. Helen and T. Virtanen. Query by Example of Audio Signals using Euclidean Distance Between Gaussian Mixture Models . IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP), Honolulu, Hawaii, USA., 1:225– 228, April 2007. [27] T.Hoffman. Probabilisticlatent semanticindexing. Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in Information Retrieval, pages 50–57, 1999. [28] H. Jiang, T. Lin, and H. J. Jiang. Video Segmentation with the support of Audio SegmentationandClassification. International Conference on Multimetia and Expo (ICME), July 2000. [29] J.Laroche. Estimating tempo, swing and beat locations in audio recordings. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (2001).(WASPAA ’01), 2001. [30] J.R.Bellegarda. Latent Semantic Mapping. IEEE Signal Processing Magazine, 22:70–80, September 2005. [31] J.Zibert, N.Pavesic, and F.Mihelic. Speech/Non-Speech Segmentation Based on Phoneme Recognition Features. EURASIP Journal on Applied Signal Processing, 2006:1–13, February 2006. [32] S. Kadambe and G. F. B. Boudreaux. Application of the Wavelet Transform for Pitch Detection of Speech Signals. IEEE Transactions on Information Theory, 38:917–924, September 1992. 174 [33] E. J. Keogh and M. J. Pazzani. Derivative Dynamic Time Warping. First SIAM International Conference on Data Mining 2001 Chicago IL, 2001. [34] M. S. Lewicki. Efficient coding of natural sounds. Nature Neuroscience, Vol.5(4):356–363, 2002. [35] D. Li, I. K. Sethi, N. Dimitrova, and T. McGee. Classification of General Audio Data for Content-Based Retrieval. Pattern Recognition Letters, 22:533–544, 2001. [36] T. Li, M.Ogihara, and Q.Li. A comparative study on Content-based Music Genre Classification. SIGIR ’03: Proceedings of the 26th annual international ACM SI- GIR conference on Research and development in informaion retrieval, pages 282– 289, August 2003. [37] L. Liu, H. J. Zhang, and H. Jiang. Content Analysis for Audio Classification and Segmentation. IEEE Transactions on Speech and Audio Processing, 10(7):504–516, October 2002. [38] B. Logan and A. Salomon. A music similarity function based on signal analysis. IEEE Internationl Conference on Multimedia and Expo (ICME) 2001., 10(5):745– 748, August 2001. [39] L. Ma, D. Smith, and B. Miller. Context Awareness using Environmental Noise Classification. In Proceedings EUROSPEECH 2003, pages 2237–2240, 2003. [40] K. D. Martin, E. D. Scheirer, and B. L. Vercoe. Music Content Analysis through Models of Audition. Presented at ACM Multimedia ’98 Workshop on Content Processing of Music for Multimedia Applications, Bristol, UK., September 1998. [41] A. Meng, P. Ahrendt, and J. Larsen. Improving Music Genre Classification by Short-Time Feature Integration. In Proceedings, IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP),Philadelphia, USA., 5:497– 500, 2005. [42] M.Goto and Y.Muraoka. A beat tracking system for acoustic signals of music. In ACM Multimedia., pages 365–372, 1994. [43] G.A. Miller. WordNet: A Lexical Database for English. Communications of the ACM, 38(11):39, 1995. [44] M.Roach and J.Mason. Classification of Video Genre using Audio. In Proceedings of EUROSPEECH 2001, 4:2693–2696, September 2001. [45] M.Slaney, N.Mesgarani, and A . S. Shamma. Discrimination of Speech From Non- speechBasedonMultiscaleSpectro-TemporalModulations. IEEE Transactions on Audio, Speech and Language Processing, 14:920 – 930, May 2006. 175 [46] N.Scaringella, G.Zoia and D.Mlynek. Automatic Genre Classification of Music Content: A Survey. IEEE Signal Processing Magazine, 23(2):133–141, 2001. [47] C. Panagiotakis and G. Tziritas. A Speech/Music Discriminator Based on RMS and Zero-Crossings. IEEE Transactions on Multimedia, 7(1), February 2005. [48] A. Papoulis and S. U. Pillai. Probability, Random Variables and Stochastic Pro- cesses. McGraw-Hill Engineering, 4E:304, December 2001. [49] L. R. Rabiner, M. Cheng, A. Rosenberg, and C. McGonegal. A Comparitive Per- formance Study of several Pitch Detection Algorithms. IEEE Transactions on Acoustics Speech and Signal Processing, 24:399–417, October 1976. [50] L.R. Rabiner and R.W. Schafer. Digital Processing of Speech Signals. Prentice Hall, New Jersey, ISBN:0132136031, September 1978. [51] R.RadhakrishnanandA. Divakaran. Generative Process TrackingforAudioAnal- ysis. IEEE International Conference on Acoustics, Speech and Audio Processing, May 2006. [52] R.Cai, A. Hanjalic, H.-J. Zhang, and L.-H. Cai. A Flexible Framework for Key Audio Effects Detection and Auditory Context Inference. IEEE Transactions on Audio, Speech and Language Processing, 14(3):1026– 1039, May 2006. [53] E. Scheirer and M. Slaney. Construction and Evaluation of a Robust Multifea- ture Speech/Music Discriminator. IEEE International Conference on Acoustic Speech and Signal Processing (ICASSP) 1997. Munich, Germany., 2:1331–1334, April 1997. [54] G. Schwarz. Estimating the Dimension of a Model. The Annals of Statistics, Vol.6(2):461–464, March 1978. [55] M. Slaney. SemanticAudioRetrieval. International Conference on Acoustic Speech and Signal Processing (ICASSP), Orlando, USA., pages 13–17, May 2002. [56] M. Slaney, D. Ponceleon, and J. Kaufman. Multimedia edges: finding hierarchy in alldimensions. Proceedings of the9 th ACMinternational conference onMultimedia, pages 29–40, 2001. [57] H. Soltau, T. Schultz, M. Westphal, and A. Waibel. Recognition of Music Types. In Proceedings, IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2:1137 – 1140, 1998. [58] G. Strang. Linear Algebra andits Applications. HarcourtBrace Jovanovich, 3rd Ed., 1988. 176 [59] S. Sundaramand S. Narayanan. Analysis of AudioClusteringusingWord Descrip- tions. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2:769–772, October 2006. [60] S. Sundaram and S. Narayanan. Audio Retrieval by Latent Perceptual Index- ing. IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), Las Vegas, USA., 2008. [61] Shiva Sundaram and Shrikanth Narayanan. An Attribute-based Approach to Au- dio Description applied to Segmenting Vocal Sections in Popular Music Songs. International Workshop on Multimedia Signal Processing 2006. (MMSP), 2006. [62] B. Thoshkahna, V. Sudha, and K. R. Ramakrishnan. Speech/Music Discriminator using HILN Model based Features. EURASIP Journal on Applied Signal Process- ing, 2006:1–13, February 2006. [63] G. Tzanetakis and P. Cook. A Framework for Audio Analysis based on Classifica- tion and Temporal Segmentation. In Proc. 25th Euromicro Conference. Workshop on Music Technology and Audio Processing, Milan, Italy. IEEE computer Society, 2:2061, 1999. [64] G. Tzanetakis and P. Cook. Musical Genre Classification of Audio Signals. IEEE Transactions on Speech and Audio Processing, 10(5):293–302, July 2002. [65] K. Umapathy, S. Krishnan, and S. Jimaa. Multigroup Classification of Audio Signals using Time-Frequency Parameters. IEEE Transactions on Multimedia, 7(2):308– 315, April 2005. [66] K.WangandA. S.Shamma. SpectralShapeAnalysisintheCentral AuditorySys- tem. IEEE Transactions on Speech and Audio Processing, 3(5):382–395, September 1995. [67] G. Ward. Grady Ward’s Moby Thesaurus. [68] G. Williams and D. P. Ellis. Speech/ Music Discrimination Based on Posterior Probability Features. EUROSPEECH, September 1999. [69] E. Wold, T. Blum, D. Keislar, and J.W Wheaton. Content-Based Classification, Search, and Retrieval of Audio. IEEE Multimedia, 3(3):27–36, Fall 1996. [70] Z. Xiong, R. Radhakrishnan,A. Divakaran, andT. S. Huang. AudioEvents Detec- tionBasedHighlightsExtractionfromBaseball,GolfandSoccerGamesinaUnified Framework. International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hong Kong., 3:401–404, April 2003. 177 [71] S. Yan, D. Xu, Q. Yang, L.Zhang X.Tang, and H.-J. Zhang. Discriminant Analysis with Tensor Representation. In Proceedings IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) 2005, pages 526–532, 2005. [72] X. Yang, K. Wang, and A. S. Shamma. Auditory Representations of Acoustic Signals. IEEE Transactions on Information Theory, 38(2):824–839, March 1992. [73] T. Zhang and J. Kuo. Audio Content Analysis for Online Audiovisual Data Seg- mentation and Classification. IEEE Transactions on Speech and Audio Processing, 9(4), May 2001. [74] B. Zhou and J. H. L. Hansen. Unsupervised Audio Stream Segmentation and Clustering Via the Bayesian Information Criterion. In Proceedings of the Interna- tional Conference on Speech and Language Processing (ICSLP), Beijing, China., 3:714–717, October 2000. 178
Abstract (if available)
Abstract
Hearing is a part of everyday human experience. Starting with the sound of our alarm clock in the morning there are innumerable sounds that are familiar to us and more! This familiarity and knowledge about sounds is learned over our lifetime. It is our innate ability to try to quantify (or consciously ignore) and interpret every sound we hear. In spite of the tremendous varieties of sounds, we can understand each and every one of them. Even if we hear a sound for the first time, we are able to come up with specific descriptions about it. The descriptions can be about the source or the properties of its sound. This is the listening process that is continuously taking place in our auditory mechanism. It is based on context, placement and timing of the source. The descriptions are not necessarily in terms of words in a language, it may be some meta residual understanding of the sound that immediately allows us to draw a mental picture and subsequently recognize it.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Recognition and characterization of unstructured environmental sounds
PDF
WOLAP: wavelet-based on-line analytical processing
PDF
Classification and retrieval of environmental sounds
PDF
Signal processing methods for interaction analysis of functional brain imaging data
PDF
Iteratively learning data transformation programs from examples
PDF
Towards trustworthy and data-driven social interventions
PDF
Automatic quantification and prediction of human subjective judgments in behavioral signal processing
PDF
Learning distributed representations of cells in tables
PDF
Data-driven and logic-based analysis of learning-enabled cyber-physical systems
PDF
Learning logical abstractions from sequential data
PDF
Cyberinfrastructure management for dynamic data driven applications
PDF
Partitioning, indexing and querying spatial data on cloud
PDF
Efficient pipelines for vision-based context sensing
PDF
Robustness of gradient methods for data-driven decision making
PDF
Emotional speech production: from data to computational models and applications
PDF
Modeling dyadic synchrony with heterogeneous data: validation in infant-mother and infant-robot interactions
PDF
Efficient processing of streaming data in multi-user and multi-abstraction workflows
PDF
Human motion data analysis and compression using graph based techniques
PDF
Privacy-aware geo-marketplaces
PDF
Reconfigurable high-speed processing and noise mitigation of optical data
Asset Metadata
Creator
Sundaram, Shiva
(author)
Core Title
Data-driven methods in description-based approaches to audio information processing
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
10/07/2008
Defense Date
05/05/2008
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
audio indexing and retrieval,audio information processing,audio representation,auditory perception,data-driven methods in signal processing,human computer interface,information extraction,machine learning,multimedia signal processing,music information representation and Processing,OAI-PMH Harvest,speech and audio processing
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Narayanan, Shrikanth S. (
committee chair
), Kyriakakis, Chris (
committee member
), Shahabi, Cyrus (
committee member
)
Creator Email
abstractshiva@gmail.com,ssundara@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m1636
Unique identifier
UC165326
Identifier
etd-Sundaram-2398 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-122849 (legacy record id),usctheses-m1636 (legacy record id)
Legacy Identifier
etd-Sundaram-2398.pdf
Dmrecord
122849
Document Type
Dissertation
Rights
Sundaram, Shiva
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
audio indexing and retrieval
audio information processing
audio representation
auditory perception
data-driven methods in signal processing
human computer interface
information extraction
machine learning
multimedia signal processing
music information representation and Processing
speech and audio processing