Page 42 |
Save page Remove page | Previous | 42 of 195 | Next |
|
small (250x250 max)
medium (500x500 max)
Large (1000x1000 max)
Extra Large
large ( > 500x500)
Full Resolution
All (PDF)
|
This page
All
|
Here w(n) is as defined in equation 2.2, i is the time index and ! is the frequency index of the Fourier transform. Since each window typically lasts for 20-100 milliseconds, this equation essentially is a sequence of Fourier transforms of a long audio signal every win-dow duration. A time indexed frequency representation (time-frequency representation) is more about the spectral content in each frequency. Therefore it is common to use the magnitude of the complex frequency coefficients obtained in the transform 2.5. A more practical time-frequency representation using the discrete Fourier transform (DFT) is given by the equation : F(i,k) = |Fi(k)| = j=XN−1 j=0 y(j) · e2 k N j (2.6) where,yi(j) is a N length sequence of the ith windowed signal yi(j) = ( mX=1 m=−1 x(m) · w(m − i) ) and, 0 j N − 1 (2.7) This F(i,k) is visualized as an image where F(i,k) $ I(x, y) where the colour of each pixel is scaled according to the magnitude at the (i, k)th point. This is illustrated in figure 2.1. All spectral features are based on the spectrogram of the signal. They are all a statistical measure of the time-frequency content of the signal. Common measures used for audio processing are discussed next. 2.2.2.2 Spectral Centroid (SC): It is the weighted mean frequency of a given frame of audio. The weights are the magni-tude of the corresponding frequency points of a short-time Discrete Fourier Transform. 25
Object Description
Title | Data-driven methods in description-based approaches to audio information processing |
Author | Sundaram, Shiva |
Author email | ssundara@usc.edu; abstractshiva@gmail.com |
Degree | Doctor of Philosophy |
Document type | Dissertation |
Degree program | Electrical Engineering |
School | Viterbi School of Engineering |
Date defended/completed | 2008-05-05 |
Date submitted | 2008 |
Restricted until | Unrestricted |
Date published | 2008-10-07 |
Advisor (committee chair) | Narayanan, Shrikanth S. |
Advisor (committee member) |
Kyriakakis, Chris Shahabi, Cyrus |
Abstract | Hearing is a part of everyday human experience. Starting with the sound of our alarm clock in the morning there are innumerable sounds that are familiar to us and more! This familiarity and knowledge about sounds is learned over our lifetime. It is our innate ability to try to quantify (or consciously ignore) and interpret every sound we hear. In spite of the tremendous varieties of sounds, we can understand each and every one of them. Even if we hear a sound for the first time, we are able to come up with specific descriptions about it. The descriptions can be about the source or the properties of its sound. This is the listening process that is continuously taking place in our auditory mechanism. It is based on context, placement and timing of the source. The descriptions are not necessarily in terms of words in a language, it may be some meta residual understanding of the sound that immediately allows us to draw a mental picture and subsequently recognize it.; All computer-based processing systems help human users to augment their audio-visual processing mechanism. The objective of this work is to try to capture a part of, or at least mimic aspects of this listening and interpretation process and implement it in a computing machine. This would help one to browse vast amounts of audio and locate parts of interest quickly and automatically.; Other contemporary systems that attempt the same problem exist. Although these methods are highly accurate, primarily because they solve specific problems that are well constrained, they lack sophisticated information extraction and representation of audio beyond the realm of a simple labeling scheme and its classification. Additionally, they present the drawback that they cannot handle large number of classes or categories of audio, as they inherently rely on a naive implementation of pattern classification algorithms. To this end, flexibility and scalability are important traits of an automatic listening machine.; One of the primary contribution of this work is developing a new, scalable framework for processing audio through its higher level descriptions (compared to signal level measures) of acoustic properties instead of just an object labeling and classification scheme. The research efforts are geared toward developing representations and analysis techniques that are scalable in terms of time and description level. It considers both perception-based description (using onomatopoeia) and high-level semantic descriptions. These methods can be universally applied to the domain of unstructured audio that covers all forms of content where the type and the number of acoustic sources and their duration are highly variable. The ultimate goal of the work presented here is to develop a full-duplex audio information processing system where audio is categorized, segmented, and clustered using both signal-level measures and higher-level language-based descriptions. For this, new organization and inference techniques are developed that are implemented as learning machines using existing pattern classifiers. First, an approach of describing audio properties with words is introduced. The results indicate that onomatopoeic descriptions can be used as proper meta-level representation of audio properties and this representation scheme is different from the properties of audio captured by existing signal level measures. Another technique to processing by descriptions is using audio attributes. Here, instead of the conventional approach of directly identifying segments of interest, a mid-level representation through generic audio attributes (such as noise-like, speech or harmonic ) using an activity rate measure is first created. Using this representation, a system that segments vocal sections and identifying the genre of a popular music piece is presented.; While the performance is comparable to other contemporary methods, the ideas presented here are also scalable and can be used for processing more complex audio scenes (with large number of sources). To do so, it is necessary to increase the number of attributes that are being tracked. An idea for extension is to further resolve the noise-like attribute into machine-generate and natural noise is discussed. Using a bio-inspired cortical representation, performance of two pattern classification systems in discriminating between the two noise types is presented. To handle large amounts of large dimensional data, a new dimension reduction technique based on partitioning the data into smaller subsets is implemented. Along the same lines, another framework for description-based audio retrieval using unit-document co-occurrence measure is presented.; In this case the retrieval is performed by explicitly discovering discrete units in audio clips and then formulating the unit-document co-occurrence measure. This is used to index any audio clip in a continuous representation space, and therefore, perform retrieval.; The approach that is adopted in this dissertation work presents an alternative method for audio processing that moves away from direct identification of acoustic sources and its corresponding labels. Instead, the framework presents ideas to represent and process general, unstructured audio without explicitly identifying distinct acoustic source-events. |
Keyword | audio information processing; speech and audio processing; information extraction; audio indexing and retrieval; audio representation; music information representation and Processing; multimedia signal processing; data-driven methods in signal processing; auditory perception; machine learning; human computer interface |
Language | English |
Part of collection | University of Southern California dissertations and theses |
Publisher (of the original version) | University of Southern California |
Place of publication (of the original version) | Los Angeles, California |
Publisher (of the digital version) | University of Southern California. Libraries |
Provenance | Electronically uploaded by the author |
Type | texts |
Legacy record ID | usctheses-m1636 |
Contributing entity | University of Southern California |
Rights | Sundaram, Shiva |
Repository name | Libraries, University of Southern California |
Repository address | Los Angeles, California |
Repository email | cisadmin@lib.usc.edu |
Filename | etd-Sundaram-2398 |
Archival file | uscthesesreloadpub_Volume14/etd-Sundaram-2398.pdf |
Description
Title | Page 42 |
Contributing entity | University of Southern California |
Repository email | cisadmin@lib.usc.edu |
Full text | Here w(n) is as defined in equation 2.2, i is the time index and ! is the frequency index of the Fourier transform. Since each window typically lasts for 20-100 milliseconds, this equation essentially is a sequence of Fourier transforms of a long audio signal every win-dow duration. A time indexed frequency representation (time-frequency representation) is more about the spectral content in each frequency. Therefore it is common to use the magnitude of the complex frequency coefficients obtained in the transform 2.5. A more practical time-frequency representation using the discrete Fourier transform (DFT) is given by the equation : F(i,k) = |Fi(k)| = j=XN−1 j=0 y(j) · e2 k N j (2.6) where,yi(j) is a N length sequence of the ith windowed signal yi(j) = ( mX=1 m=−1 x(m) · w(m − i) ) and, 0 j N − 1 (2.7) This F(i,k) is visualized as an image where F(i,k) $ I(x, y) where the colour of each pixel is scaled according to the magnitude at the (i, k)th point. This is illustrated in figure 2.1. All spectral features are based on the spectrogram of the signal. They are all a statistical measure of the time-frequency content of the signal. Common measures used for audio processing are discussed next. 2.2.2.2 Spectral Centroid (SC): It is the weighted mean frequency of a given frame of audio. The weights are the magni-tude of the corresponding frequency points of a short-time Discrete Fourier Transform. 25 |