Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Source-specific learning and binaural cues selection techniques for audio source separation
(USC Thesis Other)
Source-specific learning and binaural cues selection techniques for audio source separation
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
SOURCE-SPECIFIC LEARNING AND BINAURAL CUES SELECTION TECHNIQUES FOR AUDIO SOURCE SEPARATION by Namgook Cho A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Ful¯llment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) December 2009 Copyright 2009 Namgook Cho Dedication To my family ii Acknowledgements First of all, I would like to thank my advisor, C.-C. Jay Kuo, for his inspiring and encouraging way to guide me to a deeper understanding of knowledge work, and his invaluable comments during the whole work with this dissertation. He taught me how to ask questions, how to approach to di®erent solutions, and how to express my ideas and work. He was always there to listen and to give advice. I also would like to thank the rest of my thesis committee: Prof. Shrikanth S. Narayanan and Prof. Liang Chen, who reviewed my work and gave insightful comments. I am very grateful to my sincere friends, Jong-Dae Oh and Kyunghyun Sung, who have taken care of me in many ways since I ¯rst came to USC. A special thanks goes to my friends, Hea-Lim Lee and Seung-Hyun Chon who led me to God at USC. I would like to express my deep appreciation to all my friends at USC, Yu Shiu, Jung-Hoon Park, Jae-Joon Lee, Yongjin Cho, Byung-Ho Cha, Byung-Tae Oh, Jong-Hye Woo, Seong-Ho Cho, Dong-Woo Kang, and Selina Chu. Without their sincere care and consideration, I could never have succeeded in my Ph.D program. Finally, my profound gratitude goes to my parents and my wife, Yeonshin who is my best company, friend, and partner for life. My parents have supported me both emotionally and ¯nancially throughout the study, to which I am deeply indebted. iii Table of Contents Dedication ii Acknowledgements iii List Of Tables vi List Of Figures vii Abstract xi Chapter 1: Introduction 1 1.1 Signi¯cance of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Review of Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Contributions of the Research . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . 9 Chapter 2: Research Background 10 2.1 Sparse Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.1 Overcomplete Signal Representation . . . . . . . . . . . . . . . . . 11 2.1.2 Matching Pursuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Learning Signal Structures via Sparse Coding . . . . . . . . . . . . . . . . 15 2.3 Iterative Reweighted Least Squares . . . . . . . . . . . . . . . . . . . . . . 18 Chapter3: SparseMusicRepresentationwithSource-Speci¯cDictionaries and Its Application to Signal Separation 20 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2 Review of Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2.1 Source-Speci¯c Atoms . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2.2 Source-Independent Atoms . . . . . . . . . . . . . . . . . . . . . . 25 3.3 Source-Speci¯c Representation for Music . . . . . . . . . . . . . . . . . . . 26 3.3.1 Music Decomposition with Matching Pursuit . . . . . . . . . . . . 27 3.3.2 Atom Prioritization . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3.3 Source-Speci¯c Atoms and Dictionaries . . . . . . . . . . . . . . . 34 3.4 Application to Music Signal Separation . . . . . . . . . . . . . . . . . . . 36 3.5 Experimental Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.5.1 Learning Musical Structure and Source-Speci¯c Atoms . . . . . . . 38 iv 3.5.2 Approximation Capability . . . . . . . . . . . . . . . . . . . . . . . 46 3.5.3 Music Signal Separation . . . . . . . . . . . . . . . . . . . . . . . . 48 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Chapter 4: Multi-Channel Audio Source Separation with Noisy Data 56 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.2 Unifying Framework for Sparse Component Analysis . . . . . . . . . . . . 58 4.3 Noise E®ects on Sparse Component Analysis . . . . . . . . . . . . . . . . 63 4.4 Enhancements of Sparse Component Analysis . . . . . . . . . . . . . . . . 66 4.4.1 Estimation of Mixing Parameters . . . . . . . . . . . . . . . . . . . 66 4.4.2 Estimation of Sources . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.5 Source-Speci¯c Representation for Multichannel Source Separation . . . . 71 4.6 Experimental Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.6.1 Estimation of Mixing Parameters . . . . . . . . . . . . . . . . . . . 76 4.6.2 Extraction of Music Signals . . . . . . . . . . . . . . . . . . . . . . 78 4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Chapter5: AudioSourceSeparationinRoomAcousticEnvironmentswith Selected Binaural Cues 83 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.2 Review of Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.4 Estimation of Mixing Parameters . . . . . . . . . . . . . . . . . . . . . . . 90 5.4.1 Phase Determinacy Condition . . . . . . . . . . . . . . . . . . . . . 91 5.4.2 Source Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.5 Recovery of Source Signals. . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.6 Experimental Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.6.1 Estimation of Mixing Parameters . . . . . . . . . . . . . . . . . . . 100 5.6.2 Recovery of Source Signals . . . . . . . . . . . . . . . . . . . . . . 103 5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Chapter 6: Conclusion and Future Work 107 6.1 Summary of the Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 6.2 Future Research Direction . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Bibliography 111 v List Of Tables 3.1 Comparison for music signal separation between source-independent and source-speci¯c representation approaches. . . . . . . . . . . . . . . . . . . 49 3.2 Music signal separation examples for solo musical instrument sounds. . . . 50 3.3 Music signal separation examples for multiple musical instrument sounds. 52 3.4 MusicsignalseparationexamplesforsimulatedroomrecordingswithRT 60 = 112ms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.1 Performance benchmarking for estimating mixing matrices A 1 and A 2 . . . 78 5.1 Source separation example with stereo mixtures of three male speech ut- terances, ± jmax >1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.2 Sourceseparationexamplewithstereomixturesoffemalespeechutterance, piano, and guitar sound, ± jmax >1. . . . . . . . . . . . . . . . . . . . . . . 105 5.3 Sourceseparationexamplewithstereomixturesoffemalespeechutterance, cello, and train sound, ± jmax >1. . . . . . . . . . . . . . . . . . . . . . . . 105 5.4 Source separation examples with stereo mixtures of three speech utter- ances, ± jmax >1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 vi List Of Figures 2.1 An e±cient signal representation using an overcomplete set of unit-norm functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.1 The representations of a mixture of music and speech signals using (a) source-speci¯c atoms and (b) source-independent atoms. . . . . . . . . . . 24 3.2 Constructingmusicsource-speci¯catomsandthecorrespondingdictionary from musical notes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3 (a) A note signal G4 of clarinet (top) and the decay of ja m j as a function of iteration number m (bottom), and (b) the accumulation values of c m , where the pre-de¯ned threshold ´ p is set to 0.99. . . . . . . . . . . . . . . 30 3.4 Comparison of prioritized atoms distributed in the parameter space (»;s): (a)319atomsfortheclarinetsound, (b)602atomsforthedrumsequence, and(c)overlappingatomsbetweentheprioritizeddictionariesD pc andD p d with a total number of 61. Note that the scale parameter is logarithmic to the base 2, and each dot represents the center frequency and the scale parameter of a prioritized atom. . . . . . . . . . . . . . . . . . . . . . . . 33 3.5 (a),(b),and(c)showatomsobtainedfromthreevariationsofclarinetnote G4, and(d)isthesynthesizedatomcorrespondingtotheclarinetnoteG4, where each top sub¯gure presents the time-domain waveform and each bottom sub¯gure shows its time-frequency representation. . . . . . . . . . 41 3.6 Source-speci¯catomssynthesizedusingdi®erentscales: (a)scale10,jD s10 j= 33, (b) scale 9,jD s9 j=25, and (c) scale 8,jD s8 j=25. (d) Scales 8, 9, and 10areused,jD G4 j=83. Notethatj¢j representsthenumberofprioritized atoms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.7 Source-speci¯c atoms that correspond to notes C4 and C6 of trumpet and their time-frequency representations, where f 0 denotes the fundamental frequency of the notes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 vii 3.8 Spectra of source-speci¯c atoms for (a) the clarinet and (b) the piano, where each column represents an atom. . . . . . . . . . . . . . . . . . . . 44 3.9 Illustration of various source-speci¯c atoms: (a) atoms obtained from clar- inetnoteG4(top),pianonoteG4(middle),andtrumpetnoteG4(bottom) and (b) spectra corresponding to these atoms. . . . . . . . . . . . . . . . . 45 3.10 The approximation capability of the proposed music representation with source-speci¯c dictionaries: (a) a real clarinet sound and (b) a real piano sound. Theoriginalsignal(top),thereconstructedsignal(middle),andthe SDR values as a function of the number of approximating atoms (bottom). 47 3.11 Spectrograms of signals. (a) Approximation: a real clarinet sound (top) and its reconstructed signal (bottom). (b) Signal separation: a mixture of clarinet and male speech (top) and extracted clarinet signal (bottom). . . 52 3.12 The con¯guration of loudspeakers and a microphone in a room, where recordings were simulated with an absorption coe±cient of 0.6 for room's surface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.1 Sparse component analysis for blind source separation. . . . . . . . . . . . 59 4.2 The scatter plot of two linear mixtures of three sources in (a) the time domain and (b) the transform domain, where F W represents the short- time Fourier transform. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.3 The scatter plot of two linear mixtures of two sources: (a) clean data andtheirassociatedlineorientationswithalargeintersectionanglefroma syntheticmixture,and(b)noisydataandtheirassociatedlineorientations with a small intersection angle from a commercial music CD excerpt, "Let It Be" by Beatles. F W represents the short time Fourier transform.. . . . 63 4.4 Estimation of three random signals from two observation mixtures, where original sources s i (circle) and their estimates s e i (square) are represented. 64 4.5 Percentages of overlapping points in the STFT domain as a function of white Gaussian noise: (a) A mixture of clarinet, male speech, and street sound, and (b) a mixture of one male and two female speech. . . . . . . . 65 4.6 Scatterplotsofthreesourcesmixedintotwomixturesusingthedegenerate linearmixingsystem,A 1 . (a)Cleandata,(b)SNR=15dB,and(c)SNR=5 dB. The square dots represent the normalized original column vectors in A 1 . 77 viii 4.7 Scatterplotsofthreesourcesmixedintotwomixturesusingthedegenerate linearmixingsystem,A 2 . (a)Cleandata,(b)SNR=15dB,and(c)SNR=5 dB. The square dots represent the normalized original column vectors in A 2 . 77 4.8 SDRs obtained from extracting the music signal as a function of p when mixing system (a) A 1 or (b) A 2 is used. . . . . . . . . . . . . . . . . . . . 79 4.9 Performance comparison between Bo¯ll and Zibulevski's algorithm (BZ), where mixing matrix is estimated by the scatter plot technique, and BZ with the original mixing matrix. The mixing matrix (a) A 1 and (b) A 2 are used. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.10 Musicextractionperformancefrommixturesofclarinet,speech,andstreet environmental sounds with mixing matrix A 1 . . . . . . . . . . . . . . . . . 80 4.11 Musicextractionperformancefrommixturesofclarinet,speech,andstreet environmental sounds with mixing matrix A 2 . . . . . . . . . . . . . . . . . 80 4.12 Performance evaluation of source-speci¯c-based source extraction tech- nique as a function of SNR. . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.1 Minimum microphone spacing to avoid spatial aliasing as a function of temporal sampling frequency. . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.2 The system diagram for the solution of the underdetermined convolutive audio source separation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.3 A magnitude scatter plot at the frequency of 217.9 Hz, where three speech utterances are mixed under the room acoustic environment at RT60 = 113 ms. Note that all data points are normalized such that the maximum energy value equals to one. . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.4 Magnitude scatter plots at the frequency band 715.95 Hz, where three speech utterances are mixed under (left) the anechoic (RT 60 = 0 ms) and (right) the echoic environment (RT 60 = 113 ms). Note that all data points are normalized such that the maximum energy value equals to one. . . . . 96 5.5 Comparisonofhistogramsoverparameterspace(x at ;x d )witha)thewhole binaural cues and (b) the selection of binaural cues, where symbol \x" marks local peaks used to estimate mixing parameters. . . . . . . . . . . . 97 5.6 The con¯guration of loudspeakers and microphones in a room. . . . . . . 100 ix 5.7 Contourmapsofthebinauralcuesinananechoicroomenvironment,where three speech utterances were simulated for stereo mixtures and the whole set of binaural cues was used with microphone spacing equal to (a) 2 cm, (b) 8 cm, and (c) 20 cm, and a selected set of binaural cues was used with microphone spacing equal to (d) 2 cm, (e) 8 cm, and (f) 20 cm. . . . . . . 101 5.8 Contour maps of binaural cues in an echoic room environment (RT 60 = 113 ms), where three speech utterances were simulated for stereo mixtures and the whole binaural cues were used with microphone spacing with (a) 2 cm, (b) 8 cm and (c) 20 cm, and only the selected binaural cues were used with microphone spacing with (d) 2 cm, (e) 8 cm and (f) 20 cm. . . 102 5.9 The average SDR over mixtures of three speech sources as a function of p: (a) anechoic and (b) echoic environment (RT 60 = 113 ms).. . . . . . . . . 103 x Abstract Several audio source separation techniques, which aim to determine the original sources given their acoustic mixtures, are proposed in this research. Two di®erent mixing pro- cesses are considered in terms of the number of microphones: single-channel and multi- channelsettings. Sincenospatialcueisusedinthesingle-channelobservation,weexploit di®erentcharacteristicsofaudiosounds. Inthecaseofmultichannelmixtures, thespatial informationtothesound¯eldenablesustoestimatethemixingsystembylocatingsound sources. In Chapter 3, we propose a source-speci¯c learning approach to e±cient music repre- sentation, and use it to separate music signals that co-exist with background noise such as speech or environmental sounds. The basic idea is to determine a set of learned el- ementary functions, or called atoms, to e±ciently capture music signal characteristics. There are three steps in the construction of a learned dictionary. First, we decompose basic components of musical signals (e.g., musical notes) into a set of source-independent atoms (i.e., Gabor atoms). Then, we prioritize these Gabor atoms according to their ap- proximation capability to music signals of interest. Third, we use the prioritized Gabor atoms to synthesize new atoms to build a compact learned dictionary. The number of atoms needed to represent music signals using the learned dictionary is much less than xi that of the Gabor dictionary, resulting in a sparse music representation. Experimental results are given to demonstrate its e±ciency and application to music signal separation from a mixture of multiple sounds. In Chapter 4, we investigate the noise e®ects on the multichannel audio source sepa- ration with instantaneous mixing system, where sounds emanating from di®erent sources arrive at the same time without any delay between them. Under the noisy condition, source sparsity assumption, which is critical in Sparse Component Analysis (SCA), is easily violated, i.e., several sources may exist at a time-frequency point. These violation of the assumption yields errors in the estimation of mixing parameters and rendered the use of ` 1 -norm minimization improper. We propose an enhanced technique to address the problem by employing weighted soft-assignment clustering and generalized ` p -norm minimization with regularization. The technique results in more robust and sparser so- lutions in a noisy environment than SCA-based methods. In addition, we extend the single-channel audio source separation technique based on source-speci¯c dictionaries to the multichannel case to extract music signals from stereo-channel mixtures. In Chapter 5, we propose a robust technique to separate audio sources received by a microphone array in a room acoustic environment with an underdetermined mixing process (i.e., the number of sources is larger than the number of mixtures). Our scheme consists of two stages: 1) estimation of mixing parameters and 2) recovery of source signals. For the ¯rst stage, contrary to the traditional DUET (Degenerate Unmixing Es- timation Technique)-type methods that exploit all binaural cues, we estimate the mixing parameters by selecting a reliable subset of binaural cues based on the phase determi- nacy condition and source sparsity. As a result, we can determine the mixing parameters xii successfully even in a reverberant environment with longer time delay. Then, proper mathematical tools are applied to the underdetermined linear system to recover the orig- inalaudiosourcesforthesecondstage. Experimentalresultsonsimulateddatainaroom acoustic environment are given to show a signi¯cant gain over the DUET-type method in audio source separation. xiii Chapter 1 Introduction 1.1 Signi¯cance of the Research Audio signals are often perceived against the background of other sounds with di®erent characteristicsinareal-worldsituation. Audiosourceseparation,whichaimstodetermine the original sources given one or multiple acoustic mixtures of those sources, is one of the emerging research topics in recent years due to its many potential applications. For example, Microsoft [4] adopted a sound capturing technique with a microphone array systeminspeechrecognitionapplicationssinceusersprefertousetheirownvoicewithout wearing the headset in front of the computer. The use of multiple microphones can compensateforthee®ectofambientnoiseandreverberationbyprovidingspatial¯ltering to the sound ¯eld, e®ectively focusing attention in a desired direction. In addition, the DICIT (Distance talking Interfaces for Control Interactive TV) project [35] employed a linear microphone array to enable user-friendly interface with TV and infotainment devices by human voice. However, most current applications assume a single noisy sound source(e.g.,speech)withouttakingintoaccountmultipleaudiosourcesbeingactiveatthe 1 same time. Thus, it is di±cult to extract a single source from complex acoustic mixtures in the pre-processing step. For example, in human conversation, speech signals are often perceived against undesired audio sources (e.g., music background) and, as a result, automatic speech recognition may fail to produce desired outcomes. We consider audio source separation techniques that produce cleaner audio sources from a complex acoustic environment by exploiting di®erent features of audio sources and/or spatial information obtained from multiple microphones. In this research, we focus on extracting the original sources from single or multiple microphones, which can be written as x(t)=A? s(t) where x and s are M mixtures and N sources, respectively, and A2C M£N is a mixing matrix. Note that ? represents linear multiplication under the assumption of instan- taneous mixing process while it yields a convolutive mixing model in the reverberation mixing system. Here, our goal is to recover the unknown source signals from the ob- served mixtures only. We assume that the mixing matrix A is also unknown and that the number of sources is larger than the number of microphones (i.e., underdetermined mix- ing system). From the mathematical point of view, the underdetermined linear system has a non-unique and in¯nite set of solutions. Thus, unmixed signals cannot be directly obtained even though the mixing matrix A are successfully estimated. In the case of single-channel mixtures, we are motivated by the need to extract mu- sic signals from background noise, including speech and environmental sounds. Due to 2 the single-channel observation, there is no spatial cue to exploit as that in the multi- channel microphones. In addition, music signals tend to have a wide range of spectral andtemporalcharacteristics, whichresultsinsigni¯cantoverlapwithothersoundsinthe time-frequency domain. Due to the overlap, it is more di±cult to impose the sparsity assumption in the time-frequency domain. The sparsity is de¯ned that only one source existsatatime-frequencypoint. Itiswellunderstoodthattheimpairmentonthesparsity assumptionresultsinthedistortion, knownasthemusicalnoiseartifact, intheestimated sources. Although the use of multiple microphones allows spatial information in the sound ¯eld, there are additional technical challenges arising from a room acoustic environment: reverberationandspatialaliasing. Anacousticsignalinaroomisusuallyobservedatthe receiverbydirect-pathsignalaswellasre°ectedsignalswhicharedelayedandattenuated copies of the direct-path signal. Thus, room reverberation usually smears a sound across timeandfrequency,whichhasanegativeimpactonsourcesparsity. Moreover,thespatial sampling e®ect should be considered under the scenario of multiple microphones. Similar to the principle in temporal sampling (known as the Nyquist sampling theorem), there is an analogous requirement in spatial sampling to avoid grating lobes (or spatial aliasing) in the directivity pattern of multiple microphones [53]. With uniform linear array of microphones, for example, the maximum distance of two microphones is 7.8 mm with respect to a sampling rate of 44.1 KHz. However, some sound recordings may not satisfy the constraint on microphone spacing, for instance, KEMAR dummy head recording [24] at a sampling rate of 44.1 KHz, where the microphones (the left and right ear in the mannequin) are spaced far enough apart that there might be spatial aliasing. In addition 3 to the reverberation and spatial aliasing, the geometry of microphones determines the performance of source separation. But, some recordings of audio sounds may be given without the information of microphone geometry. In this work, we consider a stereo- channel microphone setting in a room acoustic environment without any constraint on thegeometryofmicrophones. Thus,spatialaliasingmayoccur. Notethatcommonaudio signals are typicallyavailablein stereo format and consists of a mixture of more than two sound sources. 1.2 Review of Previous Work Methodsproposedforthesingle-channelaudiosourceseparationcanberoughlyclassi¯ed into two categories: parametric analysis (or model-based inference) and non-parametric analysis. Model-basedinferencemethodsadoptaparametricmodelforeachsourceunder separationanditsmodelparametersareestimatedbasedonobservedmixturesignals[61, 72]. Themaindi±cultywiththeparametricapproachisthatitisnoteasyto¯ndaproper model for a wide range of signals. For example, the optimal state space for music sources withthehiddenMarkovmodel(HMM)ismuchlargerthanthatforthespeechsourcedue to their wider frequency range and dynamic range. The sinusoidal model is also limited initsapplicability[72],thatis,itismostsuitableforthepitchedmusicalinstrumentsand voiced speech. Unsupervised non-parametric algorithms depend on a simple linear model in the time-frequency domain [12,16,66,71]. With this approach, the sound mixture is ¯rst decomposed under the assumption that sources are statistically independent. Then, aclusteringschemeisusedto¯ndasetofdisjointfunctionswithrespecttosourcesignals. 4 Afterwards, grouped basis functions and their coe±cients are recombined to reconstruct source signals. For example, the magnitude spectrogram of a single-channel mixture can bedecomposedintotheproductofbasisspectraandtime-varyinggainsusingIndependent component analysis (ICA) [16,66] or the Nonnegative Matrix Factorization (NMF) [71]. Then, audio separation can be accomplished by clustering the basis spectra into disjoint sets using statistical distance measure [16], instrument-speci¯c features [66], or original sources as the reference [71]. Finally, the phase information of the original source is used to re-synthesize time-domain estimates of the source [23]. There are however several challenges associated with the approach. The clustering process is a nontrivial task, and the phase has to be estimated for the re-synthesis process [23,71]. The multichannel audio source separation problem has been examined by researchers for years. Early approachesconcentratedon tackling evendetermineddemixing. ICA [10, 33,38]pioneeredthoseearlyapproachesunderthecrucialassumptionthattheunderlying sources are statistically independent. Even though ICA-based algorithms are fast and reliable to tackle the multichannel audio source separation problem, they demand that the number of sensors be no less than the number of sources. The sparse component analysis (SCA) technique was used to solve the audio source separation problem in [31,51] under the source sparsity assumption in the transform domain. The SCA-based methods exploited clustering in a scatter plot and used the ` 1 - normoptimizationtosolvetheproblembyassumingalinearinstantaneousmixingmodel, wheresoundsemanatingfromdi®erentsourcesarriveatthesametimewithoutanydelay between them. The instantaneous mixing model is however too restrictive in practice. Anothersolutionwasderivedbasedonananechoicmixingmodelwithtimedelaybetween 5 sound sources, which leads to the DUET (Degenerate Unmixing Estimation Technique)- type methods [63,74]. The DUET creates a two-dimensional histogram of the interaural levelandtimedi®erencesobservedoverentirespectrogramsofmixtures. Itthensmooths the histogram and ¯nd the largest peaks, which correspond to the sources. The DUET assumes a constant interaural level and time di®erence at all frequencies and that there is no spatial aliasing, which can be met to a large degree with free-standing microphones closetooneanother. Themethodsworkwellforapairofmicrophoneswithspacingsmall enough so that any delay between the two microphones is not larger than the distance of consecutive audio samples. A delay of one audio sample can correspond to a very short distance; for example, at a sampling rate of 44.1 KHz, a delay of one sample corresponds to a distance of propagation in air of 7.8 mm. The constraint makes each source di±cult tolocalizeandisolateintheparameterspaceoftheattenuationratioandtimedelay, and further yields a failure in mixing parameter estimation under the situation with larger delays than one sample. 1.3 Contributions of the Research We examine audio source separation techniques based on single and multiple acoustic mixtures of sources by exploiting di®erent characteristics of audio sources and spatial information obtained from multiple microphones, respectively. The main contributions of this research are summarized below. ² We propose a source-speci¯c dictionary approach to e±cient music representation. The basic idea is to determine a set of learned elementary functions, or called 6 atoms, to e±ciently capture music signal characteristics. That is, the atoms are highly correlated with the class of music signals of interest, yet uncorrelated with other classes of sounds, i.e., speech and environmental sounds. We adopt the over- complete representation approach to learn inherent music characteristics e±ciently. There are three steps in the construction of a source-speci¯c dictionary. First, we decompose basic components of musical signals (e.g., musical notes) into a set of source-independent atoms (i.e., Gabor atoms). Then, we prioritize these Ga- bor atoms according to their approximation capability to music signals of interest. Third, we use the prioritized Gabor atoms to synthesize new atoms to build a com- pact learned dictionary. The number of atoms needed to represent music signals usingthelearneddictionaryismuchlessthanthatoftheGabordictionary,resulting in a sparse music representation. That is, the source-speci¯c dictionary approach reduces the computational complexity drastically in musical signal approximation. ² We apply the source-speci¯c dictionary approach to extract music signals that co- existwithbackgroundnoisesuchasspeechorenvironmentalsounds. Givenasingle- channel mixture of music signal and background sound, we project the mixture signal onto the corresponding subspace spanned by atoms from a music source- speci¯cdictionary. Sincetheatomslearnedfrommusicalnoteshavestrongerenergy correlation with the music signal than with the background sound in the mixture, we can extract the music signal from the observed mixture. In contrary to the unsupervised non-parametric algorithms, all computations in our scheme can be 7 performed in the time domain, thus the phase information of the original sources is not needed for the re-synthesis of time-domain estimates. ² We propose a robust technique to separate audio sources received by a microphone array in a room acoustic environment, where an underdetermined mixing process is considered with no knowledge about the geometry of the microphones. The proposedtechniqueconsistsoftwostages: 1)estimationofmixingparametersand2) recovery of source signals. For the ¯rst stage, after applying the short-time Fourier transform (STFT) to the mixtures, we estimate mixing parameters (i.e., scalar attenuation coe±cients and time delay parameters) by selecting reliable binaural cues instead of using the whole set of time-frequency points. That is, we select the frequency range that produces no phase ambiguity and reliable time-frequency (TF)-pointsthatcontributetosignalsthattravelfromaudiosourcestomicrophones in the direct path while excluding attenuated and delayed replicas of a sound. The selected subset of binaural cues works like the corresponding binaural cues of each sourcepresentedseparatelyinananechoicenvironment. Asaresult,ourmethodcan estimate the mixing parameters successfully even in the reverberant environment and the condition that the spatial aliasing occurs. ² After the mixing parameter estimation with reliable binaural cues, we solve the underdetermined convolutive linear system by employing an optimization problem with an ` p -norm criterion (p·1). 8 1.4 Organization of the Dissertation The rest of this dissertation is organized as follows. Research background such as the sparserepresentationforaudiosignalsandunderdeterminedlinearsystembasedonsource sparsity assumption is described in Chapter 2. Then, the sparse music representation scheme based on source-speci¯c dictionaries are discussed for the single-channel audio source separation in Chapter 3. In addition to the audio separation, the e±ciency of the proposedsparserepresentationtechniqueisexaminedintermsofcomputationalcomplex- ityandapproximationofmusicsignals. InChapters4and5,weexaminethemultichannel audio source separation. The instantaneous mixing model is discussed and the e®ects of noise on the source sparsity assumption of SCA are examined in Chapter 4. We propose an enhanced technique based on the weighted soft-assignment clustering algorithm and the generalized ` p -norm minimization with regularization. Furthermore, a robust au- dio source separation technique based on selected binaural cues is presented in Chapter 5, where reverberation of sounds is assumed in a room acoustic environment. Finally, concluding remarks and future research directions are mentioned in Chapter 6. 9 Chapter 2 Research Background Some background knowledge for the current research is given in this chapter. Building e±cient representations for signals is one of the main themes in signal processing and analysis. The overcomplete signal representation [18,50] enables the sparse representa- tion of complex signals by capturing salient signal features using an overcomplete set of non-orthogonal linearly dependent functions. Matching Pursuit (MP) [52] is one of the mostpopularalgorithmsthat¯ndgoodsuboptimalsolutionstothesparsestsignalrepre- sentationproblem. Asanalternative,sparsecodingtechniques[50,57]performsparsede- composition using a given dictionary and the dictionary is adaptively modi¯ed to achieve highersparsity. SparserepresentationsandtheMPalgorithmwillbereviewedinSec.2.1. Then, sparse coding will be explained in Sec. 2.2. Finally, the iterative reweighted least squares approach will be introduced in Sec. 2.3, which is a non-parametric method to ¯nd localized energy solutions from a limited amount of data. 10 2.1 Sparse Representations 2.1.1 Overcomplete Signal Representation Toanalyzesignalx(t)inanN-dimensionalinner-productspaceHcalledthesignalspace, we often decompose it into a weighted sum of basis functions ' k (t) via x(t)= M X k=1 a k ' k (t); (2.1) which can be expressed in matrix notation as x=©a; (2.2) where © = [' 1 ;¢¢¢ ;' M ], is an N£M matrix whose columns are basis functions, and a is a column vector of coe±cients (M£1). When a signal is represented by orthonormal bases, © in (2.2) is a square (N = M) and invertible matrix, and coe±cient vector a can be determined via a=© H x, where H represents the complex-conjugate transpose. Although such a representation is simple, somesignalsmaynotbee±cientlyrepresentedbyorthonormalbases[25,65]inacompact (or sparse) manner. That is, coe±cient vector a contains many non-zero elements. To represent a family of signals with a smaller number of basis functions, an over- complete representation (N < M) can be introduced [18,50,52] by choosing ' k (t) from an overcomplete set of non-orthogonal linearly dependent functions. The overcomplete set spans the signal space while containing more functions than necessary. The main ad- vantage of the overcomplete representation is that it enables a compact representation of 11 complex signals by capturing essential signal characteristics with fewer functions [25,50]. If most energy of signal x(t) is concentrated on a small set of overcomplete functions, it implies that functions ' k (t) are highly correlated with the underlying signal. Then, we say that the signal model in (2.1) is e±cient, compact or sparse. The e±ciency of representing a signal with a set of overcomplete functions can be explained by a simple two-dimensional example in Fig. 2.1. When using orthonormal basis functions ' 1 and ' 2 , signal x can be represented as x=hx;' 1 i' 1 +hx;' 2 i' 2 =a 1 ' 1 +a 2 ' 2 ; where the angle bracket represents the Hermitian inner product. On the other hand, given overcomplete functions, ' 1 ;' 2 , and ' 3 , we have a more concise representation x¼hx;' 3 i' 3 =a 3 ' 3 : The residual signal, r =x¡a 3 ' 3 , can be projected onto the other two functions ' 1 and ' 2 , and the resulting coe±cients,jhr;' 1 ij andjhr;' 2 ij are much smaller thanja 3 j. 2.1.2 Matching Pursuit Given an N-dimensional signal x(t)2H, the central problem in sparse approximation is torepresentthesignalwiththebestlinearcombinationofM atomsfromanovercomplete dictionary. Since M is taken to be much smaller than N, the approximant is compact or sparse. It is however di±cult to calculate the optimal sparse representation of an arbitrary signal, which is proved to be an NP-hard problem [22]. 12 1 M 2 M 3 M x 1 a 2 a 3 a Figure 2.1: An e±cient signal representation using an overcomplete set of unit-norm functions. Despite the di±culty to ¯nd the best solution, it is possible to ¯nd su±ciently good representationsthatarenearlyoptimal. Therearetwopopularalgorithms: BasisPursuit (BP) [18] and Matching Pursuit (MP) [52]. BP formulates the sparse approximation as a linear programming problem while MP employs an iterative greedy strategy that selects the atom best correlated with the residual of the signal at each step. Empirical evidence suggests that BP is more powerful but demands signi¯cantly higher complexity than MP [18]. MP iteratively construct an approximant by selecting the element of the dictionary thatbestmatchesthesignalateachiteration. LetD =fg ° ;°2¡gbearedundantfamily ofP unit-normvectors. ThedictionaryincludesN <P linearlyindependentvectorsthat span the signal space of size N. MP begins by projecting signal x on vector g ° 0 2D and setting R 0 x=x. We ¯rst decompose the signal as R 0 x=hR 0 x;g ° 0 ig ° 0 +R 1 x: 13 Since R 0 x is orthogonal to g ° 0 ,kR 0 xk 2 =jhR 0 x;g ° 0 ij 2 +kR 1 xk 2 : To minimizekR 1 xk, we must choose g ° 0 2 D such that jhR 0 x;g ° 0 ij is maximized. In some cases, it is computa- tionally more e±cient to ¯nd vector g ° 0 that is almost optimal, jhR 0 x;g ° 0 ij¸® sup °2¡ jhR 0 x;g ° ij; where ®2 (0;1] is an optimality factor. We can iterate the above procedure by decom- posing the residue functions. Suppose that the mth order residue R m x with m ¸ 0 is given. In the next iteration, we choose g ° m 2D such that jhR m x;g °m ij¸® sup °2¡ jhR m x;g ° ij; (2.3) and projects R m x on g °m via R m x=hR m x;g ° m ig ° m +R m+1 x: (2.4) The orthogonality of R m+1 x and g ° m implies kR m xk 2 = jhR m x;g ° m ij 2 +kR m+1 xk 2 : Summing (2.4) with m=0;¢¢¢ ;M¡1 yields x= M¡1 X m=0 hR m x;g °m ig °m +R M x: (2.5) The norm of the signal is equal to kxk 2 = M¡1 X m=0 jhR m x;g ° m ij 2 +kR M xk 2 : (2.6) 14 It was proved by Jones [36] that kR m xk converges exponentially to 0 when m goes to in¯nity under the assumption that dictionary D is complete, i.e., span(D)=H. 2.2 Learning Signal Structures via Sparse Coding Olshausen and Field [57] presented an algorithm capable of performing sparse decompo- sition with a given dictionary and adaptively modifying the dictionary to achieve greater sparsity. A linear generative model is used for the task, where observed data are repre- sentedasaweightedsumofelementschosenfromadictionaryofavailableatoms. Thatis, an N-dimensional random vector x is generated from M independent hidden (or latent) variables s=[s 1 ;¢¢¢ ;s M ] T by x=As+e; (2.7) whereeisazero-meanGaussianrandomvectorrepresentingadditivenoiseandAdenotes a dictionary of M vectors. The statistical independence of components of s implies that the probability density p(s) can be factorized as p(s)= M Y j=1 p(s j ): (2.8) The Gaussian noise model implies that the conditional density p(xjs;A) is given by 1 p(xjs;A)= " det ¤ e (2¼) N # 1=2 exp ³ ¡ 1 2 e T ¤ e e ´ ; (2.9) 1 Probability density functions, e.g., p(xjy) and p(x; µ) denote a conditional density and a density parameterized by µ, respectively. 15 where e = x¡As and ¤ e = hee T i ¡1 is the inverse of the noise covariance. Since the noise covariance is known as hee T i = ¾ 2 I, we can work with a scalar inverse covariance ¤ e = ¸I, where ¸ = 1=¾ 2 . Given the latent variable model in (2.7), (2.8) and (2.9), the remaining two tasks are to infer the optimal encoding s with respect to dictionary A and to learn a suitable dictionary A given a sequence of training vectors (x 1 ;x 2 ;¢¢¢). They are described in the following two subsections. For ¯xed values of A and x, likely values of s can be inferred as, p(sjx;A)= p(xjs;A)p(s) p(x;A) : (2.10) The maximum a posteriori (MAP) estimate is b s=argmax s p(sjx;A); or, equivalently, b s=argmin s E(s;x;A); (2.11) where energy functionE is de¯ned as E(s;x;A)=¡logp(xjs;A)¡logp(s): With the conditional probability given in (2.9), it becomes E(s;x;A)= 1 2 ¸kx¡Ask 2 ¡ M X j=1 logp(s j ): (2.12) 16 The ¯rst term can be interpreted as the quadratic error cost, the second as the sparsity cost(orthediversitycost[43])thatpenalizesnon-sparsecomponentactivities. Acommon choice for densities p(s j ) is the l 1 -norm, which is equivalent to using a Laplacian prior p(s j )/(¡js j j) [18,50]. Learning of dictionary A can be accomplished with the maximum-likelihood (ML) estimation: b A=argmax A L(A); where the log-likelihood functionL over an entire dataset is L(A)=hlogp(x;A)i x ; (2.13) andwhereh¢i x denotesanexpectationoverthetrainingset. Thederivativeofthisfunction with respect to the dictionary matrix can be found by @L(A) @A = D h¸es T i sjx;A E x ; where the inner expectation is over the posterior density p(sjx;A) conditioned on par- ticular values of x. The simplest algorithm for optimizing this quantity is to approximate the posterior p(sjx;A)astheDiracdeltadistributionpositionedattheposteriormode[57]. Itinvolves an iterative application of the following update: A à A+´¸hb eb s T i x ; (2.14) 17 where ´ is a learning rate parameter andb e=x¡Ab s. Later, Lewicki and Sejnowski [50] used a multivariate Gaussian approximation to the posterior around its maximum to derive another update rule as A à A+´Ah°(b s)b s T ¡Ii x ; (2.15) where the vector-valued function ° is the gradient of the log-prior, °(s) = ¡rlogp(s). Theupdaterulein(2.15)doesnotrequireseparatenormalizationstepsasthedecayterm keeps the dictionary matrix from diverging. 2.3 Iterative Reweighted Least Squares The goal here is to determine coe±cient vector s2 R M so as to minimize the weighted norm of error e in x=As+e. Let W =S T S be a weighting matrix. Then, we have min s e T We=min s e T S T Se: Based on the weighted least squares method, we get [54], s= ³ A T S T SA ´ ¡1 A T S T Sx: (2.16) Considering l p optimization, we can obtain min s kx¡Ask p =min M X i=1 jx i ¡(As) i j p : (2.17) 18 Let s ¤ be the solution to this optimization problem. The problem in (2.17) can be rewritten with weighting coe±cients as M X i=1 w i jx i ¡(As) i j 2 ; where w i = jx i ¡ (As ¤ ) i j p¡2 . The solution cannot be found in one step because s ¤ is needed to compute the appropriate weight. With the iterative reweighted least squares (IRLS) method, the current solution is used to compute the weight to be used for the next iteration. The FOCUSS algorithm [26] is an example of the IRLS method. 19 Chapter 3 Sparse Music Representation with Source-Speci¯c Dictionaries and Its Application to Signal Separation 3.1 Introduction Humans are often able to recognize an individual sound from a complex acoustic envi- ronment. It was suggested in [49] that the human auditory system might have a highly e±cient coding mechanism that extracts the meaningful structure of audio signals for perception and conveys the information to the brain compactly. Generally speaking, redundancy reduction plays an important role in mammalian perceptual processing [9]. Mathematically,wecanviewthisproblemas¯ndingasparse(orcompact)representation ofanaudiosignal, namely, acombinationofasmallnumberoffunctionstakenfromaset ofelementaryfunctions. Wecalltheseelementaryfunctions atoms,andthesetformedby theseatomsadictionary. Thetechniqueofsparsesignalrepresentation¯ndsapplications in numerous audio processing problems such as audio structure analysis [12], automatic music transcription [6], and audio source separation [76]. 20 The e±ciency of a sparse representation depends on how well atoms in a dictionary capture the salient features of the signal of interest. Methods to represent real-valued audio signals can be roughly classi¯ed into two categories: the orthogonal basis expan- sion and the overcomplete representation. Orthogonal bases, such as Fourier and wavelet bases, provide a complete representation of signals with ¯nite energy. However, this ap- proach does not guarantee a compact representation for a certain type of signals [25]. With the overcomplete representation, we ¯nd a dictionary of atoms to span the sig- nal space, where the dictionary may include more atoms than necessary, leading to a non-orthogonal, linearly dependent set. For instance, the Matching Pursuit (MP) algo- rithm [52] approximates signals using parameterized time-frequency atoms from an over- complete dictionary. More recently, Lewicki and Sejnowski [50] presented an algorithm to learn the overcomplete representation of sensory data using a probabilistic framework. Ourresearchismotivatedbytheneedtoextractmusicsignalsfrombackgroundnoise that includes speech and environmental sounds. To analyze an acoustic waveform con- sisting of a complex mixture of sounds, several unsupervised non-parametric algorithms based on a simple linear model were proposed before, e.g., [12,16,66,71]. With this approach, the sound mixture is ¯rst decomposed under the assumption that sources are statistically independent. Then, a clustering scheme is used to ¯nd a set of disjoint func- tions with respect to source signals. However, these two tasks are nontrivial in real world applications. Furthermore, since the non-parametric algorithms proposed in [16,66,71] operateonthemagnitudesofshort-timeFouriertransform(STFT)coe±cients,thephase information is lost. 21 In this work, we adopt the overcomplete representation approach for sparse musi- cal signal approximation. The following two issues will be addressed. First, we ¯nd a source-speci¯c dictionary tailored to music signals. As studied in [75], music signals have unique characteristics that di®erentiate them from speech and environmental sounds. For example, music signals tend to contain strong harmonic components, which can be usedaspriorknowledgeforaudiosignalanalysis[68,69]. Anotherproofwasgivenin[49], whereoptimalauditory¯lterswerederivedtomaximizetheamountofinformationabout sounds using a set of statistically independent features. The derived auditory ¯lters de- pend on di®erent sound classes. These properties could be exploited in the design of source-speci¯c dictionaries. Second, although the overcomplete representation provides a concise representation of audio signals, it has one shortcoming, i.e., its computational complexity is high. It is desirable to reduce the complexity as much as possible. To ¯nd a sparse representation for music signals, we build a source-speci¯c dictionary that captures inherent music characteristics e±ciently. There are three steps in the con- structionofasource-speci¯cdictionary. First,wedecomposebasiccomponentsofmusical signals (e.g., musical notes) into a set of source-independent atoms (i.e., Gabor atoms). Then, we prioritize these Gabor atoms according to their approximation capability to music signals of interest. Third, we use the prioritized Gabor atoms to synthesize new atoms to build a compact dictionary. The number of atoms needed to represent music signalsusingthesource-speci¯cdictionaryismuchlessthanthatoftheGabordictionary, resultinginasparsemusicrepresentation. Experimentalresultsaregiventodemonstrate the e±ciency and applications of the proposed approach. 22 3.2 Review of Previous Work In this section, we will review two classes of elementary functions (or atoms) for signal representation: 1) source-speci¯c atoms that can represent a speci¯c class of sounds in a compact way and 2) source-independent atoms that are used to approximate all classes of sounds. Both of them can be applied to signal separation but in di®erent ways. 3.2.1 Source-Speci¯c Atoms Optimal auditory ¯lters for di®erent classes of natural sounds were studied by Lewicki in [49]. It was observed that di®erent classes of sounds have di®erent characteristics and, asaresult,theoptimalauditory¯ltershavedi®erentshapes. Forexample,e±cientcoding of music signals results in sinusoidal ¯lters, whose lengths extend over the entire analysis window, resembling a Fourier representation. In contrast, the coding of environmental sounds yields a set of ¯lters that resemble a wavelet representation, where their ampli- tude envelopes are localized in time. Since speech signals share properties of music and environmental sounds (e.g., harmonic vowels and non-harmonic consonants), its coding yields a representation intermediate between those in music and environmental sounds. Our research in [20,21] aimed to separate music signals from background noise such as speech and/or environmental sounds. Based on the observation in [49], we considered a source-speci¯c representation of audio signals. It was assumed that a class of audio sounds can be represented of the following form: x=m+s; 23 wheremusicsignalm2S m andspeechsignals2S s asdepictedinFig.3.1(a). Themusic subspaceS m andthespeechsubspaceS s aresubsets of the universalaudio spacedenoted by U. It should be noted that since musical sound and speech have the characteristics of harmonic components in common, there exist some overlap between the two subspaces as shown in Fig. 3.1 (a). Speech Subspace Music Subspace m s Universal Audio Space, overlap (a) m s Universal Audio Space, (b) Figure3.1: Therepresentationsofamixtureofmusicandspeechsignalsusing(a)source- speci¯c atoms and (b) source-independent atoms. If a ¯nite set of atoms of a speci¯c subspace is known a priori such that S m =spanf' m i g i2¤m and S s =spanf' s i g i2¤s (3.1) with index sets ¤ m and ¤ s , we can extract the desired audio content by projecting the mixture onto the corresponding subspace via b m=P S m x and b s=P S s x; (3.2) where P Sm and P Ss represent the orthonormal projections onto the subspaces S m and S s , respectively. The main challenge of this approach is to ¯nd a ¯nite set of f' m i g i2¤m 24 in (3.1), which is highly correlated with the class of music signals of interest, yet uncor- related with other classes of sounds (i.e., speech and environmental sounds). A similar idea was adopted in [34] for single-channel source separation, where time- domain Independent Component Analysis (ICA) elementary functions were learned from a training dataset a priori and then used to separate unknown test sound sources. This method employed unsupervised learning from arbitrary musical sounds. Some of learned ICAfunctionsformusicsignalsmaybesharedbyspeech,whichcorrespondtotheoverlap region as illustrated in Fig. 3.1 (a). The atoms in the overlap region degrade the source separation performance. 3.2.2 Source-Independent Atoms An alternative approach to separate a source from a sound mixture is to use a set of source-independent atoms as illustrated in Fig. 3.1 (b), where the universal audio space is spanned by these atoms as U =spanfà i g i2¤ U with index set ¤ U . Source separation can be achieved as follows. First, the mixture is decomposed into source-independent atoms and their coe±cients. Then, these atoms are groupedaccordingtosomesimilaritycriteria. Afterwards,groupedatomsarerecombined to reconstruct source signals. For example, the magnitude spectrogram of a single-channel mixture can be decom- posed into the product of basis spectra and time-varying gains using the ICA [16,66] or the Nonnegative Matrix Factorization (NMF) [71]. Then, audio separation can be 25 accomplished by clustering the basis spectra into disjoint sets using statistical distance measure [16], instrument-speci¯c features [66], or original sources as the reference [71]. Finally, the phase information of the original source is used to re-synthesize time-domain estimates of the source [23]. There are however several challenges associated with the approach. The clustering process is a nontrivial task, and the phase has to be estimated for the re-synthesis process [23,71]. Forsourceseparationwithmultichannelobservations,theSparseComponentAnalysis (SCA) with overcomplete dictionaries was studied in [31]. Speci¯cally, multichannel mix- turesignalsaredecomposedinthetimedomainusingMPwithovercompletedictionaries. Then, given a mixing matrix estimated from the spatial information between channels (or with the scatter plot technique in SCA), the set of decomposed atoms is partitioned corresponding to column vectors of the mixing matrix. In other words, the approach consists of decomposition followed by clustering, under a multichannel framework. 3.3 Source-Speci¯c Representation for Music Decomposition with Matching Pursuit Musical note signals from Music Database Gabor Dictionary, Source-Specific Subspace of Gabor Atoms Source-Specific Dictionary for Music, Time-domain ) (t s k g D D Synthesis of Source-Specific Atoms Figure 3.2: Constructing music source-speci¯c atoms and the corresponding dictionary from musical notes. 26 In this work, we adopt the approach of source-speci¯c atoms and dictionaries, and look for a sparse representation for recordings of harmonic musical instruments. The main challenge is to reduce the overlap region in Fig. 3.1 (a) as much as possible. Our basic idea for the sparse music representation is to exploit the facts that the musical notes have harmonic structure and most energy of note signals is concentrated on a small set of functions that corresponds to the structure. We may use this set to represent music signals e±ciently without having the whole functions in the dictionary. AsdepictedinFig.3.2,theproposedschemeconsistsofthreemajorsteps. Thesemodules are detailed below. 3.3.1 Music Decomposition with Matching Pursuit In the ¯rst module, we attempt to analyze essential characteristics of a speci¯c musical instrument from their audio waveforms. These waveforms can be easily obtained from a music database that contains various musical instruments. To analyze music signals, we adopttheovercompleterepresentationwithredundantdictionaries. Itsmainadvantageis that it enables a compact representation of complex signals by capturing essential signal characteristicswithasmallnumberoffunctions[25,50]. Onthecontrary,althoughsignal representationwithorthonormalbasessuchasFourierorwaveletsissimple,musicsignals might not be e±ciently represented by them in a compact (or sparse) manner [25]. Inthefollowing,wewillshowthat,byusingMatchingPursuit(MP)[52]witharedun- dant set of Gabor atoms to decompose musical signals, the signal energy will spread over a small number of atoms as compared to orthonormal bases, leading to a more compact 27 representation. Gabor atoms are obtained by dilatation, translation, and modulation of a mother function of the following form [52] g s;u;» (t)= 1 p s g ³ t¡u s ´ e j»(t¡u) ; (3.3) where s, u, and » are the scale, position, and frequency parameters, respectively. The atom energy of g s;u;» (t) is concentrated in the neighborhood of time u and frequency », and proportional to s in time and 1=s in frequency. The Gabor dictionary can be expressed by D g =fg ° g °2¡g , where the parameter vector ° is drawn from an index set ¡ g . Generallyspeaking,anyrepresentativeaudiowaveformsfromthesameclassofinstru- ments can be used as the input to the ¯rst module in Fig. 3.2. In our implementation, we choose the kth musical note signal s k (t) as the representative one. We begin by setting the initial residual equal to the input signal as R 0 (t)=s k (t): At step m, MP chooses g °m that maximizes the correlation with residual R m¡1 (t), g ° m =arg max g°2Dg jhR m¡1 ;g ° ij: (3.4) Then, it calculates the new residual as R m (t)=R m¡1 (t)¡hR m¡1 ;g ° m ig ° m (t): (3.5) 28 The note signal s k (t) can be decomposed into a linear combination of M Gabor atoms chosen among the Gabor dictionaryD g plus the residual term R M (t), as s k (t)= M X m=1 hR m¡1 ;g °m ig °m (t)+R M (t): (3.6) Consequently,thechosenatomsg ° m (t)withlargemagnitudeofhR m¡1 ;g ° m icanbeviewed as coherent components in the musical note. 3.3.2 Atom Prioritization With the decomposition of signal s(t) into M ¸ 1 Gabor atoms, we build the following approximation s(t)' M X m=1 a m g ° m (t); (3.7) where a m is the coe±cient for g °m such that a m =hR m¡1 ;g °m i. It is observed that these atoms have a di®erent contribution to the signal representation. The top sub¯gure of Fig. 3.3 (a) shows a signal of clarinet representing note G4, which was obtained from the RWC Musical Instrument Sound Database [27] and downsampled to 11,025 Hz. The curve shown in Fig. 3.3 (a) illustrates the decay of the magnitude correlation ja m j as a function of iteration number m, which was obtained by the decomposition of the whole note signal using MP with Gabor atoms 1 , as discussed in Sec. 3.3.1. From the curve, we see thatja m j decays very fast for m·250. These atoms capture inherent characteristics of the note signal well and the energy of the residual decreases 1 The parameters of Gabor atoms for the decomposition are described in detail in Sec. 3.5.1 29 quickly. Therefore, atoms with high energy correlation can be viewed as coherent com- ponents with respect to the given signal. On the other hand, ja m j decays slowly for m > 250, where the corresponding atoms have low-correlation values and the residues behavesimilarlytorandomnoisewithnomeaningfulstructure[52]. Theseselectedatoms no longer re°ect signal properties but simply decrease the energy of the residual. In the second module of Fig. 3.2, we select the coherent components (or atoms) from a note signal using high energy correlation. 0 1 2 3 4 -0.2 0 0.2 Time (sec) Amplitude 0 200 400 600 800 1000 0.5 1.5 2.5 m | a m | (a) 0 200 385 600 800 1000 0 0.2 0.4 0.6 0.8 0.99 Number of atoms Accumulation values of c m (b) Figure 3.3: (a) A note signal G4 of clarinet (top) and the decay of ja m j as a function of iteration number m (bottom), and (b) the accumulation values of c m , where the pre- de¯ned threshold ´ p is set to 0.99. The contribution of Gabor atom g ° m to the representation of the single note signal can be measured by the normalized squared magnitude correlation de¯ned as c m = ja m j 2 M P m 0 =1 ja m 0j 2 : (3.8) Speci¯cally, we compute the accumulation values of the normalized squared magnitude correlation by sorting c m with descending order. Fig. 3.3 (b) shows the accumulation 30 values of c m with respect to the number of atoms, where the Gabor atoms chosen in the decomposition of the note signal are sorted using their normalized squared magnitude correlation. Then, we select as coherent components the Gabor atoms that satisfy the condition m X k=1 c k ·´ p ; (3.9) which forms a small set of Gabor atoms, i.e., a subdictionary, fg ° g °2¡ , where ¡ ½ ¡ g . The subdictionary can capture the harmonic structure of the note signal well. From the experiments with various musical instruments, we found that setting the threshold ´ p to 0.95» 0.99 yields good performance in capturing inherent characteristics of their notes. In Fig. 3.3 (b), we select 385 Gabor atoms by setting the threshold ´ p to 0.99. It should be emphasized in Fig. 3.3 that since there may exist some chosen Gabor atoms that have same scale and frequency parameters, but di®erent time-shift, during the decomposition, the ¯nal setfg ° g °2¡ has smaller atoms in number than 385, say 79 in the experiment 1 . Furthermore, we use L signals, which correspond to variations of a musical note, to compute the union set of the L subdictionaries, each which is obtained from a variation of the note. For example, the RWC Musical Instrument Sound Database provides three variations for each note of a speci¯c musical instrument. Each variation features an instrument from a di®erent manufacturer played by a di®erent musician, which provides 1 In this work, we take into account only two parameters, i.e., scale and frequency, when constructing a Gabor dictionary. For the translation-invariant property of atoms in time and frequency, we use a fast Fourier transform to compute all scalar products with shifted atoms when ¯nding the best atom. More details are in Sec. 3.5.1. 31 a large variety of sounds. Thus, a collection of the subdictionaries D l for the kth note from a speci¯c musical instrument can be determined as D k = L [ l=1 D l ; (3.10) where D k = fg ° g °2¡ k and ¡ k is a ¯nite set of indices of °. The atoms determined by (3.10) can be called as dominant or prioritized atoms that represent the characteristics of the note signal. By repeating the above process for various note signals, we can form a prioritized dictionary, a collection of D k , k = 1;¢¢¢ ;K, where K is the number of notes for a speci¯c instrument. Forthesingle-channelmusicseparationfrombackgroundsounds,weassumethateach class of audio sounds can be represented by a prioritized dictionary. If these prioritized dictionaries of various sources have little overlap with each other, i.e., a small number of commonly-shared atoms, separating the music sources by projecting the mixture into the music prioritized dictionary can be easily accomplished. On the other hand, if the overlapping is signi¯cant, extracting the desired music source signals from the mixture will be less trivial. Thus, source separation performance would degrade signi¯cantly due to the overlap region between the audio sources. Consider a simple example where a test signal is generated by mixing samples of pitched musical instrument and drum sequence 1 . To determine the characteristics of their prioritized atoms in the parameter space, we choose the center frequency of atoms fromanintervalofnormalizedfrequencies[0;0:5],andthelogarithmicscaleofatomsfrom 1 The analyzed signals correspond to excepts from an 8-note melodic recording of clarinet [47] and acoustic drum sequence [1]. 32 0 0.1 0.2 0.3 0.4 0.5 0 1 2 3 4 5 6 7 8 Center Frequency Scale (a) 0 0.1 0.2 0.3 0.4 0.5 0 1 2 3 4 5 6 7 8 Center Frequency Scale (b) 0 0.1 0.2 0.3 0.4 0.5 0 1 2 3 4 5 6 7 8 Center Frequency Scale (c) Figure 3.4: Comparison of prioritized atoms distributed in the parameter space (»;s): (a) 319 atoms for the clarinet sound, (b) 602 atoms for the drum sequence, and (c) overlapping atoms between the prioritized dictionaries D p c and D p d with a total number of 61. Note that the scale parameter is logarithmic to the base 2, and each dot represents the center frequency and the scale parameter of a prioritized atom. 2 n ,wheren=0;¢¢¢ ;8. Fig.3.4(a)and(b)showthedistributionoftheprioritizedatoms for clarinet D p c =fg ° g °2¡c , and drum D p d =fg ° g °2¡ d , respectively. We see that atoms oftheclarinetsoundaredistributedinthelowerfrequencyregionasshowninFig.3.4(a) whileatomsofthedrumsequencearedistributedoverawiderrangeoffrequencyandscale components as illustrated in Fig. 3.4 (b). The overlapping atoms between the prioritized dictionaries D pc and D p d are shown in Fig. 3.4 (c). Note that ¡ c \ ¡ d 6=;. It is meant fromtheexamplethatseparatingtheclarinetsignalfromthemixtureusing D pc mayalso extract the drum components due to the existence of the shared atoms. The overlapping atoms come from the low frequency region in which signi¯cant overlap happens when the two sounds are mixed together. Note that the overlapping atoms represent the overlap regionsharedbythetwodi®erentsubspaces,asillustratedinFig.3.1(a). Onesimpleway to meliorate the separation performance, we may cluster atoms that come from di®erent source signals into disjoint sets using a priori information, as discussed in Sec. 3.2.2. In this work, we exploit the harmonic structure of audio sounds to address the overlapping issue, which will be discussed in Sec. 3.3.3. 33 3.3.3 Source-Speci¯c Atoms and Dictionaries In Sec. 3.3.2, we mentioned that the prioritized dictionaries of di®erent instruments may have degraded performance in source separation due to the overlap region between them. Inthelastmodule, wewilladdressthisproblembyre-organizingprioritizedatoms through linear combinations so as to generate a set of new atoms called source-speci¯c atoms. The harmonic features of musical instruments enable redundancy reduction and com- pact representation of source signals with a small number of atoms. For instance, non- percussive music sounds usually consist of a limited number of musical notes (i.e., 12 notes in each octave range). It implies that most energy of music signals is concentrated on a small set of atoms. Mathematically, we can express the source-speci¯c atom as a linear combination of prioritized atoms in form of h k (t)= X ° m 2¡ k ® m g °m (t); (3.11) where ® m is the weighting coe±cient according to the importance of g °m in the decom- position of note signal s k (t). It is worthwhile to comment the di®erence between the kth musical note signal s k (t) and source-speci¯c atom h k (t). They are close but not identical. To take the piano instrumentasanexample,theaudiowaveformsofthesamenotesignalbutfromdi®erent pianos still vary, which are called the intra-class variation. Furthermore, the matching pursuit decomposition of s k (t) will yield a large number of Gabor atoms, including many with very small weights. In contrast, atom h k (t) is a more robust representation, whose 34 synthesizingGaboratomshavetocomefromtheprioritizedsubdictionary. Thus,wemay view source-speci¯c atom h k (t) as a ¯ltered version of musical note signal s k (t), where the ¯ltering process is used to increase its robustness against intra-class variation. Note that all prioritized atoms have the same time localization and the new atom is normalized such that kh k (t)k 2 = 1. The entire set of atoms h k , 1 · k · K, forms a source-speci¯c dictionary denoted by D. For any M ¸ 1, an approximation of arbitrary music signal s(t) can be calculated with a linear combination of the new source-speci¯c atoms as s(t)' M X m=1 a m h °m (t); (3.12) where a m is the coe±cient for h °m such that a m = hR m¡1 ;h °m i, and it is computed at the mth step, with the standard matching pursuit technique. This subsection illus- trates the approximation performance of the music representation with source-speci¯c dictionaries. The synthesized source-speci¯c atoms will be used for approximation of real music sounds in Sec. 3.5.2 and separation of music signals from a single-channel mixture in Sec. 3.5.3. We will give examples of source-speci¯c dictionaries and show that the time-frequency representation of each new atom has a well-organized harmonic structure of the corresponding musical note in Sec. 3.5.1. We should point out that a similar idea was proposed in [30] to create the so-called harmonic dictionary. A signal model was used, assuming a priori knowledge of musical harmonics, i.e., the harmonicity » r =r» 0 ; 1·r·R between the frequency » r of the rth overtonepartialandthefundamentalfrequency» 0 ,whererisintegerandRisthenumber of the partials. In the model, each partial of musical harmonics is represented by one 35 Gaboratom,andaharmonicatomhasessentiallyRpeaksinitsFouriertransform,which are located around frequencies » r with a common width of the order of 1=s. Recently, Leveau et al. [48] has proposed mid-level music representation for musical instrument recognition, where instrument-speci¯c harmonic atoms are designed especially to obtain timbre information by learning amplitudes of harmonic partials from individual notes. On the contrary, we need no harmonicity assumption to construct source-speci¯c atoms which get the pitch and the strength of overtone partials from isolated notes. 3.4 Application to Music Signal Separation In this section, the source-speci¯c representation discussed in Sec. 3.3 is applied to the music signal separation problem, where music signal is mixed with di®erent background sounds such as speech or environmental sounds. Here, we assume that musical instru- ments performed are known a priori. Let us assume that an audio mixture signal x(t) consists of music signal m(t) and background sound b(t) as x(t)=m(t)+b(t): (3.13) We would like to extract the music signal from the observed mixture x(t). Since a music signal tends to have a broader spectrum, its time-frequency representation often overlaps with that of the background sound. Due to the overlap, it is more di±cult to impose the sparsity assumption in the time-frequency plane (i.e., only one source is active in a time-frequency point) [74]. It is well understood that the impairment on the sparsity assumptionresultsinthedistortion, knownasthemusicalnoiseartifact, intheestimated 36 sources. Here, we use the source-speci¯c representation technique to extract the music signal by selecting the best approximating atoms that have stronger energy correlation with the music signal than with the background sound in the mixture. To this end, we project the mixture signal x(t) onto atoms from a music source- speci¯c dictionaryD. After initialization by setting R 0 (t)=x(t) in the matching pursuit technique, we have the following computation at the mth step. 1. ComputejhR m¡1 ;h ° ij for all h ° 2D. 2. Select a (near) best atom of the dictionary, which yields the maximum projection value by h ° m =argmax h°2D jhR m¡1 ;h ° ij: (3.14) 3. Compute the new residual according to R m (t)=R m¡1 (t)¡hR m¡1 ;h ° m ih ° m (t): (3.15) Note that musical notes can be easily identi¯ed from the mixture by (3.14), which yield large projection values. After M steps, the enhanced music signal can be reconstructed as a weighted sum of source-speci¯c atoms chosen fromD by b m(t)' M X m=1 hR m¡1 ;h ° m ih ° m (t); (3.16) where hR m¡1 ;h ° m i is the gain of h ° m at the mth step. Finally, the residual background signal can be obtained by b b(t)=x(t)¡b m(t). 37 3.5 Experimental Results In the experiments presented in this section, source-speci¯c atoms were obtained for each musical instrument on isolated notes from three music databases: the RWC Musical Instrument Sound Database [27], the McGill University Master Samples Library [58], andtheUniversityofIowaMusicalInstrumentSamples[5]. Theacousticdatabasescover various musical instruments, providing individual notes of a speci¯c instrument over the entire range of tones that could be produced by that instrument. Test signals were chosen from several excerpts of audio sounds, anechoic recordings of solo performances of monophonic and polyphonic instruments [47], speech signals [70], and environmental sounds [2]. Mixture signals were generated by mixing samples of one or more pitched musical instrument sources and speech or environmental sounds. All sounds used in the experiments were mono and downsampled to 11,025 Hz. To measure the quality of the reconstructed sound with respect to the original one, the source-to-distortion ratio (SDR), represented in decibels (dBs), was used [67]. SDR is a global performance measure that accounts for three types of distortion: artifacts, interference, and noise. A higher performance measure indicates a better reconstruction scheme with less distortion. 3.5.1 Learning Musical Structure and Source-Speci¯c Atoms This subsection now illustrates how to analyze harmonic structure of music (i.e., decom- position of musical note signals using MP with Gabor atoms, followed by prioritization) 38 and how to construct music source-speci¯c atoms, as discussed in Sec. 3.3. Then, we present results that correspond to each module in Fig. 3.2. The discrete multiscale Gabor dictionary D g is the collection of atoms g s;u;» in (3.3) such that (s;u;») = (a j ;na j ¢u;ka ¡j ¢»); j;n;k 2Z, where ¢u and ¢» are some con- stants. To analyze real-valued music signals, real Gabor atoms are usually used to con- struct the discrete Gabor dictionary, instead of complex-valued atoms as in (3.3). We beginwithasource-independentdictionaryconsistingofrealGaboratomsofthefollowing form [37,52] g s;u;»;Á (t)=K s;u;»;Á g ³ t¡u s ´ cos(2¼»(t¡u)+Á); (3.17) where g(t)= 1 p s e ¡¼t 2 ; and normalizing constant K s;u;»;Á is chosen such that kg s;u;»;Á k 2 = 1. Atoms are charac- terized by their scale s, time-position u, frequency », and phase Á. A truncated Gaussian envelope g(t) is used to generate real Gabor atoms with the parameters, s, u, and ». ² The scale s varies between 1 and atom length N, i.e., 1 · 2 n · N, where 2 represents the base and n 2 Z. In our experiments, N = 1024 samples (or 92.8 ms) and the atoms have a dyadic scale. Among these scales, the largest three are selected after the decomposition of a musical note to learn music structure. ² For the learning process, 800 di®erent frequencies are used, which uniformly spread over the interval of normalized frequencies [0;0:5], with a step of ¢f = (f 0:5 ¡ f 0 )=800. 39 ² The phase Á is set to 0. ² The Gabor dictionary is often built by considering the scale, frequency, and time- shift parameters so that its atoms can be translated to any place in the (residual) signal. However, to reduce the computational complexity, an e®ective MP imple- mentation was proposed in [37], which uses the Fast Fourier Transform (FFT) to compute all scalar products with shifted atoms. Here, we built the Gabor dictio- nary with only two parameters (i.e., scale and frequency) while setting time shift u to N=2 for all atoms. When MP selects the best atom in the decomposition of a residual signal, we use FFT to compute the time shift, u, of an atom by the cor- relation between the atom g(t) and the residual signal r(t). That is, we calculate u = argmaxjFFT ¡1 fR?Ggj where ? is a component-wise multiplication between two vectors, and R and G are FFTs of r(t) and g(t), respectively 1 . Thus, we can ¯nd the best time positions of all atoms in the Gabor dictionary to select the best atom for the residual signal. This results in a Gabor dictionary of size jD g j = 8000 without taking all possible time-shifts of Gabor atoms into account, i.e., jD g j=N s £N » ; (3.18) whereN s andN » arethenumberofscalesandthenumberoffrequencybins,respectively. After analyzing three variations of each note using MP with Gabor atoms, we ¯nd prioritized atoms from the decomposition of signals, as discussed in Sec. 3.3.2. We set ´ p 1 The approach is similar to the estimation of cross-correlation between one signal and its time-delayed version to determine time delay [42]. 40 0 0.1 0.2 (a) Amplitude Time (sample) Frequency 0 200 400 600 800 1000 0 0.1 0.2 0.3 0.4 0 0.1 0.2 (b) Amplitude Time (sample) Frequency 0 200 400 600 800 1000 0 0.1 0.2 0.3 0.4 0 0.1 0.2 (c) Amplitude Time (sample) Frequency 0 200 400 600 800 1000 0 0.1 0.2 0.3 0.4 0 0.1 0.2 0.3 (d) Amplitude Time (sample) Frequency 0 200 400 600 800 1000 0 0.1 0.2 0.3 0.4 Figure 3.5: (a), (b), and (c) show atoms obtained from three variations of clarinet note G4, and (d) is the synthesized atom corresponding to the clarinet note G4, where each top sub¯gure presents the time-domain waveform and each bottom sub¯gure shows its time-frequency representation. in (3.9) to 0.99, empirically, and then use the music signal model in (3.11) to synthesize a source-speci¯c atom from the set of prioritized atoms. InFig.3.5(a),(b),and(c),wepresentthreeatomsandtheirtime-frequencyrepresen- tations, which were obtained from three variations of clarinet note G4. For comparison, a set of prioritized atoms inD l , which was obtained from a variation of the note, is used to synthesize a new atom with the music signal model. With the three sets of prioritized atoms, we can determine a prioritized subdictionary for the note by D G4 = S 3 l=1 D l , whichconsists of a total of 221 Gabor atoms. Fig. 3.5 (d) shows the source-speci¯c atom, 41 which was synthesized using D G4 , as discussed in Sec. 3.3.3. The time-frequency repre- sentation of the source-speci¯c atom illustrates that it captures the inherent harmonic structure of the note well. Note that each partial of the note, including its fundamental frequency, is not represented by one Gabor atom, but rather several Gabor atoms with di®erent scales. -0.05 0 0.05 0.1 0.15 0.2 (a) Amplitude Time (sample) Frequency 0 200 400 600 800 1000 0 0.1 0.2 0.3 0.4 0 0.1 0.2 (b) Amplitude Time (sample) Frequency 0 200 400 600 800 1000 0 0.1 0.2 0.3 0.4 -0.1 0 0.1 0.2 (c) Amplitude Time (sample) Frequency 0 200 400 600 800 1000 0 0.1 0.2 0.3 0.4 0 0.1 0.2 (d) Amplitude Time (sample) Frequency 0 200 400 600 800 1000 0 0.1 0.2 0.3 0.4 Figure3.6: Source-speci¯c atomssynthesizedusing di®erentscales: (a) scale 10,jD s10 j= 33, (b) scale 9, jD s9 j = 25, and (c) scale 8, jD s8 j = 25. (d) Scales 8, 9, and 10 are used, jD G4 j=83. Note thatj¢j represents the number of prioritized atoms. Fig. 3.6 (a), (b), and (c) show the time-domain waveforms and time-frequency rep- resentations of an atom that is obtained from clarinet note G4, with respect to di®erent scale, whereN =1024 samples and the largest three scales, i.e., 8, 9, and 10, are selected 42 for comparison. Each waveform is normalized with its energy equal to one. We observe a clear harmonic structure in all scales. The ¯rst feature is di®erent duration of partials in time, e.g., the shortest duration in scale 8. The other feature is di®erent width of the order of 1/s, e.g., the thickest width in scale 8. We observe in Fig. 3.6 (a) that the energy of the ¯rst three partials, including the fundamental frequency, is similar to each other. On the other hand, the ¯rst and the third partial in Fig. 3.6 (c) have more energy than the second partial. This information is important in synthesizing the new atoms for a speci¯c musical instrument. For instance, Fig. 3.8 (a) shows that the energy of the second partials of clarinet notes below atom index 20 is weaker than the energy of the fundamentals and the third partials. It is also observed from the spectrogram of a real clarinet sound in Fig. 3.11 (a). In addition, we observe in Fig. 3.6 (c) that several har- monicpartialsinhighfrequencyregionaremissing. Thus,onlyonescaleisnotenoughto represent the characteristics of a note in our work. However, the missing information can be compensated by the information that comes from other scales, as shown in Fig. 3.6 (d) in which the Fourier transform of the atom has several peaks, located around har- monic frequencies. Note in Fig. 3.6 (d) that each overtone partial, including fundamental frequency, consists of one or more prioritized atoms, which results in relative amplitude between the partials. Fig. 3.7 shows source-speci¯c atoms obtained from two note signals of the trumpet instrument and their time-frequency representations. These two isolated notes were C4 and C6 obtained from the McGill University Master Samples Library. In these examples, given a Gabor dictionary of size 8000, 132 atoms for note C4 and 36 atoms for note C6 were selected as their prioritized atoms. There exist 13 overlapping atoms between the 43 0 0.1 0.2 0.3 Amplitude Trumpet C4, f 0 = 261 Hz Time (sample) Frequency 0 200 400 600 800 1000 0 0.1 0.2 0.3 0.4 (a) Time (sample) Frequency 0 200 400 600 800 1000 0 0.1 0.2 0.3 0.4 0 0.1 Amplitude Trumpet C6, f 0 = 1046 Hz (b) Figure 3.7: Source-speci¯c atoms that correspond to notes C4 and C6 of trumpet and their time-frequency representations, where f 0 denotes the fundamental frequency of the notes. two sets of prioritized atoms. With the source-speci¯c atoms, we do not consider each individual Gabor atom but view the entire waveform for a note as an object (i.e., one atom for a musical note). For instance, given a mixture of trumpet note C4 and trumpet note C6, MP can extract the source component that comes from trumpet note C4 by choosing the source-speci¯c atom shown in Fig. 3.7 (a) as the best atom. Index of atoms Frequency 5 10 15 20 25 30 35 40 0 0.1 0.2 0.3 0.4 (a) Index of atoms Frequency 10 20 30 40 50 0 0.1 0.2 0.3 0.4 (b) Figure 3.8: Spectra of source-speci¯c atoms for (a) the clarinet and (b) the piano, where each column represents an atom. 44 Fig. 3.8 (a) illustrates the whole source-speci¯c dictionary for the clarinet, where 40 individual notes of the clarinet were used to create the corresponding source-speci¯c atoms. All time-domain atoms in the dictionary were transformed to the Fourier trans- form magnitude domain, i.e., H k =jFFTfh k (t)gj where h k (t) are source-speci¯c atoms and 1 · k · 40. Thus, the x-axis in Fig. 3.8 represents the atom index and the y-axis shows the frequency component of each atom. On the other hand, Fig. 3.8 (b) presents a source-speci¯c dictionary for the piano, which consists of 50 atoms obtained from 50 individual notes of the piano. 0 0.2 Clarinet 0 0.2 Piano 0 200 400 600 800 1000 0 0.2 Trumpet Time (sample) (a) 0.1 0.2 Clarinet 0.1 0.2 Piano 0 0.1 0.2 0.3 0.4 0.5 0.1 0.2 Trumpet Frequency (Hz) (b) Figure3.9: Illustrationofvarioussource-speci¯catoms: (a)atomsobtainedfromclarinet note G4 (top), piano note G4 (middle), and trumpet note G4 (bottom) and (b) spectra corresponding to these atoms. To illustrate the di®erence between source-speci¯c atoms obtained from various in- struments, we used a note signal, G4 from the clarinet, the piano, and the trumpet to construct their source-speci¯c atoms as discussed in Sec. 3.3. The resultant atoms are shown in Fig. 3.9 (a). For comparison, their time-domain atoms were transformed into the Fourier transform magnitude domain as shown in Fig. 3.9 (b). It is observed that 45 pitchesbetweenthesethreeatomicspectraarealmostidenticalinfrequency,butwithdif- ferentstrengthsamongtheirovertonepartials. Thesource-speci¯catomsareconstructed by analyzing the pitch and the strength of overtone partials from isolated notes of a spe- ci¯c instrument, but not time evolution of coarse spectral energy distribution of a sound. Those information constitutes the timbre and it is closely related to the recognition of musical instruments [41]. Note that the information on partial's amplitudes may be used to represent distinct characteristics of an instrument. In the following subsections, we will evaluate the potential of the proposed source- speci¯c representation in music signal approximation and separation. 3.5.2 Approximation Capability In the experiments, we employed the classic greedy algorithm on a frame-by-frame basis, where the frame size was set to atom length N in a given experiment. A non-overlapping rectangular moving window of unit height (namely, the hop size was zero) was used. The algorithm was iterated until a desired level of accuracy has been achieved, in terms of the energy ratio between the original signal x(t) and the current maximum correlation of the best atomhR m¡1 ;g °m i, which is de¯ned bykhR m¡1 ;g °m ik 2 =kx(t)k 2 wherek¢k 2 is Euclidean norm. Here, the relative energy ratio was set to 0.01. To evaluate the performance of the proposed approach in capturing inherent har- monic structures of music sounds, a solo clarinet audio signal and a polyphonic piano sound were approximated using source-speci¯c atoms obtained from individual notes of the instruments. The top sub¯gure of Fig. 3.10 (a) shows an excerpt of a recording of 46 -0.1 0 0.1 Original signal Amplitude 1 2 3 4 5 -0.1 0 0.1 Reconstructed signal Time (sec) Amplitude 0 5 10 15 20 25 30 35 40 5 10 15 20 25 Number of atoms SDR (dB) (a) -0.2 0 0.2 Original signal Amplitude 0.5 1 1.5 -0.2 0 0.2 Reconstructed signal Time (sec) Amplitude 0 10 20 30 40 50 0 5 10 15 20 Number of atoms SDR (dB) (b) Figure 3.10: The approximation capability of the proposed music representation with source-speci¯c dictionaries: (a) a real clarinet sound and (b) a real piano sound. The original signal (top), the reconstructed signal (middle), and the SDR values as a function of the number of approximating atoms (bottom). a solo clarinet piece that consists of seven di®erent notes, which is obtained from mono- phonic sounds (CD-quality, 16-bit, and 44,100 Hz) [47] and downsampled to 11,025 Hz. The original audio signal was approximated using the source-speci¯c dictionary for the clarinet. To evaluate its approximation performance, the SDR value as a function of cu- mulative source-speci¯c atoms is plotted in the bottom sub¯gure of Fig. 3.10 (a), where we accumulate the source-speci¯c atoms that correspond to the notes in the clarinet sig- nal to show that the atoms are able to approximate the original signal e±ciently. We see that, aftersevenatomsfromtheclarinetsource-speci¯cdictionarythatcorrespondtothe seven di®erent notes of the real clarinet signal are used, the SDR of the reconstructed signalbecomessaturatedaround19dB.Itmeansthatsevenatomsareenoughtocapture most of the energy of the original clarinet signal. The middle sub¯gure of Fig. 3.10 (a) shows the reconstructed signal obtained using only seven atoms from the dictionary. 47 Similarly, we used a real piano sound of Bach's Invention-BWV772 [3] to test the ap- proximationperformance. Anexcerptofthesoundconsistsina12-notemelodicrecording of piano. We observe from Fig. 3.10 (b) that 12 di®erent source-speci¯c atoms for the piano can approximate the original signal e±ciently, resulting in a reconstructed signal of 17 dB. It is well known that the dictionary size in the iterative greedy algorithm such as MP hasagreatimpactonthecomputationalcomplexity[37]. Therearetwomajorfactorsthat a®ect the dictionary size as shown in (3.18): the range of scales (N s ) and the resolution of frequency bins (N » ). Note that the source-speci¯c dictionary for a speci¯c musical instrument always has the same number of atoms due to a ¯xed number of musical notes from the music database (e.g., for the clarinetjDj=40, wherej¢j returns the cardinality of a set, when a total of 40 clarinet notes is adopted as training sounds). On the other hand, a dictionary used in MP usually needs about 10 3 to 10 5 Gabor atoms to represent the clarinet sounds. 3.5.3 Music Signal Separation Weapplytheproposedsource-speci¯crepresentationapproachtomusicsignalseparation, as discussed in Sec. 3.4. In the experiments, the iterative greedy algorithm was employed with source-speci¯c dictionaries frame by frame, where each frame was set to the same length as atom size (i.e., 1024 samples) without any overlap. The algorithm is iterated until a target energy ratio between the original signal and the maximum correlation of the best atom has been achieved, which was set to 0.01. 48 Two di®erent approaches for music signal separation were discussed in Sec. 3.2. One adopts source-independent atoms while the other exploits source-speci¯c atoms that are determined by a training process. Two source-independent algorithms were chosen for comparison with our source-speci¯c algorithm, which are ISA [16,66] and NMF [44]. As discussed in Sec. 3.2.2, we used them for spectral decomposition in short-term Fourier transform magnitude domain to obtain basis spectra and their gains. Clustering was then performed to achieve source separation. Finally, the phase information of the orig- inal sources was used to re-synthesize time-domain source estimates. NMF was tested with two di®erent cost functions by minimizing the Euclidean distance (NMFEUC) and the divergence (NMFDIV) [44]. ISA and NMF were tested by factoring the magnitude spectrogram of the mixture signal into 40 independent components. Table 3.1: Comparison for music signal separation between source-independent and source-speci¯c representation approaches. Reconstructed signals in SDR (dB) Representation Algorithm Clustering Music, b m(t) Background, b b(t) ISA k-means -2.41 0.58 NMFEUC k-means 2.89 1.44 Source-independent NMFDIV k-means 3.29 1.75 NMFEUC manual 10.24 7.60 NMFDIV manual 13.65 3.98 Source-speci¯c SSD - 10.77 10.54 The mixture was generated by mixing a solo clarinet recording and a speech utter- ance. The original SDR of the mixture signal without any processing is -0.062 dB. For the automatic clustering process in ISA, NMFEUC and NMFDIV, the standard k-means algorithm with the symmetric Kullback-Leibler metric was used and best results were selected after repeating 50 times experiments. The quality of reconstructed signals with 49 di®erent algorithms is compared in Table 3.1. Note that NMFDIV yields slightly better results than NMFEUC and ISA. The poor performance of these algorithms based on source-independent atoms is probably due to poor results of automatic clustering. It is observed that manual clustering after spectral factorization gives better results than au- tomatic clustering, which implies that it is a non-trivial task to cluster the basis spectra into disjoint sets with respect to all underlying source signals. As the number of inde- pendent components in the factorization increases, its performance might improve, but manual clustering would become too troublesome and unreliable. On the other hand, even though the proposed scheme assumes musical instruments performed are known a priori, it needs no clustering nor re-synthesis process due to time-domain source-speci¯c atoms tailored to musical harmonic structure. We see from Table 3.1 that the source-speci¯c dictionary (SSD) approach provides the good results for both reconstructed music and background. Table 3.2: Music signal separation examples for solo musical instrument sounds. Original signals white noise speech drum street train clarinet 14.03 10.77 7.78 12.96 12.33 piano (polyphonic) 8.99 6.90 4.53 8.07 4.63 violin (vibrato) 6.60 5.18 1.12 5.93 3.81 Next, we tested a wider range of audio sounds including musical instrument sounds, speech, and environmental sounds. The results are shown in Table 3.2, where each per- formance value gives the quality of the reconstructed music sound with respect to the original one in SDR (dB). For example, the separation performance of clarinet signal was 50 measured as 10.77 dB from a mixture of clarinet and speech utterance. Environmen- tal sounds are broadband and non-harmonic, and they typically consist of a mixture of ambient sounds. For example, a street sound was collected from ambient outdoors on a city street, which contains sounds of moving car and human speaking in the background. Thus, the environmental sounds have unstructured characteristics in the time-frequency domain. Due to the very di®erent signal structures between environmental sounds and pitched harmonic ones, the proposed SSD approach can o®er good SDR values of estimated pitchedharmonicsources. AsshowninTable3.2,theproposedSSDapproachhasdemon- stratedgoodperformanceinthecaseofwoodwindinstrument,environmentalsounds,and white noise 1 , except for the case of polyphonic piano and violin with vibrato. The poor performance in piano sounds could be attributed to the rich sound model, including attack, decay, sustain and release (known as the ADSR model [60]) as compared with woodwind instruments. Violin is usually played with vibrato that has the frequency and the amplitude modulation e®ect, which results in poor performance in source separation. Fig. 3.11 (a) presents the spectrograms of the real clarinet sound in Fig. 3.10 (a), which consists of seven di®erent notes, and its reconstructed signal. The original clarinet sound was approximated by only seven source-speci¯c atoms as discussed in Sec. 3.5.2. The harmonic structure of the clarinet sound is successfully extracted by these atoms as showninthespectrogramofthereconstructedsignal. Fig.3.11(b)showsspectrogramsof a mixture signal of clarinet and male speech utterance, and the separated clarinet signal. Sincethespeechsignalhasadi®erentstructureascomparedtothatoftheclarinetsound, 1 The signal-to-noise ratio (SNR) of the music signal in white noise was 5 dB. 51 Frequency (KHz) Original signal 0 1 2 3 4 5 -60 -50 -40 -30 Time (s) Frequency (KHz) Reconstructed signal 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 1 2 3 4 5 -60 -50 -40 -30 (a) Frequency (KHz) Mixture signal 0 1 2 3 4 5 -60 -50 -40 -30 Time (s) Frequency (KHz) Reconstructed clarinet signal 0 0.5 1 1.5 2 2.5 3 3.5 0 1 2 3 4 5 -60 -50 -40 -30 (b) Figure 3.11: Spectrograms of signals. (a) Approximation: a real clarinet sound (top) and its reconstructed signal (bottom). (b) Signal separation: a mixture of clarinet and male speech (top) and extracted clarinet signal (bottom). wecansuccessfullyextracttheharmonicstructureoftheclarinetsoundfromthemixture signal. Table 3.3: Music signal separation examples for multiple musical instrument sounds. Reconstructed signals in SDR (dB) Mixture Music, b m 1 (t) Music, b m 2 (t) clarinet (m 1 ) + piano (m 2 ) 3.51 1.65 piano (m 1 ) + violin (m 2 ) 2.19 0.29 clarinet (m 1 ) + piano (m 2 ) + speech 2.74 0.37 clarinet (m 1 ) + piano (m 2 ) + train 2.70 1.24 Furthermore, we consider a more challenging case with multiple musical instrument recordings. Here, solo recordings of polyphonic piano and violin with vibrato were used for the mixtures. The results are shown in Table 3.3, where two musical instruments are played simultaneously to generate single channel mixtures. As compared with the results 52 in Table 3.2, the separation performance degrades due to the similar musical harmonic structure between the pitched musical instruments. 30° -30° 1m 5m 4m 2.5m 2m Loudspeakers (height: 1.6m) Room height: 3m Omni-directional microphone (height: 1.6m) Figure3.12: Thecon¯gurationofloudspeakersandamicrophoneinaroom,whererecord- ings were simulated with an absorption coe±cient of 0.6 for room's surface. Finally,weprovideanexampleofmusicsignalseparationundertheroomacousticen- vironment. We conducted experiments with simulated room recordings of several sources using the room con¯guration shown in Fig. 3.12, which consists of one omni-directional microphone and two loudspeakers. Each monaural recording was simulated by convolv- ing the source signals with the room impulse responses using the Roomsim toolbox [15]. The reverberation time RT 60 of the simulated room was 112 ms. Table 3.4 shows the separation performance of music signals in terms of SDR. As compared with the results on anechoic mixtures in Table 3.2, the performance of the SSD approach in music signal separation degrades signi¯cantly. In the reverberant environment, the re°ected source signals arrive at the microphone as delayed and attenuated copies of the directed-path source signal, which makes the mixture signal at the microphone smeared across time. Due to this reason, some source-speci¯c atoms were selected incorrectly when the SSD 53 approach computed the best atoms from the residual signal. Note that the amount of smearing is a function of reverberation time, RT 60 . Table 3.4: Music signal separation examples for simulated room recordings with RT 60 = 112ms. Reconstructed signals in SDR (dB) Mixture Music, b m(t) Background, b b(t) clarinet + speech 5.27 5.21 clarinet + drum 6.72 -0.70 clarinet + train 6.80 5.74 3.6 Conclusion A systematic way to ¯nd an e±cient representation of harmonic music signals based on source-speci¯c dictionaries was presented. The proposed approach learns the essential features of music signals by modeling their basic components, i.e., musical notes, using source-independent atoms. Due to the e±ciency of these source-speci¯c atoms, the num- ber of atoms needed to represent music signals are much smaller as compared with that of the Gabor dictionary, resulting in a lower complexity algorithm. The proposed scheme was applied to music signal enhancement with a mixture of multiple sounds. The proposed technique builds source-speci¯c atoms and dictionaries from Gabor atoms. Since Gabor atoms chosen in our work are harmonic signals, the resultant source- speci¯catomsareharmonicaswell. Forthisreason,theproposedalgorithmwillprobably be not e®ective for non-harmonic musical instrument recordings. We need to ¯nd other 54 types of atoms as the building elements. Along this line of thoughts, it appears interest- ing yet challenging to obtain speech-speci¯c dictionaries because of special properties of speech signals such as inharmonicity and irregular pitch sweeps. 55 Chapter 4 Multi-Channel Audio Source Separation with Noisy Data 4.1 Introduction Multichannel blind source separation (BSS) is the term used to describe the process of separatingunderlyingoriginalsourcesignalsfromanumberofobservablemixturesignals, where the mixing model is either unknown or the knowledge about the mixing process is limited. Thanks to the multichannel framework, spatial diversity between channels is often used for the source separation. The multichannel BSS problem has been examined by researchers for years. Early approaches concentrated on tackling evendetermined demixing. Independent component analysis (ICA) [10,33,38] pioneered those early approaches under the crucial assumption that the underlying sources are statistically independent. Even though ICA-based algo- rithms are fast and reliable to tackle the multichannel BSS problem, they demand that the number of sensors be no less than the number of sources. Recovering the sources for degenerate or underdetermined mixtures with less obser- vation sensors than sources is a much harder problem in the setup relying only on the 56 independence of the sources. Among other possible priors, sparsity of sources in a trans- formdomainhasturnedoutinthelastfewyearsoneofthesuccessfultoolstoaddressthe underdetermined BSS problem. These approaches [13,19,51,62,76] based on the source sparsityassumptiontypicallyallowtheestimationofmixingparametersbyclusteringthe scatter plot of a sparse representation of the mixtures, followed by ` 1 -norm minimization techniques for the estimation of sources. The related techniques are known as Sparse Component Analysis (SCA). SCAhasprovedaverysuccessfultoolfortheunderdeterminedsourceseparation, but it requires some innate assumptions to ensure an accurate recovery of sources. That is, givenM mixturesofN sources(M <N)nomorethanM sourcesareactiveateachdata point in the transform domain, and the noise a®ecting the mixtures is negligible [62,64]. However, by examining real-world mixture data carefully, we ¯nd that line orientations in the scatter plot are close to each other and the observation mixtures clearly contain noise [19]. The noisy data and close line orientations, to which the previous work has not given much attention, cause the violation of the basic assumptions of SCA, and results in poor performance in the source separation. Here, we concentrate on the practical e®ectiveness of the noise for the underdeter- mined BSS problem with stereo-channel observation mixtures. To address the problem in SCA with noisy data, we propose an enhanced technique that consists of two major steps. The ¯rst step is to exploit the weighted soft-assignment clustering algorithm to estimate the mixing parameters in a noisy environment. The weighting scheme relies on the distance of a data point from the origin in the scatter plot, due to that more distant points are more reliable. Then, it is applied to a modi¯ed soft-assignment method to 57 identify the multiple orientations in the scatter plot. In the second step, we employ the generalized ` p -norm optimization technique with regularization, which yield more robust and sparser solutions in the noisy environment than the ` 1 -norm minimization technique in SCA. Experimental results on several synthetic mixtures demonstrate the gains in source separation performance when compared to other SCA-based algorithms reported in the literature. 4.2 Unifying Framework for Sparse Component Analysis FortheBSSproblemwiththelinearinstantaneousmixingmodel,onehasN sourcess j (t) and M observation mixtures x i (t) such that x i (t)= N X j=1 a ij s j (t)+e i (t); i=1;¢¢¢ ;M; (4.1) which can be expressed in matrix notation as x=As+e; wherea ij areunknownmixingparametersassociatedwiththepathfromthejthsourceto theithsensor,ande i (t)isadditivenoisethatisoftenneglectedinthehighsignal-to-noise (SNR) regime. We aim to estimate the underlying original sources s j (t) from observed mixtures x i (t), where the mixing model is either unknown or the knowledge about the mixing process is limited. 58 Mixing Parameters Estimation Source Signals Estimation Sparse Transform Mixtures ) (t x i Inverse Sparse Transform Transform domain ) (t s e j Time domain Time domain Figure 4.1: Sparse component analysis for blind source separation. ICA has been one of the favorite tools to tackle the BSS problem given in (5.1); namely, torecoversourcesignalsbyestimatingtheinvertedmatrix W = e A ¡1 and¯nding s e = W x = e A ¡1 x under the condition that the number of sensors is at least as large as the number of sources, i.e., M ¸ N. Recovering the source signals for degenerate or underdetermined mixtures, that is, M < N, is a more di±cult problem. There is no unique solution even if the mixing matrix, A, is known [26,64]. The under-determined set of solutions can be expressed as s = s mn +v where s mn is minimum norm solution, and v is any vector in the null space of A. The ` 1 -norm minimum solution is the most widely used estimate for (5.1). Another possible solution is to impose a constraint that source signals are sparse in a transform (e.g., Fourier or wavelet)domain. Thistechnique,calledtheSparseComponentAnalysis(SCA),hasbeen studied extensively in recent years [13,19,51,62,74,76] to tackle the underdetermined BSS problem. SCA consists of four steps as shown in Fig. 4.1. First, mixtures x i (t) are transformed into a proper sparse domain. Given a family of K basis functions ' k (t), correlationshx i ;' k i are computed and collected in a M£K matrix as 1 b x:=Ab s+b e; (4.2) 1 Note that the bold letters such as x ors are used for multichannel variables, and x i denotes the row vector of the ith monochannel, ¡ xi(t) ¢ 1·t·T 2R T . 59 where b x= 2 6 6 6 6 6 6 4 b x 1 ¢¢¢ b x M 3 7 7 7 7 7 7 5 and b s= 2 6 6 6 6 6 6 4 b s 1 ¢¢¢ b s N 3 7 7 7 7 7 7 5 : (4.3) The sparse transform of each monochannel can be represented by b x i =x i © H and b s j =s j © H ; where H denotes the complex-conjugate transpose and © is the transform matrix whose rows are basis functions ' k (t); 1·k·K. (a) (b) Figure4.2: Thescatterplotoftwolinearmixturesofthreesourcesin(a)thetimedomain and (b) the transform domain, where F W represents the short-time Fourier transform. InSCA,orthogonalbasisfunctionssuchastheshort-timeFouriertransform(STFT)[13, 19,51,62,74], the discrete cosine transform (DCT), and the wavelet transform [40], are often used for the sparse linear transform. Note that the transform is chosen to sparsify the source signal representation such that each b s j has few signi¯cant coe±cients in the transform domain. 60 The second step is to estimate mixing parameters a ij in A from the scatter plot of b x [13,19,51,55]. If the sparsity is su±cient, the scatter plot is made of points almost alignedwiththecolumnsofmixingmatrixAasshowninFig.4.2(b). Inordertoexplain the existence of the special structure in the scatter plot, we take a simple example where there are two mixtures and three sources under the assumption that only one source is active at a time, sayb s 1 (henceb s 2 =0 andb s 3 =0), 2 6 6 4 b x 1 b x 2 3 7 7 5 = 2 6 6 4 a 11 a 12 a 13 a 21 a 22 a 23 3 7 7 5 2 6 6 6 6 6 6 4 b s 1 b s 2 b s 3 3 7 7 7 7 7 7 5 =b s 1 2 6 6 4 a 11 a 21 3 7 7 5 : Then, points on the scatter plot ofb x 1 versusb x 2 would lie on the line through the origin whose direction is given by column vector [a 11 a 21 ] T . When more than one source are active at the same time, the point on the scatter plot would deviate slightly from the direction of column vectors, resulting in broadened line orientations as shown in Fig. 4.2 (a). Since the line orientation is unique to each source (corresponding to a column of A), one can cluster the scatter plot to estimate mixing parameters. Note that the STFT domain in Fig. 4.2 exhibits a sparser signal representation as compared with the time domain. Thethirdstepistoestimatethesourcerepresentationunderthesparsityassumption. Inanoiselesscontext,theapproachbasedontheminimum` 1 -normsolution[13,51,55,76] has been widely used to estimate the source representationb s e as P 1 : min b s e kb s e k 1 subject to e Ab s e =b x; (4.4) 61 where e A is the estimated mixing matrix obtained in the previous step. The above optimization problem can be interpreted as the maximum a posteriori (MAP) estimation [64]. More precisely, if source coe±cients in the transform domain in Eq. (4.2) are identically independent-distributed (i.i.d.), we can adopt the Bayesian approach and choose a solution that is given by the corresponding MAP estimator. That is, at each data point k in the transform domain, the extracted source vectorb s e should satisfy b s e (k)=argmax b s e P(b s e (k)j e A;b x(k)) =argmax b s e P(b x(k)j e A;b s e (k))P(b s e (k)) =argmax b s e P(b s e (k)) =argmax b s e exp ³ ¡ N X j=1 jb s e j (k)j ´ =argmin b s e N X j=1 jb s e j (k)j =argmin b s e kb s e (k)k 1 : (4.5) Note that it is assumed in the fourth equality that the prior probability P(b s e (k)) is Laplacian distribution whose components are mutually independent, i.e., P(b s e (k)) / exp ³ ¡ P j jb s e j (k)j ´ . Practically, the ` 1 -norm minimization is often performed via linear programming to estimate source signals [18]. The last step is the reconstruction of sources by an inverse transform. For unitary transforms such as the Fourier transform, this can be easily obtained as s e j =b s e j ©: 62 4.3 Noise E®ects on Sparse Component Analysis (a) (b) Figure 4.3: The scatter plot of two linear mixtures of two sources: (a) clean data and theirassociatedlineorientationswithalargeintersectionanglefromasyntheticmixture, and (b) noisy data and their associated line orientations with a small intersection angle from a commercial music CD excerpt, "Let It Be" by Beatles. F W represents the short time Fourier transform. The SCA-related algorithms [13,51,55,62,64,76] as described in Sec. 4.2 require the following assumptions to ensure an accurate estimation of sources. 1. Sparsity of sources: no more than M sources are active at each point in the trans- form domain [62,64]. With this assumption, SCA separates N sources from M mixtures (M < N) by extracting at most M sources at a time via solving the P 1 problem in (4.4). 2. Separability of columns of the mixing matrix: columns are independent of each othertoallowaccurateestimation. Numerically, theanglebetweenanytwocolumn vectors should be greater than a certain threshold. 3. Nearly noise-free observation system: noise that a®ects observed mixtures is often negligible. 63 If these assumptions hold, estimation of both mixing parameters and sources will be successful. Fig. 4.3 (a) represents a scatter plot of two mixtures of two sources, where clean observation data are used. In the scatter plot, data points almost aligned with two line orientations, and the two lines intersect at a large angle. Consequently, columns of the mixing matrix can be easily found by simple clustering. However, real-world mixture data often contain noise. The example in Fig. 4.3 (b), consisting of human voice and background piano, has a small intersection angle of line orientations and contains many data points deviating from these lines. In SCA, poor estimation of mixing parameters in the mixing matrix would a®ect the source estimates negatively. -1 0 1 s 1 , s e 1 -1 0 1 s 2 , s e 2 0 4 8 12 16 20 -1 0 1 s 3 , s e 3 Time (sample) Figure 4.4: Estimation of three random signals from two observation mixtures, where original sources s i (circle) and their estimates s e i (square) are represented. Fig. 4.4 demonstrate how more than M sources active at certain points a®ect source estimationinthe` 1 -normminimizationsolutionasgivenin(4.4). Inthecaseofsynthetic sources, non-zero samples have random positions, which are zero-mean unit-variance 64 Gaussian distributed in the amplitude. These sources are mixed together using a 2£3 mixingmatrix,andestimatedbythelinearprogrammingtechnique. AsshowninFig.4.4, the minimum ` 1 -norm solutions work well in a case where only one or two sources are active at a sample point. However, having more than M active sources at certain points would violate the basic assumption of SCA as described above and render the ` 1 -norm minimization process inappropriate (for example, three sources are active simultaneously at time sample point 3 and 10 in Fig. 4.4). 5 10 15 20 25 Clean 0.6 0.9 1.2 1.5 1.8 SNR (dB) Percentage of points (a) 5 10 15 20 25 Clean 0.6 0.9 1.2 1.5 1.8 SNR (dB) Percentage of points (b) Figure 4.5: Percentages of overlapping points in the STFT domain as a function of white Gaussian noise: (a) A mixture of clarinet, male speech, and street sound, and (b) a mixture of one male and two female speech. Fig. 4.5 shows the percentage of overlapping points in the STFT domain when white Gaussian noise is added. The overlapping points indicate that more than M sources are active simultaneously and the sources have larger than a prede¯ned energy power (for instance, one percentage of the whole power of each source). These overlapping points are resulted from the violation of main assumptions of SCA. They a®ect the accuracy of source estimation. The percentage of overlapping points increases as the signal-to-noise ratio (SNR) decreases. Clearly, noise a®ects the estimation of mixing parameters and 65 sources signi¯cantly in SCA. Little attention has been paid to the noise e®ect in previous work. In this work, we concentrate on underdetermined blind source separation with stereo-channel observations (i.e., M <N and M =2) in the presence of noise. 4.4 Enhancements of Sparse Component Analysis In order to address the problem of SCA described in Sec. 4.3 we propose an enhanced technique that exploits the weighted soft-assignment clustering algorithm and the gen- eralized ` p -norm optimization with regularization. The enhanced technique yields more robustness against noisy data when estimating mixing parameters and sources. 4.4.1 Estimation of Mixing Parameters Withtheassumptionthattheobservationmixturesarecontaminatedbynoise,manydata points in the scatter plot deviate from the line orientations. In that case, the previous methods [13,40,55] fail to identify the line orientations accurately. The basic idea for estimationofmixingparametersunderthenoisydataistouseweightsinthescatterplot, dependingonthedistanceofadatapointfromtheorigin,duetothatmoredistantpoints aremorereliable. Theweightingschemeisappliedtoamodi¯edsoft-assignmentmethod, originally discussed in [39] and applied in [55], to identify the multiple line orientations in the scatter plot. Let us ¯rst take the STFT of mixtures x i (t); 1·i·M, with an appropriate window function. Then, the mixing model (5.1) reduces to b x(k;l)'Ab s(k;l) (4.6) 66 with b x;b s as in (4.3), and b x(k;l) is the STFT coe±cient at the time-frequency (TF) point (k;l). In this section, we shall use the STFT as the preferred sparse transform representation. Some of the reasons for the choice are as follows; STFT is linear and easy to implement and invert. In addition, STFT of audio signals are sparse. Tocomputeasoft-assignmentofeachdatapoint, datapointb x(j)=b x(k;l)isassigned to orientation column vector v i as ^ z ij = z ¡m ij P r z ¡m rj ; (4.7) where 1· i;r· N (number of line orientations) and 1· j · T (number of observation data points), and m controls the softness of the boundaries between regions attributed to each line. ^ z ij are the computed soft-assignments of the jth data point for each line orientationi, andz ij =kb x(j)¡hv i ;b x(j)iv i k 2 measuresthedistancebetweenadatapoint b x(j) and a line orientation v i . Sincetheorientationofalinearcloudofdatacorrespondstotheprincipaleigenvector of its covariance matrix [38], we calculate the covariance matrix of each orientation while considering data weights depending on the distance of a data point from the origin. The covariance matrix associated with line orientation i, based on data weights can be described as § i = P j c j ^ z ij b x(j)b x(j) T P j c j ^ z ij (4.8) with c j = jb x(j)j P l jb x(l)j ; 67 where 1·l·T. Theweightedsoft-assignmentclusteringproceduretoestimatethemixingparameters can be summarized as follows: 1. Given stereo-channel audio mixtures, the mixtures are transformed from the time domain to the STFT domain to reveal the sparse property better. 2. The N line orientations v i are randomly initialized such thatjv i j 2 =1. 3. Using(4.7)and(4.8), weassignweightstoeachdatapointandcalculatecovariance matrix § i for line orientation i. Then, we compute the principal eigenvector of § i for the new orientation estimates by the eigen-value decomposition (EVD) as § i =U i ¤ i U ¡1 i where U i contains the eigenvectors of § i and ¤ i is the diagonal matrix whose diag- onal elements are the corresponding eigenvalues, ¸ 1 ;¢¢¢ ;¸ N . It provides the new estimated column vector v new i =u max i where u max i is the principal eigenvector of § i and its eigenvalue is ¸ max . Repeat until all v i converge. 4. After convergence, the estimated orientations v i form the estimated mixing matrix as e A=[v 1 ;¢¢¢ ;v N ]: 68 4.4.2 Estimation of Sources Subsequent to the estimation of mixing parameters, separation of the underlying sources canbeperformedtocomputesourceestimatesb s e oftheoriginalsourcesb s. Theseestimates can be represented in the STFT domain b x(k;l)= e Ab s e (k;l) (4.9) with b s e = 2 6 6 6 6 6 6 4 b s e 1 ¢¢¢ b s e N 3 7 7 7 7 7 7 5 where each TF point (k;l) provides M equations with N > M unknowns, and e A is the estimated mixing matrix obtained in the previous subsection. Due to the non-uniqueness of solutions to (4.9), the ` 1 -norm minimization have been most widely used in SCA as a reasonable estimate for the solutions. Recently, some variants, where the ` 1 -norm is replaced with an ` p -norm for 0 < p· 1, have been reported to improve performance for the BSS problem [19,62]. Similar to the ` 1 -norm problem in (4.4), the ` p -norm problem can also be interpreted as the MAP estimation. The di®erence between them is that the ` p -norm solutions assume a generalized Gaussian distribution [45] as P(b s e (k;l))/exp ³ ¡ X j jb s e j (k;l)j p ´ 69 where 0<p·1. Thus, the MAP estimation of the ` p -norm problem then becomes 1 [62] b s e (k;l)=argmin b s e kb s e (k;l)k p : Toovercomethenoisee®ectonthesourcesparsityassumptionofSCA,weproposeto use the generalized ` p -norm optimization technique with regularization. Its solutions are much sparser and more robust to the noise when compared with the ` 1 -norm solutions. Morespeci¯cally, thesourceestimatesb s e arecomputedbysolvingtheP p problemateach TF point (k;l) as P p : min b s e ¸kb s e (k;l)k p subject to e Ab s e (k;l)=b x(k;l) (4.10) where ¸ is a regularization parameter and 0 < p · 1. Note that smaller p signi¯es increased importance of the sparsity ofb s e . In the P p problem, one usually ¯nds the best basis for the column space of e A that minimizes the ` p -norm of the solution vector. Thus, we can obtain a sparser solution than the ` 1 -norm solution (corresponding to p=1). It should be noted that when p=0, solving the optimization problem is, in general, combinatorial and very sensitive to noise, hence it is not computationally feasible. The parameter ¸ in the P p problem can make the ` p -norm minimization solutions more robust to the additive noise. More precisely, 1 For p as a measure of sparsity,kxk p = ³ P i jx i j p ´ 1=p . 70 let A be the set of M £M invertible submatrices of e A. The ` p -norm solutions can be represented by the solution of [26,62] minkB ¡1 b s e B k p ; B2A; where more than (N ¡ M) entries in the vector b s e B (k;l) are zero. Thus, the inverse operation can be regularized to prevent arbitrarily large changes in b s e in response to even small noise in data. The most common regularization techniques include Tikhonov regularization and truncated singular value decomposition (TSVD) [56]. Technically, we propose the use of FOCUSS algorithm [26] to implement the gen- eralized ` p -norm minimization with regularization. Note that SCA exploits the linear programming to implement the ` 1 -norm minimization. 4.5 Source-Speci¯cRepresentationforMultichannelSource Separation Despiteitssuccessesforunderdeterminedsourceseparationwithnoisydata,theenhanced SCA technique discussed in Sec. 4.4 faces some challenges. First, although it yields a more robust and sparse solution than that obtained by minimizing the ` 1 -norm [26,62], it still demands that no more than M sources are active in a TF point. With noisy data and confusing line orientations in the scatter plot, it is likely that the number of active sources is more than M at certain TF points. In that case, errors in mixing parameters estimation and the noise e®ect may lead to false assignment of some data points to 71 columns of the mixing matrix. Second, the computational complexity of SCA is high. The ` 1 -norm solution in (4.4) rely on a convex programming technique such as linear programming 1 whose complexity is also demanding [18,65]. As to the enhanced SCA technique,theFOCUSSalgorithm 2 [26,43]isusedtoimplementthe` p -normminimization with regularization. The FOCUSS algorithm uses the iterative reweighted least squares (IRLS)techniqueandconvergesinafewiterationstoalocalminimumofthecostfunction. However, it needs a good initialization and its complexity increases as the dimension of linear systems increases. In this section, we apply source-speci¯c dictionaries discussed in Chapter 3 to the multichannel audio source extraction problem to overcome the above two challenges Suppose that multichannel observation signals x i (t) consist of pure music signal s 1 (t) and background sounds s 2 (t);¢¢¢ ;s N (t) as x i (t)= N X j=1 a ij s j (t)+e i (t); i=1;¢¢¢ ;K; (4.11) where a ij are unknown scalar coe±cients, K is the number of observation channels and e i (t) is additive noise. We assume K < N, that is, the system is degenerate and our objective is to extract the music signal from background speech and/or environmental sound using music source-speci¯c dictionaries developed in Chapter 3. 1 LinearprogrammingismorecomplexthangreedyalgorithmssuchasMatchingPursuitorOrthogonal Matching Pursuit [65]. 2 FOCUSS was originally proposed to solve the underdetermined linear system of equations, and Rao and Kreutz-Delgado [43,59] extended it to data-driven dictionary learning for sparse data representation. 72 To obtain the sparse decomposition of multichannel signals, we use a single-channel music source-speci¯c dictionary D. We begin by setting each initial residual equal to the corresponding channel of observation mixtures as R k 0 =x k ; where 1· k · K. At step m, observation mixtures x k are projected onto the subspace spanned by atom h ° m chosen from dictionary D, i.e., P h° m R k m (t)=hR k m ;h ° m ih ° m (t); (4.12) where P h° m represents the orthogonal projection onto subspace S h° m spanned by h °m . The best atom h °m is chosen as h ° m =arg h ° 2D (corr m (°)>´ e ) (4.13) with joint correlation corr m (°)= K X k=1 jhR k m ;h ° ij 2 ; where ´ e is a pre-de¯ned threshold based on an energy criterion. For channel k, new residuals can be computed as R k m (t)=R k m¡1 (t)¡hR k m ;h °m ih °m (t): 73 Finally, the extracted music signal can be reconstructed as a weighted sum of source- speci¯c atoms chosen from the dictionary D by m e k (t)' M¡1 X m=0 hR k m ;h °m ih °m (t); (4.14) where 1· k · K. The multichannel signals can be converted into a signal-channel one via m e (t)= 1 K K P k=1 m e k (t). The source-speci¯c-based source extraction scheme has the following competitive ad- vantages. ² No need to estimate mixing parameters. Althoughtherearemanytechniquestoestimatemixingparametersfromthescatter plots such as clustering, accurate estimation of mixing parameters is still a main challenge in SCA. ² Relaxed assumption in SCA. The basic assumptions of SCA discussed in Sec. 4.3 originates from the inverse problem in a degenerate linear system. However, the source-speci¯c-based source extraction technique does not rely on these assumptions so that it accepts the case in which more than M sources are active simultaneously. ² Lower computational complexity. Since the source-speci¯c-based source extraction scheme uses greedy algorithms such as the Matching Pursuit (MP), it has a much lower computational complexity when compared with linear programming. 74 ² No need to perform clustering. Recently, MP-based source separation techniques based on multichannel dictionar- ies were proposed for underdetermined source separation [29,46]. The principle is essentially to decompose the mixtures into elementary components, which are then grouped according to the given criterion. The grouped components are ¯nally recombined to reconstruct sources. These techniques need a proper grouping cri- terion to distinguish which components correspond to which source. In [46], the spatial diversity is used to group components using the similarity between their spatial patterns. Thus, the original mixing matrix is assumed to be known. As discussed in Chapter 3, the source-speci¯c-based scheme needs no clustering for source extraction. 4.6 Experimental Results Experiments were conducted to evaluate the performance of the proposed techniques for theunderdeterminedblindsourceseparation. Testaudiosignalswerechosenfromseveral excerpts of real audio sounds: musical instrument sounds, speech, and environmental sounds. All sounds were downsampled to 11,025 Hz. The time-domain signals were transformed to the STFT domain using a Hanning window of length 512 samples with a 256-point overlap between windows. The regularization parameter ¸ in (4.10) can be set independently for each vectorb s e (k;l). Even though the L-curve approach [32] can be used to select the parameter, it is computationally intensive. Alternatively, a heuristic method proposed in [43] was adopted to calculate the regularization parameter. 75 Toassessthequalityoftheseparation,theperformancemeasuresproposedin[67]were used; namely, the source-to-interference ratio (SIR), the source-to-artifact ratio (SAR), and the source-to-distortion ratio (SDR). SIR measures the interference due to sources other than the one being extracted. SAR measures the distortion due to algorithmic or numerical artifacts such as \forced zeros" in the STFT domain. SDR, on the other hand, is a measure of global distortion, i.e., interference, artifacts, or noise. The higher the performance measures, the better the separation performance. In [67], it was observed that informal listening tests correlated well with the nature of the perceived distortion as quanti¯ed by the SIR and SAR measures. For the degenerate linear mixing system, we used two synthetic mixing matrices A 1 = 2 6 6 4 0:92 1:40 1:05 0:56 1:27 1:36 3 7 7 5 and A 2 = 2 6 6 4 0:92 1:40 1:05 0:36 0:70 1:10 3 7 7 5 (4.15) where A 1 is formed with equally spaced directions of column vectors while the two direc- tions in A 2 , i.e., the ¯rst and the second column, are closer than others. 4.6.1 Estimation of Mixing Parameters Theweightedsoft-assignmentclusteringalgorithmis¯rsttestedonsimulatedmixturesof three sources, one of which is speech and two are music. We examined several synthetic mixtures: clean mixtures and mixtures with additive white Gaussian noise as shown in Fig. 4.6 and Fig. 4.7. For the scatter plots, we used real coe±cients of stereo-channel mixtures in STFT domain. Similar results are obtained when imaginary coe±cients are 76 Figure 4.6: Scatter plots of three sources mixed into two mixtures using the degenerate linear mixing system, A 1 . (a) Clean data, (b) SNR=15 dB, and (c) SNR=5 dB. The square dots represent the normalized original column vectors in A 1 . Figure 4.7: Scatter plots of three sources mixed into two mixtures using the degenerate linear mixing system, A 2 . (a) Clean data, (b) SNR=15 dB, and (c) SNR=5 dB. The square dots represent the normalized original column vectors in A 2 . used. We can observe that the line orientations broaden and data points deviate from the line orientations when signal-to-noise ratio (SNR) decreases. To measure the performance for the estimation of mixing parameters, we de¯ned the following distance measure between two unit-norm vectors a i and a j as ¸ a = 1 N N X j=1 min i=1;¢¢¢;N d(a i ;a j ) (4.16) with d(a i ;a j ) = 1¡jha i ;a j ij 2 : Note that a vector a i can be considered as equivalent to ¡a i , and the distance measure is independent of the direction of a i as d(a i ;¡a i ) = 0. Thus, the smaller ¸ a , the more precise the estimation of the mixing parameters. 77 The performance of the proposed algorithm is compared to the reference algorithm, LineOrientationSeparationTechnique(LOST)[55]. Table4.1showstheresultsobtained fromtheperformancebenchmarkingusingthesyntheticmixturesinFig.4.6andFig.4.7. Consequently, the data weights and soft-assignment are e±cient in estimating line orien- tationswhentheorientationshavesmallintersectionanglesandmanydatapointsdeviate from the orientations in the scatter plot. Table 4.1: Performance benchmarking for estimating mixing matrices A 1 and A 2 . mixing matrix algorithm ¸ a with respect to SNR (dB) clean 20 dB 15 dB 10 dB 5 dB A 1 LOST 0.00002 0.00001 0.32016 0.30397 0.33198 proposed 0.00004 0.00006 0.00005 0.00014 0.00001 A 2 LOST 0.00004 0.00003 0.28531 0.30877 0.31519 proposed 0.00006 0.00010 0.00006 0.00003 0.00001 4.6.2 Extraction of Music Signals We discussed the ` p -norm minimization with regularization for underdetermined source separation with noisy data in Sec. 4.4.2. With this method, we show the SDR results of several extracted music signals for a range of p values in steps of 0:1 in Fig. 4.8: (a) extraction of the piano from a mixture of speech, piano, and synthesizer sounds, and (b) extraction of the clarinet from a mixture of speech, clarinet, and street sounds. We see from experimental results that the performance of the ` p -norm minimization with regularization can be improved for 0:1 · p · 0:8. More speci¯cally, given clean observation mixtures, the ` 1 -norm solution with p =1:0 yields good performance. Their performance however degrades signi¯cantly due to noisy observation data. Besides, the 78 0.1 0.5 1.0 1.5 0 2 4 6 8 10 12 14 16 18 20 p SDR (dB) Clean SNR = 15 dB (a) 0.1 0.5 1.0 1.5 0 2 4 6 8 10 12 p SDR (dB) Clean SNR = 15 dB (b) Figure 4.8: SDRs obtained from extracting the music signal as a function of p when mixing system (a) A 1 or (b) A 2 is used. source sparsity assumption may not hold at certain data points. The performance of the ` p -norm optimized solution degrades drastically for 1 < p· 1:5 since the solution is not sparse any longer. In this region, the energy is spread over a large number of entries of b s e at each TF point. 15 20 Clean 4 6 8 10 12 14 16 18 SNR (dB) SDR (dB) BZ BZ with the original A (a) 15 20 Clean 4 5 6 7 8 9 10 SNR (dB) SDR (dB) BZ BZ with the original A (b) Figure 4.9: Performance comparison between Bo¯ll and Zibulevski's algorithm (BZ), where mixing matrix is estimated by the scatter plot technique, and BZ with the original mixing matrix. The mixing matrix (a) A 1 and (b) A 2 are used. 79 15 20 Clean 0 2 4 6 8 10 12 14 16 18 SNR (dB) SDR (dB) BZ Enhanced SCA CAD 15 20 Clean 10 15 20 25 30 35 40 SNR (dB) SIR (dB) BZ Enhanced SCA CAD 15 20 Clean 0 5 10 15 20 25 SNR (dB) SAR (dB) BZ Enhanced SCA CAD Figure 4.10: Music extraction performance from mixtures of clarinet, speech, and street environmental sounds with mixing matrix A 1 . 15 20 Clean 4 2 6 7 8 9 10 SNR (dB) SDR (dB) BZ Enhanced SCA CAD 15 20 Clean 6 8 10 12 14 16 18 20 22 24 26 28 SNR (dB) SIR (dB) BZ Enhanced SCA CAD 15 20 Clean 4 5 6 7 8 9 10 11 12 13 14 15 SNR (dB) SAR (dB) BZ Enhanced SCA CAD Figure 4.11: Music extraction performance from mixtures of clarinet, speech, and street environmental sounds with mixing matrix A 2 . For SCA to perform well, the key requirement is that at most one source contributes signi¯cantly to each data point in the scatter plot. Ideally, we look for representations of sourcesthatarenotonlysparsebutalsodisjoint. Techniqueswereproposedin[8,19,51,55] to estimate mixing parameters with weaker yet practical assumptions. Experiments are performed to estimate two mixing matrices given in (4.15) with three audio sources (clarinet, speech, and street environmental sounds) using the Bo¯ll and Zibulevski's (BZ) algorithm [13]. Two settings are compared. They are BZ with the estimated mixing matrix obtained from the scatter plot technique and BZ with the original mixing matrix. The results are compared in Fig. 4.9. With clean data, the sparsity hypothesis is valid for most points in the scatter plot and BZ with the original mixing matrix has better performance. However, some points deviate from original line 80 orientations with noisy data and the robustness of BZ with the original mixing matrix is a®ected. For performance benchmarking, we compare several algorithms: BZ, the enhanced SCA algorithm [19], and source-speci¯c dictionaries for multichannel audio source ex- traction in Sec. 4.5. The stereo-channel mixtures were generated using mixing matrices in (4.15) with several excerpts of real audio sounds: recordings of pitched instrument sound, speech, and environmental sounds. We show SDR, SIR, and SAR values of music signalextractionwithmixingmatrixA 1 inFig.4.10. RecallthatcolumnsofA 1 areformed with equally spaced directions. With clean data, BZ has the best performance in SDR. However, it su®ers from the noise e®ect, and its performance degrades signi¯cantly in both SDR and SAR measures. The enhanced SCA algorithm [19] has good performance in SDR, SIR, and SAR measures in the presence of noise. The source-speci¯c-based source extraction algorithm has the best SDR, SIR and SAR performance (with only one exception in the SAR measure for clean data). Besides, it is most robust against noise. 0 5 10 15 20 Clean 4 6 8 10 12 14 16 18 20 22 SNR (dB) SDR (dB) Clarinet Piano Figure 4.12: Performance evaluation of source-speci¯c-based source extraction technique as a function of SNR. 81 To test the performance when some line orientations are close to each other, we adopt matrix A 2 in (4.15). The closeness in line orientations hinders accurate estimation of mixing parameters and correct assignment of data points to columns of the mixing matrix. ResultsareshowninFig.4.11,whereweseethatthesource-speci¯c-basedsource extraction technique yields good performance and it is robust against noise. Note that thesource-speci¯c-basedschemeneedsnoestimationofmixingparametersandnosource sparsity assumption. BZ su®ers from the noise e®ect, and it is less robust. The superior performanceofsource-speci¯c-basedsourceextractionisduetodi®erentsignalstructures in white Gaussian noise and pitched harmonic ones. The proposed source-speci¯c-based algorithm has good SDR values of the extracted harmonic source as shown in Fig. 4.12. 4.7 Conclusion We examined the noise e®ect for the underdetermined blind source separation, which causes the violation of the basic assumptions in SCA, and results in performance degra- dation in source separation. To address the problem, we presented an enhanced tech- nique that exploits the weighted soft-assignment clustering algorithm and the ` p -norm minimization with regularization. In addition, we applied source-speci¯c dictionaries to the multichannel audio source extraction. It was shown by experimental results that the proposed techniques are more robust against the mixing systems with noisy data. 82 Chapter 5 Audio Source Separation in Room Acoustic Environments with Selected Binaural Cues 5.1 Introduction Audiosourceseparation,whichaimstodeterminetheoriginalsourcesgiventheiracoustic mixtures,isoneoftheemergingaudioprocessingproblemsinrecentyearsduetoitsmany potential applications. For example, Microsoft [4] adopted a sound capturing technique with a microphone array in speech recognition applications since users prefer to use their own voice without wearing the headset in front of computers. The DICIT (Distance talking Interfaces for Control Interactive TV) project [35] employed a linear microphone array to enable user-friendly interface with TV and infotainment devices by human voice in short sentences. The use of multiple microphones can compensate the e®ects of ambient noise and reverberation by applying spatial ¯ltering to the sound ¯eld, e®ectively focusing on a desired direction. However, most current applications assume a single noisy sound source (e.g., speech) without taking into account multiple audio sources being active at the 83 same time. To suppress undesired audio sources (e.g., music background), we consider the audio source separation problem, which is the main objective of this work. Although theproposedaudiosourceseparationalgorithmisquitegenericbynature, oursimulation resultswillfocusontheroomacousticenvironmentyetwithoutassuminganygeometrical setting of the room and the microphone array. AsreviewedinSec.5.2, mostofpreviousalgorithmsfailunderchallengingconditions, for example, the audio delay between two microphones is too long (i.e., larger than one sample) and signals are not sparse enough in the transform domain. Here, we propose a two-stageapproachtosolvetheprobleminthetransformdomain: 1)estimationofmixing parametersand2)recoveryofsourcesignals. Forthe¯rststage, afterapplyingtheshort- time Fourier transform (STFT) to the mixtures, we estimate mixing parameters (i.e., scalar attenuation coe±cients and time delay parameters) by selecting reliable binaural cues instead of using the whole set of time-frequency points. That is, we select the frequency range that produces no phase ambiguity and reliable time-frequency (TF)- points that contribute to signals that travel from audio sources to microphones in the direct path while excluding attenuated and delayed replicas of a sound. The selected subsetofbinauralcuesworkslikethecorrespondingbinauralcuesofeachsourcepresented separately in an anechoic environment. As a result, our method can estimate the mixing parameterssuccessfullyeveninthereverberantenvironmentand/orwithtimedelaylarger than one sample. After the mixing parameter estimation with reliable binaural cues, we recoversourcesbysolvinganoptimizationproblemwithan ` p -normcriterion(p·1)and use the inverse STFT to reconstruct time-domain signals for the second stage. 84 5.2 Review of Previous Work The Sparse Component Analysis (SCA) technique was used to solve the audio source separation problem in [31,51] under the source sparsity assumption in the transform domain. The SCA-based methods exploited clustering in a scatter plot and used the ` 1 -norm optimization to solve the problem by assuming a linear instantaneous mixing model, where sounds emanating from di®erent sources arrive at the same time without any delay between them. The instantaneous mixing model is however too restrictive in practice. Another solution was derived based on an anechoic mixing model with time delay between sound sources, which leads to the DUET (Degenerate Unmixing Estimation Technique)-type methods [63,74]. The DUET methods work well for a pair of micro- phones with spacing small enough so that any delay between the two microphones is not larger than the distance of consecutive audio samples. A delay of one audio sample can correspond to a very short distance; for example, at a sampling rate of 44.1 KHz, a delay of one sample corresponds to a distance of propagation in air of 7.8 mm. The constraint makes each source di±cult to localize and isolate in the parameter space of the attenu- ation ratio and time delay, and further yields a failure in mixing parameter estimation under the situation with larger delays than one sample. Inaddition,anotheressentialrequirementofallpreviousmethodsisthatsourcesmust be very sparse in the transform domain. It is di±cult to estimate mixing parameters accurately when the requirement is not met. Moreover, the sparsity can be violated by other reasons in reality, e.g., reverberation and high-degree overlapping of audio sources. 85 In this research, our goal is to extract an arbitrary number of audio source signals given two mixtures of audio sources (i.e., a pair of microphones). This often leads to an underdetermined audio source separation problem, where the number of audio sources is greater than the number of mixtures. 5.3 Problem Formulation Consider a convolutive mixing model with N audio sources, denoted by s j (t), 1·j·N, and M audio sensors (or microphones) that yield linearly mixed signals. This mixing process can be described by x i (t)= N X j=1 a ij s j (t¡± ij ); i=1;¢¢¢ ;M (5.1) where a ij and ± ij are the scalar attenuation coe±cients and the time delay parameters, respectively, for the path from the jth source to the ith microphone. Without loss of generality, we set ± 1j = 0 and scale the sources s j such that P M i=1 ja ij j 2 = 1, for j = 1;¢¢¢ ;N. Our objective is to recover unknown source signals s j (t) from observed mixtures x i (t) only. It is assumed that the number of source, N is known a priori and M <N, i.e., the number of sources to be separated exceeds the number of available mixtureswhichisthesameasthenumberofmicrophones. Underthissetting,themixing model in (5.1) is underdetermined. Instead of solving the problem in the time domain, we apply the time-frequency transformationtomixturesignals,assumingthatthesourceshaveasparserepresentation in the transform domain. By sparsity, we mean that a large percentage of the signal 86 power is contained by a small percentage of TF points and, consequently, TF-points that contain a signi¯cant amount of power from di®erent audio signals rarely overlap [74]. By performing the STFT with a ¯xed window function, we can re-write the mixing model given in (5.1) from the time domain to the transform domain as ^ x[k;l]= 2 6 6 6 6 6 6 6 6 6 6 4 a 11 ¢¢¢ a 1N a 21 e ¡j! 0 l± 21 ¢¢¢ a 2N e ¡j! 0 l± 2N . . . . . . . . . a M1 e ¡j! 0 l± M1 ¢¢¢ a MN e ¡j! 0 l± MN 3 7 7 7 7 7 7 7 7 7 7 5 ^ s[k;l] (5.2) where ^ x = [^ x 1 ;¢¢¢ ;^ x M ] T and ^ s = [^ s 1 ;¢¢¢ ;^ s N ] T are the STFT of mixtures and sources, respectively, ! 0 = 2¼=L and L is the length of the analysis window. The application of STFT has several advantages. First, the convolutive mixtures in (5.1) are reduced to multiplicative ones for each TF point with index . Second, we can exploit the sparsity of source components, which plays a key role in source signal separation. Note that sources are in general not sparse in the time domain. There are additional challenges arising from the room acoustic environments, i.e., reverberationandspatialaliasing. Roomreverberationusuallysmearsasoundacrosstime and frequency, which has a negative impact on source sparsity. The amount of smearing is a function of reverberation time, RT 60 , which is the time required for re°ections of a direct sound to decay by 60 dB below the magnitude of the direct sound. Moreover, the spatial sampling e®ect should be considered under the scenario of multiple microphones. Similar to the principle in temporal sampling (known as the Nyquist sampling theorem), there is an analogous requirement in spatial sampling to avoid grating lobes (or spatial 87 aliasing) in the directivity pattern of multiple microphones [53]. With uniform linear arrayofmicrophones,thespatialsamplingtheoremstatesthatthemaximummicrophone spacing, in the array should satisfy the following condition d· c 2f max (5.3) where the sound c is approximately 344 m/sec in air and f max is the maximum frequency component in the signal's frequency spectrum. 1.7 4 8 16 22.05 44.1 0 5 10 15 20 25 Sampling frequency (KHz) Microphone spacing (cm) Figure 5.1: Minimum microphone spacing to avoid spatial aliasing as a function of tem- poral sampling frequency. Basedon(5.3),onecancomputethemaximumdistancebetweenapairofmicrophones to prevent spatial aliasing as a function of the sampling frequency in time. The results areshowninFig. 5.1. Forexample, themaximumdistanceoftwomicrophones is 7.8mm withrespecttoasamplingrateof44.1KHz. AlthoughtheDUETmethod[74]workswell forapairofmicrophoneswithsmallspacing, itcannotprovideaccurateestimatesoftime delayparameters,ifsomeofthesedelaysislongerthanonesample. Notethatsomesound recordings may not satisfy the constraint on microphone spacing, for example, KEMAR 88 dummy head recording [24] at a sampling rate of 44.1 KHz, where the microphones (the left and right ear in the mannequin) are spaced far enough apart that there might be spatial aliasing. Furthermore, the minimum distance constraint makes each audio source harder to localize and isolate in a two-dimensional histogram of the interaural level and time di®erences. The limitation will be explained in detail later. Mixing Parameters Estimation Recovery of Source Signals Sparse Transform Mixtures ) (t x i Inverse Sparse Transform Transform domain ) (t s e j Time domain Time domain ij a ˆ ij ˆ Figure 5.2: The system diagram for the solution of the underdetermined convolutive audio source separation. In this work, we consider a stereo-channel microphone setting in a room acoustic en- vironmentwithoutanyconstraintonthegeometryofmicrophones. Thus,spatialaliasing may occur, which tends to lead to ambiguity in estimating parameters a ij and ± ij . Note also that common audio signals are typically available in stereo format and consists of a mixture of more than two sound sources. To solve the underdetermined convolutive audio source separation problem in (5.1), we employ a two-stage approach in the STFT domain as shown in Fig. 5.2. After the application of the STFT to mixtures, we ¯rst estimate mixing parameters, a ij and ± ij , from the mixtures by selecting a subset of reliable TF-points based on the phase deter- minacy condition and the source sparsity assumption. Then, for the second stage, we recover audio sources based on the estimated parameters. Finally, we use the inverse 89 STFT to reconstruct time-domain signals. These two modules in the STFT domain will be described in the following two sections. 5.4 Estimation of Mixing Parameters With the stereo-channel mixtures, the mixing model in (5.2) reduces to 2 6 6 4 ^ x 1 ^ x 2 3 7 7 5 = 2 6 6 4 a 11 ¢¢¢ a 1N a 21 e ¡j! 0 l± 21 ¢¢¢ a 2N e ¡j! 0 l± 2N 3 7 7 5 2 6 6 6 6 6 6 4 ^ s 1 . . . ^ s N 3 7 7 7 7 7 7 5 : (5.4) If only a single source j is signi¯cantly di®erent from zero at a TF-point [k;l], we deduce from (5.4) that ^ x 2 [k;l] ^ x 1 [k;l] = a 2j a 1j e ¡j 2¼ L l± 2j : (5.5) The corresponding attenuation ratio x at and time delay x d are de¯ned as b x at [k;l]= a 2j a 1j = ¯ ¯ ¯ ¯ ¯ ^ x 2 [k;l] ^ x 1 [k;l] ¯ ¯ ¯ ¯ ¯ and b x d [k;l]=¡ 1 ! 0 l \ ^ x 2 [k;l] ^ x 1 [k;l] ; (5.6) which are called binaural cues. The DUET-type method constructs a smoothed two-dimensional histogram of the binaural cues computed over entire TF-points, and then selects the largest N peaks, which correspond to the N sources, to estimate the mixing parameters. Several variants of this method have been proposed to form clear clusters in the parameter space with some well-known clustering algorithm, e.g., the k-means [7]. All of them use the whole 90 set of TF-points of mixtures to construct the parameter space. There exists a limitation of these methods. That is, due to the periodicity of the complex exponential (i.e., phase indeterminacy), they yield accurate estimates of delay parameters only if each of these delays between the two microphones is less than one sample [63,74]. This entails that the spacing between microphones should satisfy the condition in (5.3). To overcome the limitation, we propose a new parameter estimation technique that is able to e±ciently extract reliable TF-points that have no phase ambiguity. Under the source sparsity condition, these selected TF-points yield clear peaks in the smoothed two-dimensionalhistogramofbinauralcues,x at andx d ,evenifcondition(5.3)isviolated. 5.4.1 Phase Determinacy Condition To resolve the phase indeterminacy problem, we are motivated by human audition for sourcelocalization[11]. Atlowfrequencies, sincesound'swavelengthismuchlongerthan the human head diameter, the phase di®erence between signals received by two ears can be estimated with no ambiguity. In contrast, there can be several cycles of shift in high frequencies, whichresultsinphaseambiguityofinterauraltimedi®erence. Here, ourgoal is to ¯nd the frequency range that produces no phase ambiguity, and use TF-points in this range for the estimation of mixing parameters. To avoid phase indeterminacy in (5.4) with stereophonic observations, we should choose good TF-points [k;l] that meet the following criterion: j! 0 l± 2j j<¼: (5.7) 91 Let± jmax =max j j± 2j j, where ± jmax is the largest delay in the mixing system. Clearly, the condition (5.7) is guaranteed for all j if! 0 l± jmax <¼. This is equivalent to the condition: ± jmax < ¼ ! 0 l = L=2 l : (5.8) Thus, given ± jmax , we can determine the frequency range that satis¯es the phase deter- minacy condition based on (5.8). Next, we estimate ± jmax between stereo microphones using a well-known time delay estimation(TDE)techniquecalledtheGCC-PHATmethodproposedin[42]. Itiswidely acknowledged that GCC-PHAT is more immune to reverberation and able to provide consistentperformancewhenthecharacteristicsofthesourcesignalchangeovertime[17]. With GCC-PHAT, the maximum time delay estimate can be obtained by computing the inverseFouriertransform oftheweightedcross-powerspectrumof themixture signals, as b ± jmax =argmax m ª PHAT [m] (5.9) where ª PHAT [m]=FT ¡1 © S x =jS x j ª and S x is the cross-power spectrum of the mixture signals. Note that all the TDE techniques including GCC-PHAT measure time delay based on discrete signal samples. The delay estimate is, therefore, an integral multiple of the sampling period [17]. It is worthwhile to emphasize that the goal of (5.9) is not to estimate time delay parameters ± ij between multiple audio sources, but the maximum time delay ± jmax among them. 92 5.4.2 Source Sparsity Acousticsignalsthatareusuallyobservedatmicrophonescontainnotonlythedirect-path signals but also attenuated and delayed replicas of the source signal due to re°ections in a room. The multipath propagation e®ect results in the spectral overlapping of signals and blurs cues related to the direct-path signal. Thus, several sources may contribute to mixtures in the same TF-points. It is desirable to ¯nd TF-points where only one source contributes to the mixtures under the source sparsity condition. Suppose that only a particular source, j has a dominant power while the energy of all other sources is su±ciently small at a TF-point indexed with [k;l]. Then, we can derive (5.5)from(5.4),whichdependsonmixingparametersofsourcej only. Furthermore,these sparse TF-points are aligned to a line in the magnitude mixture space, i.e.,j^ x 1 j¡j^ x 2 j. On the other hand, for the non-sparse case where several sources are active at the same time, the mixing model (5.4) is in form of ^ x 2 [k;l] ^ x 1 [k;l] = P j a 2j e ¡j 2¼ L l±2j ^ s j [k;l] P j a 1j ^ s j [k;l] ; which is dependent on mixing parameters of all active source components. On the other hand, the non-sparse TF-points deviate from the oriented line in the magnitude mixture space. The above observation allows us to identify good TF-points of mixtures in the scatterplot,whereonlythedirectsoundofasinglesourcehasnon-negligibleenergy. One such example is shown in Fig. 5.3. To exploit the audio source sparsity in the transform domain, we measure a distance between the TF-point and the principal line of a scatter plot. That is, the orientation 93 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 |X 1 | |X 2 | Figure 5.3: A magnitude scatter plot at the frequency of 217.9 Hz, where three speech utterancesaremixedundertheroomacousticenvironmentatRT60=113ms. Notethat all data points are normalized such that the maximum energy value equals to one. of a linear data cloud can be estimated by computing the principal eigenvector of its covariance matrix. The procedure to estimate the oriented line of a magnitude scatter plot is summarized as follows [19]. Algorithm: Selection of TF-points based on source sparsity 1. Initialize a line orientation, v such thatjvj 2 =1. 2. Assign sparseness weights to a TF-point [k;l], as ^ z[k;l]= z ¡m [k;l] P k;l z ¡m [k;l] with z[k;l] = kj^ xj¡hv;j^ xjivk 2 , k = 1;¢¢¢ ;K and l = 1;¢¢¢ ;L=2, where m is a control parameter. Note that l = 1;¢¢¢ ;L=2 and k = 1;¢¢¢ ;K, where K is the number of window frames in the STFT representation. 94 3. Assign a weights to a TF-point [k;l] as c[k;l]= dist(^ x) P k;l dist(^ x) ; where dist(¢) is a distance measure between the TF-point and the origin [0;0]. 4. Calculate the principal eigenvector of the covariance matrix of TF-points with the eigenvalue decomposition method D = P k;l c[k;l]^ z[k;l]j^ xjj^ x T j P k;l c[k;l]^ z[k;l] =U¤U ¡1 ; where ¤ is the diagonal matrix whose entries are the eigenvalues of D, and the columns of U are formed by eigenvectors of D. Then, we can compute a new line vector by setting v =u max where u max is the eigenvector which corresponds to the largest eigenvalue. 5. Repeat Steps 2-4 until the v converges. Fig. 5.4 illustrates the magnitude scatter plots under di®erent acoustic environments, where the solid lines represent the principal eigenvectors computed by the proposed method. In the anechoic environment as shown in the left sub-¯gure, most TF-points are close to the oriented line since there are no re°ected signals. The reason that some TF-pointsdeviatefromthelinemightcomefromtheoverlappingofseveralsourcesounds. By selecting the TF-points that satisfy both the phase determinacy condition in (5.8) and the source sparsity in Sec. 5.4.2, we can obtain a subset of binaural cues that behave 95 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 |X 1 | |X 2 | (a) 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 |X 1 | |X 2 | (b) Figure 5.4: Magnitude scatter plots at the frequency band 715.95 Hz, where three speech utterances are mixed under (left) the anechoic (RT 60 = 0 ms) and (right) the echoic environment (RT 60 = 113 ms). Note that all data points are normalized such that the maximum energy value equals to one. similarly to the binaural cues of each source present separately in an anechoic environ- ment. Given the selected set of TF-points, we use (5.6) to de¯ne a parameter space of the attenuation ratio and time delay, generate a smoothed two-dimensional histogram of the selected binaural cues, and ¯nd the peaks that correspond to the sound sources to determine the mixing parameters. Being contrary to the DUET-type method that uses binaural cues of all TF-points, our method exploits binaural cues of selected TF-points so that each source can be localized and isolated in the parameter space more easily. We compare the smoothed histograms of two mixtures originating from four speech sources in Fig. 5.5, which are computed using the DUET-type method and the proposed one. In the example, the maximum time delay is larger than one sample (± jmax = 7.2 samples). The use of the whole TF-points yields spurious peaks and, consequently, in- correct estimation of mixing parameters as shown in Fig. 5.5 (a). These spurious points can however be successfully eliminated, as presented in Fig. 5.5 (b), using the proposed 96 (a) (b) Figure 5.5: Comparison of histograms over parameter space (x at ;x d ) with a) the whole binaural cues and (b) the selection of binaural cues, where symbol \x" marks local peaks used to estimate mixing parameters. method,whereeachdetectedpeaklocationlabeledbysymbol\x"correspondstoonepair of mixing parameters. Note that, since b ± jmax was computed as 7 samples by GCC-PHAT in this case, the frequency interval (0;1120] Hz is used to con¯ne the parameter space. It is worthwhile to mention that, when ± jmax is less than one sample, we can estimate the four sets of mixing parameters successfully with either method (i.e., the whole or the selected set of TF points). 5.5 Recovery of Source Signals Based on the estimated mixing parameters, the mixing model in (5.2) can be written as ^ x[k;l]= b A[l]^ s[k;l] (5.10) 97 where b A[l] 2C M£N is the estimated mixing matrix. The next step is to ¯nd the esti- mates, ^ s of original sources. However, they cannot be directly obtained since the mixing matrix in (5.10) is underdetermined. That is, the mixing model has more unknowns (N) than equations (M) at each TF point and the system (5.10) has an in¯nite number of possible solutions. Several mathematical techniques have been proposed to solve the un- derdetermined system of linear equations. Among them, the sparsity of the source vector has been widely and successfully exploited by the minimum norm solution using ` 1 -norm or ` p -norm criterion (p < 1). That is, to ¯nd a sparse solution, one can reformulate the problem in (5.10) as another optimization problem: min ^ s k^ sk p subject to b A^ s= ^ x; (5.11) where 0<p·1. It was shown in [64] that the N-dimensional vector ^ s that solves (5.11) contains at least N ¡M zeros if the columns of b A are normalized. Based on this result, one can employ the combinatorial solution (CS) to solve (5.11) [63]. That is, one can ¯nd a set that contains all M£M invertible submatrices from b A and chooses the one that o®ers a solution with the minimum norm as minkB ¡1 ^ xk p ; B2; (5.12) where B M£M = [a i1 ;¢¢¢ ;a iM ] and the possible number of B is equal to C M N . For the pronlemin(5.12),Winteretal.[73]employedthe` 1 -normconstraintwhileSaabetal.[63] 98 showed that the separation performance can be improved with the ` p -norm (p < 1) experimentally. InSec.5.6, wecomparetheproposedtechniquewiththeCStechniqueandtheFOCal UnderdeterminedSystemSolver(FOCUSS)algorithm[26]. NotethattheFOCUSSalgo- rithm is an iterative re-weighted norm minimization process of the source vector, which yieldsasparsersolutionwithmorethan N¡M componentsinthevector^ sequaltozero. 5.6 Experimental Results The performance of the proposed technique based on the binaural cues of selected TF- points is evaluated in this section. To measure the quality of reconstructed sounds with respect to the original one, the performance metrics suggested in [67] were used, includ- ing the source-to-distortion-ratio (SDR), the source-to-interference-ratio (SIR), and the source-to-artifact-ratio (SAR), represented in decibels (dBs). The SIR measures the in- terference due to sources other than the one being extracted while the SAR measures the distortion due to algorithmic artifacts such as forced zeros in the TF-points. It was observed in [67] that informal listening tests correlated well with the SIR and the SAR measures. The SDR is a global performance measure that accounts for three types of distortion: interference, artifacts and noise. A higher performance metric value means better reconstruction with less distortion. Testsignalswerechosenfromseveralexcerptsofaudiosounds, includingmale/female speechutterances[70],recordingsofmusicalinstruments[47]andenvironmentalsounds[2]. All sounds used in the experiments were downsampled to 16,000 Hz with a length of 10 99 seconds. TheframesizeLoftheHanningwindowwassetto512samplesandtheshifting interval of the frame was 256. We examined underdetermined mixtures with N = 3 and M =2. L R 30° -5° -40° 1m 20cm 5m 4m 2.5m 2m Loudspeakers (height: 1.6m) Room height: 3m Omni-directional microphones (height: 1.6m) Figure 5.6: The con¯guration of loudspeakers and microphones in a room. Stereo recordings of several sources were simulated by convolving the source signals with room impulse responses using the Roomsim toolbox [15]. The toolbox simulates the geometricalacousticsofarectangularparallelepipedroomvolumeusingtheimagesource modeltoproduceanimpulseresponsefromeachsourcetoareceiver. Thepositionsofthe omnidirectional microphones and loudspeakers are illustrated in Fig. 5.6. In the con¯g- uration, the maximum time delay between microphones is larger than one sample. Note that the re°ected source signals make the mixture signals at the microphones, smeared across time. The amount of smearing is a function of reverberation time, RT 60 . 5.6.1 Estimation of Mixing Parameters Most techniques based on multiple microphones are valid for microphone pairs with a small spacing that does not cause spatial aliasing. As mentioned in Sec. 5.3, the as- sumption imposes several constraints on real-world applications. Fig. 5.7 and Fig. 5.8 100 illustrate contour maps of smoothed two-dimensional histograms with respect to di®er- ent microphone spacing, where three speech utterances were used to create simulated mixtures. Each peak corresponds to one source and the peak location is associated with the source's mixing parameters. Note that the minimum distance between two micro- phones is 2.15 cm at the sampling rate of 16 KHz, according to the condition in (5.3). Here, we compare the parameter estimation performance of the proposed algorithm with DUET-type methods. x at x d (a) 0 0.5 1 1.5 2 2.5 3 -8 -6 -4 -2 0 2 4 6 8 x at x d (b) 0 0.5 1 1.5 2 2.5 3 -8 -6 -4 -2 0 2 4 6 8 x at x d (c) 0 0.5 1 1.5 2 2.5 3 -8 -6 -4 -2 0 2 4 6 8 x at x d (d) 0 0.5 1 1.5 2 2.5 3 -8 -6 -4 -2 0 2 4 6 8 x at x d (e) 0 0.5 1 1.5 2 2.5 3 -8 -6 -4 -2 0 2 4 6 8 x at x d (f) 0 0.5 1 1.5 2 2.5 3 -8 -6 -4 -2 0 2 4 6 8 Figure 5.7: Contour maps of the binaural cues in an anechoic room environment, where three speech utterances were simulated for stereo mixtures and the whole set of binaural cues was used with microphone spacing equal to (a) 2 cm, (b) 8 cm, and (c) 20 cm, and a selected set of binaural cues was used with microphone spacing equal to (d) 2 cm, (e) 8 cm, and (f) 20 cm. InFig.5.7,wepresentthebinauralparameterspaceintheanechoicroomenvironment; (a), (b), and (c) is constructed from using the whole TF-points, while (d), (e), and (f) result from the selected TF-points. As the microphone spacing increases, the distance betweenclustersbecomeslarger. Thus,theproposedtechniquecaneasilylocatethepeaks 101 x at x d (a) 0 0.5 1 1.5 2 2.5 3 -8 -6 -4 -2 0 2 4 6 8 x at x d (b) 0 0.5 1 1.5 2 2.5 3 -8 -6 -4 -2 0 2 4 6 8 x at x d (c) 0 0.5 1 1.5 2 2.5 3 -8 -6 -4 -2 0 2 4 6 8 x at x d (d) 0 0.5 1 1.5 2 2.5 3 -8 -6 -4 -2 0 2 4 6 8 x at x d (e) 0 0.5 1 1.5 2 2.5 3 -8 -6 -4 -2 0 2 4 6 8 x at x d (f) 0 0.5 1 1.5 2 2.5 3 -8 -6 -4 -2 0 2 4 6 8 Figure 5.8: Contour maps of binaural cues in an echoic room environment (RT 60 = 113 ms), where three speech utterances were simulated for stereo mixtures and the whole binaural cues were used with microphone spacing with (a) 2 cm, (b) 8 cm and (c) 20 cm, and only the selected binaural cues were used with microphone spacing with (d) 2 cm, (e) 8 cm and (f) 20 cm. of histograms and estimate the mixing parameters accurately. On the other hand, the spatial aliasing and the phase ambiguity make the peaks harder to localize and isolate, as shown in Fig. 5.7 (a), (b), and (c). Fig. 5.8 further evaluates the performance of the proposed approach under the echoic environment (RT 60 = 113 ms). Due to the close peaksinFig.5.8(d)and(e),whichresultfromthere°ectedsoundsofadirect-pathsource signal, the proposed technique fails to locate the peaks, while a longer distance between microphoneshelps¯ndthepeakssuccessfullyasshowninFig.5.8(f). Incontrast,Fig.5.8 shows that DUET-type methods yield spurious points and peaks in the parameter space, failing in forming clusters in all reported cases of the microphone spacing. 102 5.6.2 Recovery of Source Signals We conducted audio source separation experiments with the simulated room recordings of several sources based on the room con¯guration in Fig. 5.6. Source recovery algorithms that can solve the optimization problem in (5.11) ¯nd the sparsest ^ s at each TF-point based on the ` 1 -norm or ` p -norm criterion (p < 1), as discussed in Sec. 5.5. To understand the sensitivity of parameter p in the minimum norm solution, we plot the SDR performance curves as a function of p in Fig. 5.9. In both anechoic and echoic environment, the separation performance is improved when one uses the ` p -norm criterion with 0:1·p<0:9 as compared to the ` 1 -norm case. These results are well consistent with the empirical probabilistic interpretation in [63]. Note that the p>1 cases do not provide sparse solutions and, hence, the corresponding performance is poor. 0.5 1 1.5 2 0 2 4 6 8 10 p SDR (dB) (a) 0.5 1 1.5 2 -2 -1 0 1 2 p SDR (dB) (b) Figure 5.9: The average SDR over mixtures of three speech sources as a function of p: (a) anechoic and (b) echoic environment (RT 60 = 113 ms). Next, we compare the performance of source separation algorithms using features obtained from the whole or the selected binaural cues and the CS and the FOCUSS 103 algorithms for source recovery. Three test conditions were considered under a quiet-room acoustic environment with RT 60 = 44 ms: ² Mixtures of three males speech utterances; ² Mixtures of female speech, piano and guitar sounds; ² Mixture of female speech, cello and train sounds. The separation performance metrics in terms of the SDR, SIR, and SAR values are shown in Tables 5.1{5.3, respectively, where each performance metric is computed based on the average of all three extracted signals. The poor performance of the DUET-type approach,whichusesthewholebinauralcues,isduetotheincorrectestimationofmixing parameters. In contrast, the proposed technique based on selected binaural cues can estimate the mixing parameters successfully and, thus, achieve better performance in source recovery. These results indicate that the performance in source recovery is highly dependent on the successful estimation of the mixing parameters. Table 5.1: Source separation example with stereo mixtures of three male speech utter- ances, ± jmax >1. Algorithm The whole binaural cues Selection of binaural cues SDR SIR SAR SDR SIR SAR CS with p=1:0 { 4.84 7.58 27.35 2.58 4.40 13.05 CS with p=0:4 0.41 10.14 24.57 8.23 15.81 9.74 FOCUSS with p=0:4 { 1.99 1.33 3.61 5.26 15.99 5.76 Furthermore, we see that the CS algorithm with p=0.4 yields the best performance in Tables 5.1{5.3. Although FOCUSS yields more zero components in vector ^ s than the CS algorithm, FOCUSS tends to result in numerical artifacts. These artifacts can be measured by the SAR, thus leading to performance poorer in both the SAR and the 104 Table 5.2: Source separation example with stereo mixtures of female speech utterance, piano, and guitar sound, ± jmax >1. Algorithm The whole binaural cues Selection of binaural cues SDR SIR SAR SDR SIR SAR CS with p=1:0 { 3.37 { 3.07 13.46 5.05 10.68 8.39 CS with p=0:4 { 3.79 { 3.40 12.43 6.08 13.24 8.13 FOCUSS with p=0:4 { 4.39 { 3.03 6.15 3.55 12.40 4.54 Table 5.3: Source separation example with stereo mixtures of female speech utterance, cello, and train sound, ± jmax >1. Algorithm The whole binaural cues Selection of binaural cues SDR SIR SAR SDR SIR SAR CS with p=1:0 { 0.52 6.15 9.05 2.30 7.25 6.83 CS with p=0:4 0.05 3.92 8.85 4.73 11.00 7.53 FOCUSS with p=0:4 { 1.33 4.35 4.83 0.82 2.95 5.74 SDR. For source separation results obtained from FOCUSS, our own listening experience indicatesthatthereislittleinterferencefromothersources,butthereexistssomearti¯cial distortion known as the \musical noise" artifact. Table 5.4: Source separation examples with stereo mixtures of three speech utterances, ± jmax >1. RT 60 0 ms 73 ms 113 ms SDR SIR SAR SDR SIR SAR SDR SIR SAR CS p=1:0 2.26 4.06 12.16 0.70 2.95 8.39 { 1.34 1.94 6.20 CS p=0:4 7.91 15.62 9.48 5.21 14.02 6.45 2.80 11.96 3.83 FOCUSS p=0:4 5.30 15.89 5.82 0.41 5.01 5.92 0.27 9.97 1.75 Finally, Table 5.4 reports the separation performance of the proposed technique in di®erent acoustic environments. All results were obtained by running the algorithm with mixturesofthreespeechutterances;namely,female(English),male(English),andfemale (Japanese) sounds. The performance degrades with a higher room reverberant value, 105 RT 60 . The reported results indicate that the best performance occurs for the CS method with p=0.4, where all extracted sources were recovered successfully and the utterances could be discerned without di±culty. 5.7 Conclusion We examined main assumptions and limitations of DUET-type methods for underdeter- mined convolutive audio source separation and presented a new technique to overcome theirlimitations. InsteadofusingthewholesetofTF-pointsfortheestimationofmixing parameters, the proposed approach e±ciently extracts reliable TF-points with no phase ambiguity under the assumption of source sparsity. These features help form obvious peaks in a smoothed two-dimensional histogram of binaural cues, even under the condi- tion of spatial aliasing. This leads to an accurate estimate of mixing parameters, which in turn yields good performance in the recovery of source signals. 106 Chapter 6 Conclusion and Future Work 6.1 Summary of the Results Audio source separation, given single or multiple acoustic mixtures of source sounds, was studied. For the single-channel audio source separation, we examined sparse music rep- resentation based on source-speci¯c dictionaries by exploiting di®erent characteristics of audio sources. In the case of multichannel audio source separation, multiple microphones was used to compensate the e®ects of ambient noise by exploiting spatial information to the sound ¯eld. In Chapter 3, a systematic way to ¯nd an e±cient representation of harmonic music signalsbasedonsource-speci¯cdictionarieswaspresented. Theproposedapproachlearns the essential features of music signals by modeling their basic components, i.e., musical notes, usingsource-independentatoms. After¯ndingprioritizedatomsaccordingtotheir approximation capability to music signals of interest, we synthesized new atoms to con- structacompactlearneddictionaries. Duetothee±ciencyofthesesource-speci¯catoms, the number of atoms needed to represent music signals are much smaller as compared 107 with that of the Gabor dictionary, resulting in a lower complexity algorithm. The pro- posed scheme was applied to music signal separation with a single mixture of multiple sounds. In Chapter 4, we investigated the noise e®ect on the multichannel audio source sep- aration with the instantaneous mixing model. Under the noisy condition, several basic assumptions required by SCA were violated. As a result, there were errors in the estima- tionofmixingparameters,whichrenderedtheuseof` p -normminimizationimproper. An enhanced technique was proposed to address the problem by employing weighted soft- assignment clustering and generalized ` p -norm minimization with regularization. The technique yields more robust and sparser solutions in a noisy environment than SCA- based algorithms. Besides, we extended the single-channel sparse signal representation technique based on source-speci¯c dictionaries to the multichannel audio source separa- tion to extract music signals from stereo-channel mixtures. It was shown by simulation that the proposed technique is more robust for mixing systems with noisy data. InChapter5, weexaminedmainassumptionsandlimitationsofDUET-typemethods for multichannel convolutive audio source separation and presented a new technique to overcometheirlimitations. InsteadofusingthewholesetofTF-pointsfortheestimation of mixing parameters, our approach e±ciently extracts reliable TF-points with no phase ambiguity under the assumption of source sparsity. These features help form obvious peaksinasmoothedtwo-dimensionalhistogramofbinauralcues,evenunderthecondition of spatial aliasing. This leads to an accurate estimate of mixing parameters, which in turn yields good performance in the recovery of source signals. 108 6.2 Future Research Direction To make our work more complete, we would like to extend current results along the following two directions. ² From Gabor atoms to Chirp atoms Due to the complex phenomena, it is di±cult to derive a compact representation of oscillatory signals using Gabor atoms. Instead, we need several constant frequency Gaboratomsofsmallerscalestorepresenttime-varyingfrequencycomponentssuch astheoscillationofpartialsinanote. Chirpatomshavebeenproposedtodealwith the nonstationary behavior of signals [14], speci¯cally used for vibrato sounds [28]. The chirplet decomposition provides local information about the local structure of a signal in terms of scale s, time-shift u, frequency-shift », and chirp rate c [14] as g s;u;»;c (t)= 1 p s g à t¡u s ! e jf»(t¡u)+0:5c(t¡u) 2 g ; (6.1) where the instantaneous frequency !(t) = » +c(t¡u) varies linearly with time. When the chirp rate c=0, the chirp atom is exactly the same as the Gabor atom. Givena signalwith vibrato, thechirpletdecompositioncanreconstruct theoriginal signal well with just a few chirp atoms, thus providing a compact representation. Thus, we can use the chirp atoms, instead of Gabor atoms, in the ¯rst module of the proposed representation scheme in Chapter 3 to learn inherent structure of the oscillatorynotes. Sincetheincreasinganddecreasingpartsofoscillatorypartialsare decomposedbasedontheirtemporalposition, thepositionparameter uofthechirp 109 atomshouldbelearnedaccordingly. Notethatweconstructedsource-speci¯catoms by setting u = N=2 in Chapter 3. After learning, the source-speci¯c dictionary constructed using chirp atoms is expected to contain a small number of atoms due to the ¯xed number of notes. A modi¯ed matching pursuit algorithm, called ridge pursuit, was introduced in [28] as an e®ective tool for signal decomposition with chirp atoms. Currently, we investigate the problem of e±cient modeling of vibrato sounds using chirp atoms. ² An array of microphones with larger than stereo channels As a natural extension of Chapter 5, it may be advantageous to apply the proposed framework to an array of microphones with M ¸ 2 mixtures and incorporate the beamforming technique for further improvement of the audio source separation performance. The array of microphones allows to leverage of the spatial location of sources for better audio source separation. 110 Bibliography [1] The acoustic drum sequences. [Online]. Available: http://www.cs.tut.¯/»tuomasv [2] The BBC sound e®ects library. [Online]. Available: http://www.sound- ideas.com/bbc.html [3] J. S. Bach, Inventions and Sinfonias, BWV 772{786 by R. Stahlbrand (piano), recorded in 2007. [4] Microphone array support coming with Windows Vista. [Online]. Available: http://research.microsoft.com/en-us/projects/microphone-array [5] The university of Iowa musical instrument samples database. [Online]. Available: http://theremin.music.uiowa.edu [6] S. A. Abdallah and M. D. Plumbley, \Unsupervised analysis of polyphonic music using sparse coding," IEEE Trans. Neural Netw., vol. 17, pp. 179{196, 2006. [7] S. Araki, H. Sawada, R. Mukai, and S. Makino, \Underdetermined blind sparse source separation for arbitrarily arranged multiple sensors," Signal Process., vol. 87, pp. 1833{1847, 2007. [8] S. Arberet, R. Gribonval, and F. Bimbot, \A robust method to count and locate au- diosourcesinastereophoniclinearinstantaneousmixture,"inProc.ofInt.Workshop on Independent Compon. Analysis Blind Signal Separation (ICA2006), Charleston, SC, Mar. 2006, pp. 536{543. [9] H. B. Barlow, Possible principles underlying the transformations of sensory mes- sages. Cambridge, MA: MIT Press, 1961. [10] A. Bell and T. Sejnowski, \An information-maximization approach to blind separa- tion and blind deconvolution," Neural Computat., vol. 7, pp. 1129{1159, 1995. [11] J. Blauert, Spatial Hearing { The Psychophysics of Human Sound Localization. Cambridge, MA: MIT Press, 1997. [12] T.BlumensathandM.Davies,\Sparseandshift-invariantrepresentationsofmusic," IEEE Trans. Audio, Speech, and Language Process., vol. 14, pp. 50{57, Jan. 2006. [13] P. Bo¯ll and M. Zibulevsky, \Blind separation of more sources than mixtures using sparsity of their short-time fourier transform," in Int. Workshop on Independent Compon. Analysis Blind Signal Separation (ICA2003). 111 [14] A. Bultan, \A four-parameter atomic decomposition of chirplets," IEEE Trans. Sig- nal Process., vol. 47, pp. 731{745, 1999. [15] K. P. D. Campbell and G. Brown, \A Matlab simulation of shoebox room acoustics foruseinresearchandtesting," Computing and Inf. Syst. J., vol.9, pp.48{51, 2005. [16] M. A. Casey and A. Westner, \Separation of mixed audio sources by independent subspace analysis," in Proc. Int. Comp. Music Conf., Berlin, Germany, 2000, pp. 154{161. [17] J. Chen, J. Benesty, and Y. A. Huang, \Time delay estimation in room acoustic environments: an overview," EURASIP J. Appl. Signal Process., vol. 2006, pp. 1{ 19, 2006. [18] S. S. Chen, D. L. Donoho, and M. A. Saunders, \Atomic decomposition by basis pursuit," SIAM J. Sci. Comput., vol. 20, pp. 33{61, 1999. [19] N. Cho, Y. Shiu, and C.-C. J. Kuo, \An improved technique for blind audio source separation," in Proc. IEEE Int. Conf. on Intelligent Information Hiding and Multi- media Signal Process., Pasadena, CA, Dec. 2006, pp. 525{528. [20] ||, \Audio source separation with matching pursuit and content-adaptive dictio- naries (MP-CAD)," in IEEE Workshop on Applications of Signal Process. Audio Acoust., New Paltz, NY, 2007, pp. 287{290. [21] ||, \E±cient music representation with content-adaptive dictionaries," in IEEE Int'l Symposium on Circuits and Systems, Seattle, WA, May 2008, pp. 3254{3257. [22] G. M. Davis, S. Mallat, and M. Avelanedo, \Adaptive greedy approximations," J. Construct. Approx., vol. 13, pp. 57{98, 1997. [23] D. FitzGerald, \Automatic drum transcription and source separation," Ph.D. dis- sertation, Dublin Inst. Technol., Dublin, Northern Ireland, 2004. [24] W. G. Gardner and K. D. Martin, \HRTF measurements of a KEMAR," J. Acoust. Soc. Am., vol. 97, pp. 3907{3908, 1995. [25] M. M. Goodwin and M. Vetterli, \Matching pursuit and atomic signal models based on recursive ¯lter banks," IEEE Trans. Signal Process., vol. 47, pp. 1890{1902, Jul. 1999. [26] I. F. Gorodnitsky and B. Rao, \Sparse signal reconstruction from limited data using FOCUSS: A re-weighted minimum norm algorithm," IEEE Trans. Signal Process., vol. 45, pp. 600{616, 1997. [27] M. Goto, H. Hashiguchi, T. Nishimura, and R. Oka, \RWC music database: Music genre database and musical instrument sound database," in Int. Conf. Music Inf. Retrieval, Washington, DC, Oct. 2003, pp. 229{230. [28] R. Gribonval, \Fast matching pursuit with a multiscale dictionary of Gaussian chirps," IEEE Trans. Signal Process., vol. 49, pp. 994{1001, May 2001. 112 [29] ||, \Sparse decomposition ofstereo signals with matchingpursuit and application to blind separation of more than two sources from a stereo mixture," in IEEE Int. Conf. Audio, Speech, Signal Process., Orlando, FL, May 2002, pp. 3057{3060. [30] R.GribonvalandE.Bacry,\Harmonicdecompositionofaudiosignalswithmatching pursuit," IEEE Trans. Signal Process., vol. 51, pp. 101{111, Jan. 2003. [31] R.GribonvalandS.Lesage, \Asurveyofsparsecomponentanalysisforsourcesepa- ration : principles, perspectives, and new challenges," in 14th European Symposium on Arti¯cial Neural Networks, Bruges, Belgium, Apr. 2006, pp. 323{330. [32] P. C. Hansen and D. P. O'Leary, \The use of the l-curve in the regularization of discrete ill-posed problems," SIAM J. Sci. Comput., vol. 14, pp. 1487{1503, 1993. [33] A. Hyvarinen and E. Oja, \Independent component analysis: Algorithms and appli- cations," Neural Netw., vol. 13, pp. 411{430, 2000. [34] G. J. Jang and T. W. Lee, \A probabilistic approach to single channel blind sig- nal separation," in Advances in Neural Information Processing Systems, Vancouver, British Columnbia, Canada, Dec 2002, pp. 1178{1180. [35] J.Huang, M.Epstein, and M.Matassoni, \E®ective acoustic adaptation for a distant- talking interactive TV system," in Interspeech, Brisbane, Australia, Sep. 2008, pp. 1709{1712. [36] L.K.Jones,\Onaconjectureofhuberconcerningtheconvergenceofpp-regression," Ann. Statist., vol. 15, pp. 880{882, 1987. [37] P. Jost, P. Vandergheynst, and P. Frossard, \Tree-based pursuit: Algorithm and properties," IEEE Trans. Signal Process., vol. 54, pp. 4685{4697, Dec. 2006. [38] J. Karhunen, A. Hyvarinen, and E. Oja, Independent Component Analysis. John Wiley-Sons, 2001. [39] M. Kearns, Y. Mansour, and A. Y. Ng, \An information-theoretic analysis of hard and soft assignment methods for clustering," in Proc. of the Thirteenth Conf. on Uncertainty in Arti¯cial Intelligence, 1997, pp. 282{293. [40] P. Kisilev, M. Zibulevsky, and Y. Y. Zeevi, \A multiscale framework for blind sepa- ration of linearly mixed signals," Journal of Machine Learning Research, vol. 4, pp. 1339{1363, 2003. [41] A. Klapuri and M. Davy, Signal processing methods for music transcription. New York: Springer-Verlag, 2006. [42] C. H. Knapp and G. Carter, \The generalized correlation method for estimation of time delay," IEEE Trans. Acoust., Speech, Signal Process., vol. 24, pp. 320{327, 1976. 113 [43] K. Kreutz-Delgado, J. F. Murray, B. Rao, K. Engan, T.-W. Lee, and T. Se- jnowski, \Dictionary learning algorithms for sparse representation," Neural Com- putat., vol. 15, pp. 349{396, 2003. [44] D. D. Lee and H. S. Seung, \Algorithms for nonnegative matrix factorization," in Neural Inf. Process. Syst., 2001, pp. 556{562. [45] T.-W. Lee and M. Lewicki, \The generalized gaussian mixture model using ica," in Int. Workshop on Independent Compon. Analysis Blind Signal Separation (ICA2000), Helsinki, Finland, Jun. 2000, pp. 239{244. [46] S. Lesage, S. Krstulovic, and R. Gribonval, \Under-determined source separation: comparison of two approaches based on sparse decompositions," in Proc. Int. Conf. Independent Compon. Anal. Blind Signal Separation, Mar. 2006, pp. 633{640. [47] P. Leveau, L. Daudet, and G. Richard, \Methodology and tools for the evaluation of automatic onset detection algorithms in music," in Int. Conf. Music Inf. Retrieval, Barcelona, Spain, 2004, pp. 72{75. [48] P. Leveau, E. Vincent, G. Richard, and L. Daudet, \Instrument-speci¯c harmonic atoms for mid-level music representation," IEEE Trans. Audio, Speech, and Lan- guage Process., vol. 16, pp. 116{128, Jan. 2008. [49] M. S. Lewicki, \E±cient coding of natural sounds," Nature Neurosci., vol. 5, pp. 356{363, Apr. 2002. [50] M. S. Lewicki and T. J. Sejnowski, \Learning overcomplete representations," Neural Comput., vol. 12, pp. 337{365, 2000. [51] Y. Li, S. Amari, A. Cichocki, D. Ho, and S. Xie, \Underdetermined blind source separation based on sparse representation," IEEE Trans. Signal Process., vol. 54, pp. 423{437, 2006. [52] S. G. Mallat and Z. Zhang, \Matching pursuit with time-frequency dictionaries," IEEE Trans. Signal Process., vol. 41, pp. 3397{3415, Dec. 1993. [53] I. A. McCowan, \Robust speech recognition using microphone arrays," Ph.D. dis- sertation, Queensland University of Technology, Australia, 2001. [54] T. K. Moon and W. C. Stirling, Mathematical methods and algorithms for signal processing. Prentice-Hall, 2000. [55] P. D. O'Grady and B. A. Pearlmutter, \Soft-lost: Em on a mixture of oriented lines," in Int. Conf. on Independent Compon. Analysis, Granada, Spain, Sep. 2004, pp. 428{435. [56] D.P.O'Leary,\Near-optimalparametersfortikhonovandotherregularizationmeth- ods," SIAM J. Sci. Comput., vol. 23, pp. 1161{1171, 2001. [57] B. A. Olshausen and D. J.Field, \Emergenceof simple-cell receptive¯eld properties by learning a sparse code for natural images," Nature, vol. 13, pp. 607{609, 1996. 114 [58] F. Opolkoand J. Wapnick, \McGilluniversitymaster samples," McGill Univ., Mon- treal, QC, Canada, Tech. Rep., 1987. [59] B. Rao and K. Kreutz-Delgado, \An a±ne scaling methodology for best basis selec- tion," IEEE Trans. Signal Process., vol. 47, pp. 187{200, 1999. [60] C. Roads, The computer music tutorial. Cambridge, MA: MIT Press, 1998. [61] S. Roweis, \One microphone source separation," Proc. Neural Inf. Proc. Syst. (NIPS), vol. 13, pp. 793{799, 2000. [62] R.Saab,O.Yilmaz,M.J.McKeown,andR.Abugharbieh,\Underdeterminedsparse blind source separation with delays," in Workshop on Signal Process. with Adaptive SparseStructuredRepresentation(SPARS05),Rennes,France,Nov.2005,pp.67{70. [63] ||, \Underdetermined anechoic blind source separation via ` q -basis-pursuit with q <1," IEEE Trans. Signal Process., vol. 55, pp. 4004{4017, 2007. [64] I. Takigawa, M. Kudo, and J. Toyama, \Performance analysis of minimum ` 1 - norm solutions for underdetermined source separation," IEEE Trans. Signal Pro- cess., vol. 52, pp. 582{591, 2004. [65] J. A. Tropp, \Greed is good: algorithmic results for sparse approximation," IEEE Trans. Inf. Theory, vol. 50, pp. 2231{2242, Oct. 2004. [66] C. Uhle, C. Dittmar, and T. Sporer, \Extraction of drum tracks from polyphonic music using independent subspace analysis," in Proc. 4th Int. Symp. Independent Compon. Anal. Blind Signal Separation, Nara, Japan, 2003, pp. 843{848. [67] E.Vincent,R.Gribonval,andC.Fevotte,\Performancemeasurementinblindaudio source separation," IEEE Trans. Audio, Speech, and Language Process., vol. 14, pp. 1462{1469, 2006. [68] E. Vincent and X. Rodet, \Music transcription with ISA and HMM," in Proc. Int. Conf.IndependentCompon.Anal.BlindSignalSeparation,Sep.2004,pp.1197{1204. [69] ||, \Underdetermined source separation with structured source priors," in Proc. Int. Conf. Independent Compon. Anal. Blind Signal Separation, Sep. 2004, pp. 327{ 332. [70] E. Vincent, H. Sawada, P. Bo¯ll, S. Makino, and J. P. Rosca, \First stereo audio source separation evaluation campaign: data, algorithm and results," in Int. Conf. Independent Compon. Analysis and Signal Separation, London, UK, 2007, pp. 552{ 559. [71] T. Virtanen, \Monaural sound source separation by nonnegative matrix factoriza- tion with temporal continuity and sparse criteria," IEEE Trans. Audio, Speech, and Language Process., vol. 15, pp. 1066{1074, Mar. 2007. 115 [72] T. Virtanen and A. Klapuri, \Separation of harmonic sound sources using sinusoidal modeling," in IEEE Int. Conf. Audio, Speech, Signal Process., Istanbul, Turkey, 2000, pp. 765{768. [73] S. Winter, W. Kellermann, H. Sawada, and S. Makino, \MAP-based underdeter- mined blind source separation of convolutive mixtures by hierarchical clustering and ` 1 -norm minimization," EURASIP J. Adv. in Signal Process., 2007. [74] O. Yilmaz and S. Rickard, \Blind separation of speech mixtures via time-frequency masking," IEEE Trans. Signal Process., vol. 52, pp. 1830{1847, 2004. [75] T. Zhang and C.-C. J. Kuo, \Audio content analysis for online audiovisual data segmentation and classi¯cation," IEEE Trans. Speech Audio Process., vol. 9, pp. 441{457, May 2001. [76] M. Zibulevsky and B. A. Pearlmutter, \Blind source separation by sparse decompo- sition in a signal dictionary," Neural Computat., vol. 13, pp. 863{882, 2001. 116
Abstract (if available)
Abstract
Several audio source separation techniques, which aim to determine the original sources given their acoustic mixtures, are proposed in this research. Two different mixing processes are considered in terms of the number of microphones: single-channel and multi-channel settings. Since no spatial cue is used in the single-channel observation, we exploit different characteristics of audio sounds. In the case of multichannel mixtures, the spatial information to the sound field enables us to estimate the mixing system by locating sound sources.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Statistical inference for dynamical, interacting multi-object systems with emphasis on human small group interactions
PDF
Data-driven methods in description-based approaches to audio information processing
PDF
Classification and retrieval of environmental sounds
PDF
Digital signal processing techniques for music structure analysis
PDF
Understanding sources of variability in learning robust deep audio representations
PDF
Modeling expert assessment of empathy through multimodal signal cues
PDF
Hybrid methods for music analysis and synthesis: audio key finding and automatic style-specific accompaniment
PDF
Statistical enhancement methods for immersive audio environments and compressed audio
PDF
Biologically inspired auditory attention models with applications in speech and audio processing
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
Representation, classification and information fusion for robust and efficient multimodal human states recognition
PDF
Boundary layer and separation control on wings at low Reynolds numbers
PDF
Block-based image steganalysis: algorithm and performance evaluation
PDF
Efficient coding techniques for high definition video
PDF
Automatic quantification and prediction of human subjective judgments in behavioral signal processing
PDF
Human behavior understanding from language through unsupervised modeling
PDF
Enhancing speech to speech translation through exploitation of bilingual resources and paralinguistic information
PDF
Multimodality, context and continuous dynamics for recognition and analysis of emotional states, and applications in healthcare
PDF
Emotional speech production: from data to computational models and applications
PDF
Advanced features and feature selection methods for vibration and audio signal classification
Asset Metadata
Creator
Cho, Namgook
(author)
Core Title
Source-specific learning and binaural cues selection techniques for audio source separation
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
09/09/2009
Defense Date
08/31/2009
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
audio source separation,musical signal processing,OAI-PMH Harvest,room acoustics,source-specific signal processing,sparse signal representation,underdetermined mixing
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kuo, C.-C. Jay (
committee chair
), Chen, Liang (
committee member
), Narayanan, Shrikanth S. (
committee member
)
Creator Email
namgookc@gmail.com,namgookc@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m2594
Unique identifier
UC1287281
Identifier
etd-Cho-3258 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-258996 (legacy record id),usctheses-m2594 (legacy record id)
Legacy Identifier
etd-Cho-3258.pdf
Dmrecord
258996
Document Type
Dissertation
Rights
Cho, Namgook
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
audio source separation
musical signal processing
room acoustics
source-specific signal processing
sparse signal representation
underdetermined mixing