Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Matrix factorization for noise-robust representation of speech data
(USC Thesis Other)
Matrix factorization for noise-robust representation of speech data
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Matrix Factorization for Noise-Robust Representation
of Speech Data
A Dissertation
Presented to
The Faculty of the USC Graduate School
By
Colin Vaz
In Partial Fulfillment
of the Requirements for the Degree
Doctor of Philosophy
in the
Ming Hsieh Department of Electrical and Computer Engineering
UNIVERSITY OF SOUTHERN CALIFORNIA
May 10, 2019
ABSTRACT
Noise is usually present in collected speech data, and its presence can affect subsequent processing of the
data. Furthermore, speech is often processed for different end goals, and noise can affect the results differ-
ently depending on the end goals. In my thesis, I propose several matrix factorization algorithms that are
noise-robust and facilitate different end goals. On a speech denoising task, I use Complex Matrix Factor-
ization with spectral and temporal regularization to remove MRI acoustic noise from speech recordings. I
show that this algorithm achieves greater noise suppression and lower spectral distortion compared to state-
of-the-art methods. I also introduce the Complex Beta Divergence that generalizes the Beta Divergence
to complex-valued scalars. It uses a real-valued parameter, beta, to control the penalty placed on errors
in estimation. Using this divergence, I propose the Beta Complex Matrix Factorization algorithm and de-
rive update rules with convergence guarantees to a stationary point. I show that different settings of beta
improve the performance of speech denoising for different metrics, such as spectral distortion and speech
quality. Moreover, I show on audio recorded in an MRI scanner that Beta CMF improves performance over
Euclidean CMF on several denoising metrics. I also propose a Non-negative Matrix Factorization algorithm
to estimate noise-robust activation matrices. On the noisy Aurora 4 dataset, I extract the activation matrices
and use them as acoustic features for an Automatic Speech Recognition task. Results show a 4.2% relative
improvement in word error rate compared to baseline log-mel features. Finally, I describe the Convolutive
Convex Hull Factorization of a Matrix algorithm I proposed for uncovering temporal patterns from multi-
variate time-series data. I show that this algorithm uncovers a richer and more meaningful set of vocal tract
movements from vocal tract contours of continous speech compared to a previously-proposed algorithm for
this task.
ACKNOWLEDGEMENTS
I first want to thank my parents, brother, and sisters for their enduring love and support. From a young age,
my parents fostered my curiosity and encouraged me learn new things, whether it is playing sports, learning
a musical instrument, or writing novels. At the same time, they allowed me to make mistakes and more
importantly, taught me how to learn and grow from my failures. I strongly believe that this strong, nurturing
foundation gave me the drive to pursue a Ph.D., to constantly ask questions and explore new ideas, to have
the mental fortitude to endure setbacks, and to make the most of my journey. And what a journey it has
been! I cannot thank them enough for being so supportive and making my journey so rewarding.
My thesis would not have been possible without the support and mentorship of my advisor, Dr. Shrikanth
Narayanan. His forward-thinking ideas encouraged me to think outside of the box, to apply my knowledge
in a variety of topics, and taught me to keep the big picture in mind while digging into the nitty gritty details.
His energy inspired me to push beyond my comfort zone and to see challenges for the opportunities they
present. Thank you, Shri, for giving me the opportunity to pursue a Ph.D. at SAIL.
I also want to thank the other members of my Qualification Exam and Defense committees: Dr. Louis
Goldstein, Dr. Keith Jenkins, Dr. Antonio Ortega, and Dr. Mahdi Soltanolkotabi. I especially want to thank
Dr. Soltonalkotabi for providing me with valuable technical advice and for being so approachable. I also
want to thank Dr. Asterious Toutios for his leadership in the SPAN group at SAIL, and to him, Dr. Dani
Byrd, Dr. Goldstein, and the other students in the SPAN group for their technical advice and feedback.
Much of my thesis has its roots in SPAN, and my thesis would not have taken shape without their guidance
and critical ears.
Finally, I want to thank my colleagues at SAIL. The many discussions I’ve had over the years helped me
grow as a researcher and a mentor. Plus, I won’t forget the great times at SAIL music nights. I also want to
give a shout out to my friends for making my time in Los Angeles an incredibly vibrant experience. This
journey would not have been possible without them.
ii
TABLE OF CONTENTS
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Chapter 2: Matrix Factorization Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 Non-negative Matrix Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Beta NMF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Convolutive Non-negative Matrix Factorization . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Convex Hull Non-negative Matrix Factorization . . . . . . . . . . . . . . . . . . . . . . . . 5
2.5 Complex Matrix Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Chapter 3: Speech Denoising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1 Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 MRI Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3 Denoising Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3.1 Algorithm Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3.2 Temporal Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3.3 Spectral Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3.4 Update Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4.2 Other Denoising Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4.3 Quantitative Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4.4 Qualitative Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.5 Analysis of Regularization Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.5.1 Spectral Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.5.2 Temporal Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.6 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.6.1 Objective Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.6.2 Listening Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.7 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Chapter 4: Complex Beta Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
iii
4.1 Notation and Complex Number Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Standard Beta Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Young’s Inequality for Complex Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.4 Complex Beta Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.4.1 Complex Beta Divergence Properties . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.6 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Chapter 5: Beta Complex Matrix Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.1 Beta Complex Matrix Factorization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2.2 MRI Speech Recording Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Chapter 6: Robust Recovery of Latent Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.1 Joint Filtering–Factorization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.1.1 Update equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.2 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Chapter 7: Noise-Robust Acoustic Features for Automatic Speech Recognition . . . . . . . . . . 59
7.1 Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
7.2 Algorithm for Creating Noise-Robust Acoustic Features . . . . . . . . . . . . . . . . . . . 60
7.2.1 Step 1: Learn a speech basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
7.2.2 Step 2: Learn a noise basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7.2.3 Step 3: Learn a time-varying projection . . . . . . . . . . . . . . . . . . . . . . . . 63
7.2.4 Step 4: Extract acoustic features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7.3 ASR Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
7.5 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Chapter 8: Extracting Temporal Patterns from Time-Series Data . . . . . . . . . . . . . . . . . 68
8.1 CHARM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
8.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
8.2.1 Synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
8.2.2 V ocal tract data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
8.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Chapter 9: Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Appendix A: Derivation of Denoising Algorithm Update Equations . . . . . . . . . . . . . . . . 79
A.1 Basis update equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
A.2 Time-activation update equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Appendix B: Complex Beta Divergence Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
iv
LIST OF TABLES
3.1 Description of common rtMRI (seq1, seq2, seq3, ga21, ga55, mult) and static 3D (st3d)
pulse sequences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Key variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Number of data points for the listening test for each data set and pulse sequence noise. . . . 22
3.4 Parameter settings for the number of speech dictionary elements (n
s
) and wavelet packet
depth (D) in the 2step algorithm. The number of noise dictionary elements was set to 70
and the window length for wavelet packet analysis was set to 2048 for all noises. See [1] for
more information about the 2step parameters. . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5 NS results (dB) for the MRI-utt dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.6 NS, LLR, DV , PESQ scores, and STOI scores for the Aurora 4 dataset. . . . . . . . . . . . . 27
3.7 Mean rankings of the audio clips for each dataset corrupted with different pulse sequence
noises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.8 Mean ratings of the audio clips for each dataset corrupted with different pulse sequence noises. 30
5.1 Convexity and concavity of g
(Equation 5.5) and h
(Equation 5.6) in terms ofW (H)
withH (W ) fixed for different values of, assuming the cosine terms are non-negative. . . 46
5.2 Noise suppression results (dB) for the MRI-utt dataset. . . . . . . . . . . . . . . . . . . . . 49
5.3 Noise suppression (NS), LLR, PESQ scores, and STOI scores for the Aurora 4 dataset. . . . 52
7.1 Word error rates for different acoustic model features in different noise and channel conditions. 65
8.1 Root mean square errors (RMSE) and correlations when reconstructing vocal tract constric-
tion using a calculated encoding matrixH
test
and a random encoding matrixH
rand
. . . . . . 74
B.1 Functions for different values of used in the proof of the Complex Beta divergence . . . . 86
v
LIST OF FIGURES
3.1 Quantitative metrics for different spectral regularization weights and temporal regulariza-
tion weights
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Noisy and denoised spectrograms of the sentence “Don’t ask me to carry an oily rag like
that” in the MRI-utt dataset. The noise is seq3. . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Clean, noisy, and denoised spectrograms of the sentence “The language is a big problem” in
the Aurora 4 dataset. The noise is seq3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Average values of the noisy cost function (Equation 3.8) as a function of iteration number
and average run times for the denoising algorithms as a function of audio duration for the
Aurora 4 dev set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1 Complex Beta divergence as a function ofw for various values of. . . . . . . . . . . . . . 40
4.2 Sensitivity of the Complex Beta divergence to errors in phase as a function of. In these
plots,z is fixed such thatjzj < 1 (left panel) orjzj > 1 (right panel) andw =jzje
j(z")
,
with" = 0:1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.1 Denoising metrics for various values of with restaurant noise added at 0 dB. . . . . . . . . 50
5.2 Denoising metrics for various values of with traffic noise added at 0 dB. . . . . . . . . . . 51
6.1 (a) Cosine similarity and (b) PESQ scores for proposed algorithm, standard NMF, and
Wiener filter + NMF in 5 dB (front) and 10 dB (behind) SNR levels. . . . . . . . . . . . . . 56
7.1 Flowchart illustrating the algorithm for generating noise-robust time-activation matrices. . . 61
7.2 Comparison of log-mel features and encoding matrices for an Aurora 4 utterance. . . . . . . 67
8.1 Input synthetic data and the recovered temporal patterns from CHARM and CNMF-SC. The
circles in (a) indicate the states of the Markov chains. The arrows represent the temporal
progression within the Markov chains and the recovered temporal patterns. . . . . . . . . . . 71
vi
8.2 Air-tissue boundary contours (thin black lines) from a single real-time MRI frame. The
thick black lines mark the places of articulation where constrictions are measured (from lips
to glottis: bilabial; alveolar; palatal; velar; velopharyngeal port; pharyngeal). For a given
frame, constrictions are measured as the Euclidean distances between opposing gray dots. . . 72
8.3 Visualization of the CHARM gesture basis. The vocal tract at time step 1 is shown in light
grey, time step 2 in dark grey, and time step 3 in black. . . . . . . . . . . . . . . . . . . . . 73
8.4 Visualization of the CNMF-SC gesture basis. The vocal tract at time step 1 is shown in light
grey, time step 2 in dark grey, and time step 3 in black. . . . . . . . . . . . . . . . . . . . . 74
vii
CHAPTER 1
INTRODUCTION
Data collected from the real world often contains noise. Noise may arise from environmental conditions
that contaminate the data collection and from measurement noise from instruments used to collect the data.
Furthermore, data often contains latent structure usually related to the underlying process or set of pro-
cesses that generated the data. Observation of the latent structure can provide insight into the generating
processes and expose what the processes can and cannot generate. For example, machine learning algo-
rithms assume that data from different classes is generated by different processes or the same process with
different parameters. Uncovering the latent structure of the different classes can help one design features
that accurately capture the parameters of the generating processes of each class, thus helping the machine
learning algorithm generalize to unseen data. However, presence of noise in the data can obscure the latent
structure.
Additionally, collected data is used for different end goals. One goal may be to simply remove the noise
and present the end user with denoised data. Another goal may be to estimate a set of parameters. One may
want to find and analyze the latent structure to gain a better understanding about the generating processes.
Such analysis can help, for example, improve diagnosis of medical conditions from various medical data or
improve prediction of future events from time-series data. Given these different end goals, noise can affect
the results in different ways.
My thesis aims to address the issue of alleviating the effect that noise in speech data has on the results for
different end goals. Centrally, the question my thesis will try to answer is “How does on robustly recover
latent structure from noisy speech to facilitate different end goals?” In this work, I will show the initial
strides I have taken to address this question by developing multiple matrix factorization algorithms. I will
describe the algorithms for removing acoustic noise from speech data (Chapter 3), recovering latent structure
from noisy speech (Chapter 6), generating noise-robust acoustic features for automatic speech recognition on
noisy speech (Chapter 7), and extracting temporal patterns from time-series data (Chapter 8). Furthermore,
1
I propose a new divergence called the Complex Beta divergence (Chapter 4) and use this as the cost function
in Beta CMF (Chapter 5). Prior to describing these algorithms, I provide an overview of various matrix
factorization algorithms in Chapter 2.
1.1 Notation
I will use the following notation throughout this work.
Scalars are denoted by non-bolded letters:n,,K.
Vectors are denoted by bolded lowercase letters:x,.
Matrices are denoted by bolded uppercase letters:V , .
AB means element-wise product of matricesA andB. That is,C = AB meansC
ij
=
A
ij
B
ij
8i;j.
A
B
means element-wise division of matricesA andB. That is,C =
A
B
meansC
ij
=
A
ij
B
ij
8i;j.
jAj means element-wise modulus of the elements inA.
[A]
+
and [A]
creates a matrix containing only the positive and negative elements inA respectively:
[A]
+
=
1
2
(jAj +A); [A]
=
1
2
(jAjA).
!
H means right-shift of the columns of matrixH by columns. The last columns ofH are removed
and all-zero columns are added to the left.
H similarly defines a left-shift of the columns.
diag(x) is used to form a diagonal matrix with the diagonal elements from vectorx.
Z;R, R
+
, andC refer to the sets of integers, real values, positive real values, and complex values
respectively. For brevity, I will useC
to denote the punctured complex planeCnf0g.
R
mn
,R
mn
+
, andC
mn
denote the sets ofmn real-valued matrices, non-negative matrices, and
complex-valued matrices respectively.
2
CHAPTER 2
MATRIX FACTORIZATION OVERVIEW
Matrix factorization decomposes a matrix into a product of low-rank matrices. The general idea is that
data in the input matrix has a low rank, so decomposing the input into a product of low-rank matrices can
expose the low-rank structure in the data. Matrix factorization has been used for matrix completion [2, 3],
source separation [4–6], image processing [7], music analysis [8], and topic modeling [9], to name a few
applications. The following sections describe several matrix factorization variants.
2.1 Non-negative Matrix Factorization
Non-negative Matrix Factorization (NMF) was first proposed by Paatero and Tapper [10, 11] and further
developed by Lee and Seung [12]. NMF factors a non-negative matrixV 2 R
mn
+
into a basis matrix
W 2 R
mk
+
and an encoding matrix matrixH 2 R
kn
+
by minimizing a cost function. Typically for
audio processing,V is the magnitude or power spectrogram,W is a basis of different spectral “building
blocks” found in the spectrogram, andH indicates when and how strongly the building blocks occur in the
spectrogram. NMF optimizes the following program:
minimize
m
X
i=1
n
X
l=1
d
V
il
K
X
k=1
W
ik
H
kl
!
subject to W
ik
0;H
kl
0 8i;k;l
(2.1)
whered() is the square Euclidean distance or the I-divergence.
NMF has two attractive properties: the factorization is interpretable and its cost function can be minimized
with multiplicative updates. The interpretability occurs because the non-negative constraint means that a
matrix is decomposed into additive parts. The basis matrix contains the parts and the encoding matrix tells
how to combine the parts to reconstruct the original matrix. By properly choosing the number of basis
3
vectorsk and by imposing sparsity on the encoding matrix ([13, 14]), the parts in the basis matrix can reveal
the latent structure in the data. The ability to learn the basis and encoding matrices with multiplicative
update rules remove the need to set appropriate step size parameters for gradient descent. See [12] for
further details about how the multiplicative rule is derived.
2.2 Beta NMF
There has been a push, particularly in the speech community, to derive NMF update rules for divergence
functions beyond the squared Euclidean distance and I-divergence. One such function is the beta divergence
[15], which takes a real-valued parameter and smoothly connects the squared Euclidean distance ( = 2)
to the I-divergence ( = 1) and Itakura-Saito (IS) divergence ( = 0) [16, 17]. The beta divergence is
defined8x;y> 0 as
d
(xky) =
8
>
>
>
>
>
>
<
>
>
>
>
>
>
:
1
(1)
x
+ ( 1)y
xy
1
; 2Rnf0; 1g
x log
x
y
x +y; = 1
x
y
log
x
y
1; = 0
(2.2)
Beta NMF optimizes the following program:
minimize
m
X
i=1
n
X
l=1
d
V
il
K
X
k=1
W
ik
H
kl
!
subject to W
ik
0;H
kl
0 8i;k;l
(2.3)
Fevotte et al. derived the following update rules for the beta divergence:
W W
0
@
V
^
V
2
H
T
^
V
1
H
T
1
A
()
; H H
0
@
W
T
V
^
V
2
W
T
^
V
1
1
A
()
; (2.4)
4
where the exponents are applied element-wise and
() =
8
>
>
>
>
>
>
<
>
>
>
>
>
>
:
1
2
; < 1
1; 1 2
1
1
; > 2
(2.5)
Note that the update equations for = 2 and = 1 coincide with the update equations derived by Lee and
Seung for the squared Euclidean distance and I-divergence respectively.
2.3 Convolutive Non-negative Matrix Factorization
Smaragdis proposed Convolutive NMF (CNMF) [18] to consider temporal context in time-series data when
learning the basis and encoding matrices. CNMF represents the input matrix as the result of the encoding
matrix being convolved with a set of time-varying basis matrices:V
P
t1
=0
W ()
!
H . This formulation
is inspired by convolution:v[n] =
P
m
w[m]h[nm]. From this perspective, one can think of modeling the
observed signalv[n] with a source-filter model, wherew[n] is a filter that acts on sourceh[n]. Like NMF,
CNMF also uses multiplicative updates to learn the time-varying basis and encoding matrix. Also like
NMF, the basis can contain latent structure of the data, except now it can also contain temporal structure.
This makes CNMF particularly useful for finding temporal patterns in time-series data [19].
2.4 Convex Hull Non-negative Matrix Factorization
One drawback of NMF and CNMF is the requirement of a non-negative data matrix. This can prevent the
use of matrix factorization in cases where the data contains negative values. To overcome this, Ding et al.
proposed the Convex NMF algorithm [20], where the basis matrix is formed as a convex combination of the
data points. They showed that Convex NMF tends to find sparse solutions and the basis vectors correspond
closely to observed data points, making the basis more interpretable over an NMF basis. Furthermore, they
highlighted the connection between Convex NMF and k-means clustering, where the basis matrix contains
cluster centroids and the encoding matrix assigns soft membership of the data points to each cluster. Thurau
et al. introduced Convex Hull NMF (CH-NMF) to improve computation speed on large datasets [21]. They
5
proposed to form the basis matrix from convex combinations of the convex hull vertices rather than the data
itself. They showed that the basis vectors from this approach tend to lie at the extremities of the data. Thus,
the CH-NMF basis contains a wide range of basic “building blocks” present in the data.
2.5 Complex Matrix Factorization
Researchers have developed the complex matrix factorization (CMF) algorithm to process complex-valued
data. Processing complex-valued data is very useful in signal processing, such as speech processing, where
one can take advantage of phase information present in the complex-valued spectrogram. Kameoka et
al. introduced the CMF algorithm [22] to be able to use the complex-valued spectrogram as the input
V 2 C
mn
. In addition to learning a basis matrixW and encoding matrixH, CMF also learns a phase
matrixP
i
= e
j
^
i
2 C
mn
corresponding to the ith basis vector and ith row inH. CMF optimizes the
following program:
minimize
m
X
i=1
n
X
l=1
V
il
K
X
k=1
W
ik
H
kl
e
j
^
ikl
2
2
subject to W
ik
0;H
kl
0 8i;k;l
(2.6)
In the above program, the productWH models the magnitude ofV while
^
models the phase. As with
NMF,W andH are updated iteratively with multiplicative update rules along with an update for
^
in each
iteration; I refer the reader to [22] for the update equations. The authors showed that the update rules reduce
to the squared Euclidean NMF updates when is fixed to the phase ofV .
King and Atlas showed that reconstructed sources from CMF have lower distortion and artifacts than those
from NMF [23]. However, one drawback of CMF is that it has significantly more parameters than NMF
because it estimates a phase matrix for each basis vector (note the dependence of onk). This results in
high computational load and memory requirements. CMF with intra-source additivity (CMF-WISA) was
proposed to overcome this issue [24]. Suppose that the observed matrixV is composed ofS underlying
additive sources
V
s
S
s=1
:
P
S
s=1
V
s
= V andK(s) is the set containing indices of the basis vectors
6
corresponding to sources. CMF-WISA solves the following program:
minimize
m
X
i=1
n
X
l=1
V
il
S
X
s=1
X
k2K(s)
W
ik
H
kl
e
j[
^
s]
il
2
2
subject to W
ik
0;H
kl
0 8i;k;l
(2.7)
where
^
s
is the estimated phase matrix of sources. TypicallySK, so CMF-WISA greatly reduces the
number of phase parameters to estimate compared to the original CMF algorithm.
In both CMF and CMF-WISA, the cost function is restricted to the squared Euclidean distance due to the
complex-valued matricesV ande
j
^
. However, I will introduce the complex Beta divergence in Chapter 4
to overcome this limitation.
7
CHAPTER 3
SPEECH DENOISING
Technological applications using speech are ubiquitous, and include speech-to-text systems [25], emotional-
state detection [26], and assistive applications, such as hearing aids [27]. The presence of background noise
usually degrades the performance of these systems, thus limiting their use to confined environments or sce-
narios. Researchers are actively developing speech denoising methods to overcome these barriers. Such
methods include signal subspace approaches [28], model-based methods [29], and spectral subtraction algo-
rithms [30]. These different techniques make specific assumptions about the noise or SNR levels, and give
a certain trade-off between noise suppression and speech distortion. This trade-off is particularly important
when denoising speech for speech science analysis.
This chapter focuses on denoising speech audio obtained during magnetic resonance imaging (MRI) scans,
a major motivation arising from speech science and clinical applications. Speech science researchers use
a variety of methods to study articulation and the associated acoustic details of speech production. These
include Electromagnetic Articulography [31] and x-ray microbeam [32] methods that track the movement
of articulators while subjects speak into a microphone. Data from these methods offer excellent temporal
details of speech production. Such methods, however, are invasive and do not offer a full view of the vocal
tract. On the other hand, methods using real-time MRI (rtMRI) offer a non-invasive method for imaging
the vocal tract, affording access to more structural details [33]. Unfortunately, MRI scanners produce high-
energy broadband noise that corrupts the speech recording. This affects the ability to analyze the speech
acoustics resulting from the articulation and requires additional schemes to improve the audio quality. An-
other motivation for denoising speech corrupted with MRI scanner noise arises from the need for enabling
communication between a patient and a provider during scanning. In this chapter, I will describe a denoising
algorithm using CMF-WISA with additional spectral and temporal regularizations for removing magnetic
resonance imaging (MRI) scanner acoustic noise from speech recorded inside an MRI scanner. Importantly,
the algorithm infers the speech and noise components from observing the noisy mixture and a short seg-
8
ment of noise; that is, the speech component is not available separately a priori to learn properties of the
speech.
This chapter is organized as follows. Section 3.1 reviews two denoising methods currently used to remove
MRI scanner acoustic noise from noisy speech. Section 3.3 describes the algorithm I developed to perform
denoising. Section 3.4 discusses the experiments I conducted and the evaluation metrics I used to evaluate
the denoising performance. Section 3.6 shows the results of my method on data acquired from MRI scans
and artificially-created noisy speech. Finally, Section 3.7 offers my directions for future work.
3.1 Prior Work
The Least Mean Squares (LMS) algorithm is a popular technique for signal denoising. The algorithm
estimates the filter weights of an unknown system by minimizing the mean square error between the denoised
signal and a reference signal. This approach removes noise from the noisy signal very well, but severely
degrades the quality of the recovered speech [34]. Bresch et al. proposed a variant to the LMS algorithm
in [35] to remove MRI noise from noisy recordings. This method, however, uses knowledge of the MRI
pulse sequence to design an artificial reference “noise” signal that can be used in place of a recorded noise
reference. We found that this method outperforms LMS in denoising speech corrupted with noise from
certain types of pulse sequences. Unfortunately, it performs rather poorly when the noise frequencies are
spaced closely together in the frequency domain. Furthermore, the algorithm creates a reverberant artifact
in the denoised signal, which makes speech analysis challenging. The LMS formulation assumes additive
noise, so these algorithms may not perform well in the presence of convolutive noise in the signal, which
we encounter during MRI scans.
Recently, Inouye et al. proposed an MRI denoising method that uses correlation subtraction followed by
spectral noise gating [36]. Correlation subtraction finds the temporal shift that maximizes the correlation
between the noisy signal and a reference noise signal, and subtracts this shifted reference noise from the
noisy signal. The residual noise from this procedure is removed by spectral noise gating, which uses the
reference noise to calculate a spectral envelope of the noise and attenuates the frequency components of the
noisy speech that are below the noise spectral envelope. Their method showed a high level of noise suppres-
sion and low distortion, both desirable properties of a denoising algorithm. A drawback to their approach
9
is manual setting of the threshold in the spectral noise gating. Furthermore, their algorithm assumes access
to a reference noise recording. As such, their algorithm would not be suitable for use in single-microphone
setups and would perform poorly if speech leaks into the reference microphone.
3.2 MRI Noise
MRI scanners produce a powerful magnetic field that aligns the protons in water molecules with this field.
The MRI operator briefly turns on a radio frequency electromagnetic field, which causes the protons to
realign with the new field. After the electromagnetic field is turned off, the protons relax back their alignment
with the scanner’s magnetic field. The on and off switching pattern of the electromagnetic field is called a
pulse sequence. The pulse sequence constantly realigns the protons, which causes a changing magnetic flux,
and which in turn generates a changing voltage within the receiver coils.
During each pulse, the MRI scanner samples these changing voltages in the 2-dimensional Fourier space
(called k-space). In real-time MRI (rtMRI), the pulses are repeated periodically to get a temporal sequence
of images. The period between each repetition is called the repetition time (TR). Typically, the readout
from multiple successive pulses are combined to form one image because it improves the SNR and spatial
resolution of the image. The number of pulses that are combined to form one image is called the number
of interleaves. The number of interleaves gives a trade-off between spatial and temporal resolution of the
images; a higher number of interleaves increases the spatial resolution but decreases the temporal resolu-
tion.
A primary source of MRI noise arises from Lorentz forces, due to the pulse sequence, acting on receiver
coils in the body of an MRI scanner. These forces cause vibrations of the coils, which impact against their
mountings. The result is a high-energy broadband noise that can reach as high as 115 dBA [37]. The noise
corrupts the speech recording, making it hard to listen to the speaker, and can obscure important details in
speech.
MRI pulse sequences typically used in rtMRI produce periodic noise because the pulse is repeated every TR.
The fundamental frequency of this noise, i.e., the closest spacing between two adjacent noise frequencies in
10
the frequency spectrum, is given by:
f
0
=
1
TR number of interleaves
Hz (3.1)
The repetition time and number of interleaves are scanning parameters set by the MRI operator. Choice of
these parameters inform the spatial and temporal resolution of the reconstructed image sequence, as well as
the spectral characteristics of the acoustic noise generated by the scanner.
Table 3.1 provides a summary of the pulse sequences that we will consider in this article and their properties.
Importantly, the periodicity property of the noise allows us to design effective denoising algorithms for time-
synchronized audio collected during rtMRI scans. For instance, the algorithm proposed by Bresch et al. [35]
relies on knowingf
0
to create an artificial “noise” signal which can then be used as a reference signal by
standard adaptive noise cancellation algorithms. This algorithm has been shown to effectively remove noise
from some commonly-used rtMRI pulse sequences, such as Sequences 1–3 (seq1, seq2, seq3), and the
multislice (mult) sequence listed in Table 3.1.
Table 3.1: Description of common rtMRI (seq1, seq2, seq3, ga21, ga55, mult) and static 3D (st3d) pulse
sequences.
Pulse sequence
usage
Pulse
sequence TR (ms)
Number of
interleaves f
0
(Hz)
Real-time
(dynamic)
MRI (rtMRI)
seq1 6:164 13 12:48
seq2 6:004 13 12:81
seq3 6:028 9 18:43
ga21 6:004 21 7:93
ga55 6:004 55 3:03
Multislice
rtMRI mult 6:004 13 12:81
Static
3D MRI st3d 4:22 N/A N/A
However, there are pulse sequences that do not exhibit this exact periodic structure. In addition, there are
other useful sequences that are either periodic with an extremely large period, resulting in very closely-
spaced noise frequencies in the spectrum (i.e.f
0
is very small), or are periodic with discontinuities that can
introduce artifacts in the spectrum. To handle these cases, it is essential that denoising algorithms do not
rely on periodicity. One example of such sequences which we will consider in this article is the Golden
Angle (GA) sequence [38], which allows for retrospective and flexible selection of the temporal resolution
of the reconstructed image sequences (typical rtMRI protocols do not allow this desirable property). We
11
will consider the ga21 and ga55 Golden Angle sequences in this article. These two sequences, along with
seq1, seq2, seq3, and mult, constitute the rtMRI pulse sequences that this article focuses on.
In addition to using rtMRI for imaging speech dynamics, one can use 3D MR imaging to capture a three-
dimensional image of a static speech posture. 3D pulse sequences scan the vocal tract in multiple planes
simultaneously. Such sequences can be highly aperiodic, and like the GA sequences require a denoising
algorithm that does not rely on periodicity for proper denoising. We will consider the st3d static 3D pulse
sequence in this article (see Table 3.1). For further reading about MRI pulse sequences and their use in
upper airway imaging, see [38–40]. For an example spectrogram of speech recorded with the seq3 pulse
sequence, see the top panel in Figure 3.2.
3.3 Denoising Algorithm
We propose a denoising algorithm that uses CMF-WISA to model spectro-temporal properties of the speech
and noise components. We also add spectral and temporal regularization terms to better model the noise
component. The following subsections provide an overview of the algorithm, introduce the regularization
terms, and show the update equations used in the algorithm. Table 3.2 shows the key variables I will use
consistently throughout the chapter as well as a brief description for quick reference.
Table 3.2: Key variables.
Variable Meaning
k
s
,k
d
Number of basis elements in the speech and noise bases
t
d
,t
n
Number of spectrogram frames of the noise-only and noisy speech sig-
nals
V
s
,V
d
,V Complex-valued spectrograms of speech, noise-only, and noisy speech
signals
W
s
,W
d
,W
n
Speech basis, noise basis learned on noise-only signal, and noise basis
learned on noisy speech
H
s
,H
d
,H
n
Speech time-activation matrix, noise time-activation matrix learned on
noise-only signal, and noise time-activation matrix learned on noisy
speech
P
s
,P
d
,P
n
Speech phase matrix, noise phase matrix learned on noise-only signal,
and noise phase matrix learned on noisy speech
12
3.3.1 Algorithm Overview
I propose a denoising algorithm that uses CMF-WISA to model spectro-temporal properties of the speech
and MRI noise and to faithfully recover the speech. I first use NMF on the MRI noise to learn a noise basis
W
d
2 R
mk
d
+
and its time-activation matrixH
d
2 R
k
d
t
d
+
. I obtain the noise-only recording from the
beginning 1 second of the noisy speech recording before the speaker speaks (it is usually the case that the
speaker speaks at least 1 second after the start of the recording). Alternatively, one can obtain a noise-only
recording using a reference microphone placed far away enough from the speaker so that it does not record
speech. I convert the noise signal to a spectrogramV
d
2 R
mt
d
+
by taking the magnitude of the STFT of
the noisy speech with a 25-ms Hamming window shifted by 10 ms. NMF will approximateV
d
byW
d
H
d
.
NMF uses iterative updates to learn the basis and time-activation matrix, so I initializeW
d
andH
d
with
random matrices sampled from the uniform distribution on [0; 1].
After learning the noise basis, I use CMF-WISA with the noisy speech complex-valued spectrogramV 2
C
mtn
as the input to separate into speech and noise components. I initialize the basis matrix withW
0
=
[W
s
W
d
], whereW
s
is a random mk
s
matrix from the uniform distribution andW
d
is the noise
basis learned from the noise-only signal. I initialize the time-activation matrix withH
n
=
h
Hs
H
d
i
, where
H
s
2 R
kstn
+
and H
d
2 R
k
d
tn
+
are random matrices from the uniform distribution. I initialize the
phase matrices for speechP
s
2C
mtn
and noiseP
n
2C
mtn
with the phase of the noisy spectrogram:
exp(j arg(V )). After initialization, I run the CMF-WISA algorithm for a fixed number of iterations, which
approximatesV with
^
V =
^
V
s
+
^
V
n
, where
^
V
s
=W
s
H
s
P
s
and
^
V
n
=W
n
H
n
P
n
. I will show
the update equations for the basis, time-activation, and phase matrices in Section 3.3.4. For convenience,
I defineW = [W
s
W
n
] as the concatenation of the learned speech and noise dictionaries. Similarly, I
defineH =
Hs
Hn
as the concatenation of the learned speech and noise time-activation matrices.
Once CMF-WISA terminates, I reconstruct the speech component. Generally, I have a better estimate
of the noise component than the speech component because I learn the noise model from a noise-only
signal, whereas I learn the speech model from the noisy speech. Moreover, I apply regularization terms
(discussed in Sections 3.3.2 and 3.3.3) to improve the noise model. Consequently, I reconstruct the speech
by reconstructing the noise component and subtracting it from the noisy speech. I form the complex-valued
spectrogram
^
V
n
=W
n
H
n
P
n
and take the inverse STFT to reconstruct the time-domain noise signal
13
^
d. I subtract
^
d from the noisy signalx to obtain the denoised speech ^ s =x
^
d.
3.3.2 Temporal Regularization
After running NMF on the noise-only signal, I have a noise dictionaryW
d
and time-activation matrixH
d
that models the noise-only signal. I will useW
d
andH
d
for initializing the noise dictionaryW
n
and
time-activation matrixH
n
that models the noise in the noisy speech. In order to model the noise for the
entire duration of the noisy speech, I assume that the columns ofH
d
are generated by a multivariate log-
normal random variable. Then ln(H
d
) consists of t
d
samples drawn from the normal distribution with
mean2 R
k
d
and covariance 2 R
k
d
k
d
. Suppose that the columns of the log time-activation matrix
ln(H
n
)2R
k
d
tn
for the noise component of the noisy signal consist oft
n
samples drawn from the normal
distribution with meanm2R
k
d
and covarianceS2R
k
d
k
d
. The statistics, ,m andS of ln(H
d
) and
ln(H
n
) are approximated by their sample estimates:
(3.2)
^ =
1
t
d
t
d
X
t=1
ln([H
d
]
t
)
^
=
1
t
d
1
t
d
X
t=1
(ln([H
d
]
t
) ^ ) (ln([H
d
]
t
) ^ )
T
^ m =
1
t
n
tn
X
t=1
ln([H
n
]
t
)
^
S =
1
t
n
1
tn
X
t=1
(ln([H
n
]
t
) ^ m) (ln([H
n
]
t
) ^ m)
T
I add a regularization term J
temp
(H
n
) to the CMF-WISA cost function that approximates the Kullback-
Leibler (KL) divergence between ln(H
d
) and ln(H
n
) using the sample estimates defined in Equation 3.2:
(3.3)
J
temp
(H
n
) =d
KL
(ln(H
d
)kln(H
n
))
1
2
0
@
tr
^
S
1
^
+ ( ^ m ^ )
T
^
S
1
( ^ m ^ )k
d
+ ln
0
@
det
^
S
det
^
1
A
1
A
This term will regularizeH
n
such that its second-order statistics match those ofH
d
. In practice, ^ and
^
are computed beforehand from the noise-only time-activation matrixH
d
and are then fixed throughout the
14
algorithm. I assume that the covariance matrices
^
and
^
S are diagonal; i.e., each row ofH
d
andH
n
is
generated independently.
3.3.3 Spectral Regularization
The bore of the MRI scanner acts as a resonance cavity that imparts a transfer function on the MRI noise prior
to being recorded. When I learn a noise model from the noise-only signal, I implicity capture the Fourier
coefficients of the transfer function in the noise dictionaryW
d
. When a subject speaks inside the scanner,
they open and close their mouth and vary the position of their articulators, which changes the volume of
the resonance cavity. This results in slight but noticable changes in the transfer function. Consequently,
there can be a slight mismatch between the noise dictionaryW
d
and the noise component during speech
production. The mismatch is most noticeable at frequencies where the noise has high energy.
To address the mismatch, I allow entries inW
d
corresponding to frequencies with high noise energy to
change when updating the noise dictionaryW
n
on the noisy speech. I achieve this by introducing a regu-
larization termJ
spec
(W
n
) to the CMF-WISA cost function:
(3.4) J
spec
(W
n
) =k (W
d
W
n
)k
2
F
:
2R
mm
+
is a diagonal matrix that specifies how closely the entries inW
n
must match the entries inW
d
for the frequency bins 1;:::;m. High values in enforce less change while lower values allow for greater
change, so I set entries in corresponding to frequencies with low noise energy to a high value
0
and
entries corresponding to frequencies with high noise energy to values lower than
0
.
3.3.4 Update Equations
I now present the update equations with the regularization terms incorporated and pseudo-code for the
denoising algorithm. When learning the noise-only model, I minimize the following cost function:
(3.5) C
noise
(W
d
;H
d
) =kV
d
W
d
H
d
k
2
F
+
d
t
d
X
j=1
k[H
d
]
j
k
1
;
15
where
d
trades reconstruction error for sparsity inH
d
. The update equations for the noise model on the
noise-only signal are as follows:
(3.6) W
d
W
d
V
d
H
T
d
W
d
H
d
H
T
d
(3.7) H
d
H
d
W
T
d
V
d
W
T
d
W
d
H
d
+
d
These update equations are derived in [12].
When learning the speech model and updating the noise model on the noisy speech, I minimize the following
cost function:
(3.8)
C
noisy
(W
s
;W
n
;H
s
;H
n
;P
s
;P
n
) =kV (W
s
H
s
P
s
+W
n
H
n
P
n
)k
2
F
+
s
tn
X
j=1
k[H
s
]
j
k
1
+
J
temp
(H
n
) +J
spec
(W
n
);
where
s
trades reconstruction error for sparsity inH
s
,
controls the amount of temporal regularization,
and controls the amount of spectral regularization. I will discuss parameter settings of
and in Sec-
tion 3.5. Minimizing Equation 3.8 directly is difficult, so I minimize an auxiliary cost function, shown in
Equation A.7 in Appendix A. The auxiliary function has auxiliary variables
V
s
and
V
n
that are calculated
as
(3.9)
V
s
=
^
V
s
+B
s
V
^
V
(3.10)
V
n
=
^
V
n
+B
n
V
^
V
;
where
(3.11) B
s
=
W
s
H
s
WH
(3.12) B
n
=
W
n
H
n
WH
The update equations for the speech model on the noisy speech are
(3.13) P
s
= exp
j arg
V
s
;
16
(3.14) W
s
W
s
j
V sj
Bs
H
T
s
WsHs
Bs
H
T
s
;
(3.15) H
s
H
s
W
T
s
j
V sj
Bs
W
T
s
WsHs
Bs
+
s
1
kstn
:
The derivation of these update equations can be found in [23]. Finally, the update equations for the noise
model on the noisy speech are
(3.16) P
n
= exp
j arg
V
n
;
(3.17) W
n
W
n
j
V nj
Bn
H
T
n
+
r
Wn
J
spec
(W
n
)
num
WnHn
Bn
H
T
n
+
r
Wn
J
spec
(W
n
)
den
;
(3.18) H
n
H
n
W
T
n
j
V nj
Bn
+
r
Hn
J
temp
(H
n
)
num
W
T
n
WnHn
Bn
+
r
Hn
J
temp
(H
n
)
den
;
where
(3.19)
r
Wn
J
spec
(W
n
)
num
=
T
W
d
;
(3.20)
r
Wn
J
spec
(W
n
)
den
=
T
W
n
;
(3.21)
r
Hn
J
temp
(H
n
)
num
=
1
H
n
1
t
n
^
S
1
h
^
U
i
+
+
h
^
M
i
1
k
d
tn
+
1
t
n
1
^
S
2
^
+
^
M
^
U
T
^
S
2
^
M
^
U
[ln(H
n
)]
+
+
h
^
M
i
1
k
d
tn
+
1
t
n
1
^
S
1
[ln(H
n
)]
+
h
^
M
i
+
1
k
d
tn
;
17
and
(3.22)
r
Hn
J
temp
(H
n
)
den
=
1
H
n
1
t
n
^
S
1
h
^
U
i
+
h
^
M
i
+
1
k
d
tn
+
1
t
n
1
^
S
2
^
+
^
M
^
U
T
^
S
2
^
M
^
U
[ln(H
n
)]
+
h
^
M
i
+
1
k
d
tn
+
1
t
n
1
^
S
1
[ln(H
n
)]
+
+
h
^
M
i
1
k
d
tn
:
In the equations above,
^
U = diag (^ ) and
^
M = diag ( ^ m). I show the derivation of these update equations
in Appendix A. Algorithm 1 shows the pseudo-code for the denoising algorithm.
Algorithm 1 Denoising Algorithm
1: Initialize parametersnum_iter,k
s
,k
d
,
s
,
d
,
,
2: Create spectrogramsV
d
from noise-only signal andV from noisy speechx
fLearn noise model from noise-only signalg
3: InitializeW
d
andH
d
with random matrices
4: InitializeP
d
= exp (j arg (V
d
))
5: foriter = 1 tonum_iter do
6: UpdateW
d
using Equation 3.6
7: UpdateH
d
using Equation 3.7
8: end for
9: Calculate second-order statistics ^ and
^
fromH
d
fLearn speech model and update noise model from noisy speechg
10: InitializeW
s
,H
s
, andH
n
with random matrices
11: InitializeW = [W
s
W
d
] andH =
Hs
Hn
12: InitializeP
s
;P
n
= exp (j arg (V ))
13: Initialize
^
V =W
s
H
s
P
s
+W
n
H
n
P
n
14: Calculate second-order statistics ^ m and
^
S fromH
n
15: foriter = 1 tonum_iter do
16: UpdateB
s
;B
n
with Equations 3.11, 3.12
17: Update
V
s
;
V
n
with Equations 3.9, 3.10
18: UpdateP
s
;P
n
with Equations 3.13, 3.16
19: UpdateW
s
;W
n
with Equations 3.14, 3.17
20: UpdateH
s
;H
n
with Equations 3.15, 3.18
21: Update second-order statistics ^ m and
^
S fromH
n
22: end for
23: Estimate noise
^
d from inverse STFT ofW
n
H
n
P
n
24: return Estimated speech ^ s =x
^
d
18
3.4 Experimental Evaluation
The following sections describe the datasets I tested my algorithm on, the other denoising algorithms I
compared against, and the evaluation metrics I used.
3.4.1 Datasets
MRI-utt dataset: The MRI-utt dataset contains 6 utterances spoken by a male in an MRI scanner. The
utterances include 2 TIMIT sentences [41] and various standard vowel-consonant-vowel utterances that can
be used to verify how well the denoising preserves the spectral components of these vowels and consonants.
These utterances were recorded with seq1, seq2, seq3, ga21, ga55, and mult pulse sequences (I refer to these
sequences as the real-time sequences). In the case of the static 3D pulse sequence (st3d), the utterances
consist of a vowel held for 7 seconds because this sequence can only be used to capture static vocal tract
postures. I obtained a noise-only signal of the real-time sequences from the start of the noisy speech before
the subject speaks, while the st3d noise-only signal came from a recording of the st3d pulse while the subject
remained silent. The drawback with using recordings in the MRI scanner for denoising evaluation is the lack
of a clean reference signal.
Aurora 4 dataset [42]: The Aurora 4 dataset is a subset of clean speech from the Wall Street Journal corpus
[43]. I added the 7 pulse sequence noises to the clean speech with an SNR of7 dB, which is similar to the
SNR in the MRI-utt dataset. Note that even though the static 3D noise would occur with a held vowel rather
than continuous speech in a real-world scenario, I still added this noise to the clean speech to evaluate how
well my algorithm removes this noise. Aurora 4 is divided into train, dev, and test sets. I used the dev set
to determine optimum parameter settings for my algorithm (see Section 3.5) and report denoising results on
the test set.
3.4.2 Other Denoising Algorithms
I compared the performance of my proposed algorithm to the two-step algorithm (denoted 2step) I previously
proposed in [44], the correlation subtraction + spectral noise gating algorithm (denoted CS+SNG) [36], and
the LMS variant (denoted LMS-model) proposed in [35].
19
2step [44]: The 2step algorithm sequentially processes the noisy speech through an NMF step then a wavelet
packet analysis stage. The NMF step estimates the speech and noise components in the noisy speech and
passes the estimated speech to a wavelet packet analysis step for further noise removal. Wavelet packet
analysis thresholds the estimated speech wavelet coefficients in different frequency bands based on the
wavelet coefficients of the reference noise signal [45]; speech wavelet coefficients below the threshold are
set to zero. The resulting thresholded coefficients are converted back to the time domain with the inverse
wavelet packet transform to give the final denoised speech.
CS+SNG [36]: The CS+SNG algorithm is also a two-stage algorithm. The first stage, correlation sub-
traction, determines the best temporal alignment between the noisy speech and noise reference using the
correlation metric. The time-aligned noise reference is subtracted from the noisy speech to get the estimated
speech. The estimate speech is then passed to a spectral noise gating algorithm which thresholds the esti-
mated speech Fourier coefficients in each frequency band based on the noise reference Fourier coefficients,
similar to wavelet packet analysis. The thresholded coefficients are converted back to the time domain,
resulting in the final denoised speech.
LMS-model [35]: LMS-model creates an artificial noise reference signal based on the periodicity of the
MRI pulse sequence (seef
0
in Table 3.1). Using the noisy speech and reference noise signals, LMS-model
recursively updates the weights of an adaptive filter to minimize the mean square error between the filter
output and the noise signal. The residual error between the filter output and the noise signal is the final
denoised speech.
LMS-model is known to perform well with seq1, seq2, and seq3 noises and is currently used to remove
these pulse sequence noises from speech recordings. However, its performance degrades with golden angle
and static 3D pulse sequence noises, preventing speech researchers from collecting better MR images using
golden angle pulse sequences or capturing 3D visualizations of the vocal tract during speech production. On
the other hand, the other denoising methods are agnostic to the pulse sequence and can be used for removing
a wider range of pulse sequence noises, including the golden angle sequences.
3.4.3 Quantitative Performance Metrics
I used the following 5 objective measures for evaluating the denoising performance.
20
1. Noise suppression (NS): To quantify the amount of noise the denoising algorithms remove, I calcu-
lated the noise suppression, which is given by
NS = 10 log
P
noise
^
P
noise
; (3.23)
whereP
noise
is the power of the noise in the noisy signal and
^
P
noise
is the power of the noise in the
denoised signal. I used a voice activity detector (V AD) to find the noise-only regions in the denoised
and noisy signals. I calculated the noise suppression measure instead of SNR because I do not have a
clean reference signal for the MRI-utt dataset.
2. Log-likelihood ratio (LLR): Ramachandran et al. proposed the log-likelihood ratio (LLR) and dis-
tortion variance (DV) measures in [46] for evaluating the amount of distortion introduced by the
denoising algorithm. The LLR calculates the mismatch between the spectral envelopes of the clean
signal and the denoised signal. It is calculated using
LLR = log
a
T
^ s
R
s
a
^ s
a
T
s
R
s
a
s
; (3.24)
wherea
s
anda
^ s
arep-order LPC coefficients of the clean and denoised signals respectively, andR
s
is a (p + 1) (p + 1) autocorrelation matrix of the clean signal. An LLR of 0 indicates no spectral
distortion between the clean and denoised signals, while a high LLR indicates the presence of noise
and/or distortion in the denoised signal.
3. Distortion variance (DV): The distortion variance is given by
DV =
1
N
N1
X
n=0
js[n] ^ s[n]j
2
; (3.25)
wheres[n] and ^ s[n] are the clean and denoised signals respectively, andN is the length of the signal.
A low distortion variance is more desirable than a high distortion variance.
4. Perceptual Evaluation of Speech Quality (PESQ) score: The PESQ score is an automated assess-
ment of speech quality [47]. It gives a score for the denoised signal from0:5 to 4:5, where0:5
indicates poor speech quality and 4:5 indicates excellent quality. The score models the mean opinion
score (but with a different scale), so the PESQ score provides a way to estimate the speech quality
21
quantitatively without requiring listening tests. I calculated the PESQ score using C code provided by
ITU-T.
5. Short-Time Object Intelligibility (STOI) score: Similar to the PESQ score, the STOI score is an
automated assessment of the speech intelligibility [48]. Unlike several other objective intelligibil-
ity measures, STOI is designed to evaluate denoised speech. The STOI score ranges from 0 to 1,
with higher values indicating better intelligibility. I calculated the STOI score using the Matlab code
provided by the authors in [48].
3.4.4 Qualitative Performance Metrics
To supplement the quantitative results, I created a listening test on Amazon Mechanical Turk to compare
the denoised signals from my proposed algorithm, 2step, CS+SNG, and LMS-model. I selected 4 Aurora
sentences and added the 7 pulse sequence noises to these with7 dB SNR. For each clean/noisy pair, I
denoised the noisy signal with the denoising algorithms and presented the listeners with the clean, denoised,
and noisy signals. I refer to these 6 clips (clean, denoised with proposed, 2step, CS+SNG, LMS-model,
and noisy) as a set. I asked the listeners to rate the speech quality of each of the clips on a scale of 1 to 5,
with 1 meaning poor quality and 5 meaning excellent quality. Additionally, I asked them to rank the clips
within each set from 1 to 6, with 1 being the least natural/worst quality clip to 6 being the most natural/best
quality. I also included 2 clips of TIMIT sentences from the MRI-utt dataset with the rtMRI pulse sequences
and 2 clips of held vowels with the st3d static 3D sequence. For these clips, I only provided the noisy and
denoised clips in the set because I don’t have a clean recording of the speech. The listeners had to rate
these clips from 1 to 5 as before, but only provide rankings from 2 to 6 because there are only 5 clips in
these sets. 40 Mechanical Turk workers evaluated each set and assigned a rating and ranking to each clip as
described.
During the experiment, I rejected any sets where the rating or ranking was left blank and allowed someone
else to provide ratings and rankings for those sets. After the experiment concluded, I processed the results
to remove bad data. If an annotator rated a noisy clip from a set as a 4 or 5, or ranked it as a 5 or 6, then
I discarded the results for that set. Table 3.3 shows the total number of data points for each dataset and
pulse sequence noise after processing the results. The values in Table 3.3 reflect the fact that I used 2 clips
22
Table 3.3: Number of data points for the listening test for each data set and pulse sequence noise.
Dataset seq1 seq2 seq3 ga21 ga55 mult st3d
MRI-utt 73 70 73 75 69 70 60
Aurora 4 144 139 141 136 140 134 137
from MRI-utt and 4 clips from Aurora per noise in the listening test. Thus, on average, I retained 35 unique
ratings and rankings for the clips in each dataset and sequence noise after processing the results.
3.5 Analysis of Regularization Parameters
The proposed algorithm contains two parameters that control the spectral and temporal regularization during
the multiplicative updates. Generally, analysis of the noise can inform proper selection of these parameters.
In this section, I will analyze these parameters and provide insight into choosing good values for these
parameters.
3.5.1 Spectral Regularization
The weight of the spectral regularization term in the cost function (Equation 3.8) is controlled by . In this
article, I explore spectral regularization weightings of the form = diag ([cc cc]), where
c 2 R
+
controls the regularization of the DFT bins corresponding to low and high frequency bins and
2R
+
controls the regularization of the DFT bins corresponding to the middle frequencies. Higher values
ofc and result in less change inW
n
relative toW
d
at the corresponding frequencies.
In the datasets, most of the MRI noise energy is concentrated between 600 Hz and 6 kHz for the rtMRI
sequences and 700 Hz to 8 kHz for the st3d sequence. Thus, I let regularize the frequency bins for
600 Hz to 6 kHz for the real-time sequences and 700 Hz to 8 kHz for the st3d sequence, whilec controls the
remaining frequency bins. I setc = 10
8
and varied from the set2f0; 10
1
; 10
2
; 10
3
; 10
4
; 10
5
g.
3.5.2 Temporal Regularization
The influence of the temporal regularization term on the cost function (Equation 3.5) is controlled by
.
Higher values of
enforce greater adherence to the statistics calculated fromH
d
. Temporal regularization
23
also implicitly affects how the noise basisW
n
is updated; by incorporating prior knowledge about the
time-activations,W
n
is forced to model parts of the noisy speech (i.e., noise) that results in time-activation
statistics matching the learned statistics. To explore the effect of temporal regularization on the denoising
performance, I varied
from the set
2f0; 10
1
; 10
2
; 10
3
g and measured the noise suppression, LLR,
PESQ scores, and STOI scores for the Aurora 4 dev set with ga55 noise added.
3.5.3 Discussion
Figure 3.1 shows the noise suppression, LLR, PESQ scores, and STOI scores for the Aurora 4 dev set with
ga55 noise added at7 dB SNR when varying and
. From the figure, one can see a trade-off between
noise suppression and signal distortion as is varied. Noise suppression, LLR, and distortion variance
decrease as increases. This makes sense because higher results in less changes to the noise dictionary,
which causes less noise to be removed but also reduces the chance of removing speech. The PESQ score
indicates that the denoised speech quality increases slightly when increasing from 10
1
to 10
3
, but decreases
beyond 10
3
. Similar to the spectral regularization, one can see a trade-off between noise suppression and
signal distortion as
is varied, though the effect is not as pronounced as when is varied. Higher values of
lead to less noise suppression, greater distortion, and lower speech quality. In the interest of space, I only
show results with ga55 noise, but the trends are similar for the other pulse sequence noises.
Figure 3.1: Quantitative metrics for different spectral regularization weights and temporal regularization
weights
.
When I do not use any regularization in the cost function (Equation 3.8) (i.e. = 0 and
= 0), one can see
that the performance is generally worse than when regularization is used. Without these regularization terms,
the cost function only contains the reconstruction error and the `
1
penalty on the speech time-activation
matrix. In this case, the algorithm will learn a noise model that maximally minimizes the reconstruction
24
error, which leads to maximal noise removal. This result is reflected in the noise suppression values in
Figure 3.1. However, the unregularized cost function does not take into account the temporal structure of the
noise and the filtering effects of the MRI scanner bore and vocal tract shaping, as discussed in Sections 3.3.2
and 3.3.3. This means that the algorithm does not properly account for the presence of speech when learning
the noise model, and subtracting the estimated noise component from the noisy speech leads to distortion in
the speech. This results in a higher LLR and lower PESQ and STOI scores, as shown in Figure 3.1.
3.6 Results and Discussion
Based on the discussion in Section 3.5, I optimized the parameters of my proposed algorithm for each pulse
sequence noise. I chose = 10
3
and
= 100. Additionally, I set the number of speech dictionary elements
k
s
= 30 and number of noise dictionary elementsk
d
= 50 for the real-time sequences in the MRI-utt dataset
and for all sequences in the Aurora 4 dataset. For the st3d sequence in the MRI-utt dataset, I usedk
s
= 5
andk
d
= 100 because a held vowel requires fewer speech dictionary elements than running speech, which
has a wider range of sounds. I ran the update equations for 300 iterations. The parameters used for the
2step algorithm [44] are shown in Table 3.4. These parameters were determined in the same manner I used
to select the parameters for the proposed algorithm. For the CS+SNG method [36], I optimized the noise
reduction coefficient parameter for the 5 objective metrics. I found the best value to be 0:3. The LMS-model
algorithm [35] does not require parameter tuning; its parameter is based onf
0
, which is noise-dependent
(see Table 3.1).
Figures 3.2 and 3.3 show spectrograms of removing seq3 noise from an audio clip in the MRI-utt and Aurora
4 datasets using the different denoising algorithms. Figure 3.4 shows the average value of the cost function
(Equation 3.8) at each iteration when denoising files in the Aurora 4 dev set. The cost function monotonically
decreases and reaches convergence after roughly 300 iterations for both datasets. Additionally, the figure
shows the average run time for the denoising algorithms when processing files of different durations in the
Aurora 4 dev set. I either chopped or zero-padded the files to achieve the desired duration. Unfortunately,
one can see that the proposed algorithm has the longest run time among the denoising algorithms. Finding
ways to improve computation efficiency will be one of my priorities in improving the algorithm.
25
Table 3.4: Parameter settings for the number of speech dictionary elements (n
s
) and wavelet packet depth
(D) in the 2step algorithm. The number of noise dictionary elements was set to 70 and the window length
for wavelet packet analysis was set to 2048 for all noises. See [1] for more information about the 2step
parameters.
Parameter seq1–3 ga21 ga55 mult st3d
n
s
30 30 30 30 10
D 7 8 9 7 9
Noisy
0.5 1 1.5 2 2.5 3
0
5
10
Proposed
0.5 1 1.5 2 2.5 3
0
5
10
Frequency (kHz)
2step
0.5 1 1.5 2 2.5 3
0
5
10
CS+SNG
0.5 1 1.5 2 2.5 3
Time (s)
0
5
10
LMS-model
0.5 1 1.5 2 2.5 3
0
5
10
Figure 3.2: Noisy and denoised spectrograms of the sentence “Don’t ask me to carry an oily rag like that”
in the MRI-utt dataset. The noise is seq3.
3.6.1 Objective Results
Table 3.5 lists the average noise suppression across each utterance in the MRI-utt dataset. I used the non-
parametric Wilcoxon Rank-Sum Test to determine if the medians of the noise suppression (and the other
metrics) are significantly different between the different denoising methods. In Table 3.5 and subsequent ta-
bles, a bolded value indicates the best-performing algorithm and an asterisk denotes statistically significant
performance withp < 0:05. Table 3.6 shows the noise suppression, LLR, distortion variance, PESQ, and
STOI results for the Aurora 4 test set.
One can see that my proposed algorithm consistently has the least signal distortion compared to the other
denoising methods, except for the LLR measurement in seq1, seq2, and seq3 noises, where the LMS-model
performs the best. Unfortunately, this comes at a cost of less noise removal, as indicated by the better noise
26
Table 3.5: NS results (dB) for the MRI-utt dataset.
Sequence Proposed 2step CS+SNG LMS-model
seq1 30:18 25:52 33:51 13:90
seq2 29:42 14:71 31:87
15:04
seq3 29:55 13:65 31:79
16:70
ga21 29:26 15:47 31:57
13:81
ga55 30:34 14:74 33:19
10:30
mult 29:22 12:69 32:87
0:47
st3d 10:82 7:99 10:12 1:69
Table 3.6: NS, LLR, DV , PESQ scores, and STOI scores for the Aurora 4 dataset.
Metric Sequence Proposed 2step CS+SNG LMS-model
NS
(dB)
seq1 15:42 11:17 18:08
9:52
seq2 15:78 11:38 17:49
9:62
seq3 15:61 11:38 18:24
10:33
ga21 15:39 11:29 16:57
8:71
ga55 14:96 10:95 16:36
7:16
mult 14:93 10:51 16:61
0:21
st3d 14:78 11:98 17:12
1:80
LLR
seq1 1:004 3:676 2:462 0:987
seq2 1:058 3:666 2:046 0:931
seq3 1:012 3:650 2:065 0:850
ga21 1:018
3:329 1:987 1:058
ga55 1:020
3:179 1:882 1:497
mult 1:098
3:486 2:480 2:839
st3d 0:676
2:522 2:265 2:094
DV
(10
5
)
seq1 1:933
2:502 2:512 3:105
seq2 1:919
2:484 2:401 3:094
seq3 1:846
2:342 2:428 3:013
ga21 1:635
2:149 1:909 2:941
ga55 1:497
1:908 1:769 3:043
mult 1:552
1:897 1:941 4:187
st3d 0:971
2:919 1:683 4:217
PESQ
seq1 2:20 2:49
1:95 1:91
seq2 2:23 2:60
2:09 1:97
seq3 2:30 2:67
2:06 2:03
ga21 2:36 2:65
2:07 1:94
ga55 2:43 2:71
2:14 1:97
mult 2:30 2:70
2:08 1:56
st3d 3:01
2:12 2:02 1:97
STOI
seq1 0:907
0:781 0:785 0:869
seq2 0:910
0:778 0:800 0:873
seq3 0:920
0:795 0:788 0:883
ga21 0:920
0:782 0:828 0:861
ga55 0:922
0:798 0:836 0:825
mult 0:907
0:792 0:790 0:714
st3d 0:964
0:705 0:812 0:765
27
Clean
0.5 1 1.5 2 2.5 3
0
2
4
6
8
Noisy
0.5 1 1.5 2 2.5 3
0
2
4
6
8
Proposed
0.5 1 1.5 2 2.5 3
0
2
4
6
8
Frequency (kHz)
2step
0.5 1 1.5 2 2.5 3
0
2
4
6
8
CS+SNG
0.5 1 1.5 2 2.5 3
Time (s)
0
2
4
6
8
LMS-model
0.5 1 1.5 2 2.5 3
0
2
4
6
8
Figure 3.3: Clean, noisy, and denoised spectrograms of the sentence “The language is a big problem” in the
Aurora 4 dataset. The noise is seq3.
suppression performance of CS+SNG for all of the pulse sequence noises in the Aurora 4 datasets. However,
as I discussed in Section 3.5, minor changes in parameter settings can vary the trade-off between noise
suppression and distortion, depending on the user’s needs. One can also see that our algorithm always gave
the best STOI scores and the best PESQ score in st3d noise. The low distortion coupled with good speech
intelligibility indicates that my proposed algorithm produces denoised speech that can be used reliably for
speech analysis and subjective listening tests. One can observe that the proposed algorithm improves upon
my previous approach (2step algorithm) in all measures except the PESQ score in real-time pulse sequences.
This observation suggests that incorporating phase information results in better separation of speech and
noise, particularly at frequencies where there is overlap between speech and noise.
For the st3d noise, one can see that my algorithm far outperforms the other denoising methods in terms of
signal distortion, speech quality, and intelligibility. This encouraging result suggests my denoising approach
is better suited for removing aperiodic noise, such as st3d pulse sequence noises, than other denoising
approaches. One reason why my algorithm shows better results for st3d compared to the real-time sequences
is that my algorithm had access to the st3d noise-only signal while it extracted the real-time sequence
noises from the start of the noisy speech. Meanwhile, CS+SNG had access to the noise-only signal for
all sequences. I performed the experiment in this way because I wanted to mimic how these algorithms
function in the wild; CS+SNG requires a reference noise signal while my algorithm can handle having
28
0 100 200 300 400 500
Iterations
10
2
10
4
10
6
10
8
10
10
10
12
Average value of cost function
Cost function vs. iterations
2 4 6 8 10
Audio duration (s)
0
10
20
30
40
50
60
Average run time (s)
Run time vs. audio duration
Proposed 2step CS+SNG LMS-model
Figure 3.4: Average values of the noisy cost function (Equation 3.8) as a function of iteration number and
average run times for the denoising algorithms as a function of audio duration for the Aurora 4 dev set.
partial information about the noise signal.
It is interesting to note that the 2step algorithm gives a better PESQ score for the real-time sequence noises
while the proposed algorithm gives a better STOI score. These results suggest that the 2step approach
preserves properties of the speech that lead to better perceptual quality while the proposed method retains
speech properties important for conveying speech content. This finding warrants further investigation into
the specific speech properties required for good speech and quality and intelligibility, and understanding
how the proposed and 2step algorithms preserve these properties. Incorporating these properties in the
optimization framework of the proposed algorithm can further improve the denoised speech quality.
3.6.2 Listening Test Results
Table 3.7 shows the mean rankings obtained from the listening test for the 3 datasets corrupted by the pulse
sequence noises. A higher value indicates a better ranking. In this table, I highlight the best rank in bold and
statistically significant results, marked with an asterisk, are computed by comparing the rankings among the
denoising methods only; not surprisingly, the rankings for the clean speech are always significantly better
than the denoised speech. Table 3.8 shows the mean ratings of speech quality obtained from the listening
test. As with the ranking results, I highlight the best statistically significant results when comparing the
29
Table 3.7: Mean rankings of the audio clips for each dataset corrupted with different pulse sequence noises.
Dataset Sequence Clean Proposed 2step CS+SNG LMS-model Noisy
MRI-utt
seq1 — 3:85 3:63 4:47
3:47 1:63
seq2 — 4:13 3:57 4:10 3:44 1:70
seq3 — 3:56 3:47 3:81 3:71 1:66
ga21 — 3:81 3:25 4:21 3:48 1:64
ga55 — 3:54 3:65 3:94 2:70 1:65
mult — 3:44 3:39 4:10
1:94 1:99
st3d — 2:78 3:17 2:72 2:35 1:92
Aurora 4
seq1 5:74 4:10
3:74 2:99 3:07 1:29
seq2 5:66 4:17
3:58 3:09 3:30 1:29
seq3 5:60 4:06
3:64 3:48 3:28 1:33
ga21 5:71 4:46
3:94 2:95 2:87 1:28
ga55 5:63 4:34
3:82 3:20 2:33 1:30
mult 5:69 4:18 4:28 3:22 1:59 1:66
st3d 5:72 4:26
3:62 3:57 1:39 1:93
Table 3.8: Mean ratings of the audio clips for each dataset corrupted with different pulse sequence noises.
Dataset Sequence Clean Proposed 2step CS+SNG LMS-model Noisy
MRI-utt
seq1 — 3:07 2:99 3:60
2:78 1:26
seq2 — 3:30 2:77 3:24 2:69 1:29
seq3 — 2:82 2:66 3:07 3:00 1:19
ga21 — 2:93 2:65 3:36
2:80 1:29
ga55 — 2:99 2:94 3:14 2:09 1:32
mult — 2:44 2:40 3:14
1:20 1:36
st3d — 1:73 2:07 1:78 1:53 1:27
Aurora
seq1 4:78 3:58 3:39 2:75 2:85 1:33
seq2 4:78 3:68
3:17 2:75 2:98 1:35
seq3 4:73 3:59
3:28 3:17 2:99 1:45
ga21 4:82 3:83
3:44 2:75 2:70 1:36
ga55 4:75 3:74
3:44 2:93 2:16 1:34
mult 4:79 3:66 3:66 2:90 1:50 1:57
st3d 4:77 3:63 3:24 3:14 1:44 1:61
ratings from the denoising methods.
One can see from Tables 3.7 and 3.8 that listeners compared the denoised speech from my algorithm fa-
vorably with the denoised speech from CS+SNG. In all cases in the Aurora dataset, listeners ranked and
rated the output as the best denoised speech. More interestingly, one can see that my algorithm ranked
and rated the best among the denoising algorithms for removing st3d pulse sequence noise in the Aurora
dataset. Though the ratings are poor for the MRI-utt dataset, they are a promising indicator that my algo-
rithm is a step in the right direction for handling aperiodic, high-power noise corrupting a speech recording.
Another observation is that the rankings and ratings for the LMS-model algorithm decreases when going
from Sequence 1–3 noise to Golden Angle noise and finally to multislice and static 3D noise. In contrast,
the proposed algorithm performs consistently well in the different noises, giving speech researchers greater
30
flexibility in choosing an MRI sequence to study the vocal tract.
3.7 Future Work
To further extend my work, I will improve the contribution of the temporal regularization term by modeling
the distribution of the noise time-activation matrix in a data-driven manner rather than assuming a log-
normal distribution. Additionally, I will incorporate STFT consistency constraints [49] and phase constraints
[50] when learning the speech and noise components to reduce artifacts and distortions in the estimated
components. In my current work, I made strides towards addressing convolutive noise in the MRI recordings
by using spectral regularization to account for filtering effects of the scanner bore, but a more rigorous
treatment of convolutive noise might further improve results. Given that the primary motivation behind
recording speech in an MRI is for linguistic studies, I will evaluate how well my algorithm aids speech
analysis, such as improving the reliability of formant and pitch measurements. However, I will also target
clinical use of this algorithm by developing a real-time version that facilitates doctor-patient interaction
during MRI scanning. Finally, I will evaluate the performance of my algorithm in other low-SNR speech
enhancement scenarios, such as those involving babble and traffic noises to generalize its application beyond
MRI acoustic denoising.
31
CHAPTER 4
COMPLEX BETA DIVERGENCE
Researchers have been exploring the use of information-theoretic divergences in a wide range of applica-
tions, from image and audio signal processing [51, 52] to machine learning [53, 54]. Well-known examples
of these divergences include the Kullback-Leibler (KL) [55], Hellinger [56], and Itakura-Saito (IS) [16]
divergences. Using these divergences as a cost function or measure of dissimilarity allows one to tailor
the penalty placed on errors or outliers in the data to the application. It also gives greater flexibility in
choosing an appropriate loss function that corresponds closely to the probability distribution of the data or
noise.
These divergences are often special cases of a larger class of divergences. One such class is the f-divergence
[57–59]. It uses a convex function f with the property that f (1) = 0 to measure the distance between
two probability distributions. The KL and Hellinger divergences are two examples of special cases of the
f-divergence, formed by a particular f for each case. Another sub-class of the f-divergence is the Alpha
divergence [60], which consists of divergences parameterized by a real value. The parameter smoothly
connects the aforementioned KL and Hellinger divergences, among others. One property the Alpha diver-
gence inherits from the f-divergence is that it is convex in both of its arguments, making it useful as an error
metric in optimization problems.
Another class of divergences is the Bregman divergence [61]. Similar to the f-divergence, it uses a strictly-
convex function f to measure the distance between points on a closed convex set. Picking appropriate
functionsf allows one to generate the squared Euclidean distance, I divergence (which is the KL divergence
generalized to positive values), IS divergence, and many others. The Beta divergence [15, 17] is a sub-
class of the Bregman divergence (as is the Alpha divergence, shown in [62], making it one of the few
divergences belonging to both the f-divergence and Bregman divergence classes). A real-valued parameter
smoothly connects the squared Euclidean distance, I divergence, and IS divergence, and it was also shown
in [15] that the value of trades off between robustness and efficiency of estimators in statistical models.
32
Recently, the Alpha and Beta divergences were recently united to form the Alpha-Beta divergence [63]. It
was shown that different settings of and control the influence of small or large values in the divergence
calculation, allowing one to adjust the importance of outliers based on the distribution of the data and the
application.
Given that these divergences are motivated by information theory concepts, they are usually defined for
values in the positive orthant or sometimes even restricted to the probability simplex. However, there are
some applications where the data contains negative or complex values, and it would be useful to inherit
the flexibility of these divergences when measuring the errors or dissimilarities between values that can be
non-positive. For example, it has been shown that optimizing directly on the complex-valued spectrogram
can help improve the quality of source-separated audio [23, 50]. To that end, we propose the Complex Beta
divergence, which extends the standard Beta divergence to measure the error between complex values. Like
the standard Beta divergence, it uses a real-valued parameter to vary the emphasis of outliers in the data
or prediction errors.
4.1 Notation and Complex Number Properties
We write the Cartesian decomposition ofz2C as Refzg +j Imfzg. We write the polar decomposition of
z2 C asjzje
jz
, with
z
:= Arg(z) being the principal-value argument ofz. We will use the convention
that
z
< . Writing z with the principal-value argument gives a unique polar representation.
Transformationsf(z) ofz can sometimes result in the argument falling outside of [;). For example,
z
a
=jzj
a
e
jaz
could result in arg(z
a
) =a62 [;). To remedy this, we add multiples of 2 to arg(f(z))
to get the argument in [;). Define :R!Z as
() =
+
2
: (4.1)
Then, Arg(f(z)) = arg(f(z))+2k(arg(f(z))). One can easily show that the following property holds:
Property 4.1.8k2Z; ( + 2k) =()k.
Next, I state a few properties of complex numbers that I will refer to in the proofs.
Property 4.2.8z2C; Refzgjzj.
33
Property 4.3.8z2C
;8a;b2R; (z
a
)
b
=z
ab
e
j2(b(az )+(abz +2b(az ))(abz ))
.
Property 4.4.8z;w2 C
; z = w () jzj =jwj and arg(z) arg(w) (mod 2). As a special case,
z =w () jzj =jwj and
z
=
w
.
Property 4.5.8z2C
;8a2R; Arg(z
a
) =a
z
+ 2(a
z
).
Note that Property 4.5 may not hold in general if one definesz asjzje
j arg(z)
because Arg(z
a
) will be a set
of solutions witha
z
+ 2(a
z
) as one of its elements. However, since I definez asjzje
jz
in this article,
Arg(z
a
) yields a unique value.
4.2 Standard Beta Divergence
The Beta divergence [15, 17] betweenx;y2R
+
is parameterized by2R and is defined as
d
(xky) =
8
>
>
>
>
>
>
<
>
>
>
>
>
>
:
1
(1)
x
+ ( 1)y
xy
1
; 2Rnf0; 1g
x ln
x
y
x +y; = 1
x
y
ln
x
y
1; = 0
(4.2)
The cases of =f1; 0g are defined by continuity using the identity lim
!0
x
y
= ln
x
y
. The cases
=f2; 1; 0g correspond to the squared Euclidean distance, I divergence [55], and Itakura-Saito divergence
[16] respectively. Though x;y 2 R
+
for d
(x k y) to be properly defined for all , recent work has
investigated the instances in which the beta divergence remains a proper divergence when the domain is
extended to include negative scalars [64].
The standard Beta divergence can be derived from Young’s inequality [65]. Young’s inequality is defined
as:
8p;q2R
+
;8a;b2R
+
:
1
a
+
1
b
= 1 pq
p
a
a
+
q
b
b
(4.3)
with equality iffp
a
=q
b
. The beta divergence is derived by lettinga :=a(),b :=b(),p :=p
(x;y), and
q :=q
(x;y) be functions ofx,y, and, where the functional forms depend on the value of. For example,
with > 1, settinga = ,b =
1
,p = x, andq = y
1
results in the equation for the beta divergence
shown in Equation 4.2. Different settings ofa,b,p, andq for other ranges of yield the beta divergence for
34
all values of; see Appendix A of [63] for the functional forms for all ranges of. Using this inequality,
one can show that the beta divergence satisfies the requirements of a divergence:d
(xky) 08x;y2R
+
andd
(xky) = 0 () x =y.
4.3 Young’s Inequality for Complex Values
In this chapter, I propose an extension of the beta divergence to measure the distance between complex
valuesz andw. As with the standard Beta divergence, the Complex Beta divergence is built upon Young’s
inequality, so I state and prove a generalization of Young’s inequality to the complex values in this section.
First, I state a lemma which will be required in the proof of Young’s Inequality.
Lemma 4.6. Definek
p
:=(a
p
),k
q
:=(b
q
),l
p
:=
a
b
p
+ 2
kp
b
a
b
p
, andl
q
:=
q
+ 2
kq
b
.
8p;q2C
;8a;b2R :a6= 0; ifp
a
=q
b
, thenp
a
b
exp
j2
kp
b
+l
p
=q exp
j2
kq
b
+l
q
.
Proof.
p
a
=q
b
=) (p
a
)
1
b
=
q
b
1
b
() p
a
b
exp
j2
k
p
b
+l
p
=q exp
j2
k
q
b
+l
w
by Property 4.3
Theorem 4.7 (Young’s Inequality for Complex Values).8p;q2C
8a;b2R
+
:
1
a
+
1
b
= 1; Re
pq
exp
j
a
b
p
b
a
q
jpj
a
a
+
jqj
b
b
35
Proof. Lett =
1
a
. Then 1t =
1
b
.
jpj
a
a
+
jqj
b
b
=tjpj
a
+ (1t)jqj
b
= exp
ln
tjpj
a
+ (1t)jqj
b
exp
t lnjpj
a
+ (1t) lnjqj
b
* ln() concave and exp() monotonically increasing
= exp(lnjpj + lnjqj)
=jpjjqj
=jpq
j
=
pq
exp
j
a
b
p
b
a
q
Re
pq
exp
j
a
b
p
b
a
q
by Property 4.2
Corollary 4.8. Young’s Inequality for Complex Values is tight if and only ifp
a
=q
b
.
Proof. Supposep
a
=q
b
. Thenjp
a
j =
q
b
and
Arg(p
a
) = Arg
q
b
a
p
+ 2k
p
=b
q
+ 2k
q
a
p
=b
q
+ 2 (k
q
k
p
)
36
jpj
a
a
+
jqj
b
b
=
1
a
+
1
b
jpj
a
=jpj
a
*
1
a
+
1
b
= 1
=
1
2
jpjjpj
a1
+jpjjpj
a1
= Re
n
pe
jp
p
a
b
e
j
a
b
p
o
*a 1 =
a
b
= Re
pe
j
a
b
p
e
j
a
a
p
q
exp
j2
1
b
(k
p
k
q
) +l
p
l
q
by Lemma 4.6
= Re
pe
j
a
b
p
q
exp
j
1
a
(b
q
+ 2 (k
q
k
p
))
exp
j2
1
b
(k
p
k
q
)
= Re
pq
exp
j
a
b
p
b
a
q
using
1
a
+
1
b
= 1
Now suppose Re
pq
exp
j
a
b
p
b
a
q
=
jpj
a
a
+
jqj
b
b
.
Then all the inequalities in the proof of Theorem 4.7 are tight.
For the first inequality (Jensen’s inequality),
exp
ln
tjpj
a
+ (1t) lnjqj
b
= exp
t lnjpj
a
+ (1t) lnjqj
b
8t2 [0; 1] iff ln() is linear orjpj
a
=
jqj
b
.
ln() is a non-linear function, sojpj
a
=jqj
b
.
Let =pq
exp
j
a
b
p
b
a
q
=jpj exp(j
p
)jqj exp(j
q
) exp
j
a
b
p
b
a
q
.
For the second inequality,jj = Refg if and only if
= 0.
Then
=a
p
b
q
+ 2(a
p
b
q
) = 0
a
p
+ 2(a
p
b
q
) =b
q
arg(p
a
) arg
q
b
(mod 2)
Sincejpj
a
=jqj
b
and arg(p
a
) arg
q
b
(mod 2),p
a
=q
b
by Property 4.4.
37
4.4 Complex Beta Divergence
Let
=
z
w
. I define the Complex Beta divergence betweenz;w2C
as
d
(zkw) =
8
>
>
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
>
>
:
1
(1)
jzj
+ ( 1)jwj
jzjjwj
1
cos(
)
; > 1
1
(1)
jzj
cos(
) + ( 1)jwj
jzjjwj
1
; 0< < 1
jzj
jwj
ln
jzj
jwj
1; = 0
1
(1)
jzj
+ ( 1)jwj
cos
1
jzjjwj
1
; < 0
(4.4)
The equation for = 0 is derived by using the identity lim
a!0
x
a
y
a
a
= ln
x
y
. Unfortunately at this time,
I am not able to find an equation for = 1 for arbitrary z and w, so the Complex Beta divergence re-
mains discontinuous at = 1. Note that = 2 is the squared Euclidean distance and = 0 is the IS
divergence.
Theorem 4.9. d
(zkw) satisfies the requirements for a divergence: 8z;w2 C
; d
(zkw) 0 and
d
(zkw) = 0 () z =w.
I put the proof of this theorem in Appendix B.
4.4.1 Complex Beta Divergence Properties
Here, I make a few observations about the Complex Beta divergence.
In general, the divergence is not symmetric:d
(zkw)6=d
(wkz).
Ifz andw have the same angle (
z
=
w
), then the Complex Beta divergence can be stated succinctly
as
d
(zkw) =
8
>
>
>
>
>
>
<
>
>
>
>
>
>
:
1
(1)
jzj
+ ( 1)jwj
jzjjwj
1
; 2Rnf0; 1g
jzj ln
jzj
jwj
jzj +jwj; = 1
jzj
jwj
ln
jzj
jwj
1; = 0
(4.5)
Note that in this case, the Complex Beta divergence is defined for = 1, unlike for arbitraryz andw,
and this case is the I divergence.
38
Ifz;w2 R
+
, then the Complex Beta divergence reduces to the standard Beta divergence (Eq. 4.2).
This is a consequence of the previous observation with
z
=
w
= 0.
82 C
; d
(zkw) =jj
d
(zkw). This property means that the value of controls how
much scalingz andw affects the divergence between them. In particular, = 0 results in a scale-
invariant divergence (d
(zkw) = d
(zkw)); indeed, this condition is the IS divergence, which
was introduced for measuring the distance between two audio spectra because it gives equal weight to
frequency components with low and high energy (audio spectral energy typically rolls off as frequency
increases).
4.5 Discussion
I show a few plots in Figure 4.1 of the Complex Beta divergence as a function ofw withz set to 2e
j
4
for
various values of. In the plots, the red “x” indicates the location ofz in the complex plane, and the blue
contours show the level sets of the Complex Beta divergence. To make the visualization easier, I also plot
the values of the Complex Beta divergence along the dashed red line in the upper right panel and the dashed
red circle in the lower right panel. The dashed red line represents values ofw that have the same angle asz
and the dashed red circle represents values ofw that have the same magnitude asz. Note that the values of
the divergence along the dashed red line is the same as the standard Beta divergence betweenjzj andjwj;
this follows from the observation I made in the previous section that the Complex Beta divergence reduces
to the standard Beta divergence whenz andw have the same angle.
The level-set contours in Figure 4.1 help with visualizing the rate of change in the divergence when moving
in different directions from the target value. With = 2 (squared Euclidean distance), one can see that
the contours are concentric circles, as expected; anyw that is some" > 0 distance away fromz will have
the same divergence. It is interesting to observe how these contours change as is varied; the contours
give an idea about how much the errors in the magnitude or phase of w affect the divergence. First, I
focus on the errors in magnitude. From the contours and the same-angle divergence subplots, one can
see that setting > 2 penalizes errors with a large magnitude more than errors with a small magnitude,
while setting < 2 reverses this effect. This behavior is also observed with the standard Beta divergence.
This property is useful for adjusting the weight of errors when using the complex beta divergence as a cost
39
β = 3
-4 -2 0 2 4
real
-4
-2
0
2
4
imaginary
0 2 4
magnitude
0
10
divergence
-1 -0.5 0 0.5 -1
angle (× π rad)
0
5
divergence
β = 2
-4 -2 0 2 4
real
-4
-2
0
2
4
imaginary
0 2 4
magnitude
0
2
4
divergence
-1 -0.5 0 0.5 -1
angle (× π rad)
0
5
divergence
β = 1.2
-4 -2 0 2 4
real
-4
-2
0
2
4
imaginary
0 2 4
magnitude
0
2
divergence
-1 -0.5 0 0.5 -1
angle (× π rad)
0
10
20
divergence
β = 0.8
-4 -2 0 2 4
real
-4
-2
0
2
4
imaginary
0 2 4
magnitude
0
5
divergence
-1 -0.5 0 0.5 -1
angle (× π rad)
0
10
20
divergence
β = 0
-4 -2 0 2 4
real
-4
-2
0
2
4
imaginary
0 2 4
magnitude
0
10
divergence
-1 -0.5 0 0.5 -1
angle (× π rad)
0
0.5
1
divergence
β = -1
-4 -2 0 2 4
real
-4
-2
0
2
4
imaginary
0 2 4
magnitude
0
50
divergence
-1 -0.5 0 0.5 -1
angle (× π rad)
0
0.5
1
divergence
Figure 4.1: Complex Beta divergence as a function ofw for various values of.
40
-4 -3 -2 -1 0 1 2 3 4
β
0
d
β
(zkw)
|z|< 1
-4 -3 -2 -1 0 1 2 3 4
β
0
d
β
(zkw)
|z|> 1
Figure 4.2: Sensitivity of the Complex Beta divergence to errors in phase as a function of. In these plots,
z is fixed such thatjzj< 1 (left panel) orjzj> 1 (right panel) andw =jzje
j(z")
, with" = 0:1.
function. Depending on the application, one might want to penalize small errors differently than large errors,
so setting appropriately can allow one to achieve the desired penalty weighting.
Next, I focus on errors in angle. To simplify things, I focus on the case where the error is due purely to
angle and not magnitude; i.e., given a target valuez, I consider the case whenw =jzje
j(z")
for some
0 < " . Figure 4.2 shows the Beta divergence as a function of in the conditions thatjzj < 1
andjzj > 1. In both conditions, one can observe that errors in the angle do not affect the complex beta
divergence when = 0, while setting! 1 greatly increases the effect of phase errors on the divergence.
In the case thatjzj < 1, the effect of errors in angle increases as !1 and decreases as !1.
This effect reverses whenjzj > 1. As in the cases of errors in magnitude, one can choose an appropriate
based on how sensitive a particular application is to errors in angle. I focused on small errors in angle in
this discussion, but as can be seen in the same-magnitude subplots in Figure 4.1, the divergence increases
with greater error in angle. Moreover, additional magnitude error further increases the divergence. In the
discussion of magnitude and angle errors, I setz as the target value and variedw, but the plots in Figures 4.1
and 4.2 would follow a similar, though not identical, pattern if I setw as the target and variedz.
As mentioned in the introduction, I were motivated to bring the flexibility offered by the Beta divergence
to applications where the data is complex-valued. One such application is speech enhancement and audio
source separation. One commonly-used approach to this problem is using Non-negative Matrix Factoriza-
tion (NMF) [10, 12]. As the name implies, the NMF algorithm works with non-negative data. In audio, and
specifically speech, the non-negative data comes from the magnitude or power spectrogram, which discards
the phase information of the Fourier transform. The algorithm, when first proposed, optimized the Euclidean
distance and KL divergence [12], but recent advancements have resulted in update equations for NMF under
a variety of divergence metrics, including the Beta divergence [63, 66]. However, the estimated denoised or
41
source signals tend to contain artifacts and distortions due to the phase information being discarded. Further-
more, it has been shown that phase information affects speech intelligibility [67], so discarding the phase
response can lead to suboptimal denoising and source separation in terms of intelligibility. To overcome
this limitation, researchers have proposed Complex Matrix Factorization [22, 23, 68] to directly process the
complex-valued spectrogram. Unfortunately, the cost function is limited to the squared Euclidean distance
(there is initial work with using the KL divergence for Complex NMF by reformulating the cost function;
see [69]). With my proposed Complex Beta divergence, it is possible to incorporate the benefits of a flexible
cost function used in NMF in the Complex Matrix Factorization paradigm.
Another area where the Complex Beta divergence can be useful is machine learning. Recent advance-
ments, particularly in deep learning, have pushed the state-of-the-art in a variety of fields. While most
implementations assume real-valued input, there have been efforts in extending machine learning theory
and application to complex-valued input. For example, researchers have extended kernel methods to handle
complex-valued signals and have shown improved performance over standard kernel methods in tasks such
as non-linear channel equalization [70, 71]. Recently, there have been advancements in deriving a complex-
valued backpropagation algorithm for use in training deep learning models [72–74]. Reichert and Serre [75]
have suggested that complex-valued neurons can more accurately model neuronal firing, where the magni-
tude represents the firing rate and phase represents the relative timing of the firing. Complex-valued neural
networks have shown promising results in computer vision [76, 77], speech [78], and natural language pro-
cessing tasks [79]. Using the Complex Beta divergence as a loss function when training these networks can
allow one to tailor the weight of outliers in these tasks.
One of the drawbacks with the Complex Beta divergence is its disconnection from information theory.
The standard Beta divergence exhibits a relationship to various probability distributions. For example,
=f2; 1; 0g is connected to the normal, Poisson, and Gamma distributions respectively, and 1< < 2 is
connected to the compound Poisson distribution [52]. Thus, one can choose or estimate a value for based
on the distribution of the data. It isn’t clear how the distribution of complex-valued data is related to in
the Complex Beta divergence, and it warrants further investigation.
42
4.6 Future Work
Work needs to be done to establish a connection, if any, between the Complex Beta divergence and statistical
models, as is the case with the Standard Beta divergence. Furthermore, the utility of the Complex Beta
divergence as a cost function in tasks such as matrix factorization, clustering, and other machine learning
tasks needs to be explored.
43
CHAPTER 5
BETA COMPLEX MATRIX FACTORIZATION
In the previous chapter, I introduced the complex beta divergence. In this chapter, I introduce the Beta CMF
algorithm that combines the formulations of Beta NMF and CMF-WISA by using the Complex Beta diver-
gence as the cost function. By varying a real-valued parameter, one can tailor the amount of penalty placed
on reconstruction error to the application, as is the case with Beta NMF. At the same time, the algorithm
leverages phase information in the complex-valued input to yield a better factorization compared to Beta
NMF. The rest of this chapter will lay out the derivation of the Beta CMF update equations (Section 5.1)
and discuss the results of denoising experiments on synthetically-created noisy speech and speech recorded
in the MRI scanner (Section 5.2).
5.1 Beta Complex Matrix Factorization Algorithm
Letjzje
jz
be the polar decomposition of complex-valuedz, and for convenience, I define
=
z
w
to
be the phase difference between complex valuesz andw. The complex beta divergence is defined8z;w2
Cnf0g as
d
(zkw) =
8
>
>
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
>
>
:
1
(1)
jzj
+ ( 1)jwj
jzjjwj
1
cos (
)
; > 1
1
(1)
jzj
cos (
) + ( 1)jwj
jzjjwj
1
; 0< < 1
jzj
jwj
ln
jzj
jwj
1; = 0
1
(1)
jzj
+ ( 1)jwj
cos
1
jzjjwj
1
; < 0
(5.1)
= 2 is the squared Euclidean distance and = 0 is the IS divergence. Furthermore, ifz andw are in
phase (
z
=
w
(mod 2)), then the complex beta divergence reduces to the beta divergence (Equation 2.2).
Note that the complex beta divergence is undefined at = 1, though it is defined ifz andw are restricted to
be in phase, at which point it becomes the KL divergence.
44
Building upon CMF-WISA, I formulate the Beta CMF program as
minimize
S
X
s=1
D
X
d=1
T
X
t=1
d
0
@
V
s
dt
X
k2K(s)
W
dk
H
kt
e
j[
^
s]
dt
1
A
subject to
S
X
s=1
V
s
dt
=V
dt
;W
dk
0;H
kt
0 8d;k;t
(5.2)
Let =f
V;W;H;
^
g be the parameters of the CMF model. From the program, I formulate the following
cost function:
C
() =
S
X
s=1
D
X
d=1
T
X
t=1
d
0
@
V
s
dt
X
k2K(s)
W
dk
H
kt
e
j[
^
s]
dt
1
A
+
D
X
d=1
T
X
t=1
dt
S
X
s=1
V
s
dt
V
dt
!
=
S
X
s=1
D
X
d=1
T
X
t=1
f
fg
s;d;t
+g
fg
s;d;t
+h
fg
s;d;t
+
D
X
d=1
T
X
t=1
dt
S
X
s=1
V
s
dt
V
dt
!
(5.3)
where
f
fg
s;d;t
=
8
>
>
>
>
>
>
<
>
>
>
>
>
>
:
1
(1)
V
s
dt
; 2 (1; 0)[ (1;1)
1
(1)
V
s
dt
cos
s
dt
h
^
s
i
dt
; 0< < 1
ln
V
s
dt
1; = 0
(5.4)
g
fg
s;d;t
=
8
>
>
>
>
>
>
<
>
>
>
>
>
>
:
1
P
k2K(s)
W
dk
H
kt
; 2 (0;1)nf1g
ln
P
k2K(s)
W
dk
H
kt
; = 0
1
P
k2K(s)
W
dk
H
kt
cos
1
s
dt
h
^
s
i
dt
; < 0
(5.5)
45
h
fg
s;d;t
=
8
>
>
<
>
>
:
1
1
V
s
dt
P
k2K(s)
W
dk
H
kt
1
cos
s
dt
h
^
s
i
dt
; > 1
1
1
V
s
dt
P
k2K(s)
W
dk
H
kt
1
; < 1
(5.6)
Optimizing Equation 5.3 can be intractable, so I create an auxiliary cost function and optimize that in-
stead.
Definition 5.1 (Auxiliary function). A function
~
C(~ x;x) is an auxiliary function for a function C(x) iff
C(x) = min
~ x
~
C(~ x;x). ~ x is called an auxiliary variable.
It has been shown in [22] that the primary functionC(x) can be indirectly minimized by iteratively updating
the primary and auxiliary variables with
~ x argmin
~ x
~
C(~ x;x); x argmin
x
~
C(~ x;x) (5.7)
Following the approach in [66], I form the auxiliary function for C
() by breaking it into convex and
concave functions in terms ofW andH separately. Here, I describe forming an auxiliary function in terms
ofW withH fixed, but the same procedure is used for creating an auxiliary function in terms ofH withW
fixed. One can upper-bound the convex functions using Jensen’s inequality (Theorem 5.2) and the concave
functions by using a first-order Taylor series expansion (Theorem 5.3). WithH (W ) fixed,g
fg
s;d;t
andh
fg
s;d;t
in Equations 5.5 and 5.6 are convex or concave inW (H) for different values of and
sign of the cosine terms. Table 5.1 summarizes the convexity/concavity ofg
andh
assuming the cosine
terms are non-negative; the convexity/concavity is reversed when the cosine terms are negative.
Table 5.1: Convexity and concavity ofg
(Equation 5.5) andh
(Equation 5.6) in terms ofW (H) withH
(W ) fixed for different values of, assuming the cosine terms are non-negative.
< 1 1< 2 > 2
g
concave convex convex
h
convex convex concave
Theorem 5.2 (Auxiliary function for convex functions). Fixs,d,t, andfH
kt
g
k2
~
K(s)
, where
~
K (s) =fk2
K (s) :H
kt
> 0g.
~
C
n
~
W
dk
o
k2K(s)
;fW
dk
g
k2K(s)
=
P
k2
~
K(s)
~
dkt
f
W
dk
H
kt
~
dkt
is an auxiliary function forC
fW
dk
g
k2K(s)
=
46
f
P
k2
~
K(s)
W
dk
H
kt
for some convex functionf if ~
dkt
=
~
W
dk
H
kt
P
l2K(s)
~
W
dl
H
lt
.
Proof. Let ~
dkt
=
~
W
dk
H
kt
P
l2
~
K(s)
~
W
dl
H
lt
8k2
~
K (s).
Note that ~
dkt
> 08k2
~
K (s) and
P
k2
~
K(s)
dkt
= 1.
~
C
fW
dk
g
k2K(s)
;fW
dk
g
k2K(s)
=C
fW
dk
g
k2K(s)
is trivially true.
C
fW
dk
g
k2K(s)
=f
0
@
X
k2
~
K(s)
W
dk
H
kt
1
A
X
k2
~
K(s)
~
dkt
f
W
dk
H
kt
~
dkt
by Jensen’s inequality
=
~
C
n
~
W
dk
o
k2K(s)
;fW
dk
g
k2K(s)
(5.8)
Theorem 5.3 (Auxiliary function for concave functions). Fixs,d,t, andfH
kt
g
k2K(s)
.
~
C
n
~
W
dk
o
k2K(s)
;fW
dk
g
k2K(s)
=
f (~ !
sdt
)+rf (~ !
sdt
)
P
k2K(s)
W
dk
H
kt
~ !
sdt
is an auxiliary function forC
fW
dk
g
k2K(s)
=f
P
k2K(s)
W
dk
H
kt
for some concave functionf if ~ !
sdt
=
P
k2K(s)
~
W
dk
H
kt
.
Proof. Let ~ !
sdt
=
P
k2K(s)
~
W
dk
H
kt
.
~
C
fW
dk
g
k2K(s)
;fW
dk
g
k2K(s)
=C
fW
dk
g
k2K(s)
is trivially true.
For a concave functionf, it holds thatf (y)f (x) +rf (x)
T
(yx)8x;y2 dom (f)
Settingx = ~ !
sdt
andy =
P
k2K(s)
W
dk
H
kt
, we have
C
fW
dk
g
k2K(s)
~
C
fW
dk
g
k2K(s)
;fW
dk
g
k2K(s)
.
By applying Theorems 5.2 and 5.3 tog
andh
in the primal cost functionC
() (Equation 5.3), I can
form the auxiliary cost functionC
;
~
with auxiliary variables
~
=
n
~
W;
~
H
o
and indirectly optimize
C
() by iteratively minimizing C
;
~
w.r.t.
~
and . C
;
~
is minimized w.r.t.
~
when
the inequalities in Theorems 5.2 and 5.3 are tight, so
~
W = W and
~
H = H minimizeC
;
~
. The
47
following equations minimizeC
;
~
w.r.t. :
V
s
dt
8
>
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
>
:
h
^
V s
i
dt
1
e
j
[
^
s]
dt2(1)
dt
h
V s
i
dt
2
> 1
h
V s
i
dt
h
V
s
i
1
dt
e
j
[
^
s]
dt +2(1)
dt
h
V s
i
dt
h
^
V s
i
dt
1
0< < 1
V
s
dt
h
^
V
s
i
dt
V
s
1
dt
2
dt
= 0
(5.9)
h
^
s
i
dt
= Arg
V
s
dt
(5.10)
W
dk
=
~
W
dk
0
B
@
P
T
t=1
V
s
dt
h
^
V
s
i
dt
2
H
kt
P
T
t=1
h
^
V
s
i
dt
1
H
kt
1
C
A
()
(5.11)
H
kt
=
~
H
kt
0
B
@
P
D
d=1
V
s
dt
h
^
V
s
i
dt
2
W
dk
P
D
d=1
h
^
V
s
i
dt
1
W
dk
1
C
A
()
(5.12)
5.2 Experiments and Results
I ran two speech denoising experiments. In the first experiment, I denoise speech from the Aurora 4 dataset
using Beta CMF with different values of and measure the denoising performance in terms of various
speech denoising metrics. Using the results of this experiment, I run a second experiment where I denoise
MRI audio with particular settings of that optimize specific metrics.
5.2.1 Synthetic Data
In this experiment, I add noises from the DEMAND noise database [80] to the clean speech in Aurora 4 [42]
at 0 dB SNR. I train speaker-specific speech dictionaries from the utterances in Aurora 4 and noise-specific
dictionaries from the DEMAND noises using Beta CMF with2f0; 0:2; 0:4;:::; 2gnf1g. I then keep
the dictionaries fixed and denoise the noisy audio with the same values of, picking the trained dictionaries
trained with the corresponding. I measure the LLR, WSS, PESQ, and first and second formant RMSEs.
48
Table 5.2: Noise suppression results (dB) for the MRI-utt dataset.
Sequence LMS-model CS+SNG CMF-WISA Beta CMF
per1 13:09 33:51 30:18 33:59
per2 15:04 31:87 29:42 32:12
per3 16:70 31:79 29:55 30:99
per4 13:81 31:57 29:26 30:64
per5 10:30 33:20 30:34 32:00
aper1 0:47 32:87 29:22 31:99
aper2 1:69 10:12 10:82 10:95
Figures 5.1 and 5.2.
From the figures, one can see that see in general that Beta CMF outperforms Beta NMF for 0 < 1 while
Beta NMF tends to do better than Beta CMF for 1 < 2 across the different metrics. However, this is
noise-dependent, where greater gains in performance are achieved by Beta CMF in noises with significant
overlap with speech, such as speech babble and meeting room noises. Another trend to note are which values
of give better performance for each of the metrics. Generally, near 1 gives good performance for LLR,
PESQ andf
2
estimation. On the other hand, closer to 0 works better for WSS andf
1
estimation.
5.2.2 MRI Speech Recording Data
I ran a second experiment for denoising the audio recorded in an MRI scanner. I ran the same MF algorithm
described in Chapter 3, except I replaced the squared Euclidean cost function with the complex beta cost
function. Based on the synthetic data experiment, I set = 0:6 because it is roughly the value that gave
the best performance for the different metrics on average. However, can be, and should be, chosen to
optimize a specific metric as is required; my choice of = 0:6 is merely for exposition. Table 5.2 shows
the noise suppression results on the MRI-utt dataset. As was done in Chapter 3, I calculated all the metrics
after denoising Aurora 4 clean speech with MRI noise added at7 dB SNR due to lack of groundtruth
clean speech in the MRI-utt dataset. Table 5.3 shows these results. From the tables, one can see that Beta
CMF was able to improve the performance over the proposed approach in Chapter 3 that uses the squared
Euclidean distance in all the metrics except for STOI.
49
Figure 5.1: Denoising metrics for various values of with restaurant noise added at 0 dB.
50
Figure 5.2: Denoising metrics for various values of with traffic noise added at 0 dB.
51
Table 5.3: Noise suppression (NS), LLR, PESQ scores, and STOI scores for the Aurora 4 dataset.
Metric Sequence LMS-model CS+SNG CMF-WISA Beta CMF
NS
(dB)
per1 9:52 18:08
15:42 16:25
per2 9:62 17:49
15:78 16:07
per3 10:33 18:24
15:61 15:88
per4 8:71 16:57
15:39 15:67
per5 7:16 16:36
14:96 15:29
aper1 0:21 16:61
14:93 15:23
aper2 1:80 17:12
14:78 14:61
LLR
per1 0:987
2:462 1:004 1:097
per2 0:931
2:046 1:058 1:084
per3 0:850
2:065 1:012 0:998
per4 1:058 1:987 1:018 1:058
per5 1:497 1:882 1:020 1:029
aper1 2:839 2:480 1:098 0:995
aper2 2:094 2:265 0:679 0:506
PESQ
per1 1:91 1:95 2:20 2:31
per2 1:97 2:09 2:23 2:38
per3 2:03 2:06 2:30 2:46
per4 1:94 2:07 2:36 2:51
per5 1:97 2:14 2:43 2:57
aper1 1:56 2:08 2:30 2:60
aper2 1:97 2:02 3:01 3:56
STOI
per1 0:869 0:785 0:907
0:859
per2 0:873 0:800 0:910
0:874
per3 0:883 0:788 0:920
0:882
per4 0:861 0:828 0:920
0:877
per5 0:825 0:836 0:922
0:880
aper1 0:714 0:790 0:907
0:878
aper2 0:765 0:812 0:964 0:975
52
CHAPTER 6
ROBUST RECOVERY OF LATENT STRUCTURE
Observation of latent structure in data provides researchers with a tool for data analysis and interpretation.
Non-negative matrix factorization (NMF) is a widely-used method for observing the latent structure in a
signal of interest. NMF has been employed in a variety of areas, from analyzing molecular structure [81] to
enhancing noisy speech [44, 82]. The drawback to NMF is that it is sensitive to outliers in the data, because
the NMF formulation minimizes the square residual error. Researchers have proposed several techniques
to overcome this drawback. Kong et al. derived update equations that minimize the`
2;1
norm rather than
the Frobenius norm, and achieved better image clustering results using the modified metric [83]. Another
approach is to induce sparsity on the encoding matrix to control the number of basis elements that are
simultaneously activated, which can reduce the influence of outliers in the data in the factorization process
[19, 84]. I propose a method to filter the data during the factorization process to try to overcome outliers
and noise in the data.
I model the filter on the minimum variance distortionless response (MVDR) filter. This filter was first
proposed by Capon for beamforming in array signal processing [85], and then later adapted for spectral
estimation by Rao et al. [86, 87]. The MVDR filter computes a power spectrum that estimates the spectral
envelope of a signal with the property that it does not distort the spectrum. It does so by computing a bank of
filters, each of which try to pass a specific frequency of the signal undistorted while suppressing the output
at other frequencies. This formulation leads to a smoother estimate of the spectrum that is less sensitive
to noise compared to the Fourier transform of the signal. This is a desirable property for improving the
performance of NMF when factoring noisy data. Thus, I will rewrite the MVDR formulation and use it in
the NMF framework to perform filtering of the noisy data during the factoring operation.
This chapter is organized as follows. Section 6.1 describes the NMF and MVDR algorithms and describes
my proposed approach to combine these two methods to achieve joint filtering and factorization of a noisy
signal. Section 6.2 compares the performance of the proposed algorithm to NMF as well as NMF of a
53
Wiener-filtered input. In Section 6.3, I discuss the experiments and point out the conditions in which my
algorithm performs well and where it fails. Finally, I state my future work in Section 6.4.
6.1 Joint Filtering–Factorization Algorithm
NMF factors a non-negative matrixV into basis matrixW and encoding matrixH by minimizingkV
WHk
2
F
(see Chapter 2, Section 2.1 for more details). If there is noise in the data, then it is possible for some
columns ofW to capture properties of the noise, which can obscure the structure in the data. To overcome
noise in the data, I design a filter modeled on the MVDR filter because it has the desirable property of
preserving salient peaks in the spectrum of the data. I aim to incorporate the MVDR formulation within the
NMF framework to perform joint filtering and factorization of noisy data. More specifically, I will use the
data inV to derive a set of filters that optimally estimates the spectrum of the desired signal corrupted with
noise, and use this filtered data to calculate an improved basis matrix estimate.
The MVDR formula for a filterf
k
that passes a frequency!
k
undistorted with magnitude
k
is
^
f
k
= arg min
f
k
f
H
k
Rf
k
s.t.
e
H
(!
k
)f
k
=
k
; (6.1)
whereR is a Toeplitz autocorrelation matrix of the input signalx[n] ande(!
k
) =
1e
j!
k
e
j!
k
(N1)
>
.
Using Parseval’s Theorem, I can write a roughly equivalent formulation in the frequency domain as
^ g
k
= arg min
g
k
kg
k
xk
2
2
s.t. g
kk
x
k
=
k
x
k
; (6.2)
where g
k
is the frequency response of f
k
, g
kk
is the value of the frequency response of g
k
at the kth
frequency bin, andx is the frequency response of the input data. Equation 6.2 computes the frequency
response that has a pre-determined fixed value at thekth frequency bin and has minimal amplitude at the
other frequencies. For maximum minimization, the frequency response that the other frequencies should be
0. To jointly compute a bank of filters, I solve
^ g = arg min
g
kgxk
2
2
s.t. gx =ax; (6.3)
54
wherea = [
1
2
m
]
>
is a vector of the desired frequency response for all frequencies. To achieve
joint filtering and factoring, I incorporate Equation 6.3 in the NMF framework. I formulate the cost function
as
J =kGVWHk
2
F
+
1
kG (WH)k
2
F
+
2
kG (WH)A (WH)k
2
F
; (6.4)
whereV is the spectrogram of the input data. The first term in the cost function performs NMF on the
filtered input data,
1
controls the level of filtering, and
2
controls the extent to whichG is constrained by
the desired frequency responseA.
6.1.1 Update equations
Computing the gradient of J with respect toG and setting it to zero allows one to obtain a closed-form
solution for the filterG:
G =
(WH)V +
2
A (WH)
2
V
2
+ (
1
+
2
)(WH)
2
; (6.5)
SinceG depends onW andH, the filter is updated at every iteration during the algorithm. The iterative
update equations for the basis matrixW and encoding matrixH are:
W W
(GV )H
>
WHH
>
+
1
WC +
2
WD
H H
W
>
(GV )
W
>
WH +
1
EH +
2
FH
;
(6.6)
where
C
ab
=(G
a;:
H
b;:
)(G
a;:
H
b;:
)
>
D
ab
=((G
a;:
A
a;:
)H
b;:
)((G
a;:
A
a;:
)H
b;:
)
>
(6.7)
E
ab
=(W
:;a
G
:;b
)
>
(W
:;a
G
:;b
)
F
ab
=(W
:;a
(G
:;b
A
:;b
))
>
(W
:;a
(G
:;b
A
:;b
)):
55
6.2 Experiments and Results
I compared the performance of the proposed algorithm to standard NMF as well as Wiener filter denoising
followed by standard NMF (henceforth Wiener filter + NMF). The Wiener filter + NMF method represents
sequential filtering followed by factorization, as opposed to the joint filtering and factorization of the pro-
posed algorithm. I randomly sampled a set of 100 sentences from the MOCHA-TIMIT corpus [88], with
50 sentences spoken by male speakers and the other 50 sentences spoken by female speakers. I added white
noise, pink noise, speech babble, and factory floor noise from the NOISEX database [89] to these sentences
at 5 dB and 10 dB SNR levels. For each noise and SNR level, I computed the cosine similarity between the
basis matrices of noisy speech and ground-truth clean speech for the proppsed algorithm, standard NMF,
and Wiener filter + NMF. The cosine similarity between the clean and noisy basis matrices is calculated
using
=
1
k
k
X
i=1
jW
>
clean
(:;i)W
noisy
(:;i)j
kW
clean
(:;i)k
2
kW
noisy
(:;i)k
2
: (6.8)
Equation 6.8 computes the mean of the cosine of the angle between a vector in the clean basis and the cor-
responding vector in the noisy basis. The cosine similarity ranges from 0 to 1, with higher values indicating
higher similarity between the clean basis and noisy basis. Hence, the cosine similarity is a measure of the
basis recovery of the noisy signal compared to the clean signal. I sorted the columns of the basis matrices
in ascending order of center of gravity prior to computing the cosine similarity to make the computation
meaningful. Figure 6.1a shows the cosine similarity for each algorithm in each noise conditions. The values
shown are the average of the cosine similarity values of the 100 sentences in each noise condition.
To evaluate the performance of the filtering in the proposed algorithm, I calculated the Perceptual Evaluation
of Speech Quality (PESQ) score of the denoised speech [47]. I reconstructed the denoised speech from the
recovered noisy basis and encoding matrices by applying the formula
^
V
denoised
=V
noisy
(W
noisy
H
noisy
)
and then computing the inverse Fourier transform of
^
V
denoised
. Figure 6.1b shows the PESQ scores for
the reconstructed signals from my algorithm and standard NMF, and the denoised signal from the Wiener
filter in the different noise conditions. As with the cosine similarity, these scores are averaged over the 100
sentences.
56
(a)
(b)
Figure 6.1: (a) Cosine similarity and (b) PESQ scores for proposed algorithm, standard NMF, and Wiener
filter + NMF in 5 dB (front) and 10 dB (behind) SNR levels.
6.3 Discussion
I used the Wilcoxon rank-sum test, a non-parametric version of the Student’s T-test, to evaluate the statistical
significance of my results. The cosine similarity values of the proposed algorithm are significantly better
than NMF’s and Wiener filter + NMF’s correlation values at the 99% level in all noise conditions. This
suggests that the basis matrices recovered from noisy data by the proposed algorithm better represents
the underlying structure of the signal of interest compared to using standard NMF or filtering the signal
prior to performing NMF. This holds true across stationary and non-stationary noises, and wideband and
narrowband noises. Overall, it appears that the joint filtering and factoring approach performs the best on
wideband stationary noise, such as white and pink noises. This is because the bandpass filter inA helpsG to
remove a lot of the out-of-band noise, while the MVDR-like formulation in the proposed algorithm helpsG
preserve the speech, which appears as peaks in the passband region of the spectrum. On the other hand, the
proposed algorithm’s performance degrades in narrowband noise. If the noise lies outside the frequencies
of interest, then it will be likely suppressed by the stopband imposed byA. However, if the noise is within
the passband, there are no such guarantees.
From looking at the correlation scores in Figure 6.1a, one can see that the joint filtering and factorization
approach consistently outperforms the filtering followed by factorization approach. This suggests that there
are benefits to filtering during the factorization operation rather than prior to factorization. One such benefit
57
is that the filter can adapt to the factored outputsW andH and reweight the frequency response to further
boost frequencies of interest while suppressing undesirable frequencies.
From the PESQ scores in Figure 6.1b, one can see that the proposed algorithm’s denoising performance
is on par with NMF and Wiener filtering methods. The difference in PESQ scores between my algorithm
and NMF is statistically significant only in the 10 dB noise conditions, but the Wiener filter scores are
significantly worse than the proposed algorithm and NMF in all noise conditions. This suggests that in
addition to recovering improved basis matrices with the proposed algorithm, one can also reconstruct a
denoised signal with a quality comparable to other denoising methods. The comparable performance also
suggests that the speech bases found by the proposed algorithm contains meaningful speech information; it
is possible to find a pathological basis with high cosine similarity, but the PESQ scores would suffer since
the basis won’t contain speech information.
6.4 Future Work
The current joint filtering and factorization formulation minimizes the Frobenius norm of the reconstruction
error. For different data and applications, it might be appropriate to minimize a different divergence. I
will explore using the alpha-beta divergence [90] for measuring the reconstruction error. The alpha-beta
divergence provides two tunable parameters, alpha and beta, that generalize the alpha-, beta- and gamma-
divergences, which include the Frobenius norm and generalized KL divergence. The various divergences
give different weights to the error magnitudes, so the ability to set alpha and beta can re-weight the error
magnitudes based on the application.
58
CHAPTER 7
NOISE-ROBUST ACOUSTIC FEATURES FOR AUTOMATIC SPEECH
RECOGNITION
Automatic speech recognition (ASR) is increasingly being used as the primary interface between humans
and devices. Speech offers a natural and efficient way to communicate with devices. Furthermore, rich
information contained in speech, such as emotion [26] and cognitive load [91], can help devices interact or
respond appropriately to users. Unfortunately, ASR systems perform poorly in noisy environments. Gen-
erally, features extracted from noisy speech contain distortion and artifacts. Researchers have proposed
several approaches to reduce the distortion and artifacts, including speech denoising [92], feature enhance-
ment [93], feature transformation [94], and acoustic model adaptation [95, 96]. Multi-condition training
has also been found to reduce word error rates on noisy speech [97]. The goal in all of these approaches is
to reduce the mismatch between the features extracted from clean and noisy speech. In this chapter, I will
detail an algorithm using matrix factorization to generate noise-robust acoustic features for ASR on noisy
speech.
This chapter is organized as follows. Section 7.1 provides an overview of several methods researchers have
developed to improve ASR performance in noisy environments. Section 7.2 describes the process I used to
create acoustic features that are more invariant to acoustic noise. Section 7.3 discusses the ASR experiment
and compares the word error rate with baseline log-mel features extracted from noisy and denoised speech.
Section 7.4 gives insights into the results of the experiments and points out some of the limitations in my
work. Finally, Section 7.5 offers my directions for future work.
7.1 Prior Work
Speech denoising is a commonly-used pre-processing step. Popular methods for speech denoising include
Wiener filtering and spectral subtraction methods [98]. These methods assume the power spectra of speech
59
and noise are additive, and an estimate of the noise power spectra can be subtracted from the noisy power
spectra at the frame level. Another denoising technique assuming additive components is non-negative ma-
trix factorization (NMF) [10–12]. In NMF, each frame of noisy speech is decomposed into components
from a speech dictionary and a noise dictionary, and the underlying speech is recovered by keeping the com-
ponents corresponding to the speech dictionary. Speech denoising, however, can introduce distortion and
artifacts, such as musical noise, and has been shown to degrade ASR performance [97, 99]. Moreover, these
algorithms operate at each frame independently, and so they can introduce discontinuities across frames.
These discontinuities can manifest as noise in the features, thus contributing to feature mismatch between
clean and noisy speech.
Another way to reduce feature mismatch is to extract features that are more robust to noise. Moreno et
al. introduced Vector Taylor Series (VTS) features [100], which uses the Taylor series expansion of the
noisy signal to model the effect of noise and channel characteristics on the speech statistics. Deng et al.
proposed the Stereo-based Piecewise Linear Compensation for Environments (SPLICE) algorithm [101] for
generating noise-robust features for datasets that have clean versions of the noisy data (stereo datasets).
They assume that each cepstral vector from the noisy speech comes from a mixture of Gaussians, and that
the clean speech cepstral vector has a piece-wise linear relationship to the noisy speech cepstral vector.
Power-Normalized Cepstral Coefficients (PNCC), recently proposed by Kim and Stern [102], were shown
to reduce word error rates on noisy speech compared to Mel-Frequency Cepstral Coefficients (MFCC) and
Relative Spectral Perceptual Linear Prediction (RASTA-PLP) coefficients. Inspired by human auditory pro-
cessing, the processing steps for creating PNCCs include a power-law nonlinearity, a denoising algorithm,
and temporal masking.
7.2 Algorithm for Creating Noise-Robust Acoustic Features
I propose an algorithm for creating noise-robust acoustic features using the encoding matrices from convo-
lutive NMF (CNMF) [18]. The encoding matrix can be discriminative of the different speech sounds at the
frame level when the basis matrix remains fixed. I describe an algorithm that reduces the effect of noise on
the resulting encoding matrices. The following sections describe the steps of the algorithm, and Figure 7.1
summarizes the algorithm in a flowchart.
60
Figure 7.1: Flowchart illustrating the algorithm for generating noise-robust time-activation matrices.
7.2.1 Step 1: Learn a speech basis
Speech contains certain spectro-temporal properties that help distinguish it from background noise. I use
CNMF to discover the spectro-temporal building blocks of speech and store them in a time-varying basis.
CNMF decomposes a spectrogramV 2R
mn
+
into a time-varying basisW2R
mkt
+
and encoding matrix
H2R
kn
+
by minimizing the divergence betweenV and
^
V =
P
t1
=0
W ()
!
H .W () refers to the basis
at time (the third dimension ofW ). In this work, I minimize the generalized KL divergence betweenV
and
^
V :
(7.1) D
GKL
Vk
^
V
=
m
X
i=1
n
X
j=1
V
ij
ln
V
ij
^
V
ij
!
V
ij
+
^
V
ij
To learn a speech dictionary, I concatenate the clean speech from a stereo dataset into one long utterance
and create the spectrogramV
clean
from this utterance. I use CNMF to decomposeV
clean
into a spectro-
temporal speech basisW
speech
and encoding matrixH
clean
. Researchers have shown that imposing sparsity
on the encoding matrix improves the quality of the dictionary [13, 14]. Thus, I augment the generalized KL
divergence with an`
1
penalty on the encoding matrix to encourage sparsity:
(7.2) C
speech
=D
GKL
V
clean
k
^
V
clean
+
k
X
i=1
n
X
j=1
H
clean
ij
;
where
^
V
clean
=
P
t1
=0
W
speech
()
!
H
clean
and controls the level of sparsity ofH. To minimize Equa-
61
tion 7.2, I iteratively updateW
speech
andH
clean
with the following multiplicative updates:
W
speech
() W
speech
()
V
clean
^
V
clean
t!
H
>
clean
1
mn
t!
H
>
clean
;82f0;:::;t 1g (7.3a)
H
clean
H
clean
P
t1
=0
W
>
speech
()
t
V
clean
^
V
clean
P
t1
=0
W
>
speech
()1
mn
+
: (7.3b)
7.2.2 Step 2: Learn a noise basis
I also use CNMF to learn the spectro-temporal properties of noise. Importantly, I want the noise basis to
capture as much of the perturbations due to noise so that the encoding matrix is unaffected by noise. That is,
suppose I have clean speechV
clean
that decomposes intoW
speech
andH
clean
, and I have the corresponding
speech corrupted by noiseV
noisy
. Then, I would like to find a noise basisW
noise
such that the CNMF
decomposition ofV
noisy
also yields the encoding matrixH
clean
.
To achieve this goal, I minimize the following cost function:
(7.4) C
noisy
=D
GKL
V
noisy
k
^
V
noisy
+
k
X
i=1
n
X
j=1
H
clean
ij
;
where
^
V
noisy
=
P
t1
=0
W
speech
() +W
noise
()
!
H
clean
. The idea behind this cost function is to try to
push the variability due to noise intoW
noise
. This formulation is similar in idea to total variability modeling
[103], whereW
speech
represents the universal background model (UBM) andW
noise
represents the shift in
the UBM due to some source of variability (in this case, noise).
To learn a noise basis, I pair the clean and noisy utterances in the stereo dataset. I concatenate the clean
utterances and the noisy utterances and create spectrograms from these concatenated utterancesV
clean
and
V
noisy
. WithV
clean
andW
speech
fixed, I run Equation 7.3b to getH
clean
. Then, withV
noisy
, W
speech
,
andH
clean
fixed, I obtain the spectro-temporal noise basisW
noise
by using the following update rule that
62
minimizes Equation 7.4:
W
noise
() W
noise
()
V
noisy
^
V
noisy
!
H
>
clean
1
mn
!
H
>
clean
;82f0;:::;t 1g (7.5)
7.2.3 Step 3: Learn a time-varying projection
Once I have the speech and noise bases in hand, I can generate encoding matrices for the entire dataset.
However, note that the CNMF cost function minimizes the signal reconstruction error. This cost function is
appropriate when you want to reconstruct the signal (eg. denoise speech). What is important when using
the encoding matrices as features is the reduction in mismatch between the matrices from clean and noisy
speech, which is not guaranteed by the CNMF cost function.
To reduce feature mismatch, I find a time-varying projection matrixP2R
kmt
+
that denoises the encoding
matrices from noisy speech by projecting them onto the space containing the encoding matrices from clean
speech. The cost function that achieves this is
(7.6) C
proj
=D
GKL
H
clean
k
^
H
denoised
+D
GKL
^
H
clean
k
^
H
denoised
;
where
^
H
clean
=
P
t1
=0
P ()
!
^
V
clean
,
^
H
denoised
=
P
t1
=0
P ()
!
^
V
denoised
,
^
V
clean
is as defined in Equa-
tion 7.2, and
^
V
denoised
=
P
t1
=0
W
speech
()
!
H
noisy
. The first part of the cost function minimizes the diver-
gence between the denoised and target clean encoding matrices. The second part of the cost function ensures
thatP projects encoding matrices from clean and noisy speech in the same way. The second part is useful
during feature extraction (Step 4) where it is unknown whether the utterance is clean or noisy. Equation 7.6
can be minimized with the following multiplicative update:
(7.7) P () P ()
1
!
^
V
>
clean
+
H
clean
+
^
H
clean
^
H
denoised
!
^
V
>
denoised
1 + ln
^
H
clean
^
H
denoised
!
^
V
>
clean
+ 2
!
^
V
>
denoised
;82f0;:::;t 1g
To learn the time-varying projection, I pair the clean and noisy utterances. For the clean utterances, I run
CNMF withW
speech
fixed to getH
clean
. For the noisy utterances, I run CNMF withW
speech
andW
noise
fixed to getH
noisy
. I then learn the time-varying projection with Equation 7.7.
63
7.2.4 Step 4: Extract acoustic features
Once I have learned the time-varying projection, I can generate encoding matrices for a test dataset as
features for the acoustic model. For each utteranceV
utt
in the corpus, I find the encoding matrixH
utt
with
W
speech
andW
noise
fixed using the following update rule:
H
utt
H
utt
P
t1
=0
W
speech
() +W
noise
()
>
V
utt
^
V
utt
P
t1
=0
W
speech
() +W
noise
()
>
1
mn
+
; (7.8)
where
^
V
utt
=
P
t1
=0
W
speech
() +W
noise
()
!
H
utt
. Then, I use the time-varying projectionP to calcu-
late the denoised encoding matrixH
denoised
=
P
t1
=0
P ()
!
^
V
denoised
, where
^
V
denoised
=
P
t1
=0
W
speech
()
!
H
utt
.
I input log(H
denoised
) as features into the acoustic model.
7.3 ASR Experiment
I investigated the performance of the proposed algorithm on the Aurora 4 corpus [42]. The training set
consists of 7137 multi-condition sentences from the Wall Street Journal database. The noisy utterances are
corrupted with one of six different noise types (airport, babble, car, restaurant, street traffic, and train station)
at 10–20 dB SNR. The standard Aurora 4 test set consists of 330 base utterances from 8 speakers, with each
of the utterances corrupted by the same six noises with SNRs ranging from 5–15 dB. The test set is divided
into four categories:
A: clean speech with near-field microphone.
B: average of all noise conditions with near-field microphone.
C: clean speech with far-field microphone.
D: average of all noise conditions with far-field microphone.
The acoustic model for the ASR is a 7-layer fully-connected deep neural network (DNN) with 1024 neurons
per hidden layer and 2000 neurons in the output layer. I use the rectified linear units (ReLU) activation
function and a fixed dropout rate of 50% for layers 4 and 5. The training is based on the cross-entropy
64
criterion, using stochastic gradient descent (SGD) and a mini-batch size of 256. I apply speaker-independent
global mean and variance normalization to the features prior to augmenting them with delta and delta-delta,
followed by splicing of 5 frames to the left and right for context. I used the task-standard WSJ0 bigram
language model. The Aurora 4 test set is decoded using the IBM Attila dynamic decoder [104].
I ran two baseline experiments: extracting 40-dimensional log-mel features from the unprocessed speech and
extracting 40-dimensional log-mel features from speech denoised by CNMF. To obtain denoised speech, I
calculated the denoised spectrogram
V
denoised
=
^
V
speech
^
V
speech
+
^
V
noise
V
utt
(7.9)
for each utteranceV
utt
, with
^
V
speech
=
P
t1
=0
W
speech
()
!
H
utt
and
^
V
noise
=
P
t1
=0
W
noise
()
!
H
utt
. I
converted the denoised spectrogram to the time-domain using the overlap-add method [105].
Next, I generated encoding matrices in three different ways: using only a speech basis, using a speech and
noise basis and keeping the rows of the encoding matrix corresponding to the speech dictionary, and using
the algorithm described in the previous section. I usedk = 60,t = 5, and = 2 to generate these matrices.
Furthermore, I appended the encoding matrices generated using the proposed method to log-mel features.
Table 7.1 shows the word error rates (WER) for all the experiments.
Table 7.1: Word error rates for different acoustic model features in different noise and channel conditions.
Feature A B C D Average
Log-mel, unprocessed speech 4:82 8:32 8:03 16:35 12:34
Log-mel, denoised speech 5:29 9:67 9:19 19:36 14:52
Encodings, speech basis 5:06 8:35 10:50 18:69 13:53
Encodings, speech+noise basis 5:01 8:29 10:42 18:27 13:29
Encodings, proposed algorithm 4:43 7:34 7:86 16:29 11:82
Log-mel + encodings 4:22 7:17 7:70 15:56 11:37
7.4 Discussion
Table 7.1 shows that log-mel features extracted from denoised speech performed worse than log-mel fea-
tures extracted from unprocessed speech. As mentioned previously, denoising is a common step taken by
researchers when performing ASR on noisy speech. The results indicate, in the context of multi-condition
65
training, that it is better not to denoise the speech. Denoising most likely increases the WER because it
introduces distortions and artifacts in the signal. Since most features, including log-mel features, are cal-
culated directly from the signal, the features capture the artifacts, thus increasing the mismatch between
features from clean and noisy speech. Moreover, the distortions and artifacts can vary by noise and SNR
level. These introduce additional sources of variability in the log-mel features.
The results show that using the encoding matrices directly as features outperforms using them as a denois-
ing pre-processing step. Unfortunately, calculating the encoding matrices with only a speech basis performs
below log-mel features on unprocessed speech. Since the speech dictionary is fixed when generating fea-
tures, a poorer performance is expected because the variability due to noise is captured by the encoding
matrix, making it susceptible to noise. Adding the noise basis gave slight improvements because it was
able to capture some of the variability due to noise. However, the noise basis did not adapt to different
noises during feature extraction, reducing its efficacy in capturing the noise variability. On the other hand,
generating encoding matrices with the proposed algorithm outperformed all of the previous experiments. In
different noise conditions with the near-field microphone (category B), I achieved a 11:8% relative improve-
ment over log-mel features on unprocessed speech. This result suggests that designing noise-robust features
can improve ASR performance on noisy speech compared to extracting standard features on unprocessed or
denoised speech.
Finally, appending the encoding matrices to the log-mel features gives the best-performing system. In
category B, I achieved a 13:8% relative improvement over log-mel features on unprocessed speech. The im-
provement in performance over using just the encoding matrices indicates that the encoding matrices contain
complementary information to log-mel features. The log-mel features are a low-dimensional projection of
the spectrogram, and so they contain spectral information. On the other hand, the encoding matrix is an
encoding of the spectrogram relative to the speech basis. Thus, the encoding matrix doesn’t contain spectral
information, but rather shows the magnitude of different spectro-temporal speech patterns at each frame. For
visualization, Figure 7.2 compares the log-mel features and encoding matrices extracted for an Aurora 4 ut-
terance in clean and babble noise. Notice that the encoding matrix for babble noise is more closely matched
to the matrix for clean speech than the log-mel features for babble noise are to clean log-mel features.
A limitation of the proposed algorithm is the need for clean versions of noisy speech in the corpus (stereo
dataset). I used the clean speech when learning the bases and time-varying projection. This limits the
66
(a) Log-mel, clean (b) Log-mel, babble
(c) Time-activations, clean (d) Encodings, babble
Figure 7.2: Comparison of log-mel features and encoding matrices for an Aurora 4 utterance.
approach to datasets with clean speech ground truth. One way around this constraint is to learn the bases
and projection on a different stereo dataset, and then apply the bases and projection when extracting features
on a non-stereo dataset. Another workaround is to use a voice activity detector (V AD) to learn the speech
bases only from frames that have a high confidence of containing speech. Additionally, the frames marked
as non-speech can be used to adapt the noise dictionary during the feature extraction step.
7.5 Future Work
To build upon this work, I will explore ways to adapt the noise dictionary during feature extraction to
generalize better to unseen noises. As mentioned earlier, using a V AD can provide frame-level confidence
scores about the presence of speech. Frames that have low speech confidence scores can be used to update
the noise basis. Such updating should improve the noise-robustness of the features in non-stationary noises.
I will also incorporate channel compensation into the proposed algorithm. From the results, it is clear that
all features give poorer performance in mismatched channel conditions (Conditions C and D). One way
to tackle the channel mismatch problem is to use the proposed algorithm with mean-normalized cepstral
coefficients instead of spectrograms. Cepstral mean normalization is a common approach to removing the
filtering effects of the microphone channel (or any other convolutive noise).
67
CHAPTER 8
EXTRACTING TEMPORAL PATTERNS FROM TIME-SERIES DATA
As mentioned in Chapter 6, matrix factorization techniques are a powerful method for observing latent
structure in data. Convolutive NMF (CNMF) [18] was proposed to consider temporal context in time-series
data and extract temporal patterns observed in the data. CNMF was shown to find speech phones when
operating on spectrograms of speech. Sparsity constraints on either the basis or encoding matrix can be
employed in order to get more interpretable outputs, depending on the application [13, 14]. In order to
provide interpretability of the sparsity parameter, Hoyer proposed NMF with sparsity constraints (NMF-
SC) [13], where the sparsity parameter ranges between 0 and 1, with 0 requiring no sparsity and 1 enforcing
maximum sparsity. A convolutive extension to this algorithm, CNMF-SC, was recently proposed in [19] to
find articulatory primitives in Electromagnetic Articulography (EMA) data.
One drawback of the various NMF variants is the requirement of a non-negative data matrix. This can
prevent the use of NMF in cases where the data contain negative values. To overcome this, Ding et al.
proposed the Convex NMF algorithm [20], where the basis matrix is formed as a convex combination of
the data points. Thurau et al. introduced Convex Hull NMF (CH-NMF) to improve computation speed of
Convex NMF on large datasets [21]. See Chapter 2, Section 2.4 for more details about these two algo-
rithms. I propose the Convolutive Convex Hull fActoRization of a Matrix (CHARM) algorithm that extends
CH-NMF to incorporate temporal context in time-series data. Like CNMF, the basis will contain a set of
temporal patterns found in the data. However, the basis will inherit the desirable properties of the CH-NMF
basis: temporal patterns that correspond closely to temporal units in the data and represent a wide range of
dynamics.
This chapter is organized as follows. Section 8.1 describes the CHARM algorithm. Section 8.2 discusses the
experiments on synthetic time-series and real articulatory data and compares the performance quantitatively
and qualitatively to CNMF-SC. Finally, Section 8.3 offers my directions for future work.
68
8.1 CHARM Algorithm
Assume a multivariate time-seriesV = [v
1
v
2
v
n
]2R
mn
ofm variables overn time frames. CHARM
tries to findk temporal patterns of durationt inV . To achieve this, I propose minimizing the cost func-
tion
(
^
G;
^
H) = arg min
G;H
VS
t1
X
=0
G()
!
H
2
F
+kHk
1
subject tokG
i
()k
1
= 1;8i2f1;:::;kg;82f0;:::;t 1g: (8.1)
S2R
mp
arep vertices of the convex hull ofV .G2R
pkt
+
forms convex combinations of the columns
ofS to represent the time series inV . Because I want convex combinations of the columns ofS, I require
G to have non-negative entries and for each column to sum to 1. In Equation 8.1,G(t)2 R
pK
+
indexes
the third dimension ofG.W () :=SG()2R
mk
is a time-varying basis forV . By combiningW ()
over all into a three-dimensional tensorW2R
mkt
, one has a time-varying basis forV that capturesk
temporal patterns of durationt.H2R
kn
+
represents the encodings ofV in terms of the basisW . The
parameter trades off reconstruction error for sparsity in the encoding matrix. Sparsity in the encoding matrix
forces each data point inV to be represented by a few basis vectors. This usually leads to more interpretable
basis vectors.
To find the vertices of the convex hull, I follow the approach described in [21]. First, compute thed eigen-
vectorse
p
corresponding to thed largest eigenvalues of the covariance matrix ofV . d is chosen such that
the eigenvectors account for 95% of the data variance. Then, projectV onto 2D subspaces:
~
V
p;q
= [e
p
e
q
]
>
V 2R
2n
;8p;q2f1;:::;dg;p6=q: (8.2)
Next, find the vertices of the convex hull of
~
V
p;q
using a convex hull-finding method (e.g. [106, 107]) and
store the frame indices of the vertices in ch
~
V
p;q
. Finally, formS by concatenating all the points inV
marked as a convex hull vertex:
S = [V
ch(
~
V
1;2
)
V
ch(
~
V
1;3
)
V
ch(
~
V
d1;d
)
]; (8.3)
69
whereV
ch(
~
Vp;q )
are the columns ofV corresponding to the indices in ch
~
V
p;q
. There may be duplicate
columns inS, so the repeated columns should be removed.
To findG andH that minimizes the cost function (Equation 8.1), I iteratively alternate updatingG andH
until the cost function converges or a given number of iterations have occurred. LetF =
P
t1
=0
G()
!
H
andI
n
be thenn identity matrix. The update forG is
G() G()
[S
>
V ]
+
+ [S
>
S]
F
!
H
>
[S
>
V ]
+ [S
>
S]
+
F
!
H
>
;82f0;:::;t 1g: (8.4)
The update forH is
H H
P
t1
=0
G
>
()
[S
>
V ]
+
I
n
+ [S
>
S]
F
P
t1
=0
G
>
()
[S
>
V ]
I
n
+ [S
>
S]
+
F
+
: (8.5)
8.2 Experiments
I evaluated CHARM on two datasets. The first dataset was created synthetically to aid evaluation of the basis
vectors and verify that the algorithm finds a meaningful temporal patterns. For the second dataset, I used
vocal tract constrictions derived from real-time MRI images of a subject speaking TIMIT sentences. I chose
this dataset to assess the performance of CHARM on realistic time-series data and to uncover articulatory
gestures in a data-driven manner. For both datasets, I compared the performance of CHARM to CNMF-
SC.
8.2.1 Synthetic data
To create synthetic time-series data, I created three Markov chains, each with four states. Each state gener-
ates a sample from a two-dimensional Gaussian distribution with a given mean vector and covariance matrix.
The means were chosen such that each Markov chain produces distinct 4-sample sequences. Within each
chain, the states transitioned from left to right with probability 1 to ensure that the chain outputs exactly
four samples. After transitioning out of the last state, another chain is chosen, with each chain having equal
70
(a) Input data (b) CHARM temporal patterns (c) CNMF-SC temporal patterns
Figure 8.1: Input synthetic data and the recovered temporal patterns from CHARM and CNMF-SC. The
circles in (a) indicate the states of the Markov chains. The arrows represent the temporal progression within
the Markov chains and the recovered temporal patterns.
probability of being selected. I used this procedure to generate a 1000-sample time series. Figure 8.1a shows
an example output plotted in two-dimensional space, with circles indicating the states of the Markov chains
and arrows indicating transitions between the states within each chain. Note that the output of one chain is
separated spatially from the other two, while the other two chains share the same second state.
Since I know that the data has three distinct patterns with a length of four samples, I setk = 3 andt = 4.
Additionally, I experimentally determined = 1 to be a good choice. In this data, I know that only one of
the three chains are active at a particular time, which means that the encoding matrixH should be about
67% sparse. Thus, I set the sparsity parameter in CNMF-SC to 0:67. I used 100 update iterations for each
algorithm, and I ran the experiment 100 times to account for effects of random initialization.
Figures 8.1b and 8.1c show the basis for CHARM and CNMF-SC respectively. From these figures, one can
see that CHARM recovers the three temporal patterns more clearly than CNMF-SC. Specifically, one can
see that CHARM accurately chooses clusters that reside near the convex hull of the data. Meanwhile, inner
clusters are represented less accurately; the algorithm tends to shift the basis vectors for the inner clusters
closer to the convex hull. On the other hand, the CNMF-SC basis is less interpretable because the temporal
patterns don’t correspond closely to the observed data points. CNMF-SC scales the rows of the encoding
matrixH to have unit`
2
norm, so the basis patterns are scaled accordingly to minimize the CNMF-SC cost
function. These results agree with those found by Thurau et al. [21], where the CH-NMF basis lies near the
convex hull of the data while the NMF basis does not correspond well to the data points. While it is possible
to scale the CNMF-SC basis vectors to lie closer to the data points, this procedure may not be feasible on
larger and real-world datasets where it is not known what the ground-truth temporal patterns are. Thus, I
find that CHARM captures temporal structure in the data more reliably than CNMF-SC.
71
(a) MRI frame (b) Contours
Figure 8.2: Air-tissue boundary contours (thin black lines) from a single real-time MRI frame. The thick
black lines mark the places of articulation where constrictions are measured (from lips to glottis: bilabial;
alveolar; palatal; velar; velopharyngeal port; pharyngeal). For a given frame, constrictions are measured as
the Euclidean distances between opposing gray dots.
8.2.2 Vocal tract data
To supplement the synthetic data experiment, I tested CHARM on measurements of constriction degrees
(Euclidean distances) between active and passive articulators during a speech task. Articulatory Phonology
[108] theorizes that the movements of the vocal tract can be decomposed into units of vocal tract actions
called gestures. Gestures implement constrictions of variable degrees along the vocal tract, and the timings
of the gestures are arranged into a gestural score. The goal in this experiment is to derive such gestures
along with a gestural score in a data-driven manner.
I used mid-sagittal real-time MRI data of the vocal tract from the F1 speaker of the USC-TIMIT database
[109]. The frame rate of the MRI data in this corpus is 23:18 frames per second. I used an automatic
segmentation algorithm [110] to find the contours of the air-tissue boundaries of the vocal tract in each frame.
Figure 8.2 shows an example MRI frame and the contours found from this frame. Based on the contours,
the constriction degrees were measured at five places of articulation (bilabial, alveolar, palatal, velar, and
pharyngeal), plus the velopharyngeal port opening. The locations of the constriction measurements are
indicated in Figure 8.2b with thick black lines. I used 250 TIMIT sentences in this experiment, where the
first 150 sentences assigned to the training set, which I used to learn a basis of gestures, and the remaining
100 sentences were assigned to the testing set to evaluate how well the gesture basis generalizes to unseen
data.
The parametersk andt are chosen based on the data and application. I choset = 3 to capture gestures with
a duration of 130 ms (3
1 second
23:18 frames
130 ms), which is roughly the average duration of a phoneme.
72
(a) 1st gesture (b) 2nd gesture (c) 3rd gesture (d) 4th gesture
Figure 8.3: Visualization of the CHARM gesture basis. The vocal tract at time step 1 is shown in light grey,
time step 2 in dark grey, and time step 3 in black.
Since I measured constrictions at 6 locations, I chose k = 6. Choice of k is highly data dependent and
can be chosen in a more principled manner for a specific application. I used the samek andt values for
CNMF-SC, and I set CNMF-SC’s sparsity parameter to 0:7, as suggested in [19]. I ran both algorithms with
200 update iterations.
After running the algorithms, the bases contain constriction degrees at the six locations in the vocal tract. In
order to visualize the bases, I used a forward map [111] to convert the constriction degrees to articulatory
weights [112] that parametrize vocal tract shape. Figure 8.3 shows vocal tract movements (gestures) due to
the constrictions found in the CHARM basis. Figure 8.4 shows the same for the CNMF-SC basis. In the
interest of space, I only show four gestures from each algorithm. The CHARM basis shows interpretable
articulatory gestures; for example, Figure 8.3a shows the tongue body rising, while Figure 8.3c shows
the tongue forming a dental/alveolar constriction while the velum simultaneously closes. In general, the
gestures found by CHARM are more overt and display a wider range of vocal tract movement than those
found by CNMF-SC. This agrees with the results of the synthetic data experiment, where CHARM tends to
find temporal patterns at the extremities of the data, while the temporal patterns from CNMF-SC generally
don’t correspond to the data.
To evaluate how well the learned gestures generalize to unseen data, I fix the basis for each algorithm and
find the encoding matrixH
test
for each sentence in the test set using Equation 8.5. I then reconstruct the
constrictions and find the root mean square error (RMSE) and correlation between the input and recon-
structed constrictions. Table 8.1 shows the average RMSE and correlations for both algorithms on the test
set. I used a one-sided Wilcoxon rank-sum test to find that the RMSE was significantly lower (p 0) and
73
(a) 1st gesture (b) 2nd gesture (c) 3rd gesture (d) 4th gesture
Figure 8.4: Visualization of the CNMF-SC gesture basis. The vocal tract at time step 1 is shown in light
grey, time step 2 in dark grey, and time step 3 in black.
the correlation was significantly higher (p 0) for CHARM than for CNMF-SC. This indicates that the
gestures found by CHARM generalize better than the CNMF-SC basis.
Additionally, I used an experimental procedure described in [19] to evaluate the extent to which the bases
captured temporal structure in the data. If one supposes that the bases contain random temporal patterns,
then one won’t expect a significant change in the RMSE and correlation between the input and reconstructed
constrictions when one substitutesH
test
with a random matrixH
rand
with the same sparsity asH
test
. To
ensureH
rand
has the same sparsity asH
test
, I used the method proposed by Hoyer [13] to set the`
1
and
`
2
norms of each row ofH
rand
to the `
1
and `
2
norms of the corresponding rows ofH
test
. The results
of reconstructing with a random matrix are shown in Table 8.1. Using a one-sided Wilcoxon rank-sum
test, I found that the RMSE was significantly lower when reconstructing withH
test
than withH
rand
for both
CHARM (p 0) and CNMF-SC (p 0). Also, the correlation was significantly higher when reconstructing
withH
test
than withH
rand
for both CHARM (p 0) and CNMF-SC (p 0). These results suggest that
both algorithms learn meaningful temporal structure from the training set data.
Table 8.1: Root mean square errors (RMSE) and correlations when reconstructing vocal tract constriction
using a calculated encoding matrixH
test
and a random encoding matrixH
rand
.
Algorithm Encoding matrix RMSE (mm) Correlation
CHARM
H
test
0:824 0:964
H
rand
3:419 0:002
CNMF-SC
H
test
6:058 0:619
H
rand
8:127 0:168
74
8.3 Future Work
One shortcoming of CHARM is the inability to identify when the same temporal pattern is observed at dif-
ferent temporal scales. Currently, CHARM looks for patterns with durationt, so the same pattern occuring
with durationt might be considered as a different pattern. Particularly if< 1 (the pattern is stretched in
duration), it’s possible that CHARM will identify multiple sections of the pattern seperately. To overcome
this issue, I incorporate in the cost function to explicitly account for temporal scaling. This procedure will
most likely involve learning by how much to resample the encoding matrix at each time point to model the
observed data; that is, it requires learning an for each time point.
75
CHAPTER 9
CONCLUSION
In this work, I considered several matrix factorization techniques to address the issue of noise in data and
its effects on different end goals. To restate, the central question my thesis will address is “How does on
robustly recover latent structure from noisy data to facilitate different end goals?”
For a speech denoising task, I proposed an algorithm that uses CMF-WISA to model spectro-temporal
properties of the speech and noise in the noisy signal. Using CMF-WISA instead of NMF allowed us to
model the magnitude and phase of the speech and noise. I incorporated spectral and temporal regularization
terms in the CMF-WISA cost function to improve the modeling of the noise. Parameter analysis of the
weights of the regularization terms gave me optimum ranges for the weights to balance the trade-off between
noise suppression and speech distortion and also showed that having the regularization terms improved
denoising performance over not having the regularization terms. Objective measures show that my proposed
algorithm achieves lower distortion and higher STOI scores than other recently proposed denoising methods.
A listening test shows that my algorithm yields higher quality and more intelligible speech than some other
denoising methods in some pulse sequence noises, especially the aperiodic static 3D pulse sequence.
I then presented an extension of the Beta divergence to measure the distance between complex-valued
scalars. The divergence uses a real-valued parameter, which has the squared Euclidean distance ( = 2)
and IS divergence ( = 0) as special cases. I showed the standard Beta divergence is also a special case of
the Complex Beta divergence when the scalars being measured are in phase. I demonstrated that different
settings of change the amount of weight the errors in magnitude and phase have on the divergence, which
allows for greater flexibility in penalizing errors or outliers. I then proposed the Beta CMF algorithm that
uses the Complex Beta divergence as it’s cost function. I derived multiplicative update rules that optimize
the cost function with guaranteed convergence to a stationary point. Experiments on noisy speech demon-
strated trade-offs in various metrics, such as speech quality and formant frequency measurement errors, by
varying the parameter. Thus, one can choose appropriate values of to optimize performance for a certain
76
metric or application.
Next, I investigated using matrix factorization for robustly recovering latent structure in noisy speech data.
I proposed a joint filtering and factorization algorithm for robustly recovering spectral components from
noisy speech. Cosine similarity measures show that the basis matrices recovered by my proposed algorithm
represent spectral structural information better than using NMF factorization alone or performing filtering
prior to NMF factorization. Furthermore, I found that the quality of denoised signals reconstructed from
my algorithm is comparable to the quality when using NMF or Wiener filtering for denoising, making my
algorithm a viable alternative for signal denoising.
Continuing the investigation into recovering latent structure, I proposed an algorithm to generate noise-
robust encoding matrices using CNMF, and I used these as features for the acoustic model of an ASR. The
algorithm centered upon forcing the variability due to noise out of the encoding matrices and into the basis
matrices. ASR results on the Aurora 4 dataset indicate a 11:8% relative improvement of the WER over log-
mel features. Furthermore, combining the encoding matrices with log-mel features gives a 13:8% relative
improvement of the WER over log-mel features.
Finally, I introduced the Convex Hull Non-negative Matrix Factorization (CHARM) algorithm to find tem-
poral patterns in multivariate time-series data. It factors a data matrix into a basis tensor that contains
temporal patterns and an encoding matrix that indicates the times at which the temporal patterns occur in the
data. Using synthetic data, I showed that CHARM extracts better, more interpretable temporal patterns than
Convolutive Non-negative Matrix Factorization with sparsity constraints (CNMF-SC). With vocal tract con-
striction data, I was able to find a wider range of articulatory gestures using CHARM than using CNMF-SC.
I also demonstrated that the gestures contained in the CHARM basis generalized better to unseen data and
extracted better vocal tract dynamics than the CNMF-SC basis by reconstructing the data with significantly
lower RMSE and significantly higher correlation.
77
Appendices
78
APPENDIX A
DERIV ATION OF DENOISING ALGORITHM UPDATE EQUATIONS
When learning the speech basis and updating the noise basis from the noisy speech, we used the following
cost function:
(A.1) C() =J
error
^
V
+
s
J
spars
(H
s
) +
J
temp
(H
n
) +J
spec
(W
s
)
where
(A.2) J
error
^
V
=kV
^
Vk
2
F
;
(A.3) J
spars
(H
s
) =
tn
X
j=1
k[H
s
]
j
k
1
;
(A.4) J
temp
(H
n
) =D
KL
(ln (H
d
)kln (H
n
));
and
(A.5) J
spec
(W
n
) =k (W
d
W
n
)k
2
F
:
= (W
s
;W
n
;H
s
;H
n
;P
s
;P
n
) is the set of parameters we seek when optimizing the cost function, and
^
V =W
s
H
s
P
s
+W
n
H
n
P
n
.
In this work, we assume that ln(H
d
)N (; ) and ln(H
n
)N (m;S), with diagonal covariance
matrices andS. In this case,
(A.6) J
temp
(H
n
)
1
2
0
@
tr
^
S
1
^
+ ( ^ m ^ )
T
^
S
1
( ^ m ^ )k
d
+ ln
0
@
det
^
S
det
^
1
A
1
A
:
We estimate with the sample mean ^ =
1
t
d
P
t
d
t=1
ln([H
d
]
t
) and with the sample covariance
^
=
1
t
d
1
P
t
d
t=1
(ln ([H
d
]
t
) ^ ) (ln ([H
d
]
t
) ^ )
T
and keeping only the diagonal elements in
^
. Similarly,
79
we estimatem with the sample mean ^ m =
1
tn
P
tn
t=1
ln ([H
n
]
t
) andS with the sample covariance
^
S =
1
tn1
P
tn
t=1
(ln ([H
n
]
t
) ^ m) (ln ([H
n
]
t
) ^ m)
T
and keeping only the diagonal elements in
^
S.
When minimizing the primary cost function is difficult, an auxiliary function is introduced.
Definition A.1. C
+
;
is an auxiliary function forC() ifC
+
;
C() andC
+
(;) =C().
It has been shown in [22] thatC() monotonically decreases under the updates
argmin
C
+
;
and
argmin
C
+
;
.
We form the auxiliary function as
(A.7) C
+
;
=J
+
error
^
V;
V
+
s
J
+
spars
H
s
;
H
s
+
J
+
temp
H
n
;
H
n
+J
spec
(W
s
);
where
(A.8) J
+
error
^
V;
V
=
m
X
f=1
n
X
t=1
V
s
ft
h
^
V
s
i
ft
2
[B
s
]
ft
+
m
X
f=1
n
X
t=1
V
n
ft
h
^
V
n
i
ft
2
[B
n
]
ft
;
(A.9) J
+
spars
H
s
;
H
s
=
ks
X
k=1
n
X
t=1
H
s
kt
2
[H
s
]
2
kt
+ 2
H
s
kt
H
s
kt
;
and
(A.10)
J
+
temp
H
n
;
H
n
=
1
2
0
@
tr
^
S
1
^
+ ( ^ m ^ )
T
S
1
( ^ m ^ )k
d
+ tr
^
S
1
^
S
+ ln
0
@
det
^
S
det
^
1
A
k
d
1
A
;
where
^
m =
1
tn
P
tn
t=1
ln
H
n
t
and
^
S =
1
tn1
P
tn
t=1
ln
H
n
t
^
m
ln
H
n
t
^
m
T
.
=
V;
H
s
;
H
n
are the auxiliary variables. 0 < < 2 is a parameter for
P
k
d
k=1
P
tn
t=1
j[H
s
]
kt
j
to pro-
mote sparsity inH
s
. In our work, we measure the`
1
norm ofH
n
, so = 1. Proofs thatJ
+
error
andJ
+
spars
are
auxiliary functions forJ
error
andJ
spars
respectively can be found in Appendix A of [23], so we will focus on
proving thatJ
+
temp
is an auxiliary function ofJ
temp
.
Since we assume that each row ofH
d
andH
n
are independent, we will consider each row separately. In
this case, Equation A.6 simplifies to
(A.11) J
temp
(h
n
) =
1
2
^
2
+ ( ^ m ^ )
2
^ s
2
1 + ln
^ s
2
^
2
!
80
and Equation A.10 simplifies to
(A.12) J
+
temp
h
n
;
h
n
=
1
2
^
2
+ ( ^ m ^ )
2
^ s
2
1 +
^ s
2
^
s
2
+ ln
^
s
2
1 ln
^
2
!
:
Theorem A.2. J
+
temp
h
n
;
h
n
is an auxiliary function forJ
temp
(h
n
).
Proof. If
h
n
=h
n
, then
^
m = ^ m and
^
s
2
= ^ s
2
.
In this case,J
+
temp
(h
n
;h
n
) =J
temp
(h
n
).
(A.13)
J
+
temp
h
n
;
h
n
J
temp
(h
n
) =
^ s
2
^
s
2
+ ln
^
s
2
1
ln
^ s
2
= ln
^
s
2
+
^ s
2
^
s
2
1
ln
^ s
2
ln
^
s
2
+ ln
^ s
2
^
s
2
ln
^ s
2
* ln (x)x 18x> 0
= 0
HenceJ
+
temp
h
n
;
h
n
J
temp
(h
n
) andJ
+
temp
(h
n
;h
n
) =J
temp
(h
n
).
)J
+
temp
h
n
;
h
n
is an auxiliary function forJ
temp
(h
n
).
The optimum value of the auxiliary variable
h
n
can be found by setting the gradient ofJ
+
temp
h
n
;
h
n
w.r.t.
h
n
equal to zero:
r
hn
J
+
temp
h
n
;
h
n
=
ln
h
n
^
m1
tn
(t
n
1)
^
s
2
h
n
^ s
2
ln
h
n
^
m1
tn
(t
n
1)
^
s
2
2
h
n
= 0
ln
h
n
^
m1
n
=
^ s
2
s
2
ln
h
n
^
m1
tn
1
^ s
2
^
s
2
ln
h
n
=
1
^ s
2
^
s
2
^
m1
n
ln
h
n
=
^
m1
tn
(A.14)
J
+
temp
h
n
;
h
n
can be rewritten for all rows ofH
d
andH
n
as Equation A.10 and the auxiliary variable
H
n
can be updated as
H
n
= diag
^
m
1
kntn
.
81
We did not create an auxiliary function forJ
spec
(W
n
) because it is already quadratic inW
n
, so minimizing
J
spec
w.r.t.W
n
is not difficult. Indeed,r
Wn
J
spec
(W
n
) =
T
(W
n
W
d
).
A.1 Basis update equations
To find the update forW
s
, we need to findr
Ws
C
+
;
. Since the regularization terms we added do not
containW
s
, they do not affect gradient. Hence, we use the update equation derived in [24], which results
in Equation 3.14.
To find the update forW
n
, we calculater
Wn
C
+
;
=r
Wn
J
+
error
^
V;
V
+J
spec
(W
n
)
.r
Wn
J
+
error
^
V;
V
=
WnHn
V n
Bn
H
T
n
is derived in [23] andr
Wn
J
spec
(W
n
) =
T
(W
n
W
d
). So,
(A.15) r
Wn
C
+
;
=
W
n
H
n
V
n
B
n
H
T
n
+
T
(W
n
W
d
):
The update equation forW
n
is
(A.16) W
n
W
n
r
Wn
C
+
;
r
Wn
C
+
;
+
;
which leads to the update equation given in Equation 3.17.
A.2 Time-activation update equations
To find the update forH
s
, we need to findr
Hs
C
+
;
. As in the case withW
s
, the added regularization
terms do not containH
s
so they do not affect the gradient. Hence, we use the update equation derived in
[23], which results in Equation 3.15.
To find the update forH
n
, we calculater
Hn
C
+
;
=r
Hn
J
+
error
^
V;
V
+
J
+
temp
H
n
;
H
n
.
82
r
Hn
J
+
error
^
V;
V
=W
T
n
WnHn
V n
Bn
is derived in [23]. Define
^
U = diag (^ ) and
^
M = diag ( ^ m).
r
Hn
J
+
temp
H
n
;
H
n
=
1
H
n
1
t
n
^
S
1
^
M
^
U
1
kntn
1
t
n
1
^
S
2
^
+
^
M
^
U
T
^
S
2
^
M
^
U
ln (H
n
)
^
M1
kntn
+
1
t
n
1
^
S
1
ln (H
n
)
^
M1
kntn
(A.17)
The update equation forH
n
is
(A.18) H
n
H
n
r
Hn
C
+
;
r
Hn
C
+
;
+
:
Note that
^
U,
^
M, and ln (H
n
) are mixed-sign matrices. A mixed-sign matrixA can be rewritten in terms of
non-negative matrices asA = [A]
+
[A]
. Rewriting the mixed-sign matrices leads to the update equation
forH
n
given by Equation 3.18.
83
APPENDIX B
COMPLEX BETA DIVERGENCE PROOF
Theorem B.1. d
(zkw) satisfies the requirements for a divergence:8z;w2 C
; d
(zkw) 0 and
d
(zkw) = 0 () z =w.
Proof. From Young’s Inequality for Complex Values (Theorem 4.7), Re
pq
exp
j
a
b
p
b
a
q
jpj
a
a
+
jqj
b
b
, with equality iffp
a
=q
b
.
Seta;b;p;q for different ranges of as given in Table B.1.
Letd
(zkw) =
jpj
a
a
+
jqj
b
b
Re
pq
exp
j
a
b
p
b
a
q
0.
From Corollary 4.8, we know thatd
(zkw) = 0 () p
a
=q
b
.
What remains to be shown is thatz =w () p
a
=q
b
, which would establishd
(zkw) = 0 () z =
w.
Supposez =w.
Case 1: > 1
(p
(z;z))
a
=jzj
a
e
j(az +2(az ))
and (q
(z;z))
b
=jzj
a
e
j(az +2(az ))
.
Case 2: 0< < 1
(p
(z;z))
a
=jzj
1
a
e
jz
and (q
(z;z))
b
=jzj
1
a
e
jz
.
Case 3: < 0
(p
(z;z))
a
=jzj
1b
e
jz
and (q
(z;z))
b
=jzj
1b
e
jz
.
In all 3 cases,j(p
(z;z))
a
j =
(q
(z;z))
b
and Arg((p
(z;z))
a
) = Arg
(q
(z;z))
b
.
Thus (p
(z;z))
a
= (q
(z;z))
b
, by Property 4.4.
Suppose (p
(z;w))
a
= (q
(z;w))
b
.
Thenj(p
(z;w))
a
j =
(q
(z;w))
b
and Arg((p
(z;w))
a
) = Arg
(q
(z;w))
b
by Property 4.4.
Case 1: > 1
84
We have
(p
(z;w))
a
=jzj
a
exp
j
a
1
a
z
+
1
b
w
+ 2
a
1
a
z
+
1
b
w
(q
(z;w))
b
=jwj
a
exp(j (a
w
+ 2(a
w
)))
For convenience, definek
1
=
z
+
a
b
w
andk
2
=(a
w
). Then,
jzj
a
=jwj
a
jzj =jwj
and
a
1
a
z
+
1
b
w
+ 2k
1
=a
w
+ 2k
2
z
+
a
b
w
+ 2k
1
=a
w
+ 2k
2
z
+ 2k
1
+ 2(
z
+ 2k
1
) =
a
a
b
w
+ 2k
2
+ 2
a
a
b
w
+ 2k
2
z
+ 2(
z
) =
w
+ 2(
w
)
z
=
w
;
where in the second-to-last line we used Property 4.1 and in the last line we used the fact that() = 0.
Case 2: 0< < 1
We have
(p
(z;w))
a
=jzjjwj
1
b
exp
j
1
a
z
+
1
b
w
(q
(z;w))
b
=jwj
1
a
e
jw
Then,
jzjjwj
1
b
=jwj
1
a
jzj =jwj
1
a
+
1
b
=jwj
85
Table B.1: Functions for different values of used in the proof of the Complex Beta divergence
a() b() p
(z;w) q
(z;w)
(1;1)
1
jzj exp
j
1
a
z
+
1
b
w
jwj
a
b
exp
j
1
b
(a
w
)
(0; 1)
1
1
1
jzj
1
a
jwj
1
ab
exp
j
1
a
1
a
z
+
1
b
w
jwj
1
ab
exp
j
1
b
w
(1; 0)
1
( 1) jzj
1
a
jwj
b
a
exp
j
1
a
1
a
w
+
1
b
z
jzj
1
a
exp
j
1
b
z
and
1
a
z
+
1
b
w
=
w
1
a
z
=
1
1
b
w
=
1
a
w
Case 3: < 0
We have
(p
(z;w))
a
=jzjjwj
b
exp
j
1
a
w
+
1
b
z
(q
(z;w))
b
=jzj
b
a
e
jz
Then
jzjjwj
b
=jzj
b
a
jwj
b
=jzj
(
b
a
+1)
=jzj
b
and
1
a
w
+
1
b
z
=
z
1
a
w
=
1
1
b
z
=
1
a
z
In all 3 cases,jzj =jwj and
z
=
w
.
Thus,z =w, by Property 4.4.
86
REFERENCES
[1] C. Vaz, D. Dimitriadis, S. Thomas, and S. Narayanan, “Cnmf-based acoustic features for noise-robust
asr,” in IEEE Proc. Int. Conf. Acoustics, Speech, and Signal Process., 2016, pp. 5735–5739.
[2] Y . Koren, R. Bell, and C. V olinsky, “Matrix factorization techniques for recommender systems,”
Computer, vol. 42, no. 8, pp. 30–37, Aug. 2009.
[3] Y . Xu, W. Yin, Z. Wen, and Y . Zhang, “An alternating direction algorithm for matrix completion with
nonnegative factors,” Frontiers of Mathematics in China, vol. 7, no. 2, pp. 365–384, Apr. 2012.
[4] T. H. Chan, W. K. Ma, C. Y . Chi, and Y . Wang, “Convex analysis framework for blind separation of
non-negative sources,” IEEE Trans. Signal Process., vol. 56, no. 10, 2008.
[5] C. Laroche, M. Kowalski, H. Papadopoulos, and G. Richard, “A structured nonnegative matrix fac-
torization for source separation,” in Proc. European Signal Processing Conf., Nice, France, 2015, pp.
2033–2037.
[6] A. Rolet, V . Seguy, M. Blondel, and H. Sawada, “Blind source separation with optimal transport
non-negative matrix factorization,” EURASIP J. Adv. Signal Process., vol. 2018, no. 53, Dec. 2018.
[7] Z. He, W. Zhang, W.-S. Chen, C. S. Tong, and Y . Xue, “A Modified Non-negative Matrix Factor-
ization Algorithm for Face Recognition,” in 18th Int. Conf. Pattern Recognition, Hong Kong, China,
2006, pp. 495–498.
[8] C. F´ evotte, N. Bertin, and J. L. Durrieu, “Nonnegative matrix factorization with the itakura-saito
divergence: With application to music analysis,” Neural Computation, vol. 21, no. 3, pp. 793–830,
2009.
[9] T. Shi, K. Kang, J. Choo, and C. K. Reddy, “Short-Text Topic Modeling via Non-negative Matrix
Factorization Enriched with Local Word-Context Correlations,” in Proc. 2018 World Wide Web Conf.,
Lyon, France, 2018, pp. 1105–1114.
87
[10] P. Paatero and U. Tapper, “Positive matrix factorization: a non-negative factor model with optimal
utilization of error estimates of data values,” Environmetrics, vol. 5, no. 2, pp. 111–126, Jun. 1994.
[11] P. Paatero, “Least squares formulation of robust non-negative factor analysis,” Chemometrics and
Intelligent Laboratory Systems, vol. 37, no. 1, pp. 23–35, May 1997.
[12] D. D. Lee and H. S. Seung, “Algorithms for non-negative matrix factorization,” in Adv. in Neu. Info.
Proc. Sys. 13, Denver, CO, 2001, pp. 556–562.
[13] P. O. Hoyer, “Non-negative Matrix Factorization with Sparseness Constraints,” J. Machine Learning
Research, vol. 5, pp. 1457–1469, Dec. 2004.
[14] P. D. O’Grady and B. A. Pearlmutter, “Discovering speech phones using convolutive non-negative
matrix factorisation with a sparseness constraint,” Neurocomputing, vol. 72, no. 1-3, pp. 88–101,
Dec. 2008.
[15] A. Basu, I. R. Harris, N. L. Hjort, and M. C. Jones, “Robust and efficient estimation by minimising a
density power divergence,” Biometrika, vol. 85, no. 3, pp. 549–559, 1998.
[16] F. Itakura and S. Saito, “Analysis synthesis telephony based on the maximum likelihood method,” in
Proc. 6th Int. Cong. Acoustics, Tokyo, Japan, 1968, pp. C17–C20.
[17] S. Euchi and Y . Kano, “Robustifying maximum likelihood estimation,” Tokyo Institute of Statistical
Mathematics, Tech. Rep., June 2001.
[18] P. Smaragdis, “Convolutive Speech Bases and Their Application to Supervised Speech Separation,”
IEEE Trans. Acoustics, Speech, and Lang. Process., vol. 15, no. 1, pp. 1–12, Jan. 2007.
[19] V . Ramanarayanan, L. Goldstein, and S. Narayanan, “Spatio-temporal articulatory movement primi-
tives during speech production: Extraction, interpretation, and validation,” J. Acoustical Soc. Amer-
ica, vol. 134, no. 2, pp. 1378–1394, Aug. 2013.
[20] C. Ding, T. Li, and M. I. Jordan, “Convex and semi-nonnegative matrix factorizations,” IEEE Trans.
Pattern Analysis and Machine Intelligence, vol. 32, no. 1, pp. 45–55, 2010.
[21] C. Thurau, K. Kersting, M. Wahabzada, and C. Bauckhage, “Convex non-negative matrix factoriza-
tion for massive datasets,” Knowledge and Information Systems, vol. 29, no. 2, pp. 457–478, 2011.
88
[22] H. Kameoka, N. Ono, K. Kashino, and S. Sagayama, “Complex nmf: A new sparse representation
for acoustic signals,” in IEEE Int. Conf. Acoustics, Speech, Signal Process., Taipei, Taiwan, 2009, pp.
3437–3440.
[23] B. King and L. Atlas, “Single-Channel Source Separation Using Complex Matrix Factorization,”
IEEE Trans. Audio, Speech, and Language Process., vol. 19, no. 8, pp. 2591–2597, Nov. 2011.
[24] B. King, “New Methods of Complex Matrix Factorization for Single-Channel Source Separation and
Analysis,” Ph.D. dissertation, University of Washington, Seattle, WA, 2012.
[25] S. Furui, T. Kikuchi, Y . Shinnaka, and C. Hori, “Speech-to-text and speech-to-speech summarization
of spontaneous speech,” IEEE Trans. Speech Audio Process., vol. 12, no. 4, pp. 401–408, Jul. 2004.
[26] C. M. Lee and S. Narayanan, “Towards detecting emotion in spoken dialogs,” IEEE Trans. Speech
and Audio Process., vol. 13, no. 2, pp. 293–302, Mar. 2005.
[27] H. W. L¨ ollmann and P. Vary, “Low delay noise reduction and dereverberation for hearing aids,”
EURASIP J. Advances in Signal Process., vol. 2009, no. 1, 2009.
[28] Y . Hu and P. C. Loizou, “A generalized subspace approach for enhancing speech corrupted by colored
noise,” IEEE Trans. Speech Audio Process., vol. 11, no. 4, pp. 334–341, Jul. 2003.
[29] Y . Ephraim and D. Mallah, “Speech enhancement using a minimum mean-square error log-spectral
amplitude estimator,” IEEE Trans. Acoust. Speech Signal Process., vol. 33, no. 2, pp. 443–445, May
1985.
[30] S. D. Kamath and P. C. Loizou, “A multi-band spectral subtraction method for enhancing speech
corrupted by colored noise,” in IEEE Int. Conf. Acoustics, Speech, Signal Process., Orlando, FL,
2002.
[31] W. F. Katz, S. V . Bharadwaj, and B. Carstens, “Electromagnetic Articulography Treatment for an
Adult With Broca’s Aphasia and Apraxia of Speech,” J. Speech, Language, and Hearing Research,
vol. 42, no. 6, pp. 1355–1366, Dec. 1999.
[32] M. Itoh, S. Sasanuma, H. Hirose, H. Yoshioka, and T. Ushijima, “Abnormal articulatory dynamics in
89
a patient with apraxia of speech: X-ray microbeam observation,” Brain and Language, vol. 11, no. 1,
pp. 66–75, Sep. 1980.
[33] D. Byrd, S. Tobin, E. Bresch, and S. Narayanan, “Timing effects of syllable structure and stress on
nasals: A real-time MRI examination,” J. Phonetics, vol. 37, no. 1, pp. 97–110, Jan. 2009.
[34] J. Chen, J. Benesty, Y . Huang, and S. Doclo, “New Insights Into the Noise Reduction Wiener Filter,”
IEEE Trans. Audio, Speech, Language Process., vol. 14, no. 4, pp. 1218–1234, Jul. 2006.
[35] E. Bresch, J. Nielsen, K. S. Nayak, and S. Narayanan, “Synchronized and Noise-Robust Audio
Recordings During Realtime Magnetic Resonance Imaging Scans,” J. Acoustical Society of Amer-
ica, vol. 120, no. 4, pp. 1791–1794, Oct. 2006.
[36] J. M. Inouye, S. S. Blemker, and D. I. Inouye, “Towards Undistorted and Noise-free Speech in an
MRI Scanner: Correlation Subtraction Followed by Spectral Noise Gating,” J. Acoustical Society of
America, vol. 135, no. 3, pp. 1019–1022, Mar. 2014.
[37] M. McJury and F. G. Shellock, “Auditory Noise Associated with MR Procedures,” J. Magnetic Res-
onance Imaging, vol. 12, no. 1, pp. 37–45, Jul. 2001.
[38] Y . C. Kim, S. Narayanan, and K. S. Nayak, “Flexible retrospective selection of temporal resolution
in real-time speech MRI using a golden-ratio spiral view order,” J. Magnetic Resonance in Medicine,
vol. 65, no. 5, pp. 1365–1371, May 2011.
[39] S. Narayanan, K. Nayak, S. Lee, A. Sethy, and D. Byrd, “An approach to real-time magnetic reso-
nance imaging for speech production,” J. Acoustical Society of America, vol. 115, no. 4, pp. 1771–
1776, Mar. 2004.
[40] Y . C. Kim, S. Narayanan, and K. Nayak, “Accelerated Three-Dimensional Upper Airway MRI Using
Compressed Sensing,” J. Magnetic Resonance Imaging, vol. 61, no. 6, pp. 1434–1440, Jun. 2009.
[41] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, N. L. Dahlgren, and V . Zue,
“TIMIT Acoustic-Phonetic Continuous Speech Corpus,” in Linguistic Data Consortium, Philadel-
phia, PA, 1993.
90
[42] N. Parihar and J. Picone, “Analysis of the Aurora large vocabulary evaluations,” in Proc. Eurospeech,
Geneva, Switzerland, 2003, pp. 337–340.
[43] D. B. Paul and J. M. Baker, “The Design for the Wall Street Journal-based CSR Corpus,” in Proc.
Workshop on Speech and Natural Language, Harriman, New York, 1992, pp. 357–362.
[44] C. Vaz, V . Ramanarayanan, and S. Narayanan, “A two-step technique for MRI audio enhancement
using dictionary learning and wavelet packet analysis,” in Proc. Interspeech, Lyon, France, 2013, pp.
1312–1315.
[45] S. Tabibian, A. Akbari, and B. Nasersharif, “A new wavelet thresholding method for speech enhance-
ment based on symmetric Kullback-Leibler divergence,” in 2009 14th International CSI Computer
Conf., Tehran, Iran, 2009, pp. 495–500.
[46] V . R. Ramachandran, I. M. S. Panahi, and A. A. Milani, “Objective and Subjective Evaluation
of Adaptive Speech Enhancement Methods for Functional MRI,” J. Magnetic Resonance Imaging,
vol. 31, no. 1, pp. 46–55, Jan. 2010.
[47] J. G. Beerends, A. P. Hekstra, A. W. Rix, and M. P. Hollier, “Perceptual evaluation of speech qual-
ity (PESQ): The new ITU standard for end-to-end speech quality assessment part II-psychoacoustic
model,” J. Audio Eng. Soc., vol. 50, no. 10, pp. 765–778, Oct. 2002.
[48] C. H. Taal, R. C. Hendricks, R. Heusdens, and J. Jensen, “A Short-Time Objective Intelligibility
Measure for Time-Frequency Weighted Noisy Speech,” in IEEE Int. Conf. Acoustics, Speech, Signal
Process., Dallas, TX, 2010, pp. 4214–4217.
[49] J. Le Roux, H. Kameoka, E. Vincent, N. Ono, K. Kashino, and S. Sagayama, “Complex NMF under
spectrogram consistency constraints,” in ASJ Autumn Meeting, Koriyama, Japan, 2009.
[50] P. Magron, R. Badeau, and B. David, “Complex nmf under phase constraints based on signal model-
ing: Application to audio source separation,” in IEEE Int. Conf. Acoustics, Speech, Signal Process.,
Shanghai, China, 2016.
[51] H. Woo and J. Ha, “Besta-divergence-based variational model for speckle reduction,” IEEE Signal
Process. Letters, vol. 23, no. 11, pp. 1557–1561, 2016.
91
[52] M. Mihoko and S. Eguchi, “Robust blind source separation by beta divergence,” Neural Computation,
vol. 14, no. 8, pp. 1859–1886, 2002.
[53] J. Lafferty, “Additive models, boosting, and inference for generalized divergences,” in Proc. Twelfth
Annual Conf. Computational Learning Theory, Santa Cruz, CA, 1999, pp. 125–133.
[54] T. Villmann and S. Haase, “Divergence-based vector quantization,” Neural Computation, vol. 23,
no. 5, pp. 1343–1392, 2011.
[55] S. Kullback and R. A. Leibler, “On information and sufficiency,” Annals of Mathematical Statistics,
vol. 22, no. 1, pp. 79–86, 1951.
[56] E. Hellinger, “Neue begr¨ undung der theorie quadratischer formen von unendlichvielen
ver¨ anderlichen,” Journal f¨ ur die reine und angewandte Mathematik, vol. 136, pp. 210–271, 1909.
[57] I. Csisz´ ar, “Eine informationstheoretische ungleichung und ihre anwendung auf den beweis der er-
godizitat von markoffschen ketten,” Magyar. Tud. Akad. Mat. Kutato Int. Kozl., vol. 8, pp. 85–108,
1963.
[58] T. Morimoto, “Markov processes and the h-theorem,” J. Physical Society of Japan, vol. 18, no. 3, pp.
328–331, 1963.
[59] S. M. Ali and S. D. Silvey, “A general class of coefficients of divergence of one distribution from
another,” J. Royal Statistical Society. Series B (Methodological), vol. 28, no. 1, pp. 131–142, 1966.
[60] H. Chernoff, “A measure of asymptotic efficiency for tests of a hypothesis based on the sum of
observations,” Annals of Mathematical Statistics, vol. 23, no. 4, pp. 493–507, 1952.
[61] L. M. Bregman, “The relaxation method of finding the common point of convex sets and its appli-
cation to the solution of problems in convex programming,” USSR Computational Mathematics and
Mathematical Physics, vol. 7, no. 3, pp. 200–217, 1967.
[62] S.-I. Amari, “-divergence is unique, belonging to both f-divergence and bregman divergence
classes,” IEEE Trans. Information Theory, vol. 55, no. 11, pp. 4925–4931, 2009.
[63] A. Cichocki, S. Cruces, and S.-I. Amari, “Generalized alpha-beta divergences and their application
to robust nonnegative matrix factorization,” Entropy, vol. 13, pp. 134–170, 2011.
92
[64] H. Woo, “A characterization of the domain of beta-divergence and its connection to bregman varia-
tional model,” Entropy, vol. 19, no. 9, p. 482, 2017.
[65] W. H. Young, “On classes of summable functions and their fourier series,” Proc. Royal Soc. A, vol. 87,
no. 594, pp. 225–229, 1912.
[66] C. F´ evotte and J. Idier, “Algorithms for nonnegative matrix factorization with the beta-divergence,”
Neural Computation, vol. 23, no. 9, pp. 2421–2456, 2011.
[67] G. Shi, M. M. Shanechi, and P. Aarabi, “On the importance of phase in human speech recogntion,”
IEEE Trans. Audio, Speech, and Language Process., vol. 14, no. 5, pp. 1867–1874, 2006.
[68] H. Kameoka and H. Kagami, “Complex non-negative matrix factorization: Phase-aware sparse rep-
resentation of audio spectrograms,” J. Acoustical Soc. of America, vol. 140, no. 4, pp. 3053–3053,
2016.
[69] H. Kameoka and H. K. abd Masahiro Yukawa, “Complex nmf with the generalized kullback-leibler
divergence,” in IEEE Int. Conf. Acoustics, Speech, Signal Process., New Orleans, LA, 2017, pp.
56–60.
[70] P. Bouboulis and S. Theodoridis, “Extension of wirtinger’s calculus to reproducing kernel hilbert
spaces and the complex kernel lms,” IEEE Trans. Signal Process., vol. 59, no. 3, pp. 964–978, 2011.
[71] R. Boloix-Tortosa, J. J. Murillo-Fuentes, I. Santos, and F. P´ erez-Cruz, “Widely linear complex-valued
kernel methods for regression,” IEEE Trans. Signal Process., vol. 65, no. 19, pp. 5240–5248, 2017.
[72] G. M. Georgiou and C. Koutsougeras, “Complex domain backpropagation,” IEEE Trans. Circuits and
Systems–II: Analog and Digital Signal Process., vol. 39, no. 5, pp. 330–334, 1992.
[73] T. Kim and T. Adali, “Approximation by fully complex multilayer perceptrons,” Neural Computation,
vol. 15, no. 7, pp. 1641–1666, 2003.
[74] C. Trabelsi, O. Bilaniuk, Y . Zhang, D. Serdyuk, S. Subramanian, J. F. Santos, S. Mehri, N. Ros-
tamzadeh, Y . Bengio, and C. J. Pal, “Approximation by fully complex multilayer perceptrons,”
arXiv:1705.09792, 2017.
93
[75] D. P. Reichert and T. Serre, “Neuronal synchrony in complex-valued deep networks,” in Int. Conf.
Learning Representations, Banff, Canada, 2014.
[76] E. Oyallon and S. Mallat, “Deep roto-translation scattering for object classification,” in IEEE. Conf.
Computer Vision and Pattern Recognition, Boston, MA, 2015, pp. 2865–2873.
[77] S. Chintala, M. Ranzato, A. Szlam, Y . Tian, M. Tygert, and W. Zaremba, “Scale-invariant learning
and convolutional networks,” Applied and Computational Harmonic Analysis, vol. 42, no. 1, pp.
154–166, 2017.
[78] Q. Hu, J. Yamagishi, K. Richmond, K. Subramanian, and Y . Stylianou, “Initial investigation of speech
synthesis based on complex-valued neural networks,” in IEEE Int. Conf. Acoustics, Speech, Signal
Process., Shanghai, China, 2016.
[79] T. Trouillon, J. Welbl, S. Reidel,
´
E. Gaussier, and G. Bouchard, “Complex embeddings for simple
link prediction,” in IEEE. Conf. Computer Vision and Pattern Recognition, Boston, MA, 2015, pp.
2865–2873.
[80] J. Thiemann, N. Ito, and E. Vincent, “The Diverse Environments Multi-channel Acoustic Noise
Database (DEMAND): A database of multichannel environmental noise recordings,” in 21st Int.
Cong. Acoustics, Montreal, Canada, 2013.
[81] Y . Gao and G. Church, “Improving molecular cancer class discovery through sparse non-negative
matrix factorization,” Bioinformatics, vol. 21, no. 21, pp. 3970–3975, Sep. 2005.
[82] T. Virtanen, “Sound source separation using sparse coding with temporal continuity objective,” in
Proc. Int. Computer Music Conf., 2003, pp. 231–234.
[83] D. Kong, C. Ding, and H. Huang, “Robust Nonnegative Matrix Factorization using L21-norm,” in
ACM Int. Conf. Information and Knowledge Management, Glasgow, Scotland, 2011, pp. 673–682.
[84] F. J. Theis, K. Stadlthanner, and T. Tanaka, “First Results on Uniqueness of Sparse Non-negative
Matrix Factorization,” in 2005 13th European Signal Processing Conference, 2005, pp. 1–4.
[85] J. Capon, “High-Resolution Frequency-Wavenumber Spectrum Analysis,” Proc. IEEE, vol. 57, no. 8,
pp. 1408–1418, Aug. 1969.
94
[86] S. Dharanipragada and B. D. Rao, “MVDR based feature extraction for robust speech recognition,”
in IEEE Int. Conf. Acoustics, Speech, Signal Process., Salt Lake City, UT, 2001, pp. 309–312.
[87] M. N. Murthi and B. D. Rao, “Minimum Variance Distortionless Response (MVDR) Modeling of
V oiced Speech,” in IEEE Int. Conf. Acoustics, Speech, Signal Process., Munich, Germany, 1997, pp.
1687–1690.
[88] A. A. Wrench, “A multi-channel/multi-speaker articulatory database for continuous speech recog-
nition research,” in Workshop on Phonetics and Phonology in Automatic Speech Recognition, Saar-
brucken, Germany, 2000, pp. 1–13.
[89] A. Varga and H. J. M. Steeneken, “Assessment for automatic speech recognition: II. NOISEX-92:
A database and an experiment to study the effect of additive noise on speech recognition systems,”
Speech Communication, vol. 12, no. 3, pp. 247–251, Jul. 1993.
[90] A. Cichocki, S. Cruces, and S. Amari, “Generalized Alpha-Beta Divergences and Their Application
to Robust Non-negative Matrix Factorization,” Entropy, vol. 13, no. 1, pp. 134–170, Jan. 2001.
[91] B. Yin, F. Chen, N. Ruiz, and E. Ambikairajah, “Speech-based cognitive load monitoring system,” in
Proc. Int. Conf. Acoustics, Speech, and Signal Process., 2008, pp. 2041–2044.
[92] D. Macho, L. Mauury, B. No´ e, Y . M. Cheng, D. Ealey, D. Jouvet, H. Kelleher, D. Pearce, and
F. Saadoun, “Evaluation of a noise-robust DSR front-end on AURORA databases,” in Proc. Int. Conf.
Spoken Lang. Process., 2002, pp. 17–20.
[93] T. Yoshioka and T. Nakatani, “Noise model transfer: novel approach to robustness against nonsta-
tionary noise,” IEEE Trans. Acoustics, Speech, and Lang. Process., vol. 21, no. 10, pp. 2182–2192,
Oct. 2013.
[94] J. Droppo, A. Acero, and L. Deng, “Evaluation of the SPLICE algorithm on the Aurora2 database,”
in Proc. Eurospeech, 2001, pp. 217–220.
[95] O. Kalinli, M. L. Seltzer, J. Droppo, and A. Acero, “Noise adaptive training for robust automatic
speech recognition,” IEEE Trans. Acoustics, Speech, and Lang. Process., vol. 18, no. 8, pp. 1889–
1901, Nov. 2010.
95
[96] Y . Wang and M. J. F. Gales, “Speaker and noise factorization for robust speech recognition,” IEEE
Trans. Acoustics, Speech, and Lang. Process., vol. 20, no. 7, pp. 2149–2158, Sep. 2012.
[97] M. L. Seltzer, D. Yu, and Y . Wang, “An investigation of deep neural networks for noise robust speech
recognition,” in Proc. Int. Conf. Acoustics, Speech, and Signal Process. IEEE, 2013, pp. 7398–7402.
[98] S. F. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Trans. Acous-
tics, Speech, and Signal Process., vol. 20, no. 2, pp. 113–120, Apr. 1979.
[99] A. Narayanan and D. L. Wang, “Investigation of speech separation as a front-end for noise robust
speech recognition,” IEEE/ACM Trans. Audio, Speech, and Lang. Process., vol. 22, no. 4, pp. 826–
835, 2014.
[100] P. J. Moreno, B. Raj, and R. M. Stern, “A vector Taylor series approach for environment-independent
speech recognition,” in Proc. Int. Conf. Acoustics, Speech, and Signal Process. IEEE, 1996, pp.
733–736 vol. 2.
[101] L. Deng, A. Acero, L. Jiang, J. Droppo, and X. Huang, “High-performance robust speech recognition
using stereo training data,” in Proc. Int. Conf. Acoustics, Speech, and Signal Process. IEEE, 2001,
pp. 301–304.
[102] C. Kim and R. M. Stern, “Power-Normalized Cepstral Coefficients (PNCC) for robust speech recog-
nition,” in Proc. Int. Conf. Acoustics, Speech, and Signal Process. IEEE, 2012, pp. 4101–4104.
[103] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker
verification,” IEEE/ACM Trans. Acoustics, Speech, and Lang. Process., vol. 19, no. 4, pp. 788–798,
May 2011.
[104] H. Soltau, G. Saon, and B. Kingsbury, “The IBM Attila Speech Recognition Toolkit,” in Proc. IEEE
Workshop Spoken Lang. Technology, Dec. 2010, pp. 97–102.
[105] D. Griffin and J. Lim, “Signal estimation from modified short-time Fourier transform,” IEEE Trans.
Acoustics, Speech, and Signal Process., vol. 32, no. 2, pp. 236–243, Apr. 1984.
[106] R. L. Graham, “An Efficient Algorithm for Determining the Convex Hull of a Finite Planar Set,”
Information Process. Letters, vol. 1, no. 4, pp. 132–133, 1972.
96
[107] C. B. Barber, D. P. Dobkin, and H. Huhdanpaa, “The quickhull algorithm for convex hulls,” ACM
Trans. Mathematical Software, vol. 22, no. 4, pp. 469–483, 1996.
[108] C. P. Browman and L. Goldstein, “Articulatory phonology: An overview,” Phonetica, vol. 49, no.
3-4, pp. 155–180, 1992.
[109] S. Narayanan, A. Toutios, V . Ramanarayanan, A. Lammert, J. Kim, S. Lee, K. S. Nayak, Y .-C. Kim,
Y . Zhu, L. Goldstein, D. Byrd, E. Bresch, P. K. Ghosh, A. Katsamanis, and M. I. Proctor, “Real-
time magnetic resonance imaging and electromagnetic articulography database for speech production
research (TC),” J. Acoustical Society of America, vol. 136, no. 3, pp. 1307–1311, 2014.
[110] E. Bresch and S. Narayanan, “Region segmentation in the frequency domain applied to upper airway
real-time magnetic resonance images,” IEEE Trans. Medical Imaging, vol. 28, no. 3, pp. 323–338,
2009.
[111] T. Sorensen, A. Toutios, L. Goldstein, and S. Narayanan, “Characterizing vocal tract dynamics with
real-time MRI,” in LabPhon, 2015.
[112] A. Toutios and S. Narayanan, “Factor analysis of vocal-tract outlines derived from real-time magentic
resonance imaging data,” in Int. Congress Phonetic Sciences, 2015, pp. 523–532.
97
Abstract (if available)
Abstract
Noise is usually present in collected speech data, and its presence can affect subsequent processing of the data. Furthermore, speech is often processed for different end goals, and noise can affect the results differently depending on the end goals. In my thesis, I propose several matrix factorization algorithms that are noise-robust and facilitate different end goals. On a speech denoising task, I use Complex Matrix Factorization with spectral and temporal regularization to remove MRI acoustic noise from speech recordings. I show that this algorithm achieves greater noise suppression and lower spectral distortion compared to state-of-the-art methods. I also introduce the Complex Beta Divergence that generalizes the Beta Divergence to complex-valued scalars. It uses a real-valued parameter, beta, to control the penalty placed on errors in estimation. Using this divergence, I propose the Beta Complex Matrix Factorization algorithm and derive update rules with convergence guarantees to a stationary point. I show that different settings of beta improve the performance of speech denoising for different metrics, such as spectral distortion and speech quality. Moreover, I show on audio recorded in an MRI scanner that Beta CMF improves performance over Euclidean CMF on several denoising metrics. I also propose a Non-negative Matrix Factorization algorithm to estimate noise-robust activation matrices. On the noisy Aurora 4 dataset, I extract the activation matrices and use them as acoustic features for an Automatic Speech Recognition task. Results show a 4.2% relative improvement in word error rate compared to baseline log-mel features. Finally, I describe the Convolutive Convex Hull Factorization of a Matrix algorithm I proposed for uncovering temporal patterns from multi-variate time-series data. I show that this algorithm uncovers a richer and more meaningful set of vocal tract movements from vocal tract contours of continous speech compared to a previously-proposed algorithm for this task.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Toward understanding speech planning by observing its execution—representations, modeling and analysis
PDF
Noise aware methods for robust speech processing applications
PDF
Representation, classification and information fusion for robust and efficient multimodal human states recognition
PDF
Speech recognition error modeling for robust speech processing and natural language understanding applications
PDF
Novel variations of sparse representation techniques with applications
PDF
A computational framework for exploring the role of speech production in speech processing from a communication system perspective
PDF
Hierarchical methods in automatic pronunciation evaluation
PDF
Context-aware models for understanding and supporting spoken interactions with children
PDF
Efficient estimation and discriminative training for the Total Variability Model
PDF
Neural representation learning for robust and fair speaker recognition
PDF
Emotional speech production: from data to computational models and applications
PDF
Noise benefits in Markov chain Monte Carlo computation
PDF
Toward robust affective learning from speech signals based on deep learning techniques
PDF
Robust automatic speech recognition for children
PDF
Extracting and using speaker role information in speech processing applications
PDF
Learning shared subspaces across multiple views and modalities
PDF
Active data acquisition for building language models for speech recognition
PDF
Neighborhood and graph constructions using non-negative kernel regression (NNK)
PDF
Coded computing: a transformative framework for resilient, secure, private, and communication efficient large scale distributed computing
PDF
Block-based image steganalysis: algorithm and performance evaluation
Asset Metadata
Creator
Vaz, Colin
(author)
Core Title
Matrix factorization for noise-robust representation of speech data
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
05/02/2019
Defense Date
11/20/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
automatic speech recognition,noise removal,non-negative matrix factorization,OAI-PMH Harvest,speech denoising,time-series analysis
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Narayanan, Shrikanth (
committee chair
), Goldstein, Louis (
committee member
), Ortega, Antonio (
committee member
), Soltanolkotabi, Mahdi (
committee member
)
Creator Email
colinvaz@gmail.com,cvaz@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-166235
Unique identifier
UC11660839
Identifier
etd-VazColin-7398.pdf (filename),usctheses-c89-166235 (legacy record id)
Legacy Identifier
etd-VazColin-7398.pdf
Dmrecord
166235
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Vaz, Colin
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
automatic speech recognition
noise removal
non-negative matrix factorization
speech denoising
time-series analysis