Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Noise aware methods for robust speech processing applications
(USC Thesis Other)
Noise aware methods for robust speech processing applications
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Noise Aware Methods for Robust Speech Processing
Applications.
by
Pavlos Papadopoulos
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulllment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(Electrical Engineering)
August 2020
Copyright 2020 Pavlos Papadopoulos
Acknowledgments
I would like to thank my advisor Shrikanth Narayanan, for giving me the opportunity to
join his lab, and trusted me to work in many interesting projects. Despite his busy schedule,
which a successful lab as SAIL entails, he always found time for us to discuss ideas and my
work. His guidance all these years has been invaluable and the work ethics that he instilled
in me, will follow me forever.
Moreover, I would like to thank all my friends in SAIL. With such a diverse lab, it is
dicult to name everyone, but I appreciate the time I have spent with each and every one
of you, and I am very happy and proud that we crossed the line of being just \labmates"
to being friends. I have learned a lot through our collaborations and our discussions, and
it is through those that I was able to move on and nd solutions when I was stuck with a
problem in my research.
I dedicate this thesis to my parents and my brother. They have always been by my side
and supported me in all of my steps. It is true that you don't choose your family, but I was
extremely lucky that I came into life in that one, and I am very grateful for it.
Finally, I would like to thank my friends from my hometown. Whenever I was visiting
they always welcomed me, and made my vacations enjoyable and recharging.
ii
Contents
Acknowledgments ii
List of Tables v
List of Figures vi
Abstract vii
1 Introduction 1
1.1 SNR Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Speech Enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Global SNR Estimation 5
3 A Supervised SNR Estimation of Speech Signals 6
3.1 Feature Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1.1 Long-Term Energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1.2 Long-Term Signal Variability . . . . . . . . . . . . . . . . . . . . . . 8
3.1.3 Pitch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.4 Voicing Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Regression models for global SNR estimation . . . . . . . . . . . . . . . . . . 11
4 Long-term SNR Estimation of Speech Signals in Known and Unknown
Channel Conditions 19
4.1 Known Noise Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Unknown Noise Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 Noise Dependent Regression Model Training . . . . . . . . . . . . . . . . . . 21
4.4 DNN Training for Noise-Model Selection . . . . . . . . . . . . . . . . . . . . 24
4.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.6 Performance under known noise conditions . . . . . . . . . . . . . . . . . . . 27
4.7 Performance under unknown noise conditions . . . . . . . . . . . . . . . . . 33
5 Global SNR Estimation of Speech Signals for Unknown Noise Conditions
using Noise Adapted Non-linear Regression 39
5.1 Total Variability Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 Using Neural Networks as nonlinear regression models for SNR estimation . 41
6 Nonnegative Matrix Factorization(NMF) for Speech Enhancement 45
6.1 Introduction to NMF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.2 Noise Aware and Combined Noise Models for Speech Denoising in Unknown
Noise Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.2.1 Using Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.2.2 Method Using Combined Noise Dictionary . . . . . . . . . . . . . . . 52
6.2.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 54
iii
7 NMF Dictionaries as Convex Polyhedral Cones 57
7.1 Conic Anity Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
7.2 Individual Performance of conic anity measures . . . . . . . . . . . . . . . 62
7.3 Combining Conic Anity Measures . . . . . . . . . . . . . . . . . . . . . . . 63
8 Conclusions and Future Directions 67
References 72
iv
List of Tables
1 Percentile Pair values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Types of noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3 Percentile values that dene measurement windows . . . . . . . . . . . . . . 23
4 SNR mean absolute error using dierent features . . . . . . . . . . . . . . . . 30
5 SNR mean absolute error nonspeech component . . . . . . . . . . . . . . . . 34
6 SNR mean absolute error in cross-database setup . . . . . . . . . . . . . . . 37
7 SNR Mean Absolute Error for dierent feature sets. . . . . . . . . . . . . . . 43
8 SNR Estimation using nonlinear regression . . . . . . . . . . . . . . . . . . . 44
9 Noises used for speech enhancement . . . . . . . . . . . . . . . . . . . . . . . 50
10 Performance of speech enhancement metrics based on individual conic anity
measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
11 Performance of speech enhancement metrics using conic anity measures in
unknown and unseen noise cases . . . . . . . . . . . . . . . . . . . . . . . . . 65
v
List of Figures
1 Mean absolute error for White Noise. . . . . . . . . . . . . . . . . . . . . . . 13
2 Mean absolute error for Pink Noise. . . . . . . . . . . . . . . . . . . . . . . . 14
3 Mean absolute error for Car Interior Noise. . . . . . . . . . . . . . . . . . . . 15
4 Mean absolute error for Machine Gun Noise. . . . . . . . . . . . . . . . . . . 16
5 Mean absolute error for Babble Speech Noise. . . . . . . . . . . . . . . . . . 17
6 Mean absolute error for High Frequency Noise . . . . . . . . . . . . . . . . . 18
7 Analysis of regression models performance . . . . . . . . . . . . . . . . . . . 25
8 Comparison of SNR estimation methods . . . . . . . . . . . . . . . . . . . . 26
9 Analysis per noise type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
10 Analysis per SNR level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
11 Dependence on utterance duration (known noise) . . . . . . . . . . . . . . . 32
12 Dependence on nonspeech segment duration (known noise) . . . . . . . . . . 33
13 Analysis on noise pool size . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
14 Dependence on utterance duration (unknown noise) . . . . . . . . . . . . . . 36
15 Dependence on nonspeech segment duration (unknown noise) . . . . . . . . . 37
16 Cross database performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
17 PESQ score improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
18 Convex polyhedral cones in the positive orthant . . . . . . . . . . . . . . . . 58
vi
Abstract
The performance of speech technologies deteriorates in the presence of noise. Addition-
ally, the spread of such technologies in everyday scenarios (e.g speech applications operating
in mobile phones), specialized data collection (e.g. audio for medical applications) as well as
surveillance gave rise to a new demand in the eld of speech processing. These speech tech-
nologies should be able to operate across a variety of noise levels and conditions. Traditional
speech and audio processing applications require prior knowledge of the noise corrupting
the signal, or an estimate of the noise from noise-only portions of the signal, which in turn
necessitates knowledge of speech boundaries. Relaxing those requirements can facilitate pro-
cessing of data captured in dierent real life environments, and relax rigid data acquisition
protocols that can potentially create a bottleneck. Although it is impossible to have every
type of noise available, we can exploit similarities between dierent noise types to boost
the performance of algorithms. In this work we demonstrate this approach on two applica-
tions, Signal to Noise (SNR) estimation and speech enhancement. Many speech processing
algorithms and applications rely on the explicit knowledge of signal to noise ratio (SNR) in
their design and implementation. Hence, SNR estimation can guide the design and opera-
tion of such technologies or can be used as a pre-processing tool in database creation (e.g.
identify/discard noisy signals). Speech enhancement is a core tool in many speech related
applications, since it transforms a noisy signal to a state where we can extract useful and
reliable information. The goal of this work is to enable those two technologies to successfully
operate under unseen noise conditions.To achieve this goal we followed dierent approaches
for the two problems at hand. In the case of SNR estimation we propose new features
designed to capture information in the signal that generalizes across dierent noise types
and SNR characteristics, used in models that take into account not only those features but
information about the noise type itself. In speech enhancement we follow a dierent ap-
proach. We employ a method called Nonnnegative Matrix Factorization(NMF) that has met
widespread success in denoising and propose modications that will condition the method
to operate under unseen noise conditions.
vii
1 Introduction
Real life speech processing is a challenging task since environmental conditions introduce
\noise" to the speech signal altering its original properties, and decreasing the performance
of speech technology applications. In the last few years data availability, and the need of
speech applications operating under a variety of noise conditions renewed research eorts
towards unseen noise conditions, that is noise conditions that systems have not previously
encountered. Two core tools in speech processing are signal to noise ratio (SNR) estimation
and speech enhancement (speech denoising), thus the need for them to operate across a
variety of noise types is critical.
1.1 SNR Estimation
Signal to Noise Ratio (SNR), one of the most fundamental constructs in signal processing,
gives information about the level of noise present in the original signal, and is dened as the
ratio of signal power to noise power expressed in decibels (dB).
SNR estimation is a challenging task, since in general we do not know the type of noise
that corrupts the signal. Moreover, when dealing with non-deterministic signals (e.g. speech)
there is an additional layer of randomness. However, accurate SNR estimation can guide
the design of algorithms and systems that compensate for the eects of noise such as robust
automatic speech recognition (e.g. [1], [2]), speech enhancement (e.g. [3], [4], [5]), and noise
suppression [6].
Broadly speaking, SNR estimation algorithms can be divided into two categories, those
that focus on a frame of the original signal (instantaneous SNR), and those that focus on the
entire signal (global SNR). Instantaneous SNR estimation has been the focus of many works
in speech processing ([7],[8],[9],[10]) since it can directly be applied to speech enhancement.
Global SNR estimation is also useful when building SNR specic speech and speaker
recognition systems ([11], [12]) as well as other speech related tasks. For example, there is
1
a resurgence of research eorts on robust Speech Activity Detection (SAD) such as in the
DARPA RATS program wherein the speech signal can be altered by a variety of channel
conditions. Therefore, there has been a renewed eort on robust global SNR estimation
([13], [14], [15]). In this work, we address the problem of Global SNR estimation which is
dened as:
SNR =20 log
10
q
1
M
P
M
m=1
s
2
[m]
q
1
M
P
M
m=1
n
2
[m]
=10 log
10
E(s)
E(n)
where s[], n[] are the speech and noise signals respectively.
Usually, SNR estimation algorithms (both global, and local ones) are based on the fol-
lowing assumptions:
• Background noise is stationary
• Noise and Speech sources are independent
• Noise and Speech are zero-mean signals
• Speech Boundaries in the signal are known
However, recent demands of speech technology systems being widely deployed under real-
life conditions have resulted in many SNR estimation eorts moving away from the stationary
case ([16], [17]). Moreover, prior knowledge of speech boundaries in the signal is not always
feasible. While a SAD system could be employed to extract speech regions, robust SAD
systems are usually tuned to specic channel conditions.
In this work, we focus on the estimation of global SNR (i.e. at the utterance level) in
signals with unknown speech boundaries under two main frameworks. In the rst case, we
assume we know what type of noise corrupts the original signal, while in the second the
noise type is assumed unknown. In both scenarios, we make no assumptions about the noise
2
type, and our experiments show that we can achieve accurate estimation regardless of noise
conditions.
We employ a series of approaches and propose new methods to successfully estimate the
globar SNR of signals in both unknown and unseen noise conditions.
1.2 Speech Enhancement
Real life speech processing can be challenged by dierent environment noise and channel
conditions, degrading the performance of speech applications. In the last few years data
availability, and the need of speech applications operating under a variety of noise condi-
tions, has renewed the eorts on more sophisticated denoising schemes. For example, sub-
space methods with time and spectral constraints ([18], [19]) as well as Nonnegative Matrix
Factorization methods ([20], [21]) are not restricted to specic noise types (e.g. stationary
or quasi-stationary). However, all these methods require either prior information about the
noise conditions that corrupt the speech signal or a robust estimate of the noise. This type
of knowledge cannot always be obtained, especially if the data are collected from various
sources and under varying noise conditions.
The motivation behind this work is to design a system that would be able to operate
under unknown noise conditions at dierent SNR levels without requiring prior information
about the noise. To that end, we propose various methods. The rst method has two stages.
In the rst stage, we choose an appropriate pre-trained noise model to \match" the noise of
the corrupted signal. The work in [22] addresses the problem of noise classication in speech
signals using Bark Scale features, and we followed a similar approach in [23]. This is a classic
pattern classication task, where the tested signal is corrupted by one of the noises that the
system was trained on. The problem with this approach arises when the signal is corrupted
by a type of noise that the system was not trained on. The sensitivity of Bark Scale features
and MFCCs to noise results in poor generalization properties when such systems encounter
unknown types of noise. To overcome these issues we use a method that is based on Long-
3
Term Signal Variability (LTSV), introduced in [24]. Once we compute LTSV features on
the test signal we construct a histogram of its values and nd the Kullback{Leibler (KL)
distance of the signal LTSV histogram to other LTSV histrograms in our training set. In the
second stage of our system, we employ Nonnegative Matrix Factorization (NMF) to denoise
the signal, using the noise model we selected in the rst phase. NMF has been succesfully
used in speech denoising ([20], [21]), for a variety of noise types.
In the second method, we do not supply a chosen noise model to NMF. Instead we create
a \combined" noise dictionary from a variety of noises, which is used by NMF for denoising.
The intuition behind this approach is that if this combined dictionary contains noise types
which are similar to the noise that corrupts the signal, then the atoms of the dictionary will
adequately model the noise that corrupts the test signal.
Finally, the last method exploits the geometric properties of NMF. All the elements of
the dictionaries created by NMF are nonnegative. Hence, these elements form a convex
polyhedral cone in the positive orthant. We designed dierent features that attempt to
capture this geometry, and based on that make an informed decision regarding which noise
dictionary to use during the test phase. We call these features conic anity measures since
their goal is to give some measure of cone \similarity" and were inspired by the subspace
anity metrics [25]. We study the validity of these features with regard to the denoising
task, and design methods that enable us to perform speech enhancement under unknown
and unseen noise conditions.
4
2 Global SNR Estimation
The global SNR of a speech signal is dened as:
SNR =20 log
10
q
1
M
P
M
m=1
s
2
[m]
q
1
M
P
M
m=1
n
2
[m]
=10 log
10
E(s)
E(n)
whereM is the sample size of the signal,s;n are the speech and noise signals respectively,
E(s) is the total energy of the speech signal (i.e. E(s) =
P
m
s
2
[m]) and E(n) is the total
energy of the noise.
For the rest of this work we will assume that the sources of speech and noise are indepen-
dent, and the noise is additive, i.e. x[m] =s[m] +n[m]. Moreover, we will assume that both
the speech and noise signal are zero-mean. Under these assumptions, SNR can be expressed
as:
SNR = 10 log
10
E(x)E(n)
E(n)
Therefore, once we have measurements of the energies E(x) and E(n), we can estimate
the SNR. A SAD algorithm could be used to detect regions in the signal without speech and
estimate E(n), however the analysis in [14] shows that SAD methods require ne tuning
depending on channel noise conditions, which could be highly variable and diverse. Notice
that in audio processing the SNR denition of a corrupted signal is dierent from other
elds like information theory where the noise in
uence is measured per sample and not in
the whole signal.
5
3 A Supervised SNR Estimation of Speech Signals
Our method is based on features that capture the presence of speech in the signal and
formulate a regression model, estimating its coecients with ordinary least squares. Based
on those features we create \soft" SNR estimators and let the regression model the weight
each estimator should have. We make the following two assumptions:
• Speech and noise sources are independent and additive.
• Speech and Noise signals are zero-mean
Our experiments showed that our method outperforms other ones found in the literature
as [13,26]. In order to understand how we construct these soft SNR estimators we must rst
study the features from which they are constructed.
3.1 Feature Description
In this section we present the features used to create the \soft" SNR estimators from which
we train the regression models that yield the nal SNR estimation.
3.1.1 Long-Term Energy
Since SNR is a ratio of energies, our rst feature will be the long-term energy calculated in
each frame (the average energy in each frame). The average energy in each frame n of the
signal y can be found by
E
y
(n) =
1
jFj
X
f
j
2F
S
y
(n;f
j
)
where S
y
(n;f
j
) is the spectrum at frame n and frequency bin f
j
, F is the the set of
6
frequency bins, andjFj is the cardinality of F . S
y
(n;f
j
) is computed as:
S
y
(n;f
j
) =jY (n;f
j
)j
2
Y (n;f
j
) =
Nw +(n1)N
sh
X
l=(n1)N
sh
+1
w(l (n 1)N
sh
1)y(l)e
2if
j
l
where w(k); 0 k < N
w
is the short-time window, N
w
is the frame length, N
sh
is the
frame shift duration (in samples), and Y (n;f
j
) is the short-time Fourier transform (STFT)
at frequency f
j
, computed for the nth frame.
Since transition of energy values can be abrupt, we apply a simple moving average window
to smooth the long-term energy. LetS
m
() be a simple moving average window operator of
window lengthm, then we can get a \smoothed" version of the long term energy of a signal
y as E(y) =S
m
(E
y
). In order to balance between retaining the original information of the
signal, and getting robust measurements of the energy regions, we try smoothing windows
of dierent lengths.
For every window length we compute an energy measurement, E(x), in the following
manner: First, we calculate the smoothed long-term energy in each time frame and sort the
values, then we pick two percentile values (e.g. 90% and 95%) that correspond to percentage
values of the total energy, and calculate the average long-term energy of the frames that
fall in that region. The reason we chose percentile values is that signals can be of arbitrary
length and speech boundaries are unknown.
We repeat the same procedure for two dierent percentile values (e.g. 10% and 15%).
Finally, we build a regressor for every triplet of smoothing window length, and energy mea-
surements. This regressor can be expressed as:
l
ab;cd
m
= 10 log
10
E
cd
(x)E
ab
(x)
E
ab
(x)
(1)
7
In this \overloaded" notation m stands for the length of the smoothing window, a;b;c;d
are the percentile values, and E
cd
(x), E
ab
(x) are energy measurements based on their
respective percentile values.
3.1.2 Long-Term Signal Variability
The second feature we use to create our regressors is Long-Term Signal Variability (LTSV).
LTSV was proposed in [24] and is a way of measuring the degree of non-stationarity in a
signal by measuring the entropy of the normalized short-time spectrum at every frequency
over consecutive frames. LTSV is computed using the last R frames of the observed signal
x with respect to the current frame of interest r:
L(r),
1
K
K
X
i=1
i
(r)(r)
2
(r) =
1
K
K
X
i=1
i
(r)
i
(r),
r
X
n=rR+1
S(n;f
i
)
P
r
p=rR+1
S(p;f
i
)
log
S(n;f
i
)
P
r
p=rR+1
S(p;f
i
)
!
(2)
where S(n;f
i
) is the short time spectrum computed for the n
th
frame over i = 1;:::;K
frequency bins, and R is the analysis window. Hence, for every short time spectrum frame
n, we have a corresponding LTSV frame r. Regardless of the noise type that corrupts the
signal, we expect higher LTSV values in speech regions because speech itself is inherently
non-stationary.
Using LTSV, we create the second set of regressors in a similar manner as described in
the previous section. First, we apply a simple moving average smoothing window on both
8
the LTSV values and the energy of the signal (to smooth regions with abrupt transitions)
and sort the smoothed LTSV. Then, we choose a window dened by two percentile values
(e.g. 90% and 95%) of the highest LTSV values and nd all the frames that fall in that
range. Finally, we compute the energy of the LTSV frames of that window and take a
measurement of E(x) based on those. Using the same method, we compute the energy in a
dierent window of LTSV values. We use these energy measurements to form a regressor as:
v
ab;cd
m;k;R
= 10 log
10
E(
cd
(x))E(
ab
(x))
E(
ab
(x))
(3)
where m, k, are the lengths of the smoothing windows for the energy and LTSV respec-
tively, the seta,b,c,d are the percentile values that dene the windows andR is the analysis
window. The expression E(
cd
(x)) stands for the energy measurement through the LTSV
feature. In other words, it is the estimation of energy based on the frames that fall into the
c%-d% region of the sorted LTSV.
3.1.3 Pitch
Pitch is another feature we employ for constructing regressors for our models. Through
pitch detection we can distinguish the speech regions of the signal and then exploit this
information to create additional regressors for our models. We use the openSMILE software
([27]) to detect the pitch regions of the signal. Since pitch transitions are abrupt (e.g., due to
unvoiced regions), we smooth the outcome of pitch detection by applying median ltering.
The pitch based regressors are formulated in a similar fashion to those constructed from
LTSV. We rst apply smoothing windows to both the energy and the pitch frames. Then,
we choose a window dened by two percentile values (e.g. 90% and 95%) of the highest
pitch values, and nd all the frames that fall in that window. Finally, we compute the
9
energy of those frames. By choosing two dierent percentile values we take another energy
measurement and then form the regressor as:
p
ab;cd
m;k
= 10 log
10
E(f
cd
(x))E(f
ab
(x))
E(f
ab
(x))
(4)
where m, k, are the lengths of the smoothing windows for the energy and pitch respec-
tively, while the values a, b, c, d are the percentile values that dene the windows. Finally,
E(f
cd
(x)) is the energy measurement based on the frames where c% to d% of pitch is
concentrated.
3.1.4 Voicing Probability
The nal measure we employ to identify speech regions is the voicing probability ([28]).
Voicing probability assigns a value in every time frame that denotes the probability that
speech exists in that frame. We calculate the voicing probability of each frame in the signal
using the openSMILE software ([27]).
We create regressors based on voicing probability using the methodology described in the
two previous sections. These regressors can be expressed as:
c
ab;cd
m;k
= 10 log
10
E(g
cd
(x))E(g
ab
(x))
E(g
ab
(x))
(5)
where m, k, are the lengths of the smoothing windows for the energy and the voicing
probability respectively. The values a, b, c, d are the percentile values that dene the
windows. Similar to the previous cases, E(g
cd
(x)) is the measurement of energy based on
the frames where c% to d% of voicing probability is ranked.
10
3.2 Regression models for global SNR estimation
Based on the features described we created regression models for dierent types of noise
(white, pink, car interior, machine gun, and babble speech noise). We chose these types
of noises to test how our methods performs under both stationary and nonstationary noise
conditions.
We initially investigate two use cases. In the rst case, we assume that we already know
what kind of noise corrupts. In the second case, we have no prior knowledge about the kind
of noise that corrupts the signal. Instead, we use a classication scheme to identify the noise
type and use the appropriate regression model.
In this scenario, where the type of noise that alters the speech signal is considered known,
we create a regression model, which will take into account all the estimations (see eq. (1)-(5))
based on the features described in section 3.1 and weight them accordingly. The independent
variable in this regression model will be our nal SNR estimation SNR
f
, based on the
formula:
[
SNR
f
=
T
l +
T
v +
T
p +
T
c (6)
wherel;v;p;c are vectors of the regressors calculated from equations (1) to (5) respec-
tively,;;
; are vectors of regression coecients, denotes dot product, andx
T
is the
transpose of vectorx.
Notice that the regressors in our model are not the features themselves (e.g. raw energy,
raw pitch, etc). Since the relation between the features we used and the global SNR is
nonlinear, a linear regression model would perform poorly in this case. On the other hand,
the log ratios ( eq. (1)-(5) ) which act as the regressors will have a linear relationship with
the independent variable which is the true SNR. Moreover, as long as some of the regressors
provide relatively accurate estimates, the model will yield an accurate nal SNR estimate
11
by adjusting the coecients.
The total number of regressors we used in our models is 312 (24 from long-term energy,
216 from LTSV, 36 from pitch and 36 from voicing) and we estimate the features' coecients
with ordinary least squares. The regressors result from a combination of smoothing window
lengths and regions of the features from which we make energy estimations.
In the case of Long Term Energy and LTSV the window length ranges from 0.3ms to
1.8ms with a 0.3ms step, while in Pitch and Voicing Probability the window lengths are
0.9ms, 1.6ms, 2.2ms, 2.8ms, 3.4ms, and 4.1ms. The value pairs a;b;c;d in eqs. (1)-(5) we
used to estimate the energies are shown in Table 1.
a b c d
85% 95% 5% 15%
80% 90% 10% 20%
75% 85% 15% 25%
5% 15% 85% 95%
10% 20% 80% 90%
15% 25% 75% 85%
Table 1: Percentile Pair values of pitch windows from which we calculate the average energy
These values where the result of experimental procedure. Our experiments showed that
adding more features (i.e. more smoothing windows, etc) boosts the performance of the
estimation. Since this is a work in progress, in the future we plan to provide detailed
analysis of the impact each feature has on the model.
For every noise type we used 1680 clean speech les from the TIMIT Database sampled at
16KHz in which we introduced silence periods randomly selected between 3 and 10 seconds
to create signals with unknown speech boundaries. Then we added noise at six SNR levels
(-5dB, 0dB, 5dB, 10dB, 15dB, 20dB), resulting in a total of 10080 training samples per
regression model.
For the KNN classier we used 20 nearest neighbors (K=20) based on 13 MFCCs. We
used the same set of 1680 les (adding noise for every SNR level) to train the KNN classier.
12
The nal decision is made by calculating the probability of each class in every frame and
then follows a majority vote.
We have tested our system for ve dierent noise types. We randomly selected 150 les
from the TIMIT database (there was no overlap between the training and testing les). In
each le we introduced 3 to 10 seconds silence regions and then added noise at 6 dierent
SNR levels. We compared our method with the WADA and NIST SNR estimation methods
using the mean absolute error metric. In all cases we found that our method outperforms
the other methods.
−5 0 5 10 15 20
0
1
2
3
4
5
6
7
8
9
10
11
SNR
Mean Absolute Error
SNR Estimation Error on White Noise
Regression
WADA
NIST SNR
Figure 1: Mean absolute error for White Noise.
In Figures 1, 2, 3 the results of white, pink and car interior noise are presented. By
comparing the mean absolute error of our method and the WADA and NIST SNR method
13
−5 0 5 10 15 20
0
1
2
3
4
5
6
7
8
9
10
11
SNR
Mean Absolute Error
SNR Estimation Error on Pink Noise
Regression
WADA
NIST SNR
Figure 2: Mean absolute error for Pink Noise.
for 6 dierent SNR levels,it is clear that our method provides better estimates for every SNR
level (dierence in error ranges from 0.3db to 7db).
In the case of machine gun noise (Figure 4) our method greatly outperforms the other
methods (dierence in mean absolute error is about 30db). Both WADA and NIST SNR
fail to provide accurate estimates as shown from their mean absolute error values. The
reason for this is that our method does not make any assumptions about stationarity. Also
this indicates that our method can perform well across dierent noise types with dierent
characteristics.
Finally, in the case of Babble Speech Noise (Figure 5) we can see that only for 0dB the
WADA method performs better. Since babble speech noise is similar to speech some of our
14
−5 0 5 10 15 20
0
2
4
6
8
10
12
SNR
Mean Absolute Error
SNR Estimation Error on Car Interior Noise
Regression
WADA
NIST SNR
Figure 3: Mean absolute error for Car Interior Noise.
features(i.e. pitch,voicing) fail at same energy levels. However, our method gives better
estimates overall.
The above results refer to the case where we know the type of noise that corrupts the
signal and we choose the appropriate regression model. In the second set of experiments we
used the same test set of les. In every case we corrupted a signal with a noise that was
used for training the KNN classier, the signal was correctly classied and the appropriate
regression model was used. Since our classier achieved perfect accuracy for the given set
of noises, we tried to corrupt a signal with high frequency Noise (which was not used for
training the classier). The classier chose the regression model for white noise. In Figure
6 we can see the results when we corrupted signals with high frequency noise and used the
15
−5 0 5 10 15 20
0
10
20
30
40
50
60
70
80
SNR
Mean Absolute Error
SNR Estimation Error on Machine Gun Noise
Regression
WADA
NIST SNR
Figure 4: Mean absolute error for Machine Gun Noise.
white noise regression model to estimate the SNR.
In all the cases we examined our method outperforms other state-of-the-art methods,
when the kind of noise that corrupts the signal is known. When the noise is unknown the
performance of our method depends on the outcome of the KNN classier, for instance in
the example of high frequency noise if the classier chose the regression model of machine
gun noise we would have failed to provide accurate SNR estimates. Generally speaking
MFCCs do not generalize well, so in the unknown case scenario the results left ample room
for improvement.
16
−5 0 5 10 15 20
1
2
3
4
5
6
7
8
9
SNR Estimation Error on Babble Speech Noise
SNR
Mean Absolute Error
Regression
WADA
NIST SNR
Figure 5: Mean absolute error for Babble Speech Noise.
17
−5 0 5 10 15 20
0
1
2
3
4
5
6
7
8
SNR Estimation Error on High Frequency Noise
SNR
Mean Absolute Error
Regression
WADA
NIST SNR
Figure 6: Mean absolute error for High Frequency Noise by using the regression model of
white noise.
18
4 Long-term SNR Estimation of Speech Signals in Known
and Unknown Channel Conditions
In this section we expand upon our previous model. We will examine both scenarios (known
and unseen noise conditions) in more detail, as well as provide insight and analysis of the
results.
4.1 Known Noise Case
In this scenario, we assume we know the type of noise that alters the speech signal. Hence,
we can create a regression model, which will take into account all the estimations (see eq.
(1)-(5)) based on the features described in the previous section and weight them accordingly.
The independent variable in this regression model will be our nal SNR estimation SNR
f
,
based on the formula of 6 presented here again for convenience:
[
SNR
f
=
T
l +
T
v +
T
p +
T
c (7)
wherel;v;p;c are vectors of the regressors calculated from equations (1) to (5) respec-
tively,;;
; are vectors of regression coecients, denotes dot product, andx
T
is the
transpose of vectorx.
Notice that the regressors in our model are not the features themselves (e.g. raw energy,
raw pitch, etc). Since the relation between the features we used and the global SNR is
nonlinear, a linear regression model would perform poorly in this case. On the other hand,
the log ratios ( eq. (1)-(5) ) which act as the regressors will have a linear relationship with
the independent variable which is the true SNR. Moreover, as long as some of the regressors
provide relatively accurate estimates, the model will yield an accurate nal SNR estimate
19
by adjusting the coecients.
The key idea behind our approach is that by studying the eects of noise on dierent
aspects of the speech signal we can gather information about its impact on them. This in
turn, enables us to create a valid global SNR estimation method. Utilizing this insight we
create a regression model for every type of noise that we examine.
Our method diers from others in the literature because it does not rely on specic
characteristics of noise (e.g. stationarity). Hence, it can be easily be applied to any channel
condition. In many real-life situations it is feasible to gather noise measurements (e.g. car
interior noise, jet cockpit noise, etc). However, there are also scenarios where we do not
know the type of noise that corrupts the speech signal. In the following section we present
how our system handles such scenarios.
4.2 Unknown Noise Case
Information about the type of noise that corrupts the signal may not be always available.
Therefore, we also developed a procedure to estimate SNR with no prior knowledge of the
channel noise conditions.
Since we do not know the noise conditions beforehand we cannot use directly the noise
specic regression model to estimate the SNR. A simple approach would be to estimate
the SNR using every regression model at our disposal and take the average; however this
approach could lead to an inaccurate estimate, e.g if the bulk of the regression models are
derived from stationary types of noise and the test signal is corrupted by impulsive noise.
An alternative approach would be to detect the type of noise in the channel, and then use
the appropriate regression model for SNR estimation. This method is more robust, since it
uses only the appropriate regression model. Eamdeelerd et al. in [31] use Bark scale features
to train a KNN classier to nd the type of noise that alters a speech signal, while in [27]
we follow a similar approach using MFCC features.
20
Both these methods provided good results when the test signals noise was included in
the KNN noise set, however they had poor generalization properties due to the sensitivity of
MFCC and Bark scale features to noise conditions. They perform poorly when the test signal
was altered by noise which was not part of the training set. To overcome the shortcomings of
the previous methods, we implemented a noise selection scheme based on a DNN classier.
In this method, the DNN makes a decision about which regression model will be used and
the SNR is estimated based on that model. This scheme yields good results even when the
noise that corrupts the signal is not part of the DNN training set.
Since the regression models were trained using percentile values of dierent features,
we used a similar feature set to train the DNN. In order to train the DNN, we used les
corrupted by various types of noise at dierent SNR levels. From every le we extracted
percentile values of the long term energy (i.e. the values at 5%, 10%, etc), LTSV, pitch, and
voicing probability. Moreover, we split the spectrum into eight sub-bands and calculated
the average energy and LTSV in the subbands and extracted percentile values of the long-
term energy and LTSV in every subband. We did not calculate pitch and voicing probability
in the subband level since they did not add sucient information as revealed by our initial
experiments.
4.3 Noise Dependent Regression Model Training
We created regression models for fteen noise types from the NOISEX-92 database (see
Table 2).
To train a noise-specic regression model we used 1680 les from the TIMIT Database
sampled at 16KHz. In every le we introduced silence periods randomly selected between 3
and 10 seconds to create signals with unknown speech boundaries. Following that, we added
each aforementioned noise at six SNR levels (-5dB, 0dB, 5dB,10dB, 15dB, 20dB), resulting
in a total of 10080 training les.
To form the long term energy regressors (eq. (1)) we rst nd the energy in windows of
21
NOISE TYPES
WHITE
PINK
SPEECH BABBLE
TANK
MILITARY VEHICLE
CAR INTERIOR
DESTROYER ENGINE ROOM
DESTROYER OPERATIONS ROOM
F16 COCKPIT
FACTORY FLOOR 1
FACTORY FLOOR 2
HIGH FREQUENCY
MACHINE GUN
JET COCKPIT 1
JET COCKPIT 2
Table 2: Types of noise for which we have created regression models for global SNR estima-
tion
25ms with a 10ms shift (using a hamming window in the original signal). These regressors
parametrized by the length of the smoothing window and the percentile values that dene
the measurement windows. The smoothing window length (parameter m in eq. (1)) ranges
from 5 frames to 30 frames with a step of 5 while the percentile values (parameters a;b;c;d)
are presented in Table 3.
In the case of LTSV regressors, we rst compute the short time spectrum using a hamming
window of 25ms length with a 10ms shift and nd the energy in each frame. We tried three
dierent lengths of windows for energy smoothing (parameter m in Eq. 3): 10, 20, and 30
frames. Then, we extract dierent sets of LTSV features using dierent analysis windows
(parameter R in 3), over 10, 15, and 20 energy frames. We applied six dierent window
lengths for LTSV smoothing (parameter k in3), from 5 to 30 with a step of 5 LTSV frames.
We compute E(x) using the measurement windows dened dened the quadruple a;b;c;d,
presented in Table 3.
22
Finally, we produce measurements of pitch and voicing probability in windows of 50ms
with a 10ms shift. We used six dierent window lengths (parameter k in eq. (4),(5)) to
smooth pitch and voicing probability, from 5 to 30 with a step of 5 pitch/voicing probability
frames. Energy is computed in 25ms windows with a 10ms shift, and we smooth the energy
frames by applying a moving average window of 10 frames. The percentile values a;b;c;d
that dene the measurement windows are presented in Table 3.
Through this parametrization we created 312 regressors (24 from long-term energy, 216
from LTSV, 36 from pitch and 36 from voicing). We trained every regression model (eq. (6))
with ordinary least squares.
The specic parametrization values are the result of both design and experimental in-
vestigation. We used multiple window lengths for smoothing in an eort to maintain the
information of the features but also provide robust estimates. Moreover, the values we chose
for a;b;c;d are able to indicate speech and non-speech regions in the signal. Depending
on the type of noise some regressors give more accurate estimates than others, e.g. the
regressors that give accurate estimates in stationary types of noise do not perform well in
impulsive types and vice versa. However, in every noise case the regression model will learn
which regressors perform better and the regression coecients will be trained accordingly.
a b c d
Long Term Energy
and
Long Term Signal
Variability
85% 95% 5% 15%
80% 90% 10% 20%
5% 15% 85% 95%
10% 20% 80% 90%
Pitch and Voicing
Probability
85% 95% 5% 15%
80% 90% 10% 20%
75% 85% 15% 25%
5% 15% 85% 95%
10% 20% 80% 90%
15% 25% 75% 85%
Table 3: Percentile values that dene measurement windows in equations (1)-(4)
23
4.4 DNN Training for Noise-Model Selection
When the channel conditions are unknown we use a DNN to decide which regression model
will be applied to estimate the SNR.
The work in [22] addresses the problem of noise classication in speech signals using Bark
Scale features and a KNN classier, and we followed a similar approach in [23]. This is a
classic pattern classication task, thus the tested signal is corrupted by one of the noises
that their system was trained on. The problem with this approach arises when the signal is
corrupted by a type of noise that the system was not trained on. Although examples can be
found when this approach provides satisfactory results (e.g. in [23] when our system, which
is trained without signals corrupted by High Frequency Noise, encounters a signal altered
by High Frequency noise it uses the regression model for white noise) in general it has poor
generalization properties due to the sensitivity of Bark Scale features and MFCCs to noise.
To overcome these issues we use a DNN for model selection.
To train the DNN we used 1680 speech les from the TIMIT database. To each le
we added silence periods randomly selected between 3 and 10 seconds to create signals with
unknown speech boundaries, as well as dierent types of noise (see Table 2) at six SNR levels
(from -5dB to 20dB with a step size of 5dB). Notice that this training set diers from the one
we used in the regression models, because of the random silence periods and the randomness
of noise. In our setup the DNN has two hidden layers with 392 neurons on the rst layer and
198 neurons on the second, while the output neurons are triggered by a softmax function.
In order to test the robustness of our scheme we followed a noise leave-one-out cross-
validation approach. We excluded all the les that were corrupted by a particular type of
noise out of the training set and repeated this procedure for all the noise types listed in
Table 2. Thus, when the input to our system is a signal corrupted by a specic type of noise
(e.g. white noise) the DNN will not choose the regression model trained on signals altered
by white noise, instead will choose another \similar" one. Under this training procedure our
24
Figure 7: Average mean absolute error computed for every noise model. Rows denote the
test conditions, while columns denote the noise dependent regression model. We use the
mark \X" to denote high error. For example, using machine gun model to estimate the SNR
of utterances corrupted by white noise fails to provide meaningful results.
DNNs are trained by 141120 samples (1680 les 14 types of noise 6 SNR levels).
Moreover, from every training le we extract the following feature set. We calculate the
average long term energy, three \versions" of LTSV (using three analysis windows 10, 15, and
20 frames), pitch, and voicing probability and take the percentile values of those quantities
from 0 to 100 with a step of 5. Furthermore, we split the spectrum into eight frequency
bands, calculate the average energy, and LTSV (using an analysis windows of 10 frames) in
each band and nd percentile values of these quantities from 0 to 100 with a 5 step.
At this point we want to emphasize that noise selection per se is not our goal, instead the
task here is to choose a regression model that will provide the most accurate SNR estimation.
This is the reason that the training of regression models, and DNN uses dierent transforms
25
White Pink Car
Interior
Tank Machine
Gun
Destroyer
Engine
Room
Destroyer
Operation
Room
Military
Vehicle
Babble
Speech
Jet
Cockpit 1
Jet
Cockpit 2
Factory
Floor 1
Factory
Floor 2
High
Frequency
F16
Cockpit
0
5
10
0:7
0:95
3:47
1:38
3:71
1:34
1:5
2:32
2:79
1:15
0:93
2:12
1:95
1:15
1:61
0:57
0:64
1:55
0:95
1:93
1:21
0:87
1:61
1:67
0:76
0:67
1:85
1:53
0:87
1:29
2:47
2:77
3:18
3:65
44:41
4:78
4:47
4:16
2:45
3:33
3:94
2:69
3:35
3:85
2:77
5:3
5:31
6:22
5:55
36:67
5:31
5:25
5:37
5:56
5:26
5:3
5:6
5:53
5:18
5:24
Mean Absolute Error
DNN Regr WADA NIST
Figure 8: Average mean absolute SNR estimation error across all dB levels for all the channel conditions. \Regr" refers to the
assumed known channel conditions case, where we can apply the noise-specic regression model in a straightforward manner
to estimate the SNR. On the other hand, \DNN" refers to the case of unknown channel conditions, where a DNN chooses the
\appropriate" regression model. \WADA" ([13]) and \NIST" ([26]) are SNR estimation methods we compare against. When
the channel conditions are assumed known our method (\Regr") outperforms all the others, while when they are unknown our
method (\DNN") provides better results in all cases but car interior, and babble speech noise types.
26
of the same feature set. Although at this point we do not have theoretical results to justify
this scheme, our idea is backed up by the experimental results.
4.5 Experimental Results
To test the performance of our system in both known and unknown noise scenarios we used
150 les (ensuring there is no overlap between training and test les). In each of these les
we added silence periods randomly selected to be between 3 and 10 seconds to create signals
with unknown speech boundaries and noise at six SNR levels (from -5dB to 20dB with a
step size of 5dB). We compared our proposed method against the Waveform Amplitude
Distribution Analysis (WADA) method ([13]) and the NIST SNR ([26]) method.
4.6 Performance under known noise conditions
In the rst set of experiments we assume we know the noise conditions in the channel and
use the appropriate regression model. For each type of noise we use 900 test les (150 les
6 SNR levels) to measure the mean absolute error for every SNR level, as well as the average
mean absolute error of the estimation across all SNR levels.
First, we wanted to check the validity of every batch of regressors. To that end, we
built regression models consisting only of regressors coming from a single feature set and
compare their performance with the regression model using regressors from all the features.
The performance is tested in terms of the estimation mean absolute error across all SNR
levels. Although we checked the validity of the regressors for all the models, in Table 4 we
present the results only for some types due to space limitations.
For each noise type, the model that estimates the SNR using only energy regressors
outperforms those that use only regressors from any other feature set. On the other hand, the
performance of the models that use regressors only from LTSV, Pitch, or Voicing Probability
27
White
D R W N
−5dB
0dB
5dB
10dB
15dB
20dB
2
4
6
Pink
2
4
6
Car Interior
2
4
6
8
10
Tank
2
4
6
8
Machine Gun
20
40
60
Dest. Eng. R.
2
4
6
8
10
Dest. Op. R.
2
4
6
8
Milit. Veh.
2
4
6
8
Babble Speech
2
4
6
8
Jet Cockpit 1
2
4
6
Jet Cockpit 2
2
4
6
8
Fact. Fl. 1
2
4
6
8
Fact. Fl. 2
2
4
6
8
High Freq.
2
4
6
F16 Cockpit.
2
4
6
Figure 9: Mean absolute error for dierent types of noise. \D" and \R" represent our method for unknown and known channel
conditions respectively, while \W" and \N" stand for WADA and NIST SNR. Darker colors denote lower values for the mean
absolute error.
28
−5dB
D R W N
Factory Floor 1
Military Vehicle
Tank
Jet Cockpit 1
Destr. Eng. Room
Pink
Factory Floor 2
F16 Cockpit
Destr Op. Room
Car Interior
White
Babble Speech
High Freq
Jet Cockpit 2
0dB 5dB 10dB 15dB 20dB
0
1
2
3
4
5
6
7
8
9
10
11
Figure 10: Mean absolute error for dierent dB levels of SNR. \D" and \R" represent our method for unknown and known
channel conditions respectively, while \W" and \N" stand for WADA and NIST SNR. Darker colors denote lower values for
the Mean Absolute Error. Machine Gun noise is not included here since the error of WADA and NIST in every dB level is not
comparable with those of our method.
29
Type of Noise
Feature Set
Energy LTSV Pitch Voic. Pr. Complete
White 0.9585 1.0465 1.3469 0.9044 0.5702
Babble Speech 1.8564 2.0590 2.0326 3.2018 1.6693
Car Interior 2.0409 2.2384 2.2292 2.0518 1.5543
Tank 1.1419 1.5214 1.4487 1.1838 0.9452
Factory Floor 1 2.1733 2.7141 2.3220 2.2733 1.8493
Destroyer Engine 1.3699 1.6465 1.6923 2.4037 1.2099
Machine Gun 2.4279 2.9887 2.8907 2.5444 1.9333
Table 4: SNR mean absolute error of models using regressors from dierent feature sets
varies depending on the noise conditions. Moreover, we tested the performance of models
using regressors from combinations of two and three feature sets, however in every case the
model that uses regressors from the complete feature set outperforms the others.
As shown in Figure 7, we compare the performance of every noise specic model on every
other noise condition. Every row of Figure 7 stands for signals corrupted by the annotated
noise and every column stands for the noise dependent model. We observe that in every
case the model corresponding to the actual conditions that corrupt the signal gives the
best overall performance. Additionally, the machine gun model is not suitable for other
noise conditions. This is attributed to its impulsive characteristics, separating it from the
rest of the noise pool. Notice however that in many cases there are other models that give
comparable satisfactory results. Our second set of experiments (SNR estimation on unknown
noise conditions) is going to exploit this fact.
Next, we compare our method with WADA and NIST. For each type of noise we estimated
the SNR for 900 test les (150 les 6 SNR levels). The mean absolute error of the
estimation for all noises and SNR levels is shown in Figure 8. It is clear that our method
outperforms WADA and NIST SNR for every type of noise, not only on average but for every
dB level as well (see Figure 10). Especially in cases where stationary assumptions fail (e.g.
Machine Gun noise) our method still provides accurate estimation (see Figures 8, 10), which
30
demonstrates its generalization properties. In babble speech noise, WADA performs better
than our model when SNR is at 0dB, as shown in Figures 9, 10. The reason is that pitch and
voicing probability regressors underperform in this noise condition which resembles speech,
especially when the energy of speech is the same with the energy of noise. However, the
average mean absolute error for all SNR levels is lower in our system.
Since the average duration of TIMIT les is approximately 3 seconds, we examine the
behaviour of our models on utterances of increased duration and compare it with WADA. The
increased duration signals were created by concatenating TIMIT utterances and smoothing
the transitions between them. In Figure 11 we present the performance for a subset of the
models for clearer demonstration (although the results were similar across all the models).
We observe that WADA seems to improve its performance for longer utterances, while the
performance of our models remains consistent across dierent utterance lengths. Our method
provides better results across all durations. NIST SNR was omitted from the comparison
since it failed to provide results for longer utterances.
Furthermore, we study the performance of the models for dierent silence periods. The
models were trained inserting silence periods in the signals randomly from 3 to 10 seconds
to simulate signals with unknown speech boundaries. In this set of experiments, we examine
the performance of our models in TIMIT utterances inserting predetermined lengths of non-
speech segments. In Figure 12, we show our ndings for a subset of models for better
presentation (altough our experiments showed similar results across all models). In all our
experiments NIST SNR estimates were worse than those from WADA, and by extension our
models, and thus we do not present its performance. Notice that WADA performs better for
smaller non speech segments, while deteriorates signicantly as the duration of non speech
segments increases. On the other hand, our models outperform WADA in all cases except
those without silence regions (0 duration of non-speech segment). In Table 5, we examine
the performance of Car Interior model on signals without silence regions on individual SNR
levels. We observe that performance deteriorates for high SNR values. This is to be expected,
31
since our features and models were designed to distinguish speech from non-speech regions
in the signal. Our features were designed to operate in real life applications, we cannot have
exact speech boundaries, speech is more
uent { even the NIST SAD evaluation takes into
account some seconds of silence at the beginning and end of speech regions. We remind
that both the experiments based on utterance duration and non-speech segment duration
were performed using our initial models. If prior knowledge were available about utterance
duration, etc, one could tailor the models to those conditions, to improve the performance.
3 6 9 12
1
2
3
4
5
Utterance length (in seconds)
Mean Absolute Error
Regr White Regr Destr. Eng. Room Regr J. Cockpit 2
WADA White WADA Destr. Eng. Room WADA J. Cockpit 2
Figure 11: Mean absolute error estimation for utterances of various lengths, assuming we
know the type of noise that corrupts the signal.
32
12 8 5 2 1 0
0
1
2
3
4
5
6
7
Non speech segment length (in seconds)
Mean Absolute Error
Regr Pink Regr High Freq. Regr Car Interior
WADA Pink WADA High Freq. WADA Car Interior
Figure 12: Mean absolute error estimation for utterances of various durations of non speech
segments, assuming we know the type of noise that corrupts the signal.
4.7 Performance under unknown noise conditions
In our second set of experiments we assume that the channel conditions are not known. Thus,
we do not know a priori which regression model to use. This decision is made by employing
a DNN. To test the performance we again use 900 test les. For each le the DNN is used
to decide which regression model to use and then the SNR is estimated according to that
model.
In order to test the performance under unknown noise conditions, we follow a leave-one-
out method. For example, when we test les corrupted by white noise the DNN is trained
by the remaining set, thus the DNN cannot choose the regression model for white noise.
33
SNR -5 0 5 10 15 20
MAE 2.453 2.237 2.047 3.493 6.787 10.858
Table 5: SNR mean absolute error of Car Interior model in cases of signals without silence
component
Comparing the results of our method in the known noise and unknown noise case, we
notice that the average mean absolute error is smaller for every type of noise, when we have
a-priori knowledge about channel conditions (see Figure 8). Of course this is expected, since
in this scenario we can use the appropriate regression model in a straightforward fashion,
while in the second case the DNN chooses another model to estimate the SNR.
+Tank
+Military Vehicle
+Jet Cockpit 2
+Jet Cockpit 1
+Factory Floor 2
+Factory Floor 1
+F16 Cockpit
+Destr. Oper. Room
+White
+Machine Gun
+Babble Speech
+Car Interior
+Pink
+High Frequency
1:3
1:4
1:5
1:6
1:7
Mean Absolute Error
Figure 13: Mean absolute error for SNR estimation of signals corrupted by Destroyer Engine
Room noise. First the DNN model has only one model to choose from (Tank noise), then
the DNN can choose between the models for Tank and Military Vehicle and so on.
Moreover, we compare the performance of our method with WADA and NIST SNR.
In Figure 8, we show that for all noise types, with the exception of car interior noise and
34
babble speech noise, our method produces a smaller average mean absolute error, compared
to the other methods. This is attributed to the fact that we used a wide variety of noises
to train the DNN. If we reduce the pool of noises (which in turn will reduce the number
of \similar" noises) and let the DNN to choose amongst fewer models the performance of
our scheme would suer (see Figure 13). However, for the setup we followed our method
provides accurate SNR estimation under unknown noise conditions.
From Figure 13, we observe that the DNN chooses an appropriate model for SNR esti-
mation as long as it has a diverse pool of models to choose from. On the other hand, the
DNN does not choose the model that would minimize SNR estimation, instead it chooses a
noise type that is "similar" to the noise that corrupts the input signal. Since noise similarity
is not well studied in literature, it is important to understand this mechanism, since it can
benet many algorithms that are tuned for specic noise conditions.
Furthermore, we repeated the experiments based on utterance length and duration of
non speech segments and are presented in Figure 14 and Figure 15 repsectively. In both
experiments we observe a similar pattern with those performed when the noise that corrupted
the signals was known. For increased utterance duration (Figure 14), WADA seems to
improve as utterance duration increases, while our system remains fairly constant. Notice
that in this case, we do not use the \oracle" noise model but the DNN chooses which model
to use, since we assume we do not have prior knowledge about the type of noise that corrupts
the signal.
In the second experiment (Figure 15) we observe that WADA deteriorates as the duration
of non-speech segment increases. On the other hand, our system performs better than
WADA in all cases except those that the duration of non speech segments is 0 seconds. This
behaviour is similar to the case when we assume we know the type of noise that corrupts
the signal (Figure 12). Our features and models were designed to distinguish speech from
non-speech regions in the signal, since in real life applications we cannot have exact speech
boundaries. We remind that in this experiment we assume we do not know the type of noise
35
3 6 9 12
1
2
3
4
5
Utterance length (in seconds)
Mean Absolute Error
Regr White Regr Destr. Eng. Room Regr J. Cockpit 2
WADA White WADA Destr. Eng. Room WADA J. Cockpit 2
Figure 14: Mean absolute error estimation for utterances of various lengths, assuming we
have not prior knowledge about the type of noise that corrupts the signal.
that corrupts the signal.
Since our system was evaluated on the NOISEX-92 database and TIMIT utterances, from
which the former is biased toward military-machine noise while the latter shares the recording
conditions of our training set, we need to examine how well our system will generalize in new
conditions. To that end, we designed an experiment where we corrupt speech utterances
from the Wall Street Journal (WSJ) corpus with noises from the DEMAND noise database,
[29]. In this case, our DNN is able to choose any of the 15 models corresponding to the
NOISEX-92 noise conditions. We used 150 utterances from the WSJ which we corrupted
with DEMAND noises at 6 dierent SNR levels. We compare our approach with WADA
(NIST SNR was not compared in this experiment since it was not providing meaningful
results) and present the results in Figure 16.
36
12 8 5 2 1 0
1
2
3
4
5
6
7
8
Non speech segment length (in seconds)
Mean Absolute Error
Regr Pink Regr High Freq. Regr Car Interior
WADA Pink WADA High Freq. WADA Car Interior
Figure 15: Mean absolute error estimation for utterances of various durations of non speech
segments, assuming we have not prior knowledge about the type of noise that corrupts the
signal.
Type of Noise MAE Type of Noise MAE
White 4.90 Pink 4.79
Babble Speech 4.82 Machine Gun 262.34
Car Interior 4.96 Mil. Vehicle 4.88
Tank 4.86 High Freq. 4.93
Factory Floor 1 5.20 Factory Floor 2 4.91
Destr. Eng. Room 5.01 Destr. Op. Room 4.97
Jet Cockpit 1 5.58 Jet Cockpit 1 6.60
F16 Cockpit 8.07 TMetro 3.17
Table 6: SNR mean absolute error of regression models on WSJ utterances corrupted by
TMetro noise
37
0 1 2 3 4 5 6 7
PStation
DLiving
DKitchen
NPark
PCafeter
PResto
STrac
TMetro
1:76
2:71
2:53
3:18
2:79
2:85
3:61
4:93
3:38
3:57
4:3
5:51
3:5
3:59
4:97
6:87
Mean Absolute Error
DNN WADA
Figure 16: Mean absolute error for SNR estimation of WSJ utterances corrupted by DE-
MAND noises. The DNN model can choose any of the 15 models created from the NOISEX-
92 database.
Our system outperforms WADA for all the cases we tested. However, comparing the
results of Figures 8 and 16 we observe that the system performs better on signals corrupted
by noises of the NOISEX-92 database. The reason for this dierence is that the NOISEX-92
database is biased towards military-machine types of noise, thus the model chosen by the
DNN is going to be a better match for the noise that corrupts the signal. To conrm this
claim we tested the perfromance of individual models on utterances corrupted by the TMetro
noise and report the results on Table 6.
We notice that the performance of every individual model from the NOISEX-92 database
fails to give accurate results (in most cases the error is close to 5dB). However, a model trained
with TIMIT utterances corrupted by TMetro noise yields the best result. This means that
the performance of our system could be improved if the noise pool was expanded to include
similar noise types.
38
5 Global SNR Estimation of Speech Signals for Un-
known Noise Conditions using Noise Adapted Non-
linear Regression
In this work, we propose a new method to estimate SNR which is not dependent on specic
noise conditions. To that end, we perform a non-linear regression (to allow for more
exibility
in the estimation procedure) by employing a neural network, that accepts a feature set based
on energy ratios. These features are able to capture the proper information in the signal
for accurate SNR estimations under known noise conditions [30]. However, training such
a network for multiple noise conditions is a challenging task, since the input features are
dependent on noise and this information is not represented in the features. We use ivectors
[31] to perform channel adaptation on neural networks, inspired by previous work on speaker
adaptation [32]. Since ivectors contain both speaker and channel information, we follow a
similar approach to adapt the network to specic channel conditions, by appending ivectors to
our original feature set. Furthermore, we make no assumptions regarding speech boundaries
in the signal.
5.1 Total Variability Model
The Total Variability Model (TVM) [31] is a popular framework for obtaining a xed-
dimensional vector-space representation, also known as an ivector, in order to capture dif-
ferences in feature space distributions across variable length sequences.
The Total Variability Model is more commonly used in applications such as speaker recog-
nition [31] and language identication [33], [34], where ivectors are used to capture speaker
or language variability across utterances, respectively. However, in applications where such
variability is undesirable, the ivector representation can also be used as an appended in-
39
put to a discriminative system, in order to enable it to adapt to the source of variability
represented by the ivector. For example, appending ivectors to the input while training an
acoustic model for speech recognition has been found to improve speaker independence of
the recognition system [32], [35]. Similarly, in our case, the motivation for using ivectors is
to enable the SNR estimation system to be able to predict the SNR robustly, while staying
independent of the variable noise conditions present in the utterance. Next we describe the
model formulation.
Let X =fX
u
g
U
u=1
be the collection of acoustic feature vectors in a dataset comprising of
U utterances, where X
u
=fx
ut
g
Tu
t=1
denotes the feature vector sequence of length T
u
from
a specic utterance u. Let D be the dimensionality of each feature vector: x
ut
2 R
D
. In
the Total Variability Model (TVM), it is assumed that with every utterance u, there is an
associated vector w
u
2 R
K
(K being a design parameter) known as the ivector for that
utterance. The conditional distribution ofx
ut
givenw
u
is a Gaussian Mixture Model of C
components with parametersfp
c
;
uc
=
c
+ T
c
w
u
;
c
g
C
c=1
where p
c
2 R;
c
2 R
D
; T
c
2
R
DK
and
c
2R
DD
. The prior distribution forw
u
is assumed to be standard normal:
f(w
u
) =N (0; I)
LetM
0
;M
u
2R
CD
denote vectors consisting of stacked global and utterance-specic com-
ponent means
c
and
uc
respectively. Then, the TVM can be summarized as:
M
u
=M
0
+ Tw
u
where T2R
CDK
is given as: T =
h
T
T
1
::: T
T
C
i
T
TVM parameters are usually estimated using the Expectation Maximization algorithm
[31]. However, we chose to estimate them using randomized Singular Value Decomposi-
tion (SVD) [36] since it is much faster compared to EM. For extracting the ivector for an
40
utterance, we rst obtain statisticsN
u
;F
u
:
N
uc
=
Tu
X
t=1
utc
N
u
= [N
u1
::: N
uC
]
F
uc
=
1
N
uc
Tu
X
t=1
utc
x
ut
F
u
=
h
F
T
u1
::: F
T
uC
i
T
where
utc
are component posteriors obtained from the Universal Background Model (UBM).
Let
−1
c
=L
c
L
T
c
be the Cholesky decomposition of
−1
c
. Then, the statistics are normalized
as:
e
F
uc
=
p
N
uc
L
T
c
F
uc
e
F
u
=
h
e
F
T
u1
:::
e
F
T
uC
i
T
Then, as in [36], the ivector for an utterance is extracted from its normalized statistics as
below:
w
u
=
1
p
T
u
1
T
u
I +
e
T
T
e
T
−1
e
T
T
e
F
u
where
e
T is a normalized version of the matrix T estimated during the randomized SVD
algorithm.
5.2 Using Neural Networks as nonlinear regression models for
SNR estimation
In order to estimate the SNR of noisy signals we implemented a feed-forward neural in
TensorFlow [37]. The network consists of 4 hidden layers. Every layer has 1024 neurons
with RELU (rectied linear unit) activations [38]. A RELU activation is dened as:
f(y) = max(y; 0)
A major benet of RELU activations over sigmoid is the constant value of the gradient,
which occurs when y =Wx +b is greater than 0. In contrast, the sigmoid gradient goes to
0 as the absolute value of x increases, which results in the vanishing gradient problem.
41
Usually, the Mean Square Error (MSE) cost function is used in regression settings. How-
ever, we opted for the Mean Absolute Error (MAE) cost function, since we achieved better
SNR estimates in our experiments without aecting training time. The Mean Absolute Error
(MAE) cost function is dened as:
MAE =
1
N
N
X
i=1
j^ y
i
y
i
j
where N is the number of data points, and ^ y
i
, y
i
are the estimated SNR value and ground
truth respectively, for data point i. Parameter optimization was performed using the Adam
(Adaptive Moment) optimizer [39] with l = 10
5
,
1
= 0:9, and
2
= 0:999, where l is the
learning rate and
1
,
2
are hyper-parameters controlling the exponential decay rates of the
moving averages of the gradient and the squared gradient. Gradient descent operated on
mini-batches of 128 utterances for 20 epochs. Moreover, we utilized two GPUs (Graphics
Processing Unit) to train the network following a synchronous synchronization strategy [40]
through gradient averaging. Finally, the input layer had a dimensionality of 712 (our feature
dimensionality), while the output layer of the network was a linear layer producing the SNR
estimate.
To test the validity of our approach we created a \noisy" speech dataset by combining
clean speech utterances from TIMIT and noises from the DEMAND database [29]. The
DEMAND database contains 18 dierent noises, each having a duration of 5 minutes, drawn
from urban environments. We used 2000 clean speech utterances, and for each utterance we
added one of the 18 dierent types of noise at 9 dierent SNR levels, from -5dB to 15dB
with a step of 2.5. Moreover, in each of these utterances we added silence periods randomly
selected to be between 2 and 4 seconds to create signals with unknown speech boundaries.
Thus, for each of the 18 noise types we have 18000 noisy utterances of both male and female
speakers (2000 utterances 9 SNR levels), resulting in 324000 noisy utterances, subsets
of which will serve as out training set. The test set is created in a similar fashion, using
42
100 clean speech utterances (50 male and 50 female) from TIMIT, ensuring that there is no
overlap in the training and testing utterances. Hence our test set consists of 900 utterances
per noise type.
In our rst set of experiments we check the predictive capability of our features indepen-
dently. To that end, we performed three experiments. In the rst, we used just the energy
ratios to train the network that provides SNR estimates, in the second we used only the
ivectors, and in the last, their combination. In all of these experiments we used the complete
training and testing sets. The results are summarized in Table 7. Obviously this set of
experiments does not refer to the unknown noise scenario, however, we can draw some useful
conclusions. We observe that both energy ratios and ivectors (to a lesser extend) contain
valuable information for SNR estimation. However, the combination of the ratios with the
ivectors provides the lowest MAE, since ivectors capture the channel conditions, enabling
the network to ne tune the SNR estimates for dierent types of noise.
ratios ivectors combination
MAE 2.807 3.190 1.546
Table 7: SNR Mean Absolute Error for dierent feature sets.
Next, we test our method under unknown noise conditions. To achieve this goal, we
follow a leave-one-out strategy. We exclude all the les in the training set that have been
corrupted by a particular type of noise (e.g. Park noise). We train a UBM with the remaining
set of 306000 noisy utterances, and extract ivectors. The UBM is built using 512 Gaussian
components, while the extracted ivectors have a dimensionality of 400. Then, we extract
the energy ratio features, combine them with the ivectors, and train the network. Our
test set consists of utterances corrupted by the type of noise that was excluded from the
training set (e.g. Park noise). Using this approach, we ensure that our model will operate
on utterances altered by some noise for which we have no prior information. Finally, we
repeat this procedure for dierent types of noise.
We compare our method (Channel Adapted DNN) with WADA (Waveform Amplitude
43
Distribution Analysis) [13] and DNN selection [30] for 8 dierent types of noise
1
. For each
noise type we calculate the average MAE across dierent SNR levels and present the results
in Table 8.
WADA DNN Selection Ch. Ad. DNN
KITCHEN 4.663 2.976 2.835
LIV. ROOM 3.641 2.413 1.358
METRO 7.126 4.761 2.902
PARK 5.644 3.356 2.116
STATION 3.121 1.732 1.141
TRAFFIC 4.567 3.599 1.936
RESTAURANT 3.345 2.454 1.918
CAFE 3.766 2.691 1.106
Table 8: SNR Mean Absolute Error across dierent estimation methods and averages for 9
dierent SNR levels.
We observe that our method consistently outperforms the other approaches and achieves
low MAE for all the noise types we tested against. Furthermore, our results indicate that the
ivectors hold information regarding the type of noise that is altering the signal, something
that other applications (e.g., speech enhancement) can take advantage of. Our method is
dependent on the size of the noise pool at our disposal, since we exploit similarities between
noise conditions. For example, if the UBM and the network were trained with instances
drawn from just one noise type, we believe that performance would drop signicantly. How-
ever, our method achieves low MAE for many challenging noise conditions encountered in
real life.
1
We also compared against NIST SNR, but it failed to provide reasonable SNR estimates (e.g. MAE was
over 30dB in most cases), thus we opted not to include it in our comparison.
44
6 Nonnegative Matrix Factorization(NMF) for Speech
Enhancement
Real life speech processing can be challenged by dierent environment noise and channel
conditions, degrading the performance of speech applications. In the last few years data
availability, and the need of speech applications operating under a variety of noise condi-
tions, has renewed the eorts on more sophisticated denoising schemes. For example, sub-
space methods with time and spectral constraints ([18], [19]) as well as Nonnegative Matrix
Factorization methods ([20], [21]) are not restricted to specic noise types (e.g. stationary
or quasi-stationary). However, all these methods require either prior information about the
noise conditions that corrupt the speech signal or a robust estimate of the noise. This type
of knowledge cannot always be obtained, especially if the data are collected from various
sources and under varying noise conditions.
6.1 Introduction to NMF
Given a non-negative matrix V2R
KN
, in our case the magnitude of the spectrogram, the
goal of NMF is to nd non-negative matricesW2R
KL
andH2R
LN
such thatVWH.
This approximation is achieved by solving the following optimization problem:
minimize
W;H
D(VjjWH)
subject to W 0; H 0
where X 0 means that all the elements of X are nonnegative, while D() is a separable
cost function such that:
D(VjWH) =
K
X
k=1
N
X
n=1
d(V
kn
jj[WH]
kn
)
where A
ij
and [A]
ij
denote the element of matrix A at row i and column j. A common
45
choice for the cost function is the -divergence [41], dened as:
d
(xjjy) =
8
>
>
>
>
>
>
<
>
>
>
>
>
>
:
1
( 1)
(x
+ ( 1)y
+xy
(1)
) 2Rnf0; 1g
x log
x
y
x +y = 1
x
y
log
x
y
1 = 0
Notice that when = 2, the above expression reduces to the Euclidean distance. In
traditional approaches, the updates for W and H are alternated until convergence; rst,
the cost function is optimized for W with H treated as a constant, then the cost fuction is
optimized for H with W xed. In this work, we will use the Euclidean distance ( = 2) for
the cost function. The update equations in this case are:
W
kl
W
kl
[VH
T
]
kl
[WHH
T
]
kl
H
ln
H
ln
[W
T
V ]
ln
[W
T
WH]
ln
In the speech enhancement framework, NMF is applied in the following way. In the
training phase, we compute a speech dictionary W
speech
2 R
KL
, and a noise dictionary
W
noise
2R
KL
, from their corresponding spectrogram magnitudes, where the design param-
eters K, and L represents the number of frequency bins and the number of dictionary basis
vectors respectively. We assume, without loss of generality, that both the speech and noise
dictionaries have the same number of basis vectors L. In the testing phase, we estimate the
activation matrixH
noisy
2R
2LM
that best approximates the magnitude spectrogram of the
noisy signal V
noisy
2R
KM
:
46
V
noisy
[W
speech
W
noise
]H
noisy
(8)
where W
speech
and W
noise
are xed and retrieved from the training phase. Finally the en-
hanced spectrogram magnitude
^
V is calculated by:
^
V =W
speech
H
0
(9)
where H
0
is the L M matrix consisting of the rst L columns of H
noisy
, i.e. H
0
=
[h
T
1
; h
T
2
;:::h
T
L
], where h
T
j
is row j of H
noisy
.
Assuming that the magnitude spectrogram V
noisy
consists of M frames, then Equations
(8) and (9) can be expressed as:
v
m
[W
speech
W
noise
]h
m
8m = 1; 2;:::;M (10)
^ v
m
=W
speech
h
0
m
8m = 1; 2;:::;M (11)
wherev
m
, ^ v
m
are them-th frames ofV
noisy
and
^
V respectively, andh
m
,h
0
m
them-th columns
of H
noisy
and H
0
.
6.2 Noise Aware and Combined Noise Models for Speech Denois-
ing in Unknown Noise Conditions
The motivation behind this work is to design a system that would be able to operate under
unknown noise conditions at dierent Signal to Noise Ratio (SNR) levels without requiring
prior information about the noise. To that end, we developed two methods. The rst
method has two stages. In the rst stage, we choose an appropriate pre-trained noise model
47
to \match" the noise of the corrupted signal. The work in [22] addresses the problem of noise
classication in speech signals using Bark Scale features, and we followed a similar approach
in [23]. This is a classic pattern classication task, where the tested signal is corrupted
by one of the noises that the system was trained on. The problem with this approach
arises when the signal is corrupted by a type of noise that the system was not trained on.
The sensitivity of Bark Scale features and MFCCs to noise results in poor generalization
properties when such systems encounter unknown types of noise. To overcome these issues
we use a method that is based on Long-Term Signal Variability (LTSV), introduced in [24].
Once we compute LTSV features on the test signal we construct a histogram of its values
and nd the Kullback{Leibler (KL) distance of the signal LTSV histogram to other LTSV
histrograms in our training set. In the second stage of our system, we employ Nonnegative
Matrix Factorization (NMF) to denoise the signal, using the noise model we selected in the
rst phase. NMF has been succesfully used in speech denoising ([20], [21]), for a variety of
noise types.
In the second method, we do not supply a chosen noise model to NMF. Instead we create
a \combined" noise dictionary from a variety of noises, which is used by NMF for denoising.
The intuition behind this approach is that if this combined dictionary contains noise types
which are similar to the noise that corrupts the signal, then the atoms of the dictionary will
adequately model the noise that corrupts the test signal.
6.2.1 Using Model Selection
In the rst stage of our system, we choose an appropriate noise model to \match" the test
signal. To achieve this goal, we rst compute the LTSV values L(m) for every frame m.
LTSV is computed using the last R frames (analysis window) of the observed signal x with
respect to the current frame of interest, as shown in eq. 2 of equations presented here for
convenience:
48
L(m),
1
K
K
X
k=1
k
(m)(m)
2
(m) =
1
K
K
X
k=1
k
(m)
k
(m),
m
X
n=mR+1
S(n;!
k
)
P
m
l=mR+1
S(l;!
k
)
log
S(n;!
k
)
P
m
l=mR+1
S(l;!
k
)
whereS(n;!
k
) is the short time spectrum computed for the n
th
frame overk = 1;:::;K
frequencies. In our experiments the length of the LTSV analysis window R is 30 frames,
while for the short time spectrum we used a window of 25 ms with a shift of 10 ms. After
calculating the LTSV values for each frame, we perform moving average smoothing (with
a window span of 20 frames) to eliminate abrupt transitions of LTSV values. Then, we
construct a normalized histogram P of the smoothed LTSV with 151 bins (the number of
bins was chosen heuristically as the result of experimental procedure).
To build our training histogram set we used 300 utterances from the TIMIT database
(including both male and female speakers). To each utterance we added fteen dierent
types of noise at ve dierent SNR levels, from -5 dB to 15 dB with a step of 5 dB. Thus,
we have a total of 1500 histograms for each of the fteen dierent types of noise in the
NOISEX-92 database [42], presented in Table 1.
Given a test signal corrupted by noise, we determine which noise model matches it by
calculating the KL distances between the test signal histogram P and those in our training
histogram dataset Q:
D
KL
(PjjQ) =
X
i
P (i) log
P (i)
Q(i)
(12)
Then we do a majority voting on the 30 shortest distances. The result of the majority
49
NOISE
TYPES
White
Pink
Speech Babble
Tank
Military Vehicle
Car Interior
Destroyer Engine Room
Destroyer Operations Room
F16 Cockpit
Factory Floor 1
Factory Floor 2
High Frequency
Machine Gun
Jet Cockpit 1
Jet Cockpit 2
Table 9: NOISEX-92 noise types used in our experiments
voting determines the appropriate noise model. In practice, this is a K-nearest neighbour
algorithm where the data points are histograms, and the distance is the KL divergence.
We assume that the noise in the test signal did not appear in our training set. Thus, we
simulate unknown noise conditions by excluding the noise that is in the test signal from the
model selection process (e.g. if the signal is corrupted by pink noise we exclude it from the
selection process). Notice that this model selection approach is dierent than a traditional
pattern classication task, since the test signal is corrupted by a noise that does not exist in
our noise dataset.
Once the model is selected we employ NMF to denoise the signal. In [43], dierent
NMF objective functions are explored, which lead to dierent NMF variants. Generalized
KL divergence (GKL), denoted as D
GKL
(jj), has been successfully used for audio source
separation in [44]. Hence in our experiments we will use the GKL variant of NMF, which
leads to the minimization of the objective function:
50
D
GKL
(VjjWH) =
X
lr
V
lr
log
V
lr
(WH)
lr
V
lr
+ (WH)
lr
(13)
In the training phase we used NMF on clean speech spectrograms
2
V
s
to nd a speech
dictionary W
s
and the time activations H
s
. Similarly, we calculated spectrograms V
n
i
with
i = 1::: 15 for the fteen types of noise of Table 9, to nd noise dictionaries W
n
i
. Matrices
W
s
, W
n
i
have dimensionality n
f
n
b
. In our experiments we used n
f
= 513 and n
b
= 80,
which are values commonly used in literature (e.g. [21], [20]).
Each column of the dictionaries W
s
,fW
n
i
g
15
i=1
is a basis vector and represents a specic
spectral \building block" of their respective signals. On the other hand, the matrices H
s
,
fH
n
i
g
15
i=1
represent the time-varying activation levels of the basis vectors. To estimate W
s
and H
s
we used 1120 utterances of male speakers and 560 utterances of female speakers
from the TIMIT database, while for W
n
i
andH
n
i
we used noise signals whose durations are
approximately 3 minutes.
In the testing phase, we xW
s
assuming that its basis vectors accurately describe speech.
However, since we do not have prior knowledge about the type of noise,n
0
, that corrupts the
signal, we cannot useW
n
0 to describe its characteristics. Instead, we choose a noisen
00
using
the LTSV model selection process, and then use the corresponding noise dictionary W
n
00.
We expect that W
n
00 will provide a good representation of the noise that corrupts the test
signal. Once W
n
00 is chosen we form a \complete" dictionary W
c
= [W
s
W
n
00] whose basis
functions will be used to represent the test speech signal, which is corrupted by an unknown
type of noise.
Afterwards, we compute the spectrogram V
t
of the test signal. Having at our disposal
W
c
andV
t
the goal is to ndH
t
by minimizing the objective functionD
GKL
(V
t
jjW
c
H
t
) given
2
SpeechdictionariescreatedthroughNMFareusuallygenderdependent([20])orevenspeakerdependent
([45]). Wefollowedagenderdependentapproach. WekeepthenotationV
s
,W
s
,H
s
fornotationconvenience,
but the reader should keep in mind that they are gender dependent.
51
by equation (13). The multiplicative update rule for H
t
is given by:
H
t
ij
H
t
ij
P
k
W
c
ki
V
t
kj
=(W
c
H
t
)
ij
[
P
r
W
c
ri
]
(14)
where [ ]
indicates that if the quantity within the brackets is less than > 0 then it will be
replaced with to prevent violations of the nonnegativity constraint and avoid divisions by
zero.
Finally, we reconstruct the denoised spectrogram as
^
V
s
=W
s
H
t
1:nb
, using the speech basis
functions, along with the rst n
b
rows of H
t
to approximate the target speech.
6.2.2 Method Using Combined Noise Dictionary
The second system we implemented uses a combined noise dictionary. In the training phase,
we perform NMF on the speech and noise types separately. The goal is to minimize both
D
GKL
(V
s
jjW
s
H
s
) and D
GKL
(V
n
i
jjW
n
i
H
n
i
), where i is the noise type index.
In the testing phase, W
s
is again xed. However, in this system we do not choose an
noise dictionary. Instead we form a noise dictionary W
a
= [W
1
W
2
::: W
k
] based on all
the k available noise types. To simulate unknown noise conditions W
a
does not contain the
dictionary corresponding to the noise type corrupting the signal. For example, if it is pink
noise that corrupts the signal, then W
a
will not contain W
pink
.
The intuition behind this approach is that the atoms of W
a
will compensate for the
missing atoms corresponding to the noise type that actually corrupts the signal. Of course
for this approach to work, W
a
should be created in a way that will contain similar types
of noises to the one that corrupts the signal. Hence, when creating W
a
one must include a
variety of noise types so that W
a
represents a wide range of noises.
To denoise the signal we follow the approach described in the previous section, we com-
pute the spectrogram V
t
of the test signal, we form W
c
as W
c
= [W
s
W
a
], and we minimize
the the objective functionD
GKL
(V
t
jjW
c
H
t
) given by (13), in order to nd to ndH
t
. Finally,
we reconstruct the denoised spectrogram
^
V
s
=W
s
H
t
1:nb
to approximate the target speech.
52
-5 0 5 10 15
0
0:2
0:4
SNR(dB)
PESQ Improvement
Pink Noise
-5 0 5 10 15
0
0:1
0:2
0:3
Factory Floor 1 Noise
-5 0 5 10 15
0
0:1
0:2
0:3
Factory Floor 2 Noise
LTSV
Combined
Oracle
-5 0 5 10 15
0
0:1
0:2
Speech Babble Noise
-5 0 5 10 15
0
0:1
0:2
0:3
Car Interior Noise
-5 0 5 10 15
0
0:1
0:2
0:3
Machine Gun Noise
(a) Female Speakers
-5 0 5 10 15
0
0:2
0:4
Pink Noise
-5 0 5 10 15
0
0:1
0:2
0:3
0:4
Factory Floor 1 Noise
-5 0 5 10 15
0
0:1
0:2
0:3
Factory Floor 2 Noise
-5 0 5 10 15
0
0:1
0:2
0:3
Speech Babble Noise
-5 0 5 10 15
0
0:1
0:2
Car Interior Noise
-5 0 5 10 15
0
0:1
0:2
Machine Gun Noise
(b) Male Speakers
Figure 17: PESQ score improvements of the system for a variety of noise types under dierent
SNR levels. Red depicts the performance of the LSTV system, yellow the performance of the
combined dictionary system, while blue stands for the performance when we use the \true"
noise model.
53
The second system we implemented uses a combined noise dictionary. In the training
phase, we perform NMF on the speech and noise types separately. The goal is to minimize
both D
GKL
(V
s
jjW
s
H
s
) and D
GKL
(V
n
i
jjW
n
i
H
n
i
), where i is the noise type index.
In the testing phase, W
s
is again xed. However, in this system we do not choose an
noise dictionary. Instead we form a noise dictionary W
a
= [W
1
W
2
::: W
k
] based on all
the k available noise types. To simulate unknown noise conditions W
a
does not contain the
dictionary corresponding to the noise type corrupting the signal. For example, if it is pink
noise that corrupts the signal, then W
a
will not contain W
pink
.
The intuition behind this approach is that the atoms of W
a
will compensate for the
missing atoms corresponding to the noise type that actually corrupts the signal. Of course
for this approach to work, W
a
should be created in a way that will contain similar types
of noises to the one that corrupts the signal. Hence, when creating W
a
one must include a
variety of noise types so that W
a
represents a wide range of noises.
To denoise the signal we follow the approach described in the previous section, we com-
pute the spectrogram V
t
of the test signal, we form W
c
as W
c
= [W
s
W
a
], and we minimize
the the objective functionD
GKL
(V
t
jjW
c
H
t
) given by (13), in order to nd to ndH
t
. Finally,
we reconstruct the denoised spectrogram
^
V
s
=W
s
H
t
1:nb
to approximate the target speech.
6.2.3 Results and Discussion
To test the performance of our system we used 50 utterances of male speakers and 50
utterance of female speakers from the TIMIT database. Based on those utterances we
created our test dataset by introducing dierent types of noise at ve SNR levels, from -
5 dB to 15 dB with a step of 5 dB. As a result, we have 250 male and 250 female noisy les
for every type of noise (50 per SNR level).
We followed a leave-one-out cross-validation approach to simulate unknown noise condi-
tions. For example, if the test signal was corrupted by white noise then we remove the LTSV
54
histograms corresponding to white noise from the rst system, while in the second system
the combined noise dictionary, W
a
, does not contain atoms from white noise.
To quantify our results we used the ITU Perceptual Evaluation of Speech Quality (PESQ)
[46], a metric designed to match mean opinion scores of audio perceptual quality. Higher
PESQ scores indicate better signal quality, and PESQ increments of the order of 0.5 oer
noticeable improvements in terms of speech intelligibility [20].
In Fig. 1, we present average PESQ improvements for six dierent types of noise. We
notice that for all the noise types and across all SNR levels, both our systems improve the
signal quality. We observe that in many cases the LTSV system gives comparable results with
the \oracle" system that uses the true noise type, especially in low SNR levels. However, in
the case of Machine Gun noise the LTSV system falls behind compared to the oracle model
because the noise pool does not contain similar noise types. Limited availability of diverse
noise types can be a severe drawback to our system, since it relies on a large pool of noise
models to be able to select one that closely resembles the noise that corrupts the signal. On
the other hand, the system that uses a combined noise dictionary is able to outperform the
oracle system in most cases. This reinforces our assumption that the atoms of the combined
dictionary compensate for the missing atoms of the noise that corrupts the signal. This is
clearly demonstrated in the factory
oor cases, as well as in babble speech case. However,
this system fails when the combined dictionary does not contain similar noise types with
those corrupting the signal, as is showcased in the Machine Gun case, in which case the
LTSV system performs slightly better. We observe a similar pattern in Car Interior noise,
with the dierence that the systems do not fail since there are similar noise types in the
noise pool. This phenomenon warrants further investigation for a system that would use a
method to guide the creation of the combined dictionary.
As shown from our experiments, we tested both the systems for various noise types with
dierent statistical properties across dierent SNR levels, and demonstrated that it improves
signal quality (as measured by PESQ scores) consistently. The performance of the system
55
warrants further theoretical and empirical investigation that can help formalize the design.
56
7 NMF Dictionaries as Convex Polyhedral Cones
One way to explore similarities of dierent noises is through their dictionary representations
produced by NMF. Since the dictionaries are expressed as matrices with nonnegative ele-
ments, they can be interpreted as polyhedral convex cones where the dictionary elements
are the extreme rays (generators) of the cone. That is, a dictionary W , denes a cone that
is the set of all the canonical combinations of its columns
3
.
By construction, the dictionaries W
speech
, W
noise
, and by extension their combination
[W
speech
W
noise
], contain only nonnegative values. Hence, they can be interpreted as gener-
ators of convex polyhedral cones in the positive orthant [47]. Given a matrix P , a convex
polyhedral is the set dened by the conic combination of its columns:
P
=
(
x :x =
X
j
j
P
j
; a
j
0 8j
)
=fx :x =P; 0g
(15)
whereP
j
are the columns ofP ,
j
are nonnegative constants, and a vector whose elements
are the
j
values.
Since all the elements ofh
m
in Eq. (10) are nonnegative (as a column of the nonnegative
matrix H
noisy
), v
m
is approximated as a conic combination of the atoms in [W
speech
W
noise
].
Therefore, in NMF the noisy frame is expressed as a point in the cone
C
generated by
C = [W
speech
W
noise
].
This insight is crucial for understanding how speech enhancement is achieved in the
NMF framework. The noisy frame is decomposed into speech and noise components in
the combined speech and noise cone
C
. The noise dictionary will capture the noise-only
information of the signal, separating it from the speech components. Thus, once the noisy
3
It is trivial to determine if every column is an extreme ray, In our experiments all the columns of the
dictionaries were generators of the cone
57
Figure 18: Convex polyhedral cones in the positive orthant
x
y
z
frame v
m
is decomposed, eq. (10), we retrieve the enhanced frame by keeping only the
activations that correspond to the speech dictionary, eq. (11).
It is clear that the quality of the enhanced signal depends on the ability of the cone
N
,
generated by W
noise
, to accurately model the noise components of the signal. Hence, it is
necessary to have prior knowledge about the type of noise that corrupts the signal. However,
this is not always possible, and various methods have been proposed in the literature to
address this issue. For example, the authors in [48], use a noise selection scheme to decide
which dictionary to use in the denoising phase, while a similar approach has been used
for SNR estimation in [30] and image clustering in [49]. Hence, investigating conic anity
measures could guide the design of such systems by selecting the appropriate noise through
its cone representation.
7.1 Conic Anity Measures
We will explore four conic anity measures and their individual performance. The rst
one exploits the Euclidean distance of a point to a cone, the second one is based on cosine
58
similarity, the third takes into account the truncated Pompeiu-Hausdor metric, and nally
the fourth one uses the ball-truncated volume of the cone.
Consider two cones
A
,
B
generated by matrices A, and B. We assume without loss
of generality that the columns of both matrices act as the extreme rays of the cones they
generate. The rst anity measure is dened as the average Euclidean distance of each
extreme ray in
A
to the cone
B
:
d
(
A
;
B
) :=
1
K
K
X
k=1
d(a
k
;
B
) (16)
wherea
k
is an extreme ray of
A
,K the number of extreme rays andd(a
k
;
B
) the Euclidean
distance of a
k
to the cone
B
. In order to nd the required distance we need to solve the
following convex quadratic problem:
minimize
x
jjBxa
k
jj
2
2
subject to x 0
In our case, the cones are generated by the NMF dictionaries. Since the atoms of those
dictionaries can have dierent`
2
norms, we normalize all the atoms to unit`
2
norm in order
to have consistent distance values, without aecting the performance in the denoising phase.
Notice that smaller values of
d
(
A
;
B
) indicate that the two cones
A
,
B
are closer in the
multidimensional space they are dened.
The second conic anity measure is based on pairwise cosine similarity between the
extreme rays of cones. For each of the two cones
A
,
B
, we form random conic combinations
of their extreme rays to produce new points within their respective sets. For example, if
A
is generated by matrixA2R
MN
and every column ofA acts as an extreme ray of
A
, then
for random vectors z
i
= [z
i1
z
i2
::: z
iN
], where z
ij
0; 1jN, the point x
i
:
59
x
i
=z
i1
2
6
6
6
6
6
6
6
4
A
11
A
12
.
.
.
A
M1
3
7
7
7
7
7
7
7
5
+z
i2
2
6
6
6
6
6
6
6
4
A
21
A
22
.
.
.
A
M2
3
7
7
7
7
7
7
7
5
+::: +z
iN
2
6
6
6
6
6
6
6
4
A
N1
A
N2
.
.
.
A
N2
3
7
7
7
7
7
7
7
5
is part of the cone
A
.
The result of this \sampling" process are the sets C
A
A
and C
B
B
. Following
this, we nd the vectors a
i
2C
A
and b
i
2C
B
with the maximum cosine similarity:
s(a
i
;b
i
) =
P
M
m=1
a
im
b
im
q
P
M
m=1
a
2
im
q
P
M
m=1
b
2
im
Subsequently, these vectors are removed fromC
A
andC
B
and we repeat the process. Finally,
we compute the average cosine similarity of all pairs:
s
(
A
;
B
) :=
1
jC
A
j
jC
A
j
X
r=1
s(a
r
;b
r
) (17)
wherejC
A
j =jC
B
j is the cardinality of the set C
A
, and a
r
;b
r
are points in the sets C
A
and
C
B
respectively. Notice that
s
(
A
;
B
) is bounded between 0 and 1 and higher values of
s
(
A
;
B
) indicate high degree of similarity between the two cones
A
,
B
.
The Pompeiu-Hausdor metric is dened as:
PH
:= haus(
A
\B
n
;
B
\B
n
) (18)
whereB
n
is the closed unit ball inR
n
and
haus(
A
;
B
) = max
max
x2
A
d(x;
B
); max
x2
B
d(x;
A
)
We remind the reader that d(x;
A
) is the euclidean distance of point x to the cone
A
. In
60
order to calculate haus(
A
;
B
) one needs to solve two conic linear programming problems.
Finally the ball-truncated volume of the cone is dened as:
btv(K) := vol
n
(K\B
n
); (19)
where vol
n
stands for the n-dimensional Lebesgue measure.
One way of calculating the expression in (19) is by using the n-dimensional Gaussian
measure as shown in [50]. Hence we can write:
btv(K)
vol
n
(B
n
)
=
1
(2)
n=2
Z
K
e
1
2
jjxjj
2
dx (20)
We can compare similarity between two cones using the ball truncated volume in the
following way. First, we calculate the ball truncated volume of the cone of interest, e.g.
the cone represented by the dictionary of pink noise. Then, we created \expanded cones"
by combining the extreme rays of two cones, e.g. we create a new cone represented by
the extreme rays corresponding to the cones of the pink and white dictionaries. The less
the volume is altered, the better match that dictionary is, e.g. if the smaller change in
volume between the dictionary of pink noise and the rest of dictionaries of our noise pool
corresponds to the dictionary of the white noise, we can assume that these two dictionaries
are the \closest".
We can utilize the above conic anity measures, along with a diverse pool of available
noise dictionaries in order to design systems operating in unseen noise conditions. For each
noise in our pool we calculate the corresponding dictionaries. Subsequently, when the system
is presented with a noisy signal, we apply the NMF procedure and produce a \dictionary"
representation W
noisy
and calculate the conic anity measures of the cone generated by
W
noisy
with those generated from the noise dictionaries.
Although each of those measures capture a dierent aspect of the cone geometry, it is
61
valuable to examine the performance of each anity measure individually. So the system
designed based on the measure dened by (16) will calculate the metric based on W
noisy
and each of the noise dictionaries and the one that will yield the smallest value will be the
selected dictionary.
7.2 Individual Performance of conic anity measures
To perform our experiments we need available clean utterances to construct speaker-specic
dictionaries, as well as a pool of noises with dierent characteristics. To that end, we use
50 male and 50 female speakers from the TIMIT database [51]. Each dictionary is trained
using 9 utterances. The test utterances are corrupted with noises from the NOISEX-92
database [42], at SNR levels of 0 dB and 5 dB. The NOISEX-92 database contains 15 types
of noise with dierent characteristics, such as wideband and narrowband noises, as well as
stationary and nonstationary noises. Both TIMIT and NOISEX-92 are sampled at 16 kHz.
All the spectrograms were extracted using 25 ms windows with an overlap of 10 ms and 512
frequency bins. Thus, for each speaker and noise type, we have dictionaries of 257 atoms,
and all the atoms were normalized to unit length.
Moreover, we examine the ability of the proposed systems to perform in unseen noise
conditions by the following experiment. We corrupt speech utterances with a specic type
of noise and then we remove it from the noise pool, thus the system cannot pick the type of
noise that alters the signal and is forced to pick another noise with \similar" characteristics.
For example, if we corrupt an utterance with Military Vehicle noise, this specic noise is
removed from the pool, and the system will select one of the remaining 14 types of noise.
We enhance the noisy signal with the selected dictionary, and measure the quality of the
produced signal in terms of Perceptual Evaluation of Speech Quality (PESQ), and segmental-
SNR score improvements. This process is repeated for two dierent SNR levels, 0 dB and
5 dB.
The results of this experiment are presented in Table 10. An immediate conclusion
62
that can be drawn is that the systems perform well and often they choose the best available
option, and the performance is on par with the oracle noise dictionary, which is the dictionary
that corresponds to the type of noise the signal was corrupted with. Obviously, the system
is dependent on the size and variability of the noise pool. Notice that for both PESQ and
segmental-SNR the systems show satisfactory performance. However, if you restrict the noise
pool to only a few types of noise, or to noises that they all share the same characteristics,
you are severely limiting the ability of the system to select appropriate noise dictionaries to
enhance the signal, resulting in poor performance.
Furthermore, these results support our hypothesis that conic anity measures could
be employed to guide the design of NMF-based systems able to operate in unseen noise
conditions. This warranties further investigation on utilizing the geometric properties of
NMF produced dictionaries to improve the performance of speech technologies. Similar
systems found in literature use signal extracted features (e.g. MFCC, lterbanks, etc),
[30,52].
Another interesting observation is that the systems based on dierent conic anity mea-
sures hold complementary information. This leads to the conclusion that a method combin-
ing these conic anity measures along with others capturing dierent characteristics of the
cones' geometry could yield superior performance.
7.3 Combining Conic Anity Measures
Since the results from Table 10 indicate that the individual conic anity measures can
provide useful information in the classication task, the next step would be to observe if
they have complementary information. To that end we need to investigate methods to
combine the information from the individual methods to a nal classication scheme.
Decision trees have been successfully used to a variety of problems, e.g. in agriculture
[53], in the nance sector [54], as well as data mining [55] to name a few. Many algorithms
for building classication and regression trees have been proposed in the literature [56],
63
Table 10: Performance of speech enhancement metrics for ve noises from the NOISEX-92
database (white, speech babble, high frequency, machine gun, and factory
oor 1). Perfor-
mance is measured with respect to two metrics: PESQ, and segmental-SNR improvements.
In all cases oracle represents the dictionary that corresponds to the noise that corrupts the
signals, while 2nd Best corresponds to the dictionary that results in the best performance if
we exclude the oracle. The other columns correspond to the systems designed based on the
conic anity measures described in (16), (17), (18), and (19).
PESQ segmental-SNR
Oracle
d
s
PH
btv 2nd Best Oracle
d
s
PH
btv 2nd Best
W. 0.667 0.586 0.586 0.586 0.586 0.586 19.453 17.342 17.342 17.342 17.342 17.342
S.B. 0.392 0.257 0.323 0.298 0.323 0.323 5.437 5.043 4.642 5.244 4.956 5.244
H.F. 0.511 0.408 0.458 0.436 0.486 0.486 17.014 13.181 15.248 11.481 15.101 15.248
M.G. 0.603 0.482 0.534 0.134 0.534 0.534 13.787 9.123 10.355 11.541 12.042 12.042
F.F.1. 0.451 0.422 0.303 0.422 0.378 0.422 15.656 11.435 12.981 12.981 13.431 13.431
however most of them end up building binary decision trees which restricting the classication
potential. Quinlan proposed the algorithm C5.0 in [57] and its predecessor C4.5 which can
create multibranch decision trees, that are do not suer from restrictions imposed in binary
trees, thus enhancing their performance. Moreover, the C5.0 algorithm uses information
gain as its splitting criteria and adopts the binomial condence limit method as a pruning
technique.
We employed the C5.0 algorithm to build trees that would combine the conic anity
metrics into a single classication scheme. The training data consisted of the conic anity
measures extracted from the noise cones in our noise pool and noisy signals corrupted by
those noises. In our experiments we used utterances from 50 male and 50 female users from
the the TIMIT database [51]. We added 15 dierent types of noise to those utterances (one
type of noise per utterance) from the NOISEX-92 database [42], at SNR levels of 0 dB and
5 dB randomly. Both NOISEX-92 and TIMIT are sampled at 16 Khzm and each speaker
in TIMIT had 9 training utterances. Following this procedure we end up with 27000 noisy
utterances for training. The spectrograms of both the noises and the noisy speech signals
were extracted using 25 ms windows with an overlap of 10 ms and 512 frequency bins. Hence,
the resulting dictionaries consist of 257 atoms, which were normalized to unit length.
64
Table 11: Performance of speech enhancement metrics for ve noises from the NOISEX-92
database (car interior, destroyer engine room, destroyer operation room, factory
oor 1,
machine gun, military vehicle, jet cockpit 2, white, speech babble, and high frequency). Per-
formance is measured with respect to two metrics: PESQ, and segmental-SNR improvements.
In all cases oracle represents the dictionary that corresponds to the noise that corrupts the
signals, Tree(unk.) and Tree(uns.) is the tree decision under unknown and unseen noise con-
ditions available, 2nd Best corresponds to the dictionary that results in the best performance
if we exclude the oracle, and worst to the dictionary that results in the worst performance.
PESQ segmental-SNR
Oracle Tree(unk.) Tree(uns.) 2nd Best Worst Oracle Tree(unk.) Tree(uns.) 2nd Best Worst
C.I. 0.5070 0.5070 0.4233 0.4233 -0.0683 15.1519 15.1519 12.1326 12.1326 0.4985
D.E.R. 0.4925 0.4925 0.3574 0.3574 0.0186 4.8504 4.8504 2.5881 2.7218 0.2596
D.O.P. 0.4911 0.4911 0.4851 0.4851 0.1074 4.7053 4.7053 4.5913 4.5913 0.1199
F.F.1 0.4072 0.4072 0.3730 0.39531 0.0634 18.5265 18.5265 16.7111 18.4446 3.0617
M.G. 0.6643 0.6643 0.5225 0.5737 0.0452 11.3707 11.3707 7.9637 7.9851 -0.5457
M.V. 0.5987 0.5987 0.5577 0.5577 0.0434 18.3689 18.3689 17.8433 17.8433 1.7797
J.C.2 0.7184 0.6934 0.6934 0.6934 0.0391 13.0912 13.4052 13.4052 13.4052 3.9956
W. 0.667 0.667 0.586 0.586 0.0141 19.453 19.453 17.342 17.342 4.1553
S.B. 0.392 0.323 0.298 0.323 -0.0341 5.437 5.043 5.043 5.243 0.1129
H.F. 0.511 0.511 0.458 0.458 0.0313 17.0140 17.0140 15.248 15.248 2.4318
After the spectrograms were extracted, we applied NMF to get the corresponding dic-
tionaries and computed the conic anity measures between the dictionaries produced from
the noisy signals and the dictionaries produced by the noises. Having calculated the conic
anity measures we can now utilize the C5.0 algorithm to create classication trees whose
decisions take all the measures into account.
In the test phase, we use speech signals from TIMIT corrupted by one of our noises in
the noise pool. We performed two dierent set of experiments. In the rst conguration,
we have the case of unknown noise, the test speech signal will be corrupted by a noise from
our noise pool, hence the tree is created by taking all the noises into account. So we know
that the signal is corrupted from a noise from our pool but we don't know which one. In the
second conguration we simulate unseen noise conditions, the test signal is corrupted by a
noise that was not used in the creation of the classication tree. As a result the tree cannot
choose the type of noise that actually corrupts the signal but a noise similar to that one. In
65
both congurations, we measure the quality of the produced signal in terms of the PESQ
and segmental-SNR metrics.
The results of this experiment are presented in Table 11. We observe that in the unknown
noise scenario the tree selects the oracle dictionary in all cases but one, i.e. speech bubble.
This means that the decision tree is able to recognize which type of noise corrupts the signal,
thus is able to use the oracle noise dictionary. A similar behavior is observed in the unseen
noise scenario, where the tree is able to select the best available noise in the noise pool,
meaning that our conic anity measures we employed are able to give information about
the cone geometries generated by the noise dictionaries, and use them to successfully compare
these cones. As in previous cases, the performance of our system is dependent on the noise
pool diversity and availability. Reducing the pool size results in deteriorated performance.
Further investigation is needed to nd criteria regarding the size and diversity of the noise
pool, e.g. determine when there are enough noises in our pool and adding more will not
yield any benet.
66
8 Conclusions and Future Directions
In this thesis, we explored methods that take advantage of noise similarities to improve audio
and processing technologies with an emphasis on snr estimation and speech enhancement.
For each application we investigated dierent approaches to improve their performance in
unknown and unseen noise conditions.
In SNR estimation we initially presented presented a novel method for global SNR esti-
mation using regression models which are trained on features that can be ranked by presence
of speech. Using machine learning methods to estimate the SNR was a dierent approach
from previous works found in the literature that attempted to estimate the SNR from the sig-
nal properties. We tested our method for various noise types from the NOISEX-92 database,
which have dierent statistical properties, and demonstrated that these regression models
can successfully provide an accurate SNR estimation given that we know the type of noise
that corrupts the signal. Furthermore, we compared our work with two other SNR estimation
algorithms (WADA, NIST SNR), and our experiments showed that our proposed method in
general outperforms them across all experimental conditions.
However, a shortcoming of this approach is that it assumes knowledge regarding the type
of noise that corrupts the signal. Obviously this assumption cannot hold true in a wide variety
of cases. To that end, we expanded our eorts to facilitate operation of the proposed SNR
system in unseen known conditions by proposing a new system that uses a neural network
responsible to choose an appropriate noise regression model. We designed the system and
demonstrated through our experiments that this method is able to operate under unseen
noise conditions, since the network chooses a noise model that is \similar" to the oracle
model
4
. Once again we compared the performance of this \noise-independent" system to
two other methods (WADA, NIST SNR) and compared the performance of this system to
the oracle regression model. We want to draw attention to the fact, that our method can
be considered as \noise-independent" since it does not make explicit assumptions about the
4
the model that corresponds to the true type of noise that corrupts the signal
67
type of noise that corrupts the speech signal.
In fact, it is this \noise-independence" property that is the reason of the enhanced per-
formance of our method. Instead of forcing the system to handle a specic family of noise
types, we allow it to adjust accordingly to channel conditions. In particular, the system
accurately estimates the SNR when the signal is corrupted by either stationary (e.g. white)
or impulsive noise (e.g. Machine Gun). Since the regression models are trained on specic
types of noise, they are able to adapt to them and thus are not restrained from specic
characteristics as is the case with other methods that are restricted to specic types of noise
(e.g. wiener ltering). On the other hand, the neural network is able to select similar noise
models when we have no knowledge about the noise that corrupts the signal, enabling our
method to operate in scenarios of unknown and unseen noise conditions.
However, there were cases that other methods provided better results than our method,
due to some feature underperforming or model mismatch. It is easier to identify cases of
feature underperformance when the channel conditions are assumed known. For example,
in the case of babble speech noise where the regressors from pitch and voicing probability
do not perform well (see Figure 10), thus our method provides worst results at 0dB. On
the other hand, model mismatch occurs when channel conditions are assumed unknown. In
this case, the model the DNN chooses to estimate the SNR might be a poor t (e.g. see
Car Interior noise in Figure 9). One important consideration is that the overall performance
depends on the diversity of the noise pool. If we don't have a variety of noises with dierent
properties our system's performance would suer, since the DNN would not be able to select
a noise model similar to the actual noise that corrupts the signal
To overcome these shortcomings our next research eort focused on two aspects. First
we explored additional features that could provide information about the SNR level of the
signal. Ivectors provide utterance level information about a speech signal and has been
used in applications such as speaker identication, and language identication. We found
that ivectors could provide meaningful information about the SNR level of an audio signal.
68
Moreover, to address the limitations of linear regression models we adopted a nonlinear
regression scheme which is realized by a fully connected neural network. Again, this method
can be considered as \noise-independent" since it does not make explicit assumptions about
the type of noise that alters the original speech signal, instead it uses ivectors to model the
noise type and adapt the nal SNR estimation. Comparing this method with others in the
literature we found that our system had superior performance attributed to the fact that we
do not force our system to deal with specic family of noises, but utilize the neural network
to adapt to the channel conditions. However, our method depends on the availability of
noises. Training the UBM and the network from a small noise pool will result in a model
that is not able to generalize across dierent noise conditions. Moreover, the noise pool must
contain diverse noise types. If the noise pool only contains examples of stationary types of
noise, our model will not be able to handle noises with impulsive characteristics. To over-
come this shortcomings, we plan to explore if dierent models can capture the noise type
without relying on features to provide that information (e.g., recurrent networks).
Another application we focused on was speech enhancement. Robust SNR estimation
can help improve the performance of speech enhancement algorithms. For example, methods
based on NMF usually end up deteriorating the quality of the signal when the input has
a high SNR level, while ltering methods are dependent on an accurate SNR estimation.
Therefore, we have deployed our SNR estimation system as a preprocessing step in our
speech enhancement eorts whenever necessary.
Our speech enhancement eorts in unseen noise conditions are based on NMF. The
underlying principle to facilitate NMF to operate in unseen noise conditions is the same
with our SNR estimation approach. We have access to a diverse noise pool and we have
developed methods to exploit noise similarity. Initially we investigated a method where
we attempted to match the test signal with a noise, whose dictionary would be used in
the enhancement phase, by comparing LTSV (long term signal variability) histograms. In
practice, we have implemented a nearest neighbor classication scheme where the data points
69
are LTSV histograms and the \distance" is the KL divergence. We found that this method
provide meaningful results, and this method was able to propose a \valid" noise candidate
whose dictionary would be used in the denoising phase.
Another interesting set of results came from using a \combined" noise dictionary in the
enhancement phase. This combined dictionary is a new dictionary whose atoms are the
atoms of all the noises in our noise pool. Our experiments indicate that such an approach
can operate in unseen noise conditions and even outperform the \oracle" noise dictionary.
This means that the atoms of the combined dictionary can not only supplement the atoms of
the missing \oracle" dictionary, but also create a space where the NMF has more
exibility
to reconstruct the enhanced signal. This was an important observation since that motivated
us to explore the properties of this combined dictionary and compare its dierences with the
oracle noise dictionary. In order to accomplish that we need to nd mathematically robust
methods that will enable us to make such comparisons. Adopting a geometric view of the
dictionaries created through the NMF process allowed us to use those mathematic tools to
compare those dictionaries.
NMF dictionaries can act, through their properties, as generators of convex polyhedral
cones in the positive orthant. Initially we explored conic anity measures and their relation-
ship with speech enhancement metrics. We found that we can use conic anity measures
to make informed decisions about which noise dictionary to use in the denoising phase.
Using a selection procedure we were able to choose noise dictionaries that result in compa-
rable performance to the \oracle" dictionary. These results allowed us to further investigate
this approach by incorporating more conic anity measures that capture dierent aspects
of the geometry of the cones, and based on those designed a system that performs NMF
enhancement in unknown and unseen noise conditions.
As a rst step, we tested the performance of these conic anity measures individually,
see Table 10, and found out that they can provide valuable information in the noise selection
step, but our results also indicated that they had complementary information. Hence, we
70
designed a system that would take into account all the information that these conic anity
measures include. To achieve that, we employed multibranch classication trees that were
trained with the C5.0 algorithm. Our experiments indicate that in both unknown and unseen
noise scenarios this system is able to always choose an appropriate model for the denoising
phase, specically in the case of unknown noise we almost always choose the correct noise
type, while in the case of unseen noise the system chooses the best available option of our
noise pool.
Finally, all our eorts so far warrant further investigation of methods that exploit noise
similarity in order to boost the performance of speech applications in unseen noise conditions.
For example, in SNR estimation we could adopt the same approach to make estimations
about the instantaneous SNR, i.e. the SNR value in every time-frequency bin. A recurrent
network that takes into account temporal information, combined with features that would
provide information about the channel conditions could help create a robust instantaneous
SNR estimation method able to operate in real life settings. Instantaneous SNR estimations
can boost the performance of many speech technologies, e.g. speech activity detection,
security applications, as well as speech enhancement. Moreover, further study on conic
anity measures would allows us to investigate more ways to describe the geometry of the
cones, as well as identify which elements of this geometry is responsible for boosting the
performance of NMF based speech enhancement. Furthermore, combining the study conic
anity measures with the \combined" dictionary method we proposed could allows us have
to design criteria based on which we could select which noises will be part of an expanded
noise dictionary.
71
References
[1] H. G. Hirsch and C. Ehricher, \Noise Estimation Techniques for Robust Speech Recog-
nition," in Proceedings of the IEEE International Conference on Acoustics, Speech, and
Signal Processing (ICASSP), 1995.
[2] J. Morales-Cordovilla, N. Ma, V. Sanchez, J. Carmona, A. Peinado, and J. Barker, \A
Pitch Based Noise Estimation Technique for Robust Speech Recognition with Missing
Data," in Proceedings of the IEEE International Conference on Acoustics, Speech, and
Signal Processing (ICASSP), 2011, pp. 4808{4811.
[3] Y. Ephraim and D. Malah, \Speech Enhancement Using a Minimum-Mean Square Error
Short-Time Spectral Amplitude Estimator," IEEE Transactions on Acoustics, Speech
and Signal Processing, vol. 32, no. 6, pp. 1109{1121, 1984.
[4] C. Plapous, C. Marro, and P. Scalart, \Improved Signal to Noise Ratio Estimation for
Speech Enhancement," IEEE Transactions on Audio, Speech and Language Processing,
vol. 14, no. 6, pp. 2098{2108, 2006.
[5] Y. Ren and M. T. Johnson, \An Improved SNR Estimator for Speech Enhancement,"
in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal
Processing (ICASSP). IEEE, 2008, pp. 4901{4904.
[6] J. Tchorz and B. Kollmeier, \SNR Estimation Based on Amplitude Modulation Anal-
ysis with Applications to Noise Suppression," IEEE Transactions on Speech and Audio
Processing, vol. 11, no. 3, pp. 184{192, 2003.
[7] R. Martin, \Noise Power Spectral Density Estimation Based on Optimal Smoothing
and Minimum Statistics," IEEE Transactions on Speech and Audio Processing, vol. 9,
no. 5, pp. 504{512, 2001.
[8] ||, \An Ecient Algorithm to Estimate the Instantaneous SNR of Speech Sig-
nals," in Third European Conference on Speech Communication and Technology (EU-
ROSPEECH), 1993.
[9] I. Cohen, \Relaxed Statistical Model for Speech Enhancement and a Priori SNR Estima-
tion," IEEE Transactions on Speech and Audio Processing, vol. 13, no. 5, pp. 870{881,
2005.
[10] S. Suhadi, C. Last, and T. Fingscheidt, \A Data-Driven Approach to A-Priori SNR
Estimation," IEEE Transactions on Audio, Speech and Language Processing, vol. 19,
pp. 186{195, 2011.
[11] S. Furui, Digital Speech Processing, Synthesis, and Recognition, ser. Signal processing
and communications. Marcel Dekker, 2001.
[12] X. Zhao, Y. Shao, and D. Wang, \Robust Speaker Identication Using a CASA front-
end," in Proceedings of the IEEE International Conference on Acoustics, Speech, and
Signal Processing (ICASSP), 2011, pp. 5468{5471.
72
[13] C. Kim and R. M. Stern, \Robust Signal-to-Noise Ratio Estimation Based on Waveform
Amplitude Distribution Analysis," in Proc. Interspeech, 2008, pp. 2598{2601.
[14] M. Vondr a sek and P. Poll ak, \Methods for Speech SNR Estimation: Evaluation Tool
and Analysis of VAD Dependency," Radioengineering, vol. 14, pp. 6{11, 2005.
[15] A. Narayanan and D. Wang, \A CASA-Based System for Long-Term SNR Estimation,"
IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 9, pp. 2518{
2527, Nov 2012.
[16] R. Hendriks, R. Heusdens, and J. Jensen, \MMSE based noise PSD tracking with low
complexity," in Proceedings of the IEEE International Conference on Acoustics, Speech,
and Signal Processing (ICASSP), March 2010, pp. 4266{4269.
[17] P. C. Loizou, Speech Enhancement: Theory and Practice . CRC, 2007.
[18] Y. Hu and P. C. Loizou, \A generalized subspace approach for enhancing speech cor-
rupted by colored noise," IEEE Transactions on Speech and Audio Processing, vol. 11,
no. 4, pp. 334{341, 2003.
[19] S. Kamath and P. Loizou, \A multi-band spectral subtraction method for enhancing
speech corrupted by colored noise," in IEEE International Conference on Acoustics,
Speech, and Signal Processing, 2002.
[20] K. W. Wilson, B. Raj, P. Smaragdis, and A. Divakaran, \Speech denoising using nonneg-
ative matrix factorization with priors," in IEEE International Conference on Acoustics,
Speech and Signal Processing, 2008, pp. 4029{4032.
[21] K. W. Wilson, B. Raj, and P. Smaragdis, \Regularized non-negative matrix factorization
with temporal dependencies for speech denoising," in INTERSPEECH , Conference of
the International Speech Communication Association, Brisbane, Australia, 2008, pp.
411{414.
[22] C. Eamdeelerd and K. Songwatana, \Audio Noise Classication using Bark Scale Fea-
tures and K-NN Technique," in International Symposium on Communications and In-
formation Technologies, ISCIT 2008., 2008, pp. 131{134.
[23] P. Papadopoulos, A. Tsiartas, J. Gibson, and S. Narayanan, \A Supervised Signal-
to-Noise Ratio Estimation of Speech Signals," in IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), May 2014, pp. 8237{8241.
[24] P. Ghosh, A. Tsiartas, and S. Narayanan, \Robust Voice Activity Detection Using
Long-Term Signal Variability," IEEE Transactions on Audio, Speech, and Language
Processing, vol. 19, no. 3, pp. 600{613, 2011.
[25] M. Soltanolkotabi, E. Elhamifar, and E. J. C, \Robust subspace clustering," The Annals
of Statistics, 2014.
73
[26] The NIST Speech SNR Measurement. [Online]. Available:
http://www.nist.gov/smartspace/nist speech snr measurement.html
[27] \openSMILE," http://opensmile.sourceforge.net/.
[28] C. Wang and S. Sene, \Robust pitch tracking for prosodic modeling in telephone
speech," in Proceedings of the IEEE International Conference On Acoustics Speech and
Signal Processing (ICASSP), vol. 3, 2000, pp. 1343{1346.
[29] J. Thiemann, N. Ito, and E. Vincent, \The Diverse Environments Multi-channel Acous-
tic Noise Database (DEMAND): A database of multichannel environmental noise record-
ings," in 21st International Congress on Acoustics. Acoustical Society of America,
2013.
[30] P. Papadopoulos, A. Tsiartas, and S. Narayanan, \Long Term SNR Estimation of Speech
Signals in Known and Unknown Channel Conditions," IEEE Transactions on Audio,
Speech, and Language Processing, vol. 24, no. 12, pp. 2495{2506, 2016.
[31] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, \Front-End Factor
Analysis for Speaker Verication," IEEE Trans. Audio, Speech & Language Processing,
vol. 19, no. 4, pp. 788{798, 2011.
[32] G. Saon, H. Soltau, D. Nahamoo, and M. Picheny, \Speaker adaptation of neural net-
work acoustic models using i-vectors," in 2013 IEEE Workshop on Automatic Speech
Recognition and Understanding, 2013, pp. 55{59.
[33] D. M. Gonzlez, O. Plchot, L. Burget, O. Glembek, and P. Matejka, \Language Recog-
nition in iVectors Space," in INTERSPEECH, 2011, pp. 861{864.
[34] N. Dehak, P. A. Torres-Carrasquillo, D. Reynolds, and R. Dehak, \Language Recogni-
tion via i-vectors and Dimensionality Reduction," in INTERSPEECH, 2011, pp. 857{
860.
[35] A. W. Senior and I. Lopez-Moreno, \Improving DNN speaker independence with I-
vector inputs," in ICASSP, 2014, pp. 225{229.
[36] R. Travadi and S. Narayanan, \Non-Iterative Parameter Estimation for Total Variability
Model Using Randomized Singular Value Decomposition," in INTERSPEECH, 2016,
pp. 3221{3225.
[37] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado,
A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard,
Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Man e, R. Monga,
S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever,
K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Vi egas, O. Vinyals, P. Warden,
M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, \TensorFlow: Large-Scale Machine
Learning on Heterogeneous Systems," 2015, software available from tensor
ow.org.
[Online]. Available: http://tensor
ow.org/
74
[38] V. Nair and G. E. Hinton, \Rectied Linear Units Improve Restricted Boltzmann
Machines," in Proceedings of the 27th International Conference on Machine Learning
(ICML-10), 2010, pp. 807{814.
[39] D. P. Kingma and J. Ba, \Adam: A Method for Stochastic Optimization," in Proceedings
of the 3rd International Conference for Learning Representations (ICLR), 2015.
[40] J. Chen, R. Monga, S. Bengio, and R. Jozefowicz, \Revisiting distributed synchronous
sgd," in International Conference on Learning Representations Workshop Track, 2016.
[41] S. Romain, E. Slim, and R. Gael, \Group nonnegative matrix factorisation with speaker
and session variability compensation for speaker identication," in 2016 IEEE Inter-
national Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp.
5470{5474.
[42] A. Varga and H. J. M. Steeneken, \Assessment for Automatic Speech Recognition II:
NOISEX-92: A Database and an Experiment to Study the Eect of Additive Noise on
Speech Recognition Systems," Speech Commun., vol. 12, no. 3, pp. 247{251, July 1993.
[43] A. Cichocki, R. Zdunek, and S.-I. Amari, \New algorithms for non-negative matrix fac-
torization in applications to blind source separation," in IEEE International Conference
on Acoustics, Speech and Signal Processing Proceedings., 2006.
[44] P. Smaragdis, \From learning music to learning to separate," Forum Acusticum, 2005.
[45] F. Weninger, J. Feliu, and B. Schuller, \Supervised and semi-supervised suppression of
background music in monaural speech recordings," in IEEE International Conference
on Acoustics, Speech and Signal Processing, 2012.
[46] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, \Perceptual evaluation
of speech quality (pesq)-a new method for speech quality assessment of telephone net-
works and codecs," in IEEE International Conference on Acoustics, Speech and Signal
Processing Proceedings., 2001.
[47] D. Donoho and V. Stodden, \When does non-negative matrix factorization give a correct
decomposition into parts?" in Advances in Neural Information Processing Systems 16,
S. Thrun, L. K. Saul, and B. Sch olkopf, Eds. MIT Press, 2004, pp. 1141{1148.
[48] P. Papadopoulos, C. Vaz, and S. S. Narayanan, \Noise aware and combined noise models
for speech denoising in unknown noise conditions," in INTERSPEECH, 2016.
[49] J. Ho, M.-H. Yang, J. Lim, K.-C. Lee, and D. Kriegman, \Clustering appearances of
objects under varying illumination conditions," in 2003 IEEE Computer Society Con-
ference on Computer Vision and Pattern Recognition, 2003. Proceedings., vol. 1, June
2003.
[50] D. Gourion and A. Seeger, \Deterministic and stochastic methods for computing volu-
metric moduli of convex cones," Computational and Applied Mathematics, pp. 215{246,
2010.
75
[51] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, N. L. Dahlgren,
and V. Zue, \TIMIT Acoustic-Phonetic Continuous Speech Corpus," in Linguistic Data
Consortium, Philadelphia, PA, 1993.
[52] P. Papadopoulos, R. Travadi, and S. S. Narayanan, \Global SNR estimation of speech
signals for unknown noise conditions using noise adapted non-linear regression," in
INTERSPEECH. ISCA, 2017, pp. 3842{3846.
[53] S. Rajeswari and S. Kannan, \C5.0: Advanced decision tree (adt) classication model
for agricultural data analysis on cloud," Computers and Electronics in Agriculture, vol.
156, pp. 530{539, 2019.
[54] S. Pang and J. Gong, \C5.0 classication algorithm and application on individual credit
evaluation of banks," Systems Engineering - Theory and Practice, vol. 29, no. 12, pp.
94 { 104, 2009.
[55] I. H. Witten, E. Frank, and M. A. Hall, Data Mining: Practical machine learning tools
and techniques,. Morgan Kaufmann, 2011.
[56] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classication. Wiley, 2001.
[57] J. R. Quinlan, \Improved use of continuous attributes in c4.5," J. Artif. Int. Res., vol. 4,
no. 1, p. 7790, Mar. 1996.
76
Abstract (if available)
Abstract
The performance of speech technologies deteriorates in the presence of noise. Additionally, the spread of such technologies in everyday scenarios (e.g speech applications operating in mobile phones), specialized data collection (e.g. audio for medical applications) as well as surveillance gave rise to a new demand in the field of speech processing. These speech technologies should be able to operate across a variety of noise levels and conditions. Traditional speech and audio processing applications require prior knowledge of the noise corrupting the signal, or an estimate of the noise from noise-only portions of the signal, which in turn necessitates knowledge of speech boundaries. Relaxing those requirements can facilitate processing of data captured in different real life environments, and relax rigid data acquisition protocols that can potentially create a bottleneck. Although it is impossible to have every type of noise available, we can exploit similarities between different noise types to boost the performance of algorithms. In this work we demonstrate this approach on two applications, Signal to Noise (SNR) estimation and speech enhancement. Many speech processing algorithms and applications rely on the explicit knowledge of signal to noise ratio (SNR) in their design and implementation. Hence, SNR estimation can guide the design and operation of such technologies or can be used as a pre-processing tool in database creation (e.g. identify/discard noisy signals). Speech enhancement is a core tool in many speech related applications, since it transforms a noisy signal to a state where we can extract useful and reliable information. The goal of this work is to enable those two technologies to successfully operate under unseen noise conditions.To achieve this goal we followed different approaches for the two problems at hand. In the case of SNR estimation we propose new features designed to capture information in the signal that generalizes across different noise types and SNR characteristics, used in models that take into account not only those features but information about the noise type itself. In speech enhancement we follow a different approach. We employ a method called Nonnnegative Matrix Factorization (NMF) that has met widespread success in denoising and propose modifications that will condition the method to operate under unseen noise conditions.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Speech recognition error modeling for robust speech processing and natural language understanding applications
PDF
Matrix factorization for noise-robust representation of speech data
PDF
Human behavior understanding from language through unsupervised modeling
PDF
Multimodal and self-guided clustering approaches toward context aware speaker diarization
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
Toward robust affective learning from speech signals based on deep learning techniques
PDF
Extracting and using speaker role information in speech processing applications
PDF
Computational modeling of behavioral attributes in conversational dyadic interactions
PDF
Functional real-time MRI of the upper airway
PDF
Efficient estimation and discriminative training for the Total Variability Model
PDF
Context-aware models for understanding and supporting spoken interactions with children
PDF
Active data acquisition for building language models for speech recognition
PDF
insideOut: Estimating joint angles in tendon-driven robots using Artificial Neural Networks and non-collocated sensors
PDF
Toward understanding speech planning by observing its execution—representations, modeling and analysis
PDF
Understanding sources of variability in learning robust deep audio representations
PDF
Enriching spoken language processing: representation and modeling of suprasegmental events
PDF
Automatic quantification and prediction of human subjective judgments in behavioral signal processing
PDF
Noise benefits in nonlinear signal processing
PDF
Enhancing speech to speech translation through exploitation of bilingual resources and paralinguistic information
PDF
Interaction dynamics and coordination for behavioral analysis in dyadic conversations
Asset Metadata
Creator
Papadopoulos, Pavlos
(author)
Core Title
Noise aware methods for robust speech processing applications
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
07/06/2020
Defense Date
02/10/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
convex optimization,machine learning,neural networks,OAI-PMH Harvest,SNR estimation,speech enhancement
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Narayanan, Shrikanth (
committee chair
), Byrd, Dani (
committee member
), Georgiou, Panayiotis (
committee member
)
Creator Email
papadopav@gmail.com,ppapadop@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-325387
Unique identifier
UC11663370
Identifier
etd-Papadopoul-8640.pdf (filename),usctheses-c89-325387 (legacy record id)
Legacy Identifier
etd-Papadopoul-8640.pdf
Dmrecord
325387
Document Type
Dissertation
Rights
Papadopoulos, Pavlos
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
convex optimization
machine learning
neural networks
SNR estimation
speech enhancement