Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Toward robust affective learning from speech signals based on deep learning techniques
(USC Thesis Other)
Toward robust affective learning from speech signals based on deep learning techniques
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Toward Robust Aective Learning from Speech Signals Based on Deep Learning
Techniques
by
Che-Wei Huang
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulllment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(Electrical Engineering)
December 2018
Copyright 2019 Che-Wei Huang
Acknowledgements
This dissertation could not have been possible without the support and encouragement of my
family, friends and colleagues.
First of all, I would like thank my adviser Dr. Shrikanth S. Narayanan for his tremendous
support during the past six years. I have beneted greatly from his continuous guidance on aca-
demic writing and presentation, and his words of encouragement for out-of-the-box explorations
in research. With the research
exibility he provided, I was able to luxuriously attempt and enjoy
several topics until I settled with my dissertation work. I would also like to thank my qualifying
committee, Dr. Panayiotis Georgiou, Dr. Gayla Margolin, Dr. Anand Arvind Joshi and Dr. An-
tonio Ortega, and my dissertation committee, Dr. Panayiotis Georgiou and Dr. Gayla Margolin,
for their insightful comments and valuable suggestions.
I am grateful to Dr. Chi-Chun Jeremy Lee, Dr. Bo Xiao, Dr. Naveen Kumar, Dr. Jangwon
Kim and Dr. Theodora Chaspari for being excellent mentors/colleagues and for numerous talks we
had during uncountably many setbacks and confusions that I, fortunately, faced in the past years,
and to Dr. Tanaya Guha for being a wonderful collaborator. Thanks also go to my colleagues at
Signal Analysis and Interpretation Laboratory (SAIL) for their help and support.
I could not nish this long journey without a bottle of beer, for which I thank my beer buddies,
Yongzhe Wang and Roberto Martin del Campo, for the beer and football games we shared.
Finally, I would like to express my ultimate gratitude to my family. I thank my parents for
their unconditional support of my decisions and pursuits oversea, and for the love and care they
have devoted to me. Special thanks to my beloved wife, Chia-Wei, for her love and patience, and
ii
for being with me to share the ups and downs in life and in research. This long journey would be
much tougher without her in my life.
iii
Table of Contents
Acknowledgements ii
List Of Tables vi
List Of Figures viii
Abstract ix
I Background 1
Chapter 1: Introduction 2
1.1 Problem Denition and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Chapter 2: Related Work 6
2.1 Acoustic Processing Based on Signal Processing, Machine Learning and Deep Learn-
ing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
II Feature-Based Data Augmentation 9
Chapter 3: Emotion Discriminative Feature Formulation via Deep Convolutional
Neural Networks 10
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Deep Convolutional Recurrent Models . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3.1 Types of Convolutional Operations . . . . . . . . . . . . . . . . . . . . . . . 15
3.3.2 Deep Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.3 CLDNN-based Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4 Baseline Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4.1 Support vector machine with the Low-Level Descriptors and Their Statisti-
cal Functionals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4.2 LDNN with the log-Mels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4.3 LDNN with the MFCCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.5 Databases Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.5.1 The Clean Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.5.2 The Noisy Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.6 Speech Emotion Recognition Experiments . . . . . . . . . . . . . . . . . . . . . . . 25
3.6.1 SVM with openSMILE features . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.6.2 CLDNN-based Models with the MFCCs and the log-Mels . . . . . . . . . . 27
iv
3.7 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.7.1 SVM with openSMILE features . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.7.2 LDNN with the MFCCs and the log-Mels . . . . . . . . . . . . . . . . . . . 30
3.7.3 CLDNN with the MFCCs and the log-Mels . . . . . . . . . . . . . . . . . . 30
3.7.4 Finer hyper-parameter search on the spectral axis . . . . . . . . . . . . . . 32
3.7.5 Module-wise evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.7.5.1 Quantitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
III Model-Based Data Augmentation 38
Chapter 4: Regularization Via Model-Based Perturbation 39
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.1 Shake-Shake Regularization and Its Variants . . . . . . . . . . . . . . . . . 43
4.2.2 End-to-End Discriminative Feature Learning . . . . . . . . . . . . . . . . . 45
4.2.3 Vicinal Risk Minimization and Mixup . . . . . . . . . . . . . . . . . . . . . 47
4.3 Sub-band Shaking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4 Batch Normalized Shaking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.4.1 Embedding Learning on MNIST and CIFAR-10 . . . . . . . . . . . . . . . . 52
4.4.2 Relation with Batch Normalized Recurrent Neural Networks . . . . . . . . 57
4.5 Database and Experiments for Speech Emotion Recognition . . . . . . . . . . . . . 60
4.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.6.1 Sub-band Shaking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.6.2 Experiments with Dierent Layouts . . . . . . . . . . . . . . . . . . . . . . 64
4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
IV Concluding Remarks 70
Chapter 5: Conclusion 71
Reference List 74
v
List Of Tables
3.1 A summary of the parameters for each model architecture. M denotes the spectral
dimensionality and var stands for variable parameters for tuning. The dash symbol
indicates the situation where the parameter tuning is not applicable. . . . . . . . . 17
3.2 A summary of the ranges for parameter tuning on each type of the convolutional
layers, whereM denotes the spectral dimensionality and the subscripts of h andw
correspond to the rst and the second convolutional layers, respectively. . . . . . . 27
3.3 The SVM baseline performance (UA (%)) based on the leave-one-subject-out (LOSO)
cross validation and on the training-validation-testing (TVT) partitions using the
acoustic feature sets from past INTERSPEECH challenges. . . . . . . . . . . . . . 29
3.4 The performances (UA (%)) of the optimal SVM model, the LDNN-based mod-
els and the CLDNN-based models. The sparse kernel reduced rank regression
(SKRRR) [114] is one of the state-of-the-art models on the eNTERFACE'05 corpus. 30
3.5 The performances (UA (%)) of a SVM classier trained on the spliced log-Mels,
the spliced MFCCs and the output of each module from all CLDNN-based models
under the clean condition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1 Test error (%) and model size on CIFAR-10 . . . . . . . . . . . . . . . . . . . . . . 55
4.2 Performances of PreAct and PreActBN with and without shaking for speech
emotion recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3 An overview of these selected corpora, including the number of actors and the
distribution of utterances in the emotional classes . . . . . . . . . . . . . . . . . . . 60
4.4 F: female, M: male. The gender and corpus distributions in each actor set partition
of the cross validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.5 ResNeXt (12; 2 8d) network architecture . . . . . . . . . . . . . . . . . . . . . . 61
4.6 Performances of PreAct, PreAct-Full and PreAct-Both . . . . . . . . . . . . . 63
4.7 P-values resulted from one-sided paired t-test with df=4 3 1 = 11 . . . . . . . 64
vi
4.8 Performances of PreAct,PostAct,RPreAct andPreActBN with and without
shaking for speech emotion recognition, where
0
is the initialization value of the
standard deviation parameter
in batch normalization . . . . . . . . . . . . . . . . 65
4.9 P-values from one-sided paired t-test betweenPreActBN at
0
= 0:05 andPostAct
and RPreAct with various value of
0
. . . . . . . . . . . . . . . . . . . . . . . . . 66
4.10 Performances of PreAct and PreActBN with sub-band shaking for speech emo-
tion recognition, where
0
is the initialization value of the standard deviation pa-
rameter
in batch normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
vii
List Of Figures
3.1 An overview of the proposed neural networks for speech emotion recognition. . . . 14
3.2 Unweighted accuracy, UA (%), of the S-CLDNN (log-Mels) model on the validation
partition under noisy and clean conditions with respect to dierent kernel sizes h
1
in the rst convolutional layer. The curves in red are median ltered UAs. . . . . . 33
4.1 An overview of a 3-branch Shake-Shake regularized residual block. (a) Forward
propagation during the training phase (b) Backward propagation during the train-
ing phase (c) Testing phase. The coecients and are sampled from the uniform
distribution over [0; 1] to scale down the forward and backward
ows during the
training phase. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Shaking regularized ResNeXt architectures with dierent layouts introduced in [40] 44
4.3 An illustration for the sub-band denitions . . . . . . . . . . . . . . . . . . . . . . 49
4.4 MNIST embeddings based on dierent layouts of residual blocks. We set the feature
dimension entering into the output layer to be two and train them in an end-to-
end fashion. The top and bottom rows depict embeddings of the training samples
extracted in the train mode (i.e. 2 [0; 1]) without updating parameters, and
testing samples extracted in the eval mode ( = 0:5), respectively. (a,e) fully pre-
activation (Fig. 4.2(c)) without shaking (b,f) fully pre-activation (Fig. 4.2(c)) with
shaking (c,g) fully pre-activation + BN (Fig. 4.2(d)) without shaking (d,h) fully
pre-activation + BN (Fig. 4.2(d)) with shaking . . . . . . . . . . . . . . . . . . . . 50
4.5 Embeddings of training samples extracted in the (a) train (b) eval mode from
PreActBN with shaking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.6 Percentage of class samples within in a relative distance to class center . . . . . . . 54
viii
Abstract
Regularization is crucial to the success of many practical deep learning models, in particular in
a more often than not scenario where there are only a few to a moderate number of accessible
training samples, and speech emotion recognition or in general computational paralinguistics is
undoubtedly one of them. Common practices of regularization include weight decay, Dropout [97]
and data augmentation.
From the representation learning perspective, we rst examine the negative in
uence of data
augmentation with noise with respect to the size and aspect of lters in convolutional neural net-
works, from which we are able to design the optimal architecture under noisy and clean conditions,
respectively, for speech emotion recognition.
Moreover, regularization based on multi-branch architectures, such as Shake-Shake regulariza-
tion [29], has been proven successful in many applications and attracted more and more attention.
However, beyond model-based representation augmentation, it is unclear how a stochastic mixture
of model branches could help to provide further improvement on classication tasks, let alone the
baing interaction between batch normalization and shaking.
We present our investigation on regularization by model-based perturbation, drawing connec-
tions to the vicinal risk minimization principle [12] and discriminative feature learning in veri-
cation tasks. Furthermore, we identify a strong resemblance between batch normalized residual
blocks and batch normalized recurrent neural networks, where both of them share a similar con-
vergence behavior, which could be mitigated by a proper initialization of batch normalization.
ix
Based on the ndings, our experiments on speech emotion recognition demonstrate simulta-
neously an improvement on the classication accuracy and a reduction on the generalization gap
both with statistical signicance.
x
Part I
Background
1
Chapter 1
Introduction
1.1 Problem Denition and Challenges
Computational para-linguistics aims for a comprehensive understanding of information from hu-
man speech regardless of the content. In general, non-linguistic information of interest include
the way a speech is expressed and attributes of a speaker. Emotion, one of well-known non-
linguistic information, plays a fundamental role in our daily lives for eective communication,
which underlies the abilities of humans to interact, collaborate, empathize and even compete
with others. Researchers have been working on understanding human emotion, or in general hu-
man behaviors [77], for years from both psychological and computational perspectives for several
reasons including because it serves as a lens to observe the dynamics of one's internal mental
state. Moreover, with the advent of articially intelligent agents, it is hardly an overstatement to
stress the importance of emotion recognition in supporting natural and engaging human-machine
interaction.
Human behavioral cues often mix and manifest multiple sources of information together. To
robustly recover aective information from multiplexed behavioral cues renders emotion recog-
nition a challenging task. For example, speech contains not only linguistic content of what is
said but also attributes of the speaker such as identity, gender, age, speaking style, and language
2
background as well as information about the environment and context. All of these factors are
entangled and transmitted through a single channel during speech articulation. Speech emotion
recognition, therefore, involves the inverse process of disentangling these signals and identifying
aective information. Despite having drawn much attention in the research community, the task
of robust speech emotion recognition still poses several challenges.
One of the challenges in speech emotion recognition is the lack of knowledge about discrimi-
native feature formulation to distinguish between emotions [24]. Attributes of the speakers and
background channels often introduce a certain degree of acoustic variability that aects most of
commonly used acoustic features, including pitch, energy and speaking rate. Even though tech-
niques such as speaker normalization [9] for variability reduction have been proposed to mitigate
this issue, a scalable and task-specic feature formulation remains desirable. This challenge is
not specic to speech emotion recognition as research problems in other domains and/or based
on other modalities also share this problem. Fortunately, recent advancements in representation
learning and end-to-end training of deep neural networks have oered an opportunity to address
this problem from a dierent perspective and in a scalable way. However, due to the lack of suf-
cient training samples, which is often termed as a low-resource condition, to apply deep neural
networks to speech emotion recognition is still a challenging task.
Low-resource conditions are not rare is machine learning, and machine learning practitioners
employ various techniques to control, implicitly or explicitly, model complexity when facing these
conditions. This category of techniques is referred to as regularization, which include weight
decay, Dropout and data augmentation. Usually, a combination of several techniques are used
together instead of relying solely on only one of them. In speech or audio processing, data
augmentation often involves with speech rate adjustment and noise addition. Despite the success
of improving classication accuracy in a certain applications, it is also reported that under a
moderately noisy condition, representation learning by deep convolutional neural networks suers
from data augmentation with noise and results in a degradation on model performance. It was
3
hypothesized by previous work that it is harder to learn robust representations using small or
context-free lters in convolutional neural networks under a noisy condition.
In addition to the application of data augmentation to the raw input, model-based represen-
tation augmentation is gradually gaining more and more attention, such as Shake-Shake regular-
ization [29], ShakeDrop regularization [113] and Stochastic Shake-Shake regularization [49]. All
of them, based on nulti-branch architectures, promote to perform a stochastic mixture of model
branches in the forward and the backward propagations. However beyond the concept of model-
based representation augmentation, it is unclear how a mixture of model branches could help
to learn a better representation. Furthermore, both Shake-Shake and ShakeDrop regularizations
pointed out a close interaction between shaking and batch normalization, but none of them pro-
vide a explanation for this phenomenon other than a brief discussion on the strength of shaking.
In Stochastic Shake-Shake regularization, it was reported shaking without a directly preceding
batch normalization contribute more on constraining the optimization process than boosting the
generalization capacity.
In this dissertation, we present our investigations for regularization in speech emotion recog-
nition based on end-to-end representation learning with deep neural networks. First, we study
the hypothesis that small or context-free lters would lead to a poor representation under a noisy
condition. Based on an architecture that consists of convolutional and recurrent neural layers, we
thoroughly benchmark four types of convolutional operations under the noisy and clean condi-
tions. We also perform module-wise evaluation to probe the performance gain by each module in
the networks, where each of them is responsible for modeling local spectral-temporal information
or a long-term temporal dynamics, respectively.
Second, we conduct ablation studies of Shake-Shake regularization on MNIST and CIFAR-10
datasets to study the baing interaction between batch normalization and shaking. Based on
the observations, we draw connections between batch normalization and discriminative feature
learning, and between shaking with the vicinal risk minimization principle to elucidate the close
4
interaction. In spite of a clear understanding of shaking, we face a convergence issue when we
employ batch normalization before shaking. To address this issue, we identify a strong resem-
blance between batch normalized residual blocks and batch normalized recurrent neural networks,
where both of them suer from a similar convergence pattern. In the end, our experiments on
speech emotion recognition not only improve on the classication accuracy but also reduce the
generalization gap both with signicance.
The rest of the dissertation is organized as follows. Chapter 2 summarizes related work in
acoustic signal processing and aective computing based on signal processing, machine learning
and deep learning. Chapter 3 thoroughly benchmarks the performance of speech emotion recog-
nition with respect to size of convolutional lters based on convolutional and recurrent neural
architecture. Chapter 4 focuses on the model-based regularization techniques, including Shake-
Shake and ShakeDrop regularizations, their working mechanism to improve representation learning
in classication tasks and application to speech emotion recognition. Chapter 5 concludes this
dissertation.
5
Chapter 2
Related Work
2.1 Acoustic Processing Based on Signal Processing, Machine
Learning and Deep Learning
Due to the subtlety in human emotion, acoustic features that is able to eectively characterize
the aective content in speech is still an open question [25, 93].
A multitude of studies on the subject of emotion recognition have discovered a number of
emotion-covariant parameters based on prior knowledge of psychology, speech science, vision
science and through signal processing and machine learning approaches. Commonly used fea-
tures include pitch, log-Mel lterbank energies (log-Mels) [86], Mel-frequency cepstral coecients
(MFCCs) [7] and perceptual linear prediction (PLP) [41] in the acoustic modality, and Haar [34],
local binary pattern (LBP) [79], histogram of oriented gradients (HOG) [74] and scale-invariant
feature transform (SIFT) [68] in visual modality.
A variety of classiers based on these features have been reported to give a reasonable per-
formance, including support vector machine (SVM) [17], extreme learning machine (ELM) [50],
decision tree [5] and random forest [6]. In particular, an extensive feature set consisting of thou-
sands of hand-engineered parameters has been recommended in the past few INTERSPEECH
challenges [94], and in a recent meta research review article [27]. However, these features are
6
designed for general acoustic or visual pattern recognition tasks, and not specic for aective
information representation. With recent advances in machine learning and in particular deep
learning, deep neural network has emerged as a powerful tool for pattern classication.
In addition to aforementioned hand-crafted feature engineering, deep learning [85, 19, 43]
provides an alternative approach to formulate appropriate features for the task at hand. By
feeding low-level data, e.g. pixels of an image or spectrogram of an utterance, into a deep neural
network and supervising the training with a suitable objective and corresponding, often task-
specic, annotation, the deep neural network is capable of tting the transformation from low-
level data to annotation via a functional composition of non-linear mappings. Intermediate results
between the composition are referred to as intermediate representations and have been shown to
possess more discriminative power than low-level data. In the last few years, convolutional neural
networks (CNNs) [57, 116, 95, 102, 38] have demonstrated outstanding performances in various
applications including image recognition, object detection, and recently speech acoustic modeling.
Compared to hand-crafted features, a CNN that learns from a large number of training samples
via a deep architecture can capture a higher-level representation for the task-specic knowledge
distilled from annotated data. In the area of speech emotion recognition, several researchers
have investigated the eectiveness of CNNs into automatically learning aective information from
signal data [72, 4, 118, 76].
Information encoded in speech signals is inherently sequential. Moreover, psychological studies
have shown that aective information involves a slow temporal evolution of mental states [78].
Based on this observation, previous studies have also investigated the use of architectures that
explicitly model the temporal dynamics, such as hidden Markov models (HMM) [92] or recurrent
neural networks (RNN) [111, 75, 61] for recognizing human emotion in speech.
Furthermore, there is a growing trend in combining CNN and RNN into one architecture and
to train the entire model in an end-to-end fashion. The motivation behind a holistic training
is derived from the need to avoid greedily enforcing the distribution of intermediate layers to
7
approximate that of labels, which is believed to maximally exploit the advantage of deep learning
over traditional learning methods and would lead to an improved performance. For example,
Sainath et al. [88] proposed an architecture, called the Convolutional Long Short-Term Memory
Deep Neural Networks (CLDNN) model, made up of a few convolutional layers, long short-term
memory gated (LSTM) RNN layers and fully connected (FC) layers in the respective order. They
trained CLDNNs on the log-Mel lterbank energies [88] and on the raw waveform speech signal
[89] for speech recognition, and showed that both CLDNN models outperform CNN and LSTM
alone or combined. Likewise, Huang et al. [46] and Lim et al. [64] reported CLDNN-based
speech emotion recognition experiments, on log-Mels and spectrograms respectively, using similar
benchmark settings to highlight the superior performance resulting from an end-to-end training.
8
Part II
Feature-Based Data Augmentation
9
Chapter 3
Emotion Discriminative Feature Formulation via Deep
Convolutional Neural Networks
3.1 Introduction
In a recent work, Sainath et al. [87] observed that under a moderately noisy condition, the
spectrally only convolutional operation (see Fig. 3.1 for details) degrades the performance. They
hypothesized noise has made it dicult for local lters to learn translation invariance and thus
the local decisions are prone to error. In addition to study the performance gain by an end-to-
end training for speech emotion recognition, one of the goals in this chapter is built upon the
observation in order to quantitatively investigate whether convolutional operations of dierent
types and sizes could show robustness to noise for aective computing from speech signals.
In this chapter, we propose to characterize four types of convolutional operations in a CLDNN
for speech emotion recognition under both clean and noisy conditions. We employ log-Mels
and MFCCs as input to the proposed models depending on their spectral-temporal correlation.
In particular, we compare spectral decorrelation power between one type of the convolutional
operations and the discrete cosine transformation (DCT). In addition, we quantitatively analyze
modules in the proposed CLDNN-based models in order to gain insights into the information
ows
within the models.
10
The contributions in this chapter are multi-fold. First of all, in addition to demonstrate the
benet from an end-to-end training, we consider all commonly used convolutional operations for
oering a comprehensive understanding toward discriminative representation learning for aective
information using deep neural networks, including two types covered in [4]. Second, unlike previous
studies [4, 53] that increased training corpus size internally, we perform data augmentation with a
noise corpus. As a result, we evaluate the proposed models under both clean and noisy conditions
to quantitatively measure the in
uence of noise on dierent types and sizes of convolutional
operations. Furthermore, we carry out module-wise evaluation to analyze the information
ow
encoded in speech along the depth of an architecture. The content in this chapter is covered in
our published work [46] and [45].
The outline of this chapter is as follows. Section 3.2 reviews previous related work on speech
emotion recognition. Section 3.3 presents the architecture of the proposed models and Section 3.4
describes three competitive baseline models. Section 3.5 introduces the corpus and data augmen-
tation procedure. Section 3.6 details the experimental settings and the results are interpreted in
Section 3.7. Section 3.8 concludes this chapter.
3.2 Related Work
Before the present era of deep learning, speech emotion recognition systems prevalently relied on
a two-stage training approach, where feature engineering and classier training were performed
separately. Commonly used hand-crafted features include pitch, MFCC, log-Mels and the recom-
mended feature sets from the INTERSPEECH challenges. Support vector machine and extreme
learning machine were two of the most competitive classiers. For the ease to compare models,
Eyben et al. [27] summarized the performances by a SVM trained on the INTERSPEECH chal-
lenge feature sets over several public corpora. Yan et al. [114] recently proposed a sparse kernel
reduced-rank regression (SKRRR) for bimodal emotion recognition from facial expressions and
11
speech, which has achieved one of the state-of-the-art performances on the eNTERFACE'05 [73]
corpus.
Han et al. [36] employed a multilayer perceptron (MLP), a network architecture of multiple
fully connected layers, to learn from spliced data frames, where the label for a spliced frame is
the same as the utterance it belongs to. The statistics of aggregated frame posteriors, such as
minimum or maximum, are taken as utterance-level features. An MLP-ELM supervised by these
utterance features and corresponding labels is shown to outperform its MLP-SVM counterpart.
It has been known that emotion involves temporal variations of mental state. To exploit this
fact, W ollmer et al. [111] and Metallinou et al. [75] conducted experiments at the conversation-
level to show that human emotion depends on the context of a long-term temporal relationship
using HMM and Bi-directional LSTM (BLSTM), respectively. Lee et al. [61] posed speech
emotion recognition at the utterance level as a sequence learning problem and trained an LSTM
with a connectionist temporal classication (CTC) objective to align voiced frames with emotion
activation.
Deep CNN models were initially applied to computer vision related tasks and have achieved
many ground-breaking results [57, 116, 95, 102, 38]. Recently, researchers have started to con-
sider their use in the acoustic domain, including speech recognition [2, 62, 88, 89, 11], audio event
detection [42, 103] and speech emotion recognition [72, 4, 84]. Abdel-Hamid et al. [2] concluded
that one of the advantages in using CNNs to learn from less processed features such as raw wave-
forms, spectrograms and log-Mels is their ability to reduce spectral variation, including speaker
and environmental variabilities; this capability is attributed to structures such as local connec-
tivity, weight sharing, and pooling. When training a CNN model for speech emotion recognition,
Mao et al. [72] proposed to learn the lters in a CNN on spectrally whitened spectrograms. The
learning, however, is carried out by a sparse auto-encoder in an unsupervised fashion. Anand et
al. [4] benchmarked two types of convolutional operations in their CNN-based speech emotion
12
recognition systems: the spectral-temporally convolutional operation and the full-spectrum tem-
porally convolutional operation (see Fig. 3.1 for details). Their results showed the full-spectrum
temporal convolution is more favorable for speech emotion recognition. They also reported that
an LSTM trained on the raw spectrograms could achieve a better performance compared to the
convolutional neural networks.
Recently, Sainath et al. proposed the CLDNN architecture for speech recognition based on
the log-Mels [88] and the raw waveform signal [89], in which both models have been shown to
more competitive than a LSTM and a CNN model alone or combined. They also demonstrated
that with a sucient amount of training data (roughly 2; 000 hours), a CLDNN trained on the
raw waveform signal can match the one trained on the log-Mels. Moreover, they found the raw
waveform and the log-Mels in fact provide complementary information. Based on the CLDNN
architecture, Trigeorgis et al. [105] published a model using the raw waveform signal for continuous
emotion tracking. Huang et al. [46] trained a CLDNN model on the log-Mels for speech emotion
recognition and quantitatively analyzed the dierence in spectrally decorrelating power between
the discrete cosine transformation and the convolutional operation. Lim et al. [64] repeated the
comparison between CNN, LSTM and CLDNN for speech emotion recognition using spectrograms.
Ma et al. [69] applied the CLDNN architecture to classifying depression based on the log-Mels
and spectrograms. They employed the full-spectrum temporally convolutional operation on the
log-Mels but the temporally-only convolutional operation on the spectrograms.
On the multi-modal side, Zhang et al. [118] ne-tuned the AlexNet on spectrograms and
images, separately, for audio-visual emotion recognition but only applied time-averaging for tem-
poral pooling. Tzirakis et al. [106] extended the uni-modal work in [105] to make use of visual
cues. They ne-tuned the pre-trained ResNet model [38] for facial expression recognition and
then re-trained the concatenated bimodal network with the LSTM layers re-initialized again.
13
log-Mels/MFCCs
Time
}
X-Conv
FC
FC
FC
FC
BLSTM
X-Conv
emotion class
LDNN
}
X-CLDNN
(a) (b)
Figure 3.1: An overview of the proposed neural networks for speech emotion recognition.
3.3 Deep Convolutional Recurrent Models
In this section we describe the proposed deep convolutional recurrent networks and details of
structurally dierent convolutional operations on the log-Mels and the MFCCs. Fig. 3.1 illustrates
the overview of the models we design for speech emotion recognition. In Fig. 3.1 (a), we dene
four types of convolutional operations depending on the shape of their feature maps. By dividing
the convolutional operations into four types, we expect to acquire a ner understanding toward
their dierences after they have been optimized to learn from the spectral-temporal signals. In
14
Fig. 3.1 (b), we depict a deep recurrent neural network, called the LDNN model, as the common
sub-network architecture for every model. Two convolutional layers together with the LDNN sub-
network make up a CLDNN architecture. As a convolutional layer is applied locally in time, the
LDNN model is supposed to model the long-term temporal relationship within an utterance. We
only consider spectral-temporal features as input to a CLDNN model. Specically, an emotional
utterance u is represented by a sequence of spectral featuresfx
u
t
g. These spectral features can
either be the log-Mels or the MFCCs depending on the application scenario. Overall, we present
eight models based on the combinations of the factors including the input features (the log-Mels
or the MFCCs) and the type of convolutional operations (spectral only, temporal only, spectral-
temporal or full-spectrum temporal). All models are presented for a comprehensive study to
understand the role a convolutional layer plays in learning the aective information in speech. In
the following subsections, we give a brief review of the convolutional and recurrent neural layers
and introduce corresponding notations.
3.3.1 Types of Convolutional Operations
A convolutional neural layer Conv that receives an input tensor X2 R
CH
0
W
0
consists of a
convolutional function F
:R
CH
0
W
0
7!R
KH
1
W
1
, an activation function and an optional
pooling function F
:R
KH
1
W
1
7!R
KH
2
W
2
.
The convolutional function F
is dened by K feature maps (h
k
;b
k
)2 R
Chw
R
H
1
W
1
of shape hw, where the kijth component of F
(X) is given as
F
(X)
kij
, h
k
X
hw
ij
+b
kij
=
C1
X
c=0
h1
X
=0
w1
X
=0
X
ij
[c;;]h
k
[c;;] +b
kij
; (3.1)
in which X
ij
[c;;] =X[c;is
+;jt
+] ands
,t
are the strides, i.e. the amount of shift,
of the lters in the convolutional operation in their respective directions.
15
Likewise, it is straight-forward to formulate the pooling function F
acting on an input Y2
R
KH
1
W
1
through a lter of shape mn by the component-wise denition:
F
(Y)
kij
,
Y
mn
kij
; (3.2)
whereY
mn
kij
2R
mn
is a sub-tensor ofY lying on thekth slice ofY with its rst entry aligned to
Y[k;is
;jt
], and is the pooling operation, usually the max or the mean functions. Similarly,
s
and t
are the strides of the lters in the pooling operations in their respective directions.
Typical choices of the activation functions include the sigmoid function (x) =
1
1+exp(x)
, the
hyperbolic tangent function(x) = tanh(x) and the rectied linear unit (ReLU)(x) = max(0;x).
Concisely, a convolutional neural layer can be summarized as a function composition
Conv,F
(F
); (3.3)
where and denote functional composition and element-wise application.
We concentrate entirely on the convolutional function F
and adjust the pooling function F
accordingly. In particular, we are interested in the relationship between the acoustic emotional
pattern learnt by the model and the shape of the lter h
k
in feature maps. To this end, we
divide the shapes of the lters h
k
into four categories to highlight their structural dierences:
the full-spectrum temporally (FST-Conv), the spectral-temporally (ST-Conv), the temporally
only (T-Conv) and the spectrally only (S-Conv) convolutional operations. In what follows, we
mathematically dene each category.
FST-Conv First of all, we consider lters of shape Mw for w 2, where M denotes the
number of spectral bands andw species the width on the temporal axis. Since this type of lters
covers the entire spectrum, they convolve with the input tensor only in the temporal direction
16
and as a result the pooling function can only perform temporal pooling. This type convolves with
global spectral information and models across neighboring frames.
ST-Conv A ST-Conv layer contains lters of shape hw, where 2 h M 1 and w 2.
This type of lters observes local spectral-temporal information at a time and is free to convolve
with the input tensor in both directions. Accordingly, the pooling function also operates on the
convolved tensor through a two-dimensional lter.
T-Conv A T-Conv layer is similar to a FST-Conv layer except that the lters in a T-Conv layer
has a shape of 1w for w 2. These lters convolve with the input tensor along the temporal
direction from one frequency band to another and ignore spectrally contextual information. The
pooling function acts on the convolved tensor along the temporal direction correspondingly.
S-Conv A spectrally only convolutional neural layer consists of lters of shape h 1, where
h 1 and the pooling function down-samples the convolved tensor along the spectral direction.
Note that the S-Conv type is closely related to the traditional signal processing techniques; for
example, DCT transformation from log-Mels to MFCCs belongs to this category when h = M
except that the lters in DCT are mathematically pre-dened; see Sec. 3.4.3 for more details.
For each type of the convolutional operations, we employ a stride of 1. Since our focus is on
the convolutional operations, we employ a xed pooling size of 3 and a xed stride of 2 in their
respective direction(s) of convolution. Table 3.1 summarizes the parameters for all Conv layers.
Table 3.1: A summary of the parameters for each model architecture. M denotes the spectral
dimensionality and var stands for variable parameters for tuning. The dash symbol indicates the
situation where the parameter tuning is not applicable.
h w m n s
t
s
t
S-Conv var 1 3 1 1 1 2 1
T-Conv 1 var 1 3 1 1 1 2
ST-Conv var var 3 3 1 1 2 2
FST-Conv M var 1 3 { 1 { 2
17
3.3.2 Deep Recurrent Neural Network
Suppose the input is a sequence of vectorsfx
t
g. The Elman type simple recurrent neural network
RNN [26] is dened through the following equations:
h
t
=
h
U
hx
x
t
+U
hh
h
t1
+u
h
(3.4)
y
t
=
y
U
yh
h
t
+u
y
; (3.5)
where h
t
as an non-linear recurrent transformation of all past historyfx
s
g
t
s=1
represents the
system memory at timet, (U
ba
;u
b
) is an ane mapping from a space of type a to one of typeb,
and
c
is the activation function for typec. Herex,h andy denote the input, hidden and output
vectors, respectively. However, training a simple RNN with the back-propagation algorithm may
cause the issues of gradient vanishing or explosion. Although heuristic techniques such as gradient
clipping can alleviate the issue of gradient explosion, the gradient vanishing problem is mitigated
by an enhanced architecture: the LSTM architecture [43].
An LSTM is able to decide when to read from the input, to forget the memory or to write an
output by controlling a gating mechanism. By denition an LSTM learns the following internal
controlling functions:
i
t
=
i
U
ix
x
t
+U
is
s
t1
+u
i
(3.6)
f
t
=
f
U
fx
x
t
+U
fs
s
t1
+u
f
(3.7)
o
t
=
o
(U
ox
x
t
+U
os
s
t1
+u
o
) (3.8)
g
t
= tanh (U
gx
x
t
+U
gs
s
t1
+u
g
) (3.9)
c
t
= c
t1
f
t
+g
t
i
t
(3.10)
s
t
= tanh (c
t
)o
t
(3.11)
18
where i, f, o, g, c and s represent the input, forget, output, gate, cell and output vectors,
respectively. In particular, the change from non-linear multiplicative recurrence in Eq. (3.4) to
linear additive recurrence in Eq. (3.10) theoretically prevents gradients from vanishing during
back-propagating the error through time. Moreover, studies have found that a BLSTM layer
can further improve upon a unidirectional LSTM in applications such as speech recognition [33],
translation [99] and emotion recognition [111, 75] as it fuses information from the past and the
future.
Suppose an LSTM : R
D1T
7! R
D2T
takes in a sequencefx
t
g
T
t=1
and returnsfy
f
t
g
T
t=1
,
and another LSTM : R
D1T
7! R
D2T
takes in a reversed sequencefx
T+1t
g
T
t=1
and returns
fy
b
t
g
T
t=1
. A BLSTM :R
D1T
R
D1T
7!R
(2D2)T
, which is made of two LSTMs, runs on two
sequencesfx
t
g
T
t=1
andfx
T+1t
g
T
t=1
and gives another sequencefz
t
g
T
t=1
, where z
t
= [y
f
t
;y
b
t
] is
the concatenation of y
f
t
and y
b
t
.
3.3.3 CLDNN-based Models
Before dening a variety of CLDNN-based models, we introduce a shared sub-network architecture
among them. The sub-network contains one BLSTM layer followed by four fully connected feed-
forward layers. Each direction of the BLSTM layer has 128 cells so the BLSTM outputs a sequence
of vectors in R
256
. We take a mean pooling over the output of the BLSTM layer to obtain the
utterance representation c rather than using the output vector at the last time step. A dropout
mechanism [97] of probability 0:2 is xed and applied to the representation c to regularize the
learning process. These four FC layers have their own size of 128; 32; 32;N, respectively, where
N denotes the number of emotion classes, in which the rst three FC layers are activated by
the ReLU and the last one by the softmax function for classication. This architecture based
on (B)LSTM and FC layers is conveniently called an LDNN model [88]. Note that we employ
a BLSTM layer instead of an LSTM layer as in [88] because it has been shown that the ability
19
of a BLSTM to integrate future information into representation learning is benecial to emotion
recognition.
In the bottom of the LDNN sub-network, there are two Conv layers. Each Conv layer has 32
feature maps and each of them is activated by the ReLU. Formally, we dene X-CLDNN model
to be an LDNN sub-network architecture specied above on top of two X-Conv layers, where
X2fS; T; ST; FSTg.
A Conv layer is often said to be local because its feature maps when being computed at a local
region on the input tensor depend only on the entries that the feature maps currently overlap
with. As a result, we expect the input tensor to preserve locality in both spectral and temporal
directions in general. However, due to the aforementioned structural dierences, it is reasonable
to relax this expectation a little bit accordingly. For example, a ST-Conv certainly requires its
input tensor to maintain spectral-temporal correlation locally while a (FS)T-Conv and a S-Conv
only need such locality preservation in the temporal or spectral direction, respectively. Taking
this issue into consideration, we apply all four types of the Conv layer to the log-Mels and denote
the corresponding CLDNN-based models as X-CLDNN (log-Mels) for X2fS; T; ST; FSTg. On
the other hand, because the discrete cosine transformation decorrelates the spectral energies, the
MFCCs may not maintain locality in the spectral domain. Therefore, we apply only temporal
convolutional operations to the MFCCs and denote these CLDNN-based models as X-CLDNN
(MFCCs) for X2fT; FSTg.
3.4 Baseline Models
We evaluate our CLDNN-based models for understanding the convolutional operations by compar-
ing with three baseline models on a speech emotion recognition task. First of the baseline models
20
uses the low-level descriptors and their statistical functionals within an utterance to train a sup-
port vector machine. The other two of the baseline models are based on the BLSTM recurrent
neural networks and take the log-Mels and the MFCCs features as its input, respectively.
3.4.1 Support vector machine with the Low-Level Descriptors and Their
Statistical Functionals
Many speech scientic studies have empirically found emotion correlating parameters, also known
as the low-level descriptors (LLDs), along dierent aspects of phonation and articulation in speech,
such as speech rate in the time domain, fundamental frequency or formant frequency in the fre-
quency domain, intensity or energy in the amplitude domain, or relative energy in dierent fre-
quency bands in the spectral energy domain. Furthermore, statistical functionals of an entire
emotional utterance are derived from the LLDs to obtain global information, complementary to
local information captured by frame-level LLDs. Popular selections of these parameters for de-
veloping machine learning algorithms in practical applications often amount to several thousands
of features. For example, in the INTERSPEECH 2013 computational paralinguistics challenge,
the recommended feature set contains 6; 373 parameters of the LLDs and statistical functionals
altogether [94]. Fortunately, researchers have identied the support vector machine as one of the
most eective machine learners for using these hand-crafted high-dimensional features [27].
To make our work comparable to the published results, we set up the rst baseline model simi-
lar to the evaluation experiments conducted in [27]. We use the openSMILE toolkit [28] to extract
the acoustic feature sets for INTERSPEECH Challenges from 2009 to 2013 , including Emotion
Challenge (EC, 384 parameters), Paralinguistic Challenge (PC, 1582 parameters), Speaker State
Challenge (SSC, 4368 parameters), Speaker Trait Challenge (STC, 5757 parameters) and Com-
putational Paralinguistic ChallengE (ComParE, 6373 parameters). On each of these feature sets,
we train a SVM for speech emotion recognition.
21
3.4.2 LDNN with the log-Mels
As suggested by previous studies [92, 111, 75, 61], explicit temporal modeling is benecial for
speech emotion recognition, in which a recurrent neural network is a better choice than a hidden
Markov model for its outstanding ability to model longer-term temporal relationship. Meanwhile,
in order to build a competitive as well as compatible baseline model with respect to the CLDNN-
based model, we take the LDNN architecture dened in Sec. 3.3.3 as our second baseline model. In
particular, we use the log-Mels as the input to the LDNN model as the "raw" feature set without
temporal or spectral convolutional operations. We denote this model as the LDNN (log-Mels).
3.4.3 LDNN with the MFCCs
MFCCs are related to log-Mels via a mathematical construct: the discrete cosine transformation
(DCT). Specically, the relationship is dened as the following:
MFCC[k] =
M1
X
m=0
log-Mel[m] cos
k
M
m +
1
2
; (3.12)
where MFCC[k] and log-Mel[m] are the kth and the mth coecients of MFCCs and log-Mels,
respectively, and M is the number of the Mel-scaled lter banks.
We can easily convert Eq. (3.12) into a convolutional operation along the spectral direction,
in which scenario all feature maps are thus tensors of shape M 1. For the kth feature map h
k
,
its mth component
h
k
[m] = cos
k
M
m +
1
2
(3.13)
is pre-dened mathematically based on the prior knowledge of signal processing, rather than task-
specically learnt from training samples. With this development, Eq. (3.12) can be succinctly
summarized as
MFCC, DCT-Conv (log-Mel); (3.14)
22
where DCT-Conv represents the mathematically pre-dened spectrally only convolutional layer
transforming log-Mels into MFCCs. Note that the properties of a conventional convolutional
layer, such as the pooling function and the non-linear activation function are missing in this
special conguration of a convolutional layer. In fact, there is no convolutional operation per
se. Nevertheless, the purpose for this identication of DCT as a convolutional operation is to
encapsulate this spectral modeling into the language of convolutional operations, to help us focus
on the dierence among various convolutional operations and mostly to contrast DCT with the
S-Conv layer.
Our third baseline model is an LDNN model which takes the MFCCs as its input. Similarly,
we denote this model as the LDNN (MFCCs). By comparing the the performances of the LDNN
(MFCCs) and the S-CLDNN (log-Mels), we are able to quantitatively demonstrate the advantages
of the S-Conv layer over the DCT-CNN layer.
3.5 Databases Description
3.5.1 The Clean Set
We use the eNTERFACE'05 emotion database [73], which is a publicly-available multi-modal
corpus of elicited emotional utterances, to evaluate the performance of our proposed models.
Although the entire database contains speech, facial expression and text, in this work we only
conduct experiments on the audio modality. This database includes 42 subjects from 14 various
countries, in which 34 of them were male and 8 were female. Each subject was asked to listen
carefully to 6 short stories, and each of them was designed to elicit a particular emotion from
among the 6 archetypal emotions dened by Ekman et al. [23]. The subjects then reacted to
each of the scenarios to express their emotion according to a proposed script in English. Each
subject was asked to speak ve utterances per emotion class for 6 emotion classes (anger, disgust,
fear, happiness, sadness, and surprise). For each recorded emotional utterance, there is one
23
corresponding global label describing the aective information conveyed by the whole utterance.
The resulting corpus, however, is slightly unbalanced in the emotion class distribution because
the subject 23 has only two utterances portraying happiness, so the total number of emotional
utterances in this corpus is 1; 257. We call the set of these 1; 257 utterances the clean set. The
average length of utterances is around 2:78 seconds, and the total duration of the clean set amounts
to roughly 0:97 hour. We believe it is the moderate number of speakers and a variety of their
cultural backgrounds that render it one of the most popular corpora for benchmarking speech
emotion recognition models.
3.5.2 The Noisy Set
Deep neural networks have a well-known reputation of being data-hungry. Despite the afore-
mentioned diversity, it is not data-ecient enough to train a deep neural network as big as a
CLDNN on the clean set alone for it would potentially incur a high risk of over-tting. Various
techniques have been proposed to implicitly or explicitly regularize the training process of deep
neural networks in order to prevent over-tting as well as to improve the generalization perfor-
mance, such as dropout [97], early-stopping [70], data augmentation [53], transfer learning [76]
and the recent group convolution approach [15, 22]. In addition to the dropout mechanism and
the early-stopping strategy, we also adopt the data augmentation approach to articially increase
the number of our data samples for the purpose of implicit regularization. To be precise, we ag-
gressively mix samples from the clean set with samples from another publicly-available database,
called the MUSAN corpus [96], for a few randomly chosen levels of signal-to-noise ratio (SNR).
The MUSAN corpus consists of three portions: music, speech and noise. As speech and music
may inherently convey aective information, mixing samples from these two portions with clean
emotional utterances would unnecessarily complicate the learning process and would possibly
result in a suboptimal system due to a mixture of inconsistent emotion types. Therefore, to avoid
adding confounding factors to clean emotional utterances, we only use the noise portion in the
24
MUSAN corpus for data augmentation. The noise portion contains 929 samples of assorted noise
types, including technical noises, such as Dual-tone multi-frequency (DTMF) tones, dialtones, fax
machine noises, and ambient sounds, such as car idling, thunder, wind, footsteps, paper rustling,
rain, animal noises, and so on so forth. The total duration of the noise portion is about 6 hours.
We generate articially corrupted data based on the clean set using the following recipe. For each
clean utterance, 20 noise samples are uniformly selected from the noise portion and 3 levels of
the SNR are uniformly chosen from the interval [10; 15]. Mixing the clean utterance with the 60
combinations of the 20 noise samples and 3 SNR levels augments the clean set by a factor of 61.
Note that randomly selecting samples from the noise portion gives an advantage over simply using
a xed subset of the noise portion. Due to the stochasticity, the probability of choosing the same
set of 20 noise samples is on the order of one out of 7 10
40
C
929
20
, which is almost impossible.
By carefully eliminating potential articial patterns, we hope the deep neural networks could
concisely capture the true underlying acoustic emotion prototypes.
We call this set of the resulting 75; 420 noisy utterances the noisy set, as opposed to the clean
set dened above. The total duration of the noisy set is about 58:25 hours. In the following
sections, when referring to the clean condition, we mean the experiments are conducted on the
clean set; on the other hand, when referring to the noisy condition, we mean they are conducted
on the union of the noisy and the clean sets. Moreover, we further randomly divide the set of
subjects into training, validation and testing (TVT) partitions under the percentage constraint of
70:10:20, respectively, for experimental convenience. Partitioning the subject set, instead of the
utterance set, allows us to maintain speaker independence across all experiments.
3.6 Speech Emotion Recognition Experiments
In this section, we evaluate the proposed models with the following experiments:
1. Baseline models
25
(a) SVM with openSMILE features
(b) LDNN (MFCCs)
(c) LDNN (log-Mels)
2. CLDNN-based models
(a) T-CLDNN (MFCCs)
(b) FST-CLDNN (MFCCs)
(c) T-CLDNN (log-Mels)
(d) S-CLDNN (log-Mels)
(e) ST-CLDNN (log-Mels)
(f) FST-CLDNN (log-Mels)
The purposes of these experiments are multi-fold. The comparison between the baseline models
and the CLDNN-based models aims to demonstrate the eectiveness of the convolutional oper-
ations in learning the aective information. Within the category of CLDNN-based models, the
goal is to quantify the dierence between types of convolutional operations.
3.6.1 SVM with openSMILE features
For the rst set of baseline experiments, we employ two evaluation strategies. In the rst one,
we perform a leave-one-subject-out (LOSO) cross validation. Since we train our deep neural net-
work models using the TVT partitions, the second strategy evaluates the performances of SVM
classiers on the TVT partitions for a fair comparison. In addition, we also take the regular
pre-processing procedures, including speaker standardization for removing speaker characteristics
and class weighting for slight class imbalance. We conduct the baseline experiments using SVM
classiers trained on the acoustic feature sets in the past INTERSPEECH challenges. The SVM
26
classiers are trained on these hand-crafted high-dimensional features using the Scikit-Learn ma-
chine learning toolkit [81] with linear, polynomial and radial basis function (RBF) kernels. All of
SVM experiments are conducted under the clean condition.
3.6.2 CLDNN-based Models with the MFCCs and the log-Mels
To begin with, we extract the log-Mels and the MFCCs using the KALDI toolkit [82] with a
window size of 25 ms and a window shift of 10 ms. In both cases, the number of Mel-frequency
lterbanks is chosen to be 40. It has been shown [27] that due to the strong energy compaction
property of the discrete cosine transformation, the lower order MFCCs are more important for
aective and paralinguistic analysis, while the higher order MFCCs are more related to the pho-
netic content understanding. In fact, the INTERSPEECH challenges feature sets contain the rst
12-14 orders of MFCCs; however, the Geneva Minimalistic Acoustic Parameter Set (GeMAPS)
[27] recommends the use of only the rst 4 orders of the MFCCs. In this work, we keep the
conventional rst 13 coecients when computing the MFCCs. After feature extraction, we splice
the raw log-Mels and raw MFCCs with a context of 10 frames in the left and 5 frames in the right.
At this point, each spliced log-Mel or spliced MFCCx
t
lives inR
4016
orR
1316
, respectively. An
emotional utterance is now represented as a sequence of spliced spectral vectorsfx
t
g. We train
the LDNN (log-Mels) and the LDNN (MFCCs) as depicted in Fig. 3.1 with their corresponding
inputs.
Table 3.2: A summary of the ranges for parameter tuning on each type of the convolutional layers,
where M denotes the spectral dimensionality and the subscripts of h and w correspond to the
rst and the second convolutional layers, respectively.
h
1
w
1
h
2
w
2
T-Conv 1 3:8 1 2:3
S-Conv 4:9 1 3:4 1
ST-Conv 4:9 3:8 3:4 2:3
FST-Conv M 3:8 1 2:3
27
In order to accommodate the inputs to various CLDNN models in Fig. 3.1, we further reshape
each x
t
to a matrix X
t
with the shape of 40 16 or 13 16. We train the X-CLDNN (log-Mels)
and X-CLDNN (MFCCs) on the emotional utterancesfX
u
t
g for each training utterance u and
for X2fS; T; ST; FSTg. The ranges of the tunable parameters for the convolutional layers are
summarized in Table 3.2, where as shown we focus mostly on the rst Conv layer. We exhaust
all of the parameter combinations for the S-Conv, T-Conv and FST-Conv types when tuning the
architectural parameters. Note that, however, the search space of the optimal parameter set for
the ST-Conv is rather huge. Therefore, instead of exploring all of the combinations aimlessly, we
limit our attention to the combinations of top k parameters from the S-Conv and T-Conv.
We use the Keras library [14] on top of the Theano [104] backend to specify the network
architectures and execute the learning processes on an NVIDIA K40 Kepler GPU. The weights
of all deep neural network models are learnt by minimizing the cross-entropy objective through
the Adam method [55] to adjust the parameters in the stochastic optimization with an initial
learning rate being 0:001. The size of mini-batch is xed to 10 due to the capacity of the GPU
memory as well as the pursuit for a better generalizing power [54]. An early-stopping strategy
[70] with the patience of 3 epochs is employed to avoid over-training. We train all deep neural
network models with the emotional utterances in the training partition under the noisy condition;
we perform parameter tuning on the validation partition, and the most competitive model on the
validation partition under the noisy (clean) condition is tested under the noisy (clean) condition,
respectively.
28
3.7 Experimental Results
We present our experimental results for speech emotion recognition in this section. Even though
the class imbalance in the corpus is insignicant, throughout the entire section, we use the un-
weighted accuracy (UA) as the performance metric to avoid being biased to the larger classes.
3.7.1 SVM with openSMILE features
Table 3.3 summarizes the results of using SVM classiers to identify the emotion class of an
emotional utterance with one of the 6 archetypal emotions. Based on the LOSO evaluation
strategy, a SVM with the STC feature set gives the best baseline performance, while under the
TVT evaluation strategy, a SVM with the ComParE feature set stands out among other feature
sets. It is clear from these results that a SVM learns better from higher-dimensional feature sets
such as the ComParE and the STC sets, which is also a consistent phenomenon observed in [27].
Yan et al. [114] recently published a baseline result on the eNTERFACE'05 corpus using the
PC feature set. They trained a SVM classier on the PC feature set with a speaker-dependent
ve-fold cross validation evaluation strategy as one of their baseline models. Their baseline work
is comparable to ours, and is included in the Table 3.3 as well.
Table 3.3: The SVM baseline performance (UA (%)) based on the leave-one-subject-out (LOSO)
cross validation and on the training-validation-testing (TVT) partitions using the acoustic feature
sets from past INTERSPEECH challenges.
EC PC SSC STC ComParE
LOSO 66.61 73.87 79.19 81.18 80.45
TVT 70.83 71.66 77.92 80.00 80.83
Yan et al. [114] { 74.21 { { {
*
Emotion Challenge (EC), Paralinguistic Challenge (PC), Speaker State Chal-
lenge (SSC), Speaker Trait Challenge (STC), Computational Paralinguistic
ChallengE (ComParE)
29
Table 3.4: The performances (UA (%)) of the optimal SVM model, the LDNN-based models and
the CLDNN-based models. The sparse kernel reduced rank regression (SKRRR) [114] is one of
the state-of-the-art models on the eNTERFACE'05 corpus.
Model (features) noisy clean
SVM (ComParE) { 80.83
SKRRR [114] { 87.46
LDNN (MFCCs) 75.51 88.33
LDNN (log-Mels) 78.87 90.42
T-CLDNN (MFCCs) 83.44 87.92
FST-CLDNN (MFCCs) 84.45 92.92
T-CLDNN (log-Mels) 84.23 92.92
S-CLDNN (log-Mels) 82.73 91.67
ST-CLDNN (log-Mels) 84.26 93.75
FST-CLDNN (log-Mels) 86.21 94.58
3.7.2 LDNN with the MFCCs and the log-Mels
We present the results of the LDNN-based models in Table 3.4. Under the noisy condition, the
LDNN (MFCCs) and the LDNN (log-Mels) models are able to accurately classify 75:51% and
78:87% of the testing samples, respectively. Under the clean condition, they give a performance
of 88:33% and 90:42%, respectively. One can easily observe that there is a gap of 3:36% and
2:09%, respectively, between LDNN (MFCCs) and LDNN (log-Mels) under each condition. Since
MFCCs are DCT transformed log-Mels, it implies that DCT may have removed a certain amount
of aective information when transforming the log-Mels into the MFCCs. The widened gap under
the noisy condition also suggests MFCCs are more sensitive to noise compared to log-Mels, which
renders learning from MFCCs a more challenging task. Nevertheless, both LDNN models achieve
promising results comparable to that by one of the state-of-the-art models on the eNTERFACE'05
corpus, the sparse kernel reduced rank regression (SKRRR) [114].
3.7.3 CLDNN with the MFCCs and the log-Mels
Finally, Table 3.4 also presents the eectiveness of the CLDNN-based models for classifying emo-
tional utterances into one of the 6 archetypal emotions. First of all, notice that with the CNN
layers all CLDNN-based models improve upon their LDNN-based counterparts under both noisy
30
and clean conditions, except that the T-CLDNN (MFCCs) results in a slightly inferior perfor-
mance under the clean condition. Since MFCCs are rather sensitive to noise, it is likely that
the T-Conv layers are mainly optimized to reduce prominent variations due to the articial noise
while neglecting other subtle factors of variation such as speaker or gender. Yet, the result from
the FST-CLDNN (MFCCs) also suggests that the MFCCs still contain a reasonable amount of
aective information which is learnable by a suitable architecture.
Among the X-CLDNN (log-Mels) models, the order of performances from high to low is the
FST-CLDNN (log-Mels), the ST-CLDNN (log-Mels), the T-CLDNN (log-Mels) and the S-CLDNN
(log-Mels). The fact that the FST-Conv outperforms the ST-Conv is consistent with the conclu-
sion from [4] under the clean condition. However, the margin is not as signicant when there is
an LDNN sub-network to help with temporal modeling. It has been reported that the S-Conv
layer in a S-CLDNN (log-Mels) would degrade the performance for speech recognition under a
moderately noisy condition [87]. The authors attributed this deterioration to the noise-enhanced
diculty for local lters of small sizes to make decision when learning to capture translational
invariance. This attribution seems valid when we contrast the FST-Conv with the other three
types. Actually, if we take a closer look, we can easily discover that there is a varying degree of
enhanced diculty to the type of convolutional operations, in which the S-Conv suers from noise
the most, followed by the T-Conv and the ST-Conv to a roughly equivalent degree and nally
the FST-Conv the least. Even though we validate on the clean validating partition for selecting
the model to be tested on the clean testing partition, the performances under the clean condition
demonstrate a similar trend in
uenced by noise since we carried out the training process under
the noisy condition.
One of our goals is to benchmark the strength of the S-Conv and the discrete cosine trans-
formation for spectral modeling. Specically, the fair comparison should be between the LDNN
(MFCCs) and the S-CLDNN (log-Mels) where the DCT-CNN and the S-Conv layers, respec-
tively, act on the spliced log-Mels along the spectral direction, and both of them have an LDNN
31
sub-network for further temporal modeling. Despite the negative impact on the S-Conv layer by
noise, it is interesting to observe a stark performance gap between them under the noisy con-
dition. Even under the clean condition, the S-CLDNN (log-Mels) still has a leading margin by
more than 3%. Due to its task independence, DCT is not particularly designed to decorrelate the
aective information from the other factors. Moreover, since the DCT-CNN layer is shallow and
structurally simple, the S-Conv layer has an advantage over DCT as it is deeper and thus better
at disentangling the underlying factors of variations [80, 31, 32]. This strength is manifested the
most especially when it comes to the noise-related factors. Given that the MFCCs still carry a
reasonable amount of aective information, these signicant dierences in performance between
the S-Conv and DCT can be best explained by the inability of DCT to adequately disentangle
the aective information from other irrelevant factors of variations.
Last but not the least, we notice that temporally convolutional operations and temporally
recurrent operations are learning complimentary information. For instance, the LDNN (log-Mels)
models the evolution of aective information through temporal recurrence alone, while the FST-
CLDNN (log-Mels) does so by tting itself to the dynamics via temporal convolution and then
temporal recurrence, which improves upon the LDNN (log-Mels) and results in a more competitive
system.
3.7.4 Finer hyper-parameter search on the spectral axis
We have seen the negative eect of noise on the S-Conv for speech emotion recognition in Table 3.4
and speech recognition in [87]. The authors hypothesized that noise has increased the diculty for
these local lters to correctly capture translational invariance. On the other hand, the performance
shown by the FST-CLDNN (log-Mels) model suggests that global information over the entire
spectrum helps to learn a better representation. To gain more insight into how convolving with
more spectral information contributes to aective learning, we further conduct an extensive search
on the spectral axis for the optimal kernel size in the rst convolutional layer of the S-CLDNN
32
5 10 15 20 25 30
Kernel Size
70
75
80
85
90
UA (%)
Valid_UA_Clean
Valid_UA_Noisy
Figure 3.2: Unweighted accuracy, UA (%), of the S-CLDNN (log-Mels) model on the validation
partition under noisy and clean conditions with respect to dierent kernel sizes h
1
in the rst
convolutional layer. The curves in red are median ltered UAs.
(log-Mels) model, i.e. h
1
in Table 3.2. For the search, we x h
2
= 3 and the pooling hyper-
parameters the same as in Sec 3.3.1. We iterate the lter heighth
1
through all possible sizes from
4 to 30 (to allow pooling and convolution in the second layer).
Fig. 3.2 depicts the validation UA under clean and noisy conditions with respect to dierent
kernel size h
1
. Although highly
uctuating possibly due to the in
uence of noise, the accuracy
is indeed improving along with a larger kernel size until it peaks at h
1
= 22 for both conditions,
and increasing the kernel size larger than 22 does not result in any further improvement. Second,
from the median ltered curves, the S-Conv is able to benet more under the noisy condition
from having a larger kernel size, specically h
1
> 18 in Fig. 3.2, which suggests a potential phase
transition from small to large lters; however, such a pattern is not equally signicant when under
the clean condition as the curve is relatively
at. Third, whenh
1
= 22, the respective test UA are
85:87% and 95:42% under the noisy and clean conditions. Despite the outstanding performance
under the clean condition, when compared with the FST-CLDNN (log-Mels) model, these results
further highlight the in
uence of noise on the S-Conv operation as well as the robustness of two-
dimensional lters to noise [11] even though it has convolved with the optimal amount of spectral
information.
Overall, this extended set of experiments demonstrate one of the advantages of convolving
with more spectral information, emphasizing on the ability to counter the negative eect due to
33
noise in learning. Since S-Conv shows two characterizations with small and large lter sizes and to
convolve with more spectral information is one of the characteristics of the FST-Conv in addition
to being two-dimensional, therefore, we will continue to refer to S-Conv as dened in Table 3.2
for consistency in this work.
3.7.5 Module-wise evaluations
We have so far analyzed the proposed models from an end-to-end perspective and observed inter-
esting phenomena. Although this kind of external analysis has distilled certain working knowledge,
what we are equally interested in is the internal mechanism within these models. Along these
lines, a key step is to track the
ow of relevant information using techniques such as information
regularizer [44] or layer-wise evaluation [3, 46]. In this work, we take the second approach due to
its simplicity. To make it clear, we only evaluate the intermediate representations at the module
level, where by module we mean the CNN module (two Conv layers), the BLSTM module (a
BLSTM layer) and the multi-layer perceptron (MLP) module (four FC layers) that make up a
CLDNN model.
To begin with, we take the trained CLDNN-based model as the feature extractors and the
activated responses of each layer as the discriminative features. For each CLDNN model, we
only keep the extraction from the output layer of each module. In addition, the raw spectral-
temporal features are presented to serve as the lower bound. A mean pooling over the temporal
direction is applied to the raw features, the output of the CNN module and the output of the
BLSTM module to form an utterance representation for each of them. In order to quantify the
improvement of the representations for speech emotion recognition achieved by each module, we
train a SVM classier on the utterance representation from the output of each module as well as
the raw features. The experiment setting is similar to the SVM baseline, where only the clean set
is used and the evaluation is based on the TVT strategy.
34
3.7.5.1 Quantitative Analysis
Table 3.5: The performances (UA (%)) of a SVM classier trained on the spliced log-Mels, the
spliced MFCCs and the output of each module from all CLDNN-based models under the clean
condition.
Model (features) Raw CNN BLSTM MLP
T-CLDNN (MFCCs) 23.75 52.50 88.75 88.75
FST-CLDNN (MFCCs) 23.75 56.25 88.75 92.50
T-CLDNN (log-Mels) 27.92 59.17 93.33 93.33
S-CLDNN (log-Mels) 27.92 45.83 88.33 91.67
ST-CLDNN (log-Mels) 27.92 55.83 89.17 93.75
FST-CLDNN (log-Mels) 27.92 54.17 89.17 94.58
Table 3.5 summarizes the results of the module-wise evaluation. As shown in the second
column, even though the training and the testing are carried out under the clean condition, the
discrete cosine transformation degrades the performance once again. Nevertheless, most of the
CNN modules have helped to lift the discriminative power to around 55% regardless of the raw
features except for a particularly under-performing model, the S-CLDNN (log-Mels), which based
on the previous analysis is known to suer from noise drastically. One can easily observe that
each type of the Conv layers is learning a dierent representation and hence results in dierent
levels of discriminative power.
It is interesting to note that the SVMs trained on the activations of the CNN module in the
fT,STg-CLDNN (log-Mels) give a better accuracy than that based on the FST-CLDNN (log-
Mels), but from a holistic perspective the FST-Conv based system is the most robust one. This
may re
ect one of the biggest advantages of the end-to-end training approach over the traditional
layer-wise approach, which works on feature engineering and classier training separately; i.e. a
greedy layer-wise training that forces the distribution of an intermediate layer to prematurely
approximate the distribution of the label is likely to result in a suboptimal system.
Going deeper into the networks, we can see most of the BLSTM modules have further improved
the discriminative power to the level of 88-89% except for the T-CLDNN (log-Mels). In fact, as we
take a closer look at the T-CLDNN (MFCCs) and the T-CLDNN (log-Mels), we nd that they both
35
attain one of their optimal forms of aective representation at the output of the BLSTM module.
Instead of implying their MLP modules have done nothing based on the constant performance, it
may suggest that their MLP modules are integrating out irrelevant information while maintaining
the optimal representation. Finally, in the other CLDNN models, the MLP modules further rene
the representation to make the prediction an easier task. To sum up, in terms of the UA, the
contributions from the CNN module, the BLSTM module and the MLP module are 27:435:18%,
35:63 3:61% and 2:85 2:32%, respectively.
3.8 Conclusion
In this chapter, we demonstrated the performance gain by an end-to-end training for speech emo-
tion recognition through a direct comparison between LDNN and CLDNN models with a SVM
trained on hand-crafted acoustic features. We report the benchmarking of four types of convolu-
tional operations in deep convolutional recurrent neural networks for speech emotion recognition,
including the spectrally only, the temporally only, the spectral-temporally, and the full-spectrum
temporal convolutional operations. We found these types suer from noise to a varying degree,
in which noise negatively in
uences the S-Conv the most, followed by the T-Conv and the ST-
Conv, and the FST-Conv the least. Under both conditions, the FST-Conv outperforms all of
the other three types, and one of the state-of-the-art models under the clean condition. A set of
extended experiments further shows that insucient amount of spectral information is the major
reason that leads to the negative in
uence of noise on the S-Conv. However, without temporal
convolution, the S-Conv with larger lters is still not as robust to noise as the FST-Conv.
Even though the S-Conv is the weakest type, the comparison between the S-CLDNN (log-Mels)
and the LDNN (MFCCs) shows a signicant performance gap between them, which can mostly be
attributed to the dierence between the S-Conv and the discrete cosine transformation. On the
other hand, the FST-CLDNN (MFCCs) is still able to achieve a reasonably good accuracy. These
36
two experiments suggest that although DCT may discard certain amount of aective information,
the loss does not entirely account for the performance gap. However, we may link the mediocre
performance of the LDNN (MFCCs) to the inability of DCT to adequately disentangle the aective
information from other correlated irrelevant factors of variations such as speaker and gender
dierences and those caused by noise. Based on previous studies of deep neural networks, it is
likely the shallow and structurally simple architecture of the DCT-Conv and its task-independent
nature leads to such incapability of DCT.
Meanwhile, we also found that the temporal convolution and the temporal recurrence are able
to learn complementary information, and the combination of both results in a robust model such
as the FST-CLDNN. Nevertheless, we only consider the architecture of a CNN module followed by
a BLSTM module. It would be interesting to see if an architecture of a BLSTM module followed
by a CNN module would make any dierence.
In order to understand the internal mechanism within a CLDNN model, we quantitatively
analyzed the module-wise discriminative power by training a SVM on the extracted activations
from the output of modules. The reported accuracy can be viewed as an approximated measure
of quality in the sense of readiness to exploit the aective information. From the results in Table
3.5, we found the CNN module, the BLSTM module and the MLP module contribute a renement
of 27:43 5:18%, 35:63 3:61% and 2:85 2:32% to the quality, respectively. This ranking is
not surprising as studies from psychology [78] or computational paralinguistics [92, 111, 75, 61]
all point out emotion is characterized by temporally dependent dynamics. Nevertheless, our
ndings have shown that the CNN module is capable of signicantly enhancing the separability
for emotional classes compared to raw features, particularly when under a noisy condition.
37
Part III
Model-Based Data Augmentation
38
Chapter 4
Regularization Via Model-Based Perturbation
4.1 Introduction
Deep convolutional neural networks have been successfully applied to several pattern recognition
tasks such as image recognition [39], machine translation [30] and speech emotion recognition
[47]. Currently, to successfully train a deep neural network, one needs either a sucient num-
ber of training samples to implicitly regularize the learning process, or employ techniques like
weight decay and dropout [98] and its variants to explicitly keep the model from over-tting. In
the previous chapter, we have presented an in-depth study of feature-based data augmentation
for regularizing speech emotion recognition models. In this chapter, we focus on model-based
alternatives for data augmentation.
In the recent years, one of the most popular and successful architectures is the residual neural
network (ResNet) [39] and its variant ResNeXt [112] with multiple residual branches. The ResNet
architecture was designed based on a key assumption that it is more ecient to optimize the
residual term than the original task mapping. Since then, a great deal of eort in machine
learning and computer vision has been dedicated to study the multi-branch architecture.
Deep convolutional neural networks have also gained much attention in the community of
aective computing mainly because of its outstanding ability to formulate discriminative features
39
for the top-layer classier. Usually the number of parameters in a model is far more than the
number of training samples and thus it requires heavy regularization to train deep neural networks
for aective computing. However, since the introduction of batch normalization [52], the gains
obtained by using dropout for regularization have decreased [52, 115, 51]. A recent work dedicated
to study the disharmony between dropout and batch normalization [63] suggests that dropout
introduces a variance shift between training and testing, which causes batch normalization to
malfunction if batch normalization is placed after dropout, which severely limits the application
of successful architectures such as ResNet or the application of dropout to the top-most fully
connected layers. Yet, multi-branch architectures have emerged as a promising alternative for
regularizing convolutional layers.
Regularization techniques based on multi-branch architectures such as Shake-Shake [29] and
ShakeDrop [113] have delivered impressive performances on standard image datasets such as the
CIFAR-10 [56] dataset. In a clever way, both of them utilize multiple branches to learn dierent
aspects of the relevant information and then a summation in the end follows for information
alignment among branches. Also, both of Shake-Shake and ShakeDrop regularizations emphasize
on the important interaction between batch normalization and shaking. However, none of them
gave an explanation for this phenomenon, other than a brief discussion on limiting the strength
of shaking. Instead of using multiple branches, a recent work [109] based on a mixture of experts
showed that randomly projecting samples is able to break the structure of adversarial noise that
could easily confound the model and as a result mislead the learning process. Despite not being
an end-to-end approach, it shares the same idea of integrating multiple streams of model-based
diversity.
In addition, a recent trend of studies on data augmentation, based on the Vicinal Risk Min-
imization (VRM) [12] principle, proposed to interpolate and/or extrapolate training samples in
feature space, for example, [21] and [13]. Szegedy et al. [100] studied regularization of training
40
by label smoothing. Furthermore, Mixup [117] performed convex combinations of pairs of fea-
ture and label to demonstrate that expanding the coverage of training samples in feature space,
leaving little to none of margin between classes, could not only improve the performance of a
model but also make it robust to adversarial samples. Eectively, Mixup reduces the uncertainty
in prediction of a testing sample lying outside of the coverage of training samples in feature space
by linear interpolation. Based on Mixup, Manifold Mixup [108] called for mixing intermediate
representations instead of raw inputs.
Our work in chapter follows the model-based representation augmentation thread like Shake-
Shake, ShakeDrop regularization and Manifold Mixup. In this work, we study the Shake-Shake
regularized ResNeXt for speech emotion recognition. In addition to shaking the entire spectral-
temporal feature maps with the same strength, we propose to address dierent spectral sub-bands
independently based on our hypothesis of the non-uniform distribution of aective information
over the spectral axis. Furthermore, we investigate and come up with an explanation for the
ability of shaking regularization to improve classication tasks and its crucial interaction with
batch normalization. In order to achieve our goal, we conduct ablation studies on MNIST [59]
and CIFAR-10 datasets to highlight a subtle dierence in the requirement of optimal embeddings
by classication tasks based on the VRM principle and by verication tasks. In addition, we
identify a strong resemblance in the mathematical formulation between batch normalized residual
blocks and batch normalized recurrent neural networks, where both of them suer from a shared
issue: faster convergence but more over-tting and could be xed by the same technique.
The contributions of this chapter are multi-fold. First, our work explains with visualization
the key factor to the success of shaking regularization and the crucial property that batch nor-
malization plays in a shaking regularized architecture, drawing a connection between shaking and
the VRM principle, and between batch normalization and discriminative embedding learning.
Second, to the best of our knowledge, our work is the rst to highlight the subtle dierence in the
requirement of optimal embeddings by classication tasks and by verication tasks. Third, our
41
work identify the resemblance between batch normalized residual blocks and batch normalized
recurrent neural networks, and the shared issue they have. Based on the solution to batch normal-
ized recurrent neural networks, we demonstrate a signicant reduction on the generalization gap,
i.e. reduced over-tting, in a batch-normalized shaking-regularized ResNeXt for speech emotion
recognition without sacricing the validation accuracy.
The outline of this chapter is as follows. We review related work in the next section, including
Shake-Shake regularization and its variants, discriminative feature learning in verication tasks
and the vicinal risk minimization principle. Section 4.3 introduces sub-band shaking. Section 4.4
presents batch normalized shaking, including ablation studies on MNIST and CIFAR-10 datasets,
and the identication of batch normalized residual blocks with batch normalized recurrent neural
networks. Section 4.5 and 4.6 cover the datasets, the network architecture, the experimental setup
and the results for speech emotion recognition. Section 4.7 concludes our work.
4.2 Related Work
In this section, we review some of the recently developed techniques on which we base our work and
those that are related to our ndings. The rst subsection covers Shake-Shake regularization [29]
and its variants, including Sub-band Shaking [48], ShakeDrop [113] and Stochastic Shake-Shake
[49] regularization. Another subsection is devoted to end-to-end discriminative feature learning
algorithms in face verication, in particular the thread of work that focus on the large-margin and
symmetrical representation learning in [66, 65, 110, 20, 83]. In the last subsection, we talk about
a recent success in supervised learning based on the Vicinal Risk Minimization (VRM) principle
[12], called Mixup [117]. Moreover, we highlight one subtle but denite dierence in the desired
quality of intermediate representations between classication and verication tasks.
42
β←rand(0,1)
α←rand(0,1)
(a) (b) (c)
Mul( )
Conv 3x3
Mul(1- ) Mul( )
Conv 3x3
Mul(0.5) Mul(0.5) Mul( )
Conv 3x3
Mul(β) Mul(1-β) Mul(α)
Conv Conv Conv Conv Conv Conv
Conv Conv Conv Conv Conv Conv
Figure 4.1: An overview of a 3-branch Shake-Shake regularized residual block. (a) Forward
propagation during the training phase (b) Backward propagation during the training phase (c)
Testing phase. The coecients and are sampled from the uniform distribution over [0; 1] to
scale down the forward and backward
ows during the training phase.
4.2.1 Shake-Shake Regularization and Its Variants
Shake-Shake regularization is a recently proposed technique to regularize training of deep convolu-
tional neural networks for image recognition tasks. This regularization technique based on multi-
branch architectures promotes stochastic mixtures of forward and backward propagations from
network branches in order to create a
ow of model-based adversarial learning samples/gradients
during the training phase. Owing to it excellent ability to combat over-tting even in the pres-
ence of batch normalization, the Shake-Shake regularized 3-branch residual neural network [29]
has achieved one of current state-of-the-art performances on the CIFAR-10 image dataset.
An overview of a 3-branch Shake-Shake regularized ResNeXt is depicted in Fig. 4.1. Shake-
Shake regularization adds to the aggregate of the output of each branch an additional layer, called
the shaking layer, to randomly generate adversarial
ows in the following way:
ResBlock
N
(X) =X +
N
X
n=1
Shaking
fB
n
(X)g
N
n=1
where in the forward propagation for a = [
1
; ;
N
] sampled from the (N1)-simplex (Fig.
4.1 (a))
ResBlock
N
(X) =X +
N
X
n=1
n
B
n
(X);
43
x
l
x
l+1
Addition
ReLU
Conv
Conv
BN
BN
1-
x
l
x
l+1
Addition
ReLU ReLU
ReLU ReLU
Conv Conv
Conv Conv
BN BN
BN BN
1-
x
l
x
l+1
Addition
ReLU ReLU
ReLU ReLU
Conv Conv
Conv Conv
BN BN
BN BN
BN BN
1-
x
l
x
l+1
Addition
ReLU ReLU
ReLU ReLU
Conv Conv
Conv Conv
BN BN
BN BN
1-
ReLU
Conv
Conv
BN
BN
ReLU
(a) original
(b) ReLU-only
pre-activation
(c) full pre-activation (d) pre-activation + BN
Figure 4.2: Shaking regularized ResNeXt architectures with dierent layouts introduced in [40]
while in the backward propagation for b = [
1
; ;
N
] sampled from the (N1)-simplex and g
the gradient from the top layer, the gradient entering into B
n
(x) is
n
g (Fig. 4.1 (b)). At testing
time, the expected model is then evaluated for inference by taking the expectation of the random
sources in the architecture (Fig. 4.1 (c)). In each mini-batch, to apply scaling coecients or
either on the entire mini-batch or on each individual sample independently can also make a
dierence.
Instead of the commonly known original or the fully pre-activation residual block, Shake-Shake
regularization is proposed with the ReLU-only pre-activation residual block. Refer to Fig. 4.2
for more details. In addition, it has been shown that shaking with the absence of both batch
normalization layers could cause the training process to diverge. One proposed remedy to this
situation in [29] is to employ a shallower architecture and more importantly to reduce the range
of values can take on, i.e. to reduce the strength of shaking.
ShakeDrop regularization [113], based on the same idea of model-based representation aug-
mentation but on the deep pyramidal residual architecture, reached an improved accuracy on
44
CIFAR-10/100 datasets. The authors further empirically observed that each residual block should
end with batch normalization before the shaking layer to prevent training from diverging.
In our previous work on acoustic sub-band shaking [48] and stochastic Shake-Shake regulariza-
tion [49] for aective computing from speech, we found that in a fully pre-activation architecture
without a batch normalization layer right before shaking, the shaking mechanism contributes much
more to constraining the learning process than to boosting the generalization power. In addition,
we showed methods to relax or control the impact of shaking, either deterministically by sub-
band shaking or probabilistically by randomly turning o shaking, in order to trade-o between
a higher accuracy and reduced over-tting. Fortunately, with appropriate hyper-parameters, we
could achieve both with statistical signicance, compared to the baseline.
All these ndings indicated there is a close interaction between shaking and batch normal-
ization. However, these studies did not give an explanation for the crucial location of batch
normalization in a shaking regularized architecture, other than a brief discussion of the range of
.
4.2.2 End-to-End Discriminative Feature Learning
Recently, there has been a trend to focus on the design of loss functions so that a neural network
supervised by such a loss function is able to formulate more discriminative features, usually
compared to the ordinary softmax loss, for face verication. Inspired by the contrastive loss [35]
and the triplet loss [91], the main design objective aims to simultaneously minimize intra-class
dispersion and maximize inter-class margin. However, using the contrastive loss or the triplets
loss often involves training with pairs and triplets on the order of O(N
2
) and O(N
3
) when N is
the number of training samples, or one has to rely on a carefully selected sampling strategy.
A series of work hence reviewed the interpretation of the softmax loss as a normalized expo-
nential of inner products between feature vector and class center vectors, and came up with var-
ious modications for achieving the aforementioned design objective, including the large-margin
45
softmax [66], SphereFace [65], CosFace [110], ArcFace [20] and Centralized Coordinate Learning
(CCL) [83].
Dierent from the other four modications, CCL distributes features dispersedly by central-
izing the features to the origin of the space during the learning process so that feature vectors
from dierent classes can be more separable in terms of a large angle between neighboring classes,
and ideally symmetrically distributed in the whole feature space. The CCL loss is presented as
follows:
L =
1
N
N
X
i
log
e
(xi) cos(y
i
)
P
K
j
e
(xi) cos(y
j
)
(4.1)
where
(x
i
)
j
=
x
ij
o
j
j
: (4.2)
(x
i
)
j
and x
ij
are j-th coordinate of (x
i
) and x
i
, respectively, and
yj
is the angle between
x
i
and the class center vector w
yj
. It is immediately clear that Eq. (4.2) resembles the famous
batch normalization:
(x
i
)
j
=
j
x
ij
o
j
j
+
j
(4.3)
except that the trainable ane transformation, dened by
and , after the normalization are
missing in the formulation. In Eq. (4.2) and (4.3), and o are running standard deviation and
running mean updated per mini-batch during training. In fact, it was showed that there would be
slight degradation in performance when the trainable ane transformation is employed. In the
rest of this work, we use batch normalization or CCL interchangeably to refer to this discriminative
feature learning.
46
4.2.3 Vicinal Risk Minimization and Mixup
In statistical learning theory, because the data distributionP(x;y) is unknown in most practical
applications, one may approximate it by the empirical distribution
P
(x;y) =
1
N
X
n
(x =x
n
;y =y
n
); (4.4)
where (x = x
n
;y = y
n
) is a Dirac centered at (x
n
;y
n
), andD =f(x
n
;y
n
) : 1 n Ng is a
given training set sampled fromP(x;y). This paradigm of learning via minimizing a risk based
on the approximation in Eq. (4.4) is referred to as the Empirical Risk Minimization (ERM) [107].
Instead of a Dirac function, one may employ a vicinity function to take into consideration
the vicinity of a true training pair (x
n
;y
n
), and this alternative is therefore called the Vicinal
Risk Minimization (VRM) principle [12]. The Gaussian vicinity [12] (x;yjx
n
;y
n
) = N (x
x
n
;
2
)(y = y
n
) is one of the well-known examples, which is equivalent to data augmentation
with additive Gaussian noise.
Recently, a novel vicinity function, called Mixup [117], is proposed to cover the entire convex
hull of training pairs:
(x;yjx
n
;y
n
) =
1
N
X
m
E
[(x =u
m
;y =v
m
)]; (4.5)
u
m
= x
n
+ (1)x
m
; (4.6)
v
m
= y
n
+ (1)y
m
; (4.7)
where 2 [0; 1]. This data augmentation scheme by Mixup is believed to reduce the amount of
undesirable oscillations when predicting outside the original training examples as it lls in the
feature space between classes by convex combinations of true training pairs, and hence contribute
both to improvement on classication accuracy as well as robustness to adversarial examples.
47
Classication versus Verication One could easily observe that the directions for improving
classication tasks based on VRM and for rening embeddings in verication tasks based on the
aforementioned design objective are drastically dierent, where the former strives to minimize or
eliminate the margin between classes while the latter aims to ensure a denite minimum margin.
We provide more details in Section 4.4.
4.3 Sub-band Shaking
Shake-Shake regularization delivers dierent results depending on the strength of the shaking, in
terms of the range of values (or) takes on as well as whether the same (or) is shared within
a mini-batch. In addition to batch- or sample-wise shaking, when it comes to the area of acoustic
processing, there is another orthogonal dimension to consider: the spectral domain. Leveraging
domain knowledge, our rst proposed models are based on a simple but plausible hypothesis that
aective information is distributed non-uniformly over the spectral axis [60]. Therefore, there is
no reason to enforce the entire spectral axis to be shaken with the same strength concurrently.
Furthermore, adversarial noise may exist and extend over the spectral axis. By deliberately
shaking spectral sub-bands independently, the structure of adversarial noise may be broken and
become less confounding to the model.
There has been work on multi-stream framework in speech processing. For example, Mallidi
et al. [71] designed a robust speech recognition system using multiple streams, each of them
attending to a dierent part of the feature space, to ght against noise. However, lacking both
multiple branches and the nal information alignment, the design philosophy is fundamentally
dierent from that of multi-branch architectures. In fact, sub-band shaking could be viewed as a
bridge between the multi-stream framework and the multi-branch architecture.
Before we formally dene the proposed models, we introduce the denition of sub-bands rst.
Fig. 4.3 depicts the denition for sub-bands in a 3-branch residual block. Here we slightly abuse
48
Time
Frequency
Upper1
Upper1
Lower1
residual branch 1
}
Time
Frequency
Upper1
Upper2
Lower2
residual branch 2
Full
Figure 4.3: An illustration for the sub-band denitions
the notations of frequency and time because after two convolutional layers these axes are not
exactly the same as those of input to the branches; however, since convolution is a local operation
they still hold the corresponding spectral and temporal nature. At the output of each branch, we
dene the high-frequency half to be the upper sub-band while the low-frequency half to be the
lower sub-band. We take the middle point on the spectral axis to be the border line for simplicity.
The entire output is called the full band.
Having dened these concepts, we denote X the input to a residual block, X
i
the full band
from the i-th branch, X
i
u
the upper sub-band from the i-th branch and X
i
l
the lower sub-band
from the i-th branch. Naturally, the relationship between them is given by X
i
=
X
i
u
jX
i
l
. We
also denote Y the output of a Shake-Shake regularized residual block.
To study the eectiveness of sub-band shaking for speech emotion recognition, we propose the
following models for benchmarking:
1. Shake the full band (Full)
Y =X +
N
X
n=1
ShakeShake
fX
n
g
N
n=1
: (4.8)
49
2. Shake both sub-bands but independently (Both)
Y = X + [Y
u
jY
l
]; (4.9)
Y
u
=
N
X
n=1
ShakeShake
fX
n
u
g
N
n=1
;
Y
l
=
N
X
n=1
ShakeShake
fX
n
l
g
N
n=1
:
4.4 Batch Normalized Shaking
−20 −15 −10 −5 0 5 10 15 20
(e) Test Acc: 98.28 %
−15
−10
−5
0
5
10
15
−10 −8 −6 −4 −2 0 2 4 6 8
(f) Test Acc: 93.22 %
−15
−10
−5
0
5
10
15
−15 −10 −5 0 5 10 15 20
(g) Test Acc: 98.66 %
−15
−10
−5
0
5
10
15
−15 −10 −5 0 5 10 15 20
(h) Test Acc: 98.90 %
−15
−10
−5
0
5
10
15
Figure 4.4: MNIST embeddings based on dierent layouts of residual blocks. We set the feature
dimension entering into the output layer to be two and train them in an end-to-end fashion.
The top and bottom rows depict embeddings of the training samples extracted in the train mode
(i.e. 2 [0; 1]) without updating parameters, and testing samples extracted in the eval mode
( = 0:5), respectively. (a,e) fully pre-activation (Fig. 4.2(c)) without shaking (b,f) fully pre-
activation (Fig. 4.2(c)) with shaking (c,g) fully pre-activation + BN (Fig. 4.2(d)) without shaking
(d,h) fully pre-activation + BN (Fig. 4.2(d)) with shaking
It has been shown that without batch normalization in a residual block, the shaking operation
could easily cause training to diverge [29]. Even with various methods to reduce the strength of
shaking such as to limit the range of values can take on or to decrease the number of residual
blocks and hence the number of shaking layers, applying shaking regularization without batch
50
normalization leads to a much inferior model, compared to a unregularized ResNeXt model with
batch normalization.
Batch Normalization for Symmetrically Distributed Representations From the per-
spective of discriminative feature learning, one may assume that without batch normalization,
feature vectors from dierent classes may lie close to each other in an uneven fashion. In this situ-
ation, the appropriate strength of shaking may be limited by the margin of two closest classes as in
the reported experiments in [29]. In fact, a recent study [90] refuted the commonly believed view
on the role of batch normalization in reducing the so-called internal covariate shift. Instead, they
pointed out batch normalization actually helps to signicantly smooth the optimization landscape
during training. Intuitively, it is expectedly easier to explore the neighborhood of well-separated
class center vectors, compared to a set of unevenly distributed class center vectors, since it is less
likely to have overlapping embeddings from dierent classes in the former case. Without batch
normalization, any inappropriately large strength of shaking may lead to overshot embeddings,
cause two or more classes to overlap and most importantly result in a slower and inferior conver-
gence. Empirically, we nd it is the key factor to keep a distribution of dispersed representations
that spans over the entire feature space in a symmetric fashion and extends away from the ori-
gin, even when shaking is applied. Therefore, we hypothesize that the batch normalization (or
CCL) layer right before shaking serves to disperse embeddings and therefore prevent them from
overlapping each other under the in
uence of perturbation that is encouraged by shaking around
class center vectors.
Model-Based Representation Augmentation for Coverage of Between-Class Feature
Space From the review in Section 4.2, we have noticed that classication tasks based on the
VRM principle and verication tasks ask for subtly dierent distributions of representations. The
success of Mixup in classication tasks advocates for minimizing or eliminating the gap between
classes in feature space by augmented training samples. By doing so, the chance of predicting
51
samples lying outside of the coverage of the original or augmented training samples, which leaves
room for attacks by adversarial samples [101] or results in uncertain predictions, is reduced. On
the other hand, CCL, and the other discriminative feature learning algorithms for verication,
could leave a larger margin between classes (for example, c.f. Fig. 4.4 (a) and (c)). We hypothesize
that shaking following a batch normalization layer expands the coverage of training samples and
eectively it eliminates the margin between classes in feature space to achieve a similar distribution
as in Mixup. One caveat is that shaking does not smooth labels during exploration, and therefore
is more similar to a prior work of Mixup, where interpolation and extrapolation of the nearest
neighbors of the same class in feature space is proposed to enhance generalization [21]. However,
shaking operates solely on a single sample at a time, unlike [117] and [21].
4.4.1 Embedding Learning on MNIST and CIFAR-10
To demonstrate the close interaction between shaking and batch normalization in representation
learning, we present two ablation studies, one on the MNIST dataset and the other on the CIFAR-
10 dataset. For both sets of experiments, we employ the ordinary softmax loss to examine the
eectiveness of batch normalization in representation learning when it is not coupled with the
softmax function.
The rst set aims to visualize the embeddings of handwritten digit images learned under the
in
uence of batch normalization and shaking. We employ a ResNeXt (20, 2 4d) architecture,
where the last residual block reduces the feature dimension to 2 for the purpose of visualization.
Fig. 4.4 depicts embeddings by four layouts of residual blocks, where the top and bottom rows
correspond to embeddings of the training and testing samples, respectively. From left to right,
the columns represent embeddings learned by models of fully pre-activation (Fig. 4.2(c) PreAct)
without shaking, PreAct with shaking, fully pre-activation + BN (Fig. 4.2(d) PreActBN)
without shaking and PreActBN with shaking, respectively.
52
The rst column serves as the baseline in this set of experiments. Immediately, we can observe
a severe degradation in separability when applying shaking to PreAct, comparing Fig. 4.4(a,e)
with Fig. 4.4(b,f). Also notice that shaking without a directly preceding batch normalization
could perturb or destroy the symmetric distribution, which is obvious when there is no shaking
(the symmetry in Fig. 4.4(a,e)). This is rather interesting as batch normalization still exists in
PreAct residual block, only not directly connected to the shaking layer. It seems the explo-
ration encouraged by shaking around each class center has expanded its coverage but without a
directly preceding batch normalization to maintain a good dispersion between classes, each class
only expands to overlap with neighboring classes, and the resulting distribution is heavily tiled.
Consequently, PreAct with shaking delivers a much inferior performance compared to PreAct.
The comparison between PreAct (Fig. 4.4(a,e)) and PreActBN (Fig. 4.4(c,g)), both with-
out shaking, demonstrates the eectiveness of CCL in discriminative feature learning although
it is not coupled with the loss function. In Fig. 4.4(c), not only does it maintain a symmetric
distribution of classes, but also it encourages each class to expand outward and to leave more
margin between neighboring classes. As a result, PreActBN without shaking is able to reach a
better performance compared to the baseline.
Finally, PreActBN with shaking (Fig. 4.4(d,h)), on the contrary, does not lead to a larger
margin between classes asPreActBN without shaking does. On the other hand, it seems that the
shaking has expanded the coverage of each class so that all of them are directly adjacent to each
other with a minimal or zero margin. We could also observe that, although batch normalization
tries to maintain a symmetric distribution of feature vectors, some of the classes in Fig. 4.4(d,h)
are slightly tilted around the outer most region. However, the most salient dierence is the
distribution of testing samples, where each class becomes more compact. The performance of
PreActBN is therefore the highest among all of the models.
Based on the VRM principle, the region close to boundaries between classes in feature space is
covered by augmented training samples. Since in supervised learning we assume testing samples
53
Figure 4.5: Embeddings of training samples extracted in the (a) train (b) eval mode from Pre-
ActBN with shaking
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
relative distance to class center
0
20
40
60
80
100
percentage
PreAct
PreAct w/ Shake
PreActBn
PreActBn w/ Shake
Figure 4.6: Percentage of class samples within in a relative distance to class center
54
are drawn from a similar or the same distribution as training samples, the majority of testing
samples are mapped to embeddings close to the center of class, where most original training
samples are mapped to, and thus the distribution seems more compact. In Fig. 4.5, from the
comparison of training embeddings in the train and eval modes, it is visually clear that shaking
is expanding the coverage of training embeddings. To quantitatively show that this is the case,
for each layout we calculate distances of the original training embeddings to their respective class
center vectors in the eval mode and plot the percentage of class samples within distances relative
to the largest distance in the class that is calculared in the train mode, in Fig. 4.6. It is clear that
with shaking most of the original training embeddings are concentrated around the class center.
For example, almost 100% of original training embeddings lies within 0:3 largest distance for
PreActBN with shaking and just a small number of original training samples are lying close to
the boundaries.
Model Depth Params Error (%)
ResNeXt (29, 16 64d) [112] 29 68:1M 3:58
Wide ResNet [115] 28 36:5M 3:80
Shake-Shake (26, 2 96d) [29] 26 26:2M 2:86
Shake-Shake (26, 2 64d) [29] 26 11:7M 2:98
ResNeXt (26, 2 64d) [29] 26 11:7M 3:76
PreActBN (26, 2 64d)
y
26 11:7M
2:95
BN-Shake (26, 2 64d) 26 11:7M
3:65
PreAct (26, 2 64d)
y
26 11:7M
6:92
average over three runs
y
with shaking
Table 4.1: Test error (%) and model size on CIFAR-10
The second set of experiments on CIFAR-10 is designed to measure the contribution of the
last batch normalization in PreActBN with shaking. In order to do so, we remove the rst
two batch normalization from the PreActBN residual block and rename the new one, the BN-
Shake residual block (ReLU-Conv-ReLu-Conv-BN-Mul), assuming shaking is applied. Along
with PreActBN and PreAct, by presenting BN-Shake, all of them with shaking, we are able
to quantitatively demonstrate the crucial location of batch normalization in a shaking regularized
architecture.
55
We modify the open-sourced Torch-based Shake-Shake implementation
1
that was released with
[29] to build these three architectures. All of the rest of parameters such as the cosine learning
rate scheduling and the number of epochs remain unchanged. Only the part that involves residual
block denition is modied to serve our need. We run each experiment for three times to obtain
a robust estimate of the performance using dierent random seeds. Table 4.1 presents the results
of our experiments as well as the quoted performances on CIFAR-10 from [29].
Note that Shake-Shake ResNeXt in [29] is based on the ReLu-only pre-activation residual
block (Fig. 4.2(b) RPreAct) and is thus dierent from PreAct we have here. Although the
ResNeXt-26 2 64d is based on theRPreAct structure, with a shallow depth of 26, it should be
comparable to one that is based on the PreAct when no shaking is applied [40]. Therefore, we
also take it as the baseline for the pre-activation layout.
The performance of PreActBN with shaking (2:95%, mean of 2:89%, 3:00% and 2:95%) is
comparable to the reported performance of RPreAct with shaking (2:98%), where both of them
have a directly preceding batch normalization layer before shaking. As expected, the performance
of PreAct with shaking (6:92%, mean of 6:76%, 6:82% and 7:19%) is much worse than every
model in Table 4.1, including the baseline ResNeXt-26 2 64d.
On the other hand, the result of BN-Shake (3:65%, mean of 3:56%, 3:76% and 3:62%) is
rather positive. With only one batch normalization layer, it outperforms not only PreAct with
shaking but also the baseline ResNeXt-26 264d. This nding highlights the fact that the directly
preceding batch normalization plays a crucial role in keeping a good dispersion of intermediate
representations when shaking is applied to explore unseen feature space, while the dispersing eect
of any other batch normalization that is separated by convolutional layers from the shaking layer
is reduced or only auxiliary.
Based on these two sets of experiments, it is safe to state that in the close interaction between
batch normalization and shaking, batch normalization is mainly responsible for keeping a dispersed
1
https://github.com/xgastaldi/shake-shake.
56
symmetric distribution of intermediate representations from perturbation by shaking, while the
shaking mechanism expands the coverage of training samples to eliminate the margin between
classes by promoting stochastic convex combinations of model branches, similar to [117] and [21].
For speech emotion recognition, we conduct experiments with 3-branch ResNeXt based on the
original (Fig. 4.2(a) PostAct), RPreAct, PreActBN and PreAct layouts with and without
shaking to benchmark these architectures.
4.4.2 Relation with Batch Normalized Recurrent Neural Networks
So far, we have presented arguments and analyses of experiments to clarify the interplay of batch
normalization and shaking, and drawn a connection between each of them to discriminative feature
learning and VRM, respectively.
In our previous work on sub-band shaking [48] and stochastic Shake-Shake regularization [49]
in aective computing, we found that shaking regularization contributes more to constraining the
training process, measured in terms of the generalization gap between the accuracies of training
and validation, than improving generalization when using PreAct with shaking, which is under-
standable based on previous analyses. However, when employing PreActBN for speech emotion
recognition, we further observe another interesting behavior of the resulting models. Compared
toPreAct, although the validation performance is improved thanks to the addition of batch nor-
malization, the gap between training and validation accuracies also drastically increased, which
suggests a signicantly amount of increased over-tting.
Architecture Shake Valid UA (%) Gap (%)
PreAct
7 61:342 7:485
X 62:989 1:128
PreActBN
7 63:407 7:410
X 66:194 8:348
*
average over three runs
Table 4.2: Performances of PreAct andPreActBN with and without shaking for speech emotion
recognition
57
For example, Table 4.2 gives a comparison betweenPreAct andPreActBN with and without
shaking. With shaking,PreActBN achieves an improvement of 3:205% on the un-weighted accu-
racy (UA) at the cost of an increment of 9:476% on the generalization gap (Gap) between training
and validation UAs fromPreAct, simply because of the addition of batch normalization. On the
other hand, it is 2:787% increment on the UA and 0:938% increment on the Gap, respectively,
for PreActBN due to the addition of shaking. Apparently, the addition of batch normalization
triggers something that causes this signicant change in the generalization gap when shaking is
applied.
We suspect that this behavior partially resembles the reported situation in application of batch
normalization to recurrent neural networks [58, 16], where Laurent et al. [58] attempted to apply
batch normalization to the recurrent formulation of temporal modeling and concluded that it
leads to faster convergence but more over-tting. To address this issue, Cooijmans et al. [16] at-
tributed the diculties with recurrent batch normalization to gradient vanishing due to improper
initialization of the batch normalization parameters, and proposed to initialize the standard de-
viation parameter
in Eq. (4.3) to a small value such as 0:1, contrary to the common practice
of unit initialization. With a proper initialization, they demonstrated successful applications of
batch normalized Long Short Term Memory (LSTM) networks that converge faster and generalize
better.
In fact, the formulations are awfully similar in batch normalized residual blocks and in batch
normalized recurrent neural networks, where both of them consist of a batch normalized sum of
multiple batch normalized streams and so on so forth. For example, in PreActBN with shaking
the output of the l-th block is as follows:
X
l+1
=BN(f
1
l
(X
l
)) + (1)BN(f
2
l
(X
l
)) +X
l
; (4.10)
58
and in the (l + 1)-th block,
f
i
l+1
(X
l+1
) = Conv(ReLU(BN(Conv(Y
i
)))); (4.11)
Y
i
= ReLU(BN(X
l+1
)): (4.12)
On the other hand, the batch normalized LSTM is given by the following equations [16]:
0
B
B
B
B
B
B
B
B
B
B
@
f
t
i
t
o
t
g
t
1
C
C
C
C
C
C
C
C
C
C
A
= BN(W
h
h
t1
) + BN(W
x
x
t
) +b; (4.13)
c
t
= (f
t
)c
t1
+(i
t
) tanh(g
t
); (4.14)
h
t
= (o
t
) tanh(BN(c
t
)); (4.15)
where x
t
is the input, W
x
and W
h
are the input and recurrent weights, f
t
, i
t
, o
t
and g
t
are
forget, input, output and gates and the cell candidate, c
t
and h
t
are the cell and output vectors,
and are the Hadamard product and the logistic sigmoid function, respectively.
In these two batch normalized formulations, we ndY
i
andh
t
rather similar to each other, up
to only some component-wise scaling and clipping by activations. In addition to benchmarking
dierent layouts of residual blocks, we also present speech emotion recognition experiments to
investigate the convergence behavior of shaking regularized ResNeXt networks with the batch
normalization parameter
initialized to dierent values.
59
4.5 Database and Experiments for Speech Emotion Recognition
We use six publicly available emotion corpora to demonstrate the eectiveness of the proposed
models, including the eNTERFACE'05 Audio-Visual Emotion Database [73], the Ryerson Audio-
Visual Database of Emotional Speech and Song (RAVDESS) [67], the Interactive Emotional
Dyadic Motion Capture (IEMOCAP) database [10], the Berlin Database of Emotional Speech
(Emo-DB) [8], the EMOVO Corpus [18] and the Surrey Audio-Visual Expressed Emotion (SAVEE)
[37]. Some of these corpora are multi-modal in which speech, facial expression and text all convey
a certain degree of aective information. However, in this paper we solely focus on the acoustic
modality for experiments.
We formulate the experimental task into a sequence classication of 4 classes, including joy,
anger, sadness and fear. We perform sub-utterance sampling [53] by dividing long utterances into
several short segments of 6:4-second long with the same label to limit the length of the longest
utterance; for example, a 10-second long angry utterance is replaced by two angry segments corre-
sponding to the rst 6:4 and the last 6:4 seconds of the original utterance with some overlapping
part. In this way, we also slightly benet from data augmentation. As a result, we obtain 6803
emotional utterances from the aggregated corpora. Table 4.3 summarizes the information about
these six corpora.
Corpus No. No. Utterances
Actors Joy Anger Sadness Fear
eNTERFACE 42 207 210 210 210
RAVDESS 24 376 376 376 376
IEMOCAP 10 720 1355 1478 0
Emo-DB 10 71 127 66 69
EMOVO 6 84 84 84 84
SAVEE 4 60 60 60 60
Total 96 1518 2212 2274 799
Table 4.3: An overview of these selected corpora, including the number of actors and the distri-
bution of utterances in the emotional classes
For the evaluation, we adopt a 4-fold cross validation strategy. To begin with, we split the actor
set into 4 partitions. In addition, we impose extra constraints to make sure that each partition
60
Corpus Actor Set Partition
1 2 3 4
eNTERFACE 2F, 8M 2F, 9M 3F, 8M 2F, 8M
RAVDESS 3F, 3M 3F, 3M 3F, 3M 3F, 3M
IEMOCAP 1F, 2M 1F, 1M 1F, 1M 2F, 1M
Emo-DB 1F, 2M 1F, 1M 1F, 1M 2F, 1M
EMOVO 1F, 0M 1F, 1M 1F, 1M 0F, 1M
SAVEE 0F, 1M 0F, 1M 0F, 1M 0F, 1M
Total 8F, 16M 8F, 16M 9F, 15M 9F, 15M
Table 4.4: F: female, M: male. The gender and corpus distributions in each actor set partition of
the cross validation
Layer Name Structure Stride No. Params
prelim-conv
2 16; 4
1 [1; 1] 132
res-8
2 16; 8
2 16; 8
3 [1; 1] 22:84K
res-16
2 16; 16
2 16; 16
1 [1; 1] 24:88K
res-32
2 16; 32
2 16; 32
1 [1; 1] 99:17K
average - - -
ane 256 4 - 1024
total - - 148K
Table 4.5: ResNeXt (12; 2 8d) network architecture
is as gender and corpus uniform as possible. For example, each actor set partition is randomly
distributed with 2-3 female actors and 8-9 male actors from the eNTERFACE'05 corpus. More
details are provided in Table 4.4. By partitioning the actor set, it becomes easier to maintain
speaker independence between training and validation throughout all of the experiments.
We extract the spectrograms of each utterance with a 25ms window for every 10ms using the
Kaldi [82] library. Cepstral mean and variance normalization is then applied on the spectrogram
frames per utterance. To equip each frame with a certain context, we splice it with 10 frames
on the left and 5 frames on the right. Therefore, a resulting spliced frame has a resolution of
16 257. Since emotion involves a longer-term mental state transition, we further down-sample
the frame rate by a factor of 8 to simplify and expedite the training process.
We build an architecture of ResNeXt (12; 2 8d) that consists of only 3 residual stages, res-8,
res-16 and res-32, in which the lter size is xed to 216. There are 3 residual blocks in res-8, and
one residual block in res-16 and res-32. Before the ane transformation, we add a dropout layer
with probability of 0:5. Table 4.5 contains details of the architecture. For each utterance, a simple
61
mean pooling is taken at the output of the nal residual block to form an utterance representation
before it is mapped by the ane layer. We avoid explicit temporal modeling layers such as a long
short-term memory recurrent network because our focus is to investigate the eectiveness by
shaking the ResNeXt. Note that a shaking layer has no parameter to learn and hence the model
size in this work stays almost constant and only changes slightly in the situation when the number
of batch normalization layers diers, since batch normalization has a set of trainable scaling and
shift parameters.
We implement the shaking layer as well as the entire network architecture using the PyTorch
[1] library. Only the Shake-Shake combination [29] is used and shaking is applied independently
per frame. We may also for simplicity refer to the Shake-Shake regularization as shaking regu-
larization. Due to class imbalance in the aggregated corpora, the objective function for training
is the weighted cross-entropy, where the class weight is inversely proportional to the class size.
The models are learned using the Adam optimizer [55] with an initial learning rate of 0:001 and
the training is carried out on an NVIDIA Tesla P100 GPU. We use a mini-batch of 8 utterances
across all model training and let each experiment run for 1000 epochs, and for three runs with
dierent random seeds to obtain a robust estimate of the performance.
For sub-band shaking experiments, we initially employ the fully pre-activation layout (Fig.
4.2(c)). PreAct denotes the model without shaking, while PreAct-Full and PreAct-Both
correspond to the models with the Full and Both shaking (Eq. 4.8, 4.9), respectively. We only
consider two sub-bands, dened by the middle point on the spectral axis, to present our point.
However, it does not rule out the feasibility of experiments with more sub-bands. In fact, in the
multi-stream framework for speech recognition, more sub-bands were considered to generate more
combinations and diversity.
We conduct experiments to benchmark dierent layouts, including PostAct (Fig. 4.2(a)),
RPreAct (Fig. 4.2(b)) and PreActBN (Fig. 4.2(d)) with and without (Full) shaking. Fur-
thermore, we choose three dierent initialization values of
in batch normalization and apply
62
them to each of PostAct, RPreAct and PreActBN layouts to investigate if batch normalized
ResNeXt networks also behave as batch normalized recurrent neural networks in terms of faster
convergence and better generalization, in addition to the resemblance in mathematical formula-
tions. In this set of experiments, we use the ane transformation in batch normalization despite
CCL recommends not to use it, except in the last batch normalization before the output layer,
where we only standardize the input and do not rescale it.
Finally, we revisit sub-band shaking with the PreActBN layout and with dierent initializa-
tion values of
to present a complete comparison between shaking on the full band and on the
sub-bands independently. With this set of experiments, we may also examine the eectiveness
of trade-o between the un-weighted accuracy (UA) and the generalization gap (Gap) between
training and validation UAs when batch normalization is applied right before shaking.
4.6 Experimental Results
4.6.1 Sub-band Shaking
The result of sub-band shaking is presented in Table 4.6. It is clear that both PreAct-Full and
PreAct-Both are able to improve fromPreAct, unlike the experiments on MNIST and CIFAR-
10 datasets. For both shaking regularized models, however, their improvements on the UA is
comparably smaller than their reduction on the generalization gap, which suggests the models are
learning a harder problem in the presence of shaking. It could also be the dierence in the nature
of tasks or in the number of classes that PreAct with shaking is improving instead of degrading.
Model Valid UA (%) Gap (%)
PreAct 61:342 7:485
PreAct-Full
y
62:989
y
1:128
PreAct-Both
y
64:973
y
1:791
*
average over three runs
y
signicantly (p<0:05) outperforms Pre-
Act
Table 4.6: Performances of PreAct, PreAct-Full and PreAct-Both
63
Furthermore, we conduct statistical hypothesis testing to examine the signicance of the ob-
served improvements. The resulting p-values from one-sided paired t-test indicates that the im-
provements of PreAct-Full andPreAct-Both fromPreAct are statistical signicant, in terms
of the UA and the Gap. In addition, compared to the improvements of PreAct-Full, PreAct-
Both has reached a higher UA but also a larger gap. In fact, the improvement of PreAct-Both
from PreAct-Full is also signicant in terms of UA. On the other hand, PreAct-Full has a
signicant reduction on the Gap, compared to the reduction of PreAct-Both on the Gap. More
details are presented in Table 4.7.
Model Pair Valid UA Gap
PreAct vs PreAct-Full 1:44 10
2
2:7 10
4
PreAct vs PreAct-Both 1:22 10
5
3:0 10
3
PreAct-Full vs PreAct-Both 2:60 10
3
9:9 10
1
Table 4.7: P-values resulted from one-sided paired t-test with df=4 3 1 = 11
Based on these observations, we can think of sub-band shaking as a method to relax the
strength of shaking. This way, one may trade o between the amount of improvement on the
UA and the amount of reduction on the Gap. Our previous work on stochastic Shake-Shake
regularization [49] also investigated another method to trade o between these two performance
metrics by randomly turning o shaking. Since sub-band shaking and stochastic Shake-Shake
behave similarly and the latter do not contribute directly into the focus of this work, we choose
to present only sub-band shaking here.
4.6.2 Experiments with Dierent Layouts
The results of benchmarking the layouts are presented in Table 4.8. First of all, we notice that
even without shaking, all of PostAct, RPreAct and PreActBN have achieved higher UAs
with signicance (denoted by ?), compared to that of PreAct, while the UA of the rst three
models are comparable, i.e. no statistical signicance in the dierences. These improvements on
the UA could be attributed to discriminative feature learning, i.e. CCL, since we additionally
64
apply batch normalization without rescaling before the nal ane layer in the rst three models.
We deliberately employ CCL in these three models to set up a set of competitive baselines.
Next, we look at the layouts with shaking at
0
= 1:0. We immediately observe that even
with the common practice of unit initialization for
, all of these three layouts give further
improvements on the UA. However, none of them is able to reduce the generalization gap with
statistical signicance. The closest one is RPreAct with shaking when
0
= 1:0 and the p-value
for the testing is 0:088 due to a high variation of the generalization gap from each fold. With the
common practice of unit initialization, PreActBN converges to the most accurate model with
the UA as high as 66:194%, but it also results in the largest generalization gap.
Architecture Shake
0
Valid UA (%) Gap (%)
PreAct
7 { 61:342 7:485
X {
y
62:989
y
1:128
PostAct
7 {
?
62:782 7:822
X 1:00
y
65:536 7:657
X 0:20
y
65:490 5:431
X 0:10
y
65:483 7:255
X 0:05
y
65:512 7:205
RPreAct
7 {
?
63:939 8:517
X 1:00
y
65:859 7:063
X 0:20
y
65:821 8:305
X 0:10
y
64:899 7:958
X 0:05
y
65:052
y
5:821
PreActBN
7 {
?
63:407 7:410
X 1:00
y
66:194 8:348
X 0:20
y
65:789
z
5:817
X 0:10
y
66:097
z
6:040
X 0:05
y
66:418
yz
3:416
*
average over three runs
?
signicantly (p<0:05) outperforms PreAct without
shaking
y
signicantly (p<0:05) outperforms the same layout with-
out shaking
z
signicantly (p<0:05) outperforms the same layout at
0
= 1:0
Table 4.8: Performances of PreAct, PostAct, RPreAct and PreActBN with and without
shaking for speech emotion recognition, where
0
is the initialization value of the standard devi-
ation parameter
in batch normalization
In fact, we could see that for all values of
0
we choose, these three layouts all outperform
their respective baseline models, the same layout without shaking, in terms of UA with signicance
(denoted byy), and all of them show a trend of generalization gap reduction only that the majority
of them are unable to reduce the generalization gap with signicance, except two of them. Both
65
of RPreAct and PreActBN outperform their baseline models simultaneously on the UA and
on the Gap, achieving a higher accuracy as well as reduced over-tting at the same time, when
is initialized as 0:05. PreActBN with shaking at
0
= 0:05 is also the best model, both
highest on the UA and lowest on the Gap (except forPreAct with shaking, which is known to be
dicult to learn). In other words, the result of this model validates our hypothesis with statistical
signicance that batch normalized ResNeXt networks are locally in formulation similar to batch
normalized recurrent neural networks, and both of them require a careful selection of initialization
values for batch normalization parameters to avoid the aforementioned diculties.
For comparison between layouts, we nd PreActBN with shaking at
0
= 0:05 also outper-
forms most of the settings inPostAct andRPreAct layouts with statistical signicance, except
for the reduction on the generalization gap by PostAct at
0
= 0:2, where PreActBN@0.05 is
not better with signicance. We summarize the complete comparison results in Table 4.9.
Architecture Pair
0
Valid UA Gap
PreActBN@0.05 vs PostAct
1:00 2:37 10
3
4:12 10
2
0:20 2:96 10
3
6:44 10
2
0:10 1:71 10
2
2:15 10
2
0:05 2:21 10
2
1:27 10
2
PreActBN@0.05 vs RPreAct
1:00 1:29 10
2
1:00 10
2
0:20 2:85 10
2
2:44 10
3
0:10 5:07 10
4
1:34 10
2
0:05 4:14 10
3
4:23 10
2
Table 4.9: P-values from one-sided paired t-test between PreActBN at
0
= 0:05 and PostAct
and RPreAct with various value of
0
Architecture Shake
0
Valid UA (%) Gap (%)
PreAct X 1:00 64:973 1:791
PreActBN
X 1:00 65:544 8:034
X 0:20 65:679
z
4:408
X 0:10 66:170 6:257
X 0:05
z
66:432
z
5:539
*
average over three runs
z
signicantly (p<0:05) outperforms the same layout at
0
= 1:0
Table 4.10: Performances of PreAct andPreActBN with sub-band shaking for speech emotion
recognition, where
0
is the initialization value of the standard deviation parameter
in batch
normalization
66
Finally, when comparing
0
-initialized batch normalized networks (when
0
= 0:05; 0:1; 0:2)
with their counterpart by unit initialization, we nd that although both RPreAct and Pre-
ActBN show a tendency to reduce the Gap with a smaller
0
, only PreActBN manages to
outperform its unit-initialized counterpart (denoted byz). It may suggest that the extra batch
normalization between residual blocks in PreActBN makes it structurally more similar to batch
normalized recurrent neural networks. On the other hand, any two batch normalization layers in
PostAct or RPreAct are always separated by one convolutional layer.
In Section 4.6.1, we have seen that sub-band shaking leads to an improvement on the UA
but also enlarges the generalization gap. However, instead of PreActBN we employed PreAct
previously and did not experiment with dierent initialization values of
. To present a complete
comparison between shaking on the full band and on the sub-bands independently, we conduct
experiments on sub-band shaking with the PreActBN layout and with the chosen initialization
values of
. The results are summarized in Table 4.10. Again, comparingPreAct andPreActBN
both at
0
= 1:00, we could easily observe a signicant increment of the generalization gap
along with the additional batch normalization. With a smaller initialization value, not only does
PreActBN reduce the generalization gap but also improve the UA when
0
= 0:05 with statistical
signicance. Yet, the resulting 5:539% on the Gap is still signicantly larger than 1:791% on the
Gap by the PreAct layout.
We further compare the performance of PreActBN with shaking on the full band (66:418%/3:416%)
and on the sub-bands independently (66:432%/5:539%). The p-values for these two settings are
0:51 and 0:079 for testing between the UAs and the Gaps. In the end, shaking on the full band
and on the sub-bands independently give comparable performances with shaking on the full band
slightly better at reducing the generalization gap.
67
4.7 Conclusion
We investigate the recently proposed Shake-Shake regularization and its variants for classication
in general and speech emotion recognition in particular. In order to explain the observed inter-
action between batch normalization and shaking regularization, we base our ablation analysis of
batch normalization in the MNIST experiments on discriminative feature learning. Our exper-
iments show that the batch normalization right before shaking regularization is crucial in that
it keeps a dispersed symmetric distribution of intermediate representations from being tilted by
random perturbation due to shaking. Without the batch normalization, shaking regularization
could easily tilt the distribution of each classes and cause them to overlap each other.
In addition, we highlight the subtle dierence in the requirement of embeddings for classica-
tion tasks and verication tasks, where according to the vicinal empirical minimization principle
or the recent success of Mixup [117], classication tasks should try to minimize the margin be-
tween classes in feature space, while verication tasks aim to nd an embedding distribution that
has a minimal intra-class variation but a maximal inter-class dispersion. From this perspective,
we nd the embeddings by PreActBN with shaking are indeed distributed with small or zero
margin between classes (Fig. 4.4(d)), while the embeddings by PreActBN without shaking are
distributed with large margins (Fig. 4.4(c)). Moreover, since the original embeddings are dis-
tributed close to the class center vectors, the distribution of testing embeddings, which is often
assumed to be the same or similar distribution of training samples, becomes more compact and
hence leads to a higher accuracy. This nding provides a direct explanation based on the VRM
principle for the ability of shaking regularization to help improve classication tasks. Another set
of experiments on CIFAR-10 further validates it is the direct concatenation of batch normalization
and shaking that contributes the most to the improvement on classication accuracy, and other
layers of batch normalization are auxiliary.
68
In addition, we nd that the formulation of batch normalized residual blocks resemble batch
normalized recurrent neural networks. Moreover, both of these two architectures suer from a
reported issue of fast convergence but more over-tting. To reduce the observe increment of the
generalization gap in our speech emotion recognition experiments, we properly initialize the
parameter with a smaller value. The experimental results validate our hypothesis and give a
signicant reduction on the generalization gap while achieving the same or a higher improvement
on the UA.
A nal comparison between shaking on the full band and on the sub-bands independently
shows that with the additional batch normalization in PreActBN and a proper initialization
value of
, the dierence between these two kinds of shaking is minimized, only that shaking on
the full band slightly better at reducing the generalization gap.
69
Part IV
Concluding Remarks
70
Chapter 5
Conclusion
Our work in this dissertation is divided into two major portions. The rst half contains the appli-
cation of data augmentation with noise to raw emotional utterances and a thorough benchmarking
on the negative in
uence of noise on the size and aspect of lters in convolutional neural networks.
The second half is dedicated to model-based representation augmentation, which include variants
of Shake-Shake regularization as well as ablation studies to understand the success of shaking in
improving classication tasks.
In the rst part, we have presented several models to address the challenges in speech emotion
recognition in Chapter 1 based on an end-to-end deep learning for discriminative feature formu-
lation, Specically, we trained several CLDNN-based models in an end-to-end approach which
outperform conventional models which are trained in a two-stage approach, where feature engi-
neering and classier training are carried out separately. In order to keep a CLDNN model from
over-tting, we devised a data augmentation procedure to articially increase the size of training
data with a noise corpus and cautiously prevent any articial patterns.
For the benchmarking of convolutional operations, we found that noise negatively in
uences
lters of a small size while large lters have demonstrated a certain degree of robustness. In
addition, two-dimensional lters that scan across multiple neighboring frames also show robustness
71
to noise. Discrete cosine transformation, on the other hand, suers from its task-ignorant nature
and lack of non-linearity to disentangle factors of variation.
In the second part, we investigated the Shake-Shake regularization and its variants. Leverage
domain knowledge, we propose sub-band shaking to eectively control regularization strength
and demonstrate trade-os between the classication accuracy and the generalization gap. Our
experiments show that shaking contributes more to reducing training accuracy than increasing
validation accuracy.
We further noted that to the success of shaking regularization, batch normalization plays an
important role to keep a model from diverging. We connected batch normalization right before
shaking to discriminative feature learning and vicinal risk minimization, where batch normal-
ization disperses feature distribution to keep classes from overlapping each other while shaking
creates augmented samples to expand the coverage of classes. Furthermore, we showed that it is
crucial for batch normalization to directly connected to shaking to successfully demonstrate the
dispersing eect, and other batch normalization layers that are separated by convolutional layers
are auxiliary.
With the batch normalization directly connected to shaking, we observed a convergence issue
in a batch normalized residual blocks, which was observed in batch normalized recurrent neural
networks as well. By inspecting the formulation, we found these two architecture actually bear a
strong resemblance and the shared convergence issue could be solved by a common solution. By
properly initializing batch normalization, our experiments on speech emotion recognition show an
improvement on the classication and a reduction on the generalization gap.
In this dissertation, we studied regularization based on data augmentation for speech emotion
recognition, either directly on the raw inputs or on the intermediate representation. We dived
into these two approaches, scrutinized the details that cause changes in performance and strived
to nd an explanation for them. During the process of research, we tried to make generalizable
72
assumptions whenever possible and hope our ndings for speech emotion recognition tasks could
be applicable to other similar tasks as well.
73
Reference List
[1] Pytorch: Tensors and Dynamic Neural Networks in Python with Strong GPU Acceleration,
2017. https://github.com/pytorch/pytorch.
[2] Ossama Abdel-Hamid, Abdel-Rahman Mohamed, Hui Jiang, Li Deng, Gerald Penn, and
Dong Yu. Convolutional Neural Networks for Speech Recognition. IEEE/ACM Transac-
tions on Audio, Speech, and Language Processing, 22(10):1533{1545, October 2014.
[3] Guillaume Alain and Yoshua Bengio. Understanding Intermediate Layers Using Linear
Classier Probes. arXiv:1610.01644, 2016.
[4] Namrata Anand and Prateek Verma. Convoluted Feelings Convolutional and Recurrent
Nets for Detecting Emotion from Audio Data. Technical Report, Stanford University, 2015.
[5] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classication and Regression
Trees. Wadsworth and Brooks, 1984.
[6] Leo Breiman. Random Forests. Machine Learning, 45(1):5{32, October 2001.
[7] J. S. Bridle and M. D. Brown. An Experimental Automatic Word Recognition System.
JSRU Report No. 1003, 1974.
[8] Felix Burkhardt, Astrid Paeschke, M. Rolfes, Walter F. Sendlmeier, and Benjamin Weiss.
A Database of German Emotional Speech. In Proceedings of Interspeech, 2005.
[9] C. Busso, S. Mariooryad, A. Metallinou, and S. Narayanan. Iterative Feature Normalization
Scheme for Automatic Emotion Detection from Speech. IEEE Transactions on Aective
Computing, 4(4):386{397, 2013.
[10] Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel
Kim, Jeannette N. Chang, Sungbok Lee, and Shrikanth S. Narayanan. IEMOCAP: Inter-
active Emotional Dyadic Motion Capture Database. Language Resources and Evaluation,
42(4):335, Nov 2008.
[11] William Chan and Ian Lane. Deep Convolutional Neural Networks for Acoustic Model-
ing in Low Resource Languages. In Proceedings of the IEEE International Conference on
Acoustics, Speech and Signal Processing, 2015.
[12] Olivier Chapelle, Jason Weston, L eon Bottou, and Vladimir Vapnik. Vicinal Risk Mini-
mization. In Advances in Neural Information Processing Systems, 2000.
[13] N. V. Chawla, K. W. Bowyer, L. O. Hall, , and W. P. Kegelmeyer. Smote: synthetic
minority over-sampling technique. Journal of articial intelligence research, 2012.
[14] Fran cois Chollet. Keras. https://github.com/fchollet/keras, 2015.
74
[15] Taco S. Cohen and Max Welling. Group Equivariant Convolutional Networks. In Proceed-
ings of the International Conference on Machine Learning, 2016.
[16] Tim Cooijmans, Nicolas Ballas, C esar Laurent, C a glar G ul cehre, and Aaron Courville.
Recurrent Batch Normalization. In International Conference on Learning Representations,
2017.
[17] Corinna Cortes and Vladimir Vapnik. Support-Vector Networks. Machine Learning,
20(3):273{297, September 1995.
[18] Giovanni Costantini, Iacopo Iaderola, andrea Paoloni, and Massimiliano Todisco. EMOVO
Corpus: an Italian Emotional Speech Database. In Proceedings of the 9th International
Conference on Language Resources and Evaluation, 2014.
[19] Y. Le Cun, B. Boser, J. S. Denker, R. E. Howard, W. Habbard, L. D. Jackel, and D. Hen-
derson. Handwritten Digit Recognition with A Back-propagation Network. In Advances in
Neural Information Processing Systems, 1990.
[20] Jiankang Deng, Jia Guo, and Stefanos Zafeiriou. ArcFace: Additive Angular Margin Loss
for Deep Face Recognition. arXiv:1801.07698, 2018.
[21] T. DeVries and G. W. Taylor. Dataset Augmentation in Feature Space. In International
Conference on Learning Representations Workshop, 2017.
[22] Sander Dieleman, Jerey De Fauw, and Koray Kavukcuoglu. Exploiting Cyclic Symmetry in
Convolutional Neural Networks. In Proceedings of the International Conference on Machine
Learning, 2016.
[23] Paul Ekman, E Richard Sorenson, Wallace V Friesen, et al. Pan-Cultural Elements in Facial
Displays of Emotion. Science, 164(3875):86{88, 1969.
[24] Moataz El Ayadi, Mohamed S. Kamel, and Fakhri Karray. Survey on Speech Emo-
tion Recognition: Features, Classication Schemes, and Databases. Pattern Recognition,
44(3):572{587, March 2011.
[25] Moataz El Ayadi, Mohamed S. Kamel, and Fakhri Karray. Survey on Speech Emo-
tion Recognition: Features, Classication Schemes, and Databases. Pattern Recognition,
44(3):572{587, March 2011.
[26] Jerey L. Elman. Finding Structure in Time. Cognitive Science, 14(2):179{211, 1990.
[27] Florian Eyben, Klaus Scherer, Bj orn Schuller, Johan Sundberg, Elisabeth Andre, Carlos
Busso, Laurence Devillers, Julien Epps, Petri Laukka, Shrikanth S. Narayanan, and Khiet
Truong. The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research
and Aective Computing. IEEE Transactions on Aective Computing, 7(2):190{202, 2016.
[28] Florian Eyben, Martin W ollmer, and Bj orn Schuller. openSMILE: The Munich Versatile
and Fast Open-Source Audio Feature Extractor. In Proceedings of the ACM International
Conference on Multimedia, 2010.
[29] Xavier Gastaldi. Shake-Shake Regularization. In International Conference on Learning
Representations Workshop, 2017.
[30] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. Con-
volutional Sequence to Sequence Learning, 2017. arXiv:1705.03122.
75
[31] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Domain Adaptation for Large-Scale
Sentiment Classication: A Deep Learning Approach. In Proceedings of the International
Conference on Machine Learning, 2011.
[32] Ian J. Goodfellow, Quoc V. Le, Andrew M. Saxe, Honglak Lee, and Andrew Y. Ng. Mea-
suring Invariances in Deep Networks. In Proceedings of the International Conference on
Neural Information Processing Systems, 2009.
[33] Alex Graves, Santiago Fern andez, and J urgen Schmidhuber. Bidirectional LSTM Networks
for Improved Phoneme Classication and Recognition. In Proceedings of the International
Conference on Articial Neural Networks, 2005.
[34] Alfred Haar. Zur Theorie der Orthogonalen Funktionensysteme. Mathematische Annalen,
69(3):331{371, September 1910.
[35] Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality Reduction by Learning An
Invariant Mapping. In The IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 2006.
[36] Kun Han, Dong Yu, and Ivan Tashev. Speech Emotion Recognition Using Deep Neural
Network and Extreme Learning Machine. In Proceedings of Interspeech, 2014.
[37] Sana-Ul Haq and Philip J.B. Jackson. Machine Audition: Principles, Algorithms and Sys-
tems, chapter Multimodal Emotion Recognition. IGI Global, 2010.
[38] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for
Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2016.
[39] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for
Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2016.
[40] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity Mappings in Deep
Residual Networks. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors,
Proceedings of the European Conference on Computer Vision (ECCV), 2016.
[41] H. Hermansky. Perceptual Linear Predictive (PLP) Analysis of Speech. Journal of the
Acoustical Society of America, 57(4):1738{52, April 1990.
[42] Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen,
R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm
Slaney, Ron J. Weiss, and Kevin Wilson. CNN Architectures for Large-Scale Audio Classi-
cation. arXiv:1609.09430, 2016.
[43] Sepp Hochreiter and J urgen Schmidhuber. Long Short-Term Memory. Neural Computation,
9(8), 1997.
[44] Che-Wei Huang and Shrikanth S. Narayanan. Flow of Renyi Information in Deep Neural
Networks. In Proceedings of the IEEE International Workshop of Machine Learning for
Signal Processing, 2016.
[45] Che-Wei Huang and Shrikanth S. Narayanan. Characterizing Types of Convolution in
Deep Convolutional Recurrent Neural Networks for Robust Speech Emotion Recognition.
arXiv:1706.02901v2, 2017.
76
[46] Che-Wei Huang and Shrikanth S. Narayanan. Deep Convolutional Recurrent Neural Net-
work with Attention Mechanism for Robust Speech Emotion Recognition. In Proceedings
of the IEEE International Conference on Multimedia and Expo, 2017.
[47] Che-Wei Huang and Shrikanth S. Narayanan. Deep Convolutional Recurrent Neural Net-
work with Attention Mechanism for Robust Speech Emotion Recognition. In IEEE Inter-
national Conference on Multimedia and Expo (ICME), 2017.
[48] Che-Wei Huang and Shrikanth S. Narayanan. Shaking Acoustic Spectral Sub-bands Can
Better Regularize Learning in Aective Computing. In Proceedings of the IEEE Interna-
tional Conference on Audio, Speech and Signal Processing, 2018.
[49] Che-Wei Huang and Shrikanth S. Narayanan. Stochastic Shake-Shake Regularization for
Aective Learning from Speech. In Proceedings of Interspeech, 2018.
[50] G. B. Huang, Q. Y. Zhu, and C. K. Siew. Extreme Learning Machine: Theory and Appli-
cations. Neurocomputing, 70:489{501, 2006.
[51] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Weinberger. Deep Networks
with Stochastic Depth. In ECCV, 2016.
[52] Sergey Ioe and Christian Szegedy. Batch Normalization: Accelerating Deep Network
Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd International
Conference on Machine Learning (ICML), 2015.
[53] Gil Keren, Jun Deng, Jouni Pohjalainen, and B. Schuller. Convolutional Neural Networks
with Data Augmentation for Classifying Speakers Native Language. In Proceedings of In-
terspeech, 2016.
[54] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping
Tak Peter Tang. On Large-Batch Training for Deep Learning: Generalization Gap and
Sharp Minima. In Proceedings of the International Conference on Learning Representations,
2017.
[55] Diederik P. Kingma and Jimmy Ba. ADAM: A Method for Stochastic Optimization.
arXiv:1412.6980, 2014.
[56] Alex Krizhevsky. Learning Multiple Layers of Features from Tiny Images. Technical Report,
2009.
[57] Alex Krizhevsky, Ilya Sutskever, and Georey E Hinton. ImageNet Classication with Deep
Convolutional Neural Networks. In Advances in Neural Information Processing Systems,
2012.
[58] C. Laurent, G. Pereyra, P. Brakel, Y. Zhang, and Y. Bengio. Batch Normalized Recurrent
Neural Networks. In Proceedings of the IEEE International Conference on Audio, Speech
and Signal Processing, 2016.
[59] Yann LeCun and Corinna Cortes. MNIST Handwritten Digit Database. 2010.
[60] Chul Min Lee, Serdar Yildirim, Murtaza Bulut, Abe Kazemzadeh, Carlos Busso, Zhigang
Deng, Sungbok Lee, and Shrikanth Narayanan. Emotion Recognition Based on Phoneme
Classes. In Proceedings of InterSpeech, 2004.
[61] Jinkyu Lee and Ivan Tashev. High-Level Feature Representation Using Recurrent Neural
Network for Speech Emotion Recognition. In Proceedings of Interspeech, 2015.
77
[62] Deng Li, Ossama Abdel-Hamid, and Dong Yu. A Deep Convolutional Neural Network
Using Heterogeneous Pooling for Trading Acoustic Invariance with Phonetic Confusion. In
Proceedings of the IEEE International Conference on Audio, Speech and Signal Processing,
2013.
[63] Xiang Li, Shuo Chen, Xiaolin Hu, and Jian Yang. Understanding the Disharmony Between
Dropout and Batch Normalization by Variance Shift. arXiv:1801.05134, 2018.
[64] Wootaek Lim, Dae-young Jang, and Taejin Lee. Speech Emotion Recognition Using Con-
volutional and Recurrent Neural Networks. In Proceedings of the Asia-Pacic Signal and
Information Processing Association Annual Summit and Conference, 2016.
[65] Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song. SphereFace:
Deep Hypersphere Embedding for Face Recognition. In The IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2017.
[66] Weiyang Liu, Yandong Wen, Zhiding Yu, and Meng Yang. Large-Margin Softmax Loss for
Convolutional Neural Networks. In Proceedings of The 33rd International Conference on
Machine Learning, pages 507{516, 2016.
[67] S. R. Livingstone, K. Peck, and F. A. Russo. RAVDESS: The Ryerson Audio-Visual
Database of Emotional Speech and Song. In Proceedings of the 22nd Annual Meeting of
the Canadian Society for Brain, Behaviour and Cognitive Science, 2012.
[68] David G. Lowe. Object Recognition from Local Scale-Invariant Features. In Proceedings of
the International Conference on Computer Vision, 1999.
[69] Xingchen Ma, Hongyu Yang, Qiang Chen, Di Huang, and Yunhong Wang. DepAudioNet:
An Ecient Deep Model for Audio Based Depression Classication. In Proceedings of the
International Workshop on Audio/Visual Emotion Challenge, 2016.
[70] D. Maclaurin, D. Duvenaud, and R. P. Adams. Early Stopping Is Nonparametric Variational
Inference. In Proceedings of the International Conference on Articial Intelligence and
Statistics, 2016.
[71] Harish Mallidi and Hynek Hermansky. Novel Neural Network Based Fusion for Multistream
ASR. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal
Processing, March 2016.
[72] Qirong Mao, Ming Dong, Zhengwei Huang, and Yongzhao Zhan. Learning Salient Features
for Speech Emotion Recognition Using Convolutional Neural Networks. IEEE Transactions
on Multimedia, 16(8):2203{2213, 2014.
[73] Olivier Martin, Irene Kotsia, Benoit M. Macq, and Ioannis Pitas. The eNTERFACE'05
Audio-Visual Emotion Database. In Proceedings of the International Conference on Data
Engineering Workshops, 2006.
[74] R.K. McConnell. Method of and Apparatus for Pattern Recognition, January 1986. US
Patent 4,567,610.
[75] Angeliki Metallinou, Martin Wollmer, Athanasios Katsamanis, Florian Eyben, Bjorn
Schuller, and Shrikanth Narayanan. Context-Sensitive Learning for Enhanced Audiovisual
Emotion Classication. IEEE Transactions on Aective Computing, 3(2):184{198, April
2012.
78
[76] Benjamin Milde and Chris Biemann. Using Representation Learning and Out-of-Domain
Data for A Paralinguistic Speech Task. In Proceedings of Interspeech, 2015.
[77] Shrikanth S. Narayanan and Panayiotis Georgiou. Behavioral Signal Processing: Deriv-
ing Human Behavioral Informatics from Speech and Language. Proceedings of IEEE,
101(5):1203{1233, May 2013.
[78] K. Oatley, D. Keltner, and J.M. Jenkins. Understanding Emotions. Blackwell, 1996.
[79] T. Ojala, M. Pietikainen, and D. Harwood. Performance Evaluation of Texture Measures
with Classication Based on Kullback Discrimination of Distributions. In Proceedings of
the 12th IAPR International Conference on Pattern Recognition, 1994.
[80] Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. How to Con-
struct Deep Recurrent Neural Networks. In Proceedings of the International Conference on
Learning Representations, 2014.
[81] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon-
del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,
M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine Learning in Python.
Journal of Machine Learning Research, 12:2825{2830, 2011.
[82] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Nagendra Goel, Mirko Hannemann, Yanmin
Qian, Petr Schwarz, and Georg Stemmer. The KALDI Speech Recognition Toolkit. In
Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding,
2011.
[83] Xianbiao Qi and Lei Zhang. Face recognition via centralized coordinate learning.
arXiv:1801.05678, 2018.
[84] Song-Chun Zhu Quanshi Zhang, Ying Nian Wu. Interpretable Convolutional Neural Net-
works. arXiv:1710.00935.
[85] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Parallel Distributed Processing: Explo-
rations in the Microstructure of Cognition, vol. 1. chapter Learning Internal Representations
by Error Propagation, pages 318{362. 1986.
[86] S S. Stevens, J Volkmann, and E B. Newman. A Scale for the Measurement of the Psycho-
logical Magnitude Pitch. 8:185{190, 01 1937.
[87] Tara N. Sainath and Bo Li. Modeling Time-Frequency Patterns with LSTM vs. Convolu-
tional Architectures for LVCSR Tasks. In Proceedings of Interspeech, 2016.
[88] Tara N. Sainath, Oriol Vinyals, Andrew W. Senior, and Hasim Sak. Convolutional, Long
Short-Term Memory, Fully Connected Deep Neural Networks. In Proceedings of the IEEE
International Conference on Acoustics, Speech and Signal Processing, 2015.
[89] Tara N. Sainath, Ron J. Weiss, Andrew Senior, Kevin W. Wilson, and Oriol Vinyals. Learn-
ing the Speech Front-End with Raw Waveform CLDNNs. In Proceedings of Interspeech,
2015.
[90] Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How Does
Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift).
arXiv:1805.11604, 2018.
79
[91] Florian Schro, Dmitry Kalenichenko, and James Philbin. Facenet: A Unied Embedding
for Face Recognition and Clustering. In The IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), 2015.
[92] B. Schuller, G. Rigoll, and M. Lang. Hidden Markov Model-Based Speech Emotion Recog-
nition. In Proceedings of the International Conference on Multimedia and Expo, 2003.
[93] Bj orn Schuller, Anton Batliner, Stefan Steidl, and Dino Seppi. Recognising Realistic Emo-
tions and Aect in Speech: State of the Art and Lessons Learnt from the First Challenge.
Speech Communication, 53(9-10):1062{1087, November 2011.
[94] Bj orn W. Schuller, Stefan Steidl, Anton Batliner, Alessandro Vinciarelli, Klaus R. Scherer,
Fabien Ringeval, Mohamed Chetouani, Felix Weninger, Florian Eyben, Erik Marchi, Mar-
cello Mortillaro, Hugues Salamin, Anna Polychroniou, Fabio Valente, and Samuel Kim. The
INTERSPEECH 2013 Computational Paralinguistics Challenge: Social Signals, Con
ict,
Emotion, Autism. In Proceedings of Interspeech, 2013.
[95] Karen Simonyan and Andrew Zisserman. Very Deep Convolutional Networks for Large-Scale
Image Recognition. In Proceedings of the International Conference on Learning Represen-
tations, 2015.
[96] David Snyder, Guoguo Chen, and Daniel Povey. MUSAN: A Music, Speech, and Noise
Corpus. arXiv:1510.08484.
[97] Nitish Srivastava, Georey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhut-
dinov. Dropout: A Simple Way to Prevent Neural Networks from Overtting. Journal of
Machine Learning Research, 15:1929{1958, 2014.
[98] Nitish Srivastava, Georey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhut-
dinov. Dropout: A Simple Way to Prevent Neural Networks from Overtting. J. Mach.
Learn. Res., 15(1), January 2014.
[99] Martin Sundermeyer, Tamer Alkhouli, Joern Wuebker, and Hermann Ney. Translation
Modeling with Bidirectional Recurrent Neural Networks. In Proceedings of the Conference
on Empirical Methods on Natural Language Processing, 2014.
[100] C. Szegedy, V. Vanhoucke, S. Ioe, J. Shlens, , and Z. Wojna. Rethinking the Inception
Architecture for Computer Vision. In The IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), 2016.
[101] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, and R. Fer-
gus. Intriguing Properties of Neural Networks. In International Conference on Learning
Representations, 2014.
[102] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov,
Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going Deeper with Convolu-
tions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
2015.
[103] Naoya Takahashi, Michael Gygli, and Luc Van Gool. AEnet: Learning Deep Audio Features
for Video Analysis. arXiv:1701.00599.
[104] Theano Development Team. Theano: A Python Framework for Fast Computation of Math-
ematical Expressions. arXiv:1605.02688.
80
[105] G. Trigeorgis, F. Ringeval, R. B., E. Marchi, M. Nicoalou a., B. Schuller, and S. Zafeiriou.
Adieu Features? End-to-End Speech Emotion Recognition Using A Deep Convolutional
Recurrent Network. In Proceedings of the IEEE International Conference on Audio, Speech
and Signal Processing, 2016.
[106] Panagiotis Tzirakis, George Trigeorgis, Mihalis A. Nicolaou, Bj orn Schuller, and Stefanos
Zafeiriou. End-to-End Multimodal Emotion Recognition Using Deep Neural Networks.
arXiv:1704.08619.
[107] V. Vapnik. Statistical Learning Theory. J. Wiley, 1998.
[108] Vikas Verma, Alex Lamb, Christopher Beckham, Aaron Courville, Ioannis Mitliagkas, and
Yoshua Bengio. Manifold mixup: Encouraging meaningful on-manifold interpolation as a
regularizer. arXiv:1806.05236, 2018.
[109] Nguyen Xuan Vinh, Sarah M. Erfani, Sakrapee Paisitkriangkrai, James Bailey, Christopher
Leckie, and Kotagiri Ramamohanarao. Training Robust Models Using Random Projection.
In Proceedings of the 23rd International Conference on Pattern Recognition (ICPR), 2016.
[110] Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong Gong, Jingchao Zhou, Zhifeng Li,
and Wei Liu. CosFace:Large Margin Cosine Loss for Deep Face Recognition. In The IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
[111] Martin W ollmer, Angeliki Metallinou, Florian Eyben, Bj orn Schuller, and Shrikanth
Narayanan. Context-Sensitive Multimodal Emotion Recognition from Speech and Facial
Expression Using Bidirectional LSTM Modeling. In Proceedings of Interspeech, 2010.
[112] Saining Xie, Ross Girshick, Piotr Dollr, Zhuowen Tu, and Kaiming He. Aggregated Residual
Transformations for Deep Neural Networks. arXiv:1611.05431, 2016.
[113] Yoshihiro Yamada, Masakazu Iwamura, and Koichi Kise. ShakeDrop Regularization.
arXiv:1802.02375, 2018.
[114] Jingjie Yan, Wenming Zheng, Qinyu Xu, Guanming Lu, Haibo Li, and Bei Wang. Sparse
Kernel Reduced-Rank Regression for Bimodal Emotion Recognition from Facial Expression
and Speech. IEEE Transactions on Multimedia, 18(7):1319{1329, 2016.
[115] Sergey Zagoruyko and Nikos Komodakis. Wide Residual Networks. In BMVC, 2016.
[116] Matthew D. Zeiler and Rob Fergus. Visualizing and Understanding Convolutional Networks.
In Proceedings of the European Conference on Computer Vision, 2014.
[117] Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond
Empirical Risk Minimization. In International Conference on Learning Representations,
2018.
[118] Shiqing Zhang, Shiliang Zhang, Tiejun Huang, and Wen Gao. Multimodal Deep Con-
volutional Neural Network for Audio-Visual Emotion Recognition. In Proceedings of the
International Conference on Multimedia Retrieval, 2016.
81
Abstract (if available)
Abstract
Regularization is crucial to the success of many practical deep learning models, in particular in a more often than not scenario where there are only a few to a moderate number of accessible training samples, and speech emotion recognition or in general computational paralinguistics is undoubtedly one of them. Common practices of regularization include weight decay, Dropout {Srivastava14} and data augmentation. ❧ From the representation learning perspective, we first examine the negative influence of data augmentation with noise with respect to the size and aspect of filters in convolutional neural networks, from which we are able to design the optimal architecture under noisy and clean conditions, respectively, for speech emotion recognition. ❧ Moreover, regularization based on multi-branch architectures, such as Shake-Shake regularization {Gastaldi 2017}, has been proven successful in many applications and attracted more and more attention. However, beyond model-based representation augmentation, it is unclear how a stochastic mixture of model branches could help to provide further improvement on classification tasks, let alone the baffling interaction between batch normalization and shaking. ❧ We present our investigation on regularization by model-based perturbation, drawing connections to the vicinal risk minimization principle {VRM 2000} and discriminative feature learning in verification tasks. Furthermore, we identify a strong resemblance between batch normalized residual blocks and batch normalized recurrent neural networks, where both of them share a similar convergence behavior, which could be mitigated by a proper initialization of batch normalization. ❧ Based on the findings, our experiments on speech emotion recognition demonstrate simultaneously an improvement on the classification accuracy and a reduction on the generalization gap both with statistical significance.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
Human behavior understanding from language through unsupervised modeling
PDF
Speech recognition error modeling for robust speech processing and natural language understanding applications
PDF
Multimodal representation learning of affective behavior
PDF
Machine learning paradigms for behavioral coding
PDF
Representation, classification and information fusion for robust and efficient multimodal human states recognition
PDF
Robust automatic speech recognition for children
PDF
Learning distributed representations from network data and human navigation
PDF
Transfer learning for intelligent systems in the wild
PDF
Deep learning models for temporal data in health care
PDF
Understanding sources of variability in learning robust deep audio representations
PDF
Modeling expert assessment of empathy through multimodal signal cues
PDF
Computational modeling of behavioral attributes in conversational dyadic interactions
PDF
Towards generalizable expression and emotion recognition
PDF
Interaction dynamics and coordination for behavioral analysis in dyadic conversations
PDF
Towards learning generalization
PDF
On optimal signal representation for statistical learning and pattern recognition
PDF
3D deep learning for perception and modeling
PDF
Visual representation learning with structural prior
PDF
Deep representations for shapes, structures and motion
Asset Metadata
Creator
Huang, Che-Wei
(author)
Core Title
Toward robust affective learning from speech signals based on deep learning techniques
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
12/13/2018
Defense Date
12/12/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
affective computing,computational paralinguistics,data augmentation,deep learning,discriminative representation learning,OAI-PMH Harvest,regularization,speech emotion recognition
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Narayanan, Shrikanth (
committee chair
), Georgiou, Panayiotis (
committee member
), Margolin, Gayla (
committee member
)
Creator Email
cheweihu@usc.edu,cheweihuang@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-113334
Unique identifier
UC11675698
Identifier
etd-HuangCheWe-7018.pdf (filename),usctheses-c89-113334 (legacy record id)
Legacy Identifier
etd-HuangCheWe-7018.pdf
Dmrecord
113334
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Huang, Che-Wei
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
affective computing
computational paralinguistics
data augmentation
deep learning
discriminative representation learning
regularization
speech emotion recognition