Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Creating cross-modal, context-aware representations of music for downstream tasks
(USC Thesis Other)
Creating cross-modal, context-aware representations of music for downstream tasks
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
CREATING CROSS-MODAL, CONTEXT-AWARE REPRESENTATIONS OF
MUSIC FOR DOWNSTREAM TASKS
by
Timothy Greer
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
August 2022
Copyright 2022 Timothy Greer
Acknowledgments
I would like to thank my advisor Dr. Shrikanth Narayanan for guiding me
through this rewarding PhD journey and encouraging me to take on diverse research
problems. Thank you to my committee for valuable feedback as I progressed through
this program, and my family, who have been my support system every step of the
way.
My completion of this degree would not be possible without the help of family,
friends, and loved ones. Thank you, Mom, Dad, Doug, Brad, Tina, and Abhishek—
to name only a few who have helped provide support to me throughout this
journey.
ii
Table of Contents
Acknowledgments ii
List of Tables vii
List of Figures xi
Abstract xiv
1 Introduction 1
1.1 Music Representations . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Music Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Musical Genre Classification . . . . . . . . . . . . . . . . . . 4
1.2.2 Music Emotion Recognition . . . . . . . . . . . . . . . . . . 5
1.2.3 Music’s Use Across Different Contexts . . . . . . . . . . . . 7
1.3 Scope of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Creating Novel Representations of Music for Downstream Tasks 9
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 M3BERT Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 Transformer Encoder . . . . . . . . . . . . . . . . . . . . . . 14
2.2.2 Pre-training and Training Objectives . . . . . . . . . . . . . 15
2.3 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.1 Dataset Curation and Preprocessing . . . . . . . . . . . . . 20
2.3.2 Training Setup . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4.1 Patch Masking, CFM and CCM . . . . . . . . . . . . . . . . 24
2.4.2 GTZAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4.3 MTG-Jamendo Emotions and Themes in Music . . . . . . . 25
2.4.4 Extended Ballroom Genre Classification Dataset . . . . . . . 26
2.4.5 DEAM Music Emotion Recognition Task . . . . . . . . . . . 26
2.4.6 RWC Instrument Detection Task . . . . . . . . . . . . . . . 26
iii
2.4.7 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4.8 Correlational Analysis . . . . . . . . . . . . . . . . . . . . . 30
2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3 CreatingNon-AuditoryRepresentationsofMusicforDownstream
Tasks 34
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5 Genre Classification Task . . . . . . . . . . . . . . . . . . . . . . . . 43
3.5.1 Collecting Billboard Songs . . . . . . . . . . . . . . . . . . . 43
3.5.2 Genre Classification Results . . . . . . . . . . . . . . . . . . 44
3.6 Emotion Recognition Task . . . . . . . . . . . . . . . . . . . . . . . 46
3.6.1 Collecting Annotations . . . . . . . . . . . . . . . . . . . . . 46
3.6.2 Statistics on Behavioral Responses . . . . . . . . . . . . . . 48
3.6.3 Emotion Regression Results . . . . . . . . . . . . . . . . . . 50
3.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4 StudyingMusicfromMultipleViewsinaContext-AwareManner 54
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2.1 Neural Response . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2.2 Physiological Response . . . . . . . . . . . . . . . . . . . . . 56
4.2.3 Affective Response . . . . . . . . . . . . . . . . . . . . . . . 57
4.2.4 Multimodal Time Series Modeling . . . . . . . . . . . . . . . 58
4.3 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.3.1 Neural Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.2 Psychophysiological Recordings . . . . . . . . . . . . . . . . 61
4.3.3 Emotion Ratings . . . . . . . . . . . . . . . . . . . . . . . . 62
4.3.4 Auditory Features . . . . . . . . . . . . . . . . . . . . . . . . 62
4.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.4.1 Neural data . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.4.2 Physiology Data . . . . . . . . . . . . . . . . . . . . . . . . 64
4.4.3 Emotion Ratings . . . . . . . . . . . . . . . . . . . . . . . . 66
4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.5.1 Phase Synchronizations . . . . . . . . . . . . . . . . . . . . . 69
4.5.2 Physiology Data . . . . . . . . . . . . . . . . . . . . . . . . 70
4.5.3 Emotion Ratings . . . . . . . . . . . . . . . . . . . . . . . . 72
iv
4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.6.1 Neural . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.6.2 Physiological . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.6.3 Emotional . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5 Embedding and Quantifying Responses to Music 78
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.2 Annotation Fusion Methods in Affective Computing . . . . . . . . . 81
5.2.1 Time Alignment . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.2.2 Fusion of Annotations . . . . . . . . . . . . . . . . . . . . . 83
5.3 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.3.1 Auditory Features . . . . . . . . . . . . . . . . . . . . . . . . 85
5.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.4.1 Parameter Selection for Annotation Fusion . . . . . . . . . . 86
5.4.2 Prediction Models . . . . . . . . . . . . . . . . . . . . . . . . 87
5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6 Music’s Use in Film and Advertisements 92
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.1.1 Music Use Across Film Genre . . . . . . . . . . . . . . . . . 94
6.1.2 Visual-Musical Cross-Modal Analysis . . . . . . . . . . . . . 95
6.1.3 Multiple Instance Learning . . . . . . . . . . . . . . . . . . . 96
6.2 Research Data Collection and Curation . . . . . . . . . . . . . . . . 98
6.2.1 Film and Soundtrack Collection . . . . . . . . . . . . . . . . 98
6.2.2 Automatically Extracting Musical Cues in Film . . . . . . . 99
6.2.3 Musical Feature Extraction . . . . . . . . . . . . . . . . . . 101
6.2.4 Visual Features . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.3.1 Genre Prediction Model Training Procedure . . . . . . . . . 102
6.3.2 Model Architectures . . . . . . . . . . . . . . . . . . . . . . 103
6.3.3 Frame-level and Cue-level Features . . . . . . . . . . . . . . 104
6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.4.1 Genre Prediction . . . . . . . . . . . . . . . . . . . . . . . . 105
6.4.2 Musical Feature Relevance Scoring . . . . . . . . . . . . . . 106
6.4.3 Musical-Visual Cross-Modal Analysis . . . . . . . . . . . . . 108
6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
v
7 Conclusions and Future Work 112
7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.2.1 Context-Aware Music Generation . . . . . . . . . . . . . . . 113
7.2.2 Musical Feature Engineering Using Domain Knowledge . . . 114
Reference List 115
vi
List of Tables
2.1 Proposed Model Parameters . . . . . . . . . . . . . . . . . . . . . . 20
2.2 Datasets used and statistics for pre-training (above solid line) and
fine-tuning (below solid line) . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Acoustic features of music extracted by Librosa . . . . . . . . . . . 23
2.4 Parameter settings for downstream tasks . . . . . . . . . . . . . . . 24
2.5 Results of a genre classification task on the GTZAN dataset . . . . 25
2.6 Results of an auto-tagging task on the MTG-Jamendo dataset . . . 27
2.7 Results of a genre classification task on the Extended Ballroom
dataset.
∗
indicates that the model evaluates on different subsets
of the dataset than our work and hence numbers are not directly
comparable. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.8 Results of a music emotion recognition task on the DEAM dataset.
∗
indicates that the model evaluates on different subsets of the dataset
than our work and hence numbers are not directly comparable. . . . 28
2.9 Results of an instrument detection task run on the RWC Instrument
dataset.
∗
indicates that the model evaluates on different subsets
of the dataset than our work and hence numbers are not directly
comparable. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.10 Performance on MTG Autotagging with Ablation Study . . . . . . 29
vii
2.11 Performance on MTG-Jamendo Mood-Theme Autotagging with
Different Starting Feature Sets . . . . . . . . . . . . . . . . . . . . . 30
3.1 Complex chords in UkuTabs, and their conversions after casting.
The left column has the unconverted complex chords and the right
column contains the corresponding basic chord. . . . . . . . . . . . 41
3.2 Statistics of chords- and lyrics-aligned dataset. . . . . . . . . . . . . 42
3.3 Number of songs listed in five Billboard charts that were also found
in the UkuTabs dataset. Pop was the most prominent genre, and
328 songs were labeled as “pop” only. The number of “crossover”
songs, or songs listed in more than one Billboard chart, are entries
in the off-diagonals. . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4 A list of models used for multi-label genre classification and their
performance on three metrics. The Chords & Lyrics model performs
best by all three metrics: the harsh exact match ratio (EMR),
label accuracy, and label-based, micro-averaged F1-score. Asterisks
indicate significance at the .05 level, when using a 2-sample t-test
with the best performing baseline in that category. . . . . . . . . . 45
3.5 Our model outperforms language models of chords or lyrics only, and
the models that use embeddings outperform n-gram models. These
BERTembeddingsweretunedonthefinaldataset. Asterisksindicate
significantlybetterperformanceatthe.05levelwhencomparedtothe
Average Baseline; Daggers indicate significantly better performance
at the .05 level when compared to the best baseline. . . . . . . . . . 51
4.1 Auditory Features Used . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2 Feature Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
viii
4.3 STG - Test RMSE . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4 Heschl’s Gyrus - Test RMSE . . . . . . . . . . . . . . . . . . . . . . 70
4.5 SCR - Test RMSE . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.6 Heart Activity - Test RMSE (Entries× 10
−1
) . . . . . . . . . . . . 71
4.7 Reported Emotion - Test RMSE . . . . . . . . . . . . . . . . . . . . 72
5.1 Auditory features used and their feature type . . . . . . . . . . . . 86
5.2 Reported Emotion, Sad Short Song — Validation RMSE . . . . . . 88
5.3 Reported Emotion, Sad Long Song — Validation RMSE . . . . . . 88
5.4 Reported Emotion Happy Song — Validation RMSE . . . . . . . . 89
5.5 Reported Enjoyment, Sad Short Song — Validation RMSE . . . . . 89
5.6 Reported Enjoyment, Sad Long Song — Validation RMSE . . . . . 89
5.7 Reported Enjoyment Happy Song — Validation RMSE . . . . . . . 89
6.1 A breakdown of the 110 films in our dataset. Only 33 of the films
have only one genre tag; the other 77 films are multi-genre. A list of
tags for every movie is given in Appendix S1. . . . . . . . . . . . . 99
6.2 Auditory features used and feature type. . . . . . . . . . . . . . . . 101
6.3 The six pooling functions, where x
i
refers to the embedding vector
of instance i in a bag set B and k is a particular element of the
output vector h. In the multi-attention equation, L refers to the
attended layer and w is a learned weight. The attention module
outputs are concatenated before being passed to the output layer.
In the feature-level attention equation, q(·) is an attention function
on a representation of the input features, u(·). . . . . . . . . . . . . 104
ix
6.4 Classification results on the 110-film dataset using VGGish features.
Five-fold cross validation mean and standard deviation on the macro-
averaged metrics for each model are reported. IMV stands for
Instance Majority Voting; FL Attn for Feature-Level Attention.
Simple MI and IMV results represent performance with the best
base classifier (kNN and SVM, respectively). . . . . . . . . . . . . . 106
6.5 Mean standardized brightness and contrast (×10
1
) across all cues
for each genre label source (actual labels, all predictions, and false
positives only). Bold values are statistically different from the mean
(p < 0.01). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
x
List of Figures
2.1 M3BERT pre-training and fine-tuning. During pre-training,
the M3BERT transformer layers are updated and we use a Huber
Loss between the reconstructed signal and the original signal. During
fine-tuning, the M3BERT layers are frozen, and multi-task learning
is used to enrich the output representations. . . . . . . . . . . . . . 18
2.2 M3BERT Architecture. M3BERT has L layers which use multi-
head attention and normalization, similar to BERT’s architecture. . 19
2.3 Multi-tasklearningonasamplefromabatch. Forthissample,
there are only labels for the MTG-Jamendo task, so the weights for
other tasks are frozen, as is M3BERT. We use cross-entropy loss for
our classification tasks. . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 A workflow for another sample from the same batch. For
this sample, there are only labels for the DEAM task, so the weights
for other tasks are frozen, as is M3BERT. Weights are updated at
the end of the batch. We use MSE loss for this regression task. . . . 20
2.5 Centroid andcell activation. Certain outputs from the M3BERT
encoder correlate highly with auditory phenomena, like spectral
centroid. Pearson’s ρ =.831 between these two features. . . . . . . 30
xi
2.6 Harmonicity and cell activation. Interpretable auditory fea-
tures like harmonicity were also correlated with certain outputs from
M3BERT’s encoder. The encoder is creating high-level representa-
tions that are not necessarily based on frequency, as in this case.
Pearson’s ρ =.823 between these two features. . . . . . . . . . . . . 31
3.1 Embedding chords and lyrics. Each lyric and chord predicts its
lyric and chord context. Here, the F minor chord (Fm) predicts
chords around it (denoted by dashed lines) and lyrics that are sung
during and around the F minor chord (denoted by solid and dotted
lines). The F minor chord is aligned with the lyric “greatest” because
they are played and sung at the same time, respectively. . . . . . . 39
3.2 A wordcloud of artists. A larger font size indicates that the artist
is featured more prominently in the dataset. . . . . . . . . . . . . . 40
3.3 Screenshot from UkuTabs showing a song excerpt. The “x4”
indicates that the bridge section is repeated three times. . . . . . . 41
4.1 SCR Plot of one listen of sad short song. SCR spikes seem to occur
after the entrance of new instruments or the start of a musical
crescendo. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2 SCR Plot of two listens of the happy song. When the KDE is
constructed with a bandwidth of 8 seconds, the plots are similar. . . 66
4.3 Prediction vs. actual value of the ISPS in the right STG in the sad
long song. The predictions were made on the first 20% of the song. 74
4.4 Emotion ratings for the sad short song. The signal is not stationary,
so models that use information of previous labels, like TPA, will
likely perform best. . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
xii
6.1 TheScoreStamperpipeline. Afilmispartitionedintonon-overlapping
five-second segments. For every segment, Dejavu will predict if a
track in the film’s soundtrack is playing. Cues, or instances of a
song’s use in a film, are built by combining window predictions. In
this example, the Cantina Band cue lasts for 15 seconds because it
was predicted by Dejavu in two nearby windows. . . . . . . . . . . . 100
6.2 Neural network model architecture. . . . . . . . . . . . . . . . . . . 103
6.3 Feature importance by genre and feature group. . . . . . . . . . . . 108
xiii
Abstract
With the ever-burgeoning market for music, film, television, and other consum-
able media, it has become supremely important to study human music listening and
multi-modal experiences. Advances in computational approaches offer new ways to
understand music content, and how it is experienced—both as a standalone medium
and in context with other forms of media—in nuanced ways. This dissertation work
identifies novel methods for representing music in a multi-modal, context-aware
fashion; we study the interaction between lyrics and chords in music, which has
potential applications in multimodal perception, music information retrieval, music
emotion recognition, and music recommendation systems. We show that using a
multi-view approach to music analysis opens up new avenues for studying music
perception, and we show that loss function approaches can improve upon state-of-
the-art methods for music genre classification. We also represent music using an
embedding scheme and fine-tune this representation on several downstream tasks.
Lastly, we investigate cross-cultural music perception, either on its own or when
consumed in conjunction with other media, such as advertisements and movies. By
creating cross-modal, context-aware representations of music, we can meaningfully
capture music and media-related perception, a boon to researchers in affective
computing, music information retrieval, and automatic music tagging.
xiv
Chapter 1
Introduction
Creating representations of music for downstream tasks is a challenging (and
interesting) research question. Music itself is a tremendously robust and versatile
medium, as it is capable of imparting overt or subtle human experiences, depending
on whether subjects are simply listening to music or listening to it as they perceive
another event simultaneously. Indeed, much of music’s power comes from its ability
to provide cues and context to its listeners: an audience member knows when an
antagonist has entered on-screen when a group of cellos drone; an opera-goer better
understands the intricacies of a star-crossed love affair when she hears Romeo sing
an aria to Juliet; a high school student attends junior prom and dances with her
friends to a particular song that will prove to be nostalgic to her decades in the
future. All of these instances show the pervasiveness, versatility, and raw emotional
power of music.
In spite of music’s utility in supporting these human experiences (or maybe
because of it), music perception has proven to be a difficult task. Music is nearly
a universally enjoyed art form, but listeners often respond to it in tremendously
different ways. The same song can bring one person great joy and another deep
sorrow. Leaving aside music emotion recognition tasks, there is a great challenge
in classifying music, known as automatic music classification. Furthermore, there
are no “gold-standard” labels for classification, and no one institution that can
canonically define how music should be tagged. Additionally, social media platforms
and music streaming services have made music and markets from cultures all over
1
the world even more accessible, which affords many musicians a chance to create
music that is “cross-over”; that is, belonging to many genres at once. An unsolved
research initiative is to be able to categorize and typify songs, which may belong
to more than one genre or tag.
Music listening is a subjective, complex, and multimodal human experience.
Lyrics, harmonies, and timbral qualities of a musical passage can separately trigger
different human responses. Although there are differences in preferences for certain
types of music in different cultures, it is almost universally enjoyed across all
cultures. An important question in musical perception is how we can capture
information that is contained in and cued by music.
Music gives cues for movies, dance, rites of passage, and many other human
experiences (Maus, 1991). Its auditory nature allows more enjoyable and memorable
affective experiences, as it can serve either a primary or a secondary role.
Because of its ubiquity and compatibility with other forms of media, it is partic-
ularly important to develop methods to represent music. Qualitatively identifying
musical features can be helpful for studying music and its effects, and domain
knowledge can inform quantitative study of music (Greer et al., 2019). Recently,
using deep learning for representations of music performed competitively on down-
stream tasks (Zhao & Guo, 2021a). The music information retrieval community
has explored the space of emotion recognition and automatic tagging, through its
study of quantifying and capturing music listening experiences, automatic playlist
generation (Logan, 2002), and music recommender systems (Schedl et al., 2015).
Speech-related features have also been successful in extracting musical informa-
tion for a variety of tasks (Coutinho et al., 2014; De Cheveigné & Kawahara, 2002).
MFCCs have shown utility for music-related tasks (Choi et al., 2017), and raw
audio techniques, such as time-dilated, convolutional neural nets have also been
2
useful (Oord et al., 2016). However, speech features alone should not suffice in the
study of music; to wit, many frequency bands in music (such as deep bass notes)
are not present in speech signals. One part of this thesis will show that applying
domain knowledge to music processing can introduce a novel view into studying
music listening experiences. Musical features and their emergent properties can
enable better capturing of musical information (Greer & Narayanan, 2019). We
will show that it is possible to classify and describe music and other media, such as
film and ads. Indeed, learning cross-modal, context-aware representations of music
allows us to meaningfully capture music and media-related perception.
1.1 Music Representations
Musicrepresentationscantakeonmanyforms, asmusicisacomplex, multimodal
medium. Representations can be based on the lyrics sung during a song (a textual
medium), the harmonies that create the tonal framework of the song (an aural
medium), or the metadata of a musical piece (a descriptive medium). By studying
both the auditory modes of music and the extra-auditory modes of music (lyrics,
chord representations, meta-information, etc.) we can use those representations to
study and report on descriptions of music.
Audio-based representations of music can be sub-classified into different musical
types as well. For example, we can represent music by its spectral components,
which indicate the frequencies in the musical signal and their magnitudes. We can
also extract rhythmic features, dynamic features, timbral features, and any other
number of features that may well-represent music. A comprehensive study of music
perception and understanding requires extracting and using a combination of many
3
of these sub-classes; these feature sets are powerful on their own, but even more
powerful when used in tandem with each other.
1.2 Music Perception
1.2.1 Musical Genre Classification
Music is a complex, multifaceted, multimodal, perceptual experience that is
challenging to analyze. Can a musical structure bestow a new quality to lyrics?
How do we perceive music when we hear an instrumental passage and when we hear
that same passage with lyrics? How would we categorize a song with lyrics related
to pop music, but with chords steeped in the R&B tradition? We use techniques in
natural language processing (NLP) to address these questions in this thesis.
Many different ways to study music perception, and judgment, exist. Music
emotion is one such lens through which to look at human perception of song.
Another way to study music perception is through categorization: it is necessary
to make a judgment (perception) about a stimulus (music) in order to classify the
genre of a song. We use a music genre classification task as a way to investigate how
humans perceive musical stimuli. One of the interesting questions is the cross-over
and multi-attribute aspects of musical genre that do not necessarily fit clearly into
one category. Many popular songs contain musical elements from disparate yet
complementary genres (as an example, take the eclectic 2019 hit “Old Town Road”
by Lil Nas X, which was produced like a hip-hop track but contains country-like
lyrics.) Among other things, studies in music perception have implications for how
music content is marketed and consumed.
We hypothesize that learning shared representations—that is, embedding words
from lyrics and chords in a shared vector space—captures how chord progressions
4
and lyrics affect each other. We create a genre classification task using Billboard
listings
1
to show the utility of these shared embeddings in predicting musical genre.
1.2.2 Music Emotion Recognition
Music is universally enjoyed perhaps because of the power it has to move us.
Listening to music can boost our mood, give us chills, and even make us cry. We
focus on three aspects of the complex human experience of music listening: neural
(how the brain responds to music), physiological (how the body responds to music),
and emotional (how people report happiness or sadness during music listening).
By computationally analyzing how music influences these three modes, we present
a more complete picture of music’s role in human experience. We apply a set of
multivariate time series (MTS) prediction models to neural, physiological, and
subjective responses, using auditory features as predictors, so we can study music
with a multi-modal, multi-view approach. We compare these models and comment
on what auditory features are important for these predictions. We hypothesized
that attention models, which nonlinearly synthesize aural information from previous
points in a song, would be best at predicting these responses.
There are many open research questions in music perception. For example,
how can we model phase synchronizations in the brain during instrumental music
listening? Is it possible to correlate other involuntary processes, like galvanic skin
response and heart rate, with musical features? Can human responses be modeled
using multi-modal autoregressive models? Can we predict affective response at a
highly granular level in order to pinpoint specific times in a song that elicit an
emotional response?
1
https://www.billboard.com/charts
5
We use MTS prediction models to answer some of these open questions. We
measure many subjects to get a better understanding of how humans react to
music at various bio-behavioral levels. Finally, we use machine learning models
with regularization and attention to identify particular features in music that are
relevant to subjective human-reported emotion responses.
We also investigate how lyrics and chords may be used to predict human response
to music listening, a multimodal perceptual, cognitive and affective phenomenon.
Can sad lyrics neutralize the effect of an otherwise happy-sounding instrumental
passage? Can particular musical structures pair with a set of lyrics to bring about
different human responses? The work presented in this dissertation seeks to answer
some of these questions.
There are many different facets of, and lenses through which to study, music
listening experience. Delving into emotional response to music is one such lens
through which to look at this human experience. In this work, we use an exemplary
music emotion classification task to investigate how humans process these complex
musical stimuli. This work has implications for how music content is marketed,
consumed, and perceived.
As in musical genre classification, we hypothesize that embedding words from
lyrics and chords in a shared vector space captures how chord progressions and lyrics
affect each other and are useful for music analysis. We show results from a genre
classification task using Billboard listings (https://www.billboard.com/charts).
With this task, we reveal the utility of using shared embeddings to study music
listening experiences.
6
1.2.3 Music’s Use Across Different Contexts
Music plays a crucial role in the experience and enjoyment of other media, such
as film. While the narrative of movie scenes may be driven by non-musical audio
and visual information, a film’s music often carries a significant impact on audience
interpretation of the director’s intent and style. Musical moments may complement
the visual information in a film; other times, they flout the affect conveyed in film’s
other modalities (visual, linguistic). In every case, however, music influences a
viewer’s experience in consuming cinema’s complex, multi-modal stimuli. Analyzing
how these media interact can provide filmmakers and composers insight into how
to create particular holistic cinema-watching experiences.
We hypothesize that musical properties, such as timbre, pitch, and rhythm,
achieve particular stylistic effects in film and are reflected in the display and experi-
ence of a film’s accompanying visual cues, as well as its overall genre classification.
In this study, we characterize differences among movies of different genres based on
their film music scores. While this chapter focuses on how music is used to support
particular cinematic genres, created to engender particular film-watching experi-
ences, this work can be extended to study other multi-modal content experiences,
such as viewing television, advertisements, trailers, documentaries, music videos
and musical theatre.
1.3 Scope of the Thesis
In this thesis, it is shown that uncovering multi-modal, context-aware repre-
sentations of music can capture meaningful information contained and expressed
in human experiences, such as movie-going and music listening. Music is a rich,
complex medium, and through the careful identification and quantification of its
7
myriad features, we can better study music perception tasks. Such areas of study are
of great interest to the music information retrieval community, affective computing
researchers, and even music curators.
We demonstrate that in leveraging information from modalities that are con-
tained in and represented by music, we can better tackle downstream tasks such
as music emotion recognition and genre classification. While sometimes, these
modalities contain overlap, other times, these modalities can be complementary. We
show a task in which we represent chords and lyrics of music in a joint embedding
space, and show that music listening responses are better predicted using these
symbolic representations of music in conjunction with auditory features.
On the way, we show that the context in which a song is heard is paramount to
the way it is perceived. By curating and studying a dataset of music annotators
from all over the world, we see marked differences in expression of arousal along
several demographic dimensions, such as musical experience, age, and cultural
upbringing. These findings emphasize the need for robust systems to study music
perception.
Lastly, we show that by identifying and capturing the context in which a song is
traditionally heard, it is possible to capture music listening experiences and larger-
scale experiences, such as media-watching. Loss function approaches to music genre
classification are enabled through such techniques, and music is shown to support
film perception, especially in conjunction with other cues, such as visual contrast
and brightness. In the process of representing music in a context-aware, multi-modal
fashion, we can meaningfully capture music- and media-related perception.
8
Chapter 2
Creating Novel Representations
of Music for Downstream Tasks
2.1 Introduction
A strong performance on specific downstream application tasks, such as genre
and emotion recognition, is paramount to ensuring broad capabilities for computa-
tional music understanding and Music Information Retrieval (MIR). Traditional
approaches have relied on supervised learning to train models to support these
music-related tasks. However, such approaches require copious data, including data
annotations, and may only provide insight into one view of music—namely, that
related to the specific task at hand. We present a new model for supporting music
understanding that leverages self-supervision and cross-domain learning. After pre-
training using masked reconstruction and self-attention bi-directional transformers,
the model is fine-tuned using several downstream music understanding tasks. The
results show that our model, which we call M3BERT, generates features that result
in better performance on several music-related tasks, indicating the potential of
self-supervised and semi-supervised learning approaches toward a more generalized
and robust computational approach to modeling music. Our work can offer a
starting point for many music-related modeling tasks, with potential applications
in learning deep representations and enabling robust end technology applications.
9
The amount of consumable music has been growing rapidly over the past decades.
As an effective way of utilizing such massive music content, automatically providing
high-level descriptions about music (like genre, emotion, and theme) provide a
useful venue, and are widely pursued within the MIR community (Bu et al., 2010;
Zhang et al., 2018). Prior approaches have relied largely on supervised learning
models (Ghosal & Kolekar, 2018; Hung et al., 2019; Koutini et al., 2019; Pons et
al., 2017), which are trained on human-annotated music datasets. However, the
performance of supervised learning is inherently limited by the size and scope of
labeled music datasets, which can be prohibitively expensive and time-consuming
to collect and organize, and to generalize to new contexts and tasks. Recently,
self-supervised pre-training models (Ling et al., 2020; Liu et al., 2020; Song et
al., 2019; Wang et al., 2020), particularly Bidirectional Encoder Representations
from Transformers (BERT), have been used extensively in the field of Natural
Language Processing (NLP). BERT involves learning representations of language
by reconstructing masked input sequences in pre-training. The intuition behind this
design is that a model that can recover missing content of an input has, in effect,
learned a robust contextual representation of the input. BERT and its variants (Liu
et al.; Yang et al., 2019; Zhang et al., 2019) have achieved significant improvements
on various NLP benchmark tasks (Wang et al., 2018). Compared to the text domain
(whose inputs are discrete word tokens), inputs are usually multi-dimensional feature
vectors in the audio (acoustics) domain: continuous and smoothly changing over
time. Therefore, some particular designs have been introduced to bridge the gap
between the original BERT model (which is trained on text) and audio-based
transformer models (which are trained on acoustic frames). In order to do this
for our domain of music audio, we use Contiguous Frame Masking (CFM) and
Contiguous Channel Masking (CCM), as proposed in Zhao & Guo (2021b). This
10
model learns powerful acoustic music representations through pre-training. Finally,
in order to adjust our model’s output representations for applications in downstream
tasks, we fine-tune M3BERT on several supervised music information retrieval
relevant tasks at once. Because of the variety of the possible downstream tasks in
the MIR community, creating a representation of music that is adaptable to diverse
end tasks is important for model generalization and robustness. We use a multi-task
learning approach to fine-tune the transformer-generated representations, ensuring
that they are useful for broader music understanding. Our paper’s contributions
are summarized below.
1. We present a new self-supervised pre-training model named M3BERT.
M3BERT builds upon the structure of multi-layer bidirectional self-attention trans-
formers; rather than relying on massive human-labeled data, this model can learn a
powerful music representation from a variety of unlabeled music data. The model
used in our experiments is based on data from 4,281 hours of music across four
large and diverse music datasets.
2. We present two pre-training objectives for M3BERT. Previous ablation
studies have shown that a combination of CFM and CCM objectives can effectively
improve the performance of an audio-based transformer in pre-training (Zhao &
Guo, 2021b). In this work, we apply a patch-masking objective and compare these
two objectives in maximizing downstream performance.
3. We fine-tune our model on five diverse downstream tasks which span popular
areas of research in MIR: genre classification, mood and theme detection, music
emotion recognition (MER), and instrument classification. The final model out-
performs other pre-trained models on these music-related tasks. The success of
11
M3BERT indicates the potential for applying transformer-based masked recon-
struction pre-training (with subsequent multi-task enrichment) within the MIR
field.
4. We conduct a correlational analysis with our encoder outputs, identifying
certain cell activations that are similar to interpretable high-level audio features.
Thisdemonstratesthattransformermodelscangeneratefeaturesthatarepotentially
human-understandable, lending to its appeal as a tool for music understanding and
representations.
2.1.1 Related Work
Transformer Models
In the past few years, pre-trained models and self-supervised representation
learning have yielded great success on NLP tasks. Many self-supervised pre-trained
models based on multi-layer self-attention transformers (Vaswani et al., 2017), such
as BERT (Devlin et al., 2018a), GPT (Radford et al., 2018), XLNet (Yang et al.,
2019), and Electra (Clark et al., 2020), have been used effectively. BERT is perhaps
the most popular model due to its simplicity and outstanding performance across
a variety of tasks. BERT reconstructs masked input sequences in its pre-training
stage; through reconstruction, the model learns a powerful contextual representation
of its input. More recently, the success of BERT in NLP has drawn attention
from researchers in acoustic signal processing. Some pioneering works (Baevski et
al., 2019; Ling et al., 2020; Liu et al., 2020; Song et al., 2019; Wang et al., 2020)
have shown the effectiveness of adapting BERT to Automatic Speech Recognition
(ASR). By designing pre-training objectives specific to the audio modality, it is
possible to adapt BERT-like models to music and other audio domains. In vq-
wav2vec (Baevski et al., 2019), input speech audio is first discretized to a K-way
12
quantized embedding space by learning discrete representation from audio samples.
However, the quantization process requires heavy computing resources and runs
counter to the continuous nature of acoustic frames. Some works (Chi et al.,
2021; Ling et al., 2020; Liu et al., 2020; Song et al., 2019; Wang et al., 2020)
designed a modified version of BERT that directly utilized continuous speech.
In Chi et al. (2021), Ling et al. (2020), and Liu et al. (2020), continuous frame-
level masked reconstructions were adapted in a BERT-like pre-training stage.
In Wang et al. (2020), SpecAugment (Park et al., 2019) was applied to mask input
frames, and Ling et al. (2020) learned by reconstruction after shuffling acoustic
frame orders rather than masking frames. Within the MIR realm, representation
learning has been popular for many years. Several convolutional neural network-
(CNN-) based supervised methods (Choi et al., 2017; Ghosal & Kolekar, 2018;
Hung et al., 2019; Koutini et al., 2019; Pons et al., 2017) have been proposed for
various music understanding tasks. These usually employ convolutional layers on
Mel-spectrogram-based representations or raw waveform signals of music audio
to learn effective music representations, and append fully connected layers to
predict relevant annotations such as music genres or moods. However, training
CNN-based models usually requires large datasets with reliable and consistent
human-annotated labels. Carmon et al. (2019); Hendrycks et al. (2019) showed that
using self-supervision on unlabeled data can significantly improve model robustness.
More recently, self-attention transformers have shown promising results in music
generation. For example, the Music Transformer (Huang et al., 2018) and Pop
Music Transformer (Huang & Yang, 2020) employed relative attention to capture
long-term structure from music MIDI data; however, compared with raw music
audio, the size of existing MIDI datasets is limited. Transcription from raw audio
13
to MIDI files is time-consuming and often not accurate, necessitating a transformer
system that accepts (continuous) audio input.
Multi-Task Learning
Multi-task learning (MTL) is an approach that involves assigning several tasks
to a model to train on simultaneously (Caruana, 1997). This approach has been
used to great extent in several music-related tasks, including frequency estima-
tion (Bittner et al., 2018), source separation (Hung & Lerch, 2020) and instrument
detection (Hung et al., 2019). Ideally, a system that can accomplish several music
tasks simultaneously would be highly desirable for music research, providing a “first
stop shop” to researchers attempting various tasks related to music understanding
and MIR.
In this work, we propose M3BERT, a universal music-acoustic encoder based
on transformers and multi-task learning. M3BERT is first pre-trained on massive
unlabeled music datasets, and then fine-tuned using an MTL approach on specific
downstream music annotation tasks using labeled data.
2.2 M3BERT Model
A universal transformer-based encoder named M3BERT is presented for music
representation learning. The system overview of the proposed M3BERT model is
shown in Fig. 2.2.
2.2.1 Transformer Encoder
A multi-layer bidirectional self-attention transformer encoder (Devlin et al.,
2018a; Vaswani et al., 2017) is used to encode the input music frames. Specifically,
14
an L-layer transformer is used to encode the input vectors X = (x
i
)
N
i=1
as: H
l
=
Transformer
l
(H
l+1
) where l∈{1, 2...L}, H
0
=X, and H
L
= [h
L
1
,...,h
L
N
]. We use
the hidden vector h
L
i
as the contextualized representation of the input token t
i
.
2.2.2 Pre-training and Training Objectives
The main idea of masked reconstruction pre-training is to perturb inputs by
randomly masking tokens with some probability and then using the model to
reconstruct these masked tokens at the output. Intuitively, this is similar to
dropout (Srivastava et al., 2014), in which certain features and/or layers in a
neural network are set to zero in order to prevent overfitting. In the pre-training
process, a reconstruction module which consists of two feed-forward layers with
GeLU activation (Hendrycks & Gimpel, 2016) and layer-normalization (Ba et al.,
2016) is appended to the encoder-decoder architecture to predict the masked inputs.
The multi-task system then uses the output of the last M3BERT encoder layer as
its input.
Several pre-training objectives are presented for enabling M3BERT learn music
representations.
Objective 1: Contiguous Frames Masking (CFM)
To prevent the model from exploiting local smoothness of acoustic frames,
we mask spans of consecutive frames dynamically. Given a sequence of input
frames X = (x
1
,x
2
,...,x
n
), we select a subset Y ⊂ X by iteratively sampling
contiguous input frames (spans) until the masking budget (in this case, 15% of X)
has been spent. At each iteration, a span length is first sampled from the geometric
distribution l∼Geo(p). Then, the starting point of the masked span is randomly
15
selected. We set p = 0.2, l
min
= 2 and l
max
= 7.
1
In each masked span, the frames
are masked according to the following policy:
1) With 70% probability, replace all frames with zero. Since each dimension of
input frames is normalized to have zero mean, setting the masked value to zero is
equivalent to setting it equal to the mean.
2) Replace all frames with a random masking frame with 20% probability
(mutually exclusive from 1).
3) Keep the original frames unchanged in the remaining cases (this happens
10% of the time). Since M3BERT will only receive acoustic frames without masking
during inference time, this policy allows the model to receive real inputs during
pre-training, resolving the pre-train/fine-tune inconsistency problem (Devlin et al.,
2018a).
Objective 2: Contiguous Channels Masking (CCM)
The intuition of channel masking is that a model that can predict the partial loss
of channel information has learned a high-level representation of such channels. For
log-mel spectrum and log-CQT features, a block of consecutive channels is randomly
masked to zero for all time steps across the input sequence of frames. Specifically,
the number of masked channels, c, is first sampled from 1,...,H uniformly, where
H is the number of total channels (in our case, this is 272). Then a starting
channel index h is sampled uniformly from 1,...,H−c and the channels h,h +c
are masked.
1
The corresponding mean length of span is around 3.87 frames (179.6ms). Other schemes were
also tried (variable lengths with different averages, constant lengths, etc.), but this scheme proved
highest performance on downstream tasks.
16
Objective 3: Patch Masking (PM)
Often, music can be tremendously dynamic and fast-changing. For this reason, it
can be prohibitively difficult to predict contiguous frames of features as in Objective
1, particularly over a long span of music. Prior work in audio-based transformers
has proposed patch-masking (Li et al., 2021), which involves masking a square
set of features and timesteps. In the patch masking paradigm, squares of equal
size are sampled with replacement until 15% of the input matrix is masked (see
Fig. 2.1). We use this policy in comparison with CCM and CFM, as PM provides a
less stringent alternative to contiguous masking of entire feature sets and timesteps.
Pre-training Objective Function
Huber(x,y) =
0.5|x−y|
2
if|x−y|< 1
|x−y|− 0.5 otherwise
(2.1)
We use Huber loss (Girshick, 2015) to minimize the reconstruction error between
masked input features and the corresponding encoder output. Huber loss is a robust
`
1
loss that is less sensitive to outliers (Wen et al., 2019). Additionally, Zhao &
Guo (2021b) found that using Huber loss made training converge faster than `
1
loss.
17
Figure 2.1: M3BERT pre-training and fine-tuning. During pre-training, the
M3BERT transformer layers are updated and we use a Huber Loss between the
reconstructed signal and the original signal. During fine-tuning, the M3BERT layers
are frozen, and multi-task learning is used to enrich the output representations.
M3BERT Model Parameters
We report experimental results on two models: M3BERTSmall and
M3BERTLarge. Model settings are listed in Table 2.1. The number of trans-
former block layers, the size of hidden vectors, and the number of self-attention
heads are represented as L
num
, H
dim
, and A
num
, respectively.
18
Figure 2.2: M3BERT Architecture. M3BERT hasL layers which use multi-head
attention and normalization, similar to BERT’s architecture.
Figure 2.3: Multi-task learning on a sample from a batch. For this sample,
there are only labels for the MTG-Jamendo task, so the weights for other tasks are
frozen, as is M3BERT. We use cross-entropy loss for our classification tasks.
19
Figure 2.4: A workflow for another sample from the same batch. For this
sample, there are only labels for the DEAM task, so the weights for other tasks are
frozen, as is M3BERT. Weights are updated at the end of the batch. We use MSE
loss for this regression task.
Table 2.1: Proposed Model Parameters
L
num
H
dim
A
num
#Parameters
M3BERTSmall 4 768 12 29.3M
M3BERTLarge 8 1,024 16 93.1M
2.3 Experiment Setup
2.3.1 Dataset Curation and Preprocessing
As shown in Table 2.2, the pre-training data were aggregated from four different
datasets: Music4All (Santana et al., 2020), FMA-Large (Defferrard et al., 2016),
MTG-Jamendo (Bogdanov et al., 2019), and Million Song Dataset (Bertin-Mahieux
et al., 2011). Both the Music4all and FMA-Large datasets provide 30-second audio
clips in mp3 format for each song. The MTG-Jamendo dataset contains 55,700
musical tracks, each with a duration of at least 30s. Since the maximum sequence
length of M3BERT is set to 1,294 (30s), music tracks exceeding this length are split
up into 30-second chunks and treated as different samples.
2
2
If a song is more than 30s long but less than 60s long, it is split up into two equal parts
without overlap. This ensures that every example is at least 15s long and no more than 30s long.
20
Table 2.2: Datasets used and statistics for pre-training (above solid line) and
fine-tuning (below solid line)
Task Dataset # Examples Duration (hr)
Pre-training Music4All 109.2K 908.7
Pre-training FMA-Large 106.3K 886.4
Pre-training MTG-Jamendo 55.7K 464.2
Pre-training Million Song Dataset 242.7K 2023.0
Genre Classification GTZAN 1K 8.3
Genre Classification Extended Ballroom 4.6K 38.3
Instrument Recognition RWC 12.9K 91.6
Emotion Recognition DEAM 1.8K 18.3
Multi-Label Tagging MTG-Jamendo 18.5K 157.1
A variety of music-centered tasks including GTZAN music genre classification
task (Sturm, 2013), MTG-Jamendo music auto-tagging task (Bogdanov et al.,
2019), Real World Computing (RWC) Instrument Classification task (Goto et al.,
2003), Database for Emotional Analysis of Music (DEAM) task (Alajanki et al.,
2016a), and the Extended Ballroom task (Marchand & Peeters, 2016a) were used
to fine-tune M3BERT.
GTZAN consists of 1,000 music clips divided into ten different genres (blues,
classical, country, disco, hip-hop, jazz, metal, pop, reggae and rock). Each genre
consists of 100 music clips in .wav format with a duration of 30s.
The MTG-Jamendo task consists of over 18,000 music clips, each with at least
one mood or theme label. These genres range from common (“Happy” and the
thirteen other most common tags are present in 68% of examples) to uncommon
(the “Sexy” tag is present in .64% of samples) and the imbalance factor (the count
of the most common tag divided by the count of the least common tag) is 15.7.
Extended Ballroom is an improved version of the Ballroom dataset (Cano et al.,
2006). This dataset contains 4,180 music clips divided into 13 genres representing
various ballroom dances (Cha Cha, Jive, Quickstep, etc). As these genres are
21
closely related to rhythmic patterns, they can be considered as rhythm classes.
This dataset’s imbalance factor is also quite high, at 23 (Waltz is the most common
label, and West Coast Swing is the least common). While other metadata are
available (for example, artist and beats per minute of each song), we leave the
possibility of leveraging such information for future work.
The RWC Musical Instrument Sound Database covers 50 musical instruments.
At least three musicians played each instrument and at least three different manu-
facturers’ models were used for each instrument. To further provide a wide variety
of musical instrument performances, the dataset includes samples from every tonal
and dynamic range of each instrument.
After breaking long songs into smaller 30s chunks, the DEAM dataset consisted
of 2,099 excerpts annotated for overall (per-excerpt) valence and arousal. Each
sample was appraised for (perceived) valence and arousal by at least five annotators,
and triplet embeddings of these labels were computed as by Booth et al. (2018a)
and Greer et al. (2020).
For GTZAN, we used the fault-filtered splits (Kereliuk et al., 2015); for MTG-
Jamendo, we organized the training, validation and testing sets as in Bogdanov et
al. (2020). For all other datasets, we could not find an agreed-upon set of splits
in prior work, so we split up our data randomly into five equal parts, using three
parts for training, one part for validation, and one part for testing.
3
3
We split these data sets into equal parts according to number of songs. When longer songs
are broken up into 30s chunks, these splits are not exactly 60% training, 20% validation, and 20%
testing. This policy ensures that excerpts from the same song are not present in training and
testing.
22
Table 2.3: Acoustic features of music extracted by Librosa
Feature Characteristic Dimension
Chromagram Melody, Harmony 12
MFCCs Timbre 20
Delta MFCCs Timbre 20
Mel-scaled Spectrogram Raw Waveform 128
Constant-Q Transform Raw Waveform 144
Audio Preprocessing
The acoustic music analysis library Librosa (McFee et al., 2015) was used
to extract the following features from each song for pre-training: Mel-scaled
Spectrogram, Constant-Q Transform (CQT), Mel-frequency cepstral coefficients
(MFCCs), Delta MFCCs and Chromagrams (see Table 2.3). Each feature was
extracted at the sampling rate of 44,100Hz, with a Hamming window size of 2048
samples (46ms) and a hop size of 1024 samples (23ms). The Mel Spectrogram
and CQT features were transformed to log amplitude with S
new
= ln(10S+ 1e-6),
where S represents the original feature value. Then Cepstral Mean and Variance
Normalization (CMVN) (Pujol et al., 2006; Viikki & Laurila, 1998) were applied to
the extracted features to minimize the distortion caused by noise contamination.
Finally, these normalized features were concatenated to form a set of 324 features
per frame, which was later used as the pre-training input of M3BERT.
2.3.2 Training Setup
All of our experiments were conducted on 2 GTX 2080Ti and can be repro-
duced on any machine with GPU memory more than 20 GB. In pre-training,
M3BERTSmall and M3BERTLarge were trained with an effective batch size of 128
for 200k and 500k steps, respectively. We applied an Adam optimizer (Kingma & Ba,
2014)) withβ
1
= 0.9,β
2
= 0.999 and = 10
−6
. The learning rate followed a warmup
23
Table 2.4: Parameter settings for downstream tasks
Parameter Candidate Values
Batch Size 16, 24, 32
Learning Rate 2e-5, 3e-5, 4e-5
Epoch 2, 3, 4
Dropout Rate .05, .1
schedule (Vaswani et al., 2017) according to the formula: l
rate
= min(
lmaxs
wT
,
lmax(T−s)
T(1−w)
)
where s represents the step number, w represents the warmup steps (set to 7% of
the total steps T), and l
max
represents the max learning rate (set to 2· 10
−4
). For
downstream tasks, we performed an exhaustive search on a set of parameters and
the model that performed best on the validation set was selected (see Table 2.4).
All other training parameters remained the same as those in the pre-training stage.
2.4 Results
2.4.1 Patch Masking, CFM and CCM
We first survey the difference between patch masking, CFM, and CCM. When
testing Patch Masking, CFM, and CCM individually on the MTG Jamendo dataset,
we find that CFM outperforms the other two routines for masking. When CFM
and CCM are combined, this performance increases even more. A hybrid approach
of combining CCM, CFM, and Patch Masking simultaneously was not attempted
because CCM and CFM already involves contiguous channel and frame masking.
In subsequent experiments, we report on results that use CCM and CFM only.
24
Table 2.5: Results of a genre classification task on the GTZAN dataset
Model Accuracy
MFCCs 44.8%
VGGish (Gemmeke et al., 2017) 53.8%
M2BERT w/o pretraining 56.1%
M2BERT 60.1%
Pretrained CNN (Lee & Nam, 2017) 72.0%
Jukebox Probe (Castellon et al., 2021) 79.7%
Swin-T (Zhao et al., 2022) 81.1%
M3BERTSmall 61.0%
M3BERTLarge 61.7%
2.4.2 GTZAN
Since the GTZAN dataset only contains 1,000 music clips, experiments were
conducted using a ten-fold cross-validation setup. For each fold, 80 songs of each
genre were randomly selected for training and the remaining 20 songs were placed
into the validation split. The ten-fold average accuracy score is shown in Table 2.5.
In previous work, Castellon et al. (2021) applied a probe to Jukebox (Dhariwal
et al., 2020) features to predict music genres. M2BERT (the pretrained model
without multi-task enrichment) does not perform as well as these models on the
GTZAN dataset. Although this small dataset is prone to overfitting (Sturm, 2013),
the multi-task paradigm does not bring our architecture close to the performances
of the best-performing models.
2.4.3 MTG-Jamendo Emotions and Themes in Music
FortheJamendomood-themeauto-taggingtask, ROC-AUCmacroandPR-AUC
macro were used to measure performance. ROC-AUC can lead to over-optimistic
scores when data are imbalanced (Davis & Goadrich, 2006), and since the music
tags given in the MTG-Jamendo dataset are highly imbalanced (Knox et al., 2020,
25
2021), we also used PR-AUC for evaluation. The M3BERT model was compared
with other state-of-the-art models from MediaEval 2020: Emotion and Theme
Recognition in Music Using Jamendo (Bogdanov et al., 2020). We used the same
train-validation-test data splits as the challenge. The results are shown in Table 2.6.
2.4.4 Extended Ballroom Genre Classification Dataset
For the Extended Ballroom genre classification task, our performances were
compared against other models, although the splits were different. Evinced by
the best performing model in Table 2.7, we see that rhythmic features seem to be
helpful in predicting ballroom music genres, which were not used in our musical
inputs.
2.4.5 DEAM Music Emotion Recognition Task
IntheDEAMmusicemotionrecognitiontask, ourrepresentationswerecompared
against other feature sets, including VGGish features and MFCCs. In Table 2.8, we
see that MFCCs (timbral features) perform poorly on this music emotion recognition
task, while hand-crafted features and the more generalized VGGish features perform
even better than our representations.
2.4.6 RWC Instrument Detection Task
In the RWC instrument classification task, our representations outperformed
the other results found in the literature (see Table 2.9.) Understandably, timbral
MFCC features perform better than VGGish features on instrument detection.
The representations are enriched in the multi-task stage, as evidenced by the
comparatively better performance of M3BERTLarge to M2BERT.
26
Table 2.6: Results of an auto-tagging task on the MTG-Jamendo dataset
Model ROC-AUC PR-AUC
MFCCs .695 .081
VGGish (Gemmeke et al., 2017) .725 .107
M2BERT w/o pre-training .724 .104
M2BERT .735 .109
CNN (2019 Winner) (Koutini et al., 2019) .773 .155
CNN + Loss-function (Knox et al., 2020) .781 .161
M3BERTSmall .777 .125
M3BERTLarge .774 .125
Table 2.7: Results of a genre classification task on the Extended Ballroom dataset.
∗
indicates that the model evaluates on different subsets of the dataset than our
work and hence numbers are not directly comparable.
Model Accuracy Macro f1
MFCCs (ours) .532 .381
MFCCs (Voss & Nguyen, 2019) .623 -
VGGish .757 .602
Frozen DenseNet (Pavlín, 2020)
∗
.633 -
M2BERT w/o pre-training .817 .685
Transfer Learning (Choi et al., 2017)
∗
.819 -
M2BERT .820 .685
3-layer DNN (Voss & Nguyen, 2019) .830 -
Rhythmic Features (Marchand & Peeters, 2016b)
∗
.949 -
M3BERTSmall .704 .511
M3BERTLarge .812 .661
For the baseline model (based on VGGish features (Gemmeke et al., 2017)) and
the 2019 MediaEval winner (Koutini et al., 2019), we directly used the evaluation
results posted in the competition leaderboard. For the 2020 winner (Knox et al.,
2020), we reproduced the work according to their implementation. The results
suggest that improvement over past state-of-the-art work on this music auto-tagging
task is possible if an architecture were to be used that integrates information over
the temporal domain (such as a CNN).
27
Table 2.8: Results of a music emotion recognition task on the DEAM dataset.
∗
indicates that the model evaluates on different subsets of the dataset than our work
and hence numbers are not directly comparable.
Model R
2
V
R
2
A
MFCCs .122 .327
M2BERT w/o pre-training .261 .515
Hand-crafted Features (Kumar et al., 2014)
∗
.278 .529
M2BERT .345 .562
VGGish .395 .582
M3BERTSmall .332 .521
M3BERTLarge .266 .537
Table 2.9: Results of an instrument detection task run on the RWC Instrument
dataset.
∗
indicates that the model evaluates on different subsets of the dataset
than our work and hence numbers are not directly comparable.
Model Accuracy Macro-f1
VGGish .821 .735
Random Forest (Takahashi & Kondo, 2014)
∗
.549 -
Partials (Barbedo & Tzanetakis, 2010)
∗
- .634
Cross-Dataset (Donnelly & Sheppard, 2015)
∗
- .823
MFCCs .913 .875
M2BERT w/o pre-training .930 .898
M2BERT .954 .933
M3BERTSmall .951 .912
M3BERTLarge .966 .940
2.4.7 Ablation Study
Ablation studies were also conducted to better understand the performance
of M3BERT, similar to the work done by Zhao & Guo (2021b). The results are
shown in Table 2.10. According to our experiment, M2BERTSmall’s features can
still outperform other pre-trained feature extractors even without pre-training,
indicating the effectiveness of transformer-based architectures on auto-tagging
tasks. When M2BERTSmall is pre-trained with the patch-masking objective, the
performance is worse than using CCM and CFM masking in tandem. We also
28
Table 2.10: Performance on MTG Autotagging with Ablation Study
Missing Dataset ROC-AUC PR-AUC
MSD .7058 .0874
FMA .7216 .0977
M4A .7234 .1006
MTG .7267 .1035
None .7354 .1082
see that this improvement shows that properly setting the CCM and CFM pre-
training objectives is a key element for pre-training M2BERT and M3BERT to
learn powerful music representations.
We removed datasets from pre-training to assess which datasets were most
crucial to good performance on downstream tasks. Removing any dataset from
pre-training results in a degradation in downstream performance on MTG-Jamendo
autotagging;
4
the larger the input dataset, the more severe the degradation. The
M2BERT model uses the diverse input datasets to inform its representations, and
each dataset is evidently bringing a rich set of features for informing pre-training.
We also explore the effect that model size has on downstream task accuracy.
In our experiments, M3BERTLarge generally outperforms M3BERTSmall, which
remains consistent with the findings of Zhao & Guo (2021b), although in tasks like
valence prediction we see that M3BERTSmall outperforms M3BERTLarge. For
other tasks, like mood-theme detection, we see comparable performance using either
set of features. This suggests that for certain tasks, using the relatively economical
M3BERTSmall features may be as effective as using M3BERTLarge features.
4
Evaluating on downstream performance of all of the downstream tasks was prohibitively
time-consuming, so we opted to use the largest dataset for this ablation study.
29
Table 2.11: Performance on MTG-Jamendo Mood-Theme Autotagging with Differ-
ent Starting Feature Sets
Feature Sets ROC-AUC PR-AUC
Mel Spectrogram .5615 .0400
MFCCs and Delta MFCCs .5615 .0406
All .7354 .1082
2.4.8 Correlational Analysis
Deep learning models can suffer from a lack of interpretability (Linardatos et al.,
2020). Toward finding representations of music that are phenomenological (i.e., the
0th MFCCs are highly connected to a signal’s “loudness”), we used Librosa (McFee
et al., 2015) to compute several high-level audio features, including brightness,
loudness, and spectral flux. We then correlated these features with outputs from
the M3BERT encoder. Results and correlations are shown in Figs. 2.5 and 2.6.
Figure 2.5: Centroid and cell activation. Certain outputs from the M3BERT
encoder correlate highly with auditory phenomena, like spectral centroid. Pearson’s
ρ =.831 between these two features.
2.5 Discussion
We see that on several downstream tasks, such as instrument detection, and
mood-theme autotagging, M3BERT performs on par or better than the best
performance found in the literature. We observe that M3BERT performs much
30
Figure 2.6: Harmonicity and cell activation. Interpretable auditory features
like harmonicity were also correlated with certain outputs from M3BERT’s encoder.
The encoder is creating high-level representations that are not necessarily based on
frequency, as in this case. Pearson’s ρ =.823 between these two features.
better on the mood-theme classification task than the M2BERT model: this may be
because the multi-task learning paradigm exploited some labels that were present
in the mood-theme detection task and the genre classification tasks.
5
Curiously,
the genre classification tasks did not benefit as much from multi-task learning;
these datasets are relatively small compared to MTG-Jamendo, so in the multi-
task paradigm, their samples are likely getting overwhelmed by the prevalence of
MTG-Jamendo samples. With multi-task-specific loss function adjustments, such
as those suggested by Kendall et al. (2018) it may be possible to improve on our
results.
In the classification and regression tasks, we averaged outputs across timesteps.
This architecture was used for the sake of simplicity in creating representations of
music, but it does not take advantage of the temporal dependencies of the musical
inputs. If an architecture that captures this temporal information—such as a CNN
or LSTM—were to be built upon the features that we created, we would expect to
see greater improvement on these downstream tasks.
We see that although M3BERT performed very well on the instrument clas-
sification task, it did not perform as well on the GTZAN genre classification or
5
For example, one label in GTZAN is “jazz” and one label in MTG-Jamendo is “jazzy.”
31
DEAM music emotion recognition tasks. This may also be explained by the relative
paucity of data and the input features we used for pre-training, which may not have
spanned feature types that would be relevant for these prediction tasks. To wit, we
used many features that related to timbre, which sensibly would perform well on
an instrument classification task, but may not necessarily perform well on a music
emotion recognition task, for example. Rhythmic features are shown to be effective
in ballroom dance genre classification (Marchand & Peeters, 2016b), but were not
represented in our initial input features. We hypothesize that choosing a broad set
of input audio features are important for creating robust, diverse representations of
music.
In the interest of investigating interpretability of our embeddings, we present two
high-level features that are highly correlated with outputs from M3BERT, including
harmonicity and spectral centroid. While centroid is a rough measure for a song’s
pitch, other frequency-based features were also correlated with cell activations,
including brightness and spectral rolloff. Harmonicity and percussiveness were
both correlated to encoder outputs (ρ>.8), and relate to timbre and, proximally,
loudness (we did not analyze RMS because it is captured in our encoder inputs by
MFCC 0). Other features, including f0, spectral flatness and contrast, and zero
crossing rate, were not found to be highly correlated with encoder outputs.
2.6 Conclusion
We propose M3BERT, a universal music encoder based on transformers. Rather
than relying on massive human labeled data, which is expensive and time-consuming
to collect, M3BERT can learn representations of music from unlabeled data and
improve upon its representation with multi-task learning in fine-tuning. Contiguous
32
Frames Masking, Contiguous Channel Masking, and Patch Masking are applied as
pre-training objectives. Subsequently, using a multi-task approach, our model learns
from several disparate music information retrieval tasks at once. The effectiveness
of these proposed objectives, datasets, and input features are evaluated through
ablation studies. We find that M3BERT outperforms commonly used features for
music classification on a variety of music-related tasks. We also find that multi-task
learning enriches the representations generated by our encoder. Our work shows
the potential of adapting a transformer-based, masked reconstruction pre-training
scheme with multi-task learning to MIR interests. Beyond improving the model,
we plan to extend M3BERT to still other music understanding tasks, like key
estimation and cover song detection. This work shows that marrying large-scale
representation learning with diverse, supervised learning tasks can uncover powerful
representations that can provide researchers a “canonical” first step to feature
extraction for music-related tasks.
33
Chapter 3
Creating Non-Auditory
Representations of Music for
Downstream Tasks
3.1 Introduction
Analyzing how humans experience music listening is a research interest in diverse
areas such as music information retrieval (Trohidis et al., 2008b) and affective
computing (Schindler & Rauber, 2015). Automatically classifying musical genre or
identifying emotion in music can be used for music tagging or for providing insights
into the mechanisms of human cognition (Koelsch et al., 2006). While applicable
to many domains, these computational problems are especially challenging (and
interesting) because of the subjective nature of the music listening experience. We
collected arousal and valence responses to short musical clips from participants
from different parts of the world and curated a dataset which shows clear differences
in the perception of music across demographic dimensions. A dataset is released as
well,
1
which we hope will encourage further inquiry into multi-modal music emotion
recognition.
We hypothesize that embedding words from lyrics and chords in a shared vector
space capture how chord progressions and lyrics affect each other and are useful
1
https://github.com/timothydgreer/chord_lyric_reps
34
for music analysis. We show results from two tasks: a genre classification task
using Billboard listings (https://www.billboard.com/charts) and a music emotion
recognition task using human annotators. Through these tasks we reveal the utility
of using shared embeddings to study music listening experiences.
3.2 Related work
Learning distributed word representations is a heavily researched topic in natural
language processing (NLP) (Mikolov et al., 2013; Pennington et al., 2014; Gouws et
al., 2015). Madjiheurem et al. applied the widely used “word2vec” architecture to
model chord progressions (Madjiheurem et al., 2016). Other research has extended
this architecture to a bilingual scenario (Luong et al., 2015; Gouws et al., 2015). In
this chapter, we adapt a similar architecture by creating bilingual word embeddings
using two “languages” found in music: lyrics and chords. We compare this system
to other embedding schemes, such as bidirectional encoder representations from
transformers (BERT) (Devlin et al., 2018b), which tend to require large datasets
and computational power in order to train (You et al., 2019).
Studying chords and their progressions in music has proven useful for automatic
genre classification (Cheng et al., 2008; Anglade et al., 2009). There also exist
studies on predicting musical genre by applying NLP techniques to lyrics (Mayer et
al., 2008b). Other work combines these NLP techniques with audio information to
determine if a particular multimodal approach is helpful for genre prediction (Mayer
et al., 2008a; Laurier et al., 2008).
A number of genre classification tasks exist in the literature (Li et al., 2003;
Tzanetakis & Cook, 2002; Scaringella et al., 2006). However, these tasks do not
use datasets that have lyrics and chord information aligned together. In previous
35
work, we curated a new dataset with chords and English lyrics side-by-side and
used learned embeddings on a pilot task related to genre classification (Greer &
Narayanan, 2019). We extend this work, using a larger dataset from the same
online source, and show improved results on a similar task.
The subjective nature of music perception asks for an objective, agreed-upon
metric for tasks such as genre classification. For this reason, we use the Billboard
charts for musical genre labels. Billboard determines genre by “key fan interactions
with music, including album sales and downloads, track downloads, radio airplay
and touring as well as streaming and social interactions on Facebook, Twitter, Vevo,
Youtube, Spotify and other popular online destinations for music” (bil, b). Social
tags provided by users of online music streaming sites have been shown to be an
effective method for classifying music based on its emotional content (Song et al.,
2012).
We use the aforementioned embeddings on a task on multi-label music genre
classification (it is multi-label because some music can support more than one
concurrent label e.g., pop and R&B.) Many techniques have been used for multi-
label classification, including simple k-nearest neighbors (k-NN) classifiers (Zhang
& Zhou, 2007), decision trees (Blockeel et al., 2006; Yi et al., 2011), and neural
networks (Zhang & Zhou, 2006). Here, we focus on state-of-the-art random k-
labelsets classifiers (RAkEL) because of their simplicity, fast training times, and
shown utility in multi-label classification tasks (Sanden & Zhang, 2011).
We also explore the use of the proposed embeddings in a music emotion recogni-
tion task. While many such tasks exist in the literature (Fan et al., 2017; Trohidis
et al., 2008a; Han et al., 2010; Pallesen et al., 2005), these tasks do not use datasets
that have lyrics and chord information aligned together. To this end, we created
our own valence and arousal annotation task: we asked 15 participants, who hailed
36
from eight different countries, to listen to 10-second segments of 500 songs found
in our dataset and report on their valence and arousal. By extending the work
of Greer et al. (2019b), we show the range of studies that can be conducted with
this dataset and demonstrate the utility of using joint chord-and-lyric embeddings
for studying music perception.
3.3 Model
The model we use in this chapter is an adaptation of the standard skip-gram
neural network architecture (Mikolov et al., 2013), which induces word representa-
tions by predicting the context words around a target word. Mathematically, the
autoencoder maximizes the monolingual objective function:
MONO
L
=
1
T
T
X
t=1
X
−w≤j≤w,j6=0
log(p(l
t+j
|l
t
)) (3.1)
where l
1
, l
2
, ..., l
T
are words in the training corpus L, w is the size of the context
window around target word l
t
, and p(l|l
0
) is the probability the corpus contains l
given that l
0
is in its vicinity in the corpus.
The model we use induces representations for two symbolic musical “languages”
together: lyrics and chords. We implement a bilingual adaptation of the standard
skip-gram, introduced by Luong et al. (2015) and used by Greer et al. (2019b)
and Greer & Narayanan (2019).
In this case, our model predicts the neighbors of a lyric l in a lyric vocabulary
L if it is aligned with a chord c in a chord vocabulary C and vice versa. We train
a single skip-gram model with a joint vocabulary on parallel corpora and each
training example is enriched with both chordal and lyrical context. This bilingual
method, as a result, learns embeddings for lyrics that are dependent on the “context”
37
chords and vice versa. The training objective function for the lyric embeddings
is MONO
L
+CROSS
LC
, where C and L are the corpora for chords and lyrics,
respectively. CROSS
LC
, a cross-lingual term, is defined as
CROSS
LC
=
1
T
l
T
l
X
t=1
(
X
−wc≤j≤wc
log(p(c
k+j
|l
t
))). (3.2)
CROSS
CL
, the other cross-lingual term, is similarly defined as
CROSS
CL
=
1
T
c
Tc
X
t=1
(
X
−w
l
≤j≤w
l
log(p(w
k+j
|c
t
))) (3.3)
making the training objective function for the chord embeddings MONO
C
+
CROSS
CL
. Here, T
c
and T
l
refer to the number of words in the training corpus or
chords and lyrics, respectively, andw
c
andw
l
refer to the size of the context window
around the chords and lyrics, respectively. The target index k in the cross-lingual
objectives is found by computing [t·S
t
/S
r
] whereS
t
andS
r
are the sentence lengths
of the target language and source language, respectively. An example alignment of
chords with lyrics is shown in Fig 3.1.
To create the embeddings, we used multivec (Bérard et al., 2016) with the
following parameters: stochastic gradient descent (Robbins & Monro, 1951); a
learning rate of 0.01; exponential decay of 0.98 after 10,000 steps (where 1 step is
256 word pairs); negative sampling with 64 samples; a skip-gram window of size
five for lyrics and a skip-gram window of size one for chords; equal sampling of the
number of monolingual and cross-lingual word pairs to make a mini-batch at every
38
Figure 3.1: Embedding chords and lyrics. Each lyric and chord predicts its
lyric and chord context. Here, the F minor chord (Fm) predicts chords around it
(denoted by dashed lines) and lyrics that are sung during and around the F minor
chord (denoted by solid and dotted lines). The F minor chord is aligned with the
lyric “greatest” because they are played and sung at the same time, respectively.
step; and a 200-dimensional resultant embedding space.
2
This implementation is
identical to that described in Greer et al. (2019b) and Greer & Narayanan (2019).
3
Other embedding schemes, such as BERT (Devlin et al., 2018b), have been pro-
posed for capturing contextualized meaning of words. We compared our embeddings
to BERT and report on these results.
3.4 Data
We curated our chord- and lyric-aligned dataset from UkuTabs arrange-
ments (Uku). UkuTabs gives users—generally ukulele players—access to an archive
of tablatures for over 7,000 popular songs. The wordcloud in Fig 3.2 shows artists
2
Other hyperparameters were tried; this setup had the most predictive power.
3
Although other parameters were tried, the best settings were not appreciably different from
these parameters, so we use these for the studies presented.
39
Figure 3.2: A wordcloud of artists. A larger font size indicates that the artist
is featured more prominently in the dataset.
whose songs have tablatures listed on this website. Arrangements are sourced by
musicians and all arrangement submissions are verified for quality by moderators.
Although websites like ultimate-guitar.com, chordie.com, and e-chords.com may
offer more song tablatures than UkuTabs, their arrangements are not screened for
accuracy or do not follow a standard format: desiderata for creating a high-quality
dataset for music research.
3.4.1 Data Collection
The text data was retrieved from every chord tablature in UkuTabs (Uku). We
then aligned the chords and the lyrics for each musical passage that contained
chords and lyrics.
4
4
We will call such a musical passage a “clip.”
40
Table 3.1: Complex chords in UkuTabs, and their conversions after casting. The
left column has the unconverted complex chords and the right column contains the
corresponding basic chord.
Complex chord Basic chord
Cmaj9, Cadd4, Csus2, C/G, C5, C6/9 C major
Cmin, Cm7, Eb/C, Cm, Cm13, Cm/Bb C minor
C7sus4, C9, Caug, Gm/E, Em7b5/C C seventh
Cdim7, B7/C, Cdim, Cmb5 C diminished
Figure 3.3: Screenshot from UkuTabs showing a song excerpt. The “x4”
indicates that the bridge section is repeated three times.
We converted every chord in the dataset into one of four basic chord types:
major, minor, dominant 7th, or diminished. 18,055 of the 442,181 chords in the
corpus were converted to one of these basic chord types (4.1%); the other chords
were already specified as a basic chord type in the tablature. Some examples of
complex chords and their conversions are listed in Table 3.1.
If less than 50% of a song’s lyrics were English, the song was not included
in the dataset (this parameter is tunable in the code provided.) If a section was
repeated, the lyrics and chords were repeated in the dataset as well. See Fig 3.3 for
an example of a repeated section. Some statistics of the final dataset we created
are given in Table 3.2.
To create useful chord representations, it is necessary to find each chord’s
relation to its song’s key, or tonal center. Noland and Sandler used hidden Markov
41
Table 3.2: Statistics of chords- and lyrics-aligned dataset.
Total Sample Points 196,337
Number of Songs 5,474
Average Number of Chords per Sample 2.25
Average Number of Words per Sample 7.95
Number of Artists 1,847
models to estimate musical key for songs by The Beatles (Noland & Sandler, 2006).
However, the authors only used major and minor chords and tested on just 110
songs from one artist, limiting the generalization power of the study. In an effort
to make a system that would be useful for other chord types, artists, and genres,
we developed a simple method to estimate the key of every song in our dataset,
identical to that described by Greer & Narayanan (2019): we tallied the number
of chords that are found in the scale of each of the twelve potential major keys
and selected the potential key with the highest such tally to be the estimate. As a
tiebreaker, the sum of the number of I, IV, V and vi chords was found and the key
with the highest sum was chosen.
An analysis of 50 random songs from the dataset revealed that this method for
calculating the key of a song is effective: the method was 98% accurate on these
songs. The key was estimated incorrectly for one song (“A Whole New World” from
the Soundtrack of Aladdin) because it contained a key change.
5
5
While an HMM system would likely diagnose this key change, it also requires training data.
Due to lack of training data, we opted to use our key detector instead.
42
Table 3.3: Number of songs listed in five Billboard charts that were also found
in the UkuTabs dataset. Pop was the most prominent genre, and 328 songs were
labeled as “pop” only. The number of “crossover” songs, or songs listed in more
than one Billboard chart, are entries in the off-diagonals.
Latin Country Pop Rock R&B/Hip-Hop
Latin 3 0 32 3 3
Country 0 105 13 0 0
Pop 32 13 328 62 35
Rock 3 0 62 266 1
R&B/Hip-Hop 3 0 35 1 17
3.5 Genre Classification Task
Once we learned representations for these musical passages, we created a musical
genre classification task. The Billboard charts were used to provide ground truth
for genre.
3.5.1 Collecting Billboard Songs
We collected the song titles and artists listed on the Latin, Rock, R&B/Hip-Hop,
and Pop Billboard charts from May 27, 1999 to May 27, 2019 (bil, a). These songs
were then matched to songs from the dataset collected from UkuTabs. Two songs
were matched if they had the same artist and title. 850 unique songs were found.
Table 3.3 lists the number of songs from each chart found in the dataset.
We computed the embeddings for each of these songs by summing the embed-
dings of every token in each song and dividing by the number of tokens in that
song. This 200-dimensional embedding was used as a feature vector for both genre
prediction and emotion classification.
43
3.5.2 Genre Classification Results
We compared our genre classification systems to two baseline models. The first
baseline used a classifier that “chooses” the most common label set. This classifier
naïvely predicted that every song belonged to the pop genre only (38.7%). For
the second baseline model, we created a Bag of Words model (Salton et al., 1975),
treating chords and lyrics as one language. While this model uses both lyrical and
chordal modes, it does not leverage more complex musical information, such as
chord progressions, lyric sequences, or chordal and lyrical interaction.
Using the standard monolingual word2vec model, we learned monolingual
embeddings for chords and lyrics and used the sum of these embeddings as features
for two additional models. A RAkEL classifier, similar to (Sanden & Zhang, 2011),
was trained on these features, after reducing the dimensionality using principal
component analysis (Abdi & Williams, 2010). We empirically set the post-PCA
dimensionality to be three for all classifiers and used five-fold cross-validation in all
experiments.
Table 3.4 shows the results for the genre classification task. The Chords Only
model and Lyrics Only model refer to a RAkEL model that uses embeddings
learned only using chord progressions and lyric sequences, respectively. The Chords
& Lyrics model uses word embeddings learned jointly using lyrics word sequences
and chord progressions.
The Exact Matching Ratio (EMR) metric is the number of test examples that
have labelsets that exactly match the predicted labelsets, divided by the number of
test examples:
EMR =
1
N
N
X
i=1
1(Y
i
=Z
i
) (3.4)
44
Table 3.4: A list of models used for multi-label genre classification and their
performance on three metrics. The Chords & Lyrics model performs best by all
three metrics: the harsh exact match ratio (EMR), label accuracy, and label-based,
micro-averaged F1-score. Asterisks indicate significance at the .05 level, when using
a 2-sample t-test with the best performing baseline in that category.
Model EMR Label Accuracy Micro F1-score
Baselines
Most Common Set 38.7 % .465 .503
BoW 39.5 % .457 .511
Our models
Chords Only 37.5 % .455 .473
Lyrics Only 41.3 %* .484* .524*
Chords & Lyrics 42.6 %* .498* .528*
where 1(·) is the indicator function, N is the number of examples, and Y
i
and
Z
i
refer to the true labelset and the predicted labelset and sample i, respectively.
The Label Accuracy metric rewards correctly predicted labels and penalizes
incorrectly predicted labels. Concretely:
H =
1
N
N
X
i=1
|Y
i
∩Z
i
|
|Y
i
∪Z
i
|
(3.5)
Label-based, micro-averaged F1-score involves aggregating the contributions of
all classes to determine precision and recall measures and computing the F1-score
from these aggregated measures (Sorower).
In all three metrics chosen for multi-label classification (EMR, label accuracy,
and F1-score), the Lyrics Only model outperformed the baseline models in label
accuracy, and the Chords & Lyrics model outperformed embeddings models that
used only one modality. That the embeddings-based models performed better than
baseline models by all metrics suggests that chordal and lyrical context are useful
for musical analysis. The Chords & Lyrics model outperformed all others in all
45
three metrics, showing the utility of using a multimodal approach to predict musical
genre.
3.6 Emotion Recognition Task
Several approaches have been taken to predict emotion during music listening.
Researchers have used lyrics (Chen et al., 2006; Xia et al., 2008) and multimodal
approaches (Schuller et al., 2010; Hu et al., 2009) in music emotion recognition
(MER). Authors in Kim et al. (2010) suggest that the most accurate MER systems
apply large-scale machine learning algorithms to relatively short musical selections,
using vast feature sets that span multiple domains (as in Turnbull et al. (2008)
and Bischoff et al. (2009)). We create an MER system with these qualities in mind
and present our results on a music emotion recognition task, further demonstrating
the usefulness of joint embeddings in music analysis.
3.6.1 Collecting Annotations
In order to test our embeddings on a music emotion recognition task, we had
to aggregate emotion annotations on a dataset that has a chord- and lyric-aligned
corpus. To our knowledge, there is no such dataset available in the literature, so
we set up our own task and gathered human annotations.
500 musical clips were randomly chosen without replacement from a dataset of
159,427 clips, and the audio stimuli of these clips were presented to English-speaking
annotators on Amazon’s Mechanical Turk.
6
Every annotator had to complete a
questionnaire prior to completing a task, where they indicated their gender identity,
age, country of residence, and musical experience.
6
https://www.mturk.com/
46
Sixty-nine (47 male, 22 female) subjects participated in our study. The average
age of participants was 34.1 years (standard deviation = 10.2). Participants
represented eight countries, and no more than 38 participants hailed from the
same country (the United States was the best-represented.) Subjects were asked to
indicate their musical experience using a five-point Likert scale, and the average
response was 3.0 (standard deviation = 1.3). Annotators listened to 10-second
clips on YouTube which were centered around a musical passage. After listening,
participants used a five-point Likert scale to label their arousal and valence. Xiao
et al. have shown that ten seconds is sufficient for emotion to stabilize (Xiao et al.,
2008).
Triplet Embeddings
In order to draw meaningful, accurate conclusions using annotations from this
data, it is necessary to understand and represent the latent states of the subjects’
responses. To this end, we use triplet embeddings (Booth et al., 2018b) as a way to
generate group-level models of behavioral music experience, which aggregates labels
that are more robust to noise and artifacts than simple averaging of valence and
arousal measures. This has shown improvement on prediction in music emotion
recognition (Greer et al., 2020), and we apply the same technique here to the
arousal and valence responses to better quantify how music listening affects human
experience. In particular, we use the approach proposed by Booth et al. (2018a).
Thekeyassumptionhereisthatannotatorsarebetteratannotatingordinalrelations
in time than ratings, and therefore, they should be able to more easily answer the
question:
d(Y
i
,Y
j
)
?
7d(Y
i
,Y
k
), (3.6)
47
where i,j and k are time-frames of the annotations Y and d(·,·) is a distance.
Collecting a set of these comparisons (called triplets) from all annotators and
deciding based on majority vote if Y
i
is closer to Y
j
than Y
k
(or the opposite), it is
possible to find a 1-dimensional embedding that may be used as a fused annotation.
These arousal and valence triplet embeddings were compared to song-level
Spotify auditory features of “energy” and “valence,” respectively,
7
and the Pearson
correlation coefficient was found to be .66 and .37, suggesting that there is a
relationship between the arousal and valence reported in these 10-second segments
and Spotify’s correlate “perceived” auditory metrics.
3.6.2 Statistics on Behavioral Responses
Before predicting average valence and arousal across participants from around
the world, it bears importance to identify differences in responses to the musical
stimuli we presented in our music emotion annotation task.
UsingaMann-WhitneyUtest, wefoundthatparticipantslivingincountrieswith
traditionally Western cultures (United States, United Kingdom, France, and the
Netherlands) reported significantly higher (p< 10
−10
) average arousal than other
participants (who identified living in Brazil, Pakistan, India, and the Philippines).
This finding is consistent with prior studies that suggested that Western cultures
tend to value and promote high-arousal emotions over low-arousal emotions (Lim,
2016; Cowen et al., 2020).
Another Mann-Whitney U test between participants below the median age (33)
and participants above that median found that older-than-median participants
rated significantly higher arousal (p< 10
−6
) than participants who were younger
than the median age. Other studies have shown that ratings of valence and arousal
7
https://developer.spotify.com/documentation/web-api/
48
tend to become more extreme with age (Grühn & Scheibe, 2008) and that older
individuals prefer high-valence, low-arousal music (Cohrdes et al., 2017), which
suggests that our musical stimuli may have been rated as more extreme by older
individuals, particularly if these individuals do not normally expose themselves to
this kind of music.
8
Lastly, we looked at valence response ratings as a function of music experience
reported by the subjects. We split the participants into those who had higher-
than-average musical abilities (reporting a 4 or 5 on the Likert scale) and those
who were average or below average and ran a Mann-Whitney U test on these two
groups’ responses. We found that those with lower musical experience tended to
rate higher arousal to our musical stimuli than those with higher musical experience
(p< 10
−4
). Mikutta et al. (2014) showed that subjective arousal ratings of amateur
and professional musicians are not significantly different, while Kreutz et al. (2008)
showed that reported arousal ratings were in fact higher for individuals with high
musical expertise than for individuals with low musical expertise. These studies
were limited in scope, however; the former uses one musical stimuli, while the
latter uses participants from the same country. Our results may be explained in
part by familiarity; a seasoned musician has likely heard many of the songs in
our pop-leaning stimuli before, while non-musician participants may have not yet
“inured” themselves to the songs.
This dataset is released to encourage further study on cross-cultural music
perception, particularly of songs with lyrics.
8
We did not find a significant difference in valence ratings between the groups.
49
3.6.3 Emotion Regression Results
In addition to the baselines models used for the genre classification task, we
compared our models to two other baselines. The first new baseline is a model
based on BERT embeddings (Devlin et al., 2018b), which are known to have good
performance on tasks in NLP. The purpose of this is to determine if our “lighter-
weight” embeddings perform on par with embeddings that generally require more
data for good performance (Ulčar & Robnik-Šikonja, 2020). For another baseline,
we scraped Spotify’s API for song-level audio features, such as “danceability,”
“energy,” and “valence.” We wanted to determine if the chords and lyrics of a
song could indicate human-reported arousal and valence better than using audio
features computed from Spotify. Lastly, we fine-tuned a BERT model for the
specific regression task (valence or arousal), and then used the generated sentence
embeddings as input features to a downstream model.
Table 3.5 shows the results for the regression task. We report the mean and
standard deviation of the given metrics after 5-fold cross validation.
3.7 Discussion
We begin the discussion with some of the limitations of the current work that
set the stage for future work. We first note that UkuTabs’ data has a bias towards
music that is playable by ukulele musicians, but our method for creating shared
representations can be learned on any dataset that contains accurate lyrics and
chords in parallel. We also note that there are other websites with chords and
lyrics that are aligned, which could strengthen the embeddings. For example,
chords.cloud offers chord tabs on many popular songs from the 21st century (many
50
Table 3.5: Our model outperforms language models of chords or lyrics only, and the
models that use embeddings outperform n-gram models. These BERT embeddings
were tuned on the final dataset. Asterisks indicate significantly better perfor-
mance at the .05 level when compared to the Average Baseline; Daggers indicate
significantly better performance at the .05 level when compared to the best baseline.
Model Valence MSE Arousal MSE
Baselines
Average .255±.026 .475±.030
TF-IDF (Chords Only) .254±.015 .485±.024
TF-IDF (Chords Only, Song-Level) .251±.014 .433±.033
TF-IDF (Lyrics Only) .250±.018 .472±.022
TF-IDF (Lyrics Only, Song-Level) .225±.017
∗
.436±.033
∗
BERT (Lyrics Only) .251±.018 .472±.022
BERT (Lyrics Only, Song-Level) .240±.019 .457±.024
Spotify Audio Features .209±.038
∗
.247±.033
∗
Our models
Chords Only, 3 lines .253±.022 .480±.023
Lyrics Only, 3 lines .248±.026 .469±.032
Chords and Lyrics, 3 lines .249±.027 .470±.037
Chords Only .242±.022 .458±.032
∗
Lyrics Only .226±.026
∗
.413±.026
∗
Chords & Lyrics .236±.011
∗
.419±.020
∗
Chords & Lyrics w. Spotify Features .198±.040
∗
.219±.030
†
of which overlap with Ukutabs), although there is no mention of moderation for
accuracy.
As evidenced by Table 3.5, using audio information in addition to symbolic
representations of songs opens opportunities to better classify various aspects of
music such as affect conveyed, providing a deeper understanding of how we perceive
and process music. While our representations are shown to be useful for genre and
emotion classification, models that use these symbolic representations of music in
tandem with certain auditory features would likely result in even better performance,
a topic for future inquiry.
51
The chord tablatures submitted to UkuTabs, although moderated and checked
for accuracy, may contain errors. Furthermore, the chord casting method we devel-
oped may have been inaccurate for up to 4.1% of the chords in the UkuTabs corpus,
which might have led to inaccuracies in the chords of our dataset. Out of 50 songs
that were tested in our key estimation task, our system automatically estimated
49 correctly. If this method generalizes, it may be a valuable, computationally
inexpensive way to estimate musical key without using audio. In the case that
the audio for a song is available, chord detection algorithms (like those mentioned
by Pauwels & Peeters (2013)) and state-of-the-art speech-to-text algorithms can be
employed to train joint embeddings.
Discarding songs that contained less than 50% English lyrics did not have a
great effect on the performance of the classification models. In fact, only two
songs listed in the Billboard charts and the UkuTabs dataset were removed by this
measure (“Feliz Navidad” by José Feliciano and “Gangnam Style” by Psy).
In the genre classification task, we notice that the Chords Only model has the
lowest performance by all metrics. However, when this information enriches lyrical
information, the resultant model has the highest performance. This finding suggests
that there may be some emergent qualities that chordal and lyrical information
possess.
For the purpose of using chords and lyrics in tandem to predict music emotion,
we set up a task in which 15 participants rated 500 musical segments lasting 10
seconds long. The responses seem to support existing findings in the literature
about differences in arousal ratings among groups. These differences contribute
to much of the variability in average valence and arousal ratings, which were used
for our MER task. Even after using triplet embeddings to better calculate group
averages of valence and arousal, we found that our embeddings did not perform
52
as well as we had hoped on our MER task. Further study will center around
determining if certain lyrical or chordal features elicit demographic differences in
responses.
3.8 Conclusion
We curated a dataset that contains 232,206 musical segments from 6,809 pop
songs, with lyrics and corresponding chords aligned. Using this data, we developed
a shared vector representation of the lyrics and chords together. We tested our
representation on a genre classification task by using a RAkEL model on the
average of the embeddings to predict genre labels given by the Billboard charts. We
developed three models to predict musical genre and music emotion: a model using
onlychordembeddings, amodelusingonlylyricembeddings, andamodelusingjoint
chord-and-lyric embeddings. The model that uses joint embeddings outperformed
the baseline models and monolingual embedding models in several metrics in a genre
classification task, demonstrating the utility of taking a multimodal approach to
understandingandmodelinghumanmusicperception. Ourembeddingsalsoperform
at par with state-of-the-art NLP embeddings in a music emotion recognition task
that we created, containing valence and arousal annotations for 500 10-second-long
song excerpts. When combined with auditory features, our text-based embeddings
show even more improvement over baselines. This work applies to many areas,
including multimodal human perception, automatic genre classification, and music
information retrieval.
53
Chapter 4
Studying Music from Multiple
Views in a Context-Aware
Manner
4.1 Introduction
Music is universally enjoyed perhaps because of the power it has to move us.
Listening to music can boost our mood, give us chills, and even make us cry.
This paper focuses on three aspects of the complex human experience of music
listening: neural (how the brain responds to music), physiological (how the body
responds to music), and emotional (how people report happiness or sadness during
music listening). By computationally analyzing how music influences these three
modes, we present a more complete picture of music’s role in human experience.
We apply a set of multivariate time series (MTS) prediction models to neural,
physiological, and subjective responses, using auditory features as predictors. We
compare these models and comment on what auditory features are important
for these predictions. We hypothesized that attention models, which nonlinearly
synthesize aural information from previous points in a song, would be best at
predicting these responses.
There are many open research questions in music perception. For example,
how can we model phase synchronizations in the brain during instrumental music
54
listening? Is it possible to correlate other involuntary processes, like galvanic skin
response and heart rate, with musical features? Can human responses be better
modeled using autoregressive models? Can we predict affective response at a highly
granular level in order to pinpoint specific times in a song that elicit an emotional
response?
We use MTS prediction models to answer some of these open questions. We
measure many subjects to get a better understanding of how humans react to
music at various bio-behavioral levels. Finally, we use machine learning models
with regularization and attention to identify particular features in music that are
relevant to subjective human-reported emotion responses.
This work is helpful in identifying what elements of music affect us most
profoundly. It also offers more insight into how we feel emotion while listening
to music. This work can inform myriad applications, including music emotion
recognition and music information retrieval, as well as basic research in neuroscience
and psychology.
4.2 Related Work
4.2.1 Neural Response
Many studies have investigated music’s effect on the human brain. Toiviainen
et al. suggest that the auditory cortex is involved in the processing of musical
features during continuous listening to music (Toiviainen et al., 2014). Others found
that musical features related to timbre and rhythm are processed in the superior
temporal gyrus (STG) and Heschl’s gyrus (Singer et al., 2016; Samson et al., 2011).
It has also been shown that consonant music engages different brain structures from
55
dissonant music (Koelsch, 2005). Some have looked at the unfolding of musical
emotions and their temporal attributes (Singer et al., 2016).
Other papers have looked at brain connectivity for identifying the structures
that correlate with music listening (Menon & Levitin, 2005). Studies have explored
emotion recognition from brain signals using higher order crossings (Petrantonakis
& Hadjileontiadis, 2009, 2010). Intersubject correlations of listeners have been
useful for brain mapping during listening (Wilson et al., 2007).
This study looks at phase synchronizations in bilateral Heschl’s gyri and superior
temporal gyri (STG) to identify how we process basic auditory information, as
inGlereanetal.(2012). Inthepast, brainactivityhasbeendeterminedusinglogistic
regression models (Ryali et al., 2010). In this study, we use autoregressive models
to predict intersubject phase synchronizations. These models integrate auditory
information from already-perceived stimuli to make predictions. We hypothesize
that models like these will be effective in predicting fMRI phase synchronizations,
as studies show that hemodynamic lag in fMRI studies is about five seconds (Stocco,
2014). Modeling involuntary responses in this way might tell us more about how
humans respond to music listening, and what features of music are responsible for
specific neural patterns.
4.2.2 Physiological Response
Studies in the past have also investigated physiological responses in subjects
as they listen to music. Some researchers have predicted heart rate (Riganello
et al., 2010, 2008; Ellis & Brighouse, 1952) and some have predicted skin con-
ductance response (SCR) using musical or audio features (Khalfa et al., 2002;
Dillman Carpentier & Potter, 2007).
56
These studies variedly have predicted physiological response minute by minute,
song by song, or session by session. In this study, we use MTS models to determine
how our bodies react to music at a finer temporal resolution. We are interested in
attention models for this task because they can model nonlinearities and integrate
prior information in our time series data (Shih et al., 2018). Music’s effect on
psychophysiological response may not be linear, so using attention models could be
useful for prediction. We report if it is feasible to model these involuntary human
responses at a fine temporal level and, if so, which auditory features are most
responsible for causing these physiological reactions.
4.2.3 Affective Response
Several approaches have been taken to predict emotions encoded in and conveyed
by music (Huron, 2006). Researchers have used chords (Bakker & Martin, 2015),
auditory features (Trohidis et al., 2008b; Siedenburg et al., 2019), and multimodal
approaches (Schuller et al., 2010; Hu et al., 2009; Greer et al., 2019b) in music
emotion recognition (MER). Many state-of-the-art MER systems, such as Turnbull
et al. (2008) and Bischoff et al. (2009), have applied machine learning algorithms to
relatively short musical selections, using feature sets that span multiple domains.
A number of emotion classification tasks for music listening exist in the litera-
ture (Fan et al., 2017; Trohidis et al., 2008a; Han et al., 2010; Lu et al., 2006). The
present study uniquely considers emotion expressed in music at a highly granular
level (40 Hz), and uses information from prior timesteps to predict how emotion
during music listening changes over time. In this chapter, we use MTS models to
correlate continuously rated descriptions of emotion with musical features.
57
4.2.4 Multimodal Time Series Modeling
In order to analyze how we experience music at a fine temporal resolution, it
is necessary to model the multimodal measurements as multivariate time series.
Vector autoregression (VAR), a generalization of autoregressive (AR) models, is
a well-known model in MTS forecasting (Hamilton, 1995). While effective for
some tasks, neither AR-based nor VAR-based models can capture nonlinearity in
time series. For this reason, nonlinear models for time-series forecasting based
on kernel methods (Chen et al., 2008), ensembles (Bouchachia & Bouchachia,
2008), or Gaussian processes (Frigola et al., 2014) have been introduced. One
drawback to these approaches is that they apply predetermined nonlinearities and
may fail to recognize different forms of nonlinearity for different MTS. Recently,
Long Short-Term Memory systems (LSTMs) (Hochreiter & Schmidhuber, 1997),
variants of recurrent neural networks (RNNs), have also been employed for MTS
forecasting. The long- and short-term time-series network (LSTNet) (Lai et al.,
2018) was designed specifically for MTS forecasting, with the capability of jointly
modelinghundredsoftimeseries. LSTNetusesCNNstocaptureshort-termpatterns
and LSTM or GRU for memorizing relatively long-term patterns and traditional
autoregression to help mitigate the scale insensitivity of neural networks. Shih et
al. proposed an attention mechanism which does not need parameter tuning and is
adaptable to nonperiodic and nonlinear datasets (Shih et al., 2018). While these
models have been used in other MTS tasks, they have not been applied to studies
on music’s effect on human response. We use MTS models with regularization to
determine which auditory features are correlated with music listening experiences
at various levels.
58
4.3 Data Collection
In order to identify musical stimuli with high affective content, we explored
online music streaming sites, such as Spotify and Last.fm, as well as social media
sites such as Reddit and Twitter, for songs with social tags with the words “happy”
or“sad”andtheirsynonyms. Socialtagsprovidedbyusersofonlinemusicstreaming
sites have been shown to be an effective method for classifying music based on their
emotional content and correlation with acoustic features known to be associated
with a particular emotion (Song et al., 2012). To minimize any influence of prior
exposure to the songs, we selected songs with “happy” or “sad” tags from the pieces
with lowest play counts. This resulted in a list of 120 pieces: 60 sad pieces and
60 happy pieces, some with lyrics and some without lyrics. Eight human coders
listened to 30-second clips from these pieces and rated whether they conveyed
either happiness or sadness. All pieces in which at least 75% of coders agreed on
the intended emotion were then included in an online survey that was completed
by 82 adult participants via Amazon’s Mechanical Turk. The survey included
60-second clips from 27 pieces of music and asked participants to rate how much
they enjoyed the piece, what emotion they felt in response to the piece (sadness,
happiness, calmness, anxiousness, boredom), and how familiar they were with the
piece using a 5-point Likert scale. Each participant was presented with only 12
clips of music selected at random to ensure that the Mechanical Turk workers were
not overloaded.
Due to the potential confounds associated with the semantic information con-
veyed through the lyrics of a song (Brattico et al., 2011), we only selected pieces
that did not contain lyrics. We additionally excluded pieces that were rated as
highly familiar to prevent bias. Based on these criteria from the survey, we selected
three pieces of music to be used for this study: (1) a shorter piece that reliably
59
induces sadness (Ólafur Arnalds’s “Fyrsta,” the “sad short song” with duration
256 seconds); (2) a longer piece that reliably induces sadness (Michael Kamen’s
“Discovery of the Camps,” the “sad long song” with duration 515 seconds); and (3)
a piece that reliably induces happiness (Lullatone’s “Race Against The Sunset,”
the “happy song” with duration 168 seconds). The wav files of these songs were
used for playing to participants and extracting auditory features.
4.3.1 Neural Data
A group of 40 healthy, right-handed, adult participants (19 female, average age
24.3, standard deviation 6.4) was recruited from the greater Los Angeles community
based on responses to an online survey in which they listened to a 60-second clip of
the final three pieces. Only participants who were not familiar with the pieces of
music and reported feeling happiness during the happy song’s clip or sadness during
the sad song’s clip were asked to participate in the scanning portion of the study.
All participants in this study had normal hearing, normal or corrected-to-normal
vision, and no history of neurological or psychiatric disorders. All participants
were instructed to listen attentively to the three songs with their eyes open. The
auditory stimuli were presented through MR-compatible OptoACTIVE headphones
with noise cancellation and the order of the pieces was counterbalanced across
participants.
MRI Data Acquisition
Imaging was conducted using a 3-T Siemens MAGNETOM Trio System with a
32-channel matrix head coil. Functional images were acquired using multiband and
a gradient echo, echo-planar, T2*-weighted pulse sequence with repetition time
(TR) = 1000 ms, echo time (TE) = 25 ms, flip angle = 90°, and 64× 64 matrix.
60
Forty slices covering the entire brain were acquired with a voxel resolution of 3.0×
3.0× 3.0 mm with no interslice gap. A T1-weighted high-resolution (1× 1× 1
mm) image was also acquired using a three-dimensional magnetization-prepared
rapid acquisition gradient (MPRAGE) sequence (TR = 2530 ms, TE = 3.09 ms,
flip angle = 10°, 256× 256 matrix). Two hundred and eight coronal slices covering
the entire brain were acquired with a voxel resolution of 1× 1× 1 mm.
4.3.2 Psychophysiological Recordings
Sixty different healthy adult participants (36 female, average age 19.5, standard
deviation 2.9) were recruited from the greater Los Angeles community based on
responses to the same online survey. Only participants who were not familiar with
the pieces of music and reported feeling happiness during the happy song’s clip
or sadness during the sad song’s clip were asked to participate. Participants had
normal hearing and no history of neurological or psychiatric disorders. None of the
participants in the fMRI study were recruited for this study.
1
The auditory stimuli were presented through Sennheiser HD 280 PRO head-
phones. Heart activity and skin conductance of the participants were collected
during music listening using a BIOPAC MP150 system. Each subject listened to
each song twice, once rating for emotion and once rating for enjoyment.
2
The order
of annotation was counterbalanced. The BIOPAC’s software processed the galvanic
skin response (GSR) as the data was collected, subtracting out the Tonic Skin
Conductance Level (SCL).
1
Originally, we had intended to use only the fMRI participants for the entire study. However,
after measurement of psychophysiological response, we noticed that artifacts created by the
scanner rendered these recordings unusable.
2
Results relating to enjoyment ratings are reported by Ma et al. (2019b).
61
4.3.3 Emotion Ratings
All sixty participants (from the physiological study) were instructed to lis-
ten attentively to the music and simultaneously report changes in their affective
experience using a fader with a sliding scale. Participants continuously reported
the intensity of felt emotion, from 0 to 10, depending on which piece was being
presented.
3
The order of the pieces was counterbalanced.
4.3.4 Auditory Features
Past research suggests that auditory features related to dynamics, timbre,
harmony, rhythm, and register have been correlated with emotion (Kim et al.,
2010). Dynamics refer to “loudness” and change in “loudness” of music, timbre
refers to tone quality of music, harmony refers to musical pitches, rhythm refers to
properties of the musical beat, and register refers to the concentration of frequencies
in music.
Seventy-four features that capture dynamics, timbre, harmony, rhythm, and
register were extracted using the MIRtoolbox in Matlab (Lartillot et al., 2008) (see
Table 4.1). These features were extracted using a sliding window with a duration
of 50 ms and a step size of 25 ms, similar to (Lu et al., 2006).
Mel frequency cepstral coefficients (MFCCs) were calculated with Matlab’s mfcc
function (MATLAB Audio Toolbox, 2019) using a Hamming window, pre-emphasis
coefficient of .97, frequency range of 100-6400 Hz, 20 filterbank channels, and 22
liftering parameters, similar to (de Leon & Martinez, 2014). Compressibility was
calculated by computing the ratio between the file size of each window’s wav format
and that same window’s mp3 format, after being converted with ffmpeg (Developers,
3
In the sad short song and sad long song, 10 indicated “extremely sad” and 1 indicated “not at
all sad.” In the happy song, 10 indicated “extremely happy” and 1 indicated “not at all happy.”
62
Table 4.1: Auditory Features Used
Feature Name Type Notes
MFCCs Timbre 13 Features
ΔMFCCs Timbre 13 Features
ΔΔMFCCs Timbre 13 Features
HCDF Timbre
Spectral Flux Timbre
Skewness Timbre
Kurtosis Timbre
LPCs Timbre 13 Features
Chroma Harmony 10 Features
Key Strength Harmony
Spread Harmony
Key Mode Harmony Major or Minor
Centroid Register
Brightness Register
Compression Ratio Dynamics Used ffmpeg
RMS Dynamics
Pulse Clarity (Lartillot et al., 2008) Rhythm
2019). Key strength is taken as the maximum value of the 24-dimensional output
vector from MIR Toolbox’s key_strength function. This function outputs a vector
containing the probability that a musical segment is in each major or minor key.
All other features were extracted using MIR Toolbox’s eponymous functions with
default parameters.
4.4 Methods
4.4.1 Neural data
Phase synchronizations were conducted to assess the temporal dynamics of the
Heschl’s gyrus and STG in response to the musical stimuli. Standard pre-processing
63
steps were conducted on all data across the entire brain before extracting the time
series from the regions of interest.
Phase Synchronizations
To capture the time-varying patterns of these stimulus-driven responses, we
calculated a dynamic measure of neural synchronization. To avoid issues with
selecting an arbitrary window size and to increase the temporal resolution, we
used an approach that evaluates blood oxygenation level-dependent (BOLD)-signal
similarity across participants by calculating differences in phasic components of the
signal at each moment in time. The fMRI Phase Synchronization Toolbox was used
to calculate dynamic intersubject phase synchronization (ISPS) (Glerean et al.,
2012). The filtered, preprocessed data was first band-pass filtered through 0.025 Hz
(33 s) to 0.09 Hz (~11 s) because the concept of phase synchronization is meaningful
only for narrow-band signals. Using Hilbert transforms, the instantaneous phase
information of the signal was determined and an intersubject phase coefficient was
calculated for each voxel and echo planar image in the time series by evaluating the
phasic difference in the signal for every pair of participants and averaging. This
results in a value between 0 to 1 at each repetition time that represents the degree
of phase synchronization across all participants at that particular image.
4.4.2 Physiology Data
Galvanic Skin Response
Each subject from the group of 60 individuals listened to each song twice and
their galvanic skin response was measured twice. We used an SCR tagger (Chaspari
et al., 2015) to identify each individual’s SCR arrivals for both listens. This
software identifies potential artifacts and plots the SCR spikes, which were manually
64
Figure 4.1: SCR Plot of one listen of sad short song. SCR spikes seem to occur
after the entrance of new instruments or the start of a musical crescendo.
inspected for correctness. SCR arrivals were culled into two separate, collective
SCR arrival plots, one for each time each listen (see Figure 4.1). We then used
a kernel density estimation (KDE) function (Scott, 2015) with several feasible
bandwidths to model both collective arrival plots and chose the bandwidth that
had the lowest Kullback-Leibler divergence (Cover & Thomas, 2006) between the
two KDEs. The best bandwidth was found to be 8 seconds (See Figure 4.2). We
averaged the two KDEs and resampled the resultant KDE at 40 Hz to line up the
function with auditory features. We removed the first 30 seconds of data (<18% of
each song) so that SCR would stabilize.
Heart Response
Each subject was also measured for heart activity as they listened to each song
twice. Similar to the GSR analysis above, we culled the heart beat arrivals of all
participants for each song and created a KDE with a bandwidth that resulted in the
lowest Kullback-Leibler divergence between the two distributions. The bandwidth
was found to be 0.5 seconds. We then averaged the two KDEs and resampled the
65
Figure4.2: SCRPlotoftwolistensofthehappysong. WhentheKDEisconstructed
with a bandwidth of 8 seconds, the plots are similar.
resultant KDE at 40 Hz to line up the function with auditory features. We removed
the first 30 seconds of data to ensure that the response had stabilized.
We averaged heart rate variability (HRV) to use as another prediction for
musical features. We first collected heartbeat arrivals of all participants for both
listens of each song. Then, HRV was calculated by taking the standard deviation
of the interbeat interval of this collection of heart beat arrivals and averaging this
value. Consistent with (Schaaff & Adam, 2013), we used a window of 40 samples
to create an ultra-short-term HRV measure (Shaffer & Ginsberg, 2017). We then
resampled this at 40 Hz to line up with auditory features. We removed the first 30
seconds of data so that HRV could stabilize.
4.4.3 Emotion Ratings
For each song, we averaged the emotion annotation at each time step across
participants. We removed the first 30 seconds of annotations so that annotations
could stabilize, as the fader started on the lowest annotation at the start of each
66
song. We resampled this signal to 40 Hz to match the auditory features’ sampling
frequency.
4.5 Results
Wetrainedseveralbaselineandstate-of-the-artmodelsforMTSprediction: AVG
(a model which naïvely predicts the average of the training data), LASSO-T (least
squares regression at each timestep with `
1
-regularization), Ridge-T (least squares
regression at each timestep with `
2
-regularization), LASSO-VAR (a distributed
lag model with `
1
-regularization), Ridge-VAR (a distributed lag model with `
2
-
regularization), and Temporal Pattern Attention (TPA) (Shih et al., 2018). The
least squares models use only auditory information at the timestep of prediction
and the VAR models use information before the timestep of prediction. TPA
is an attention model that nonlinearly synthesizes data from previous timesteps.
Unlike the other models, TPA uses both auditory features and response values from
previous timesteps for prediction.
For the LASSO and Ridge models, the regularization parameters tried were 10
i
and 5× 10
i
for i∈ [−7,−6,..., 6, 7]. The parameter with lowest test root mean
squared error (RMSE) was chosen for each experiment. Response variables were
scaled by their maximum value and features were `
2
normalized. Standard loss
functions for LASSO and ridge regression were used.
For autoregressive models, two parameters were set: the time horizon and the
autoregression length. A model with time horizon h and autoregression length
a predicts the response ˆ y
t
at timestep t according to ˆ y
t
= α
0
x
t−h
+α
1
x
t−h−1
+
···α
a
x
t−h−a
where α is a coefficient vector for a particular timestep and x
k
is a
vector of auditory features at timestep k.
67
For brain data, h = 3 (three seconds) and a = 6 (six seconds). This was
constructed to capture hemodynamic lag, which is around five seconds (Stocco,
2014). For SCRs, h = 20 (.5 seconds), a = 160 (4 seconds), consistent with (Khalfa
et al., 2002), which says that SCR lags about four seconds behind music. For HRV
and heartbeat arrival KDEs, h = 10, a = 80, to capture quick changes in heart
rate and HRV (.25-4.25 seconds). Xiao et al. (2008) and Fan et al. (2016) note that
mood stabilizes in about six seconds in music emotion recognition. We assume that
emotion stabilization in this task will be shorter than six seconds, as the songs
we selected are repetitive and contain more context than the clips used in MER
studies. We use h = 80 (two seconds), a = 80 (two seconds), to capture latency in
feeling and subsequently reporting felt emotion.
We conducted a grid search over tunable parameters for TPA. Due to the time
required to train this model with large attention lengths, we used an attention
length equal to half of the autoregression length described above. We tried a range
of values for the number of hidden units: 4, 16, 32, 64, and 256. We found that
using 32 hidden units resulted in good performance and relatively fast training
times when predicting rated emotion in the happy song. We tried learning rates
1e-4, 3e-4, 1e-3, 3e-3, and 1e-2 and found that 1e-3 was the best learning rate when
predicting rated emotion in the happy song. These values were used for every
experiment. We normalized each time series by the maximum value in the series.
Lastly, we used the absolute loss function and Adam with a 10
−3
learning rate. We
used 3 layers for all RNNs, as done by Chuan & Herremans (2018), and did not fix
the trainable parameters to any specific number of units. TPA was not used for
neural responses, as there is not enough data for the model to perform well.
For every experiment, the first 30 seconds of each song were removed, ensuring
that responses would be stable. Then, the last 20% and the first 20% of these
68
Table 4.2: Feature Index
Number Feature Name Description
1-13 1st-13th MFCC Timbre
14-26 1st-13th ΔMFCC Change in timbre
27-39 1st-13th ΔΔMFCC Change in timbre
40 Pulse Clarity Strength of beats
41 Brightness % of high-end frequencies
42 Key Strength How likely a key is
43 RMS Loudness
47 Kurtosis Change in spectrum
48-59 1st-12th Chroma Strength of C, C#, D..., B
61 Compressibility Complexity
63 Spectral Flux Harmonic change
clipped songs were used as test sets and the other 80% of each song was used
as training sets, resulting in two “folds.” Features with statistically significant
coefficients (p < .01) in both folds of the best performing model are listed as
important in predicting responses. In the event that more than three features are
listed, we only report the three most significant features.
4.5.1 Phase Synchronizations
Predictions
We found that LASSO-VAR was generally the best model for predicting phase
synchronizations using auditory features. See Tables 4.3 and 4.4.
Relevant Auditory Features
Pulse clarity, brightness, and RMS are particularly relevant features for pre-
dicting ISPS in the Heschl’s gyrus for the sad short song and sad long song, while
MFCCs contributed to the LASSO-VAR model in the happy song. A list of feature
numbers with descriptions are provided in Table 4.2.
69
Table 4.3: STG - Test RMSE
Left Right
SS SL H SS SL H
AVG .311 .296 .383 .339 .311 .416
LASSO-T .310 .288 .383 .324 .282 .408
Ridge-T .310 .287 .382 .327 .324 .416
LASSO-VAR .311 .268* .359* .291* .273* .379*
Ridge-VAR .364 .270 .364 .380 .279 .379*
Features N/A 40,43 4 43,51 40,43 4
Table 4.4: Heschl’s Gyrus - Test RMSE
Left Right
SS SL H SS SL H
AVG .335 .253 .277 .269 .253 .199
LASSO-T .307 .230* .262 .269 .248 .194
Ridge-T .330 .230* .271 .270 .250 .196
LASSO-VAR .289 .258 .223* .287 .242* .173*
Ridge-VAR .281* .306 .307 .286 .243 .176
Features 40,41 40,43 4 N/A 40,43 37,40
The left STG ISPS was very difficult to model in the sad short song. However,
the 4th MFCC contributed to VAR models predicting ISPS in both hemispheres in
the happy song, while pulse clarity and RMS contributed to modeling the ISPS in
the STG during the sad long song. Chroma features were useful for modeling the
right STG in the sad short song.
4.5.2 Physiology Data
Predictions
We found that the VAR models were the best models for predicting KDEs of
SCR and heart beat arrivals using auditory features. Ridge-LS shared the lowest
70
Table 4.5: SCR - Test RMSE
Sad Short Sad Long Happy
AVG 0.164 0.159 0.108
LASSO-T 0.153 0.150 0.107
Ridge-T 0.142 0.137 0.105
LASSO-VAR 0.126* 0.138 0.098*
Ridge-VAR 0.128 0.102* 0.136
TPA 1733 1440 1197
Features 43,63,49 40 61
Table 4.6: Heart Activity - Test RMSE (Entries× 10
−1
)
KDE HRV
SS SL H SS SL H
AVG .253 .186 .541 .924 .888 .918
LASSO-T .230 .175 .536 .925 .888 .919
Ridge-T .240 .175 .545 .923 .888 .914
LASSO-VAR .240 .186 .510* .927 .890 .915
Ridge-VAR .218* .173* .531 .928 .891 .914
TPA 20.8 20.7 19.6 42.6 55.8 12.9
Features 41,42 54,61 4 N/A N/A N/A
root mean squared error (RMSE) values, but did not significantly outperform the
baseline AVG model in HRV prediction. See Tables 4.5 and 4.6.
Relevant Features
Brightness, key strength, compressibility, and pulse clarity are particularly
relevant for predicting the heartbeat arrival KDE.
Heart rate variability was difficult to predict from auditory features: none of
the models performed significantly better than the baseline model, and we do not
report any variables which contribute to HRV prediction.
71
Table 4.7: Reported Emotion - Test RMSE
Sad Short Sad Long Happy
AVG 0.213 0.117 0.0927
LASSO-T 0.215 0.104 0.0920
Ridge-T 0.215 0.105 0.0920
LASSO-VAR 0.214 0.095 0.0907
Ridge-VAR 0.225 0.107 0.0948
TPA 0.035* 0.031* 0.0342*
Features 17,35,47 21,37,55 6,21,39
Chroma, RMS and spectral flux were useful for predicting the SCR KDE for
the sad short song, while pulse clarity and compressibility were useful in the sad
long song and happy song, respectively.
4.5.3 Emotion Ratings
Predictions
TPA was the best model for predicting emotion ratings using auditory features.
See Table 4.7.
Relevant Features
We found that MFCCs, ΔMFCCs, and ΔΔMFCCs were used for predicting
emotion in all songs. Additionally, kurtosis was a relevant feature for modeling
emotion ratings in the sad short song, and the 8th chroma was important for
modeling emotion ratings in the sad long song.
72
4.6 Discussion
4.6.1 Neural
We found that LASSO-VAR was generally best at predicting phase synchro-
nizations in the STG and Heschl’s Gyrus. This was expected: we know from
the literature that there is a delay between neural activity and the hemodynamic
response that is collected in fMRI, so a model that accounts for this lag was likely
to be best for prediction.
The features that contributed most to the best performing models for the right
STG in the sad short song were the 4th chroma (the note Eb) and RMS. ISPS
in the STG in both hemispheres were best modeled by pulse clarity and RMS
in the sad long song, and the 4th MFCC contributed to predicting ISPS in the
STG in both hemispheres in the happy song. This finding is supported in the
literature (Toiviainen et al., 2014; Alluri et al., 2012), dynamic features and timbral
features were also found to be correlated with brain activity. It is interesting to
note that ISPS in the left STG was difficult to model for the sad short song, maybe
because the right STG has been shown to be particularly correlated with processing
music (Koelsch et al., 2005).
Figure 4.3 shows LASSO-VAR’s predictions for the right STG in the sad long
song. LASSO-VAR’s predictions mostly overestimate ISPS, but the predictions
seem to follow the contour of the actual values in the test set. Using pulse clarity
and RMS from previous timesteps as features, LASSO-VAR outperforms other
models.
In the Heschl’s gyrus, pulse clarity, RMS, and the 4th MFCC again contributed
the most to ISPS. It stands to reason that RMS would be an important feature for
prediction, as it indicates how “loud” a song sounds at a particular time. The 4th
73
Figure 4.3: Prediction vs. actual value of the ISPS in the right STG in the sad
long song. The predictions were made on the first 20% of the song.
MFCC contributed to modeling the ISPS during the listening to the happy song,
indicating that subjects were responding to timbre features in the lower frequencies.
4.6.2 Physiological
VAR models were found to be best at predicting SCR. Pulse clarity, spectral
flux, and the second chroma were strong predictors for SCR spike frequency in the
sad short song. The second chroma refers to the presence of the Db note, which is a
dissonant note in the key of the song. Pulse clarity was found to be a contributing
feature for modeling SCR in the sad long song. Compressibility was found to be
important in predicting SCR spikes in the happy song. Harmony, dynamics, timbre,
and rhythm all contribute to physiological response.
VARs were also best at predicting the heartbeat arrival KDE. Key strength and
brightness (in the sad short song), the 7th Chroma and compressibility (in the sad
long song), and the 4th MFCC (in the happy song) were predictive of heartbeat
arrival. The 7th chroma (the note F#) is dissonant in the key of the sad long song.
74
Excitement and stress in a song, which is linked to increased heart rate (Cannon,
1916), can be generated by loudness and rhythmic and tonal ambiguity.
There were no models that significantly outperformed the baseline model for
HRV prediction. This may be because many of our measurements were taken after
half of the participants had already heard the song before, which may diminish
the effect of new or unexpected musical events on participants’ reactions and
responses. Using an ultra-short-term measure of HRV may have also affected these
results. More subjects may be required for making improved predictions about
when subjects collectively experience a physiological response.
The TPA model performs very poorly on physiological data. This could be
due to several reasons: 1) the parameters may have been set incorrectly, as TPA
parameters were set using emotion ratings for the happy song 2) the model could
be overfitting, relying on only previous response values as features 3) the attention
length was too short (it is one-half of the autoregression length in the other models),
so TPA did not have as much data to use as the other models.
4.6.3 Emotional
TPA was remarkably good at modeling emotional response, perhaps because it
is not a stationary signal (see Figure 4.4).
The TPA models for each song leveraged previous emotion labels to make
accurate predictions, but several other features contributed to the model. Most
of the variables were timbral features, like MFCCs, ΔMFCCs, and ΔΔMFCCs.
Additionally, in the sad short song, kurtosis was a relevant feature and in the sad
long song, the 8th chroma (the note G) was important for modeling emotion ratings.
The sad long song is in G minor, so the presence of the tonic note (G) is positively
correlated with higher sadness ratings.
75
Figure 4.4: Emotion ratings for the sad short song. The signal is not stationary, so
models that use information of previous labels, like TPA, will likely perform best.
It is interesting to note that TPA and VAR models use different features for
prediction. We found that LASSO-VAR was the second-best at predicting reported
emotions. In the sad long song, compressibility and brightness contributed to
the best LASSO-VAR model. In the happy song, compressibility was helpful in
modeling emotion. This makes sense: stronger emotion is evoked by complex
interplay of various musical components, and therefore, the complexity of a music
signal may be correlated with affect (Kumar et al., 2014).
4.7 Conclusion
Music affects our bodies and felt emotions. In this study, we investigated how
human brain, body, and emotions are affected by music listening. By looking at
eight human reaction modes within three types of human response, we presented a
more complete picture of how music affects human experience. We applied a set of
multivariate time series prediction models to our response modes using auditory
features in sad and happy musical pieces as predictors. We presented the utility of
76
using distributed lag models in predicting these responses, compared performance
of various prediction models, and commented on what auditory features are most
relevant to predicting these responses. We hypothesized that an attention model
would perform best on these tasks, as they can nonlinearly synthesize data from
previous musical moments to predict responses. However, attention models only
performed best at modeling emotion ratings; distributed lag models worked best on
involuntary human responses. This work offers a more holistic picture of how we
respond to emotional music and can be used in music emotion recognition, music
information retrieval, neuroscience, and psychophysiology.
77
Chapter 5
Embedding and Quantifying
Responses to Music
5.1 Introduction
Music emotion recognition tasks are of great interest to academic and industrial
researchers, due to applications in fields as disparate as music therapy (Sourina et al.,
2012) and music recommender systems (Ferwerda & Schedl, 2014). The relationship
between music and measures of experience, however, has several components (both
objective and subjective), which are related to the theory of music, physiological
responses of listeners, and the accuracy with which listeners are able to express
their experiences.
In the previous chapter we computationally analyzed the link between how
people report happiness and sadness during music listening as part of an effort to
form a more complete picture of music’s role in human affective experience (Greer
et al., 2019a). We applied a set of multivariate time series prediction models to
subjective responses and used auditory features as predictors, analyzing which
elementsofthemusicalstimuliwereimportantforpredictingemotionandenjoyment.
However, in these studies (and many studies, in fact), the effect of the “ground
truth” labels on the task was not studied (Kim et al., 2010). If these deeply complex
constructs were more faithfully and/or accurately represented, perhaps avenues
78
would open up for better quantification, analysis, and explanation of results about
how music listening affects human experience.
A key aspect in the field of supervised learning methods is the need of accurate
labels that appropriately represent the construct that one wants to study. In the
last chapter, we averaged the labels collected from music listeners in real time
to create a single label, providing a basic understanding of group-level models of
experience. However, different methods of merging the labels of multiple annotators
(usually called annotation fusion in the affective computing literature) have been
proposed, which aim to create aggregate labels in ways that are more robust to
noise and artifacts than simple averaging. These methods are usually designed and
employed to estimate a ground truth variable for a subjective construct, such as
a dimension of affect that a person is conveying through audio. In this chapter,
we study the effect of different annotation fusion methods in the studying of two
subjective group-level emotional responses to music.
5.1.1 Related Work
Several widely used music emotion recognition (MER) tasks involve analyzing
reported human responses to music listening (Alajanki et al., 2016b; Chen et al.,
2015; Zhang et al., 2018). In order to draw meaningful, accurate conclusions using
annotations from these datasets, it is necessary to understand and represent the
latent states of the subjects’ responses. Therefore, it is necessary to investigate the
role of annotation fusion methods for group-level experiences during music listening,
such as emotion and enjoyment. In this chapter, we explore several annotation
fusion methods, using them to generate a single label when subjects have annotated
the same target (in this case, a musical stimulus). This label is a proxy to generate
group-level models of behavioral music experience.
79
Many researchers have proposed algorithms for fusing annotations to generate
a set of labels for use as ground truth in machine learning. A general method is
averaging individual annotations after first performing time alignment to remove
artifacts produced by latencies in annotation that vary from person to person.
Mariooryad (Mariooryad & Busso, 2015) demonstrates an improvement in clas-
sification performance by aligning the annotations in time with a uniform shift
computed per annotator based on mutual information, then averaging. Dynamic
time warping (DTW) (Sakoe & Chiba, 1978) is another popular time-alignment
method which warps time to maximize alignment with annotations (Müller, 2007).
Some methods, like canonical correlation analysis (Hotelling, 1936) and correlated
spaces regression (Nicolaou et al., 2013), learn to warp the fused annotation space so
the resultant fusion is more correlated with its associated features. Many combined
time and space warping methods have also been proposed: canonical time warp-
ing (Zhou & Torre, 2009), generalized time warping (Zhou & De la Torre, 2012),
and deep canonical time warping (Trigeorgis et al., 2017) are some examples. More
recent work explores other strategies for computing a “gold-standard.” Lopes et
al. (Lopes et al., 2017) show that in some cases the gradient in an annotation is more
informative than the annotation value, which can be exploited to produce a better
ground truth based on sudden changes in annotations. Booth et al. (Booth et al.,
2018b) hypothesize that trends in the annotation values contain more meaning than
the values themselves; they propose a fusion approach using additional comparative
information collected from humans to produce labels. These methods have been
extended to cases where only continuous annotations are present (Booth et al.,
2018a) or when the annotation scheme can be discretized (Mundnich et al., 2019).
80
These algorithms approach annotation fusion from many different ways, but the
accuracyofanyfusedannotationfundamentallycannotbemeasuredforeffectiveness
or quality when the underlying construct is a latent mental or behavioral state.
5.1.2 Contributions
We use four annotation fusion methods—time alignment, dynamic time warping,
expectation maximization, and triplet embeddings—to investigate the accuracy
of fused annotations of real time emotion and enjoyment reports to music. Some
anecdotal evidence from Booth et al. (2018b) suggests that consistency may not be
preserved during continuous real time annotation, but we aim to show that making
this simplifying assumption still produces quality fusions for behavioral responses
to music listening. Our findings suggest that certain fusion techniques can improve
prediction in music emotion recognition, thereby implying its broader utility in
annotation tasks where a latent construct must be estimated.
5.2 Annotation Fusion Methods in Affective
Computing
We use several annotation fusion techniques here. Annotation fusion methods
usually contain two key steps: (1) time-aligning the annotations acquired in real
time to account for the reaction lags introduced in the annotation process, and
(2) combining the time-aligned annotations to generate a single annotation that
faithfully represents all annotators.
81
5.2.1 Time Alignment
Annotations that are collected in real-time usually have reaction-time lags
with respect to the features being annotated. Therefore, time-alignment of these
annotations with a (sub)set of features is required. We now give a brief review of
each one of these methods.
Dynamic time warping (DTW) (Sakoe & Chiba, 1978) has been traditionally
employed to perform time alignment. DTW aligns two signals to one another by
matching similar points in each series, thereby minimizing the distance between
the resultant warped signals. In the case of multiple annotation signals, a feature
correlated with the responses is often used to warp each annotation.
A different method, called EvalDep and proposed by Mariooryad et al. (Mari-
ooryad & Busso, 2015), finds the lag by maximizing the mutual information between
a feature and the annotations. The estimation is done either non-parametrically by
using kernels to estimate the density functions, or by assuming that the features
and labels are Gaussian, for which the the mutual information is defined by:
I(X;Y ) =
1
2
log
det(
XX
) det(
YY
)
det()
!
, =
XX
XY
YX
YY
, (5.1)
where X is a feature, Y is a label, and
··
are the covariance and cross-covariance
matrices. Then, the estimated lag is given by:
ˆ τ = argmax
τ
I(Y ;X). (5.2)
This lag is applied to Y to obtain the time-aligned annotations.
82
To choose the features to which the annotations are aligned, a common empirical
approach is to compute the correlation ρ between the annotations and the features,
and time-align the annotations with respect to the feature with highest correlation.
5.2.2 Fusion of Annotations
Differentapproacheshavebeenproposedintheliteraturetocombineannotations
from different annotators. We now give a brief review of each one of these methods.
An approach based on simple averaging
The most common and simple approach to annotation fusion is to assume
that for a given time step, all annotations have the same distribution. Under this
assumption, for each time step, the average of annotations is taken to obtain a
single fused annotation.
An approach based on expectation maximization
These approaches go a step further and model each annotator individually
(Gupta et al., 2016; Ramakrishna et al., 2016), and assume that each annotation is
a distorted version of the ground truth. These works pose the model as a graphical
model where parameters can be estimated using a maximum likelihood approach,
which is in practice solved using expectation maximization.
An approach based on triplet embeddings
A different set of approaches uses triplet embeddings (Booth et al., 2018b,a;
Mundnich et al., 2019) to generate a fused label. In particular, we use the approach
proposed in (Booth et al., 2018a). The key assumption here is that annotators
83
are better at annotating ordinal relations in time than ratings, and therefore, they
should be able to more easily answer the question:
d(Y
i
,Y
j
)
?
7d(Y
i
,Y
k
), (5.3)
where i,j and k are time-frames of the annotations Y and d(·,·) is a distance.
Collecting a set of these comparisons (called triplets) from all annotators and
deciding based on majority vote if Y
i
is closer to Y
j
than Y
k
(or the opposite), it is
possible to find a 1-dimensional embedding that may be used as a fused annotation.
5.3 Data Collection
The musical stimuli are identical to the stimuli described in Chapter 4.
A group of 60 healthy, right-handed, adult participants was recruited from the
greater Los Angeles community based on responses to an online survey in which
they listened to a 60-second clip of the final three pieces.
1
All participants in this
study were right-handed with normal hearing, normal or corrected-to-normal vision,
and no history of neurological or psychiatric disorders.
2
Participants listened to all three songs twice: during one listen, they were
instructed to report changes in their affective experience using a fader with a sliding
scale. Participants continuously reported the intensity of felt emotion, from 0 to
10, depending on which piece was being presented. In the sad short song and sad
long song, 10 indicated “extremely sad” and 1 indicated “not at all sad.” In the
happy song, 10 indicated “extremely happy” and 1 indicated “not at all happy.”
1
This group was 60% female, with an average age of 19.5 and a standard deviation of age of
2.86
2
This group of participants is the same as that used in the bio-behavioral analysis of Chapter
4.
84
During the other listen, subjects were instructed to listen attentively to the
music and simultaneously report the intensity of their enjoyment of the piece using a
fader with a sliding scale. Participants continuously rated their momentary feelings
of pleasure from 0 (no pleasure) to 10 (extreme pleasure). The order of the pieces
and the order of the tasks were counterbalanced.
5.3.1 Auditory Features
In order to predict response to music, we used auditory features related to
dynamics, timbre, harmony, and rhythm, similar to (Kim et al., 2010). Dynamics
refer to “loudness” and change in “loudness” of music, timbre refers to tone quality
of music, harmony refers to musical pitches, and rhythm refers to properties of the
musical beat.
Seventy-four features that capture dynamics, timbre, harmony, and rhythm
were estimated using Matlab’s MIRtoolbox (Lartillot et al., 2008) (see Table 5.1).
These features were extracted using a sliding window with a duration of 50 ms and
a step size of 25 ms, similar to (Lu et al., 2006).
Mel frequency cepstral coefficients (MFCCs) were calculated with Matlab’s mfcc
function (MATLAB Audio Toolbox, 2019) using a Hamming window, pre-emphasis
coefficient of .97, frequency range of 100-6400 Hz, 20 filterbank channels, and 22
liftering parameters, similar to (de Leon & Martinez, 2014). Compressibility was
calculated by computing the ratio between the file size of each window’s wav format
and that same window’s mp3 format, after conversion using ffmpeg (Developers,
2019). Key strength is computed as the maximum value of the 24-dimensional
output vector from MIR Toolbox’s key_strength function. This function outputs a
vector containing the probability that a sliding window is in each major or minor
85
Table 5.1: Auditory features used and their feature type
Feature Type Feature
Timbre
MFCCs,ΔMFCCs,ΔΔMFCCs, Brightness, HCDF
Skewness, Kurtosis, Spread, LPCs, Spectral Flux
Harmony Centroid, Chroma, Key Strength, Key Mode
Dynamics RMS, Compressibility
Rhythm Pulse Clarity (Lartillot et al., 2008)
key. Other features were extracted using MIR Toolbox’s eponymous functions with
default parameters.
5.4 Methods
5.4.1 Parameter Selection for Annotation Fusion
In order to compute dynamic time warping on the annotations, a time series
that is correlated with the annotations must be chosen. In order to find a time
series like this, we averaged the raw annotations and ran a correlation of this
signal with each of the musical features. The correlation coefficient ρ between the
average annotations and the RMS of the audio signal (computed at 1 Hz with no
overlap) was found to be between .29 and .38 for every response: the highest of
any auditory feature. We time-warped each annotation with this feature for the
DTW representation.
In order to use Mariooryad’s method, it was necessary to choose a maximum
lag parameter. Previous work has shown that mood classification has a high
latency (Laurier et al., 2008), and the additional task of reporting judged emotion
adds to that latency, so we decided to use 10 seconds (10 times the sampling rate)
for the maximum lag parameter.
86
5.4.2 Prediction Models
We compare three models for prediction: Baseline (a model which naively
predicts the average of the training data), LASSO-T (least squares regression, using
musical features at the timestep of the response as predictors with`
1
regularization),
and LASSO-DL (a distributed lag model with`
1
regularization). Greer et al. (2019a)
and Ma et al. (2019c) found that LASSO outperformed Ridge regression in most
experiments, so Ridge regression was not used. The least squares models use only
auditory information at the timestep of prediction and the distributed lag models
use information before the timestep of prediction.
For the LASSO models, the regularization parameters tried were identical to
those in Chapter 4.
For the distributed lag models, two parameters were set: the time horizon and
the autoregression length. A model with time horizon h and autoregression length
a predicts the response ˆ y
t
at timestep t according to ˆ y
t
= α
0
x
t−h
+α
1
x
t−h−1
+
···α
a
x
t−h−a
, where α is a coefficient vector for a particular timestep and x
k
is a
vector of musical features at timestep k.
For DTW, EM, and triplet embeddings—which used dynamic time warping or
EvalDep for time-alignment—we omit RMS and the 0th MFCC from prediction.
RMS was used in the process of creating the fused annotations, and a model
predicting these annotations would inherently be biased; the 0th MFCC is closely
linked to RMS energy so it was also removed.
We assume that emotion stabilization will be shorter than six seconds, as the
songs we selected are repetitive and contain more context than the clips used in
MER studies. We use h = 80 (two seconds), a = 80 (two seconds), to capture
latency in feeling and subsequently reporting felt emotion.
87
For every experiment, the first 30 seconds of each song were removed, ensuring
that responses would be stable. Then, 10-fold cross-validation was used and we
report the average cross-validation error.
It is evident that the set-up here is very similar to that given in Chapter 4: this
is intentional. We seek to determine if representing annotations of emotion and
enjoyment can result in better understanding of music perception, by comparing
the model performances on the presented in the previous chapter to the labels
presented here.
5.5 Results
Tables 5.2, 5.3, and 5.4 show the validation error for self-reported emotion
in the sad short song, sad long song, and happy song, respectively. The triplet
embeddings have the lowest error for the happy song, while EM-based methods
show the best performance for sad songs.
Table 5.2: Reported Emotion, Sad Short Song — Validation RMSE
Average TA DTW EM Triplet
Baseline 0.159 0.223 0.170 0.142 0.219
LASSO-T 0.161 0.233 0.130 0.130 0.229
LASSO-DL 0.158 0.217 0.127 0.140 0.216
Table 5.3: Reported Emotion, Sad Long Song — Validation RMSE
Average TA DTW EM Triplet
Baseline 0.158 0.185 0.185 0.146 0.213
LASSO-T 0.137 0.178 0.153 0.133 0.210
LASSO-DL 0.128 0.155 0.138 0.120 0.191
88
Table 5.4: Reported Emotion Happy Song — Validation RMSE
Average TA DTW EM Triplet
Baseline 0.143 0.116 0.163 0.112 0.026
LASSO-T 0.138 0.129 0.180 0.114 0.026
LASSO-DL 0.126 0.114 0.163 0.112 0.023
Tables 5.5, 5.6, and 5.7 show the validation error for self-reported enjoyment in
the sad short song, sad long song, and happy song, respectively. Again, we observe
that triplet embeddings have the lowest error for the happy song. However, for
the sad songs either simple averaging or time-alignment through DTW followed by
simple average shows the best performance.
Table 5.5: Reported Enjoyment, Sad Short Song — Validation RMSE
Average TA DTW EM Triplet
Baseline 0.160 0.202 0.175 0.151 0.155
LASSO-T 0.153 0.213 0.133 0.156 0.153
LASSO-DL 0.142 0.201 0.127 0.148 0.150
Table 5.6: Reported Enjoyment, Sad Long Song — Validation RMSE
Average TA DTW EM Triplet
Baseline 0.106 0.192 0.144 0.121 0.229
LASSO-T 0.106 0.200 0.126 0.123 0.240
LASSO-DL 0.094 0.192 0.122 0.116 0.223
Table 5.7: Reported Enjoyment Happy Song — Validation RMSE
Average TA DTW EM Triplet
Baseline 0.143 0.117 0.162 0.129 0.122
LASSO-T 0.128 0.135 0.163 0.126 0.109
LASSO-DL 0.123 0.111 0.162 0.128 0.110
89
5.6 Discussion
In the happy song, emotion and enjoyment were best predicted using triplet
embeddings. This may be because emotion and enjoyment in happy music may
be easier to agree upon by annotators than those same latent states in sad music.
That is, reporting judged emotion and enjoyment in happy music may be a more
straightforward task than reporting judged enjoyment and emotion in sad music.
The expectation maximization method and DTW method produced labels that
were well-predicted in the tasks related to the sad songs, with the exception of
reported enjoyment in the sad long song. The fusion method that averaged the
annotations at each timestep generated labels that were easiest to predict in the
task of reporting enjoyment in the sad long song. This suggests that fusion methods
that use dynamic time warping and EvalDep can be effective for generating labels
for music emotion recognition tasks, as the Time Align model failed to provide
best-predicted labels for any tasks. It also suggests that reported enjoyment during
the sad long song had low agreement among annotators. This may be because
of its repetitiveness and/or length (it is over 7 minutes long and keeps a motif
throughout that may wear thin on some listeners.)
In some tasks, predictions using the baseline model had lower error than the
LASSO models. In the instance in which the baseline model outperforms the
LASSO models, it can be assumed that the musical features are not predictive of
the aggregate signal. In effect, the fusion method is not valuable (see the DTW
results of Table 5.7 for an example.) In almost every case, however, the baseline
model was outperformed by a LASSO model.
Lastly, the distributed lag models tended to outperform the regression models,
even in time-aligned fusion techniques. This suggests that there is a marked latency
90
in reporting judged emotion and enjoyment in music, even if that latency is different
per individual, as confirmed by Greer et al. (2019a) and Ma et al. (2019c).
5.7 Conclusion
Responses to music can vary greatly from person to person and group to group.
Aggregating human annotations, therefore, can be a very difficult undertaking.
Still, it serves a paramount role in studying tasks in music emotion recognition.
We use several methods to fuse annotations of human-reported emotion and enjoy-
ment responses to music listening, and show the utility of using these methods
by predicting these responses using auditory features from musical stimuli. By
demonstrating this, we suggest that these methods may be a boon to research in
the field of music emotion recognition.
91
Chapter 6
Music’s Use in Film and
Advertisements
Film music varies tremendously across genre in order to bring about different
responses in an audience. For instance, composers may evoke passion in a romantic
scene with lush string passages or inspire fear throughout horror films with inhar-
monious drones. While the music listening experience can be enjoyed as a solo
medium, many times music is experienced in conjunction with other media, such as
film. This chapter investigates such phenomena through a quantitative evaluation
of music that is associated with different film genres, which connect a movie-viewing
experience to the music that is used within the film. We construct supervised neural
network models with attention pooling to predict a film’s genre from its soundtrack.
We use these models to compare handcrafted music information retrieval (MIR) fea-
tures against VGGish audio embedding features, finding similar performance with
the top-performing architectures. We examine the best-performing MIR feature
model through permutation feature importance (PFI), determining that MFCC
and Chroma features are most indicative of musical differences between genres. We
investigate the interaction between musical and visual features with a cross-modal
analysis, and do not find compelling evidence that music characteristic of a certain
genre implies low-level visual features associated with that genre. This chapter
adds to our understanding of music’s use in multi-modal contexts, providing more
92
understanding of how music’s pervasive nature contributes to a complex human
experience and offering potential for future inquiry into human affective experiences.
6.1 Introduction
Music plays a crucial role in the experience and enjoyment of film. While
the narrative of movie scenes may be driven by non-musical audio and visual
information, a film’s music carries a significant impact on audience interpretation
of the director’s intent and style (Austin et al., 2010). Musical moments may
complement the visual information in a film; other times, they flout the affect
conveyed in film’s other modalities (visual, linguistic). In every case, however,
music influences a viewer’s experience in consuming cinema’s complex, multi-modal
stimuli. Analyzing how these media interact can provide filmmakers and composers
insight into how to create particular holistic cinema-watching experiences.
We hypothesize that musical properties, such as timbre, pitch, and rhythm,
achieve particular stylistic effects in film, and are reflected in the display and expe-
rience of a film’s accompanying visual cues, as well as its overall genre classification.
In this chapter, we characterize differences among movies of different genres based
on their film music scores. While we focus on how music is used to support par-
ticular cinematic genres, created to engender particular film-watching experiences,
this work can be extended to study other multi-modal content experiences, such
as viewing television, advertisements, trailers, documentaries, music videos and
musical theatre.
93
6.1.1 Music Use Across Film Genre
Several studies have explored music use in cinema. Music has been such an
integral part of the film-watching experience that guides for creating music for
movies have existed since the Silent Film era of the early 20th century (Lang &
West, 1920). Gorbman (Gorbman, 1987) noted that music in film acts as a signifier
of emotion while providing referential and narrative cues, while Rodman (Rodman,
2017) points out that these cues can be discreetly “felt” or overtly “heard.” That
stylistic musical effects and their purpose in film is well-attested provides an
opportunity to study how these musical structures are used.
Previous work has made preliminary progress in this direction. Brownrigg pre-
sented a qualitative study on how music is used in different film genres (Brownrigg,
2003). He hypothesized that film genres have distinctive musical paradigms existing
in tension with one another. By this token, the conventional score associated with
one genre can appear in a “transplanted” scene in another genre. As an example, a
science fiction movie may use musical conventions associated with romance to help
drive the narrative of a subplot that relates to love. We use a multiple-instance
machine learning approach to study how film music may provide narrative support
to scenes steeped in other film genres.
Other studies have taken a more quantitative approach, extracting audio from
movies to identify affective content (Xu et al., 2005; Hanjalic, 2006). Gillick
analyzed soundtracks from over 40,000 movies and television shows, extracting song
information and audio features such as tempo, danceability, instrumentalness, and
acousticness, and found that a majority of these audio features were statistically
significant predictors of genre, suggesting that studying music in film can offer
insights into how a movie will be perceived by its audience (Gillick & Bamman,
94
2018). In this chapter, use musical features and state-of-the-art neural embeddings
to study film genre.
Another study found timbral features most discriminatory in separating movie
genres (Austin et al., 2010). In prior work, soundtracks were analyzed without
accounting for if or for how long the songs were used in a film. We extend these
studies by investigating how timestamped musical clips that are explicitly used in
a film relate to that film’s genre.
6.1.2 Visual-Musical Cross-Modal Analysis
Cross-Modal Studies
Previous research has established a strong connection between the visual and
musical modes in film as partners in delivering a comprehensive narrative expe-
rience to the viewer (Chion et al., 1994; Cohen, 2001; Wingstedt et al., 2010).
Cohen (Cohen, 2001) argued that music “is one of the strongest sources of emotion
in film” because it allows the viewer to subconsciously attach emotional associations
to the visuals presented onscreen. Wingstedt (Wingstedt, 2005) advanced this
theory by proposing that music serves not only an “emotive” function, but also a
“descriptive” function, which allows the soundtrack to describe the setting of the
story-world (e.g., by using folk instruments for a Western setting). In combination
with its emotive function, music’s descriptive function is critical in supporting (or
undermining) the film genre characterized by the visuals of the film.
Low-Level Visual Features
In this study, we use screen brightness and contrast as two low-level visual
features to describe the visual mode of the film. Chen (Chen et al., 2012) found
that different film genres have characteristically different average brightness and
95
contrast values: Comedy and Romance films have higher contrast and brightness,
while Horror, Sci-Fi, and Action films were visually darker with less contrast.
Tarvainen (Tarvainen et al., 2015) established statistically significant correlations
between brightness and color saturation and feelings of “beauty” and “pleasantness”
in film viewers, while darkness and lack of color were associated with “ugliness”
and “unpleasantness.” This result is aligned with Chen’s finding, as Comedy and
Romance films likely try to evoke “beauty” and “pleasantness,” while Action, Horror,
and Sci-Fi tend to emphasize gritty, muddled, or even “unpleasant” and “ugly”
emotions.
6.1.3 Multiple Instance Learning
Multiple-instance learning (MIL) is a supervised machine learning method where
ground truth labels are not available for every instance; instead, labels are provided
for sets of instances, called “bags.” The goal of classification in this paradigm is to
predict bag-level labels from information spread over instances.
Strong assumptions about the relationship between bags and instances are
common, including the standard multiple instance (MI) assumption where a bag is
positive if and only if there exists at least one positive instance. Here, we make the
soft bag assumption, which allows for a negative-labeled bag to contain positive
instances (Herrera et al., 2016). This is consistent with the idea that a film can
contain musical moments characteristic of genres outside its own.
Simple MI
Simple MI is a MI method in which a summarization function is applied to all
instances within a bag, resulting in a single instance for the entire bag. Then, any
number of classification algorithms can be applied to the resulting single instance
96
classification problem. A straightforward summarization function averages the
instance vectors in the bag, as applied by Dong (2006).
1
Instance Majority Voting
In Instance Majority Voting, each instance within a given bag is naïvely assigned
the labels of that bag, and a classifier is trained on all instances. Bag-level labels
are then assigned during inference using an aggregation scheme, such as majority
voting (Kong et al., 2019).
Neural Network Approaches
Neural network approaches within an MIL framework have been used extensively
forsoundeventdetection(SED)taskswithweaklabeling. Ilseetal.(Ilseetal.,2018)
proposed an attention mechanism over instances and demonstrated competitive
performance on several benchmark MIL datasets. Wang et al. (Wang et al., 2019)
compared the performance of five MIL pooling functions, including attention, and
found that linear softmax pooling produced the best results. Kong et al. (Kong
et al., 2019) proposed a new feature-level attention mechanism, where attention
is applied to the hidden layers of a neural network. Gururani et al. (Gururani et
al., 2019) used an attention pooling model for a musical instrument recognition
task, and found improved performance over other architectures, including recurrent
neural networks. We compare each of these approaches for the task of predicting a
film’s genre from its music.
1
As discussed in the previous chapter, other strategies for combining labels can often lead to
better results than simple averaging. This chapter, in part, dives into bagging schemes, their
differences, and their strengths.
97
6.2 Research Data Collection and Curation
6.2.1 Film and Soundtrack Collection
Soundtracks
We collected the highest-grossing movies from 2014-2018 in-house (see Supple-
mentary Material for details).
2
We identified 110 films from this database with
commercially available soundtracks that include the original motion picture score
and purchased these soundtracks as MP3 digital downloads.
34
Film Genre
We labeled the genres of every film in our 110-film dataset by extracting genre
tags from IMDb.
5
Although IMDb lists 24 film genres, we only collected the
tags of six genres for this study: Action, Comedy, Drama, Horror, Romance,
and Science Fiction (Sci-Fi). This reduced taxonomy is well-attested in previous
literature (Austin et al., 2010; Zhou et al., 2010; Rasheed & Shah, 2002; Simões et
al., 2016), and every film in our dataset represents at least one of these genres.
We use multi-label genre tags because many movies span more than one of the
genres of interest. Further, we conjecture that theses movie soundtracks would
combine music that has characteristics from each genre in a label set. Statistics of
the data set that we use is given are Table 6.1.
2
boxofficemojo.com
3
amazon.com
4
Surprisingly, many commercially released soundtracks can contain very little music that
actually appears in the given film, with orchestral score being eschewed in favor of popular
music (Pool & Wright, 2010)
5
imdb.com
98
Table 6.1: A breakdown of the 110 films in our dataset. Only 33 of the films have
only one genre tag; the other 77 films are multi-genre. A list of tags for every movie
is given in Appendix S1.
Genre Tag Number of Films
Action 55
Comedy 37
Drama 44
Horror 11
Romance 13
Science Fiction 36
6.2.2 Automatically Extracting Musical Cues in Film
We developed a methodology we call Score Stamper that, given a soundtrack to
a film, automatically identifies musical cues and timestamps where these soundtrack
songs are used in the film. A musical cue is a single timestamped instance of a
track from the soundtrack that plays in the film. A given track from the soundtrack
may be part of multiple cues if clips from that track appear in the film on multiple
occasions.
The Score Stamper methodology uses Dejavu’s audio fingerprinting tool,
6
which
is robust to dialogue and sound effects. Default settings were used for all Dejavu
parameters. The Score Stamper pipeline is explained in Fig 6.1. At the end of the
Score Stamper pipeline, each film has several “cue predictions.”
We evaluated Score Stamper’s prediction performance on a test set of three
films, annotated manually in-house. We found an average precision of 0.94 (SD
= .012) and an average recall of 0.47 (SD = .086). We deemed these metrics
acceptable for the purposes of this study, as a high precision score indicates that
almost every cue prediction Dejavu provides will be correct, given that these test
6
https://github.com/worldveil/dejavu
99
Figure 6.1: The Score Stamper pipeline. A film is partitioned into non-overlapping
five-second segments. For every segment, Dejavu will predict if a track in the film’s
soundtrack is playing. Cues, or instances of a song’s use in a film, are built by
combining window predictions. In this example, the Cantina Band cue lasts for 15
seconds because it was predicted by Dejavu in two nearby windows.
results generalize to the other films in our dataset. The recall is sufficient because
the cues recognized are likely the most influential on audience response, as they
are included on the commercial soundtrack and mixed clearly over sound effects
and dialogue in the film.
This result also suggests that Score Stamper overcomes a limitation encountered
in previous studies: in prior work, the whole soundtrack was used for analysis
(which could be spurious given that soundtrack songs are sometimes not entirely
used, or even used at all, in a film) (Austin et al., 2010; Gillick & Bamman, 2018;
Shan et al., 2009). By contrast, only the music found in a film is used in this
analysis. Another benefit of this method is a timestamped ordering of every cue,
opening up opportunity for more detailed temporal analysis of music in film.
100
6.2.3 Musical Feature Extraction
MIR Features
Past research in movie genre classification suggests that auditory features related
to energy, pitch, and timbre are predictive of film genre (Jain & Jadon, 2009). We
apply a similar process to Austin et al. (2010), Eerola (2011), and Greer et al.
(2019) in this study: we extract features that relate to dynamics, pitch, rhythm,
timbre, and tone using MATLAB’s MIRtoolbox (Lartillot et al., 2008) and Audio
Toolbox (MATLAB Audio Toolbox, 2019) (see Table 6.2). These features were
then texture-windowed as in Tzanetakis & Cook (2002), using five-second windows
and 33% overlap.
Table 6.2: Auditory features used and feature type.
Feature Type Feature
Dynamics RMS Energy
Pitch Chroma
Rhythm Pulse Clarity (Lartillot et al., 2008), Tempo
Timbre
MFCCs,ΔMFCCs,ΔΔMFCCs, Roughness, Spectral Centroid,
Spectral Crest, Spectral Flatness, Spectral Kurtosis,
Spectral Skewness, Spectral Spread, Zero-crossing Rate
Tone
Key Mode, Key Strength, Spectral Brightness,
Spectral Entropy, Inharmonicity
VGGish Features
In addition to the aforementioned features, we also extract embeddings from
everycueusingVGGish’spretrainedmodel(Hersheyetal.,2017). Inthisframework,
128 features are extracted from the audio every .96 seconds, which we resample to
1 Hz to align with the MIR features. These embeddings have shown promise in
tasks like audio classification (El Hajji et al., 2019), music recommendation (Lee et
101
al., 2018), and movie event detection (Ziai, 2019). We compare the utility of these
features with that of the MIR features.
6.2.4 Visual Features
Following previous works in low-level visual analysis of films, we extract two
features from each film in our dataset: brightness and contrast (Tarvainen et al.,
2015; Chen et al., 2012). These features were sampled at 1 Hz to align with musical
features. Brightness and contrast were calculated as in Chen et al. (2012).
6.3 Methods
6.3.1 Genre Prediction Model Training Procedure
In order to select the model architecture which could best predict film genre
from musical features, the 110-film corpus was split into five folds of 22 films each,
allowing five-fold cross validation to be performed. As the ground truth label
for each movie can contain multiple genres, the problem of predicting associated
genres was posed as multi-label classification. For Simple MI and Instance Majority
Voting approaches, the multi-label problem is decomposed into training independent
models for each genre, in a method called binary relevance. The distribution of
genre labels is unbalanced, with 55 films receiving the most common label (Action),
and only 11 films receiving the least common label (Horror). In order to properly
evaluate model performance across all genres, we calculate precision, recall, and
F1-score separately for each genre, and then report the macro-average of each
metric taken over all genres.
102
6.3.2 Model Architectures
For the genre prediction task, we compare the performance of several MIL model
architectures. First, we explore a Simple MI approach where instances are averaged
with one of the following base classifiers: random forest (RF), support vector
machine (SVM), or k-nearest neighbors (kNN). Using the same base classifiers, we
also report the performance of an instance majority voting approach.
For neural network-based models, the six different pooling functions shown in
Table 6.3 are explored. We adopt the architecture given in Fig 6.2, which has
achieved state-of-the-art performance on sound event detection (SED) tasks (Kong
et al., 2019). Here, the input feature representation is first passed through three
dense embedding layers before going into the pooling mechanism. At the output
layer, we convert the soft output to a binary prediction using a fixed threshold of
0.5. A form of weighted binary cross-entropy was used as the loss function, where
weights for the binary positive and negative class for each genre are found by using
the label distribution for the input training set. An Adam optimizer (Kingma & Ba,
2015) with a learning rate of 5e-4 was used in training, and the batch size was set
to 16. Each model was trained for a maximum of 250 epochs, with checkpointing
for best validation loss after each epoch, and early stopping enabled with a patience
of 25 epochs.
Figure 6.2: Neural network model architecture.
103
Table 6.3: The six pooling functions, where x
i
refers to the embedding vector of
instance i in a bag set B and k is a particular element of the output vector h. In
the multi-attention equation, L refers to the attended layer and w is a learned
weight. The attention module outputs are concatenated before being passed to the
output layer. In the feature-level attention equation, q(·) is an attention function
on a representation of the input features, u(·).
Function Name Pooling Function
Max pooling h
k
= max
i
x
i,k
Average pooling h
k
=
1
|B|
P
i
x
i,k
Linear softmax h
k
=
P
i
x
2
i,k
P
i
x
i,k
Single attention h
k
=
P
i
w
i
x
i,k
P
i
w
i,k
Multi-attention h
(L)
k
=
P
i
w
(L)
i,k
x
(L)
i,k
P
i
w
(L)
i,k
Feature-level attention h
k
=
P
x∈B
q(x)
k
u(x)
k
6.3.3 Frame-level and Cue-level Features
For each cue predicted by Score Stamper, a sequence of feature vectors grouped
into frames is produced (either VGGish feature embeddings or hand-crafted MIR
features). For instance, a 10-second cue represented using VGGish features will
have a sequence length of 10 and a feature dimension of 128. One way to transform
the problem to an MIL-compatible representation is to simply treat all frames for
every cue as instances belonging to a movie-level bag, ignoring any ordering of the
cues. This approach is called frame-level representation.
A simplifying approach is to construct cue-level features by averaging frame-level
features per cue, resulting in a single feature vector for each cue.
104
UsingMILterminology, thesecue-level featurevectorsthenbecometheinstances
belonging to the film, which is a “bag.” We evaluate the performance of each model
type when frame-level features are used and when cue-level features are used.
6.4 Results
6.4.1 Genre Prediction
Table 6.4 displays the performance of several model architectures on the 110-
film dataset, using either VGGish features or MIR features as input. We first
observe that cue-level feature representations outperform instance-level feature
representations across all models. We further observe that Simple MI and IMV
approaches perform better in terms of precision, recall, and F1-score when using
VGGish features than when using MIR features. This result makes sense, as VGGish
embeddings are already both semantically meaningful and compact, allowing for
these relatively simple models to produce competitive results. Indeed, we find
that Simple MI with an SVM as a base classifier on VGGish features produces the
highest precision of all the models we tested.
MIR features perform much more competitively with VGGish features as input
for the neural network approaches. Here, we note that average pooling and
single attention pooling produce the highest macro-averaged F1-score of all model
architectures, regardless of which input feature set is used. The higher complexity of
these models perhaps allows them to learn a more meaningful representation of the
MIR features that is competitive in performance with the VGGish representation,
for this task. Finally, multi-attention pooling using VGGish features produces the
highest macro-averaged recall of all models tried.
105
Interestingly, pooling mechanisms that are most consistent with the standard
MI assumption—Max Pooling and Linear Softmax Pooling (Wang et al., 2019)—
perform worse than other approaches. This result is consistent with the idea that
a film’s genre is characterized by all the musical cues in totality, and not by a
single musical moment. All models do perform significantly above both a random
guess baseline using class frequencies, a zero rule baseline, where the most common
(plurality) label set is predicted for all instances.
Table 6.4: Classification results on the 110-film dataset using VGGish features.
Five-fold cross validation mean and standard deviation on the macro-averaged
metrics for each model are reported. IMV stands for Instance Majority Voting; FL
Attn for Feature-Level Attention. Simple MI and IMV results represent performance
with the best base classifier (kNN and SVM, respectively).
Model Precision Recall F1
Random Guess .31±.07 .31±.06 .30±.06
Plurality Label .14±.05 .33±.00 .19±.05
SVM - Simple MI .66±.11 .60±.05 .60±.06
kNN - IMV .58±.08 .47±.08 .48±.06
Max Pooling .52±.06 .68±.08 .56±.06
Avg. Pooling .59±.10 .74±.13 .62±.10
Linear Softmax .60±.12 .56±.12 .54±.10
Single Attn .60±.09 .71±.11 .62±.09
Multi-Attn .55±.08 .75±.10 .60±.07
FL Attn .61±.10 .72±.09 .61±.08
6.4.2 Musical Feature Relevance Scoring
To determine the importance of different musical features toward predicting
each film genre, we used the method of Permutation Feature Importance (PFI),
as described in (Molnar, 2019). PFI scores the importance of each feature by
evaluating how prediction performance degrades after randomly permuting the
106
values of that feature across all validation set examples. The feature importance
score s
k
for feature k is calculated as:
s
k
= 1−
F1
perm
k
F1
orig
(6.1)
whereF1
perm
k
is the F1 score of the model across all 5 validation folds with feature
k permuted, and F1
orig
is the F1 score of the model without any permutations. A
high scores
k
means that the model’s performance degraded heavily when feature k
was permuted, indicating that the model relies on that feature to make predictions.
To generate the F1 scores, we used our best performing model trained on MIR
features: a single-attention model with 64 nodes per layer (F1-score = 0.62). Since
we had a large feature set of 140 features, and many of our features were closely
related, we performed PFI on feature groups rather than individual features, as
in Ma et al. (2019a). We evaluated eight feature groups: MFCCs, ΔMFCCs,
ΔΔMFCCs, Dynamics, Pitch, Rhythm, Timbre, and Tone. One feature group was
created for each feature type in Table 6.2 (see Section 3.3.1). MFCCs, ΔMFCCs,
ΔΔMFCCs were separated from the “Timbre” feature type into their own feature
groups, in order to prevent one group from containing a majority of the total
features (and thus having an overwhelmingly high feature importance score). For
each feature group, we randomly permuted all features individually from the others
to remove any information encoded in the interactions between those features. We
report results averaged over 10 runs in order to account for the effects of randomness.
The results of our PFI analysis are shown in Fig 6.3.
Figure 6.3 shows that MFCCs were the highest scoring feature-group overall,
followed by Pitch, ΔMFCCs, and ΔΔMFCCs. This corroborates past research find-
ing MFCCs to be the best performing feature group for various music classification
107
Figure 6.3: Feature importance by genre and feature group.
tasks (Kim et al., 2010; Eronen, 2001). MFCCs were the only feature group to
degrade performance when permuted for the Action and Sci-Fi genres; likewise in
Comedy for MFCCs and ΔMFCCs. This result suggests that Action, Comedy, and
Sci-Fi may be most characterized by timbral information encoded in MFCCs.
Interestingly, the model relied significantly on the Pitch feature group for
predicting Drama, Romance, and especially Horror. (Brownrigg, 2003) qualitatively
posits that atonal music or “between-pitch” sounds are characteristic of Horror
film music, and the model’s extreme reliance on Pitch features to make accurate
predictions supports this notion. Likewise, (Brownrigg, 2003) states that Romance
films—and to a lesser extent, Drama films—tend to use lush string beds with rich
harmonies, which could be identified through the Pitch feature group.
Finally, we note that the model’s predictions for Action, Comedy, Drama, and
Sci-Fi were much more robust to feature permutation than for Horror and Romance,
likely because Horror and Romance were under-represented in the 110-film corpus.
6.4.3 Musical-Visual Cross-Modal Analysis
To determine whether visual features associated with a genre correlate with
music characteristic of that genre, we compare average screen brightness and
108
contrast from film clips with labeled musical cues. We consider three different
sources of genre labels: the true labels, the output confidence scores for each genre
from the best performing musical predictor, and the same genre confidence scores
where only false positives are counted (that is, all scores for the actual genres of
a cue are set to 0.) We use a single-attention pooling model trained on VGGish
features (F1-score = 0.62). We standardize the brightness and contrast values to
demonstrate each genre’s variation from the mean. Table 6.5 shows the results.
From the “Actual” metrics, we observe that for both brightness and contrast, our
dataset follows the trends illustrated by Chen et al. (2012): namely, that Comedy
and Romance films have high average brightness and contrast, while Horror films
have the lowest values for both features. We note that clips from Sci-Fi films in
our dataset also have high contrast, which differs from the findings of Chen et al.
(2012).
When comparing the brightness and contrast of clips by their “Predicted,” rather
than “Actual,” genre, we note that the same general trends are present, but tend
more toward the global mean for both metrics. This suggests that the musical style
of a film clip does not necessarily correspond to its visual style; e.g., a clip with
music befitting Comedy may not keep the Comedy-style visual attributes of high
brightness and contrast.
To further support this notion, we present the “False Positive” measure, which
isolates the correlation between musical genre characteristics and visual features
in movies outside that genre. For instance, in an Action movie with significant
Romance musical characteristics (causing the model to assign a high Romance
confidence score), do we observe visual features associated with Romance? For
the majority of genres, we actually found the opposite: “False Positive” metrics
tended in the opposite direction to the “Actual” metrics. This unexpected result
109
warrants further study, but we suspect that even when musical style subverts genre
expectations in a film, the visual style may stay consistent with the genre, causing
the observed discrepancies between the two modes.
Table 6.5: Mean standardized brightness and contrast (×10
1
) across all cues for
each genre label source (actual labels, all predictions, and false positives only).
Bold values are statistically different from the mean (p < 0.01).
Brightness
Actual Predicted False Positive
Action 0.22 -0.05 -1.26
Comedy 0.79 0.47 -0.37
Drama -0.43 0.06 0.35
Horror -2.82 -0.91 -0.16
Romance 0.44 0.04 -0.09
Sci-Fi 0.22 -0.14 -0.40
Contrast
Actual Predicted False Positive
Action 0.19 0.03 -0.56
Comedy 1.66 0.79 -0.40
Drama -1.13 -0.24 0.97
Horror -1.89 -0.79 -0.36
Romance 0.84 -0.01 -0.50
Sci-Fi 0.93 0.06 -0.64
6.5 Conclusion
In this study, we quantitatively support the notion that characteristic music
helps distinguish major film genres. We find that a supervised neural network
model with attention pooling produces competitive results for multi-label genre
classification. We use the best performing MIR feature model to show that MFCC
and Chroma features are most suggestive of differences between genres. Finally,
we investigate the interaction between musical and low-level visual features across
110
film genres, but do not find evidence that music characteristic of a genre implies
low-level visual features common in that genre. This work has applications in film,
music, and multimedia studies.
111
Chapter 7
Conclusions and Future Work
7.1 Conclusions
Inthisdissertation, weproposedandexperimentallyinvestigatedseveralmachine
learning paradigms for leveraging the rich modalities of music for performance on
music-related tasks. We show that in the case where audio is not available for certain
music, it is possible to make use of other modalities of music. In Chapter 2, we show
that chords and lyrics can well-represent how a song is perceived by a music listener;
in many cases, using these modalities in tandem can improve representations, as
well. On a genre classification task and an emotion classification task, chords
and lyrics, coupled with MIR features, proved to be an effective combination. In
Chapter 3, we propose a novel method of representing the auditory modality of
music using transformers. Enriching this deep representation of music with a
multi-task learning paradigm, we show that transformers can create interpretable
encodings of musical inputs while providing a viable starting point for downstream
music-related tasks. We believe these representations can be built upon for better
prediction on a variety of song-level autotagging tasks, such as instrument detection,
music emotion recognition, and mood and theme tagging.
In Chapter 4 we show that when studying music from a variety of vantage points
(physiological, neural, and behavorial), we see that different musical features affect
different responses. It is imperative to take a multi-view, multi-modal approach to
studying music in order to capture and quantify these markers and responses. In
112
Chapter 5, we show one approach for quantifying and representing such outputs.
We identify a method of triplet embedding, which better captures response to music
listening when compared to other aggregation methods. Finally, in Chapter 6 we
show that cross-modal approaches to music perception can be particularly helpful
when music is used in conjunction with other media, such as film. By leveraging
visual features, we show that music and images can corroborate the narrative of
visual media such as cinema and advertisements.
7.2 Future Directions
7.2.1 Context-Aware Music Generation
While this thesis work has focused almost entirely on music analysis (studying
music that has already been created), there is an exciting opportunity to create
more compelling media using this work. Much of this dissertation has discussed
how music is perceived in different contexts, but not much of the dissertation has
discussed how to create such context-aware media. Concretely, if a set of chords is
particularly nostalgic, can we automatically generate music that adheres to this
progression? Will the results be perceived as nostalgic?
Some pioneering work by Dhariwal et al. (2020) has already delved into music
generationconditioned onsomeparameters. Theseoutputsshould—perhaps, must—
be conditioned to not appear banal; music is too rich a medium to not need some
parameters of constraint. By making these music generation systems context-aware,
it may be possible to further humanize and enrich the field of music generation.
Another application of this work is in music pedagogy. By identifying chordal
and lyrical trends, for example, we can inform songwriters common practices for
creating their art. The discriminative models used in this dissertation can be
113
repurposed for providing generative approaches to creating music, which can be
used as a tool for songwriters to develop fledgling musical ideas into full-bodied
musical songs. Film directors and music directors can use the work in Chapter 6 to
find songs for licensing that agree with the affective experiences that they want to
provide to their audiences. This work was presented as a set of methods through
which to study the music listening experience, but this work can be used to inform
and build up these experiences as well.
7.2.2 Musical Feature Engineering Using Domain Knowl-
edge
While we used many musical features throughout this work, we spent little time
engineering new musical features that could be immensely helpful for music-related
tasks. Using domain knowledge of music, it is possible to continue pioneering new
ways of studying music and capturing its infinite representations. Future work will
continue to explore and combine new ways of encoding music in its various forms
(sheet music, chord progressions, metadata, harmonic and melodic intent, etc.) In
this dissertation, we looked at music through several views: natural processing
enabled us to leverage chords and lyrics, while audio processing allowed us to make
sense of audio. By continuing to combine, distill, and learn from these different
views, we can move ever closer to capturing music’s essence and why it is such an
indispensable cornerstone to the human experience.
114
Reference List
Billboard https://www.billboard.com/charts Accessed: 2019-05-27.
Billboard-legend https://www.billboard.com/biz/billboard-charts-legend
Accessed: 2019-05-27.
Ukutabs http://ukutabs.com Accessed: 2017-10-27.
Abdi H, Williams LJ (2010) Principal component analysis. Wiley interdisciplinary
reviews: computational statistics 2:433–459.
Alajanki A, Yang YH, Soleymani M (2016a) Benchmarking music emotion recogni-
tion systems. PLOS ONE .
Alajanki A, Yang YH, Soleymani M (2016b) Benchmarking music emotion recogni-
tion systems. PLOS ONE pp. 835–838.
Alluri V, Toiviainen P, Jääskeläinen IP, Glerean E, Sams M, Brattico E (2012)
Large-scale brain networks emerge from dynamic processing of musical timbre,
key and rhythm. Neuroimage 59:3677–3689.
Anglade A, Ramirez R, Dixon S et al. (2009) Genre classification using harmony
rules induced from automatic chord transcriptions. In ISMIR, pp. 669–674.
Austin A, Moore E, Gupta U, Chordia P (2010) Characterization of movie genre
based on music score In 2010 IEEE International Conference on Acoustics,
Speech and Signal Processing, pp. 421–424. IEEE.
Ba JL, Kiros JR, Hinton GE (2016) Layer normalization. arXiv preprint
arXiv:1607.06450 .
Baevski A, Schneider S, Auli M (2019) vq-wav2vec: Self-supervised learning of
discrete speech representations. International Conference on Learning Represen-
tations .
Bakker DR, Martin FH (2015) Musical chords and emotion: Major and minor
triads are processed for emotion. Cognitive, Affective, & Behavioral Neuro-
science 15:15–31.
115
Barbedo JGA, Tzanetakis G (2010) Musical instrument classification using indi-
vidual partials. IEEE Transactions on Audio, Speech, and Language Process-
ing 19:111–122.
Bertin-Mahieux T, Ellis DP, Whitman B, Lamere P (2011) The million song
dataset .
Bischoff K, Firan CS, Paiu R, Nejdl W, Laurier C, Sordo M (2009) Music mood
and theme classification-a hybrid approach. In ISMIR, pp. 657–662.
Bittner RM, McFee B, Bello JP (2018) Multitask learning for fundamental frequency
estimation in music. arXiv preprint arXiv:1809.00381 .
Blockeel H, Schietgat L, Struyf J, Džeroski S, Clare A (2006) Decision trees for
hierarchical multilabel classification: A case study in functional genomics In
European Conference on Principles of Data Mining and Knowledge Discovery,
pp. 18–29. Springer.
Bogdanov D, Porter A, Tovstogan P, Won M (2020) Mediaeval 2020: Emotion and
theme recognition in music using jamendo In Larson M, Hicks S, Constantin MG,
Bischke B, Porter A, Zhao P, Lux M, Cabrera Quiros L, Calandre J, Jones G,
editors. MediaEval’20, Multimedia Benchmark Workshop; 2020. CEUR Workshop
Proceedings.
Bogdanov D, Won M, Tovstogan P, Porter A, Serra X (2019) The MTG-Jamendo
dataset for automatic music tagging .
Booth BM, Mundnich K, Narayanan S (2018a) Fusing annotations with majority
vote triplet embeddings In Proceedings of the 2018 on Audio/Visual Emotion
Challenge and Workshop, pp. 83–89.
Booth BM, Mundnich K, Narayanan SS (2018b) A novel method for human bias
correction of continuous-time annotations In 2018 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), pp. 3091–3095. IEEE.
Bouchachia A, Bouchachia S (2008) Ensemble learning for time series prediction
na.
Brattico E, Alluri V, Bogert B, Jacobsen T, Vartiainen N, Nieminen SK, Tervaniemi
M (2011) A functional mri study of happy and sad emotions in music with and
without lyrics. Frontiers in psychology 2:308.
Brownrigg M (2003) Film music and film genre .
116
Bu J, Tan S, Chen C, Wang C, Wu H, Zhang L, He X (2010) Music recommen-
dation by unified hypergraph: combining social media information and music
content In Proceedings of the 18th ACM international conference on Multimedia,
pp. 391–400.
Bérard A, Servan C, Pietquin O, Besacier L (2016) MultiVec: a Multilingual and
Multilevel Representation Learning Toolkit for NLP In The 10th edition of the
Language Resources and Evaluation Conference (LREC 2016).
Cannon WB (1916) Bodily changes in pain, hunger, fear, and rage: An account of
recent researches into the function of emotional excitement D. Appleton.
Cano P, Gómez E, Gouyon F, Herrera P, Koppenberger M, Ong B, Serra X, Streich
S, Wack N (2006) ISMIR 2004 audio description contest. Music Technology
Group of the Universitat Pompeu Fabra, Tech. Rep .
Carmon Y, Raghunathan A, Schmidt L, Duchi JC, Liang PS (2019) Unlabeled
data improves adversarial robustness. Advances in Neural Information Processing
Systems 32.
Caruana R (1997) Multitask learning. Machine learning 28:41–75.
Castellon R, Donahue C, Liang P (2021) Codified audio language modeling learns
useful representations for music information retrieval. International Symposium
on Music Information Retrieval .
Chaspari T, Tsiartas A, Stein LI, Cermak SA, Narayanan SS (2015) Sparse
representation of electrodermal activity with knowledge-driven dictionaries. IEEE
Transactions on Biomedical Engineering 62:960–971.
Chen I, Wu F, Lin C (2012) Characteristic color use in different film genres.
Empirical Studies of the Arts 30:39–57.
Chen R, Xu Z, Zhang Z, Luo F (2006) Content based music emotion analysis and
recognition In Proc. of 2006 International Workshop on Computer Music and
Audio Technology, Vol. 68275, p. 2.
Chen S, Wang X, Harris CJ (2008) Narx-based nonlinear system identification
using orthogonal least squares basis hunting. IEEE Transactions on Control
Systems Technology 16:78–84.
Chen YA, Yang YH, Wang JC, Chen H (2015) The amg1608 dataset for music
emotion recognition In 2015 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), pp. 693–697. IEEE.
117
ChengHT,YangYH,LinYC,LiaoIB,ChenHH(2008) Automaticchordrecognition
for music classification and retrieval In 2008 IEEE International Conference on
Multimedia and Expo, pp. 1505–1508. IEEE.
Chi PH, Chung PH, Wu TH, Hsieh CC, Chen YH, Li SW, Lee Hy (2021) Audio
albert: A lite bert for self-supervised learning of audio representation In 2021
IEEE Spoken Language Technology Workshop (SLT), pp. 344–350. IEEE.
Chion M, Gobman C, Murch W (1994) Audio-vision: sound on screen Columbia
University Press.
Choi K, Fazekas G, Sandler M, Cho K (2017) Transfer learning for music classi-
fication and regression tasks. International Symposium for Music Information
Retrieval .
Chuan CH, Herremans D (2018) Modeling temporal tonal relations in polyphonic
music through deep networks with a novel image-based representation In Thirty-
second AAAI conference on artificial intelligence.
Clark K, Luong MT, Le QV, Manning CD (2020) ELECTRA: Pre-training text
encoders as discriminators rather than generators In International Conference
on Learning Representations.
Cohen AJ (2001) Music as a source of emotion in film. Music and emotion: Theory
and research pp. 249–272.
Cohrdes C, Wrzus C, Frisch S, Riediger M (2017) Tune yourself in: Valence
and arousal preferences in music-listening choices from adolescence to old age.
Developmental psychology 53:1777.
Coutinho E, Deng J, Schuller B (2014) Transfer learning emotion manifestation
across music and speech In 2014 International Joint Conference on Neural
Networks (IJCNN), pp. 3592–3598. IEEE.
Cover T, Thomas J (2006) Elements of information theory 2nd edn (hoboken, nj,
john wiley & sons) .
Cowen AS, Fang X, Sauter D, Keltner D (2020) What music makes us feel: At least
13 dimensions organize subjective experiences associated with music across dif-
ferent cultures. Proceedings of the National Academy of Sciences 117:1924–1934.
Davis J, Goadrich M (2006) The relationship between precision-recall and roc
curves In Proceedings of the 23rd International Conference on Machine Learning,
pp. 233–240.
118
De Cheveigné A, Kawahara H (2002) Yin, a fundamental frequency estima-
tor for speech and music. The Journal of the Acoustical Society of Amer-
ica 111:1917–1930.
de Leon FA, Martinez K (2014) Music genre classification using polyphonic timbre
models In 2014 19th International Conference on Digital Signal Processing,
pp. 415–420. IEEE.
Defferrard M, Benzi K, Vandergheynst P, Bresson X (2016) Fma: A dataset for
music analysis. International Symposium on Music Information Retrieval .
Developers F (2019) ffmpeg tool http://ffmpeg.org/.
Devlin J, Chang MW, Lee K, Toutanova K (2018a) Bert: Pre-training of deep
bidirectional transformers for language understanding. North American Chapter
of the Association for Computational Linguistics: Human Language Technologies .
Devlin J, Chang MW, Lee K, Toutanova K (2018b) Bert: Pre-training of
deep bidirectional transformers for language understanding. arXiv preprint
arXiv:1810.04805 .
Dhariwal P, Jun H, Payne C, Kim JW, Radford A, Sutskever I (2020) Jukebox: A
generative model for music. Computing Research Repository .
Dillman Carpentier FR, Potter RF (2007) Effects of music on physiological arousal:
Explorations into tempo and genre. Media Psychology 10:339–363.
Dong L (2006) A comparison of multi-instance learning algorithms .
Donnelly PJ, Sheppard JW (2015) Cross-dataset validation of feature sets in
musical instrument classification In 2015 IEEE International Conference on
Data Mining Workshop (ICDMW), pp. 94–101. IEEE.
Eerola T (2011) Are the emotions expressed in music genre-specific? An audio-based
evaluation of datasets spanning classical, film, pop and mixed genres. Journal of
New Music Research 40:349–366.
El Hajji M, Daniel M, Gelin L (2019) Transfer learning based audio classification
for a noisy and speechless recordings detection task, in a classroom context In
Proc. SLaTE 2019: 8th ISCA Workshop on Speech and Language Technology in
Education, pp. 109–113.
Ellis DS, Brighouse G (1952) Effects of music on respiration-and heart-rate. The
American journal of psychology 65:39–47.
119
Eronen A (2001) Comparison of features for musical instrument recognition In
Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing
to Audio and Acoustics (Cat. No.01TH8575), pp. 19–22.
FanJ,TatarK,ThorogoodM,PasquierP(2017) Ranking-basedemotionrecognition
for experimental music In International Symposium on Music Information
Retrieval.
Fan J, Thorogood M, Pasquier P (2016) Automatic soundscape affect recogni-
tion using a dimensional approach. Journal of the Audio Engineering Soci-
ety 64:646–653.
Ferwerda B, Schedl M (2014) Enhancing music recommender systems with person-
ality information and emotional states: A proposal. In Umap workshops.
Frigola R, Chen Y, Rasmussen CE (2014) Variational gaussian process state-space
models In Advances in neural information processing systems, pp. 3680–3688.
Gemmeke JF, Ellis DP, Freedman D, Jansen A, Lawrence W, Moore RC, Plakal M,
Ritter M (2017) Audio set: An ontology and human-labeled dataset for audio
events In 2017 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), pp. 776–780. IEEE.
Ghosal D, Kolekar MH (2018) Music genre recognition using deep neural net-
works and transfer learning. In Annual Conference of the International Speech
Communication Association, pp. 2087–2091.
Gillick J, Bamman D (2018) Telling stories with soundtracks: an empirical analysis
of music in film In Proc. of the 1st Workshop on Storytelling, pp. 33–42.
Girshick R (2015) Fast r-cnn In Proceedings of the IEEE international conference
on computer vision, pp. 1440–1448.
Glerean E, Salmi J, Lahnakoski JM, Jääskeläinen IP, Sams M (2012) Functional
magnetic resonance imaging phase synchronization as a measure of dynamic
functional connectivity. Brain connectivity 2:91–101.
Gorbman C (1987) Unheard melodies: Narrative film music Indiana University
Press.
Goto M, Hashiguchi H, Nishimura T, Oka R (2003) RWC music database: Music
genre database and musical instrument sound database .
Gouws S, Bengio Y, Corrado G (2015) Bilbowa: Fast bilingual distributed repre-
sentations without word alignments In International Conference on Machine
Learning, pp. 748–756.
120
Greer T, Ma B, Sachs M, Habibi A, Narayanan S (2019) A multimodal view into
music’s effect on human neural, physiological, and emotional experience In Proc.
of the 27th ACM International Conference on Multimedia, pp. 167–175.
Greer T, Mundnich K, Sachs M, Narayanan S (2020) The role of annotation
fusion methods in the study of human-reported emotion experience during music
listening In ICASSP 2020-2020 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pp. 776–780. IEEE.
Greer T, Narayanan S (2019) Using shared vector representations of words and
chords in music for genre classification In Proc. Workshop on Speech, Music and
Mind 2019, pp. 36–40.
Greer T, Sachs M, Ma B, Habibi A, Narayanan S (2019a) A multimodal view
into music’s effect on human neural, physiological, and emotional experience In
ACM Transactions on Multimedia Computing, Communications, and Applications.
ACM.
Greer T, Singla K, Ma B, Narayanan S (2019b) Learning shared vector repre-
sentations of lyrics and chords in music In ICASSP 2019-2019 IEEE Inter-
national Conference on Acoustics, Speech and Signal Processing (ICASSP),
pp. 3951–3955.© 2019 IEEE. IEEE.
Grühn D, Scheibe S (2008) Age-related differences in valence and arousal ratings
of pictures from the international affective picture system (iaps): Do ratings
become more extreme with age? Behavior Research Methods 40:512–521.
Gupta R, Audhkhasi K, Jacokes Z, Rozga A, Narayanan SS (2016) Modeling
multiple time series annotations as noisy distortions of the ground truth: An
expectation-maximization approach. IEEE transactions on affective comput-
ing 9:76–89.
Gururani S, Sharma M, Lerch A (2019) An attention mechanism for musical
instrument recognition. Proc. of the International Society for Music Information
Retrieval Conference .
HamiltonJD(1995) Timeseriesanalysis. Economic Theory. II, Princeton University
Press, USA pp. 625–630.
Han B, Rho S, Jun S, Hwang E (2010) Music emotion classification and context-
based music recommendation. Multimedia Tools and Applications 47:433–460.
Hanjalic A (2006) Extracting moods from pictures and sounds: Towards truly
personalized TV. IEEE Signal Processing Magazine 23:90–100.
121
Hendrycks D, Gimpel K (2016) Gaussian error linear units (gelus). arXiv preprint
arXiv:1606.08415 .
Hendrycks D, Mazeika M, Kadavath S, Song D (2019) Using self-supervised learning
can improve model robustness and uncertainty. Advances in Neural Information
Processing Systems 32.
Herrera F, Ventura S, Bello R, Cornelis C, Zafra A, Sánchez-Tarragó D, Vluymans
S (2016) Multiple Instance Learning: Foundations and Algorithms Springer.
Hershey S, Chaudhuri S, Ellis DP, Gemmeke JF, Jansen A, Moore RC, Plakal M,
Platt D, Saurous RA, Seybold B et al. (2017) Cnn architectures for large-scale
audio classification In 2017 ieee international conference on acoustics, speech
and signal processing (icassp), pp. 131–135. IEEE.
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural computa-
tion 9:1735–1780.
Hotelling H (1936) Relation between two sets of variates. Biometrica .
HuX,DownieJS,EhmannAF(2009) Lyrictextmininginmusicmoodclassification.
American music 183:2–209.
Huang CZA, Vaswani A, Uszkoreit J, Shazeer N, Simon I, Hawthorne C, Dai AM,
Hoffman MD, Dinculescu M, Eck D (2018) Music transformer. International
Conference on Learning Representations .
Huang YS, Yang YH (2020) Pop music transformer: Beat-based modeling and
generation of expressive pop piano compositions In Proceedings of the 28th ACM
International Conference on Multimedia, pp. 1180–1188.
Hung HT, Chen YH, Mayerl M, Vötter M, Zangerle E, Yang YH (2019) Mediaeval
2019 emotion and theme recognition task: A VQ-VAE based approach. In
MediaEval.
Hung YN, Chen YA, Yang YH (2019) Multitask learning for frame-level instrument
recognition In ICASSP 2019-2019 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pp. 381–385. IEEE.
Hung YN, Lerch A (2020) Multitask learning for instrument activation aware music
source separation. International Symposium on Music Information Retrieval .
Huron DB (2006) Sweet anticipation: Music and the psychology of expectation MIT
press.
122
Ilse M, Tomczak J, Welling M (2018) Attention-based deep multiple instance
learning. CoRR abs/1802.04712.
Jain S, Jadon R (2009) Movies genres classifier using neural network In 2009 24th
International Symp. on Computer and Information Sciences, pp. 575–580. IEEE.
Kendall A, Gal Y, Cipolla R (2018) Multi-task learning using uncertainty to weigh
losses for scene geometry and semantics In Proceedings of the IEEE conference
on computer vision and pattern recognition, pp. 7482–7491.
Kereliuk C, Sturm BL, Larsen J (2015) Deep learning and music adversaries. IEEE
Transactions on Multimedia 17:2059–2071.
Khalfa S, Isabelle P, Jean-Pierre B, Manon R (2002) Event-related skin conductance
responses to musical emotions in humans. Neuroscience letters 328:145–149.
Kim YE, Schmidt EM, Migneco R, Morton BG, Richardson P, Scott J, Speck JA,
Turnbull D (2010) Music emotion recognition: A state of the art review In Proc.
ISMIR, Vol. 86, pp. 937–952.
Kingma D, Ba J (2015) Adam: A method for stochastic optimization In Interna-
tional Conference on Learning Representations.
KingmaDP,BaJ(2014) Adam: Amethodforstochasticoptimization. International
Conference on Learning Representations .
Knox D, Greer T, Ma B, Kuo E, Somandepalli K, Narayanan S (2020) MediaEval
2020 emotion and theme recognition in music task: Loss function approaches for
multi-label music tagging .
Knox D, Greer T, Ma B, Kuo E, Somandepalli K, Narayanan S (2021) Loss function
approaches for multi-label music tagging In 2021 International Conference on
Content-Based Multimedia Indexing (CBMI), pp. 1–4. IEEE.
Koelsch S (2005) Investigating emotion with music: neuroscientific approaches.
Annals of the New York Academy of Sciences pp. 412–418.
Koelsch S, Fritz T, Müller K, Friederici AD et al. (2006) Investigating emotion
with music: an fMRI study. Human brain mapping pp. 239–250.
Koelsch S, Fritz T, Schulze K, Alsop D, Schlaug G (2005) Adults and children
processing music: an fmri study. Neuroimage 25:1068–1076.
Kong Q, Yu C, Xu Y, Iqbal T, Wang W, Plumbley M (2019) Weakly labelled
audioset tagging with attention neural networks. IEEE/ACM Transactions on
Audio, Speech, and Language Processing 27:1791–1802.
123
Koutini K, Chowdhury S, Haunschmid V, Eghbal-zadeh H, Widmer G (2019)
Emotion and theme recognition in music with frequency-aware rf-regularized
cnns Vol. Sophia Antipolis, France, 27-30 October 2019.
Kreutz G, Ott U, Teichmann D, Osawa P, Vaitl D (2008) Using music to induce
emotions: Influences of musical preference and absorption. Psychology of
music 36:101–126.
Kumar N, Gupta R, Guha T, Vaz C, Van Segbroeck M, Kim J, Narayanan SS
(2014) Affective feature design and predicting continuous affective dimensions
from music. In MediaEval. Citeseer.
Lai G, Chang WC, Yang Y, Liu H (2018) Modeling long-and short-term temporal
patterns with deep neural networks In The 41st International ACM SIGIR
Conference on Research & Development in Information Retrieval, pp. 95–104.
ACM.
Lang E, West G (1920) Musical Accompaniment of Moving Pictures: A Practical
Manual for Pianists and Organists and an Exposition of the Principles Underlying
the Musical Interpretation of Moving Pictures Boston Music Company.
Lartillot O, Eerola T, Toiviainen P, Fornari J (2008) Multi-feature modeling of pulse
clarity: Design, validation and optimization. In ISMIR, pp. 521–526. Citeseer.
Lartillot O, Toiviainen P, Eerola T (2008) A matlab toolbox for music information
retrieval In Data analysis, machine learning and applications, pp. 261–268.
Springer.
Laurier C, Grivolla J, Herrera P (2008) Multimodal music mood classification using
audio and lyrics In 2008 Seventh International Conference on Machine Learning
and Applications, pp. 688–693. IEEE.
Lee J, Nam J (2017) Multi-level and multi-scale feature aggregation using pretrained
convolutional neural networks for music auto-tagging. IEEE signal processing
letters 24:1208–1212.
Lee S, Lee J et al. (2018) Content-based feature exploration for transparent
music recommendation using self-attentive genre classification. arXiv preprint
arXiv:1808.10600 .
Li T, Ogihara M, Li Q (2003) A comparative study on content-based music
genre classification In Proceedings of the 26th annual international ACM SIGIR
conference on Research and development in informaion retrieval, pp. 282–289.
ACM.
124
Li Z, Chen Z, Yang F, Li W, Zhu Y, Zhao C, Deng R, Wu L, Zhao R, Tang M
et al. (2021) Mst: Masked self-supervised transformer for visual representation.
Advances in Neural Information Processing Systems 34.
Lim N (2016) Cultural differences in emotion: differences in emotional arousal level
between the east and the west. Integrative medicine research 5:105–109.
Linardatos P, Papastefanopoulos V, Kotsiantis S (2020) Explainable ai: A review
of machine learning interpretability methods. Entropy 23:18.
Ling S, Liu Y, Salazar J, Kirchhoff K (2020) Deep contextualized acoustic repre-
sentations for semi-supervised speech recognition In ICASSP 2020-2020 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP),
pp. 6429–6433. IEEE.
Liu AT, Yang Sw, Chi PH, Hsu Pc, Lee Hy (2020) Mockingjay: Unsupervised
speech representation learning with deep bidirectional transformer encoders In
ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), pp. 6419–6423. IEEE.
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L,
Stoyanov V Roberta: A robustly optimized bert pretraining approach .
Logan B (2002) Content-based playlist generation: Exploratory experiments. In
ISMIR, Vol. 2, pp. 295–296.
LopesP,YannakakisGN,LiapisA(2017) Ranktrace: Relativeandunboundedaffect
annotation In 2017 Seventh International Conference on Affective Computing
and Intelligent Interaction (ACII), pp. 158–163. IEEE.
Lu L, Liu D, Zhang HJ (2006) Automatic mood detection and tracking of music
audio signals. IEEE Transactions on audio, speech, and language process-
ing 14:5–18.
Luong T, Pham H, Manning CD (2015) Bilingual word representations with
monolingual quality in mind In Proc. of the 1st Workshop on Vector Space
Modeling for Natural Language Processing, pp. 151–159.
Ma B, Greer T, Sachs M, Habibi A, Kaplan J, Narayanan S (2019a) Predicting
human-reported enjoyment responses in happy and sad music In 2019 8th
International Conference on Affective Computing and Intelligent Interaction
(ACII), pp. 607–613.
Ma B, Greer T, Sachs M, Habibi A, Kaplan J, Narayanan S (2019b) Predicting
human-reported enjoyment responses in happy and sad music In 2019 8th
125
International Conference on Affective Computing and Intelligent Interaction
(ACII), pp. 607–613. IEEE.
MaB,GreerT,SachsM,KaplanJ,NarayananS(2019c) Predictinghuman-reported
enjoyment responses in happy and sad music In International Conference on
Affective Computing and intelligent interaction.
Madjiheurem S, Qu L, Walder C (2016) Chord2vec: Learning musical chord
embeddings.
Marchand U, Peeters G (2016a) The extended ballroom dataset .
Marchand U, Peeters G (2016b) Scale and shift invariant time/frequency repre-
sentation using auditory statistics: Application to rhythm description In 2016
IEEE 26th International Workshop on Machine Learning for Signal Processing
(MLSP), pp. 1–6. IEEE.
Mariooryad S, Busso C (2015) Correcting time-continuous emotional labels by
modeling the reaction lag of evaluators. affective computing. IEEE Transactions
on 6:97–108.
MATLAB Audio Toolbox (2019) Matlab audio toolbox The MathWorks, Natick,
MA, USA.
Maus FE (1991) Music as narrative. Indiana theory review 12:1–34.
Mayer R, Neumayer R, Rauber A (2008a) Combination of audio and lyrics features
for genre classification in digital audio collections In Proceedings of the 16th
ACM international conference on Multimedia, pp. 159–168. ACM.
Mayer R, Neumayer R, Rauber A (2008b) Rhyme and style features for musical
genre classification by song lyrics. In Ismir, pp. 337–342.
McFee B, Raffel C, Liang D, Ellis DP, McVicar M, Battenberg E, Nieto O (2015)
librosa: Audio and music signal analysis in python In Proceedings of the 14th
Python in Science Conference, Vol. 8, pp. 18–25. Citeseer.
Menon V, Levitin DJ (2005) The rewards of music listening: response and physio-
logical connectivity of the mesolimbic system. Neuroimage 28:175–184.
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed represen-
tations of words and phrases and their compositionality In Advances in neural
information processing systems, pp. 3111–3119.
Mikutta C, Maissen G, Altorfer A, Strik W, König T (2014) Professional musicians
listen differently to music. Neuroscience 268:102–111.
126
Molnar C (2019) Interpretable Machine Learning https://christophm.github.io/
interpretable-ml-book/.
Müller M (2007) Dynamic time warping. Information retrieval for music and
motion pp. 69–84.
Mundnich K, Booth BM, Girault B, Narayanan S (2019) Generating Labels
for Regression of Subjective Constructs using Triplet Embeddings. Pattern
Recognition Letters 128:385–392.
Nicolaou MA, Zafeiriou S, Pantic M (2013) Correlated-spaces regression for learning
continuous emotion dimensions In Proceedings of the 21st ACM international
conference on Multimedia, pp. 773–776. ACM.
Noland KC, Sandler MB (2006) Key estimation using a hidden markov model. In
ISMIR, pp. 121–126.
Oord Avd, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A, Kalchbrenner N,
Senior A, Kavukcuoglu K (2016) Wavenet: A generative model for raw audio.
arXiv preprint arXiv:1609.03499 .
Pallesen KJ, Brattico E, Bailey C, Korvenoja A, Koivisto J, Gjedde A, Carlson S
(2005) Emotion processing of major, minor, and dissonant chords. Annals of the
New York Academy of Sciences pp. 450–453.
Park DS, Chan W, Zhang Y, Chiu CC, Zoph B, Cubuk ED, Le QV (2019) Specaug-
ment: A simple data augmentation method for automatic speech recognition.
Annual Conference of the International Speech Communication Association .
Pauwels J, Peeters G (2013) Evaluating automatically estimated chord sequences In
2013 IEEE International Conference on Acoustics, Speech and Signal Processing,
pp. 749–753. IEEE.
Pavlín T (2020) Dance recognition from audio recordings .
Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word
representation. In EMNLP, Vol. 14, pp. 1532–1543.
Petrantonakis PC, Hadjileontiadis LJ (2009) Emotion recognition from eeg
using higher order crossings. IEEE Transactions on Information Technology in
Biomedicine 14:186–197.
Petrantonakis PC, Hadjileontiadis LJ (2010) Emotion recognition from brain
signals using hybrid adaptive filtering and higher order crossings analysis. IEEE
Transactions on affective computing 1:81–97.
127
Pons J, Nieto O, Prockup M, Schmidt E, Ehmann A, Serra X (2017) End-to-end
learning for music audio tagging at scale. arXiv preprint arXiv:1711.02520 .
Pool JG, Wright HS (2010) A Research Guide to Film and Television Music in the
United States Scarecrow press.
Pujol P, Macho D, Nadeu C (2006) On real-time mean-and-variance normalization
of speech recognition features In 2006 IEEE international conference on acoustics
speech and signal processing proceedings, Vol. 1, pp. I–I. IEEE.
Radford A, Narasimhan K, Salimans T, Sutskever I (2018) Improving language
understanding by generative pre-training .
Ramakrishna A, Gupta R, Grossman RB, Narayanan SS (2016) An expectation
maximization approach to joint modeling of multidimensional ratings derived
from multiple annotators. In Annual Conference of the International Speech
Communication Association, pp. 1555–1559.
Rasheed Z, Shah M (2002) Movie genre classification by exploiting audio-visual
features of previews In Object recognition supported by user interaction for service
robots, Vol. 2, pp. 1086–1089. IEEE.
Riganello F, Candelieri A, Quintieri M, Dolce G (2010) Heart rate variability,
emotions, and music. Journal of Psychophysiology .
Riganello F, Quintieri M, Candelieri A, Conforti D, Dolce G (2008) Heart rate
response to music: an artificial intelligence study on healthy and traumatic
brain-injured subjects. Journal of Psychophysiology 22:166–174.
Robbins H, Monro S (1951) A stochastic approximation method. The annals of
mathematical statistics pp. 400–407.
Rodman R (2017) The popular song as leitmotif in 1990s film In Changing tunes:
The use of pre-existing music in film, pp. 119–136. Routledge.
Ryali S, Supekar K, Abrams DA, Menon V (2010) Sparse logistic regression for
whole-brain classification of fmri data. NeuroImage 51:752–764.
Sakoe H, Chiba S (1978) Dynamic programming algorithm optimization for
spoken word recognition. IEEE transactions on acoustics, speech, and signal
processing 26:43–49.
Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing.
Communications of the ACM 18:613–620.
128
Samson F, Zeffiro TA, Toussaint A, Belin P (2011) Stimulus complexity and
categorical effects in human auditory cortex: an activation likelihood estimation
meta-analysis. Frontiers in Psychology 1:241.
Sanden C, Zhang JZ (2011) Enhancing multi-label music genre classification
through ensemble techniques In Proceedings of the 34th international ACM SIGIR
conference on Research and development in Information Retrieval, pp. 705–714.
ACM.
Santana IAP, Pinhelli F, Donini J, Catharin L, Mangolin RB, Feltrim VD,
Domingues MA et al. (2020) Music4all: A new music database and its applica-
tions In 2020 International Conference on Systems, Signals and Image Processing
(IWSSIP), pp. 399–404. IEEE.
Scaringella N, Zoia G, Mlynek D (2006) Automatic genre classification of music
content: a survey. IEEE Signal Processing Magazine 23:133–141.
Schaaff K, Adam MT (2013) Measuring emotional arousal for online applica-
tions: Evaluation of ultra-short term heart rate variability measures In 2013
Humaine Association Conference on Affective Computing and Intelligent Interac-
tion, pp. 362–368. IEEE.
SchedlM,KneesP,McFeeB,BogdanovD,KaminskasM(2015) Musicrecommender
systems In Recommender systems handbook, pp. 453–492. Springer.
Schindler A, Rauber A (2015) An audio-visual approach to music genre classification
through affective color features In European Conference on Information Retrieval,
pp. 61–67. Springer.
Schuller B, Dorfner J, Rigoll G (2010) Determination of nonprototypical valence
and arousal in popular music: features and performances. EURASIP Journal on
Audio, Speech, and Music Processing 2010:735854.
Scott DW (2015) Multivariate density estimation: theory, practice, and visualization
John Wiley & Sons.
Shaffer F, Ginsberg J (2017) An overview of heart rate variability metrics and
norms. Frontiers in public health 5:258.
Shan MK, Kuo FF, Chiang MF, Lee SY (2009) Emotion-based music recom-
mendation by affinity discovery from film music. Expert systems with applica-
tions 36:7666–7674.
Shih SY, Sun FK, Lee Hy (2018) Temporal pattern attention for multivariate time
series forecasting. arXiv preprint arXiv:1809.04206 .
129
Siedenburg K, Saitis C, McAdams S (2019) The present, past, and future of timbre
research In Timbre: Acoustics, Perception, and Cognition, pp. 1–19. Springer.
Simões GS, Wehrmann J, Barros RC, Ruiz DD (2016) Movie genre classification
with convolutional neural networks In 2016 International Joint Conference on
Neural Networks (IJCNN), pp. 259–266. IEEE.
Singer N, Jacoby N, Lin T, Raz G, Shpigelman L, Gilam G, Granot RY, Hendler
T (2016) Common modulation of limbic network activation underlies musical
emotions as they unfold. Neuroimage 141:517–529.
Song X, Wang G, Wu Z, Huang Y, Su D, Yu D, Meng H (2019) Speech-XLNet:
Unsupervised acoustic model pretraining for self-attention networks. Annual
Conference of the International Speech Communication Association .
Song Y, Dixon S, Pearce M (2012) A survey of music recommendation systems
and future perspectives In 9th International Symposium on Computer Music
Modeling and Retrieval, Vol. 4.
Sorower MS A literature survey on algorithms for multi-label learning .
Sourina O, Liu Y, Nguyen MK (2012) Real-time eeg-based emotion recognition for
music therapy. Journal on Multimodal User Interfaces 5:27–35.
SrivastavaN,HintonG,KrizhevskyA,SutskeverI,SalakhutdinovR(2014) Dropout:
a simple way to prevent neural networks from overfitting. The Journal of Machine
Learning Research 15:1929–1958.
Stocco A (2014) Coordinate-based meta-analysis of fmri studies with r. The R
Journal 2:5–15.
Sturm BL (2013) The GTZAN dataset: Its contents, its faults, their effects on
evaluation, and its future use. arXiv preprint arXiv:1306.1461 .
Takahashi Y, Kondo K (2014) Comparison of two classification methods for musical
instrument identification In 2014 IEEE 3rd Global Conference on Consumer
Electronics (GCCE), pp. 67–68. IEEE.
Tarvainen J, Westman S, Oittinen P (2015) The way films feel: Aesthetic features
and mood in film. Psychology of Aesthetics, Creativity, and the Arts 9:254.
Toiviainen P, Alluri V, Brattico E, Wallentin M, Vuust P (2014) Capturing the
musical brain with lasso: Dynamic decoding of musical features from fmri data.
Neuroimage 88:170–180.
130
Trigeorgis G, Nicolaou MA, Schuller BW, Zafeiriou S (2017) Deep canonical time
warping for simultaneous alignment and representation learning of sequences.
IEEE transactions on pattern analysis and machine intelligence 40:1128–1138.
Trohidis K, Tsoumakas G, Kalliris G, Vlahavas IP (2008a) Multi-label classification
of music into emotions. In ISMIR, pp. 325–330.
Trohidis K, Tsoumakas G, Kalliris G, Vlahavas IP (2008b) Multi-label classification
of music into emotions. In ISMIR, Vol. 8, pp. 325–330.
Turnbull D, Barrington L, Torres D, Lanckriet G (2008) Semantic annotation and
retrieval of music and sound effects. IEEE Transactions on Audio, Speech, and
Language Processing 16:467–476.
Tzanetakis G, Cook P (2002) Musical genre classification of audio signals. IEEE
Transactions on speech and audio processing 10:293–302.
Ulčar M, Robnik-Šikonja M (2020) Finest bert and crosloengual bert: less is more
in multilingual models. arXiv preprint arXiv:2006.07890 .
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł,
Polosukhin I (2017) Attention is all you need In Advances in neural information
processing systems, pp. 5998–6008.
Viikki O, Laurila K (1998) Cepstral domain segmental feature vector normalization
for noise robust speech recognition. Speech Communication 25:133–147.
Voss N, Nguyen P (2019) End-to-end classification of ballroom dancing music using
machine learning In International Symposium on Computer Music Multidisci-
plinary Research, pp. 100–109. Springer.
Wang A, Singh A, Michael J, Hill F, Levy O, Bowman SR (2018) GLUE: A
multi-task benchmark and analysis platform for natural language understanding.
7th International Conference on Learning Representations .
Wang W, Tang Q, Livescu K (2020) Unsupervised pre-training of bidirectional
speech encoders via masked reconstruction In ICASSP 2020-2020 IEEE Inter-
national Conference on Acoustics, Speech and Signal Processing (ICASSP),
pp. 6889–6893. IEEE.
Wang Y, Li J, Metze F (2019) A comparison of five multiple instance learning
pooling functions for sound event detection with weak labeling In ICASSP 2019-
2019 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), pp. 31–35. IEEE.
131
Wen Q, Gao J, Song X, Sun L, Tan J (2019) RobustTrend: A huber loss with a
combined first and second order difference regularization for time series trend
filtering. Proceedings of the 28th International Joint Conference on Artificial
Intelligence .
Wilson SM, Molnar-Szakacs I, Iacoboni M (2007) Beyond superior temporal
cortex: intersubject correlations in narrative speech comprehension. Cerebral
cortex 18:230–242.
Wingstedt J (2005) Narrative music: towards an understanding of musical narrative
functions in multimedia Ph.D. diss., Luleå tekniska universitet.
Wingstedt J, Brändström S, Berg J (2010) Narrative music, visuals and meaning
in film. Visual Communication 9:193–210.
Xia Y, Wang L, Wong K (2008) Sentiment vector space model for lyric-based
song sentiment classification. International Journal of Computer Processing Of
Languages 21:309–330.
XiaoZ,DellandréaE,DouW,ChenL(2008) WhatistheBestSegmentDurationfor
Music Mood Analysis ? In International Workshop on Content-Based Multimedia
Indexing, CBMI 2008, pp. 17–24.
Xu M, Chia LT, Jin J (2005) Affective content analysis in comedy and horror videos
by audio emotional event detection In 2005 IEEE International Conference on
Multimedia and Expo, pp. 4–pp. IEEE.
Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov RR, Le QV (2019) Xlnet:
Generalized autoregressive pretraining for language understanding. Advances in
neural information processing systems 32.
YiW,LuM,LiuZ(2011) Multi-valuedattributeandmulti-labeleddatadecisiontree
algorithm. International Journal of Machine Learning and Cybernetics 2:67–74.
You Y, Li J, Reddi S, Hseu J, Kumar S, Bhojanapalli S, Song X, Demmel J, Keutzer
K, Hsieh CJ (2019) Large batch optimization for deep learning: Training bert in
76 minutes. arXiv preprint arXiv:1904.00962 .
Zhang K, Zhang H, Li S, Yang C, Sun L (2018) The pmemo dataset for music
emotion recognition In Proceedings of the 2018 ACM on International Conference
on Multimedia Retrieval, pp. 135–142. ACM.
Zhang ML, Zhou ZH (2006) Multilabel neural networks with applications to
functional genomics and text categorization. IEEE transactions on Knowledge
and Data Engineering 18:1338–1351.
132
Zhang ML, Zhou ZH (2007) Ml-knn: A lazy learning approach to multi-label
learning. Pattern recognition 40:2038–2048.
Zhang Z, Han X, Liu Z, Jiang X, Sun M, Liu Q (2019) ERNIE: Enhanced language
representation with informative entities. Proceedings of the 57th Annual Meeting
of the Association for Computational Linguistics .
Zhao H, Zhang C, Zhu B, Ma Z, Zhang K (2022) S3t: Self-supervised pre-training
with swin transformer for music classification. arXiv preprint arXiv:2202.10139 .
Zhao Y, Guo J (2021a) Musicoder: A universal music-acoustic encoder based on
transformer In International Conference on Multimedia Modeling, pp. 417–429.
Springer.
Zhao Y, Guo J (2021b) MusiCoder: A universal music-acoustic encoder based on
transformer In International Conference on Multimedia Modeling, pp. 417–429.
Springer.
Zhou F, De la Torre F (2012) Generalized time warping for multi-modal alignment
of human motion In 2012 IEEE Conference on Computer Vision and Pattern
Recognition, pp. 1282–1289. IEEE.
Zhou F, Torre F (2009) Canonical time warping for alignment of human behavior
In Advances in Neural Information Processing Systems, pp. 2286–2294.
Zhou H, Hermans T, Karandikar AV, Rehg JM (2010) Movie genre classification
via scene categorization In Proceedings of the 18th ACM international conference
on Multimedia, pp. 747–750.
Ziai A (2019) Detecting kissing scenes in a database of hollywood films. arXiv
preprint arXiv:1906.01843 .
133
Supplemental Material
Title Genre Tags
300: Rise of an Empire Action, Drama
Aladdin (2019) Romance
Alita: Battle Angel Action, Sci-Fi
Annabelle Horror
Ant-Man Action, Comedy, Sci-Fi
Ant-Man and the Wasp Action, Comedy, Sci-Fi
Aquaman (2018) Action, Sci-Fi
Avengers: Age of Ultron Action, Sci-Fi
Avengers: Endgame Action, Drama, Sci-Fi
Avengers: Infinity War Action, Sci-Fi
Beauty and the Beast (2017) Romance
Black Mass Drama
Blended Comedy, Romance
Bohemian Rhapsody Drama
Captain America: Civil War Action, Sci-Fi
Captain America: The Winter Soldier Action, Sci-Fi
Captain Marvel Action, Sci-Fi
134
Chappaquiddick Drama
Christopher Robin Comedy, Drama
Cinderella (2015) Drama, Romance
Collateral Beauty Drama, Romance
Crazy Rich Asians Comedy, Drama, Romance
Creed Drama
Doctor Strange Action, Sci-Fi
Dora and the Lost City of Gold Comedy
Dumb and Dumber To Comedy
Dunkirk Action, Drama
Edge of Tomorrow Action, Sci-Fi
Entourage Comedy, Drama
First Man Drama
Focus (2015) Comedy, Drama, Romance
Geostorm Action, Sci-Fi
Going in Style Comedy
Halloween (2018) Horror
Happy Death Day Horror
Hitman: Agent 47 Action
Horrible Bosses 2 Comedy
How to be Single Comedy, Drama, Romance
In the Heart of the Sea Action, Drama
Incredibles 2 Action, Comedy, Sci-Fi
Interstellar Drama, Sci-Fi
Into the Woods (2014) Comedy, Drama
135
It Chapter 2 Drama, Horror
Johnny English Strikes Again Action, Comedy
Jumanji: Welcome to the Jungle Action, Comedy
Jupiter Ascending Action, Sci-Fi
Justice League Action, Sci-Fi
King Arthur: Legend of the Sword Action, Drama
Kingsman: The Secret Service Action, Comedy
Kong: Skull Island Action, Sci-Fi
Krampus Comedy, Drama, Horror
Lights Out (2016) Drama, Horror
Mad Max Fury Road Action, Sci-Fi
Magic Mike XXL Comedy, Drama
Maleficent Action, Romance
Me Before You Drama, Romance
Megan Leavey Drama
Miss Peregrine’s Home for Peculiar Children Drama
Mission: Impossible - Fallout Action
Mission: Impossible - Rogue Nation Action
Moonlight Drama
Mortal Engines Action, Sci-Fi
Murder on the Orient Express Drama
Need for Speed Action
Oceans 8 Action, Comedy
Paddington 2 Comedy
Pan Comedy
136
Parasite Comedy, Drama
Pet Sematary Horror
Pirates of the Caribbean: Dead Men Tell No Tales Action
Pokemon Detective Pikachu Action, Comedy, Sci-Fi
Queen of Katwe Drama
Rambo: Last Blood Action
Rampage Action, Sci-Fi
Ready Player One Action, Sci-Fi
Run All Night Action, Drama
San Andreas Action, Drama
Shazam! Action, Comedy
Sicario Action, Drama
Smallfoot Comedy
Solo: A Star Wars Story Action, Sci-Fi
Spotlight Drama
Star Trek Beyond Action, Sci-Fi
Star Wars: The Force Awakens Action, Sci-Fi
Star Wars: The Last Jedi Action, Sci-Fi
Stuber Action, Comedy
Suicide Squad Action, Sci-Fi
Sully Drama
Tammy Comedy, Romance
Teen Titans Go! To the Movies Action, Comedy, Sci-Fi
The Boy Horror
The Conjuring 2 Horror
137
The House with a Clock in Its Walls Comedy, Horror, Sci-Fi
The Hustle (2019) Comedy
The Imitation Game Drama
The Intern Comedy, Drama
The Judge Drama
The Legend of Tarzan Action, Drama, Romance
The Man from U.N.C.L.E. Action, Comedy
The Meg Action, Horror, Sci-Fi
The Peanut Butter Falcon Comedy, Drama
The Post Drama
This Is Where I Leave You Comedy, Drama
Thor Ragnarok Action, Comedy, Sci-Fi
Tolkien Drama, Romance
Tomorrowland Action, Sci-Fi
Transcendence Drama, Sci-Fi
Vacation Comedy
Venom Action, Sci-Fi
Wonder Woman Action, Sci-Fi
Appendix S1. 110-film corpus summary
138
Abstract (if available)
Abstract
With the ever-burgeoning market for music, film, television, and other consumable media, it has become supremely important to study human music listening and multi-modal experiences. Advances in computational approaches offer new ways to understand music content, and how it is experienced---both as a standalone medium and in context with other forms of media---in nuanced ways. This dissertation work identifies novel methods for representing music in a multi-modal, context-aware fashion; we study the interaction between lyrics and chords in music, which has potential applications in multimodal perception, music information retrieval, music emotion recognition, and music recommendation systems. We show that using a multi-view approach to music analysis opens up new avenues for studying music perception, and we show that loss function approaches can improve upon state-of-the-art methods for music genre classification. We also represent music using an embedding scheme and fine-tune this representation on several downstream tasks. Lastly, we investigate cross-cultural music perception, either on its own or when consumed in conjunction with other media, such as advertisements and movies. By creating cross-modal, context-aware representations of music, we can meaningfully capture music and media-related perception, a boon to researchers in affective computing, music information retrieval, and automatic music tagging.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Semantically-grounded audio representation learning
PDF
Multimodal perception guided computational media understanding
PDF
Visual representation learning with structural prior
PDF
Towards learning generalization
PDF
Computational narrative models of character representations to estimate audience perception
PDF
Establishing cross-modal correspondences for media understanding.
PDF
Learning shared subspaces across multiple views and modalities
PDF
Adapting pre-trained representation towards downstream tasks
PDF
Generating psycholinguistic norms and applications
PDF
Computational models for multidimensional annotations of affect
PDF
Computational modeling of mental health therapy sessions
PDF
Lexical complexity-driven representation learning
PDF
Modeling dyadic synchrony with heterogeneous data: validation in infant-mother and infant-robot interactions
PDF
Improving language understanding and summarization by leveraging auxiliary information through self-supervised or unsupervised learning
PDF
Integrating annotator biases into modeling subjective language classification tasks
PDF
Recognition and characterization of unstructured environmental sounds
PDF
Generative foundation model assisted privacy-enhancing computing in human-centered machine intelligence
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
Towards generalizable expression and emotion recognition
PDF
Balancing prediction and explanation in the study of language usage and speaker attributes
Asset Metadata
Creator
Greer, Timothy David
(author)
Core Title
Creating cross-modal, context-aware representations of music for downstream tasks
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2022-08
Publication Date
08/03/2022
Defense Date
08/02/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
cross-modal,deep representations,multi-task learning,music information retrieval,OAI-PMH Harvest,self-supervised representation learning
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Narayanan, Shrikanth (
committee chair
), Dehghani, Morteza (
committee member
), Kaplan, Jonas (
committee member
), Kempe, David (
committee member
), Soleymani, Mohammad (
committee member
)
Creator Email
timothdg@usc.edu,timothydgreer@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC111376000
Unique identifier
UC111376000
Legacy Identifier
etd-GreerTimot-11082
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Greer, Timothy David
Type
texts
Source
20220803-usctheses-batch-968
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
cross-modal
deep representations
multi-task learning
music information retrieval
self-supervised representation learning