Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Biologically inspired auditory attention models with applications in speech and audio processing
(USC Thesis Other)
Biologically inspired auditory attention models with applications in speech and audio processing
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
BIOLOGICALLY INSPIRED AUDITORY ATTENTION
MODELS WITH APPLICATIONS IN SPEECH AND
AUDIO PROCESSING
by
Ozlem Kalinli
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulllment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
December 2009
Copyright 2009 Ozlem Kalinli
Dedication
To my parents, Necla and Mustafa, and my brother
Ozg ur.
ii
Acknowledgements
Pursuing a PhD degree is a long and challenging process, and I feel very lucky to have
the support and encouragement of many people over the years. I would like to take this
opportunity to thank them.
I am grateful to my advisor Prof. Shrikanth Narayanan for his guidance, nurturing,
and support during the course of my PhD. I am especially thankful to him for believing
in me and giving me a chance to be an independent researcher. He has given enough
guidance when it was really needed and the freedom to pursue my ideas. He is an
exceptional mentor and scientist, and he has been a role model to me and many others
with his energy, enthusiasm, and dedication to science. Also, thanks to his support, I
could participate in both national and international conferences where I had the op-
portunity of not only presenting my work but also meeting many great scientists in the
eld. Without his support and trust, this thesis would not be fruitful.
I would like to thank the members of my dissertation committee: Prof. C.-C. Jay
Kuo, Prof. Bartlett Mel, Prof. Chris Kyriakakis, and Prof. B. Keith Jenkins. I am
grateful to Prof. Bartlett Mel for his questions and valuable comments that really helped
to improve the quality of this work. In addition, the seeds of the attention models for
audition were thrown during a short-time period of collaboration (a joint project) with
iii
Prof. Bartlett Mel and Prof. Laurent Itti. My appreciation also goes to Prof. Laurent
Itti for providing insight to his visual attention models.
Outside the scope of my dissertation, my research has been enriched with an intern-
ship at Microsoft Research in the summer of 2008. I would like thank to the members of
the Speech Technology Group at Microsoft Research for welcoming me to the group and
turning this into a great experience in industrial research. In particular, I am grateful to
Mike Seltzer, who has been a great mentor even after my internship, and Jasha Droppo,
Ivan Tashev, Georey Zweig, Li Deng, and Alex Acero for many technical discussions
and their advices.
I am thankful to the past and present members of the SAIL Lab for fruitful discus-
sions and making the lab a fun place. I specially thank Murtaza Bulut, Erdem Unal,
Abhinav Sethy, Kyu Jeong Han, Andreas Tsiartas, Matt Black and Emily Mower for
their friendship. I will never forget the conference trips with fellow SAILers which I
looked forward to and had many great memories about. Thanks to Mary Francis in
SIPI and Diane Demetras and Tim Boston in Electrical Engineering for their help in
making all administrative matters run smoothly. I met Tim in the rst week when I
joined USC, and since then I felt his familial friendship, encouragement, and support.
His door has been always open for me and all of the students. During the graduate
school, I also had the opportunity to meet many other great people who made my life
in Los Angeles fun and joyful. I would like to thank them all.
My sincere appreciation goes to Prof. Deniz Pazarci who inspired, encouraged, and
supported me since I met her at Istanbul Technical University (ITU). Her faith and love
has been familial. Also, I am thankful to Prof. Melih Pazarci at ITU for his guidance
and valuable advices over the years.
iv
My best friend, Elif G uzel Stichert, deserves special thanks for being there (mostly
online) for me all the time despite the distance between us. I also thank a close friend
and colleague Tarun Tandon who constantly supported me and always believed in me
throughout this process.
There is one person who has been with me at every step of my study and life since
I started at USC: my friend and colleague Shiva Sundaram. Shiva has given me more
encouragement and support than I could possibly hope for. I am grateful for sharing
my successes with him and having his boundless support at the moments of failure. I
have enjoyed countless long technical discussions with him, which helped me to sharpen
my ideas, and many other interesting discussions about life and the world. In summary,
having him in my life made this journey a lot easier and a lot more fun and fullling.
I cannot possibly thank my parents, Necla and Mustafa, and my brother,
Ozg ur,
enough for their immeasurable love, support, and encouragement throughout my life.
There is no way I could have gone this far without them. They have always believed
in me and let me do my own things including living abroad. In summary, my eternal
gratitude goes to them for helping me to complete this part of my dream, and many
others.
Ozlem Kalinli
Los Angeles, California
October 2009
v
Table of Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Review of Computational Attention Models . . . . . . . . . . . 6
1.4 Overview of Dissertation and Contributions . . . . . . . . . . . 9
1.5 Outline of the Dissertation . . . . . . . . . . . . . . . . . . . . 15
Chapter 2: A Bottom-Up Saliency-Based Auditory Attention Model . . . . . . 17
2.1 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Auditory Saliency Map . . . . . . . . . . . . . . . . . . . . . . 21
2.3.1 Multi-Scale Auditory Features . . . . . . . . . . . . . . . 23
2.3.2 Iterative Nonlinear Normalization . . . . . . . . . . . . . 26
2.3.3 Saliency Score . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5 Experiments and Results . . . . . . . . . . . . . . . . . . . . . 30
2.5.1 Saliency Score Threshold Selection . . . . . . . . . . . . 31
2.5.2 Analysis of Features . . . . . . . . . . . . . . . . . . . . 32
2.5.3 Analysis of Inhibition Window Size . . . . . . . . . . . . 34
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
vi
Chapter 3: A Top-Down Task-Dependent Auditory Attention Model with
Acoustic Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.1 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3 Top-Down Task-Dependent Model with Auditory Gist Features 40
3.3.1 Multi-Scale Auditory Features . . . . . . . . . . . . . . . 42
3.3.2 Auditory Gist Features . . . . . . . . . . . . . . . . . . . 43
3.3.3 Task-Dependent Biasing of Acoustic Cues . . . . . . . . 45
3.4 Experiments and Results . . . . . . . . . . . . . . . . . . . . . 46
3.4.1 Analysis of Scene Duration . . . . . . . . . . . . . . . . 48
3.4.2 Analysis of Grid Size . . . . . . . . . . . . . . . . . . . . 51
3.4.3 Analysis of Auditory Attention Features . . . . . . . . . 52
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Chapter 4: A Top-Down Task-Dependent Auditory Attention Model with
Acoustic and Higher Level Information . . . . . . . . . . . . . . . . 65
4.1 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3 Probabilistic Approach for Task-Dependent Model . . . . . . . 67
4.3.1 Task-Dependent Model with Auditory Gist Features . . 68
4.3.2 Task-Dependent Model with Lexical Cues . . . . . . . . 69
4.3.3 Task-Dependent Model with Syntactic Cues . . . . . . . 70
4.3.4 Combined Model with Acoustic and Higher Level Cues . 73
4.4 Experiments and Results . . . . . . . . . . . . . . . . . . . . . 74
4.4.1 Task-Dependent Model Prediction with Only Auditory
Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.4.2 Task-Dependent Model Prediction with Only Lexical
Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.4.3 Task-Dependent Model Prediction with Only Syntactic
Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.4.4 Combined Model Prediction with Auditory, Syntactic and
Lexical Features . . . . . . . . . . . . . . . . . . . . . . 78
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Chapter 5: Interaction Between Top-Down and Bottom-Up Auditory Attention
Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.1 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.3 Integrated Top-Down and Bottom-up Attention Model 1 . . . . 83
5.4 Integrated Top-Down and Bottom-up Attention Model 2 . . . . 84
vii
5.5 Experiments and Results . . . . . . . . . . . . . . . . . . . . . 86
5.5.1 Experiments with Integrated Model 1 . . . . . . . . . . 87
5.5.2 Experiments with Integrated Model 2 . . . . . . . . . . 87
5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Chapter 6: Continuous Speech Recognition Using Attention Shift Decoding with
Soft Decision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.1 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.3 Attention Shift Decoding . . . . . . . . . . . . . . . . . . . . . 94
6.3.1 Attention Shift Decoding Using Hard Decision . . . . . 95
6.3.2 Attention Shift Decoding Using Soft Decision . . . . . . 96
6.4 Automatic Island Detection . . . . . . . . . . . . . . . . . . . . 98
6.5 Experiments and Results . . . . . . . . . . . . . . . . . . . . . 102
6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Chapter 7: Saliency-Driven Unstructured Acoustic Scene Classication . . . . . 107
7.1 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.4 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.4.1 Salient Event Detection . . . . . . . . . . . . . . . . . . 113
7.4.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.4.3 Latent Perceptual Indexing . . . . . . . . . . . . . . . . 115
7.5 Experiments and Results . . . . . . . . . . . . . . . . . . . . . 118
7.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Chapter 8: Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
viii
List of Tables
2.1 Prominent Syllable Detection Performance; I:Intensity, T:Temporal con-
trast, F:Frequency contrast, O:Orientation, P:Pitch . . . . . . . . . . . . 33
2.2 Prominent Word Detection Performance . . . . . . . . . . . . . . . . . . 33
3.1 Prominent Syllable Detection Performance with Task-Dependent Audi-
tory Attention Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.2 Prominent Syllable Detection Performance with Bottom-Up Saliency-
Based Attention Model [37] . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3 Prominent Syllable Detection Performance with only Pitch features . . . 56
3.4 Mutual information between the gist feature vectors of the feature sets . 59
3.5 MI between I; F; T; O features and Syllable Prominence Class . . . . . . 61
3.6 MI between Pitch Features and Syllable Prominence Class . . . . . . . . 61
3.7 Prominent Syllable Detection Performance with 0.6 s Scene . . . . . . . 61
4.1 Prominent Syllable Detection Performance of Individual Acoustic, Lexi-
cal and Syntactic Cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.2 Combined Top-Down Model Performance for Prominent Syllable Detection 77
4.3 Previously Reported Results on Prominence Detection Task Using the
BU-RNC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.1 Prominent Syllable Detection Performance with Integrated Model 1 (th =
0:1 for the BU model) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.2 Prominent Syllable Detection Performance with Integrated Model 2 . . 88
ix
6.1 Island-Gap Detection Features Ranked by Information . . . . . . . . . . 103
6.2 Island-Gap Detection Results . . . . . . . . . . . . . . . . . . . . . . . . 103
6.3 The Baseline ASR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.4 The Results Using ASD with Hard Decision . . . . . . . . . . . . . . . . 105
6.5 The Results Using ASD with Soft Decision . . . . . . . . . . . . . . . . 105
7.1 Distribution of Clips Under Each Category . . . . . . . . . . . . . . . . 119
x
List of Figures
1.1 An example of interaction between selective attention and audio stream-
ing as proposed by the hierarchical decomposition model proposed in [16] 8
1.2 General framework of the dissertation including i) the early and central
auditory processing of sound, ii) bottom-up and top-down auditory atten-
tion models, and iii) applications at three time scales: word, utterance,
and acoustic scene level. . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1 Auditory saliency map structure adapted from [32]. . . . . . . . . . . . . 22
2.2 Spectro-temporal receptive lters (RF) mimicking the analysis stages in
the central auditory system. . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Prominent syllable detection performance vs. threshold (with IFTO fea-
tures). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4 Prominent syllable detection performance vs. inhibition window size
(with IFTO features). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.1 Auditory gist feature extraction. . . . . . . . . . . . . . . . . . . . . . . 41
3.2 Top-down task-dependent model structure. Training phase: the weights
are learned in supervised manner. Testing phase: auditory gist features
are biased with the learned weights to estimate the top-down model pre-
diction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3 Performance of prominence detection as a function of scene duration for
grid sizes of 1-by-v and 4-by-5 (v: width of a feature map, Acc: Accuracy
(%), F-sc: F-score (%)). . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4 Pitch analysis of a speech scene with grid size of 1-by-v (a) pitch. Out-
put obtained with b) frequency contrast RF, c) orientation RF with 45
o
rotation, d) orientation RF with 135
o
rotation. . . . . . . . . . . . . . . 54
xi
3.5 Pitch analysis of a speech scene with grid size of 4-by-5 (a) pitch. Out-
put obtained with b) frequency contrast RF, c) orientation RF with 45
o
rotation, d) orientation RF with 135
o
rotation. . . . . . . . . . . . . . . 55
3.6 Pair-wise MI between raw gist features created using 4-by-5 grid size and
0.6 s scene duration. Only the gist vector extracted from the feature map
with c = 2; s = 5 is illustrated. The diagonal elements are set to zero. . 58
4.1 Backo graph of lexical-prominence language model moving from the tri-
gram model p(cjc
1
; s
1
; c
2
; s
2
) down to p(c), where c denotes promi-
nence class, and s represents syllable token. The most distant variable
s
2
/c
2
is dropped rst. . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2 Sausage lattice with only lexical evidence . . . . . . . . . . . . . . . . . 76
5.1 Integrated bottom-up and top-down model 1 . . . . . . . . . . . . . . . 85
5.2 Integrated bottom-up and top-down model 2 . . . . . . . . . . . . . . . 86
6.1 Block diagram of attention shift decoding method . . . . . . . . . . . . 94
6.2 Sample word confusion network with islands and gaps . . . . . . . . . . 95
6.3 After processing the word confusion network in Fig. 6.2 with attention
shift decoding using hard decision in the oracle experiments. . . . . . . . 96
6.4 After processing the word confusion network in Fig. 6.2 with attention
shift decoding using soft decision in the oracle experiments. . . . . . . . 98
7.1 Block diagram of saliency-driven acoustic scene recognition method. W
is the duration of the window that centers on the detected salient time
point to extract salient audio event. . . . . . . . . . . . . . . . . . . . . 112
7.2 Results of a sample sound clip tagged as \goat machine milked". The
tiers shows i) waveform of sound, ii) spectrum, iii) transcription where
M represents machine noise and G represents goat voice, and iv) saliency
score. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.3 Amount of data reduction as a function of the number of retained salient
points (N). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
xii
7.4 Clip accuracy results obtained using all frames and top N salient audio
events using neural network classier. . . . . . . . . . . . . . . . . . . . 122
7.5 Clip accuracy results obtained with LISA and LPI methods. . . . . . . . 123
xiii
Abstract
Humans can precisely process and interpret complex scenes in real time despite the
tremendous amount of stimuli impinging the senses and the limited resources of the
nervous system. One of the key enablers of this capability is a neural mechanism called
\attention". The focus of this dissertation is to develop computational algorithms that
emulate human auditory attention and to demonstrate their eectiveness in spoken
language and audio processing applications.
Attention allows primates to eciently allocate their neural resources to the locations
of interest to precisely interpret a scene or to search for a target. In a scene, some stimuli
are inherently salient within the context, and they attract attention in a bottom-up
manner. Saliency-driven attention is a rapid, bottom-up, task-independent process, and
it detects the objects that perceptually pop out of a scene by signicantly diering from
their neighbors. The second form of attention is a top-down task-dependent process
which uses prior knowledge and learned past experience to focus attention on the target
locations in a scene to enhance the processing.
One of the primary contributions of this thesis work is the development of a novel
bottom-up auditory attention model. An auditory saliency map is proposed to model
such saliency-driven bottom-up auditory attention. The feature extraction structure of
xiv
the attention model is inspired by the processing stages in the human auditory system.
It has been demonstrated with the experiments that the bottom-up auditory attention
model can successfully detect prominent syllables and words in speech. In addition,
the bottom-up auditory attention model is used to detect salient acoustic events in
complex acoustic scenes. It has been shown that using only the selected salient events
for acoustic scene classication performs better than the conventional audio content
processing algorithms, which process the whole signal fully and treat everything as
equally important.
The next contribution of this thesis work is an analysis of the eect of task-dependent
in
uences on auditory attention. For this, a biologically plausible top-down model
is proposed in this thesis. The top-down attention model shares the same front-end
with the bottom-up auditory attention model and biases the features to mimic the
task in
uences on neurons. In addition to the acoustic cues, the in
uence of higher
level task-dependent cues such as lexical and syntactic information is also incorporated
into the model. The combined model achieves the highest performance on prominent
syllable/word detection tasks indicating the importance of a priori task information.
Finally, an attention shift decoding method inspired by human speech recognition is
proposed in this dissertation. In contrast to the traditional automatic speech recognition
systems, which decode speech fully and consecutively from left-to-right, the attention
shift decoding method decodes speech inconsecutively using reliability criteria. To detect
reliable regions of speech, a new set of features is proposed in this dissertation. The
attention shift decoding improves the automatic speech recognition performance.
xv
Chapter 1:
Introduction
1.1 Motivation
Hearing is an essential part of human daily life and is one of the most highly devel-
oped senses. The human auditory system can robustly localize, segment, classify and
recognize sounds embedded in complex scenes. However, even the performance of the
state-of-the-art machine processing algorithms degrades drastically in various conditions
such as cluttered scenes with overlapping sources, presence of noise, speaker changes etc..
Despite years of intensive research in speech production and psychoacoustic analysis of
the human auditory system, the articial machine speech and audio processing methods
still remain poor cousins to their biological counterparts. Understanding and modelling
the information processing architectures in biological systems can oer the possibility
of reducing the performance gap between humans and machines in realistic conditions.
In this dissertation, we work towards this goal from the point of the human auditory
system.
Humans can precisely process and interpret complex scenes in real time, despite
the tremendous number of stimuli impinging the senses and the limited resources of
1
the nervous system. One of the key enablers of this capability is believed to be a
neural mechanism called \attention". Even though the term \attention" is commonly
used in daily life, researchers in sensory processing and systems have not reached an
agreement on the structure of the attention process. Attention can be viewed as a gating
mechanism that allows only a small part of the incoming sensory signals to reach memory
and awareness [19]. It can also be viewed as a spotlight that is directed towards a target
of interest in a scene to enhance the processing in that area while ignoring the stimuli
that fall outside of the spotlighted area [29, 85]. Thus, it is believed that attention
usually serves for both selection and allocation of nite resources (capacity).
While computational attention models for vision have been successfully used for
detection, search, and recognition problems, analogous attention models for audition
have remained largely unexplored. The conventional signal processing algorithms for
speech and audio process the entire signal or acoustic scene sequentially and fully in
detail without considering the selective attention mechanism humans utilize. The focus
of this thesis work is human-like attention-driven signal processing for speech and audio.
For this, computational algorithms that emulate the auditory attention mechanism are
developed in this dissertation. The eectiveness of the proposed auditory attention
models is demonstrated with speech and audio processing applications. The specic
contributions of the dissertation will be discussed fully in Section 1.4
1.2 Attention
Attention allows primates to eciently allocate neural resources to the locations of
interest to precisely interpret a scene or to search for a target. It has been suggested that
2
attention can be oriented by a saliency-driven bottom-up task-independent mechanism
and a top-down task-dependent mechanism [33, 35, 56]. It is assumed that the bottom-
up process is based on scene-dependent features and may attract attention towards
conspicuous or salient locations of a scene in an unconscious manner, whereas the top-
down process shifts attention voluntarily toward locations of cognitive interest in a
task-dependent manner [33, 35, 56]. Further, psychophysical studies suggest that only
the selectively attended incoming stimuli are allowed to progress through the cortical
hierarchy for high-level processing to recognize and further analyze the details of the
stimuli [2, 29, 32].
Bottom-up attention is considered to be a rapid, saliency-driven mechanism, which
detects the objects that perceptually \pop out" of a scene by signicantly diering from
their neighbors. For example, in vision, for an observer, a red
ower among the green
leaves of a plant will be salient. Similarly, in audition, the sound of a gunshot in a street
will perceptually stand out of the trac/babble noise of the street. The experiments
in [14] have indicated that bottom-up attention acts early, and top-down attention
takes control within an order of 100 milliseconds (ms) (usually after about 200 ms).
The top-down task-dependent (goal-driven) process is believed to use prior knowledge
and learned past expertise to focus attention on the target locations in a scene. For
example, in vision, it was shown that gaze patterns depend on the task performed
while viewing a scene [93]. The gaze of the observer fell on faces when estimating the
people's ages, but fell on clothing when estimating the people's material conditions.
For example, in audition, it is the selective attention that allows a listener to extract
a particular person's speech in the presence of others (the cocktail party phenomenon)
3
by focusing on a variety of acoustic cues such as pitch, timbre, and spatial location
[12, 25]. The bottom-up attention mechanism may play a vital role for primates by
making them quickly aware of possible dangers around them, whereas the top-down
attention mechanism may play a key role for extracting the signal of interest from
cluttered and noisy backgrounds.
There has been extensive research to explore the in
uence of attention on the neural
responses in the sensory systems. It has been revealed that the top-down task-dependent
in
uences modulate the neuron responses in the visual and auditory cortex [19, 30, 53,
55]. This modulation mostly occurs by enhancing the response of neurons tuned to the
features of the target stimulus, whereas attenuating the response of neurons to stimuli
that did not match the target feature [19, 23, 39, 53]. In addition to this, psychophysical
experiments on selective attention have demonstrated that top-down processing can
selectively focus attention on a limited range of acoustic feature dimensions [25]. An
extensive review of task specic in
uences on the neural representation of sound is
presented in [23]. Psychophysical experiments on selective attention have been recently
reviewed in [25].
The interaction between top-down and bottom-up attention is still a debatable sub-
ject for scientists. The work on visual attention usually relies on reaction times, whereas
the work on auditory attention is done with dichotic experiments to measure the ability
of a subject to extract information from a noisy background for detection, recognition
or discrimination purposes. In the psychophysical experiments of [83] for vision, it was
demonstrated that during a task of search for a shape singleton, the attention of sub-
jects was drawn to irrelevant color singletons. This singleton eect was supported by the
4
recorded spike activity of neurons in area V4 while monkeys performed singleton (color
and shape) search tasks [59]. This was interpreted as bottom-up attention can never be
overridden by top-down attention [83]. However, the experiments in [59] demonstrated
that top-down attention is not completely overridden by bottom-up attention. For ex-
ample, the response of neurons to irrelevant color singleton during shape search was up
initially and decreased after 200 ms, whereas it stays up even after 200 ms when the task
is for color search. This supports the theories that top-down and bottom-up processing
interact to control allocation of resources and attention [14]. Similarly, for audition it
was revealed that the brain processes the task irrelevant streams up to some point, i.e.
parsing acoustic scene into streams, analyzing novelty etc. [20, 82, 87]. For example,
the dichotic experiments in [12] have shown that the listeners could almost always no-
tice when the gender of the speaker (hence, the pitch) in the unattended channel was
changed; however, detailed information, such as the language and individual words, was
unnoticed. Also, the cocktail party experiment in [12] has shown that the attention
of listeners was shifted when their own names were mentioned by another speaker in
the other side of the room. These experiments indicate that we are not completely
blind/deaf to the task irrelevant or unattended streams, and attention can be captured
by strong salient signals that are present in the unattended stream in a bottom-up
fashion, i.e. the physical sudden sound of a bang, a crack, or the semantic priority of
names.
5
1.3 Review of Computational Attention Models
In the literature, computational attention models have been mostly explored for vision.
In [32, 42], the concept of saliency map was proposed to model bottom-up visual atten-
tion in primates. A set of low-level features (such as color, intensity, and orientation)
is extracted in parallel from an image in multi-scales to produce topographic \feature
maps" and combined into a single saliency map which indicates the perceptual in
u-
ence of each part of the image. The saliency map is then scanned to nd the locations
that attract attention, and it was veried by virtue of eye movement that the model
could replicate several properties of human overt attention, i.e. detecting trac signs,
detecting colors, etc. [32].
In [57, 64, 89], the top-down in
uence of a task on visual attention was modelled.
A guided visual search model was proposed in [89] that combines the weighted feature
maps in a top-down manner, i.e. when the task is to detect a red bar, the feature maps
that are sensitive to red color get a larger gain factor. This top-down biasing is based
on the evidence that neural responses are modulated by task dependent attention [19].
The authors in [57] presented a model that tunes the bottom-up features based on the
properties of both the target and the distracter items for visual search problems, i.e.
nding the cell phone on a cluttered scene of a desk. A model that combines bottom-up
and top-down attention was proposed in [64] to predict where subjects' gaze patterns
fall while performing a task of interest. This method was shown to perform well when
tested with recorded eye movements of people while playing video games.
Analogies between auditory and visual perception have been widely discussed in
the neuroscience literature. The common principles of visual and auditory processing
6
are discussed in [72], and it is suggested that, although early pathways of visual and
auditory systems have anatomical dierences, there exists a unied framework for cen-
tral visual and auditory sensory processing. Intermodal and cross-modal interaction
between auditory and visual attention is discussed in [23, 34]. Based on the assumption
of parallel processing between vision and audition, a saliency map which has identical
structure with the visual saliency map in [32] was proposed for audition in [40]. In [40],
intensity, temporal and frequency contrast features were extracted from the Fourier
spectrum of the sound in multi-scales and contributed to the nal saliency map in a
bottom-up manner. This model was able to replicate some overt properties of auditory
scene perception, i.e. the relative salience of short, long, and temporally modulated
tones in noisy backgrounds.
Humans can focus on a particular audio stream among many others as in the cocktail
party phenomenon. It was proposed in [8] that to understand complex audio sequences,
humans separate the audio sequences into dierent sound elements (audio segregation),
and then combine the elements which are likely to be emitted from the same acoustical
sources into an audio stream or object (audio grouping). Audio grouping can be in
two dierent but complementary ways: primitive grouping and schema-driven group-
ing. Primitive grouping works in a purely bottom-up (data-driven) manner using the
Gestalt principles of perceptual organization (such as frequency proximity, good conti-
nuity, etc.). On the other hand, schema-driven grouping uses prior knowledge of past
experience about commonly exposed acoustic stimuli, i.e. music, and speech. It is ar-
gued that audio streaming can occur without attention during the audio scene analysis,
7
Figure 1.1: An example of interaction between selective attention and audio streaming
as proposed by the hierarchical decomposition model proposed in [16]
and the role of attention is to bring one of the objects to consciousness. However, re-
cent work in [10] suggests that attention is required for audio streaming. The authors
in [16] further investigated the eect of attention on the buildup audio streaming and
proposed a hierarchical decomposition model as shown in Fig 1.1. For example, in a
real world scenario, the listener rst becomes aware of the initial grouping (perhaps
this is automatic, and pre-attentive), i.e. speech, music, and noise. The listener can
focus attention on the music at the expense of the speech and noise. Then, the internal
representation of the music is fragmented due to this attentional focusing, and several
streams become available for attention. If one of the streams is selected, the further
fragmentation continues. If attention is withdrawn from the music at some point, its
grouping is reset (if attention is switched to the speech). The advantage of this hier-
archical decomposition is that the listener is not bombarded with all components of all
sound sources on the arrival of an auditory scene, and resources are not wasted by the
organization/processing of unattended parts of an auditory scene.
8
A conceptual framework for selective attention was proposed and partially imple-
mented for the purpose of audio scene analysis in [90]. The model extracts periodicity
information from the sound spectrum and performs primitive grouping based on the pe-
riodicity. Then, a one dimensional neural oscillator accomplishes audio segregation and
grouping. The output from neural oscillators is fed to an attention leaky integrator after
being modulated by a set of weights assigned by conscious attention. This model could
replicate several perceptual phenomenon including two-tone streaming and sequential
capturing.
1.4 Overview of Dissertation and Contributions
Attention is a mechanism that allows humans to selectively process the incoming sen-
sory stimuli. This is a desirable property for articial machine learning systems in
terms of both computational eciency and to understand how the human brain can
process/decode complex scenes very accurately, i.e. foreground-background segmen-
tation, object recognition, scene understanding, etc. The focus of this dissertation is
the auditory attention mechanism and its applications to spoken language and audio
processing.
The block diagram of the general framework for this dissertation is illustrated in
Fig. 1.2. A novel biologically inspired auditory attention model with bottom-up and
top-down processing mechanisms is proposed. First, sound is analyzed by mimicking
the processing stages in the early and central auditory system. The output of the central
auditory processing represents a set of multi-scale features that is used for both bottom-
up and top-down auditory attention models proposed here. It is believed that attention
9
plays a role in multiple time scales, i.e. in word level, utterance level, and even in the
complete acoustic scene level as shown in the triangle in Fig. 1.2. Thus, the auditory
attention models are tested with the spoken language and audio processing applications
that happen in multiple time scales. The proposed auditory attention models are generic
and can be used in many spoken language processing tasks and general computational
auditory scene analysis applications as discussed in each relevant chapter. However,
in this dissertation, prominent syllable detection, attention shift decoding of speech
and saliency-driven acoustic scene recognition are the proposed applications for human
like attention-driven signal processing in word, utterance and acoustic scene levels,
respectively. Each block in Fig. 1.2 is explained brie
y in the rest of the section.
One of the contributions of this work is the development of a novel biologically in-
spired bottom-up auditory attention model. In Chapter 2, an auditory saliency map
that builds on the architecture of the visual saliency map in [32] is proposed to model
such saliency-driven auditory attention with the assumption of parallel processing be-
tween vision and audition. A set of low-level multi-scale features (such as intensity,
frequency contrast, temporal contrast, orientation and pitch) are extracted in parallel
from the auditory spectrum of sound based on the processing stages in the central audi-
tory system to produce feature maps. Then, the feature maps are combined into a single
saliency map which indicates the perceptual in
uence of each part of the acoustic scene.
The bottom-up attention model rapidly scans the scene and nds the salient acoustic
events, which perceptually pop out of the scene by signicantly diering from their
neighbors. The bottom-up attention model is a task-independent process and works in
an unsupervised manner.
10
Figure 1.2: General framework of the dissertation including i) the early and central
auditory processing of sound, ii) bottom-up and top-down auditory attention models,
and iii) applications at three time scales: word, utterance, and acoustic scene level.
11
The top-down task-relevant process uses prior knowledge and learned expertise to
focus attention on the target locations in a scene. The next contribution of this thesis
work is an analysis of the eect of task dependent in
uences on auditory attention. For
this, in Chapter 3, we propose a biologically inspired top-down model which guides at-
tention during an acoustic search for a target. The top-down attention model shares the
same feature maps with the bottom-up auditory attention model. As explained in Sec-
tion 1.2, the neuron responses are modulated by the task dependent in
uences. Hence,
the feature maps are converted into low-level auditory gist features that capture the
overall properties of a scene. Then, the auditory gist features are biased and combined
using a neural network to imitate the modulation eect of task on neuron responses to
reliably detect a target or to perform a specic task. A comprehensive analysis of the
features used in the auditory attention model is also presented in Chapter 3.
There is evidence that, while processing the incoming audio stimuli, the brain is also
in
uenced by higher level information such as lexical information, syntax (whether word
strings are formed grammatically correctly or not), and even the semantics of the words
[34, 67]. Auditory attention is a highly complex operation that involves the process of
low-level acoustic features of incoming stimuli together with higher level information.
Next, the top-down in
uence of higher level information such as lexical and syntactic
information is incorporated into the proposed top-down auditory attention model in
Chapter 4.
The interaction between bottom-up and top-down attention models is a debatable
topic and not fully resolved. In Chapter 5, we model the interaction using two dierent
approaches. The rst approach follows the hypothesis that the top-down model makes
12
a selection among the conspicuous locations pointed out by the saliency-driven bottom-
up model. The second model follows the hypothesis that, while the top-down model
is processing the input stimuli in a goal oriented manner, the bottom-up model can
override the top-down model.
In this dissertation, the proposed attention models are tested with the spoken lan-
guage and audio processing applications that happen in multiple time scales as shown
with the triangle in Fig. 1.2. The application for the rst tier of the triangle is promi-
nent syllable detection in speech. During speech perception, a particular syllable can
be perceived to be more salient or prominent than the others due to factors such as
co-articulation between phonemes, accent, and the physical or emotional state of the
talker, and these syllables may carry important information for speech understanding.
The cues which make a syllable prominent in the acoustic signal is perceived by the
listeners, which helps us to generate our hypothesis that prominent syllables may at-
tract human attention. Hence, in the rst part of this dissertation in Chapters 2-5,
the proposed bottom-up and top-down attention models are tested in the context of
a prominent syllable detection task in speech using a corpora in which syllables are
labelled based on human perception. The prominent syllable/word detection is an im-
portant task in spoken language processing systems such as speech recognition, speech
synthesis and even in speech-to-speech translation systems.
Words segmented from running speech are often unintelligible even for humans, and
they become intelligible when they are heard in the context of an utterance [66]. Also,
usually there are certain words in an utterance one can recognize rst while s/he is not
13
sure about the remaining words in the utterance, and then the information from the ini-
tially recognized words is used to resolve disambiguates in understanding the remaining
words in the utterance. Based on the attention theory and the experimental ndings,
it is believed that humans decode speech non-sequentially in a selective way. An atten-
tion shift decoding (ASD) method inspired by human speech recognition is proposed in
Chapter 6. In contrast to the traditional automatic speech recognition (ASR) systems,
which decode speech fully and consecutively from left-to-right, ASD decodes speech in-
consecutively using reliability criteria; the method rst nds the islands of continuous
speech and recognizes them. The islands consist of reliable regions of speech for an au-
tomatic speech recognizer. Then, the gaps (unreliable speech regions) are decoded with
the evidence of islands. ASD oers potentially better performance than the traditional
ASR systems, since it makes use of the information from the reliable regions of speech
to resolve the uncertainty in unreliable regions of speech.
ASD can be viewed as how attention might be playing a role in the utterance level,
i.e. the second tier of the triangle in Fig. 1.2. The nal tier of the triangle in Fig.
1.2 is the acoustic scene level in which many types of acoustic sources may present in
a longer span of time. Humans can rapidly and precisely recognize a complex acoustic
scene despite many overlapping sources that may present in a scene. However, for
the machines, automatic categorization of complex, unstructured
1
acoustic scenes is
a dicult task as the appropriate label eventually associated with the audio clip of
an unknown scene depends on the key acoustic event present in it. For example, an
audio clip labeled crash may have human conversation and/or other related sources
1
where an acoustic scene may consist of any number of unknown sources which may also overlap in
time.
14
in it but the highlight \crash" of the vehicle is used to categorize the acoustic scene.
When any (unknown) number of acoustic sources can be present in a scene but only
one or a handful of them are relevant, the approach adopted by conventional audio
classication systems would entail classifying each and every source in the scene and
then implement an application specic post processing; hence, they are inecient and
usually do not have robust performance. In Chapter 7, a novel method that emulates
human auditory attention for acoustic scene recognition is proposed. The method rst
detects the salient acoustic events in a cluttered acoustic scene using the proposed
bottom-up auditory attention model in Chapter 2 and then processes only the selected
events with a previously learned representation for acoustic scene recognition. This
method is more ecient and potentially has better performance than the conventional
audio classication methods, since it lters out non-salient acoustic events in a scene.
1.5 Outline of the Dissertation
This dissertation starts with discussing the motivation behind this work. Chapter 1 also
presents an overview of the attention mechanism, together with a review of some related
neuroscience and psychophysics studies. This is followed by the review of computational
attention models presented in the literature. The overview of the proposed work and
the contributions are presented in Chapter 1.4.
In the chapters following the introduction, the proposed auditory attention models
and their applications are presented. Each chapter starts with a brief summary followed
by the discussion of the proposed model/method. After that, the conducted relevant
15
experiments and their results are presented. Each chapter ends by presenting conclu-
sions together with possible future work directions. This dissertation concludes with a
brief summary of the work presented and possible future directions of research.
16
Chapter 2:
A Bottom-Up Saliency-Based Auditory Attention Model
2.1 Chapter Summary
A bottom-up or saliency driven attention allows the brain to detect nonspecic con-
spicuous targets in cluttered scenes before fully processing and recognizing the targets.
Here, a novel biologically plausible auditory saliency map is presented to model such
saliency based auditory attention. Multi-scale auditory features are extracted based
on the processing stages in the central auditory system, and they are combined into a
single master saliency map. The usefulness of the proposed auditory saliency map in
detecting prominent syllable and word locations in speech is tested in an unsupervised
manner. The performance achieved with the bottom-up attention model compares well
to results reported on human performance on the prominence detection task.
2.2 Introduction
The brain is the most advanced information processing device, and a large portion of its
computation power is devoted to sensory processing with vision and hearing being the
two highly developed senses in humans. In [72], the common principles of visual and
17
auditory processing are discussed, and it is suggested that although early pathways of
visual and auditory systems have anatomical dierences, there exists a unied framework
for central visual and auditory sensory processing.
Our nervous system is exposed to tremendous amount of sensory stimuli, but our
brain cannot fully process all stimuli at once. A neural mechanism exists that selects
a subset of available sensory information before further processing it [2, 8, 32]. This
selection is a combination of rapid bottom-up saliency-driven (task-independent) at-
tention, as well as slower top-down cognitive (task dependent) attention [32]. First,
stimulus-driven rapid bottom-up processing of the whole scene occurs that attract at-
tention towards conspicuous or salient locations in an unconscious manner. Then, the
top-down processing shifts the attention voluntarily towards locations of cognitive in-
terest. Only the selectively attended location is allowed to progress through cortical
hierarchy for high-level processing to analyze the details [2, 26, 32]. In vision, for ex-
ample, for an observer a red circle in a gray background will be salient (bottom-up,
saliency-driven analysis), but after consciously paying attention to the red spot it will
be aware that the red spot is actually a trac sign (top-down analysis). Similarly, in
audition, one may hear people talking and music playing in a room (saliency-driven),
but it won't be immediately apparent what people are saying or what type of instru-
ments are producing the music. Only if the subject chooses to listen the music, s/he
will be aware of what kinds of instruments are producing the music (top-down).
In [32, 42], the concept of saliency map was proposed to understand bottom-up
visual attention in primates. A set of low-level features is extracted in parallel from
the image in multi-scales to produce topographic \feature maps", and combined into a
18
single saliency map which indicates the perceptual in
uence of each part of the image.
The saliency map is scanned to nd the locations that attract attention, and it was
veried by virtue of eye movement that the model could replicate several properties of
human attention, i.e. detecting trac signs, detecting colors etc. [32]. Analogous to
visual saliency maps, a saliency map for audition was proposed in [40]. The structure
of the saliency map was identical to the visual saliency map in [32]. This model was
able to replicate basic properties of auditory scene perception, i.e. the relative salience
of short, long, temporally modulated tones in noisy backgrounds [40]. These results
clearly support the hypothesis that the mechanisms extracting conspicuous events from
a sensory representation are identical in the central auditory and visual systems, and
bottom-up human attention can be modelled with a saliency map.
In this chapter, we propose a novel biologically plausible auditory saliency map that
builds on the architectures proposed in [32, 40]. The contributions of this work are as
follows: an auditory spectrum of the sound is rst computed based on early stages of
the human auditory system. This two-dimensional (2D) spectrum is processed by the
auditory saliency model. In addition to the intensity, temporal and frequency contrast
features used in [40], orientation features extracted analogous to the dynamics of the
auditory neuron responses to moving ripples in the primary auditory cortex, and pitch,
which is a fundamental percept of sound, are included in the model as well. To integrate
the dierent features into a single saliency map, a biologically inspired nonlinear local
normalization algorithm is used. The normalization algorithm is adapted from the
model proposed for vision in [31] to a plausible model for auditory system. The proposed
auditory saliency map is tested in the context of a prominent syllable detection task
19
in speech. The motivation behind choosing prominent syllable detection task is that
during speech perception, a particular phoneme or syllable can be perceived to be more
salient than the others due to the coarticulation between phonemes, and other factors
such as accent, and the physical and emotional state of the talker [26]. This information
encoded in the acoustical signal is perceived by the listeners, and we propose to detect
these salient syllable locations using the proposed bottom-up auditory attention model.
The prominent syllable/word detection can play an important role in speech under-
standing. For instance, it is important in terms of nding salient regions in speech that
may carry critical semantic information. This has applications in speech synthesis for
generating more naturally sounding speech when used together with other cues, such as
boundary times and intonation patterns [9]. Similarly in speech-to-speech translation
systems where it is important to capture and convey concepts from the source to the
target language, the ability to handle such salient information contained in speech is
critical. Prominent syllable detection also plays a role in word disambiguation and hence
in word recognition and synthesis. For example, it has been shown in [27, 62] that inte-
grating prominence patterns into the automatic speech recognition improved the speech
recognizer performance. In summary, extraction of knowledge sources human use be-
yond segmental level and integration of them into current machine speech processing
systems can yield much improved performance. Also, the bottom-up saliency-driven au-
ditory attention model proposed here is not limited to the prominence detection task; it
is a general bio-inspired model and can be applied to other spoken language processing
tasks and computational auditory scene analysis applications as discussed in Chapter
2.6.
20
The chapter is organized as follows: rst the auditory saliency map model is ex-
plained in Section 2.3, followed by experimental results in Section 2.5. The conclusions
drawn are presented in Section 2.6.
2.3 Auditory Saliency Map
The block diagram of the proposed auditory saliency model is given in Fig 2.1. First,
the auditory spectrum of the sound is estimated based on the information processing
stages in the early auditory (EA) system [92]. The EA model used here consists of
cochlear ltering, inner hair cell (IHC), and lateral inhibitory stages mimicking the
process from basilar membrane to the cochlear nucleus in the human auditory system.
The raw time-domain audio signal is ltered with a bank of 128 overlapping constant-
Q asymmetric band-pass lters with center frequencies that are uniformly distributed
along a logarithmic frequency axis analogous to cochlear ltering. This is followed by
a dierentiator, a nonlinearity, and a low-pass ltering mimicking the IHC stage, and
nally a lateral inhibitory network [92]. Here, for analysis, audio frames of 20 ms with
10 ms shift are used, i.e. each 10 ms audio frame is represented by a 128 dimensional
vector.
The output of the EA model is an auditory spectrum with time and frequency axes,
which is analogous to the input image in visual saliency map. In the next stage, this
spectrum is analyzed by extracting a set of multi-scale features that is similar to the
information processing stages in the central auditory system.
21
Figure 2.1: Auditory saliency map structure adapted from [32].
Figure 2.2: Spectro-temporal receptive lters (RF) mimicking the analysis stages in
the central auditory system.
22
2.3.1 Multi-Scale Auditory Features
Auditory attention can be captured by (bottom-up) or selectively directed (top-down) to
a wide variety of acoustical features such as intensity, frequency, temporal, pitch, timbre,
FM direction or slope (called \orientation" in this thesis) and spatial location [23,
34]. Here, ve features are included in the model to encompass all the aforementioned
features except spatial location, and spatial location information is beyond the scope
of this thesis. The features included in the model are intensity (I), frequency contrast
(F), temporal contrast (T), orientation (O), and pitch (P), and they are extracted in
multi-scales using 2D spectro-temporal receptive lters mimicking the analysis stages
in the primary auditory cortex [17, 72]. All the receptive lters (RF) simulated here for
feature extraction are illustrated in Fig. 2.2. The excitation phase (positive values) and
inhibition phase (negative values) are shown with white and black color, respectively.
The intensity lter mimics the receptive elds in the auditory cortex with only an
excitatory phase selective for a particular region [17] and can be implemented with a
Gaussian kernel. The multi-scale intensity features I() are created using a dyadic
pyramid: the input spectrum is ltered with a 66 Gaussian kernel [1; 5; 10; 10; 5; 1]=32
and reduced by a factor of two, and this is repeated [84]. If the scene duration D is
large (i.e. D > 1:28 s), the number of scales is determined by the number of band-pass
lters used in the EA model, hence eight scales =f1; :::; 8g are created yielding size
reduction factors ranging from 1:1 (scale 1) to 1:128 (scale 8). Otherwise, there are
fewer scales.
Similar to I(), the multi-scale F(), T(), O
() features are extracted using the
lters described below on eight scales (when the scene duration D is large enough), each
23
being a resampled version (factor 2) of the previous one. The frequency contrast lters
correspond to the receptive elds with an excitatory phase and simultaneous symmetric
inhibitory side bands, and the temporal contrast lters correspond to the receptive elds
with an inhibitory phase and a subsequent excitatory phase as described in [17, 40],
and they are shown in Fig 2.2. The lters used for extracting frequency and temporal
contrast features can be interpreted as horizontal and vertical orientation lters used
in the visual saliency map [32, 40]. These lters are implemented using a 2D Gabor
lter (product of a cosine function with 2D Gaussian envelope [84]) with orientation
= 0
o
for frequency contrast F() and = 90
o
for temporal contrast T(). In the
lowest scale, the frequency contrast lter has 0.125 octave excitation with same width
inhibition side bands (24 channels/octave in EA model), and the temporal contrast lter
is truncated such that it has a 30 ms excitation phase
anked by a 20 ms inhibition
phase. The orientation lters mimic the dynamics of the auditory neuron responses
to moving ripples [17, 72]. This is analogous to motion energy detectors in the visual
cortex. To extract orientation features O
(), 2D Gabor lters with =f45
o
; 135
o
g
are used. They cover approximately 0.375 octave frequency band in the lowest scale.
The exact shapes of the lters used here are not important as long as they can manifest
the lateral inhibition structure, i.e., an excitatory phase with simultaneous symmetric
inhibitory sidebands [70].
Pitch information is also included in our model because it is an important property
of sound; using a method of extracellular recording, it was shown that the neurons of the
auditory cortex also respond to pitch [7]. Further, in [12] it was shown that participants
noticed the change in the pitch of the sound played in the unattended ear in a dichotic
24
listening experiment, which indicates that pitch contributes to auditory attention. In
general, there are two hypotheses for the encoding of pitch in the auditory system:
temporal and spectral [72]. We extract pitch based on the temporal hypothesis which
assumes that the brain estimates the periodicity of the waveform in each auditory nerve
ber by autocorrelation [78]. Then, a piecewise linear (second order polynomial) model
is t to the estimated pitch values in voiced regions. We mapped the computed pitch
values to the tonotopic cortical axes assuming that the auditory neurons in the cochlear
location corresponding to the pitch are red. Then, the multi-scale pitch features P()
are created using a dyadic Gaussian pyramid identical to the one used for extracting
intensity features. In summary, six feature sets are computed in the model; one feature
set for each I, F, T and P, and two feature sets for O
since it has two angles = 45
o
and = 45
o
.
As shown in Fig 2.1, after extracting features at multiple scales, \center-surround"
dierences are calculated resulting in \feature maps". The center-surround operation
mimics the properties of local cortical inhibition, and it is simulated by across scale
subtraction ( ) between a \center" ne scale c and a \surround" coarser scale s followed
by rectication [32, 71] :
M(c; s) =jM(c) M(s)j; MfI; F; T; O
; Pg: (2.1)
Here, c =f2; 3; 4g, s = c+ with f3; 4g are used. Thus, six feature maps are computed
per feature set resulting in total 36 feature maps when the features are extracted in
eight scales. The center-surround operation detects the local temporal and spectral
discontinues.
25
The feature maps are combined to provide bottom-up input to the saliency map.
However, the maps have to be normalized since they represent non-comparable modali-
ties, i.e. dierent dynamic ranges and feature extraction mechanisms. An iterative non-
linear normalization algorithmN() is used to normalize the feature maps (discussed in
detail in Section 2.3.2). The normalized feature maps are combined into \conspicuity
maps" at scale = 3 using across scale addition [32] :
M =
4
M
c=2
c+4
M
s=c+3
N(M(c; s)) MfI; F; T; Pg and (2.2)
O =
X
f45
o
;135
o
g
N
4
M
c=2
c+4
M
s=c+3
N(O(c; s))
!
(2.3)
Finally, the auditory saliency map is computed by combining the normalized conspicuity
maps with equal weights:
S(w; t) =
1
5
N(
I) +N(
F) +N(
T) +N(
O) +N(
P)
; (2.4)
where w represent frequency domain, and t represent time domain. The maximum
of the saliency map denes the most salient location in 2-D time-frequency auditory
spectrum.
2.3.2 Iterative Nonlinear Normalization
The normalizationN() algorithm used here is an iterative, nonlinear operation simu-
lating competition between the neighboring salient locations. It was originally proposed
in [31] for visual saliency map, and is adapted to the auditory system here. As a result
26
of the normalization, possible noisy feature maps are reduced to sparse representations
of only those locations which strongly stand-out from their surroundings.
The normalization algorithm is as follows. Each feature map is rst scaled to the
range [0; 1] to eliminate the dynamic-range modality. Then, each iteration step consists
of a self-excitation and inhibition induced by neighbors. This is implemented by con-
volving each map with a large 2D dierence of Gaussians (DoG) lter, and clamping
the negative values to zero. A feature mapM is transformed in each iteration step as
follows:
M jM +M DoG C
inh
j 0 (2.5)
where, C
inh
is 2% of the global maximum of the map. The details of DoG lter param-
eters can be found in [31], except that the lter size is modied as follows.
The visual saliency model operates only on spatial domain, while the auditory
saliency map consists of temporal and frequency domain. Therefore, this requires a
dierent normalization process for auditory model especially for temporal domain. The
normalization algorithm in [31] uses the same lters for both (x; y) axes since they are
both spatial domains. First, we modify this part and design the temporal and fre-
quency lters separately. The cortical neurons along the cochlea are connected locally
[70], hence only neighboring basilar membrane lter outputs can inhibit each other.
Based on this fact, a DoG lter that operates on the frequency domain (y axes) is de-
signed such that a single frequency channel output is self-excited, and inhibited by the
two lower and upper channel outputs next to it.
In [40], the auditory saliency model uses a 0.45 seconds (s) analysis window based
on the temporal masking facts in the auditory system. Also, it is shown that for the
27
prominent syllable detection task an analysis window centered in the syllable nuclei
encompassing the neighboring syllables (approximately 0.5 s window size) performs well
[43]. To design the temporal DoG lter, we derived the statistics of the database used for
prominent syllable task to get an estimate. It is found that the mean syllable duration is
approximately 0.2 s () with 0.1 s standard deviation (std). Hence, the temporal DoG
lter used for normalization is implemented such that it comprises an excitation phase
of approximately 0.2 s, followed and preceded by 0.2 s inhibitory regions (considering
neighboring syllables), yielding a 0.6 s analysis window. There are also syllables with
duration much larger than 0.2 s. For instance, the maximum syllable duration is 1.4 s
in the database. Hence, C
inh
is computed over a 3 s audio stream during normalization
to take into account longer syllables as well. The normalization lter duration is further
analyzed in Section 2.5.3.
2.3.3 Saliency Score
The maximum of the saliency map denes the most salient location in 2-D auditory
spectrum. In vision, the visual saliency map is scanned sequentially to nd the locations
in the order of decreasing saliency. However, there is neither available saliency ranking
for prominent syllables, nor is there information regarding the frequency location that
makes the syllable prominent at that time point. Here, we assume that saliency combines
additively across frequency channels. The saliency map for the given sound frame is rst
scaled back to the original size (scale 1), and then summed across frequency channels
for each time point, and normalized to [0; 1] range (over 3 s window frame), yielding a
saliency score S(t) for each time point t :
28
S(t) =
1
max
t
P
w
S(w; t)
X
w
S(w; t) (2.6)
The local maxima of saliency score hold the conspicuous prominent object locations of
incoming sound in terms of time.
2.4 Database
To test our auditory saliency model, the Boston University Radio News Corpus (BU-
RNC) database [61] was used in the experiments. Being one of the largest speech
database with manual prosodic annotations made the radio news corpus highly popu-
lar for the prosodic event detection experiments in the literature. The corpus contains
recordings of broadcast news-style read speech that consists of speech from seven speak-
ers (3 females and 4 males). Data for six speakers has been manually labelled with Tones
and Break Indices (ToBI) [75] style prosodic tags, totaling about 3 hours of acoustic
data. The database also contains the orthographic transcription corresponding to each
spoken utterance together with time alignment information at the phone and word level.
To obtain the syllable level time-alignment information, the orthographic transcriptions
are syllabied using the rules of English phonology [36], and then the syllable level time-
alignments are generated using the phone level alignment information given with the
corpus. Part-of-speech (POS) tags for the orthographic transcriptions are also provided
with the corpus.
We mapped all pitch accent types (H*, L*, L*+H, etc..) to a single stress label,
reducing the task to a two-class problem. Hence, the syllables annotated with any
type of pitch accent were labelled \prominent", and otherwise \non-prominent". The
29
database consists of 48,852 syllables, and 29,855 words. Also, we derived word level
prominence tags from the syllable level prominence tags. The words that contain one
or more prominent syllables are labelled as prominent, non-prominent otherwise. The
prominent syllable fraction in the BU-RNC corpus is 34.3% (chance level), and 54.2%
(chance level) is the prominent word fraction. We chose this database for two main
reasons: i) syllables are stress labelled based on human perception, and ii) it allows
an easy, concrete evaluation of our algorithm since stress labels and time alignment
information are available.
2.5 Experiments and Results
This section presents the details of the experiments conducted together with prominence
detection test results. For each experiment, we report prominent syllable detection
accuracy (Acc) together with precision (Pr), recall (Re) and F-score (Fs) values. These
measures are
Acc =
tp + tn
tp + fp + tn + fn
100; (2.7)
Pr =
tp
tp + fp
;
Re =
tp
tp + fn
;
Fs =
2:Pr:Re
Pr + Re
;
where tp and tn denote true positive and negative, and fp and fn denote false positive
and false negative.
30
In order to detect prominent syllables in speech, we generated saliency score as a
function of time, S(t), as explained in Section 2.3.3. The local maxima of S(t) hold
conspicuous prominent syllable locations. Thus, in the experiments, the local maxima
of S(t) which are above a threshold (th) are found, and the syllable at the corresponding
time point is marked as prominent.
This section is organized as follows: rst, the eect of the saliency score threshold
on the prominence detection performance is investigated in Section 2.5.1. Next, the
contribution of features to the prominence detection task is examined in Section 2.5.2.
Finally, the inhibition duration of the normalization lter is analyzed in Section 2.5.3.
2.5.1 Saliency Score Threshold Selection
To analyze the sensitivity of the algorithm to the chosen threshold, the threshold is
varied from 0.05 to 0.95. The prominent syllable detection performance as a function of
the threshold is shown in Fig. 2.3. As one can expect, when the threshold is increasing,
the precision rate increases, whereas the recall rate decreases. It can be concluded from
Fig. 2.3 that the performance of the algorithm proposed here is not sensitive to selection
of threshold value, i.e. accuracy doesn't change dramatically for varying thresholds, and
it is well above chance level for all threshold values. Any threshold value between 0.1
and 0.3 can be considered reasonable. It is also important to note that more than 80%
of the \most salient" locations , i.e. locations marked salient with th > 0:9, correspond
to an actual prominent syllable (for th > 0:9, precision > 0:80). This supports the
observation that prominent syllables attract auditory attention. The contribution of
31
0 0.2 0.4 0.6 0.8 1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
threshold
Fraction
Accuracy
Precision
Recall
Figure 2.3: Prominent syllable detection performance vs. threshold (with IFTO fea-
tures).
each feature to the prominent syllable detection task, and the window size used for
iterative nonlinear normalization are explored next.
2.5.2 Analysis of Features
The contribution of each feature to the prominent syllable detection task is examined,
and the results are reported in Table 2.1. The reliability measures in Table 2.1-2.2
are obtained after averaging across the threshold values between 0.1 and 0.3. The
initial letter of the feature names is used in Tables and Figures to denote the corre-
sponding conspicuity map, i.e. I= Intensity, F=Feature contrast, T=temporal contrast,
O=Orientation, P=Pitch. The combination of letters indicates the conspicuity maps
32
Table 2.1: Prominent Syllable Detection Performance; I:Intensity, T:Temporal con-
trast, F:Frequency contrast, O:Orientation, P:Pitch
Features Acc. Pr Re Fs
I 74.7% 0.63 0.74 0.68
P 65.9% 0.51 0.85 0.66
IFT 75.2% 0.63 0.77 0.69
IFTO 75.9% 0.64 0.79 0.71
IFTOP 73.0% 0.59 0.82 0.69
Table 2.2: Prominent Word Detection Performance
Features Acc. Pr Re Fs
IFTO 78.1% 0.78 0.86 0.82
that contribute to the saliency map in Eq. 2.4. The best performance is 75.9% accu-
racy with an F-score=0.71, and obtained when the auditory saliency map consisted of
I, F, T and O features (IFTO), for the prominent syllable detection task. Even though
pitch is an important prosodic cue, the performance obtained with only pitch feature
(P) is low (Acc=65.9%), and when it is combined with the rest of the features, it also
causes performance degradation (IFTOP performs worse than IFTO). This can be due
to two reasons: i) even though the auditory experiments show that human perceive
pitch, where/how in the brain pitch is computed, and decoded is ambiguous [72], so
the pitch feature may not be modelled correctly in the proposed framework ii) as the
ndings of study in [43], loudness (or intensity here) predicts the syllable prominence,
and pitch does not contribute much for syllable prominence task.
The word prominence performance is also evaluated similarly, and it is summarized
in Table 2.2. We achieved 78.1% accuracy with an F-score=0.82 for the word prominence
task.
33
0.2 s 0.35 s 0.5 s
0.6
0.62
0.64
0.66
0.68
0.7
0.72
0.74
0.76
0.78
0.8
inhibition duration
Fraction
Accuracy
Precision
Recall
F−score
Figure 2.4: Prominent syllable detection performance vs. inhibition window size (with
IFTO features).
2.5.3 Analysis of Inhibition Window Size
The inhibition duration used in the iterative normalization step in Section 2.3.2 was
investigated. The inhibition duration is varied from 0.2 s () to 0.5 s ( + 3 std)
resulting a normalization window size from 0.6 s to 1.2 s (0.2 s excitation phase is pre-
ceded and followed by inhibition phase considering the previous and next syllable). The
results are presented in Fig. 2.4. While the accuracy stays almost constant for varying
inhibition duration, precision increases and recall decreases with increasing inhibition
duration. This indicates that when inhibition duration increases, the normalization al-
gorithm promotes the most salient location and suppresses less conspicuous locations
(i.e. precision increases).
34
These results are encouraging given that the average inter-transcriber agreement for
manual annotators is 80-85% for stress labelling [61]. The results also compare well
against the previously reported performance levels for unsupervised prosody labeling
with the BU-RNC database, i.e. in [3], the unsupervised method obtained 77% accuracy
at the syllable level using acoustical, lexical, and syntactic features.
2.6 Conclusion
In this chapter, an auditory saliency map based on a model of bottom-up stimuli driven
auditory attention is presented. A set of auditory features is extracted in parallel from
the auditory spectrum of the sound and fed into a master saliency map in a bottom-
up manner. This structure provides fast selection of conspicuous events in an acoustic
scene, which can be further analyzed by more complex and time-consuming processes.
The model could successfully detect the prominent syllable and word locations in read
speech with 75.9% and 78.1% accuracy, respectively. The proposed bottom-up audi-
tory attention model is task-independent and language independent and works in an
unsupervised manner.
The auditory saliency model proposed here is not only limited to prominence la-
belling. For example, it can be used in general computational auditory scene analysis
applications to select conspicuous events rapidly. Similar to the selective attention in
humans [2], after a conspicuous location is selected (focused), it can be analyzed fur-
ther to recognize the details of the object. A saliency-driven acoustic scene recognition
method which builds on this idea is proposed in Chapter 7.
35
Some of the other possible applications for the bottom-up attention model are salient
event/object detection, change point detection, and novelty detection; hence the model
can be used for detection and segmentation purposes in spoken language and audio
processing applications.
In this work, features are combined with equal weights to create the saliency map.
In the next chapter, the features will be weighted and combined to model the task-
dependent in
uences on attention for a given task.
36
Chapter 3:
A Top-Down Task-Dependent Auditory Attention Model
with Acoustic Features
3.1 Chapter Summary
A top-down task-dependent model guides attention to likely target locations in clut-
tered scenes. Here, a novel biologically plausible top-down attention model is presented
to model such task-dependent in
uences on a given task. First, multi-scale features are
extracted from an acoustic scene based on the processing stages in the central auditory
system, and converted to low-level auditory gist features. The auditory gist features
capture the overall properties of the scene. Then, the auditory attention model biases
the gist features in a task-dependent way to maximize target detection. The proposed
top-down attention model is tested to detect prominent syllables in speech, and exper-
imental results are reported. Also, a comprehensive analysis of the features used in the
auditory attention model is presented in this chapter.
37
3.2 Introduction
The bottom-up saliency-driven attention detects the objects that perceptually stand out
of the scene by signicantly diering from their neighbors. For example, in vision, for an
observer a red
ower among green leaves of a plant will be salient. Similarly, in audition,
the sound of a gun shut in a street will perceptually stand-out of the trac/bubble noise
of the street. On the other hand, the top-down task-relevant (goal-driven) process uses
prior knowledge and learned past expertise to focus attention on the target locations
in a scene. For example in vision, it was shown that gaze patterns depend on the task
performed while viewing the same scene [93]. The gaze of the observer fell on faces
when estimating the people's age, but fell on clothing when estimating the people's
material conditions. Similarly, in a cocktail party problem setting, the attention of the
subject may shift to the speech sound if the task is \who is speaking, what?", while the
attention may shift to the music if the task is \which instruments are being played?".
As stated previously, the task-independent bottom-up attention nds the time points
where there is a target/source that pops out perceptually. For example, the proposed
bottom-up auditory attention model detailed in Chapter 2 could detect prominent syl-
lables in speech. However, when humans are asked to nd the prominent (stressed)
syllable, they also use their prior task-relevant knowledge, such as prominent words
have longer duration [28]. The motivation for this work is to analyze the eect of
top-down task-dependent in
uences on the auditory attention for a given task.
In [57, 64, 89], the top-down in
uence of task on visual attention was modelled. A
guided visual search model was proposed in [89] that combines the weighted features-
maps in a top-down manner, i.e. with a task of detecting a red bar, the feature maps
38
which are sensitive to red color gains larger gain factor. This top-down biasing is based
on the evidence that neural responses are modulated by the task dependent attention
[19]. The authors in [57] presented a model that tunes the bottom-up features based
on the properties of both the target and distracter items for visual search problems, i.e.
nding the cell phone on cluttered scene of a desk. The authors of [64] built a separate
top-down model to predict where subjects' gaze patterns fall while performing a task of
interest. This method was tested with recorded eye movements of people while playing
video games and performed well.
In this chapter, we propose a novel biologically plausible top-down model which
guides attention during acoustical search for a target. The auditory attention model
proposed here is based on the \gist" phenomenon commonly studied for vision. Given
a task and an acoustic scene, the attention model rst computes biologically inspired
low-level auditory gist features to capture the overall properties of a scene. Then, the
auditory gist features are biased to imitate the modulation eect of task on neuron
responses to reliably detect a target or to perform a specic task. It should be noted
that the proposed top-down auditory attention model is a generic model with a variety
of applications, i.e. speaker recognition, context recognition etc. as discussed further in
Chapter 3.5. However, in this chapter, the model is used for prominent syllable detection
to connect it with the work in Chapter 2, and the experimental results indicate that the
top-down model provides approximately 10% absolute improvement over the bottom-up
attention model.
39
This chapter is organized as follows: the top-down auditory attention model with
gist feature extraction is explained in Section 3.3, followed by experimental results in
Section 3.4. The conclusions drawn are presented in Section 3.5.
3.3 Top-Down Task-Dependent Model with Auditory Gist Features
The block diagram of the auditory gist feature extraction is illustrated in Fig. 3.1. The
feature maps are extracted from the input sound by using the front-end of the bottom-
up auditory attention model detailed in Chapter 2 since it mimics the processing stages
in the human auditory system. Hence, the bottom-up and top-down models share multi-
scale feature extraction module and center-surround operation which nally yields in
feature maps. This also saves in some computational cost when the bottom-up and
top-down models are combined later as in Chapter 5.
In the top-down attention model as shown in Fig. 3.1, rst, an auditory spectrum
of the sound is computed based on early stages of the human auditory system. This
two-dimensional (2D) time-frequency auditory spectrum is akin to an image of a scene
in vision. Then, a set of multi-scale features is extracted in parallel from the auditory
spectrum of sound based on the processing stages in the central auditory system (CAS)
to produce feature maps. The auditory gist of the scene is extracted from the feature
maps at low resolution to guide attention during target search, and the attention model
biases the gist features to imitate the modulation eect of task on neuron responses
using the weights learned for a given task. The steps of the auditory attention model,
multi-scale feature maps and auditory gist extraction followed by task-dependent biasing
of the gist features, are discussed next.
40
Figure 3.1: Auditory gist feature extraction.
41
3.3.1 Multi-Scale Auditory Features
The top-down auditory attention model presented here consists of intensity (I), fre-
quency contrast (F), temporal contrast (T), orientation (O), and pitch (P) features.
All features are extracted as detailed in Chapter 2, except pitch. Pitch features are
extracted dierently from the ones in Chapter 2, due to their poor performance in
the earlier experiments with the bottom-up attention model. Instead of the absolute
pitch value (as in Chapter 2), here pitch variation is captured using the following 2D
spectro-temporal receptive lters:
Frequency Contrast RF: to capture pitch variations over the tonotopic axis; i.e.,
spectral pitch changes. These pitch features are denoted as P
F
.
Orientation RFs with = 45
o
and = 135
o
: to capture raising and falling pitch
behavior. These pitch features are denoted as P
O
.
The frequency contrast and orientation receptive lters used to capture pitch variation
are same as those described in Chapter 2.
The features are extracted in multi scales as detailed in Chapter 2.3.1. Finally,
eight scales are created if the scene duration D is large enough, otherwise there are
fewer number of scales. In the implementation of center-surround dierences, the across
scale subtraction between two scales is computed by interpolation to the ner scale and
point-wise subtraction. As in the bottom-up auditory attention model, c =f2; 3; 4g,
s = c + with f3; 4g are used, which results in six feature maps for each feature set
when features are extracted at eight scales. In summary, there are eight feature sets in
the model; one feature set for each I; F; T, two feature sets for O
(with = 45
o
and
42
= 135
o
), and three feature sets for P (P
F
, P
O
with = 45
o
and = 135
o
), which
results in total 48 feature maps.
3.3.2 Auditory Gist Features
The top-down auditory attention model proposed here is based on the \gist" phe-
nomenon commonly studied for vision. [60] denes two types of gist: perceptual gist
and conceptual gist. Perceptual gist refers to the representation of a scene built dur-
ing perception, and conceptual gist includes the semantic information inferred from a
scene and stored in memory. Here, we focus on perceptual gist. A reverse hierarchy
theory related to perceptual gist was proposed in [29] for vision. Based on this theory,
gist processing is a pre-attentive process and guides attention to focus on a particular
subset of stimuli locations to analyze the details of the target locations. The gist of
a scene is captured by humans rapidly within a few hundred milliseconds of stimulus
onset and describes the type and overall properties of the scene; i.e., after very brief
exposure to a scene, a subject can report general attributes of the scene, i.e., whether
it was indoors, outdoors, kitchen, street trac etc [29, 58]. In [26], a review of gist
perception is presented, and it is argued that gist perception also exists in audition.
Our gist algorithm is inspired by the work in [29], and [73]. We formalize gist as
a relatively low-dimensional acoustic scene representation which describes the overall
properties of a scene at low-resolution; hence we represent gist as a feature vector [73].
Then, the task-dependent top-down attention is assumed to focus processing to the
specic dimensions of the gist feature vector to maximize the task performance, which
43
is implemented using a learner as discussed in Section 3.3.3. The details of the auditory
gist extraction algorithm is presented next.
A gist vector is extracted from the feature maps of I; F; T; O
; P such that it covers
the whole scene at low resolution. A feature map is divided into m by n grid of sub-
regions, and the mean of each sub-region is computed to capture rough information
about the region, which results in a gist vector with length m n. For a feature map
M
i
with height h and width v, the computation of gist features can be written as:
G
k;l
i
=
mn
vh
(k+1)v
n
1
X
u=
kv
n
(l+1)h
m
1
X
z=
lh
m
M
i
(u; z); for (3.1)
k =f0; ; n 1g; l =f0; ; m 1g;
and i is the feature map index, i.e., i =f1; ; 48g for features extracted at eight scales.
Averaging operation is the simplest neuron computation, and the use of other second-
order statistics such as variance did not provide any appreciable benet for our appli-
cation. An example of gist feature extraction with m = 4; n = 5 is shown in Fig. 3.1.
After extracting a gist vector from each feature map, we obtain the cumulative gist vec-
tor by combining them. Then, principal component analysis (PCA) is used to reduce
redundancy and the dimension to make the subsequent machine learning more practical
while still retaining 99% of the variance. The nal auditory gist feature (after PCA), is
denoted with f, and the dimension of f is denoted with d in the rest of the chapter.
44
Figure 3.2: Top-down task-dependent model structure. Training phase: the weights
are learned in supervised manner. Testing phase: auditory gist features are biased with
the learned weights to estimate the top-down model prediction.
3.3.3 Task-Dependent Biasing of Acoustic Cues
As stated in Section 1.2, the top-down task-dependent in
uences modulate neuron re-
sponses in the auditory cortex while searching for a target [19, 30, 53, 55]. This modula-
tion mostly occurs by enhancing the response of neurons tuned to the features of target
stimuli, whereas attenuating the response of neurons to stimuli that did not match the
target feature [19, 23, 39, 53]. Thus, we formalize the task-dependent top-down process
as follows: given a task (which is prominence detection in this chapter), the top-down
task-dependent auditory attention model biases the auditory gist features with weights
learned in a supervised manner for the task such that it enhances specic dimensions
of the gist features that are related to the task, while attenuating the eect of dimen-
sions which are not related to the task. Here, the weights are learned in a supervised
manner as illustrated in Fig. 3.2; rst the data is split into training and test sets. In
the training phase, gist features f
i
are extracted from the scenes in the training set
and compiled together with their corresponding prominence class categories C
i
. The
45
features are stacked and passed through a \learner" (a machine learning algorithm) to
discover the weights. In the testing phase, scenes that are not seen in the training phase
are used to test the performance of the top-down model. For a given test scene, the gist
of the scene is extracted and biased with the learned weights to estimate the top-down
model prediction
^
C. Here, a 3-layer neural network is used to implement the learner in
Fig. 3.2 as discussed in detail in Section 3.4. The reason for using neural network is that
they are biologically well motivated; it mimics the modulation eect of task dependent
attention on the neural responses.
In this context, the term \scene" is used to refer to the sound around a syllable,
and the task is to determine whether a prominent syllable exists in the scene. For
the experiments, a scene is generated for each syllable in the database by extracting
sound surrounding a syllable with an analysis window of duration D that centers on the
syllable. The analysis of scene duration D is described later in Section 3.4.1.
3.4 Experiments and Results
To test our top-down task dependent model with gist features, the Boston University
Radio News Corpus database was used in the experiments. We chose this database
for two main reasons: i) syllables are stress labelled based on human perception, and
ii) since it carries labelled data, it helps us to learn the task-dependent in
uences in a
supervised fashion.
All of the experimental results presented here are estimated using the average of ve-
fold cross-validation. We used all the data manually labelled with ToBI style prosody
46
tags and randomly split it into ve groups (G1; G2; G3; G4; G5) to create ve cross-
validation sets. In each set, four groups were used for training, and one group was used
for testing, i.e., 80% of the data was used for training and the remaining 20% of the data
was retained for testing. For example, G2; G3; G4; G5 were used for training, when G1
was used for testing. On average, the number of syllables in the training and test sets
was 39,082 and 9,770, respectively. The number of unique syllables in the training sets
was 2,894, while it was 1,728 for the test sets (averaged over the ve cross-validation
sets). The average number of OOV syllables in the test sets was 220 (12.7% relative to
the test vocabulary). The baseline prominence accuracy, which is the chance level, is
65.7% at the syllable level.
The Wilcoxon signed rank test is used to report the condence level in terms of sig-
nicance values (p-values) whenever we make comparisons throughout this dissertation.
Here, the comparison is performed in terms of achieved accuracy values for the samples.
The Wilcoxon signed rank test is a non-parametric test, and it does not make any as-
sumption about the distributions of samples. The test is available as part of MATLAB
software, and more information about the test can be found in [48].
The learner in Fig. 3.2 is implemented using a 3-layer neural network with d inputs,
(d + n)=2 hidden nodes and n output nodes, where d is the length of gist feature vector
after PCA dimension reduction, and n = 2 since this is a two-class problem. The output
of the neural network can be treated as class posterior probability, and the class with
higher probability is assumed to be the top-down model prediction. The reason for using
a neural network is that they are biologically well motivated; it mimics the modulation
eect of task dependent attention on the neural responses.
47
This section is organized as follows: rst, an analysis of scene duration is presented
in Section 3.4.1, followed by an analysis of grid size selection in Section 3.4.2. This
is followed by an analysis of auditory attention features using mutual information and
prominence detection experiments conducted with each individual feature and their
combinations in Section 3.4.3.
3.4.1 Analysis of Scene Duration
The role of scene duration is investigated in the experiments, and the results are dis-
cussed in this section. A scene is generated for each syllable in the database. As ex-
plained earlier, scenes are produced by extracting the sound around each syllable with
an analysis window that centers on the syllable.We used the statistics of the BU-RNC
to determine a range of values for the scene window duration D. It was found that the
mean syllable duration is approximately 0.2 s with 0.1 s standard deviation, and the
maximum duration is 1.4 s for the database. The scene duration is varied starting from
0.2 s, considering only the syllable by itself, up to 1.2 s considering the neighboring
syllables.
In order to get full temporal resolution while analyzing the scene duration, at the gist
feature extraction stage each feature map is divided into 1-by-v grids, where v is width
of a feature map. This results in a (1v) = v dimensional gist vector for a single feature
map, and v varies with scene duration and center scale c. For instance, when D = 0:6 s,
the early auditory (EA) model outputs a 12860 dimensional scene since there are 128
band-pass lters used for cochlear ltering in the EA model, and the analysis window
is shifted by 10 ms. Then, we can extract features up to six scales, which enables the
48
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2
60
65
70
75
80
85
90
scene duration (sec)
Percentage (%)
(1,v) Acc
(4,5) Acc
(1,v) F−sc
(4,5) F−sc
Figure 3.3: Performance of prominence detection as a function of scene duration for
grid sizes of 1-by-v and 4-by-5 (v: width of a feature map, Acc: Accuracy (%), F-sc:
F-score (%)).
center-surround operation at scales (c s) =f(2 5); (2 6); (3 6)g, and in turn
generates three feature maps. v width of a feature map is a function of scene duration
and center scale c; given that the analysis window shift is 10 ms, v = (D=0:01)=2
(c1)
.
When the dimension of a scene is 128 60 and 1-by-v is the grid size, the dimension of
the gist vector of a single feature set is (30 + 30 + 15) = 75 (since v is 30 at c = 2 and
15 at c = 3), nally resulting in a cumulative gist vector of (75 8) = 600 dimension
(one feature set for each I; F; T and two sets for O
since =f45
o
; 135
o
g, and 3 sets
for P (P
F
; P
O
with =f45
o
; 135
o
g), total 8 sets). After principal component analysis,
the dimension of the gist vector is reduced to d = 60.
49
Table 3.1: Prominent Syllable Detection Performance with Task-Dependent Auditory
Attention Model
1-by-v grids 4-by-5 grids
D (s)
d Acc. F-sc d Acc. F-sc
0.2 12 77.89% 0.64 36 81.08% 0.7
0.4 37 82.98% 0.74 84 84.63% 0.77
0.6 60 84.79% 0.77 94 85.59% 0.78
0.8 98 85.12% 0.78 134 85.67% 0.79
1.0 123 84.80% 0.77 130 85.44% 0.78
1.2 153 84.59% 0.77 138 84.12% 0.76
The performance of prominence detection obtained with 1-by-v grid size is shown
in Fig. 3.3 as a function of scene duration. It can be observed from Fig. 3.3 that the
accuracy does not change signicantly for varying scene durations for D 0:6 s. The
best performance achieved is when D = 0:8 s. The prominence performance is poor
for short scene durations (D < 0:6 s), especially for the case when D = 0:2 s (scene
approximately includes only the syllable by itself). This essentially indicates that the
prominence of a syllable is aected by its neighboring syllables.
In Table 3.1, the results for selected values of scene duration are presented with the
values of accuracy, F-score (F-sc), and the dimension of gist features after PCA (d). The
gist feature dimension d gets larger, requiring a larger neural network for training when
the scene duration is larger as shown in Table 3.1. When 1-by-v grid size is used while
extracting auditory gist features, the best prominent syllable detection performance
achieved is 85:1% accuracy with an F-score of 0:78 obtained when D = 0:8 s.
50
Table 3.2: Prominent Syllable Detection Performance with Bottom-Up Saliency-Based
Attention Model [37]
D (s) d Acc Pr Re F-sc
0.6 NA 75.9% 0.64 0.79 0.71
3.4.2 Analysis of Grid Size
In this section, the eect of grid size on the prominence detection performance is ex-
amined. The resolution in the frequency domain is increased by a factor of four while
reducing the temporal resolution so that the dimension stays compact. At the gist fea-
ture extraction stage each feature map is divided into 4-by-5 grids, resulting in a xed
(4 5) = 20 dimensional gist vector for a single feature map. This is dierent from the
one in Section 3.4.1 where the dimension of a gist vector for a feature map varies with
v (hence with scene duration). As in the previous example in Sec. 3.4.1, when D = 0:6
s, the model generates 3 feature maps per feature set and hence a 20 3 8 = 480
dimensional cumulative gist feature vector. After the principal component analysis, the
dimension is reduced to 94. As listed in Table 3.1, for all scene durations, this selection
results in a larger dimensional gist feature vector compared to 1-by-v grid size, i.e., for
scene duration of 0.6 s the dimension of gist feature with 1-by-v grid size is 60 whereas
it is 94 with 4-by-5 grid size. This indicates that the gist features obtained with 4-by-5
grids carries more diverse information about the scene compared to the one obtained
with 1-by-v grids. The results obtained with 4-by-5 grids for varying scene durations
are also reported in Fig. 3.3 and Table 3.1. The best performance achieved with 4-by-5
grid size is 85:7% accuracy with an F-score= 0:79, and obtained again when D = 0:8 s.
51
The performance obtained with 4-by-5 grid size selection is better than the one
obtained with 1-by-v grid size (p 0:001) except for scene duration of 1.2 s (ref. Fig
3.3). This might be due to the fact that the temporal resolution is not adequate with 4-
by-5 grid size selection for large scene durations. This also indicates that, while choosing
the grid size, the scene duration should be factored in while choosing the temporal grid
size that determines temporal resolution. Larger scene durations might need larger
temporal grids in order to obtain adequate temporal resolution. Also, even though the
best performance obtained with both grid sizes is with scene duration of D = 0:8 s,
this is not signicantly better than the results obtained with scene duration of 0.6 s (at
p 0:001). Hence, we x the scene duration as 0.6 s in the rest of the experiments
since it is computationally less expensive (the feature dimension is smaller, and so is
the neural network).
The results obtained with the unsupervised bottom-up (BU) attention model from
Chapter 2 are also summarized in Table 3.2 for comparison purpose. The top-down
auditory attention model provides approximately 10% absolute improvement over the
unsupervised bottom-up auditory attention model.
3.4.3 Analysis of Auditory Attention Features
We present an analysis of the auditory attention features using mutual information and
prominence detection experiments conducted with each individual feature and their
combinations in this subsection. The scene duration is xed at 0.6 s for the analysis due
to its sucient performance as discussed in Section 3.4.1 and 3.4.2. First, pitch feature
sets are analyzed to provide insight into the features extracted with dierent receptive
52
lters. Then, mutual information estimations are presented to measure the amount of
redundancies between features and also the amount of information each feature set and
their combinations provide about the syllable prominence.
3.4.3.1 Analysis of Pitch Features
As described in Chapter 2.3.1, pitch is extracted from the auditory spectrum and then
mapped onto the tonotopic axis, assuming that the auditory neurons in the cochlear
location corresponding to the pitch are red. Then, this 2D representation is analyzed
to capture pitch behavior using frequency contrast and orientation receptive lters.
Pitch analysis results for a sample speech scene are illustrated in Fig. 3.4. The top
gure shows the mapped pitch contour itself. The gist feature vectors obtained from
this contour using frequency contrast and orientation lters are shown below it. Here,
only the raw gist vector (without PCA) obtained from the feature map with center scale
c = 2, surround scale s = 5 and grid size 1-by-v (v is width of a feature map) is shown.
The vector is interpolated to scale 1 for time alignment purpose. It can be observed
from the gure that, for the segments where pitch is rising, the gist values obtained
with 45
o
orientation RF show high activity, whereas for the segments where pitch is
falling, the gist values obtained with 135
o
orientation RF show high activity. Also, it
can be observed that when the duration of pitch rising/falling is longer, in other words
when pitch variation is larger, the gist components are larger, i.e., in Fig. 3.4 the pitch
rising from 0.1 s to 0.2 s results in gist values with O
45
o that are larger compared to
the gist values obtained with O
135
o when pitch is falling for a shorter duration around
0.3 s or 0.5 s. The gist vector extracted with frequency contrast RF helps to detect
53
Freq. (Hz)
a) Pitch
0.1 0.2 0.3 0.4 0.5 0.6
185
245
185
0 0.1 0.2 0.3 0.4 0.5 0.6
0
0.5
Amplitude
b) Freq. Contrast
0 0.1 0.2 0.3 0.4 0.5 0.6
0
0.5
Amplitude
c) Orient. 45
0 0.1 0.2 0.3 0.4 0.5 0.6
0
0.5
time (sec)
Amplitude
d) Orient. 135
Figure 3.4: Pitch analysis of a speech scene with grid size of 1-by-v (a) pitch. Out-
put obtained with b) frequency contrast RF, c) orientation RF with 45
o
rotation, d)
orientation RF with 135
o
rotation.
voiced regions, especially the segments where there is a pitch plateau. The same speech
segment is analyzed with 4-by-5 grid selection and this is illustrated in Fig. 3.5. It has
similar characteristics as the gist features extracted with 1-by-v grids, except that with
4-by-5 grid size the frequency resolution is higher and the range of pitch values is also
roughly encoded in the gist features (place coding).
54
a) Pitch
Freq (Hz)
0.1 0.2 0.3 0.4 0.5 0.6
74
116
185
294
Freq. (Hz)
b) Freq. Contrast
0.1 0.2 0.3 0.4 0.5 0.6
74
116
185
294
Freq. (Hz)
c) Orient. 45
0.1 0.2 0.3 0.4 0.5 0.6
74
116
185
294
time (sec)
Freq. (Hz)
d) Orient. 135
0.1 0.2 0.3 0.4 0.5 0.6
74
116
185
294
0.5
1
1.5
2
0.5
1
1.5
2
0.5
1
1.5
2
Figure 3.5: Pitch analysis of a speech scene with grid size of 4-by-5 (a) pitch. Out-
put obtained with b) frequency contrast RF, c) orientation RF with 45
o
rotation, d)
orientation RF with 135
o
rotation.
55
Table 3.3: Prominent Syllable Detection Performance with only Pitch features
1-by-v grids 4-by-5 grids
Pitch Feature
d Acc. F-sc d Acc. F-sc
P
F
21 73.90% 0.57 17 79.44% 0.67
P
O
45
o
15 76.10% 0.60 30 79.48% 0.67
P
O
135
o
14 74.99% 0.58 29 78.65% 0.65
P
O
45
o
& P
O
135
o
26 78.88% 0.66 44 80.80% 0.69
P
F
& P
O
45
o
& P
O
135
o
42 80.13% 0.68 54 81.26% 0.70
These results indicate that the gist features obtained from the pitch contour using
the F; O
45
o; O
135
o receptive lters capture the pitch variations and behavior. Also, there
is no need for normalization since the gist features capture variations rather than the
absolute values. Finally, the prominence detection performances obtained with using
only pitch features are detailed in Table 3.3 for both 1-by-v and 4-by-5 grid sizes. The
best performance is achieved with pitch when all three RFs (F; O
45
o; O
135
o) are used to
extract pitch gist features. Also, 4-by-5 grids performs better than 1-by-v grids, and the
best achieved performance is 81.26% accuracy with an F-score of 0.69 via using 4-by-5
grids. In the rest of the chapter, the pitch features are extracted from the pitch contour
using all three receptive lters (F; O
45
o; O
135
o) and pitch features are denoted with P
to prevent confusion with other features.
3.4.3.2 Feature Analysis with Mutual Information Measure
Here, we use mutual information (MI) measure to analyze the intensity, frequency con-
trast, temporal contrast, orientations, and pitch features. In particular, the MI be-
tween the individual features (and their combinations) and the prominence classes are
computed to measure the statistical dependency between each feature and the syllable
56
prominence. Also, the MI between feature pairs is computed to measure the redundancy
between features.
The MI between continuous random variables X and Y can be written in terms of
the dierential entropies as
I(X; Y ) = H(X) + H(Y ) H(X; Y ); (3.2)
where,
H(X) =
Z
p
x
(x) log p
x
(x)dx (3.3)
H(Y ) =
Z
p
y
(y) log p
y
(y)dy (3.4)
H(X; Y ) =
Z Z
p
xy
(xy) log p
xy
(x; y)dxdy: (3.5)
The mutual information is always non-negative, and it is zero if and only if X and Y
are independent. The joint probability density function p
xy
(x; y) is required to estimate
the mutual information between X and Y . When it is not available, usually, X and Y
are quantized with a nite number of bins, and MI is approximated by a nite sum as
follows
I
quan
(X; Y ) =
X
ij
p
xy
(i; j) log
p
xy
(i; j)
p
x
(i)p
y
(j)
: (3.6)
When the sample size is innite, and all bin sizes tend to zero, I
quan
(X; Y ) converges
to I(X; Y ). However, the amount of the data is usually limited, as in our experiments.
Also, these methods are usually limited with the use of one or two dimensional variables,
whereas here we use large dimensional vectors in the experiments. To avoid explicit
57
I
F
T
O45
O135
P−O45
P−O135
PF
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
I F T O45 O135P−O45 P−O135 PF
Figure 3.6: Pair-wise MI between raw gist features created using 4-by-5 grid size
and 0.6 s scene duration. Only the gist vector extracted from the feature map with
c = 2; s = 5 is illustrated. The diagonal elements are set to zero.
estimation of the joint probability density function, we compute the mutual information
using the method proposed in [44], which is based on entropy estimates from k-nearest
neighbor distances. This method is data ecient, adaptive, and the MI estimate is
numerically shown to be exact for the independent random variables. As suggested in
[44], a mid-range value for k (k = 3) is used in the experiments.
First, the amount of redundancy in the gist features is analyzed. In Fig. 3.6, the
pair-wise MI between all raw gist components (without PCA) extracted using 4-by-5
grids when the scene duration is 0.6 s is illustrated. Here, only the gist vector extracted
from the feature map with center c = 2 and surround s = 5 scales is shown to make
the gure more readable, so the gist vector dimension is 4 5 = 20 (each square block
in the gure is 20 20). The MI I(G
X
i
; G
Y
j
) is computed for each pair (G
X
i
; G
Y
j
)
58
Table 3.4: Mutual information between the gist feature vectors of the feature sets
I F T O
45
O
135
PO
45
PO
135
PF
I x 5.026 5.576 3.674 3.672 1.391 1.380 1.238
F 5.026 x 3.940 3.434 3.763 1.612 1.673 1.640
T 5.576 3.940 x 3.469 3.571 1.325 1.329 1.115
O
45
o 3.674 3.434 3.469 x 3.393 2.045 1.423 1.221
O
135
o 3.672 3.763 3.571 3.393 x 1.517 2.162 1.283
PO
45
o 1.391 1.612 1.325 2.045 1.517 x 3.795 2.744
PO
135
o 1.380 1.673 1.329 1.423 2.162 3.795 x 2.978
PF 1.238 1.640 1.115 1.221 1.283 2.744 2.978 x
where G
X
; G
Y
fI; F; T; O
; P
O
; P
F
g and i; j =f1; : : :; 20g. It can be observed that
there are non-zero MI results. In the gure, the diagonal square blocks have non-zero
MI, i.e., I(G
X
i
; G
X
j
)6= 0 for some (i; j) pairs. In other words, as one can expect, the
gist of a feature map has interdependencies with itself for each feature set since they
are extracted using the same receptive lter. Also, there is redundancy across feature
sets; i.e., I(G
I
i
; G
F
j
) is high for some (i; j) pairs. The comparison of redundancy across
feature sets can be made more clearly from the MI results presented in Table 3.4. In
Table 3.4, the mutual information between feature sets is computed using all the gist
features extracted from all feature maps as a vector. In other words, the table holds
the values I(G
X
;G
Y
), where G
X
and G
Y
are multi-component vectors, and X;Y are
both 203 = 60 dimensional vectors since each set has 3 feature maps and each feature
map generates a 20 dimensional gist vector. It can be concluded that the gist features
extracted from the intensity feature set (I) are highly redundant with the ones extracted
from the frequency (F) and the temporal contrast (T) feature sets. The orientation gist
features (O
) have moderate redundancy with other feature sets' gist features. Finally,
the gist features of pitch (P) feature sets have the least redundancy with the gist features
59
of the remaining feature sets I; F; T; O
and more redundancy among themselves since
they are limited to representing only the pitch characteristic of the scene. Due to the
redundancy in the gist features, PCA is applied to the cumulative gist feature vector in
the proposed model as shown in Fig. 3.1.
Next, the MI between the individual features and the prominence classes are com-
puted to measure the amount of knowledge about syllable prominence provided by
each feature. Hence, I(G
X
; C) is estimated, where G
X
is a multi-component vector,
XfI; F; T; O
; Pg, and C is the syllable prominence class. The MI results are listed
in Table 3.5 for I; F; T; O
. In Table 3.6, the MI values between pitch features and
prominence are listed. The most informative pitch features about prominence are the
ones captured with orientation RFs, P
O
45
and P
O
135
. The contribution of P
F
features is
smaller compared to P
O
. These results are in agreement with the prominence detection
results using only the pitch features which are reported in Table 3.3. We can conclude
from the results in Table 3.3 that the contribution of P
F
features is signicant when
1-by-v grid size is used. However, in the case of using 4-by-5 grid size, pitch values are
roughly place coded (frequency resolution is higher) and P
F
features do not contribute
much to the prominence detection performance when combined with P
O
features.
We can conclude from the results in Table 3.5 and 3.6 that, when the individual
features are compared, the most informative feature about syllable prominence is the
orientation (in the tables O represents the combinations of both directional orienta-
tions, i.e., it contains both O
45
and O
135
). Also, even though the features have across
redundancy as listed in Table 3.4, adding each feature increases the amount of informa-
tion on the syllable prominence, and the highest MI is obtained when all ve features
60
Table 3.5: MI between I; F; T; O features and Syllable Prominence Class
Individual Feat. Combined Features
I 0.2368 TI 0.2636 TO 0.3149 IFT 0.2926
F 0.2596 IF 0.2710 IO 0.3096 TOI 0.3138
T 0.2502 FT 0.2900 FO 0.3207 IFO 0.3229
O 0.3102 FTO 0.3278
IFTO 0.3300
Table 3.6: MI between Pitch Features and Syllable Prominence Class
Individual Feat. Combined Features
P
F
0.2015 P
O
0.2650 IFTO&P
OF
0.3490
P
O
45
0.2323 P
OF
0.2700
P
O
135
0.2163
Table 3.7: Prominent Syllable Detection Performance with 0.6 s Scene
1-by-v grids 4-by-5 grids
Receptive Filter
d Acc. F-sc d Acc. F-sc
I 22 79.62% 0.69 21 80.92% 0.71
F 15 78.31% 0.66 20 81.79% 0.72
T 32 80.83% 0.71 25 81.77% 0.72
O 24 82.52% 0.73 46 84.34% 0.76
P 42 80.13% 0.68 54 81.26% 0.70
IFTO 58 84.62% 0.77 80 85.45% 0.78
IFTOP 60 84.79% 0.77 94 85.59% 0.78
61
IFTOP are combined. These results are in agreement with the prominence detection
results listed in Table 3.7. The individual feature which achieves the highest accuracy
is the orientation with 84.34% accuracy when grid size is 4-by-5. The highest accuracy
of 85.6% is achieved when all ve features are combined. However, the performance
achieved with IFTO and IFTOP features is not signicantly dierent at p 0:001.
The results obtained with 1-by-v grid size are also listed in Table 3.7. We can
conclude that, usually, 4-by-5 grid size results in larger dimensional feature vectors
after PCA, indicating having more diverse information about the scene. Also, the
prominence results obtained with 4-by-5 grid size are signicantly higher compared to
the ones obtained with 1-by-v grid size at p 0:001. It can be concluded that the
scene duration for auditory gist feature extraction can be set to D = 0:6 s, and the
grid size can be set as 4-by-5, without signicant performance degradation. Also, the
combination of I; F; T; O
features can be used; i.e., the pitch features can be excluded
since IFTOP performance is not signicantly dierent than the results of IFTO (refer
to Table 3.7).
3.5 Conclusion
In this chapter, a task-dependent top-down auditory attention model is presented. A set
of multi-scale auditory features is extracted in parallel from the auditory spectrum of
the sound, and converted into low-level auditory gist features that capture the essence
of a scene. The gist features are biased with weights learned in a supervised manner
to incorporate the task dependent in
uences on the neural responses. The top-down
model successfully detects prominent syllables in read speech with up to 85.7% accuracy
62
and provides approximately 10% absolute improvement over the bottom-up attention
model. These results are encouraging given that the average inter-transcriber agreement
for manual annotators is 85-90% for presence vs. absence of stress labelling [61].
It has been experimentally demonstrated that the prominence of syllables is aected
by the neighboring syllables. The performance obtained with scenes that include approx-
imately only the syllable itself was poor, i.e., short scenes with D = 0:2 s. Considering
the performance and the computational cost, it is concluded that it is reasonable to
have an analysis window duration of 0.6 s for the prominent syllable detection task.
One of the strengths of the proposed auditory attention model proposed here is that
the model is a generic model, and it can be used in other spoken language processing
tasks and general computational auditory scene analysis applications such as scene
understanding, context recognition, and speaker recognition. Based on the selected
application, rst, optimal scene duration should be set. A ner grid size selection at the
auditory gist feature extraction stage increases resolution and the computational cost.
In this work, two grid sizes; i.e., 1-by-v and 4-by-5, are selected in an ad-hoc manner
for the prominence detection task, and they performed suciently well. However, for
new applications, an appropriate grid size needs to be found considering the balance
between resolution (hence the task performance) and the computational cost. Finding
an optimal grid size is an open problem that we are planning to address as part of our
future work.
It was shown with mutual information measurements that the raw auditory gist
features extracted from intensity, temporal contrast, frequency contrast, orientations,
and pitch features have redundancies. Hence, we applied PCA to the gist features
63
to reduce redundancy and also to reduce feature dimension. PCA is optimal in least
squares terms. PCA retains the components of the data set that contribute most to
its variance, and it assumes that these components carry the most important aspects
of the data. However, this might not be the case always. In the literature, there are
examples of information maximization in neural codes [76]. Thus, as part of our future
work, we plan to investigate information maximization criteria to select features.
The top-down information that comes with language has not been considered so
far. In the next chapter, top-down in
uences of lexical and syntactic knowledge will
be incorporated into the model. Also, the presented top-down model can be combined
with the bottom-up auditory attention model as discussed in Chapter 5.
64
Chapter 4:
A Top-Down Task-Dependent Auditory Attention Model
with Acoustic and Higher Level Information
4.1 Chapter Summary
Auditory attention is a complex mechanism that involves the processing of low-level
acoustic cues together with higher level cognitive cues. In this chapter, a novel method
is proposed that combines biologically inspired auditory attention cues with higher level
lexical and syntactic information to model task-dependent in
uences on a given spoken
language processing task. The lexical information is modelled using a probabilistic lan-
guage model, and the syntactic information is modelled using part-of-speech tags. Then,
the acoustic model from Chapter 3 is combined with the lexical and syntactic models in
a probabilistic framework. The experimental results obtained with prominence syllable
detection task are reported.
65
4.2 Introduction
Speech is one of the most important sound sources for human listeners. While process-
ing speech stimuli, the brain is in
uenced by higher level information such as lexical
information, syntax, semantics and the discourse context [34, 67]. For example, one
famous result from the dichotic listening experiments reported in [54] is that people
may respond to the messages on the unattended channel when they heard their own
names. In the experiments of [54], 8% of the participants responded to the message \you
may stop now" when it was presented on the unattended channel, whereas 33% of the
participants responded when the message was preceded by the participant's name. This
is similar to what happens in the cocktail party phenomenon. For example, one may
hear her/his name being mentioned by someone else across the room, even though s/he
wasn't consciously listening for it. In the experiments of [67], the recorded neurophysi-
ologic brain response was larger for one's native language than unfamiliar sounds. Also,
the experiments at the level of meaningful language units have revealed that real words
elicit a larger brain response than meaningless pseudo-words [67]. The psychophysical
experiments indicate that some words that carry semantically important information,
e.g., one's name, can capture attention, as can some syllable/word strings that form a
meaningful word/sentence [12, 34].
In addition to these, earlier studies have revealed that there is dependency between
prominence and lexical information and also between prominence and syntax [28, 69].
The authors of [28] show that content words are more likely to be prominent than
function words. Also, a statistical analysis presented in [4] indicates that some syllables
have a higher chance of being prominent than others; i.e., the syllable \can" has an
66
80% chance of being prominent, whereas the syllable \by" has a 13% chance of being
prominent.
In this chapter, a novel task-dependent model is proposed to combine the auditory
attention cues discussed in Chapter 3 together with the higher level lexical and syntactic
information to model task-dependent in
uences on a given spoken language task. The
lexical information is incorporated in the system by building a language model with
syllables, and the syntactic knowledge is represented using part-of-speech tags. The
top-down task dependent in
uence of lexical and syntactic information is combined
with the auditory attention cues using a probabilistic approach. The combined model
is used to automatically detect prominent syllables in speech.
The remaining of this chapter is organized as follows: rst the task-dependent model
is explained in Section 4.3, with the acoustic model in Section 4.3.1, lexical model
in Section 4.3.2, and syntactic model in Section 4.3.3. The probabilistic framework
that combines acoustic, lexical and syntactic cues is explained in Section 4.3.4. The
experimental results on prominence detection are reported in Section 4.4 and followed
by conclusion in Section 4.5.
4.3 Probabilistic Approach for Task-Dependent Model
The task-dependent model is in
uenced by acoustic and other higher level cues. In
this section, we present a system to combine the auditory gist features discussed in
Chapter 3 together with lexical and syntactic information in a probabilistic framework
for prominence detection in speech. The probabilistic model is based on a maximum
a-posteriori estimation (MAP); given acoustic, lexical and syntactic information, the
67
model estimates the sequence of prominence that maximizes the posterior probability.
First, we discuss modelling of each piece of information separately, and then we discuss
how they are combined in a MAP framework.
4.3.1 Task-Dependent Model with Auditory Gist Features
The auditory gist features, which are explained in Chapter 3, are used as acoustical
evidence here. A multilayer perceptron (MLP) is used to implement the learner in Fig.
3.2 to bias the auditory gist features to mimic the top-down in
uences of task on neuron
responses. We use auditory gist features as the input to the neural network, and the
output returns the class posterior probability p(c
i
jf
i
) for the i
th
syllable, where f
i
is
the auditory gist feature, and c
i
f0; 1g where 1 denotes that the syllable is prominent,
while 0 denotes that it is non-prominent. Then, the most likely prominence sequence
C
=fc
1
; c
2
; : : :; c
M
g given the gist features F =ff
1
; f
2
; : : :; f
M
g can be found using a
maximum a-posteriori framework as follows:
C
= argmax
C
p(CjF): (4.1)
Assuming that the syllable prominence classes are independent, Eq. 4.1 can be approx-
imated as:
C
= argmax
C
M
Y
i=1
p(c
i
jf
i
) (4.2)
when the gist features are the only information considered in the top-down model.
68
4.3.2 Task-Dependent Model with Lexical Cues
The lexical evidence is included in the top-down model using a probabilistic language
model. Given only the lexical information of syllable sequence S =fs
1
; s
2
; : : :; s
M
g, the
most likely prominence sequence C
can be found as:
C
= argmax
C
p(CjS) (4.3)
= argmax
C
p(C;S): (4.4)
Here, p(C;S) is modelled within a bounded n-gram context as:
p(C;S) = argmax
C
M
Y
i=1
p(c
i
; s
i
jc
i1
in+1
; s
i1
in+1
) (4.5)
For example, in a bigram context model, p(C;S) can be approximated as:
p(C;S) = p(c
1
; s
1
)
M
Y
i=2
p(c
i
; s
i
jc
i1
; s
i1
): (4.6)
We assume that the utterance transcripts are available and use the actual transcripts
provided with the database in the experiments. Then, the syllable s
i
becomes known;
hence, we replace p(c
i
; s
i
jc
i1
in+1
; s
i1
in+1
) term in Eq. 4.5 with p(c
i
jc
i1
in+1
; s
i
in+1
). It
is dicult to robustly estimate the language model even within n-gram context due to
the size of the available training database. Specically, the BU-RNC database used in
this study is small compared to the large number of syllables in the dictionary. Hence,
a factored language model is built to overcome data sparsity. The factored language
model is a
exible framework for incorporating various information sources, such as
69
morphological classes, stems, roots, and any other linguistic features, into language
modeling [41]. In a factored language model scheme, when the higher order distribution
cannot be reliably estimated, it backs o to lower order distributions. Using a factored
language model also helps with out-of-vocabulary (OOV) syllables seen in the test sets
because when an OOV syllable is observed on the right side of the conditioning bar
in p(c
i
; s
i
jc
i1
in+1
; s
i1
in+1
), the backed o estimate that does not contain that variable
is used instead. The back-o graph used for creating the language model is shown in
Fig. 4.1. We use a back-o path such that the most distant variable is dropped rst
from the set on the right side of the conditioning bar in p(c
i
; s
i
jc
i1
in+1
; s
i1
in+1
). Here,
both s
in+1
and c
in+1
are the most distant variables. For instance, we rst drop the
most distant syllable variable s
in+1
and then the most distant class variable c
in+1
and so on. As shown in Fig. 4.1, the graph includes both possible paths of starting
to drop from either s
in+1
or c
in+1
. The language model is built using the SRILM
toolkit [79]. We use a 4-gram model for the prominence class history and a trigram
model for the syllable tokens history. This n-gram order selection is also validated with
the experiments.
4.3.3 Task-Dependent Model with Syntactic Cues
In our model, the syntactic information is captured using part-of-speech (POS) tags
provided with the database. POS tagging provides details about a particular part of
speech within a sentence or a phrase by looking at its denition as well as its context
(relationship with adjacent words), i.e., whether a word is a noun, verb, adjective, etc.
The syntactic evidence is included in the model using a similar approach to the lexical
70
Figure 4.1: Backo graph of lexical-prominence language model moving from the
trigram model p(cjc
1
; s
1
; c
2
; s
2
) down to p(c), where c denotes prominence class,
and s represents syllable token. The most distant variable s
2
/c
2
is dropped rst.
evidence. POS tags are associated with words, so the most likely prominence sequence
C
for a word string W given only syntactic information can be computed as:
C
= argmax
C
p(CjPOS(W)); (4.7)
Assuming that word tokens are independent, Eq. 4.7 can be re-written as
C
= argmax
C
Y
i
p(c
i
jPOS
L
i
(w
i
)) (4.8)
In Eq. 4.8, c
i
represents the prominence of the i
th
word w
i
in the word sequence, and
POS
L
i
is the POS tags sequence that is neighboring the i
th
word:
POS
L
i
= (POS
i(L1)=2
i
; : : :; POS
i
i
; : : :
; POS
i+(L1)=2
i
) (4.9)
71
POS
L
i
is chosen such that it contains syntactic information from a xed window of L
words centered at the i
th
word, and L = 5 performs well for prominence detection [11].
The class posterior probability p(c
i
jPOS
L
i
(w
i
)) in Eq. 4.8 is computed using a neural
network as detailed in Section 4.4.3. In the rest of the chapter, POS is used to denote
POS
L
i
to make the notation simpler.
The syntactic model is built at the word level since POS tags are associated with
words. Thus, the syntactic model indicates whether one or more syllables within the
word are prominent. However, it does not provide information regarding which one is
the prominent one. On the other hand, if a word is non-prominent, so are the syllables
composing the word.
Let us assume that the i
th
word w
i
consists of n
i
syllables. Then the syllable string
for w
i
can be written as S
i
= [s
1
; s
2
; : : :; s
n
i
]. The word w
i
is non-prominent if and only
if all the syllables within w
i
are non-prominent. Hence,
p(c
i
= 0jPOS(w
i
)) =
n
i
Y
k=1
p(c
k
= 0jPOS(s
k
)) (4.10)
where p(c
i
= 0jPOS(w
i
)) is the probability of w
i
being non-prominent given the POS
tags, and p(c
k
= 0jPOS(s
k
)) denotes the probability of the syllable s
k
within the word
w
i
being non-prominent given the POS tags. From Eq. 4.10, we approximate the
posterior probability of a syllable in a word being non-prominent as follows
p(c
k
= 0jPOS(s
k
))
n
i
p
p(c
i
= 0jPOS(w
i
)): (4.11)
72
Finally, the probability of the syllable s
k
being prominent can be computed as:
p(c
k
= 1jPOS(s
k
)) = 1 p(c
k
= 0jPOS(s
k
)) (4.12)
In practice, to bring word level syntactic information to syllable level, Eq. 4.11 and 4.12
are used for the experiments.
4.3.4 Combined Model with Acoustic and Higher Level Cues
The top-down task-dependent model uses acoustic and higher level cues while perform-
ing a task. These cues are combined using a probabilistic approach for the purpose of
prominence detection. Given auditory gist features F, lexical evidence S, and syntactic
evidence POS, the most likely prominence sequence can be found as:
C
= argmax
C
p(CjF;S;POS)
= argmax
C
p(F;S;POSjC)p(C) (4.13)
Assuming that the auditory gist features are conditionally independent of the lexical
and syntactic features given the prominence class information, Eq. 4.13 can be written
as
C
= argmax
C
p(FjC)p(S;POSjC)p(C) (4.14)
73
The joint distribution p(S;POSjC) in Eq. 4.14 cannot be robustly estimated since
the vocabulary size is very large compared to the training data, so a na ve Bayesian
approximation is used to simplify it as follows:
p(S;POSjC) p(SjC)p(POSjC): (4.15)
Then Eq. 4.14 can be re-written as
C
= argmax
C
p(FjC)p(SjC)p(POSjC)p(C)
= argmax
C
p(CjF)
p(C)
p(S;C)
p(CjPOS)
p(C)
(4.16)
The combined top-down model, which includes auditory gist features and lexical and
syntactic information, nally reduces to the product of individual probabilistic model
outputs as shown in Eq. 4.16.
4.4 Experiments and Results
This section presents the details of the experiments conducted together with the auto-
matic prominence detection test results on the BU-RNC. All of the experimental results
presented here are estimated using the average of ve-fold cross-validation. The same
data partition described in Chapter 3.4 is used for the experiments in this section.
This section is organized as follows: rst, the experiments and results obtained
with the acoustic model are presented in Section 4.4.1. Next, the prominence detection
results obtained with lexical and syntactic information are presented in Section 4.4.2
74
and 4.4.3, respectively. Finally, the results with the combined acoustic, lexical and
syntactic model are presented in Section 4.4.4.
4.4.1 Task-Dependent Model Prediction with Only Auditory Features
We rst tested the top-down model using only the auditory gist features assuming that
syllables are independent as in Eq. 4.2. The scene duration, grid size and the features
were analyzed in details in Chapter 3.4.1, 3.4.2, and 3.4.3, respectively. Based on those
analyses, the scene duration for auditory gist feature extraction is set to D = 0:6 s
considering both prominence detection performance and the feature dimension for the
computational cost, and the grid size was set as 4-by-5 for the experiments presented
here. Also, the combination of I; F; T; O
features is used; i.e. the pitch features are
excluded in the rest of the experiments since IFTOP performance is not signicantly
dierent than the results of IFTO (ref. to Table 3.7).
The results are reported in Table 4.1, and as reported earlier, an accuracy of 85.45%
with an F-score of 0.78 is achieved for prominence detection task at the syllable level
using only the auditory gist features. Table 4.1 also includes the results of the other
types of top-down evidence which are discussed next. We rst present the experimental
results with the lexical and syntactic models in Section 4.4.2 and 4.4.3, respectively,
followed by the combined model results in Section 4.4.4.
4.4.2 Task-Dependent Model Prediction with Only Lexical Features
The top-down model prediction using lexical information is implemented by creating
sausage lattices for the test sets using the test transcriptions. The lattice arcs hold the
75
Figure 4.2: Sausage lattice with only lexical evidence
Table 4.1: Prominent Syllable Detection Performance of Individual Acoustic, Lexical
and Syntactic Cues
TD Evidence Acc. Pr. Re. F-sc.
Auditory Feat. only 85.45% 0.82 0.75 0.78
Lexical only 83.85% 0.77 0.76 0.76
Syntactic only (word) 82.50% 0.82 0.87 0.84
Syntactic only (syl.) 68.01% 0.54 0.53 0.53
syllable tokens together with the possible prominence class categories. For example,
a part of a lattice that includes the word \wanted" is shown in Fig. 4.2. The arcs
carry two syllables \w aa n" and \t ax d" the word contains together with prominent
(C-one) and non-prominent (C-zero) class categories. When the only available evidence
is the lexical rule, the arcs of the lattices do not carry any acoustic score, i.e., they are
all set to zero (a=0.0 in Fig 4.2). After constructing the lattice, it is scored with the
factored n-gram lexical language model which was detailed in Section 4.3.2. The most
likely prominence sequence is obtained by Viterbi decoding through the lattices. The
results obtained with only the lexical model are reported in Table 4.1; the prominence
detection performance achieved with using only lexical information is 83.85% with an
F-score of 0.76. We observe that the auditory features alone perform 1.6% better than
the lexical features (85.45% vs. 83.85 %), and this result is signicant at p 0:001.
76
Table 4.2: Combined Top-Down Model Performance for Prominent Syllable Detection
TD Evidence Acc. Pr. Re. F-sc.
Auditory Feat. + Lexical 88.01% 0.83 0.82 0.82
Auditory Feat. + Syntactic 86.23% 0.81 0.79 0.80
Auditory Feat. + Syntactic + Lexical 88.33% 0.83 0.83 0.83
Combined Feat. word level 85.71% 0.87 0.86 0.87
4.4.3 Task-Dependent Model Prediction with Only Syntactic Features
The class posterior probability p(c
i
jPOS
L
i
(w
i
)) in Eq. 4.8 is computed using a neural
network [11]. We use a 3-layer neural network with d
pos
inputs and n output nodes,
where d
pos
is the length of feature vector produced with part-of-speech tags, and n = 2
since this is a two-class problem. In our implementation, a set of 34 POS tags are
used as those used in the Penn Treebank [50]. Each POS feature is mapped into a
34 dimensional binary vector. The neural network has d
pos
= 34 5 inputs, since the
syntactic information in our model includes the information from a window of L = 5
words.
As mentioned earlier, POS tags are associated with the words, so the neural network
is trained using the word level POS tags. Using only syntactic information, we achieve
82.50% accuracy (F-score=0.84) for the prominence detection task at the word level as
detailed in Table 4.1. Then, using Eq. 4.11 and 4.12, we convert word level posterior
probability to the syllable level posterior probability. To obtain the baseline performance
for the syntactic model, we combine prior chance level observed in the training data
with the syntactic model posterior probabilities. The prominence detection accuracy
achieved is 68.01% (F-score=0.53) at the syllable level using the syntactic evidence.
This is slightly better than the chance level for the BU-RNC which is 65.7% accuracy
77
at the syllable level. The syntactic features alone (68.01%) perform signicantly worse
than both auditory features (85.45%) and lexical features (83.83%) at the prominence
detection task at the syllable level (p 0:001). This is not surprising because of the
fact that POS carries information at the word level. When a multisyllabic word is
detected as prominent, there is no information about which syllable/s is/are prominent
within the word. Hence, Eq. 4.11 and 4.12 are only approximations. Nevertheless, the
combination of the syntactic information leads to a statistically signicant performance
improvement, as shown in the next section.
4.4.4 Combined Model Prediction with Auditory, Syntactic and Lexical
Features
We combine auditory gist features together with syntactic and lexical information using
a probabilistic approach as presented in Eq. 4.16. First, the syllable level syntactic
and auditory gist feature model outputs are combined and embedded in the lattice arcs.
Then, the lattices are scored with the lexical language model, and a Viterbi search is
conducted to nd the best sequence of prominence labels. The combined model achieves
88.33% accuracy with an F-score of 0.83 as listed in Table 4.2.
In addition to these experiments, we also investigated the combination of auditory
features together with lexical information and the combination of auditory features
together with syntactic information. The results are summarized in Table 4.2. Incor-
porating syllable token information into the top-down prediction of auditory features
leads to 2.56% (p 0:001) accuracy improvement over the auditory features only model
78
(88.01% vs. 85.45%). Also, the improvement of 0.78% over the acoustical model predic-
tion accuracy due to syntactic information is signicant (86.23% vs. 85.45%, p 0:001).
The performance dierence between the model that includes all three models and the
one that does not include the syntactic model is also signicant at p 0:005 (88.33% vs.
88.01%). The best prominence detection performance accuracy is achieved with using
all three information streams: auditory features and lexical and syntactic evidence. Fi-
nally, the combined model achieves 85.71% prominence detection accuracy at the word
level with an F-sc=0.87.
4.5 Conclusion
In this chapter, a novel model that combines bio-inspired auditory attention cues with
higher level task-dependent lexical and syntactic cues is presented. All three information
streams, acoustic, syntactic and lexical information, are combined within a probabilistic
approach. The combined model was demonstrated to detect prominent syllables suc-
cessfully in read speech with 88.33% accuracy. The results compare well with human
performance on stress labelling reported with BU-RNC: the average inter-transcriber
agreement for manual annotators was 85-90% for presence vs. absence of stress labelling
[61].
We can conclude from the experimental results summarized in Table 4.2 that the
contribution of lexical information is signicant in the prominence detection task. The
contribution of the syntactic cues captured with POS tags is small compared to that from
lexical cues for the prominence detection at the syllable level. This might be mainly due
to the fact that the POS tags used to represent syntactic information are associated with
79
Table 4.3: Previously Reported Results on Prominence Detection Task Using the
BU-RNC
Previous Work Features Acc d Level
Wightman et al [86] Acoustic + Prosodic LM 84.0% 12 syllable
Ross et al. [69] (single speaker) Lex + Syn 87.7% NA syllable
Hirschberg [28] Syn 82.4% NA word
Chen et al. [11] Acoustic only 77.3 % 15 word
Chen et al. [11] Acoustic + Syn 84.2 % 15 word
Ananthakrishnan et al. [4] Acoustic only 74.1% 9 syllable
Ananthakrishnan et al. [4] Acoustic+ Lex + Syn 86.1% 9 syllable
Our method Acoustic only 85.45% 80 syllable
Our method Acoustic + Lex + Syn 88.33% 80 syllable
Our method Acoustic only 83.11% 80 word
Our method Acoustic + Lex + Syn 85.71% 80 word
words. Hence, the syntactic model is accurate for the word level prominence detection,
and its contribution to prominent syllable detection is limited to only detecting non-
prominent words, and hence the non-prominent syllables the word contains. On the
other hand, when a multi-syllabic word is prominent, we don't have any information
regarding which syllables within the word are prominent.
In Table 4.3, we compare the performance of our model with results presented by
other authors for this task using the BU-RNC database. The table shows the promi-
nence detection accuracy obtained in these previously published papers, names of the
features which were used for the experiments, the acoustic feature dimension (denoted
as d in Table 4.3), and the level at which the prominence detection experiments were
performed (syllable or word). The set of possible features used in the literature are syn-
tactic features (referred as Syn in Table 4.3), lexical features (referred as Lex in Table
4.3) and acoustic features (referred as Acoustic in Table 4.3). Also, in [86] a prosodic
80
bigram language model (LM) was used together with acoustic features. In the literature,
all the previous work which uses acoustic cues for prominence detection utilizes prosodic
features which consist of pitch, energy and duration features. Those acoustic feature
dimensions vary between 9-15, whereas our model has a much higher acoustic feature
dimension (80). However, the proposed acoustic only model performs signicantly bet-
ter than all the previously reported methods used only acoustic features, and it provides
approximately 8-10% absolute improvement over the previous results. In summary, we
achieve signicant performance gain but at the cost of computation. These results are
especially benecial for the cases where the utterance transcripts are not available in
which syntactic and lexical cues cannot be easily extracted. The proposed auditory
attention cues, however, perform suciently well even without text-derived features.
Finally, when we combined all three lexical, syntactic and acoustic cues, our model per-
forms 2.32% better than the previously reported results in [4]. Although, [69] reports
87.7% accuracy on this task, our results are not directly comparable since their experi-
ments are limited to a single speaker, whereas we used the entire data set (6 speakers)
for the experiments.
The combined top-down task-dependent auditory attention model is used in this
chapter to detect prominent regions of speech. However, the prominence itself can
actually be a feature that may attract human attention in a bottom-up manner. In-
corporating prominence as a bottom-up attention cue into the current machine speech
processing systems can be benecial. In Chapter 6, it is incorporated into an automatic
speech recognizer system.
81
Chapter 5:
Interaction Between Top-Down and Bottom-Up Auditory
Attention Models
5.1 Chapter Summary
The focus of this chapter is an analysis of the interaction between top-down and bottom-
up attention. How attention is in
uenced by bottom-up and top-down processes?
The answer of this question is still controversial for scientists. Here, two hypothe-
ses are modeled: i) top-down attention makes selection among the conspicuous salient
events/objects pointed out by bottom-up attention, and ii) task-dependent top-down
attention can be overridden by a strong salient event/object that occurs in the unat-
tended channel (outside of the focus). The experiments in the previous chapters are
repeated with the combined bottom-up and top-down attention models and the results
are reported.
82
5.2 Introduction
The interaction between top-down and bottom-up attention is still ambiguous to the
scientists. One hypothesis is that top-down attention behaves as a lter to select among
the conspicuous events/objects detected by bottom-up attention [52, 90]. Another set
of experiments indicate that top-down task-dependent attention can be overridden by a
strong salient event/object, i.e. when a subject is listening a speaker, his/her attention
can be captured by a sudden sound of a car crash [14].
Here, we implement two models that try to analyze the interaction between the task-
independent bottom-up (BU) model and the task-dependent top-down (TD) model. The
rst model follows the hypothesis that the top-down model makes selection among the
conspicuous locations detected by the saliency-driven bottom-up model. The second
model follows the hypothesis that while the top-down model is processing the input
stimuli in a goal oriented manner, the bottom-up model can override the top-down
model prediction. These two models and the experimental results obtained with them
are detailed in the following sections.
5.3 Integrated Top-Down and Bottom-up Attention Model 1
The model presented here assumes that it is possible to have a number of simulta-
neous conspicuous acoustic events occurring and that task-dependent attention is re-
quired for these events to be perceived. In the integrated model, rst the conspicuous
events/objects in a scene are located using the auditory saliency map detailed in Chapter
2. Then, the top-down task-dependent attention behaves as a lter to decide whether
83
these events are the target or not. The diagram illustrating this process is shown in
Fig. 5.1.
In practice, for prominence detection in speech we implemented this model as follows:
1. Process the auditory spectrum of a scene with the saliency-driven bottom-up
model and create an auditory saliency map.
2. Compute saliency score and nd local maxima.
3. Find the syllables corresponding to the local maxima, and extract auditory gist
features from the scenes containing these syllables.
4. Pass the auditory gist features of the candidate syllables pointed out by the
bottom-up model through the top-down model to decide whether they are promi-
nent or not.
In this integrated model, only the candidates obtained from the BU model are
processed by the TD model.
5.4 Integrated Top-Down and Bottom-up Attention Model 2
The second model follows the hypothesis that while the top-down model is processing
the input stimuli in a goal oriented manner, the bottom-up model can override the
top-down model. The diagram illustrating this process is shown in Fig. 5.2.
In practice, for prominence detection in speech we implemented this model as follows:
1. Process the auditory spectrum with both top-down and bottom-up models in
parallel.
84
Figure 5.1: Integrated bottom-up and top-down model 1
85
Figure 5.2: Integrated bottom-up and top-down model 2
2. Compute saliency score from the auditory saliency map and nd local maxima.
3. For \strong" maxima of the saliency score, i.e. the points whose saliency score are
larger than 0.5, label the corresponding syllables as prominent without consider-
ing the top-down model prediction (top-down model output is overridden by the
bottom-up model prediction).
4. For the syllables that do not have strong saliency evidence in a bottom-up manner,
use the top-down model prediction.
In this integrated model, the top-down model predictions are used for all the syllables
except the ones that have strong saliency indication from the bottom-up model.
5.5 Experiments and Results
In this section, the prominent syllable detection experiments detailed in the previous
chapters are repeated with the combined bottom-up and top-down attention models
and the results are reported. The BU-RNC was used in the experiments.
86
Table 5.1: Prominent Syllable Detection Performance with Integrated Model 1 (th =
0:1 for the BU model)
Process Ac. Pr Re F-sc
BU only 73.20% 0.58 0.89 0.72
TD only 84.46% 0.80 0.73 0.77
BU & TD together 83.79% 0.83 0.66 0.74
5.5.1 Experiments with Integrated Model 1
The results obtained with the integrated model 1 are reported in Table 5.1. For these
experiments, the threshold used for the local maxima selection of the saliency score
was set to a low value, i.e. th = 0:1 for the listed results. The integrated model
performs better than the bottom-up only model (73.20% vs. 83.79%), however it is
0.67% less accurate than the top-down model alone (83.79% vs 84.46%
1
), but this was
not signicant (p 0:05). It can be observed that the integrated model improves the
precision score in the expense of recall value. It can be concluded from these results
that there are points which are actually prominent but were not detected by the BU
model, whereas the TD model alone correctly identies them.
5.5.2 Experiments with Integrated Model 2
The results obtained with the integrated model 2 are reported in Table 5.2. For these
experiments, the threshold for selecting \strong" local maxima of the saliency score
was varied from 0.5-0.9, (saliency score varies in the range of [0,1]). The integrated
model always performs better than the bottom-up only model, however it is 0.25% less
accurate than the top-down model alone (84.21% vs 84.46%), but this is not signicant
1
The top-down model with only acoustic features was used in the experiments in this chapter. The
dimension of the auditory gist features was 48 after dimension reduction with PCA.
87
Table 5.2: Prominent Syllable Detection Performance with Integrated Model 2
Threshold Ac. Pr Re F-sc
0.5 82.94% 0.74 0.78 0.76
0.6 83.46% 0.75 0.77 0.76
0.7 83.76% 0.77 0.76 0.76
0.8 83.98% 0.78 0.75 0.76
0.9 84.21% 0.79 0.75 0.76
(p 0:05). It can be observed that the integrated model improves recall value in the
expense of precision value. Here, the higher is the threshold, the more will the top-down
model predictions be used. It can be concluded from these results that there are points
which are actually prominent but were not identied by the TD model, whereas the BU
model alone correctly detects them. However, there are not many cases like this. Thus,
the integrated model is less accurate than the TD only model since the BU model also
introduces noisy prominence predictions into the integrated model.
5.6 Conclusion
The performance of the both integrated models we implemented here are limited by the
performance of the top-down model. This can be explained due to the task-dependent
optimization of the gist features via a neural network. The bottom-up and the top-down
models share the same feature maps, so it can also be thought as the top-down model
already modulates the bottom-up features for the detection task. Hence, the bottom-
up model contributes nothing but noise (confusion) to the top-down model, since the
experiments are done for a specic task, i.e. the sound of a car crash will be only noise
88
and confusing sound if the performance evaluation metric for a listener is the number
of correctly recognized words spoken by a speaker.
89
Chapter 6:
Continuous Speech Recognition Using Attention Shift
Decoding with Soft Decision
6.1 Chapter Summary
An attention shift decoding (ASD) method inspired by human speech recognition is
presented in this chapter. In contrast to the traditional automatic speech recognition
(ASR) systems, ASD decodes speech inconsecutively using reliability criteria; the gaps
(unreliable speech regions) are decoded with the evidence of islands (reliable speech
regions). On the BU Radio News Corpus, ASD provides signicant improvement (2.9%
absolute) over the baseline ASR results when it is used with oracle island-gap infor-
mation. At the core of the ASD method is the automatic island-gap detection. Here,
a new feature set is proposed for automatic island-gap detection which achieves 83.7%
accuracy. To cope with the imperfect nature of the island-gap classication, we also
propose a new ASD algorithm using soft decision. The ASD with soft decision provides
0.4% absolute (2.2% relative) improvement over the baseline ASR results when it is
used with automatically detected islands and gaps.
90
6.2 Introduction
Human-like speech processing has been an inspiration and motivation for researchers for
many years to improve the performance of computational models and machine process-
ing applications. Humans can successfully recognize speech with high accuracy despite
conditions such as highly variable speaking styles, noise conditions, overlapping sources,
etc. In contrast, the machine performance typically degrades drastically in such condi-
tions. Existing automatic speech recognition (ASR) systems have modeled some parts
of the human speech recognition process and found them to be benecial; signal pro-
cessing in the peripheral auditory system is a good example. There are however other
possibilities that oer promise. One of those that can be considered within ASR systems
is the attention mechanism humans utilize.
Humans can precisely process and interpret complex scenes in real time despite
the tremendous number of stimuli impinging the senses. One of the key enablers of
this capability is the attention mechanism that selects a subset of available sensory
information before fully processing all stimuli at once [2]. Only the selectively attended
incoming stimuli are allowed to progress through the cortical hierarchy for high-level
processing to recognize the details of the stimuli. Thus, it is believed that humans
process a scene nonconsecutively in a selective way. In addition to this, the experiments
in [66] have shown that words segmented from running speech are often unintelligible
even for humans, and they become intelligible when they are heard in the context
of an utterance. Also, the experiments in [74] showed that humans use a short-term
memory buer (about 1-2 sec long) which when injured causes sentence processing
diculty. These experiments indicate that i) humans use context information while
91
decoding speech, ii) humans use a buer that stores a string of words while recognizing
individual words within a sentence. Based on the attention theory and supporting
experimental ndings, it is believed that humans rst process and recognize salient or
prominent parts of speech. Then they nalize recognition of non-salient parts of speech
using the contextual information together with their segmental properties.
Prior research that has focused on the notion of attention in speech understanding
dates back to Hearsay system [21]. The Hearsay speech understanding system is one of
the early works which proposed to resolve uncertainty using many knowledge sources
in a selective structure [21]. One of the limitations of the Hearsay system is that it was
rule-based and it has not been implemented within the state-of-the art machine learning
framework. Human like non-consecutive speech recognition has been the motivation of
some other work in the past. In [38], an island-driven continuous speech recognition
system that uses word spotting and word verication was proposed. The system de-
scribed in [38] rst detects a noun as an island in a small vocabulary continuous speech
and then expands the island by verifying neighboring words predicted by a word pair
grammar until all parts of speech were lled. Island-driven search technique has been
applied to handwriting recognition in [46, 65], and parsing in [15]. However, [15, 46, 65]
followed a dierent approach than [38] and used reliable parts of signal (called as is-
land) to determine unreliable parts of signal (called gaps). Recently, the idea of island
of reliability driven search was applied to continuous speech recognition in [45], and it
was concluded that the speech recognition performance was highly dependent on the
accuracy of automatic detection of islands of continuous speech.
92
In this chapter, we explore the possibility of improving automatic speech recogni-
tion performance by using a human like attention shift decoding (ASD) approach. The
presented method builds on the ideas proposed in [45, 46]. The method rst nds the
islands of continuous speech, and then recognizes them. The islands consist of reliable
regions of speech for an automatic speech recognizer. Then, the islands are expanded
by verifying the neighboring words using a statistical language model within a lattice
search algorithm. Thus, the algorithm uses neither left-to-right nor right-to-left con-
secutive search paradigm as in the conventional ASR systems. It starts decoding from
the islands of the speech and then lls in the gaps using the contextual information to
make a selection amongst the word hypotheses obtained from the segmental features.
The main contributions of this chapter are as follows: as mentioned earlier, the
performance of an attention shift decoding algorithm highly depends on the automatic
detection of islands in continuous speech with high accuracy. Hence, one of the main
focuses of this work is to explore the parameters that will lead us to achieve high island
detection accuracy. Here, we propose a new set of features that is inspired by both
human and machine recognition of speech for detection of islands. In addition to this,
we propose a novel attention shift decoding method using soft decision to cope with the
imperfect nature of island detection. Finally, we present continuous speech recognition
experiments and results with attention shift decoding using both soft and hard decision
for completeness.
The chapter is organized as follows: the ASD method is explained in Section 6.3
followed by the automatic island detection in Section 6.4. The experimental results and
conclusions are presented in Sections 6.5 and 6.6, respectively.
93
Figure 6.1: Block diagram of attention shift decoding method
6.3 Attention Shift Decoding
Here, we present an attention shift decoding method that decodes speech nonconsec-
utively based on reliability criteria inspired by human speech recognition. The block
diagram of the ASD system is shown in Fig. 6.1. The method rst decodes each speech
utterance using an automatic speech recognizer and provides a word lattice output in
addition to the 1-best sentence hypothesis for each utterance. A word lattice may con-
tain a large number of competing word hypotheses; hence they are transformed to word
confusion networks (CN) to easily obtain the competing words for each time interval
[49]. A word confusion network for a sample utterance together with its transcription
(TRA) and the ASR 1-best output (HYP) is illustrated in Fig. 6.2. In a CN, the words
in each time interval or slot (all of the arcs between two neighbor nodes) are sorted
based on the normalized posterior probability as shown in Fig. 6.2. The top words from
each time interval form the 1-best output of the ASR. Then, the correctly recognized
words form islands, and the incorrectly recognized words form gaps. After identifying
islands and gaps for an utterance, the gaps are lled by decoding the utterance with the
evidence of neighboring islands. As mentioned before, we propose a novel ASD method
94
Figure 6.2: Sample word confusion network with islands and gaps
using soft decision. For sake of clarity, we rst present ASD method using hard decision
[45].
6.3.1 Attention Shift Decoding Using Hard Decision
The method rst detects islands for an automatic speech recognizer, and nalizes the
recognition of these reliable words by pruning out the alternative hypotheses for the
island words. For example, in Fig. 6.2, in the second time interval only the top hy-
pothesis (the arc that carries word give) will be kept by pruning out the other three
word hypotheses since this is an island. In other words, the recognition of island words
is nalized, and they cannot be altered in the later steps. At this stage, the hypothe-
ses for the gaps are left intact. The new CN after pruning the CN in Fig. 6.2 using
hard decision is illustrated in Fig. 6.3. Next, the new pruned confusion networks are
re-scored with a language model (LM). After re-scoring, the gap words carry new LM
scores that are based on the island words; hence it is believed to be more accurate.
Finally, the 1-best recognition output is obtained using the new LM scores together
95
Figure 6.3: After processing the word confusion network in Fig. 6.2 with attention
shift decoding using hard decision in the oracle experiments.
with the normalized posterior probabilities from the original confusion network since
they are intact.
6.3.2 Attention Shift Decoding Using Soft Decision
At the heart of the ASD method is the automatic island-gap detection, which is an
inherently challenging problem. Usually the island-gap detectors are prone to errors,
hence using a hard decision by taking the island-gap detector output as binary decision
and pruning the confusion network accordingly may not benet enough from ASD. Thus,
we propose an alternative ASD scheme using soft decision to deal with the imperfect
nature of automatic island detection.
The island-gap detector is designed such that it returns the posterior probability
of the top word hypothesis in a time slot being island given the features; i.e., P(IjF)
where I is the island label, F is the features explained in Section 6.4. Then, for the
other alternative word hypotheses in the same time slot, the probability of being island
is computed as 1 P(IjF); i.e., the more likely the top word is an island, the less likely
the alternative words in the same slot can be an island (the correct word). We enrich
96
the confusion networks by embedding a new score of island by modifying the standard
ASR equation such as:
W
P(AjW)
AS
:P(W)
LS
:P(IjF)
IS
(6.1)
where W stands for the word, P(AjW) is the acoustic model score with scale AS, P(W)
is the language model score with scale LS, and IS is the island scale.
As discussed before this is a second-pass decoding, hence instead of acoustic model
score the normalized posterior scales are used. Enriching the confusion network by
adding an island score does nothing but re-ranking the hypotheses in each time slot
based on the combined posterior and island scores; i.e., for simplicity consider that the
posterior and the island scales are equal: AS = IS in Eq. 6.1. In other words, if
a rst-best word has high probability of being island then its posterior score will be
boosted, while the alterative words in the same slot will be penalized, otherwise the top
word will be penalized while candidacy of the alternative words in the same slot will be
promoted.
In our oracle experiments, where it is assumed that the islands and gaps are known
perfectly, i.e., P(IjF) = 1 for the islands and P(IjF) = 0 for the gaps, the new CN
in Fig. 6.4 is obtained after processing the CN in 6.2 with soft decision. In the oracle
experiments, the soft decision becomes similar to hard decision with one dierence: for
the islands the alternative words in the same time slot are going to be pruned as in the
hard decision; for the gaps the top word hypothesis is going to be also pruned since
P(IjF) = 0, while the remaining alternative words in a time slot will be left intact,
which is dierent than the one in the hard-decision.
97
Figure 6.4: After processing the word confusion network in Fig. 6.2 with attention
shift decoding using soft decision in the oracle experiments.
6.4 Automatic Island Detection
At the heart of the ASD method is the island detection. The goal is to detect whether
the top word hypothesis in each time slot in a confusion network is island or not. Here,
we propose a new set of features for the automatic island-gap detection inspired by
both human and machine speech recognition. First, we summarize some key factors
taking place in human word recognition which also lead us to select some features in
our island-gap classier. The references and a review of the research on spoken word
recognition can be found in [51]. It is not surprising that there is some commonality
between the factors aecting both human and machine recognition of speech. When it
is applicable, these similarities are addressed as this section evolves.
Successful human communication depends on word recognition [51]. There is no
doubt that segmental features provide information about which sounds are in an ut-
terance. For
uent speakers of a language, the words are usually stored in long-term
memory, and hence lexical access is an essential part of word recognition [51]. Segmen-
tal and suprasegmental information are extracted from the signal and used in lexical
access to activate a set of candidate words in lexicon [51].The factors aecting lexical
98
access and word activation are as follows: It was found that segmental mismatch is more
disruptive of lexical access in word initial than in word nal position since words with
initial mispronunciation have to recover from a poor start, while the words with nal
mispronunciation can be already highly activated before the mismatch occurs [51]. This
is also valid for machine recognition of speech from the perspective of search paradigm.
Hence, we compared the rst (f syl) and last syllable (l syl) of the rst-best and second-
best word hypothesis in a time slot; i.e., if the rst syllables of the top two words are
the same then f syl=1, otherwise f syl=0.
Second, the mismatched segments in short words appear to be more disruptive than
the ones in long words [51]. Hence, we used three word length measures to capture the
information for the top word hypothesis in a time slot: word duration in milliseconds
(duration) (obtained from the CN), the number of phones (n ph) and syllables (n syl)
in the word. Syllabication software from NIST [22] is used for syllabifying words using
their phoneme strings. The approximate phone duration (ph dur) is also used (word
duration in milliseconds divided by the number of phones). Third, lexical neighbors play
a role in word recognition; the presence or absence of similar sounding words in
uences
the eect of segmental mismatch [51]. Thus, we captured the distance information
among the set of word hypotheses in a time slot. The phone based distance scores are
computed using the standard Levenshtein distance. We computed minimum (min dist)
and maximum (max dist) phone distance in a time slot after computing phone distances
between all possible pairs of words in a time slot. We also computed the phone distance
between the rst-best and second-best hypothesis words in a time slot (compete distance:
99
comp dist), and the normalized compete distance, (n dist: compete distance divided by
the number of phonemes in the rst-best word).
When the number of similar sounding words increases the word recognition becomes
harder and gets delayed for humans [51]. Similarly, the number of word hypotheses
within a time slot is also a clear indication of the level of diculty an ASR is having.
For example, it was observed that usually when there is an out-of-vocabulary word in
a sentence, the ASR system makes mistakes and usually the number of hypotheses in
these time slots is larger. Hence, we used the number of alternative word hypotheses
within a time slot (ncomp) to capture the number of candidate words for ASR.
The word frequency also aects lexical activation; i.e., humans recognize more ac-
curately the words they use frequently [51]. This is also true for the ASR systems;
the words which occur more frequently usually have more data samples during training
hence may be recognized more accurately. From the LM, we used unigram (unigram)
probability of the top hypothesis word in a time slot to capture the word occurrence
frequency.
Suprasegmental information is another cue used during lexical access by humans.
English listeners appear to be sensitive to whether a syllable is prominent (stressed)
or not since they believe that content words in English tend to begin with prominent
syllables [51]. Similarly, content words are more likely to be recognized correctly than
function words in ASR. Hence, we used prominence (prom) of the top word hypothesis
in a time slot as a feature for island-gap classier. For this, we used our previously
proposed top-down attention model in Chapter 3 that can detect the prominent words
in speech with high accuracy from the acoustic signal. As shown in Fig 6.1, the acoustic
100
signal is used to extract prominence of the top word hypothesis in each slot using the
boundaries extracted from CN.
Some of the ASR output scores are also inevitable parts of the island-gap classier
to measure how condent ASR is about the word hypotheses.The following features are
used from the confusion networks: normalized posterior probability (post), normalized
likelihood score (acoustic score per frame, (as), language model score (ls), To measure
the uncertainty within a time slot, we also used entropy (entropy) of the probability
distribution of the words within the time interval and competing posterior probability
(comp post: the ratio between posterior probability of the rst-best and the second-best
hypotheses within a time interval).
During LM scoring, when there is no entry in the LM for the higher order statistics of
a word sequence, the speech recognizer uses the available lower order statistics. Hence,
this is an evidence that shows how reliable the LM score is. The LM back-o values
(NG) are printed at the lattice output; i.e., if LM score is as a result of 3-gram statistics
then NG=2, if it is as a result of unigram statistics then NG=0. We used the following
LM back-o related parameters: value of NG for the top hypothesis word in the time
interval (NG); distance to max (max NG) and min (min NG): the dierence between
the value of NG that belongs to the top hypothesis and the maximum/minimum NG
over all the words in a time slot; range of NG (range NG: the dierence between the
maximum NG and minimum NG over all the words in a time slot).
101
6.5 Experiments and Results
The Boston University Radio News Corpus was used in the experiments. The pitch
accent tags that are available with the database are only used during training for the
prominence detection (using only training set). After eliminating story repetitions from
the same speaker, the remaining data was split into ve folds each with 50% train
(14.5K words), 30% development (8.6K words), and 20% test (5.9K) sets. The Hidden
Markov Model Toolkit (HTK) is used for the baseline experiments [94]. We adapted
context-dependent triphone acoustic models trained from the WSJ and TIMIT tasks
with data from the training partitions of the BU-RNC using the MAP and MLLR
algorithms. The adapted acoustic models were gender specic (not speaker dependent).
The 39-dimensional standard MFCC features are used as acoustic features. A standard
back-o trigram language model with Kneser-Ney smoothing trained with the data from
the CSR project was used. The language model vocabulary contained about 20K words.
The OOV rate on the development and test sets were 3.8% and 3.7% respectively. The
development set was used for tuning the scale parameters. The ASR 1-best hypothesis
output is used as the baseline result. Also, lattices created using HTK are transformed
to word confusion networks using the SRILM toolkit. The Wilcoxon signed rank test is
used to report the condence level in terms of signicance values (p-values) whenever
we make comparisons.
The development and train sets are used for training island-gap classier. A 3-layer
neural network is used as a classier for island-gap detection. The neural network had
d inputs, (d + n)=2 hidden nodes and n output nodes, where d = 22 is the length of
feature vector, and n = 2 since this is a two class problem. Then, the information gain
102
Table 6.1: Island-Gap Detection Features Ranked by Information
Rank Feature Rank Feature
1 entropy 12 as
2 posterior 13 n ph
3 comp post 14 min dist
4 ls 15 f syl
5 ncomp 16 ph dur
6 unigram 17 range NG
7 NG 18 n syl
8 max dist 19 duration
9 max NG 20 min NG
10 comp dist 21 l sylv
11 n dist 22 prom
Table 6.2: Island-Gap Detection Results
System Overall Acc Island Acc Gap Acc
Dev. Test Dev. Test Dev. Test
Predic. 84.7 83.7 93.9 94.5 48.9 44.9
criteria is used to select the features using a forward algorithm; i.e., more features are
added until the classier accuracy starts to decrease. In Table 6.1, the features are
ranked based on the information criteria. Among the features, entropy and prominence
were the most and the least informative features, respectively, about the island-gap
classes. This indicates that even though prominence is an important cue for humans,
since previous stages of ASR ignore this cue, prominence has no information about the
reliability of the created word hypotheses. The number of selected features that gives
the highest accuracy for each fold varied from nineteen to twenty two. In Table 6.2, the
island-gap detection results are presented. The chance level for the development and
test sets were 79.5% and 78.4% respectively. With the proposed features, we achieved
an overall 84.7% and 83.7% accuracy on the development and test sets, which are well
103
above the chance level. The results are signicantly higher than the previously reported
results on island-gap detection; 63.47% island-gap classication accuracy was obtained
in [45], however they used a dierent database.
In Table 6.3, the baseline results using the standard ASR 1-best output are presented.
18.6% and 18.4% word error rates (WER) were obtained on the development and test
sets, respectively. We also tried rescoring the word confusion networks with the LM
without using island-gap information, and this provides only 0.1% improvement over
the baseline in both development and test sets as shown in Table 6.3.
In Table 6.4, the results obtained with using ASD with hard decision are presented.
In the oracle experiments, where it is assumed that all the islands and gaps are known
perfectly, the WER is reduced to 15.9% and 15.8%, providing 2.7% and 2.6% absolute
improvements over the baseline in development and test sets, respectively. When the
predicted island-gap information is used from the classier, we obtained 18.2% and
18.0% WER on the development and test sets, respectively. This provides 0.4% (2.2%
relative) improvement over the baseline in both sets, which is signicant at p 0:001.
In Table 6.5, the results obtained with using ASD with soft decision are presented.
In the oracle experiments, the WER is further reduced to 15.4% and 15.5%, providing
3.2% and 2.9% absolute improvements over the baseline in development and test sets,
respectively. The improvement over the hard decision oracle results is attributed to the
pruning of the top word hypothesis, which is wrong, for gaps. When the automatically
detected island-gap information is used from the classier, we obtained 18.2% and 18.0%
WER on the development and test sets, respectively. Similar to hard decision, this
provides 0.4% (2.2% relative) improvement over the baseline in both sets, and the
104
Table 6.3: The Baseline ASR
System Dev. WER Test WER
Baseline 18.6 18.4
CN Rescoring 18.5 18.3
Table 6.4: The Results Using ASD with Hard Decision
System WER Improvement Relative Improv.
Dev. Test Dev. Test Dev. Test
Oracle 15.9 15.8 2.7 2.6 14.5% 14.1%
Predic 18.2 18.0 0.4 0.4 2.2% 2.2%
Table 6.5: The Results Using ASD with Soft Decision
System WER Improvement Relative Improv.
Dev. Test Dev. Test Dev. Test
Oracle 15.4 15.5 3.2 2.9 17.2% 15.8%
Predic 18.2 18.0 0.4 0.4 2.2% 2.2%
improvement is signicant at p 0:001. We observed that ASD with soft decision
performed better than using hard decision with the automatically detected islands,
however the improvement was not signicant enough.
6.6 Conclusion
In this chapter, we presented an attention shift decoding method inspired by human
speech recognition. In contrast to traditional ASR systems, ASD decodes speech in-
consecutively using reliability criteria; the gaps (unreliable speech regions) are decoded
with the evidence of islands (reliable speech regions). In the experiments with oracle
information, ASD provides signicant improvement (2.9% absolute) over the baseline
ASR results conrming the promise of the method. At the heart of the ASD method
105
is the automatic island-gap detection. Hence, we proposed a new feature set for island-
gap detection and obtained 83.7% accuracy which is signicantly higher than previously
reported results. To cope with the imperfect nature of island-gap classication, we pro-
posed a new ASD algorithm using soft decision rather than hard decision. The ASD
with soft decision provided 2.2% relative improvement over the baseline, which is signif-
icant (p 0:001). As part of future work, we plan to explore more features to improve
the island-gap detection accuracy.
106
Chapter 7:
Saliency-Driven Unstructured Acoustic Scene
Classication
7.1 Chapter Summary
Automatic acoustic scene classication of real life, complex and unstructured acoustic
scenes is a challenging task as the number of acoustic sources present in the audio
stream are unknown and overlapping in time. In this chapter, a novel saliency-driven
approach is proposed for classication of such unstructured acoustic scenes. Motivated
by the bottom-up attention model of the human auditory system, salient events of an
audio clip are extracted in an unsupervised manner. Only the selected acoustic events
are later presented to the classication system. Our results on the BBC sound eects
library indicate that using the saliency-driven attention selection approach proposed
here provides both performance improvement and eciency in terms of the data size
compared to the conventional frame-based classication.
107
7.2 Introduction
Automatic categorization of complex, unstructured
1
acoustic scenes is a dicult task
as the appropriate label eventually associated with the audio clip of an unknown scene
depends on the key acoustic event present in it. For example, an audio clip labeled
crash may have human conversation and/or other related sources in it but the highlight
\crash" of the vehicle is used to categorize the acoustic scene. When any (unknown)
number of acoustic sources can be present in a clip but only one or a handful of them
are relevant, the approach adopted by conventional audio classication systems would
entail classifying every source in the scene and then implement an application specic
post processing. This approach has two major drawbacks that can be immediately
identied. (1) A large amount of computational resources are committed to processing
feature-level information that is subsequently marginalized; hence it is inecient and
(2) it is not possible to train for every possible acoustic source a priori; hence it is not
robust.
Axiomatically, it is known that in comparison to machine-based systems, humans
can precisely process and interpret complex scenes rapidly. One of the key enablers
of this capability is the attention mechanism which processes a scene selectively and
non-sequentially. As stated previously, one caveat of the conventional audio content
processing approaches is that they process the entire signal or acoustical scene fully and
equally in detail (i.e. recognizing each and every source/event in an acoustic scene).
This issue can be alleviated by taking advantage of a selective attention mechanism
similar to what humans perform. Thus, in this dissertation, we propose a novel method
1
where an acoustic scene may consist of any number of unknown sources which may also overlap in
time.
108
that emulates human auditory attention for acoustic scene recognition. The algorithm
rst detects the salient audio events in a cluttered auditory scene in an unsupervised
manner, and then processes only the selected events with a previously learned represen-
tation for acoustic scene recognition. In other words, the work presented here allows us
to process only a subset of meaningful information in a complex acoustic scene rather
than processing the whole acoustic scene fully and equally as conventional audio con-
tent processing methods does. This has the potential to improve classication accuracy
of unstructured audio clips and additionally, reduce the computational bandwidth
2
re-
quired to process audio content.
In this work, we rst present results using a neural network to learn the mapping be-
tween the salient events and scene categories since they are biologically well motivated.
However, it is important to note that saliency (and its denition) does not depend on any
individual acoustic source; hence we are interested in class-independent representation
approach for a classication framework. Hence, for class-independent representation of
audio clips, we also use latent perceptual indexing (LPI) [80, 81], which seeks a single
vector representation of an audio clip within a collection by using unit-document fre-
quency measures. The main advantage of this approach is that it allows for comparison
of arbitrary audio clips through vector similarity measure that also embodies both se-
mantic and perceptual similarities [81]. By combining this with saliency-based attention
model, called latent indexing using saliency (LISA), the work presented here allows us
to process only a subset of meaningful information in a complex acoustic scene.
2
In terms of amount of training data and/or runtime memory requirements.
109
We test our system on the BBC sound eects library [1] that consists of a large
variety of audio clips belonging to acoustic scene categories such as household, military,
oce etcetera. This data set is particularly challenging for machine-based classication
task because almost all the clips contains multiple, unknown number of unique acoustic
sources present in it.
The chapter is organized as follows: rst a comprehensive discussion of related
work in audio content processing is present in Section 7.3. Then, the saliency-driven
acoustic scene classication method is explained in Section 7.4 followed by the discussion
of salient event detection in Section 7.4.1, and latent perceptual indexing in Section
7.4.3. The experimental results and conclusions are presented in Section 7.5 and 7.6,
respectively.
7.3 Related Work
Starting from [88], typical examples of audio classication systems use category based
modeling for a selection of audio clips [24, 47, 88]. In [24, 88] the system is evalu-
ated on a variety of categories such as animals, bells, crowds, female, laughter, ma-
chines, male voices, percussion instruments, telephone, water sounds etcetera. While
the performance of these systems is notable, they were trained and tested on homoge-
nous
3
clips. Examples of similar approaches that deal with more complex acoustic
scenes include sports highlighting [91], context-aware listening for robots [13] and also in
background/foreground audio tracking [68]. While these methods target more complex
3
Audio clips or segments that contain only one acoustic source in it, for example an instance of
laughter.
110
acoustic scenes they are still based on the typical classication approaches of category-
based modeling and therefore they are dicult to generalize to clips of unstructured
acoustic scenes.
Examples of other approaches that deal with clips of unstructured acoustic scenes are
[5, 77]. In [77] the author improves on the naive labeling scheme by creating a mapping
from each node of a hierarchical model in the abstract semantic space to the acoustic
feature space. The nodes in the hierarchical model (represented probabilistically as
words) are mapped onto their corresponding acoustic models. In [5], the authors have
adopted a similar approach of modeling features with text labels in the captions. In such
cases, however, the focus has mainly been on relating the audio clips to its language-level
descriptions.
In contrast to these approaches and the typical class-based training approaches pre-
sented earlier, in this work we focus on selectively processing the audio events similar
to the way humans detect important segments in a cluttered acoustic scene and subse-
quently use them to classify the given clip. Here, we use the bottom-up attention model
described in Chapter 2 to detect salient audio events present in an acoustic scene. As
far as we know, there has been no work in this area that applies saliency-based attention
models to recognition of unstructured acoustic scenes.
7.4 Proposed Method
The block diagram of the proposed method is shown in Fig. 7.1. First, audio signal
is fed into a salient event detector which is described in Section 7.4.1. The output
of the salient event detector is a one dimensional saliency score time-aligned with the
111
Figure 7.1: Block diagram of saliency-driven acoustic scene recognition method. W
is the duration of the window that centers on the detected salient time point to extract
salient audio event.
original acoustic signal. As explained in section 7.5, the audio events for subsequent
classication are selected in a decreasing order of saliency. To capture the audio event
corresponding to a salient point, the sound around each salient point is extracted using a
window of duration W that centers on that time point. In other words, we assume that
an audio segment of duration W that centers on a salient point (in time) corresponds
to a salient audio event. The perceptually motivated features are extracted from the
detected salient audio events and indexed into the latent space (`Learner' in training)
as explained in Section 7.4.3. Classication of an unknown test clip (the `Predictor') is
performed by comparing it with a collection of labeled audio clips (`Learner' in training)
in the latent space. It is important to note that in LISA the collection of labeled training
clips are used only to assess the performance of the approach presented here, the actual
information used to derive the latent representation are derived in a class-independent
112
manner. Details of the experiments are given in Section 7.5. Next, the salient audio
event detector is explained.
7.4.1 Salient Event Detection
At the core of the proposed method is our previously described bottom-up auditory
attention model which computes an auditory saliency map from the input sound. The
details of the auditory saliency map model was described in Chapter 2, and the block
diagram of the auditory attention model is given in Fig. 2.1.
First, an auditory spectrum of sound is estimated using an early auditory system
model. For analysis, audio frames of 20 ms with 10 ms shift are used, i.e. each 10 ms
audio frame is represented by a 128 dimensional vector. Next, the auditory spectrum is
analyzed by extracting a set of multi-scale features. Here, intensity, frequency contrast,
temporal contrast and orientation features are extracted in multi-scales, and pitch is
excluded. Eight scales are created (if audio segment duration W 1:28 s; otherwise
there are fewer scales). As shown in Fig 2.1, after extracting features at multiple
scales, \center-surround" dierences are calculated resulting in \feature maps". The
center-surround operation mimics the properties of local cortical inhibition, and detects
the local temporal and spatial discontinuities in feature channels. Center-surround
dierences are computed as point wise dierences across scales using three center scales
c =f2; 3; 4g and two surround scales s = c + with f3; 4g resulting in six feature
maps for each feature set. In total, there are 30 features maps computed: six for each
intensity, frequency contrast, temporal contrast and twelve for orientation since it has
113
two angles =f45
o
; 135
o
g. Each feature map is normalized in the order of within-
scale, across-scale, and across-features. For this, the normalization algorithm, which is
an iterative nonlinear operation as detailed in Chapter 2.3.2, is used. As a result of
normalization, possible noisy feature maps are reduced to sparse representations of only
those locations which strongly stand-out from their surroundings. All normalized maps
are then summed to provide bottom-up input to the saliency map.
The saliency map holds non-negative values and its maximum denes the most
salient location in 2D auditory spectrum. It is assumed that saliency combines additively
across frequency channels, and saliency score S(t) for each time point t is computed
using Eq. 2.6. Then, the local maxima of S(t) are found and the audio event at the
corresponding time point is marked as salient together with its saliency score. Later,
these salient points are selected in the order of decreasing saliency score as discussed in
Section 7.5.
7.4.2 Discussion
The bottom-up attention model is capable of detecting only the salient audio events
represented in at least one of the four implemented features; i.e., intensity, frequency
contrast, temporal contrast and orientation features. In Fig. 7.2, a sample sound clip
tagged with \goat machine milked" is shown. In the gure, the rst and second tiers
show the waveform and the spectrum of the clip, respectively. The third tier shows the
transcription where M represents machine noise and G represents goat voice. The fourth
tier shows the saliency score results obtained from the bottom-up auditory attention
model. For this clip, the model detects location of all goat voices in the sound clip.
114
Figure 7.2: Results of a sample sound clip tagged as \goat machine milked". The
tiers shows i) waveform of sound, ii) spectrum, iii) transcription where M represents
machine noise and G represents goat voice, and iv) saliency score.
Although the third goat event from the left is drowned in the machine sound in the
background, the model could successfully detect this event as well. This alludes to the
fact that the model is not limited to intensity feature. This scene can be summarized
as follows; in the scene the voice of goat pops out perceptually while the machine noise
becomes less prominent as in the gure-ground phenomenon in visual perception.
Similarly, another example from the database we used is a sound clip tagged with
\wigeon at pool". For this clip, the auditory attention model detects the locations of
the bird tweets and suppresses the background water sound.
7.4.3 Latent Perceptual Indexing
In this work, latent perceptual indexing (LPI) [80] is used for class-independent repre-
sentation of audio clips. An entire audio clip from a collection of audio clips is repre-
sented as a single vector in a latent perceptual space; this is similar to latent semantic
115
indexing/mapping (LSI) [6, 18] for text documents. First, a bag of feature-vectors is
extracted from a given audio clip. Then, this clip is characterized by calculating the
number of feature-vectors that are quantized into each of the reference clusters of signal
features (analogous to the term-document frequency counts in information retrieval).
By applying this procedure to the whole collection of clips, it results in a sparse matrix
where each row represents a quantitative characterization of a complete clip in terms of
the reference clusters. The reference clusters are obtained by unsupervised clustering
of the whole collection of features extracted from the clips in the library, and assumed
to represent distinct perceptual qualities. A reduced rank approximation of this sparse
representation is obtained by singular-value decomposition resulting in mapping audio
clips to points in a latent perceptual space. Thus each audio clip is represented as a
single vector. The LPI approach is similar to LSI of text documents [6]; the units or
reference clusters in LPI are taken to be equivalent to terms (or words) in LSI and the
audio clips in LPI are equivalent to text documents in LSI.
This method is implemented as follows. Let us assume that a collection of M audio
clips is available in a database with the i
th
clip having T
i
feature-vectors. Then, the
procedure involved in obtaining a representation in the latent perceptual space listed
below [80]:
STEP 1. The collection of all the feature-vectors obtained from all the clips in the
database is clustered using the k-means clustering algorithm. This results in C
reference clusters.
STEP 2. Let the i
th
audio clip have a total of T
i
frames.
FOR audio clip A
i
where, i2f1; : : :; Mg, DO:
116
i. Calculate : f
i;j
=
P
t=T
i
t=1
I(lab(t)=j)
T
i
:8j 2 1; : : :; C. Here I()2f0; 1g is an
indicator function.
I (lab(t) = j) = 1 if the t
th
frame is labeled to be in the j
th
cluster, otherwise
I() = 0.
ii. Assign F(i; j) = f
i;j
the (i; j)
th
element of the sparse matrix F
MC
.
STEP 3. END FOR loop;
STEP 4. Obtain F
MC
= U
MM
S
MC
(V
CC
)
T
by SVD.
STEP 5. Obtain the approximation of F as
~
F
MC
=
~
U
MR
~
S
RR
(
~
V
CR
)
T
by
retaining the R largest singular values.
In addition to the F matrix obtained at the end of step 3, an entropy-based weighting
term also weighs each column [6]. The approximation
~
F is obtained by the span of basis
vectors that have signicant singular values. By retaining only the signicant singular
values, the randomness in quantization is eliminated. The similarity measure between
a given test audio clip and the audio clips in the training set in the latent space is
computed using cosine vector similarity function [6, 80]. Using this measure, the k-
nearest neighbor (KNN) is used for classication of an unknown test audio clip.
In LPI, all segments or feature-vectors of an audio clip are used for indexing. Here,
we propose a modied method called latent indexing using saliency (LISA) which com-
bines saliency based audio event selection with class independent LPI method for audio
scene recognition. In other words, LISA uses only selected salient segments of an audio
clip whereas the original LPI uses the whole audio clip for scene recognition. The in-
formation processed by SVD of the term-document matrix in LPI is dierent from the
117
segments selected by the saliency map. In LPI, one attempts to derive the underlying
perceptual structure by eliminating randomness caused by dierent recording condi-
tions or realizations of the same acoustic source. However, auditory saliency model
selects salient events in an audio clip while ignoring parts that would typically con-
stitute `background' in an acoustic scene. As our result illustrates, by combining the
two in LISA, we attempt to use only a subset of meaningful acoustic information in an
unsupervised manner to classify a given acoustic scene. It is important to note that
in the next section, for LPI and LISA, the category labels are only used to asses and
compare the performance of the dierent approaches. The latent representation derived
is obtained using only unsupervised kmeans algorithm for reference clusters and SVD
of the term-document matrix.
7.5 Experiments and Results
For the experiments 2,491 whole audio clips from the BBC Sound Eects Library [1]
were used. The sound clips consist of natural unconstrained audio recorded in real
environments that is composed of many mixed audio events and sources. The duration
of clips varies from 1 second to 9.5 minutes. The database is available pre-organized
according to high-level semantic categories and their corresponding subcategories. Each
clip in the library is labeled with a semantically high-level category that best describes
the acoustic properties of the scene. There are twenty one categories with varying
number of sound clips under each category as in Table 7.1.
118
Table 7.1: Distribution of Clips Under Each Category
Category No. of les Category No. of les
IMPACT 16 NATURE 85
OPEN 8 SPORTS 151
TRANSPORTATION 295 HUMAN 357
AMBIENCES 311 EXPLOSIONS 18
MILITARY 102 MACHINERY 117
ANIMALS 359 SCI-FI 121
OFFICE 144 POLICE 96
HORROR 98 PUBLIC 44
AUTOMOBILES 53 DOORS 4
MUSIC 25 HOUSEHOLD 38
ELECTRONICS 49
The twelve dimensional Mel-frequency cepstral coecients (MFCCs) (C0 energy
feature was excluded) were extracted from each audio clip for sound classication exper-
iments. The MFCCs are based on the early auditory system of humans and successfully
used in generic audio classication task in the literature [63]. Instead of the features ex-
tracted from the front-end of auditory saliency model, we preferred to use the standard
MFCC features here since the focus of this work is acoustic scene classication based
on salient acoustic events rather than denition and presentation of new features. The
MFCC features were extracted every 10 ms with a Hamming window of 20 ms length.
The length of audio segment for audio classication task was analyzed empirically in
[63], and the best audio classication accuracy was obtained using 1 second (sec) win-
dow. Thus, mean and standard deviation of the MFCC features were estimated over
1 sec window resulting in 24 dimensional feature vector representing each 1 sec audio
segment.
119
All classication performances are evaluated by ten-fold cross-validation. In this,
10% of the whole database is chosen as the test set and the remaining were retained as
the train set. This is repeated ten times (without replacement) and the nal result is
the average of these repetitions. Chance-level performance, which is dependent on data
distribution amongst the categories, was estimated to be 14.4%.
First, we establish a baseline system based on the conventional approach of creat-
ing category-based models. A 3-layer neural network is used for baseline classication
experiments. The neural network had d
in
inputs, (d
in
+ d
out
)=2 hidden nodes and d
out
output nodes, where d
in
= 24 is the length of feature vector, and d
out
= 21 since there
are twenty one classes. The neural network is used together with 1 sec audio segments.
Later, the output of 1 sec frame classication results were combined by majority voting
to obtain the sound clip classication result since some clips are longer than 1 sec. The
baseline result is obtained by using all the frames in all the clips (i.e., without using the
auditory saliency model) and it was 40.0% accuracy.
For the rst experiment, the saliency model is used to scan and detect salient audio
events in a scene as explained in Section 7.4.1 and only these salient segments are
used for classication. The saliency score takes values between 0.0 and 1.0. A value
close to 0.0 indicates no saliency and 1.0 indicates the most salient point in an audio
clip. The score for each clip is sorted, and the top N locations are marked as salient.
Then, W = 1 sec window centered on a marked time location is used to extract the
corresponding salient audio event. 24 dimensional MFCC features are extracted from
these segments as explained previously. The reduction in data gained by keeping only
top N salient events is illustrated in Fig. 7.3. The number of retained salient audio
120
0 10 20 30 40 50 all_sal
40
50
60
70
80
90
100
N
Data Reduction (%)
Figure 7.3: Amount of data reduction as a function of the number of retained salient
points (N).
events are varied starting from N = 1 to N = all sal (all the detected salient points are
used irrespective of their saliency score). Retaining only the top salient point provides
98.8% data reduction. Retaining all the salient audio events still provides more than
40% data reduction.
For classication, the `learner' in Fig. 7.1 is implemented using a 3-layer neural
network to test the eectiveness of salient audio event detection for frame-based clas-
sication. The frame-based classication results after applying the saliency model are
shown as a function of N in Fig. 7.4. Additionally, Fig. 7.4 presents the baseline result
using all the frames (40% accuracy) and the chance level (14.4%) for comparison pur-
pose. It can be observed that the performance obtained by retaining only the top salient
location (N = 1) is better than using all of the frames (the whole sound clip). The best
121
0 10 20 30 40 50 all_sal
10
15
20
25
30
35
40
45
50
55
N
Accuracy %
Chance
All frames
Salient Events
Figure 7.4: Clip accuracy results obtained using all frames and top N salient audio
events using neural network classier.
122
0 10 20 30 40 50 all_sal
10
15
20
25
30
35
40
45
50
55
Accuracy %
N
Chance
LISA
LPI
C=500
C=1600
C=1000
C=1700
C=1700
C=1600
C=1400
C=1700
C=1200 C=1400
C=2000
C=2000
C=2000
Figure 7.5: Clip accuracy results obtained with LISA and LPI methods.
123
result is 46.9% clip accuracy obtained with N = 15. This provides approximately 7%
absolute improvement over the baseline while reducing the amount of data processed
by the classier by more than 85% as shown in Fig. 7.3. A reduction in performance is
observed when all the salient locations are used for classication. This however, is still
above the baseline result.
Finally, the results obtained using LISA are illustrated in Fig 7.5. In the experiments
the number of reference clusters in LISA are varied starting from C = 200 to C = 2000
with a step size of 100. For KNN, K = 7 nearest neighbors were found to have the
best performance. In Fig. 7.5, we present the best accuracy results obtained with C
reference clusters by retaining top N salient points. The best performance of 49.7%
was obtained with LISA by retaining the top 35 salient points and using C = 1700
clusters. LISA provides approximately 10% absolute improvement over the baseline
frame-based classication and 3% absolute improvement over frame-based classication
using only the salient segments. Results obtained using only LPI (i.e. using the whole
audio clip without salient event selection) is also shown in Fig. 7.5. LPI achieves
50.4% classication accuracy using C = 2000 clusters. As a result, it can be seen that
comparable results to LPI using all the feature vectors can be obtained by using only
top 35 salient points (data reduction of approximately 74%). Consequently, we can also
say that in most cases, the salient segments of an audio clip are the dening moments
of the audio clip of an unstructured acoustic scene.
124
7.6 Conclusion
In this chapter, a novel method that mimics human auditory attention for acoustic scene
recognition was presented. The method rst detects the salient audio events present
in an unstructured audio clip using a bottom-up auditory attention model, and then
processes only the selected salient events for acoustic scene recognition using latent
perceptual indexing. The salient event detection algorithm is completely unsupervised;
it can be used to obtain salient, dening audio events in an acoustic scene cluttered with
dierent (unknown) acoustic sources. This allows us to categorize unstructured audio
clips of acoustic scenes without processing the whole clip. Additionally for such scenes,
using the term-document frequency measures to derive a representation is desirable
as it makes no assumptions about the individual sources present in it. This makes
this approach applicable to a variety of audio content processing problems. Hence,
by combining latent perceptual indexing with saliency-based attention model, a novel
method called LISA was proposed in this chapter. The performance of the method is
tested using the BBC sounds eect library, and it is shown that LISA provides 10%
absolute (25% relative) improvement over the baseline by retaining only top 35 salient
points, and reduces the amount of data approximately 74%. It is shown that LPI and
LISA perform approximately the same however LISA uses less number of data points
and reference clusters since it only uses selected salient events in an audio clip.
The auditory saliency model behaves as a highlighting mechanism that selects only
the events that perceptually pop out of an acoustic scene while ignoring segments or
sources that are part of the background. For example, in the previously discussed
example in Section 7.4.2, the saliency model detects the locations of goat sound and
125
ignores the machine sound in the clip tagged with \goat machine milked". Hence,
the predicted label for this clip would be \Animal" when only salient segments are
considered for classication. However, the high level semantic label for this clip provided
with the database was\Machine". As it can be seen from the detailed tag, this is
not completely incorrect since the clip description includes an animal sound. Relating
semantic descriptions (with multiple tags) to ranked salient segments of acoustic scenes
is an interesting avenue to explore with many applications in audio content processing.
This is a part of our planned future work for this framework.
126
Chapter 8:
Conclusion
Human-like attention-driven acoustic signal processing has been explored in this thesis.
Attention mechanism is one of the key enablers that helps humans process the incoming
sensory stimuli eciently and accurately. Thus, computational algorithms emulating
human auditory attention are desirable tools for machine processing of acoustic signals.
In this dissertation, novel biologically inspired auditory attention models are proposed,
and their eectiveness is demonstrated with the experiments.
One of the main contributions of this dissertation is the bottom-up saliency-driven
auditory attention model proposed in Chapter 2. The bottom-up (BU) auditory at-
tention model processes the input signal based on the processing stages in the early
and central auditory system and extracts multi-scale feature maps. Then, these feature
maps are combined to provide bottom-up input to the nal auditory saliency map, which
indicates the perceptual in
uence of each part of the acoustic scene. The maximum of
the saliency map indicates the most salient location in the 2D auditory spectrum. The
proposed bottom-up auditory attention model is task-independent and language inde-
pendent and works in an unsupervised manner.
127
The bottom-up auditory attention model provides fast detection of salient events
in an acoustic scene. In other words, it detects the acoustic events that perceptually
pop out of a scene by signicantly diering from their neighbors. It has been demon-
strated with the experiments in Chapter 2 that the BU model can successfully detect
the prominent syllable and word locations in speech.
The other contribution of this dissertation is the top-down task-dependent atten-
tion model proposed in Chapter 3. The top-down auditory attention model shares the
same feature maps with the bottom-up auditory attention model since it is also bi-
ologically inspired and hence based on the processing stages in the human auditory
system. Then, the feature maps are converted into low-level auditory gist features that
capture the essence of a scene. Finally, the auditory gist features are biased to imi-
tate the modulation eect of task on neuron responses to reliably detect a target or to
perform a specic task. In Chapter 3, the experiments concluded that the top-down
model successfully detects prominent syllables in speech and provides approximately
10% absolute improvement over the bottom-up auditory attention model. Also, the
top-down model performs signicantly better than all the previously reported methods
using only acoustic features, and it provides approximately 8-10% absolute improve-
ment over the previous results. However, it is computationally more expensive than the
previous methods which use 10-15 dimensional prosodic features.
One of the strengths of the auditory attention models proposed here is that the
models are generic, and they can be used in other spoken language processing tasks and
general computational auditory scene analysis applications. For instance, some of the
possible applications for the BU model are salient event/object detection, change point
128
detection, and novelty detection; hence it can be used for detection and segmentation
purposes. The top-down auditory attention model is a powerful model due to multi-
scale features. It uses prior knowledge since it is task dependent; hence it can be used
in classication or recognition problems, i.e. scene understanding, context recognition,
and speaker recognition.
The performance expected from the auditory attention models is limited by the
ve features used in the models: intensity, frequency contrast, temporal contrast and
pitch. In other words, the model will fail to perform the tasks that require features
which are not implemented here. For example, the current model uses mono signal, and
hence spatial cues are not implemented in the model. As a result, while the model will
successfully work for the tasks which are represented by at least one of the features of
the model, it will fail in the tasks which require spatial cues, such as localization and
source separation.
It is well known that, while processing speech, the brain is also in
uenced by higher
level information such as lexical information, syntax, and semantics. In Chapter 4, a
novel model is proposed that combines task-dependent in
uences captured via syntactic
and lexical cues with the top-down auditory attention model presented in Chapter 3.
The combined top-down task-dependent model is used to detect prominent syllables in
speech, and it is shown that it performs better than the acoustic only model. Also, the
combined model outperformed the state-of-the art results on this task.
Another contribution of this dissertation is to incorporate attention-driven process-
ing of speech into an automatic speech recognition system. In Chapter 6, an attention
shift decoding method inspired by human speech recognition is proposed for automatic
129
speech recognition. In contrast to the traditional ASR systems, ASD decodes speech in-
consecutively using reliability criteria; the gaps (unreliable speech regions) are decoded
with the evidence of islands (reliable speech regions). In the experiments with oracle
information, where the locations of islands are assumed to be known perfectly, ASD
provides signicant improvement over the baseline ASR results conrming the promises
of the method. At the heart of the ASD method is the automatic island-gap detection.
Hence, a new feature set inspired by both human and machine recognition of speech
is proposed for automatic island-gap detection in this dissertation. Also, to cope with
the imperfect nature of the island-gap classication, we proposed a new ASD algorithm
using soft decision rather than hard decision. The ASD with soft decision provides
signicant improvement over the baseline. As part of future work, we plan to explore
more features to improve the island-gap detection accuracy.
Finally, a novel method that emulates human auditory attention for acoustic scene
recognition is proposed in this dissertation. The BU model proposed in Chapter 2 be-
haves as a highlighting machine. For example, in an audio clip containing goat voice
and machine noise and labelled as \goat machine milked", the BU model detects the
locations of goat voice while ignoring the machine noise. In other words, the clip is
summarized as follows; the goat voice perceptually stands out and becomes the \gure"
while the machine noise becomes the \ground" as in the gure-ground phenomenon in
visual perception. In Chapter 7, it is shown that detection of salient acoustic events
in a scene is benecial for scene recognition since they are often the dening/key mo-
ments of a scene. It has been demonstrated with the experiments conducted with com-
plex unstructured audio clips that by ltering-out the non-salient events of an acoustic
130
scene, the BU model improves the scene recognition performance while also reducing the
amount of data. Relating semantic descriptions with multiple tags to salient segments
of an acoustic scene is a promising research direction to explore as part of future work.
131
Bibliography
[1] The BBC Sound Eects Library Original Series. http://www.sound-ideas.com, May
2006.
[2] C. Alain and S. R. Arnott. Selectively attending to auditory objects. Frontiers in
Bioscience, 5:d202{212, 2000.
[3] S. Ananthakrishnan and S. Narayanan. Combining acoustic, lexical, and syntactic
evidence for automatic unsupervised prosody labeling. In Proceedings of ICSLP,
Pittsburgh, PA, September 2006.
[4] S. Ananthakrishnan and S. Narayanan. Automatic prosodic event detection using
acoustic, lexical, and syntactic evidence. IEEE Transactions on Audio, Speech, and
Language Processing, 16(1), 2008.
[5] L. Barrington, A. Chan, D. Turnbull, and G. Lanckriet. Audio Information Re-
trieval using Semantic Similarity. In Proceedings of ICASSP, Honolulu, Hawaii,
2007.
[6] J. R. Bellagarda. Latent Semantic Mapping: A Data driven Framework for Model-
ing Global Relationships Implicit in Large Volumes of Data. IEEE Signal Processing
Magazine, 22:70{80, September 2005.
[7] D. Bendor and X. Wang. The neuronal representation of pitch in primate auditory
cortex. Nature, 436:1161{1165, 2005.
[8] A. S. Bregman. Auditory Scene Analysis: The Perceptual Organization of Sounds.
The MIT Press, London, 1990.
[9] I. Bulyko and M. Ostendorf. Joint prosody prediction and unit selection for con-
catenativespeech synthesis. In Proceedings of ICASSP, volume 2, 2001.
[10] R. P. Carlyon, R. Cusack, J. M. Foxton, and I. H. Robertson. Eects of Attention
and Unilateral Neglect on Auditory Stream Segregation. Journal of Experimental
Psychology, 27(1):115{127, 2001.
[11] K. Chen, M. Hasegawa-Johnson, A. Cohen, and J. Cole. A maximum likelihood
prosody recognizer. In Proceedings of Speech Prosody, Nara, Japan, 2004.
132
[12] E. C. Cherry. Some Experiments on the Recognition of Speech, with One and with
Two Ears. The Journal of the Acoustical Society of America, 25:975{979, 1953.
[13] S. Chu, S. Narayanan, C. C. Kuo, and M. J. Mataric. Where am I? Scene Recog-
nition for Mobile Robots using Audio Features. In Proceedings of ICME, July
2006.
[14] C. E. Connor, H. E. Egeth, and S. Yantis. Visual Attention: Bottom-Up Versus
Top-Down. Current Biology, 14(19):850{852, 2004.
[15] A. Corazza, R. De Mori, R. Gretter, and G. Satta. Stochastic context-free gram-
mars for island-driven probabilistic parsing. In Proceedings of IWPT, 1991.
[16] R. Cusack, J. Deeks, G. Aikman, and R. P. Carlyon. Eects of location, frequency
region, and time course of selective attention on auditory scene analysis. Journal
of Experimental Psychology: Human Perception and Performance, 30(4):643{656,
2004.
[17] R. C. deCharms, D. T. Blake, and M. M. Merzenich. Optimizing sound features
for cortical neurons. Science, 280:1439{1443, 1998.
[18] S. Deerwester, S. T. Dumais, G. W. Furnas, T.K. Landaurer, and R. Harshman.
Indexing by Latent Semantic Analysis. Journal of the American Society for Infor-
mation Science, 6(41):391{407, 1990.
[19] R. Desimone and J. Duncan. Neural Mechanisms of Selective Visual Attention.
Annual Review of Neuroscience, 18(1):193{222, 1995.
[20] J. A. Deutsch and D. Deutsch. Some Theoretical Considerations. Psychological
Review, 70(1):80{90, 1963.
[21] L. D. Erman, F. Hayes-Roth, V. R. Lesser, and D. R. Reddy. The Hearsay-II
speech-understanding system: Integrating knowledge to resolve uncertainty. ACM
Computing Surveys (CSUR), 1980.
[22] B. Fisher. Syllabication software. http://www.itl.nist.gov/iad/mig//tools/,
1997. National Institute of Standards and Technology (NIST).
[23] J. B. Fritz, M. Elhilali, S .V. David, and S. A. Shamma. Auditory attentionfocusing
the searchlight on sound. Current Opinion in Neurobiology, 17(4):437{455, 2007.
[24] G. Guo and S. Z. Li. Content-Based Audio Classication and Retrieval by Support
Vector Machines. IEEE Transactions on Neural Networks, 14(1):209{215, January
2003.
[25] E. R. Hafter, A. Sarampalis, and P. Loui. Auditory Attention and Filters. In
W. Yost, editor, Auditory Perception of Sound Sources, volume 29. Springer, 2007.
133
[26] S. Harding, M. P. Cooke, and P. Koenig. Auditory gist perception: An alternative
to attentional selection of auditory streams. In Proceedings of WAPCV, Hyderabad,
India, 2007.
[27] M. Hasegawa-Johnson, K. Chen, J. Cole, S. Borys, S. S. Kim, A. Cohen, T. Zhang,
J. Y. Choi, H. Kim, T. Yoon, et al. Simultaneous recognition of words and prosody
in the Boston University Radio Speech Corpus. Speech Communication, 46(3-
4):418{439, 2005.
[28] J. Hirschberg. Pitch Accent in Context: Predicting Intonational Prominence from
Text. Articial Intelligence, 63(1-2):305{340, 1993.
[29] S. Hochstein and M. Ahissar. View from the Top Hierarchies and Reverse Hierar-
chies in the Visual System. Neuron, 36(5):791{804, 2002.
[30] D. H. Hubel, C. O. Henson, A. Rupert, and R. Galambos. Attention Units in the
Auditory Cortex. Science, 129(3358):1279{1280, 1959.
[31] L. Itti and C. Koch. Feature combination strategies for saliency-based visual atten-
tion systems. Systems Journal of Electronic Imaging, 10:161{169, January 2001.
[32] L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention for rapid
scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence,
20(11):1254{1259, 1998.
[33] W. James. The Principles of Psychology. Vol. 1. Dover Publications, 1950.
[34] A. Johnson and R. W. Proctor. Attention: Theory and Practice. Sage Publications,
2004.
[35] J. Jonides and S. Yantis. Uniqueness of abrupt visual onset in capturing attention.
Percept Psychophys, 43(4):346{54, 1988.
[36] D. Kahn. Syllable-based generalizations in English phonology. PhD thesis, Univ.
Massachusetts, Boston, 1976.
[37] O. Kalinli and S. Narayanan. A saliency-based auditory attention model with
applications to unsupervised prominent syllable detection in speech. In Proceedings
of Interspeech, Antwerp, Belgium, August 2007.
[38] T. Kawabata and K. Shikano. Island-driven continuous speech recognizer using
phone-based hmm word spotting. In Proceedings of ICASSP, 1989.
[39] C. Kayser and N. Logothetis. Vision: Stimulating Your Attention. Current Biology,
16(15):R581{R583, 2006.
134
[40] C. Kayser, C. Petkov, M. Lippert, and N. Logothetis. Mechanisms for allocating
auditory attention: An auditory saliency map. Current Biology, 15(8):1943{1947,
2005.
[41] K. Kirchho, J. Bilmes, and K. Duh. Factored language models tutorial. Technical
Report UWEETR{2007{0003, Dept. of EE, U. Washington, June 2007.
[42] C. Koch and S. Ullman. Shifts in selective visual attention: towards the underlying
circuitry. Human Neurobiology, 4:219{227, 1985.
[43] G. Kochanski, E. Grabe, J. Coleman, and B. Rosner. Loudness predicts promi-
nence:fundamental frequency lends little. The Journal of the Acoustical Society of
America, 118, 2005.
[44] A. Kraskov, H. St ogbauer, and P. Grassberger. Estimating mutual information.
Physical Review E, 69(6):66138, 2004.
[45] Raghunandan Kumaran, Je Bilmes, and Katrin Kirchho. Attention shift decod-
ing for conversational speech recognition. In Proceedings of Interspeech, Antwerp,
Belgium, August 2007.
[46] S. H. Lee, H. K. Lee, and J. H. Kim. On-line Cursive Script Recognition Using
an Island-Driven Search Technique. In Proceedings of International Conference on
Document Analysis and Recognition, 1995.
[47] L. Liu, H. J. Zhang, and H. Jiang. Content Analysis for Audio Classication and
Segmentation. IEEE Transactions on Speech and Audio Processing, 10(7):504{516,
October 2002.
[48] R. Lowry. Vassarstats: Wilcoxon signed-rank test.
http://faculty.vassar.edu/lowry/wilcoxon.html.
[49] L. Mangu, E. Brill, and A. Stolcke. Finding consensus in speech recognition: word
error minimization and other applications of confusion networks. Computer Speech
and Language, 2000.
[50] M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz. Building a Large Annotated
Corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313{330,
1994.
[51] J. M. McQueen. Eight questions about spoken-word recognition. In G. Gaskell,
editor, The Oxford handbook of psycholinguistics. Oxford University Press, USA,
2007.
[52] C. M. Moore and H. Egeth. Perception Without Attention: Evidence of Grouping
Under Conditions of Inattention. Perception, 23(2):339{352, 1997.
135
[53] J. Moran and R. Desimone. Selective attention gates visual processing in the
extrastriate cortex. Science, 229(4715):782{784, 1985.
[54] N. Moray. Attention in dichotic listening: Aective cues and the in
uence of
instructions. Quarterly Journal of Experimental Psychology, 11(1):56{60, 1959.
[55] B. C. Motter. Neural correlates of attentive selection for color or luminance in
extrastriate area V4. Journal of Neuroscience, 14(4):2178{2189, 1994.
[56] H. J. Muller and P. M. A. Rabbitt. Re
exive and voluntary orienting of visual
attention: Time course of activation and resistance to interruption. Journal of
Experimental Psychology, 15(2):315{330, 1989.
[57] V. Navalpakkam and L. Itti. An integrated model of top-down and bottom-up
attention for optimal object detection. In Proceedings of IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), New York, NY, 2006.
[58] D. Navon. Forest before trees: The precedence of global features in visual percep-
tion. Cognitive Psychology, 9:353{383, 1977.
[59] T. Ogawa and H. Komatsu. Target Selection in Area V4 during a Multidimensional
Visual Search Task. Journal of Neuroscience, 24(28):6371, 2004.
[60] A. Oliva. Gist of a scene. In L. Itti, G. Rees, and J. K. Tsotsos, editors, Neurobiology
of Attention, pages 251{256. Elsevier, 2005.
[61] M. Ostendorf, P. Price, and S. Shattuck-Hufnagel. The Boston University Radio
Corpus. 1995.
[62] M. Ostendorf, I. Shafran, and R. Bates. Prosody models for conversational speech
recognition. In Proceedings of the 2nd Plenary Meeting and Symposium on Prosody
and Speech Processing, 2003.
[63] V. Peltonen, J. Tuomi, A. Klapuri, J. Huopaniemi, and T. Sorsa. Computational
auditory scene recognition. In Proceedings of ICASSP, 2002.
[64] R. J. Peters and L. Itti. Beyond bottom-up: Incorporating task-dependent in
u-
ences into a computational model of spatial attention. In Proceedings of IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), Minneapolis,
MN, 2007.
[65] J. F. Pitrelli, J. Subrahmonia, and B. Maison. Toward island-of-reliability-driven
very-large-vocabulary on-line handwriting recognition using character condence
scoring. In Proceedings of ICASSP, 2001.
[66] I. Pollack and J. M. Pickett. Intelligibility of Excerpts from Conversation. The
Journal of the Acoustical Society of America, 35:1900, 1963.
136
[67] F. Pulverm uller and Y. Shtyrov. Language outside the focus of attention: The
mismatch negativity as a tool for studying higher cognitive processes. Progress in
Neurobiology, 79(1):49{71, 2006.
[68] R. Radhakrishnan and A. Divakaran. Generative Process Tracking for Audio Anal-
ysis. In Proceedings of ICASSP, May 2006.
[69] K. Ross and M. Ostendorf. Prediction of abstract prosodic labels for speech syn-
thesis. Computer Speech & Language, 10(3):155{185, 1996.
[70] P. Ru. Multiscale Multirate Spectro-Temporal Auditory Model. PhD thesis, Univer-
sity of Maryland, 2001.
[71] C. E. Schreiner, H. L. Read, and M. L. Sutter. Modular organization of frequency
integration in primary auditory cortex. Annual Review of Neuroscience, 23:501{
529, 2000.
[72] S. Shamma. On the role of space and time in auditory processing. Trends in
Cognitive Sciences, 5:340{8, 2001.
[73] C. Siagian and L. Itti. Rapid biologically-inspired scene classication using features
shared with visual attention. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 29:300{312, 2007.
[74] M. C. Silveri and A. Cappa. Segregation of the neural correlates of language and
phonological short-term memory. Cortex, 39(4-5):913{25, 2003.
[75] K. Silverman, M. Beckman, J. Pierrehumbert, M. Ostendorf, C. Wightman,
P. Price, and J. Hirschberg. ToBI: A standard scheme for labeling prosody. In
Proceedings of the International Conference on Spoken Language Processing, pages
867{870, 1992.
[76] E. P. Simoncelli and B. A. Olshausen. Natural image statistics and neural repre-
sentation. Annual Reviews in Neuroscience, 24(1):1193{1216, 2001.
[77] M. Slaney. Semantic Audio Retrieval. In Proceedings of ICASSP, Orlando, May
2002.
[78] M. Slaney and R.F. Lyon. On the importance of time-a temporal representation of
sound. Visual Representations of Speech Signals, pages 95{116, 1993.
[79] A. Stolcke. SRILM-an Extensible Language Modeling Toolkit. In Proceedings of
International Conference on Spoken Language Processing, September 2002.
[80] S. Sundaram and S. Narayanan. Audio Retrieval by Latent Perceptual Indexing.
In Proceedings of ICASSP, Las Vegas, USA, 2008.
137
[81] S. Sundaram and S. Narayanan. Classication of sound clips by two schemes: Using
Onomatopoeia and Semantic labels. In Proceedings of ICME, Hannover, Germany,
June 2008.
[82] E. S. Sussman. Integration and segregation in auditory scene analysis. The Journal
of the Acoustical Society of America, 117:1285, 2005.
[83] J. Theeuwes. Perceptual selectivity for color and form. Perception & Psychophysics,
51(6):599{606, 1992.
[84] D. Walther and C. Koch. Modeling attention to salient proto-objects. Neural
Networks, 19:1395{1407, 2006.
[85] E. Weichselgartner and G. Sperling. Dynamics of automatic and controlled visual
attention. Science, 238(4828):778{780, 1987.
[86] C. W. Wightman and M. Ostendorf. Automatic labeling of prosodic patterns. IEEE
Transactions on Speech and Audio Processing, 2(4):469{481, 1994.
[87] I. Winkler, W.A. Teder-S alej arvi, J. Horvath, R. N a at anen, and E. Sussman.
Human auditory cortex tracks task-irrelevant sound sources. NeuroReport,
14(16):2053{2056, 2003.
[88] E. Wold, T. Blum, D. Keislar, and J. Wheaton. Content-Based Classication,
Search, and Retrieval of Audio. IEEE Multimedia, 3(3):27{36, Fall 1996.
[89] J. M. Wolfe. Guided Search 2.0: A revised model of guided search. Psychonomic
Bulletin & Review, 1(2):202{238, 1994.
[90] S. N. Wrigley and G. J. Brown. A computational model of auditory selective
attention. IEEE Transactions on Neural Networks, 15(5):1151{1163, 2004.
[91] Z. Xiong, R. Radhakrishnan, A. Divakaran, and T. S. Huang. Audio Events De-
tection Based Highlights Extraction from Baseball, Golf and Soccer Games in a
Unied Framework. In Proceedings of ICASSP, Hong Kong, April 2003.
[92] X. Yang, K. Wang, and S. Shamma. Auditory representations of acoustic signals.
IEEE Transactions on Information Theory, 38(2):824{839, 1992.
[93] A. Yarbus. Eye movements during perception of complex objects. Eye Movements
and Vision, pages 171{196, 1967.
[94] S. Young. Hidden markov model toolkit (HTK). http://htk.eng.cam.ac.uk/,
1989.
138
Abstract (if available)
Abstract
Humans can precisely process and interpret complex scenes in real time despite the tremendous amount of stimuli impinging the senses and the limited resources of the nervous system. One of the key enablers of this capability is a neural mechanism called "attention". The focus of this dissertation is to develop computational algorithms that emulate human auditory attention and to demonstrate their effectiveness in spoken language and audio processing applications.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Biologically inspired mobile robot vision localization
PDF
Contextual modeling of audio signals toward information retrieval
PDF
Source-specific learning and binaural cues selection techniques for audio source separation
PDF
Computational modeling and utilization of attention, surprise and attention gating
PDF
User modeling for human-machine spoken interaction and mediation systems
PDF
Computational modeling and utilization of attention, surprise and attention gating [slides]
PDF
Recognition and characterization of unstructured environmental sounds
PDF
Data-driven methods in description-based approaches to audio information processing
PDF
Emotional speech resynthesis
PDF
Speech enhancement and intelligibility modeling in cochlear implants
PDF
Emotions in engineering: methods for the interpretation of ambiguous emotional content
PDF
Multimodal analysis of expressive human communication: speech and gesture interplay
PDF
Compression algorithms for distributed classification with applications to distributed speech recognition
PDF
Digital signal processing techniques for music structure analysis
PDF
Hybrid methods for music analysis and synthesis: audio key finding and automatic style-specific accompaniment
PDF
Robust speaker clustering under variation in data characteristics
PDF
Emotional speech production: from data to computational models and applications
PDF
Modeling expert assessment of empathy through multimodal signal cues
PDF
Behavioral signal processing: computational approaches for modeling and quantifying interaction dynamics in dyadic human interactions
PDF
Zero-power sensing and processing with piezoelectric resonators
Asset Metadata
Creator
Kalinli, Ozlem (author)
Core Title
Biologically inspired auditory attention models with applications in speech and audio processing
School
Andrew and Erna Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
11/13/2009
Defense Date
10/22/2009
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
acoustic scene classification,attention shift decoding,auditory attention,auditory gist,auditory saliency map,bottom-up auditory attention,OAI-PMH Harvest,prominence detection,top-down auditory attention
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Narayanan, Shrikanth S. (
committee chair
), Kuo, C.-C. Jay (
committee member
), Mel, Bartlett W. (
committee member
)
Creator Email
kalinli@usc.edu,okalinli@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m2739
Unique identifier
UC1459052
Identifier
etd-Kalinli-3369 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-279506 (legacy record id),usctheses-m2739 (legacy record id)
Legacy Identifier
etd-Kalinli-3369.pdf
Dmrecord
279506
Document Type
Dissertation
Rights
Kalinli, Ozlem
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
acoustic scene classification
attention shift decoding
auditory attention
auditory gist
auditory saliency map
bottom-up auditory attention
prominence detection
top-down auditory attention