Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Noise-robust spectro-temporal acoustic signature recognition using nonlinear Hebbian learning
(USC Thesis Other)
Noise-robust spectro-temporal acoustic signature recognition using nonlinear Hebbian learning
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
NOISE-ROBUST SPECTRO-TEMPORAL ACOUSTIC SIGNATURE
RECOGNITION USING NONLINEAR HEBBIAN LEARNING
by
Bing Lu
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(BIOMEDICAL ENGINEERING)
August 2009
Copyright 2009 Bing Lu
Dedication
To My Parents
ii
Acknowledgments
My profound gratitude goes to Professor Theodore W. Berger, my mentor and
research advisor, for his guidance, support, and encouragement. I salute Professor
Berger’s genuine passion for science, discovery and understanding, superior men-
toring skills, and unparalleled help. My profound gratitude also goes to Assistant
Professor Alireza Dibazar, my co-advisor, who was always open to unconventional
ideas, instructions, and support. And in last several years, Professor Dibazar con-
tinuously gave me detailed and very helpful feedback.
My profound gratitude also goes to Professor Michel Baudry, Professor David
Z. D’Argenio, and Professor Michael C. K. Khoo, who patiently discussed my
research works, and gave me very positive and instructive guidance. The research
for this thesis was done in the Department of Biomedical Engineering, University
ofSouthernCalifornia,LosAngeles,California,USA.Iamgratefultoworkinsuch
a stimulating environment, rooted in the exciting research context of Biomedical
Engineering. I would thank Professor Norberto M. Grzywacz, Professor Bartlett
W. Mel, and Professor Stanley M. Yamashiro for their instructive and luminous
lectures.
TheLaboratoryforNeuralDynamicshasbeenhosttomanychallengingprojects,
e.g.,theArtificialNeuralLearningproject,theReal-timeVehicleDetectionproject,
theReal-timeGunshotproject,andtheSmartSensorDesignproject. Ioweagreat
iii
dealtothemembersofformermembersofthislab. Inparticular,DrSageevGeorge
very kindly helped me prepare the databases and do the evaluation of the exper-
iments. Dr. Dong Song patiently gave me very helpful research suggestions and
often discussed with me about my research works. Dr. Walter Yamada also gave
me helpful feedback and good suggestions.
I am also grateful to members of the Safety Dynamics company. Portions of
the applications were developed in close cooperation with them.
Finally, I wish to thank my dear family, my parents and my husband, for their
continuous support and encouragement. Their untiring and warm care allowed me
to accomplish this long journey towards a PhD.
iv
Table of Contents
Dedication ii
Acknowledgments iii
List of Tables vii
List of Figures viii
Abstract xv
Chapter1: Introduction 1
1.1 Problem Statement:
Why Noise-Robust Pattern Recognition and What Challenges . . . 1
1.2 History and Limitations of Other Existing Algorithms . . . . . . . . 3
1.3 Objectives of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Biological Motivation and Overview of Proposed Algorithms . . . . 8
1.5 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . 12
Chapter2: GammatoneFilterbanks 15
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Cochlear Impulse Response and Gammatone Filter . . . . . . . . . 17
2.3 Gammatone vs. Mel Filterbanks . . . . . . . . . . . . . . . . . . . . 19
2.4 Decision Making: Radial Basis Function Neural Network . . . . . . 21
2.5 Real-World Acoustic Signal Recognition . . . . . . . . . . . . . . . 22
2.6 Recognition Results of GTF vs. MFCC . . . . . . . . . . . . . . . . 26
Chapter3: Spectro-TemporalDynamics 30
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Incorporation of Temporal Dynamics . . . . . . . . . . . . . . . . . 31
3.3 Recognition Results of
STR vs. Spectral Analysis . . . . . . . . . . . . . . . . . . . . . . . 35
v
Chapter4: UnsupervisedLearning 39
4.1 Biological Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 Linear Hebbian Learning . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3 Blind Source Separation Using LHL . . . . . . . . . . . . . . . . . . 44
Chapter5: UnsupervisedNonlinearHebbianLearning 52
5.1 Nonlinear Hebbian Learning for Pattern Recognition . . . . . . . . 52
5.2 Nonlinear Hebbian Learning . . . . . . . . . . . . . . . . . . . . . . 55
5.3 Blind Source Separation Using NHL . . . . . . . . . . . . . . . . . . 62
5.4 Nonlinear Activation Function . . . . . . . . . . . . . . . . . . . . . 64
5.4.1 Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.4.2 Neurobiological Motivation . . . . . . . . . . . . . . . . . . . 66
5.4.3 Interpretation for Slope of Activation Function . . . . . . . . 71
5.5 Discussion of
the Present NHL vs. Other ICAs . . . . . . . . . . . . . . . . . . . 72
Chapter6: ProjectI:RecognitionResults 78
6.1 Real-World Acoustic Signal . . . . . . . . . . . . . . . . . . . . . . 78
6.2 Noise-Robust Acoustic Signal Recognition Results . . . . . . . . . . 82
6.2.1 Noise Robustness Analysis . . . . . . . . . . . . . . . . . . . 86
Chapter7: ProjectII:IdentificationResults 88
Chapter8: Noise-RobustReal-TimeFieldTesting 92
8.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
8.2 Overview of Hardware and Software . . . . . . . . . . . . . . . . . . 93
8.3 Field Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Chapter9: ConclusionandFutureWork 104
9.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
9.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
9.2.1 The Need . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
9.2.2 ProposalforIdentificationProjectUnderSevereEnvironments109
9.2.3 Other Future Directions . . . . . . . . . . . . . . . . . . . . 110
References 112
vi
List of Tables
7.1 Identification results when vehicle mixed with AWGN . . . . . . . . 89
7.2 Identificationresultswhenvehiclemixedwithhumanvowelutterances 90
7.3 Identification results when vehicle mixed with bird chirps . . . . . . 91
8.1 Real-time field testing results . . . . . . . . . . . . . . . . . . . . . 100
vii
List of Figures
1.1 The incoming signals may be mixed from several unknown sources.
The mixing property is also unknown due to various unknown envi-
ronments. A system can be designed to blindly separate indepen-
dent sources out of observed signals. . . . . . . . . . . . . . . . . . 6
1.2 The detailed mixing procedure by an unknown mixing matrix. . . . 7
1.3 Thedetailedseparatingprocedurewithsomeunsupervisedlearning.
Theextractedcomponentsofycanbeequaltotheunknownsources
s if the learning can ideally extract the independent sources. . . . . 7
1.4 Description of the proposed system. . . . . . . . . . . . . . . . . . . 11
2.1 An overview of the MFCC recognizer. . . . . . . . . . . . . . . . . . 16
2.2 An overview of the GTF recognizer. . . . . . . . . . . . . . . . . . . 17
2.3 (a) An example of mel filterbanks. (b) An example of gammatone
filterbanks.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 (a)Anexampleofvehiclecoming,passing-by,andleavingwaveform.
(b)Whenvehiclesoundfrom(a)ismixedwithadditivewhiteGaus-
sian noise (AWGN) at signal-to-noise-ratio (SNR) 17 dB. . . . . . . 21
2.5 Each line denotes the feature differences = features trained with
noisy data - features trained with clean data. The smaller the dif-
ference, the less the spectral analysis is influenced by noises. The
upper three lines are feature differences when using mel filterbanks
for various SNRs. The lowerthree lines are feature differences when
using gammatonefilterbanks. Gammatonefilterbanksarelessinflu-
enced by noises, which is more noise robust then mel filterbanks. . . 22
2.6 Radial basis function neural network is used for decision making. . . 23
viii
2.7 An illustration of vehicle recognition environment. Vehicles loaded
with dangerous weapons may approach protected or strict areas.
Themicrophoneandrecognizershouldbeabletodetecttheapproach-
ing vehicles and immediately release alerts. Around the protected
areas, many uncontrolled background sounds may exist, such as
human voice, bird chirp, and wind. . . . . . . . . . . . . . . . . . . 24
2.8 Decision tree. Recognize any urban vehicle in city environments
while rejecting background noises . . . . . . . . . . . . . . . . . . . 25
2.9 When vehicle data is mixed with AWGN, gammatone filterbanks
can achieve a little bit better performance than MFCC. . . . . . . . 27
2.10 When vehicle data is mixed with human vowel sound, gammatone
filterbanks can achieve a little bit better performance than MFCC.. 28
2.11 When vehicle data is mixed with bird chirp sound, gammatone fil-
terbanks can achieve a little bit better performance than MFCC. . . 29
3.1 An overview of the GTF+STR recognizer. . . . . . . . . . . . . . . 31
3.2 An example of vehicle waveform and its spectrogram. Waveform
amplitude is normalized based on used recording microphone and
pre-amplifier. The frame size is 20 ms, and overlapping is 10 ms.
The narrowblackrectangularinthe lowerspectrogramfigurerepre-
sents one spectral feature vector, which is the input to radial basis
functionneuralnetworkinSect. 2.4. Ateachtimeonlyonespectral
feature vector is used for network operation. . . . . . . . . . . . . . 32
3.3 An example of vehicle waveform and its spectrogram. In the lower
spectrum figure, a broader rectangular is marked, which includes
both the current and past several feature vectors. At each time
multiple feature vectors are used as an input to a RBF-NN. The
duration of the broader rectangular is generally chosen on the order
of hundreds of milliseconds. . . . . . . . . . . . . . . . . . . . . . . 33
3.4 Thespectro-temporaldynamicrepresentation. Infrequencydomain,
Qfilteredspectralelementsarelisted. In temporaldomain, the cur-
rent and M −1 feature vectors are assembled. . . . . . . . . . . . . 34
3.5 RBF-NN with STR as an input. Interleave the spectro-temporal
representation and use it as an input to a RBF-NN.Now both spec-
tral and temporal information are correlated in a network operation. 36
ix
3.6 When vehicle data is mixed with AWGN, STR can achieve better
performance than single spectral analysis can. . . . . . . . . . . . . 37
3.7 When vehicle data is mixed with human vowel sound, STR can
achieve a little bit better performance than single spectral analysis
can. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.8 When vehicle data is mixed with bird chirp sound, STR can achieve
better performance than single spectral analysis can. . . . . . . . . 38
4.1 (a) When pre-synaptic input signal and post-synaptic output signal
are positively correlated, the synaptic weight connecting both pre-
synapse and post-synapse is strengthened. (b) When pre-synaptic
input signal and post-synaptic output signal are negatively corre-
lated, the synaptic weight connecting both pre-synapse and post-
synapse is weakened. . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2 LinearHebbianlearningflowchart. Theinputisfromspectro-temporal
dynamic representation. Then STR is interleaved to the network
operation. This is unsupervised learning as the exact outputs are
unknown. During LHL procedure, patterns and synaptic weights
are iteratively learned. . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3 The incoming signals may be mixed from several unknown sources.
The mixing property is also unknown due to the various unknown
environments. A system can be designed to blindly separate inde-
pendent sources out of observed signals. . . . . . . . . . . . . . . . 45
4.4 The detailed mixing procedure by an unknown mixing matrix. . . . 46
4.5 Thedetailedseparatingprocedurewithsomeunsupervisedlearning.
Theextractedcomponentsofycanbeequaltotheunknownsources
s if the learning can ideally extract independent sources. . . . . . . 46
4.6 The only given information to the system is the observation data x
in x
1
-x
2
space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.7 (a) The observationvectorhas observationsignal x
1
along the time.
(b) The histogram of signal x
1
. (c) The observation vector has
observation signal x
2
along the time. (b) The histogram of signal x
2
. 48
x
4.8 The observation data is a mixture of two components. (a)(b) One
is uniform-distributed within the range [−2,2]. (c)(d) The other
is Gaussian-distributed with mean 0 and variance 1. These two
components are composed with matrix A. The observed data has
already lost the source distributions. . . . . . . . . . . . . . . . . . 49
4.9 (a) The only given information is the observation data in x
1
-x
2
space. The system does NOT assume any knowledge of the com-
ponents NOR their composing property. (b) In order to find the
representative feature space, LHL optimizes the variance of data,
and finds the components approximately along two diagonal lines.
The observation data is projected to LHL-learned feature space. . . 50
4.10 (a)(b)ThehistogramofLHL-learnedy
L
1
isnotequaltothedistribu-
tion of either real component. (c)(d) The histogram of LHL-learned
y
L
2
is not equal to the distribution of either real component. Both
LHL-learned components are not the real components. . . . . . . . 51
5.1 The flowchart when independent component extraction is used for
pattern recognition. The first three blocks are related with blind
source separation: The incoming signals may be mixed from several
unknown sources; The mixing property is also unknown due to the
various unknown environments. Then the recognizer can extract
unknown independent sources, and compare them with trained pat-
terns of interest. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2 (a)Thevehiclewaveforminanormalcircumstance. (b)TheAWGN
waveform. (c) The vehicle waveform under a severely noisy circum-
stance. Vehicle data is corrupted by AWGN at SNR= 0 dB. . . . . 54
5.3 An overview of the proposed recognizer. . . . . . . . . . . . . . . . 56
5.4 An illustration of nonlinear neural learning. Synaptic weights con-
nectinginputandoutputneuronsareadaptivelylearned. Thelinear
outputneuronsdenoteextractedrepresentativeindependentcompo-
nents. Andthenonlinearoutputneuronswithanactivationfunction
play a crucial role in statistical optimization of NHL procedure. . . 58
5.5 Signal flow for synaptic weight adaptation: x represents excitatory
input, y
1
,y
2
,...,y
l−1
contribute to inhibitory effects, the linear y
l
provides a stabilizing effect, and the nonlinear part z
l
is critical for
higher-order statistical characteristics of y
l
. For simplicity spectral
and temporal index q,m are ignored. . . . . . . . . . . . . . . . . . 59
xi
5.6 (a) The only given information is the observation data in x
1
-x
2
space. The system does NOT assume any knowledge of the com-
ponents NOR their composing property. (b) By optimizing higher-
order statistics of data, NHL finds the two components, y
N
1
, y
N
2
.
TheNHL-resultisobtainedbyprojectingtheobservationdatausing
synaptic weight matrix W
N
. . . . . . . . . . . . . . . . . . . . . . . 63
5.7 (a)(b) The histogram of y
N
1
is a uniform distribution within [−2,2],
corresponding to s
1
. (c)(d) The histogram of y
N
2
is a Gaussian
distribution with mean 0 and variance 1, corresponding to s
2
. . . . 64
5.8 Multiple vesicle pool framework. Immediately releasable pool: a
fractionofapparentlydockedvesiclesatactivezones. Readilyreleasable
pool: available for immediate replenishment of release sites that are
vacated by exocytosis. Reserve pool: more distant vesicles that
are unable to respond rapidly. Released pool: recycling ensures a
constant supply of vesicles. . . . . . . . . . . . . . . . . . . . . . . . 67
5.9 Mutual information is maximized when the high density part of
input is aligned with the slope of the activation function. And the
distance between outputs is large after nonlinear mapping. . . . . . 71
5.10 (a)The steepslope canfavorleaning convergence. Signalofinterest
xandnoisenarefartherseparated. (b)Ontheotherhand,theslope
cannot grow too fast, as the rate of outlier noise rejection is low.
Signal of interest x and noise n are not easy to be separated. The
positions of x, n and the shape of data are the same in (a) and (b). 73
6.1 An illustration of vehicle recognition environment. Vehicles loaded
with dangerous weapons may approach protected or strict areas.
Themicrophoneandrecognizershouldbeabletodetecttheapproach-
ing vehicles and immediately release alerts. Around the protected
areas, many uncontrolled background sounds may exist, such as
human voice, bird chirp, and wind. . . . . . . . . . . . . . . . . . . 79
6.2 Synaptic weight change along learning iteration. . . . . . . . . . . . 81
6.3 Decision tree for vehicle recognition.. . . . . . . . . . . . . . . . . . 82
6.4 When vehicle data is mixed with AWGN. . . . . . . . . . . . . . . . 84
6.5 When vehicle data is mixed with human vowel noise. . . . . . . . . 85
6.6 When vehicle data is mixed with bird chirp noise. . . . . . . . . . . 86
xii
7.1 Decision tree for vehicle identification. . . . . . . . . . . . . . . . . 89
7.2 For vehicle type identification, one representative feature space and
one projection synaptic weight matrix are learned for each type of
vehicle. Decision is made based on maximum likelihood. . . . . . . 90
8.1 The practical system includes several hardwares and real-time soft-
ware. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
8.2 A circumstance of real-time vehicle detection. (a) When a vehicle
is approaching. (b) When the vehicle is passing by the microphone.
The designed recognizer can provide alerts immediately for both
cases so that the command center can have an early alarm even in
(a) case. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
8.3 (a) An example of vehicle waveform (including coming, passing by,
and leaving events). The x-axis in (b-d) figures is aligned with
figure (a). The y-axis is ‘0-or-1’ decision, when it is 0, no vehi-
cle approaching; when it is 1, there is vehicle approaching. (b)
The recognition result by using MFCC. (c) The recognition result
by using GTF+STR+LHL. (d) The recognition result by using
GTF+STR+NHL.Therecognitiondurationisabout7seconds. For
a vehicle moving speed about 20 mph, the recognition distance is
about ±78 = 20∗ 1600∗ 6∗ 3/3600/2 feet around a microphone
center. Within this recognition distance, vehicle coming, passing
by, and leaving events are detected. Comparing results by using
MFCC, GTF+STR+LHL, or GTF+STR+NHL, we can see that
the proposed GTF+STR+NHL can provide a better recognition
result, since the recognition duration is broader than others. . . . . 101
8.4 (a) An example of noisy vehicle waveform. In this case, the wave-
form is a mixture of vehicle sound and unknown AWGN noise. The
SNR is 0 dB. (b) The recognition results by using MFCC. (c) The
recognition results by using GTF+STR+LHL. (d) The recognition
results by using GTF+STR+NHL. We can see that the first two
systems could not recognize the approaching vehicle and have high
falsepositive rates. The proposedGTF+STR+NHLcanstilldetect
theapproachingvehicle,andprovideanalmostthesamerecognition
result as in the normal circumstance. . . . . . . . . . . . . . . . . . 102
xiii
8.5 Atthemiddlepart,thecamerahasbeenslewedtothedetectedarea
andtakenapictureofthevehicle. Inthemarkedbarof’vehicleicon”
in the lower part, there are continuous dots meaning the alerts sent
by on-site vehicle detector. . . . . . . . . . . . . . . . . . . . . . . . 103
9.1 The x-axisisthepatterndimension. The y-axisdenotesthe pattern
value. At each pattern dimension, the mean is plotted, with an
error bar to denote the variance. We can notice that both patterns
are close to each other. The Earth Mover’s distance between these
two patterns is 93, which is much, much smaller than the pattern
distance 1853 between generalized vehicle and human voice, and
much smaller than the threshold 200 of generalized vehicle vs non-
vehicle recognition. . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
9.2 Within one class, the goal is informax, maximizing input-output
informationtransmission,whichcanberealizedbythepresentNHL
inthethesis. Theindependentcomponentextractionofeachclassis
one parallel structure of the figure, which is the same as in Fig. 5.4.
Across classes, the goal is to minimize the correlative information,
or mutual information between class A and class B. . . . . . . . . . 110
xiv
Abstract
How to recognize the acoustic signal of interest in open environments where many
other acoustic noises exist? The efficient auditory signal processing and intelli-
gent neural learning contribute to this remarkable ability. We propose a nonlinear
Hebbian learning (NHL), with several certain novelties, to newly implement noise-
robust acoustic signal recognition. The proposed learning rule processes both time
andfrequencyfeaturesofinput. Thespectralanalysisisrealizedbyusingauditory
gammatone filterbanks. To address temporal dynamics, the network input incor-
porates not only the current gammatone-filtered feature vector, but also multiple
pastfeaturevectors. Werefertothisestablishedhigh-dimensionalinputasspectro-
temporal representation (STR). Given STR inputs, the exact acoustic signatures
of signals of interest and the composing property among signatures are generally
unknown. The nonlinear Hebbian learning rule is then employed to extract repre-
sentative independent signatures, and to learn the weight vectors which transform
data into signature space. During learning, NHL also reduces feature dimension-
ality. Comparing with linear Hebbian learning (LHL) which explores the second-
order moment of data, the applied NHL involves higher-order statistics of data.
Therefore, NHL can capture representative components that are more statistically
independent than LHL can. Besides, the nonlinear activation function of NHL can
be chosen to refer to the implicit distribution of many acoustic sounds, and thus
xv
makingthelearningoptimizedinanaspectofmutualinformation. Theadvantages
of the proposed NHL over other ICA algorithms (which are often used for blind
source separation) are also discussed, in terms of the criterion and optimization
function.
Simulation results show that the whole proposed system can more accurately
recognize signals of interest than its counterparts in severely noisy circumstances.
The proposed system has been used in real-world noise-independent projects. One
project is detecting moving vehicles. Noise-corrupted vehicle sound is recognized
while background sounds are rejected. At low SNR= 0 dB, when vehicle data
is contaminated by AWGN, human voice, and bird chirp, the proposed system
dramaticallydecreasestheerrorrateovernormallyusedacousticfeatureextraction
method MFCC by 16%, 25%, and 68%, respectively; and, by 15.3%, 20%, and 2%,
over LHL (another normally used acoustic feature extraction method). Another
applicable project is vehicle type identification. The proposed system achieves
betterperformancethanLHL,e.g.,40%improvementwhengasolineheavywheeled
car is contaminated with AWGN at SNR= 5 dB.
xvi
Chapter 1
Introduction
1.1 Problem Statement:
Why Noise-Robust Pattern Recognition and
What Challenges
Thewordnoise meansunwantedsoundornoisepollution. Noisecanbeconsidered
as meaningless data, which blocks or blurs other signals of interest, and degrades
the useful information that is transmitted in signals of interest.
Robustness in acoustic signal recognition refers to the need to maintain good
recognition accuracy even when the quality of the signals of interest is degraded,
or when the acoustical conditions in the training and testing environments differ.
In the present study, acoustical degradations produced by environmental noises
are mainly analyzed.
Normally used acoustic signal recognition methods, such as mel frequency cep-
stral computation (MFCC) plus Gaussian mixed model (GMM) [RS78, HAH01],
or MFCC plus principal component analysis (PCA) [WSK99, Mun04], work rea-
sonably well in quiet backgrounds. But they work poorly under noisy conditions.
For example, when using an acoustic signal recognition system to recognize what
one says over the phone, the accuracy of the system may be acceptable if one calls
from the phone in a quiet office, yet its performance can be unacceptable if one
tries to use the cellular phone in a shopping mall.
1
As acoustic signal recognition technologies are being transferred to real-world
applications, the need for greater robustness in recognition technologies is becom-
ing increasingly apparent. Nevertheless, the performance of even the best state-of-
the art system tends to deteriorate when signals of interest are severely contami-
nated by noises. It is necessary to design algorithms to improve the robustness of
acoustic signalrecognitionsystem, so thatthe system canworkwellin the circum-
stance where noise levels are high. And generally, these noises under the testing
conditions may not be the ones when the system is trained.
We study the problem when signals of interest are mixed with unknown noise
sources. In real-world applications, there can be many noise sources, and these
noises can be highly time-varying. Moreover, the mixing property of signals of
interest and noises is unknown to the system. The CHALLENGES of unknown
noise sources and unknown mixing property widely exist in real-world applica-
tions. When we want to use a microphone to record some signals of interest, other
uncontrolled sounds may also be emitted by surrounding backgrounds, then the
incoming sounds are mixtures of signals of interest and noises. The only known
informationtothe recognizeristheobservednoisydata, the systemdoesnotknow
noise sources, nor mixing property. If the observed noisy data is directly used for
patternrecognition, it maymistakenlybe classified asnoise pattern, instead of the
patternofsignalsofinterest. Thenitcomesupwiththequestionhowto implement
noise-robust pattern recognition.
2
1.2 History and Limitations of Other Existing
Algorithms
With the difficulties of noise-robust acoustic signal recognition, people have made
a lot of progress in algorithms so that the recognizer’s accuracy does not degrade
very much. Instead of cleaning the acoustic signal itself, these algorithmstypically
work on modeling or estimating noise sources.
A number of signal processing schemes have been developed for noise-
robust acoustic signal recognition, such as robust speech recognition [RDP86,
EW93, Gir98, JHH99, KLK99, YSFC00, WBWR03, AMB
+
04, KSAC06, XK06,
CB07, SK07], and robust speaker verification [HI95, CPLTB96, LC00, Cho01,
YV02, RH02, DH03, IAF04, YHW05, MHG06, SW06, Lun06, DN07, WYWL07,
MHGR07, SSW07, KJK08, RMK08]. [EW93] brought up an estimation algorithm
for noise-robust speech recognition. The minimum mean log spectral distance
(MMLSD), was presented. The estimation was matched to the recognizer by seek-
ingtominimizetheaveragedistortionasmeasuredbyaEuclideandistancebetween
filterbanklog-energyvectors,approximatingtheweighted-cepstraldistanceusedby
the recognizer. The estimation was computed using a clean speech spectral prob-
ability distribution, estimated from a database, and a stationary, auto-regression
moving-average (ARMA) model for the noise. When trained on clean speech and
tested with additive white noise at 10 dB SNR, the recognition accuracy with the
MMLSD algorithm was comparable to that achieved with training the recognizer
at the same constant 10 dB SNR. Also, [CB07] investigated a technique consisting
ofmeansubtraction,variancenormalizationandtimesequencefiltering. Itapplied
ARMA filtering directly in the cepstral domain.
3
Then, [JHH99] studied a category of robust speech recognition problem in
which mismatches exist between training and testing conditions, and no accurate
knowledgeofthemismatchmechanismisavailable. Theonlyavailableinformation
is the test data along with a set of pretrained Gaussian mixture continuous den-
sity hidden Markov models (CDHMMs). They investigated the problem from the
viewpoint of Bayesian prediction. A simple prior distribution, namely constrained
uniform distribution, wasadopted to characterizethe uncertaintyof the meanvec-
tors of the CDHMMs. Two methods, namely a model compensation technique
based on Bayesian predictive density and a robust decision strategy called Viterbi
Bayesian predictive classification were studied.
Next, [YSFC00] presented a model-based noise compensation algorithm for
robust speech recognition in nonstationarynoisy environments. The effectof noise
was split into a stationary part, compensated by parallel model combination, and
a time varying residual. The evolution of residual noise parameters was repre-
sented by a set of state space models. The state space models were updated by
Kalman prediction and the sequential maximum likelihood algorithm. Prediction
ofresidualnoiseparametersfromdifferentmixtureswerefused,andthefusednoise
parameters were used to modify the linearized likelihood score of each mixture.
[KSAC06] also used Kalman filters as a front-end to the Aurora-2 speech recogni-
tiontask. Kalman-filterbasedspeechestimationalgorithmsassumeautoregressive
(AR) models for the speech and the noise signals.
Besides, the well-used acoustic signal feature extraction, MFCC plus GMM
method, tried to find spectral subbands that noises may dominate and modeled
noises in these subbands [HAH01]. Another used MFCC plus PCA assumes noise
is Gaussian distributed with low power levels [WSK99, Mun04].
4
In all, the above algorithms tried to model every noise source or track noise
variation. However, there are so many noise sources existing under real-world
environments. And generally they are highly time-varying or unknown to the
acoustic signal recognizer. In addition, the mixing property between signals of
interest and noises is also unknown to the recognizer. It is not quite feasible
to model every noise source and its variation, or to track the mixing property.
There should be some other potential approaches for noise-robust acoustic signal
recognition.
1.3 Objectives of the Thesis
The present problem is how to perform noise-robust acoustic signal recognition in
real-world environments. Since there exist many unknown noise sources, and their
mixing property with signals of interest is generally unknown, it is not feasible to
model or track every noise source as other algorithms did. It is suggested that the
success in the following key problem area is likely to accelerate the development
and deployment of practical acoustic signal recognition applications.
In the present thesis, noise-robustness means regardless of noise type or its
variation, regardless of mixing property, first perform blind source separation to
separate signals of interest from noises, and then implement pattern recognition.
Blind source separation, also known as blind signal separation, is the sepa-
ration of a set of signals from a set of mixtures, without the aid of informa-
tion (or with very little information) about the source signals or the mixing pro-
cess [Com94, BS95, CL96, PG97, PFP99, PJLL99, DL00, LBS00, Mon00, BH00,
HO00, LL02, AIO
+
03, PSB03, YR04, CACA04, CCPL05, FG06, DGSM07, SZ07].
5
Blind signal separation relies on the assumption that the source signals are non-
redundant. The representative point is that the signals may be mutually statis-
tically independent. Blind signal separation thus separates a set of mixtures into
a set of other signals, such that the regularity (statistical independence) of each
resulting signal is maximized, and the regularity between the signals is minimized.
Unknown
environments
Blind source
separation
Unknown
source vector
Observation
vector
As x
Wx y
Mixer
Output vector
s
Demixer
Figure 1.1: The incoming signals may be mixed from several unknown sources.
The mixing property is also unknown due to various unknown environments. A
system can be designed to blindly separate independent sources out of observed
signals.
As in Fig. 1.1, two unknown sources s
1
,s
2
may be mixed by unknown envi-
ronments with mixing process A. The only observed data to the recognizer is x.
Blind source separation technologies can be designed to extract two components
y
1
,y
2
outoftheobservationdatax, sothattheextractedcomponentscanbeequal
to two unknown sources as much as possible. The detailed mixing procedure by
an unknown mixing matrix is illustrated in Fig. 1.2. And the detailed separating
procedure with some unsupervised learning is shown in Fig.1.3.
We used the concept of independent feature extraction in the task of pattern
recognition. The given obserbation data is spectral features or temporal features.
The output is extracted independent spectral features or independent temporal
features. This is one usage of independent component analysis (ICA)technologies
in independent feature extraction. The usage of ICA technologies in blind source
separationviewsthegivenobservationdatafrommultiplemicrophones,i.e.,spatial
6
1
s
2
s
2 12 1 11 1
s a s a x
2 22 1 21 2
s a s a x
11
a
12
a
22
a
21
a
Unknown
source
Observation
Figure 1.2: The detailed mixing procedure by an unknown mixing matrix.
2 12 1 11 1
x w x w y
2 22 1 21 2
x w x w y
1
x
2
x
11
w
12
w
22
w
21
w
Observation
Output
Figure 1.3: The detailed separating procedure with some unsupervised learning.
The extracted components of y can be equal to the unknown sources s if the
learning can ideally extract the independent sources.
diversity, while in the present study, we consider the given observations based
on frequency and/or temporal diversity. In the methods [WSK99] and [Mun04],
spectral filters were used to realize frequency analysis and spectral features were
obtained. Methods [WSK99] and [Mun04] then used principal component analysis
(PCA) to extract principal (maximum-variance) spectral features from filtered
spectral features. In Fig. 1.3, the input data is filtered spectral features, and the
output is the extracted principal spectral features. The extracted outputs in the
PCA algorithm are uncorrelated, and they are good representative features for the
following discrimination or pattern recognition task. But uncorrelation does not
mean independence. In order to extract better representative features, We would
use another method to extract independent spectral (or temporal) features from
7
filtered spectral (or temproal) features. It is expected that using some certain
method to extract independent features can better describe data in the task of
pattern recognition.
1.4 Biological Motivation and Overview of Pro-
posed Algorithms
How to recognize the acoustic signal of interest in open environments where many
othernoisysoundsexist? Ithasbeenhypothesizedthatthebrainmayformauseful
representation of its environment so that useful information is strengthened while
unrelated noisy information is weakened. The efficient auditory signal processing
and intelligent neural learning may contribute to this remarkable ability of noise-
robust acoustic signal recognition.
The auditory feature extraction technology, normally used in acoustic signal
recognition such as speech or speaker recognition, is mel frequency cepstral com-
putation (MFCC)[RS78, HAH01, LYB6b]. It is basedon mel-frequencyscale, and
resultsin a goodrepresentationof sound spectrum. In orderto capture morespec-
tral nonlinearities, people have referred to other filters, such as gammatone filter.
Gammatone filter more closely characterizes the physiological response of primary
auditoryfibers incatand humanears[Joh72]. Itwasshownthatgammatonefilter
is a little bit more stable and noise-robust than mel filter [CAS05]. In this present
study, we would use gammatone filterbanks for spectral analysis.
After spectral analysis, consecutive feature vectors are often assumed as inde-
pendent inputs for the following pattern recognition [RS78, WSK99, Mun04,
TW05, CAS05]. The duration of each feature vector generally is on the order
of tens of milliseconds. However, these spectral feature vectors may be sensitive to
8
background noise or channel distortion, and actually they can be correlative along
time [Cam97, Rey02]. To be specific, the output response depends not only on the
presentvalue of the input featurevector, but alsoonpast values. Time constitutes
an essential ingredient of the learning process. Incorporation of time into the net-
workoperationenablesaneuralnetworktoaddressnon-stationaryprocesses. How
do we build time into the operation of a neural network? For a neural network
to be dynamic, the structure of time delays is used [LH88]. Each time delayed
input has a different synaptic weight. This incorporation of time is motivated by
physiological studies on the primary auditory cortex [SVK95, SDS98, KDSS00].
The authors have pointed out that neurons in the brain process both time and fre-
quency components of signals, and the temporal receptive field is extended up to
the order of hundreds of milliseconds. In this present study, we would incorporate
time-delayed feature vectors into the operation of a neural network.
Most importantly, Hebbian learning, a basic neural learning rule found in the
brain, has motivated us to implement the noise-robust acoustic signature recog-
nition. There is strong physiological evidence for Hebbian learning in the area
of the brain called hippocampus. The hippocampus plays an important role in
certain aspects of learning or memory. This physiological evidence makes Hebbian
learning all the more appealing. Hebb’s postulate of learning is the oldest and
most famous of all learning rules. Quoting from [Heb49, p. 62], a statement is
made in a neurobiological context, “When an axon of cell A is near enough to
excite a cell B and repeatedly or persistently takes part in firing it, some growth
process or metabolic changes take place in one or both cells such that A’s effi-
ciency as one of the cells firing B, is increased.” The experimental results were
reported in [BL73, GWAH87, AS90, BAD
+
91, BBB
+
91, XBB92] and called long-
term potentiation of synaptic strength. Next, it is hypothesized that the direction
9
and magnitude of change in synaptic strength depend on the degree to which pre-
and postsynapse activities are correlated [BCM82, SS89], which assume the nega-
tive learning (together with Hebb’s postulate, called bidirectional learning). The
experimental results were reported in [XBB92, MM92, TBB94] and called long-
term depression of synaptic strength. According to the mathematical abstraction
[Pal82, Oja82, OK88], a Hebbian synapse increases its strength with positively
correlated presynaptic and postsynaptic signals, and decreases its strength when
these signals are negatively correlated. Then, it has been indicated that changes
inthe synaptic connectionsbetweenneuronscontributetomemorystorage, andto
the activity-dependent development of a neural network. To be specific, neurons
employing Hebbian learning in a network can learn to code a set of patterns in
such a way that important components are strengthened while unimportant noisy
ones are weakened [Oja82, OK88, San9a, San9b, San90, OOW91, Oja95, Oja97,
SH94, SH95, HO98, PW01]. In the present study, a nonlinear Hebbian learning is
analyzed, and its advantage over the generalized linear Hebbian learning is stated.
The whole recognizer is illustrated in Fig. 1.4, which includes auditory signal
processing and nonlinear neural learning in a neural network. A Hamming win-
dow is used to divide signals into frames with duration on the order of tens of
milliseconds.
A series of bandpass gammatone filters are applied to process each windowed
frame and to extract spectral features. Via gammatone filter processing, a spec-
tral feature vector is computed for each frame. The incoming acoustic waveforms
are generally non-stationary through consecutive feature vectors. The response
of a recognizer at the present time depends not only on the current feature vec-
tor, but also on the past ones. To recognize such a temporally dynamic pattern,
10
Figure 1.4: Description of the proposed system.
processing a sequence of feature vectors is necessary. A spectro-temporal repre-
sentation (STR) is thus established as an input for the following process. This
spectro-temporal input includes both the present spectral feature vector and the
past ones.
Next, a nonlinear Hebbian learning (NHL) is described to extract unknown
acoustic signatures of signals of interest. Concurrent to this process, synaptic
weights connecting input and output neurons are adaptively learned. Then, at the
11
testing stage, both synaptic weights and signatures are used as trained patterns
into an associative network, radial basis function neural network (RBF-NN). The
decision tree is defined when the proposed system is applied for specific real-world
tasks, such as in Fig. 2.8 and in Fig. 7.1.
1.5 Organization of the Thesis
The rest of the thesis is organized as follows. In Chap. 2, auditory gammatone
filterbanks are applied to characterize spectral properties in Sect. 2.2. The noise
robustness of gammatone filterbanks and mel filterbanks is compared in Sect. 2.3.
And in Sect. 2.6 the performance of gammatone and mel filterbanks is compared
when real-world data is applied.
In Chap. 3, a spectro-temporal dynamic representation is established via col-
lecting multiple spectral feature vectors along time in Sect. 3.2. In this approach,
temporal dynamics is incorporated into the network operation. Hence, the perfor-
mance when using real-world data for noise-robust acoustic signal recognition is
further improved in Sect. 3.3.
Next, in Chap. 4, the modificationfromthe blind sourceseparationtechnology
into the noise-robust pattern recognition technology is described in Sect. 4.1.
The normally used linear Hebbian learning (or equivalently principal component
analysis) is studied in Sect. 4.2. A good example is then analyzed to show the
limitation of linear Hebbian learning in Sect. 4.3.
Then, the most important part of the thesis Nonlinear Hebbian Learning is
proposed in Chap. 5. Why and how to use nonlinear Hebbian learning into the
problem of noise-robust pattern recognition is described in Sect. 5.1. The deriva-
tion of nonlinear Hebbian learning is studied in Sect. 5.2. Nonlinear Hebbian
12
learningplaysacriticalroleinreducingthedimensionofspectro-temporalfeatures
and in learning the unknown acoustic signatures of incoming sounds. Concurrent
to this process, synaptic weight vectors connecting input and output neurons are
adaptively learned, which project spectro-temporal features into a representative
space of signals of interest. The same example that linear Hebbian learning fails is
presented in Sect. 5.3, where nonlinear Hebbian learning is applied and it achieves
correct results. And a description of nonlinear activation function is in Sect. 5.4.
With these learned acoustic signatures and synaptic weights by using algo-
rithmsinpreviousseveralchapter,simulationresultsareshowninChap. 6. Inthis
chapter, urban vehicle or non-vehicle in city environmentsis recognized. Specially,
some normal background sounds, such as human voice, bird chirp, and wind, are
also analyzed. The noisy vehicle data is generated by mixing clean vehicle sound
with those noises. The proposed recognizer can achieve better performance than
otherexisting works, evenin severelynoisy circumstances. The vehicle recognition
isjustanexampletoshowhowtheproposedsystemcanperformwellinreal-world
applications. Andthenoisesourcesarealsojustnoiseexamples. Thesenoisesource
properties and their mixing process are always unknown to the recognizer.
The identification simulation is implemented in Chap. 7. In the present thesis,
the identification is performed among four types of vehicles, gasoline light wheeled
car, gasoline heavy wheeled car, diesel truck, motorcycle. Again noisy vehicle data
is generated by mixing with unknown noises. Simulation results show that the
proposedsystemsignificantlyoutperformsitscounterpart,linearHebbianlearning,
especially more robust against noise.
The real-time practical field testing results are presented in Chap. 8. The
proposed software system is designed and combined with practical hardware, such
as microphone, pre-amplifier, A/D sound card. The real processing time of the
13
proposed system is about 100 ms, which is much less than the once-time data
acquiring time 300 ms. This means the proposed system can perform online data
processing, andnobufferneeded. Wehavetestedthereal-timepracticalsystemon
sandy and paved roads, and achieved very good results. When we demonstrated
the practical recognizer to Transportation Security Administration (TSA), Safety
Dynamics company, and Army Research Lab, we obtained 99 ∼ 100% correct
vehicle recognition results.
Finally, some highlighted conclusion and future work discussions are given in
Chap. 9.
14
Chapter 2
Gammatone Filterbanks
2.1 Introduction
It is known that auditory models can indeed provide better recognition accuracy
than traditional signal processing representations (such Fast Fourier Transforma-
tion, FFT) when the quality of the incoming acoustic signals degrades, or when
training and testing conditions differ. The approach of auditory modeling con-
tinues to merit further attention, particularly with the goal of resolving issues of
acoustic signal recognition.
The spectral feature extraction technology, normally used in acoustic signal
recognition such as speech or speaker recognition, is mel frequency cepstral com-
putation (MFCC) [RS78]. The overall diagram of the recognizer is given in Fig.
2.1. It is based on mel-frequency scale, and results in a good representation of
sound spectrum.
However, MFCC features are not very robust in the presence of additive noise
due to the limitation of mel-frequency filter [TW05]. To solve this problem, one
direction is to use filters that can capture more spectral nonlinearities, such as
gammatone filter. Gammatone filter more closely characterizes the physiological
response of primary auditory fibers in cat and human ears [Joh72]. It was shown
that gammatone filter is more stable and noise-robust than mel filter [CAS05]. In
this present study, we would use gammatone filterbanks for spectral analysis. The
15
Real-world
signal
Hamming
window
Recognition
decision
MFCC
Training
RBF-NN RBF-NN
Learned
patterns
Testing
Figure 2.1: An overview of the MFCC recognizer.
overview of the recognizer is shown in 2.2, where gammatone filterbanks are used
to replace mel filterbanks .
16
Real-world
signal
Hamming
window
Recognition
decision
Gammatone
filterbanks
Training
RBF-NN RBF-NN
Learned
patterns
Testing
Figure 2.2: An overview of the GTF recognizer.
2.2 CochlearImpulseResponseandGammatone
Filter
As illustrated in Fig. 2.2, a hamming window is used to divide incoming acoustic
waveforms into frames. Each frame has duration on the order of tens of millisec-
onds. Aseriesofbandpassgammatonefiltersareappliedtoprocesseachwindowed
17
frame and to extract spectral features. The gammatone function was first intro-
duced by [Joh72], which characterizes the physiological impulse response of cat’s
primary auditory fibers. The impulse response of gammatone filter is
f(t) = t
n−1
exp(−2πbt)cos(2πf
c
t+φ), (2.1)
where n is the filter order, b represents the filter bandwidth, f
c
denotes the center
frequency, and φ is the tone phase. The primary parameters of the filter are n and
b. When the order of the filter is 4, the magnitude shape is very similar to that of
human auditory filter [Boe75, PM86, Med86, BEBK98].
A gammatone filter well represents the cochlear impulse responses of audi-
tory system [Boe75, GM90, PRH
+
92, Sla93, ER95, PH96, IU98, Cos98, CGG
+
99,
AEL01, CH01, GK05, CS06, ZA07]. The bandwidth of the filter depends on the
center frequency and is described by an equivalent rectangular bandwidth (ERB)
b(f
c
) = 1.019×24.7(1+
4.37f
c
1000
), (2.2)
where1.019isthecorrectionfactor[PRH
+
92]. Inordertoderivethetransferfunc-
tion of analog gammatone filter, impulse invariant transformation (IIT) [OSB99,
pp. 443–449] is applied, which is shown to have a smaller digital implementation
error than other transformation methods [IP03]. Gammatone filter can extract
morespectralnonlinearitiesthanotherconventionalfeatureextractionapproaches,
suchasmelfilter. Itcanthusachievebetterperformanceundernoisyenvironments
[CAS05]. Anexampleofgammatonefilterbanks(GTF)isillustratedinFig. 2.3(b).
18
a
0 2000 4000 6000 8000 10000 12000
0
0.5
1
1.5
2
2.5
Mel filterbanks
Frequency (Hz)
Magnitude
b
0 2000 4000 6000 8000 10000 12000
0
0.5
1
1.5
2
2.5
Gammatone filterbanks
Frequency (Hz)
Magnitude
Figure 2.3: (a) An example of mel filterbanks. (b) An example of gammatone
filterbanks.
2.3 Gammatone vs. Mel Filterbanks
An example of mel filterbanks, a normally used spectral analysis technology, is
shown in Fig. 2.3(a). Then, gammatone and mel filterbanks are compared in this
part, especially when they are used for noisy data recognition. From Figs. 2.3(a)
19
and (b), we can see that gammatone filters covers broader frequency range than
melfilters, andtherebymorespectralinformationisinvolvedingammatonefilters.
This spectral assembling is good for noise robustness. For example, if at 8000 Hz,
mel filter provides filtering value 0.6, and gammatone filter provides 0.45. When
there is a noise at this frequency bin, mel filter provides more filtering coefficient
than gammatone filter, and thus is more impacted by this noise.
We show how gammatone filters are more noise robust than mel filters. An
example of vehicle passing-by sound is given in Fig. 2.4(a). To present a noisy
data, this clean vehicle waveform is mixed with additive white Gaussian noise
(AWGN) at signal-to-noise ratios (SNRs) 21 dB, 19 dB, and 17 dB, respectively.
The noisy sound at SNR=17 dB is illustrated in Fig. 2.4(b). For clean vehicle
data, and each noisy data, we train a set of feature patterns via mel filterbanks or
gammatone filterbanks. The feature difference of feature patterns by using noisy
data and by using clean data is calculatedand plotted in Fig. 2.5. The smallerthe
feature difference, the less the spectral analysis is influenced by noises. The upper
three lines are feature differences when using mel filterbanks for various SNRs.
The lower three lines are feature differences when using gammatone filterbanks.
Gammatone filterbanks are less influenced by noises, which is more noise robust
than mel filterbanks.
Itisalsonoticedthatbothgammatoneandmelfilterseffectivelyreducespectral
feature dimension with minimum loss of perception sensitivities. That is, the fast
Fourier transform (FFT) has dimension on the order hundreds, such as 512 in the
present study. Both gammatone and mel filters reduce spectral information down
to 20. But mel filters, which have narrower triangular shapes, are not as stable as
gammatone filters against noises.
20
a
0 5 10 15 20 25 30
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
Time (s)
Normalized amplitude
Vehicle sound waveform in normal circumstance
b
0 5 10 15 20 25 30
−0.8
−0.6
−0..4
−0.2
0
0.2
0.4
0.6
0.8
Time (s)
Normalized amplitude
Noisy vehicle sound with AWGN at SNR=17 dB
Figure 2.4: (a) An example of vehicle coming, passing-by, and leaving waveform.
(b) When vehicle sound from (a) is mixed with additive white Gaussian noise
(AWGN) at signal-to-noise-ratio (SNR) 17 dB.
2.4 Decision Making: Radial Basis Function
Neural Network
In the present study, radial basis function neural network (RBF-NN) is applied for
decisionmaking. Inthisnetwork,Gaussianfunctionsareusedaskernels. Thereisa
21
2 4 6 8 10 12 14 16 18 20
0
2
4
6
8
10
12
Feature index
Normalized feature difference (dB)
GTF vs. MFCC noise robustness
MFCC SNR=21 dB
MFCC SNR=19 dB
MFCC SNR=17 dB
GTF SNR=21 dB
GTF SNR=19 dB
GTF SNR=17 dB
Figure 2.5: Each line denotes the feature differences = features trained with noisy
data - features trained with clean data. The smaller the difference, the less the
spectral analysis is influenced by noises. The upper three lines are feature dif-
ferences when using mel filterbanks for various SNRs. The lower three lines are
featuredifferenceswhen using gammatonefilterbanks. Gammatonefilterbanks are
less influenced by noises, which is more noise robust then mel filterbanks.
one-to-one correspondence between input data and Gaussian kernel. The number
of Gaussian functions is smaller than the number of filterbanks. Therefore, the
network maps and reduces input data to a feature space with fewer free degrees.
At the output, a maximum likelihood decision is made.
2.5 Real-World Acoustic Signal Recognition
Acoustic signature recognition of moving vehicles has attracted increased atten-
tion recently. This acoustic signature recognition is intended for integration into
a larger security context. It is normally assumed that there exists a fixed asset
to protect, and a perimeter that defines the vicinity around that asset for surveil-
lance. Automatic vehicle recognizer is necessary as providing security by human is
dangerousorexpensive. Thesoundrecognizerofincomingvehiclesisdevelopedfor
22
G
x
1
x
2
x
Q-1
x
Q
G
G
Input layer
Hidden layer of
N Gaussian
functions
Output layer
w
1
w
j
w
N
y
Figure 2.6: Radial basis function neural network is used for decision making.
perimeter protection in national, agricultural, airport, prison, military sites, and
residential areas. For instance, the recognizer can be used to detect approaching
vehicles that may be loaded with explosives or suicide bombers.
Theacousticsoundofinterestfromamovingvehicleiscomplicatedandaffected
by multiple factors, such as vehicle type, gearing, number of cylinders, muffler
choice, state of maintenance, moving speed, distance from the microphone, tires,
and the road on which the vehicle travels. Moreover, the problem is complicated
because of the presence of uncontrolled interference emitted by surrounding back-
ground, suchashumanvoice,bird chip, andwind. Real-worldacousticrecognition
of moving vehicles thus is very challenging.
Recently,somestudies[CKGM96,MMSW97,Liu99,WSK99,Mun04,AZRS07]
havebeendoneforacousticdetectionofmovingvehicles. Itisnoteasytogiveauni-
fied comparison among these studies as their databases and testing environments
are significantly different. Generally in acoustic signal processing, extracting rep-
resentative features plays important role in characterizing the unknown signature
23
Figure 2.7: An illustration of vehicle recognition environment. Vehicles loaded
with dangerous weapons may approach protected or strict areas. The microphone
and recognizer should be able to detect the approaching vehicles and immediately
release alerts. Around the protected areas, many uncontrolled background sounds
may exist, such as human voice, bird chirp, and wind.
of moving vehicles. [CKGM96], [MMSW97], and [AZRS07] applied wavelet-based
analysis for feature extractionof incoming waveforms. [WSK99] and [Mun04] used
the short-term Fourier transform (STFT) to provide a precise representation of
the acoustic data, and then used linear principal component analysis to convert
the high-dimensional STFT features to low-dimensional ones. Besides, [Mun04]
proposed a reliable probabilistic recognizer based on both principal subspace and
complementary subspace, and compared his method with the baseline method
MFCC.
24
The purpose of the designed recognizer is to detect approaching vehicle and
identifyitstypewithminimumerrorrates. Theacousticdataofmovingvehiclesis
recordedfromone microphone. The purpose ofusing one microphone isto analyze
theacousticsignaturesofmovingvehiclesratherthantrackingorlocalizingvehicles
(normally by using an array of microphones).
We recorded four types of vehicle sounds, gasoline light wheeled car, gasoline
heavy wheeled car, diesel truck, and motorcycle. The road conditions were paved
and sandy. The microphone was set 2 ∼ 5 feet away from the road edge. Height
of the microphone was 0∼ 6 feet. 20 runs of each vehicle type is used for training,
where each run means vehicle comes, passes, and leaves the fixed microphone.
Other 20 runs of each vehicle type is for testing. In non-vehicle class, white noise,
human voice, bird chirp are tested. These three sounds are used to test whether
the proposed system can correctlyrejectthem when there is no vehicle mixed with
them. The window size is 20 ms with 10 ms overlapping, the sampling rate is
22,050 Hz, the gammatone spectral range is 50 ∼ 11,025 Hz. The number of
filterbanks Q = 30 is selected in order to cover enough high-frequency subbands
within this spectral range.
Sound
waveform
Non-vehicle
Urban vehicle
Figure 2.8: Decision tree. Recognize any urban vehicle in city environments while
rejecting background noises
25
As shown in Fig. 2.8, any urban vehicle (gasoline light wheeled car, gasoline
heavy wheeled car, diesel truck, motorcycle) are recognized, while background
sounds, such as human voice, bird chirp, and wind, are rejected.
2.6 Recognition Results of GTF vs. MFCC
The vehicle sound werecordedin lowbackgroundnoise hasSNR 20∼ 30 dB. SNR
istheratiooftheaveragedsignal(vehicle)energyandaveragednoiseenergy. Then
weanalyze−20∼ 20dBasnoisyenvironments. Giventheaveragedvehicleenergy
and a SNR value, for example, 0 dB, we reversely compute averaged noise energy
value, P
n
. For white noise, the averaged energy value is its variance. For human
voice, we compute the averaged energy of selected human voice waveforms, P
h
.
There is a scalar difference between P
n
and P
h
, so we multiply the square root of
the scalar difference with the selected human voice waveforms. Then the selected
human voice waveforms have the averaged energy value P
n
, and we can obtain
the SNR value, 0 dB. The SNR computation is similar with bird chirp test. The
noise-corrupted vehicle waveforms are mixtures of vehicle data and noises. During
pattern recognition, the recognizer does not assume any knowledge of noise source
or how it is mixed with vehicle data. Therefore, it depends on how accurate
the extracted independent features of signals of interest are against noises. The
comparison standard is error rate, which is 1 - performance.
Firstly, when vehicle data is mixed with unknown AWGN, the recognition
results are givenin Fig. 2.9. AWGN isjust a noise example. We cansee that GTF
(gammatone filterbanks) is a little bit better than MFCC (mel frequency cepstral
computation), such as when SNR= 14 dB, the error rate is decreased from 30% to
26
20% with 10% improvement. Generally, when SNR> 16 dB, MFCC and GTF can
have good performance, i.e., error rate is less than 10%.
−20 −10 0 10 20
0.1
1
10
100
SNR (dB)
Error rate (%)
Vehicle mixed with AWGN
MFCC
GTF
Figure 2.9: When vehicle data is mixed with AWGN, gammatone filterbanks can
achieve a little bit better performance than MFCC.
Secondly, when vehicle sounds are corrupted by unknown colorednoise, human
vowelvoice. Differentvowelswithvariousspectrumsaremixedwithvehiclesounds
along the time. In Fig. 2.10, the compared results indicate that GTF is a little bit
better than MFCC, for example, when SNR= 8 dB, the error rate of GTF is 25%
while the error rate of MFCC is 30%. Generally, when SNR≥ 20 dB, MFCC and
GTF can have good performance,
Next, unknown bird chirp noise, another colored noise often existing in normal
environments, is mixed with vehicle data. Various bird chirps out of the dataset
[Gau07] are used to mix. In Fig. 2.11, we can see that GTF is a little bit better
thanMFCC,suchaswhenSNR= 2dB,theerrorrateisdecreasedfrom29%to19%
27
−20 −10 0 10 20
0.1
1
10
100
SNR (dB)
Error rate (%)
Vehicle mixed with human vowel
MFCC
GTF
Figure 2.10: When vehicle data is mixed with human vowel sound, gammatone
filterbanks can achieve a little bit better performance than MFCC.
with 10% improvement. Generally, MFCC and GTF can have good performance
when SNR≥ 10 dB.
28
−20 −10 0 10 20
0.1
1
10
100
SNR (dB)
Error rate (%)
Vehicle mixed with bird chirp
MFCC
GTF
Figure 2.11: When vehicle data is mixed with bird chirp sound, gammatone filter-
banks can achieve a little bit better performance than MFCC.
29
Chapter 3
Spectro-Temporal Dynamics
3.1 Introduction
Consecutive feature vectors, after spectral analysis, are often assumed as indepen-
dent inputs for the following pattern recognition [RS78, WSK99, Mun04, TW05,
CAS05]. The duration of each feature vector generally is on the order of tens of
milliseconds. However, these spectral feature vectors are sensitive to background
noise or channel distortion, and they are correlative along time [Cam97, Rey02].
The output response depends not only on the present value of the input feature
vector but also on its past values. Time constitutes an essential ingredient of the
learning process. Incorporation of time into the operation enables a neural net-
worktoaddressnon-stationaryprocesses. Howdowebuildtimeintotheoperation
of a neural network? For a neural network to be dynamic, the structure of time
delays is used [LH88]. Each time delayed input has a different synaptic weight.
This incorporation of time is motivated by physiological studies on the primary
auditory cortex. [SVK95, SDS98, KDSS00] have pointed out that neurons in the
brain process both time and frequency components of signals, and the temporal
receptive field is extended up to the order of hundreds of milliseconds. In this
present study, we would incorporate time-delayed feature vectors into the opera-
tion of a neural network. The overview of the recognizer is shown in 3.1. We add
a spectro-temporal dynamic representation (STR) after gammatone filters.
30
Real-world
signal
Hamming
window
Decision level
one
Gammatone
filterbanks
Training
RBF-NN RBF-NN
Learned
patterns
Testing
Spectro-temporal
representation
Figure 3.1: An overview of the GTF+STR recognizer.
3.2 Incorporation of Temporal Dynamics
Via gammatone filter processing, a spectral feature vector is computed for each
frame. An example of vehicle waveform and its spectrogram is shown in Fig.
3.2. The narrow black rectangular in the lower spectrogram figure represents one
spectral feature vector, which is the input to radial basis function neural network
31
in Sect. 2.4. At each time only one spectral feature vector is used for network
operation.
Figure3.2: Anexampleofvehiclewaveformanditsspectrogram. Waveformampli-
tude is normalized based on used recording microphone and pre-amplifier. The
frame size is 20 ms, and overlapping is 10 ms. The narrow black rectangular in
the lower spectrogram figure represents one spectral feature vector, which is the
input to radial basis function neural network in Sect. 2.4. At each time only one
spectral feature vector is used for network operation.
The incoming acoustic waveformsare generally non-stationary through consec-
utivefeaturevectors. Theresponseofarecognizeratthepresenttimedependsnot
only on the current input feature vector, but also on the past ones. To recognize
suchatemporallydynamicpattern,processingasequenceoffeaturevectorsisnec-
essary. A spectro-temporal representation (STR) is thus established as an input
for the following process. This spectro-temporal input includes both the present
feature vector and the past ones. Integrating acoustic information along time can
attenuate the drawback of spectral feature vectors, which is the sensitiveness to
changes in the acoustic environments. It is thus expected that performance can be
improved by incorporating temporal dynamics than that of just spectral analysis.
32
In the lower spectrum of Fig. 3.3, a broader rectangular is marked, which includes
both the current and past several feature vectors. At each time multiple feature
vectorsareusedasaninputtoaRBF-NN.Thedurationofthebroaderrectangular
is generally chosenon the order of hundreds of milliseconds. The spectro-temporal
representation is illustrated in Fig. 3.4. In frequency domain, Q filtered spectral
elements are listed. In temporal domain, the current and M −1 feature vectors
are assembled, which has duration on the order of hundreds of milliseconds.
Figure 3.3: An example of vehicle waveform and its spectrogram. In the lower
spectrum figure, a broader rectangular is marked, which includes both the current
and past several feature vectors. At each time multiple feature vectors are used
as an input to a RBF-NN. The duration of the broader rectangular is generally
chosen on the order of hundreds of milliseconds.
By interleaving the spectro-temporal representation from Fig. 3.4, and using
it as an input to a RBF-NN, the network is illustrated as in Fig. 3.5. Now
both spectral and temporal information are correlated in a network operation.
Integrating acoustic information over multiple frames can greatly attenuate the
drawback of spectral features, which is the sensitiveness to changes in the acoustic
environments such as background noise or channel distortion. Hence, performance
33
x (m)
Q
x (m)
Q-1
x (m)
2
x (m)
1
x (m-1)
Q
x (m-1)
Q-1
x (m-1)
2
x (m-1)
1
x (m-M+1)
Q
x (m-M+1)
Q-1
x (m-M+1)
2
x (m-M+1)
1
time
Freq
Figure 3.4: The spectro-temporal dynamic representation. In frequency domain,
Q filtered spectral elements are listed. In temporal domain, the current and M−1
feature vectors are assembled.
of the STR recognizer is expected to be better than that of the either-spectral-
or-temporal recognizer since incorporation of features in both domains maintains
more intrinsic properties of input data. But the STR input requires a higher
computational load than the single spectral analysis (such as GTF, or MFCC).
This spectro-temporal dynamic representation is motivated by physiological
studies of the mammalian auditory cortex. These studies have determined that
neurons in the brain process both time and frequency components of signals, and
the temporalreceptivefieldisextendeduptothe orderofhundredsofmilliseconds
[SVK95, SDS98, KDSS00, SDK
+
07]. And this technology was applied in speech
recognition [MSS06].
34
3.3 Recognition Results of
STR vs. Spectral Analysis
Theapplication,thedata,andtheset-uparethesameasinSect. 2.6. Thenumber
of consecutive feature vectors M = 20 indicates 200 ms ((20−10)×20 = 200)
temporal duration. Firstly, when vehicle data is mixed with unknown AWGN,
the recognition results are given in Fig. 3.6. Using STR (spectro-temporal
dynamic representation) can improve system performance than gammatone filter-
banks(GTF),suchasatSNR= 10dB,theerrorrateisdecreasedfrom50%to10%
with improvement 40%. When SNR≥ 10 dB, STR can have good performance.
Secondly, when vehicle sounds are corrupted by unknown colorednoise, human
vowel voice. In Fig. 3.7, the results indicate that STR is a little bit better than
single spectral analysis technologies, GTF, or MFCC. For example, STR improves
performance by 3% at SNR= 18 dB. Generally, when SNR≥ 20 dB, STR can have
good performance.
Next, unknown bird chirp noise, another colored noise often existing in normal
environments, is mixed with vehicle data. In Fig. 3.8, we can see that STR is
better than GTF. At SNR= 2 dB, it decreases the error rate by 11.5%. When
SNR≥ 2 dB, STR can have good performance.
35
G
x (m)
1
x (m)
2
x (m)
Q-1
x (m)
Q
G
G
Input layer
Hidden layer of
N Gaussian
functions
Output layer
w
1
w
j
w
N
y
x (m-1)
1
x (m-1)
Q-1
x (m-1)
2
x (m-1)
Q
x (m-M+1)
Q
x (m-M+1)
1
x (m-M+1)
2
x (m-M+1)
Q-1
Figure 3.5: RBF-NN with STR as an input. Interleave the spectro-temporal rep-
resentation and use it as an input to a RBF-NN.Now both spectral and temporal
information are correlated in a network operation.
36
−20 −10 0 10 20
0.1
1
10
100
SNR (dB)
Error rate (%)
Vehicle mixed with AWGN
MFCC
GTF
GTF+STR
Figure 3.6: When vehicle data is mixed with AWGN, STR can achieve better
performance than single spectral analysis can.
−20 −10 0 10 20
0.1
1
10
100
SNR (dB)
Error rate (%)
Vehicle mixed with human vowel
MFCC
GTF
GTF+STR
Figure 3.7: Whenvehicledataismixedwithhumanvowelsound, STRcanachieve
a little bit better performance than single spectral analysis can.
37
−20 −10 0 10 20
0.1
1
10
100
SNR (dB)
Error rate (%)
Vehicle mixed with bird chirp
MFCC
GTF
GTF+STR
Figure 3.8: When vehicle data is mixed with bird chirp sound, STR can achieve
better performance than single spectral analysis can.
38
Chapter 4
Unsupervised Learning for
Noise-Robust Pattern
Recognition
When testing conditions are different from the conditions where the training sig-
nals are recorded, which is almost always the case in reality, some normally used
algorithms may fail to recognize signals of interest. Or even worse, when signals of
interest are corrupted by noises that are emitted by surrounding backgrounds, the
incoming sounds are mixtures of signals of interest and noises. We are motivated
to design a system that is capable of eliminating unexpected noise effects for real-
world applications. But considering real-world noises, there are two difficulties to
design such a system.
D1) Those unexpected noises are generally highly time-varying, or unknown to
the recognizer.
D2) The mixing property between signals of interest and noises is generally
unknown.
Hence, it is not feasible to model every noise source, or to track the mixing
property at each time stamps. The ideal ideas are:
I1) Find a representative feature space of signals of interest, which should not be
affected by various noises or mixing property.
39
I2) The representative feature space should be decided by inherent statistics of
signals of interest.
How could we find this feature space? In the following, we first describe the
generalized linear Hebbian learning (LHL), a normally used algorithm, to find a
feature space. Then, we propose using a nonlinear Hebbian learning in Chap. 5,
which is able to calculate a better representative feature space than LHL is. Using
these algorithms for training, a feature space is learned, so is a set of signatures
that represents signals of interest. The learned signatures are output values when
signals of interest are projected into the feature space. These output signatures
are unknown to the system before training, so the learning process belongs to
unsupervised learning.
Duringtesting,withtrainedfeaturespaceandsignatures, whenincomingnoisy
data (mixtures of signals of interest and noises) is projected to the feature space,
signalsofinterestcanbefavoredwhilenoisesbeattenuatedoreliminated. Allpro-
jected values are then compared with trained signatures and a likelihood decision
is made.
4.1 Biological Motivation
Hebbian learning, a basic neural learning rule found in the brain, has motivated
us to implement the noise-robust acoustic signature recognition. There is strong
physiological evidence for Hebbian learning in the area of the brain called the
hippocampus. The hippocampus plays an important role in certain aspects of
learning or memory. This physiological evidence makes Hebbian learning all the
more appealing. Hebb’s postulate of learning is the oldest and most famous of
40
all learning rules. Quoting from [Heb49, p. 62], a statement is made in a neu-
robiological context, “When an axon of cell A is near enough to excite a cell
B and repeatedly or persistently takes part in firing it, some growth process or
metabolic changes take place in one or both cells such that A’s efficiency as one
of the cells firing B, is increased.” The experimental results were reported in
[BL73, GWAH87, AS90, BAD
+
91, BBB
+
91, XBB92] and called long-term poten-
tiation of synaptic strength. Next, it is hypothesized that the direction and
magnitude of change in synaptic strength depend on the degree to which pre-
and postsynapse activities are correlated [BCM82, SS89], which assume the nega-
tive learning (together with Hebb’s postulate, called bidirectional learning). The
experimental results were reported in [XBB92, MM92, TBB94] and called long-
term depression of synaptic strength. According to the mathematical abstraction
[Pal82, Oja82, OK88], a Hebbian synapse increases its strength with positively
correlated presynaptic and postsynaptic signals, and decreases its strength when
these signals are negatively correlated.
Then, it has been indicated that changes in the synaptic connections between
neuronscontribute tomemorystorage,andto the activity-dependentdevelopment
of a neural network. As illustrated in Fig. 4.1(a), when pre-synaptic input sig-
nal and post-synaptic output signal are positively correlated, the synaptic weight
connecting both pre-synapse and post-synapse is strengthened. On the contrary,
as in Fig. 4.1(b), when pre-synaptic input signal and post-synaptic output signal
are negatively correlated, the synaptic weight connecting both pre-synapse and
post-synapse is weakened.
41
a
b
Figure4.1: (a)Whenpre-synapticinputsignalandpost-synapticoutputsignalare
positively correlated, the synaptic weight connecting both pre-synapse and post-
synapse is strengthened. (b) When pre-synaptic input signal and post-synaptic
output signal are negatively correlated, the synaptic weight connecting both pre-
synapse and post-synapse is weakened.
4.2 Linear Hebbian Learning
Many studies and applications [Oja82, San9a, San9b, San90, Mil90, OK95, FB95,
RS97] have been done based on a generalized linear Hebbian learning. And it is
proved that LHL is actually equivalent to the generalized linear principal compo-
nent analysis (PCA) [KD90, KJ93, XY95, WAW03], except that LHL is adaptive
42
computation while PCA is batch computation. In the present study the compari-
son between LHL and NHL is equivalent to the one between PCA and NHL.
x: Spectro-temporal
presynaptic neurons
x (m)
Q
x (m)
Q-1
x (m)
2
x (m)
1
x (m-1)
Q
x (m-1)
Q-1
x (m-1)
2
x (m-1)
1
x (m-M+1)
Q
x (m-M+1)
Q-1
x (m-M+1)
2
x (m-M+1)
1
time
Freq
Interleaved
w: Synaptic
weights
y: Linear postsynaptic
neurons
Figure4.2: LinearHebbianlearningflowchart. Theinputisfromspectro-temporal
dynamicrepresentation. ThenSTRisinterleavedtothenetworkoperation. Thisis
unsupervised learning as the exact outputs are unknown. During LHL procedure,
patterns and synaptic weights are iteratively learned.
As illustrated in Fig. 4.2, the input of LHL is from spectro-temporal dynamic
representation. ThenSTRisinterleavedtothenetworkoperation. Thisbelongsto
Unsupervised Learning asthe exactoutputs are unknown. During LHL procedure,
patterns and synaptic weights are iteratively learned.
Define Q as the number of filterbanks, M as the number of feature vectors
alongtime, Lasthenumberofoutputneurons. Anddefinex
qm
asaninputfeature
elementatthe{q}
Q
q=1
thspectralfilterandatthe{m}
M
m=1
thtemporalframe. AnQ-
by-M spectro-temporal representation is an input to the learning, which includes
both the present and past M − 1 spectral feature vectors. {y
l
}
L
l=1
denotes an
43
output. Then, we define w
qml
as the synaptic weight connecting input neuron x
qm
and output neuron y
l
. A generalized LHL is realized by iteratively computing 4.1
y
l
=
Q
X
q=1
M
X
m=1
w
qml
x
qm
, l∈ [1,L];
Δw
qml
= ηy
l
x
qm
−
l−1
X
i=1
w
qmi
y
i
−w
qml
y
l
!
,
q∈ [1,Q],m∈ [1,M],l∈ [1,L],
(4.1)
where η is the learning rate. It has been proved that this LHL maximizes
variance E{y
2
}, (where E{·} is expectation computation). Upon convergence,
the output patterns y
l
s are the largest eigenvalues of input data correlation,
which are LHL-projected signatures. And the synaptic weight vectors are cor-
responding eigenvectors, which project data into LHL-learned feature space
[Oja82, San9a, San9b, San90].
4.3 Blind Source Separation Using LHL
In real-world applications, signals of interest are often corrupted by uncontrolled
noisesthatareemittedbysurroundingbackgroundsatthesametimestamps. Then
microphone may record noise-corrupted data, i.e., mixtures of signals of interest
and noises. When this noisy data is used for pattern recognition, it may be incor-
rectly classified as noise patterns, instead of patterns for signals of interest. In
general, many noise sources exist under real-world environments. They can be
highly time-varying and unknown to acoustic signal recognizer. Moreover, the
mixing property between signals of interest and noises is unknown. It is not quite
feasible to model every noise source or to track mixing property. Given the only
44
known observation noisy data, how can a system implement noise-robust pattern
recognition?
The concept of blind source separation (BSS) can be used for noise-robust
pattern recognition. Before implementing pattern recognition, implement inde-
pendent component extraction computation. These extracted components should
represent signals of interest. In Fig. 4.3, the incoming observed signals may be
mixed from severalunknown sources. The mixing property is also unknown due to
the various unknown environment. A system can be designed to blindly separate
independent sources out of observed signals. The detailed mixing procedure by
an unknown mixing matrix is illustrated in Fig. 4.4. And the detailed separating
procedure with some unsupervised learning is shown in Fig. 4.5. The extracted
components ofy can be equal to the unknown sourcess if the learning can ideally
extract independent sources.
Unknown
environments
Blind source
separation
Unknown
source vector
Observation
vector
As x
Wx y
Mixer
Output vector
s
Demixer
Figure 4.3: The incoming signals may be mixed from several unknown sources.
The mixing property is also unknown due to the various unknown environments.
A system can be designed to blindly separate independent sources out of observed
signals.
Based on eq. in (4.1), it is noticed that the only known information to LHL
is the observation data x. LHL function does not model noise source nor mixing
property. Therefore, regardless of noise type or its variation, regardless of mixing
property, the purpose is to separate signals of interest from noises, or to separate
several mixed unknown sources.
45
1
s
2
s
2 12 1 11 1
s a s a x
2 22 1 21 2
s a s a x
11
a
12
a
22
a
21
a
Unknown
source
Observation
Figure 4.4: The detailed mixing procedure by an unknown mixing matrix.
2 12 1 11 1
x w x w y
2 22 1 21 2
x w x w y
1
x
2
x
11
w
12
w
22
w
21
w
Observation
Output
Figure 4.5: The detailed separating procedure with some unsupervised learning.
The extracted components of y can be equal to the unknown sources s if the
learning can ideally extract independent sources.
One example is given here to illustrate the limitation of LHL. The only given
information to the system is the observation data x in x
1
-x
2
space in Fig. 4.6.
Its observation signal x
1
along the time is given in Fig. 4.7(a), and signal x
2
is
given in Fig. 4.7(c). Their corresponding histograms are shown in Fig. 4.7(b) and
(d), respectively.
What sources in the observation vector are unknown. How these sources are
mixed is also unknown. But in order to provide comparison standards for after-
demixing components, they are presented in Fig. 4.8. The unknown sources and
unknown mixing matrix are randomly selected to generate the observation signals
in Figs. 4.6 and 4.7. The observation data is composed of two independent com-
ponents. One s
1
is uniform-distributed within the range [−2,2], the other s
2
is
46
−4 −2 0 2 4
−4
−3
−2
−1
0
1
2
3
4
Observation data
x
1
x
2
Figure 4.6: The only given information to the system is the observation data x in
x
1
-x
2
space.
Gaussian-distributed withmean0andvariance1. Twocomponentsarecomposed,
x = As, A = [0.9397 −0.7660;−0.3420 0.6428]. The observed data has already
lost the source inherent properties, as the observation data histograms are not
equal to source distributions.
The system does NOT know component distributions NOR their compos-
ing property. In order to find the principal feature space, LHL maximizes
the variance of data, and extracts components approximately along two diag-
onal lines of the observation data. The synaptic weight matrix is calculated,
W
L
= [−0.8785 0.4807;0.4778 0.8727]. Via W
L
, the data can be projected to
LHL-learned feature space, as shown in Fig. 4.9(b).
The LHL-components do not appear like [−2,2] uniform nor (0,1) Gaussian,
as illustrated in Figs. 4.10(a)-(d). This indicates that LHL is not able to correctly
47
a b
0 1000 2000 3000 4000 5000 6000 7000 8000
−4
−3
−2
−1
0
1
2
3
4
Sample (discrete−time)
Amplitude
Mixed observation one
−4 −2 0 2 4
0
50
100
150
200
250
300
x
1
histogram
c d
0 1000 2000 3000 4000 5000 6000 7000 8000
−4
−3
−2
−1
0
1
2
3
4
Sample (discrete−time)
Amplitude
Mixed observation two
−4 −2 0 2 4
0
50
100
150
200
250
300
x
2
histogram
Figure 4.7: (a) The observation vector has observation signal x
1
along the time.
(b) The histogram of signal x
1
. (c) The observation vector has observation signal
x
2
along the time. (b) The histogram of signal x
2
.
extract two components (one uniform, one Gaussian), as it is not able to find their
real feature space. When the decomposing synaptic weight matrix multiplies the
composing one,W
L
A = [−1.1915 −0.2168;0.60760.3966]6=I, whereI is identity
matrix. Then, y =W
L
x =W
L
As6=s. The potential reason is that the inherent
properties of independent components involve higher-order statistics, and thereby
LHL which only explores second-order moment fails to extract them.
48
a b
0 1000 2000 3000 4000 5000 6000 7000 8000
−4
−3
−2
−1
0
1
2
3
4
Sample (discrete−time)
Amplitude
Source one
−4 −2 0 2 4
0
50
100
150
200
250
300
s
1
distribution
c d
0 1000 2000 3000 4000 5000 6000 7000 8000
−4
−3
−2
−1
0
1
2
3
4
Sample (discrete−time)
Amplitude
Source two
−4 −2 0 2 4
0
50
100
150
200
250
300
s
2
distribution
Figure 4.8: The observation data is a mixture of two components. (a)(b) One
is uniform-distributed within the range [−2,2]. (c)(d) The other is Gaussian-
distributed with mean 0 and variance 1. These two components are composed
with matrix A. The observed data has already lost the source distributions.
49
a b
−4 −2 0 2 4
−4
−3
−2
−1
0
1
2
3
4
Observation data
x
1
x
2
−4 −2 0 2 4
−4
−3
−2
−1
0
1
2
3
4
Linear Hebbian learned results
y
1
L
y
2
L
Figure 4.9: (a) The only given information is the observation data in x
1
-x
2
space.
The system does NOT assume any knowledge of the components NOR their com-
posing property. (b) In order to find the representative feature space, LHL opti-
mizes the variance of data, and finds the components approximately along two
diagonal lines. The observation data is projected to LHL-learned feature space.
50
a b
0 1000 2000 3000 4000 5000 6000 7000 8000
−4
−3
−2
−1
0
1
2
3
4
Demixed signal 1 using LHL
Sample (discrete−time)
Amplitude
−4 −2 0 2 4
0
50
100
150
200
250
300
y
1
L
histogram
c d
0 1000 2000 3000 4000 5000 6000 7000 8000
−4
−3
−2
−1
0
1
2
3
4
Demixed signal 2 using LHL
Sample (discrete−time)
Amplitude
−4 −2 0 2 4
0
50
100
150
200
250
300
y
2
L
histogram
Figure 4.10: (a)(b) The histogram of LHL-learned y
L
1
is not equal to the distri-
bution of either real component. (c)(d) The histogram of LHL-learned y
L
2
is not
equal to the distribution of either real component. Both LHL-learned components
are not the real components.
51
Chapter 5
Unsupervised Nonlinear Hebbian
Learning
5.1 Nonlinear Hebbian Learning for Pattern
Recognition
For the specific tasks of acoustic signal recognition under noisy environments, the
concept of blind source separation is applied for separating noises out of their
mixtures with signals of interest. Not only the BSS concept is applied, we also
modify the BSS technologies to perform pattern recognition. If we project the
mixed noisy data into a feature space of signals of interest, signals of interest and
noises are de-mixed. In this way, noise effects on signals of interest are attenuated
or eliminated.
In real-world testing, coexisted w many uncontrolled background noises, the
testing data can be corrupted by unknown noise. Without noise attenuation or
elimination, the testing data may be classified as some noise pattern, and miss
the signal of interest. Such as for vehicle sound recognition application in noisy
circumstances, when the data is clean in Fig. 5.2(a), the recognizer can provide
accurate recognizing results. However, when the data is severely corrupted by
noises, e.g., by AWGN in Fig. 5.2(b) at SNR= 0 dB. The recognizer may classify
the noisy vehicle data in Fig. 5.2(c) as noises in Fig. 5.2(b), instead of vehicle
sounds in Fig. 5.2(a).
52
Unknown generator and
environments
Independent
component extraction
Unknown
components
Observation
vector
As x
Wx y
Synthesis
Output vector
s
Decomposition
and
dimensionality
reduction
Pattern
recognition
Figure 5.1: The flowchartwhen independent componentextractionis used forpat-
tern recognition. The first three blocks are related with blind source separation:
The incoming signals may be mixed from several unknown sources; The mixing
propertyisalsounknownduetothevariousunknownenvironments. Thentherec-
ognizercanextractunknownindependent sources, andcomparethemwithtrained
patterns of interest.
As described in the example, we assume the given observation data is
composite of several independent components. Each independent component is
a signature of the data. In order not to confuse the concept of mixing with the
concept of composing in the present study, two comments are stated.
Mixture of signals of interest and noises:
This case refers to the testing stage. When an incoming noisy sound (mixture
of signals of interest and noise) is tested, the system does not know and does not
need to know noise distribution or mixing property. But the system can project
the noisy data to a learned feature space of signals of interest, in which noises
can be attenuated. In such a way, the system can effectively separate signals of
interest from noises and to provide a noise-robust recognition.
Composite of independent signatures:
This case refers to the training stage. The purpose of training is to find the
representative feature space and independent components (signatures) of signals
53
a
0 5 10 15 20 25 30
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
Time (s)
Normalized amplitude
Vehicle sound waveform in normal circumstance
b
0 5 10 15 20 25 30
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
Time (s)
Normalized amplitude
AWGN at SNR=0 dB
c
0 5 10 15 20 25 30
−1.6
−1.2
−0.8
−0.4
0
0.4
0.8
1.2
1.6
Time (s)
Normalized amplitude
Noisy vehicle sound with AWGN at SNR=0 dB
Figure 5.2: (a) The vehicle waveform in a normal circumstance. (b) The AWGN
waveform. (c) The vehicle waveform under a severely noisy circumstance. Vehicle
data is corrupted by AWGN at SNR= 0 dB.
of interest. The system does not assume any knowledge of component distribution
or composing property. Via efficient learning, the synaptic weight matrix can
54
be updated to project/decompose signals to a representative feature space, and
independent signatures are projected.
Neurons employing Hebbian learning in a network can learn to code a set of
patternsinsuchawaythatimportantcomponentsarestrengthenedwhileunimpor-
tant noisy ones are weakened [Oja82, San9a, San9b, San90, OOW91, SS92, SH94,
Oja95, Oja97, SH95, CAT96, HO98, LLGS99, LS00, LLS00]. In this chapter, a
nonlinear Hebbian learning is analyzed, and its advantage over the generalized
linear Hebbian learning is stated. The overview of the proposed system is pre-
sented in Fig. 5.3. After using gammatone filters for spectral nonlinearities, and
using STR to involve temporal dynamics, a nonlinear Hebbian learning (NHL) is
describedtoextractunknown acoustic signatures ofsignalsofinterest. Concurrent
to this process, synaptic weights connecting input and output neurons are adap-
tively learned. Then, at the testing stage, both synaptic weights and signatures
are used as trained patterns into a supervised associative network, radial basis
function neural network (RBF-NN).
5.2 Nonlinear Hebbian Learning
In an information-theoretic context, the second-order moment is inadequate to
describe data property, as generally the mutual information between independent
components of data involves statistics of all orders. LHL has been extended to
nonlinearHebbianlearningwithseveraldifferentsetsofequations[OOW91,SH94,
Oja95, SH95, Oja97, HO98, Fio03, LS04]. We modify a NHL method [SH95] (used
fordataclustering)intoamethodusedfornoise-robustacousticsignalrecognition.
55
Figure 5.3: An overview of the proposed recognizer.
The output y
l
, produced in response to a set of inputs, is given by
y
l
=
Q
X
q=1
M
X
m=1
w
qml
x
qm
, l∈ [1,L]. (5.1)
Define nonlinear neuron outputs {z
l
= g(y
l
)}
L
l=1
with g(·) being a nonlinear func-
tion, as illustrated in Fig. 5.4.
56
Instead of directly using a nonlinear function as in method [SH95], it can be
proven that the properly selected criterion E{g
2
(y)} with a proper nonlinear func-
tion g(y) approximates data entropy. Intuitively, informax or maximum entropy
means maximum data information. As data information can be described by
moments of data, and if data has non-Gaussian information, then higher-order
moments of data is necessary. Instead of using a sum of higher-order moments
of data, which are computationally complicated, are sensitive to outlier noise,
mainly measure the tails of distribution, a properly selected nonlinear function is
good for simple operation. Approximating the maximum entropy using some cer-
tain nonpolynomial function can be proven by using Boltzman theory (related to
maximum entropy probability distribution) and Taylor expansion [JS87, Hyv98].
The exponential distribution family is related to the maximum entropy probabil-
ity distribution family. We will select a nonlinear function from the exponential
distribution family.
The NHL optimizes the approximate entropy,
maximize J = E{g
2
(y
l
)},
subject to w
T
l
1
w
l
2
= δ
l
1
,l
2
, l,l
1
,l
2
∈ [1,L],
(5.2)
where w
l
=[w
11l
,w
12l
,...,w
1Ml
,...,w
Q1l
,w
Q2l
,...,w
QMl
]
T
is an interleaved vector
of spectro-temporal synaptic weights for output neuron l. The inner-product of
synaptic weight vectors from different output neurons satisfies the constrained
condition, Kronecker function δ
l
1
,l
2
= 1, if l
1
= l
2
, otherwise, δ
l
1
,l
2
= 0, where (·)
T
istranspositioncomputation. Thisconditionpreventsextensivegrowthofsynaptic
weights, and guarantees that the learned weight vectors are orthogonal. Specially,
a proper nonlinear function g(y) can be Taylor expanded and then have all-order
momentsofy. Thusthisvarianceofnonlinearfunctionisabletocatchhigher-order
57
moments rather than just second-order moment of data. With more statistical
characteristic involved, NHL can extract components that are more representative
than LHL can.
x: Spectro-temporal
presynaptic neurons
x (m)
Q
x (m)
Q-1
x (m)
2
x (m)
1
x (m-1)
Q
x (m-1)
Q-1
x (m-1)
2
x (m-1)
1
x (m-M+1)
Q
x (m-M+1)
Q-1
x (m-M+1)
2
x (m-M+1)
1
T
S
Interleaved
w: Synaptic
weights
y: Linear postsynaptic
neurons
Inhibitory
effect
z: Nonlinear
postsynaptic neurons
Nonlinear
neural func.
Nonlinear
neural func.
Figure 5.4: An illustration of nonlinear neural learning. Synaptic weights connect-
ing input and output neurons are adaptively learned. The linear output neurons
denote extracted representative independent components. And the nonlinear out-
put neurons with an activation function play a crucial role in statistical optimiza-
tion of NHL procedure.
Using Lagrange function in eq. (5.2), and differentiating it with respect to
w
qml
, the update of synaptic weightcan be derived in accordancewith a stochastic
approximation approach
Δw
qml
=ηg(y
l
)g
′
(y
l
)
x
qm
−
l−1
X
i=1
w
qmi
y
i
−w
qml
y
l
!
,
q∈ [1,Q],m∈ [1,M],l∈ [1,L]
(5.3)
where g
′
(y
l
) denotes the derivative of g(y
l
).
The first item g(y
l
)g
′
(y
l
)x
qm
on the right-hand side in eq. (5.3) is in accor-
dance with the usual Hebbian modification of synaptic weight, which is in
58
response to excitatory input and provides self-amplification. The second item
−g(y
l
)g
′
(y
l
)
P
l−1
i=1
w
qmi
y
i
represents the inhibitory connections from other output
neurons1,2,...,l−1 to neuronl. Andthe third item−g(y
l
)g
′
(y
l
)w
qml
y
l
isrespon-
sible for stabilization. The whole signal flow is displayed in Fig. 5.5.
.
.
.
.
.
.
+ 1
l
w
dy
d
l l
w w
) ( g
l
w
2
w
1
w
l
y
2
y
1
y
l
z
x
Figure 5.5: Signal flow for synaptic weight adaptation: x represents excitatory
input, y
1
,y
2
,...,y
l−1
contribute to inhibitory effects, the linear y
l
provides a sta-
bilizing effect, and the nonlinear part z
l
is critical for higher-order statistical char-
acteristics of y
l
. For simplicity spectral and temporal index q,m are ignored.
The w
qml
learning is derived from the NHL method proposed by [SH95]. In the
presentstudythedifferenceistwofold. First,therealextractedoutput(signature)
here is y rather than z in the method by [SH95]. We care about acoustic signal
recognition of one-class data, while the method by [SH95] cares about multi-class
clustering. Neuron z clusters data into 1 or −1 end point in a bounded range
[−1,1] [SH95], which is excellent for multi-class data clustering, but not good for
59
one-class pattern recognition that needs extracting real representative features.
Secondly, the nonlinear activation function is chosen based on implicit acoustic
signal distribution in the present study, which considers outlier rejection issue in
pattern recognition problem. But the method by [SH95] considers big-gap bound-
ary issue in clustering problem, and outliers may be closely centered with signals
of interest. More discussion on the outlier issue is in Slope of activation function
in Sect. 5.4.3.
To summarize, NHL iteratively updates neuron outputs and synaptic weights
via the following two steps.
Step I) Neuron output computation:
y
l
=
P
Q
q=1
P
M
m=1
w
qml
x
qm
, l∈ [1,L];
Step II) Synaptic weight update:
Δw
qml
= ηg(y
l
)g
′
(y
l
)
x
qm
−
P
l−1
i=1
w
qmi
y
i
−w
qml
y
l
, q ∈ [1,Q],m ∈ [1,M],l ∈
[1,L].
Uponconvergence,representativeindependentsignatures{y
l
}
L
l=1
areextracted,
and synaptic weight vectors are nonlinearly learned. Multiplying weight vectors
with input data projects data to its principal space. Projection is a linear opera-
tion, which does not affect inherent data property. Thus the projected signatures
can be used to represent input data. Moreover, the basis vectors which span the
representative feature space are transposition of synaptic weight vectors. This
relation is discussed in Convergence of NHL in Sect. 5.4.1.
From the perspective of statistical pattern recognition, the practical value of
NHL result is that it provides an effective technique for dimensionality reduction.
As in Fig. 5.4, the dimension of spectro-temporal features can be very high,
e.g. 600 if Q = 30 frequency bins and M = 20 temporal frames. Such high
dimensionality may cause very complex computation at the testing stage if they
60
are used as patterns. Besides, high-dimensional features may be less useful than
real representative features as high-dimensional ones may be easily mixed with
unrelated noises. To tackle the curse of dimensionality, NHL projects this messy
high-dimensional representationR
Q×M
to a low-dimensional spaceR
L
(L≪ Q×
M). During the learning, useful features that are related to signals of interest
are extracted while unrelated ones are removed. Moreover, acoustic signals of
interest generally can not be grouped as just one centered cluster. There may
exist multiple distributive clusters since signals of interest may be generated by
multiple levels of sources. Hence, NHL is used to be able to capture multiple
independent components (clusters).
In addition, the proposed algorithm is sped up by including a sphering
step prior to NHL. In the present study we use LHL as a sphering step. This
computation works on the first and the second-order statistics of data. And the
full representation of synaptic weights is thus the product of the sphering result
and the nonlinear learning result, W = W
L
W
N
. After using LHL for spher-
ing,inputfeaturesareshifted(basedonmean)andnormalized(basedonvariance).
Distribution of more than one Gaussian components
Although the proposed nonlinear Hebbian learning algorithm can extract more
thanonesub-Gaussiancomponents,ormorethanonesuper-Gaussiancomponents,
it could not extract more than one Gaussian components. The reason is stated
as follows. In probability theory, if X
1
is a Gaussian random variable with mean
μ
1
and variance σ
2
1
, and X
2
is a Gaussian random variable with mean μ
2
and
variance σ
2
2
, and if X
1
and X
2
are independent, then X = X
1
+X
2
is also normally
distributed with mean μ
1
+μ
2
and variance σ
2
1
+σ
2
2
. This can be proved by using
convolution for probability density function summation.
61
Therefore, when we use nonlinear Hebbian learning for Gaussian component
extraction, it can only find their summation’s mean and variance if there is more
than one Gaussian components. Specifically, for super-Gaussian components, or
sub-Gaussian components, their probability density functions are described by
higher-order statistics (more than variance). And the statistics of the sum of
random variables (sub-Gaussian or super-Gaussian) generally does not follow the
distribution property of Gaussian random variables (the property is described in
the above paragraph). Hence, nonlinear Hebbian learning, which explores higher-
order moments of data can efficiently extract those independent components (sub-
Gaussian or super-Gaussian). On the other hand, for Gaussian components, their
probability density functions are described with just mean and variance (second-
order), and their linear mixtures are still Gaussian. Hence, nonlinear Hebbian
learning cannot extract more than one Gaussian components like other ICA algo-
rithms.
5.3 Blind Source Separation Using NHL
When NHL is used for the example in Fig. 4.6 in Sect. 4.3, it can correctly find
two components, y
N
1
, y
N
2
. The nonlinear activation function is defined as in Sect.
5.4.1. Synaptic weight matrix is W
N
= [1.6613 1.9878;0.9487 2.6608]. Via W
N
,
the data is projected to NHL-learned feature space as illustrated in Fig. 5.6(b).
As in Figs. 5.7(a)-(b), the extracted component y
N
1
is uniform-distributed
within [−2,2], corresponding to s
1
; and in Figs. 5.7(c)-(d), y
N
2
is Gaussian-
distributed with mean 0 and variance 1, corresponding to s
2
. When
the decomposing synaptic weight matrix multiplies the composing one,
W
N
A = [0.8813 −0.0052;−0.0185 0.9837]
∼
= I. Then, y =W
N
x =W
N
As
∼
=s.
62
a b
−4 −2 0 2 4
−4
−3
−2
−1
0
1
2
3
4
Observation data
x
1
x
2
−3 −2 −1 0 1 2 3
−3
−2
−1
0
1
2
3
Nonlinear Hebbian learned results
y
1
N
y
2
N
Figure 5.6: (a) The only given information is the observation data in x
1
-x
2
space.
The system does NOT assume any knowledge of the components NOR their com-
posing property. (b) By optimizing higher-order statistics of data, NHL finds the
twocomponents,y
N
1
,y
N
2
. TheNHL-resultisobtainedbyprojectingtheobservation
data using synaptic weight matrix W
N
.
Hence, NHL which involves higher-order statistical characteristics of data can
extract real independent components. We have studied the detailed comparison
of linear Hebbian learning and nonlinear Hebbian learning this year [LDBed].
LHL vs. NHL
Super-Gaussian-like statistics of acoustic sounds indicates why LHL is inade-
quate to extract its principal components. Gaussian-distributed data is described
with mean and variance, so LHL, which optimizes the second-order moment of
data, is a good tool for its principal component extraction. However, in real-
world environments, other non-Gaussian data generally exists. They cannot be
described with just mean and variance. Hence, an approach such as NHL that
explores higher-order moments of data is necessary.
63
a b
0 1000 2000 3000 4000 5000 6000 7000 8000
−4
−3
−2
−1
0
1
2
3
4
Demixed signal 1 using NHL
Sample (discrete−time)
Amplitude
−4 −2 0 2 4
0
50
100
150
200
250
300
y
1
N
histogram
c d
0 1000 2000 3000 4000 5000 6000 7000 8000
−4
−3
−2
−1
0
1
2
3
4
Demixed signal 2 using NHL
Sample (discrete−time)
Amplitude
−4 −2 0 2 4
0
50
100
150
200
250
300
y
2
N
histogram
Figure 5.7: (a)(b) The histogram of y
N
1
is a uniform distribution within [−2,2],
corresponding to s
1
. (c)(d) The histogram of y
N
2
is a Gaussian distribution with
mean 0 and variance 1, corresponding to s
2
.
5.4 Nonlinear Activation Function
5.4.1 Derivation
The nonlinear activation function g(·) is critical for NHL. Generally a smooth,
monotonic increasing, invertible, bounded, and differentiable function [Oja95,
SH95, Hyv7a, Hyv7b] can be used. On the other hand, which representative com-
ponents are picked up depends not only on the activation function but also on the
implicit signal distribution. NHL is unsupervised learning, which does not assume
any knowledge of component distribution or composing property. Nevertheless,
some prior knowledge about acoustic data distribution is helpful. It is hypoth-
esized that general acoustic sound is approximately super-Gaussian distributed,
64
with higher peak and longer tail than Gaussian distribution [BS95, RB96, LV05].
In order to provide a more stable learning, it is better to choose an activation
function that considers some inherent property of data distribution. When the
slope of the activation function can be aligned with the high density portion of
the input distribution, the mutual information of input and output is optimized
[BS95, RB96, LGS99, FB01, BMS02, Fio2a, Fio2b, BPR04, VV05, TK07]. And
this alignment can transform input data to a range which raises sensitivity for the
adaptation of synaptic weights. Such transformation can help escaping of early
trapping from pre-matured saturation. Moreover, it is better that the selected
nonlinear function belongs to maximum entropyprobabilitydistribution family, or
exponential distribution family.
Considering the general requirements for an activation function, and regarding
the implicit statistics of acoustic data, we format an activation function in the
range [−1,1],
g(y) =
h(y), y≥ 0
−h(|y|), y≤ 0
, (5.4)
whereh(y)iscumulativegammadistribution,whichbelongstothesuper-Gaussian
class
h(y) =
γ(α,βy)
Γ(α)
, y > 0, where
γ(α,βy) =
Z
y
0
τ
α−1
e
−τ
dτ, and
Γ(α) =
Z
∞
0
τ
α−1
e
−τ
dτ.
(5.5)
αdenotestheshape, 1/β representsthescaleandslope[CW69]. Anditsderivative
function is
h
′
(y) = y
α−1
β
α
exp(−βy)
Γ(α)
. (5.6)
65
Around y = 0 this activation function is continuously differentiable. Specially,
theincompletegammafunctionγ(α,βy)inh(y)canbeTaylorexpandedintermsof
all-orderpolynomialsofy [Gau79](Taylor expansionof activationfunction inSect.
5.4.1). As a result, the nonlinear activation function involves all-order moments of
data and lead to a statistical optimization.
Introducing a nonlinear activation function into Hebbian rule results in explor-
ing all-order moments of data through the function nonlinearities. In general, the
choose of the nonlinear function is not unique. Practically any smooth, differen-
tiable, nonlinear function may be used to perform NHL [Hyv7a, Hyv7b, HO98].
In the present study, choosing the specific function rather than others is also
based on approximate input distribution. Using this function can behave like
a probabilistic filter of input data. It can better associate the slope of the
function with the high density portion of the input distribution. This associ-
ation can be shown to optimize the mutual information of input and output
[BS95, RB96, LGS99, FB01, BMS02, Fio2a, Fio2b, BPR04, VV05].
5.4.2 Neurobiological Motivation
HowtochoosethenonlinearactivationfunctioninthenonlinearHebbianlearning?
Not only we need to consider optimization criterion, acoustic data property, but
also we can get some certain intuitiveness from neuron’s input-output transfor-
mation. Normally, the sigmoid function [BS95, LGS99] is used in artificial neural
networksasanonlinearactivationfunction. Thesigmoidfunctioncanbeviewedas
a smoothed variant of the classical neurobiological threshold neuron [Mit97, chap.
4]. If assuming synaptic transmission approximates one exponential process, the
input-output transformation can be approximated as the normally used sigmoid
function.
66
The selected activation function in the present study is one step forward from
the sigmoid function. Neurobiological studies pointed out one important con-
cept that, the neural signal transformation involves multiple synaptic vesicle pools
[ZR02,LYB6a,LYB7a]. The multiple vesiclepooling kineticscontributetononlin-
ear synaptic dynamics. We have proposed a computational “pool” framework for
studying nonlinear dynamic synapses [LYB6a, LYB7a], and and represented this
study in the qualifying exam. The “pool framework” results in a model composed
of three vesicle pools in the presynapse as illustrated in Fig. 5.8, with each pool
having a vesicle release rate. The total vesicle release event is affected by facili-
tation and depression in a multiple-order fashion between the presynapse and the
postsynapse.
Immediately
releasable
pool
Readily
releasable
pool
Reserve
pool
slow
fast
Released
pool
recycling
Figure 5.8: Multiple vesicle pool framework. Immediately releasable pool: a frac-
tion of apparently docked vesicles at active zones. Readily releasable pool: avail-
able for immediate replenishment of release sites that are vacated by exocytosis.
Reserve pool: more distant vesicles that are unable to respond rapidly. Released
pool: recycling ensures a constant supply of vesicles.
The experimental data was recorded from Schaffer collateral – CA1 synapses
underdifferentexperimentalconditions[SWMB04]. Theproposedpoolframework,
using fewer parameters, predicts better than the basic additive dynamic synapse
(DS) model [LB96]. The DS model is based on one vesicle pool in the presynapse,
or one exponential process. Numerical study showed that the pool framework is
67
stableandcanefficientlyexplorethedynamicfilteringpropertyofthesynapse. The
pool framework captured biological reality with regard to signal communication
between the pre- and postsynapse, while signal communication between neurons
encodes information of cognition and consciousness throughout cortices.
Itisexpectedthatthepoolframeworkcanbe employedasbasiccomputational
units to explore neuron nonlinearities in an artificial neural network. For example,
we have used the pool framework as a nonlinear function in the dynamics synapse
neural network in order to implement the task of speaker identification [LYB6b]
and represented this study in the qualifying exam. When using TIMIT data base,
the dynamic synapse neural network achieved approximately 2 ∼ 5% (the cor-
relation between speakers decreases about 11%) improvement over the normally
used acoustic signal recognition method, mel frequency cepstral computation plus
Gaussian mixed model.
WhenwewanttochooseanonlinearfunctioninthenonlinearHebbianlearning,
we view it as a nonlinear transformation between neurons. This nonlinear trans-
formation between neurons can indicate the underlying synaptic transformation.
Since the synaptic transmission is not mono-exponential process, it is intuitive
to consider using more neurobiological findings. With the pool framework, the
synaptic transmission is affected by multiple pool. If the vesicle release event in
one pool is approximately one independent exponential process, the neural trans-
formation involves multiple independent exponential processes. The integral effect
of multiple independent exponential processes can be approximately developed to
be one super-exponential (cumulative gamma) distribution [Pap84, pp. 103–104].
During this function approximate, the trade-off between the exact computational
pool framework and the computational load is considered.
68
Besides, this nonlinear activation function can be used as a match filter
to implicitly match the underlying acoustic data distribution. Generally, it is
known that acoustic data belongs to super-Gaussian family. And the gamma
distribution belongs to super-Gaussian family. Hence, using gamma function to
describe acoustic data is a good choice. Mutual information in the NHL method
is maximized when the high density part of input is aligned with the slope of
the activation function. This is based on the study by [Lau81], that a neuron’s
transformation function is matched to the approximated signal distribution. More
details of the usage of the nonlinear function in the NHL are in Sect. 5.4.3.
Taylor expansion of activation function
To show that the used nonlinear activation function involvesall-order statistics
ofdata,Taylorseriescanbeusedtoexpanditintermsofpolynomialsofy [Gau79],
h(y) = (βy)
α
e
−βy
∞
X
n=0
β
n
y
n
Γ(α+n+1)
, (5.7)
where the lower incomplete gamma function γ(α,βy) is the primary function.
Otherwise, Taylor series can be used to expand the upper incomplete gamma
function Γ(α,βy) =
R
∞
y
τ
α−1
e
−τ
dτ in terms of polynomials of y. And then we can
obtain the Taylor-expanded function with the relation
h(y) = 1−
Γ(α,βy)
Γ(α)
. (5.8)
Convergence of NHL
69
The present NHL stops changing synaptic weights when two conditions are
satisfied. One is
x =W
T
y =
L
X
l=1
y
l
v
l
, (5.9)
where v
l
:= [w
l,1
,w
l,2
,...,w
l,QM
] is the column vector of W
T
and the row vector
of W. This function means the original input can be reconstructed via projected
signatures y and vectors v
l
s. It can be stated that the weight matrix W trans-
forms/projects signals to a representative feature space, which is spanned by basis
vectors of v
l
s. Besides, as y =Wx, we can derive
x =W
−1
y. (5.10)
Combining eqs. (5.9) and (5.10), we obtain
W
T
=W
−1
. (5.11)
Or equivalently
W
T
W =I. (5.12)
The function (5.12) is actually the constraint in (5.2).
Deriving from (5.7), the learning also converges when
e
−2βy
X
i,j
c
ij
y
i
y
j
= 0, i,j = 0,1,2,... (5.13)
where c
ij
is Taylorexpansion coefficient. This function involves all cross-moments.
The fading item e
−2βy
and the LHL sphering on data would help the convergence.
70
5.4.3 Interpretation for Slope of Activation Function
Mutual information is maximized when the high density part of input is aligned
with the slope of the activation function. This is based on the study by [Lau81],
that a neuron’s input-output function is matched to the approximated signal dist.
From Fig. 5.9, we can that after nonlinear mapping, the distance between outputs
is large.
3
x
2
x
) (
1
x g
) (
2
x g
) (
3
x g
1
x
Data distribution
Figure 5.9: Mutual information is maximized when the high density part of input
is aligned with the slope of the activation function. And the distance between
outputs is large after nonlinear mapping.
How to choose the shape of the activation function is important. We need
to consider the trade-off between the convergence of learning procedure and the
71
robustness against outliers. Outliers are data that does not belong to the class of
interest. Regarding convergence, it requires the slope of the activation function
to be sharp enough to project independent components within a centered cluster
[SH95]. However, the slope of the activation function cannot be grow too fast,
otherwise, the rejection rate for outliers is low [Oja95, Hyv7a, Hyv7b, HO98]. In
the present study, the slope of the activation function is adjusted in accordance
with the learning speed of synaptic weights. Outlier robustness also sheds the
light on why we use neuron y as real outputs in the present study. Neuron z
tries to cluster data into 1 or −1 extreme end of the range [−1,1] [SH95], which
is excellent for multi-class data clustering. But data that does not belong to the
class of interest is also centered around 1 or −1, which is not good for one-class
pattern recognition and outlier rejection. There is an important trade-off between
convergence and outlier rejection. The steep slope can favor leaning convergence,
as illustrated in Fig. 5.10(a), signal of interest x and noise n are farther separated.
On the other hand, the slope cannot grow too fast, as the rate of outlier noise
rejection is low. This case is shown in 5.10(b), signal of interest x and noise n are
not easy to be separated.
5.5 Discussion of
the Present NHL vs. Other ICAs
In order to capture essential structure that best represents data, generally data
can be viewed as being composed of several original independent sources, and
then these sources are linearly mixed (or composed) in an unknown way. Due to
the mixing, the given signals are dependent with each other, as in the example,
the given signal x
1
has information of sources s
1
and s
2
, and so does the given
72
a b
x
) (n g
Data distribution
n
Proper
slope
) (x g
) (n g
Data distribution
n
Too
steep
slope
x
) (x g
Figure 5.10: (a) The steep slope can favor leaning convergence. Signal of interest
x and noise n are farther separated. (b) On the other hand, the slope cannot grow
too fast, as the rate of outlier noise rejection is low. Signal of interest x and noise
n are not easy to be separated. The positions of x, n and the shape of data are
the same in (a) and (b).
signal x
2
. The more statistically independent between extracted components, the
more separated (or decomposed) the given signals into the original independent
sources. Forblindsourceseparation,oneofmajorapproaches[Com94,Hay98]uses
independence as the criterion: the mutual information between outputs should be
minimized. Mutual information measures the amount of information each signal
containsabouttheother. Thismethodneededtoassumeorestimatetheknowledge
of marginal distribution of data.
73
The second major approach is called Informax. Informax maximizes informa-
tion conveyed to output about input, including maximum likelihood, and max-
imum entropy. Entropy is a measure of the average amount of information
conveyed per signal. There are several representative works of maximum like-
lihood [PGJ92, PG97, Car97]. They needed to assume or estimate the knowl-
edge of source distribution. The representative works of maximum entropy are
[BS95, LGS99, LLS00]. They assumed that the nonlinear activation function per-
fectly matches the inherent data distribution.
It has been proven that the independence and informax approaches are equiv-
alent, under ideal condition of perfect decomposing [Car98, Hay98, HO00]. But
theseapproachesneedtoassumeorestimatetheunknowndatadistribution. Ifthe
estimated distribution is the same or similar to the real distribution, there is no
problem. But if the estimated distribution is not similar to the exact distribution,
their derived equations based on the estimated distribution would cause accumu-
lative error. Moreover, it is generally hard to assume or estimate the extract data
distribution. In additive, at each iteration, the inverse computation of weight
matrix in these methods can be ill-conditioned, if the matrix is singular or sparse.
How to avoid estimating the exact data distribution? An approximation of
negentropy was proposed [Oja97, HO98, HO00]. Assume there are non-Gaussian
sources before mixing. By central limit theorem, the distribution of a sum of inde-
pendent random variables (even non-Gaussian distributed) tends towards Gaus-
sian distribution. During unmixing, if the extracted component is less Gaussian,
then the original source information is more retrieved. Negentropy measures the
non-Gaussian component in the data. Specially, negentropy measures individual
entropy distance between non-Gaussian component and Gaussian component. By
74
approximating negentropy, a nonlinear function is properly selected as the crite-
rion. In this method, the nonlinear function should be even-function, otherwise,
no stable convergence. During the algorithm derivation, the constraint of energy
norm for weight vectors in the ICA learning equation is not complete, or need an
extra learning of orthogonal constraint for weight vectors.
Another ICA (or NHL) method [SH95] was directly using an asymptotic
variance, the second-order moment of a nonlinear function, to replace the variance
of data. But in the method [SH95], the selected nonlinear function was the
opposite of the inherent data distribution. The outlier noise rejection is not good.
The nonlinear mapping of data loses information.
Advantages of the present NHL
We considered the above issues in other ICA algorithms, and tried to deal with
them in the present NHL in this chapter.
Firstly, instead of directly using a nonlinear function, it can be proven that the
properly selectedcriterion E{g
2
(y)} with a proper nonlinear function g(y) approx-
imates data entropy. Intuitively, informax or maximum entropy means maximum
data information. As data information can be described by moments of data,
and if data has non-Gaussian information, then higher-order moments of data is
necessary. Insteadofusing a sum ofhigher-ordermomentsofdata, whicharecom-
putationally complicated, are sensitive to outlier noise, mainly measurethe tails of
distribution, a properly selected nonlinear function is good for simple operation.
Approximating the maximum entropy using some certain nonpolynomial function
can be proven by using Boltzman theory (related to maximum entropy probability
distribution) and Taylor expansion [JS87, Hyv98]. The exponential distribution
familyisrelatedto the maximumentropyprobabilitydistribution family. We have
75
selected the cumulative gamma distribution as the nonlinear function. And this
selected function belongs to the exponential distribution family.
Secondly, the criterion E{g
2
(y)} of the present NHL method is a simple oper-
ation. There is no need to estimate the exact data distribution. There is no
matrix inverse computation at each iterative step. There is no need to compute
the summation of all higher-order moments of data.
During the learning process, the constraint for weight vectors is completely
learned together with the criterion, as using Lagrange multiplier algorithm. There
isnoneedforextralearningofweightvectors. Andthereisnoinformationlossdue
to improper nonlinear mapping ateachiterative step. The storagerequirementfor
the present NHL method is not strict.
Moreover, the selected function belongs to super-Gaussian family, and implic-
itly matches the underlying acoustic data distribution as most acoustic data is
super-Gaussian distributed. The selected function is also motivated from synap-
tic transmission between neurons. In addition, the selected nonlinear function is
monotonic, invertible, bounded, differentiable, not sensitive to outlier noises.
We also give an intuitive neurobiology interpretation for each item in the
present NHL equation. And most importantly, the present NHL can demix (or
decompose) independent sources (or components) that are mixed in random direc-
tions (orthogonal or non-orthogonal).
Generally, a good method includes a correct objective function and an efficient
optimization algorithm. We can connect the objective function to the approxi-
mated maximum entropy, which determines statistical properties. Then we use
Lagrangemultiplier algorithm to considercomputationalload and storagerequire-
ments.
76
In addition, whitening is a good preprocessing step before NHL or ICAs, in
order to speed or simplify the difficult blind separation problem. Before using
whitening, theNHLunmixing matrixisnotorthogonal. Afterwhitening, the NHL
unmixing matrix can be proven orthogonal [HO00]. The overall unmixing matrix
should be the product of whitening matrix and NHL unmixing matrix. Hence, the
overall unmixing matrix is not orthogonal, even if the NHL unmixing matrix is
orthogonal. It is reasonable to say that the non-orthogonal unmixing matrix gives
the possibility of inversely unmixing the non-orthogonal mixing matrix. It also
indicates that the present method does not have the constraint of being used just
in orthogonal mixing problems.
77
Chapter 6
Project I: Noise-Robust Acoustic
Signal Recognition Results
6.1 Real-World Acoustic Signal
Acoustic signature recognition of moving vehicles has attracted increased atten-
tion recently. This acoustic signature recognition is intended for integration into
a larger security context. It is normally assumed that there exists a fixed asset
to protect, and a perimeter that defines the vicinity around that asset for surveil-
lance. Automatic vehicle recognizer is necessary as providing security by human is
dangerousorexpensive. Thesoundrecognizerofincomingvehiclesisdevelopedfor
perimeter protection in national, agricultural, airport, prison, military sites, and
residential areas. For instance, the recognizer can be used to detect approaching
vehicles that may be loaded with explosives or suicide bombers.
Theacousticsoundofinterestfromamovingvehicleiscomplicatedandaffected
by multiple factors, such as vehicle type, gearing, number of cylinders, muffler
choice, state of maintenance, moving speed, distance from the microphone, tires,
and the road on which the vehicle travels. Moreover, the problem is complicated
because of the presence of uncontrolled interference emitted by surrounding back-
ground, suchashumanvoice,bird chip, andwind. Real-worldacousticrecognition
of moving vehicles thus is very challenging.
78
Figure 6.1: An illustration of vehicle recognition environment. Vehicles loaded
with dangerous weapons may approach protected or strict areas. The microphone
and recognizer should be able to detect the approaching vehicles and immediately
release alerts. Around the protected areas, many uncontrolled background sounds
may exist, such as human voice, bird chirp, and wind.
Recently,somestudies[CKGM96,MMSW97,Liu99,WSK99,Mun04,AZRS07]
havebeendoneforacousticdetectionofmovingvehicles. Itisnoteasytogiveauni-
fied comparison among these studies as their databases and testing environments
are significantly different. Generally in acoustic signal processing, extracting rep-
resentative features plays important role in characterizing the unknown signature
of moving vehicles. [CKGM96], [MMSW97], and [AZRS07] applied wavelet-based
analysis for feature extractionof incoming waveforms. [WSK99] and [Mun04] used
the short-term Fourier transform (STFT) to provide a precise representation of
the acoustic data, and then used linear principal component analysis to convert
79
the high-dimensional STFT features to low-dimensional ones. Besides, [Mun04]
proposed a reliable probabilistic recognizer based on both principal subspace and
complementary subspace, and compared his method with the baseline method
MFCC.
The purpose of the designed recognizer is to detect approaching vehicle and
identify its type with minimum error rates. The acoustic data of moving vehi-
cles is recorded from one microphone. The purpose of using one microphone is to
analyze the acoustic signatures of moving vehicles rather than tracking or local-
izing vehicles (normally by using an array of microphones). We have studied the
detailed perimeter security problem, and application of artificial intelligence this
year [LDB8d].
We recorded four types of vehicle sounds, gasoline light wheeled car, gasoline
heavy wheeled car, diesel truck, and motorcycle. The road conditions were paved
and sandy. The microphone was set 2 ∼ 5 feet away from the road edge. Height
of the microphone was 0∼ 6 feet. 20 runs of each vehicle type is used for training,
where each run means vehicle comes, passes, and leaves the fixed microphone.
Other 20 runs of each vehicle type is for testing. In non-vehicle class, white noise,
human voice, bird chirp are tested. These three sounds are used to test whether
the proposed system can correctlyrejectthem when there is no vehicle mixed with
them. The window size is 20 ms with 10 ms overlapping, the sampling rate is
22,050 Hz, the gammatone spectral range is 50 ∼ 11,025 Hz. The number of
filterbanks Q = 30 is selected in order to cover enough high-frequency subbands
within this spectral range. The number of consecutive feature vectors M = 20
indicates 200 ms ((20− 10)× 20 = 200) temporal duration. L = 20 is chosen
based on a coarse estimation of the number of dominant signatures. Similar to
the method by [JG98, LV05] that computes histogram of speech signals, it can be
80
shown that vehicle sound is super-Gaussian distributed. The kurtosis of vehicle
sound is 3.09, which is greater than Gaussian kurtosis 0.
Using LHLforsphering, input featuresare shifted bythe meanand normalized
bythevariance. ToprovideastableNHL,thelearningrateissetat0.01forthefirst
10 iterations. It is then decreased by multiplying a factor 0.7 every 10 iterations.
Convergence is decided when the synaptic weight vectors are orthogonal and the
synaptic weight change is less than 10
−4
over two consecutive iterations. The
synaptic weightchange along with learning iterationis plotted in Fig. 6.2. We can
see that synaptic weight converges for about 400 iterations. The converging time
using Intel(R) Core
TM
2 Duo Processor at 1.8 GHz is 50 min. NHL is implemented
for 5 trials, with each trial using one set of different initial values of synaptic
weights. ThetrialthatprovidesmaximalthevarianceofnonlinearoutputE{g
2
(y)}
is chosen as a desirable result.
0 100 200 300 400 500 600
−0.75
−0.7
−0.65
−0.6
−0.55
−0.5
Iteration
Synaptic weight
Weight learning
Figure 6.2: Synaptic weight change along learning iteration.
81
The recognizerisable to workontwoapplications. Firstly, asillustrated in Fig
6.3, recognize urban vehicle (generalized urban vehicles in city environments) and
reject non-vehicle (human voice, bird chirp, and wind), which is similar to human
speech recognition; and second, in Fig. 7.1, decide which type the vehicle is, gaso-
line light wheeled car, gasoline heavy wheeled car, heavy truck, and motorcycle,
which is similar to human speaker identification.
6.2 Noise-Robust Acoustic Signal Recognition
Results
The comparing metric is error rate, which is 1 - performance. MFCC has been
viewed as a baseline spectral feature extraction technology. We compare the
performance of the proposed algorithms and MFCC. In the meantime, we com-
pare the performance of the proposed system and the method used by [WSK99]
and [Mun04], linear principal component analysis (or equivalently LHL). We have
reported the compared results for vehicle recognition recently [LDB8a].
Sound
waveform
Non-vehicle
Urban vehicle
Figure 6.3: Decision tree for vehicle recognition.
82
For real-world testing, there are many unexpected noises emitted by surround-
ing backgrounds. Both vehicle sounds and noises may be co-sensed by a micro-
phone. Hence, the incoming data would be mixtures of vehicle sounds and noises.
This refers to the concept Mixture of signals of interest and noises in Sect. 4.2.
The proposed system does not assume any knowledge of noise sources or mixing
property. But it can project noisy data into the feature space of vehicle sounds,
in which noises are weakened. To mimic the situation when incoming signals are
mixtures of vehicle sounds and other noises, clear vehicle data is added with either
white or colored noises at various SNRs. The vehicle sound we recorded in low
background noise has SNR 20 ∼ 30 dB. SNR is the ratio of the averaged signal
(vehicle)energyandaveragednoiseenergy. Thenweanalyze−20∼ 20dBasnoisy
environments. Given the averaged vehicle energy and a SNR value, for example,
0 dB, we reversely compute averaged noise energy value, P
n
. For white noise,
the averaged energy value is its variance. For human voice, we compute the aver-
aged energy of selected human voice waveforms, P
h
. There is a scalar difference
betweenP
n
and P
h
, so wemultiplythe squarerootofthe scalardifferencewiththe
selected human voice waveforms. Then the selected human voice waveforms have
the averaged energy value P
n
, and we can obtain the SNR value, 0 dB. The SNR
computationissimilarwithbirdchirptest. Thenoise-corruptedvehiclewaveforms
are mixturesofvehicle data and noises. During patternrecognition, the recognizer
does not assume any knowledge of noise source or how it is mixed with vehicle
data. Therefore, it depends on how accurate the extractedindependent featuresof
signals of interest are against noises. The comparison standard is error rate, which
is 1 - performance.
Firstly, without the knowledge of noise source or mixing property, we test how
the proposed system is capable of attenuating AWGN effects on vehicle sounds.
83
RecognitionresultsaregiveninFig. 6.4. We canseethatatlowSNR= 0dB, LHL
haserrorrate32%,whileNHLdecreasesitdownto1.7%with30.3%improvement.
When SNR≥ 0 dB, the error rate of the proposed system stays low at 0.7∼ 1.5%.
ThisnearlyplateauindicatesNHLhaseffectivelyattenuatednoiseeffectsonsignals
of interest. But other algorithms can only have good performance when SNR is
very high, MFCC and GTF when SNR> 16 dB, STR when SNR≥ 10 dB, LHL
when SNR≥ 8 dB.
−20 −10 0 10 20
0.1
1
10
100
SNR (dB)
Error rate (%)
Vehicle mixed with AWGN
MFCC
GTF
GTF+STR
GTF+STR+LHL
GTF+STR+NHL
Figure 6.4: When vehicle data is mixed with AWGN.
Secondly, we test the noise robustness of the proposed system when vehicle
sounds are corrupted by unknown colored noise, human vowel voice. Different
vowels with various spectrums are mixed with vehicle sounds along the time. In
Fig. 6.5,atverylowSNR=−10dB,LHLerrorrateis30%,whileNHLerrorrateis
5% with 25% improvement. When SNR≥−10 dB, the error rate of the proposed
system stays low 2.5 ∼ 5%. NHL is more efficient to eliminate human vowel
84
effecton signalsof interest. However,otheralgorithmscanhavegood performance
only at higher SNRs, MFCC, GTF, and STR when SNR≥ 20 dB, LHL when
SNR> 16 dB. Comparing with Fig. 6.4, NHL shows more robustness in a lower
SNR range in human vowel test than in AWGN test. The implicit reason could be
that human vowel and vehicle may have some certain non-overlapping frequency
subbands though they have overlapping subbands for sure. So human vowel can
be attenuated more in the feature space of vehicle signals via NHL. But AWGN
occupiesallfrequencysubbands,andtherebyinanyfrequencysubbandthatvehicle
data dominates there is always AWGN effect.
−20 −10 0 10 20
0.1
1
10
100
SNR (dB)
Error rate (%)
Vehicle mixed with human vowel
MFCC
GTF
GTF+STR
GTF+STR+LHL
GTF+STR+NHL
Figure 6.5: When vehicle data is mixed with human vowel noise.
Next the proposed system is tested against unknown bird chirp noise, another
colored noise often existing in normal environments. Various bird chirps out of
the dataset [Gau07] are mixed with vehicle sounds. In Fig. 6.6, we can see that
at very low SNR= −10 dB, LHL error rate is 36%, NHL error rate is 3% with
85
33% improvement. When SNR≥ −10 dB, the error rate of the proposed system
stays low 0.9∼ 3%. But other algorithms have good performance in a higher SNR
range, MFCC and GTF when SNR ≥ 10 dB, STR when SNR≥ 2 dB, LHL when
SNR≥ 0 dB. This low error rate plateau of the proposed system again indicates
that NHL has efficiently attenuated bird chirp effects on vehicle signals.
−20 −10 0 10 20
0.1
1
10
100
SNR (dB)
Error rate (%)
Vehicle mixed with bird chirp
MFCC
GTF
GTF+STR
GTF+STR+LHL
GTF+STR+NHL
Figure 6.6: When vehicle data is mixed with bird chirp noise.
6.2.1 Noise Robustness Analysis
AWGN, human vowel, and bird chirp influence
Comparison results in Figs. 6.4-6.6, the noisy influence of AWGN, human
vowel,andbirdchirpisdifferent. Whenvehicledataismixedwithhumanvowelor
bird chirp, NHL can dramatically decreases the error rate, even at very low SNRs.
The underlying reason can be, vowel voice or bird chirp may have some different
frequencycomponentsthatvehiclesoundmaynothaveinitslearnedfeaturespace.
86
Hence, NHL which involves all-order statistics of data can independently separate
vehicle components from noisy ones, and has low error rates. On the other hand,
other algorithms including LHL could not reveal the properties of independent
components, and have much higher error rates. The error rate tendency appears
different when vehicle data is mixed with AWGN. Spectral components of AWGN
are spread over all frequency subbands. Hence, AWGN effects are in every
frequency subband that vehicle data dominates. It is reasonable that recognition
results are worse in this test. The error rate curve of NHL is more like a
decreasing slope rather than a plateau along increasing SNRs. But when SNR> 0
dB, NHL still offers a very robust performance with error rates 0.7∼ 1.5%. This
impliesthatNHLhaseffectivelyfunctionedonnoiseeliminationwhenSNR> 0dB.
With and without NHL
Moreover,we analyze howthe proposed system workswhen vehicle mixed with
humanvowelsoundatSNR=−3dB.TheEarthMover’sdistance(EMD)isusedto
compute the distance between two patterns (each pattern is described with mean
vector and variance vector). Without NHL, the EMD for tested noisy vehicle
and trained vehicle patterns is 1592.5. With NHL, the distance between noisy
data and trained vehicle pattern is 186.5, which is less than 200 (threshold). This
indicates that via the trained decomposing matrix, noisy data is transformed into
the representative space of vehicle and vehicle-related components are successfully
extracted. On the other hand, noise components are attenuated. We choose the
decision threshold 200 based on the distances between many patterns, with and
without noise effects.
87
Chapter 7
Project II: Noise-Robust Acoustic
Signal Identification Results
As illustrated in Fig. 7.1, if the coming waveform is recognized as vehicle, then we
need to decide what type of vehicle it is. The sounds fromvarious types of vehicles
maybe generatedby some commonfactors(such as moving speed, road condition,
muffler choice), and thus highly correlated with each other. Thus makes vehicle
identification difficult. For a vehicle coming to, passing by, and leaving away a
microphone, we call these activities one cycle. The recognition duration for one
cycle can last 4∼ 9 seconds. Each STR is 200 ms, and one identification result is
given. If we choose the recognition duration as 4 second, 4/0.2= 20 identification
results can be obtained for one cycle. Similar to speaker identification, which
makes a robust decision of one speaker if this speaker is recognized in most STRs
within a period of time, we claim one vehicle type if this type has more confirmed
resultsthanotherswithintherecognitionduration. Similarly,vehicleidentification
is analyzed when vehicle data is corrupted by unknown noises. We have reported
the compared results for vehicle type identification recently [LDB8b].
During NHL procedure, one representative feature space and one projection
synaptic weight matrix are learned for each type of vehicle. Then four feature
spaces and four weight matrixes are learned to represent patterns of four types,
as illustrated in Fig. 7.2. Then maximum likelihood is used to decide which type
the vehicle is. During testing identification results are listed in Tables 7.1-7.3
88
Sound
waveform
Urban
vehicle
Diesel
truck
Motorcycle
Gasoline
light
wheeled
Gasoline
heavy
wheeled
Figure 7.1: Decision tree for vehicle identification.
when vehicle data is corrupted by unknown AWGN, human vowel, and bird chirp,
respectively. The performance of LHL and NHL is compared. From these tables,
we can see that in general NHL can achieve better performance than LHL can.
For example, when gasoline heavy wheeled car is mixed with AWGN at SNR= 5
dB, NHL improves the performance by 40% over LHL; when diesel truck is mixed
with human vowel at SNR= 0 dB, NHL improves the performance by 25% over
LHL. In all, the proposed system can offer very robust identification results, such
as SNR= 5,10 dB, the performance of diesel truck and motorcycle 95∼ 100%. At
very low SNR=−5,0 dB, the performance degrades, whereas the system is still at
a workable status.
Table 7.1: Identification results when vehicle mixed with AWGN
Gasoline light Gasoline heavy Diesel truck Motorcycle
wheeled wheeled
LHL NHL LHL NHL LHL NHL LHL NHL
SNR=-5 dB 65% 55% 55% 50% 45% 35% 40% 35%
0 dB 60% 40% 50% 30% 30% 15% 25% 15%
5 dB 60% 20% 50% 10% 25% 5% 20% 5%
10 dB 50% 15% 45% 5% 20% 5% 20% 0%
89
Unknown generator and
environments
Independent
component extraction
Unknown
components
Observation
vector
As x
Wx y
Synthesis
Output vector
s
Decomposition
and
dimensionality
reduction
Pattern
recognition of
vehicle type I
Unknown generator and
environments
Independent
component extraction
Unknown
components
Observation
vector
As x
Wx y
Synthesis
Output vector
s
Decomposition
and
dimensionality
reduction
PR of vehicle
type II
Unknown generator and
environments
Independent
component extraction
Unknown
components
Observation
vector
As x
Wx y
Synthesis
Output vector
s
Decomposition
and
dimensionality
reduction
PR of vehicle
type III
Unknown generator and
environments
Independent
component extraction
Unknown
components
Observation
vector
As x
Wx y
Synthesis
Output vector
s
Decomposition
and
dimensionality
reduction
PR of vehicle
type IV
Maximum
likelihood
Figure7.2: Forvehicletypeidentification,onerepresentativefeaturespaceandone
projection synaptic weight matrix are learned for each type of vehicle. Decision is
made based on maximum likelihood.
Table 7.2: Identification results when vehicle mixed with human vowel utterances
Gasoline light Gasoline heavy Diesel truck Motorcycle
wheeled wheeled
LHL NHL LHL NHL LHL NHL LHL NHL
SNR= -5 dB 65% 55% 55% 60% 55% 50% 55% 50%
0 dB 50% 40% 50% 35% 45% 20% 40% 20%
5 dB 40% 15% 30% 10% 25% 5% 20% 5%
10 dB 30% 10% 25% 5% 15% 5% 15% 0%
90
Table 7.3: Identification results when vehicle mixed with bird chirps
Gasoline light Gasoline heavy Diesel truck Motorcycle
wheeled wheeled
LHL NHL LHL NHL LHL NHL LHL NHL
SNR= -5 dB 50% 40% 50% 30% 40% 20% 35% 20%
0 dB 35% 20% 30% 10% 25% 10% 25% 10%
5 dB 15% 15% 15% 10% 10% 5% 10% 5%
10 dB 10% 15% 5% 5% 5% 0% 0% 0%
91
Chapter 8
Noise-Robust Real-Time Field
Testing
8.1 Objectives
For on-site field detection of approaching vehicles, the detector should process
incoming waveforms real-time. Besides, for many uncontrolled noises emit-
ted by surrounding environments, the detector should be robust against these
noises. These two requirements are considered when we convert the theoretical
nonlinear-Hebbian-learning system into a practical detector. Various parameters
areadjusted,anddetailedschemesaretested. Withfixedparametersandschemes,
this designed practical detector has better performance than other existing works.
No matter what set of parameters or schemes are selected, the following three
should be satisfied in the practical detector.
(1) Real-time: The software processing time of the practical detector is less than
or equal to the online data acquiring length.
(2)Noiserobustness: Whenbothvehiclesoundandsurroundingnoisesareemitted
at the same time, the practical detector can recognize vehicle sound even if it is
severely noise-corrupted.
(3) Low false alarm rate: The practical detector can reject noises which are very
similar as vehicle sound by hearing.
92
8.2 Overview of Hardware and Software
ThesystemmakesuseoftheCZMmicrophonetoconvertacousticsoundwaveforms
in outdoor environment into electrical signals. Pre-amplifiers are used to amplify
the microphone-recorded signals. The gains of the pre-amplifiers are adjustable,
either manually or automatically in software. Between hardware and software
interface, there is equipped with the micro-controller A/D. The output of the pre-
amplifier is sampled at rate 22,050 Hz, and the on-line recorded data is framed
with window size 400 ms, and overlapping size 100 ms. The acoustic signature
recognizer of moving vehicles is an efficient parallel processor, which guarantees
real-time detection and continuous data acquisition. Based on maximal likelihood
metric,whetherornotthereisanarrivingvehicleandwhichclassthevehicleisare
decided. Positive detection results are sent to the command center, which slews
the camera toward the range where the microphone senses this positive signal.
Some details of each block are described as followings.
Microphone
Microphones distributed in the protected area form an array. Each micro-
phone covers a surrounding radius between 50 to 300 feet. The microphone
converts acoustic sound waveforms into electrical signals. These signals are
processed by software and a real-time decision is made. Upon a positive decision
thecameraslewstowardtherangewherethemicrophonesendsthispositivesignal.
Amplifier unit
The amplifier is capable of adjusting the gain value.
A/D card
93
Amplifier unit
A/D and on-line
data framing
Acoustic signature
recognition of running
vehicles
Is there an
approaching
vehicle
CZM
microphone
Send recognition
results to command
center
Which class of
vehicle
No
Yes
Wireless
transmitter
Receive recognition
results
On-site Sensor Command Center
Slew camera toward
the positive detected
field
Create log and video
record
Wireless
receiver
Figure8.1: Thepracticalsystemincludesseveralhardwaresandreal-timesoftware.
A/D card converts continuous-time signals to discrete-time with sampling rate
22050 Hz.
94
Wireless transmitter
Wireless device connects on-site sensors with remote command center, and
sends positive results to command center such that command center immediately
activates the camera for image recording.
Software: Algorithms
As illustrated in Fig. 1.4 in Sect. 1.4, all algorithms are proposed for the use
of real-time acoustic signal recognition. These algorithms, GTF, STR, and NHL,
arebiologicallymotivatedauditorysignalprocessingandnonlinearneurallearning.
Linear Hebbian learning for variance normalization
Linear Hebbian learning is used to normalize variances of feature vectors.
In this way, variance influences on some dominant elements are attenuated.
On the other hand, some variances calculated by LHL are very small. When
these values are used for normalization, their corresponding elements would
become very big, and thus make the system unstable. It is necessary to choose
dominant variances for normalization, while others do not change. In order to
select dominant variances, based on LHL convergence theory, these variances are
actually eigenvalues of data correlation, and their corresponding basis vectors
are eigenvectors. If the variance (eigenvalue) is dominant, its corresponding
eigenvectoris also dominant, and followsorthogonalconditionw
T
w = 1. Then we
have used this condition in practical real-time processing. When the LHL-learned
synapticweightvectorisorthogonalwithitself,theextractedvarianceisdominant.
Lowest frequency choice for gammatone filters
95
Like 133 Hz being selected as the lowest frequency of filterbanks for speech
recognition, we have decided which lowest frequency value is better for vehicle
sound detection. The lowest frequency value affects the definition of filterbanks,
and then the covering range and magnitude of spectrum. Filterbanks with dif-
ferent lowest frequency values would provide different spectral features, some of
which may match better with vehicle spectrum than others. Therefore, some of
them have extracted better spectral information than others. We have selected
0,30,50,100,150,200,250 Hz as the lowest frequency value, respectively. With
each lowest frequency value selected, the whole system is processed, and a set
of trained patterns is obtained. These trained patterns can provide almost the
same recognition results when vehicle data in normal circumstances is analyzed.
When vehicle data is mixed with AWGN at SNR= 0 dB, their responses are quite
different. The patterns trained with the lowest frequency value 30 Hz can pro-
vide a better performance than others. The recognition is more accurate and the
detecting range is about 50∼ 300 feet.
8.3 Field Testing
The online processing time is about 100 ms, while the one-time data acquiring
time is 300 ms (window size 400 ms - overlapping size 100 ms). This fast real-time
processing means the recognizer can immediately provide an early alarm, and no
data buffering needed.
A circumstance of real-time vehicle detection is illustrated in Fig. 8.2. When
a vehicle is approaching and when the vehicle is passing by the microphone. The
designed recognizer can provide alerts immediately for both cases so that the com-
mand center can have an early alarm even in Fig. 8.2(a) case.
96
a
b
Figure 8.2: A circumstance of real-time vehicle detection. (a) When a vehicle is
approaching. (b) When the vehicle is passing by the microphone. The designed
recognizer can provide alerts immediately for both cases so that the command
center can have an early alarm even in (a) case.
97
An example of vehicle waveform (including coming, passing by, and leav-
ing events) is given in Fig. 8.3(a). The recognition results of using MFCC,
GTF+STR+LHL, GTF+STR+NHL are illustrated in Fig. 8.3(b-d), respectively.
The y-axis of these figures is ‘0-or-1’ decision, when it is 0, no vehicle approach-
ing; when it is 1, there is vehicle approaching. Especially we notice that in Fig.
8.3(d), the recognition duration is about 7 seconds. For a vehicle moving speed
about 20 mph, the recognition distance is about ±93 = 20∗1600∗7∗3/3600/2
feet around a microphone center. Within this recognition distance, vehicle com-
ing, passing by, and leaving events are detected. Comparing results by using
MFCC, GTF+STR+LHL, or GTF+STR+NHL, we can see that the proposed
GTF+STR+NHL can provide a better recognition result, since the recognition
duration is broader than others.
An example of noisy vehicle waveform is shown in Fig. 8.4(a). In this case,
the waveform is a mixture of vehicle sound and unknown AWGN noise. The
SNR is 0 dB. The compared recognition results are illustrated in Fig. 8.4(b-d).
Comparing results by using MFCC, GTF+STR+LHL, or GTF+STR+NHL, we
can see that the first two systems could not recognize the approaching vehicle and
have high false positive rates. The proposed GTF+STR+NHL can still detect the
approaching vehicle, and provide an almost the same recognition result as in the
normal circumstance. It is again shown that the proposed system is more noise
robust.
Moreover, we have connected the proposed system with video-based image
recognition system. The proposed vehicle recognition system can send the on-site
recognition results to the camera located in command center. Positive recognition
activatesthecameraandslewittothedetectedarea. AsillustratedinFig. 8.5,the
camera has been slewed to the detected area and taken a picture of the vehicle. In
98
the marked bar of ’vehicle icon’ in the lower part of Fig. 8.5, there are continuous
dots to denote the period of time when vehicle is detected.
In all, we have tested the proposed system in Panama City (FL), Joshua Tree
(CA), USC, and non-busy streets in LA. We have tested the system under various
conditions, such as on paved or sandy desert roads, various vehicle moving speeds.
Overall, we have achieved good vehicle recognition results 99 ∼ 100%, and good
non-vehicle sound rejection results 98 ∼ 100%, except that airplane or scooter
could cause false alarm with a rate 5 ∼ 10% (if there are 10 airplanes, 0.5 or
1 airplane would cause false alarm). In low background noise, the recognition
distance average is 50 ∼ 100 feet from microphone, maximum up to 300 feet; in
noisyenvironments,therecognitiondistanceaverageis30∼ 70feet. Alldetailsare
listedinTable8.1. EspeciallyinPanamaCity(FL),wehavedemonstratedourreal-
time practical system in front of Transportation Security Administration (TSA).
Atthattime, wesuccessfullyrecognizedallpassing-byvehiclesonthestreets100%
correction rate, and rejected all background sounds 100% rejection rate. In that
circumstance,surroundingbackgroundsoundsincludehumanvoice,mediumwind,
birdchirp,explosivesound,andfootstep. Wealsoobtainedgoodresults99∼ 100%
when we demonstrated the recognizer to Safety Dynamics company and Army
Research Lab.
99
Table 8.1: Real-time field testing results
Testing results Panama City (FL) Joshua Tree (CA) USC Non-busy street (LA)
Testing time daytime daytime night daytime
night night
Vehicle speed (mph) 10∼ 40 10∼ 20 10∼ 20 10∼ 35
Vehicle type any makes or models seven light wheeled three light wheeled any makes or models
on the street three heavy wheeled two heavy wheeled on the street
one motorcycle
Road condition paved sandy desert paved paved
Vehicle recognition 100% (60 runs) 100% (100 runs) 99% (60 runs) 99∼ 100% (300 runs)
Recognition range
from the microphone 50∼ 300 feet 50∼ 150 feet 50∼ 150 feet 50∼ 300 feet
Bird chirping rejection 100% 100% N/A 100%
Wind rejection 100% 98% N/A 99∼ 100%
Human voice rejection 100% 99∼ 100% 99∼ 100% 99∼ 100%
Building power station
sound rejection N/A N/A 100% N/A
100
a
0 5 10 15 20 25 30
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
Time (s)
Normalized amplitude
Vehicle sound waveform in normal circumstance
b
0
0.5
1
Recognition decision using MFCC
c
0
0.5
1
Recognition decision using GTF+STR+LHL
d
0
0.5
1
Recognition decision using GTF+STR+NHL
7 s
Coming
Leaving
Passing
Figure8.3: (a)Anexampleofvehiclewaveform(including coming,passingby, and
leaving events). The x-axis in (b-d) figures is aligned with figure (a). The y-axis is
‘0-or-1’ decision, when it is 0, no vehicle approaching; when it is 1, there is vehicle
approaching. (b)TherecognitionresultbyusingMFCC.(c)Therecognitionresult
by using GTF+STR+LHL.(d) The recognitionresult by using GTF+STR+NHL.
The recognition duration is about 7 seconds. For a vehicle moving speed about
20 mph, the recognition distance is about ±78 = 20∗ 1600∗ 6∗ 3/3600/2 feet
around a microphone center. Within this recognition distance, vehicle com-
ing, passing by, and leaving events are detected. Comparing results by using
MFCC, GTF+STR+LHL, or GTF+STR+NHL, we can see that the proposed
GTF+STR+NHL can provide a better recognition result, since the recognition
duration is broader than others.
101
a
0 5 10 15 20 25 30
−1.6
−1.2
−0.8
−0.4
0
0.4
0.8
1.2
1.6
Time (s)
Normalized amplitude
Noisy vehicle sound with AWGN at SNR=0 dB
b
0
0.5
1
Recognition decision using MFCC w/ AWGN at SNR=0 dB
c
0
0.5
1
Recognition decision using GTF+STR+LHL w/ AWGN at SNR=0 dB
d
0
0.5
1
Recognition decision using GTF+STR+NHL w/ AWGN at SNR=0 dB
6 s
Coming
Leaving
Passing
Figure 8.4: (a) An example of noisy vehicle waveform. In this case, the waveform
is a mixture of vehicle sound and unknown AWGN noise. The SNR is 0 dB.
(b) The recognition results by using MFCC. (c) The recognition results by using
GTF+STR+LHL. (d) The recognition results by using GTF+STR+NHL. We can
see that the first two systems could not recognize the approaching vehicle and
have high false positive rates. The proposed GTF+STR+NHL can still detect the
approaching vehicle, and provide an almost the same recognition result as in the
normal circumstance.
102
Figure 8.5: At the middle part, the camera has been slewed to the detected area
and taken a picture of the vehicle. In the marked bar of ’vehicle icon” in the lower
part, there are continuous dots meaning the alerts sent by on-site vehicle detector.
103
Chapter 9
Conclusion and Future Work
9.1 Conclusion
The intelligent auditory signal processing and neural learning provides us with
some heuristics to explore the acoustic signal recognition and identification tasks.
We propose using gammatone filterbanks, spectro-temporal dynamic represen-
tation, and nonlinear Hebbian learning to perform noise-robust acoustic signal
recognition or identification. Specially, we modify the nonlinear Hebbian learning
approach into a noise-robust signature recognition technology. That means, we
apply the concept of blind source separation into the real-world noise-robust pat-
tern recognition. The proposed system can effectively extract representative inde-
pendent components from high-dimensional input data. Concurrent to this pro-
cess, synaptic weight vectors which connect input and output neurons are learned,
which can project noisy data into a representative feature space of signals of inter-
est. In such a projection, signals of interest in the noisy data are favored, while
noises are attenuated or eliminated.
The recognizer is designed without the need of knowing noise sources or mix-
ing properties. To generate noisy data in simulation, clean vehicle data is mixed
with additive white Gaussian noise, human vowel sound, and bird chirp sound.
These noises are just examples that normally exist in surrounding backgrounds,
when we perform the real-world application, vehicle detection. The proposed sys-
tem is compared with many other existing algorithms. The compared simulation
104
results demonstrate, that the proposed system can accurately recognize signals
of interest with much lower error rates than its counterparts, even if signals are
severelycorruptedbywhiteorcolorednoises. Moreover,vehicletypeidentification
isperformedundernoisyenvironments. Againtheproposedsystemshowsstronger
noise-robustness than other works.
In addition, the proposed software system is designed and combined with
practical hardware, such as microphone, pre-amplifier, A/D sound card. The
real processing time of the proposed system is about 100 ms, which is much
less than the once-time data acquiring time 300 ms. This means the proposed
system can perform online data processing, and no buffer needed. We have tested
the real-time practical system on sandy and paved roads, and achieved very
good results. In Panama city, FL, we demonstrated the practical recognizer to
Transportation Security Administration (TSA); In Tucson, AZ, we demonstrated
therecognizertoSafetyDynamicscompany; InLosAngeles,CA,wedemonstrated
it to Army Research Lab. In these demos, we obtained 99∼ 100% correct vehicle
recognition results.
Neural encoding
In neurobiology, rate coding means as the intensity of a stimulus increases,
the frequency or rate of action potentials increases [AZ26], The frequency of
events, not individual event magnitude, is the basis for most inter-neuronal
communication. Input signals are trains of action potentials with identical shapes,
and inter-impulse intervals encode input information, approximately, this rate
encoding is based on dt/dx. Output signals are quantities of neurotransmitter
vesicles released from presynaptic bouton elicited by each action potential,
approximately it is based on the relation dy/dt. Multiple signaling pathways
105
exist in one input-output neuron pair. If in each one pathway, the signaling
dy/dt is exponential process, then with the linear part dt/dx, the whole signaling
dy/dx = dt/dx∗ dy/dt is exponential process (a linear process multiplies with
an exponential process = an exponential process). It can then be shown that
the multiple signaling pathways approximately follow a gamma function, an
approximation of multiple exponential processes.
Hebbian learning and STDP
[vRBT00] explored a synaptic plasticity model that incorporates the findings
thatpotentiation and depression canbe induced by precisely timedpairs of synap-
tic events and postsynaptic spikes. Spike timing dependent plasticity (STDP)
is viewed as a Hebbian synaptic learning rule [PS00, CD08]. That means, the
changesofsynapticconnectionsarethoughttooccurthroughcorrelation-based, or
Hebbian plasticity. Precise plasticity rules should allow synaptic inputs to change
in synaptic strength, depending on their correlation with postsynaptic firing, or
with activity of other inputs. STDP realizes strength changes(plasticity) based
on synaptic input and postsynaptic output timing correlations. We have studied
the STDP based on arbitrary pre- and postsynaptic spike protocols [LYB7b].
Generality of nonlinear Hebbian learning
The proposed algorithm in the presentthesis canbe generallyused foracoustic
signal recognition, or other kind of signal recognition, image signal recognition,
neurobiologicalsignalidentification. Forexample, itcanbe usedtoextractregular
features from EEG (electroencephalogram)signals, and detect whether the testing
EEG signals from some patients are normal or not.
106
9.2 Future Work
9.2.1 The Need
Althoughseverenoisyenvironmentsmaylessexist,wealwayswanttopursuemore
and better.
The NHL works very well for vehicle vs non-vehicle recognition, even under
severely noisy environments. The NHL also works well for vehicle type identi-
fication under noisy environments, SNR=5, 10 dB. Then, under severely noisy
environments, even NHL improves performance 25% than LHL for diesel truck
identification at SNR=0 dB; When SNR=0, -5 dB, the absolute performance of
NHL for vehicle type identification is ok, but not quite good. The possible reason
could be that different type of vehicle still share several common sources. Their
engines have similar sound generating schemes. They may use the same choice of
muffler. Or, the same speed, the same road condition, could generates the similar
vehicle moving sound that reaches the microphone.
The NHL extracts representative independent components for each type of
vehicle. Some components from different types of vehicles could be similar with
each other, as implicated in the above that different types of vehicles may share
common sources. When we look at the trained patterns (signatures) between two
different vehicle types, gasoline light wheeled car, gasoline heavy wheeled car, as
shown in Fig. 9.1. The x-axis is the pattern dimension. The y-axis denotes the
pattern value. At each pattern dimension, the mean is plotted, with an error bar
to denote the variance. We can notice that both patterns are close to each other.
The distance between these two patterns is 93, which is much, much smaller
than the pattern distance 1853 between generalized vehicle and human voice,
and much smaller than the threshold 200 of generalized vehicle vs non-vehicle
107
5 10 15 20
−10
−5
0
5
10
15
20
25
30
Output dim
Pattern value
Pattern comparison
Gasoline heavy wheeled
Gasoline light wheeled
Figure 9.1: The x-axis is the pattern dimension. The y-axis denotes the pattern
value. Ateachpattern dimension, the meanis plotted, with an errorbar to denote
the variance. We can notice that both patterns are close to each other. The
Earth Mover’s distance between these two patterns is 93, which is much, much
smaller than the pattern distance 1853 between generalized vehicle and human
voice, and much smaller than the threshold 200 of generalized vehicle vs non-
vehicle recognition.
recognition. The selected threshold distance between these two vehicle types
is 93/2 = 46.5. This threshold is enough for vehicle type identification under
noisy environments, such as SNR=5, 10 dB. The used NHL is efficient to remove
noise effect in these cases. However, under severely noisy environments, such
as SNR=-5 dB, the noise effect may be too big. The NHL can remove some
108
noise effect, but the rest noise effect may make the signal across the thresh-
old 46.5 between classes. Hence, a signal from gasoline light wheeled car may
crosstheboundaryandbemis-classifiedastheclassofgasolineheavywheeledcar.
9.2.2 Proposal for Identification Project Under Severe
Environments
In order to discriminate highly-correlated data between different classes, the pre-
liminary idea is to decrease the correlative information between classes. Normally,
decrease E{y
A
y
B
}, where y
A
is the component from class A, and y
B
is the com-
ponent from class B. However, the correlation E{y
A
y
B
} is just a linear production
of class A and class B. These two classes may have higher-order correlation. So
we need to consider the higher-order correlativeinformationbetween classesin the
future.
In the meantime, during the learning, the input-output information transmis-
sion should be maximized, since the pattern recognition of class A (class B) vs
outside classes is another purpose. As illustrated in Fig. 9.2, within one class, the
goalis informax, maximizinginput-output informationtransmission, which canbe
realized by the present NHL in the thesis. The independent component extraction
of each class is one parallel structure of the figure, which is the same as in Fig.
5.4. Use GTF and STR for spectral and temporal dynamics analysis. Then use
NHL for representative signature extraction.
Across classes, the goal is to minimize the correlative information, or mutual
information between class A and class B. How to realize the goal across classes?
In the future, a proper algorithm can be designed by following the criterion of
minimizing correlative information between classes. In such a way, the pattern
109
time
Freq
Interleaved
time
Freq
Spectro-temporal
input neurons
Linear output
neurons
Class A
Class B
Between-class
inhibitory effect
Within-class
inhibitory effect
Within-class
inhibitory effect
Synaptic
weights
Synaptic
weights
Nonlinear virtual
output neurons
Nonlinear
activation
Nonlinear
activation
Nonlinear
activation
Nonlinear
activation
Max input-output information within one class
Max input-output information within one class
Min mutual infor between
outputs of different classes
Figure 9.2: Within one class, the goal is informax, maximizing input-output infor-
mation transmission, which can be realized by the present NHL in the thesis. The
independent component extraction of each class is one parallel structure of the
figure, which is the same as in Fig. 5.4. Across classes, the goal is to minimize the
correlative information, or mutual information between class A and class B.
distance between classes could be bigger. Then, even under severe noisy environ-
ments, the vehicle type identification performance would be further better.
9.2.3 Other Future Directions
In the research areas of independent component analysis, there are several other
possible directions for future works. And recently people started to put much
110
more complex conditions on ICA and tried to deal with more difficult problems
that were not fully addressed nor solved at present.
Oneofmoredifficultproblemsisthatiftheindependentsourcesarenonlinearly
mixed, instead of linearly mixed. Then the unmixing process is not unique since
there is no information about the mixing nonlinearity. Another more difficult
problemisthatifusing ICAforblindsourceseparation, insomecases,thenumber
of observed mixtures may be less than the number of independent sources. This is
liketherearethreeunknownrandomvariables, butthereareonlytwoequationsto
solve. We would keep steps with state-of-the-art ICA algorithms. Through hard
studying in the future, it is expected that we can figure out these more difficult
problems.
111
References
[AEL01] E. Ambikairajah, J. Epps, and L. Lin. Wideband speech and audio
codingusinggammatonefilterbanks. InICASSP,pages773–776,Salt
Lake City, UT, 2001.
[AIO
+
03] F. Asano, S. Ikeda, M. Ogawa, H. Asoh, and N. Kitawaki. Combined
approachof arrayprocessing and independent componentanalysisfor
blind separation of acoustic signals. IEEE Transactions on Speech
and Audio Processing, 11:204–215, 2003.
[AMB
+
04] S. Araki, S. Makino, A. Blin, R. Mukai, and H. Sawada. Underdeter-
minedblindseparationforspeechinrealenvironmentswithsparseness
and ICA. In ICASSP, pages 881–884, 2004.
[AS90] A. Artola and W. Singer. The involvement of N-Methyl-D-Aspartate
receptors in induction and maintenance of long-term potentiation in
rat visual cortex. Journal of Neuroscience, 2:254–269, 1990.
[AZ26] E. D. Adrian and Y. Zotterman. The impulses produced by sensory
nerve endings: part3.impulses set up by touchand pressure. Journal
of Physiol., 61:465–483, 1926.
[AZRS07] A. Averbuch, V. Zheludev, N. Rabin, and A. Schclar. Wavelet based
acoustic detection of moving vehicles, 2007.
[BAD
+
91] Z. I. Bashir, S. Alford, S. N. Davies, A. D. Randall, and G. L.
Collingridge. Long-term potentiation of NMDA receptor-mediated
synaptictransmissioninthehippocampus.Nature,349:156–158,1991.
[BBB
+
91] N. Berretta, F. Berton, R. Bianchi, M. Brunelli, M. Capogna, and
W.Francesconi. Long-termpotentiationofNMDAreceptor-mediated
EPSP in guinea-pig hippocampal slices. Journal of Neuroscience,
3:850–854, 1991.
112
[BCM82] E. L. Bienenstock, L. N. Cooper, and P. W. Munro. Theory for the
development of neuron selectivity: orientation specificity and binoc-
ular interaction in visual cortex. Journal of Neuroscience, 2:32–48,
1982.
[BEBK98] D. M. Bowman, J. J. Eggermont, D. K. Brown, and B. P. Kimberley.
Estimatingcochlearfilter response propertiesfrom distortionproduct
otoacoustic emission (DPOAE) phase delay measurements in normal
hearing human adults. Hearing Research, 119:14–26, 1998.
[BH00] E. Bingham and A. Hyv¨ arinen. A fast fixed-point algorithm for inde-
pendent component analysis of complex valued signals. International
Journal of Neural Systems, 10:1–8, 2000.
[BL73] T.W.BlissandT.Lomo. Long-lastingpotentiationofsynaptictrans-
missioninthedentateareaoftheanaesthetizedrabbitfollowingstim-
ulation of the perforant path. Journal of Physiol., 232:331–356, 1973.
[BMS02] M. S. Bartlett, J. R. Movellan, and T. J. Sejnowski. Face recog-
nition by independent component analysis. IEEE Trans. on Neural
Networks, 13(6):1450–1464, 2002.
[Boe75] E. Boer. Synthetic whole-nerve action potentials for the cat. Journal
of the Acoustical Society of America, 58:1030–1045, 1975.
[BPR04] R. Boscolo, H. Pan, and V. P. Roychowdhury. Independent com-
ponent analysis based on nonparametric density estimation. IEEE
Transactions on Neural Networks, 15:55–65, 2004.
[BS95] A.J.BellandT.J.Sejnowski.Aninformation-maximizationapproach
to blind separation and blind deconvolution. Neural Computation,
7:1004–1034, 1995.
[CACA04] S. A. Cruces-Alvarez, A. Cichocki, and S. Amari. From blind sig-
nal extraction to blind instantaneous signal separation: criteria, algo-
rithms,andstability. IEEETransactionsonNeuralNetworks,15:859–
873, 2004.
[Cam97] J. P. Campbell. Speaker recognition: a tutorial. Proceedings of the
IEEE, 85:1437–1462, 1997.
[Car97] J.-F. Cardoso. Informax and maximum likelihood for source separa-
tioh. IEEE Letters on Signal Processing, 4:112–114, 1997.
[Car98] J.-F.Cardoso.Blindsignalseparation: statisticalprincipls.Proceedins
of the IEEE, 9:2009–2025, 1998.
113
[CAS05] O.Cheng,W.Abdulla,andZ.Salcic. Performanceevaluationoffront-
end algorithms for robust speech recognition. In Proc. of the Eighth
International Symposium on Signal Processing and its Applications,
pages 711–714, 2005.
[CAT96] A.Cichockiy,S.Amariz,andR.Thawonmasy. Blindsignalextraction
using self-adaptive non-linear hebbian learning rule, 1996.
[CB07] C.-P. Chen and J. A. Bilmes. MVA processing of speech features.
IEEE Trans. on Speech, and Language Processing, 15(1):257–270,
2007.
[CCPL05] S. Choi, A. Cichocki, H.-M. Park, and S.-Y. Lee. Blind source sepa-
ration and independent component analysis: a review. Neural Infor-
mation Processing - Letters and Reviews, 6:1–57, 2005.
[CD08] N. Caporale and Y. Dan. Spike timing-dependent plasticity: a Heb-
bian learning rule. Annual Review of Neuroscience, 31:25–46, 2008.
[CGG
+
99] T. Chi, Y. Gao, M. C. Guyton, P. Ru, and S. A. Shamma. Spectro-
temporalmodulation transferfunctions andspeech intelligibility. The
Journal of the Acoustical Society of America, 106:2719–2732, 1999.
[CH01] G.CharestanandR.Heusdens. Agammatone-basedpsychoacoustical
modeling approach for speech and audio coding, 2001.
[Cho01] K. H. Chon. Accurate identification of periodic oscillations buried in
whiteorcolorednoiseusingfastorthogonalsearch. IEEE Transactions
on Biomedical Engineering, 48:622–629, 2001.
[CKGM96] H. C. Choe, R. E. Karlsen, G. R. Gerhert, and T. Meitzler. Wavelet-
basedgroundvehiclerecognitionusingacousticsignal. WaveletAppli-
cations III, 2762:434–445, 1996.
[CL96] J. F. Cardoso and B. H. Laheld. Equivariant adaptive source separa-
tion. IEEE Transactions on Signal Processing, 44:3017–3030, 1996.
[Com94] P. Comon. Independent component analysis, a new concept? Signal
Processing, 36:287–314, 1994.
[Cos98] P. Cosi. Auditory modeling and neural networks, 1998.
[CPLTB96] M. J. Carey, E. S. Parris, H. Lloyd-Thomas, and S. Bennett. Robust
prosodic features for speaker identification. In Fourth International
Conference on Spoken Language, pages 1800–1803, 1996.
114
[CS06] T. Chi and S. A. Shamma. Spectrum restoration from multiscale
auditory phase singularities by generalized projections. IEEE Trans-
actions on Audio, Speech, and Language Processing, 14:1179–1192,
2006.
[CW69] S. C. Choi and R. Wette. Maximum likelihood estimation of the
parameters of the gamma distribution and their bias. Technometrics,
11:683–669, 1969.
[DGSM07] S. C. Douglas, M. Gupta, H. Sawada, and S. Makino. Spatio-
temporal fastICA algorithms for the blind separation of convolutive
mixtures. IEEE Transactions on Audio, Speech, and Language Pro-
cessing, 15:1511–1520, 2007.
[DH03] R. I. Damper and J. E. Higgins. Improving speaker identification in
noise by subband processing and decision fusion. Pattern Recognition
Letters, 24:2167–2173, 2003.
[DL00] N. Delfosse and P. Loubaton. Adaptive blind separation of inde-
pendent sources: a second-orderstable algorithm for the general
case. IEEE Transactions on Fundamental Theory and Applications,
47:1056–1071, 2000.
[DN07] P.DayandA.K.Nandi. Robusttext-independentspeakerverification
using genetic programming. IEEE Transactions on Audio, Speech,
and Language Processing, 15:285–295, 2007.
[ER95] D. Ellis and D. Rosenthal. Mid-level representations for computa-
tional auditory scene analysis. In International Joint Conference on
Artificial Intelligence, Montreal, Quebec, Canada, 1995.
[EW93] A. Erell and M. Weintraub. Filterbank-energy estimation using mix-
ture and Markovmodelsforrecognitionof noisy speech. IEEE Trans.
on Speech and Audio Processing, 1(1):68–76, 1993.
[FB95] C. Fyfe and R. Baddeley. Non-linear data structure extraction using
simple Hebbian networks. Biological Cybernetics, 72:533–541, 1995.
[FB01] S.FioriandP.Bucciarelli. Probabilitydensityestimationusingadap-
tive activation function neurons. Neural Processing Letters, 13:31–42,
2001.
[FG06] C.F´ evotteandS.J.Godsill. ABayesianapproachforblindseparation
ofsparsesources. IEEE Transactionson Audio, Speech, andLanguage
Processing, 14:2174–2188, 2006.
115
[Fio2a] S.Fiori. NotesonBell-SejnowskiPDF-matchingneuron. NeuralCom-
putation, 14:2847–2855, 2002a.
[Fio2b] S. Fiori. Closed-form expressions of some stochastic adapting equa-
tions for non-linear adaptive activation function neurons, 2002b.
[Fio03] S. Fiori. Extended Hebbian learning for blind separation of complex-
valued sources. IEEE Transactions on Neural Networks and Circuit
Theory, 50:195–202, 2003.
[Gau79] W.Gautschi. Acomputationalprocedureforincompletegammafunc-
tions. ACM Trans. on Mathematical Software, 5:466–481, 1979.
[Gau07] Doug Von Gausig. North American Bird Sounds, 2007.
[Gir98] M. Girolami. Noise reduction and speech enhancement via temporal
anti-Hebbian learning. In ICASSP, pages 1233–1236, Seattle, WA,
1998.
[GK05] Y.GuiandH.K.Kwan. Adaptive subbandwienerfiltering forspeech
enhancementusingcritical-bandgammatonefilterbank. InIEEE 48th
Midwest Symposium on Circuits and Systems, pages 732–735, 2005.
[GM90] B. R. Glasberg and B. C. J. Moore. Derivation of auditory filter
shapes from notched-noise data. Hear Res., 93:401–417, 1990.
[GWAH87] B. Gustafsson, H. Wigstrom, W. C. Abraham, and Y. Y. Huang.
Long-term potentiation in the hippocampus using depolarizing cur-
rent pulses as the conditioning stimulus to single volley synaptic
potentials. Journal of Neuroscience, 7:774–780, 1987.
[HAH01] X.Huang, A.Acero, andH.-W.Hon. Spoken Language Processing: A
Guide to Theory, Algorithm and System Development. Prentice Hall,
2001.
[Hay98] S. Haykin. Neural Networks. Prentice Hall, 1998.
[Heb49] D. O. Hebb. The Organization of Behavior: A Neuropsychological
Theory. New York, Wiley, 1949.
[HI95] S. Hayakawa and F. Itakura. The influence of noise on the speaker
recognitionperformanceusingthe higherfrequencyband. In ICASSP,
pages 321–324, 1995.
[HO98] A.Hyv¨ arinenandE.Oja. Independentcomponentanalysisbygeneral
non-linear hebbian-like learning rules. Signal Processing, 64:301–313,
1998.
116
[HO00] A. Hyv¨ arinen and E. Oja. Independent component analysis: algo-
rithms and applications. Neural Networks, 13:411–430, 2000.
[Hyv98] A.Hyv¨ arinen. Newapproximationsofdifferentialentropyforindepen-
dent component analysis and projection pursuit. Advances in Neural
Information Processing Systems, 10:273–279, 1998.
[Hyv7a] A. Hyv¨ arinen. One-unit contrast functions for independent compo-
nent analysis: a statistical analysis. Neural Networks for Signal Pro-
cessing VII, pages 388–397, 1997a.
[Hyv7b] A. Hyv¨ arinen. A family of fixed-point algorithms for independent
component analysis. In IEEE International Conference on Acoustics,
Speech, and Signal Processing, pages 3917–3920, Munich, Germany,
1997b.
[IAF04] K. Iwano, T. Asami, and S. Furui. Noise-robust speaker verification
using F0 features. In ICSLP, pages 1417–1420, Jeju Island, Korea,
2004.
[IP03] L. V. Immerseel and S. Peeters. Digital implementation of linear
gammatonefilters: comparisonofdesignmethods. AcousticsResearch
Letters Online, 2003.
[IU98] T. Irino and M. Unoki. A time-varying, analysis/synthesis auditory
filterbank using the gammachirp. In ICASSP, pages 3653–3656,Seat-
tle, Washington, 1998.
[JG98] D. N. Joanesand C.A. Gill. Comparing measuresof sample skewness
and kurtosis. Journal of the Royal Statistical Society (Series D): The
Statistician, 47(1):183–189, 1998.
[JHH99] H. Jiang, K. Hirose, and Q. Huo. Robust speech recognition based on
a Bayesian prediction approach. IEEE Trans. on Speech and Audio
Processing, 7(4):426–440, 1999.
[Joh72] P.I.M.Johanesma. Thepre-responsestimulusensembleofneuronsin
the cochlear nucleus. In Proc. Symposium on Hearing Theory, pages
58–69, Eindhoven, Netherlands, 1972.
[JS87] M. C. Jones and R. Sibson. What is projection pursuit. Journal of
the Royal Statistical Society, 150:1–36, 1987.
[KD90] S. Y. Kung and K. I. Diamantaras. A neural network learning algo-
rithm for adaptive principal component extraction (APEX). In IEEE
117
InternationalConference on Acoustics, Speech, and SignalProcessing,
pages 861–864, 1990.
[KDSS00] D. Klein, D. Depireux, J. Simon, and S. Shamma. Robust spectro-
temporalreversecorrelationfortheauditorysystem: optimizingstim-
ulus design. Journal of Computational Neuroscience, 9, 2000.
[KJ93] J. Karhunen and J. Joutsensalo. Learning of robust principal com-
ponent subspace. In IEEE International Joint Conference on Neural
Networks, pages 2409–2412, 1993.
[KJK08] S. Kim, M. Ji, and H. Kim. Noise-robust speaker recognition using
subband likelihoods and reliable-feature selection. ETRI Journal,
30:89–100, 2008.
[KLK99] D.-S. Kim, S.-Y. Lee, and R. M. Kil. Auditory processing of speech
signals for robust speech recognitionin real-world noisy environments.
IEEE Transactions on Speech and Audio Processing, 7:55–69, 1999.
[KSAC06] V. Krishnan, S. M. Siniscalchi, D. V. Anderson, and M. A. Clements.
Noise robust aurora-2 speech recognition employing a codebook-
constrained kalman filter preprocessor. In ICASSP, pages 781–784,
Toulouse, France, 2006.
[Lau81] S. Laughlin. A simple coding procedure enhances a neuron’s informa-
tion capacity. Z Naturforsch, 36:910–912, 1981.
[LB96] J. S. Liaw and T. W. Berger. Dynamic synapse: a new concept
of neural representation and computation. Hippocampus, 6:591–600,
1996.
[LBS00] T.-W. Lee, A. J. Bell, and T. J. Sejnowski. A unifying information-
theoretic framework for independent component analysis. Computers
and Mathematics with Applications, 39:1–21, 2000.
[LC00] S.-Y. Lung and C.-C. T. Chen. A new approach for text-independent
speaker recognition. Pattern Recognition, 33:1401–1403, 2000.
[LDB8a] B. Lu, A. Dibazar, and T. W. Berger. Nonlinear Hebbian learning for
noise-independent vehicle sound recognition. In IEEE International
Joint Conference on Neural Networks, pages 1337–1344, Hong Kong,
2008a.
[LDB8b] B. Lu, A. Dibazar, and T. W. Berger. Perimeter security on detect-
ing acoustic signature of approaching vehicle using nonlinear neural
118
computation. In IEEE International Conference on Technologies for
Homeland Security, pages 51–56, Waltham, MA, 2008b.
[LDB8d] B. Lu, A. Dibazar, and T. W. Berger. Perimeter security on
noise-robust vehicle detection using nonlinear Hebbian learning. In
A. Solanas and A. Mart´ ınez-Ballest´ e, editors, Advances in Artifi-
cial Intelligence for Privacy Protection and Security.ImperialCollege
Press, World Scientific, 2008d.
[LDBed] B.Lu, A. Dibazar, andT. W.Berger. Noise-robustacoustic signature
recognition using nonlinear Hebbian learning. Neural Network, 2008c
(submitted).
[LGS99] T.-W.Lee,M.Girolami,andT.J.Sejnowski. Independentcomponent
analysisusinganextendedinfomaxalgorithmformixedsub-Gaussian
and super-Gaussian sources. Neural Computation, 11(2):417–441,
1999.
[LH88] K. G. Lang and G. E. Hinton. The development of the time-delay
neural network architecture for speech recognition. Technical Report
CMU-CS-88-152, Carnegie-Mellon University, Pittsburgh, PA, 1988.
[Liu99] L. Liu. Ground vehicle acoustic signal processing based on biologi-
cal hearing models. Master’s thesis, University of Maryland, College
Park, USA, 1999.
[LL02] T.-W. Lee and M. S. Lewicki. Unsupervised image classification, seg-
mentation, and enhancement using ica mixture models. IEEE Trans-
actions on Image Processing, 11(3):270–279, 2002.
[LLGS99] T.-W. Lee, M. S. Lewicki, M. Girolami, and T. J. Sejnowski. Blind
source separation of more sources than mixtures using overcomplete
representations. IEEE Signal Processing Letters, 6(4):87–90, 1999.
[LLS00] T.-W. Lee, M. S. Lewicki, and T. J. Sejnowski. ICA mixture models
for unsupervised classification of non-Gaussian classes and automatic
context switching in blind signal separation. IEEE Trans. on Pattern
Analysis and Machine Intelligence, 22(10):1078–1089, 2000.
[LS00] M. S. Lewicki and T. J. Sejnowski. Learning overcomplete represen-
tations. Neural Computation, 12:337–365, 2000.
[LS04] S.-J. Li and R.-M. Shen. Fuzzy cognitive map learning based on
improved nonlinear Hebbian rule. In International Conference on
Machine Learning and Cybernetics, pages 2301–2306, 2004.
119
[Lun06] S.-Y. Lung. Wavelet feature domain adaptive noise reduction using
learning algorithm for text-independent speaker recognition. Pattern
Recognition, 40:2003–2006, 2006.
[LV05] T. Lotter and P. Vary. Speech enhancement by MAP spectral ampli-
tude estimation using a super-gaussian speech model. EURASIP
Journal on Applied Signal Processing, 7:1110–1126, 2005.
[LYB6a] B. Lu, W. Yamada, and T. W. Berger. Nonlinear queuing model
for dynamic synapse with multiple transmitter pools. In IEEE Inter-
national Joint Conference on Neural Networks, Vancouver, Canada,
2006a.
[LYB6b] B. Lu, W. Yamada, and T. W. Berger. Nonlinear dynamic neural
networkfor text-independent speaker identification using information
theoreticlearningtechnology.InIEEEInternationalConferenceofthe
Engineering in Medicine and Biology Society, New York, NY, 2006b.
[LYB7a] B.Lu,W.M.Yamada,andT.W.Berger. Nonlinearhigh-ordermodel
fordynamicsynapsewithmultiplevesiclepools. InL.I.Perlovskyand
R. Kozma, editors, Neurodynamics of Cognition and Consciousness,
pages 341–358. New York, Springer-Verlag, 2007a.
[LYB7b] B. Lu, W. Yamada, and T. W. Berger. Asymmetric synaptic plas-
ticity based on arbitrary pre- and postsynaptic timing spikes using
finite state model. In IEEE International Joint Conference on Neural
Networks, Orlando, FL, 2007b.
[Med86] R. Meddis. Simulation of mechanical to neural transduction in the
auditory receptor. The Journal of the Acoustical Society of America,
79(3):702–711, 1986.
[MHG06] J.Ming, T.J. Hazen, and J. R.Glass. Speakerverificationoverhand-
held devices with realistic noisy speech data. In ICASSP, Toulouse,
France, 2006.
[MHGR07] J.Ming,T.J.Hazen,J.R.Glass,andD.A.Reynolds. Robustspeaker
recognitionin noisyconditions. IEEE Transactions on Audio, Speech,
and Language Processing, 15:1711–1723, 2007.
[Mil90] K. D. Miller. Derivation of linear hebbian equations from a nonlinear
hebbianmodelofsynaptic plasticity. Neural Computation,2:321–333,
1990.
[Mit97] T. M. Mitchell. Machine Learning. WCB-McGraw-Hill, 1997.
120
[MM92] R.M.MulkeyandR.C.Malenka. Mechanismunderlyinginductionof
homosynaptic long-term depression in area CA1 of the hippocampus.
Neuron, 9:967–975, 1992.
[MMSW97] H. Maciejewski, J. Mazurkiewicz, K. Skowron, and T. Walkowiak.
Neuralnetworksforvehiclerecognition. InU.Ramacher,H.Klar,and
A. Koenig, editors, Proc. of the 6th International Conf. on Microelec-
tronics for Neural Networks, Evolutionary, and Fuzzy Systems, pages
292–296, 1997.
[Mon00] A. H. Monahan. Nonlinear principal component analysis by neural
networks: Theory and application to the lorenz system. Journal of
Climate, 13:821–835, 2000.
[MSS06] N. Mesgarani, M. Slaney, and S. A. Shamma. Discrimination of
speechfromnonspeech basedonmultiscale spectro-temporalmodula-
tions. IEEETransactionsonAudio, Speech, andLanguageProcessing,
14:920–930, 2006.
[Mun04] M. E. Munich. Bayesian subspace methods for acoustic signature
recognitionofvehicles.InProc.ofthe12thEuropeanSignalProcessing
Conf., 2004.
[Oja82] E.Oja. Asimplifiedneuronmodelasaprincipalcomponentanalyzer.
Journal of Mathematical Biology, 15:267–273, 1982.
[Oja95] E.Oja. PCA,ICA,andnonlinearHebbianlearning. InICANN,pages
89–94, 1995.
[Oja97] E. Oja. The nonlinear pca learning rule in independent component
analysis. Neurocomputing, 17:25–45, 1997.
[OK88] E.OjaandT.Kohonen.Thesubspacelearningalgorithmasformalism
for pattern recognition and neural networks. In IEEE International
Conf. on Neural Networks, pages 277–284, San Diego, CA, 1988.
[OK95] E. Oja and J. Karhunen. Signal separation by nonlinear hebbian
learning. In In Proceedings IEEE ICNN 95, pages 83–97, 1995.
[OOW91] E. Oja, H. Ogawa, and J. Wangviwattana. Learning in non-linear
constrained hebbian networks. In ICANN, pages 385–390, 1991.
[OSB99] A. V. Oppenheim, R. W. Schafer, and J. R. Buck. Discrete-time
Signal Processing. Prentice Hall, Englewood Cliffs, 1999.
121
[Pal82] G. Palm. Neural Assemblies, An Alternative Approach to Artificial
Intelligence. New York, Springer-Verlag, 1982.
[Pap84] A.Papoulis.Probability,RandomVariables,andStochasticProcesses.
New York, McGraw-Hill, 1984.
[PFP99] E. Pomponi, S. Fiori, and F. Piazza. Complex independent compo-
nentanalysisbynonlineargeneralizedHebbianlearningwithRayleigh
nonlinearity. In ICASSP, pages 1077–1080, 1999.
[PG97] D.T.PhamandP.Garat. Blindseparationofmixtureofindependent
sources through aquasi-maximum likelihood approach. IEEE Trans-
actions on Signal Processing, 45:1712–1725, 1997.
[PGJ92] D.-T. Pham, P. Garrat, and C. Jutten. Separation of a mixture of
independent sources through a maximum likelihood approach. In
EUSIPCO, pages 771–774, 1992.
[PH96] R. D. Patterson and J. Holdsworth. A functional model of neural
activity patterns and auditory images. Advances in Speech, Hearing,
and Language Processing, 3:547–563, 1996.
[PJLL99] H. M. Park, H. Y. Jung, T. W. Lee, and S. Y. Lee. Subband-based
blind signal separation for noisy speech recognition. Electronics Let-
ters, 35(23), 1999.
[PM86] R. D. Patterson and B. C. J. Moore. Auditory filters and excita-
tion patterns as representations of frequency resolution. In B. C. J.
Moore, editor, Frequency Selectivity in Hearing, pages 123–177. Aca-
demic Press Limited, London, 1986.
[PRH
+
92] R.D.Patterson,K.Robinson,J.Holdsworth,D.MeKeown,C.Zhang,
and M. H. Allerhand. Complex sounds and auditory images. In
K. Horner Y. Cazals, L. Demany, editor, Auditory Physiology and
Perception, pages 429–446. Pergamon, Oxford, 1992.
[PS00] O.PaulsenandT.J.Sejnowski. Naturalpatternsofactivityandlong-
term synaptic plasticity. Cognitive Neuroscience, 10:172–180, 2000.
[PSB03] D.-T.Pham,C.Serviere,andH.Boumaraf.Blindseparationofspeech
mixtures based on nonstationarity. In IEEE Seventh International
Symposium on Signal Processing and Its Applications, pages 73–76,
2003.
122
[PW01] B. Porr and F W¨ org¨ otter. Temporal Hebbian learning in rate-coded
neural networks: A theoretical approach towards classical condition-
ing, 2001.
[RB96] Z.RothandY.Baram.Multidimensionaldensityshapingbysigmoids.
IEEE Transactions on Neural Networks, 7:1291–1298, 1996.
[RDP86] P. Rajasekaran, G. Doddington, and J. Picone. Recognition of speech
under stress and in noise. In ICASSP, pages 733–736, Dallas, Texas,
1986.
[Rey02] D. A. Reynolds. An overview of automatic speaker recognition tech-
nology. In ICASSP, pages 4072–4075, 2002.
[RH02] M. Roch and R. R. Hurtig. The integral decode: a smoothing tech-
nique forrobustHMM-basedspeakerrecognition. IEEE Transactions
on Speech and Audio Processing, 10:315–324, 2002.
[RMK08] R. C. Rose, A. Miguel, and A. Keyvani. Improving robustness in fre-
quencywarping-basedspeakernormalization. IEEE Signal Processing
Letters, 15:225–228, 2008.
[RS78] L.R.RabinerandR.W.Schafer. Digital Processingof SpeechSignals.
Prentice Hall, 1978.
[RS97] B. Ruf and M. Schmitt. Hebbian learning in networks of spiking neu-
rons using temporal coding. In Biological and artificial computation:
From neuroscience to technology, pages 380–389, 1997.
[San90] T. D. Sanger. Analysis of the two-dimensionalreceptive fields learned
by the Hebbian algorithm in response to random input. Biological
Cybernetics, 63:221–228, 1990.
[San9a] T. D. Sanger. An optimality principle for unsupervised learning. In
AdvancesinNeuralInformationProcessingSystems,pages11–19,San
Mateo, CA, 1989a.
[San9b] T. D. Sanger. Optimal unsupervised learning in a single-layer linear
feedforward neural network. Neural Networks, 12:459–473, 1989b.
[SDK
+
07] J. Z. Simon, D. A. Depireux, D. J. Klein, J. B. Fritz, and S. A.
Shamma. Temporal symmetry in primary auditory cortex : Implica-
tions for cortical connectivity. Neural computation, 19:583–638, 2007.
[SDS98] J. Z. Simon, D. A. Depireux, and S. A. Shamma. Representation of
complex dynamic spectra in auditory cortex, 1998.
123
[SH94] A. Sudjianto and M. H. Hassoun. Nonlinear Hebbian rule: a sta-
tistical interpretation. In IEEE World Congress on Computational
Intelligence, pages 1247–1252, Orlando, FL, 1994.
[SH95] A.SudjiantoandM.H.Hassoun.Statisticalbasisofnonlinearhebbian
learning and application to clustering. Neural Networks, 8:707–715,
1995.
[SK07] Y.A.SolewiczandM.Koppel. Usingpost-classifierstoenhancefusion
of low- and high-level speaker recognition. IEEE Transactions on
Audio, Speech, and Language Processing, 15:2063–2071, 2007.
[Sla93] M. Slaney. An efficient implementation of the Patterson-Holdsworth
auditory filter bank. Apple Computer Technical Report #35, 1993.
[SS89] P. K. Stanton and T. J. Sejnowski. Associative long-term depression
in the hippocampus induced by hebbian covariance. Nature, 339:215–
218, 1989.
[SS92] N. N. Schraudolph and T. J. Sejnowski. Competitive anti-Hebbian
learning of invariants. In J. E. Moody, S. J. Hanson, and R. P. Lipp-
mann,editors,In Advancesin Neural Information Processing Systems
4, pages 1017–1024. Morgan Kaufmann, 1992.
[SSW07] Y. Shao, S. Srinivasan, and D. Wang. Incorporating auditory feature
uncertainties in robust speaker identification. In ICASSP, Honolulu,
Hawaii, 2007.
[SVK95] S. Shamma, H. Versnel, and N. Kowalski. Ripple analysis in ferret
primary auditory cortex: I. response characteristics of single units to
sinusoidally rippled spectra. Aud. Neurosci., 1, 1995.
[SW06] Y. Shao and D. Wang. Robust speaker recognition using binary time-
frequency masks. In ICASSP, Toulouse, France, 2006.
[SWMB04] D. Song, Z. Wang, V. Z. Marmarelis, and T. W. Berger. A modeling
paradigm incorporating parametric and non-parametric methods. In
Proceedings of the 26th Annual International Conference of the IEEE
EMBS, pages 647–650, San Francisco, CA, 2004.
[SZ07] Z. Shi and C. Zhang. Blind source extraction using generalized auto-
correlation. IEEE Transactions on Neural Networks, 18:1516–1522,
2007.
124
[TBB94] E. Thiels, G. Barrionuevo, and T. W. Berger. Excitatory stimulation
during postsynaptic inhibition induces long-term depression in hip-
pocampus in vivo. Journal of neurophysiology, 72:3009–3016, 1994.
[TK07] H. Takizawa and H. Kobayashi. Partial distortion entropy maximiza-
tion for online data clustering. Neural Networks, 20:819–831, 2007.
[TW05] V.TyagiandC.Wellekens. Ondesensitizingthemel-cepstrumtospu-
rious spectral components for robust speech recognition. In ICASSP,
pages 529–532, Philadelphia, PA, 2005.
[vRBT00] M. C. W. van Rossum, G. Q. Bi, and G. G. Turrigiano. Stable Heb-
bian learning from spike timing-dependent plasticity. Journal of Neu-
roscience, 20:8812–8821, 2000.
[VV05] F. Vrins and M. Verleysen. Information theoretic versus cumulant-
based contrasts for multimodal source separation. IEEE Signal Pro-
cessing Letters, 12:190–193, 2005.
[WAW03] M. Welling, F. Agakov, and C. K. I. Williams. Extreme compo-
nents analysis. In Advances in Neural Information Processing Sys-
tems, 2003.
[WBWR03] S. N. Wrigley, G. J. Brown, V. Wan, and S. Renals. Feature selec-
tion for the classification of crosstalk in multi-channel audio. In
EUROSPEECH, Geneva, 2003.
[WSK99] H. Wu, M. Siegel, and P. Khosla. Vehicle sound signature recognition
by frequency vector principal component analysis. IEEE Trans. on
Instrumentation and Measurement, 48:1005–1009, 1999.
[WYWL07] J.-C. Wang, C.-H. Yang, J.-F. Wang, and H.-P. Lee. Robust speaker
identificationandverification. IEEE ComputationalIntelligenceMag-
azine, 2:52–59, 2007.
[XBB92] X. Xie, T. W. Berger, and G. Barrionuevo. Isolated NMDA receptor-
mediated synaptic responses express both LTP and LTD. Journal of
Neurophysiology, 67:1009–1013, 1992.
[XK06] Y. Xia and M. S. Kamel. A cooperative recurrent neural network
algorithmforparameterestimationofautoregressivesignals. InIEEE
InternationalJoint Conference on Neural Networks,pages4823–4829,
Vancouver, Canada, 2006.
125
[XY95] L. Xu and A. L. Yuille. Robust principal component analysis by self-
organizing rules basedon statistical physics approach. IEEE Transac-
tions on Neural Networks, 6:131–143, 1995.
[YHW05] K.-H. Yuo, T.-H. Hwang, and H.-C. Wang. Combination of
autocorrelation-based features and projection measure technique for
speaker identification. IEEE Transactions on Speech and Audio Pro-
cessing, 13:565–574, 2005.
[YR04] O. Yilmaz and S. Rickard. Blind separation of speech mixtures via
time-frequency masking. IEEE Transactions on Signal Processing,
52:1830–1847, 2004.
[YSFC00] K. Yao, B. E. Shi, P. Fung, and Z. Cao. Residual noise compensation
for robust speech recognition in nonstationary noise. In ICASSP,
pages 1125–1128, Istanbul, Turkey, 2000.
[YV02] N. B. Yoma and M. Villar. Speaker verification in noise using a
stochastic version of theweighted Viterbi algorithm. IEEE Trans-
actions on Speech and Audio Processing, 10:158–166, 2002.
[ZA07] Y. Zhang and W. H. Abdulla. Eigenanalysis applied to speaker iden-
tificationusing gammatoneauditoryfilterbank andindependent com-
ponent analysis. In IEEE 9th International Symposium on Signal
Processing and Its Applications, pages 1–4, Sharjah, 2007.
[ZR02] R. S. Zucker and W. G. Regehr. Short-term synaptic plasticity.
Annual Review of Physiology, 64:355–405, 2002.
126
Abstract (if available)
Abstract
How to recognize the acoustic signal of interest in open environments where many other acoustic noises exist? The efficient auditory signal processing and intelligent neural learning contribute to this remarkable ability. We propose a nonlinear Hebbian learning (NHL), with several certain novelties, to newly implement noise-robust acoustic signal recognition. The proposed learning rule processes both time and frequency features of input. The spectral analysis is realized by using auditory gammatone filterbanks. To address temporal dynamics, the network input incorporates not only the current gammatone-filtered feature vector, but also multiple past feature vectors. We refer to this established high-dimensional input as spectro-temporal representation (STR). Given STR inputs, the exact acoustic signatures of signals of interest and the composing property among signatures are generally unknown. The nonlinear Hebbian learning rule is then employed to extract representative independent signatures, and to learn the weight vectors which transform data into signature space. During learning, NHL also reduces feature dimensionality. Comparing with linear Hebbian learning (LHL) which explores the second-order moment of data, the applied NHL involves higher-order statistics of data. Therefore, NHL can capture representative components that are more statistically independent than LHL can. Besides, the nonlinear activation function of NHL can be chosen to refer to the implicit distribution of many acoustic sounds, and thus making the learning optimized in an aspect of mutual information. The advantages of the proposed NHL over other ICA algorithms (which are often used for blind source separation) are also discussed, in terms of the criterion and optimization function.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Nonlinear dynamical modeling of single neurons and its application to analysis of long-term potentiation (LTP)
PDF
Understanding sources of variability in learning robust deep audio representations
PDF
Structure learning for manifolds and multivariate time series
PDF
Decoding memory from spatio-temporal patterns of neuronal spikes
PDF
Robust representation and recognition of actions in video
PDF
Representation, classification and information fusion for robust and efficient multimodal human states recognition
PDF
Nonlinear models for hippocampal synapses for multi-scale modeling
PDF
Learning logical abstractions from sequential data
PDF
Neural representation learning for robust and fair speaker recognition
PDF
Leveraging training information for efficient and robust deep learning
PDF
Speech recognition error modeling for robust speech processing and natural language understanding applications
PDF
Learning shared subspaces across multiple views and modalities
PDF
Classification and retrieval of environmental sounds
PDF
Differentially private learned models for location services
PDF
Efficient graph learning: theory and performance evaluation
PDF
Metabolic profiling of single hematopoietic stem cells for developing novel ex vivo culture strategies
PDF
Advanced techniques for object classification: methodologies and performance evaluation
PDF
Scaling up deep graph learning: efficient algorithms, expressive models and fast acceleration
PDF
Learning the geometric structure of high dimensional data using the Tensor Voting Graph
PDF
Fabrication-aware machine learning for accuracy control in additive manufacturing
Asset Metadata
Creator
Lu, Bing (author)
Core Title
Noise-robust spectro-temporal acoustic signature recognition using nonlinear Hebbian learning
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Biomedical Engineering
Publication Date
06/16/2009
Defense Date
04/15/2009
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
acoustic signal recognition,biological inspiration,noise robustness,nonlinear Hebbian learning,OAI-PMH Harvest
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Berger, Theodore W. (
committee chair
), Baudry, Michel (
committee member
), D'Argenio, David Z. (
committee member
), Dibazar, Alireza (
committee member
)
Creator Email
blu.bing@gmail.com,blu@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m2299
Unique identifier
UC1494537
Identifier
etd-Lu-2898 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-239039 (legacy record id),usctheses-m2299 (legacy record id)
Legacy Identifier
etd-Lu-2898.pdf
Dmrecord
239039
Document Type
Dissertation
Rights
Lu, Bing
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
acoustic signal recognition
biological inspiration
noise robustness
nonlinear Hebbian learning