Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Novel variations of sparse representation techniques with applications
(USC Thesis Other)
Novel variations of sparse representation techniques with applications
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Novel Variations of Sparse Representation Techniques
with Applications
by
Qun Feng Tan
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
May 2013
Copyright 2013 Qun Feng Tan
Dedication
To my family.
ii
Acknowledgements
I would like to thank my advisor Prof. Shrikanth Narayanan for his guidance and
supportduringmyyearsatUSC,andmotivatingmetoalwayschallengemyself. Hehas
encouragedandgivenadviceonvariousresearchthreadswhichmadethisthesispossible.
IwouldalsoliketothankProf. UrbashiMitra,Prof. AntonioOrtegaandProf. FeiSha
for their thoughtful discussions, support and advice. Discussion with Prof. Mitra has
helped guide the research in the Underwater Classification section of this thesis, and
the discussions have allowed me to look at problems from different perspectives. My
class project (on genetic breakpoint detection with sparse representation techniques)
and discussions with Prof. Ortega have inspired my interest in Sparse Representation
and Compressed Sensing. Prof. Ortega has also given numerous insightful suggestions
which helped strengthened the research. My class project (on reweighted algorithms)
and discussions with Prof. Sha have directed this dissertation theme to be Machine
Learningoriented. Prof. Shahasalsogivengoodideastohelpimprovesomeofthenew
algorithms I am developing. They have helped shaped my Ph.D. experience in many
aspects.
iii
SignalAnalysisandInterpretationLaboratory(SAIL)hasbeenagreatenvironment
for developing my research, and I would like to thank all members for direct/indirect
contributions.
Last but not least, I would like to thank my family for their unwavering support,
without which this thesis will not have been possible.
iv
Table of Contents
Dedication ii
Acknowledgements iii
List of Figures viii
List of Tables x
Abstract xii
Chapter 1: Enhanced Sparse Imputation Techniques for a Robust Speech Recog-
nition Front-End [59] 1
1.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.1 Construction of Representation of Test Utterances . . . . . . . . 6
1.2.2 Signal Reliability Masks . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.3 Noise Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.4 Bounded Optimization . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.5 Formulation of optimization problem . . . . . . . . . . . . . . . . 12
1.2.6 Baseline - The LASSO solution . . . . . . . . . . . . . . . . . . . 13
1.2.7 Drawbacks of the classical LASSO solution for our recognition
task and dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.8 Proposed solution for the continuous digits task. . . . . . . . . . 16
1.3 Description of dataset, experimental setup, and experimental results . . 19
1.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.3.2 Tuning for parameters . . . . . . . . . . . . . . . . . . . . . . . . 20
1.3.3 Experimental Results for Continuous Digit Recognition Task . . 21
1.3.4 Insight into the mechanism behind the imputation process . . . . 25
1.3.5 Investigation of various dictionary structures . . . . . . . . . . . 27
1.4 Discussion of Practicality of Implementation in Real-time Systems . . . 28
1.5 Conclusion and Extensions . . . . . . . . . . . . . . . . . . . . . . . . . 29
1.6 Evaluation of other regularization/optimization techniques. . . . . . . . 30
v
Chapter2: NovelVariationsofGroupSparseRegularizationTechniques,withAp-
plications to Noise Robust Automatic Speech Recognition [58] 36
2.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.2.1 Notation description . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.2.2 Basis Selection Techniques. . . . . . . . . . . . . . . . . . . . . . 41
2.2.3 Formulation of the “Group Elastic Net” algorithm . . . . . . . . 43
2.2.4 Formulation of “Group Sparse Bayesian Learning” algorithm . . 46
2.3 Application to Spectral Denoising in a Speech Recognition Framework . 52
2.4 Experimental Setup and Results . . . . . . . . . . . . . . . . . . . . . . 54
2.4.1 Description of Database . . . . . . . . . . . . . . . . . . . . . . . 54
2.4.2 Description of ASR Features . . . . . . . . . . . . . . . . . . . . 55
2.4.3 Details of Algorithms Implementation . . . . . . . . . . . . . . . 55
2.4.4 Signal Reliability Mask . . . . . . . . . . . . . . . . . . . . . . . 55
2.4.5 Approaches for dictionary partitioning . . . . . . . . . . . . . . . 56
2.4.6 Evaluation Results for Aurora 2.0 . . . . . . . . . . . . . . . . . 57
2.4.7 Runtimes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.4.8 Aurora 2.0 Results for other SNR levels . . . . . . . . . . . . . . 61
2.4.9 Evaluation of Algorithms on Real Noisy Data . . . . . . . . . . . 62
2.5 Conclusion and Final Remarks . . . . . . . . . . . . . . . . . . . . . . . 63
2.6 Derivation of Property 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Chapter 3: Combining Window Predictions Efficiently - A New Imputation Ap-
proach for Noise Robust Automatic Speech Recognition [60] 66
3.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.3 Framework and Algorithmic Description . . . . . . . . . . . . . . . . . . 68
3.3.1 Feature Extraction Procedure . . . . . . . . . . . . . . . . . . . . 68
3.3.2 Linear formulation of Problem . . . . . . . . . . . . . . . . . . . 70
3.3.3 Denoising problem formulation . . . . . . . . . . . . . . . . . . . 70
3.3.4 Signal Reliability Masks . . . . . . . . . . . . . . . . . . . . . . . 72
3.3.5 A Novel Formulation of the Optimization Problem . . . . . . . . 72
3.4 Experimental Setup and Results . . . . . . . . . . . . . . . . . . . . . . 76
3.4.1 Description of Database and Algorithm implementation . . . . . 76
3.4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 76
3.5 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . 77
Chapter 4: New Methods for Sparse Representation Classification (SRC), with
applications to Underwater Object Classification and Face Recognition 79
4.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.2 Problem Description and Derivation . . . . . . . . . . . . . . . . . . . . 83
4.2.1 Sparse Representation Formulation . . . . . . . . . . . . . . . . . 83
vi
4.2.2 Baseline Sparse Representation-based Classification Schemes . . 85
4.3 An Inspiration - The Monty Hall Problem . . . . . . . . . . . . . . . . 88
4.4 New Algorithmic Approach: Coupling Optimization and Decision Making 88
4.5 The case of orthogonal subspaces . . . . . . . . . . . . . . . . . . . . . . 89
4.6 Feature Selection Procedure . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.6.1 Zernike Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.7 Sidescan Sonar Dataset Description and Preparation . . . . . . . . . . . 95
4.7.1 Underwater Sidescan Sonar Dataset Description . . . . . . . . . 95
4.7.2 Sidescan Sonar Image Dataset Preparation Methodology . . . . . 97
4.8 Experiments, Results and Interpretation . . . . . . . . . . . . . . . . . . 99
4.8.1 Underwater Sidescan Sonar Image Classification Results . . . . . 99
4.8.2 Face Recognition Results . . . . . . . . . . . . . . . . . . . . . . 102
4.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Bibliography 105
vii
List of Figures
1.1 Diagram of sparse imputation process . . . . . . . . . . . . . . . . . . . 3
1.2 Visualization of the L
0
, L
1
, and L
2
penalty functions. In the figure,
we can see that the linear L
1
penalty function emulates the L
0
penalty
function more closely than the quadratic L
2
penalty function. . . . . . . 17
1.3 The stem plot on the top shows the weight vector for 5 iterations of
LARS-LASSO, and the one on the bottom shows the weight vector for
20iterationsofLARS-LASSOforaspecificoptimizationproblem. While
5 iterations of LARS-LASSO give a sparser representation than 20 iter-
ations of LARS-LASSO from the diagram, 5 iterations of LARS-LASSO
will have a higher error in||B
r
a−F
Ur
||
2
compared to 20 iterations due
to the nature of the LARS implementation. . . . . . . . . . . . . . . . . 23
1.4 Plot of ASR recognition error vs average estimation error rate (average
across all optimization problems) for the SNR 5 corruption level. We
can see that a diminishing estimation error rate results in a much worse
performance in terms of ASR recognition rate for both LARS-LASSO
andLARS-EN.Thehighestpointsinbothgraphscorrespondtothebest
performance that was evaluated with the algorithms for the tuning set.. 24
1.5 SpectralplotofoneparticularimputedutteranceusingtheLARS-LASSO
and LARS-EN algorithms for the SNR 5 corruption level. As we can
observe from the diagram, the imputed signals have a much closer re-
semblance to the clean signal than the noisy one with much of the noise
artifacts removed. Moreover, we can observe that the imputed result of
LARS-ENbearscloserresemblancetothecleansignalrelativetoLARS-
LASSO, further testifying to the robustness of LARS-EN. . . . . . . . . 26
viii
2.1 Block diagram illustrating a typical speech recognition system, but en-
hanced with our new front-end. The “Spectral Denoiser” module is an
extra module which we introduced into the feature extraction flow that
utilizes the group sparse regularization techniques with appropriate dic-
tionary partitioning for better recognition accuracies. . . . . . . . . . . . 37
2.2 Thediagramaboveshowsthespectralplotsofaparticularutterance. As
wecansee,LARS-ENimprovesgreatlyuponthenoisyversionwithmuch
of the noise artifacts eradicated, and the Group EN algorithm further
improves upon the LARS-EN algorithm by a closer resemblance to the
original clean spectrum. . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.1 Schematic of the Spectral Denoiser. The Regularization block can be
broadly split into 3 steps: 1) Windowing 2) Regularization of the Win-
dows (local optimization) 3) Reconciliation of predictions . . . . . . . . 69
3.2 The diagram above shows the spectral plots of a particular noisy utter-
ance. We can think of this as an image and employ denoising techniques
to clean up the noise artifacts evident in the image . . . . . . . . . . . . 70
4.1 A sample object (circled) in a partial sidescan sonar image with its
shadowtotheright(Acolormaphasbeenaddedtotheoriginalgrayscale
image for better viewing). . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.2 Samples of the 4 object types in the NSWC sidescan sonar database . . 96
4.3 Samples of the 7 object types in the NURC sidescan sonar database . . 97
4.4 Segmentation results for a sample of the wedge object class from the
NURC database using the Mean-shift clustering (MSC) algorithm as de-
scribed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.5 Demonstrationofthesum-ruleclassificationheuristicforasamplesparse
representation computed on an overcomplete dictionary comprising 196
atoms for the NWSC database. . . . . . . . . . . . . . . . . . . . . . . . 100
ix
List of Tables
1.1 Parameter variation/tuning for SNR 5 dB dataset for the Aurora 2.0
databaseoncontentsoftesta/N1. Thelastcolumnindicateswhetherthe
improvementoverLARS-LASSOatthesamevalueof λ
SNR
issignificant
withtheDifferenceofProportionsTestat95%confidencelevel. Thebest
performing result of each algorithm is in bold . . . . . . . . . . . . . . . 32
1.2 Recognition results for different noise corruption values ranging from -5
dB,0dB,5dBand10dBforTEST1. Thelastcolumnindicateswhether
the improvement over LARS-LASSO is significant with the Difference of
Proportions Test at 95% confidence level . . . . . . . . . . . . . . . . . . 33
1.3 Recognition results for different noise corruption values ranging from -5
dB,0dB,5dBand10dBforTEST2. Thelastcolumnindicateswhether
the improvement over LARS-LASSO is significant with the Difference of
Proportions Test at 90% confidence level . . . . . . . . . . . . . . . . . . 34
1.4 Results for SNR 5 dB dataset with λ
SNR
= 20 dB for the Aurora 2.0
database, with N
train
= 1000,1500,2000 for TEST1. The last column
indicateswhethertheimprovementoverLARS-LASSOissignificantwith
the Difference of Proportions Test at 95% confidence level . . . . . . . . 34
1.5 Number of Optimization Problems for the TEST1 dataset with SNR 5
dB noise for different values of λ
SNR
. . . . . . . . . . . . . . . . . . . . 35
1.6 Result for SNR 5 dB tuning set with λ
SNR
= 20 dB for the Aurora 2.0
database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.1 Spectral denoising results for the Aurora 2.0 database. . . . . . . . . . . 58
2.2 Average runtime results for the Group Sparse Regularization algorithms 61
2.3 Results for SNR -5 dB, SNR 0 dB, SNR 10 dB corruption and clean
conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
x
2.4 Results for Aurora 3.0 noisy dataset . . . . . . . . . . . . . . . . . . . . 63
3.1 Results for various levels of corruption.“CMN” refers to Cepstral Mean
Normalization. “EN Averaging” refers to the procedure where the Elastic
Net is applied to each window and contributions from the windows are
subsequently averaged. “EN Coupled” refers to the new procedure we de-
scribed where a second optimization formulation is employed to reconcile
the predictions. Runtimes are measured in seconds per optimization. Sig-
nificance testing is done at 95% confidence interval with the difference of
proportions test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.1 Object classification accuracies for a 4-fold cross validation for various
competing classifiers for the NWSC database . . . . . . . . . . . . . . . 100
4.2 Object classification accuracies for a 4-fold cross validation for various
competing classifiers for the NURC database . . . . . . . . . . . . . . . 101
4.3 Face recognition results for a 5-fold cross validation of the YaleB face
database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
xi
Abstract
This thesis proposes novel variations of Sparse Representation techniques and shows
successful applications to a variety of fields such as Automatic Speech Recognition
(ASR) denoising, Face Recognition and Underwater Image Classification. A section of
thethesismakesnewalgorithmiccontributionsinGroupRegularizationwhichareable
tobetterhandlecollineardictionaries. AnewASRfront-endisintroduced, andapplies
thesealgorithmstofeaturedenoisinginASRpipeline. Thethesisalsoexploreseffective
ways for dictionary partitioning for improved speech recognition results over a range of
baselines. A new method for combining predictions from different feature windows is
alsoexplored. Inadditiontothedenoisingsection,thisthesisalsoproposesnewmethods
for Sparse Representation Classification (SRC) which better couples the regularization
and decision making step. The effectiveness of these new methods are shown in both
Face Recognition and Underwater Image Classification. Since all these techniques are
domain independent with the right feature extraction procedure and training set, they
show great promise to be applied to a gamut of other different areas.
xii
Chapter 1:
Enhanced Sparse Imputation
Techniques for a Robust Speech
Recognition Front-End [59]
1.1 Introduction
Missingdata/featuretechniques(MDT)havebeenproposedfornoisysignalconditions
to compensate for unreliable components of features corrupted by noise. By missing
data/feature, we mean problems that are made difficult by absence of portions of data
which take on some known/hypothesized structure. Missing data techniques have been
employed in statistics [23] long before its adoption into the speech processing field for
Automatic Speech Recognition (ASR). In addition to speech processing, techniques for
dataimputation(i.e. fillinginorsubstitutingformissingdata)havealsobeenemployed
in many other areas for denoising noisy measurements. For example, in the field of
1
genetics[29],themicroarraysemployedinmeasuringgeneexpressionsoftensufferfrom
the problem of probe noise. Another example is in reconstruction of noisy images [42].
There have been a large number of works pertaining to the topic of MDT and
imputation in the speech processing field. For example, in [16, 38], the authors have
employed two different statistical methods to infer the unreliable speech data. The
firstismarginalization,wherethelikelihoodoftheincompletedatavectoriscomputed.
Usingx
r
andx
u
todenotethereliablepartsandtheunreliablepartsofthefeaturevector
respectively,themethodallowscomputationof p(x
r
|C)insteadof p(x
r
,x
u
|C),where C
represents the states in a Hidden Markov Model (HMM). A further refinement of this
marginalization technique is termed “bounded marginalization” where the integral of
the probability density functions are done over a finite range rather than from−∞ to
+∞. Thesecondmethodistocomputethedistributionoftheunreliablesegmentsofthe
feature vector instead of the likelihood of the data present. Experimental evaluation
on the TIDigits corpus with non-stationary (car/helicopter/factory) noise corruption
showed that with these proposed techniques, the performance is much better than the
original performance before imputation. In particular, the performance of the bounded
marginalization method outperforms that of the second method.
SparserepresentationtechniqueshavealsobeenusedintherealmofMDT,attempt-
ing data reconstruction under the assumption that the signal can be reconstructed by
a sparse representation from a dictionary. Sparse representation techniques and Com-
pressiveSensingTechniques[11](wherethedictionaryobeystherestrictedisometryhy-
pothesis)havebeenusedwidely,applicationsincludingphoneticclassificationinSpeech
Processing[56],andalsoImageProcessingandMedicalImaging[12,21,40,48,66]. Re-
cently, Gemmeke et al [31, 34] and B¨ orgstrom et al [7, 8] have proposed the use of L
1
optimizationtechniquesforspectralimputation. ByL
1
optimizationtechniques,weare
2
Figure 1.1: Diagram of sparse imputation process
referringtotechniqueswhichoptimizesomeerrorfunctionsubjecttoconstraintsonthe
L
1
norm of the solution vector a, defined as follows:
||a||
1
=
X
n
|a
n
| (1.1)
In particular, Gemmeke et al proposed an imputation framework based on a dictio-
nary of exemplars, and refer to the process as “Sparse Imputation”. Fig. 1.1 gives an
illustration of the Sparse Imputation process. Both works have experimentally demon-
strated the effectiveness of this technique when recovering missing speech components
in adverse SNR conditions (SNR≤ 0 dB). In [31, 34], the authors evaluated the impu-
tationtechniquesonatime-normalizedsingledigitrecognitiontask. Theformulationin
[31, 34] assumes a well constructed dictionary for the sparse imputation process and it
wasfoundthroughexperimentationthatadictionarywith4,000exemplarspectrogram
representations yielded the best performance with the LASSO algorithm in terms of
speed and accuracy for their dataset. The LASSO algorithm is essentially a variable
selection procedure which imposes a constraint on the L
1
norm of the solution vector.
Results in [34] have demonstrated considerable improvement of the Sparse Imputation
technique with well constructed dictionaries over classical imputation techniques like
per-Gaussian-conditionedimputationandcluster-basedimputation. Anattempttoex-
tend this system on the continuous digit task using LASSO has been reported in [33].
3
A further extension of the system to Large Vocabulary Continuous Speech Recognition
(LVCSR) has also been explored in [32].
One of the goals of this chapter is to investigate a solution to better exploit the
properties of collinear dictionaries of exemplars in the sparse imputation setting for
the continuous digits recognition task. We typically desire a dictionary which is less
collinear, by which we are referring to a dictionary B where the value
C = max
1≤i,j≤dim(B,2)
|G(i,j)| where G=B
T
B (1.2)
issmall[20],whichessentiallymeansthattheentriesofthedictionaryhaveahigherten-
dency to point in different directions. Here, dim(B,2) refers to the number of columns
in B. In particular, in the event of a collinear dictionary, some variable selection pro-
cedures such as LASSO do not satisfy oracle properties, meaning that they do not
identifythecorrectsubsetofpredictorstomodeltheobservation,andtheydonothave
an optimal estimation rate [70].
The two main algorithms we will investigate in this chapter are LASSO [61] and
the Least Angle Regression implementation of the Elastic Net (LARS-EN) [71] which
is essentially an enhanced version of LASSO and Ordinary Least Squares (OLS). The
reason LASSO is chosen as our baseline is because it has been demonstrated by [33]
to be an efficient algorithm in the Sparse Imputation framework. In addition, LASSO
offers a LARS modification which greatly accelerates its implementation. LARS-EN
is chosen because it is theoretically proven to better exploit the property of collinear
dictionarycomparedtoLASSO,andofferstheLARSmodificationwhichallowsforfast
execution [71].
Wedemonstrateexperimentallythatbybetterexploitingthepropertiesofacollinear
dictionary, we can expect to enjoy better speech recognition rates. We also provide a
4
study of how different degrees of sparsity will affect speech recognition rates and why
a good measure of sparsity is needed for optimal speech recognition results. We will
use the implementation details of both algorithms to explain why a good measure of
sparsity is necessary for optimal speech recognition rates. We demonstrate LARS-EN
to be a significant improvement over LASSO for the continuous digit recognition task
withestimatedmasks. Wealsosupplementtheresultsofevaluationwithsomepopular
regularization techniques for completeness. In this thesis, like many others in the re-
lated literature, we will adopt the Aurora 2.0 noisy digits database for evaluation. The
algorithms we introduce are incorporated into the speech recognition front-end, and
thedenoised/imputedversionofthespeechfeaturesareinturnusedforspeechrecogni-
tion. WewillbeworkingwiththestandardMel-Frequency Cepstral Coefficient (MFCC)
front-end in contrast to [31, 34], which use PROSPECT features [37] for recognition.
Wealsoevaluateourmethodsondictionariesofmultiplesizesincontrastto[31,34]. In
practicalscenarios,itwillbedifficulttotunefordictionarysizes,andthebasisselection
technique employed should ideally be able to select the appropriate sparse basis rep-
resentation regardless of the structure of the dictionary. We will show that LARS-EN
with small dictionary sizes outperforms LARS-LASSO with larger dictionaries for our
digit recognition task.
The organization of this chapter is as follows: Section 1.2 details the framework,
algorithm, and justification as to the choices of the particular algorithms. Section 1.3
provides a description of our experimental setup and the experimental results, as well
as a discussion of our parameter choices, and the results. Section 1.5 concludes with
possible extensions of this work.
5
1.2 Methodology
1.2.1 Construction of Representation of Test Utterances
We first need to construct a signal observation representation of the test spoken ut-
terance U. In this work, we use frame-level spectral representations of speech rather
thantime-domainrepresentations. Theapproachesin[31,34]consideredafixedlength
vector representation for each digit. This is done by converting the acoustic feature
representationtoatime-normalizedrepresentationwithafixednumberofacousticfea-
ture frames. That reduces the digit recognition into a classification task since the digit
boundaries are assumed to be known. Working with the assumption that the digit
boundaries are known is non-trivial in practical settings.
In this chapter, we consider the continuous digit recognition scenario like in [33].
Let the total number of frames for utterance U be denoted T
U
. Let the feature vector
corresponding to frame i of digit utterance U be denoted f
U,i
. The feature vector f
U,i
contains N
B
(number of frequency bands) spectral coefficients corresponding to frame
i of utterance U. Let F
U
be a N
B
×T
U
matrix defined as follows:
F
U
=
f
U,1
f
U,2
... f
U,T
U
(1.3)
Wenowconsideraslidingwindowextractionofthedatainthismatrixrepresentation
F
U
. Define a sliding matrix which has dimensions N
B
× T
W
, T
W
representing the
duration of the sliding matrix. We also define a window shift parameter T
WS
, which
represents the number of frames by which we shift the sliding matrix.
Through this we obtain a total of ⌈
(T
U
−T
W
)
/T
WS
⌉+1 matrices of feature vectors.
For a more efficient implementation of the window extraction algorithm, we zero-pad
F
U
to be a N
B
×k matrix where k =⌈
(T
U
−T
W
)
/T
WS
⌉×T
WS
+T
W
. Let us denote the
6
m-th window corresponding to utterance U to be F
U;m
=
"
f
U,m,1
... f
U,m,T
W
. Let us
denote the linearization of the matrix F
U;m
to be the following:
F
U,m
=
f
U,m,1
f
U,m,2
...
f
U,m,T
W
(1.4)
Now,wemaketheassumptionthatwecanwriteF
U,m
asF
U,m
=Ba
m
,whereF
U,m
is the observation (feature vector), B is a dictionary of exemplars, a
m
is a vector of
weights. Weareassumingthateachtestsegmentcanbewrittenasalinearcombination
ofvectorsfromthedictionary. Thisisareasonableassumptiontomakeandfollowsthe
approaches in [31, 34, 48, 65] and many other signal processing applications where a
regularizedregressionsettingisdesiredfordenoising. Also,thespectralrepresentations
fordifferentrealizationsofthesamewordhaveenergyconcentrationsinsimilarregions
in the time-frequency domain, giving us a reason for using this linear representation.
Thus, we will have the following linear representations from our windows:
F
U,m
=Ba
m
, m=1,...,
T
U
−T
W
T
WS
+1 (1.5)
Afterthesparseimputationprocess, weneedtoreconstructanimputedrepresenta-
tion of the original sliding matrix. Define a counter matrix of dimension N
B
×k where
k=⌈
(T
U
−T
W
)
/T
WS
⌉×T
WS
+T
W
. Thiscountermatrixcountsthenumberoftimeseach
entryinthematrixF
U;m
isimputedduetooverlappingwindows. Formationofthefinal
imputed matrix will involve first reshaping
b
F
U,m
(the solution to optimizing Equation
1.5) back todimensions N
B
×k, addingall theresultingreshaped framestogether, and
then doing element-wise division by the entries of the counter matrix. This is in effect
amountstoaveragingthecontributionsofindividualimputationscomingfrommultiple
7
windows. To simplify the notation we will omit the subscript m in the remainder of
this chapter when dealing with the m-th sliding matrix.
Let us denote the number of utterances in the training data to be used in our dic-
tionary by N
train
. We then form a dictionary B = [B
1
B
2
... B
N
train
] which consists
of segments of clean spectral shapes. This will be our overcomplete dictionary of ex-
emplar spectral segments. We now describe the procedure by which we obtain B
i
for
i= 1,...,N
train
. To motivate our dictionary choice, we will treat each digit as a recog-
nitionunit,andhencewewillbeputtingexemplarsofthecleanspectralshapesofeach
recognition unit as entries of our dictionary. We thus only consider the single digit files
inourtrainingdataforformationofourdictionary,sincewewillhavewholedigitutter-
anceswithouthavingtodoanyforcedalignment. Wethenextractthe N
B
×T
U
matrix
containing the spectral coefficients corresponding to those digits. After extraction, a
simpletimenormalizationisperformedbyinterpolatingthe T
U
framesoftheextracted
spectral features of the single digit files to T
W
frames. This is done by cubic spline in-
terpolation[46]whichretainsthespectralshapesofthecoefficientsfairlywell. Wethen
linearizetheinterpolatedmatrixtoaN
B
·T
W
×1vectorwhichgoesintoeachcolumnof
our dictionary B. Note that our dictionary construction retains boundary information
to a great extent, which turns out to be instrumental in improving recognition rates.
Section 1.3.5.1will provideexperimental evidenceto substantiateour dictionary choice
overrandomlyselectedfixed-lengthexemplarsforthecontinuousdigitrecognitiontask.
1.2.2 Signal Reliability Masks
Most works in speech processing MDT [16, 31, 34, 38] define some sort of signal re-
liability mask for the mel-frequency log-energy coefficients (the popularly used signal
representation), which is a matrix the size of the original feature vectors with entries
8
containing 1 to mean that the feature component is reliable and 0 to mean domination
by noise. Basically, the mask is defined as:
M(k,t)=
1 if
S(k,t)
N(k,t)
>λ
SNR
0 otherwise
(1.6)
where S and N stands for the signal and noise respectively. k is the index of the
frequency bands and t the index of the time frames.
Ifthesignal-to-noise(SNR)ratioisaboveacertainthreshold wedeemappropriate,
we regard the component to be reliable. However, if the SNR is below our threshold,
this means the component is unreliable and will be replaced with the corresponding
imputed version we have from our imputation algorithms.
Anoraclemaskisasignalreliabilitymaskcomputedwithperfectknowledgeofwhat
the noise signal is and what the underlying clean signal is. An estimated mask relaxes
the assumption that we have oracle knowledge of the underlying noise characteristics
by estimating the noise characteristics from the noisy speech signal itself.
Therehavebeenworkstoestimatemasksfromjusttheobservednoisyspeechdata,
likein[2],whereMDTtechniqueswereevaluatedonavarietyofmaskssuchasDiscrete
SNR Masks, Soft SNR Masks, and combined Harmonicity and SNR masks. Another
good overview on the topic of mask estimation is given in [15].
We adopt the estimated mask described in [3] rather than using oracle masks as
described in [16]. Essentially, we get a local estimate of the SNR by averaging the first
10framesofthespectralfeaturesoftheutterance,whichcontainsinformationpreceding
the voicing of the digits. This provides a reasonable estimate of the noise, under the
condition that the noise is stationary [3], which will be assumed here. An estimate of
thecleandigitutteranceisobtainedbysubtractionofthenoiseestimatefromthenoisy
digit utterance.
9
Now that we have a mask giving us an indication of the reliable/unreliable com-
ponents in the spectrum, we are able to make modifications to our dictionary B. The
main idea is that we will be only including vectors in the dictionary which correspond
to reliable components for our imputation process. Let N
r
be the number of reliable
components in F
U
. We define the matrix R to be a N
r
×N
B
·T
W
matrix containing
0’s and 1’s which extracts the rows of B which correspond to the reliable components
of F
U
as defined by our estimated mask. Thus, we have
B
r
= RB
F
Ur
= RF
U
F
Ur
≈ B
r
a (1.7)
where B
r
is the new dictionary we use for our algorithms and F
Ur
are the reliable
componentsinF
U
. Forthereconstruction,wewillsimplyimputethecomponentswhich
are defined as unreliable by our SNR mask.
1.2.3 Noise Model
When noise is added to the speech signal, the spectral coefficients will be perturbed.
Let us represent this perturbation in our model as
F
Ur
=B
r
a+z (1.8)
where z is a N
r
× 1 dimension noise vector. Note that even though we have been
dealingwithwhatisinprincipleanoiselesssignalcorrespondingtothereliablepartsof
the data, in practice there will still be a noise component associated with it, which we
attempt to capture by the vector z.
10
Weassumethatwearedealingwithadditivenoiseinthetimedomain,andhencewe
have O(t)= S(t)+N(t), where O(t) is the observed speech signal, S(t) is the original
cleanspeechsignal, N(t)isthenoisesignal,and treferstothetimeframe. Tocompute
the spectral coefficients using the notation in [17], we do pre-emphasis, frame blocking,
windowing (Hamming), and the Short-Time Discrete Fourier Transform (stDFT) to
give S(k,l) where k refers to the frequency band index and l refers to the length of
the window used for the stDFT. Since the stDFT is linear, we will analogously have
the additive noise become additive spectral noise N(k,l). Taking the logarithm, we
will have log|S(k,l)+N(k,l)|, which we can then write as log
S(k,l)
1+
N(k,l)
S(k,l)
=
log|S(k,l)|+log
1+
N(k,l)
S(k,l)
. We can hence calculate the output Y(i) of the i-th
critical band filter by:
N
′
2
X
k=0
log|S(k,l)|+log
1+
N(k,l)
S(k,l)
H
i
k
2π
N
′
(1.9)
H
i
is the impulse response of the i-th critical band filter and N
′
is the number of
points for computing the stDFT. Ideally, a high SNR will mean that
N(k,l)
S(k,l)
is close to
zero,andsothetermlog
1+
N(k,l)
S(k,l)
H
i
k
2π
N
′
inEquation(1.9)willbeapproximately
zero,meaningthattheobservedspectralcomponentwillbeclosetothetruevalue. How-
ever,inreality,therewillstillbeamismatchbetweentheestimatedreliablecomponents
andtheirtruevalues;thuswecanattempttomodelthetermlog
1+
N(k,l)
S(k,l)
H
i
k
2π
N
′
bysomeappropriatenoisemodelinouroptimizationproblem. Hence,weseethatspec-
tral imputation is closely equivalent to imputation of the output features.
1.2.4 Bounded Optimization
Bounded optimization refers to solving the optimization problem such that the opti-
mized value is less than or equal to the original value. Unbounded optimization means
11
that this constraint is ignored. Since we are approximating the additive noise in the
time domain by additive noise in the spectral domain also, the imputed values should
technically be less than the original noisy version. However, our optimization prob-
lems are generally unbounded; hence this constraint is not guaranteed. To circumvent
this problem, as in most works in MDT, we simply opted to impute only if the specific
componenttobeimputedhasaninferredvaluewhichislessthantheoriginalnoisycom-
ponent. In general, this simple rule resulted in better recognition accuracies compared
to those of the unbounded situation by our preliminary experiments. This phenomena
has been observed in [34] as well.
1.2.5 Formulation of optimization problem
Ifweconsideraregularizedleast-squaresapproachforthespectralcoefficientsdenoising,
the vector a is assumed distributed according to a Gaussian distribution [6]. By a
similar token, if we consider a regularized L
1
approach, a is assumed Laplacian. To
ensure maximal sparsity, we ideally like to solve the L
0
optimization problem. For our
problem setup, this is equivalent to solving the optimization problem of the form:
min
a
kB
r
a−F
Ur
k
2
+λkak
0
(1.10)
Here, the parameter λ controls the sparsity of the vector a; specifically, when we
increase the value of λ, a will become more sparse.
Itisawell-knownfactthatoptimizingequation(1.10)isanNP-hardproblem,since
it involves searching through C
N
train
k
least-squares problems, where k is the degree of
sparsity desired. This can be computationally expensive for our problem at hand since
N
train
could potentially be large. There have been alternatives proposed which try to
get around this problem while still maintaining a good penalty curve approximation to
the L
0
solution.
12
1.2.6 Baseline - The LASSO solution
Schemes presented in [7, 8, 31, 34] have employed the classical L
1
solution for sparse
imputation borrowed from compressive sensing [9, 10], and have demonstrated its ef-
ficiency in the sparse imputation framework. Furthurmore, the penalty curve for L
1
optimization emulates that of the L
0
solution fairly closely (See Fig. 1.2). We will like-
wise use the classical L
1
solution as our baseline. When applied to our problem setup,
we can represent it as the following convex optimization problem:
min
a
kB
r
a−F
Ur
k
2
+λkak
1
(1.11)
Note that Equation (1.11) can be equivalently formulated as:
min
a
kB
r
a−F
Ur
k
2
subject to kak
1
≤t (1.12)
Here tistheshrinkageparameter, whichisinverselyrelatedto λ. As tgetssmaller,
the weight vectora will become more sparse. Note that it is easy to see that equations
(1.11) and (1.12) are equivalent by the Karush-Kuhn-Tucker (KKT) conditions [9].
There are several efficient algorithms proposed for the solution of (1.11) or (1.12).
[61] proposed the “Least Absolute Shrinkage and Selection Operator” (LASSO) which
involvesaseriesofquadraticprograms. Infact,Equation(1.12)isequivalenttosolving
aleast-squareswith2
N
train
differentinequalityconstraintscorrespondingtothesignsof
the components of a:
a
i
≤t
i
or a
i
≥−t
i
for i=1,...,N
train
(1.13)
13
What Tibshirani [61] proposed is that, instead of solving all the inequality con-
straintsatonce,wecanprogressivelyincorporatetheinequalityconstraintswhileseeking
a solution which still satisfies the KKT conditions. Essentially, the iterative algorithm
starts out with just the sign of the least-squares solution in its constraint set. If the
next iteration solves the problem, the algorithm is terminated. Otherwise, the violated
constraint is added to the constraint set and the algorithm continues.
Another solution proposed by Efron et al for the LASSO algorithm is the Least
Angle Regression (LARS) modification for LASSO (LARS-LASSO) [24]. One of the
important results in the paper is proving that the LARS algorithm yields all LASSO
solutions. The LARS algorithm is a much faster algorithm compared to Tibshirani’s
original proposal in [61]. It starts by setting all coefficients to zero and then finding
the highest direction of correlation with the response vector. It then takes the largest
step possible in that direction until some other predictor has as much correlation with
the current residual. LARS then continues in the direction equiangular between the
two predictors, and this procedure is repeated. This is actually an important property
which we will capitalize upon to control sparsity, and the experimental justification for
this will be presented in Section 1.3.
Another related work regarding the L
1
optimization is Basis-Pursuit [13]. We use
the LARS-LASSO algorithm as our baseline.
1.2.7 Drawbacks of the classical LASSO solution for our recognition
task and dataset
There have been several studies on the asymptotics and oracle properties of LASSO
such as in [41, 51, 69, 70].
14
In [69, 70], it has been independently demonstrated that LASSO does not satisfy
oracle properties under certain circumstances. Let us define b a to be the estimate re-
turned by a variable selection procedure. We define the oracle properties of a variable
selection procedure as follows[70]:
• Identify the right subset of predictors to model the observation
• Have an optimal estimation rate given by:
√
N
r
(b a− b a
∗
) → N(0,Σ). b a
∗
is the
estimate where E[F
Ur
|B
r
]=B
r
b a
∗
and Σ is the covariance matrix given that the
oracle subset of predictors is known.
Without loss of generality, assume that the first p entries of b a are non zero, p <
N
train
, where here b a is the solution to the optimization problem (1.11). Otherwise, we
caneasilyreorderthecolumnsofB
r
=
h
B
r
1
...B
r
N
train
i
tomatchthisassumption. Also
assume the rest of the entries b a
i
=0, where i>p.
Define the covariance matrix C to be:
C=
1
N
r
B
T
r
B
r
=
C
11
C
12
C
21
C
22
(1.14)
where C is a positive definite matrix. We set
C
11
=
1
N
r
B
r
1
... B
rp
T
B
r
1
... B
rp
C
22
=
1
N
r
h
B
r
p+1
... B
r
N
train
i
T
h
B
r
p+1
... B
r
N
train
i
C
12
=
1
N
r
B
r
1
... B
rp
T
h
B
r
p+1
... B
r
N
train
i
C
21
=
1
N
r
h
B
r
p+1
... B
r
N
train
i
T
B
r
1
... B
rp
(1.15)
We say that the estimator b a is sign consistent if and only if
15
P(sign(a)=sign(b a))→1 as N
r
→∞ (1.16)
Sign consistency is needed for the LASSO estimate to match the true model.
It is proven in [69, 70] that LASSO is sign consistent only if
|C
21
C
−1
11
sign([b a
1
... b a
p
]
T
)| ≤ 1 − η (Strong Irrepresentable Condition). 1 is a
p×1 dimensional vector of ones. However, it is easy to see that this condition is easily
violated when the columns of B
r
are highly collinear or correlated.
Inourcase,wecanexpectthatthespectralprofiles(coefficients)forthesamedigits
to be similar. Thus we can expect contiguous entries in the dictionary to be highly
collinear. Moreover, in our specific overcomplete dictionary, there are only 11 distinct
digits. Thus, we will have a highly coherent dictionary, potentially leading to problems
when using LASSO.
1.2.8 Proposed solution for the continuous digits task
One of the possible alternatives for testing the suitability of the Gaussian noise model
is the Ordinary Least-Squares (OLS) method. However, it is well known that OLS is
generally inferior in terms of prediction and does not give parsimonious solutions [71].
Moreover,itspenaltycurveislesseffectivethanLASSOwhenusedasanapproximation
forthe L
0
penaltycurve,sinceOLShasaquadraticpenaltycurve. LASSOhasalinear
penalty curve, which emulates the L
0
penalty curve better than the quadratic one (See
Fig. 1.2).
To circumvent the disadvantages of LASSO for our problem as outlined in Section
1.2.7,andalsothatofOLS,severalsolutionshavebeenproposed. TheyincludetheElas-
ticNetsolution(avariationofregularizedleast-squares)[71], Sparse Bayesian Learning
(SBL) [63, 64], Matching Pursuit [49] and Orthogonal Matching Pursuit [52].
16
Figure 1.2: Visualization of the L
0
, L
1
, and L
2
penalty functions. In the figure, we can
see that the linear L
1
penalty function emulates the L
0
penalty function more closely
than the quadratic L
2
penalty function.
SBL assumes a parametrized prior from the data using a Gamma distribution and
by choice of this distribution, is shown to enjoy sparsity. While SBL is guaranteed
to converge to an optimum since it operates with the Expectation-Maximization (EM)
algorithm, it could potentially get stuck in a local minimum. In fact, our experiments
withSBLalsoresultedinlowerrecognitionaccuraciesascomparedtoLARS-LASSOfor
thecontinuousdigitsrecognitiontask(SeeSection1.6forresults)forhighλ
SNR
thresh-
olds. Bayesian compressive sensing techniques have, however, demonstrated success in
the phonetic classification task [56].
WehaveadditionallyevaluatedMatchingPursuitandOrthogonalMatchingPursuit
which are popular L
0
-approximation techniques, but both algorithms performed worse
than LARS-LASSO in terms of recognition rates for our imputation task (See Section
1.6 for results of our evaluation) .
The Elastic Net is our choice for the optimization task due to the advantages ad-
vocated by the authors in [71]. In particular, the Elastic Net encourages a “grouping
17
effect” more strongly than LASSO, where highly correlated predictors tend to be se-
lected or excluded together in a more efficient manner. The Elastic Net is also more
effectiveasavariableselectionprocedurewhenweencounteramatrixwherethenumber
of columns is much greater than the number of rows, as with our current framework.
Moreover, the Elastic Net can be viewed as a more general framework as an extension
to LASSO, since LASSO is a special case of the Elastic Net when the coefficient for
the L
2
regularization term is set to zero. Computationally, the Elastic Net offers a
LARS modification which allows us to create the entire solution path with complexity
comparabletothatofasingleOLSoptimization,andthusisefficientcomparedtoOLS.
The “naive” Elastic Net formulation is given by the following:
argmin
a
||F
Ur
−B
r
a||
2
2
+λ
1
||a||
1
+λ
2
||a||
2
2
(1.17)
Note that the formulation in Equation (1.17) is very similar to our original formu-
lation in Equation (1.11), with an additional regularization term of the L
2
norm.
The reason why the formulation in Equation (1.17) is called “naive” is due to the
fact that experimental evidence by the authors in [71] showed that it does not perform
up to expectations unless it is close to the ridge regression or LASSO. The solution to
this is scaling the solution of the “naive” Elastic Net as follows:
a
Elastic Net
=(1+λ
2
)×a
“naive” Elastic Net
(1.18)
From this basic formulation, it is possible to prove that the Elastic Net overcomes
some of the limitations of LASSO [71]; most importantly, it is able to better exploit
thepropertiesofhighlycollinear/correlateddictionaryentries. Specifically,intheevent
of a group of highly correlated vectors, LASSO has a tendency to randomly pick one
18
from the group without regard for which one is selected. However, the Elastic Net is
demonstrated to be able to select these “grouped” variables more efficiently.
We will be using the Least Angle Regression implementation of the Elastic Net
(LARS-EN), which implements the Elastic Net in an efficient manner as mentioned
above. Further details of this implementation can be found in [71].
1.3 Description of dataset, experimental setup, and ex-
perimental results
1.3.1 Experimental Setup
1.3.1.1 Database
For our recognition system, we use all of the 8040 clean training files (containing single
and continuous digit utterances) provided in the Aurora 2.0 database training set to
train a continuous digit recognizer in HTK [67].
For the continuous digit recognition task, the Aurora database consists of test sets
labeled N1, N2, N3 and N4 (corresponding to subway, babble, car and exhibition noise
respectively) in the Test Set A subset. We used the N1 folder for tuning of our opti-
mization parameters. For the test sets, we created two test sets as follows:
• TEST1: merging N1, N2, N3 and N4, giving us a total of 4004 files
• TEST2: merging N2, N3 and N4 (exclusion of the N1 folder), giving us a total of
3003 files
We evaluated our algorithms on different SNR conditions: SNR -5 dB, SNR 0 dB,
SNR 5 dB and SNR 10 dB.
ToformourdictionaryB, wegetallthesingledigitaudiofilesinthetrainingdata,
whichtotals2412files,andinterpolatethemtoT
W
framesasdescribedabove. Wethen
19
form B with each column representing the interpolated spectral components from the
2412files,givingusamatrixBofdimensions N
B
.T
W
×2412. Wealsoexperimentwith
different dictionary sizes, namely with N
train
=1000,1500,2000,2412.
1.3.1.2 ASR Features
WetraintherecognizeronMFCCswiththefirstandsecondderivatives,with16states
total. We use 23 frequency bands (N
B
=23), a hamming window size of 25 ms, and a
frameshiftof10ms. Forthedeltaanddelta-deltacoefficients,wesettheDELWINDOW
and the ACCWINDOW parameters in HTK to be both equal to 2 frames.
Thefeatureextractionforthe23spectralcoefficientsisdoneinMATLAB.Wethen
optimize upon these spectral coefficients with the optimization algorithms described in
the Section 1.2.8. From the denoised spectral coefficients, we reconstruct (again using
MATLAB) the 13 MFCC coefficients with the first and second derivatives, which are
then fed to the HTK continuous digit recognizer that we have trained.
1.3.1.3 Algorithm Implementation details
BothouroptimizationalgorithmsareimplementedusingMATLAB.TheLARS-LASSO
baseline was implemented using Sparselab available at http://sparselab.stanford.edu
which provides a MATLAB routine called SolveLasso to solve the LASSO formulation
using the LARS modification.
1.3.2 Tuning for parameters
For parameter tuning, we used a smaller subset of the test files we have. We took 1001
test files from the test set testa/N1 as our tuning set. The tuning results are presented
in Table 1.1.
20
AsevidentfromTable1.1,weusedtheSNR5datasettotuneforasuitablereliability
threshold λ
SNR
. Since we are using an estimated mask, we will want more accurate
components to enter our optimization matrix. Thus we need to set our confidence
level to be sufficiently high to eliminate bad estimates. At the same time, if we set
the threshold of λ
SNR
to be too high, too few components will enter our optimization
matrix and we will have an ill-defined optimization problem. Thus, it is important to
strikeabalancebetweenthesetwofactors. Wedecidedupona λ
SNR
thresholdof20dB
after some experimentation (see results in Table 1.1).
Initialexperimentationwithseveralwindowlengthsshowedthat T
W
=35(whichis
alsotheaveragedigitdurationinthetrainingset)isagoodwindowlengthtochoosefor
our particular database. For the frame shift parameter, we experimented with several
values using LARS-LASSO as the tuning algorithm. We find that T
WS
= 10 provided
the best results for the SNR 5 set. Note that for our tuning set using LARS-LASSO,
T
WS
= 1 gave a recognition rate of 63.24%, T
WS
= 5 a recognition rate of 65.04%,
T
WS
= 10 a recognition rate of 65.88%, and T
WS
= 15 a recognition rate of 45.38%.
Thereasonbehindthedifferenceswith[33]isduetoadifferentdictionaryconstruction.
Thus, T
WS
has a different optimal point to ensure optimal superpositioning of the
imputed results for the best speech recognition rates.
As for the algorithmic parameters, we decided upon 10 iterations of LARS-LASSO
and a sparsity degree of 50 for the LARS-EN by experimentation with our tuning set
(seeTable1.1). Notethatthenumberofiterationsandsparsitydegreeisinfactrelated;
the more iterations of the LARS algorithm we take, the less sparse our solution vector
will be (See Fig. 1.3).
1.3.3 Experimental Results for Continuous Digit Recognition Task
For the continuous digit task, we evaluated LARS-EN and LARS-LASSO (baseline).
21
We make the following observations from our experimental results given in Tables
1.1,1.2, and 1.3:
1. LARS-EN consistently out-performs LARS-LASSO in terms of recognition accu-
racy. ThiscanbeexplainedbyLARS-ENbeingmoreadeptathandlingacollinear
dictionary as compared to LARS-LASSO.
2. As we start from a degree of sparsity of zero and continue increasing the value,
the recognition rate increases. However, there is a point of saturation; any further
increase in the number of non-zero components leads to degradation of the speech-
recognition performance. Aswestartincreasingthenumberofnon-zeroentriesin
a,wekeepgettingabetterrepresentationoftheobservationvectorF
Ur
. However,
after a point, the representation becomes poor due mainly to the fact the LARS
modification is essentially a greedy approach. As characteristic to most greedy
algorithms (like Forward Selection/Forward Stagewise), each movement is in the
promisingdirection,anditishighlylikelythatasmoreiterationsaretaken,some
important covariates are missed, resulting in errors in speech recognition. Each
step forward in a different direction corresponds to an increase in one degree of
sparsity. Thistestifiestothefactthatagoodsparsitymeasurehelpsinimproving
recognition rates.
Another reason why sparsity is needed is due to the fact that it helps prevent
overfitting by ensuring that the more relevant components are chosen.
3. As we increase λ
SNR
, the recognition improves until a point where it saturates.
See Section 1.3.2 for further details.
4. The recognition rates generally do not depend on the magnitude of the estimation
error||F
Ur
−B
r
a||
2
, but rather on the quality of covariates selected. Thisisinfact
22
Figure 1.3: Thestemplot on thetop shows the weight vector for 5iterations of LARS-
LASSO, and the one on the bottom shows the weight vector for 20 iterations of LARS-
LASSO for a specific optimization problem. While 5 iterations of LARS-LASSO give a
sparserrepresentationthan20iterationsofLARS-LASSOfromthediagram,5iterations
of LARS-LASSO will have a higher error in||B
r
a−F
Ur
||
2
compared to 20 iterations
due to the nature of the LARS implementation.
23
Figure 1.4: Plot of ASR recognition error vs average estimation error rate (average
across all optimization problems) for the SNR 5 corruption level. We can see that a
diminishingestimationerrorrateresultsinamuchworseperformanceintermsofASR
recognition rate for both LARS-LASSO and LARS-EN. The highest points in both
graphs correspond to the best performance that was evaluated with the algorithms for
the tuning set.
24
aninterestingandimportantobservation. Forexample,forLARS-LASSO,wesee
that10iterationsofthealgorithmgiveabetterperformancethan20iterationsof
thealgorithmfromTable1.1. Infact,asmallnumberofiterationsgivesasparser
vector a compared to a larger number iterations of the algorithm but at the ex-
pense of a higher error in||B
r
a−F
Ur
||
2
by nature of the LARS implementation.
The values of||B
r
a−F
Ur
||
2
for the Sparselab implementation of LARS-LASSO
can be verified by setting the Verbose option to be True. Fig. 1.3 gives examples
of stem plots of the solution vectors for a specific optimization. Fig. 1.4 shows a
plot of the recognition rate vs. the average estimation error across all the opti-
mizationsofourtuningset. ThiscanbeexplainedbythefactthatASRaccuracy
ismoredeterminedbytherelevantcovariatesthattheregressiontechniqueselects
rather than the absolute error that the optimization problem seeks to minimize.
In fact, it is apparent from our experiments that if bad covariates are included,
even if the error in estimation is much lower, the recognition rates suffer.
AnotherconnectiontosparsityistorelatethisobservationtoEquation1.11. Note
that an increase in the magnitude of λ implies an increase in sparsity. When λ
becomes big, the role that||B
r
a−F
Ur
||
2
plays in the optimization problem de-
creases. This further reinforces the fact that the appropriate degree of sparsity
helpsinimprovingrecognitionratesratherthanthelowmagnitudeintheestima-
tion error||B
r
a−F
Ur
||
2
.
1.3.4 Insight into the mechanism behind the imputation process
Fig. 1.5 shows the spectral plots of one particular utterance. As we can observe, the
imputed spectral plots bear much closer resemblance to the clean signal than the noisy
one, with much of the noise artifacts removed. In particular, we can distinctly observe
that the imputed result with LARS-EN bears a closer resemblance to the clean signal
25
Figure 1.5: Spectral plot of one particular imputed utterance using the LARS-LASSO
and LARS-EN algorithms for the SNR 5 corruption level. As we can observe from the
diagram, the imputed signals have a much closer resemblance to the clean signal than
the noisy one with much of the noise artifacts removed. Moreover, we can observe that
the imputed result of LARS-EN bears closer resemblance to the clean signal relative to
LARS-LASSO, further testifying to the robustness of LARS-EN.
26
than LARS-LASSO. This further testifies to the robustness of LARS-EN given our
experimental setup/conditions.
In fact, the regularization techniques, when deployed in the spectral domain, can
be viewed as spectral denoising. In each frame of imputation, we are essentially doing
some form of spectral profile identification. To further reinforce this intuition, the first
stepoftheLARS-LASSOinvolvesfindingtheprojectionoftheobservationvectoronto
the dictionary B
r
. Thus, what we are doing is essentially a form of spectral profile
identification,takingintoaccountanoisemodel. Theslidingwindowframeworksimply
reconciles all the possible predictions by careful superposition of the predictions and
then averaging them.
1.3.5 Investigation of various dictionary structures
1.3.5.1 Whole digits vs. randomly selected fixed length exemplars
For λ
SNR
=20 dB, we conducted experiments with both LARS-LASSO and LARS-EN
using randomly selected fixed length exemplars from the training data as described in
[33]. For our tuning set and a dictionary size of 2412, 10 iterations of LARS-LASSO
gave a recognition rate of 58.70%, and LARS-EN with a sparsity degree of 50 gave a
recognition rate of 61.50%. We see that the choice of whole digit exemplars yields a
betterrecognitionrate(65.88%forLARS-LASSOand71.83%forLARS-EN).Thiscan
be explained by the fact our dictionary choice retains digit boundary information to a
greater extent than the dictionary choice of randomly selected fixed length exemplars.
Morever, the transition information is not as significant in the digit recognition setting
as compared to the phoneme recognition/LVCSR setting, since the acoustic distance
between digits is significantly more distinct as compared to that between phonemes.
27
1.3.5.2 Investigation of varying dictionary sizes
We next investigate the effectiveness of the various basis selection methods with dic-
tionaries of varying sizes and also investigate the possible relationship to recognition
accuracies. For this section, we just consider the results on the dataset with SNR 5
dB corruption. For the dictionary sizes, we consider N
train
= 1000,1500,2000. These
dictionaries are random subsets of the original dictionary of size 2412.
From our results in Table 1.4, it is apparent that the LARS-EN algorithm is con-
sistently better than LARS-LASSO in terms of improving the recognition accuracy of
the ASR. Thus LARS-EN does the most robust job in basis selection, regardless of
dictionary size. Note that the LARS-EN with a smaller dictionary even outperforms
LARS-LASSO with a bigger dictionary, as evident from Table 1.4. This demonstration
canbeusefulwhenwewanttoportasimilarframeworktoafullLVCSRsystemrather
thanthecurrentcontinuousdigitstask. Thisisbecauseoptimizationinthosecasescan
be expensive in terms of computational power if numerous occurrences of each word
havetobeincludedinthedictionary,resultinginanoverlylargedictionary. Now,with
thisverificationofanefficientbasisselectiontechniqueforimprovedrecognition,wesee
that dictionary construction can be a much easier task, since we can choose a smaller
subsetoftheoriginalovercompletedictionaryforourimputationprocess. Hence,itwill
definitely be wiser to opt for a smaller dictionary.
1.4 Discussion of Practicality of Implementation in Real-
time Systems
For our dataset of 4004 test files (TEST1), the number of complete optimization prob-
lemsrequiredtofullyimputeallwindowsisanywherebetween10000and56577(upper-
limit for our dataset) depending on the value of λ
SNR
chosen. See Table 1.5 for the
28
numberofoptimizationproblemsfortheSNR5dBdatasetfordifferentvaluesofλ
SNR
.
Specifically,forourthresholdof λ
SNR
=20dB,wehave51260optimizationproblemsto
solve. This is a computationally expensive procedure if we are considering implemen-
tation on a real-time system and renders iterative algorithms like the Adaptive LASSO
[70] and the Reweighted-L
1
algorithm [12] impractical.
TheLARSimplementationofLASSOandElasticNetprovidesanaccelerationover
classical implementations [24, 32]. For our MATLAB implementation of both LARS-
LASSO and LARS-EN, the entire sparse imputation process for a test utterance gen-
erally finishes between 1 to 20 seconds on a Core 2 Quad Processor with 8 gigabytes of
RAM for the SNR 5 dB corruption. However, MATLAB is generally slower due to it
being a high-level language. When porting this to a real ASR system, we can expect
execution time to improve when we are coding in lower level languages such as C. This
greatly increases the feasibility of porting our algorithms to an LVCSR system.
1.5 Conclusion and Extensions
We showed that the LASSO solution for sparse imputation is relatively less effective
(theoretically and experimentally) in improving the accuracies of the continuous digit
recognition task as compared to the Elastic Net algorithm. The LARS-EN algorithm
proved to be the more robust of the two algorithms under our specific experimental
conditions (test set, dictionary, training set, parameters choice). We have also seen
the effects of appropriate sparsity in helping speech recognition accuracies. The lesson
learntisthatitisthequalityofthecovariatesthatmatters,notthequantity. Moreover,
we believe that with appropriate noise models the quality of speech recognition can be
improved, and we intend on investigating that in our future work.
An immediate extension to the work in this chapter will be to extend our work to a
full LVCSR system similar to that described in [32]. The number of spectral segments
29
would increase likewise, and efficient dictionary creation would become a challenge.
Thus, basis selection, appropriate noise models, sparsity and algorithmic complexity
will play an even more important role in large systems, and the techniques that we
proposedtodeal withthedigitstask canbeanalogouslyextendedtodealwithalarger
and more general framework. Moreover, for the LVCSR system, it will be useful to
explore the effects of dictionary sizes on the overall imputation time and the effects on
recognition rates.
For the LVCSR system, transitions between words/phonemes can play a bigger role
than in the digits recognition case. Thus, rather than interpolating or doing a random
selection of exemplars for dictionary construction, a more informed choice of selecting
representative exemplars could potentially lead to improvements of recognition rates.
This is a line of work we intend to pursue in future.
FutureworkontheMFCCsparseimputationfront-endcanincludetheinvestigation
ofcombiningdimensionalityreductiontechniqueslikeHDA[43]withsparseimputation
to see if the effects of dimensionality reduction can be capitalized when we are doing
basisselection. Ifwedodimensionalityreduction, thenumberofrowsofthedictionary
B will decrease, and thus the number of entries in the basis can be reduced too. With
a smaller matrix B, we expect to reduce operation time which will be desirable in a
larger system.
1.6 Evaluation of other regularization/optimization tech-
niques
We implemented the algorithms in Table 1.6 in MATLAB. The SBL algorithm was
implemented using the Sparse Bayes toolbox available at
http://www.miketipping.com/index.php?page=rvm.
30
The MP and OMP algorithms were implemented using Sparselab available at
http://sparselab.stanford.edu.
31
Table 1.1: Parameter variation/tuning for SNR 5 dB dataset for the Aurora 2.0
database on contents of testa/N1. The last column indicates whether the improve-
ment over LARS-LASSO at the same value of λ
SNR
is significant with the Difference of
ProportionsTestat95%confidencelevel. Thebestperformingresultofeachalgorithm
is in bold
Alg
λ
SNR
in
dB
Degree of
Sparsity
Iterations
Accuracy
(%)
Significant
?
Unimputed NA NA NA 52.48 NA
LARS-LASSO -13.98 NA 5 22.45 NA
LARS-LASSO 3 NA 5 38.50 NA
LARS-LASSO 7 NA 5 54.66 NA
LARS-LASSO 7 NA 10 60.49 NA
LARS-LASSO 10 NA 5 58.92 NA
LARS-LASSO 10 NA 10 63.64 NA
LARS-LASSO 10 NA 20 63.52 NA
LARS-LASSO 20 NA 10 65.88 NA
LARS-LASSO 20 NA 15 63.19 NA
LARS-LASSO 30 NA 10 58.59 NA
LARS-EN 3 50 NA 53.65 YES
LARS-EN 7 50 NA 65.32 YES
LARS-EN 10 5 NA 60.38 NO
LARS-EN 10 10 NA 63.19 NO
LARS-EN 10 15 NA 66.33 YES
LARS-EN 10 30 NA 67.23 YES
LARS-EN 10 50 NA 67.12 YES
LARS-EN 20 30 NA 71.72 YES
LARS-EN 20 50 NA 71.83 YES
LARS-EN 20 70 NA 71.83 YES
LARS-EN 30 50 NA 69.25 YES
32
Table 1.2: Recognition results for different noise corruption values ranging from -
5 dB, 0 dB, 5 dB and 10 dB for TEST1. The last column indicates whether the
improvement over LARS-LASSO is significant with the Difference of Proportions Test
at 95% confidence level
Alg
Degree of
Sparsity
Iterations
Accuracy
(%)
Significant
?
SNR = -5 dB
Unimputed NA NA 7.89 NA
LARS-LASSO NA 10 29.74 NA
LARS-EN 50 NA 31.04 YES
SNR = 0 dB
Unimputed NA NA 18.59 NA
LARS-LASSO NA 10 44.54 NA
LARS-EN 50 NA 46.83 YES
SNR = 5 dB
Unimputed NA NA 43.82 NA
LARS-LASSO NA 10 64.28 NA
LARS-EN 50 NA 66.36 YES
SNR = 10 dB
Unimputed NA NA 68.89 NA
LASSO NA 10 79.17 NA
LARS-EN 50 NA 82.16 YES
33
Table 1.3: Recognition results for different noise corruption values ranging from -
5 dB, 0 dB, 5 dB and 10 dB for TEST2. The last column indicates whether the
improvement over LARS-LASSO is significant with the Difference of Proportions Test
at 90% confidence level
Alg
Degree of
Sparsity
Iterations
Accuracy
(%)
Significant
?
SNR = -5 dB
Unimputed NA NA 7.95 NA
LARS-LASSO NA 10 29.33 NA
LARS-EN 50 NA 30.61 YES
SNR = 0 dB
Unimputed NA NA 17.59 NA
LARS-LASSO NA 10 43.41 NA
LARS-EN 50 NA 45.35 YES
SNR = 5 dB
Unimputed NA NA 40.95 NA
LARS-LASSO NA 10 63.69 NA
LARS-EN 50 NA 64.35 NO
SNR = 10 dB
Unimputed NA NA 65.77 NA
LASSO NA 10 78.31 NA
LARS-EN 50 NA 80.92 YES
Table 1.4: Results for SNR 5 dB dataset with λ
SNR
= 20 dB for the Aurora 2.0
database,withN
train
=1000,1500,2000forTEST1. Thelastcolumnindicateswhether
the improvement over LARS-LASSO is significant with the Difference of Proportions
Test at 95% confidence level
Alg
Degrees of
sparsity
Iterations
Accuracy
(%)
Significant?
Unimputed NA NA 43.82 NA
N
train
=1000
LARS-LASSO NA 10 62.62 NA
LARS-EN 50 NA 65.55 YES
N
train
=1500
LARS-LASSO NA 10 63.47 NA
LARS-EN 50 NA 66.36 YES
N
train
=2000
LARS-LASSO NA 10 64.01 NA
LARS-EN 50 NA 66.24 YES
34
Table 1.5: Number of Optimization Problems for the TEST1 dataset with SNR 5 dB
noise for different values of λ
SNR
λ
SNR
in dB Optimizations
0 56568
5 56534
10 56223
15 54536
20 51260
25 44051
Table 1.6: Result for SNR 5 dB tuning set with λ
SNR
= 20 dB for the Aurora 2.0
database.
Algorithm
Stopping
condition
Accuracy (%)
OMP N
r
iterations max 25.81
SBL 100 iterations max 38.27
MP 100 iterations max 63.30
LASSO 10 iterations 63.64
LARS-EN Sparsity degree of 50 71.83
35
Chapter 2:
Novel Variations of Group Sparse
Regularization Techniques, with
Applications to Noise Robust
Automatic Speech Recognition
[58]
2.1 Introduction
Regularization techniques are commonly employed in statistics, natural sciences and
engineering. Inthisthesis,weareinterestedinthespecificapplicationofregularization
techniques to spectral denoising, with the aim of improving Automatic Speech Recog-
nition [17] (ASR). We begin with a brief survey of relevant efforts in this direction.
Recently, Gemmeke et al. [33] and B¨ orgstrom et al. [7] have proposed the use of L
1
36
Figure2.1: Blockdiagramillustratingatypicalspeechrecognitionsystem,butenhanced
with our new front-end. The “Spectral Denoiser” module is an extra module which we
introduced into the feature extraction flow that utilizes the group sparse regularization
techniques with appropriate dictionary partitioning for better recognition accuracies.
optimization techniques for spectral denoising in speech recognition. Gemmeke et al.
have coined the process “Sparse Imputation” and demonstrated its efficiency over clas-
sical missing data techniques. However, it is well known that L
1
techniques do not
generallyfairwellunlesscertainpropertiesofthedictionariesaresatisfied,andananal-
ysis of these properties is reported in [59]. In particular, we [59] proposed the use of
the Elastic Net for better exploiting the characteristics of a coherent dictionary, and
alsoprovidedarigorousjustifcationofwhysparsityisnecessaryfortheimprovementof
speech recognition rate. We demonstrated significant improvement over LASSO-based
strategiesforspectraldenoising. Ablockdiagramrepresentationoftheprocessisgiven
in Fig. 2.1.
In most regularization applications, a linear model is assumed:
f =Φx (2.1)
Here,f istheobservedfeaturevector,Φisabasis/dictionary,andxistheactivation.
Frequently,thegoalistofindasparseormaximallysparsexwhichbestreconstructsf.
Bysparsesignals,wemeansignalsthatarezeroeverywhereexceptonaminimalsupport
37
of the solution space [35]. Ideally, we would like to solve the following optimization
problemwhichoptimizestheL
p
norm(theL
p
normisdefinedas||x||
p
=(
P
n
k=1
|x
k
|
p
)
1
p
):
min
x
kΦx−fk
2
2
+λkxk
p
(2.2)
For a maximally sparse solution vector, we ideally would like to solve the following:
min
x
kΦx−fk
2
2
+λkxk
0
(2.3)
However,itiswellknownthatthisproblemisNP-hard,sincesolvingEquation(2.3)
willinvolvesearchingthrough
n
k
least-squareproblems,where ndenotesthedimen-
sions of the activation x, assuming that x is k-sparse (having k nonzero activations).
However, there are some good greedy-based solutions which seek to approximate the
solution to Equation (2.3), with examples including Matching Pursuit (MP) [49] and
Orthogonal Matching Pursuit (OMP) [52].
It is typical that we consider a convex relaxation of the above problem to optimize
the L
1
norm instead:
min
x
kΦx−fk
2
2
+λkxk
1
(2.4)
There exist many efficient solutions for solving Equation (2.4), examples including
the Least Absolute Shrinkage and Selection Operator (LASSO)[61]andthe Least Angle
Regression (LARS) algorithm [24].
In this thesis, we are more interested in a particular extension of the LASSO model
to incorporate grouping information, given by the following:
min
x
N
X
i=1
Φ
i
x
i
−f
2
2
+λ
N
X
l=1
kx
l
k
1
(2.5)
38
Here, Φ
i
stands for partitions of the dictionary andx
i
stands for the corresponding
activations. Yuan and Lin [68] proposed an efficient solution for resolving Equation
(2.5). Meier et al. [50] have proposed a variation which solves the problem efficiently
for logistic regression models. However, it has been pointed out in [28] that the group
LASSO does not yield sparsity within a group. Thus, whenever a group has some non-
zero parameters, they will likely be all non-zero. Hence, the proposal is to consider the
following problem formulation instead:
min
x
N
X
i=1
Φ
i
x
i
−f
2
2
+λ
1
N
X
l=1
kx
l
k
1
+λ
2
N
X
l=1
kx
l
k
2
(2.6)
This is known as the Sparse Group LASSO (SGL). It is a more general formulation
than the Group LASSO, since when λ
2
= 0, we will have the Group LASSO. Experi-
mentalsimulationin[28]hasprovidedpromisingevidenceofamorerobustperformance
in the presence of collinear dictionaries over the Group LASSO and the LASSO algo-
rithms. While [28] has proposed a 1-dimensional optimization based on the Golden
Section Search algorithm to capitalize on the separable penalty function, there are
other efficient solutions to the formulation in Equation (2.6). Notably, the software
SLEP[47]implementstheSGLformulationusingaversionof Fast Iterative Shrinkage-
Thresholding Algorithm (FISTA) [5].
The first contribution of this thesis chapter is to propose novel variations along the
lines of the SGL formulation to better enforce sparsity within a group. Specifically,
we propose two algorithms, one which follows the theme of the Least Angle Regression
implementationoftheElasticNet[71],andtheother,alongthelinesofSparse Bayesian
Learning (SBL)[63]. The SBL algorithm boasts of a prior which is able to enforce
sparsity efficiently, and in this chapter, we will experimentally justify this robustness
for our spectral denoising task for improving speech recognition.
39
The second contribution of this chapter is the study of the effects of grouping spec-
tralatomsinthedictionaryonspeechrecognitionrates. Wediscussgroupingtechniques
for enhancing the spectral denoising process for a dictionary consisting of a variety of
spectral exemplars. In particular, we explore groupings based on speaker identity, and
alsobasedonL
2
distancebetweenthefeaturevectors. ThroughexperimentsontheAu-
rora 2.0 noisy digits database and the Aurora 3.0 real noisy data, we demonstrate that
clustering based on both strategies will lead to an appreciable increase in speech recog-
nitionrates. Infact,experimentalevidenceshowthatthesegroupingstrategies,coupled
with an appropriate group sparse regularization technique yield better accuracies than
the Elastic Net algorithm.
The structure of this chapter is as follows: Section 2.2 details the derivation of our
proposedalgorithms,theoreticaljustificationastothesuitabilityofthesealgorithmsin
the denoising framework, and a description of the feature extraction process. Section
2.4describesourdataset,experimentalresultsandinterpretationoftheresults. Finally,
Section 2.5 concludes with some possible consequences and extensions of our work in
this thesis chapter.
2.2 Methodology
2.2.1 Notation description
Before delving into main results of this chapter, we first introduce the notational con-
ventions adopted in the chapter.
For matrices, we use a bold uppercase font - for example, F denotes a matrix of
features; in speech recognition the columns typically would correspond to sequence of
spectral frame information (See Sec 2.4). For vectors, we use a bold lowercase font -
40
for example f refers to a vector of features. Otherwise, a normal font refers to a scalar
value - for example, f can refer to a particular entry in our feature vector.
Whenweaddahatontopofavariable,unlessstatedotherwise,werefertoaresult
returned from an optimization formulation, for example, b x can refer to the solution
returned from Equation (2.4).
2.2.2 Basis Selection Techniques
2.2.2.1 Partitioning the dictionary
In order to form our dictionary Φ = [φ
1
φ
2
... φ
N
train
], where N
train
is the number of
training samples in the dictionary, we stack spectral exemplars as the columns of the
dictionary.
In order to further capitalize on the structure of the dictionary, we assume that we
can partition the dictionary as
Φ = [φ
1
φ
2
... φ
N
train
]
= [Φ
1
|Φ
2
| ...|Φ
N
C
] (2.7)
In equation (2.7), “|” represents the partition boundaries, Φ
i
is a matrix comprised
of some columns from Φ, and N
C
denotes the number of partitions (clusters) we have.
When N
C
=N
train
,wejusthaveeachpartitiontoconsistofacolumnofthedictionary.
Later sections in this chapter will describe efficient partitioning schemes that will be
crucial for improving speech recognition rates.
In view of the dictionary partitioning, we can reformulate our problem in Equation
(2.2) as the following:
41
min
x
N
C
X
l=1
Φ
l
x
l
−f
2
2
+λ
N
C
X
l=1
s
l
kx
l
k
p
(2.8)
s
l
refers to a penalty rescaling factor with respect to the dimensionality of x
l
.
2.2.2.2 Better exploitation of a collinear dictionary
In most speech data, we can expect that the spectral profiles for the same
phonemes/words(ingeneral,asoundunit)tobesimilar. Thuswecanexpectgroupsof
contiguous entries in the dictionary to be highly collinear. As a result of the collinear
dictionary, a method like LASSO will not select the relevant atoms efficiently due to
the fact that it does not discriminate between the collinear entries well enough [59]. In
particular, let us define the covariance matrix C to be:
C=
1
dim(Φ)
Φ
T
Φ=
C
11
C
12
C
21
C
22
(2.9)
where C is a positive definite matrix. Suppose we arrange the columns of Φ such
that Φ=
φ
1
...φ
N
train
and φ
1
,...,φ
k
corresponding to the k non-zero activations. We
set
C
11
=
1
dim(Φ)
[φ
1
... φ
k
]
T
[φ
1
... φ
k
]
C
22
=
1
dim(Φ)
φ
k+1
... φ
N
train
T
φ
k+1
... φ
N
train
C
12
=
1
dim(Φ)
[φ
1
... φ
k
]
T
φ
k+1
... φ
N
train
C
21
=
1
dim(Φ)
φ
k+1
... φ
N
train
T
[φ
1
... φ
k
] (2.10)
We say that the estimator b x is sign consistent if and only if
42
P(sign(x)=sign(b x))→1 as rank(Φ)→∞ (2.11)
Sign consistency is a neccessary condition for the LASSO estimate to match the
true model. It has been demonstrated in [69, 70] that LASSO is sign consistent only if
|C
21
C
−1
11
sign([b x
1
... b x
p
]
T
)|≤ 1−η (Strong Irrepresentable Condition). However, this
condition has a high possibility of being violated when the columns of Φ are highly
collinear, a pressing issue in contending with speech data.
Theworkin[59]hasshownthattheElasticNetsignificantlyoutperformsLASSOin
termsofspeechrecognitionaccuraciesinthespectraldenoisingframeworkbybeingable
tobetterexploitthepropertiesofacollineardictionary. Hence,whenweareconsidering
group sparse regularization techniques, this will be motivation to consider algorithms
alongthelinesoftheSGLalgorithm[28]ratherthantheGroupLASSOalgorithm[68].
Moreover, as the authors in [28] pointed out, the Group LASSO algorithm does not
yieldsparsitywithinagroup. Inparticular,whenagroupofactivationsisdropped,the
entire group is dropped, and when a group is non-zero, they will all be non-zero.
2.2.3 Formulation of the “Group Elastic Net” algorithm
Following the background discussed in Section 2.2.2.2, we now set out to formulate the
“Group Elastic Net” and the “Group Sparse Bayesian Learning” algorithms for our
problem setup. In particular, the first formulation extends the Elastic Net formulation
to a more general setting, while retaining the speed (complexity) advantages of the
Elastic Net, and the second formulation further promotes sparsity enforcement within
a partition.
MotivatedbyEquation(2.7)andtheElasticNet,wecanhaveadifferentformulation
of our optimization problem as follows:
43
min
x
N
C
X
l=1
Φ
l
x
l
−f
2
2
+λ
2
N
C
X
l=1
s
l
kx
l
k
2
2
+λ
1
N
C
X
l=1
r
l
kx
l
k
1
(2.12)
r
l
ands
l
arerescalingfactorsfortheL
1
normandL
2
norm,respectively. Duetothe
fact that our penalty function is separable, we can take the usual route of considering
each group as an individual optimization problem. Suppose at the current iteration we
are considering group i. Define the following:
Φ
i
=
φ
i,1
φ
i,2
... φ
i,N
i
x
i
=
x
i,1
x
i,2
... x
i,N
i
T
(2.13)
N
i
denotes the number of columns in group i. Let us define the residual r =
F−
P
j6=i
Φ
j
x
j
.
We will then need to solve a successive series of optimization problems of the fol-
lowing form:
min||r−Φ
i
x
i
||
2
2
+λ
2
s
i
||x
i
||
2
2
+λ
1
r
i
||x
i
||
1
(2.14)
Define r
′
to be a (dim(r)+dim(x
i
))×1 vector as follows:
r
′
=
r
0
(2.15)
Wecanmanipulatetheobjectivefunction(denoteitJ)inEquation(2.14)asfollows:
44
J = ||r−Φ
i
x
i
||
2
2
+λ
2
s
i
||x
i
||
2
2
+λ
1
r
i
||x
i
||
1
= ||r
′
−
Φ
i
√
λ
2
s
i
I
x
′
i
||
2
2
+λ
1
r
i
||x
′
i
||
1
= ||r
′
−
1
√
1+λ
2
s
i
Φ
i
√
λ
2
s
i
I
x
′′
i
||
2
2
+
λ
1
r
i
√
1+λ
2
s
i
||x
′′
i
||
1
(2.16)
ThesolutiontoEquation(2.16)givesthesolutiontotheoriginaloptimizationprob-
lemstatedinEquation(2.14). Specifically,wecanscale
c
x
′′
i
togive
c
x
′
i
=
1
√
1+λ
2
s
i
c
x
′′
i
and
b x
i
is equal to
c
x
′
i
. Moreover, the formulation in Equation (2.16) increases the rank of
the dictionary (and hence reduces the coherence), which will contribute to mitigating
the issue of highly coherent dictionary atoms.
If we proceed to solve Equation (2.16) by the LASSO algorithm, we will have the
Elastic Net formulation [71]. This can be further sped up to be equivalent to the order
ofone Ordinary Least Squares (OLS)fitbyemployingamodifiedversionoftheLARS-
LASSO solution [24]. We will call this method the “Group Elastic Net” algorithm
(Group EN) and the gist of the algorithm is given in Algorithm 1.
Algorithm 1 “Group Elastic Net” Algorithm
1: Initialize the algorithm with x = Φ
+
f. Here, we are intializing with the least
squares solution and Φ
+
is the Pseudo-Inverse.
2: For each group i=1...N
C
, we define the residual r =f−
P
j6=i
Φ
j
x
j
. We proceed
to the Elasic Net algorithm to solve the following optimization problem:
min
x
i
||r−Φ
i
x
i
||
2
2
+λ
1
s
i
||x
i
||
2
2
+λ
2
r
i
||x
i
||
1
(2.17)
3: Iterate over all the groups repeatedly until convergence is reached.
45
2.2.4 Formulation of “Group Sparse Bayesian Learning” algorithm
Sinceourgoalinthegroupedregularizationsettingistoenforcesparsitywithinagroup
as well as between the groups, there are alternatives to LARS-LASSO for resolving
the optimization of Equation (2.16). In particular, here we advocate the use of Sparse
Bayesian Learning[63,64]duetothefactthatitmakesuseofadataparametrizedprior
whichcanbeeffectiveforenforcingsparsity. Byassumingaparametrizedprior, agood
measure of sparsity relative to both L
1
and L
2
optimization can be obtained.
We assume a Gaussian likelihood model for the residual:
p(r
′
|x
′′
i
,σ
2
)=
1
2πσ
2
d
2
exp
−
1
2πσ
2
1
√
1+λ
2
s
i
Φ
i
√
λ
2
s
i
I
x
′′
i
−r
′
2
(2.18)
where d=dim(r)+dim(x
i
). The SBL also assumes a parametrized prior from the
training data, which is given by
p(x
′′
i
|γ)=
N
i
Y
j=1
1
(2πγ
j
)
1
2
e
−
x
′′2
i
j
2γ
j
(2.19)
Here,γ =[γ
1
γ
2
...γ
N
i
]arethehyperparameterswhichregulatethepriorvarianceof
eachweight. Theinverseofthehyperparametersarechosentobedistributedaccording
to a gamma distribution [63]:
p(γ
−1
)=
N
i
Y
j=1
Γ(γ
−1
j
|m
γ
,n
γ
) (2.20)
m
γ
and n
γ
are the parameters of the Gamma distribution. We will see later (in
Property 1) that γ controls the degree of sparsity of the vector x
i
, and thus can be
46
likenedtotheroleof λ
1
. However, sinceγ isadaptedateachiterationofthealgorithm
described below, we do not have to concern ourselves with the explicit form of γ.
Bychoosingtheappropriatepriorasin(2.19),theposteriordensityoftheactivation
weights x
i
will be distributed according to a Gaussian, which is given by:
p(x
′′
i
|r,γ,σ
2
)=N(μ,Σ
x
′′
i
) (2.21)
with
μ=
1
σ
2
√
1+λ
2
s
i
Σ
x
′′
i
h
Φ
T
i
p
λ
2
s
i
I
i
r (2.22)
and
Σ
x
i
′′ =
1
σ
2
(1+λ
2
s
i
)
(Φ
T
i
Φ
i
+λ
2
s
i
I)+diag(γ
−1
)
−1
(2.23)
To compute the cost function for SBL, note that to find p(r
′
|γ,σ
2
), we marginalize
to get:
p(r
′
|γ,σ
2
) =
Z
p(r
′
|x
i
,σ
2
)p(x
i
|γ)dx
i
=
1
2π
d
2
|Σ
r
′|
1
2
e
−
1
2
r
′T
Σ
−1
r
′
r
′
(2.24)
Here, we have
47
Σ
r
′ = σ
2
I+
N
train
X
j=1
γ
j
1+λ
2
s
i
Φ
i,j
Φ
T
i,j
√
λ
2
s
i
Φ
i,j
I
j
√
λ
2
s
i
I
j
Φ
T
i,j
λ
2
s
i
I
j
= σ
2
I+
N
train
X
j=1
γ
j
1+λ
2
s
i
B
j
(2.25)
I
j
denotes an indicator matrix of all zeros except at the (j,j) location, where the
valueis1. Tomaximizep(r
′
|γ,σ
2
),wewillbeequivalentlyminimizing−logp(r
′
|γ,σ
2
).
Thus, our cost function is:
−logp(r
′
|γ,σ
2
)∝log|Σ
r
′|+r
′
T
Σ
−1
r
′
r
′
(2.26)
Minimizing the expression in Equation (2.26) will give us an update rule of the
hyperparameters. Alternatively,wecantreattheactivationsx
i
ashiddenvariables[63,
64]andthenmaximizeE
x
′′
i
|r
′
,γ,σ
2
p(r
′
,x
′′
i
,γ,σ
2
)
togiveanExpectation Maximization
(EM) formulation for SBL.
However,ithasbeendemonstratedbyTippingetal. in[62]thatthisEMimplemen-
tation of the SBL algorithm is generally slower than if we considered the atoms in the
dictionary sequentially. In fact, experimental verification in [62] hasdemonstrated that
the Sequential Sparse Bayesian Learning (SSBL) performs more than 15 times faster
as compared to the EM implementation. In particular, for our case, this speedup will
be even more relevant for two reasons: firstly, our dictionary is augmented to be even
larger as given in Equation (2.16); secondly, our algorithm consists of a nested loop,
which requires a fast algorithm to be practically feasible.
Following the trend in [27, 62], we are able to write Equation (2.25) as:
48
Σ
r
′ = σ
2
I+
N
train
X
j=1,j6=k
γ
j
1+λ
2
s
i
B
j
+
γ
k
1+λ
2
s
i
B
k
= Σ
′
r
′ +
γ
k
1+λ
2
s
i
B
k
(2.27)
For subsequent expression simplification, let us define the following variables:
c
k
= r
′T
Σ
′
−1
r
′B
k
Σ
′
−1
r
′ r
′
d
k
=
h
Φ
T
k
p
λ
2
s
i
I
i
Σ
′
r
′
−1
h
Φ
k
p
λ
2
s
i
I
i
(2.28)
After algebraic manipulations, we can write the right-hand side of Equation (2.26)
as:
P =log|Σ
r
′|+r
′
T
Σ
−1
r
′
r
′
=log|γ
k
Σ
′
r
′(
1
γ
k
+
1
1+λ
2
s
i
Σ
′
−1
r
′B
k
)|+r
′
T
Σ
−1
r
′
r
′
=log|Σ
′
r
′|+r
′
T
Σ
′
−1
r
′r
′
+logγ
k
+log
1
γ
k
+
1
1+λ
2
s
i
d
k
−
c
k
1+λ
2
s
i
γ
k
+d
k
(2.29)
NotethatinverseofΣ
−1
r
′
canbefoundusingtheWoodburyIdentity. Takingpartial
derivatives w.r.t γ
−1
k
,
49
∂P
∂γ
−1
k
=−
1
γ
−1
k
+
1+λ
2
s
i
γ
−1
k
(1+λ
2
s
i
)+d
k
+
c
k
(1+λ
2
s
i
)
γ
−1
k
(1+λ
2
s
i
)+d
k
2
(2.30)
Setting
∂P
∂γ
−1
k
to be zero and denoting Δ=γ
−1
k
(1+λ
2
s
i
), we have
−
1
γ
−1
k
+
1+λ
2
s
i
γ
−1
k
(1+λ
2
s
i
)+d
k
+
c
k
(1+λ
2
s
i
)
γ
−1
k
(1+λ
2
s
i
)+d
k
2
=0
−
1
Δ
+
1
Δ+d
k
+
c
k
(Δ+d
k
)
2
=0
−(Δ+d
k
)
2
+Δ
2
+d
k
Δ+c
k
Δ=0
−d
2
k
−Δd
k
+c
k
Δ=0
Δ=
d
2
k
c
k
−d
k
γ
−1
k
(1+λ
2
s
i
)=
d
2
k
c
k
−d
k
γ
−1
k
=
d
2
k
(1+λ
2
s
i
)(c
k
−d
k
)
(2.31)
We see that when c
k
−d
k
< 0, γ
−1
k
is positive. However, when c
k
−d
k
≤ 0, γ
−1
k
is
undefined, thus we default it to infinity.
Thus, from the analysis of the stationary points, we can see that Equation (2.29)
has a minimum w.r.t γ
k
given by:
γ
−1
k
=
∞ if c
k
≤d
k
d
2
k
(1+λ
2
s
i
)(c
k
−d
k
)
otherwise
(2.32)
In fact, when the parameters of the Gamma distribution in Equation (2.20) m
γ
,n
γ
approach zero, we will have p(x
′′
i
)→ c
Q
N
train
j=1
1
|x
′′
i j
|
where c is some constant [63]. We
nowstateapropertythatjustifieswhyweselectedtheSBLalgorithmtoenforcesparsity
within a group:
50
(SBL Sparsity Property) WhentheparametersoftheGammadistributioninEqua-
tion (2.20) m
γ
,n
γ
approach zero, we have p(x
′′
i
) → c
Q
N
train
j=1
1
|x
i
′′
j
|
where c is some
constant. (See Section 2.6 for derivation)
This evidently induces greater sparsity since the distribution peaks sharply at zero
andhasheavytails. Moreover,fromtheanalysisgivenin[64],weknowthatevenwhen
the algorithm gets to a local minima instead of the desired global minima, we will still
obtain a solution with maximal sparsity.
Thus, we can now formulate the “Group Sparse Bayesian Learning” (Group SBL)
algorithm as summarized in Algorithm 2.
Algorithm 2 “Group Sparse Bayesian Learning” (Group SBL) Algorithm
1: Initialize the algorithm with x = Φ
+
f. Here, we are intializing with the least
squares solution and Φ
+
is the Pseudo-Inverse.
2: For each group i = 1...N
C
, we define the residual r = f−
P
j6=i
Φ
j
x
j
. We then
proceed as follows:
3: Initialize σ
2
4: Initializeforsome j∈{1...N
train
}, γ
−1
j
=
||Φ
i,j
||
2
+λ
2
s
i
(1+λ
2
s
i
)
||[Φ
T
i,j
√
λ
2
s
i
I
j
]r||
2
||Φ
i,j
||
2
+λ
2
s
i
−σ
2
!
,and γ
−1
k
=
∞ if k6=j
5: Compute Σ
x
′′
i
and μ.
6: For j =1...N
train
, we check the following: If c
j
> d
j
and γ
−1
j
<∞, we re-estimate
γ
−1
i
. Else if c
j
>d
j
and γ
−1
j
=∞, we add Φ
i,j
to the model. Otherwise, if c
j
≤d
j
and γ
−1
j
is finite, remove Φ
i,j
and set γ
−1
j
to infinity. Re-estimate Σ
x
′′
i
and μ from
the relevant updated model. Repeat until convergence or some iteration limit.
7: Iterate over all the groups until convergence is reached.
51
2.3 Application to Spectral Denoising in a Speech Recog-
nition Framework
Now that we have described the derivations and details of the algorithms, we will con-
sideranapplicationofthegroupsparseregularizationtechniquesinthesettingofspec-
tral denoising within a speech recognition framework [17]. In particular, for ASR, we
need to extract features from the audio data, which constitute the speech recognition
front-end. Fig2.1showsatypicalspeechrecognitionsystem,butwithamodifiedfront-
end which incorporates a spectral denoiser. The spectral denoiser will incorporate the
group sparse regularization techniques with appropriate dictionary partitioning. How-
ever, the original feature matrices that we extract from the audio will have variable
number of columns (dependent on the duration of the audio), and thus we will need an
alternative feature extraction procedure which we will now describe.
We consider a framework for extraction of features as in [59]. Denote the number
of frames for an utterance as T. Let the feature vector corresponding to frame k in an
utteranceberepresentedbyf
k
,wheref
k
isaN
B
×1dimensionalvectorwhichcontains
the spectral coefficients, and N
B
is the number of frequency bands. Define
˜
F to be a
N
B
×T matrix as follows:
˜
F=
f
1
f
2
... f
T
(2.33)
We define the linearization of a matrix F=[f
1
...f
n
] as follows:
¯
f =
f
1
f
2
...
f
n
(2.34)
52
When we try to linearize
˜
F, we will obtain vectors of different lengths due to the
different audio durations. Thus, we need to consider the features in blocks to ensure
conformity of dimensions. Consider a sliding window extraction of the data in this
matrixrepresentation
˜
F. Defineaslidingmatrix(window)whichhasdimensions N
B
×
T
W
, T
W
represents the length in frames of the sliding matrix. We also define a shift
parameter T
S
, which represents the number of frames by which we shift the sliding
matrix.
In doing so, we obtain a total of⌈
(T−T
W
)
T
S
⌉+1 matrices of feature vectors. We zero-
pad
˜
F to be a N
B
×c matrix where c=⌈
(T−T
W
)
T
S
⌉×T
S
+T
S
. Let us denote the m-th
sliding window to be
˜
F
m
=
f
1+(m−1)T
S
... f
T
W
+(m−1)T
S
.
We now suppose that we can write
¯
f
m
as
¯
f
m
=Φx
m
, where
¯
f
m
is the observation
(feature vector) linearized from
˜
F
m
, Φ is the dictionary of exemplars, and x
m
is a
vectorofactivation. Weareassumingthateachtestsegmentcanbewrittenasalinear
combination of the basis vectors. This is a reasonable assumption since the spectral
representations for different realizations of the same word have energy localizations in
similar regions in the time-frequency domain.
We thus obtain the following type of linear representations from our windows:
¯
f
m
=Φx
m
, m=1,...,
T−T
W
T
S
+1 (2.35)
After the sparse imputation process, we need to reconstruct a denoised representa-
tion of the original sliding matrix. Define a counter matrix of dimension N
B
×c where
c is defined as above. This counter matrix counts the number of times each entry in
the matrix
˜
F
m
is optimized due to overlapping window shifts. Formation of the final
denoised matrix will involve first reshaping
d
˜
F
m
(the solution to optimizing Equation
(2.35))backtodimensionsN
B
×c,summingalltheresultingreshapedframes,andthen
doing component-wise division by the entries of the counter matrix.
53
Let us denote the number of columns in our dictionary by N
train
. This is the num-
ber of linearized exemplars used for the spectral denoising process. We then form a
dictionary Φ = [φ
1
φ
2
... φ
N
train
] which consists of segments of clean spectral shapes.
Thiswillbeourovercompletedictionaryofexemplarspectralsegments. Section2.4will
discuss procedures by which we obtain Φ
i
for i=1,...,N
train
as efficiently as possible.
2.4 Experimental Setup and Results
2.4.1 Description of Database
For the Aurora 2.0 recognition system, we use all of the 8040 clean training files (con-
taining single and continuous digit utterances) provided in the Aurora 2.0 database
training set to train a continuous digit recognizer in HTK [67].
For the continuous digit recognition task, the Aurora database consists of test sets
labeled N1, N2, N3 and N4 (corresponding to subway, babble, car and exhibition noise
respectively) in the Test Set A subset. We merge all audio files from each of N1, N2,
N3 and N4 to give us a total of 4004 files in our test set for a specific noise SNR. We
will be using SNR levels -5 dB, 0 dB, 5 dB and 10 dB.
To form our dictionary Φ, we get all the single digit audio files in the training
data, extract the spectral features corresponding to those files, and then construct
a dictionary of fixed window-size spectral exemplars (window length T
W
) comprising
wholedigitinterpolatedspectralexemplarslikein[59]. Wethenlinearizethesespectral
exemplars which will go into the columns of our dictionary Φ.
Since the Aurora 2.0 is a synthetically corrupted dataset, in order to further stress
testouralgorithms, weevaluatedthemontheAurora3.0databaseaswell. Wetrained
an Italian continuousdigit recognizer using961 continuous digit trainingfiles and eval-
uated the algorithms on a test set comprising 160 continuous digit utterances.
54
2.4.2 Description of ASR Features
We train the speech recognizer on MFCCs with the first and second derivatives, with
16 states total for each digit model. We use 23 frequency bands (N
B
= 23), a ham-
ming window size of 25 ms, and a frame shift of 10 ms. For the delta and delta-delta
coefficients, we set the respective parameters in HTK to be both equal to 2 frames.
The feature extraction for the 23 spectral coefficients is done in MATLAB. We
then find a sparse representation for these spectral coefficients with the optimization
algorithmsdescribedabove. Fromthedenoisedspectralcoefficients,wereconstructthe
13 MFCC coefficients with the first and second derivatives, which are then passed to
the HTK implemented continuous digit recognizer for the speech recognition.
2.4.3 Details of Algorithms Implementation
Our optimization algorithms are implemented using MATLAB. The Group
Sparse LASSO algorithm was implemented using SLEP 4.0 available at
www.public.asu.edu/˜jye02/Software/SLEP/ which provides an optimized MAT-
LAB routine for solving the Sparse Group LASSO formulation.
For comparison with some widely known state-of-the-art denoising techniques, we
have also included results obtained from Cepstral Mean Normalization (CMN) and the
ETSI Advanced Front-end (ETSI AFE) [26].
2.4.4 Signal Reliability Mask
ForourobservedvectorF,therewillbecomponentsthataremorecorruptedwithnoise
as compared to the rest. Thus, if we estimated x from these unreliable components,
theestimated b xwillnotbeveryaccurateforthereconstructionofF. Thus,weemploy
a hard signal reliability mask [16] to denote reliable parts of the observed data. In
particular,themaskmatrixwillbethesamedimensionsasthatoftheextractedfeature
55
matrix,with1toindicateareliableentry,and0toindicateanoisyentry. Inthischapter,
liketheprevious,wewilluseamaskwhichisestimatedfromthenoisyspeechdataitself
[3]. ThisinvolvesgettingalocalestimateoftheSNRbyaveragingthefirst10framesof
thespectralfeaturesoftheutterance,whichcontainsinformationprecedingthevoicing
of the digits. An estimate of the clean digit utterance is obtained by subtraction of the
noise estimate from the noisy digit utterance.
After we have an indication of which component is reliable and which is not, we
removethecomponentsdeemedunreliablefromourobservedvector
˜
F. Wealsomodify
our dictionary Φ to remove the unreliable rows corresponding to the unreliable compo-
nents of
˜
F. By varying the threshold SNR values we will be able to get more reliable
estimates of x. In this chapter, like in [59], we adopt an SNR threshold of 20 dB.
2.4.5 Approaches for dictionary partitioning
There are many ways to partition the dictionary, of which we will explore two intuitive
approaches:
• We partition the dictionary according to specific speaker identity. Here, we sort
the columns of the dictionary to have the spectral segments of similar speakers
to be grouped together, and then we do chunking to ensure that each chunk will
essentially have utterances from the same speaker. This is a knowledge-based
approach for dictionary partitioning.
• Partition the dictionary according to proximity based on the L
2
distance. In
this respect, we perform some form of clustering on the atoms of the dictionary
basedontheL
2
distancemetrictoformapre-determinednumberofclusters,and
then run our grouped regularization technique based on these clusters. This is a
data-driven approach for dictionary partitioning.
56
For the second method, we perform the clustering using the popular K-means clus-
tering algorithm [6]. In fact, there is a relationship between sparse representation and
clustering. We can think of clustering as sparse representation in its extreme, with one
atom (the mean of the cluster) allowed in the representation, and the coefficient of the
atom is 1. The K-means algorithm has also been generalized in the K-SVD algorithm
[1] for designing overcomplete dictionaries for sparse represenation.
2.4.6 Evaluation Results for Aurora 2.0
We first evaluate the following algorithms on the SNR 5 dB noise corruption setting -
SparseBayesianLearning(SBL),LeastAngleRegressionImplementationoftheElastic
Net (LARS-EN), FISTA implementation of the Sparse Group LASSO (FISTA SGL),
Group Elastic Net (Group EN), and lastly Group Sparse Bayesian Learning (Group
SBL). We additionally evaluate LARS-EN, Group EN and Group SBL for SNR -5 dB,
SNR 0 dB and SNR 10 dB.
Table 2.1 presents the results for optimized versions of the algorithms for the SNR
5 dB corruption. Table 2.3 presents the results for SNR -5 dB, 0 dB, 10 dB and clean
conditions.
From Table 2.1, we can see that when the dictionary is grouped appropriately, cou-
pled with a suitable group sparse regularization algorithm, speech recognition perfor-
mance can be greatly improved over the original base algorithm. In particular, for our
evaluationoftheSNR5dataset,wecanseethattheGroupENalgorithmperformsthe
bestwitharecognitionaccuracyof67.78%withthespeakergroupclusteringtechnique.
We see that the SBL algorithm did not perform as well as either the LASSO al-
gorithm or the Elastic Net algorithm. However, with our Group SBL modification, it
performssignificantlybetterthantheSBLalgorithm. Thiscanbeexplainedbythefact
thattheoriginalSBLformulationcouldbelessadeptathandlingcollineardictionariesof
57
Table 2.1: Spectral denoising results for the Aurora 2.0 database.
Algorithm Grouping
Accuracy (%)
Before denoising NA 43.82
SBL NA 39.70
LASSO NA 64.24
LARS-EN NA 66.36
Group EN Speaker 67.78
Group EN L
2
67.42
FISTA SGL Speaker 56.81
FISTA SGL L
2
53.38
Group SBL Speaker 55.80
Group SBL L
2
48.16
ETSI AFE NA 72.67
CMN NA 55.74
spectral exemplars. In our formulation of the Group SBL algorithm, we are doing rank
augmentation to our dictionary matrix (as evident from Equation (2.16)). Thus, when
we are applying SBL as an intermediate step, the augmented dictionary has a higher
chance of being of sufficient rank to result in better performance as demonstrated by
our results in Table 2.1 and Table 2.3.
Thus, we see that both our proposed algorithms (Group EN and Group SBL) with
the appropriate groupings perform better than the Elastic Net and the SBL algorithm
respectively. Moreover,theproposedalgorithmsarealsocomparabletotheFISTASGL
algorithm, and in the case of the Group EN, outperforming it.
The results in Table 2.1 also show that clustering by knowledged-based speaker
identity performs slightly better than the data-driven K-means clustering with the L
2
distance metric to partition the dictionary. Intuitively, similar spectral profiles will
tend to cluster closer together with the L
2
distance. Thus, during the group sparse
regularizationprocess,irrelevantgroups(consistingofspectralprofileswhicharefurther
away in the L
2
sense) will be zeroed out, and the appropriate partition will have non-
zero coefficients with sparsity enforced by our proposed algorithms. When we partition
58
thedictionarybasedonspeakeridentity,weareretainingspeakercharacteristicswithin
each partition, and distinct spectral profiles coming from different utterances will be
present within a single partition. Thus, partitioning according to speaker identity can
belikenedtoperformingregularizationwithaseriesofmorefocuseddictionaries,which
explains why it is slightly outperforming the K-means clustering approach.
Another interpretation of K-means clustering is the following: K-means can be
likened to an extreme version of sparse regularization as mentioned earlier. Thus, the
additionalgroupsparseregularizationstepcanbeviewedasarefinementstepoftheK-
means, which has been empirically demonstrated to yield good recognition rates given
the appropriate refinement algorithm in Tables 2.1 and 2.3.
In fact, our algorithms can be likened to a subset of the feature compression algo-
rithmintheETSIAFE[26]. Specifically,thedictionarylearningalgorithmsisanalogous
tothevectorquantizationstepinthecompressionalgorithmpipelineofETSIAFE.Due
to the fact that our system is not incoporating an error protection step that the ETSI
AFE has at this stage, it is not an entirely direct one to one comparison with ETSI
AFE, but we are providing the results obtained from the ETSI AFE for comparison
purposes.
Regularization in the spectral domain amounts to spectral denoising [59]. From
the spectral plots, we can see that LARS-EN is improving upon the noisy version with
successfulremovalofportionsofthenoiseartifacts,andtheGroupENalgorithmfurther
improves upon the LARS-EN algorithm by closer resemblance to the original clean
spectrum, as illustrated in the example of Fig. 2.2.
2.4.7 Runtimes
Table 2.2 presents runtime results for our algorithms and the Sparse Group LASSO
algorithm.
59
Figure 2.2: The diagram above shows the spectral plots of a particular utterance. As
we can see, LARS-EN improves greatly upon the noisy version with much of the noise
artifactseradicated,andtheGroupENalgorithmfurtherimprovesupontheLARS-EN
algorithm by a closer resemblance to the original clean spectrum.
60
Table 2.2: Average runtime results for the Group Sparse Regularization algorithms
Algorithm Runtime (secs/optimization)
FISTA SGL 0.0641
Group EN 0.0725
Group SBL 0.1142
We can see that FISTA SGL runs the fastest, followed closely by Group EN, and
lastly by Group SBL. To ensure that we are comparing the algorithms fairly, the algo-
rithms are ensured to have converged sufficiently and are evaluated on the same plat-
form (Core 2 Quad Processor with 8 gigabytes of RAM). All algorithms are evaluated
in MATLAB.
In our group regularization formulations, we can see that we are retaining the com-
plexity advantages of the base algorithms. In particular, if we cap the iterations to a
limit, the complexity is the same as that of the base algorithms.
IthasbeenproventhattheIterativeShrinkageThresoldingAlgorithm(ISTA)hasa
sublinearconvergence(O(
1
i
)where iistheiteration)andFISTAfurtherimprovesonit
by a quadratic factor (O(
1
i
2
)) [5]. In this chapter, we have experimentally verified that
our Group EN formulation runs at comparable speed as the FISTA implementation.
2.4.8 Aurora 2.0 Results for other SNR levels
As evident from the results in Table 2.3, we see that the Group EN algorithm is con-
sistently better than the Elastic Net under different levels of noise corruption, which
shows that grouping helps in improving speech recognition rate.
Animportantaspectofdenoisingalgorithmsisperformanceundercleanconditions.
We evaluated the proposed algorithms in clean conditions, and results show that the
Group EN algorithm suffers from some amount of deterioration, while the Group SBL
61
Table 2.3: Results for SNR -5 dB, SNR 0 dB, SNR 10 dB corruption and clean
conditions
Algorithm Accuracy (%)
SNR -5 dB
Before denoising 7.89
LARS-EN 31.04
Group EN 31.31
Group SBL 26.15
SNR 0 dB
Before denoising 18.59
LARS-EN 46.83
Group EN 47.80
Group SBL 38.10
SNR 10 dB
Before denoising 68.89
LARS-EN 82.16
Group EN 83.24
Group SBL 70.86
Clean Conditions
Clean 98.99
Group EN 97.78
Group SBL 98.98
exhibits good robustness in clean conditions. For good reconstruction under clean con-
ditions, sparsity is important, and since the Group SBL algorithm is enforcing sparsity
differently from the Group EN, we see that the better performance of the Group SBL
algorithm when compared to the Group EN algorithm under cleaner conditions could
imply that the Group SBL way of enforcing sparsity is more robust under cleaner con-
ditions.
2.4.9 Evaluation of Algorithms on Real Noisy Data
We additionally evaluate our algorithms on the Aurora 3.0 Italian database. Our test
set comprise of 160 continuous digit utterances, randomly selected from the following
noise types: low speed rough road, town traffic, stop motor running and high speed
62
good road. Table 2.4 shows recognition results after denoised by Lars-EN, Group EN,
Group SBL, the ETSI Advanced Front-end, and lastly Cepstral Mean Normalization.
Table 2.4: Results for Aurora 3.0 noisy dataset
Algorithm Accuracy (%)
Before denoising 62.87
LARS-EN 75.26
Group EN 77.50
Group SBL 76.46
ETSI AFE 77.21
CMN 67.09
Onthisrealisticdataset,weseethattheGroupENisthebestperforming,followed
verycloselybytheETSIAFEandtheGroupSBLalgorithm. Weseethatgenerally,our
proposed dictionary-based sparse regularization algorithms are comparable with state-
of-the-art techniques, and in particular for the Group EN algorithm, we are slightly
outperforming ETSI AFE. For Aurora 2.0, since the dataset is synthetically corrupted
with additive noise, we can expect to observe some algorithms perform better than
others due to their handling of additive noise better. However, in the presence of the
more complicated real noisy conditions, the robustness of our proposed algorithms is
made apparent.
2.5 Conclusion and Final Remarks
In this chapter, we introduced two novel variations of group sparse regularization tech-
niques: the Group Elastic Net algorithm and the Group Sparse Bayesian Learning
algorithm. We see that while the Group EN algorithm is the best performing out of
both algorithms for the Aurora 2.0 and 3.0 dataset. We also see that the Group SBL
algorithmexhibited leastdeterioration withclean data, and improvedgreatly uponthe
performance of SBL. Moreover, on the Aurora 3.0 database, the Group SBL algorithm
63
exhibitedgreaterrobustnesscomparedtoElasticNetunderthemorepracticalrealnoise
condition. In terms of runtime, we see that the Group EN and Group SBL algorithms
are comparable with FISTA SGL. We also see that the results obtained from our algo-
rithms are comparable with the ETSI AFE. We believe that further improvements can
be achieved with an adaptive dictionary update approach. In this work, the overcom-
plete dictionary is formed by a simple strategy. In adaptive dictionaries, essentially we
would update the dictionary as we perform regularization. This can ensure that only
representative spectral segments are retained in the dictionary, and irrelevant spectral
segments are eliminated.
This chapter also explored two intuitive strategies for partitioning the dictionary
of exemplars, namely by K-means clustering with the L
2
distance metric, and also by
speaker identity. We find that the speaker clustering strategy performs better than the
clustering with K-means using the L
2
distance metric.
Future work in terms of investigating better partitioning strategies can be directed
towards finding meaningful hybrids between data-driven approaches and knowledge-
based approaches.
Another interesting extension will be to expand our framework to be able to do
speaker identification. In this work, the speakers in the dictionary do not generally
overlapwiththespeakersinthetestset, sospeakeridentificationisnotpossiblewithin
the current framework. However, due to the fact that the partitioning is doing some
form of discrimination, our techniques could potentially be adapted for this task.
64
2.6 Derivation of Property 1
From Equation (2.20) we have
P(γ
−1
j
)=
1
Γ(m
γ
)
n
γ
mγ
γ
1−mγ
j
e
−nγγ
−1
j
where Γ(.) is the Gamma function. Thus,
p(x
i
′′
j
) =
Z
p(x
i
′′
j
|γ
j
)p(γ
j
)dγ
−1
j
=
Z
1
(2πγ
j
)
1
2
e
−
x
i
′′
j
2
2γ
j
Γ(m
γ
)
n
γ
mγ
γ
1−mγ
j
e
−nγγ
−1
j
dγ
−1
j
∝
Z
γ
1
2
−mγ
j
e
−
x
i
′′
j
2
2
+nγ
γ
j
dγ
−1
j
=
Z
x
i
′′
j
2
2
+n
γ
!
−mγ−
1
2
(γ
′
j
)
mγ−
1
2
e
−γ
′
j
dγ
′
j
=
x
i
′′
j
2
2
+n
γ
!
−mγ−
1
2
Γ
m
γ
+
1
2
Where γ
′
j
=
x
i
′′
j
2
2
+nγ
γ
j
. Thus we have:
p(x
i
′′
j
)∝
x
ij
′′2
2
+n
γ
−mγ−
1
2
As m
γ
,n
γ
→0, we have p(x
i
′′
j
)→c
′ 1
|x
i
′′
j
|
where c
′
is some constant.
65
Chapter 3:
Combining Window Predictions
Efficiently - A New Imputation
Approach for Noise Robust
Automatic Speech Recognition
[60]
3.1 Introduction
Compressed sensing/Sparse Representation techniques have recently been employed in
spectraldenoisingforspeechrecognitionapplications. Priorworks[7,30]haveproposed
the use of L
1
optimization techniques and demonstrated appreciable gains in speech
recognition rates over the original noisy conditions. The L
1
optimization considers an
objective function of the following form:
66
min
a
kDa−Fk
2
2
+λkak
1
(3.1)
In Equation (3.1), F is the observed feature vector, a is the vector of activations,
and D is the dictionary. This process is known as “Sparse Imputation” [30]. In this
chapter,weproposeatechniquetofurtherimproveupontheSparseImputationprocess.
3.2 Related Work
Previously [59], we have suggested the use of Elastic Net [71] as a solution to better
exploitthepropertiesofadictionaryofcollinearspectralexemplarsintheMDTsetting,
and also provided a study of why sparsity is important in the regularization framework
for spectral denoising. In addition, we [58] showed the importance of grouping spectral
atoms for improving speech recognition.
Whendealingwithutterancesofdifferentlengths,priorworks[30,58,59]haveused
the classical sliding window approach to extract fixed-duration frames for input to an
appropriate regularization algorithm for noise removal. Subsequently, to reconcile pre-
dictions (i.e. recombine the individual imputed results from the contributing windows)
from the analysis frames, an averaging strategy was proposed whereby the predictions
are all added up and then averaged by the number of overlapping frames. While such
astrategyimprovesperformancecomparedtowhenjustoptimizingonnon-overlapping
frames,amorerobustwayofcombiningthepredictionsispossible,whichwewillintro-
duce in this chapter.
The main contribution of this thesis chapter is to propose an extension to the sim-
ple averaging reconciliation approach for spectral denoising. Instead of just simple
averaging, we propose an alternative framework which tightly couples the frame-level
67
optimizationandthelocalpredictionreconciliationstep. Followingtheevaluationprac-
ticesetbypriorworks,wewillusetheAurora2.0noisydigitsdatabaseforourdenoising
experiments. We will demonstrate that the proposed framework yields an appreciable
improvement in ASR accuracies.
The structure of this chapter is as follows: Section 3.3 details the framework for
speech spectral denoising and also our algorithm formulation. Besides describing our
algorithm in detail, we also provide justification as to why this algorithm is well-suited
to perform in the setting of spectral exemplars. Section 3.4 provides a description of
our ASR system setup/settings and also provides the results of our experiments with
interpretation. Section 3.5 concludes with possible future work and extensions to our
proposed system.
3.3 Framework and Algorithmic Description
Fig. 3.1 shows the schematic of the ASR pipeline with a breakdown of the feature
extractionmodule. Inthischapter,weproposeimprovementtotheregularizationblock
by a more robust reconciliation method.
3.3.1 Feature Extraction Procedure
At the feature extraction stage in the ASR pipeline, after the spectral features are
extracted, we obtain a matrix of features of dimensions N
B
×T
F
, where N
B
stands for
the number of frequency bands in the extraction process and T
F
, the duration of the
utterance in number of analysis frames.
68
Figure3.1: SchematicoftheSpectralDenoiser. TheRegularizationblockcanbebroadly
splitinto3steps: 1)Windowing2)RegularizationoftheWindows(localoptimization)
3) Reconciliation of predictions
69
Figure3.2: Thediagramaboveshowsthespectralplotsofaparticularnoisyutterance.
Wecanthinkofthisasanimageandemploydenoisingtechniquestocleanupthenoise
artifacts evident in the image
3.3.2 Linear formulation of Problem
IfF isafeaturevectorofspectralexemplars(priortotakingthelogarithm),weassume
the following linear model for our problem:
F =Da (3.2)
D is composed of a dictionary of spectral exemplars in this setting.
3.3.3 Denoising problem formulation
We can treat the problem at hand as an image denoising problem, where the feature
values can be likened to the pixel intensities of an image (as in Fig. 3.2).
The problem formulation is given by the following equation:
min
a
kDa−Fk
2
2
+λkak
1
(3.3)
We will now proceed to describe the procedure by which we obtain F.
When D is comprised of spectral exemplars extracted from speech, D will have
a tendency of being collinear, since spectral images have energy localizations in simi-
lar regions for similar sounding utterances. Thus, we [59] demonstrated the need for
70
more robust solutions to this problem, and showed the effectiveness of the Elastic Net
formulation:
min
a
kDa−Fk
2
2
+λ
1
kak
1
+λ
2
kak
2
2
(3.4)
For each utterance, due to different durations, we will have different values of T
F
.
Toensurethatwehavematricesofequaldimensionseachtimewerunouroptimization,
we adopt the strategy of a sliding window matrix extraction. In particular, we define a
windowofsizeN
B
×T andshiftthewindowatregularintervalswhichispredetermined.
The vector F is then determined by linearizing the extracted window.
To represent the window extraction process algebraically, let us define a matrix R
i
whichisofdimensions T
F
×T. iindicatesthewindowcount. Thus,byputtingIdentity
matrix blocks and zero blocks in appropriate locations in R
i
, we can write the window
extraction process as follows:
F
s
=F
o
R
i
(3.5)
In Equation (3.5), F
s
refers to the window extracted subset and F
o
refers to the
original feature matrix.
Prior works have attempted to reconcile the predictions by an averaging strategy:
at the end of all regularizations for a particular utterance, the predictions are summed
appropriately and then divided by the number of times the sliding window overlaps at
the particular location. While this has yielded good results, we propose a more tightly
coupled way of optimization which further improves upon the averaging framework as
shown in a later section (Sec 3.4).
71
3.3.4 Signal Reliability Masks
Before we describe the proposed optimization formulation, we briefly review Signal
Reliability Masks since we will be integrating them in our optimization formulation.
In this chapter, like in the previous ones, we adopt a simple binary mask [3] which is
simplyamatrixofdimensionsN
B
×T
F
consistingofzerosandoneswithzeroindicating
an unreliable component and one indicating a reliable component.
Inparticular,letusdefineamatrixEwhichextractsthereliablerowsofthefeature
vectorF withrespecttothereliabilityindicationbythebinarymask. Then, thelinear
model as given in Equation (3.2) can be rewritten as follows:
F
reliable
=EDa (3.6)
For subsequent notational convenience, we define:
D
reliable
=ED (3.7)
Equation (3.4) can be rewritten as:
min
a
kD
reliable
a−F
reliable
k
2
2
+λ
1
kak
1
+λ
2
kak
2
2
(3.8)
3.3.5 A Novel Formulation of the Optimization Problem
In section 3.3.3, we drew the analogy that the spectral denoising problem can be vi-
sualized as an image denoising problem. Elad et al [25] have proposed an alternative
frameworkforimagedenoisingwhichdealswiththeproblemofimagepatches. Wewill
likewise be motivated by that framework and propose a new system which reconciles
the individual predictions by optimization.
72
LetusdefineanewmatrixGwhichhasthesamedimensionsasF
o
,representingthe
denoised version of F
o
. In addition, denote the number of extracted windows that we
have by N
W
. A natural generalization of the Maximum A Posteriori (MAP) estimate
in this case will be the following optimization:
min
a
i∈{1...N
W
}
,G
λ kG−F
o
k
2
2
+
N
W
X
i=1
kDa
i
−GR
i
k
2
2
+ λ
0i
ka
i
k
0
(3.9)
Note that the formulation in Equation (3.9) is an NP-hard problem (due to the L
0
norm term), and thus a convenient convex relaxation can be formulated as follows:
min
a
i∈{1...N
W
}
,G
λ kG−F
o
k
2
2
+
N
W
X
i=1
kDa
i
−GR
i
k
2
2
+ λ
1i
ka
i
k
1
(3.10)
Since Equation (3.9) is convex, there are a variety of fast solutions. However, to be
betterabletohandleadictionaryDofcollinearspectralexemplars, weherebypropose
the following formulation which naturally ties in with the formulation of the Elastic
Net:
min
a
i∈{1...N
W
}
,G
λ kG−F
o
k
2
2
+
N
W
X
i=1
kDa
i
−GR
i
k
2
2
+ λ
1i
ka
i
k
1
+λ
2i
ka
i
k
2
2
(3.11)
73
TosolvetheformulationinEquation(3.11),wedecoupletheexpressionintoaseries
of smaller optimization problems (by optimizing on each unknown sequentially). In
particular, we employ the Elastic Net algorithm to solve the following series of N
W
optimization problems:
min
a
i
kDa
i
−Fk
2
2
+λ
1i
ka
i
k
1
+λ
2i
ka
i
k
2
2
(3.12)
From the a
i
’s obtained from the Elastic Net, we can now fix them and proceed to
optimize for G. From Equation (3.11) we can see that we need to further solve the
following optimization problem:
min
G
λkG−F
o
k
2
2
+
N
W
X
i=1
kDb a
i
−GR
i
k
2
2
(3.13)
For subsequent notational convenience, let us denote J = λkG−F
o
k
2
2
+
P
N
W
i=1
kDb a
i
−GR
i
k
2
2
.
We can write:
kDb a
i
−GR
i
k
2
2
= b a
i
T
D
T
Db a
i
− b a
i
T
D
T
GR
i
− R
T
i
G
T
Db a
i
+ R
T
i
G
T
GR
i
(3.14)
Hence, taking the partial derivative of J w.r.t G:
∂J
∂G
=2λ(G−F
o
)−
N
W
X
i=1
2Db a
i
R
T
i
+2GR
i
R
T
i
(3.15)
Setting the RHS of Equation (3.15) to be zero gives the following:
74
b
G=
λF
o
+
N
W
X
i=1
Db a
i
R
T
i
!
λI+
N
W
X
i=1
R
i
R
T
i
!
−1
(3.16)
Note that initial inspection of the term
λI+
P
N
W
i=1
R
i
R
T
i
−1
might suggest that
such a huge matrix inversion might lead to significant complexity increase, and might
not be worth our while. However, note that R
i
is a block extraction matrix, and thus
R
i
R
T
i
willessentiallybeblockdiagonal. Hence,theexpressionλI+
P
N
W
i=1
R
i
R
T
i
willbe
blockdiagonalaswellandthereareefficientwaystoinvertsuchamatrix. Moreover,all
the prior steps to estimate a
i
in Equation (3.12) are essentially repeated applications
oftheElasticNet,andweseethatourapproachstillretainsthecomplexityadvantages
of the Elastic Net.
Letusdenotethefinalestimateforaparticularwindow(afterreshapingbacktothe
dimensions N
B
×T) by W
estimate
.
Wethenproceedtosolvethefollowingoptimizationproblemtoreconcileallpredic-
tions from the individual windows:
min
G
λkG−F
o
k
2
2
+
N
W
X
i=1
kW
estimate
−GR
i
k
2
2
(3.17)
whose solution is given by the following:
b
G=
λF
o
+
N
W
X
i=1
W
estimate
R
T
i
!
λI+
N
W
X
i=1
R
i
R
T
i
!
−1
(3.18)
75
3.4 Experimental Setup and Results
3.4.1 Description of Database and Algorithm implementation
For our ASR system, we use 8040 clean training files (containing single and continuous
digitutterances)providedintheAurora2.0databasetrainingsettotrainacontinuous
digit recognizer in HTK [67].
For the continuous digit recognition task, we take a random subset of 4000 digit
utterances from Test A, B and C giving us subway, babble, car, exhibition, restaurant,
street, airport, train station, subway (MIRS), and street (MIRS) noise.
WetraintheASRonMFCCswiththedeltaanddelta-deltacoefficients. Weuse23
frequency bands (N
B
=23), a hamming window size of 25 ms, and a frame shift of 10
ms. Forthedeltaanddelta-deltacoefficients, wesettherespectiveparametersinHTK
to be equal to 2.
All algorithms are implemented in MATLAB.
3.4.2 Experimental Results
AsshowninTable3.1,weranourexperimentsonavarietyofSNRlevels,namelySNR
0dB,5dB,10dBand15dB.Wepresentrecognitionresultsoftheoriginalnoisysignal,
the denoised version using Elastic Net and simple averaging, and our newly proposed
coupled strategy (in the table as “EN Coupled”).
Asevidentfromourresults,theproposedstrategyperformsconsistentlybetterthan
the simple averaging strategy. Moreover, from our experiments, we see that the new
strategy has generally comparable runtimes relative to the simple averaging method.
As mentioned before, the main latency involved in our proposed method is the time
needed to invert the large square matrix λI+
P
N
W
i=1
R
i
R
T
i
. Due to the fact that this
matrix is block diagonal, the inversion is efficient compared to inverting a non block
76
diagonal square matrix, thus contributing to speedups needed to be comparable with
simple averaging.
3.5 Conclusion and Future Work
Weshowedthatamoretightlycoupledoptimizationthatintegratesthelocaloptimiza-
tionperwindowandthereconciliationstepyieldsimprovedresultsingeneralcompared
tothecommonly adoptedsimpleaveragingstrategy (7.67%improvement onaveragein
recognitionaccuracies). OurformulationalsoretainsthecomplexitysavingsoftheLeast
Angle Regression implementation of the Elastic Net, while speeding up the execution
at the reconciliation step.
Animmediateextensiontotheproposedschemeinthischapterwillbetoincorporate
theglobalstructureoftheentirespectralimageintotheoptimizationprocess. Currently,
the sliding window framework, while to a small extent is already doing so by utilizing
the overlapping portions, better targeted strategies could be developed.
77
Table 3.1: Results for various levels of corruption.“CMN” refers to Cepstral Mean
Normalization. “EN Averaging” refers to the procedure where the Elastic Net is applied
to each window and contributions from the windows are subsequently averaged. “EN
Coupled” refers to the new procedure we described where a second optimization formu-
lation is employed to reconcile the predictions. Runtimes are measured in seconds per
optimization. Significance testing is done at 95% confidence interval with the difference
of proportions test.
Algorithm
Accuracies
(%)
Runtimes Significant?
SNR 0 dB
Unimputed 9.63 NA NA
CMN 27.78 NA NA
EN Averaging 26.64 0.0158 NA
EN Coupled 40.39 0.0197 Yes
SNR 5 dB
Unimputed 36.78 NA NA
CMN 55.91 NA NA
EN Averaging 64.38 0.0209 NA
EN Coupled 72.57 0.0215 Yes
SNR 10 dB
Unimputed 61.41 NA NA
CMN 87.82 NA NA
EN Averaging 83.85 0.0296 NA
EN Coupled 89.99 0.0364 Yes
SNR 15 dB
Unimputed 81.99 NA NA
CMN 95.67 NA NA
EN Averaging 93.25 0.0409 NA
EN Coupled 95.84 0.0356 Yes
78
Chapter 4:
New Methods for Sparse
Representation Classification
(SRC), with applications to
Underwater Object Classification
and Face Recognition
4.1 Introduction
Previous chapters of this thesis have dealt with the problem of denoising using Sparse
Representation (SR) techniques. We have showed successful applications of new algo-
rithmsdevelopedtotheproblemofAutomaticSpeechRecognition(ASR)denoising. In
this chapter of the thesis, we consider an analogous problem – Sparse Representation
Classification (SRC). We develop new algorithms that are motivated towards handling
79
collineardictionaries,aswellasnewalgorithmswhichbettercouplesregularizationand
decision making. In this chapter, we show successful application of these newly devel-
opedalgorithmstothedomainsofunderwaterobjectclassificationandfacerecognition.
In the domain of classifying underwater sidescan sonar images, the problem of low
resolution object images is quite significant [45]. Sidescan sonars allow large portions
of the seabed to be scanned at once, making it invaluable to applications like mine-
countermeasure(MCM)wherespeedisakeyfactor. However,theobjectsofinterestin
these images are usually very difficult to detect. These objects sit camouflaged on the
seafloor,areoftendifficulttonoticeexceptforthefainttraceofacharacteristicshadow
adjacenttoabrighthighlightregion(Figure4.1). Fromanimageprocessingstandpoint,
theseobjectsaregenerallymuchsmallerinsizecomparedtotheimagedimensions. Ad-
ditionally,sidescansonarimagesareverysensitivetothegrazingangle,whichcanmake
the same underwater object appear completely different depending on its surrounding
scene. These and other factors like texture of the object and characteristics of the
medium cause significant differences in patterns within the class.
Since classical supervised approaches do not adequately perform on these datasets
(see the experimental section of this chapter for more details), this problem calls for
a more targeted solution. A variety of machine learning techniques have also been
employed to improve upon the adaptability of the algorithms to unseen features in the
test data. These include methods like active learning [22] and classifier fusion [18]. In
spiteofapplicationoftheaforementionedmethods,thefalsealarmrateisgenerallyquite
high. Thismightstillbeacceptableaslongthefalsenegativesarecontrolled. Typically
detection is therefore followed by a second classification stage of the detected object;
for example, a binary classification task as a mine or a non-mine. Some of these works
have tried to push the bar even higher by trying to further classify the detected object
80
Figure 4.1: A sample object (circled) in a partial sidescan sonar image with its shadow
to the right (A colormap has been added to the original grayscale image for better
viewing).
according to its shape [54]. If the object of interest is a mine this extra information
mightinsomecasesaidindevelopingmoreaccuratemethodsforneutralizingthethreat.
Inthisworkwediscussanapproachtosolvethelatterobjectclassificationproblem
using Sparse Representation techniques. These methods come under the broad cate-
gory of non-parametric methods; this class of supervised methods does not compute
any parameters or train any models from the data. Instead of trying to compute a
sufficientstatisticfromthefeatures, thesemethodsusetheentiresetoffeaturesduring
classification. We will show that this approach is flexible enough to capture much of
the variation in real data. Sparse representation methods are able to exploit the fact
that all samples from one class are essentially distortions of some reference exemplar,
and hence must lie on a lower dimensional subspace [65]. In the context of underwater
81
objects this would mean that features from objects of the same class would, in spite
of their intra class differences, share a common subspace which can be exploited for
classification.
In most practical applications, the dictionary can be effectively formed by stacking
trainingdataexemplarscolumn-wiseintothedictionary[45,58,59]. However,oneofthe
issues that arises from the dictionary formation process is the problem of collinearity
[59]. In particular, in the event of a (highly) collinear dictionary, the convenient L
1
solution is not the best to deploy due to potential failure to satisfy oracle conditions
as given in [70]. In Chapter 1 of this thesis [45], we have shown that the Elastic Net
algorithm [71] is able to handle collinear dictionaries more efficiently than the LASSO
algorithm,andisabletoyieldhigherclassificationaccuracy. TheElasticNetalgorithm
has additionally been demonstrated to be effective in a variety of applications, such as
in denoising [58, 59]. In this chapter, we further testify to the robustness of the Elastic
Net algorithm by evaluation on an additional database (the NURC database).
Typically in a Sparse Classification procedure, after we obtain the coefficients, we
would need a good method to interpret the coefficients to effectively obtain the classifi-
cation result. Wright et al. employed a minimal residue decision rule [65] and we have
previouslyemployedvotingonthesumofcoefficientsfromaparticularclass[45]. Inthis
work, we will demonstrate the effectiveness of the coefficient-sum voting scheme over
the minimal residue decision rule for the underwater object feature sets. Additionally,
we will propose a novel optimization scheme which tightly couples regularization and
decision making, and using this new scheme, we are able to further improve upon the
Elastic Net combined with either decision rules.
Besides the two sidescan sonar image dataset that we are evaluating on, we also in-
cluderesultsfortheYaleBfacedatabasetoreinforcetheeffectivenessofouralgorithms.
82
In fact, experimental evidence shows the potential applicability of our algorithms to
other domains given the right feature extraction procedure and a good training set.
Thischapterisstructuredasfollows: Section4.2providestheproblemformulation,
algorithmic description and derivation. Section 4.8 provides the experimental results
andinterpretation/implications. Finally,Section4.9concludesthechapterwithpossible
ramifications and extensions of this work.
4.2 Problem Description and Derivation
4.2.1 Sparse Representation Formulation
Subspace models assume that samples from a single class lie on a lower dimensional
linear subspace. In fact given sufficient training data from each of the class we assume
that it is possible to represent any test sample as the linear combination of training
samples from its class (Linearity Assumption).
In particular, if{x
(k)
i
}
n
i=1
are training samples for object k and y
(k)
is a test sample
from the same class then y
(k)
will approximately lie in the span of the above training
samples.
y
(k)
=
n
X
i=1
α
i
x
(k)
i
(4.1)
for some real scalars α
i
In the context of object recognition these subspace assumptions have been studied
in depth for objects under various lighting and illuminations [4]. When applied to face
recognition these are usually called face subspaces and have been shown to capture
variations in expressions and different conditions of illumination [65].
Given that this assumption holds, we will now construct a dictionary A comprising
of training samples from all classes. For classifying a test sample y we would like to
nowexploitthefactthat y canberepresentedasthesparselinearcombinationofafew
83
rows of A (the ones containing samples corresponding to the object k). In other words
if we write y as follows
y =Av (4.2)
then v should be a sparse vector with only a few non zero values (parsimony prop-
erty). To ensure sparsity of v, this can now be posed as a constrained minimization
problem where we try to find the vector v with the minimum norm which also satisfies
Equation 4.2:
min
v
||y−Av||
2
2
+λ||v||
0
(4.3)
Due to above formulation being NP-hard, Equation (4.3) is frequently relaxed to
min
v
||y−Av||
2
2
+λ||v||
p
, p>0 (4.4)
Choosingthekindofnormforviscrucialhere. MinimizingtheL
2
normforexample
hasbeen shown to return non-sparse solutionsdueto poor emulation of the L
0
penalty
function[59]. Themorecommonlytakenrouteisp=1. Thereexistssomeevidencethat
given the v is sparse enough, the solution to the L
1
minimization problem approaches
the solution to the L
0
minimization problem under the right conditions [19]. However,
in most practical situations, such conditionsand the oracleconditions in [70] are rarely
fulfiled, and hence, for our baseline, we choose to adopt a more robust alternative in
the form of the Elastic Net (mixed L
1
−L
2
norm):
min
v
||y−Av||
2
2
+λ
1
||v||
1
+λ
2
||v||
2
(4.5)
84
TheElasticNetisabletoamelioratetheproblemofdictionarycollinearitybyachiev-
ing rank augmentation effectively:
||y−Av||
2
2
+λ
1
||v||
1
+λ
2
||v||
2
2
(4.6)
=
y
0
−
1
√
1+λ
2
A
√
λ
2
I
v
2
2
+λ
1
||v||
1
Note that by setting λ
2
= 0, we effective get back the LASSO solution. Another
interpretation as to why the Elastic Net solution solution could be more effective in
general is due to the priors – The LASSO formulation has a Laplacian prior, while the
Least Squares formulation has a Gaussian prior. The Elastic Net sports a mixed prior,
which allows us the leeway of finding a good tradeoff between the two.
Intermsofimplementation,theElasticNetcanbeimplementationallyveryeffective
using the Least Angle Regression solution [24], which is a greedy algorithm. Moreover,
beingagreedyalgorithmwillallowustocontrolthesparsityconstraintmoreefficiently.
4.2.2 Baseline Sparse Representation-based Classification Schemes
AftersolvingtheElasticNetformulationinEquation(4.5),wewouldobtainthesolution
vector v returned from the optimization problem.
Wright el al. [65] have employed a minimum residual rule for decision making. In
particular, assuming we have a total of m classes, for each i∈{1···m}, we compute
the residual error as||y−
c
A
i
b v
i
||
2
2
, where
c
A
i
is the column subset of the dictionary cor-
responding to the ith class, and b v
i
is the entry subset of the vector corresponding to
the ith class. The class that the observation y lies in is then the class which minimizes
this residual error. Wright’s original algorithm is given in in Algorithm 3. In addition,
oneofthebaselinewewilluseforthischapteristheextensionofWrightetal.’sSparse
85
Representation Classification (SRC) [65] to the Elastic Net formulation, and the algo-
rithmissummarizedinAlgorithm4. Thereasonthatwedothisisduetothefactthat
the L
1
formulation is actually a subset of the Elastic Net formulation (Elastic Net is
equivalent to the L
1
when λ
2
=0), and since Algorithm 3 can essentially be viewed as
a combination of regularization then decision rule, we can swap out the regularization
algorithm to give Algorithm 4.
Algorithm 3 Wright’s original Sparse Representation Classification (SRC) algorithm
1: Step1: Input: amatrixoftrainingsamples A=[A
1
···A
m
]where misthenumber
of classes.
2: Step 2: Solve the L
1
Formulation:
min
v
||y−Av||
2
2
+λ
1
||v||
1
3: Step 3: For each i∈{1···m}, compute the residuals:
r
i
=||y−
c
A
i
b v
i
||
2
2
4: Step 4: The class occupancy of y is decided by i such that r
i
is the smallest for
i ∈ {1···m}. In the event of a tie, a class is randomly chosen out of the set of
minimal r
i
values.
Algorithm 4 Extension of Sparse Representation Classification to the Elastic Net
1: Step1: Input: amatrixoftrainingsamples A=[A
1
···A
m
]where misthenumber
of classes.
2: Step 2: Solve the Elastic Net Formulation:
min
v
||y−Av||
2
2
+λ
1
||v||
1
+λ
2
||v||
2
Thisincludesthecasewhereλ
2
=0,whichreducestoAlgorithm1’sL
1
formulation.
3: Step 3: For each i∈{1···m}, compute the residuals:
r
i
=||y−
c
A
i
b v
i
||
2
2
4: Step 4: The class occupancy of y is decided by i such that r
i
is the smallest for
i ∈ {1···m}. In the event of a tie, a class is randomly chosen out of the set of
minimal r
i
values.
86
We will also consider a scheme that we previously proposed in [45]. In particular,
we compute the score C
k
for each class k from the coefficient vector v
C
k
=
X
∀i|L
i
=k
v
i
(4.7)
Weexperimentallyfortheunderwaterobjectclassificationcontext(SeeExperimental
SectioninSection4.8)foundthisscoringschemetobesuperiortoothersimilarheuristic
schemes like the simple intuitive way of voting by just the maximum coefficient or
the minimum residual rule that we detailed above. This rule that we propose can be
motivated as follows: If we exponentiate the objective function in Equation (4.5), we
see that the part corresponding to the solution vectors in fact correspond to the prior
probabilites. Hence, adding up coefficients corresponding to each class would give the
“strength” of that class for the classification task.
Hence,wewillnowbeabletoformulateoursecondbaselineasoutlinedinAlgorithm
5.
Algorithm 5 Elastic Net based Sparse Representation Classification (SRC) with the
sum-coeff decision rule
1: Step1: Input: amatrixoftrainingsamples A=[A
1
···A
m
]where misthenumber
of classes.
2: Step 2: Solve the Elastic Net Formulation:
min
v
||y−Av||
2
2
+λ
1
||v||
1
+λ
2
||v||
2
3: Step 3: compute the score C
i
for each class i∈{1···m} from the coefficient vector
v
C
i
=
X
∀l|L
l
=i
v
l
4: Step 4: The class occupancy of y is decided by i such that C
i
is the greatest for
i ∈ {1···m}. In the event of a tie, a class is randomly chosen out of the set of
maximal C
i
values.
87
4.3 An Inspiration - The Monty Hall Problem
Before we proceed to describe our new algorithmic formulations, we would first like to
draw attention to the famous Monty Hall Problem [55], which will be the inspiration
for the algorithms we are about to propose. In its classic form the Monty Hall problem
presents three doors to a contestant, two of which are empty while a third contains a
prize. The host, known as Monty Hall, asks us to pick one of the doors. We will then
chooseadoorbutnotopenit. MontyHallwillnowopenadoorfromthetwounchosen
doors, careful to choose the one which is empty (since Monty Hall has knowledge of
where the prize lies). Monty Hall now give us the option to either stick to our original
door or to switch to the other unopened door. By Baye’s rule, in fact to maximize our
chances of getting the prize, we should always switch. The Monty Hall problem and a
varietyofitsvariantscanbefoundinRosenhouse[55]. Whilenotexactlythesame,the
MontyHallproblemwillbetheinspirationforouralgorithmicformulationsinthenext
section, where we propose new classification algorithms whereby the decision making
step is deferred with an iterative approach.
4.4 New Algorithmic Approach: Coupling Optimization
and Decision Making
Our experiments (see Section 4.8) would testifiy to the effectiveness of the sum-coeff
rule (Algorithm 5) over the min-residual rule (Algorithm 4) for the underwater object
classification task. In this section, we will further expand upon the sum-coeff rule
to formulate a new algorithm which which more tightly couple decision making and
optimization. Experimental results would testify to the robustness of this algorithm on
twounderwatersidescansonarimagedatasets. Weadditionallyformulateananalogous
88
version for the mininum residual rule (the iterated max-res algorithm – see Algorithm
7), and show its effectiveness over the vanilla min-res version.
In particular, we have mentioned before the connection of the sum-rule to prior
probabilities – a higher sum would correspond to higher probability. By the same
token, alowersumwouldindicatelowerprobabilityandlesslikelythattheobservation
y would take on the particular class label. Hence, we propose that instead of making
the decision immediately after one round of optimization, we discard the class having
the smallest C
i
value in the dictionary. This will give us a dictionary of one class less
and we reevaluate the optimization/regularization step with this new dictionary. We
continue to iterate the process of optimization/discarding classes until we are left with
oneclass. Weshallshowexperimentallythatthisstrategyyieldsasuperiorclassification
result compared to our outlined schemes in Algorithm 4 and 5. A detailed outline of
the algorithm is given in Algorithm 6.
Followingasimilartrendaslaidoutabove,weproposeaniteratedmax-resalgorithm
(the analog of the vanilla min-res algorithm) which defers decision making to the end
ratherthaninstantlycomingtoadecisionafteroneiterationoftheoptimizationstep. In
particular,foreachiterationofthealgorithm,wechoosetoeliminatetheleastlikelyclass
by removing the atoms corresponding to the class which gives the maximum residual
value. Again,weseethatthisisequivalenttonarrowingthesearchspacebyeliminating
the unlikely class measure by the L
2
error norm. The algorithmic description is given
in Algorithm 7.
4.5 The case of orthogonal subspaces
When the classification accuracy is perfect, we can see that the iterated scheme would
coincide with vanilla version. One situation where this will definitely happen is the
89
Algorithm6Proposedalgorithmfortightly-coupledoptimizationanddecisionmaking
for robust classification by the iterated sum-rule approach
1: Step1: Input: amatrixoftrainingsamples A=[A
1
···A
m
]where misthenumber
of classes.
2: Initialize A
′
=A and rc={}, where rc is short for “Removed Class”.
3: Step 2: For iter=1···m−1
• We solve the Elastic Net Formulation:
min
v
||y−A
′
v||
2
2
+λ
1
||v||
1
+λ
2
||v||
2
• computethescoreC
i
foreachclassi∈{1···m}\rcfromthecoefficientvector
v
C
i
=
X
∀l|L
l
=i
v
l
• WeadditosetrcwherethatC
i
isthesmallestfori∈{1···m}\rc. Intheevent
of a tie, a class is randomly chosen out of the set of minimal C
i
values. Hence
rc→rc+{i}. Inaddition,wesetA
′
tobeadictionarywithoutallthecolumns
correspondingtoclasslabeli,i.e. A
′
→A
′
\{columns corresponding to class i}
4: Step3: Makethedecisionbyassigningthelabelofytotheclasslabelcorresponding
to the last standing class.
90
Algorithm7Proposedalgorithmfortightly-coupledoptimizationanddecisionmaking
for robust classification by the iterated max-res approach
1: Step1: Input: amatrixoftrainingsamples A=[A
1
···A
m
]where misthenumber
of classes.
2: Initialize A
′
=A and rc={}, where rc is short for “Removed Class”.
3: Step 2: For iter=1···m−1
• We solve the Elastic Net Formulation:
min
v
||y−A
′
v||
2
2
+λ
1
||v||
1
+λ
2
||v||
2
• computethescore r
i
foreachclass i∈{1···m}\rcfromthecoefficientvector
v
r
i
=||y−
c
A
i
b v
i
||
2
2
• Weadditosetrcwherethatr
i
isthemaximalfori∈{1···m}\rc. Intheevent
of a tie, a class is randomly chosen out of the set of maximal r
i
values. Hence
rc→rc+{i}. Inaddition,wesetA
′
tobeadictionarywithoutallthecolumns
correspondingtoclasslabeli,i.e. A
′
→A
′
\{columns corresponding to class i}
4: Step3: Makethedecisionbyassigningthelabelofytotheclasslabelcorresponding
to the last standing class.
91
case where all the classes lie on orthogonal subspaces. To illustrate this, let us de-
fine a series of matrices H
1
,H
2
,··· ,H
i
with the same dimensions (the case where all
of H
1
,H
2
,··· ,H
i
are of different dimensions follow similarly), where the entries of
H
k
,k ∈{1,··· ,i} can be any real number (In our toy experiment we simply use the
MATLAB command rand to generate a random matrix). We then form a dictionary
B = diag(H
1
,H
2
,··· ,H
i
) where B is a block diagonal dictionary matrix formed by
diagonalizing H
1
,H
2
,··· ,H
i
. We then define a series of test samples such that we
have one sample lying in each subspace spanned by the column of B. To illustrate
using a concrete example, lets say that we have the following dictionary matrix (before
normalization):
B =
1.2 3.4 0 0 0 0 0 0
2.1 0.3 0 0 0 0 0 0
0 0 0.3 1.6 0 0 0 0
0 0 2.1 1.6 0 0 0 0
0 0 0 0 0.6 0.8 0 0
0 0 0 0 2.6 0.88 0 0
0 0 0 0 0 0 10 12
0 0 0 0 0 0 12 45.1
(4.8)
Note that in Eq. (4.8), the first two columns span a subspace (Class 1), the next
two span another orthogonal subspace (Class 2), and so on. Hence, one possible test
set for this dictionary can be the following:
92
1
1
0
0
0
0
0
0
,
0
0
1
1
0
0
0
0
,
0
0
0
0
1
1
0
0
,
0
0
0
0
0
0
1
1
(4.9)
where we have one sample from each class. When we run the baseline algorithms
(vanillaSRCalgorithms)discussedabove,e.g. Algorithm5,weseethatwegetaperfect
accuracy of one hundred percent. This is because the results returned from the Elastic
Net algorithm have coefficients only activated in the relevant group. Hence, when we
use the sum-rule, we would get the maximal sum for the group where the observation
belongs. Likewise for the minimum residual rule, we would get the minimum residue
(in fact zero) for the group that the observation belongs to. Hence this would ensure a
perfect classification accuracy.
In this case, our proposed algorithms will also give a perfect classification accuracy.
This is because due to the fact that only the relevant class will be activated, for the
iterated sum-rule (Algorithm 6), we will have all the irrelevant classes at each step
to have sum zero, and hence randomly eliminated until we are left with the relevant
class. Fortheiteratedmax-res(Algorithm7),wewillhavemaximumresidueforallthe
irrelevant classes which will be eliminated at each iteration, leaving the correct class at
the end of the algorithm. To verify this, we wrote a MATLAB script which generates
thedictionarymatrixaccordingtothespecslaidoutabove,andalsothecorresponding
93
test set. We experimented with a variety of block sizes for the random blocks, and for
all the algorithms ran, the accuracy was a hundred percent.
Hence, we see that we would have the vanilla versions and the iterated versions of
the algorithm coincide in performance for the case of orthogonal subspaces. For other
cases, where the classification accuracy is less than perfect, the iterated version can be
optimized to be much more accurate than the vanilla SRC algorithms as shown in the
experimental section.
4.6 Feature Selection Procedure
The role of feature extraction in pattern recognition problems is usually considered to
be paramount. The general consensus seems to be that there is no best feature for
a particular problem. The common approach has been to find appropriate features
suitable for particular datasets, and this is no different in the domain of underwater
sonar images [57].
4.6.1 Zernike Moments
We have previously demonstrated the effectiveness of Zernike moment features in the
domain of classifying underwater sidescan images [45] over a variety of other features.
Zernike moments of an image are computed via an orthogonal transform in the polar
domain, where the degree of the representation controls the degree of generalizability.
We use magnitudes of zernike moments which have been shown to have rotational
invariance properties for object recognition [39]. Their robustness to variabilities in
94
underwaterimageshasalsobeenwellestablished[53][44]. Tocomputezernikemoments
A
nm
of order (n,m), we find its projection with the basis function V
nm
(x,y) as follows
V
nm
(x,y)=V
nm
(ρ,θ)=R
nm
(ρ)e
−jmθ
(4.10)
R
nm
(ρ)=
n−|m|
2
X
s=0
(−1)
s
(n−s)!
s!(
n+|m|
2
−s)!(
n−|m|
2
−s)!
ρ
n−2s
(4.11)
Z
nm
=
n+1
π
X
x
X
y
f(x,y)V
∗
nm
(ρ,θ),x
2
+y
2
≤1 (4.12)
where the index n is constrained to be a positive number or zero, while m can take
positive or negative integer values subject to the constraints that n−|m| is even and
|m| ≤ n. The range of n selects the order of zernike moments and the degree of
generalizability of the description. For our current work we found that representations
at order n=10 contained sufficient information for classification that also generalized
well.
4.7 Sidescan Sonar Dataset Description and Preparation
4.7.1 Underwater Sidescan Sonar Dataset Description
Two data sets are employed for evaluation in this work – the NURC dataset and the
NSWC PCD Scrubbed Images. These datasets contain sidescan sonar imagery in the
formof8bitgrayscaleimages,eachcontainingoneormoresyntheticmine-likeobjects.
Eachoftheobjectsisapproximately10-20pixelsindiameterandcanbelongtodifferent
classes, based on the shape. Objects in the NURC dataset can be from any of the 7
classes : Cone, Cylinder, Junk, Rock, Sphere, Wedding Cake or Wedge, while objects
in the NSWC dataset are labeled as belonging to classes 1-4. This work deals with
the problem of object classification, given that the object location has already been
95
detected (Detection is an entirely different problem altogether, which is out of scope of
this chapter). For the NURC dataset, we have have a total of 800 objects, with 206
samples from the Cone Class, 172 samples from the Cylinder Class, 70 samples from
the Junk class, 92 samples from the Rock class, 64 samples from the Sphere Class, 38
samples from the Wedding Cake class, and finally 158 samples from the Wedge class.
For the NSWC dataset, we have 79, 90, 83, and 44 samples from Class 1, 2, 3 and 4
respectively. Fig. 4.2 presents examples of the object classes from the NSWC database
and Fig. 4.3 presents examples from the NURC database.
Figure 4.2: Samples of the 4 object types in the NSWC sidescan sonar database
Whiletheobjectmightusuallybeclearlyvisibleasabrighthighlightbecauseofthe
strong reflection of sonar waves, substantial variations in object appearance can occur
asaresultofdifferentsea-bedconditionsaroundtheobject. Inadditiondifferentangles
of viewing the object can make the identification task challenging even to the human
eye.
One popular class of techniques employed for this problem compares the shapes
against expected shapes, generated through exhaustive simulations of their 3-
dimensional templates. However, our datasets provide no additional information about
96
Figure 4.3: Samples of the 7 object types in the NURC sidescan sonar database
the object shapes, ruling out the possibility of using such methods that use expert
knowledge. No other information other than object location and class was provided in
these datasets.
4.7.2 Sidescan Sonar Image Dataset Preparation Methodology
SegmentationofhighlightsfortheNSWCdatasetisrelativelytrivialbecausetheobject
locationisknownandtheobjectsaremostoftensymmetricallyplacedaroundthegiven
location. Thus we simply choose a radial area of 10 pixels around the object position
to compute features on the object. In the NURC dataset the objects are often more
arbitrarily shaped, and the object location is only known within a margin of error.
Henceweemployasegmentationalgorithminthiscasetoextracttheobjectshapeand
compute features on it.
97
In this work we propose a segmentation scheme based on Mean Shift Clustering
(MSC) [14] that adaptively clusters the intensities in these images making the segmen-
tation easier. MSC is a mode seeking algorithm that is popular for use in unsupervised
clustering. Given initial points in a data distribution, the MSC algorithm climbs in
the direction of the density gradient to the nearest local mode of the distribution. In
general MSC, the density at a point can be measured using a kernel function. In our
case we measure density at a point in terms of the number of points within a circle of
fixed radius, referred to as the bandwidth in MSC.
The algorithm proceeds by computing mean of all points within a specified radius
(bandwidth) of the initial point. This point is then updated to the mean. This is
repeatedtillthemeanconvergestoalocalmodeofthedistribution. Allpointstraversed
in the process are assigned to this mode/cluster.
x
n+1
=
P
s
K(s−x
n
)s
P
s
K(s−x
n
)
K(s−x)=
1 if||s−x||≤R
b
0 if otherwise
This is repeated for several random initial points till all the points have been as-
signed to some cluster. The bandwidth parameter provides control over the fineness of
clustering. In particular, asmaller bandwidth might lead toalargenumber of clusters,
while a larger bandwidth might instead merge similar clusters. To extract the high-
light we use a large bandwidth parameter which undersegments the image, flooding all
regions around the bright highlight into a single dark cluster. The rationale for doing
this is based on the observation that highlight in an image is typically much brighter
compared to the shadow or clutter around the object. All pixels that do not belong to
98
this large dark cluster are predicted to be part of the object highlight. Fig. 4.4 shows
anexampleofthissegmentationprocessonasampleofawedgeobjectfromtheNURC
database.
Figure4.4: SegmentationresultsforasampleofthewedgeobjectclassfromtheNURC
database using the Mean-shift clustering (MSC) algorithm as described.
4.8 Experiments, Results and Interpretation
4.8.1 Underwater Sidescan Sonar Image Classification Results
Fig. 4.5givesagraphicaldepictionofthesum-ruleclassificationrulethatwedescribed,
in particular, showing the coefficients obtained from regularization and their sum. In
thiscase,theclassificationresultwouldbeclass4becauseitaccumulatesthemaximum
coefficient weight.
For various comparison points (baselines), we compare our proposed algorithms
againststandardlinearclassifierslikeSupportVectorMachinesandKNearestNeighbors
(KNN) using the same set of features. We use WEKA [36] implementations for the
classical linear classifiers.
TheresultsinTable4.1presentstheresultsofa4-foldcross-validationclassification
onthesegmentedNWSCdatabaseforvariouscompetingclassificationschemes,andthe
99
Figure 4.5: Demonstration of the sum-rule classification heuristic for a sample sparse
representation computed on an overcomplete dictionary comprising 196 atoms for the
NWSC database.
Table 4.1: Object classification accuracies for a 4-fold cross validation for various com-
peting classifiers for the NWSC database
Classifier Accuracy (%)
KNN 94.6
SVM 95.9
Elastic Net with Min-res 95.6
Elastic Net with sum-rule 96.6
Proposed algorithm (Max-res) 95.9
Proposed algorithm (Sum-rule) 97.0
resultsinTable4.2depictstheresultsofsimilarlya4-foldcross-validationclassification
of all schemes on the NURC database.
As we can very clearly see from our experimental results, for both datasets, Sparse
Representation Classication (SRC) is performing better than classical classifiers like
K-NN and SVM. In addition, we observe that for our particular underwater object
classification application, generally the sum-rule formulation performs better than the
min-res formulation (or analogously the max-res formulation for the iterative version).
100
Table 4.2: Object classification accuracies for a 4-fold cross validation for various com-
peting classifiers for the NURC database
Classifier Accuracy (%)
KNN 53.3
SVM 50.8
Elastic Net with Min-res 52.4
Elastic Net with sum-rule 52.4
Proposed algorithm (Max-res) 53.9
Proposed algorithm (Sum-rule) 54.0
Interpretation-wise, the sum-rule formulation is motivated in terms of the prior proba-
bilities (this can be seen by exponentiating the optimization objective function, and a
higher corresponding sum would indicate a higher chance for occupancy in that class),
while the min-res formulation is motivated in terms of smaller L
2
error between the
observation and reconstructed vectors. As we can see, minimization of L
2
error does
not necessarily constitute better classification accuracies.
In addition, we see that our proposed algorithm outperforms both the classical
algorithms, and also the vanilla SRC algorithms. This can be intuitively explained by
thefactthatwhatweareproposingisaniterated L
p
scheme–insteadofjustdoingthe
optimizationonepassandthenimmediatelycomingtoadecisionfortheclassoccupancy,
we choose to optimize multiple times, and instead of coming to a decision, we choose
instead toeliminatethe least likely class. Thisallowsfor amore intensiveoptimization
at each iteration, and as our results show, lead to more accurate classification results.
Intermsofcomplexity,ourproposedalgorithmisdependentonthenumberofclasses
intheclassificationtask. Inourclassificationtasks,weonlyhave4classesand7classes,
which takes optimization time for a single sample to around 4 and 7 times respectively.
However,whenthenumberofclassesgrowstolargenumbers,thiscouldpotentiallylead
tooverheadincomputationtimes. Duetothefactthatourregularizationalgorithmsare
101
implemented by the Least Angle Regression technique, the overhead in our underwater
object classification experiments are manageable.
4.8.2 Face Recognition Results
We additionally include face recognition results to reinforce the effectiveness of our
algorithms, and also due to the fact that Sparse Representation Classification is popu-
larly applied to the face recognition problem setup [65]. Wright et al. [65] has already
demonstrated the effectiveness of Sparse Representation techniques (and particularly
L
1
-based strategies) over classical classification techniques for this setting. A funda-
mental assumption that the face setup makes is that the faces belonging to a class lie
on a single subspace, given sufficient training samples. This is related to the guarantee
of sparsity. Additionally, to ensure that the dictionary is more overcomplete (a fatter
matrix), it makes sense to reduce the size of the images in the database.
The database which we used for this experiment is the YaleB face database, which
boasts a total of 2414 images and 38 face classes, each class containing a person under
differentlightingconditions(illumination). Thisgivesusanaverageofabout64sample
images per face class. The images are of size 192×168 per image, and we resize the
images to size 12×10 using bicubic interpolation.
To construct our dictionary, we simply linearize the resized images and stack them
into the columns of the dictionary matrix. For the 2414 images, a five fold cross-
validation is carried out for the following techniques: minimum residual LARS-LASSO
(L
1
), sum-rule LARS-LASSO (L
1
), minimum residual Elastic Net (L
1
−L
2
), sum-rule
Elastic Net (L
1
− L
2
), our proposed algorithm with the iterated maximum residual
heuristic, and our proposed algorithm with the sum-rule heuristic. Table 4.3 gives the
results of the experiments.
102
Table 4.3: Face recognition results for a 5-fold cross validation of the YaleB face
database.
Classifier Accuracy (%)
LARS-LASSO (Min-res) 92.9
LARS-LASSO (Sum-rule) 92.2
Elastic Net (Min-res) 92.9
Elastic Net (Sum-rule) 92.9
Proposed algorithm (Max-res) 94.0
Proposed algorithm (Sum-rule) 93.8
AswecanseefromTables4.1,4.2and4.3,ourproposedalgorithmsareconsistently
performing better than the vanilla versions of the algorithms, which further testifies
the effectiveness of our algorithms on an additional dataset, given the right feature
extraction procedure and an appropriate train set. However, we observe that for the
underwater sidescan sonar feature set, the sum-rule versions of the algorithms are gen-
erally performing better than the min-res/max-res versions, but for the face database,
thereverseistrue. Awaytodeterminewhichmethodwillworkbetterforwhichfeature
set will be to have a small pre-annotated tuning set to experiment on. We can thus
beforehand find out which of the sum-rule or min-res decision rules will yield better
classification results for the feature set in question.
4.9 Conclusion
In this chapter we propose a series of new methods based on Sparse Representation
to deal with the intra class variabilities of objects in a sidescan sonar image for object
classification. Trainingsamplesfromeachclassformalowerdimensionalsubspacewhich
isclosesttothesamplesfromthatclass. WeseethatSparseRepresentationtechniques
exploits this inherent structure to provide a discriminative ability without an explicit
training. Weadditionallyproposeschemestocoupleregularizationanddecisionmaking
(class assignment) in a more motivated fashion, while keeping computation overhead
103
manageableintermsofthenumberofclassesforourexperiments. ResultsontheNSWC
andtheNURCsidescansonardatabasesuggestthatourmethodsarerobust(intermsof
recognition accuracies) in presence of variabilities. We additionally demonstrated that
our algorithms are more robust than the vanilla sparse classification algorithms even
in the face recognition task, indicating that with the appropriate feature extraction
procedure and a good training set, these proposed algorithms can be generalizable to a
variety of classification tasks.
In future work, it would be interesting and helpful to develop algorithms or tech-
niques to couple classification with decision making without overhead in terms of the
number of classes. While all iterated L
p
schemes are inherently more computationally
intensivethantheirnon-iterativecounterparts,itwouldbeinterestingtolookintoways
to overcome this issue. This will be particularly relevant when the number of classes
grow to large amounts (e.g. in a face database where there could be potentially hun-
dreds or thousands of face classes). Our algorithms, while in this chapter is applied to
underwater object classification and face recognition, is generally domain independent;
hence it would be helpful to think of extensions/solutions where the number of classes
becomes unmanageable.
In addition, it would be interesting to bring in other sources of information which
couldpotentiallyleadtoimprovementinclassificationaccuracyfortheunderwatertask
(e.g. information from shadows casted from the object, ambient temperatures etc.). It
would be interesting to develop extensions which could integrate such information into
the Sparse Representation Classification framework effectively and seamlessly.
104
Bibliography
[1] M.Aharon,M.Elad,A.Bruckstein,andKatzYana. K-SVD:AnAlgorithmforDe-
signing Overcomplete Dictionaries for Sparse Representation. IEEE Transactions
on Signal Processing, 54(11):4311–4322, 2006.
[2] J. Barker, M. Cooke, and P. Green. Robust ASR based on clean speech models:
An evaluation of missing data techniques for connected digit recognition in noise.
In Seventh European Conference on Speech Communication and Technology, 2001.
[3] J. Barker, PD Green, and M. Cooke. Linking auditory scene analysis and robust
ASRbymissingdatatechniques. Proceedings-Institute of Acoustics,23(3):295–308,
2001.
[4] R. Basri and D.W. Jacobs. Lambertian reflectance and linear subspaces. IEEE
Transactions on Pattern Analysis and Machine Intelligence, pages 218–233, 2003.
[5] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for
linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.
[6] C.M.Bishop. Pattern recognition and machine learning. SpringerNewYork,2006.
[7] B.J. Borgstr¨ om and A. Alwan. Missing feature imputation of log-spectral data for
noise robust ASR. Workshop on DSP in Mobile and Vehicular Systems, 2009.
[8] B.J. Borgstr¨ om and A. Alwan. Utilizing Compressibility in Reconstructing Spec-
trographicData,WithApplicationstoNoiseRobustASR. IEEE Signal Processing
Letters, 16(5), 2009.
[9] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge Uni-
versity Press, March 2004.
[10] E. Cand` es and M. Wakin. People hearing without listening: An introduction to
compressive sampling. IEEE Signal Processing Magazine, 25(2):21–30, 2008.
[11] E.J. Cand` es, J.K. Romberg, and T. Tao. Stable signal recovery from incomplete
andinaccuratemeasurements. Communications on Pure and Applied Mathematics,
59(8):1207, 2006.
105
[12] E.J. Cand` es, M.B. Wakin, and S.P. Boyd. Enhancing sparsity by reweighted l 1
minimization. Journal of Fourier Analysis and Applications, 14(5):877–905, 2008.
[13] S.S. Chen, D.L. Donoho, and M.A. Saunders. Atomic decomposition by basis
pursuit. SIAM review, pages 129–159, 2001.
[14] Y.Cheng. Meanshift,modeseeking,andclustering. Pattern Analysis and Machine
Intelligence, IEEE Transactions on, 17(8):790–799, 1995.
[15] M.Cooke,P.Green,L.Josifovski,andA.Vizinho. Robustautomaticspeechrecog-
nitionwithmissingandunreliableacousticdata. Speechcommunication,34(3):267–
285, 2001.
[16] Martin Cooke, Phil Green, Ljubomir Josifovski, and Ascension Vizinho. Robust
automatic speech recognition with missing and unreliable acoustic data. Speech
Communication, 34:267–285, 2001.
[17] JohnR.Deller,Jr.,JohnG.Proakis,andJohnH.Hansen. Discrete Time Process-
ing of Speech Signals. Prentice Hall PTR, Upper Saddle River, NJ, USA, 1993.
[18] G.J.Dobeck. Algorithmfusionforautomatedseaminedetectionandclassification.
In OCEANS, 2001. MTS/IEEE Conference and Exhibition, volume 1, pages 130–
134. IEEE, 2001.
[19] D.L. Donoho. For most large underdetermined systems of linear equations the
minimal l1-norm solution is also the sparsest solution. Communications on pure
and applied mathematics, 59(6):797–829, 2006.
[20] D.L.Donoho,M.Elad,andV.N.Temlyakov.Stablerecoveryofsparseovercomplete
representationsinthepresenceofnoise. IEEETransactionsonInformationTheory,
52(1):6–18, 2005.
[21] M.F.Duarte,M.A.Davenport,D.Takhar,J.N.Laska,T.Sun,K.F.Kelly,andR.G.
Baraniuk. Single-pixel imaging via compressive sampling. IEEE Signal Processing
Magazine, 25(2):83–91, 2008.
[22] E. Dura, Y. Zhang, X. Liao, G.J. Dobeck, and L. Carin. Active learning for
detection of mine-like objects in side-scan sonar imagery. Oceanic Engineering,
IEEE Journal of, 30(2):360–371, 2005.
[23] B. Efron. Missing Data, Imputation, and the Bootstrap. Journal of the American
Statistical Association, 89(426):83–127, 1994.
[24] B.Efron,T.Hastie,I.Johnstone,andR.Tibshirani. Leastangleregression. Annals
of statistics, 32(2):407–451, 2004.
106
[25] M.EladandM.Aharon. Imagedenoisingviasparseandredundantrepresentations
over learned dictionaries. IEEE Transactions on Image Processing, 15(12):3736–
3745, 2006.
[26] ETSI ES 201 108. v1.1.3 Speech Processing, Transmission and Quality aspects
(STQ); Distributed speech recognition; Frontend feature extraction algorithm;
Compression algorithms, September 2003.
[27] A. Faul and M.E. Tipping. Analysis of Sparse Bayesian learning. In NIPS, vol-
ume 14, 2002.
[28] J. Friedman, T. Hastie, and R. Tibshirani. A note on the group lasso and a sparse
group lasso. preprint, 2010.
[29] X. Gan, A.W.C. Liew, and H. Yan. Microarray missing data imputation based
on a set theoretic framework and biological knowledge. Nucleic Acids Research,
34(5):1608–1619, 2006.
[30] J. Gemmeke, L. ten Bosch, L. Boves, and B. Cranen. Using sparse representations
for exemplar based continuous digit recognition. EUSIPCO Glasgow, Scotland,
2009.
[31] J. F. Gemmeke and B. Cranen. Using sparse representations for missing data
imputation in noise robust speech recognition. In Proc. of EUSIPCO, 2008.
[32] J.F. Gemmeke, B. Cranen, and U. Remes. Sparse imputation for large vocabulary
noise robust ASR. Computer Speech & Language, 2010.
[33] JortGemmekeandBertCranen. Missingdataimputationusingcompressivesens-
ingtechniquesforconnecteddigitrecognition. Proceedings of the 16th international
conference on Digital Signal Processing, pages 37–44, 2009.
[34] JortFlorentGemmeke,HugoVanHamme,BertCranen,andLouBoves. Compres-
sive sensing for missing data imputation in noise robust speech recognition. IEEE
Journal of Selected Topics in Signal Processing, 4(2):272–286, April 2010.
[35] IrinaF.GorodnitskyandBhaskarD.Rao.Sparsesignalreconstructionfromlimited
datausingFOCUSS:Are-weightedminimumnormalgorithm. IEEE Trans. Signal
Processing, pages 600–616, 1997.
[36] M.Hall,E.Frank,G.Holmes,B.Pfahringer,P.Reutemann,andI.H.Witten. The
weka data mining software: an update. ACM SIGKDD Explorations Newsletter,
11(1):10–18, 2009.
[37] H.V. Hamme. PROSPECT features and their application to missing data tech-
niquesforrobustspeechrecognition. InEighth International Conference on Spoken
Language Processing, 2004.
107
[38] LjubomirJosifovski,MartinCooke,PhilGreen,andAscensionVizinho.Statebased
imputationofmissingdataforrobustspeechrecognitionandspeechenhancement.
In Proc. Eurospeech, pages 2837–2840, 1999.
[39] A. Khotanzad and Y.H. Hong. Invariant image recognition by zernike moments.
Pattern Analysis and Machine Intelligence, IEEE Transactions on, 12(5):489–497,
1990.
[40] Y.C. Kim, S.S. Narayanan, and K.S. Nayak. Accelerated three-dimensional up-
per airway MRI using compressed sensing. Magnetic Resonance in Medicine,
61(6):1434–1440, 2009.
[41] K.KnightandW.Fu. Asymptoticsforlasso-typeestimators. Annals of Statistics,
28(5):1356–1378, 2000.
[42] A. Kokaram and S. Godsill. A system for reconstruction of missing data in image
sequencesusingsampled3DARmodelsandMRFmotionpriors. Computer Vision
ECCV’96, pages 613–624, 1996.
[43] N. Kumar and A.G. Andreou. Heteroscedastic discriminant analysis and reduced
rank HMMs for improved speech recognition. Speech communication, 26(4):283–
297, 1998.
[44] N.Kumar,A.C.Lammert,B.Englot,F.S.Hover,andS.S.Narayanan. Directional
descriptorsusingzernikemomentphasesforobjectorientationestimationinunder-
water sonar images. In Acoustics, Speech and Signal Processing (ICASSP), 2011
IEEE International Conference on, pages 1025–1028. IEEE, 2011.
[45] N Kumar, Q.F. Tan, and S. S. Narayanan. Object classification in sidescan sonar
images with sparse representation techniques. In Proceedings of the International
Conference on Acoustics, Speech and Signal Processing (ICASSP), March 2012.
[46] G.Lindfield,JETPenny,andJ.Penny. Numerical methods using MATLAB. Pren-
tice Hall, 1999.
[47] J.Liu,S.Ji,andJ.Ye. SLEP:SparseLearningwithEfficientProjections. Arizona
State University, 2009.
[48] M. Lustig, D.L. Donoho, J.M. Santos, and J.M. Pauly. Compressed sensing MRI.
IEEE Signal Processing Magazine, 25(2):72–82, 2008.
[49] S.G. Mallat and Z. Zhang. Matching pursuits with time-frequency dictionaries.
IEEE Transactions on signal processing, 41(12):3397–3415, 1993.
[50] L. Meier, S. Van De Geer, and P. B¨ uhlmann. The group lasso for logistic regres-
sion. Journal of the Royal Statistical Society: Series B (Statistical Methodology),
70(1):53–71, 2008.
108
[51] N.MeinshausenandB.Yu. Lasso-typerecoveryofsparserepresentationsforhigh-
dimensional data. Annals of Statistics, 37(1):246–270, 2009.
[52] YC Pati, R. Rezaiifar, and PS Krishnaprasad. Orthogonal matching pursuit: re-
cursivefunctionapproximationwithapplicationstowaveletdecomposition. InCon-
ference Record of The Twenty-Seventh Asilomar Conference on Signals, Systems
and Computers, pages 40–44, 1993.
[53] O. Pizarro and H. Singh. Toward large-area mosaicing for underwater scientific
applications. Oceanic Engineering, IEEE Journal of, 28(4):651–672, 2003.
[54] S.Reed,Y.Petillot,andJ.Bell. Automatedapproachtoclassificationofmine-like
objectsinsidescansonarusinghighlightandshadowinformation. In Radar, Sonar
and Navigation, IEE Proceedings-, volume 151, pages 48–56. IET, 2004.
[55] Jason Rosenhouse. The Monty Hall Problem: The Remarkable Story of Math’s
Most Contentious Brain Teaser. Oxford University Press, USA, 2009.
[56] T.N. Sainath, A. Carmi, D. Kanevsky, and B. Ramabhadran. Bayesian compres-
sive sensing for phonetic classification. In 2010 IEEE International Conference on
Acoustics Speech and Signal Processing (ICASSP), pages 4370–4373. IEEE, 2010.
[57] J. Stack. Automation for underwater mine recognition: current trends and future
strategy. In Proceedings of SPIE, volume 8017, page 80170K, 2011.
[58] Q. F. Tan and S. S. Narayanan. Novel variations of group sparse regularization
techniques with applications to noise robust automatic speech recognition. IEEE
Transactions on Audio, Speech, and Language Processing, 20(4):1337 –1346, May
2012.
[59] Q.F. Tan, P.G. Georgiou, and S. Narayanan. Enhanced sparse imputation tech-
niques for a robust speech recognition front-end. IEEE Transactions on Audio,
Speech, and Language Processing, 19(8):2418 –2429, Nov. 2011.
[60] Q.F. Tan and S. Narayanan. Combining window predictions efficiently - a new
imputationapproachfornoiserobustautomaticspeechrecognition. InIEEE Inter-
national Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013.
[61] R. Tibshirani. Regression shrinkage and selection via the LASSO. Journal of the
Royal Statistical Society (Series B), 58:267–288, 1996.
[62] M.E. Tipping and A. Faul. Fast marginal likelihood maximisation for sparse
Bayesian models. In Proceedings of the ninth international workshop on artificial
intelligence and statistics, volume 1, 2003.
109
[63] Michael E. Tipping. Sparse bayesian learning and the relevance vector machine.
Journal of Machine Learning Research, 1:211–244, 2001.
[64] D.P. Wipf and B.D. Rao. Sparse Bayesian learning for basis selection. IEEE
Transactions on Signal Processing, 52(8):2153–2164, 2004.
[65] J.Wright,A.Y.Yang,A.Ganesh,S.S.Sastry,andY.Ma. Robustfacerecognition
via sparse representation. IEEE Transactions on Pattern Analysis and Machine
Intelligence, pages 210–227, 2008.
[66] A.Y.Yang,J.Wright,Y.Ma,andS.S.Sastry. Featureselectioninfacerecognition:
A sparse representation perspective. UC Berkeley Technical Report UCB/EECS-
2007-99, 2007.
[67] S.Young,G.Evermann,D.Kershaw,G.Moore,J.Odell,D.Ollason,V.Valtchev,
and P. Woodland. The HTK book, 2000.
[68] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped
variables. Journal of the Royal Statistical Society: Series B (Statistical Methodol-
ogy), 68(1):49–67, 2006.
[69] P. Zhao and B. Yu. On model selection consistency of LASSO. The Journal of
Machine Learning Research, 7:2563, 2006.
[70] H. Zou. The adaptive LASSO and its oracle properties. J. Amer. Statist. Assoc.,
101(476):1418–1429, 2006.
[71] H.ZouandT.Hastie.Regularizationandvariableselectionviatheelasticnet.Jour-
nal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–
320, 2005.
110
Abstract (if available)
Abstract
This thesis proposes novel variations of Sparse Representation techniques and shows successful applications to a variety of fields such as Automatic Speech Recognition (ASR) denoising, Face Recognition and Underwater Image Classification. A section of the thesis makes new algorithmic contributions in Group Regularization which are able to better handle collinear dictionaries. A new ASR front-end is introduced, and applies these algorithms to feature denoising in ASR pipeline. The thesis also explores effective ways for dictionary partitioning for improved speech recognition results over a range of baselines. A new method for combining predictions from different feature windows is also explored. In addition to the denoising section, this thesis also proposes new methods for Sparse Representation Classification (SRC) which better couples the regularization and decision making step. The effectiveness of these new methods are shown in both Face Recognition and Underwater Image Classification. Since all these techniques are domain independent with the right feature extraction procedure and training set, they show great promise to be applied to a gamut of other different areas.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Representation, classification and information fusion for robust and efficient multimodal human states recognition
PDF
Matrix factorization for noise-robust representation of speech data
PDF
Visual representation learning with structural prior
PDF
Multimodality, context and continuous dynamics for recognition and analysis of emotional states, and applications in healthcare
PDF
Compression algorithms for distributed classification with applications to distributed speech recognition
PDF
Noise aware methods for robust speech processing applications
PDF
Sparse representation models and applications to bioinformatics
PDF
Human behavior understanding from language through unsupervised modeling
PDF
A computational framework for diversity in ensembles of humans and machine systems
PDF
Speech recognition error modeling for robust speech processing and natural language understanding applications
PDF
On the theory and applications of structured bilinear inverse problems to sparse blind deconvolution, active target localization, and delay-Doppler estimation
PDF
Efficient coding techniques for high definition video
PDF
Context-aware models for understanding and supporting spoken interactions with children
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
Knowledge-driven representations of physiological signals: developing measurable indices of non-observable behavior
PDF
A computational framework for exploring the role of speech production in speech processing from a communication system perspective
PDF
On optimal signal representation for statistical learning and pattern recognition
PDF
Syntax-aware natural language processing techniques and their applications
PDF
Modeling expert assessment of empathy through multimodal signal cues
PDF
Toward understanding speech planning by observing its execution—representations, modeling and analysis
Asset Metadata
Creator
Tan, Qun Feng
(author)
Core Title
Novel variations of sparse representation techniques with applications
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
04/22/2013
Defense Date
03/13/2013
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
algorithms,convex optimization,machine learning,OAI-PMH Harvest,pattern recognition,sparse representation,speech processing
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Narayanan, Shrikanth S. (
committee chair
), Mitra, Urbashi (
committee member
), Ortega, Antonio K. (
committee member
), Sha, Fei (
committee member
)
Creator Email
qunfengtan@hotmail.com,tanqunfeng@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-241019
Unique identifier
UC11294527
Identifier
etd-TanQunFeng-1573.pdf (filename),usctheses-c3-241019 (legacy record id)
Legacy Identifier
etd-TanQunFeng-1573.pdf
Dmrecord
241019
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Tan, Qun Feng
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
algorithms
convex optimization
machine learning
pattern recognition
sparse representation
speech processing