Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
A computational framework for exploring the role of speech production in speech processing from a communication system perspective
(USC Thesis Other)
A computational framework for exploring the role of speech production in speech processing from a communication system perspective
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ACOMPUTATIONAL FRAMEWORKFOREXPLORING
THEROLEOFSPEECHPRODUCTIONINSPEECH
PROCESSINGFROMACOMMUNICATIONSYSTEM
PERSPECTIVE
by
Prasanta Kumar Ghosh
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
August 2011
Copyright 2011 Prasanta Kumar Ghosh
Dedication
To my dear parents for their unconditional love, support, and inspiration.
ii
Acknowledgements
I thank my advisor Professor Shrikanth S. Narayanan for his guidance and support
throughout my graduate work. He always challenged me to give more and demanded
the best. His philosophy and work principles have immensely influenced my academic
growth. I also thank the other committee members Professor Antonio Ortega and
Professor Louis Goldstein for their thoughtful comments and discussions.
Signal Analysis and Interpretation Laboratory (SAIL) has always been a fertile
environmentforresearchwithfriendlygroupdiscussionsandcollaborative workculture.
Many thanks to all the SAIL members for that.
I am so fortunate to have a constant moral support from my parents. They have
been a source of encouragement for me through thick and thin. This dissertation is
dedicated to them.
iii
Table of Contents
Dedication ii
Acknowledgements iii
List of Tables vii
List of Figures viii
Abstract xi
Chapter 1: Introduction 1
Chapter 2: Exemplar-specific approach to explore the role of speech
production in speech processing – motivation and principles 7
Chapter 3: Articulatory Databases and features 11
3.1 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.1 EMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.2 Tract variables (TV) features . . . . . . . . . . . . . . . . . . . . 13
Chapter 4: Optimal filterbank for computing acoustic features - an in-
formation theoretic link between production and perception 18
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3 Articulatory gesture representation . . . . . . . . . . . . . . . . . . . . . 25
4.4 Representation of the filterbank output . . . . . . . . . . . . . . . . . . 26
4.5 Filterbank optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.6 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
iv
Chapter 5: Generalized smoothness criterion (GSC) for acoustic-to-
articulatory inversion 38
5.1 Dataset and pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2 Empirical frequency analysis of articulatory data . . . . . . . . . . . . . 44
5.3 Generalized smoothness criterion for the inversion problem . . . . . . . 46
5.4 Recursive solution to the inversion problem . . . . . . . . . . . . . . . . 51
5.5 Selection of acoustic features for the inversion problem . . . . . . . . . . 54
5.6 Experiments and results . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Chapter 6: Analysis of inter-articulator correlation in acoustic to artic-
ulatory inversion using GSC 66
6.1 Dataset and pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.2 Generalized smoothness criterion (GSC) for articulatory inversion . . . . 68
6.3 Correlation among estimated articulator trajectories . . . . . . . . . . . 69
6.4 Modified GSC to preserve inter-articulator correlation . . . . . . . . . . 73
6.4.1 Transformation of the articulatory position vector . . . . . . . . 74
6.4.2 Frequency analysis of the transformed variables . . . . . . . . . . 74
6.4.3 Inversion using transformed articulatory features . . . . . . . . . 75
6.4.4 Articulatory inversion results using modified GSC . . . . . . . . 76
6.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Chapter7: Subject-independent acoustic-to-articulatory inversion using
GSC 79
7.1 Dataset and pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.2 Proposed subject-independent inversion . . . . . . . . . . . . . . . . . . 81
7.3 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.3.1 EMA features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.3.2 TV features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Chapter 8: Automatic speech recognition using articulatory features
from subject-independent acoustic-to-articulatory inversion 90
Chapter 9: Comparison of recognition accuracies for using original and
estimated articulatory features 94
9.1 Datasets and features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
9.2 Subject-independent inversion . . . . . . . . . . . . . . . . . . . . . . . . 97
9.3 Automatic Speech Recognition experiments . . . . . . . . . . . . . . . . 99
9.4 Discussion of the recognition results . . . . . . . . . . . . . . . . . . . . 100
9.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
v
Chapter 10: Discussions - link to Motor Theory of speech perception 103
Chapter 11: Future works 106
11.1 Thedirectionofcausalityintheinfomationtheoreticproduction-perception
link. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
11.2 Complementary characteristics of acoustic and articulatory features . . 107
11.3 Theroleofspeechproductioninspeechrecognition byanon-nativelistener107
11.4 Comparison to visual articulation . . . . . . . . . . . . . . . . . . . . . . 108
Bibliography 109
Appendix 116
vi
List of Tables
5.1 Numberofframes ofarticulatory data available fortraining, development,
and test set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2 The mean f
c
(standard deviation in bracket) of articulatory data of two
speakers in the MOCHA-TIMIT database. . . . . . . . . . . . . . . . . . 46
5.3 Mutual information between various acoustic features and articulatory
position. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.4 Best choices of γ
c
and C for all articulatory positions optimized on dev set. 59
5.5 Accuracy of inversion in terms of RMS errorE and correlation ρ (Female
speaker). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.6 Accuracy of inversion in terms of RMS errorE and correlation ρ (Male
speaker). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
8.1 HMMbasedphoneticrecognition accuraciesonTIMITcorpususingacoustic-
only and acoustic-articulatory features. . . . . . . . . . . . . . . . . . . . 92
9.1 Average correlation coefficient (ρ) between original and estimated TV
features for different talker and exemplar combinations. . . . . . . . . . 98
9.2 Averagephonetic recognitionaccuracyusingacousticandacoustic-articulatory
(both measured and estimated) features separately for each English sub-
ject. p-values indicate the significance in the change of recognition accu-
racy from the acoustic to acoustic-articulatory feature based recognition. 98
vii
List of Figures
2.1 An illustration of the proposed exemplar-specific framework: each exem-
plar uses his/her speech production knowledge to estimate the exemplar-
specific articulatory featuresforthespeech utterance receivedfrom atalker
and then, uses both acoustic features from the received speech signal and
the estimated articulatory features to recognize the utterance. As the ar-
ticulatory features are exemplar-specific, the acoustic-articulatory model
is also exemplar-specific. . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1 Anillustrative diagramshowing thesensorsinEMAdatacollectiononthe
midsagittal plane. The sensors are indicated by circles at upper lip (UL),
lower lip (LL), upper incisor (UI), lower incisor (LI), tongue tip (TT),
tongue body (TB), tongue dorsum (TD), and velum (V). The diagram
also illustrates the features motivated by the tract variables, namely, lip
aperture (LA), lip protrusion (PRO), jaw opening (JAW-OPEN), tongue
tip constriction degree (TTCD), tongue body constriction degree (TBCD),
tongue dorsum constriction degree (TDCD), and velum opening (VEL). 14
3.2 An illustration of the procedure to compute TTCD and TBCD for the fe-
male subject in MOCHA-TIMIT database. The TT locations are shown
for phonemes /t/, /d/, and /s/; similarly, the TB locations are shown
for /ch/ and /zh/. These phonemes being alveolar and post-alveolar, an
approximate location and shape of the palate can be estimated from the
location of the TT and TB sensors. The two straight lines are determined
manually so that all the TT and TB locations are below those lines. Sep-
arate lines are determined for other subjects’ data in a similar fashion. . 17
4.1 A Communication system viewtohuman speech communication. Talker’s
speechproduction systemconsistsofspeecharticulators suchaslips, tongue,
velum etc. Outer ear (ear pinna), middle ear, and inner ear (cochlea) are
main components of listener’s auditory system [77]. . . . . . . . . . . . 20
viii
4.2 Cochlea and its characteristics: a. the coiled and uncoiled cochlea [49]
with the illustration of the frequency selectivity shown by damped sinusoid
of different frequencies. b. the relation between the natural frequency
axis and the frequency map on the basilar membrane (i.e., the auditory
frequency axis) over 0 to 8kHz. . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 Articulator data from X-ray MicroBeam speech production database [75]
a. Pellets are placed on eight critical points of the articulators viewed
on the midsagittal plane of the subject, namely, upper lip (UL), lower lip
(LL), tongue tip (T1), tongue body (T2 and T3), tongue dorsum (T4),
mandibular incisors (MNI), and mandibular molars (MNM). The XY
co-ordinate trajectories of these pellets are recorded. b. The X and Y
co-ordinate values of eight pellets are shown when a male subject was
uttering ‘but special’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.4 A tree-structure illustration of the greedy approach for optimizing the fil-
terbank. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.5 Result of the filterbank optimization a. Filterbank corresponding to max-
imum MI at each level of the greedy optimization. b. The empirical
auditory filterbank and the optimal filterbank (i.e., the filterbank corre-
sponding to the maximum MI at level 20) c. The warping function be-
tween the frequency axis and the warped frequency axis obtained by the
empirical auditory filterbank and the optimal filterbank. . . . . . . . . . . 32
4.6 The warping functions corresponding to the optimal filterbanks for all 45
subjects (40 English + 3 Cantonese + 2 Georgian) considered from the
production databases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.7 The ratio (in percentage) of the MI computed for auditory filterbank
(Aud. FB) and the optimal filterbank (Opt. FB) all 45 subjects (40
English + 3 Cantonese + 2 Georgian) considered from the production
databases. The average of all ratio values (in percentage) is 91.6% (blue
dashed line). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.1 Illustrative example of inverse mapping: randomly chosen examples of the
test articulator trajectory (dash-dotted) and the corresponding estimated
trajectory for 14 articulatory positions. . . . . . . . . . . . . . . . . . . . 61
6.1 ρ
⋆
ji
/ρ
ji
for all pairs of articulators for (a) female (b) male subjects in the
MOCHA database. i and j vary over articulatory variable index 1,...,14. 73
ix
6.2 (a)-(b): The average RMS Error (E) and Corr. Coef. (ρ) of inversion
accuracy using GSC [24] and modified GSC (MGSC) on the test set for
the female subject; (c)-(d): (a)-(b) repeated for the male subject. Error
bars indicate SD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.1 Bar diagram ofthemeanρobtained usingIS-1, IS-2, andIS-3forvarious
EMA features separately over all male and female subjects’ test utter-
ances. Errorbar indicates standard deviations of ρ across respective test
utterances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.2 Illustrative examplesoftheestimates oftheTVfeaturetrajectories using
IS-1 and IS-3. These trajectories are randomly selected from the female
subject’s testset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.3 Bar diagram ofthemeanρobtained usingIS-1, IS-2, andIS-3forvarious
TV features separately over all male and female subjects’ test utterances. 89
9.1 Illustration of the TV features in the midsagittal plane. . . . . . . . . . . 96
x
Abstract
Thisthesisfocusesonexploringtheroleofspeechproductioninautomaticspeechrecog-
nitionfromacommunication system perspective. Specifically, Ihave developedagener-
alized smoothness criterion (GSC) for a talker-independent acoustic-to-articulatory in-
version, which estimates speech production/articulation features from the speech signal
of any arbitrary talker. GSC requires parallel articulatory andacoustic data from asin-
glesubjectonly(exemplar)andthisexemplarneednotbeanyofthetalkers. Usingboth
theoretical analysis and experimental evaluation, it is shown that the estimated articu-
latory features provide recognition benefit when used as additional features in an auto-
maticspeechrecognizer. Aswerequireasingleexemplarfortheacoustic-to-articulatory
inversion, we overcome the need for the articulatory data from multiple subjects dur-
ing inversion. Thus, we demonstrate a feasible way to utilize production-oriented fea-
tures for speech recognition in a data-driven manner. Due to the concept of exemplar,
the production-oriented features and, hence, the speech recognition become exemplar-
dependent. Preliminary recognition results with different talker-exemplar combinations
show that the recognition benefit due to the estimated articulatory feature is greater
when the talkers and exemplars speaking styles are matched, indicating that the pro-
posed exemplar-dependent recognition approach has potential to explain the variability
in recognition across human listeners.
xi
Chapter 1:
Introduction
Machine processingof speech signal is often inspiredby thehuman auditory processing.
Human auditory system can be interpreted as the receiver in human speech commu-
nication system with the human vocal organ as the transmitter. From the principle
of maximum information transfer in a communication system, the design of the re-
ceiver needs to be informed by the characteristics of the transmitter. Motivated by
this communication theoretic viewpoint, in this thesis work, weask two main questions:
1) Is there any optimum relationship between the human auditory system (receiver)
and the human speech production system (transmitter)? 2) Can we design a compu-
tational mechanism of processing speech informed by the speech production knowledge
for better speech processing applications? Based on the research work carried out in
this thesis, we find an optimum information theoretic relationship between the human
speech production and auditory system. This further enables us to design a computa-
tional framework to derive production-oriented speech features. We demonstrate the
utility of these production-oriented speech features for automatic speech recognition.
Automaticspeechrecognition (ASR)isthetaskofphonetically(and,lexically) tran-
scribing the spoken utterance from a continuous stream of acoustic speech signal. It
1
is well-known that the recognition accuracy improves when the recognizer is built with
the additional knowledge beyond what the acoustic data provide such as the expected
language patterns of the talker (i.e., language model) or how the speech is produced
by the talker (i.e., production knowledge) [44]. In the statistical pattern recognition
approach to ASR, a language model (LM) typically is specified in terms of the proba-
bility of a sequence of words in the talker’s language. Although at the word level the
knowledge provided by the LM is used to decode a meaningful sequence of words from
the speech signal, at the signal level, a probabilistic model using the features derived
from the acoustic signal acts as the fundamental element [often called acoustic model
(AM)] for the recognition of the sequence of sounds in a given speech signal. The more
discriminative the acoustic feature is for different sounds, the better is the acoustic
recognition, which in turn improves the overall ASR performance. Thus, the search for
better and representative acoustic features has always been an active topic of research
in the ASR community.
The acoustic features for most of the speech sounds are vulnerable to the change in
talker, environment, and background noise. The variability in the acoustic signal has
triggered researchers to develop methodologies for the adaptation of the acoustic model
to the acoustic environment under consideration. In a radically different viewpoint,
some researchers believe thatthe invariant representations of speech soundsmay not be
available in the acoustic signal space. Rather the acoustic signal is a highly redundant
carrier of what is called speech code [38], which is related to the production of speech
sounds, i.e., the description of the articulatory movements, which producesthe acoustic
speech signal. According to articulatory phonology [2, 3], speech results from articu-
latory gestures, which can potentially offer an invariant representation of the speech
signal. In fact, according to many theories of speech perception [20, 38, 39], listeners
recognize the speech received from a talker using the underlying articulatory gestures.
2
However, it is not clear as to how a listener can derive articulatory gesture information
fromwhatthelistenerreceives, i.e., theacousticspeechsignal(wenotethatinformation
about the visible articulation can offer additional information, but for the purpose of
this work we restrict ourselves to audio speech recognition).
Kirchhoff etal. [34], infact, estimated a setof linguistically-meaningful articulatory
features such as voicing, place and manner of articulation from the acoustic signal and
used them in addition to the acoustic features for achieving improved speech recogni-
tion. Improved speech recognition using the estimated articulatory features indicates
that the use of the knowledge about speech production can boost the recognition per-
formance by providing information that is not captured in the acoustic signal features.
Similar promising results were reported by King and Taylor [31, 32], who described an
approach in which phonetic features are estimated from the acoustics using an artificial
neural network (ANN). As the canonical voicing, place, and manner information are
known for different phonemes, these articulatory features are tagged for each phonemes
in the training data, which were, in turn, used to train an ANN to estimate the artic-
ulatory features from a given test acoustic signal. In a similar approach, Vikram et al.
used parallel acoustic and articulatory features obtained from an articulatory speech
synthesizer for training an ANN [45] and then used the trained ANN during recogni-
tion for estimating articulatory features from the acoustic features obtained from a test
subject’s voice. However, the acoustic characteristics of a synthesized speech and a
human subject’s natural speech are different and this could affect the estimates of the
articulatory features and, hence, the recognition accuracy.
There is an extensive body of work by Deng et al. [14, 15, 16], where articula-
tory configuration knowledge is incorporated in building a generative speech model like
HMM. McDermott et al. [44] described this set of work by Deng et al. in the fol-
lowing way: ‘This is a knowledge-oriented way of designing context-dependent states,
3
that explicitly allows for desynchronization of feature movements, within bounds. A
danger with the approach is that the state networks risk becoming very large if too
much leeway is allowed for feature de-synchronization’. Compared to HMMs, dynamic
Bayesian networks (DBNs) allow a more natural integration of production knowledge
in statistical pattern recognition [85]. Work by Livescu et al. [40] has reported an im-
provementinrecognition performanceusingDBNsfortheAurorataskunderbothclean
and noisy conditions. Several studies have also made use of of dynamic system model
for modeling articulation and acoustics [13]. A detailed summary of all these different
approaches of using articulatory configuration within the (statistical model) state net-
work can be found in [44]. In general, these generative model based approaches do not
use any real-world production data from human subjects, rather the knowledge about
speech production is derived from linguistic or rule-driven articulatory constraints.
Incontrast to thelinguistically definedarticulatory features or abstract articulatory
knowledge based or rule-driven articulatory constraints based approaches for speech
recognition, we in this work focus on utilizing real articulatory movement data (or
features derived from them) recorded during speech production. Advances in speech
production measurement such as using real-time magnetic resonance imaging (rt-MRI)
[48]orelectromagnetic articulography(EMA)[80]providedirectcaptureandcharacter-
ization of the articulatory movements when a subject speaks. Such speech production
data from rt-MRI or EMA can be potentially used to learn the variability in the speech
production systems across subjects. EMA speech production data has been directly
used for recognition in addition to the acoustic speech signal [21] and has been shown
to outperform the recognition based on the acoustic speech signal only. However, the
availability of such direct speech articulation data during the test phase of a recognizer
is not feasible in practice and, hence, their direct use for recognition is not a realistic
4
approach. Frankel et al. [22] designed a speech recognizer by using talker’s EMA ar-
ticulatory traces estimated from the acoustic speech signal using an ANN and showed
improvement in the recognition performance compared to that using acoustic feature
alone. However, the raw articulatory position data is subject-specific because the size
and configurations of the speech articulators vary from person to person. Thus, it is
in general difficult to obtain a reliable estimate of the articulatory trajectories of any
unknown talker based on his/her speech signal without training the ANN using the
target talker’s articulatory data. Hence, such a recognition system using the estimated
articulatory trajectories is also not scalable in practice.
Estimation of the articulatory movements of a subject from the acoustic speech
signal of the same subject is known as acoustic-to-articulatory inversion [24]. To train
an inversion algorithm, in general, parallel acoustic and articulatory movement data
fromthesubjectunderconsiderationisrequired. But,inpractice,arecognizeronlyhave
access tothetalker’s voiceanddoesnothavedirectaccess tothearticulatory movement
of the talker. Thus, in the absence of talker’s articulatory data, it is not possible to
perform an inversion to recover the talker’s articulatory movement. To alleviate the
needfortalker’sarticulatory data, weadvantageously usesubject-independentacoustic-
to-articulatory inversion [25], which requires parallel articulatory-acoustic data from a
single subject only, who can, in general, be different from the talker. We refer to this
subject as ‘exemplar’ in the work. Further, we refer to the inversion as exemplar-
specific talker-independent (ESTI) inversion. This is because the inversion scheme uses
training data specific to one exemplar but need to perform acoustic-to-articulatory
inversion independent of any talker. Using ESTI inversion technique, we propose a
front-end articulatory feature estimation scheme in ASR. We show that by using the
estimated articulatory features in addition to thestandard acoustic features, thespeech
recognition accuracy can be improved compared to that obtained by acoustic features
5
alone. As the estimated articulatory features are exemplar specific, we find that these
estimated features and, hence, the recognized output may change depending on the
chosen exemplar even for the same talker’s speech. We begin with the description of
the proposed exemplar-specific speech recognition framework.
6
Chapter 2:
Exemplar-specific approach to
explore the role of speech
production in speech processing –
motivation and principles
We have provided a schematic diagram of the proposed exemplar-specific speech recog-
nition framework in Fig. 2.1. The principle of the proposed framework lies in using
a single exemplar’s speech production knowledge for estimating articulatory features
in the ASR from any talker’s speech. The exemplar represents a single subject, from
whom parallel speech acoustic and articulatory movement are available. These parallel
acoustic-articulatory data provides the knowledge about the forward articulatory-to-
acoustic map of the exemplar, which is further used in exemplar-specific and talker-
independent acoustic-to-articulatory inversion for estimating articulatory features from
any arbitrary talker’s speech. The estimated articulatory features are used to augment
7
the acoustic features extracted from the talker’s speech. The augmented feature vector
is finally used build statistical model using Hidden-Markov Model (HMM) for speech
recognition. We refer to this acoustic and articulatory feature based statistical model
as acoustic-articulatory model.
The acoustic-to-articulatory inversion is a critical component in the proposed
exemplar-specific recognition framework. This is because we expect that the estimated
articulatoryfeatures,whenusedalongwiththeacousticfeatures,willprovideadditional
discriminatory power among different phonological units, which is impossible without
an accurate estimate of thearticulatory features. We alsoobserved that theaccuracy of
the estimate depends on the type of quantitative representations adopted for both the
articulatory andtheacoustic spaces. Thisis becausethemappingbetween acoustic and
articulatoryspacesisnon-linear[74]andalsosuffersfromnon-uniqueness[53]. Also, the
articulatory and acoustic features will have an inherent statistical variation even for a
specific exemplar. Therefore, the acoustic-to-articulatory inversion should be robust to
such statistical variations. Thearticulator-to-acoustic mapping is built using the paral-
lelarticulatory andacousticdata(shownwithincircles inFig. 2.1), whichisuniquetoa
specificexemplar. However, in theexemplar-specific recognition framework, a exemplar
needs to estimate articulatory features for the acoustic signal, which, in general, can
come from any talker. Thus, the acoustic-to-articulatory inversion should be robust to
such inter-talker acoustic variations too.
Note that the design of ASR is nothing but the design of a decoder or receiver to
decode the message embedded in the speech signal transmitted by the talker’s speech
production system. The principle of the proposed exemplar-specific approach is to use
the knowledge from the speech production system of an exemplary subject in designing
the ASR. This is equivalent to using a copy of the transmitter inside the decoder. In
fact, use of a copy of the encoder in the decoder is common in many communication
8
system such as differential pulse code modulation (DPCM) or adaptive DPCM (AD-
PCM).Motivation forusingexemplar’stransmitterindesigningthereceiver (ASR)also
came from the observation that the human speech perception system is best known re-
ceiver in the human speech communication and a human listener has access to his own
articulatory-to-acoustic map (but not of the talker) which could be useful in deriving
someproduction-oriented representations fromspeechsignal. Suchapropositioniscon-
sistent with the findings by Wilson et al., who showed that the motor areas involved in
speech production are activated when a subject listens to speech [79].
Theuseofproduction-knowledgeinspeechrecognitionusingtheproposedexemplar-
specificframeworkissignificantly differentfromtheexistingproduction-orientedspeech
recognition approaches [44] in several ways: 1) The proposed recognition framework is
exemplar-specific. It provides a production-knowledge driven account to the variability
in speech recognition across exemplars, 2) The proposed framework is built on direct
speech production data and is scalable for automatic speech recognition to any speech
corpus;thisreliesonanexemplarysubject’sproductionmodelforwhomwehaveagood
characterization of their articulatory details. We only need to acquire new articulatory
data only when a specific exemplar perspective needs to be reflected in the recognition.
In this thesis work we have implemented the exemplar-specific speech recognition
framework for three different exemplars for whom we have adequate speech production
data. One can easily extend the proposed recognition framework for any arbitrary
exemplar given thataparallel acoustic andarticulatory dataareavailable forthetarget
exemplar. The parallel acoustic and articulatory data of an exemplar is required to
learn the mappingbetween the acoustic and articulatory space for the exemplar. In the
next chapter we describe the details about the acoustic and articulatory data used for
three exemplars used in our experiment.
9
Figure 2.1: An illustration of the proposed exemplar-specific framework: each exemplar
uses his/her speech production knowledge to estimate the exemplar-specific articulatory
features for the speech utterance received from a talker and then, uses both acoustic
features from the received speech signal and the estimated articulatory features to rec-
ognize the utterance. As the articulatory features are exemplar-specific, the acoustic-
articulatory model is also exemplar-specific.
10
Chapter 3:
Articulatory Databases and
features
Articulatory databases are required for this thesis work to gain knowledge about artic-
ulatory knowledge about articulatory movement during speech production.At the same
time, representation of the articulatory movement is crucial for better analysis and
interpretation. Below we describe both the articulatory databases and features.
3.1 Databases
Of the three subjects used in our experiments, two (one male and one female) corre-
spond to the subjects available in the Multichannel Articulatory (MOCHA) database
[80]. TheywillbedenotedbyMOCHA MandMOCHA Finthisthesisunlessotherwise
stated. The MOCHA database provides the acoustic and the corresponding articula-
tory EMA data for both subjects and, hence, appropriate for learning the acoustic-
to-articulatory mapping in our proposed recognition framework. The acoustic and ar-
ticulatory data were collected while each talker reads a set of 460 phonetically-diverse
11
British TIMIT sentences. The articulatory data consist of X and Y coordinates of nine
receiver sensor coils attached to nine points along the midsagittal plane, namely the
lower incisor or jaw (li x, li y), upper lip (ul x, ul y), lower lip (ll x, ll y), tongue tip
(tt x, tt y), tongue body(tb x, tb y), tongue dorsum(td x, td y), velum (v x, v y), up-
per incisor (ui x, ui y) and bridge of the nose (bn x, bn y). The sensor locations on the
midsagittal plane is illustrated in the Fig. 3.1. The last two are used as reference coils.
Thus, the firstseven coils provide 14 channels of articulatory position information. The
position of each coil was recorded at 500 Hz with 16 bit precision. The corresponding
speech was collected at 16KHz. The articulatory data is pre-processed following the
steps outlined in [24] so that the processed data has minimal noise due to measurement
error or error due to head motion.
ThethirdsubjectusedinourexperimentisamalenativetalkerofAmericanEnglish
from another articulatory database [66], collected at University of Southern California
(USC)asapartofMURIproject. WedenotethissubjectbyMURI M.Incontrasttothe
read speech in MOCHA database, the articulatory data in [66] was collected when the
subject was engaged in a spontaneous-speech dialogue on different topics spanning over
14 sessions each lasting about 8 minutes. The EMA technique was used to measure the
movements (X andY coordinates) ofthefollowing articulators inthemid-sagittal plane
- the jaw (li x, li y), lower lip (ll x, ll y), upper lip (ul x, ul y), tongue tip (tt x, tt y),
tonguebody(tb x, tb y), andtonguedorsum(td x, td y). Threereferencesensorswere
also placed onthebridgeofthenose(bn x, bn y), andattheback of leftandrightears.
Acoustic signal and articulatory traces were simultaneously recorded at the sampling
rates of 16 KHz and 200 Hz, respectively. Thus, six articulatory sensors provide 12
channels of articulatory position information. The articulatory data was pre-processed
to remove errors due to head movement and to reduce measurement errors.
12
3.2 Features
The articulatory features are computed from the samples of the articulatory trajec-
tories available in EMA data. The EMA data provides trajectories of a few critical
articulators’ locations and, hence, do not provide a complete description of the vocal
tract configuration. From this partial description of the vocal tract configuration, we
compute two types of articulatory features, namely, EMA and tract variables (TV), as
described below.
3.2.1 EMA
This set of features are nothing but raw articulatory data available through EMA tech-
nique. As described above, the EMA feature vector is 14 dimensional for the first two
subjects and 12 dimensional for the third subject considered for modeling in our exper-
iment. Although EMA features are raw position values of a few critical articulators,
we would like to investigate the representation capability of the EMA features for es-
timating the articulatory trajectories and also for exemplar-specific speech recognition
task.
3.2.2 Tract variables (TV) features
This set of features is motivated by the concept of tract variables in the articulatory
phonology [2, 3]. According to articulatory phonology, speech is represented using an
ensembleofgestures,whicharedefinedasthedynamicalcontrolregimesforconstriction
actionsineightdifferentconstrictiontractvariablesconsistingoffiveconstrictiondegree
variables, lip aperture (LA), tongue body (TBCD), tongue tip(TTCD), velum (VEL),
glottis (GLO), and three constriction location variables, lip protrusion (PRO), tongue
tip (TTCL), tongue body (TBCL).
13
Figure 3.1: An illustrative diagram showing the sensors in EMA data collection on
the midsagittal plane. The sensors are indicated by circles at upper lip (UL), lower
lip (LL), upper incisor (UI), lower incisor (LI), tongue tip (TT), tongue body (TB),
tongue dorsum (TD), and velum (V). The diagram also illustrates the features motivated
by the tract variables, namely, lip aperture (LA), lip protrusion (PRO), jaw opening
(JAW-OPEN), tongue tip constriction degree (TTCD), tongue body constriction degree
(TBCD), tongue dorsum constriction degree (TDCD), and velum opening (VEL).
14
Based on the available EMA data we estimate a subset of the above tract variables
and use them as articulatory features. Note that the nature of the EMA data are not
identical across all subjectsbecausethefirsttwo subjectsarechosen fromtheMOCHA-
TIMIT database and the third one chosen from a different database. Thus, based on
the availability of EMA sensors, we estimate different TV features as follows:
LA = |ul y−ll y| (3.1)
PRO =
ul x+ll x
2
(3.2)
JAW-OPEN =
|ui y−li y| , when ui y is available
|bn y−li y| , when bn y is available
(3.3)
TDCD = |bn y−td y| (when bn y is available) (3.4)
VEL = [ v x v y ]V
e
(3.5)
where, V
e
is a 2×1 unit norm eigenvector corresponding to the highest eigenvalue
of the correlation matrix of the random vector [ v x v y ]
T
. [·]
T
is the transpose
operator. The correlation matrix is estimated using the available X and Y co-ordinate
values of velum. Since the palate information in the EMA databases was not available,
we manually determine straight lines representing the palate to compute TTCD and
TBCD. This is illustrated in Fig. 3.2. These lines are determined separately for each
subject in EMA databases. TTCD and TBCD are computed by the perpendicular
distance from TT and TB positions to the respective lines as follows:
TTCD =
|a
tt
tt x+b
tt
tt y+c
tt
|
p
a
2
tt
+b
2
tt
(3.6)
TBCD =
|a
tb
tb x+b
tb
tb y+c
tb
|
q
a
2
tb
+b
2
tb
(3.7)
15
where, a
tt
x+b
tt
y+c
tt
= 0 and a
tb
x+b
tb
y+c
tb
= 0 are the equations of the straight
lines, as shown in Fig. 3.2.
A visualillustration of theTVfeatures areshowninFig. 3.1. TheTVfeatures need
not be identical to the constriction variables as in articulatory phonology. For example,
one can optimize on the slope and intercept of the straight lines (in Fig. 3.2) without
manually choosing them. The focus here is to obtain some representative features and
not to exactly replicate TV as defined in articulatory phonology.
Since the EMA sensors in the case of the third subject are not identical to those of
the first two subjects, we use LA, VEL, JAW-OPEN, TTCD, TBCD, and PRO for the
first two subjects and LA, JAW-OPEN, TTCD, TBCD, TDCD, and PRO for the third
subject.
16
−500 0 500 1000 1500 2000 2500 3000 3500
−8000
−7500
−7000
−6500
−6000
−5500
−5000
X
Y
TT
TB
a
tb
x+b
tb
y+c
tb
=0
TBCD
a
tt
x+b
tt
y+c
tt
=0
TTCD
Figure3.2: An illustration of the procedure to compute TTCD and TBCD for the female
subject in MOCHA-TIMIT database. The TT locations are shown for phonemes /t/,
/d/, and /s/; similarly, the TB locations are shown for /ch/ and /zh/. These phonemes
being alveolar and post-alveolar, an approximate location and shape of the palate can
be estimated from the location of the TT and TB sensors. The two straight lines are
determined manually so that all the TT and TB locations are below those lines. Separate
lines are determined for other subjects’ data in a similar fashion.
17
Chapter 4:
Optimal filterbank for computing
acoustic features - an information
theoretic link between production
and perception
4.1 Introduction
Speech is one of the important means for communication among humans. In human
speech communication, a talker produces speech signal, which contains the message of
interest. A listener receives and processes the speech signal to retrieve the message
of the talker. Hence, in the standard communication systems terminology, the talker
plays the role of a transmitter and the listener plays the role of a receiver to establish
the interpersonal communication. It is well known that human speech communication
is robust to different speaker and environmental conditions and channel variabilities.
18
Such a robust communication system suggests that the human speech receiver and
transmitteraredesignedtoachieveoptimaltransferofinformationfromthetransmitter
to the receiver.
Fig. 4.1 illustrates a communication system view of the human speech exchange.
Thetalker useshis/herspeechproductionsystemtoconvertthemessageofinterestinto
the speech signal, which gets transmitted. The speech is produced by choreographed
and coordinated movements of several articulators, including the glottis, tongue, jaw,
and lips of the talker’s speech production system. According to articulatory phonology
[4], speechcan bedecomposedinto basicphonological unitscalled articulatory gestures.
Thus,articulatory gesturescouldbeusedasanequivalentrepresentation ofthemessage
that the talker (transmitter) intends to transmit to the listener (receiver). The speech
signal transmitted by the talker, in general, gets distorted by environmental noises
and other channel factors as the acoustic speech wave travels in the air (channel).
This distorted speech signal is received by the listener’s ear. The ear works as the
signal processing unit in the listener’s auditory system. As the speech wave passes
through various segments of the ear canal, namely outer, middle, and the inner ear,
the speech signal is converted to electrical impulses which are carried by the auditory
nerve and, finally, get decoded in the brain to infer talker’s message. For an efficient
communication between talker and listener, we hypothesize that the components in the
listener’s auditory system (receiver) should be designed such that maximal information
about the talker’s message (transmitter) can be obtained from the auditory system
output. In other words, the uncertainty about the talker’s message should be minimum
at the output of the listener’s auditory system.
Among the various components in the human speech communication system’s re-
ceiver, the cochlea (Fig. 4.1) is known to be the time-frequency analyzer of the speech
signal. The structure and characteristics of the cochlea in a typical listener’s auditory
19
Figure 4.1: A Communication system view to human speech communication. Talker’s
speech production system consists of speech articulators such as lips, tongue, velum etc.
Outer ear (ear pinna), middle ear, and inner ear (cochlea) are main components of
listener’s auditory system [77].
system is illustrated in Fig. 4.2. Fig. 4.2(a) shows the coiled and uncoiled cochlea.
The basilar membrane inside the cochlea performs a running spectral analysis on the
incoming speech signal [29, pp. 51–52]. This process can be conceptualized as a bank
of tonotopically-organized cochlear filters operating on the speech signal. It should be
noted that the physiological frequency analysis is different from the standard Fourier
decomposition of a signal into its frequency components. A key difference is that the
auditory system’s frequency response is not linear [29, pp. 51–52]. The relationship
between the center frequency of the analysis filters and their location along the basilar
membrane is approximately logarithmic in nature [29, pp. 51–52]. Also, the bandwidth
of the filter is a function of its center frequency [86]. The higher the center frequency,
the wider the bandwidth. A depiction of these filters are shown in Fig. 4.2(b) over a
frequency range 0 to 8kHz; these ideal rectangular filters are drawn using the center
20
0 2000 4000 6000 8000
0
2000
4000
6000
8000
Frequency Axis (Hz)
Auditory Frequency Axis (Hz)
(b)
Figure 4.2: Cochlea and its characteristics: a. the coiled and uncoiled cochlea [49]
with the illustration of the frequency selectivity shown by damped sinusoid of different
frequencies. b. the relation between the natural frequency axis and the frequency map
on the basilar membrane (i.e., the auditory frequency axis) over 0 to 8kHz.
frequency and the bandwidth data from Zwicker et al. [86]. The bandwidths of the
brick-shaped filters represent the equivalent rectangular bandwidths (ERB) [86] of the
cochlear filters at each chosen center frequency along the frequency axis. This non-
uniform filterbank in the natural frequency axis is referred to as auditory filterbank.
However, the auditory filterbank appears uniform when plotted in the auditory fre-
quency scale (the frequency scale on the basilar membrane). The relation between the
natural andtheauditory frequencyaxis is shownbythereddashedcurvein Fig. 4.2(b).
The relation is approximately logarithmic in nature.
The auditory filterbank is a critical component in the auditory system since it con-
verts the incoming speech signal into different frequency components which are finally
converted to electrical impulses by hair cells. The output of the auditory filterbank is
used to decode the talker’s message. In this work, we focus on the optimality of the
filterbank used in the receiver (listener’s ear). Our goal is to design a filterbank that
achieves efficient speech communication in the sense that the output of the designed
21
filterbank provides the least uncertainty about the talker’s message (or equivalently
talker’s articulatory gestures). Finally, we would like to compare the canonical empir-
ically known auditory filterbank with the optimally designed filterbank to check if the
auditory filterbank satisfies the minimum uncertainty principle. We follow the defini-
tion of uncertainty from the mathematical theory of communication [8, 64]. From an
information theoretic viewpoint, it turns out that minimizing uncertainty is identical
to maximizing mutual information (MI) between two correspondingrandom quantities.
Therefore, our goal is to design a filterbank such that the filterbank output at the
receiver provides maximum MI with the articulatory gestures in the transmitter; we
assume that the articulatory gestures can be used as a representation for encoding the
talker’s message.
For quantitative description of the articulatory gestures, we need to use a database
whichprovidesrecordingsofthemovementsofthespeecharticulatorsinvolvedinspeech
production (such as tongue, jaw, lips etc.) when the subject talks. We also need the
recording of the speech signal in a concurrent fashion so that the filterbank output
can be computed using the recorded speech signal. We describe such a production
database in Section 4.2. Since articulatory gestures can vary across languages, for
the present investigation, we have used production databases collected from subjects
in three different languages - namely, English, Cantonese, and Georgian. In Sections
4.3 and 4.4, we explain the features or the quantitative representations used for the
articulatory gestures and the filterbank output. Such quantitative representations are
essentialforcomputingMIbetweenthem. InSection4.5,wedescribeagreedyalgorithm
for obtaining the optimal filterbank in a language and subject specific manner so that
the MI between articulatory gestures and the filterbank output is maximized. We find
thattheoptimalfilterbankvariesacrosssubjectsbutthevariation isnotdrastic; rather,
the optimal filterbanks are similar to the empirically established auditory filterbank. A
22
communication theoretic explanation for the optimality of the auditory-like filterbank
for speech processing is presented in Section 4.6. Conclusions are drawn in Section 4.7.
4.2 Dataset
We use the X-ray MicroBeam speech production database collected at the University
of Wisconsin [75] for the experiments related to the subjects in the English language.
Thiscorpusprovidestemporaltrajectories ofthemovement oftheupperlip(UL), lower
lip (LL), tongue tip (T1), tongue body (T2 and T3), tongue dorsum (T4), mandibular
incisors (MNI), and mandibular molars (MNM) of subjects obtained using x-ray mi-
crobeam technique [75] during their speech. We have chosen 40 subjects (17 males, 23
females) fromthis databasefor ourexperimentamongavailable 47 subjects(we exclude
some subjects not having sufficient amount of data required for our experiments). The
approximate locations of the articulators are shown in Fig. 4.3(a) viewed on the mid-
sagittal planeof a subject. Thelocations of each of theseeight articulators atany given
time are represented by the X and Y co-ordinates in the midsagittal plane, and the
articulator trajectories are sampled at 145.65 Hz. For our analysis, we downsampled
the microbeam pellet data to a rate of 100 samples/second. A few co-ordinate values
of some pellets at some time points are missing. If any pellet data at any time point is
missing, we discard all other pellet data at those time points.
We also include data from two languages distinct from English, namely Cantonese
[82] and Georgian [26], to explore the generalization of the proposed experiments. Un-
like English, Cantonese is a tonal language. The realizations of phonological units in
Cantonese and Georgian are different from those in English and this implies that the
languages employ different articulatory gestures and/or different combinations of ges-
tures. Weexpectthatthekinematicsinthoselanguageswillreflectthegesturalpatterns
specific to the respective language. Thus, as a secondary data source for our study, we
23
13
14
15
16
10
15
20
5
10
15
−30
−20
−10
−20
−10
−10
0
10
−40
−30
−20
X co−ordinate values (mm)
0
10
20
Y co−ordinate values (mm)
−50
−40
5
10
15
20
−60
−55
−50
8
10
12
14
16
18
−2
−1
0
1
−15
−10
−5
0 0.34 0.68 1.02 1.37 1.71 2.05
−44
−43
−42
Time (sec)
0 0.34 0.68 1.02 1.37 1.71 2.05
−10
−5
Time (sec)
UL
LL
T1
T2
T3
T4
MNI
MNM
UL
LL
T1
T2
T3
T4
MNI
MNM
(b)
Figure 4.3: Articulator data from X-ray MicroBeam speech production database [75] a.
Pellets are placed on eight critical points of the articulators viewed on the midsagittal
plane of the subject, namely, upper lip (UL), lower lip (LL), tongue tip (T1), tongue
body (T2 and T3), tongue dorsum (T4), mandibular incisors (MNI), and mandibular
molars (MNM). The XY co-ordinate trajectories of these pellets are recorded. b. The
X and Y co-ordinate values of eight pellets are shown when a male subject was uttering
‘but special’.
have used the articulatory movements of three (two male and one female) Cantonese
and two (both male) Georgian subjects recorded at 500Hz using the EMMA [52] tech-
nique, while they spoke. The recorded articulators are UL, LL, jaw (JAW), tongue tip
(TT), tongue body (TB), and tongue dorsum (TD). Similar to the articulatory data
from English subjects, wepre-processed theEMMA data from Cantonese and Georgian
subjects to achieve a frame rate of 100 Hz.
Note that, in parallel to thearticulatory movements in the corporadescribedabove,
thespeechsignalisrecordedat21739HzforEnglishsubjectsand20000HzforCantonese
and Georgian subjects. We downsampled the speech signals to 16kHz for our analysis.
Using these parallel acoustic and articulatory data from Cantonese and Georgian, we
will beable to examine our communication theoretic hypothesis in a scenario wherethe
24
acoustic properties of sounds and the respective articulatory gestures are different from
thosein English. Inthefollowing sections, wedescribehowthearticulatory corporaare
used for specifying the articulatory representations during speech production and how
the signal representation at the output of a generic filterbank is specified.
4.3 Articulatory gesture representation
Speech gestures can be modeled as the formation and release of constrictions [61] by
particular constricting organs of the vocal tract (lips, tongue tip, etc.). The unfolding
of these constrictions over time causes motion in the vocal tract articulators, whose
positions are tracked using markers in the x-ray and EMMA data. For example, Fig.
4.3(b)illustratestheXandYco-ordinatetrajectoriesofeightarticulatorscorresponding
to a English male subject’s utterance of ‘but special’. We make a generic assumption
that the trajectories of UL, LL, T1, T2, T3, T4, MNI, and MNM provide information
about critical articulatory gestures involved in producing various speech sounds [5].
For example, in Fig. 4.3(b), the Y co-ordinates of UL and LL decrease and increase,
respectively, to create the lip closure gesture while producing the sound /b/ (at around
0.2 sec) in the word ‘but’ (This is indicated by↓ and↑ in Fig. 4.3(b)). The tongue tip
goes up to the palate and creates constriction for producing /t/ (at around 0.53 sec);
this gesture is indicated by the peak in the Y co-ordinate of tongue tip T1 (This is
indicated by↑ in Fig. 4.3(b)). We use all the measured movement co-ordinate values of
8 sensors and construct a 16-dimensional vector (Y) as a representation of articulatory
gestures every 10 msec.
In a similar fashion, the co-ordinate values of the available articulators of subjects
in Cantonese and Georgian languages are used to construct articulatory position vector
(Y) as a representation of articulatory gestures.
25
4.4 Representation of the filterbank output
Thefilterbankcomprisedofbrick-shapedmulti-channel bandpassfilterscanbespecified
by the upper and lower cut-off frequencies of each bandpass filter. Note that the upper
cut-off frequency of a bandpass filter is identical to the lower cut-off frequency of the
next bandpass filter in the filterbank (Fig. 4.2(b)). Thus, to specify a filterbank with L
filters it is sufficient to know the set of lower cut-off frequencies only; let us denote the
set of cut-off frequencies or equivalently the filterbank by B
L
.
Intheauditorysystem,theoutputofthecochlearfilterbankgetsconvertedtoaseries
of electrical impulses; however, a complete mechanism for this conversion is not clearly
understood. But, it has been reported that the rate of the pulses varies depending on
the intensity of the output of the filters [6, 50]. Hence, we use the energy of individual
filter’s outputas arepresentation of thefilterbankoutput. We processthespeechsignal
using each filter and compute the logarithm of the energies at the output of each filter
inthefilterbank over ashortdurationevery 10ms. LetX ={x(n):0≤n≤N−1} be
the samples of a segment of the speech signal (over a short duration) at the sampling
frequency F
s
. In our experiments F
s
=16KHz and
N
Fs
=0.02 sec. The energy of the
output of the k
th
filter is given by
S
k
=log
η
L
k
X
l=η
L
k−1
N−1
X
n=0
x[n]exp
−
2π
N
F
ln
2
, k =1,...,L (4.1)
where N
F
is the order of the Discrete Fourier transform (DFT) for computing the
spectrum of the signal x[n].
Fs
N
F
η
L
k
and
Fs
N
F
η
L
k+1
are the lower and upper cut-off fre-
quencies of the k
th
filter in the filterbank, which comprises of L band-pass filters. Note
that B
L
=
η
L
0
, η
L
1
, ... , η
L
L
, where η
L
0
= 0 and η
L
L
=
N
F
2
−1. We construct a vector
26
by using S
k
, k = 1,...,L and call it a ‘feature vector’ X
B
L representing the output of
the filterbank B
L
, i.e., X
B
L = [S
1
, ... ,S
L
]
T
, where [·]
T
is the transpose operator. We
compute the feature vectors on the speech signal concurrently recorded with the artic-
ulator tracking. By computing these features every 10 ms for a chosen filterbank B
L
,
we obtain a sequence of feature vectors as an equivalent representation of the filterbank
output over time. Depending on the chosen filterbank, the output of the filterbank
will vary and hence, the amount of information that this filterbank output can provide
about articulatory gestures will also vary.
4.5 Filterbank optimization
As wehave quantitatively definedthedescription of thetalker’s articulatory gestures as
wellastheoutputofagenericfilterbankatthereceiver,wecannowcomputetheamount
of information that the filterbank output provides about the articulatory gestures. We
use the mutual information I(X
B
L;Y) between two random variables for this purpose
[8,pp. 12–49]. MImeasureshowmuchinformationarandomvariablecanprovideabout
another random variable (see Appendix ). In our case, we treat X
B
L and Y as random
quantities because the realizations of the filterbank output, X
B
L, and the articulatory
position, Y, are not identical when a subject utters the same sound at different times
due to differences in a variety of linguistic, contextual, and environmental factors.
For an efficient communication between the talker and the listener, the filterbank
at the receiver (listener) needs to be selected in such a way that its output provides
maximum information regarding talker’s articulatory gestures, i.e.,
B
L
⋆
=argmax
B
L
I(X
B
L;Y) (4.2)
27
ItiseasytoshowthatmaximizingI(X
B
L;Y)isequivalenttominimizingH(Y|X
B
L),
i.e., the conditional uncertainty of articulatory gestures (Y) given the filterbank output
X
B
L (see Appendix for details).
In our production database (Section 4.2), each subject can be treated as a talker.
We optimize the filterbank for each subject in the production database i.e., we design
the filterbank at the receiver for achieving efficient communication with each talker
separately in the database. Performing subjects-specific optimization provides us the
opportunitytoanalyze thevariability inoptimal filterbankstructureacross talkers. For
ourexperiment, weconsiderspeechsignalssampledatF
s
=16kHz, i.e.,
Fs
2
=8kHz. There
are 20 critical bands over 0 to 8kHz as given by the critical bandwidthdata [86]. Hence
for the filterbank optimization, we have chosen L=20.
The optimization in Eq. (4.2) does not have a closed form solution, rather it is a
combinatorial problem (and, hence, computationally prohibitive) becausethe optimiza-
tion in Eq. (4.2) in equivalent to finding the cut-off frequencies
η
l
k
L
k=0
from the
N
F
2
frequency points such that 0 = η
L
0
< η
L
1
<···η
L
L−1
< η
L
L
=
N
F
2
−1. In the absence of
any closed-form optimization due to analytical intractability of the objective function
[in Eq. (4.2)], we adapt a greedy algorithm where the optimal filterbank is obtained
usingaseriesofdecompositionusingalow-pass andhigh-passfiltersforeveryfrequency
band under consideration. This is similar to the dyadic filterbank analysis in wavelets
[71]. A tree-structured illustration of this greedy approach to filterbank optimization is
shown in Fig. 4.4.
Thelevel1inthetreecorrespondstotherootnode. Thereare(l-1)nodesatthelevel
l(l >1). Notethattherootnodecorrespondstoanall-passfilter,i.e.,afilterofconstant
magnitude from 0 to 8kHz. A node in the level l corresponds to a filterbank having l
band-passfilters. Thesebands at level l are determined by low-pass and high-pass filter
decompositionofeachbandofafilterbankchosenfromthepreviouslevel(levell-1)using
28
Figure 4.4: A tree-structure illustration of the greedy approach for optimizing the filter-
bank.
MIcriterion. Thereare(l-1)bandsinthefilterbankchosenfromthepreviouslevel(l-1);
hence, there are (l-1) possible decomposition [or (l-1) possible filterbanks or nodes] at
thelevel l of thetree. Letusdenote(l-1) filterbanksinthelevel l byB
l
k
, k =1,...,l−1.
For each B
l
k
, let M
l
k
△
= I(X
B
l
k
;Y). The filterbank (or node) at level l-1, which yields
the maximum MI among M
l−1
k
, k = 1,...,l−2, is used for further decomposition at
level l. Nodes correspondingto maximum MI are indicated in Fig. 4.4 by circles around
the maximum M
l
k
in each level. In the filterbank optimization, since our goal is to
find the best 20-band filterbank, we run this optimization until the algorithm finds the
29
filterbank with the maximum MI at level 20 in the tree structure. The filterbank with
the maximum MI in the level 20 is reported as the optimal filterbank. This approach of
filterbankoptimizationisgreedybecauseateverysuccessivelevelofthetreestructurewe
select the filterbank which yields the maximum MI. In general, the optimized filterbank
can be sub-optimal because of the greedy nature of the optimization. Due to the lack
of any other time-efficient algorithms (apart from full-search) for obtaining optimal
solution to the optimization problem [Eq. (4.2)], we assume that the solution obtained
by the proposed greedy algorithm is sufficient for our conclusions and interpretations.
For each subject in the production database, let us supposea total of T realizations
of articulatory position vectors (Section 4.3) and the corresponding short time speech
segments (20 msec) [Section 4.4] are denoted by Y
j
and X
j
, j = 1,...,T, respectively.
Note that given a filterbank B
l
k
we can compute the realization of X
j
B
l
k
from X
j
, j =
1,...T using the method outlined in Section 4.4. Given
Y
j
, X
j
, j =1,...,T
, the
algorithm for filterbank optimization is described in Algorithm 1.
Algorithm 1 A tree-structured illustration of the greedy approach to filterbank opti-
mization
1: η
1
0
=0, η
1
1
=
N
F
2
−1, l=2
2: η
l
0
=0, η
l
l
=
N
F
2
−1 & η
l
1
=
η
l−1
0
+η
l−1
l−1
2
3: B
l
1
=
η
l
0
,··· ,η
l
l−1
; compute X
j
B
l
1
from X
j
, j =1,··· ,T
4: Compute M
l
1
=I(X
B
l
1
;Y) using
n
X
j
B
l
1
,Y
j
; j =1,...,T
o
5: for l=3 to 20 do
6: for k=1 to l-1 do
7: B
l
k
←
B
l−1
k
⋆
,
η
l−1
k−1
+η
l−1
k
2
8: Compute X
j
B
l
k
from X
j
, j =1,··· ,T
9: Compute M
l
k
=I(X
B
l
k
;Y) using
n
X
j
B
l
k
,Y
j
; j =1,...,T
o
10: end for
11: k
⋆
=argmax
k
M
l
k
12: end for
13: Return B
20
k
⋆
and M
20
k
⋆
30
4.6 Results and Discussion
Fig.4.5(a)illustrateshowthefilterbankcorrespondingtothemaximumMIevolveswith
increasing number of levels in the greedy optimization for a randomly chosen English
subject from the production database. The rectangular filterbanks are drawn over the
frequency axis (horizontal axis) for increasing levels; the level numbers are indicated on
theleftsideof thedrawnfilterbanks. Themaximum MIcorrespondingto thefilterbank
ateachlevel isalsoshown. NotethatthemaximumMIateach levelneednotbestrictly
increasing for increasing level number due to the greedy nature of the algorithm. The
algorithm only ensures that the filterbank corresponding to the maximum MI is picked
in each level. There is no guarantee that the maximum MI at level l should be greater
than that at level l-1. For example the maximum MI at level 17 is 0.91 while that
at level 16 is 0.92 in Fig. 4.5(a). The greedy algorithm can be modified by running it
back and forth to ensure strict monotonicity in the MI values across levels; however, it
requires much longer optimization time compared to the one presented here.
The optimal filterbank at level 20 is shown in Fig. 4.5(b) below the empirically
established auditory filterbank (The auditory filterbank is identical to the one shown
in Fig. 4.2(b)). It is clear that the filters in the empirical auditory filterbank are not
strictly identical to those in the optimal filterbank. However what is interesting to
observe is that both the optimal and the empirical auditory filterbank have filters with
high frequency resolution at low center frequencies and low frequency resolution at
higher center frequencies. Thus, the frequency axis corresponding to these filterbanks
are warped with respect to the standard frequency axis. This warping function can be
obtained usingthecenter frequencies andthebandwidthsof thefilters inthefilterbank.
ThewarpingfunctionforboththeauditoryandtheoptimalfilterbanksareshowninFig.
4.5(c). These warping functions are quite similar to each other according to the nature
31
LEVEL20 MI = 0.92
LEVEL19 MI = 0.91
LEVEL18 MI = 0.91
LEVEL17 MI = 0.91
LEVEL16 MI = 0.92
LEVEL15 MI = 0.91
LEVEL14 MI = 0.91
LEVEL13 MI = 0.91
LEVEL12 MI = 0.9
LEVEL11 MI = 0.9
LEVEL10 MI = 0.88
LEVEL9 MI = 0.88
LEVEL8 MI = 0.88
LEVEL7 MI = 0.86
LEVEL6 MI = 0.83
LEVEL5 MI = 0.77
LEVEL4 MI = 0.72
LEVEL3 MI = 0.61
LEVEL2 MI = 0.48
0 8000 4000 2000 6000
Frequency (Hz)
(a)
8000 0
Frequency (Hz)
Empirical Auditory FilterBank
8000 0
Frequency (Hz)
Optimized FilterBank
(b)
0 1000 2000 3000 4000 5000 6000 7000 8000
0
1000
2000
3000
4000
5000
6000
7000
8000
Frequency (Hz)
Warped Frequency (Hz)
Auditory
Optimized
(c)
Figure 4.5: Result of the filterbank optimization a. Filterbank corresponding to maxi-
mum MI at each level of the greedy optimization. b. The empirical auditory filterbank
and the optimal filterbank (i.e., the filterbank corresponding to the maximum MI at level
20) c. The warping function between the frequency axis and the warped frequency axis
obtained by the empirical auditory filterbank and the optimal filterbank.
32
0 2000 4000 6000 8000
0
1000
2000
3000
4000
5000
6000
7000
8000
Frequency (Hz)
Warped Frequency (Hz)
Auditory
Optimized
Figure 4.6: The warping functions corresponding to the optimal filterbanks for all
45 subjects (40 English + 3 Cantonese + 2 Georgian) considered from the production
databases.
of the frequency warping they produce. However, we need to examine the variability of
the frequency warping corresponding to the optimal filterbanks across all subjects.
Fig. 4.6 showsthewarpingfunctionsobtained by usingtheoptimal filterbankssepa-
rately for all 45 subjects (40 English + 3 Cantonese + 2 Georgian) against the warping
function obtained by theempirically established auditoryfilterbank (dashedline in Fig.
4.6). It is important to note that the optimal filterbanks are similar to the auditory
one in the sense that the frequency resolution of the filters in the filterbank decreases
33
with the increasing center frequencies of the filters. This findingis consistent with find-
ings in our previous work [23], although the filterbank optimization was not performed
in generic fashion rather a specific parameterization of the filterbank was considered.
However the warping functions corresponding to the optimal filterbanks are not iden-
tical to the one corresponding to the empirically established auditory filterbank. This
can be due to the fact that each optimal filterbank is tuned only for speech signal
from a specific subject; however, the auditory filterbank in human ear is exposed not
only to speech signal but also to a variety of other sounds including environmental and
natural sounds. For example, Lewicki et. al. [68] demonstrated that the auditory fil-
ters turn out to be efficient from information encoding perspective considering various
natural sounds including speech. Since our objective function in this work is derived
from a communication theoretic principle for achieving efficient speech communication
between a talker and a listener, we have not considered any sounds other than speech.
Also the empirically established cochlear filterbank is obtained from the data recorded
fromdifferenthumansubjectsandthereisinherentvariabilityinthecriticalbandwidths
across different human subjects. Thus, the empirically established auditory filterbank
just represents a notional filterbank for processing speech and it does not capture the
variability in speech processing across subjects. From that perspective, it is expected
that the optimal filterbanks corresponding to each subject would not be identical to
the empirically established auditory filterbank, rather, on an average, the empirically
established auditory filterbank would be similar to the optimal filterbanks.
As the computationally determined optimal filterbanks are not identical to the em-
pirical auditory filterbank, we investigate how close the empirical auditory filterbank is
to the optimal filterbank for each subject considered in our experiment; this is done by
computing the objective function value I(X
B
L;Y) (Eq. (4.2)) when the empirical audi-
tory filterbank is used as B
L
. Note that empirical auditory filterbank is different from
34
0 10 20 30 40
0
10
20
30
40
50
60
70
80
90
100
Subject Number
MI
Aud. FB
MI
Opt. FB
×100%
Figure 4.7: The ratio (in percentage) of the MI computed for auditory filterbank (Aud.
FB) and the optimal filterbank (Opt. FB) all 45 subjects (40 English + 3 Cantonese +
2 Georgian) considered from the production databases. The average of all ratio values
(in percentage) is 91.6% (blue dashed line).
the optimal filterbank and, hence, sub-optimal with respect to the objective function in
Eq. (4.2). Let the objective function (or MI) value for empirical auditory filterbank be
denoted by M
Aud
. The optimized filterbank yields the maximum MI at Level 20 of the
tree-structured greedy algorithm and the maximum MI value is M
20
k
⋆
(see Algorithm 1).
We examine the closeness between M
Aud
and M
20
k
⋆
in a subject specific manner. For
each subject in the production database, the ratio
M
Aud
M
20
k
⋆
is shown by bar graph in Fig.
35
4.7. It is clear that the ratio
M
Aud
M
20
k
⋆
< 1, i.e., the MI obtained using optimal filterbank
is always greater than that obtained using the auditory filterbank. This is expected
since the filterbank optimization is performed to maximize the mutual information be-
tween filterbank output and the articulatory gestures. It is interesting to observe that
althoughtheproposedgreedy algorithm mayresultinsub-optimalsolutionB
L
⋆, theMI
valuecorrespondingtotheauditoryfilterbankneverexceedstheMIvaluecorresponding
to B
L
⋆ for any subject. Rather M
Aud
is on average (across all subjects) 91.6% of the
maximum possible MI (i.e., I(X
B
L⋆;Y)) achieved by the optimal filterbank. In other
words, the output of the auditory filterbank provides more than 90% of the maximum
possibleinformation abouttalker’s message (or equivalent articulatory gestures). Thus,
the auditory filterbank in listener’s ear acts as a near-optimal speech processor in de-
codingthetalker’s messageforachievinganefficientspeechcommunication betweenthe
talker and the listener.
4.7 Conclusions
Our experimental result reveals that there is an inherent similarity among the filter-
banks optimized in a talker-specific manner for retrieving maximum information about
atalker’s(transmitter)messagefromthespeechsignal. Interestingly, theempiricallyes-
tablished auditoryfilterbankin humanear is similar tothecomputationally determined
optimal filterbanks. This indicates that as far as an efficient speech communication
between a talker and a listener is concerned, the auditory-like filterbank offers a near-
optimal choice as a speech processing unit at the receiver.
Inourexperimentwehaveusedtalkersfromthreedifferentlanguages(English,Can-
tonese, andGeorgian) with differentphonological structureand hencedifferentacoustic
characteristics of sounds. Theexperimentwas designed to achieve thebestfilterbankin
thereceiver sothatmaximum information abouttransmitter’s messagecan beobtained
36
fromthefilterbankoutput. Ourresultsshowthat, consistently forall languages, theau-
ditory filterbank output provides more than 90% mutual information about the talker’s
articulatory gestures compared to that provided by a talker-specific optimal filterbank.
Thus, from a communication theoretic perspective, the auditory-like filterbank is an
optimal choice for processing speech signal and inferring maximal information about
talker’s message.
37
Chapter 5:
Generalized smoothness criterion
(GSC) for
acoustic-to-articulatory inversion
Havingobtainedaninformationtheoreticrelationshipbetweentherepresentationsinthe
speechproductionsystemandauditorysystems,weinvestigatehowwellwecanestimate
the representations of the articulatory space from the representation of acoustic signal
space, which, in other words, is known as the acoustic-to-articulatory inversion. The
acousticspaceistypically definedbyoneofseveral popularspectro-temporalfeaturesor
model parameters derived from the acoustic speech signal. Similarly, the articulatory
space can be represented in a variety of ways including through 1) stylized models
such as the Maeda’s model [41, 42] or the lossless tube model [30] of the vocal tract, 2)
linguisticrulebasedmodels[2,3,5]or3)directphysiologicaldata-basedrepresentations
of articulatory information [80]. In this work, we consider the physiological data based
representation of the articulatory space, where articulatory data (e.g. position of the
38
lips, jaw, tongue, velum etc.) during speech production are obtained directly from the
talkers by means of a specialized instrument such as an electromagnetic articulography
(EMA), ultrasound, or magnetic resonance imaging. Hence, in this work, by acoustic-
to-articulatory inversion werefertotheproblemofestimating thearticulatory positions
(physiological data) from a given acoustic speech signal.
Acoustic-to-articulatory inversion has received a great deal of attention from re-
searchers over the last several decades, notably motivated by potential applications to
speech technology development. All acoustic-to-articulatory inversion solutions are su-
pervised, i.e., they require some knowledge about the possible articulatory positions
for a given acoustic signal from some training data. Such solutions often provide com-
plementary information to acoustics and, thus, can help improve the performance of
current automatic speech recognition systems, especially in cases such as with noisy,
spontaneous, or pathological speech [22, 33, 81, 84]. In addition, articulatory gesture
representations are considered to have a parsimonious description of the underlyingdy-
namics for producing acoustic speech signal [2, 3, 5] and hence deriving these gestures
from the speech signal or from the estimated articulatory positions or tract variables
[83] can provide insight into linguistic phonology.
It is widely known that the difficulty in the acoustic-to-articulatory mapping lies in
its ill-posed nature. Ithasbeenshownthatmultipledistinctarticulatory configurations
canresultinthesameorverysimilaracousticeffects. Anempiricalinvestigation ofsuch
non-uniqueness in acoustic-to-articulatory mapping can be found in [53]. Atal et al. in
[1] alsoshowed that aninfinitenumberof articulatory configurations can generate three
identical formant frequencies. The problem is highly non-linear, too; two somewhat
similar articulatory states may give rise to totally different acoustic signals [74]. One of
the reasons for this non-unique mapping may come from the limitation of modeling or
parametric representation of both articulatory and acoustic spaces. For example, the
39
non-uniquenessinmappingarisesusingonly formantbasedacousticrepresentation, but
additional knowledge about bandwidth in the acoustic representation reduces the non-
uniqueness. Nonetheless, non-uniqueness in inverse mapping poses a serious problem
in the estimation of articulatory parameters from acoustic ones and, hence, motivates
investigation for a better solution to the inversion problem.
A common approach toaddressthisill-posed problemis touseregularization [46] or
dynamicconstraintswhileestimatingtheinversemapping[27,55,62,69,72]. Sorokinet
al. [69]chosearegularizingtermthatpreventsinversesolutionsfromdeviatingtoomuch
from theneutral position of articulators. Schroeter and Sondhi[62] presented amethod
based on dynamic programming (DP) to search articulatory codebooks with a penalty
factor for large “articulatory efforts”, that is, fast changes in the vocal tract so that
the estimated articulator trajectories are smoothly evolving. They used LPC derived
cepstral coefficients as the acoustic feature and introduced a lifter in the computation
of the acoustic distance and dynamic cost in making a transition from one vocal tract
shapetoanother. Todaetal. in[72]usedaGaussianmixturemodel(GMM)toperform
theinversion mappingbutformulated itasastatistical trajectory modelbyaugmenting
observations(melcepstralcoefficients)withfirstandsecondderivativesfeatures. In[60],
Richmond proposed a trajectory model which is based on a mixture density network
for estimating maximum likelihood trajectories which respects constraints between the
static and derived dynamic features. Similar methods using dynamical constraints have
been proposed based on Kalman filtering and smoothing [56, 65, 78]. Dusan et al. in
[18] extended previous studies of estimating articulator trajectories by Kalman filtering
by implementing phonological constraints by modeling different articulatory-acoustic
sub-functions, each corresponding to a phonological coproduction model.
Theessenceoftheregularizationorsmoothnessconstraintsliesinthephysicalmove-
ment of the articulators. The trajectory of the articulators during speech production is
40
in general smooth and slowly varying. Demanding smooth changes in the articulators
can reduce the non-uniqueness in the inversion problem [62]. For example in [72], Toda
et al. reported that with lowpass filtering of the solution of the GMM based mapping,
they achieved lower RMS error. Similarly, Richmond in [59], performed lowpass filter-
ing as a postprocessing step. It was shown that low pass filtering of the MLP output
by articulator specific cut-off frequencies indeed moderately improved the result, i.e.
the RMS error decreased and the correlation score improved. In [58], Richmond et
al. discussed the usefulness of using low-pass filtering on the articulator trajectory as a
smoothnessconstraintintheoptimization. Forexample, in[55],onesuchconstraintwas
used as a part of the DP search through the output of their network, which constrained
the articulator trajectories to be as smooth as possible. Also in [7, 35], the articulator
trajectories are constrained such that articulators move as slowly as possible.
Thesmoothingof a signal can beinterpreted as linear time-invariant (LTI)filtering,
in which the high frequency components of the signal are suppressed and low frequency
components are preserved so that the signal becomes smooth. For example, authors in
[36, 57, 62] minimizetheDP cost function, whichcontains (A
t
−A
t−1
)
2
, whereA
t
is the
articulator variable at time frame t. By minimizing (A
t
−A
t−1
)
2
over the entire time,
the energy of the difference of the articulator variable is minimized.
P
t
(A
t
−A
t−1
)
2
can be interpreted as the energy of the outputof a discrete-time LTI filter with impulse
response h=[1 −1] where the input is A
t
. h=[1 −1] is a high pass filter, whose 3dB
cut-off frequency is
Fs
4
, where F
s
is the sampling frequency. By minimizing the energy
of the output of this filter, the high frequency component in the articulator trajectory
is suppressed. However, a particular high-pass filter with fixed cut-off frequency may
not be optimal for different articulators. A more systematic approach would be to
design appropriate high pass filters for individual articulators and include them in the
optimization. However, note that an arbitrary high pass filter might have large finite
41
or an infinite impulse response. The complexity of DP increases exponentially with the
length of the filter K and hence, it becomes computationally expensive even for an FIR
filter with K > 2. When the smoothness constraints in the cost function involves an
IIR filter, the cost function cannot be solved using DP at all.
Inthiswork,wederiveaformulationwhereanyarbitraryhighpassfiltercanbeused
in the inversion problem for smoothing articulator trajectories. The cut-off frequency
of the filter can be adaptively tuned in such a generalized smoothness setting and,
hence, this formulation can provide a more realistic articulator trajectory compared
to that obtained by a filter with fixed cut-off frequency. The formulation is similar
to the codebook search approach but under a general smoothness criterion. Another
key advantage of this formulation is that the solution of the articulator trajectory need
not be computed all at once; rather, a recursive solution can be derived without any
degradation in performance.
5.1 Dataset and pre-processing
The Multichannel Articulatory (MOCHA) database [80] is used for the analysis and
experimentsofthiswork. TheMOCHAdatabaseconsistsofacousticandcorresponding
articulatory ElectroMagnetic Articulography (EMA) data from two speakers - onemale
(with a Northern English accent) and one female speaker (with a Southern English
accent). The acoustic and articulatory data were collected while each speaker read a
set of 460 phonetically-diverse British English TIMIT sentences. The articulatory data
consistofXandYcoordinatesofninereceiver sensorcoils attached toninepointsalong
the midsagittal plane, namely the lower incisor or jaw (li x, li y), upper lip (ul x, ul y),
lower lip (ll x, ll y), tongue tip (tt x, tt y), tongue body (tb x, tb y), tongue dorsum
(td x, td y), velum (v x, v y), upper incisor (ui x, ui y) and bridge of the nose (bn x,
bn y). The last two are used as reference coils. Thus, the first seven coils provide 14
42
channels of articulatory position information. The position of each coil was recorded
at 500 Hz with 16 bit precision. The corresponding speech was collected at 16KHz
sampling rate.
Although the position data of seven articulators in the MOCHA database have
been already processed to compensate for head movement, the data in this raw form is
still not suitable for analysis or modeling [59]. The position data have high frequency
noise resulting from EMA measurement error, while the articulatory movements are
predominantlylow passinnature(wewillseeinthenextsection that99%oftheenergy
is contained below∼21 Hz for all the articulators). Hence the articulatory data of each
channel is low pass filtered with a cut-off frequency of 35 Hz. Since articulatory data is
low-pass dueto the natureof thephysical movement of articulators, thechoice of 35 Hz
is sufficient to keep the articulatory position information unaltered. To avoid any phase
distortion due to the low pass filtering on the articulatory data, the filtering process is
performed twice (“zero-phase filtering”) - the data is initially filtered and then reversed
and filtered again and reversed once more finally. After filtering, the articulatory data
is downsampled by a factor of 5 so that the frame rate is 100 per second. Since the low
pass cut-off frequency was 35 Hz, no aliasing occurs due to downsampling.
Each utterance of both speakers has silence in the initial portion and towards the
end of the utterance. Since duringnon-speech portions the articulators can assume any
position, considering data from these regions can increase the variability in the inverse
mapping. Hence, the silence portions were manually selected and the corresponding
articulatory data wereomitted. Ofthe460 utterances available fromeach speaker, data
from 368 utterances (80%) areusedfor training, 37utterances (8%) as the development
set (dev set), and the remaining 55 utterances (12%) as the test set. In summary, for
thetwospeakers, thenumberofframesofavailable articulatory dataareshowninTable
5.1.
43
# Articulator frames
Speaker Training Set Dev Set Test Set
Male 85673 8866 14553
Female 98666 10298 16454
Table 5.1: Number of frames of articulatory data available for training, development,
and test set.
The mean position for each articulator changes from utterance to utterance [59]. A
few reasons for this variation of mean articulatory position have been stated in [59],
namely change in temperature and shift in the location of the EMA helmet and trans-
mitter coil relative to the subject’s head. This means that even after low-pass filtering
and downsampling, the articulatory data are still not directly ready for the modeling
purpose. To make the data ready for such use, we first subtract the mean articula-
tor location from the articulatory position for every utterance in a way similar to [59].
Finally, we add the mean articulatory position, averaged over all utterances. These
pre-processed articulator trajectories are used for further analysis and experiments.
5.2 Empirical frequency analysis of articulatory data
The articulators in the human speech production system move to create distinct vocal
tract shapes to generate different acoustic signals. The articulators, i.e. tongue, lips,
jaw, velum, are in general slow moving and thus the articulatory data are low-pass
in nature [47]. The purpose of analyzing the spectrum of the articulatory data is to
understandthenatureofthearticulatorymovementandquantifytheeffectivemaximum
frequencycontentofsuchslowlyvaryingsignals. Thisinturnwouldinformusaboutthe
smoothness of the articulatory movement for designing appropriate smoothing criteria
for different articulatory data.
44
Thefrequencydomainanalysisisperformedseparatelyonthearticulatortrajectories
of each utterance in the training set. There are 14 different articulator trajectories for
every utterance. Let{x[n];1≤n≤N} denote any one of these 14 trajectories for a
particular utterance. We compute the samples of its spectrum S[k], k =0,··· ,N
F
−1
usingdiscreteFourier transform(DFT)with aDFT orderN
F
=2
14
=16384 as follows:
S[k] =
N
X
n=1
x
0
[n]exp
−j
2π
N
F
kn
2
, (5.1)
where x
0
[n] = x[n]−
1
N
P
N
n=1
x[n] is the dc removed articulator trajectory. S[k] of all
14 articulator trajectories are found to be low-pass, as expected. Since the sampling
frequencyofx[n]is100Hz, thefrequencyresolutionofthespectrumis
100
N
F
=0.0061 Hz.
The total energy of x[n] is
P
N
n=1
x
2
[n] =
1
N
F
P
N
F
k=1
S[k] (by Parseval’s theorem). We
would like to calculate the frequency below which a certain percentage (say α%) of the
total energy is contained. This is performed by finding N
c
such that
S[0]+2
P
Nc
k=1
S[k]
P
N
F
k=1
S[k]
=
α
100
. The corresponding frequency is f
c
=
Nc100
N
F
Hz. The mean f
c
(along with standard
deviation (SD)) averaged over all utterances forα = 90, 95 and 99 is tabulated in Table
5.2 for all 14 articulators of both speakers.
From Table 5.2, it can be seen that the mean f
c
of a particular articulator is similar
for both speakers except for ul x, ll x, li x. For a particular speaker, not all articulators
have the same mean f
c
for all α. For example, for α=90, the mean f
c
varies from 3.33
Hz (ul y) to 4.52 Hz (v x) in the case of the male speaker. For α=99, this variation is
even more. The same is true for the data of the female speaker.
It is well-known that the articulatory movements are for the most part slow and
smooth [47]. However, not all the articulators have equal degrees of smoothness as
demonstrated by the aforementioned empirical frequency analysis. These results will
45
Mean f
c
(SD) [in Hz]
Artic- Male Female
ulator α=90 α=95 α=99 α=90 α=95 α=99
ul x 4.03 (1.7) 6.11 (2.1) 11.71 (3.1) 2.67 (0.7) 3.62 (0.9) 7.67 (3.3)
ll x 4.02 (0.7) 5.07 (0.9) 9.63 (3.2) 2.88 (0.7) 3.89 (0.9) 8.20 (3.1)
li x 4.15 (1.3) 5.81 (1.7) 11.00 (3.0) 2.69 (0.6) 3.66 (0.9) 7.77 (3.3)
tt x 3.75 (0.6) 4.71 (0.7) 9.13 (3.6) 3.36 (0.6) 4.29 (0.7) 7.60 (2.5)
tb x 3.64 (0.6) 4.60 (0.7) 8.64 (3.0) 3.27 (0.6) 4.14 (0.7) 7.15 (2.1)
td x 3.56 (0.7) 4.53 (0.8) 8.12 (2.0) 3.43 (0.7) 4.43 (0.8) 7.81 (3.0)
v x 4.52 (1.3) 6.93 (2.2) 21.68 (12.7) 3.94 (1.3) 5.97 (2.7) 20.63 (15.5)
ul y 3.33 (0.6) 4.35 (0.8) 8.99 (4.4) 3.10 (0.6) 4.00 (0.7) 7.69 (3.1)
ll y 4.40 (0.5) 5.23 (0.6) 9.27 (3.2) 4.11 (0.5) 4.92 (0.6) 7.74 (1.9)
li y 3.37 (0.6) 4.23 (0.7) 8.26 (3.6) 3.49 (0.5) 4.37 (0.6) 7.75 (2.9)
tt y 4.13 (0.7) 5.07 (0.7) 8.54 (2.3) 4.30 ( 0.7) 5.35 (0.7) 8.64 (1.6)
tb y 3.60 (0.6) 4.43 (0.6) 7.44 (1.9) 3.38 (0.5) 4.19 (0.6) 7.06 (2.3)
td y 3.71 (0.5) 4.52 (0.5) 7.55 (2.5) 3.46 (0.5) 4.38 (0.6) 8.57 (3.0)
v y 3.88 (0.9) 5.62 (1.6) 15.53 (9.3) 3.80 (1.1) 5.34 (2.1) 15.74 (12.3)
Table 5.2: The mean f
c
(standard deviation in bracket) of articulatory data of two
speakers in the MOCHA-TIMIT database.
be invoked in selecting parameter values for smoothness constraints for the different
articulators in implementing the inversion problem.
5.3 Generalized smoothness criterion for the inversion
problem
Let{z
i
;1≤i≤T}represent the acoustic feature vectors in the training set. Also let x
i
denote the corresponding position value of any one of the 14 articulator channels. Now
suppose, for the inversion problem, a (test) speech utterance is given and the acoustic
feature vectors computed for this utterance are denoted by{u
n
;1≤n≤N}. The goal
is to find out the corresponding position values of each articulator channel denoted by
{x[n];1≤n≤N} from the{u
n
;1≤n≤N}.
46
We need to minimize the high frequency components in x[n] to ensure that the
estimated articulatory position is smooth and slowly varying. Hence, the smoothness
requirement is equivalent to minimizing the energy of the output of a high pass filter
with input{x[n];1≤n≤N}. Also suppose, based on the knowledge of the frequency
content of the articulator trajectory, the high pass filter h is given. h can be an FIR
or IIR filter. For an FIR filter the impulse response h[n] is specified and for an IIR
filter the rational transfer function H(z), theZ transform of h[n], is specified. Let y[n]
denote the output of h with input{x[n];1≤n≤N}, i.e.,
y[n]=
N
X
k=1
x[k]h[n−k] (5.2)
Let L possible values of the articulatory position at the n
th
frame of the test
speech utterance be denoted by
η
l
n
;1≤l≤L
. These are obtained using a train-
ing set{(z
i
,x
i
);1≤i≤T} and u
n
. Let p
l
n
denote the probability that η
l
n
is the value
of the articulatory position at the n
th
frame given that u
n
is the acoustic feature. L
can be, in general, equal to T. Then the inversion problem can be stated as follows:
{x
⋆
[n];1≤n≤N} = arg min
{x[n]}
J(x[1],··· ,x[N])
△
= arg min
{x[n]}
(
X
n
(y[n])
2
+C
X
n
X
l
x[n]−η
l
n
2
p
l
n
)
.(5.3)
where J denotes the cost function to be minimized and y[n] is given in eqn. (5.2).
Thefirstterm
P
n
(y[n])
2
inthecost functionistheenergyoftheoutputof thefilter
h. The second term
P
n
P
l
x[n]−η
l
n
2
p
l
n
denotes the weighted cost of how different
x[n]isfromη
l
n
, 1≤l≤L, wheretheweights arep
l
n
(η
l
n
andp
l
n
aredeterminedfromthe
training set). For example, if p
l
n
=1 for l=1 and p
l
n
=0 for l >1, this means x[n] has to
be as close as η
1
n
. In other words, if it turns out that the probability of the articulatory
47
position being η
1
n
is very high based on the training set, the solution x
⋆
[n] has to be as
close as η
1
n
. More generally, the probability of x[n] being equal to η
l
n
is p
l
n
, 1≤ l≤ L.
C(>0) is the trade off parameter between these two terms. For minimization, we set
∂J
∂x[m]
=0 , m=1,··· ,N
=⇒2
(
X
n
X
k
x[k]h[n−k]
!
h[n−m]+C
X
l
x[m]−η
l
m
p
l
m
)
=0, m=1,··· ,N
=⇒
X
k
x[k]
X
n
h[n−k]h[n−m]
!
+
C
X
l
p
l
m
!
x[m] =C
X
l
η
l
m
p
l
m
, m=1,··· ,N
=⇒
N
X
k=1
x[k]R
h
[m−k]+
C
X
l
p
l
m
!
x[m] =C
X
l
η
l
m
p
l
m
, m=1,··· ,N
where R
h
[m−k]
△
=
P
n
h[n−k]h[n−m], the autocorrelation sequence of h[n]. The
above set of N equations can be written in matrix vector form as follows:
R
h
[0]+C
P
l
p
l
1
R
h
[1] ··· R
h
[N−1]
R
h
[−1] R
h
[0]+C
P
l
p
l
2
··· R
h
[N−2]
· · · ·
· · · ·
R
h
[−(N−1)] R
h
[−(N−2)] ··· R
h
[0]+C
P
l
p
l
N
x[1]
x[2]
·
·
x[N]
=
C
P
l
η
l
1
p
l
1
C
P
l
η
l
2
p
l
2
·
·
C
P
l
η
l
N
p
l
N
(5.4)
48
Assuming p
l
n
are normalized such that
P
l
p
l
n
=1∀n, (it does not alter the solution,
since any constant can be absorbed in C) we can rewrite eqn. (5.4),
(R+CI)x=d (5.5)
where R={R
ij
}={R(j−i)} ={R|j−i|} (since the autocorrelation is symmetric), I
isN×N identitymatrix,x =[x[1],··· ,x[N]]
T
andd=
C
P
l
η
l
1
p
l
1
,··· ,C
P
l
η
l
N
p
l
N
T
.
[·]
T
denotes transpose operation.
Note that if C=0, the solution is x
⋆
[n]=0, i.e., when there is no information about
η
l
n
and p
l
n
or we do not consider any information from the training data, the solution
is zero. This is because the only way by which we can minimize the energy of y[n]
is by feeding a zero signal at the input of the filter h. On the other hand, if h=0,
i.e., no filter is provided or no smoothing criterion is imposed, then x
⋆
[n] =
P
l
η
l
n
p
l
n
,,
i.e., it is the convex combination of the possible values of the articulatory positions
learned from the training data. If p
1
n
=1 and p
l
n
=0, for l >1, the solution is x
⋆
[n] =
η
1
n
, the only possible value of the articulatory position. Thus, in general, the second
term of the objective function (eqn. (5.3)) constrains the solution to be in the convex
hull of η
l
n
, 1 ≤ l ≤ L. It is easy to show that the second term, in turn, ensures
that the acoustic feature vector corresponding to x
⋆
[n] is also in the covex hull of the
acoustic feature vectors corresponding to η
l
n
, 1≤ l≤ L under the assumption of local
linearity on the non-linear mapping between acoustic and articulatory space. Thus, the
acoustic proximity between the estimated and the possible articulatory configurations
is indirectly considered in our proposed optimization framework, although we do not
directly consider an acoustic proximity term in the objective function unlike that in
dynamic programming formulation [62].
49
IfbothC andharenon-zero, thenthesolutionof eqn. (5.5) canbefoundas follows:
x
⋆
=(R+CI)
−1
d (5.6)
Since R is an autocorrelation matrix and hence symmetric toeplitz and since C >0,
(R+CI) is always invertible and hence the solution of x always exists.
Beforeconcludingthissection, wedescribethestrategy todetermineη
l
n
andp
l
n
, l =
1,··· ,L from the training set.
u
n
denotes theacoustic featurevector atthe n
th
frameof thetest speech utterance.
{(z
i
,x
i
);1≤i≤T} is the pair of acoustic feature and articulatory position vector in
the training set. Let δ
n,i
=||u
n
−z
i
||, 1≤ i≤ T. At each frame n, δ
n,i
, 1≤ i≤ T
are computed and sorted in an ascending order. The articulatory position vectors x
i
in
the training set corresponding to the top L sorted δ
n,i
are denoted by
η
l
n
; 1≤l≤L
.
That means
η
l
n
; 1≤l≤L
are the L articulatory position vectors in the training set,
the corresponding acoustic features of which are closest to u
n
. Let the top L sorted
δ
n,i
be denoted by{δ
l
;1≤l≤L}. Then p
l
n
are computed as p
l
n
=
δ
−1
l
P
l
δ
−1
l
. This ensures
that
P
l
p
l
n
=1. p
l
n
computed in this way implies that if the test acoustic feature vector
u
n
is closer to the training acoustic feature vector z
l
1
compared to some other z
l
2
, then
x
l
1
is more likely to be the articulatory position than x
l
2
at the n
th
frame of the test
utterance.
As an alternative to normalized sorted distance, we considered the Parzen window
based density estimation for determining p
l
n
. In this approach, a probability density
function is estimated on the entire training space (joint space of z
i
and x
i
) using the
sum of Gaussian windows at each data point. The probability density values at (z
i
,x
i
)
corresponding to top L sorted δ
n,i
were considered as p
l
n
. However, this approach did
not result in a better estimate of the articulatory positions. This could be due to the
50
fact thattheParzen windowbasedpdfestimation is efficient only whenlargenumberof
data samples are available, particularly if the related space is high dimensional. Also,
the relation between z
i
andx
i
is non-linear and hence the probability in the joint space
might not be a good measure of p
l
n
.
5.4 Recursive solution to the inversion problem
Thegoaloftherecursionintheinversionproblemistoestimatethearticulatoryposition
at the (N+1)
th
frame using the acoustic feature at (N+1)
th
frame and the estimated
articulatory positions up to the N
th
frame, i.e., x[1],··· ,x[N].
Let x
N
= [x[1] ··· x[N]]
T
and let R
N
be the N×N autocorrelation matrix of
the filter h and we have the solution (using eqn. (5.6))
x
N
=(R
N
+CI)
−1
d
N
(5.7)
where d
N
is N×1 vector. Suppose we get d
N+1
=
d
N
d
N+1
, we need to solve
x
N+1
=(R
N+1
+CI)
−1
d
N+1
using x
N
.
Let A
N
=R
N
+CI. A
N+1
can be partitioned as follows:
A
N+1
=
| R
h
[N]
A
N
| :
| R
h
[1]
−−− −−− −−− | −−−
R
h
[N] ··· R
h
[1] | R
h
[0]+C
=
A
N
Jr
N
r
T
N
R
h
[0]+C
(5.8)
51
where r
N
=
R
h
[1]
:
R
h
[N]
and J=
0 ··· 1
: 1 :
1 ··· 0
. Using matrix partitioning [43],
A
−1
N+1
=
A
−1
N
0
0
T
0
+
1
P
N
b
N
1
b
T
N
1
(5.9)
where, b
N
=−A
−1
N
Jr
N
and P
N
=R
h
[0]+C +r
T
N
Jb
N
. So
x
N+1
= A
−1
N+1
d
N+1
=A
−1
N+1
d
N
d
N+1
=
A
−1
N
0
0
T
0
+
1
P
N
b
N
1
b
T
N
1
d
N
d
N+1
=
x
N
0
+
b
N
1
α
N
(5.10)
where α
N
=
−x
T
N
Jr
N
+d
N+1
P
N
. So if b
N
is known we can compute x
N+1
from x
N
without
any matrix inversion. Thus we need to derive a recursion for b
N
.
Let us define
a
N
△
= Jb
N
(=⇒b
N
=Ja
N
)
= −JA
−1
N
Jr
N
= −A
−1
N
r
N
(5.11)
Thus we need to compute a
N+1
(=−A
−1
N+1
r
N+1
) from a
N
.
52
a
N+1
= −A
−1
N+1
r
N+1
=−A
−1
N+1
r
N
r
N+1
= −
A
−1
N
0
0
T
0
+
1
P
N
b
N
1
b
T
N
1
r
N
r
N+1
=
a
N
0
+
b
N
1
γ
N
(5.12)
where γ
N
=−
a
T
N
Jr
N
+r
N+1
P
N
. Thus if we knowa
N
(orb
N
=Ja
N
), we can computea
N+1
without matrix inversion. Thus, if we know x
N
, we can compute x
N+1
using x
N
using
eqn. (5.10) and (5.12). No explicit matrix inversion is required in each step. The steps
in the recursive solution of eqn. (5.3) are given below:
1. Step 1 (Initialization):
n=1, estimate η
l
1
and p
l
1
, l =1,··· ,L from u
1
. d
1
=C
P
l
η
l
1
p
l
1
.
x
1
=x[1]=
d
1
R
h
[0]+C
r
1
=R
h
[1]
b
1
=
r
1
R
h
[0]+C
and a
1
=b
1
P
1
=R
h
[0]+C +r
T
1
Jb
1
n=2.
2. Step 2 (Recursion):
Estimate η
l
n
and p
l
n
, l =1,··· ,L from u
n
. d
n
=C
P
l
η
l
n
p
l
n
.
γ
n−1
=−
a
T
n−1
Jr
n−1
+R
h
[n]
P
n−1
α
n−1
=
−x
T
n−1
Jr
n−1
+dn
P
n−1
and x
n
=
x
n−1
0
+
b
n−1
1
α
n−1
53
r
n
=
r
n−1
R
h
[n]
a
n
=
a
n−1
0
+
b
n−1
1
γ
n−1
b
n
=Ja
n
and P
n
=R
h
[0]+C +r
T
n
Jb
n
.
3. Step 3:
Increment n to n+1 and go to Step 2.
5.5 Selectionofacousticfeaturesfortheinversionproblem
Appropriate acoustic feature selection is crucial for the inversion problem because, in
every analysis frame, the acoustic feature is used to determine possible articulatory
positions from the training set. In turn, from these possible positions the smoothness
criterion estimates the best position so that the articulator trajectory is as smooth as
possibleforagivenh. Thepossiblearticulatory positionsatevery testframearechosen
such that thecorrespondingacoustic vectors in thetraining setare in theneighborhood
of the acoustic vector of the test frame (as discussed in section 5.3). The more the
correlation or dependency between the acoustic feature and the corresponding articu-
latory position, the more accurate are the possible articulatory positions. Therefore,
quantifyingthedependencybetweentheacousticfeatureandthearticulatorypositionis
essential to compare different acoustic features and select the best one for the inversion
problem.
We compute the statistical dependency between the acoustic feature and the artic-
ulatory position by mutual information (MI). Let Z denote the acoustic feature vector
and X be the vector whose elements are the position values of all articulators at every
frame. Since there are 7 articulators each with x and y coordinates in our experimental
54
data, X has 14 dimensions; the dimension of Z depends on the chosen acoustic feature.
For acoustic features, we consider mel frequency cepstral coefficients (MFCC), linear
prediction coefficients (LPC), cepstral representation of LPC (LPCC), and variants of
LPC,i.e. linespectralfrequency(LSF),reflectioncoefficient(RC),logarearatio(LAR).
Each of these features were computed every 10 msec to match the rate of articulatory
position data. Speech signal is pre-emphasized and windowed using 20 msec hamming
window before computing frame-based features. For MFCC, Z is a 13 dimensional
vector. LPCs were computed using an order of 12; thus, Z for LPC and LPCC is 13
dimensional, but for LSF, RC, and LAR Z is 12 dimensional. In this work, we only
consider static features; no dynamic features have been used.
Since the probability density functions of Z and X are not directly known, we
considerMIestimationbyquantizationofthespaceofZ andX fromthetrainingdataset
with a finite number of quantization bins, then estimating the joint distribution of Z
andX inthenewly quantized finitealphabetspace usingstandardmaximum likelihood
criterion - frequency counts[17]; and finally applying the discrete version of the MI
[8]. More precisely, let us denote the pair of acoustic feature and articulatory position
vectors in the training set by{(z
i
,x
i
); i=1,··· ,T}, where z
i
and x
i
take values in
R
K
1
andR
K
2
. The quantizations of these spaces are denoted by Q(Z) :R
K
1
→A
z
and Q(X):R
K
2
→A
x
, where|A
z
|<∞ and|A
x
|<∞. Then the MI is given by:
I(Q(Z),Q(X)) =
X
qz∈Az,qx∈Ax
P(Q(Z)=q
z
,Q(X)=q
x
)×
log
P(Q(Z)=q
z
,Q(X) =q
x
)
P(Q(Z)=q
z
)P(Q(X) =q
x
)
(5.13)
It is well known that I(Q(Z),Q(X))≤ I(Z,X), because quantization reduces the
level of dependency between random variables. On the other hand, increasing the
resolution of Q(·), implies that I(Q(Z),Q(X)) converges to I(Z,X) as the number
55
of bins tends to infinity [10]. However, this result assumes that we know the joint
distribution, which implies having an infinite amount of training data and a consistent
learning approach. Consequently, for the finitetraining data scenario there is a tradeoff
betweenhowpreciselywewanttoestimateI(Q(Z),Q(X)), versushowclosewewantto
be to the analytical upper boundI(Z,X). We decided to have a resolution of Q(·) that
guarantees good estimation of the joint distribution, and consequently a precise lower
bound estimation for I(Z,X). K-means vector quantization was used to characterize
the quantization mapping [8, 17].
Foreachacousticfeaturevectorandthearticulatorypositionvector, K-meansvector
quantization with 512 prototypes was used, i.e.,|A
z
|=|A
x
|=512. Table 5.3 shows the
mutualinformationbetweenvariousacousticfeaturesandarticulatorypositionsforboth
the male and female speaker. It can be observed that the mutual information between
I(Q(Z),Q(X))
Z MALE FEMALE
MFCC 1.8179 1.8594
LPC 1.3394 1.3931
LPCC 1.3339 1.4936
LSF 1.7025 1.6080
RC 1.6148 1.5309
LAR 1.6921 1.5834
Table 5.3: Mutual information between various acoustic features and articulatory posi-
tion.
MFCC and articulatory position is maximum among all other acoustic features for
the data of both speakers. LSF has the second highest MI with articulatory position,
and the least MI occurs for LPC. It should be noted that change in the number of
prototypes in K-means does not alter the relative value of MI for different acoustic
features. For example, we computed MI using|A
z
| =|A
x
| =64, 128, 256, 1024 and
we found MFCC to have maximum MI with articulatory position in all cases. This is
56
consistent for both speakers. It is interesting to note that qin et al. [54] also achieved
maximum correlation between original and estimated articulator trajectories by using
MFCC features. Based on this observation, we use MFCC as the acoustic feature for
all of the following experiments.
5.6 Experiments and results
Theacoustic-to-articulatory inversionexperimentsareperformedseparatelyforthemale
and female speaker data. The accuracy of inverse mapping is evaluated separately on
the test set for both speakers in terms of both root mean squared (RMS) error and
correlation between actual articulatory position in the test set x
r
[n] and the position
estimated by inverse mappingx
⋆
[n]. TheRMS errorE reflects theaverage closeness be-
tween x
r
[n] andx
⋆
[n]. Thecorrelation ρ indicates how similar the actual and estimated
articulator trajectories are. A minimumE does not always mean that the trajectories
are similar since the estimated one can be very jagged although it might be close to
the actual one. Jagged trajectories are physically less likely during speech production
since articulators cannot move in such a way in real life. Such jagged trajectories can
be identified by poor ρ values. We use Pearson correlation ρ between the actual and
estimated trajectory for each utterance, where
ρ =
N
P
n
x
r
[n]x
⋆
[n]−
P
n
x
r
[n]
P
n
x
⋆
[n]
q
N
P
n
(x
r
[n])
2
−(
P
n
x
r
[n])
2
q
N
P
n
(x
⋆
[n])
2
−(
P
n
x
⋆
[n])
2
. (5.14)
The development set is used to tune the cut-off frequency γ
c
of filter h and the
trade-off parameter C. For our experiment we considered L=200. Increasing L further
did not improve the result.
We considered an IIR high pass filter with cut-off frequency γ
c
, and stop-band
ripple 40 dB down compared to the pass-band ripple. A rational transfer function
57
having order 5 for both numerator and denominator polynomials is constructed for
the desired specification. The MATLAB function cheby2 is used for this purpose.
We choose an IIR filter so that the roll-off of the high-pass filter is large and hence
the filter becomes close to the articulator specific ideal high-pass filter. We chose γ
c
and C from a set of values, which yield the best performance on the development
set. From section 5.2, we observe that most of the energy of the spectrum of the
articulator trajectories is below 9-10 Hz; hence, we consider the set of values for γ
c
to
be{γ
c
} =
n
1.5+
(k−1)
19
(7.5);k =1,··· ,20
o
, i.e. the set of values is 20 equally spaced
points between 1.5 Hz and 9 Hz. Similarly the set of values for C was chosen to be
{.001, .005, .01, 0.05, .1, .5, 1, 5, 10, 50, 100}. The values of C were chosen to have a
widerange of orders. For every γ
c
andC combination, eqn. (5.6) was solved recursively
using eqn. (5.10) and (5.12) for each utterance of the development set. As a metric
of performance of the inverse mapping, we measureE between the actual value of the
articulatory positions and the estimated positions.
γ
c
and C, for which the minimum value of the averagedE (averaged over all utter-
ances of the dev set) was obtained, are shown in Table 5.4 for each articulator and for
both speakers. We can see that the velum has a slightly higher γ
c
compared to other
articulators to achieve the leastE. The values of the best C for different articulatory
positions do not differ in its order much.
Toestimate thepositionofaparticulararticulator fromtheacoustics inthetestset,
we use the correspondingγ
c
and C optimized on the dev set. For each utterance in the
testset, thearticulatory positionsfromtheacousticsignal areestimated bysolvingeqn.
(5.6) recursively as outlined in section 5.4. As a baseline, we estimated the articulatory
positions using a fixed filter h=[1 -1] with γ
c
= 25 Hz =
Fs
4
; C is optimized on the dev
set. Thepurposeof choosingsuch abaselineis toinvestigate thechange in performance
when articulator specific γ
c
are used compared to a fixed γ
c
.
58
Best choices of γ
c
(in Hz) and C
Articulator Female Speaker Male Speaker
γ
c
C γ
c
C
ul x 3.07 0.50 3.47 0.05
ll x 3.47 0.10 4.65 0.10
li x 3.47 0.10 4.65 0.05
tt x 3.47 0.10 4.26 0.05
tb x 3.47 0.10 3.86 0.10
td x 3.86 0.10 3.07 0.50
v x 5.05 0.50 7.02 0.10
ul y 5.05 0.10 4.26 0.50
ll y 4.65 0.10 4.26 0.50
li y 3.86 0.50 4.26 0.50
tt y 5.05 0.10 4.26 0.50
tb y 3.07 0.50 3.86 0.50
td y 3.07 0.50 3.47 0.50
v y 6.23 0.50 6.23 0.10
Table 5.4: Best choices of γ
c
and C for all articulatory positions optimized on dev set.
Wealsoimplementedthedynamicprogramming(DP)basedinversionmappingwith
a cost function similar to that outlined in the work by Richards et al. [57]. The cost
function, which is minimized, is as follows:
D =
N
X
n=1
K||u
n
−z
n
||
2
+||x
n
−x
n−1
||
2
(5.15)
At each frame n, the possible articulatory positions were η
l
n
,1≤ l≤ L, through which
the bestpath was found. z
n
arechosen from the acoustic feature vectors in the training
set corresponding to η
l
n
,1≤ l≤ L. K was optimized on the dev set to achieve least
averageE. ThesolutionoftheDPbasedinversionislow-passfilteredfollowingthework
by Toda et al [73]. The cut-off frequencies the low-pass filters for post-processing are
chosen to be the ones given by Toda et al [73].
59
ThecostindynamicprogrammingD (eqn. (5.15))isdifferentfromthecostfunction
in our proposed approach (eqn. (5.3)). Thus, they are not directly comparable in terms
of their cost functions. The motivation for selecting DP followed by low-pass filtering
as a part of our experiment is to analyze the quality of the estimated articulatory posi-
tions using the proposed generalized smoothness approach with respect to the positions
obtained by the well-established DP approach with smoothing as a post-processing.
14 trajectories corresponding to 14 different articulatory positions are randomly
picked from the test set, and their estimates using both the proposed approach and
the DP approach are shown in Fig. 5.1 overlaid on the actual position. It can be
seen that the estimated trajectories are smooth and, on average, they follow the actual
trajectories. Thecloseness of the estimated trajectory to the actual one dependson the
corresponding
η
l
n
;1≤l≤L
and
p
l
n
;1≤l≤L
. The trajectories estimated using
the DP approach are also very close to the actual one. For the examples chosen in Fig.
1, trajectories estimated by the proposed approach and DP appear similar. For clarity,
we have not shown the trajectories estimated by our proposed approach with a fixed
γ
c
. We evaluate the performanceof differentapproaches through erroranalysis over the
entire test set.
For a comprehensive error analysis, we computed theE and ρ for all utterances in
the test set. The meanE and ρ (with their SD) between the actual trajectories and the
estimated trajectories by inverse mapping using generalized smoothness criterion (for
both fixed γ
c
and articulator specific γ
c
) and the DP (followed by low-pass filtering)
approach are tabulated in Tables 5.5 and 5.6 for the female and male speaker, respec-
tively. The tables also show the range of the position values for each articulator so that
the quality of the inverse mapping can be understood from the meanE.
From Tables 5.5 and 5.6, it can be observed that the averagedE values obtained by
generalizedsmoothnesscriterionareoftheorderof10%oftherangeofthecorresponding
60
20 40 60 80 100 120 140 160
−18
−16
ul_x (mm)
20 40 60 80 100 120 140 160
−20
−18
ll_x (mm)
20 40 60 80 100 120 140 160
−7
−6
−5
li_x (mm)
20 40 60 80 100 120 140
5
10
15
tt_x (mm)
20 40 60 80 100 120 140 160
20
25
30
tb_x (mm)
20 40 60 80 100 120 140 160
35
40
45
td_x (mm)
20 40 60 80 100 120 140 160
55
55.5
56
56.5
Frame Number
v_x (mm)
20 40 60 80 100 120 140 160
−56
−54
−52
ul_y (mm)
20 40 60 80 100 120 140 160
−85
−80
−75
ll_y (mm)
20 40 60 80 100 120 140 160
−82
−80
−78
−76
li_y (mm)
20 40 60 80 100 120 140 160
−75
−70
−65
tt_y (mm)
20 40 60 80 100 120 140 160
−70
−60
tb_y (mm)
20 40 60 80 100 120 140 160
−68
−66
−64
−62
−60
−58
td_y (mm)
20 40 60 80 100 120 140 160
−63.5
−63
−62.5
−62
−61.5
Frame Number
v_y (mm)
Figure 5.1: Illustrative example of inverse mapping: randomly chosen examples of the
test articulator trajectory (dash-dotted) and the corresponding estimated trajectory for
14 articulatory positions.
61
Mean (SD) ofE (in mm) and ρ
Generalized Smoothness DP
Articu- Range Artic. Specific γ
c
Fixed γ
c
low-pass filtered
lators (mm) E ρ E ρ E ρ
ul x 6.6 0.82(0.2) 0.58(0.1) 0.83(0.2) 0.52(0.1) 0.85(0.2) 0.51(0.2)
ll x 10.7 1.27(0.3) 0.53(0.1) 1.30(0.3) 0.46(0.1) 1.38(0.3) 0.35(0.2)
li x 7.2 0.75(0.1) 0.57(0.1) 0.77(0.1) 0.52(0.1) 0.84(0.2) 0.39(0.2)
tt x 23.4 2.39(0.4) 0.76(0.1) 2.54(0.4) 0.70(0.1) 2.60(0.5) 0.69(0.1)
tb x 26.1 2.24(0.4) 0.76(0.1) 2.35(0.4) 0.72(0.0) 2.41(0.4) 0.72(0.1)
td x 24.4 1.95(0.3) 0.74(0.1) 2.04(0.3) 0.71(0.1) 2.15(0.4) 0.69(0.1)
v x 5.3 0.33(0.1) 0.73(0.1) 0.34(0.0) 0.70(0.1) 0.34(0.1) 0.70(0.1)
ul y 8.9 1.23(0.2) 0.58(0.1) 1.26(0.2) 0.54(0.1) 1.31(0.3) 0.49(0.2)
ll y 29.2 2.78(0.6) 0.79(0.1) 2.87(0.6) 0.75(0.1) 3.27(0.6) 0.67(0.1)
li y 13.2 1.23(0.2) 0.80(0.1) 1.27(0.2) 0.78(0.1) 1.41(0.3) 0.76(0.1)
tt y 23.3 2.46(0.4) 0.78(0.1) 2.60(0.4) 0.75(0.1) 2.70(0.4) 0.75(0.1)
tb y 20.8 2.38(0.4) 0.78(0.1) 2.49(0.4) 0.74(0.1) 2.64(0.5) 0.72(0.1)
td y 18.6 2.38(0.4) 0.69(0.1) 2.45(0.4) 0.64(0.1) 2.64(0.5) 0.57(0.1)
v y 4.6 0.36(0.1) 0.78(0.1) 0.37(0.1) 0.75(0.1) 0.40(0.1) 0.74(0.1)
Table 5.5: Accuracy of inversion in terms of RMS errorE and correlation ρ (Female
speaker).
articulator. Consistent higher values of ρ in the case of articulator specific γ
c
compared
to fixed γ
c
indicates that the estimated articulatory trajectories are more similar to
the actual ones when they are smoothed in an articulator specific fashion. Similarly,
lower values ofE demonstrate that, on average, the generalized smoothness criterion
indeed improves the inverse mapping accuracy compared to a fixed smoothing. The
meanE and the mean ρ obtained by the DP (followed by low-pass filtering) approach
also have a similar order for most of the articulators. Note that the solution of DP is
optimal according to the DP cost function but once the solution is low-pass filtered it
is no longer necessarily optimal and furthermore it is, in general, difficult to establish
what cost function the low-pass filtered trajectory might be optimal to, if at all it is. In
contrast, our proposed optimization results in an optimal solution as per the objective
function (eqn. (5.3)) for any arbitrary filter. Theuseof higher orderarticulator specific
62
Mean (SD) ofE (in mm) and ρ
Generalized Smoothness DP
Articu- Range Artic. Specific γ
c
Fixed γ
c
low-pass filtered
lators (mm) E ρ E ρ E ρ
ul x 7.4 0.76(0.1) 0.45(0.1) 0.78(0.1) 0.37(0.1) 0.80(0.1) 0.29(0.3)
ll x 10.7 1.15(0.2) 0.70(0.1) 1.19(0.2) 0.64(0.1) 1.34(0.2) 0.51(0.2)
li x 8.4 0.59(0.1) 0.63(0.1) 0.60(0.1) 0.58(0.1) 0.64(0.1) 0.53(0.2)
tt x 25.5 2.41(0.6) 0.73(0.1) 2.54(0.6) 0.66(0.1) 2.79(0.6) 0.58(0.2)
tb x 26.1 2.39(0.5) 0.69(0.1) 2.48(0.5) 0.62(0.1) 2.68(0.6) 0.51(0.2)
td x 24.4 2.20(0.5) 0.67(0.1) 2.26(0.5) 0.62(0.1) 2.42(0.5) 0.57(0.2)
v x 5.6 0.79(0.2) 0.60(0.1) 0.81(0.2) 0.55(0.1) 0.87(0.2) 0.40(0.3)
ul y 9.7 1.20(0.2) 0.65(0.1) 1.23(0.2) 0.59(0.1) 1.34(0.2) 0.49(0.2)
ll y 29.2 1.92(0.3) 0.81(0.1) 2.02(0.3) 0.76(0.1) 2.36(0.4) 0.67(0.1)
li y 15.0 1.02(0.2) 0.73(0.1) 1.05(0.2) 0.70(0.1) 1.13(0.2) 0.68(0.1)
tt y 25.2 3.08(0.7) 0.77(0.1) 3.23(0.6) 0.72(0.1) 3.50(0.6) 0.69(0.1)
tb y 22.2 2.32(0.4) 0.78(0.1) 2.43(0.4) 0.73(0.1) 2.63(0.4) 0.71(0.1)
td y 19.3 2.38(0.5) 0.71(0.1) 2.45(0.5) 0.66(0.1) 2.72(0.5) 0.59(0.2)
v y 4.7 0.80(0.1) 0.56(0.1) 0.81(0.1) 0.51(0.1) 0.85(0.2) 0.46(0.2)
Table 5.6: Accuracy of inversion in terms of RMS errorE and correlation ρ (Male
speaker).
smoothingfiltersintheDPcostfunction(eqn. (5.15)) canfurtherimprovetheaccuracy
of theestimated articulatory positions butthecomplexity order increases exponentially
with the length of the filter. DP has a complexity order of L
K
N, where L is the
number of possible articulatory positions in each frame, K is the length of the impulse
response of the filter and N is the number of frames. Even for our experiment where
we choose L=200, choice of an FIR filter h of length 5 makes the complexity order
3.2×10
11
N. Hence, we have not reported any results of applying DP when a higher
order smoothness filter is used in eqn. (5.15). In contrast, the order complexity of the
proposed optimization scheme does not change with the filter type.
63
5.7 Conclusions
The generalized smoothness criterion proposed in this work can be useful to estimate
any smooth trajectory beyond the articulator trajectory. As long as the mapping be-
tween two spaces underconsideration canbelocally linearly approximated, thesmooth-
ness criterion will find the best possible smooth trajectory, using the knowledge about
the possible solutions
η
l
n
; 1≤l≤L
. The flexibility in choosing the filter h for the
smoothness criterion is advantageous since it provides a good way to analyze various
degrees of smoothness requirement for the trajectory to be estimated. Note that, in
the DP approach of articulatory inversion, an acoustic proximity term||u
n
−z
n
||
2
is
directly considered in the optimization; this is indirectly performed in our proposed
optimization by choosing the candidate articulators based on the acoustic proximity.
The recursive version of the solution of the articulator trajectory estimate is a key
feature of the formulation presented in this work. Recursive algorithms are very use-
ful for online processing and suitable for speech applications that need an estimate of
articulators on-the-fly.
We observed that the correlation between the original trajectory and the estimated
trajectory using generalized smoothness criterion is better than that obtained with
a fixed smoothing filter, indicating the effectiveness of using the articulator specific
smoothing filter. It should be noted that for each frame of the test utterance, the DP
(without any post-processing) approach selects the best possible articulator position
fromwhatwereseen inthetrainingset, whiletheproposedtechniquedoesnot. Rather,
it provides a real valued solution that best fits the smoothness criterion and data con-
sistency. In this work, we analyzed the smoothness of articulators in a speaker specific
manner; a study on smoothness over a large set of speakers can be performed to obtain
agenericsmoothnessparameter foreach articulator. Weestimate each articulator inan
independent fashion and do not use their correlation explicitly although the candidate
64
positions of different articulators from training data have correlations between them-
selves. The correlation between different articulators can be utilized to appropriately
extend the proposed optimization for estimating more realistic articulator trajectories.
65
Chapter 6:
Analysis of inter-articulator
correlation in acoustic to
articulatory inversion using GSC
As we have seen in the previous chapter, GSC estimates each articulator trajectory
in an independent fashion using the acoustic feature sequence from a test utterance.
However, it is well known that many of the measured articulators’ movements can be
correlated with oneanother [67]. For example, theupperandlower lips’ movements are
correlated since they move together to create lip opening and closure. Similarly since
TT, TB, TD are three locations on the same physical tongue organ and, hence, their
movements are expected to be correlated too. It is however not clear whether the GSC
formulation preserves theinter-articulator correlation sinceitestimates each articulator
independently.
66
Inthiswork,weperformatheoreticalanalysisonthecorrelationsamongarticulators
estimated using GSC and derive their relations to the correlations among measured ar-
ticulatory trajectories. Based on the analysis of inter-articulator correlation in GSC we
show that, theoretically, there is no guarantee that GSC preserves the inter-articulator
correlation as observed in the training data. However, when the theoretical analysis is
examined with respect to real articulatory data, we found that the differences between
the correlations among estimated articulators and those among measured articulators
are not significant. Thus, it turns out that, in practice, the correlation among articula-
torsestimated byGSCareapproximately similartothoseamongmeasuredarticulators.
To further validate this inter-articulator correlation property of GSC, we develop a
way within the GSC framework by which inter-articulator correlation can be explicitly
imposed in the acoustic-to-articulatory inversion such that the empirical correlation
coefficient between any two estimated articulatory trajectories will be identical to that
betweentherespectivearticulatory variables(measured)intrainingdata. Theaccuracy
of the inversion obtained by exploiting the inter-articulator correlation is compared
experimentallyagainsttheaccuracyobtainedbytreatingeacharticulatorindependently
during inversion using GSC [24]. Based on the comparison, we observe that there is no
significant benefit in explicitly imposing correlation in the inversion using GSC, which
further justifies the validity of the theoretical analysis.
6.1 Dataset and pre-processing
For the analysis and experiments of this work, we use the Multichannel Articulatory
(MOCHA) database [80] containing acoustic and corresponding articulatory Electro-
Magnetic Articulography (EMA) data from one male and one female subject. The
articulatory position data have high frequency noise resulting from EMA measurement
error. Also the mean position of the articulators changes from utterance to utterance;
67
hence, the position data needs pre-processing before it can be used for analysis. Fol-
lowing the pre-processing steps as outlined by Ghosh et al. [24], we obtain parallel
acoustic and articulatory data at a frame rate of 100 observations per second. Of the
460 utterances available from each subject, data from 368 utterances (80%) are used
for training, 37 utterances (8%) as the development set (dev set), and the remaining 55
utterances (12%) as the test set.
6.2 Generalized smoothness criterion (GSC) for articula-
tory inversion
In this section, we briefly describe the principle of GSC [24] for inversion. Let
{(z
i
, x
i
);1≤i≤T} represent the parallel acoustic feature vector and articulatory po-
sition vector pairs in the training set, where z
i
and x
i
represent the i
th
acoustic and
articulatoryvectorrespectively. x
n
=[x
1
n
x
2
n
··· x
14
n
]
T
,wherethe14elementscorrespond
to X and Y co-ordinates of seven articulators considered in this work. [·]
T
denotes the
transpose operator. Now suppose, for the acoustic-to-articulatory inversion, a (test)
speech utterance is given and the acoustic feature vectors computed for this utterance
are denoted by{u
n
;1≤n≤N}. The GSC is used to estimate the j
th
articulatory po-
sition trajectory
n
x
j
n
;1≤n≤N
o
by solving the following optimization problem [24]:
arg min
{x
j
n}
"
X
n
y
j
[n]
2
+C
j
X
n
X
l
x
j
n
−η
l,j
n
2
p
l
n
#
, (6.1)
where, y
j
n
=
P
N
k=1
x
j
k
h
j
n−k
and h
j
n
is an articulator specific high-pass filter with cut-
off frequency f
j
c
. Thus the first term on the right hand side of Eq. (6.1) is used to
minimizethehighfrequencycomponentsinx
j
n
sothatthearticulatorpositiontrajectory
is smooth.
n
η
l,j
n
;1≤l≤L
o
are the L possible values of the j
th
articulatory position
68
at the n
th
frame of the test speech utterance and p
l
n
are their probabilities [24]. C
j
is
the trade off parameter between two terms in the objective function (Eq. (6.1)).
The solution of Eq. (6.1) for the j
th
articulator can be written as
x
j⋆
=
R
j
+C
j
I
-1
d
j
, (6.2)
where, x
j⋆
= [x
j⋆
1
... x
j⋆
N
]
T
is the optimal articulatory trajectory. R
j
=
n
R
j
pq
o
=
R
j
(p−q)
△
=
P
n
h
j
[n− p]h
j
[n− q]. I is the identity matrix and d
j
=
h
C
j
P
l
η
l,j
1
p
l
1
,··· ,C
j
P
l
η
l,j
N
p
l
N
i
T
. Note that the solution (Eq. (6.2)) can be obtained
recursively with frame index n without any loss in accuracy [24].
6.3 Correlation among estimated articulator trajectories
We theoretically investigate how the correlations among estimated articulatory trajec-
tories in GSC differ from those among the measured articulatory trajectories.
From Eq. (6.2), we can write (for i6=j) the estimated articulatory trajectory in the
following way:
x
j⋆
n
=
P
N
m
1
=1
a
m
1
C
j
P
L
l=1
η
l,j
m
1
p
l
m
1
x
i⋆
n
=
P
N
m
2
=1
b
m
2
C
i
P
L
l=1
η
l,i
m
2
p
l
m
2
(6.3)
where,{a
m
1
, 1≤ m
1
≤N}and{b
m
2
, 1≤ m
2
≤N} arethen
th
rowsof
R
j
+C
j
I
-1
and
R
i
+C
i
I
-1
respectively. Let us assume η
l
n
, i.e.,
h
η
l,1
n
,··· ,η
l,14
n
i
, is i.i.d. random
69
vector∀l,n. Also let the j
th
and i
th
articulators, i.e., η
l,j
n
& η
l,i
n
, have correlation
co-efficients ρ
ji
. In other words,
E
h
η
l,j
n
i
=μ
j
, V
h
η
l,j
n
i
=E
η
l,j
n
−μ
j
2
=σ
2
j
,
COV
η
l
1
,j
n
1
, η
l
2
,i
n
2
= E
h
η
l
1
,j
n
1
−μ
j
η
l
2
,i
n
2
−μ
i
i
= ρ
ji
σ
i
σ
j
δ
n
1
,n
2
δ
l
1
,l
2
, (6.4)
where,E[·],V[·], andCOV[·,·] denotethemean, variance, andcovariance of therandom
variables. δ
m,n
is the kronecker delta, i.e., δ
m,n
= 1, when m = n and δ
m,n
= 0, when
m6=n. Therefore, the mean of x
j⋆
n
and x
i⋆
n
are as follows:
E
x
j⋆
n
=
N
X
m
1
=1
a
m
1
C
j
L
X
l=1
E
h
η
l,j
m
1
i
p
l
m
1
= μ
j
C
j
N
X
m
1
=1
a
m
1
L
X
l=1
p
l
m
1
= μ
j
C
j
N
X
m
1
=1
a
m
1
L
X
l=1
p
l
m
1
=1
!
(6.5)
Similarly, E
x
j⋆
n
= μ
i
C
i
N
X
m
2
=1
b
m
2
(6.6)
70
Therefore,
V
x
j⋆
n
= E
h
x
j⋆
n
−E
x
j⋆
n
2
i
= C
2
j
N
X
m
11
=1
N
X
m
12
=1
L
X
l
1
=1
L
X
l
2
=1
a
m
11
a
m
12
p
l
1
m
11
p
l
2
m
12
E
h
η
l
1
,j
m
11
−μ
j
η
l
2
,j
m
12
−μ
j
i
= C
2
j
N
X
m
11
=1
N
X
m
12
=1
L
X
l
1
=1
L
X
l
2
=1
a
m
11
a
m
12
p
l
1
m
11
p
l
2
m
12
σ
2
j
δ
m
11
,m
12
δ
l
1
,l
2
(using Eq. (6.4))
= C
2
j
σ
2
j
N
X
m
1
=1
L
X
l=1
a
m
1
p
l
m
1
2
(6.7)
& V
x
i⋆
n
= C
2
i
σ
2
i
N
X
m
2
=1
L
X
l=1
b
m
2
p
l
m
2
2
(6.8)
And, the covariance between the j
th
and i
th
estimated articulators is
COV
x
j⋆
n
,x
i⋆
n
= E
x
j⋆
n
−E
x
j⋆
n
x
i⋆
n
−E
x
i⋆
n
= C
i
C
j
N
X
m
1
=1
N
X
m
2
=1
L
X
l
1
=1
L
X
l
2
=1
a
m
1
b
m
2
p
l
1
m
1
p
l
2
m
2
E
h
η
l
1
,j
m
1
−μ
j
η
l
2
,i
m
2
−μ
i
i
= C
i
C
j
N
X
m
1
=1
N
X
m
2
=1
L
X
l
1
=1
L
X
l
2
=1
a
m
1
b
m
2
p
l
1
m
1
p
l
2
m
2
ρ
ji
σ
i
σ
j
δ
m
1
,m
2
δ
l
1
,l
2
)
= C
i
C
j
ρ
ji
σ
i
σ
j
N
X
m=1
L
X
l=1
a
m
b
m
p
l
m
2
!
71
Hence, the correlation coefficient between the estimated j
th
and i
th
articulators, i.e.,
x
j⋆
n
& x
i⋆
n
is
ρ
⋆
ji
=
COV
x
j⋆
n
,x
i⋆
n
r
V
h
x
j⋆
n
i
p
V[x
i⋆
n
]
=
P
N
m=1
P
L
l=1
a
m
b
m
p
l
m
2
q
P
N
m=1
P
L
l=1
(a
m
p
l
m
)
2
q
P
N
m=1
P
L
l=1
(b
m
p
l
m
)
2
ρ
ji
(6.9)
Thus, the correlation ρ
⋆
ji
between the j
th
and i
th
articulator trajectories estimated
using GSC is not necessarily identical to the inter-articulator correlation ρ
ji
in the
trainingset. Itis important tonote that whenthecut-off frequencies of the j
th
andi
th
articulator-specific filters are nearly identical, their impulse responses (i.e., h
j
n
and h
i
n
)
and hence the correlation matrices R
j
and R
i
are approximately same. In addition, if
the trade-off parameters of the respective articulators, i.e., C
j
and C
i
, are similar, then
a
m
≈ b
m
, 1≤ m≤ N because{a
m
1
, 1≤ m
1
≤ N} and{b
m
2
, 1≤ m
2
≤ N} are
the n
th
rows of
R
j
+C
j
I
-1
and
R
i
+C
i
I
-1
respectively. Under such circumstances
(i.e., a
m
≈ b
m
), it is easy to see that ρ
⋆
ji
≈ ρ
ji
; this means that when a
m
≈ b
m
,
the correlation is approximately preserved among the articulators estimated by GSC.
However, as shown by Ghosh et al. [24], neither f
j
C
nor C
j
is identical across different
articulators and, hence, we compute ρ
⋆
ji
/ρ
ji
(from Eq. (6.9)) for each test utterance.
Fig. 6.1demonstratestheaverage valuesofρ
⋆
ji
/ρ
ji
foreachpairofarticulatory variables
(j-th and i-th, 1≤j,i≤14) separately for each subject in the MOCHA database. Note
that standard deviation (SD) of ρ
⋆
ji
/ρ
ji
over all test utterances is minimal (maximum
SD among all pair of articulatory variables is 5.54×10
−5
).
It is clear from Fig. 6.1 that ρ
⋆
ji
/ρ
ji
is very close to 1 for all pairs of articulators
and hence ρ
⋆
ji
is approximately same as ρ
ji
for each subject in the MOCHA database.
72
F e m a l e S u b j e c t
j=
j=
i=
(a) 2 4 6 8 10 12 14
2
4
6
8
10
12
14
0.988
0.99
0.992
0.994
0.996
0.998
1
M a l e S u b j e c t
j=
j=
i=
(b) 2 4 6 8 10 12 14
2
4
6
8
10
12
14
0.985
0.99
0.995
1
Figure 6.1: ρ
⋆
ji
/ρ
ji
for all pairs of articulators for (a) female (b) male subjects in the
MOCHA database. i and j vary over articulatory variable index 1,...,14.
Therefore, in practice GSC approximatley preserves the inter-articulator correlation
although theoretically the correlation among estimated articulators is not identical to
that among measured articulatory variables.
To further validate this conclusion based on this theoretical analysis, we develop
a modified version of GSC framework where inter-articulator correlation is explicitly
imposed among estimated articulators during inversion. The goal is to examine where
there is any significant benefit in inversion by explicitly preserving the inter-articulator
correlation and thereby examine the validity of the theoretical analysis above.
6.4 ModifiedGSCtopreserveinter-articulatorcorrelation
FromEq.(6.9)itisworthnotingthatifthevariablesinthetrainingset(η
l,j
n
&η
l,i
n
)were
uncorrelated (i.e., ρ
ji
= 0), then the estimated trajectories (x
j⋆
n
and x
i⋆
n
) would also be
uncorrelated (i.e., ρ
⋆
ji
=0). This observation motivates us to transform the articulatory
positionvariables,
n
x
j
n
; j =1,...,14
o
, intoanothersetofvariables,
n
˜ x
j
n
; j =1,...,14
o
,
73
where ˜ x
j
n
and ˜ x
k
n
are uncorrelated∀ j, k. The GSC can be used in the transformed
variable domain for inversion and, after inversion, the correlation between variables can
be imposed by transforming them back to the original articulatory position variable
domain. Whitening is one of the approaches where a random vector (x
n
, in our case)
is linearly transformed to make its components uncorrelated [70]. We transform x
n
to
obtain ˜ x
n
.
6.4.1 Transformation of the articulatory position vector
Let μ
x
andK
xx
, respectively, be the mean vector and the covariance matrix of the ran-
domvectorx
n
. Lettheeigen decompositionofK
xx
=VΛV
T
, whereV istheorthogonal
eigenvector matrix (V
T
V =I) and Λ is the diagonal eigen value matrix. The following
linear transformation whitens x
n
to the random vector ˜ x
n
; the components of ˜ x
n
are
uncorrelated as it is easy to show that ˜ x
n
has a diagonal covariance matrix K
˜ x˜ x
.
˜ x
n
=V
T
x
n
(6.10)
where, μ
˜ x
=E(˜ x
n
)=V
T
μ
x
, K
˜ x˜ x
=V
T
K
xx
V =Λ.
Note that the component of ˜ x
n
does not correspond to any physically meaning-
ful articulatory parameters any more. However, x
n
can be recovered from ˜ x
n
by
x
n
=
V
T
−1
˜ x
n
=V˜ x
n
.
6.4.2 Frequency analysis of the transformed variables
In the GSC formulation [24], the cut-off frequencies of the articulator specific high-pass
filters h
j
n
are determined based on the analysis of the frequency content of all articu-
latory position variables
n
x
j
n
; j =1,...,14
o
. Similarly, to estimate ˜ x
j
n
using GSC, we
need to analyze the frequency content of the transformed variables
n
˜ x
j
n
; j =1,...,14
o
74
to determine the cut-off frequency of the high-pass filters in the transformed variable
domain. We calculate the frequency f
j
c
below which α% of the total energy of the
transformed variable trajectory is contained. This is done for each utterance in the
training set and the mean f
j
c
(along with standard deviation (SD)) over all utterances
is calculated for α=90, 95. It is found that the range of f
j
c
for most of the transformed
variables ˜ x
j
n
is similar to what was observed in the frequency analysis of the articulator
position variables in x
j
n
[24]. This is expected since ˜ x
j
n
is a linear combination of the ar-
ticulator position variables x
n
. This analysis will help us choose the cut-off frequencies
while designing filters h
j
n
, ∀j in GSC.
6.4.3 Inversion using transformed articulatory features
Since we donot know the truecovariance matrixK
xx
ofx
n
, we estimateK
xx
usingthe
realizations of x
n
in the training set as follows:
K
xx
=
1
T−1
T
X
n=1
(x
n
−¯ x)(x
n
−¯ x)
T
(6.11)
where ¯ x =
1
T
P
T
n=1
x
n
. Using eigen-decomposition of estimated K
xx
, we obtain the
eigenvector matrix V, which is used to transform x
n
vectors of training, dev and test
set to ˜ x
n
(using Eq. (6.10)), where variables are uncorrelated. Parallel acoustic vectors
and transformedarticulatory vectors ˜ x
n
of the trainingset are usedto estimate η
l,j
n
and
p
l
n
in a way similar to that described by Ghosh et al. [24]. The GSC (Eq. (6.1)) is used
to estimate the trajectories of ˜ x
j⋆
n
, ∀ j = 1,...,14 separately using the acoustic data of
the test set. Finally, x
j⋆
n
, ∀ j = 1,...,14 are obtained by transforming ˜ x
⋆
n
back using
x
⋆
n
=[x
1⋆
n
x
2⋆
n
··· x
14⋆
n
]
T
=V˜ x
⋆
n
=V[˜ x
1⋆
n
˜ x
2⋆
n
··· ˜ x
14⋆
n
]
T
. Bythisreversetransformation,
we correlate different variables so that correlation among them is similar to the inter-
articulator correlation observed in the training data. Note that the transformation
75
matrix (V) is learned on the training set. Therefore, it is assumed that correlations
between different articulator positions in the test set are similar to those in the training
set.
6.4.4 Articulatory inversion results using modified GSC
The proposed approach of utilizing inter-articulator correlation for acoustic-to-
articulatory inversion is evaluated separately for the male and female subjects data
in the MOCHA-TIMIT corpus [80]. The accuracy of the inversion is evaluated sepa-
rately on thetest set for bothsubjects in terms of both rootmean squared (RMS)error
and Pearson correlation co-efficient [9] between the actual articulatory position in the
test setandtheposition estimated by GSC.TheRMS errorE reflects theaverage close-
ness between actual and estimated articulator trajectories. The correlation ρ indicates
how similar the actual and estimated articulator trajectories are.
The dev set is used to tune the cut-off frequency f
j
c
of filter h
j
n
and the trade-
off parameter C
j
. For our experiment we considered L=200. Increasing L further
did not improve the result. For designing articulator specific high-pass filters h
j
n
, we
considered an IIR high pass filter with cut-off frequency f
j
c
and stop-band ripple 40
dB down compared to the pass-band ripple similar to that used by Ghosh et al [24].
From Section 6.4.2, we observed that most of the energy of the transformed variables
is below 9-10 Hz and, hence, we choose values of f
j
c
from the following set
n
f
j
c
o
=
n
1.5+
(k−1)
19
(7.5);k =1,··· ,20
o
. Similarly, the set of values for C
j
was chosen from
theset{.001, .005, .01, 0.05, .1, .5, 1, 5, 10, 50, 100}. Thef
j
c
andC
j
combination which
yielded the minimum value of the averagedE (averaged over all utterances of the dev
set) are selected separately for each subject.
These best choices of f
j
c
and C
j
are used to perform inversion on the test set using
modified GSC. TheE and ρ are computed between the actual trajectories and the
76
estimated trajectories for every utterance in the test set. The meanE and ρ (averaged
over all utterances in the test set) along with the standard deviation (SD) are shown in
Fig. 6.2 for both the female and male subjects. For comparison, Fig. 6.2 also showsE
and ρ when GSC is directly used in the articulator position variable domain [24].
From Fig. 6.2, it is clear that for most of the cases, the accuracy of estimates is
similar or higher (i.e., lowerE or higher ρ) when inter-articulator correlation is utilized
usingtransformationofvariables, buttherearecaseswhenGSContransformeddomain
eitherincreasesthemeanE ordecreasesthemeanρ(e.g. ul x, ll x,ul yformalesubject
andtt xforfemalesubject). However, consideringtheSDofE andρ,thechanges inthe
accuracy in terms ofE and ρ are insignificant. Thus the gain in performance because
of explicitly using inter-articulator correlation by transformation of variable approach
is not significant. This supports the conclusions based on the theoretical analysis (Sec.
4) that the correlation among articulators are approximately preserved in GSC and,
hence, there is no further benefit by imposing inter-articulator correlation explicitly.
6.5 Conclusions
The analysis of inter-articulator correlation in this work reveals that acoustic-to-
articulatory inversion using GSC approximately preserves inter-articulator correlations
althougharticulatoryinversioninGSCisperformedforeacharticulatorytrajectorysep-
arately. Using both theoretical and experimental analysis, we observe that the smooth-
ness constraints for different articulators are similar and this could be the reason for
GSC to approximately preserve correlation in articulatory inversion in practice.
77
0
1
2
3
4
UL_X LL_X LI_X TT_X TB_X TD_X V_X UL_Y LL_Y LI_Y TT_Y TB_Y TD_Y V_Y
R M S E r r o r
F E M A L E S U B J E C T
GSC
MGSC
0
0.2
0.4
0.6
0.8
1
UL_X LL_X LI_X TT_X TB_X TD_X V_X UL_Y LL_Y LI_Y TT_Y TB_Y TD_Y V_Y
C o r r. C o e f.
GSC
MGSC
(b)
(a)
0
1
2
3
4
UL_X LL_X LI_X TT_X TB_X TD_X V_X UL_Y LL_Y LI_Y TT_Y TB_Y TD_Y V_Y
R M S E r r o r
M A L E S U B J E C T
GSC
MGSC
0
0.2
0.4
0.6
0.8
1
UL_X LL_X LI_X TT_X TB_X TD_X V_X UL_Y LL_Y LI_Y TT_Y TB_Y TD_Y V_Y
C o r r. C o e f.
GSC
MGSC
(c)
(d)
Figure 6.2: (a)-(b): The average RMS Error (E) and Corr. Coef. (ρ) of inversion
accuracy using GSC [24] and modified GSC (MGSC) on the test set for the female
subject; (c)-(d): (a)-(b) repeated for the male subject. Error bars indicate SD.
78
Chapter 7:
Subject-independent
acoustic-to-articulatory inversion
using GSC
In this chapter, we describe an acoustic-to-articulatory inversion technique which re-
quires acoustic-articulatory training data from only one subject and can be used to
perform inversion on any other subject’s acoustic signal. The proposed inversion tech-
niqueworksontheprincipleofrepresentinganacousticfeaturewithrespecttoageneric
acoustic space, obtained using speech data from a pool of talkers. Thus when a test
subject’s acoustic data is given for inversion, it is matched with the training subject’s
acoustic data with respect to the generic acoustic space. This enables us to obtain
the articulatory feature trajectory using the articulatory-to-acoustic mapping learned
from the data of the exemplary training subject. It should be noted that the range
and values of the estimated articulatory trajectory correspond to the training subject
and not the test subject. The estimated articulatory trajectories can be interpreted
79
as the articulatory movement expected when the training subject tries to mimic the
utterance spoken by the test subject. It is hypothesized that, for a given utterance,
the reference articulatory trajectory (corresponding to test subject) and the estimated
trajectory (corresponding to the training subject) will have similar shapes and, thus,
correlation between these trajectories can beused to measurethe accuracy of inversion.
We investigate the efficacy of the proposed inversion using experimentally obtained ar-
ticulatorydata. Wefindthattheaccuracyoftheestimatedarticulatorytrajectoryusing
the proposed approach is close to the accuracy obtained by an existing inversion tech-
niquewheretheparallel articulatory andacoustic datafromthetestsubjectis available
to perform inversion.
7.1 Dataset and pre-processing
For the analysis and experiments of this work, we use the Multichannel Articulatory
(MOCHA) database [80] that contains acoustic and correspondingarticulatory Electro-
Magnetic Articulography (EMA) data from one male and one female talker of British
English. The articulatory position data have high frequency noise resulting from EMA
measurement error. Also the mean position of the articulators changes from utterance
to utterance; hence, the position data needs pre-processing before it can be used for
analysis. Following the pre-processing steps outlined in [24], we obtain parallel acous-
tic and articulatory data at a frame rate of 100 observations per second. Of the 460
utterances available from each speaker, data from 368 utterances (80%) are used for
training, 37 utterances (8%) as the development set (dev set), and the remaining 55
utterances (12%) as the test set for the subject-dependent inversion procedure. For the
proposed subject-independent inversion, one subject’s data is used for training and the
other subject’s test utterances are used during testing.
80
7.2 Proposed subject-independent inversion
Based on the generalized smoothness criterion (GSC) for the acoustic-to-articulatory
inversion introduced in [24], we develop a formulation to estimate the articulatory fea-
turevector sequenceforagiven acoustic featurevector sequencecorrespondingtoatest
speechutterance. However, unlikeGSC,inourproposedsubject-independentinversion,
we assume access to a generic acoustic space generated by the speech signal features
(denoted by{c
j
;1≤j≤R}) obtained from various subjects (the TIMIT corpus [11] is
chosen for this purpose) in addition to the parallel acoustic and articulatory features
({(z
i
,x
i
);1≤i≤T}) from an exemplary training subject. Note that the acoustic data
from the test subject for inversion need not be there in this generic acoustic space. We
will show that this additional knowledge about the generic acoustic space will play a
crucial role in estimating the articulatory features from the acoustics of any arbitrary
test subject.
Let the test acoustic feature sequence be denoted byu
n
, 1≤n≤N (n denotes the
frame index). Let the time sequence of an articulatory feature that we need to estimate
from u
n
, 1≤ n≤ N be denoted by x[n], 1≤ n≤ N. Since the articulatory feature
trajectory is smooth, in general, the best estimate of an articulatory feature sequence
is obtained using a smoothness criterion as follows [24]:
{x
⋆
[n];1≤n≤N}=arg min
{x[n]}
J(x[1],··· ,x[N])
△
=arg min
{x[n]}
(
X
n
(y[n])
2
+C
X
n
X
l
x[n]−η
l
n
2
p
l
n
)
, (7.1)
whereJ denotes the cost function to beminimized andy[n]=
P
N
k=1
x[k]h[n−k], where
h[n] is the articulatory feature-specific high-pass filter (FIR or IIR).
η
l
n
;1≤l≤L
is
the set of L possible values of the articulatory feature at the n
th
frame. p
l
n
denotes the
81
probability that η
l
n
is the value of the articulatory position at the n
th
frame given that
u
n
is the test acoustic feature. η
l
n
and p
l
n
are obtained using{(z
i
,x
i
)},{c
j
}, and u
n
.
L can be, in general, equal to T. C is the trade-off parameter between the first term
(smoothness constraint) and the second term (data proximity) in J.
A closed form solution of x
⋆
[n] can be computed once h[n], η
l
n
, and p
l
n
are deter-
mined [24]. Also, it was shown in [24] that the solution can be obtained recursively
over time without any loss in performance. h[n] and its cut-off frequency are designed
here following [24]. Finally, the articulator-specific cut-off frequency and the trade-off
variable C are optimized for the training subject on a development set (30% of the
entire parallel acoustic and articulatory data of the training subject) so that the mean
squared error (MSE) between the actual articulator trajectories and the estimated ones
is minimized.
The basic principle of determining η
l
n
, and p
l
n
, 1≤ l≤ L in [24] was to choose ar-
ticulatory features from thetraining data so that the correspondingacoustic features in
the training data are close to the test acoustic feature in an Euclidean sense. Since the
test and train acoustic features were considered for the same subject, such an acous-
tic proximity-based approach was appropriate for the subject-dependent acoustic-to-
articulatory inversion. However, in the proposed subject-independent inversion frame-
work, the test subject is different from the training subject and their acoustic spaces
are different in general. Hence, a Euclidean distance measure d
E
(z
i
,u
n
) may not be
a reliable metric of acoustic proximity due to inter-subject acoustic variability. Hence,
we need to transform the acoustic feature vectors to another space where the closeness
measure between two points d(Φ(z
i
),Φ(u
n
)) is robust to such inter-subject variability,
where Φ(·) is the transformation function from the acoustic feature space to the new
spaceandd isanappropriatemeasureofcloseness between two pointsinthenewspace.
82
Also, computing the distance between u
n
and z
i
, 1≤i≤T at each frame is computa-
tionally expensive because T (the number of parallel acoustic and articulatory features
of the training subject) is in the order of 10
5
. For example, when a 14-dimensional
MFCC vector with zero-th coefficient is considered as acoustic feature and T=5×10
5
,
the computation of the distances between u
n
and z
i
, 1≤ i≤ T at each frame (n)
requires 14T multiplications and 27T additions (and takes∼0.4 second in MATLAB
software on a desktop computer). Therefore, a computationally efficient closeness mea-
sure is desirable. Below we propose a transformation function Φ and a new closeness
measure d in the range space of Φ(·).
LetA be the acoustic space represented by c
j
, 1 ≤ j ≤ R. We perform a K-
means clustering with K number of clusters; letA
k
denote the k
th
cluster. Note that
S
K
k=1
A
k
=A. The density of the data points in each cluster is modeled using a M-
mixture Gaussian mixture model (GMM), i.e.,
p(v|A
k
)=p(v|v∈A
k
)=
M
X
m=1
w
m
N
v;μ
k
m
,Σ
k
m
, k =1,...,K
wherevistheacousticfeaturevector,μ
k
m
andΣ
k
m
arethemeanvectorandthecovariance
matrix of the m
th
component of GMM in the k
th
cluster respectively. μ
k
m
and Σ
k
m
are
estimated using the expectation maximization algorithm [12]. w
m
is the weight for the
m
th
component. Given an acoustic feature vector v, Φ(v)
△
=
1
Z
[p(v|A
1
)···p(v|A
K
)]
T
,
where Z =
P
K
k=1
p(v|A
k
), a normalization constant. Thus, Φ(v) is a K dimensional
vector representing the probabilities of v to get generated from each of the K clusters
in the acoustic space
1
. We usedK=32 for our experiment (we also tried K=50, 64, but
no significant performance benefit was obtained).
1
Any quantity other than p(v|A
k
) can also be considered, e.g., the posterior probabilities p(A
k
|v).
However, in this work, we have performed all experiments using p(v|A
k
). We do not get any significant
improvement in recognition performance for the chosen corpus by using p(A
k
|v) as against p(v|A
k
).
83
To determine η
l
n
, one approach could be measuring proximity between Φ(u
n
) and
Φ(z
i
), 1≤ i≤ T using a distance metric, say, Euclidean distance. Since this can be
prohibitive due to large T, we propose a closeness measure based on highest valued
element in the Φ(v) vector. Note that Φ(z
i
), 1≤i≤T can be computed a-priori. Let
B
r
={z
i
|r =argmax
k
p(z
i
|A
k
)}, 1≤ r≤ K. Also let π
r
i
= max
k
p(z
i
|A
k
)
P
k
p(z
i
|A
k
)
, ∀z
i
∈
B
r
, 1≤ r≤ K. Given u
n
, Φ(u
n
) can be computed and suppose that r
th
1
element in
Φ(u
n
) is the largest among all elements, i.e., r
1
=argmax
k
p(u
n
|A
k
). Then we assume
that u
n
is acoustically closer to{z
i
|z
i
∈B
r
1
} compared to{z
i
|z
i
6∈B
r
1
}, i.e., when the
probability of generating two acoustic features by the same cluster is higher compared
to other clusters in the acoustic space, those two feature vectors are assumed to be
acoustically close. This reduces the search space of the acoustic features. We compute
δ
i
=
p(un|Ar
1
)
P
k
p(un|A
k
)
−π
r
1
i
,∀z
i
∈B
r
1
. δ
i
,∀z
i
∈B
r
1
aresortedinanascendingorder. Note
thatδ
i
istheabsolutedifferenceoftwoscalervaluesandhence,iscomputationallyfaster
compared to computing the norm of two vectors. Let z
′
i
, 1≤ i≤ L be the acoustic
features corresponding to the top L sorted δ
i
. The articulatory features corresponding
to z
′
i
, 1≤i≤L are used as η
l
n
. p
l
n
are computed using the Euclidean distance between
Φ(u
n
) and Φ(z
′
i
), 1 ≤ i ≤ L, i.e., p
i
n
=
Δ
−1
i
P
L
i=1
Δ
−1
i
, where, Δ
i
= ||Φ(u
n
)−Φ(z
′
i
)||.
Note that the Euclidean distance (Δ
i
) is computed only for L acoustic features and in
practice, we choose L << T (L=200 in our experiment). Therefore, the computation
of η
l
n
and p
l
n
and, hence, the acoustic-to-articulatory mapping can happen in real time.
For example, when 14-dimensional MFCC vector is considered as acoustic feature and
T=5×10
5
, the computation of η
l
n
and p
l
n
at each frame requires, on an average, only
1
K
T additions (and takes∼0.01 second in MATLAB software on a computer).
84
7.3 Experimental results
Note that there are only two subjects in the MOCHA corpus. We run the proposed
inversion procedure to estimate articulatory feature trajectories of each subject in the
corpus using the other subject’s data for training (i.e. {z
i
,x
i
}). For comparison, we
have considered two different types of inversion procedure - 1) the subject-dependent
inversion using GSC; this is identical to the one reported in [24], which is referred here
as inversion scheme-1 (IS-1), 2) the subject-independent inversion using GSC as in [24]
except that the training data is obtained from one subject and the other subject’s data
is used for testing and evaluation; we refer to this as inversion scheme-2 (IS-2). We
refer to the proposed inversion procedure as inversion scheme-3 (IS-3). For consistency
across three inversion schemes, we have used 12% of the available utterances (Chapter
3) for each subject as testset. Note that IS-1 is a subject-dependent inversion scheme
and hence, it is expected that the best inversion accuracy will be obtained with IS-1,
among the three inversion schemes considered.
We use Pearson’s correlation coefficients ρ as a measure of accuracy between the
estimated and referencearticulatory trajectories. Theestimated articulatory trajectory
values will correspond to the shape and size of the training subject’s articulators and,
hence, will not be similar to the test subject’s articulators, in general. Thus, we have
not used root mean square (RMS) as a measure since it will not reflect the accuracy
of inversion. Since the training and test subjects are not identical, it is important to
chooseappropriatefeaturesandtheevaluationmetrictoanalyzethequalityofinversion.
RangesoftherawEMAfeaturevaluesforthesamearticulator maybedifferentbetween
subjects, but the shape of the articulator trajectories are expected to be similar when
two subjects utter the same utterance; this similarity will be more apparent for the TV
features. For example, if the acoustic signal has a stop consonant /t/, tongue tip will
go up to form the constriction against palate and will come down for the release. Thus,
85
it is expected that both thereference and estimated trajectories of tt y will have a peak
correspondingto/t/(similarly, TTCDtrajectorieswillhaveadip). However, theactual
trajectoryvaluesmaynotbeidenticalsincethereferenceandestimatedtrajectoryvalues
corresponds to the test subject and training subject respectively. Thus, the quality of
inversion can be measured by the similarity or correlation between the reference and
the estimated trajectories. Hence, the higher the ρ, the better is the inversion quality.
Below we report the accuracy of inversion using IS-1, IS-2, and IS-3 separately for
the cases when the EMA features and the TV features are used as the representations
of the articulatory space.
7.3.1 EMA features
Fig. 7.1(a) and (b) show averaged ρ using IS-1, IS-2, and IS-3 on the male and female
subject’stestset, respectively, whenEMAfeaturesareusedtorepresentthearticulatory
space. It is clear that IS-1 yields the highest ρ for all EMA features due to its subject-
0
0.5
1
ρ
IS−1
IS−2
IS−3
0
0.5
1
ρ
IS−1
IS−2
IS−3
ul_x ll_x li_x tt_x tb_x td_x v_x ul_y ll_y li_y tt_y tb_y td_y v_y
ul_x ll_x li_x tt_x tb_x td_x v_x ul_y ll_y li_y tt_y tb_y td_y v_y
(a)
(b)
FEMALE SUBJECT’S TESTSET
MALE SUBJECT’S TESTSET
Figure 7.1: Bar diagram of the mean ρ obtained using IS-1, IS-2, and IS-3 for various
EMA features separately over all male and female subjects’ test utterances. Errorbar
indicates standard deviations of ρ across respective test utterances.
dependent nature of inversion. ρ for IS-3 is greater than that for IS-2 for most of the
86
EMA features indicating the effectiveness of the proposed inversion procedure. Lower ρ
forIS-2isexpectedsinceweselectη
l
n
basedonclosenessbetweentheacousticfeaturesof
the two subjects and acoustic spaces of two subjects are, in general, different. Since, for
IS-3,η
l
n
arechosenbasedontheclosenessbetweentheacousticfeaturesinatransformed
probability space, IS-3 is more robust to inter-subject acoustic variations compared to
IS-2.
7.3.2 TV features
TV features capture various constriction events during speech production. Thus, the
TV features, to a certain extent, mitigate inter-subject vocal tract shape and articula-
tory configuration differences. Depending on the way we compute various TV features
(Chapter 3), some of them such as PRO, VEL may suffer from greater inter-subject
variability than LA, TTCD, and TBCD. This becomes clear from Fig. 7.2, which il-
lustrates a set of randomly selected TV feature trajectories from the Female subject’s
testset and their estimates using IS-1 and IS-3. For clarity, estimates using IS-2 are not
shown. It is evident from Fig. 7.2 that IS-1 yields more accurate trajectories since it is
subject-dependent inversion. It is also clear from the figure that the computed TTCD,
TBCD, and LA measures are invariant across subjects and hence the estimates of these
feature trajectories using IS-3 are close to the reference trajectories although estimated
and reference trajectories corresponds to two different subjects. In contrast, the values
of the estimated trajectories of VEL, PRO, and JAW OPEN using IS-3 are different
from those of the reference ones. However, there are similarities between the shapes of
the trajectories. For a more comprehensive evaluation of the inversion quality for the
TV features, we report the averaged ρ obtained using IS-1, IS-2, and IS-3 for both the
maleandfemalesubjects’testsetinFig. 7.3. Fordifferentinversionschemes, weobserve
87
50 100 150 200
2000
2500
3000
3500
LA
50 100 150 200 250 300 350
−200
0
200
400
600
800
VEL
50 100 150 200 250 300
−1800
−1600
−1400
−1200
−1000
−800
−600
PRO
50 100 150
1800
2000
2200
2400
2600
JAW_OPEN
50 100 150 200 250
600
800
1000
1200
1400
1600
1800
Frame Numbers
TTCD
50 100 150 200 250
500
1000
1500
2000
TBCD
Frame Numbers
ONCE YOU FINISH GREASING YOUR CHAIN,
BE SURE TO WASH THOROUGHLY
WHICH THEATRE SHOWS "MOTHER GOOSE"
SMASH LIGHTBULBS AND THEIR CASH
VALUE WILL DIMINISH TO NOTHING
THEY ASSUME NO BURGLAR WILL EVER ENTER HERE
WE’LL SERVE RHUBARB PIE AFTER RACHEL’S TALK LAUGH, DANCE, AND SING IF FORTUNE SMILES UPON YOU
(a) (b)
(c) (d)
(e) (f)
Original Trajectory
(Female Subject)
Estimated Trajectory
(using IS−1)
Estimated Trajectory
(using IS−3)
Figure 7.2: Illustrative examples of the estimates of the TV feature trajectories using
IS-1andIS-3. Thesetrajectoriesarerandomlyselectedfromthefemalesubject’stestset.
a trend in the ρ values similar to that for the EMA features (Fig. 7.1) indicating the
effectiveness of the proposed inversion procedure in the case of TV features too.
7.4 Conclusions
Itisimportanttonotethattheproposedsubject-independentinversionschemethatuses
a transformed acoustic representation performs better than IS-2 but worse than IS-1
(subject-dependentinversion). Itshouldalsobenotedthattheproposedschemeismore
88
0
0.5
1
ρ
IS−1
IS−2
IS−3
0
0.5
1
ρ
IS−1
IS−2
IS−3
MALE SUBJECT’S TESTSET
(a)
(b)
FEMALE SUBJECT’S TESTSET
LA VEL PRO JAW_OPEN TTCD TBCD
LA VEL PRO JAW_OPEN TTCD TBCD
Figure 7.3: Bar diagram of the mean ρ obtained using IS-1, IS-2, and IS-3 for various
TV features separately over all male and female subjects’ test utterances.
efficient computationally than IS-1 and IS-2 (∼40 times faster). Thus, the proposed in-
version procedure(IS-3)isattractive whenarticulatory featuresneedtobeestimated in
a subject-independent fashion in a real-time scenario for speech or speaker recognition.
We estimate the trajectories for each articulator independently; however, the correla-
tions among different articulator trajectories are well-known. Thus, the accuracy of
the proposed inversion scheme can be further improved by exploiting inter-articulator
correlation. Finally, subject-independent inversion provides us the opportunity to in-
vestigate the similarity between the acoustic-articulatory map across subjects. If the
subject-independentinversion performswell on a specific set of test subjects, then their
acoustic-articulatory map might be similar to that of the training subject.
89
Chapter 8:
Automatic speech recognition
using articulatory features from
subject-independent
acoustic-to-articulatory inversion
Inthischapter, weperformrecognition experimenttoexaminetheutilityofarticulatory
features obtained using subject-independent acoustic-to-articulatory inversion to evalu-
ate the proposed exemplar-specific speech recognition framework. We used the TIMIT
database [11] to train and test the acoustic-articulatory recognizer. TIMIT contains
broadband recordings (at 16kHz) of 630 talkers of eight major dialects of American
English, each reading ten phonetically rich sentences. Among 6300 sentences, 4620 sen-
tencesareusedfortrainingandtheremainingareusedastestset. Fromthetrainingset
weexcludedutterances (e.g., sa1.wav, sa2.wav) whichwereusedforinitial calibration of
thesubjectpriortorecordingleadingto3696 sentences fortraining. TheTIMITcorpus
90
includes time-aligned orthographic, phonetic and word transcriptions and, hence, ideal
for phonetic classification or recognition experiments.
We perform a phonetic recognition experiment on the TIMIT corpus using the
exemplar-specific articulatory features in addition to the acoustic features. The pur-
pose of the experiment is to investigate if there is any improvement in recognition
performance by using the exemplar-specific articulatory features. The phonetic recog-
nition is performed by using three different exemplar (MOCHA M, MOCHA F, and
MURI M) models. Therefore, we obtain three different phoneticrecognition accuracies.
The exemplar-specific articulatory features are appended to the acoustic feature vector
to obtain a higher dimensional acoustic-articulatory feature vector. Unlike the acoustic
model in traditional speech recognition, we build an acoustic-articulatory model us-
ing the acoustic-articulatory feature vectors of all frames in the training corpus. Since
three exemplars are investigated, three different acoustic-articulatory models are built
using mono-phone 3-states left-to-right HMM. The emission probabilities of the HMM
states are modeled using Gaussian mixture model (GMM). The number of mixture
components is increased to obtain highest recognition accuracy. HMM based phonetic
recognition performance obtained by acoustic-articulatory features are reported and
compared against that obtained by only acoustic features.
The TV features are estimated for each utterance in the TIMIT corpus using the
method outlined in Chapter 7. The entire TIMIT training set is used to build the
acoustic-articulatory model and the TIMIT test set is used to perform the recognition
experiment. The TIMIT corpus has phonetic labels of the speech utterances using 61
different phonemes. However, following the phonetic recognition experiment in [37, 63],
wemergedsimilarsoundingphonemesinto39broadphoneticcategories. Allrecognition
results are reported on these 39 phonetic categories.
91
Without Estimated Articulatory Features
Acoustic Articulatory Only Acoustic+
Features Features Exemplar Articulatory Articulatory
MOCHA M 34.29% 48.42%
MFCC 47.70% MOCHA F 33.41% 48.57%
MURI M 38.64% 51.18%
MFCC+ MOCHA M 46.10% 65.75%
ΔMFCC+ 64.51% MOCHA F 44.47% 65.88%
ΔΔMFCC MURI M 48.22% 66.46%
Table8.1: HMM based phonetic recognition accuracies on TIMIT corpus using acoustic-
only and acoustic-articulatory features.
Table 8.1 shows the phonetic recognition accuracy on TIMIT corpus for using
acoustic-only and acoustic-articulatory features without any phonotactic model. In
Table 8.1 we separately report results with acoustic features as MFCC and MFCC
with its Δ and ΔΔ and corresponding articulatory features derived from them. TV
features with its velocity and acceleration coefficients are used as articulatory fea-
tures. Sincethereare six TV features, thearticulatory featurevector is 18-dimensional.
As shown in Table 8.1, the phonetic recognition accuracy using only MFCC features
(13-dim) is 47.70% while that using MFCC+ΔMFCC+ΔΔMFCC features (39-dim)
is 64.51%. When the articulatory features are estimated from MFCC using subject-
independentacoustic-to-articulatory inversionandusedinadditiontoMFCC,weobtain
different recognition accuracies (48.42%, 48.57%, and 51.18%) depending on the exem-
plar chosen for the inversion. Similarly when the articulatory features are estimated
fromMFCC+ΔMFCC+ΔΔMFCC, therecognition accuracies are65.75%, 65.88%, and
66.46% respectively for three exemplars MOCHA M, MOCHA F, and MURI M. It is
clear from the Table that by using estimated articulatory features, we get recognition
benefitontopofacoustic-only recognition althoughtheamountoftherecognition bene-
fitdependson the exemplar chosen for acoustic-to-articulatory inversion. Note that the
92
dimension of the feature vector combining MFCC+ΔMFCC+ΔΔMFCC and the artic-
ulatory features is 39+18=57; to have a fairness in comparison between acoustic-only
and acoustic-articulatory recognition, we perform linear discriminant analysis (LDA)
on the phoneme classes and reduce the feature vector dimension to 39. Thus we obtain
approximately 2% absoluteimprovement inrecognition accuracy by usingestimated ar-
ticulatory features over the traditional acoustic features MFCC+ΔMFCC+ΔΔMFCC.
Further,whenthebigramphonotacticmodelisusedwhiledecodingthephonemelattice
for the best acoustic and acoustic-articulatory recognition accuracy, we obtain 66.41%
and 67.92% recognition accuracy respectively (indicating an 1.5% absolute improve-
ment).
It is interesting to observe that depending on the chosen exemplar we get different
amount of recognition benefit. The maximum benefit is obtained for MURI M exem-
plar. This could be because we have maximum amount of parallel articulatory and
acoustic data for MURI M among other exemplar and that could help us estimate the
articulatory-to-acoustic map reliably for this subject. To have a better understanding
about which exemplar can help improve recognition under what kind of talkers, we
perform a more controlled experiment in the next chapter.
93
Chapter 9:
Comparison of recognition
accuracies for using original and
estimated articulatory features
The goal in this chapter is to experimentally study the effectiveness of using articula-
tory features estimated through subject-independentinversion for speech recognition in
different talker-exemplar combinations and have an understanding about which exem-
plar performs better for which talkers. The experiments are performed using parallel
acoustic and articulatory data from three native speakers of English from two distinct
databases. Automatic speech recognition experiments using both acoustic-only speech
features and joint acoustic-articulatory features are performed for each subject (talker)
separately. To experimentally explore the effect of using estimates derived from dif-
ferent articulatory-acoustic maps (i.e., exemplars), we cross test each exemplar based
models against the data of the others. Thus, for each subject in our study, we have
three different estimates of the articulatory features (using two other subjects and the
94
talker itself as exemplars) as well as the original articulatory features – overall four dif-
ferent versions of the articulatory features for each subject. We investigate the nature
of acoustic-articulatory recognition accuracy compared to acoustic-only recognition ac-
curacy for the different versions of the articulatory features. The availability of direct
articulatory data allows us to investigate the extent and nature of the recognition ben-
efit we can obtain when we replace the original articulatory features by the estimated
ones. We next describe the articulatory datasets used in this work.
9.1 Datasets and features
Thepresentstudyusesarticulatory-acousticdatadrawnfromtwodifferentsources. The
first one is from the Multichannel Articulatory (MOCHA) database [80] that contains
ElectroMagneticArticulography(EMA)datafor460utterances(∼20minutes)readbya
maleandafemaletalkerofBritishEnglish. WerefertothesesubjectsasEN RD MALE
and EN RD FEMALE respectively. The EMA data consists of trajectories of sensors
placed in the midsagittal plane of the subject on upper lip (UL), lower lip (LL), jaw
(LI), tongue tip (TT), tongue body (TB), tongue dorsum (TD), and velum (V).
The second source of parallel articulatory-acoustic data comes from the EMA data
collected at University of Southern California (USC) from a male talker of American
English (EN SP MALE) as a part of the Multi-University Research Initiative (MURI)
project [66]. In contrast to the read speech in MOCHA database, the articulatory data
in the MURI database was collected when the subject was engaged in a spontaneous
conversation (∼50 minutes) with an interlocutor. Unlike MOCHA, the second corpus
has articulatory position data only for UL, LL, LI, TT, TB, and TD. The articulatory
data from MURI corpus are preprocessed, in a manner similar to that used for the
MOCHA database, to obtain a frame rate of 100 Hz.
95
Figure 9.1: Illustration of the TV features in the midsagittal plane.
To specify articulatory features, we have used tract variable (TV) definition
1
moti-
vated bythebasicroleof constriction formation inarticulatory phonology [2]. Thedata
from the three subjects, we have considered in this study, do not correspond to identi-
cal set of articulators; thus, for consistency, we have chosen five TV features for each
subjects, namely, lip aperture (LA), lip protrusion (PRO), jaw opening (JAW OPEN),
tongue tip constriction degree (TTCD), and tongue body constriction degree (TBCD).
These TV features are illustrated in Fig. 9.1 and are computed from the raw position
values of the sensors using the definitions given by Ghosh et al [25]. We use 13-dim
mel frequencycepstral co-efficients (MFCCs) asspeech acoustic features ataframerate
(100 Hz) identical to the rate of the articulatory features.
Theimplementation ofsubject-independentinversion[25]requiresagenericacoustic
model, thedesign of which requires alarge acoustic speech corpus. For this purpose,we
have considered thespeech data from TIMIT[11] corpus. Since TIMITis a phonetically
1
For articulatory representation, one can also use raw X, Y values of the sensor positions of com-
mon articulators across subjects. TV features represent a “low-dimensional” (5×1) control regime for
constriction actions in speech production and are considered more invariant in a linguistic sense.
96
balanced database of English and our experiments are limited to English talkers, we
assume the TIMIT training corpus adequately represents all variabilities in acoustic
space required for subject-independent inversion.
9.2 Subject-independent inversion
In subject-independent acoustic-to-articulatory inversion [25], the articulatory-to-
acoustic map of an ‘exemplar’ is used to estimate the articulatory trajectory corre-
sponding to any arbitrary talker’s speech. Since the acoustics of the ‘exemplar’ and
the talker can be, in general, different, the basic idea of enabling subject-independent
inversion [25] is to normalize this inter-subject acoustic variability by computing the
likelihood of the acoustic features for both the ‘exemplar’ and the target talker using a
general acoustic model and predict the articulatory position values based on the close-
ness between likelihood scores. Since the articulatory configuration of the ‘exemplar’
is in general different from that of the talker, it was shown in [25] that the range and
values of the estimated articulatory trajectories correspond to those of the exemplar’s
articulatory trajectories as if the exemplar spoke the target utterance spoken by the
talker. Although the talker and exemplar articulatory trajectories are not identical, a
significant correlation was observed between the measured articulatory trajectories of
the talker and the estimated ones.
Toexaminethecorrelationvaluesbetweentheoriginal(x)andestimatedTVfeatures
(ˆ x), we report the average correlation coefficients (ρ) in Table 9.1 computed over all
utterances calculated by considering in turn each of the three subjects in our data
set as an exemplar and the others two as the talker. For each exemplar and talker
combination, we also performedalinear regression analysis ˆ x=ax+b+ǫ, ǫ∼N(0,σ
2
)
and a hypothesis test on the slope a (H
0
: a=0, H
a
: a6=0) for each TV feature. We
found that the estimated feature values have a significant correlation (p-value= 10
−5
)
97
with the original ones (i.e., there is sufficient evidence to reject the null hypothesisH
0
).
To investigate thepotential benefitof these estimated articulatory feature to automatic
speech recognition, we performed experiments using the estimated TV features and
compare the results with those obtained by using the measured (original) TV features.
Test ρ for different TV features
Subject Exemplar LA PRO Jaw Open TTCD TBCD
EN RD MALE EN RD FEMALE 0.45 0.20 0.58 0.60 0.63
EN SP MALE 0.50 0.29 0.59 0.59 0.62
EN RD FEMALE EN RD MALE 0.45 0.16 0.63 0.69 0.65
EN SP MALE 0.60 0.15 0.66 0.73 0.59
EN SP MALE EN RD FEMALE 0.55 0.28 0.52 0.70 0.62
EN RD MALE 0.36 0.29 0.44 0.65 0.64
Table9.1: Averagecorrelation coefficient (ρ)between original and estimated TV features
for different talker and exemplar combinations.
Acoustic Acoustic-articulatory recognition (p-value)
Talker only using using Exemplar
recognition Original EN RD EN RD EN SP
Accuracy TV features MALE FEMALE MALE
EN RD MALE 76.79% 79.05% 77.37% 77.99% 77.23%
(0.002) (0.002) (0.002) (0.020)
EN RD FEMALE 79.10% 81.28% 80.25% 80.16% 79.75%
(0.002) (0.002) (0.002) (0.010)
EN SP MALE 74.84% 76.29% 74.87% 74.97% 75.17%
(0.002) (0.084) (0.131) (0.002)
Table 9.2: Average phonetic recognition accuracy using acoustic and acoustic-
articulatory (both measured and estimated) features separately for each English subject.
p-values indicate the significance in the change of recognition accuracy from the acoustic
to acoustic-articulatory feature based recognition.
98
9.3 Automatic Speech Recognition experiments
Our experiment focuses on a frame-based broad-class phonetic recognition using GMM
classifiers using data from each of the three subjects. The phone boundaries were
obtained using the forced-alignment module in SONIC recognizer [51] using acoustic
model set of 51 phones. Since, the total number of frames corresponding to some
phones was too few in our data to build individual GMMs for them, we grouped the
data into five broad phone classes, namely, vowel, fricative, stops, nasal, and silence.
90% of each subject’s data (acoustic as well as articulatory) was used for training and
theremaining10% was usedastestset inaten-fold cross validation setup. Note that, in
addition to original articulatory features, we also have three different estimates of the
articulatory features for each subject from subject-independentacoustic-to-articulatory
inversionwhereremainingtwosubjectsaswell asthetalker itselfareusedasexemplars.
The feature space corresponding to each broad phone class is modeled using GMMs
with 4 mixtures and full covariance matrices. The GMM parameters were estimated
using the expectation maximization (EM) algorithm [12]. Each test frame is classi-
fied as one of the five broad phonetic categories using a max-a-posteriori (MAP) rule.
Table 9.2 shows the acoustic and acoustic-articulatory feature (using both original and
estimatedarticulatoryfeatures)basedrecognitionaccuraciesforeachEnglishtalkersep-
arately. Averagerecognitionaccuracyusingjustacousticfeatures(MFCC)overten-fold
cross validation is reported in Table 9.2
2
. In addition, average recognition accuracies
for each talker are reported when estimated articulatory features are used to augment
acoustic feature vectors. This allows us to compare the recognition performance for
the different exemplar choices. Table 9.2 also shows the average recognition accuracy
when the acoustic feature vector is augmented with the directly measured (original)
2
We did not explore delta and delta-delta MFCC as acoustic features due to the increase in feature
dimension, which in turn requires more data for reliable estimates of GMM parameters; this is not
afforded by the corpus limitations of the present study.
99
articulatory features. We perform the Wilcoxon test [28] between acoustic-only and
acoustic-articulatory feature based recognition to investigate whether there is any sig-
nificantimprovement for includingarticulatory features. Thep-value resultingfrom the
Wilcoxon test is reported next to the average recognition accuracy. A lower p-value
indicates more significant improvement in the recognition accuracy. In Table 9.2, we
mark the recognition accuracies by ‘bold’ when the average recognition accuracy using
acoustic-articulatory featureissignificantlybetterat95%significancelevelthanjustthe
acoustic feature. We also mark the highest acoustic-articulatory recognition accuracy
for individual subject by ’underline’.
9.4 Discussion of the recognition results
The results of Table 9.2 show that the recognition accuracy for all the English talkers
significantly improves when the measured (original) TV features are used to augment
the acoustic features. When the original TV features are replaced with the estimated
TV features, the nature of improvement obtained depends on the characteristics of the
‘exemplar’ used to estimate the TV features. We also observe that the recognition
benefit due to original articulatory features is more compared to that due to estimated
articulatory features for each talker considered in this experiment. This observation
suggests that the better estimates of the articulatory features could lead to better
recognition accuracy. The selection of ‘exemplar’ plays a crucial role in determining
the quality of the articulatory estimates and, hence, the recognition benefit. For exam-
ple, when we consider EN RD MALE or EN RD FEMALE as talker, the recognition
benefits due to EN RD FEMALE and EN RD MALE as exemplars are more than that
due to EN SP MALE as exemplar. This could be due to the fact that EN RD MALE
andEN RD FEMALE areBritish Englishsubjects andEN SP MALE is American En-
glish subject. Furthermore the MOCHA TIMIT represents read speech while the USC
100
MURI database, spontaneous speech, and this could contribute to poor estimates due
to increased talker-exemplar speaking style mismatch. It is also interesting to observe
that EN SP MALE exemplar and EN RD MALE talker are of same gender, yet an ex-
emplar of a different gender EN RD FEMALE provided a better recognition benefitfor
EN RD MALEtalker. WhenweconsidertheAmericanEnglishtalker(EN SP MALE),
we find the recognition benefits obtained using British English subjects as exemplars
are not statistically significant. This means that the general acoustic space in subject-
independentinversion couldaccount forgenderdifferences moreeffectively comparedto
speaking dialectal and style differences between the talker and the exemplar.
Finally, recognition experiments with articulatory features estimated usingidentical
exemplar-talker combination were performed to examine the extent of recognition ben-
efit whenthe talker’s articulatory-to-acoustic map itself is used for subject-independent
inversion. For every training-test set combination of individual talkers, the parallel
articulatory and acoustic data of the training set is used to estimate the articulatory
features for the test sentences. It appears that the use of identical talker and exemplar
doesnotalways guaranteethemaximumrecognition benefitamongdifferentexemplars.
This may reflect the data limitations in deriving articulatory-acoustic maps that can
cover the range of expected test feature variability. For example, when EN RD MALE
isconsideredastalker, theexemplarEN RD FEMALEofsimilarspeakingstyleresulted
in better recognition benefit compared to that using identical talker-exemplar scenario.
Thisalso holds for EN RD FEMALE talker andEN RD MALE exemplar. However for
EN SP MALE talker, the identical talker-exemplar combination provides the highest
recognition benefit among other exemplars.
Thus, the choice of ‘exemplar’ plays a critical role to improve the recognition for
a given talker. Our results suggest that when the ‘exemplar’ is chosen to have same
101
speaking style as that of the talker, there is a significant benefit in using estimated
articulatory features in addition to the acoustic features to improve speech recognition.
9.5 Conclusions
We investigated the potential of usingarticulatory features estimated through acoustic-
articulatory inversion in automatic speech recognition. We conducted subject-specific
broad-class phonetic recognition experiments using data from three different native
English speaking subjects. We find that the selection of ‘exemplar’ for the subject-
independent acoustic-to-articulatory inversion has a critical impact on the quality of
the articulatory feature estimates and, hence, the final speech recognition accuracy. In
particular, recognition results suggest that when the talker and the ‘exemplar’ charac-
teristics are matched in their speaking styles, the recognition benefit due to estimated
articulatory features is significant.
102
Chapter 10:
Discussions - link to Motor
Theory of speech perception
The proposed exemplar-specific approach to speech processing demonstrates a princi-
pled way of utilizing exemplary subject’s articulatory-to-acoustic map to derive artic-
ulatory features from any arbitrary talker’s speech acoustics. The acoustic mismatch
between the exemplar and talker is normalized through a general acoustic space during
acoustic-to-articulatoryinversionusingGSC.Duetosuchanormalization,theestimated
articulatory trajectories obtained through inversion can be interpreted as the articula-
tory movement of the exemplar when he/she imitates what the talker says. Under such
interpretation, when the exemplar imitates a sentence spoken by different talkers, dif-
ferent versions of the estimated articulatory movement (for different talkers) turn out
to be similar. The exemplar-specific approach to speech processing requires two main
components: 1) the exemplar’s articulatory-to-acoustic map, 2) the variability in the
acousticspaceduetotalkervariability. Whentheexemplarisinterpretedasalistenerin
real-world human speech communication scenario, this exemplar-specific framework to
103
speechprocessingclosely resemblesthephilosophyofMotor theoryofspeechperception
[38].
Motor theory of speech perception [38] claims that human listeners donot recognize
speechsoundsonlyfromthespeechacoustics producedby thetalker, rather thelistener
derives articulatory gestures required to produced the speech sound from the acoustic
speech signal and use the derived articulatory information to recognize speech sound.
This is because there is more variability in speech acoustics compared to that in speech
articulation. This has been shown using the articulation of stops (for example) under
different vowel contexts, where the formant transition changes depending on the vowel
context but articulation for the stops as well as the perception of the stops remain
invariant. Although Motor theory suggests theconcept of derivingarticulatory features
on the listener’s end, how the articulatory features can be estimated for any arbitrary
talker has not been shown. That posed a serious limitation to computationally extend
the Motor theory of speech perception towards machine recognition of speech.
The proposed exemplar-specific approach to deriving articulatory features from the
speech acoustics of any arbitrary talker provides an opportunity to extend a philoso-
phy similar to the Motor Theory to real-world machine recognition of speech. When
the exemplar in the proposed framework is interpreted as a listener, the articulatory
features are estimated in listener’s end using listener-specific and talker-independent
acoustic-to-articulatory inversion. Thus, the proposed approach offers a computational
methodology to estimate articulatory features on the listener’s end, which is the main
component in the Motor theory of speech perception.
Duetotheproposedcomputational framework, wecannowaddressfurtherquestion
relatedtohowtheMotortheorymightbeextendedwhenthereismismatchbetweenthe
talker and the listener. Given a talker and listener, can we conclude about the extent
to which the listener can reliably derive articulatory features from talker’s speech? For
104
a given listener, what characteristics should the talker have so that the listener can
recognize the talker to a desired accuracy? How does the language differences between
the talker and the listener plays a role in recognition from the viewpoint of production
of sounds in two different languages? Answering questions of this kind could provide a
theoretical as well as computational basis for understandingthe complex human speech
communication system.
105
Chapter 11:
Future works
Based upon the preliminary analysis and results as described in the previous chapters,
I provide a list of possible future works, which are expected to provide further insights
to the bigger question on the role of speech production in speech processing:
11.1 Thedirection of causality in the infomation theoretic
production-perception link
Our experiment, described in Chapter 4, indicates that the auditory filterbank is a
near-optimal filterbank in the sense that the output of this filterbank provides maxi-
mal information (or least uncertainty) about the articulatory gestures underlying the
acoustic speech signal. The results of our information-theoretic analysis indicate that
the characteristics of the human speech production and perception systems are linked.
Inspired by such an production-perception link, we would like to investigate the role of
the speech production system on the development of the auditory system, if any, and
vice-versa. The goal of such an investigation is to get an insight into the fundamental
106
principle behind the development of speech production and auditory system and the
role of one system on the development of the other system.
11.2 Complementary characteristics of acoustic and artic-
ulatory features
Theimprovementinrecognition usingarticulatory features inaddition toacoustics sug-
gests that articulatory features provideinformation for phoneticdiscrimination comple-
mentary to that by acoustic features. We also get recognition benefit when we use
the articulatory features estimated from the speech acoustics using GSC based subject-
independent inversion. Thus, the nature of the map between acoustic and articulatory
features determines the extent to which they provide complementary information. Un-
derstanding the map between acoustic and articulatory spaces will also lead to design
of better acoustic as well as articulatory features for inversion andrecognition. Further,
this understanding will help improve the acoustic-to-articulatory inversion strategy.
11.3 The role of speech production in speech recognition
by a non-native listener
It is well-known that speech recognition accuracy, in practice, by non-native or second-
languagelistenersisworsecomparedtothatbynativeorfirst-languagelisteners[19,76].
Thus, the recognition accuracy varies dependingon the listener’s and talker’s language.
This variation could be, in part, due to the lack of listener’s production knowledge
regardingsomeofthesoundsintalker’slanguage. Usingourexemplar-basedrecognition
framework we can conduct cross-language recognition experiment where the exemplar
can be interpreted as the listener. We would like to examine whether the proposed
107
exemplar-specificspeechrecognitionframeworkcanexplainsuchvariationinrecognition
across listeners in practice. This can be done by developing a recognizer in the context
of a listener whose articulatory database is collected in a language different from the
language of the speech recognition corpus. In fact, we have done some preliminary
experiments using Korean and English talker-exemplar combination; we have found
that korean exemplar performs worse compared to an English exemplar on an English
talker.
11.4 Comparison to visual articulation
Ithasbeenshownseveraltimesthattheautomaticspeechrecognitionaccuracyimproves
if features related to lip or jaw movement, derived from talker facial expression, is
used in addition to the speech acoustics. In this thesis we propose a computational
mechanism where the articulatory features are estimated without any need of talker’s
facial expression. It will interesting to investigate if the derived articulatory features
provide equal or more discriminatory power compared to visible articulatory features
for recognition.
108
Bibliography
[1] B.S.Atal, J.J.Chang,M.V.Mathews,andJ.W.Tukey. Inversionofarticulatory-
to-acoustic transformation in the vocal tract by a computer-sorting technique. J.
Acoust. Soc. Am, 63:1535–1555, May 1978.
[2] C. P. Browman and L. Goldstein. Towards an articulatory phonology. Phonology
Yearbook, 3:219–252, 1986.
[3] C. P. Browman and L. Goldstein. Articulatory gestures as phonological units.
Phonology, 6:201–251, 1989.
[4] C. P. Browman and L. Goldstein. Articulatory gestures as phonological units.
Phonology, 6(2):201–251, 1989.
[5] C. P. Browman and L. Goldstein. Gestural specification usingdynamically-defined
articulatory structures. Journal of Phonetics, 18:299–320, 1990.
[6] M. Chatterjee andJ. J. Zwislocki. Cochlear mechanisms of frequencyand intensity
coding. II. Dynamic range and the code for loudness. Hearing research, 124(1-
2):170–181, 1998.
[7] S. Chennoukh, D. Sinder, G. Richard, and J. Flanagan. Voice mimic system using
an articulatory codebook for estimation of vocal tract shape. Proc. Eurospeech,
Rhodes, Greece, pages 429–432, September 1997.
[8] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley Inter-
science, New York, 1991.
[9] D. R. Cox and D. V. Hinkley. Theoretical Statistics. Chapman & Hall, 1974.
[10] G. A. Darbellay and I. Vajda. Estimation of the information by an adaptive parti-
tion oftheobservation space. IEEE Transactions on Information Theory, 45:1315–
1321, 1999.
[11] DARPA-TIMIT. Acoustic-Phonetic Continuous Speech Corpus, NIST Speech Disc
1-1.1. 1990.
109
[12] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incom-
plete data via the EM algorithm. Journal of the Royal Statistical Society, Series
B (Methodological), 39(1):1–38, 1977.
[13] L. Deng. Switching dynamic system models for speech articulation and acoustics.
Chapter in M. Johnson, M. Ostendorf, S. Khudanpur, and R. Rosenfeld (eds.),
Springer Verlag, 3:115–134, 2003.
[14] L. Deng and K. Erler. Structural design of hiddenmarkov model speech recognizer
using multivalued phonetic features: Comparison with segmental speech units. J.
Acoust. Soc. Am., 92(6):3058–3069, 1992.
[15] L. Deng and D. Sun. Phonetic classification and recognition using hmm represen-
tation of overlapping articulatory features for all classes of english sounds. Proc.
ICASSP, 1:45–48, 1994.
[16] L. Deng and D. X. Sun. A statistical approach to automatic speech recognition
using the atomic speech units constructed from overlapping articulatory features.
J. Acoust. Soc. Am., 95(4):2702–2719, 1994.
[17] R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. New
York:Wiley-Interscience, 2000.
[18] S.DusanandL.Deng. Acoustic-to-articulatoryinversionusingdynamicandphono-
logical constraint. the 5th Speech Production Seminar, Munich, Germany, pages
237–240, 2000.
[19] J. E. Flege, O. S. Bohn, and S. Jang. Effects of experience on non-native speakers’
production and perception of english vowels. Journal of Phonetics, 25:437–470,
1997.
[20] C. A. Fowler. An event approach to the study of speech perception from a direct-
realist approach. J. Phonetics, 14:3–28, 1986.
[21] J. Frankel and S. King. ASR - articulatory speech recognition. Proc. Eurospeech,
Scandinavia, pages 599–602, 2001.
[22] J. Frankel, K. Richmond, S. King, and P. Taylor. An automatic speech recognition
system using neural networks and linear dynamic models to recover and model
articulatory traces. Proc. ICSLP, Beijing, China, 4:254–257, October 2000.
[23] P. K. Ghosh, L. M. Goldstein, and S. S. Narayanan. Processing speech signal
usingauditory-likefilterbankprovidesleastuncertaintyaboutarticulatorygestures.
accepted in Journal of the Acoustical Society of America, 2011.
110
[24] P. K. Ghosh and S. Narayanan. A generalized smoothness criterion for acoustic-
to-articulatory inversion. J. Acoust. Soc. Am., 128(4):2162–2172, 2010.
[25] P. K. Ghosh and S. S. Narayanan. A subject-independent acoustic-to-articulatory
inversion. accepted in ICASSP, Prague, Czech Republic, 2011.
[26] L. Goldstein, I. Chitoran, and E. Selkirk. Syllable structure as coupled oscillator
modes: Evidence from Georgian vs. Tashlhiyt Berber. Proceedings of the 16th
International Congress of Phonetic Sciences, Saarbrucken, Germany, pages 241–
244, 2007.
[27] J. Hogden, A. Lofqvist, V. Gracco, I. Zlokarnik, P. Rubin, and E. Saltzman. Ac-
curate recovery of articulator positions from acoustics: New conclusions based on
human data. J. Acoust. Soc. Am., 100(3):1819–1834, 1996.
[28] M.Hollander andD.A. Wolfe. Nonparametric Statistical Methods. NJ:JohnWiley
& Sons, Inc., 1999.
[29] K. Johnson. Acoustic and auditory phonetics. Wiley-Blackwell, 2 edition, January
17 2003.
[30] J. L. Kelly Jr. and C. C. Lochbaum. Speech synthesis. Proc. Fourth Int. Congr.
Acoust., Copenhagen, pages 1–4, 1962.
[31] S. King, T. Stephenson, S. Isard, P. Taylor, and A. Strachan. Speech recognition
via phonetically featured syllables. Proc. ICSLP, 3:1031–1034, 1998.
[32] S. King and P. Taylor. Detection of phonological features in continuous speech
using neural networks. Computer. Speech Lang., 14:333–345, 2000.
[33] K. Kirchhoff. Robust Speech Recognition using Articulatory Information. PhD
Thesis, University of Bielefeld, 1999.
[34] K. Kirchhoff, G. A. Fink, and G. Sagerer. Combining acoustic and articulatory
feature information for robust speech recognition. Speech Communication, 37:303–
319, 2002.
[35] R. Kuc, F. Tutuer, and J. R. Vaisnys. Determining vocal tract shape by applying
dynamic constraints. Proc. ICASSP, Tampa, Florida, pages 1101–1104, 1985.
[36] A. Lammert, D. P. W. Ellis, and P. Divenyi. Data-driven articulatory inversion in-
corporatingarticulatorpriors. ISCA Tutorial and Research Workshop on Statistical
And Perceptual Audition, SAPA, Brisbane, Australia, 21 September 2008.
[37] K. F. Lee and H. W. Won. Speaker-independent phone recognition using hidden
markov models. IEEE Transactions on Acoustic, Speech, and Signal Processing,
31(11):1641–1648, 1989.
111
[38] A. M. Liberman, F. S. Cooper, D. P. Shankweiler, and M. Studdert-Kennedy.
Perception of the speech code. Psychol. Rev., 74:431–461, 1967.
[39] A.M.LibermanandI.G.Mattingly. Themotortheoryofspeechrevised.Cognition,
21:1–36, 1985.
[40] K. Livescu, J. Glass, and J. Bilmes. Hidden feature models for speech recognition
using dynamic bayesian networks. Proc. Eurospeech, (3):2529–2532, Sep 2003.
[41] S. Maeda. Un modele articulatoire de la langue avec des composantes lineaires (an
articulatory model of the tongue with linear components). Actes 10emes Journees
d’Etude sur la Parole (Grenoble, France), pages 152–162, 1979.
[42] S. Maeda. Compensatory articulation during speech: Evidence from the analysis
and synthesis of vocal tract shapes using an articulatory model. Speech production
and speech modelling, edited by W. Hardcastle and A. Marchal (Kluwer Academic
Publishers, Dordrecht, The Netherlands), pages 131–149, 1990.
[43] D. G. Manolakis, V. K. Ingle, and S. M. Kogon. Statistical and Adaptive Signal
Processing: Spectral Estimation, Signal Modeling, Adaptive Filtering and Array
Processing. Artech House Publisher, April 30 2005.
[44] E. McDermott and A. Nakamura. Production-oriented models for speech recogni-
tion. IEICE Trans. Inf. & Syst., E89-D(3):1006–1014, 2006.
[45] V. Mitra, H. Nam, C. Espy-Wilson, E. Saltzman, and L. Goldstein. Articulatory
information for noise robust speech recognition. to appear in IEEE Trans. ASLP,
2011.
[46] V.A. Morozov. Regularization Methods for Ill-Posed Problem. Florida: CRCPress,
1993.
[47] E. Muller and G. MacLeod. Perioral biomechanics and its relation to labial motor
control. J. Acoust. Soc. Am., 71:S33–S33, April 1982.
[48] S. S. Narayanan, K. Nayak, S. Lee, A. Sethy, and D. Byrd. An approach to real-
time magnetic resonance imaging for speech production. J. Acoust. Soc. Am.,
115(4):1771–1776, 2004.
[49] C. R. Nave. Place Theory. Accessed 13/03/2011.
[50] J.S.PathmanathanandD.O.Kim.AcomputationalmodelfortheAVCNmarginal
shell with medial olivocochlear feedback: Generation of a wide dynamic range.
Neurocomputing, 38:807–815, 2001.
[51] B. Pellom and K. Hacioglu. Sonic: The university of colorado continuous speech
recognizer. Technical Report TR-CSLR-2001-01, May 31 2005.
112
[52] S. J. Perkell, M. Cohen, M. Svirsky, M. Matthies, I. Garabieta, and M. Jackson.
Electro-magnetic midsagittal articulometer systems for transducing speech artic-
ulatory movements. Journal of the Acoustical Society of America, 92:3078–3096,
1992.
[53] C. Qin and M. A. Carreira-Perpinan. An empirical investigation of the nonunique-
ness in the acoustic-to-articulatory mapping. Proc. Interspeech, 2007.
[54] C. Qin and M. A. Carreira-Perpinan. An empirical investigation of the nonunique-
ness in the acoustic-to-articulatory mapping. Proc. Interspeech, pages 74–77, 2007.
[55] M. G. Rahim, W. B. Kleijn, J. Schroeter, and C. C. Goodyear. Acoustic-to-
articulatory parameter mapping using an assembly of neural networks. Proc.
ICASSP, pages 485–488, 1991.
[56] G. Ramsay and L. Deng. Maximum-likelihood estimation for articulatory speech
recognition using a stochastic target mode. Proc. EUROSPEECH, pages 1401–
1404, 1995.
[57] H. B. Richards, J. S. Mason, M. J. Hunt, and J. S. Bridle. Deriving articulatory
representations fromspeech withvarious excitation modes. Proc. ICSLP, Philadel-
phia, PA, USA, pages 1233–1236, October 3-6 1996.
[58] K. Richmond. Mixturedensity networks, humanarticulatory data andacoustic-to-
articulatory inversion of continuous speech. Proceedings Workshop on Innovation
in Speech Processing WISP, 2001.
[59] K. Richmond. Estimating articulatory parameters from the acoustic speech signal.
Ph.D. Thesis, The Centre for Speech Technology Research, Edinburgh University,
2002.
[60] K. Richmond. A trajectory mixture density network for the acoustic-articulatory
inversionmapping. Proc. ICSLP, Pittsburgh,USA,pages577–580, September2006.
[61] E. L. Saltzman and K. G. Munhall. A dynamical approach to gestural patterning
in speech production. Ecological Psychology, 1:333–382, 1989.
[62] J. Schroeter and M. M. Sondhi. Dynamic programming search of articulatory
codebooks. Proceedings ICASSP, Glasgow, UK, pages 588–591, 1989.
[63] F. Sha and L. K. Saul. Large margin gaussian mixture modeling for phonetic
classification and recognition. Proceedings of ICASSP, Toulouse, France, pages
265–268, 2006,.
[64] C. E. Shannon. A Mathematical Theory of Communication. Bell System Technical
Journal, 27:379–423, 1948.
113
[65] K. Shirai and M. Honda. Estimation of articulatory motion. Dynamic Aspects of
Speech Production, Tokyo Univ Press, pages 279–302, 1976.
[66] J. Silva, V. Rangarajan, V. Rozgic, and S. S. Narayanan. Information theoretic
analysis of direct articulatory measurements for phonetic discrimination. Proc.
ICASSP, pages 457–460, 2007.
[67] V. D. Singampalli and P. J. B. Jackson. Statistical identification of articulation
constraints in the production of speech. Speech Communication, 51(8):695–710,
2009.
[68] E. C. Smith and M. S. Lewicki. Efficient auditory coding. Nature, 439:978–982,
February 2006.
[69] V.N.Sorokin,A.Leonov, andA.V.Trushkin. Estimationofstabilityandaccuracy
of inverse problem solution for the vocal tract. Speech Communication, 30:55–74,
2000.
[70] H. Stark and J. W. Woods. Probability and Random Processes with Applications
to Signal Processing. Prentice Hall, 3rd edition, August 3, 2001.
[71] G. Strang and T. Nguyen. Wavelets and Filter Banks. Wellesley, MA: Wellesley-
Cambridge Press, 1996.
[72] T. Toda, A. Black, and K. Tokuda. Acoustic-to-articulatory inversion mapping
with gaussian mixture model. Proc. ICSLP, Jeju Island, Korea, pages 1129–1132,
2004.
[73] T.Toda,A.Black, andK.Tokuda. Statistical mappingbetweenarticulatory move-
ments and acoustic spectrum using a gaussian mixture model. Speech Communi-
cation, 50:215–217, 2008.
[74] A. Toutios andK. Margaritis. Acoustic-to-articulatory inversion of speech: arevie.
Proceedings of the International 12th TAINN, 2003.
[75] J. R. Westbury. X-ray microbeam speech production database user’s handbook
version 1.0. http://www2.uni-jena.de/∼x1siad/uwxrmbdb.html(date last viewed
6/15/2010), June 1994.
[76] S. J. V. Wijngaarden. Intelligibility of native and non-native dutch speech. Speech
Communication, 35:103–113, 2001.
[77] Wikibooks. Anatomy and Physiology of Animals/The Senses. Accessed
13/03/2011.
114
[78] R. Wilhelms, P. Meyer, and H. W. Strube. Estimation of articulatory trajectory
by kalman filter. I.T. Young et al., editor, Signal Processing III: Theories and
Application, pages 477–480, 1986.
[79] S. M. Wilson, A. P. Saygin, M. I. Sereno, and M. Iacoboni. Listening to speech
activates motor areas involved in speech production. Nature Neuroscience, 7:701–
702, 2004.
[80] A. A. Wrench and K. Richmond. Continuous speech recognition using articulatory
data. Proc. ICSLP, Beijing, China, pages 145–148, 2000.
[81] A. A. Wrench and K. Richmond. Continuous speech recognition using articulatory
data. Proc. ICSLP, Beijing, China, pages 145–148, 2000.
[82] M. Yanagawa. Articulatory timing in first and second language: a cross-linguistic
study. Doctoral dissertation, Yale University, 2006.
[83] X. Zhuang, H. Nam, M. Hasegawa-Johnson, L. Goldstein, and E. Saltzman. The
entropy of the articulatory phonological code: Recognizing gestures from tract
variables. Proc. Interspeech, Brisbane, Australia, pages 1489–1492, 2008.
[84] I.Zlokarnik. Addingarticulatory features toacoustic features forautomatic speech
recognitio. J.Acoust. Soc. Am., 97:3246(A), 1995.
[85] G. Zweig. Speech recognition with Dynamic Bayesian Networks. PhD Thesis, Uni-
versity of California, Berkeley, Computer Science, 2002.
[86] E. Zwicker and E. Terhardt. Analytical expressions for critical-band rate and
critical bandwidth as a function of frequency. Journal of the Acoustical Society
of America, 68:1523–1525, 1980.
115
Appendix
Mutual Information
Mutual information (MI) between two random variables measures the amount of in-
formation a random variable can provide about another variable. If the two random
variables aremathematically identical thenknowing oneof them isequivalent tohaving
the full information about the other one; thus, in such a case, the MI between two
random variables attains maximum value. On the other hand, if two random variables
are independent, then knowing one does not provide any information about the other
and hence the MI is zero.
Let two random variables U and V have a joint probability mass function p(u,v)
and marginal probability mass functions p(u) and p(v). The MI between U and V is
defined as [8, pp. 12–49]:
I(U;V)=
X
u
X
v
p(u,v)log
p(u,v)
p(u)p(v)
(A-1)
116
It is easy to show that I(U;V) = H(V)−H(V|U), where H(V) is the entropy of
V and H(V|U) is the conditional entropy of V given U. H(V|U) measures the amount
of uncertainty that U provides about V. Note that when H(V) is fixed, maximizing
I(U;V) is equivalent to minimizing H(V|U).
For computing I(X
B
L;Y) we need to know p(X
B
L), p(Y), and p(X
B
L,Y). But,
in our case, we don’t have access to the probability density functions of X
B
L and
Y. Hence we consider MI estimation by quantization of the spaces of X
B
L and Y.
This quantization is performed on the data points in both spaces with a finite number
of quantization bins. We then estimate the joint distribution of X
B
L and Y in the
newlyquantizedfinitealphabetspaceusingstandardmaximumlikelihoodcriterion, i.e.,
frequencycounts[17,pp. 85–92]andfinallyapplythediscreteversionoftheMIgivenby
eqn. (A-1). Moreprecisely, weknowthatX
B
L andY takevaluesinR
L
andR
16
spaces,
respectively. The quantizations of these spaces are denoted by Q(X
B
L) :R
L
−→A
x
and Q(Y) :R
16
−→A
y
, where|A
x
| <∞ and|A
y
| <∞. Then the estimate of MI is
given by:
I(Q(X
B
L);Q(Y))=
X
qx∈Ax,qy∈Ay
p(Q(X
B
L)=q
x
,Q(Y)=q
y
)×
log
p(Q(X
B
L)=q
x
,Q(Y)=q
y
)
p(Q(X
B
L)=q
x
)p(Q(Y)=q
y
)
(A-2)
It is well known that I(Q(X
B
L);Q(Y))≤I(X
B
L;Y), because quantization reduces
the level of dependency between the random variables. On the other hand, increasing
117
the resolution of Q(·), implies that I(Q(X
B
L);Q(Y)) converges to I(X
B
L;Y) as the
number of bins tends to infinity [10]. For both spaces, we perform K-means vector
quantization with 128 prototypes, i.e.,|A
x
| =|A
y
| =128. Increasing the number of
prototypes yields similar result.
The subjects in the speech production databases in three languages [26, 75, 82]
have different numbers of parallel Y and X
B
L vectors depending on the duration of
theirrecordings. HowevertoestimateI(Q(X
B
L);Q(Y)),wepickapproximately100,000
parallel vectors of X
B
L and Y for each subject so that the amount of data used in our
analysis is balanced across subjects.
Tocalculate arealization ofI(Q(X
B
L);Q(Y))forasubject, weselectparallel Y and
X
B
L vectors of the target subject and quantize (with random initialization) them to
Q(X
B
L) and Q(Y), which are finally used in eqn. (A-2). We repeat this process several
times for a chosen filterbank B
L
to capture the inherent variability in the process of
quantizing thearticulatory andacoustic space. Itturnsoutthat thestandarddeviation
(SD) of the MI estimates is of the order∼0.1% of the actual MI values. Hence, we use
the average estimated MI for our experiments.
118
Abstract (if available)
Abstract
This thesis focuses on exploring the role of speech production in automatic speech recognition from a communication system perspective. Specifically, I have developed a generalized smoothness criterion (GSC) for a talker-independent acoustic-to-articulatory inversion, which estimates speech production/articulation features from the speech signal of any arbitrary talker. GSC requires parallel articulatory and acoustic data from a single subject only (exemplar) and this exemplar need not be any of the talkers. Using both theoretical analysis and experimental evaluation, it is shown that the estimated articulatory features provide recognition benefit when used as additional features in an automatic speech recognizer. As we require a single exemplar for the acoustic-to-articulatory inversion, we overcome the need for the articulatory data from multiple subjects during inversion. Thus, we demonstrate a feasible way to utilize production-oriented features for speech recognition in a data-driven manner. Due to the concept of exemplar, the production-oriented features and, hence, the speech recognition become exemplar-dependent. Preliminary recognition results with different talker-exemplar combinations show that the recognition benefit due to the estimated articulatory feature is greater when the talker’s and exemplar’s speaking styles are matched, indicating that the proposed exemplar-dependent recognition approach has potential to explain the variability in recognition across human listeners.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Emotional speech production: from data to computational models and applications
PDF
Toward understanding speech planning by observing its execution—representations, modeling and analysis
PDF
Matrix factorization for noise-robust representation of speech data
PDF
Representation, classification and information fusion for robust and efficient multimodal human states recognition
PDF
Automatic quantification and prediction of human subjective judgments in behavioral signal processing
PDF
A computational framework for diversity in ensembles of humans and machine systems
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
Context-aware models for understanding and supporting spoken interactions with children
PDF
Hierarchical methods in automatic pronunciation evaluation
PDF
Exploiting latent reliability information for classification tasks
PDF
Speech recognition error modeling for robust speech processing and natural language understanding applications
PDF
Statistical inference for dynamical, interacting multi-object systems with emphasis on human small group interactions
PDF
Weighted factor automata: A finite-state framework for spoken content retrieval
PDF
Coded computing: a transformative framework for resilient, secure, private, and communication efficient large scale distributed computing
PDF
A framework for automated administration of post stroke assessment test
Asset Metadata
Creator
Ghosh, Prasanta Kumar
(author)
Core Title
A computational framework for exploring the role of speech production in speech processing from a communication system perspective
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
07/24/2011
Defense Date
06/17/2011
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
articulation,OAI-PMH Harvest,production-perception link,speech production,speech recognition
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Narayanan, Shrikanth S. (
committee chair
), Goldstein, Louis M. (
committee member
), Ortega, Antonio (
committee member
)
Creator Email
prasantag@gmail.com,prasantg@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c127-642815
Unique identifier
UC1369773
Identifier
usctheses-c127-642815 (legacy record id)
Legacy Identifier
etd-GhoshPrasa-162-0.pdf
Dmrecord
642815
Document Type
Dissertation
Rights
Ghosh, Prasanta Kumar
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
articulation
production-perception link
speech production
speech recognition