Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Toward understanding speech planning by observing its execution—representations, modeling and analysis
(USC Thesis Other)
Toward understanding speech planning by observing its execution—representations, modeling and analysis
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
TOWARDUNDERSTANDINGSPEECHPLANNINGBY
OBSERVINGITSEXECUTION{REPRESENTATIONS,
MODELINGANDANALYSIS
by
Vikram Ramanarayanan
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulllment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
June 2014
Copyright 2014 Vikram Ramanarayanan
Rig Veda 3.62.10 (Gayatri Mantra) { We meditate on the eulgent glory of the divine Light;
please inspire our understanding and enrich our intellect.
Quran 26:83 { O Lord! Bestow wisdom on me, and join me with the righteous.
Proverbs 18:15 { The heart of the prudent getteth knowledge; and the ear of the wise seeketh knowledge.
Taittriya Upanishad, Second Anuvaka
ii
Dedication
To my family, who have supported me throughout this wonderful odyssey of knowledge,
wisdom and discovery; especially to my father, who gave up his dreams of higher educa-
tion to support his family.
To the memory of Mary Francis, aunt-like mentor and friend. You will be missed.
iii
Acknowledgements
Mymotherhasbeenmyrockofsolaceduringtheupsanddownsofmydoctoralstudies.
This thesis is a testament to her love and support. I would like to dedicate this thesis
to my father, who gave up his dreams of a higher education in order to support his
family. I am fortunate to enjoy the blessings of grandparents and elders, as well as the
support of my siblings and family both in India and the United States, without which
this journey would not be as rewarding. Thank you all.
This thesis would not have been possible without the excellent mentorship of my
primary advisor, Dr. Shri Narayanan. Apart from his able guidance on a range of
research topics, his advice on time and project management, interaction with people,
and leadership will stay with me for the rest of my life. Thank you Shri, for, as you
put it, acting as my agent and looking out for me. I would be remiss if I did not
acknowledge the invaluable advice and support of my other PhD advisors, Drs. Louis
Goldstein and Dani Byrd. Thank you, Louis and Dani, for teaching me the basics
of Phonetics and Phonology which has a crucial interdisciplinary component of this
thesis, and for making the research process so much more enjoyable. I would also like
to acknowledge Dr. Panos Georgiou for his support and willingness to have an informal
chat at any time about any issue bugging us. Thank you, Pano, for your support and
iv
friendship. The other members of my thesis committee, Drs. Krishna Nayak, Antonio
Ortega and Stefan Schaal have oered crucial and timely insights into multiple aspects
of this thesis. Thank you all.
This thesis has been built on the foundation of excellent teaching of basic concepts
by many brillant teachers, both at USC as well as NIT Trichy. I would like to specially
acknowledge Dr. Bart Kosko, who tutored me on subjects ranging from probability
and statistics to law and intellectual property. More importantly, he taught me how
to reason critically, communicate eectively, and articulate ideas clearly and concisely.
Thank you, Professor. I would also like to thank Dr. T.V. Sreenivas at the Indian
Institute of Science, Bangalore, for sowing the seeds of research in me. I have come a
long way from the basic speaker recognition ideas that started me on my odyssey to
understandmoreaboutspeechcommunicationtowhereIamtoday. Toallmyteachers,
thank you all.
Last but not the least, I would like to thank all my friends and colleagues who
have made my doctoral journey unforgettable. My SAIL and SPAN labmates have all
contributedtoanintellectuallydiverseandvibrantenvironmentforbothworkandplay.
Thank you all. I am and will be a SAILer for life. I would like to give a special shout-
out to Vidushak, the Indian improv group at USC. In more ways than one everyone
in Vidushak has been my extended family, serving as a home away from home. I am
privileged to have been an actor and director of Vidushak, and I am indebted to the
lmmaking, theatrical, leadership and general life lessons I have learnt from everyone in
Vidushakduringthelastsixyears. IamalsothankfultoallthemembersofInQuiZitive,
the Quizzing and Literary Club at USC, for all the conversations and timepass sessions.
v
A big thank you to all my roommates, neighbors and guinea pigs of my cooking, who've
put up with me and my idiosyncrasies and been incredibly loyal friends. Thank you all.
Thank you all.
vi
Table of Contents
Dedication iii
Acknowledgements iv
List of Figures x
List of Tables xviii
Abstract xx
Chapter 1: Introduction 1
Chapter 2: Articulatory data 5
I Knowledge-driven approaches: Leveraging existing models of motor
control to form and test hypotheses that can in turn inform those models 8
Chapter 3: Kinematic planning case study: planning in spontaneous speech 9
3.1 Pauses during spontaneous speech . . . . . . . . . . . . . . . . . . . . . 9
3.2 Data acquisition and preparation . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Chapter 4: Postural motor control case study: articulatory setting 19
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2.2 Vocal tract airway contour extraction . . . . . . . . . . . . . . . 23
4.2.3 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2.4 Phonetic alignment . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2.5 Extracting frames of interest . . . . . . . . . . . . . . . . . . . . 28
vii
4.2.6 Statistical analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 Results and observations . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Chapter 5: Articulatory setting and mechanical advantage 40
5.1 Mechanical Advantage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2.1 Direct & Dierential Kinematics . . . . . . . . . . . . . . . . . . 42
5.2.2 Calculating Mechanical Advantage . . . . . . . . . . . . . . . . . 43
5.2.3 Example: Simulations on a Planar Robot Arm . . . . . . . . . . 44
5.2.4 Extracting frames of interest from production data . . . . . . . . 45
5.2.5 Statistical Analyses . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.4 Discussion & Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 49
II Direct data-driven approaches: Toward optimal representations and
models of speech motor control and planning 58
Chapter 6: Data-driven representations of speech articulation: articulatory move-
ment primitives 59
6.1 Movement primitives and motor control . . . . . . . . . . . . . . . . . . 59
6.1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.2 Review of data-driven methods to extract movement primitives . . . . . 65
6.3 Validation strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.4 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.5 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.5.1 Nonnegative Matrix Factorization and its extensions . . . . . . . 72
6.5.2 Extraction of primitive representations from data . . . . . . . . . 74
6.5.3 Selection of optimization free parameters . . . . . . . . . . . . . 76
6.6 Results and validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.6.1 Quantitative performance metrics. . . . . . . . . . . . . . . . . . 78
6.6.2 QualitativecomparisonsofTaDAmodelpredictionswiththepro-
posed algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.6.3 Visualization of extracted basis functions . . . . . . . . . . . . . 83
6.7 Discussion and future work . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Chapter 7: Exploring production{perception links in speech: Data-driven articula-
tory gesture-like representations retain discriminatory information about phone
categories 98
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
viii
7.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.3 Extraction of primitive movements . . . . . . . . . . . . . . . . . . . . . 102
7.4 Algorithm performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.4.1 Quantitative metrics . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.4.2 Visualization of extracted basis functions . . . . . . . . . . . . . 105
7.5 Interval-based broad phone classication setup . . . . . . . . . . . . . . 105
7.5.1 Codebook generation . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.5.2 Computing histograms of co-occurrences . . . . . . . . . . . . . . 107
7.5.3 Classication experiments . . . . . . . . . . . . . . . . . . . . . . 107
7.6 Observations and results . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.7 Discussion and outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Chapter 8: Control Primitives 120
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
8.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
8.3 Dynamical systems model of the vocal tract . . . . . . . . . . . . . . . . 122
8.4 Computing control synergies. . . . . . . . . . . . . . . . . . . . . . . . . 124
8.4.1 Computing optimal control signals . . . . . . . . . . . . . . . . . 125
8.4.2 Extraction of control primitives . . . . . . . . . . . . . . . . . . . 127
8.5 Experiments and Results. . . . . . . . . . . . . . . . . . . . . . . . . . . 128
8.6 Conclusions and Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . 130
Chapter 9: Conclusions and outlook 136
Bibliography 138
ix
List of Figures
1.1 The integrated state feedback model of speech production (after Hickok
et al. [38]). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Conceptual outline of this thesis' contributions. . . . . . . . . . . . . . . 4
3.1 An illustration of the gradient energy calculation process: rst, contour
outlines are obtained from the MRI images in panel A, and are then
converted to binary masks (panel B); these are then used to compute the
gradient images (panel C), the energy of which is then calculated by a
simple addition operation (of all white pixels). This serves as a simple
measure of the speed of articulators. . . . . . . . . . . . . . . . . . . . . 13
3.2 Pause length distributions for grammatical and ungrammatical pauses. . 15
3.3 Time-normalized average gradient frame energies (in squared pixels) of
grammatical and ungrammatical pauses and their neighborhoods (blue
bars) pooled across all 7 speakers (Red bars on top of each blue bar rep-
resent standard deviation measurements). Corresponding average local
phone rates (phones/sec) are also shown to the right of each gradient
frame energy panel. Each panel consists of 2 pause groups on the x-axis
1:Grammatical and 2:Ungrammatical. Group 1 consists of, in order, bars
for two neighborhoods immediately before the grammatical pause ( 250
ms), followed by one bar for the pause itself (not shown for phone rate
graph), followed by two bars for the neighborhoods following the pause
( 250 ms); this set of ve bars is followed by a parallel sequence of ve
bars for the ungrammatical pauses (Group 2). . . . . . . . . . . . . . . . 16
3.4 A schematic depicting the levels (grammatical, ungrammatical) and sites
(pre-pause, pause, post-pause) at which the ANOVA statistical analyses
were performed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
x
4.1 (a) Contour outlines extracted for each image of the vocal tract. Note
the template denition such that each articulator is described by a sep-
arate contour. (b) A schematic depicting the concept of vocal tract area
descriptors or VTADs (adapted from [10]). These VTADs are bounded
by cross-distances (depicted by white lines), and are, in order, from lips
to glottis: lip aperture, tongue tip constriction degree, tongue dorsum
constriction degree, velic aperture, tongue root constriction degree and
the epiglottal-pharyngeal wall cross-distance respectively. . . . . . . . . 37
4.2 Schematic showing how the jaw angle (denoted by ) is computed. . . . 38
4.3 Mean vocal tract images for all speakers calculated on all frames corre-
sponding to dierent positions in the utterance and speaking style. . . . 39
5.1 A schematic illustration of the analysis procedure. . . . . . . . . . . . . 52
5.2 An illustration of modeling with Locally-Weighted Regression (LWR).
For a particular point (black cross) a local region is dened in articulator
space by a Gaussian-shaped kernel (gray dashed curve). A line is t in
the local region using a weighted least-squares solution, indicated by the
black dashed line. The global t is generated by repeating this procedure
at a large number of local regions. The resulting t can be quite complex
(gray curve), and depends on the width of the kernel. . . . . . . . . . . 53
5.3 Planar robot arm congurations corresponding to the top eight (a) high-
est and (b) lowest average Jacobian values. . . . . . . . . . . . . . . . . 54
5.4 Congurablearticulatorysynthesizer(CASY)congurationscorrespond-
ing to the top eight (a) highest and (b) lowest average Jacobian values. 54
5.5 (a)Cross-distancesinmoredetail(lipaperture(LA),velicaperture(VEL),
andconstrictionsofthetonguetip(TTCD),tonguedorsum(TDCD)and
tongueroot(TRCD).(b)Articulatoryposturevariables{jawangle(JA),
tongue centroid (TC) and length (TL), and upper and lower lip centroids
(ULC and LLC). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.6 Histogramsofthesum-squaredvaluesofJacobianscomputedfordierent
consonants on speaker Eng5's data. . . . . . . . . . . . . . . . . . . . . . 56
5.7 Histogramsofthesum-squaredvaluesofJacobianscomputedfordierent
vowel and pause categories on speaker Eng5's data. . . . . . . . . . . . . 57
6.1 Vocal tract articulators (marked on a midsagittal image of the vocal tract). 61
xi
6.2 Gestural score for the word \team". Each gray block corresponds to a
vocal tract action or gesture. See Figure 6.1 for an illustration of the
constricting organs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.3 Flow diagram of TaDA, as depicted in [86]. . . . . . . . . . . . . . . . . 69
6.4 A screenshot of the Task Dynamics Application (or TaDA) software GUI
[after 85]. In the center is the temporal display consisting of gestural
scores that are input to the task dynamic model as well as tract variables
and articulator time-trajectories which the model outputs. Displayed
to the left is the instantaneous vocal tract shape and area function at
the time marked by the cursor in the temporal display. Note especially
the pellets corresponding to dierent pseudo vocal-tract
esh-points in
the top left display, movements of which are recorded and used for our
experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.5 Schematic illustrating the proposed cNMFsc algorithm. The input ma-
trixV canbeconstructedeitherfromreal(EMA)orsynthesized(TaDA)
articulatory data. In this example, we assume that there are M = 7
articulator
eshpoint trajectories. We would like to nd K = 5 basis
functions or articulatory primitives, collectively depicted as the big red
cuboid (representing a three-dimensional matrix W). Each vertical slab
of the cuboid is one primitive (numbered 1 to 5). For instance, the white
tube represents a single component of the 3
rd
primitive that corresponds
to the rst articulator (T samples long). The activation of each of these
5 time-varying primitives/basis functions is given by the rows of the ac-
tivation matrix H in the bottom right hand corner. For instance, the 5
values in the t
th
column of H are the weights which multiply each of the
5 primitives at the t
th
time sample. . . . . . . . . . . . . . . . . . . . . . 90
6.6 Schematic illustrating how shifted and scaled primitives can additively
reconstruct the original input data sequence. Each gold square in the
topmost row represents one column vector of the input data matrix, V,
corresponding to a single sampling instant in time. Recall that our basis
functions/primitivesaretime-varying. Hence,atanygiventimeinstantt,
we plot only the basis functions/primitives that have non-zero activation
(i.e., the corresponding rows of the activation matrix at time t has non-
zeroentries). Noticethatanygivenbasisfunctionextends T = 4samples
long in time, represented by a sequence of 4 silver/gray squares each.
Thus,inordertoreconstructsaythe4thcolumnofV,weneedtoconsider
thecontributionsofallbasisfunctionsthatare\active"startinganywhere
between time instant 1 to 4, as shown. . . . . . . . . . . . . . . . . . . . 91
xii
6.7 Akaike Information Criterion (AIC) values for dierent values of K (the
number of bases) andT (the temporal extent of each basis) computed for
speaker fsew0. We observe that an optimal model selection prefers the
parameter values to be as low as possible since the number of parameters
in the model far exceeds the contribution of the log likelihood term in
computing the AIC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.8 Histogram of the number of non-zero constriction task variables at any
sampling instant. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.9 Root mean squared error (RMSE) for each articulator and phone class
(categorized by ARPABET symbol) obtained as a result of running the
algorithm on all 460 sentences spoken by male speaker msak0. . . . . . . 93
6.10 Root mean squared error (RMSE) for each articulator and phone class
(categorized by ARPABET symbol) obtained as a result of running the
algorithm on all 460 sentences spoken by female speaker fsew0. . . . . . 93
6.11 Histograms of the fraction of variance unexplained (FVU) by the pro-
posed cNMFsc model for MOCHA-TIMIT speakers msak0 (left) and
fsew0 (right). The samples of the distribution were obtained for each
speaker by computing the FVU for each of the 460 sentences. (The algo-
rithm parameters used in the model were S
h
= 0:65, K = 8 and T = 10). 94
6.12 Original (dashed) and cNMFsc-estimated (solid) articulator trajectories
ofselectedTaDAarticulatorvariables(left)andEMA(MOCHA-TIMIT)
articulator variables (obtained from speaker msak0) (right) for the sen-
tence \this waseasy for us." The verticalaxis in eachsubplot depicts the
value of the articulator variable scaled by its range (to the interval [0,1]),
whilethehorizontalaxisshowsthesampleindexintime(samplingrate =
100Hz). The algorithm parameters used were S
h
= 0:65, K = 8 and
T = 10. See Table 6.1 for an explanation of articulator symbols. . . . . 95
6.13 Spatio-temporal basis functions or primitives extracted from MOCHA-
TIMIT data from speaker msak0. The algorithm parameters used were
S
h
= 0:65, K = 8 and T = 10. The front of the mouth is located
toward the left hand side of each image (and the back of the mouth on
the right). Each articulator trajectory is represented as a curve traced
out by 10 colored markers (one for each time step) starting from a lighter
color and ending in a darker color. The marker used for each trajectory
is shown in the legend. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
xiii
6.14 Average activation pattern of the K = 8 basis functions or primitives
for (a) voiceless stop consonants, and (b) British English vowels obtained
from speaker msak0's data. For each phone category, 8 colorbars are
plotted, one corresponding to the average activation of each of the 8
primitives. This was obtained by collecting all columns of the activation
matrix corresponding to each phone interval (as well as T1 columns
before and after) and taking the average across each of the K = 8 rows. 97
7.1 Schematic of the experimental setup. The input matrix V is constructed
either from real (EMA) articulatory data. In this example, we assume
thatthereareM = 7articulator
eshpointtrajectories. Wewouldliketo
ndK = 5basisfunctionsorarticulatoryprimitives,collectivelydepicted
asthebigredcuboid(representingathree-dimensionalmatrixW). Each
vertical slab of the cuboid is one primitive (numbered 1 to 5). For in-
stance, the white tube represents a single component of the 3
rd
primitive
that corresponds to the rst articulator (T samples long). The activation
of each of these 5 time-varying primitives/basis functions is given by the
rows of the activation matrix H in the bottom right hand corner. For
instance, the5valuesinthe t
th
columnofHaretheweightswhichmulti-
plyeachofthe5primitivesatthet
th
timesample. Theactivationmatrix
is used as input to the classication module, which consists of 3 steps {
(i) dimensionality reduction using agglomerative information bottleneck
(AIB) clustering, (ii) conversion to a histogram of cooccurrence (HAC)
representation to capture dependence information across timeseries, and
(iii) a nal support vector (SVM) classier. . . . . . . . . . . . . . . . . 112
7.2 Root mean squared error (RMSE) for each articulator and broad phone
class obtained as a result of running the algorithm on all 460 sentences
spoken by male speaker msak0. . . . . . . . . . . . . . . . . . . . . . . . 113
7.3 Root mean squared error (RMSE) for each articulator and broad phone
class obtained as a result of running the algorithm on all 460 sentences
spoken by male speaker fsew0.. . . . . . . . . . . . . . . . . . . . . . . . 113
7.4 Histograms of the fraction of variance unexplained (FVU) by the pro-
posed cNMFsc model for MOCHA-TIMIT speakers msak0 (left) and
fsew0 (right). The samples of the distribution were obtained by comput-
ing the FVU for each of the 460 sentences. (The algorithm parameters
used in the model were S
h
= 0:65, K = 40 and T = 10). . . . . . . . . . 114
xiv
7.5 Spatio-temporal basis functions or primitives extracted from MOCHA-
TIMITdatafromspeakermsak0 correspondingtodierentEnglishmonoph-
thong (rst and third columns) and diphthong (second column) vowels.
Each panel is denoted by ARPABET phone symbol. The algorithm pa-
rameters used were S
h
= 0:65, K = 40 and T = 10. The front of the
mouthislocatedtowardthelefthandsideofeachimage(andthebackof
the mouth on the right). Each articulator trajectory is represented as a
curve traced out by 10 colored markers (one for each time step) starting
from a lighter color and ending in a darker color. The marker used for
each trajectory is shown in the legend (see Table 6.1 for the list of EMA
trajectory variables). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.6 Spatio-temporal basis functions or primitives extracted from MOCHA-
TIMIT data from speaker msak0 corresponding to stop (rst two rows),
nasal(thirdrow)andapproximant(lastrow)consonants. Allrowsexcept
the last are arranged in order of labial, coronal and dorsal consonant,
respectively. Each panel is denoted by ARPABET phone symbol. The
algorithm parameters used were S
h
= 0:65, K = 40 and T = 10. The
front of the mouth is located toward the left hand side of each image
(and the back of the mouth on the right). Each articulator trajectory
is represented as a curve traced out by 10 colored markers (one for each
time step) starting from a lighter color and ending in a darker color. The
marker used for each trajectory is shown in the legend. . . . . . . . . . . 116
7.7 Spatio-temporal basis functions or primitives extracted from MOCHA-
TIMITdatafromspeakermsak0 correspondingtofricativesandaricate
consonants. Each panel is denoted by ARPABET phone symbol. The
algorithm parameters used were S
h
= 0:65, K = 8 and T = 10. The
front of the mouth is located toward the left hand side of each image
(and the back of the mouth on the right). Each articulator trajectory
is represented as a curve traced out by 10 colored markers (one for each
time step) starting from a lighter color and ending in a darker color. The
marker used for each trajectory is shown in the legend. . . . . . . . . . . 117
7.8 Mutual information I(A;
^
H) between quantized activation space
^
H and
the space of acoustic featuresA as a function of the cardinality of
^
H (in
other words, the number of quantization levels). . . . . . . . . . . . . . . 118
xv
7.9 Schematic depiction the computation of histogram of articulatory cooc-
currences (HAC) representations. For a chosen lag value, , and a time-
step t, if we nd labels m and n occurring time-steps apart (marked in
gold), we mark the entry of the lag- cooccurrence matrix correspond-
ing to row (m;n) and the t
th
column with a 1 (corresponding entry also
marked in gold). We sum across the columns of this matrix (across time)
to obtain the lag- HAC representation. . . . . . . . . . . . . . . . . . . 119
8.1 A visualization of the Congurable Articulatory Synthesizer (CASY) in
aneutralposition, showingtheoutlineofthevocaltractmodel. Overlain
are the key points (black crosses) and geometric reference lines (dashed
lines) used to dene the model articulator parameters (black lines and
angles), which are: lip protrusion (LX), vertical displacements of the
upper lip (UY) and lower lip (LY) relative to the teeth, jaw angle (JA),
tongue body angle (CA), tongue body length (CL), tongue tip length
(TL), and tongue angle (TA). . . . . . . . . . . . . . . . . . . . . . . . . 123
8.2 Schematic illustrating the proposed cNMFsc algorithm. The input ma-
trixV canbeconstructedeitherfromreal(EMA)orsynthesized(TaDA)
articulatory data. In this example, we assume that there are M = 7
articulator
eshpoint trajectories. We would like to nd K = 5 basis
functions or articulatory primitives, collectively depicted as the big red
cuboid (representing a three-dimensional matrix W). Each vertical slab
of the cuboid is one primitive (numbered 1 to 5). For instance, the white
tube represents a single component of the 3
rd
primitive that corresponds
to the rst articulator (T samples long). The activation of each of these
5 time-varying primitives/basis functions is given by the rows of the ac-
tivation matrix H in the bottom right hand corner. For instance, the 5
values in the t
th
column of H are the weights which multiply each of the
5 primitives at the t
th
time sample. . . . . . . . . . . . . . . . . . . . . . 126
8.3 (a) Histograms of root mean squared error (RMSE) computed on the
reconstructed control signals assuming a known model of the dynamics
using the cNMFsc algorithm over all 972 VCV utterances, and (b) the
correspondingRMSEinreconstructingarticulatormovementtrajectories
from these control signals using Equation 8.5. . . . . . . . . . . . . . . . 129
8.4 (a) Histograms of root mean squared error (RMSE) computed on the
reconstructed control signals (dynamics model learnt using LWPR) using
the cNMFsc algorithm over all 972 VCV utterances, and (b) the corre-
spondingRMSEinreconstructingarticulatormovementtrajectoriesfrom
these control signals using Equation 8.5. . . . . . . . . . . . . . . . . . . 130
xvi
8.5 Spatio-temporal movements of the articulator dynamical system eected
by8 dierentcontrolprimitivesfor a given choiceof initial position when
a pre-dened dynamics model is used. Each row represents a sequence of
vocaltractposturesplottedat20mstimeintervals, correspondingtoone
controlprimitivesequence. Theinitialpositionineachcaseisrepresented
by the rst image in each row. The cNMFsc algorithm parameters used
were S
h
= 0:65, K = 8 and T = 10. The front of the mouth is located
toward the right hand side of each image (and the back of the mouth on
the left). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
8.6 Spatio-temporal movements of the articulator dynamical system eected
by8dierentcontrolprimitivesforainitialpositionwithalow
attongue
when the dynamics model is learned using the LWPR algorithm. Each
row represents a sequence of vocal tract postures plotted at 20 ms time
intervals, corresponding to one control primitive sequence. The initial
position in each case is represented by the rst image in each row. The
cNMFsc algorithm parameters used were S
h
= 0:65, K = 8 and T = 10.
The front of the mouth is located toward the right hand side of each
image (and the back of the mouth on the left). . . . . . . . . . . . . . . 133
8.7 Spatio-temporal movements of the articulator dynamical system eected
by 8 dierent control primitives for a more dorsal initial tongue position
when the dynamics model is learned using the LWPR algorithm. Each
row represents a sequence of vocal tract postures plotted at 20 ms time
intervals, corresponding to one control primitive sequence. The initial
position in each case is represented by the rst image in each row. The
cNMFsc algorithm parameters used were S
h
= 0:65, K = 8 and T = 10.
The front of the mouth is located toward the right hand side of each
image (and the back of the mouth on the left). . . . . . . . . . . . . . . 134
8.8 Median activations of the 8 bases plotted in Figure 8.6 contributing to
theproductionofdierentsoundscomputedoverall972VCVutterances,
for (a) select stop consonants and (b) selected vowels. . . . . . . . . . . 135
xvii
List of Tables
2.1 Articulatory measurement techniques. . . . . . . . . . . . . . . . . . . . 7
3.1 Standard deviation (in squared pixels) of the gradient energies for gram-
matical and ungrammatical pauses and their neighborhoods pooled across
allspeakers(Herethetwo250msneighborhoodsbeforeandafterthepause
are pooled together to get one 500 ms neighborhood before and after). . . 17
4.1 Number of pause samples per speaker used in the statistical analysis. . . 29
4.2 Means and standard deviations of all VTADs, jaw angle (JA), lip aper-
ture (LA), tongue tip constriction degree (TTCD), tongue dorsum con-
striction degree (TDCD), tongue root constriction degree (TRCD) and
velic aperture (VEL) rounded to two signicant digits. Also shown are
the results of performing pairwise comparisons between dierent levels of
the xed factor. If a pairwise test for a mean is statistically signicant
at the 95% level, we indicate this by. Similarly, if a pairwise test for a
standard deviation is signicant, ? is used. . . . . . . . . . . . . . . . . . 31
5.1 Medians of sum-squared values of the Jacobians tabulated by category and
speaker(left). Alsoshown(right)foreachpairofcategories, arethenum-
ber of speakers (out of 5) that returned a statistically signicant dierence
on the Mann-Whitney U test for pairwise dierences in medians at the
= 95% level. (Abbreviations: HF = High Front, HB = High Back, LF
= Low Front, LB = Low Back, Lab = Labial, Cor = Coronal, Dor =
Dorsal, App = Approximant). . . . . . . . . . . . . . . . . . . . . . . . . 48
6.1 Articulator
eshpointvariablesthatcomprisethepost-processedsynthetic
(TaDA) and real (EMA) datasets that we use for our experiments. . . . 71
xviii
6.2 Top 5 canonical correlation values between the gestural activation matrix
G (generated by TaDA) and the estimated activation matrix H for both
TaDA and EMA cases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.1 Performance of various feature sets on a interval-based phone classica-
tionexperiment(afterappropriatetransformationtoHAC-representations).
For clarity of understanding we also show the entropy of the feature set
X, denoted by H(X), along with the mutual information between the
feature set and phone labelsL (39 classes), denoted by I(X;L), in each
case. We also performed classication experiments on a 5-class broad
phonesetL
broad
(where each of the original 39 phones were categorized
as either vowel, stops, fricatives, nasals or approximants). . . . . . . . . 110
xix
Abstract
Thisthesisproposesabalancedframeworktowardunderstandingspeechmotorplanning
and control by observing aspects of its behavioral execution. To this end, it proposes
representing,modeling,andanalyzingreal-timespeecharticulationdatafromboth`top-
down' (or knowledge-driven) as well as `bottom-up' (or data-driven) perspectives.
The rst part of the thesis uses existing knowledge from linguistics and motor con-
trol to extract meaningful representations from real-time magnetic resonance imaging
(rtMRI) data, and further, posit and test specic hypotheses regarding kinematic and
postural planning during pausing behavior. In the former case, we propose a measure
to quantify the speed of articulators during pauses as well as during their immediate
neighborhoods. Using appropriate statistical analysis techniques, we nd support for
the hypothesis that pauses at major syntactic boundaries (i.e., grammatical pauses),
but not ungrammatical (e.g., word search) pauses, are planned by a high-level cognitive
mechanism that also controls the rate of articulation around these junctures. In the
latter case, we present a novel automatic procedure to characterize vocal posture from
rtMRI data. Statistical analyses suggest that articulatory settings dier during rest
positions, ready positions and inter-speech pauses, and might, in that order, involve an
increasing degree of active control by the cognitive speech planning mechanism. We
show that this may be due to the fact that postures assumed during pauses are signif-
icantly more mechanically advantageous than postures assumed during absolute rest.
In other words, inter-speech postures allow for a larger change in the space of motor
xx
control tasks/goals for a minimal change in the articulatory posture space as compared
to postures at absolute rest. We argue that such top-down approaches can be used to
augment models of speech motor control.
The second part of the thesis presents a computational, data-driven approach to
derive interpretable movement primitives from speech articulation data in a bottom-
up manner. It puts forth a convolutive Nonnegative Matrix Factorization algorithm
with sparseness constraints (cNMFsc) to decompose a given data matrix into a set of
spatiotemporal basis sequences and an activation matrix. The algorithm optimizes a
cost function that trades o the mismatch between the proposed model and the in-
put data against the number of primitives that are active at any given instant. The
methodisappliedtobothmeasuredarticulatorydataobtainedthroughelectromagnetic
articulography (EMA) as well as synthetic data generated using an articulatory synthe-
sizer. The paper then describes how to evaluate the algorithm performance quantita-
tively and further performs a qualitative assessment of the algorithm's ability to recover
compositional structure from data. The results suggest that the proposed algorithm
extracts movement primitives from human speech production data that are linguisti-
cally interpretable. We further examine how well derived representations of \primitive
movements" of speech articulation can be used to classify broad phone categories, and
thus provide more insights into the link between speech production and perception.
We nally show that such primitives can be mathematically modeled using nonlinear
dynamical systems in a control-theoretic framework for speech motor control. Such a
primitives-based framework could thus help inform practicable theories of speech motor
control and coordination.
xxi
CHAPTER1
Introduction
Speech or spoken language production is arguably the most complicated motor act per-
formed by any species, and is in many ways uniquely human. In order to produce the
variations in air pressure and
ow that we hear while listening to speech, the vocal
tract apparatus must be controlled and coordinated such that the acoustic variations in
the signal reliably encode desired linguistic information [66]. Understanding how hu-
mans control and coordinate speech production is critical to understanding the speech
communication process. This understanding is also essential to diagnosing and rehabil-
itating patients who suer from disorders of the speech motor system, e.g., apraxia, or
other biomechanical impairments to the system, for e.g., people who have had part of
their vocal apparatus removed/damaged following cancer (glossectomy of the tongue).
Further, from a technological standpoint, such an understanding of speech motor con-
trol can help inform speech synthesis models, possibly leading to the generation of more
realistic-sounding synthetic speech.
This complex act can be analyzed along a number of levels of organization and rep-
resentative processes { cognitive (transformation of an abstract concept into a signal
1
encoding linguistic structure), neural (how encoding is achieved at the level of neu-
rons), neuromotor (how articulatory subsystems interact to produce coordinated pat-
terns withing a dynamic behavioral environment), and acoustic (signal characteristics
arising from complex aerodynamic manipulations of the vocal apparatus) [33, 74]. We
will analyze the speech motor control system at a broad algorithmic level. For instance,
consider the integrated state feedback model proposed by Hickok et al. [38] (see Figure
1.1). The model consists of a feedforward branch where the motor controller sends sig-
nals to coordinate movements of vocal articulators which in turn produces speech. In
addition, there is an internal model of the vocal tract which makes forward predictions
about the dynamic state of the vocal tract and also allows feedback control of the sys-
tem. It is important to realize that each block in Figure 1.1 is a complex component
component in its own right, and thus an examination of the full system is beyond the
scope of this thesis. Thus not explicitly consider the internal model of the vocal tract in
our analysis. Instead, we will focus on the relation between the controller and observed
behavior (articulatory movements) and see how we can move towards better models of
thecontrollerbyexaminingthislinkinbothatop-downaswellasabottom-upmanner.
Figure1.2schematizesthecontributionsofthisthesisandhowittsintothebroader
contextofspeechmotorcontrolmodeling. Weproposeabalancedtwo-prongedapproach
to the problem of designing better control models: (1) a knowledge-driven `top-down'
approach that uses existing knowledge from linguistics and motor control to extract
meaningful representations from articulatory data, and further, posit and test specic
hypothesesregardingkinematicandposturalplanningduringpausingbehavior; and(2)
a data-driven `bottom-up' approach to deriving `primitive' articulatory movements that
can be used to build models of coordination of speech movements that can potentially
operate in an inherently lower-dimensional control subspace. In the next few chapters,
we will rst describe the articulatory data we use that allows us to (partially) observe
2
Figure 1.1: The integrated state feedback model of speech production (after Hickok et
al. [38]).
thebehavioralmovementofvariousvocaltractarticulators(Chapter 2),followedbytwo
top-downcase-studiespertainingtokinematic(Chapter 3)andposturalmotorplanning
(Chapters 4and 5). Wewill then describe a more data-drivenframeworkfor extracting,
interpreting and validating articulatory movement primitives (Chapter 6) and examine
linksbetweenproductionandperceptionwithinthisframework(Chapter 7). Wefurther
extendthisframeworktothecontroldomain(Chapter8). Finally,wewillpresentfuture
extensions to existing work and an outlook on how we can move toward better models
of speech motor control (Chapter 9).
3
Figure 1.2: Conceptual outline of this thesis' contributions.
4
CHAPTER2
Articulatory data
In order to develop realistic models of speech motor control, it is imperative to have
access to real-time measurement of articulatory movements during speech production.
Table 2.1 lists current state-of-the-art articulatory measurement techniques and their
relative advantages and disadvantages. Techniques that have been used to measure
articulation include x-ray microbeam (XRMB) [132], electropalatography (EPG), elec-
tromagnetic articulography (EMA) [135] and ultrasound [133]. These techniques, al-
though some are invasive, are able to capture articulatory information at high sampling
rates. However, none of these modalities oer a complete a view of all vocal tract ar-
ticulators at a sucient spatial resolution, which is important for studying vocal tract
posture. More recently, developments in real-time MRI have allowed for an exami-
nation of shaping along the entirety of the vocal tract during speech production and
provide a means for quantifying the choreography of the articulators, including struc-
tural/morphological characteristics of speakers in conjunction with their articulation
and acoustics [87]. However, rt-MRI has an intrinsically lower frame rate than the
other modalities. Another potential challenge in studies of AS using rt-MRI is the ef-
fect of gravity due to the supine position subjects assume in order to be scanned using
5
MRI. In an X-ray microbeam study of two Japanese subjects, [123] concluded that the
supineposturecausednon-criticalarticulatorstofallwithgravity(avoidingunnecessary
eort opposing gravity), while critical articulators (with acoustically sensitive targets)
are held in position even if against gravity. Observed posture eects were greatest for
sustained vowel production but were minimal for running speech production.
For the studies described in this thesis, we primarily use articulatory data obtained
through EMA and rtMRI modalities. These data are time-synchronized with corre-
sponding noise-cancelled acoustic speech recordings. Specic details of the database
used and subsequent processing performed in each case will be presented alongside the
studies themselves in the following chapters.
6
Table 2.1: Articulatory measurement techniques.
Characteristic XRMB EMA Ultrasound EPG rtMRI
Order of typical sampling rate (Hz) 100 500 50 to 300 100 20 to 30
Relative spatial resolution Low Low Medium High High
View of vocal tract Fleshpoints Fleshpoints Tongue Tongue{palate contact Full view
Supine position? No No No No Yes
Invasive? Yes Yes No Yes No
Example database Wisconsin x-ray Edinburgh MOCHA Haskins HOCUS [133] Edinburgh MOCHA USC MRI{TIMIT
(with citation) microbeam database [132] database [135] database [135] database [88]
7
Part I
Knowledge-driven approaches:
Leveraging existing models of motor
control to form and test hypotheses
that can in turn inform those models
8
CHAPTER3
Kinematic planning case study: planning in spontaneous
speech
Thischapterandthenextwillpresentknowledge-driven`top-down'approachesthatuse
existing knowledge from linguistics and motor control to extract meaningful represen-
tations from articulatory data, and further, posit and test specic hypotheses regarding
kinematic and postural planning during pausing behavior. In this chapter, we will focus
on the hypothesis that grammatical pauses during spontaneous speech involve some
kinematic planning by the central speech controller.
3.1 Pauses during spontaneous speech
Pausing in natural speech can be considered from a listener perspectivehow do pauses
aid or impair speech understandingor from a speaker perspectivehow do pauses re
ect
the speech planning process, either operating well or encountering diculties. In this
paper, we use real-time MRI to provide a non-invasive view of the entire length of
the moving vocal tract and to examine pauses from a speech production perspective.
Pausescanbebroadlycategorizedintoplannedorgrammaticalpausesandunplannedor
ungrammaticalpauses. (Forourpurposes, wemakenodistinction betweenplannedand
9
grammatical, and likewise unplanned and ungrammatical pauses.) Grammatical pauses
generally occur at the boundary of a clause, presumably due to the need to parse and
plan the sentence. Ungrammatical pauses can indicate a breakdown in composing the
speech stream and occur at inappropriate locations as the planning, production, and/or
lexical access process is disrupted (see [93, 104]).
The framework of Articulatory Phonology [12, 13] in conjunction with the prosodic-
gesture model [14] of phrase boundaries oers one approach for considering the nature
of grammatical and ungrammatical pauses in articulation. In this framework, the act
of speaking is decomposable into atomic units of vocal tract action gestures that can
be dened as an equivalence class of goal-directed movements, such as those by a set of
articulatorsinthevocaltract(seetheTaskDynamicsmodel,[109]). ByrdandSaltzman
[14]viewphrasejuncturesasphonologicallyplannedintervalsofcontrolledlocalslowing
of speech timing around a phrase edge, with the articulatory slowing increasing as the
boundary approaches and the speech stream resuming speed as the boundary recedes
(i.e. immediatelypost-boundary). Thisclockslowing,atitsextreme,canbeunderstood
to result in a pause, as the clock controlling articulation slows to a near-stop and then
speeds up again as the post-pause interval is initiated. In contrast, ungrammatical
pauses (which may be lled or unlled depending on the state of voicing) abruptly
interrupt the execution of the planned speech stream interfering with the vocal tract
articulators reaching their targets. Under this approach, a grammatical pause, then,
is viewed as a planned event under cognitive control with explicit consequences for
the spatiotemporal behavior of the articulators over an interval; consequences that are
distinct from ungrammatical pauses that abruptly perturb articulation. In this paper,
we will examine direct articulatory evidence for this hypothesis.
Although pauses can contributeinformation about speech planning, few joint acous-
tic and articulatory studies of pausing behavior have been carried out, one reason being
10
the diculty of acquiring data on vocal tract movement during running speech. Recent
progress in real-time magnetic resonance imaging (MRI) [87] allows for a more compre-
hensive investigation of pauses in speech than does study of the acoustic signal alone.
Since the technique allows for a complete view of the moving vocal tract, providing syn-
chronized audio in conjunction, it is possible to examine the supraglottal articulators
during not only the spoken portion but also the silent portions of the speech stream.
3.2 Data acquisition and preparation
The data we examined comprises spontaneous speech utterances and the corresponding
time-synchronized movies of the moving vocal tract, elicited in response to queries from
the experimenter. Seven healthy native speakers of American English were asked to
answer simple questions on general topics such as what music do you listen to, tell me
more about your favorite cuisine , etc.) while lying inside an MRI scanner. For each of
thestimulusquestions, time-synchronizedaudioresponsesandMRIvideosofspeechar-
ticulationwererecordedfor30seconds. Furtherdetailsregardingtherecording/imaging
setupcanbefoundinNarayananetal. [87]andBreschetal. [11]. Midsagittalreal-time
MR images of the vocal tract were acquired with an MR pulse repetition time of TR
= 6.5 ms on a GE Signa 1.5T scanner with a 13 interleaf spiral gradient echo pulse
sequence. The slice thickness was approximately 3mm. A sliding window reconstruc-
tion at a rate of 22.44 frames per second was employed. The eld of view (FOV) was
adjusted depending on the subjects head size, so that images covered an area of 18.4
cm by 18.4 cm at a resolution of 68 by 68 pixels.
Forthemanualannotationoftheaudiowaveformforthisexperiment,agrammatical
pause was dened to be a silent or lled pause that occurred between overt syntactic
constituents(includingsentenceend). Examplesincludepausesat(1)clauseboundaries
such as relative clause boundaries, (2) subject-verb or verb-object boundaries, and (3)
11
prepositional phrases oset from another constituent. Any pause other than the above,
i.e., generally those occurring within a clause, was marked as an ungrammatical pause.
Such pauses are atypical in this natural speech and do not mark the juncture between
obvioussyntacticorsemanticword-groupsinthesentence; theydonotappeartoencode
linguistic information. For each speakers utterances, grammatical and ungrammatical
pauses were manually annotated by the rst author according to this denition and
veried by a linguist for accuracy.
3.3 Analyses
Inordertoexaminethearticulatorycharacteristicsatandaroundpauses,theextraction
of a gradient energy measure that captures the speed of articulatory motion (of all
articulators) from image sequences is employed. In order to study the time evolution
of vocal tract shaping, for each set of image sequences, the air-tissue boundary of the
articulatory structures needs to be clearly delineated. This contour tracing process is
timeconsumingandtediouswhencarriedoutbyahuman,soanalgorithmusingFourier
region segmentation to automatically carry out the task was used (see [10]).
In order to observe articulatory eects of pausing behavior more comprehensively,
not only is it important to study articulator dynamics in the pause frames, but also in
the interval preceding and following the pause, particularly since models such as the
prosodic-gesture model [14] predict spatiotemporal eects during neighboring intervals.
Since most appreciable eects, including construction of a rough plan for the utterance
[53], occur in a time window of 500 ms before and after the pause, neighborhoods of
that order were analyzed for global range of movements of articulators. Since in our
experimentalsetup, theframerateisabout22.44framespersecond; thisapproximately
translates to neighborhoods consisting of about 12 frames. Thus, for analysis, although
12
the length (in number of frames) of each pause was variable, the analysis neighborhoods
before and after the pause were of xed lengths.
Figure 3.1: An illustration of the gradient energy calculation process: rst, contour
outlines are obtained from the MRI images in panel A, and are then converted to
binary masks (panel B); these are then used to compute the gradient images (panel
C), the energy of which is then calculated by a simple addition operation (of all white
pixels). This serves as a simple measure of the speed of articulators.
Once the contour outlines have been extracted from the MR images, they are used
to create binary mask images, with all pixels enclosed by these contour outlines as-
signed a normalized value of one, and the rest, zeroes, such that the mid-sagittal
section of the vocal tract appears white on a black background. A gradient energy
measure was calculated (see Figure 3.1 for a schematic) for every pair of contiguous
mask images in a pause/neighborhood frame sequence, by subtracting them, taking
13
the absolute value of the dierence, and computing the pixel energy of the result
(by nding the number of pixels of value 1). The overall gradient energy value for a
pause/neighborhood is then computed by averaging over all gradient energies obtained
during the pause/neighborhood period. This is done to obtain an entropy measure that
can capture variability in articulator movement and thus give an estimate of the speed
of articulatory motion during such periods, which this measure does well on a global
level. In a similar manner, one can compute delta gradient measures that will capture
the acceleration of the articulators during pauses/neighborhoods.
Atwo-factorparametricanalysisofvariance(ANOVA)wasconductedonthedepen-
dent variable of gradient energy with data pooled across speakers and with the factors:
site (levels: pre-pause, pause, post-pause) and grammaticality (levels: grammatical,
ungrammatical). It should be noted that since the number of occurrences of grammat-
ical and ungrammatical pauses (especially the latter ) were too small for a repeated
measures ANOVA, analyses were carried out with data pooled across speakers.
3.4 Results
Histogramsofgrammaticalandungrammaticalpausedurationsacrossallspeakerswere
plotted, and while these pauses cannot be reliably separated based on duration values
alone, a statistical t-test indicated that grammatical pauses tended to be signicantly
longer on average (p 0:01) the mean and standard deviation of pause durations were
found to be 13.62 frames and 8.2 frames respectively for grammatical, and 9.76 frames
and 5.8 frames respectively for ungrammatical pauses (1 frame = 0.0446 seconds). See
Figure 3.2.
Thetime-normalizedaveragegradientframeenergyforeachpauseandfortheneigh-
borhoods before and after it , pooled across all 7 speakers, is plotted as a bar graph in
Figure3.3. Correspondingtime-normalizedaveragelocalphoneratesarealsoplottedto
14
Figure 3.2: Pause length distributions for grammatical and ungrammatical pauses.
the right of each of these graphs. The derived measure of mean gradient frame energy
capturesarticulatorspeedswellandisagoodindicatorofthelocalphonerate(assuming
that articulator speeds directly inform the local phone rate to a certain extent).
In order to examine the eects of speech planning on the structure of pauses, we
havetoexaminehowthegradientframeenergiesvarymovingintoandoutofthepause.
Figure3.4schematicallysummarizesthestatisticalcomparisons(double-headedarrows)
performed on the data. Signicant dierences (p 0:01) were found between the means
ofthegradientenergiesinthe(pre-pausal)neighborhoodbeforegrammaticalpausesand
those of the pauses themselves. That is, the gradient energy means of the grammatical
pauses themselves were signicantly lower than these pre-pausal neighborhood gradient
energy means. In contrast, ungrammatical pauses displayed no signicant dierences
15
Figure 3.3: Time-normalized average gradient frame energies (in squared pixels) of
grammatical and ungrammatical pauses and their neighborhoods (blue bars) pooled
across all 7 speakers (Red bars on top of each blue bar represent standard deviation
measurements). Corresponding average local phone rates (phones/sec) are also shown
to the right of each gradient frame energy panel. Each panel consists of 2 pause groups
on the x-axis 1:Grammatical and 2:Ungrammatical. Group 1 consists of, in order, bars
for two neighborhoods immediately before the grammatical pause ( 250 ms), followed
by one bar for the pause itself (not shown for phone rate graph), followed by two bars
for the neighborhoods following the pause ( 250 ms); this set of ve bars is followed by
a parallel sequence of ve bars for the ungrammatical pauses (Group 2).
between the means of the pre-pausal and pausal gradient energies. However, there
was a signicant increase in the gradient energy for the neighborhood immediately
following both grammatical and ungrammatical pauses, which was often slightly higher
than the pre-pausal energy value. There were no signicant dierences in the means
of the grammatical and ungrammatical (i) pause gradient energies or (ii) pre-pausal
neighborhood gradient energies. However, ungrammatical pauses showed a signicantly
higher variation in the values of gradient energies compared to the grammatical case,
especially during and after the pause (see Table 3.1), which is expected, since such a
pause would hypothetically serve to interrupt the
ow of speech (irrespective of the
speech rate), and hence would have a much higher gradient energy variance compared
to grammatical pauses.
16
Figure 3.4: A schematic depicting the levels (grammatical, ungrammatical) and sites
(pre-pause,pause,post-pause)atwhichtheANOVAstatisticalanalyseswereperformed.
Table 3.1: Standard deviation (in squared pixels) of the gradient energies for grammat-
ical and ungrammatical pauses and their neighborhoods pooled across all speakers (Here
the two 250 ms neighborhoods before and after the pause are pooled together to get one
500 ms neighborhood before and after).
Grammatical Pauses Ungrammatical Pauses
500ms before Pause 500ms after 500ms before Pause 500ms after
366.54 369.25 423.57 374.95 445.37 538.11
For both grammatical and ungrammatical pauses, there was no trend found that
distinguished lled from unlled pauses, as the gradient energies of these cases can be
highly context-dependent. In some cases, the lled pause gradient energies for some
speakers were much higher than their unlled counterparts, while the opposite eect
was found for other speakers. Furthermore, there was no observed trend of gradient
frame energy variation within a pause, be it grammatical or ungrammatical, lled or
unlled, suggestingthatinstantaneousvaluesofthesegradientenergiesmaybecontext-
dependent.
17
The results obtained suggest that grammatical pauses are part of a more globally
choreographed plan of articulatory movement, since at the pause, the speed of the
articulatorsdrops(asindicatedbythereductioninmeangradientenergy),whichnally,
towards the end, increases to around the level where it was at the start of the pause.
Also, the results suggest that ungrammatical pauses are essentially unplanned, with the
articulatorspeed dropping slightly(but not signicantly)early intothepause, following
which there is a sudden jump in the gradient energy. In addition, there is a large
varianceassociatedwiththegradientenergyvaluesinthiscase(Table 3.1). Suchpauses
in spontaneous speech, which occur without the linguistic structuring of the speaker,
are characterized by a sudden increase of articulator speed on a global level when the
speaker eventually succeeds in lexical access or planning.
3.5 Conclusions
In this paper, our principal hypothesis that grammatical pauses are a result of a higher-
level cognitive plan of articulatory movement, while ungrammatical ones are not, has
been validated via direct observation of articulatory behavior. Measures that help dis-
tinguish between the two in the articulatory domain were developed.
It has long been recognized that pauses are relevant to cognitive processing and are
related to aect, style, and lexical and grammatical structure (e.g., [55, 61, 104, 137]).
Direct observation of articulation along the entire vocal tract oers an important new
source of data for investigation of speech planning, since it allows a view of how the
speech
owisalteredineitheracognitivelyplannedwayorinterruptedbyaperturbation
when normal planning fails. It can also inform as to how much time it takes to recover
from the eect of a sudden unplanned pause that perturbs the linguistic structural
integrity of the utterance.
18
CHAPTER4
Postural motor control case study: articulatory setting
Along the lines of the knowledge-driven case study presented in the previous chapter,
herewepresentanothercasestudyregardingposturalplanningduring pausing behavior.
Specically, the primary objective of this chapter is to explore the relatively understud-
ied phenomenon of articulatory setting in human speech production and obtain insight
into the characteristics of its postural motor control using real-time vocal tract imaging
data.
4.1 Introduction
Articulatory setting (also called phonetic setting or organic basis of articulation or
voice quality setting; henceforth referred to as AS) may be dened as the set of postural
congurations (which can be language-specic and/or speaker-specic) that the vocal
tract articulators tend to be deployed from and return to in the process of producing
uent and natural speech [23, 39, 58, 121]. A postural conguration might be, for
example, a tendency to keep the lips in a rounded position throughout speech, or a
tendency to keep the body of the tongue slightly retracted into the pharynx while
speaking [59].
19
Historically AS has been the subject of linguists' intrigue, but due to the lack of
reliable articulation measurement techniques, it has not been studied extensively until
recently [for example, see studies by 32, 77, 134]
1
. [90] and [97] have postulated the
existenceofAS-likedefaultpositionsforspeechbyobservingvocaltractposturesduring
speechpausesasopposedtoabsoluterestpositions. [97]furthermentionsa\pre-speech"
or \speech-ready" posture that vocal tract articulators tend to assume as the speaker
gets ready to speak. However, issues pertaining to the nature of control exercised by
the speech \planner"
2
during the execution of these postures have not been addressed
yet in a comprehensive manner using speech articulation data. For example, what are
thearticulatoryoracousticvariablesthatarecontrolledtoachievethesepostures? How
variableisthiscontrolatdierentpointsintheutterance(asmeasuredbyanappropriate
function of the control variables, e.g., variance)? Most studies of AS have focused on
dierences observed in AS between dierent languages such as English and French [see
77, for a review]. This study, in contrast, focuses on understanding the manifestations
of AS within spoken American English, considering the eects of speaking style (read
vs. spontaneous) and position within an utterance and analyzing its postural motor
control characteristics. Further, we look specically at postures assumed during silent
pauses both before speech (absolute rest and speech-ready) as well as during speech.
Since the acoustic correlates of these are silences, this process eliminates confounds due
to motor control goals for dierent sounds to a large extent.
Articulatory setting is closely related to the concept of voice quality [59]. Voice
quality has been dened in the literature as the characteristic auditory coloring of
an individual speaker's voice that re
ects characteristic traits of the speaker such as
1
Note that prior to the development of articulation measurement techniques, acoustic quantities
such as the long-term average spectrum were used to study AS. While the consequences of AS may
also be acoustic in nature, since this is an inherently articulatory phenomenon, we choose to focus on
articulation measurement techniques.
2
By the term \speech planner," we mean a cognitive control system that directs and regulates the
behavior of the speech motor apparatus.
20
identity, personality, health and emotional state [59, 118, 119]. Dierent settings may
imposespecicpatternsofuseofthespeechorgansresultingindierent`voicequalities'.
Supralaryngeal articulatory settings in combination with laryngeal articulatory settings
mightgenerateformantandharmonicstructureoftheacousticspeechsignalthatimpart
a particular voice quality to the speech signal. Note however that this AS study focuses
only on the supralaryngeal vocal tract. every time instant, dierent ASs might exert
a varying in
uence on the tract shape depending on the transient context conditions.
This might bias the formant structure of the acoustic speech signal towards a particular
global timbre, thus imparting a particular voice quality to the speech signal.
AS has been variously discussed as a language-specic phonological phenomenon or
a functional by-product of the execution of the speech plan. [32] have argued for the
existenceofalanguage-specicASandhavefurtherspeculatedthatspeechrestpositions
are specied in a manner similar to actual speech targets. They compared the standard
deviations of vocal tract measurements taken during inter-utterance rest positions to
those taken from the target vowel /i/ to test whether the accuracy of movements into
an inter-utterance rest position was similar to that of a specied articulatory target and
not just a transition position solely determined by the immediately surrounding sounds.
They found no signicant dierences in the standard deviations of the two groups,
leading them to suggest that a language's inter-speech posture may be linguistically
specied as part of the phonetic or phonological inventory of the language in question.
Further exploration of AS with respect to position in the utterance and speaking style
couldhaveimportantimplicationsforunderstandingthespeechmotorplanningprocess,
especially in models of motor planning following a `constraint hierarchy,' i.e., a set of
prioritized goals dening the task to be performed [e.g., 106].
In this paper, we present a novel method to analyze articulatory setting (and vocal
tractpostureingeneral). Wefurtherapplytheproposedmethodtoanswerthefollowing
21
three broad questions: (1) Do ASs assumed during grammatical inter-speech pauses
(ISPs) dier from an absolute resting vocal tract position and, further, from a speech-
ready posture [or pre-speech posture, after 97]? (2) What can be inferred regarding the
degree of active control exerted by the cognitive speech planner (as measured by the
variance of appropriate variables that capture vocal tract posture) in each case? (3)
Does articulatory setting vary between read and spontaneous speech?
4.2 Method
4.2.1 Data
FivefemalenativespeakersofAmericanEnglishwereengagedinasimpledialogwiththe
experimenter on topics of a general nature (e.g., \what music do you listen to ...", \tell
me more about your favorite cuisine ...," etc.) to elicit spontaneous spoken responses
while inside the MR scanner. For each speech turn, audio responses and MRI videos
of vocal tract articulation were recorded for 30 seconds and time-synchronized with the
audio. The same speakers were also recorded/imaged while reading TIMIT shibboleth
sentences and the rainbow passage during a separate scan. The spontaneous and read
speechdatarepresentthetwospeakingstylesconsideredinthisstudy. Detailsregarding
the recording and imaging setup can be found in [87] and [11]. Midsagittal real-time
MR images of the vocal tract were acquired with a repetition time of TR=6.5ms on a
GE Signa 1.5T scanner with a 13 interleafspiral gradientechopulse sequence. The slice
thickness was approximately 3mm. A sliding window reconstruction at a rate of 22.4
frames per second was employed. Field-of-view (FOV), which can be thought of as a
zoom factor, was set depending on the subjects head size. Since MRI scanners generate
a lot of noise, the recorded audio was post-processed using a custom noise-cancellation
22
algorithm [11] before use. Further details, and sample MRI movies can be found at
http://sail.usc.edu/span.
4.2.2 Vocal tract airway contour extraction
We automatically extracted the air-tissue boundary of the articulatory structures using
an algorithm that hierarchically optimizes the observed image data t to an anatomi-
cally informed object model using a gradient descent procedure [10]. The object model
consists of 3 regions (R1, R2 and R3 in Figure 4.1a) corresponding to the mandible-
tongue,pharyngealwallandupperhead. Wechosetheobjectmodelsuchthatair-tissue
boundaries of dierent regions of interest such as the palate, tongue, velum, pharyngeal
wall, etc. are each dened by a dedicated poly-line contour. For each image to be seg-
mented, we initialized the optimization process with a single manually-traced contour
outlineforavocaltractposturethatcorrepondstothe/"/vowelandthenhierarchically
optimized the t in three steps { (1) only allowing rotation and translation of the entire
three-regiongeometry,thuscompensatingforheadmotion; (2)allowingforrotationand
translationwithinthethreeregions, totthecontouroutlinestothespecicvocaltract
shape, and nally (3) allowing for independent movement of all poly-lines in all regions
to make the t more accurate. The algorithm takes a long time to run but provides
good results overall [see 10, for examples]. Note however that poor signal-to-noise ratio
in the lower pharyngeal region compromises the quality of the segmentation at times.
Thus, structures like the epiglottis are not segmented accurately in some frames. For
this reason, we perform an outlier-removal procedure, i.e., we do not consider frames
with contour shapes whose Euclidean distance from the mean contour shape is greater
than three standard deviations.
23
4.2.3 Feature extraction
In this section, we explain how relevant features for AS measurement were extracted
from the MRI videos using the automatically-determined air-tissue boundary informa-
tion, and how they were used for visualization and inference. Desirable characteristics
of AS features are that (1) they should suciently characterize vocal tract postures, (2)
they should be robust to rotation and translation, and inaccuracies introduced by the
contour extraction procedure, (3) they should involve as little manual intervention as
possible, and (4) should allow for meaningful comparison across speakers.
First, let us brie
y review some measures that have been used in the literature to
capturevocaltractposture. Apopularmeasureistheaperturefunctionorareafunction
[65,69,78]. Thisisobtainedbyrstimposingasemi-polargridonthemidsagittalimage
of the vocal tract and then nding the intersections between each gridline and the vocal
tract contour outlines found earlier. Finally, the distances between the intersection co-
ordinates on each gridline are computed, right from the lips to the glottis, and use this
ordered set of cross-distances as a feature vector to capture vocal tract posture. Note
that although elegant, this procedure suers from one major disadvantage { it is only
semi-automatic { one has to manually choose the initial parameters of the semi-polar
grid to be tted to the vocal tract (such as the number of gridlines, spacing between
gridlines, gridline orientation angle, to name a few). This also means that there is min-
imal guarantee that one will be able to compare gridlines at the same position across
dierent subjects. [32] instead chose to measure specic cross-distances in their AS
study. They manually measured from X-ray lms the following cross-distances { phar-
ynxwidth,velicaperture,tonguebodydistancefromthehardpalate(ortonguedorsum
constriction degree), tongue tip distance from the alveolar ridge (or tongue tip constric-
tion degree), lower-to-upper jaw distance, and the upper and lower lip protrusion. Both
techniques mentioned above rely on the accurate computation of cross-distances.
24
We decided to extract features similar to some of those extracted by [32], but in
an automatic manner that will be described below. Further, we will append to these
area features that capture the airway shape. These features are computed in a manner
such that they are comparable across subjects. Also, since they are areas and not
point-measures (like distances), they are more robust to noise in the contour tracking
procedure. First we will describe the computation of the cross-distance features given
the vocal tract contour outlines corresponding to an MRI image. We compute the
following cross-distances: lip aperture, velic aperture, tongue tip constriction degree,
tongue dorsum constriction degree and tongue root constriction degree. These cross-
distances computed are represented by white lines in Figure 4.1b. Lip aperture is
computedastheminimumdistancebetweenthecontourscorrespondingtothelowerand
upper lip. Similarly, velic aperture is calculated to be the minimum distance between
thevelumandpharyngealwallcontours. Noticethatthisispossiblesincetheupperand
lower lips, the velum and pharyngeal wall are each dened by their own contour (Figure
4.1a). However, in the case of the tongue-related cross-distances, computing cross-
distances is not as straightforward. This is because in these cases it is not clear how
the coordinates on the palate and pharyngeal wall must be chosen such that the cross-
distances computed to the tongue are both meaningful as well as reproducible across
subjects. To solve this problem, we rst compute \constriction locations" { points on
the palate and pharyngeal wall where the vocal tract can be maximally (and ideally,
completely) constricted. For example, during the production of coronal stops like /t,
d/, the tongue tip makes contact with the alveolar ridge, resulting in a palatal point of
zero distance between tongue and palate. Thus, by isolating a /t/ or /d/ token, and
nding the coordinate location of palatal contact, we can nd the constriction location
for that frame. This process can be repeated for all /t, d/ tokens in the database, and
the mean of these coordinates can be found to give us a mean tongue tip constriction
25
location on the palate. Once this coordinate location is found, we can compute the
tongue tip constriction degrees for all MRI frames as the minimum distance from that
coordinate location to the tongue. In a similar manner, we can compute the tongue
dorsum constriction degree by rst nding the mean palatal point of contact for all
dorsal stops like /k, g/ and then computing the minimum distance from that point to
the tongue for all frames. The tongue root constriction degree computation is more
challenging since there is no pharyngeal stop in English. In this case we consider tokens
where the tongue is maximally (but not completely) constricted with respect to the
pharyngeal wall, like the low back vowel /a:/. Finally, for each frame, we compute the
lowermost boundary of the vocal tract as the minimum distance between the root of the
epiglottis and pharyngeal wall contour. This is for purposes of computing areas only
(describedinthenextparagraph). Notethatthecross-distancesderivedareindependent
of any particular theory of speech motor control { they are computed at points where
constrictions are made in the vocal tract during normal speech production. Hence they
are conducive to meaningful comparison across subjects.
Once these cross-distances are computed, we can use them to \partition" the airway
into four areas { A1, A2, A3 and A4. See Figure 4.1b. (Note that although we are
not using the cross-distance between the epiglottis and pharyngeal wall as a feature,
we need to dene it in order to compute A4). We call these features vocal tract area
descriptors or VTADs. We compute the numerical value of the area enclosed by each
polygon by invoking the planar form of Stokes' Theorem. Consider a simply connected
area in the plane and any two functions P(x;y) and Q(x;y) [9]. Then Stokes' Theorem
says:
∫
area
(
dP
dy
+
dQ
dx
)dxdy =
∫
boundary
(P dx+Qdy): (4.1)
Applying the theorem to the case of a polygon and substituting P = 0, Q =x gives:
26
∫
area
da =
∫
boundary
xdy: (4.2)
If the polygon's vertices are specied in x y coordinates and numbered counter-
clockwise from 1 to N, then we obtain the expression:
Area =
1
2
(x
1
y
2
x
2
y
1
+x
2
y
3
x
3
y
2
++x
N
y
1
x
1
y
N
): (4.3)
Once these areas (VTADs) are obtained, we can formalize the dierences in vocal
tract shaping more concretely. This is because there is large variability in dierent
realizationsofthesameutterance,especiallyinspontaneousspeech[seefore.g.,45],that
cannot be eectively captured by the cross-distance features alone. The area features
allowustocapturethisvariabilityinspeechproductionwhileprovidingmorerobustness
than cross-distance features to rotation and shift errors.
Lastly, we compute the jaw angle as the obtuse angle between linear regression lines
ttedtothepharyngealwallcontourandchincontours(seeFigure 4.2). Thisisarobust
measure of jaw displacement since the pharyngeal wall has been shown to be relatively
rigid [70].
4.2.4 Phonetic alignment
Although we analyze articulatory setting directly from the MRI image sequences, the
noise-canceled audio signal is important in that we use it to phonetically align the
synchronized signals (given sentence-level transcriptions) of the data corpus using the
SONICautomaticspeechrecognizer[95]
3
. Thealignmentaccuracyscorereturnedbythe
automaticspeechrecognizergivesanindicationofthealignmentquality. Weempirically
nd that an alignment score of 90% and above corresponded to good alignment quality.
3
Note that other alignment tools such as SailAlign [49] are freely available for phonetics research.
27
When the alignment score falls below this value, we perform a second-pass manual
correction of these alignments. We observe that most misalignments occurred at the
beginning of the utterance and were apparent on manual inspection of the alignments
using an appropriate software editor. These are mainly due to the presence of a noise
burst caused by the MRI scanner gradients turning on, long before the subject starts
to speak. The nal alignments obtained after the manual correction are then used
to determine time-boundaries of inter-speech pauses (ISPs) and utterance onsets and
endings.
automatic speech recognizer. The nal alignments obtained after the manual cor-
rection were then used to determine time-boundaries of inter-speech pauses (ISPs) and
utterance onsets and endings.
4.2.5 Extracting frames of interest
We automatically extract all frames of ISPs
4
from the read and spontaneous speech
samples [see 101]. For the purposes of this study, we consider only grammatical ISPs,
i.e.,silentorlledpausesthatoccurbetweenovertsyntacticconstituents(includingsen-
tenceend). Inotherwords, weexcludepausesthatareduetohesitationorword-search.
Also note that we do not control for phonetic context adjacent to these pause bound-
aries. This is because we want to observe characteristics of articulatory setting during
thesepausesthataregeneric, i.e., notspecictoanyparticularphoneticcontext. Inad-
dition, we extract `speech-ready' frames from each image sequence immediately before
an utterance (a window of 100-200ms before the start of the utterance as determined
4
The SONIC speech recognizer uses a general heuristic of 170ms between words before detecting and
labeling a pause between those words.
28
Table 4.1: Number of pause samples per speaker used in the statistical analysis.
Speaker Rest Speech-
ready
Read
ISP
Spon
ISP
Eng1 5 21 26 371
Eng2 9 20 31 201
Eng3 8 21 56 221
Eng4 10 22 52 256
Eng5 25 77 395 554
by phonetic alignment). Finally, we also extract the rst and last frames of each utter-
ance's MRI data acquisition interval as representatives of absolute rest position
5
in the
two speaking styles.
For all extracted frames for a given speaker, we compute cross-distances (namely,
lip aperture, velic aperture, tongue tip constriction degree, tongue dorsum constriction
degree and tongue root constriction degree), VTADs (areas A1-A4), and jaw angle. We
thennormalizeeachvariablebyitsrangesuchthatthetransformedvariabletakesvalues
between 0 and 1. For example, if the tongue root constriction degree has a minimum
value of 0:7 units and a maximum value of 2:5 units, then these values will correspond
to 0 and 1 respectively after transformation. This allows us to compare variables across
speakers while accounting for speaker-specic attributes, such as vocal tract geometry
and gender. In addition, this type of transformation allows for more interpretable
comparisons between dierent categories. These variables are the dependent variables
used for subsequent statistical analysis.
4.2.6 Statistical analysis
We use the SPSS software to conduct all statistical analyses. For each dependent vari-
able, we perform a 2-way parametric analysis of variance ( = 0:05) to test the null
5
Since subjects are cued to start speaking after they hear the MRI system \switch on," we assume
that the speaker's articulators will be in a \rest" position for the rst frame of every acquisition.
29
hypotheses that the mean of all samples of that variable extracted for each speaker
(random factor) and for each inter-speech pause type based on speaking style (xed
factor with 4 levels: read ISP, spontaneous ISP, rest and ready positions) are equal
6
.
We further perform post-hoc Tukey tests ( = 0:05) to test for dierences in means,
and Levene's tests ( = 0:05) to test for dierences in standard deviations, between
dierent levels of the xed factor. Table 4.1 shows the number of samples of each de-
pendent variable extracted for dierent pause types for all 5 speakers. Note that the
imbalance in number of data samples is not by design but rather a characteristic of the
data corpus.
4.3 Results and observations
Table 4.2 summarizes the means and standard deviations of the dependent variables as
well as the result of pairwise statistical signicance tests conducted at the 95% level.
For these tests, statistically signicant dierences in means (p< 0:05) are indicated by
asterisks (), and in the case of variances, by stars (?). Keep in mind that each variable
is expressed as a percentage of its range, with 0 being the minimum value and 1 being
the maximum value that the variable assumes over the entire corpus of speech data for
each speaker.
The rst important result we observed is that vocal tract postures adopted during
absolute rest positions are more extreme and signicantly dierent from those adopted
during inter-speech pauses (ISPs). In other words, the mean values of all dependent
variables other than the velic aperture during both read and spontaneous ISPs are
signicantly higher than those during non-speech rest intervals, indicating adoption of
6
We used Kolmogorov-Smirnov tests to test the dependent variable samples for normality assump-
tions. Many of the variables do not pass the test. Hence we perform non-parametric Kruskal-Wallis
tests and post-hoc Mann-Whitney U Tests in these cases, results of which are found to conform to those
of the parametric ANOVA analysis. Hence we report the latter for uniformity.
30
Table 4.2: Means and standard deviations of all VTADs, jaw angle (JA), lip aper-
ture (LA), tongue tip constriction degree (TTCD), tongue dorsum constriction degree
(TDCD), tongue root constriction degree (TRCD) and velic aperture (VEL) rounded to
two signicant digits. Also shown are the results of performing pairwise comparisons
between dierent levels of the xed factor. If a pairwise test for a mean is statistically
signicant at the 95% level, we indicate this by. Similarly, if a pairwise test for a
standard deviation is signicant, ? is used.
Variable Position Mean (in % range) SD (in % range) Statistical signicance
Ready Read Spon
A1
Rest 0.31 0.17 ? ?
Ready 0.41 0.11 ?
Read 0.38 0.10 ?
Spon 0.38 0.15
A2
Rest 0.28 0.18 ?
Ready 0.43 0.16 ?
Read 0.40 0.09 ?
Spon 0.42 0.16
A3
Rest 0.28 0.20 ? ?
Ready 0.36 0.17
Read 0.29 0.15
Spon 0.39 0.15
A4
Rest 0.37 0.18 ? ?
Ready 0.60 0.14 ?
Read 0.60 0.14 ?
Spon 0.51 0.19
JA
Rest 0.33 0.18 ? ?
Ready 0.58 0.14 ?
Read 0.48 0.12 ?
Spon 0.50 0.17
LA
Rest 0.23 0.25 ? ?
Ready 0.51 0.19 ?
Read 0.42 0.18 ?
Spon 0.47 0.26
TTCD
Rest 0.24 0.23 ? ?
Ready 0.36 0.19 ?
Read 0.29 0.15 ?
Spon 0.34 0.18
TDCD
Rest 0.21 0.19
Ready 0.35 0.19 ?
Read 0.34 0.19 ?
Spon 0.36 0.16
TRCD
Rest 0.41 0.18 ? ?
Ready 0.55 0.14 ? ?
Read 0.59 0.11 ?
Spon 0.52 0.16
VEL
Rest 0.23 0.24 ?
Ready 0.18 0.15 ? ?
Read 0.22 0.19
Spon 0.20 0.21
31
amoreclosedvocaltractpositionwitharelativelysmalljawangleandanarrowpharynx
atabsoluterestcomparedtoASsadoptedjustpriortospeech(speech-ready)andduring
speech (ISPs). These dierences can be qualitatively observed in Table 4.3. This may
indicate that during the non-speech rest interval the tongue is resting somewhat more
nestled in the pharynx of the individual and that the mouth is quite closed.
Secondly, these rest positions also display relatively high variances compared to the
ready and ISP positions (signicant in many cases). This trend is especially seen for
the read ISPs. This may indicate that rest positions are not under active control in the
way that the ready and read ISP intervals presumably are. Note however that the small
number of samples per speaker might also give rise to the large variability observed;
that said, we observed a similar large variability in rest-position-means even in the case
of subject Eng 5 where we have a larger sample size, suggesting that this variability
might be due to a stochastic source that is not under active cognitive control.
We further note that the means of the dependent varibles calculated for ISP inter-
vals do not dier consistently in large measure from those calculated for speech ready
intervals. However, notice that the mean A1, A2 and jaw angle, lip aperture, tongue
tip constriction degree and tongue root constriction degree are signicantly larger for
speech-ready postures compared to read and spontaneous ISPs. This suggests that the
vocal tract is slightly more open on average as the speaker is getting ready to speak.
This trend is clearly seen in the case of speaker Eng4 in Table 4.3. Postures adopted
during read ISPs also exhibit lesser variability (as measured by the variance of the de-
pendent variables) than is observed for speech-ready postures (signicantly so in many
cases); and far less than that observed for absolute rest postures. This may indicate a
trend for the control regimes during the active read speech intervals, including pauses,
being far stricter than the rest intervals and somewhat stricter than the speech-ready
32
intervals. In other words, the articulators may be under active control during ISPs
occurring within utterances [as suggested by 14].
Thirdly, we note signicant dierences between postures adopted for read and spon-
taneousspeech. SpontaneousASshaveslightlyhigherjaw(largerjawangle),alongwith
higher values of the A2 VTAD and lower values of the A4 VTADs. This is consistent
with spontaneous ASs as characterized by a relatively elevated jaw and lowered tongue
position as compared to ASs in read speech. Given that recent studies [e.g., 101] have
presentedquantitativearticulatoryevidenceoflinguisticandmotorspeechplanningdif-
ferences in dierent speaking styles, the current work provides more knowledge about
how the constraints on the speech motor control system vary from formal read speech
to spontaneous discourse.
4.4 Discussion
We have presented a methodology to capture vocal tract posture (and thus analyze
articulatory setting) that is generally robust to rotation and translation, involves lit-
tle manual intervention, and allows for meaningful comparison across speakers. This
is important since traditional methods that capture vocal tract posture such as area
functions, although elegant, suer from certain disadvantages { for instance, they are
generally semi-automatic and dicult to generalize across dierent subjects.
There remain several areas for improvement and open research questions. For in-
stance, one limitation of this study is that it only looks at the questions of articulatory
setting within a small female subject sample. Whether these results generalize across
gender and for a larger, more balanced subject pool has not yet been examined, and
is a subject for future research. From an algorithmic perspective, the method relies on
vocal tract contours to derive postural features { hence a robust segmentation of the
vocal tract is required prior to feature extraction.
33
Articulatory setting is a relevant concept from multiple theoretic perspectives.
Supralaryngeal articulatory settings in combination with laryngeal articulatory settings
might generate formant and harmonic structure of the acoustic speech signal that im-
part a particular voice quality to the speech signal. Such ideas can be cast into a
communication-theoretic framework such as that proposed by [125]. The theory sug-
gests that speech signals are the result of manipulating articulatory gestures such that
they modulate a phonetically-neutral \carrier" signal that captures the voice quality
and in turn, the AS of the speaker. This suggests that a comprehensive understanding
of AS and its production is necessary in order to understand the speaker-dependent and
speaker-invariant characteristics of speech. Further, let us consider the implications
from a speech motor control perspective. An important question in speech planning
is the extent of control exerted by the cognitive speech planner as an utterance (read
or spontaneous) progresses. Earlier in this paper, we observed that ASs during rest
positions, ready positions and read inter-speech pauses, in that order, exhibit a trend
for decreasing variability and thus, a possible increasing degree of active control by the
cognitive speech planning mechanism. Another question central to theories of motor
control is whether the human brain explicitly codes for higher task-level parameters
[for example, articulator movement directions in task coordinates 109] or for intrin-
sic parameters such as muscle forces [see for e.g., 27]. A deeper understanding of AS
might help inform this question; one way to examine this further would be to model
AS within a dynamical systems model of speech motor control such as the Task Dy-
namics model [14, 109]. The Task Dynamics model of provides an explicit model of
articulatory setting (AS). In this approach, articulatory setting can be modeled using
the concept of a \neutral attractor" (an attractor is a set of stable states towards which
a variable moving according to the dictates of a dynamical system evolves over time).
Eacharticulatorinthemodelarticulatorspaceisassociatedwithsuchanattractor, and
34
the result of the entire set of them is a neutral vocal tract conguration that can be
language-specic. This is important since evidence has been put forth in the literature
for language-specic ASs as mentioned earlier. In this model, achievement of a phonetic
targettask(eitheraconstrictiondegreeorlocation)iscontrolledbyadynamicalsystem
consisting of an active attractor that achieves the task and neutral attactors associated
with each articulator in the task's coordinative structure. Without a neutral attractor,
articulators could simply remain `stuck' in a constricted posture if not called away by
anothergesture. Forexample, insimulationexperimentsperformedby[110], therelease
of a gesture can be governed by the neutral attractor corresponding to the articulators
in question, whose activation strength trajectory is simply the complement of the active
constrictions'sactivationstrength. Butthiskindofpassivereleasehassincebeenshown
to be inadequate to capture the observed kinematics [84].
With regard to speech planning and execution, it would be useful to understand
whether AS is a phonological (\targeted") phenomenon or whether it is a by-product
of the execution of the speech plan. Gick and colleagues argue that if the AS for a
language is indeed determined to be a specic target, then this target must be acquired
and stored as part of the phonological inventory associated with that language [134,
pg. 228]. However, the observations presented in the present paper suggest that this
issue might be much more complex. This is especially the case for ASs associated with
readversusspontaneousISPs, whereweobservesignicantposturaldierencesbetween
speaking styles. This raises the question that if a particular language's AS is indeed
specied in the language's grammar or phonological inventory, then do separate such
ASs have to be learnt (in the same inventory) for dierent speaking styles? At this
point, the nature of specication of an AS is an intriguing area for future research.
35
In conclusion, we have presented a novel automatic procedure to analyze articula-
tory setting in speech production. We have further demonstrated using rt-MRI mea-
surements of vocal tract posture that (1) articulatory settings are signicantly dierent
for default rest postures as compared to speech-ready and interspeech pause postures;
(2) there is a trend, signicant in several cases, for variance in AS to dier between
interspeech pauses, which appear to be more controlled in their execution, as compared
to rest and speech-ready postures, and (3) read and spontaneous speaking styles also
exhibit dierences in articulatory setting.
36
Figure 4.1: (a) Contour outlines extracted for each image of the vocal tract. Note the
template denition such that each articulator is described by a separate contour. (b)
A schematic depicting the concept of vocal tract area descriptors or VTADs (adapted
from [10]). These VTADs are bounded by cross-distances (depicted by white lines),
and are, in order, from lips to glottis: lip aperture, tongue tip constriction degree,
tongue dorsum constriction degree, velic aperture, tongue root constriction degree and
the epiglottal-pharyngeal wall cross-distance respectively.
37
Figure 4.2: Schematic showing how the jaw angle (denoted by ) is computed.
38
Subject Rest Ready ISP Gram. ISP
(Read) (Spon)
Eng1
Eng2
Eng3
Eng4
Eng5
Figure 4.3: Mean vocal tract images for all speakers calculated on all frames corre-
sponding to dierent positions in the utterance and speaking style.
39
CHAPTER5
Articulatory setting and mechanical advantage
In the previous chapter, we observed that vocal postures assumed by people at absolute
rest were signicantly dierent and more variable as compared to postures during inter-
speech pauses. However, the reason for this dierence is not immediately clear. In
this chapter, we examine how some postures in a language may be more \mechanically
advantageous" than other postures for motor control and action, using the concept of
speed ratio or mechanical advantage.
5.1 Mechanical Advantage
If speech motor control is optimized { in any sense of the term { it is reasonable to
expect that key controlled postures have important mechanical advantages. Because
AS represents a base posture for deploying speech articulators, it might ideally provide
some mechanical optimality and/or advantage toward achieving speech motor tasks.
Mechanical optimization can take many forms, depending on the situation and whether
a dynamical (e.g., force/energy expenditure) or purely kinematical (e.g., duration of
movements) perspective is being considered. Given the rapidity of motor actions asso-
ciated with human speech, an important mechanical advantage might be the speed with
40
which motor tasks can be achieved. Kinematic criteria of this kind have been quantita-
tively explored in a wide variety of mechanical systems { everything from simple levers
to robot arms { as the speed ratio, which is the ratio of task space velocities (or the
space of goal variables of motor control) to those in articulatory postural space (or the
space of controllable variables) [2, 112]. Ratios with large numerical values are said to
be mechanically advantageous because small changes in postures can result in relatively
large changes toward tasks. Perhaps the simplest example of this situation is provided
by a class two lever, which amplies force and speed on dierent sides of the fulcrum
according to the ratio of lengths of those sides. Indeed, amplication of force and speed
are the same under the assumption of preservation of power from articulators to tasks:
Power =F
task
v
task
=F
artic
v
artic
; (5.1)
where F
task
;v
task
;F
artic
;v
artic
are the forces and speeds associated with the tasks and
articulators, respectively. Following directly from this, it is possible to write more
precisely that speed ratios are connected to mechanical advantage, as follows:
MA =
F
task
F
artic
=
v
artic
v
task
: (5.2)
Equation 5.2 states the classical \Law of the Lever" discovered by Archimedes. More
generally, mechanical advantage quanties how the rate of change of input variables
(in this case measured by the articulator speed, v
artic
) aects the rate of change of
output/goal variables (as measured by task speed v
task
). In other words, if a system
has a large mechanical advantage, that implies that a small change in the space of
articulators/controllableparametersresultsinalargechangeinthespaceoftasks/motor
control goals.
41
With this in mind, the specic hypothesis of this study can be stated more precisely.
The central hypothesis of this study is that postures assumed during pauses in speech,
as well as speech-ready postures, have a much higher overall mechanical advantage or
speed ratio when compared with postures at absolute rest. In other words, inter-speech
posturesallowforalargerchangeinthespaceofmotorcontroltasks/goalsforaminimal
change in the articulatory posture space as compared to postures at absolute rest. This
study is aimed at quantitatively testing this hypothesis using articulatory vocal tract
dataof real humanspeechdataacquired with rtMRI. Posturesare described in terms of
the spatial location of various speech articulators as well as the angle of the jaw, while
task/goal variables are considered to be constriction degree at various points along the
vocal tract. Please see the Methods section for further details.
In sum, we aim to answer the following question in this paper: are articulatory
settings adopted during speech production more advantageous than rest positions in a
kinematic sense, i.e., in facilitating ecient motor control of vocal tract articulators?
5.2 Method
5.2.1 Direct & Dierential Kinematics
Given a vector q, representing n low-level articulator variables of the system, and a
vectorx,representingmhigh-leveltaskvariablesofthesystem,therelationshipbetween
them is commonly expressed by the direct kinematics equation, of the form:
x =f(q) (5.3)
where the function f() represents the forward map, a transformation from articulator
to task space. In addition to the direct kinematics, the dierential kinematics { which
42
relate articulator space velocities to task space velocities. Of particular interested is
modeling f() so as to facilitate derivation of the Jacobian matrix:
J(q) =
@x
1
=@q
1
@x
1
=@q
n
.
.
.
.
.
.
.
.
.
@x
m
=@q
1
@x
m
=@q
n
(5.4)
TheJacobianisacompactrepresentationoftheposture-specic1
st
-orderpartialderiva-
tives of the forward map. These can also be interpreted as speed ratios. The Jacobian
allowsustoconvenientlywritethedierentialkinematicsequationinthefollowingway:
_ x =J(q)_ q (5.5)
5.2.2 Calculating Mechanical Advantage
Itispossibletoaccuratelyestimatekinematicsofthevocaltractinadata-drivenfashion.
It was recently shown that Locally-Weighted Linear Regression (LWR) is useful for this
purpose, producing accurate estimation and oering practical advantages [56]. LWR
is a method that uses locally-dened, low-order polynomials to approximate globally
nonlinear functional relationships. See Figure 5.2 for a graphical illustration of this
function approximation technique. Training the model has a closed-form solution via
thegeneralizedleastsquaressolution. Themethodalsohasfewfreeparameters,making
accurate training even more feasible. The result is a Jacobian matrix, relating the
velocities of each task to each articulator. Each value in the Jacobian represents a
speed ratio that could be used to characterize MA in the system. As an overall measure
of MA, we computed the sum of squares of all Jacobian values. We also computed
the condition number of the Jacobian matrix, dened as the ratio of the largest to the
smallest eigenvalue.
43
5.2.3 Example: Simulations on a Planar Robot Arm
In order to better understand and visualize how the notion of mechanical advantage
might be useful in describing the posture of a motor system and understanding its con-
trol, we ran simulations on kinematic models of a multi-link planar robot arm with
revolute joints. We would expect that postures with the highest mechanical advan-
tage are more open, while postures with the lowest mechanical advantage are more
constricted.
The simulated robot arm used in this application comprised three rigid links, three
revolute joints and a single end-eector. The links were labeled from the base to the
end-eector, and the lengths of these links were xed to the values l
1
= 1:0, l
2
= 0:62
and l
3
= 0:38. The corresponding joint angles were labeled in similar fashion as q
1
, q
2
and q
3
. The Jacobian for this system can be written as:
J(q;l) =
l
1
s
1
l
2
s
12
l
3
s
123
l
2
s
12
l
3
s
123
l
3
s
123
l
1
c
1
+l
2
c
12
+l
3
c
123
l
2
c
12
+l
3
c
123
l
3
c
123
(5.6)
where s
abc
andc
abc
are shorthand for sine and cosine of angle summations. For example
s
123
=sin(q
1
+q
2
+q
3
) and c
123
=cos(q
1
+q
2
+q
3
).
Simulations involved manipulating the angles of the revolute joints to produce a
wide variety of arm postures, and subsequently calculating mechanical advantage from
the Jacobian at each of those postures. The joint angles that were considered for each
joint spanned the range from 0 to 2 in increments of =16. All possible combinations
of angles for the three joints were considered, creating uniform coverage of joint-space.
This large number of considered postures were sorted by their respective mechanical
advantage values, and the highest- and lowest-valued postures were selected for visual-
ization.
44
Figure 5.3 show the model congurations corresponding to the top eight highest
and lowest Jacobian values of the robot arm. We observe that postures with higher
sum-squared Jacobian values are more open congurations, while more constricted,
convoluted congurations have lower values of the same metric, in conformity with
our expectations. We further replicated the above experiment using the Congurable
Articulatory Synthesizer (CASY), which is part of the TaDA application. Figure 5.4
shows the model congurations corresponding to the top eight highest and lowest Jaco-
bian values of the vocal tract. Again, we observe similar trends, with more open vocal
postures possessing the highest sum-squared Jacobian values, while more constricted,
closed vocal postures had the lowest values, signifying a lower mechanical advantage.
5.2.4 Extracting frames of interest from production data
We analyzed the same read and spontaneous MRI data with synchronized, noise-
cancelled audio from 5 American English speakers as in the original AS study presented
in the last chapter.
Once we have prepared and preprocesed our data { articulatory data with synchro-
nizedspeechaudio { our next stepis to dene pauses and phonetic categoriesof interest
usingtheacousticsignal. Inordertoextractdataframescorrespondingtodierentcat-
egories of interest, a phonetic alignment of the data corpus was performed using the
SONICspeechrecognizer[95]. Basedonthisalignment,werstautomaticallyextracted
all frames of ISPs
1
from the read and spontaneous speech samples [101]. For the pur-
poses of this study, we considered only grammatical ISPs, i.e., silent or lled pauses
that occurred between overt syntactic constituents (including sentence end). In other
words, we excluded pauses that were due to hesitation, word-search, etc., which do not
appeartoencodephonologicalinformation. Alsonotethatphoneticcontextadjacentto
1
The SONIC speech recognizer uses a general heuristic of 170ms between words before detecting and
labeling a pause between those words.
45
these pause boundaries was not controlled. This was to allow for observation of articu-
latory setting characteristics during these pauses that were generic, i.e., not specic to
any particular phonetic context. In addition, `speech-ready' frames were extracted from
each image sequence immediately before an utterance (a window of 100-200ms before
the start of the utterance as determined by phonetic alignment). Finally, the rst and
last frames of each utterance's MRI data acquisition interval were extracted as repre-
sentatives of absolute rest position
2
in the two speaking styles. The phonetic alignment
also allowed the extraction of frames corresponding to dierent phones categorized by
manner and place of articulation.
Based on the phonetic alignments, the dataset was divided into 11 mutually ex-
clusive, linguistically-meaningful categories: inter-speech pauses (ISP), absolute rest,
speech-ready,4vowelcategoriescategorizedalongheight(high{low)andfronting(front{
back) dimensions, 3 consonant categories categorized by place of articulation (labial,
coronal and dorsal), and approximants.
Forallextractedframesforagivenspeaker, cross-distanceswerecomputed(namely,
lip aperture, velic aperture, tongue tip constriction degree, tongue dorsum constriction
degreeandtonguerootconstrictiondegree)asrepresentativeconstrictiontask variables,
and jaw angle and tongue length as representative articulatory posture variables. See
Figure 5.5foravisualschematicand[102]formoredetailsonhowthesewereextracted.
Each variable was then normalized by its range such that the transformed variable took
values between 0 and 1. For example, if the tongue root constriction degree has a
minimum value of 0:7 units and a maximum value of 2:5 units, then these values will
correspond to 0 and 1 respectively after transformation. This allows us to compare
variables across speakers while accounting for speaker-specic attributes, such as vocal
2
Since subjects are cued to start speaking after they hear the MRI system \switch on," it is assumed
that the speaker's articulators will be in a \rest" position for the rst frame of every acquisition.
46
tract geometry and gender. In addition, this type of transformation allows for more
interpretable comparisons between dierent categories.
5.2.5 Statistical Analyses
We now want to statistically quantify how the mechanical advantages of vocal tract
postures assumed during dierent phonetic categories of interest dier from each other.
For each category of interest (such as ISP, absolute rest, speech-ready, low front vowels,
and so on) in a given speaker's data, Jacobian matrices were estimated using a boot-
strapping procedure with N = 100 bootstrap samples. In each bootstrap iteration, a
posture was randomly sampled from all the postures in that category to be used as a
\test" posture. The LWR model was then t to the rest of the data (training data),
which was then used to estimate a Jacobian matrix for the test posture. Thus, at the
end of the bootstrapping procedure we obtained N Jacobian estimates, and therefore N
sum-squared-values of the Jacobian, for each category of interest (for a given speaker).
Anon-parametric2-wayanalysisofvariance(Friedman'stest)wasperformedtotest
the hypothesis that the medians of the 11 dierent linguistic categories of interest were
dierent
3
. Note that in this case, the random factor is speaker (S = 5 speakers) and
there wereN = 100 replicates in eachblock corresponding to the 100 bootstrap samples
obtained earlier. Non-parametric Mann-Whitney U tests were also performed post-hoc
for multicomparison tests.
5.3 Results
The Friedman's test showed that the medians of the dependent variable (sum-squared
values of Jacobian) were signicantly dierent across the dierent categories of interest
3
The data samples failed to pass Kolmogorov-Smirnov tests of normality. Hence, nonparametric
tests were used here.
47
Table 5.1: Medians of sum-squared values of the Jacobians tabulated by category and
speaker (left). Also shown (right) for each pair of categories, are the number of speakers
(out of 5) that returned a statistically signicant dierence on the Mann-Whitney U
test for pairwise dierences in medians at the = 95% level. (Abbreviations: HF =
High Front, HB = High Back, LF = Low Front, LB = Low Back, Lab = Labial, Cor =
Coronal, Dor = Dorsal, App = Approximant).
Medians of SS Jacobian Number of speakers with signicant pairwise dierences in median
Category Eng1 Eng2 Eng3 Eng4 Eng5 Rest Ready Vowels Consonants
HF HB LF LB Lab Cor Dor App.
ISP 11.28 13.16 6.42 24.43 20.73 3 3 4 4 4 5 5 4 5 4
Rest 1.50 10.49 4.74 17.69 22.28 4 3 5 5 4 5 4 4 4
Ready 11.98 10.60 7.68 19.45 20.85 4 5 5 4 5 5 5 5
Vowel
HF 9.94 7.83 6.41 17.56 20.44 4 4 3 5 4 3 4
HB 11.23 5.56 9.68 18.22 19.97 4 4 2 4 4 4
LF 8.95 5.07 6.74 18.14 17.50 3 4 4 4 5
LB 9.60 6.72 7.71 18.86 19.35 3 3 0 2
Cons.
Lab 10.97 5.39 9.87 18.89 18.68 2 3 2
Cor 10.93 8.21 7.17 18.89 19.28 3 2
Dor 10.00 6.62 7.26 19.33 19.53 2
App. 10.69 7.04 8.72 19.26 19.40
(p = 0). Table 5.1 shows the medians of the sum-squared values of all Jacobian entries,
listed by linguistic category as well as speaker. We also tabulate the number of speakers
for which we observed a pairwise dierence in medians as determined by a post-hoc
Mann-Whitney U test.
Speech-ready postures are generally more mechanically advantageous than postures
assumed during inter-speech pauses, which are in turn signicantly more mechanically
advantageous as compared to postures assumed at absolute rest. The only case where
the latter eect is not observed is for speaker Eng5, where the median for rest pos-
tures is higher than that for ISPs, but not signicant
4
. Speech-ready postures and
inter-speech postures are also generally more advantageous than vowel and consonant
postures, on the whole, while vowel and consonant postures may be seen as relatively
equally advantageous, in general.
4
Interestingly, in the case of Eng5, although the median for rest are higher, the mean is lower than
that for ISP.
48
However,weobservedthatwhenweconsideredonlyconsonantsspokenbyaspeaker,
interesting patterns emerge. Coronal fricatives and stops, which may require more task
precision as compared to their labial counterparts, were found to be less mechanically
advantageous (and thus, more stable) than the latter. These trends are depicted in
Figure 5.6 for speaker Eng5. This, the reader will recall, is because the mechanical
advantage measure re
ects how much change is observed in task/goal space for a unit
change in articulatory space. Furthermore, when we just consider the vowels and pause
postures of speakers, we observe a similar eect (see Figure 5.7). Specically, vowels
that are produced with a smaller degree of constriction (such as IY, OW, and UW) are
less mechanically advantageous as compared to vowels produced with more open vocal
tract postures (such as AE, UH and EH).
Suchobservationshavebeenattributedintheliteraturetonon-linearquantaleects
or saturation eects (Fujimura and Kakita, 1979; Perkell et al., 1997) between motor
control commands and articulatory movements, which might help detemine motor con-
trol goals. We postulate that the concept of mechanical advantage generalizes this
notion of saturation eects, in that postures with low mechanical advantage are stable.
This can be seen for more constricted vowels like /i/ as compared to other vowels, as
well as for coronal fricatives as opposed to labial fricatives.
5.4 Discussion & Conclusions
Thispaperhasmotivatedtheimportanceofapplyingthenotionofmechanicaladvantage
to questions of interest regarding the speech production apparatus. MA is a basic
mechanical concept with its origins in kinematic analysis but, to our knowledge, this
concept has not been utilized for examinations in the domain of speech production. We
presentedamethodologyforquantifyingthemechanicaladvantageprovidedbydierent
vocal tract postures by proposing methods to extract relevant task and articulator
49
variablesfromrtMRIvideosandforcomputingtheJacobianofthedierentialkinematic
relationship between the two sets of variables.
We then explored a specic hypothesis of linguistic interest concerning articulatory
settings which can be tested by quantifying and comparing the MA of dierent classes
of vocal tract postures. We found support for the central hypothesis that postures
assumed during inter-speech pauses (\articulatory settings") are more mechanically ad-
vantageous than absolute rest postures with respect to speech articulation. In other
words, articulatory setting postures aord large changes with respect to speech tasks
for relatively small changes in low-level speech articulators. In the course of examin-
ing this hypothesis, we also nd evidence that articulatory settings and speech-ready
postures are more mechanically advantageous overall than other classes of vocal tract
postures, including those assumed during dierent vowels, consonants and during abso-
lute rest.
The observed dierences between vowels and consonants is intriguing and suggests
a way of grounding the traditional idea [47, 90] that consonant production is overlaid
on a base formed by the production of vowels. In MA terms, the vowel could form an
advantageous \launch-pad" for consonant constriction actions. Relatedly, MA could be
one of the bases for the sonority hierarchy, which governs syllabication in languages.
Dierences among individual consonants and vowels may also provide some insight into
their linguistic function, their acquisition, or their sensitivity to speech disorders. As
an initial step in this direction, we uncovered some interesting patterns with respect
to the mechanical advantage properties of certain fricatives and stops. For instance,
we found that coronal fricatives were less mechanically advantageous (and thus, more
stable) as compared to labial fricatives. Such observations have been attributed in the
literature to non-linear quantal eects or saturation eects [29, 96] between motor con-
trol commands and articulatory movements, which might help detemine motor control
50
goals. We postulate that the concept of mechanical advantage generalizes this notion
of saturation eects, in that postures with low mechanical advantage are stable.
Therearemanyotherexcitingavenuesforfuturestudy. Forinstance, itisimportant
to observe that the specic measures of mechanical advantage computed here (i.e., sum
of squared of Jacobian values) are dependent on the choice of articulatory and task
variables used for the dierential kinematics estimation. This underscores the need
for complementary ways of proceeding further: (i) nding an optimal set of task and
articulatory variables with respect to MA and (ii) nding more expository measures of
mechanical eciency.
51
Figure 5.1: A schematic illustration of the analysis procedure.
52
Figure 5.2: An illustration of modeling with Locally-Weighted Regression (LWR). For a
particularpoint(blackcross)alocalregionisdenedinarticulatorspacebyaGaussian-
shaped kernel (gray dashed curve). A line is t in the local region using a weighted
least-squares solution, indicated by the black dashed line. The global t is generated
by repeating this procedure at a large number of local regions. The resulting t can be
quite complex (gray curve), and depends on the width of the kernel.
53
(a) (b)
Figure 5.3: Planar robot arm congurations corresponding to the top eight (a) highest
and (b) lowest average Jacobian values.
(a) (b)
Figure 5.4: Congurable articulatory synthesizer (CASY) congurations corresponding
to the top eight (a) highest and (b) lowest average Jacobian values.
54
Figure 5.5: (a) Cross-distances in more detail (lip aperture (LA), velic aperture (VEL),
and constrictions of the tongue tip (TTCD), tongue dorsum (TDCD) and tongue root
(TRCD).(b)Articulatoryposturevariables{jawangle(JA),tonguecentroid(TC)and
length (TL), and upper and lower lip centroids (ULC and LLC).
55
Figure 5.6: Histograms of the sum-squared values of Jacobians computed for dierent
consonants on speaker Eng5's data.
56
Figure 5.7: Histograms of the sum-squared values of Jacobians computed for dierent
vowel and pause categories on speaker Eng5's data.
57
Part II
Direct data-driven approaches:
Toward optimal representations and
models of speech motor control and
planning
58
CHAPTER6
Data-driven representations of speech articulation:
articulatory movement primitives
In contrast to the previous couple of chapters that presented knowledge-driven `top-
down' case-studies which can inform models of speech motor control, in this chapter
we switch gears and present a more data-driven `bottom-up' approach to deriving a
small number of `primitive' articulatory movements from speech articulation data. The
ultimately goal is to use these ideas to develop models of speech motor control and
coordination that can potentially operate in an inherently lower-dimensional control
subspace.
6.1 Movement primitives and motor control
Articulatory movement primitives may be dened as a dictionary or template set of ar-
ticulatory movement patterns in space and time, weighted combinations of the elements
of which can be used to represent the complete set of coordinated spatio-temporal
movements of vocal tract articulators required for speech production. Extracting in-
terpretable movement primitives from raw articulatory data is important for better
understanding, modeling and synthetic reproduction of the human speech production
59
process. Supportforthisviewiswell-groundedintheliteratureonneurophysiologyand
motor control. For instance, [83] argue that in order to generate and control complex
behaviors, the brain does not need to explicitly solve systems of coupled equations. In-
stead a more plausible mechanism is the construction of a vocabulary of fundamental
patterns, or primitives, that are combined sequentially and in parallel for producing a
broad repertoire of coordinated actions. An example of how these could be neurophys-
iologically implemented in the human body could be as functional units in the spinal
cordthateachgenerateaspecicmotoroutputbyimposingaspecicpatternofmuscle
activation [8]. The authors argue that this representation might simplify the produc-
tion of movements by reducing the degrees of freedom that need to be specied by the
motor control system. In this paper, we: (i) present a data-driven approach to extract
a spatio-temporal dictionary of articulatory primitives from real and synthesized artic-
ulatory data using machine learning techniques; (ii) propose methods to validate
1
the
proposed approach both quantitatively (using various performance metrics) as well as
qualitatively (by examining how well it can recover evidence of compositional structure
from (pseudo) articulatory data); (iii) show that such an approach can yield primitives
that are linguistically interpretable on visual inspection.
[50]denesasynergy tobeafunctionalgroupingofstructuralelements(likemuscles
or neurons) which, together with their supporting metabolic networks, are temporarily
constrained to act as a single functional unit. The idea that there exist structural-
functional organizations (or synergies) that facilitate motor control, coordination and
exploitation of the enormous degrees of freedom in complex systems is not a new one.
Right from the time of [6], researchers have been trying to understand the problem
of coordination { compressing a high-dimensional movement state space into a much
lower-dimensionalcontrolspace[forareview,see 127]. Researchershavediscoveredthat
1
Note that validation of experimentally-derived articulatory primitives, especially in the absence of
absolute ground truth, is a dicult problem.
60
a small number of synergies can be used to perform simple tasks such as reaching [e.g.,
20, 68] or periodic tasks such as nger tapping [e.g., 36]. However, the question of how
more complex tasks are orchestrated remains an open one (complex tasks could be, for
example, combinationsofreachingandperiodicmovements, suchasthoseperformedby
a skilled guitarist/percussionist). In other words, can we discover a set of synergies that
can be used to perform a given complex task? In the following sections, we consider
this question for the case of human speech production.
Figure 6.1: Vocal tract articulators (marked on a midsagittal image of the vocal tract).
One can approach the problem of formulating a set of primitive representations of
the human speech production process in either a knowledge-driven or a data-driven
manner. An example of the former from the linguistics (and more specically, phonol-
ogy) literature is the framework of Articulatory Phonology [12] which theorizes that
the act of speaking is decomposable into units of vocal tract actions termed \gestures."
Under this gestural hypothesis, the primitive units out of which lexical items are as-
sembled are constriction actions of the vocal organs (see Figure 6.1 for an illustration
of vocal tract constriction organs). When the gestures of an utterance are coordinated
with each other and produced, the resulting pattern of gestural timing can be captured
in a display called a gestural score. The gestural score of a given utterance species the
particular gestures that compose the utterance and the times at which they occur. For
61
example, Figure 6.2 depicts the hypothesized gestural score for the word \team". It is
important to note that the gestural score doesn't directly specify a set of lower-level
raw articulatory movement trajectories, but how dierent vocal tract articulators are
\activated" in a spatio-temporally coordinated manner with respect to each other at a
higher level. Hence these are more akin to the [8] idea of specic patterns of muscle
activation, which in turn are realized as specic articulatory movement trajectory pat-
terns (or articulatory primitives). Having said this, it is important to experimentally
examine the applicability of such knowledge-driven theories vis-a-vis real speech pro-
duction data. In this paper, we adopt the less-explored data-driven approach to extract
sparse primitive representations from measured and synthesized articulatory data and
examine their relation to the gestural activations for the same data predicted by the
knowledge-based model described above. We further explore other related questions of
interest, such as: (i) how many articulatory primitives might be used in speech pro-
duction, and (ii) what might they look like? We view these as rst steps towards our
ultimate goal of bridging and validating knowledge-driven and data-driven approaches
to understanding the role of primitives in speech motor planning and execution.
Electromagnetic articulography (EMA) data of vocal tract movements [see for ex-
ample, 135] oer a rich source of information for deriving articulatory primitives that
underlie speech production. However, one problem in extracting articulatory movement
primitives from this kind of real-world data is the lack of ground truth for validation,
i.e., we do not know what the actual primitives were that generated the data in ques-
tion. Hence although experiments on real data are necessary in order to further our
understanding of motor control of articulators during speech production, they are not
sucient.
We therefore also analyze synthetic data generated by a congurable articulatory
speechsynthesizer[42,107]thatinterfaceswiththeTaskDynamicsmodelofarticulatory
62
Figure 6.2: Gestural score for the word \team". Each gray block corresponds to a vocal
tract action or gesture. See Figure 6.1 for an illustration of the constricting organs.
control and coordination in speech [109] within the framework of Articulatory Phonol-
ogy [12]. This functional coordination is accomplished with reference to speech `tasks',
which are dened in the model as constricting primitives or `gestures' accomplished by
the various vocal tract constricting devices. Constriction formation is modeled using
task-level point attractor dynamical systems that guide the dynamic behavior of indi-
vidualarticulatorsandtheircoupling. Theentireensembleisimplementedinasoftware
package called Task Dynamics Application (or TaDA) [85, 108]. The advantage of using
such a model is that it allows us to evaluate the similarity of the \information content
2
"
encoded by (i) the hypothesized gestures, and (ii) the algorithm-extracted primitives
(since the model hypothesizes what gestural primitives generate a given set of articu-
lator movements in the vocal tract). This in turn aords us a better understanding of
the strengths and drawbacks of both the model as well as the algorithm.
2
Although the term `information content' is a loaded term with neurocognitive underpinnings, we
operationally use this term to abstractly refer to the structure encoded in the multivariate signal of
interest.
63
Note that real-time magnetic resonance imaging [or rt-MRI, see for example, 87] is
another technique which oers a very high spatial coverage of the vocal tract at the cost
of low temporal resolution. In fact, in earlier work, we presented a technique to extract
articulatory movement primitives from rt-MRI data [100]. However, validating primi-
tives obtained from rt-MRI using the TaDA synthetic model is not as straightforward
as in the case of EMA, where we can directly compare
esh-point trajectories measured
to those generated by the synthetic model. Moreover, the temporal resolution in EMA
and TaDA is much higher (100500Hz) as opposed to rt-MRI (2030Hz). Hence we
restrict ourselves to the use of EMA data for our analyses in this paper.
6.1.1 Notation
We use the following mathematical notation to present the analysis described in this
paper. Matrices are represented by bold uppercase letters (e.g., X), while vectors are
represented using bold lowercase letters (e.g., x), while scalars are represented without
anyboldcase(eitherupperorlowercase). WeusethenotationX
y
todenotethematrix
transpose of X. Further, if x is an N-dimensional vector, we use the notation x2R
N
to denote that x takes values from the N-dimensional real-valued set. Similarly, X2
R
MN
denotesthatXisareal-valuedmatrixofdimensionMN. Weusethesymbols
and to denote element-wise matrix multiplication and division, respectively. Finally,
weusethenotationX = [x
1
jx
2
j:::jx
K
]todenotethatmatrixXisformedbycollecting
the vectors x
1
, x
2
, :::, x
K
together as its columns.
64
6.2 Review of data-driven methods to extract movement
primitives
Recently there have been studies that have attempted to further our understanding of
primitive representations in biological systems using ideas from machine learning and
sparsitytheory. Sparsityinparticularhasbeenshowntobeanimportantprincipleinthe
design of biological systems. For example, studies have suggested that neurons encode
sensory information using only a few active neurons at any point of time, allowing
an ecient way of representing data, forming associations and storing memories [91,
92]. [41] have put forth quantitive evidence for sparse representations of sounds in the
auditory cortex. Their results are compatible with a model in which most auditory
neurons are silent (i.e., not active or spiking) for much of the time, and in which neural
representations of acoustic stimuli are composed of small dynamic subsets of highly
active neurons. As far as speech production is concerned, phonological theories such
as Articulatory Phonology [12] support the idea that speech primitives (or `gestures')
are sparsely activated in time, i.e., at any given time instant during the production
of a sequence of sounds, only a few gestures are \active" or \on" (for example, see
Figure 6.2). However, to our knowledge, no practical computational studies have been
conducted into uncovering the primitives of speech production thus far.
Modeling data vectors as sparse linear combinations of basis vectors
3
is a general
computational approach (termed variously as dictionary learning or sparse coding or
sparse matrix factorization depending on the exact problem formulation) which we will
3
In linear algebra, a basis is a set of linearly independent vectors that, in a linear combination,
can represent every vector in a given vector space. Such a set of vectors can be collected together as
columns of a matrix { a matrix so formed is called a basis matrix. More generally, this concept can
be extended from vectors to functions, i.e., a basis in a given function space would consist of a set of
linearly independent basis functions that can represent any function in that function space. For further
details, see [120].
65
use to solve our problem. To recapitulate, the problem is that of extracting articulatory
movement primitives, weighted and time-shifted combinations of which can be used
to synthesize any spatio-temporal sequence of articulatory movements. Note that we
use the terms `basis' and `primitive' to mean the same thing in the mathematical and
scientic sense, respectively, for the purposes of this paper. Such methods have been
successfully applied to a number of problems in signal processing, machine learning,
and neuroscience. For instance, [19] used nonnegative matrix factorization [or NMF,
see 60] and matching pursuit [73] techniques to extract synchronous and time-varying
muscle synergy patterns from electromyography (EMG) data recorded from the hind-
limbs of freely-moving frogs. [126] further compared the performance of various matrix
factorization algorithms on such synergy extraction tasks for both real and synthetic
datasets. [51] formulated the problem of extracting spatio-temporal primitives from a
databaseofhumanmovementsasatensorfactorizationproblemwithtensorgroupnorm
constraints on the primitives themselves and smoothness constraints on the activations.
[138] also proposed an algorithm to temporally segment human motion-capture data
into motion primitives using a generalization of spectral clustering and kernel K-means
clustering methods for time-series clustering and embedding. For speech modeling, [4]
presented an algorithm to perform temporal decomposition of log area parameters ob-
tainedfromlinearpredictionanalysisofspeech. Thistechniquerepresentsthecontinous
variation of these parameters as a linearly-weighted sum of a number of discrete ele-
mentary components. More recently, [115] presented a convolutive NMF algorithm to
extract \phone"-like vectors from speech spectrograms which could be used to charac-
terize dierent speakers (or audio sources) for speech (or music) separation problems.
[89] included the notion of sparsity in this formulation and showed that this gave more
intuitive results. Note that we can view all these formulations as optimization problems
66
with a cost function that involves (1) a data-t term (which penalizes how accurately
4
appropriately weighted and time-shifted primitives can represent input data) and (2) a
regularization term (which enforces sparsity and/or smoothness constraints).
Mathematically, we say that a signal x in R
m
admits a sparse approximation over
a basis set of vectors or `dictionary' D in R
mk
with k columns referred to as `atoms'
when one can nd a linear combination of a small number of atoms from D that is
as \close" to x as possible (as dened by a suitable error metric) [71]. Note that
sparsity constraints can be imposed over either the basis/dictionary or the coecients
of the linear combination (also called 'activations') or both. In this paper, since one of
our main goals is to extract interpretable
5
basis or dictionary elements (or primitives)
from observed articulatory data, we focus on matrix factorization techniques such as
Nonnegative Matrix Factorization (NMF) and its variants [40, 60, 89, 115]. We use
NMF-based techniques since these have been shown to yield basis vectors that can be
assigned meaningful interpretation
6
depending on the problem domain [76, 89, 115]. In
addition, we would like to nd a factorization such that only a few basis vectors (or
primitives) are \active" at any given point of time (as observed in Figure 6.2), i.e.,
a sparse activation matrix. In other words, we would like to represent data at each
sampling point in time using a minimum number of basis vectors. Hence we formulate
our problem such that sparsity constraints are imposed on the activation matrix.
4
As measured by a suitable metric such as a norm distance.
5
By interpretable we mean a basis that a trained speech researcher can assign linguistic meaning to
on visual inspection; for example, a basis of articulator
esh-point trajectories, or sequences of rt-MRI
images of the vocal tract.
6
It is worth noting that [22] give specic mathematical conditions required for NMF algorithms
to give a \correct" decomposition into parts, which aords us some mathematical insight into the
decomposition. Presentation of the exact conditions here requires a level of mathematical sophistication
that is beyond the scope of this paper and is hence omitted. Interested readers are directed to [22] for
further details.
67
6.3 Validation strategy
Direct validation of experimentally-derived articulatory primitives, especially in the ab-
senceofabsolutegroundtruth, isadicultproblem. Thatbeingsaid, wecanassessthe
extent to which these primitives provide a valid compositional model of the observed
data. There are two important conceptual questions that arise during such a validation
of experimentally-derived articulatory primitives. First, does speech have a composi-
tional structure that is re
ected in its articulation? Second, if we are presented with a
set of waveforms or movement trajectories that have been generated by a compositional
structure, then can we design and validate algorithms that can recover this composi-
tional structure? The rst question is one that we are not in a position to address fully
yet, atleast with the datasets at our current disposal
7
. However, we can answer the
second question, and we address it in this paper.
ThesyntheticTaDAmodelisgeneratedbyaknowncompositiontaskmodel. Figure
6.3 illustrates the
ow of information through the TaDA model. Articulatory control
and functional coordination is accomplished with reference to speech `tasks' which are
composed and sequenced together in space and time. The temporal activations of each
constricting primitive or `gesture' required to perform a speech task can be obtained
fromthemodel(i.e.,wecanrecoverarepresentationsimilartothatshowninFigure6.2).
HencetheTaDAmodelprovidesatestbedtoinvestigatehowwellaprimitive-extraction
algorithm can recover the compositional structure that underlies (pseudo) behavioral
movementdata. NotethatwedonotclaimherethattheTaDAmodelmimicsthehuman
speechproductionmechanismorissomekindofgroundtruth. Nevertheless, ithasbeen
shown that it is possible to use the model to learn a mapping from acoustics (MFCCs)
to gestural activations [79, 80]. When this mapping is applied to natural speech, the
7
Note that some phonological models of speech do support the hypothesis that speech sounds have
a compositional structure. For more details, see [46] and [17].
68
resulting gestural activations can be added as inputs to speech recognition systems with
a sharp decrease in error rate. Furthermore, the TaDA model is a compositional model
ofspeechproduction, whichwecanusetotesthowwellalgorithmscanrecoverevidence
of compositional structure from (pseudo) articulatory data. Following this, we can pose
another question: to what extent does the algorithm extract a compositional structure
in measured articulatory data that is similar to that extracted in the synthetic TaDA
case?
Figure 6.3: Flow diagram of TaDA, as depicted in [86].
6.4 Data
We analyze ElectroMagnetic Articulography (EMA) data from the Multichannel Ar-
ticulatory (MOCHA) database [135], which consists of data from two speakers - one
male and one female. Acoustic and articulatory data were collected while each (British
English) speaker read a set of 460 phonetically-diverse TIMIT sentences. The articula-
tory channels include EMA sensors directly attached to the upper and lower lips, lower
incisor (jaw), tongue tip (5-10mm from the tip), tongue blade (approximately 2-3cm
69
posterior to the tongue tip sensor), tongue dorsum (approximately 2-3cm posterior to
the tongue blade sensor) and soft palate. Each articulatory channel was sampled at 500
Hz with 16-bit precision. Since the data in its native form is unsuitable for processing,
each channel was zero-phase low-pass ltered with a cut-o frequency of 35 Hz [31].
Next, for every utterance, we subtracted the mean value from each articulatory channel
[31, 103]. Then we added the mean value of each channel averaged over all utterances
to that corresponding channel. Finally, we downsampled each channel by a factor of
ve to 100 Hz and further normalized data in each channel (by its range) such that all
data values lie between 0 and 1. These pre-processed articulator trajectories were used
for further analysis and experiments (see Table 6.1).
We also analyze synthetic data generated by the Task Dynamics Application (or
TaDA) software [85, 86, 108] { which implements the Task Dynamic model of inter-
articulator speech coordination with the framework of Articulatory Phonology [12] de-
scribed earlier in this paper. It also incorporates a coupled-oscillator model of inter-
gestural planning, a gestural-coupling model, and a congurable articulatory speech
synthesizer [42, 107]. TaDA generates articulatory and acoustic outputs from ortho-
graphical input. Figure 6.4 shows a screenshot of the TaDA graphical user interface.
The orthographic input is converted to a phonetic string using a version of the Carnegie
Mellon pronouncing dictionary that also provides syllabication. The syllabied string
is then parsed into gestural regimes and inter-gestural coupling relations using hand-
tuneddictionariesandthenconvertedintoagesturalscore. Theobtainedgesturalscore
isanensembleofgesturesfortheutterance,specifyingtheintervalsoftimeduringwhich
particular constriction gestures are active. This is nally used by the Task Dynamic
modelimplementationinTaDAtogeneratethetractvariableandarticulatortimefunc-
tions, which are further mapped to the vocal tract area function (sampled at 200 Hz).
See Figure 6.3. The articulatory trajectories are downsampled to 100 Hz in a manner
70
similar to the case of the MOCHA data described earlier. We further normalize data
in each channel (by its range) such that all data values lie between 0 and 1. Gestural
scores, articulatory trajectories and corresponding acoustics were each synthesized (and
normalized, in a manner similar to the EMA data) for 460 sentences corresponding to
those used in the MOCHA-TIMIT [135] database. The list of pre-processed articulator
trajectory variables we use for analysis is listed in Table 6.1.
Table 6.1: Articulator
esh point variables that comprise the post-processed synthetic
(TaDA) and real (EMA) datasets that we use for our experiments.
Symbol Articulatory parameter TaDA MOCHA-TIMIT
UL(x,y) Upper lip X X
LL(x,y) Lower lip X X
JAW(x,y) Jaw X X
TT(x,y) Tongue tip X X
TF(x,y) / TB(x,y) Tongue front/body X X
TD(x,y) Tongue dorsum X X
TR(x,y) Tongue root X
VEL(x,y) Velum X
UI(x,y) Upper Incisor X
6.5 Problem formulation
Theprimaryaimofthisresearchistoextractdynamicarticulatoryprimitives, weighted
combinations of which can be used to resynthesize the various dynamic articulatory
movements in the vocal tract. Techniques from linear algebra such as non-negative
matrix factorization (NMF) which factor a given non-negative matrix into a linear
combination of (non-negative) basis vectors oer an excellent starting point to solve
our problem.
71
6.5.1 Nonnegative Matrix Factorization and its extensions
The aim of NMF [as presented in 60] is to approximate a non-negative input data
matrix V 2 R
0;MN
as the product of two non-negative matrices, a basis matrix
W 2 R
0;MK
and an activation matrix H2 R
0;KN
(where K M) by mini-
mizing the reconstruction error as measured by either a Euclidean distance metric or
a Kullback-Liebler (KL) divergence metric. Although NMF provides a useful tool for
analyzing data, it suers from two drawbacks of particular relevance in our case. First,
it fails to account for potential dependencies across successive columns of V (in other
words, capture the (temporal) dynamics of the data); thus a regularly repeating dy-
namic pattern would be represented by NMF using multiple bases, instead of a single
basis function that spans the pattern length. Second, it does not explicitly impose
sparsity constraints on the factored matrices, which is important for our application
since we want only few bases \active" at any given sampling instant. These drawbacks
motivated the development of convolutive NMF [115], where we instead model V as:
V
T1
∑
t=0
W(t)
~
H
t
=V (6.1)
where W is a basis tensor
8
, i.e., each column of W(t)2 R
0;MK
is a time-varying
basisvectorsampledattimet, eachrowofH2R
0;KN
isitscorrespondingactivation
vector, T is the temporal length of each basis (number of image frames) and the
~
()
i
operator is a shift operator that moves the columns of its argument by i spots to the
right, as detailed in [115]:
if H =
1 3 5
2 4 6
; then
~
H
1
=
0 1 3
0 2 4
8
Notethatamultidimensionalmatrixisalsocalledatensor. Inthiscasewehaveathree-dimensional
basis tensor, with the third dimension representing time.
72
In this case the author uses a KL divergence-based error criterion and derives iterative
update rules forW(t) andH based on this criterion. This formulation was extended by
[89] to impose sparsity conditions on the activation matrix (i.e., requiring that a certain
number of entries in the activation matrix are zeros). However the parameter which
trades-o sparsity of the activation matrix against the error criterion in their case ()
is not readily interpretable, i.e., it is not immediately clear what value should be set
to to yield optimal interpretable bases. We instead choose to use a sparseness metric
based on a relationship between the l
1
and l
2
norms [as proposed by 40] as follows:
sparseness(x) =
p
n
(
∑
i
jx
i
j)
p
∑
i
x
2
i
p
n1
(6.2)
where n is the dimensionality of x. This function equals unity i x contains only
one non-zero component and 0 i all components are equal upto signs and smoothly
interpolates between the extremes. More recently [131] showed that using a Euclidean
distance-basederrormetricwasmoreadvantageous(intermsofcomputationalloadand
accuracy on an audio object separation task) than the KL divergence-based metric and
further derived the corresponding multiplicative update rules for the former case. It is
thisformulationalongwiththesparsenessconstraintsonH(asdenedbyEquation6.2)
that we use to solve our problem. Note that incorporation of the sparseness constraint
also means that we can no longer directly use multiplicative update rules for H { so we
use gradient descent followed by a projection step to update H iteratively [as proposed
by 40]. The added advantage of using this technique is that it has been shown to
nd a unique solution of the NMF problem with sparseness constraints [122]. The
nal formulation of our optimization problem, which we term `convolutive NMF with
sparseness constraints' or cNMFsc, is as follows:
73
min
W;H
kV
T1
∑
t=0
W(t)
~
H
t
k
2
s:t:sparseness(h
i
) =S
h
;8i: (6.3)
where h
i
is the i
th
row of H and 0 S
h
1 is user-dened (for example, setting
S
h
= 0:65 roughly corresponds to requiring 65% of the entries in each row of H to be
zero). Figure 6.5 provides a graphic illustration of the input and outputs of the model,
while Figure 6.6 pictorially depicts how weighted and shifted additive combinations of
the basis reconstruct the original input data sequence.
6.5.2 Extraction of primitive representations from data
If x
1
, x
2
, ..., x
M
are the M time-traces (consisting of N samples each; represented
as column vectors of dimension N1) corresponding to the M articulator (
eshpoint)
trajectory variables (could be obtained from either TaDA or MOCHA-TIMIT), then we
can design our data matrix V to be:
V = [x
1
jx
2
j:::jx
M
]
y
2R
MN
(6.4)
wherey is the matrix transpose operator. We now aim to nd an approximation of this
matrix V using a basis tensor W and an activation matrix H. A practical issue which
ariseshereisthatinourdataset, thereare460lescorrespondingtodierentsentences,
each of which results in a MN data matrix V (where N is equal to the number of
frames in that particular sequence). However we would like to obtain a single basis
tensor W for all les so that we obtain a primitive articulatory representation for any
sequence of articulatory movements made by that speaker. One possible way to do this
is to concatenate all 460 sequences into one huge matrix, but the dimensionality of this
matrix makes computation intractably slow. In order to avert this problem we propose
74
a second method that optimizes W jointly for all les and H individually per le. The
algorithm is as follows:
1. Initialize W to a random tensor of appropriate dimension.
2. W Optimization.
for Q of N les in the database do
(a) Initialize H to a random matrix of requisite dimensions.
(b) PROJECT.Projecteachrowof Htobenon-negative, haveunitl
2
normand
l
1
norm set to achieve the desired sparseness [40].
(c) ITERATE.
i. H Update.
for t = 1 to T do
Set
^
H(t) = H -
H
W(t)(
V
t
-
V
t
).
PROJECT H.
H
1
T
∑
^
H(t).
ii. W Update.
for t = 1 to T do
Set W(t) = W(t)
V(
!
H
t
)
y
V(
!
H
t
)
y
.
3. for the rest of the les in the database do
H Update keeping W constant.
Steps 2 and 3 are repeated for an empirically-specied number of iterations. The step-
size parameter
H
of the gradient descent procedure (described in Step 2) and the
75
number of les Q used for theW optimization are also set manually based on empirical
observations.
6.5.3 Selection of optimization free parameters
In this section we brie
y describe how we performed model selection, i.e., choosing
the values of the various free parameters of the algorithm. The Akaike Information
Criterion [or AIC, 1] and Bayesian Information Criterion [or BIC, 111] are two popular
model selection criteria that are measures of the relative goodness of t of a statistical
model. Theytradeothelikelihoodofthemodel(whichisproportionaltotheobjective
function value) against the model complexity (which is proportional to the number of
parameters in the model). They are used as a criterion for model selection among a
nite set of models. The formula for AIC is:
AIC =2ln(L)+2k (6.5)
where L is the maximized value of the likelihood function for the estimated model, and
k is the total number of free parameters in the model. Under the assumption that the
likelihood L of the data is Gaussian-distributed, one can show that the log-likelihood,
ln(L), in Equation 6.5 is proportional to the objective function of the convolutive NMF
(or cNMF) formulation, presented earlier in Equations 6.1 and 6.3:
ln(L)/
1
2
kVVk
2
=
1
2
kV
T1
∑
t=0
W(t)
~
H
t
k
2
(6.6)
For further details, see [25, 54]. Hence, Equation 6.5 reduces to:
AICkVVk
2
+2k (6.7)
76
The value of k equals the number of parameters of in the model, i.e., the number of
entries in the MKT-dimensional basis matrix (W) and the KN-dimensional
activationmatrix(H),respectively. Alsonotingtheimpositionofsparsenessconstraints
on the activation matrix, we nd the nal expression for AIC as follows:
AICkVVk
2
+2(MKT +S
h
KN) (6.8)
where S
h
is the sparseness parameter, which requires that each row of H needs to
have a sparseness of S
h
, as dened in Equation 6.2. Since performing AIC and BIC
computations for the whole corpus is time- and resource-consuming, we computed these
criteria for a subset of the data. Figure 6.7 shows the AIC computed for dierent values
of K (number of bases/primitives) and T (the temporal extent of each basis) over a 5%
subset of subject fsew0's data (the BIC and AIC trends are similar for both speakers,
hence we only present the AIC computed for subject fsew0 in the interest of brevity).
We see that the AIC tends to overwhelmingly prefer models that are less complex, since
the model complexity term far outweighs the log likelihood term in the model as the
values of K and T increase.
InlightoftheAICanalysis,wewouldliketochooseK andT assmallaspossible,but
in a meaningful manner. Therefore we decided to set the temporal extent of each basis
sequence (T) to 10 samples (since this corresponds to a time period of approximately
100ms, factoring in a sampling rate of 100 samples per second) to capture eects of the
order of the length of a phone on average. We chose the number of bases, K, to be
equal to the number of time-varying constriction task variables generated (by TaDA)
for each le, i.e., 8.
The value of the sparseness parameter S
h
was set based on the percentage of con-
striction tasks (generated by TaDA) that were active at any given time instant. Figure
6.8 shows a histogram of the number of constriction tasks active at any sampling time
77
instant (computed over all TaDA-generated constriction task variables). We observe
that most of the time only 2 or 3 task variables have non-zero activations, suggesting
that choosing S
h
in the range of 0:650:75 would be optimal for our experiment.
6.6 Results and validation
A signicant hurdle to validating any proposed model of articulatory primitives is the
lack of ground truth data or observations, since these entities are dicult to observe
and/or measure directly given that they exist. However, we can evaluate quantita-
tive metrics of algorithm performance such as the fraction of variance explained by the
model and root mean squared error (RMSE) performance of the algorithm for dier-
ent articulator trajectories and phonemes. In addition, we can evaluate how well the
model performs for measured articulatory data vis-a-vis synthetic data generated by
an articulatory synthesizer (TaDA). Therefore, in the following section we rst present
quantitative evaluations of our proposed model, and then follow it up with qualitative
comparisons with the Articulatory Phonology-based TaDA model.
6.6.1 Quantitative performance metrics
We rst present a quantitative analysis of the convolutive NMF with sparseness con-
straints (cNMFsc) algorithm described earlier. In order to see how the algorithm per-
forms for dierent phone classes, we need to rst perform a phonetic alignment of the
audio data corresponding to each set of articulator trajectories. We did this using the
Hidden Markov Model toolkit [HTK 136].
Figures 6.9 and 6.10 show the root mean squared error (RMSE) for each articula-
tor and phone class (categorized by ARPABET symbol) for MOCHA-TIMIT speakers
msak0 and fsew0 respectively. Recall that since we are normalizing each row of the
original data matrix to the range [0;1] (and hence each articulator trajectory), the error
78
values in Figures 6.9 and 6.10 can be read on a similar scale. We see that in general,
error values are very high. Among the articulator trajectories, the errors were highest
(0:150:2) for tongue-related articulator trajectories. On the other hand, trajectories
of the lip (LLx andLLy) and jaw (JAWx andJAWy) sensors were reconstructed with
lower error ( 0:1). As far as the phones were concerned, errors were comparatively
higher for the voiced alveolo-dental stop DH. One reason for this can be attributed to
the way we form the data matrix V. Recall that we construct V by concatenating ar-
ticulatory data corresponding to several (say, Q = 20) sentences into a single matrix. In
otherwords, therewillbeQ1discontinuities inthearticulatorytrajectoriescontained
in V, one corresponding to each sentence boundary. Moreover, we found that roughly
a third of all instances of DH occurred in the beginning of the sentences (as the rst
phone). The RMSE errors for these sentence-initial instances of DH were signicantly
higher than the RMSE errors of DH instances that occurred in other positions in the
sentence. This suggests that the higher reconstruction error observed in DH instances
that occur sentence-initially is likely not an artifact of the cNMFsc algorithm itself, but
rather due to the presence of discontinuities in the original data matrix.
Wefurthercomputedforeachspeakerthefractionofvariancethatwasnotexplained
(FVU) by the model for each sentence in the database. The histograms of these distri-
bution are plotted in Figure 6.11. The mean and standard deviation of this distribution
was0:0790:028forspeakermsak0 (i.e., approx. 7:9%oftheoriginaldatavariancewas
not accounted for on average) and 0:0970:034 for speaker fsew0 respectively. These
statistics suggest that the cNMFsc model accounts for more than 90% of the original
data variance.
79
6.6.2 Qualitative comparisons of TaDA model predictions with the
proposed algorithm
Figure 6.12 shows selected measured articulator trajectories superimposed on those ob-
tainedbyreconstructionbasedontheestimatedcNMFscmodelfortheTaDAandEMA
data respectively. Notice that while the reconstructed curves approximate the shape
of the original curves, they are not as smooth. This is likely due to the imposition of
sparseness constraints in our problem formulation in Equation 6.3, where the optimiza-
tionproceduretradesoreconstructionaccuracyforsparsenessofrowsintheactivation
matrix. This could explain the higher phone error rates that we observed in Figures 6.9
and 6.10 as well. Note that although we are plotting only a subset of the articulatory
trajectories, we generally observe that synthetic (TaDA) data is reconstructed with a
smallererrorascomparedtothemeasured(EMA)data, whichmakessense, considering
that a greater amount of within-speaker variability is observed in actual speech articu-
lation. This extra variability may not be captured very well by the model. Also notice
that panels on the right (corresponding to EMA measurements) have a smaller total
utterance duration ( 1:2 seconds) as compared to the TaDA panels on the left ( 1:7
seconds), which further reinforces the earlier argument that synthetic speech does not
account for phenomena such as phoneme reduction, deletion, etc., which contribute to
signal variability.
6.6.2.1 Comparison with gestural scores
Now that we have generated spatio-temporal basis functions or synergies of articula-
tor trajectories, linear combinations of which can be used to represent the input, the
next step is to compare the activations/weights of these spatiotemporal bases generated
for each sentence to the gestural activations hypothesized by TaDA for that sentence.
However, comparison of time-series data is not trivial in general, and in our case in
80
particular, since the basis functions extracted by the algorithm need not represent the
gestures themselves in form, but might do so in information content. In other words,
thetwosetsoftime-seriesrepresentedbythegesturalscorematrixGandtheestimated
activation matrix H cannot be directly compared to each other. Notice that this is a
particularly dicult problem in signal processing and time-series analysis in general,
especially because of its abstract nature. To our knowledge, there is no easy or direct
method of solving this problem to date.
That being said, we present here a sub-optimal, indirect approach to attacking this
complex problem using well-established signal modeling techniques. Specically, we
proposeatwo-stepprocedureinordertoperformthiscomparison{rst, wemodeleach
set of time-series (hypothesized gestural activations and extracted activation traces) by
an autoregressive (AR) model using linear prediction [for example, see 72]. Once this
is done, we nd the canonical correlation between the set of AR coecient matrices
to examine the similarity between the two sets of time-series. Canonical correlation
linearly projects both sets of signals (contained in the matrices) into a common signal
space where they are maximally correlated to each other. Hence, by examining the
magnitude of (canonical) correlation values, we can get an estimate of how maximally
correlatedthetwosetsofmultidimensionaltimeseriesaretoeachother
9
. Inthefollowing
paragraphs, we describe the procedure in more detail.
Thetechniqueoflinearprediction[72]modelsagivendiscretetime-varyingsignalas
alinearcombinationofitspastvalues(thisisalsoknownasautoregressive(AR)system
9
Note that instead of this two-step process, one can also consider using functional canonical corre-
lation analysis [fCCA, 62]. However, since the time-series under consideration are sparse, this method
maynotprovideusefulresultsinpractice. Thisisbecausethetechniquereliesonappropriatesmoothing
of the time-series before nding their optimal linear projections. Another altogether dierent technique
onecanthinkofusingisbasedininformationtheory, i.e., computingthemutualinformation[see 18]be-
tweenthe2setsofsignals. However, thesparsityofthesignals, coupledwiththelackofprovenmethods
of computing mutual information of arbitrary multidimensional time-series (not vectors), makes reliable
estimation of this quantity dicult.
81
Table 6.2: Top 5 canonical correlation values between the gestural activation matrix G
(generated by TaDA) and the estimated activation matrix H for both TaDA and EMA
cases.
n
th
highest TaDA MOCHA-TIMIT
canonical correlation value
n = 1 0.9089 0.9407
n = 2 0.7717 0.8548
n = 3 0.7158 0.7817
n = 4 0.6350 0.6017
n = 5 0.4947 0.4409
modeling). Mathematically, if the given signal (for example, an articulator trajectory
in our case) is x, then:
x[n] =
P
∑
i=1
a
i
x[ni] (6.9)
where P is the order of the model and a
i
;8i are the linear prediction or AR model
coecients. Recall from equation 6.4 that each row of our KN activation matrix H
represents activation of a dierent time-varying basis. Using linear prediction, we can
model each row of this matrix by a LP model, thus giving us a KP matrix H
LP
. If
G is the K
g
N
g
matrix of gestural activations (where again, rows represent gestural
activations associated with dierent constriction task variables and columns represent
time), these can also be modeled as a K
g
P matrix G
LP
in a similar fashion.
If we have two sets of variables, x
1
;:::;x
n
and y
1
;:::;y
m
, and there are corre-
lations among the variables, then canonical correlation analysis will enable us to nd
linear combinations of the x's and the y's which have maximum correlation with each
other. We thus examine the canonical correlation values between the rows of matrices
G
LP
andH
LP
respectively. This allows to observe how much linear correlation there is
between the 2 spaces and thus obtain an estimate of the \information content" of the
two spaces are, and, in addition, allow us to estimate a linear mapping between the two
82
variable spaces, which will allow us to convert the estimated activation matrices into
gestural activations. Table 6.2 shows the top ve canonical correlation values obtained
in the cases of both TaDA and EMA data. In general we observe high values of canon-
ical correlations which supports the hypothesis that the estimated activation matrices
capture the important information structure contained in the gestural activations.
However, as we have mentioned earlier, this technique of comparison is at best
sub-optimal, for many reasons. First, recall that the activation matrices are sparse.
Modeling sparse time-series in general, and specically using AR modeling is prone
to errors since the time-series being modeled are not generally smooth. Second, the
optimal choice of model parameters, such as the number of coecients in the LPC/AR
analysis (P in Equation 6.9) is not clear. In our case, we chose a value that captured
temporal eects of the order of approximately 200250ms.
6.6.2.2 Signicance of extracted synergies
To check that our algorithm indeed captures some structure in the data, we compared
the reconstruction error of extracted activation matrix (synergies) H to those obtained
by substituting it for a random matrix of the same sparsity structure in Equation 6.1.
Thisprocedurewasrepeated50times. Aright-sidedStudent t-test foundthatthemean
square error objective function value for the random matrices was signicantly higher
than for the case of the estimated H matrix (p = 0).
6.6.3 Visualization of extracted basis functions
Figure 6.13 show exemplar basis functions extracted from MOCHA-TIMIT data from
speaker msak0. We observe that the bases are interpretable and capture important
and diverse articulatory movements from a phonetics perspective. The bases are inter-
pretableand7ofthe8correspondtoarticulatorypatternsassociatedwiththeformation
83
of constrictions of the vocal organs. Basis 1 exhibits the articulatory pattern expected
when a labial constriction (note vertical movement of lower lip) is formed in the context
of a (co-produced) front vowel (note vertical movements of front tongue markers), while
basis 3 exhibits the pattern expected for a labial constriction co-produced with a back
vowel (note the backwards and raising motions of the tongue markers). Comparably,
bases 4 and 8 showpatterns expected of a coronal (tongue tip) constriction co-produced
withbackandfrontvowelsrespectively. Basis5showstheexpectedpatternforatongue
dorsum constriction with a velar or uvular location, while 6 shows the pattern of a dor-
sum constriction with a palatal location. Basis 7 shows the expected pattern for tongue
root constriction in the pharynx. Basis 2 is the only one that does not appear to rep-
resent a constriction, per se, but rather appears capture horizontal movement of all the
receivers.
Since dierent phonetic segments have are formed with distinct constrictions, we
would expect that they should activate distinct bases. To test this, we plot the average
activation patterns of selected segments in Figure 6.14. We did this by collecting all
columns of the activation matrix corresponding to each phone interval (as well as (T
1) = 9 columns before and after, since the primitives are spatiotemporal in nature with
temporal length T = 10) and taking the average across each of the K = 8 rows. Let us
rst consider the activation patterns of the three voiceless stops in Figure 6.14a. Each
oftheconsonantshaveadierentaveragegesturalactivationpattern. For/p/, thebasis
withthehighestactivationistheoneidentiedaboveasalabialconstrictionco-produced
withabackvowel(3or(c)), thelabial-frontvowelpattern(1or(a))isalsohighlyactive
(3rd highest). For /t/, the the two patterns identied as coronal constriction patterns
(4 (d) and 8 (h)) have the highest activation, while for /k/, the dorso-palatal pattern
is most active (6 or (f)). We might have expected more activation of the dorso-velar
constrictionspattern,butinEnglishthedorsalstopsarequitefront,particularlyinfront
84
vowel contexts. Similarly, in Figure 6.14b, we qualitatively observe that the average
activation patterns of dierent vowels are dierent. For the vowels that are produced
withnarrowconstrictions(IY,AA,OW,andUW),thehighestactivatedbasesarethose
that produce the appropriate constriction. The bases most highly activated for IY are
thedorsalconstrictionbases(5and6). ForAAandOW,itisthepharyngealconstriction
basis (7) and the labial constriction in back vowel context (3) (here capturing lip-
rounding) that are most active. For UW, the dorso-velar (5) and labial constriction
(1 and 3) bases are most active. For the less constricted vowels (IY, EH, AE, UH),
there is, perhaps unsurprisingly, a lack of clearly dominant bases, except for the back-
vowel rounding basis (3) for (UH), and perhaps the horizontal movement basis for EH.
This suggests that although the cNMFsc algorithm does extract some discriminatory
phonetic structure from articulatory data, there is still plenty of room for improvement.
We notice that in general there are some similarities in movement between dier-
ent bases but also dierences, i.e., each basis captures some distinct aspect of speech
articulation, mostly the formation of distinct constrictions. The bases extracted by
the algorithm depend on the choice of parameters, and will change accordingly. Thus
we do not claim to have solved the problem of nding the \correct" set of articula-
tory primitives (under the assumption that they do exist). However we have proposed
an algorithm that can extract interpretable spatiotemporal primitives from measured
articulatory data that are similar in information content to those derived from a well-
understood model of the speech production system. The extracted bases are similar
to the bases of Articulatory Phonology, which also represent constrictions of the vocal
organs. One dierence is that there are distinct bases extracted for distinct constriction
locations of the tongue body (palatal, velar/uvular, and pharyngeal), while in the stan-
dard version of Articulatory Phonology [e.g. 13], these are distinguished by parameter
values of a single basis (Tongue Body Constriction Location). However, this parametric
85
view has been independently called into question by recent data [e.g. 43, 44] that argues
thatdistinctconstrictionlocationsarequalitativelydistinctactions. Aseconddierence
is that the distinct bases appear to be extracted for labial and coronal constrictions in
dierentvowelcontexts(frontvs. back), whileinAP,thesecontextualdierencesresult
from the temporal overlap of activation of consonant and vowel bases. Why the conso-
nant and vowel constrictions are con
ated in the extracted bases is not clear. Perhaps
with fewer bases, this would not be the case. A thorough validation of these primitives
is a subject for future research.
6.7 Discussion and future work
In order to fully understand any cognitive control system which is not directly observ-
able, it is important to experimentally examine the applicability of any knowledge-
driven theory or data-driven model of the system vis-a-vis some actual system mea-
surements/observables. However, comparatively little work has been done with respect
to data-driven models, especially in the case of the speech production system. In this
paper, we have presented some initial eorts toward that end, by proposing a data-
driven approach to extract sparse primitive representations from real and synthesized
articulatory data. We further examined their relation to the gestural activations for the
same data predicted by the knowledge-based Task Dynamics model. We view this as
a rst step towards our ultimate goal of bridging and validating knowledge-driven and
data-driven approaches.
There remain several open research directions. From an algorithmic perspective,
for example, we need to consider nonparametric approaches that do not require apriori
choiceofparameterssuchasthetemporaldimensionofeachbasisorthenumberofbases.
We also need to design better techniques to validate and understand the properties of
articulatory movement primitives. Such methods should obey the rules and constraints
86
imposed by the phonetics and phonology of a language while being able to reconstruct
the repertoire of articulatory movements with high delity.
There are many applications to these threads of research. Consider the case of
coarticulation in speech, where the position of an articulator/element may be aected
by the previous and following target [94]. Using the idea of motor primitives, we can
explore how the choice, ordering and timing of a given movement element within a well-
rehearsed sequence can be modied through interaction with its neighboring elements
(co-articulation). For instance, through a handwriting-trajectory learning task, [117]
demonstrate that extensive training on a sequence of planar hand trajectories passing
through several targets results in the co-articulation of movement components, and in
the formation of new movement primitives.
Let us further consider the case of speech motor control. One popular theory of mo-
tor control is the inverse dynamics model, i.e., in order to generate and control complex
behaviors, the brain needs to explicitly or implicitly solve systems of coupled equa-
tions. [82] and [37] instead argue for a less computationally complex viewpoint wherein
the central nervous system uses a set of primitives (in the form of force elds acting
uponcontrolledarticulatorstogeneratestablepostures)to\solve"theinversedynamics
problem. Constructinginternalneuralrepresentationsfromalinearcombinationofare-
duced set of modiable basis functions tremendously simplies the task of learning new
skills, generalizing to novel tasks or adapting to new environments [26]. Further, par-
ticular choices of basis functions might further reduce the number of functions required
to represent learned information successfully. Thus, by understanding and deriving a
meaningful set of articulatory primitives, we can develop better models of speech motor
control, and possibly, at an even higher level, move toward an understanding of the
language of speech actions [see for example work by 34].
87
6.8 Conclusions
We have presented a convolutive Nonnegative Matrix Factorization algorithm with
sparseness constraints (cNMFsc) to automatically extract interpretable articulatory
movementprimitivesfromhumanspeechproductiondata. Wefoundthattheextracted
activation functions or synergies corresponding to dierent basis functions (from both
synthetic as well as measured articulatory data) captured the important information
structure contained in the gestural scores in general, and further estimated linear trans-
formation matrices to convert the estimated activation functions to gestural scores.
Since gestures may be viewed as a linguistically-motivated theoretical set of primitives
employed for speech production, the results presented in this paper suggest that (a)
the cNMFsc algorithm successfully extracts movement primitives from human speech
production data, and (b) the extracted primitives are linguistically interpretable in an
Articulatory Phonology [12] framework.
88
Figure6.4: AscreenshotoftheTaskDynamicsApplication(orTaDA)softwareGUI[af-
ter 85]. In the center is the temporal display consisting of gestural scores that are input
to the task dynamic model as well as tract variables and articulator time-trajectories
which the model outputs. Displayed to the left is the instantaneous vocal tract shape
and area function at the time marked by the cursor in the temporal display. Note es-
pecially the pellets corresponding to dierent pseudo vocal-tract
esh-points in the top
left display, movements of which are recorded and used for our experiments.
89
cNMFsc
algorithm
Basis/primitive
matrix, W
Activation
matrix, H
Measured
articulatory
fleshpoint
trajectories
(EMA)
Synthesized
articulatory
fleshpoint
trajectories
(TaDA)
Data matrix, V
time
time
time = 1,2,…, T
basis
index
articulators
t
t
1 2 3 4 5
1
2
3
4
5
Figure 6.5: Schematic illustrating the proposed cNMFsc algorithm. The input matrix
V can be constructed either from real (EMA) or synthesized (TaDA) articulatory data.
In this example, we assume that there are M = 7 articulator
eshpoint trajectories. We
would like to nd K = 5 basis functions or articulatory primitives, collectively depicted
asthebigredcuboid(representingathree-dimensionalmatrixW). Eachverticalslabof
the cuboid is one primitive (numbered 1 to 5). For instance, the white tube represents
a single component of the 3
rd
primitive that corresponds to the rst articulator (T
samples long). The activation of each of these 5 time-varying primitives/basis functions
is given by the rows of the activation matrix H in the bottom right hand corner. For
instance, the 5 values in the t
th
column ofH are the weights which multiply each of the
5 primitives at the t
th
time sample.
90
Weighted sum
(activation matrix gives
non-zero weights )
Input data
sequence to
cNMFsc
1 2 3 4 5 6 7 8
time
Active
bases
Figure6.6: Schematicillustratinghowshiftedandscaledprimitivescanadditivelyrecon-
struct the original input data sequence. Each gold square in the topmost row represents
one column vector of the input data matrix, V, corresponding to a single sampling
instant in time. Recall that our basis functions/primitives are time-varying. Hence, at
any given time instant t, we plot only the basis functions/primitives that have non-zero
activation (i.e., the corresponding rows of the activation matrix at time t has non-zero
entries). Notice that any given basis function extends T = 4 samples long in time,
represented by a sequence of 4 silver/gray squares each. Thus, in order to reconstruct
say the 4th column of V, we need to consider the contributions of all basis functions
that are \active" starting anywhere between time instant 1 to 4, as shown.
91
Figure 6.7: Akaike Information Criterion (AIC) values for dierent values of K (the
numberofbases)andT (thetemporalextentofeachbasis)computedforspeaker fsew0.
We observe that an optimal model selection prefers the parameter values to be as low
as possible since the number of parameters in the model far exceeds the contribution of
the log likelihood term in computing the AIC.
Figure 6.8: Histogram of the number of non-zero constriction task variables at any
sampling instant.
92
Articulatory
variable
Vowels Consonants
Figure 6.9: Root mean squared error (RMSE) for each articulator and phone class
(categorized by ARPABET symbol) obtained as a result of running the algorithm on
all 460 sentences spoken by male speaker msak0.
Articulatory
variable
Vowels Consonants
Figure 6.10: Root mean squared error (RMSE) for each articulator and phone class
(categorized by ARPABET symbol) obtained as a result of running the algorithm on
all 460 sentences spoken by female speaker fsew0.
93
(a) (b)
Figure 6.11: Histograms of the fraction of variance unexplained (FVU) by the proposed
cNMFsc model for MOCHA-TIMIT speakers msak0 (left) and fsew0 (right). The sam-
ples of the distribution were obtained for each speaker by computing the FVU for each
of the 460 sentences. (The algorithm parameters used in the model were S
h
= 0:65,
K = 8 and T = 10).
94
(a) JAWy (b) JAWy
(c) LLy (d) LLy
(e) TTy (f) TTy
(g) TDy (h) TDy
(i) TRx (j) VELy
Figure 6.12: Original (dashed) and cNMFsc-estimated (solid) articulator trajectories
of selected TaDA articulator variables (left) and EMA (MOCHA-TIMIT) articulator
variables (obtained from speaker msak0) (right) for the sentence \this was easy for us."
The vertical axis in each subplot depicts the value of the articulator variable scaled by
its range (to the interval [0,1]), while the horizontal axis shows the sample index in time
(samplingrate = 100Hz). The algorithm parameters used were S
h
= 0:65, K = 8 and
T = 10. See Table 6.1 for an explanation of articulator symbols.
95
(a) (b)
(c) (d)
(e) (f)
(g) (h)
Back Front Back Front
Back Front Back Front
Back Front Back Front
Back Front Back Front
1
Figure 6.13: Spatio-temporal basis functions or primitives extracted from MOCHA-
TIMIT data from speaker msak0. The algorithm parameters used were S
h
= 0:65,
K = 8 and T = 10. The front of the mouth is located toward the left hand side of
each image (and the back of the mouth on the right). Each articulator trajectory is
representedasacurvetracedoutby10coloredmarkers(oneforeachtimestep)starting
from a lighter color and ending in a darker color. The marker used for each trajectory
is shown in the legend.
96
(a)
(b)
Figure 6.14: Average activation pattern of the K = 8 basis functions or primitives for
(a) voiceless stop consonants, and (b) British English vowels obtained from speaker
msak0's data. For each phone category, 8 colorbars are plotted, one corresponding to
the average activation of each of the 8 primitives. This was obtained by collecting all
columns of the activation matrix corresponding to each phone interval (as well as T1
columns before and after) and taking the average across each of the K = 8 rows.
97
CHAPTER7
Exploring production{perception links in speech:
Data-driven articulatory gesture-like representations
retain discriminatory information about phone categories
7.1 Introduction
Consider an information-based model of speech communication where the aim is to op-
timally and robustly convey a piece of information from speaker to listener. Scientists
are still unclear about how the speech communication system has evolved in humans.
One possibility is that the human auditory system has evolved to perceive speech pro-
duced by talkers, while another is that speakers' articulatory mechanisms have evolved
to produce sounds that can be perceived by the listener's auditory system. A more
likely possibility is that these systems have evolved together, the development of each
bootstrapped by the other. This is because speech articulation is not the only action
that can be produced by the human vocal organs, and likewise speech is not the only
class of sounds that can be perceived by the auditory system. The human speech pro-
duction system can perform actions other those required for producing speech sounds
(swallowing, chewing, etc.), while the auditory system can perceive natural sounds in
98
the 2020000 Hz range, including those that have distinct spectro-temporal charac-
teristics not found in human speech. Articulatory Phonology [12] theorizes that the
act of speaking is decomposable into units of vocal tract action called \gestures," and
suggests that lexical items are assembled from these dynamic primitive units, i.e., con-
striction actions of the vocal organs. Note that these representations are essentially low
dimensional in nature. The theory that the speech production and perception systems
co-evolved to jointly optimize their (information encoding/decoding) performance with
respect to each other, among other criteria, posits two broad predictions. First, the
auditory system in listeners must process speech so as to preserve maximal information
about the \intended" speech gestures of the speaker. Second, speakers must encode
information (linguistic or paralinguistic) into speech gestures (and thereby speech) in
such a manner that it can be robustly extracted by listeners.
With respect to the rst prediction, researchers have presented evidence suggesting
that the objects of speech perception are the intended gestures of the speaker, which
could be represented, for instance, as invariant motor commands for linguistically sig-
nicant movements [28, 64]. [116] found that the lterbank model of the cochlea has
high coding eciency for conveying maximal information to the brain for a wide range
of natural sounds and, in particular, speech. It was in fact mathematically shown by
[113] that a cochlear-like lterbank provides the bayes optimal phonetic classication.
The research of [30] and [7] has shown that processing speech signals using an audi-
tory cochlea-like lterbank preserves maximal mutual information between articulatory
gesturesandtheprocessedspeechsignals. Inotherwords,auditorylterbank-liketrans-
formations improve speech perception/recognition performance because they maximize
thearticulatoryinformationthatspeakerstransmit. Hencethereissubstantialevidence
in favor of the hypothesis that the human auditory system has evolved to maximally
and robustly perceive information regarding the talker's speech gestures.
99
Nowthesecondpredictionpositsthatspeechgesturesmustencapsulateinformation
such that listeners can optimally perceive it. Such information, in many cases, must
also include categorical information regarding underlying learned phonological struc-
tures (such as phonemes or syllables) of a language. This information must be discrim-
inative such that these discrete constructs or categories can be teased apart from the
continuous acoustic signal by listeners. Hitherto there has been little empirical evidence
for such a claim for two reasons. For one, it was not until recently that major develop-
mentshavebeenmadeinspeecharticulationmeasurement[see 98, forareviewofrecent
developments in this eld]. Furthermore, speech gestures are theoretically dened in
terms of abstract constriction-producing dynamical systems, and it is not clear how to
extract these from speech articulation data in a principled manner. However, we re-
centlyshowedqualitativelyandquantitativelythatonecanrobustlyextractgesture-like
movement primitives from speech articulation data using knowledge-informed machine
learningtechniques[99]. Iftheseprimitiverepresentationsaretrulygesture-likeandour
hypothesis is true, they should contain discriminative information regarding underlying
linguistic structure, such as phone categories. This leads us to the central question of
this paper: do data-driven \activation functions" of gesture-like movement primitives
contain information to robustly classify broad phone classes?
Scientic understanding apart, answering such a research question is important for
speech technology applications such as automatic speech recognition; nding ecient
representations is a key building block for such eorts [for a more detailed discussion,
please see 98]. Some reasons for this include: (i) improved noise robustness [105], (ii)
betterperformanceonspontaneousspeechwhichexhibitsagreaterdegreeofcoarticula-
tionduetofactoredrepresentations[21,24,75], (iii)bettermodelingofdierentsources
of variability, e.g., vocal tract morphology [57], (iv) provision of a complementary view
of the information captured by acoustic features alone [3], and (v) the signicantly
100
lower-dimensional space of articulatory-based feature representations [12, 52]. To mo-
tivate the nal argument in particular, observe that the speech signal at the acoustic
level has a much higher bit rate (e.g., 64 kbits/sec assuming 8 kHz sampling rate and
8 bits/sample encoding) as compared to that of the underlying sound patterns that
have an information rate of less than a 100 bits/sec [5]. The presence of this large re-
dundancy in the speech signal means that we rst need to extract a lower-dimensional
representation of the signal that best captures the discriminative information required
for a given task at hand. For example, in the case of a phone discrimination task, we
would want to extract a representation that is able to capture the dierences between
various sounds in a language. Extracting such a representation from speech data is not
straightforward. However, if we are able to extract from speech discriminative infor-
mation about articulatory gestures [see 12], which we know are useful in distinguishing
dierent sounds in a language, we might be better positioned to solve this problem.
With that additional motivation, we explore the question of how well low-dimensional
\articulatory movement primitives" derived from data by imposing sparsity constraints
candiscriminatebetweenbroadphonecategories. Thathavingbeensaid,itisextremely
importanttostressthatthemaingoalofthispaperis not toimprovethestateoftheart
in speech classication/recognition, but to enhance our understanding of the scientic
link between speech production and perception using computational means.
Inearlierwork[99,100],weformulatedanddenedarticulatorymovementprimitives
(or, exemplars) as a dictionary or template set of articulatory movement patterns in
spaceandtime,weightedcombinationsoftheelementsofwhichcanbeusedtorepresent
the set of coordinated spatio-temporal movements of vocal tract articulators required
for speech production. Although we do not claim that this is a completely validated
model for human speech production, we showed that the primitives-extraction method
empirically captures articulatory gesture-like components and therefore compositional
101
elements within speech. Moreover such a representation captures information regarding
movement synergies, i.e., combinations that simplify the production of movements by
reducing the degrees of freedom that need to be specied by the motor control system
[50].
Figure 7.1 presents a schematical overview of this chapter. We describe the articu-
latory data used for experiments in Section 7.2. Sections 7.3 presents the mathematical
formalism used for primitive extraction. Next, in Section 7.5, we describe the classi-
cation setup including appropriate feature preprocessing steps. Finally, we present
our experimental observations along with a brief discussion of possible implications in
Sections 7.6 and 7.7.
7.2 Data
We analyze the same database as used in the previous chapter on extracting movement
primitives { ElectroMagnetic Articulography (EMA) data from the Multichannel Ar-
ticulatory (MOCHA) database [135], which consists of data from two (British English)
speakers-onemaleandonefemale. Acousticandarticulatorydatawerecollectedwhile
each speaker read a set of 460 phonetically-diverse TIMIT sentences.
7.3 Extraction of primitive movements
If x
1
, x
2
, ..., x
M
are the M time-traces (represented as column vectors of dimension
N1) of EMA articulator trajectory variables, then we can design our data matrix V
to be:
V = [x
1
jx
2
j:::jx
M
]
y
2R
MN
(7.1)
102
wherey is the matrix transpose operator. We now formulate the cNMFsc optimization
problemjustaswedidearlier,withoneimportantaddition. Inordertoderiveprimitives
thataremaximallydiscriminativeofdierentphoneclasses,weaugmentthedatamatrix
V with phone label information (after [129]):
V
V
lab
=
T1
∑
t=0
W(t)
W
lab
(t)
~
H
t
(7.2)
where each column of V
lab
is a 401 vector whose entries are all 0 save for one { we
set the entry corresponding to the phone label of the current frame
1
to 1 (there are
40 phone labels in all annotated for this dataset). To force the training algorithm to
extract one unique primitive for each phone, we (i) added a (weak) supervision step to
the multiplicative update rules of the cNMF training algorithm by forcing the W
lab
matrix to be a 4040 identity matrix, and (ii) set the number of primitives K equal
to the number of unique phone classes (40).
The nal optimization problem is as follows:
min
W;H
k
V
V
lab
=
T1
∑
t=0
W(t)
W
lab
(t)
~
H
t
k
2
s:t:sparseness(h
i
) =S
h
;8i: (7.3)
Note that the level of sparseness (0 S
h
1) is user-dened. See Ramanarayanan et
al. [99, 100] for the details of an algorithm that can be used to solve this problem.
103
7.4 Algorithm performance
7.4.1 Quantitative metrics
Inordertochoosemodelparametersappropriately,wecomputedtheAkaikeInformation
Criterion [AIC; 1], which overwhelmingly selected parameter values that resulted in
low model complexity. Based on this analysis we decided to set the temporal extent
of each basis sequence (T) to 10 samples (since this corresponds to a time period of
approximately100ms,factoringinasamplingrateof100samplespersecond)tocapture
eects of the order of the length of a phone on average. As mentioned earlier, we chose
the number of bases, K, to be equal to the number of phone classes, i.e., 40. The
sparseness parameter S
h
was set to 0:65 based on experiments with synthetic data.
In order to see how the algorithm performs for dierent phone classes, we rst
performedaphoneticalignmentoftheaudiodatacorrespondingtoeachsetofarticulator
trajectories (using the Hidden Markov Model toolkit, HTK [136]) to enable association
with dierent phone classes. Figure 7.2 shows the root mean squared error (RMSE) for
each articulator and broad phone class for MOCHA-TIMIT speaker msak0. Recall that
since we are normalizing each row of the original data matrix to the range [0;1] (and
henceeacharticulatortrajectory), theerrorvaluesinFigure 7.2canbereadonasimilar
scale. We see that in general, error values are high. The errors were highest (0:130:2)
for tongue-related articulator trajectories and the upper incisor variable. On the other
hand, trajectories of the lip (LLx and LLy) and jaw (JAWx and JAWy) sensors were
reconstructed with lower error ( 0:1). We further computed the fraction of variance
that was not explained (FVU) by the model for each sentence in the database. The
histograms of these distribution are plotted in Figure 7.4. The mean and standard
deviation of this distribution was 0:0790:028 for speaker msak0 (i.e., approx. 7:9% of
1
Phone labels of each frame were obtained through automatic phonetic alignment of the data.
104
the original data variance was not accounted for on average). These statistics suggest
that the cNMFsc model accounts for more than 90% of the original data variance.
7.4.2 Visualization of extracted basis functions
Figures 7.5, 7.6 and 7.7 show exemplar basis functions extracted from MOCHA-TIMIT
data from speaker msak0 corresponding to vowels, stops and fricatives respectively. We
observe that the bases are interpretable and capture the salient articulatory movements
required to produce each phone.
7.5 Interval-based broad phone classication setup
In this section, we describe how activation matrices obtained using the algorithm de-
scribedabovearetransformedintofeaturessuitableforphoneclassicationexperiments.
We can hypothesize the sequence of phones corresponding to a given utterance along
with their corresponding time-boundaries by phonetically aligning the audio. Therefore
inthiswork,thephonecategoriesareentirelybasedoncategoricalinformationobtained
from the audio signal. At this point we would like to reiterate that the main goal of
this paper is not to improve the state of the art in speech classication/recognition,
but to enhance our understanding of the scientic link between speech production and
perception using computational means.
Sincetheactivationmatricesaresparsebyformulation,itdoesnotmakesensetouse
columns of the activation matrix (one per frame) as feature prototypes in a frame-based
phone classication experiment (since there will be zeros corresponding to time-frames
where no basis is activated). Instead, we choose to compute one feature per phone
interval. This way, we are formulating the classication problem as an interval-based
phone classication experiment. Therefore, given a segment of activation columns for a
givenphoneinterval(i.e.,ablocksubsetofcolumnsoftheactivationmatrix),wehaveto
105
compute a single feature. First, we quantize the space of activation vectors (columns of
the activation matrix) to generate a codebook representation of the time-series using an
agglomerative information bottleneck-based clustering technique; second, we compute
histograms of co-occurrences (denoted HAC [128]) of the codebook indices over the
time-series
2
. HAC representations are useful since they explicitly model cooccurrences
of articulatory feature instances over time. We describe the procedure in more detail
below.
7.5.1 Codebook generation
We perform vector quantization (VQ) of the columns of the activation matrices using
theagglomerativeinformationbottleneck(AIB)principle[114]. Weformulatetheprob-
lem as that of nding a quantization or a compressed representation
^
H of the activation
spaceHthatminimizesthemutualinformationI(H;
^
H)betweenthem,whilesimultane-
ously maximizing the mutual information I(A;
^
H) between
^
H and the space of acoustic
featuresA. Inotherwords,wewouldliketondthatquantizationofthedurationspace
that achieves maximal compression while retaining as much discriminative information
as possible about acoustic features
3
. We use the VLFeat software [130] to perform this
clustering. Figure 7.8 plots the mutual informationI(A;
^
H) as a function of the number
of clusters/codebook entries. We observe a rapid drop in mutual information as the
number of clusters drops below 20. Based on empirical observation of this graph, we
choose a codebook size of 32 clusters for our experiments.
106
7.5.2 Computing histograms of co-occurrences
We rst replace each frame of the activation matrix H with the best matching centroid
of the codebook. This way, activation matrix is now represented by a single row vector
of VQ-labels, H
quant
. A HAC-representation of lag is then dened as a vector where
each entry corresponds to the number of times all pairs of VQ-label are observed
frames apart. In other words, we construct a vector of lag- co-occurrences where each
entry (m;n) signies the number of times that the input sequence of activation frames
is encoded into a VQ-label m at time t (in the row vector H
quant
), while encoded into
VQ-label n at time t+ [129]. By stacking all (m;n) combinations, each phone interval
can be represented by a single column vector where the elements express the sum of all
K
2
possible lag- co-occurrences (where K is the number of VQ clusters). See Figure
7.9. We can repeat the procedure for dierent values of , and stack the results into one
\supervector". Note however, that the dimensionality of the HAC feature increases by
a factor of K
2
for each lag value that we want to consider. In our case, we empirically
found that choosing four lag values of 2, 3, 4 and 5 frames gave an optimal classication
performance.
7.5.3 Classication experiments
We used support vector machine (SVM) classiers to perform classication experiments
[15] with 10-fold cross-validation. We experimented with both linear as well as radial
basis function (RBF) kernels and empirically found that the former gave better clas-
sication accuracy. This could be due to the large dimensionality of the HAC feature
space. Hyperparameters were tuned using a grid-search method.
2
Noticethattheinitialquantizationstepisneededbecausethecolumnentriesarenotdiscrete-valued,
making it impractical to compute meaningful co-occurrences directly.
3
Note that we aren't adding any extra info from acoustics to the activation features obtained from
articulatory data. We are just clustering it dierently using acoustic information. Thus the argument
that we are using only articulatory information to cluster phone categories still holds water.
107
7.6 Observations and results
Table 7.1 shows the performance of the activation features (after appropriate HAC-
feature transformation) on an interval-based phone classication task. Also shown
for reference purposes are the performances of the raw EMA pellets themselves (16-
dimensional), as well as mel-frequency cepstral coecient (MFCC) features along with
delta and double-delta coecients (39-dimensional) on the same task. Note that each
of these feature sets, i.e., primitive activation features, MFCCs and raw EMA features,
were each passed through the same classication module (box to the right in Figure
7.1), in order to ensure a fair point of reference. Recall that the MFCC and EMA
experiments were performed mainly as a reference, and the main purpose of the clas-
sication study is to see how well articulatory gesture-like activation features can be
used to classify dierent phone classes. We report results on both a full-blown 39-class
classication task as well as a simpler broad-phone task consisting of 5 classes (namely
vowels,stops,fricativesandaricates,nasalsandapproximants). Asmightbeexpected,
theperformanceforallfeaturesonthelattertaskissignicantlybetterthanthatonthe
full-blown classication task. Our initial experiments also suggest that the activation
features learnt by the cNMFsc algorithm signicantly outperform both raw MFCC and
raw EMA features in terms of classication accuracy. However, one cause for this may
be that the classication process (AIB clustering, followed by HAC extraction) might
not be as well suited to MFCC and EMA features as they are to the sparse activa-
tion features. Further note that results on interval-based classication task described
herearedierentfromthestate-of-the-artperformancenumbersreportedontraditional
frame-based classication tasks, where MFCCs perform competently [for further details
on these numbers, please see for e.g., 35, 48, 67].
108
For a deeper understanding of what the classication accuracy numbers in Table 4.2
actually mean, we also computed the entropy of each feature set and mutual informa-
tion
4
(MI) between each feature set and the phone labels. We observe that although
the entropy (and consequently bit rate assuming a xed encoding scheme) of primitive
activation features is lower than that of the raw MFCC or EMA features, the mutual
informationbetweenthephonelabelsandthedierentfeaturesconsideredisstillcompa-
rable. This, along with the weak supervision during the learning process, suggests that
primitive activations are a useful, low-dimensional representation capable of disciminat-
ing phone classes. In addition, we can see that although the MFCC and EMA features
have a similar entropy value, the former has a higher MI. This is in agreement with
the observation of a higher classication accuracy. The continuing challenge for future
workwill be nding representationsthat push the classication accuracy envelope while
minimizing the required bitrate.
7.7 Discussion and outlook
Ourresultssuggestthatarticulatorymovementprimitivesoerinformationfordiscrim-
inating between broad phone classes. It is important to note that the performance of
these features is contingent upon the way they are extracted, and therefore, algorithmic
choices, such as the sparseness value S
h
, number of primitives, and the temporal extent
of each primitive, will greatly in
uence the outcome of subsequent classication exper-
iments. The primitives we extract will depend on the cost function that we formulate
and optimize. Development of better problem formulations and algorithms to extract
primitives is an exciting area of ongoing and future research. Having said that, it is
4
To estimate the probability of a given feature vector: (i) We clustered the data using k-means
clustering (K = 128) and assigned each feature to a cluster. (ii) We set the probability of occurrence
of a feature to be equal to the maximum likelihood estimate of the probability of occurrence of its
corresponding cluster.
109
Table 7.1: Performance of various feature sets on a interval-based phone classication
experiment (after appropriate transformation to HAC-representations). For clarity of
understanding we also show the entropy of the feature setX, denoted by H(X), along
with the mutual information between the feature set and phone labelsL (39 classes),
denoted by I(X;L), in each case. We also performed classication experiments on a
5-class broad phonesetL
broad
(where each of the original 39 phones were categorized as
either vowel, stops, fricatives, nasals or approximants).
Feature setX Speaker
Classication accuracy
H(X) I(X;L) I(X;L
broad
)
Full Broad
MFCC++
msak0 46.46% 71% 6.9 1.68 0.39
fsew0 49.85% 65.63% 6.9 1.78 0.43
Raw EMA pellets
msak0 36.87% 61.78% 6.9 1.59 0.375
fsew0 40.23% 57.20% 6.9 1.66 0.40
Primitive activations
msak0 80.59% 85.01% 6.5 1.63 0.40
fsew0 84.16% 87.77% 6.56 1.93 0.51
Phone labelsL 100% - 4.9 4.9 -
Broad phone labelsL
broad
- 100% 1.96 - 1.96
encouraging to observe that the computationally estimated lower-dimensional primitive
representations of speech articulation contain useful information to distinguish between
broad phone categories. Note also that the (EMA) articulatory data used in the ex-
periments oers only a limited view of the complex articulatory mechanisms. Although
they do encode information about phonetic categories, these movements represent only
a part of the picture with respect to the phonetic categories.
Theresultspresentedhereallowustore-examinetheproblemofoptimalfeatureex-
tractionforphoneclassication/recognitioninanewlight. Inotherwords,goodfeatures
for phone classication and recognition would be those that are retain maximal mutual
information with the dierent phone classes of interest as well as the speech articulation
process. We have shown that appropriately-postprocessed activation features of artic-
ulatory primitives exhibiting this property perform well on classication experiments
while retaining linguistic interpretability. The challenge for future work is to extend the
ideas presented here from phone classication to full-blown phone recognition. Also,
110
since we have presented results for only two speakers (as the MOCHA-TIMIT public
release is limited to data from two speakers), future work will also look at examining
the robustness of these results with articulatory and acoustic data from more speak-
ers. Furthermore, extending these results to a speaker-independent setting (as opposed
to the speaker-specic setting described here) would be a useful and important future
research direction.
We can now return to our predictions regarding the joint optimization and co-
evolution of the speech production and perception systems in light of our results. The
rst prediction was that the auditory system in listeners must process speech so as to
preserve maximal information regarding the \intended" speech gestures of the speaker.
This position is supported by both theoretical work [28, 64] as well as recent empirical
results [7, 30]. The second prediction was that speakers must encode information (lin-
guistic or paralinguistic) into speech gestures (and thereby speech) in such a manner
that it can be robustly extracted by listeners. This paper has shown empirical results
supporting this position. The Motor Theory of speech perception [64] also predicts that
the human speech production system must produce just the right maneuvers to t the
demands of the categories imposed by the auditory system. Assuming that articula-
tory movement primitives can be considered as a surrogate for atleast a subset of these
maneuvers, our results are in agreement with the theory. This is because we nd that
experimentally-derived articulatory primitives are not only good surrogate representa-
tions of articulatory gestures (as observed in Figures 7.5, 7.6 and 7.7), but also contain
discriminatory information regarding dierent phone categories (as seen in Table 7.1).
This work presents a rst step toward a more complete understanding of whether infor-
mationtransferduringspeechproductionisperformedsoastoeectecientperception
of auditory categories.
111
AIB CLUSTERING
Relevant-information-preserving
dimensionality reduction
MEASURED EMA TRAJECTORIES
HAC
REPRESENTATION
Time-series context information
SVM CLASSIFIER
Interval-based phone
classification
CLASSIFICATION
MODULE
PRIMITIVES
EXTRACTION
MODULE
Figure 7.1: Schematic of the experimental setup. The input matrix V is constructed
either from real (EMA) articulatory data. In this example, we assume that there are
M = 7articulator
eshpointtrajectories. WewouldliketondK = 5basisfunctionsor
articulatoryprimitives, collectivelydepictedasthebigredcuboid(representingathree-
dimensional matrix W). Each vertical slab of the cuboid is one primitive (numbered 1
to 5). For instance, the white tube represents a single component of the 3
rd
primitive
that corresponds to the rst articulator (T samples long). The activation of each of
these 5 time-varying primitives/basis functions is given by the rows of the activation
matrixH in the bottom right hand corner. For instance, the 5 values in the t
th
column
of H are the weights which multiply each of the 5 primitives at the t
th
time sample.
The activation matrix is used as input to the classication module, which consists
of 3 steps { (i) dimensionality reduction using agglomerative information bottleneck
(AIB) clustering, (ii) conversion to a histogram of cooccurrence (HAC) representation
to capture dependence information across timeseries, and (iii) a nal support vector
(SVM) classier.
112
Vowels Stops Frics Nasals Liquids
ULx
ULy
LLx
LLy
JAWx
JAWy
TTx
TTy
TBx
TBy
TDx
TDy
VELx
VELy
UIx
UIy
0.08
0.09
0.1
0.11
0.12
0.13
0.14
0.15
0.16
0.17
0.18
Articulatory
variable
Figure 7.2: Root mean squared error (RMSE) for each articulator and broad phone
class obtained as a result of running the algorithm on all 460 sentences spoken by male
speaker msak0.
Vowels Stops Frics Nasals Liquids
ULx
ULy
LLx
LLy
JAWx
JAWy
TTx
TTy
TBx
TBy
TDx
TDy
VELx
VELy
UIx
UIy
0.08
0.1
0.12
0.14
0.16
0.18
Articulatory
variable
Figure 7.3: Root mean squared error (RMSE) for each articulator and broad phone
class obtained as a result of running the algorithm on all 460 sentences spoken by male
speaker fsew0.
113
0 0.04 0.08 0.12 0.16
0
20
40
60
80
100
120
0 0.04 0.08 0.12 0.16
0
20
40
60
80
100
120
140
160
180
Frequency of
occurrence
FVU FVU
Figure 7.4: Histograms of the fraction of variance unexplained (FVU) by the proposed
cNMFsc model for MOCHA-TIMIT speakers msak0 (left) and fsew0 (right). The sam-
ples of the distribution were obtained by computing the FVU for each of the 460 sen-
tences. (The algorithm parameters used in the model were S
h
= 0:65, K = 40 and
T = 10).
114
Figure 7.5: Spatio-temporal basis functions or primitives extracted from MOCHA-
TIMIT data from speaker msak0 corresponding to dierent English monophthong (rst
and third columns) and diphthong (second column) vowels. Each panel is denoted by
ARPABET phone symbol. The algorithm parameters used were S
h
= 0:65, K = 40
and T = 10. The front of the mouth is located toward the left hand side of each image
(and the back of the mouth on the right). Each articulator trajectory is represented as
a curve traced out by 10 colored markers (one for each time step) starting from a lighter
color and ending in a darker color. The marker used for each trajectory is shown in the
legend (see Table 6.1 for the list of EMA trajectory variables).
115
Figure 7.6: Spatio-temporal basis functions or primitives extracted from MOCHA-
TIMIT data from speaker msak0 corresponding to stop (rst two rows), nasal (third
row) and approximant (last row) consonants. All rows except the last are arranged in
order of labial, coronal and dorsal consonant, respectively. Each panel is denoted by
ARPABET phone symbol. The algorithm parameters used were S
h
= 0:65, K = 40
and T = 10. The front of the mouth is located toward the left hand side of each image
(and the back of the mouth on the right). Each articulator trajectory is represented as
a curve traced out by 10 colored markers (one for each time step) starting from a lighter
color and ending in a darker color. The marker used for each trajectory is shown in the
legend.
116
Figure 7.7: Spatio-temporal basis functions or primitives extracted from MOCHA-
TIMIT data from speaker msak0 corresponding to fricatives and aricate consonants.
Each panel is denoted by ARPABET phone symbol. The algorithm parameters used
were S
h
= 0:65, K = 8 and T = 10. The front of the mouth is located toward the left
hand side of each image (and the back of the mouth on the right). Each articulator
trajectory is represented as a curve traced out by 10 colored markers (one for each time
step) starting from a lighter color and ending in a darker color. The marker used for
each trajectory is shown in the legend.
117
0 20 40 60 80 100 120 140
0
0.2
0.4
0.6
0.8
1
1.2
1.4
I(A;
^
H)
j
^
Hj
Figure 7.8: Mutual information I(A;
^
H) between quantized activation space
^
H and the
space of acoustic featuresA as a function of the cardinality of
^
H (in other words, the
number of quantization levels).
118
0
0
0
1
0
0
0
(1,1)
(1,2)
…
(m,n-1)
(m,n)
(m,n+1)
(K,K-1)
(K,K)
1
1
1
1
1
1
0
1
1
activation matrix H
m n
1
2
0
3
1
1
2
! frames
sum
VQ-labels Hquant
lag-! co-occurrence matrix
decoding
co-occurrence
counting
HAC
representation
≈
≈
…
≈
≈
…
…
t
Figure 7.9: Schematic depiction the computation of histogram of articulatory cooccur-
rences (HAC) representations. For a chosen lag value, , and a time-step t, if we nd
labels m and n occurring time-steps apart (marked in gold), we mark the entry of
the lag- cooccurrence matrix corresponding to row (m;n) and the t
th
column with a 1
(corresponding entry also marked in gold). We sum across the columns of this matrix
(across time) to obtain the lag- HAC representation.
119
CHAPTER8
Control Primitives
8.1 Introduction
In previous chapters, we proposed a method to extract interpretable articulatory move-
ment primitives from raw speech production data. Although this topic has remained
relatively unexplored in the domain of speech production, there has been signicant
work on unconvering motor primitives in the general motor control community. For
instance, [20] and [8] proposed a variant on a nonnegative matrix factorization algo-
rithm to extract muscle synergies from frogs that performed various movements. More
recently, [16] extended these ideas to the control domain, and showed that the various
movements of a two-joint robot arm could be eected by a small number of control
primitives.
In this work, we propose an extension of these ideas to a control systems framework.
Specically, we: (i) present a data-driven approach to extract a spatio-temporal dictio-
nary of control primitives (sequences of control parameters), which can then be used
120
to control the vocal tract dynamical system to produce any desired sequence of move-
ments; (ii) propose methods to validate
1
the proposed approach both quantitatively
(using various performance metrics) as well as qualitatively (by examining how well it
can recover evidence of compositional structure from data generated by an articulatory
synthesizer); (iii)showthatsuchanapproachcanyieldprimitivesthatarelinguistically
interpretable on visual inspection.
We adopt the less-explored data-driven approach to extract control primitives from
model articulatory synthesizer data and examine their relation to the gestural acti-
vations for the same data predicted by the knowledge-based model described above.
The computational approach we adopt imposes the requirement of sparsity based on
the rationale that we would like any given sequence of movements to be eected by a
combination of as few primitives as possible, i.e., each primitive represents a specic ob-
served movement pattern, as far as possible. We further explore other related questions
ofinterest,suchas: (i)howmanycontrolprimitivesmightbeusedinspeechproduction,
and (ii) what might the vocal tract movements they eect look like? We view these as
rst steps towards our ultimate goal of bridging and validating knowledge-driven and
data-drivenapproachestounderstandingtheroleofprimitivesinspeechmotorplanning
and execution.
While real speech movement/physiological data will be the ultimate testbed for
these kinds of algorithmic methods, we will focus particularly on synthetic data for this
chapter because: (i) deriving the \correct" set of vocal tract control variables from real
dataisaninvolvedprobleminitself;(ii)weknowthegenerativemodelofthearticulatory
synthesizer, which allows us to validate and interpret the results of the method; (iii) an
in-depthanalysisandvalidationofsynthesizeddatawillallowustoextendtheresultsto
realdatamoreeasilyinafuturestudy. Specically, weanalyzesyntheticdatagenerated
1
Note that validation of experimentally-derived primitives, especially in the absence of absolute
ground truth, is a dicult problem.
121
by CASY (congurable articulatory speech synthesizer) [42, 107] that interfaces with
theTaskDynamicsmodelofarticulatorycontrolandcoordinationinspeech[109]within
the framework of Articulatory Phonology [12].
8.2 Data
We analyzed synthetic VCV (vowel-consonant-vowel) data generated by the Task Dy-
namicsApplication(orTaDA)software[85,108]{whichimplementstheTaskDynamic
model of inter-articulator coordination in speech within the framework of Articulatory
Phonology [12]. We chose to analyze synthetic data since (i) articulatory data is gen-
erated by a known compositional model of speech production, and (ii) we can generate
a balanced dataset of VCV observations. We generated 972 Vowel-Consonant-Vowel
sequences (VCVs) corresponding to all combinations of 9 English vowels (or monoph-
thongs) and 12 consonants (including stops, fricatives, nasals and approximants). We
then downsampled the articulator variable trajectory outputs to 100 Hz. We further
normalized data in each channel (by its range) such that all data values lie between 0
and 1.
8.3 Dynamical systems model of the vocal tract
We need to dene a suitable model of vocal tract dynamics which we can then use
to simulate vocal tract dynamics on application of control inputs. We will adopt and
extend the Task Dynamics model of speech articulation [after 109]:
M z+B_ z+Kz = ^ (8.1)
122
Figure 8.1: A visualization of the Congurable Articulatory Synthesizer (CASY) in a
neutral position, showing the outline of the vocal tract model. Overlain are the key
points (black crosses) and geometric reference lines (dashed lines) used to dene the
model articulator parameters (black lines and angles), which are: lip protrusion (LX),
vertical displacements of the upper lip (UY) and lower lip (LY) relative to the teeth,
jaw angle (JA), tongue body angle (CA), tongue body length (CL), tongue tip length
(TL), and tongue angle (TA).
z =f() (8.2)
_ z =J()
_
(8.3)
z =J()
+
_
J(;
_
)
_
(8.4)
wherez refers to the task variable (or goal variable) vector, which is dened in TaDA as
a set of constriction degrees (such as lip aperture, tongue tip constriction degree, velic
aperture, etc.) or locations (such as tongue tip constriction location). M is the mass
123
matrix, B is the damping coecient matrix, and K is the stiness coecient matrix
of the second-order dynamical system model. ^ is a control input. However, in motor
control, we typically cannot directly control these task variables. We can only control
so-calledarticulatory variables,, whichcanbenonlinearlyrelatedtothetaskvariables
using the `direct kinematics' relationship (Equation 2-4). Using this relationship, we
canderivetheequationofthecorrespondingdynamicalsystemforcontrollingthemodel
articulators (for e.g., see Figure 8.1):
+(J
y
M
1
BJ +J
y
_
J)
_
+J
y
M
1
Kf() = (8.5)
whereJ istheJacobianmatrix(whichcanbeobtainedfromTaDA).Inourexperiments,
we chose M = I, B = 2!I, and K = !
2
, where I is the identity matrix and ! is the
critical frequency of the (critically-damped) spring-mass dynamical system, which we
set as 0:6
2
. For simplicity, in initial experiments described in this paper we chose the
derivative of the Jacobian matrix,
_
J, to be the zero matrix, and assumed that f() is
locally linear.
8.4 Computing control synergies
At this point we would like to reiterate the advantages of a synergy-based control as
opposed to a generalized dynamics-based control. First, it is less computationally ex-
pensive, both in terms of space and time { in order to achieve a certain vocal tract
target conguration, we do not have to search through the space of all possible congu-
rations. Instead, all we need to do is estimate the optimal sequencing times and relative
activation weights of a small number of control synergies in order to produce that con-
guration. Second, it requires control only of articulators critical to the production of
2
This value was chosen empirically as the mean of! values that TaDA uses for consonant and vowel
gestures respectively.
124
the desired vocal tract conguration (fewer controlled degrees of freedom), which is in
agreement with various theoretical and experimental observations on real data [45] { so
in some sense, reducing task-critical variability and allowing leeway in variability not
critical to the production of a given vocal tract conguration).
In order to nd primitive control signals, we rst need to use optimal control tech-
niquestocomputeappropriatecontrolinputsthatcandrivethedynamicalsystemgiven
inEquation8.5toproducethesetofarticulatorydatatrajectoriescorrespondingtoeach
of our synthesized VCVs. Once we estimate the control inputs, we can use these as in-
put to algorithms that learn spatiotemporal dictionaries such as the cNMFsc algorithm
[99] to obtain control primitives.
8.4.1 Computing optimal control signals
To nd the optimal control signal for a given task, a suitable cost function must be
minimized. Such a cost function typically trades o the accuracy of the solution against
the magnitude of the control input (since we typically don't want very large input
control values) as well as a regularization factor controlling the degree of sparsity (or
smoothness) in the solution. Unfortunately, when using nonlinear systems such as the
vocal tract system described above, this minimization is computationally intractable.
Researchers typically resort to approximate methods to nd locally optimal solutions.
We will use one such method, developed by [124], called the iterative linear quadratic
gaussian (iLQG) method [16, 63, also see].
The iLQG starts with an initial guess of the optimal control signal and iteratively
improves it. The method uses iterative linearizations of the nonlinear dynamics around
the current trajectory, and improves that trajectory via modied Riccati equations. For
further details, see [124].
125
cNMFsc
algorithm
Basis/primitive
matrix, W
Activation
matrix, H
iLQG iLQGalgorithm algorithm
Matrix of control
parameter
trajectories, V
time
time
time = 1,2,…, T
basis
index
articulator
control
parameters
t
t
1 2 3 4 5
1
2
3
4
5
Second order dynamical
system model of vocal tract
motion (DYNAMICS)
time
V ocal tract model articulator
variable trajectories
(STATE SEQUENCE)
Figure 8.2: Schematic illustrating the proposed cNMFsc algorithm. The input matrix
V can be constructed either from real (EMA) or synthesized (TaDA) articulatory data.
In this example, we assume that there are M = 7 articulator
eshpoint trajectories. We
would like to nd K = 5 basis functions or articulatory primitives, collectively depicted
asthebigredcuboid(representingathree-dimensionalmatrixW). Eachverticalslabof
the cuboid is one primitive (numbered 1 to 5). For instance, the white tube represents
a single component of the 3
rd
primitive that corresponds to the rst articulator (T
samples long). The activation of each of these 5 time-varying primitives/basis functions
is given by the rows of the activation matrix H in the bottom right hand corner. For
instance, the 5 values in the t
th
column ofH are the weights which multiply each of the
5 primitives at the t
th
time sample.
126
However, iLQG in its basic form still requires a model of the system dynamics given
by the equation _ x = f(x;u), where x is the articulatory state and u is the control in-
put. In order to eliminate this need and enable the to algorithm adapt to changes in
the system dynamics in real time, Mitrovic et al. proposed an extension, called iLQG
with Learned Dynamics, or iLQG-LD, wherein we learn the mapping f using a com-
putationally ecient machine learning technique such as Locally Weighted Projection
Regression, or LWPR [81]. Hence we evaluate this more general system which is mini-
mally dependent on an explicit and accurate model of the dynamics in addition to the
system which requires the dynamics model described earlier.
In either case, we pass as input to the iLQG algorithm articulator trajectories (see
Section 6.4), and obtain as output a set of control signals (timeseries) that can eect
those sequence of movements (one timeseries per articulator trajectory). In order to
initialize the LWPR model of the dynamics, we used the linear, second-order critically-
damped model of vocal tract articulator dynamics described earlier in Equation 8.5.
8.4.2 Extraction of control primitives
If
1
,
2
, ...,
N
are the N = 972 control matrices obtained using iLQG for each of the
972 VCVs, then we will rst stack these matrices together to form a large data matrix
V = [
1
j
2
j:::j
N
].
In order to nd the primitives, we use the same cNMFsc formulation proposed
earlier:
min
W;H
kV
T1
∑
t=0
W(t)
~
H
t
k
2
s:t:sparseness(h
i
) =S
h
;8i: (8.6)
where each column of W(t)2R
0;MK
is a time-varying basis vector sequence, each
row of H2 R
0;KN
is its corresponding activation vector (h
i
is the i
th
row of H),
127
T is the temporal length of each basis (number of data frames) and the
~
()
i
operator
is a shift operator that moves the columns of its argument by i spots to the right, as
detailed in [115].
8.5 Experiments and Results
The three-dimensional W matrix and the two-dimensional H matrix described above
allows us to form an approximate reconstruction, V
recon
, of the original control matrix
V. This matrix V
recon
can be used to reconstruct the original articulatory trajectories
for each VCV by simulating the dynamical system in Equation 8.5. Figures 8.3a and
8.3bshowtheperformanceofthealgorithminrecoveringtheoriginalcontrolsignalsand
movementtrajectories, respectively, whenweassumeapre-speciedsecondordermodel
of the dynamics. Similarly, gures 8.4a and 8.4b show the performance of the algorithm
when the model of the dynamics is learned using LWPR. We not only observe that
learning the dynamics is more robust and accurate than when a pre-specied model of
the dynamics is assumed, but also that the former model accounts for a large amount of
variance in the original data. The root mean squared errors of the original movements
and controls were 0.16 and 0.29, respectively, on average
3
. The cNMFsc algorithm
parameters used were S
h
= 0:65, K = 8 and T = 10. The sparseness parameter was
chosen empirically to re
ect the percentage of gestures that were active at any given
sampling instant ( 35%), while the number of bases were selected based on the Akaike
Information Criterion or AIC [1], which in this case tends to prefer more parsimonious
models. The temporal extent of each basis was chosen to capture eects of the order of
100ms, as described in the previous chapters.
3
Recall that earlier we normalized each row of both the articulatory and control matrices to the
proportion of its respective range (which will in turn be dierent for the articulatory matrix versus the
control matrix), and so the RMSE values can be interpreted accordingly.
128
0.36 0.38 0.4 0.42 0.44 0.46
0
50
100
150
200
RMSE
Number of Occurrences
(a)
0.1 0.2 0.3 0.4 0.5 0.6
0
50
100
150
200
RMSE
(b)
Figure 8.3: (a) Histograms of root mean squared error (RMSE) computed on the re-
constructed control signals assuming a known model of the dynamics using the cNMFsc
algorithm over all 972 VCV utterances, and (b) the corresponding RMSE in recon-
structing articulator movement trajectories from these control signals using Equation
8.5.
Note that each control primitive could eect dierent movements of vocal tract
articulators depending on their initial position/conguration. For example, Figures 8.5
and 8.6 show 8 movement sequences eected by 8 control primitives for one particular
choice of a starting position when a pre-specied and an LWPR-learned model of the
dynamics is used. In order to appreciate the importance of the initial position (which
can be thought of as an articulatory setting of sorts) Figure 8.7 shows the movement
sequences eected for a dierent, more dorsal initial tongue position for the learned
dynamics model case. We observe that learning the dynamics allows us to capture a
widerrangeofmovementsascapturedbydierentprimitivesascomparedtowhenapre-
speciedmodelofthedynamicsisused. Inbothgures,eachrowofplotsweregenerated
by taking one control primitive sequence, using it to simulate the dynamical system
learned using either the iLQG or iLQG-LD algorithm, respectively, and visualizing the
resulting movement sequence
4
. Figure 8.8 shows the median activations of each of the
eight bases in Figure 8.6 for selected phones of interest. We see that the primitives
4
The extreme overshoot/undershoot in some cases could be an artifact of normalization. Having said
that, it is important to remember that the original data will be reconstructed by a scaled-down version
of these primitives (weighted down by their corresponding activations)
129
0 0.2 0.4 0.6 0.8 1 1.2 1.4
0
50
100
150
200
250
300
350
RMSE
Number of occurrences
(a)
0.05 0.1 0.15 0.2 0.25 0.3
0
50
100
150
200
RMSE
(b)
Figure 8.4: (a) Histograms of root mean squared error (RMSE) computed on the re-
constructed control signals (dynamics model learnt using LWPR) using the cNMFsc
algorithm over all 972 VCV utterances, and (b) the corresponding RMSE in recon-
structing articulator movement trajectories from these control signals using Equation
8.5.
produce movements that are interpretable: for instance, the bases that are activated
the most for P, T, and K are those involved in lip, tongue tip, and tongue dorsum
constrictionsrespectively. Forvowels,wealsoobservelinguistically-meaningpatterning:
IY, AA and UW involve high activations of controls that produce palatal, pharyngeal
and velar/uvular constrictions, respectively.
8.6 Conclusions and Outlook
WWe have described a technique to extract synergies of control signal inputs that
actuate a learned dynamical systems model of the vocal tract. We further observe,
usingdatageneratedbytheTaDAcongurablearticulatorysynthesizerthatthismethod
allows us to extract control primitives that eect linguistically-meaningful vocal tract
movements.
130
Work described in this paper can help in formulating speech motor control theories
that are control synergy- or primitives-based. The idea of motor primitives allows
us to explore many longstanding questions in speech motor control in a new light.
For instance, consider the case of coarticulation in speech, where the position of an
articulator/element may be aected by the previous and following target [94]. Using
the idea of motor primitives, we can explore how the choice, ordering and timing of
a given movement element within a well-rehearsed sequence can be modied through
interaction with its neighboring elements (co-articulation). In other words, dierent
movement sequences could result from changes in the timing and ordering of the same
set of control primitives. Constructing internal control representations from a linear
combination of a reduced set of modiable basis functions tremendously simplies the
task of learning new skills, generalizing to novel tasks or adapting to new environments
[26]. Further, particular choices of basis functions might further reduce the number
of functions required to represent learned information successfully. Thus, by deriving a
meaningfulsetofarticulatorycontrolprimitives,wecandevelopbettermodelsofspeech
motor control, and possibly, at an even higher level, move toward an understanding of
the language of speech actions [34].
131
1
2
3
4
5
6
7
8
Figure 8.5: Spatio-temporal movements of the articulator dynamical system eected by
8 dierent control primitives for a given choice of initial position when a pre-dened
dynamics model is used. Each row represents a sequence of vocal tract postures plotted
at 20 ms time intervals, corresponding to one control primitive sequence. The initial
position in each case is represented by the rst image in each row. The cNMFsc algo-
rithm parameters used were S
h
= 0:65, K = 8 and T = 10. The front of the mouth is
located toward the right hand side of each image (and the back of the mouth on the
left).
132
1
2
3
4
5
6
7
8
Figure 8.6: Spatio-temporal movements of the articulator dynamical system eected
by 8 dierent control primitives for a initial position with a low
at tongue when the
dynamics model is learned using the LWPR algorithm. Each row represents a sequence
of vocal tract postures plotted at 20 ms time intervals, corresponding to one control
primitive sequence. The initial position in each case is represented by the rst image in
each row. The cNMFsc algorithm parameters used were S
h
= 0:65, K = 8 and T = 10.
The front of the mouth is located toward the right hand side of each image (and the
back of the mouth on the left).
133
1
2
3
4
5
6
7
8
Figure 8.7: Spatio-temporal movements of the articulator dynamical system eected by
8dierentcontrolprimitivesforamoredorsalinitialtonguepositionwhenthedynamics
model is learned using the LWPR algorithm. Each row represents a sequence of vocal
tract postures plotted at 20 ms time intervals, corresponding to one control primitive
sequence. The initial position in each case is represented by the rst image in each row.
The cNMFsc algorithm parameters used were S
h
= 0:65, K = 8 and T = 10. The front
of the mouth is located toward the right hand side of each image (and the back of the
mouth on the left).
134
P T K
0
2
4
6
8
x 10
−5
1
2
3
4
5
6
7
8
(a)
IY EH AA OW UW
0
0.5
1
1.5
2
2.5
x 10
−3
(b)
Figure 8.8: Median activations of the 8 bases plotted in Figure 8.6 contributing to the
productionofdierentsoundscomputedoverall972VCVutterances, for(a)selectstop
consonants and (b) selected vowels.
135
CHAPTER9
Conclusions and outlook
We have proposed a balanced two-pronged approach to the problem of designing better
speech motor control models: (1) a knowledge-driven `top-down' approach that uses
existing knowledge from linguistics and motor control to extract meaningful represen-
tations from articulatory data, and further, posit and test specic hypotheses regarding
kinematic and postural planning during pausing behavior, with the view that they can
help inform modeling; and (2) a data-driven `bottom-up' approach to deriving `primi-
tive' articulatory movements that can be used to build models of coordination of speech
movements that can potentially operate in an inherently lower-dimensional control sub-
space. Figure 1.2 schematizes the contributions of this thesis and how it ts into the
broadercontextofspeechmotorcontrolmodeling. Noticethattheultimategoalofboth
these approaches is to gain insights that can be used to either design better models of
speech motor control or to augment exisiting models. To this eect, here are a few
promising avenues for further research:
1. Relation between biomechanical and kinematic motor constraints and phonological
organization in languages: We observed earlier that dierent vocal tract postures
136
have dierent mechanical advantage properties, which could be an important fac-
tor in understanding speech motor control. Dierent languages have a dierent
phonologicalinventory, aswellasadierentsetofvocalposturesthatwillbeused
to produce the phones in that inventory. So a natural question which arises here
is whether there is a causal relationship between these two factors.
2. Production{perception links: We observed that articulatory movement primtive-
based representations perform favorably in phone classication experiments.
There is however, much work to be done in nding an optimal representation
of information content in speech that captures the links between speech produc-
tion and perception, while simultaneously serving as a useful feature for robust
speech recognition.
3. Articulatory settings as \launchpad" or initial positions for control models: We
can leverage the ndings regarding mechanically advantageous postures and ar-
ticulatory settings in order to better understand and model initial conditions (or
postures) toward developing more integrative models of speech motor control.
4. Toward a more complete understanding of speech motor control: We have pre-
sented a systems-level analysis and model of speech motor control that is based
onthenotionofcontrolprimitives. Analyzingthevalidityoftheserepresentations
at the neural level is an important next step.
137
Bibliography
[1] H. Akaike. Likelihood of a model and information criteria. Journal of Economet-
rics, 16(1):3{14, 1981.
[2] E.K. Antonsson and J. Cagan. Formal Engineering Design Synthesis. Cambridge
University Press, 2005.
[3] R. Arora and K. Livescu. Multi-view cca-based acoustic features for phonetic
recognition across speakers and domains. In Int. Conf. on Acoustics, Speech, and
Signal Processing, 2013.
[4] B. Atal. Ecient coding of lpc parameters by temporal decomposition. In
IEEE International Conference on Acoustics, Speech, and Signal Processing,
ICASSP'83., volume 8, pages 81{84. IEEE, 1983.
[5] B.S. Atal. Automatic speech recognition: A communication perspective.
In ICASSP, IEEE International Conference on Acoustics, Speech and Signal
Processing- Proceedings, volume 1, pages 457{460, 1999.
[6] N.A. Bernstein. The co-ordination and regulation of movements. Pergamon Press
Ltd., Oxford, London., 1967.
[7] A. Bertrand, K. Demuynck, V. Stouten, and H. Van hamme. Unsupervised learn-
ingofauditorylterbanksusingnon-negativematrixfactorisation. InIEEEInter-
national Conference on Acoustics, Speech and Signal Processing,pages4713{4716,
2008.
[8] E.Bizzi,VCKCheung,A.d'Avella,P.Saltiel,andM.Tresch. Combiningmodules
for movement. Brain Research Reviews, 57(1):125{133, 2008.
[9] S.F.Bockman. Generalizingtheformulaforareasofpolygonstomoments. Amer-
ican Mathematical Monthly, 96(2):132, 1989.
[10] E.BreschandS.Narayanan.Regionsegmentationinthefrequencydomainapplied
to upper airway real-time magnetic resonance images. Medical Imaging, IEEE
Transactions on, 28(3):323{338, 2009.
138
[11] E.Bresch,J.Nielsen,K.Nayak,andS.Narayanan. Synchronizedandnoise-robust
audio recordings during realtime magnetic resonance imaging scans. The Journal
of the Acoustical Society of America, 120:1791, 2006.
[12] C.P. Browman and L. Goldstein. Dynamics and articulatory phonology. In T.
van Gelder and B. Port (Eds.), Mind as motion: Explorations in the dynamics of
cognition, pages 175{193, 1995.
[13] C.P. Browman, L. Goldstein, et al. Articulatory phonology: An overview. Pho-
netica, 49(3-4):155{180, 1992.
[14] D.ByrdandE.Saltzman.Theelasticphrase: modelingthedynamicsofboundary-
adjacent lengthening. Journal of Phonetics, 31(2):149{180, 2003.
[15] C.C. Chang and C.J. Lin. LIBSVM: a library for support vector machines. ACM
Transactions on Intelligent Systems and Technology (TIST), 2(3):27, 2011.
[16] M. Chhabra and R.A. Jacobs. Properties of synergies arising from a theory of
optimal motor behavior. Neural computation, 18(10):2320{2342, 2006.
[17] G Nick Clements and Rachid Ridouane. Where do phonological features come
from?: cognitive, physical and developmental bases of distinctive speech categories.
John Benjamins Publishing, Philadelphia, PA (347 pages), 2011.
[18] Thomas M Cover and Joy A Thomas. Elements of information theory. Wiley-
Interscience, John Wiley and Sons, Hoboken, NJ (776 pages), 2012.
[19] A. d'Avella and E. Bizzi. Shared and specic muscle synergies in natural motor
behaviors. Proceedings of the National Academy of Sciences of the United States
of America, 102(8):3076, 2005.
[20] A. d'Avella, A. Portone, L. Fernandez, and F. Lacquaniti. Control of fast-
reaching movements by muscle synergy combinations. The Journal of Neuro-
science, 26(30):7791{7810, 2006.
[21] L. Deng, G. Ramsay, and D. Sun. Production models as a structural basis for
automatic speech recognition. Speech Communication, 22(2):93{111, 1997.
[22] D. Donoho and V. Stodden. When does non-negative matrix factorization give
a correct decomposition into parts. Advances in neural information processing
systems, 16, 2004.
[23] J.H. Esling and R.F. Wong. Voice quality settings and the teaching of pronunci-
ation. TESOL Quarterly, 17(1):89{95, 1983.
[24] E. Farnetani. Coarticulation and connected speech processes. The handbook of
phonetic sciences, pages 371{404, 1997.
139
[25] C edric F evotte and A Taylan Cemgil. Nonnegative matrix factorizations as prob-
abilistic inference in composite models. In Proc. EUSIPCO, volume 47, pages
1913{1917, 2009.
[26] T. Flash and B. Hochner. Motor primitives in vertebrates and invertebrates.
Current Opinion in Neurobiology, 15(6):660{666, 2005.
[27] T.FlashandT.J.Sejnowski. Computationalapproachestomotorcontrol. Current
Opinion in Neurobiology, 11(6):655{662, 2001.
[28] Carol A Fowler and Bruno Galantucci. The relation of speech perception and
speech production. The handbook of speech perception, pages 632{652, 2005.
[29] O Fujimura and Y Kakita. Remarks on quantitative description of lingual artic-
ulation. Frontiers of speech communication research, pages 17{24, 1979.
[30] P.K. Ghosh, L.M. Goldstein, and S.S. Narayanan. Processing speech signal using
auditory-likelterbankprovidesleastuncertaintyaboutarticulatorygestures. The
Journal of the Acoustical Society of America, 129:4014, 2011.
[31] P.K.GhoshandS.Narayanan. Ageneralizedsmoothnesscriterionforacoustic-to-
articulatoryinversion. TheJournaloftheAcousticalSocietyofAmerica,128:2162,
2010.
[32] B.Gick,I.Wilson,K.Koch,andC.Cook. Language-specicarticulatorysettings:
Evidence from inter-utterance rest position. Phonetica, 61:220{233, 2004.
[33] V.L.Gracco. Characteristicsofspeechasamotorcontrolsystem. Cerebral control
of speech and limb movements, pages 3{28, 1990.
[34] G. Guerra-Filho and Y. Aloimonos. A language for human action. Computer,
40(5):42{51, 2007.
[35] AselaGunawardana,MilindMahajan,AlexAcero,andJohnCPlatt. Hiddencon-
ditional random elds for phone classication. In INTERSPEECH, pages 1117{
1120, 2005.
[36] H. Haken, J.A.S. Kelso, and H. Bunz. A theoretical model of phase transitions in
human hand movements. Biological cybernetics, 51(5):347{356, 1985.
[37] C.B.HartandS.F.Giszter. Aneuralbasisformotorprimitivesinthespinalcord.
The Journal of Neuroscience, 30(4):1322{1336, 2010.
[38] G. Hickok, J. Houde, and F. Rong. Sensorimotor integration in speech processing:
computational basis and neural organization. Neuron, 69(3):407{422, 2011.
140
[39] B. Honikman. Articulatory settings. In D. Abercrombie, D.B. Fry, P.A.D. Mac-
Carthy, N.C. Scott and J.L.M. Trim (eds.), In Honour of Daniel Jones, pages
73{84, 1964.
[40] P.O. Hoyer. Non-negative matrix factorization with sparseness constraints. The
Journal of Machine Learning Research, 5:1457{1469, 2004.
[41] T. Hrom adka, M.R. DeWeese, and A.M. Zador. Sparse representation of sounds
in the unanesthetized auditory cortex. PLoS Biol, 6(1):e16, 2008.
[42] K. Iskarous, L. Goldstein, D.H. Whalen, M.K. Tiede, and P.E. Rubin. Casy:
The haskins congurable articulatory synthesizer. In International Congress of
Phonetic Sciences, Barcelona, Spain, pages 185{188, 2003.
[43] Khalil Iskarous. Patterns of tongue movement. Journal of Phonetics, 33(4):363{
381, 2005.
[44] Khalil Iskarous, Christine H Shadle, and MichaelI Proctor. Articulatory{acoustic
kinematics: The production of american english/s. The Journal of the Acoustical
Society of America, 129:944, 2011.
[45] P.J.B. Jackson and V.D. Singampalli. Statistical identication of articulation
constraints in the production of speech. Speech Communication, 51(8):695{710,
2009.
[46] Roman Jakobson, Gunnar Fant, and Morris Halle. Preliminaries to speech analy-
sis. The distinctive features and their correlates. MIT Press, Cambridge, MA (58
pages), 1951.
[47] Martin Joos. Acoustic phonetics. Language, 24(2):5{136, 1948.
[48] Peter Karsmakers, Kristiaan Pelckmans, Johan AK Suykens, and Hugo Van
hamme. Fixed-size kernel logistic regression for phoneme classication. In IN-
TERSPEECH, pages 78{81, 2007.
[49] A. Katsamanis, M. Black, P. Georgiou, and S. Goldstein, L. andNarayanan.
SailAlign: Robust long speech-text alignment. In Workshop on New Tools and
Methods for VLSPR, 2011.
[50] J.A.S. Kelso. Synergies: atoms of brain and behavior. Progress in motor control,
pages 83{91, 2009.
[51] T. Kim, G. Shakhnarovich, and R. Urtasun. Sparse coding for learning inter-
pretable spatio-temporal primitives. Advances in neural information processing
systems, 22, 2010.
141
[52] S. King, J. Frankel, K. Livescu, E. McDermott, K. Richmond, and M. Wester.
Speech production knowledge in automatic speech recognition. The Journal of
the Acoustical Society of America, 121:723{742, 2007.
[53] G. Kochanski, C. Shih, and H. Jing. Quantitative measurement of prosodic
strength in mandarin. Speech Communication, 41(4):625{645, 2003.
[54] Deguang Kong, Chris Ding, and Heng Huang. Robust nonnegative matrix factor-
ization using l21-norm. In Proceedings of the 20th ACM international conference
on Information and knowledge management, pages 673{682. ACM, 2011.
[55] E.J. Kutik, W.E. Cooper, and S. Boyce. Declination of fundamental frequency
in speakers production of parenthetical and main clauses. The Journal of the
Acoustical Society of America, 73:1731, 1983.
[56] A. Lammert, L. Goldstein, S. Narayanan, and K. Iskarous. Statistical methods
for estimation of direct and dierential
kinematics of the vocal tract. Journal of Speech Communication, in press.
[57] A. Lammert, M. Proctor, and S. Narayanan. Morphological variation in the adult
vocal tract: A study using rtmri. Proc. 9th ISSP, 2011.
[58] J. Laver. The concept of articulatory settings: a historical survey. Historiographia
Linguistica, 5, 1(2):1{14, 1978.
[59] J. Laver. The phonetic description of voice quality. Cambridge University Press,
Cambridge, England and New York, US (186 pages), 1980.
[60] D.D. Lee and H.S. Seung. Algorithms for non-negative matrix factorization. Ad-
vances in neural information processing systems, 13, 2001.
[61] I. Lehiste. Suprasegmentals. MIT Press, 1970.
[62] Sue E Leurgans, Rana A Moyeed, and Bernard W Silverman. Canonical correla-
tion analysis when the data are curves. Journal of the Royal Statistical Society.
Series B (Methodological), pages 725{740, 1993.
[63] Weiwei Li and Emanuel Todorov. Iterative linear-quadratic regulator design for
nonlinear biological movement systems. In Proceedings of the First International
Conference on Informatics in Control, Automation, and Robotics, pages 222{229,
2004.
[64] A.M. Liberman and I.G. Mattingly. The motor theory of speech perception re-
vised. Cognition, 21(1):1{36, 1985.
142
[65] B.E.F. Lindblom and J.E.F. Sundberg. Acoustical consequences of lip, tongue,
jaw, and larynx movement. The journal of the Acoustical Society of America,
50:1166, 1971.
[66] A. L ofgvist and B. Lindblom. Speech motor control. Current opinion in neurobi-
ology, 4(6):823{826, 1994.
[67] Carla Lopes and Fernando Perdig~ ao. Phone recognition on the timit database.
Speech Technologies/Book, 1:285{302, 2011.
[68] S. Ma and AG Feldman. Two functionally dierent synergies during arm reaching
movements involving the trunk. Journal of neurophysiology, 73(5):2120{2122,
1995.
[69] S. Maeda. Compensatory articulation during speech: Evidence from the analysis
andsynthesisofvocal-tractshapesusinganarticulatorymodel. Speech production
and speech modelling, pages 131{149, 1990.
[70] H.S. Magen, A.M. Kang, M.K. Tiede, and DH Whalen. Posterior pharyngeal wall
position in the production of speech. Journal of Speech, Language, and Hearing
Research, 46(1):241, 2003.
[71] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factor-
ization and sparse coding. The Journal of Machine Learning Research, 11:19{60,
2010.
[72] J. Makhoul. Linear prediction: A tutorial review. Proceedings of the IEEE,
63(4):561{580, 1975.
[73] S.G. Mallat and Z. Zhang. Matching pursuits with time-frequency dictionaries.
Signal Processing, IEEE Transactions on, 41(12):3397{3415, 1993.
[74] D. Marr. Vision: A computational investigation into the human representation
and processing of visual information. WH San Francisco: Freeman and Company,
1982.
[75] E. Mcdermott and A. Nakamura. Production-oriented models for speech recogni-
tion. IEICE transactions on information and systems, 89(3):1006{1014, 2006.
[76] B.W. Mel. Computational neuroscience: Think positive to nd parts. Nature,
401(6755):759{760, 1999.
[77] I. Mennen, J.M. Scobbie, E. de Leeuw, S. Schaeer, and F. Schaeer. Measuring
language-specicphoneticsettings. Second Language Research, 26(1):13{41, 2010.
[78] P. Mermelstein. Articulatory model for the study of speech production. Journal
of the Acoustical Society of America, 53(4):1070{1082, 1973.
143
[79] Vikramjit Mitra, Hosung Nam, Carol Espy-Wilson, Elliot Saltzman, and Louis
Goldstein. Recognizing articulatory gestures from speech for robust speech recog-
nition. The Journal of the Acoustical Society of America, 131:2270, 2012.
[80] Vikramjit Mitra, Hosung Nam, Carol Y Espy-Wilson, Elliot Saltzman, and Louis
Goldstein. Articulatory information for noise robust speech recognition. Audio,
Speech, and Language Processing, IEEE Transactions on, 19(7):1913{1924, 2011.
[81] Djordje Mitrovic, Stefan Klanke, and Sethu Vijayakumar. Adaptive optimal feed-
back control with learned internal dynamics models. In From Motor Learning to
Interaction Learning in Robots, pages 65{84. Springer, 2010.
[82] F.A. Mussa-Ivaldi, N. Gantchev, and GN Gantchev. Motor primitives, force-
elds and the equilibrium point theory. From Basic Motor Control to Functional
Recovery. Academic Publishing House" Prof. M. Drinov", Soa, Bulgaria, pages
392{398, 1999.
[83] F.A. Mussa-Ivaldi and S.A. Solla. Neural primitives for motion control. Oceanic
Engineering, IEEE Journal of, 29(3):640{650, 2004.
[84] H. Nam. Articulatory modeling of consonant release gesture. In International
Congress on Phonetic Sciences XVI, pages 625{628, 2007.
[85] H.Nam,L.Goldstein,C.Browman,P.Rubin,M.Proctor,andE.Saltzman. Tada
(task dynamics application) manual. Manual, Haskins Laboratories, 2006.
[86] Hosung Nam, Vikramjit Mitra, Mark Tiede, Mark Hasegawa-Johnson, Carol
Espy-Wilson, Elliot Saltzman, and Louis Goldstein. A procedure for estimat-
ing gestural scores from speech acoustics. The Journal of the Acoustical Society
of America, 132(6):3980{3989, 2012.
[87] S. Narayanan, K. Nayak, S. Lee, A. Sethy, and D. Byrd. An approach to real-time
magnetic resonance imaging for speech production. The Journal of the Acoustical
Society of America, 115:1771, 2004.
[88] Shrikanth Narayanan, Erik Bresch, Prasanta Ghosh, Louis Goldstein, Athana-
sios Katsamanis, Yoon Kim, Adam Lammert, Michael Proctor, Vikram Rama-
narayanan, and Yinghua Zhu. A multimodal real-time mri articulatory corpus for
speech research. In Proc. 12th Conf. Intl. Speech Communication Assoc. (Inter-
speech 2011), Florence, Italy, 2011.
[89] P.D.O'GradyandB.A.Pearlmutter. Discoveringspeechphonesusingconvolutive
non-negative matrix factorisation with a sparseness constraint. Neurocomputing,
72(1-3):88{101, 2008.
144
[90] S.Ohman.Peripheralmotorcommandsinlabialarticulation. SpeechTransmission
Laboratory-Quarterly Progress and Status Report (Royal Institute of Technology
(KTH), Stockholm), 4/1967, pages 30{63, 1967.
[91] B.A. Olshausen and D.J. Field. Sparse coding with an overcomplete basis set: A
strategy employed by V1? Vision research, 37(23):3311{3325, 1997.
[92] B.A. Olshausen and D.J. Field. Sparse coding of sensory inputs. Current opinion
in neurobiology, 14(4):481{487, 2004.
[93] D. O'Shaughnessy. Recognition of hesitations in spontaneous speech. In Acous-
tics, Speech, and Signal Processing, 1992. ICASSP-92., 1992 IEEE International
Conference on, volume 1, pages 521{524. IEEE, 1992.
[94] D.J. Ostry, P.L. Gribble, and V.L. Gracco. Coarticulation of jaw movements in
speech production: is context sensitivity in speech kinematics centrally planned?
The Journal of Neuroscience, 16(4):1570{1579, 1996.
[95] B. Pellom and K. Hacioglu. Sonic: The university of colorado continuous speech
recognizer. University of Colorado,# TRCSLR-2001-01, Boulder, Colorado,2001.
[96] Joseph Perkell, Melanie Matthies, Harlan Lane, Frank Guenther, Reiner
Wilhelms-Tricarico, Jane Wozniak, and Peter Guiod. Speech motor control:
Acoustic goals, saturation eects, auditory feedback and internal models. Speech
communication, 22(2):227{250, 1997.
[97] J.S. Perkell. Physiology of speech production: Results and implications of a quan-
titative cineradiographic study. Research Monograph No. 53 (120 pages). MIT
Press: Cambridge, MA, 1969.
[98] V. Ramanarayanan, P. Ghosh, A. Lammert, and S. Narayanan. Exploiting speech
production information for automatic speech and speaker modeling and recogni-
tion { possibilities and new opportunities. In Fourth Annual Conference of the
Asia-Pacic Signal and Information Processing Association, 2012.
[99] V.Ramanarayanan,L.Goldstein,andS.Narayanan. Spatio-temporalarticulatory
movement primitives during speech production { extraction, interpretation and
validation. The Journal of the Acoustical Society of America, 134(2):1378{1391,
2013.
[100] V. Ramanarayanan, A. Katsamanis, and S. Narayanan. Automatic data-driven
learning of articulatory primitives from real-time mri data using convolutive nmf
with sparseness constraints. In Twelfth Annual Conference of the International
Speech Communication Association, 2011.
145
[101] Vikram Ramanarayanan, Erik Bresch, Dani Byrd, Louis Goldstein, and
Shrikanth S. Narayanan. Analysis of pausing behavior in spontaneous speech
using real-time magnetic resonance imaging of articulation. The Journal of the
Acoustical Society of America, 126(5):EL160{EL165, 2009.
[102] Vikram Ramanarayanan, Louis Goldstein, Dani Byrd, and Shrikanth S.
Narayanan. An investigation of articulatory setting using real-time magnetic res-
onance imaging. Journal of the Acoustical Society of America, pages 510{519,
2013.
[103] K. Richmond. Estimating articulatory parameters from the acoustic speech signal.
PhD thesis, University of Edinburgh, 2002.
[104] S.R. Rochester. The signicance of pauses in spontaneous speech. Journal of
Psycholinguistic Research, 2(1):51{81, 1973.
[105] R.C. Rose, J. Schroeter, and MM Sondhi. The potential role of speech production
models in automatic speech recognition. The Journal of the Acoustical Society of
America, 99:1699, 1996.
[106] D.A. Rosenbaum, R.J. Meulenbroek, J. Vaughan, and C. Jansen. Posture-Based
Motion Planning: Applications to Grasping. Psychological Review, 108(4):709{
734, 2001.
[107] P. Rubin, E. Saltzman, L. Goldstein, R. McGowan, M. Tiede, and C. Browman.
Casy and extensions to the task-dynamic model. In 1st ETRW on Speech Pro-
duction Modeling: From Control Strategies to Acoustics; 4th Speech Production
Seminar: Models and Data, 1996.
[108] E. Saltzman, H. Nam, J. Krivokapic, and L. Goldstein. A task-dynamic toolkit
for modeling the eects of prosodic structure on articulation. In Proceedings of
the 4th international conference on speech prosody (speech prosody 2008). Brazil:
Campinas, 2008.
[109] E.L. Saltzman and K.G. Munhall. A dynamical approach to gestural patterning
in speech production. Ecological Psychology, 1(4):333{382, 1989.
[110] E.L. Saltzman and K.G. Munhall. A dynamical approach to gestural patterning
in speech production. Ecological Psychology, 1(4):333{382, 1989.
[111] G. Schwarz. Estimating the dimension of a model. The annals of statistics,
6(2):461{464, 1978.
[112] B. Siciliano and O. Khatib, editors. Springer Handbook of Robotics. Springer,
2008.
146
[113] Jorge Silva and Shrikanth S Narayanan. Discriminative wavelet packet lter
bank selection for pattern recognition. Signal Processing, IEEE Transactions on,
57(5):1796{1810, 2009.
[114] N. Slonim and N. Tishby. Agglomerative information bottleneck. Advances in
Neural Information Processing Systems, 12:617{623, 1999.
[115] P.Smaragdis. Convolutivespeechbasesandtheirapplicationtosupervisedspeech
separation. Audio, Speech, and Language Processing, IEEE Transactions on,
15(1):1{12, 2007.
[116] Evan C Smith and Michael S Lewicki. Ecient auditory coding. Nature,
439(7079):978{982, 2006.
[117] R. Sosnik, B. Hauptmann, A. Karni, and T. Flash. When practice leads to co-
articulation: the evolution of geometrically dened movement primitives. Exper-
imental Brain Research, 156(4):422{438, 2004.
[118] B.H. Story and I.R. Titze. A preliminary study of voice quality transformation
based on modications to the neutral vocal tract area function. Journal of Pho-
netics, 30(3):485{509, 2002.
[119] B.H.Story, I.R.Titze, andE.A.Homan. Therelationshipofvocaltractshapeto
three voice qualities. The Journal of the Acoustical Society of America, 109:1651{
1667, 2001.
[120] GilbertStrang. Introduction to linear algebra. WellesleyCambridgePress, Welles-
ley, MA (571 pages), 2003.
[121] H. Sweet. A primer of phonetics. Clarendon Press, United Kingdom (144 pages),
1890.
[122] F.J. Theis, K. Stadlthanner, and T. Tanaka. First results on uniqueness of sparse
non-negative matrix factorization. In Proceedings of the 13th European Signal
Processing Conference (EUSIPCO05). Citeseer, 2005.
[123] M.K.Tiede,S.Masaki,andE.Vatikiotis-Bateson.Contrastsinspeecharticulation
observed in sitting and supine conditions. Proceedings of the 5th Seminar on
Speech Production, Kloster Seeon, Bavaria, pages 25{28, 2000.
[124] Emanuel Todorov and Weiwei Li. A generalized iterative lqg method for locally-
optimalfeedbackcontrolofconstrainednonlinearstochasticsystems. In American
Control Conference, 2005. Proceedings of the 2005, pages 300{306. IEEE, 2005.
[125] H. Traunm uller. Conventional, biological and environmental factors in speech
communication: A modulation theory. Phonetica, 51(1-3):170{183, 1994.
147
[126] M.C.Tresch,V.C.K.Cheung,andA.d'Avella. Matrixfactorizationalgorithmsfor
the identication of muscle synergies: evaluation on simulated and experimental
data sets. Journal of Neurophysiology, 95(4):2199{2212, 2006.
[127] M.T. Turvey. Coordination. American Psychologist, 45(8):938, 1990.
[128] H. Van hamme. HAC-models: a novel approach to continuous speech recognition.
In Interspeech, 2008.
[129] M. Van Segbroeck and H. Van hamme. Unsupervised learning of time{frequency
patches as a noise-robust representation of speech. Speech Communication,
51(11):1124{1138, 2009.
[130] A. Vedaldi and B. Fulkerson. VLFeat: An open and portable library of computer
vision algorithms. http://www.vlfeat.org/, 2008.
[131] W. Wang, A. Cichocki, and J.A. Chambers. A multiplicative algorithm for con-
volutive non-negative matrix factorization based on squared Euclidean distance.
Signal Processing, IEEE Transactions on, 57(7):2858{2864, 2009.
[132] J. Westbury, P. Milenkovic, G. Weismer, and R. Kent. X-ray microbeam speech
production database. The Journal of the Acoustical Society of America, 88:S56,
1990.
[133] DH Whalen, K. Iskarous, M.K. Tiede, D.J. Ostry, H. Lehnert-LeHouillier,
E.Vatikiotis-Bateson,andD.S.Hailey. Thehaskinsopticallycorrectedultrasound
system (hocus). Journal of Speech, Language, and Hearing Research, 48(3):543,
2005.
[134] I.L.WilsonandB.Gick. ArticulatorysettingsofFrenchandEnglishmonolinguals
and bilinguals. The Journal of the Acoustical Society of America, 120(5):3295{
3296, 2006.
[135] A.A.Wrench. Amulti-channel/multi-speakerarticulatorydatabaseforcontinuous
speech recognition research. In Workshop on Phonetics and Phonology in ASR,
2000.
[136] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, XA Liu, G. Moore,
J. Odell, D. Ollason, D. Povey, et al. The htk book (for htk version 3.4). 2006.
[137] B. Zellner. Pauses and the temporal structure of speech. Zellner, B.(1994).
Pauses and the temporal structure of speech, in E. Keller (Ed.) Fundamentals
of speech synthesis and speech recognition.(pp. 41-62). Chichester: John Wiley.,
pages 41{62, 1994.
148
[138] F. Zhou, F. Torre, and J.K. Hodgins. Aligned cluster analysis for temporal seg-
mentation of human motion. In Automatic Face & Gesture Recognition, 2008.
FG'08. 8th IEEE International Conference on, pages 1{7. IEEE, 2008.
149
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Emotional speech production: from data to computational models and applications
PDF
Matrix factorization for noise-robust representation of speech data
PDF
A computational framework for exploring the role of speech production in speech processing from a communication system perspective
PDF
Articulatory dynamics and stability in multi-gesture complexes
PDF
Visualizing and modeling vocal production dynamics
PDF
The planning, production, and perception of prosodic structure
PDF
Functional real-time MRI of the upper airway
PDF
Investigating the production and perception of reduced speech: a cross-linguistic look at articulatory coproduction and compensation for coarticulation
PDF
Individual differences in phonetic variability and phonological representation
PDF
Fast upper airway MRI of speech
PDF
Noise aware methods for robust speech processing applications
PDF
Active data acquisition for building language models for speech recognition
PDF
Understanding music perception with cochlear implants with a little help from my friends, speech and hearing aids
PDF
Enriching spoken language processing: representation and modeling of suprasegmental events
PDF
Fast flexible dynamic three-dimensional magnetic resonance imaging
PDF
Representation, classification and information fusion for robust and efficient multimodal human states recognition
PDF
Speech recognition error modeling for robust speech processing and natural language understanding applications
PDF
Context-aware models for understanding and supporting spoken interactions with children
PDF
Statistical inference for dynamical, interacting multi-object systems with emphasis on human small group interactions
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
Asset Metadata
Creator
Ramanarayanan, Vikram
(author)
Core Title
Toward understanding speech planning by observing its execution—representations, modeling and analysis
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
07/14/2014
Defense Date
01/14/2014
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
action‐perception systems,articulatory setting,matrix factorization,motor control,movement primitives,OAI-PMH Harvest,production‐perception links,real‐time MRI,Speech Communication,speech production
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Narayanan, Shrikanth S. (
committee chair
), Byrd, Dani (
committee member
), Goldstein, Louis (
committee member
), Nayak, Krishna S. (
committee member
), Ortega, Antonio K. (
committee member
), Schaal, Stefan (
committee member
)
Creator Email
vikram.ramanarayanan@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-438767
Unique identifier
UC11287177
Identifier
etd-Ramanaraya-2672.pdf (filename),usctheses-c3-438767 (legacy record id)
Legacy Identifier
etd-Ramanaraya-2672.pdf
Dmrecord
438767
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Ramanarayanan, Vikram
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
action‐perception systems
articulatory setting
matrix factorization
motor control
movement primitives
production‐perception links
real‐time MRI
speech production