Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Hierarchical methods in automatic pronunciation evaluation
(USC Thesis Other)
Hierarchical methods in automatic pronunciation evaluation
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
HIERARCHICALMETHODSINAUTOMATIC PRONUNCIATIONEVALUATION by Joseph Tepperman A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) August 2009 Copyright 2009 Joseph Tepperman Acknowledgements I have to start by thanking Shri Narayanan for being the best advisor I or anyone ever could have asked for. His sharp scientific mind is matched only by his constant optimism, excitement, and generosity. Sungbok Lee met with me every week for something like six years. If not for his valuable ideas and astute criticism, this whole thesis thing might never have happened. The advice of Louis Goldstein has been instrumental in helping me flesh out many of my latest research directions - I am beyond thankful for it. In their multidisciplinary wisdom, Abeer Alwan and Patti Price helped me keep sight of who I’m really doing this for and why. And how many of the ideas here were born while working under Bryan Pellom and Kadri Hacioglu? Too many to list them all. Much gratitude is due to Jerry Mendel, David Traum, and Keith Jenkins for serving on my committee(s) and pointing out the visible seams, the Emperor’s invisible clothes, et cet. Matt Black and Abe Kazemzadeh have been incomparable collaborators and sound- ing boards for years now, bless their patient hearts. Much of this thesis was reviewed in its early stages by the always cheerful and insightful Carlos Busso. I definitely can’t ii forget to thank Erik Bresch and Yoon-Chul Kim for the many weekend hours toiling over a hot MRI scanner, no joke. IwouldnothavegottenintothiswithoutAbhinavSethy’sinspirationandmentoring seven years ago. “Supportive” falls way short of being an adequate word to describe my mom’s care for me over the past 6 years, 10 years, 27 years... And if I am sane at all then it is only because of Pamela Douglas. iii Table of Contents Acknowledgements ii List of Tables vii List of Figures x Abstract xii Chapter 1: Introduction 1 1.1 Significance of this Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Brief Summary of the Approach. . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Known Limitations and Open Questions . . . . . . . . . . . . . . . . . . 10 1.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Chapter 2: A Review of Speech Hierarchies 13 2.1 Articulatory Production . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Pronunciation Perception . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 Prosodic Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4 Interactions Among These Theories . . . . . . . . . . . . . . . . . . . . . 16 2.5 Conclusion: The Insight of Hierarchical Representations . . . . . . . . . 18 Chapter 3: Detecting Phoneme-level Pronunciation Errors 20 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2 Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3 Articulatory Representations . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3.1 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3.2 Choice of Representation and Modeling . . . . . . . . . . . . . . 30 3.4 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.5 Derivation of Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.5.1 Traditional HMM Confidence Scores . . . . . . . . . . . . . . . . 37 3.5.2 Articulatory Confidence Scores . . . . . . . . . . . . . . . . . . . 38 iv 3.5.3 Articulatory Recognition . . . . . . . . . . . . . . . . . . . . . . 41 3.6 Experiments and Results. . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.6.1 Experiments with Nonnative Speech . . . . . . . . . . . . . . . . 45 3.6.2 Native Speaker Experiments. . . . . . . . . . . . . . . . . . . . . 48 3.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.9 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.10 Epilogue: Articulatory Evidence of Phonological Transfer . . . . . . . . 57 3.10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.10.2 Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.10.3 Experimental Methods . . . . . . . . . . . . . . . . . . . . . . . . 61 3.10.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 65 3.10.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Chapter 4: Assessing Word-level Pronunciation and Reading Skills 69 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.2.1 The TBALL Project and its Context . . . . . . . . . . . . . . . . 73 4.2.2 Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.3 Perceptual Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.4 Feature Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.4.1 Hidden Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.4.2 Evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.4.3 Underlying Features . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.5 Network Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.5.1 Hypothesized Structure . . . . . . . . . . . . . . . . . . . . . . . 87 4.5.2 Structure Training and Refinement . . . . . . . . . . . . . . . . . 90 4.6 Experiments and Results. . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.6.1 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.6.2 Network Structure Optimization . . . . . . . . . . . . . . . . . . 95 4.6.3 Feature Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.7.1 Automatic Performance Comparison . . . . . . . . . . . . . . . . 96 4.7.2 Automatic Structure Refinement . . . . . . . . . . . . . . . . . . 97 4.7.3 Comparison of Proposed Features. . . . . . . . . . . . . . . . . . 98 4.7.4 Bias and Disagreement Analysis . . . . . . . . . . . . . . . . . . 98 4.7.5 Remaining Questions . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.9 Epilogue: A Reformulation of Word-level Scoring by way of Articulatory Phonology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.9.2 Articulatory Phonology: Background . . . . . . . . . . . . . . . . 104 v 4.9.3 Speech Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.9.4 Pronunciation Modeling . . . . . . . . . . . . . . . . . . . . . . . 107 4.9.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.9.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 4.9.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Chapter 5: Phrase-level Intonation Scoring 114 5.1 Acoustic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.1.2 Corpora and Annotation . . . . . . . . . . . . . . . . . . . . . . . 116 5.1.3 Baseline: Decision Tree Score Models . . . . . . . . . . . . . . . 117 5.1.4 HMM Intonation Models . . . . . . . . . . . . . . . . . . . . . . 119 5.1.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 5.2 Prosodic Structure Models . . . . . . . . . . . . . . . . . . . . . . . . . . 126 5.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 5.2.2 Trees in Prosody . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 5.2.3 Corpus Preparation . . . . . . . . . . . . . . . . . . . . . . . . . 129 5.2.4 Regular Tree Grammars . . . . . . . . . . . . . . . . . . . . . . . 131 5.2.5 Training and Experiments . . . . . . . . . . . . . . . . . . . . . . 132 5.2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 5.2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 Chapter 6: Conclusion and Future Directions 137 Bibliography 140 Appendix 148 vi List of Tables 1.1 Details regarding the three example applications presented in this thesis, including the scales of the acoustic models, the theories used as inspira- tion, and the corresponding computational framework. . . . . . . . . . . 11 3.1 Articulatory feature space. Note the numerical mapping and gradual physical progression among classes within a given stream, proportional to the integers chosen to represent those classes. . . . . . . . . . . . . . 25 3.2 Relativesubstitutionprobabilitiesforcommonlymispronouncedphonemes, for the ISLE corpus’ native German speakers. The probabilities for a given target may not add up to 1 because all substitutions of less than 0.01 probability have been disregarded.. . . . . . . . . . . . . . . . . . . 37 3.3 Relativesubstitutionprobabilitiesforcommonlymispronouncedphonemes, fortheISLEcorpus’nativeItalianspeakers. Theprobabilitiesforagiven target may not add up to 1 because all substitutions of less than 0.01 probability have been disregarded. . . . . . . . . . . . . . . . . . . . . . 39 3.4 Pseudo-articulatory-recognition accuracy, in %. . . . . . . . . . . . . . . 45 3.5 Target segments and their substitutions, for the native English speaker test set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.6 Results for German speakers, reported as false rejection rate / false ac- ceptance rate (in %). Entries in bold were significantly better than the baseline (P subs ) with p≤0.05 using McNemar’s test. . . . . . . . . . . . 49 3.7 Results for Italian speakers, reported as false rejection rate / false ac- ceptance rate (in %). Entries in bold were significantly better than the baseline (P subs ) with p≤0.05 using McNemar’s test. . . . . . . . . . . . 51 vii 3.8 Native British English speaker results, reported as false rejection rate / false acceptance rate (in %). Entries in bold were significantly better than the baseline (PL) with p≤0.05 using McNemar’s test. . . . . . . . 53 3.9 Results of two-tailed t-test comparing the mean difference between any two IN-L2 tokens and the mean difference between any IN-L2 token and any OUT token. Sample means were determined to be equal or unequal (denoted by = and 6=) on the 95% level. . . . . . . . . . . . . . . . . . . 64 3.10 Results of two-tailed t-test comparing the mean difference between any two IN-L1 tokens and the mean difference between any IN-L1 token and any OUT token. Sample means were determined to be equal or unequal (denoted by = and 6=) on the 95% level. . . . . . . . . . . . . . . . . . . 64 3.11 Results of one-tailed t-test of the alternative hypothesis that the mean difference between any two OUT tokens is greater than or less than the mean difference between any two IN-L2 tokens. Inequalities are given on the 95% level. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.12 Results of one-tailed t-test of the alternative hypothesis that the mean difference between any two OUT tokens is greater than or less than the mean difference between any two IN-L1 tokens. Inequalities are given on the 95% level. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.1 Inter-listener agreement in reading scores, in terms of percent agreement and Kappa for binary item-level scores, and correlation between overall list-level scores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.2 Summary of Evidence, Hidden, and Underlying variables used in this chapter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.3 Hypothesized parent-child arcs in the Bayesian Network student model. Only for pairs marked with an ‘X’ is the column variable considered conditionally dependent on the row variable - all others are assumed to be independent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.4 Demographic makeup of the test data used in all experiments. Total speakersmaynotaddupto189sincesomeofthisinformationwasmissing for certain students for reasons explained in Section 4.4.3. . . . . . . . . 93 4.5 Item-levelagreementandlist-leveloverallscorecorrelationbetweenauto- matic results and human reading scores. For threshold-based item-level classification, this agreement is the result closest to the EER. . . . . . . 96 viii 4.6 Using the forward selection procedure outlined in Section 4.5.2, these are the total number of times each of the hypothesized network arcs was selected for the final refined network, over 5 crossvalidation training sets. 99 4.7 Amount of data used in this epilogue. . . . . . . . . . . . . . . . . . . . 106 4.8 The number of acoustic and duration models required for each of the words in the CDT vocabulary. . . . . . . . . . . . . . . . . . . . . . . . . 109 4.9 Correlation coefficients between automatic and median listener ratings. Entries in bold were significantly better than the baseline with p≤0.05 112 5.1 Relative sizes of the training and test sets. . . . . . . . . . . . . . . . . . 117 5.2 Correlation and error of scores derived from different proposed modeling methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 5.3 Sizes of the training, development, and test sets. “Tags” refers to the prosodic symbols in the transcripts. “Trees” means complete four-level prosodic trees. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 5.4 PPL results for 4-tier setup, over dev and test sets. . . . . . . . . . . . . 133 5.5 PPL results for 2-tier setup, over dev and test sets. . . . . . . . . . . . . 134 5.6 Tag classification error (in %) on the development set. FINB = final boundary tones, INTB = intermediate boundary tones, PACC = pitch accents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 A.1 Expected British English Vowel Articulations . . . . . . . . . . . . . . . 149 A.2 Expected British English Consonant Articulations . . . . . . . . . . . . 150 ix List of Figures 1.1 General pronunciation evaluation flowchart for all applications. Hierar- chical theories are used both in the choice of acoustic models for recog- nition and the structure of the computational framework in the next step. 6 3.1 British English (BBC newscaster) vowel chart with numerical mappings, after [52]. This illustrates the relationship between the Tongue Frontness and Tongue Height streams. . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2 A 3D representation of the relation between the Lip Rounding, Tongue Frontness, and Tongue Height streams in cardinal and secondary English vowels, after [52]. Think of this as a rotation of Fig. 3.1, to reveal the Lip Rounding dimension. This also explains the numerical mapping assignments for the Lip Rounding stream. . . . . . . . . . . . . . . . . . 32 3.3 An illustration of the three steps involved in automatically deriving an articulatory-level representation from phone-level transcriptions, as ex- plained in Section 3.3.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.4 An example of recognizing 8 HAMM streams and then extracting the 8-dimensional soft and centered articulatory feature vectors from them. This correct realization of the target /w/ will be grouped with others so as to distinguish them from the class of substitution pronunciations. . . 43 3.5 DET curves for all nonnative speakers, German and Italian, over the three HMM-based confidence measures. . . . . . . . . . . . . . . . . . . 46 3.6 German speakers’ classification results over various combinations of fea- tures, for the complete test set. . . . . . . . . . . . . . . . . . . . . . . . 50 3.7 Italian speakers’ classification results over various combinations of fea- tures, for the complete test set. . . . . . . . . . . . . . . . . . . . . . . . 51 x 3.8 DETcurvesfortheNativeBritishEnglishspeaker,overbothHMM-based confidence measures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.9 Native British English speaker classification results over various combi- nations of features, for the complete test set. . . . . . . . . . . . . . . . 54 3.10 Illustration of finding the pixel-by-pixel difference between tokens of /d/ and /D/ (X and Y, respectively), including masked versions of the five organs of interest.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.1 Graphical illustration of the student model, over 2 test items. Shaded nodes denote hidden variables. The dashed lines are not probabilistic relations, but indicate how the overall score for itemt is derived from the combined previous item and overall scores. . . . . . . . . . . . . . . . . . 88 4.2 DET curves over a varying threshold, for item-level binary classification on several different scoring methods and network structures. . . . . . . . 95 4.3 Gestural score for the word “then.” The sequence of gestural pattern vectorsisshownalongthebottom,witheachvectorassignedanarbitrary number. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.1 A prosodic tree derived from sequential ToBI transcripts and syllable segmentation of the BURNC [71]. The four tiers of the tree represent prosodic units on increasingly large time-scales, each with a unique lin- guisticmeaning. TheASCIIrepresentationofthetopthreetierswouldbe TOP(L%(L-(H* H*) L-(NULL NULL H* H* H*))). Thetextisnotpartof thetreestructurebutisincludedhereforillustration. Notethatthepitch accents’ bounds match the foot rather than the word, following [41]. . . 128 xi Abstract Technologythatcanautomaticallycategorizepronunciationsandestimatescoresofpro- nunciation quality has many potential applications, most notably for second-language learners interested in practicing their pronunciation along with a machine tutor, or for automating the standard assessments elementary school teachers use to measure a child’s emerging reading skills. The many sources of variability in speech and the sub- jective perception of pronunciation make this a complex problem. Linguistic hierarchies - in speech production, perception, and prosodic sturcture - help to conceive of the variability as existing on multiple simultaneous scales of representation, and offer an explanatory order of precedence to those scales. These theories are beginning to gain widespread attention and use in traditional speech recognition, but experimenters in pronunciation evaluation have been slow to embrace them. This work proposes using theories of hierarchical structure in speech to inform a chosen computational frame- work and scale of analysis when performing automatic pronunciation evaluation, on the assumption that they will offer improvements over non-hierarchical methods and can be used to rate pronunciation with performance comparable to that of inter-human agreement. Three example applications here illustrate novel hierarchical approaches over three different standard scales of analysis - the phoneme, the word, and the phrase. Each one makes use of hierarchical knowledge in at least two ways. First, the acoustic models for xii evaluation are defined on time-scales below the one of interest (similar to the common practice of using strings of phoneme models to represent words in speech recognition), based on a nested conception of parallel linguistic scales. Then recognition results ob- tainedfromthesemodelsareaggregatedinanorderedstructureappropriatetothetask and using a computational framework best suited to instantiate the hierarchy. Results show statistically significant improvements over baseline methods that do not use these novel time-scales for modeling variability nor make use of a structured hierarchy in combining the cues derived from those models. xiii Chapter 1: Introduction 1.1 Significance of this Work Pronunciation evaluation is the discrimination of categories in pronunciation and the modeling of qualitative judgments of pronunciation, done according to the variability expectedinspeech. Wheretraditionalautomaticspeechrecognition(ASR)isconcerned with identifying patterns of acoustic similarity common to all spoken realizations of a linguisticunitinspiteofthevariability(causedbyspeakerage,gender,nativelanguage, and many other sources), the automatic pronunciation evaluation task aims to detect the often subtle contrast between one realization and another, and to quantify this difference. In its most basic form, pronunciation evaluation assumes the role of speech verifi- cation - a binary decision of whether or not a given spoken realization matches some target model of pronunciation, useful in applications like detecting pronunciation errors for foreign language students, but also in speaker identification and keyword spotting. A natural extension of this verification, pronunciation assessment (or scoring), which 1 estimates the degree to which the speech matches the target model, is also used in ed- ucational settings for language learners who want to practice their pronunciation with a machine tutor, or for assessing a child’s emerging reading skills. Reading assessment is somewhat different from pronunciation assessment because it requires a model of a listener’s perception of a speaker’s mental state (i.e. the degree to which they have learned to read certain words), which is observable only indirectly through the sounds they make when reading aloud, and those sounds can potentially demonstrate adequate reading skills through nonstandard pronunciation. In a sense, each speaker determines what reading pronunciation is “correct” for their voice, accent, and style of speaking. Thisbuilt-insubjectivityisespeciallypresentwhenworkingwithbilingualchildren-one would not want to estimate a lower reading score just because a child’s pronunciation does not match some arbitrary target model for standard native speech. Pronunciation is only one consideration when assessing reading skills, and so one must account for as many available cues to perceived ability as possible, including factoring in prior knowl- edgeaboutthespeaker,asacarefulteacherwoulddo. Listeners,too,arenotnecessarily conscious about how their decision is made based on these sources, and may disagree on which of the available factors are relevant in their decision [51]. It is apparent that all pronunciation evaluation tasks, in whatever variation, are inherently subjective, due to the large variability in both production and perception of pronunciation. Automatic results can only be measured in terms of their agree- ment or correlation with human evaluations, and the upper bound on performance is always determined by the level of agreement that is found among human listeners. This necessitates building variation into the structure of the models - the use of percep- tual models of cognitive pronunciation processing to account for the uncertainty in this decision-makingprocess,andknowledgeofspeechproductionandinformationstructure 2 to account for the many sources of variability seen. A hierarchical structure affords in- sight into grouping the exemplars of variability, and giving precedence to certain types of cues, as necessitated by the structured order theorized in both speech perception and production literature. The Oxford English Dictionary defines a hierarchy as “a ranking system ordered according to status or authority.” In all discussion related to this term, it is assumed that speech can be thought of as several parallel sequences of units on as many parallel (but interacting) time-scales, varying according to the size of the unit, with smaller units nested within and overlapping upon larger ones, and with dependencies among them. In this work the word hierarchical will be used broadly to refer to all analysis of pronunciation that makes use of models on time-scales in addition to the one in question, and will also be used as a type of structured ranking either between or within those time-scales. The three levels of analysis of interest here are the phoneme, the word, and the (prosodic) phrase, though making use of information on other scales (e.g. the articulatory, the prosodic foot). Pronunciation evaluation within each one relies on models for either that scale’s units or, if hierarchical, the smaller or larger units overlapping upon that scale, or both. A theoretical hierarchy can manifest itself in technological models in a number of ways. Analyses on smaller units can be aggregated to analyses of the larger ones they compose - a bottom-up approach to pronunciation evaluation that makes use of the hierarchy of nested unit sizes. By conditioning models oncontextoutsideoftheirtierofanalysis,ahierarchycanbeusedtocapturethesources ofvariabilityinacousticmodeling,infactlimitingtheconfusionamongacousticmodels. Likewise, a hierarchy of perceptual importance can inform a pronunciation decision by weighting each time-scale’s analysis according to its relevance in listener perception. Traditional speech recognition, using phoneme- or sometimes syllable-level models to perform recognition and analysis on the word level, is the most common example 3 of a hierarchical structure regularly used today [26]. In recent years, improvements in general ASR have come from the use of speech production knowledge and theories of a hierarchical organization of linguistic units. Articulatory features and models have been shown to improve recognition accuracy when combined with traditional abstract phonemic (or segmental) units, and new model topologies such as Dynamic Bayesian Networks (DBNs) have allowed for the dependencies among different scales of linguis- tic units - articulatory, segmental, and suprasegmental - to be incorporated as model parameters [48]. Further improvements have come from training models that are de- pendentonthemulti-levelencodingofinformationinspeechthroughprosodicstructure [38]. These advances have been applied extensively in experiments in speech recognition, though in practice real-world ASR systems still rely heavily on traditional models built from abstract segmental units that are not optimal for capturing all the variability in- herent in pronunciation and all the sources of variability that can be conveyed on finer or coarser time-scales. Pronunciation evaluation experiments have been perhaps even slower to embrace modeling on multiple scales. One reason is the focus on read speech in computer-based second-language pedagogy. When the target text to be recognized is known a priori (as it typically is in many real-world applications), scores derived from alignment of student speech with phoneme-level models can achieve a high correlation withhumanperception[5,70]. Thesubjectivenatureofthehumanscorescreatesanup- perboundforperformancethatphonememodelscanapproachgivensuchaconstrained search space, and there is a common belief that the addition of prosodic or articulatory features or models can offer no improvement, though this may be due to sub-optimal feature sets or modeling techniques [85]. Another explanation for the reluctance to use articulatory or prosodic knowledge for pronunciation evaluation is the emphasis on cor- rect segment-level pronunciation in language instruction in general. In a language like 4 English, which does not usually rely on suprasegmental contrasts to determine lexical identity, and in which intonation errors are not ranked high in importance in pronun- ciation perceptual studies, it seems reasonable by some sources to exclude teaching or evaluating prosodic structure for other than the most advanced students [20, 94]. In this work I intend to show that automatic pronunciation evaluation can benefit bothfromtheuseofinformationoutsidethetraditionalsegmentlevel,andfromastruc- ture to that information that accounts for the variability seen on the multiple levels. The goal of this thesis is to show a human-like performance in standard pronunciation evaluationtaskslikedetectingerrorsorgivingsubjectivescores-onthephoneme,word, andphraselevels-throughtheuseofhierarchicalknowledgederivedfromlinguisticthe- ories of perception, articulatory production, and the prosodic structure of information. From the improvements seen in pronunciation evaluation, I will draw conclusions about the explanatory power of the theories on which the models are based. 1.2 Brief Summary of the Approach This thesis presents work toward automatic analysis of pronunciation on multiple time- scales. One overall theme of the approach is that each scale’s analysis begins with traditional speech recognition of some sort, though usually with a highly-constrained grammar that befits pronunciation evaluation rather than recognition of continuous speech. Pronunciation evaluation on each scale relies on models for the smaller units nested within that scale - the first instance of hierarchical theory used in this approach. The recognition results at that point are in themselves unimportant, but are turned into pronunciation evaluation scores or decisions through a second step. The second step implements the hierarchical representation upon the recognition results through an appropriate computational framework for modeling a hierarchy, and the two-step 5 Recognition Computational Framework Acoustic Models Grammars Production Perception Prosodic Structure Speech Waveform Feature Extraction Decision or Score Features Recog. Results Figure1.1: Generalpronunciationevaluationflowchartforallapplications. Hierarchical theories are used both in the choice of acoustic models for recognition and the structure of the computational framework in the next step. procedure itself can be likened to a type of hierarchical model. Depending on the time- scale,thesecondstepgenerallytakestheformofaclassificationdecision,basedonusing the first step’s results in an ordered structure. A flowchart that illustrates these steps is shown in Figure 1.1. Starting on the level of the phoneme - the abstractions of the fundamental speech sounds which, if changed, can change the meaning of a word they compose - I propose to improve detection of second-language learner errors in phoneme-level pronunciation by incorporating models of the underlying articulations of those phonemes - relying on a hierarchy of variations in the articulatory domain as indicative of perceived and produced variations on the phoneme level [90]. Normally this would be done by set- ting a threshold on a traditional confidence score (an estimate of the target phoneme’s posterior probability given the speech) obtained through alignment of the speech with phoneme-level models. With articulatory models, structured and decoded similarly to thephonememodels, onecancalculateanentirevectorofarticulatoryconfidencescores (one for each organ of articulation), for use in disambiguating very subtle differences in pronunciation among segments. The expected sequence of articulatory gestures can be extrapolatedfromthephonemerepresentationbywayoftablesofexpectedarticulatory 6 configurations and phonological rules for coarticulation, without requiring true articu- latory measurements. An appropriate phonological hierarchy for each target phoneme’s mispronunciationisgenerateduniquelyusingautomaticDecisionTreeclassifierstrained on the combined articulatory and phoneme-level confidence scores. It is not any new observation but only the expanded formulation of the phoneme-level model into the articulatory domain that accounts for an improvement in pronunciation error detection. Somesupplementalanalysisofarticulatoryevidenceseeninreal-timeMRIvideosofthe vocal tract confirms that articulatory models should be useful for disambiguating close phoneme-level contrasts, even in nonnative speech. Similarly, word-level pronunciation variants are modeled in terms of variable se- quences of the phonemes that can compose a word, as is ordinarily done in pronunci- ation modeling for speech recognition. This word-level assessment, applied to the task of automatically assessing a child’s ability to read isolated words from a list, makes use of a model for a teacher’s perception of acceptable or unacceptable reading skills, as demonstrated by a child’s production of the elicited words [86]. The closed set of expected phoneme-level pronunciations for a given word are divided into several cate- gories that would be known to teachers a priori - expected canonical (dictionary-based) pronunciations, expected pronunciations generated by common reading mistakes, and L1-accented pronunciations (for children with a native language other than English). A recognition network models the teacher’s acoustic comparison among these variants in pronunciation, similar to certain cognitive models of lexical access. Since the pronunci- ation variants are often too close to be reliably discriminated even with state-of-the-art techniques, and often it is not clear (and teachers cannot agree) exactly how these cues all contribute and interact in making a word-level pronunciation decision, a Bayesian networkcombinesthesevariousrecognitionresults(inahierarchyconsistentwithrecog- nition architecture) so that an overall word-level binary decision of reading ability can 7 be made. Here the hierarchical model is a two-step perceptual one, in which pronunci- ation recognition cues are retrieved and then aggregated to form an overall impression. A similar study of word-level scoring for adult nonnative speakers demonstrates a re- formulation of the articulatory modeling strategy described above by way of the theory of Articulatory Phonology. A robust phrase-level pronunciation score must make use of suprasegmental cues from sub-phrase intonation - one manifestation of the prosodic information structure of language, correlated perceptually with temporal variations in fundamental frequency. Because of the complex nature of prosody in English (both in its structure and its real- ization through intonation), and the challenge to potentially score spontaneous speech, Iproposeatext-freemodelforphraseaccentsandboundarytones, usingHMMstrained on processed continuous f0 estimates and energy contours from native speech that has been prosodically-transcribed [87]. These intonation events can then be decoded from a nonnative utterance’s prosodic cues using a pre-defined grammar for tone sequences. Each recognized tone’s posterior probability given the observed features is then esti- mated, and all posterior probabilities over a phrase are combined to give a score of intonation “nativeness” over the whole phrase. Final experiments with tree grammars ofprosodicstructure[89]alsosuggestthatahierarchicalmodelfortonelabelshasbetter predictive power than an n-gram sequential model. 1.3 Contributions This thesis is concerned with both the computational and theoretical (linguistically- speaking) aspects of automatic pronunciation evaluation. The unifying theme is that of the hierarchy - the idea of speech as a graded series - and so theoretical hierarchies and computational frameworks for implementing those hierarchies are of interest here. The conception of speech as a hierarchy of parallel time-scales of different-sized linguistic 8 units is implicit in this analysis of variability on multiple time-scales and the use of acoustic models below each of those chosen scales. This work presents an example application for each of the three chosen levels of analysis - the phoneme, the word, and the phrase - making use of theoretical and computational hierarchies appropriate to the given application. Linguistic theories of hierarchies inform the approach to solving each particular application, and the best computational framework is determined based on the suggestion of the theory and the demands of the task. Table 1.1 outlines these theories and frameworks as they are used in each application. The detection of phoneme-level pronunciation errors is strictly a verification task: the observed phoneme is either accepted or rejected as belonging to the target phoneme model. This work introduces the idea of performing pronunciation evaluation on the phoneme level by factoring the target model into several asynchronous streams of ar- ticulatory models, and making a verification decision based on information from the combination of articulatory and segment-level representations. The theoretical basis for this is the phonological link between the segmental and articulatory domains, a production-basedhierarchyinwhichsmallchangesinarticulatorytimingorconstriction can result in common phoneme-level substitutions by nonnative speakers [13]. Decision Trees [100] are the framework of choice for modeling this type of error, since they are a way of hierarchically extending the traditional single verification threshold to the case of multiple confidence scores. With a set of scores derived from models of positions of articulatory organs, decision trees can be made to resemble the distinctive feature trees describing phonological contrast [36] that are organized into a hierarchy by the geometry of the vocal tract. Literacy assessment on the word level is very much a subjective task, without a single target model of pronunciation that applies to all speakers. A listener’s prior knowledge of expected pronunciation variants and of the speaker’s background can go 9 a long way toward explaining their perception of the speaker’s reading skills, but how this knowledge is combined and weighted in the decision is not usually clear. One novel contribution of work in this application is the factoring of word-level pronunciation into sequences of phonemes in several categories relevant to reading perception, and using cuestothepresenceofthesecategories(basedonautomaticrecognitionamongthem)as factors that influence perceived reading skills. Perceptual theories of lexical access [58] offer inspiration for the formulation of a hierarchy among these cues in terms of their competition and dependency in the listener’s cognitive decision process. A Bayesian Network classifier [28] is capable of modeling the complex cause-and-effect relationships among these cues, and nonlinearly combining them to reflect their interactions with one another and with the overall perception of reading ability. Theestimationofaphrase-levelpronunciationscorefornonnativespeechshouldrely on cues and models from many time-scales below the phrase level. The work toward that end presented here makes use of sub-phrase models of text-free intonation events - pitch accents and boundary tones. With an n-gram model for a sequence of tones, or a tree grammar for a hierarchy of tones, an intonation-based phrase-level score can be estimated in much the same way acoustic and language models are traditionally combined to give a score for a sequence of words [26]. This is a type of hierarchical formulationofthistask, makinguseofanovelcomputationalframeworkandknowledge from multiple suprasegmental scales. 1.4 Known Limitations and Open Questions All speech acoustic models in this work are limited to the traditional Hidden Markov Model (HMM) framework. An HMM formulates speech as a sequence of hidden states (representing an acoustically homogeneous part of an abstract linguistic unit such as a 10 Acoustic Hierarchical Computational Application Models Population Theory Framework Detection of Phoneme, L1 and L2 Production Decision phoneme-level errors Articulation British English Trees Word-level Phoneme, Pre-literate children, Perception, Bayesian assessment Articulation L1 and L2 American English Production Networks Phrase-level Intonation L2 British English Prosody Tree Grammars scoring L1 American English Table 1.1: Details regarding the three example applications presented in this thesis, including the scales of the acoustic models, the theories used as inspiration, and the corresponding computational framework. phoneme, a syllable, etc.), with observed speech as generated by probability distribu- tions of those hidden states. An HMM is defined by the parameters of the distribution ateachhiddenstate,andamatrixofprobabilitiesfortransitionamongthosestates[26]. These parameters are not trivial to estimate since the underlying state sequence is not known a priori, but they can be trained iteratively from transcripts (ordinarily on the phoneme level), and once defined they allow for linguistic units to be decoded directly from observed speech. Though improvement in automatic recognition performance has been achieved with articulatory models or features in more complex generative struc- tures (e.g. DBNs, of which HMMs are a special case), such models usually require large amounts of training data and annotation other than on the phoneme level (a combi- nation rarely available for nonnative speech) as well as considerably more processing time required for training and decoding. Here I am not concerned with recognition so much as with evaluation, and can demonstrate improvements in automatic pronun- ciation evaluation with traditional HMMs. Furthermore, the methods presented here should be generalizable to other types of acoustic models such as DBNs or Neural Nets. The acoustic models themselves are not so important, since these are methods that could be applied to any existing acoustic modeling method. 11 Similarly, many of the hierarchical techniques exploited here are specific to the structure of the English language and the perception of native English speakers. These techniques could potentially be applied to pronunciation evaluation of other languages, though of course with appropriate language-specific modifications. However, applica- tions to pronunciation evaluation in languages other than English will not be investi- gated here. Though all of the techniques developed and outlined in this work have applications for language instruction or assessment in mind and build upon existing applications for speech technology in education, relatively few of them ([86], for example) have been incorporated into real-world systems. None of the new ideas here are for new types of assessments or teaching strategies, but for improvements to those that have already been applied in real-world tasks. All performance is measured in terms of agreement or correlation with human scores, but the extent to which these methods will improve learninginactualstudentsorwillbeofbetterusetoteachersthantheexistingmethods, though an interesting question, is beyond the scope of this thesis. 1.5 Thesis Outline This thesis is organized as follows: the next chapter reviews linguistic theories of hier- archical structure in speech, and attempts to explain why these hierarchical structures canbeusefulinpronunciationevaluation. Chapter2givesdetailsontheexperimentsin usingarticulatoryrepresentationstodetectphoneme-levelpronunciationerrors. Ahier- archical perceptual model of the decision a teacher makes in assessing a child’s reading ispresentedinChapter3. InChapter4somemethodsformodelingfoot-levelintonation for use in phrase-level suprasegmental pronunciation scoring are examined. The final chapter concludes and summarizes these findings, and offers some suggestions for future work. 12 Chapter 2: A Review of Speech Hierarchies 2.1 Articulatory Production Linguisticliteraturespeaksofhypothesizedhierarchiesinseveralsenses. Ofthese, there are three senses which will be used in this thesis. One is the hierarchy of phonological representation at and below the level of the phoneme segment. In traditional phonology this takes the form of a hierarchical structure of the distinctive features that uniquely describe a phoneme and partition the phoneme set [36, 81]. Its purpose is to provide a structured explanation to the variation on the segment level in terms of variation in properties common to a subset of phonemes. The hierarchy allows for an order of precedence in allophonic variation, as opposed to assuming all changes in features to be equal, and provides constraints on the types of variation expected, capturing the phonological rules exhibited in spoken language. All features that partition a phoneme set are speech production-based; whether linked with a specific vocal tract organ or not, they are based on fundamental articula- tory actions that result in specific categorical acoustic and perceptual effects [82]. The basis for the structures of feature hierarchies such as delineated in [7] and [36] are the 13 relative importance of the feature in making a distinction between two large classes of speech sounds, and, in the case of articulator-bound features, the anatomical demands of the vocal tract’s constituent organs. As it descends and splits into branches, a hier- archical tree of features proceeds from more general to more specific phoneme classes; traversed in the opposite direction, articulators known to be physically contiguous form groups. The idea that the phonological structure of language is linked to the physical properties of articulation is made a more crucial assumption in Articulatory Phonol- ogy, which posits not distinctive features but the articulatory gestures of constriction themselves as the fundamental units that, through variations in their temporal overlap, account for the variations in realization on the segment level [13]. The coordination of these gestures - captured in their phase relations in time - leads to the realization of larger linguistic units, with the hypothesis that the strength of the coordinating bond between gestures (based partly on the physical dependencies between the articulators performing those gestures) can result in a type of hierarchy of linguistic units of various sizes [31]. 2.2 Pronunciation Perception Another type of hierarchy often described is that of pronunciation perception. It dif- fers from the hierarchy for speech production in that linguistic units and perceptual phenomena are grouped according to perceptual saliency rather than temporal resolu- tionorphonologicalfunction,thoughoftentheperceptualhierarchyispartitionedalong time-scalelines. Manydifferentmodelsoflexicalaccess(spokenwordrecognition)exist. Mostcanbeschematizedasamulti-levelarchitecturewithinteractionsbetweenatierof competing lexical entries and a tier of perceived sublexical (or phonological) units that compose those entries [58]. The evidence for this hierarchy - activation of the lexicon 14 basedonstimulationofsublexicalunits-comesfromlisteningexperimentsinwhichdis- tinguishing between phonetically similar words requires more cognitive processing time than phonetically dissimilar words - the phonetic similarities activate competing lexical entries that are more closely related and therefore require more time to disambiguate. Essentially this is the same problem of word confusion in traditional ASR, in which sub-lexical acoustic models are trained and the recognition search space is limited to those sequences of phonemes that form lexical entries, often with recognition confusions that can be explained in phonetic terms. In nonnative speech pronunciation, several sources speak of a “hierarchy of error” [20, 61, 94]asperceivedbynativelisteners(anddocumentedthroughsurveys), inwhich errors are usually divided into three categories along a continuum of importance: • errors that make speech unintelligible, leading to confusion of minimal pairs in phonetics or lexical stress • errors that listeners find irritating or amusing, especially foreign mistakes that have been specifically stigmatized • errors that provoke little reaction and may not be noticed This hierarchy explains the emphasis in language instruction (both computer-assisted and otherwise) on segmental and word-level pronunciation (a top-down approach to pronunciation teaching), since those will have the most impact on intelligibility. Ordi- narily “errors” in English intonation (insofar as concrete rules for English intonation can be set forth) are grouped into the third category of least important errors, though the exact hierarchy can depend on the listener’s dialect, age, gender, personal biases, and the specific demands of the task. A hierarchy of this sort really represents a general or specific listener model for the perception and assessment of errors. 15 2.3 Prosodic Structure The third type of hierarchy used in this work is that of speech prosody, defined as the organization of linguistic and para-linguistic information above the phoneme level and manifested acoustically as rhythm, loudness, and intonation (or variations in pitch, the perceptual correlate of the fundamental frequency contour). The hierarchy of prosodic content is similar to the production- and articulation-based hierarchies of the sub- segmental features, functioning to provide structure to the dependency between the prosodic and segmental content of speech, and to explain the coordination and nested synchronization among different scales of prosodic units. They are not based on speech production in the sense of constriction of vocal tract organs, but speech production in a larger sense, one that also encompasses subtle fluctuations in vocal fold vibration rates, phrasing through breath control, and even the cognitive planning of information to be encoded on various levels, each representing a different type of information though all present concurrently in intonation. One model of intonation decomposes the fundamen- tal frequency contour as the summation of tone components on either the phrase or word-accent level [29], much like the overlapping and coordinated gestures of Articu- latory Phonology. Others employ a hypothesized tree structure to schematize the way multiple levels of information can be superimposed on one another in a final realization of intonation [41]. 2.4 Interactions Among These Theories These three types of hierarchical theories are each distinct in scope, but highly interre- lated. Speechproductionandperceptionareknowntobeconnectedinmanyways. The setofdistinctivefeaturesofphoneticsegmentsandtheirhierarchicalstructurearebased on both the physical properties of speech production and the categorical perception of 16 the acoustic properties associated with that production [82]. Many experiments have shown common perceptual confusion of phonemes of the same production-based feature class, as well as segment-level confusions based on variations in articulatory gesture. One source argues that the physical gestures of articulation are the “common currency” between production and perception, calling for a necessary-for-perception mental repre- sentationofspeechintermsofthearticulatorygesturesusedtoproduceit,andthatthis explains, for example, how newborn babies know how to imitate adult facial gestures [31]. Most models of lexical access (spoken word perception) posit some form of cog- nitive resynthesis (a speech production of sorts) based on the perceived sub-phonemic events, for comparison with the incoming acoustics [82]. One would expect a hierarchy of articulatory production to inform perceptual modeling, and vice-versa. Recent work has expanded Articulatory Phonology to include the influence of prosodic phrase boundary gestures (or π-gestures) on the rate of the articulatory ges- tures on either side of those boundaries, thus linking the prosodic hierarchy and artic- ulatory production (and, hence, speech perception) [15]. Advances in phoneme- and articulatory-based speech recognition have made use of prosodic context as a condition- ing variable for more robust acoustic models [38]. The connection between articulation andintonationrunsintheotherdirection,too: therealizationofintonationthroughthe fundamental frequency (f0) is highly dependent on the phonetic context, which is to say that the tune changes slightly based on the phonetic content of the words over which it is set. Stop consonants create sharp discontinuities in intonation [84], and vowels will have varying intrinsic f0 - that is, when asked to match frequency in production of high and low vowels, speakers will typically produce high vowels with higher frequency than low vowels [31]. A link between perception and prosodic structure is also well-established. This is most obviously seen in syllable stress - often marked by increased pitch, energy, syllable 17 duration,andfullvowels[52]-thatcandeterminethelexicalidentityofanisolatedword (e.g. “project,”“content”). Boundarytonesandpausesattheendsofphrasesdelineate syntactic units of various sizes, offering the listener hints for appropriate processing and interpretation [93]. Phrase-level pitch accents are also well-correlated with speaker intentions and listener perception on the level of dialogue acts within a conversation, intimating if the information offered is new, contrastive, accessible, or even uncertain [95]. Evenaspeaker’semotionalstatecanbeinferredfromphrase-levelprosodicfeatures [14],andlistenerscandiscernaspeaker’sregionalaccentfromhearingtheprosodyalone in low-pass filtered speech [43]. 2.5 Conclusion: The Insight of Hierarchical Representa- tions Variability in pronunciation can come from many sources. A speaker’s articulation will change based on the temporal overlap of gestures (co-articulatory effects) and the prosodic context, abiding by the phonological rules of the language. Some of the vari- abilityinprosodyisbuiltintothenatureofalanguagelikeEnglish,inwhichtheposition and choice of pitch accents, for example, can vary based on the speaker’s intentions and dialogue strategy [9], and certain lexical items will have multiple accepted pronuncia- tions. Afurtherlevelofvariabilityisseeninnonnativespeech, whichwillalmostalways show the influence of the speaker’s native language (with its unique phonological rules and coarticulatory effects), and will perhaps be suject to the preconceived ideas of their listeners’ biases [20, 94]. Children’s speech when reading aloud will exhibit the high variability associated with a child’s rapid physical development, as well as their fast- growing ability to decode sounds from text [54]. A speaker’s degree of spontaneity or preparedness in speech can be a factor explaining variability, since often the cognitive 18 planforaspontaneousmessagewillundergorevisionwhiletheassociatedspeechsounds are being produced [29]. On top of all this potential variability in production, the task of evaluating a speaker’s pronunciation is inherently subjective. How does a speaker’s variability come across as cues? And which of these cues best determine perceptions of pronunciation quality? The answers to these questions can depend on many factors as well: the listener’s dialect, their exposure to foreign languages, teaching experience, age, personal biases, and the nature of the assessment task itself [94]. Furthermore, the interactions among these sources of variability and subjectivity (as indicated above) make attributing a pronunciation phenomenon to any one source difficult. The notion of factoring speech into a hierarchy - into sequences of nested variable- sized units on parallel time-scales - is a most helpful asset when faced with subjectivity and variability to this degree. First of all, the variability does not exist on any one scale - its various sources lie not just on the phoneme level but above and below it as well. To conceive of speech in terms of only any one scale is to ignore the relevant cues present on other scales of analysis, cues used by listeners and therefore necessary for holistic automatic analysis of pronunciation. But it is not always enough to simply use multiple cues. The structure to the hierarchy - the dependencies among the parallel units - is important because it tells us where to look for sources of the variability seen. Knowingwhichcuestendtoco-occurinspeechcangivemodelsmoreexplanatorypower, eliminating some of the uncertainty due to context and other factors contributing to variation that would otherwise not be accounted for. By constraining the search space, thehierarchytellsuswhatkindsofvariabilitytoexpect,andsetsanorderofprecedence oncedetected. Inthisway,thestructureofthehierarchymakesmodelsofpronunciation more specific, less subject to unexpected influences. 19 Chapter 3: Detecting Phoneme-level Pronunciation Errors 3.1 Introduction Pronunciation evaluation calls for the establishment of a reference abstract pronun- ciation model - the canonical form - against which all realizations can be compared, and naturally lends itself to applications in the area of second-language acquisition, in which a student’s pronunciation will be assessed relative to a “gold standard.” With these pedagogical applications in mind, in this chapter for I define the canonical refer- encetobewhatlinguistsoftencallthesingle“citationform”[52]-theformalizedlexical pronunciation of words when spoken in isolation, which students of a foreign language are expected to produce when practicing their new tongue. The expected deviants from the canonical form can be modeled in a number of ways and on various time-scales, traditionally referred to as the segmental, suprasegmental, and articulatory feature levels. A segment-level pronunciation error is defined here as the substitution of a phoneme or sequence of phonemes, with respect to the canonical 20 form. A phoneme is an abstract but unambiguous unit (or “segment”) of speech, and the term phone is given to a specific realization of that unit; by definition, a sound is a phoneme if simply changing that sound can change the meaning of a word [52]. In English, certain events within a phone will not change the meaning of the word they compose nor the phonemes that make up that word, so these variants - called “allo- phonic” - necessarily occur below the segmental level. Pronunciation mistakes in the sub-phonemic or articulatory feature scale are those which result in a segment- or word- level substitution; for example, a lengthy voice onset time (VOT) in the consonant /d/ can lead to the perceived substitution of /t/ [47, 52]. Suprasegmental pronunciation errors may encompass several phonemes and involve longer-scale variations in prosody, including those of syllabic or word-level stress [88]. The set of expected errors may be arbitrarily large, of course. To constrain the search space, past work in modeling pronunciation variants or errors has derived these expected forms with rule-based or statistical methods of transforming the canonical representation. The rule-based meth- ods [57, 83] are grounded in firm linguistic theory and expected usage, and allow for a generalized and unsupervised approach to future, unseen datasets, but statistical anal- ysis [23, 65, 103] can estimate probabilistic models for each rule’s application, and will sometimes show a certain rule to be statistically insignificant for specific cases. Inthischapter,Ichoosetofocusonsegmentalpronunciationerrorsand,morespecif- ically, those which can be modeled as systematic - i.e. phonologically or statistically predictable-basedonpriorknowledgeofthespeaker’snativelanguageandthedataset. The goal is to discriminate between a canonical segment and any of its substitutions in nonnative speech, to disambiguate between a close approximation of the canonical form and a true pronunciation error. It is the systematic substitutions - the ones a speaker produces unconsciously by the “phonological transfer” from his native tongue - which are most difficult to correct and therefore most important to practice [3]. 21 This evaluation of pronunciation on the phone level is essentially one formulation of the traditional hypothesis verification task. In verifying that an observation utterance, O, fits a target model, M t , one can estimate the posterior probability P(M t |O)= P(O|M t )P(M t ) P(O) (3.1) by approximating P(O)≈ P(O|M f )P(M f ) where M f is a general or specific substitu- tion (or “filler”) model. Assuming equal priors, this becomes the likelihood ratio (or “confidence score”) τ = P(O|M t ) P(O|M f ) (3.2) and one decides in favor of verification if τ ≥ T for some threshold T. Many different types of filler models have been proposed, depending on the application. A typical baseline approach is to make the filler be a generalized “garbage” model for all speech, either on the phone or word level, though a specific set of “cohort” models can improve these scores’ reliability [17, 25, 73, 98]. In the domain of speaker identification and speaker verification, a complex set of “impostor” models is used as the discriminative filler [44]. In the past, the detection and correction of pronunciation errors on the phone level has relied on this general approach to verification and scoring, given a set of trained phoneme models [23, 65]. However, even when performing verification on the segment level, a strictly phonemic representation might not tell the “whole story” about the nature of a systematic mistake. Consider the common substitution of /s/ for /z/ made by native German speakers in such English words as “dessert” and “warnings,” as predicted by German orthography and phonology [3]. Though distinct phonetic units, both phonemes in question share a common Place and Manner assignment in terms of articulation - Alveolar Fricative - and differ only in that /z/ is voiced and /s/ is 22 not. This type of “close” substitution, this overlap between the canonical form and its common substitutions once factored into the articulatory domain, is the rule rather than the exception when it comes to nonnative speech. To model such an error in terms of parameterized articulatory motion has more explanatory power than to treat it as a simple substitution. This chapter begins with the hypothesis that the sorts of insights offeredbyarticulatoryinformationwillallowdiscriminationbetweenthecanonicalform and its close segmental substitutions in nonnative speech more astutely than with the more ambiguous phonemic models alone. Articulatory representations of speech have been used to improve accuracy in such tasksasspeakerverification[55],generalspeechrecognition[49,76],pronunciationmod- eling [57], and spectrally-impoverished or whispery speech recognition [45]. However, directarticulatorymeasurementandtranscriptionisnotalways available (especiallyfor nonnativespeech),andnotallarticulatoryrepresentationsaresuitableforthepronunci- ation evaluation task. In this chapter, I present a rule-based method for deriving useful articulatory representations from phoneme-level transcriptions. With these represen- tations, I train models for these articulatory derivations as well as standard phoneme models using the same traditional spectral features. I then show how these models can be used to generate novel features and confidence scores for segment-level pronuncia- tion verification. My intention is to show that features derived from an articulatory representation, when combined in a hierarchical framework with traditional phonemic features, will improve verification accuracy in phone-level evaluation. Moreover, I ap- proach it with the dearth of human input demanded by automated language-learning applications. Section 3.2 presents the corpora used in this chapter, and Section 3.3 gives some background on work in articulatory representations, and explains the representation 23 chosen here. Section 3.4 describes the model training procedure for these representa- tions, while Section 3.5 outlines several new methods for deriving features from these models. Section3.6explainsthepronunciationevaluationexperimentsperformed,based on these features, and Section 3.7 discusses the results. Sections 3.8 and 3.9 conclude and present some ideas for future work in this area. 3.2 Corpora The data used in these experiments were compiled by the University of Leeds in their ISLEcorpus[3]. Theserecordingsconsistof46adultIntermediateBritishEnglishlearn- ers who are native speakers of either Italian or German - 23 of each. Utterance prompts were complete sentences designed to highlight specific difficulties English learners typi- cally encounter in pronouncing single phone minimal pairs, phone clusters, and primary stress minimal pairs. The recordings were automatically segmented by a forced-aligner. These transcriptions were then augmented on the phone level by a team of five lin- guists to reflect each speaker’s pronunciation. However, no effort was made to correct discrepancies in the automatic segmentation times. Consequently, this means that the derivation of these features is fully automated, though the test set annotations provide a reference for assessing this method’s performance. For comparison with the nonnative results, additional experiments on native speak- ers were done using the MOCHA-TIMIT corpus [101] and the IViE corpus [32], both of which are composed of read British English of the sort the ISLE corpus students were learning to speak - 43 Southern British speakers were used in all. Though the pronunci- ation of native speakers does not diverge from the canonical in the same way as that of nonnative speakers, experiments on these corpora can legitimize the proposed features for assessment of native speech, in which the usefulness of articulatory information has already been well-documented elsewhere. 24 stream classes cardinality jaw 0: Nearly Closed, 1: Neutral, 4 2: Slightly Lowered, 3: Lowered lip separation 0: Closed, 1: Slightly Apart, 4 2: Apart, 3: Wide Apart lip rounding 0: Rounded, 1: Slightly Rounded, 4 2: Neutral, 3: Spread tongue frontness 0: Back, 1: Slightly Back, 5 2: Neutral, 3: Slightly Front, 4: Front tongue height 0: Low, 1: Mid, 2: Mid-High, 3: High 4 tongue tip 0: Low, 1: Neutral, 2: Dental, 5 3: Nearly Alveolar, 4: Alveolar velum 0: Closed, 1: Open 2 voicing 0: Unvoiced, 1: Voiced 2 Table3.1: Articulatoryfeaturespace. Notethenumericalmappingandgradualphysical progression among classes within a given stream, proportional to the integers chosen to represent those classes. 25 3.3 Articulatory Representations 3.3.1 Previous Work Let me begin by defining some terms. By an articulatory representation I mean any convention of spoken language transcription that uses symbols denoting an abstraction of an underlying speech production mechanism or position (as opposed to symbols that represent abstractions of perceptual or semantic phenomena). Often an articulatory representation will span multiple streams, which is to say that the symbols used in the representation can be grouped based on their relevant components. I will refer to the symbols that compose these streams as articulatory classes. One example stream could be the Manner of articulation, and its classes may include Fricative, Stop, and Vowel. Another stream could be the degree of Lip Rounding, with classes of perhaps Rounded, Neutral, and Spread. Many variations on the idea of representing speech through an articulatory frame- work already exist, though the fundamental methodology has relied either on articula- tory measurement, or a mapping to the articulatory feature domain from a higher-level representation (ordinarily that of phonemes). Electromagnetic or electropalatal mea- surement as used in [33] is sometimes costly, though still appealing for the pronuncia- tion evaluation task because of the concrete physical referents for any models derived therefrom - it allows us to “point to” the observable differences in production among realizations, and one may make class assignments as specific as the resolution of the vo- cal tract imaging permits. As yet, no known corpus of direct articulatory measurement possesses the same scope of nonnative variability as the ISLE recordings. As for the rule-based approaches to generating an expected articulatory represen- tation from phonemic transcriptions, several class and feature configurations have been 26 proposed, all grounded in articulatory feature theory [52]. The feature assignments de- fined by Kirchoff in [49] - Voicing, Manner, Place, Front-Back, and Rounding - have served as an appropriate baseline for the experiments reported in [34, 45, 55], but are perhaps too abstract and coarse to be of much use in pronunciation evaluation or lan- guage learning applications. An ideal articulatory representation for this task should allow for differences between vowels subtler than simply Low or High, Rounded or Un- rounded,andsoon. Inthisparadigm,themodelconfigurationsforcertainpairsofoften substitutedvowelsarenotalwaysdistinct,makingthemtheoreticallyimpossibletoclas- sifyautomatically. Agoodexampleisthesubstitutionof/i:/for/I/frequentlymadeby Italian learners of English - [49]’s feature space renders them both as high, front vowels, without distinction. In addition, all classes within a given articulatory stream should, when taken as a group, delineate a graduated physical progression within that stream. This is necessary to ensure that the models derived from each stream will represent pronunciation-dependent variations in articulatory motion over time; the overlapping motion of each vocal tract component should be visualizable, so as to parameterize the uniquequalitiesofeachpronunciationrealizationeffectively. Thisisn’tpossible,though, if all abstract Place models are treated as members of the same stream, as in [49]; a Labial articulation concerns the lips, whereas a Velar articulation concerns the tongue and soft palate - they should be evaluated and tracked separately. As another example, if an abstract Manner classifier requires a mutual exclusion between Nasal and Vowel classes (as [49] has done), this disallows the possibility of classifying a nasalized vowel into both categories. Acoustic modeling of an articulatory representation can be done using traditional spectralfeatures-CepstralCoefficients, forexample-andanystatisticalclassifier. Pre- vious studies have shown success in articulatory modeling using Neural Networks [49]. Hidden-Articulator Markov Models (or HAMMs) were first proposed by Richardson 27 et. al. in [76] as a method of incorporating articulatory information into the exist- ing Hidden Markov Model framework that dominates traditional speech recognition. HAMMs have been used to improve speech recognition performance when used in com- bination with traditional HMMs, and also show a robustness to noise previously unseen in phoneme-based acoustic modeling [75]. This is remarkable in light of the fact that both HAMMs and phonemic HMMs are trained on the same spectral features, so it is really just the augmented representation which is responsible for this improvement in model performance. A Hidden Markov Model consists of j states, each with an output density b j (o t ) modeling the probability that state j “generates” the speech observation o t , and a set of transition probabilities such thata ij models the probability of a transition from state i to state j [26]. The assumption is that, given a model M, the joint probability of a sequence of observations O = o 1 ,o 2 ,...,o T and their underlying “hidden” sequence of states X = x(1),x(2),...,x(T) can be calculated as the product of all output and transition probabilities over that sequence: P(O,X|M)= T Y t=1 b x(t) (o t )a x(t)x(t+1) . (3.3) Thestatesequenceisnotingeneralknown, soanobservation’slikelihoodgivenamodel must be computed by summing over all possible state sequences: P(O|M)= X X a x(0)x(1) T Y t=1 b x(t) (o t )a x(t)x(t+1) (3.4) where x(0) and x(T +1) are model entry and exit states, respectively. In the case of an HAMM, the hidden state sequence represents changes in the speech spectrum over time within a particular articulatory class or vector of such classes, rather than over the more abstract phoneme units ordinarily represented by HMMs. 28 Each HAMM state proposed in [76] represents what is called an articulatory config- uration - a vector of integers C = {c 1 ,c 2 ,...,c N } over N streams where 0 ≤ c a < M a andM a isthecardinalityofstreama. Asimilar“feature-bundle”statewasproposedby Sun and Deng in [83] but without the numerical mapping. This integer representation is advantageous for several reasons. It treats each stream as a set of discrete articula- tory classes, isomorphically mapping numerical values to physical positions for factored tracking of movement over time. This permitted [76] to mathematically impose a set of static and dynamic constraints on state transitions, so that the streams could move asynchronously but still adhere to the physical properties of the vocal tract. In[76],thearticulatoryrepresentationwasderivedfromphone-leveltranscriptssuch that some configurations necessary for constrained transition would not be seen in the training data - i.e. no phoneme mapped to many configurations which were still physi- callypossibleandwereneededformodelingthetransitionsbetweenphones. Theinteger representation allowed [76] to interpolate the models for these meta-phonemic states as the Cartesian product of the phoneme-trained states. This required the estimation of severalhundredthousandHAMMparameters-onedisadvantageofmodelingacomplete configuration vector rather than the individual streams. One related approach, implemented by Livescu and Glass in [57] and Wester et. al. in [96], is to model the articulatory representation as a Dynamic Bayesian Network in which each hidden state is factored into its respective streams, allowing for trainable probabilistic dependencies among them. As a modification to the original HAMM def- inition stated above, this means that P(O,X|M) can be calculated as the product of outputandtransitionprobabilitiesaswellastheprobabilityofasynchronousmovement among the parameterized states in X. For speech recognition purposes, this is an im- provementovertheapproachin[76,83]becauseitnotonlyallowsforfixedconstraintsin 29 4 3 2 1 0 3 2 1 0 front back high low i ɪ eɪ ɛ æ aɪ ʌ aʊ ɜ: ɒ ɔɪ ɔ: ʊ u: əʊ ɑ: ə Figure 3.1: British English (BBC newscaster) vowel chart with numerical mappings, after [52]. This illustrates the relationship between the Tongue Frontness and Tongue Height streams. allowable articulatory configurations, but succeeds in modeling the relative probability of one configuration over another. 3.3.2 Choice of Representation and Modeling Chosen because of their fine-grained taxonomy in concrete physical terms, the articu- latory representation used in this chapter (enumerated in Table 3.1) is based primarily on the mapping and modeling proposed in [76], but with some important differences. First, many modifications to [76]’s phone-to-articulator mapping are made based on an interestinmodelingspeechthatisspecificallynonnative. ThestudentsintheISLEcor- pus were learning British English, so most of the mappings in [76] for American English do not serve as appropriate reference-points - particularly those for the relative tongue positions of the vowels. These were determined anew by drawing from Ladefoged’s charts of BBC newscaster English and relative lip rounding of the cardinal and sec- ondary vowels [52], here reflected in Figs. 3.1 and 3.2. Ladefoged’s rules for allophonic variability in English were also incorporated, but many of them were omitted because 30 they do not apply to the chosen representation, or are not generalizable to nonnative speakers, orboth. Forexample, therulethataglottalstop/P/substitutesfor/t/when preceding an alveolar nasal is omitted because neither my articulatory representation nor the ISLE corpus’ phoneme set accounts for the glottal stop; furthermore, Ladefoged suggests this rule applies to “many” - but not all - accents of English. The final set of four contextual rules used are: • A vowel preceding a nasal consonant will also be nasalized • Voiced stops and affricates become unvoiced when syllable initial • Stops are unexploded when occurring immediately before another stop • Alveolar consonants become dental in anticipation of a subsequent dental conso- nant The algorithm for expanding phone-level transcriptions to a sequence of expected articulatory representations involves three steps. First, a phone is mapped directly to its expected articulatory configuration, in eight dimensions; charts of this mapping, derived from [52, 76] are given in Appendix . Then, based on its context, the expected configuration is perhaps changed to one of its allophonic variants, based on the above rules. Finally, the overall transcript for the entire utterance is interpolated so as to adhere to the physical constraints of the vocal tract. No model states may be skipped in the transcripts’ numerical transitions (unless transcribed Silence intervenes) because they represent a discrete sequence of positions. For example, a transition of the Tongue Frontness stream from Back (class 0) to Front (class 4) must first pass through all intermediate classes (1 through 3), because that is what a real tongue would do. In this way, without the need for vocal tract imaging, a series of articulations for a phone sequence can be generated, for use as an abstract standard of expected behavior from 31 0 1 2 3 i ɛ e a ɒ ʌ u ɔ ɑ Lip Rounding Tongue Height Tongue Frontness Figure 3.2: A 3D representation of the relation between the Lip Rounding, Tongue Frontness, and Tongue Height streams in cardinal and secondary English vowels, after [52]. Think of this as a rotation of Fig. 3.1, to reveal the Lip Rounding dimension. This also explains the numerical mapping assignments for the Lip Rounding stream. 32 which all model parameters for this articulatory representation may be derived. This representation is ultimately synchronous with the original phoneme labels, though the models derived from them need not perform synchronously. In a sense it is also non- causal because the present articulatory class is always dependent on future contextual information. See Fig. 3.3 for a graphical depiction of this transcription expansion technique. The modeling method also owes a debt to the HAMMs of [76] but, like the method of representation, the models have been adapted to suit the domain of nonnative speech evaluation. Rather than treating each possible configuration of articulatory classes as a unique hidden state - something of a “meta-phoneme” - I designed a separate set of Hidden-Articulator Markov Models for each of the eight streams (Jaw, Lip Separation, etc.), and trained each set independently. This allowed for free asynchrony among the articulators, so that the results might mimic the overlapping behavior of a true vocal tract’s constituent parts. It also permitted the generation of independent bigram models specific to each stream, and simplified the training and testing process, since in this quantization scheme no feature has more than five classes, compared to the several thousand states trained in [76]’s previous work. These simplifications rest on the assumption of independent motion among these eight articulator streams, which in a different study might not be valid - it allows for results that could potentially violate the fundamental physical constraints of the human vocal tract (e.g. dependencies between the jaw position and lip separation, tongue tip and tongue body, etc.). However, the point of this project is not to build an articulator-based speech recognizer or even a general phoneme recognizer, in both of which such constraints would be more important. Rather, I intend to demonstrate improved discrimination between canonical and noncanonical pronunciations in artic- ulatory feature space, regardless of the accuracy in recognizing any of the individual 33 “tenth” - t ɛ n ɵ 0 0 0 1 0 velum tongue tip velum tongue tip 4 3 0 4 2 0 0 1 1 1 1 1 0 4 3 0 2 2 2 2 2 0 0 1 1 0 4 3 2 2 2 2 1 1 1 1 0 1 1 1 1 2 2 velum tongue tip expected allophonic interpolated Figure 3.3: An illustration of the three steps involved in automatically deriving an articulatory-levelrepresentationfromphone-leveltranscriptions,asexplainedinSection 3.3.2. articulatory streams (the true transcripts of which are not known). Verification in this domain may in fact perform better under an assumption of independence. The true articulation may diverge from its expected mapping, especially in the case of a segmen- tal substitution. Disallowing physically impossible articulatory configurations from the recognitionresultsmaylimittherepresentationoffinepronunciationdistinctionswithin articulatoryfeature space. Withwell-trained models, if the results point towarda phys- ically unlikely articulation, it could signify the presence of a pronunciation mistake, and that is exactly what I intend to detect. 3.4 Model Training Thegoaloftheexperimentsinthischapteristoshowthatfeaturesderivedfrommodels trained on an articulatory representation, when combined with features derived from traditional phoneme models, can be used to evaluate a nonnative speaker’s segment- level pronunciation more accurately than with segmental features alone. This required the training of traditional phoneme models as well as articulatory-based models, both based on the Block D recordings of the ISLE corpus of nonnative speech, described 34 in Section 3.2. Block D consists of read sentences in both question-and-answer and “I said X, not Y” forms, designed to contrast spellings ordinarily difficult for nonnative speakers. FortheItalianandGermanspeakerscombined,thisBlockofrecordingstotals about 5 hours of training speech. Separate phone-level Hidden Markov Models for the native German and Italian speakers in the ISLE corpus were trained using standard techniques. The first 12 MFCCs (plus delta, acceleration, and normalized energy coefficients) were extracted every 10 msec using a window size of 16 msec. These features were then used to ini- tialize three-state context-independent HMMs over the ISLE phone set (plus Silence and generalized phone-level filler models), using segmentation times from the auto- matic alignment provided by the corpus’ transcriptions. The models were then refined with embedded re-estimation and by updating the number of mixtures per state to 16. Context-dependent models were not trained because all of the HMM alignment-based features were to be decoded over segments taken in isolation, and not over an entire word or phrase. EightdifferentsetsofHAMMs-onesetforeachofthearticulatorystreamsenumer- atedinTable1-werealsotrained,againkeepingthenativeGermanandItalianspeakers separate. All model parameters and topology were the same as for the phone models except the training procedure was slightly different. The ISLE corpus does not pro- vide transcripts below the segment level, so the phoneme transcriptions were mapped to a sequence of articulatory classes - paramaterized in eight streams - according to the convention outlined in Section 3.3.2. For this new articulatory expansion, there was not a reliable segmentation (especially for articulatory motion within a phone) so the HAMMs were initialized using a flat-start procedure, then updated with embedded re-estimation. 35 For purposes of articulatory recognition over an entire utterance, a bigram articula- tion model for transitions among articulatory classes in each of the eight streams was also trained, based on the interpolated transcripts. A bigram model assumes that the current articulatory position for a given stream depends only on the position immedi- ately preceding it. The purpose of this was to steer the recognizer away from physically unlikely transitions within each of the eight streams, e.g. if the current Tongue Tip recognition result is Low (class 0) then the next recognition result can either stay in that position or move to Neutral (class 1). The interpolated transcripts were also con- strained as such, therefore the bigram model derived from them would be as well. And a simple bigram model is sufficient to capture these transition rules. AllmodeltrainingwasdoneusingHTK[26]. Comparablemodelsfornativespeakers of British English were trained using the Southern speakers of the IViE corpus and half of the MOCHA-TIMIT database, which both consist of phonetically-balanaced read speech, totaling about 1 hour of training data. 3.5 Derivation of Features I propose three types of features for this verification problem. The first type are con- fidence scores derived from likelihoods based on alignment of traditional phone-level HMMs. Thoughtheyhavebeenshowntoprovidecomplementaryinformationtophone- mic models, HAMMs alone do not perform as well as phone models in basic recognition tasks [76]. These HMM scores serve as a tool for both baseline experiments and ones in conjunction with features derived from articulatory models, to be combined in a Decision Tree framework in Section 3.6. Articulatory-based features are split into two categories: those based on articulatory recognition results (represented by integers as 36 target subst. prob. target subst. prob. v f 0.76 @ U 0.21 w 0.21 6 0.18 t SIL 0.73 u: 0.12 d 0.21 æ 0.11 z s 0.91 E 0.07 Iz 0.01 @U 0.07 Is 0.01 I 0.06 SIL 0.01 eI 0.04 2 @ 0.9 SIL 0.02 6 0.02 2 0.02 I 0.01 3: 0.01 O: 0.01 Table 3.2: Relative substitution probabilities for commonly mispronounced phonemes, for the ISLE corpus’ native German speakers. The probabilities for a given target may not add up to 1 because all substitutions of less than 0.01 probability have been disregarded. explained in Section 3.3); and those which are confidence scores for the expected artic- ulation in each of the eight streams, directly analogous to the alignment-based HMM scores. Details about these feature sets are explained below. 3.5.1 Traditional HMM Confidence Scores Section 3.1 gave some background on pronunciation evaluation using confidence scores estimated from appropriate filler models, for verification of the canonical form. Here I investigate three different HMM-based fillers, based partly on those used in [56], so as to select the best of these derived confidence scores as a baseline feature. The best use of prior knowledge of expected substitutions is to let the filler be a recognition network over all likely substitutions in the target domain for the canonical phoneinquestion, withrelativearctransitionsderivedfromcorpusstatistics. Formally, the confidence measure’s denominator in Eqn. 3.2 is defined to be max i {P(O|p i )P(p i )} (3.5) 37 where O is the segmental speech observation, p i is one phoneme model (or sequence of phoneme models) in the set of expected substitutions (i takes all values in this set), and P(p i ) is the prior probability of the substitution p i . The resulting ratio used as a confidence score is τ =P(O|p t )P(p t )/max i {P(O|p i )P(p i )} (3.6) where p t is the target phoneme’s model and P(p t ) = 1 on the assumption that there is only one target. This particular confidence score, based on expectations of common phone-level substitutions, will be referred to as P subs in the remainder of this chapter. For a concise list of substitution statistics used in constructing these filler networks, please see Tables 3.2 and 3.3. The second proposed HMM-derived confidence measure is based on what [56] calls a “phoneme loop” filler, which is identical to that of the denominator in Eqn. 3.6 except i can take values over all phone-level HMMs, and P(p i ) = P(p j ) for every i and j in the complete HMM set, i.e. all substitutions are weighted equally. The confidence score derived from this phoneme loop filler is abbreviated as PL from now on. Athirdfillerisdefinedasageneralizedsegment-levelHMMtrainedonallphonemes. Its confidence measure is calculated as τ =P(O|p t )/P(O|f g ) (3.7) where p t is the target phoneme and f g is the generalized filler. 3.5.2 Articulatory Confidence Scores In a manner directly analogous to the derivation of the HMM alignment scores, articu- latory alignment confidence scores were generated over all eight articulation streams by constructing ratios of target to filler likelihoods. However, unlike the P subs HMM-based 38 target subst. prob. target subst. prob. target subst. prob. t t@ 0.47 2 @ 0.59 @ 6 0.2 SIL 0.36 A: 0.11 u: 0.14 d 0.06 6 0.08 E 0.14 T 0.01 E 0.03 U 0.1 d@ 0.01 U 0.03 æ 0.1 @ 0.01 u: 0.03 2 0.07 N Ng 0.56 2p 0.01 @U 0.06 Ng@ 0.26 æ 0.01 eI 0.02 n 0.06 @U 0.01 I 0.02 N@ 0.05 3: 3:ô 0.44 i 0.01 Nk 0.02 Eô 0.14 A: 0.01 I i 0.81 O:ô 0.1 SIL 0.01 @ 0.06 u:ô 0.07 U u: 0.68 E 0.04 6ô 0.03 2 0.14 aI 0.01 ô 0.02 u:l 0.05 æ 0.01 O: 0.01 Ul 0.05 SIL 0.01 Uô 0.01 @U 0.01 2 0.01 @ô 0.01 A: 0.01 6 0.01 Table 3.3: Relative substitution probabilities for commonly mispronounced phonemes, fortheISLEcorpus’nativeItalianspeakers. Theprobabilitiesforagiventargetmaynot addupto1becauseallsubstitutionsoflessthan0.01probabilityhavebeendisregarded. 39 confidence score, these target and filler models required only a general knowledge of articulatory phonology and no special prior knowledge of the corpus statistics for likely substitutions. The target articulatory model for each phone of interest is defined as a recognition network of allowable articulatory classes for that particular phone and stream. For example, all vowels are allowed to decode the Velum stream as either Open or Closed (1 or 0 by the numerical mapping), since in English the difference is not con- trastive - a nasalized vowel is only an allophonic variant of that vowel, not an entirely new phonetic unit. The filler model against which the target is compared is defined as the recognition network of all unallowable articulation classes within that stream, for the phone in question - the complement of the target network. This includes de- coding Silence as one unallowable articulation. No weights are assigned to the arcs of these recognition networks because, unlike the phonemic substitutions, the articulation statistics for the ISLE corpus are not available. Formally, for a segment-level speech observation O, define a vector of articulatory confidence scores in N streams, A={τ 1 ,τ 2 ,...,τ N }, where τ a =max i {P(O|t i )}/max j {P(O|f j )}, (3.8) t i is one of I allowable (target) articulatory classes for this segment, and f j is one of J unallowable (filler) articulatory classes for this segment. In this case, N = 8, and I and J vary depending on the stream and the segment. Silence is included in the filler network, soI+J =M a +1 whereM a is the cardinality of streama. All scores in vector A can be used as classification features, and together the whole vector of scores will be denoted as A conf in the rest of the chapter. 40 3.5.3 Articulatory Recognition A second novel method of deriving useful features from articulatory models for segmen- tal pronunciation discrimination is as follows. All articulatory classes within each of the eight streams are represented as integers, consistent with the linguistically-based method outlined in Section 3.3. The magnitude of these integers is proportional to the discrete positions they represent, so the results of articulatory recognition over these eight streams within a segment of interest can in themselves serve as an eight- dimensional feature vector in articulatory space - an articulatory factorization of that segment. I propose two methods of deriving these recognition-based feature vectors. First, I argue that articulatory recognition over an utterance (rather than just the isolated segment of interest) using a bigram articulation model is necessary to ensure that the appropriate degree of articulatory motion within a phoneme is captured. The recogni- tion results need not be synchronous with the ISLE corpus’ phone-level segmentation, which is based on automatic alignment with the canonical pronunciation and should therefore not be blindly trusted on the level of articulation, especially if the pronounced phone differs from the canonical form. If I were to perform recognition only over the in- terval of each phone of interest, the classifier might miss crucial articulatory transitions immediately before or after the phone’s transcribed start and end points. A bigram model captures local context important for classification, and, though the streams are modeled as independent, it ensures that a physically impossible transition is not likely tobemadewithinanyofthestreams,asexplainedinSections3.3and3.4. Thearticula- toryrecognitionsegmentationmaynotlineupperfectlywiththephone-levelalignment, and may include a sequence of articulatory classes within a given phone’s interval. This allows for multiple interpretations of what should be considered the ultimate vector of 41 recognition results. Two such methods are investigated, referred to as centered and soft and explained in detail in the remainder of this section. ForaspeechsequenceO =o 1 ,o 2 ,...,o n innframes,theutterance-levelarticulatory recognition result for stream a with cardinality M a is given by b C a =argmax Ca {P(C a |O)} =argmax Ca P(O|C a )P(C a ) P(O) =argmax Ca {P(O|C a )P(C a )} (3.9) where C a = c 1 a ,c 2 a ,...,c m a is a sequence of m integer articulatory class labels such that 0≤c i a <M a and m≤n, i.e. these class labels can span more than one frame. Because this is with a bigram articulations model, the approximation P(C a )=P(c 1 a ,c 2 a ,...,c m a )≈P(c 1 a ) m Y i=2 P(c i a |c i−1 a ) (3.10) is computed, where each class depends only on the preceding class in its stream. Once recognized, b C a isexpressedasthesequenceofintegerclasslabelsbyframe-s 1 a ,s 2 a ,...s n a - with the same frame-level indices as in O. Now, for a segment within this utterance beginning at frame b and ending at frame e, I define the segment-level articulatory recognition result in stream a two ways - • centered: b S a =s j (b+e) 2 k a (3.11) • soft: b S a = 1 e−b e X i=b s i a (3.12) 42 1 2.33 0.67 1 . . . . . . 1 2 1 . . . . . . . . . . . . . . . . . . . . . … /w/ … alignment window: one occurrence of /w/ (three frames long) soft feature vector centered feature vector jaw lip separation voicing 1 1 2 2 3 1 1 0 n recognized articulation: class n of given stream Figure 3.4: An example of recognizing 8 HAMM streams and then extracting the 8- dimensional soft and centered articulatory feature vectors from them. This correct realization of the target /w/ will be grouped with others so as to distinguish them from the class of substitution pronunciations. Combining the eight streams, one segment’s vector of recognition results is R = n b S 1 , b S 2 ,..., b S 8 o . To make this vector target-independent as a set of features, one must subtract each stream’s recognition result from the expected articulation: b R = n L t 1 − b S 1 ,L t 2 − b S 2 ,...,L t 8 − b S 8 o whereL t a istheexpectedintegerclasslabelforsegment tinstreama; iftisexpectedtohavearticulatorymotionbelowthephonelevel, thenL t a is the mean of all expected integer class labels within t for stream a. The final feature vector b R reflects the distance of the recognition result from the expected mapping of the canonical, regardless of the target. The centered method (its features denoted by A cent ) is an accepted one for evaluat- ing articulatory recognition results, though it ignores any articulatory motion within a phone. For this reason, other studies that have followed [49] have concatenated features from adjacent frames in making an overall articulatory classification decision, on an assumption of continuity in a local context. The soft method (its features denoted by 43 A soft ) is also appropriate because it compensates for the artifice of quantizing artic- ulatory motion into discrete levels. Due to the bigram constraints of the recognition decoding procedure and the transcript interpolation training, motion of the recognition results within a phoneme could either signify corresponding motion of the articulators themselves, orsimplyanarticulation“between”therigidlyassignedquantizationlevels. Even for the Voicing stream, which has only two classes (On or Off), a delay in voice onset time within a phoneme - especially a stop consonant - can signal a nonnative pro- nunciation, as investigated in [47]. Either way, for the sake of phoneme discrimination based on articulatory information, it may be important to incorporate this evidence of sub-phonemic motion, especially in the case of phones of interest which have expected articulatory motion within them, e.g. /t/. See Fig. 3.4 for a graphical depiction of this recognition-based feature extraction algorithm. As a final note, one may be interested in the accuracy of these experiments in articulatory recognition. Such results cannot be obtained other than by artificially comparing the recognizer’s output with these interpolated articulation transcripts, and it isn’t really known whether this articulatory expansion reflects the true articulation of these speakers, especially since they are all nonnative. This is one limitation of the data and evaluation presented here, and in the absence of a true reference the expected mapping for the phone-level transcriptions is used. These results are reported in Table 3.4, simply as a preliminary test of model suitability. They appear to be consistent with similar baseline results such as the “segment error” reported in [34], though for the nonnative Velum and Voicing streams there was an excess of insertions. Compared to the native speaker results, the better relative performance of many of the nonnative speech streams can be attributed to the larger size of the nonnative training set. 44 native speaker nonnative speakers jaw 52.83 65.71 lip separation 65.29 55.48 lip rounding 41.38 52.27 tongue frontness 51.54 57.24 tongue height 57.81 59.30 tongue tip 47.73 48.99 velum 76.33 44.48 voicing 81.04 69.10 Table 3.4: Pseudo-articulatory-recognition accuracy, in %. 3.6 Experiments and Results 3.6.1 Experiments with Nonnative Speech Forbinaryclassification,theItalianandGermantestsetsconsistedofequalproportions of canonical and substitution pronunciations for every target segment of interest. All canonical realizations of all segments were regarded as members of one class, and all substitutionerrorswereregardedasmembersofasecondclass,independentofthetarget segmenttowhicheacherrorbelonged. ThemostconsistentlydifficultEnglishphonemes for the German and Italian speakers in the ISLE corpus are listed in [3] - these include phonemes difficult for students to remember because of the idiosyncracies of English orthography, and phonemes difficult for nonnative speakers to produce because they do not exist in their native language. Those used in the test sets are given in Tables 3.2 and 3.3, along with their substitution probabilities. Test items were taken from the ISLE corpus’ Block E, F, and G recordings - read sentences typical of the kind language learnersmightneedwhiletravelingonaholidayororderingfoodinarestaurant. Though all speakers were included in both training and test sets, this is permissible given that 45 20 25 30 35 40 45 50 55 20 25 30 35 40 45 50 55 false rejection rate (%) false acceptance rate (%) substitution filler phoneme loop filler generalized filler EER Figure 3.5: DET curves for all nonnative speakers, German and Italian, over the three HMM-based confidence measures. these methods will most likely be incorporated into a language-learning system that makes use of speaker-dependent features from registered students. Though only 12 difficult target phonemes were examined, their systematic errors account for about 30-40% of all phone-level substitutions in the ISLE corpus. More- over, they encompass different types of mistakes over various articulatory streams. The expected articulation for /v/ differs from its most common substitution, /f/, only in terms of Voicing. The phone /I/ and its most common substitution, /i/, differ by Jaw position, Lip Separation, and Tongue Height. The Italian speakers would often insert a velar stop, /g/, after pronunciation of the velar nasal, /N/, via sub-phonemic motion in just the Tongue Height and Velum. The test set can be regarded as representative of the corpus at large, over many different types of substitutions. From among the three phone-level confidence scores described in Section 3.5.1, a baseline was selected by plotting their Detection Error Tradeoff (DET) curves illustrat- ing rates of false rejection vs. false acceptance of the canonical pronunciation over vari- ous operating points. These curves were generated by varying the verification threshold 46 T asexplainedinSection3.1. ResultsforallnonnativespeakersareshowninFigure3.5. The the curve with the lowest error on the EER line, P subs , was taken as a nonnative baseline feature. TocombinethesebaselinescoreswiththenewarticulatoryscoresproposedinSection 3.5, the Weka Toolkit’s implementation of the C4.5 algorithm was used [100]. This decision tree method struck us as the best of the available automatic techniques for several reasons. One, because it allowed for a combination of continuous and discrete features - the A soft and A cent articulatory recognition results alongside phonemic and articulatory confidence scores. In a study such as [56], all combinations of confidence scores were defined by optimizing a threshold on the sum of those scores, but this was not desirable in this case, since the A cent and A conf features were defined on entirely different scales. However, this did require reporting each feature combination result for just one operating point - the one determined by the trained decision tree - rather than over a range of points. The tree also facilitated the pruning of redundant features, and some features were expected to be unnecessary in certain cases. For example, all the expected substitutions for /2/ when uttered by German learners of English are also vowels (/@/, /6/, /I/, and /O:/), and so one would not expect much discrimination between canonical and erroneous pronunciations in terms of Voicing stream features. Finally, for a verification problem such as this, a decision tree’s method of setting multiple sequential thresholds on many confidence-like features seemed like a logical extension of the traditional verification method of setting one threshold on just one confidence score, as introduced in Section 3.1. In the baseline case of just one feature - the phone-level confidence score - the tree would consist of one root node and two leaves (one each for the canonical and substi- tution classes), with threshold optimized by the same information gain maximization criterionusedtosetallthresholdsinalargertree. Thetreesweretrainedandevaluated 47 target @ I E i t n s d D substitutions 3: i @ I p m S b v E eI eI eI k N T g Z @U E 3: E f z Table 3.5: Target segments and their substitutions, for the native English speaker test set. using a ten-fold crossvalidation of the entire test set. Classification results in terms of false acceptance and false rejection of the canonical for various combinations of features areplottedinFigures3.6and3.7. Additionalresultswereobtainedoversubsetsdivided by target segment, with the same test procedure, and are reported in Tables 3.6 and 3.7. 3.6.2 Native Speaker Experiments The native speaker test set, taken from the remaining half of the MOCHA-TIMIT database not used in training (one speaker), was constructed somewhat differently. The strict definition of the canonical form in Section 3.1 does not apply to literate native speakers, who will frequently introduce non-lexical pronunciation variants into readspeech-these are notreally“errors” inthe same sense as those made bynonnative speakers. Thesubstitutioninstancesforeachtargethadtobeselectedartificially, based on segments which are perceptually confusable and produced in the test database, but which the speaker did not actually substitute erroneously when faced with the target prompt. Based on the most common phonemes in spoken English as reported in [24], four vowels and five consonants were selected as targets. Substitutions for the consonants were other consonants which had the same expected manner of articulation and voicing features. Vowel substitutions were chosen based on each target’s three closest vowels in 48 P subs + P subs + P subs + P subs + P subs + target P subs A conf A soft Acent A soft +A conf Acent +A conf z 62.1 / 12.1 61.4 / 15.4 40.3 / 33.6 47.3 / 26.2 38.3 / 36.9 43.6 / 30.9 @ 8.9 / 77.3 32.6 / 50.7 38.0 / 60.3 28.9 / 51.2 40.1 / 38.5 42.2 / 38.5 v 64.5 / 13.5 39.4 / 30.3 29.7 / 27.7 37.4 / 22.6 36.1 / 29.0 39.4 / 22.6 2 42.5 / 18.8 33.3 / 23.4 33.0 / 18.0 32.6 / 26.1 34.5 / 22.6 32.6 / 28.0 t 20.3 / 56.4 23.8 / 23.3 27.9 / 14.5 37.8 / 23.3 22.7 / 20.9 17.4 / 26.2 all 30.5 / 49.6 43.3 / 36.6 43.9 / 29.3 40.7 / 34.1 46.1 / 26.2 40.6 / 33.7 Table 3.6: Results for German speakers, reported as false rejection rate / false accep- tance rate (in %). Entries in bold were significantly better than the baseline (P subs ) with p≤0.05 using McNemar’s test. Figure 3.1. All targets and substitutions used in this native test set are given in Table 3.5. As in the nonnative test sets, equal proportions of canonical and substitution instances for each target were maintained. Figure 3.8 reports the DET curves for the potential native speaker baseline scores. The confidence score derived from the substitution filler (P subs ) was not available in the native case because the test set was not composed of true substitutions, and so no corpus statistics were used. Numerical and graphical classification results for the native speaker are presented in Table 3.8 and Fig. 3.9; the same decision tree feature combination method as in the nonnative experiments was used here. 3.7 Discussion In Figure 3.5, the score with the lowest EER over the set of nonnative speech is that of the substitution filler (P subs ) which were taken as the baseline feature, though it performs only slightly better than the generalized filler, indicating that knowledge of corpus statistics is maybe not necessary to calculate an acceptable baseline confidence measure. From the two nonnative plots of various feature combinations (Figures 3.6 and 3.7) one can discern that, in general, the new articulatory features alone (A conf , A soft , and A cent ) do not consistently perform with combined error rate lower than the 49 20 25 30 35 40 45 50 55 60 65 70 75 20 25 30 35 40 45 50 55 60 65 70 75 false rejection rate (%) false acceptance rate (%) Psubs Asoft Acent Aconf Psubs + Aconf Psubs + Asoft Psubs + Acent Psubs + Asoft + Aconf Psubs + Acent + Aconf EER Figure3.6: Germanspeakers’classificationresultsovervariouscombinationsoffeatures, for the complete test set. baseline (which is in keeping with the findings of [56, 76]), but when the phone-level and HAMM-derived features are combined, false acceptance is usually reduced and the combined overall error rate is lower. In foreign language practice, too many false rejections will frustrate and discourage the student, but too many false acceptances is probably less desirable, as that may undermine the learning process. Which error is morecostlyreallydependsonthestudentandthetask,butonefeaturecombinationwas judged better than another if the combined classification error rate was lower, i.e. if the operating point in Figs 3.6 and 3.7 was closer to the origin. The baseline classification result (P subs ) is skewed to the side of more false acceptance than rejection, but the addition of articulatory information serves to balance the error rates, and in these plots one can see a certain reduction in the false acceptance rate for those feature sets. The numerical nonnative classification results reported in Tables 3.6 and 3.7 show overall statistically significant improvements over the baseline with the addition of HAMM-derived features, but this is not consistent for every target segment subset. Here a significant reduction in the combined error rate is defined by a p-value of 0.05 50 P subs + P subs + P subs + P subs + P subs + target P subs A conf A soft Acent A soft +A conf Acent +A conf U 0.6 / 87.8 59.0 / 22.4 43.6 / 47.4 32.1 / 50.6 67.9 / 12.8 39.7 / 48.1 I 44.2 / 34.9 29.3 / 42.4 24.8 / 44.6 25.1 / 45.6 22.2 / 48.8 30.5 / 43.7 2 31.4 / 31.4 34.2 / 25.1 29.6 / 32.7 26.4 / 34.6 31.2 / 32.7 29.6 / 34.2 @ 32.1 / 54.9 32.6 / 51.2 33.5 / 50.2 32.6 / 51.4 31.4 / 55.2 37.4 / 43.9 t 20.6 / 53.9 20.3 / 29.5 30.5 / 34.9 30.0 / 41.2 23.3 / 25.2 19.7 / 32.3 N 31.9 / 36.1 21.8 / 50.4 37.0 / 32.8 32.8 / 35.3 32.8 / 42.9 32.8 / 46.2 3: 41.7 / 30.4 36.4 / 38.2 48.1 / 27.6 45.6 / 24.0 37.1 / 32.5 37.5 / 35.7 all 35.5 / 42.8 31.9 / 43.9 41.6 / 33.3 33.9 / 40.4 32.8 / 40.0 37.6 / 36.8 Table3.7: ResultsforItalianspeakers, reportedasfalserejectionrate/falseacceptance rate (in %). Entries in bold were significantly better than the baseline (P subs ) with p≤0.05 using McNemar’s test. 30 35 40 45 50 55 30 35 40 45 50 55 false rejection rate (%) false acceptance rate (%) Psubs Asoft Acent Aconf Psubs + Aconf Psubs + Asoft Psubs + Acent Psubs + Asoft + Aconf Psubs + Acent + Aconf EER Figure 3.7: Italian speakers’ classification results over various combinations of features, for the complete test set. 51 or less using McNemar’s test (denoted in bold in the tables). Even in the case of an overall improvement, often a large reduction in false rejection is accompanied by a small increase in false acceptance, or vice versa. If some of the target-dependent sub- sets allowed for less (or no) improvement over the baseline, this is probably because the baseline score generalizes well to discriminate between the canonical form and its substitutions in such a limited test set. Over the complete test set, just one baseline threshold may not be optimal for all target words, and so the addition of articulatory features offers significant reductions in classification error. The best feature set overall, for both Italian and German speakers, was the P subs +A soft +A conf combination, which offered a 3-4% absolute reduction in combined classification error rate over the baseline for the complete test set, and as high as a 16- 17% absolute improvement for individual segments (e.g. /t/ for the German speakers). However,thiscombinationdidnotperformsignificantlybetterthanP subs +A cent +A conf , soonecannotconcludethatthesoft methodofextractingarticulatoryrecognition-based features was more appropriate than the centered method. More than the Germans, the Italian speakers had several individual segments with error rates not reduced by the addition of articulatory information. This difference in results for the two speaker types can be attributed to the fact that the Italian speakers were generally less proficient in English than the German speakers [3]. One can infer that the Italians’ mistakes in pronunciation were more dramatic and obvious than those of the Germans who, with higher proficiency and a native language more closely related toEnglish, probablyintroducedmoresubtlesubstitutions. Classificationoftheseminor differences in articulation stands to benefit the most from articulatory features which bear pronunciation details not captured by phone-level HMMs. If the nonnative results seem modest in light of the fact that these were all binary classifications tested on equal proportions of each class, consider a couple of things. 52 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 false rejection rate (%) false acceptance rate (%) phoneme loop filler generalized filler EER Figure 3.8: DET curves for the Native British English speaker, over both HMM-based confidence measures. PL+ PL+ PL+ PL+ PL+ target PL A conf A soft Acent A soft +A conf Acent +A conf all 26.0 / 18.7 23.1 / 18.6 22.3 / 16.5 22.3 / 15.5 19.3 / 19.9 19.9 / 16.8 Table 3.8: Native British English speaker results, reported as false rejection rate / false acceptance rate (in %). Entries in bold were significantly better than the baseline (PL) with p≤0.05 using McNemar’s test. First, in [76] the overall absolute improvement in basic word recognition offered by combining traditional and articulatory models similar to these was on the order of 1-2%. Second, that [3] reported an inter-annotator agreement of “at best” 70% when simplydecidingthephone-levellocationofapronunciationerror(butnotdecidingwhat the substitution is), which is in essence what my automatic classifier did. This indicates that often it is not a trivial decision to attribute an abstract substitution to a single phone location, and that perhaps pronunciation error annotation on the sub-phone or suprasegmental level would achieve higher inter-labeler agreement. 53 10 15 20 25 30 35 40 10 15 20 25 30 35 40 false rejection rate (%) false acceptance rate (%) PL Asoft Acent Aconf PL + Aconf PL + Asoft PL + Acent PL + Asoft + Aconf PL + Acent + Aconf EER Figure 3.9: Native British English speaker classification results over various combina- tions of features, for the complete test set. SimilarresultswerefoundforthesupplementarynativeEnglishspeakerexperiments (Figure 3.9 and Table 3.8), though with slightly lower error rates overall, probably be- cause the “substitution” instances chosen artificially were easier to discriminate from their so-called targets because they were not true pronunciation mistakes but phonemes known to be perceptually confusable. For example, while the artificial errors for conso- nants shared the target’s voicing and manner, real errors for consonants were in many casespronouncedbothwiththesamemannerandplaceofarticulationasthetarget(see Tables 3.2 and 3.3), therefore making real errors more difficult to classify automatically. In the native speaker case, the score derived from the phoneme loop filler (PL) was an obviouschoiceforabaselinefeature,fromFigure3.8. Justasforthenonnativespeakers, the addition of articulatory features offered significant improvements over the baseline error rates, both for false acceptance and rejection. This suggests the usefulness of the proposed features for utterance verification outside the domain of second-language pedagogy. 54 3.8 Conclusion I have demonstrated a useful and unique method of assigning an articulatory repre- sentation to a phone-level transcription. Hidden Articulator Markov Models trained on this representation were used to generate novel articulatory confidence measures and recognition-based feature vectors. These vectors successfully reduced error rates in segment-level pronunciation classification when combined with traditional segmental confidence scores, for two types of nonnative speakers and over a native speaker test set aswell. Itisremarkabletoconsiderthisimprovementinlightofthefactthatthearticu- latory models had the same topology as the phoneme models, were trained on the same spectral features as the phoneme models, needed no direct articulatory measurement or transcription, and required no prior knowledge of the corpus substitution statistics. This augmented representation helped to model subtler differences in pronunciation by factoring each segment into its relevant discriminative components. 3.9 Future Directions Thismethodoflocatingasegment-levelpronunciationmistakerepresentsoneimportant step in the complex pronunciation evaluation task. Pronunciation results generated by this algorithm should in the future be used in combination with other - perhaps suprasegmental - scores to derive an overall pronunciation score for a given speaker or utterance. The feature sets could benefit from prosodic information when, for example, distinguishing between full and reduced vowels, or stressed and unstressed syllables. Theproposedtranscriptionexpansionmethodisbynomeanstheonlymappingthat improves close phonemic discrimination over traditional models - there is still room for optimization, depending on the task. Though I incorporated a large amount of prior information about expected English substitutions based on the speaker set, future 55 researchers in this area might also include phonological constraints from the speakers’ native language - and not just the language they are learning - when expanding and interpolating the articulatory transcriptions. For example, what allophonic variants are nativeGermanandItalianspeakerslikelytoproducewhenreadingEnglish? Inaddition to segment-level substitutions, what articulation should one expect based on Italian or German phonology? These are relevant open questions. Future work in this area may also choose to rely on direct articulatory measurement for model training purposes. The models would better reflect the true articulation, rather than just the expected form. Such models could potentially have a finer-grain resolutionthanthosesetforthinTable3.1. Boththeseimprovementsmayservetomake the discrimination between canonical and erroneous segments more accurate. No exist- ing collection of articulatory data has the same degree of nonnative speaker variability as seen in the ISLE corpus, nor the same focus on second language acquisition and ped- agogy, and until such a corpus is created, the use of direct articulatory measurement in this domain will be limited. Finally, though I have shown that an assumption of dependence among the streams in this representation is not necessary to effect an improvement in verification accuracy, it may be worthwhile to try modeling such a dependency in, for example, a Dynamic BayesianNetworkframework. AsexplainedinSection3.3,thisapproachwouldprobably make articulatory recognition more accurate, but may also decrease the variation in recognition results - variation that may help signify the presence of a pronunciation mistake. Such a network could be implemented without any changes to the centered or soft methods of calculating features from the articulatory recognition results. 56 3.10 Epilogue: Articulatory Evidence of Phonological Transfer 3.10.1 Introduction It is widely known that learners of a foreign language will produce speech variants that reflect the phonology of their native language [59]. The systematic substitution, deletion, or insertion of phonemes when speaking a second language (L2), as predicted by the phonological rules of the speaker’s first language (L1), is called “phonological transfer”. Though not all errors in foreign speech can be attributed to transfer from the speaker’s L1, phonological speech errors are exemplified by (but not limited to) transfer in the form of the substitution of a close L1 phoneme for a target L2 phoneme nonexistent in the speaker’s L1. Such a substitution may be “close” to the target either in terms of perception (acoustic similarity) or production (articulatory similarity), or both, depending on the speaker’s L1 phoneme set. In the preceding parts of this chapter, it was shown that an articulatory repre- sentation of speech could provide additional discriminatory information to a phonetic representation when automatically detecting these kinds of close segment-level errors produced by English learners. Though that work used no true articulatory data, the true articulation could be inferred from the known phoneme sequence, and acoustic models trained to represent these inferred articulations were not redundant, even when used alongside phoneme models trained on the same acoustics. My interpretation was thatrepresentingthesecloseerrorsintermsofarticulationhadmoreexplanatorypower than simply conceiving of them as full-on substitutions - the difference between the tar- get and its substitution existed perhaps on only a subset of the articulatory organs that produced the acoustics. For example, a German speaker’s common error of substituting /d/ for /D/ in English demonstrates articulatory contrast only in terms of the tongue 57 tip - the rest of the vocal tract should be identical in both cases, ceteris paribus. This is a partial explanation for the improvement seen when using articulatory models. However, several details about nonnative articulation remained obscure after those experiments. Whatwasthenatureofthesesubstitutions? Didtheyshowtheinfluenceof transfer, or were these articulations seemingly unrelated to the speaker’s L1? Did their true articulations follow the phoneme-derived expectations, or were they somewhere in betweentheL2targetandtheL1substitution? Theanswerstothesequestionswillhelp to better model articulation in foreign-accented speech errors, and to determine what pseudo-articulatory models can really capture. Real-time Magnetic Resonance Imaging (MRI) of the vocal tract [69] has recently proven to be a useful tool for anaylzing vari- ation in fricative production [11], emotional speech [53], and articulatory coordination of nasals [16]; I intend to use real-time MRI to shed light on nonnative production. The main question I want to address in this epilogue is, does nonnative speech show evidence of phonological transfer from the speaker’s L1? If so, in what ways, and along what articulatory dimensions? More specifically, the questions of the real-time MRI data are the following: 1. When prompted to produce a target phoneme outside their L1 set, do nonnative speakers employ an articulation indistinguishable from the one they use when prompted to produce the cohort “close” phoneme in their L1 set? 2. Do nonnative speakers demonstrate more variability in their articulation of out- of-L1 targets than for their closest in-L1 counterparts? IftheanswertoQuestion1isyes,thenthatmeansthatarticulatorymodelswouldbe ofnohelpindiscriminatingclosephoneme-levelerrors,andthatthepseudo-articulatory models that worked so well in [76, 90] probably did not represent the true articulation. However, it is expected nonnative articulation of out-of-set phonemes to be similar 58 to close in-set targets at least for some vocal tract organs, otherwise a difference in articulation is not the cause of a listener’s perception of a foreign accent. One might expecttheanswertoQuestion2tobeyes,sinceanincreaseinarticulatoryvariabilityfor an out-of-set phoneme would reflect a speaker’s unfamiliarity with producing it. Either way, the presence of a diversity of variability in production, if seen, will have to be incorporatedintoarticulatorymodelsofspeech. Thisepiloguewillcomparearticulatory contrasts between close phonemes for both native English and German speakers in the hopes of illuminating some of these issues. 3.10.2 Corpus 3.10.2.1 Speakers and Stimuli ThisworkusedMRIandsynchronizedaudiodatafrom3nativespeakersofGermanand, for comparison, 3 native speakers of American English. The Germans were advanced learners of English who had been living in Los Angeles for more than 6 months. All but one of them (native English speaker H5) were male. All subjects were asked to read two standard English texts for phonetic elicitation: The Rainbow Passage, and The North Wind and the Sun. Additionally, the German speakers read a phonetically-balanced German translation of each passage. All readings were repeated once, for a total of 31.3 minutes of speech for the English natives and 43.1 minutes for the German natives. I chose to look only at two well-documented phonological errors made by German speakers of English that are usually explained as L1 transfer: the substitution of /v/ for /w/, and the substitution of /d/ for /D/. They represent contrasts of different artic- ulatory organs - /v/ and /w/ are expected to differ in the tongue body, lips, and velum, while /d/ and /D/ should differ only in the tongue tip - and so would yield complemen- tary results. Contexts in which /d/ becomes dental (i.e. before dental consonants) were not included. For each German speaker, all tokens were divided into three categories: 59 phonemes outside their L1 (OUT), phonemes present in both languages and elicited in their L2 (IN-L2), and phonemes present in both that are elicited in their L1 (IN-L1). Though the English speakers had all phonemes in their L1, for simplicity they were given the same category names. In the stimuli, including repetitions, per speaker there wereroughly35tokensof/w/,45ofGerman/v/,and30ofEnglish/v/; thestimulialso had roughly 65 tokens each of /D/ and English /d/, and about 115 tokens of German /d/. 3.10.2.2 Imaging and Image Tracking “Real-time”MRIreferstothegenerationofMRscansatasufficientlyhighframerateto capture dynamic vocal tract shaping in the midsagittal view of the upper airway during natural speech production [69]. The computational demands of MRI and the scientific demands of speech analysis necessitate a tradeoff in favor of finer temporal resolution at the expense of image resolution - this compromise results in the generation a series of 68 x 68 pixel images reconstructed at a rate of 22.4 frames per second. All MR data used in this work was acquired on a GE Signa 1.5-T scanner using fast gradient echo pulse sequences and a 13-interleaf spiral acquisition technique. With a custom-made multichannel upper airway receiver coil, the images were reconstructed using standard sliding window gridding and inverse Fourier transform techniques. The contours of the vocal tract articulators were automatically segmented in 2-D space based on an anatomical object model’s fit to the data in the spatial frequency domain [10]; the fit of the model to the image was optimized using a hierarchical and anatomically-informed version of gradient descent. Based on this segmentation, I de- rived polygonal masks enclosed by these contours and representing the regions occupied by each of 5 articulators: the upper lip (UL), lower lip (LL), velum (VE), tongue body (TB),andtonguetip(TT).Theboundariesofallbutthetonguetipweredefinedbythe 60 segmentation algorithm’s object model, and the tongue tip was defined as the polygon enclosed by the leftmost four points of the tongue body contour (out of 11 points used in the object model of the tongue body). 3.10.2.3 Audio Processing and Alignment SimultaneousaudiorecordingsduringeachMRIscanwerecollectedusingafiberoptical microphone, at a sampling rate of 20 kHz. This audio was made useable by way of a noise-canceling filter based on a model of the MRI scanner’s gradient noise and pulse sequence [12]. In the absence of phoneme-level transcripts, I used forced alignment of the expected sequence of phonemes to generate segmentation times. Phoneme-level acoustic models were Hidden Markov Models trained on 39-dimensional MFCC feature vectors, with 3 hidden states and 8 Gaussian mixtures per state. The window length was a standard 25 msec and the frame rate was shorter than usual (5 msec) so as not to miss any rapid changes in articulation. Training and alignment of these models were done using an iterative bootstrap procedure like that described in [26]. These automatic segmentations were potentially inaccurate if the speaker paused at an unexpected place while reading the stimuli. In those cases, the alignment would includethepauseaspartofanabnormallylongsegmentationfortheprecedingphoneme. To eliminate these alignment errors, all outlier phonemes with a duration more than 2 standard deviations from the mean for that target phoneme were removed. This eliminated no more than about 10% of the tokens for each speaker. 3.10.3 Experimental Methods 3.10.3.1 Overview Because of a wide variety of vocal tract shapes and no standard way to warp one onto another, these experiments were restricted to intra-speaker comparisons between pairs 61 | X–Y | mean mean c c+1 c-1 c c+1 c-1 X Y Upper Lip Lower Lip Velum Tongue Body Tongue Tip Figure 3.10: Illustration of finding the pixel-by-pixel difference between tokens of /d/ and/D/(XandY,respectively),includingmaskedversionsofthefiveorgansofinterest. of phonemic MRI tokens. To obtain a representative MRI image of each phoneme’s articulation, the token of one phoneme was defined as the mean of the center frame, c, of that phoneme (determined from the automatic segmentation times) and one frame on either side of it, frames c + 1 and c− 1. When the middle of the segmentation boundaries fell between two MRI frames, it was rounded down to the nearest frame, since the characteristic articulatory closures typically occurred more toward the side of the start boundary. Most phonemes examined here (/v/, /w/, /d/, and /D/) were automatically segmented as being between 2 and 4 MRI frames in length. Once mean tokens of two productions are obtained, they could be compared by summing the absolute values of their pixel-by-pixel differences. Difference values for each of the organs of interest (UL, LL, VE, TB, and TT) were localized using the polygonal masks derived from the image tracking, as explained in Section 3.10.2.2. A maskforonetokenwasdefinedastheunionofthethreemasksfromframesc−1,c, and c+1. Similarly, to capture all regions of potential difference, the mask for the difference of two tokens was defined as the union of both tokens’ masks. Since the sizes of the 62 masks might vary from one token to another and from one frame to another, the sum of all pixel-by-pixel differences for each masked image was normalized by dividing by the number of pixels in the mask. Figure 3.10 depicts the generation of two tokens, the calculation of their difference frame, and the masked versions of that difference frame for each organ. 3.10.3.2 Place of Articulation The Place of Articulation experiments were designed to answer Question 1 posed in Section 3.10.1: do nonnative speakers produce their OUT tokens with place of artic- ulation identical to their IN-L2 tokens, and, in the case of the German speakers, to their IN-L1 tokens? For all speakers separately, I calculated pixel-by-pixel differences between every IN token and every other IN token, and between every OUT token and every IN token. If the mean of the differences between all IN-L2 tokens and OUT tokens is equal to the mean ofthe differences within allpairs of IN-L2 tokens, thenthat suggests that the OUT tokens are articulated just like the IN-L2 tokens, and the same goes for the IN-L1 tokens. This difference in means was assessed statistically using a two-samplet-test. Overallspeakersandmasks,andforbothphonemecontrasts,Tables 3.9 and 3.10 show the mean differences that were statistically equal or unequal on the 95% confidence level. 3.10.3.3 Articulatory Variability The Articulatory Variability experiments were intended to answer Question 2 from Section 3.10.1: do nonnative speakers produce their OUT tokens with more variability than their IN-L2 and IN-L1 tokens? Pixel-by-pixel differences between every pair of IN-L1 tokens were measured, and similarly for every pair of IN-L2 tokens and OUT 63 English speakers German speakers J2 G1 H5 JS DH CW /d/ : /D/ overall 6= 6= 6= 6= 6= 6= upper lip 6= 6= 6= 6= 6= 6= lower lip 6= 6= 6= 6= = = velum 6= 6= 6= 6= = 6= tongue body 6= 6= 6= 6= 6= 6= tongue tip 6= 6= 6= 6= 6= 6= /v/ : /w/ overall 6= 6= 6= 6= 6= 6= upper lip 6= 6= 6= 6= 6= 6= lower lip 6= 6= 6= 6= 6= 6= velum 6= = 6= = 6= = tongue body 6= 6= 6= 6= 6= = tongue tip 6= 6= 6= 6= = = Table 3.9: Results of two-tailed t-test comparing the mean difference between any two IN-L2 tokens and the mean difference between any IN-L2 token and any OUT token. Sample means were determined to be equal or unequal (denoted by = and 6=) on the 95% level. /d/ : /D/ /v/ : /w/ JS DH CW JS DH CW overall 6= = 6= 6= 6= 6= upper lip 6= = = 6= 6= 6= lower lip 6= = 6= 6= 6= 6= velum = 6= 6= 6= 6= 6= tongue body 6= 6= 6= 6= 6= 6= tongue tip 6= 6= 6= 6= 6= 6= Table 3.10: Results of two-tailed t-test comparing the mean difference between any two IN-L1 tokens and the mean difference between any IN-L1 token and any OUT token. Sample means were determined to be equal or unequal (denoted by = and 6=) on the 95% level. tokens. The means of each set’s pairwise differences were then compared using a one- tailed t-test. Table 3.11 indicates for which organs and speakers the pairwise mean of differences for the OUT tokens was statistically greater than, less than, or equal to the IN-L2 mean pairwise difference, on the 95% confidence level. Table 3.12 displays the same thing, but for OUT vs. IN-L1 sets. 64 3.10.4 Results and Discussion Table 3.9 demonstrates that in general, for native English speakers, the mean difference between any OUT token and any IN-L2 token was not equal to the mean difference betweenanytwoIN-L2 tokens,with95%confidence. Thisindicatesthatnativespeakers use statistically different places of articulation to distinguish /D/ from /d/, and /w/ from /v/, as one would expect. This finding is encouraging for articulatory modeling because it suggests that these two pairs of close phonemes are potentially separable in articulatory space using statistical models. The results for the German speakers are quite different - there is more of a tendency for the mean difference between OUT and IN-L2 tokens to be equal to the mean difference between any pair of IN-L2 tokens, indicatingthatinsomesensetheOUT phonemesarebeingarticulatedlikeatypicalIN- L2 phoneme - this implies phonological transfer from the L2 articulation of the closest L1 phoneme. However, as with the native English speech, the places of articulation did differ on enough dimensions for them to be potentially separable with some set of articulatory models. These equal means were not seen across the board but specifically for the velum and lower lip in the /d/ : /D/ contrast, and for the velum and tongue in the /v/ : /w/ contrast. The velum and tongue should contrast between tokens of /v/ and /w/, but only the tongue tip of /D/ should show the transfer from /d/. It is possible that the image segmentation algorithm could not always distinguish the lower lip from the tongue tip for dental consonants, where they would be likely to come into contact (see token Y in Figure 3.10 for an example). Similarly, in Table 3.10 the mean difference between any L2 /D/ and any L1 /d/ is found in some cases to be equal to the mean difference between any two L1 /d/ tokens, specifically for the lips and velum. This suggests phonological transfer from the closest phoneme not only as produced in the speaker’s L2 (according to Table 3.9) but in their L1 articulation as well. The same was not true for L2 /w/ and L1 /v/ - these were 65 English speakers German speakers J2 G1 H5 JS DH CW /d/ : /D/ overall > > = = > > upper lip > > < = > > lower lip < < < < < > velum = > > > > > tongue body < = < < < > tongue tip < < < < < > /v/ : /w/ overall > > > < > > upper lip > > > < > > lower lip > > > = = > velum > < = > > < tongue body = = > < < < tongue tip < < > = < < Table 3.11: Results of one-tailed t-test of the alternative hypothesis that the mean difference between any two OUT tokens is greater than or less than the mean difference between any two IN-L2 tokens. Inequalities are given on the 95% level. observed to be separable in articulatory space in all dimensions. The implication is that L1 /v/ is articulated significantly differently from L2 /v/, which is closer to /w/ than the L1 version. This may be evidence for L1 attrition for certain consonants but not others, or it may indicate a non-uniform transfer within a phoneme set due to reasons such as markedness [59] - a larger study would be necessary to know for sure. There was a tendency for differences between OUT pairs and differences between IN-L2 pairs to have unequal means with 95% confidence, as Table 3.11 shows, but this was true for native speakers as well as nonnative ones. Furthermore, the set with the greater mean difference sometimes flipped depending on the articulatory organ, though overallOUT seemedtohavemorepairwisevariabilitythanIN-L2. TheEnglishspeakers generally agreed with one another about as much as they agreed with the German speakers, or as much as the Germans agreed among themselves. The main conclusion to gather from Table 3.11 is that this difference in variability of articulation between close phonemes is present in both native and nonnative speech, but the direction of the difference may be organ- or even speaker-dependent. Whatever the cause of this variation, articulatory models of speech will have to account for it. 66 /d/ : /D/ /v/ : /w/ JS DH CW JS DH CW overall < < > < < > upper lip < < < < < > lower lip < < > < < > velum < < < < < > tongue body < < > < > < tongue tip < < > > < < Table 3.12: Results of one-tailed t-test of the alternative hypothesis that the mean difference between any two OUT tokens is greater than or less than the mean difference between any two IN-L1 tokens. Inequalities are given on the 95% level. Table 3.12 tells a somewhat different story. For two of the German speakers (JS and DH) the OUT set overall showed less mean pairwise difference than the IN-L1 set, indicating that the L1 articulations were actually more variable than those of the OUT tokens - the opposite of the L2 versions. One interpretation is that speakers show more versatility of articulation - and therefore more appropriate variability according to context - in their L1 than in an L2, and out-of-L1 articulations are somewhat static in comparison. As with the results in Tables 3.9 and 3.10, this shows that the native Germanspeakersarticulate/v/and/d/differentlydependingonwhethertheyareusing them in German or English. The third German speaker, CW, showed the opposite effect from the other two in most cases, but more data is needed to determine if CW is an outlier or within the expected range of German speakers. At any rate, speaker- dependent variability is observed here. 3.10.5 Conclusion In summary, articulatory evidence of German phonological transfer can be found in German-accented English speech where it isn’t seen in native English speech. Native speakers tend to produce their close pairs of contrasting phonemes with more contrast than nonnative speakers do, but both populations produce close phonemes through contrastsinarticulationonsomelevel. Thisvalidatespaststudiesinpseudo-articulatory 67 modeling of speech [76, 90] that have shown improved discrimination between close phonemes through an articulatory representation, even without any real articulatory data, though it does imply that any mapping from phonemes to expected articulations in nonnative speech does need to account for the possibility of L1 transfer. Pairwise variability between phonemes can be rather phoneme-, speaker-, and organ-dependent, but it is seen that, in nonnative English, phonemes not found in German are in many casesproducedwithmorevariabilitythantheirclosestsubstitutionsthatareinGerman, and articulatory models will have to capture this variability. Much work remains in analysis of intra-phoneme articulatory dynamics, and in similar studies with speakers who are native to other languages. 68 Chapter 4: Assessing Word-level Pronunciation and Reading Skills 4.1 Introduction How does a teacher judge reading skills from hearing a child read words out loud? Each student’s pronunciation is, of course, relevant evidence of reading ability. For individual words read in isolation, a new reader’s skills are best demonstrated through various pronunciation-relatedcues,includingtheircorrectapplicationofEnglishletter-to-sound (or LTS) rules [97] - rules that describe the complex mapping from orthography to phonemes in English. Certain types of letter-to-sound decoding mistakes clearly testify to incorrect reading, such as the common tendency in young readers of English to make vowels “say” their own names [1]. Hesitancy and disfluency in decoding sounds from text are also indicative of underlying reading difficulties, and are manifested through suprasegmental pronunciation cues when reading aloud. What about the case of a child whose native language (or L1) is not English, or who speaks with a foreign accent? How would the teacher know if a particular word’s variant in pronunciation is due 69 to the child’s inability to apply English LTS rules, or is simply typical of the child’s pronunciation trends in general, when reading or speaking? A conscientious teacher, in an effort to remain unbiased in their assessments, would know what variants in pronunciationtoexpectoftheirstudent’saccentedspeech(hopefullydistinctfromthose caused by true reading errors), and would apply different assessment criteria based on what they know about the child’s background and their own past experience teaching similar children. Ultimately,inassessingreadingskillsonthewordlevel,ateachermakesaninference astotheirstudent’shiddencognitivestate-thestateofidentifyingastringofcharacters astheintendedword,ornot. Thisinferenceisbasedontheavailableevidencespokenby the child as well as what they know about the child’s demographics, the target words in the test, and about accented speech and children in general. This is not, strictly speak- ing, the same as assessing the child’s pronunciation (i.e. comparing their pronunciation withsomepredefinedreference), sinceinsomesenseeveryspeakerimplicitlydetermines their own “correct” reference pronunciation when reading or speaking. A child with a foreign accent can, of course, be capable of reading English correctly and fluently. It is then the teacher’s task to decide if a read pronunciation is consistent with the child’s own personal reference - the child’s phonological trends when speaking overall - or is the result of mis-applied LTS decoding. The main benefit of using automatic reading assessment in the classroom is that it can free up teachers’ time and energy to do what they do best: teach. A system that standardizes the regular assessments teachers would otherwise be conducting by hand can not only save them time, but can eliminate any potential teacher biases, provide a fine-grained pronunciation analysis, track long-term trends over a large number of students,andofferdiagnosesofdifferenttypesofreadingdifficulties,allowingteachersto focus on child-specific instruction and additional interventions. One goal of this chapter 70 is to demonstrate part of such a system, a new development toward automatically mimicking teacher judgments in list-based word reading tasks, a very common test format for evaluating young readers. Intheory, astudentmodelforisolatedwordreadingskillsshouldbesimpletomodel and replicate automatically. From a standard pronunciation dictionary there can be some notion of acceptable variants of a target word when read by native speakers of En- glish. We can also assume a closedset ofvariants resulting from common LTSmistakes, and another set based on foreign accented speech - these could be either determined empirically or derived from rules well-documented by experts in child pedagogy. With acoustic models, it can be chosen which one of the available variants best matches an unknown pronunciation. If the child is a native speaker, then only from recognizing a native variant should one infer acceptable reading. If the child is nonnative, then both the native and foreign-accented variants should indicate acceptable reading skills. In either case, automatically detecting any of the variants commonly coming from LTS mistakes will suggest that the child does not know how to read the target word. Forautomaticallyassessingreadingskills,thisisanunrealisticandineffectualmodel for many reasons. First off, it presumes that there will be no overlap among pronunci- ation categories, so that the source of the pronunciation will be obvious, and for many words this is not the case. For example, with the standard ARPAbet English phoneme set, the word “mop” may be pronounced as /M OW P/ both by children of any back- ground who make the mistake of decoding a “long” vowel for the ‘o’ as well as by students with a Mexican Spanish accent - for children who speak with this accent, the source of this variant remains obscure 1 . Even for the target words for which there is no overlap in pronunciation categories, there are further problems with this method. 1 Of course, the ARPAbet vowel /OW/ (the IPA diphthong /oU/) does not exist in Spanish, but when working exclusively with read English data, it is the closest vowel to the Spanish monophthong /o/. 71 The pronunciation variants of a given target are often so close as to remain difficult to reliably distinguish with state-of-the-art acoustic models [90] (or with human ears, as listening tests show [4]). Aside from this, segment-level pronunciation is only one ob- servable cue to underlying reading ability - the accept/reject algorithm described above does not account for other evidence such as speaking rate or suprasegmental manifes- tations of fluency or hestiation. Knowledge of the child’s background will change how this evidence is interpreted, and teachers are not necessarily conscious of how all these variablesinteracttoinformtheirinferenceofastudent’scognitivestate. Clearlythereis a very complex set of implicit decisions at work here. This is all just by way ofoutlining some of the many difficulties in creating an automatic method to judge reading skills. Inlightofthecomplexityofthetaskanditscognitivemodelinggoals,Iproposeusing a Bayesian Network to model the generative interactions among these many disparate cues in a framework that would represent how I hypothesize teachers conceive of a child’s reading skills. Bayesian inference on a hidden cognitive state variable would then reflect the degree to which all the evidence and background knowledge of the child combine to color a teacher’s judgment. The Bayesian framework is attractive for its flexibility in creating a hypothesized causal structure among variables, and has been used in past studies for student modeling [8, 21, 74]. The novel aspects of this work lie in the careful distinction between pronunciation and reading skills, the application to nonnative-speakingreaders, andtheuseofpronunciationvariantcategoriesandstudent backgroundinaunifiedgenerativestudentmodel. HereIintendtoanswerthefollowing relevant questions about the chosen model and the perceptual data used to train it: • How subjective are human assessments of reading ability? • Doesasubjectivecognitivemodelperformbetterthanacategoricaldecisionbased strictly on pronunciation? 72 • Which cues are most useful in making an automatic assessment, and how does inclusion of child demographic information affect the model? • Can the model’s generative structure be optimized for the training data, and does this improve automatic score performance? • Whatbiases,ifany,doesthisautomaticscoringmethodpresent,andhowdothese compare with biases in listeners’ assessments? Section 4.2 will give some background on the data and the modeling framework. A perceptual study on a subset of this corpus is described in Section 4.3. Section 4.4 explains how the feature set is estimated and Section 4.5 suggests how those features could be unified in a network structure. Results of experiments on various network structures and non-Bayesian baselines will be reported in Section 4.6, and Section 4.7 willinterprettheseresultsinlightoftheperceptualevaluationsandtheabovequestions about the model. Section 4.8 concludes with some ideas for future improvements and other potential applications for a student model such as this. 4.2 Background 4.2.1 The TBALL Project and its Context This work was done as part of the TBALL (Technology-Based Assessment of Language and Literacy) project [2], a UCLA-USC-UC Berkeley collaboration in response to a growing demand for diagnostic reading assessments in US schools [72]. The project’s goal is to develop components for a system that would administer tests and collect data inaclassroomenvironment,automaticallyprovideassessmentscores,organizereportsof the results for teachers, and recommend further assessments and interventions. TBALL concentrates on children in Kindergarten through Grade 2, younger than those of most 73 related studies. Due to the demographic makeup of Los Angeles, one emphasis of the project has become the development of such a system specifically with speakers of Mexican Spanish or Chicano English dialects in mind [27]; dealing with dialectal variability is a challenge for robust automated processing of speech. The most well-known studies in the area of automatic reading assessment [6, 35, 64, 99] typically use automatic scoring as one component of a computer reading tutor that provides feedback to children in real-time for pedagogical purposes. The focus of these past projects has (with a few exceptions) been on sentence- and paragraph-level reading and comprehension, in which a machine tutor will follow a child word-for-word and indicate if any reading errors are detected as the passage progresses. Reading errors in all of these studies are defined strictly in terms of segment-level pronunciation mistakes - the causal link from reading skills to pronunciation evidence is made to be quite direct, probably because these studies do not account for nonnative or accented children’s speech, nor for the subjectivity in judgment that must occur in those cases. Hence,improvementsinthesemethodshavecomefromcreativeuseofsentencedecoding grammars andpronunciationvariants (usuallyinterms ofreadword fragments), as well as expert knowledge of reading mistakes and appropriate acoustic model training and adaptation. This assessment of reading skills strictly in terms of speech decoding on a closed set of predetermined “correct” pronunciations will serve as a baseline method in this chapter. TBALL differs from these other studies in several ways. Its goals are focused on assessment (rather than tutoring) and feedback to the student is omitted in favor of fine-grained results reported to the teacher. The TBALL assessment battery does in- clude passage-level reading, but also many simpler tests that are common in reading assessment and are based on lists of items designed to measure, for example, a child’s abilitytoidentifyisolatedwordsonsight, toblendsyllablesorphonemestogether, orto 74 recite the names and sounds of English letters. The work in this chapter is designed for automatic assessment of isolated words, and here I work exclusively with data elicited from the K-1 High Frequency (K1HF) and Beginning Phonic Skills Test (BPST) word lists [37, 78] - these will be described in more detail in Section 4.4. However, the mod- eling framework could easily be extended to other list-based assessments. Due to the widerangeof“correct”pronunciationswhenreading,TBALLalsotreatsreadingassess- ment as a more complex task than simply deciding between close variants, as explained in Section 4.1. Advances in this chapter and elsewhere have come through the use of multiple pronunciation categories (beyond those expected of native speakers or reading errors), features beyond the segment level, classification algorithms beyond automatic decoding of speech, and information about the child’s background [2, 86, 91]. The student model proposed in this chapter is inspired mostly by the Knowledge Tracing(KT)modelofstudentknowledgedemonstrationandacquisitionduringatutor- ingsession[22]. KTpositsthateveryanswerastudentgiveswhentakingatutor-guided test is either a direct demonstration of their actual knowledge (or lack thereof), or else a lucky guess or a temporary slip. The tutor’s intervention can affect the child’s inner knowledge state by actually teaching them, or the tutor can simply scaffold the child’s answer without really imparting any understanding to them. A student model based on KT was used in [8] to generate automatic sentence reading scores in cases when the child’sobservedanswersmaybecorruptedbyerrorsinASRresults. Thatstudyshowed each child’s automatic scores from this model correlated well with their performance on standardized tests that they took at the end of the school year. 75 4.2.2 Bayesian Networks A Bayesian Network is a graphical model that defines the joint probability of a set of variables X 1 ,X 2 ,...,X n as P(X 1 ,X 2 ,...,X n )= n Y i=1 P(X i |Pa(X i )) (4.1) where Pa(X i ) are X i ’s “parents” in the network - the variables on which one would expect X i to be conditionally dependent, either because there is a causal relationship between the parents and the “child,” or because knowledge of the parents’ values would influence one’s expectation of the child’s [28]. This pre-defined conditional dependence simplifies the inference of any variable’s value given the others: P(X 1 |X 2 ,X 3 ,...,X n ) = P(X 1 ,X 2 ,...,X n )/P(X 2 ,X 3 ,...,X n ) = n Y i=1 P(X i |Pa(X i ))/ n Y i=2 P(X i |Pa(X i )) (4.2) where X 1 is excluded from possible parents in this example’s denominator. In a dis- crete classification task such as item-level reading assessment, the value of X 1 can be estimated as argmax x 1 P(X 1 =x 1 |X 2 =x 2 ,X 3 =x 3 ,...,X n =x n ) (4.3) or x 1 can be taken as X 1 ’s value if P(X 1 =x 1 |X 2 =x 2 ,X 3 =x 3 ,...,X n =x n )≥T (4.4) for some appropriate threshold T. 76 Bayesian Networks are versatile as automatic classifiers in many ways. They allow for both continuous and discrete variables, and can be trained on instances that are oc- casionallymissingoneormoreofthevariables’values(asisoftenthecaseforreal-world data). Conditional dependencies among variables can be specified a priori or optimized based on a method such as the Tree-Adjoining Naive Bayesian (TAN) algorithm [28]. Dynamic Bayesian Networks (i.e. ones that track sequences of variables over time) have been used extensively to incorporate articulatory, prosodic, and audio-visual features in Automatic Speech Recognition (ASR) [38, 48]. Studies such as [21, 74] have also used Dynamic Bayesian Networks to track student knowledge acquisition in intelligent tutor- ing systems without ASR capability, and have been extended to automatically scoring sentence-level reading [8], though without the focus on nonnative speakers. The probability distribution of each variable in the network is also an open design parameter, and must depend on the conditional dependencies built into the model. In this work I follow the methods used in the Bayes Net Toolkit [66], in which all these models were implemented. Continuous nodes were modeled as Gaussian distribu- tions. With discrete parents, they take the form of a table of Gaussians - one for each combination of parent values. With continuous parents, these were modeled as linear Gaussians, in which the mean is a linear combination of the parents’ values and the variance remains constant. Discrete variables with discrete parents are modeled sim- ply as a table of conditional probability values over all combinations of parent values, but discrete variables with continuous parents are not so straightforward. One option is to artificially discretize the parents, though this usually results in poor parameter estimates and is sometimes computationally unfeasible. Instead I chose to model these variables as continuous multinomial logistic (or “softmax”) distributions [67], which are commonly used in Neural Networks and behave like soft decision thresholds between discrete values. Further details are presented in Sections 4.4 and 4.5. 77 4.3 Perceptual Evaluations Since this chapter is concerned with automatic generation of subjective judgments, I organized some formal listening tests to determine the level of agreement between annotators as an upper-bound on the automatic results, and to determine any sources of bias or disagreement in human perception of reading skills. Five listeners were asked to give binary accept/reject scores to word-list recordings of twelve children collected with the TBALL child interface (an average of 14 items per child). One of the listeners (#1) was a professional speech transcriber, another (#2) was an expert in linguistics, second-language acquisition, and literacy assessment, and the remaining three were PhD students of speech technology with many hours of experience assessing data from the TBALL corpus. The stimuli from each child’s list were presented to the listeners in chronological order so that they would hear the test items in the same sequence a teacher in the classroom would have heard them. To maximally use the data perhaps as an automatic scoring system would, they were allowed to listen to each item as many times as they wanted, and they could go back to previous items from the current child before moving on to the next child. Along with the word-level recordings, the listeners were given the intended target word for each test item, but were not told any background information about the children, so as to minimize any a priori bias in scoring. Theirjudgmentofeachitemwasthenbasedonthechild’spronunciationofthat item, their performance on previous items, the relative difficulty of the item, anything the listenercouldinferaboutthechild, andtheirpastexperiencescoringotherchildren. The recordings chosen were balanced for the following potential sources of bias: • gender • L1 (English or Spanish) • grade level (Kindergarten, 1st, or 2nd) 78 Listener # Listener # 1 2 3 4 % agreement 87.9 2 Kappa 0.725 correlation 0.852 % agreement 94.2 92.9 3 Kappa 0.845 0.844 correlation 0.917 0.925 % agreement 88.3 91.9 96.5 4 Kappa 0.779 0.821 0.910 correlation 0.889 0.930 0.987 % agreement 91.2 96.0 94.7 94.7 5 Kappa 0.773 0.913 0.869 0.867 correlation 0.911 0.990 0.944 0.950 Table 4.1: Inter-listener agreement in reading scores, in terms of percent agreement and Kappa for binary item-level scores, and correlation between overall list-level scores. Inter-annotator agreement is reported in Table 4.1 in terms of percent agreement, Kappa statistic, and correlation in list-level scores, defined as the percentage of items tagged Acceptable in the list. Overall, agreement between all pairs of listeners was consistently high. The percent agreement ranged from 87.9% to 96.5% and the cor- relation was anywhere from 0.852 to 0.990. This indicates that the listeners generally agreed not only on how many items were Acceptable for each child, but on which those Acceptable items were. Similarly, high Kappa statistics indicated that the high percent agreement was not simply due to chance. These findings do set a high standard for au- tomatic scores, but are consistent with other previous perceptual studies on this corpus [86, 91]. They show that subjectivity is indeed present in this scoring task, but that overall agreement is higher than one might expect for other tasks such as, for example, strict pronunciation scoring [3]. 79 To measure bias in these scores, the percentage of Acceptable-rated items was com- pared between demographic categories using a one-tailed z-test for difference in inde- pendent proportions. Only one listener (#2) gave significantly higher scores to Kinder- gartenersover1stor2ndgraders(p≤0.03). Besidesthat,theonlysignificantdifference in scores was neither in gender nor grade level but native language - all five listeners gave native English speakers higher scores than nonnative ones (p ≤ 0.04). Since this was common across all listeners and they were not told what each child’s background was, this difference in proportions was probably not bias but simply indicates that the nonnative speakers chosen really were worse readers. However, the number of speakers is not large enough to draw any conclusions correlating native language with reading ability in general (though other studies like those reviewed in [30] have confirmed this link with larger sample sizes). What about subjectivity between demographic categories - did these listeners ever agree more often for one type of student than another? The answer is yes, in every case of demographic duality. They agreed more for male students than for female stu- dents, for native English speakers than for nonnative speakers, and for Kindergarteners than for 1st or 2nd graders (all with p ≤ 0.0001). This could indicate that there is more speech variability among female, nonnative, and older children, perhaps due to variations in child physical development or second-language acquisition. The propor- tion of Acceptable items was not different based on gender or grade level - only the item-level agreement among listeners changed. This means that they assigned roughly the same number of Acceptable scores to both categories, but their disagreement about which items were Acceptable was higher for one category than the other. Note, though, that all agreement levels were still quite high regardless (89.6% and up, depending on demographic category), even when the difference in agreement between categories was significant. 80 Listener #1 provided reference scores for the rest of the data used in this chapter (beyondthis12-childsubset). Thislistenerhad93.6%agreementwiththemajorityvote scoresfromtheotherfourlisteners(asomewhatobjectivereference), andtheremaining four agreed with the others’ majority scores 96.1% of the time. This difference in agreement proportions was significant only with p ≤ 0.08, and so Listener #1’s scores wereseentoagreewiththemajorityvotereferenceaboutasmuchasthosefromtherest of the listeners did - enough for them to serve as a reliable reference. In light of studies such as [3, 85] which claim a maximum of 70% agreement or 0.8 correlation between human annotators when judging pronunciation, any of these listeners’ scores could have served as a good reference because of the high inter-listener agreement overall. All automatic results will be reported in comparison with Listener #1’s scores. 4.4 Feature Estimation Three types of variables for student modeling are used in this chapter - one type will be called Hidden variables, another type will be called Evidence, and a third will be called Underlying features. Hidden variables are the scores for a child’s cognitive state when reading that I intend to estimate automatically. These variables are literally hidden because a child may or may not know how to read a target word, and this knowledge state may or may not manifest itself in their pronunciation - i.e. they might know how to read but could accidentally say it wrong, or they might not know how to read but could guess the correct pronunciation. To estimate this hidden ability one can only gather the relevant cues and make an inference based on them, from a teacher’s point of view. The Evidence variables are what a child would demonstrate at the time of a test and that a teacher would observe firsthand when conducting it. These are cues related to the child’s pronunciation and speaking style, and are derived from robust speech acoustic models and prior notions of pronunciation categories. Other types of Evidence 81 such as visual cues might be useful in scoring, but they will not be investigated in this work. The Underlying features are extra information that may influence judgments of reading ability, but are known before the test and before its Evidence is elicited. These are things like the child’s background, or information specific to the test items, such as their relative difficulty. All these variables will be summarized in Table 4.2. 4.4.1 Hidden Variables Theautomaticword-levelscoringmodelisessentiallythatofatwo-stepdecisionprocess: first, values for the Evidence demonstrated by the child in reading a target word t out loud are estimated (Section 4.4.2 below will discuss these in detail). This first step is really just a model for teacher perception of this Evidence. In the second step, all these Evidence features E t = e 1 t ,...,e 5 t are combined with the Underlying features U t =u 1 t ,...,u 6 t in a Bayesian Network to synthesize an item-level binary score for that item’sHiddenvariable,q t -thissecondstepisthestudentmodelproper, conceivedfrom a teacher’s perspective. An additional consideration in estimating this binary score is another hidden variable, r t−1 : the child’s overall reading ability on the given word list. This overall skill level is modeled as a continuous variable defined as the percentage of Acceptably-read items in the list up to and including t−1, a “running score” for test performance. It is used here on the assumption that, for example, if a child can read 16 out of 18 items in the list Acceptably, then the 19th item will probably also be Acceptable, though this could potentially propagate errors in the automatic scores. Strictly-speaking, r t−1 is not exactly a hidden variable since a pseudo-observed value for it is estimated from the inferred value of the hidden q t and the previous overally score, r t−2 . 82 An automatic score through Bayesian inference on an item’s hidden binary variable q t is then estimated by evaluating the conditional probability that it is Acceptable reading: Sq t = P(q t =Acc.|E t ,U t ,r t−1 ) = P(q t =Acc.,E t ,U t ,r t−1 )/P(E t ,U t ,r t−1 ) (4.5) This leads to a continuous score for each item, which can be left as-is or thresholded and made binary for comparison with human binary scores. The overall reading score is initializedtor 0 =0andintestmodetherunningscorethatestimatesitisautomatically updated based on Sq t as the test progresses. The joint probabilities in Eqn. 4.5 will be specified with assumptions of conditional independence explained below, in Section 4.5. 4.4.2 Evidence Section 4.1 introduced the idea of comparing an unknown pronunciation to one or more predefined pronunciation lexica, each capturing a type of expected variation in pronun- ciation due to a specific source, like reading errors or foreign accent. I hypothesize that thesepronunciationlexicaareusefultoguessthesourceofapronunciationvariantwhen judging words read out loud, and consequently to infer the child’s reading ability that generatedthatsourceofvariation. Inotherwords, onecanestimatethedegreetowhich the pronunciation comes from the set of common letter-to-sound (LTS) decoding errors or from the child’s expected phonological patterns, as this will help to link the pronun- ciation evidence to a cognitive state of word reading ability (though this is not always so clearalink, as Section4.1 explains). This idea issimilartotheories ofLexicalAccess in perception of spoken words [82], in which an incoming sequence of phonetic segments is compared with similar sequences that can form words in the listener’s vocabulary. 83 For a speech observation O the likelihood that it belongs to a pronunciation lexicon is estimated by decoding all pronunciations in that lexicon and choosing the one that is most likely given the model: P(O|λ p )=max n P(O|λ n p ) (4.6) where λ n p is the model for pronunciation n in lexicon p. In this chapter three pronun- ciation lexica are used, defined in terms of phoneme-level substitutions, insertions, and deletions: variants common to native English speakers (NA), variants common in Mex- ican Spanish accents (SP), and variants arising from predictable reading errors (RD). For example, if the target test word is “can” then most native children will pronounce it either as /K AE N/, or sometimes /K EH N/ when speaking quickly. A child with a Spanish accent might say /K AA N/, and one who makes the common LTS mistake of having the ‘a’ say its name might pronounce it as /K EY N/. These variants are determined based on rules observed in heldout data [103] as well as input from experts in child literacy. These lexicon-based likelihoods were then used to estimate posterior probabilities for each pronunciation lexicon: P(λ p |O)= P(O|λ p )P(λ p ) P p P(O|λ p )P(λ p ) (4.7) Here I follow [70] in using posterior probabilities rather than likelihoods for pronuncia- tion scores, since they have been shown to have higher correlation with listener percep- tion of pronunciation quality. Equal priors are assumed for all values of P(λ p ) because most of the dataset was not transcribed on the phoneme level and so information about the relative frequencies of each pronunciation variant in the data was not known. All 84 acousticmodelsweremonophoneHiddenMarkovModelswiththreestatesand16Gaus- sian mixtures per state. They were trained on 19 hours of children’s speech collected in classrooms as part of the TBALL project, of which only a small amount was annotated on the phoneme level to bootstrap automatic transcription of the remainder. Viterbi decoding was used to estimate the likelihood of the observed speech given each model. In addition to these posterior scores for each of the three pronunciation lexica, two more features were used as Evidence. One was the child’s rate of speaking (ROS), defined as the number of phonemes read per second. The second was the pronunciation recognition result, a discrete variable for the pronunciation lexicon with the highest likelihood: ˆ p=argmax p P(O|λ p ) (4.8) Ifeachpronunciationlexicon’sposteriorrepresentsateacher’sperceptionofanunknown pronunciation’s distance to the pronunciations in that lexicon, then this pronunciation recognition result represents which pronunciation lexicon a teacher would choose if they had to pick only one. 4.4.3 Underlying Features Therearemanythingsthatonemightexpecttoaffectateacher’sinferenceastoachild’s reading ability. Some of these are related to what they may already know about the child in question. In this work the child’s native language, gender, and grade level (K, 1st, or 2nd) are used as three discrete Underlying variables to inform these automatic assessments. The vast majority of the children in this dataset were native speakers of either En- glish or Spanish, so I chose to represent native language as a binary variable. Their native language was determined through questionnaires filled out by the children’s par- ents,andsomeofthemchosenottorespond-inthosecases,thisvariable’svaluewasleft 85 symbol variable cardinality E t e 1 t rate of speaking: ROS cont. e 2 t native lexicon posterior: NA cont. e 3 t Spanish lexicon posterior: SP cont. e 4 t reading mistake lexicon posterior: RD cont. e 5 t recognized lexicon: ˆ p 3 r t−1 list-level running score cont. q t item-level score 2 U t u 1 t item index cont. u 2 t word length cont. u 3 t word list 4 u 4 t native language: L1 2 u 5 t gender 2 u 6 t grade level 3 Table4.2: SummaryofEvidence,Hidden,andUnderlyingvariablesusedinthischapter. unspecified in the Bayesian Network. The same went for the few children whose native language was something other than Spanish or English. The pronunciation lexica only accounted for variability related to a native English, L.A. Chicano English, or Mexican Spanishaccent,andchildrenwithanothernativelanguageweretoofewtojustifychang- ing the cardinality of this variable. Several of the children were described as bilingual - equally native to both Engish and Spanish. These were tagged as Spanish-speaking because they could potentially have the influence of the only foreign accent examined here. The reader should keep in mind that, whatever the child’s native language, an objective measure of the presence of a foreign accent in their speech was not available, and moreover, children who are native Spanish speakers do not necessarily demonstrate any discernible foreign accent. Other Underlying features that may potentially sway a child’s reading ability are factors related to the design of the test itself. Some words are more difficult than 86 others, and this difficulty may manifest itself in those words’ Evidence and Hidden variables. These experiments used recordings from four different word lists - three lists of K1HF words and one BPST list (see Section 4.2.1) - and BPST is for most items the more difficult list. The K1HF lists are made of high-frequency words that beginning readers have hopefully already read many times: “I,” “like,” “can,” “see,” etc. The BPST words are designed to elicit distinctions between phonemes, and so contain many minimal pairs of less-common words like “map” and “mop,” or “rip” and “lip.” Because of a potential difference in word list difficulty, the word list was included as a discrete variable of cardinality 4. For similar reasons, the length of each word (in characters) was included as a continuous variable to account for increased difficulty proportional to the word length. Lastly, the item index divided by the total number of items in the list was used as another continuous variable. This made sense because the BPST list steadily increased in difficulty as the items progressed, and the item index also worked in conjuction with the overall score hidden variable r t−1 . For example, a running score after only 2 items should have less influence on the next item’s binary score than the same running score after 15 items. 4.5 Network Structure Section 4.4 described the feature set and how each variable’s value was derived. This section now explains the conditional dependencies in the model among these variables, used in performing Bayesian inference on the item-level Hidden variable q t , as in Eqn. 4.5: P(q t |E t ,U t ,r t−1 ). 4.5.1 Hypothesized Structure To reiterate, the student model proposed in this chapter takes what teachers perceive during the test (the Evidence, E t ), what teachers know beforehand about the student 87 Underlying Variables Overall Reading r t-1 Evidence E t Test Item t Item-level Reading q t Underlying Variables 6 : 3 t u ∀ 2 : 1 t u Underlying Variables Overall Reading r t Evidence E t+1 Item-level Reading q t+1 2 : 1 1 + t u Figure 4.1: Graphical illustration of the student model, over 2 test items. Shaded nodes denote hidden variables. The dashed lines are not probabilistic relations, but indicate how the overall score for item t is derived from the combined previous item and overall scores. 88 or the test (the Underlying variables, U t ), and the student’s performance on prior test itemsasanestimateoftheiroverallreadingskill(therunningscore,r t−1 ),anditusesall thesethingstoconstructastudentmodelfromateacher’spoint-of-viewthatcanbeused to infer whether or not the student read the current test item Acceptably, cognitively- speaking(i.e. notjustintermsofpronunciation). Figure4.1showsgraphicallyinahigh- level Bayes Net structure what I hypothesize the conditional dependencies among these differenttypesofvariablesshouldbe,basedonhowIexpectteachersconceiveofreading ability and its demonstration. On these assumptions of conditional independence, Eqn. 4.5 simplifies to: P(q t |E t ,U t ,r t−1 ) = P(q t ,E t ,U t ,r t−1 )/P(E t ,U t ,r t−1 ) = P(q t |U t ,r t−1 )P(E t |q t ,U t ,r t−1 )P(r t−1 |u 3:6 ∀t )P(U t ) P(E t |U t ,r t−1 )P(r t−1 |u 3:6 ∀t )P(U t ) = P(q t |U t ,r t−1 )P(E t |q t ,U t ,r t−1 ) P(E t |U t ,r t−1 ) (4.9) The item-level pronunciation Evidence, E t , is a consequence of the student’s hidden reading ability states (both item-level and overall running score), though how those states are manifested in all the Evidence is not deterministic. The hidden binary item- levelstateq t isconditionallydependentontheoverallskilllevelr t−1 estimatedfromthe previous test items, as well as all the Underlying variables, U t , since one would expect item-level inference to change based on the value of these parent variables. Note that this does not mean, for example, that the student’s L1 is modeled as the cause of their reading ability - this is just one variable on which I would expect a teacher’s inference into the student’s cognitive state to depend. The overall reading skill, too, is modeled as a child of the Underlying variables but only those that apply globally (u 3:6 ∀t : word list, L1, gender, and grade level) rather than those that apply only to an individual test 89 item (u 1:2 t : item index and word length). The Evidence is also modeled as a child of all the Underlying variables, so that the observed pronunciation features are conditionally dependent on both the student’s hidden reading ability and other external factors such as word difficulty. The dotted lines from q t and r t−1 to r t do not denote probabilistic relationsbutrathershowhowtherunningscoreisupdatedateachitemwiththenewly- inferred value of q t . Table 4.3 shows the hypothesized child-parent relationships among all these variables in more detail. Though this method does not use any intelligent tutor feedback, this student model is in many ways analogous to that of Knowledge Tracing (KT) as introduced in Section 4.2.1. The Hidden student cognitive state for each word item represents whether they know how to read the target word Acceptably or Unacceptably, and is equivalent to KT’s “Student Knowledge” state for a particular skill. Both student models use ob- served “Student Performance” during the test (in this case, pronunciation Evidence) as a variable generated by the Hidden knowledge state. The Tutor Intervention variable in KT is similar to the Underlying features, which are in both models considered to be parents of the Hidden knowledge state and the observed Evidence. The novel aspect of this student model compared to KT and other similar models in [21, 74] is mainly in the feature set - the use of Underlying demographic and test item information as well as Evidence scores over several different pronunciation lexica is unique to this work. 4.5.2 Structure Training and Refinement The many hypothesized arcs in this graphical model (Fig. 4.1 and Table 4.3) may result in a sub-optimal network for several reasons. First of all, some of the variables thought to be dependent may not in fact be, and modeling such “dependencies” would be useless. This would be a failure to follow the Occam’s Razor principle of model succinctness, and would unnecessarily increase the computational complexity involved 90 Children E t e 1 t e 2 t e 3 t e 4 t e 5 t Parents r t−1 q t ROS NA SP RD ˆ p U t u 1 t item index X X X X X X u 2 t word length X X X X X X u 3 t word list X X X X X X X u 4 t L1 X X X X X X X u 5 t gender X X X X X X X u 6 t grade level X X X X X X X r t−1 X X X X X X q t X X X X X Table4.3: Hypothesizedparent-childarcsintheBayesianNetworkstudentmodel. Only for pairs marked with an ‘X’ is the column variable considered conditionally dependent on the row variable - all others are assumed to be independent. in estimating an automatic reading score. Beyond that, with finite training instances and an overly complex model there is always the possibility that true dependencies might not be estimated properly due to a dearth of training instances representative of all combinations of dependent variables. For these reasons I propose an alternative forward-selection greedy search algorithm to refine the network structure of the hypothesized arcs. The algorithm begins with just one arc in the network representing a baseline dependency: the arc from q t to the pronunciation lexicon recognition Evidence variable, ˆ p. Then it proceeds in a random order to add each hypothesized arc individually, keeping an arc if it improves the likeli- hood of the training variables given the Bayesian Network. This process is looped until it has been shown that adding any remaining hypothesized arc will decrease the model likelihood. This method is also useful in that analysis of the refined network may reveal the necessary inter-variable dependencies of the data. 91 Because of occasionally missing data in some of the demographics variables, and the potential for some variables to be modeled as “softmax” distributions (for dis- crete features with continuous parents), all model parameters were estimated using the Expectation-Maximization (EM) algorithm that these conditions require. The likeli- hood of the training set given the model was defined as the log-likelihood after EM convergence, and convergence was defined as either 10 iterations of EM or the number of iterations required to make the following inequality true: |LL(i)−LL(i−1)| mean{|LL(i)|,|LL(i−1)|} <0.001 (4.10) Here LL(i) is the log-likelihood after iteration i. In most cases, EM on the data met this inequality within 3 or 4 iterations. 4.6 Experiments and Results To paraphrase the overall questions first posed in Section 4.1, the experiments with this new student model were intended to answer the following: • Does the Bayes Net student model offer improvements over a baseline pronunciation-based paradigm for reading scoring? • How useful are the novel features proposed here in making a reading score? • Can the structure of the student model be improved automatically? • How do the automatic scores compare with human scores, both in terms of agree- ment and bias? The evaluation dataset used in all these experiments consisted of 6.85 hours of read words from 189 children - a total of 9620 word items from 658 word lists (an average 92 speakers word items word lists Male 80 3950 278 Female 100 5051 349 English L1 71 3041 209 Spanish L1 82 4660 323 Kindergarten 35 882 104 1st grade 72 3854 253 2nd grade 82 4884 301 Table4.4: Demographicmakeupofthetestdatausedinallexperiments. Totalspeakers may not add up to 189 since some of this information was missing for certain students for reasons explained in Section 4.4.3. of 15 items per list). These children were all distinct from those in the 19 hours of speech used for acoustic model training described in Section 4.4.2. The demographic makeup of this dataset is given in Table 4.4. Automatic scoring on the evaluation set was done using a five-fold crossvalidation procedure in which, for each fold, four-fifths of the speakers in the eval set were used to train and refine the student model which was then tested on the remaining one-fifth of the speakers. All variables were estimated using the acoustic models and background information as explained in Section 4.4. 4.6.1 Baselines Based on the methods in [6, 8, 35, 64, 99] of doing automatic reading assessment as a pronunciation recognition/verification task, I propose two baseline methods. In both, I assume that only one of the NA lexicon of pronunciations (those common to native- speaking readers) can qualify as a demonstration of Acceptable reading ability, as past studies have done. The recognition method takes the Viterbi-decoded pronunciation recognition result for each item, ˆ p, and makes an item-level reading score: if the rec- ognized pronunciation is in the NA lexicon, then the item is considered Acceptable. 93 The threshold method takes the posterior of the NA pronunciations, P(λ NA |O), and gives an Acceptable item-level score if the posterior is greater than some threshold. For both of these baselines, overall scores for a word list are taken as the percentage of Acceptable item-level scores in the list. The point of using baselines such as these is to show improvement in automatic reading scores with the addition of the novel aspects of this work: the pronunciation Evidence based on multiple pronunciation lexica, the Underlyingfeatures, andthestudentmodel’snetworkstructurethatunitesthem. Note, however, that although neither baseline explicitly used features from multiple pronun- ciation lexica, the recognition baseline required multiple pronunciations in its decoding grammar, and the threshold baseline used all the pronunciation lexica in estimating its posterior probability, as seen in Eqn. 4.7. For comparison against these baselines, all experiments using the Bayesian Net- work student model (of any structure) estimate scores based on the conditional prob- ability of Acceptable reading given all the features and the structure of the network: P(q t = Acceptable|E t ,U t ,r t−1 ). For item-level scores, this probability is made binary by comparison with a threshold, as in the threshold baseline method. List-level over- all scores are taken as the mean of all the item-level Acceptable probabilities on the word list, not the binary scores generated from them. Figure 4.2 shows detection-error tradeoff (DET) curves for the item-level results over varying threshold values, and Ta- ble 4.5 gives overall score correlation as well as the item-level agreement nearest to the DETplot’sEqualErrorRate(EER)foreachscoringmethod(exceptfortherecognition baseline, since it did not use a threshold). 94 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 false negative rate false positive rate EER refined network, all features refined, no U t full network threshold baseline refined, no E t refined, no r t−1 Figure 4.2: DET curves over a varying threshold, for item-level binary classification on several different scoring methods and network structures. 4.6.2 Network Structure Optimization As explained in detail in Section 4.5.2, the structure of the student model’s network was automatically refined using a forward-selection procedure on the hypothesized con- ditional dependencies outlined in Section 4.5. This refinement was done five times - once for each fold in the crossvalidation. Table 4.6 gives the total number of times each hypothesizedarcwasselectedforthefinalnetwork,outoffivecrossvalidationfolds. Fig- ure 4.2 and Table 4.5 compare performance of the full and refined network structures with item-level DET curves and list-level score correlation. 4.6.3 Feature Comparison Are pronunciation features beyond simple native-accent pronunciation scores necessary when scoring reading automatically? This is what the baseline experiments were in- tended to answer. If the answer is yes, then which of the new features (E t , U t , and 95 Baselines Refined Net recognition threshold Full Net all features no Et no Ut no r t−1 Item-level % agreement 82.2 81.2 64.3 76.4 60.1 82.0 74.9 Kappa 0.492 0.413 0.213 0.426 0.161 0.504 0.408 List-level correlation 0.592 0.577 0.238 0.762 0.582 0.577 0.742 Table 4.5: Item-level agreement and list-level overall score correlation between auto- matic results and human reading scores. For threshold-based item-level classification, this agreement is the result closest to the EER. r t−1 ) are most useful in estimating q t ? To measure this, the scoring experiments on the refined network were redone, leaving each of these three feature subsets out. DET curves for these feature sets are shown in Fig. 4.2, and Table 4.5 gives their EER item agreement and list-level correlation performance. 4.7 Discussion 4.7.1 Automatic Performance Comparison As reported in Table 4.5, the better of the two baseline methods was recognition. Both baselines had comparable agreement levels (82.2% and 81.2%) but apparently each one agreedwiththehumanlabelsondifferentitems: theitem-levelmatched-pairresultswere significantly different using McNemar’s test with p ≤ 0.01. However, the difference in list-level correlations between the two baselines was not significant. Neither the Kappa agreement nor correlation of either baseline came close to the inter-listener agreement reported in Table 4.1, though the 0.492 recognition Kappa agreement meant that the item-level results were well above chance levels. According to both Fig. 4.2 and Table 4.5, the best-performing of the Bayesian Net- work student models was the Refined Network with all the features. This had a worse item-level agreement than the recognition baseline (76.4% - significantly different from the recognition baseline with McNemar’s test and p ≤ 0.01), but the correlation in 96 list-level scores was much higher (0.762, compared to recognition’s 0.592 - a significant difference with p≤ 0.01). This proved to be the main benefit of using the novel Bayes Net student model: better overall list-level scores at the expense of item-level agree- ment. This may seem counter-intuitive, as one might expect better list-level scores to accompany higher item-level agreement on a list. Recall, however, Section 4.6.1 ex- plained that the list-level scores for the student model are averages not of the number of list items deemed Acceptable, but of the probabilities of Acceptable reading over all items in the list. This allows for a much more fine-grained score, since it is calculated as the average of continuous-valued probabilities rather than a ratio of integers (Accept- able items divided by total items). This is how it is possible to have lower item-level agreement but better list-level correlation. 4.7.2 Automatic Structure Refinement Theforwardselectionproceduretoautomaticallyrefinethenetworkworkedasexpected: the refined structure performed dramatically better than the complete hypothesized structure both on the item and list levels. According to Table 4.6, an average of 6 out of the 51 hypothesized arcs were excluded from the network in each fold of crossvali- dation. This indicates that the hypothesized structure is for the most part an accurate model of the data’s conditional dependencies. Interestingly, the dependencies excluded were generally ones that would have required variables to be modeled with softmax distributions - i.e. ones that linked continuous parents to discrete children. For exam- ple, the two continuous Underlying variables (word length and item index) were rarely ever allowed to be parents of q t (the hidden binary reading variable) or ˆ p (the discrete recognitionresult). Similarly, thecontinuous-valuedoverallscorer t−1 wasgenerallynot selected as the parent of either of those variables. Perhaps this illustrated a limitation in estimating the softmax distribution’s parameters (which required EM), or in using 97 the softmax distribution itself. It is also important to keep in mind that the omission of any of these dependencies might be due to data sparsity rather than true independence between variables. 4.7.3 Comparison of Proposed Features In at least four of the five folds, the running score r t−1 was omitted as a parent of two variablesthroughforwardselection, andsoretrainingtherefinedstructurewithoutthat variable did not degrade model performance very much - its list-level score had correla- tion 0.742 which was not significantly less than that of the best network structure (the one with all the variables). This implies, surprisingly, that there was not a sequential dependencybetweentestitems,butthereasonitwasomittedmightbesolelyduetothe failure of the softmax distribution required by the hypothesized structure. Predictably, leaving out the Evidence resulted in very low item-level agreement (0.161 Kappa) but remarkably high list-level correlation (0.582, not significantly less than the recognition baseline) based solely on the Underlying variables alone. Omitting the Underlying vari- ables did reduce the correlation to 0.577, but the item-level Kappa agreement was the highest in Table 4.5: 0.504. This suggests that, in making a reading score, the Evidence variables are far more important than the Underlying ones. However, it is interesting to note that surprisingly accurate list-level scores can be predicted merely with prior knowledge of the child and the test, without even observing how the child performs on thetest. Furthermore, includingtheUnderlyingvariablesdoessignificantlyimprovethe list-level score correlation. 4.7.4 Bias and Disagreement Analysis Section 4.3 reported that listeners gave higher reading scores to native English than Spanish-speaking students, over a small subset of the data (based on a one-tailed test 98 Children E t e 1 t e 2 t e 3 t e 4 t e 5 t Parents r t−1 q t ROS NA SP RD ˆ p U t u 1 t item index 0 5 5 5 5 0 u 2 t word length 1 5 5 5 5 0 u 3 t word list 5 5 5 5 5 5 5 u 4 t L1 5 5 5 5 5 5 5 u 5 t gender 5 4 5 5 5 5 5 u 6 t grade level 5 4 5 5 5 5 5 r t−1 1 5 5 5 5 0 q t 5 5 5 5 5 Table 4.6: Using the forward selection procedure outlined in Section 4.5.2, these are the total number of times each of the hypothesized network arcs was selected for the final refined network, over 5 crossvalidation training sets. of difference in proportions with p ≤ 0.05). For Listener #1 (the reference), this bias over the entire corpus extended also to higher scores for females than for male stu- dents, and for older students than for younger (2nd grade over 1st, and 1st grade over Kindergarten). This background info was not given to any of the listeners, but it can be assumed they guessed each child’s gender, age, and native language from their voices. A conscientious listener would try to judge objectively regardless, so it’s possible that these differences in proportions are simply a mirror of the children’s performance in the data. In the automatic scores, these same biases (if they were in fact biases) were retained by the baseline recognition method (which used neither prior knowledge of demographics nor listener scores) and the Refined Network (which used both). Even the network that excluded that Underlying features (“no U t ” in Table 4.5) had these same biases - again, without using demographics. There was, however, one exception: the Refined Network did not give significantly higher scores to male or female students. 99 In this one respect the best student model was less biased than both the baseline and human listeners. Just as the listeners agreed with significantly different proportions (p≤0.05) for all demographic dualities in the perceptual experiments of Section 4.3, similarly did the recognition baseline agree with Listener #1’s reference more for native speakers than for nonnative speakers, and more for older students than for younger students. The Refined Network only exhibited a significant difference in agreement between scores for maleandfemalestudents. Again,thenewmodelproposedinthischapterprovedtogive in some sense more balanced results: disagreements between it and the reference could not be statistically differentiated along lines of student grade level or native language. 4.7.5 Remaining Questions A number of lingering questions remain. First, the best results are still below the inter- listener agreement. How can this be improved? In the student model results, there was a dramatic decline in performance with the exclusion of the Evidence or Underlying features. The forward selection procedure on the network structure did not completely exclude any one feature from the model - they all seemed worthwhile in making an automatic reading decision. These findings suggest that the addition of more features - especially along the lines of pronunciation Evidence - can potentially help make the automatic scores more human-like. What about the automatic elimination of any dependencies that would have re- quired softmax distributions - is there another type of PDF that could be used for those discrete variables that require continuous parents? Should the parent variables in these situations be discretized, or could the children be made continuous without using softmax functions? Seeing as forward selection did not eliminate these variables entirely, the ones in question must be valuable to the model. Finding a proper way to 100 represent their probability distributions would be another potential source of improved performance. 4.8 Conclusion This chapter proposed a new student model with a number of unique features for use in automatic scoring of reading skills when demonstrated by isolated words read out loud. First, I explained why this was not simply a pronunciation evaluation or verifi- cation task, then I suggested some new features that should be useful in such a model - cues that I expected teachers to use when judging reading ability, like pronuncia- tion Evidence and Underlying information about the child or test. Then I described a hypothesized Bayesian Network structure that would account for the potential condi- tionaldependenciesamongallthesefeaturesandreflectthewayIexpectteachersmight conceive of a student’s cognitive state when reading. I also proposed a method for au- tomatically refining this network by using a greedy forward-selection of the conditional dependencies. ExperimentsontheTBALLdatasetrevealedthattheuseofastudentmodelsuchas thisinsteadofsimplepronunciationrecognitionorverificationcanresultinasignificant increase in correlation of overall automatic scores with human perception. This spoke to the usefulness of the proposed features based on various pronunciation lexica that illustrated different reading or accent phenomena, as well as the child demographics. It was found that the network refinement algorithm worked as expected, but that it did not choose to exclude very many of the hypothesized arcs to improve performance, which testifies to the sound basis of the initial proposed network. The best network model exhibited fewer potential biases than the baseline or the human scores, and its amount of agreement with the human scores was not as polarized as the baseline on demographic dualities like student grade level or native language. 101 The methods presented here may seem specific to reading assessment, but could they be used elsewhere with only some minor modifications? To construct a similar student/cognitivemodel,thedesignerneedsonlytospecifytheEvidenceandUnderlying features that apply to their given task. Such features can then be united using the network and refinement algorithm presented here, with no real changes necessary. It could potentially be used for other types of assessment and pedagogy (pronunciation training, math tutoring, etc.), or even user modeling for a dialogue system or emotion recognition. 4.9 Epilogue: A Reformulation of Word-level Scoring by way of Articulatory Phonology 4.9.1 Introduction In [90] and Chapter 3, it was shown that an articulatory representation of speech could be used, along with a more traditional phonetic representation, to automatically dis- criminate between close phoneme-level errors made by nonnative speakers of English, andalsobetweenperceptuallyconfusableEnglishphonemesproducedbynativespakers - the addition of these pseudo-articulatory models resulted in better performance than with phonetic models alone. Since that study did not work with any real articulatory data (e.g. derived from magnetometry or magnetic resonance imaging), a new map- ping from phonemes to articulatory features based on [52, 76] was proposed, and it wasclearlythisadditionalarticulatoryrepresentation(thoughartificial)thataccounted for the improvement in classification. The use of several streams of Hidden Markov Models representing articulatory configurations seemed especially suited for the task of discriminatingbetweenclosephonemes, sincethatrepresentationhadmoreexplanatory power than the traditional notion of speech as a sequence of segments - a close error 102 in nonnative production could be represented as a subtle dynamic change across some subset of eight overlapping articulatory streams, rather than as a full-on substitution of an entire phoneme. This also had implications for second-language instruction in that these pseudo-articulatory models could potentially be used to provide feedback to students in concrete anatomical terms - i.e. they could be taught to sound more like a native speaker through specific physical changes to their vocal tract configuration. However, there were a few obvious limitations to that method. Most importantly, withasetofHiddenMarkovModelsforarticulatoryacoustics,thecharacteristicvariants indurationthatwouldprobablyhaveimprovedclassificationaccuracywerenotmodeled. This method of deriving expected articulatory sequences from phonemes, though it accounted for context and was grounded in phonological theory, assumed very rigid and unrealistically quantized positions of the vocal tract constituents. Finally, the eight derived articulatory streams were assumed to move independently and were decoded as such-thiswasforthesakeofcomputationalsimplicity,anddidn’treflecttheanatomical dependencies among organs (e.g. jaw and tongue) nor the phonological synchronization of these organs when producing linguistic units like phonemes. Forthesereasons,areformulationofthisproblemfromanArticulatoryPhonological point-of-viewseemedinorder. ThetheoryofArticulatoryPhonology[13]proposesaset of overlapping articulatory gestures as the fundamental units of speech that, through coordinated action, give rise to larger units such as phonemes and syllables. These gestures specify only a constriction location and a degree of constriction for each vocal tract variable (e.g. tongue tip, tongue body) - beyond that they allow for a degree of physical abstraction that suits acoustic-to-articulatory inversion without true articula- tory data, as in this task. As implemented in the Task Dynamics Application (TaDA) model [68], each gesture is associated with an underlying timing oscillator that controls the activation and de-activation of an ensemble of gestures, and the phase relations 103 among these oscillators constitute a model of coordinated planning of the tract vari- ables’ constriction tasks. From these hypothesized inter-gestural coupling relations, Articulatory Phonology provides some notion of the durational and coarticulatory ef- fects that are characteristic of native speech - this was one component missing from the work in Chapter 3. With this new approach I don’t intend to replicate Chapter 3’s work in segment- level error detection, but rather to estimate a word-level rating of pronunciation quality for nonnative speech. Here a pronunciation rating is defined as a real number that is proportional to the subjective degree of nativeness in a speaker’s English production. I propose using Bayesian Inference on a continuous Gaussian hidden variable to estimate this rating. The abstract perceptual property called “nativeness” manifests itself on many simultaneous levels, and I expect that the insights of Articulatory Phonology will allow for automatic ratings better correlated with listener perception than similar ratings derived from phoneme-level acoustic and duration models like those outlined in [70]. 4.9.2 Articulatory Phonology: Background The primitive phonological unit in Articulatory Phonology is the articulatory gesture, defined as the dynamic constriction action of a distinct vocal tract organ (or tract vari- able)[13]. Thesegesturesareuniquelyspecifiedbyconstrictiondegree/locationpairsfor the lips (LA/LP), tongue tip (TTCD/TTCL), and tongue body (TBCD/TBCL), and by constriction degree descriptions alone for those tract variables that cannot change location: the glottis (GLO) and velum (VEL). For an input sequence of phonemes, the implementation in TaDA [68] will generate the corresponding gestural score where “score” is used in the same sense as that of a musical transcription - it is an expected 104 constellation of gesture activations across tract variables, derived from a phoneme-to- constriction mapping, with allophonic exceptions based on position in the syllable. Gestures can overlap within one tract variable or among several tract variables, sig- nifying the presence of multiple simultaneous constriction efforts. The expected overlap between any two gestures is computed by a coupled oscillator model of inter-gestural coordination. Each gesture is activated and de-activated by an oscillator equation with parameters defined by the tract variable, the type of constriction involved, and the syl- labic structure; gestures in a constriction degree/location pair share the same oscillator. Phasal relations among these oscillators determine the sequence in which their gestures are activated and hence their overlap in the gestural score. Currently TaDA only allows for 0, 90, 180, and 360 degrees of relative phasing between oscillators. The coupling of these oscillators is the way in which TaDA models the coordination of gestures into larger linguistic units such as phonemes and syllables. Though Articulatory Phonology was first formally proposed in 1986, it hasn’t been until very recently, with the widespread interest in articulatory modeling of speech, that speech engineers have started to investigate the potential of the articulatory ges- ture as a viable unit for automatic speech recognition. Two studies have worked with synthetic time-varying physical realizations of gestural scores as generated by TaDA’s task dynamic model of inter-articulator coordination. One has automatically estimated these vocal tract time functions based on the corresponding synthetic acoustics [63], and the other has demonstrated automatic estimation of the gestural score based on the synthetic tract variable time functions [104]. As of yet there has been no published work in decoding a gestural score from real speech acoustics, but the idea of inferring articulatory behavior from real acoustics is not new. Studies such as [34] have done this not with gestures but with an articulatory feature-based representation in which each speech frame is described by the discrete values of an ensemble of articulatory features 105 native language speakers hours words English 39 2.54 4569 Arabic 18 1.25 2240 Table 4.7: Amount of data used in this epilogue. (e.g. manner, place, rounding, nasalization, etc.) rather than through Articulatory Phonology’soverlapping gestures. ThemainadvantagesofArticulatoryPhonologyover articulatory feature-based models are in its level of physical abstraction, its patterns of coordinated temporal overlap among tract variables, and its capacity to connect cognitive aspects of speech planning with the physical realizations of those planned constrictions. In assigning ratings to nonnative speech, the expected gestural score generated by TaDAisassumedtobethatofanativespeaker,andsoitcanserveasareferenceagainst which all incoming test utterances are compared. Dynamic changes in gestural overlap correspond to acoustic changes in real speech, and the couplings among gestures denote relative activations that are characteristic of native-like coarticulatory timing - these relative gestural onsets can be represented in duration models. Here I follow [104] in using the gestural pattern vector as the unit for encoding the gestural score in acoustic model form. The gestural pattern vector is the pattern of activation across all variables in a gestural score, at any one instant. Encoding the models this way ensured that syn- chronous gestures (those sharing an oscillator, or with a 0 degree phase relation) would be decoded simultaneously, an improvement over previous work in Chapter 3 in which the articulatory streams were decoded independently. Nonnative speech is hence rated along two time-varying dimensions: the degree to which the acoustics match a gestural pattern vector’s acoustic model trained on native speech, and the degree to which the differences in onsets of the coupled gestures fall within native-like distributions. 106 4.9.3 Speech Corpus All speech data used in this epilogue comes from the CDT corpus. This consists of a small vocabulary of English words (12 total) spoken in isolation by 24 native speakers of Arabic and 39 native speakers of English. The words were designed to elicit varying degrees of English proficiency, some with intentionally difficult consonant clusters (e.g. “racetrack”). Each word was repeated roughly 10 times per speaker (in a random order), and integer pronunciation ratings on a 1 to 7 scale were elicited from 8 native Englishlistenersforalltokensfrom18oftheArabicspeakers. Theaverageinter-listener correlation over all nonnative files was 0.470, and the average correlation between any listenerandthemedianoftheotherlisteners’ratingswas0.627-thesemedianswerethen taken as the reference set of ratings against which to measure automatic performance. Statistics for the size of this data set are given in Table 4.7, while the number of phonemes, gestures, and gestural couplings for the words in this corpus are given in Table 4.8. 4.9.4 Pronunciation Modeling This section explains the phoneme and gesture models for native English pronunciation used in this work. These consist of acoustic and duration models, as well as pronun- ciation rating models in which all acoustic and duration measures are combined to synthesize an overall word-level pronunciation rating. 4.9.4.1 Acoustic Models All acoustic models in this work - whether for phonemes or gestural pattern vectors - were designed as Hidden Markov Models trained on 39-dimensional MFCC feature vectors, with 3 hidden states and 32 Gaussian mixtures per state. The window length was standard (25 msec) and the frame rate was shorter than usual (5 msec) so as 107 to capture very fine changes in gestural overlap. An expected sequence of phonemes for each word came from TaDA’s lexicon. Similarly, an expected sequence of gestural pattern vectors was derived from TaDA’s gestural score specification for each phoneme sequence (as explained in Section 4.9.2): the sequence consisted of a concatenation of all regions in the gestural score for which the gestural pattern vector did not change. Vectors resulting from a gestural overlap less than 2 frames long (at TaDA’s default 10 msec frame rate) were discarded from the sequence. Similarly, any intra-variable overlapbetweenareleasegestureandthegestureimmediatelyprecedingitwereignored. See Figure 4.3 for an illustration of deriving the sequence of vectors from the gestural score. Among the twelve words in the task vocabulary, there were a total of 28 unique phonemes and 115 unique gestural pattern vectors, 96 of which only appeared in one word - 15 appeared in two words, 1 appeared in three words, 2 appeared in four words, and 1 appeared in five of the twelve words. Note that the gestural score does not simply specify more linguistic units per word (see Table 4.8) but maps those units to the words with a different distribution than that of phonemes. For example, though “then” and “thing” share no common phonemes, they have two gestural pattern vectors in common due to their similar articulations. The CDT corpus has no transcribed segmentations on the phoneme level (and cer- tainly not on the gestural pattern vector level) so all acoustic models were trained using an iterative bootstrap procedure like that described in [26]. After Viterbi decoding of the expected sequence using the trained models, a phoneme-level or gestural vector- level pronunciation quality measure based on these models was defined for phoneme or gesture n as the log-likelihood ratio A n =log[P(O n |M n )/P(O n |f i )] (4.11) 108 word phonemes gestural vectors coupling pairs believe 6 13 6 chin 3 7 3 drag 4 10 5 forgetful 9 19 12 go 3 6 3 paper 5 11 6 racetrack 8 16 7 then 3 6 3 thing 3 6 3 typical 7 21 11 understand 9 18 12 wood 3 8 3 Table 4.8: The number of acoustic and duration models required for each of the words in the CDT vocabulary. whereP(O n |M n ) is the likelihood of the speech observation given the target HMM, and P(O n |f i ) is the likelihood of the same observed speech given a generic filler HMM. Two filler models, f p and f g , were trained - one for phonemes and one for gestural pattern vectors - based on all training data. 4.9.4.2 Duration Models From the automatic segmentation determined with acoustic models as described in Sec- tion 4.9.4.1, duration information from relevant segments was extracted. Two different approacheswereinvestigated: one,D seg ,inwhichthedurationsofallsegments(whether they represent phonemes or gestural pattern vectors) were measured relative to native distributionsofdurations; thesecond,D coupl ,inwhichonlythedelayinactivationtimes betweenallpairsofcoupledgestureswerecomparedtonativedistributions(thisapplied only when using gestural acoustic models, of course). Following the method used in [70], the measure of nativeness for segment duration or activation delay n was formally defined as D n = P(f(d n )|M n ) where d n is the n- th duration or delay in the sequence, f(·) is the duration normalization function, and 109 critical release Tongue Body Degree vowel palatal dental release closed wide closed alveolar v86 v90 v88 v93 v85 v111 Tongue Body Location Tongue Tip Degree Tongue Tip Location Velum time Figure4.3: Gesturalscorefortheword“then.”Thesequenceofgesturalpatternvectors is shown along the bottom, with each vector assigned an arbitrary number. M n is n-th segment or activation delay. The probability of the duration was modeled as a Gaussian distribution estimated from all native speech for that particular word. Duration normalization was computed using f(d n ) = d n ·ROS w where ROS w is the rate of speaking (in phonemes or gestural vectors per second) for word w. 4.9.4.3 Pronunciation Rating Models Tocombineacousticanddurationmeasuresinestimatinganoverallscore,aNaiveBayes frameworkwasused. Thehiddennode,Q w ,representingthe1to7subjectivenativeness score for word w, was modeled as the generative parent of two feature nodes: A w = {A 1 ,...,A N } for acoustic measures, and D w = {D 1 ,...,D N } for duration measures. AllthreenodesweremodeledaslinearGaussiandistributionswithdiagonalcovariances. The inferred value of Q w was calculated as the mean of the marginal distribution of Q w giventhefeatures: ˆ Q w =E[P(Q w |A w ,D w )]. ThecardinalitiesofA w andD w wereword- dependent, as each word had a unique number of phonemes, gestural pattern vectors, and coupled gestures (as outlined in Table 4.8); consequently, a unique network had to 110 be trained for each word (using only nonnative examples, to reflect a range of ratings), though the overall structure was identical across all words. 4.9.5 Experiments Experiments were designed to address the following: • How do phoneme and gestural models compare in performance to each other and to a baseline pronunciation rating? • For gestural models, were D coupl duration measures (activation delay measures basedonhypothesizedinter-gesturalcoupling)betterthangesturalD seg measures that were coupling-blind? • How did phoneme and gestural models perform in combination? Was it better than with phonemes alone? Inallcases,performancewasevaluatedintermsofcorrelationwiththemedianofthe 8listeners’scores. Thebaselineword-levelratingwastakenas 1 N P N n=1 A n ,i.e. themean of all the phoneme-level acoustic measures in a word, without using Bayesian inference, following [70]. Both phoneme and gestural vector acoustic models were compared, the latter with either coupling duration measures or segment duration measures. All data were trained and tested using a leave-one-speaker-out crossvalidation procedure. Only native English speech was used for training the acoustic and duration models, and only rated Arabic-speaker data was used to train the pronunciation rating model. When combining phoneme and gestural information, the acoustic measures A w included both phoneme and gestural pattern vector measures, and the duration measures D w were comprised of both phoneme D seg duration measures and gestural D coupl measures (but not gestural D seg measures). 111 Phoneme Gesture acoustic models acoustic models listener baseline alone w/ Dseg alone w/ Dseg w/ D coupl Combined agreement all speakers 0.510 0.500 0.517 0.593 0.591 0.582 0.674 0.801 Arabic only 0.428 0.463 0.492 0.510 0.513 0.502 0.547 0.627 Table 4.9: Correlation coefficients between automatic and median listener ratings. En- tries in bold were significantly better than the baseline with p≤0.05 Correlation results for different models and feature sets are reported in Table 4.9. The overall automatic scores for each row were simply the concatenation of the auto- matic scores from each word’s individual pronunciation rating model. Entries in bold were significantly better than the baseline with p≤ 0.05 using a z-test for difference in correlation coefficients. Listener agreement with the median of the other 7 listeners is provided as an upper bound on automatic performance. In the “all speakers” case, the native English speakers were artificially assigned the highest pronunciation score, 7. 4.9.6 Discussion According to Table 4.9, there are a number of general trends. First, improvements over the baseline are in general only ever achieved either through the use of gestural models, or the combination of phoneme and gestural models - phoneme models were not enough to achieve a significant improvement without duration measures. Using the D coupl duration measures was not significantly better or worse than using the D seg measures with the gestural models, and neither one was statistically worse than just using the gesture acoustic measures alone, with no duration measures. In both populations, the best result came from combining both phoneme and gestural acoustic measures along with both phoneme D seg and gesture D coupl duration measures. All results fall significantly below the inter-listener agreement upper bound. How- ever, the agreement for “all speakers” is artificially inflated since it includes many of the native English examples for which all listeners’ scores were assumed to be 7 (and 112 so there could be no disagreement). Because the performance trend is similar in both populations, thisindicatesthatthescoringmodeliscapablebothofassigninganappro- priate range of scores to nonnative speech, and of giving high scores to native speech. In a word-dependent analysis of the results, the words with the poorest performance overall - “go” and “then” - were also the ones with the lowest inter-listener agreement, and were also among the shortest words in the set and therefore the most difficult to judgeduetoadearthofevidenceonwhichtobasearating. However, othershortwords like “wood” and “thing” did not show this effect, and so perhaps some of the words were more difficult for Arabic speakers to pronounce, or for English-speaking listeners to rate. 4.9.7 Conclusion Thisepiloguehasshowntheusefulnessofthearticulatorygestureasaunitforautomat- ically assigning subjective ratings to both nonnative and native English. With gestural acoustic and duration models, improvements were demonstrated both separately and in combination with phoneme models, as in previous work using pseudo-articulatory fea- tures[90]. Gesturalvectorsdohavethedisadvantageofdemandingmanymoreacoustic models than phonemes; over a larger vocabulary, the number of unique vector models to train might become intractable. Future work is needed to properly incorporate Ar- ticulatory Phonology’s inter-gestural coupling as a cue to nativeness - no improvement was seen through its addition in this case. These findings still suggest that Articula- tory Phonology is a promising avenue not just for pronunciation rating, but for speech recognition as well. 113 Chapter 5: Phrase-level Intonation Scoring 5.1 Acoustic Models 5.1.1 Introduction Whymightoneexpectintonation-thepatternsofpitchinspeech-toinformautomatic assessment of an English learner’s pronunciation quality? Though English is not a tone language like Mandarin (i.e. one in which intonation can determine the meaning of an isolated word), intonational variation in English conveys a wide variety of information. The placement and choice of pitch accents and boundary tones - manifested through the shape and range of the fundamental frequency (f0) contour - is well-correlated with speakerintentionsandlistenerperceptiononthephraselevel. Boundarytoneswithinor attheendsofphrasesdelineatesyntacticunitsofvarioussizes,offeringthelistenerhints for appropriate processing and interpretation [93]. Pitch accents within a phrase can intimateiftheinformationofferedisnew,contrastive,accessible,oruncertain[95]. Even a speaker’s emotional state can be inferred from phrase-level suprasegmental features [14], and listeners can discern a speaker’s regional accent from hearing the intonation 114 aloneinfilteredspeech[43]. ItfollowsthattheextenttowhichanEnglishlearnersounds nativemustbeduepartlytotheirintonation,andthatintonationmodelingispotentially useful for automatic score generation in a second-language practice environment. Perhaps the most complete study in using text-independent intonation-based fea- tures to generate pronunciation scores was reported by Teixeira et al in [85]. The proposed solution was to derive many text-free features from the phrase-level f0 con- tour, and then train a decision tree to assign each feature vector an integer score on a 1 to 5 scale. Relying only on these pronunciation scores as training class labels, this method used no prosodic annotation, which requires expert linguistic knowledge and suffers from low inter-annotator agreement. It also allowed a certain versatility in the selection of features and the size of the feature set, though the linguistic relevance of many of the features is not clear since they are chosen ad hoc. Ultimately, features derived from alignment of the text and from other sources resulted in automatic scores better correlated with listener scores than the text-free intonation-based features did. The modest performance in scoring pronunciation based on intonation in [85] is per- haps due both to the ad hoc choice of features and the fact that the models trained represent uncertain perceptual scores rather than relevant linguistic units of intonation (i.e. pitch accents and boundary tones). Previous work in [87] introduced improved intonation-based scoring by training Hidden Markov Models for categorical intonation units on continuous f0 and energy contours from native speech. These units could then be decoded from nonnative speech in the same way that words commonly are, to esti- mate scores for how well the nonnative features fit the native models. In this chapter I expand on that work by investigating several new methods for improving HMM-based intonationmodelsforthepronunciationevaluationtask,andbyreproducingthebaseline decisiontreescoringprocedurefrom[85]forcomparisononthisdataset. Myhypothesis (and main finding) is that significant improvements in automatic score correlation with 115 listener perception can come from the use of linguistic theory about intonation and prosodic structure, in the form of proper f0 processing, meaningful feature sets, the- oretical grammars for recognition of intonation events, and accounting for contextual effects. Intonation is only one factor contributing to a listener’s qualitative assessment of phrase-level pronunciation, so improvements in this domain are expected to result in correlation still well below that of agreement among listeners for scoring pronunciation in general. 5.1.2 Corpora and Annotation The nonnative speech in this work comes from the ISLE corpus of English learners [3]. It consists of read sentences by native speakers of Italian and German, at various levels of British English proficiency. This corpus was divided into training and test sets, each with an equal number of German and Italian speakers per set, and no speakers in commonbetweenthem. InformationabouttheirrelativesizesisgiveninTable5.1. The sentencesinthesesetswerescoredbyonenativelistenerforoverallpronunciation(taking intoaccountintonationandallothercues)ona1to5scale,asin[85]. Tomeasureinter- annotator agreement, 138 sentences from this corpus were scored by five other native listeners. Average inter-labeler correlation was 0.657, which can be considered an upper bound on all automatic scores’ performance. The one native listener who scored all sentences had a correlation of 0.732 with the medians of the other five listeners’ scores; since this exceeds the inter-labeler agreement, that listener’s scores can be considered to be a reliable reference. For training native models of intonation events, the Boston University Radio News Corpus [71] and the IViE corpus [32] were used. The former consists of read news by AmericanEnglishspeakers, transcribedforintonationusingtheToBIsystem; thelatter uses a similar transcription convention for its read southern British English. Previous 116 native nonnative nonnative train set train set test set corpora BURNC, IViE ISLE ISLE total speakers 7 8 8 total sentences 3657 1238 307 total minutes 80.2 82.8 25.5 Table 5.1: Relative sizes of the training and test sets. work in [87] has shown that, due to broad similarities in tone realization (if not place- ment), both dialects can be used together for training categorical prosodic models for nonnative pronunciation evaluation. Because of low transcriber agreement for some of the finer tone categories, and to reconcile minor differences in the two transcription conventions, all sub-categories within both accents and boundaries were collapsed into onlytwo: highandlow(i.e. highandlowpitchaccents,highandlowphraseboundaries, etc.). Intonational “silence” was inserted into the transcripts at the start and end of each phrase, and at every ToBI break of 3 or higher. 5.1.3 Baseline: Decision Tree Score Models For comparison with the proposed improvements in Section 5.1.4, I reproduced the method in [85] of pronunciation score modeling with decision trees trained on text- independent features derived from the f0 contour. The f0 contour of each sentence in the nonnative training and test sets was estimated using the standard autocorrelation method with a frame size of 10 msec. The curves were then smoothed as described in [46] using a 5-point median filter (i.e. each frame’s value was replaced by the median of itself and that of the four frames immediately surrounding it) to minimize any phonetic segment effects, and then log-scaled. Then a piecewise linear stylization was fitted to each voiced segment. Using this processed representation of the f0 contour, for each sentence the same 23 “pitch signal” features used in [85] were derived, mainly related 117 to f0 range and slope. Note that, though all these features are text-independent (i.e. theycanbederivedfromanyspokenphrasewithoutaccountingfordifferencesinlexical itemsorsentencestructure), theyarenotentirelytext-free. Therateofspeaking(ROS) used for normalizing some of the features is derived from alignment of the text, though unsupervised methods of estimating ROS exist and could potentially be used here. A decision tree performs classification by asking of a feature vector a sequence of questions which can be thought of as “branches” of the tree. The results of the fi- nal branches in the question sequence (called the “leaf” nodes of the tree) represent subsets of the training data, and one can use these subsets to obtain estimates of the probability of each class for every leaf. As it relates to pronunciation scoring, the pos- terior probability of the human score h (integers from 1 to 5) given the feature vector f = {f 1 ,f 2 ,...,f 22 } can be estimated as P(h|f) ∼ = P(h|l f ) where l f is the leaf node that the questions asked of f lead to. In [85], these posteriors are used to estimate pronunciation scores as continuous values based on the minimum error criterion, E[h|l f ]= 5 X i=1 h i ·P(h i |l f ) (5.1) The study in [85] evaluated these automatic scores in terms of correlation with listener scores and the mean absolute error between the automatic and listener scores (as a percentage of the maximum possible error, which is 4). The optimal tree is grown as in [70] by specifying the minimum size of a leaf subset that maximizes correlation on the test set, and then for that minimum subset size choosing a pruning confidence factor that again maximizes correlation. Baseline results for this optimal tree are given in Table 3.5 (row 2), alongside results for random scores generated using the same proportions of 1 to 5 scores found in the nonnative train set and averaged over ten random realizations. These best decision tree results in Table 3.5, though better than 118 random, are lower than those obtained on the corpus used in [85] (they reported 0.247 correlation, 25.4% error), but their listener agreement (0.8 score correlation) was higher than for the test set used here. 5.1.4 HMM Intonation Models This section explains the intonation modeling and score generation paradigm, as a con- trast to the purely score-based models of the baseline method in Section 5.1.3. Some potential improvements inspired by prosodic theory and intonation modeling in Man- darin are then described, and experimental results are presented. 5.1.4.1 Score Generation with HMMs Native HMMs were trained for eight different intonation units: high and low pitch accents (H ∗ andL ∗ ), high and low intermediate phrase boundaries (H− andL−), high and low phrase-final boundaries (H% and L%), a high initial boundary tone (%H), and a silence model (SIL). Without needing to know the text spoken, these units can be decoded from a nonnative speaker’s suprasegmental features and estimate scores that represent how native the speaker’s intonation sounds. My approach is simply to estimate posterior probabilities of each decoded tonal unit and then take their product over the phrase to estimate a phrase-level posterior score. This is similar to the method of scoring with phoneme models used in [70]. See [87] for the models’ accuracy in tone label recognition. Assuming a bigram model of intonation, each decoded unit’s posterior is calculated as P(M t |O t ,M t−1 )= P(O t |M t )P(M t |M t−1 ) P n P(O t |M n )P(M n |M n−1 ) (5.2) where O t is the speech observation in suprasegmental features at time t, M t is the recognized unit, P(M t |M t−1 ) is its bigram probability given the previous segment, and 119 n takes values over all HMMs. In the case of an unweighted tone network (as will be seen in certain experiments), P(M t |O t ,M t−1 ) reduces to P(M t |O t ), and P(M t |M t−1 ) reduces to its prior, P(M t ). Then an overall utterance score ρ is approximated as the product of the posteriors of the T decoded intonation segments: ρ=P(M 1 ,M 2 ,...,M T |O t )≈ T Y t=1 P(M t |O t ,M t−1 ) (5.3) Tomakethesescorescomparablewiththoseofthebaselinemethod,alllog-posteriors for the test set were normalized so that their mean and variance matched that of the 1 to 5 scores in the nonnative train set. Then all scores below 1 were set equal to 1, and all scores above 5 were set equal to 5. It is possible that other score calibration heuristics might give better results, but this one was simple and required no further training. All context-independent (CI) HMMs were trained with a flat-start initialization and five iterations of embedded re-estimation. Three hidden states and 16 Gaussian mixtures per state were arbitrarily chosen, though these parameters could be optimized on a development set. A bigram tone recognition model and three features - f0, plus its first and second derivatives (estimated with the standard regression formulae) - were used for all initial experiments (rows 3-6 in Table 3.5). 5.1.4.2 Mandarin-style f0 Preprocessing Studies in automatic recognition of Mandarin speech have made good use of f0-related features, because of the importance of tone in lexical disambiguation. There is cur- rently no standard method of preprocessing or normalizing estimates of the f0 contour (or at least nothing akin to the MFCCs of spectral features), but studies of Mandarin have developed a number of techniques aimed to compensate for variations in raw f0 120 [42]. Some of them are used in the baseline approach’s preprocessing step (e.g. me- dian smoothing to remove discontinuities caused by segment-level production). When working with HMM models trained on continuous f0 contours, it helps to interpolate regions of unvoiced speech - areas which would in theory have had voiced intonation if the phonemes of the utterance had been different. Here simple linear interpolation was used. Next, instead of only log-scaling the frequency contour, it was converted to the ERB scale, to better match the human ear’s perception of frequency. Gradual decli- nation in f0 over the course of a phrase is common in many languages, including both English and Mandarin. Normalization for declination was done by subtracting from each f0 frame the mean within a 1.5 second window. Finally, 7-point median filtering was performed to smooth out discontinuities. Row 3 in Table 3.5 presents results for CI intonation HMMs using the baseline preprocessing explained in Section 5.1.3; the correlation results are comparable to those obtained using the decision tree scoring method. Correlation improves dramatically with the benefit of techniques first used for Mandarin (row 4). This indicates that the baseline method, row 2, is not ideal for HMM intonation models. 5.1.4.3 Additional Features So far, the only features used have been the f0 contour and its first and second deriva- tives. This was simply for comparison with the baseline method, which used only features derived from f0. However, some studies (such as [50]) have emphasized the importance of energy in the perception of pitch accents. Experiments in automatic recognition of tones in Mandarin have found improvement through using MFCCs along with the f0-related features [92]. Could the addition of these features improve intona- tion modeling in English? Rows 5 and 6 in Table 3.5 report the results. Some modest improvement in correlation is seen with the use of RMS energy (and its first and second 121 derivatives) but performance declines dramatically once MFCCs are introduced. This suggests that intonation units in English are not realized through contrasts in spec- tral characteristics so much as through suprasegmental features like f0 and energy, but feature weighting could offer some improvement. 5.1.4.4 Intonation Grammars In previous work in [87], I proposed two types of intonation grammars for proper recog- nition of tone units before scores could be calculated. The better of these two was a bigram model: based on the transcripts, it estimated the probability P(M t |M t−1 ) where M t is the current tone unit and M t−1 is the unit immediately preceding it. The other, poorer-performing finite-state grammar (FSG), was an unweighted network that decoded intonation models based on theories of prosodic structure in an English phrase [102]. At that time, the poor performance of this theory-based grammar seemed to indicate thatFSGsfornativeintonationcouldnotapplytononnativespeech. However, these acoustic models then did not include intermediate phrase boundaries (H− and L−)astheydonow. Thenewtheory-basedFSGdictatedbythe currentchoiceofmod- els requires SIL at the beginning and end of each utterance, allows an optional initial high boundary (%H) after the initialSIL, and then decodes zero or more intermediate phrase sequences: <H ∗ |L ∗ >(H−|L−)[SIL] followed by a required phrase-final sequence: <H ∗ |L ∗ >(H−|L−)(H%|L%) where square brackets denote optional elements, angle brackets denote one or more repetitions, and vertical lines mean “or”. For the sake of comparison, a simple “tone 122 loop” grammar that allowed for any sequence of tone events was also evaluated, making use of no statistical or theoretical information. The two non-bigram models assumed unweighted arcs for all decoding paths. Results for these grammars are compared in Table 5.2. Both the bigram model (row 5)andthenewtheory-basedFSG(row8)performedbetterthanthesimple“toneloop” grammar(row7),showingthatdecodingcanbenefitfromknowledgeoflinguistictheory or corpus statistics. Since there was no statistically significant difference in correlation performancebetweenthebigramandthetheory-basedFSG,allsubsequentexperiments were conducted with the latter since it requires no training and is more in accord with prosodictheorythanthebigrammodel-i.e. itconceivesofintonationwithinaphraseas a whole sequence of tones consistent with the structure of prosodic information, rather than each tone dependent only on the one immediately preceding it. 5.1.4.5 Context-Dependent Tone Models Again making use of ideas first proposed for Mandarin, new HMMs were trained that were dependent on their context. Often for Mandarin, tone models are trained based on the syllable over which they occur [92]. This is appropriate for Mandarin, since most words are one syllable in length and a single tone is realized over the rhyme of each syllable. For English, however, the domain of the pitch accent is the prosodic foot [40], consisting of one accented syllable and all following unaccented syllables until the next accented syllable is reached. So, instead of conditioning the models on syllable- or phoneme-levelcontext,theywereconditionedontheirleftandrightintonationcontexts. For example, a low accent surrounded by two high accents (H ∗ −L ∗ +H ∗ ) would be trained as a different acoustic model than a low accent surrounded by low accents (L ∗ −L ∗ +L ∗ ). This compensates for tone sandhi [52] - the change in tone realization 123 based on its prosodic context, often important phonologically in tone languages like Mandarin. To train these context-dependent tone models, similar HMM states were tied to- gether using the decision tree-based model clustering method normally used for train- ing CD phoneme models. The “questions” for the decision tree to test in determining a model’s cluster were related to groupings of tone contexts (e.g. right tone is high, right toneisaboundary, etc.). Allcontextsunseeninthetrainingdatawerethensynthesized using the tree that maximized the likelihood of the native speech given the models. As withthecontext-independentHMMs, thesemodelshad3hiddenstatesand16mixtures per state. Decoding tone events was still done using the context-independent models and the theory-based finite-state intonation grammar explained in Section 5.1.4.4; this avoided training an overly complex bigram model for context-dependent tones. The context-dependent models were then used in calculating the posterior scores by assum- ing that the decoded context was correct and limiting P n P(O|M n )P(M n ) in Eqn 5.2 to only those tone models that share the same context. For example, if a decoded tone unit had L ∗ decoded on its left and H ∗ decoded on its right, then M n could only take the form of L ∗ −H ∗ +H ∗ or L ∗ −L%+H ∗ or L ∗ −SIL+H ∗ , etc. Results for the context-dependent (CD) models are shown in Table 5.2, row 9. They are the best models presented in this work, and outperform the baseline decision tree methodby16.2% correlationand1.7% error. Forpronunciationscoring, correlationisa more relevant metric thanmeanabsolute error, since giving meaningless flatscores near the test set’s mean for all items could lead to very low error but also poor correlation. Using the standard statistical test, the difference in correlations between row 9 and row 2 is significant with p≤0.001. 124 modeling method corr. error (1) random scores -0.002 38.5 (2) baseline decision tree scores 0.156 29.1 (3) CI HMMs: preprocessing from (2) 0.153 34.1 (4) (3) w/ Mandarin preprocessing 0.304 31.2 (5) (4) + E and Δ E and ΔΔ E 0.320 29.4 (6) (5) + 12 MFCCs (+ Δ and ΔΔ) 0.054 35.7 (7) (5) w/ tone loop grammar 0.223 32.2 (8) (5) w/ theory-based FSG 0.318 29.9 (9) (8) w/ CD HMMs 0.398 27.4 Table 5.2: Correlation and error of scores derived from different proposed modeling methods. 5.1.5 Conclusion Withthebaselinemethodfrom[85],theoptimaldecisiontreeforscoringEnglishlearner pronunciation based on intonation-based features achieved only 0.156 correlation with listener scores on this test set. Using HMMs representing intonation units, plus linguis- tictheoryaboutprosodicstructureinEnglishandmethodsalreadydevelopedforrobust Mandarin modeling, that correlation was raised to 0.398. This method had the disad- vantage of requiring specialized intonation transcripts rather than simple phrase-level pronunciation scores, but needs no prior knowledge of the target text and could poten- tially be used to assess intonation in spontaneous speech. As expected, the correlation of these best intonation-based scores still falls well below the inter-listener agreement in deciding pronunciation scores based on intonation and all other available cues. Future work is needed to combine these intonation-based assessments with those of pronunci- ation cues on other time-scales. 125 5.2 Prosodic Structure Models 5.2.1 Introduction Intonation-thepatternofpitchinspeech-progressessequentially,inaway. Ifonetakes intonation to be made of categorical prosodic events (the pitch accents and boundary tones that convey linguistic information through suprasegmental cues such as pitch), then one can imagine one pitch accent following another until an intermediate or final boundary tone intervenes, marking a division in the phrase. This is the model on which a prosodic transcription system like ToBI (Tones and Break Indices) is based [80] - accent and boundary phenomena are modeled not so much as “beads on a string” (as withphonemes),butmorelikenotchesonastick; theyaretranscribedasdiscreteevents occurring in sequence, denoted by the instants in time perceived as their centers. Is this really the best way to look at intonation? Consider two phrases, one ending in a low boundary and the other ending in a high one. Anticipation of the boundary motion in each could result in contrasting phrase-level frequency contours, affecting the shape of any within-phrase pitch accents. This is the basis for a superpositional model like Fujisaki’s [29], in which the frequency contour is decomposed into the sum- mation of tone components on the phrase and accent levels. Other theories focus on thisnestedhierarchytointonation[41],employingtreestructurestoschematizetheway multiple levels of information are superimposed, and to explain the coordination and synchronization among different scales of prosodic units. In recent years, tree grammars for sentence syntax have shown some promise in structure-based language modeling for text translation and speech recognition [18, 19]. Tree grammars are capable of capturing long-term context that n-gram models would miss, and are versatile in their modeling of an entire tree as a context-dependent set of subtree structures. Even so, their use has been limited - the ordinary left-to-right 126 decoding of most speech recognition frameworks has favored simpler n-gram language models (especially for real-time processing), and training tree grammars requires expert part-of-speech annotation of sentences. However, tree grammars of prosody rather than syntax do seem well-suited for modeling the sort of structure hypothesized in linguistic theory. The symbol set is small compared to words or part-of-speech tags (potentially requiring less training data), and prosodic tree structures can be derived directly from more common sequential ToBI transcripts. This section intends to answer two essential questions: given a set of prosodic tags over an utterance, how can their joint probability be estimated, accounting for all in- teractions and dependencies? And, secondly, can a tree-based model provide “better” probability estimates than a sequential model? Perhaps a better question to start with is, why would anyone want to do this at all? With an estimate of the probability of a set of prosodic tags, the best set of tags can potentially be chosen by searching over all possible sets. In speech synthesis this means, for a string of words, the ability choose the best prosodic structure to match those words, so that synthesis can sound more natural, with the best pitch accents and boundaries in appropriate places [39]. Using these grammars to decode prosodic tags from acoustic-prosodic features can help to resolve ambiguities in decoding words, or to tag dialog acts for improved speech understanding [79]. They should also be useful in estimatingaprosodicpronunciationscorefornonnativespeakerspracticingEnglishasa foreign language - once decoded, a set of prosodic tags common in native speech would receive a higher score than those not common in native speech [87]. Any task that uses categorical prosodic tags (like those in the ToBI system) is a potential application for these tree grammars. Thisworkstartsoffbygivingsomebackgroundonlinguistictheoriesoftreestructure in prosody. Then details about data preparation are explained. Next is a short review 127 Chief Jus-tice of the Mass-a- chu-setts Su-preme Court STR STR UNS STR UNS STR UNS STR UNS UNS STR STR H* H* NULL NULL H* H* H* L- L- L% Text Syllabic Stress Pitch Accent Intermediate Phrase Boundary Final Phrase Boundary Figure 5.1: A prosodic tree derived from sequential ToBI transcripts and syllable seg- mentation of the BURNC [71]. The four tiers of the tree represent prosodic units on increasingly large time-scales, each with a unique linguistic meaning. The ASCII rep- resentation of the top three tiers would be TOP(L%(L-(H* H*) L-(NULL NULL H* H* H*))). The text is not part of the tree structure but is included here for illustration. Notethatthepitchaccents’boundsmatchthefootratherthantheword, following[41]. of tree grammars in general, as they relate to this work. The following section describes someexperimentscomparingtree-basedmodelswithsequentialn-grammodels. Finally, I draw some conclusions about the benefits of using tree structures for this task. 5.2.2 Trees in Prosody For many years linguists have organized the syllables of English into a hierarchy of prosodic units on various levels, each corresponding to a unique time-scale and func- tion as information [41]. Many claim that the fundamental rhythmic/melodic unit of English prosody is the foot (a term borrowed from studies of poetry), consisting of a stressed syllable and all subsequent unstressed syllables before the next stress. Stress is a syllabic prominence realized through increased duration, energy, or pitch, and is used to mark lexical contrasts between words (e.g. “contract”). On the level of the prosodic foot, pitch accents - manifested through changes in pitch and energy - are perceptually relevant to discourse, denoting if the information offered is new, contrastive, accessible, or uncertain [95]. Above that, phrase boundary tones on the intermediate and final 128 train set dev set test set total tags 37792 4973 4685 total trees 3756 493 465 total minutes 62.0 8.1 7.7 Table5.3: Sizesofthetraining,development,andtestsets. “Tags”referstotheprosodic symbols in the transcripts. “Trees” means complete four-level prosodic trees. levels act like prosodic punctuation, offering the listener cues for interpretation and syntactic processing [93]. All of these levels of information are present simultaneously in the natural pitch and energy, and their organization is schematized well as a tree structure to illustrate the way the units nest and co-occur. Uses of these hierarchical theories in computational models of prosody have been relatively rare. A few studies in syntax-based language modeling have combined syn- tactic and prosodic trees, with improvements in predicting boundary locations [39, 62]. One study in French prosody used a tree structure of an entirely different kind, based on syllable grouping according to pitch range and slope [77]. Generally, most modeling of intonation has remained sequential rather than tree-structured, in line with the ToBI system [80] that seems to dominate prosodic transcription. 5.2.3 Corpus Preparation All prosodic tag data in this work come from the ToBI annotation of the Boston Uni- versity Radio News Corpus (BURNC) [71], which consists of read news reports by professional radio announcers - the intonation they employ is highly regularized and is representative of a generic standard for American read speech. In addition to ToBI labels, the BURNC has transcripts for syllable-level stress as well as syllable boundary locations. All transcripts from one speaker were divided into training, development, and test sets - their sizes can be found in Table 5.3. 129 To define tree structures, I needed to align all levels of annotation with shared be- ginning and ending boundaries, though in the transcripts only the syllable boundaries were defined. The end of an intermediate phrase was defined as the end of the sylla- ble in which the boundary tone’s center was transcribed; its beginning was either the beginning of the utterance or the end of the previous intermediate phrase. Full phrase boundaries were determined the same way, and since full boundaries require a concur- rent intermediate boundary, this syllable demarcation synchronized the two boundary levels and established the tree structure between them. Following [41], it is assumed that a pitch accent in English extends throughout its prosodic foot. I defined the foot as beginning with the syllable in which the accent’s center was transcribed, which was ordinarily the foot’s stressed syllable (but occasionally fell on an unstressed syllable in the BURNC). The end of that foot was then defined as synchronous with the beginning of the next stressed syllable, intermediate phrase, or silence. All leftover unaccented feet within a phrase were assigned a NULL accent tag. This established the nesting of stressed or unstressed syllables within a pitch-accented prosodic foot, and of the pitch accents within the intermediate phrases. See Fig 5.1 for an example of a derived tree. The top three tiers, when put in sequential order, would be {H* H* L- NULL NULL H* H* H* L- L%}. The full and intermediate phrase boundaries can each take high and low symbols - {H%,L%} and {H-,L-}, respectively - and because of low inter-annotator agreement for fine-grained pitch accents, these were distilled down to two categories, H* and L*, in addition to the accentless NULL. Similarly, all syllable stress labels were binary, STR or UNS,withsecondarystresswithinawordconsideredsimplySTR.Anytranscribedsilence between phrases was assigned the tree TOP(SIL1(SIL2(SIL3(SIL4)))) in keeping with the four tiers of sequential symbols within the tree. Rarely, some of the accent or 130 boundary tone labels were unspecified in the transcripts due to uncertainty on the transcriber’spart-theseweremappedtothemostcommontags,H*andL-,respectively. 5.2.4 Regular Tree Grammars AProbabilisticContextFreeGrammar(PCFG)specifiesasetofterminalandnontermi- nal symbols for which, beginning with a starting symbol, a sequence of tree production rules for replacing the nonterminal symbols can be performed, each with an associated probability [60]. The probability of a tree T is then the product of the probabilities for all n production rules α→β that generated it, P(T)= n Y i=1 P(α i →β i |α i ) (5.4) This of course assumes all rules (and all subtrees generated by those rules) are independent, allowing for the versatility of modeling a larger tree implicitly through a sequenceofsmallertreeproductionrules. AWeightedRegularTreeGrammar(WRTG) is a finite-state acceptor of PCFG trees, representing all nonterminal symbols through statesintherecognitionnetwork. Probabilitiesofproductionrules(i.e. statetransitions through the WRTG) are estimated from the training data as ˆ P(α→β|α)= Count(α→β) Count(α) (5.5) where Count(α→ β) and Count(α) are the occurrences of the production rule α→ β and the symbol α, respectively. In the case of the prosodic trees being modeled here, it’s clear that the starting symbol is TOP, and the only terminal symbols are {STR,UNS,SIL4}. Unlike the parse trees for part-of-speech tags, these prosodic trees are simpler in that the nonterminal symbols are not recursive. No nonterminal prosodic tag can produce itself in the tree 131 the way that certain parse tags (like the noun phrase, NP) can - for example, there can never be an H* inside of an H*, but there can be noun phrases inside of noun phrases. Equation 5.4 defines how P(T) is calculated when the tags in T are arranged in a tree structure. When taken sequentially, a traditional n-gram model estimates P(T) as P(T)=P(t 1 ,...,t |T| )= M Y m=1 P(t m ) (5.6) where t m is one of M parallel but independent sequences of symbols in the set T (e.g. tiers of the tree). A separate n-gram model for each sequence defines P(t m ) as P(t m )=P(s 1 m ,...,s Q m )= Q Y q=n P(s q m |s q−1 m ,...,s q−n+1 m ) (5.7) where q is the symbol index in the sequence, and n is the order of the n-gram model. 5.2.5 Training and Experiments The main thing I wanted to learn from this study was whether models for prosodic tags canbeimprovedbyusingtreegrammarsinsteadofsequentialn-grammodels. “Improve- ment” will be measured using perplexity (PPL), the standard metric for comparing two different language models’ abilities to assign high probabilities to previously unseen strings of symbols or words. Ultimately, classification or detection error rates offer a truer comparison, but PPL is a useful metric to estimate relative model performance. In this case, the PPL is defined as PPL=2 −log 2 {P(T)}/|T| (5.8) where P(T) is the probability the model assigns to the set of tags T, and |T| is the number of tags in the set. 132 n-gram order trees 1 2 3 4 5 PI PD dev 2.72 2.53 2.49 2.48 2.49 2.29 2.27 test - - - 2.41 - - 2.27 Table 5.4: PPL results for 4-tier setup, over dev and test sets. For the sequential n-gram models, the performance was evaluated over two different experimental setups. One, the “4-tier” setup, took the four tiers of prosodic tags to be independent parallel sequences. The other, the “2-tier” setup, combined the top three tiers into a sequence encompassing pitch accents, intermediate boundary tones, and final boundary tones, just like in the ToBI transcripts. The syllable stress tier had to remain separate since many syllables are simultaneous with the pitch accents and boundary tones, rather than sequential. The tree grammars were identical in each case, but to keep the number of tags the same, the silence trees for the two-tier setup were changed to TOP(SIL1(SIL4)). It should go without saying that the TOP symbols were not included in the tree perplexity calculations, since they are not in the sequential transcripts. Similarly, two types of tree grammars were trained. Parent-dependent (or PD) tree grammars were conditioned on knowledge of the parent symbol one level up in the tree. For example, instances of the production rule {H* → STR UNS} would be split into either {H*|L- → STR|H* UNS|H*} or {H*|H- → STR|H* UNS|H*}, depending on H*’s parent. Parent-independent (or PI) tree grammars simply did not use this knowledge of the parents, intead assuming, for example, that all examples of the production rule {H* → STR UNS} were to be modeled as one, regardless of H*’s parent. The best sequential model for each setup was found by increasing the order of the n- gram until perplexity on the development set no longer decreased. These n-grams were trained using the SRI language modeling toolkit with Good-Turing smoothing. All tree 133 n-gram order trees 1 2 3 4 5 PI PD dev 3.80 2.74 2.63 2.61 2.63 2.46 2.44 test - - - 2.53 - - 2.44 Table 5.5: PPL results for 2-tier setup, over dev and test sets. grammars were implemented using Tiburon [60], with probabilities trained using the method in Eqn. 5.5. The subtree production rules in the dev and test sets not seen in the training set were assigned a count of 1 before all probabilities were normalized - this is known to be a simple and sub-optimal smoothing method. The better of the two tree grammars on the dev set was evaluated on the test set, for comparison with the best n-gram model. These results are given in Tables 5.4 and 5.5. As a preliminary experiment beyond perplexity, prosodic tag classification was also performedbyansweringthisquestion: givenalltagsbutoneinatreeorsequence, what should that missing tag be? The classification result for a set of tags T was given by ˆ R =argmax R {P(t 1 ,t 2 ,...,R,...,t |T| )} (5.9) where R can be any other tag from the missing tag’s tier. The intermediate and final phrase boundary tone classification was binary - either low or high - and classification of pitch accents was three-way: H*, L*, or NULL. Essentially this is very much like the speech synthesis task of assigning natural-sounding prosody to text - it is decided what typeofaccentorboundarytonetohave,giventhatitisalreadyknownwheretheaccent or boundary should be (perhaps based on the syntactic phrases, or the lexically-defined syllable stress sequence). Results for this classification are given in Table 5.6, using the best n-gram and tree grammars from the PPL experiments, alongside a “majority choice” baseline in which all test items were assigned the most common tag. 134 5.2.6 Discussion In the PPL experiments (Tables 5.4 and 5.5) it is seen that the best models for both setups were 4th-order n-grams and PD trees, with the PD and PI trees outperforming the best n-gram models. This indicates that the long-term and multi-scaled context captured by the tree grammars is better suited for modeling prosodic structure, and is evidence in favor of the linguistic theories on which the tree structures are based. The improvement seen with context-dependent models in both the sequential and tree grammarsillustratestheimportanceofaccountingforasmuchoftheprosodicstructure as possible. However, the difference in PPL between the PI and PD trees was not as large as between the 2nd and 3rd order n-grams, suggesting that vertical dependencies in the tree are not as important as horizontal ones. With more training data, higher- order n-grams might yield some improvement, but more training data would probably make the tree grammars better as well. In general the perplexity for all models was quite low, due to the small set of prosody labels. The 2-tier and 4-tier setups are not really directly comparable in the PPL experi- mentsbecausetheyhavedifferentsymbolsets(duetocombinationof SILtags). Forthe tag classification experiments reported in Table 5.6, it is seen that the 2-tier sequential setup outperformed the 4-tier one for all three tag types, indicating that the classifica- tion of pitch accents can benefit from knowledge of phrase boundaries, and vice versa - one assumption that motivated the use of tree grammars to begin with. As for the PD tree grammars, they outperformed the sequential grammars and “majority choice” baseline in all cases, with a margin of 2-5% over the best n-grams. Improvement over the baseline with PD trees was most dramatic for the pitch accents, partly because the three-way classification made for higher baseline error. Neither of the sequential meth- ods beat the baseline for the phrase-final boundaries, suggesting independence between successive phrase-final tones in the sequence. 135 # test majority 4-tier 2-tier PD items choice 4-grams 4-grams trees FINB 275 28.33 35.84 29.69 24.92 INTB 389 21.88 20.19 15.63 11.78 PACC 1203 46.22 33.59 27.04 24.58 Table 5.6: Tag classification error (in %) on the development set. FINB = final bound- ary tones, INTB = intermediate boundary tones, PACC = pitch accents. 5.2.7 Conclusion These experiments have shown that using tree grammars to model the structure of prosodic tags has several advantages over sequential models, including lower perplexity measures and lower prosodic tag classification error rates. This seems to justify the linguistic theories behind schematizing prosody in a tree structure, and is potentially useful in applications as diverse as speech synthesis, dialog act classification, and pro- nunciation scoring. The next step would be to combine these tree-based “language models” for a set of prosodic tags with acoustic models for the suprasegmental mani- festations of those tags, so that they can be decoded from speech. To do this within the framework of traditional left-right speech decoding will be challenging, and is one potential drawback of tree grammars. 136 Chapter 6: Conclusion and Future Directions The goal of the work presented here was to demonstrate improvements in automatic pronunciationevaluationusingbothlinguistictheoriesofhierarchiesinspeechandtheir complementary computational frameworks to implement suitable hierarchical represen- tations. Three example applications on as many scales of analysis were presented. The first of these was the detection of errors in phoneme-level pronunciation by nonnative learnersofEnglish,potentiallyinacomputer-assistedsecond-languagepracticeenviron- ment. A baseline non-hierarchical approach used only phoneme-level acoustic models for estimating a score that reflects how close an observed segment is to the target, and how far it is from expected substitutions. The hierarchical approach factored each target phoneme into its articulatory configuration and estimated comparable scores for eachvocaltractorgan, thencombinedthosewiththephoneme-levelscoresinaDecision Tree classifier that automatically instituted an appropriate hierarchy for error discrimi- nation. The improvement in classification accuracy over the non-hierarchical approach wasstatisticallysignificantforbothnonnativespeakersandevenforsomesupplemental experiments with native speech. 137 The hierarchical theory most oftenusedinspeechtechnology is the idea that a word iscomposedofaseriesofphonemes-abstractunitsthatcanbeusedtomakeacontrast between two words’ meanings. It was this hierarchy that was used to model an elemen- tary school teacher’s decision to accept or reject a child’s reading skills on words in isolation. With prior knowledge of expected categories of phoneme-level pronunciation variants, standard phoneme Hidden Markov Models established the presence or absence of the cues to these pronunciation categories, and a Bayesian Network modeled their complex interaction in contributing to an overall reading assessment decision - a per- ceptual hierarchy related to cognitive models of lexical access and based on the chain of cause-and-effect in deriving the cues. This Bayes Net offered results approaching inter- teacher agreement, and significantly outperformed a baseline classifier (one without the hierarchical network structure). As a step toward estimating a phrase-level pronunciation score for English language learners based on all available cues within the prosodic hierarchy, I discussed work in modeling sub-phrase intonation events. Because of the high variability in what’s considered “correct” intonation and the added variability of nonnative speakers, these intonation models were purposely text-independent HMMs trained on appropriately processed continuous fundamental frequency curves rather than the features often de- rived ad hoc from those curves. Phrase-level scores for nonnative speech (reflecting the degree to which the intonation matched the native model) were based on the decoding of pitch accents and boundaries using these acoustic models and an n-gram model of in- tonation sequences. The reported correlation with human pronunciation scores, though far below inter-listener agreement, are promising in that they are consistent with those of previous ad hoc approaches and require no prior knowledge of the target text. A hierarchical tree grammar for these intonation labels proved to have lower perplexity 138 and greater predictive power than the initial n-gram models - yet another example of the insight of hierarchical modeling based on linguistic theory. Where do we go from here? The logical next step in automatic pronunciation evalu- ation would be to make these methods useable in real-world applications. For example, tree grammars are an exciting new development for modeling intonation, but incorpo- rating them into a real-time recognition framework will demand some ingenuity. The same goes for extensions of the Articulatory Phonological study discussed in Section 4.9: it remains to be seen how articulatory gesture units should best be adapted to a large-vocabulary recognition or assessment system. What is known for sure, after the several studies presented in this thesis, is that linguistic information on multiple scales - from articulatory gestures up to phrase-level pitch boundaries - can be combined in intelligent ways to make automatic judgments about pronunciation more human-like. 139 Bibliography [1] M. J. Adams. Beginning to read: Thinking and learning about print. the MIT Press, Cambridge, 1990. [2] A. Alwan, Y. Bai, M. Black, L. Casey, M. Gerosa, M. Heritage, M. Iseli, B. Jones, A. Kazemzadeh, S. Lee, S. Narayanan, P. Price, J. Tepperman, and S. Wang. A system for technology based assessment of language and literacy in young chil- dren: the role of multiple informationsources. In Proc. of MMSP,Chania,Greece, October 2007. [3] E.Atwell, P.Howarth, andC.Souter. Theislecorpus: Italianandgermanspoken learners english. ICAME Journal, 27:518, 2003. [4] T. M. Bailey and U. Hahn. Phoneme similarity and confusability. Journal of Memory and Language, 52:347–370, 2005. [5] J. Balogh, J. Bernstein, J. Cheng, and B. Townshend. Automatic evaluation of reading accuracy: Assessing machine scores. In Proc. of SLaTE Workshop, October 2007. [6] S. Banerjee, J. E. Beck, and J. Mostow. Evaluating the effect of predicting oral reading miscues. In Proc. of Eurospeech, Geneva, 2003. [7] R. A. Bates, M. Ostendorf, and R. A. Wright. Symbolic phonetic features for modeling of pronunciation variation. Speech Communication, 49:83–97, 2007. [8] J. E. Beck and J. Sison. Using knowledge tracing to measure student reading proficiencies. In Proc. of the 7th International Conference on Intelligent Tutoring Systems, September 2004. [9] D. Bolinger. Accent is predictable (if you’re a mind-reader). Language, 48:633– 644, September 1972. [10] E.BreschandS.Narayanan.Regionsegmentationinthefrequencydomainapplied toupperairwayreal-timemagneticresonanceimages. IEEE Trans. Med. Imaging, 28:323–338, March 2009. 140 [11] E. Bresch, D. Riggs, L. Goldstein, D. Byrd, S. Lee, and S. Narayanan. An anal- ysis of vocal tract shaping in english sibilant fricatives using real-time magnetic resonance imaging. In Proc. of Interspeech, Brisbane, 2008. [12] E. Brescha, J. Nielsen, K. Nayak, and S. Narayanan. Synchronized and noise- robust audio recordings during realtime magnetic resonance imaging scans. J. Acoust. Soc. Amer., 120:1791–1794, October 2006. [13] C. P. Browman and L. Goldstein. Articulatory phonology: An overview. Phonet- ica, 49:155–180, 1992. [14] M.Bulut,S.Lee,andS.Narayanan. Analysisofemtionalspeechprosodyinterms of part of speech tags. In Proc. of Interspeech ICSLP, Antwerp, Belgium, August 2007. [15] D. Byrd and E. Saltzman. The elastic phrase: Dynamics of boundary-adjacent lengthening. Journal of Phonetics/Academic Press, 31:149–180, 2003. [16] D. Byrd, S. Tobin, E. Bresch, and S. Narayanan. Timing effects of syllable struc- ture and stress on nasals: a real-time mri examination. Journal of Phonetics, 37:97–110, 2009. [17] J. Caminero, C. de la Torre, L. Villarrubia, C. Matin, and L. Hernandez. On-line garbage modeling with discriminant analysis for utterance verification. In Proc. of ICSLP, Philadelphia, 1996. [18] E. Charniak, K. Knight, and K. Yamada. Syntax-based language models for statistical machine translation. In MT Summit IX, Int’l. Assoc. for Machine Translation, 2003. [19] C. Chelba and P. Xu. Richer syntactic dependencies for structured language modeling. In Proc. of ASRU, 2001. [20] B. Collins and I. M. Mees. Practical Phonetics and Phonology: A Resource Book for Students. Routledge, London, 2003. [21] C. Conati, A. Gertner, and K. VanLehn. Using bayesian networks to manage uncertainty in student modeling. User Modeling and User-Adapted Interaction, 12:371–417, 2002. [22] A. T. Corbett and J. R. Anderson. Knowledge tracing: Modeling the acquisition ofproceduralknowledge. UserModelingandUser-AdaptedInteraction,4:253–278, 1994. [23] R. Delmonte. Slim prosodic automatic tools for selflearning instruction. Speech Communication, 30:145–166, 2000. 141 [24] P. Denes. On the statistics of spoken english. J. Acoust. Soc. Am., 35:892–904, June 1963. [25] J. Dolfing and A. Wendemuth. Combination of confidence measures in isolated word recognition. In Proc. of ICSLP, Sydney, 1998. [26] S. Young et al. The htk book, 2002. [27] C. Fought. Chicano English in Context. Palgrave MacMillan, New York, 2003. [28] N. Friedman, D. Geiger, and M. Goldszmidt. Bayesian network classifiers. Ma- chine Learning, 29:131–163, 1997. [29] H. Fujisaki. Prosody, models, and spontaneous speech. In Y. Sagisaka, N. Camp- bell, and N. Higuchi, editors, Computing Prosody. Springer, New York, 1997. [30] C. Goldenberg. Teaching english language learners. American Educator, 32:8–21, 2008. [31] L. Goldstein and C. A. Fowler. Articulatory phonology: A phonology for public languageuse. InN.O.SchillerandA.S.Meyer, editors, Phonetics and Phonology in Language Comprehension and Production. Mouton, Berlin, 2003. [32] E. Grabe. Intonational variation in urban dialects of english spoken in the british isles. In P. Gilles and J. Peters, editors, Regional Variation in Intonation, pages 9–31. Niemeyer, Tuebingen, 2004. [33] A. Gutkin and S. King. Detection of symbolic gestural events in articulatory data for use in structural representations of continuous speech. In Proc. ICASSP, Philadelphia, 2005. [34] K. Hacioglu, B. Pellom, and W. Ward. Parsing speech into articulatory events. In Proc. ICASSP, Montreal, 2004. [35] A. Hagen, B. Pellom, and R. Cole. Children’s speech recognition with application to interactive books and tutors. In Proc. ASRU, St. Thomas, 2003. [36] M. Halle, B. Vaux, and A. Wolfe. On feature spreading and the representation of place of articulation. Linguistic Inquiry, 31:387–444, 2000. [37] A. J. Harris and M. D. Jacobson. Basic Reading Vocabularies. MacMillan, New York, 1982. [38] M.Hasegawa-Johnson, J.Cole, K.Chen, L.Partha, A.Juneja, T.Yoon, S.Borys, and X. Zhuang. Prosodic hierarchy as an organizing framework for the sources of context in phone-based and articulatory-feature-based speech recognition. In S. Tseng, editor, Linguistic Patterns in Spontaneous Speech. Academica Sinica, Taiwan, 2008. 142 [39] J. Hirschberg and O. Rambow. Learning prosodic features using a tree represen- tation. In Proc. of Eurospeech, Aalborg, 2001. [40] D. Hirst. Intonation in british english. In D. Hirst and A. Di Cristo, editors, Intonation Systems: A Survey of Twenty Languages. CUP, Cambridge, 1998. [41] D. Hirst and A. Di Cristo. A survey of intonation systems. In D. Hirst and A. Di Cristo, editors, Intonation Systems: A Survey of Twenty Languages. CUP, Cambridge, 1998. [42] M.-Y. Hwang, X. Lei, W. Wang, and T. Shinozaki. Investigation on mandarin broadcast news speech recognition. In Proc. of Interspeech ICSLP, Pittsburgh, September 2006. [43] A. Ikeno and J. H. L. Hansen. The role of prosody in the perception of us native english accents. In Proc. of Interspeech ICSLP, Pittsburgh, September 2006. [44] Jr. J. P. Campbell. Speaker recognition: A tutorial. Proceedings of the IEEE, 85:1437–1462, 1997. [45] S.-C. Jou, T. Schultz, and A. Waibel. Whispery speech recognition using adapted articulatory features. In Proc. ICASSP, Philadelphia, 2005. [46] L. Heck K. Sonmez, E. Shriberg and M. Weintraub. Modeling dynamic prosodic variation for speaker verification. In Proc. of ICSLP, 1998. [47] A. Kazemzadeh, J. Tepperman, J. Silva, H. You, S. Lee, A. Alwan, and S. Narayanan. Automatic detection of voice onset time contrasts for use in pro- nunciation assessment. In Proc. of Interspeech ICSLP, Pittsburgh, September 2006. [48] S. King, J. Frankel, K. Livescu, E. McDermott, K. Richmond, and M. Wester. Speech production knowledge in automatic speech recognition. J. Acoust. Soc. Am., 121:723–742, 2007. [49] K. Kirchhoff. Robust speech recognition using articulatory information. PhD thesis, University of Bielefeld, 1999. [50] G. Kochanski, E. Grabe, J. Coleman, and B. Rosner. Loudness predicts promi- nence: fundamental frequency lends little. J. Acoust. Soc. Am., 118:1038–1054, 2005. [51] W. Labov and B. Baker. What is a reading error? 2003. [52] P. Ladefoged. A Course in Phonetics. Thomson, Boston, 2006. 143 [53] S. Lee, E. Bresch, and S. Narayanan. An exploratory study of emotional speech productionusingfunctionaldataanalysistechniques. InProc. of 7th International Seminar On Speech Production, Ubatuba,Brazil, 2006. [54] S. Lee, A. Potamianos, and S. Narayanana. Acoustics of children’s speech: De- velopmental changes of temporal and spectral parameters. J. Acoust. Soc. Am., 105:1455–1468, 1999. [55] K. Leung, M. Mak, and S. Kung. Applying articulatory features to telephone- based speaker verification. In Proc. ICASSP, Montreal, 2004. [56] K.-Y. Leung and M. Siu. Articulatory-feature-based confidence measures. Com- puter Speech and Language, 20:542–562, 2006. [57] K. Livescu and J. Glass. Feature-based pronunciation modeling for speech recog- nition. In Proc. HLT/NAACL, Boston, 2004. [58] P. A. Luce and C. T. McLennan. Spoken word recognition: The challenge of variation. In D. B. Pisoni and R. E. Remez, editors, The Handbook of Speech Perception. Blackwell, Oxford, 2005. [59] R. C. Major. Foreign Accent: The Ontogeny and Phylogeny of Second Language Phonology. Lawrence Erlbaum, Mahwah, 2001. [60] J. May and K. Knight. Tiburon: A weighted tree automata toolkit. In Proc. of the Eleventh International conference on Implementation and Application of Automata (CIAA), 2006. [61] W. Menzel, E. Atwell, P. Bonaventura, D. Herron, P. Howarth, R. Morton, and C.Souter. Theislecorpusofnon-nativespokenenglish. InProc.ofLREC,Athens, 2000. [62] S. Minnis. The parsody system: Automatic prediction of prosodic boundaries for text-to-speech. In Proc. of the International Conference on Computational Linguistics, Kyoto,Japan, 1994. [63] V. Mitra, I. Y. Ozbek, H. Nam, X. Zhou, and C. Y. Espy-Wilson. From acoustics to vocal tract time functions. In Proc. of ICASSP, Taipei, 2009. [64] J. Mostow, S. F. Roth, A. G. Hauptmann, and M. Kane. A prototype reading coach that listens. In Proc. of AAAI-94, Seattle, 1994. [65] N.Mote,A.Sethy,J.Silva,S.Narayanan,andL.Johnson.Detectionandmodeling oflearnerspeecherrors: Thecaseofarabictacticallanguagetrainingforamerican english speakers. In Proceedings of InStil, Venice,Italy, 2004. 144 [66] K. Murphy. The bayes net toolbox for matlab. Computing Science and Statistics, 33, 2001. [67] K. P. Murphy. A variational approximation for bayesian networks with discrete and continuous latent variables. In Proc. of the Conf. on Uncertainty in AI, 1999. [68] H. Nam, L. Goldstein, E. Saltzman, and D. Byrd. Tada: An enhanced, portable task dynamics model in matlab. J. Acoust. Soc. Amer., 115:2430, 2004. [69] S. Narayanan, K. Nayak, S. Lee, A. Sethy, and D. Byrd. An approach to real- time magnetic resonance imaging for speech production. J. Acoust. Soc. Amer., 115:1771–1776, 2004. [70] L. Neumeyer, H. Franco, V. Digalakis, and M. Weintraub. Automatic scoring of pronunciation quality. Speech Communication, 30:83–94, 1999. [71] M. Ostendorf, P. J. Price, and S. Shattuck-Hufnagel. The boston university radio news corpus. Technical Report ECS-95-001, Boston University, March 1995. [72] National Reading Panel. Teaching children to read: An evidence-based assess- ment of the scientific research literature on reading and its implication for reading instruction. Technical Report Tech. Rep. 00-4769, National Institute for Child Health and Human Development, National Institute of Health, Washington, DC, 2000. [73] P. Ramesh, C.-H. Lee, and B.-H. Juang. Context dependent anti subword model- ing for utterance verification. In Proc. of ICSLP, Sydney, 1998. [74] J. Reye. Student modelling based on belief networks. International Journal of Artificial Intelligence in Education, 14:1–33, 2004. [75] M. Richardson, J. Bilmes, and C. Diorio. Hidden-articulator markov models: Performance improvements and robustness to noise. In Proc. of ICSLP, Beijing, 2000. [76] M. Richardson, J. Bilmes, and C. Diorio. Hidden-articulator markov models for speech recognition. Speech Communication, 41, October 2003. [77] N. Segal and K. Bartkova. Prosodic structure representation for boundary detec- tion in spontaneous french. In Proc. of ICPhS XVI, Saarbrucken, 2007. [78] J. Shefelbine. Bpst - beginning phonics skills test, 1996. [79] E. Shriberg, R. Bates, P. Taylor, A. Stolcke, D. Jurafsky, K. Ries, N. Coccaro, R. Martin, M. Meteer, and C. Van Ess-Dykema. Can prosody aid the auto- matic classification of dialog acts in conversational speech? Language and Speech, 41:439–487, 1998. 145 [80] K. Silverman, M. Beckman, J. Pitrelli, M. Ostendorf, C. Wightman, P. Price, J. Pierrehumbert, and J. Hirschberg. Tobi: A standard for labeling english prosody. In Proc. of ICSLP, Banff,Canada, 1992. [81] K. Stevens. Acoustic Phonetics. MIT Press, Cambridge, 1998. [82] K. Stevens. Features in speech perception and lexical access. In D. B. Pisoni and R. E. Remez, editors, The Handbook of Speech Perception. Blackwell, Oxford, 2005. [83] J. Sun and L. Deng. An overlapping-feature-based phonological model incorpo- rating linguistic constraints: Applications to speech recognition. J. Acoust. Soc. Am., 111:1086–1101, 2002. [84] P. Taylor. Analysis and synthesis of intonation using the tilt model. J. Acoust. Soc. Am., 107:1697–1714, 2000. [85] C.Teixeira,H.Franco,E.Shriberg,K.Precoda,andK.Sonmez. Prosodicfeatures for automatic text-independent evaluation of degree of nativeness for language learners. In Proc. ICSLP, 2000. [86] J. Tepperman, M. Black, P. Price, S. Lee, A. Kazemzadeh, M. Gerosa, M. Her- itage, A. Alwan, and S. Narayanan. A bayesian network classifier for word-level reading assessment. In Proc. of InterSpeech ICSLP, Antwerp,Belgium, August 2007. [87] J. Tepperman, A. Kazemzadeh, and S. Narayanan. A text-free approach to as- sessing nonnative intonation. In Proc. of InterSpeech ICSLP, Antwerp,Belgium, August 2007. [88] J. Tepperman and S. Narayanan. Automatic syllable stress detection using prosodic features for pronunciation evaluation of language learners. In Proc. of ICASSP, Philadelphia, March 2005. [89] J. Tepperman and S. Narayanan. Tree grammars as models of prosodic structure. In Proc. of Interspeech ICSLP, Brisbane,Australia, September 2008. [90] J.TeppermanandS.Narayanan. Usingarticulatoryrepresentationstodetectseg- mental errors in nonnative pronunciation. IEEE Transactions on Audio, Speech, and Language Processing, 16:8–22, January 2008. [91] J. Tepperman, J. Silva, A. Kazemzadeh, H. You, S. Lee, A. Alwan, and S. Narayanan. Pronunciation verification of children’s speech for automatic liter- acy assessment. In Proc. of InterSpeech ICSLP, Pittsburgh, 2006. 146 [92] Y. Tian, J. Zhou, M. Chu, and E. Chang. Tone recognition with fractionized models and outlined features. In Proc. of ICASSP, Montreal, 2004. [93] J. Vaissiere. Perception of intonation. In D. B. Pisoni and R. E. Remez, editors, The Handbook of Speech Perception. Blackwell, Oxford, 2005. [94] W. Z. van den Doel. How Friendly are the Natives? An Evaluation of Native- speaker Judgments of Foreign-accented British and American English. LOT, Utrecht, 2006. [95] A. Wennerstrom. The Music of Everyday Speech. OUP, New York, 2001. [96] M.Wester,J.Frankel,andS.King. Asynchronousarticulatoryfeaturerecognition usingdynamicbayesiannetworks.InProc.IEICIBeyondHMMWorkshop,Kyoto, 2004. [97] P. Westwood. Reading and Learning Difficulties: Approaches to Teaching and Assessment. ACER, Camberwell,Victoria, 2003. [98] D. Willett, A. Worm, C. Neukirchen, and G. Rigoll. Confidence measures for hmm-based speech recognition. In Proc. of ICSLP, Sydney, 1998. [99] S.M.Williams,D.Nix,andP.Fairweather. Usingspeechrecognitiontechnologyto enhanceliteracyinstructionforemergingreaders. InProc. of Fourth International Conference of the Learning Sciences, Mahwah,New Jersey, 2000. [100] I. H. Witten and E. Frank. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, San Francisco, 2005. [101] A. Wrench. The mocha-timit articulatory database, 1999. [102] H.WrightandP.A.Taylor. Modellingintonationalstructureusinghiddenmarkov models. In Proc. of ESCA Workshop on Intonation: Theory, Models, and Appli- cations, 1997. [103] H. You, A. Alwan, A. Kazemzadeh, and S. Narayanan. Pronunciation variations of spanish-accented english spoken by young children. In Proc. of Eurospeech, Lisbon,Portugal, 2005. [104] X. Zhuang, H. Nam, M. Hasegawa-Johnson, L. Goldstein, and E. Saltzman. The entropy of the articulatory phonological code: Recognizing gestures from tract variables. In Proc. of Interspeech, Brisbane, 2008. 147 Appendix Tables A.1 and A.2 display the expected articulatory mappings for British English vow- els and consonants, respectively, derived from [52, 76]. For those with sub-phonemic motion, only the start and end positions are shown here. See Table 3.1 for the articu- latory classes represented by these integers. 148 IPA ISLE example lip lip tongue tongue tongue phone phone word jaw separation rounding frontness height tip velum voicing A: AA bard 3 2 1 1 0 0 0 1 æ AE bad 3 3 2 3 0 0 0 1 2 AH bud 2 2 2 2 0 0 0 1 O: AO bawd 3 2 1 0 2 0 0 1 aU AW bowed 3 2 2 2 0 0 0 1 1 2 0 1 2 0 0 1 @ AX about 2 2 2 2 1 0 0 1 aI AY bide 3 2 2 2 0 0 0 1 1 2 3 3 2 0 0 1 E EH bed 3 2 2 3 1 0 0 1 3: ER bird 2 2 2 2 1 0 0 1 eI EY bayed 1 2 3 4 2 0 0 1 I IH bid 3 2 3 4 2 0 0 1 i IY bead 0 1 3 4 3 0 0 1 6 OH body 2 1 2 0 1 0 0 1 @U OW bode 3 2 1 3 2 0 0 1 2 1 0 1 2 0 0 1 OI OY boy 2 2 1 0 1 0 0 1 1 2 3 3 2 0 0 1 U UH buddhist 1 2 1 1 2 0 0 1 u: UW booed 1 1 0 1 3 0 0 1 Table A.1: Expected British English Vowel Articulations 149 IPA ISLE example lip lip tongue tongue tongue phone phone word jaw separation rounding frontness height tip velum voicing b B bet 1 0 2 2 1 1 0 1 1 2 2 2 1 1 0 1 d D debt 1 1 2 4 3 4 0 1 1 2 2 4 2 3 0 1 g G get 1 2 2 0 3 1 0 1 1 2 2 0 2 1 0 1 p P pet 1 0 2 2 1 1 0 0 1 2 2 2 1 1 0 0 t T tat 1 1 2 4 3 4 0 0 1 2 2 4 2 3 0 0 k K cat 1 2 2 0 3 1 0 0 1 2 2 0 2 1 0 0 D DH that 2 2 2 4 2 2 0 1 T TH thin 2 2 2 4 2 2 0 0 v V van 2 0 2 2 1 1 0 1 f F fan 2 0 2 2 1 1 0 0 z Z zoo 1 2 2 3 3 3 0 1 s S sue 1 2 2 3 3 3 0 0 Z ZH measure 2 2 1 3 3 0 0 1 S SH shoe 2 2 1 3 3 0 0 0 Ã JH jeep 2 2 2 4 3 4 0 1 1 2 1 3 3 0 0 1 Ù CH cheap 2 2 2 4 3 4 0 0 1 2 1 3 3 0 0 0 m M met 1 0 2 2 1 1 1 1 n N net 1 1 2 2 3 4 1 1 N NG thing 1 2 2 0 3 1 1 1 l L led 1 2 2 3 2 4 0 1 ô R red 1 2 1 2 2 3 0 1 w W wed 1 2 0 0 3 1 0 1 j Y yet 1 2 2 4 3 3 0 1 h HH hat 2 2 2 2 1 1 0 0 Table A.2: Expected British English Consonant Articulations 150
Abstract (if available)
Abstract
Technology that can automatically categorize pronunciations and estimate scores of pronunciation quality has many potential applications, most notably for second-language learners interested in practicing their pronunciation along with a machine tutor, or for automating the standard assessments elementary school teachers use to measure a child's emerging reading skills. The many sources of variability in speech and the subjective perception of pronunciation make this a complex problem. Linguistic hierarchies - in speech production, perception, and prosodic sturcture -- help to conceive of the variability as existing on multiple simultaneous scales of representation, and offer an explanatory order of precedence to those scales. These theories are beginning to gain widespread attention and use in traditional speech recognition, but experimenters in pronunciation evaluation have been slow to embrace them. This work proposes using theories of hierarchical structure in speech to inform a chosen computational framework and scale of analysis when performing automatic pronunciation evaluation, on the assumption that they will offer improvements over non-hierarchical methods and can be used to rate pronunciation with performance comparable to that of inter-human agreement.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Robust automatic speech recognition for children
PDF
Speech enhancement and intelligibility modeling in cochlear implants
PDF
A computational framework for exploring the role of speech production in speech processing from a communication system perspective
PDF
Matrix factorization for noise-robust representation of speech data
PDF
Automatic quantification and prediction of human subjective judgments in behavioral signal processing
PDF
Enriching spoken language processing: representation and modeling of suprasegmental events
PDF
Speech recognition error modeling for robust speech processing and natural language understanding applications
PDF
Efficient estimation and discriminative training for the Total Variability Model
PDF
Visualizing and modeling vocal production dynamics
PDF
Active data acquisition for building language models for speech recognition
PDF
Hybrid methods for music analysis and synthesis: audio key finding and automatic style-specific accompaniment
PDF
Emotional speech production: from data to computational models and applications
PDF
Toward understanding speech planning by observing its execution—representations, modeling and analysis
PDF
Concept classification with application to speech to speech translation
PDF
Categorical prosody models for spoken language applications
PDF
Emotions in engineering: methods for the interpretation of ambiguous emotional content
PDF
Data-driven methods in description-based approaches to audio information processing
PDF
Multimodal and self-guided clustering approaches toward context aware speaker diarization
PDF
Statistical enhancement methods for immersive audio environments and compressed audio
PDF
Computational methods for modeling nonverbal communication in human interaction
Asset Metadata
Creator
Tepperman, Joseph
(author)
Core Title
Hierarchical methods in automatic pronunciation evaluation
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
06/16/2009
Defense Date
05/04/2009
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
automatic speech recognition,OAI-PMH Harvest,pronunciation evaluation
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Narayanan, Shrikanth S. (
committee chair
), Goldstein, Louis M. (
committee member
), Mendel, Jerry M. (
committee member
)
Creator Email
joe.tepperman@gmail.com,tepperma@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m2306
Unique identifier
UC1497770
Identifier
etd-Tepperman-2925 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-246807 (legacy record id),usctheses-m2306 (legacy record id)
Legacy Identifier
etd-Tepperman-2925.pdf
Dmrecord
246807
Document Type
Dissertation
Rights
Tepperman, Joseph
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
automatic speech recognition
pronunciation evaluation