Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Robust automatic speech recognition for children
(USC Thesis Other)
Robust automatic speech recognition for children
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ROBUST AUTOMATIC SPEECH RECOGNITION FOR CHILDREN by Prashanth Gurunath Shivakumar A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree MASTER OF SCIENCE (ELECTRICAL ENGINEERING) December 2014 Copyright 2014 Prashanth Gurunath Shivakumar Dedication To my parents. ii Acknowledgments FirstIwanttothankmyparents(Prof. G.K.ShivakumarandMrs. Banashankari Shivakumar),theyhavebeenaninspirationtomeandaconstantsourceofsupport and strength throughout my life. I would like to thank my advisor Prof. Shrikanth Narayanan of the Signal and Analysis laboratory (SAIL), University of Southern California (USC) for showing confidence in me and encouraging me throughout this work. I am grateful to him forsupportingmyideasandprovidingallthenecessaryresourcesthroughtheSAIL Lab. Ialsowanttothankmyco-advisorsProf. AlexandrosPotamianosofNational Technical University of Athens, Athens, Greece and Prof. Sungbok Lee of SAIL, University of Southern California for supporting me throughout my thesis work andgivingmechallengingideastoexplore. IowemygratitudetoProf. Panayiotis G. Georgiou for his guidance and feedback. IthankDoganCan, Ph.DstudentinSAIL,forhispatientmentorshipthrough- outmythesiswork. Hehasbeenextremelyhelpfulinguidingmethroughpractical problems and giving me valuable inputs. My gratitude also goes to the Post Doc- toral students and Doctoral students at SAIL. IwouldalsoliketothankmyfriendsatUSCandbackinIndia,especiallyNeha for being a source of encouragement. iii Contents Dedication ii Acknowledgments iii List of Tables vi List of Figures viii Abstract x Introduction 1 1 Theory 4 1.1 Automatic Speech Recognition System . . . . . . . . . . . . . . . . 4 1.1.1 Front-end feature extraction . . . . . . . . . . . . . . . . . . 5 1.1.2 Acoustic Modeling . . . . . . . . . . . . . . . . . . . . . . . 6 1.1.3 Language Model (LM) . . . . . . . . . . . . . . . . . . . . . 9 1.1.4 Pronunciation Dictionary . . . . . . . . . . . . . . . . . . . . 10 1.1.5 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2 Our approach 12 2.1 Front-End Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.1 Mel Frequency Cepstral Coefficients (MFCC) . . . . . . . . 12 2.1.2 Perceptual Linear Predictive Analysis (PLP) . . . . . . . . . 14 2.1.3 Delta components . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2 Speaker Normalization Algorithms . . . . . . . . . . . . . . . . . . 17 2.2.1 Cepstral Mean and Variance Normalization (CMVN) . . . . 17 2.2.2 Vocal Tract Length Normalization (VTLN) . . . . . . . . . . 18 2.3 Acoustic Model Adaptation Techniques . . . . . . . . . . . . . . . . 19 2.3.1 Maximum Likelihood Linear Transform (MLLT) . . . . . . . 19 2.3.2 Speaker Adaptive Training (SAT) . . . . . . . . . . . . . . . 20 2.3.3 Linear Discriminant Analysis (LDA) . . . . . . . . . . . . . 22 2.4 Pronunciation Modeling . . . . . . . . . . . . . . . . . . . . . . . . 24 iv 3 Our system 34 3.1 Speech Recognition Setup . . . . . . . . . . . . . . . . . . . . . . . 34 3.2 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4 Recognition Experiments and Results 40 4.1 Baseline System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2 Complexity Analysis of Sentences . . . . . . . . . . . . . . . . . . . 41 4.3 Front-End Feature Analysis . . . . . . . . . . . . . . . . . . . . . . 41 4.4 Speaker Normalization Algorithms . . . . . . . . . . . . . . . . . . 43 4.5 Acoustic Model Adaptation Techniques . . . . . . . . . . . . . . . . 43 4.6 Age Dependent Results . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.7 Pronunciation Modeling . . . . . . . . . . . . . . . . . . . . . . . . 45 5 Duration Modeling 48 5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.4 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . 52 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 6 Subword Modeling 57 6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 6.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 6.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 6.3.1 Generating subwords from words . . . . . . . . . . . . . . . 58 6.3.2 Recognition Experiments and Results . . . . . . . . . . . . . 62 6.3.3 Converting Subwords to Words . . . . . . . . . . . . . . . . 64 6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 7 Conclusions and Future Work 71 7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 7.2.1 Pronunciation Modeling . . . . . . . . . . . . . . . . . . . . 72 7.2.2 Duration Modeling . . . . . . . . . . . . . . . . . . . . . . . 72 7.2.3 Subword Modeling . . . . . . . . . . . . . . . . . . . . . . . 73 Bibliography 74 v List of Tables 2.1 Toy-example: lexicon . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.2 Toy-example: mapping . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.1 Age Distribution of Training and Testing Data . . . . . . . . . . . . 36 3.2 Age Distribution of CIDMIC development and test data . . . . . . 36 4.1 Baseline System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2 Performance Analysis of Five Sentences in CID . . . . . . . . . . . 41 4.3 Front-end Feature Selection . . . . . . . . . . . . . . . . . . . . . . 42 4.4 Performance for MFCC features . . . . . . . . . . . . . . . . . . . . 42 4.5 Speaker Normalization Techniques . . . . . . . . . . . . . . . . . . . 43 4.6 Acoustic Modeling and Adaptation . . . . . . . . . . . . . . . . . . 45 4.7 Results: Pronunciation Modeling . . . . . . . . . . . . . . . . . . . 47 5.1 Baseline: Duration Modeling . . . . . . . . . . . . . . . . . . . . . . 53 5.2 Position independent duration modeling for phones . . . . . . . . . 54 5.3 Position dependent duration modeling for phones . . . . . . . . . . 55 5.4 Position and state independent duration modeling for phones . . . . 55 6.1 Baseline system for word level ASR for colorado data . . . . . . . . 63 6.2 Baseline system for subword level ASR for colorado data . . . . . . 64 vi 6.3 Subword modeling step-wise performance analysis . . . . . . . . . . 69 vii List of Figures 1.1 An ASR system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2 HMM based acoustic modeling . . . . . . . . . . . . . . . . . . . . . 6 2.1 MFCC Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 PLP Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.3 MFCC (left) vs PLP (right) Calculation . . . . . . . . . . . . . . . 30 2.4 Insertions, Deletions, Substitutions . . . . . . . . . . . . . . . . . . 31 2.5 Mapping Transducer . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.6 Lexicon Transducer . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.7 Inverted Lexicon Transducer . . . . . . . . . . . . . . . . . . . . . . 32 2.8 Composed Transducer . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.9 Confusion Matrices over Age . . . . . . . . . . . . . . . . . . . . . . 33 3.1 CHIMP data distribution . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2 Colorado data distribution . . . . . . . . . . . . . . . . . . . . . . . 37 3.3 CIDMIC data distribution . . . . . . . . . . . . . . . . . . . . . . . 38 3.4 CIDMIC development data distribution . . . . . . . . . . . . . . . . 38 3.5 CIDMIC test data distribution . . . . . . . . . . . . . . . . . . . . . 39 4.1 Age vs average VTLN scaling factor obtained for CIDMIC data . . 44 viii 4.2 Age Dependency Results . . . . . . . . . . . . . . . . . . . . . . . . 46 4.3 Pronunciation Modeling Results . . . . . . . . . . . . . . . . . . . . 47 5.1 Probability distribution for 3 states for the phone "‘iy"’ . . . . . . . 51 5.2 Probability distribution for 3 states for the phone "‘aa"’ . . . . . . . 52 5.3 Gaussian approximated version of figure 5.1 . . . . . . . . . . . . . 53 5.4 α vs WER% for position dependent phone duration modeling . . . 56 6.1 Generation of Subwords . . . . . . . . . . . . . . . . . . . . . . . . 59 6.2 Computation of Relative position (a)(above), Score (b)(below) . . . 60 6.3 n-gram LM vs subword error rate subword baseline system . . . . . 64 6.4 The lexicon transducer L . . . . . . . . . . . . . . . . . . . . . . . . 66 6.5 The input FSA I . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.6 The determinized FST O 2 . . . . . . . . . . . . . . . . . . . . . . . 67 6.7 The FST F . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 6.8 The FST O 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 6.9 The grammar FST G . . . . . . . . . . . . . . . . . . . . . . . . . . 68 6.10 The final FST O 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 ix Abstract Developing a robust ASR system for children is a challenging task because of increased variability in acoustic and linguistic correlates as function of young age. The acoustic variability is mainly due to the developmental changes associated with vocal tract growth. On the linguistic side, the variability is associated with limited knowledge of vocabulary, pronunciations and other linguistic constructs. Thisstudyaimstodevelopamorerobustacousticmodelingframeworkthrough a better front-end speech processing, pronunciation modeling, duration modeling, subword modeling for more accurate children’s speech recognition. The thesis study focuses on the comprehensive tests of pre-existing methods in acoustic mod- eling for children’s speech recognition. The results are presented as a function of age to study the effect of age on the performance. The inter-database effects are takenintoaccountbyusingmultipledatabasesofchildren’sspeech. Pronunciation modeling is introduced for children’s speech. The results show promising perfor- mance improvements over the baseline. To capture pronunciation mistakes and stammering specific to children, a modified version of a pre-existing subword mod- eling is used. The novel contribution to the subword modeling lies in FSM based subword based decoding. Motivated by the increased temporal variability found in children, duration modeling is applied for children’s speech and its effectiveness is presented. x Introduction ASR for children has many significant applications, in educational domain as a tutoring tool for reading and pronunciation as well as in entertainment and communication domains in the form of interactive games. However, even though ASRtechnologyhascomealongwaywiththestateofthearttechnologiesyielding highlyaccurateresultsforadultspeech,thefieldofASRforchildrenusershasbeen laggingbehindwitharelativelypoorperformance. Theincreasedinter-speakerand intra-speaker variability in children’s speech has complicated the speech recogni- tion task. Previousstudiesshowdegradationinwordrecognitionaccuracywhenthemodelis trained on adult speech. Models trained on children speech performs significantly betterasshowninElenius & Blomberg(2005). Combinedmodelstrainedonadult and children speech along with speaker normalization and adaptation techniques perform almost as good as the models trained with children’s speech. Even for the matched training and testing conditions there is a significant performance gap rel- ative to adult ASRs (Elenius & Blomberg, 2005). On the acoustic side, there is a reduction in pitch, formant frequency magnitude and within-subject variability of spectral parameters with age for children. Vowel durations and formant frequen- cies decrease approximately linearly with age. Fundamental frequency or pitch drops as age increases, the drop is more gradual for female subjects compared to 1 male subjects. Temporal variability is also significant in the case of children and might account for speaking rate, reading ability and pause durations. Vowel and sentence durations decrease with age significantly (Lee et al., 1999). The acous- tic variability can be accounted by the developmental changes in vocal tract and immature speech production skill in growing children. Front end frequency warping, speaker normalization, spectral adaptations tech- niques like Vocal Tract Length Normalization (VTLN) have all proved use- ful to deal with the aforementioned speech variability in children speakers (Elenius & Blomberg, 2005; Potamianos et al., 1997). On the linguistic side, performance degradation is partly due also to pronuncia- tion variability associated in children (Li & Russell, 2002). Children’s pronuncia- tions diverge from the canonical and adult patterns. Creating a custom dictionary based on actual children’s pronunciation can help the performance. Studies have shown that the mispronunciations of younger children (8-10 years) was twice as high as for older children (11-14 years) (Potamianos & Narayanan, 1998). Dis- fluency phenomena like breathing were 60% more prominent in younger children. In contrast to the above, filled pauses were twice as common for older children (Potamianos & Narayanan, 1998) . Series of front-end experiments in Li & Russell (2001) indicated that the degra- dation in performance is relatively small for sampling frequencies until 6 KHz. A drasticlossofperformancewasobservedwhenbandwidthwasreducedfrom4KHz to 2 KHz. The degradation is much larger for children than adults. In this paper, we concentrate on five aspects of speech recognition: acoustic modeling, duration modeling, subword modeling, front-end processing and pronunciation modeling for building robust ASR for children. 2 The thesis is organized as follows. Chapter 1 deals with the basic theory and understanding of a HMM based ASR system. In Chapter 2, we present our approachtotheproblem,thealgorithmsusedandtheirtheory. Chapter3describes our experimental setup. Chapter 4 presents the recognition experiments and their results. Chapter 5 deals with duration modeling and Chapter 6 with Subword modeling. Finally we conclude our views and future ideas in Chapter 7. 3 Chapter 1 Theory 1.1 Automatic Speech Recognition System An ASR can be thought of as a system which maps the sequence of acoustic features to a most likely sequence of words. It can be summarized in a single equation as shown below: W ∗ =arg max W P(W|X) (1.1) where X is the sequence of acoustic feature vectors or observations, W is a word sequence and W ∗ is the most likely word sequence. Using Baye’s rule: P(W|X) = P(X|W)P(W) P(X) ∝P(X|W)P(W) W ∗ =arg max W P(X|W)P(W) (1.2) whereP(X|W)istheobservationofacousticfeaturevectorswhenauseruttersthe word W, this forms the acoustic model. P(W) is the probability of the sequence of word W particular to the language, this forms the language model. Figure 1.1 shows the components that make up an ASR. Each component is described briefly below. 4 Figure 1.1: An ASR system 1.1.1 Front-end feature extraction AnASRsystemconstitutesofafrontendwhichisusedtoextracttheAcoustic Features, X, from the raw speech signal. Feature extraction is a process of finding appropriaterepresentationfortherawspeechsignalsuchthatitcanberepresented in a lower dimensional space retaining the latent lexical information. The features are desired to be robust to noise. One of the most common front-end feature used in the ASRs are Mel frequency cepstrum coefficents (MFCC) (Tervo & Pätynen, 2010; Ding). Some of the feature extraction methods incorporated in our study pertaining to children are described in more detail in Chapter 2.1. 5 Figure 1.2: HMM based acoustic modeling 1.1.2 Acoustic Modeling Acoustic Modeling usually uses continuous density Hidden Markov Models (HMM) (Ganitkevitch; Gales & Young, 2008). HMMs are markov process in which the underlying states are "hidden". In ASR sense, HMMs are used to capture the temporal spectral variations present in the acoustic features. They model the relationship between the acoustic features and the underlying phonemic structures. Each phone is modeled by a HMM. Fig 1.1.2 shows the structure of a typical HMM with the observation sequence O = (o 1 ,o 2 ,...,o T ) and the underlying hidden states S = (s 1 ,s 2 ,...,s T ). {a ij } denotes the transi- tion probabilities for the transition from the state i to state j. {b t } denotes the emissionprobabilities, definedastheprobabilitytoobserveasymbolo t atstates t . 6 GaussianMixtureModels(GMM)areusedtomodeltheemissionprobabilities. GMMs use mixtures of gaussians which are nothing but the normalized weighted sum of gaussian distributions, given by. b(o t ) = K ∑ k=1 w k N(μ k ,Σ k ) (1.3) where N(μ k ,Σ k ) = 1 (2π) n 2 |Σ k | 1 2 exp− 1 2 ( (o t −μ k ) T Σ −1 k (o t −μ k ) ) (1.4) Substituting equation 1.4 in equation 1.3 gives: b(o t ) = K ∑ k=1 w k 1 (2π) n 2 |Σ k | 1 2 exp− 1 2 ( (o t −μ k ) T Σ −1 k (o t −μ k ) ) (1.5) where b(o t ) is the emission probability for the symbol o t , K is the total number of gaussian components, w k is the weight assigned to the k th distribution, μ k is the mean of the k th gaussian component and Σ k its covariance. The acoustic model training is done using Expectation-Maximization (EM). The Maximization step is given by: Q(λ, ˆ λ) = ∑ θ∈Θ L(θ|O,λ)log(L(O,θ| ˆ λ)) (1.6) Q(λ, ˆ λ) =C− 1 2 K ∑ k=1 T (r) ∑ t=1 γ (rk) t [ C k +log(|Σ k |)+(o t −μ k ) T Σ −1 k (o t −μ k ) ] (1.7) whereC andC k arethe normalization constantsdependenton transition prob- abilities and gaussian component k respectively. 7 Then the Expectation step is given by: ˆ μ (j) = ∑ R r=1 ∑ T (r) t=1 γ (rj) t o (r) t ∑ R r=1 ∑ T (r) t=1 γ (rj) t (1.8) ˆ Σ (j) = ∑ R r=1 ∑ T (r) t=1 γ (rj) t (o (r) t − ˆ μ (j) )(o (r) t − ˆ μ (j) ) T ∑ R r=1 ∑ T (r) t=1 γ (rj) t (1.9) where γ (rj) t =P(θ t =s j |Y (r) ;λ) (1.10) where γ (rj) t is defined as the probability of the model λ, occupying state s j at time t for any given utterance r. Y (r) ,r = 1,...R is the r th utterance, R is the total number of utterance, T (r) is the length of the r th utterance. θ is the state sequence and λ is the model with parameters λ = [{a ij },{b j }] For an ASR, EM is performed by Baum-Welch algorithm also known as the forward-backward algorithm. The algorithm can be derrived in terms of transition and observation probabilities. The forward probability is given by: α (rj) t =p(Y (r) 1:t ,θ t =s j ;λ) (1.11) The backward probability is given by: β (ri) t =p(Y (r) t+1:T (r) |θ t =s i ;λ) (1.12) Equation 1.11 can be recursively calculated by: α (rj) t = [ ∑ i α (ri) t−1 a ij ] b j (o (r) t ) (1.13) 8 and equation 1.12 by: β (ri) t = ∑ j a ij b j (o (r) t+1 )β (rj) t+1 (1.14) where i,j are summed over all states. The Expectation step is the same as in equation 1.8 and 1.9. The Maximiza- tion step is given by: ˆ α ij = ∑ R r=1 1 P (r) ∑ T (r) t=1 α (ri) t a ij b j (o (r) t+1 )β (rj) t+1 o (r) t ∑ R r=1 ∑ T (r) t=1 γ (ri) t (1.15) where γ (rj) t =P(θ t =s j |Y (r) ;λ) = 1 P (r) α (rj) t β t (rj) (1.16) P (r) =p(Y (r) ;λ) (1.17) 1.1.3 Language Model (LM) Language modeling refers to the statistical modeling particular to a language whichassignsdistributionstoeachword. Thesimplestformofthelanguagemodel is the unigram model which assigns probability based on the relative frequency of occurrence of the word in the training data. Higher degree of LMs make use of the informationoftheneighboringdata,i.e.,thewordsbeforeandafterthewordunder consideration. Ingeneral, thepriorprobabilityofawordsequencew =w 1 ,...,w K is given by (Gales & Young (2008)): P(w) = K ∏ k=1 P(w k |w k−1 ,...,w 1 ) (1.18) 9 where K is the number of words. For a large vocabulary, the above equation can be approximated by reducing the word history to N −1, for a N-gram model. P(w) = K ∏ k=1 P(w k |w k−1 ,...,w k−N+1 ) (1.19) where P(w k |w k−1 ,...,w k−N+1 )≈ Count(w k−N+1 ,...,w k−1 ,w k ) Count(w k−N+1 ,...,w k−1 ) (1.20) Typically, the degree of LM used for ASR task is between 2 and 4. Higher degrees lead to data sparsity issues. To handle data sparsity, back-off models and smooth- ing is applied. Back-off models work by backing off to a lower degree LM, when data sparsity is encountered. The most common one is the Katz smoothing. The performance of the LM is evaluated in terms of perplexity of the model. It is given by: H =− lim K→∞ 1 K log 2 (P(w 1 ,...,w K )) ≈− 1 K K ∑ k=1 log 2 (P(w k |w k−1 ,...,w k−N+1 )) (1.21) 1.1.4 Pronunciation Dictionary The acoustic modeling works on the scale of phones. The pronunciation dic- tionary synthesizes the sequence of phones to words. The pronunciation dictio- nary in ASR should be able to handle pronunciation variations, i.e., more than one sequence of phones for a single word. If P (w) = p 1 ,...,p L is the phonemic decomposition of the word w, which consists of L phonemes, then pronunciation likelihood P(O|w) can be computed by: P(O|w) = ∑ P p(O|P)P(P|w) (1.22) 10 where P is a particular sequence of phones for the word w and the summation is over all the pronunciation variations. The probability, P(P|w), of the pronuncia- tion sequence P given the word w can be computed as follows: P(P|w) = L ∏ i=1 P(P (w i ) |w i ) (1.23) 1.1.5 Decoder The basic task of the decoder is to find the most likely sequence of words correspondingtothefeaturevectorsgivenanacousticmodelandalanguagemodel. Thisisdonebysearchingallpossiblestatesequencesarisingfromallpossibleword sequences and the one which is most likely to have generated is chosen. It can be viewed as maximizing the probability of observing the partial sequence o 1:t and being in state s j at time t, given the model λ: ϕ (j) t =max θ p(o 1:t ,θ t =s j ;λ) (1.24) where O 1:T = (o 1 ,...,o T ) is the feature vector, θ is the state sequence. The above probability can be efficiently computed using the Viterbi algorithm: ϕ (j) t =max i { ϕ (i) t−1 a ij } b j (o t ) (1.25) where a ij is the transition probability from state i to state j, b j (o t ) is the emission probability of state j emitting symbol o t . Initially ϕ (j) 0 is set to 1 for the first non-emitting state and 0 to all the other states. The term max j {ϕ (j) T } is recorded for every decision and its traceback gives the most likely state/word sequence. 11 Chapter 2 Our approach Thetechniquesandalgorithmsweadoptedinoursystemaredescribedindetail in this chapter. 2.1 Front-End Features FrontendfeaturesareanimportantpartofanyASRsystem. Mostofthemare usuallyinspiredfromthewaythehumansproducespeech. Manyfactsfrompsycho- physical and psycho-acoustics of humans are taken into analysis. We compare three such features - Filter bank, Mel Frequency cepstral coefficients (MFCC), Perceptual linear predictive (PLP) analysis and evaluate their effectiveness on children’s speech for ASRs. 2.1.1 Mel Frequency Cepstral Coefficients (MFCC) Basic procedure for the calculation of MFCC features are as follows (Tervo & Pätynen, 2010; Ding): • The raw speech signal is passed through a pre-emphasis filter to apply loud- ness equalization across all the frequencies. • The signal is subjected to short time analysis which consists of framing and windowing operation. The frames are usually of 15ms - 30ms in width and overlapping. 12 • Each frame is transformed into the frequency domain by taking the discrete fourier transform (DFT) of the frame. X w = N−1 ∑ n=0 x n e − 2πi N kn k = 0,...,K−1 (2.1) wherek = 0,...,K−1arethediscretizedfrequenciesandnisthediscretized time index. • The signal in the frequency domain are warped into the mel-scale. The mel-scale is inspired from the hair spacings along the basilar membrane of the human ear which are responsible for reception of particular frequency of sound. The mel-scale can be obtained as: m = 2595log 10 ( 1+ f 700 ) (2.2) • Triangular filter bank spaced uniformly over the mel-spectrum is applied to the signal. The weighted sum of each filter is considered and their logarithm is calculated. E k = k 2 ∑ k=k 1 log 10 { |X k | 2 } W k (2.3) where E k is the energy output from the k th filter, W k is the triangular filter • The output from the mel-frequency filter banks are further subjected to Dis- crete Cosine Transform (DCT) to give the final mel-cepstral frequency coef- ficients. MFCC w = K−1 ∑ k=0 E k cos [ π K ( n+ 1 2 ) w ] w = 0,...,K−1 (2.4) where w is another frequency index. 13 Figure2.1summarizesthegenerationofMFCCfeaturesintheformofaflowchart. Figure 2.1: MFCC Calculation 2.1.2 Perceptual Linear Predictive Analysis (PLP) PLP features were reported to be more robust when there was an acoustic mismatch between the training and testing dataset (Woodland et al., 1996). But 14 when there are no mismatches MFCC are said to be slightly better performing (Hönig et al., 2005). The intuition behind using PLP features is that children’s speech contains higher mismatch and variability. MFCC and PLP techniques have many similarities. The basic procedure for the calculation of PLP features are as follows: • UnlikeMFCC,heretherawspeechsignalisdirectlyprocessedwithshorttime analysis with overlapping windowing operation instead of passing through a pre-emphasis filter. • The DFT is performed exactly as in the case of MFCC calculation (See. Equation 2.1). • The frequency domain signal is warped into Bark scale. Bark = 13arctan(0.00076f)+3.5arctan(( f 7500 ) 2 ) (2.5) • The warped signal is passed through a critical band filter to simulate the critical bands in the human peripheral auditory system. CBR = 26.81f 1960+f −0.53 if CBR< 2 Add 0.15∗(2−CBR) CBR> 20.1 Add 0.22∗(CBR−20.1) (2.6) • The resulting signal is down-sampled to meet 1 Bark intervals. • ThePLPcalculationinvolvesanadditionstageinvolvingapre-emphasisfilter to simulate the frequency-dependent loudness sensitivity of hearing (equal loudness), found in humans. 15 • Theequalizedvaluesaresubjectedtothepowerlawbyraisingittothepower of 0.33. • Linear Prediction analysis is applied on the resulting signal giving the pre- dictor coefficients. • Finally, cepstral coefficients are computed from predictor coefficients in a recursive manner. Figure2.2showstheflowchartforthecalculationofPLPfeaturesandfigure2.3 shows the comparison between the PLP and MFCC calculations. 2.1.3 Delta components The first order (delta) and the second order derivatives (delta-delta) of the acoustic features are calculated and appended to the feature vector. The first order delta features are calculated by (Gales & Young (2008)): ∆ o t = Σ n i=1 w i (o t+i −o t−i ) 2Σ n i=1 w 2 i (2.7) The second order, delta-delta features are calculated by: ∆ 2 o t = Σ n i=1 w i (∆ o t+i −∆ o t−i ) 2Σ n i=1 w 2 i (2.8) where n is the window width, o t is the original feature vector, w i is the regression coefficients. The resulting feature vector is of the form: o t = [o T t ∆ o T t ∆ 2 o T t ] T (2.9) 16 The addition of the delta components is to compensate for the assumption of conditional independence by HMM-based acoustic models as well as to capture dynamics in phonetic transitions (Furui, 1986). 2.2 Speaker Normalization Algorithms Previous studies have showed us that the increased inter-speaker and intra-speaker variability in children can be tackled with effective normalization techniques. We evaluate the importance of Cepstral Mean and Variance Normal- ization (CMVN) and Vocal Tract Length Normalization (VTLN) techniques. 2.2.1 Cepstral Mean and Variance Normalization (CMVN) CMVN is a normalization technique used to reduce the raw cepstral features to zero mean and unit variance (Viikki & Laurila, 1998). In our implementation CMVN is applied in the speaker dependent sense. Given a frame of acoustic features like MFCC or PLP, O T = {o 1 ,o 2 ,...,o T }, the mean vector is computed as: μ = 1 T T ∑ t=1 o t (2.10) and variance vector as: σ 2 = 1 T T ∑ t=1 diag(o t o T t ) (2.11) Then, CMVN is computed as: ˜ o t (d) = o t (d)−μ(d) σ(d) (2.12) 17 2.2.2 Vocal Tract Length Normalization (VTLN) VTLN is a speaker dependent transform aimed to reduce inter-speaker vari- ability. It involves the calculation of speaker dependent frequency warping factors using maximum likelihood estimation. The warping factors are used to warp the frequency axis during extraction of front-end features. The VTLN is expected to normalize the variability found in vocal tract structures from speaker to speaker. Generic procedure used to perform VTLN is described below • A scaling factor α is chosen based on grid search over discrete set of possible values obtained by experimentation (0.88 to 1.12). • The scaling factor is used to warp the frequency axis during the feature extraction phase. This is achieved by changing the spacing and width of the filters. • The scaling factor which maximizes the likelihood of the warped signal given the acoustic model is assigned to the particular speaker. According to the maximum likelihood criteria: ˆ α =arg max α P(o α |W,λ) (2.13) where o α = [o α 1 ,...,o α T ] is the acoustic data obtained after warping the origi- nal data o = [o 1 ,...,o T ] using the scaling factor α. ˆ α is the optimum scaling factor for the speaker, W is the uttered word sequence, λ is the acoustic model. • Each speaker is associated with a scaling factor which seems to capture the vocal tract variation in relative to other speakers. VTLN used in our system is based on (Kim et al., 2004). 18 2.3 Acoustic Model Adaptation Techniques AcousticModelAdaptationTechniqueslikeMaximumLikelihoodLinearTrans- form (MLLT), Speaker Adaptive Training (SAT) have shown improvements with children speech in the past (Elenius & Blomberg, 2005; Potamianos et al., 1997; Ganitkevitch). We experiment the effectiveness of both speaker independent and speaker dependent techniques. We use MLLT as a standard for speaker indepen- dent acoustic adaptation. 2.3.1 Maximum Likelihood Linear Transform (MLLT) MLLTworksbytransformingtheparametersoftheHMMmodelsuchthatthey are better adapted to the new speaker by using maximum likelihood adaptation (Leggetter & Woodland, 1995). The transformation is linear and is applied on the speaker independent model to give an estimate of the Gaussian model parameters such that the observation’s likelihood is maximized. The estimated mean is calculated by: ˆ μ =Aμ+b =Wξ (2.14) where μ is the mean of the speaker independent model, ˆ μ is the estimated mean, A is the transformation, b is the bias. W is the extended transformation given by [b T A T ] T and ξ is the extended mean vector given by [1 μ T ] T . The estimated variance is calculated by: ˆ Σ =LHL T (2.15) 19 OR ˆ Σ =HΣH T (2.16) whereH isthetransformationmatrix, Σisthecovariancematrix,ListheCholeski factor of Σ and ˆ Σ is the estimated covariance matrix. MLLT in our system is based on (Gales, 1999), which differs from the tradi- tional method by using semi-tied covariance matrices where a few full covariance matrices are shared over many distributions with each distribution having its own diagonal covariance matrices. 2.3.2 Speaker Adaptive Training (SAT) SAT systems deals with the phonetic variations and speaker induced variations separately (Ganitkevitch). The typical speaker adaptation techniques use: ˆ λ =argmax λ R ∏ r=1 L(O (r) |λ) (2.17) where λ is the model, O (r) observation sequence from the speaker r. The SAT calculates the transforms specific to each speaker. It is estimated by: ( ˆ λ c , ˆ H) =argmax (λc,H) R ∏ r=1 L(O (r) |H (r) (λ c )) (2.18) where R is the number of speakers, s r is the r th speaker in the set{s 1 ,s 2 ,...,s R }, H r is the transform applied for speaker s r , H models the variations induced by speakers and λ c models the phonetically induced variations. 20 TheSATincorporatedinoursystemisbasedonConstrainedMaximumLikeli- hoodLinearRegression(CMLLR).CMLLRisverysimilartoMLLT,theconstraint lies in the transformation applied to the variance which should correspond to the transform applied to the means (Gales, 1998) (Ganitkevitch). The estimated mean is calculated by: ˆ μ =Aμ−b (2.19) and the variance by: ˆ Σ =AΣA T (2.20) Here, the same transformation matrix A is applied to both the mean and the variance. Substituting Equations 2.19 and 2.20 into Equation 1.7 we obtain: Q(λ, ˆ λ) =C− 1 2 K ∑ k=1 T (r) ∑ t=1 γ (rk) t [ C k +log(|Σ k |)−log(|A| 2 )+(ˆ o t −μ k ) T Σ −1 k (ˆ o t −μ k ) ] (2.21) where ˆ o t =A −1 o t +A −1 b =Ao t +b =Wξ t (2.22) W is the extended transformation matrix similar to the case of unconstrained MLLT given by [ b T A T ] T and ξ t is the extended observation vector given by [ 1 o T t ] T . The i th row of the transform shown in equation 2.22 can be iteratively found out using: w i = ( αp i +k (i) ) G (i) −1 (2.23) 21 where p i is the extended co-factor row vector [ 0c i1 ... c in ] and c ij =cof(A i j). G (i) = K ∑ k=1 1 σ (k)2 i T ∑ t=1 (r) γ (k) t ξ t ξ T t (2.24) k (i) = K ∑ k=1 1 σ (k)2 i μ ik T ∑ t=1 (r) γ (k) t ξ t ξ T t (2.25) where α is dependent on G (i) , p i and k (i) . 2.3.3 Linear Discriminant Analysis (LDA) Since the children speech is subjected to increased variability, we apply Linear Discriminant Analysis to reduce the intra-class variability and increase the inter- class variability. LDA works by transforming the features such that they are of unit variance but not necessarily zero mean. LDA also reduces the dimensionality of the features which might lead to a better selection of the features. Operations involved in LDA are listed below (Balakrishnama & Ganapathiraju (1998)): 1. Mean (μ 1 ,...,μ n ) of data sets (x 1 ,...,x n ) is computed. μ i = 1 N N ∑ j x ij (2.26) 2. Mean of entire data set is computed as follows: μ T = n ∑ i=1 p i μ i (2.27) where μ T is the mean of entire data set, p i is the a-priori probability of the i th data set. 22 3. Covariance of the dataset is computed given by: Σ i = (x i −μ i )(x i −μ i ) T (2.28) 4. Intra-class scatter S w , is computed as: S w = ∑ i p i Σ i (2.29) It is nothing but the expected covariance of each of the classes. 5. Inter-class scatter S b , is computed as: S b = ∑ i (μ i −μ T )(μ i −μ T ) T (2.30) It can be viewed as the covariance of the mean vectors of each class. 6. Criteria for optimization is given by: criterion =S −1 w S b (2.31) 7. The eigen vectors corresponding to the non-zero eigen values of the above criterion gives the transformation. 8. The transformed data set is obtained by: transformed_data =transformation T × (x 1 ,...,x n ) T (2.32) 23 2.4 Pronunciation Modeling Acoustic Modeling has certain limitations when subjected to a lexicon with definite canonical transcriptions. In reality, speech is not always an exact match with the canonical transcriptions. This is especially observed in spontaneousconversations(Wester,2003),foreignaccents(Humphries et al., 1996), dysarthric speakers (Morales & Cox, 2009; Seong et al., 2012). Pro- nunciationmodelinghasprovedtohelpimprovetheASRperformanceinthe above cases. The fact that children are limited in linguistic knowledge and pronunciation skill (Li & Russell, 2001) poses an interesting problem on how to tackle the pronunciation differences. We study the pronunciation differences that are found in children and eval- uate how these pronunciation differences affects children of different age classes. The procedure implemented in our system is described as follows: (a) The pronunciation differences are obtained by running a free phone decoding task. i. The free phone decoding is performed by using a lexicon with phones mapped to itself instead of the typical word to phone transcriptions found in ASRs. The lexicon would look like: ax ax ae ae b b d d hh hh s s t t 24 ii. The training reference transcriptions are converted from words to phones. Toachievethis, foreachwordinthereferencetheirrespec- tive phone composition is looked up in the dictionary and replaced by its phone transcripts. A reference of the form “He has a blue pen” is converted to its equivalent phone form “hh iy hh ae z ah b l uw p eh n”. iii. A model is trained using the above phone-to-phone lexicon and phone level reference transcripts. iv. Aphonelevelgrammarisconstructed. Atri-phonelanguagemodel trained on training reference transcripts is used. (b) The decoded transcript for each utterance is obtained by applying a best path algorithm to the decoded lattices. (c) The decoded phonemic transcripts are aligned with the reference transcripts by using Needleman-Wunsch global alignment algorithm (Needleman & Wunsch, 1970). (d) Theinsertions, deletionsandsubstitutionsobtainedfromalignmentare plotted in the figure 2.4. (e) In our study, we only consider substitutions and ignore deletions and insertions. (f) A phone confusion matrix is constructed using reference transcripts vs decoded transcripts for all the phones in the dictionary. (g) The confusion matrix is pruned to retain only the pronunciation dif- ferences with high frequency of occurrence. We retain top 10 mapping 25 word phones her hh er tall t ao l Table 2.1: Toy-example: lexicon er r t dh Table 2.2: Toy-example: mapping rules. An example obtained from our experiments is shown below: m n s l k hh y iy t dh iy ih iy ay er r ay iy s n n s (h) The pronunciation alternatives are weighed based on their weights obtained from confusion matrix in the maximum likelihood estimation sense. (i) A weighted Finite State Transducer is used to generate the pronuncia- tionvariantsforallthewordsinthetestingvocabularyduringdecoding. The steps involved are described below using a toy example with two words and two mapping rules shown in table 2.1 and table 2.2 respec- tively: 26 i. A mapping transducer is created to incorporate the mapping rules generated. Figure 2.5 shows the structure of the FST for the toy example. ii. A lexicon transducer is created using the word to phone lexicon. Figure 2.6 shows the structure of the FST for the toy example. iii. The lexicon transducer is inverted and output symbol sorted. Fig- ure 2.7 shows the inverted transducer. iv. The mapping transducer is composed with the lexicon transducer to generate the pronunciation variants as per the mapping rule. Figure 2.8 shows the composed transducer. (j) The decoding is performed using the lexicon with newly added pronun- ciation variants. The performance is reported over each age class to observe the trend over age. Figure 2.9 shows the confusion matrices for different age classes. The plots show the confusion of the ASR system, as to how each phone in the dictionaryisconfusionwitheveryotherphone. AnidealASRwouldproduce just a diagonal matrix with each phone mapped to itself as an ideal case yielding100%accuracy. Inotherwordsthesparsityofthematrixdefinehow confused the system is. It is evident that the matrices for younger children show higher error rates compared to the older children, with the matrix for age 6 group showing the most confusion, while the least is observed in the case of children of age 14. For experimentation purposes, the CID database was split into two, one as a developmental dataset and the other as the testing dataset. Table 3.1 shows the distribution of data according to age for development and testing 27 datasets. The confusion matrix and the phone mapping rules were obtained from the development dataset. Decoding was performed on the test dataset with the lexicon containing the pronunciation alternatives estimated from the development dataset. 28 Figure 2.2: PLP Calculation 29 Figure 2.3: MFCC (left) vs PLP (right) Calculation 30 Figure 2.4: Insertions, Deletions, Substitutions Figure 2.5: Mapping Transducer 31 Figure 2.6: Lexicon Transducer Figure 2.7: Inverted Lexicon Transducer Figure 2.8: Composed Transducer 32 Figure 2.9: Confusion Matrices over Age 33 Chapter 3 Our system 3.1 Speech Recognition Setup All the recognition experiments were conducted using the Kaldi toolkit (Povey et al., 2011). The standard front-end of the setup used standard MFCC features with 13 mel-cepstrum coefficients with their first and second order derivatives. The MFCCs were extracted using 23-channel filter banks usingframe-lengthof25msandframe-shiftof10ms. Thesamplingfrequency of 16 KHz was used for all the experiments. For front-end experimentation a variation in the above parameters were used and are described later in Chapter 4.3. Kaldi was configured to model Hidden Markov models (HMM), one per each position dependent phones. Each phone was modeled with a HMM of 3 states, whereas silence was modeled with a 5 state HMM. A total of 1000 Gaussian densities are shared among HMMs. The British English Example Pronunciation (BEEP) dictionary (Robinson, 2010) containing British English pronunciations was used because of its extensivevocabulary. TheBEEPdictionaryconsistsof257065lexicalentries with 237749 unique words, 52 non-silent phones and 3 silent phones. Two language models (LM) were trained: one using a generic english LM from cmu-sphinx-5.0 (Sphinx, 2011) and the other using the reference tran- scriptions from the training data. The two LMs were then interpolated and 34 the resulting LM was used for the experiments. After experimenting using unigram, bigram and trigram models, the trigram was chosen to give the best performance. The perplexity test for the LM over the test utterances gave a perplexity of 268.67 with 0 out-of-vocabulary words. 3.2 Databases Threechildrenspeechdatabaseswereusedinthiswork: TheChildren’sInter- active Multimedia Project (CHIMP) (Narayanan & Potamianos, 2002), The CURead,PromptedSpeechCorpus(Cole et al.,2006a)alongwithCUStory Corpus (Cole et al., 2006b) and speech data collected from the joint effort of Southwestern Bell Technology Resources and Central Institute for the Deaf (CID) (Lee et al., 1999). CHIMP is a communication agent application and a computer game controlled by a conversational animated chimpanzee char- acter. The data consists of verbal interaction of children ranging between 6 years and 14 years with the computer. The CU Read, Prompted Speech Corpus consists of children through grade 1 to 5 (6 years to 11 years) read- ing sentences and isolated words. The CU Story Corpus consists of read and summarized stories from children ranging from 3rd through 5th grade. CID consistsoffivesentencesreadoutby436children(5-18years)and56adults (25 - 50 years). For our work, we sample out the data limited to children of 6 years to 14 years. The five sentences read by the subjects are: • “He has a blue pen.” • “I am tall.” • “She needs strawberry jam on her toast.” • “Chuck seems thirsty after the race.” 35 • “Did you like the zoo this spring?” TheCHIMPandCUKids’CorpuswereusedfortrainingandCIDfortesting. Tables3.1-3.2andFigures 3.2-3.2showstheagedistributionoftrainingand testing databases. Testing was conducted using data from speakers ranging between age 6 to 14 years from the CID database. Age CHIMP CU CIDMIC # of utterances # of speakers # of utterance # of speakers # of utterance # of speakers 6 yrs 674 3 6620 70 244 27 7 yrs 218 1 23501 144 333 37 8 yrs 6068 23 577 7 347 36 9 yrs 7804 31 1641 27 477 49 10 yrs 4925 19 2717 32 380 39 11 yrs 4908 19 3834 40 419 43 12 yrs 3511 14 0 0 413 43 13 yrs 2937 12 0 0 287 29 14 yrs 1608 6 0 0 209 21 Total 32653 128 38890 320 3109 324 Table 3.1: Age Distribution of Training and Testing Data Age CIDMIC Development-set Test-set # of utterances # of speakers # of utterance # of speakers 6 yrs 117 13 127 14 7 yrs 169 19 164 18 8 yrs 170 18 177 18 9 yrs 237 24 240 25 10 yrs 187 19 193 20 11 yrs 203 21 216 22 12 yrs 205 22 208 21 13 yrs 138 14 149 15 14 yrs 99 10 110 11 Total 1525 160 1584 164 Table 3.2: Age Distribution of CIDMIC development and test data 36 Figure 3.1: CHIMP data distribution Figure 3.2: Colorado data distribution 37 Figure 3.3: CIDMIC data distribution Figure 3.4: CIDMIC development data distribution 38 Figure 3.5: CIDMIC test data distribution 39 Chapter 4 Recognition Experiments and Results 4.1 Baseline System ThebaselinesystemwasconstructedbytrainingoncombineddataofCHIMP and CU Kid’s Corpus. The testing was performed on CID database for chil- dren age ranging between 6 to 14 years. A trigram interpolated language model is used. For the baseline experiments we use Cepstral Mean and Vari- ance Normalization (CMVN) as a standard practice. Monophone, triphone and quinphone models are modeled and evaluated. Model WER Monophone 54.73% Triphone 44.23% Quinphone 44.70% Table 4.1: Baseline System Table 4.1 shows the performance of our baseline models. Triphone model provides a significant reduction in WER of about 10.5% absolute compared to Monophone model. Quinphone modeling doesn’t prove useful over the triphone models. Thus the triphone model forms our baseline system. 40 4.2 Complexity Analysis of Sentences Sentence WER “He has a blue pen.” 42.74% “I am tall.” 22.92% “She needs strawberry jam on her toast.” 57.54% “Chuck seems thirsty after the race.” 51.27% “Did you like the zoo this spring?” 35.13% Table 4.2: Performance Analysis of Five Sentences in CID The complexity of the five sentences in CID is analyzed in terms of ASR performance for a baseline triphone model and can be seen in Table 4.2. For sentence 1, the relatively poor performance might be due to the suc- cessive similar sounding (pronunciation) words “He has”, which might be more error prone in the case of children. Sentence 3 and 4 have few verbally challenging pronunciations and the presence of proper nouns, for example: “strawberry”, “Chuck”, which might prove challenging for young children because of their limited vocabulary knowledge. This explains for their poor performance. It can be inferred that the sentence length is not a factor for performance degrade. Sentences containing common and easy words show good performance as in the case of sentence 2 and 5. 4.3 Front-End Feature Analysis We conduct experiments using different acoustic features like MFCC, PLP andfilter-bankfeaturestoevaluatetheirperformancewithchildren’sspeech. Alltheexperimentsinthissectionareconductedonbaselinetriphonemodels. The features were calculated using 13 coefficients, 23 channel filter banks using frame width of 25ms with 10ms frame shift. 41 Table 4.3 shows the performance obtained from using different front-end features. The best results are obtained for the MFCC features. Thus the rest of the paper uses MFCC as standard front-end feature. Features WER MFCC 44.23% PLP 49.20% Filter Bank 65.25% Table 4.3: Front-end Feature Selection coefficients log-energy window size filter-banks WER 11 NO 25ms 23 42.72% 12 NO 25ms 23 40.73% 13 NO 25ms 23 44.23% 14 NO 25ms 23 43.13% 15 NO 25ms 23 43.23% 13 NO 20ms 23 42.93% 13 NO 30ms 23 42.42% 13 NO 35ms 23 40.78% 13 NO 40ms 23 42.21% 13 NO 25ms 22 43.47% 13 NO 25ms 24 42.77% 13 YES 25ms 23 49.25% Table 4.4: Performance for MFCC features Table 4.4 shows the results obtained for variation of MFCC parameters like number of mel-cepstrum coefficients, log energy, window size and number of channel filter banks. It can be seen that adding log-energy decreases the performance by 5.02% absolute. Superior performance is observed when the number of MFCC coefficients are reduced to 12 resulting in a gain of 3.5% over the baseline. Increasing the frame width also seems to help the performance, a gain of 3.45% absolute was observed for a frame width of 35ms. Increasing frame width and decreasing MFCC coefficients provides 42 some smoothing and helps decrease the variability in speech which seems to translate to better performance in the case of children speech. 4.4 Speaker Normalization Algorithms Figure 4.4showsthescalingfactorscalculatedoveragefortheCIDMICdata. We can see a linear increase of the scaling factor with age. Table 4.5 shows the performance improvements achieved using CMVN and VTLN. Overall both CMVN and VTLN bring significant improvements. An improvement of 19.18%absoluteisobtainedwithCMVNwhereasVTLNadds4.05%absolute improvement. Using VTLN in testing further reduces WER but not by a big margin (0.35% absolute). Model CMVN VTLN WER Triphone NO NO 63.09% Triphone YES NO 44.23% Triphone YES Training only 40.18% Triphone YES Training + Testing 39.84% Table 4.5: Speaker Normalization Techniques 4.5 Acoustic Model Adaptation Techniques Table 4.6 shows different speaker adaptation techniques and its effectiveness in terms of WER for children speech. LDA is not effective and makes little to no change to the performance. MLLT gives a net gain in performance of 4.72% absolute, whereas SAT reduces WER by 11.94% absolute. Speaker Independent SAT degrades the performance of the baseline system. Among 43 Figure 4.1: Age vs average VTLN scaling factor obtained for CIDMIC data acoustic model adaptation techniques SAT gives a bigger improvement mar- gin. The best results are obtained when MLLT, SAT and VTLN are used together to achieve 27.26% WER, an improvement of 16.65% absolute. 4.6 Age Dependent Results Toinvestigatehowtheperformanceofacousticmodelingtechniques,normal- ization techniques and acoustic adaptation effects each age class, the testing data is split according to the age groups ranging from 6 years to 11 years. The Figure 4.3 shows the performance variation across the age groups of 44 Model VTLN LDA MLLT SAT WER Triphone X X X X 44.25% Triphone X X X X 39.51% Triphone X X X X 36.51% Triphone X X X X(SI) 45.27% Triphone X X X X 32.29% Triphone X X X X(SI) 40.33% Triphone X X X X 29.34% Triphone X X X X(SI) 41.06% Triphone X X X X 29.86% Triphone X X X X(SI) 35.53% Triphone X X X X 27.26% Table 4.6: Acoustic Modeling and Adaptation SI: Speaker Independent children and the effectiveness of various adaptation and normalization tech- niques for each age class. The Word Error Rate (WER) decreases over age from6yearsto11years. Acousticadaptationandvariationtechniquesfollow the same trend. There is a performance difference of around 17% absolute between the age class of 6 years and 11 years. Approximately linear increase inperformanceisobservedoverageclassesusingvariousacousticadaptation and normalization techniques. 4.7 Pronunciation Modeling Table 4.7 shows the results obtained with and without pronunciation mod- eling and the relative improvement achieved. A consistent improvement is observed for all the acoustic modeling techniques. An average performance of 1.185% absolute is gained over the best results using pronunciation mod- eling. 45 Figure 4.2: Age Dependency Results Figure 4.3 shows the age dependency in ASR performance with the pronun- ciation modeling technique specifically developed for the CID test database. The results are shown with the pronunciation modeling (dotted lines) and without (solid lines). 46 Model Baseline PM % Gain Monophone 54.66% 53.94% 1.32% Triphone 42.89% 40.77% 4.94% Tri + VTLN 38.64% 37.50% 2.95% Tri-MLLT 38.18% 37.15% 2.7% Tri + SAT 31.03% 30.07% 3.09% Tri-MLLT + SAT 28.83% 28.03% 2.77% Tri-MLLT + VTLN 35.57% 33.53% 5.74% Tri-MLLT + SAT + VTLN 25.51% 24.84% 2.63% Table 4.7: Results: Pronunciation Modeling Tri-MLLT: Triphone + LDA + MLLT Figure 4.3: Pronunciation Modeling Results 47 Chapter 5 Duration Modeling 5.1 Motivation Temporal variability is significant in the case of children and might account for speaking rate, reading ability and pause durations. Vowel and sentence durations decrease with age significantly (Lee et al., 1999). The increased duration variability in the case of children might be a factor for the relative performance difference between adults and children speech. To capture the durationinformationorthespeakingrate,thereisaneedfortheHMM-based acoustic models to accurately model the duration probabilities. The HMM inherently models the duration probabilities but are limited to an exponential distribution (Juang et al., 1985). The state durations in the HMMs are defined in terms of the transition probability of staying in the same state, given by: p j (d) = (1−a jj )a d−1 jj (5.1) where a jj is the transition probability of staying in the same state j, d is the number of frames spent in state j. To model more complex duration information, there is a need to explicitly include the duration information into the acoustic modeling. 48 5.2 Introduction Duration modeling have shown significant improvements in the past in application to ASR systems (Pylkkönen & Kurimo, 2004) (Gadde, 2000). (Levinson, 1986) introduced continuously variable duration hidden markov models (CVDHMM) for ASR, which models the durations with continu- ous probability density functions. They also proved that the gamma dis- tribution is better suited for modeling durations rather than the inherent exponential distributions. In Gadde (2000) durations at a word level were modeled. The word durations were included along with the front-end fea- ture vectors and GMM were used to model duration. Performance improve- ments upto 1% WER was observed. In Bonafonte et al. (1993), a hidden semi-Markovmodel(HSMM)isintroduced,wherethetransitionprobabilities are ignored from the HMM and is replaced by state duration distributions that are explicitly modeled using gamma distribution. In Noll & Ney (1987) (Russell & Cook, 1987), expanded state HMM (ESHMM) is introduced, in which the framework of the HMM is expanded into a sub-HMM. The topol- ogy of the sub-HMM is such that it models the correct state duration dis- tribution while retaining the same emission probabilities. In Juang et al. (1985), a post-processor duration modeling scheme was introduced in which the log-likelihood from the viterbi algorithm is augmented by log of duration probabilities and re-ranking the paths emulated duration modeling. 49 5.3 Implementation (Pylkkönen & Kurimo,2004)reviewstheeffectivenessofthevariousduration models. ItwasshownthatthemethodsinvolvingchangesinHMMstructure likeHSMMandESHMMneedamodificationtotheViterbialgorithmtokeep the complexity of the system practical. Explicitly modeling duration into the HMM often results in increase in Real-time factor to gain any reasonable increaseinASRperformance. Whereasthepost-processordurationmodeling relies on augmenting likelihood scores which doesn’t need modifications to HMMstructureortheviterbialgorithm. Itwasalsoshowntoperformbetter relatively compared to other methods especially at lower real-time factors. Hence, in our system we adopted the post-processor method considering the computational complexity, performance and ease of implementation. Implementation details for the post-processor duration modeling are described below: • Run alignment on the training data to get the state sequence for each phone in the training data. • Estimatetheprobabilitydistributionforeachstateforeachphone. This is done by constructing a histogram with 25 bins for the occupancy of each state and normalizing. Figure 5.1 shows the probability distribu- tion for 3 states for the phone "‘iy"’. Figure 5.2 shows the probability distribution for 3 states for the phone "‘aa"’. • The raw probability distribution is smoothed and approximated using different well-defined distributions like Gaussian, Gamma. Figure 5.3 shows the gaussian approximated version of figure 5.1. 50 Figure 5.1: Probability distribution for 3 states for the phone "‘iy"’ • During decoding of the test data, the n-best path algorithm is applied to the decoded lattices to obtain n-best hypotheses based on their like- lihood. • The word level lattices are converted to the phone level lattices. • Theloglikelihoodofthephonesarethenaugmentedusingthelogdura- tion probabilities as shown below: log ˆ f =logf +α N ∑ j=1 log(p j (l j /T)) (5.2) wherelog ˆ f is the augmented log likelihood, logf is the log likelihood, α is the emperical scaling factor, N is the total number of unique states 51 Figure 5.2: Probability distribution for 3 states for the phone "‘aa"’ in the state sequence, T is the number of frames in the phone, l j is the number of frames in state j and p j (l j /T) is the probability of being in state j for (l/T) of the word. • The n-hypotheses are re-ranked to give the best path based on the augmented log likelihood scores. 5.4 Experiments and Results The duration modeling experiments were conducted on only CHIMP database. The database was divided into two sets training and testing such thattheyhavemutuallyexclusivespeakers. Thetrainingsetcontained29829 52 Figure 5.3: Gaussian approximated version of figure 5.1 Baseline Results WER 22.22% Table 5.1: Baseline: Duration Modeling utterances and the testing set contained 6117. The language model was trained on the training data reference transcripts. The acoustic model used was a monophone model. The experiment was conducted on position independent and position depen- dentphones. Thepositionindependentversionsimplymeansthatthedistri- butions of the phone estimated were irrespective of the position of the phone 53 α Quantized bins (WER %) Gaussian Distribution (WER %) Gamma Distribution (WER %) 0.1 - 21.94 21.99 0.2 - 21.92 21.99 0.3 - 21.97 21.98 0.4 - 21.93 21.96 0.5 22.04 21.93 21.93 0.6 - 21.97 21.97 0.7 - 22.01 21.99 1.0 22.15 22.05 22.04 2.0 22.39 22.26 22.18 3.0 22.68 22.62 22.33 Table 5.2: Position independent duration modeling for phones in the word. Whereas the position dependent version had 4 distributions associated with a single phone depending on its position in the word - begin- ning, intermediate, ending and single. This meant there was a possibility of data scarcity issues. In case of data scarcity, the duration scores were simply neglected. The experiment was conducted for the combinations of different distributions - discrete quantized bins, gaussian and gamma distributions. The emperical scaling factor α was also varied to check its effects. Table 5.2 shows the results obtained for position independent version of the experiments and Table 5.3 shows the results for position dependent phone duration modeling. It can be seen from both tables 5.3, 5.2, that approximating the quantized bins using a distribution helps. The best results were obtained for position dependent modeling with a gaussian distribution. More specifically for α = 0.5 leading to relative 2% decrease in word error rate (WER). Figure 5.4 shows an overall trend of the variation of WER with α with position dependent quantized bins as an example. As seen from the figure, 54 α Quantized bins (WER %) Gaussian Distribution (WER %) Gamma Distribution (WER %) 0.1 22.00 21.95 21.99 0.2 21.98 21.87 21.95 0.3 21.96 21.90 21.93 0.4 21.97 21.88 21.93 0.5 22.00 21.81 21.91 0.6 22.04 21.85 21.93 0.7 22.06 22.89 - 1.0 22.15 21.96 - 2.0 22.37 22.19 - 3.0 22.68 22.45 - Table 5.3: Position dependent duration modeling for phones α Gaussian Distribution (WER %) Gamma Distribution (WER %) 0.1 21.97 21.97 0.2 21.99 21.99 0.3 21.99 21.99 0.4 21.98 21.98 0.5 22.02 21.02 0.6 22.03 21.03 0.7 22.02 22.02 1.0 22.06 21.06 2.0 22.19 22.19 3.0 22.29 22.29 Table 5.4: Position and state independent duration modeling for phones initially as α increases, the performance increases until the WER reaches its minimum for the optimum α. The WER increases monotonically later. We also investigate state independent duration modeling. By assigning sin- gle distributions to the phone instead of assigning distributions to all the states present in the phone. Table 5.4 shows the results obtained for state independent and position independent duration modeling. 55 Figure 5.4: α vs WER% for position dependent phone duration modeling 5.5 Conclusion Eventhoughthereisasmallimprovement(relative2%)obtained,theresults were found to be very sensitive to the acoustic scaling and emperical scaling factors. Thus the results themselves cannot be used to make strong conclu- sions. 56 Chapter 6 Subword Modeling 6.1 Motivation Children are known to make mistakes during pronunciation of words due to their limited vocabulary skills. Most common mistakes include stammering, uttering incomplete words and skipping parts of the words. For example, a case of stammering where the word "‘Summer"’ is read as "‘Sum-summer"’. To capture these mistakes, we need to operate in a smaller level than words but bigger than phones. Subwords are a collection of phones that make up a part of the word. Two or more subwords form a word. Subwords allow us to model out of vocabulary words. This can be useful in applications like reading tutors to detect a child having difficulty pronouncing a word. This could also lead in bridging the gap between the children and adult ASR performance. 6.2 Introduction Previous work show improvements in ASR performance by using mean- ingful subwords instead of words by modeling out of vocabulary words. NIST has a tool called syllabification to generate syllables for a given word (of Standards & (NIST)). In Bazzi (2002), a method to generate subword 57 units based on the mutual information (MI) is introduced. The motive was to efficiently model the out-of-vocabulary words. (Creutz & Lagus, 2002), introduces a method to find subword units called morphs by using minimum description length (MDL). In Hagen et al. (2007), subwords are generated using a modified version of the Lempel-Ziv and Welch (LZW) algorithm which is basically a text compression algorithm. It was also shown to be more correlated with the syllables and has lower false alarm rates compared to MI and MDL methods. 6.3 Implementation Our approach to the problem differs from (Hagen et al., 2007) in the sense their system uses lexical trees to model subwords, while we use Finite State Transducers (FST) to achieve the same. One more important distinction is their system incorporates the modifications during the decoding stage. In our system we use FSTs to model the linguistic mistakes after we obtain the decoded subword lattices. 6.3.1 Generating subwords from words Since LZW algorithm is proven to be better/comparable to other methods (MDL,MI)togeneratesubwordunits,weimplementthesameinoursystem. The details are described below: (a) Figure 6.1 shows the flow chart for subword unit generation. A typical word-phone dictionary is input to the system, the system generates tablescontainingthesubwordsalongwithit’sphonemesequencelength, 58 Figure 6.1: Generation of Subwords 59 Figure 6.2: Computation of Relative position (a)(above), Score (b)(below) 60 word positions and frequency of occurrence (hits). Three such tables are generated specific to the subword’s position in the word - initial, medial and end. i. The tables are initialized with the non-silence phones, the hits are initialized to zeros for all the phonemes and the text buffer B is cleared. ii. For each word in the input dictionary and its corresponding phoneme sequence, the phones are iterated and transferred to the text buffer, B=B*A, where * denotes the operation of concatenat- ing the new phone A to the ones in B. iii. The position of B (pos) in the word and the number of phonemes in B (#phonemes) is also stored. iv. If B is already in the dictionary table, then the hits corresponding to that entry is incremented. v. If B is not present in the dictionary table, it is added to the table and its hits is initialized to 1. Also, the text buffer is cleared and set to A. vi. When the end of the phoneme sequence is reached, the buffer B is cleared and the above process is repeated for the next word in the input lexicon. (b) Figure 6.2a shows the calculation of "Relative Position" of the subword unit in the tables generated. i. The tables generated in step (a) are sorted according the number of hits for each phoneme for each position. 61 ii. For each subword unit ’U’ in the dictionary table, it’s relative posi- tion is calculated as: RP(U) = index total (6.1) where total is the total number of subword units in U’s table and index is the index of U in it’s table. (c) The third step is to choose the best sequence of the subwords for the given word. Figure 6.2b shows the process. i. For each word, W, in the input lexicon, there might be multiple sequences of subword units that make up the word. ii. The best sequence of subwords are chosen such that it maximizes the following score: Score(i) = 1 m i m i ∑ j=1 RP(u ij ) (6.2) where i is the sequence index, u i1 ,u i2 ,...,u im are the candidate subword units for the i th sequence, RP(u ij ) is the relative position of the unit u ij in its dictionary table. iii. The sequence with the best score is selected and stored in the database. iv. The word W’s split information is output to create a word-to- subword lexicon. 6.3.2 Recognition Experiments and Results To compare the effectiveness of our method to that of (Hagen et al., 2007), we conducted all the experiments on the Colorado database. The colorado 62 Model Word Error Rate (WER) % Monophone 20.46 Triphone 15.19 Triphone + LDA + MLLT + SAT 14.35 Table 6.1: Baseline system for word level ASR for colorado data database training set consists of 35990 utterances, whereas the testing set consisted of 2900 utterances. The word-level baseline system consists of a trigram language model trained on the training transcripts of the colorado dataset. The baseline system for the colorado system for word level recogni- tion is shown in table 6.1. To implement the subword level ASR, the following steps were performed: • The typical word to phoneme lexicon was replaced by the subword to phoneme lexicon. • The training reference transcripts were converted from word level to subword level. • The language model was modeled using subwords seen in the training dataset. • Since the subword units are smaller than words in terms of vocabulary size, this meant there was a less possibility to face data scarcity. This enabled us to experiment with higher n-gram language models. • The decoded lattices have the output symbols in terms of words. • The results obtained for the baseline subword level recognition task is shown in table 6.2. Figure 6.3 shows the variation of subword error rate with respect to the n-gram models. 63 Model n-gram language model - subword error rate % 3-gram model 4-gram model 5-gram model 6-gram model Monophone 31.42 21.90 19.34 18.83 Triphone 18.21 14.32 13.36 13.18 Triphone+LDA+MLLT+SAT 15.91 12.56 11.97 11.73 Table 6.2: Baseline system for subword level ASR for colorado data Figure 6.3: n-gram LM vs subword error rate subword baseline system 6.3.3 Converting Subwords to Words Eventually, we need the output of the ASR to be in terms of words for easy human readability and perception. We use FSTs to achieve the conversion from subwords to words. To handle stammering, repetition, pauses and part 64 of words being uttered, we create a custom FSTs to model this. The steps involved are described below: Idea The idea is to design a subword to word conversion system. The system is designed such that it can handle the left-over subwords which are not part of any word. The result is subwords converted to words and subwords (in the case of any stray subwords). Procedure The description of the entire procedure is given using a toy example for the phrase "‘Summer Water"’. Please note that the subwords for the phrase is assumed for ease of readability and perception and may not be a valid phoneme sequence. Lexicon (L): The lexicon L is obtained after generating the subwords. The lexi- con basically consists of words mapped to their corresponding subword sequences. The lexicon is of the form as shown below: summer sum mer water wat ter To handle the stray subwords, we map all the subwords to a unique tag “<sw>” and add the mappings to the above lexicon. The resulting lexicon would look like: 65 <sw> mer summer sum mer <sw> sum <sw> ter water wat ter <sw> wat Graphically the lexicon FST is of the form as shown in figure 6.4. Figure 6.4: The lexicon transducer L Conversion: To convert the subwords to words the decoded subword lattice is com- posedwiththelexiconFST’L’.Theresultisacombinationofwordand subword sequence. To illustrate the process for the toy example, let the inputdecodedsubwordphrasebe: “sumsummerwatwatwatter”. The desirable output is: “sum summer wat wat water”. The subword input phrase can be viewed as a Finite state acceptor (FSA) structurally as shown in 6.5. 66 Figure 6.5: The input FSA I The composition of the lexicon FST (L) and input FSA (I) can be denoted as: O 1 =I◦L (6.3) TheresultingFSTcontainsmultiplepathsandneedstobedeterminized to get the dominating path. The determinization is performed on O 1 . O 2 =det ∗ {O 1 } (6.4) The resulting determinized FST is shown in figure 6.6. Figure 6.6: The determinized FST O 2 Re-scoring LM: To re-score with the word level language model (G), the determinized FST O 2 shouldn’t have any subwords in their output. To achieve this, the <sw > tags are replaced with <eps> tags. This can be achieved by composition of O 2 with a FST, F, shown in figure 6.7 in which every word has a self-loop with input equal to output except for < sw >, which is mapped to <eps>. The process can be viewed as: O 3 =O 2 ◦F (6.5) 67 Figure 6.7: The FST F The resulting FST O 3 will have <eps> tags in the position of <sw> tags while preserving the actual subword on the input side. Figure 6.8 shows the resulting O 3 . Figure 6.8: The FST O 3 The grammar FST, G, is a unigram language model consisting of only two words "Summer Water". The FST, G, is shown in figure 6.9. Figure 6.9: The grammar FST G The process of rescoring is nothing but the composition of the FST O 3 with the grammar FST G and can be denoted by: O 4 =minimize{determinize{O 3 ◦G}} (6.6) 68 The resulting FST is shown in figure 6.10. The FST is determinized and then minimized for readability purposes. It can be seen from the figure that the FST has only two outputs "summer water". Thus the input"sumsummerwatwatwater"wasconvertedtothedesiredoutput increasing the performance of the ASR. Figure 6.10: The final FST O 4 6.4 Results The experiments were conducted using an acoustic model trained on mono- phones and a 6-gram language model trained on subwords in the training data set. The performance was recorded after each steps of compositions and are shown in the table 6.3. Hypotheses Subword Error Rates % Baseline-word 20.46 (WER) Baseline-subword 18.83 O 1 33.46 O 2 33.04 O 3 32.22 O 4 53.68 Table 6.3: Subword modeling step-wise performance analysis 69 6.5 Conclusions Clearly, from the results, although there is an increase in performance over baseline word level system using subword baseline system, there is a degrade in the performance when the subwords are converted to words. There is a huge drop in performance after re-scoring. The database in itself may not be suitable for the experiment. Using database with extensive markings of the mistakes might help us evaluate the performance in more detail. Also, application of the hybrid system which is able to handle both words and subwords as in Hagen et al. (2007) might prove more useful. Such a system would use a typical word level system in-case of no errors and the subword system when the proposed mistakes are found. 70 Chapter 7 Conclusions and Future Work 7.1 Conclusions UsingacousticadaptationschemesandnormalizationtechniqueslikeVTLN, MLLT and SAT leads to a large improvement in performance over the base- line. Speakeradaptivetechniques(VTLN,SAT)areproventobemoreeffec- tive than the speaker independent adaptation techniques (MLLT). Further pronunciation modeling can be used to improve the performance by learning the common linguistic mistakes made by children. Evaluation of pronuncia- tion mistakes as a function of age gives us an insight of where the potential improvements in pronunciation modeling lies for children. The preliminary resultsobtainedusingpronunciationmodelinghintstoanareawithpotential performance to be gained in children’s ASR. 7.2 Future Work DeepNeuralNetworks(DNN)arebecomingmoreefficientincapturinglatent acousticfeaturesandoutperformingthetypicalHMM-basedacousticmodels (Pan et al., 2012). It is interesting to check the performance of DNNs in application to children’s speech. The fact that children exhibit increased acousticvariabilitymotivatestheeffectivefeatureselectionusingDNNssuch 71 that the variability is minimized. Also, the effect of DNNs on the cross- evaluation of children and adults speech performance can be an interesting research problem. 7.2.1 Pronunciation Modeling Although, the pronunciation modeling used in our system was preliminary, the promising improvements hint at the potential performance to be gained. The pronunciation modeling can be extended by incorporating the context information to make it more robust. Also the likelihood estimates obtained canbesmoothenedusingdifferentprobabilitysmoothing. Morecomplexand stateoftheartpronunciationmodelingcanbeappliedandtheirperformance evaluated on children’s speech. It is also a good idea to compare and study the pronunciation variation patterns between adult and children. 7.2.2 Duration Modeling The duration modeling used in our system proved to be very sensitive to scaling parameters like acoustic scale, language model scale and empirical durationscorescales. Moreexplicitdurationmodelingcanbeembeddedinto the HMM architecture during acoustic modeling stage. That is, ESHMM, CVDHMM and HSMM can be applied to children’s speech and their perfor- mance compared. Since the speaker adaptive training seems to capture the temporalvariabilityeffectively,wecannotexpectahugeleapinperformances due to duration modeling. 72 7.2.3 Subword Modeling Ahybridapproachcouldimprovetheresultsbyhandlingthewordsandstray subwords differently. The hybrid system would use the subword system only in the case of pronunciation mistakes. In the absence of any mistakes in pro- nunciation a typical word level ASR decoding is performed. The pronuncia- tionmistakecouldbedetectedbyusingconfidencescoresoftheASRbothin the case of subword and word models. Moreover, the subword modeling can beextendedtocapturepronunciationvariantsonthelevelofsubwords. This might prove more interesting and intuitive than phone based pronunciation models since the humans are shown to make pronunciation errors/variations on the level of syllables (Greenberg, 1999) assuming the subwords represent the syllables more closely. 73 Bibliography Balakrishnama S, Ganapathiraju A (1998) Linear discriminant analysis-a brief tutorial. Institute for Signal and information Processing . Bazzi I (2002) Modelling out-of-vocabulary words for robust speech recogni- tion Ph.D. diss., Massachusetts Institute of Technology. Bonafonte A, Ros X, Marifio JB (1993) An efficient algorithm to find the best state sequence in hsmm. In EUROSPEECH. Cole R, Hosom P, Pellom B (2006a) University of colorado prompted and read childrenâĂŹs speech corpus Technical report, Technical Report TR- CSLR-2006-02, University of Colorado. Cole R, Hosom P, Pellom B (2006b) University of colorado prompted and read childrenâĂŹs speech corpus Technical report, Technical Report TR- CSLR-2006-02, University of Colorado. CreutzM,LagusK(2002) Unsuperviseddiscoveryofmorphemes In Proceed- ings of the ACL-02 workshop on Morphological and phonological learning- Volume 6, pp. 21–30. Association for Computational Linguistics. Ding JJ Time frequency analysis tutorial. R99942057 . Elenius D, Blomberg M (2005) Adaptation and normalization experiments in speech recognition for 4 to 8 year old children. In INTERSPEECH, pp. 2749–2752. FuruiS(1986) Speaker-independentisolatedwordrecognitionusingdynamic features of speech spectrum. Acoustics, Speech and Signal Processing, IEEE Transactions on 34:52–59. Gadde VR (2000) Modeling word duration for better speech recognition In Proc. NIST Speech Transcription Workshop, College Park, MD. Citeseer. Gales MJF (1998) Maximum likelihood linear transformations for hmm- based speech recognition Vol. 12, pp. 75–98. Elsevier. 74 Gales MJF (1999) Semi-tied covariance matrices for hidden markov models Vol. 7, pp. 272–281. IEEE. GalesM,YoungS(2008) Theapplicationofhiddenmarkovmodelsinspeech recognition. Foundations and Trends in Signal Processing 1:195–304. Ganitkevitch J Speaker adaptation using maximum likelihood linear regres- sion Citeseer. Greenberg S (1999) Speaking in shorthand–a syllable-centric perspec- tive for understanding pronunciation variation. Speech Communica- tion 29:159–176. Hagen A, Pellom B, Cole R (2007) Highly accurate childrenâĂŹs speech recognitionforinteractivereadingtutorsusingsubwordunits. SpeechCom- munication 49:861–873. Hönig F, Stemmer G, Hacker C, Brugnara F (2005) Revising perceptual linear prediction (plp). In INTERSPEECH, pp. 2997–3000. Humphries JJ, Woodland PC, Pearce D (1996) Using accent-specific pro- nunciation modelling for robust speech recognition In Spoken Language, 1996. ICSLP 96. Proceedings., Fourth International Conference on,Vol.4, pp. 2324–2327. IEEE. Juang B, Rabiner L, Levinson S, Sondhi M (1985) Recent developments in the application of hidden markov models to speaker-independent iso- lated word recognition In Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP’85., Vol. 10, pp. 9–12. IEEE. Kim DY, Umesh S, Gales MJF, Hain T, Woodland PC (2004) Using vtln for broadcast news transcription In Proc. ICSLP, Vol. 4. LeeS,PotamianosA,NarayananS(1999) AcousticsofchildrenâĂŹsspeech: Developmental changes of temporal and spectral parameters Vol. 105, pp. 1455–1468. Acoustical Society of America. Leggetter CJ, Woodland PC (1995) Maximum likelihood linear regression forspeakeradaptationofcontinuousdensityhiddenmarkovmodels Vol.9, pp. 171–185. Elsevier. Levinson SE (1986) Continuously variable duration hidden markov models for automatic speech recognition. Computer Speech & Language 1:29–45. 75 Li Q, Russell MJ (2001) Why is automatic recognition of children’s speech difficult? In Seventh European Conference on Speech Communication and Technology. Li Q, Russell MJ (2002) An analysis of the causes of increased error rates in children’s speech recognition In Seventh International Conference on Spoken Language Processing. Morales SOC, Cox SJ (2009) Modelling errors in automatic speech recogni- tion for dysarthric speakers Vol. 2009, p. 2. Hindawi Publishing Corp. Narayanan S, Potamianos A (2002) Creating conversational interfaces for children Vol. 10, pp. 65–78. IEEE. Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins Vol. 48, pp. 443–453. Elsevier. Noll A, Ney H (1987) Training of phoneme models in a sentence recognition system In Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP’87., Vol. 12, pp. 1277–1280. IEEE. of Standards NI, (NIST) T Nist syllabification software. Pan J, Liu C, Wang Z, Hu Y, Jiang H (2012) Investigation of deep neural networks (dnn) for large vocabulary continuous speech recognition: Why dnn surpasses gmms in acoustic modeling. In ISCSLP, pp. 301–305. Potamianos A, Narayanan S (1998) Spoken dialog systems for children In Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Conference on, Vol. 1, pp. 197–200. IEEE. Potamianos A, Narayanan S, Lee S (1997) Automatic speech recognition for children. In Eurospeech, Vol. 97, pp. 2371–2374. Povey D, Ghoshal A, Boulianne G, Burget L, Glembek O, Goel N, Hanne- mann M, Motlicek P, Qian Y, Schwarz P, Silovsky J, Stemmer G, Vesely K (2011) The kaldi speech recognition toolkit In IEEE 2011 Workshop on Automatic Speech Recognition and Understanding.IEEESignalProcessing Society IEEE Catalog No.: CFP11SRW-USB. PylkkönenJ,KurimoM(2004) Durationmodelingtechniquesforcontinuous speech recognition. In INTERSPEECH. Robinson A (2010) The british english example pronunciation (beep) dictio- nary. 76 Russell MJ, Cook A (1987) Experimental evaluation of duration modelling techniques for automatic speech recognition In Acoustics, Speech, and SignalProcessing, IEEEInternationalConferenceonICASSP’87.,Vol.12, pp. 2376–2379. IEEE. Seong WK, Park JH, Kim HK (2012) Dysarthric Speech Recognition Error Correction Using Weighted Finite State Transducers Based on Context– Dependent Pronunciation Variation Springer. Sphinx C (2011) Open source toolkit for speech recognition. Tervo S, Pätynen J (2010) Tutorial and examples on pure data externals: Real-timeaudiosignalprocessingandanalysis. DepartmentofMediaTech- nology, Aalto University School of Science and Technology: http://www. tml. tkk. fi/tervos . Viikki O, Laurila K (1998) Cepstral domain segmental feature vector normalization for noise robust speech recognition. Speech Communica- tion 25:133–147. WesterM(2003) Pronunciationmodelingforasr–knowledge-basedanddata- derived methods Vol. 17, pp. 69–85. Elsevier. Woodland P, Gales M, Pye D (1996) Improving environmental robustness in large vocabulary speech recognition In Acoustics, Speech, and Signal Processing, 1996. ICASSP-96. Conference Proceedings., 1996 IEEE Inter- national Conference on, Vol. 1, pp. 65–68. IEEE. 77
Abstract (if available)
Abstract
Developing a robust ASR system for children is a challenging task because of increased variability in acoustic and linguistic correlates as function of young age. The acoustic variability is mainly due to the developmental changes associated with vocal tract growth. On the linguistic side, the variability is associated with limited knowledge of vocabulary, pronunciations and other linguistic constructs. ❧ This study aims to develop a more robust acoustic modeling framework through a better front-end speech processing, pronunciation modeling, duration modeling, subword modeling for more accurate children's speech recognition. The thesis study focuses on the comprehensive tests of pre-existing methods in acoustic modeling for children's speech recognition. The results are presented as a function of age to study the effect of age on the performance. The inter-database effects are taken into account by using multiple databases of children's speech. Pronunciation modeling is introduced for children's speech. The results show promising performance improvements over the baseline. To capture pronunciation mistakes and stammering specific to children, a modified version of a pre-existing subword modeling is used. The novel contribution to the subword modeling lies in FSM based subword based decoding. Motivated by the increased temporal variability found in children, duration modeling is applied for children's speech and its effectiveness is presented.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Speech recognition error modeling for robust speech processing and natural language understanding applications
PDF
Hierarchical methods in automatic pronunciation evaluation
PDF
Emotional speech production: from data to computational models and applications
PDF
Representation, classification and information fusion for robust and efficient multimodal human states recognition
PDF
Noise aware methods for robust speech processing applications
PDF
Toward robust affective learning from speech signals based on deep learning techniques
PDF
A computational framework for exploring the role of speech production in speech processing from a communication system perspective
PDF
Matrix factorization for noise-robust representation of speech data
PDF
Multimodality, context and continuous dynamics for recognition and analysis of emotional states, and applications in healthcare
PDF
Neural representation learning for robust and fair speaker recognition
PDF
Efficient estimation and discriminative training for the Total Variability Model
PDF
Enhancing speech to speech translation through exploitation of bilingual resources and paralinguistic information
PDF
Structure and function in speech production
PDF
Categorical prosody models for spoken language applications
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
User modeling for human-machine spoken interaction and mediation systems
PDF
Toward understanding speech planning by observing its execution—representations, modeling and analysis
PDF
Novel variations of sparse representation techniques with applications
PDF
Context-aware models for understanding and supporting spoken interactions with children
PDF
Efficient template representation for face recognition: image sampling from face collections
Asset Metadata
Creator
Gurunath Shivakumar, Prashanth
(author)
Core Title
Robust automatic speech recognition for children
School
Viterbi School of Engineering
Degree
Master of Science
Degree Program
Electrical Engineering
Publication Date
10/31/2014
Defense Date
10/24/2014
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
acoustic adaptation,acoustic modeling,automatic speech recognition,duration modeling,front-end features,OAI-PMH Harvest,pronunciation modeling,subword modeling
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Narayanan, Shrikanth S. (
committee chair
), Georgiou, Panayiotis G. (
committee member
), Lee, Sungbok (
committee member
)
Creator Email
pachhu.g.s@gmail.com,pgurunat@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-512162
Unique identifier
UC11297765
Identifier
etd-GurunathSh-3054.pdf (filename),usctheses-c3-512162 (legacy record id)
Legacy Identifier
etd-GurunathSh-3054.pdf
Dmrecord
512162
Document Type
Thesis
Format
application/pdf (imt)
Rights
Gurunath Shivakumar, Prashanth
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
acoustic adaptation
acoustic modeling
automatic speech recognition
duration modeling
front-end features
pronunciation modeling
subword modeling