Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Speech recognition error modeling for robust speech processing and natural language understanding applications
(USC Thesis Other)
Speech recognition error modeling for robust speech processing and natural language understanding applications
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Speech Recognition Error Modeling for Robust Speech Processing and Natural Language Understanding Applications by Prashanth Gurunath Shiv akumar A Dissertation Presented to the F ACUL TY OF THE USC GRADUA TE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial F ulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Electrical Engineering) May 2021 Copyright 2021 Prashan th Gurunath Shiv akumar Dedication T o Amma and P apa. ii Acknowledgements First and foremost, I w ould lik e to express m y gratitude to m y advisors Prof. P ana yiotis Georgiou and Prof. Shrik an th Nara y anan for alw a ys b eing supp ortiv e during m y Ph.D and researc h, and for pro viding complete freedom for m y researc h. With their immense kno wledge, they ha v e alw a ys directed me with m y researc h with patience and motiv ation. My sincere thanks also go es to m y dissertation committee mem b ers Prof. Keith Jenkins and Prof. Ma ja Mataric for guiding me with insigh tful commen ts and profound questions. I thank m y fello w colleagues and lab-mates in the Signal Pro cessing for Comm unication Under- standing and Beha vior Analysis (SCUBA) and Signal Analysis and In terpretation Lab oratory (SAIL) for stim ulating discussions. Finally , I w ould lik e to thank m y family , m y paren ts Mrs. M. N. Banashank ari and Prof. G. K. Shiv akumar for alw a ys b eing supp ortiv e and motiv ating throughout m y life. iii T able of Contents Dedication ii A c kno wledgemen ts iii List Of T ables viii List Of Figures x Abstract xii Chapter 1: In tro duction 1 1.1 Error Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Input sp eec h signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.2 Mac hine Pro cessing and Learning Limitations . . . . . . . . . . . . . . . . . 2 1.1.3 Limitations of h uman-ev olv ed language enco ding . . . . . . . . . . . . . . . 3 1.2 Our Con tribution: Error Mo deling . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Chapter 2: Prior and Existing W ork 5 2.1 Automatic Sp eec h Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Error Correction for ASR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Natural Language Pro cessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Sp ok en Language Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Chapter 3: Learning from P ast Mistak es: Impro ving Automatic Sp eec h Recogni- tion output via Noisy-Clean Phrase Con text Mo deling 10 3.1 In tro duction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2 Hyp otheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2.1 Re-scoring Lattices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2.2 Reco v ering Pruned Lattices . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2.3 Reco v ery of Unseen Phrases . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2.4 Better Reco v ery during P o or Recognitions . . . . . . . . . . . . . . . . . . . 14 3.2.5 Impro v emen ts under all A coustic Conditions . . . . . . . . . . . . . . . . . 14 3.2.6 A daptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2.7 Exploit Longer Con text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2.8 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.3 Metho dology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.3.1 Previous related w ork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.3.1.1 W ord-based Noisy Channel Mo deling . . . . . . . . . . . . . . . . 16 3.3.1.2 Phrase-based Noisy Channel Mo deling . . . . . . . . . . . . . . . . 17 3.3.2 Noisy-Clean Phrase Con text Mo deling . . . . . . . . . . . . . . . . . . . . . 17 iv 3.3.3 Our Other Enhancemen ts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3.3.1 Neural Language Mo dels . . . . . . . . . . . . . . . . . . . . . . . 18 3.3.3.2 Minim um Error Rate T raining (MER T) . . . . . . . . . . . . . . . 19 3.4 Exp erimen tal Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.4.1 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.4.2 System Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.4.2.1 Automatic Sp eec h Recognition System . . . . . . . . . . . . . . . 20 3.4.2.2 Pre-pro cessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.4.2.3 NCPCM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.4.3 Baseline Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.4.4 Ev aluation Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.5 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.5.1 Re-scoring Lattices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.5.2 Reco v ering Pruned Lattices . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.5.3 Reco v ery of Unseen Phrases . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.5.4 Better Reco v ery during P o or Recognitions . . . . . . . . . . . . . . . . . . . 24 3.5.5 Impro v emen ts under all A coustic Conditions . . . . . . . . . . . . . . . . . 27 3.5.6 A daptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.5.7 Exploit Longer Con text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.5.8 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.6 Conclusions & F uture W ork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Chapter 4: Confusion2V ec: T o w ards Enric hing V ector Space W ord Represen ta- tions with Represen tational Am biguities 34 4.1 In tro duction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.2 Motiv ation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.2.1 Human sp eec h pro duction, p erception and hearing . . . . . . . . . . . . . . 36 4.2.2 Mac hine Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.3 Case Study: Application to Automatic Sp eec h Recognition . . . . . . . . . . . . . 37 4.3.1 Related W ork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.4 Prop osed Mo dels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.4.1 Baseline W ord2V ec Mo del . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.4.2 In tra-Confusion T raining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.4.3 In ter-Confusion T raining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.4.4 Hybrid In tra-In ter Confusion T raining . . . . . . . . . . . . . . . . . . . . . 42 4.5 T raining Sc hemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.5.1 Mo del Initialization/Pre-training . . . . . . . . . . . . . . . . . . . . . . . . 43 4.5.2 Mo del Concatenation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.5.3 Join t Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.5.3.1 Unrestricted . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.5.3.2 Fixed W ord2V ec . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.6 Ev aluation Metho ds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.6.1 Analogy T asks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.6.1.1 Seman tic&Syn tactic Analogy T ask . . . . . . . . . . . . . . . . . . 45 4.6.1.2 A coustic Analogy T ask . . . . . . . . . . . . . . . . . . . . . . . . 45 4.6.1.3 Seman tic&Syn tactic-A coustic Analogy T ask . . . . . . . . . . . . . 46 4.6.2 Similarit y Ratings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.6.2.1 W ord Similarit y Ratings . . . . . . . . . . . . . . . . . . . . . . . 47 4.6.2.2 A coustic Similarit y Ratings . . . . . . . . . . . . . . . . . . . . . . 47 4.7 Data & Exp erimen tal Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.7.1 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 v 4.7.2 Exp erimen tal Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.7.2.1 Automatic Sp eec h Recognition . . . . . . . . . . . . . . . . . . . . 48 4.7.2.2 Confusion2V ec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.7.3 Creation of Ev aluation Datasets . . . . . . . . . . . . . . . . . . . . . . . . 49 4.7.3.1 A coustic Analogy T ask . . . . . . . . . . . . . . . . . . . . . . . . 49 4.7.3.2 Seman tic&Syn tactic-A coustic Analogy T ask . . . . . . . . . . . . . 49 4.7.3.3 A coustic Similarit y T ask . . . . . . . . . . . . . . . . . . . . . . . 50 4.7.4 P erformance Ev aluation Criterion . . . . . . . . . . . . . . . . . . . . . . . 50 4.8 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.8.1 Baseline W ord2V ec Mo del . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.8.2 In tra-Confusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.8.3 In ter-Confusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.8.4 Hybrid In tra-In ter Confusion . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.8.5 Mo del Initialization/Pre-training . . . . . . . . . . . . . . . . . . . . . . . . 54 4.8.6 Mo del Concatenation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.8.7 Join t Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.8.7.1 Fixed W ord2V ec . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.8.7.2 Unrestricted Optimization . . . . . . . . . . . . . . . . . . . . . . 57 4.8.8 Results Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.9 V ector Space Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.9.1 Seman tic Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.9.2 Syn tactic Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.9.3 A coustic Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.10 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.11 P oten tial Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.12 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.13 F uture W ork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Chapter 5: Sp ok en Language In ten t Detection using Confusion2V ec 72 5.1 In tro duction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.2 Prop osed T ec hnique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.2.1 Confusion2v ec W ord Em b edding . . . . . . . . . . . . . . . . . . . . . . . . 73 5.2.2 In ten t Classication Mo del . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.3 Database & Exp erimen tal Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.3.1 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.3.2 Exp erimen tal Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.3.3 Baseline Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.4.1 T raining on Reference Clean T ranscripts . . . . . . . . . . . . . . . . . . . . 77 5.4.2 T raining on ASR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.5 Conclusion and F uture W ork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Chapter 6: Confusion2V ec 2.0: Enric hing Am biguit y Represen tations with Sub- w ords 81 6.1 In tro duction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.2 Confusion2V ec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.3 Confusion2V ec 2.0 sub w ord mo del . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.3.1 In tra-Confusion Mo del . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.3.2 In ter-Confusion Mo del . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.3.3 T raining Loss and Ob jectiv e . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.4 Data and Exp erimen tal Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 vi 6.4.1 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.4.2 Exp erimen tal Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.4.2.1 Automatic sp eec h recognition . . . . . . . . . . . . . . . . . . . . . 87 6.4.2.2 Confusion2V ec 2.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.4.3 Ev aluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.5.1 Mo del Concatenation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 6.6 Sp ok en Language In ten t Detection with Confusion2V ec 2.0 . . . . . . . . . . . . . 92 6.6.1 In ten t classication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.6.2 Database and Exp erimen tal Setup . . . . . . . . . . . . . . . . . . . . . . . 93 6.6.2.1 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.6.2.2 Exp erimen tal Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.6.2.3 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.8 F uture W ork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 Conclusion 99 References 101 App endix A Confusion2V ec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 vii List Of T ables 3.1 Database split and statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 Analysis of selected sen tences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3 Noisy-Clean Phrase Con text Mo del (NCPCM) results . . . . . . . . . . . . . . . . 27 3.4 Results for out-of-domain adaptation using Noisy-Clean Phrase Con text Mo dels (NCPCM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.5 Results for Noisy-Clean Phrase Con text Mo dels (NCPCM) with Neural Net w ork Language Mo dels (NNLM) and Neural Net w ork Join t Mo dels (NNJM) . . . . . . . 31 4.1 F ew examples from A coustic Analogy T ask T est-set . . . . . . . . . . . . . . . . . . 45 4.2 F ew examples from Seman tic & Syn tactic - A coustic Analogy T ask T est Set . . . . 46 4.3 Examples of A coustic Similarit y Ratings . . . . . . . . . . . . . . . . . . . . . . . . 47 4.4 Statistics of Ev aluation Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.5 Results: Dieren t prop osed mo dels . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.6 Results with pre-training/initialization . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.7 Mo del concatenation and join t optimization results . . . . . . . . . . . . . . . . . . 56 4.8 Cosine Similarit y b et w een the ASR Ground-truth and ASR output in application to ASR error correction for baseline pre-trained w ord2v ec and the prop osed confu- sion2v ec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.1 Results with T raining on Reference: Classication Error Rates (CER) for Reference and ASR T ranscripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.2 Results with T raining and T esting on ASR transcripts. . . . . . . . . . . . . . . . . 79 6.1 Results: Dieren t prop osed mo dels C2V-a: In tra-Confusion, C2V-c: In ter- Confusion, S&S: Seman tic & Syn tactic Analogy . F or the analogy tasks: the accura- cies of baseline w ord2v ec mo dels are for top-1 ev aluations, whereas of the other mo dels are for top-2 ev aluations (as discussed in [148]). F or the similarit y tasks: all the correlations (Sp earman’s) are statistically signican t with p< 0:001. . . . . . . . . . . . . . . . . . 89 6.2 Results: Dieren t prop osed mo dels C2V-a: In tra-Confusion, C2V-c: In ter- Confusion, S&S: Seman tic & Syn tactic Analogy . F or the analogy tasks: the accura- cies of baseline w ord2v ec mo dels are for top-1 ev aluations, whereas of the other mo dels are for top-2 ev aluations (as discussed in [148]). F or the similarit y tasks: all the correlations (Sp earman’s) are statistically signican t with p< 0:001. . . . . . . . . . . . . . . . . . 90 viii 6.4 Results: Mo del trained and ev aluated on ASR transcripts. C2V 1.0 corresp onds to C2V-1 + C2V-c (JT) in T able 6.1 and 6.2 . . . . . . . . . . . . . . . . . . . . . . . . . 97 A.1 Analogy T ask Results with Seman tic & Syn tactic splits: Dieren t prop osed mo dels 115 A.2 Similarit y T ask Results: Dieren t prop osed mo dels . . . . . . . . . . . . . . . . . . 116 A.3 Analogy T ask Results with Seman tic & Syn tactic splits: Mo del pre-training / ini- tialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 A.4 Similarit y T ask Results: Mo del pre-training/initialization . . . . . . . . . . . . . . 116 A.5 Analogy T ask Results: Mo del concatenation and join t optimization . . . . . . . . . 117 A.6 Similarit y T ask Results: Mo del concatenation and join t optimization . . . . . . . . 118 ix List Of Figures 3.1 Ov erview of NCPCM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2 T op-Go o d, Bottom-Bad WER Splits . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3 Length of ASR h yp otheses vs. absolute WER c hange (NCPCM) . . . . . . . . . . 26 4.1 An example confusion net w ork for ground-truth utterance I w an t to sit. . . . . . 37 4.2 Baseline W ord2V ec T raining sc heme for Confusion net w orks. . . . . . . . . . . . . 39 4.3 Prop osed In tra-Confusion T raining Sc heme for Confusion net w orks . . . . . . . . . 40 4.4 Prop osed In ter-Confusion T raining Sc heme for Confusion net w orks . . . . . . . . . 41 4.5 Prop osed Hybrid-Confusion T raining Sc heme for Confusion net w orks . . . . . . . . 42 4.6 Flo w c harts for prop osed training sc hemes . . . . . . . . . . . . . . . . . . . . . . . 43 4.7 Confusion Net w ork Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.8 Computation of lattice feature v ector . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.9 2D plot after PCA of w ord v ector represen tation on baseline pre-trained w ord2v ec. Demonstration of Seman tic Relationship on Randomly c hosen pairs of Coun tries and Cities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.10 2D plot after PCA of w ord v ector represen tation on join tly optimized pre-trained w ord2v ec + in tra-confusion mo dels. Demonstration of Seman tic Relationship on Randomly c hosen pairs of Coun tries and Cities . . . . . . . . . . . . . . . . . . . . 67 4.11 2D plot after PCA of w ord v ector represen tation on baseline pre-trained w ord2v ec. Demonstration of Syn tactic Relationship on Randomly c hosen 30 pairs of A djectiv e- A dv erb, Opp osites, Comparativ e, Sup erlativ e, Presen t-P articiple, P ast-tense, Plurals 68 4.12 2D plot after PCA of w ord v ector represen tation on join tly optimized pre-trained w ord2v ec + in tra-confusion mo dels. Demonstration of Syn tactic Relationship on Randomly c hosen 30 pairs of A djectiv e-A dv erb, Opp osites, Comparativ e, Sup erla- tiv e, Presen t-P articiple, P ast-tense, Plurals . . . . . . . . . . . . . . . . . . . . . . 69 4.13 2D plot after PCA of w ord v ector represen tation on baseline pre-trained w ord2v ec. Demonstration of V ector Relationship on Randomly c hosen 20 pairs of A coustically Similar Sounding W ords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.14 2D plot after PCA of w ord v ector represen tation on join tly optimized pre-trained w ord2v ec + in tra-confusion mo dels. Demonstration of V ector Relationship on Ran- domly c hosen 20 pairs of A coustically Similar Sounding W ords . . . . . . . . . . . 71 x 5.1 2D V ector space illustration after PCA dimension reduction for W ord2v ec and Confusion2v ec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.2 In ten t Classication RNN Mo del . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.3 Comparison of CER for dieren t systems . . . . . . . . . . . . . . . . . . . . . . . 79 6.1 Example Confusion Net w ork Output b y ASR . . . . . . . . . . . . . . . . . . . . . 83 xi Abstract Automatic Sp eec h Recognition (ASR) is gaining a lot of imp ortance in ev eryda y life. ASR has b ecome a core comp onen t of h uman computer in teraction. It is a k ey part of man y applications in v olving virtual assistan ts, v oice assistan ts, gaming, rob otics, natural language understanding, education, comm unication-pron unciation tutoring, call routing, in teractiv e media en tertainmen t, etc. The gro wth of suc h applications and their adaptations in ev eryda y scenarios, p oin ts to the ASR b ecoming an ubiquitous part of our daily life in the foreseeable, near future. This has b ecome partly p ossible due to high p erformance ac hiev ed b y state-of-the-art sp eec h recognition systems. Ho w ev er, the errors resulting from ASR can often ha v e a negativ e impact to w ards the do wnstream applications. In this w ork, w e fo cus on mo deling the errors of the ASR with the h yp othesis that an accurate mo deling of suc h errors can b e used to reco v er from the ASR errors and alleviate the negativ e consequences to w ards its do wnstream applications. W e mo del the ASR as a phrase-based noisy transformation c hannel and prop ose an error correction system that can learn from the aggregate errors of all the indep enden t mo dules consti- tuting the ASR and attempt to in v ert those. The prop osed system can exploit long-term con text and can re-in tro duce previously pruned or unseen phrases in addition to b etter c ho osing b et w een existing ASR output p ossibilities. W e sho w that the system can pro vide impro v emen ts o v er a range of dieren t ASR conditions without degrading an y accurate transcription. W e also sho w that the prop osed system pro vides consisten t impro v emen ts ev en on out-of-domain tasks as w ell as o v er highly optimized ASR mo dels re-scored b y recurren t neural language mo dels. F urther, w e prop ose sequence-to-sequence neural net w ork for mo deling the ASR errors b y incorp orating m uc h longer con textual information. W e prop ose dieren t optimal arc hitectures and feature rep- resen tations, in terms of sub w ords, and demonstrate impro v emen ts o v er the phrase-based noisy c hannel mo del. A dditionally , w e prop ose a no v el w ord v ector represen tation, Confusion2V ec, motiv ated from the h uman sp eec h pro duction and p erception that enco des represen tational am biguit y . The rep- resen tational am biguit y of acoustics, whic h manifests itself in w ord confusions, is often resolv ed b y b oth h umans and mac hines through con textual cues. W e presen t sev eral tec hniques to train an xii acoustic p erceptual similarit y represen tation am biguit y and learn on unsup ervised-generated data from ASR confusion net w orks or lattice-lik e structures. Appropriate ev aluations are form ulated for gauging acoustic similarit y in addition to seman tic-syn tactic and w ord similarit y ev aluations. The Confusion2V ec is able to mo del w ord confusions ecien tly without compromising on the seman tic-syn tactic w ord relations, th us eectiv ely enric hing the w ord v ector space with extra task relev an t am biguit y information. The prop osed Confusion2V ec can also con tribute and extend to a range of represen tational am biguities that emerge in v arious domains further to acoustic p erception, suc h as morphological transformations, w ord segmen tation, paraphrasing for natural language pro cessing tasks lik e mac hine translation, and visual p erceptual similarit y for image pro cessing tasks lik e image summarization, optical c haracter recognition etc. This w ork also con tributes to w ards ecien t coupling of ASR with v arious do wnstream algo- rithms op erating on ASR outputs. W e pro v e the ecacy of the Confusion2V ec b y prop osing a recurren t neural net w ork based sp ok en language in ten t detection to ac hiev e state-of-the-art results under noisy ASR conditions. W e demonstrate through exp erimen ts and our prop osed mo del that ASR often mak es errors relating to acoustically similar w ords and the confusion2v ec with inheren t mo del of acoustic relationships b et w een w ords is able to comp ensate for the errors. Impro v emen ts are also demonstrated when training the in ten t detection mo dels on noisy ASR transcripts. This w ork op ens new p ossible opp ortunities in incorp orating the confusion2v ec em b eddings to a whole range of full-edged applications. F urther, w e extend the previously prop osed confusion2v ec b y enco ding eac h w ord in confu- sion2v ec v ector space b y its constituen t sub w ord c haracter n-grams. W e sho w the sub w ord en- co ding helps b etter represen t the acoustic p erceptual am biguities in h uman sp ok en language via information mo deled on lattice structured ASR output. The ecacy of the sub w ord-confusion2v ec is ev aluated using seman tic, syn tactic and acoustic analogy and w ord similarit y tasks. W e demon- strate the b enets of sub w ord mo deling for acoustic am biguit y represen tation on the task of sp ok en language in ten t detection. The results signican tly outp erform existing w ord v ector represen ta- tions as w ell as the non-sub w ord confusion2v ec w ord em b eddings when ev aluated on erroneous ASR outputs. W e demonstrate confusion2v ec sub w ord mo deling eliminates the need for retrain- ing/adapting the natural language understanding mo dels on ASR transcripts. xiii Chapter 1 Introduction Sp ok en comm unication is the most natural w a y of in teraction for h umans. This mak es sp ok en language comm unication arguably the most preferred means of h uman computer in teraction. The sp ok en natural language understanding (SLU) t ypically comprises of t w o fundamen tal comp o- nen ts (i) sp eec h pro cessing & recognition, and (ii) natural language pro cessing & understanding. Ho w ev er, errors can arise throughout the system due to v arious inconsistencies in h uman sp eec h and language, op erating en vironmen ts as w ell as in tricate in terconnects of mac hine pro cessing and learning algorithms. In this w ork, w e rst iden tify the sources of errors and attempt to ad- dress p oten tial b ottlenec ks presen t in eac h of the stages of the sp ok en language understanding framew ork. 1.1 Error Sources 1.1.1 Input sp eec h signal Sev eral errors are induced in to the SLU systems due to the complexit y of sp eec h signals. Chal- lenges in sp eec h signal mo deling is largely attributed to the v ast amoun t of v ariabilit y presen t in the signal. The v ariabilit y can b e largely categorized in to three t yp es: (i) acoustic v ariabilit y , (ii) pron unciation v ariabilit y , and (iii) language v ariabilit y . One of the primary sources of acoustic v ariabilit y is due to the wide range of in ter-sp eak er v ariabilit y . Sp eak er v ariabilit y in acoustics manifests in terms of sp eak er age, v o cal tract struc- tures, sp eec h articulation and expressions. F or example, kids sp ectral c haracteristics is found to b e v astly dieren t to that of adults [129, 94, 53]. Children’s sp eec h is asso ciated with shifted sp ectral con ten t and forman t frequencies, high within-sub ject and in ter-sub ject v ariabilities at- tributed to dev elopmen tal c hanges in v o cal tract. Children’s ASR w ere found to b e 2 to 5 times w orse than adults [129]. A coustic v ariabilit y can also result due to sp eak er’s health conditions. 1 Sp eec h disabilities including dysarthria, strok e, tongue cancer etc., can ha v e adv erse eects on sp eec h mo deling. Moreo v er paralinguistic phenomenon lik e emotion, sen timen t also p ose c hal- lenges and induce additional errors. Other than sp eak er related v ariabilit y , sp eak er bac kground and en vironmen t induced acoustic v ariabilit y is a ma jor source of sp eec h mo deling errors. V ary- ing amoun t of noise presen t in sp eaking en vironmen t can ha v e complex in teraction with sp eec h signals, often resulting in heigh tened errors. Sp ectral c haracteristics of noise can generate v arying error conditions. F or example, sp ok en noise suc h as o v erlapp ed sp eec h generates errors that are v astly dieren t to the ones generated due to a p o w er line noise. Channel c haracteristics suc h as rev erb eration and the inheren t sp ectral signature of the sp eec h recording/capturing devices is an additional source of error resulting from sp eec h signals. Pron unciation v ariabilit y refers to the dierences in phonological pro cess in v olv ed in pron unci- ation among dieren t sp eak ers, whic h is also a prime source for errors. Pron unciation v ariabilit y manifests in terms of dieren t dialects, accen ts, non-nativ e sp eak ers and sp eak er’s linguistic kno wl- edge. Non-nativ e sp eak ers often pro ject phonological pro cesses and pron unciation rules from their nativ e language to the target non-nativ e language. F or example, nativ e arabic sp eak ers often con- fuse phoneme ih with eh leading to p oten tial confusion and errors b et w een w ords sit and set. Dev eloping linguistic kno wledge in c hildren can result in highly v arying pron unciations resulting in increased errors during sp eec h recognition. The use of language can v ary from p erson to p erson dep ending on sp eak er’s nativit y , origin and general linguistic kno wledge. New learners can induce errors resulting from the mismatc h b e- t w een sp eak er’s language constructs and the statistical language mo dels. Dev elopmen tal stages in linguistic kno wledge, esp ecially found in c hildren, can p ose serious c hallenges in sp eec h mo deling. On the other hand, extensiv e sp eak er v o cabulary can also pro v e c hallenging. A ddressing these errors ha v e b een the main fo cus of researc hers in the ASR comm unit y . Some of the existing ASR tec hnologies and the researc h in acoustic, pron unciation and language mo d- eling are presen ted in section 2.1. Prior w ork in error correction and their eectiv eness in terms of error reduction are discussed under section 2.1.1. 1.1.2 Mac hine Pro cessing and Learning Limitations Another source of errors is the practical limitations imp osed b y the mac hine pro cessing and learning algorithms. T w o main sources of limitations are computation complexit y and memory constrain ts. F or instance, in an ASR, a deco ding b eam is adopted to prev en t memory explosion during generation of deco ding graphs. The implications are that the ASR output can itself b e non- optimal due to the p oten tial dropping o of a b etter h yp othesis during lattice pruning. Moreo v er, 2 ASR systems often mak e unreco v erable errors due to subsystem pruning (acoustic, language and pron unciation mo dels). F or example, pruning w ords due to acoustics, prior to re-scoring with pron unciation and language mo del. This can lead to aggregation of errors through eac h mo dule. F urther, the three mo dules of the ASR (acoustic, language and pron unciation) t ypically op erate with a lo cal view on v arying con textual information. A coustic mo dels t ypically mak e decisions using short-term con text, prior to re-scoring with longer term con text based on pron unciation and language. The v arying con text can induce unreco v erable errors, for instance sub-optimal decisions based on short term con text ma y not b e reco v ered at a later stage. Finally , the ASR and the NLU op erate fairly indep enden t of eac h other with b ottlenec ks asso ciated with the o w of information b et w een the t w o mo dules with p oten tial for more errors. 1.1.3 Limitations of h uman-ev olv ed language enco ding Human language is complex b ecause of the v ast information enco ded and certain am biguities asso ciated with them. One suc h am biguit y relating to h uman sp ok en language is due to the lac k of correlation of language seman tics and the acoustics. F or example, w ords suc h as righ t & write and see & sea sound iden tical but can ha v e dieren t meanings asso ciated b et w een them. On the con trary , w ords suc h as blue & cy an and king & queen are seman tically close but ha v e v astly dieren t acoustic c haracteristics. F or suc h reasons, for lac k of correlation b et w een seman tics and acoustics, h uman language enco ding is not optimal. The non-optimalit y is a p oten tial source for errors during sp ok en language pro cessing and understanding. Prior and existing w orks in represen tation of h uman language for mac hine pro cessing and learning is discussed under section 2.2. Some of the c hallenges and eects of non optimal h uman language enco ding in application to the task of sp ok en language in ten t detection is presen ted in section 2.3. 1.2 Our Contribution: Error Modeling Our study fo cuses on building systems that addresses the c hallenges in alleviating the eects of errors under eac h of the three categories: (i) input sp eec h signal induced errors, (ii) mac hine pro cessing and learning limitation induced errors, and (iii) errors induced due to limitations in h uman-ev olv ed language enco ding. In this w ork w e mo del ASR as a phrase-based noisy transformation c hannel and prop ose an error correction system that can learn from the aggregate errors of all the indep enden t mo dules constituting the ASR and attempt to in v ert those. The prop osed system can not only reco v er 3 sp eec h signal induced errors (discussed in section 1.1.1) but also o v ercome the limitations imp osed b y mac hine pro cessing and learning algorithms (discussed in section 1.1.2). Our approac h is elab orated and presen ted in c hapter 3. On the asp ect of h uman language enco ding, in this w ork, w e prop ose a no v el w ord v ector rep- resen tation, Confusion2V ec, motiv ated from the h uman sp eec h pro duction and p erception that enco des represen tational am biguit y . Humans emplo y b oth acoustic similarit y cues and con textual cues to deco de information and w e fo cus on a mo del that incorp orates b oth sources of information. W e presen t sev eral tec hniques to train an acoustic p erceptual similarit y represen tation am biguit y and learn on unsup ervised-generated data from Automatic Sp eec h Recognition confusion net- w orks or lattice-lik e structures. The prop osed language enco ding, Confusion2V ec, is presen ted in c hapter 4. Next, w e demonstrate the sup eriorit y of the newly prop osed h uman language enco ding on the task of sp ok en language in ten t detection under noisy conditions imp osed b y automatic sp eec h recognition (ASR) systems. W e demonstrate the capabilities of the prop osed language enco ding to comp ensate for the errors made b y ASR and to increase the robustness of the SLU system. W e h yp othesize that ASR often mak es errors relating to acoustically similar w ords, and the confusion2v ec with inheren t mo del of acoustic relationships b et w een w ords is able to comp ensate for the errors. The study is presen ted in c hapter 5. F urther enhancemen ts to Confusion2v ec is explored b y enco ding eac h w ord in confusion2v ec v ector space b y its constituen t sub w ord c haracter n-grams. W e sho w the sub w ord enco ding helps b etter represen t the acoustic p erceptual am biguities as w ell as in capturing language seman tics and syn tax b y ev aluating using seman tic, syn tactic and acoustic analogy and w ord similarit y tasks. W e demonstrate the b enets of sub w ord mo deling for acoustic am biguit y represen tation in application to sp ok en language in ten t detection op erating on the sp eec h recognition output. The sub w ord mo deling for Confusion2V ec and its ecacy to w ards sp ok en language in ten t detection is presen ted in c hapter 6. 4 Chapter 2 Prior and Existing W ork 2.1 Automatic Speech Recognition Due to the complexit y of h uman language and qualit y of sp eec h signals, impro ving p erformance of automatic sp eec h recognition (ASR) is still a c hallenging task. The traditional ASR comprises of three conceptually distinct mo dules: acoustic mo deling, dictionary and language mo deling. Three mo dules are fairly indep enden t of eac h other in researc h and op eration. In terms of acoustic mo deling, Gaussian Mixture Mo del (GMM) based Hidden Mark o v Mo del (HMM) systems [134, 133] w ere a standard for ASR for a long time and are still used in some of the curren t ASR systems. Lately , adv ances in Deep Neural Net w ork (DNN) led to the adv en t of Deep Belief Net w orks (DBN) and Hybrid DNN-HMM [71, 34], whic h basically replaced the GMM with a DNN and emplo y ed a HMM for alignmen ts. Deep Recurren t Neural Net w orks (RNN), particularly Long Short T erm Memory (LSTM) Net w orks replaced the traditional DNN and DBN systems [60]. Connectionist T emp oral Classication (CTC) [59] pro v ed to b e eectiv e with the abilit y to compute the alignmen ts implicitly under the DNN arc hitecture, thereb y eliminating the need of GMM-HMM systems for computing alignmen ts. The researc h eorts for dev eloping ecien t dictionaries or lexicons ha v e b een mainly in terms of pron unciation mo deling. Pron unciation mo deling w as in tro duced to handle the in tra-sp eak er v ariations [160, 176], non-nativ e accen t v ariations [160, 176], sp eaking rate v ariations found in con v ersational sp eec h [176] and increased pron unciation v ariations found in c hildren’s sp eec h [150]. V arious linguistic kno wledge and data-deriv ed phonological rules w ere incorp orated to augmen t the lexicon. Researc h eorts in language mo deling share those of the Natural Language Pro cessing (NLP) comm unit y . By estimating the distribution of w ords, statistical language mo deling (SLM), suc h as n-gram, decision tree mo dels [8], linguistically motiv ated mo dels [117] amoun t to calculating 5 the probabilit y distribution of dieren t linguistic units, suc h as w ords, phrases [88], sen tences, and whole do cumen ts [137]. Recen tly , Deep Neural Net w ork based language mo dels [5, 112, 163] ha v e also sho wn success in terms of b oth p erplexit y and w ord error rate. V ery recen tly , state-of-the-art ASR systems are emplo ying end-to-end neural net w ork mo dels, suc h as sequence-to-sequence [165] in an enco der-deco der arc hitecture. The systems are trained end-to-end from acoustic features as input to predict the phonemes or c haracters [7, 23]. Suc h systems can b e view ed as an in tegration of acoustic and lexicon pron unciation mo dels. The state- of-the-art p erformance can b e attributed to w ards the join t training (optimization) b et w een the acoustic mo del and the lexicon mo dels (end-to-end) enabling them to o v ercome the short-comings of the former indep enden tly trained mo dels. 2.1.1 Error Correction for ASR Sev eral researc h eorts w ere carried out for error correction using p ost-pro cessing tec hniques. Muc h of the eort in v olv es user input used as a feedbac k mec hanism to learn the error patterns [2, 121]. Other w ork emplo ys m ulti-mo dal signals to correct the ASR errors [162, 121]. W ord co-o ccurrence information based error correction systems ha v e pro v en quite successful [142]. In [135], a w ord-based error correction tec hnique w as prop osed. The tec hnique demonstrated the abilit y to mo del the ASR as a noisy c hannel. In [77], similar tec hnique w as applied to a syllable- to-syllable c hannel mo del along with maxim um en trop y based language mo deling. In [39], a phrase-based mac hine translation system w as used to adapt a generic ASR to a domain sp ecic grammar and v o cabulary . The system trained on w ords and phonemes, w as used to re-rank the n-b est h yp otheses of the ASR. In [33], a phrase based mac hine translation system w as used to adapt the mo dels to the domain-sp ecic data obtained b y man ual user-corrected transcriptions. In [167], an RNN w as trained on v arious text-based features to exploit long-term con text for error correction. Confusion net w orks from the ASR ha v e also b een used for error correction. In [191], a bi-directional LSTM based language mo del w as used to re-score the confusion net w ork. In [118], a t w o step pro cess for error correction w as prop osed in whic h w ords in the confusion net w ork are re- rank ed. Errors presen t in the confusion net w ork are detected b y conditional random elds (CRF) trained on n-gram features and subsequen tly long-distance con text scores are used to mo del the long con textual information and re-rank the w ords in the confusion net w ork. [21, 52] also mak es use of confusion net w orks along with seman tic similarit y information for training CRF s for error correction. 6 2.2 Natural Language Processing Deco ding h uman language is c hallenging for mac hines. It in v olv es estimation of ecien t, mean- ingful represen tation of w ords. Mac hines represen t the w ords in the form of real v ectors and the language as a v ector space. V ector space represen tations of language ha v e applications spanning natural language pro cessing (NLP) and h uman computer in teraction (HCI) elds. More sp ecif- ically , w ord em b eddings can act as features for Mac hine T ranslation, Automatic Sp eec h Recog- nition, Do cumen t T opic Classication, Information Retriev al, Sen timen t Classication, Emotion Recognition, Beha vior Recognition, Question Answ ering etc. Early w ork emplo y ed w ords as the fundamen tal unit of feature represen tation. This could b e though t of as eac h w ord represen ting an orthogonal v ector in a n-dimensional v ector space of language with n-w ords (often referred to as one-hot represen tation). Suc h a represen tation, due to the inheren t orthogonalit y , lac ks crucial information regarding in ter-w ord relationships suc h as similarit y . Sev eral tec hniques found using co-o ccurrence information of w ords to b e a b etter feature represen tation (Ex: n-gram Language Mo deling). Subsequen t studies in tro duced few matrix factorization based tec hniques to estimate a more ecien t, reduced dimensional v ector space based on w ord co-o ccurrence information. Laten t Se- man tic Analysis (LSA) assumes an underlying v ector space spanned b y orthogonal set of laten t v ariables closely asso ciated with the seman tics/meanings of the particular language. The dimen- sion of this v ector space is m uc h smaller than the one-hot represen tation [35]. LSA w as prop osed initially for information retriev al and indexing, but so on gained p opularit y for other NLP tasks. [73] prop osed Probabilistic LSA replacing the co-o ccurrence information b y a statistical class based mo del leading to b etter v ector space represen tations. Another p opular matrix factorization metho d, the Laten t Diric hlet Allo cation (LD A) assumes a generativ e statistical mo del where the do cumen ts are c haracterized as a mixture of laten t v ari- ables represen ting topics whic h are describ ed b y w ord distributions [16]. Recen tly neural net w orks gained p opularit y . They often outp erform the N-gram mo dels [11, 112] and enable estimation of more complex mo dels incorp orating m uc h larger data than b efore. V arious neural net w ork based v ector space estimation of w ords w ere prop osed. [11] pro- p osed feed-forw ard neural net w ork based language mo dels whic h join tly learned the distributed w ord represen tation along with the probabilit y distribution asso ciated with the represen tation. Estimating a reduced dimension con tin uous w ord represen tation allo ws for ecien t probabilit y mo deling, thereb y resulting in m uc h lo w er p erplexit y compared to an n-gram mo del. Recurren t neural net w ork based language mo dels, with inheren t memory , allo w ed for the exploitation of 7 m uc h longer con text, pro viding further impro v emen ts compared to feed forw ard neural net w orks [112]. [113] prop oses a new tec hnique of estimating v ector represen tation (p opularly termed w ord2v ec) whic h sho w ed promising results in preserving the seman tic and syn tactic relationships b et w een w ords. T w o no v el arc hitectures based on simple log-linear mo deling (i) con tin uous skip-gram and (ii) con tin uous bag-of-w ords are in tro duced. Both the mo dels are trained to mo del lo cal con text of w ord o ccurrences. The con tin uous skip-gram mo del predicts surrounding w ords giv en the curren t w ord. Whereas, the con tin uous bag-of-w ords mo del predicts the curren t w ord giv en its con text. The task ev aluation is based on answ ering v arious analogy questions testing seman tic and syn tac- tic w ord relationships. Sev eral training optimizations and tips w ere prop osed to further impro v e estimation of the v ector space b y [115, 116]. Suc h ecien t represen tation of w ords directly inu- ences the p erformance of NLP tasks lik e sen timen t classication [83], part-of-sp eec h tagging [99], text classication [98, 79], do cumen t categorization [180] and man y more. Subsequen t researc h eorts on extending w ord2v ec in v olv e expanding the w ord represen tation to phrases [115], sen tences and do cumen ts [93]. Similarly , training for con texts deriv ed from syn- tactic dep endencies of a w ord is sho wn to pro duce useful represen tations [96]. Using morphemes for w ord represen tations can enric h the v ector space and pro vide gains esp ecially for unkno wn, rarely o ccurring, complex w ords and morphologically ric h languages [104, 18, 131, 32, 155]. Lik ewise, incorp orating sub-w ord represen tations of w ords for the estimation of v ector space is b enecial [17]. Similar studies using c haracters of w ords ha v e also b een tried [26]. [187] explored ensem ble tec hniques for exploiting complemen tary information o v er m ultiple w ord v ector spaces. Studies b y [114, 47] demonstrate that v ector space represen tations are extremely useful in extending the mo del from one language to another (or m ulti-lingual extensions) since the seman tic relations b et w een w ords are in v arian t across languages. Some ha v e tried to com bine the adv an tages from b oth matrix factorization based tec hniques and lo cal-con text w ord2v ec mo dels. [127] prop oses global log-bilinear mo del for mo deling global statistical information as in the case of global matrix factorization tec hniques along with the lo cal con text information as in the case of w ord2v ec. 2.3 Spoken Language Understanding Sp ok en Language Understanding (SLU) systems aim at extracting seman tic information from h uman sp ok en utterances. Suc h systems pla y a signican t role in practical applications lik e p ersonal AI v oice assistan ts (e.g. Alexa, Siri, etc.), phone-call routing, b o oking system and 8 so on. A SLU system is t ypically mo deled as t w o separate comp onen ts: an ASR fron t-end, whic h translates acoustic signal in to text, follo w ed b y a Natural Language Understanding (NLU) mo dule that p erforms inference for do wnstream tasks. T ypical tasks include Domain classication, In ten t Detection and Slot lling. In this w ork, w e fo cus on the SLU system that p erforms In ten t Detection, a task iden tifying sp eak er’s in ten t from sp eec h. Suc h task is usually treated as an utterance classication problem [186]. In the ligh t of success of Deep Learning tec hniques, applying Deep Neural Net w orks on in- ten t detection has b een sho wn to b e eectiv e, often outp erforming con v en tional classiers, suc h as Supp ort V ector Mac hines [63]. In recen t y ears, the NLU comm unit y ha v e applied v arious tec hniques to impro v e in ten t detection p erformance on man ual transcripts. [183, 61, 189] join tly mo del in ten t detection with slot lling, simplifying the NLU task b y a unied mo del. [65, 84] extend the join t mo deling with domain kno wledge, whic h enables information from m ultiple tasks to b enet the individual tasks and allo w the NLU mo del to b e applied to m ultiple-domain tasks. Going one step further, [101, 188] in v olv e adapting domain-sp ecic language mo del (LM) while p erforming in ten t detection and slot lling, impro ving the p erformance on b oth LM and language understanding task. [100] explores strategies in join t mo deling in ten t classication and slot lling using explicit alignmen t information pro vided b y slot lling using atten tion-based enco der-deco der structure. On the basis of atten tion-based mo del, [58] connects con text information from in ten t detection with slot lling using a gate mec hanism. [97] emplo ys a similar in ten t-augmen ted gating mec hanism to guide the learning of the slot lling task. It further incorp orates c haracter-lev el em b edding along with w ord-lev el em b edding ac hieving state-of-the-art results in in ten t detection. Ho w ev er, suering from ASR fron t-end errors, suc h as mis-recognized w ords, insertions and deletions, the p erformance of suc h systems degrades signican tly , as sho wn in [69, 36, 111] and is still the b ottlenec k in SLU systems. On one hand, in order to mak e system more ASR-robust, ASR h yp otheses can b e incorp orated in to the mo del’s training corpus. [90, 171, 107] exploit W ord Confusion Net w orks to ecien tly connect NLU mo dels with ASR h yp otheses. [154] sim ulated ASR errors b y randomly substituting w ords with their linguistically and acoustically similar candidates. On the other hand, there ha v e b een w orks that aim to join tly p erform NLU tasks and ASR error adaption. [146, 192] emplo y Recurren t Neural Net w ork (RNN) based Enco der-Deco der structure to reconstruct correct utterances from ASR h yp otheses while p erforming in ten t detection and slot lling. [158] mak es ric her feature represen tations b y adding acoustic pitc h accen t ags in to w ord em b edding. 9 Chapter 3 Learning from Past Mistakes: Improving Automatic Speech Recognition output via Noisy-Clean Phrase Context Modeling 3.1 Introduction Sev eral researc h eorts w ere carried out for error correction using p ost-pro cessing tec hniques. Muc h of the eort in v olv es user input used as a feedbac k mec hanism to learn the error patterns [2, 121]. Other w ork emplo ys m ulti-mo dal signals to correct the ASR errors [162, 121]. W ord co-o ccurrence information based error correction systems ha v e pro v en quite successful [142]. In [135], a w ord-based error correction tec hnique w as prop osed. The tec hnique demonstrated the abilit y to mo del the ASR as a noisy c hannel. In [77], similar tec hnique w as applied to a syllable- to-syllable c hannel mo del along with maxim um en trop y based language mo deling. In [39], a phrase-based mac hine translation system w as used to adapt a generic ASR to a domain sp ecic grammar and v o cabulary . The system trained on w ords and phonemes, w as used to re-rank the n-b est h yp otheses of the ASR. In [33], a phrase based mac hine translation system w as used to adapt the mo dels to the domain-sp ecic data obtained b y man ual user-corrected transcriptions. In [167], an RNN w as trained on v arious text-based features to exploit long-term con text for error correction. Confusion net w orks from the ASR ha v e also b een used for error correction. In [191], a bi-directional LSTM based language mo del w as used to re-score the confusion net w ork. In [118], a t w o step pro cess for error correction w as prop osed in whic h w ords in the confusion net w ork are re- rank ed. Errors presen t in the confusion net w ork are detected b y conditional random elds (CRF) trained on n-gram features and subsequen tly long-distance con text scores are used to mo del the long con textual information and re-rank the w ords in the confusion net w ork. [21, 52] also mak es 10 use of confusion net w orks along with seman tic similarit y information for training CRF s for error correction. The scop e of this c hapter is to ev aluate whether subsequen t transcription corrections can tak e place, on top of a highly optimized ASR. W e h yp othesize that our system can correct the errors b y (i) re-scoring lattices, (ii) reco v ering pruned lattices, (iii) reco v ering unseen phrases, (iv) pro- viding b etter reco v ery during p o or recognitions, (v) pro viding impro v emen ts under all acoustic conditions, (vi) handling mismatc hed train-test conditions, (vii) exploiting longer con textual in- formation and (viii) text regularization. W e target to satisfy the ab o v e h yp otheses b y prop osing a Noisy-Clean Phrase Con text Mo del (NCPCM). W e in tro duce con text of past errors of an ASR system, that consider all the automated system noisy transformations. These errors ma y come from an y of the ASR mo dules or ev en from the noise c haracteristics of the signal. Using these errors w e learn a noisy c hannel mo del, and apply it for error correction of the ASR output. Compared to the ab o v e eorts, our w ork diers in the follo wing asp ects: Error corrections tak e place on the output of a state-of-the-art L ar ge V o c abulary Continuous Sp e e ch R e c o gnition (L V CSR) system trained on matc hed data. This diers from adapting to constrained domains (e.g. [33, 39]) that exploit domain mismatc h. This pro vides additional c hallenges b oth due to the larger error-correcting space (spanning larger v o cabulary) and the already highly optimized ASR output. W e ev aluate on a standard L V CSR task th us establishing the eectiv eness, repro ducibilit y and generalizabilit y of the prop osed correction system. This diers from past w ork where sp eec h recognition w as on a large-v o cabulary task but subsequen t error corrections w ere ev aluated on a m uc h smaller v o cabulary . W e analyze and ev aluate m ultiple t yp e of error corrections (including but not restricted to Out-Of-V o c abulary (OO V) w ords). Most prior w ork is directed to w ards reco v ery of OO V w ords. In addition to ev aluating a large-v o cabulary correction system on in-domain (Fisher, 42k w ords) w e ev aluate on an out-of-domain, larger v o cabulary task (TED-LIUM, 150k w ords), th us assessing the eectiv eness of our system on c hallenging scenarios. In this case the adaptation is to an ev en bigger v o cabulary , a m uc h more c hallenging task to past w ork that only considered adaptation from large to small v o cabulary tasks. W e emplo y m ultiple h yp otheses of ASR to train our noisy c hannel mo del. 11 W e emplo y state-of-the-art neural net w ork based language mo dels under the noisy-c hannel mo deling framew ork whic h enable exploitation of longer con text. A dditionally , our prop osed system comes with sev eral adv an tages: (1) the system could p o- ten tially b e trained without an ASR b y creating a phonetic mo del of corruption and em ulating an ASR deco der on generic text corp ora, (2) the system can rapidly adapt to new linguistic pat- terns, e.g., can adapt to unseen w ords during training via con textual transformations of erroneous L V CSR outputs. F urther, our w ork is dieren t from discriminativ e training of acoustic [177] mo dels and dis- criminativ e language mo dels (DLM) [136], whic h are trained directly to optimize the w ord error rate using the reference transcripts. DLMs in particular in v olv e optimizing, tuning, the w eigh ts of the language mo del with resp ect to the reference transcripts and are often utilized in re-ranking n-b est ASR h yp otheses [136, 141, 184, 15, 22]. The main distinction and adv an tage with our metho d is the NCPCM can p oten tially re-in tro duce unseen or pruned-out phrases. Our metho d can also op erate when there is no access to lattices or n-b est lists. The NCPCM can also op erate on the output of a DLM system. The rest of the pap er is organized as follo ws: Section 3.2 presen ts v arious h yp otheses and discusses the dieren t t yp es of errors w e exp ect to mo del. Section 3.3 elab orates on the prop osed tec hnique and Section 3.4 describ es the exp erimen tal setup and the databases emplo y ed in this w ork. Results and discussion are presen ted in Section 3.5 and w e nally conclude and presen t future researc h directions in Section 6.7. 3.2 Hypotheses In this section w e analytically presen t cases that w e h yp othesize the prop osed system could help with. In all of these the errors of the ASR ma y stem from realistic constrain ts of the deco ding system and pruning structure, while the prop osed system could exploit v ery long con text to impro v e the ASR output. Note that the v o cabulary of an ASR do esn’t alw a ys matc h the one of the error correction system. Lets consider for example, an ASR that do es not ha v e lexicon en tries for Prashan th or Shiv akumar but it has the en tries Shiv a and Kumar. Lets also assume that this ASR consis- ten tly mak es the error Pression when it hears Prashan th. Giv en training data for the NCPCM, it will learn the transformation Pression Shiv a Kumar in to Prashan th Shiv akumar, th us it will ha v e a larger v o cabulary than the ASR and learn to reco v er suc h errors. This demonstrates the abilit y to learn out-of-v o cabulary en tries and to rapidly adapt to new domains. 12 3.2.1 Re-scoring Lattices 1. I was b orn in ninete en ninety thr e e in Ir aq 2. I was b orn in ninete en ninety thr e e in eye r ack 3. I was b orn in ninete en ninety thr e e in I r ack Phonetic T ranscription: ay . w aa z . b ao r n . ih n . n ay n t iy n . n ay n t iy . th r iy . ih n . ay . r ae k Example 1 In Example Example 1, all the three samples ha v e the same phonetic transcription. Let us assume sample 1 is the correct transcription. Since all the three examples ha v e the same phonetic transcription, this mak es them indistinguishable b y the acoustic mo del. The language mo del is lik ely to do wn-score the sample 3. It is p ossible that sample 2 will score higher than sample 1 b y a short con text LM (e.g. bi-gram or 3-gram) i.e., in migh t b e follo w ed b y ey e more frequen tly than Iraq in the training corp ora. This will lik ely result in an ASR error. Th us, although the oracle WER can b e zero, the output WER is lik ely going to b e higher due to LM c hoices. Hyp othesis A: An ideal error correction system can select correct options from the existing lattice. 3.2.2 Reco v ering Pruned Lattices A more sev ere case of Example Example 1 w ould b e that the w ord Iraq w as pruned out of the output lattice during deco ding. This is often the case when there are memory and complexit y constrain ts in deco ding large acoustic and language mo dels, where the deco ding b eam is a re- stricting parameter. In suc h cases, the w ord nev er ends up in the output lattice. Since the ASR is constrained to pic k o v er the only existing p ossible paths through the deco ding lattice, an error is inevitable in the nal output. Hyp othesis B: An ideal error correction system can generate w ords or phrases that w ere erroneously pruned during the deco ding pro cess. 13 3.2.3 Reco v ery of Unseen Phrases On the other hand, an extreme case of Example Example 1 w ould b e that the w ord Iraq w as nev er seen in the training data (or is out-of-v o cabulary), thereb y not app earing in the ASR lattice. This w ould mean the ASR is forced to select among the other h yp otheses ev en with a lo w condence (or output an unkno wn, < unk >, sym b ol) resulting in a similar error as b efore. This is often the case due to the constan t ev olution of h uman language or in the case of a new domain. F or example, names suc h as Al Qaeda or ISIS w ere non-existen t in our v o cabularies a few y ears ago. Hyp othesis C: An ideal error correction system can generate w ords or phrases that are out of v o cabulary (OO V) and th us not in the ASR output. 3.2.4 Better Reco v ery during P o or Recognitions An ideal error correction system w ould pro vide more impro v emen ts for p o or recognitions from an ASR. Suc h a system could p oten tially oset for the ASR’s lo w p erformance pro viding consisten t p erformance o v er v arying audio and recognition conditions. In real-life conditions, the ASR often has to deal with v arying lev el of mismatc hed train-test conditions, where relativ ely p o or recognition results are commonplace. Hyp othesis D: An ideal error correction system can pro vide more corrections when the ASR p erforms p o orly , thereb y osetting ASR’s p erformance drop (e.g. during mismatc hed train-test conditions). 3.2.5 Impro v emen ts under all A coustic Conditions An error correction system whic h p erforms w ell during tough recognition conditions, as p er Hy- p othesis 3.2.4 is no go o d if it degrades go o d recognizer output. Th us, in addition to our Hyp othe- sis 3.2.4, an ideal system w ould cause no degradation on go o d ASR output. Suc h a system can b e h yp othesized to consisten tly impro v e up on and pro vide b enets o v er an y ASR system including state-of-the-art recognition systems. An ideal system w ould pro vide impro v emen ts o v er the en tire sp ectrum of ASR p erformance (WER). Hyp othesis E: An ideal error correction system can not only pro vide impro v emen ts during p o or recognitions, but also preserv es go o d sp eec h recognition. 14 3.2.6 A daptation W e h yp othesize that the prop osed system w ould help in adaptation o v er mismatc hed conditions. The mismatc h could manifest in terms of acoustic conditions and lexical constructs. The adap- tation can b e seen as a consequence of Hyp othesis 3.2.4 & 3.2.5. In addition, the prop osed mo del is capable of capturing patterns of language use manifesting in sp ecic sp eak er(s) and domain(s). Suc h a system could eliminate the need of retraining the ASR mo del for mismatc hed en vironmen ts. Hyp othesis F: An ideal error correction system can aid in mismatc hed train-test condi- tions. 3.2.7 Exploit Longer Con text Eyes melte d, when he plac e d his hand on her shoulders. Ic e melte d, when he plac e d it on the table. Example 2 The complex construct of h uman language and understanding enables reco v ery of lost or cor- rupted information o v er dieren t temp oral resolutions. F or instance, in the ab o v e Example Ex- ample 2, b oth the phrases, Ey es melted, when he placed and Ice melted, when he placed are v alid when view ed within its shorter con text and ha v e iden tical phonetic transcriptions. The succeeding phrases, underlined, help in discerning whether the rst w ord is Ey es or Ice. W e h yp othesize that an error correction mo del capable of utilizing suc h longer con texts is b enecial. As new mo dels for phrase based mapping, suc h as sequence to sequence mo dels [165], b ecome applicable this b ecomes ev en more p ossible and desirable. Hyp othesis G: An ideal error correction system can exploit longer con text than the ASR for b etter corrections. 15 3.2.8 Regularization 1. I guess ’c ause I went on a I went on a ... I guess b e c ause I went on a I went on a ... 2. i was b orn in ninete en ninety two i was b orn in 1992 3. i was b orn on ninete en twelve i was b orn on 19/12 Example 3 As p er the 3 cases sho wn in Example Example 3, although b oth the h yp otheses for eac h of them are correct, there are some irregularities presen t in the language syn tax. Normalization of suc h surface form represen tation can increase readabilit y and usabilit y of output. Unlik e traditional ASR, where there is a need to explicitly program suc h regularizations, our system is exp ected to learn, giv en appropriate training data, and incorp orate regularization in to the mo del. Hyp othesis H: An ideal error correction system can b e deplo y ed as an automated text regularizer. 3.3 Methodology The o v erview of the prop osed mo del is sho wn in Figure 3.1. In our pap er, the ASR is view ed as a noisy c hannel (with transfer function H ), and w e learn a mo del of this c hannel, b H 1 (estimate of in v erse transfer function H 1 ) b y using the corrupted ASR outputs (equiv alen t to signal corrupted b y H ) and their reference transcripts. Later on, w e use this mo del to correct the errors of the ASR. The noisy c hannel mo deling mainly can b e divided in to w ord-based and phrase-based c hannel mo deling. W e will rst in tro duce previous related w ork, and then our prop osed NCPCM. 3.3.1 Previous related w ork 3.3.1.1 W ord-based Noisy Channel Mo deling In [135], the authors adopt w ord-based noisy c hannel mo del b orro wing ideas from a w ord-based statistical mac hine translation dev elop ed b y IBM [19]. It is used as a p ost-pro cessor mo dule to 16 Dictionary Model Acoustic Model Language Model Lattice ASR Language Model (ARPA/Neural) Noisy-Clean Phrase Context Model Noisy-Clean Phrase Transition Model Corrected ASR Output Reference Signal H Corrupted Signal H -1 Training Viterbi 1-best path k-best path Audio Input Audio Input ^ H -1 Corrected Signal Eval ^ Figure 3.1: Ov erview of NCPCM correct the mistak es made b y the ASR. The w ord-based noisy c hannel mo deling can b e presen ted as: ^ W = arg max W clean P (W clean jW noisy ) = arg max W clean P (W noisy jW clean )P LM (W clean ) where ^ W is the corrected output w ord sequence, P (W clean jW noisy ) is the p osterior probabilit y , P (W noisy jW clean ) is the c hannel mo del and P LM (W clean ) is the language mo del. In [135], au- thors h yp othesized that in tro ducing man y-to-one and one-to-man y w ord-based c hannel mo deling (referred to as fertilit y mo del) could b e more eectiv e, but w as not implemen ted in their w ork. 3.3.1.2 Phrase-based Noisy Channel Mo deling Phrase-based systems w ere in tro duced in application to phrase-based statistical translation system [85] and w ere sho wn to b e sup erior to the w ord-based systems. Phrase based transformations are similar to w ord-based mo dels with the exception that the fundamen tal unit of observ ation and transformation is a phrase (one or more w ords). It can b e view ed as a sup er-set of the w ord-based [19] and the fertilit y [135] mo deling systems. 3.3.2 Noisy-Clean Phrase Con text Mo deling W e extend the ideas b y prop osing a complete phrase-based c hannel mo deling for error correction whic h incorp orates the man y-to-one and one-to-man y as w ell as man y-to-man y w ords (phrase) 17 c hannel mo deling for error-correction. This also allo ws the mo del to b etter capture errors of v arying resolutions made b y the ASR. As an extension, it uses a distortion mo deling to capture an y re-ordering of phrases during error-correction. Ev en though w e do not exp ect big b enets from the distortion mo del (i.e., the order of the ASR output is usually in agreemen t with the audio represen tation), w e include it in our study for examination. It also uses a w ord p enalt y to con trol the length of the output. The phrase-based noisy c hannel mo deling can b e represen ted as: ^ p = argmax p clean P (p clean jp noisy ) (3.1) = argmax p clean P (p noisy jp clean )P LM (p clean )w length (p clean ) where ^ p is the corrected sen tence, p clean andp noisy are the reference and noisy sen tence resp ectiv ely . w length (p clean ) is the output w ord sequence length p enalt y , used to con trol the output sen tence length, and P (p noisy jp clean ) is decomp osed in to: P (p I noisy jp I clean ) = I Y i=1 (p i noisy jp i clean )D(start i end i1 ) (3.2) where (p i noisy jp i clean ) is the phrase c hannel mo del or phrase translation table, p I noisy and p I clean are the sequences of I phrases in noisy and reference sen tences resp ectiv ely and i refers to the i th phrase in the sequence. D(start i end i1 ) is the distortion mo del. start i is the start p osition of the noisy phrase that w as corrected to the i th clean phrase, and end i1 is the end p osition of the noisy phrase corrected to b e the i 1 th clean phrase. 3.3.3 Our Other Enhancemen ts In order to eectiv ely demonstrate our idea, w e emplo y (i) neural language mo dels, to in tro duce long term con text and justify that the longer con textual information is b enecial for error cor- rections; (ii) minim um error rate training (MER T) to tune and optimize the mo del parameters using dev elopmen t data. 3.3.3.1 Neural Language Mo dels Neural net w ork based language mo dels ha v e b een sho wn to b e able to mo del higher order n-grams more ecien tly [5, 112, 163]. In [77], a more ecien t language mo deling using maxim um en trop y w as sho wn to help in noisy-c hannel mo deling of a syllable-based ASR error correction system. 18 Incorp orating suc h language mo dels w ould aid the error-correction b y exploiting the longer con text information. Hence, w e adopt t w o t yp es of neural net w ork language mo dels in this w ork. (i) F eed-forw ard neural net w ork whic h is trained using a sequence of one-hot w ord represen tation along with the sp ecied con text [172]. (ii) Neural net w ork join t mo del (NNJM) language mo del [37]. This is trained in a similar w a y as in (i), but the con text is augmen ted with noisy ASR observ ations with a sp ecied con text windo w. Both the mo dels emplo y ed are feed-forw ard neural net w orks since they can b e incorp orated directly in to the noisy c hannel mo deling. The recurren t neural net w ork LM could p oten tially b e used during phrase-based deco ding b y emplo ying certain cac hing and appro ximation tric ks [3]. Noise Con trastiv e Estimation w as used to handle the large v o cabulary size output. 3.3.3.2 Minim um Error Rate T raining (MER T) One of the do wnsides of the noisy c hannel mo deling is that the mo del is trained to maximize the lik eliho o d of the seen data and there is no direct optimization to the end criteria of WER. MER T optimizes the mo del parameters (in our case w eigh ts for language, phrase, length and distortion mo dels) with resp ect to the desired end ev aluation criterion. MER T w as rst in tro duced in application to statistical mac hine translation pro viding signican tly b etter results [122]. W e apply MER T to tune the mo del on a small set of dev elopmen t data. 3.4 Experimental Setup 3.4.1 Database F or training, dev elopmen t, and ev aluation, w e emplo y Fisher English T raining P art 1, Sp eec h (LDC2004S13) and Fisher English T raining P art 2, Sp eec h (LDC2005S13) corp ora [29]. The Fisher English T raining P art 1, is a collection of con v ersation telephone sp eec h with 5850 sp eec h samples of up to 10 min utes, appro ximately 900 hours of sp eec h data. The Fisher English T raining P art 2, con tains an addition of 5849 sp eec h samples, appro ximately 900 hours of telephone con- v ersational sp eec h. The corp ora is split in to training, dev elopmen t and test sets for exp erimen tal purp oses as sho wn in T able 3.1. The splits of the data-sets are consisten t o v er b oth the ASR and Database T rain Dev elopmen t T est Hours Utterances W ords Hours Utterances W ords Hours Utterances W ords Fisher English 1,890.5 1,833,088 20,724,957 4.7 4906 50,245 4.7 4914 51,230 TED-LIUM - - - 1.6 507 17,792 2.6 1155 27,512 T able 3.1: Database split and statistics 19 the subsequen t noisy-clean phrase con text mo del. The dev elopmen t dataset w as used for tuning the phrase-based system using MER T. W e also test the system under mismatc hed training-usage conditions on TED-LIUM. TED- LIUM is a dedicated ASR corpus consisting of 207 hours of TED talks [140]. The data-set w as c hosen as it is signican tly dieren t to Fisher Corpus. Mismatc h conditions include: (i) v ariations in c hannel c haracteristics, Fisher, b eing a telephone con v ersations corpus, is sampled at 8kHz where-as the TED-LIUM is originally 16kHz, (ii) noise conditions, the Fisher recordings are signican tly noisier, (iii) utterance lengths, TED-LIUM has longer con v ersations since they are extracted from TED talks, (iv) lexicon sizes, v o cabulary size of TED-LIUM is m uc h larger with 150,000 w ords where-as Fisher has 42,150 unique w ords, (v) sp eaking in tonation, Fisher b eing telephone con v ersations is sp on taneous sp eec h, whereas the TED talks are more organized and w ell articulated. F actors (i) and (ii) mostly aect the p erformance of ASR due to acoustic dierences while (iii) and (iv) aect the language asp ects, (v) aects b oth the acoustic and linguistic asp ects of the ASR. 3.4.2 System Setup 3.4.2.1 Automatic Sp eec h Recognition System W e used the Kaldi Sp eec h Recognition T o olkit [130] to train the ASR system. In this pap er, the acoustic mo del w as trained as a DNN-HMM h ybrid system. A tri-gram maxim um lik eliho o d estimation (MLE) language mo del w as trained on the transcripts of the training dataset. The CMU pron unciation dictionary [175] w as adopted as the lexicon. The resulting ASR is state-of- the-art b oth in arc hitecture and p erformance and as suc h additional gains on top of this ASR are c hallenging. 3.4.2.2 Pre-pro cessing The reference outputs of ASR corpus con tain non-v erbal signs, suc h as [laugh ter], [noise] etc. These ev en t signs migh t corrupt the phrase con text mo del since there is little con textual information b et w een them. Th us, in this pap er, w e cleaned our data b y remo ving all these non-v erbal signs from dataset. The text data is sub jected to traditional tok enization to handle sp ecial sym b ols. Also, to prev en t data sparsit y issues, w e restricted all of the sample sequences to a maxim um length of 100 tok ens (giv en that the database consisted of only 3 sen tences ha ving more than the limit). The NCPCM has t w o distinct v o cabularies, one asso ciated with the ASR transcripts and the other one p ertaining to the ground-truth transcripts. The ASR dictionary is often smaller than 20 the ground-truth transcript mainly b ecause of not ha ving a pron unciation-phonetic transcriptions for certain w ords, whic h usually is the case for names, prop er-nouns, out-of-language w ords, brok en w ords etc. 3.4.2.3 NCPCM W e use the Moses to olkit [86] for phrase based noisy c hannel mo deling and MER T optimization. The rst step in the training pro cess of NCPCM is the estimation of the w ord alignmen ts. IBM mo dels are used to obtain the w ord alignmen ts in b oth the directions (reference-ASR and ASR- reference). The nal alignmen ts are obtained using heuristics (starting with the in tersection of the t w o alignmen ts and then adding the additional alignmen t p oin ts from the union of t w o alignmen ts). F or computing the alignmen ts mgiza, a m ulti-threaded v ersion of GIZA++ to olkit [123] w as emplo y ed. Once the alignmen ts are obtained, the lexical translation table is estimated in the maxim um lik eliho o d sense. Then on, all the p ossible phrases along with their w ord alignmen ts are generated . A max phrase length of 7 w as set for this w ork. The generated phrases are scored to obtain a phrase translation table with estimates of phrase translation probabilities. Along with the phrase translation probabilities, w ord p enalt y scores (to con trol the translation length) and re-ordering/distortion costs (to accoun t for p ossible re-ordering) are estimated. Finally , the NCPCM mo del is obtained as in the equation 3.2. During deco ding equation 3.1 is utilized. F or training the MLE n-gram mo dels, SRILM to olkit [159] w as adopted. F urther w e emplo y the Neural Probabilistic Language Mo del T o olkit [172] to train the neural language mo dels. The neural net w ork w as trained for 10 ep o c hs with an input em b edding dimension of 150 and output em b edding dimension of 750, with a single hidden la y er. The w eigh ted a v erage of all input em b eddings w as computed for padding the lo w er-order estimates as suggested in [172]. The NCPCM is an ensem ble of phrase translation mo del, language mo del, translation length p enalt y , re-ordering mo dels. Th us the tuning of the w eigh ts asso ciated with eac h mo del is crucial in the case of prop osed phrase based mo del. W e adopt the line-searc h based metho d of MER T [13]. W e try t w o optimization criteria with MER T, i.e., using BLEU(B) and WER(W). 3.4.3 Baseline Systems W e adopt four dieren t baseline systems b ecause of their relev ance to this w ork: Baseline-1: ASR Output : The ra w p erformance of the ASR system, b ecause of its relev ance to the application of the prop osed mo del. Baseline-2: R e-sc oring lattic es using RNN-LM : In order to ev aluate the p erformance of the 21 system with more recen t re-scoring tec hniques, w e train a recurren t-neural net w ork with an em- b edding dimension of 400 and sigmoid activ ation units. Noise con trastiv e estimation is used for training the net w ork and is optimized on the dev elopmen t data set whic h is used as a stop criterion. F aster-RNNLM 1 to olkit is used to train the recurren t-neural net w ork. F or re-scoring, 1000-b est ASR h yp otheses are deco ded and the old LM (MLE) scores are remo v ed. The RNN-LM scores are computed from the trained mo del and in terp olated with the old LM. Finally , the 1000-b est h yp otheses are re-constructed in to lattices, scored with new in terp olated LM and deco ded to get the new b est path h yp othesis. Baseline-3: W or d-b ase d noisy channel mo del : In order to compare to a prior w ork describ ed in Section 3.3.1.1 whic h is based on [135]. The w ord-based noisy c hannel mo del is created in a similar w a y as the NCPCM mo del with three sp ecic exceptions: (i) the max-phrase length is set to 1, whic h essen tially con v erts the phrase based mo del in to w ord based, (ii) a bi-gram LM is used in- stead of a tri-gram or neural language mo del, as suggested in [135], (iii) no re-ordering/distortion mo del and w ord p enalties are used. Baseline-4: Discriminative L anguage Mo deling (DLM) : Similar to the prop osed w ork, DLM mak es use of the reference transcripts to tune language mo del w eigh ts based on sp ecied feature sets in order to re-rank the n-b est h yp othesis. Sp ecically , w e emplo y the p erceptron algorithm [136] for training DLMs. The baseline system is trained using unigrams, bigrams and trigrams (as in [15, 184, 141]) for a fair comparison with the prop osed NCPCM mo del. W e also pro vide results with an extended feature set comprising of rank-based features and ASR LM and AM scores. Refr (Rerank er framew ork) is used for training the DLMs [14] follo wing most recommendations from [15]. 100-b est ASR h yp otheses are used for training and re-ranking purp oses. 3.4.4 Ev aluation Criteria The nal goal of our w ork is to sho w impro v emen ts in terms of the transcription accuracy of the o v erall system. Th us, w e pro vide w ord error rate as it is a standard in the ASR comm unit y . Moreo v er, Bilingual Ev aluation Understudy (BLEU) score [126] is used for ev aluating our w ork, since our mo del can b e also treated as a transfer-function (translation) system from ASR output to NCPCM output. 1 h ttps://gith ub.com/y andex/faster-rnnlm 22 1. REF: o ysters clams and m ushro oms i think ASR: w asters clams and m ushro oms they think ORA CLE: w asters clams and m ushro oms i think NCPCM: o ysters clams and m ushro oms they think Example of h yp otheses B 2. REF: y eah w e had this a wful mon th this win ter where it w as lik e a go o d da y if it got up to thirt y it w as ridiculously cold ASR: y eah w e had this a wful mon th uh this win ter where it w as lik e a go o d da y if i got up to thirt y w as ridiculous lee cold ORA CLE: y eah w e had this a wful mon th this win ter where it w as lik e a go o d da y if it got up to thirt y it w as ridiculous the cold NCPCM: y eah w e had this a wful mon th uh this win ter where it w as lik e a go o d da y if i got up to thirt y it w as ridiculously cold Example of h yp otheses A, B, G 3. REF: oh w ell it dep ends on whether y ou agree that al qaeda came righ t out of afghanistan ASR: oh w ell it dep ends on whether y ou agree that al <unk> to came righ t out of afghanistan ORA CLE: oh w ell it dep ends on whether y ou agree that al <unk> to came righ t out of afghanistan NCPCM: oh w ell it dep ends on whether y ou agree that al qaeda to came righ t out of afghanistan Example of h yp otheses C 4. REF: they laugh b ecause ev eryb o dy else is laughing and not b ecause it’s really funn y ASR: they laughed b ecause ev eryb o dy else is laughing and not b ecause it’s really funn y ORA CLE: they laugh b ecause ev eryb o dy else is laughing and not b ecause it’s really funn y NCPCM: they laugh b ecause ev eryb o dy else is laughing and not b ecause it’s really funn y Example of h yp otheses A, G 5. REF: y eah esp ecially lik e if y ou go out for ice cream or something ASR: y eah it sp ecially lik e if y ou go out for ice cream or something ORA CLE: y eah it’s esp ecially lik e if y ou go out for ice cream or something NCPCM: y eah esp ecially lik e if y ou go out for ice cream or something Example of h yp otheses A 6. REF: w e don’t ha v e a lot of that around w e kind of liv e in a nicer area ASR: w e don’t ha v e a lot of that around w e kinda liv e in a nicer area ORA CLE: w e don’t ha v e a lot of that around w e kind of liv e in a nicer area NCPCM: w e don’t ha v e a lot of that around w e kind of liv e in a nicer area Example of h yp otheses A, H T able 3.2: Analysis of selected sen tences. REF: Reference ground-truth transcripts; ASR: Output ASR transcriptions; ORA CLE: Best path through output lattice giv en the ground-truth transcript; NCPCM: T ranscripts after NCPCM error-correction Green color highligh ts correct phrases. Orange color highligh ts incorrect phrases. 23 3.5 Results and Discussion In this section w e demonstrate the abilit y of our prop osed NCPCM in v alidating our h yp otheses A- H from Section 3.2 along with the exp erimen tal results. The exp erimen tal results are presen ted in three dieren t tasks: (i) o v erall WER exp erimen ts, highligh ting the impro v emen ts of the prop osed system, presen ted in T ables 6.1, 3.4 & 3.5, (ii) detailed analysis of WERs o v er subsets of data, presen ted in Figures 3.3 & 3.2, and (iii) analysis of the error corrections, presen ted in T able 3.2. The assessmen t and discussions of eac h task is structured similar to Section 3.2 to supp ort their resp ectiv e claims. 3.5.1 Re-scoring Lattices T able 3.2 sho ws selected samples through the pro cess of the prop osed error correction system. In addition to the reference, ASR output and the prop osed system output, w e pro vide the ORA CLE transcripts to assess the presence of the correct phrase in the lattice. Cases 4-6 from T able 3.2 ha v e the correct phrase in the lattice, but get do wn-scored in the ASR nal output whic h is then reco v ered b y our system as h yp othesized in Hyp othesis 3.2.1. 3.5.2 Reco v ering Pruned Lattices In the cases 1 and 2 from T able 3.2, w e see the correct phrases are not presen t in the ASR lattice, although they w ere seen in the training and are presen t in the v o cabulary . Ho w ev er, the prop osed system manages to reco v er the phrases as discussed in Hyp othesis 3.2.2. Moreo v er, Case 2 also demonstrates an instance where the confusion o ccurs due to same phonetic transcriptions (ridiculously v ersus ridiculous lee) again supp orting Hyp othesis 3.2.1. 3.5.3 Reco v ery of Unseen Phrases Case 3 of T able 3.2, demonstrates an instance where the w ord qaeda is absen t from the ASR lexicon (v o cabulary) and hence absen t in the deco ding lattice. This forces the ASR to output an unkno wn-w ord tok en (<unk>). W e see that the system reco v ers an out-of-v o cabulary w ord qaeda successfully as claimed in Hyp othesis 3.2.3. 3.5.4 Better Reco v ery during P o or Recognitions T o justify the claim that our system can oset for the p erformance decit of the ASR at tougher conditions (as p er Hyp othesis 3.2.4), w e form ulate a sub-problem as follo ws: 24 0 5 10 15 20 25 30 35 40 45 0 5 10 15 20 25 30 35 40 45 Utterance Length WER % ASR Top−Good and Bottom−Bad vs. WER Top−Good WER Bottom−Bad WER (a) In Domain (Fisher) 15 20 25 30 35 40 45 5 10 15 20 25 30 35 40 45 50 55 Utterance Length WER % ASR Top−Good and Bottom−Bad vs. WER Top−Good WER Bottom−Bad WER (b) Out of Domain (TEDLIUM) Figure 3.2: T op-Go o d, Bottom-Bad WER Splits. As w e can see the WER for top-go o d is often 0%, whic h lea v es no margin for impro v emen t. W e will see the impact of this later, as in Fig. 3.3 Problem F orm ulation: W e divide equally , p er sen tence length, our dev elopmen t and test datasets in to go o d recognition results (top-go o d) and p o or recognition results (b ottom-bad) sub- sets based on the WER of the ASR and analyze the impro v emen ts and an y degradation caused b y our system. Figure 3.3 sho ws the plots of the ab o v e men tioned analysis for dieren t systems as captioned. The blue lines are represen tativ e of the impro v emen ts pro vided b y our system for top-go o d subset o v er dieren t utterance lengths, i.e., it indicates the dierence b et w een our system and the original WER of the ASR (negativ e v alues indicate impro v emen t and p ositiv e v alues indicate degradation resulting from our system). The green lines indicate the same for b ottom-bad subset of the database. The red indicates the dierence b et w een the b ottom-bad WERs and the top-go o d WERs, i.e., negativ e v alues of red indicate that the system pro vides more impro v emen ts to the b ottom-bad subset relativ e to the top-go o d subset. The solid lines represen t their resp ectiv e trends whic h is obtained b y a simple linear regression (line-tting). F or p o or recognitions, w e are concerned ab out the b ottom-bad subset, i.e., the green lines in Figure 3.3. Firstly , w e see that the solid green line is alw a ys b elo w zero, whic h indicates there is alw a ys impro v emen ts for b ottom-bad i.e., p o or recognition results. Second, w e observ e that the solid red line usually sta ys b elo w zero, indicating that the p erformance gains made b y the system add more for the b ottom-bad p o or recognition results compared to the top-go o d subset (go o d recognitions). F urther, more justications are pro vided later in the con text of out-of-domain task (Section 3.5 3.5.6) where high mismatc h results in tougher recognition task are discussed. 25 0 5 10 15 20 25 30 35 40 45 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 Utterance Length Absolute WER Change Top−Good Top−Good Trend Bottom−Bad Bottom−Bad Trend Diff Diff Trend (a) Dev: NCPCM + MER T(W) 0 5 10 15 20 25 30 35 40 45 50 −3 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 Utterance Length Absolute WER Change Top−Good Top−Good Trend Bottom−Bad Bottom−Bad Trend Diff Diff Trend (b) T est: NCPCM + MER T(W) 0 5 10 15 20 25 30 35 40 45 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 Utterance Length Absolute WER Change Top−Good Top−Good Trend Bottom−Bad Bottom−Bad Trend Diff Diff Trend (c) Dev: NCPCM + 5gram NNLM + MER T(W) 0 5 10 15 20 25 30 35 40 45 50 −3.5 −3 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 Utterance Length Absolute WER Change Top−Good Top−Good Trend Bottom−Bad Bottom−Bad Trend Diff Diff Trend (d) T est: NCPCM + 5gram NNLM + MER T(W) 15 20 25 30 35 40 45 −5 −4 −3 −2 −1 0 1 2 3 4 Utterance Length Absolute WER Change Top−Good Top−Good Trend Bottom−Bad Bottom−Bad Trend Diff Diff Trend (e) Out-of-Domain Dev: NCPCM + generic LM + MER T(W) 0 5 10 15 20 25 30 35 40 45 50 −5 −4 −3 −2 −1 0 1 2 Utterance Length Absolute WER Change Top−Good Top−Good Trend Bottom−Bad Bottom−Bad Trend Diff Diff Trend (f ) Out-of-Domain T est: NCPCM + generic LM + MER T(W) Figure 3.3: Length of ASR h yp otheses vs. absolute WER c hange (NCPCM). Blue & Green lines represen t dierence b et w een WER of our system and the baseline ASR, for top-go o d and b ottom-bad h yp otheses, resp ectiv ely . In an ideal scenario, all these lines w ould b e b elo w 0, th us all pro viding a c hange in WER to w ards impro ving the system. Ho w ev er w e see in some cases that the WER increases, esp ecially when the h yp otheses length is short and when the p erformance is go o d. This is as exp ected since from Fig. 3.2 some cases are at 0% WER due to the already highly-optimized nature of our ASR. The red line represen ts the aggregate error o v er all data for eac h w ord length and as w e can see in all cases the trend is one of impro ving the WER, justifying Hyp otheses D, E, F, G. 26 In domain testing on Fisher Data Metho d Dev T est WER BLEU WER BLEU ASR output (Baseline-1) 15.46% 75.71 17.41% 72.99 ASR + RNNLM re-scoring (Baseline-2) 16.17% 74.39 18.39% 71.24 W ord based + bigram LM (Baseline-3) 16.23% 74.28 18.10% 71.76 W ord based + bigram LM + MER T(B) 15.46% 75.70 17.40% 72.99 W ord based + bigram LM + MER T(W) 15.39% 75.65 17.40% 72.77 W ord based + trigram LM + MER T(B) 15.48% 75.59 17.47% 72.81 W ord based + trigram LM + MER T(W) 15.46% 75.46 17.52% 72.46 DLM (Baseline-4) 23.65% 63.35 25.36% 61.19 DLM w/ extended feats 24.48% 62.92 26.12% 60.98 Prop osed NCPCM 20.33% 66.70 22.32% 63.81 NCPCM + MER T(B) 15.11% 76.06 17.18% 73.00 NCPCM + MER T(W) 15.10% 76.08 17.15% 73.05 NCPCM + MER T(B) w/o re-ordering 15.27% 76.02 17.11% 73.33 NCPCM + MER T(W) w/o re-ordering 15.19% 75.90 17.18% 73.04 NCPCM + 10b est + MER T(B) 15.19% 76.12 17.17% 73.22 NCPCM + 10b est + MER T(W) 15.16% 75.91 17.21% 73.03 T able 3.3: Noisy-Clean Phrase Con text Mo del (NCPCM) results (uses exactly same LM as ASR) 3.5.5 Impro v emen ts under all A coustic Conditions T o justify the claim that our system can consisten tly pro vide b enets o v er an y ASR system ( Hy- p othesis 3.2.5), w e need to sho w that the prop osed system: (i) do es not degrade the p erformance of the go o d recognition, (ii) pro vides impro v emen ts to p o or recognition instances, of the ASR. The latter has b een discussed and conrmed in the previous Section 3.5 3.5.4. F or the former, w e pro vide ev aluations from t w o p oin t of views: (1) assessmen t of WER trends of top-go o d and b ottom-bad subsets (as in the previous Section 3.5 3.5.4), and (2) o v erall absolute WER of the prop osed systems. Firstly , examining Figure 3.3, w e are mainly concerned ab out the top-go o d subset p ertaining to degradation/impro v emen t of go o d recognition instances. W e observ e that the solid blue line is close to zero in all the cases, whic h implies that the degradation of go o d recognition is extremely minimal. Moreo v er, w e observ e that the slop e of the line is almost zero in all the cases, whic h indicates that the degradation is minimal and mostly consisten t o v er dieren t utterance lengths. Moreo v er, assessing the degradation from the absolute WER p ersp ectiv e, Figure 3.2a sho ws the WER o v er utterance lengths for the top-go o d and b ottom-bad subsets for the in-domain case. The top-go o d WER is small, at times ev en 0% (p erfect recognition) thereb y allo wing v ery small 27 margin for impro v emen t. In suc h a case, w e see minimal degradation. Although w e lose a bit on v ery go o d recognitions whic h is extremely minimal, w e gain signican tly in the case of ‘bad’ recognitions. Th us to summarize, the damage that this system can mak e, under the b est ASR conditions, is minimal and oset b y the p oten tial signican t gains presen t when the ASR hits some tough recognition conditions. WER exp erimen ts: Secondly , examining the o v erall WER, T able 6.1 giv es the results of the baseline systems and the prop osed tec hnique. Note that w e use the same language mo del as the ASR. This helps us ev al- uate a system that do es not include additional information. W e pro vide the p erformance measures on b oth the dev elopmen t and held out test data. The dev elopmen t data is used for MER T tuning. Baseline results: The output of the ASR (Baseline-1) suggests that the dev elopmen t data is less complex compared to the held out test set. In our case, the RNN-LM based lattice re- scoring (Baseline-2) do esn’t help. This results sho ws that ev en with a higher order con text, the RNN-LM is unable to reco v er the errors presen t in the lattice, suggesting that the errors stem from pruning during deco ding. W e note that the w ord-based system (Baseline-3) do esn’t pro vide an y impro v emen ts. Ev en when w e increase con text (trigram LM) and use MER T opti- mization, the p erformance is just on par with the original ASR output. F urther, DLM re-ranking (Baseline-4) fails to pro vide an y impro v emen ts in our case. This result is in conjunction with the nding in [15], where the DLM pro vides impro v emen ts only when used in com bination with ASR baseline scores. Ho w ev er, w e b eliev e in tro duction of ASR scores in to NCPCM can b e b en- ecial as w ould b e in the case of DLMs. Th us, to demonstrate the indep enden t con tribution of NCPCM vs DLM’s, rather than in v estigate fusion metho ds, w e don’t utilize baseline ASR scores for either of the t w o metho ds. W e plan to in v estigate the b enets of m ulti-metho d fu- sion in our future w ork. When using the extended feature set for training the DLM, w e don’t observ e impro v emen ts. With our setup, none of the baseline systems pro vide noticeable signif- ican t impro v emen ts o v er the ASR output. W e b eliev e this is due to the highly optimized ASR setup, and the nature of the database itself b eing noisy telephone con v ersational sp eec h. Ov erall, the results of baseline highligh ts: (i) the dicult y of the problem for our setup, (ii) re-scoring is insucien t and emphasizes the need for reco v ering pruned out w ords in the output lattice. NCPCM results: The NCPCM is an ensem ble of phrase translation mo del, language mo del, w ord p enalt y mo del and re-ordering mo dels. Th us the tuning of the w eigh ts asso ciated with eac h mo del is crucial in the case of the phrase based mo dels [119]. The NCPCM without tuning, i.e., assigning random w eigh ts to the v arious mo dels, p erforms v ery p o orly as exp ected. The 28 Cross domain testing on TED-LIUM Data Metho d Dev T est WER BLEU WER 1 2 BLEU Baseline-1 (ASR) 26.92% 62.00 23.04% 0% -10.9% 65.71 ASR + RNNLM re-scoring (Baseline-2) 24.05% 64.74 20.78% 9.8% 0% 67.93 Baseline-3 (W ord-based) 29.86% 57.55 25.51% -10.7% -22.8% 61.79 Baseline-4 (DLM) 33.34% 53.12 28.02% -21.6% -34.8% 58.50 DLM w/ extended feats 30.51% 57.14 29.33% -27.3% -41.1% 57.60 NCPCM + MER T(B) 26.06% 63.30 22.51% 2.3% -8.3% 66.67 NCPCM + MER T(W) 26.15% 63.10 22.74% 1.3% -9.4% 66.36 NCPCM + generic LM + MER T(B) 25.57% 63.98 22.38% 2.9% -7.7% 66.97 NCPCM + generic LM + MER T(W) 25.56% 63.83 22.33% 3.1% -7.5% 66.96 RNNLM re-scoring + NCPCM + MER T(B) 23.36% 65.88 20.40% 11.5% 1.8% 68.39 RNNLM re-scoring + NCPCM + MER T(W) 23.32% 65.76 20.57 10.7% 1% 68.07 RNNLM re-scoring + NCPCM + generic LM + MER T(B) 23.00% 66.48 20.31% 11.8% 2.3% 68.52 RNNLM re-scoring + NCPCM + generic LM + MER T(W) 22.80% 66.19 20.23% 12.2% 2.6% 68.49 T able 3.4: Results for out-of-domain adaptation using Noisy-Clean Phrase Con text Mo dels (NCPCM) 1 :Relativ e % impro v emen t w.r.t baseline-1; 2 :Relativ e % impro v emen t w.r.t baseline-2; w ord-based mo del lac ks re-ordering/distortion mo deling and w ord p enalt y mo dels and hence are less sensitiv e to w eigh t tuning. Th us it is unfair to compare the un-tuned phrase based mo dels with the baseline or w ord-based coun terpart. Hence, for all our subsequen t exp erimen ts, w e only include results with MER T. When emplo ying MER T, all of the prop osed NCPCM systems signif- ican tly outp erform the baseline (statistically signican t with p < 0:001 for b oth w ord error and sen tence error rates [57] with 51,230 w ord tok ens and 4,914 sen tences as part of the test data). W e nd that MER T optimized for WER consisten tly outp erforms that with optimization criteria of BLEU score. W e also p erform trials b y disabling the distortion mo deling and see that results remain relativ ely unc hanged. This is as exp ected since the ASR preserv es the sequence of w ords with resp ect to the audio and there is no reordering eect o v er the errors. The phrase based con text mo deling pro vides a relativ e impro v emen t of 1.72% (See T able 6.1) o v er the baseline-3 and the ASR output. Using m ultiple h yp otheses (10-b est) from the ASR, w e hop e to capture more relev an t error patterns of the ASR mo del, thereb y enric hing the noisy c hannel mo deling capabilities. Ho w ev er, w e nd that the 10-b est giv es ab out the same p erformance as the 1-b est. In this case w e considered 10 b est as 10 separate training pairs for training the system. In the future w e w an t to exploit the in ter-dep endency of this am biguit y (the fact that all the 10-b est h yp otheses represen t a single utterance) for training and error correction at test time. 3.5.6 A daptation WER exp erimen ts: T o assess the adaptation capabilities, w e ev aluate the p erformance of the prop osed noisy-clean phrase con text mo del on an out-of-domain task, TED-LIUM data-base, sho wn in T able 3.4. 29 Baseline Results: The baseline-1 (ASR p erformance) conrms of the heigh tened mismatc hed conditions b et w een the training Fisher Corpus and the TED-LIUM data-base. Unlik e in matc hed in-domain ev aluation, the RNNLM re-scoring pro vides drastic impro v emen ts (9.8% relativ e im- pro v emen t with WER) when tuned with out-of-domain dev elopmen t data set. The mismatc h in cross domain ev aluation reects in considerably w orse p erformance for the w ord-based and DLM baselines (compared to matc hed conditions). NCPCM Results: Ho w ev er, w e see that the phrase con text mo deling pro vides mo dest impro v e- men ts o v er the baseline-1 of appro ximately 2.3% (See T able 3.4) relativ e on the held-out test set. W e note that the impro v emen ts are consisten t compared to the earlier in-domain exp erimen ts in T able 6.1. Moreo v er, since the previous LM w as trained on Fisher Corpus, w e adopt a more generic English LM whic h pro vides further impro v emen ts of up to 3.1% (See T able 3.4). W e also exp erimen t with NCPCM o v er the re-scored RNNLM output. W e nd the NCPCM to alw a ys yield consisten t impro v emen ts o v er the RNNLM output (See 1 & 2 in T able 3.4). An o v erall gains of 2.6% relativ e is obtained o v er the RNNLM re-scored output (baseline-2) i.e., 12.2% o v er ASR (baseline-1) is observ ed. This conrms that the NCPCM is able to pro vide impro v emen ts parallel, in conjunction to the RNNLM or an y other system that ma y impro v e ASR p erformance and therefore supp orts the Hyp othesis 3.2.5 in yielding impro v emen ts in the highly optimized ASR en vironmen ts. This also conrms the robustness of the prop osed approac h and its application to the out-of-domain data. More imp ortan tly , the result conrms Hyp othesis 3.2.6, i.e., our claim of rapid adaptabilit y of the system to v arying mismatc hed acoustic and linguistic conditions. The extreme mismatc hed conditions in v olv ed in our exp erimen ts supp orts the p ossibilit y of going one step further and training our system on articially generated data of noisy transformations of phrases as in [168, 141, 22, 89, 40, 184]. Th us p ossibly eliminating the need for an ASR for training purp oses. F urther, comparing the WER trends from the in-domain task (Figure 3.3b) to the out-of- domain task (Figure 3.3f), w e rstly nd that the impro v emen ts in the out-of-domain task are obtained for b oth top-go o d (go o d recognition) and b ottom-bad (bad recognition), i.e., b oth the solid blue line and the solid green line are alw a ys b elo w zero. Secondly , w e observ e that the impro v emen ts are more consisten t throughout all the utterance lengths, i.e., all the lines ha v e near zero slop es compared to the in-domain task results. Third, comparing Figure 3.2a with Figure 3.2b, w e observ e more ro om for impro v emen t, b oth for top-go o d p ortion as w ell as the b ottom-bad WER subset of data set. The three ndings are fairly meaningful considering the high mismatc h of the out-of-domain data. 30 In domain testing on Fisher Data Metho d Dev T est WER BLEU WER BLEU Baseline-1 (ASR output) 15.46% 75.71 17.41% 72.99 Baseline-2 (ASR + RNNLM re-scoring) 16.17% 74.39 18.39% 71.24 Baseline-3 (W ord based + 5gram NNLM) 15.47% 75.63 17.41% 72.92 W ord based + 5gram NNLM + MER T(B) 15.46% 75.69 17.40% 72.99 W ord based + 5gram NNLM + MER T(W) 15.42% 75.58 17.38% 72.75 NCPCM + 3gram NNLM + MER T(B) 15.46% 75.91 17.37% 73.24 NCPCM + 3gram NNLM + MER T(W) 15.28% 75.94 17.11% 73.31 NCPCM + 5gram NNLM + MER T(B) 15.35% 75.99 17.20% 73.34 NCPCM + 5gram NNLM + MER T(W) 15.20% 75.96 17.08% 73.25 NCPCM + NNJM-LM (5,4) + MER T(B) 15.29% 75.93 17.13% 73.26 NCPCM + NNJM-LM (5,4) + MER T(W) 15.28% 75.94 17.13% 73.29 T able 3.5: Results for Noisy-Clean Phrase Con text Mo dels (NCPCM) with Neural Net w ork Lan- guage Mo dels (NNLM) and Neural Net w ork Join t Mo dels (NNJM) 3.5.7 Exploit Longer Con text Firstly , insp ecting the error correction results from T able 3.2, cases 2 and 4 hin t at the abilit y of the system to select appropriate w ord-suxes using long term con text information. Second, from detailed WER analysis in Figure 3.3, w e see that the b ottom-bad (solid green line) impro v emen ts decrease with increase in length in most cases, hin ting at p oten tial impro v e- men ts to b e found b y using higher con textual information for error correction system as future researc h directions. Moreo v er, closer insp ection across dieren t mo dels, comparing the trigram MLE mo del (Figure 3.3b) with the 5gram NNLM (Figure 3.3d), w e nd that the NNLM pro vides minimal degradation and b etter impro v emen ts esp ecially for longer utterances b y exploiting more con text (the blue solid line for NNLM has smaller in tercept v alue as w ell as higher negativ e slop e). W e also nd that for the b ottom-bad p o or recognition results (green solid-line), the NNLM giv es consisten t (smaller p ositiv e slop e) and b etter impro v emen ts esp ecially for the higher length utter- ances (smaller in tercept v alue). Th us emphasizing the gains pro vided b y higher con text NNLM. WER exp erimen ts: Third, T able 3.5 sho ws the results obtained using a neural net w ork language mo del of higher orders (also trained only on the in-domain data). F or a fair comparison, w e adopt a higher order (5gram) NNLM for the baseline-3 w ord based noise c hannel mo deling system. Ev en with a higher order NNLM, the baseline-3 fails to impro v e up on the ASR. W e don’t include the baseline-4 results under this section, since DLM do esn’t include a neural net w ork mo del. Comparing results from T able 6.1 with T able 3.5, w e note the b enets of higher order LMs, with 31 the 5-gram neural net w ork language mo del giving the b est results (a relativ e impro v emen t of 1.9% o v er the baseline-1), outp erforming the earlier MLE n-gram mo dels as p er Hyp othesis 3.2.7. Moreo v er, exp erimen tal comparisons with baseline-3 (w ord-based) and NCPCM mo dels, b oth incorp orating iden tical 5-gram neural net w ork language mo dels conrms the adv an tages of NCPCM (a relativ e impro v emen t of 1.7%). Ho w ev er, the neural net w ork join t mo del LM with target con- text of 5 and source con text of 4 did not sho w signican t impro v emen ts o v er the traditional neural LMs. W e exp ect the neural net w ork mo dels to pro vide further impro v emen ts with more training data. 3.5.8 Regularization Finally , the last case in T able 3.2 is of text regularization as describ ed in Section 3.2, Hyp oth- esis 3.2.8. Ov erall, in our exp erimen ts, w e found that appro ximately 20% w ere cases of text regularization and the rest w ere a case of the former h yp otheses. 3.6 Conclusions & F uture W ork In this w ork, w e prop osed a noisy c hannel mo del for error correction based on phrases. The system p ost-pro cesses the output of an automated sp eec h recognition system and as suc h an y con tribu- tions in impro ving ASR are in conjunction of NCPCM. W e presen ted and v alidated a range of h yp otheses. Later on, w e supp orted our claims with apt problem form ulation and their resp ectiv e results. W e sho w ed that our system can impro v e the p erformance of the ASR b y (i) re-scoring the lattices ( Hyp othesis 3.2.1), (ii) reco v ering w ords pruned from the lattices ( Hyp othesis 3.2.2), (iii) reco v ering w ords nev er seen in the v o cabulary and training data ( Hyp othesis 3.2.3), (iv) exploit- ing longer con text information ( Hyp othesis 3.2.7), and (v) b y regularization of language syn tax (Hyp othesis 3.2.8). Moreo v er, w e also claimed and justied that our system can pro vide more impro v emen t in lo w-p erforming ASR cases ( Hyp othesis 3.2.4), while k eeping the degradation to minim um in cases when the ASR p erforms w ell ( Hyp othesis 3.2.5). In doing so, our system could eectiv ely adapt ( Hyp othesis 3.2.6) to c hanging recognition en vironmen ts and pro vide impro v e- men ts o v er an y ASR systems. In our future w ork, the output of the noisy-clean phrase con text mo del will b e fused with the ASR b eliefs to obtain a new h yp othesis. W e also in tend to in tro duce ASR condence scores and signal SNR estimates, to impro v e the c hannel mo del. W e are in v estigating in tro ducing the probabilistic am biguit y of the ASR in the form of lattice or confusion net w orks as inputs to the c hannel-in v ersion mo del. 32 F urther, w e will utilize sequence-to-sequence (Seq2seq) translation mo deling [165] to map ASR outputs to reference transcripts. The Seq2seq mo del has b een sho wn to ha v e b enets esp ecially in cases where training sequences are of v ariable length [27]. W e in tend to emplo y Seq2seq mo del to enco de ASR output to a xed-size em b edding and deco de this em b edding to generate the corrected transcripts. 33 Chapter 4 Confusion2V ec: T owards Enriching V ector Space W ord Representations with Representational Ambiguities 4.1 Introduction The goal of this study is to come up with a new v ector space represen tation for w ords whic h incorp orates the uncertain t y information in the form of w ord confusions presen t in lattice lik e structures (e.g. confusion net w orks). Here, the w ord confusions are an y w ord lev el am biguities resultan t of an y algorithms suc h as mac hine translation, ASR etc., or can b e kno wledge-based lik e w ord segmen tation information or data driv en. F or example, acoustic confusable w ords in ASR lattices: "t w o" and "to" (see Figure 4.1). A w ord lattice is a compact represen tation (directed acyclic w eigh ted graphs) of dieren t w ord sequences that are lik ely p ossible. A confusion net w ork is a sp ecial t yp e of lattice, where eac h w ord sequence is made to pass through eac h no de of the graph. The lattices and confusion net w orks em b ed w ord confusion information. The study tak es motiv ation from h uman p erception, i.e., the abilit y of h umans to deco de information based on t w o fairly indep enden t information streams (see Section 4.2.1 for examples): (i) linguistic con text (mo deled b y w ord2v ec lik e w ord v ector represen tations), and (ii) acoustic confusabilit y (relating to phonology). Ho w ev er, the presen t w ord v ector represen tations lik e w ord2v ec only incorp orate the con textual confusabilit y during mo deling. Hence, in order to handle confusabilit y and to deco de h uman language/sp eec h successfully , there is a need to mo del b oth the dimensions. Although, primarily , the motiv ation is deriv ed from h uman sp eec h and p erception, the confusions are not constrained to acoustics and can b e extended to an y confusions parallel to the linguistic con texts, for example, confusions presen t in lattices. Most of the mac hine learning algorithms output predictions as a probabilit y measure. This uncertain t y information stream can b e expressed in the form of a lattice or a confusion net w ork temp orally , and is often found to con tain useful 34 information for subsequen t pro cessing and analysis. The scop e of this w ork is to in tro duce a complemen tary (ideally orthogonal) subspace in addition to the underlying w ord v ector space represen tation captured b y w ord2v ec. This new subspace captures the w ord confusions orthogonal to the syn tactic and seman tics of the language. W e prop ose Confusion2V ec v ector space op erating on lattice lik e structures, sp ecically w ord confusion net w orks. W e in tro duce sev eral training congurations and ev aluate their eectiv eness. W e also form ulate appropriate ev aluation criterion to assess the p erformance of eac h orthogonal subspaces, rst indep enden tly and then join tly . Analysis of the prop osed w ord v ector space represen tation is carried out. The rest of the pap er is organized as follo ws. Motiv ation for Confusion2v ec, i.e., the need to mo del w ord-confusions for w ord em b eddings, is pro vided through means of h uman sp eec h & p erception, mac hine learning, and through p oten tial applications in section 4.2. A particular case study is c hosen and the problem is form ulated in section 4.3. In section 4.4, dieren t train- ing congurations for ecien t estimation of w ord em b eddings are prop osed. A dditional tuning sc hemes for the prop osed Confusion2v ec mo dels are presen ted in section 4.5. Ev aluation criterion form ulation and ev aluation database creation is presen ted in section 4.6. Exp erimen tal setup and baseline system is describ ed in section 4.7. Results are tabulated and discussed in section 6.5. W ord v ector space analysis is p erformed and ndings are presen ted in section 4.9. Section 4.10 discusses with the help of few to y examples, the b enets of the Confusion2v ec em b eddings for the task of ASR error correction. Section 6.7 dra ws the conclusion of the study and nally the future researc h directions are discussed in Section 6.8. 4.2 Motiv ation One ecien t w a y to represen t w ords as v ectors is to represen t them in a space that preserv es the seman tic and syn tactic relations b et w een the w ords in the language. W ord2v ec describ es a tec hnique to ac hiev e suc h a represen tation b y trying to predict the curren t w ord from its lo cal con text (or vice-v ersa) o v er a large text corp ora. The estimated w ord v ectors are sho wn to enco de ecien t syn tactic-seman tic language information. In this w ork w e prop ose a new v ector space for w ord represen tation whic h incorp orates v arious forms of w ord confusion information in addition to the seman tic & syn tactic information. The new v ector space is inspired and motiv ated from the follo wing factors from h uman sp eec h pro duction & p erception and mac hine learning. 35 4.2.1 Human sp eec h pro duction, p erception and hearing In our ev ery da y in teractions, confusabilit y can often result in the need for con text to deco de the underlying w ords. \Please a seat. 00 (Example 1) In Example 1, the missing w ord could b e guessed from its con text and narro w ed do wn to either ha v e or tak e. This con text information is mo deled through language mo dels. More complex mo dels suc h as w ord2v ec also use the con textual information to mo del w ord v ector represen tations. On the other hand, confusabilit y can also originate from other sources suc h as acoustic repre- sen tations. \I w an t to seat 00 (Example 2) In Example 2, the underlined w ord is mispronounced/misheard, and grammatically incorrect. In this case, considering the con text there exists a lot of p ossible correct substitutions for the w ord seat and hence the con text is less useful. The acoustic construct of the w ord seat can presen t additional information in terms of acoustic alternativ es/similarit y , suc h as sit and seed. \I w an t to s 00 (Example 3) Similarly in Example 3, the underlined w ord is incomplete. The acoustic confusabilit y information can b e useful in the ab o v e case of brok en w ords. Th us, since the confusabilit y is acoustic, purely lexical v ector represen tations lik e w ord2v ec fail to enco de or capture it. In this w ork, w e prop ose to additionally enco de the w ord (acoustic) confusabilit y information to learn a b etter w ord em- b edding. Although the motiv ation is sp ecic to acoustics in this case, it could b e extended to other inheren t sources of w ord-confusions spanning v arious mac hine learning applications. 4.2.2 Mac hine Learning Algorithms Most of the mac hine learning algorithms output h yp othesis as a probabilit y measure. Suc h a h yp othesis could b e represen ted in the form of a lattice, confusion net w ork or n-b est lists. It is often useful to consider the uncertain t y asso ciated with the h yp othesis for subsequen t pro cessing and analysis (see Section 4.11 for p oten tial applications). The uncertain t y information is often, orthogonal to the con textual dimension and is sp ecic to the task attempted b y the mac hine learning algorithms. Along this direction, recen tly , there ha v e b een sev eral eorts concen trated on in tro ducing lat- tice information in to the neural net w ork arc hitecture. Initially , T ree-LSTM w as prop osed enabling 36 I eye to two tees seat sit eat seed want what won’t wand Acoustic Confusability Axis Contextual Content Axis Figure 4.1: An example confusion net w ork for ground-truth utterance I w an t to sit. tree-structured net w ork top ologies to b e inputted to the RNNs [166], whic h could b e adapted and applied to lattices [156]. LatticeRNN w as prop osed for pro cessing w ord lev el lattices for ASR [91]. Lattice based Gated Recurren t Units (GR Us) [161] and lattice-to-sequence mo dels [169] w ere pro- p osed for reading w ord lattice as input, sp ecically a lattice with tok enization alternativ es for mac hine translation mo dels. LatticeLSTM w as adopted for lattice-to-sequence mo del incorp orat- ing lattice scores for the task of sp eec h translation b y [156]. [20] prop osed Neural lattice language mo dels whic h enables to incorp orate man y p ossible meanings for w ords and phrases (paraphrase alternativ es). Th us, a v ector space represen tation capable of em b edding relev an t uncertain t y information in the form of w ord confusions presen t in lattice-lik e structures or confusion net w orks along with the Seman tic & Syn tactic can b e p oten tially sup erior to w ord2v ec space. 4.3 Case Study: Application to Automatic Speech Recognition In this w ork, w e consider the ASR task as a case study to demonstrate the eectiv eness of the prop osed Confusion2v ec mo del in mo deling acoustic w ord-confusabilit y . Ho w ev er, the tec hnique can b e adopted for a lattice or confusion net w ork output from p oten tially an y algorithm to cap- ture v arious patterns as discussed in section 4.11, in whic h case the confusion-subspace (v ertical am biguit y in gure 4.1), is no longer constrained to acoustic w ord-confusions. An ASR lattice con tains m ultiple paths o v er acoustically similar w ords. A lattice could b e transformed and represen ted as a linear graph forcing ev ery path to pass through all the no des [185, 105]. Suc h a linear graph is referred to as a confusion net w ork. Figure 4.1 sho ws a sample confusion net w ork output b y ASR for the ground truth I w an t to sit. The confusion net w ork could b e view ed along t w o fundamen tal dimensions of information (see gure 4.1): (i) Con textual 37 axis - sequen tial structure of a sen tence, (ii) A coustic axis - similarly sounding w ord alternativ es. T raditional w ord v ector represen tations suc h as w ord2v ec only mo del the con textual information (the horizon tal (red) direction in Figure 4.1). The w ord confusions, for example, the acoustic con textualization as in Figure 4.1 (the v ertical (green) direction in Figure 4.1) is not enco ded. W e prop ose to additionally capture the co-o ccurrence information along the acoustic axis orthogonal to the w ord2v ec. This is the main fo cus of our w ork, i.e., to join tly learn the v ertical, w ord-confusion con text and the horizon tal, seman tic and syn tactic con text. In other w ords, w e h yp othesize to deriv e relationships b et w een the seman tics and syn taxes of language and the w ord-confusions (acoustic-confusion). 4.3.1 Related W ork [10] trained a con tin uous w ord em b edding of acoustically alik e w ords (using n-gram feature rep- resen tation of w ords) to replace the state space mo dels (HMMs), decision trees and lexicons of an ASR. Through the use of suc h an em b edding and lattice re-scoring tec hnique demonstrated impro v emen ts in w ord error rates of ASR. The em b eddings are also sho wn to b e useful in appli- cation to the task of ASR error detection b y [56]. A few ev aluation strategies are also devised to ev aluate phonetic and orthographic similarit y of w ords. A dditionally , there ha v e b een studies concen trating on estimating w ord em b eddings from acoustics [80, 28, 95, 68] with ev aluations based on acoustic similarit y measures. P arallely , w ord2v ec lik e w ord em b eddings ha v e b een used successfully to impro v e ASR Error detection p erformance [54, 55]. W e b eliev e the prop osed ex- ploitation of b oth information sources, i.e., acoustic relations and linguistic relations (seman tics and syn taxes) will b e b enecial in ASR and error detection, correction tasks. The prop osed con- fusion2v ec op erates on the lattice output of the ASR in con trast to the w ork on acoustic w ord em b eddings [80, 28, 95, 68] whic h is directly trained on audio. The prop osed Confusion2v ec diers to the w orks b y [10] and [56], whic h also utilizes audio data with the h yp othesis that the la y er righ t b elo w softmax la y er of a deep end-to-end ASR con tains acoustic similarit y information of w ords. Confusion2v ec can also b e p oten tially trained without an ASR, on articially generated data, em ulating an ASR [168, 141, 22, 89, 40, 184]. Th us, Confusion2v ec can p oten tially b e trained in a completely unsup ervised manner and with appropriate mo del parameterization incorp orate v arious degrees of acoustic confusabilit y , e.g. stemming from noise or sp eak er conditions. F urther, in con trast to the prior w orks on lattice enco ding RNNs [166, 156, 91, 161, 169, 20], whic h concen trate on incorp orating the uncertain t y information em b edded in the w ord lattices b y mo difying the input arc hitecture for recurren t neural net w ork, w e prop ose to in tro duce the am biguit y information from the lattices to the w ord em b edding explicitly . W e exp ect similar 38 w t,1 w t,2 w t,3 w t-1,1 w t-1,2 w t-1,3 w t+1,1 w t+1,2 w t+1,3 w t+2,1 w t+2,2 w t+2,3 w t-2,1 w t-2,2 w t-2,3 C(t-2) C(t-1) C(t+1) C(t+2) Output C(t) Input Figure 4.2: Baseline W ord2V ec T raining sc heme for Confusion net w orks. c(t) is a unit w ord confusion in the confusion net w ork at a time-stamp t, i.e., c(t) represen ts a set of arcs b et w een t w o adjacen t no des of a confusion net w ork, represen ting a set of confusable w ords. wt;i is the i th most probable w ord in the confusion c(t). W ord confusions are sorted in decreasing order of their p osterior probabilit y: P (wt;1)>P (wt;2)>P (wt;3)::: adv an tages as with lattice enco ding RNNs in using the pre-trained confusion2v ec em b edding to w ards v arious tasks lik e ASR, Mac hine translation etc. Moreo v er, our arc hitecture do esn’t require memory whic h has signican t adv an tages in terms of training complexit y . W e prop ose to train the em b edding in a similar w a y to w ord2v ec mo dels [113]. All the w ell studied previous eorts to w ards optimization of training suc h mo dels [115, 116], should apply to our prop osed mo del. 4.4 Proposed Models 4.4.1 Baseline W ord2V ec Mo del The p opular w ord2v ec w ork [113] prop osed log-linear mo dels, i.e., neural net w ork consisting of a single linear la y er (pro jection matrix) without non-linearit y . These mo dels ha v e signican t adv an- tages in training complexit y . [113] found the skip-gram mo del to b e sup erior to the bag-of-w ord mo del in a seman tic-syn tactic analogy task. Hence, w e only emplo y the skip-gram conguration in this w ork. Appropriately , the skip-gram w ord2v ec mo del is also adopted as the baseline for this w ork. Ho w ev er, w e strongly b eliev e the prop osed concept (in tro ducing w ord am biguit y infor- mation) is indep enden t of the mo deling tec hnique itself and should translate to relativ ely new er tec hniques lik e GloV e [127] and fastT ext [17]. 39 w t,1 w t,2 w t,3 C(t) Input Output w t,1 w t,2 w t,3 C(t) w t+1,1 w t+1,2 w t+1,3 C(t+2) w t+1,1 w t+1,2 w t+1,3 C(t+1) w t-1,1 w t-1,2 w t-1,3 C(t-1) w t-1,1 w t-1,2 w t-1,3 C(t-1) Figure 4.3: Prop osed In tra-Confusion T raining Sc heme for Confusion net w orks. c(t) is a unit w ord confusion in the confusion net w ork at a time-stamp t, i.e., c(t) represen ts a set of arcs b et w een t w o adjacen t no des of a confusion net w ork, represen ting a set of confusable w ords. wt;i is the i th most probable w ord in the confusion c(t). W ord confusions are sorted in decreasing order of their p osterior probabilit y: P (wt;1)>P (wt;2)>P (wt;3)::: The dotted curv ed lines denote that the self-mapping is disallo w ed. W e adapt the w ord2v ec con textual mo deling to op erate on the confusion net w ork (in our case confusion net w ork of an ASR). Figure 4.2 sho ws the training conguration of the skip- gram w ord2v ec mo del on the confusion net w ork. The baseline mo del (traditional skip-gram) only considers the con text of the top h yp othesis of the confusion net w ork (single path) for training. The w ords w t2;1 , w t1;1 , w t+1;1 and w t+2;1 (i.e., the most probable w ords in the confusions C(t 2), C(t 1), C(t + 1) and C(t + 2) resp ectiv ely) are predicted from w t;1 (i.e., the most probable w ord in C(t)) for a skip-windo w of 2 as depicted in Figure 4.2. 4.4.2 In tra-Confusion T raining Next, w e explore the direct adaptation of the skip-gram mo deling but on the confusion dimension (i.e., considering w ord confusions as con texts) rather than the traditional sequen tial con text. Figure 4.3 sho ws the training conguration o v er a confusion net w ork. In short ev ery w ord is link ed with ev ery other alternate w ord in the confusion dimension (i.e., b et w een set of confusable w ords) through the desired net w ork (as opp osed to the temp oral con text dimension in the w ord2v ec training). Note, w e disallo w an y w ord b eing predicted from itself (this constrain is indicated with curv ed dotted lines in the gure). As depicted in the Figure 4.3, the w ord w t;i (confusion 40 w t,1 w t,2 w t,3 w t-1,1 w t-1,2 w t-1,3 w t+1,1 w t+1,2 w t+1,3 w t+2,1 w t+2,2 w t+2,3 w t-2,1 w t-2,2 w t-2,3 C(t) C(t-2) C(t-1) C(t+1) C(t+2) Input Output Figure 4.4: Prop osed In ter-Confusion T raining Sc heme for Confusion net w orks. c(t) is a unit w ord confusion in the confusion net w ork at a time-stamp t, i.e., c(t) represen ts a set of arcs b et w een t w o adjacen t no des of a confusion net w ork, represen ting a set of confusable w ords. wt;i is the i th most probable w ord in the confusion c(t). W ord confusions are sorted in decreasing order of their p osterior probabilit y: P (wt;1)>P (wt;2)>P (wt;3)::: con text) is predicted from w t;j (curren t w ord), where i = 1; 2; 3::: length(C(t)) and j6= i, for eac hj = 1; 2; 3::: length(C(t)) for confusion C(t)8t. W e exp ect suc h a mo del to capture inheren t relations o v er the dieren t w ord confusions. In the con text of an ASR lattice, w e exp ect it to capture in trinsic relations b et w een similarly sounding w ords (acoustically similar). Ho w ev er, the mo del w ould fail to capture an y seman tic and syn tactic relations asso ciated with the language. The em b edding obtained from this conguration can b e fused (concatenated) with the traditional skip-gram w ord2v ec em b edding to form a new subspace represen ting b oth the indep enden tly trained subspaces. The n um b er of training samples generated with this conguration is: #Samples = n X i=1 D i (D i 1) (4.1) where n is the n um b er of time steps, D i is the n um b er of confusions at the i th time step. 4.4.3 In ter-Confusion T raining In this conguration, w e prop ose to mo del b oth the linguistic con texts and the w ord confusion con texts sim ultaneously . Figure 4.4 illustrates the training conguration. Eac h w ord in the curren t confusion is predicted from eac h w ord from the succeeding and preceding confusions o v er a predened lo cal con text. T o elab orate, the w ords w tt 0 ;i (con text) are predicted from w t;j (curren t w ord) for i = 1; 2; 3::: length(C(tt 0 )), j = 1; 2; 3::: length(C(t)), t 0 2 1; 2;1;2 for skip-windo w of 2 for curren t confusion C(t)8t as p er Figure 4.4. Since w e assume the acoustic 41 w t,1 w t,2 w t,3 w t-1,1 w t-1,2 w t-1,3 w t+1,1 w t+1,2 w t+1,3 w t+2,1 w t+2,2 w t+2,3 w t-2,1 w t-2,2 w t-2,3 C(t) C(t-2) C(t-1) C(t+1) C(t+2) Input Output w t,1 w t,2 w t,3 C(t) Figure 4.5: Prop osed Hybrid-Confusion T raining Sc heme for Confusion net w orks. c(t) is a unit w ord confusion in the confusion net w ork at a time-stamp t, i.e., c(t) represen ts a set of arcs b et w een t w o adjacen t no des of a confusion net w ork, represen ting a set of confusable w ords. wt;i is the i th most probable w ord in the confusion c(t). W ord confusions are sorted in decreasing order of their p osterior probabilit y: P (wt;1)>P (wt;2)>P (wt;3)::: The dotted curv ed lines denote that the self-mapping is disallo w ed. similarities for a w ord to b e co-o ccurring, w e exp ect to join tly mo del the co-o ccurrence of b oth the con text and confusions. This also has the additional b enet of generating more training samples than the in tra-confusion training. The n um b er of training samples generated is giv en b y: #Samples = n X i=1 i+Sw X j=iSw j6=i D i D j (4.2) where n is the total n um b er of time steps, D i is the n um b er of w ord confusions at the i th time step,S w is the skip-windo w size (i.e., sample S w w ords from history and S w w ords from the future con text of curren t w ord). 4.4.4 Hybrid In tra-In ter Confusion T raining Finally , w e merge b oth the in tra-confusion and in ter-confusion training. This can b e seen as a sup er-set of w ord2v ec, in ter-confusion and in tra-confusion training congurations. Figure 4.5 illustrates the training conguration. The w ords w tt 0 ;i (con text) are predicted from w t;j (curren t w ord) for i = 1; 2; 3::: length(C(tt 0 )), j = 1; 2; 3::: length(C(t)), t 0 2 1; 2; 0;1;2 suc h that if t 0 = 0 then i6=j ; for skip-windo w of 2 for curren t confusion C(t)8t as depicted in Figure 4.5. W e simply add the com bination of training samples from the ab o v e t w o prop osed tec hniques (i.e., the n um b er of samples is the sum of equation 4.1 and equation 4.2). 42 Start Baseline W2V Training with pre-training Convusion2Vec Training Intra/Inter/ Hybrid Concatenate Embeddings End Start Model Concatenation Joint Optimization Mode Fix Weights Baseline W2V Subspace Confusion2Vec Fine Tuning Intra/Inter/ Hybrid End Fixed W2V Unrestricted Start Word ∈ Google Vocab Initialize from Google W2V Initialize Randomly End Yes No (a) Flowchart for pre-training/initializing models (b) Flowchart for concatenating models (c) Flowchart for joint optimization using unrestricted and fixed word2vec configurations. Figure 4.6: Flo w c harts for prop osed training sc hemes 4.5 T raining Schemes 4.5.1 Mo del Initialization/Pre-training V ery often, it has b een found that b etter mo del initializations lead to b etter mo del con v ergence [46]. This is more signican t in the case of under-represen ted w ords. Moreo v er, for training the w ord confusion mappings, it w ould b enet to build up on the con textual w ord em b eddings, since our nal goal is in conjunction with b oth con textual and confusion information. Hence, w e exp erimen t initializing all our mo dels with the original Go ogle’s w ord2v ec mo del 1 trained on Go ogle News dataset with 100 billion w ords as describ ed b y [115]. Pre-training rules are explained in the o w c hart in Figure 4.6(a). F or the w ords presen t in the Go ogle’s w ord2v ec v o cabulary , w e directly initialize the em b eddings with w ord2v ec. The em b eddings for rest of the w ords are randomly initialized follo wing uniform distribution. 4.5.2 Mo del Concatenation The h yp othesis with mo del concatenation is that the t w o subspaces, one represen ting the con tex- tual subspace (w ord2v ec), and the other capturing the confusion subspace can b e b oth trained indep enden tly and concatenated to giv e a new v ector space whic h manifests b oth the information and hence a p oten tially useful v ector w ord represen tation. Flo w c hart for mo del concatenation is sho wn in Figure 4.6(b). The mo del concatenation can b e mathematically represen ted as: NEW ne1+e2 = h W 2V ne1 C2V ne2 i (4.3) 1 h ttps://co de.go ogle.com/arc hiv e/p/w ord2v ec/ 43 whereNEW is the new concatenated v ector space of dimensions ne 1 +e 2 ,n is the v o cabulary size, e 1 and e 2 are the em b edding sizes of W 2V and C2V subspaces resp ectiv ely . 4.5.3 Join t Optimization F urther to the mo del concatenation sc heme, one could ne-tune the new v ector space represen- tation to b etter optimize to the task criterion (ne-tuning in v olv es re-training end-to-end with a relativ ely lo w er learning rate than usual). This could b e view ed as a case of relaxing the strict indep endence b et w een t w o subspaces as in the case of mo del concatenation. The ne-tuning itself could b e either of the aforemen tioned prop osed tec hniques. W e sp ecically try t w o congurations of join t optimization: 4.5.3.1 Unrestricted In this conguration, w e optimize b oth the subspaces, i.e., the con textual (w ord2v ec) and the confusion subspaces. The h yp othesis is the ne-tuning allo ws the t w o subspaces to in teract to ac hiev e the b est p ossible represen tation. The o w c hart for the unrestricted join t optimization is displa y ed in Figure 4.6(c). 4.5.3.2 Fixed W ord2V ec In this conguration, w e x the con textual (w ord2v ec) subspace and ne-tune only the confusion subspace. Since the w ord2v ec already pro vides robust con textual represen tation, an y ne-tuning on con textual space could p ossibly lead to sub-optimal state. Keeping the w ord2v ec subspace xed also allo ws the mo del to concen trate more sp ecically to w ards the confusion since the xed subspace comp ensates for all the con textual mappings during training. This allo ws us to constrain the up datable parameters during join t optimization. It also allo ws for the p ossibilit y to directly use a v ailable w ord2v ec mo dels without mo dications. The o w c hart for the xed W ord2V ec join t optimization is displa y ed in Figure 4.6(c). 4.6 Ev aluation Methods Prior literature suggests, there are t w o prominen t w a ys for ev aluating the v ector space represen- tation of w ords. One is based on Seman tic&Syn tactic analogy task as in tro duced b y [113]. The other common approac h has b een to assess the w ord similarities b y computing the rank-correlation (Sp earman’s correlation) on h uman annotated w ord similarit y databases [143] lik e W ordSim-353 [49]. Although, the t w o ev aluations can judge the v ector represen tations of w ords ecien tly for 44 W ord P air 1 W ord P air 2 i’d ey ed phi e seedar cedar rued rude air aire spade spa y ed scen t cen t vile vial cirrus cirrous sold soled curser cursor p endan t p enden t sensor censor straigh t strait T able 4.1: F ew examples from A coustic Analogy T ask T est-set seman tics and syn tax of a language, w e need to device an ev aluation criteria for the w ord confu- sions, sp ecically for our case scenario - the acoustic confusions of w ords. F or this, w e form ulate ev aluations for acoustic confusions parallel to the analogy task and the w ord similarit y task. 4.6.1 Analogy T asks 4.6.1.1 Seman tic&Syn tactic Analogy T ask [113] in tro duced an analogy task for ev aluating the v ector space represen tation of w ords. The task w as based on the in tuition that the w ords, sa y king is similar to man in the same sense as the queen is to w oman and th us relies on answ ering questions relating to suc h analogies b y p erforming algebraic op erations on w ord represen tations. F or example, the analogy is correct if the v ector(w oman) is most similar to v ector(king)-v ector(man)+v ector(queen). The anal- ogy question test set is designed to test b oth syn tactic and seman tic w ord relationships. The test set con tains v e t yp es of seman tic questions (8869 questions) and nine t yp es of syn tactic ques- tions (10675 questions). Finally , the eciency of the v ector represen tation is measured using the accuracy ac hiev ed on the analogy test set. W e emplo y this for testing the Seman tic & Syn tactic (con textual axis as in terms of Figure 4.1) relationship inheren t in the v ector space. 4.6.1.2 A coustic Analogy T ask The primary purp ose of the acoustic analogy task is to indep enden tly gauge the acoustic similarit y information captured b y the em b edding mo del irresp ectiv e of the inheren t Seman tic and Syn tactic linguistic information. A dopting similar idea and extending the same for ev aluation of w ord confusions, w e form ulate the acoustic confusion analogy task (v ertical con text test as in terms of Figure 4.1) as follo ws. F or similar sounding w ord pairs, see & sea and red & read, the w ord v ector see is similar to sea in the same sense as the w ord red is to read. W e set up an acoustic analogy question set on acoustically similar sounding w ords, more sp ecically 45 homophones. T able 4.1 lists a few examples from our data set. Detailed description of the creation of dataset is presen ted in section 4.7.3.1. 4.6.1.3 Seman tic&Syn tactic-A coustic Analogy T ask F urther, rather than ev aluating the Seman tic-Syn tactic tasks and the acoustic analogy tasks in- dep enden tly , w e could test for b oth together. In tuitiv ely , the w ord v ectors in eac h of the t w o subspaces should in teract together. W e w ould exp ect for an analogy , see-sa w:tak e-to ok, the w ord see has a homophone alternativ e in sea, th us there is a p ossibilit y of the w ord see b eing confused with sea in the new v ector space. Th us an algebraic op eration suc h as vector(\see") vector(\saw") +vector(\take") should b e similar to vector(\took") as b efore. Moreo v er the vector(\sea")vector(\saw") +vector(\take") should also b e similar to vector(\took"). This is b ecause w e exp ect the vector(\sea") to b e similar to vector(\see") under the acoustic sub- space. W e also tak e in to accoun t the more c hallenging p ossibilit y of more than one homophone w ord substitution. F or example, vector(\see")vector(\saw") +vector(\allow") is similar to vector(\allowed"), vector(\aloud") and vector(\sea")vector(\saw") +vector(\allow"). The h yp othesis is that to come up with suc h a represen tation the system should join tly mo del b oth the language seman tic-syn tactic relations and the acoustic w ord similarit y relations b et w een w ords. The task is designed to test Seman tic-A coustic relations and the Syn tactic-A coustic relationships. In other w ords, in terms of Figure 4.1, the task ev aluates b oth the horizon tal & v ertical con text T yp e of Relationship W ord P air 1 W ord P air 2 Currency India Rup ee K orea One (W on) Canada Dollar Denmark Krona (Krone) Japan Y en Sw eden Krone (Krona) F amily Buo y (Bo y) Girl Brother Sister Bo y Girl King Quean (Queen) Bo y Girl Sun (Son) Daugh ter A djectiv e-to-A dv erb Calm Calmly Slo e (Slo w) Slo wly Opp osite A w are Una w are P ossible Impassible (Imp ossible) Comparativ e Bad W orse High Hire (Higher) Sup erlativ e Bad W orst Grate (Great) Greatest Presen t P articiple Dance Dancing Rite (W rite) W riting P ast T ense Dancing Danced Flying Flu (Flew) Plural Banana Bananas Burred (Bird) Birds Plural V erbs Decrease Decreases Fined (Find) Finds Multiple Homophone Substitutions W righ t (W rite) W rites Sea (See) Sees Ro w ed (Road) Roads I (Ey e) A y es (Ey es) Si (See) Seize (Sees) Righ t (W rite) W rites T able 4.2: F ew examples from Seman tic & Syn tactic - A coustic Analogy T ask T est Set The w ords in the paren thesis are the original ones as in the analogy test set [113] whic h ha v e b een replaced b y their homophone alternativ es. 46 W ord1 W ord2 A coustic Rating W ordSim353 I Ey e 1.0 - A dolescence A dolescen ts 0.9 - Allusion Illusion 0.83 - Sew er So w er 0.66 - Figh ting Defeating 0.57 7.41 Da y Da wn 0.33 7.53 W eather F orecast 0.0 8.34 T able 4.3: Examples of A coustic Similarit y Ratings A coustic Rating: 1.0 = Iden tically sounding, 0.0 = Highly acoustically dissimilar W ordSim353: 10.0 = High w ord similarit y , 0.0 = Lo w w ord similarit y W ord pairs not presen t in W ordSim353 is denoted b y ’-’ together. A few examples of this task is listed in T able 4.2. Section 4.7.3.2 details the creation of the database. 4.6.2 Similarit y Ratings 4.6.2.1 W ord Similarit y Ratings Along with the analogy task the w ord similarit y task [49] has b een p opular to ev aluate the qualit y of w ord v ector represen tations in the NLP comm unit y [127, 104, 76, 143]. In this w ork w e emplo y the W ordSim-353 dataset [49] for the w ord similarit y task. The dataset has a set of 353 w ord pairs with div erse range of h uman annotated scores relating to the similarit y/dissimilarit y of the t w o w ords. The rank-order correlation (Sp earman correlation) b et w een the h uman annotated scores and the cosine similarit y of w ord v ectors is computed. Higher correlation corresp onds to b etter preserv ation of w ord similarit y order represen ted b y the w ord v ectors, and hence b etter qualit y of the em b edding v ector space. 4.6.2.2 A coustic Similarit y Ratings Emplo ying a similar analogous idea to w ord similarit y ratings and extending it to reect the qualit y of w ord confusions, w e form ulate an acoustic w ord similarit y task. The attempt is to ha v e w ord pairs scored similar to as in W ordSim-353 database, but with the scores reecting the acoustic similarit y . T able 4.3 lists a few randomly pic k ed examples from our dataset. The dataset generation is describ ed in section 4.7.3.3. 47 4.7 Data & Experimental Setup 4.7.1 Database W e emplo y Fisher English T raining P art 1, Sp eec h (LDC2004S13) and Fisher English T raining P art 2, Sp eec h (LDC2005S13) corp ora [29] for training the ASR. The corp ora consists of appro x- imately 1915 hours of telephone con v ersational sp eec h data sampled at 8kHz. A total of 11972 sp eak ers w ere in v olv ed in the recordings. The sp eec h corp ora is split in to three sp eak er disjoin t subsets for training, dev elopmen t and testing for ASR mo deling purp oses. A subset of the sp eec h data con taining appro ximately 1905 hours w ere segmen ted in to 1871731 utterances to train the ASR. Both the dev elopmen t set and the test set consists of 5000 utterance w orth 5 hours of sp eec h data eac h. The transcripts con tain appro ximately 20.8 million w ord tok ens with 42150 unique en tries. 4.7.2 Exp erimen tal Setup 4.7.2.1 Automatic Sp eec h Recognition KALDI to olkit is emplo y ed for training the ASR [130]. A h ybrid DNN-HMM based acoustic mo del is trained on high resolution (40 dimensional) Mel F requency Cepstral Co ecien ts (MF CC) along with i-v ector features to pro vide sp eak er and c hannel information for robust mo deling. The CMU pron unciation dictionary [175] is pruned to corp ora’s v o cabulary and is used as a lexicon for the ASR. A trigram language mo del is trained on the transcripts of the training subset data. The ASR system ac hiev es a w ord error rates (WER) of 16.57% on the dev elopmen t and 18.12% on the test datasets. The deco ded lattice is used to generate confusion net w ork based on minim um ba y es risk criterion [182]. The ASR output transcriptions resulted in a v o cabulary size of 41274 unique w ord tok ens. 4.7.2.2 Confusion2V ec F or training the Confusion2V ec, the training subset of the Fisher corp ora is used. The total n um b er of tok ens resulting from the m ultiple paths o v er the confusion net w ork is appro ximately 69.5 million w ords, i.e., an a v erage of 3.34 alternativ e w ord confusions presen t for eac h w ord in the confusion net w ork. A minim um frequency threshold of 5 is set to prune the rarely o ccurring tok ens from the v o cabulary , whic h resulted in the reduction of the v o cabulary size from 41274 to 32848. F urther, w e also subsample the w ord tok ens as suggested b y [115] whic h w as sho wn to b e helpful. Both the frequency thresholding and the do wnsampling resulted in a reduction of 48 T ask T otal Samples Retained Seman tic&Syn tactic Analogy 19544 11409 A coustic Analogy 20000 2678 Seman tic&Syn tactic-A coustic Analogy 7534 3860 W ordSim-353 353 330 A coustic Confusion Ratings 1372 943 T able 4.4: Statistics of Ev aluation Datasets w ord tok ens from 69.5 million w ords to appro ximately 33.9 million w ords. The Confusion2V ec and W ord2V ec are trained using the T ensoro w to olkit [1]. Negativ e Sampling ob jectiv e is used for training as suggested for b etter eciency [115]. F or the skip-gram training, the batc h-size of 256 w as c hosen and 64 negativ e samples w ere used for computing the negativ e sampling loss. The skip-windo w w as set to 4 and w as trained for a total of 15 ep o c hs, since it pro vided optimal p erformance with traditional w ord2v ec em b eddings, ev aluating for w ord analogy task, for the size of our database. During ne-tuning, the mo del w as trained with a reduced learning rate and with other parameters unc hanged. All the ab o v e parameters w ere xed for consisten t and fair comparison. 4.7.3 Creation of Ev aluation Datasets 4.7.3.1 A coustic Analogy T ask W e collected a list of homophones in English 2 , and created all p ossible com binations of pairs of acoustic confusion analogies. F or homophones with more than 2 w ords, w e list all p ossible confusion pairs. F ew examples from the dataset are listed in T able 4.1. W e emphasize that the consideration of only homophones in the creation of the dataset is a strict and a dicult task to solv e, since the ASR lattice con tains more relaxed w ord confusions. 4.7.3.2 Seman tic&Syn tactic-A coustic Analogy T ask W e construct an analogy question test set b y substituting the w ords in the original analogy ques- tion test set from [113] with their resp ectiv e homophones. Considering all the 5 t yp es of seman tic questions and 9 t yp es of syn tactic questions, for an y w ords in the analogies with homophone al- ternativ es, w e sw ap with the homophone. W e prune all the original analogy questions ha ving no w ords with homophone alternativ es. F or analogies ha ving more than one w ords with homophone alternativ es, w e list all p erm utations. W e found that the n um b er of questions generating b y the 2 h ttp://homophonelist.com/homophones-list/ (A ccessed: 2018-04-30) 49 ab o v e metho d, b eing exhaustiv e, w as large and hence w e randomly sample from the list to retain 948 seman tic questions and 6586 syn tactic questions. T able 4.2 lists a few examples with single and m ultiple homophone substitutions for Seman tic&Syn tactic-A coustic Analogy T ask from our data set. 4.7.3.3 A coustic Similarit y T ask T o create a set of w ord pairs scored b y their acoustic similarit y , w e add all the homophone w ord pairs with an acoustic similarit y score of 1:0. T o get a more div erse range of acoustic similarit y scores, w e also utilize all the 353 w ord pairs from the W ordSim-353 dataset and compute the normalized phone edit distance using the CMU Pron unciation Dictionary [175]. The normalized phone edit distance is of the range b et w een 0 and 1. The edit distance of 1 refers to the w ord pair ha ving almost 0 o v erlap b et w een their resp ectiv e phonetic transcriptions and th us b eing completely acoustically dissimilar and vice-v ersa. W e use 1 phone-edit-distance as the acoustic similarit y score b et w een the w ord pair. Th us a score of 1:0 signies that the t w o w ords are iden tically sounding, whereas as 0 refers to w ords sounding drastically dissimilar. In the case of a w ord ha ving more than one phonetic transcriptions (pron unciation alternativ es), w e use the minim um normalized edit distance. T able 4.3 sho ws a few randomly pic k ed examples from the generated dataset. Finally , for ev aluation the resp ectiv e corp ora are pruned to matc h the in-domain training dataset v o cabulary . T able 4.4 lists the samples in eac h ev aluation dataset b efore and after pruning. 4.7.4 P erformance Ev aluation Criterion In the original w ork b y [113], the eciency of the v ector represen tation is measured using the accuracy ac hiev ed on the analogy test set. But, in our case, note that the Seman tic&Syn tactic analogy task and the Seman tic&Syn tactic-A coustic analogy task are m utually exclusiv e of eac h other. In other w ords, the mo del can get only one, either one of the analogies correct, meaning an y incremen ts with one task will result in decremen ts o v er the other task. Moreo v er, while join tly mo deling t w o orthogonal information streams (i) con textual co-o ccurrences, and (ii) acoustic w ord confusions, nding the nearest w ord v ector nearest to the sp ecic analogy is no longer an optimal ev aluation strategy . This is b ecause the w ord v ector nearest to the analogy op eration can either b e along the con textual axis or the confusion axis, i.e., eac h analogy could p ossibly ha v e t w o correct answ ers. F or example, the analogy write-wrote : read can b e righ t when the nearest w ord v ector is either read (con textual dimension) or red (confusion dimension). T o incorp orate this, 50 Mo del Analogy T asks Similarit y T asks S&S A coustic S&S-A coustic A v erage A ccuracy W ord Similarit y A coustic Similarit y Go ogle W ord2V ec 61.42% 0.9% 16.99% 26.44% 0.6893 -0.3489 W ord2V ec GroundT ruth 35.15% 0.3% 7.86% 14.44% 0.5794 -0.2444 Baseline W ord2V ec 34.27% 0.7% 11.27% 15.41% 0.4992 0.1944 In tra-Confusion 22.03% 52.58% 14.61% 29.74% 0:105 0.8138 In ter-Confusion 36.15% 60.57% 20.44% 39.05% 0.2937 0.8055 Hybrid In tra-In ter 30.53% 53.55% 29.35% 37.81% 0:0963 0.7858 T able 4.5: Results: Dieren t prop osed mo dels F or the analogy tasks: the accuracies of baseline w ord2v ec mo dels are for top-1 ev aluations, whereas of the other mo dels are for top-2 ev aluations (as discussed in Section 4.6.1). Detailed seman tic analogy and syn tactic analogy accuracies, the top-1 ev aluations and top-2 ev aluations for all the mo dels are a v ailable under App endix in T able A.1. F or the similarit y tasks: all the correlations (Sp earman’s) are statistically signican t with p< 0:001 except the ones with the asterisks. Detailed pvalues for the correlations are presen ted under App endix in T able A.2. S&S: Seman tic & Syn tactic Analogy . w e pro vide the accuracy o v er top-2 nearest v ectors, i.e., w e coun t the analogy question as correct if an y of the top-2 nearest v ector satises the analogy . This also holds for the acoustic confusion analogy tasks, esp ecially for relations in v olving triplet homophones. F or example, the analogy write - righ t : road can b e righ t when the nearest w ord v ector is either ro de or ro w ed (for triplet homophones road/ro de/ro w ed). Th us, w e presen t ev aluations b y comparing the top-1 (nearest v ector) ev aluation with baseline w ord2v ec against the top-2 ev aluation for the prop osed confusion2v ec mo dels. T o main tain consistency , w e also pro vide the top-2 ev aluations for the baseline w ord2v ec mo dels in the app endix. Moreo v er, since w e ha v e 3 dieren t analogy tasks, w e pro vide the a v erage accuracy among the 3 tasks in order to ha v e an easy assessmen t of the p erformance of v arious prop osed mo dels. 4.8 Results T able 6.1 lists the results for v arious mo dels. W e pro vide ev aluations on three dieren t analogy tasks and t w o similarit y tasks as discussed in Section 4.6. F urther, more thorough results with the Seman tic and Syn tactic accuracy splits are pro vided under the app endix to gain deep er insigh ts. 4.8.1 Baseline W ord2V ec Mo del W e consider 3 v ariations of W ord2V ec baseline mo del. First, w e pro vide results with the Go ogle’s W ord2V ec mo del 3 whic h is trained with orders more training data, and is th us a high upp er b ound on the Seman tic&Syn tactic task. The Go ogle’s W ord2V ec mo del w as pruned to matc h 3 h ttps://co de.go ogle.com/arc hiv e/p/w ord2v ec 51 the v o cabulary of our corp ora to mak e the ev aluation comparable. Second, w e consider the W ord2V ec mo del trained on the in-domain ground truth transcripts. Third, for a more fair comparison with the other prop osed mo dels, w e pro vide ev aluations on W ord2V ec mo del trained on the noisy ASR output transcripts. All the three baseline mo dels result in go o d p erformance on Seman tic&Syn tactic analogy tasks and w ord similarit y task as exp ected. The Go ogle’s mo del ac hiev es an accuracy of 61.42% on the Seman tic&Syn tactic analogy task. W e note that the Syn tactic accuracy (70.79%) is m uc h higher than the Seman tic accuracy (28.98%) (see App endix T able A.1). This could b e due to our pruned ev aluation test set (see T able 4.4). Both the in- domain mo dels impro v e on the Seman tic accuracy while losing on the syn tactic accuracy o v er the Go ogle mo del (see App endix T able A.1). The shortcomings of the in-domain mo dels compared to the Go ogle W ord2V ec on the Seman tic&Syn tactic analogy task can b e attributed to w ards the amoun t of training data and its extensiv e v o cabulary . The in-domain mo dels are trained on 20.8 million w ords v ersus the 100 billion of Go ogle’s News dataset. Moreo v er the v o cabulary of in-domain mo dels are appro ximately 42,150 v ersus the 3 million of Go ogle [115] and th us unfair to compare with rest of the mo dels. Comparing the t w o in-domain mo dels, w e observ e the mo del trained on clean data p erforms b etter than the one trained on ASR transcripts as exp ected. Ho w ev er, the p erformance dierence is minimal whic h is encouraging. W e see the noisy transcripts negativ ely aect the seman tic accuracies while the syn tactic accuracy remains iden tical whic h mak es sense. F urther, ev aluating the A coustic analogy and Seman tic&Syn tactic- A coustic analogy tasks, all the three baseline mo dels p erform p o orly . An un usual thing w e note is that the Go ogle W2V mo del p erforms b etter comparativ ely to the other baseline mo dels in the Seman tic&Syn tactic-A coustic analogy task. A deep er examination rev ealed that the mo del comp ensates w ell for homophone substitutions on Seman tic&Syn tactic analogies whic h ha v e v ery similar sp ellings. This suggests that the t yp ographical errors presen t in the training data of the Go ogle mo del results in a small p eak in p erformance for the Seman tic&Syn tactic-A coustic analogy task. On the ev aluations of similarit y tasks, all the baseline mo dels p erform w ell on the w ord simi- larit y tasks as exp ected. Ho w ev er, they exhibit p o or results on the acoustic similarit y task. One in teresting observ ation is Go ogle W ord2V ec and the in-domain W ord2V ec mo del trained on clean transcript sho w negativ e correlation, whereas the mo del trained on noisy transcript sho ws a small p ositiv e correlation. One of the p ossible reasons b ehind this is due to the inuence of the ASR language mo del on the w ord confusions in the lattice whic h enforces con textual constrain ts during ASR deco ding and hence results in a p ositiv e correlation. 52 Ov erall, the results indicate that the baseline mo dels are largely inept of capturing an y rela- tionships o v er the acoustic w ord confusions presen t in a confusion net w ork or a lattice. In our sp ecic case, the baseline mo dels are p o or in capturing relationships b et w een acoustically similar w ords. 4.8.2 In tra-Confusion With in tra-confusion training, w e exp ect the mo del to capture acoustically similar w ord rela- tionships, while completely ignoring an y con textual relations. Hence, w e exp ect the mo del to p erform w ell on acoustic analogies and acoustic similarit y tasks and to p erform p o orly on Se- man tic&Syn tactic analogies and w ord similarit y tasks. The T able 6.1 lists the results obtained using in tra-confusion training. The results are in conjunction with our exp ectations. The mo del giv es the w orst results in Seman tic&Syn tactic analogy task. Ho w ev er, w e observ e that the syn- tactic analogy accuracy to b e a fair amoun t higher than the seman tic accuracy (see App endix T able A.1). W e think this is mainly b ecause of syn tactically similar w ords app earing along the w ord confusion dimension in the confusion net w orks, resultan t of the constrain ts enforced on the confusion net w ork b y the (ASR) language mo del - whic h are kno wn to p erform b etter for syn tactic tasks [113]. The mo del also giv es the highest correlation on the acoustic similarit y task, while p erforming p o orly on the w ord similarit y task. 4.8.3 In ter-Confusion With in ter-confusion training, w e h yp othesized that the mo del is capable of join tly mo deling b oth the con textual statistics as w ell as the w ord confusion statistics. Hence, w e exp ect the mo del to p erform w ell on b oth the Seman tic&Syn tactic analogy and A coustic analogy tasks and in doing so result in b etter p erformance with Seman tic&Syn tactic-A coustic analogy task. W e also exp ect the mo del to giv e high correlations for b oth w ord similarit y and acoustic similarit y tasks. F rom T able 6.1, w e observ e that as h yp othesized the in ter-confusion training sho ws impro v emen ts in the Seman tic&Syn tactic analogy task. Quite surprisingly , the in ter-confusion training sho ws b etter p erformance than the in tra-confusion training for the A coustic analogy task, hin ting that ha ving go o d con textual represen tation could m utually b e b enecial for the confusion represen tation. Ho w ev er, w e don’t observ e an y impro v emen ts in the Seman tic&Syn tactic-A coustic analogy task. Ev aluating on the similarit y tasks, the results supp ort the observ ations dra wn from analogy tasks, i.e., the mo del fares relativ ely w ell in b oth w ord similarit y and acoustic similarit y . 53 Mo del Analogy T asks Similarit y T asks S&S A coustic S&S-A coustic A v erage A ccuracy W ord Similarit y A coustic Similarit y Baseline W ord2V ec 61.13% 0.9% 16.66% 26.23% 0.6036 -0.4327 In tra-Confusion 63.97% 16.92% 43.34% 41.41% 0.5228 0.62 In ter-Confusion 65.45% 27.33% 38.29% 43.69% 0.5798 0.5825 Hybrid In tra-In ter 65.19% 20.35% 42.18% 42.57% 0.5341 0.6237 T able 4.6: Results with pre-training/initialization F or the analogy tasks: the accuracies of baseline w ord2v ec mo dels are for top-1 ev aluations, whereas of the other mo dels are for top-2 ev aluations (as discussed in Section 4.6.1). Detailed seman tic analogy and syn tactic analogy accuracies, the top-1 ev aluations and top-2 ev aluations for all the mo dels are a v ailable under App endix in T able A.3. F or the similarit y tasks: all the correlations (Sp earman’s) are statistically signican t. Detailed pvalues for the correlations are presen ted under App endix in T able A.4. S&S: Seman tic & Syn tactic Analogy . 4.8.4 Hybrid In tra-In ter Confusion The h ybrid in tra-in ter confusion training sho ws comparable p erformance in join tly mo deling on b oth the Seman tic&Syn tactic and A coustic analogy tasks. One crucial observ ation is that it giv es signican tly b etter p erformance with the Seman tic&Syn tactic-A coustic analogy task. This sug- gests that join tly mo deling b oth the in tra-confusion and in ter-confusion w ord mappings are useful. Ho w ev er, it ac hiev es b etter results b y compromising on the seman tic analogy (see App endix T a- ble A.1) accuracy and hence also negativ ely aecting the w ord similarit y task. The mo del ac hiev es go o d correlation on the acoustic similarit y task. Ov erall, our prop osed Confusion2V ec mo dels capture signican tly more useful information compared to the baseline mo dels judging b y the a v erage accuracy o v er the analogy tasks. One particular observ ation w e see across all the prop osed mo dels is that the p erformance remains fairly p o or for the Seman tic&Syn tactic-A coustic analogy task. This suggests that the Seman tic&Syn tactic- A coustic analogy task is inheren tly hard to solv e. W e b eliev e that to ac hiev e b etter results with Seman tic&Syn tactic-A coustic analogies, it is necessary to ha v e robust p erformance on one of the tasks (Seman tic&Syn tactic analogies or A coustic analogies) to b egin with, i.e., b etter mo del initialization could help. Next, w e exp erimen t with mo del initializations/pre-training. 4.8.5 Mo del Initialization/Pre-training T able 4.6 lists the results with mo del initialization/pre-training. The baseline mo del is initialized from the Go ogle W ord2V ec mo del. Rest of the mo dels are initialized from the baseline w ord2v ec mo del (i.e., the baseline mo del initialized from Go ogle W ord2V ec), since this w ould enable full 54 compatibilit y with the v o cabulary . Since the Go ogle W ord2V ec mo del is 300 dimensional, this forces all the pre-trained mo dels (in T able 4.6) to b e 300, opp osed to 256 dimensions (in T able 6.1). Pre-training the baseline mo del pro vides impro v emen ts with Seman tic&Syn tactic analogy re- sults to b e close and comparable to that of the Go ogle’s W ord2V ec mo del. F or in tra-confusion mo del, the pre-training pro vides drastic impro v emen ts on Seman tic&Syn tactic analogy task at the exp ense of the A coustic analogy task. Ev en-though the accuracy of A coustic analogy task decreases comparativ ely to without pre-training, it remains signican tly b etter than the base- line mo del. More imp ortan tly , the Seman tic&Syn tactic-A coustic analogy task accuracy doubles. In ter-Confusion mo del do es not compromise on the Seman tic&Syn tactic analogy tasks, in doing so giv es comparable p erformances to the baseline mo del. A dditionally it also do es w ell on the A coustic and Seman tic&Syn tactic-A coustic analogy task as w as the case without pre-training. In the case of h ybrid in tra-in ter confusion mo del, similar trends are observ ed as w as with no pre- training, but with considerable impro v emen ts in accuracies. Pre-training also helps in b o osting the correlations for the w ord similarit y tasks for all the mo dels. Ov erall, w e nd the pre-training to b e extremely useful. 4.8.6 Mo del Concatenation The rst 4 ro ws of T able 4.7 sho w the results with mo del concatenation. W e concatenate eac h of the three prop osed mo dels (from T able 6.1) with the pre-trained baseline w ord2v ec. Th us the resulting v ector space is 556 dimensional (300 (pre-trained baseline w ord2v ec) + 256 (prop osed mo dels from T able 6.1) = 556). In our case, w e b eliev e the dimension expansion of the v ector space is insignican t in terms of p erformance considering the relativ ely lo w amoun t of training data compared to Go ogle’s w ord2v ec mo del. T o b e completely fair in judgmen t, w e create a new baseline mo del with 556 dimensional em b edding space for comparison. T o train the new base- line mo del, the 556 dimension em b edding w as initialized with 300 dimensional Go ogle’s w ord2v ec em b edding and the rest of the dimensions as zeros (n ull space). Comparing the 556 dimension baseline from T able 4.7 with the previous 300 dimensional baseline from T able 4.6, the results are almost iden tical whic h conrms the dimension expansion is insignican t with resp ect to p erfor- mance. With mo del concatenation, w e see sligh tly b etter results (a v erage analogy accuracy) comparing with the pre-trained mo dels from T able 4.6, an absolute increase of up-to appro ximately 5% among the b est results. The correlations with similarit y tasks are similar and comparable with the earlier results with the pre-trained mo dels. 55 Mo del Fine-tuning Analogy T asks Similarit y T asks Sc heme S&S A coustic S&S-A coustic A v erage W ord A coustic 1 Baseline W ord2V ec (556 dim.) - 61.13% 0.93% 16.53% 26.2% 0.5973 -0.4341 Mo del Concatenation 2 W ord2V ec (F) + In tra-Confusion (F) - 67.03% 25.43% 40.36% 44.27% 0.5102 0.7231 3 W ord2V ec (F) + In ter-Confusion (F) - 70.84% 35.25% 35.18% 47.09% 0.5609 0.6345 4 W ord2V ec (F) + Hybrid In tra-In ter (F) - 68.08% 11.39% 41.3% 40.26% 0.4142 0.5285 Fixed W ord2V ec Join t Optimization 5 W ord2V ec (F) + In tra-Confusion (L) in ter 71.65% 20.54% 33.76% 41.98% 0.5676 0.4437 6 W ord2V ec (F) + In tra-Confusion (L) in tra 67.37% 28.64% 39.09% 45.03% 0.5211 0.6967 7 W ord2V ec (F) + In tra-Confusion (L) h ybrid 70.02% 25.84% 37.18% 44.35% 0.5384 0.6287 8 W ord2V ec (F) + In ter-Confusion (L) in ter 72.01% 35.25% 33.58% 46.95% 0.5266 0.5818 9 W ord2V ec (F) + In ter-Confusion (L) in tra 69.7% 39.32% 39.07% 49.36% 0.5156 0.7021 10 W ord2V ec (F) + In ter-Confusion (L) h ybrid 72.38% 37.75% 37.95% 49.36% 0.5220 0.6674 11 W ord2V ec (F) + Hybrid In tra-In ter (L) in ter 71.36% 8.55% 33.21% 37.71% 0.5587 0.302 12 W ord2V ec (F) + Hybrid In tra-In ter (L) in tra 66.85% 13.33% 40.1% 40.09% 0.4996 0.5691 13 W ord2V ec (F) + Hybrid In tra-In ter (L) h ybrid 68.32% 11.61% 38.19% 39.37% 0.5254 0.4945 Unrestricted Join t Optimization 14 W ord2V ec (L) + In tra-Confusion (L) in ter 62.12% 46.42% 36.4% 48.31% 0.5513 0.7926 15 W ord2V ec (L) + In tra-Confusion (L) in tra 64.85% 40.55% 42.38% 49.26% 0.5033 0.7949 16 W ord2V ec (L) + In tra-Confusion (L) h ybrid 31.65% 61.91% 23.55% 39.04% 0:1067 0.8309 17 W ord2V ec (L) + In ter-Confusion (L) in ter 64.98% 52.99% 34.79% 50.92% 0.5763 0.7725 18 W ord2V ec (L) + In ter-Confusion (L) in tra 65.88% 49.4% 41.51% 52.26% 0.5379 0.7717 19 W ord2V ec (L) + In ter-Confusion (L) h ybrid 37.86% 67.21% 25.96% 43.68% 0.2295 0.8294 20 W ord2V ec (L) + Hybrid In tra-In ter (L) in ter 65.54% 27.97% 36.87% 43.46% 0.5338 0.6953 21 W ord2V ec (L) + Hybrid In tra-In ter (L) in tra 64.42% 20.05% 42.56% 42.34% 0.4920 0.6942 22 W ord2V ec (L) + Hybrid In tra-In ter (L) h ybrid 65.79% 22.63% 41.3% 43.24% 0.4967 0.6986 T able 4.7: Mo del concatenation and join t optimization results A cron yms: (F):Fixed em b edding, (L):Learn em b edding during join t training, S&S: Seman tic & Syn tactic Analogy . F or the analogy tasks: the accuracies of baseline w ord2v ec mo dels are for top-1 ev aluations, whereas of the other mo dels are for top-2 ev aluations (as discussed in Section 4.6.1). Detailed seman tic analogy and syn tactic analogy accuracies, the top-1 ev aluations and top-2 ev aluations for all the mo dels are a v ailable under App endix in T able A.5. F or the similarit y tasks: all the correlations (Sp earman’s) are statistically signican t with p< 0:001 except the ones with the asterisks. Detailed pvalues for the correlations are presen ted under App endix in T able A.6. 4.8.7 Join t Optimization 4.8.7.1 Fixed W ord2V ec Ro ws 5-13 of T able 4.7 displa y the results of join t optimization with concatenated, xed W ord2V ec em b eddings and learn-able confusion2v ec em b eddings. As h yp othesized with xed W ord2V ec subspace, the results indicate b etter accuracies for the Seman tic&Syn tactic analogy task. Thereb y , the impro v emen ts also reect on the o v erall a v erage accuracy of the analogy tasks. This also conrms the need for join t optimization whic h b o osts the a v erage accuracy up-to appro ximately 2% absolute o v er the unoptimized concatenated mo del. 56 4.8.7.2 Unrestricted Optimization The last 9 ro ws of T able 4.7 displa y the results obtained b y join tly optimizing the concatenated mo dels without constrain ts. Both the subspaces are ne tuned to con v ergence with v arious pro- p osed training criteria. W e consisten tly observ e impro v emen ts with the unrestricted optimization o v er the unoptimized mo del concatenations. In terms of a v erage accuracy w e observ e a increase in a v erage accuracy b y up-to 5% absolute appro ximate o v er the unoptimized concatenated mo d- els. Moreo v er, w e obtain impro v emen ts o v er the Fixed W ord2V ec join t-optimized mo dels, up-to 2-3% (absolute) in a v erage accuracies. The b est o v erall mo del in terms of a v erage accuracies is obtained b y unrestricted join t optimization on the concatenated baseline w ord2v ec and in ter- confusion mo dels b y ne-tuning with the in tra-confusion training sc heme. 4.8.8 Results Summary Firstly , comparing among the dieren t training sc hemes (see T able 6.1), the in ter-confusion train- ing consisten tly giv es the b est A coustic analogy accuracies, whereas the h ybrid training sc heme often giv es the b est Seman tic&Syn tactic-A coustic analogy accuracies. As far as the Seman- tic&Syn tactic analogy task is concerned, the in tra-confusion is often found to giv e preference to syn tactic relations, while the in ter-confusion b o osts the seman tic relations and the h ybrid sc heme balances b oth relations (see App endix T able A.1). Next, pre-training/initializing the mo del giv es drastic impro v emen ts in o v erall a v erage accuracy of analogy tasks. Concatenating the baseline w ord2v ec with the confusion2v ec mo del giv es sligh tly b etter results. More optimizations and ne-tuning o v er the concatenated mo del giv es considerably the b est results. Ov erall, the b est results are obtained with unrestricted join t optimization of baseline w ord2v ec and in ter-confusion mo del, i.e., ne-tuning using in tra-confusion training mo de. In terms of a v erage analogy accuracies the confusion2v ec mo del outp erforms the baseline b y an absolute 26.06%. The b est p erforming confusion2v ec mo del outp erforms the w ord2v ec mo del ev en on the Seman tic&Syn tactic analogy tasks (b y a relativ e 7.8%). Moreo v er, ev en the comparison b et w een the top-2 ev aluations of b oth the w ord2v ec and confusion2v ec suggests v ery similar p erformance on Seman tic&Syn tactic-analogy tasks (see App endix T able A.5). This conrms and emphasizes that the confusion2v ec do esn’t compromise on the information captured b y w ord2v ec but succeeds in augmen ting the space with w ord confusions. Another highligh t observ ation is that mo deling the w ord confusions b o ost the seman tic and syn tactic scores of the Seman tic&Syn tactic analogy task (compared to w ord2v ec), suggesting inheren t information in w ord confusions whic h could b e exploited for b etter seman tic-syn tactic w ord relation mo deling. 57 4.9 V ector Space Analysis In this section, w e compare the v ector space plots of the t ypical w ord2v ec space and the prop osed confusion2v ec v ector space for sp ecically c hosen set of w ords. W e c ho ose a subset of w ords represen ting three categories to reect seman tic relationships, syn tactic relationships and acoustic relationships. The v ector space represen tations of the w ords are then sub jected to dimension reduction using principle comp onen t analysis (PCA) to obtain 2D v ectors whic h are used for plotting. 4.9.1 Seman tic Relationships F or analyzing the seman tic relationships, w e compile random w ord pairs (constrained b y the a v ailabilit y of these in our training data) represen ting Coun try-Cities relationships. The 2D plot for baseline pre-trained w ord2v ec mo del is sho wn in Figure 4.9 and for the prop osed confusion2v ec mo del, sp ecically for the randomly selected, join tly-optimized w ord2v ec + in tra-confusion mo del (corresp onding to ro w 6 in T able 4.7) is displa y ed in Figure 4.10. The follo wing observ ations can b e made comparing the t w o PCA plots: Examining the baseline w ord2v ec mo del, w e nd the Cities are clustered o v er the upp er half of the plot (highligh ted with blue h ue in Figure 4.9) and Coun tries are clustered together at the b ottom half (highligh ted with red h ue in Figure 4.9). Similar trends are observ ed with the prop osed confusion2v ec mo del, where the cities are clustered together o v er the righ t half of the plot (highligh ted with blue h ue in Figure 4.10) and the coun tries are group ed together to w ards the left half (highligh ted with red h ue in Figure 4.10). In the W ord2V ec space, the v ectors of Coun try-Cit y w ord pairs are roughly parallel, p oin ting north-east (i.e., v ectors are appro ximately similar). Similar to the w ord2v ec space, with the Confusion2V ec, w e observ e the v ectors of Coun try- Cit y w ord pairs are fairly parallel and p oin t to the east (i.e., v ectors are highly similar). The four observ ations indicate that the Confusion2V ec preserv es the Seman tic relationships b e- t w een the w ords (similar to the W ord2V ec space). 58 4.9.2 Syn tactic Relationships T o analyze the syn tactic relationships, w e create 30 pairs of w ords comp osed of A djectiv e-A dv erb, Opp osites, Comparativ e, Sup erlativ e, Presen t-P articiple, P ast-tense, Plurals. The PCA 2D plots for baseline pre-trained w ord2v ec mo del and the prop osed confusion2v ec mo del are illustrated in Figure 4.11 and Figure 4.12 resp ectiv ely . The follo wing inferences can b e made from the t w o plots: Insp ecting the baseline w ord2v ec mo del, w e observ e that the w ord pairs depicting syn tactic relations o ccur often close-b y (highligh ted with red ellipses in Figure 4.11). F ew seman tic relations are also apparen t and are highligh ted with blue ellipses in Figure 4.11. F or example, animals are clustered together. Similarly , with the Confusion2V ec mo del, w e observ e syn tactic clusters of w ords highligh ted with red ellipses in Figure 4.12. Seman tic relations apparen t in the case of w ord2v ec is also eviden t with the Confusion2V ec, whic h are highligh ted with blue ellipses in Figure 4.12. A dditionally with the Confusion2V ec mo del, w e nd clusters of acoustically similar w ords (with similar phonetic transcriptions). These are highligh ted using a green ellipse in Fig- ure 4.12. The ab o v e ndings conrm that the confusion2v ec mo dels preserv e the syn tactic relationships similar to w ord2v ec mo dels, supp orting our h yp othesis. 4.9.3 A coustic Relationships In order to analyze the relationships of similarly sounding w ords in the w ord v ector spaces under consideration, w e comp ose 20 pairs of acoustically similar sounding w ords, with similar phonetic transcriptions. The 2D plot obtained after PCA for the baseline w ord2v ec mo del is sho wn in Figure 4.13 and the prop osed confusion2v ec mo del is sho wn in Figure 4.14. W e mak e the follo wing observ ations from the t w o gures: Observing the baseline W ord2v ec mo del, no apparen t trends are found b et w een the acous- tically similar w ords. F or example, there is no trivial relationships apparen t from the plot in Figure 4.13 b et w een the w ord no and kno w, try and tri etc. 59 Example Ground-truth ASR output W2V Similarit y C2V Similarit y 1.1 y es righ t answ er y es [righ t/write] answ er 0.96190 0.96218 1.2 y es righ t answ er y es write answ er 0.93122 0.93194 1.3 y es write answ er y es [righ t/write] answ er 0.99538 0.99548 1.4 y es rite answ er y es [righ t/write] answ er 0.84216 0.88206 1.5 y es rite answ er y es righ t answ er 0.86003 0.87085 1.6 y es rite answ er y es write answ er 0.82073 0.87034 2.1 she lik es sea [she/shea] lik es [see/sea] 0.91086 0.92130 2.2 she lik es sea shea lik es see 0.73295 0.77137 2.3 shea lik es see [she/shea] lik es [see/sea] 0.94807 0.95787 2.4 shea lik es see [she/shea] lik es [see/ro c k et] 0.93560 0.93080 2.5 she lik es sea [she/shea] lik es [see/ro c k et] 0.85853 0.85757 T able 4.8: Cosine Similarit y b et w een the ASR Ground-truth and ASR output in application to ASR error correction for baseline pre-trained w ord2v ec and the prop osed confusion2v ec: join tly optimized in tra-confusion + baseline w ord2v ec mo dels Example 1.1-1.6 inherits structure as in Figure 4.7a, i.e., y es [righ t/write] answ er assigns w eigh t of 1.0 to y es and answ er, 0.75 to righ t, 0.25 to write. Similarly Example 2.1-2.5 inherits structure as in Figure 4.7b 0 1 yes:yes/1 2 write:write/0.75 right:right/0.25 3 answer:answer/1 (a) Example 1 0 1 she:she/0.4 shea:shea/0.6 2 likes:likes/1 3 sea:sea/0.45 see:see/0.55 (b) Example 2 Figure 4.7: Confusion Net w ork Examples Ho w ev er, insp ecting the prop osed confusion2v ec mo del, there is an ob vious trend apparen t, the acoustically similar w ords are group ed together in pairs and o ccur roughly in similar distances. The w ord pairs are highligh ted with blue ellipses in Figure 4.14. A dditionally , in the Figure 4.14, as highligh ted in green ellipse, w e observ e the 4 w ords no, not, knot and kno w o ccur in close pro ximit y . The w ord pair no and not p ortra y Seman tic/Syn tactic relation whereas the pairs knot & not and no & kno w are acoustically related. The ab o v e ndings suggest that the w ord2v ec baseline mo del fails to capture an y acoustic rela- tionships whereas the prop osed confusion2v ec successfully mo dels the confusions presen t in the lattices, in our sp ecic case the acoustic confusions from the ASR lattices. 4.10 Discussion In this section, w e demonstrate wh y the prop osed em b edding space is sup erior for mo deling w ord lattices with the supp ort of to y examples. Lets consider a simple task of ASR error correction. 60 As sho wn b y [4, 124, 151], often, the information needed to correct the errors are em b edded in the lattices. The to y examples in Figure 4.7a & 4.7b depict the real scenarios encoun tered in ASR. The lattice feature represen tation is a w eigh ted v ector sum of all w ords in the confusion and its con text presen t in the lattice (see Figure 4.8). W e compare the prop osed confusion2v ec em b eddings with the p opular w ord2v ec using cosine similarit y as the ev aluation measure. T able 4.8 lists the ev aluation for the follo wing cases: (i) ASR output is correct, (ii) ASR output is wrong and the correct candidate is presen t in the lattice, (iii) ASR output is wrong and the correct candidate is absen t from the lattice, and (iv) ASR output is wrong and with no lattice a v ailable. The follo wing observ ations are dra wn from the results: 1. Confusion2v ec sho ws higher similarit y with the correct answ ers when the ASR output is correct (see T able 4.8 example 1.1, 2.1). 2. Confusion2v ec exhibits higher similarit y with the correct answ ers when the ASR output is wrong - meaning the represen tation is closer to the correct candidate and therefore more lik ely to correct the errors (see T able 4.8 example 1.2, 2.2, 1.3, 2.3). 3. Confusion2v ec yields high similarit y ev en when the correct w ord candidate is not presen t in the lattice - meaning confusion2v ec lev erages inheren t w ord represen tation kno wledge to aid re-in tro duction of pruned or unseen w ords during error correction (see T able 4.8 example 1.4, 1.5, 1.6). 4. The confusion2v ec sho ws lo w similarit y in the case of fak e lattices with highly unlik ely w ord alternativ es (see T able 4.8 example 2.4, 2.5). All the ab o v e observ ations are supp ortiv e of the prop osed confusion2v ec w ord represen tation and is in line with the exp ectations for the task of ASR error correction. w t,1 w t,2 w t,3 C(t) P(w t,1 ) P(w t,2 ) P(w t,3 ) w t-1,1 w t-1,2 w t-1,3 C(t-1) P(w t-1,1 ) P(w t-1,2 ) P(w t-1,3 ) w t+1,1 w t+1,2 w t+1,3 C(t+1) P(w t+1,1 )P(w t+1,2 ) P(w t+1,3 ) Sum Feature Vector Figure 4.8: Computation of lattice feature v ector. 61 4.11 Potential Applications In addition to the ab o v e discussed ASR error correction task, other p oten tial application include: Mac hine T ranslation: In Mac hine T ranslation, w ord lattices are used to pro vide m ultiple sources for generating a single translation [144, 44]. W ord lattices deriv ed from reordered h y- p otheses [31, 120, 66], morphological transformations [43, 66], w ord segmen tations [41], para- phrases [125] are used to in tro duce am biguit y and alternativ es for training mac hine translation systems [178, 42, 44]. Source language alternativ es can also b e exploited b y in tro ducing am bigu- it y deriv ed from the com bination of m ultiple mac hine translation systems [110, 139, 138]. In the c ase of Machine T r anslation, the wor d-c onfusion subsp ac e is asso ciate d with morpholo gic al tr ans- formations, wor d se gmentations, p ar aphr ases, p art-of-sp e e ch information, etc., or a c ombination of them. Although the w ord-confusion subspace is not orthogonal, the explicit mo deling of suc h am biguit y relationships is b enecial. NLP: Other NLP based applications lik e paraphrase generation [132], w ord segmen tation [87], part-of-sp eec h tagging [87] also op erate on lattices. As discussed in section 4.2.2, confusion2v ec can exploit the am biguit y presen t in the lattices for b ettermen t of the tasks. ASR: In ASR systems, w ord lattices and confusion net w orks are often re-scored using v arious algorithms to impro v e their p erformances b y exploiting am biguit y [164, 105, 181, 102]. In the c ase of ASR, the wor d-c onfusion subsp ac e is asso ciate d with ac oustic similarity of wor ds which is often ortho gonal to the semantic-syntactic subsp ac e as discusse d in se ction 4.2.1 . Example 1, Example 2 and Example 3 are prime cases supp orting the need for join tly mo deling acoustic w ord confusions and seman tic-syn tactic subspace. Sp ok en Language Understanding: Similarly , as in the case of ASR, Confusion2V ec could exploit the inheren t acoustic w ord-confusion information for k eyw ord sp otting [105], condence score estimation [105, 147, 81, 78], domain adaptation [151], lattice compression [105], sp ok en con ten t retriev al [24, 74], system com binations [105, 72] and other sp ok en language understanding tasks [64, 170, 106] whic h op erate on lattices. Sp eec h T ranslation: In sp eec h translation systems, incorp orating the w ord lattices and con- fusion net w orks (instead of the single top h yp othesis) is b enecial in b etter in tegrating sp eec h recognition system to the mac hine translation systems [12, 108, 109, 145]. Similarly , exploiting uncertain t y information b et w een the ASR - Mac hine T ranslation - Sp eec h syn thesis systems for Sp eec h-to-sp eec h translation is useful [92, 174]. Since sp eec h translation in v olv es com bination of ASR and the Mac hine T ranslation systems, the w ord-confusion subspace is asso ciated with a 62 com bination of acoustic w ord-similarit y (for ASR) and morphological-segmen tation-paraphrases am biguities (for Mac hine T ranslation). \See son win ter is here 00 ! \v oir ls hiv er est ici 00 (Example 4) \Season win ter is here 00 ! \saison hiv er est ici 00 (Example 5) Example 4 and Example 5 demonstrate a case of sp eec h translation of iden tically sounding English phrases to F renc h. W ords See son and Season demonstrate am biguit y in terms of w ord segmen tation. Whereas the phrases See son and Season also exhibit am biguit y in terms of acoustic similarit y . By mo deling b oth w ord-segmen tation and acoustic-confusion through w ord v ector represen tations, the confusion2v ec can pro vide crucial information that the frenc h w ords v oir and saison are confusable under sp eec h translation framew ork. Optical Character Recognition: In optical c haracter recognition (OCR) systems, the confu- sion axis is related to pictorial structures of the c haracters/w ords. F or example, sa y the c haracters a and o are easily confusable th us leading to similar c haracter v ectors in the em b edding space. In the case of w ord lev el confusions leading to w ords ward and word b eing similar with con- fusion2v ec (w ord2v ec w ould ha v e the w ords w ord and w ard fairly dissimilar). Ha ving this crucial optical confusion information is useful during OCR deco ding on sequence of w ords when used in conjunction with the linguistic con textual information. Image/Video Scene Summarization: The task of scene summarization in v olv es generating descriptiv e text summarizing the con ten t in one or more images. In tuitiv ely , the task w ould b enet from linguistic con textual kno wledge during the text generation. Ho w ev er, with the confusion2v ec, one can mo del and exp ect to capture t w o additional information streams (i) pictorial confusion of image/ob ject recognizer, and (ii) pictorial con text, i.e., mo deling ob jects o ccurring together (e.g. w e can exp ect o v en to often app ear nearb y a sto v e or other kitc hen based appliances). The additional streams of v aluable information em b edded in the lattices can con tribute for b etter deco ding. In other w ords, for example, w ord2v ec can exhibit high dissimilarit y b et w een the w ords lifebuo y and don uts, ho w ev er the confusion2v ec can capture their pictorial similarit y in a b etter w ord space represen tation and th us aiding in their end application of scene summarization. 4.12 Conclusion In this w ork, w e prop osed a new w ord v ector represen tation motiv ated from h uman sp eec h & p erception and asp ects of mac hine learning for incorp orating w ord confusions from lattice lik e 63 structures. The prop osed confusion2v ec mo del is mean t to capture additional w ord-confusion in- formation and impro v e up on the p opular w ord2v ec mo dels without compromising the inheren t in- formation captured b y the w ord2v ec mo dels. Although the w ord confusions could b e domain/task sp ecic, w e presen t a case study on ASR lattices where the confusions are based on acoustic sim- ilarit y of w ords. Sp ecically , with resp ect to ASR related applications, the aim is to capture the con textual statistics, as with w ord2v ec, and additionally also capture the acoustic w ord confusion statistics. Sev eral training congurations are prop osed for confusion2v ec mo del, eac h diering in the utilization of the em b edded information presen t in the lattice or confusion net w ork for mo del- ing the w ord v ector space. F urther, tec hniques lik e pre-training/initializations, mo del concatena- tion and join t optimization are prop osed and ev aluated for the confusion2v ec mo dels. Appropriate ev aluation sc hemes are form ulated for the domain sp ecic application. The ev aluation sc hemes are inspired from the p opular analogy based question test set and w ord similarit y tasks. A new analogy task and w ord similarit y tasks are designed for the acoustic confusion/similarit y scenario. A detailed tabulation of results are presen ted for the confusion2v ec mo del and compared to the baseline w ord2v ec mo dels. The results sho w that the confusion2v ec can augmen t additional task-sp ecic w ord confusion information without compromising on the seman tic and syn tactic relationships captured b y the w ord2v ec mo dels. Next, detailed analysis is conducted on the confusion2v ec v ector space through PCA reduced 2-dimensional plots for three indep enden t w ord relations: (i) Seman tic relations, (ii) Syn tactic relations, and (iii) A coustic relations. The analysis further supp orts our aforemen tioned exp erimen tal inferences. F ew to y examples are presen ted to w ards the task of ASR error correction to supp ort the adequacy of the Confusion2v ec o v er the w ord2v ec w ord represen tations. The study v alidates through v arious h yp otheses and test results, the p oten tial b enets of the confusion2v ec mo del. 4.13 F uture W ork In future, w e plan to w ork on impro ving the confusion2v ec mo del b y incorp orating the sub-w ord and phonemic transcription of w ords during training. Sub-w ords and c haracter transcription information is sho wn to impro v e the w ord v ector represen tation [17, 26]. W e b eliev e the sub-w ords and phoneme transcriptions of w ords are ev en more relev an t to confusion2v ec than c haracters. In addition to the impro v emen ts exp ected to w ards the seman tic and syn tactic represen tations (w ord2v ec), since the sub-w ords and phoneme transcriptions of acoustically similar w ords are similar, it should help in mo deling the confusion2v ec to a m uc h greater exten t. 64 Apart from concen trating on impro ving the confusion2v ec mo del, this w ork op ens new p ossi- ble opp ortunities in incorp orating the confusion2v ec em b eddings to a whole range of full-edged applications suc h as ASR error correction, Sp eec h translation tasks, Mac hine translation, Dis- criminativ e language mo dels, Optical c haracter recognition, Image/Video scene summarization etc. 65 Word2vec Relationships of countries-cities Figure 4.9: 2D plot after PCA of w ord v ector represen tation on baseline pre-trained w ord2v ec Demonstration of Seman tic Relationship on Randomly c hosen pairs of Coun tries and Cities Coun try-Cit y v ectors are almost parallel/similar. Coun tries are clustered together on the b ottom half (highligh ted with red h ue) and the cities on upp er half (highligh ted with blue h ue). 66 Confusion2Vec (Joint training) Relationships of countries-cities Figure 4.10: 2D plot after PCA of w ord v ector represen tation on join tly optimized pre-trained w ord2v ec + in tra-confusion mo dels Demonstration of Seman tic Relationship on Randomly c hosen pairs of Coun tries and Cities Observ e the seman tic relationships are preserv ed as in the case of w ord2v ec mo del: Coun try-Cit y v ectors are almost parallel/similar. Coun tries are clustered together on the left half (highligh ted with red h ue) and the cities on righ t half (highligh ted with blue h ue). 67 Word2vec Semantic/Syntactic illustration via PCA Figure 4.11: 2D plot after PCA of w ord v ector represen tation on baseline pre-trained w ord2v ec Demonstration of Syn tactic Relationship on Randomly c hosen 30 pairs of A djectiv e-A dv erb, Opp osites, Comparativ e, Sup erlativ e, Presen t-P articiple, P ast-tense, Plurals Observ e the clustering of syn tactically related w ords (Ex: highligh ted with red ellipses). F ew seman tically related w ords are highligh ted with blue ellipses (Ex: animals) 68 Confusion2Vec (Joint training) Semantic/Syntactic illustration via PCA Semantic/Syntactic illustration via PCA Acoustic and Syntactic/ Semantic Cluster Figure 4.12: 2D plot after PCA of w ord v ector represen tation on join tly optimized pre-trained w ord2v ec + in tra-confusion mo dels Demonstration of Syn tactic Relationship on Randomly c hosen 30 pairs of A djectiv e-A dv erb, Opp osites, Comparativ e, Sup erlativ e, Presen t-P articiple, P ast-tense, Plurals Syn tactic clustering is preserv ed b y Confusion2V ec similar to W ord2V ec. Red ellipses highligh t few examples of syn tactically related w ords. Similar to W ord2V ec, seman tically related w ords (Ex: animals), highligh ted with blue ellipses, are also clustered together. A dditionally Confusion2V ec clusters acoustically similar w ords together (indicated with green ellipse). 69 Word2vec Acoustic siimilarity illustration via PCA - No obvious clustering Figure 4.13: 2D plot after PCA of w ord v ector represen tation on baseline pre-trained w ord2v ec Demonstration of V ector Relationship on Randomly c hosen 20 pairs of A coustically Similar Sounding W ords No apparen t relations b et w een acoustically similar w ords are eviden t. 70 Confusion2Vec (Joint training) Acoustic siimilarity illustration via PCA - Clear clustering Acoustic and Syntactic Cluster Figure 4.14: 2D plot after PCA of w ord v ector represen tation on join tly optimized pre-trained w ord2v ec + in tra-confusion mo dels Demonstration of V ector Relationship on Randomly c hosen 20 pairs of A coustically Similar Sounding W ords Confusion2V ec clusters acoustically similar w ords together (highligh ted with blue ellipses). A dditionally , in ter-relations b et w een syn tactically related w ords and acoustically related w ords are also eviden t (highligh ted with green ellipse). 71 Chapter 5 Spoken Language Intent Detection using Confusion2V ec 5.1 Introduction In this c hapter, w e sp ecically target the task of sp ok en language in ten t detection on noisy ASR transcripts. In con trast to the ma jorit y of the w orks whic h mostly deal with the inno v ation of classication mo dels [101, 100, 58, 97], in our study , w e concen trate on robust w ord feature represen tations. W e prop ose to emplo y the confusion2v ec [149] w ord v ector represen tation to comp ensate for the errors made b y an ASR and to pro vide enhanced and robust p erformance for the task of sp ok en language in ten t detection. Confusion2v ec captures acoustic similarit y information of w ords in addition to the seman tic-syn tactic relations and is trained in a completely unsup ervised manner on ASR lattices deco ded on an out-of-domain corp ora. Moreo v er, unlik e the studies whic h adapt the ASR to the target datasets and tasks [146, 101], w e treat the ASR as a generic indep enden t mo dule, but con tribute to w ards bridging the gap b et w een the ASR and the NLU mo del. W e demonstrate with our exp erimen ts, on the b enc hmark A TIS dataset [70], the vital role of the confusion2v ec to the robustness of the in ten t classication. The rest of the c hapter is structured as follo ws. In section 5.2, w e presen t the prop osed metho d- ology , pro vide brief description of the confusion2v ec w ord em b edding and the in ten t classication mo del. Section 5.3 describ es the databases emplo y ed, our exp erimen tal setup and the baseline systems. In section 5.4, w e presen t and discuss the exp erimen tal results. Finally , section 5.5 concludes the study and discusses the future w ork. 72 (a) W ord2v ec Space (b) Confusion2v ec Space Figure 5.1: 2D V ector space illustration after PCA dimension reduction for W ord2v ec and Confusion2v ec The blue ellipses indicate syn tactic w ord relations. The red ellipses indicate acoustic similarit y relations. The blue arro ws illustrate the seman tic relationships. The red arro ws illustrate the in teraction of acoustic similarit y with seman tic relationships. The w ord2v ec space is ric h in seman tic and syn tactic w ord relations, ho w ev er no trivial acoustic similarit y is eviden t. The confusion2v ec space preserv es the seman tic and syn tactic w ord relations, moreo v er captures additional acoustic similarit y information. 5.2 Proposed T echnique In this section, w e rst describ e the confusion2v ec w ord v ector represen tation for the task of sp ok en language in ten t detection and then in tro duce the recurren t neural net w ork in ten t classication mo del. 5.2.1 Confusion2v ec W ord Em b edding The role of w ord v ector represen tations is crucial for NLP [30]. Ecien t and information ric h w ord em b eddings lik e w ord2v ec [115], glo v e [127] are sho wn to capture seman tics and syn tactics of the language. Using suc h ecien t w ord represen tations ha v e pro v en to b e b enecial in the NLU tasks lik e named en tit y detection [153], in ten t detection [67]. The SLU tasks lik e in ten t detection [82, 103, 51, 9], slot-lling [50], sp ok en dialogue systems [48] ha v e also b eneted from using information ric h w ord em b eddings. Ho w ev er, they are less than optimal in the cases of erroneous transcriptions, for example ASR transcriptions [146, 101], since the errors corrupt the seman tic-syn tactic space o v er lo cal con text of o ccurrence and thereb y in tro duce noise in the mo del. 73 In this w ork, w e prop ose to emplo y recen tly prop osed confusion2v ec w ord v ector represen tation [149] for the task of in ten t detection to coun ter for errors presen t in the sp ok en transcriptions. Motiv ated from h uman sp eec h pro duction and p erception, the confusion2v ec mo dels the acoustic relations of w ords in addition to the seman tic and syn tactic relationships of w ords [149]. The con- fusion2v ec uses unsup ervised training tec hniques similar to skip-gram of w ord2v ec, but op erates on lattice-lik e structures or confusion net w orks output b y the ASR. Since the confusion net w orks of a t ypical ASR exhibits confusions b et w een w ords on t w o principle axes (i) con textual, and (ii) acoustic similarit y , the confusion2v ec is devised to op erate on b oth the axes, thereb y mo deling lo cal con text information (lik e w ord2v ec) as w ell as acoustic similarit y information. Figure 5.1 illustrates the 2-dimensional w ord v ector space for w ord2v ec and confusion2v ec after dimension reduction using principal comp onen t analysis (PCA). F rom the gure (and from extensiv e anal- ysis done in [149]), it is eviden t that confusion2v ec space captures acoustic similarit y b et w een w ords without compromising the information captured b y the w ord2v ec. Complex meaningful, useful in teractions b et w een the acoustic subspace and the seman tic-syn tactic subspaces are also observ ed. F or more information w e w ould lik e to p oin t the in terested readers to w ards [149], whic h in detail describ es and analyzes the confusion2v ec em b edding. In application to the sp ok en language in ten t detection task, the nature of ASR errors are often acoustically related. Confusion2v ec incorp orates real, unsup ervised, ASR output as its training corpus, th us the feature represen tation incorp orates confusions (errors) nearb y in its em b edding space. In other w ords, w e h yp othesize that the em b edded acoustic similarit y information in confusion2v ec limits the impact of errors made b y the ASR, and th us allo ws subsequen t NLP tasks to b e minimally aected. W e exp ect the follo wing with resp ect to the in ten t detection task: W e exp ect our mo del to b e less aected from ASR errors and th us ac hiev e b etter p erformance in the case of noisy ASR transcriptions. W e exp ect our mo del to b e at-least on par with w ord2v ec under clean conditions. 5.2.2 In ten t Classication Mo del Since the con tribution of this w ork is to w ards w ord feature represen tations, w e emplo y a fairly simple recurren t neural net w ork mo del for the classication task. Ho w ev er, w e b eliev e the con- tributions on feature represen tations are orthogonal to the classication mo del and th us exp ect ev en b etter p erformance for more complex mo dels lik e in [101, 100, 58, 97]. In this w ork, w e use Bi-directional Long Short-T erm Memory (LSTM) units, as sho wn in Figure 5.2. Giv en an 74 intent label find LSTM LSTM LSTM LSTM LSTM Confusion2Vec Embedding flight from phoenix to LSTM LSTM LSTM LSTM LSTM concat Softmax Linear Figure 5.2: In ten t Classication RNN Mo del input utterance w 0 ;w 1 ;:::;w T , eac h w ord in the input sequence is mapp ed to its w ord v ector represen tation x 0 ;x 1 ;:::;x T b y em b edding lo ok up. W e form ulate the mo del outputs as ! ht = ! LSTM( ! h t1 ;xt; ! ) ht = LSTM( h t+1 ;xt; ) (5.1) ^ P in ten t =Softmax(W [ ! h T ; h 0 ] +b) (5.2) where h t is the LSTM output of eac h direction at eac h time step t, is the parameter of the LSTM. W e feed the concatenation of t w o directional LSTM outputs at the last time step in to the linear output la y er (with w eigh ts W and bias b) whic h pro jects it in to the in ten t lab el space. Finally , the in ten t lab el is predicted from the Softmax-normalized probabilit y distribution o v er all in ten t classes. 5.3 Database & Experimental Setup 5.3.1 Database The ASR is trained on the Fisher English T raining (LDC2004S13 and LDC2005S13) Sp eec h corp ora [29]. The confusion2v ec is trained on the output of the ASR, i.e., the confusion net w orks generated via Fisher English Corp ora. The database setup for the ASR and confusion2v ec is iden tical and explained in detail in [149]. 75 W e trained the in ten t detection mo del on A TIS (Airline T ra v el Information Systems) dataset [70], whic h comes with audio recordings and corresp onding man ual transcripts ab out h umans asking for igh t information. F ollo wing [65, 58], w e apply the same train, dev elopmen t, test split setup. The setup con tains 4478, 500 and 893 in ten t-lab eled reference utterances in train, dev elopmen t and test set resp ectiv ely . In order to ev aluate our mo del’s robustness to ASR outputs, w e also construct our ASR output set b y deco ding the corresp onding audio recording for eac h of the data splits using the ASR. In cases where an utterance is lab eled with m ultiple in ten t lab els, the top in ten t w as selected as the true lab el, yielding 18 in ten ts in total. 5.3.2 Exp erimen tal Setup The training setup for the ASR and the confusion2v ec is iden tical to our previous w ork [149]. F or deco ding the A TIS dataset, through our ASR, the audio samples w ere do wn-sampled from 16kHz to 8kHz. The ASR ac hiev es a WER rate of 18.54% on the A TIS test set. W e c ho ose the confusion2v ec mo del yielding the b est p erformance in [149], i.e., indep enden tly trained C2V-1 and C2V-c mo dels are concatenated and join tly optimized with in tra-Confusion2v ec sc heme (556 dimensions). F or in ten t detection, w e train mo dels on the 4478 utterances in training set, and tune h yp er- parameters based on the classication accuracy on the 500 reference utterances in dev elopmen t set. The mo del with the b est p erformance on the dev elopmen t set is c hosen and ev aluated on b oth reference test set and ASR test set. The h yp er-parameter space w e exp erimen ted with is as follo ws: Batc h size is set to 1, i.e. eac h sen tence is view ed as an indep enden t sample. Hidden dimension of LSTM unit is tuned o v erf256; 128; 64; 32g, and drop out is tuned o v erf0:1; 0:2; 0:25g. W e select A dam optimizer, with learning rate set to b e among f0:001; 0:0005g. The maxim um n um b er of ep o c hs is set to 50 with early-stopping strategy . 5.3.3 Baseline Systems The rst set of baselines compare dieren t con v en tional w ord em b eddings. They include: (i) random initialization (556 dimensions) sampled from a uniform distribution, (ii) v anilla GloV e 1 (300 dimensions) as in [127], and (iii) skip-gram W ord2V ec 2 (556 dimensions) ne-tuned on Fisher English corpus reference transcripts (for fair comparison with confusion2v ec). W e also tried the v anilla Go ogle w ord2v ec 2 . Ho w ev er, the p erformance w as found to b e consisten tly lo w er than the 1 h ttps://nlp.stanford.edu/pro jects/glo v e/ 2 h ttps://co de.go ogle.com/arc hiv e/p/w ord2v ec 76 ne-tuned v ersion, th us, w e don’t include it in the comparisons. Note, only the randomly initial- ized w ord em b edding is trainable, while all other em b eddings are xed throughout the training. All the ab o v e baselines use iden tical RNN arc hitecture for in ten t classication as describ ed in section 5.2.2. The second set of baselines compare our prop osed mo del with the recen t state-of-the-art mo d- els, including: (i) a join t in ten t detection, slot lling & LM mo del [101], (ii) an atten tion-based join t mo del that incorp orates alignmen t information pro vided b y slot lling task [100], and (iii) an in ten t-augmen ted gating mec hanism based mo del whic h further incorp orates c haracter-lev el em b edding along with w ord-lev el em b edding [97]. The baselines are trained under our exp erimen t setup using the same h yp er-parameters rep orted in their original pap ers. W e also repro duce the result of eac h mo del under their original exp erimen t settings 3 , and rep ort the obtained scores in paren theses for reference. F or the adapted mo del trained on ASR, w e consider a join t in ten t slot lling and in ten t detection mo del whic h p erforms sen tence reconstruction from ASR h yp otheses [146] as the baseline, and rep ort the scores on ASR outputs and corresp onding ASR WER from original pap er. 5.4 Results and Discussion 5.4.1 T raining on Reference Clean T ranscripts T able 5.1 and Figure 5.3 illustrate the results obtained training on clean transcripts. First, w e compare the results b et w een dieren t w ord feature em b edding (refer to upp er half of T able 5.1). On clean reference transcripts, GloV e em b eddings pro vides the b est p erformance (as found in [51]). Both w ord2v ec and random initialization pro vide iden tical results. The prop osed confu- sion2v ec giv es considerably lo w er CER compared to the W ord2V ec and random initialization. Although GloV e outp erforms confusion2v ec, w e b eliev e the comparison of confusion2v ec is more fair with that of w ord2v ec, since b oth use skip-gram mo deling. With the prop osed Confusion2v ec system, w e don’t exp ect impro v emen ts on clean reference transcripts, since the acoustic simi- larit y/confusion is less relev an t. As exp ected, w e observ e no degradation in p erformance with confusion2v ec and is on par with the p opular, leading w ord v ector represen tations for the task of in ten t detection on clean transcriptions. On noisy ASR transcripts, w e see an increase in CER with all mo dels. Although, random ini- tialization p erformed iden tical to W ord2V ec on clean transcriptions, w e see W ord2V ec p erforms relativ ely b etter on ASR transcriptions. This observ ation conrms that b etter w ord feature repre- sen tations exhibit higher robustness to errors. Similar trend is apparen t with GloV e em b eddings in 77 Mo del Reference ASR di Random 2.69 10.75 8.06 GloV e [127] 1.90 8.17 6.27 W ord2V ec [115] 2.69 8.06 6.16 C2V (prop osed) 2.46 6.38 3.92 Liu and Lane [101] 1.90 (1.57) 9.41 (8.29) 4 7.51 (6.72) Liu and Lane [100] 1.79 (1.90) 8.06 (8.40) 6.27 (6.50) Li et al. [97] 2.02 ( 1.34 ) 9.18 (9.07) 7.16 (7.73) T able 5.1: Results with T raining on Reference: Classication Error Rates (CER) for Reference and ASR T ranscripts. di is the absolute degradation of mo del from clean to ASR. The n um b ers inside paren thesis indicate CER obtained repro ducing the result of eac h mo del under their original exp erimen t settings 3 . comparison with random initialization, although w e observ e sligh tly higher CER and degradation (b et w een clean and noisy transcripts) compared to w ord2v ec. The prop osed confusion2v ec giv es the least CER among all the mo dels (a relativ e impro v emen t of 20.84% o v er w ord2v ec, 21.9% o v er GloV e and 40.65% o v er random initialization). Moreo v er, C2V displa ys higher robustness going from clean to noisy ASR transcriptions (degradation, di , is minimal). A relativ e impro v emen t in robustness of 37.48%, 36.36% and 51.36% compared to GloV e, W ord2v ec and random initial- ization resp ectiv ely is observ ed with C2V (in terms of di ). The confusion2v ec w ord feature represen tation is able to use the em b edded acoustic similarit y information to reco v er from errors resulting from acoustically confusable w ords in the ASR output transcriptions. F urther, w e compare our prop osed system with the recen t state-of-the-art w orks on SLU (see b ottom part of T able 5.1). Note, the recen t w orks emplo y m uc h more complex mo deling tec hniques compared to ours. Th us, as exp ected the recen t w orks outp erform our simple RNN arc hitecture testing on clean transcriptions. Ho w ev er, on noisy ASR transcriptions, ev en with a m uc h simpler mo del, the prop osed confusion2v ec ac hiev es signican tly lo w er CER (a relativ e impro v emen t of at-least 20.84%) compared to state-of-the-art mo dels. Moreo v er, again, the degradation with confusion2v ec is the least among all the mo dels, a relativ e 37.48% lesser degradation compared to the recen t w orks. The results highligh t the p oten t robustness of the confusion2v ec w ord feature represen tation. In addition, w e b eliev e that the gains from the complex classication mo deling are orthogonal to gains from confusion2v ec w ord feature represen tations and th us should result in additional gains incorp orating more complex mo dels with confusion2v ec. 3 Original settings of [101, 100, 97], mak e use of train + dev data for training. They pre-pro cess data b y substituting the digits with a tok en. 78 Mo del WER % CER % Random 18.54 5.15 GloV e [127] 18.54 6.94 W ord2V ec [115] 18.54 5.49 C2V (prop osed) 18.54 4.70 Sc h umann and Angkititrakul [146] 10.55 5.04 4 T able 5.2: Results with T raining and T esting on ASR transcripts. 5.4.2 T raining on ASR F urther, w e also p erform additional exp erimen ts b y training the in ten t classication mo dels on noisy ASR transcripts. A more robust feature represen tation should theoretically help in reducing the noise in the mo del translating to b etter p erformance. F rom T able 5.2, it is eviden t that all the mo dels impro v e with the matc hed noisy train and test conditions. The prop osed confusion2v ec 2.69 10.75 8.06 5.15 1.9 8.17 6.27 6.94 2.69 8.06 5.37 5.49 1.9 9.41 7.51 1.79 8.06 6.27 2.02 9.18 7.16 2.46 6.38 3.92 4.7 5.04 Classification Error Rate 0 2 4 6 8 10 12 Train: Reference / Test: Reference Train: Reference / Test: ASR Train: Reference / Degradation Train: ASR / Test: ASR Random Init. Glove Word2Vec Liu and Lane 2016 [8] Liu and Lane 2016 [10] Li et. al 2018 C2V (Proposed) Schumman et.al 2018 Figure 5.3: Comparison of CER for dieren t systems based mo del giv es the least CER among all the mo dels. The confusion2v ec feature represen tation is b etter able to explain the (acoustic) errors and doing so reduces confusion and noise in the in ten t classication mo del, thereb y resulting in a b etter and robust p erformance. Moreo v er, comparing it with the recen t study b y Sc h umann and Angkititrakul [146], although the results are not directly comparable due to dierences in the WER of the ASRs, our prop osed metho d ac hiev es a lo w er CER in-spite of m uc h w orse WER conditions 4 . This suggests that explicitly mo deling in-domain ASR errors as in [146] is of lesser eect compared to mo deling the general acoustic signatures b et w een w ords in a language as in the case with confusion2v ec. 4 W e don’t domain-constrain, optimize or rescore our ASR, as in [146, 101]. W e treat ASR as an indep enden t mo dule for fair comparison with other mo dels and for domain-generalization and p ortabilit y of our system and conclusions. 79 Finally , comparing the results from T able 5.1 and T able 5.2, it is encouraging to see that the prop osed confusion2v ec mo del trained on clean transcripts is able to inherit enough robustness to ac hiev e lo w er CER (than GloV e) and comparable CER to the mo dels (w ord2v ec) trained on ASR output, p ossibly reducing the need for adaptation on ASR and allo wing for more generalizable systems. 5.5 Conclusion and F uture W ork In this pap er, w e prop osed an in ten t detection mo del based on confusion2v ec w ord v ector rep- resen tation targeting noisy ASR transcriptions. The prop osed w ord em b eddings signican tly outp erform the p opular leading w ord v ector represen tations lik e w ord2v ec and GloV e in the cases of noisy ASR output. Comparisons are made with v arious recen t state-of-the-art studies, and w e nd the prop osed metho d impro v es o v er them b y a considerable margin despite using rela- tiv ely simple RNN arc hitectures for classication. The robustness of confusion2v ec also extends to mo dels trained on noisy ASR, ac hieving the least CER among the con v en tional w ord em b ed- ding as w ell as the recen t studies. Encouraging results suggest confusion2v ec robustness to errors eliminates the need for adapting the in ten t classication mo dels on noisy ASR outputs. In future, w e plan to apply and ev aluate the prop osed confusion2v ec on additional SLU tasks lik e slot-lling, domain classication and named en tit y recognition. W e b eliev e the prop osed mo del should pro vide similar adv an tages, esp ecially under noisy conditions. A ddressing m ultiple SLU tasks also allo ws us to use more complex join t-mo delling systems with confusion2v ec. The b etter, more complex, mo dels should pro vide impro v emen ts orthogonal to confusion2v ec feature represen tations, and w e th us exp ect to see further impro v emen ts. W e also plan to conduct more in- depth analysis on ho w the signal conditions and ASR p erformance aect eac h mo del; w e exp ect confusion2v ec to pro vide more gains as the ASR p erformance deteriorates. Represen tation of m ultiple path outputs from the ASR with confusion2v ec instead of only the b est path is also a p ossible future direction. 80 Chapter 6 Confusion2V ec 2.0: Enriching Ambiguity Representations with Subwords 6.1 Introduction Deco ding h uman language is a core comp onen t for sp ok en language understanding. Although it comes v ery naturally to h umans, it is a c hallenging task for mac hines. Human language is a complex construct in v olving m ultiple dimensions of information in v olving seman tics, syn tax and often con tain am biguities whic h mak e it dicult for mac hine inference. Sev eral w ord v ector represen tations ha v e b een prop osed for eectiv ely describing the h uman language in the natural language pro cessing comm unit y . The neural net w orks ha v e b een pro v en to b e an eectiv e to ol for estimation of suc h enco ding. Con textual mo deling tec hniques lik e language mo deling, i.e., predicting the next w ord in the sen tence giv en a windo w of preceding con text ha v e b een sho wn to mo del meaningful w ord represen tations [11, 112]. Bag-of-w ord based con textual mo deling, where the curren t w ord is predicted giv en b oth its left and righ t (lo cal) con texts has sho wn to capture language seman tics and syn tax [113]. Similarly , predicting lo cal con text from the curren t w ord, referred to as skip-gram mo deling, is sho wn to b etter represen t seman tic and syn tactic distances b et w een w ords [115]. In [127] log bi-linear mo dels com bining global w ord co-o ccurrence information and lo cal con text information, termed as global v ectors (GloV e), is sho wn to pro duce meaningful structured v ector space. Bi-directional language mo dels is prop osed in [128], where in ternal states of deep neural net w orks are com bined to mo del complex c haracteristics of w ord use and its v ariance o v er linguistic con texts. The adv an tages of bi-directional mo deling are further exploited along with self-atten tion using transformer net w orks [173] to estimate a represen tation, termed as BER T (Bidirectional Enco der Represen tations from T ransformers), that has pro v ed its ecacy on a m ultitude of natural language understanding tasks [38]. Mo dels suc h as BER T, 81 ELMo estimate w ord represen tations that v ary dep ending on the con text, whereas the con text- free represen tations including GloV e and W ord2V ec generate a single represen tation irresp ectiv e of the con text. Ho w ev er, most of the w ord v ector represen tations infer the kno wledge through con textual mo deling and man y of the am biguities presen t in h uman language is often unrecognized or ignored. F or instance, from the p ersp ectiv e of sp ok en language, the am biguities can b e asso ciated with ho w similar the w ords sound, i.e., for example, the w ords see and sea sound acoustically iden tical but ha v e dieren t meanings. The am biguities can also b e asso ciated with the underlying sp eec h signal itself due to wide range of acoustic en vironmen ts in v olving noise, o v erlapp ed sp eec h and c hannel, ro om c haracteristics. These am biguities often pro ject themselv es as errors through ASR systems. Most of the existing w ord v ector represen tations suc h as w ord2v ec [115, 113], fasttext [17], GloV e [127], BER T [38], ELMo [128] don’t accoun t for the am biguities presen t in sp eec h signals and th us degrade while pro cessing on top of ASR transcripts. Confusion2v ec w as recen tly prop osed to handle am biguit y information presen t in h uman lan- guage from the asp ects of h uman sp eec h pro duction and p erception [148]. Application to domain of sp eec h and acoustics, confusion2v ec is estimated b y unsup ervised skip-gram training on the ASR output lattices and confusion net w orks. The analysis of inheren t acoustic am biguit y information of the em b eddings displa y ed meaningful in teractions b et w een the seman tic-syn tactic subspace and acoustic similarit y subspaces. In [152], the ecacy of the confusion2v ec is conrmed on the task of sp ok en language in ten t detection. The confusion2v ec signican tly outp erformed t ypical w ord em b eddings including w ord2v ec, GloV e when ev aluated on top of ASR transcripts b y reducing the classication error rate b y appro ximately 20% relativ e. Although, there ha v e b een few attempts in lev eraging information presen t in w ord lattices and w ord confusion net w orks for sev eral tasks [166, 91, 169, 179, 157, 75], the main do wnside with these w orks is that the w ord represen tation estimated b y suc h tec hniques are task dep enden t and are restricted to a particular domain and dataset. Moreo v er, a v ailabilit y of most of the task sp ecic datasets are limited and task sp ecic sp eec h data is exp ensiv e to collect. The adv an tage with the Confusion2V ec is that it estimates a task indep enden t w ord v ector represen tations b y unsup ervised learning on lattices or confusion net w orks generated b y an ASR on random sp eec h con v ersations. In this c hapter, w e incorp orate sub w ords to represen t eac h w ord for mo deling b oth the acoustic am biguit y information and the con textual information. Eac h w ord is mo deled as a sum of con- stituen t n-gram c haracters. Our motiv ation b ehind the use of sub w ords are the follo wing: (i) it incorp orates morphological information of the w ords b y enco ding in ternal structure of w ords [17], 82 w t-1,1 w t-1,2 w t,2 w t,1 w t,3 w t,4 w t+1, 2 w t+1, 1 w t+1, 3 w t+2, 2 w t+2, 1 w t+2, 3 w t+2, 4 Figure 6.1: Example Confusion Net w ork Output b y ASR (ii) the bag of c haracter n-grams often ha v e a high o v erlap b et w een acoustically am biguous w ords, (iii) sub w ords help mo del under-represen ted w ords more ecien tly , thereb y leading to more robust estimation with limited a v ailable data, whic h is the case since training Confusion2V ec is restricted to ASR lattice outputs, (iv) sub w ords enable represen tations for out-of-v o cabulary w ords whic h are common-place with end-to-end ASR systems outputting c haracters. The rest of the c hapter is organized as follo ws: Confusion2v ec is review ed in Section 6.2. The prop osed sub w ord mo deling is presen ted in Section 6.3. Section 6.4 giv es details of the datasets emplo y ed, the exp erimen tation setup and the ev aluation metho dology . The results are presen ted in section 6.5. Section 6.6 demonstrates the ecacy of the prop osed w ord em b edding mo del to the application of sp ok en language in ten t detection task. Finally , the c hapter is concluded in section 6.7 and future w ork discussed in section 6.8. 6.2 Confusion2V ec In the eld of psyc ho-acoustics, it is established that h umans also relate w ords with ho w they sound [6] in addition to seman tics and syn tax. Inspired b y psyc ho-acoustics, h uman sp eec h and p erception, w e previously prop osed confusion2v ec [148]. The core idea is to estimate a h yp er- space that not only captures the seman tics and syn tax of h uman language, but also augmen ts the v ector space with acoustic am biguit y information, i.e., w ord acoustic similarit y information. In other w ords, w ord2v ec, GloV e can b e view ed as a subspace of the confusion2v ec v ector space. Sev eral dieren t metho dologies are prop osed for capturing the am biguit y information. The metho dologies are an adaptation of the skip-gram mo deling for w ord confusion net w orks or lattice- lik e structures. The w ord lattices are directed acyclic w eigh ted graphs of all the w ord sequences that are lik ely p ossible. A confusion net w ork is a sp ecic t yp e of lattice with constrain ts that eac h w ord sequence passes through eac h no de of graph. Suc h lattice-lik e structures can b e deriv ed from mac hine learning algorithms that output probabilit y measures, for example, an ASR. Figure 6.1, 83 illustrates a confusion net w ork that can p ossibly result from a sp eec h recognition system. Unlik e t ypical sen tences, whic h are used for training w ord em b eddings lik e w ord2v ec, GloV e, BER T, ELMo etc., the information in the confusion net w ork can b e view ed along t w o dimensions: (i) con textual dimension, and (ii) acoustic am biguit y dimension. More sp ecically , 4 conguration of skip-gram mo deling algorithms are prop osed in our recen t w ork [148], namely: (i) top-confusion, (ii) in tra-confusion, (iii) in ter-confusion, and (iv) h ybrid mo del. The top-confusion v ersion considers only the most-probable path of the ASR confusion net w ork and applies the t ypical skip-gram mo del on it. The in tra-confusion v ersion applies the skip-gram mo deling on the acoustic am biguit y dimension of the confusion net w ork and ignores the con textual information, i.e., eac h am biguous w ord alternativ e is predicted b y the other o v er a pre-dened lo cal con text. The in ter-confusion v ersion applies the skip-gram mo deling on the con textual dimension but o v er eac h of the acoustic am biguous w ords. The h ybrid mo del is a com bination of b oth the in tra and in ter-confusion congurations. More information on the training conguration is a v ailable in [148]. 6.3 Confusion2V ec 2.0 subword model Sub w ord enco ding of w ords ha v e b een p opular in mo deling seman tics and syn tax of language using w ord v ector represen tations [17, 38, 128]. The use of sub w ords are mainly motiv ated b y the fact that the sub w ords incorp orate morphological information whic h can b e helpful, for example, in relating the prexes, suxes and the w ord ro ot. In this w ork, w e apply sub w ord represen tation for enco ding the w ord am biguit y information in the h uman language. W e b eliev e w e ha v e a m uc h stronger case for the use of sub w ords for represen ting the acoustic similarities (am biguities) b e- t w een the w ords in the language since more similarly sounding w ords often ha v e highly o v erlapping sub w ord represen tations. This helps mo del ascertain the lev el of o v erlap and in doing so estimate the magnitude of acoustic similarit y robustly . Moreo v er, use of sub w ord should help in ecien t enco ding of under-represen ted w ords in the language. This is crucial in the case of confusion2v ec b ecause w e are restricted to ASR lattices for training data limiting w ord-w ord co-o ccurrence in con trast to t ypical w ord v ector represen tation whic h can b e trained on large amoun ts of easily a v ailable plain text data. Another imp ortan t asp ect is the abilit y to represen t out-of-v o cabulary w ords whic h are common place o ccurrence with end-to-end ASR systems outputting c haracter sequences. 84 In the prop osed mo del, eac h w ord w is represen ted as a sum of its constituen t n-gram c haracter sub w ords. This enables the mo del to infer the in ternal structure of eac h w ord. F or example, a w ord want is represen ted with the v ector sum of the follo wing sub w ords: <wa, wan, ant, nt>, <wan, want, ant>, <want, want>, <want> Sym b ols < and > are used to represen t the b eginning and end of the w ord. The n-grams are generated for n=3 upto n=6. It is apparen t that an acoustically am biguous, similar sounding w ord wand has a high degree of o v erlap with the set of n-gram c haracters. In this c hapter, w e consider t w o mo deling v ariations: (i) in ter-confusion, and (ii) in tra- confusion v ersions of confusion2v ec with the sub w ord enco ding. 6.3.1 In tra-Confusion Mo del The goal of the in tra-confusion mo del is to estimate the in ter-w ord relations b et w een the acousti- cally am biguous w ords that app ear in the ASR lattices. F or this, w e p erform skip-gram mo deling o v er the acoustic similarit y dimension (see Figure 6.1) and ignore the con textual dimension of the utterance. The ob jectiv e of the in tra-confusion mo del is to maximize the follo wing log-lik eliho o d: T X t=1 X ^ a2 ^ At X a2At log p(w t;a jw t;^ a ) (6.1) where T is the length of the utterance (confusion net w ork) in terms of n um b er of w ords, w i;j is the w ord in the confusion net w ork output b y the ASR at time-step i andj is the index of the w ord among the am biguous alternativ es. ^ A t is the set of indices of all am biguous w ords at time-step t, ^ a is the index of the curren t w ord along the acoustic am biguit y dimension, A t ^ A t ^ a is the subset of am biguous w ords barring ^ a at the curren t w ord t, i.e., for example from Figure 6.1, for the curren t w ord, w t;^ a , want , A t {wand, won’t, what }. A dditionally , for sub w ord enco ding, eac h w ord input is represen ted as: w i;j = X s2Sw x s (6.2) where S w is the set of all c haracter n-grams ranging from n=3 to n=6 and the w ord itself and x s is the v ector represen tation for n-gram sub w ord s. F ew training samples (input, target) generated for this conguration p ertaining to input confusion net w ork in Figure 6.1 are (I, eye), (eye, I), (want, wand), (want, won’t), (won’t, what), (wand, what) etc. 85 6.3.2 In ter-Confusion Mo del The aim of the in ter-confusion mo del is to join tly mo del the con textual co-o ccurrence information and the acoustic am biguit y co-o ccurrence information along b oth the axis depicted in the confusion net w ork. Here, the skip-gram mo deling is p erformed o v er time con text and o v er all the p ossible acoustic am biguities. The ob jectiv e of the in ter-confusion mo del is to maximize the follo wing log-lik eliho o d: T X t=1 X ^ a2 ^ At X c2Ct X a2Ac log p(w c;a jw t;^ a ) (6.3) where C t corresp onds to set of indices of no des of confusion net w ork, i.e., w ords around the cur- ren t w ord t along the time-axis and c is the curren t con text index. A c is the set of indices of acoustically am biguous w ords at a con text c. F or example, for the curren t w ord, w t;^ a , want in Figure 6.1, A c {I, eye, two, tees, to, seat, sit, seed, eat } andA t {wand, won’t, what, want}. Note, eac h w ord input is sub w ord enco ded as in equation 6.2. F ew training samples (input, target) generated for this conguration are (want, I), (want, eye), (want, two), (want, to), (want, tees), (what, I), (what, eye), (what, to), (what, tees), (what, two), (won’t, eye) etc. 6.3.3 T raining Loss and Ob jectiv e Negativ e sampling is emplo y ed for training the em b edding mo del. Negativ e sampling w as rst in- tro duced for training w ord2v ec represen tation [115]. It is a simplication of the Noise Con trastiv e Estimation ob jectiv e [62]. The negativ e sampling for training the em b edding can b e p osed as a set of binary classication problems whic h op erates on t w o classes: presence of signal or absence (noise). In the con text of w ord em b eddings the presence of the con text w ords are treated as p ositiv e class and the negativ e class is randomly sampled from the unigram distribution of the v o cabulary . The negativ e sampling for sub w ord mo del can b e expressed using binary logistic loss as: log ( X s2Sw i x T s o wt ) + K X k=1 E w k Pn(w) log ( X s2Sw i x T s o w k ) (6.4) where(x) = 1 1+e x ,w i is the input w ord,w t is the output w ord, S wi is the set of n-gram c haracter sub w ords for the w ord w i ,x s is the v ector represen tation for the c haracter n-gram sub w ord s and o wt is the output v ector represen tation of target w ord w t . K is the n um b er of negativ e samples to b e dra wn from the negativ e sample, noise distribution P n (w). The noise distribution P n (w) is c hosen to b e the unigram distribution of w ords in the v o cabulary raised to the 3=4 th p o w er 86 as suggested in [115]. Note, for confusion2v ec the input w ord w i and target w ord w t are deriv ed according to equations 6.1 and 6.3 for implemen ting the resp ectiv e training congurations 6.4 Data and Experimental Setup 6.4.1 Database Fisher English T raining P art 1, Sp eec h (LDC2004S13) and Fisher English T raining P art 2, Sp eec h (LDC2005S13) corp ora [29] are used for b oth training the ASR and the confusion2v ec 2.0 em- b eddings. The c hoice of database is based on [148] for direct comparison purp oses. The corpus consists of sp on taneous telephonic con v ersations b et w een 11,972 nativ e English sp eak ers. The sp eec h data amoun ts to appro ximately 1,915 hours sampled at 8 kHz. The corpus is divided in to 3 parts for training (1,905 hours, 1,871,731 utterances), dev elopmen t (5 hours, 5000 utterances) and test (5 hours, 5000 utterances). Ov erall, the transcripts con tain appro ximately 20.8 million w ord tok ens and v o cabulary size of 42,150. 6.4.2 Exp erimen tal Setup The exp erimen tal setup is main tained iden tical to [148] for direct comparison. Brief detail of the setup is as follo ws: 6.4.2.1 Automatic sp eec h recognition A h ybrid HMM-DNN based acoustic mo del is trained on the train subset of the sp eec h corpus using the KALDI sp eec h recognition to olkit [130]. 40 dimensional mel frequency cepstral co ecien ts (MF CC) features are extracted along with the i-v ector features for training the acoustic mo del. The i-v ector features are used to pro vide sp eak er and c hannel c haracteristics to aid acoustic mo deling. The acoustic mo del, DNN, comprises of 7 la y ers with P-norm non-linearit y (p=2) eac h with 350 units [190]. The DNN is trained using 5 MF CC frame splices with left and righ t con text of 2 to classify among 7979 Gaussian mixtures with sto c hastic gradien t descen t optimizer. CMU pron unciation dictionary [175] is utilized as the w ord-pron unciation transcription lexicon. T ri-gram language mo del is trained on the training subset of the Fisher English Sp eec h Corpus. The ASR yields w ord error rates (WER) of 16.57% and 18.12% on the dev elopmen t and the test datasets. Lattices are deriv ed during the ASR deco ding with a deco ding b eam size of 11 and lattice b eam size of 6. The lattices are con v erted to confusion net w orks with the minim um Ba y es risk criterion [182] for training the confusion2v ec em b eddings. The resulting confusion net w orks 87 ha v e a v o cabulary size of 41,274 and 69.5 million w ords, with an a v erage of 3.34 alternativ e (am biguous) w ords for eac h edge in the graph. 6.4.2.2 Confusion2V ec 2.0 In order to train the em b edding, most frequen t w ords are sub-sampled as suggested in [115], with the rejection threshold set to 10 4 . Also, a minim um frequency threshold of 5 is set and the rarely o ccurring w ords are pruned from the v o cabulary . The con text windo w size for b oth the acoustic am biguit y and con textual dimensions are uniformly sampled b et w een 1 and 5. The dimension of the w ord v ectors are set to 300. The n um b er of negativ e samples for negativ e sampling is c hosen to b e 64. The learning rate is set to 0.01 and trained for a total of 15 ep o c hs using sto c hastic gradien t descen t. All the h yp er-parameters are empirically c hosen for optimal p erformance. W e implemen ted the confusion2v ec 2.0 b y mo difying the source co de from fastT ext 1 [17]. W e mak e our source co de and trained mo dels a v ailable. 6.4.3 Ev aluation Metrics F or ev aluating the inheren t seman tic and syn tactic kno wledge of the w ord em b eddings, w e emplo y t w o tasks: (i) seman tic-syn tactic analogy task, and (ii) w ord similarit y task. The w ord analogy task w as rst prop osed in [113] whic h comprises of w ord pair analogy questions of the form W 1 is to W 2 as W 3 is to W 4 . The analogy is answ ered correct if vec(W 1 )vec(W 2 ) +vec(W 3 ) is most similar to vec(W 4 ). Another prominen t approac h is the w ord similarit y task, where rank- correlation b et w een cosine similarit y of set of pair of w ord v ectors and h uman annotated w ord similarit y scores are assessed [143]. F or w ord similarit y task, w e use the W ordSim-353 database [49] consisting of 353 pairs of w ords annotated o v er a score of 1 to 10 dep ending on the magnitude of w ord similarit y as p erceiv ed b y h umans. F or assessing the w ord acoustic am biguit y (similarit y) information, w e conduct the acoustic analogy task, Seman tic&syn tacticacoustic analogy task and A coustic similarit y tasks prop osed in [148]. A coustic analogy task comprises of w ord pair analogies compiled using homophones whic h answ er questions of the form: W 1 sounds similar to W 2 as W 3 sounds similar to W 4 . The acoustic analogy task is designed to assess the am biguit y information em b edded in the w ord v ector space [148]. The seman tic&syn tactic-acoustic analogy task is designed to assess b oth seman tic, syn tactic and acoustic am biguit y information sim ultaneously . The analogies are formed b y replacing certain w ords b y their homophone alternativ es in the original seman tic and syn tactic analogy task [148]. The acoustic w ord similarit y task is analogous to the w ord similarit y task, 1 h ttps://gith ub.com/faceb o okresearc h/fastT ext 88 Mo del Analogy T asks Similarit y T asks S&S A coustic S&S-A coustic A v erage A ccuracy W ord Similarit y A coustic Similarit y Go ogle W2V [115] 61.42% 0.9% 16.99% 26.44% 0.6893 -0.3489 In-domain W2V 59.17% 0.6% 8.15% 22.64% 0.4417 -0.4377 fastT ext [17] 75.93% 0.46% 17.40% 31.26% 0.7361 -0.3659 Confusion2V ec 1.0 (w ord) [148] C2V-a 63.97% 16.92% 43.34% 41.41% 0.5228 0.6200 C2V-c 65.45% 27.33% 38.29% 43.69% 0.5798 0.5825 Confusion2V ec 2.0 C2V-a 56.74% 50.79% 44.67% 50.73% 0.3181 0.8108 (sub w ord) C2V-c 56.87% 51.00% 44.98% 50.95% 0.2893 0.8106 T able 6.1: Results: Dieren t prop osed mo dels C2V-a: In tra-Confusion, C2V-c: In ter-Confusion, S&S: Seman tic & Syn tactic Analogy . F or the analogy tasks: the accuracies of baseline w ord2v ec mo dels are for top-1 ev aluations, whereas of the other mo dels are for top-2 ev aluations (as discussed in [148]). F or the similarit y tasks: all the correlations (Sp earman’s) are statistically signican t with p< 0:001. i.e., it con tains of w ord pairs whic h are rated on their acoustic similarit y based on the normalized phone edit distances. A v alue of 1.0 refers to t w o w ords sounding iden tical and 0.0 refers to the w ord pairs b eing acoustically dissimilar. More details regarding the ev aluation metho dologies are a v ailable in [148]. The ev aluation datasets are made a v ailable. 6.5 Results T able 6.1 lists the results in terms of accuracies for analogy tasks and rank-correlations for sim- ilarit y tasks. The rst t w o ro ws corresp ond to results with the original w ord2v ec. Go ogle W2V mo del is the op en source mo del released b y Go ogle 2 , trained on 100 billion w ord Go ogle News dataset. W e also train an in-domain v ersion of original w ord2v ec on the Fisher English corpus for fair comparison with the confusion2v ec mo dels, referred to as In-domain W2V in T able 6.1. The fastT ext mo del emplo y ed is the op en source mo del trained on Wikip edia dumps with a v o cabulary size of more than 2.5 million w ords released b y F aceb o ok 3 . The middle t w o ro ws of the table cor- resp ond to confusion2v ec em b eddings without sub w ord enco ding and they are tak en directly from [148]. The b ottom t w o ro ws corresp ond to the results obtained with sub w ord enco ding. Note, the confusion2v ec 1.0 is initialized on the Go ogle w ord2v ec mo del for b etter con v ergence. The con- fusion2v ec 2.0 mo del is initialized on the fastT ext mo del to main tain compatibilit y with sub w ord enco dings. W e normalize the v o cabulary for all the exp erimen ts, meaning the same v o cabulary is used to ev aluate the analogy and similarit y tasks to allo w for fair comparisons. Comparing the baseline w ord2v ec and fastT ext em b eddings to the confusion2v ec, w e observ e the baseline em b eddings p erform w ell on the seman tic&syn tactic analogy task and pro vide go o d 2 h ttps://co de.go ogle.com/arc hiv e/p/w ord2v ec/ 3 h ttps://fasttext.cc/do cs/en/pretrained-v ectors.h tml 89 Mo del Analogy T asks Similarit y T asks S&S A coustic S&S-A coustic A v erage A ccuracy W ord Similarit y A coustic Similarit y Go ogle W2V [115] 61.42% 0.9% 16.99% 26.44% 0.6893 -0.3489 In-domain W2V 59.17% 0.6% 8.15% 22.64% 0.4417 -0.4377 fastT ext [17] 75.93% 0.46% 17.40% 31.26% 0.7361 -0.3659 Confusion2V ec 1.0 (w ord) [148] C2V-1 + C2V-a 67.03% 25.43% 40.36% 44.27% 0.5102 0.7231 C2V-1 + C2V-c 70.84% 35.25% 35.18% 47.09% 0.5609 0.6345 C2V-1 + C2V-c (JT) 65.88% 49.4% 41.51% 52.26% 0.5379 0.7717 Confusion2V ec 2.0 fastT ext + C2V-a 76.10% 22.67% 49.15% 49.31% 0.5744 0.7577 (sub w ord) fastT ext + C2V-c 76.16% 22.56% 49.12% 49.12% 0.5732 0.7573 T able 6.2: Results: Dieren t prop osed mo dels C2V-a: In tra-Confusion, C2V-c: In ter-Confusion, S&S: Seman tic & Syn tactic Analogy . F or the analogy tasks: the accuracies of baseline w ord2v ec mo dels are for top-1 ev aluations, whereas of the other mo dels are for top-2 ev aluations (as discussed in [148]). F or the similarit y tasks: all the correlations (Sp earman’s) are statistically signican t with p< 0:001. p ositiv e correlation on the w ord similarit y task as exp ected. Ho w ev er, they p erform p o orly on the acoustic analogy task, seman tic&syn tactic-acoustic analogy task and giv e small negativ e cor- relation on the acoustic analogy task. All the confusion2v ec mo dels p erform relativ ely go o d on seman tic&syn tactic analogy task and w ord similarit y task, but more imp ortan tly giv e high accu- racies on acoustic analogy task and seman tic&syn tactic-acoustic analogy tasks and pro vide high p ositiv e correlation with the acoustic similarit y task. Sp ecically with Confusion2V ec 2.0, among the analogy tasks, w e observ e the sub w ord en- co ding enhances the acoustic am biguit y mo deling. F or the acoustic analogy task w e nd rel- ativ e impro v emen t of upto 46.41% o v er its non-sub w ord coun terpart. Moreo v er, ev en for the seman tic&syn tactic-acoustic analogy task, w e observ e impro v emen ts with sub w ord enco ding. Ho w ev er, w e nd a small reduction in p erformance for the original seman tic and syn tactic analogy task. Regardless of the small dip in the p erformance, the accuracies remain acceptable in com- parison to the in-domain w ord2v ec mo del. Ov erall, taking the a v erage accuracy of all the analogy tasks, w e obtain an increase of appro ximately 16.62% relativ e o v er the non-sub w ord confusion2v ec mo dels. In v estigating the results for the similarit y tasks, w e nd signican t and high correlation of 0.81 for acoustic similarit y task with the sub w ord enco ding. Again, a small degradation is observ ed with the w ord similarit y task obtaining a correlation of 0.3181 against the 0.4417 of the in-domain baseline w ord2v ec mo del. Ov erall, the results of the analogy and the similarit y tasks suggest the sub w ord enco ding greatly enhances the am biguit y mo deling of confusion2v ec. 90 6.5.1 Mo del Concatenation F urther, the confusion2v ec mo del can b e concatenated with the other w ord em b edding mo dels to pro duce a new w ord v ector space that can result in b etter represen tations as seen in [148]. T able 6.2 lists the results of the concatenated mo dels. F or the previous, non-sub w ord v ersion of the confusion2v ec, the v ector mo dels are concatenated with the w ord2v ec mo del trained on the ASR output transcripts (C2V-1). The c hoice of using the C2V-1 instead of the Go ogle W2V for concatenation w as based on empirical ndings. Where as to main tain compatibilit y of sub w ord enco ding, the confusion2v ec 2.0 mo dels are concatenated with fastT ext mo dels. First, comparisons b et w een the non-concatenated v ersions in T able 6.1 and the concatenated v ersion in T able 6.2, of the non-sub w ord mo dels, w e observ e a decen t impro v emen t of appro xi- mately 7.22% relativ e in a v erage analogy accuracy after concatenation. W e don’t observ e signif- ican t impro v emen t with sub w ord based mo dels after concatenation in terms of a v erage analogy accuracy . Ho w ev er, w e observ e dieren t dynamics b et w een the acoustic am biguit y and the seman- tic and syn tactic subspaces. Concatenation results in impro v ed seman tic and syn tactic ev aluations with the exp ense of drop in accuracies of acoustic analogy task. W e also note impro v emen ts (9.27% relativ e) in seman tic&syn tactic-acoustic analogy task after concatenation conrming meaningful existence of b oth am biguit y and seman tic-syn tactic relations. Moreo v er, the w ord similarit y task also yields b etter correlation after concatenation. Next, comparisons of the confusion2v ec 1.0 (non-sub w ord) and the sub w ord v ersion, w e ob- serv e signican t impro v emen ts in seman tic&syman tic analogy task (7.51% relativ e) as w ell as the seman tic&syn tactic-acoustic analogy tasks (21.78% relativ e). Moreo v er, the sub w ord mo dels outp erform the non-sub w ord v ersion in b oth of the similarit y tasks. The sub w ord mo dels sligh tly under-p erform in the acoustic analogy task, but more crucially outp erform the Go ogle W2V and F astT ext baselines signican tly . F urther, the concatenated mo dels can b e ne-tuned and optimized to exploit additional gains as found in [148]. The ro w corresp onding to Confusion2V ec 1.0 - C2V + C2V-c (JT) is the b est result obtained in [148] whic h in v olv es 2-passes. The Confusion2V ec 2.0 with the sub w ord mo deling with a single pass training giv es comparable p erformance to the 2-pass approac h. Th us w e skip the 2-pass approac h with the sub w ord mo del in fa v or of ease of training and repro ducibilit y . 91 6.6 Spoken Language Intent Detection with Confusion2V ec 2.0 In this section, w e apply the prop osed w ord v ector em b edding to the task of sp ok en language in ten t detection to assess the practicalit y in application to real w ord scenarios. Sp ok en language in ten t detection is the pro cess of deco ding the sp eak er’s in ten t in con texts in v olving v oice commands, call routing and an y h uman computer in teractions. Most of the sp ok en language tec hnologies comprises of an ASR to con v ert the sp eec h signal to text. This pro cess in tro duces errors in to the pip eline via ASR conditioned on the v arying sp eak er and noise en vironmen ts. Ho w ev er, p opular approac hes to sp ok en in ten t detection in the natural language pro cessing comm unit y assume clean text as input to the in ten t classication systems. The erroneous ASR outputs result in degradation of the in ten t detection classication pro cess. F ew eorts ha v e fo cused on handling the errors of the ASR to mak e the subsequen t in ten t detection pro cess more robust to errors. These eorts often in v olv e training the in ten t classication systems on noisy ASR transcripts. The do wnsides of training the in ten t classiers on the ASR is that the systems are limited with the amoun t of sp eec h data a v ailable. Moreo v er, v arying sp eec h signal conditions and use of dieren t ASR mo dels mak e suc h classiers non-optimal and less practical. In man y scenarios, sp eec h data is not a v ailable to enable adaptation on ASR transcripts. In our previous w ork [152], w e applied the non-sub w ord v ersion of the confusion2v ec to the task of sp ok en language in ten t detection. W e demonstrated the confusion2v ec is able to p erform as ecien tly as the p opular w ord em b eddings lik e w ord2v ec and GloV e on clean man ual tran- scripts giving comparable classication error rates. More imp ortan tly , w e w ere able to illustrate the robustness of the confusion2v ec em b eddings when ev aluated on the noisy ASR transcripts. The confusion2v ec giv es signican tly b etter accuracies (upto relativ e 20% impro v emen ts) when ev aluated on ASR transcripts compared to the w ord2v ec, GloV e em b eddings and state-of-the-art mo dels in v olving more complex neural net w ork in ten t classication arc hitectures. Moreo v er, w e also illustrated the confusion2v ec undergo es the least degradation b et w een the clean and ASR transcripts. W e also found that the confusion2v ec consisten tly pro vides the least classication error rates ev en when the in ten t classier is trained on ASR transcripts. The exp erimen ts in- dicated that the dierence in accuracies b et w een training the in ten t classier on clean v ersus the ASR transcripts is reduced to 0.89% from 2.57% absolute. Ov erall, the results illustrate the confusion2v ec has inheren t kno wledge of the acoustic am biguit y (similarit y) w ord relations whic h correlates with the ASR errors using whic h the classier is able to reco v er from certain errors more ecien tly . 92 In this c hapter, w e incorp orate the confusion2v ec 2.0 em b eddings with inheren t kno wledge of acoustic am biguit y to allo w robust in ten t classication. The enhanced eects of the sub w ord mo deling in capturing acoustic am biguit y , v eried b y the previous ev aluations, w e b eliev e the prop osed mo del could further impro v e the sp ok en utterance classication. In doing so, w e aim to eliminate the need for re-training the classiers on the ASR outputs. 6.6.1 In ten t classication F or in ten t classication w e adopt a simple RNN arc hitecture iden tical to [152] for direct compar- ison. The arc hitecture of the neural net w ork is in ten tionally k ept simple for eectiv e inference of the ecacy of the prop osed em b edding w ord features. The classier comprises of an em b edding la y er follo w ed b y a single la y er of bi-directional recurren t neural net w ork (RNN) with long short- term memory (LSTM) units whic h is follo w ed b y a linear dense la y er with softmax function to output a probabilit y distribution across all the in ten t categories. The em b edding la y er is xed throughout the training except for the randomly initialized em b eddings where the em b edding is estimated on the in-domain data sp ecic to the task of in ten t detection. 6.6.2 Database and Exp erimen tal Setup 6.6.2.1 Database W e conduct exp erimen ts on the Airline T ra v el Information Systems (A TIS) b enc hmark dataset [70]. The dataset comprises of h umans making igh t related inquiries with an automated answ er- ing mac hine with audio recorded and its transcripts man ually annotated. A TIS consists of 18 in ten t categories. The dataset is divided in to train (4478 samples), dev elopmen t (500 samples) and test (893 samples) consisten t with previous w orks [152, 65, 58]. F or ASR ev aluations, the audio recordings are do wn-sampled from 16kHz to 8kHz and then deco ded using the ASR setup describ ed in section 6.4.2.1 using the audio mappings 4 . The ASR ac hiev es a WER of 18.54% on the A TIS test set. 6.6.2.2 Exp erimen tal Setup The in ten t classication mo dels are trained on the 4478 samples of training subset and the h yp er- parameters are tuned on the dev elopmen t set. W e c ho ose the b est set of h yp er-parameters yielding the b est results on the dev elopmen t set and then apply it on the unseen held-out test subset of b oth the man ual clean transcripts and the ASR transcripts and rep ort the results. F or training 4 h ttps://gith ub.com/pgurunath/slu_confusion2v ec 93 w e treat eac h utterance as a single sample (batc h size = 1). The h yp er-parameter space w e exp erimen t are as follo ws: the hidden dimension size of the LSTM is tuned o v er f32; 64; 128; 256g, the learning rate o v erf0:0005; 0:001g, the drop out is tuned o v erf0:1; 0:15; 0:2; 0:25g. The A dam optimizer is emplo y ed for optimization and trained for a total of 50 ep o c hs with early stopping when the loss on the dev elopmen t set do esn’t impro v e for 5 consecutiv e ep o c hs. 6.6.2.3 Baselines W e include results from sev eral baseline systems for pro viding comparisons of Confusion2V ec 2.0 with the p opular con text-free w ord em b eddings, con textual em b eddings, p opular established NLU systems and the curren t state-of-the-art. 1. Con text-F ree Em b eddings : GloV e 5 [127], skip-gram w ord2v ec 6 [115] and fastT ext 7 [17] w ord represen tations are emplo y ed. They are referred to as con text-free em b eddings since the w ord represen tations are static irresp ectiv e of the con text. 2. ELMo : P eters et al. [128] prop osed deep con textualized w ord represen tation based on c haracter based deep bidirectional language mo del trained on large text corpus. The mo dels eectiv ely mo del syn tax and seman tics of the language along v arying linguistic con texts. Unlik e con text-free em b eddings, ELMo em b eddings ha v e v arying represen tations for eac h w ord dep ending on the w ord’s con text. W e emplo y the original mo del trained on 1 Billion W ord Benc hmark with 93.6 million parameters 8 . F or in ten t-classication w e add a single bi-directional LSTM la y er with atten tion for m ulti-task join t in ten t and slot predictions. 3. BER T : Devlin et al. [38] in tro duced BER T bidirectional con textual w ord represen tations based on self atten tion mec hanism of T ransformer mo dels. BER T mo dels mak e use of mask ed language mo deling and next sen tence prediction to mo del language. Similar to ELMo, the w ord em b eddings are con textual, i.e., v ary according to the con text. W e emplo y b ert-base- uncased mo del 9 with 12 la y ers of 768 dimensions eac h trained on Bo okCorpus and English Wikip edia corpus. F or in ten t-classication w e add a single bi-directional LSTM la y er with atten tion for m ulti-task join t in ten t and slot predictions. 4. Join t SLU-LM : Liu and Lane [101] emplo y ed join t mo deling of the next w ord prediction along with in ten t and slot lab eling. The unidirectional RNN mo del up dates in ten t states for eac h w ord input and uses it as con text for slot lab eling and language mo deling. 5 h ttps://nlp.stanford.edu/pro jects/glo v e/ 6 h ttps://co de.go ogle.com/arc hiv e/p/w ord2v ec/ 7 h ttps://fasttext.cc/do cs/en/pretrained-v ectors.h tml 8 h ttps://allennlp.org/elmo 9 h ttps://gith ub.com/go ogle-researc h/b ert 94 5. A ttn. RNN Join t SLU : Liu and Lane [100] prop osed atten tion based enco der-deco der bidirectional RNN mo del in a m ulti-task mo del for join t in ten t and slot-lling tasks. A w eigh ted a v erage of the enco der bidirectional LSTM hidden states pro vides information from parts of the input w ord sequence whic h is used together with time aligned enco der hidden state for the deco der to predict the slot lab els and in ten t. 6. Slot-Gated A ttn. : Go o et al. [58] in tro duced a slot-gated mec hanism whic h in tro duces ad- ditional gate to impro v e slot and in ten t prediction p erformance b y lev eraging in ten t con text v ector for slot lling task. 7. Self A ttn. SLU : Li et al. [97] prop osed self-atten tion mo del with gate mec hanism for join t learning of in ten t classication and slot lling b y utilizing the seman tic correlation b et w een slots and in ten ts. The mo del estimates em b eddings augmen ted with in ten t information using self atten tion mec hanism whic h is utilized as a gate for slot lling task. 8. Join t BER T : Chen et al. [25] prop osed to use BER T em b eddings for join t mo deling of in ten t and slot-lling. The pre-trained BER T em b eddings are ne tuned for (i) sen tence prediction task - in ten t detection, and (ii) sequence prediction task - slot lling. The Join t BER T mo del lac ks the bi-directional LSTM la y er in comparison to the earlier baseline BER T based mo del. 9. SF-ID Net w ork : E et al. [45] in tro duced a bi-directional in terrelated mo del for join t mo deling of in ten t detection and slot-lling. An iteration mec hanism is prop osed where the SF subnet in tro duces the in ten t information to slot-lling task while the ID-subnet applies the slot information to in ten t detection task. F or the task of slot-lling a conditional random eld la y er is used to deriv e the nal output. 10. ASR Robust ELMo : Huang and Chen [75] prop osed ASR robust con textualized em- b eddings for in ten t detection. ELMo em b eddings are ne-tuned with a no v el loss function whic h minimizes the cosine distance b et w een the acoustically confused w ords found in ASR confusion net w orks. T w o tec hniques based on sup ervised and unsup ervised extraction of w ord confusions are explored. The ne-tuned con textualized em b eddings are then utilized for sp ok en language in ten t detection. 6.6.3 Results T able 6.3 lists the results of the in ten t detection in terms of classication error rates (CER). The Reference column corresp onds to results on man ually annotated transcripts of A TIS and 95 Mo del Reference ASR di Random 2.69 10.75 8.06 GloV e [127] 1.90 8.17 6.27 W ord2V ec [115] 2.69 8.06 5.37 fastT ext [17] 1.90 8.40 6.50 ELMo [128] y 1.46 7.05 5.59 BER T [38] y 1.12 6.16 5.04 Join t SLU-LM [101] y 1.90 9.41 7.51 A ttn. RNN Join t SLU [100] y 1.79 8.06 6.27 Slot-Gated A ttn. [58] y 3.92 10.64 6.72 Self A ttn. SLU [97] y 2.02 9.18 7.16 Join t BER T [25] y 2.46 7.73 5.27 SF-ID Net w ork [45] y 3.14 10.53 7.39 ASR Robust ELMo (unsup.) [75] 3.24 5.26 2.02 ASR Robust ELMo (sup.) [75] 3.46 5.03 1.57 C2V 1.0 [148] 2.46 6.38 3.92 C2V-c 2.0 3.36 5.82 2.46 C2V-a 2.0 2.46 4.37 1.91 fastT ext + C2V-c 2.0 1.79 4.70 2.91 fastT ext + C2V-a 2.0 1.90 5.04 3.14 T able 6.3: Results: Mo del trained on clean Reference: Classication Error Rates (CER) for Reference and ASR T ranscripts di is the absolute degradation of mo del from clean to ASR. C2V 1.0 corresp onds to C2V-1 + C2V-c (JT) in T able 6.1 and 6.2. y indicates join t mo deling of in ten t and slot-lling. indicates con textual em b eddings. the ASR corresp onds to the ev aluations on the noisy sp eec h recognition transcripts. Firstly , ev aluating on the Reference clean transcripts, w e observ e the confusion2v ec 2.0 with sub w ord enco ding is able to ac hiev e similar p erformance to the p opular w ord em b edding mo dels and the state-of-the-art. The b est p erforming confusion2v ec 2.0 ac hiev es a CER of 1.79%. Among the dieren t v ersions of the prop osed sub w ord based confusion2v ec, w e nd that the concatenated v ersions are sligh tly b etter. W e b eliev e this is b ecause the concatenated mo dels exhibit b etter seman tic and syn tactic relations (see T able 6.1 and 6.2) compared to the non-concatenated ones. Among the baseline mo dels, the con textual em b edding lik e BER T and ELMo giv es the b est CER. Note, the prop osed confusion2v ec em b eddings are con text-free and is able to outp erform other con text-free em b edding mo dels suc h as GloV e, w ord2v ec and fastT ext. Secondly , ev aluating the p erformance on the erroneous ASR transcripts, w e nd that all the sub w ord based confusion2v ec 2.0 mo dels outp erform the p opular w ord v ector em b eddings b y a big margin. The sub w ord-confusion2v ec giv es drastic impro v emen t of appro ximately 45.78% relativ e to the b est p erforming con text-free w ord em b eddings. The prop osed em b eddings also impro v e 96 Mo del WER % CER % Random 18.54 5.15 GloV e [127] 18.54 6.94 W ord2V ec [115] 18.54 5.49 Sc h umann and Angkititrakul [146] 10.55 5.04 10 C2V 1.0 18.54 4.70 C2V-c 2.0 18.54 4.82 C2V-a 2.0 18.54 4.26 fastT ext + C2V-c 2.0 18.54 3.70 fastT ext + C2V-a 2.0 18.54 4.26 T able 6.4: Results: Mo del trained and ev aluated on ASR transcripts. C2V 1.0 corresp onds to C2V-1 + C2V-c (JT) in T able 6.1 and 6.2 o v er the con textual em b eddings including BER T and ELMo (relativ e impro v emen ts of 29.06%). Moreo v er, the results are also a go o d impro v emen t o v er the non-sub w ord confusion2v ec w ord v ectors (31.50% impro v emen t). This conrms our initial h yp othesis that the sub w ord enco ding is b etter able to represen t the acoustic am biguities in the h uman language. Comparisons b et w een the dieren t v ersions of the prop osed confusion2v ec, the in tra-confusion conguration yields the least CER. Insp ecting the degradation, diff (drop in p erformance b et w een the clean and ASR ev aluations), w e nd that all the confusion2v ec 2.0 with sub w ord information undergo the least degradation, thereb y re-arming the robustness to the noise in the transcripts. T able 6.4 presen ts the results obtained b y training mo dels on the ASR transcripts and ev alu- ated on the ASR transcripts. Here w e omit all the join t in ten t-slot lling baseline mo dels, since training on ASR transcripts need aligned set of slot lab els due to insertion, substitution and deletion errors whic h is out-of-scop e of this study . W e note that the confusion2v ec mo dels giv e signican tly lo w er CER. The sub w ord based confusion2v ec mo dels also pro vide impro v emen ts o v er the non-sub w ord based confusion2v ec mo del (21.28% impro v emen t). Comparing the results in T able 6.3 and T able 6.4, w e w ould lik e to highligh t the sub w ord-confusion2v ec mo del giv es a minim um CER of 4.37% on mo del trained on clean transcripts whic h is m uc h b etter than the CER obtained b y p opular w ord em b eddings lik e w ord2v ec, GloV e, fastT ext ev en when trained on the ASR transcripts (15.15% b etter relativ ely). These results pro v e the sub w ord-confusion2v ec mo dels can eliminate the need for re-training an y natural language understanding and pro cessing algorithms on ASR transcripts for robust p erformance. 97 6.7 Conclusion In this c hapter, w e prop ose to use sub w ord enco ding for mo deling the acoustic am biguit y infor- mation in to the w ord v ector represen tations along with the seman tic and syn tax of the language. Eac h w ord in the language is represen ted as a sum of its constituen t c haracter n-gram sub w ords. The adv an tages of the sub w ords are conrmed b y ev aluating the prop osed mo dels on v arious w ord analogy tasks and w ord similarit y tasks designed to assess the eectiv e acoustic am biguit y , seman tic and syn tactic kno wledge inheren t in the mo dels. Finally , the prop osed sub w ord mo dels are applied to the task of sp ok en language in ten t detection system. The results of in ten t classica- tion system suggest the prop osed sub w ord confusion2v ec mo dels greatly enhance the classication p erformance when ev aluated on the noisy ASR transcripts. The highligh t of the results is that the sub w ord-confusion2v ec mo dels totally eliminate the need for re-training the classier on the ASR transcripts. 6.8 F uture W ork In the future, w e plan to mo del am biguit y information using deep con textual mo deling tec hniques suc h as BER T. W e b eliev e bidirectional information mo deling with atten tion can further enhance am biguit y mo deling. On the application side, w e plan to implemen t and assess the eect of using confusion2v ec mo dels for a wide range of natural language understanding and pro cessing applications suc h as sp eec h translation, dialogue trac king etc. On the analysis fron t, w e w ould lik e to apply the prop osed em b eddings and ev aluate the eects of WER on the p erformance of sp ok en language understanding task and the impro v emen ts pro vided b y confusion2v ec. Assessing the b enets of confusion2v ec o v er wide range of underlying sp eec h signal en vironmen ts including t yp e of noise, amoun t of noise, transferabilit y o v er dieren t ASR systems can b e v ery useful for the domain of sp ok en language understanding. 98 Conclusion The thesis addresses an imp ortan t problem of handling sp eec h recognition errors for robust sp eec h pro cessing and sp ok en language understanding. The problem is tac kled from the p ersp ectiv e of three basic underlying error sources whic h ev en tually results in sp eec h recognition errors: (i) v ari- ations in underlying sp eec h signal, (ii) limitations of sev eral mac hine learning algorithms, and (iii) am biguities presen t in h uman language. In this thesis, w e prop osed a noisy c hannel mo del for error correction based on phrases termed Noisy-Clean Phrase Con text Mo deling. The sys- tem p ost-pro cesses the output of an automated sp eec h recognition system eectiv e in correcting and reco v ering from sev eral dieren t t yp es of sp eec h recognition errors. The NCPCM is able to pro vide impro v emen ts o v er wide range of underlying input signal v ariations, and wide sp ec- trum of ASR w ord error rates. The NCPCM is also able to adapt o v er certain limitations and restrictions induced to main tain computational complexit y and memory tractable. The NCPCM system is sho wn to impro v e the sp eec h recognition ev en o v er a highly optimized ASR. F urther, w e in tro duced a new h uman language enco ding in the form of w ord v ector represen tation, termed Confusion2V ec, that tak es in to accoun t sev eral am biguit y asso ciated in h uman sp ok en language, more sp ecically , w e mo del the acoustic am biguit y (similarit y) in h uman language. The acoustic am biguit y is in tro duced in to the w ord v ector represen tation b y unsup ervised mo deling of con- fusions presen t in the sp eec h recognition output lattices. The acoustic am biguit y is sho wn to co-exist with the seman tics and syn tax of h uman language and in doing results in a robust w ord v ector represen tation that is robust to ASR errors. The ecacy of the w ord v ector represen tation is conrmed on the task of sp ok en language in ten t classication. The Confusion2V ec ac hiev es signican tly lo w er classication error rates when ev aluated on noisy ASR transcripts compared to p opular w ord v ector represen tations. Confusion2V ec with the inheren t kno wledge of acoustic confusable w ords whic h correlates with the ASR error is able to reco v er from them. Finally , to enhance the am biguit y mo deling capacit y of the Confusion2V ec, w e prop ose to enco de eac h w ord b y their constituen t c haracter n-grams. This results in an increased abilit y of the mo del to capture the acoustic am biguit y information, whic h reects in the impro v ed p erformance on the acoustic, 99 seman tic and syn tactic analogy tasks and w ord similarit y tasks. The enhancemen ts with the sub- w ord mo deling is ensured with sp ok en language in ten t classication, whic h results in impro v ed classication error rates compared with the non-sub w ord v ersion of Confusion2V ec. The sub w ord Confusion2V ec mo del completely eliminates the need for retraining sp ok en utterancce classier on ASR transcripts. 100 References [1] Martn Abadi, P aul Barham, Jianmin Chen, Zhifeng Chen, Andy Da vis, Jerey Dean, Matthieu Devin, Sanja y Ghema w at, Georey Irving, Mic hael Isard, et al. T ensoro w: a system for large-scale mac hine learning. In OSDI , v olume 16, pages 265283, 2016. [2] William A Ainsw orth and SR Pratt. F eedbac k strategies for error correction in sp eec h recognition systems. International Journal of Man-Machine Studies , 36(6):833842, 1992. [3] T amer Alkhouli, F elix Rietig, and Hermann Ney . In v estigations on phrase-based deco ding with recurren t neural net w ork language and translation mo dels. In Pr o c e e dings of the T enth W orkshop on Statistic al Machine T r anslation , pages 294303, 2015. [4] Alexandre Allauzen. Error detection in confusion net w ork. In Eighth A nnual Confer enc e of the International Sp e e ch Communic ation Asso ciation , 2007. [5] Ebru Ariso y , T ara N Sainath, Brian Kingsbury , and Bh uv ana Ramabhadran. Deep neural net w ork language mo dels. In Pr o c e e dings of the NAA CL-HL T 2012 W orkshop: Wil l W e Ever R e al ly R eplac e the N-gr am Mo del? On the F utur e of L anguage Mo deling for HL T , pages 2028. Asso ciation for Computational Linguistics, 2012. [6] Jennifer A ydelott and Elizab eth Bates. Eects of acoustic distortion and seman tic con text on lexical access. L anguage and c o gnitive pr o c esses , 19(1):2956, 2004. [7] Dzmitry Bahdanau, Jan Choro wski, Dmitriy Serdyuk, Philemon Brak el, and Y osh ua Bengio. End-to-end atten tion-based large v o cabulary sp eec h recognition. In Pr o c e e dings of A c oustics, Sp e e ch and Signal Pr o c essing (ICASSP), 2016 IEEE International Confer enc e on , pages 49454949. IEEE, 2016. [8] Lalit R Bahl, P eter F Bro wn, P eter V de Souza, and Rob ert L Mercer. A tree-based statistical language mo del for natural language sp eec h recognition. IEEE T r ansactions on A c oustics, Sp e e ch, and Signal Pr o c essing , 37(7):10011008, 1989. [9] Kaspars Balo dis and Daiga Deksne. In ten t detection system based on w ord em b eddings. In International Confer enc e on A rticial Intel ligenc e: Metho dolo gy, Systems, and Applic ations , pages 2535. Springer, 2018. [10] Sam y Bengio and Georg Heigold. W ord em b eddings for sp eec h recognition. In Pr o c e e dings of the 15th Confer enc e of the International Sp e e ch Communic ation Asso ciation, Intersp e e ch , 2014. [11] Y osh ua Bengio, RØjean Duc harme, P ascal Vincen t, and Christian Jauvin. A neural proba- bilistic language mo del. Journal of machine le arning r ese ar ch , 3(F eb):11371155, 2003. [12] Nicola Bertoldi, Ric hard Zens, and Marcello F ederico. Sp eec h translation b y confusion net w ork deco ding. In A c oustics, Sp e e ch and Signal Pr o c essing, 2007. ICASSP 2007. IEEE International Confer enc e on , v olume 4, pages IV1297. IEEE, 2007. 101 [13] Nicola Bertoldi, Barry Haddo w, and Jean-Baptiste F ouet. Impro v ed minim um error rate training in Moses. The Pr ague Bul letin of Mathematic al Linguistics , 91:716, 2009. [14] Daniel M Bik el and Keith B Hall. Refr: an op en-source rerank er framew ork. In INTER- SPEECH , pages 756758, 2013. [15] Dan Bik elf, Chris Callison-Burc hc, Y uan Cao c, Nathan Glennd, Keith Hallf, Ev a Haslerg, Damianos Karak osc, Sanjeev Kh udanpurc, Philipp K o ehng, Maider Lehrb, et al. Confusion- based statistical language mo deling for mac hine translation and sp eec h recognition. 2012. [16] Da vid M Blei, Andrew Y Ng, and Mic hael I Jordan. Laten t diric hlet allo cation. Journal of machine L e arning r ese ar ch , 3(Jan):9931022, 2003. [17] Piotr Bo jano wski, Edouard Gra v e, Armand Joulin, and T omas Mik olo v. Enric hing w ord v ectors with sub w ord information. T r ansactions of the Asso ciation for Computational Lin- guistics , 5:135146, 2017. ISSN 2307-387X. [18] Jan Botha and Phil Blunsom. Comp ositional morphology for w ord represen tations and language mo delling. In International Confer enc e on Machine L e arning , pages 18991907, 2014. [19] P eter F Bro wn, John Co c k e, Stephen A Della Pietra, Vincen t J Della Pietra, F redric k Jelinek, John D Laert y , Rob ert L Mercer, and P aul S Ro ossin. A statistical approac h to mac hine translation. Computational linguistics , 16(2):7985, 1990. [20] Jacob Buc kman and Graham Neubig. Neural lattice language mo dels. arXiv pr eprint arXiv:1803.05071 , 2018. [21] E By am bakhishig, Katsuyuki T anak a, Ry o Aihara, T oru Nak ashik a, T etsuy a T akiguc hi, and Y asuo Ariki. Error correction of automatic sp eec h recognition based on normalized w eb distance. In Fifte enth A nnual Confer enc e of the International Sp e e ch Communic ation Asso ciation , 2014. [22] Arda Celebi, Hasim Sak, Erin Dikici, Murat Saralar, Maider Lehr, E Prud’hommeaux, Puy ang Xu, Nathan Glenn, Damianos Karak os, Sanjeev Kh udanpur, et al. Semi-sup ervised discriminativ e language mo deling for T urkish ASR. In A c oustics, Sp e e ch and Signal Pr o- c essing (ICASSP), 2012 IEEE International Confer enc e on , pages 50255028. IEEE, 2012. [23] William Chan, Na vdeep Jaitly , Quo c Le, and Oriol Vin y als. Listen, attend and sp ell: A neural net w ork for large v o cabulary con v ersational sp eec h recognition. In Pr o c e e dings of A c oustics, Sp e e ch and Signal Pr o c essing (ICASSP), 2016 IEEE International Confer enc e on, pages 49604964. IEEE, 2016. [24] Ciprian Chelba, Timoth y J Hazen, and Murat Saraclar. Retriev al and bro wsing of sp ok en con ten t. IEEE Signal Pr o c essing Magazine , 25(3), 2008. [25] Qian Chen, Zh u Zh uo, and W en W ang. Bert for join t in ten t classication and slot lling. arXiv pr eprint arXiv:1902.10909 , 2019. [26] Xinxiong Chen, Lei Xu, Zhiyuan Liu, Maosong Sun, and Huan-Bo Luan. Join t learning of c haracter and w ord em b eddings. In IJCAI, pages 12361242, 2015. [27] Kyungh yun Cho, Bart V an Merriºn b o er, Dzmitry Bahdanau, and Y osh ua Bengio. On the prop erties of neural mac hine translation: Enco der-deco der approac hes. arXiv pr eprint arXiv:1409.1259 , 2014. 102 [28] Y u-An Ch ung, Chao-Ch ung W u, Chia-Hao Shen, Hung-Yi Lee, and Lin-Shan Lee. Au- dio w ord2v ec: Unsup ervised learning of audio segmen t represen tations using sequence-to- sequence auto enco der. arXiv pr eprint arXiv:1603.00982 , 2016. [29] Christopher Cieri, Da vid Miller, and Kevin W alk er. The Fisher Corpus: a resource for the next generations of sp eec h-to-text. In International Confer enc e on L anguage R esour c es and Evaluation , v olume 4, pages 6971. LREC, 2004. [30] Ronan Collob ert, Jason W eston, LØon Bottou, Mic hael Karlen, K ora y Ka vuk cuoglu, and P a v el Kuksa. Natural language pro cessing (almost) from scratc h. Journal of machine le arning r ese ar ch , 12(Aug):24932537, 2011. [31] Marta R Costa-juss and JosØ AR F onollosa. Analysis of statistical and morphological classes to generate w eigh ted reordering h yp otheses on a statistical mac hine translation system. In Pr o c e e dings of the Se c ond W orkshop on Statistic al Machine T r anslation , pages 171176. Asso ciation for Computational Linguistics, 2007. [32] Ry an Cotterell and Hinric h Sc htze. Morphological w ord-em b eddings. In Pr o c e e dings of the 2015 Confer enc e of the North A meric an Chapter of the Asso ciation for Computational Linguistics: Human L anguage T e chnolo gies , pages 12871292, 2015. [33] Horia Cucu, Andi Buzo, Lauren t Besacier, and Corneliu Burilean u. Statistical error correc- tion metho ds for domain-sp ecic ASR systems. In International Confer enc e on Statistic al L anguage and Sp e e ch Pr o c essing , pages 8392. Springer, 2013. [34] George E Dahl, Dong Y u, Li Deng, and Alex A cero. Con text-dep enden t pre-trained deep neural net w orks for large-v o cabulary sp eec h recognition. IEEE T r ansactions on A udio, Sp e e ch, and L anguage Pr o c essing , 20(1):3042, 2012. [35] Scott Deerw ester, Susan T Dumais, George W F urnas, Thomas K Landauer, and Ric hard Harshman. Indexing b y laten t seman tic analysis. Journal of the A meric an so ciety for information scienc e , 41(6):391, 1990. [36] Ano op Deoras and Ruhi Sarik a y a. Deep b elief net w ork based seman tic taggers for sp ok en language understanding. In Intersp e e ch , pages 27132717, 2013. [37] Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas Lamar, Ric hard M Sc h w artz, and John Makhoul. F ast and robust neural net w ork join t mo dels for statistical mac hine trans- lation. In Pr o c e e dings of the 52nd A nnual Me eting of the Asso ciation for Computational Linguistics, , pages 13701380. A CL, 2014. [38] Jacob Devlin, Ming-W ei Chang, Ken ton Lee, and Kristina T outano v a. Bert: Pre- training of deep bidirectional transformers for language understanding. arXiv pr eprint arXiv:1810.04805 , 2018. [39] Luis F ernando D’Haro and Rafael E Banc hs. Automatic correction of ASR outputs b y using mac hine translation. In tersp eec h, 2016. [40] Erin Dikici, Arda Celebi, and Murat Saralar. P erformance comparison of training algo- rithms for semi-sup ervised discriminativ e language mo deling. In Thirte enth A nnual Con- fer enc e of the International Sp e e ch Communic ation Asso ciation , 2012. [41] Chris Dy er. Using a maxim um en trop y mo del to build segmen tation lattices for m t. In Pr o c e e dings of Human L anguage T e chnolo gies: The 2009 A nnual Confer enc e of the North A meric an Chapter of the Asso ciation for Computational Linguistics , pages 406414. Asso- ciation for Computational Linguistics, 2009. 103 [42] Christopher Dy er, Smaranda Muresan, and Philip Resnik. Generalizing w ord lattice trans- lation. T ec hnical rep ort, MAR YLAND UNIV COLLEGE P ARK INST F OR AD V ANCED COMPUTER STUDIES, 2008. [43] Christopher J Dy er. The’noisier c hannel’: translation from morphologically complex lan- guages. In Pr o c e e dings of the Se c ond W orkshop on Statistic al Machine T r anslation , pages 207211. Asso ciation for Computational Linguistics, 2007. [44] Christopher James Dy er. A formal mo del of ambiguity and its applic ations in machine tr anslation . Univ ersit y of Maryland, College P ark, 2010. [45] Haihong E, P eiqing Niu, Zhongfu Chen, and Meina Song. A no v el bi-directional in terrelated mo del for join t in ten t detection and slot lling, 2019. [46] Dumitru Erhan, Y osh ua Bengio, Aaron Courville, Pierre-An toine Manzagol, P ascal Vincen t, and Sam y Bengio. Wh y do es unsup ervised pre-training help deep learning? Journal of Machine L e arning R ese ar ch , 11(F eb):625660, 2010. [47] Manaal F aruqui and Chris Dy er. Impro ving v ector space w ord represen tations using m ul- tilingual correlation. In Pr o c e e dings of the 14th Confer enc e of the Eur op e an Chapter of the Asso ciation for Computational Linguistics , pages 462471, 2014. [48] Emman uel F erreira, Bassam Jabaian, and F abrice LefŁvre. Zero-shot seman tic parser for sp ok en language understanding. In Sixte enth A nnual Confer enc e of the International Sp e e ch Communic ation Asso ciation , 2015. [49] Lev Fink elstein, Evgeniy Gabrilo vic h, Y ossi Matias, Eh ud Rivlin, Zac h Solan, Gadi W olf- man, and Eytan Ruppin. Placing searc h in con text: The concept revisited. In Pr o c e e dings of the 10th international c onfer enc e on W orld Wide W eb , pages 406414. A CM, 2001. [50] Maua jama Firdaus, Shobhit Bhatnagar, Asif Ekbal, and Pushpak Bhattac haryy a. A deep learning based m ulti-task ensem ble mo del for in ten t detection and slot lling in sp ok en language understanding. In International Confer enc e on Neur al Information Pr o c essing , pages 647658. Springer, 2018. [51] Maua jama Firdaus, Shobhit Bhatnagar, Asif Ekbal, and Pushpak Bhattac haryy a. In ten t detection for sp ok en language understanding using a deep ensem ble mo del. In Pacic R im International Confer enc e on A rticial Intel ligenc e , pages 629642. Springer, 2018. [52] Y ohei F usa y asu, Katsuyuki T anak a, T etsuy a T akiguc hi, and Y asuo Ariki. W ord-error cor- rection of con tin uous sp eec h recognition based on normalized relev ance distance. In IJCAI, pages 12571262, 2015. [53] Matteo Gerosa, Diego Giuliani, and Shrik an th Nara y anan. A coustic analysis and automatic recognition of sp on taneous c hildren’s sp eec h. In Ninth International Confer enc e on Sp oken L anguage Pr o c essing , 2006. [54] S. Ghanna y , Y. EstŁv e, and N. Camelin. W ord em b eddings com bination and neural net w orks for robustness in asr error detection. In 2015 23r d Eur op e an Signal Pr o c essing Confer enc e (EUSIPCO) , pages 16711675, Aug 2015. doi: 10.1109/EUSIPCO.2015.7362668. [55] Sahar Ghanna y , Y annic k EstŁv e, Nathalie Camelin, Camille Dutrey , F abian San tiago, and Martine A dda-Dec k er. Com bining con tin uous w ord represen tation and proso dic features for asr error prediction. In Pr o c e e dings of the Thir d International Confer enc e on Statisti- c al L anguage and Sp e e ch Pr o c essing - V olume 9449 , SLSP 2015, pages 8495, New Y ork, NY, USA, 2015. Springer-V erlag New Y ork, Inc. ISBN 978-3-319-25788-4. doi: 10.1007/ 978- 3- 319- 25789- 1_9. URL http://dx.doi.org/10.1007/978- 3- 319- 25789- 1_9 . 104 [56] Sahar Ghanna y , Y annic k EstŁv e, Nathalie Camelin, and P aul delØglise. A coustic w ord em- b eddings for asr error detection. In Intersp e e ch 2016 , pages 13301334, 2016. doi: 10.21437/ In tersp eec h.2016- 784. URL http://dx.doi.org/10.21437/Interspeech.2016- 784 . [57] Laurence Gillic k and Stephen J Co x. Some statistical issues in the comparison of sp eec h recognition algorithms. In Pr o c e e dings of A c oustics, Sp e e ch, and Signal Pr o c essing,. ICASSP-89., 1989 International Confer enc e on , pages 532535. IEEE, 1989. [58] Chih-W en Go o, Guang Gao, Y un-Kai Hsu, Chih-Li Huo, T sung-Chieh Chen, Keng-W ei Hsu, and Y un-Nung Chen. Slot-gated mo deling for join t slot lling and in ten t prediction. In Pr o c e e dings of the 2018 Confer enc e of the North A meric an Chapter of the Asso ciation for Computational Linguistics: Human L anguage T e chnolo gies, V olume 2 (Short Pap ers) , v olume 2, pages 753757, 2018. [59] Alex Gra v es, San tiago F ernÆndez, F austino Gomez, and Jrgen Sc hmidh ub er. Connec- tionist temp oral classication: lab elling unsegmen ted sequence data with recurren t neural net w orks. In Pr o c e e dings of the 23r d international c onfer enc e on Machine le arning , pages 369376. A CM, 2006. [60] Alex Gra v es, Ab del-rahman Mohamed, and Georey Hin ton. Sp eec h recognition with deep recurren t neural net w orks. In Pr o c e e dings of A c oustics, sp e e ch and signal pr o c essing (ICASSP), 2013 IEEE international c onfer enc e on , pages 66456649. IEEE, 2013. [61] Daniel Guo, Gokhan T ur, W en-tau Yih, and Georey Zw eig. Join t seman tic utterance classication and slot lling with recursiv e neural net w orks. In 2014 IEEE Sp oken L anguage T e chnolo gy W orkshop (SL T) , pages 554559. IEEE, 2014. [62] Mic hael U Gutmann and Aap o Hyvrinen. Noise-con trastiv e estimation of unnormalized statistical mo dels, with applications to natural image statistics. The journal of machine le arning r ese ar ch , 13(1):307361, 2012. [63] P atric k Haner, Gokhan T ur, and Jerry H W righ t. Optimizing svms for complex call classication. In 2003 IEEE International Confer enc e on A c oustics, Sp e e ch, and Signal Pr o c essing, 2003. Pr o c e e dings.(ICASSP’03). , v olume 1, pages II. IEEE, 2003. [64] Dilek Hakk ani-Tr, F rØdØric BØc het, Giusepp e Riccardi, and Gokhan T ur. Bey ond asr 1- b est: Using w ord confusion net w orks in sp ok en language understanding. Computer Sp e e ch & L anguage , 20(4):495514, 2006. [65] Dilek Hakk ani-Tr, Gkhan Tr, Asli Celikyilmaz, Y un-Nung Chen, Jianfeng Gao, Li Deng, and Y e-Yi W ang. Multi-domain join t seman tic frame parsing using bi-directional rnn-lstm. In Intersp e e ch , pages 715719, 2016. [66] Christian Hardmeier, Arianna Bisazza, and Marcello F ederico. W ord lattices for morpho- logical reduction and c h unk-based reordering. In Pr o c e e dings of the Joint Fifth W orkshop on Statistic al Machine T r anslation and MetricsMA TR , pages 8892. Asso ciation for Com- putational Linguistics, 2010. [67] Homa B Hashemi, Amir Asiaee, and Reiner Kraft. Query in ten t detection using con v o- lutional neural net w orks. In International Confer enc e on W eb Se ar ch and Data Mining, W orkshop on Query Understanding , 2016. [68] W anjia He, W eiran W ang, and Karen Liv escu. Multi-view recurren t neural acoustic w ord em b eddings. arXiv pr eprint arXiv:1611.04496 , 2016. 105 [69] Y ulan He and Stev e Y oung. A data-driv en sp ok en language understanding system. In 2003 IEEE W orkshop on A utomatic Sp e e ch R e c o gnition and Understanding (IEEE Cat. No. 03EX721) , pages 583588. IEEE, 2003. [70] Charles T Hemphill, John J Go dfrey , and George R Do ddington. The atis sp ok en language systems pilot corpus. In Sp e e ch and Natur al L anguage: Pr o c e e dings of a W orkshop Held at Hidden V al ley, Pennsylvania, June 24-27, 1990 , 1990. [71] Georey Hin ton, Li Deng, Dong Y u, George E Dahl, Ab del-rahman Mohamed, Na vdeep Jaitly , Andrew Senior, Vincen t V anhouc k e, P atric k Nguy en, T ara N Sainath, et al. Deep neural net w orks for acoustic mo deling in sp eec h recognition: The shared views of four researc h groups. IEEE Signal Pr o c essing Magazine , 29(6):8297, 2012. [72] Bjrn Homeister, Dustin Hillard, Stefan Hahn, Ralf Sc hluter, M Ostendor, and Hermann Ney . Cross-site and in tra-site asr system com bination: Comparisons on lattice and 1-b est metho ds. In A c oustics, Sp e e ch and Signal Pr o c essing, 2007. ICASSP 2007. IEEE Interna- tional Confer enc e on , v olume 4, pages IV1145. IEEE, 2007. [73] Thomas Hofmann. Probabilistic laten t seman tic analysis. In Pr o c e e dings of the Fifte enth c onfer enc e on Unc ertainty in articial intel ligenc e , pages 289296. Morgan Kaufmann Pub- lishers Inc., 1999. [74] T ak aaki Hori, I Lee Hetherington, Timoth y J Hazen, and James R Glass. Op en-v o cabulary sp ok en utterance retriev al using confusion net w orks. In A c oustics, Sp e e ch and Signal Pr o- c essing, 2007. ICASSP 2007. IEEE International Confer enc e on , v olume 4, pages IV73. IEEE, 2007. [75] Chao-W ei Huang and Y un-Nung Chen. Learning asr-robust con textualized em b eddings for sp ok en language understanding. In ICASSP 2020-2020 IEEE International Confer enc e on A c oustics, Sp e e ch and Signal Pr o c essing (ICASSP) , pages 80098013. IEEE, 2020. [76] Eric H Huang, Ric hard So c her, Christopher D Manning, and Andrew Y Ng. Impro ving w ord represen tations via global con text and m ultiple w ord protot yp es. In Pr o c e e dings of the 50th A nnual Me eting of the Asso ciation for Computational Linguistics: L ong Pap ers-V olume 1 , pages 873882. Asso ciation for Computational Linguistics, 2012. [77] Min w o o Jeong, Sangk eun Jung, and Gary Geun bae Lee. Sp eec h recognition error correction using maxim um en trop y language mo del. In Pr o c. of INTERSPEECH , pages 21372140, 2004. [78] Hui Jiang. Condence measures for sp eec h recognition: A surv ey . Sp e e ch c ommunic ation , 45(4):455470, 2005. [79] Armand Joulin, Edouard Gra v e, Piotr Bo jano wski, and T omas Mik olo v. Bag of tric ks for ecien t text classication. arXiv pr eprint arXiv:1607.01759 , 2016. [80] Herman Kamp er, W eiran W ang, and Karen Liv escu. Deep con v olutional acoustic w ord em b eddings using w ord-pair side information. In A c oustics, Sp e e ch and Signal Pr o c essing (ICASSP), 2016 IEEE International Confer enc e on , pages 49504954. IEEE, 2016. [81] Thomas Kemp and Thomas Sc haaf. Estimating condence using w ord lattices. In Fifth Eur op e an Confer enc e on Sp e e ch Communic ation and T e chnolo gy , 1997. [82] Jo o-Kyung Kim, Gokhan T ur, Asli Celikyilmaz, Bin Cao, and Y e-Yi W ang. In ten t detection using seman tically enric hed w ord em b eddings. In 2016 IEEE Sp oken L anguage T e chnolo gy W orkshop (SL T) , pages 414419. IEEE, 2016. 106 [83] Y o on Kim. Con v olutional neural net w orks for sen tence classication. arXiv pr eprint arXiv:1408.5882 , 2014. [84] Y oung-Bum Kim, Sung jin Lee, and Karl Stratos. Onenet: Join t domain, in ten t, slot predic- tion for sp ok en language understanding. In 2017 IEEE A utomatic Sp e e ch R e c o gnition and Understanding W orkshop (ASR U) , pages 547553. IEEE, 2017. [85] Philipp K o ehn, F ranz Josef Oc h, and Daniel Marcu. Statistical phrase-based translation. In Pr o c e e dings of the 2003 Confer enc e of the North A meric an Chapter of the Asso ciation for Computational Linguistics on Human L anguage T e chnolo gy-V olume 1 , pages 4854. A CL, 2003. [86] Philipp K o ehn, Hieu Hoang, Alexandra Birc h, Chris Callison-Burc h, Marcello F ederico, Nicola Bertoldi, Bro ok e Co w an, W ade Shen, Christine Moran, Ric hard Zens, et al. Moses: Op en source to olkit for statistical mac hine translation. In Pr o c e e dings of the 45th annual me eting of the A CL on inter active p oster and demonstr ation sessions , pages 177180. Asso- ciation for Computational Linguistics, 2007. [87] Canasai Kruengkrai, Kiy otak a Uc himoto, Jun’ic hi Kazama, Yiou W ang, Ken taro T orisa w a, and Hitoshi Isahara. An error-driv en w ord-c haracter h ybrid mo del for join t c hinese w ord segmen tation and p os tagging. In Pr o c e e dings of the Joint Confer enc e of the 47th A nnual Me eting of the A CL and the 4th International Joint Confer enc e on Natur al L anguage Pr o- c essing of the AFNLP: V olume 1-V olume 1 , pages 513521. Asso ciation for Computational Linguistics, 2009. [88] Hong-K w ang Je Kuo and W olfgang Reic hl. Phrase-based language mo dels for sp eec h recognition. In Sixth Eur op e an Confer enc e on Sp e e ch Communic ation and T e chnolo gy , 1999. [89] Gakuto Kurata, Nobuy asu Itoh, and Masafumi Nishim ura. T raining of error-correctiv e mo del for ASR without using audio data. In A c oustics, Sp e e ch and Signal Pr o c essing (ICASSP), 2011 IEEE International Confer enc e on , pages 55765579. IEEE, 2011. [90] Gakuto Kurata, Nobuy asu Itoh, Masafumi Nishim ura, Abhina v Seth y , and Bh uv ana Ram- abhadran. Lev eraging w ord confusion net w orks for named en tit y mo deling and detection from con v ersational telephone sp eec h. Sp e e ch Communic ation , 54(3):491502, 2012. [91] F aisal Ladhak, Ankur Gandhe, Markus Drey er, Lam b ert Mathias, Ariy a Rastro w, and Bjrn Homeister. Latticernn: Recurren t neural net w orks o v er lattices. In INTERSPEECH , pages 695699, 2016. [92] Alon La vie, Alex W aib el, Lori Levin, Mic hael Fink e, Donna Gates, Marsal Ga v alda, T orsten Zepp enfeld, and Puming Zhan. Jan us-iii: Sp eec h-to-sp eec h translation in m ultiple lan- guages. In A c oustics, Sp e e ch, and Signal Pr o c essing, 1997. ICASSP-97., 1997 IEEE Inter- national Confer enc e on , v olume 1, pages 99102. IEEE, 1997. [93] Quo c Le and T omas Mik olo v. Distributed represen tations of sen tences and do cumen ts. In International Confer enc e on Machine L e arning , pages 11881196, 2014. [94] Sungb ok Lee, Alexandros P otamianos, and Shrik an th Nara y anan. A coustics of c hildren’s sp eec h: Dev elopmen tal c hanges of temp oral and sp ectral parameters. The Journal of the A c oustic al So ciety of A meric a , 105(3):14551468, 1999. [95] Keith Levin, Katharine Henry , Aren Jansen, and Karen Liv escu. Fixed-dimensional acous- tic em b eddings of v ariable-length segmen ts in lo w-resource settings. In A utomatic Sp e e ch R e c o gnition and Understanding (ASR U), 2013 IEEE W orkshop on , pages 410415. IEEE, 2013. 107 [96] Omer Levy and Y oa v Goldb erg. Dep endency-based w ord em b eddings. In Pr o c e e dings of the 52nd A nnual Me eting of the Asso ciation for Computational Linguistics (V olume 2: Short Pap ers) , v olume 2, pages 302308, 2014. [97] Changliang Li, Liang Li, and Ji Qi. A self-atten tiv e mo del with gate mec hanism for sp ok en language understanding. In Pr o c e e dings of the 2018 Confer enc e on Empiric al Metho ds in Natur al L anguage Pr o c essing , pages 38243833, 2018. [98] Joseph Lilleb erg, Y un Zh u, and Y anqing Zhang. Supp ort v ector mac hines and w ord2v ec for text classication with seman tic features. In Co gnitive Informatics & Co gnitive Computing (ICCI* CC), 2015 IEEE 14th International Confer enc e on , pages 136140. IEEE, 2015. [99] W ang Ling, Chris Dy er, Alan W Blac k, and Isab el T rancoso. T w o/to o simple adaptations of w ord2v ec for syn tax problems. In Pr o c e e dings of the 2015 Confer enc e of the North A meric an Chapter of the Asso ciation for Computational Linguistics: Human L anguage T e chnolo gies , pages 12991304, 2015. [100] Bing Liu and Ian Lane. A tten tion-based recurren t neural net w ork mo dels for join t in ten t de- tection and slot lling. In Intersp e e ch 2016 , pages 685689, 2016. doi: 10.21437/In tersp eec h. 2016- 1352. [101] Bing Liu and Ian Lane. Join t online sp ok en language understanding and language mo deling with recurren t neural net w orks. arXiv pr eprint arXiv:1609.01462 , 2016. [102] Xun ying Liu, Y ongqiang W ang, Xie Chen, Mark JF Gales, and Philip C W o o dland. Ecien t lattice rescoring using recurren t neural net w ork language mo dels. In A c oustics, Sp e e ch and Signal Pr o c essing (ICASSP), 2014 IEEE International Confer enc e on , pages 49084912. IEEE, 2014. [103] Yi Luan, Shinji W atanab e, and Bret Harsham. Ecien t learning for sp ok en language un- derstanding tasks with w ord em b edding based pre-training. In Sixte enth A nnual Confer enc e of the International Sp e e ch Communic ation Asso ciation , 2015. [104] Thang Luong, Ric hard So c her, and Christopher Manning. Better w ord represen tations with recursiv e neural net w orks for morphology . In Pr o c e e dings of the Sevente enth Confer enc e on Computational Natur al L anguage L e arning , pages 104113, 2013. [105] Lidia Mangu, Eric Brill, and Andreas Stolc k e. Finding consensus in sp eec h recognition: w ord error minimization and other applications of confusion net w orks. Computer Sp e e ch & L anguage , 14(4):373400, 2000. [106] Alex Marin, T om K wiatk o wski, Mari Ostendorf, and Luk e Zettlemo y er. Using syn tactic and confusion net w ork structure for out-of-v o cabulary w ord detection. In Sp oken L anguage T e chnolo gy W orkshop (SL T), 2012 IEEE , pages 159164. IEEE, 2012. [107] Ry o Masum ura, Y usuk e Ijima, T aic hi Asami, Hirok azu Masataki, and Ryuic hiro Hi- gashinak a. Neural confnet classication: F ully neural net w ork based sp ok en utterance clas- sication using w ord confusion net w orks. 2018 IEEE International Confer enc e on A c oustics, Sp e e ch and Signal Pr o c essing (ICASSP) , pages 60396043, 2018. [108] Lam b ert Mathias and William Byrne. Statistical phrase-based sp eec h translation. In A c ous- tics, Sp e e ch and Signal Pr o c essing, 2006. ICASSP 2006 Pr o c e e dings. 2006 IEEE Interna- tional Confer enc e on , v olume 1, pages II. IEEE, 2006. 108 [109] Evgen y Matuso v, Stephan Kan thak, and Hermann Ney . On the in tegration of sp eec h recog- nition and statistical mac hine translation. In Ninth Eur op e an Confer enc e on Sp e e ch Com- munic ation and T e chnolo gy , 2005. [110] Evgen y Matuso v, Nicola Ueng, and Hermann Ney . Computing consensus translation for m ultiple mac hine translation systems using enhanced h yp othesis alignmen t. In 11th Confer enc e of the Eur op e an Chapter of the Asso ciation for Computational Linguistics , 2006. [111] GrØgoire Mesnil, Y ann Dauphin, Kaisheng Y ao, Y osh ua Bengio, Li Deng, Dilek Hakk ani-T ur, Xiao dong He, Larry Hec k, Gokhan T ur, Dong Y u, et al. Using recurren t neural net w orks for slot lling in sp ok en language understanding. IEEE/A CM T r ansactions on A udio, Sp e e ch, and L anguage Pr o c essing , 23(3):530539, 2015. [112] T omas Mik olo v, Martin KaraÆt, Luk as Burget, Jan Cerno c k y, and Sanjeev Kh udanpur. Recurren t neural net w ork based language mo del. In Pr o c e e dings of Intersp e e ch , v olume 2, page 3, 2010. [113] T omas Mik olo v, Kai Chen, Greg Corrado, and Jerey Dean. Ecien t estimation of w ord represen tations in v ector space. arXiv pr eprint arXiv:1301.3781 , 2013. [114] T omas Mik olo v, Quo c V Le, and Ily a Sutsk ev er. Exploiting similarities among languages for mac hine translation. arXiv pr eprint arXiv:1309.4168 , 2013. [115] T omas Mik olo v, Ily a Sutsk ev er, Kai Chen, Greg S Corrado, and Je Dean. Distributed represen tations of w ords and phrases and their comp ositionalit y . In A dvanc es in neur al information pr o c essing systems , pages 31113119, 2013. [116] Andriy Mnih and K ora y Ka vuk cuoglu. Learning w ord em b eddings ecien tly with noise- con trastiv e estimation. In A dvanc es in neur al information pr o c essing systems , pages 2265 2273, 2013. [117] Rob ert Mo ore, Douglas App elt, John Do wding, J Mark Ga wron, and Douglas Moran. Com- bining linguistic and statistical kno wledge sources in natural-language pro cessing for A TIS. In Pr o c. ARP A Sp oken L anguage Systems T e chnolo gy W orkshop , 1995. [118] Ry ohei Nak atani, T etsuy a T akiguc hi, and Y asuo Ariki. T w o-step correction of sp eec h recog- nition errors based on n-gram and long con textual information. In INTERSPEECH , pages 37473750, 2013. [119] Graham Neubig and T aro W atanab e. Optimization for statistical mac hine translation: A surv ey . Computational Linguistics , 42(1):154, 2016. [120] Jan Nieh ues and Mun tsin K olss. A p os-based mo del for long-range reorderings in sm t. In Pr o c e e dings of the F ourth W orkshop on Statistic al Machine T r anslation , pages 206214. Asso ciation for Computational Linguistics, 2009. [121] JM No y es and CR F rankish. Errors and error correction in automatic sp eec h recognition systems. Er gonomics , 37(11):19431957, 1994. [122] F ranz Josef Oc h. Minim um error rate training in statistical mac hine translation. In Pr o- c e e dings of the 41st A nnual Me eting on Asso ciation for Computational Linguistics-V olume 1, pages 160167. A CL, 2003. [123] F ranz Josef Oc h and Hermann Ney . A systematic comparison of v arious statistical alignmen t mo dels. Computational Linguistics , 29(1):1951, 2003. 109 [124] Jun Ogata and Masatak a Goto. Sp eec h repair: Quic k error correction just b y using selection op eration for sp eec h input in terfaces. In Ninth Eur op e an Confer enc e on Sp e e ch Communi- c ation and T e chnolo gy , 2005. [125] T ak ashi Onishi, Masao Utiy ama, and Eiic hiro Sumita. P araphrase lattice for statistical mac hine translation. IEICE TRANSA CTIONS on Information and Systems , 94(6):1299 1305, 2011. [126] Kishore P apineni, Salim Rouk os, T o dd W ard, and W ei-Jing Zh u. BLEU: a metho d for automatic ev aluation of mac hine translation. In Pr o c e e dings of the 40th annual me eting on asso ciation for c omputational linguistics , pages 311318. A CL, 2002. [127] Jerey P ennington, Ric hard So c her, and Christopher Manning. Glo v e: Global v ectors for w ord represen tation. In Pr o c e e dings of the 2014 c onfer enc e on empiric al metho ds in natur al language pr o c essing (EMNLP) , pages 15321543, 2014. [128] Matthew E P eters, Mark Neumann, Mohit Iyy er, Matt Gardner, Christopher Clark, Ken ton Lee, and Luk e Zettlemo y er. Deep con textualized w ord represen tations. arXiv pr eprint arXiv:1802.05365 , 2018. [129] Alexandros P otamianos and Shrik an th Nara y anan. Robust recognition of c hildren’s sp eec h. IEEE T r ansactions on sp e e ch and audio pr o c essing , 11(6):603616, 2003. [130] Daniel P o v ey , Arnab Ghoshal, Gilles Boulianne, Luk as Burget, Ondrej Glem b ek, Nagendra Go el, Mirk o Hannemann, P etr Motlicek, Y anmin Qian, P etr Sc h w arz, et al. The Kaldi sp eec h recognition to olkit. In IEEE 2011 workshop on automatic sp e e ch r e c o gnition and understanding , n um b er EPFL-CONF-192584. IEEE Signal Pro cessing So ciet y , 2011. [131] Siyu Qiu, Qing Cui, Jiang Bian, Bin Gao, and Tie-Y an Liu. Co-learning of w ord rep- resen tations and morpheme represen tations. In Pr o c e e dings of COLING 2014, the 25th International Confer enc e on Computational Linguistics: T e chnic al Pap ers , pages 141150, 2014. [132] Chris Quirk, Chris Bro c k ett, and William Dolan. Monolingual mac hine translation for paraphrase generation. In Pr o c e e dings of the 2004 Confer enc e on Empiric al Metho ds in Natur al L anguage Pr o c essing , 2004. URL http://aclweb.org/anthology/W04- 3219 . [133] La wrence Rabiner and Biing-Hw ang Juang. F undamentals of Sp e e ch R e c o gnition . Pren tice- Hall, Inc., Upp er Saddle Riv er, NJ, USA, 1993. ISBN 0-13-015157-2. [134] La wrence R Rabiner. A tutorial on hidden Mark o v mo dels and selected applications in sp eec h recognition. Pr o c e e dings of the IEEE , 77(2):257286, 1989. [135] Eric K Ringger and James F Allen. Error correction via a p ost-pro cessor for con tin uous sp eec h recognition. In A c oustics, Sp e e ch, and Signal Pr o c essing, 1996. ICASSP-96. Confer- enc e Pr o c e e dings., 1996 IEEE International Confer enc e on , v olume 1, pages 427430. IEEE, 1996. [136] Brian Roark, Murat Saraclar, and Mic hael Collins. Discriminativ e n-gram language mo del- ing. Computer Sp e e ch & L anguage , 21(2):373392, 2007. [137] Ronald Rosenfeld. T w o decades of statistical language mo deling: Where do w e go from here? Pr o c e e dings of the IEEE , 88(8):12701278, 2000. 110 [138] An tti-V eikk o Rosti, Necip F azil A y an, Bing Xiang, Sp yros Matsouk as, Ric hard Sc h w artz, and Bonnie Dorr. Com bining outputs from m ultiple mac hine translation systems. In Human L anguage T e chnolo gies 2007: The Confer enc e of the North A meric an Chapter of the Asso- ciation for Computational Linguistics; Pr o c e e dings of the Main Confer enc e , pages 228235, 2007. [139] An tti-V eikk o Rosti, Sp yros Matsouk as, and Ric hard Sc h w artz. Impro v ed w ord-lev el system com bination for mac hine translation. In Pr o c e e dings of the 45th A nnual Me eting of the Asso ciation of Computational Linguistics , pages 312319, 2007. [140] An thon y Rousseau, P aul DelØglise, and Y annic k EstŁv e. Enhancing the ted-lium corpus with selected data for language mo deling and more ted talks. In LREC, pages 39353939, 2014. [141] Kenji Sagae, Maider Lehr, E Prud’hommeaux, Puy ang Xu, Nathan Glenn, Damianos Karak os, Sanjeev Kh udanpur, Brian Roark, Murat Saraclar, Izhak Shafran, et al. Hal- lucinated n-b est lists for discriminativ e language mo deling. In A c oustics, Sp e e ch and Signal Pr o c essing (ICASSP), 2012 IEEE International Confer enc e on , pages 50015004. IEEE, 2012. [142] Arup Sarma and Da vid D P almer. Con text-based sp eec h recognition error detection and correction. In Pr o c e e dings of HL T-NAA CL 2004: Short Pap ers , pages 8588. Asso ciation for Computational Linguistics, 2004. [143] T obias Sc hnab el, Igor Labuto v, Da vid Mimno, and Thorsten Joac hims. Ev aluation metho ds for unsup ervised w ord em b eddings. In Pr o c e e dings of the 2015 Confer enc e on Empiric al Metho ds in Natur al L anguage Pr o c essing , pages 298307, 2015. [144] Josh Sc hro eder, T rev or Cohn, and Philipp K o ehn. W ord lattices for m ulti-source transla- tion. In Pr o c e e dings of the 12th Confer enc e of the Eur op e an Chapter of the Asso ciation for Computational Linguistics , pages 719727. Asso ciation for Computational Linguistics, 2009. [145] T anja Sc h ultz, Szu-Chen Jou, Stephan V ogel, and Shirin Saleem. Using w ord latice informa- tion for a tigh ter coupling in sp eec h translation systems. In Eighth International Confer enc e on Sp oken L anguage Pr o c essing , 2004. [146] Raphael Sc h umann and P ongtep Angkititrakul. Incorp orating asr errors with atten tion- based, join tly trained rnn for in ten t detection and slot lling. In 2018 IEEE International Confer enc e on A c oustics, Sp e e ch and Signal Pr o c essing (ICASSP) , pages 60596063. IEEE, 2018. [147] Matthew Stephen Seigel and Philip C W o o dland. Com bining information sources for con- dence estimation with crf mo dels. In Twelfth A nnual Confer enc e of the International Sp e e ch Communic ation Asso ciation , 2011. [148] Prashan th Gurunath Shiv akumar and P ana yiotis Georgiou. Confusion2v ec: T o w ards enric h- ing v ector space w ord represen tations with represen tational am biguities. Pe erJ Computer Scienc e , 5:e195, 2019. [149] Prashan th Gurunath Shiv akumar and P ana yiotis G. Georgiou. Confusion2v ec: T o w ards enric hing v ector space w ord represen tations with represen tational am biguities. CoRR , abs/1811.03199, 2018. [150] Prashan th Gurunath Shiv akumar, Alexandros P otamianos, Sungb ok Lee, and Shrik an th Nara y anan. Impro ving sp eec h recognition for c hildren using acoustic adaptation and pro- n unciation mo deling. In Pr o c. W orkshop on Child, Computer and Inter action (W OCCI) , 2014. 111 [151] Prashan th Gurunath Shiv akumar, Hao qi Li, Kevin Knigh t, and P ana yiotis Georgiou. Learn- ing from past mistak es: Impro ving automatic sp eec h recognition output via noisy-clean phrase con text mo deling. arXiv pr eprint arXiv:1802.02607 , 2018. [152] Prashan th Gurunath Shiv akumar, Mu Y ang, and P ana yiotis Georgiou. Sp ok en language in ten t detection using confusion2v ec. arXiv pr eprint arXiv:1904.03576 , 2019. [153] Sc harolta Katharina Sien£nik. A dapting w ord2v ec to named en tit y recognition. In Pr o c e e d- ings of the 20th nor dic c onfer enc e of c omputational linguistics, no dalida 2015, may 11-13, vilnius, lithuania , n um b er 109, pages 239243. Linkping Univ ersit y Electronic Press, 2015. [154] Edwin Simonnet, Sahar Ghanna y , Nathalie Camelin, and Y annic k EstŁv e. Sim ulating asr errors for training slu systems. In Pr o c e e dings of the 11th L anguage R esour c es and Evaluation Confer enc e , Miy azaki, Japan, Ma y 2018. Europ ean Language Resource Asso ciation. [155] Radu Soricut and F ranz Oc h. Unsup ervised morphology induction using w ord em b eddings. In Pr o c e e dings of the 2015 Confer enc e of the North A meric an Chapter of the Asso ciation for Computational Linguistics: Human L anguage T e chnolo gies , pages 16271637, 2015. [156] Matthias Sp erb er, Graham Neubig, Jan Nieh ues, and Alex W aib el. Neural lattice-to- sequence mo dels for uncertain inputs. arXiv pr eprint arXiv:1704.00559 , 2017. [157] Matthias Sp erb er, Graham Neubig, Ngo c-Quan Pham, and Alex W aib el. Self-atten tional mo dels for lattice inputs. arXiv pr eprint arXiv:1906.01617 , 2019. [158] Sabrina Steh wien and Ngo c Thang V u. First step to w ards enhancing w ord em b eddings with pitc h accen t features for dnn-based slot lling on recognized text. In Konfer enz Elektr onische Spr achsignalver arb eitung 2017, Saarbrcken , 2017. [159] Andreas Stolc k e. SRILM-an extensible language mo deling to olkit. In Seventh international c onfer enc e on sp oken language pr o c essing , 2002. [160] Helmer Strik and Catia Cucc hiarini. Mo deling pron unciation v ariation for ASR: A surv ey of the literature. Sp e e ch Communic ation , 29(2):225246, 1999. [161] Jinsong Su, Zhixing T an, Deyi Xiong, Rongrong Ji, Xiao dong Shi, and Y ang Liu. Lattice- based recurren t neural net w ork enco ders for neural mac hine translation. In AAAI , pages 33023308, 2017. [162] Bernhard Suhm, Brad My ers, and Alex W aib el. Multimo dal error correction for sp eec h user in terfaces. A CM tr ansactions on c omputer-human inter action (TOCHI) , 8(1):6098, 2001. [163] Martin Sundermey er, Ralf Sc hlter, and Hermann Ney . LSTM neural net w orks for language mo deling. In Pr o c e e dings of Intersp e e ch , pages 194197, 2012. [164] Martin Sundermey er, ZoltÆn Tsk e, Ralf Sc hlter, and Hermann Ney . Lattice deco ding and rescoring with long-span neural net w ork language mo dels. In Fifte enth A nnual Confer enc e of the International Sp e e ch Communic ation Asso ciation , 2014. [165] Ily a Sutsk ev er, Oriol Vin y als, and Quo c V Le. Sequence to sequence learning with neural net w orks. In A dvanc es in neur al information pr o c essing systems , pages 31043112, 2014. [166] Kai Sheng T ai, Ric hard So c her, and Christopher D Manning. Impro v ed seman tic rep- resen tations from tree-structured long short-term memory net w orks. arXiv pr eprint arXiv:1503.00075 , 2015. 112 [167] Yik-Cheung T am, Y un Lei, Jing Zheng, and W en W ang. Asr error detection using recurren t neural net w ork language mo del and complemen tary ASR. In A c oustics, Sp e e ch and Signal Pr o c essing (ICASSP), 2014 IEEE International Confer enc e on , pages 23122316. IEEE, 2014. [168] Qun F eng T an, Kartik Audhkhasi, P ana yiotis G Georgiou, Emil Ettelaie, and Shrik an th S Nara y anan. Automatic sp eec h recognition system c hannel mo deling. In Eleventh A nnual Confer enc e of the International Sp e e ch Communic ation Asso ciation , 2010. [169] Zhixing T an, Jinsong Su, Boli W ang, Yidong Chen, and Xiao dong Shi. Lattice-to-sequence atten tional neural mac hine translation mo dels. Neur o c omputing , 284:138147, 2018. [170] Gokhan T ur, Jerry W righ t, Allen Gorin, Giusepp e Riccardi, and Dilek Hakk ani-Tr. Impro v- ing sp ok en language understanding using w ord confusion net w orks. In Seventh International Confer enc e on Sp oken L anguage Pr o c essing , 2002. [171] Gkhan Tr, Ano op Deoras, and Dilek Hakk ani-Tr. Seman tic parsing using w ord confusion net w orks with conditional random elds. In INTERSPEECH , pages 25792583. Citeseer, 2013. [172] Ashish V asw ani, Yinggong Zhao, Victoria F ossum, and Da vid Chiang. Deco ding with large- scale neural language mo dels impro v es translation. In Pr o c e e dings of the 2013 Confer enc e on Empiric al Metho ds in Natur al L anguage Pr o c essing , pages 13871392. EMNLP , 2013. [173] Ashish V asw ani, Noam Shazeer, Niki P armar, Jak ob Uszk oreit, Llion Jones, Aidan N Gomez, Luk asz Kaiser, and Illia P olosukhin. A tten tion is all y ou need. In A dvanc es in neur al information pr o c essing systems , pages 59986008, 2017. [174] W olfgang W ahlster. V erbmobil: foundations of sp e e ch-to-sp e e ch tr anslation . Springer Science & Business Media, 2013. [175] R W eide. The cm u pron unciation dictionary , release 0.6, 1998. [176] Mirjam W ester. Pron unciation mo deling for ASRkno wledge-based and data-deriv ed meth- o ds. Computer Sp e e ch & L anguage , 17(1):6985, 2003. [177] P .C. W o o dland and D. P o v ey . Large scale discriminativ e training of hidden Mark o v mo dels for sp eec h recognition. Computer Sp e e ch & L anguage , 16(1):25 47, 2002. ISSN 0885- 2308. doi: h ttp://dx.doi.org/10.1006/csla.2001.0182. URL http://www.sciencedirect. com/science/article/pii/S0885230801901822 . [178] Jo ern W uebk er and Hermann Ney . Phrase mo del training for statistical mac hine translation with w ord lattices of prepro cessing alternativ es. In Pr o c e e dings of the Seventh W orkshop on Statistic al Machine T r anslation , pages 450459. Asso ciation for Computational Linguistics, 2012. [179] F engsh un Xiao, Jiangtong Li, Hai Zhao, Rui W ang, and Kehai Chen. Lattice-based trans- former enco der for neural mac hine translation. arXiv pr eprint arXiv:1906.01282 , 2019. [180] Chao Xing, Dong W ang, Xuew ei Zhang, and Chao Liu. Do cumen t classication with distri- butions of w ord v ectors. In Signal and Information Pr o c essing Asso ciation A nnual Summit and Confer enc e (APSIP A), 2014 Asia-Pacic , pages 15. IEEE, 2014. [181] W a yne Xiong, Jasha Dropp o, Xuedong Huang, F rank Seide, Mik e Seltzer, Andreas Stolc k e, Dong Y u, and Georey Zw eig. A c hieving h uman parit y in con v ersational sp eec h recognition. arXiv pr eprint arXiv:1610.05256 , 2016. 113 [182] Haih ua Xu, Daniel P o v ey , Lidia Mangu, and Jie Zh u. Minim um ba y es risk deco ding and system com bination based on a recursion for edit distance. Computer Sp e e ch & L anguage , 25(4):802828, 2011. [183] Puy ang Xu and Ruhi Sarik a y a. Con v olutional neural net w ork based triangular crf for join t in ten t detection and slot lling. In 2013 IEEE W orkshop on A utomatic Sp e e ch R e c o gnition and Understanding , pages 7883. IEEE, 2013. [184] Puy ang Xu, Brian Roark, and Sanjeev Kh udanpur. Phrasal cohort based unsup ervised discriminativ e language mo deling. In Thirte enth A nnual Confer enc e of the International Sp e e ch Communic ation Asso ciation , 2012. [185] Jian Xue and Y unxin Zhao. Impro v ed confusion net w ork algorithm and shortest path searc h from w ord lattice. In A c oustics, Sp e e ch, and Signal Pr o c essing, 2005. Pr o c e e d- ings.(ICASSP’05). IEEE International Confer enc e on , v olume 1, pages I853. IEEE, 2005. [186] Sib el Y aman, Li Deng, Dong Y u, Y e-Yi W ang, and Alex A cero. An in tegrativ e and discrim- inativ e tec hnique for sp ok en utterance classication. IEEE T r ansactions on A udio, Sp e e ch, and L anguage Pr o c essing , 16(6):12071214, 2008. [187] W enp eng Yin and Hinric h Sc htze. Learning w ord meta-em b eddings. In Pr o c e e dings of the 54th A nnual Me eting of the Asso ciation for Computational Linguistics, A CL 2016, A ugust 7-12, 2016, Berlin, Germany, V olume 1: L ong Pap ers , 2016. URL http://aclweb.org/ anthology/P/P16/P16- 1128.pdf . [188] Huifeng Zhang, Su Zh u, Sh uai F an, and Kai Y u. Join t sp ok en language understanding and domain adaptiv e language mo deling. In International Confer enc e on Intel ligent Scienc e and Big Data Engine ering , pages 311324. Springer, 2018. [189] Xiao dong Zhang and Houfeng W ang. A join t mo del of in ten t determination and slot lling for sp ok en language understanding. In Pr o c e e dings of the Twenty-Fifth International Joint Confer enc e on A rticial Intel ligenc e , IJCAI’16, pages 29932999. AAAI Press, 2016. ISBN 978-1-57735-770-4. [190] Xiaoh ui Zhang, Jan T rmal, Daniel P o v ey , and Sanjeev Kh udanpur. Impro ving deep neural net w ork acoustic mo dels using generalized maxout net w orks. In 2014 IEEE International Confer enc e on A c oustics, Sp e e ch and Signal Pr o c essing (ICASSP) , pages 215219. IEEE, 2014. [191] D. Zheng, Z. Chen, Y. W u, and K. Y u. Directed automatic sp eec h transcription error correction using bidirectional LSTM. In 2016 10th International Symp osium on Chinese Sp oken L anguage Pr o c essing (ISCSLP) , pages 15, Oct 2016. doi: 10.1109/ISCSLP .2016. 7918446. [192] Su Zh u, Ouyu Lan, and Kai Y u. Robust sp ok en language understanding with unsup ervised asr-error adaptation. In 2018 IEEE International Confer enc e on A c oustics, Sp e e ch and Signal Pr o c essing (ICASSP) , pages 61796183. IEEE, 2018. 114 Appendix A Confusion2V ec Mo del Analogy T asks Seman tic&Syn tactic Analogy A coustic Analogy Seman tic&Syn tactic-A coustic Analogy A v erage A ccuracy Seman tic Syn tactic Seman tic&Syn tactic Seman tic-A coustic Syn tactic-A coustic Seman tic&Syn tactic-A coustic Go ogle W ord2V ec 28.98% (35.75%) 70.79% (78.74%) 61.42% (69.1%) 0.9% (1.42%) 6.54% (14.38%) 17.9% (27.46%) 16.99% (26.42%) 26.44% (32.31%) W ord2V ec GroundT ruth 42.39% (51.57%) 33.14% (43.14%) 35.15% (44.98%) 0.3% (0.6%) 5.17% (10.69%) 8.13% (11.93%) 7.86% (11.82%) 14.44% (19.13%) Baseline W ord2V ec 38.33% (46.7%) 33.1% (42.36%) 34.27% (43.33%) 0.7% (1.16%) 11.76% (14.38%) 11.23% (15.11%) 11.27% (15.05%) 15.41% (19.85%) In tra-Confusion 0.51% (0.78%) 18.59% (28.17%) 14.54% (22.03%) 41.93% (52.58%) 0.98% (2.29%) 9.62% (15.67%) 8.94% (14.61%) 21.8% (29.74%) In ter-Confusion 16.15% (23.7%) 26.14% (39.74%) 23.9% (36.15%) 48.58% (60.57%) 3.27% (6.86%) 12.13% (21.61%) 11.42% (20.44%) 27.97% (39.05%) Hybrid In tra-In ter 2.07% (2.58%) 28.91% (38.6%) 22.89% (30.53%) 40.78% (53.55%) 1.96% (2.94%) 20.99% (31.63%) 19.48% (29.35%) 27.72% (37.81%) T able A.1: Analogy T ask Results with Seman tic & Syn tactic splits: Dieren t prop osed mo dels Num b ers inside paren thesis indicate top-2 ev aluation accuracy; Num b ers outside paren thesis represen t top-1 ev aluation accuracy . Go ogle W ord2V ec, W ord2V ec Groundtruth (trained on in-domain) and Baseline W ord2V ec (trained on ASR transcriptions) p erform b etter with the Seman tic&Syn tactic tasks, but fares p o orly with A coustic analogy task. In tra-Confusion p erforms w ell on A coustic analogy task while compromising on Seman tic&Syn tactic task. In ter-Confusion p erforms w ell on b oth the A coustic analogy and Seman tic&Syn tactic tasks. Hybrid In tra-In ter training p erforms fairly w ell on all the three analogy tasks (A coustic, Seman tic&Syn tactic and Seman tic&Syn tactic-A coustic). 115 Mo del Similarit y T asks W ord Similarit y A coustic Similarit y Go ogle W ord2V ec 0.6893 (7.9e-48) -0.3489 (2.2e-28) W ord2V ec GroundT ruth 0.5794 (4.2e-29) -0.2444 (1e-10) Baseline W ord2V ec 0.4992 (3.3e-22) 0.1944 (1.7e-9) In tra-Confusion 0.105 (0.056) 0.8138 (5.1e-224) In ter-Confusion 0.2937 (5.4e-8) 0.8055 (5.1e-216) Hybrid In tra-In ter 0.0963 (0.08) 0.7858 (1.5e-198) T able A.2: Similarit y T ask Results: Dieren t prop osed mo dels Similarit y in terms of Sp earman’s correlation. Num b ers inside paren thesis indicate correlation pvalue for similarit y tasks Go ogle W ord2V ec, Baseline W ord2V ec and W ord2V ec Groundtruth, all sho w high correlations with w ord similarit y , while sho wing p o or correlations on acoustic similarit y . Go ogle W ord2V ec and W ord2V ec Groundtruth mo dels trained on clean data exhibit negativ e acoustic similarit y correlation. Baseline W ord2V ec trained on noisy ASR sho ws a small p ositiv e acoustic similarit y correlation. In tra-Confusion, In ter-Confusion and Hybrid In tra-In ter training sho w higher correlations on A coustic similarit y . Mo del Analogy T asks Seman tic&Syn tactic Analogy A coustic Analogy Seman tic&Syn tactic-A coustic Analogy A v erage A ccuracy Seman tic Syn tactic Seman tic&Syn tactic Seman tic-A coustic Syn tactic-A coustic Seman tic&Syn tactic-A coustic Baseline W ord2V ec 34.92% (41.96%) 68.7% (78.82%) 61.13% (70.56%) 0.9% (1.46%) 14.38% (19.28%) 16.85% (24.25%) 16.66% (23.86%) 26.23% (31.96%) In tra-Confusion 11.5% (15.53%) 67.56% (77.96%) 54.99% (63.97%) 9.04% (16.92%) 7.84% (10.46%) 36.92% (46.17%) 34.61% (43.34%) 32.88% (41.41%) In ter-Confusion 25.77% (33.12%) 60.1% (74.79%) 52.4% (65.45%) 16.54% (27.33%) 10.78% (14.05%) 28.9% (40.38%) 27.46% (38.29%) 32.13% (43.69%) Hybrid In tra-In ter 15.64% (21.94%) 66.73% (77.68%) 55.28% (65.19%) 10.49% (20.35%) 6.86% (11.11%) 35.4% (44.85%) 33.13% (42.18%) 36.27% (42.57%) T able A.3: Analogy T ask Results with Seman tic & Syn tactic splits: Mo del pre-training/initialization Num b ers inside paren thesis indicate top-2 ev aluation accuracy; Num b ers outside paren thesis represen t top-1 ev aluation accuracy . Pre-training is helpful in all the cases. Pre-training b o osts the Seman tic&Syn tactic Analogy accuracy for all. F or In tra-Confusion, In ter-Confusion and Hybrid In tra-In ter mo dels, pre-training b o osts the Seman tic&Syn tactic-A coustic Analogy accuracies. A small dip in A coustic Analogy accuracies is observ ed. Ho w ev er, o v erall a v erage accuracy is impro v ed. Mo del Similarit y T asks W ord Similarit y A coustic Similarit y Baseline W ord2V ec 0.6036 (3.8e-34) -0.4327 (2.5e-44) In tra-Confusion 0.5228 (1.4e-24) 0.62 (2.95e-101) In ter-Confusion 0.5798 (4.9e-31) 0.5825 (9.1e-87) Hybrid In tra-In ter 0.5341 (9.8e-26) 0.6237 (8.8e-103) T able A.4: Similarit y T ask Results: Mo del pre-training/initialization Similarit y in terms of Sp earman’s correlation. Num b ers inside paren thesis indicate correlation pvalue for similarit y tasks. Pre-training b o osts the W ord Similarit y correlation for all the mo dels. The correlation is impro v ed considerably in the case of In tra-Confusion, In ter-Confusion and Hybrid In tra-In ter mo dels while main taining go o d correlation on acoustic similarit y . 116 Mo del Fine-tuning Analogy T asks Seman tic&Syn tactic Analogy A coustic Analogy Seman tic&Syn tactic-A coustic Analogy A v erage A ccuracy Sc heme Seman tic Syn tactic Seman tic&Syn tactic Seman tic-A coustic Syn tactic-A coustic Seman tic&Syn tactic-A coustic Baseline W ord2V ec (556 dim.) - 34.49% (41.53%) 68.7% (78.82%) 61.13% (70.25%) 0.93% (1.46%) 15.36% (19.93%) 16.63% (23.89%) 16.53% (23.57%) 26.2% (31.76%) Mo del Concatenation W ord2V ec (F) + In tra-Confusion (F) - 6.22% (9.5%) 71.03% (83.65%) 56.51% (67.03%) 13.59% (25.43%) 6.54% (11.76%) 33.91% (42.82%) 31.74% (40.36%) 33.95% (44.27%) W ord2V ec (F) + In ter-Confusion (F) - 36.53% (47.01%) 57.94% (77.72%) 53.14% (70.84%) 20.99% (35.25%) 10.46% (16.01%) 26.31% (36.83%) 25.05% (35.18%) 33.06% (47.09%) W ord2V ec (F) + Hybrid In tra-In ter (F) - 11.85% (17.32%) 71.85% (82.74%) 58.4% (68.08%) 6.35% (11.39%) 7.84% (12.18%) 34.38% (43.78%) 32.28% (41.3%) 32.34% (40.26%) Fixed W ord2V ec Join t Optimization W ord2V ec (F) + In tra-Confusion (L) in ter 22.96% (32.42%) 66.19% (82.98%) 56.5% (71.65%) 12.73% (20.54%) 13.4% (18.3%) 26.22% (35.09%) 25.21% (33.76%) 31.48% (41.98%) W ord2V ec (F) + In tra-Confusion (L) in tra 6.69% (11.58%) 69.79% (83.48%) 55.65% (67.37%) 17.03% (28.64%) 8.17% (13.73%) 31.85% (47.64%) 29.97% (39.09%) 34.22% (45.03%) W ord2V ec (F) + In tra-Confusion (L) h ybrid 11.69% (19.79%) 69.31% (84.53%) 56.39% (70.02%) 14.86% (25.84%) 9.8% (16.67%) 30.02% (38.94%) 28.42% (37.18%) 33.22% (44.35%) W ord2V ec (F) + In ter-Confusion (L) in ter 39.19% (50.57%) 58.35% (78.21%) 54.05% (72.01%) 23.33% (35.25%) 12.42% (18.3%) 24.45% (34.89%) 23.5% (33.58%) 33.63% (46.95%) W ord2V ec (F) + In ter-Confusion (L) in tra 22.76% (32.85%) 62.07% (80.34%) 53.26% (69.7%) 24.76% (39.32%) 7.52% (11.11%) 29.97% (41.47%) 28.19% (39.07%) 35.40% (49.36%) W ord2V ec (F) + In ter-Confusion (L) h ybrid 30.54% (43.21%) 61.56% (80.81%) 54.61% (72.38%) 23.6% (37.75%) 8.5% (14.71%) 28.25% (39.95%) 26.68% (37.95%) 34.96% (49.36%) W ord2V ec (F) + Hybrid In tra-In ter (L) in ter 27.02% (35.9%) 67.52% (81.6%) 58.45% (71.36%) 5.04% (8.55%) 11.76% (16.67%) 26.28% (34.64%) 25.13% (33.21%) 29.54% (37.71%) W ord2V ec (F) + Hybrid In tra-In ter (L) in tra 10.48% (15.84%) 70.44% (81.57%) 57.00% (66.85%) 7.21% (13.33%) 6.21% (12.09%) 34.07% (42.52%) 31.87% (40.1%) 32.03% (40.09) W ord2V ec (F) + Hybrid In tra-In ter (L) h ybrid 15.41% (23.31%) 70.56% (82.61%) 58.2% (68.32%) 6.39% (11.61%) 8.17% (12.09%) 32.36% (40.43%) 30.44% (38.19%) 31.68% (39.37%) Unrestricted Join t Optimization W ord2V ec (L) + In tra-Confusion (L) in ter 8.6% (14.74%) 57.96% (75.8%) 46.9% (62.12%) 30.73% (46.42%) 5.88% (12.75%) 26.79% (38.44%) 25.13% (36.4%) 34.25% (48.31%) W ord2V ec (L) + In tra-Confusion (L) in tra 4.97% (7.9%) 69.27% (81.30%) 54.86% (64.85%) 23.86% (40.55%) 7.84% (11.44%) 34.92% (45.02%) 32.77% (42.38%) 37.16% (49.26%) W ord2V ec (L) + In tra-Confusion (L) h ybrid 1.1% (1.64%) 26.54% (40.32%) 20.83% (31.65%) 49.25% (61.91%) 2.29% (3.92%) 15.05% (25.24%) 14.04% (23.55%) 28.12% (39.04%) W ord2V ec (L) + In ter-Confusion (L) in ter 33.01% (43.72%) 50.81% (71.13%) 46.82% (64.98%) 37.15% (52.99%) 9.48% (16.01%) 23.16% (36.41%) 22.07% (34.79%) 35.35% (50.92%) W ord2V ec (L) + In ter-Confusion (L) in tra 21.9% (30.43%) 58.99% (76.12%) 50.68% (65.88%) 33.05% (49.4%) 7.52% (10.46%) 31.23% (44.12%) 29.35% (41.51%) 37.69% (52.26%) W ord2V ec (L) + In ter-Confusion (L) h ybrid 10.48% (15.72%) 30.0% (44.25%) 25.63% (37.86%) 52.73% (67.21%) 3.27% (4.9%) 16.09% (27.77%) 15.08% (25.96%) 31.15% (43.68%) W ord2V ec (L) + Hybrid In tra-In ter (L) in ter 19.24% (26.59%) 61.57% (76.8%) 52.08% (65.54%) 17.85% (27.97%) 7.52% (12.75%) 28.81% (38.94%) 27.12% (36.87%) 32.35% (43.46%) W ord2V ec (L) + Hybrid In tra-In ter (L) in tra 10.09% (13.77%) 68.76% (79.06%) 55.61% (64.42%) 10.34% (20.05%) 5.88% (9.48%) 36.13% (45.41%) 33.73% (42.56%) 33.23% (42.34%) W ord2V ec (L) + Hybrid In tra-In ter (L) h ybrid 12.98% (17.91%) 68.26% (79.62%) 55.87% (65.79%) 11.73% (22.63%) 5.88% (10.46%) 35.28% (43.92%) 32.95% (41.3%) 33.52% (43.24%) T able A.5: Analogy T ask Results: Mo del concatenation and join t optimization Num b ers inside paren thesis indicate top-2 ev aluation accuracy; Num b ers outside paren thesis represen t top-1 ev aluation accuracy . A cron yms: (F):Fixed em b edding, (L):Learn em b edding during join t training Mo del Concatenation pro vides gains in A coustic-Analogy T ask and thereb y resulting in gains in a v erage accuracy compared to results in T able A.3 for In tra-Confusion and In ter-Confusion mo dels. Fixed W ord2V ec and Unrestricted Join t Optimizations further impro v es results o v er mo del concatenation. Best results in terms of a v erage accuracy is obtained with unrestricted join t optimizations, an absolute impro v emen t of 10%. Confusion2V ec mo dels surpass W ord2V ec ev en for Seman tic&Syn tactic analogy task (top-2 ev aluation accuracy). 117 Mo del Fine-tuning Similarit y T asks Sc heme W ord Similarit y A coustic Similarit y Baseline W ord2V ec (556 dim.) - 0.5973 (2.8e-33) -0.4341 (1.3e-44) Mo del Concatenation W ord2V ec (F) + In tra-Confusion (F) - 0.5102 (2.9e-23) 0.7231 (2.2e-153) W ord2V ec (F) + In ter-Confusion (F) - 0.5609 (9.8e-29) 0.6345 (2.3e-107) W ord2V ec (F) + Hybrid In tra-In ter (F) - 0.4142 (4.1e-15) 0.5285 (5.6e-69) Fixed W ord2V ec Join t Optimization W ord2V ec (F) + In tra-Confusion (L) in ter 0.5676 (1.6e-29) 0.4437 (9.1e-47) W ord2V ec (F) + In tra-Confusion (L) in tra 0.5211 (2.3e-24) 0.6967 (6.5e-138) W ord2V ec (F) + In tra-Confusion (L) h ybrid 0.5384 (3.4e-26) 0.6287 (6.7e-105) W ord2V ec (F) + In ter-Confusion (L) in ter 0.5266 (6.1e-25) 0.5818 (1.6e-86) W ord2V ec (F) + In ter-Confusion (L) in tra 0.5156 (8.3e-24) 0.7021 (6.3e-141) W ord2V ec (F) + In ter-Confusion (L) h ybrid 0.5220 (1.8e-24) 0.6674 (1.4e-122) W ord2V ec (F) + Hybrid In tra-In ter (L) in ter 0.5587 (1.7e-28) 0.302 (2.5e-21) W ord2V ec (F) + Hybrid In tra-In ter (L) in tra 0.4996 (3.1e-22) 0.5691 (4.7e-82) W ord2V ec (F) + Hybrid In tra-In ter (L) h ybrid 0.5254 (8.2e-25) 0.4945 (2.6e-59) Unrestricted Join t Optimization W ord2V ec (L) + In tra-Confusion (L) in ter 0.5513 (1.3e-27) 0.7926 (2.4e-204) W ord2V ec (L) + In tra-Confusion (L) in tra 0.5033 (1.4e-22) 0.7949 (2e-206) W ord2V ec (L) + In tra-Confusion (L) h ybrid 0.1067 (0.0528) 0.8309 (8.5e-242) W ord2V ec (L) + In ter-Confusion (L) in ter 0.5763 (1.3e-30) 0.7725 (8.2e-188) W ord2V ec (L) + In ter-Confusion (L) in tra 0.5379 (3.8e-26) 0.7717 (3.5e-187) W ord2V ec (L) + In ter-Confusion (L) h ybrid 0.2295 (2.6e-5) 0.8294 (3.6e-240) W ord2V ec (L) + Hybrid In tra-In ter (L) in ter 0.5338 (1e-25) 0.6953 (3.7e-137) W ord2V ec (L) + Hybrid In tra-In ter (L) in tra 0.4920 (1.6e-21) 0.6942 (1.5e-136) W ord2V ec (L) + Hybrid In tra-In ter (L) h ybrid 0.4967 (5.8e-22) 0.6986 (5.9e-139) T able A.6: Similarit y T ask Results: Mo del concatenation and join t optimization Similarit y in terms of Sp earman’s correlation. Num b ers inside paren thesis indicate correlation pvalue for similarit y tasks. Go o d correlations are observ ed for b oth the w ord similarit y and acoustic similarit y with mo del concatenation with and without join t optimization. All the correlations are found to b e statistically signican t. 118
Abstract (if available)
Abstract
Automatic Speech Recognition (ASR) is gaining a lot of importance in everyday life. ASR has become a core component of human computer interaction. It is a key part of many applications involving virtual assistants, voice assistants, gaming, robotics, natural language understanding, education, communication-pronunciation tutoring, call routing, interactive media entertainment, etc. The growth of such applications and their adaptations in everyday scenarios, points to the ASR becoming an ubiquitous part of our daily life in the foreseeable, near future. This has become partly possible due to high performance achieved by state-of-the-art speech recognition systems. However, the errors resulting from ASR can often have a negative impact towards the downstream applications. In this work, we focus on modeling the errors of the ASR with the hypothesis that an accurate modeling of such errors can be used to recover from the ASR errors and alleviate the negative consequences towards its downstream applications. ❧ We model the ASR as a phrase-based noisy transformation channel and propose an error correction system that can learn from the aggregate errors of all the independent modules constituting the ASR and attempt to invert those. The proposed system can exploit long-term context and can re-introduce previously pruned or unseen phrases in addition to better choosing between existing ASR output possibilities. We show that the system can provide improvements over a range of different ASR conditions without degrading any accurate transcription. We also show that the proposed system provides consistent improvements even on out-of-domain tasks as well as over highly optimized ASR models re-scored by recurrent neural language models. Further, we propose sequence-to-sequence neural network for modeling the ASR errors by incorporating much longer contextual information. We propose different optimal architectures and feature representations, in terms of subwords, and demonstrate improvements over the phrase-based noisy channel model. ❧ Additionally, we propose a novel word vector representation, Confusion2Vec, motivated from the human speech production and perception that encodes representational ambiguity. The representational ambiguity of acoustics, which manifests itself in word confusions, is often resolved by both humans and machines through contextual cues. We present several techniques to train an acoustic perceptual similarity representation ambiguity and learn on unsupervised-generated data from ASR confusion networks or lattice-like structures. Appropriate evaluations are formulated for gauging acoustic similarity in addition to semantic-syntactic and word similarity evaluations. The Confusion2Vec is able to model word confusions efficiently without compromising on the semantic-syntactic word relations, thus effectively enriching the word vector space with extra task relevant ambiguity information. The proposed Confusion2Vec can also contribute and extend to a range of representational ambiguities that emerge in various domains further to acoustic perception, such as morphological transformations, word segmentation, paraphrasing for natural language processing tasks like machine translation, and visual perceptual similarity for image processing tasks like image summarization, optical character recognition etc. ❧ This work also contributes towards efficient coupling of ASR with various downstream algorithms operating on ASR outputs. We prove the efficacy of the Confusion2Vec by proposing a recurrent neural network based spoken language intent detection to achieve state-of-the-art results under noisy ASR conditions. We demonstrate through experiments and our proposed model that ASR often makes errors relating to acoustically similar words and the confusion2vec with inherent model of acoustic relationships between words is able to compensate for the errors. Improvements are also demonstrated when training the intent detection models on noisy ASR transcripts. This work opens new possible opportunities in incorporating the confusion2vec embeddings to a whole range of full-fledged applications. ❧ Further, we extend the previously proposed confusion2vec by encoding each word in confusion2vec vector space by its constituent subword character n-grams. We show the subword encoding helps better represent the acoustic perceptual ambiguities in human spoken language via information modeled on lattice structured ASR output. The efficacy of the subword-confusion2vec is evaluated using semantic, syntactic and acoustic analogy and word similarity tasks. We demonstrate the benefits of subword modeling for acoustic ambiguity representation on the task of spoken language intent detection. The results significantly outperform existing word vector representations as well as the non-subword confusion2vec word embeddings when evaluated on erroneous ASR outputs. We demonstrate confusion2vec subword modeling eliminates the need for retraining/adapting the natural language understanding models on ASR transcripts.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Robust automatic speech recognition for children
PDF
Extracting and using speaker role information in speech processing applications
PDF
Noise aware methods for robust speech processing applications
PDF
Active data acquisition for building language models for speech recognition
PDF
Understanding sources of variability in learning robust deep audio representations
PDF
Efficient estimation and discriminative training for the Total Variability Model
PDF
Computational modeling of behavioral attributes in conversational dyadic interactions
PDF
Matrix factorization for noise-robust representation of speech data
PDF
User modeling for human-machine spoken interaction and mediation systems
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
Weighted factor automata: A finite-state framework for spoken content retrieval
PDF
Interaction dynamics and coordination for behavioral analysis in dyadic conversations
PDF
Representation, classification and information fusion for robust and efficient multimodal human states recognition
PDF
Machine learning paradigms for behavioral coding
PDF
Human behavior understanding from language through unsupervised modeling
PDF
Enriching spoken language processing: representation and modeling of suprasegmental events
PDF
Multimodal and self-guided clustering approaches toward context aware speaker diarization
PDF
Enhancing speech to speech translation through exploitation of bilingual resources and paralinguistic information
PDF
Speech and language understanding in the Sigma cognitive architecture
PDF
Improving language understanding and summarization by leveraging auxiliary information through self-supervised or unsupervised learning
Asset Metadata
Creator
Gurunath Shivakumar, Prashanth
(author)
Core Title
Speech recognition error modeling for robust speech processing and natural language understanding applications
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
02/01/2021
Defense Date
12/10/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
automatic speech recognition,Confusion2Vec,error modeling,OAI-PMH Harvest,spoken language understanding
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Narayanan, Shrikanth (
committee chair
), Georgiou, Panayiotis (
committee member
), Jenkins, Keith (
committee member
), Mataric, Maja (
committee member
)
Creator Email
pgurunat@usc.edu,prashanth.g.s@ieee.org
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-419993
Unique identifier
UC11667186
Identifier
etd-GurunathSh-9258.pdf (filename),usctheses-c89-419993 (legacy record id)
Legacy Identifier
etd-GurunathSh-9258.pdf
Dmrecord
419993
Document Type
Dissertation
Rights
Gurunath Shivakumar, Prashanth
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
automatic speech recognition
Confusion2Vec
error modeling
spoken language understanding