Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Emotional speech resynthesis
(USC Thesis Other)
Emotional speech resynthesis
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
EMOTIONAL SPEECH RESYNTHESIS by Murtaza Bulut A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) May 2008 Copyright 2008 Murtaza Bulut Dedication to my family ii Acknowledgments IwouldliketothankmyadviserProf. ShrikanthNarayanan,SAILgroupmembers, and to all of my friends for their unconditional support. iii Table of Contents Dedication ii Acknowledgments iii Abstract xxv Chapter 1: Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Proposed approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Dissertation contribution . . . . . . . . . . . . . . . . . . . . . . . . 3 1.5 Research methodology . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.6 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.7 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.8 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Chapter 2: Emotion to emotion transformation system (ETET): A summary 9 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Multi level emotion to emotion transformation system (ETET) . . . 10 2.2.1 Spectral envelope modifications . . . . . . . . . . . . . . . . 12 2.2.2 Part of speech (POS) tags’ prosody modifications . . . . . . 13 iv 2.2.3 Voiced/unvoiced regions’ prosody modifications . . . . . . . 14 2.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Chapter 3: Literature review: Background and related work 17 3.1 Theories of emotion . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2 Dimensions of emotional space . . . . . . . . . . . . . . . . . . . . . 21 3.3 Collection of emotional speech data . . . . . . . . . . . . . . . . . . 24 3.3.1 How to collect emotional speech data? . . . . . . . . . . . . 26 3.3.2 Data collection for concatenative speech synthesis . . . . . . 27 3.4 Evaluation of emotional speech . . . . . . . . . . . . . . . . . . . . 31 3.5 Analysis of emotional speech . . . . . . . . . . . . . . . . . . . . . . 33 3.5.1 Acoustic correlates of emotions . . . . . . . . . . . . . . . . 35 3.6 Synthesis of emotional speech . . . . . . . . . . . . . . . . . . . . . 42 3.6.1 Concatenative speech synthesis (Data-driven synthesis) . . . 42 3.6.2 Formant speech synthesis (Rule-Driven Synthesis) . . . . . . 49 3.6.3 Articulatory speech synthesis (Model-driven synthesis) . . . 52 3.7 Emotional speech resynthesis . . . . . . . . . . . . . . . . . . . . . 53 Chapter 4: Expressivespeechsynthesisusingaconcatenativesyn- thesizer 60 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.2 Database collection . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.3 Synthesis of emotional sentences . . . . . . . . . . . . . . . . . . . . 63 4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.6 Summary and conclusions . . . . . . . . . . . . . . . . . . . . . . . 71 v Chapter 5: Investigating the role of phoneme-level modifications in emotional speech resynthesis 72 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.2 Dataset description . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.3 System description . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.3.1 Test stimuli . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.4.1 Listening experiment . . . . . . . . . . . . . . . . . . . . . . 77 5.4.2 Listening test results . . . . . . . . . . . . . . . . . . . . . . 77 5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.5.1 Prosody modifications . . . . . . . . . . . . . . . . . . . . . 83 5.5.2 Spectrum modifications. . . . . . . . . . . . . . . . . . . . . 84 5.5.3 Prosody and spectrum modification combination . . . . . . . 85 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Chapter 6: Prosody of part of speech tags in emotional speech: Statistical approach for analysis and synthesis 87 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 6.2 Emotional dataset and feature extraction . . . . . . . . . . . . . . . 92 6.2.1 Emotional dataset . . . . . . . . . . . . . . . . . . . . . . . 92 6.2.2 Part of speech (POS) tagging . . . . . . . . . . . . . . . . . 93 6.2.3 Acoustic feature calculation . . . . . . . . . . . . . . . . . . 94 6.3 Analysis of POS tags’ prosody characteristics . . . . . . . . . . . . 96 6.3.1 Standard analysis: Comparisons of feature values . . . . . . 96 vi 6.3.2 Probabilistic analysis: Comparisons of tag in terms of prob- abilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.4 ANOVA analysis: Repeated measures design on emotions . . . . . . 99 6.4.1 ANOVA analysis results . . . . . . . . . . . . . . . . . . . . 100 6.4.2 Analysis of tag duration, energy and F0 contours . . . . . . 105 6.4.3 Post hoc tests for emotions and POS tags . . . . . . . . . . 106 6.4.4 Statistical modeling of emotional differences . . . . . . . . . 108 6.5 Discussion of analysis results . . . . . . . . . . . . . . . . . . . . . . 108 6.6 Analysis by synthesis: A resynthesis experiment of emotional speech 111 6.6.1 Optimal tag order estimation . . . . . . . . . . . . . . . . . 112 6.6.2 Parameter value estimation . . . . . . . . . . . . . . . . . . 113 6.6.3 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.6.4 Speech conversion . . . . . . . . . . . . . . . . . . . . . . . . 115 6.7 Listening test results . . . . . . . . . . . . . . . . . . . . . . . . . . 116 6.7.1 Listening test setup . . . . . . . . . . . . . . . . . . . . . . . 116 6.7.2 Listening test results . . . . . . . . . . . . . . . . . . . . . . 117 6.8 Discussions of the results . . . . . . . . . . . . . . . . . . . . . . . . 120 6.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 6.10 Modifications in the POS tag parameter value estimation for ETET system implementation . . . . . . . . . . . . . . . . . . . . . . . . . 123 6.10.1 Maximum/Minimum tag estimation and tag ordering . . . . 123 6.10.2 Parameter value generation . . . . . . . . . . . . . . . . . . 124 Chapter 7: Recognitionforsynthesis: Automaticparameterselec- tion for resynthesis of emotional speech from neutral speech 126 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 vii 7.2 Recognition for Synthesis (RFS) system description . . . . . . . . . 128 7.2.1 Prosody modifications . . . . . . . . . . . . . . . . . . . . . 129 7.2.2 Automatic emotion recognition using neural networks . . . . 131 7.3 Modification factor selection . . . . . . . . . . . . . . . . . . . . . . 133 7.4 Listening tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 7.4.1 Listening test structure . . . . . . . . . . . . . . . . . . . . . 136 7.4.2 Listening test results . . . . . . . . . . . . . . . . . . . . . . 137 7.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 7.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 Chapter 8: Analysis of effects of F0 modifications on emotional speech 141 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 8.2 Data preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 8.2.1 Data collection . . . . . . . . . . . . . . . . . . . . . . . . . 144 8.2.2 F0 modifications . . . . . . . . . . . . . . . . . . . . . . . . 147 8.3 Listening tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 8.4 Emotional regions in F0 mean-range space . . . . . . . . . . . . . . 150 8.5 Statistical analysis of emotion and speech quality perception . . . . 157 8.5.1 Factors influencing emotion perception . . . . . . . . . . . . 158 8.5.2 Factors influencing speech quality perception . . . . . . . . . 160 8.6 Effects of F0 modifications on emotional content . . . . . . . . . . . 164 8.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 8.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 Chapter 9: Evaluationofemotiontoemotiontransformation(ETET) system 173 viii 9.1 Modification parameters . . . . . . . . . . . . . . . . . . . . . . . . 173 9.2 Original utterances . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 9.3 Listening test structure . . . . . . . . . . . . . . . . . . . . . . . . . 176 9.4 Listening test results . . . . . . . . . . . . . . . . . . . . . . . . . . 178 9.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 9.5.1 Neutral to angry transformation . . . . . . . . . . . . . . . . 180 9.5.2 Neutral to sad transformation . . . . . . . . . . . . . . . . . 185 9.5.3 Neutral to happy transformation . . . . . . . . . . . . . . . 188 9.5.4 Some listener comments . . . . . . . . . . . . . . . . . . . . 191 9.5.5 Speech quality issues: How to improve the quality . . . . . . 191 Chapter 10: Future directions 195 10.1 Emotional speech representation and evaluation . . . . . . . . . . . 196 10.2 Emotional speech synthesis . . . . . . . . . . . . . . . . . . . . . . . 200 Chapter 11: Conclusion 202 Chapter 12: Publications 203 12.1 Book Chapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 12.2 Journals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 12.3 Conferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 12.4 Abstracts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 Bibliography 207 Appendix A Modification algorithms: PSOLA and LPC . . . . . . . . . . . . . . . . . 219 A.1 Pitch Synchronous Overlap and Add (PSOLA) . . . . . . . . . . . . 219 A.1.1 Pitch-synchronous analysis . . . . . . . . . . . . . . . . . . . 219 ix A.1.2 Pitch-synchronous modifications . . . . . . . . . . . . . . . . 220 A.1.3 Pitch-synchronous overlap-add synthesis . . . . . . . . . . . 220 A.2 Linear Predictive Coding (LPC) . . . . . . . . . . . . . . . . . . . . 221 Appendix B Spectral conversion using GMMs . . . . . . . . . . . . . . . . . . . . . . 223 B.1 Spectral Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 B.2 Spectral conversion using Gaussian mixture models (GMM) . . . . 223 Appendix C Test utterances: Spectrogram and F0 contour plots . . . . . . . . . . . . 227 Appendix D ETET evaluation results for individual sentences . . . . . . . . . . . . . 234 x List of Figures 2.1 Emotion resynthesis. Input emotional speech is resynthesized to possess new emotional content. . . . . . . . . . . . . . . . . . . . . 10 2.2 Multi Level Emotion to Emotion Transformation (ETET) system based on the modification of speech acoustic features at different time scales. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.1 Plutchik’s emotion wheel (adapted from [34]) . . . . . . . . . . . . . 23 3.2 The functional diagram of a general text-to-speech conversion sys- tem (adapted from [40]). . . . . . . . . . . . . . . . . . . . . . . . . 43 4.1 Recognition accuracy for natural target files: 89.1% Angry, 89.7% Sad, 67.3% Happy, 92.1% Neutral . . . . . . . . . . . . . . . . . . . 65 4.2 Recognitionaccuracyforsynthesizedemotions: 86.1%Angry,89.1% Sad, 44.2% Happy, 81.8% Neutral. ”A”, ”S”, ”H”, ”N” denote Anger, Sadness, Happiness and Neutral, respectively; ”p” indicates prosody and ”i” indicates inventory. . . . . . . . . . . . . . . . . . . 66 4.3 Recognition rates observed for matched synthetic sentences of each emotion for female, male, native and non-native listeners. There were no significant group differences. . . . . . . . . . . . . . . . . . 69 xi 5.1 F1-F2 plots for sentence vowels. Top 6 plots are for sentence 1 vowels. Numbers indicate the position of the vowel in the sentence. Happy is represented by ?, angry is o, sad is, neutral is∇. . . . . 75 5.2 Pitch contour (left) and energy contour of the sentences. Top two plotsareforsentence1andfirstofthesetopplotsistheplotofhappy andangry(dashed)andthesecondplotisofsadandneutral(dashed) emotions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.3 LPC and TD-PSOLA based emotion conversion system. s[n] and t[n] are the source and target, respectively. A(z) indicates the inverse filter used for the calculation of the residual e[n]. s 1 [n], s 2 [n], and s 21 [n] are the outputs obtained by modifying the input signal using only TD-PSOLA, only LPC synthesis, and both LPC and TD-PSOLA, respectively. . . . . . . . . . . . . . . . . . . . . . 81 6.1 Plots of the tag values for different emotions. Note that the order of tags is different for different parameters. The tags were sorted in ascending order based on the neutral tags averages (dotted line), without differentiating based on the position. Each symbol repre- sents a different emotion: Neutral: , anger: , happiness: 4, sadness: ♦. The figures show the main effect of POS tag type, emotion, position, and their interactions. . . . . . . . . . . . . . . . 97 6.2 Shows the main effect of emotion and position. Cases where posi- tion=1, position=2, and position=1 or 2 are marked with , 4, and , respectively. Symbols N, A, H, S represent neutral, angry, happy, and sad emotions, respectively. . . . . . . . . . . . . . . . . 98 xii 6.3 Displayed in these figures are the probabilities of tags having the maximum parameter value in utterance. The probabilities were cal- culatedbycountingthecaseswhenataghadamaximumparameter value, and then this number was divided to the total occurrences of that tag. The green line shows the mean probability of a tag having a maximum value (that is, it is the average of all emotion values.) Also note that the tags were sorted based on that average. Therefore the order of the tags in the x axis is different for each parameter. Each symbol represents a different emotion: Neutral: , anger: , happiness: 4, sadness: ♦. . . . . . . . . . . . . . . . 101 6.4 Probabilities that a neutral tag will have a higher value that its emotional counterpart. Note that the tags are sorted by the prob- abilities calculated for energy maximum parameter. Each symbol represent the comparison results for a different parameter. In this figure, Energy Maximum: , Energy median: ∗, F0 median: , F0 range: 4, Duration: ♦. . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.5 Displayed are the probabilities of POS tags on the x axis having greater F0 median values than the tags on the y axis. The size of thecirclesisproportionaltotheprobability,thatisthelargestprob- ability is shown as the largest circle. Probabilities for each emotion are plotted in a different color and in a different line style. Neu- tral: Blue circle, solid line, Angry: Red circle, dashed line, Happy: Magenta circle, dashdot line, Sad: Black circle, dotted line. . . . . . 103 xiii 6.6 The histogram and the approximated normal distribution curve for emotionalfeaturedifferencesfornouns(NN).Plots(a),(d),(g),and (j) are for Neutral−Angry; (b), (e), (h), and (k) for Neutral− Happy; and (c), (f), (l), and (i) for Neutral−Sad differences in energy maximum (a,b,c), F0 median (d,e,f), F0 range (g,h,i) and tag durations (j,k,l). . . . . . . . . . . . . . . . . . . . . . . . . . . 109 7.1 Recognition for Synthesis (RFS) system architecture. The emotion recognizer is used to assess the emotional quality of the resynthe- sized utterances. Based on the results, parameters for synthesis are selected and applied to add emotional quality to the input speech. . 129 7.2 Recognition results for a sample test set with emotion recognition performance of 81.25%. . . . . . . . . . . . . . . . . . . . . . . . . . 134 7.3 Recognition results for 5 different training and test sets . . . . . . . 135 8.1 (color online) F0 contours of all 16 utterances that were recorded. (H,A,N,S,spk,andsent denotehappy,angry,neutral,sad,speaker, and sentence, respectively. . . . . . . . . . . . . . . . . . . . . . . . 146 8.2 (color online) Stylization example for the happy utterance, speaker =1, sentence=1. Bluecircles=originalF0contour, redsquares= 2 semitones stylization, black triangles = 10 semitones stylization, green dots = 40 semitones stylization. . . . . . . . . . . . . . . . . . 148 8.3 (color online) The Gaussian emotional regions for each emotion, speaker, and sentence. x axis = F0 mean (Hz), y axis = F0 range (Hz). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 xiv 8.4 (color online) Emotional regions for different speech quality (ρ) requirements for angry emotion. The area of emotional regions decrease as quality requirements increase. Green: all cases (same as Fig. 8.3), Red: ρ ≥ 3, Blue: ρ ≥ 3.5, Magenta: ρ ≥ 4, Black: ρ≥ 4.5. Small circles show the resynthesized utterances. x axis = F0 mean (Hz), y axis = F0 range (Hz). . . . . . . . . . . . . . . . . 153 8.5 (color online) The estimated Euclidean emotional regions. x axis = F0 mean (Hz), y axis = F0 range (Hz). . . . . . . . . . . . . . . . . 154 8.6 (color online) The perceived Gaussian emotional regions and esti- mated Euclidean emotional regions. x axis = F0 mean (Hz), y axis = F0 range (Hz). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 8.7 (color online) Figures (a), (b), (c), (d): The differences between the emotionrecognitionpercentagesoforiginalandmodifiedutterances. Happy=circle, Angry=filledcircle, Sad=square, Neutral=filled square, Other = filled triangle. Figures (e), (f): The differences between the average speech qualities (5 = excellent, 4 = good, 3 = fair,2=poor,1=bad)oforiginalandmodifiedutterances. Speaker 1 = circle, Speaker 2 = filled square. . . . . . . . . . . . . . . . . . 165 8.8 (color online) Relation between average quality, similarity, and per- centage parameters. Notethatthequalityisnormalized: 1=Excel- lent, 0.8 = Good, 0.6 = Fair, 0.4 = Poor, 0.2 = Bad. Squares () are used for similarity, circles (◦) for percentage, and (x) for quality variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 xv 9.1 Webbasedlisteningteststructure. Thefirstfileisthereferencefile. It is defined as neutral with speech quality rating of 5. Other files are the resynthesized files (one for each of the happy, angry, or sad emotions) presented in a different random order for each rater. The raters were required to select the emotion and quality for these files. 177 9.2 Percentages of other responses. Note that when the target emotion was happy, many raters selected the other option. . . . . . . . . . . 178 9.3 Emotion evaluation results for different modifications. (a) neutral to angry conversion, (b) neutral to happy conversion, (c) neutral 2 sad conversion. The emotion responses are displayed together with other responses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 9.4 The percentages of matching emotional responses for each transfor- mation condition. For each one of the modifications, the first bar shown the percentage of angry responses for neu2ang, the second bar shows the percentage of happy responses for neu2hap, and the third bar shows the percentage of sad responses for neu2sad. . . . . 182 9.5 Confusion between different emotions for modification x1. . . . . . . 182 9.6 Confusion between different emotions for modification pos1. . . . . 183 9.7 Confusion between different emotions for modification uv1. . . . . . 183 9.8 Confusion between different emotions for modification x1+pos1. . . 184 9.9 Confusion between different emotions for modification x1+uv1. . . . 184 9.10 Confusion between different emotions for modification pos1+uv1. . 185 xvi 9.11 Confusion between different emotions for modification x1+pos1+uv1.186 9.12 Listeningtestresultsforneutral-to-angry(neu2ang)conversion. (a) Angry emotion recognition percentages (b) Average speech ratings for utterances labeled as angry. (20% chance level, 3=fair speech quality) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 9.13 Listening test results for neutral-to-sad (neu2sad) conversion. (a) Sad emotion recognition percentages (b) Average speech ratings for utterances labeled as sad. (20% chance level, 3=fair speech quality) 187 9.14 Listeningtestresultsforneutral-to-happy(neu2hap)conversion. (a) Happy emotion recognition percentages (b) Average speech ratings for utterances labeled as happy. (20% chance level, 3=fair speech quality) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 9.15 Happy+Other emotion recognition percentages for individual sen- tences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 9.16 Top figure: The current system, bottom figure: For future imple- mentation. It can be expected that performing the prosody modifi- cations directly on the residual signal (bottom figure) will produce better quality results. . . . . . . . . . . . . . . . . . . . . . . . . . . 194 C.1 Sentence 1: (nfjoyNew112 1.wav) See how funny our puppy looks in the photo. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 C.2 Sentence 2: (nfjoyNew192.wav) You always come up with patholog- ical examples to fancy the audience. . . . . . . . . . . . . . . . . . . 229 xvii C.3 Sentence 3: (nfjoyNew195.wav) Leave off the marshmallows and look what you have done with the vanilla pie. . . . . . . . . . . . . 230 C.4 Sentence 4: (nfjoyNew213.wav) Summertime supper outside is a natural thing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 C.5 Sentence 5: (nfjoyNew228.wav) The fifth jar contains big juicy peaches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 C.6 Sentence 6: (nfjoyNew245.wav) Keep the desserts simple fruit does nicely. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 xviii List of Tables 3.1 Emotion Words and their Activation, Evaluation (as defined by Whissell)andAngle(asdefinedbyPlutchik)values. Tableisadapted from [34]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2 (Table continued) Summary of acoustic features for angry, happy and sad emotions, taken from [85] and [34] . . . . . . . . . . . . . . 41 3.3 The values of the acoustical parameters that were used in Affect Editor [27]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.1 Recognition rates in percent and average success ratings (5=excel- lent and 1=bad) for the 16 possible prosody and inventory combi- nations. (Note: ApHi rate is not above chance). . . . . . . . . . . . 68 5.1 Listening test results for the two original sentences (κ = 0.70, α < 0.01). Emotion categories are listed in the first column and the results are presented in percentages. . . . . . . . . . . . . . . . . . . 78 5.2 Listening test results (in 10% multiples) for some selected modi- fications. h,a,s,n indicate happy, angry, sad and neutral, respec- tively. The numbers in paranthesis refer to the modification type as explained in section 5.3.1. h2a means that source is happy and target is angry. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 xix 6.1 List of the analyzed tags. A subset of Penn Treebank POS tags [78]. 94 6.2 Greenhouse-Geisserteststatisticsfora4-factormixed-designANOVA experiment are shown. Independent variables are emotion(emo, 4 levels), tag type (tag, 13), position (poz, 2) and speaker (spk, 3). Significant results are highlighted. . . . . . . . . . . . . . . . . . . . 104 6.4 Contrast analysis of emotions and post hoc comparisons of POS tags. Only significant pairs are shown. Results for position=1, and position=2 are highlighted and in italic, respectively, for easy differentiation. Notethatsometagsaremorehelpfulthantheothers to differentiate between emotions. Also note that, the patterns of differences between emotions are, in general, more consistent in the second half. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 6.5 Theestimatednormaldistributionmeanandstandarddeviationval- ues for modeling the differences in the energy maximum, F0 median andtagdurationparametersofneutral(n)-angry(a),neutral-happy(h) and neutral-sad(s) POS pairs. . . . . . . . . . . . . . . . . . . . . . 110 6.6 Listening test results for natural speech files. First row shows the results for unmodified natural utterances. The other rows show the results for the utterances resynthesized angry, happy, and sad prob- ability models. Displayed numbers show the emotional recognition percentage and the average success of the speaker in expressing the perceived emotion (in italic) (5 = excellent, 4 = good, 3 = fair, 2 = poor, 1 = bad). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 xx 6.7 Listening test results for the kal voice from the Festival speech syn- thesis software. The first row shows the results for the unmodifed utterancesandtheotherfortheresynthesizedutterances. Thenum- bers shown are the emotion recognition percentages and average success (in italic) as calculated from the human responses. . . . . . 118 7.1 ConfusionmatrixofNNrecognitionresultssummingthetestresults of 5 different runs (112 test utterances in each run). Displayed are the number of the files, and percentages in parenthesis. Emotion- NN indicates the emotions recognized by the NN. . . . . . . . . . . 133 7.2 Results of listening tests with humans. Recognition percentages (average confidence) are shown. The symbols n1, n2, n3, n4, n5 represent neutral utterances that were modified. The symbols (h2, h5, h4, h2, h1), and (a5, a1, a4, a4, a4) representthebestperform- ing happy, and angry modifications, respectively. . . . . . . . . . . . 137 7.3 The modification factor values that worked the best for different utterances are shown. (Fm = F0 mean, Fr = F0 range, Vd = Voiced duration, Ve = Voiced energy, Ud = Unvoiced duration, Ue = Unvoiced energy). . . . . . . . . . . . . . . . . . . . . . . . . . . 138 8.1 Summary of the performed F0 contour modifications. The values for mean and range are in Hz and the values for stylization are in semitones. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 8.2 Cochran’s Q statistics calculated for emotion selection dependent variable. Significant results are in italic form. . . . . . . . . . . . . 159 xxi 8.3 Repeated measures ANOVA statistics calculated for quality depen- dentvariable. ThereportedaretheFvaluesforGreenhouse-Geisser tests. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 8.4 Repeated ANOVA statistics calculated for quality dependent vari- able. The reported are the F values for Greenhouse-Geisser tests. Significant results are shown in italic for easy differentiation. . . . . 163 9.1 The modification factor values that were used for modifying the voiced or unvoiced region prosody characteristics. (Fm = F0 mean, Fr = F0 range, Vd = Voiced duration, Ve = Voiced energy, Ud = Unvoiced duration, Ue = Unvoiced energy). . . . . . . . . . . . . . 174 9.2 The percentage values used to restrict the possible output values generated for POS tags. (Fm = F0 mean, Fr = F0 range, Emax = Energy maximum, Dur = Duration.) . . . . . . . . . . . . . . . . 175 9.3 Tested modifications . . . . . . . . . . . . . . . . . . . . . . . . . . 175 9.4 Neutral to angry conversion results for different modification con- ditions. Emotion recognition percentages and mean quality ratings are displayed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 9.5 Neutral to happy conversion results for different modification con- ditions. Emotion recognition percentages and mean quality ratings are displayed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 9.6 Neutral to sad conversion results for different modification condi- tions. Emotion recognition percentages and mean quality ratings are displayed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 xxii 9.7 Confusion matrix of the subjective human evaluations. Displayed aretherecognitionpercentages. Forthistest25utteranceswereran- domly selected from each emotion category. These utterances were rated by 4 naive native English speakers. (Table adapted from [147])190 D.1 Neutral to angry conversion results for sentence 1. . . . . . . . . . . 234 D.2 Neutral to happy conversion results for sentence 1. . . . . . . . . . . 235 D.3 Neutral to sad conversion results for sentence 1. . . . . . . . . . . . 235 D.4 Neutral to angry conversion results for sentence 2. . . . . . . . . . . 236 D.5 Neutral to happy conversion results for sentence 2. . . . . . . . . . . 236 D.6 Neutral to sad conversion results for sentence 2. . . . . . . . . . . . 237 D.7 Neutral to angry conversion results for sentence 3. . . . . . . . . . . 237 D.8 Neutral to happy conversion results for sentence 3. . . . . . . . . . . 238 D.9 Neutral to sad conversion results for sentence 3. . . . . . . . . . . . 238 D.10Neutral to angry conversion results for sentence 4. . . . . . . . . . . 239 D.11Neutral to happy conversion results for sentence 4. . . . . . . . . . . 239 D.12Neutral to sad conversion results for sentence 4. . . . . . . . . . . . 240 D.13Neutral to angry conversion results for sentence 5. . . . . . . . . . . 240 D.14Neutral to happy conversion results for sentence 5. . . . . . . . . . . 241 D.15Neutral to sad conversion results for sentence 5. . . . . . . . . . . . 241 xxiii D.16Neutral to angry conversion results for sentence 6. . . . . . . . . . . 242 D.17Neutral to happy conversion results for sentence 6. . . . . . . . . . . 242 D.18Neutral to sad conversion results for sentence 6. . . . . . . . . . . . 243 xxiv Abstract Emotions play an important role in human life. They are essential for commu- nication, for decision making, and for survival. They pose a challenging research area across diverse disciplines such as psychology, sociology, philosophy, medicine and engineering. One realm of inquiry relates to emotions expressed in speech. In this study our focus is on angry, happy, sad, and neutral emotions in speech. We investigate the speech acoustic correlates that are important for emotion percep- tion in utterances and propose techniques to synthesize emotional speech which will be correctly recognized by human listeners. The motivation for our research comes from the desire to impart emotion processing capabilities to machines in order to make human-machine interactions more pleasant, effective and produc- tive. Instead of generating the emotional speech from text, in our approach we start with a natural neutral utterance and modify its acoustic features to impart the targeted emotion. As shown by the analysis and recognition studies, spec- tral and prosodic (F0, duration, energy) parameters can be successfully used to describe and recognize emotions. In this study we utilize these acoustic parame- tersforemotionresynthesisandfollowanexperimentalmethodologytoinvestigate how they should be modified in order to produce one of the angry, happy or sad emotions in human speech. Based on the experiment results a multi-level emotion to emotion transformation (ETET) system is proposed. This is a novel system xxv which is capable of generating good quality emotional speech. It consists of three main components that modify speech acoustic parameters at different time scales. First spectral conversion is applied at phoneme level, then prosody parameters are statistically estimated and modified at part of speech (POS) tags level, and finally automatically selected modification factors are applied on voiced and unvoiced regions. The proposed ETET system is robust and it can be easily adapted to new emotions and speakers. The field of emotional speech synthesis is a challenging new research area. We believe that the ideas, results, and discussions presented in this study will be beneficial for improving the rapidly developing and growing research of emotions in speech. xxvi Chapter 1 Introduction 1.1 Motivation We, humans, speak to express ourselves and to communicate with other people. While speaking, we normally would utter words clearly and understandably, so thatourlistenerscanmakeintelligibleguessesaboutwhatwearetryingtoexpress. However, not always do these guesses exactly match what we are really trying to say, because not always can others understand how we feel. Intelligibility, that is how recognizable the speech sounds to the audience, is certainly the most important feature that normal human speech should posses. However, hardly could anyone listen to intelligible speech devoid of variability and naturalness for long periods of time. Variability, which could be matched with ”personality” of the voice, and naturalness, which could be thought of as ”how real” (or ”how human”) the speech is, are other important characteristics that speech should posses besides intelligibility [89]. As human beings we communicate with each other through our feelings. The feelings are expressions shaped by the experience and knowledge each human has, and every single state of a human can be related to a particular feeling, or in other words, to a particular emotion. Therefore, the role of emotions in human life should not be underestimated. Emotions are essential for decision making and for motivating someone in participating into a particular action [122]. There have been studies of the human brain showing the impossibility of making appropriate 1 decisions, when the emotion-controlling centers in brain were damaged, even if logical reasoning skills were intact [38]. Interestingly, it is also argued that success in life is more related to emotional intelligence than it is related to IQ [105]. Reasonably, due to all of these reasons (and many others [98]) the incorporation of emotions into applications and systems trying to model human-like behaviors and effects (like speech synthesis) is important and necessary. 1.2 Problem statement In this work, our focus is on the natural emotional human speech. We focus on three emotional categories – anger, happiness and sadness – plus neutral state. These emotions are three of the basic emotions specified by Ekman et.al. [46]. The question that we try to answer is how neutral speech should be modified in order to be perceived as emotional. In other words, we investigate the speech acousticparametersthatareeffectivefortheperceptionofemotions,andproposea multi-levelemotionmodificationsystemtotransformneutralspeechintoemotional speech. The transformation is achieved by modifying the acoustic parameters at different time scales. The requirements that need to be satisfied are that the final output should be perceived as emotional by the human raters and that the quality and intelligibility of the transformed signal should be similar to the natural speech. Theimmediateapplication,fortheproposedEmotionToEmotionTransforma- tion (ETET) system will be in field of text to speech (TTS) synthesis. The speech output generated by the state of the art text to speech synthesizers [106–109] is comparable to human speech in terms of its intelligibility and quality. However it still sounds different from the natural human speech mainly because it lacks the 2 expressiveness and variability of human speech. In other words, regardless of what the content is, the synthesizer output is always emotionally the same, i.e., neutral, or in the emotion that speech corpora was recorded. The proposed ETET system can be used to post-process the synthesizer output to add a different emotional flavor to it. 1.3 Proposed approach The system that we propose for the modification of neural speech into emotional speech is a multi-level emotion to emotion transformation (ETET) system, where spectral envelope, F0 contour, duration, and energy features of an input speech signal are modified at different time scales. Spectral envelope modifications are performed at phoneme level; F0 contour, duration, and energy modifications are performed both at part of speech (POS) tags level, and in larger scales, which are based on voiced and unvoiced regions of the input speech. This approach is suggested as a result of our research [17–22,64,73,136,147], whichisdescribedingreaterdetailinthefollowingchapters. Adetaileddescription of the proposed ETET system is given in chapter 2. 1.4 Dissertation contribution We propose an ETET system that can be used to automatically modify the emo- tionalcontentofaninpututterance. Itisanovelsystemwhichmodifiesthespeech acoustic features at different time scales (phoneme, POS tags, voiced/unvoiced regions). Some of the important features of the ETET system can be listed as follows. 3 • It is automatic. • It is robust and can easily be adapted to new data, new emotions, and new speakers. • It can be easily extended to include additional modifications at new time scales. • The probabilistic parameter generation module helps to produce different outputs at different run times, just like production of speech in humans. Since the prosody parameter values are modified based on the values derived automatically from an estimated probabilistic distribution, the final output speech has a variability analogous to natural human speech. • Recognition for Synthesis: Automatic parameter selection for resynthesis of emotional speech from neutral speech. Using a neural network system which models the human emotion perception, the best set of prosody parameters areautomaticallyselectedandusedforthemodificationsofprosodyofvoiced and unvoiced regions. In addition, the large emotional database that was collected and analyzed also provides an important and novel contribution. The results derived from the exten- siveanalysisandresynthesisexperimentsconductedusingthecollecteddatareflect the relation between acoustical parameters (such as pitch, duration, intensity and spectral envelope) and emotions (happy, angry, sad and neutral) from synthesis viewpoint. These results are a valuable resource to the young but fast developing research of emotions in engineering applications. 4 1.5 Research methodology An experimental method was followed. First, a database of emotional speech was collected and analyzed in terms of the acoustical correlates of emotions. Next, the speechparametersweremodifiedandnewemotionalspeechsampleswereresynthe- sized. The results were then evaluated by conducting listening tests with human raters. Based on the responses, the algorithms for acoustical feature modification were revised and the process described above was repeated. Note that while developing the necessary algorithms we worked with natural emotional speech collected from native speakers of American English. For data collection, the speakers (with theatrical background) were asked to read different sentences with angry, happy, sad and neutral emotions. The emotional content of the collected utterances was validated by listening experiments with naive human raters. The reasons of developing the modification algorithms on natural speech, instead of on synthetic speech, were following. In synthetic speech due to the discontinuities between selected concatenation units there could be quality degra- dations which could distract the listeners while evaluating the emotional content and assessing the performance of modifications. In order to not deal with the undesired artifacts of the synthetic speech, the emotion synthesis algorithms were tested and formulated using natural speech. Importantly, we believe that if a nat- ural speech utterance can be successfully modified then the same techniques can be also applied on synthetic speech. 5 1.6 Assumptions Ekman et.al. [47] defines six fundamental emotions for facial expressions. These emotions are happiness, anger, sadness, fear, disgust and surprise. Our work focuses on three of these emotions, happiness, anger, and sadness. In addition, we also use neutral speech, i.e., speech without any particular emotional quality, as a reference. For the ease of description, in the rest of the document we will refer to the neutral case also as emotional. Our work is based on the following assumptions: 1. Happy, angry, sad and neutral emotions can be expressed at sentence level. 2. Happy, angry, sad and neutral emotions lie in the space spanned by pitch, duration, energy and spectral envelope parameters. 3. These emotions can be discriminated in this acoustic parameter space. 1.7 Hypotheses Based on the above assumptions our hypotheses are following. 1. The emotional content (i.e., happy, angry, sad, neutral) of any utterance can be altered by modifying its pitch, duration, energy, and spectral envelope parameters. 2. Pitch, duration, energy, and spectral envelope parameters of any neutral utterance can be modified to produce a specific emotion, i.e., one of the happy, angry and sad emotions. 6 1.8 Definitions Wefocuson4emotionallabelsinourwork. Theseemotionsarehappiness,sadness, anger and neutral state. We consider these terms to be subjective labels of speech in a sense that different people may label the same speech signal differently in different environments. In order to determine the emotional label for an utterance we conduct listening tests with human raters. For a particular speech signal to be labeled as happy for instance, it is considered sufficient if it is classified as happy by the majority of raters who participated in the listening tests. The test environment within which the emotions are evaluated consists of lis- tening tests, where utterances are presented in a random order, and listeners are asked to choose one of the 5 choices. The choices presented to the listeners are, happy, angry, sad, neutral and other. The option other is included to account for the additional emotional nuances that are not represented by the four concrete emotional labels. The listeners can listen to the same utterance as many times as they like before making their decision. In addition to the emotional content we also evaluate the speech quality in a similar manner. Listeners are asked to choose one of the 1 (= poor quality), 2 (= bad), 3 (= fair), 4 (= good), or 5 (= natural speech quality) options. Note that as pointed out by Cowie et. al. [34], “Everyday usage divides emo- tional states into categories that are related to time, reflecting the fact that emo- tional life has a definite temporal structure. Emotion in its narrow sense — full- blownemotion—isgenerallyshortlivedandintense. Mooddescribesanemotional state that is underlying and relatively protracted. Emotional traits are more or less permanent dispositions to enter certain emotional states. Emotional disorders - such as depression or pathological anxiety - also fall within the broad category of emotional states. They may involve full-blown and protracted emotion. ...It 7 is interesting that emotion words can refer to many categories, e.g., happy may describe a full-blown emotion, a mood, or a trait. Dealing with the temporal structure of emotional life is a difficult challenge, but an interesting one. There are obvious reasons for trying to assess whether a user is briefly angry or consti- tutionally bad tempered.” In this work we regard emotional labels (happy, angry, sad, neutral) as describ- ing short-lived full-blown emotions. Resynthesis is a term used to indicate that the output speech is re-generated from another speech signal. In that sense it is different from synthesis, which is usually used to indicate that the output is generated from a text input. In this paper, we use synthesis as a synonym to resynthesis. This should not cause any confusion since the cases when the input is text are specifically indicated or clear from the context. 8 Chapter 2 Emotion to emotion transformation system (ETET): A summary 2.1 Introduction In this chapter we briefly summarize the characteristics of the multi-level emotion to emotion transformation (ETET) system that we propose for emotion resyn- thesis. This system is designed as a result of different experiments that will be described in detail in the following chapters. We decided to present the outline of the final system first, so that the reader can better follow the steps that led to the final system design. Themulti-levelemotiontoemotiontransformation(ETET)systemisbasedon themodificationofthespeechacousticfeaturesatdifferenttimescales. Thesystem was designed as a result of extensive experiments, which showed the parameters that had significant influences on the listeners’ emotion perception and how they should be modified. All of these findings are combined and implemented in the ETET system. 9 2.2 Multi level emotion to emotion transforma- tion system (ETET) The concept of emotion resynthesis is shown in Fig. 2.1. The idea is to modify the input speech signal expressing a particular emotion so that it expresses another emotion at the output. This thesis presents a method to achieve this transforma- tion. Emotion RESYNTHESIS Emotion INPUT Emotion OUTPUT Figure 2.1: Emotion resynthesis. Input emotional speech is resynthesized to pos- sess new emotional content. The schema of the proposed ETET system is shown in Fig. 2.2. There are 3 mainblocks, denotedas1, 2, and3, whichperformspectralenvelopemodifications at phoneme level, prosody modifications at part of speech (POS) tag level, and prosody modification at voiced/unvoiced region level, respectively. 10 Inputs that are required are a speech signal (an utterance), input speech emo- tion, phoneme labels and boundaries, part of speech tags and boundaries, and the desired output emotion. The output is a resynthesized version of the input signal and it possesses the desired emotion characteristics. In its current state, the system is designed to operate on isolated input utter- ances and the input emotion is assumed to be neutral. The target emotions for which the system performance was evaluated were anger, happiness, and sadness. NotehoweverthattheproposedETETsystemisrobustanditcaneasilybetrained andmodifiedtohandlelongerspeechsegments, suchasparagraphs, andtoinclude new input and/or target emotion categories. SPEECH Input: “Neutral Speech” Phone labels Phone boundaries TEXT PARSING SEMANTICS (future work) Linguistic Processing Natural Language Processing (“I love my job”) Text POS tags POS boundaries PHONE SPECTRUM (GMM, LSF) POS tags DURATION F0 RANGE ENERGY (PSOLA) VOICED/UNVOICED REGIONS F0 MEAN/RANGE DURATION ENERGY (PSOLA) I LOVE MY JOB (happy) Output: “Happy Speech” 1 2 3 Figure2.2: MultiLevelEmotiontoEmotionTransformation(ETET)systembased on the modification of speech acoustic features at different time scales. 11 As stated above, each one of the modification blocks (shown as 1, 2, and 3 in Fig. 2.2) are explained in detail in the following chapters. In this chapter, only a brief summary is given. 2.2.1 Spectral envelope modifications The spectral envelope modifications were included in the final ETET system as a result of our experiments [17,22,73] which showed that the spectral characteristics play and important role in the emotion perception. In [22] we showed that for angry and happy emotion the effect of inventory was significant. Especially, for anger it was shown that spectral modifications can improve the results significantly. For sadness, it was shown that the effect of spectral modifications were less obvious, indicating that spectral modifications may not be essential to synthesize sad speech. This experiment and the results are discussed in greater detail in chapter 4. The concept of modifying the spectral characteristics at phoneme level, from one emotion to another was analyzed in [17], where we showed that at phoneme level in order to achieve acceptable angry or happy speech generation spectral modifications must be performed. The results also showed that the modifications of the spectral characteristics alone were not sufficient to generate the targeted emotion. They were only effective when combined with prosody modifications. However, as shown in [17], performing the prosody modification at phoneme level can degrade the speech quality significantly. This suggests that prosody should be modified at larger scales. A detailed discussion of these results are given in chapter 5. Another proof that emotions can be differentiated in terms of phoneme spec- tral characteristics was given in [73]. Using the same database that was used for 12 synthesisexperimentsitwasshownthatemotionshavedifferenteffectsondifferent phonemes. Especially it was observed that lower vowels /ae/ and /uw/ were more affected by emotions than high vowel /iy/. Recognition performance of ∼ 75% was achived using HMM models trained for different phoneme classes. For the modification of spectral characteristics, a spectral conversion method described in appendix B was used. In this method the spectral parameters are transformed using a linear transformation based on Gaussian mixture models (GMM) whose parameters are trained on the joint density of input and output vectors and estimated using expectation maximization algorithm [67,68,81]. 2.2.2 Part of speech (POS) tags’ prosody modifications Thepartofspeech(POS)levelmodificationwereincludedasaresultofouranalysis and synthesis studies [18,19], which showed that a probabilistic model based on POS tags can be useful to account for the variability present in the prosody of natural human speech [14,32]. The interplay between the linguistic role of words, and their expressive mod- ulation in spoken language is complex, and not completely understood. Research show that different words have different contribution to the emotional quality of speech [145]. In addition to the individual effects of the words, the interaction between words in a sentence also is effective in the expression and perception of emotions [98]. Therefore it is important to include all of these factors while study- ingemotionalcharacteristicsofspeech. Especiallyforsynthesisandunderstanding applications integrating the models of linguistic and acoustic components is essen- tial [4,93]. In order to represent the linguistic level information part of speech (POS) tags canbeused[4,12,93,141]. POStagsarefew innumbers andtheycanbeidentified 13 withhighaccuracy(morethan90%)withnomanualintervention[31],whichmakes them an attractive option for linguistic modeling. We analyzed the interaction between emotions and POS tags, and the factors influencing this interaction. The interaction between emotions and POS tags was investigated in two perspectives: Statistical analyses and analysis-by- synthesis. First, statistical methods (i.e., ANOVA) were used to quantitatively analyze the relation between POS tags and emotion categories [18]. Second, using speech signal modification techniques (TD-PSOLA [82] (see appendix A)), the effects of modifyingprosodycharacteristicsatPOStaglevel(basedonthestatisticalanalysis results) was investigated [19]. Based on the results it was observed that although POS tag modifications alone were not sufficient for emotion resynthesis, they can be successfully used to add variability and to make the final synthesis results more natural. A detailed discussion of the performed analysis studies and synthesis results is given in chapter 6. 2.2.3 Voiced/unvoiced regions’ prosody modifications Prosody modification based on voiced and unvoiced regions were included as a result of the experiments which showed that they can be successfully used for emotional speech resynthesis [20]. As detailed in chapter 7, modifying the prosody characteristics at voiced/unvoicedregionshasimportantadvantages. First, themodificationparam- eter values can be automatically estimated using emotion recognizers. Second, the quality degradations due to the acoustic parameter modifications were minimal. The average quality was approximately 4, which denotes good speech quality in MOS scale (5=excellent, 4=good, 3=fair, 2=poor, 1=bad). 14 We used automatic emotion recognizer as a preprocessing step to narrow down the size of the evaluation set before it is presented to human raters. The over- all system consisted of a prosody modification module which generated a large number of synthetic utterances that were then evaluated using a neural network (NN) emotion recognizer. The output of the recognizer was used to select the parameter combinations performing consistently well, and only these modifica- tions were submitted for evaluations with human subjects. The results showed that the parameters automatically predicted by the NN recognizer can be used to generate angry speech. The results for happy speech were not completely suc- cessful, however they provided important insights about what needs to be done to synthesizehappyspeech. Thissystemwasnottestedonsadspeech, becausebased on the previous experiments it was shown that sad speech synthesis was possible using some deterministically specified values. 2.3 Conclusion In this thesis, an ETET system that can be used to automatically modify the emotional content of an input utterance is described. It is a novel system which modifies the speech acoustic features at different time scales (phoneme, POS tags, voiced/unvoiced regions). The system performance was evaluated by conduction listening tests with human raters. For 6 utterances uttered in neutral emotion by a female speaker (who had a degree from USC theater school), following modifications were applied (1) Only spectral, (2) Only POS tags, (3) Only voiced/unvoiced region, (4) Spec- tral and POS tags, (5) Spectral and voiced/unvoiced regions, (6) POS tags and voiced/unvoiced region, (7) Spectral, POS tags and and voiced/unvoiced regions. 15 The purpose was to synthesize either anger, happiness, or sadness. The results and their detailed discussions are presented in chapter 9. 16 Chapter 3 Literature review: Background and related work The emotions have been one of the most interested and researched topics in the human history. They still continue to be very popular today and without doubt will stay like that forever, because as human being we are emotional creatures and everystudytargetedforhumansshouldincorporateandconsideremotionsatsome level. Naturally, the emotional literatureishugeasitcomprisesmanydifferentfields, including (but not limited to) psychology, philosophy, medicine, sociology and engineering. Inthischapter,ourconcentrationwillbeonthemethodsemployedfor studying emotions in speech processing with a brief introduction to the emotional theories. This chapter is organized in five main sections that aim to give the required background for the material presented in next chapters. First two sections are general introductions, in section 3.1 we outline the basic theories of emotion and mention how they can be useful for engineering studies (as this study is) and in the next section (sec. 3.2) we show how emotions can be represented in terms of activation, valence and intensity dimensions. The following section describe the studies of emotions in speech. Following the naturalpathofresearch, wefirstoutlinehowcollectionofemotionalspeechdatais performed(insection3.3)andthenpresentsomeofthetechniquesusedtovalidate 17 whether the collected data is in fact representative of the intended emotions (in section 3.4) by discussing about how evaluations are and how they should be performed. In the next section, we present various analysis results (sec. 3.5) that showhowacousticalspeechpropertiesareinfluencedbytheemotions. Application oftheanalysisresultsaregiveninsection3.6wherewereviewthebasictechniques employed in the synthesis of emotional speech and discuss their successes and capabilities. 3.1 Theories of emotion What are emotions? How can they be defined? What is the way to study and explain them? In this section we address these questions by presenting the major theories of emotions. This kind of theoretical study is useful for engineering because it can be refer- enced while defining the methods for studying emotions. For example, a research project based on Darwinian approach will prefer to work with the basic emotions and interpret the other emotions as a combination of these basic states. On the contrary, if the study is targeting to study differences in the expressiveness of speech of different groups and cultures it would be extensively using the results of Social Constructivist view. The best method to study emotions can be found only by appropriate amalgamation of theory and engineering research. To answer the question of what an emotion is, Cornelius [33] looked at the emotions from four different perspectives, which he calls, Darwinian, Jamesian, Cognitive and Social Constructivist perspectives. Darwinians represent the emo- tions as adaptive responses, which have evolved with time in a way that helped living organisms to solve certain problems that they have faced. The basic idea 18 (thatoriginatedfromDarwin)isthatemotionsshouldbeanalyzedintermsoftheir functionality and survival value. From that perspective, all organisms sharing and evolutionary past should share the same emotions as well. In adopting and sup- porting that view, Ekman et. al. [47], define six universal fundamental categories for facial expressions. These emotional categories are: Happiness, sadness, fear, disgust, anger and surprise. Although the number of fundamental emotions that have been cited varies depending on research studies, the common unifying point is that there are some basic or primary emotions and all other emotions are derived from them. If that is true, we can think of each human being as a ”truly emotional artist”: just as a painter mixes various colors to get the desired effect, each person acts with his/her emotions. And the way each ”emotional color” is formed depends on that particular person’s survival needs. This also suggests that there are not concrete boundaries among different emotions; rather one may conceptualize ”fuzzy bound- aries” between emotional categories. ”We are afraid because we run away, we do not run because we are afraid”; or in other words, ”we feel sorry because we cry, we do not cry because we feel sorry”. Meaning that, we can view emotions as results of bodily changes. This is the Jamesian Perspective [63] in analyzing the emotions. According to William James,ourbodieshaveevolved(notetheconnectiontoDarwinianview)torespond automatically to external and internal changes so that we can survive, and it is that response which causes us to experience emotions. Actually, it is an interesting viewpoint to consider emotions as responses of a complex system (which we can call ”black box 1”) whose input is defined under the general term ”bodily changes”. Now, if you think of a second system (let us call it ”black box 2”) that can output the required ”bodily changes”, when the 19 input stimulus is a text, or a speech; a picture, or a silent movie; or a TV show (i.e., multi-modal input), then theoretically, we could make people feel whatever we want them to feel by changing the inputs appropriately. Undoubtedly, the systems should be designed according to the target’s specifications because each target’s bodily changes connected with a particular emotion would be different. According to the cognitive perspective of emotion production, thought is enough to drive the first system (i.e., “black box 1”). Stated more clearly, the cog- nitive view is that, thought and emotions cannot be separated and what one feels depends on how one understands and judges the events and the environment. In other words every emotion is connected to a specific pattern of appraisal [2], which means that if an appraisal is changed the emotion should also change. According to the supporters of cognitive view, the organism becomes aware of the situation through appraisal and becomes ready to respond to the environment. In this per- spective, the ”bodily changes” that were the initiative of the emotions according to James, have been replaced by the mental activities that occur in the brain. Another view at the source of emotion is the social constructivist perspective. Differentlyfromtheallothertheories,socialconstructiviststieemotionstoculture and social rules [3,56]. That is a totally new look, because the claim is that emotion has been originated from an ”outside” source. One example to social construction of emotion is how people from different social groups and gender express their emotions. It is expected, and in reality usually it is, that a man would respond differently than a woman; or a child would act more likely as other children and a grown up as other adults. Even in large social groups, sometimes, we can see that there is difference in appraisal of particular events. For instance, a stereotype American would not feel the same excitement as a stereotype Turkish when watching a soccer game. Here, we need to underline the strong connection 20 between cognitive and constructivist view. The development of thoughts is tightly connected to the culture and social status. 3.2 Dimensions of emotional space An ideal study of emotions requires all of the factors, such as context, speaker, time, environment and many others to be taken into account. This multi dimen- sional nature of emotions makes them difficult to study and analyze. Due to that, reducing the number of dimensions to a manageable level should be considered. In this section we show how various dimensionality markers can be used to categorize emotions. (Note that the categorization of emotions in terms of the emotional labels proposed by Ekman et.al. [47], as discussed in section 3.1, can be also con- sidered as a dimensionality reduction method.) One of the popular approaches is to represent the emotions in a three dimen- sional space [49], where axes are labeled as activation (also referred to as arousal), valence (also referred to as evaluation) and intensity (also referred to as power). Activation can be thought of as a measure of how active or passive the emotion is (i.e, whether it energizes or not). In other words, it can be considered to be the possibility of taking some action or not doing anything. Activation is relatively easier to evaluate because it can be associated with the physiological changes occurring in the human body. For instance, in case of anger, respiration deepens, the eyes are open wider than usual, the heart beats faster, muscles tense, there is an increase in physical strength and willingness to do something, while in case of sadness there is not too much association with will to act. Valence is to indicate whether the effect is positive or negative. If the emotion “feels” good then it is positive (for example happiness), if it “feels” bad then it 21 is negative (for example fear). Many emotions can be well differentiated on the valence axis, and having such a variation enables the implementation of systems (such as an automated call center application) that require differentiating positive and negative emotions [74,108] (so that the caller can be transfered to an agent if negative emotions are detected). In the intensity dimension, emotions are compared in terms of the intensity (or power) they posses. A good examples of such differentiation are fear and terror emotions. Although both are negative they differ in terms of power they have. In general evaluation of intensity dimension is more difficult than the other two dimensions because of its highly subjective nature. In Table 3.1, valence and activation values of some words, based on the study by Whissell [145] are presented as examples. The intensity dimension can be considered as the distance from the (imaginary) origin in the activation-valence space. Another method introduced by Plutchik [101] is to represent emotions in cir- cular pattern. Using the Plutchik’s emotional wheel, shown in Figure 3.1, it is possible to define emotional orientation. Defining the zero degree point approximately at acceptance, 180 degree point close to disgust, 90 degree point at apathetic and 270 close to curious, emotional orientationvaluesforsomeemotionalwordsarepresentedinTable3.1asexamples. Representing emotions in limited number of dimensions is an effective and effi- cient approach for research purposes. However, it is ”dangerous” as well. As pointed out in [34], ”There is inevitably loss of information and, worse still, dif- ferent ways of making the collapse lead to substantially different results. This is well illustrated in the fact that fear and anger are at opposite extremes in 22 Figure 3.1: Plutchik’s emotion wheel (adapted from [34]) Emotion Act. Eval. Angle Emotion Act. Eval. Angle Agreeable 4.3 5.2 5 Panicky 5.4 3.6 67.7 Anxious 6 2.3 78.3 Apathetic 3 4.3 90 Sad 3.8 2.4 108.5 Puzzled 2.6 3.8 138 Disgusted 5 3.2 161.3 Dissatisfied 4.6 2.7 183 Jealous 6.1 3.4 184.7 Angry 4.2 2.7 212 Daring 5.3 4.4 260.1 Amused 4.9 5 321 Happy 5.3 5.3 323.7 Sociable 4.8 5.3 296 Table 3.1: Emotion Words and their Activation, Evaluation (as defined by Whissell) and Angle (as defined by Plutchik) values. Table is adapted from [34]. Plutchik’s emotion wheel [101], but close in Whissell’s [145] activation-emotion space. Extreme care, thus, is needed to ensure that collapsed representations are consistent.” Labeling speech as angry, happy or any other emotion is hardly enough to describe it correctly. In order to illustrate how more descriptive information about 23 each emotional word can be collected, Cowie et. al. [36] asked subjects to select 16 words from a longer list and then to locate each word in the activation-emotion space and answer questions about what the emotional word implied. That was an attempt to form a basic English emotion vocabulary (BEEV). Positions in the activation space were represented by two numerical values labeled as emo- tional orientation and strength of emotion. Answers to the questions regarding the appraisal of the emotional words were collected under following categories: Disposed to engage or withdraw, open or closed minded, own perceived power, oriented to surroundings, oriented to other time and oriented elsewhere. Inthissectionweintroducedsomeparametersthatcanbeemployedindefining the expressive content of speech. Correct definition of expressiveness is of essential importance for the production of good quality synthetic speech. We need to point out, however, that due to the huge literature on emotion, it is impossible to cover everypossibleviewanddetail. Thereaderisstronglyadvisedtorefertodocuments cited within the representative bibliography of this chapter. 3.3 Collection of emotional speech data Whenweconsidertheresearchonemotionalspeechingeneral,probablythebiggest challenge is the lack of common emotional databases. Due to that, every emotion research group records their own databases. As underlined in [42], however, the collecteddatausuallyconsistofsmallnumberofsentencesrecordedunderdifferent conditions. Reasonably, the reported results are strictly specific to the analyzed datasetandthereforedifficulttogeneralize. Moreover, sincethesedatasetsarenot distributed, it is practically impossible to replicate the results. 24 Until recently, there was only one database (which was accessible) of emo- tional speech collected for studying prosodic variations. It is ”Emotional Prosody Speech and Transcripts” database distributed by the Linguistics Data Consortium (LDC) [76]. The database consists of dates and numbers recorded by professional actors. With the increase of the research on emotional speech, there is a consider- able effort toward generation new emotional databases and as seen by the recent efforts several groups are in progress of distributing their databases, which is a very promising development for the future emotion related research [42]. The problem of emotional speech databases is discussed in detail in [42] where an table of some of the available datasets is shown. In that paper, four main issues with respect to the emotional database generation are underlined. These are, the scope, naturalness, context of the content and the kinds of descriptor it is appropriate to use. Scope of a database, is a term that reflects “number of different speakers; lan- guage spoken; type of dialect (e.g. standard or vernacular); gender of speakers; types of emotional state considered; tokens of a given state; social/functional set- ting” [42]. Naturalness describes how the data is collected, i.e., are actors employed or not. We will discuss the naturalness issue in more detail below. Context is given as an another important factor because of evidences that listeners’ judgments are effected by the context in which emotions are presented. In the paper [42], four broad types of context are listed. They are, semantic context, structural context, intermodal context and temporal context. The issue of how to describe the emotions is another very important factor. Our experiments which will be presented in the next chapters (see Chapter 8 for example) show that using appropriate emotional labels is essential in the analysis 25 of acoustic parameters. We have found that when emotions are described in terms of the broad emotional labels (i.e., angry, happy, sad, neutral), there is significant overlap between the parameter values for different emotions making it difficult to associate particular parameter values with particular emotions. Clearly, the collected emotional databases should be easily accessible by the all researchers so that the improvements can be achieved in a collaborative manner. In order to analyze all acoustical feature variations present in emotional speech averylargedatabaserepresentingmanyemotionsisnecessary. However,collection of such database is both time consuming and costly. Due to these limitations, the number of the emotions to be analyzed will be restricted in all cases. Due to that the emotional categories need to be carefully determined. The popular choices for emotional categories are usually among the ”universal” emotions, as defined by Ekman, et. al. [47]. Popularly considered emotional categories can be listed as, anger, sadness, happiness, frustration, surprise, disgust and content [42]. After deciding on the emotion categories to be studied, the question is how to collect the emotional data. Should the emotions be expressed in excessive sense (i.e., full-blown) or in a normal way? Should actors be employed or non-trained people be used? Next, we address these questions and briefly discuss the advan- tages and disadvantages for each approach. 3.3.1 How to collect emotional speech data? Thebasicmethodologyemployedfordatacollectioncanbesummarizedasfollows: 1. Purpose determination: What is the purpose of collecting the data? This is a simple but very important question, because emotional data necessary for recognition may not be suitable for synthesis and vice versa. If you are planning to have an emotion recognizer, then you definitely will need many 26 variations of each emotion, instead if you are planning to build a limited domain speech synthesizer, then a limited coverage will be sufficient. Also, the decision of whether you will be recording short (or long) sentences or short (or long) paragraphs should be made. In sum, the inventory should be designed to match the purpose. 2. Preparationofinventorytoberecorded: Thedatashouldbepreparedaccord- ing to the defined purpose. It should be free of any grammatical mistakes and include instructions for the actors and/or actresses to follow. Consider- ingthedifferentapproachesforemotionaldatacollection,insomecasesthere is no need for the prior inventory preparation, as is the case where real-life situations are recorded [29,30]. 3. Recording: Do we employ professionals or amateurs? The choice depends on the application requirements, but one should make sure that there is a consistency in the recording. Inclusion of various control parameters, such as detailed instructions and explanations, while designing the inventory may help to minimize the subjectivity. It is also preferred for the recording envi- ronment to be noise free. Since, employing professional may be expensive, the recording process should be carefully designed so that cost is minimized. Inmanycasesforexample, itisusefultocollectthemulti-modaldata, where face expressions and audio are recorded simultaneously. 3.3.2 Data collection for concatenative speech synthesis In this section we address emotional data collection specifically for concatenative speechsynthesis. Twoapproachesarediscussed. Inoneofthemprofessionalactors are employed while the other supports using data from real life where natural 27 interactions of people are recorded. Our concentration will be on discussing the advantages and disadvantages of each approach. For concatenative speech synthesis [9,11] we require that all of the concatena- tion units (that will be used in synthesis) to be recorded beforehand and available for processing. Assuming that the emotional set, i.e., the emotions that we are interested with, is decided, the next step is to assure complete phonetic coverage. The need for a complete coverage requires that the context is designed carefully. Specifically, if the emotional recordings will consist of sentences, the sentences should be constructed appropriately, the same is also true if paragraphs will be recorded. Havingaspecificcontexttorecordandtheneedtogetspecificemotional effects suggests that professional actors and actresses should be employed. In that way, we can be sure that the target emotions have been correctly expressed. To test how the recordings are perceived, listening tests can be performed with nave listeners and if the recognition rates are not satisfactory, the recordings can be repeated. Using professional actors in recording the emotional inventory will also minimizetheburdenofhavingtosubjectively classifyemotionsaftertherecording. Anotherapproachforemotionalrecording,stronglysupportedbyCampbell[29, 30], is recording real life situations with no specifically designed scripts to record. The advantage of this approach is that more natural emotional expressions will be collected; the disadvantage is that you may need to record a large inventory due to the lack of direct control over the recorded context. In addition to that, the need for the classification of the emotions in the ”real life” scene is necessary. In other words,thereisaneedtogooveralloftherecordingsandcategorizethespeechinto emotional categories. Considering the lack of precise definition for emotions, and the interpretation differences that may result due to personal and cultural factors, it can be argued that categorization of the recorded emotions in an objective way 28 cannot be easily achieved. However, by employing a large number of listeners and eliminating bad and confusing data, it is possible to minimize subjectivity to an acceptable level. (The need for objective evaluation of emotional data is discussed in detail in the section 3.4.) Atfirstglance,theuseofprofessionalactorsseemstobeabetterchoice,because of the relatively less amount of work necessary for data collection. However, the questionthatarisesiswhetherthe”laboratory”recordingsarenaturalenough. Do they sound sincere to normal people? Certainly, the expressions of a professional actor and a regular person differ form each other, and in real life only in extreme situations we exaggerate our emotions as a contrary to a play or movie. This may be one reason leading people to assume that ”laboratory” recording would infer insincere feelings and that they may distract regular people. Shortly, the question is the one of believability. Itisnotdifficulttothinkofscenarioswhereoneapproachwouldbepreferredto theother. Forexample,ifweareplanningtousethesynthesizerinamovieproduc- tion, then employing professional actors would be more suitable. However, if the synthesizerisgoingtobeusedforinteractingwiththeotherpeople(e.g.,theworld famous physisit Professor Stephen Hawking is using a speech synthesis system to talk) then voices recorded in real life may be more appropriate. Depending on the application, either approach can clearly have its advantages and disadvantages. One essential point that also needs to be considered is the interaction between humans and computers [15,98,100,105]. That is, is it really feasible to try to make computers sound like a real human beings; or would it be better if the computers had no emotions? Have exactly human like natural communication capabilities for computers should be also addressed in terms of its possible sociological effects. In many cases, having a sound quality that is not disturbing may be sufficient. Do 29 we really need a machine that can imitate humans? What will the advantages and disadvantages be of having such a system? This are some questions that we do not discuss further in this proposal, but they are an essential questions that will be of greater importance when that level of technology is achieved. However, we believe that being able to synthesize natural sounding emotional speech will be useful for many applications that will be of great help to people. Some examples include talking devices for people with vision and speech disorders; instructional, self-teaching devices for car drivers and students, book or webpage readers and so on. At this point, it will be useful to give an example of how data collection is performed for a generic application, specifically for limited domain synthesis. The presented is a specific case study which illustrates the above described issues. Anexampleofdomainspecificexpressivedatarecording, withoutexactlyspec- ifying the target emotions is described in [64] where military type speech, specif- ically command speech, was recorded to be used for the generation of synthetic speech for an automated Mission Rehearsal Exercise (MRE) experiment. In the example described in the paper, a real life scenario is dramatized with the multi- modal combination of animated characters, speech recognition engine and expres- sive limited domain speech synthesizer. The expressive speech synthesizer called ESpeech was designed to provide a voice of a platoon sergeant and incorporated multiple sets of units each represent- ing a different speaking style. Data were recorded for four expressive styles, which are: shoutedcommands, shoutedconversation, spokencommandsandspokencon- versation. The amount of the recorded inventory was determined by the needs of the dramatized scenario. The greatest vocabulary coverage was provided for 30 spoken conversation, 316 words; second most coverage was provided for shouted commands, 72 words. The emotional content of the recordings can be best described through the role of the platoon sergeant. The sergeant needs to speak in a tone of voice that expressesrealisticemotionalreactionstothesituation: distressaboutthecondition of an injured boy, a desire to act quickly and urgently, alarm if the trainee makes a decision that may jeopardize the mission, and disappointment if the goals of the mission are not met. He needs to speak with the lieutenant in a conversational tone, of voice consistent with his pedagogical role as the lieutenant’s coach and advisor and his team role as subordinate officer. He also must shout orders to the troops, or shout to the trainee if the background noise becomes too loud. The values that sergeant must posses in such an application, advices that professional actorsshouldbeemployedsince,itishardtoimaginethataregularsergeantwould be able to have all of the required qualities together. Note that for this particular example, theonlywaytoavoidrecordinganactorforthisparticularsituationisto record the real military employed sergeant in situations resembling to the desired scenario, which is very hard to happen. 3.4 Evaluation of emotional speech In this section we briefly discuss the techniques commonly employed (for engineer- ing purposes) for the evaluation of emotional speech. Classification of the natural speech according to emotional categories is one of the biggest challenges in expressive speech technology research. Automated emotion recognition systems and/or human subject can be employed for that pur- pose. However, when we consider synthetic speech, the most suitable subjects 31 are humans. Being targeted for humans, it is natural for the evaluations of the syntheticspeechtobeperformedbyhumans. Consideringthetoday’sleveloftech- nology, however, it is quite surprising (and disappointing) not having an expert system that could evaluate synthetic speech. Employing humans as research subject is a difficult task. First, it is both time and money consuming. Second, if the subjects are not carefully selected, there is a possibility of results being highly subjective and biased. Third, since it is impossible for each researcher to use the same group of people, direct comparisons of results is not possible due to the factors related to human subjects. Forth, humans can be easily deceived by using specific psychological techniques [121]. There are several methods that can be applied to minimize the subjectivity and increase the objectivity of the experiments. The ideal approach is to employ large number of subjects chosen from different education and cultural background. However,hardlyisitpossibleinpracticetoemployagroupwithsuchqualifications. Usually, objectivity is compromised, and the number of listeners is decreased to a more manageable number. This change increases the risk of results being biased towards the views of the employed listener groups. In some particular situations, where for example a synthesis system designed for military personnel is evaluated by military people, this kind of bias may be desirable. This suggests that, when it is not possible to employ large number of listeners, a good strategy is to use listeners that will be interacting with the evaluated synthetic voice the most. Anotherway toincrease the reliability of the results is bydesigningclever eval- uation environments and experiments. The most popular method for emotional speech evaluation is the forced choice test. In this test, the listeners are presented all of the possible choices and required to select one of them. An alternative approach is to use dummy response categories in additions to the actual ones [84]. 32 It is arguable, however, whether these are the correct methods, because in some cases they can be regarded merely as a discriminatory task rather than an iden- tification method. This problem can be solved, by designing tests where listeners will be asked open ended questions [62], such as ”What do you feel?”, instead of being presented a list of items to choose from. Also, virtual environments may be constructed, where listener and system will interact in the similar manner peo- ple interact in real life, i.e. in a conversation. In short, the method of testing is essential and it should be carefully considered while evaluating and generalizing the results. Regardless of the employed techniques, the setup, test and participants should be clearly detailed while presenting the results. Unfortunatelly, this is one of the main weaknesses of the engineering emotion literature. Although human subject arepopularlyused,thelisteningtestconditionsaredescribedonlyinminordetails. 3.5 Analysis of emotional speech In this section a summary of results of several emotional speech analysis studies are presented with the purpose to show how emotional speech is studied and what kinds of acoustical parameters are used for its description. Our concentration will be mainly on the angry, happy, sad and neutral emotional labels, because these are the emotions that are included in our research. We will mostly refer to these labels as emotions for the ease of description. However, as pointed out in section 3.4 it should be always kept in mind that they are merely some descriptive labels assigned (by human raters) to the speech signal we analyze. These terms are considered to be subjective labels of speech, in the sense that different people may have different choices [33,98,121] and therefore label the same speech signal 33 differently. For a particular speech signal, to be labeled as happy for instance, it is considered sufficient if it is classified as happy by the majority of raters in a listening test. From these labels, neutral is usually used as a reference to determine the base- line against which other acoustic parameter values will be compared. It is used to describe a state where there is not a particular dominant emotional condition. It can be resembled to the normal way of talking, i.e., when the speaker is not emotionally charged. As every other label, this is a subjective label and every speaker may have a different neutral speech. In the analysis papers, unfortunately, there are no explicit definitions of emo- tions, which makes the results extremely difficult to compare. For example, angry emotion can have many different “shapes”, which may be described by adjectives such as annoyed, hostile, impatient, intolerant, nervous and so on. All of these variations are labeled as angry. Similarly, the label sad may comprise expressions such as bored, depressed,discontented, fed up, not in high spirits, helpless, hesitant, tired, sorry, weak, worried, melancholic, nostalgic and so on. Happy may include affectionate, cheerful, eager, excited, fascinated, impressed, in high spirits, lively and so on. The procedure commonly employed for analysis consists of (1) recording emo- tional data (for review of emotional databases and issues involved in their gen- eration see [42]) which is either simulated or natural (CREST-ESP project [30], Reading-Leeds database [54,112]) and (2) analyzing the F0 statistics for the col- lected emotions. The analysis studies are mainly focused on the statistical evaluation of pitch, duration and energy with significant emphasis on the first two features. These parameters are usually analyzed independently of each other and their relation 34 to the emotional content of speech is described using terms showing the direc- tional change (i.e., increase or decrease) in their values. A general review of these parameter and emotion relations can be found in [34]. Note that since there is not a standard emotional database commonly used by everyone, almost every research group records and studies their own emotional data. It is for that reason why the results can vary. For instance, as stated in [42], “...The findings for hot-anger seem consistent, as do those for joy/elation. But there are inconsistencies for most other emotions and emotion-related states that have been studied at all frequently. Sadness generally seems to be marked by a decrease in mean F0, but there are cases where there is no change...”, or there is “...an increase. It is often reported that fear is marked by an increase in F0 range and speech rate; but there are contradictory findings for both variables. Studies of disgust report both an increase in mean F0 and a decrease in mean F0.” 3.5.1 Acoustic correlates of emotions Summary of the relations between emotions and speech parameters is given in Table 3.2. In addition to the results in the table, below we present results from additional studies. In [97] the prosodic acoustic properties of two semantically neutral sentences which were uttered in English by one male and one female actor in angry (“cold” anger and “hot” anger), happy, sad and neutral tone are analyzed. The results showed that hot anger and happiness have large F0 and RMS (i.e., energy) range and mean values, while these values were low for sad emotion. Happy and angry emotions were differentiated based on the F0 contour shape while cold and hot anger were differentiated from F0 mean. In terms of duration, it was observed that for sadness the consonants are longer and the vowels shorter than in other 35 emotions. The same database is also analyzed in terms of the arousal, pleasure and power dimensions by conducting listening test with 31 normally hearing sub- jects[96]toconcludethatemotionscanbedescribedintermsofthese3dimensions. In addition, it was also observed that arousal dimension has significant positive correlation with F0 range, mean and mean intensity for both male and female speaker. For the male speaker, pleasure and F0 mean correlation, and power and F0 range, F0 mean and mean intensity correlation is also observed. For female speaker, intensity mean correlated with power dimension. Another study [35] of emotions uses a system called ASSESS to analyze fear, anger, sadness and happiness and neutral emotions as they are expressed in pas- sages which took 25-30 seconds to read. There were 40 speakers, 20 male and 20 female, aged between 18 and 69 and all of them were from the Belfast area. Based on the intensity analysis, these emotions are grouped in two broad categories; the first one consists of afraid, angry and happy and the second one of sad and neutral emotions. It is also observed that pauses were longer in happy and sad emotions. Spectral analysis showed variations for different emotions, especially for anger, however the differences were not explicitly interpreted. The distribution of pitch height and the timing of pitch movements were also different for these emotions. It is observed that happiness and anger had wide pitch ranges, while the neutral passages were characterized by the lowest range values. In addition, for happy speech, pitch plateaus were shorter, their duration lay within a narrower range, pitch falls lasted longer, pitch rises were faster and in general pitch contour “was not only wide but constant”. Analyzing the emotional categories in terms of dimensional labels is discussed in [120] for the Belfast Naturalistic Emotion Database [41]. In the paper, the 36 relation between the three dimensions (activation, evaluation, power) and acous- tic parameters, which are calculated from the pitch and intensity contours and voice quality analyses generated using the semi-automatic speech analysis tool ASSESS [37], was investigated. As a result of correlation analysis it was found that high activation was highly correlated with “high F0 mean and range, longer phrases, shorter pauses, larger and faster F0 rises and falls, increased intensity andflatterspectralslope”. Valencedimensioncorrelationswereobservedfornega- tive emotions which were associated with “longer pauses, faster F0 falls, increased intensity and more prominent intensity maxima”. In power dimension, higher powerwascorrelatedwithlowerF0withreducedintensityforfemaleandincreased intensity for male speakers. Also, “for female speakers, F0 rises and falls were less steep, and F0 falls had a smaller magnitude”. The variations that occurred in the F0 contours of fear, disgust, happiness, boredom, sadness and hot anger emotions of single and two-phrase sentences were reported in [94]. The analyses were performed both on stylized F0 contours, to determine the variations at the sentence level, and on unstylized contours, to find fine changes that happen due to emotion. In terms of the F0 mean val- ues, the emotions ranked as sad,boredom<neutral <disgust<anger <fear < happiness, and F0 standard deviation comparisons were sad < fear,boredom < neutral,disgust<angry,happy. In addition, a connection between emotions and syllablelevelF0andstylizedsentenceF0wasdescribed. Forexample,itwasstated that F0 for anger “progresses on a higher level, its rises have a greater steepness, and the maximum F0 is reached later than in the neutral utterance”, while for happiness “rise of F0 begins earlier time and forms a wide rounded summit” and for sadness there were only few changes. Additionally, the comparison results for microprosodic variations were also presented. 37 The analysis of intonation patterns for 7 emotions in Dutch speech [83] led to a conclusion that particular intonation patterns increased the perception of specific emotions, howevernospecificemotionandintonationrelationwaspresented. Also in the same experiment, the pitch range and mean were modified according to val- ues given in previous studies. The results indicated significant interaction between pitch mean and range values and subject responses; and between intonation shape and subject responses. Since no relation between the intonation shape and pitch values was found, it was concluded that they were independent. In [95] the relations between focus (initial, final), modality (statement) and emotions (happy, angry, sad, and neutral) were investigated and it was observed that emotions had little effect on the initial, intermediate and final word F0 val- ues. Happy and angry emotions had higher F0 values than sad, but these values did not alter the overall shape. For questions, there were no statistical differences resulting from emotion change. Questions had higher overall F0 values than state- ments. When analyzed in terms of sentence level values for F0 mean and range, the order from the largest to smallest was, happy, angry, neutral and sad. For the initial and final focus sentences, sad speech had considerable low F0 ranges than other emotions. Speech rate was found to be both sentence length and emotion dependent. Long utterances had a faster rate than short utterances. Neutral and happy utterances were faster articulated than angry and sad. Inastudypresentedin[65],thedataiscollectedbyaskingparticipantstocom- plete the phrase “At this moment, I feel...” while (still) playing a computer game. As the continuation for the sentence, the participants selected one of the “irri- tated, disappointed, contented, stressed, surprised, relieved, helplessandalarmed” words. In addition to the audio, electroglottograph, respiration, electrocardio- gram, and surfaceelectromyogram (muscle tension) were also recorded. Based on 38 the results, it was found that F0 floor was lowest for bored and depressed (these two emotions also had the lowest RMS energy) emotions and highest for happy and anxious emotions. In terms of F0 range values, depressed speech had limited range, while broad range was observed for bored speech. Happy speech also had broad range. Limited pitch range was observed for tense, irritated and anxious speech. In addition, it was noted that glottis opening time was not significantly affected by emotions, while closing time was affected. “...for those (high arousal) emotions characterized by high F0 and high RMS energy, the glottis closes faster, as a proportion of the fundamental period”. An attempt to describe and categorize pitch contours in the framework of the Generalized Linear Alignment model [140] is described in [69]. In the analysis, foot-based pitch contours were hierarchically clustered into 6 clusters and it was found that apart from the standard declining phrase curve, there were phrase curves consisting of an incline, an optional plateau and a decline. For the analyzed corpus, it was also found that ”continuation rise which was always assumed to be present at minor phrase boundaries was only observed in fewer than 10% of feet occuring at the minor phrase boundary”. It is underlined, however, since the results were based on a small dataset collected from one speaker they should be interpreted with caution. 39 Acoustic correlates of emotions Study Parameter Anger Happiness Sadness Description from [85] F0 Pitch average: Very much higher, Pitch range: Much wider, Pitch changes: Abrupt, on stressed syllables Pitch average: Much higher, Pitch range: much wider, Pitch changes: Smooth, upward inflec- tions Pitch average: Slightly lower, Pitch range: Slightly narrower, Pitch changes: Downward inflections Summary of human vocal emotion effects. The descriptions are given relative to the neutral speech. Duration Speech rate: Slightly faster Speech rate: Faster or slower Speech rate: Slightly slower Energy Intensity: Higher Intensity: Higher Intensity: Lower Other Voice quality: Breathy, chest tone, Articulation: Tense Voice quality: Breathy, blaring, Articulation: Nor- mal Voice quality: Resonant, Articulation: Slurring from [34] F0 F0 Values: Increase in mean, median, range and variability, F0 Contour: Angular frequency curve, stressed syllables ascend frequently and rhythmi- cally, irregular up and down inflection, level aver- age pitch except for jumps of about a musical fourth or fifth on stressed sylla- bles. F0 Values: Increase in mean, range and vari- ability, F0 Contour: Descending line, melody ascending frequently and at irregular intervals F0 Values: Below normal mean and range, F0 Con- tour: Downward inflec- tions, long pitch falls This is a summary of many papers as they are presented in [34]. The values are compared against neutral speech values. Duration High rate or reduced rate Increased rate or slow rate Slightly slow Energy Raised intensity Increased intensity Decreased intensity 40 Other Voice Quality: Tense, breathy, heavy chest tone, blaring Other: Falling tones, clipped speech, irregular rhythm basic opening and closing, artic- ulatory gestures for vowel, consonant alternation more extreme Voice Quality: Tense, breathy, blaring Other: Irregular stress distribu- tion, capriciously alternat- ing level of stressed sylla- bles Voice Quality: Lax, resonant Other: Slurring, rhythm with irregular pauses Table 3.2: (Table continued) Summary of acoustic features for angry, happy and sad emo- tions, taken from [85] and [34] 41 3.6 Synthesis of emotional speech Speech synthesis techniques can be categorized into three major groups. These methods differ from each other in terms of the rules followed to produce speech. Formant synthesis is rule-based synthesis that allows parametrization flexibility. Concatenative synthesis is a data-driven method, which can be used to produce naturalsoundingspeechfrompre-recordedunitsattheexpenseofincreasedstorage size and loss of flexibility. Articulatory synthesis is model-driven system where the human organs involved in speech production are explicitly modelled to synthesize the speech waveform. The basic structure of a text-to-speech synthesis system is displayed in Fig- ure 3.2. There are two main modules in the system, Natural Language Processing (NLP) module and Digital Signal Processing (DSP) module. Text normalization, morpho-syntactic analysis, phonetization and generation of prosody are tasks per- formed by NLP module, while speech production and digital to analog conversion is completed by DSP module. In the following sub-sections, a short introduction to each synthesis method is provided. 3.6.1 Concatenativespeechsynthesis(Data-drivensynthe- sis) The most popular synthesis technique currently is concatenative, i.e. data-driven, speech synthesis. The basic advantage of this method, over formant synthesis, is that more human-like, natural-sounding synthetic speech can be produced. How- ever, one must sacrifice the great flexibility of formant synthesis, over the limited parametrization afforded by concatenative synthesis. Since the output is created 42 Figure 3.2: The functional diagram of a general text-to-speech conversion system (adapted from [40]). by concatenating the pre-recorded speech units, the modification on the output is mostly at prosodic level. While fundamental frequency and duration parameters can be adjusted relatively easily there is a limited control over ”segmental” prop- ertiessuchasvoicequality. Inaddition, itishardlypossibletomodifytheformant trajectories with concatenative speech synthesis techniques. The reason that concatenative synthesizers can produce relatively higher qual- ity speech than the formant synthesizers is that there are only limited number of modifications performed over the recorded speech units, while in the formant synthesis, speech is created entirely from the given input parameters. That is why rule-based synthesis methods are considered to be more flexible than data-driven methods. 43 The quality of the output in concatenative synthesizers mostly depends on the type and size of the concatenated units. Quality increases with unit size, how- ever, with the use of longer units, the ability to produce any arbitrary desired output decreases and the size of the database increases. Recording of long speech (as words, sentences and so on) units is impractical both in terms of storage and applicability. That is why units that have been used mostly are diphones. It is argued that a minimum inventory of 1400 diphones could be enough to synthe- size unrestricted English text. However, in order to achieve higher quality and to avoidcoarticulationeffectsincreasingthenumberandvariationsofdiphonesinthe inventory is necessary. Systems, where we have more than one variations of the same diphone and larger sized units are called unit selection systems. While syn- thesizingthespeechinthesesystemstheunitstoconcatenatearechosenaccording to the calculated target and concatenation cost values. TheFestival[9,11]speechsynthesissoftwarewhichwasdevelopedattheCenter for Speech Technology Research at University of Edinburgh, is the major software used by many research groups for concatenative speech synthesis. Free available codewritteninC++andSchemelanguageanddetailedonlinedocumentationpro- videawellsupportedenvironmentforresearchanddevelopment. Festivalsoftware supports British and American English as well as Spanish and Welsh languages, which are synthesized by diphone-based residual excited LPC and PSOLA [82] waveformsynthesistechniques. AversionofFestival, calledFlite(festival-lite)[10] is a small, fast, run-time synthesis system built for small embedded devices. There are a number of research and commercial TTS systems that are also available. SomerepresentativereferencesincludeAT&TNaturalVoices[106],IBM Text-to-speech system [108], Loquendo [109], FlexVoice [110], CHATR [71] (and many others). To listen to examples of synthetic voices generated by different 44 speechsynthesizer(andforseverallanguages), visitthewebsiteofspeechsynthesis examples [79]. For the production of emotional speech using concatenative speech synthesis techniques there are two major methods one might follow. These two methods differ in terms of the inventory that will be used in the production of speech synthesis. As it is discussed above, in general, when the coverage is large and when there are more that one variation of the concatenation units there will be less amount of modifications required. In the first approach, the expressive speech is produced by modifying the prosodic features of the recorded neutral speech units (or more correctly only one type of emotional speech units), while in the second approach the concatenation inventory to produce final output consist of specifically recorded emotional speech units. Taking the second method one step further, the domain specific emotional speech units can be also used to produce high quality expressive speech, as it was shown by [64]. Eachoftheseapproacheshastheirownmeritsanddisadvantages. Forinstance, in the former approach, the size of database will be more manageable and the syn- thesis will be faster than the later approach. However, due to fact that database in the first approach consists of only one type of expressive units, at the prosodic level, a great deal of modifications may be required to produce the desired emo- tional effects. Increasing the number of modification over the natural speech sam- ples will degrade the quality. On the contrary, in the later method, one may avoid the necessity for significant modifications by recording many variations of same units. This may require complex search and classification techniques. In terms of the database preparation time and the required processing power and storage, obviously the former approach is advantageous. But considering the developments 45 in the computing technology, hardly would computation power and storage be a real problem. The real challenge would be to achieve sufficient coverage. Consid- ering the fact that there are numerous variations of human speech, it is almost impossible to built a database that will be able to cover all possible variations. Careful design of inventories and limiting the application domain (as in [64]) can help to overcome this problem. There are several algorithms for prosody modification. The most com- monly used ones are Time Domain Pitch Synchronous Overlap Addition (TD- PSOLA) [82] and Multi Band Resynthesis Overlap Addition (MBROLA) [48] methods. Both of these methods have low computational costs and are sim- ple to implement. Some of the other important algorithms are Harmonic plus Noise method(HNM) [129,130], sinusoidal approaches [77] and HMM-based approaches [134]. In the appendix A we briefly present the technical details of the TD-PSOLA method since it is the method that we use for pitch are duration modifications. At this point it will be useful to provide some examples of the two basic emo- tional speech synthesis techniques introduced in the preceding paragraphs, which were using emotional inventory and modifying the speech parameters to produce expressive speech. In order to produce emotional speech by modifying neutral speech units, it is necessary to extract basic parameters that describe emotions. As it was outlined in the previous section, the multi dimensional nature of the expressivecontentofspeechmakesitdifficulttoparameterizetheemotionalspeech completely. In concatenative synthesis research, we choose our parameters (i.e., dimensions) in most cases to be various prosodic features, such as pitch, intensity 46 andduration(mostlybecausetheycanbemodifiedwithout”significantly”degrad- ing the speech quality). Summary of relationships between basic emotional cate- gories and prosodic features was given in the previous section where we described theanalysisresultsofemotionalspeech. Examiningtheresults, wenote, forexam- ple, that increasing the speech rate, intensity, pitch average and pitch range may be helpful to induce anger. Note that similar changes, as in anger, were listed for happiness (i.e., according to analysis results anger and happiness affect the speech parameters similarly). This clearly shows the challenge in trying to modify the speech parameters to produce a particular emotion. An example of using the neutral diphones and triphones in order to synthe- size emotional speech is presented in [86]. By modifying the parameters of BT’s Laureate concatenative speech synthesizer and by applying new intonation con- tour for pitch and duration changes a speech output is synthesized. The produced output was manually modified (using the Goldwave waveform editing package) to successfully synthesize emotional speech. Another interesting example where demi-syllable like units spoken in isolation were used is described in [142]. In this approach, the units were recorded with different pitch values. It was found that using inventories consisting of several version of each syllable (but with different pitch values) improved the quality of the results, since less modifications were per- formed while applying PSOLA during the synthesis (of angry, happy and neutral speech). Usage of emotional speech units to generate expressive speech is currently the most popular technique. The most important reason behind its popularity is the high quality output. There are numerous techniques that can be employed on emotional inventories to synthesize emotional speech. A popular method to obtain high quality output while working in diphone level is to restrict the domain. In 47 that way enough number of emotional units can be collected and the need for modification can be minimized. This was demonstrated in [64] to produce military style speech in a training environment for mission rehearsal exercise. A limited domainsynthesizerwasconstructedbasedontheFestival[9]platform,andmultiple sets of units, each representing a different speaking style, were used for synthesis. An interesting approach based on mixing emotional and neutral units is sug- gested in [58]. Three emotional databases consisting of happy, angry and neutral emotions were recorded and during the synthesis (while selecting the concatena- tionunits)thevaluesfromtheDictionaryofAffect[145](introducedinsection3.2) were also considered by modifying the target cost according to the word’s affect. Based on the evaluation tests with 13 raters it was concluded that emotions are perceived as more intense when greater number of emotional units were used for the synthesis of a sentence and that by varying the number of the emotional units used it is possible to vary the perceived emotional intensity. Another alternative for producing intermediate emotional nuances is mixing different emotional inven- tories and prosodies. For example in our work, which is described in Chapters 4 and5,thecopysynthesismethodwasappliedon”emotional”diphonestogenerate the desired target sentences. In copy synthesis, instead of generating the prosodic parameters (F0 and duration), they are copied from the actor’s portrayal of the target sentence. Application of copy synthesis for Spanish reported in [80] showed that prosodic (supra-segmental) information alone was not enough to portray emotions. In [92] vowel-consonant-vowelsegmentswererecordedforanger,sadnessandjoyandthen thesesegmentswerecombinedusingTD-PSOLAtosynthesizeemotionalJapanese speech. Another work where Japanese speech was synthesized using CHATR [61] also support using an emotional inventory. Experiments on synthesizing German 48 emotional speech [57] showed that fundamental frequency and duration modifi- cation were not enough to synthesize emotions and underlined the importance of emotional inventories. Increasing the parameters space by including voice quality parameters, spectralenergydistribution, harmonics-to-noiseratioandarticulatory precision has been shown to improve the quality of Austrian German speech [103]. A spectral interpolation approach [138] which was applied of soft, modal and loud voicediphonedatabasesinGermanshowedthatintermediatelevelsofvocalefforts canbesuccessfullygenerated. Additionaldetailsoftheemotionalspeechsynthesis will be given in the section 3.7, where we discuss emotional speech resynthesis techniques. 3.6.2 Formant speech synthesis (Rule-Driven Synthesis) Formant synthesizers are rule-based algorithms that literally create the speech signal from rules. Rules are obtained from recorded speech data. While the highly robust nature is an advantage, the somewhat unnatural quality of the resulting synthetic speech is the basic drawback. Formant synthesis techniques are useful to examine the relations between different parameters and emotions. One of the pioneering applications in the emotional speech synthesis research within this approach is ”The Affect Editor” developed by Janet Cahn [27]. Affect Editorisaprogram,whichimplementsanacousticalmodelofspeechandgenerates synthesizer instructions to produce the desired emotional effect. The acoustical model is represented by independently varying parameters that correspond to the speech correlates of target emotions. The parameters are grouped into four categories: pitch, timing, voice quality and articulation. The pitch parameters are listed as: accent shape, average pitch, contour slope, final lowering, pitch range and reference line; the timing parameters are: exaggeration, fluent pauses, 49 hesitation pauses, speech rate and stress frequency; voice quality parameters are breathiness, brilliance, loudness, pause discontinuity, pitch discontinuity and tremor; and the articulation parameter is only precision (refer to [27] for detailed description of these parameters). Inputs to the Affect Editor are emotion and an utterance. The program models the input emotion by assigning a value between -10 and 10 to each acoustical parameter. Since it is a good way of visualizing the acoustical differences among various emotions, we have included a table summarizing the Affect Editor’s parameters (see Table 3.3). The utterance, which is represented as a set of clauses (sentence, agent, action, object and locative) arranged in tree structure, is examined to decide on the location of emotion effects. The Affect Editor has two output sets. One set contains information for the synthesizer enabling it to adjust prosody and voice quality, while the second output is the utterance itself consisting of the combination of English text, ARPAbet phonemes, phoneme durations, pauses and intonation markings. These outputs served as input to a DECtalk3 formant synthesizer. The recognition rates for the emotions produced with the Affect Editor were as follows: Angry 43.9%, Disgusted 42.1%, Glad 48.2%, Sad 91%, Scared 51.8%, Surprised 43.9%. Angry Disgusted Glad Sad Scared Surprised Accent Shape 10 0 10 6 10 5 Average Pitch -5 0 -3 0 10 0 Contour Slope 0 0 5 0 10 10 Final Lowering 10 0 -4 -5 -10 0 Pitch Range 10 3 10 -5 10 8 Reference Line -3 0 -8 -1 10 -8 Fluent Pauses -5 0 -5 5 -10 -5 Hesitation Pauses -7 -10 -8 10 10 -10 Speech Rate 8 -3 2 -10 10 4 Stress Frequency 0 0 5 1 10 0 Breathiness -5 0 -5 10 0 0 Brilliance 10 5 -2 -9 10 -3 Laryngealization 0 0 0 0 -10 0 Loudness 10 0 0 -5 10 5 Pause Discontinuity 10 0 -10 -10 10 -10 Pitch Discontinuity 3 10 -10 10 10 5 Precision of Articulation 5 7 -3 -5 0 0 Table 3.3: The values of the acoustical parameters that were used in Affect Editor [27]. 50 Similar to the Affect Editor, HAMLET (Helpful Automatic Machine for Lan- guage and Emotional Talk) system designed by Murray and Arnott [88] alters the voice of DECtalk synthesizer according to the simulated emotions. HAMLET was built to synthesize six emotions (anger, happiness, sadness, fear, disgust and grief) by altering the voice quality at utterance level and prosodic parameters such as pitch and duration at phoneme level [87,88]. EMOSYN (Emotional Synthesizer) [24] is another system that is driven by an input consisting of a list of phonemes with assigned prosody descriptors similar to the MBROLA [48] format. The output of the system, which serves as input to KLSYN88 formant synthesizer, consists of the F0 contour, intensity, voice quality, articulatory features and vowel precision parameters. In [16] emotional speech was synthesized as a voice for an experimental robot called Kismet using DECTALK, ver. 4.0 formant speech synthesizer. The synthe- sizedemotionswereanger,calmness,disgust,fear,happiness,sadnessandsurprise. Inthatwork,therulesandmethodologyemployedforthesynthesisoftheemotions were adapted from Cahn’s work [27]. As a result of modifications, the synthesized emotions showed following characteristics. (Note that these descriptions are for the synthesized speech.) • Fear: Very fast, wide pitch contour, large pitch variance, very high mean pitch, normal intensity, breathy voice quality. • Anger: Loud, slightly fast, wide pitch range and high variance, low mean pitch. Low pitch gives threatening voice quality. • Sadness: Slower speech rate, longer pauses, low mean pitch, narrow pitch rangeandlowvariance,breathyqualitywhichmakesvoicesoundtired. Pitch contour falls at the end. 51 • Happiness: Relatively fast, high mean pitch, wide pitch range, wide pitch variance, loud, smooth undulating inflections. • Disgust: Slow, with long pauses, low mean pitch, slightly wide range, fairly quiet, slight creaky quality, the contour has global downward slope. • Surprise: Fast, high mean pitch, wide pitch range, fairly loud, steep rising contour on the stressed syllable of the final word. Listening test conducted with 9 people showed that all emotions, except fear could be recognized above chance level (the numbers in parenthesis reflect recognition rates). Anger (75%) was confused with disgust or surprise/excitement; disgust (50%) was confused with sadness and anger; fear (25%) and happiness (67%) were confusedwithsurprise/excitement; sadness(84%)wasconfusedangeranddisgust; and surprise (59%) was confused with fear. 3.6.3 Articulatory speech synthesis (Model-driven synthe- sis) Due to the poor quality of the currently available articulatory speech synthesiz- ers (and due to the small number of ongoing research efforts, when compared to other categories), there has not been a big development in this area. Articulatory emotional speech synthesis can be achieved by defining models for the relation between articulatory organs and acoustical correlates of emotion and then by con- trolling the articulators appropriately. We believe that, although the dominant speech technology presently is data-based concatenative speech synthesis, in the future articulatory synthesis methods hold great promise and potential. Elaborate analysisandmodellingofthearticulatorscaneventuallyresultinbuildingsystems that would be able to produce ”exactly” human-like speech. 52 Recently there is an increased effort for analysis and modeling of articulators for emotional speech. One such project [75] is going in our laboratory, Speech Analysis and Interpretation Laboratory (SAIL) [91]. 3.7 Emotional speech resynthesis Speech resynthesis, which we will also refer to as speech conversion in many instances, is used to describe a modification system where input is a speech signal and output is the modified version of the input signal. It is a different approach than text-to-speech (TTS) synthesis, because the input is a speech signal instead of a text. The research that is described in this proposal is based on the speech resynthe- sis technique. The main motivation for using resynthesis technique is because it can be used as a post-processing module to modify the output of TTS synthesizers and produce the desired emotional effect. As pointed out in section 3.6 concate- native speech synthesizers are capable of producing high quality and intelligible speech which resembles natural human speech. However, everything is database dependent. Thatis, ifthereareemotionalunitsonlythenemotionalspeechcanbe synthesized, in all other cases the output is neutral. The idea of emotional speech resynthesis is to modify the synthesizer output, which is neutral and convert it to emotional speech. While doing the conversion, however, it is essential that the quality, intelligibility and naturalness of the input (i.e., TTS output) speech are maintained. To summarize, in emotional speech resynthesis input is a speech signal (which canpossesanyemotion)andtheoutputisthemodifiedversionoftheinputsignal. Our purpose is to generate emotional speech, therefore the modifications that we 53 perform on the input speech are tuned according to the intended emotion. The performed modifications can be both in prosodic and spectral level. Emotion conversion is a novel research area which resembles voice conversion (VC) in terms of the underlying techniques, the most important distinction being that the effect of the prosody can not be ignored or understudied (because the speech emotion and prosody are strongly tied to each other [34,116]), as it is usually done in the conventional VC algorithms where pitch modifications are generally utilized only by matching the converted average pitch to the average target pitch. Applying segmental pitch modifications [137], copying the target pitch contours [139], or using heterogeneous training vectors including both spec- tral coefficients and normalized pitch [90] has shown that incorporation of prosody modifications improves the VC results. Note that in VC, the aim is to convert the voice of one speaker into another (e.g., male to female). It is for that reason why the main modifications are mostly applied on the spectral envelope parameters, such as MFCC [128], LSF [114,146] and LPC [67,139]. For prosody modification (for emotional speech resynthesis) there are many techniques that can be applied. Two of them and model-synthesis and copy- synthesis. In the model-synthesis, there is a specific model for each emotion and the parameter values are modified according to the parameters specified by the model. In copy-synthesis, the modification parameters are extracted from the targetutterance,whichisofmatchinglexicalcontent(totheinput)butindifferent emotion. Considering the fact that in copy-synthesis there is a need for the target utterance’s parameters, this technique can be regarded as a learning algorithm for developing parametrized models for emotion resynthesis. By modifying the input sentence using target emotion parameter values, we learn if the parameters are effective and if they can be used for the synthesis of emotional speech. Note 54 that an alternative method of learning model parameter values is experimenting withmanydifferentparametersandvaluesandchoosingthemostappropriateones based on the evaluation results. However, instead of experimenting with random values it is more reasonable to start with the target emotions’ values and modify them if necessary. The final aim is to built a model. Copy-synthesis experiments for happy, sad, cold anger and surprise (in Span- ish) reported in [80] show that the effects of prosody modifications are different for different emotions. It was observed that, sadness and surprise are more supra seg- mental, while happiness and anger were described as segmental. It was concluded that“prosodicmodellingofemotionalspeechisnotenoughtomakeitrecognizable (it does not convey enough emotional information in the supra segmental level)”. Different results were achieved for Dutch speech experiments based on neutral, joy, boredom, anger, sadness, fear and indignation emotions [143]. By copying pitch and duration of original utterances to a monotonous utterance it was shown that emotions can be successfully synthesized by modifying pitch and duration. The effects of F0, energy and overall duration on emotional speech in German were analyzed in [57] for joy, fear, anger, neutral, disgust and sadness emotions. Listening test performed on sawtooth signals (constructed from the pitchmarks) which included only prosodic information represented by F0, energy and overall duration showed that these signals cannot be discriminated across emotions. The conclusion derived was that “the problem of resynthesizing emotions does not lie in the synthesis as such, but in the fact that emotions are not always prosodically marked, or at least not marked enough to be easily recognizable.” In another experiment for German emotions speech pitch mean, pitch range, speech rate, phonation type and vowel precision features of a neutral utterance were modified to generate emotional speech using a system called “emoSyn” that 55 generated parameters that drive KLSYN88 formant synthesizer [24]. Modified utterances were evaluated by 30 subject who had to select one of the neutral, fear, anger, joy, sadnessorboredomlabels. Basedontheevaluationresults, “utterances were judged as fearful when they had a high pitch, a broad range, falsetto voice and a fast speech rate”. For joy, although the recognition rates were low it was observedthat“aprecisearticulationenhancesajoyfulimpressionandanimprecise one reduces it”, moreover, it is noted that “a broader pitch range and a faster speech rate as well as modal or tense phonation are more often judged as joyful than the other characteristics”. For boredom, lowered F0 mean, narrow F0 range and breathy or creaky voice were effective. Sadness was “revealed by a narrow range and a slow speech rate as well as a breathy articulation.” It was observed that high pitch values caused the utterances to be classified as sad as well. The sadness in these high pitch situation was indicated to be “crying despair”. Angry emotion was mostly characterized by timing and voice quality modifications, as stated in the paper, “a faster speech rate and tense phonation is judged by the majority as angry”. In summary the conclusion was that neutral sentence can be converted to emotional utterances that can be recognized as well as emotional speech spoken by actors. The generated output files and many other examples are available online at “http://emosamples.syntheticspeech.de/” [23]. Another copy-synthesis experiment on German speech [118] investigates whether emotions can be synthesized without voice quality modifications. The conclusion reached was that they can be synthesized with “relatively convincing” rates and that more careful interpretation of the results is required because the contribution of duration and intonation was varying for different speakers. Inapaper[45]wherelimitationsofconcatenativespeechsynthesizersareunder- lined a copy synthesis experiment for anger, happiness, fear, sadness and boredom 56 emotions(expressedinEnglish)was conducted. Theresultsshowedthatresynthe- sized emotions can be recognized with 40-60% accuracy. Comparing these rates with the 80-100% recognition rates achieved for natural speech it was hypoth- esized that possible causes for the reduction in the rates may be that (1) “the chosen prosodic parameters do not carry sufficient information to clearly identify the emotion”, (2) “lack of control of voice source characteristics is confounding perception of emotion, or (3) the speech modification process is introducing exces- sive distortion for the prosodic parameter range required”. Based on these, it was statedthatsynthesizinggoodqualityemotionalspeechbydoingtimedomainmod- ificationmaybelimitedintermsoftheachievablesuccess. Itisinterestinghowthe results are interpreted, because by doing resynthesis, 68%, 38%, 56%, 58% recog- nition rates were achieved for neutral, anger, happiness and boredom, respectively. Note that these results were higher than chance level which was 16.6%, and such results are generally interpreted positively in the emotional speech synthesis liter- ature. For instance in [86] the result in range of 2-3 times higher than chance level wereconsideredasindicatorsthat“vocalemotioncanbeimplementedsuccessfully in concatenated speech, and possibly to a more realistic degree than in formant- based speech”. Note however that these results were based on a small test set and becauseofthatneedsfortheirverificationthroughmoredetailedexperimentswere suggested in the paper. A implementation of a unit selection approach for F0 modeling in Festival speech synthesis system and its application to emphasis generation was presented in [104]. Selecting the F0 contours from a contour database was shown to produce better results than using the hand written rule-specified F0 contours. For labeling of different F0 contours, word emphasis, accent, stress, syllable position, nature of the following syllable break, syllable structure and position in syllable were taken 57 into account. Considering the fact that there are variations in the F0 contours of emotions [69] this kind of F0 contour selection approach can be also applied to emotional speech synthesis. As it can be observed from the presented references the results are varying. They vary due to several reasons which are (1) speaker dependency, (2) emotion dependency, (3) language dependency, (4) interpretation dependency. The inter- pretation differences (of the same results) are due to the intended applications. For instance, while for analysis purposes recognition rates higher than chance level maybesufficient, theresultsstillmaynotbesuitabletouseinthereallifehuman- computerconversationsystems. Howeverdespitethedifferenceswecansummarize the results as follows: (1) In general, pitch and duration modifications only were not completely sufficient to synthesize all types of emotions, such as variations of anger and happiness. Sadness and fear and their variations were well synthe- sizeable by pitch and duration modifications. (2) The results varied from speaker to speaker. That is, conversion of emotions of some speakers gave better results than the others. (3) Energy modifications were mostly performed by modifying the mean energy in the sentence level. The intensity contour was not payed any specific attention. We believe that modifying the intensity contour is necessary and that it will improve the results for emotions such as happiness and anger. In general, it was observed that prosody modifications changed the emotional con- tent of the original utterances, i.e., they introduced new emotional nuances. In some instances, these modifications made the output sound very different from the input while in others the output was resembling to the input. This is an impor- tant result, because it is a clear indication that modifying prosody can be used to modify emotional content of speech. Reasonably, we believe that when combined 58 together with the additional parameters such as intensity contour, spectral mod- ifications and sentence syntax and grammar considerations they can be used to successfully synthesize a specific emotion. 59 Chapter 4 Expressive speech synthesis using a concatenative synthesizer The content of this chapter was published in ICSLP 2002 [22]. 4.1 Introduction It is usually rare to mistake synthetic speech for human speech. The complex nature of human speech, which comes from the fact that it varies depending on the speaking style and emotion of the speaker, makes it difficult to be imitated by synthetic speech. Intelligibility, naturalness and variability are three features used to compare synthetic speech with human speech [89]. In terms of intelligibility, which is a measure of how understandable speech is to humans, recent research has shown thatsyntheticspeechcanreachintelligibilitylevelsofhumanspeech. However,lack of variability, representing the changes in speech rate, voice quality, and natural- ness, i.e. how ”human” a synthesizer sounds, has impeded the general acceptance ofsyntheticspeech,especiallyforextendedlistening. Incompleteknowledgeofhow factors such as the type of the material read, behavior of the audience, speaker’s social standing, attitude and emotions, affect the speech signal has been a major problem in building more ”human” sounding systems. 60 The need for human sounding text-to-speech synthesis (TTS) comes from the fact that it can greatly enhance applications based on human-machine interaction and make them simpler and more compelling. Imagine that a pleasant voice is reading your e-mails, web sites and books for you. When you have a question, you can ask a ”virtual teacher” that adapts his voice depending on the topic and the natureofyourresponsesandquestions. Then, youcanplaygamesandwatchfilms without realizing that you are hearing a synthetic voice. Early attempts at imparting emotional quality to synthetic speech were based on rule-based TTS, including the pioneering efforts of Cahn [27]. The lack of naturalness in the speech synthesized using such schemes however poses a serious drawback. Inthispaperwedescribetheproductionofsyntheticspeechbyconcate- nation of ”emotional diphones” using Time-Domain Pitch Synchronous Overlap Addition (TD-PSOLA) [82] as the concatenation method. A similar approach has been applied to synthesize emotional speech in Spanish [80] [8]. Listeners’ recog- nition of emotion for Spanish showed that prosodic (supra-segmental) information alone was not enough to portray emotions. It was found that supra-segmental information characterized sadness and surprise while segmental components were dominant for cold anger and happiness. Studies on German emotional speech [57] [6] showed that prosodic parameters, fundamental frequency and duration, were not enough to synthesize emotions. Increasing the parameter space by including voice quality parameters, spectral energy distribution, harmonics-to-noise ratio and articulatory precision has been shown to improve the recognition results for emotional Austrian German speech [103] [13]. Experiments on synthesizing emo- tional speech using Japanese emotional corpora with CHATR [60] [7] also support using an emotional inventory to synthesize emotional speech. 61 4.2 Database collection For the purpose of emotional speech synthesis reported in this paper, we chose to work with four target emotional states: anger, happiness, sadness, and neutral. First, we constructed five emotionally unbiased target sentences, i.e., sentences suitable to be uttered with any of the four emotions. The sentences we prepared were ”I don’t want to play anymore”, ”She said the story was a lie”, ”It was the chance of a lifetime”, ”They are talking about rain this weekend” and ”OK, I’m coming with you”. Target diphones necessary to synthesize the five target sentences were deter- mined. Next, four different source text scripts (one for each emotional state) were preparedforrecording;eachsourcescriptincludedallthetargetdiphones. Ouraim in preparing the source text was to build emotionally biased sentences that could easily be uttered with the required emotion. The source sentences were declara- tive and on average 7 words long. The five target utterances were also included in each of the four inventories. In order to motivate and focus the speaker, each of the source sentences was accompanied by a one or two sentence scenario. These brief scenarios were prepared for eliciting happy, sad and angry inventories, and were not used for the neutral sentences. For example, ”This is a wonderful life” and ”I earned fifty million bucks” were scenario and source sentences, respectively, belonging to the happy inventory. Having such elicitation scenarios also helps to minimizetheinterpretationvariations[95]thatmayresultfromspeakertospeaker. It also increases the probability of getting the same effect from different speakers or from the same speaker at different times. This assumption is closely related to the cognitive emotion definition perspective that every emotion is associated with a particular appraisal [33]. 62 In this paper we give results based on recordings obtained from a semi- professional female actress. A total of 357 source utterances (97 angry, 107 sad, 97 happy, 56 neutral) were recorded. The recordings were made in a sound-proof room at 48kHz sampling rate using a unidirectional, condensor, head-worn B&K capsulemicrophone. Forsynthesis, allfileswerelaterdownsampledto16kHz. The phonetic segmentation and alignment was first performed automatically with the Entropics’ Aligner software that used a phonetic transcription dictionary prepared at AT&T Labs-Research. Labeling for all sentences was manually checked and corrected when necessary. 4.3 Synthesis of emotional sentences In most studies, human emotions are categorized into basic distinct categories such as anger, happiness, sadness, fear, disgust, surprise and neutral. Although this approach is correct, especially from a pragmatic sense, it can be regarded as a somewhat gross representation because it ignores the variations that exist within each emotion. For example, both hot-anger (rage) and cold-anger (hostility) are treated under the same category, although they show different acoustical and psy- chological characteristics; similar examples can be provided for other emotions. The lack of a complete formal definition for each emotion and variations result- ing from gender, personality and cultural differences (see Social Constructivist perspective [33,115]) makes it impossible to account for every small variation. Despite some disadvantages, interpreting a perceived emotion as one of the seven basic emotions has the major advantage of the Darwinian perspective [47], which holds that there are certain universal basic emotions, and all other emotions can be derived from them. 63 For this experiment we decided to test the possibility of producing basic emo- tionsbymixingprosodicinformationanddiphonesbelongingtodistinctemotional states. Interpreting ”set 1” as comprising prosodic information corresponding to angry,happy,sadandneutraltargetsentencesand”set2”asthediphoneinventory for angry, happy, sad and neutral sentences, we produced 80 synthetic sentences by combining ”set 1” and ”set 2” properties for the five target sentences: set 1 =Prosody of (angry, happy, sad, neutral) target sentences set 2 =Inven- tory of (angry, happy, sad, neutral) sentences set 1 x set 2 x (5 target sentences) = 80 synthetic sentences Forthesynthesisofthese80sentences,theFestivalSpeechSynthesisSystem[9] provided a simple method of diphone concatenation using an implementation of TD-PSOLA [82] that produces good quality synthetic speech with easy modifica- tion of pitch and duration and a low computational load. In the generation of the 80 synthetic test utterances, there were three basic steps: analysis, modification andconcatenation. Intheanalysisstep,therequiredprosodic(i.e. pitchanddura- tion) information was calculated from the target sentences. The speech segments extracted from the source (inventory) sentences were then modified to match the prosodic target data and, finally, were concatenated. Diphones, selected manually, were the basic concatenation units used in this experiment. 4.4 Results Web-based listening tests were conducted with 33 adults who were unaware of the identityoftheteststimuli. Fourteenparticipants(5of8femalesand9of25males) were native speakers of English and 19 were nonnative. The listeners were allowed to play the test files as many times as they wished, and were asked to choose 64 for each the most suitable emotion among angry, happy, sad and neutral options in a ”forced choice” task. They also rated the success of expressing the emotion they had selected along a 5-point scale: excellent (5), good (4), fair (3), poor (2), bad (1). A total of 100 files, consisting of 80 synthetic sentences and 20 original (recorded) target sentences were presented in a random order that was different for each listener. Figures 4.1 and 4.2 show, respectively, listeners’ emotion recognition rates for the20originalsentencesandfor20”matched”synthesizedsentencesinwhichboth prosody and inventory were extracted from the same emotion. In the figures, the suffix ”-L” marks listeners’ choices; for example “Angry” indicates the intended emotion, while “Angry-L” represents what emotion listeners heard. Emotions are represented by their first initials, ”p” indicates ”prosody” and ”i” indicates “inventory”; for instance ApAi represents the matched synthetic sentences that used Angry prosody information and the Angry inventory. 0 20 40 60 80 100 Angry Sad Happy Neutral Recognition Rates for Original Files Angry-L Sad-L Happy-L Neutral-L Figure4.1: Recognitionaccuracyfornaturaltargetfiles: 89.1%Angry, 89.7%Sad, 67.3% Happy, 92.1% Neutral 65 Recognition rates for all original sentences (Figure 4.1) and for all matched synthetic sentences of each emotion (Figure 4.2) are significantly above the 25% chance level (as tested by one-sample t-Tests). Recognition accuracy for original andmatchedsyntheticsentenceswasanalyzedusingarepeatedmeasuresANOVA. Thereweresignificantdifferencesinrecognitionaccuracyamongthefouremotions: recognition rates observed for the Happy set were significantly lower than rates for theotherthreeemotionsinboththeoriginalandmatchedsyntheticsets;eitherthe speaker was relatively less successful in expressing happiness, or happiness is more difficult to recognize in isolated utterances. For the matched synthetic sentences, recognition accuracy for the Sad set was significantly higher than for the Neutral set. Happy and Neutral emotions were recognized more accurately in original sentences than in matched synthetic sentences, but Angry and Sad recognition rates were equivalent between original and matched synthetic versions. 0 10 20 30 40 50 60 70 80 90 ApAi SpSi HpHi NpNi Recognition of Synthesized Emotions Angry-L Sad-L Happy-L Neutral-L Figure 4.2: Recognition accuracy for synthesized emotions: 86.1% Angry, 89.1% Sad, 44.2% Happy, 81.8% Neutral. ”A”, ”S”, ”H”, ”N” denote Anger, Sadness, HappinessandNeutral,respectively;”p”indicatesprosodyand”i”indicatesinven- tory. 66 It is difficult to attribute the lower recognition of Happy and Neutral emotions in matched synthetic sentences than in the original sentences to the same cause. We can deduce that the lower recognition rate for synthetic happiness is in part due to the less successfully conveyed intended emotion of happiness in the Happy original target sentences and inventory. However, the same explanation does not hold for the Neutral emotion, the recognition accuracy for which in the original sentences was the highest of the four emotions. Recognition rates and Average Success for the 16 different contour and inven- tory combinations are presented in Table 4.1. The two measures were significantly correlated (r = 0.60). Average Success was calculated by weighing excellent, good, fair, poor and bad responses by 5, 4, 3, 2 and 1, respectively. According to Table 4.1, combinations of Angry inventory (Ai) with Angry prosody (Ap), Neutral prosody (Np) and Happy prosody (Hp) were classified, in most cases, as ”angry”, with the highest rate for the matched combination ApAi. Combinations of Sad prosody (Sp) with all other inventories were recog- nized as ”sad” with an average of 80.6% accuracy. Synthetic sentences produced by employing Neutral prosody (Np) and Neutral, Happy and Sad inventories (Ni, Hi, Si) were recognized as ”neutral”, while NpAi was recognized as ”angry” the majority of the time. Results for ”happiness” show that it was mostly mistaken with ”sadness” or a ”neutral” emotional state. We observe that most successful recognition results were achieved for matched synthetic sets, i.e. when inventory and contour belonging to the same emotion were used together. It is also interest- ing to note that the combination of Neutral prosody and Angry inventory (NpAi) was recognized mostly as ”anger”, the combination of Sad prosody and Neutral inventory (SpNi); and Happy prosody and Neutral inventory (HpNi) as ”sadness”. 67 Angry-L Sad-L Happy-L Neutral-L ApAi 86.1-4.1 1.2-3.0 6.1-3.1 6.7-2.7 NpAi 63.0-3.7 3.6-3.2 1.2-3.0 32.1-3.2 HpAi 59.4-3.4 15.8-2.7 11.5-2.7 13.3-2.7 SpSi 2.4-3.3 89.1-3.7 4.8-2.6 3.6-2.8 SpNi 0.0-0.0 89.1-3.6 6.7-2.7 4.2-2.3 SpHi 1.8-3.0 82.4-3.2 11.5-3.3 4.2-2.1 SpAi 28.5-3.3 61.8-3.2 3.0-2.4 6.7-2.3 HpSi 15.2-3.3 46.7-3.3 23.6-3.2 14.5-3.1 ApSi 32.1-3.2 37.6-3.0 7.9-2.8 22.4-2.8 ApNi 15.2-2.9 35.8-2.7 17.0-2.8 32.1-3.0 HpNi 7.3-3.5 35.2-3.2 34.6-3.2 24.2-3.2 HpHi 10.3-2.9 27.3-3.0 44.2-3.0 18.2-3.1 ApHi 20.6-3.0 25.5-3.1 29.7-3.2 24.2-3.0 NpNi 3.0-3.2 10.9-3.3 4.2-3.0 81.8-3.5 NpHi 10.3-2.8 9.7-2.7 8.5-2.9 71.5-3.3 NpSi 13.3-3.5 17.6-3.1 2.4-3.7 63.7-3.2 Table 4.1: Recognition rates in percent and average success ratings (5=excellent and 1=bad) for the 16 possible prosody and inventory combinations. (Note: ApHi rate is not above chance). Resultsbasedonthegenderandlanguagebackgroundoflistenersareillustrated in Figure 4.3. Although the experiment was not designed to study systematically possiblegroupdifferences,ANOVAresultsyieldednosignificantgenderorlanguage effects on recognition accuracy of either the original or the matched synthetic sentences. 4.5 Discussion TherecognitionresultspresentedinFigure4.2andTable4.1showthatanger,sad- nessandhappinesscanbesynthesizedfairlysuccessfullybyapplyingconcatenative synthesistechniques. Recognitionratesachievedforangerandsadnesssuggestthat 68 0 10 20 30 40 50 60 70 80 90 100 Angry-L Sad-L Happy-L Neutral-L Recognition Rates for Listener Groups Female Male Native Non-native Figure 4.3: Recognition rates observed for matched synthetic sentences of each emotionforfemale,male,nativeandnon-nativelisteners. Therewerenosignificant group differences. theywereeasiertosynthesizewhencomparedtohappiness. Although,notdirectly comparable because of differences in the set of emotions and languages used, these results agree with those presented by other researchers [1,57,60,80]. As seen in Table 4.1, taking the inventory of one emotion and mixing it with prosodic information of a different emotion gave lower recognition results than whentheinventoryandcontourcombinationbelongedtothesameemotion. Based on these results, the ideal way to portray a particular basic emotion appears to be to use a separate database and separate prosodic models for each emotion. We also observe that ApAi, NpAi and HpAi combinations were recognized as ”anger”. This suggests that segmental components (which include vocal quality andphoneticcharacteristics)weredominantinsynthesizinganger. Fromtherecog- nition results for SpNi, SpSi, SpHi and SpAi, we conclude that supra-segmental 69 information determined ”sadness”. Listener ambiguity in the recognition of ”hap- piness” prevented us from drawing similar conclusions. These results generally agree with experiments on synthesizing emotions by mixing diphones and prosody for the Spanish language [80] [8] where anger and happiness have been classified as segmental emotions, and sadness as a prosodic emotion. It is also seen (Table 4.1) that for most mismatched prosody and inventory combinations,thetwoemotionsrecognizedwithhighestaccuracywerethetwoused inthecombinationinquestion. Forexample,forNpAi,”anger”and”neutral”,and forSpHi,”sadness”and”happiness”werethemostfrequentlyrecognizedemotions. This shows that both prosody and inventory are important in conveying emotions. The most common feedback given by our listening test subjects was that some sentences conveyed different flavors of emotions than the ones listed as choices. This kind of listener feedback is promising and exciting because it is consistent with the Darwinian approach [33,47] that all emotions can be derived from basic emotions. Future listening tests where listeners will be given an opportunity to choose among a larger set of emotions will be helpful in validating this hypothesis. A better understanding of this issue may reduce the need to record a separate inventory for each derived emotion. The difficulty in expressing happiness for both original and synthesized sen- tences indicates the need for new experimental approaches. In addition to the text scenarios, use of visual aids such as pictures, videos, sounds, may help the actor/actress to express the required emotion more successfully. Since synthetic utterances depend on inventory, it is hoped such techniques will improve the arti- ficial expression of happiness. 70 4.6 Summary and conclusions Demand for more ”human-sounding” speech synthesis has created the need to synthesize emotions. In this paper we show that by using separately recorded inventories for anger, happiness, sadness and neutral emotions, and basic diphone concatenationsynthesiswithTD-PSOLAwithintheFestivalSystem,someofthese synthesized emotions can be reliably recognized by listeners. The recognition rate forangerwas86.1%with4.1AverageSuccessrating(max=5), forsadness, 89.1% with 3.7, for neutral emotion, 81.8% with 3.5, and for happiness, 44.2% with 3.0. Happinesswasthemostdifficultemotiontoconveywitheithernaturalorsynthetic speech. Segmental information was dominant in conveying anger, while prosody best characterized sadness and neutral emotion. Different combinations of inven- tory and prosody of basic emotions may provide synthesis of various intermediate emotional nuances. This is a topic of future research. 71 Chapter 5 Investigating the role of phoneme-level modifications in emotional speech resynthesis The content of this chapter was published in Eurospeech 2005 [17]. 5.1 Introduction Emotion resynthesis (or conversion) is an adaptation technique where the input emotional speech is modified so that the output speech is perceived as conveying a new emotion. The parameters of the input speech emotion are adapted to the targetemotionandthenthefinaloutputisresynthesizedusingthenewparameters. Emotion conversion is a novel research area which resembles voice conversion (VC) in terms of the underlying techniques, the most important distinction being that the effect of the prosody can not be ignored or understudied (because the speech emotion and prosody are strongly tied to each other [34,116]), as it is usually done in the conventional VC algorithms where pitch modifications are generally utilized only by matching the converted average pitch to the average target pitch. Applying segmental pitch modifications [137], copying the target 72 pitch contours [139], or using heterogeneous training vectors including both spec- tral coefficients and normalized pitch [90] has shown that incorporation of prosody modifications improves the VC results. Motivated by the fact that distinct emotional coloring is also present at the phoneme level (especially for back and low vowels), as shown in our own recent emotionalspeechanalysisstudies [73,147],inthischapterweinvestigatetheappli- cability of segmental modification of duration, pitch, energy and spectral envelope parameters for the resynthesis of four emotions, happy, angry, sad and neutral. We study the effect of each parameter on emotion perception, when applied on various source-target emotional pairs using TD-PSOLA [82] and LPC synthesis methods, to show that by modifying phonemes’ acoustic features we can change the emotional content of the whole sentence. Although here we tested only at phoneme-level, such segmental modifications can be extended to diphones, sylla- bles and words, and they can be integrated in concatenative speech synthesizers as pre-processingofconcatenationunitsbeforesynthesis,withthepurposeofimprov- ing the synthesized emotion’s quality for the whole sentence. For data collection we recorded sentences uttered by a professional actress who was instructed to produce four full-blown emotions, anger, happiness sadness and neutral,forthesentencesofidenticalcontent. Thevalidityofwhethertherecorded data can be correctly identified in terms of its emotional content was validated by conducting listening tests with 10 naive raters. The same approach was followed for the evaluation of the final resynthesized output utterances as well. Considering the fact that the interpretation of emotions can sometimes differ due to personal, cultural and many other experiences, it is not unusual that the raters will disagree ontheemotionalcontentofapresentedutterance. Aspointedoutinmanystudies, the boundaries between emotions are fuzzy and during the evaluation this fact 73 should be taken into account [28,126]. A good review of these emotion research issues can be found in [116]. In the rest of the paper, we first describe the used dataset and then introduce the proposed conversion system in section 5.3. Evaluation results are presented in section 5.4 and discussed in section 5.5. 5.2 Dataset description The sentences that we used for this study are (1) ”This is some union we’ve got” and (2) ”The store closes at twelve”. Formant, pitch and energy plots for these two sentences are presented in Figure 5.1 and 5.2. In terms of the pitch contour and energy plots (Figure 5.2) we see that anger is somewhat similar to happiness and sadness is similar to neutral. The plots of the first two formants (Figure 5.1) for the vowels show that the vowel formants vary based on sentence emotions. 5.3 System description A schematic diagram of our emotion conversion system is shown in Figure 5.3. All of the modifications are performed on the source signal (s[n]), based on the features extracted from the target signal (t[n]). For pre-processing, the source and target speech are normalized to the same intensity level (i.e., 70db) and the label boundaries for each utterance are manually extracted. Next, we cal- culate the pitchmarks of both signals and use the target pitchmarks to modify the pitch contour and duration of the source by applying the TD-PSOLA [82] algorithm. The energy modification is performed next, by normalizing the power of the TD-PSOLA output, so that it will match the target signal’s power (i.e., Output = Output∗(targetpower/outputpower) 0.5 ). The new output, s 1 [n], is the 74 350 400 450 500 1700 1800 1900 2000 380 400 420 440 460 1600 1800 2000 2200 500 600 700 800 1300 1400 1500 1600 1700 370 380 390 400 410 1800 2000 2200 2400 2600 300 350 400 2200 2300 2400 2500 700 800 900 1000 1500 1600 1700 1800 1900 400 450 500 550 1700 1800 1900 2000 2100 400 500 600 700 800 1200 1400 1600 1800 400 500 600 700 1300 1400 1500 1600 1700 350 400 450 500 550 1700 1800 1900 2000 400 500 600 700 1800 1900 2000 2100 400 600 800 1000 1200 1000 1500 2000 /IH/ 2 /IH/ 4 /AH/ 7 /UW/ 10 /IY/ 16 /AA/ 19 /AX/ 2 /AO/ 5 /OW/ 9 /IH/ 11 /AE/ 13 /EH/ 17 Figure 5.1: F1-F2 plots for sentence vowels. Top 6 plots are for sentence 1 vowels. Numbers indicate the position of the vowel in the sentence. Happy is represented by ?, angry is o, sad is, neutral is∇. prosody-modifiedversionofthesource. Forspectralenvelopemodification, wecal- culate the linear prediction coefficients (LPC) using a prediction error filter (A(z)) of order 16 and Hanning-windowed 20ms long frames interlaced with 50% overlap. The pre-emphasis coefficient was set to 0.9. To change the spectral characteristics of the source, we filter the source residual (e s [n]) with the inverse of the target error filter to get s 2 [n], which is then futher prosody modified, as described above, to produce the final result, s 12 [n]. Alignment of source and target signals is per- formed automatically, by adding or deleting frames from the mid-regions (where thespectrumisrelativelystable)oftheprocessedsegments. Whileallmodification 75 are performed for voiced phonemes, only duration and energy are changed for the rest. 5.3.1 Test stimuli The proposed conversion was applied on the two target test sentences. For each of the sentences there were 12 source-target pairs. They were: (1) happy-angry, (2) happy-sad, (3) happy-neutral, (4) angry-happy, (5) angry-sad, (6) angry-neutral, (7) sad-happy, (8) sad-angry, (9) sad-neutral, (10) neutral-happy, (11) neutral- angry, (12) neutral-sad. Inordertoobservetheeffectoftheindividualparameterchanges, eachofthese pairs was modified in a controlled manner by changing one parameter at a time. The list of the feature modifications we investigated include: (1) only pitch, (2) pitch and energy, (3) only duration, (4) duration and energy, (5) duration and pitch, (6) duration, pitch and energy, (7) only spectrum, (8) pitch and spectrum, (9)pitch, energyandspectrum, (10)durationandspectrum, (11)duration, energy and spectrum, (12) duration, pitch and spectrum (13) duration, pitch, energy and spectrum. In total we synthesized 156 (12x13) sentences for each test sentence. Together with 312 (2x156) synthetic signals and with the inclusion of the original sentences (2 happy, 2 angry, 2 sad and 2 neutral), our test set consisted of 320 stimuli. 5.4 Evaluation In this section we describe the listening test set up and present the evaluation results. 76 5.4.1 Listening experiment Assessment of the output emotion categories is achieved by conducting subjective listening tests with naive listeners. Ten listeners participated in the experiment and each of the listeners was presented with 320 sentences, which consisted of the results of all modifications as well as the original utterances. The test stimuli were presented in random order in order to eliminate any correlative effects in decision making. Headphoneswereusedandtheusersweregiventhefreedomtoadjustthe volume and to listen to the current sentence as many times they wanted, however once done they were not given the opportunity to return back. The test was organized as a forced-choice experiment, where the raters were required to decide on one of the following five choices: (1) happy, (2) angry, (3) sad, (4) neutral and (5) other. The inclusion of other option was to provide a way for including intermediate (i.e, fuzzy) emotional categories that can happen as a result of the modifications. The average test duration was 25 minutes per listener. 5.4.2 Listening test results Results for the original (unmodified) and synthesized utterances are presented in Table 5.1 and Table 5.2, respectively. Original sentences Listening test results for the original sentences, displayed in table 5.1, show that our speaker successfully elicits all of the emotions. The kappa statistics, κ = 0.70, α< 0.01, show strong rater agreement. These results serve as an upper bound. 77 H1 A1 S1 N1 H2 A2 S2 N2 Happy 70 20 0 10 100 0 0 0 Angry 10 80 0 10 0 100 0 0 Sad 0 10 70 20 0 0 80 20 Neutral 0 0 0 100 0 0 0 100 Table5.1: Listeningtestresultsforthetwooriginalsentences(κ = 0.70,α < 0.01). Emotion categories are listed in the first column and the results are presented in percentages. Resynthesized sentences The two best results for each possible pair are shown in Table 5.2. For full table of results please check the website 1 . Kappa statistics calculated for all 312 rater responses are as follows: Sentence 1:κ = 0.25,α < 0.01, Sentence 2:κ = 0.36,α < 0.01. These values are much lower compared to the original sentences, and they indicatetheinherentdifficultiesinevaluatingemotional[126]andsyntheticspeech. InTable5.2, theresultsforsentence1and2arepresentedseparately. Thenumber shown in the parenthesis indicates the performed modification, as described in section 5.3.1. The results are in 10% multiples, for instance, for happy-to-sad conversion (h2s, i.e. source is happy, target is sad emotion) of sentence 1, when method (9) was applied, the final resynthesized output was recognized as 20% happy, 60% sad, 10% neutral and 10% as other (not shown in the table). The results indicate successful conversion, except for the following pairs: Sent.2-a2h, s2h, n2h. Apparently, synthesis of happiness is not achieved within an accepteble level, emphasizing the fact, as shown in many other papers, that special attention should be paid to it, because the signal distortions arising due to the modifications influence listener’s judgments especially of happy emotion. 1 http://sail.usc.edu/∼mbulut/euro05.html 78 In addition, we note that happy emotion was usually confused with angry emotion. This confusion can be related to the similarity of the acoustic features of happy and angry sentences as shown in Figure 5.2. Kendall’s tau-b correlation calculation (for all 312 sentences) show positive correlation between numbers of happy andangry responses(τ b = 0.074,α < 0.108). Although, weseethatsadand neutral sentences also have resembling acoustic features, the correlation between neutral-sad responses was insignificant (τ b = 0.018, α < 0.683). For all other pairs significant negative correlation was observed (angry-neutral: τ b = −0.433, α < 0.01; angry-sad: τ b =−0.424, α < 0.01; angry-other: τ b =−0.213, α < 0.01; happy-neutral: τ b =−0.290, α < 0.01; happy-sad: τ b =−0.308, α < 0.01; happy- other: τ b =−0.107, α< 0.025). It is also interesting to note the positive correlation between neutral-other responses (τ b = 0.276,α < 0.01), indicating that raters tend to choose neutral whenever the appropriate emotion choice is not listed in the evaluation test. 79 Figure5.2: Pitchcontour(left)andenergycontourofthesentences. Toptwoplots areforsentence1andfirstofthesetopplotsistheplotofhappyandangry(dashed) and the second plot is of sad and neutral(dashed) emotions. 80 Figure 5.3: LPC and TD-PSOLA based emotion conversion system. s[n] and t[n] arethesourceandtarget,respectively. A(z)indicatestheinversefilterusedforthe calculation of the residual e[n]. s 1 [n], s 2 [n], and s 21 [n] are the outputs obtained by modifying the input signal using only TD-PSOLA, only LPC synthesis, and both LPC and TD-PSOLA, respectively. 81 h2a h2s h2n a2h a2s a2n Sent.1 (11)h:4 a:4 s:0 n:1 (9)h:2 a:0 s:6 n:1 (11)h:0 a:2 s:1 n:6 (7)h:3 a:4 s:1 n:1 (8)h:0 a:3 s:6 n:0 (12)h:0 a:2 s:2 n:5 (13)h:6 a:4 s:0 n:0 (12)h:1 a:0 s:5 n:1 (13)h:0 a:1 s:1 n:6 (11)h:4 a:2 s:0 n:3 (12)h:0 a:2 s:4 n:2 (13)h:0 a:2 s:1 n:6 Sent.2 (11)h:3 a:6 s:0 n:1 (12)h:1 a:0 s:3 n:4 (10)h:5 a:0 s:0 n:3 (12)h:1 a:8 s:0 n:1 (7)h:0 a:1 s:5 n:1 (10)h:0 a:4 s:2 n:3 (13)h:4 a:5 s:1 n:0 (13)h:3 a:0 s:3 n:2 (11)h:4 a:0 s:2 n:3 (13)h:1 a:9 s:0 n:0 (9)h:0 a:1 s:5 n:0 (13)h:0 a:3 s:1 n:5 s2h s2a s2n n2h n2a n2s Sent.1 (8)h:3 a:2 s:0 n:3 (9)h:0 a:4 s:2 n:2 (12)h:0 a:0 s:2 n:7 (11)h:1 a:3 s:1 n:4 (12)h:1 a:4 s:0 n:3 (10)h:0 a:0 s:5 n:3 (13)h:2 a:1 s:3 n:3 (13)h:1 a:5 s:0 n:3 (13)h:0 a:0 s:1 n:8 (12)h:1 a:2 s:3 n:2 (13)h:1 a:5 s:0 n:2 (12)h:1 a:0 s:5 n:3 Sent.2 (12)h:0 a:4 s:3 n:1 (9)h:0 a:5 s:1 n:0 (11)h:0 a:0 s:3 n:6 (12)h:0 a:4 s:3 n:1 (11)h:0 a:5 s:1 n:2 (7)h:0 a:0 s:8 n:2 (13)h:3 a:1 s:4 n:1 (13)h:1 a:6 s:0 n:0 (12)h:0 a:0 s:3 n:6 (13)h:3 a:3 s:2 n:0 (13)h:2 a:6 s:0 n:1 (8)h:0 a:0 s:4 n:5 Table 5.2: Listening test results (in 10% multiples) for some selected modifications. h,a,s,n indicate happy, angry, sad and neutral, respectively. The numbers in paranthesis refer to the modification type as explained in section 5.3.1. h2a means that source is happy and target is angry. 82 5.5 Discussion Inthisstudywecopyalloftheinformationfromthetarget, andthusourapproach should not be directly compared to the unsupervised and automated processes. Here, we aim to provide insights into the usefulness and feasibility of phoneme- levelmodificationsforemotioncontrol,withagoaltowardcreatingfullyautomated conversion rules in the future. This section details the effects of individual and combined application of prosody and spectral modifications on the emotions. 5.5.1 Prosody modifications Prosody parameters -duration, pitch and energy- at the supra-segmental level are considered the defining factor for emotions [22,34,116]. However, to make con- catenative synthesis systems more useful and flexible we need to be able to modify segmental-level properties as well. Our analysis results have reported distinct emotion effects at the segmental level speech features [147], which suggest that by performing segmental modifications we can alter the emotion of the whole sen- tence. The results (available on the website 1 ) show that when applied without any spectral envelope modifications, prosody parameter modifications done locally at the phoneme-level are not effective in changing the emotion perception. This is because, it is not possible to exactly match and reproduce the target sentence prosody only by doing local prosody modifications. Local modifications are not sufficient due to two main reasons: (1) speech prosody features, such as inter-word silences, fluent pauses, hesitation pauses, stress pattern and shimmer [26], that we did not modify here, have significant effect on the sentence prosody; (2) TD-PSOLA modifications at phoneme level, especially when the ratio between source and target phone parameters is larger 83 (smaller) than 2 (0.5), introduce perceptible and visible artifacts, which must be smoothed by further sentence level processing. Thus, the evaluation results sup- port our hypothesis that for the local prosody modifications to be effective and meaningful they should always be followed by additional supra-segmental level modifications. 5.5.2 Spectrum modifications As outlined in the introduction, voice personality can be successfully changed by transforming frequency parameters [90,137,139]. Tenseness, creakiness, laxness, breathiness which are all attributes of the voice quality can be also changed by spectral modifications. These attributes are closely related to the emotional con- tent of speech [53], so one can expect that changing them will be useful for emotion re-synthesis. We test this by directly using the target LPCs to modify source speech. The complete table of results (available on the website) show that spectrum modifications increase the ambiguity between the emotion classes (i.e., emotions are more confused with one another due to the new emotional coloring) but not to any significant level (except for Sent 2: a2s, n2s) to cause change in the emotion category. We also observe that for phonemes, spectral modifications are more effective than local prosody modifications in shifting the source emotion (in the emotional space) closer to the target emotion. This justifies our hypothesis that transforming the phonemes’ spectral properties can be effectively used in speech synthesizers to add new emotional content to the synthesized speech. 84 5.5.3 Prosody and spectrum modification combination Concurrently applied prosodic and spectral modifications give better results than their individual applications. As expected, when all 4 variables (spectrum, dura- tion, pitch and energy) are modified the results improve. From Table 5.2 we see that the best results are achieved for following pairs, Sent 1: h2s, h2n, a2s, a2n, s2n; Sent 2: a2s, s2a, s2n, n2a, n2s. Our evaluation experiments do not show a particular trend in the results that can be associated with each parameter individually. For example, starting with only spectrally modified sentences, when we changed the pitch the recognition rates for some pairs (Sent1: h2a, h2s, a2s, a2n, s2h, n2s; Sent2: a2s, s2a, n2s) improved while for the others there was no improvement. The same observation is also valid when only duration modifications were applied to the spectrum-only modified sentences (pairs that improve are sent1: h2a, h2s, a2s, a2n, s2a, s2n, n2h, n2a, n2s; sent2: h2a, h2n, a2n, s2a, n2h, n2a). Inclusion of the modification of an additional prosodic feature (such as duration, pitch and energy), generally speaking, improved the results. In addition, duration changes proved to be more effective than the pitch changes, which in turn were more effective than the energy changes when applied on spectrally modified phonemes. The results indicate following trend, in terms of producing successful conver- sion, among proposed feature modification methods: (13) duration, energy, pitch, spectrum > (12) duration, pitch, spectrum > (11) duration, energy, spectrum > (9) pitch, energy, spectrum > (10) duration, spectrum > (8) pitch, spectrum > (7) spectrum. 85 5.6 Conclusion In this paper we studied the conversion of one emotion into another by modifying prosody and spectral envelope at phoneme-level. Our results show that individu- allymodifiedlocalprosodyandspectrumaddnewemotionalcoloringtothesource emotion. However, they are not sufficient by themselves to elicit the target emo- tion. Comparing the two, we see that at phoneme-level, spectral envelope modifi- cations are more effective than local prosodic modifications, and for local prosody, duration modifications are more effective than pitch modifications. When applied together, local prosody and spectrum modifications successfully transformed the emotion of the source speech to the target emotion. These results support our hypothesis that combining phoneme level and supra- segmental level modifications can be a useful framework to model the emotional content of synthesized speech. Furthermore, for an emotion synthesizer to be fully successful additional linguistic layers -from phonemic and lexical context to syntactic and discourse structures- should be carefully accounted for. Much of those details are still unknown and are a subject of ongoing work. 86 Chapter 6 Prosody of part of speech tags in emotional speech: Statistical approach for analysis and synthesis The main content of this chapter was published in ICASSP 2007 [19], and Inter- speech 2007 [18]. 6.1 Introduction The interplay between the linguistic role of words, and their expressive modula- tion in spoken language is complex, and not completely understood. Research shows that different words have different contribution to the emotional quality of speech [145]. In addition to the individual effects of the words, the interac- tion between words in a sentence also is effective in the expression and perception of emotions [98]. Therefore it is important to include all of these factors while studying emotional characteristics of speech. Especially for synthesis and under- standingapplicationsintegratingthemodelsoflinguisticandacousticcomponents is essential [4,93]. 87 In order to represent the linguistic level information part of speech (POS) tags canbeused[4,12,93,141]. POStagsarefew innumbers andtheycanbeidentified withhighaccuracy(morethan90%)withnomanualintervention[31],whichmakes them an attractive option for linguistic modeling. Inthischapteranalysesofemotionalspeechprosody(duration,F0,andenergy) characteristicsarepresented. Theseanalysesareperformedinpartofspeech(POS) tag level, with the purpose of investigating the interaction between emotions and POS tags, and the factors influencing this interaction. The interaction between emotions and POS tags is investigated in two per- spectives: Statistical analyses and analysis-by-synthesis. First, statistical methods (i.e., ANOVA) are used to quantitatively analyze the relation between POS tags and emotion categories. Second, using speech signal modification techniques (TD- PSOLA [82,139]), the effects of modifying prosody characteristics in POS tag level (based on the statistical analysis results) is investigated. The results provide help- ful insights about both quantitative and qualitative (i.e., perceptive) aspects of emotional speech representation and synthesis. Representation of emotional speech characteristics in a statistical framework can capture and model the complex nature (due to variability and uncertainty) of emotions in a simple and effective way. Similar to the language modeling in speech recognition, such methods will be useful to learn patterns from data, which will help to better understand (and implement) the complicated phenomenon of emotions. 6.1.1 Background Part of speech tags are predefined lexical labels that are used to group words into specific classes. In that sense, they provide an eloquent way to represent a large 88 number of words in terms of only few categories. The number of the tags used can vary significantly depending on the application. For example, there were 45 lexical tags defined in the Penn Treebank project [78], 87 tags in the Brown corpus [50], and 146 tags in [52]. In this work we use 13 tags (shown in Table 6.1), which are a subset of Penn Treebank tags. As explained in section 6.2.2 these tags were selected because of their linguistic significance. Partofspeechtagsaregeneratedandusedduringtexttospeech(TTS)synthe- sis for task such as phrase break assignment [12], prosody generation [141], homo- graphic disambiguation, and target cost calculations [44]. For example, in [12] it is shown that using Markov models trained based on POS tag information, phrase breaks (i.e., break and non-break) locations can be predicted with high success(∼ 79%). In this chapter, using the probabilities calculated for acoustic features of different POS tags, we first predict how POS tags should be ordered (in terms of acoustic feature values) and then use this information during the generation of new values for synthesis. UsageoflinguisticinformationforgenerationofF0contoursisdescribedin[51, 141]. From these, Fujisaki model [51] is an simple but effective method of F0 contour generation using phrase and accent marks. It has been shown to perform well for many languages. In [141] it is shown that using POS tags, (French) intonation contours can be reliably generated using POS tag information. Our focus, in this chapter, is on how POS tags can be used to generate emotional effect. We start from a neutral utterance and modify its prosodic features (F0, energy, duration) in POS tag level to generate emotional speech. The parameter values are estimated automatically using Gaussian distributions that model the difference between emotions [19]. 89 In emotional speech research, POS information has been used in emotional speechrecognition[6]andonlypartially(forcontent/functionworddiscrimination) in emotional speech synthesis [27]. Statistical parametric synthesizers are a rising new research area [8], pioneered by Tokuda et.al. [133]. They provide an effective way to learn patterns from data. The ease of voice quality modification and adaptation provided by these parametric methods makes them an attractive tool for emotional speech synthesis. As shown in [136] even with limited data successful results can be achieved. The main limitation of the HMM-based synthesis techniques, however, is the vocoder (buzzy) speech quality. Since it is not clear how speech quality influences the perceptionofemotions,inthischapterinsteadofsynthesizingthespeechfromtext, wefocusofhowneutralspeechcanbemodifiedtosoundemotional. Doingthishas the advantage that quality degradations are less, enabling us to more effectively measure the effects of the applied modification (on the emotion perception). Studies of emotional speech have shown that emotion change can be associ- ated with changes in prosodic and spectral characteristics of speech signals. The concentration has been mainly on prosody parameters, such as F0, duration, and energy [34,116]. Comparison of different statistics of these parameters has been used as a measures to describe and discriminate between emotions. However, looking from the synthesis perspective, it is not a simple procedure to implement these results to synthesize emotional speech. One of the main challenges, is due to the fact that these results are presented individually, without taking the effects of linguistic components into account. In this chapter, we represent the linguis- tic content using individual POS tags and analyze how using this information can help the synthesis of emotional speech. Of course, this representation is not necesary sufficient. In many cases, more complex concepts such as constituency, 90 grammatical relations, andsubcategorization and dependencies [66]mayneedtobe implemented. In this chapter, individual POS tags based modeling was selected mainly due to the limited amount of the training data. 6.1.2 Outline The chapter consists of four main parts: Database description (sec. 6.2) Analysis (sec. 6.3), synthesis (sec. 6.6), and listening test results (sec. 6.7). In the analysis partweinvestigatetheeffectofprosodyparameterdifferencesinPOSlevelinterms of emotion style, part of speech tag type, speaker, and position in a sentence. This is done using a 4-factor (4x13x2x3) mixed-design ANOVA (with repeated measuresonemotionstyle)analyzingduration, energymaximumandmedian, and F0 median and range dependent variables [18]. In addition, we also calculate the probabilitiesofobservingdifferencesinPOStagparametervaluesacrossemotions, and show that these differences can be parameterized by fitting into Gaussian distributions [19]. The discussion of the analysis results is given in section 6.5. In the synthesis part the details of the proposed probabilistic approach are presented. In summary, it is shown how an optimal tag order can be estimated for a given target emotion, and then how prosody parameter values can be generated based on the estimated order. The estimation of the values is performed using Gaussiandistributionsmodelingthedifferencesbetweeninputandtargetemotions POS tags. Next, the listening text results are presented (sec. 6.7). Discussionsofthelisteningtestresults, anddirectionsforfutureworkaregiven in section 6.8, and conclusion is presented in section 6.9. 91 6.2 Emotional dataset and feature extraction In this section we first describe the emotional dataset we use and then give details of how statistical tests were performed. 6.2.1 Emotional dataset The results in this study are based on 72 sentences uttered by 3 speakers (who werenativespeakersofEnglish)infouremotionstyles(happy,angry,sad,neutral). The sentences were designed to be semantically neutral, however the distribution of different POS tags was not specifically controlled. The same sentences were used for the all three speakers. Of these, speaker 1 was a male speaker in his late twenties with no professional acting experience. Speakers 2 and 3 were females in their late twenties with degrees from a theater school. The speakers were asked to utter each sentence in 4 emotion styles, which were happiness, anger, sadness and neutral (i.e., no particular intended emotion) styles. The recordings were performed emotion by emotion, that is all sentences were first recorded in one style then in another. Listening tests were conducted with naive listeners (minimum 4 listeners per file) to evaluate the expressed emotions. The tests were designed as forced choice tests where listeners had to choose one of happy, angry, sad, neutral, or other options. Results showed that in more than 80% of instances, the perceived emotion matched the intended emotion. The average sentence length was 6.81 words. The 0.25, 0.5, and 0.75 quantiles forsentencewordcountswere5,6,and8words,respectively. Wordboundarieswere estimated automatically using HMM models trained using the HTK software. 92 6.2.2 Part of speech (POS) tagging POStagscanbealsogroupedamongthemselvesintotwobroadcategories: Closed (i.e., fixed set, which is unlikely to grow in number) and open (i.e., a set that can grow)classes. Someofthewords(alsocalledfunctionwords)thatfallintheclosed classes are determiners, prepositions, pronouns, etc. The common characteristics of these words is that they occur frequently and the are not very likely to change. In contrast, open class words (also called content words), such as nouns, verbs, adjectives, adverbs can increase in number with addition of new words to the language. The analyses performed in this chapter, are mainly based on content words, because they provide a significant information about both lexical [66] and emotional[27]contentofthesentence, andprominence[144]ofwords. Inaddition, we also analyze the characteristics of possessive and personal pronouns, because they provide useful information about the words in their vicinity. For example possessive pronouns are likely to be followed by a noun, while personal pronouns by a verb [66]. Part of speech tagging of the sentences was performed using the Charniak POS parser[31]. Thisparserisbasedonaprobabilisticgenerativemodelanditachieves 90.1% average precision/recall performance on sentences shorter than 40 words. For a given sentence input it generates Penn tree-bank style parse tress [78]. For an example input sentence, “Count the number of teaspoons of soysauce that you add” (a sentence from the TIMIT database), the output generated by the parser is: “(S1 (FRAG (-LRB- -LRB-) (“ “) (S (VP (VB count) (NP (NP (DT the) (NN number))(PP(INof)(NP(NP(NNSteaspoons))(PP(INof)(NP(NNsoysauce) (SBAR (IN that) (S (NP (PRP you)) (VP (VBP add))))))))))) (. .) (” ”) (-RRB- -RRB-)))”. From the generated output one can easily identify the Verb Phrase (VP), Noun Phrase (NP), etc., boundaries and POS tags for individual words. 93 As explained in the introduction, the specific focus of our analysis was on the POStagsofcontentwordsandpronouns,astheyareknowntoaffecttheemotional and perceptual quality the most [27,44]. The complete list of these tags is given in Table 6.1. As can be seen from the Table 6.1, instead of using combining NN, Symbol POS Tag Name No Examples JJ adjective 25 big NN singular or mass nouns 70 book NNP proper singular nouns 17 USC NNS plural nouns 5 books PRP personal pronouns 59 she PRP$ possessive pronouns 17 her RB adverbs 37 swiftly VB base for verbs 17 eat VBD past tense verbs 12 ate VBG gerund or present particle verbs 17 eating VBN past particle verbs 9 eaten VBP non-3rd person singular present verbs 10 eat VBZ person singular present verbs 7 eats Table 6.1: List of the analyzed tags. A subset of Penn Treebank POS tags [78]. NNP, and NNS; and VB, VBD, VBG, VBN, VBP, VBZ into a single category we decidedtoincludethemasseparatecategories. Thiswaspreferredasaresultofthe preliminary analyses where it was observed that they had a different distribution. 6.2.3 Acoustic feature calculation Utterance F0 contours were calculated using Praat software [13]. After smoothing theutteranceF0vectorbyamedianfilteroflength3,themean,median,maximum, minimum statistics were calculated for each POS tag. Average magnitude function was used for energy contour calculations. In this method,insteadofthesquaresofindividualvalues(asinRMSenergycalculation), 94 their absolute values are summed over a shifting short-time window [102]. A Ham- ming window of 240 samples (0.015 seconds) was used. The average magnitude function was preferred over the standard (RMS) energy calculations because of its smaller dynamic range. After smoothing the utterance energy vector by a median filteroflength3,themean,median,maximum,minimumstatisticswerecalculated for each POS tag. Duration values were calculated from the time domain word boundaries estimated automatically using adapted-HMM models trained (using HTK toolkit [148]) on the TIMIT database and adapted by maximum likelihood lin- ear regression model using our emotional speech data. Accuracy of the automatic boundary detection procedure was evaluated by comparingthealignmentresultswithmanuallyalignedboundariesfor16randomly selected utterances and it was concluded that automatically generated boundaries are accurate enough for our purposes. Speaker dependent normalizations were performed on all variables in order to minimize effects of speakers. For each speaker the mean and standard deviation of parameter values were calculated at the utterance level across all emotions. These values were used to normalize the tag level parameters. In order to ensure fair competition among the tag energies, the energy values were also normalized at the utterance level so that each utterance had energy median equal to 1. The normalizations were performed as follows: • tagduration=tagduration/mean(utteranceduration/utterancewordnum- ber) • tag energy maximum = (tag energy maximum / utterance energy median) / mean (utterance energy maximum / utterance energy median) 95 • tag energy median = tag energy median / utterance energy median • tag F0 median = (tag F0 median - mean(utterance F0 median))/std(utterance F0 median), • tag F0 range = tag F0 range / mean (utterance F0 range) 6.3 Analysis of POS tags’ prosody characteris- tics In this section the results of the statistical analyses are presented. 6.3.1 Standard analysis: Comparisons of feature values In this section the plots of the acoustic features used in the ANOVA analysis (section 6.4) are presented. Figures are also provided to visualize the effect of each parameter. The plots in Fig. 6.1 are helpful to visualize the main effects of emotion and tag type, and also their interaction. The position variable had two levels and it was used to mark whether the word was in the first or second half of a sentence. For example, for a 5 (or 6) word sentence, the first 3 words were considered as belonging to first half and last 2 (3) words to the second half. Position was included as a factor as a result of our analysispriortothedesign, whereitwasfoundthatpositiondependentdifferences in the acoustic parameter values were significant. The main effect of position (and also of emotion) is shown in Fig. 6.2 where the values of dependent variables are plotted for two conditions: position=1 (i.e., only considering the tags in the first half of sentences) and position=2 (i.e., tags in 96 the second half) conditions. The figure is also helpful to visualize the interaction between emotion and position. PRP$ PRP VBP VB VBD RB VBZ VBG VBN NNP NN JJ NNS 0.6 0.8 1 1.2 1.4 1.6 PRP NNS PRP$ VBZ VB JJ VBN NN RB VBD VBG VBP NNP 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 (a)duration (b)energy maximum PRP$ VBD VB RB VBG PRP NNP VBP NN JJ VBZ VBN NNS 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 NNS VBD JJ PRP$ NN VBZ VB RB VBG NNP PRP VBP VBN −1 −0.5 0 0.5 1 (c) F0 range (d) F0 median Figure 6.1: Plots of the tag values for different emotions. Note that the order of tags is different for different parameters. The tags were sorted in ascending order based on the neutral tags averages (dotted line), without differentiating based on the position. Each symbol represents a different emotion: Neutral: , anger: , happiness: 4, sadness: ♦. The figures show the main effect of POS tag type, emotion, position, and their interactions. 97 N A H S 0.8 1 1.2 1.4 1.6 (a) duration N A H S 0.4 0.6 0.8 1 (b) energy maximum N A H S 0.8 1 1.2 1.4 1.6 (c) energy median N A H S −1 −0.5 0 0.5 1 1.5 (d) F0 median Figure6.2: Showsthemaineffectofemotionandposition. Caseswhereposition=1, position=2, and position=1 or 2 are marked with , 4, and , respectively. SymbolsN,A,H,Srepresentneutral,angry,happy,andsademotions,respectively. 6.3.2 Probabilistic analysis: Comparisons of tag in terms of probabilities In this section we compare tags and emotions in terms of the calculated probabil- ities. Having lexically matching utterances expressed in different emotions, as in our case, enables us to directly compare the acoustic features for different emotions. ForeveryPOStagandspeaker, weanalyzedthedifferencesinacousticfeaturesfor each one of the neutral-angry, neutral-happy, neutral-sad, angry-happy, angrysad 98 and happy-sad emotional pairs. The goal was to learn patterns to inform design of statistical emotion conversion schemes. Theresults(averagedoverallspeakers)comparingtheenergy, durationandF0 featurevaluesacrossemotionsforthecontentPOStagsareshowninFigs.6.3,6.4, and 6.5. Shown in Fig. 6.3 is the probability of a tag having the maximum parameter value in the utterance. Shown in the Fig. 6.4 is the probability that for an emotional pair (emotion1- emotion2), the first emotion (emotion 1) has higher feature values than the second emotion (emotion 2). The probabilities were computed by counting the number of the instances where the target event occurs (i.e., emotion 1 feature > emotion 2 feature) between matched utterances and dividing the total count by the total number of the compared utterances. Between tag comparisons are given in Fig. 6.5. Displayed are the probabilities of POS tags on the x axis having greater F0 median values than the tags on the y axis. The size of the circles was drawn proportional to the probability (i.e., the largest probability is shown as the largest circle). 6.4 ANOVAanalysis: Repeatedmeasuresdesign on emotions Inordertoanalyzetheeffectofemotionsonprosodicfeaturesofpartofspeechtags, a4factormixeddesign(4x13x2x3)ANOVAwasused. Therewerefourindependent variables. On one of these, intended emotion, we had repeated measures. There werefourlevelsoftheintendedemotion: Happy,angry,sadandneutral. Thethree 99 other independent variables, which were used as between-subject variables, were POS tag type, position of the tag in sentence, and speaker. The POS tag type we used had 13 levels. The position variable had two levels and it was used to mark whether the word was in the first or second half of a sentence (as explained in sec. 6.3.1). The speaker variable had 3 levels. The dependent variables were (POS) tag duration, tag energy maximum, tag energy median, tag F0 median, and tag F0 range. Note that the figures for these variables were given in section 6.3. 6.4.1 ANOVA analysis results In this section we present the results of ANOVA tests. The implications of these results are discussed in more detail in section 6.5. ResultsoftheANOVAtestsarereportedinTable6.2andTable6.4. Thevalues are the results of Greenhouse-Geisser statistical test (which was preferred in order to account for the violations of the sphericity assumption) performed using the SPSS software [125]. In order to provide a tag level discrimination for emotions, contrast tests were performed to analyze the differences between all 6 emotional pairs. Since position was calculated to be an important factor, these comparisons were performed sep- arately for position=1 and position=2. The results for some POS tags (JJ, NN, VB, RB, PRP) are reported in Table 6.4. The table also shows the results of post hoc test for POS tag type variable. 100 PRP$PRP VBP VB VBD RB VBG VBZ NNP VBN NN JJ NNS 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 VBZPRP$PRP NNS JJ RB VBD VB VBG NN VBP VBN NNP 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 (a)duration (b)energy maximum PRP$PRP VB VBD VBG VBP VBZ RB VBN NNS JJ NNP NN 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 JJ VBN NNS VBZ VB VBD VBG RB PRP$VBP NN PRP NNP 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 (c) F0 range (d) F0 median Figure 6.3: Displayed in these figures are the probabilities of tags having the maximum parameter value in utterance. The probabilities were calculated by counting the cases when a tag had a maximum parameter value, and then this number was divided to the total occurrences of that tag. The green line shows the mean probability of a tag having a maximum value (that is, it is the average of all emotion values.) Also note that the tags were sorted based on that average. Therefore the order of the tags in the x axis is different for each parameter. Each symbol represents a different emotion: Neutral: , anger: , happiness: 4, sadness: ♦. 101 JJ VBP VBZ NN NNS VBGPRP$VBN PRP VB RB VBD NNP 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 (a) Probabilities Neutral > Angry VBG VBN JJ NN PRP RB NNS VB VBDPRP$NNP VBP VBZ 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 (b) Probabilities of Neutral > Happy VBZ JJ NNS VB NN PRP VBDPRP$ RB VBG NNP VBN VBP 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 (c) Probabilities of Neutral > Sad Figure 6.4: Probabilities that a neutral tag will have a higher value that its emo- tional counterpart. Note that the tags are sorted by the probabilities calculated for energy maximum parameter. Each symbol represent the comparison results for a different parameter. In this figure, Energy Maximum: , Energy median: ∗, F0 median: , F0 range: 4, Duration: ♦. 102 JJ NN VB RB PRP JJ NN VB RB PRP Figure6.5: DisplayedaretheprobabilitiesofPOStagsonthexaxishavinggreater F0medianvaluesthanthetagsontheyaxis. Thesizeofthecirclesisproportional to the probability, that is the largest probability is shown as the largest circle. Probabilities for each emotion are plotted in a different color and in a different line style. Neutral: Blue circle, solid line, Angry: Red circle, dashed line, Happy: Magenta circle, dashdot line, Sad: Black circle, dotted line. 103 Variables Duration Energy maximum Energy Median F0 median F0 range emo F(2.75,2279.08)=21.53 F(2.85,2362.06)=47.47 F(2.79,2312.69)=2.60 F(2.84,2354.07)=110.00 F(2.78, 2301.44)=9.71 p<.001 p<.001 p=.054 p<.001 p<.001 tag F(12,828)=46.82 F(12,828)=15.01 F(12,828)=3.41 F(12,828)=1.89 F(12,828)=12.11 p<.001 p<.001 p<.001 p=.032 p<.001 poz F(1,828)=18.22 F(1,828)=67.77 F(1,828)=60.04 F(1,828)=40.24 F(1,828)=5.59 p<.001 p<.001 p<.001 p<.001 p=.018 spk F(2,828)=.02 F(2,828)=4.32 F(2,828)= 0.54 F(2,828)= 2.79 F(2,828)=5.65 p=.970 p=.014 p=.582 p=.062 p=.004 emo*tag F(33.03,2279.08)=1.17 F(34.23,2362.06)=1.29 F(33.51,2312.69)=1.44 F(34.11,2354.07)=.92 F(33.35,2301.44)=1.77 p=.220 p=.122 p=.049 p=.596 p=.004 emo*poz F(2.75,2279.08)=6.91 F(2.85,2362.06)=7.75 F(2.79,2312.69)=3.65 F(2.84,2354.07)=10.81 F(2.78,2301.44)=1.38 p<.001 p<.001 p=.014 p<.001 p=.25 emo*spk F(5.50,2279.08)=44.73 F(5.70,2362.06)=18.99 F(5.58,2312.69)=1.32 F(5.68,2354.07)=15.34 F(5.56,2301.44)=4.07 p<.001 p<.001 p=.249 p<.001 p=.001 tag*poz F(12,828)=2.73 F(12,828)=4.17 F(12,828)=3.72 F(12,828)=3.27 F(12,828)=1.67 p=.001 p<.001 p<.001 p<.001 p=.069 tag*spk F(24,828)=.29 F(24,828)=.39 F(24,828)=.33 F(24,828)=.77 F(24,828)=.67 p=1.00 p=.996 p=.999 p=.776 p=.866 spk*poz F(2,828)=2.39 F(2,828)=.91 F(2,828)=.33 F(2,828)=6.79 F(2,828)=4.87 p=.093 p=.400 p=.718 p=.001 p=.008 Table 6.2: Greenhouse-Geisser test statistics for a 4-factor mixed-design ANOVA experiment are shown. Independent variables are emotion(emo, 4 levels), tag type (tag, 13), position (poz, 2) and speaker (spk, 3). Significant results are highlighted. 104 6.4.2 Analysis of tag duration, energy and F0 contours The results in Table 6.2 show that main effects of emotion (Fig. 6.1 and 6.2), tag type (Fig. 6.1), and position (Fig. 6.2) were significant on all dependent vari- ables (except the main effect of emotion on energy median, which was insignificant because of the performed utterance level energy normalizations). The main effect of speaker was either not significant or small. This was mainly due to the speaker level normalizations. Significant effects of emotion and tag type were expected. However, it was particularly interesting to observe the strong effect of position. As it can clearly be seen in Fig. 6.2, the tags located in the first half of sentences (i.e., position=1) had shorter durations, higher energy, and higher F0 values than the tags in the second half (i.e., position=2). In general, this was true for all tags. As expected, there were significant differences between patterns of differences, due to position, in the values of duration, energy maximum and median, and F0 median, for some tags (i.e., tag*position interaction was significant). The effect of position was significant for all emotions (Fig. 6.2). Note also that the effect of position was emotion dependent (i.e., emotion*position interaction was significant) for all variables, except F0 range. This can be seen from Fig. 6.2 and Table 6.2. The results show that we do not have enough evidence to conclude that the effect of emotion on certain parameters – duration, energy maximum, and F0 median – is tag dependent. For energy median and F0 range parameters it was found that the effect of emotions are significantly dependent on tag type. It is an indication that, emotion change affected some tags more than the others. Note, however, that the size of effect was small. 105 Analysisof3-wayinteractions(notshowninthetables), emotion*tag*position, and emotion*tag*speaker, showed that they were insignificant for all acoustic fea- turesexceptduration. Forduration,theinteractionwassmallbutsignificant(emo- tion*tag*position: F(33.03, 2279.08)=1.58, p=0.019, emotion*tag*spk: F(66.06, 2279.08)=1.37, p=0.026). 6.4.3 Post hoc tests for emotions and POS tags Contrastanalysesofemotions(Table6.4)showthatemotionscanbedifferentiated from each other at the POS tag level. Moreover, there were differences between information (for emotion differentiation) inherent in different tags. For example, verbs(VB)hadlessemotionrelatedinformationthantheothertags. Theseresults should be interpreted with caution, however. Considering the relatively small size of the analyzed dataset, they need to be tested for larger databases as well. The patterns of differences between emotions were dependent on position (Table 6.2 and Fig. 6.2). From Table 6.4 we observe that, in general, for dif- ferent acoustic features, differences between emotions were more consistent in the second half of sentences. For example, in the second half, a relation (between emotions) observed for duration variable was also observed for energy maximum and F0 median. As emphasized in section 6.5, the results discussed in the previous two para- graphsmaybespecifictothetypeofdataset(i.e., samesentencesforallemotions) that was used in this experiment. It will be interesting to see if these relations still hold for emotional data where all emotions are not expressed with the same sentences. 106 JJ NN VB RB PRP All tags Post Hoc test for POS tags Duration, poz=1 — h<a/s h<s — n/h<a n/h<a/s PRP/PRP$/RB<JJ/NN poz=2 n<a/h/s n<a/h/s, s<a n<h n<a/h, s<h n<a/h/s, s<h/a n<a/h/s, s<a VB/VBD/VBG/ VBP/VBZ<JJ/ NN, PRP/ PRP$/VB<NNP/ NNS, RB/VBP/ VBZ<NNS, PRP (PRP$)<all but PRP$(PRP) Energy maximum, poz=1 n/h<a s<a — h/s<n/a n/h/s<a, s<n/h s<n, n/h/s<a NN/NNS/ PRP/PRP$/RB/ VB<NNP poz=2 n<a/h/s, h/s<a n<a/h/s, h/s<a n<a n<a/h/s, s<a n<a/h, s<a n<a/h/s, h/s<a PRP<RB/VB/ VBD/VBN/VBP, PRP/ PRP$<JJ/ NN/VBG Energy median, poz=1 — — h<a — s<n/a/h h/s<n VB/VBN/VBZ<NNP poz=2 n<a/h/s n<a/s, a<s — n<s n<h/s n/h<a VB/VBN/ VBZ<NNP/JJ/ NN/NNS, /RB/VBZ<PRP$, NNS<VBD/ VBG/VBP, VBZ<VBG/VBP F0 median, poz=1 n<a/h, h/s<a n<a/h/s, s<h, h/s<a n<a/h, s<a n<a/h, s<a/h n<a/h, s<a/h n<a/h/s, h/s<a, s<h JJ<PRP/ VBG/VBP poz=2 n<a/h/s, s<a/h n<a/h/s, s<a/h n<a/h, s<h n<a/h/s, s<h n<a/h, s<a/h n<a/h/s, s<a/h F0 range, poz=1 — n<a/h/s n<a/h n<a/h/s — n<a/h/s RB/VB<JJ/NN, VBD<JJ poz=2 n/s<h n<a/h/s s<a/h n<h/s — n<s PRP/PRP$<JJ/NN/ NNP/NNS/VBN, PRP$<VBZ, VB<NNS/VBN Table 6.4: Contrast analysis of emotions and post hoc comparisons of POS tags. Only significant pairs are shown. Results for position=1, and position=2 are highlighted and in italic, respectively, for easy differentiation. Note that sometagsaremorehelpfulthantheotherstodifferentiatebetweenemotions. Alsonotethat, thepatternsofdifferences between emotions are, in general, more consistent in the second half. 107 6.4.4 Statistical modeling of emotional differences From the analysis of the distribution of acoustic features conditioned on individ- ual POS tags for various emotion types it was observed that they do not fit into a particular distribution in any simple way. For that reason, instead of considering the individual emotion-dependent variations in the acoustic features, we decided to concentrate on acoustic feature differences between emotions. In contrast, the patterns of the differences were easier to fit into statistical distributions. As can be seen from the example plot (Fig. 6.6), it was found that a normal distribu- tion can be used to approximate the differences between acoustic features well (as determined by statistical chi-square tests with 5% significance level). In Table 6.5, the approximated mean (μ) and standard deviation (σ) values for some tags and emotional pairs are shown as examples. It is important to note that the normal distribution fit to the data is only an approximation. For some pairs it is a very good fit (e.g., Fig. 6.6d), while for some other cases it provides only a moderate fit (e.g., Fig. 6.6f). 6.5 Discussion of analysis results Results show that emotions can be differentiated at the POS level. This was an expected result, because many emotional speech analysis studies show that word, syllable and phoneme features change with emotional style change. The more interesting result was to note that emotion and tag type interaction was either insignificant (duration, energy maximum, F0 median) or small (energy median, F0 range). This means that the effect of emotion on duration, energy maximum and F0 median acoustic features was not significantly dependent on POS tag type. 108 −1.5 −1 −0.5 0 0.5 1 0 0.5 1 1.5 2 2.5 (a) Emax n-a −1.5 −1 −0.5 0 0.5 1 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 (b) Emax n-h −2 −1.5 −1 −0.5 0 0.5 1 1.5 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 (c) Emax n-s −6 −4 −2 0 2 4 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 (d) F0 med. n-a −12 −10 −8 −6 −4 −2 0 2 4 6 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 (e) F0 med. n-h −6 −5 −4 −3 −2 −1 0 1 2 3 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 (f) F0 med. n-s −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 (g) F0 ran. n-a −1.5 −1 −0.5 0 0.5 1 1.5 2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 (h) F0 ran. n-h −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 0 0.5 1 1.5 (i) F0 ran. n-s −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 (j) Dur. n-a −2 −1 0 1 2 3 0 0.2 0.4 0.6 0.8 1 1.2 1.4 (k) Dur. n-h −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 (l) Dur. n-s Figure 6.6: The histogram and the approximated normal distribution curve for emotional feature differences for nouns (NN). Plots (a), (d), (g), and (j) are for Neutral−Angry; (b), (e), (h), and (k) for Neutral−Happy; and (c), (f), (l), and (i) for Neutral−Sad differences in energy maximum (a,b,c), F0 median (d,e,f), F0 range (g,h,i) and tag durations (j,k,l). However, as shown in Table 6.4, the emotion related information inherent in dif- ferent POS tags was different. For instance, verbs (VB) provided less information than adjectives (JJ) or nouns (NN). In order to understand the interaction between POS tags and emotions bet- ter, for randomly selected 10 sentences (i.e., 4x10 utterances), we asked 2 native Englishspeakerstolabelthemostprominenttwowords. Thepurposewastoinves- tigate how the position of prominence changed with emotion. The analysis based on position showed that, in general, the same words were salient across different emotions. This means that the prominence marks mostly fell on the same POS tags even when emotion changed. It is interesting to note that these preliminary 109 E. Max F0 Median Tag Dur. μ σ μ σ μ σ JJ n a -.232 .334 -1.241 1.298 -.302 .503 JJ n h -.091 .296 -.868 1.418 -.248 .592 JJ n s -.131 .343 -.348 1.078 -.255 .693 NN n a -.143 .293 -1.304 1.551 -.231 .442 NN n h -.064 .289 -1.166 1.647 -.164 .502 NN n s -.047 .366 -0.563 1.456 -.150 .582 VB n a -.148 .338 -1.108 1.863 -.092 .393 VB n h -.041 .314 -1.171 2.169 -.140 .547 VB n s .015 .372 -0.528 1.626 -.074 .523 RB n a -.015 .423 -1.18 1.44 -.118 .469 RB n h .015 .363 -.865 1.57 -.096 .447 RB n s .029 .419 -.417 1.63 -.016 .594 PRP n a -.113 .344 -.977 1.940 -.144 .269 PRP n h -.023 .312 -.964 2.127 -.091 .356 PRP n s .035 .282 -.367 1.908 -.077 .334 Table 6.5: The estimated normal distribution mean and standard deviation values for modeling the differences in the energy maximum, F0 median and tag duration parametersofneutral(n)-angry(a),neutral-happy(h)andneutral-sad(s)POSpairs. analyses were in accord with the results (for the interaction between POS tag and emotion for different acoustic features) in this chapter. Not observing any significant interaction between emotions and POS tags may be due to the type of dataset used. Note that, in this experiment, in order to have well controlled comparisons between emotion, expressions of different emotions were constrained to specific semantically neutral sentences. Another important conclusion of this study is that position of words (and con- sequently, ofPOStags)wasasignificantfactor. ItisseenthatthePOStagsinthe first half of sentences had shorter durations, higher energy maximum and median, and higher F0 median values than POS tags in the second half of sentences. One reason for having lower F0 values toward the end may be because declarative sen- tences were used. A reason for having lower energy values (in the second half of 110 utterances) may be because speakers uttered a single sentence. Therefore, toward theendofutterance,thesubglottalpressuremayhavedecreasedandsodidenergy. Similarresultsondurationreportedin[149]suggestthattheremaybean“inherent effect of word position in a segment on its duration”. From Tables 6.2 and 6.4 we note (as expected) that main effect of tag type was significant for all dependent variables (and especially for duration, energy maximum and F0 range parameters as can be seen from the size of F values). This was because different POS tags carry different prominence information [144] and prominence is related to loudness, duration and F0 [70]. 6.6 Analysis by synthesis: A resynthesis experi- ment of emotional speech In this section we describe how the probabilities calculated for POS tags (pre- sented in section 6.3.2) can be used to model the dynamical interaction among tags and then to resyhthesize a new emotional utterance. The proposed proba- bilisticapproachconsistof4mainsteps,whichare(1)Optimaltagorderestimation (2) Parameter value estimation (3) Smoothing and (4) Resynthesis. The details of these steps are given in the next sections. However, first an overview of the proposed approach is explained on a specific example, and the assumptions that needs to be satisfied are listed. 111 6.6.1 Optimal tag order estimation In order to find the tags with the maximum value, we compute a merit function (F(w)) defined as in Eqn. 6.1 for each tag. This function was selected over other possible functions, because it gave the best performance for the analyzed dataset. ˆ w = argmax wW F(w) =P(w max ) n−1 Y i=1 P(w >o i ) (6.1) P(w max ) = 1 n−1 n−1 X i=1 P(w =max|o i ) (6.2) Inthisequation,w isthetagforwhichthemeritfunctionisevaluated. Thesymbol orepresentalloftheremainingPOStagsinthesentence. W isthesetofalltagsin the sentence. The symbol n is the number of the tags in the sentence. The symbol P represents probability. The probability P(w > o i ), is calculated by counting the number of occurrences tag w having greater value that tag o i , given that they exist in the same sentence, and then by dividing this by the total number these two tags were seen in the same sentence. The probability P(w = max|o i ), which shows the probability of tag w having the maximum value given that tag o i also occurs in the same sentence, was calculated in a similar manner. The tag with the highest merit function value is selected as the tag that will be assigned the highest parameter value. Next step is to order the remaining tags. In this case the merit function is defined as in Eqn. 6.3. argmax w F(w) = n−2 Y i=1 P(w >o) (6.3) 112 6.6.2 Parameter value estimation Once the order of tag values is estimated, the next step is the estimation of the parameter values. Parameter value generation Starting from the maximum tag, and continuing in the tag order estimated in the previousstep,theparametervaluesaredeterminedprobabilistically,foralltagsin the sentence. (To be more precise, currently, the parameter values are generated only for tags listen in Table 6.1, and for others estimated from these generated values.) Theparametersaregeneratingfromthecorrespondingprobabilitydistribution. They are generated until the imposed conditions are satisfied. The conditions are imposed based on the relations shown in Fig. 6.4, and minimum and maximum values calculated from the target emotion corpora. The generation of F0 val- ues (est F 0) is shown below, as an example. In this example, the terms Input and Target represent the input and target emotions, respectively. min(target F 0) (max(target F 0)) is the minimum (maximum) F0 value observed in the target emotion corpora. if (P(Input > Target) > 0.7){ continue until (min(target_F0)<(est_F0)<max(target_F0) & est_F0 < input_F0); }elseif (P(Input > Target) < 0.3){ continue until (min(target_F0)<(est_F0)<max(target_F0) & est_F0 > input_F0); 113 }else{ continue until min(target_F0)<(est_F0)<max(target_F0); } The above algorithm was used for estimating the value of the maximum tag. For the other tags, an addition condition (to satisfy the estimated ordering in previous step) is imposed to check that the generated value is smaller than the previously generated value. Notethat,thethresholdprobabilities(0.7and0.3)usedinthepairwisecompar- isonsoftheemotionsweredeterminedempirically. Forlargerdatabases, insteadof using probabilities, an alternative method would be to consider multiple compari- son results between POS tags, and to determine the inter-emotion relations based on whether the difference in the features values for the target POS tag are signifi- cantornot. Forexample, ifforaPOStag, thedifferencebetweeninputandtarget emotion was significantly bigger than zero, one would require est F 0 < input F 0. Note that, it may take some time until all of the conditions are satisfied. For that reason, itisusuallyusefultoahaveanexitconditionthatwillbreaktheloopafter several iterations. For this experiment the threshold was set to 200, and after 200 iterations, the generated parameter value was required to be only larger than the minimum and smaller than the maximum values of the target emotion. As stated above, this method was used only for the tags in Table 6.1, let us call them parameterized tags. For the remaining tags (non-parameterized), the values were estimated as follows. For energy maximum and energy median parameters, the values of the non-parameterized tags were determined by linearly interpolating the values estimated for parameterized tags. For other parameters, wecalculatethemeanvalueofthedifferencebetweentheinputandtargetemotion 114 for parameterized tags, and then find the target non-parameterized tag values by adding the calculated mean difference to the input non-parameterized tag values. 6.6.3 Smoothing After parameters values for all tags in the sentence are estimated, a final smooth- ing is applied to ensure a good quality output. This smoothing depends on the algorithm that will be used for synthesis. SinceinthiscaseTD-PSOLA[82]wasusedforthemodificationofdurationand F0 values, for each tag, it is checked if the ratio between the input and estimated values is larger (smaller) than 1.8 (1/1.8). For the TD-PSOLA algorithm, it is known that the quality degrades significantly, if the ratio is close to 2 (0.5). If it is is beyond this range, the estimated value is adjusted so that the ratio is equal to 1.8 (or 1/1.8). For consistency, the same adjustments was applied to all parameters (i.e., energy, F0, and duration were adjusted so that input/target< (>)1.8). 6.6.4 Speech conversion Having estimated all of the parameter in the next stage the input speech file is modifiedandanewfileisgenerated. First, energycontourwasmodifiedbyscaling the speech signal at the appropriate POS tag locations according to the estimated energy maximum values. Then, TD-PSOLA [82] was used for F0 and duration modifications. Additional smoothing was performed in time domain to ensure a continuous waveform and smooth transitions at POS tag boundaries. 115 6.7 Listening test results In this section we first describe how the listening tests were conducted and then present the results. 6.7.1 Listening test setup In order to test the conversion, 15 utterances (i.e., 5 sentences per each speaker) were randomly selected from the neutral natural speech corpora. These utterances were the input to the conversion system. The sentences were the following: Toby and George stole the game., This hat makes me look like an aardvark. Mickey ate all raisins., Today is the last day., and This approach is a lunacy.. In addition, the same sentences were also synthesized using the kal voice from the Festival speech synthesis software [9]. The conversion system was tested also with these utterances. The listening results for natural speech, and synthesis speech input utterances are shown in Tables 6.6 and 6.7, respectively. The listening test were conducted using a web based interface. The interface consisted of a single page where all of the utterances were listed. There were 80 utterances in total: 15 unmodified natural speech utterances, 5 unmodified synthesized utterances, 60 utterances resynthesized using the POS probabilistic models trained from angry, happy, and sad natural speech corpora. The order of the utterances was randomly determined, and it was different for every listener. For each one of the files, the listeners were asked to answer the following ques- tion: (1) What emotion is the speaker trying to express? and (2) How successful is the speaker?. The test was designed as a forced-choice test. For the first question, thelistenerswererequiredtochooseoneoftheangry, happy, sad, neutral, orother 116 Emotions Neutral-L Angry-L Happy-L Sad-L Other-L orig 64.81 20.37 4.07 9.63 1.11 3.87 3.55 3.36 3.85 3.00 orig2ang 41.48 20.74 7.78 25.93 4.07 3.19 3.55 3.52 3.43 1.90 orig2hap 37.55 20.82 13.38 20.07 8.18 2.98 3.38 3.33 3.15 2.40 orig2sad 18.52 9.26 5.93 56.67 9.63 2.94 3.40 3.13 3.41 2.42 Table 6.6: Listening test results for natural speech files. First row shows the results for unmodified natural utterances. The other rows show the results for the utterances resynthesized angry, happy, and sad probability models. Displayed numbers show the emotional recognition percentage and the average success of the speaker in expressing the perceived emotion (in italic) (5 = excellent, 4 = good, 3 = fair, 2 = poor, 1 = bad). options. For the second question, they had to choose one of the excellent, good, fair, poor, bad options. Note that since, all of the files were presented in a single page, the listeners had the option to listen to the files as many times as they wanted and to compare different files with each other before submitting their responses. In total 18 people (7 females, 9 males) participated in the listening test. All of them were fluent in English and did not have any known hearing problems. 5 of the listeners listened to the files using speakers, and the rest using headphones. All of the participant were sent a link to the webpage, and they completed the tests in their own comfort. 6.7.2 Listening test results TheresultspresentedinTable6.6showthattheresultsofconversionareacceptable for sad emotion, but not for angry and happy emotions. 117 Emotions Neutral-L Angry-L Happy-L Sad-L Other-L kal 61.11 3.33 0 26.67 8.89 2.64 2.67 - 3.26 1.50 kal2ang 31.11 2.22 14.44 34.44 17.78 2.61 2.00 2.85 2.67 1.25 kal2hap 38.89 3.33 11.11 24.44 22.22 2.23 3.00 2.50 2.77 1.00 kal2sad 27.78 2.22 4.44 43.33 22.22 2.63 3.00 2.50 2.56 1.20 Table 6.7: Listening test results for the kal voice from the Festival speech synthe- sis software. The first row shows the results for the unmodifed utterances and the other for the resynthesized utterances. The numbers shown are the emotion recog- nition percentages and average success (in italic) as calculated from the human responses. When the probabilities from sad speech database were used to modify the input speech the generated speech files were recognized as sad in 56.67% of the cases. Interestingly, many of the listeners perceived sadness also when angry or happy probabilities were used for conversion. When the angry probabilities were used, the percentage of angry (compared to the original input utterances) did not change significantly. Instead note that sad (from 9.63% to 25.93%), and happy (from 4.07% to 7.78%), and other (from 1.11% to 4.07%) responses increased. Similar behavior was observed when happy probabilities were used. In this case, the happy responses increased from 4.07% to 13.38%, other responses to 8.18% and sad responses to 20.07%. It was particularly interesting to observe that many of the original files were perceived as angry. Note that when these files were evaluated in the emotional corpus they were classified as neutral in more than 80% of the instances. One possible reason for this difference may be the different corpus within which they were evaluated (perception of the emotions is relative and dependent to the envi- ronment [135]). In addition, the listeners who evaluated the original emotional 118 corpus (i.e., the corpus which was used to estimate POS tag probabilities) were different the listeners for which the results are presented in Table 6.6. Since it is observed that, in general, the applied modifications cause significant changesintheemotionoforiginalutterance, itwillbeusefultoconductadditional listening test in different environments. For instance, an utterance synthesized using original-2-happy conversion can be evaluated in terms of how it will be perceived in happy, angry, or sad context, e.g., in a paragraph (see sec. 6.8 for additional discussion). Evaluation results for the same sentences synthesized using the kal voice of the Festival speech synthesis [9] software, show many similarities to the natural speech data. From table 6.7 it is seen that many of the utterances were again perceived as sad. Conversion of the input voice using sad POS models showed to be successful. It this case 43.33% of the utterances were perceived as sad. Using angry or happy probability models caused the emotional content of the original voice to change. It was especially interesting to see the significant increase in the other and happy responses. Especially, increase in other responses is noticeable. These results clearly show that the modifications cause the original emotion to change. In contrast to the results of the natural speech, the synthetic speech was rarely perceived as angry. It was especially interesting (and unexpected) to see that angry responses were fewer than happy responses. There may be many reasons of why this is the case. One reason may be because no spectrum modifications were performed (which are very important for angry emotion). Another reason may be the characteristics of the kal voice. As discussed in [131], not every voice performssimilarlywhenusedintheconcatenativesynthesizers. Somevoicessound more pleasant than the others, and normally the system developers would prefer 119 the more pleasant ones. It is necessary to repeat the same tests with additional synthetic voices in order to understand this behavior better. It should be noted that, in general, the evaluation of emotional speech is a difficult task. The difficulty comes from the fact that it is not completely known how people evaluate and perceive emotions. Clearly the perception of emotions is related to many factors. These factors are not only related to speech signal characteristics but also to the environment and context. There is a need for more advanced techniques for the evaluation of synthesized emotions. 6.8 Discussions of the results The listening test results presented in the previous section showed that the best results were achieved for the models trained from sad utterances. This was no surprise, since previous research has also shown that prosody modifications can be successfully used to synthesize sadness. The results also showed that the applied modificationswereeffectiveinchangingtheemotionalcontentoftheoriginalutter- ance. However, this emotional quality change was not predictable in any simple way. For example, modifying the original utterances with happy models, caused new emotional nuances (from a range of emotions) to be perceived. One possi- ble reason for this variability in the human listeners responses may be because only individual POS tag prosody was modified. And this may have resulted in utterances whose emotional content was difficult to classify. Note, for instance that no specific modifications in the spectrum (except the unintended changes that happen as a result of prosody modifications) were per- formed. No specific adjustments were made in utterance level, either. It can be 120 expected that, inclusion of such modifications will increase the strength of the targeted emotions. Infact,therearesomeresemblancesbetweentheresultsoftextgenerationusing N-grams and listening test results. As shown in [66] (where unigrams, bigrams, trigrams, and quadrigrams trained on Shakespeare or Wall Street Journal corpora were compared), when the unigrams were used, the generated text showed the characteristics of the training corpora, however hardly were there any meaningful sentences. Similarly here, it may be the case that the target emotion were rep- resented in the generated output utterances, but they were difficult to perceive because of the other factors which were not controlled (which may have masked the effects of the performed modifications). For future research, the modifications may concentrate on two or more tags simultaneously and try to calculate the like- lihood of parameter values sequence conditioned on the bigrams or trigrams. Note that currently, the likelihood of the tag values is conditioned only on a single tag. During the analysis of the parameter, it was shown that position of the tags was also an important factor. Due to the insufficient amount of data we did not calcu- late separate probabilities for the tags located in the first or second halves of the sentences. In other words, we did not made any differentiation based on position while calculating the probabilities. In the future, increasing the amount of data will give us the opportunity to do that. Another issue is whether the tag set that was used was optimal or not. For instance, some of the tag classes can be combined and modeled together. This is also another issue that needs to be investigated in future research. One of the main limitations of this study is the size of the analyzed dataset. Considering the small size of the analyzed dataset (note that (similar to language modeling), the calculated probabilities are strongly dependent on corpus), it is a 121 necessity to test the proposed method with a larger database. It can be expected that with a larger set, the probabilistic modeling of the interaction of tags will be more accurate, which in turn can be expected to improve the synthesis results. This is one of the main task we are planning to address in the future. Another question that needs to be examined is the effect of type of the emo- tionalcorpora. Inthischapter,withthepurposeofhavingwellcontrolledrepeated measures on the relation of emotions and POS tags we used semantically neutral sentences, which were acted in four different emotions. The next step will be per- form similar analyses with more natural emotional data [42]. Although, we can expect to see many similarities, especially in the way acoustic features are modu- lated, it will interesting to observe how the interaction among POS tags changes. It is our expectation that for this kind of data, the differences between some of the tags will be more pronounces than what is currently observed - since naturally different words are selected for the expression of different emotions - which may improve the emotion resynthesis results. 6.9 Conclusions Emotionshavebeenoneofthemostchallengingtopicformanyresearchdisciplines. The main challenge comes from the fact that they are influenced by many factors that needs to be taken into account during their evaluation. For the expressions of emotions in speech, context and sentence syntax are two of factors influencing how they are expressed. Trying to express all of these factors using manually defined rules is nearly impossible task. Statistical method provide a good tool that can be easily trained and applied. 122 Inthischapter, wefirstshowedhowstatisticscanbeusedtoanalyzeemotional speech characteristics in terms of POS speech tag prosody and then described a probabilistic system that can be used for the resynthesis of emotional speech. The results showed that such statistical techniques have a promising future. Usage of probabilistic models for emotional speech synthesis and analysis is a growing new research area full of many possibilities and opportunities. 6.10 Modifications in the POS tag parameter value estimation for ETET system imple- mentation For the implementation of POS tag modifications in the ETET system, several changesweremadeinthePOStagparameterestimationdescribedinthischapter. The changes were made to improve the system performance. In this section, we describethesechangesandshowhowthePOStagparametervalueswereestimated. 6.10.1 Maximum/Minimum tag estimation and tag order- ing It was observed that the POS tag that had the maximum parameter value was usually the same in neutral and emotional utterances. Therefore the procedure for maximum tag selection was modified. First, the two tags that had the greatest values in the neutral utterance were determined. Then comparing the calculated probabilities(theprobabilitythattag1willhavethemaximumvalueinautterance when tag2 is present in the same utterance), one of the tags (the tags with the 123 maximumprobability)wasselectedasatagthatwillhavethemaximumparameter value. Similar procedure was used for the estimation of the POS tag that will have the minimum value. First the two tags with the minimum parameter values were determined from the neutral input utterance. Then the probabilities (the proba- bility that tag1 will have the minimum value in an utterance when tag2 also exists in the same utterance) for each tag were calculated. The tag with the highest probability was selected as the tag that will have the minimum parameter value. Note that this new maximum/minimum tag estimation method was preferred because it performs better than the previously mentioned method. After the tags that will possess the maximum and minimum parameter values were determined the parameter estimation was started. Note that, for the remain- ing tags no tag ordering was performed. The only requirement for them was that the values estimated for them should fall between the values estimated for the maximum and minimum tags. 6.10.2 Parameter value generation The new parameter generation was the following. Note that in this example the parameter for which a value is estimated is denoted as F0. if (P(Input > Target) > 0.99){ continue until (min(target_F0)*0.1<(est_F0)<max(target_F0)*3 & est_F0 < input_F0); }elseif (P(Input > Target) < 0.01){ continue until (min(target_F0)*0.1<(est_F0)<max(target_F0)*3 124 & est_F0 > input_F0); }else{ continue until min(target_F0)*0.1<(est_F0)<max(target_F0)*3; } Ascanbeseen, differentlyfromtheoldprocedure, whilegeneratingtheparam- etervalues,theconditioningbasedontheprobabilitiesoftheinputemotionhaving greater value than the output emotion was removed by setting the threshold prob- ability to 0.99 (0.01), instead of 0.7 (0.3). Also, in order to allow more freedom in theautomaticparametergeneration,theminimum(min(targetF0))andmaximum (max(targetF0))valuescalculatedfromthedatabasewerealtered(bymultiplying them with 0.1 and 3). Note that the effect of these modifications is to give more freedom (by easing the restrictions) in the automatic parameter estimation. 125 Chapter 7 Recognition for synthesis: Automatic parameter selection for resynthesis of emotional speech from neutral speech The content of this chapter was accepted for ICASSP 2008 [20]. 7.1 Introduction Speech synthesis, and specifically emotional speech synthesis, is a challenging researchtopic. Twoofthemainchallengesarethat(1)therearenumerousparame- tervaluesthatcanbeselectedduringthegenerationofpitch, duration, andenergy contours, and that (2) human evaluators are needed to evaluate the synthesizer’s performance. The need for human subjects requires that evaluation experiments be carefully designed to minimize the cost, both in terms of time and resources. At the same time, in order to find the best balance between different parameter values, many combinations need to be tested. Clearly, there is a trade off between the design requirements and the cost. 126 In this chapter we address the synthesis of angry and happy emotional speech andpropose usinganautomatic emotionrecognizeras apreprocessingstep tonar- row down the size of the evaluation set before it is presented to human raters. The system consists of a prosody modification module which generates a large num- ber of synthetic utterances that are then evaluated using a neural network (NN) emotion recognizer. The output of the recognizer is used to select the parame- ter combinations performing consistently well, and only these modifications are submitted for evaluations with human subjects. Theemotioncharacteristicsofspeechcanbepartlyassociatedwiththechanges in the prosody (pitch, duration, and energy) parameters [34,116] and partly with the spectral characteristics of speech [17,22]. However, as explained in [116,135] it should be noted that human emotion perception is a complex process which involvesmanyotherfactors. Inthischaptertheconcentrationisjustontheprosody parameters. For synthesis (which in this chapter we use as a synonym to resynthesis) of some emotions, such as sadness, simple prosody modification rules can be utilized. For instance, by lowering the F0 mean (by ≈ 30%), decreasing the F0 range (by ≈ 100%), and by increasing the duration (by≈ 30%) it may be possible to synthe- size a low activation and dull speech which would be perceived as sad (or bored, depressed, discontented, fed up, not in high spirit, or weak) under appropriate conditions. Conversely, if the F0 mean is increased beyond a certain level (more than 50% of its original value), speech which would also be perceived as sad (a dif- ferent type of sadness, however) might be synthesized. This type of sadness can be describedasanextremesadnesswhichhasadistinctcry-likespeechquality[17,24]. 127 Synthesis of high activation emotions, such as happiness and anger, is more challenging. Although it has been suggested that large F0 range and mean vari- ations can be beneficial for the synthesis of these emotions [34], in practice it is not usually the case. In many cases, the large F0 mean and range modifications would add a cry-like (due to high pitch) and trembling (due to high jitter) quality to speech, which in turn would favor the perception of sadness. Clearly, there is a fine balance between different prosody modifications that are needed to suc- cessfully synthesize speech that will be perceived as angry or happy. In order to achieve this balance, many parameter value combinations need to be evaluated. Since using human raters for this process is costly, a more efficient technique is needed. For instance, a technique that will perform the evaluations automatically. Having such an automatic emotion evaluation system will be beneficial to find the best modification combinations specific to each sentence, to each emotion, and to each speaker. Inthischapterwetesthowsuccessfullymachinerecognitionofemotionscanbe usedtoassisthumanrecognitionofemotions. Resultsfortwoemotions–happiness andanger–arereported. ThegoalofthisworkistoinformdesignsofRecognition for Synthesis (RFS) systems for the automatic evaluation of synthetic speech. 7.2 Recognition for Synthesis (RFS) system description The proposed algorithm consists of five main stages, which are briefly outlined in this section (see Fig. 7.1) and explained in more detail in the following sections. In stage one (see Sec. 7.2.1), pitch, duration, and energy of natural utterances are modified using an empirically selected set of parameter values. In stage two 128 TD-PSOLA Pitch Duration Energy Natural Speech Emotion Recognizer Pitch, Duration, Energy Parameter Selection Pitch Duration Energy TD-PSOLA Input speech Emotional Synthetic Speech Figure 7.1: Recognition for Synthesis (RFS) system architecture. The emotion recognizer is used to assess the emotional quality of the resynthesized utterances. Based on the results, parameters for synthesis are selected and applied to add emotional quality to the input speech. (Sec. 7.2.2), using a neural network emotion recognizer, these resynthesized utter- ances are classified into one of the angry, happy, sad or neutral emotion categories. Instagethree(Sec.7.3),theclassificationresultsareusedtoselectthebestparam- eters, which produce successful recognition results from a machine recognition perspective. In the fourth stage, using the selected parameters, input utterance prosody is modified, and in the final stage (Sec. 7.4), listening experiments, assess- ing emotional quality, with human raters are conducted to determine the set of parameters that perform the best from a human recognition perspective. 7.2.1 Prosody modifications For modification of the input speech, only prosody modifications were performed using the TD-PSOLA algorithm as implemented in the Praat software. The modi- fications were performed on the voiced (V) and unvoiced regions (U) of utterances 129 byscalingtheoriginalutterancevaluesbythefactorslistedbelow. Thevoicedand unvoiced region boundaries were automatically detected using the Praat software. The scaling factors for duration modifications were [0.7, 1, 1.4], for energy modifications were [0.5, 1, 2], for F0 median modifications were [0.8, 1, 1.25], and for F0 range modifications were [0.4, 1, 2.5]. All of the possible modification factor combinations were tested. F0 modifica- tions were applied only on voiced regions, while duration and energy modifications were applied on both voiced and unvoiced regions. The modified values were calculated by the multiplication of the modification factor and the original value. For example, if the F0 range, voiced region duration and unvoiced region duration modification factors were set to 2.5, 0.7, and 1.4 respectively, andtherestto1, thenforagiveninpututterancetheF0rangewould beincreasedbyafactorof2.5(i.e.,250%oforiginal). Thevoicedregions’durations would be decreased by factor of 0.7 (i.e., 30%) and the unvoiced regions’ durations would be increased by 40%. The range of the tested modification factors was chosen large to allow for broader coverage. In order to minimize the test time, the number of the tested factors was kept minimal. Note, however, that finer resolution of modification factors may be necessary, in the future, for more advanced and detailed analy- ses, and [probably, but not necessarily] for better selection of different parameter combinations. Increasing the number of the factors increases the number of the synthesized utterances exponentially. Even in this case, for example, for a given input utterance, 729 (= 3 6 ) new utterances were synthesized. The most accurate and reliable way to evaluate the emotional content of syn- thesized utterances is through listening tests with human raters. However this is not feasible due to the large size of the evaluating set. The emotion recognizer, 130 described next, is proposed as a preprocessing step – to aid human listeners – for generating a more manageable evaluation set. 7.2.2 Automatic emotion recognition using neural net- works Neural networks (NN) are popular in machine learning applications because they are able to learn and model non-linear data with high success rates. Ourgoalinthischapterwastodesignanemotionrecognitionsystemcapableof distinguishing four emotion (angry, happy, sad, neutral) types. For that purpose, we built a 1 hidden layer feed-forward neural network, with 31 inputs, 5 hidden units and 2 output units using Matlab’s Neural Networks toolbox. Input variables TheinputvariablesusedtotraintheNN(31intotal)includedavarietyofprosody parameters (calculated in Praat) as detailed below. Forthewholespeechfilethefollowingparameterswerecalculated: (1)F0mean, (2) F0 median, (3) F0 range, (4) F0 std, (5) F0 minimum, (6) F0 maximum, (7) Energy, (8) Intensity, (9) Duration, (10) 25% quantile of F0, (11) 75% quantile of F0. Next, the voiced regions were extracted and concatenated together to generate a voiced-regions-only file. For this file, the following parameters were calculated: (12) F0 mean, (13) F0 median, (14) F0 range, (15) F0 std, (16) F0 minimum, (17) F0 maximum, (18) Energy, (19) Intensity, (20) Duration, (21) 25% quantile of F0, (22) 75% quantile of F0, (23) Intensity minimum, (24) Intensity maximum, (25)Intensityminimumtime,(26)Intensitymaximumtime,(27)Intensitycontour mean, (28) Intensity contour std. 131 Finally, the unvoiced regions were extracted and concatenated together to gen- erate an unvoiced-regions-only file. For this file, (29) Intensity, (30) Duration, (31) Energy parameters were calculated. Before training the system, all of the input variables were normalized to be in the [0 1] range. Output variables Two output variables were used to represent each emotion. They were (1,1) for happy, (1,-1) for sad, (-1,-1) for angry, and (-1,1) for neutral. We used 2 dimen- sional output vectors because they provide a nice visualization of the four emo- tional spaces. In this case, each quadrant of the Cartesian coordinate system can be regarded as a distinct emotional space. Neural network system design For the construction of the NN system we used the Matlab’s Neural Network tool- box. The network was trained with backpropagation using the following options: trainrp, learngd, mse, and 0.1 learning rate. We experimented with a large number of NNs, by varying the number of hid- den units, the number of hidden layers and type of the activation functions before deciding to use 1-hidden layer (with 5 units) with logarithmic sigmoidal (logsig) function and linear activation (purelin) at the output. This network was chosen because of its high performance, robustness and simple structure. The error func- tionthatwasusedforcomparingdifferentneuralnetworkswasthemisclassification rate of the test emotional utterances. 132 Training data and system performance Theemotionaldatausedfortrainingthenetworkisdescribedin[147]. Thetraining data consist of 408 utterances (102 for each emotion) and test data consist of 112 utterances(28foreachemotion)recordedbyaprofessionalactressinangry,happy, sad, or neutral emotions. The training and test sets were randomly split, and did not have any common utterances. The NN network was trained and tested with 5 different training and test sets, and the recognition accuracies for these test sets were 81.25%, 74.14%, 70.00%, 82.00%, and 80.17%, averaging 77.77%. The average recognition accuracy for the 5 training sets was 94.43%. The confusion matrix summing the test results for these different runs is given in Table 7.1. Emotion Happy-NN Sad-NN Angry-NN Neutral-NN Happy 90 (62.93%) 17 3 33 Sad 7 109 (76.22%) 6 21 Angry 2 1 139 (97.20%) 1 Neutral 17 12 7 107 (74.83%) Table 7.1: Confusion matrix of NN recognition results summing the test results of 5 different runs (112 test utterances in each run). Displayed are the number of the files, and percentages in parenthesis. Emotion-NN indicates the emotions recognized by the NN. 7.3 Modification factor selection Forfurtheranalysisweconcentratedontestset1, theNNrecognitionperformance for which was 81.25%. First, the utterances falling inside the unit circle centered on the [−1 1] point (which corresponds to the neutral output) were selected for further processing. These selected neutral utterances will be referred to as SelNeu. This set consisted of 21 utterances (out of possible 28). 133 −2 −1 0 1 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 HAPPY SAD ANGRY NEUTRAL Figure 7.2: Recognition results for a sample test set with emotion recognition performance of 81.25%. To all of the utterances in SelNeu, the prosody modifications explained in Sec. 7.2.1 were applied. As a result 15309 (= 21 x 729) utterances were resynthe- sized. They will be referred to as ModNeu. (The NN performance on the ModNeu setwas14.64%happy, 22.59%sad, 7.28%angry, and55.49%neutral, showingthat the most of the modifications did not alter the neutral input emotion.) Next, 5 neutral utterances 1 were randomly selected from the SelNeu. Let us call this set EvalNeu. For EvalNeu utterances, the modification factor combina- tions (SelMod) that made them classified as happy or angry were automatically 1 The utterances that were tested were the following: n1: I am going shopping., n2: Lucy ate all the chocolate., n3: Mickey ate all the raisins., n4: The cat’s meow always makes my finger twitch., n5: The saw is broken so chop the wood instead.. 134 1 2 3 4 5 70 75 80 85 90 95 100 Set number Accuracy Training data Test data Figure 7.3: Recognition results for 5 different training and test sets determined. Note that naturally different modifications were selected for different target emotions and for different utterances. In order to select among the large number of successful modifications, the SelMod modifications were sorted based on their performance on the SelNeu set. Theprocedurewasasfollows. First, theeffectofeachoftheselectedmodifications (SelMod)ontheutterancesofSelNeu wasdetermined. Next,theSelMod modifica- tions were sorted in descending order based on the number of instances for which they produced the target emotion. A separate sorting was performed for each target emotion. After the sorting, the most consistently performing modifications were on the top of the stack. The first five of the sorted SelMod modifications were selected to be used in the human listening experiments. These modifications will be represented as h1, h2, h3, h4, h5 (for happy), and a1, a2, a3, a4, a5 (for angry), and referred to as BestSelMod forth in the chapter. 135 7.4 Listening tests The utterances in set EvalNeu were modified according to the BestSelMod mod- ification factors (different modification factors for every utterance and for every emotion) that were found in the previous step. As a result of these modifications 50utterances(={5neutralutterances}x{5BestSelMod modifications}x{happy, angry}) were resynthesized and presented to human raters for the listening test. 7.4.1 Listening test structure Awebbasedinterface(awebpage 2 preparedusingPerlCGI)showingatablewith 50 rows and 3 columns was used for listening tests. Three speech files were shown on each row. The first file was defined as a reference file and it was indicated that this utterance had neutral emotion with confidence 5 (= the highest confidence). The other two files were two randomly selected modified versions of the reference file. One of these files was synthesized using the parameters selected (as explained in the previous section) for happy emotion, and the other for angry emotion. In a similar manner, all of the remaining utterances were presented on the same web page. The order of the utterances in each row and each column was randomly determined and it was different for every evaluator and every sentence. Listeners were given 5 emotion options but were allowed to select only one of them. These options were Neutral, Angry, Happy, Sad, and Other. Confidence level (for the emotion choice that the rater selects) was measured on a 5 point scale, 5 showing high confidence and 1 low confidence. A total of 27 naive raters (10 female, 17 male) participated in the test. They were not given any detailed information about the nature of the test, except the 2 http://sail.usc.edu/∼mbulut/cgi-bin/evalJul25/comp evalNN.cgi 136 fact that they needed to listen to some utterances and then select the emotions they perceived. All of the subjects had advanced English language skills and they were mostly engineering graduate students. Headphones were used by 19 of them, while the remaining 8 preferred loud speakers. The average test duration was 10 minutes. 7.4.2 Listening test results ThetestresultsarepresentedinTable7.2,andTable7.3(matrices(7.1)and(7.2)). For each of the input utterances (n1, n2, n3, n4, n5) the most successful happy (h)(matrix(7.1))orangry(a)(matrix(7.1))modificationsweredeterminedbased onthehumanraters’responses. Therecognitionpercentagesandconfidencescores are shown in Table 7.2. The parameter factor values for these modifications are shown in Table 7.3. The Table 7.2 also shows the average recognition for the best 2, and the best 3 of happy (h1, h2, h3, h4, h5), and of angry (a1, a2, a3, a4, a5) modifications. Sent. Mod. Neutral Angry Happy Sad Other n1 h2 11.11 (4.3) 55.56 (3.9) 18.52 (4.0) 00.00 (–) 14.81 (3.3) a5 25.93 (4.1) 62.96 (3.6) 11.11 (3.7) 00.00 (–) 00.00 (–) n2 h5 38.46 (3.6) 46.15 (3.5) 3.85 (4.0) 7.69 (3.5) 3.85 (4.0) a1 11.11 (4.7) 85.19 (3.7) 3.70 (5.0) 00.00 (–) 00.00 (–) n3 h4 30.77 (4.1) 23.08 (3.3) 15.38 (3.0) 19.23 (3.2) 11.54 (3.7) a4 37.04 (3.8) 51.85 (3.5) 3.70 (4.0) 3.70 (4.0) 3.70 (3.0) n4 h2 18.52 (4.0) 22.22 (3.0) 33.33 (3.4) 00.00 (–) 25.93 (4.2) a4 37.04 (4.4) 22.22 (2.8) 7.41 (2.0) 3.70 (4.0) 29.63 (3.4) n5 h1 42.31 (3.9) 00.00 (–) 19.23 (3.0) 30.77 (3.1) 7.69 (3.0) a4 3.85 (3.0) 92.31 (4.0) 00.00 (–) 00.00 (–) 3.85 (3.0) all 2 best-h 30.93 (3.9) 28.03 (–) 16.92 (3.23) 11.37 (–) 12.75 (3.5) 2 best-a 27.29 (4.03) 55.83 (3.41) 6.00 (–) 4.16 (–) 6.72 (–) all 3 best-h 32.29 (3.9) 21.92 (–) 15.59 (3.3) 18.64 (–) 11.56 (3.2) 3 best-a 29.13 (4.0) 46.35 (–) 5.00 (–) 14.02 (–) 5.50 (–) Table 7.2: Results of listening tests with humans. Recognition percentages (aver- age confidence) are shown. The symbols n1, n2, n3, n4, n5 represent neutral utterances that were modified. The symbols (h2, h5, h4, h2, h1), and (a5, a1, a4, a4, a4) representthebestperforminghappy,andangrymodifications,respectively. 137 Fm Fr Vd Ve Ud Ue n1−h2 1.0 2.5 0.7 0.5 0.7 2.0 n2−h5 1.0 2.5 0.7 0.5 1.4 1.0 n3−h4 1.0 2.5 1.4 2.0 0.7 1.0 n4−h2 1.0 2.5 1.0 0.5 0.7 0.5 n5−h1 1.0 2.5 1.4 0.5 0.7 1.0 (7.1) Fm Fr Vd Ve Ud Ue n1−a5 1.0 1.0 0.7 2.0 1.4 0.5 n2−a1 1.0 2.5 0.7 2.0 1.4 0.5 n3−a4 1.0 1.0 0.7 2.0 1.4 0.5 n4−a4 1.0 1.0 0.7 2.0 1.4 0.5 n5−a4 1.0 1.0 0.7 2.0 1.4 0.5 (7.2) Table 7.3: The modification factor values that worked the best for different utter- ances are shown. (Fm = F0 mean, Fr = F0 range, Vd = Voiced duration, Ve = Voiced energy, Ud = Unvoiced duration, Ue = Unvoiced energy). 7.5 Discussion The results show that the proposed system can successfully select the modification parameters for angry emotion synthesis. For example (see Table 7.2) for n5 one of the selected modifications (a4) by the system was confidently (4.0) perceived as angry by 92.31% of the listeners. Similarly, forn1, n2, n3, at least one of the mod- ifications automatically selected by the system made the synthesized utterances perceived as angry above the chance rate (20%). The average values measured for thebest2modifications(55.83%)andthebest3modifications(46.35%)showthat some of the other selected modifications were also successful in converting neutral speech into angry speech. As seen from the matrix 7.2 (in Table 7.3), for angry speech generation one need to decrease the voiced speech duration (Vd), and unvoiced speech energy (Ue), and increase voiced speech energy (Ve) and unvoiced speech duration (Ud). 138 For happy speech synthesis, only the result for the modification (h2) selected for neutral utterance n4 was above the chance level. In general, we observe that automatically selected modification for happy speech synthesis, caused the synthe- sized utterances to be perceived with wide range of emotions, mostly as neutral (e.g., n2, n3, n5), angry (e.g., n1, n2, n3, n4), or sad (e.g., n3, n5). The low performance achieved for happy emotion can be attributed to several causes. First, examining the NN recognition results in Table 7.1 we note that nat- ural happy utterances were recognized with 62.94% accuracy (cf. anger 97.20%), indicating that the NN recognizer might have misclassified some modifications which might be perceptually important for happiness, causing the BestSelMod for happy synthesis to be insufficient. Second, as examined in detail in [147], for the naturalemotionalspeech,thespeaker’sexpressionofhappinesswassometimescon- fusablewithanger. Asimilarconfusionbetweenangerandhappinessisobservedin the results in Table 7.2, which indicates that an improved emotional database may perform better. Third, for synthesis of happiness simple modifications on voiced and unvoiced speech regions might not be sufficient. It can be expected that finer modifications taking the word, phrase and stress pattern structures into account would improve the results [22,34,116]. Also importantly, note that there are other factors beyond prosody that can influence emotion perception, e.g., spectral enve- lope characteristics [17,22] and the transmission medium [116,135]. Using just the prosody factors may be one of the causes of the limited performance observed for happy emotion. Note however that the idea (of using a recognizer to select data parts) itself is general. If the listening test structure was altered and only three options, happy, angry, other, were presented to the raters, then the results may have been different. In that case, the listeners would psychologically be concentrated on the contrast 139 between the angry and happy emotions, which could be expected to increase the recognition of both emotions. The current method (with 5 options) was selected in order to be consistent with the evaluation of the natural speech of the same speaker done in [147]. The results show that there are clear differences between the automated emo- tion classification and human perception. For instance, many of the parameters selected by the system (even for angry speech) were not useful for the synthesis of the targeted emotion. Also, although not examined here, it may be the case that some perceptually important parameter combinations were not selected. 7.6 Conclusion Considering the wide range of possible modifications that can be applied on a speech signal to synthesize emotional speech, there is a need for a system that can select the parameters that are expected to perform well, thus narrowing down the sample set that needs to be evaluated by human raters. In this chapter, such a recognition for synthesis (RFS) system, combining emotion recognition and synthesis, is described. The results show that the proposed RFS system is promising for selecting parametersforemotionalspeechresynthesis. Consideringthesignificantlydifferent performances for different emotions, and the differences observed between human and machine perception of emotions, however, at this stage we prefer to view the proposed automated evaluation more as a preprocessing step than a replacement to human evaluations. Our future research will be directed towards the design of more robust systems, more sophisticated parameter modifications, and experi- menting with different parameter selection techniques and additional emotions. 140 Chapter 8 Analysis of effects of F0 modifications on emotional speech The content of this chapter is accepted for publication in JASA [21]. 8.1 Introduction Studies of emotional speech have shown that emotion change can be associated with changes in the prosodic and spectral characteristics of speech signals [17,22, 24,27,80]. The concentration has been mainly on prosody parameters, such as F0, duration, and energy. Among these, significant attention has been paid to F0 contour modulations occurring as a result of emotion change [34,85,116]. Acoustic analyses of angry or happy speech show that, in general, their F0 mean,median,rangeandvariancevaluesarelargerthantheirneutralspeechcoun- terparts, which are larger than sad emotion F0 values [39,61,85]. The F0 contours of happy and angry speech, in most cases, are more variable than neutral speech, showing fast and irregular up and down movements, while the sad speech F0 con- tours show smaller variation and downward inflections [39,85]. Although these findings are fairly consistent across different studies, differences are not uncom- mon. For instance, in [147] sad speech had a higher F0 mean than neutral speech. Despite having a powerful descriptive value, the aforementioned technique for studying emotions has several limitations. For example its implementation in 141 emotionalspeechsynthesisislimited[34]becauseitdoesnotspecificallyaccountfor thevariabilitypresentinthenaturalspeech[14,32,95]. Inthetraditionaltechnique, an emotional utterance is represented as a point in the parameter space. We suggestanewmodelwhere each utteranceis representedbyan ”emotional region” intheparameterspace. WeshowhowF0mean,rangeandshapecharacteristicscan varyinemotionalutterances, howthesevariationscanbemodeled, andthefactors that cause the variability. The results also show the role of F0 characteristics in emotion and speech perception. The concept of the variability of prosodic patterns was studied in a database composed of two repetitions of 1000 sentences recorded with 6 months separation by [32]. TheresultsshowedwidevariationsinF0values,sometimescorresponding to 50% of the dynamical range of the speaker. In an another study [14] iterative mimicrywasemployedtoobservewhetherF0contoursconvergetospecificEnglish intonation patterns, referred to as “attractors”. It was only after several iterations that F0 branching (i.e., clustering) patterns were seen. However, even then the variability of F0 contours was noticeable. This was due to the fact that “human variability places a lower limit on the width of the branches” [14]. In this paper, focusing on emotional speech, the concept of emotional regions is introduced to model the variability in the F0 characteristics of emotional utter- ances. A model based on F0 mean, range, and standard deviation statistics is proposed. Following an analysis-by-synthesis approach it is shown that the pro- posed model gives reliable estimation of how F0 contours of emotional utterances can be modified without significantly affecting the perceived emotional content and speech quality. This representation is helpful to better assess the role of F0 142 contours in emotion perception. Also, when applied together with the F0 genera- tion models such as ToBI [123] or Tilt [132], it can be used to better predict the intonation events in emotional speech synthesis. Emotion perception is a result of the interplay between acoustical, lexical, and environmentalfactors[135]. Thesefactorscanbeexpectedtohaveanaffectonthe emotional regions. We show how the speaker and utterance characteristics affect the emotional regions and analyze the interaction between different factors using statistical tests. The effects of modifying F0 contour shape, F0 range, and voice quality characteristics of emotional utterances (in German) were statistically analyzed by[72]forarousalrelated(relaxed/aroused,open/deceitful,annoyed/content,inse- cure/arrogant,andindifferent/involved)andcognitionrelated(emphasis,coopera- tiveness, contradiction, surprise, and reproach) emotions. The results showed that text (i.e., sentencecontent)hadasignificanteffectonlistenerjudgment. Similarly, the speaker factor (i.e., who uttered the utterance) also had significant effect for allemotioncategories, exceptarrogant. TheresultsalsoshowedthatmodifyingF0 range had a significant (and continuous) effect on emotion perception, especially on speaker arousal. The effects of contour changes were less prominent than range modifications. Similarly, in this paper, we also analyze the effects of F0 range and contour modifications, and also consider the sentence and speaker as independent factors. However our focus, in contrast, is is on how F0 acoustic features interact with sentence, speaker, and emotion factors to influence the perception of emotion and speech quality. We use angry, happy, sad, and neutral labels, which are a subset of the emotional labels suggested by [46], to describe the emotions. Inthispaper,theresultswerealsoanalyzedfromtheemotionsynthesisperspec- tive[24,27,104,119]. ItwasobservedthatF0modificationcausedtheperceptionof 143 sad and neutral emotions to increase, and angry and happy emotions to decrease. The effects of F0 range modifications were continuous [72] and more significant than F0 mean modifications. F0 contour shape modifications were also effective but only when performed in large semitone scales. It was also observed that the listeners were still able to perceive the emotions in a manner similar to that of the unmodified natural utterances even when the speech quality was distorted. In the next sections we first describe the performed F0 mean, range and shape modifications (Sec. 8.2) and how they were evaluated (Sec. 8.3). The concept of emotional regions is introduced in Sec. 8.4 and the statistical analyses results are presented in Sec. 8.5 and Sec. 8.6. The discussion and conclusion follow in Sec. 8.7 and Sec. 8.8, respectively. 8.2 Data preparation InthissectiontheemotionaldatacollectionandtheF0modificationsareexplained. 8.2.1 Data collection Two sentences, She told me what you did. (sentence 1) and This hat makes me look like an aardvark. (sentence 2) were recorded by a female speaker (speaker 1) and a male speaker (speaker 2). Both of the speakers were in their late twenties. Speaker 1 had some professional acting experience, while speaker 2 did not. The speakers were instructed to utter the two sentences in angry, happy, sad, and neutral (i.e., no particular emotion) emotion styles, resulting in a total of 16 utterances (see Fig. 8.1). However, no specific instructions were given about how the emotions should be expressed. In other words, the interpretation and expres- sion of emotions was left to the speakers themselves. The speech was recorded 144 in a quiet room at 48kHz sampling rate using unidirectional head-worn dynamic Shure brand (model SM10) microphones. Later the speech was downsampled to 16kHz. Listening tests were conducted, afterwards, to evaluate the success of emo- tion production. The results showed that human listeners were able to correctly identify (with approximately 80% success on average) the emotions expressed by the speakers. 145 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 100 200 300 N, spk=1, sent=1 0.2 0.4 0.6 0.8 1 1.2 100 200 300 N, spk=2, sent=1 0.5 1 1.5 2 2.5 100 200 300 A, spk=1, sent=1 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 200 400 A, spk=2, sent=1 0 0.5 1 1.5 2 2.5 0 200 400 H, spk=1, sent=1 0.8 1 1.2 1.4 1.6 100 150 200 H, spk=2, sent=1 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 0 200 400 S, spk=1, sent=1 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 100 150 200 S, spk=2, sent=1 0 0.5 1 1.5 2 2.5 0 200 400 N, spk=1, sent=2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 0 100 200 N, spk=2, sent=2 1 1.5 2 2.5 3 3.5 100 200 300 A, spk=1, sent=2 0 0.5 1 1.5 2 2.5 100 200 300 A, spk=2, sent=2 0 0.5 1 1.5 2 2.5 0 200 400 H, spk=1, sent=2 0.5 1 1.5 2 2.5 3 0 100 200 H, spk=2, sent=2 0 0.5 1 1.5 2 2.5 0 200 400 S, spk=1, sent=2 0.5 1 1.5 2 2.5 3 3.5 0 100 200 S, spk=2, sent=2 Figure 8.1: (color online) F0 contours of all 16 utterances that were recorded. (H, A, N, S, spk, and sent denote happy, angry, neutral, sad, speaker, and sentence, respectively. 146 8.2.2 F0 modifications Several modifications manipulating the mean, range and shape of the natural F0 contourswereappliedtoalloftherecordedemotionalutterances(whichwillbealso referred to as original utterances). The F0 mean, range, and shape modifications were performed using the TD-PSOLA algorithm [82] as implemented in Praat software [13]. The applied modifications can be categorized into three groups: Mean, range and stylization modifications (summarized in Table 8.1). F0 Mean Range Stylization Increase m1: +10% r3: +50% m2: +15% r4: +100% m3: +25% m4: +50% Decrease m5: -10% r1: -50% m6: -15% r2: -25% m7: -25% m8: -50% Set value m9: =50 r5: =10 s1: =2 m10: =100 r6: =30 s2: =5 m11: =150 r7: =50 s3: =10 m12: =200 r8: =80 s4: =15 m13: =250 r9: =110 s5: =40 m14: =300 r10: =150 Table 8.1: Summary of the performed F0 contour modifications. The values for mean and range are in Hz and the values for stylization are in semitones. Modifications in F0 mean: The mean was modified by shifting the F0 contour up or down. The following modifications were applied: (1) Increas- ing/decreasing the original F0 mean by 10%, 15%, 25% and 50%, (2) Making the F0 mean equal to 50, 100, 150, 200, 250 and 300 (Hz). Modifications in F0 range: The range was modified by multiplying the F0 contourwithaconstantandthenshiftingthecontourupordownsothatthemean 147 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 160 180 200 220 240 260 280 300 320 340 360 Figure 8.2: (color online) Stylization example for the happy utterance, speaker = 1, sentence = 1. Blue circles = original F0 contour, red squares = 2 semitones stylization, black triangles = 10 semitones stylization, green dots = 40 semitones stylization. will be same as the original mean value. The following modifications were applied: (1) Scaling the range by 0.5, 0.75, 1.5 and 2, (2) Making the F0 range equal to 10, 30, 50, 80, 110 and 150 (Hz). Stylization modifications: The shape of the F0 contour of the utterances was altered by stylizing the F0 contour. The following modifications were applied: Stylizing the F0 contour by a 2, 5, 10, 15, and 40 semitone frequency resolution. Stylization of the F0 contour was performed using Praat software. The logic behind the stylization algorithm is to try to represent the F0 contour using linear segments. The length of the linear segments is determined by the frequency reso- lution component. For instance, while 2 semitones resolution corresponds to fairly short linear segments, thus preserving the general contour shape, 40 semitones resolution may cause the whole utterance F0 contour to be a line (see Fig. 8.2 for an example). Asaresultofapplyingtheaforementionedmodifications,29utterances,allhav- ing exactly the same duration as the original utterance but different F0 contours, 148 were resynthesized for each of the original (i.e., recorded, natural) utterances. In total, including the original utterances, there were (30x16 =) 480 utterances. 8.3 Listening tests All natural and resynthesized utterances were evaluated by listening experiments with na¨ ıve listeners that included both native and non-native American English speakers. Before evaluation, all speech files were normalized so that the maximum digitized waveform amplitude was 1. In the listening tests - conducted in a quiet room, using headphones and with a single rater at a time - first the speech file was presented and then the raters were asked to choose among the following options: Happy, angry, sad, neutral and other. The raters were particularly instructed to choose other if their choice of emotion was not listed or if they could not decide on the emotional content, or if the speech sounded to them as a mixture of several emotions. Theywereallowedtolistentoeachutteranceasmanytimesastheyliked before making their decision. After the raters had chosen the emotion, they were asked to rate the naturalness (i.e., speech quality) of the utterance on a scale from 1to5, with5correspondingtothemostnatural. Theywerespecificallyinstructed to give low values if the speech was perceived to be different from natural human speech in terms of quality. Again, the raters were able to listen to the speech as many times as they liked. The files were presented in a different random order for each rater. In order to limit the time of any single test to around 20 minutes (it is known that the listeners’ judgment abilities are negatively affected for tests lasting long periods of time), the test set was divided into 10 groups of 48 utterances, each 149 consisting of 3 variations of the 16 original utterances (which were chosen ran- domly). After the completion of a set, listeners were given the opportunity to rest (or to continue some other time), or to continue with a different test set. The average number of raters per set was 9.2. In total, there were 14 different people that participated. Of these, 7 people (3 female, and 4 male listeners) evaluated all utterances. Mostoftheratersweregraduatestudentsintheirmidtolatetwenties. 8.4 Emotional regions in F0 mean-range space One of the basic characteristics of natural speech is its variability [14,32]. In order togeneratemodelsforspeechproduction,synthesis,andperceptionthisvariability shouldbeaccountedfor. Inthissection,weshowexamplesofthevariabilitypresent in emotional speech and propose a model to parametrize it. For each of the resynthesized utterances, the F0 contour was calculated using Praatsoftware. Afterremovingtheoutliersandsmoothingusingamedianfilterof length3,F0mean,F0range(=(0.975quantile)-(0.025quantile)),andF0standard deviation (std) statistics were calculated. Based on the results of the listening tests, all of the resynthesized utterances were assigned an emotional label using majority voting. Then, each one of these utterances was grouped together with its original version only if its emotion was the same as the original utterance. As a result, 16 (= 2 speakers x 2 sentences x 4 emotions) groups (one for each original utterance) were generated. The utterances in these groups were used to construct the emotional regions. We introduce the idea of emotional regions to model the variability in the F0 parameter values of emotional utterances. Using emotional regions one can theoreticallyrepresenthowtheF0contourofanutterancecanbemodifiedwithout 150 significantly affecting its emotion and speech quality. Note that, the dimension of these regions is dependent on the number of parameters that are used. In this paper, for easy visualization we worked with two dimensional (2D) regions, which were estimated based on the F0 mean and range values. If F0 contour shape was also considered as a factor the emotional regions would be three dimensional (3D). Grouping the utterances into 16 groups based on speaker, sentence, and emo- tion, as explained above, for each group the group mean vector and covariance matrix were calculated and constant Mahalanobis distance contours – equal to 3 – were determined. The center and shape of these contours are determined by the mean vectors and by the covariance matrices, respectively. The contours are ellipses (Fig. 8.3) and they represent the equal probability density Gaussians [43]. The Mahalanobis distance was set to 3 as a results of experiments that showed that these contours were reliable estimates for the distribution of resynthesized utterances as it can be seen in Fig. 8.4. Each of these Gaussian emotional regions, shown in Fig. 8.3, represent a subset of possible F0 values with which a given original utterance can be modified to maintain the same emotion perception by the majority of the listeners. Note that the Gaussian emotional regions in Fig. 8.3 are considered a subset of the true emotional regions because they were estimated based on a limited set of modifications (listed in Table 8.1). Speech quality can be also included as one of the factors determining the emo- tional regions. In this case in addition to the requirement that the utterances need to be perceived with a certain emotion they are also required to be perceived with a certain minimum average speech quality. For example, denoting the average speech quality by ρ, it may be required for each of them to satisfy ρ≥ 3, ρ≥ 3.5, ρ ≥ 4 or ρ ≥ 4.5 conditions. Under these requirements, the area of emotional 151 0 200 400 0 200 400 N spk=1, sent=1 A H S 0 200 400 0 100 200 300 N spk=1, sent=2 A H S 0 100 200 300 50 100 150 200 250 N spk=2, sent=1 A H S 0 100 200 300 0 100 200 N spk=2, sent=2 A H S Figure 8.3: (color online) The Gaussian emotional regions for each emotion, speaker, and sentence. x axis = F0 mean (Hz), y axis = F0 range (Hz). regions can be expected to decrease as quality requirements increase. An exam- ple is shown in Fig. 8.4, which shows the emotional regions for angry utterances. Although not show, when higher quality conditions were applied, the size of the emotional regions (shown in Fig. 8.3) decreased in a similar manner for the other emotions as well. The emotional regions shown in Fig. 8.3 and Fig. 8.4 were estimated using the resynthesized utterances. In order for them to be used in real life applications they need to be estimated automatically for individual utterance F0 contours. A covariance matrix cannot be calculated using only the F0 contour vector, but 152 0 200 400 100 200 300 400 spk=1, sent=1 100 200 300 400 0 100 200 300 spk=1, sent=2 0 200 400 100 200 300 spk=2, sent=1 100 200 300 50 100 150 200 spk=2, sent=2 Figure8.4: (coloronline)Emotionalregionsfordifferentspeechquality(ρ)require- mentsforangryemotion. Theareaofemotionalregionsdecreaseasqualityrequire- ments increase. Green: all cases (same as Fig. 8.3), Red: ρ ≥ 3, Blue: ρ ≥ 3.5, Magenta: ρ≥ 4, Black: ρ≥ 4.5. Small circles show the resynthesized utterances. x axis = F0 mean (Hz), y axis = F0 range (Hz). standard deviation (std) can be used instead. And using the F0 mean, range and std values together, Euclidean emotional regions can be constructed (Fig. 8.5). AsshowninFig.8.5,contoursofonestdEuclideandistancefromthe[F0mean, F0 range] point were plotted and considered as Euclidean emotional regions for a given utterance. Note that the regions generated by this technique were circles with radius equal to the utterance F0 std. The circle center was defined by the utterance F0 mean and F0 range. 153 0 200 400 600 100 200 300 400 500 spk=1, sent=1 N A H S 200 300 400 100 200 300 spk=1, sent=2 N A H S 100 200 300 50 100 150 200 spk=2, sent=1 N A H S 50 100 150 200 250 50 100 150 200 spk=2, sent=2 N A H S Figure 8.5: (color online) The estimated Euclidean emotional regions. x axis = F0 mean (Hz), y axis = F0 range (Hz). In order to determine how well the Euclidean regions can approximate the Gaussian regions, they were plotted together as shown in Fig. 8.6. Comparing the two regions we note that for most of the groups Euclidean regions lie inside the Gaussian regions. This shows that the regions estimated by the Euclidean method are reasonable and accurate representative of a subset of Gaussian regions. This is more clearly seen in Fig. 8.4 where the Gaussian regions for different quality con- ditions were plotted together with the Euclidean regions (shown as dotted circles). While observing the plots note the similarity between the Euclidean emotional regions and higher quality (ρ ≥ 4.5) Gaussian regions. The Fig. 8.4 shows the 154 0 100 200 300 0 50 100 150 200 N spk=2, sent=2 A H S 0 100 200 300 50 100 150 200 250 N spk=2, sent=1 A H S 100 200 300 400 500 0 100 200 300 N spk=1, sent=2 A H S 0 200 400 600 0 100 200 300 400 500 N spk=1, sent=1 A H S Figure8.6: (coloronline)TheperceivedGaussianemotionalregionsandestimated Euclidean emotional regions. x axis = F0 mean (Hz), y axis = F0 range (Hz). results for angry utterances only, but the results are similar for other emotions as well. From the figures it can be seen that the emotional regions were different for different speakers. For example in Fig. 8.3 note that for speaker 1, the happy regions did not lie inside sad or neutral region, while for speaker 2 it did. Also note that for speaker 2, the intersection between angry and neutral regions was smallerincomparisontotheirintersectionforspeaker1. Inadditiontothespeaker related differences, differences due to sentences were also observed. For instance, 155 forspeaker1,sentence2theneutralregionwasinsidesadregion,whileforsentence 1 it was not. Different utterances had different emotional regions because they had different F0contours(seeFig.8.1). Since,theF0contourisdependentonsentence,speaker, andemotioncharacteristicstheemotionalregionsaredependentonthemtoo. The effect of these different factors are examined in more detail in the next section (Sec. 8.5) where statistical analyses results are presented. It is important to note that the present emotional regions are proposed as models to represent the variability of single utterance F0 parameter values, and therefore they are specific to the utterance itself. In other words, they show the rangeswithinwhichtheutteranceF0parameterscanbemodifiedwithoutaffecting its emotional and speech quality. But they do not necessary show how these parameters should be modified to synthesize speech with a new emotion. For example, if a happy utterance is modified so that its new F0 values fall outside the happy emotional region, it is known that it will not be perceived as happy anymore. However, it is not necessarily true that if the new point is in the neutral (or any other) region then the utterance will be perceived as neutral. This is due to the fact that the perception of emotions is based not only on F0, but on the combined effects of prosodic, spectral, and linguistic factors. Therefore only when all of these factors are used to construct multi dimensional regions one can predict the emotion only from the region itself. Note that the present emotional regions can be considered as the projections of the hypothetical multi-dimensional regions on the linguistic (sentence), spectral (speaker), and F0 planes. As shown in Fig. 8.4, in general, the Euclidean emotional regions can be con- sidered to be reasonable estimates of the high quality Gaussian regions. As seen 156 in Fig. 8.6 and Fig. 8.4, in the Euclidean method the assumption is that the vari- ations in F0 mean and F0 range directions are equal. If needed, another model which estimates possible F0 range and F0 mean variations separately can be also constructed. For example, if in addition to the F0 contour, word (and/or syllable) boundaries are also know one can calculate the F0 mean and range values for each word (syllable). Then, these two vectors can be used to calculate the covariance matrix, which can be used to form Gaussian emotional regions. 8.5 Statistical analysis of emotion and speech quality perception In order to examine the effects of utterance emotion, speaker, sentence, and mod- ification factors on emotion and quality perception a 4-way (4x2x2x30) repeated measures ANOVA model was designed. The model consisted of 4 independent variables (original utterance emotion (4), speaker (2), sentence (2), and modifi- cation (30)) used as repeated measures, and 2 dependent variables (emotion and speech quality). The model was fully counterbalanced across 7 subjects (i.e., lis- teners) that evaluated all of the 480 utterances (which correspond to all possible combinations of the independent variables). Ofthewithin-subjectindependentvariables,utteranceemotiontype has4levels which reflect the emotions intended by the speakers. These levels correspond to happy, angry, sad, and neutral emotions. The speaker variable has 2 levels, corresponding to speaker 1 (who was a female) and speaker 2 (who was a male). The sentence variable has two levels, sentence 1 (She told me what you did.), and sentence2(Thishatmakesmelooklikeanaardvark.). Andfinally,themodification variablehas30levelsthatcorrespondtoalloftheperformedF0modificationscases 157 (29), plus the no modification (i.e., original) case (see Table 8.1 for the complete list of the modifications). There are two dependent variables. Of these, emotion selection is a nominal variablethatwasdefinedasadichotomousoutcomereflectingwhethertheemotion selected by a listener for a resynthesized utterance was the same as the emotion of the original utterance. If they were the same the variable was set to 1, if they were different it was set to 0. A dichotomous variable was used, because the purpose of the experiment was to investigate specifically the role of the F0 component in the perception of the original emotion. The quality dependent variable was used as a measure of the perceived speech quality, which was evaluated on a 5 point scale as explained in Sec. 8.3. 8.5.1 Factors influencing emotion perception The null hypothesis tested was the following. The probability of intended (i.e., original) emotions correctly perceived by listeners is equal across different variants in a group. The variants in this case, as explained above, consisted of all possible combinations of independent variables. There were 480 different variants in total, which were grouped based on the sentence, speaker, emotion, and modification (mean, range, or stylization) factors, resulting in 48 (= 2 x 2 x 4 x 3) groups. Inourexperimentalsetupeachoneofthe7listeners(i.e.,subjects)evaluatedall of the utterances. Thus, the subjects were treated as related samples. Therefore, to test the null hypothesis, Cochran’s Q test was used. The required condition for the application of the Cochran’s Q test, that the number of the conditions (K) and number of the listeners (N) are such that KN>30, was satisfied for all of the analyzedgroups. TheresultsofthesetestsareshowninTable8.2. Thestatistically significant (p< 0.05) results are shown in italic for ease of differentiation. 158 Modification Mean Range Stylization Emotion Spk./Sent. sent1 sent2 sent1 sent2 sent1 sent2 Happy spk1 Q(14)=16.26 Q(14)=27.06 Q(10)=20.00 Q(10)=17.61 Q(5)=19.52 Q(5)=10.77 p=0.298 p=0.019 p=0.029 p=0.062 p=0.002 p=0.056 spk2 Q(14)=29.63 Q(14)=32.36 Q(10)=16.79 Q(10)=9.00 Q(5)=10.91 Q(5)=4.00 p=0.009 p=0.004 p=0.079 p=0.532 p=0.053 p=0.549 Angry spk1 Q(14)=28.24 Q(14)=11.41 Q(10)=38.31 Q(10)=10.00 Q(5)=5.00 Q(5)=10 p=0.013 p=0.654 p<0.001 p=0.440 p=0.416 p=0.075 spk2 Q(14)=14.25 Q(14)=30.84 Q(10)=33.08 Q(10)=14.70 Q(5)=3.46 Q(5)=18.10 p=0.431 p=0.006 p<0.001 p=0.144 p=0.629 p=0.003 Sad spk1 Q(14)=11.17 Q(14)=31.31 Q(10)=7.78 Q(10)=4.24 Q(5)=3.40 Q(5)=5.56 p=0.673 p=0.005 p=0.651 p=0.936 p=0.639 p=0.352 spk2 Q(14)=32.62 Q(14)=16.63 Q(10)=15.22 Q(10)=7.14 Q(5)=8.23 Q(5)=15.00 p=0.003 p=0.277 p=0.124 p=0.712 p=0.144 p=0.010 Neutral spk1 Q(14)=29.38 Q(14)=30.92 Q(10)=24.00 Q(10)=29.34 Q(5)=15.85 Q(5)=21.07 p=0.009 p=0.006 p=0.008 p=0.001 p=0.007 p=0.001 spk2 Q(14)=21.33 Q(14)=26.80 Q(10)=11.72 Q(10)=10.00 Q(5)=15.29 Q(5)=6.30 p=0.093 p=0.020 p=0.304 p=0.440 p=0.009 p=0.278 Table 8.2: Cochran’s Q statistics calculated for emotion selection dependent variable. Significant results are in italic form. 159 Note that, since the purpose of this analysis was to investigate whether F0 modificationsaresufficienttoaltertheemotionalcontentoforiginalutterances,for each of the cases compared in Table 8.2, the original (i.e., unmodified) utterances were also included. For example, the results reported in the lower right corner (of Table 8.2) are for the group consisting of neutral sentence 2 recorded by speaker 2 anditsstylizationmodifications,intotal6utterances(5modifiedandoneoriginal). The size of the groups comparing mean, range, and stylization modifications were 15, 11, and 6, respectively. From the results it is observed that the effects of F0 modifications on emo- tionperceptionweredependenton emotion, speaker and sentence factors, showing the complex interactions between these parameters. For example, it is seen that sentence 2 uttered in angry emotion was not significantly influenced by the range modifications, whileincontrast, thesamemodificationscausedtheperceivedemo- tions for angry sentence 1 to be significantly different than its original. Note however that when sentence 2 uttered by speaker 1 in neutral style was modified by the same F0 range modifications, the perceived emotions were different than the emotions perceived for the original utterance. Many such examples can be observed from the results in Table 8.2. Also it is notable that especially for speaker 1 modifying the F0 characteristics ofneutralutterancescausedtheperceptionofnewemotionalnuances. Incontrast, this result was less common for angry and happy emotions, and the least common for sad emotion. 8.5.2 Factors influencing speech quality perception The null hypothesis tested was following: The mean of the perceived quality is the same under different conditions. The repeated measures ANOVA results are 160 reported in Table 8.3 (the significant results are shown in italic). Shown in the tables are the F values calculated from Greenhouse-Geisser tests. This test was preferred because it accounts – by adjusting the degrees of freedom – for the violations of sphericity condition. Factor Greenhouse-Geisser statistics Emotion F(1.72,10.30)=1.87, p=0.203 Speaker F(1,6)=0.96, p=0.366 Sentence F(1,6)=0.02, p=0.890 Modification F(3.84,23.02)=28.27, p<0.001 Emo * Spk F(2.50,15.03)=3.64, p=0.043 Emo * Sent F(1.20,7.21)=7.70, p=0.024 Emo * Modif F(5.40,32.40)=6.07, p<0.001 Spk * Sent F(1,6)=3.144, p=0.127 Spk * Modif F(4.24,25.45) = 13.25, p<0.001 Sent * Modif F(4.12, 24.74) = 5.16, p=0.003 Emo * Spk * Sent F(1.22, 7.30)=7.21, p=0.027 Emo * Spk * Modif F(5.43, 32.60) = 2.64, p=0.037 Emo * Sent * Modif F(5.20, 31.22)=2.76, p=0.034 Spk * Sent * Modif F(4.67, 28.06) = 3.36, p=0.019 Emo * Spk * Sent * Modif F(4.89, 29.35)=2.84, p=0.034 Table 8.3: Repeated measures ANOVA statistics calculated for quality dependent variable. The reported are the F values for Greenhouse-Geisser tests. The results show that the main effects of emotion, speaker and sentence fac- tors were insignificant, while the main effect of modification was significant (see Table 8.3). Interesting results were found from the interaction analysis of the within-subject factors. Note for instance that the effect of F0 modifications (on theperceivedspeechquality)wassignificantlydependentonemotion,speaker,and sentencevariables. Alsonotethattheeffectofspeakerswassignificantlydependent on emotion, but not on sentence. InordertoanalyzetheeffectsofF0modifications, speakerandsentencefactors for different emotion conditions, statistical analyses were performed separately for differentemotions. TheseresultsareshowninTable8.4. Foralloftheemotions, it 161 is seen that the main effect of speaker was not statistically significant. In contrast, the main effect of modification was significant in all cases. Interestingly, we also observe that the effect of sentence was significant for angry and neutral emotions, but not for happy and sad. In fact, note that the patterns of significant results were the same for happy or sad emotions and somewhat similar between angry or neutral emotions. 162 Factor Happy Angry Sad Neutral Speaker F(1,6)=1.66 F(1,6)=0.75 F(1,6)=2.13 F(1,6)=0.03 p=0.245 p=0.421 p=0.195 p=0.864 Sentence F(1,6)=0.85 F(1,6)=11.97 F(1,6)=0.55 F(1,6)=21.26 p=0.392 p=0.013 p=0.486 p=0.004 Modification F(3.68,22.06)=16.84 F(4.23,25.38)=26.93 F(4.63,27.79)=7.97 F(4.52,27,12)=24.55 p<0.001 p<0.001 p<0.001 p<0.001 Spk * Sent F(1,6)=1.92 F(1,6)=33.51 F(1,6)=3.67 F(1,6)=9.72 p=0.215 p=0.001 p=0.104 p=0.021 Spk * Modif F(4.43,26.56)=6.98 F(4.21,25.27)=8.46 F(4.72,28.33)=3.05 F(5.01,30.07)=7.59 p<0.001 p<0.001 p=0.027 p<0.001 Sent * Modif F(4.75,28.50)=2.12 F(4.69,28.12)=8.36 F(4.72,28.29)=1.87 F(5.02,30.13)=2.35 p=0.095 p<0.001 p=0.135 p=0.065 Spk * Sent * Modif F(4.14,24.82)=1.55 F(4.41,26.44)=2.64 F(4.51,27.03)=2.57 F(4.74,28.46)=5.23 p=0.217 p=0.051 p=0.055 p=0.002 Table 8.4: Repeated ANOVA statistics calculated for quality dependent variable. The reported are the F values for Greenhouse-Geisser tests. Significant results are shown in italic for easy differentiation. 163 8.6 EffectsofF0modificationsonemotionalcon- tent In this section, the effects of F0 modifications are compared in terms of emotional contentthatwasperceived. Allofthe14listeners’responseswereincludedinthese evaluations. In Fig. 8.7 the changes in the emotion recognition percentages observed after each modification are shown. The change was defined as the difference between recognition percentages of unmodified and modified utterances. Chi-square tests with 95% confidence interval were used to calculate whether or not the change was significant. The discussions below focus mainly on the significant modifications. 164 m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12 m13 m14 −20 0 20 40 speaker 1 r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 s1 s2 s3 s4 s5 −20 0 20 40 speaker 1 (a) (b) m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12 m13 m14 −20 0 20 40 speaker 2 r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 s1 s2 s3 s4 s5 −20 0 20 40 speaker 2 (c) (d) m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12 m13 m14 −3 −2 −1 0 quality r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 s1 s2 s3 s4 s5 −3 −2 −1 0 quality (e) (f) Figure 8.7: (color online) Figures (a), (b), (c), (d): The differences between the emotion recognition percentages of original and modified utterances. Happy = circle, Angry = filled circle, Sad = square, Neutral = filled square, Other = filled triangle. Figures (e), (f): The differences between the average speech qualities (5 = excellent, 4 = good, 3 = fair, 2 = poor, 1 = bad) of original and modified utterances. Speaker 1 = circle, Speaker 2 = filled square. 165 The mean modifications that caused significant (p < 0.05) emotion changes for speaker 1 were m4, m8, m9, m10, m11 (Fig. 8.7a). These results show that speaker 1 was quite robust against the F0 mean modifications. It was only when the F0 mean was changed by±50% significant changes were observed. Increasing F0 mean caused the neutral and angry recognition percentages to drop, and sad and other recognition percentages to increase. Interestingly, adjusting the mean to be in 50-150Hz range caused increase in happy and other responses. Note that in all of these instances the speech quality degraded significantly (Fig. 8.7e). For speaker 2 – as seen with speaker 1 – increasing or decreasing F0 mean by 50% caused an increase in sad and other perception percentages (Fig. 8.7c). It was also observed that some of the modifications caused an increase in the neutral and other responses, but not in the happy or angry responses. The statistically significant modifications in this case were m1, m4, m7, m8, m9, m10, m13, and m14. All of these (except m1) caused a significant drop in the speech quality. The effects of F0 range modifications were more prominent than F0 mean. For speaker 1, decreasing the F0 range by more than 50% caused a significant increase in sad responses (r1, r5, r6, r7, r8, r9, r10). The effect of the F0 range on the sad emotion percentages was continuous [72] and it could be easily parametrized (Fig.8.7b). ThedropintheperceivedspeechqualitywaslessseverethanF0mean modifications, suggesting that one should perform range and not mean modifica- tions during the synthesis of emotional speech. The effects of range modifications on speaker 2 were also significant, however not as strong as they were for speaker 1 (Fig. 8.7d). This can be attributed to the lower F0 range of this speaker. The modifications that caused significant emotion perception difference were r4, r5, r6, r7, r8. These modifications increased sad 166 perception, and decreased happy and angry perception. In contrast to speaker 1, some of them (r1, r6) also increased the neutral perception. Interesting results were observed for stylization modifications. For speaker 1 (Fig. 8.7b), only s4 and s5 caused significantly different results. An increase in the sad andother responsesanddropinqualitywasseenforthesecases. Theseresults show that eliminating the small prosodic variations (s1, s2, s3) in the F0 contour shape did not significantly decrease the perception of the original emotions. It was only when the F0 contour in sentence level was fully linearized (as seen in Fig. 8.2) – eliminating any accents and foot patterns [69] – that the percentages of happy and angry emotions started to decrease. In these cases the utterances were mostly perceived as sad or other. This is an important result which has implications for emotion synthesis. As shown in our previous work [17,22], for emotions such as anger and happiness, in additiontoprosody,spectralcharacteristicsalsoplayanimportantrole. Therefore, duringsynthesisoftheseemotionsoneneedstoconcentratemoreontheoverallF0 contour shape, F0 range, and spectral characteristics and need not worry about the small prosodic variations in the F0 contour shape. As we show later, these small prosodic variations were more important for high quality perception than emotion perception. Theargumentsabovewerealsovalidforthespeaker2,forwhichonlysomepar- ticular stylization modifications (s2, s3, s5) caused significant changes (Fig. 8.7d), with minimal degradation in quality (Fig. 8.7f). In these cases increase in the other responses was accompaniedeitherbyincrease in sad or neutral responses. It was particularly interesting to note that both s1 and s4, which caused significantly different F0 contour shapes did not cause any significant emotion changes. That 167 the s4 modification was not significant, while s3 was, was unexpected and it can be attributed to the subjective nature of the listening tests. As expected, the modifications that received high quality responses were the ones that did not cause any significant changes in the emotional content. Signifi- cant emotion change was in general accompanied with the degradation in quality. However in some instance, especially when F0 range of speaker 2 was modified, despite significant emotion change the quality was not affected. In almost all instances, the original emotions were correctly recognized by the majority (i.e., 50% or more) of the listeners. This shows that despite the quality distortions the original emotions were still well perceived. Inordertovisualizetheserelationsbetweenqualityandemotionperception,we define two new variables, percentage and similarity (see Fig. 8.8). The percentage variable represents the percentage of listeners that perceived the same emotion as the original emotion. The similarity variable is a measure that is the cosine of the angle between two vectors and that has a large value (i.e., close to 1) when the vectors point in the same direction [43]. It is calculated using Eqn. 8.1, where x and y are vectors, of size [5x1], showing the fractions of perceived emotions, for an original (x) and a modified (y) utterance, respectively. For example, a vector y = [0.5 0.3 0.1 0 0.1] was used for an utterance that was perceived as happy, angry, sad, neutral and other by 50%, 30%, 10%, 0% and 10% of human raters, respectively. s(x,y) = x t y kxkkyk (8.1) Insummary, theeffectsofF0rangemodificationsweremoresignificantthatF0 mean modifications. Stylization modifications were also effective, but only when performed in large semitone scales. They showed that small prosodic variations in 168 m9 m8 m10 s5 m11 m7 m14 s4 r5 m13 r6 r4 m4 m12 s3 r7 r8 r9 r1 m3 r10 m6 r3 m5 m1 m2 s2 r2 s1 org 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 8.8: (color online) Relation between average quality, similarity, and per- centage parameters. Note that the quality is normalized: 1 = Excellent, 0.8 = Good, 0.6 = Fair, 0.4 = Poor, 0.2 = Bad. Squares () are used for similarity, circles (◦) for percentage, and (x) for quality variables. the F0 contour shape were more connected to the quality of speech and not to its emotional content. 8.7 Discussion The acoustic parameters should be studied in connection with the human per- ception of prosodic and paralinguistic features [98,100,113,135]. The results in this paper show that in order to be able to describe F0 variations occurring in emotional speech it is required that sentence, speaker, and emotion factors also be considered. These are the factors that determine how emotional regions (Sec. 8.4) will be shaped. 169 Sentence (i.e., linguistic content) should be taken into account because – together with speaker and emotion characteristics – sentence structure (i.e., focus, modality,length[95])determineshowthepitch(andalsoduration,energy,formant frequencies and meaning) will be generated. Instead of including the linguistic content as a factor, a different approach may be to minimize its effects. One method to do that may be by using nonsense sentences [5]. This method eliminates the semantic effects, however it also may cause the acoustic parameters (e.g., F0, duration, energy) to be modulated in an unnatural fashion. Therefore, the results may not be easily generalizable to real life utterances. Probablyabetterparameterizationofemotionscanbeachievednotbyrestrict- ingthevarianceinthedifferentfeaturesbutbyrestrictingtheemotionspaceitself. This may be achieved by defining more homogeneous emotion categories. One good example is given by [5], who, in addition to the classic categorical emotion labels, also used activation level differences [55] to describe the emotions. This suggests that in order to better relate the acoustic parameter variation to partic- ular emotions a hybrid labeling scheme combining categorical [46] and attribute descriptions [117] can be utilized. For example, considering the findings showing that valence, activation and intensity dimensions are correlated with the acous- tic features of emotional speech [55,120], an angry utterance can be described as angry, high (low, medium) activation, high (low,medium) valence, high (low, medium) intensity, instead of just angry. Having a better description for emotions can be expected to produce smaller emotional regions. Smaller regions can be expected to overlap less, which in turn will help to better parameterize and differentiate between different emotions in terms of their acoustic features. For example, evaluating the angry speech as high 170 or low activation anger would have created two emotional regions instead of one, which theoretically would have helped to better describe how F0 characteristics relatetotheangryemotionalcontent. Asshowninthispaper, thesignificantover- lap between the regions of emotions labeled using the categorical labels indicates that a hybrid labeling technique is necessary for the future research. Considering the small number of sentences and speakers that were analyzed in this study, our future plan is to perform similar studies on a larger dataset. Also, we plan to perform similar analyses for the duration and energy parame- ters. Increasing the number of different sentences, speakers, emotions and acoustic parameters will provide better information about how the interactions between different factors can be described and parametrized. 8.8 Conclusion Thevariabilityofpitch(andthereforeF0contour)isoneofthebasiccharacteristics of natural human speech. It has been shown that the same text recorded at different times can have very different F0 characteristics. In this study using an analysis by synthesis method we showed the variability that exist in the F0 mean, range, and shape parameters of emotional speech. The results showed that even significant variation in F0 parameters did not mask the original emotion perception. It was observed that F0 modification caused sad, neutral or other emotion perception to increase, and angry or happy perception to decrease. The effectsofF0rangemodificationsonemotionperceptionweremoreprominentthan F0meanmodifications. Also,forF0rangemodifications,thedropintheperceived speechqualitywaslessthanF0meanmodifications. Theseresultssuggestthatone shouldfocusonrangeandnotmeanmodificationsduringthesynthesisofemotional 171 speech. The results were significantly dependent on speaker and original utterance characteristics. InordertomodeltheobservedvariabilityintheF0contouramodelcalledemo- tional regions approach was introduced. The observed emotional regions derived from the data were represented as 2D Gaussian ellipses which showed the limits within which the F0 contour of a given utterance can be modified. In order to model these observed regions Euclidean emotion regions based on F0 statistics (mean, range, std) were proposed. It was shown that the Euclidean regions can be used as reliable estimators to the high quality Gaussian emotional regions. The emotional regions concept can be applied to the other acoustic parame- ters as well. If duration and spectral envelope variations are modeled together with energy and F0 variations it will be possible to build multi dimensional emo- tional regions for each emotion, which can then be used in emotion to emotion transformation and synthesis. This is a task for our future research. 172 Chapter 9 Evaluation of emotion to emotion transformation (ETET) system In this chapter the evaluation results of the multi-level emotion to emotion trans- formation system are presented. 9.1 Modification parameters The spectral modification were performed using the trained GMMs. Refer to the chapter 2 and appendix B for the details. The modification parameters that were used for modifying the voiced or unvoiced regions are shown in Table 9.1. These modifications factors were selected based on the results described in chapter 7. Note that in chapter 7 the recognition for synthesis (RFS) approach was used to select modification parameters for angry and happy emotions only. It was not applied to sad emotions because based on the previous results it was shown that the parameters that can be used to successfully synthesize sad speech can be empirically determined. Thus, the modification factors for sad emotion were empirically selected as shown in Table 9.1. The factors displayed in the Table 9.1 were multiplied by the corresponding input emotion parameters to modify the input utterance. Note that F0 modifica- tions were applied only to voiced regions. 173 Fm Fr Vd Ve Ud Ue neu2ang 1.0 1.0 0.7 2.0 1.4 0.5 neu2hap 1.0 1.5 1.2 0.5 0.7 1.0 neu2sad 1.2 0.2 1.4 0.5 1.4 1.0 (9.1) Table 9.1: The modification factor values that were used for modifying the voiced or unvoiced region prosody characteristics. (Fm = F0 mean, Fr = F0 range, Vd = Voiced duration, Ve = Voiced energy, Ud = Unvoiced duration, Ue = Unvoiced energy). TheprosodyparametervaluesforPOStagsweregeneratedautomaticallyusing the Gaussian probability distribution functions that were trained to model the differences between emotion pairs. As explained in chapter 6 in order to eliminate the large deviations from the original values the generated values were required to be within a specific proximity (percentage) of the original values. Table 9.2 shows these percentages for the different prosody parameters. The values in the table should be interpreted as follows. If x is the displayed number, then the output parameterwasrequired tobeinside [(1−x)O (1+x)O]region, whereO represents the parameter value of input emotion. For example, if the input utterance’s POS tag energy maximum was 100, then the estimated value was required to vary ±50%. Inotherwordsitwasrequiredtobeinthe[15050]interval. Theparameter generation was repeated until this condition was satisfied. In order to eliminate an infinite loop condition, after 300 repetitions if still the estimated value was outside the required interval then it was set equal to one of the boundaries, i.e., either equal to the lower boundary (if the estimated value was smaller than it) or to the upper boundary (if the estimated value was larger than it). The modification conditions that were tested are displayed in Table 9.3. 174 Emax Fr Fm Dur neu2ang 0.5 1.0 0.0 0.3 neu2hap 0.5 1.0 0.0 0.3 neu2sad 0.5 1.0 0.2 0.3 (9.2) Table 9.2: The percentage values used to restrict the possible output values gener- ated for POS tags. (Fm = F0 mean, Fr = F0 range, Emax = Energy maximum, Dur = Duration.) Symbol Modification x1 Spectral conversion pos1 POS tag based modification uv1 Voiced/unvoiced region based modification x1+pos1 x1 and pos1 applied together x1+uv1 x1 and uv1 applied together pos1+uv1 pos1 and uv1 applied together x1+pos1+uv1 x1, pos1, and uv1 applied together Table 9.3: Tested modifications 9.2 Original utterances Sixutteranceswererandomlyselectedfromthenaturalneutraldatasettoevaluate the ETET system. These utterances were selected from the JoyNew dataset. The speaker that recorded these sentences was a professional actress who had a degree for the USC theater school. She was in her mid-twenties at the time of recording. Adetailedanalysisofthisdatabasecanbefoundin[147]. (Notethatthe spectral/POStag/voiced-unvoicedregionmodelswerespecificallytrained/adapted forthisspeaker. Ifadifferentspeakerwereusedthemodificationparametervalues might need to be modified accordingly.) The neutral utterances that were used for the evaluation of the ETET system were the following. • nfjoyNew112 1.wav: See how funny our puppy looks in the photo. 175 • nfjoyNew192.wav: You always come up with pathological examples to fancy the audience. • nfjoyNew195.wav: Leave off the marshmallows and look what you have done with the vanilla pie. • nfjoyNew213.wav: Summertime supper outside is a natural thing. • nfjoyNew228.wav: The fifth jar contains big juicy peaches. • nfjoyNew245.wav: Keep the desserts simple fruit does nicely. The prosody and spectral feature characteristics of these natural neutral sen- tences is shown in appendix C. The spectrogram and F0 contours plots (shown in appendix C) were generated using the Wavesurfer software [124]. 9.3 Listening test structure A web based interface (a web page prepared using Perl CGI) showing a table with 126 rows and 4 columns was used for listening tests. Four speech files were shown on each row (see Fig. 9.1). The first file (in a row) was defined as a reference file and it was indicated that this utterance had neutral emotion and speech quality of 5 (=excellent). The other three files were the modified versions (using the same type of modification (Table 9.3)) of the reference file. These files were synthesized using the parameters selected (as explained in the previous section) for happy, angry, and sad emotions. In other words, in each row, the first file was the original unmodified file, the other three files were the resynthesized files using neutral-to-angry (neu2ang), neutral- to-happy (neu2hap), and neutral-to-sad (neu2sad) modification techniques. The 176 order of the resynthesized files was randomly determined and it was different for different raters. Figure 9.1: Web based listening test structure. The first file is the reference file. It is defined as neutral with speech quality rating of 5. Other files are the resyn- thesized files (one for each of the happy, angry, or sad emotions) presented in a different random order for each rater. The raters were required to select the emotion and quality for these files. In a similar manner, all of the remaining utterances were presented on the same web page. The order of the utterances in each row and each column was randomly determined and it was different for every evaluator and every sentence. Listenersweregiven5emotionoptionsbutwereallowedtoselectonlyoneofthem. These options were Neutral, Angry, Happy, Sad, and Other. Speech quality was 177 x1 pos1 uv1 x1+pos1 x1+uv1 pos1+uv1 x1+pos1+uv1 0 5 10 15 20 Other responses neu2ang neu2hap neu2sad Figure 9.2: Percentages of other responses. Note that when the target emotion was happy, many raters selected the other option. evaluated on a 5 point scale; 5, 4, 3, 2, 1 representing excellent, good, fair, poor, and bad speech quality, respectively. A total of 21 naive raters (11 female, 10 male) participated in the test. They were not given any detailed information about the nature of the test, except the fact that they needed to listen to some utterances and then select the emotions they perceived. All of the subjects had advanced English language skills and they were mostly engineering graduate students (average age was 27.76). There were 7 native English speakers and 14 non-native speakers. Eight of the raters used loud speakers, while 13 of them used headphones to listen to the utterances. The average test duration was 40 minutes. 9.4 Listening test results The test results are shown in the Tables 9.4, 9.5, and 9.6. Different characteristics of the results presented in the tables are also plotted in the figures shown below. Notethattheresultspresentedinthischapterrepresenttheaverageresultsover all 6 sentences. The results for individual sentences are presented in appendix D. 178 Emo/Qual Mod. Neutral Angry Happy Sad Other Emotion x1 71.54 19.51 4.88 1.63 2.45 pos1 54.76 30.16 8.73 0 6.35 uv1 21.60 68.80 4.00 0.80 4.80 x1+pos1 28.80 54.40 6.40 1.60 8.80 x1+uv1 22.22 57.14 5.56 3.97 11.11 pos1+uv1 33.06 52.42 4.84 0.81 8.87 x1+pos1+uv1 30.40 49.60 7.20 1.60 11.20 Quality x1 3.89 3.71 4.00 3.50 4.33 pos1 4.09 4.00 4.09 – 3.75 uv1 3.44 3.84 3.60 3.00 4.33 x1+pos1 3.19 3.26 3.50 2.50 3.30 x1+uv1 3.07 3.10 3.50 2.80 2.71 pos1+uv1 3.17 3.26 3.17 4.00 2.36 x1+pos1+uv1 2.50 2.57 2.56 2.00 1.71 Table9.4: Neutraltoangryconversionresultsfordifferentmodificationconditions. Emotion recognition percentages and mean quality ratings are displayed. Emo/Qual Mod. Neutral Angry Happy Sad Other Emotion x1 73.60 16.80 4.80 1.60 3.20 pos1 64.80 12.80 6.40 6.40 9.60 uv1 74.59 4.10 9.84 7.38 4.10 x1+pos1 52.80 9.60 7.20 17.60 12.80 x1+uv1 69.84 11.11 10.32 5.56 3.17 pos1+uv1 45.60 10.40 14.40 12.80 16.80 x1+pos1+uv1 48.80 8.80 16.80 5.60 20.00 Quality x1 3.70 3.33 3.83 3.50 4.50 pos1 4.00 3.75 4.13 3.63 3.58 uv1 3.93 3.20 3.58 3.89 4.00 x1+pos1 3.17 2.75 2.67 2.82 2.31 x1+uv1 3.05 2.79 2.92 2.86 2.75 pos1+uv1 2.82 3.08 2.89 2.69 2.33 x1+pos1+uv1 2.21 2.55 2.90 2.57 1.60 Table9.5: Neutraltohappyconversionresultsfordifferentmodificationconditions. Emotion recognition percentages and mean quality ratings are displayed. 9.5 Discussion UsingtheETETsystemneutralutterancesweretransformedintoangry, happy, or sademotions. Theseutteranceswereevaluatedusinghumanraters. Theevaluation 179 Emo/Qual Mod. Neutral Angry Happy Sad Other Emotion x1 80.00 12.00 3.20 1.60 3.20 pos1 30.40 12.80 14.40 32.80 9.60 uv1 8.87 0 0.81 82.26 8.06 x1+pos1 26.40 11.20 12.00 36.80 13.60 x1+uv1 7.14 0.79 0.79 77.78 13.49 pos1+uv1 5.56 0.79 4.76 80.95 7.94 x1+pos1+uv1 6.35 1.59 4.76 74.60 12.70 Quality x1 3.12 3.33 3.67 3.00 3.75 pos1 3.92 3.69 3.67 3.49 3.42 uv1 1.91 – 2.00 2.98 2.30 x1+pos1 2.48 2.29 3.13 2.30 2.41 x1+uv1 1.33 2.00 2.00 2.21 1.41 pos1+uv1 1.29 3.00 3.50 2.60 2.00 x1+pos1+uv1 2.00 1.50 1.60 1.80 1.44 Table 9.6: Neutral to sad conversion results for different modification conditions. Emotion recognition percentages and mean quality ratings are displayed. results were presented in the previous section. In this section a discussion of the evaluation results is given. 9.5.1 Neutral to angry transformation Neutral to angry conversion results are shown in Table 9.4. The results (sum- marized in Fig. 9.12) show that for most of the modifications the angry emotion recognition percentages were larger than the chance rate (which was 20%). The most successful modification was uv1 (see Table 9.3 and Fig. 9.12), which produced angry utterances that were recognized with 68.80% success. Also note that the average speech quality in this case was 3.84, which corresponds to good quality. The results show that when two or more modifications were combined, there was a increase in the other responses. For these cases the drop in speech quality is also observed. (Note that later in this chapter we discuss what are the possible 180 x1 pos1 uv1 x1+pos1 x1+uv1 pos1+uv1 x1+pos1+uv1 0 20 40 60 80 100 neu2ang: Angry + Other 86.1 % Angry Other (a) neu2ang: Angry + Other percentages x1 pos1 uv1 x1+pos1 x1+uv1 pos1+uv1 x1+pos1+uv1 0 10 20 30 40 neu2hap: Happy + Other Happy Other (b) neu2hap: Happy + Other percentages x1 pos1 uv1 x1+pos1 x1+uv1 pos1+uv1 x1+pos1+uv1 0 20 40 60 80 100 neu2sad: Sad + Other 89.1 % Sad Other (c) neu2sad: Sad + Other percentages Figure 9.3: Emotion evaluation results for different modifications. (a) neutral to angry conversion, (b) neutral to happy conversion, (c) neutral 2 sad conversion. The emotion responses are displayed together with other responses. reasons of the quality degradations and propose a method that can be used to improve it.) The results show the important effect of spectral modifications. We note that even spectral modifications (x1) by themselves caused a significant increase in the angry responses. Also note that pos1 modifications were perceived as 30.16% angry. But when spectral characteristics were also modified (x1+pos1) the recog- nition rate increased to 54.40%. 181 x1 pos1 uv1 x1+pos1 x1+uv1 pos1+uv1 x1+pos1+uv1 0 20 40 60 80 100 Matching responses neu2ang neu2hap neu2sad Figure 9.4: The percentages of matching emotional responses for each transfor- mation condition. For each one of the modifications, the first bar shown the percentage of angry responses for neu2ang, the second bar shows the percentage of happy responses for neu2hap, and the third bar shows the percentage of sad responses for neu2sad. Neutral Angry Happy Sad Other 0 20* 40 60 80 100 neutral 2 angry Neutral Angry Happy Sad Other 0 20* 40 60 80 100 neutral 2 happy Neutral Angry Happy Sad Other 0 20* 40 60 80 100 neutral 2 sad Figure 9.5: Confusion between different emotions for modification x1. Interestingly, note that uv1 performed better than x1+uv1. It may be the case that combining both of these modifications created new emotional nuances which made the listeners select the other option more often. For example, we note that for uv1 the percentage of other responses was 4.80%, while for x1+uv1 it was 11.11%. This indicates that combining the two modifications caused confusion 182 Neutral Angry Happy Sad Other 0 20* 40 60 neutral 2 angry Neutral Angry Happy Sad Other 0 20* 40 60 80 100 neutral 2 happy Neutral Angry Happy Sad Other 0 20* 40 neutral 2 sad Figure 9.6: Confusion between different emotions for modification pos1. Neutral Angry Happy Sad Other 0 20* 40 60 80 100 neutral 2 angry Neutral Angry Happy Sad Other 0 20* 40 60 80 100 neutral 2 happy Neutral Angry Happy Sad Other 0 20* 40 60 80 100 neutral 2 sad Figure 9.7: Confusion between different emotions for modification uv1. among listeners. Also note that a similar confusion was also observed when all three modifications were combined (x1+pos1+uv1). 183 Neutral Angry Happy Sad Other 0 20* 40 60 neutral 2 angry Neutral Angry Happy Sad Other 0 20* 40 60 neutral 2 happy Neutral Angry Happy Sad Other 0 20* 40 neutral 2 sad Figure 9.8: Confusion between different emotions for modification x1+pos1. Neutral Angry Happy Sad Other 0 20* 40 60 neutral 2 angry Neutral Angry Happy Sad Other 0 20* 40 60 80 100 neutral 2 happy Neutral Angry Happy Sad Other 0 20* 40 60 80 100 neutral 2 sad Figure 9.9: Confusion between different emotions for modification x1+uv1. This confusion may be due to fact different modifications were performed inde- pendently of each other (i.e., in a serial configuration). Therefore, maybe in some 184 Neutral Angry Happy Sad Other 0 20* 40 60 neutral 2 angry Neutral Angry Happy Sad Other 0 20* 40 60 neutral 2 happy Neutral Angry Happy Sad Other 0 20* 40 60 80 100 neutral 2 sad Figure 9.10: Confusion between different emotions for modification pos1+uv1. cases a following modification altered/masked the effects of a previous modifica- tions. Also note that having the modifications implemented in a serial manner caused some quality degradations, which also had an effect on the emotion per- ception. These are some of the issues that we plan to investigate in the future studies. 9.5.2 Neutral to sad transformation The neutral to sad transformation results are shown in Table 9.6 and Fig. 9.13. The results show that sad emotion was recognized with high rates (around 80%), which were significantly different than chance (20%). Two of the most successful modifications were uv1 and pos1+uv1. Note that both of these cases correspond to prosody modifications only. 185 Neutral Angry Happy Sad Other 0 20* 40 60 neutral 2 angry Neutral Angry Happy Sad Other 0 20* 40 60 neutral 2 happy Neutral Angry Happy Sad Other 0 20* 40 60 80 100 neutral 2 sad Figure9.11: Confusionbetweendifferentemotionsformodificationx1+pos1+uv1. x1 pos1 uv1 x1+pos1 x1+uv1 pos1+uv1 x1+pos1+uv1 0 20 40 60 80 neu2ang (a) neu2ang: Angry emotion percentages x1 pos1 uv1 x1+pos1 x1+uv1 pos1+uv1 x1+pos1+uv1 0 1 2 3 4 neu2ang−Quality (b) neu2ang: Average speech quality Figure 9.12: Listening test results for neutral-to-angry (neu2ang) conversion. (a) Angry emotion recognition percentages (b) Average speech ratings for utterances labeled as angry. (20% chance level, 3=fair speech quality) 186 x1 pos1 uv1 x1+pos1 x1+uv1 pos1+uv1 x1+pos1+uv1 0 20 40 60 80 100 neu2sad (a) neu2sad: Sad emotion percentages x1 pos1 uv1 x1+pos1 x1+uv1 pos1+uv1 x1+pos1+uv1 0 1 2 3 4 neu2sad−Quality (b) neu2sad: Average speech quality Figure9.13: Listeningtestresultsforneutral-to-sad(neu2sad)conversion. (a)Sad emotion recognition percentages (b) Average speech ratings for utterances labeled as sad. (20% chance level, 3=fair speech quality) It is observed that performing spectral modifications did not have a big effect on the sad emotion recognition. Note for instance that sad emotion recognition percentages for x1 were the lowest (1.60%). Also note that combining x1 and pos1 resulted in similar results as pos1. (Recall that, in contrast, for angry emotion, combining x1 and pos1 caused significant increase in the angry responses.) Similar to the neutral to angry conversion results, we note that spectral mod- ifications caused the number of other responses to increase. Also note that when they were applied together with the prosody modifications a drop in speech qual- ity was observed. However, despite the low quality, the listeners were still able to correctly perceive the sad emotion. This fact is also underlined in the listeners comments which are presented in section 9.5.4 of this chapter. 187 9.5.3 Neutral to happy transformation Neutral to happy transformation results are shown in Table 9.5 and Fig. 9.14. The results show that the best performance for neutral 2 happy transformation was achieved when all 3 modifications were combined (x1+pos1+uv1). However note that even in this case the recognition performance was below the chance. x1 pos1 uv1 x1+pos1 x1+uv1 pos1+uv1 x1+pos1+uv1 0 5 10 15 20 neu2hap (a) neu2hap: Happy emotion percentages x1 pos1 uv1 x1+pos1 x1+uv1 pos1+uv1 x1+pos1+uv1 0 1 2 3 4 5 neu2hap−Quality (b) neu2hap: Average speech quality Figure 9.14: Listening test results for neutral-to-happy (neu2hap) conversion. (a) Happy emotion recognition percentages (b) Average speech ratings for utterances labeled as happy. (20% chance level, 3=fair speech quality) It is interesting however to observe the significant number of other responses. (In fact the number of the other responses is the greatest when neutral to happy conversion is applied, as shown in Fig. 9.3 and Fig. 9.2.) This is an indication that the listeners were confused (maybe because of the difficulty of recognizing the happiness from speech only or because of the degraded speech quality) and in addition of selecting the happy response they also selected other response. Also, maybe it was the case that they felt another positive emotion (such as excitement, 188 content,surprise,etc.),butnothappiness,andthatwasthereasonforthemselect- ing the other option (In fact, in the evaluation of the natural happy utterances the listeners also confused happy and other options. This is shown Table 9.7). These are some issues that we will focus on in our future research. To address these issues the resynthesized utterances will be evaluated in terms of valence and acti- vation dimensions. Additionally, evaluation tests where multiple positive emotion categories will be presented will be also conducted. If we assume that other responses represent a positive (i.e., high valence) emo- tioncategorythenhappy andother responsescanbecombined. Theresultsforthis combination are shown in Fig. 9.3b. Note that in this case, the system performs better than the chance level. In fact the 36.80% (happy+other for x1+pos1+uv1) recognitionaccuracyiscomparableto44.2%happyemotionsynthesisperformance of the concatenative speech synthesizer described in chapter 4. (Recall that in chapter 4 when the emotions were evaluated listeners were given 4 choices (angry, happy, sad, neutral). Also note that the speaker used for ETET system training and testing is different that the speaker in chapter 4). Another interesting observation is to note the confusion between the angry and happy emotions. This was a characteristics of this particular speaker as shown in [147]. A similar confusion was observed in the neutral to happy resynthesis results, i.e., many listeners also selected the angry option in addition to the happy orotheroptions. (Forcompletenesstheevaluationresultsofthenaturalemotional speechdatabasethatwasusedinthetraining/testingoftheETETsystemisshown in Table 9.7. This table is adapted from [147]). Also as can be seen from the comparisons of the best modification results for individual utterances, the happy conversion results were sentence dependent. This 189 Neutral Angry Happy Sad Other Neutral 74 8 1 14 3 Angry 3 82 2 1 12 Happy 7 12 56 6 19 Sad 20 5 1 61 13 Table 9.7: Confusion matrix of the subjective human evaluations. Displayed are the recognition percentages. For this test 25 utterances were randomly selected fromeachemotioncategory. Theseutteranceswereratedby4naivenativeEnglish speakers. (Table adapted from [147]) sent 1 sent 2 sent 3 sent 4 sent 5 sent 6 0 20 40 60 80 Sentence Comparison 44.2 % 56 % Happy Other Figure 9.15: Happy+Other emotion recognition percentages for individual sen- tences. means that inclusion of the linguistic factors in the ETET system can improve the results for happy emotion resynthesis. This is clearly seen from the Fig. 9.15. In the figure, for each sentence the best performing modifications were selected. For sentences 1, 2, 3, 4, 5, and 6 they were x1+pos1+uv1, x1+pos1, x1+pos1+uv1, pos1+uv1, pos1+uv1, and x1+uv1, respectively. The chance rate was 20%, the concatenative happy synthesis perfor- mance reported in chapter 4 was 44.2%. Note however that here the speaker that recorded the emotional data was a different female speaker than the one that was used for training/testing of the ETET system. The recognition of natural happy utterances reported in Table 9.7 was 56%, this was the same speaker used in the ETET system. In general, the synthesis of happy emotion is the most difficult. Similar results were reported in the other studies as well (see chapter 3). 190 Also note that in general the recognition and perception of happy emotion can be better done by using the visual cues [115,116]. In other words, the recognition of happiness from speech only is more challenging than recognition of happiness from image/video only data. This indicates that, it order to better assess the neutral to happy transformation performance of the ETET system a multi-modal evaluation system (combining speech and video) may be more appropriate. This is one of the issues that we plan to address in our future research. 9.5.4 Some listener comments After the listening test the listeners were allowed to leave their comments. These are some examples of their comments. “Types of emotions may be increased. Ex: Excited or so...” “I felt that the low quality of the speech hinder my recognition of emotion.” “Some of the sad (my interpretation) files that have lower quality are more emotionally expressive (if this makes any sense). Perhaps it is the artifacts that give the speech the halting choking quality of sadness. ” “Often times, the manipulated utterances make the speaker sound like she recently suffered from a stroke. However, these are the files that have (at least in my opinion) higher fidelity.” “Need a better speech quality of test waveforms, at least no trill-like sounds.” 9.5.5 Speech quality issues: How to improve the quality The listeners complained about the bad speech quality. It may be the case that insufficient speech quality obstructed the perception of some modifica- tions/emotions. Therefore there is a need to modify the system so that a better quality is achieved. 191 In order to improve the speech quality of the ETET system, in the future following modifications are planned. More training data will be collected. With more data we will be able to train better spectral models, which can be expected to produce better results. Having more emotional data for training will be also beneficial for training better models for POS tags and voiced/unvoiced region modifications. For example, for POS tags, bi-gram or tri-gram probabilistic models may be constructed. Currently, the spectral modifications were applied on all phonemes. In the future, spectralmodificationsmayberestrictedonlyonsomephonemeclasses(ex. vowels). This will improve the quality. In fact note that the quality degraded mostly after the spectral modifications. In the current design, the spectral modi- fications were performed on all phonemes because it was observed (by the author MB) that modifying only vowels was not sufficient to generate the targeted emo- tional effect. However, since as a result of evaluation experiments it was seen that modifying all phonemes does more harm than good, in the future only spectral characteristics of particular phonemes may be transformed. Additionally, another way to obtain better quality resynthesis results may be by focusing on prominent words, phrases, or regions in an utterance. It that way theamountoftheperformedmodificationscanbeminimized(limited). Doingthis will improve the speech quality. However, since the main purpose is to generate emotional speech, in some cases limited modifications may not be sufficient to produce the targeted emotional effect. In other words, there is a fine balance (i.e., trade off) between the amount of modifications necessary for good emotion and good quality synthesis. Our future research will be focused on investigating how a good quality emotion can be generated without degrading (or by minimally degrading) the natural speech quality. 192 In the current design of the system, a speech file was resynthesized at every stage of the system, and the next modifications were applied on that resynthesized file. (In other words, a serial modification configuration was used.) Although this method was suitable for tracking (i.e., evaluating) the output at every stage, it caused degradation is speech quality. In order to improve the quality, all of the modules can be integrated together. (In other words, a parallel configuration can be used.) For instance POS tag and voiced/unvoiced region modifications can be easily combined into one module (since both of them were used for prosody modifications). Going one step further, the spectral and prosody modifications can be also combined together by perform- ing the prosody modifications directly on the residual (see Fig. 9.16). Having done this, only one output will be generated by the system, and this output can be expected to be of higher quality. (However, this parallel system configuration also needs to be verified by the listening tests.) As it is shown in Fig. 9.16 in the current design the prosody modifications are applied after a spectrally modified speech signal is resynthesized. In the future, the prosody modifications may be performed on the residual and only then the residual may be filtered. This may produce better quality results. However, it may also cause some quality artifacts, because in order to match the duration (of input and target) LPC coefficient would be replicated or deleted. Listening tests will show which method will perform better. 193 P s (z) - A s (z) e s [n] 1/A t (z) TD-PSOLA INPUT OUTPUT P s (z) - A s (z) e s [n] 1/A t (z) TD-PSOLA INPUT OUTPUT Figure 9.16: Top figure: The current system, bottom figure: For future implemen- tation. It can be expected that performing the prosody modifications directly on the residual signal (bottom figure) will produce better quality results. 194 Chapter 10 Future directions The results presented in the previous chapters show the complex interaction between emotions and speech acoustic parameters. Of these, especially the syn- thesis of happy speech appears to be the most challenging. In chapter 4 we have showed that when happy units were combined together with prosody information extracted from natural happy utterances a good quality happy speech synthesis was possible. However, even in these cases the best happy speech synthesis per- formance was 44.2%. It was also interesting to note that even the original happy utterances were recognized with 67.3% accuracy. Also note that for the speaker analyzed in [147] the natural happy speech recognition was 56%. These numbers show that recognition of happiness from isolated utterances was not easy. In con- trast,whenexpressedatutterancelevelsadnessandangerwereeasiertorecognize. This fact was also reflected in the synthesis results. In order to improve the system performance for happy speech (and also for other emotions) in the future we are planing to experiment with new techniques to represent, evaluate, and synthesize the emotions. These techniques are outlined next. 195 10.1 Emotional speech representation and eval- uation In real life emotions are evaluated in context. Perception of information present in speechsignalsisdependentonlinguistic,expressive,organicandperspectivalqual- ity factors (Modulation Theory [135]). The first three of these factors have signifi- canteffectsontheacousticfeaturesofspeechandthereforeshouldbeaccountedfor during the analysis, synthesis and recognition of emotional information in speech. One of the future objectives of emotional speech research should be investigat- ing what is the best way to evaluate emotions. It is especially required for happy emotion. It can be argued that when happy synthesis results are evaluated in context the results will improve. Also note that in all of the experiments that were described in this document semantically neutral sentences were used to express the emotions. Using such sentences did not have any significant effect on angry or sad emotions. But it may be the case that it was one of the reasons why the happy emotion recognition percentages were low. Therefore, it is necessary to repeat the same experiment using semantically happy sentences to express the happy emotion. To summarize, one of our main future objectives will be the investigation of more effective emotional speech evaluation techniques. Following are the issues that need to be addressed. • Evaluation in context. • Comparison of in context and out of context evaluations. 196 • Usage of sentences that are suitable to be expressed in happy emotion vs. semanticallyneutralsentences. Onemayexpectthattheresultswillimprove when semantically happy sentences are used. • Usage of multiple positive emotion categories. Currently, only happy label is used as a positive emotion during the evaluations. It may be the case that some emotion modifications increased the valence (i.e., make the listener “feel” more positive), but not to the level where utterances would be labeled as happy. Therefore inclusion of additional positive labels may be beneficial to better tract the effects of spectrum and prosody modifications. • Evaluation in terms of Valence, Activation and Dominance (VAD) dimen- sions. • Evaluation of emotional content in speech in a multi-modal environment which comprises speech, face and body. In daily life, we process all of the available information before deciding how the speaker feels. It will be inter- esting to assess the ETET system performance in more realistic settings. For example, animated agents with the appropriate facial and body gestures canbeusedtogetherwiththeexpressivespeech. Havingsuchadvancedtech- niquesfortheevaluationoftheemotionalqualityofspeechmaybebeneficial to better assess the effects of neutral to happy transformations. In this thesis we worked only with acted emotional speech which was collected from trained actors. For the future research, it will be interesting to investigate what are the differences between acted and spontaneous emotions. The new emo- tionaldatabaseIEMOCAP[25]collectedintheSpeechAnalysisandInterpretation Lab. (SAIL) at the University of Southern California (USC) will be useful for that 197 purpose. Thisdatabaseincludesbothactedandspontaneousspeechdatacollected from the same actors. Based on the experiments presented in this paper, we suggest the following improvements for systems analyzing the variations of acoustic parameter charac- teristics in emotional speech. Use of more descriptive linguistic labels: A hybrid labeling scheme com- bining categorical and attribute descriptions can be utilized. For example, con- sidering the findings showing that valence, activation and intensity dimensions are correlated with the acoustic features of emotional speech [55,120], an angry utter- ance can be described as angry, high (low, medium) activation, high (low,medium) valence, high (low, medium) intensity, instead of just angry. As a recent example, this technique was used by [5] to study F0 variations in emotional speech. Clear definitions of the labeling specifications: Each of the chosen emo- tional labels should be described in detail in terms of what type of emotions they represent. For example, emotional adjectives such as annoyed, hostile, impatient, intolerant, nervous, etc. canallbeusedtodescribeangry emotion. Thisisrequired because,duetothesubjectivenatureofemotions,identicallabelsusedbydifferent research studies may correspond to very different emotions. Carefuldesignofsentencestructureforcorpuscollection: Theconcepts such as constituency, grammatical relations, and subcategorization and dependen- cies [66]canbeemployedtorestrictthe“textspace”. Thesentencescanbechosen to have similar syntax and grammar, for example. In addition, the lexicon can be also restricted, because different word may have different effects as suggested by ”The dictionary of affect” [145]. Note that imposing such restrictions on text may 198 be difficult to generalize, but it will be helpful for modeling the acoustical corre- lates of emotional speech. Once successful parametrization is achieved, the models can be adapted to more general datasets. The acoustical parameter values are both emotion and sentence related. Emo- tions are differently expressed in different sentences. Thus these factors should be analyzed simultaneously. These suggestions are in line with the four main issues, “the scope”, “natural- ness”, “context of the content” and “the kinds of descriptor it is appropriate to use”, emphasized by [42] for more systematic database construction and analysis. They also support the idea that acoustic parameters should be studied in connec- tion with the human perception of prosodic and paralinguistic features [113]. In summary, perception of information present in speech signals is dependent onlinguistic, expressive, organicandperspectivalqualityfactors(Modulation The- ory [135]). The first three of these factors have significant effects on the acoustic features of speech and therefore should be accounted for during the analysis of emotional information in speech. The Facial Action Coding System (FACS) developed by Ekman, Friesen and their colleagues [46,47] consists of rules for reading and interpreting facial emo- tions. It is desirable to have a similar system for speech. However, as seen in this chapter 8, the emotional labels used for facial expressions are too general to represent all possible variations in speech. For example, they are not adequate to representdifferentlevelsofactivation,valence,andintensitythatmaybepresentin emotional speech. Therefore more expressive emotional labels, combining dimen- sionalityapproaches[120]andDarwinian[46,47]approachescanbedeveloped. Our hope is that there will be more research on defining expressive labels particular to emotional speech. 199 10.2 Emotional speech synthesis In addition to better represent and evaluate the emotional content of the synthe- sizedutterances, inthefutureweareplanningalsotoimprovethespeechsynthesis techniques. Some the possible new addition to the ETET system can be listed as follows. • Syllablelevelandphraselevelmodifications. Thenewdatabase(IEMOCAP) willbeanalyzedwiththepurposeofformulatingmodelsofprosodyatsyllable and phrase level. In addition to the existing models prosody characteristics will be also modified based on these models. • Improved spectral conversion training. The models for conversion of spectral characteristics will be trained using the new data. Considering the fact that thesizeoftheIEMOCAPdatabase[25]isconsiderablelargerthanthecurrent dataset, we can expect better models to be trained. This will improve the system performance. • Design of implementation of modification techniques specific to input utter- ances’ intonation contour. Having a larger database will be also beneficial to create better models for intonation contours. Utilizing such models can be expected to improve the synthesis performance. • Speech quality improvements. As discussed in chapter 9, section 9.5.5, the speech quality can be improved by concentrating on the prominent words and regions and by combining different modifications together. For example, POS tag and voiced/unvoiced modifications can be combined and applied directly on the residual of the input utterance. After the residual prosody 200 characteristics are modified it can be filtered using the LPC filter and new utterance can be synthesized. • Synthesis of new emotional categories can be also tried. For example, emo- tional such as surprise, disgust, and frustration can be also synthesized. It willbeinterestingtoobservehowthesystemwillperformfortheseemotions. The field of emotional speech synthesis is challenging new research area. We believe that the ideas, results, and discussions presented in this study will be beneficial for improving the rapidly developing and growing research of emotions in speech. 201 Chapter 11 Conclusion As eloquently stated by Picard et.al. [99], “Emotional intelligence consists of the ability to recognize, express, and have emotions, coupled with the ability to regu- late these emotions, harness them for constructive purposes, and skillfully handle the emotions of others.” Working with emotions is highly challenging. In this thesis we attacked the problem of generating emotional speech from the perspective of resynthesis. We showed the effect of spectral and prosody modifications on the emotional content, andproposedamulti-levelemotiontoemotiontransformation(ETET)systemthat can be used to modify the emotional content of utterances. The results showed that emotions can be successfully transformed from one to another. Overall, this work provides important insight of how the emotion in speech can be parametrized, synthesized, and evaluated. We believe that this work will be a helpful resource for all emotion researchers. 202 Chapter 12 Publications 12.1 Book Chapters M. Bulut, S. Narayanan, and L. Johnson, Text to speech synthesis: New paradigms and advances. Prentice Hall, 2004, ch. Synthesizing expressive speech: Overview, challenges and open questions, pp. 175–198. C. Busso, M. Bulut, and S. Narayanan, The role of prosody in the expression of emotions in English and French. Peter Lang, 2008, ch. Expressive speech prosody: Analysis and Applications, in preparation (to appear in 2008). 12.2 Journals M.BulutandS.Narayanan,“AnalysisofeffectsofF0modificationsonemotional speech,” Journal of the Acoustic Society of America, 2008, accepted. M.Bulut,S.Lee,andS.Narayanan,“Prosodyofpartofspeechtagsinemotional speech: Statistical approach for analysis and synthesis,” in preparation. C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. Narayanan, “IEMOCAP: Interactive emotional dyadic motion capture database,” Journal of Language Resources and Evaluation, 2008, in review. Z. Deng, U. Neumann, J. Lewis, T.-Y. Kim, M. Bulut, and S. Narayanan, 203 “Expressive facial animation synthesis by learning speech co-articulation and expression spaces,” IEEE Transaction on Visualization and Computer Graphics (TVCG), vol. 12, no. 6, pp. 1523–1534, Nov/Dec 2006. 12.3 Conferences M. Bulut, S. Lee, and S. Narayanan, “Recognition for Synthesis: Automatic parameter selection for resynthesis of emotional speech from neutral speech,” in ICASSP, Las Vegas, Nevada, 2008. M. Bulut, S. Lee, and S. Narayanan, “Analysis of emotional speech prosody in terms of part of speech tags,” in Interspeech, Antwerp, Belgium, August 2007. M. Bulut, S. Lee, and S. Narayanan, “A statistical approach for modeling prosody features using POS tags for emotional speech synthesis,” in ICASSP, Honolulu, Hawaii, April 2007. M. Bulut, C. Busso, S. Yildirim, A. Kazemzadeh, C. M. Lee, S. Lee, and S. Narayanan, “Investigating the role of phoneme-level modifications in emotional speech resynthesis,” in Eurospeech, Interspeech, Lisbon, Portugal, September 2005. M. Bulut, S. Narayanan, and A. K. Syrdal, “Expressive speech synthesis using a concatenative synthesizer,” in ICSLP, Denver, CO, September 2002. S. Narayanan, P. G. Georgiou, A. Sethy, D. Wang, M. Bulut, S. Sundaram, E. Ettalaie, S. Ananthakrishnan, H. Franco, K. Precoda, D. Vergyri, J. Zheng, W. Wang, R. R. Gadde, M. Graciarena, V. Abrash, M. Frandsen, and C. Richey, “Speech recognition engineering issues in speech to speech translation system design for low resource languages and domains,” in ICASSP, Toulouse, France, 204 May 2006. Z. Deng, M. Bulut, U. Neumann, and S. Narayanan, “Automatic dynamic expression synthesis for speech animation,” in IEEE 17th International Con- ference on Computer Animation and Social Agents (CASA 2004), Geneva, Switzerland, July 2004. S. Yildirim, M. Bulut, C. M. Lee, A. Kazemzadeh, C. Busso, Z. Deng, S. Lee, and S. Narayanan, “An acoustic study of emotions expressed in speech,” in ICSLP, Jeju, Korea, October 2004. C. M. Lee, S. Yildirim, M. Bulut, A. Kazemzadeh, C. Busso, Z. Deng, S. Lee, and S. Narayanan, “Emotion recognition based on phoneme classes,” in ICSLP, Jeju, Korea, October 2004. R. Tsuzuki, H. Zen, K. Tokuda, T. Kitanura, M. Bulut, and S. Narayanan, “Constructing emotional speech synthesizers with limited speech database,” in ICSLP, Jeju, Korea, October 2004. C. Busso, Z. Deng, S. Yildirim, M. Bulut, C. Lee, A. Kazemzadeh, S. Lee, U. Neumann, and S. Narayanan, “Analysis of emotion recognition using facial expressions, speech and multimodal information,” in International Conference on Multimodal Interfaces (ICMI’04), State Park, PA, October 2004. L. Johnson, S. Narayanan, R. Whitney, R. Das, M. Bulut, and C. LaBore, “Limited domain synthesis of expressive military speech for animated characters,” in IEEE Speech Synthesis Workshop, Santa Monica, CA, September 2002. 205 12.4 Abstracts M. Bulut, S. Yildirim, S. Lee, C. M. Lee, C. Busso, A. Kazemzadeh, and S. Narayanan, ”Emotion to emotion speech conversion in phoneme level”, J. Acoust. Soc. Am., 2004. S. Yildirim, M. Bulut, C. Busso, C. M. Lee, A. Kazemzadeh, S. Lee, and S. Narayanan, ”Study of acoustic correlates associate with emotional speech”, J. Acoust. Soc. Am., 2004. C. M. Lee, S. Yildirim, M. Bulut, C. Busso, A. Kazemzadeh, S. Lee, and S. Narayanan, ”Effects of emotion on different phoneme classes”, J. Acoust. Soc. Am., 2004. C. M. Lee, S. Yildirim, M. Bulut, C. Busso, A. Kazemzadeh, S. Lee, and S. Narayanan, ”Effects of emotion on different phoneme classes”, J. Acoust. Soc. Am., 2004. 206 Bibliography [1] E. Abadjieva, I. R. Murray, and J. L. Arnott. Applying analysis of human emotional speech to enhance synthetic speech. In Eurospeech, Berlin, Ger- many, September 1993. [2] M.B. Arnold. Emotion and personality. Columbia University Press, New York, NY, 1960. [3] J.R. Averill. A constructivist view of emotion. In R. Plutchik and H. Keller- man, editors, Emotion: Theory, research, and experience, volume 1, pages 305–339. Academic Press, New York, NY, 1980. [4] J. Bachenko and E. Fitzpatrick. A computational grammar of discourse- neutralprosodicphrasinginEnglish. Computational Linguistics,16:155–167, September 1990. [5] Tanja Banziger and Klaus R. Scherer. The role of intonation in emotional expressions. Speech Communication, 46:252–267, 2006. [6] AntonBatliner, StefanSteidl, BjornSchuller, DinoSeppi, KornelLaskowski, Thurid Vogt, Laurence Devillers, Laurence Vidrascu, Noam Amir, Loic Kessous, and Vered Aharonson. Combining efforts for improving automatic classification of emotional user states. In IS-LTC, Ljubljana, Slovenia, 2006. [7] J. A. Bilmes. A Gentle Tutorial of the EM Algorithm and its Applications to Parameter Estimation for Gaussian Mixture and Hidden Markov Models. International Computer Science Institute, 1997. [8] A. Black. CLUSTERGEN: A statistical parametric synthesizer using trajec- tory modeling. In Interspeech-ICSLP, Pittsburgh, PA, 2006. [9] A. Black and P. Taylor. The festival speech synthesis system: System docu- mentation. Technical Report Technical report HCRC/TR-83, Human Com- munications Research Centre, University of Edinburgh, Scotland, UK, Jan- uary 1997. 207 [10] A. W. Black and Kevin A. Lenzo. Flite: a small, fast run time synthesis engine, November 2007. http://www.speech.cs.cmu.edu/flite/. [11] A. W. Black, P. Taylor, and K. A. Lenzo. Festvox, November 2007. http://www.festvox.org/. [12] Alan W. Black and Paul A. Taylor. Assigning phrase breaks from part-of- speech sequences. In Eurospeech, Rhodes, Greece, 1997. [13] Paul Boersma and David Weenink. Praat: doing phonetics by computer, November 2007. http://www.fon.hum.uva.nl/praat/. [14] Bettina Braun, Greg Kochanski, Esther Grabe, and Burton S. Rosner. Evi- denceforattractorsinEnglishintonation. J. Acoust. Soc. Am.,119(6):4006– 4015, June 2006. [15] S. Brave and C. Nass. ”Handbook of human-computer interaction”, chapter ”Emotion in human-computer interaction”, pages ”251–271”. ”Lawrence Erlbaum Associates”, ”New York, NY”, ”2003”. [16] Cynthia Breazeal. Emotive qualities in robot speech. In IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems, volume 3, pages 1388– 1394, 2001. [17] Murtaza Bulut, Carlos Busso, Serdar Yildirim, Abe Kazemzadeh, Chul Min Lee, Sungbok Lee, and Shrikanth Narayanan. Investigating the role of phoneme-level modifications in emotional speech resynthesis. In Proc. of Eurospeech, Interspeech, Lisbon, Portugal, 2005. [18] Murtaza Bulut, Sungbok Lee, and Shrikanth Narayanan. Analysis of emo- tional speech prosody in terms of part of speech tags. In Interspeech- Eurospeech, Antwerp, Belgium, 2007. [19] Murtaza Bulut, Sungbok Lee, and Shrikanth Narayanan. A statistical approachformodelingprosodyfeaturesusingPOStagsforemotionalspeech synthesis. In ICASSP, Honolulu, Hawaii, April 2007. [20] Murtaza Bulut, Sungbok Lee, and Shrikanth Narayanan. Recognition for synthesis: Automaticparameterselectionforresynthesisofemotionalspeech from neutral speech. In ICASSP (in review), Las Vegas, NV, April 2008. [21] Murtaza Bulut and Shrikanth Narayanan. Analysis of effects of F0 modi- fications on emotional speech. Journal of the Acoustic Society of America, 2007. accepted. 208 [22] MurtazaBulut,ShrikanthNarayanan,andAnnK.Syrdal. Expressivespeech synthesis using a concatenative synthesizer. In ICSLP, Denver, CO, 2002. [23] Felix Burkhardt. Simulation of emotion with speech synthesis, April 2006. http://emosamples.syntheticspeech.de/. [24] Felix Burkhardt and Walter F. Sendlmeier. Verification of acoustical cor- relates of emotional speech using formant-synthesis. In ISCA workshop on speech and emotion, Northern Ireland, 2000. [25] Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Emily Mower, Samuel Kim, Abe Kazemzadeh, Jeannette N. Chang, Sungbok Lee, and Shrikanth Narayanan. IEMOCAP: Interactive and emotional dyadic motion capture database. Journal of Language Resources and Evaluation, 2007. in review. [26] J. E. Cahn. Generating expressions in synthesized speech. Master’s thesis, MIT, 1989. http://www.media.mit.edu/∼cahn/masters-thesis.html. [27] Janet E. Cahn. The generation of affect in synthesized speech. Journal of the American Voice I/O Society, 8:1–19, July 1990. [28] N. Campbell. Perception of affect in speech - towards an automatic process- ing of paralinguistic information in spoken conversation. In ICSLP, 2004. [29] Nick Campbell. Databases of emotional speech. In Proc. of ISCA Workshop on speech and emotion, Northern Ireland, 2000. [30] Nick Campbell. FEAST: Feature Extraction and Analysis for Speech Tech- nology, March 2006. http://feast.atr.jp/. [31] EugeneCharniak. Amaximum-entropy-inspiredparser. InProc. of NAACL, 2000. [32] MinChu,YongZhao,andEricChang. Modelingstylizedinvarianceandlocal variability of prosody in text-to-speech synthesis. Speech Communication, 48:716–726, 2006. [33] R. R. Cornelius. Theoretical approaches to emotion. In Proc. of ISCA Workshop on Speech and Emotion, Belfast, September 2000. [34] R. Cowie, E. D. Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. Fellenz, and J. Taylor. Emotion recognition in human-computer interaction. IEEE Signal Processing Magazine, 18(1):32–80, January 2001. [35] R. Cowie and E. Douglas-Cowie. Automatic statistical analysis of the signal and prosodic signs of emotion in speech. In ICSLP, Philadelphia, 1996. 209 [36] R.Cowie,E.Douglas-Cowie,B.Apolloni,J.Taylor,A.Romano,andW.Fel- lenz. Whataneuralnetneedstoknowaboutemotionwords. InCSCC,1999. [37] R. Cowie, M. Sawey, and E. Douglas-Cowie. A new speech analysis system: Assess. In ICPhS, Stockholm, 1995. [38] A.R. Damasio. Descartes’ error. G.P. Putnam’s Sons, New York, NY, 1994. [39] J. R. Davitz. The communication of emotional meaning. McGraw-Hill, New York, NY, 1964. [40] S. Deketelaere, O. Deroo, and T. Dutoit. Speech processing for communica- tions: What’s new? Revue HF, pages 5–24, March 2001. [41] E. Douglas-Cowie, R. Cowie, and M. Schroeder. A new emotion database: considerations,sourcesandscope.InISCAWorkshoponspeechandemotion, Belfast, 2000. [42] EllenDouglas-Cowie, NickCampbell, RoddyCowie, andPeterRoach. Emo- tional speech: Towards a new generation of databases. Speech Communica- tion, 40:33–60, 2003. [43] RichardO.Duda, PeterE.Hart, andDavidG.Stork. Pattern Classification. John Willey & Sons, INC., a Wiley-Interscience Publication, second edition, 2001. [44] T. Dutoit. An Introduction to Text-to-Speech Synthesis. Kluwer Academic Publishers, 1996. [45] M. Edington. Investigating the limitations of concatenative synthesis. In Eurospeech, 1997. [46] P. Ekman and W. Friesen. Facial action coding system. Consulting Psychol- ogist Press, 1977. [47] P. Ekman, W. Friesen, M. Sullivan, A. Chan, A. I. Diacoyanni-Tarlatzis, K. Heider, R. Krause, W. LeCompte, T. Pitcairn, P. Ricci-Bitti, K. Scherer, and M. Tomita. Universal and cultural differences in the judgments of facial expressions of emotions. Journal of Personality and Social Psychology, 53:712–717, 1987. [48] MULTITEL-TCTS Lab Faculte Polytechnique de Mons. The MBROLA project homepage, March 2006. http://tcts.fpms.ac.be/synthesis/mbrola.html. 210 [49] L. A. Feldman. Valence focus and arousal focus: Individual differences in the structure of affective experience. Journal of Personality and Social Psy- chology, 69:153–166, 1995. [50] W. N. Francis. Studies in English linguistics for Randolph Quirk, chapter A tagged corpus - problems and prospect, pages 192–209. Longman, London and New York, 1979. [51] H. Fujisaki. Speech perception production and linguistic structure, chapter Modelling the process of fundamental frequency contour generation, pages 313–328. IOS press, 1992. [52] R. Garside, G. Leech, and A. McEnery. Corpus Annotation. Longman, London and New York, 1997. [53] C. Gobl and A. N Chasaide. The role of voice quality in communicating emotion, mood and attitude. Speech Communication, 40:189–212, 2003. [54] P.Greasley, J.Setter, M.Waterman, C.Sherrard, P.Roach, S.Arnfield, and D. Horton. Representation of prosodic and emotional features in a spoken language database. In Proc. of ICPhS, volume 1, pages 242–245, Stockholm, 1995. [55] MichaelGrimm,EmilyMower,KristianKroschel,andShrikanthNarayanan. Primitives based estimation and evaluation of emotions in speech. Speech Communication (in press), 2007. [56] R. Harre. The Social Construction of Emotions. Oxford, Basil Blackwell, 1986. [57] B. Heuft, T. Portele, and M. Rauth. Emotions in time domain synthesis. In ICSLP, Philadelphia, USA, October 1996. [58] GregorO.Hofer,KorinRichmond,andRobertA.J.Clark.Informedblending of databases for emotional speech synthesis. In Eurospeech, Lisbon, 2005. [59] Xuedong Huang, Alex Acero, and Hsiao-Wuen Hon. Spoken language pro- cessing: A guide to theory, algorithm and system development. Prentice Hall PTR, 2001. [60] A. Iida, N. Campbell, S. Iga, F. Higuchi, and M. Yasumura. A speech synthesis system for assisting communication. In ISCA Workshop on speech and emotion, Belfast, 2000. 211 [61] Akemi Iida, Nick Campbell, Fumito Higuchi, and Michiaki Yasumura. A corpus-basedspeech synthesissystem withemotion. Speech Communication, 40:161–187, 2003. [62] Carroll E. Izard. Human Emotions. Plenum Press, New York, NY, 1977. [63] W. James. What is an emotion? Mind, 19:188–205, 1884. [64] L. Johnson, S. Narayanan, R. Whitney, R. Das, M. Bulut, and C. LaBore. Limited domain synthesis of expressive military speech for animated charac- ters. In IEEE Speech Synthesis Workshop, Santa Monica, CA, 2002. [65] Tom Johnstone and Klaus R. Scherer. The effects of emotions on voice quality. In Proc. of the XIV Int. Congress of Phonetic Sciences, 1999. [66] Daniel Jurafsky and James H. Martin. Speech and Language Processing: An introduction to natural language processing, computational linguistics, and speech recognition. Prentice-Hall, Inc., Upper Saddle River, New Jersey, 2000. [67] A. Kain. High resolution voice transformation. PhD thesis, OGI school of science and engineering at Oregon Health and Science University, 2001. [68] A. Kain and M. W. Macon. Spectral voice conversion for text-to-speech synthesis. In ICASSP, pages 285–288, May 1998. [69] E. Klabbers and J. P.H. van Santen. Clustering of foot-based pitch contours in expressive speech. In Proc. of the 5th ISCA Speech Synthesis Workshop, Pittsburg, PA, June 2004. [70] G. Kochanski, E. Grabe, J. Coleman, and B. Rosner. Loudness predicts prominence: Fundamentalfrequencylendslittle.JASA,118(2),August2005. [71] ATR Interpreting Telecommunications Research Labs. Chatr speech synthe- sis, April 2006. http://feast.atr.jp/chatr/chatr/index.html. [72] D. Robert Ladd, Kim E. A. Silverman, Frank Tolkmitt, Gunther Bergmann, and Klaus R. Scherer. Evidence for the independent function of intonation contourtype,voicequality,andf0rangeinsignalingspeakeraffect.J.Acoust. Soc. Am., 78(2):435–444, August 1985. [73] C.M.Lee,S.Yildirim,M.Bulut,A.Kazemzadeh,C.Busso,Z.Deng,S.Lee, andS.Narayanan.Emotionrecognitionbasedonphonemeclasses.InICSLP, 2004. 212 [74] Chul Min Lee and Shri Narayanan. Towards detecting emotions in spoken dialogs. IEEE Transactions on Speech and Audio Proc.,13(2):293–302,2005. [75] Sungbok Lee, Serdar Yildirim, Shrikanth Narayanan, and Abe Kazemzadeh. An articulatory study of emotional speech production. In Eurospeech- Interspeech, Lisbon, Portugal, 2005. [76] Mark Liberman, Kelly Davis, Murray Grossman, Nii Martey, and John Bell. Emotional prosody speech and transcripts. speech LDC2002S28, The Lin- guistics Data Consortium, 2002. [77] M.W. Macon. Speech synthesis based on sinusoidal modeling. PhD thesis, Georgia Institute of Technology, 1996. [78] M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz. Building a large annotatedcorpusofEnglish: ThePenntreebank. ComputationalLinguistics, 19(2):313–330, 1993. [79] GregorMohler. Speechsynthesisexamples,April2006. http://www.ims.uni- stuttgart.de/∼moehler/synthspeech/. [80] J. M. Montero, J. Gutierrez-Arriola, J. Colas, E. Enriquez, and J. M. Pardo. Analysis and modelling of emotional speech in spanish. In ICPhS, San Fran- cisco, CA, 1999. [81] AthanasiosMouchtaris,ShrikanthS.Narayanan,andChrisKyriakakis. Mul- tichannel audio synthesis by subband-based spectral conversion and param- eter adaptation. IEEE transactions on speech and audio processing, 13:263– 274, March 2005. [82] E. Moulines and F. Charpentier. Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Communica- tion, 9:453–467, December 1990. [83] S. J. L. Mozziconacci and D. J. Hermes. Role of intonation patterns in conveying emotion in speech. In ICPhS, San Francisco, CA, 1999. [84] I. R. Murray and J. L. Arnott. Implementation and testing of a system for producing emotion-by-rule in synthetic speech. Speech Communication, 16(4):175–205, 1995. [85] Iain R. Murray and John L. Arnott. Toward the simulation of emotion in synthetic speech: A review of the literature on human vocal emotion. J. Acoust. Soc. Am., 93(3):1097–1108, February 1993. 213 [86] IainR.Murray,MikeD.Edgington,DianeCampion, andJustinLynn. Rule- based emotion synthesis using concatenated speech. In Proc. of the ISCA Workshop on Emotion and Speech, pages 173–177, Newcastle, Northern Ire- land, September 2000. [87] I.R. Murray. Simulating emotion in synthetic speech. PhD thesis, University of Dundee, UK, 1989. [88] I.R. Murray and J.L. Arnott. Synthesizing emotions in speech: Is it time to get excited? In ICSLP, Philadelphia, 1996. [89] I.R. Murray, J.L. Arnott, and E.A. Rohwer. Emotional stress in synthetic speech: Progress and future directions. Speech Communication, 20:3–12, November 1996. [90] T. Najjary, O. Rosec, and T. Chonavel. A voice conversion method based on joint pitch and spectral envelope transformation. In ICSLP, 2004. [91] ShrikanthNarayanan. Speechanalysisandinterpretationlaboratory,Novem- ber 2007. http://sail.usc.edu/. [92] Y.Niimi,M.Kasamatsu,T.Nishinoto,andM.Araki. Synthesisofemotional speech using prosodically balanced vcv segments. In 4th ISCA tutorial and research workshop on speech synthesis, Scotland, 2001. [93] M. Ostendorf and N. Veilleux. A hierarchical stochastic model for auto- matic prediction of prosodic boundary location. Computational Linguistics, 20(1):27–54, March 1994. [94] A. Paeschke, M. Kienast, and W. F. Sendlmeier. F0-contours in emotional speech. In Proc. of the 14th International Congress of Phonetic Sciences, pages 929–932, San Francisco, CA, 1999. [95] M. D. Pell. Influence of emotion and focus location prosody in matched statements and questions. Journal of the Acoustical Society of America, 109(4):1668–1680, 2001. [96] C. Pereira. Dimensions of emotional meaning in speech. In ISCA Workshop on speech and emotion, Belfast, 2000. [97] Cecile Pereira and Catherine Watson. Some acoustic characteristics of emo- tion. In ICSLP, Sydney, 1998. [98] R. Picard. Affective Computing. MIT Press, Cambridge, MA, 1997. 214 [99] R. Picard, E. Vyzas, and J. Healey. Toward machine emotional intelligence: Analysisofaffectivephysiologicalstate. IEEETransactionsonPatternAnal- ysis and Machine Intelligence, 23(10), October 2001. [100] R. W. Picard, S. Papert, W. Bender, B. Blumberg, C. Breazeal, D. Cavallo, T. Machover, M. Resnick, D. Roy, and C. Strohecker. Affective learning–a manifesto. BT Technical Journal, 22(4), oct 2004. [101] R.Plutchik. Emotion: A Psychoevolutionary Synthesis. Harper&Row,New York, NY, 1980. [102] Lawrence R. Rabiner and Ronald W. Schafer. Digital processing of speech signals. Prentice-Hall, 1978. [103] E. Rank and H. Pirker. Generating emotional speech with a concatenative synthesizer. In ICSLP, Sydney, Australia, 1998. [104] A. Raux and A. Black. A unit selection approach to F0 modeling and its applicationtoemphasis. InProc.ofASRU,StThomas,USVirginIs,Decem- ber 2003. [105] B. Reeves and C. Nass. The media equation. Cambridge Univ. Press, Center for the study of language and information, 1996. [106] ATT Research. ATT Natural Voices, December 2005. http://www.naturalvoices.att.com/. [107] Cepstral Research. Cepstral text-to-speech synthesis, December 2005. http://cepstral.com/. [108] IBM Research. IBM research text-to-speech, December 2005. http://www.research.ibm.com/tts/. [109] Loquendo Research. Loquendo interactive TTS demo, December 2005. http://www.loquendo.com/en/demos/interactive tts demo.htm. [110] Mindmaker Research. Flexvoice text to speech technology, April 2006. http://www.flexvoice.com/. [111] D. A. Reynolds and R. C. Rose. Robust text independent speaker identifi- cation using gaussian mixture speaker models. IEEE transactions on speech and audio processing, 3(1):72–83, January 1995. [112] P. Roach, R. Stibbard, J. Osborne, S. Arnfield, and J. Setter. Transcription of prosodic and paralinguistic features of emotional speech. J. Int. Phonetic Assoc., 28:83–94, 1998. 215 [113] Peter Roach. Techniques for the phonetic description of emotional speech. In ISCA Workshop on speech and emotion, Newcastle, 2000. [114] O. Salor, M. Demirekler, and B. Pellom. A system for voice conversion based on adaptive filtering and line spectral frequency distance optimization for text-to-speech synthesis. In Eurospeech, 2003. [115] K.R.Scherer. Acrossculturalinvestigationofemotioninferencesfromvoice and speech: Implications for speech technology. In ICSLP, Beijing, China, 2000. [116] K. R. Scherer. Vocal communication of emotion: A review of research paradigms. Speech Communication, 40(1-2):227–256, 2003. [117] H. Schlosberg. Three dimensions of emotion. Psychology Review, 61(2):81– 88, 1954. [118] M.Schroder. Canemotionsbesynthesizedwithoutcontrollingvoicequality? Phonus 4, research report of the institute of phonetics, Saarland University, 1999. [119] M.Schroder. Emotionalspeechsynthesis-areview. In Eurospeech, Aalborg, 2001. [120] MarcSchroder,RoddyCowie,EllenDouglas-Cowie,MachielWesterdijk,and Stan Gielen. Acoustic correlates of emotion dimensions in view of speech synthesis. In Eurospeech, Aalborg, 2001. [121] BarrySchwartz. TheParadoxofChoice: Whymoreisless. HarperPerennial, New York, NY, 2005. [122] T. Shibata, M. Yoshida, and J. Yamato. Artificial emotional creature for human-machineinteraction. In Proc. of the IEEE Systems, Man, and Cyber- netics, pages 2269–2274, 1997. [123] K.Silverman,M.Beckman,J.Pitrelli,M.Ostendorf,C.Wightman,P.Price, J. Pierrehumbert, and J. Hirschberg. ToBI: A standard for labeling English prosody. In International Conference on Spoken Language Processing, pages 867–870, 1992. [124] Kare Sjolander and Jonas Beskow. Wavesurfer 1.8.3/0504131354, November 2007. http://www.speech.kth.se/wavesurfer/. [125] SPSS for Windows, Release 10.1.0. (9 Sep. 2000), Chicago: SPSS Inc. 216 [126] S. Steidl, M. Levit, A. Batliner, Noth E., and H. Niemann. Of all things the measure is man. automatic classification of emotions and inter-labeler cnsistency. In ICASSP, 2005. [127] Y. Stylianou. Harmonic plus noise models for speech, combined with statisti- calmethods, forspeechandspeakermodification. PhDthesis,EcoleNationale Superieure des Telecommunications, Paris, France, January 1996. [128] Y.Stylianou,O.Cappe,andE.Moulines. Continuousprobabilistictransform for voice conversion. IEEE Transactions on Speech and Audio Processing, 6(2):131–142, March 1998. [129] Y. Stylianou, T. Dutoit, and J. Schroeter. Diphone concatenation using a harmonic plus noise model of speech. In Proc. of Eurospeech, 1997. [130] A. Syrdal, Y. Stylianou, L. Garrison, A. Conkie, and J. Schroeter. TD- PSOLAversusharmonicplusnoisemodelindiphonebasedspeechsynthesis. In ICASSP, Seattle, Washington, 1998. [131] Ann K. Syrdal, Alistair Conkie, and Yannis Stylianou. Exploration of acous- tic correlates in speaker selection for concatenative synthesis. In ICSLP, Sydney, Australia, Nov. 1998. [132] Paul Taylor. Analysis and synthesis of intonation using the tilt model. Jour- nal of the Acoustical Society of America, 107(3):1697–1714, 2000. [133] K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, and T. Kitamura. Speech parameter generation algorithms for HMM-based speech synthesis. In ICASSP, June 2000. [134] Keiichi Tokuda, Heiga Zen, Junichi Yamagishi, Alan W. Black, Takashi Masuko, Shinji Sako, Tomoki Toda, Takashi Nose, and Keiichiro Oura. HMM-based speech synthesis system (HTS), March 2006. http://hts.ics.nitech.ac.jp/. [135] Hartmut Traunmuller. Speech considered as modulated voice. revised manuscript, 2005. [136] R. Tsuzuki, H. Zen, K. Tokuda, T. Kitamura, M. Bulut, and S. Narayanan. Constructingemotionalspeechsynthesizerswithlimitedspeechdatabase. In Proc. of ICSLP, Jeju, Korea, 2004. [137] Oytun Turk and Levent M. Arslan. Voice conversion methods for vocal tract and pitch contour modification. In EUROSPEECH, 2003. 217 [138] Oytun Turk, Marc Schroder, Baris Bozkurt, and Levent M. Arslan. Voice quality interpolation for emotional text-to-speech synthesis. In Eurospeech- Interspeech, Lisbon, Portugal, 2005. [139] H. Valbret, E. Moulines, and J. P. Tubach. Voice transformation using PSOLA technique. Speech Communication, 11:175–187, 1992. [140] J.vanSantenandB.Mobius. Intonation: Analysis, modeling and technology, chapter A quantitative model of F0 generation and alignment, pages 269– 288. Kluwer Academic Publishers, 1999. [141] Jean Veronis, Philippe Di Cristo, Fabienne Courtois, and Benoit Lagrue. A stochastic model of intonation for French text-to-speech synthesis. In Eurospeech, September 1997. [142] D. Vine and R. Sahandi. Synthesizing emotional speech by concatenating multiple pitch recorded speech units. In ISCA workshop on speech and emo- tion, Belfast, 2000. [143] J. Vroomen, R. Collier, and S. J. L. Mozziconacci. Duration and intonation in emotional speech. In Eurospeech, 1993. [144] DagenWangandShrikanthNarayanan. Anacousticmeasureforwordpromi- nence in spontaneous speech. IEEE Transactions on Speech and Audio Pro- cessing, 15(2):690–701, February 2007. [145] C.M. Whissel. The dictionary of affect in language. In R. Plutchik and H. Kellerman, editors, Emotion: Theory, research and experience: Vol.4, The measurement of emotions. Academic, New York, 1989. [146] H. Ye and S. Young. High quality voice morphing. In ICASSP, 2004. [147] S.Yildirim,M.Bulut,C.M.Lee,A.Kazemzadeh,C.Busso,Z.Deng,S.Lee, and S. Narayanan. An acoustic study of emotions expressed in speech. In Proc. of ICSLP, Jeju, Korea, October 2004. [148] S. Young, G. Evermann, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodlan. The HTK book 3.2, 2002. [149] Jiahong Yuan, Mark Liberman, and Christopher Cieri. Towards an inte- grated understanding of speaking rate in conversation. In Interspeech, Pitts- burgh, Pennsylvania, 2006. 218 Appendix A Modification algorithms: PSOLA and LPC A.1 Pitch Synchronous Overlap and Add (PSOLA) PSOLA [59,82] is a modification technique for modifying speech prosody and con- catenating speech waveforms in a pitch-synchronous manner. When the modifica- tions are applied in frequency domain the algorithm is referred to as FD-PSOLA. Whentheyareperformedintimedomain,itiscalledTD-PSOLA.Whilefrequency domain modifications are very flexible for modifying the spectral characteristics they are computationally expensive. On contrary, time domain approach can be implemented efficiently in real time. There are three main steps in PSOLA algorithm: Analysis, modification and synthesis. A.1.1 Pitch-synchronous analysis For a given input signal x[n] short-terms signals are obtained by multiplying the speech waveform x[n] by a pitch-synchronous analysis window h m [n]: x m [n] =h m [t m −n]x[n] (A.1) 219 The windows are centered around the pitchmarks (t m ), which are set at pitch- synchronous rate (i.e., periodically) if x[n] is voiced and at a constant rate if it is unvoiced. The windows are always longer than one single pitch period, so that there is an overlap between consecutive windows. (For TD-PSOLA usually they are chosen to be Hanning windows and of two pitch periods long.) A.1.2 Pitch-synchronous modifications Atthisstagetheanalysispitchmarksaresynchronizedwithanewsetofpitchmarks called synthesis pitchmarks. In TD-PSOLA this synchronization is achieved by replicating or removing the short-time signals. In other words, the algorithm consists of selecting and copying a certain number of analysis short-time signals x m [n] and translating them by the sequence of delays δ q = e t q −t m : e x q [n] =x m [n−δ q ] =x m [n+t m − e t q ] (A.2) A.1.3 Pitch-synchronous overlap-add synthesis The synthesis is achieved by combining the windowed signals: e x[n] = P q α q e x q [n] e h q [ e t q −n] P q e h 2 q [ e t q −n] (A.3) Intheequationabove e h q [n]denotessynthesiswindowsandnormalizationfactor α q is introduced to compensate for the energy modifications. Notethat,whenthewindowlengthsarechosentobetwicethesynthesisperiods the equation above is simplified to 220 e x[n] = X q e x q [n] (A.4) The result is a new signale x[n] which has the similar spectral characteristics as x[n]butadifferentpitchand/orduration(asdeterminedbysynthesispitchmarks). A.2 Linear Predictive Coding (LPC) “The basic idea behind linear predictive analysis is that a speech sample can be approximated as a linear combination of past speech samples. By minimizing the sum of squared differences (over a finite interval) between the actual speech samples and the linearly predicted ones, a unique set of predictor coefficients can be determined. (The predictor coefficients are the weighting coefficients used in the linear combination.)” [102] In the speech production, the composite spectrum effects of radiation, vocal tract, and glotal excitation can be represented by a time-varying digital filter whose system response is H(z) = S(z) U(z) = G 1− p P k=1 a k z −k (A.5) The input (u[n]) to this system is an impulse train for voiced speech or random noise for unvoiced speech. G is a gain parameter and a k are the coefficients of the digital filter. Based on the system, the speech (s[n]) can be produced as: s[n] = p X k=1 a k s[n−k]+Gu[n] (A.6) 221 Whenthelinearpredictioncoefficientsα k arenotexactwegettheapproximate output: e s[n] = p X k=1 α k s[n−k] (A.7) The prediction error (also called residual) is e[n] =s[n]−e s[n] (A.8) The basic problem in LPC is the estimation of the predictor coefficients (α k ) so that error (e[n]) is minimized. Because of the time-varying nature of the speech signal the predictor coefficients are estimated from short time segments of the speech signal using methods such as autocorrelation, covariance, lattice, inverse filtering, spectral estimation, maximum likelihood and inner products. 222 Appendix B Spectral conversion using GMMs B.1 Spectral Conversion In this chapter we present the properties of the spectral conversion algorithm that was used to transform input emotion spectral characteristics to resemble to the target emotion spectral characteristics. In this method, the spectral parameters are mapped using a linear transfor- mation based on Gaussian mixture models (GMM) whose parameters are trained by joint density estimation. This method was used by Kain et.al. [67,68] for voice conversion, where the purpose was to convert one speaker characteristics into another. The method was also successfully applied to multichannel audio synthesis by Mouchtaris et.al. [81]. The spectral conversion method mentioned above was implemented in Matlab by Athanasios Mouchtaris [81], when he was at the University of Southern Califor- nia (USC). We modified his code and used it for our purposes. Below we present only the basic details of the method. B.2 Spectral conversion using Gaussian mixture models (GMM) Inspectralconversion, thegoalistotransformthe input(source)intotherequired output (target) with the aid of the specifically designed and appropriately trained 223 transfer functions. However, in most real life cases, it is not possible to find an exact mapping from the provided input domain into the output domain. Because of that, the required transfer function is designed according to some optimization criterion, where the goal is to minimize some predefined error function. The usage of different error definitions will lead to different transfer function. The most commonly used error functions are mean squares and least squares error. If we define x and y to indicate the input and output data, and F and E the transfer and error functions, respectively, then we seek to minimize the error, which is defined as in Eqn. B.1. ε =E(F(x),y) (B.1) In general the size of input x and output y do not need to be same and the error function E could be any function. However usually x and y are formed so that they have the same size, and functionE is chosen so that it can be minimized easily. If we represent the input data with n vectors of dimension p, such as x = {x k ,k = 1,2,,n}andsimilarlytheoutputdatawithanothersetofnvectorsofthe same dimension, such as y ={y k ,k = 1,2,n} and assuming that input and output vectors are time aligned, we can define the transfer function F, as a function that brings thevectorx k closesttoy k intermsofthedefinederrormeasure. Employing the least squares error criterion, we get the following equation (Eqn. B.2). ε = n X k=1 ky k −F(x k )k 2 (B.2) ThetransferfunctionF ischosentothattheleastsquareserrorεisminimized. 224 In the Gaussian Mixture Model (GMM) the assumption is that the probability density function (g(x)) of random vector x can be modeled with M Gaussian probability density functions as shown in Eqn. B.3. g(x) = M X i=1 p(w i )N(x;μ x i ,Σ xx i ) (B.3) In Eqn. B.3, N(x;μ,Σ) represent the normal distribution and p(w i ) represent the prior probability of class w i . For each one of the classes, there are three unknown parameters, μ, Σ, and p(w) that need to be estimated. They are esti- mated using the expectation maximization (EM) algorithm [7,111]. The transformation function F(x) is defined as in Eqn. B.4. F(x k ) = M X i=1 p(w i |x k )[ν i +Γ i (Σ xx i ) −1 (x k −μ x i )] (B.4) In Eqn. B.4 the conditional probability p(w i |x k ) shows the probability that a vector x k belongs to class w i . This conditional probability can be computed using Bayes theorem (Eqn. B.5). p(w i |x k ) = p(w i )N(x k ;μ x i ,Σ xx i ) M P j=1 p(w i )N(x k ;μ x j ,Σ xx j ) (B.5) There are two unknown parameters in Eqn. B.4. They are ν i and Γ i (i = 1,...,M). They are found by minimizing Eqn. B.2. This requires solving normal equation the solutions of which require inversion of large and sometimes poorly formed matrices [68,127]. An alternative to the transfer function in Eqn. B.4 is to use another transfer function where source and target vector parameters are used jointly. This is the method that is applied in our spectral conversion. In this case, assuming that 225 x and y are jointly Gaussian for each class w i , then the new transfer function is defined as expectation (denoted by E) of y given x (Eqn. B.6). F(x k ) =E(y|x k ) = M X i p(w i |x k )[μ y i +Σ yx i (Σ xx i ) −1 (x k −μ x i )] (B.6) Letusdefineanewvectorz asz = [x t y t ] t wheretdenotestransposition. Then all of the required parameters in Eqn. B.6 can be found by estimating the GMM parameters of z using the EM algorithm. Note that the parameters of z can be represented as in Eqn. B.7 and B.8. Σ zz i = Σ xx i Σ xy i Σ yx i Σ yy i (B.7) μ z i = μ x i μ y i (B.8) 226 Appendix C Test utterances: Spectrogram and F0 contour plots The prosody and spectral characteristics of the test utterances can be visualized in the spectogram and pitch contour plots displayed below. These plots were generated using the Wavesurfer [124] software. 227 Figure C.1: Sentence 1: (nfjoyNew112 1.wav) See how funny our puppy looks in the photo. 228 FigureC.2: Sentence2: (nfjoyNew192.wav)Youalwayscomeupwithpathological examples to fancy the audience. 229 Figure C.3: Sentence 3: (nfjoyNew195.wav) Leave off the marshmallows and look what you have done with the vanilla pie. 230 Figure C.4: Sentence 4: (nfjoyNew213.wav) Summertime supper outside is a nat- ural thing. 231 FigureC.5: Sentence5: (nfjoyNew228.wav)Thefifthjarcontainsbigjuicypeaches. 232 Figure C.6: Sentence 6: (nfjoyNew245.wav) Keep the desserts simple fruit does nicely. 233 Appendix D ETET evaluation results for individual sentences Emo/Qual Mod. Neutral Angry Happy Sad Other Emotion x1 66.67 14.29 9.52 9.52 0 pos1 57.14 19.05 19.05 0 4.76 uv1 20.00 75.00 0 0 5.00 x1+pos1 4.76 61.90 9.52 0 23.81 x1+uv1 28.57 61.90 9.52 0 0 pos1+uv1 42.86 42.86 9.52 4.76 0 x1+pos1+uv1 52.38 33.33 4.76 0 9.52 Quality x1 4.07 3.33 4.00 3.50 – pos1 4.08 4.25 4.00 – 4.00 uv1 3.00 3.80 – – 4.00 x1+pos1 3.00 3.46 3.50 – 3.20 x1+uv1 3.33 3.15 4.00 – – pos1+uv1 3.22 3.44 2.50 4.00 – x1+pos1+uv1 2.82 3.14 2.00 – 1.00 Table D.1: Neutral to angry conversion results for sentence 1. 234 Emo/Qual Mod. Neutral Angry Happy Sad Other Emotion x1 71.43 9.52 4.76 9.52 4.76 pos1 71.43 0 9.52 9.52 9.52 uv1 63.16 0 10.53 26.32 0 x1+pos1 52.38 4.76 9.52 28.57 4.76 x1+uv1 76.19 0 4.76 14.29 4.76 pos1+uv1 38.10 4.76 19.05 14.29 23.81 x1+pos1+uv1 28.57 0 23.81 9.52 38.10 Quality x1 3.67 3.50 5.00 3.50 4.00 pos1 3.73 – 4.00 3.50 4.00 uv1 3.83 – 3.50 4.00 – x1+pos1 3.36 4.00 2.50 2.83 2.00 x1+uv1 3.06 – 2.00 3.67 3.00 pos1+uv1 2.88 2.00 3.25 3.33 2.60 x1+pos1+uv1 2.17 – 2.80 2.50 1.88 Table D.2: Neutral to happy conversion results for sentence 1. Emo/Qual Mod. Neutral Angry Happy Sad Other Emotion x1 66.67 9.52 9.52 9.52 4.76 pos1 9.52 0 14.29 71.43 4.76 uv1 10.53 0 0 89.47 0 x1+pos1 19.05 4.76 14.29 52.38 9.52 x1+uv1 4.76 0 4.76 90.48 0 pos1+uv1 0 0 0 90.48 9.52 x1+pos1+uv1 4.76 4.76 14.29 66.67 9.52 Quality x1 3.29 3.50 3.50 3.00 4.00 pos1 3.00 – 3.67 3.47 3.00 uv1 1.50 – – 3.24 – x1+pos1 2.75 3.00 3.33 2.09 3.50 x1+uv1 1.00 – 2.00 2.72 – pos1+uv1 – – – 2.63 2.00 x1+pos1+uv1 2.00 2.00 2.00 2.36 1.00 Table D.3: Neutral to sad conversion results for sentence 1. 235 Emo/Qual Mod. Neutral Angry Happy Sad Other Emotion x1 80.95 19.04 0 0 0 pos1 71.42 14.28 0 0 14.29 uv1 14.29 76.19 4.76 4.76 0 x1+pos1 28.57 61.90 4.76 0 4.76 x1+uv1 23.81 57.14 9.52 4.76 4.76 pos1+uv1 15.00 75.00 5.00 0 5.00 x1+pos1+uv1 14.28 57.14 9.52 0 19.05 Quality x1 4.29 4.25 – – – pos1 3.40 3.67 – – 4.33 uv1 2.33 3.69 4.00 3.00 – x1+pos1 3.00 3.46 3.00 – 4.00 x1+uv1 2.60 3.17 3.00 3.00 4.00 pos1+uv1 3.00 3.13 3.00 – 3.00 x1+pos1+uv1 1.67 2.73 2.00 – 1.50 Table D.4: Neutral to angry conversion results for sentence 2. Emo/Qual Mod. Neutral Angry Happy Sad Other Emotion x1 85.71 9.52 0 0 4.76 pos1 76.19 4.76 0 9.52 9.52 uv1 71.43 0 14.29 4.76 9.52 x1+pos1 47.62 0 9.52 14.29 28.57 x1+uv1 85.71 9.52 0 0 4.76 pos1+uv1 33.33 4.76 19.05 28.57 14.29 x1+pos1+uv1 60.00 10.00 5.00 5.00 20.00 Quality x1 4.00 3.50 – – 4.00 pos1 4.31 5.00 – 4.00 3.00 uv1 4.07 – 3.33 4.00 3.00 x1+pos1 3.00 – 3.00 3.00 2.33 x1+uv1 3.33 3.00 – – 3.00 pos1+uv1 2.57 2.00 2.25 2.67 1.67 x1+pos1+uv1 2.75 3.50 2.00 3.00 1.50 Table D.5: Neutral to happy conversion results for sentence 2. 236 Emo/Qual Mod. Neutral Angry Happy Sad Other Emotion x1 80.95 9.52 0 0 9.52 pos1 38.10 0 14.29 33.33 14.29 uv1 9.52 0 0 85.71 4.76 x1+pos1 33.33 0 0 57.14 9.52 x1+uv1 9.52 4.76 0 71.43 14.29 pos1+uv1 4.76 0 0 95.24 0 x1+pos1+uv1 4.76 0 0 76.19 19.05 Quality x1 3.29 4.00 – – 4.00 pos1 4.25 – 3.00 3.43 3.00 uv1 1.50 – – 2.83 2.00 x1+pos1 2.43 – – 2.67 2.50 x1+uv1 1.50 2.00 – 2.33 1.33 pos1+uv1 1.00 – – 2.75 – x1+pos1+uv1 5.00 – – 1.81 1.25 Table D.6: Neutral to sad conversion results for sentence 2. Emo/Qual Mod. Neutral Angry Happy Sad Other Emotion x1 76.19 14.29 4.76 0 4.76 pos1 19.05 66.67 4.76 0 9.52 uv1 19.05 76.19 4.76 0 0 x1+pos1 19.05 76.20 0 0 4.76 x1+uv1 9.52 66.67 0 9.52 14.29 pos1+uv1 42.86 42.86 4.76 0 9.52 x1+pos1+uv1 19.05 66.67 4.76 4.76 4.76 Quality x1 3.38 2.67 1.00 – 3.00 pos1 3.75 4.21 4.00 – 3.50 uv1 3.50 4.13 4.00 – – x1+pos1 2.00 3.06 – – 2.00 x1+uv1 2.00 3.29 – 2.50 2.33 pos1+uv1 3.11 3.44 4.00 – 3.00 x1+pos1+uv1 1.75 2.36 3.00 1.00 1.00 Table D.7: Neutral to angry conversion results for sentence 3. 237 Emo/Qual Mod. Neutral Angry Happy Sad Other Emotion x1 71.43 28.57 0 0 0 pos1 47.62 23.81 9.52 4.76 14.29 uv1 85.00 5.00 5.00 0 5.00 x1+pos1 42.86 14.29 0 23.81 19.05 x1+uv1 80.95 14.29 0 4.76 0 pos1+uv1 42.86 33.33 0 9.52 14.29 x1+pos1+uv1 42.86 14.29 9.52 0 33.33 Quality x1 2.73 2.33 – – – pos1 3.40 3.40 3.50 3.00 3.00 uv1 3.82 3.00 3.00 – 5.00 x1+pos1 2.11 1.67 – 2.60 2.00 x1+uv1 2.35 2.67 – 3.00 – pos1+uv1 2.67 3.57 – 3.00 2.67 x1+pos1+uv1 1.44 2.33 3.00 – 1.29 Table D.8: Neutral to happy conversion results for sentence 3. Emo/Qual Mod. Neutral Angry Happy Sad Other Emotion x1 75.00 25.00 0 0 0 pos1 14.29 0 4.76 71.43 9.52 uv1 9.52 0 0 71.43 19.05 x1+pos1 14.29 0 0 76.19 9.52 x1+uv1 14.29 0 0 71.43 14.29 pos1+uv1 14.29 0 9.52 71.43 4.76 x1+pos1+uv1 4.76 0 4.76 80.95 9.52 Quality x1 2.27 3.00 – – – pos1 3.33 – 3.00 3.53 3.50 uv1 1.00 – – 2.80 2.75 x1+pos1 2.00 – – 2.19 2.00 x1+uv1 1.33 – – 1.53 1.67 pos1+uv1 1.67 – 3.50 2.80 2.00 x1+pos1+uv1 2.00 – – 1.24 1.50 Table D.9: Neutral to sad conversion results for sentence 3. 238 Emo/Qual Mod. Neutral Angry Happy Sad Other Emotion x1 75.00 15.00 5.00 0 5.00 pos1 42.86 42.86 9.52 0 4.76 uv1 28.57 52.38 9.52 0 9.52 x1+pos1 61.90 14.29 4.76 9.52 9.52 x1+uv1 19.05 71.43 4.76 0 4.76 pos1+uv1 19.05 52.38 4.76 0 23.81 x1+pos1+uv1 40.00 35.00 15.00 0 10.00 Quality x1 3.73 4.00 5.00 – 5.00 pos1 4.00 4.00 4.00 – 2.00 uv1 2.83 3.82 3.00 – 5.00 x1+pos1 3.00 2.00 1.00 2.50 4.00 x1+uv1 3.25 3.27 5.00 – 2.00 pos1+uv1 2.75 3.10 3.00 – 2.40 x1+pos1+uv1 2.50 2.14 3.00 – 1.50 Table D.10: Neutral to angry conversion results for sentence 4. Emo/Qual Mod. Neutral Angry Happy Sad Other Emotion x1 61.90 23.81 9.52 0 4.76 pos1 71.43 9.52 4.76 4.76 9.52 uv1 65.00 10.00 10.00 10.00 5.00 x1+pos1 57.14 14.29 14.29 9.52 4.76 x1+uv1 71.43 4.76 19.05 0 4.76 pos1+uv1 38.10 14.29 23.81 0 23.81 x1+pos1+uv1 38.10 9.52 19.05 9.52 23.81 Quality x1 3.77 3.80 2.50 – 5.00 pos1 3.87 3.00 5.00 4.00 3.50 uv1 3.92 4.00 4.00 3.50 4.00 x1+pos1 3.08 2.33 2.00 3.00 2.00 x1+uv1 3.07 4.00 3.25 – 2.00 pos1+uv1 2.88 2.67 2.80 – 2.40 x1+pos1+uv1 2.00 2.50 3.00 2.00 1.60 Table D.11: Neutral to happy conversion results for sentence 4. 239 Emo/Qual Mod. Neutral Angry Happy Sad Other Emotion x1 90.48 4.76 0 0 4.76 pos1 10.00 35.00 35.00 0 20.00 uv1 9.52 0 0 76.19 14.29 x1+pos1 23.81 33.33 14.29 14.29 14.29 x1+uv1 4.76 0 0 61.90 33.33 pos1+uv1 4.76 0 14.29 66.67 14.29 x1+pos1+uv1 4.76 4.76 0 76.19 14.29 Quality x1 2.21 4.00 – – 3.00 pos1 3.50 3.86 4.00 – 3.75 uv1 1.50 – – 2.81 2.67 x1+pos1 2.20 1.71 3.00 1.67 2.00 x1+uv1 1.00 – – 2.08 1.14 pos1+uv1 1.00 – 3.67 2.43 2.00 x1+pos1+uv1 1.00 1.00 – 1.94 2.33 Table D.12: Neutral to sad conversion results for sentence 4. Emo/Qual Mod. Neutral Angry Happy Sad Other Emotion x1 73.68 26.32 0 0 0 pos1 71.43 9.52 14.29 0 4.76 uv1 28.57 66.67 0 0 4.76 x1+pos1 15.00 65.00 10.00 0 10.00 x1+uv1 28.57 28.57 9.52 4.76 28.57 pos1+uv1 61.90 38.10 0 0 0 x1+pos1+uv1 19.05 61.90 0 4.76 14.29 Quality x1 3.93 3.60 – – – pos1 4.33 4.50 4.00 – 4.00 uv1 4.17 3.71 – – 5.00 x1+pos1 3.33 3.38 3.50 – 3.50 x1+uv1 3.00 2.17 3.00 3.00 2.33 pos1+uv1 3.54 3.38 – – – x1+pos1+uv1 3.00 2.46 – 3.00 2.00 Table D.13: Neutral to angry conversion results for sentence 5. 240 Emo/Qual Mod. Neutral Angry Happy Sad Other Emotion x1 80.95 14.29 4.76 0 0 pos1 60.00 10.00 10.00 10.00 10.00 uv1 85.71 4.76 4.76 0 4.76 x1+pos1 50.00 10.00 0.00 30.00 10.00 x1+uv1 66.67 14.29 9.52 9.52 0 pos1+uv1 47.62 4.76 4.76 23.81 19.05 x1+pos1+uv1 71.43 4.76 14.29 4.76 4.76 Quality x1 3.88 4.33 4.00 – – pos1 4.50 4.00 4.00 3.50 4.50 uv1 4.22 2.00 4.00 – 5.00 x1+pos1 3.60 4.00 – 2.83 3.00 x1+uv1 3.21 2.00 3.00 2.50 – pos1+uv1 2.40 3.00 3.00 2.20 2.00 x1+pos1+uv1 2.27 1.00 3.33 3.00 2.00 Table D.14: Neutral to happy conversion results for sentence 5. Emo/Qual Mod. Neutral Angry Happy Sad Other Emotion x1 85.71 9.52 4.76 0 0 pos1 52.38 9.52 14.29 19.05 4.76 uv1 14.29 0 4.76 76.19 4.76 x1+pos1 40.00 5.00 10.00 15.00 30.00 x1+uv1 0 0 0 85.71 14.29 pos1+uv1 4.76 0 0 80.95 14.29 x1+pos1+uv1 9.52 0 0 85.71 4.76 Quality x1 3.83 4.00 – – – pos1 4.09 3.00 3.67 3.50 3.00 uv1 3.33 – 2.00 3.25 1.00 x1+pos1 2.50 2.00 2.50 3.33 2.33 x1+uv1 – – – 2.61 1.67 pos1+uv1 1.00 – – 2.29 2.33 x1+pos1+uv1 2.00 – – 1.89 1.00 Table D.15: Neutral to sad conversion results for sentence 5. 241 Emo/Qual Mod. Neutral Angry Happy Sad Other Emotion x1 57.14 28.57 9.52 0 4.76 pos1 66.67 28.57 4.76 0 0 uv1 19.05 66.67 4.76 0 9.52 x1+pos1 42.86 47.62 9.52 0 0 x1+uv1 23.81 57.14 0 4.76 14.29 pos1+uv1 15.00 65.00 5.00 0 15.00 x1+pos1+uv1 38.10 42.86 9.52 0 9.52 Quality x1 3.92 4.00 5.00 – 5.00 pos1 4.71 3.33 5.00 – – uv1 4.50 3.86 4.00 – 3.50 x1+pos1 4.11 3.30 5.00 – – x1+uv1 3.60 3.00 – 3.00 3.67 pos1+uv1 2.33 3.23 4.00 – 1.67 x1+pos1+uv1 2.50 2.78 2.50 – 3.00 Table D.16: Neutral to angry conversion results for sentence 6. Emo/Qual Mod. Neutral Angry Happy Sad Other Emotion x1 70.00 15.00 10.00 0 5.00 pos1 61.90 28.57 4.76 0 4.76 uv1 76.19 4.76 14.29 4.76 0 x1+pos1 66.67 14.29 9.52 0 9.52 x1+uv1 38.10 23.81 28.57 4.76 4.76 pos1+uv1 75.00 0 20.00 0 5.00 x1+pos1+uv1 52.38 14.29 28.57 4.76 0 Quality x1 4.07 3.33 4.50 – 5.00 pos1 4.08 4.00 5.00 – 4.00 uv1 3.69 3.00 3.67 4.00 – x1+pos1 3.62 3.00 3.50 – 2.50 x1+uv1 3.50 3.00 2.83 1.00 3.00 pos1+uv1 3.27 – 3.25 – 3.00 x1+pos1+uv1 2.36 2.67 2.83 3.00 – Table D.17: Neutral to happy conversion results for sentence 6. 242 Emo/Qual Mod. Neutral Angry Happy Sad Other Emotion x1 80.95 14.29 4.76 0 0 pos1 57.14 33.33 4.76 0 4.76 uv1 0 0 0 95.24 4.76 x1+pos1 28.57 23.81 33.33 4.76 9.52 x1+uv1 9.52 0 0 85.71 4.76 pos1+uv1 4.76 4.76 4.76 80.95 4.76 x1+pos1+uv1 9.52 0 9.52 61.90 19.05 Quality x1 3.82 2.67 4.00 – – pos1 3.92 3.71 4.00 – 4.00 uv1 – – – 2.95 1.00 x1+pos1 2.83 3.00 3.29 1.00 2.50 x1+uv1 1.50 – – 1.83 2.00 pos1+uv1 1.00 3.00 3.00 2.65 1.00 x1+pos1+uv1 1.00 – 1.00 1.62 1.25 Table D.18: Neutral to sad conversion results for sentence 6. 243
Abstract (if available)
Abstract
Emotions play an important role in human life. They are essential for communication, for decision making, and for survival. They pose a challenging research area across diverse disciplines such as psychology, sociology, philosophy, medicine and engineering. One realm of inquiry relates to emotions expressed in speech. In this study our focus is on angry, happy, sad, and neutral emotions in speech. We investigate the speech acoustic correlates that are important for emotion perception in utterances and propose techniques to synthesize emotional speech which will be correctly recognized by human listeners. The motivation for our research comes from the desire to impart emotion processing capabilities to machines in order to make human-machine interactions more pleasant, effective and productive. Instead of generating the emotional speech from text, in our approach we start with a natural neutral utterance and modify its acoustic features to impart the targeted emotion.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Categorical prosody models for spoken language applications
PDF
Emotions in engineering: methods for the interpretation of ambiguous emotional content
PDF
Enriching spoken language processing: representation and modeling of suprasegmental events
PDF
Active data acquisition for building language models for speech recognition
PDF
Speech enhancement and intelligibility modeling in cochlear implants
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
Concept classification with application to speech to speech translation
PDF
Multimodal analysis of expressive human communication: speech and gesture interplay
PDF
Emotional speech production: from data to computational models and applications
PDF
Multimodality, context and continuous dynamics for recognition and analysis of emotional states, and applications in healthcare
PDF
A computational framework for exploring the role of speech production in speech processing from a communication system perspective
PDF
Visualizing and modeling vocal production dynamics
PDF
User modeling for human-machine spoken interaction and mediation systems
PDF
Prediction modeling and statistical analysis of amino acid substitutions
PDF
The planning, production, and perception of prosodic structure
PDF
Effects of speech context on characteristics of manual gesture
PDF
Hybrid methods for music analysis and synthesis: audio key finding and automatic style-specific accompaniment
PDF
Toward understanding speech planning by observing its execution—representations, modeling and analysis
PDF
Mocap data compression: algorithms and performance evaluation
PDF
Speech recognition error modeling for robust speech processing and natural language understanding applications
Asset Metadata
Creator
Bulut, Murtaza
(author)
Core Title
Emotional speech resynthesis
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
02/15/2008
Defense Date
11/21/2007
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
analysis,emotional speech,expressive,human evaluation tests,OAI-PMH Harvest,prosody,synthesis
Language
English
Advisor
Narayanan, Shrikanth S. (
committee chair
), Byrd, Dani (
committee member
), Kuo, C.-C. Jay (
committee member
)
Creator Email
murtazabulut@yahoo.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m1017
Unique identifier
UC1357529
Identifier
etd-Bulut-20080215 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-47862 (legacy record id),usctheses-m1017 (legacy record id)
Legacy Identifier
etd-Bulut-20080215.pdf
Dmrecord
47862
Document Type
Dissertation
Rights
Bulut, Murtaza
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
analysis
emotional speech
expressive
human evaluation tests
prosody
synthesis