Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Enriching spoken language processing: representation and modeling of suprasegmental events
(USC Thesis Other)
Enriching spoken language processing: representation and modeling of suprasegmental events
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ENRICHING SPOKEN LANGUAGE PROCESSING: REPRESENTATION AND MODELING OF SUPRASEGMENTAL EVENTS. by Vivek Kumar Rangarajan Sridhar A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) August 2008 Copyright 2008 Vivek Kumar Rangarajan Sridhar Acknowledgments I could not have asked for a better advisor than Shrikanth Narayanan. He has shown an unbridled optimism and enthusiasm for my research efforts throughout my graduate student life. I thank him for giving me the opportunity and entrusting me with many responsibilities, small and big. I would like to thank another person who has had a deep involvement in shaping my dissertation, Srinivas Bangalore. He has been a great source of expertise and insight. There have also been some excellent professors at USC who have helped me reach this point, especially Dani Byrd, Antonio Ortega, Keith Jenkins, Kevin Knight, Urbashi Mitra and Gandhi Puvvada. My research has also benefitted greatly from my interactions and collaborations with my lab mates. I learned a lot from Abhinav Sethy during my early years in the lab. I have enjoyed the discussions and debates with Shankar, Jorge Silva, Viktor Rozgic, Tom Murray and SAILors all. I have immensely enjoyed joking and laughing with Arjun Bharadwaj, Shiva Sundaram and all my roommates during my life at USC. Finally, I would like to thank Kavitha, who has shown a great deal of patience and understanding during the course of my doctoral program, and I’d like to thank my family (Amma, Appa, Girish, Rangarajans and Srinivasans all), who have shown me nothing but love and support for all these years. ii Table of Contents Acknowledgments ii List of Figures vi List of Tables viii Abstract xi Chapter 1: Introduction 1 1.1 Prosodic attributes: prominence and phrasing .... .... ... ... 2 1.1.1 Representations of prosody . . . . . . .... .... ... ... 4 1.1.1.1 Categorical descriptions of prosody .... ... ... 4 1.1.1.2 ToBI annotation scheme . . .... .... ... ... 6 1.1.1.3 Raw acoustic correlates of prosody .... ... ... 7 1.1.1.4 Parametric representation of prosody . . . . . . . . . 9 1.1.1.5 Tilt Intonation model . . . .... .... ... ... 10 1.1.2 Algorithms for automatic prosody labeling . .... ... ... 11 1.1.2.1 Generative models . . . . . .... .... ... ... 11 1.1.2.2 Discriminative models . . . .... .... ... ... 13 1.2 Discourse context . .... ... .... ... .... .... ... ... 14 1.2.1 Representation of discourse context through dialog acts . . . . . 14 1.2.2 Algorithms for automatic dialog act tagging . .... ... ... 16 1.3 Contributions of thesis . . . . . .... ... .... .... ... ... 17 1.4 Outline of the thesis.... ... .... ... .... .... ... ... 20 Chapter 2: Automatic detection of prosodic structure 24 2.1 Introduction . . . . .... ... .... ... .... .... ... ... 25 2.1.1 Contributions of this work . . . . . . .... .... ... ... 26 2.1.1.1 Syntactic features . . . . . .... .... ... ... 26 2.1.1.2 Acoustic features . . . . . .... .... ... ... 27 2.1.1.3 Modeling . . . .... ... .... .... ... ... 27 2.2 Categorical representation of prosody . . . . .... .... ... ... 28 iii 2.2.1 Mapping ToBI labels to binary classes .... .... ... ... 28 2.3 Related Work . . . .... ... .... ... .... .... ... ... 30 2.3.1 Pitch accent and boundary tone labeling . . . .... ... ... 30 2.3.2 Prosodic phrase break labeling . . . . .... .... ... ... 32 2.4 Maximum Entropy discriminative model for prosody labeling . . . . . . 35 2.5 Lexical, syntactic and acoustic features . . . . .... .... ... ... 37 2.5.1 Lexical and syntactic features . . . . .... .... ... ... 38 2.5.2 Acoustic-prosodic features... ... .... .... ... ... 40 2.6 Experimental Evaluation . . . . .... ... .... .... ... ... 43 2.6.1 Data . . . . .... ... .... ... .... .... ... ... 43 2.7 Pitch accent and boundary tone labeling . . . .... .... ... ... 44 2.7.1 Baseline Experiments . . .... ... .... .... ... ... 44 2.7.1.1 Acoustic baseline (chance) .... .... ... ... 45 2.7.1.2 Prosody labels derived from lexical stress . . . . . . . 45 2.7.1.3 Prosody labels predicted using TTS systems . . . . . 45 2.7.2 Maximum entropy pitch accent and boundary tone classifier . . 46 2.7.2.1 Maximum entropy syntactic-prosodic model . . . . . 46 2.7.2.2 Maximum entropy acoustic-prosodic model . . . . . 47 2.7.3 HMM acoustic-prosodic model . . . .... .... ... ... 48 2.8 Prosodic break index labeling . . .... ... .... .... ... ... 50 2.8.1 Baseline Experiments . . .... ... .... .... ... ... 51 2.8.1.1 Break index prediction in Festival . .... ... ... 51 2.8.2 Maximum Entropy model for break index prediction . . . . . . 52 2.8.2.1 Syntactic-prosodic model . .... .... ... ... 52 2.8.2.2 Acoustic-prosodic model . .... .... ... ... 52 2.9 Discussion . . . . . .... ... .... ... .... .... ... ... 53 2.9.1 Prominence prediction . .... ... .... .... ... ... 54 2.9.2 Phrase structure prediction . . . . . . .... .... ... ... 56 2.10 Summary, conclusions, and future work . . . .... .... ... ... 56 Chapter 3: Automatic recognition of discourse context 59 3.1 Introduction . . . . .... ... .... ... .... .... ... ... 60 3.2 Maximum entropy model for dialog act tagging . . . .... ... ... 61 3.3 Data . .... ... .... ... .... ... .... .... ... ... 64 3.4 Features for Dialog Act Classification . . . . .... .... ... ... 65 3.4.1 Lexical and syntactic features . . . . .... .... ... ... 66 3.4.2 Acoustic-prosodic features... ... .... .... ... ... 67 3.5 Dialog Act classification using true transcripts .... .... ... ... 68 3.6 Dialog Act classification using acoustic prosody . . . .... ... ... 70 3.6.1 Related work . . . . . . .... ... .... .... ... ... 71 3.6.2 Maximum entropy intonation model . .... .... ... ... 72 3.6.3 Comparison with acoustic correlates of prosody . . . . . . . . . 74 iv 3.7 Dialog Act tagging using recognized transcripts . . . .... ... ... 76 3.8 Dialog Act tagging using history .... ... .... .... ... ... 78 3.9 Dialog Act tagging using right context . . . . .... .... ... ... 80 3.10 Discussion . . . . . .... ... .... ... .... .... ... ... 82 3.11 Conclusion and Future Work . . .... ... .... .... ... ... 84 Chapter 4: Enriching speech translation with prosody 89 4.1 Introduction . . . . .... ... .... ... .... .... ... ... 89 4.2 Automatic prominence labeling . .... ... .... .... ... ... 91 4.3 Data . .... ... .... ... .... ... .... .... ... ... 93 4.4 Enriching translation with prosody . . . . . . .... .... ... ... 94 4.4.1 Factored translation models for incorporating prominence . . . 95 4.5 Experiments and Results . . . . .... ... .... .... ... ... 97 4.6 Discussion and Future Work . . .... ... .... .... ... ... 99 Chapter 5: Enriching speech translation with dialog acts 100 5.1 Introduction . . . . .... ... .... ... .... .... ... ... 100 5.2 Dialog act tagger . .... ... .... ... .... .... ... ... 103 5.3 Enriched translation using DAs . .... ... .... .... ... ... 104 5.3.1 Phrase-based translation with dialog acts . . .... ... ... 105 5.3.2 Bag-of-Words lexical choice and permutation reordering model 107 5.4 Data . .... ... .... ... .... ... .... .... ... ... 108 5.5 Experiments and Results . . . . .... ... .... .... ... ... 109 5.5.1 Analysis of results . . . .... ... .... .... ... ... 111 5.6 Discussion and Future Work . . .... ... .... .... ... ... 112 Chapter 6: Summary and Conclusion 114 References 117 v List of Figures 1.1 Illustration of the four parallel tiers of ToBI for a sample utterance (cour- tesy: ToBI tutorial) .... ... .... ... .... .... ... ... 6 1.2 Illustration of the Tilt intonational model (From Taylor [123]) . . . . . 10 2.1 Illustration of the quantized feature input to the maxent classifier. “|” denotes feature input conditioned on preceding values in the acoustic- prosodic sequence . .... ... .... ... .... .... ... ... 41 2.2 Illustration of the FST composition of the syntactic and acoustic lattices and resulting best path selection. The syntactic-prosodic maxent model produces the syntactic lattice and the HMM acoustic-prosodic model produces the acoustic lattice. . . .... ... .... .... ... ... 50 3.1 The distribution of utterances by dialog act tag category in the Switchboard- DAMSL corpus. . . .... ... .... ... .... .... ... ... 64 3.2 Illustration of syntax-based prosody predicted by the prosody labeler. The prosody labeler uses lexical and syntactic context from surrounding words to predict the prosody labels for the current word. . . . . . . . . 68 3.3 Illustration of n-gram feature encoding of lexical, syntactic and syntax- based prosody cues. The n-gram features represent the feature input space of the maximum entropy classifier. “|” denotes feature input con- ditioned on the history. . . . . . .... ... .... .... ... ... 69 3.4 Illustration of the quantized feature input to the maxent classifier. “|” denotes feature input conditioned on the value of the preceding element in the acoustic-prosodic sequence . . . . . . .... .... ... ... 73 3.5 Enabling transfer of meaning and style from source language to target language using enriched and integrated models of translation and syn- thesis. For clarity, only one direction of the two way path between the dyad in S2S interaction is shown in the figure..... .... ... ... 88 vi 4.1 Illustration of the proposed scheme in comparison with conventional approaches . . . . .... ... .... ... .... .... ... ... 91 4.2 Example of a factored translation model (borrowed from [64]). The arcs represent conditional dependencies between the nodes. . . . . . . . . . 95 4.3 Illustration of the proposed factored translation models to incorporate prominence . . . . .... ... .... ... .... .... ... ... 96 4.4 Illustration of the process used to calculate prosodic accuracy . . . . . . 97 5.1 Example of enriched speech translation output with dialog act . . . . . 102 5.2 Formulation of the proposed enriched speech-to-speech translation frame- work . .... ... .... ... .... ... .... .... ... ... 105 5.3 Distribution of dialog acts in the test data of each corpus . . . . . . . . 111 vii List of Tables 1.1 Acoustic correlates of prosody used in speech applications. Both raw as well as normalized features are typically used. .... .... ... ... 8 1.2 Automatic prosody labeling algorithms that have used generative and discriminative models . . . . . . .... ... .... .... ... ... 12 1.3 Examples of dialog act tags in the Switchboard-DAMSL annotation scheme. . . . . . . .... ... .... ... .... .... ... ... 16 1.4 Generative and discriminative models that have been used for automatic dialog act tagging . .... ... .... ... .... .... ... ... 17 2.1 ToBI label mapping used in experiments. The decomposition of labels is illustrated for pitch accents, phrasal tones and break indices . . . . . 29 2.2 Summary of previous work on pitch accent and boundary tone detection (coarse mapping). Level denotes the orthographic level (word or sylla- ble) at which the experiments were performed. The results of Hasegawa- Johnson et. al and our work are directly comparable as the experiments are performed on identical dataset . . . . . . .... .... ... ... 30 2.3 Summary of previous work on break index detection (coarse mapping). Detection is performed at word-level for all experiments . . . . . . . . 33 2.4 Lexical, syntactic and acoustic features used in the experiments. The acoustic features were obtained over 10ms frame intervals . . . . . . . 38 2.5 Illustration of the supertags generated for a sample utterance in BU cor- pus. Each sub-tree in the table corresponds to one supertag. . . . . . . . 38 2.6 Statistics of Boston University Radio News and Boston Directions cor- pora used in experiments . . . . .... ... .... .... ... ... 41 2.7 Baseline classification results of pitch accents and boundary tones (in %) using Festival and AT&T Natural V oices speech synthesizer . . . . . 42 viii 2.8 Classification results (%) of pitch accents and boundary tones for differ- ent syntactic representations. Classifiers with cardinality V=2 learned either accent or btone classification, classifiers with cardinality V=4 classified accent and btone simultaneously. The variable (k) controlling the length of the local context was set to k =3.... .... ... ... 44 2.9 Classification results of pitch accents and boundary tones (in %) with acoustics only, syntax only and acoustics+syntax using both our models. The syntax based results from our maximum entropy syntactic-prosodic classifier are presented again to view the results cohesively. In the table A = Acoustics, S = Syntax . . . .... ... .... .... ... ... 47 2.10 Classification results of break indices (in %) with syntax only, acoustics only and acoustics+syntax using the maximum entropy classifier. In the table A = Acoustics, S = Syntax .... ... .... .... ... ... 52 3.1 Illustration of POS tags and supertags generated for a sample utterance . 66 3.2 Dialog act tagging accuracies (in %) for lexical and syntactic cues obtained from true transcripts with the maximum entropy model. Only the cur- rent utterance was used to derive the n-gram features. .... ... ... 69 3.3 Examples of misclassifications due to lexical ambiguity from the Switchboard- DAMSL corpus . . .... ... .... ... .... .... ... ... 70 3.4 Acoustic correlates used in the experiment, organized by duration, pitch and energy categories. . . . . . . .... ... .... .... ... ... 75 3.5 Accuracies (%) of DA classification experiments on the Switchboard DAMSL corpus for different prosodic representations .... ... ... 76 3.6 Dialog act tagging accuracies (in %) using lexical and prosodic cues for true and recognized transcripts with the maximum entropy model. Only the current utterance was used to derive the n-gram features. . . . . . . 77 3.7 Dialog act tagging accuracies (in %) using preceding context. current utterance refers to lexical+syntactic+prosodic cues of the current tran- scribed utterance. prev utterance refers to the lexical+syntactic+prosodic cues from the previous utterance. .... ... .... .... ... ... 79 3.8 Dialog act tagging accuracies (in %) using preceding context. Rec- ognized utterance refers to lexical+syntactic+prosodic cues of the cur- rent ASR hypothesized utterance. prev utterance refers to the lexi- cal+syntactic+prosodic cues from the preceding ASR hypotheses. . . . 80 ix 3.9 Dialog act tagging accuracies (in %) using preceding context. current utterance refers to lexical+syntactic+prosodic cues of the current tran- scribed utterance. next utterance refers to the lexical+syntactic+prosodic cues from the succeeding utterance and recognized utterance refers to utterance hypothesized by ASR. .... ... .... .... ... ... 82 4.1 Pitch accent detection accuracies for various cues on the prosodically labeled Switchboard corpus. . . .... ... .... .... ... ... 93 4.2 Statistics of the training and test data used in the experiments. . . . . . 93 4.3 Evaluation metrics for the two corpora used in experiments (all scores arein%)... ... .... ... .... ... .... .... ... ... 98 5.1 Dialog act tagging accuracies for various cues on the SWBD-DAMSL corpus..... ... .... ... .... ... .... .... ... ... 104 5.2 Statistics of the training and test data used in the experiments. . . . . . 108 5.3 F-measure and BLEU scores for the two different translation schemes with and without use of dialog act tags. . . . .... .... ... ... 109 5.4 BLEU scores (%) per DA tag for the phrase-based translation scheme with and without use of dialog act tags for Farsi-English corpus . . . . . 112 x Abstract Machine processing of speech, while has advanced significantly, is still insufficient in capturing and utilizing rich contextual information such as prosodic prominence, phras- ing and discourse information that are conveyed beyond words. The work presented in this dissertation focuses on automatic enrichment of spoken language processing through the representation and modeling of suprasegmental events such as prosody and discourse context. First, we demonstrate the suitability of maximum entropy models for the automatic recognition of these events from speech and text. The techniques that we have developed achieve state-of-the-art performance. Second, we introduce a novel framework for enriching speech translation with rich information. Our approach of incorporating rich information in speech translation is motivated by the fact that it is important to capture and convey not only what is being communicated (the words) but how something is being communicated (the context). We show that promising improve- ments in translation quality can be obtained by exploiting rich annotations in conven- tional speech translation approaches. xi Chapter 1 Introduction Spoken language processing involves the analysis, representation and modeling of lex- ical, syntactic and acoustic events in speech at various temporal scales. The acoustic theory of speech contends that a given language can be constructed from a basic set of identifiable atomic units (phonemes). These phonemes are designed to be maximally distinct in the acoustic domain with phones being their acoustic realization. Words are realized as a sequence of phones and in turn are used together to form meaningful sen- tences or utterances. This complex process of producing speech is influenced both at the segmental and suprasegmental level. Stress, rhythm and intonation are phonological realities that manifest themselves at the suprasegmental level and prosody is the general term used to describe them. These suprasegmental events confer prosodic prominence to phonologically larger units (syllables, words and utterances). At the conceptual level, humans typically produce speech based on certain inten- tions (or speech acts [110]). In conversational dialog (two or more participants), a single speaker has temporary control of the dialog and produces one or more utterances that conveys his/her intention. The intentions are reflected in many aspects of its linguistic realization such as use of cue phrases, as well as other lexical, syntactic and prosodic factors. Reliable representation and automatic recognition of these speech acts in addi- tion to prosodic prominence offers valuable structural information that can be exploited in several spoken language applications. Our objective in this thesis is to derive meaningful representations of suprasegmental events such as prosody and discourse context followed by subsequent modeling and 1 automatic detection. We believe that the accurate automatic detection of these events can offer rich information that is complementary to the acoustic and lexical information typically available. In this chapter, we lay the groundwork for the rest of the thesis by providing a brief introduction to suprasegmental events utlilized in this work . We also provide a synopsis of our overall contribution in this thesis proposal. 1.1 Prosodic attributes: prominence and phrasing Prosody is generally used to describe aspects of a spoken utterance which are not adequately explained by segmental acoustic correlates of sound units (phones). The prosodic information associated with a unit of speech, say, syllable, word, phrase or clause, influences all the segments of the unit in an utterance. In this sense they are also referred to as suprasegmentals [68] that transcend the properties of local phonetic context. One of the main functions of prosody is to endow the rich spatio-temporal organization for enabling spoken language communication. Prosody encoded in the form of intonation, rhythm and lexical stress patterns of spoken language, conveys linguistic and paralinguistic information such as, emphasis, intent, attitude and emotion of a speaker. Prosody is also used by speakers to provide cues to the listener and aid in the appropriate interpretation of their speech. This facili- tates a method to convey the intent of the speaker through meaningful chunking or phras- ing of the sentence, and is typically achieved by breaking long sentences into smaller prosodic phrases. These two different prosodic attributes are referred to as prominence and phrasing. Prosody in spoken language can be characterized through acoustic features, syn- tactic features, or both. Acoustic correlates of duration, intensity and pitch, such as syllable nuclei duration, short time energy and fundamental frequency (f0) are some 2 of the acoustic features that are perceived to confer prosodic prominence or stress in English. Lexical and syntactic features such as parts-of-speech, syllable nuclei identity, syllable stress of neighboring words have also been shown to exhibit high degree of correlation with prominence. Humans realize phrasing acoustically by pausing after a major prosodic phrase, accentuating the final syllable in a phrase, and/or by lengthen- ing the final syllable nuclei before a phrase boundary. While some studies conjecture that prosodic phrase breaks typically coincide with syntactic boundaries [63], others have posited that the prosodic phrase structure is not isomorphic to the syntactic struc- ture [118, 69]. Incorporating prosodic information can be beneficial in speech applications such as, text-to-speech synthesis, speech recognition, natural language understanding, dialog act detection and even speech-to-speech translation. Accounting for the correct prosodic structure is essential in text-to-speech synthesis to produce natural sounding speech with appropriate pauses, intonation and duration. Speech understanding applications can benefit from being able to interpret the recognized utterance through the placement of correct prosodic phrasing and prominence. Speech-to-speech translation systems can also greatly benefit from the marking of prosodic phrase boundaries, for e.g., providing this information could directly help in building better phrase-based statistical machine translation systems. The integration of prosody in these applications is preempted by two main requirements: 1. A suitable and appropriate representation of prosody (e.g., categorical or continu- ous) 2. Algorithms to automatically detect and seamlessly integrate the detected prosodic structure in speech applications 3 1.1.1 Representations of prosody Prosody is highly dependent on individual speaker style, gender, dialect and a variety of other phonological factors. Non-uniform acoustic realizations of prosody are charac- terized by distinct intonation patterns and prosodic constituents using either, symbolic or parametric prosodic labeling standards like, Tones and Break Indices (ToBI) [117], TILT intonational model [123], Fujisaki model [36], Intonational Variation in English (IViE) [38], International Transcription System for Intonation (INTSINT) [54], etc. These prosodic labeling standards provide a common framework for characterizing prosody and hence facilitate development of tractable algorithms and modeling frame- works for automatic detection and subsequent integration of prosody in various speech applications. In general, prosodic representation can be categorized into three broad categories: • Discrete categorical representations of prosody • Raw acoustic correlates of pitch, intensity and duration, or transformations • Parametric or sequential representation of the prosodic contour 1.1.1.1 Categorical descriptions of prosody Prosody is highly dependent on speaker style, gender, dialect and various phonologi- cal factors. Further, the suprasegmental nature of prosody poses difficulty in modeling the relationship between segmental acoustic correlates, orthographic information and syntactic structure, all of which lend prosodic prominence within an utterance. Rep- resentation of prosodic events through categorical or symbolic labels (akin to the IPA system for segmental transcription) can offer a standardized framework which in turn can encourage quantitative computational modeling from large amounts of prosodically transcribed data. However, such a transcription scheme must be designed carefully with 4 expert linguistic knowledge to ensure reliability, adequate coverage and easy integration in speech applications. The Tones and Break Indices (ToBI) [117] annotation scheme is a popular scheme for categorical representation of prosodic events (both prominence and phrasing). The gross categorical descriptions within the ToBI framework offer a level of uncertainty in the human annotation to be incorporated into the labeling scheme and hence provide some generalization, considering that the prosodic structure is highly speaker dependent. They also provide more general-purpose description of prosodic events encompassing acoustic correlates of pitch, duration and energy compared to some prosodic standards that exclusively model the pitch contour. Furthermore, the availability of large prosodi- cally labeled corpora with manual ToBI annotations, such as the Boston University (BU) Radio Speech Corpus [86] and Boston Directions Corpus (BDC) [50], offer a convenient and standardized avenue to design and evaluate automatic ToBI-based prosody labeling algorithms. However, ToBI is not language independent and thus requires expert human knowledge for the characterization of prosodic events in each language (e.g., Span- ish ToBI [13], Japanese ToBI [14]). Alternative categorical transcription systems for prosody include Intonational Variation in English (IViE) [38], International Transcrip- tion System for Intonation (INTSINT) [54], Rhythm and Pitch (RaP) [20]. Several linguistic theories have been proposed to represent phrasing – the grouping of prosodic constituents [94, 132, 117]. In the simplest representation, prosodic phrasing constituents can be grouped into word, minor phrase, major phrase, utterance [68]. The ToBI break index representation [117] uses indices between 0 and 4 to denote the perceived disjuncture between each pair of words, while the perceptual labeling system described in [94] represents a superset of prosodic constituents by using labels between 0 and 6. In general, these representations are mediated by rhythmic and segmental analysis in the orthographic tier and associate each word with an appropriate index. 5 Categorical descriptions of prosody have been demonstrated to be useful in text- to-speech synthesis [22], spoken language understanding [131], sentence disambigua- tion [128, 61], dialog act tagging [16], and speech translation [83] 1.1.1.2 ToBI annotation scheme The Tones and Break Indices (ToBI) [117] framework consists of four parallel tiers that reflect the multiple components of prosody. Each tier consists of discrete categorical symbols that represent prosodic events belonging to that particular tierA concise sum- mary of the four parallel tiers is presented below. The reader is referred to [117] for a more comprehensive description of the annotation scheme. Figure 1.1: Illustration of the four parallel tiers of ToBI for a sample utterance (courtesy: ToBI tutorial) • Orthographic tier: The orthographic tier contains the transcription of the ortho- graphic words of the spoken utterance. 6 • Tone tier: Two types of tones are marked in the tonal tier: pitch events associated with intonational boundaries, phrasal tones or boundary tones and pitch events associated with accented syllables, pitch accents. The basic tone levels are high (H) and low (L), and are defined based on the relative value of the fundamental frequency in the local pitch range. There are a total of five pitch accents that lend prominence to the associated word :{H*, L*, L*+H, L+H*, H+!H*}. The phrasal tones are divided in two coarse categories, weak intermediate phrase boundaries {L-, H-} and full intonational phrase boundaries{L-L%, L-H%, H-H%, H-L%} that group together semantic units in the utterance. • Break index tier: The break-index tier marks the perceived degree of separation between lexical items (words) in the utterance and is an indicator of prosodic phrase structure. Break indices range in value from 0 through 4, with 0 indicating no separation, or cliticization, and 4 indicating a full pause, such as at a sentence boundary. This tier is strongly correlated with phrase tone markings on the tone tier. • Miscellaneous tier: This tier used to annotate any other information relevant to the utterance that is not covered by the other tiers. This may include annotation of non-speech events such as disfluencies, laughter, etc. 1.1.1.3 Raw acoustic correlates of prosody Prosodic representations are primarily driven by the nature and architecture of the speech application they are intended to be integrated into. While detailed categorical representations are suitable for text-to-speech synthesis, speech and natural language understanding tasks, simpler prosodic representations in terms of raw or speaker nor- malized acoustic correlates of prosody have been shown to be beneficial in many speech 7 applications [119, 65, 43, 115, 51]. As long as the acoustic correlates are reliably extracted under identical conditions during training and testing, there is no necessity for an intermediate symbolic representation though it may provide additional discrimi- native information if available. The raw acoustic correlates or simple transformations of pitch, intensity and duration extracted from the fundamental frequency (f0) contour, energy contour and segmental duration derived from automatic alignment respectively, have been demonstrated to be beneficial in disfluency detection [103, 115, 70], named entity detection [104, 65], topic segmentation [51, 114], sentence boundary detection [71] and dialog act detection [112]. Table 1.1: Acoustic correlates of prosody used in speech applications. Both raw as well as normalized features are typically used. Prosodic features Acoustic correlates Pitch f0 onset, offset f0 range, mean, slope, min, max diff in f0 statistics between neighboring words or utterances Energy rms energy mean, slope, min, max SNR Duration pause duration, vowel duration, rhyme duration duration of voiced segments Typically, these raw features are derived heuristically and tuned to the particular application. Feature selection schemes such as the scheme described in [114] are used to select the most discriminative features for the classification task. The derived features are also normalized through a variety of speaker and utterance specific normalization techniques to account for the variability across speakers. Some of the acoustic correlates of prosody used in these tasks is shown in Table 1.1. The major drawback of such a representation is that it is an extremely lossy and is contrary to the suprasegmental theory of prosody that advocates a sequential or continuous model of acoustic correlates over longer durations. 8 1.1.1.4 Parametric representation of prosody Parametric representation schemes for prosody aim are data-driven and only at provid- ing a configurational description of the macroscopic pitch contour that is meaningful in such a way that the representations can be automatically derived from the acoustics of an utterance, and the acoustics can be synthesized from the representation. While the linguistic interpretation provided by these models is debatable due to their non- phonological basis, they have nevertheless demonstrated reasonable robustness in mod- eling the intonation contour. Further, the numerical parametric representation makes such a scheme more or less language independent compared to categorical descriptions like ToBI that are language dependent. Parametric schemes typically obtain a set of intonational contour templates by clustering and use the derived templates as codebook entries to represent prosody through an abstraction of the natural contour. Some of the popular parametric representation schemes for prosody are f0 stylization [55], Fujisaki model [36], Rise/Fall/Connection (RFC) model [125] and Tilt [123]. Parametric representation of prosody has been mainly used in text-to-speech synthe- sis [126, 35, 109, 32] as they are naturally well suited for an analysis-synthesis paradigm. Recently, Ag¨ uero et. al [2] have proposed parametric representation of the f0 contour in a speech-to-speech translation framework for transferring prosody from the source to target language. We provide a brief description of the tilt intonation model in the following section. 1.1.1.5 Tilt Intonation model Tilt [123] is a phonetic model of intonation that represents the intonational contour as a sequence of continuously parameterized events. The model is an extension of the RFC model [125] in that the parameters obtained through the RFC model are transformed into a set of Tilt parameters. The model is well suited for text-to-speech applications 9 as it offers both an analysis and synthesis scheme. The model considers two types of intonational events: (i) pitch accents (denoted by the letter a) and (ii) boundary tones (denoted by the letter b). The choice of the these intonational events or combination of both is considered to produce different global intonational tunes which can indicate questions, statements, emotion, etc. to the listener. The intonational events themselves need to be obtained first either through human annotation or an automatic event labeler (for e.g. HMM-based event detection). Given, the segmentation of the intonational events in the pitch contour, the model parameterizes the events through a set of rise/fall amplitude and duration parameters which are finally converted into tilt parameters. A diagrammatic illustration of the tilt intonation model is shown in Figure 1.2. Figure 1.2: Illustration of the Tilt intonational model (From Taylor [123]) The advantage of such a representation of prosody is that it is language independent and encodes prosodic information that is significant to the linguistic interpretation of the utterance only. In other words, it excludes effects that are redundant, or events which affect the f0 contour but which are not important in the interpretation. The major drawback of such a parametric approach is that it considers only the intonational contour and not the associated energy and durational aspects of an utterance. Moreover, the initial segmentation obtained through automatic methods are still not accurate enough to completely exploit the technique. 10 1.1.2 Algorithms for automatic prosody labeling The above discussed prosodic representation schemes provide an avenue for the devel- opment of algorithms that can automatically detect the prosodic events from the acoustic signal. While the raw acoustic correlates of prosody are directly used in classification schemes within a variety of speech applications, the categorical descriptions of prosody require careful design of algorithms that can learn the discrete representations from labeled data through either supervised or unsupervised machine learning techniques. Automatic prosody labeling has been achieved through various machine learning tech- niques, such as decision trees [48, 131], rule-based systems [111], bagging and boosting on decision trees [121], hidden markov models [28], coupled HMMs [4], neural net- works [46], maximum-entropy models [21] and conditional random fields [40]. These algorithms typically exploit lexical, syntactic and acoustic features in a supervised learn- ing scenario to predict prosodic constituents characterized through one of the prosodic representation standards. In general, automatic prosody labeling algorithms fall under two categories: generative models and discriminative models. 1.1.2.1 Generative models In generative models, a system’s input features and output classes are represented homogenously through a joint probability distribution. Once the joint distribution has been obtained, standard marginalization or conditioning operation through Bayes rules are applied for use in classification and regression problems. Typically, generative mod- els use the span of exponential family distributions and mixtures of the exponential fam- ily. Popular models include Gaussians, naive Bayes, mixture of Gaussians, mixtures of experts, hidden Markov models, Bayesian networks and Markov random fields, to name a few. More formally, given an observation feature vector X(u) and a discrete categor- ical label Y (u), one typically has access to observations T = {(x i ,y i ): i=1, .., n}, 11 with x i taking values in N and y i in a discrete alphabet denoted by A y . Generative classifiers learn the model of the joint probability, P(X,Y ), from the training data T and make their predictions by Bayes rules to calculate P(Y|X), and then picking the most likely label ˆ y. Classical approaches to learning generative models include Bayesian inference, maximum a posteriori and maximum likelihood estimation. ˆ y =argmax yAy P(Y|X) = arg max y Ay P(X,Y ) P(X) ≈ arg max y Ay P(Y ).P(X|Y ) (1.1) Table 1.2: Automatic prosody labeling algorithms that have used generative and dis- criminative models Model Authors Algorithm Ostendorf and Wightman [131] Markov model/CART Ross and Ostendorf [108] Markov model/CART Taylor and Black [17] HMM Veilleux and Ostendorf [128] Markov model/CART Generative Sun and Applebaum [122] Markov model/CART Conkie et. al [28] HMM Ananthakrishnan et. al [4] Coupled HMM Hasegawa-Johnson et. al [26] GMM/Neural networks Hirschberg [48] CART Wang and Hirschberg [130] CART Discriminative Muller and Hoffman [78] Neural network Sun [121] Ensemble learning (boosting and bagging) Gregory and Altun [40] Conditional random fields Rangarajan et. al [95, 96, 100] Maximum entropy model Generative classifiers have advantages such as straightforward Expectation Maxi- mization (EM) methods for handling missing data, and often demonstrate better per- formances when the training set sizes are small. They are also versatile as one can exclusively model the relationship between variables, independencies, prior distribu- tions, etc. However, generative models could be wasteful and non-robust for a variety of applications and they need to be used with caution as the true distribution almost never 12 coincides with the distribution constructed from the training data. To prevent sparse- ness issues they are also required to make certain conditional independence assumptions which may not be necessarily true. Automatic prosody labeling using generative models has been presented in [131, 28, 4, 26, 108, 17, 128, 122]. A snapshot of generative models used for automatic prosody labeling is presented in Table 1.2. 1.1.2.2 Discriminative models Discriminative classifiers model the posterior P(Y|X) directly, or learn a direct map from the inputs X to the class labels Y (f: X −→ Y ). Even though from a purist per- spective, using conditional distributions is not strictly discriminative, they can generally be considered under the broad category of discriminative models. Strict discrimina- tive approaches only learn the mapping from the inputs to the outputs without making any explicit attempt to model the underlying distributions of features and output classes. Popular discriminative models include logistic regression, Classification and Regression trees (CART), Gaussian processes, support vector machines (SVM), maximum entropy models, boosting algorithms and neural networks. Discriminative models are in general robust to structural assumptions and concentrates on the computational resources for a given task, thus providing better performance. However, discriminative models usually lack flexibility and methods for incorporating prior knowledge. Furthermore, they may not be easy to train as they require simultaneous use of data from all classes. Automatic prosody labeling through discriminative models has been proposed in [48, 130, 78, 121, 40, 95] (see Table 1.2). 13 1.2 Discourse context In both human-to-human and human-computer speech communication, identifying whether an utterance is a statement, question, greeting, etc., is integral to producing, sustaining and understanding natural dialogs. Discourse structure is typically described through either a hierarchical [91, 116] or surface form representation [59, 24]. The sur- face level representation of discourse context has been generally preferred in spoken language applications due to the easy integration in spoken language applications. Sur- face level discourse context in dialog can be approximated by assigning labels to the communicative acts associated with each of them. For example, one can design a tag set that classifies utterances into statements and questions. The tag set is typically chosen according to a combination of various pragmatic, semantic and syntactic criteria. While such a surface representation of discourse context does not offer understanding of dialog in a deep sense, it can be beneficial in many spoken language processing applications such as automatic speech recognition, spoken language understanding, dialog modeling and speech translation. 1.2.1 Representation of discourse context through dialog acts Dialog acts [6] or speech acts [110] are labels that are used to represent surface level communicative acts in a conversation or dialog. While they may not provide a deep understanding of discourse structure, dialog acts (DAs) can serve as intermediate repre- sentations that can be useful in several speech and language processing applications. For example, in human-machine dialogs, constraining automatic speech recognition hypotheses by using a model of likely DAs to be expected at a dialog turn has been shown to improve the recognition accuracy [119, 124]. Dialog acts have also found to be use in spoken language understanding [112] and more recently in the annotation of 14 archived conversations and meetings [5, 139], which in turn can help improve speech summarization [79] and retrieval. Incorporating DAs in speech-to-speech (s2s) transla- tion [67, 106] also aided in the resolution of ambiguous communication. Conceptually, the process of designing an automatic DA prediction system can be seen as comprising three steps: • Initializing a dialog act vocabulary for the task and procuring supervised data • Identifying the lexical, syntactic and acoustic cues that are most useful in distin- guishing among the various DAs • Combining the multiple cues in an algorithmic framework to implement their accurate recognition One of the popular schemes to annotate dialog acts in discourse is the Dialog Act Markup in Several Layers (DAMSL) annotation scheme [29]. The scheme defines a set of primitive communicative actions that can be used to analyze dialogs. The dialog acts represent shallow discourse structure and are defined at the intention or speech act level. The Switchboard-DAMSL [59] is another annotation scheme that is an augmentation to the DAMSL tag set. The annotation scheme extends the DAMSL scheme to account for characteristics of conversational speech. Table 1.3 shows some examples of utterances that are tagged with dialog acts from the Switchboard-DAMSL tag set. 1.2.2 Algorithms for automatic dialog act tagging Automatic interpretation of dialog acts has been addressed through two main approaches. The AI-style plan inferential interpretation of dialog acts that is designed through plan-inference heuristics [3] and the cue-based interpretation that uses knowl- edge sources such as, lexical [60, 119, 129], syntactic [7], prosodic [73, 112, 124] 15 Table 1.3: Examples of dialog act tags in the Switchboard-DAMSL annotation scheme. Utterance Dialog Act Me, I’m in the legal department. Statement-non-opinion Uh-huh. Acknowledge (Backchannel) I think it’s great Statement-opinion Do you have to have any special training? Yes-No-Question Well, how old are you? Wh-Question That’s exactly it. Agree/Accept and discourse-structure [58]. Even though the plan-inference method can theoretically account for all variations in discourse, it is time-consuming in terms of manual design and computational overhead. On the contrary, data-driven cue-based approaches are computationally friendly and offer a reasonably robust framework to model and detect dialog acts automatically. Automatic dialog act tagging is typically performed by using either generative or discriminative models. Table 1.4 lists some of the generative and discriminative mod- eling frameworks that have been used for DA tagging. These schemes typically exploit linguistic features such as cue words, lexical, syntactic and prosodic markers. While the discriminative models directly learn the conditional density of the classes given the cues, generative models learn the joint distribution of the cues and classes. A classic exam- ple of generative model for DA tagging is a HMM-like representation [119], where the states represent DA tags and the observations represent the words of the utterance. Such a scheme is similar to that used in speech recognition. Instead of using language models over words, a discourse language model over the DA tags is used. 16 Table 1.4: Generative and discriminative models that have been used for automatic dia- log act tagging Model Authors Algorithm Jurafsky et. al [59] HMM Taylor et. al [124] HMM Generative Stolcke et. al [119] Markov model Ji and Bilmes [56] Graphical models Ries [107] Neural network Fernandez and Picard [33] Support vector machines Discriminative Wu et. al [134] Fuzzy systems Rangarajan et. al [97, 102, 98] Maximum entropy model 1.3 Contributions of thesis We demonstrate that current spoken language processing techniques can greatly benefit from improved representation and modeling of events at different temporal resolution. The contributions of this proposal can be categorized into two tiers, a suprasegmen- tal tier that focuses on the representation and modeling of prosodic prominence and phrasing at a global temporal scale, and the other, a segmental tier that focuses on the representation and modeling of raw articulatory measurements at a local temporal scale. We believe that both are critical to improving the state-of-the-art in spoken language processing and the rest of this thesis proposal is aimed at bringing this to the fore with our contribution in each of these tiers. Our proposed work for the future then aims at integrating these contributions at multiple tiers. The following points reflect the list of specific contributions proposed in this work. Automatic detection of prosody: 17 • We propose the use of novel syntactic features for prosody labeling in the form of supertags which represent dependency analysis of an utterance and its predicate- argument structure, akin to a shallow syntactic parse. We demonstrate that inclu- sion of supertag features can further exploit the prosody-syntax relationship com- pared to that offered by using parts-of-speech tags alone. • We also propose a novel representation scheme for the modeling of acoustic- prosodic features such as energy and pitch. We model the continuous valued feature stream through a quantized feature representation that is integrated in the maximum entropy classification scheme. Such a model of quantized continuous features is similar to representing the acoustic-prosodic features with a piecewise linear fit as done in parametric approaches to modeling intonation. • Our proposed maximum entropy discriminative model [95, 96, 100] outperforms previous work. On the BU corpus, with syntactic information alone we achieve pitch accent and boundary tone accuracy of 85.2% and 91.5%. Further, the cou- pled model with both acoustic and syntactic information results in accuracies of 86.0% and 93.1% respectively, compared to 84.2% and 93.0% accuracies reported in [26] (identical dataset was used for this comparison). On the BDC corpus, we achieve pitch accent and boundary tone accuracies of 79.8% and 90.3%. • We achieve a break index accuracy of 83.95% and 87.18% on the BU and BDC corpora using lexical and syntactic information alone. Our combined maximum entropy acoustic-prosodic model achieves a break index accuracy of 84.01% and 87.58%, respectively. These results are significantly better than previously reported results on automatic phrase structure detection. • The inter-annotator agreement for pitch accent, boundary tone and break index labeling on the BU corpus [86] are 81-84% (on 487 words), 93% (on 207 words) 18 and 95% (on 989 words) respectively. Our achieved accuracies of 80-86%, 90- 93.1% and 84-87% for the three prosody detection tasks are quite close to the inter-labeler agreements. Automatic classification of discourse context: • We demonstrate that extracting n-gram features from the quantized prosodic con- tour can exploit intonational characteristics of dialog acts better than conventional approaches that use coarse representation of the prosodic contour through sum- mative statistics of the prosodic contour. • We use only static features and approximate the previous dialog act tags in terms of lexical, syntactic and prosodic information extracted from previous utterances. Our approach is different from traditional DA systems that use the entire conversa- tion for offline dialog act decoding with the aid of a discourse model. In contrast, our framework facilitates online decoding. • The proposed n-gram feature extraction for exploiting prosody results in an abso- lute improvement of 8.7% on the Switchboard-DAMSL data set over the use of most other widely used representations of acoustic correlates of prosody. • The proposed use of dialog context from lexical, syntactic and prosodic cues of previous utterances performs well in comparison with previous work [119] that used the entire conversation for offline optimal decoding. • The proposed techniques for automatically recognizing dialog acts achieves state- of-the-art performance [97, 102, 98]. Enriching spoken language translation with prosody and dialog acts • We propose a novel framework for enriching speech-to-speech translation. Our framework focuses on the transfer of context in addition to content. We capture 19 the context through the accurate detection of prosody and dialog acts and exploit these rich representations in translation [99, 101]. • Conventional S2S systems rely on extracting prosody dependent cues from hypothesized (possibly erroneous) translation output using only words and syn- tax. In contrast, we propose the use of factored translation models to integrate the assignment and transfer of pitch accents (tonal prominence) during transla- tion [101]. • We demonstrate the integration of the DA tags in two different statistical trans- lation frameworks, phrase-based translation and a bag-of-words lexical choice model. In addition to producing interpretable DA annotated target language trans- lations, our framework also offers improvements in terms of automatic evaluation metrics such as lexical selection accuracy and BLEU score [99]. • The proposed techniques for enriching statistical translation with prosody and dialog acts are novel and state-of-the-art. 1.4 Outline of the thesis The dissertation is organized into two parts. Part 1.4 addresses the representation, mod- eling and detection of suprasegmental phenomena for spoken language processing and part 3.11 addresses the application of rich representations in spoken language transla- tion. Part 1.4 consists of two chapters, each of them with the structure of a self-contained paper. In chapter 2, we address the categorical representation of prosody. We present a maximum entropy discriminative classifier that performs automatic prominence and phrasing detection within the ToBI annotation framework by exploiting lexical, syntac- tic and acoustic information. Chapter 3 addresses the automatic recognition of discourse 20 context in dialogs through dialog acts. Part 3.11 focuses on the integration of rich rep- resentations described in part 1.4 in spoken language translation. It comprises of two chapters, each with the structure of a self-contained paper. Chapter 4 describes factored translation models for enriching speech translation with prosodic information and chap- ter 5 focuses on enriching spoken language translation with dialog act tags. Finally, chapter 6 provides summary and conclusions of the work described in this dissertation. 21 Suprasegmental events: Representation and Modeling 22 In the first part of this thesis, we address the representation and modeling of suprasegmental events such as prosody and discourse structure. We propose novel algo- rithms for automatically detecting these rich structures from speech and language infor- mation. We investigate suitable representations of suprasegmental events that can be reliably learned and automatically detected in various spoken language processing applications. We are particularly interested in automatically recognizing prosodic attributes and dis- course context of utterances. The main challenge in reliably learning rich representa- tions is that their structure is reflected in many aspects of their linguistic realization, including lexical, morpho-syntactic and intonational features. The lack of consensus on a generative model leads to difficulty in automatically detecting these events. Fur- ther, jointly exploiting the various cues in a structured learning approach has been a challenging task. In this work, we propose the use of maximum entropy models that have recently been shown to perform well in various natural language processing tasks. The fundamental difference of the approach presented here from other ongoing efforts in automatic recognition of rich representations is that our discriminative framework robustly combines lexical, syntactic and prosodic cues to estimate the conditional den- sity of the classes directly. The framework also integrates feature selection within the training procedure and is not hurt by strong independence assumptions unlike gener- ative models. We demonstrate the use of simple local contextual features for reliably inferring prosodic prominence, phrasing and discourse context from speech and text using the proposed scheme. Thus, the models proposed in this thesis are also favorable for online processing, enabling the recognition of rich representations alongside (or in lockstep) automatic speech recognition, speech translation and text-to-speech synthesis. 23 Chapter 2 Automatic detection of prosodic structure In this chapter we describe a maximum entropy based automatic prosody labeling frame- work that exploits both language and speech information. We apply the proposed frame- work to both prominence and phrase structure detection through categorical represen- tation of prosody within the ToBI annotation scheme. Our framework utilizes novel syntactic features in the form of supertags and a quantized acoustic-prosodic feature rep- resentation that is similar to linear parameterizations of the prosodic contour. The pro- posed model is trained discriminatively and is amenable to the addition of new features and also very robust in the selection of appropriate features for the task of prosody detec- tion. We evaluate our proposed algorithm on Boston University Radio News and Boston Directions corpus, two publicly available corpora intended for automatic prosody exper- iments. The proposed maximum entropy syntactic-prosodic model achieves accuracies of 85.2% and 92% for pitch accent and boundary tone labeling, respectively, on the Boston University Radio News corpus with lexical and syntactic information alone. The coupled acoustic-syntactic model achieves significantly improved pitch accent and boundary tone classification accuracies of 86.0% and 93.1% respectively. The phrase structure detection through prosodic break index labeling also performs well with the proposed framework with accuracies ranging from 84-87%. The reported results are significantly better than previously reported results for prominence and phrase structure detection. 24 2.1 Introduction Automatic prosody labeling has been achieved through a variety of machine learning techniques, such as decision trees [48, 131], rule-based systems [111], bagging and boosting on decision trees [121], hidden markov models [28], coupled HMMs [4], neural networks [46] and conditional random fields [40]. These algorithms typically exploit lexical, syntactic and acoustic features in a supervised learning scenario to pre- dict prosodic constituents characterized through one of the aforementioned prosodic representations. The interplay between acoustic, syntactic and lexical features in characterizing prosodic events has been successfully exploited in text-to-speech synthesis [22, 72], dia- log act modeling [124, 119], speech recognition [46] and speech understanding [131]. The procedure in which the lexical, syntactic and acoustic features are integrated plays a vital role in the overall robustness of automatic prosody detection. While generative models using HMMs typically perform a front-end acoustic-prosodic recognition and integrate syntactic information through back-off language models [46, 4], stand-alone classifiers use a concatenated feature vector combining the three sources of informa- tion [52, 40]. We believe that a discriminatively trained model that jointly exploits lexi- cal, syntactic and acoustic information would be the best suited for the task of prosody labeling. The chapter is organized as follows. In section 2.2 we describe the categorical ToBI representation of prosody and the mapping of ToBI labels into simpler classes for subsequent detection. We discuss related work in automatic prosody labeling in section 2.3 followed by a description of the proposed maximum entropy algorithm for prosody labeling in section 2.4. Section 2.5 describes the lexical, syntactic and acoustic- prosodic features used in our framework and section 2.6.1 describes the data used. We present results of pitch accent and boundary tone detection, and break index detection 25 in sections 2.7 and 2.8, respectively. We provide discussion of our results in section 2.9 and conclude in section 2.10 along with directions for future work. We present a brief synopsis of our contribution in the following section. 2.1.1 Contributions of this work We present a discriminative classification framework using maximum entropy modeling for automatic prosody detection. The proposed classification framework is applied to both prominence and phrase structure prediction, two important prosodic attributes that convey vital suprasegmental information beyond the orthographic transcription. The prominence and phrase structure prediction is carried out within the Tones and Breaks Indices (ToBI) framework designed for categorical prosody representation. We perform automatic pitch accent and boundary tone detection, and break index prediction, that characterize prominence and phrase structure, respectively, with the ToBI annotation scheme. The primary motivation for the proposed work is to exploit lexical, syntactic and acoustic-prosodic features in a discriminative modeling framework for prosody model- ing that can be easily integrated in a variety of speech applications. The following are some of the salient aspects of our work: 2.1.1.1 Syntactic features • We propose the use of novel syntactic features for prosody labeling in the form of supertags which represent dependency analysis of an utterance and its predicate- argument structure, akin to a shallow syntactic parse. We demonstrate that inclu- sion of supertag features can further exploit the prosody-syntax relationship com- pared to that offered by using parts-of-speech tags alone. 26 2.1.1.2 Acoustic features • We propose a novel representation scheme for the modeling of acoustic-prosodic features such as energy and pitch. We use n-gram features derived from the quantized continuous acoustic-prosodic sequence that is integrated in the maxi- mum entropy classification scheme. Such an n-gram feature representation of the prosodic contour is similar to representing the acoustic-prosodic features with a piecewise linear fit as done in parametric approaches to modeling intonation. 2.1.1.3 Modeling • We present a maximum entropy framework for prosody detection that jointly exploits lexical, syntactic and prosodic features. Maximum entropy modeling has been shown to be favorable for a variety of natural language processing tasks such as part-of-speech tagging, statistical machine translation, sentence chunking, etc. In this work we demonstrate the suitability of such a framework for automatic prosody detection. The proposed framework achieves state-of-the-art results in pitch accent, boundary tone and break index detection on the Boston University (BU) Radio News Corpus [86] and Boston Directions Corpus (BDC) [50], two publicly available read speech corpora with prosodic annotation. • Our framework for modeling prosodic attributes using lexical, syntactic and acoustic information is at the word level, as opposed to syllable level. Thus, the proposed automatic prosody labeler can be readily integrated in speech recog- nition, text-to-speech synthesis, speech translation and dialog modeling applica- tions. 27 2.2 Categorical representation of prosody Automatic detection of prosodic prominence and phrasing requires appropriate repre- sentation schemes that can characterize prosody in a standardized manner and hence facilitate design of algorithms that can exploit lexical, syntactic and acoustic features in detecting the derived prosodic representation. This has led to the development of many prosody annotation schemes that range from generic representation of prosody to exclu- sive categorization of certain prosodic events (refer to section 1.1.1 for more detailed description). We use categorical description of prosody within the ToBI framework in this work. We evaluate our automatic prosody algorithm on the Boston University Radio Speech Corpus and Boston Directions Corpus, both of which are hand annotated with ToBI labels. We perform both prominence and phrase structure detection that are char- acterized within the ToBI framework through the following parallel tiers: (i) a tone tier, and (ii) a break-index tier. 2.2.1 Mapping ToBI labels to binary classes The Tones and Break Indices (ToBI) [117] framework consists of four parallel tiers that reflect the multiple components of prosody. Each tier consists of discrete cate- gorical symbols that represent prosodic events belonging to that particular tier 1 (see section 1.1.1.2). The detailed representation of prosodic events in the ToBI framework however, suf- fers from the drawback that all the prosodic events are not equally likely and hence a prosodically labeled corpus would consist of only a few instances of one event while comprising a majority of another. This in turn creates serious data sparsity problems for 1 On a variety of speaking styles, Pitrelli et al. [93] have reported inter-annotator agreements of 83- 88%, 94-95% and 92.5%, respectively for pitch accent, boundary tone and break index detection within the ToBI annotation scheme 28 Table 2.1: ToBI label mapping used in experiments. The decomposition of labels is illustrated for pitch accents, phrasal tones and break indices ToBI Labels Intermediate Mapping Coarse Mapping H* High L+H* !H*, H+!H* Downstepped accent L+!H*,L*+!H L* Low L*+H *,*?,X*? Unresolved L-L%,!H-L%,H-L% H-H% Final L-H% Boundary tone %?,X%?,%H btone L-,H-,!H- Intermediate -X?,-? Phrase (IP) boundary <,>,no label none none 0 0 1,1-,1p 1 NB 2,2-,2p 2 3,3-,3p 3 4,4- 4 B automatic prosody detection and identification algorithms. This problem has been cir- cumvented to some extent by decomposing the ToBI labels into intermediate or coarse categories such as presence or absence of pitch accents, phrasal tones, etc., and per- forming automatic prosody detection on the decomposed inventory of labels. Such a grouping also reduces the effects of labeling inconsistency. A detailed illustration of the label decompositions is presented in Table 2.1. In this work, we use the coarse repre- sentation (presence versus absence) of pitch accents, boundary tones and break indices to alleviate the data sparsity and compare our results with previous work. 29 2.3 Related Work In this section, we survey previous work in prominence and phrase break prediction with an emphasis on ToBI-based pitch accent, boundary tones and break index predic- tion. We present a brief overview of speech applications that have used such prosodic representations along with algorithms and their corresponding performance on the vari- ous prosody detection and identification tasks. 2.3.1 Pitch accent and boundary tone labeling new Table 2.2: Summary of previous work on pitch accent and boundary tone detection (coarse mapping). Level denotes the orthographic level (word or syllable) at which the experiments were performed. The results of Hasegawa-Johnson et. al and our work are directly comparable as the experiments are performed on identical dataset Accuracy (%) Authors Algorithm Corpus Level Pitch accent Boundary tone Wightman and Ostendorf [131] HMM/CART BU syllable 83.0 77.0 Ross and Ostendorf [108] HMM/CART BU syllable 87.7 66.9 Ananthakrishnan et al. [4] Coupled HMM BU syllable 75.0 88.0 Gregory and Altun [40] Conditional random fields Switchboard word 76.4 - Nenkova et al. [81] Decision Tree Switchboard word 76.6 - Harper et al. Decision Trees/ Switchboard word 80.4 - (JHU Workshop) [45] Random Forest Hirschberg [48] CART BU word 82.4 - Wang and Hirschberg [130] CART ATIS word - 90.0 Ananthakrishnan et al. [4] Coupled HMM BU word 79.5 82.1 Hasegawa Johnson et al. [46] Neural networks/GMM BU word 84.2 93.0 Proposed work Maximum entropy model BU and BDC word 86.0 93.1 Automatic prominence labeling through pitch accents and boundary tones, has been an active research topic for over a decade. Wightman and Ostendorf [131] devel- oped a decision-tree algorithm for labeling prosodic patterns. The algorithm detected phrasal prominence and boundary tones at the syllable level. Bulyko and Ostendorf [22] used a prosody prediction module to synthesize natural speech with appropriate pitch accents. Verbmobil [83] incorporated prosodic prominence into a translation framework for improved linguistic analysis and speech understanding. 30 Pitch accent and boundary tone labeling has been reported in many past studies [48, 46, 4]. Hirschberg [48] used a decision-tree based system that achieved 82.4% speaker dependent accent labeling accuracy at the word level on the BU corpus using lexical features. Wang and Hirschberg [130] used a CART based labeling algorithm to achieve intonational phrase boundary classification accuracy of 90.0%. Ross and Ostendorf [108] also used an approach similar to [131] to predict prosody for a TTS system from lexical features. Pitch accent accuracy at the word-level was reported to be 82.5% and syllable-level accent accuracy was 87.7%. Hasegawa-Johnson et al. [46] pro- posed a neural network based syntactic-prosodic model and a gaussian mixture model based acoustic-prosodic model to predict accent and boundary tones on the BU corpus that achieved 84.2% accuracy in accent prediction and 93.0% accuracy in intonational boundary prediction. With syntactic information alone they achieved 82.7% and 90.1% for accent and boundary prediction, respectively. Ananthakrishnan and Narayanan [4] modeled the acoustic-prosodic information using a coupled hidden markov model that modeled the asynchrony between the acoustic streams. The pitch accent and boundary tone detection accuracy at the syllable level were 75% and 88% respectively. Yoon [137] has recently proposed memory-based learning approach and has reported accuracies of 87.78% and 92.23% for pitch accent and boundary tone labeling. The experiments were conducted on a subset of the BU corpus with 10,548 words and consisted of data from same speakers in the training and test set. More recently, pitch accent labeling has been performed on spontaneous speech in the Switchboard corpus. Gregory and Atlun [40] modeled lexical, syntactic and phono- logical features using conditional random fields and achieved pitch accent detection accuracy of 76.4% on a subset of words in the Switchboard corpus. Ensemble machine learning techniques such as bagging and random forests on decision trees were used in the 2005 JHU Workshop [45] to achieve pitch accent detection accuracy of 80.4%. The 31 corpus used was a prosodic database consisting of spontaneous speech from the Switch- board corpus [87]. Nenkova et al. [81] have reported a pitch accent detection accuracy of 76.6% on a subset of the Switchboard corpus using a decision tree classifier. Our proposed maximum entropy discriminative model outperforms previous work on prosody labeling on the BU and BDC corpora. On the BU corpus, with syntactic information alone we achieve pitch accent and boundary tone accuracy of 85.2% and 91.5% on the same training and test sets used in [26, 46]. These results are statistically significant by a difference of proportions test 2 . Further, the coupled model with both acoustic and syntactic information results in accuracies of 86.0% and 93.1% respec- tively. The pitch accent improvement is statistically significant compared to results reported in [26] by a difference of proportions test. On the BDC corpus, we achieve pitch accent and boundary tone accuracies of 79.8% and 90.3%. The proposed work uses speech and language information that can be reliably and easily extracted from the speech signal and orthographic transcription. It does not rely on any hand-coded fea- tures [81] or prosody labeled lexicons [46]. The results of previous work on pitch accent and boundary tone detection on the BU corpus are summarized in Table 2.2. 2.3.2 Prosodic phrase break labeling Automatic intonational phrase break prediction has been addressed mainly through rule-based systems developed by incorporation of rich linguistic rules, or, data-driven statistical methods that use labeled corpora to induce automatic labeling informa- tion [131, 52, 17, 122]. Typically, syntactic information like POS tags, syntactic struc- ture (parse features), as well as acoustic correlates like duration of preboundary syl- lables, boundary tones, pauses and f0 contour have been used as features in automatic detection and identification of intonational phrase breaks. Algorithms based on machine 2 Results at a level≤ 0.001 were considered significant 32 learning techniques such as decision trees [131, 52, 128], HMM [17] or combination of these [122] have been successfully used for predicting phrase breaks from text and speech. Automatic detection of phrase breaks has been addressed mainly from the intent of incorporating the information in text-to-speech systems [52, 17], to generate appropri- ate pauses and lengthening at phrase boundaries. Phrase breaks have also been modeled from the interest of their utility in resolving syntactic ambiguity [128, 127, 61]. Into- national phrase break prediction is also important in speech understanding [131] where the recognized utterance needs to be interpreted correctly. Table 2.3: Summary of previous work on break index detection (coarse mapping). Detection is performed at word-level for all experiments Accuracy (%) Authors Algorithm Corpus Break index Wightman and Ostendorf [131] HMM/CART BU 84.0 Ostendorf and Veilleux [127] HMM/CART ATIS 70.0 Wang and Hirschberg [130] CART ATIS 81.7 Spoken Taylor and Black [17] HMM English 79.2 corpus Sun and Applebaum [122] CART BU 85.2 Harper et al. Decision Trees/ Switchboard 83.2 (JHU Workshop) [45] Random Forest Proposed work Maximum BU and BDC 84.0-87.5 entropy model One of the first efforts in automatic prosodic phrasing was presented by Ostendorf and Wightman [131]. Using the seven level break index proposed in [94], they achieved an accuracy of 67% for exact identification and 89% correct identification within±1. They used a simple decision tree classifier for this task. Wang and Hirschberg [130] have reported an overall accuracy of 81.7% in detection of phrase breaks through a 33 CART based scheme. Ostendorf and Veilleux [127] achieved 70% accuracy for break correct prediction, while, Taylor and Black [17], using their HMM based phrase break prediction based on POS tags have demonstrated 79.27% accuracy in correctly detect- ing break indices. Sun and Applebaum [122] have reported F-score of 77% and 93% on break and non-break prediction. Recently, ensemble machine learning techniques such as bagging and random forests that combined decision tree classifiers were used at the 2005 JHU workshop [45] to perform automatic break index labeling. The classifiers were trained on spontaneous speech [87] and resulted in break index detection accuracy of 83.2%. Kahn et al. [61] have also used prosodic break index labeling to improve parsing. Yoon [137] has reported break index accuracy of 88.06% in a three-way classi- fication between break indices using only lexical and syntactic features. We achieve a break index accuracy of 83.95% and 87.18% on the BU and BDC cor- pora using lexical and syntactic information alone. Our combined maximum entropy acoustic-prosodic model achieves a break index detection accuracy of 84.01% and 87.58%, respectively on the two corpora. The results from previous work are sum- marized in Table 2.3. 2.4 Maximum Entropy discriminative model for prosody labeling Discriminatively trained classification techniques have emerged as one of the domi- nant approaches for resolving ambiguity in many speech and language processing tasks. Models trained using discriminative approaches have been demonstrated to out-perform generative models as they directly optimize the conditional distribution without model- ing the distribution of all the underlying variables. The maximum entropy approach can model the uncertainty in labels in typical NLP tasks and hence is desirable for prosody 34 detection due to the inherent ambiguity in the representation of prosodic events through categorical labels. A preliminary formulation of the work in this section was presented by the authors in [95, 96]. We model the prosody prediction problem as a classification task as follows: given a sequence of words w i in an utterance W = {w 1 ,··· ,w n }, the corresponding syntactic information sequence S ={s 1 ,··· ,s n } (for e.g., parts-of-speech, syntactic parse, etc.), a set of acoustic-prosodic features A = {a 1 ,··· ,a n }, where a i =(a 1 i ,··· ,a tw i i ) is the acoustic-prosodic feature vector corresponding to word w i with a frame length of t w i and a prosodic label vocabularyL ={l 1 ,··· ,l V }, the best prosodic label sequence L ∗ ={l 1 ,l 2 ,··· ,l n } is obtained as follows, L ∗ =argmax L P(L|W, S, A) (2.1) We approximate the string level global classification problem, using conditional independence assumptions, to a product of local classification problems as shown in Eq.(5.2). The classifier is then used to assign to each word a prosodic label conditioned on a vector of local contextual features comprising the lexical, syntactic and acoustic information. L ∗ =argmax L P(L|W, S, A) (2.2) ≈ arg max L n i=1 p(l i |w i+k i−k ,s i+k i−k ,a i+k i−k ) (2.3) =argmax L n i=1 p(l i |Φ(W, S, A, i)) (2.4) where Φ(W, S, A, i)=(w i+k i−k ,s i+k i−k ,a i+k i−k ) is a set of features extracted within a bounded local context k. Φ(W, S, A, i) is shortened to Φ in the rest of the section. 35 To estimate the conditional distribution P(l i |Φ) we use the general technique of choosing the maximum entropy (maxent) distribution that estimates the average of each feature over the training data [15]. This can be written in terms of the Gibbs distribution parameterized with weights λ l , where l ranges over the label set and V is the size of the prosodic label set. Hence, P(l i |Φ)= e λ l i .Φ V l=1 e λ l .Φ (2.5) To find the global maximum of the concave function in Eq.(3.5), we use Sequen- tial L1-Regularized Maxent algorithm (SL1-Max) [31]. Compared to Iterative Scaling (IS) and gradient descent procedures, this algorithm results in faster convergence and provides L1-regularization as well as efficient heuristics to estimate the regularization meta-parameters. We use the machine learning toolkit LLAMA [42] to estimate the conditional distribution using maxent. LLAMA encodes multiclass maxent as binary maxent to increase the training speed and to scale the method to large data sets. We use here V one-versus-other binary classifiers. Each output label l is projected onto a bit string, with components b j (l). The probability of each component is estimated independently: P(b j (l)|Φ)=1− P( ¯ b j (l)|Φ)= e λ j .Φ e λ j .Φ + e λ¯ j .Φ = 1 1+ e −(λ j −λ¯ j ).Φ (2.6) where λ¯ j is the parameter vector for ¯ b j (y). Assuming the bit vector components to be independent, we have, 36 P(l i |Φ)= V j=1 P(b j (l i )|Φ) (2.7) Therefore, we can decouple the likelihoods and train the classifiers independently. In this work, we use the simplest and most commonly studied code, consisting of V one- versus-others binary components. The independence assumption states that the output labels or classes are independent. 2.5 Lexical, syntactic and acoustic features In this section, we describe the lexical, syntactic and acoustic features that we use in our maximum entropy discriminative modeling framework. We use only features that are derived from the local context of the text being tagged, referred to as static features here on. One would have to perform a Viterbi search if the preceding prediction context were to be added. Using static features is especially suitable for performing prosody labeling in lockstep with recognition or dialog act detection, as the prediction can be performed incrementally instead of waiting for the entire utterance or dialog to be decoded. Table 2.4: Lexical, syntactic and acoustic features used in the experiments. The acoustic features were obtained over 10ms frame intervals Category Features used Lexical features Word identity (3 previous and next words) POS tags (3 previous and next words) Syntactic features Supertags (3 previous and next words) function/content word distinction (3 previous and next words) Acoustic features Speaker normalized f0 contour (+delta+acceleration) Speaker normalized energy contour (+delta+acceleration) 37 2.5.1 Lexical and syntactic features The lexical features used in our modeling framework are simply the words in a given utterance. The BU and BDC corpora that we use in our experiments are automatically labeled (and hand-corrected) with part-of-speech (POS) tags. The POS inventory is the same as the Penn treebank which includes 47 POS tags: 22 open class categories, 14 closed class categories and 11 punctuation labels. We also automatically tagged the utterances using the AT&T POS tagger. The POS tags were mapped into function and content word categories 3 and were added as a discrete feature. Table 2.5: Illustration of the supertags generated for a sample utterance in BU corpus. Each sub-tree in the table corresponds to one supertag. But now seventy minicomputer makers compete for customers S Conj But S* S Adv now S* NP DT seventy NP* N N minicomputer N* NP N makers S NP↓ VP V compete PP↓ PP P for NP↓ NP N customers In addition to the POS tags, we also annotate the utterance with Supertags [11]. Supertags encapsulate predicate-argument information in a local structure. They are the elementary trees of Tree-Adjoining Grammars (TAGs) [57]. Similar to part-of-speech tags, supertags are associated with each word of an utterance, but provide much richer information than part-of-speech tags, as illustrated in the example in Table V . Supertags can be composed with each other using substitution and adjunction operations [57] to derive the predicate-argument structure of an utterance. There are two methods for creating a set of supertags. One approach is through the creation of a wide coverage English grammar in the lexicalized tree adjoining grammar formalism, called XTAG [136]. An alternate method for creating supertags is to employ 3 Function and content word features were obtained through a look-up table based on POS 38 rules that decompose the annotated parse of a sentence in Penn Treebank into its ele- mentary trees [25, 135]. This second method for extracting supertags results in a larger set of supertags. For the experiments presented in this work, we employ a set of 4,726 supertags extracted from the Penn Treebank. There are many more supertags per word than part-of-speech tags, since supertags encode richer syntactic information than part-of-speech tags. The task of identifying the correct supertag for each word of an utterance is termed as supertagging [11]. Different models for supertagging that employ local lexical and syntactic information have been proposed [8]. For the purpose of this work, we use a Maximum Entropy supertagging model that achieves a supertagging accuracy of 87% [9] 4 . While there have been previous attempts to employ syntactic information for prosody labeling [128, 47], which mainly exploited the local constituent information provided in a parse structure, supertags provide a different representation of syntactic information. First, supertags localize the predicate and its arguments within the same local representation (e.g. give is a di-transitive verb) and this localization extends across syntactic transformations (relativization, passivization, wh-extraction), i.e., there is a different supertag for each of these transformations for each of the argument positions. Second, supertags also factor out recursion from the predicate-argument domain. Thus modification relations are specified through separate supertags as shown in Table V . For this work we use the supertags as labels, even though there is a potential to exploit the internal representation of supertags as well as the dependency structure between supertags as demonstrated in [53]. Table 3.1 shows the supertags generated for a sample utterance in the BU corpus. 4 The model is trained to disambiguate among the supertags of a word by using the lexical and part- of-speech features of the word and of six words in the left and right context of that word. The model is trained on 1 million words of supertag annotated text. 39 2.5.2 Acoustic-prosodic features The BU corpus contains the corresponding acoustic-prosodic feature file for each utter- ance. The f0 and RMS energy (e) of the utterance along with features for distinc- tion between voiced/unvoiced segment, cross-correlation values at estimated f0 value and ratio of first two cross correlation values are computed over 10 msec frame inter- vals. The pitch values for unvoiced regions are smoothed using linear interpolation. In our experiments, we use these values rather than computing them explicitly which is straightforward with most audio processing toolkits. Both the energy and the f0 levels were range normalized (znorm) with speaker specific means and variances. Delta and acceleration coefficients were also computed for each frame. The final feature vector has 6 dimensions comprising f0, ∆ f0, ∆ 2 f0, e, ∆ e, ∆ 2 e per frame. We model the frame level continuous acoustic-prosodic observation sequence as a discretized sequence through quantization (see Figure 3.4). We perform this on the normalized pitch and energy extracted from the time segment corresponding to each word. The quantized acoustic stream is then used as a feature vector. For this case, Eq.(5.2) becomes, L ∗ ≈ arg max L n i p(l i |Φ) = arg max L n i p(l i |a i ) (2.8) where a i =(a 1 i ,··· ,a tw i i ), the acoustic-prosodic feature vector corresponding to word w i with a frame length of t w i . The quantization while being lossy, reduces the vocabulary of the acoustic-prosodic features, and hence offers better estimates of the conditional probabilities. The quan- tized acoustic-prosodic cues are then modeled using the maximum entropy model described in Section 2.4. The n-gram representation of quantized continuous features is similar to representing the acoustic-prosodic features with a piecewise linear fitas 40 Figure 2.1: Illustration of the quantized feature input to the maxent classifier. “|” denotes feature input conditioned on preceding values in the acoustic-prosodic sequence done in the tilt intonational model [123]. Essentially, we leave the choice of appropriate representations of the pitch and energy features to the maximum entropy discriminative classifier, which integrates feature selection during classification. Table 2.6: Statistics of Boston University Radio News and Boston Directions corpora used in experiments BU BDC Corpus statistics f2b f1a m1b m2b h1 h2 h3 h4 # Utterances 165 69 72 51 10 999 # words (w/o punc) 12608 3681 5058 3608 2234 4127 1456 3008 # pitch accents 6874 2099 2706 2016 1006 1573 678 1333 # boundary tones (w IP) 3916 1059 1282 1023 498 727 361 333 # boundary tones (w/o IP) 2793 684 771 652 308 428 245 216 # breaks (level 3 & above) 3710 1034 11721 1016 434 747 197 542 The proposed scheme of quantized n-gram prosodic features as input to the maxent classifier is different from previous work [113]. Shriberg et al. [113] have proposed N- grams of Syllable-based Nonuniform Extraction Region Features (SNERF-grams) for speaker recognition. In their approach, they extract a large set of prosodic features such as maximum pitch, mean pitch, minimum pitch, durations of syllable onset, coda, nucleus, etc. and quantize these features by binning them. The resulting syllable-level features, for a particular bin resolution, are then modeled as either unigram (using cur- rent syllable only), bigram (current and previous syllable or pause) or trigram (current and previous two syllables or pauses). They use support vector machines (SVMs) for 41 Table 2.7: Baseline classification results of pitch accents and boundary tones (in %) using Festival and AT&T Natural V oices speech synthesizer Accuracy Corpus Speaker Set Prediction Module Pitch accent Boundary tone Chance 54.33 81.14 Lexical stress 72.64 - Entire Set AT&T Natural Voices 81.51 89.10 Festival 69.55 89.54 Chance 56.53 82.88 Lexical stress 74.10 - BU Hasegawa-Johnson et al. set AT&T Natural Voices 81.73 89.67 Festival 68.65 90.21 Chance 57.60 88.90 Lexical stress 67.42 - BDC Entire Set AT&T Natural Voices 68.49 84.90 Festival 64.94 85.17 subsequent classification. Our framework, on the other hand, models the macroscopic prosodic contour in its entirety by using n-gram feature representation of the quantized prosodic feature sequence. This representation coupled with the strength of the maxent model to handle large feature sets and in avoiding overtraining through regularization makes our scheme attractive for capturing characteristic pitch movements associated with prosodic events. 2.6 Experimental Evaluation 2.6.1 Data All the experiments reported in this work are performed on the Boston University (BU) Radio News Corpus [86] and the Boston Directions Corpus (BDC) [50], two publicly available speech corpora with manual ToBI annotations intended for experiments in automatic prosody labeling. The BU corpus consists of broadcast news stories includ- ing original radio broadcasts and laboratory simulations recorded from seven FM radio 42 announcers. The corpus is annotated with orthographic transcription, automatically gen- erated and hand-corrected part-of-speech tags and automatic phone alignments. A sub- set of the corpus is also hand annotated with ToBI labels. In particular, the experiments in this work are carried out on 4 speakers similar to [26], 2 males and 2 females referred to hereafter as m1b, m2b, f1a and f2b. The BDC corpus is made of elicited monologues produced by subjects who were instructed to perform a series of direction-giving tasks. Both spontaneous and read versions of the speech are available for four speakers h1, h2, h3 and h4 with hand-annotated ToBI labels and automatic phone alignments, similar to the BU corpus. Table 2.6 shows some of the statistics of the speakers in the BU and BDC corpora. In all our prosody labeling experiments we adopt a leave-one-out speaker validation similar to the method in [46] for the four speakers with data from one speaker for testing and those from the other three for training. For the BU corpus, speaker f2b was always used in the training set since it contains the most data. In addition to performing exper- iments on all the utterances in BU corpus, we also perform identical experiments on the train and test sets reported in [26] which is referred to as Hasegawa-Johnson et al. set. 2.7 Pitch accent and boundary tone labeling In this section, we present pitch accent and boundary tone labeling results obtained through the proposed maximum entropy prosody labeling scheme. We first present some baseline results, followed by the description of results obtained from our classification framework. 43 2.7.1 Baseline Experiments We present three baseline experiments. One is simply based on chance where the major- ity class label is predicted. The second is a baseline only for pitch accents derived from the lexical stress obtained through look-up from a pronunciation lexicon labeled with stress. Finally, the third baseline is obtained through prosody detection in current off-the-shelf speech synthesis systems. The baseline using speech synthesis systems is comparable to our proposed model that uses lexical and syntactic information alone. For experiments using acoustics, our baseline is simply chance. Table 2.8: Classification results (%) of pitch accents and boundary tones for differ- ent syntactic representations. Classifiers with cardinality V=2 learned either accent or btone classification, classifiers with cardinality V=4 classified accent and btone simul- taneously. The variable (k) controlling the length of the local context was set to k =3 V=2 V=4 Corpus Speaker Set Syntactic features Pitch accent Boundary Pitch accent Boundary tone tone correct POS tags 84.75 91.39 84.60 91.34 BU Entire Set POS tags 83.71 90.52 83.50 90.36 POS + supertags 84.59 91.34 84.48 91.22 correct POS tags 85.22 91.33 85.03 91.29 Hasegawa-Johnson et al. set POS tags 83.91 90.14 83.72 90.04 POS + supertags 84.95 91.21 84.85 91.24 BDC Entire Set POS + supertags 79.81 90.28 79.57 89.76 2.7.1.1 Acoustic baseline (chance) The simplest baseline we use is chance, which refers to the majority class label assign- ment for all tokens. The majority class label for pitch accents is presence of a pitch accent (accent) and that for boundary tone is absence (none). 2.7.1.2 Prosody labels derived from lexical stress Pitch accents are usually carried by the stressed syllable in a particular word. Lexicons with phonetic transcription and lexical stress are available in many languages. Hence, 44 one can use these lexical stress markers within the syllables and evaluate the correlation with pitch accents. Even when the lexicon has a closed vocabulary, letter-to-sound rules can be derived from it for unseen words. For each word carrying a pitch accent, we find the particular syllable where the pitch accent occurs from the manual annotation. For the same syllable, we assign a pitch accent based on the presence or absence of a lexical stress marker in the phonetic transcription. The CMU pronunciation lexicon was used for predicting lexical stress through simple lookup. Lexical stress for out- of-vocabulary words was predicted through a CART based letter-to-sound rule derived from the pronunciation lexicon. The results are presented in Table 2.7. 2.7.1.3 Prosody labels predicted using TTS systems We perform prosody prediction using two off-the-shelf speech synthesis systems, namely, AT&T NV speech synthesizer and Festival. The AT&T NV speech syn- thesizer [1] is a half phone speech synthesizer. The toolkit accepts an input text utterance and predicts appropriate ToBI pitch accent and boundary tones for each of the selected units (in this case, a pair of phones) from the database. The toolkit uses a rule-based procedure to predict the ToBI labels from lexical informa- tion [48]. We reverse mapped the selected half phone units to words, thus obtain- ing the ToBI labels for each word in the input utterance. The pitch accent labels predicted by the toolkit are L accent {H∗,L∗,none} and the boundary tones are L btone {L-L%,H-H%,L-H%,none}. Festival [18] is an open-source unit selection speech synthesizer. The toolkit includes a CART-based prediction system that can predict ToBI pitch accents and boundary tones for the input text utterance. The pitch accent labels predicted by the toolkit are L accent {H∗,L + H∗, !H∗, none} and the boundary tones are 45 L btone {L-L%,H-H%,L-H%,none}. The prosody labeling results obtained through both the speech synthesis engines are presented in Table 2.7. 2.7.2 Maximum entropy pitch accent and boundary tone classifier In this section, we present results of our maximum entropy pitch accent and boundary tone classification. We first present a maximum entropy syntactic-prosodic model that uses only lexical and syntactic information for prosody detection, followed by a maxi- mum entropy acoustic-prosodic model that uses an n-gram feature representation of the quantized acoustic-prosodic observation sequence. 2.7.2.1 Maximum entropy syntactic-prosodic model The maximum entropy syntactic-prosodic model uses only lexical and syntac- tic information for prosody labeling. Our prosodic label inventory consists of L accent {accent,none} for pitch accents, and L btone {btone,none} for boundary tones. Such a framework is beneficial for text-to-speech synthesis that relies on lexical and syntactic features derived predominantly from the input text to synthesize natural sounding speech with appropriate prosody. The results are presented in Table 2.8. In Table 2.8, correct POS tags refer to hand-corrected POS tags present in the BU corpus release and POS tags refers to parts-of-speech tags predicted automatically. Prosodic prominence and phrasing can also be viewed as joint events occurring simultaneously. Previous work by [131] suggests that a joint labeling approach may be more beneficial in prosody labeling. In this scenario, we treat each word to have one of the four labels l i L ={accent-btone, accent-none, none-btone, none-none}.We trained the classifier on the joint labels and then computed the error rates for individual classes. The joint modeling approach provides a marginal improvement in the boundary tone prediction but is slightly worse for pitch accent prediction. 46 Table 2.9: Classification results of pitch accents and boundary tones (in %) with acous- tics only, syntax only and acoustics+syntax using both our models. The syntax based results from our maximum entropy syntactic-prosodic classifier are presented again to view the results cohesively. In the table A = Acoustics, S = Syntax Pitch accent Boundary tone Corpus Speaker Set Model A S A+S A S A+S Entire Set Maxent acoustic model 80.09 84.60 84.63 84.10 91.36 91.76 HMM acoustic model 70.58 84.60 85.13 71.28 91.36 92.91 BU Hasegawa-Johnson et al. set Maxent acoustic model 80.12 84.95 85.16 82.70 91.54 91.94 HMM acoustic model 71.42 84.95 86.01 73.43 91.54 93.09 BDC Entire Set Maxent acoustic model 74.51 79.81 79.97 83.53 90.28 90.49 HMM acoustic model 68.57 79.81 80.01 74.28 90.28 90.58 2.7.2.2 Maximum entropy acoustic-prosodic model We quantize the continuous acoustic-prosodic values by binning, and extract n-gram features from the resulting sequence. The quantized acoustic-prosodic n-gram features are then modeled with a maxent acoustic-prosodic model similar to the one described in section 5. Finally, we append the syntactic and acoustic features to model the com- bined stream with the maxent acoustic-syntactic model, where the objective criterion for maximization is Eq.(2.1). The two streams of information were weighted in the com- bined maximum entropy model by performing optimization on the training set (weights of 0.8 and 0.2 were used on the syntactic and acoustic vectors respectively). The pitch accent and boundary tone prediction accuracies for quantization performed by consid- ering only the first decimal place is reported in Table 2.9. As expected, we found the classification accuracy to drop with increasing number of bins used in the quantization due to the small amount of training data. In order to compare the proposed maxent acoustic-prosodic model with conventional approaches such as HMMs, we also trained continuous observation density HMMs to represent pitch accents and boundary tones. This is presented in detail in the following section. 47 2.7.3 HMM acoustic-prosodic model In this section, we compare the proposed maxent acoustic-prosodic model with a tra- ditional HMM approach. HMMs have been demonstrated to capture the time-varying pitch patterns associated with pitch accents and boundary tones effectively [4, 28]. We trained separate context independent HMMs with 3 state left-to-right topology with uniform segmentation. The segmentations need to be uniform due to lack of an acoustic- prosodic model trained on the features pertinent to our task to obtain forced segmenta- tion. The acoustic observations of the HMM were unquantized acoustic-prosodic fea- tures described in Section 2.5.2. The label sequence was decoded using the Viterbi algorithm. The final label sequence using the maximum entropy syntactic-prosodic model and the HMM based acoustic-prosodic model was obtained by combining the syntactic and acoustic probabilities. Essentially, the prosody labeling task reduces to the following: L ∗ =argmax L P(L|A, W) =argmax L P(L|W).P(A|L, W) ≈ arg max L P(L|Φ(W)).P(A|L) γ (2.9) where Φ(W) is the syntactic feature encoding of the word sequence W . The first term in Eq.(2.9) corresponds to the probability obtained through our maximum entropy syn- tactic model. The second term in Eq.(2.9), computed by an HMM corresponds to the probability of the acoustic data stream which is assumed to be dependent only on the prosodic label sequence. γ is a weighting factor to adjust the weight of the two models. The syntactic-prosodic maxent model outputs a posterior probability for each class per word. We formed a lattice out of this structure and composed it with the lattice generated by the HMM acoustic-prosodic model. The best path was chosen from the 48 0 12 3 WORD ACCENT NONE W1 W3 W2 0.66 0.14 0.68 0.34 0.86 0.32 02 13 4 ACCENT/0.89 NONE/0.91 NONE/0.76 <eps>/1 NONE/0.32 ACCENT/0.58 <eps>/1 ACCENT/0.52 SYNTACTIC MODEL ACOUSTIC MODEL W1:ACCENT/0.66 W1:NONE/0.34 W2:NONE/0.86 W2:ACCENT/0.14 W3:NONE/0.32 A B 5 0 12 3 6 4 W1:ACCENT/1.55 W1:NONE/1.25 <eps>/1 W2:NONE/1.62 W3:ACCENT/1.20 W3:NONE/0.64 <eps>/1 W3:ACCENT/1.20 W2:ACCENT/0.73 W3:NONE/0.64 C = A o B W3:ACCENT/0.68 BESTPATH W1:NONE/1.25 W3:NONE/0.64 1 023 W2:ACCENT/0.73 Figure 2.2: Illustration of the FST composition of the syntactic and acoustic lattices and resulting best path selection. The syntactic-prosodic maxent model produces the syntactic lattice and the HMM acoustic-prosodic model produces the acoustic lattice. composed lattice through a Viterbi search. The procedure is illustrated in Figure 2.2. The acoustic-prosodic probability P(A|L, W) was raised by a power of γ to adjust the weighting between the acoustic and syntactic model. The value of γ was chosen as 0.008 and 0.015 for pitch accent and boundary tone respectively, by tuning on the train- ing set. The results of the HMM acoustic-prosodic model and the coupled model are shown in Table 2.9. The weighted maximum entropy syntactic-prosodic model and HMM acoustic-prosodic model performs the best in pitch accent and boundary tone classification. We conjecture that the generalization provided by the acoustic HMM model is complementary to that provided by the maximum entropy model, resulting in 49 slightly better accuracy when combined together as compared to that of a combined maxent-based acoustic and syntactic model. 2.8 Prosodic break index labeling We presented pitch accent and boundary tone labeling results using our proposed max- imum entropy classifier in the previous section. In the following section, we address phrase structure detection by performing automatic break index labeling within the ToBI framework. Prosodic phrase break prediction has been especially useful in text- to-speech [17] and sentence disambiguation [128, 127] applications, both of which rely on prediction based on lexical and syntactic features. We follow the same format as the prominence labeling experiments, presenting baseline experiments followed by our maximum entropy syntactic and acoustic classification schemes. All the experiments are performed on the entire BU and BDC corpora. 2.8.1 Baseline Experiments We present baseline experiments, both chance and break index labeling results using an off-the-shelf speech synthesizer. The AT&T Natural V oices speech synthesizer does not have a prediction module for prosodic break prediction and hence we present results from using the Festival [18] speech synthesizer alone. Festival speech synthesizer pro- duces simple binary break presence or absence distinction, as well as more detailed ToBI-like break index prediction. 2.8.1.1 Break index prediction in Festival Festival can predict break index at the word level based on the algorithm presented in [17]. The toolkit can predict both, ToBI like break values (L tobi break {0,1,2,3,4}) 50 and simple presence versus absence (L binary break {B,NB}). Only lexical and syntac- tic information is used in this prediction without any acoustics. Baseline classification results are presented in Table 2.10. 2.8.2 Maximum Entropy model for break index prediction 2.8.2.1 Syntactic-prosodic model The maximum entropy syntactic-prosodic model uses only lexical and syntactic infor- mation for prosodic break index labeling. Our prosodic label inventory consists of L tobi break {0,1,2,3,4} for ToBI based break indices, and L binary break {B,NB} for binary break versus no-break distinction. The{B,NB} categorization was obtained by grouping break indices 0,1,2 into NB and 3,4 into B [117]. The classifier is then applied for break index labeling as described in Section 2.7.2.1 for the pitch accent pre- diction. We assume knowledge of sentence boundary through the means of punctuation in all the reported experiments. Table 2.10: Classification results of break indices (in %) with syntax only, acoustics only and acoustics+syntax using the maximum entropy classifier. In the table A = Acoustics, S = Syntax ToBI break indices B/NB Corpus Chance Festival A S A+S Chance Festival A S A+S BU 61.25 64.22 64.73 72.32 72.90 71.91 77.58 73.98 83.95 84.01 BDC 60.01 66.56 58.95 69.25 69.81 82.26 82.31 75.94 87.18 87.58 2.8.2.2 Acoustic-prosodic model Prosodic break index prediction is typically used in text-to-speech systems and syntactic parse disambiguation. Hence, the lexical and syntactic features are crucial in the auto- matic modeling of these prosodic events. Further, they are defined at the word level and 51 do not demonstrate a high degree of correlation with specific pitch patterns. We thus use only the maximum entropy acoustic-prosodic model described in Section 2.7.2.2. The combined maximum entropy acoustic-syntactic model is then similar to Eq.(5.1), where the prosodic label sequence is conditioned on the words, POS tags, supertags and quan- tized acoustic-prosodic features. A binary flag indicating the presence or absence of a pause before and after the current word was also included as a feature. The results of the maximum entropy syntactic, acoustic and acoustic-syntactic model for break index prediction are presented in Table 2.10. The maxent syntactic-prosodic model achieves break index detection accuracies of 83.95% and 87.18% on the BU and BDC corpora. The addition of acoustics to the lexical and syntactic features does not result in a sig- nificant improvement in detection accuracy. In these experiments, we used only pitch and energy features and did not use duration features such as rhyme duration, duration of final syllable, etc., used in [131]. Such features require both phonetic alignment and syllabification and therefore are difficult to obtain in speech applications that require automatic prosody detection to be performed in lockstep. Additionally, in the context of TTS systems and parsers, the proposed maximum entropy syntactic-prosodic model for break index prediction performs with high accuracy compared to previous work. 2.9 Discussion The automatic prosody labeling presented in this work is based on ToBI-based cate- gorical prosody labels but is extendable to other prosodic representation schemes such as IViE [38] or INTSINT [54]. The experiments are performed on decompositions of the original ToBI labels into binary classes. However, with the availability of sufficient training data, we can overcome data sparsity and provide more detailed prosodic event detection (refer to Table 2.1). We use acoustic features only in the form of pitch and 52 energy contour for pitch accent and boundary tone detection. Durational features, which are typically obtained through forced alignment of the speech signal at the phone level in typical prosody detection tasks have not been considered in this work. We concen- trate only on the energy and pitch contour that can be robustly obtained from the speech signal. However, our framework is readily amenable to the addition of new features. We provide discussions on the prominence and phrase structure detection presented in sections 2.7 and 2.8 below. 2.9.1 Prominence prediction The baseline experiment with lexical stress obtained from a pronunciation lexicon for prediction of pitch accent yields substantially higher accuracy than chance. This could be particularly useful in resource-limited languages where prosody labels are usually not available but one has access to a reasonable lexicon with lexical stress markers. Off-the- shelf speech synthesizers like Festival and AT&T speech synthesizer have utilities that perform reasonably well in pitch accent and boundary tone prediction. AT&T speech synthesizer performs better than Festival in pitch accent prediction while the latter per- forms better in boundary tone prediction. This can be attributed to better rules in the AT&T synthesizer for pitch accent prediction. Boundary tones are usually highly cor- related with punctuation and Festival seems to capture this well. However, both these synthesizers generate a high degree of false alarms. The maximum entropy model syntactic-prosodic proposed in section 2.7.2.1 out- performs previously reported results on pitch accent and boundary tone classification. Much of the gain comes from the strength of the maximum entropy modeling in cap- turing the uncertainty in the classification task. Considering the inter-annotator agree- ment for ToBI labels is only about 81% for pitch accents and 93% for boundary tones, the maximum entropy framework is able to capture the uncertainty present in manual 53 annotation. The supertag feature offers additional discriminative information over the part-of-speech tags (also demonstrated by Rambow and Hirschberg [53]). The maximum entropy acoustic-prosodic model discussed in section 2.7.2.2 per- forms well in isolation compared to the traditional HMM acoustic-prosodic model. This is a simple method and the quantization resolution can be adjusted based on the amount of data available for training. However, the model performs with slightly lower accuracy when combined with the syntactic features compared to the combined maxent syntactic- prosodic and HMM acoustic-prosodic model. We conjecture that the generalization pro- vided by the acoustic HMM model is complementary to that provided by the maximum entropy acoustic model, resulting in slightly better accuracy when combined with the maxent syntactic model compared the maxent acoustic-syntactic model. We attribute this behavior to better smoothing offered by the HMM compared to the maxent acoustic model. We also expect this slight difference would not be noticeable with a larger data set. The weighted maximum entropy syntactic-prosodic model and HMM acoustic- prosodic model performs the best in pitch accent and boundary tone classification. The classification accuracies are comparable to the inter-annotator agreement for the ToBI labels. Our HMM acoustic-prosodic model is a generative model and does not assume the knowledge of word boundaries in predicting the prosodic labels as in pre- vious approaches [48, 131, 46]. This makes it possible to have true parallel prosody prediction during speech recognition. However, the incorporation of word boundary knowledge, when available, can aid in improved detection accuracies [27]. This is also true in the case of our maxent acoustic-prosodic model that assumes word segmentation information. The weighted approach also offers flexibility in prosody labeling for either speech synthesis or speech recognition. While the syntactic-prosodic model would be 54 more discriminative for speech synthesis, the acoustic-prosodic model is more appro- priate for speech recognition. 2.9.2 Phrase structure prediction The baseline results from Festival speech synthesizer are relatively modest for the break index prediction and only slightly better than chance. The break index prediction module in the synthesizer is mainly based on punctuation and parts-of-speech tag information and hence does not provide a rich set of discriminative features. The accuracies reported on the BU corpus are substantially higher compared to chance than those reported on the BDC corpus. We found that the distribution of break indices was highly skewed in the BDC corpus and the corpus also does not contain any punctuation markers. Our proposed maximum entropy break index labeling with lexical and syntactic information alone achieves 83.95% and 87.18% accuracy on the BU and BDC corpora. The syntactic model can be used in text-to-speech synthesis and sentence disambiguation (for parsing) applications. We also envision the use of prosodic breaks in speech translation by aiding in the construction of improved phrase translation tables. 2.10 Summary, conclusions, and future work In this chapter, we described a maximum entropy discriminative modeling framework for automatic prosody labeling. We applied the proposed scheme to both prominence and phrase structure detection within the ToBI annotation scheme. The proposed max- imum entropy syntactic-prosodic model alone resulted in pitch accent and boundary tone accuracies of 85.2% and 91.5% on training and test sets identical to [26]. As far as we know, these are the best results on the BU and BDC corpus using syntactic information alone and a train-test split that does not contain the same speakers. We 55 have also demonstrated the significance of our approach by setting reasonable base- line from out-of-the-box speech synthesizers and by comparing our results with prior work. Our combined maximum entropy syntactic-prosodic model and HMM acoustic- prosodic model performs the best with pitch accent and boundary tone labeling accu- racies of 86.0% and 93.1% respectively. The results of collectively using both syntax and acoustic within the maximum entropy framework are not far behind at 85.2% and 92% respectively. The break index detection with the proposed scheme is also promis- ing with detection accuracies ranging from 84-87%. The inter-annotator agreement for pitch accent, boundary tone and break index labeling on the BU corpus [86] are 81-84%, 93% and 95%, respectively. The accuracies of 80-86%, 90-93.1% and 84-87% achieved with the proposed framework for the three prosody detection tasks are comparable to the inter-labeler agreements. In summary, the experiments of this chapter demonstrate the strength of using a maximum entropy discriminative model for prosody prediction. Our framework is also suitable for integration into state-of-the-art speech applications. The supertag features in this work were used as categorical labels. The tags can be unfolded and the syntactic dependencies and structural relationship between the nodes of the supertags can be exploited further as demonstrated in [53]. We plan to use these more refined features in future work. As a continuation of our work, we have integrated our prosody labeler in a dialog act tagging scenario and we have been able to achieve modest improvements [?]. We are also working on incorporating our automatic prosody labeler in a speech-to-speech translation framework. Typically, state-of-the-art speech translation systems have a source language recognizer followed by a machine translation system. The translated text is then synthesized in the target language with prosody pre- dicted from text. In this process, some of the critical prosodic information present in the source data is lost during translation. With reliable prosody labeling in the source lan- guage, one can transfer the prosody to the target language (this is feasible for languages 56 with phrase level correspondence). The prosody labels by themselves may or may not improve the translation accuracy but they provide a framework where one can obtain prosody labels in the target language from the speech signal rather than depending only on a lexical prosody prediction module in the target language. 57 Chapter 3 Automatic recognition of discourse context In this chapter, we propose a discriminative framework for automatically detecting dis- course context in dialogs through dialog act tags. The proposed work jointly exploits lexical, syntactic and prosodic cues in a maximum entropy framework. We show that modeling the sequence of acoustic-prosodic values as n-gram features with a maxi- mum entropy model for dialog act (DA) tagging can perform better than conventional approaches that use coarse representation of the prosodic contour through summative statistics of the prosodic contour. The proposed scheme for exploiting prosody results in an absolute improvement of 8.7% over the use of most other widely used representations of acoustic correlates of prosody. The proposed scheme is discriminative and exploits context in the form of lexical, syntactic and prosodic cues from preceding discourse segments. Such a decoding scheme facilitates online DA tagging and offers robustness in the decoding process, unlike greedy decoding schemes that can potentially propagate errors. Our approach is different from traditional DA systems that use the entire conver- sation for offline dialog act decoding with the aid of a discourse model. In contrast, we use only static features and approximate the previous dialog act tags in terms of lexical, syntactic and prosodic information extracted from previous utterances. Experiments on the Switchboard DAMSL corpus, using only lexical, syntactic and prosodic cues from 3 previous utterances, yield a DA tagging accuracy of 72% compared to the best case 58 scenario with accurate knowledge of previous DA tags (oracle), which results in 74% accuracy. 3.1 Introduction Methods for automatic cue-based identification of dialog acts typically exploit multiple knowledge sources in the form of lexical [60, 119], syntactic [7], prosodic [112, 124] and discourse structure [58] cues. These cues have been modeled using a variety of methods including Hidden Markov models [59], neural networks [107], fuzzy systems [134] and maximum entropy models [7, 97]. Conventional dialog act tagging systems rely on the words and syntax of utterances [49]. However, in most applications that require transcriptions from an automatic speech recognizer, the lexical information obtained is typically noisy due to recognition errors. Moreover, some utterances are inherently ambiguous based on just lexical information. For example, an utterance such as “okay” can be used in the context of a statement, question or acknowledgment [39]. While lexical information is a strong cue to DA identity, the prosodic information contained in the speech signal can provide a rich source of complementary informa- tion. In languages such as English and Spanish, discourse functions are characterized by distinct intonation patterns [19, 30]. These intonation patterns can either be final fundamental frequency (f0) contour movements or characteristic global shapes of the pitch contour. For example, yes-no questions in English show a rising f0 contour at the end and wh- questions typically show a final falling pitch. Modeling the intonation pat- tern can thus be useful in discriminating sentence types. Previous work on exploiting intonation for DA tagging has mainly been through the use of representative statistics of the raw or normalized pitch contour, duration and energy such as mean, standard deviation, slope, etc. [119, 112]. However, these acoustic correlates of prosody provide 59 only a coarse summary of the macroscopic prosodic contour and hence may not exploit the prosodic profile fully. In this work, we model the prosodic contour by extracting n-gram features from the acoustic-prosodic sequence. This n-gram feature representa- tion is shown to yield better dialog act recognition accuracy compared to other methods that use summative statistics of acoustic-prosodic features. Further details of prosodic representations is provided in Section 3.6. We also present a discriminatively trained maximum entropy modeling framework using the n-gram prosodic features that is suitable for online classification of DAs. Traditional DA taggers typically combine the lexical and prosodic features in a HMM framework with a Markovian discourse grammar [119, 59]. The HMM representation facilitates optimal decoding through the Viterbi algorithm. However, such an approach limits DA classification to offline processing, as it uses the entire conversation during decoding. Even though this drawback can be overcome by using a greedy decoding approach, the resultant decoding is sensitive to noisy input and may cause error propa- gation. In contrast, our approach uses contextual features captured in the form of only lexical and prosodic cues from previous utterances. Such a scheme is computationally inexpensive and facilitates robust online decoding that can be performed alongside with automatic speech recognition. We evaluate our proposed approach through experiments on the Maptask [24] and Switchboard-DAMSL [59] corpora. 3.2 Maximum entropy model for dialog act tagging We use a maximum entropy sequence tagging model for the purpose of automatic DA tagging. We model the prediction problem as a classification task: given a sequence of utterances u i in a dialog U = u 1 ,u 2 ,··· ,u n and a dialog act vocabulary (d i D,|D| = K), we need to predict the best dialog act sequence D ∗ = d 1 ,d 2 ,··· ,d n . 60 D ∗ =argmax D P(D|U) = arg max d 1 ,···,dn P(d 1 ,··· ,d n |u 1 ,··· ,u n ) (3.1) We approximate the sequence level global classification problem, using conditional independence assumptions, to a product of local classification problems as shown in Eq.(5.2). The classifier is then used to assign to each word a dialog act label conditioned on a vector of local contextual feature vectors comprising the lexical, syntactic and acoustic information. D ∗ =argmax D P(D|U) (3.2) ≈ arg max D n i=1 P(d i |Φ(u i+l i−k )) (3.3) =argmax D n i=1 P(d i |Φ(W i+l i−k ,S i+l i−k ,A i+l i−k )) (3.4) where W is the word sequence, S is the syntactic feature sequence and A, the acoustic- prosodic observation belonging to utterances u i−k ,··· ,u i+l . To estimate the conditional distribution P(d|Φ) we use the general technique of choosing the maximum entropy (maxent) distribution that estimates the average of each feature over the training data [15]. This can be written in terms of the Gibbs distribution parameterized with weights λ l , where l ranges over the label set and K is the size of the dialog act vocabulary. Hence, P(d|Φ)= e λ d .Φ K l=1 e λ l .Φ (3.5) 61 To find the global maximum of the concave function in Eq.(3.5), we use Sequential L1-Regularized Maxent algorithm (SL1-Max) [31]. Compared to Iterative Scaling (IS) and gradient descent procedures, this algorithm results in faster convergence and pro- vides L1-regularization as well as efficient heuristics to estimate the regularization meta- parameters. We use the machine learning toolkit LLAMA [42] to estimate the condi- tional distribution using maxent. LLAMA encodes multiclass maxent as binary maxent to increase the training speed and to scale the method to large data sets. We use here K one-versus-other binary classifiers. Each output label d is projected onto a bit string, with components b j (d). The probability of each component is estimated independently: P(b j (d)|Φ)=1− P( ¯ b j (d)|Φ)= e λ j .Φ e λ j .Φ + e λ¯ j .Φ = 1 1+ e −(λ j −λ¯ j ).Φ (3.6) where λ¯ j is the parameter vector for ¯ b j (d). Assuming the bit vector components to be independent, we have, P(d|Φ)= K j=1 P(b j (d)|Φ) (3.7) Therefore, we can decouple the likelihoods and train the classifiers independently. In this work, we use the simplest and most commonly studied code, consisting of K one-versus-others binary components. The independence assumption of the bit vector components states that the output labels or classes are independent. 3.3 Data The Maptask [24] and Switchboard-DAMSL [59] corpora have been extensively used for dialog act tagging studies. Maptask [24] is a cooperative dialog task involving two 62 participants. The two speakers, instruction giver and instruction follower are engaged in a dialog with the goal of reproducing the instruction giver’s route on the instruction follower’s map. The original dataset was slightly modified for the experiments of the present study. The raw move information was augmented with the speaker information while the non-verbal content (e.g., laughs, background noise) was removed. The Map- task tagging scheme has 12 unique dialog acts; augmented with speaker information this results in 24 tags. The corpus consists of 128 dialogs and 26181 utterances. We used ten-fold cross validation for testing. 1 2 3 4 5 6 7 0 10 20 30 40 50 60 Simplified dialog act classes Number of utterances (%) Statement Acknowledgement Agreement Abandoned Question Appreciation Other Figure 3.1: The distribution of utterances by dialog act tag category in the Switchboard- DAMSL corpus. The Switchboard-DAMSL (SWBD-DAMSL) corpus consists of 1155 dialogs and 218,898 utterances from the Switchboard corpus of telephone conversations, tagged with discourse labels from a shallow discourse tag set. The original tag set of 375 unique tags was clustered to obtain 42 dialog act tags that distinguish mutually exclu- sive utterance types [59]. The interlabeler agreement for this 42-label tag set is 84%, with the labeling performed at the utterance level. In our experiments, we used a set of 63 173 dialogs, selected at random for testing. The test set consisted of 29869 discourse segments. The experiments were performed on the 42 tag vocabulary as well as a simpli- fied tag set consisting of 7 tags. We grouped the 42 tags into 7 disjoint classes, based on the frequency of the classes and grouped the remaining classes into an ”other” category constituting less than 3% of the entire data. This grouping is similar to that presented in [112]. Such a simplified grouping is more generic and hence useful in speech appli- cations that require only a coarse level of DA representation. It can also offer insights into common misclassifications encountered in the DA system. Figure 3.1 shows the distribution of the simplified tag set in the Switchboard-DAMSL corpus. Statements are the most frequent (more than 50%) tags, followed by acknowledgements, abandoned or incomplete utterances and agreements. Questions and appreciations account for roughly 6% and 4% of the total utterances. In the next section, we describe the maximum entropy modeling framework that is used for automatic DA identification in the rest of the work. 3.4 Features for Dialog Act Classification In this section, we describe the lexical, syntactic and prosodic cues used with the pro- posed maximum entropy modeling framework for DA tagging. The lexical, syntactic and prosodic cues extracted from the utterance text and speech signal are encoded as n-gram features and used as input to the maximum entropy framework. We only use features that are derived from the local context of the text being tagged, referred to as static features here on. One would have to perform a Viterbi search if the preceding pre- diction context (dialog act history) were to be used. Using static features is especially suitable for performing dialog act tagging in lockstep with automatic speech recogni- tion, as the prediction can be performed incrementally instead of waiting for the entire utterance or dialog to be decoded. This is explained in more detail in Section 3.8. 64 3.4.1 Lexical and syntactic features The lexical features used in our modeling framework are simply the words in a given utterance. We tag the utterances with part-of-speech tags using the AT&T POS tagger. The POS inventory is the same as the Penn treebank which includes 47 POS tags: 22 open class categories, 14 closed class categories and 11 punctuation labels. In addition to the POS tags, we also annotate the utterance with Supertags [11]. Supertags encapsulate predicate-argument information in a local structure. They are the elementary trees of Tree-Adjoining Grammars (TAGs) [57]. Similar to part-of- speech tags, supertags are associated with each word of an utterance, but provide much richer information than part-of-speech tags, as illustrated in the example in Table 3.1. Supertags can be composed with each other using substitution and adjunction opera- tions [57] to derive the predicate-argument structure of an utterance. Table 3.1: Illustration of POS tags and supertags generated for a sample utterance Words: But now seventy minicomputer makers compete for customers POS tags: CC RB NN NN NNS VBP IN NN Supertags: S Conj But S* S Adv now S* NP DT seventy NP* N N minicomputer N* NP N makers S NP↓ VP V compete PP↓ PP P for NP↓ NP N customers There are two methods for creating a set of supertags. One approach is through the creation of a wide coverage English grammar in the lexicalized tree adjoining grammar formalism, called XTAG [136], wherein supertags are the resulting elementary struc- tures. An alternate method for creating supertags is to employ rules that decompose the annotated parse of a sentence in Penn Treebank into its elementary trees [25, 135]. This second method for extracting supertags results in a larger set of supertags. For the experiments presented in this work, we employ a set of 4,726 supertags extracted from the Penn Treebank. 65 In addition to the lexical and syntactic cues, we also use categorical prosody labels predicted from our previously developed maximum entropy automatic prosody labeler [95, 96] to tag the utterances with prosodic labels. The prosody labeler uses lexical (words) and syntactic (parts-of-speech tags and supertags) information to pre- dict binary pitch accent (accent, none) and boundary tone (btone, none) labels for each word. Our prosody labeler was trained on the entire Boston University Radio News cor- pus. Even though the domain is not the same as that of our test corpora, we expect that the syntactic information in the form of POS tags and supertags can offer good gener- alization to circumvent the disparity in the domains. Moreover, we expect the syntax- based prosody labeler to offer additional discriminatory evidence beyond the lexical and syntactic features, as the mapping between prosody and syntax is non-linear. Figure 3.2: Illustration of syntax-based prosody predicted by the prosody labeler. The prosody labeler uses lexical and syntactic context from surrounding words to predict the prosody labels for the current word. 3.4.2 Acoustic-prosodic features Exploiting utterance level intonation characteristics in DA tagging presumes the capa- bility to automatically segment the input dialog into discourse segments. Many studies have addressed the problem of automatically detecting the utterance boundaries in a 66 dialog using lexical and prosodic cues [114, 71, 5, 77]. However, we do not attempt to address the problem of utterance segmentation in this work, and assume that we have access to utterance segmentation marked either automatically or by human labelers. We compute the pitch (f0), RMS energy (e) of the utterance over 10 msec frame intervals. The pitch values in the unvoiced segments were smoothed using linear interpolation. Both the energy and the pitch were normalized with speaker specific mean and variance (z-norm). 3.5 Dialog Act classification using true transcripts We first perform DA tagging experiments on clean transcribed data. While this is typ- ically not available in automated applications, it is a preliminary step and can offer valuable insights into common classification mistakes committed by the classifier when trained on lexical information alone. The lexical cues we use are word trigrams from current utterance; parts-of-speech, supertagged utterances constitute the syntactic cues, and prosody tagged utterances comprise prosodic cues. In addition, we use the speaker identity information (speaker A or B for the particular dialog). The lexical and syn- tactic cues are encoded as n-gram features and used as input to the maximum entropy classifier. The feature encoding is illustrated in Figure 3.3. Results of DA tagging using lexical and syntactic features from reference transcripts are presented in Table 3.2. Analysis of the confusion matrix obtained from the 7 way classification of dialog acts in the SWBD-DAMSL corpus indicates that the most com- mon misclassifications are: agreements as acknowledgments; questions as statements and, abandoned utterances as statements. 58% of the total agreements in the test set are misclassified as acknowledgments and 61% of the questions are wrongly classified as statements. The misclassifications predominantly occur due to the ambiguity in lexical 67 Figure 3.3: Illustration of n-gram feature encoding of lexical, syntactic and syntax-based prosody cues. The n-gram features represent the feature input space of the maximum entropy classifier. “|” denotes feature input conditioned on the history. Table 3.2: Dialog act tagging accuracies (in %) for lexical and syntactic cues obtained from true transcripts with the maximum entropy model. Only the current utterance was used to derive the n-gram features. Maptask SWBD-DAMSL Cues used (current utterance) 12 moves 42 tags 7 tags Chance (majority tag) 15.6 39.9 54.4 Lexical 65.7 69.7 81.9 Lexical+Syntactic 66.1 70.0 82.4 Lexical+Syntactic+Syntax-based prosody 66.6 70.4 82.5 choice for these discourse functions. Table 3.3 shows an example of misclassification from the Switchboard-DAMSL corpus while using only lexical information. In the next section, we demonstrate how one can model the intonation characteristics associated with DA types for improved classification. 68 Table 3.3: Examples of misclassifications due to lexical ambiguity from the Switchboard-DAMSL corpus Utterance Reference tag Hypothesized tag Yeah Agreement Acknowledgement right Agreement Acknowledgement you just needed a majority Question Statement someone had to figure out what was going on Question Statement 3.6 Dialog Act classification using acoustic prosody Given that most dialog act classification tasks are typically used as downstream appli- cations that operate on speech input, in this section, we present a maximum entropy framework to model the acoustic-prosodic features in dialog act tagging. As mentioned earlier in Section 3.4.2, we do not attempt to address the problem of utterance segmen- tation in this work. The experiments are performed only on the SWBD-DAMSL corpus since the Maptask corpus is not accompanied by utterance level segmentation. The utterance level segmentation for the SWBD-DAMSL annotations was obtained from the Mississippi State resegmentation of the Switchboard corpus [44]. These segmentations were checked for inconsistencies and cleaned up further. The pitch and energy contour were extracted as explained in Section 3.4.2. 3.6.1 Related work Before describing our proposed prosodic representation for DA tagging, we present a brief overview of previous work that has used prosodic cues for dialog act classification. The use of prosodic cues in DA classification is contingent on two main factors: the type of prosodic representation (categorical or continuous) and the framework used to 69 integrate the prosodic representation with lexical and syntactic cues. Three main repre- sentations of the intonation contour have been used in previous work: (i) Raw/normalized acoustic correlates of intonation such as pitch contour, duration and energy, or transformations thereof [119, 112] (ii) Discrete categorical representations of prosody through pitch accents and boundary tones [16, 106] (iii) Parametric representations of pitch contour [138, 124] Stolcke et al. [119] used prosodic decision trees to model the raw/normalized acous- tic correlates of prosody. They used correlates of duration, pause, pitch, energy and speaking rate as features in the classification. An HMM-based generative model was used for classification. The likelihoods due to multiple knowledge sources were decou- pled and a prosodic decision tree classifier was used to estimate the likelihood (obtained from the posterior probability through Bayes rule) of the dialog acts during training. On the Switchboard-DAMSL dataset, they report a dialog act labeling accuracy of 38.9% using prosody alone (chance being 35%). Using the reference word transcripts and preceding discourse context in an n-gram modeling framework, they obtained 71% accuracy. The combined use of prosody, discourse context and lexical cues from erro- neous recognition output resulted in an accuracy of 65%. Ries [107] and Fernandez & Picard [33] have also used raw acoustic correlates of prosody for DA classification with neural networks and support vector machines, respectively. Discrete categorical representations can be effective in characterizing pitch excur- sions associated with sentence types [92]. Reithinger et al. [106] and Black and Camp- bell [16] have used symbolic representation of prosodic events as additional features in 70 dialog act tagging for S2S translation and text-to-speech synthesis, respectively. How- ever, automatic detection of detailed categorical representations is still a topic of ongo- ing research. Parametric approaches that are data-driven provide a configurational description of the macroscopic intonation contour. Yoshimura et al. [138] proposed clustering of utter- ances based on vector-quantized f0 patterns and regression fits. Taylor et al. [124] have demonstrated the use of parametric representations of the pitch contour for dialog act modeling in speech recognition. On a subset of the Maptask corpus (DCIEM Map- task corpus), they achieved an accuracy of 42% using intonation alone. Using both intonation and dialog history their system correctly identified dialog acts 64% of the time. The drawback of such an approach is that it requires segmentation of the prosodic contour into intonational events, which is not easy to obtain automatically. In the next section, we propose an n-gram feature representation of the prosodic contour that is subsequently used within the maxent framework for DA tagging. We also compare the proposed maximum entropy intonation model with the summative statistics acoustic correlates representation used in previous work [112]. 3.6.2 Maximum entropy intonation model We quantize the continuous acoustic-prosodic values by binning, and extract n-gram features from the resulting sequence. Such a representation scheme differs from the approach commonly used in DA tagging, where representative statistics of the prosodic contour are computed [112]. The n-gram features derived from the pitch and energy contour are modeled using the maxent framework described in Section 3.2. For this case, Eq.(5.2) becomes, 71 D ∗ ≈ arg max D n i p(d i |Φ(A, i)) = arg max D n i p(d i |a i ) (3.8) where a i = {a 1 i ,··· ,a ku i i } is the acoustic-prosodic feature sequence for utterance u i and the variable k u i is the number of frames used in the analysis. Figure 3.4: Illustration of the quantized feature input to the maxent classifier. “|” denotes feature input conditioned on the value of the preceding element in the acoustic-prosodic sequence We fixed the analysis window to the last 100 frames (k u i ) of the discourse seg- ment corresponding to 1 second. The length of the window was empirically determined through optimization on a held-out set. The normalized prosodic contour was uniformly quantized into 10 bins and bigram features 1 were extracted from the sequence of frame level acoustic-prosodic values. Even though the quantization is lossy, it reduces the ‘vocabulary’ of the acoustic-prosodic features, and hence offers better estimates of the conditional probabilities. In order to test the sensitivity of the proposed framework to errors in utterance segmentation, we also varied the end points of the actual boundary by±20 frames. There was no significant degradation in performance for this window. However, the performance dropped for incorrect segmentation beyond ±20 frames. Thus, the proposed model can also offer some robustness to errors in utterance seg- mentation. The results of the maxent intonation model are presented in Table 3.5. 1 Higher order n-grams did not result in any significant improvement 72 3.6.3 Comparison with acoustic correlates of prosody Acoustic correlates of prosody refer to simple transformations of pitch, intensity and duration extracted from the fundamental frequency (f0) contour, energy contour and segmental duration derived from automatic alignment, respectively. Such features have been demonstrated to be beneficial in disfluency detection [70], topic segmentation [51], sentence boundary detection [71] and dialog act detection [112]. The derived features are also normalized through a variety of speaker and utterance specific normalization techniques to account for the variability across speakers. The major drawback of such a representation is that it is lossy and is not consistent with the suprasegmental theory of prosody that advocates a sequential or continuous model of acoustic correlates over longer durations [85]. The primary motivation for this experiment is to compare the n-gram feature rep- resentation of the prosodic contour with previous approaches that have used acoustic correlates of prosody [112]. We extracted a set of 28 features from the pitch and energy contour of each utterance. These included duration of utterance, statistics of the pitch contour (e.g., mean and range of f0 over utterance, slope of f0 regression line) and energy contour (e.g., mean and range of rms energy). The features are directly borrowed from [112] and a decision tree classifier (J48 in WEKA toolkit [133]) was trained on the prosodic features for DA classification. The features that were used are summarized in Table 3.4. In order to compare the n-gram feature representation (presented in Section 3.6.2) with that of using acoustic correlates, we also fit a decision tree to the n-gram features. The results are presented in Table 3.5. Results indicate that the n-gram feature represen- tation performs better than using acoustic correlates, and offers an absolute improvement of 6.4% in classification accuracy. The maxent model with the n-gram features offers 73 Table 3.4: Acoustic correlates used in the experiment, organized by duration, pitch and energy categories. Features used Description ling dur duration of utterance f0 mean good utt mean of f0 values above f0 min f0 mean n difference between mean f0 of utterance and mean f0 of convside for f0 values > f0 min f0 mean ratio ratio of f0 mean in utterance to f0 mean in convside f0 mean zcv f0 mean in utterance normalized by mean and std dev of f0 values in convside f0 sd good utt std dev of f0 values in utterance f0 sd n log ratio of std dev of f0 values in utterance and in convside f0 max n log ratio of max f0 values in utterance and in convside f0 max utt maximum f0 value in utterance (no smoothing) max f0 smooth maximum f0 value in smoothed f0 contour f0 min utt minimum value of f0 in utterance (no smoothing) utt grad linear regression slope over all points over utterance pen grad linear regression slope over penultimate 200ms of utterance end grad linear regression slope over final 200ms of utterance end f0 mean mean f0 in final 200ms region pen f0 mean mean f0 in penultimate 200ms region abs f0 diff difference between mean f0 of end and penultimate regions rel f0 diff ratio of f0 of final and penultimate regions norm end f0 mean mean f0 in final region normalized by mean and std deviation in convside norm pen f0 mean mean f0 in penultimate region normalized by mean and std deviation in convside norm f0 diff difference between mean f0 of final and penultimate regions, normalized by mean and std dev of f0 from convside utt nrg mean mean RMS energy in utterance abs nrg diff difference between RMS energy of final and penultimate 200ms regions end nrg mean mean RMS energy in the final 200ms region norm nrg diff normalized difference between mean RMS energy of final and penultimate regions rel nrg diff ratio of mean RMS energy of final and penultimate regions further improvement compared to the decision tree classifier. This may be attributed to the integrated feature selection and modeling offered by the maxent framework. The results also clearly demonstrate the suitability of the proposed n-gram represen- tation for exploiting prosody in DA tagging. Closer analysis of the predictions made by the maxent intonational model (for the simplified SWBD-DAMSL tag set) indicate that majority of the correct predictions are for statements and acknowledgements, with a per 74 Table 3.5: Accuracies (%) of DA classification experiments on the Switchboard DAMSL corpus for different prosodic representations Prosodic representation 42 tags 7 tags Chance (majority tag) 39.9 54.4 Acoustic correlates + decision tree 45.7 60.5 n-gram acoustic features + decision tree 52.1 66.3 n-gram acoustic features + maxent 54.4 69.4 dialog act accuracy of 76% and 56%, respectively. The precision and recall for the other categories are very low (less than 1%). In other words, even though the maxent into- nation model performs much better than chance, the majority of correct predictions are limited to the two most frequent tags in the DA vocabulary. To evaluate the complemen- tarity of our intonation model with respect to lexical information, in the next section, we perform DA tagging on both clean and recognized transcripts in conjunction with the n-gram prosodic contour representation. 3.7 Dialog Act tagging using recognized transcripts In most speech processing applications, dialog act tagging is either performed simulta- neously with front-end automatic speech recognition (ASR) or as a post processing step. The lexical information at the output of ASR is typically noisy due to recognition errors. Thus, modeling the intonational characteristics of discourse segments that are indepen- dent of the hypothesized words can offer robustness in DA classification. To evaluate our framework on automatic speech recognition (ASR) output, the 29869 test utterances were decoded with an ASR setup. The acoustic model for first-pass decoding was a speaker independent model trained on 220 hours of telephone speech from the Fisher English corpus. The language model (LM) was interpolated from the SWBD-DAMSL 75 training set (182K words) and Fisher English corpus (1.5M words). The final hypothesis was obtained after speaker adaptive training using constrained maximum likelihood lin- ear regression on the first-pass lattice. The word error rate (WER) for the test utterances was 34.4% 2 . While this is a relatively high WER, the experiment is intended to provide insights into DA tagging on noisy text. Table 3.6: Dialog act tagging accuracies (in %) using lexical and prosodic cues for true and recognized transcripts with the maximum entropy model. Only the current utterance was used to derive the n-gram features. SWBD-DAMSL Cues used (current utt) 42 tags 7 tags True transcripts 69.7 81.9 Recognition output 52.3 65.7 Recognition output+acoustics 55.1 69.9 Table 3.6 presents DA tagging results using lexical information from reference tran- scripts (true words) and recognition hypotheses. The accuracy using recognized words is 55.1% compared to 69.7% using the true transcript. The use of prosodic information in conjunction with the words obtained from the recognition output provides a relative improvement of 5.35%. The maxent models described so far use cues from the current utterance only. In the next section, we demonstrate how dialog context can be exploited in our framework. 3.8 Dialog Act tagging using history The dialog act tags that characterize discourse segments in a dialog are typically depen- dent on preceding context. For e.g., Questions are usually followed by Statements or 2 The decoding was performed on all of 29K utterances for comparison across experiments. The stan- dard deviation of WER was 14.0% 76 Acknowledgments, and, Agreements often follow Statement-opinions. This aspect of dialog acts is usually captured by modeling the prior distribution of dialog act tags as a k th order Markov process, k being the number of preceding dialog act labels. Such an n-gram discourse model of DA tags coupled with locally decomposable likelihoods can be viewed as a k th order hidden markov model (HMM). An HMM-based representation of DA tagging, with the states corresponding to DAs and observations corresponding to utterances, coupled with a discourse LM, facilitates efficient dynamic programming to compute the most probable DA sequence using the Viterbi algorithm [119, 124, 56]. Mathematically, the HMM-based DA tagging can be expressed as, D ∗ =argmax D P(D|U) = arg max D P(U|D)∗ P(D) (3.9) The main drawback of such an approach is that one has to wait for the comple- tion of entire conversation before decoding. Thus, optimal decoding can be performed only during offline processing. One way to overcome this problem is by using a greedy decoding approach that uses a discourse LM over the predictions of DA tags at each utterance. However, such an approach is clearly suboptimal and can be further exacer- bated when applied to noisy ASR output. The results of such a greedy decoding scheme is presented in Tables 3.7 and 3.8. In contrast to the above methods, we argue for a DA tagging model that uses con- text history in the form of only n-gram lexical and prosodic features from the previous utterances. Our objective is to approximate discourse context information indirectly using acoustic and lexical cues. Such a scheme facilitates online DA tagging and conse- quently, the decoding can be performed incrementally during automatic speech recog- nition. Even though the proposed scheme may still be suboptimal, it offers robustness in the decoding process, unlike greedy decoding schemes that can potentially propagate 77 errors. We compare the proposed use of “static” contextual features with the scenario where one has accurate knowledge of previous DA tag. Such a comparison illustrates the gap between the best case scenario (optimal decoding with a bigram discourse LM using the Viterbi algorithm, will be less than or equal to this performance; the greedy approach maybe be worse) and the performance that can be achieved by using only the lexical and prosodic cues from previous utterances. The results are presented in Table 3.7. Table 3.7: Dialog act tagging accuracies (in %) using preceding context. current utter- ance refers to lexical+syntactic+prosodic cues of the current transcribed utterance. prev utterance refers to the lexical+syntactic+prosodic cues from the previous utterance. Maptask SWBD-DAMSL Model Cues used 12 moves 42 tags 7 tags Greedy current utterance + bigram discourse LM 60.1 54.4 76.4 Decoding current utterance + trigram discourse LM 58.2 54.9 76.8 current utterance 66.6 70.4 82.5 current utterance + 1 prev DA tag (oracle) 74.3 74.4 82.9 current utterance + 2 prev DA tags (oracle) 75.1 75.8 83.0 Maxent current utterance + 3 prev DA tags (oracle) 75.2 76.0 83.1 current utterance + 1 prev utterance 70.1 71.2 82.7 current utterance + 2 prev utterances 70.0 71.8 82.6 current utterance + 3 prev utterances 69.9 72.0 82.6 Table 3.8: Dialog act tagging accuracies (in %) using preceding context. Recognized utterance refers to lexical+syntactic+prosodic cues of the current ASR hypothesized utterance. prev utterance refers to the lexical+syntactic+prosodic cues from the preced- ing ASR hypotheses. Maptask SWBD-DAMSL Model Cues used 12 moves 42 tags 7 tags Greedy recognized utterance + trigram discourse LM - 47.63 57.27 Decoding Maxent recognized utterance + 3 prev DA tags (oracle) - 59.7 73.9 recognized utterance + 3 prev utterances - 56.2 70.8 78 The best case scenario, assuming accurate knowledge of words and the previous dialog act tag (bigram discourse context), results in a DA classification accuracy of 74.4% (see Table 3.7). A greedy decoding approach with the HMM-based framework and bigram discourse language model yields an DA tagging accuracy of 54.4%, which is much lower than the case when oracle information about previous dialog act tag is accurately known. On the other hand, using only the lexical and prosodic information from 1 previous utterance, yields 71.2%. The use of only static features from previous utterances is computationally inexpensive and the framework is more robust compared to using greedy DA predictions for each utterance. Adding context from 3 previous utterances 3 results in a classification accuracy of 72%. Similar trends can be observed for DA classification using the ASR output. It is interesting to observe that there is an accuracy drop of only 3-4% when using context in terms of lexical and prosodic content from previous utterances, compared to accurate (oracle) knowledge of previous DA. Such a scheme is clearly beneficial in speech applications that require online decoding of dialog act tags. 3.9 Dialog Act tagging using right context Conventional dialog act tagging schemes [58, 119, 56] typically use a dialog act gram- mar to predict the most probable next dialog act based on the previous ones. Exploiting discourse context in such a manner offers a convenient way of modeling the prior dis- tribution of dialog acts in a generative model for dialog act tagging. Often, an n-gram model is chosen as a computationally convenient type of discourse grammar, as it allows for efficient decoding in the HMM framework. While the HMM-based approach to DA tagging is certainly intuitive and desirable in many left-to-right decoding systems, in 3 Context beyond 3 previous utterances did not result in any significant improvement. 79 this section, we are interested in evaluating the usefulness of right context in DA tag- ging. Since, our maximum entropy decomposes the sequence labeling problem into local classification problems, we can exploit right context of a current utterance during the tagging. In this case, Eq.(5.1) becomes, D ∗ =argmax D P(D|U)≈ arg max D n i=1 P(d i |Φ(u i+l i )) =argmax D n i=1 P(d i |Φ(W i+l i ,S i+l i ,A i+l i )) (3.10) where W is the word sequence, S is the syntactic feature sequence and A, the acoustic- prosodic observation belonging to utterances u i ,··· ,u i+l , the current and future utter- ances. Table 3.9 shows the results of using right context (words, part-of-speech tags, supertags, syntax-based prosody and acoustics-based prosody of future utterances) in the maximum entropy framework. Just as explained in Section 3.8, we use only the lex- ical, syntactic and prosodic information instead of using the actual dialog act tags. The results indicate that trends in improvement when right context is added to the current utterance is similar to that of adding left context for the Switchboard-DAMSL corpus. However, the addition of right context (1 next utterance) results in a degradation of about 2.6% in DA tagging accuracy in comparison with the use of left context (1 pre- vious utterance) for the Maptask corpus. Hence, the experimental results indicate that right context is not as beneficial in DA tagging in comparison with left context. 3.10 Discussion The maximum entropy framework for DA tagging presented in this work is not restricted to the data sets used in this work. The framework is generalizable and can be used for 80 Table 3.9: Dialog act tagging accuracies (in %) using preceding context. current utter- ance refers to lexical+syntactic+prosodic cues of the current transcribed utterance. next utterance refers to the lexical+syntactic+prosodic cues from the succeeding utterance and recognized utterance refers to utterance hypothesized by ASR. Maptask SWBD-DAMSL Cues used 12 moves 42 tags 7 tags Current utterance 66.6 70.4 82.5 Current utterance+1next utterance 67.4 71.4 82.8 Current utterance+2next utterances 67.3 71.4 82.7 Current utterance+3next utterances 67.0 71.3 82.6 Recognized utterance+3next utterances - 56.1 70.7 multiple tasks that may require the joint use of lexical, syntactic, prosodic and addi- tional cues for identifying dialog acts. Previous work on automatic DA tagging has mainly used lexico-syntactic information in the form of orthographic words and parts- of-speech. In this work, we exploited richer syntactic information such as supertags and prosody predicted from lexical and syntactic cues. These features offer a relative improvement of about 1-1.3% over using lexical information alone. The proposed n-gram representation of the prosodic contour is trained using a reg- ularized maximum entropy classifier. Thus, the proposed scheme avoids overfitting. In previous work, we have also demonstrated the suitability of such a representation for categorical prosody detection [96] and achieved state-of-the-art results. The prosodic representation coupled with the maxent model achieves an accuracy of 54.4% on the SWBD-DAMSL corpus. Previous work on SWBD-DAMSL corpus with intonational cues [119] achieved an accuracy of 38.9% (chance being 35%). While a direct compari- son with our work is not possible due to different training and test splits of the data, our test set consists of about 29K utterances, much larger than the 4K test set used in [119]. 81 To evaluate the complementarity of the lexico-syntactic and prosodic evidence, we performed a correlation analysis on the DA predictions using the two streams of infor- mation. We computed Yule’s Q statistic [66] for the two classifiers with different fea- tures. The value of Q can vary between -1 and 1, with Q taking a value of 0 for sta- tistically independent classifiers. Classifiers that tend to recognize the same samples correctly will thus have positive values of Q. The value of Q for classifiers using lexico- syntactic (true transcripts) and prosodic evidence is 0.85, indicating that the outputs of the two classes are highly correlated. This also explains the relatively small improve- ment (0.7%) when the prosodic features are added to the classifier using only lexical and syntactic cues. On the other hand, the Q value between recognized transcripts and prosodic cues is 0.64, which in turn can be attributed to the higher improvement (2.8%) when prosodic features are added to the recognition output. The DA tagging experiments reported on ASR output were performed on the entire test set for consistency across experiments. Our primary motivation was to evaluate the contribution of our intonation model when used with noisy text. We were not concerned with tuning the recognizer to obtain the best performance. However, it is easy to see that the DA tagging accuracy is directly related to the WER of the recognition system. For example, the DA tagging accuracy on a subset of SWBD-DAMSL utterances with 22.0% WER was 64.6%, in comparison with 52.3% accuracy on the entire test set with 34.4% WER. The proposed use of dialog context from lexical, syntactic and prosodic cues of pre- vious utterances performs well in comparison with previous work [119] that used the entire conversation for offline optimal decoding. On the SWBD-DAMSL corpus, Stol- cke et al. [119] achieved DA tagging accuracy of 71.0% with a bigram discourse model on true transcripts, while our framework achieves 72.0% accuracy. The best accuracy of 70.1% reported on the Maptask corpus also compares favorably to previous work [24] 82 reported on this corpus. The results indicate that exploiting discourse history informa- tion through actual lexical, syntactic and prosodic evidence is as good as representing them through a dialog act discourse model. Further, such a discourse context is lim- ited to about 3 previous utterances. Adding further context does not offer additional knowledge in predicting the dialog act tag of the current utterance. 3.11 Conclusion and Future Work We presented a maximum entropy discriminative model that jointly exploits lexical, syntactic and prosodic cues for automatic dialog act tagging. First, we presented a novel representation scheme for exploiting the intonational properties associated with certain dialog act categories. The n-gram feature representation of the prosodic contour, coupled with the maximum entropy learning scheme is favorable for the task of dis- tinguishing dialog acts based on intonation alone. The proposed feature representation outperforms conventional techniques such as extracting representative statistics such as mean, slope, variance, etc., from the acoustic correlates of prosody. It also supports the suprasegmental theory of prosody that advocates a sequential or continuous model of acoustic correlates over longer durations. Specifically, the n-gram feature represen- tation resulted in an absolute improvement of 6.4% over using the acoustic correlates used in most previous work [112, 119]. We also demonstrated the use of preceding context in terms of lexical, syntactic and prosodic cues from previous utterances for facilitating online DA tagging. Our maxi- mum entropy framework approximates the previous dialog act state in terms of observed 83 evidence and hence is not limited to offline DA classification that uses the entire con- versation during the decoding process. Such a scheme also offers more robustness com- pared to greedy decoding procedures, which use a discourse model over DA tag predic- tions at each state. The proposed maxent model achieves DA tagging accuracy of 72% on the SWBD-DAMSL corpus, comparable to the 71% accuracy reported in [119] using offline optimal decoding with a discourse model. Thus, the proposed framework can be used in a variety of speech applications that require online decoding of DA tags. The methods and algorithms presented in this work were supervised. We plan to investigate unsupervised classification of dialog acts with the help of intonation as part of our future work. Another limitation of the current work is that we assume the knowl- edge of utterance boundaries for DA tagging. The problem of automatic sentence bound- ary detection has been well addressed in literature and we intend to evaluate our frame- work on boundaries hypothesized by such a detector. Finally, the HMM-based frame- work and maximum entropy model (with left context) for DA tagging can be applied directly to ASR lattices and thus can enrich the lattices. Such an enriched lattice could be potentially used in speech-to-speech translation. We plan to perform DA tagging on ASR lattices as part of future work. 84 Application to spoken language translation 85 We have presented our contributions in automatic recognition of rich information such as prosodic prominence, phrasing and discourse context. In this part of the thesis, we propose a novel framework for enriching speech-to-speech translation with the rich representations discussed so far. The focus of this part of the thesis is in the realm of human language technology solutions for facilitating cross-lingual interaction. This type of interaction, which is dif- ferent from conventional human-machine dialogs, is computer-mediated interpersonal communication. It is critical that the communicating participants can understand each other. Notably, it is important to capture and convey not only what is being communi- cated (the words) but how something is being communicated (the context). The central premise of our work is that capturing and utilizing rich context – beyond what is con- veyed by words – is essential for facilitating successful cross-lingual interactions in terms of improved information transfer, communication efficiency and social presence. The fundamental difference of the approach presented here from other ongoing efforts in the domain of speech recognition and translation is that it takes a human-centric per- spective that aims at modeling and optimizing end-to-end communication between indi- viduals rather than being driven merely by a technology-centric end goal of improving individual underlying component performances. We argue that the latter, while nec- essary, it is not sufficient. In particular, in addition to lexical information, we aim at capturing, modeling and transferring rich contextual information to aid robust transla- tion and accurate and expressive synthesis in the target language. We investigate the importance of prosodic and discourse information conveyed through spoken language interactions and design algorithms for automatically capturing and incorporating them within the translation framework. 86 Figure 3.5: Enabling transfer of meaning and style from source language to target lan- guage using enriched and integrated models of translation and synthesis. For clarity, only one direction of the two way path between the dyad in S2S interaction is shown in the figure. 87 Chapter 4 Enriching speech translation with prosody 4.1 Introduction Current speech translation approaches predominantly rely on a pipeline model wherein a speech recognizer transcribes the source language speech into text. Typically, the 1-best ASR hypothesis text is considered for machine translation followed by synthesis into speech in the target language. Such an approach loses the rich information contained in the source speech signal that may be vital for successful communication. It is well known that prosodic and affective aspects of speech are highly correlated with the com- municative intent(s) of the speaker and often complement the information present in the lexical stream. Disregarding such information may result in ambiguous concept transfer in translation (e.g., providing improper utterance chunking; erroneously emphasizing a target language word or phrase). In other cases, key contextual information such as word prominence, emphasis, and contrast can be lost in the speech-to-text conversion. In this chapter, we investigate issues related to accurate capture and transfer of prosodic information – properties that signify aspects of intonation, phrasing, rhythm and empha- sis. Prosodic information has mainly been used in speech translation for utterance seg- mentation [74, 34] and disambiguation [83]. The VERBMOBIL speech-to-speech trans- lation (S2S) system [83] utilized prosody through clause boundaries, accentuation and 88 sentence mood for improving the linguistic analysis within the speech understanding component. The use of clause boundaries improved the decoding speed and disambigua- tion during translation. More recently Aguero et al. [2] have proposed a framework for generating target language intonation as a function of source utterance intonation. They used an unsupervised algorithm to find intonation clusters in the source speech similar to target speech. However, such a scheme assumes some notion of prosodic isomorphism either at word or accent group level. In this work, we incorporate prosodic prominence (represented through categorical pitch accent labels) in a statistical speech translation framework by injecting these labels into the target side of translation. Our approach generates enriched tokens on the tar- get side in contrast to conventional systems that predict prosody from the output of the statistical machine translation using just hypothesized text and syntax. The proposed framework integrates the assignment of prominence to word tokens within the trans- lation engine. Hence, the automatic prosody labeler can exploit lexical, syntactic and acoustic-prosodic information. Furthermore, the enriched target language output can be used to facilitate prosody enriched text-to-speech synthesis, the quality of which is typically preferred by human listeners [120]. A system level illustration of the proposed framework in comparison with conventional S2S systems is presented in Figure 4.1. The rest of the chapter is organized as follows: Section 4.2 describes the automatic prosody labeler used in this work. Section 5.4 contains a summary of the parallel corpora used in the translation experiments. Section 5.3 formulates the problem and describes the factored translation models used in our experiments. Section 4.5 summarizes the results of our experiments and Section 5.6 concludes the chapter with discussion and directions for future work. 89 (a) Conventional speech-to-speech translation (b) Prosody enriched speech-to-speech translation Figure 4.1: Illustration of the proposed scheme in comparison with conventional approaches 4.2 Automatic prominence labeling In this section, we describe the classifier used for automatic prominence detection in the rest of the chapter. The classifier was trained on a subset of the Switchboard corpus that had been hand-labeled with pitch accent markers [88]. The corpus is based on about 4.7 hours of hand-labeled conversational speech (excluding silence and noise) from 63 conversations of the Switchboard corpus and 1 conversation from the CallHome corpus. The corpus contains about 67k word instances (excluding silences and noise). Prominent syllables were marked only with “*” for indicating a pitch accent (tonally cued prominence) or “*?” for a possible prominence (i.e., uncertainty about presence of a pitch accent). We mapped the pitch accent labels on syllables to words for training a word-level pitch accent classifier with two classes, accent and none. 90 We use a maximum entropy model for the prominence labeling, similar to that pro- posed in Chapter 2. Given a sequence of words w i in an utterance W ={w 1 ,··· ,w n }, the corresponding syntactic information sequence S ={s 1 ,··· ,s n } (for e.g., parts-of- speech, syntactic parse, etc.), a set of acoustic-prosodic features A = {a 1 ,··· ,a n }, where a i =(a 1 i ,··· ,a tw i i ) is the acoustic-prosodic feature vector corresponding to word w i with a frame length of t w i and a prosodic label vocabulary (l i L,|L| = V ), the best prosodic label sequence L ∗ = l 1 ,l 2 ,··· ,l n is obtained by approximating the sequence classification problem, using conditional independence assumptions, to a product of local classification problems as shown in Eq.(5.2). The classifier is then used to assign a prosodic label to each word conditioned on a vector of local contextual features com- prising the lexical, syntactic and acoustic information. L ∗ =argmax L P(L|W, S, A) (4.1) ≈ arg max L n i=1 p(l i |w i+k i−k ,s i+k i−k ,a i+k i−k ) (4.2) =argmax L n i=1 p(l i |Φ(W, S, A, i)) (4.3) where Φ(W, S, A, i)=(w i+k i−k ,s i+k i−k ,a i+k i−k ) is a set of features extracted within a bounded local context k. In this work, the lexical features are word trigrams, syntactic features are trigrams of part-of-speech tags and supertags [12] and acoustic-prosodic features are the normalized (over utterance) f0 and energy values extracted over 10ms frames. We use the machine learning toolkit LLAMA [42] to estimate the conditional dis- tribution P(l i |Φ) using maxent. The 10-fold cross-validation performance of the clas- sifier on the subset of Switchboard corpus described above is presented in Table 4.1 (chance=67.48%). The pitch accent detection accuracy reported here is close to the 91 state-of-the-art for spontaneous speech from the Switchboard corpus [45]. More details about the automatic prosody labeler can be found in [100]. Table 4.1: Pitch accent detection accuracies for various cues on the prosodically labeled Switchboard corpus. Accuracy (%) Cues used (k=3) Pitch accent Lexical 72.68 Lexical+Syntactic 75.90 Prosodic 74.34 Lexical+Syntactic+Prosodic 78.52 4.3 Data We report experiments on two different parallel corpora of spoken dialogs: Farsi-English and Japanese-English. The Farsi-English data used in this work was collected for doctor- patient mediated interactions in which an English speaking doctor interacts with a Per- sian speaking patient [80]. The corpus consists of 9315 parallel sentences with corre- sponding audio for each English sentence. The conversations are spontaneous and the audio was recorded through a microphone (22.5KHz). Table 4.2: Statistics of the training and test data used in the experiments. Training Test Farsi Eng Jap Eng Farsi Eng Jap Eng Sentences 8066 12239 925 604 Running words 76321 86756 64096 77959 5442 6073 4619 6028 V ocabulary 6140 3908 4271 2079 1487 1103 926 567 Singletons 2819 1508 2749 1156 903 573 638 316 92 The Japanese-English parallel corpus is a part of the “How May I Help You” (HMIHY) [37] corpus of operator-customer conversations related to telephone services. The corpus consists of 12239 parallel sentences with corresponding English side audio. The conversations are spontaneous and the audio was recorded over a telephone channel (8KHz). The statistics of the data corpora are summarized in Table 5.2. 4.4 Enriching translation with prosody In this section, we formulate the problem of using rich prosodic annotations in speech translation. Let S s , T s and S t , T t be the speech signals and equivalent textual transcrip- tion in the source and target language, and L t the enriched representation (prosody) for the target speech. We formalize our proposed enriched S2S translation in the following manner: S ∗ t =argmax St P(S t |S s ) (4.4) P(S t |S s )= Tt,Ts,Lt P(S t ,T t ,T s ,L t |S s ) (4.5) ≈ Tt,Ts,Lt P(S t |T t ,L t ).P(T t ,L t |T s ).P(T s |S s ) (4.6) where Eq.(5.6) is obtained through conditional independence assumptions. Even though the recognition and translation can be performed jointly [75], typical S2S translation frameworks compartmentalize the ASR, MT and TTS, with each component maximized for performance individually. max St P(S t |S s )≈ max St P(S t |T ∗ t ,L ∗ t )× max Tt,Lt P(T t ,L t |T ∗ s )× max Ts P(T s |S s ) (4.7) 93 where T ∗ s is the output of speech recognition, T ∗ t and L ∗ t are the target text and enriched prosodic representation obtained from translation. While conventional approaches address the detection of L ∗ t separately through postprocessing, we integrate this within the translation process thereby enabling the use of acoustic-prosodic information in training the translation engine (see Figure 4.1). In this work, we do not address the speech synthesis part and assume that we have access to the reference transcripts or 1- best recognition hypothesis of the source utterances. The rich annotations (L t ) can be syntactic or semantic concepts [41, 64] or as in this work, pitch accent labels predicted from the model described in Section 4.2. 4.4.1 Factored translation models for incorporating prominence Factored translation models [64] have been proposed recently to integrate linguistic information such as part-of-speech, morphology and shallow syntax in conventional phrase-based statistical translation. The framework allows for integrating multiple levels of information into the translation process instead of incorporating linguistic markers in either preprocessing or postprocessing. For example, in morphologically rich languages it may be preferable to translate lemma, part-of-speech and morphological information separately and combine the information on the target side to generate the output surface words. Figure 4.2: Example of a factored translation model (borrowed from [64]). The arcs represent conditional dependencies between the nodes. 94 Factored translation models have been used primarily to improve the word level translation accuracy by incorporating the factors in phrase-based translation. In contrast, we are interested in integrating factors such as pitch accent labels in speech translation with the objective of maximizing the accuracy of the output factors themselves. By facilitating factored translation with pitch accent labels predicted from prosodic, syntac- tic and lexical cues, our enriched translation scheme can produce output with improved pitch accent assignment accuracy. On the other hand, predicting prominence at output of conventional S2S systems is subject to greater error due to typically noisy transla- tions and lack of direct acoustic-prosodic information. Figure 4.3 illustrates the type of factored models used in this work. (a) Factored model 1 (b) Factored model 2 Figure 4.3: Illustration of the proposed factored translation models to incorporate promi- nence Factored model 1 represents joint translation of words and prominence. Thus, the phrase translation table obtained for such a model would have compound tokens (word+prominence) in the target language. However, with a factored approach we can build the alignments based on the words alone, thus avoiding data sparsity typically introduced by compound tokens. Factored model 2 translates input words to output words and generates prominence labels from the output word forms through a genera- tion step. 95 4.5 Experiments and Results The translation experiments reported in this work were conducted using the Moses 1 toolkit for statistical phrase-based translation. We report results for three scenarios which vary in terms of how the prominence labels are produced in the target language. 1. Post processing: The pitch accent labels are produced at the output of the transla- tion block using lexical and syntactic cues from hypothesized text 2. Factored model 1: Factored model that translates source words to target words and pitch accents 3. Factored model 2: Factored model that translates source words to target words which in turn generate pitch accents Table 5.3 summarizes the results obtained in terms of BLEU score [89], lexical selection accuracy and prosodic accuracy. Lexical selection accuracy is measured in terms of the F-measure derived from recall ( |Res∩Ref| |Ref| ∗ 100) and precision ( |Res∩Ref| |Res| ∗ 100), where Ref is the set of words in the reference translation and Res is the set of words in the translation output. Prosodic accuracy is defined as # correct pitch accents∈ (Res∩Ref) |Res∩Ref| ∗ 100. Figure 4.4 illustrates the computation of prosodic accuracy for an example utterance. Figure 4.4: Illustration of the process used to calculate prosodic accuracy The reference pitch accent labels for the English sentences were obtained from the automatic prominence labeler described in Section 4.2 using lexical, syntactic and 1 http://www.statmt.org/moses 96 Table 4.3: Evaluation metrics for the two corpora used in experiments (all scores are in %) Farsi-English Japanese-English Translation model Lexical BLEU Prosodic Lexical BLEU Prosodic F-score accuracy F-score accuracy Postprocessing 56.46 22.90 74.51 78.98 54.01 68.57 Factored model 1 56.18 22.93 80.83 79.00 54.04 80.12 Factored model 2 56.07 22.85 80.57 78.56 53.97 79.56 prosodic cues. The language models were trigram models created only from the train- ing portion of each corpus. The results in Table 5.3 indicate that the assignment of correct pitch accents to the target words improves with the use of factored transla- tion models. Factored model 1 that translates input word forms to output word forms and pitch accents achieves the best performance. We obtain a relative improvement of 8.4% and 16.8% in prosodic accuracy for the two corpora in comparison with the postprocessing approach. In the postprocessing approach, the pitch accent classifier was trained on lexical, syntactic and acoustic-prosodic features from clean sentences, but evaluated on possibly erroneous machine translation output. Furthermore, the lack of acoustic-prosodic information at the output of machine translation results in lower prosodic assignment accuracy. On the other hand, factored models integrate the pitch accent labels derived from lexical, syntactic and acoustic-prosodic features within the translation framework. Thus, the prosodic accuracy obtained is consistently higher than the postprocessing scheme. Table 5.3 also illustrates translation performance at the word level. For both the factored translation models, the word-level BLEU score and lexical selection accuracy are close to the baseline model that uses no pitch accent labels within the translation 97 framework. Thus, the improvement in prosodic assignment accuracy is obtained at no significant degradation of the word-level translation performance. 4.6 Discussion and Future Work It is important to note that the pitch accent labels used in our translation system are pre- dictions from the maxent based prosody labeler described in Section 4.2. We do not have access to the true reference labels; thus, some amount of error is to be expected in the predictions. Improving the current prosody labeler and developing suitable adaptation techniques are part of future work. The models proposed in this work may be especially useful for tonal languages such as Chinese where it is important to associate accurate tones to syllables. Our framework can produce enriched target output by integrating the acoustic-prosodic information dur- ing translation in comparison with conventional S2S translation systems that postprocess the output to predict prominence. While we have demonstrated that our framework can improve the accuracy of promi- nence labels in the target language, it can potentially be used to integrate any word-level rich annotation dependent on acoustic-prosodic features (e.g., boundary tones, emotion, etc.). We have not used optimization techniques such as minimum error rate training (MERT) in this work due to the relatively small size of the corpora. The use of such techniques could potentially lead to further improvements. Finally, the proposed frame- work needs to be evaluated by including a speech synthesis system that can make use of prosodic markers. We plan to address this also as part of future work. 98 Chapter 5 Enriching speech translation with dialog acts 5.1 Introduction Machine processing of speech, while has advanced significantly, is still largely com- partmentalized. For instance, automatic speech recognition typically deals with ortho- graphic transcription of the speech and hence is insufficient in capturing the context beyond words. Enriched transcription has emerged as a unifying theme in spoken lan- guage processing combining automatic speech recognition, speaker identification and natural language processing with the goal of producing richly annotated speech tran- scriptions that are useful both to human readers and to automated programs for index- ing, retrieval and analysis. For example, these include punctuation detection [23], topic segmentation, disfluency detection and clean-up [70], semantic annotation, dialogue act tagging [102, 98], pitch accent and boundary tone detection [131, 100], as well as speaker segmentation, recognition, and annotation of speaker attributes. These meta- level tags can be considered to be an intermediate representation of the context of the utterance along with the content provided by the orthography. In this chapter, we are interested in enriching speech translation through the annotation and transfer of dialog acts detected from the source speech signal. 99 Recent approaches to statistical speech translation have relied on improving trans- lation quality with the use of phrase translation [84, 62]. The quality of phrase trans- lation is typically measured using n-gram precision based metrics such as BLEU [89] and NIST scores. However, in many dialog based speech translation scenarios, vital information beyond what is robustly captured by words and phrases is carried by the communicative act (e.g., question, acknowledgement, etc.) representing the function of the utterance. Our approach of incorporating dialog act tags in speech translation is motivated by the fact that it is important to capture and convey not only what is being communicated (the words) but how something is being communicated (the con- text). Augmenting current statistical translation frameworks with such dialog act tags can potentially improve translation quality and facilitate successful cross-lingual inter- actions in terms of improved information transfer and communication efficiency. Dialog act tags have been previously used in the VERBMOBIL statistical speech-to- speech translation system [106]. In that work, the predicted DA tags were mainly used to improve speech recognition, semantic evaluation, and information extraction modules. A dialog act based translation module in VERBMOBIL was presented in [105]. The module was mainly designed to provide robustness in the translation process in case of defective input from the speech recognition system. Ney et al. [82] proposed a statistical translation framework to facilitate the translation of spoken dialogues in the VERBMO- BIL project. Their framework was integrated into the VERBMOBIL prototype system along with the dialog act based approach developed in [105]. Discourse information in the form of speech acts has also been used in interlingua translation systems [67, 76] to map input text to semantic concepts, which are then translated to target text. In contrast with the approaches that exploit DA tags in the speech recognition mod- ule or in an interligua framework, in this chapter we demonstrate how dialog act tags can be directly exploited in statistical speech translation. We present two speech translation 100 frameworks for exploiting DA tags. The first is a standard phrase based statistical trans- lation system [62] and the second is a global lexical selection and reordering approach based on translating the source utterance into a bag-of-words [10]. The dialog act tags in our work are obtained from a maximum entropy dialog act tagger (described in Sec- tion 5.2) trained on the Switchboard DAMSL corpus [59]. The framework presented in this work is particularly suited for human-human and human-computer interactions in a dialog setting, where information loss due to erroneous content may be compensated to some extent through the correct transfer of the appropriate dialog act. The dialog acts can also be potentially used for imparting correct utterance level intonation during speech synthesis in the target language. Figure 5.1 shows an example situation where the detection and transfer of dialog act information is beneficial in resolving ambiguous intention associated with the translation output. Figure 5.1: Example of enriched speech translation output with dialog act The remainder of this chapter is organized as follows: Section 5.2 describes the dialog act tagger used in this work, Section 5.3 formulates the problem, Section 5.4 describes the parallel corpora used in our experiments, Section 5.5 summarizes our experimental results and provides performance analysis per dialog act. Section 5.6 con- cludes the chapter with a discussion and outline for future work. 101 5.2 Dialog act tagger In this work, we use a dialog act tagger trained on the Switchboard DAMSL corpus [59] using a maximum entropy (maxent) model. The Switchboard-DAMSL (SWBD- DAMSL) corpus consists of 1155 dialogs and 218,898 utterances from the Switchboard corpus of telephone conversations, tagged with discourse labels from a shallow dis- course tagset. The original tagset of 375 unique tags was clustered to obtain 42 dialog tags as in [59]. In addition, we also grouped the 42 tags into 7 disjoint classes, based on the frequency of the classes and grouped the remaining classes into an “Other” cat- egory constituting less than 3% of the entire data. The simplified tagset consisted of the following classes: statement, acknowledgment, abandoned, agreement, question, appreciation, other. We use a maximum entropy sequence tagging model for the purpose of automatic DA tagging. Given a sequence of utterances U = u 1 ,u 2 ,··· ,u n and a dialog act vocabulary (d i D,|D| = K), we need to predict the best dialog act sequence D ∗ = d 1 ,d 2 ,··· ,d n . The classifier is used to assign to each utterance a dialog act label conditioned on a vector of local contextual feature vectors (Φ) comprising the lexical, syntactic and acoustic-prosodic information. D ∗ =argmax D P(D|U) (5.1) ≈ arg max D n i=1 P(d i |Φ(u i+l i−k )) (5.2) =argmax D n i=1 P(d i |Φ(W i+l i−k ,S i+l i−k ,A i+l i−k )) (5.3) where W is the word sequence, S is the syntactic feature sequence and A, the acoustic- prosodic observation belonging to utterances u i−k ,··· ,u i+l . 102 The lexical cues are words from the current utterance; parts-of-speech, supertagged utterances constitute the syntactic cues. The acoustic-prosodic cues used were utterance normalized pitch (f0), RMS energy (e) of over 10 msec frame intervals. We used the machine learning toolkit LLAMA [42] to estimate the conditional distribution using maxent. The performance of the maxent dialog act tagger on a test set comprising 29K utterances of SWBD-DAMSL dataset is shown in Table 5.1. Table 5.1: Dialog act tagging accuracies for various cues on the SWBD-DAMSL corpus. Accuracy (%) Cues used (current utterance) 42 tags 7 tags Lexical 69.7 81.9 Lexical+Syntactic 70.0 82.4 Lexical+Syntactic+Prosodic 70.4 82.9 5.3 Enriched translation using DAs In this section, we formulate the problem of using rich annotations in speech transla- tion. The general problem of enriched statistical speech-to-speech translation can be summarized as follows. If S s , T s and S t , T t are the speech signals and equivalent textual transcription in the source and target language, L s the enriched representation for the source speech, we can formalize our proposed S2S translation as shown in Figure 5.2. Eq.(5.6) is obtained from Eq.(5.5) through conditional independence assumptions. Even though the recognition and translation can be performed jointly [75], typical S2S trans- lation frameworks compartmentalize the ASR, MT and TTS with each component max- imized for performance individually. T ∗ s , T ∗ t and S ∗ t are the arguments maximizing the ASR, MT and TTS components respectively. L ∗ s is the rich annotation detected from the source speech signal and text, S s and T ∗ s respectively. In this work, we do not address 103 S ∗ t =argmax St P(S t |S s ) (5.4) P(S t |S s )= Tt,Ts,Ls P(S t ,T t ,T s ,L s |S s ) = Tt,Ts,Ls P(S t |T t ,T s ,L s ,S s ).P(T t |T s ,L s ,S s ).P(L s |T s ,S s ).P(T s |S s ) (5.5) ≈ Tt,Ts,Ls P(S t |T t ,L s ).P(T t |T s ,L s ).P(L s |T s ,S s ).P(T s |S s ) (5.6) max St P(S t |S s )≈ max St P(S t |T ∗ t ,L ∗ s ). max Tt P(T t |T ∗ s ,L ∗ s ). max Ls P(L s |T ∗ s ,S s ). max Ts P(T s |S s ) Augmented Enriched Rich Speech Text-to-Speech Machine Translation Annotation Recognition (5.7) Figure 5.2: Formulation of the proposed enriched speech-to-speech translation frame- work the speech synthesis part and assume that we have access to the reference transcripts or 1-best recognition hypothesis of the source utterances. The rich annotations (L s ) can be syntactic or semantic concepts [41], prosody [2], or, as in this work, dialog act tags. 5.3.1 Phrase-based translation with dialog acts One of the currently popular and predominant schemes for statistical translation is the phrase-based approach [62]. Typical phrase-based SMT approaches obtain word-level alignments from a bilingual corpus using tools such as GIZA++ [84] and extract phrase translation pairs from the bilingual word alignment using heuristics. Suppose, the SMT 104 had access to source language dialog acts (L s ), the translation problem may be reformu- lated as, T ∗ t =argmax Tt P(T t |T s ,L s ) =argmax Tt P(T s |T t ,L s ).P(T t |L s ) (5.8) The first term in Eq.(5.8) corresponds to a dialog act specific MT model and the second term to a dialog act specific language model. Given sufficient amount of training data such a system can possibly generate hypotheses that are more accurate than the scheme without the use of dialog acts. However, for small scale and limited domain applications, Eq.(5.8) leads to an implicit partitioning of the data corpus and might generate inferioir translations in terms of lexical selection accuracy or BLEU score. A natural step to overcome the sparsity issue is to employ an appropriate back- off mechanism that would exploit the phrase translation pairs derived from the com- plete data. A typical phrase translation table consists of 5 phrase translation scores for each pair of phrases, source-to-target phrase translation probability (λ 1 ), target-to- source phrase translation probability (λ 2 ), source-to-target lexical weight (λ 3 ), target- to-word lexical weight (λ 4 ) and phrase penalty (λ 5 =2.718). The lexical weights are the product of word translation probabilities obtained from the word alignments. To each phrase translation table belonging to a particular DA-specific translation model, we append those entries from the baseline model that are not present in phrase table of the DA-specific translation model. The appended entries are weighted by a factor α. (T s → T t ) L ∗ s =(T s → T t ) Ls ∪{α.(T s → T t ) s.t. (T s → T t )∈ (T s → T t ) Ls } (5.9) 105 where (T s → T t ) is a short-hand 1 notation for a phrase translation table. (T s → T t ) Ls is the DA-specific phrase translation table, (T s → T t ) is the phrase translation table con- structed from entire data and (T s → T t ) L ∗ s is the newly interpolated phrase translation table. The interpolation factor α is used to weight each of the four translation scores (phrase translation and lexical probabilities for the bilanguage) with the phrase penalty remaining a constant. Such a scheme ensures that phrase translation pairs belonging to a specific DA model are weighted higher and also ensures better coverage than a partitioned data set. 5.3.2 Bag-of-Words lexical choice and permutation reordering model The bag-of-words (BOW) translation approach was introduced in previous work [10]. In this section, we extend a bag-of-words approach for enriching translation, where we treat the target sentence as a bag-of-words (BOW) assigned to the source sentence and it’s corresponding dialog act tag. The objective here is, given a source sentence and the dialog act tag, estimate the probability of finding a given word in the target sentence. Since, each word in the target vocabulary is detected independently, one can use simple binary static classifiers. The classifier is trained with word n-grams and dialog act from the source sentence T s ((BOW grams(T s ),L s ). During decoding, the words with conditional probability greater than a threshold θ are considered as the result of lexical choice decoding. We use a binary maximum entropy technique with L1- regularization for training the bag-of-words lexical choice model. BOW ∗ Tt ={T t |P(T t |BOW grams(T s ),L s )>θ} (5.10) 1 (T s → T t ) represents the mapping between source alphabet sequences to target alphabet sequences, where every pair (t s 1 ,··· ,t s n ,t t 1 ,··· ,t t m ) has a weight sequence λ 1 ,··· ,λ 5 (five weights). 106 Table 5.2: Statistics of the training and test data used in the experiments. Training Test Farsi Eng Jap Eng Chinese Eng Farsi Eng Jap Eng Chinese Eng Sentences 8066 12239 46311 925 604 506 Running words 76321 86756 64096 77959 351060 376615 5442 6073 4619 6028 3826 3897 V ocabulary 6140 3908 4271 2079 11178 11232 1487 1103 926 567 931 898 Singletons 2819 1508 2749 1156 4348 4866 903 573 638 316 600 931 For reconstructing the correct order of words in the target sentence, we consider all permutations of words in BOW ∗ Tt and weight them by a target language model. In this work, we used a separate language model for each dialog act, created by interpolating the DA-specific language model with the baseline language model obtained from com- plete data. We control the length of the target sentences by either varying the parameter θ or by adding optional deletion arcs to the final step of the decoding process. 5.4 Data We report experiments on three different parallel corpora: Farsi-English, Japanese- English and Chinese-English. The Farsi-English data used in this work was collected for doctor-patient mediated interactions in which an English speaking doctor interacts with a Persian speaking patient [80]. The corpus consists of 9315 parallel sentences. The Japanese-English parallel corpus is a part of the “How May I Help You” (HMIHY) [37] corpus of operator-customer conversations related to telephone services. The corpus consists of 12239 parallel sentences. The Chinese-English corpus corre- sponds to the IWSLT06 training and 2005 development set comprising 46K and 506 sentences respectively [90]. The data are traveler task expressions. 107 Table 5.3: F-measure and BLEU scores for the two different translation schemes with and without use of dialog act tags. F-score (%) BLEU (%) w/o DA tags w/ DA tags w/o DA tags w/ DA tags Framework Language pair 7tags 42tags 7tags 42tags Farsi-Eng 56.46 57.32 57.74 22.90 23.50 23.75 Phrase-based translation Japanese-Eng 79.05 79.40 79.51 54.15 54.21 54.32 Chinese-Eng 65.85 67.24 67.49 48.59 52.12 53.04 Farsi-Eng 58.00 59.14 59.35 15.95 16.99 17.12 BOW model Japanese-Eng 79.50 79.82 79.93 42.54 44.70 44.98 Chinese-Eng 68.83 69.70 69.91 54.76 55.98 56.14 5.5 Experiments and Results In all our experiments we assume that the same dialog act is shared by a parallel sentence pair. Thus, even though the dialog act prediction is performed for English, we use the predicted dialog act as the dialog act for the source language sentence. We used the Moses 2 toolkit for statistical phrase-based translation. The machine learning toolkit LLAMA [42] was used to train the maxent based BOW model. The language models were trigram models and were trained only from the training portion of each corpus. Due to the relatively small size of the corpora used in the experiments, we could not devote a separate development set for tuning the parameters for either of the translation schemes presented. Hence, the experiments are strictly performed on the training and test sets reported in Table 5.2 3 . The lexical selection accuracy and BLEU scores for the three parallel corpora using the two translation schemes described in Section 5.3.1 and 5.3.2 are presented in Table 5.3. Lexical selection accuracy is measured in terms of the F-measure derived from recall ( |Res∩Ref| |Ref| ∗ 100) and precision ( |Res∩Ref| |Res| ∗ 100), where Ref is the set of 2 http://www.statmt.org/moses 3 A very small subset of the data was reserved for optimizing the interpolation factor (α) and threshold (θ) described in Sections 5.3.1 and 5.3.2 respectively. 108 words in the reference translation and Res is the set of words in the translation output. For both the statistical translation frameworks, adding dialog act tags (either 7 or 42 tag vocabulary) consistently improves both the lexical selection accuracy and BLEU score for all the language pairs. While the BOW model provides higher lexical selection accu- racy, the phrase-based translation provides better BLEU score. In the BOW model, we detect each word in the target vocabulary independently and reorder the bag-of-words separately. The framework focuses on maximizing the occurrence of target words in the context of a given source sentence. Further, the permutation model used for reorder- ing is still inferior to state-of-the-art reordering techniques. Hence, the lexical selection accuracy reported in this work is higher in comparison with the BLEU score. On the other hand, phrase-based translation produces a bag-of-phrases in the target language which are reordered using a distortion model. The framework focuses on maximizing the occurrence of target phrases in the context of source phrases and can potentially generate target hypotheses with both high lexical selection accuracy and BLEU score (weighted n-gram precision). The improvements for Farsi-English and Chinese-English corpora are more pro- nounced than the improvements in Japanese-English corpus. This is due to the skewed distribution of dialog acts in the Japanese-English corpus; 80% of the test data are state- ments and other and questions category form 16% and 3.5% of the data. The most important observation is, appending DA tags in the form described in this work, can improve translation performance even in terms of conventional automatic evaluation metrics. However, the performance gain measured in terms of automatic metrics that are designed to reflect only the orthographic accuracy during translation is not a fair eval- uation of the translation quality of the proposed framework. We are currently planning human evaluation to bring to fore the usefulness of such rich annotations in interpreting and supplementing typically noisy translations. 109 5.5.1 Analysis of results In this section, we present a preliminary investigation into the contribution of each dia- log act to the overall improvement in translation quality. We analyze the performance in terms of BLEU score improvements per dialog act. The analysis in this section is performed only for the phrase-based translation results reported in Table 5.3. Figure 5.3 shows the distribution of dialog acts in the 7 vocabulary dialog act tag set across the three corpora used in our experiments. Statements are the most frequent dialog acts followed by question, other and acknowledgment. Dialog acts such as agreement, appreciation and abandoned occur quite infrequently in the corpora. 0 10 20 30 40 50 60 70 80 Number of utterances (%) DLI HMIHY IWSLT Abandoned Acknowledgment Agreement Appreciation Other Question Statement Figure 5.3: Distribution of dialog acts in the test data of each corpus In Table 5.4 we report the BLEU scores per dialog act for the Farsi-English corpus. The table compares the per DA performance of the phrase-based translation model with- out and with the use of dialog act information in the translation process. The results indi- cate that knowledge of discourse context such as question or acknowledgment is most beneficial to the translation process. Knowledge of detecting an utterance as a statement does not offer any significant improvement in the translation. This may be attributed to 110 lack of systematic structural information (syntactic) or cue words that differentiate state- ments from other dialog acts. Deeper analysis using the 42 tag set vocabulary indicates that dialog acts such as yes-no questions, Wh- questions and open questions contribute the most to the BLEU score improvement. Similar trends hold for the Chinese-English corpus. On the other hand, the BLEU score improvement for the Japanese-English cor- pus is largely insignificant due to the high proportion of statements in the test corpus. Table 5.4: BLEU scores (%) per DA tag for the phrase-based translation scheme with and without use of dialog act tags for Farsi-English corpus Translation model Dialog act w/o DA tags w/ DA tags Statement 20.58 20.57 Question 24.12 26.36 Other 37.84 41.19 Acknowledgement 51.21 69.30 Appreciation 46.92 73.02 Agreement 18.46 50.00 Abandoned 58.41 58.41 The analysis of the informativeness of dialog acts presented in this section has been inferred only in terms of automatic evaluation metrics. As we have stressed before, the knowledge of dialog acts in translation may be much more beneficial in a cross-lingual human-computer or human-human interaction scenario that is not dependent on just word (phrase) level objective metrics. We plan to extend our preliminary analysis as part of our future work. 5.6 Discussion and Future Work It is important to note that the dialog act tags used in our translation system are predic- tions from the maxent based DA tagger described in Section 5.2. We do not have access to the reference tags; thus, some amount of error is to be expected in the DA tagging. 111 Despite the lack of reference DA tags, we are still able to achieve modest improvements in the translation quality. Improving the current DA tagger and developing suitable adaptation techniques are part of future work. The phrase translation interpolation described in Section 5.3.1 uses the same interpo- lation factor (α) for each of the four translation model weights. Further, the same factor is used across all the dialog acts. Optimizing each of the translation model weights sep- arately and independently for each dialog act subset of the data can possibly generate better translations. The work described here is better suited for translation scenarios that do not involve multiple sentences as part of a turn (e.g., lectures or parliamentary addresses). However, this is not a principled limitation of the proposed work. It can be overcome by using the DA tagger on segmented utterances (or sentences separated by punctuation). Further- more, the experiments in this work have been performed on reference transcripts. We plan to evaluate our framework on speech recognition output as well as lattices as part of our future work. While we have demonstrated that using dialog act tags can improve translation qual- ity in terms of word based automatic evaluation metrics, the real benefits of such a scheme would be manifested through human evaluations. We are currently working on conducting subjective evaluations. Finally, the main objective of this work was to demonstrate the utility of dialog acts in translation. Hence, the systems are not overly tuned or optimized to maximize the evaluation metrics. The results should be interpreted as a comparison between systems that do not have access to dialog acts with those that do have access to them. 112 Chapter 6 Summary and Conclusion The first part of this dissertation focused on the representation and modeling of supraseg- mental events. We focused specifically on prosodic prominence and phrasing, as well as discourse context represented through dialog acts. The derived representations are typically exploited in a classification framework through either a generative or discrim- inative model depending on the speech or language processing task. In this direction, we first presented a discriminative framework for automatic prosody detection through maximum entropy modeling in chapter 2. The prosody detector can predict both promi- nence and phrasing within the ToBI annotation scheme. Our proposed scheme exploits lexical, syntactic and acoustic information to perform pitch accent, boundary tone and break index detection. Hence, the model is suitable for use in language only appli- cations that typically have access only to lexical and syntactic information, as well as speech processing that usually have access to acoustics and possibly incorrect lexical information (from ASR output). In chapter 3 we presented a maximum entropy discriminative model that jointly exploits lexical, syntactic and prosodic cues for automatic dialog act tagging. First, we presented a novel representation scheme for exploiting the intonational properties associated with certain dialog act categories. The n-gram feature representation of the prosodic contour, coupled with the maximum entropy learning scheme is favorable for the task of distinguishing dialog acts based on intonation alone. The proposed feature representation outperforms conventional techniques such as extracting representative statistics such as mean, slope, variance, etc., from the acoustic correlates of prosody. 113 It also supports the suprasegmental theory of prosody that advocates a sequential or continuous model of acoustic correlates over longer durations. We also demonstrated the use of preceding context in terms of lexical, syntactic and prosodic cues from pre- vious utterances for facilitating online DA tagging. Our maximum entropy framework approximates the previous dialog act state in terms of observed evidence and hence is not limited to offline DA classification that uses the entire conversation during the decod- ing process. Such a scheme also offers more robustness compared to greedy decoding procedures, which use a discourse model over DA tag predictions at each state. The second part of the dissertation focused on the use of rich representations pro- posed in part 1.4 in spoken language translation. We proposed a novel framework for enriching speech-to-speech translation with prosody and dialog acts. In chapter 4 we proposed the use of factored translation models to integrate the assignment and trans- fer of pitch accents (tonal prominence) during translation. Our framework incorporates prosodic prominence (represented through categorical pitch accent labels) in a statistical speech translation framework by injecting these labels into the target side of translation. Our approach generates enriched tokens on the target side in contrast to conventional systems that predict prosody from the output of the statistical machine translation using just hypothesized text and syntax. The proposed framework integrates the assignment of prominence to word tokens within the translation engine. Hence, the automatic prosody labeler can exploit lexical, syntactic and acoustic-prosodic information. In chapter 5 we demonstrated the use of dialog acts in statistical speech translation. We presented two speech translation frameworks for exploiting DA tags. The first was a standard phrase based statistical translation system [62] and the second was a global lexical selection and reordering approach based on translating the source utterance into a bag-of-words [10]. The novel framework proposed in this work produces interpretable dialog act annotated target language translations that can improve translation quality and 114 facilitate successful cross-lingual interactions in terms of improved information trans- fer and communication efficiency. We also demonstrated the utility of dialog acts in improving objective evaluation metrics such as lexical selection accuracy and BLEU in translation. 115 References [1] AT&T Natural Voices speech synthesizer. http://www.naturalvoices.att.com. [2] P. D. Ag¨ uero, J. Adell, and A. Bonafonte. Prosody generation for speech-to- speech translation. In Proceedings of ICASSP, Toulouse, France, May 2006. [3] J. Allen, G. Ferguson, B. Miller, and E. Ringer. Spoken dialogue and interactive planning. In Proceedings of ARPA Speech and Natural Language Workshop, pages 202–207, Austin, Texas, 1995. [4] A. Ananthakrishnan and S. Narayanan. An automatic prosody recognizer using a coupled multi-stream acoustic model and a syntactic-prosodic language model. In In Proceedings of ICASSP, Philadelphia, PA, March 2005. [5] J. Ang, Y . Liu, and E. Shriberg. Automatic dialog act segmentation and classifi- cation in multiparty meetings. In Proc. of ICASSP, 2005. [6] J. L. Austin. How to do Things with Words. Clarendon Press, Oxford, 1962. [7] S. Bangalore, G. Di Fabbrizio, and A. Stent. Learning the structure of task- driven human-human dialogs. In Proceedings of ACL, pages 201–208, Sydney, Australia, July 2006. [8] S. Bangalore, A. Emami, and P. Haffner. Factoring global inference by enriching local representations. Technical report, AT&T Labs-Research, 2005. [9] S. Bangalore and P. Haffner. Classification of large label sets. In Proceedings of the Snowbird Learning Workshop, 2005. [10] S. Bangalore, P. Haffner, and S. Kanthak. Statistical machine translation through global lexical selection and sentence reconstruction. In Proceedings of ACL, 2007. [11] S. Bangalore and A. K. Joshi. Supertagging: An approach to almost parsing. Computational Linguistics, 25(2), 1999. 116 [12] S. Bangalore and A. K. Joshi. Supertagging: An approach to almost parsing. Computational Linguistics, 25(2), June 1999. [13] M. E. Beckman, M. Diaz-Campos, J. T. McGory, and T. A. Morgan. Intonation across spanish, in the tones and break indices framework. Probus, 14:9–36. [14] M. E. Beckman and J. B Pierrehumbert. Intonational structure in Japanese and English. Phonology Yearbook, 3:255–309. [15] A. Berger, S. D. Pietra, and V . D. Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):39–71, 1996. [16] A. W. Black and N. Campbell. Predicting the intonation of discourse segments from examples in dialogue speech. In Proceedings of the ESCA Workshop on Spoken Dialogue Systems, May 1995. [17] A. W. Black and P. Taylor. Assigning phrase breaks from part-of-speech sequences. In Proc. of EUROSPEECH, volume 2, pages 995–998, Rhodes, Greece, 1997. [18] A. W. Black, P. Taylor, and R. Caley. The Festival speech synthesis system. http://festvox.org/festival, 1998. [19] D. L. Bolinger. Intonation across languages. In J. P. Greenberg, C. A. Fergu- son, and E. A. Moravcsik, editors, Universals of human language, volume 2 of Phonology. Stanford: Stanford University Press, 1978. [20] M. Breen, L. Dilley, E. Gibson, M. Bolivar, and J. Kraemer. Advances in prosodic annotation: A test of inter-coder reliability for the rap (rhythm and pitch) and tobi (tones and break indices) transcription systems. In 19th Annual CUNY Confer- ence on Human Sentence Processing, New York, NY , 2006. [21] J. M. Brenier, D. Cer, and D. Jurafsky. The detection of emphatic words using acoustic and lexical features. In In Proceedings of Eurospeech, 2005. [22] I. Bulyko and M. Ostendorf. Joint prosody prediction and unit selection for con- catenative speech synthesis. In Proc. of ICASSP, 2001. [23] D. Byron, E. Shriberg, and A. Stolcke. Automatic punctuation and disfluency detection in multi-party meetings using prosodic and lexical cues. In Proc. of ICSLP, volume 2, pages 949–952, Denver, 2002. [24] J. Carletta, A. Isard, S. Isard, J. Kowtko, G. Doherty-Sneddon, and A. Anderson. The reliability of a dialogue structure coding scheme. Computational Linguistics, 23:13–31, 1997. 117 [25] J. Chen and K. Vijay-Shanker. Automated extraction of tags from the penn tree- bank. In Proceedings of the 6th International Workshop on Parsing Technologies, Trento, Italy, 2000. [26] K. Chen, M. Hasegawa-Johnson, and A. Cohen. An automatic prosody labeling system using ANN-based syntactic-prosodic model and GMM-based acoustic- prosodic model. In Proceedings of ICASSP, 2004. [27] K. Chen, M. Hasegawa-Johnson, A. Cohen, S. Borys, Sung-Suk Kim, J. Cole, and Jeung-Yoon Choi. Prosody dependent speech recognition on radio news cor- pus of American English. IEEE Transactions on Audio, Speech and Language Processing, 14(1):232–245, 2006. [28] A. Conkie, G. Riccardi, and R. C. Rose. Prosody recognition from speech utter- ances using acoustic and linguistic based models of prosodic events. In Proc. Eurospeech, pages 523–526, Budapest, Hungary, 1999. [29] M. Core and J. Allen. Coding dialogs with the damsl annotation scheme. In Working Notes: AAAI Fall Symposium, Menlo Park, CA, 1997. [30] A. Cruttenden. Intonation. Cambridge University Press, 1989. [31] M. Dudik, S. Phillips, and R. E. Schapire. Performance guarantees for regularized maximum entropy density estimation. In Proceedings of COLT, Banff, Canada, 2004. Springer Verlag. [32] K. Dusterhoff and A. W. Black. Generating f0 contours for speech synthesis using the tilt intonation theory. In Proc. ESCA Workshop on Intonation, pages 107–110, Athens, Greece., 1997. [33] R. Fernandez and R. W. Picard. Dialog act classification from prosodic features using support vector machines. In Proceedings of Speech Prosody, pages 291– 294, 2002. [34] C. F¨ ugen and M. Kolss. The influence of utterance chunking on machine transla- tion performance. In Proceedings of Interspeech, Antwerp, Belgium, 2007. [35] K. Fujii, H. Kashioka, and N. Campbell. Target cost of f0 based on polynomial regression in concatenative speech synthesis. In Proceedings of ICPhS. ICPhS. [36] H. Fujisaki and K Hirose. Modelling the dynamic characteristics of voice fun- damental frequency with application to analysis and synthesis of intonation. In Proceedings of 13th International Congress of Linguists, pages 57–70, 1982. [37] A. Gorin, G. Riccardi, and J. Wright. How May I Help You? Speech Communi- cation, 23:113–127, 1997. 118 [38] E. Grabe, F. Nolan, and K. Farrar. IViE - a comparative transcription system for intonational variation in English. In Proceedings of ICSLP, Sydney, Australia, 1998. [39] A. Gravano, B. Benus, J. Ch´ avez, Hirschberg, and L. Wilcox. On the role of context and prosody in the interpretation of okay. In Proceedings of ACL, Prague, Czech Republic, 2007. [40] M. Gregory and Y . Altun. Using conditional random fields to predict pitch accent in conversational speech. In 42nd Annual Meeting of the Association for Compu- tational Linguistics (ACL), 2004. [41] L. Gu, Y . Gao, F. H. Liu, and M. Picheny. Concept-based speech-to-speech trans- lation using maximum entropy models for statistical natural concept generation. IEEE Transactions on Audio, Speech and Language Processing, 14(2):377–392, March 2006. [42] P. Haffner. Scaling large margin classifiers for spoken language understanding. Speech Communication, 48(iv):239–261, 2006. [43] D. Hakkani-Tur, G. Tur, A. Stolcke, and E. Shriberg. Combining words and prosody for information extraction from speech. In Proc. Eurospeech, pages 1991–1994, Budapest, Hungary, 1999. [44] J. Hamaker, N. Deshmukh, A. Ganapathiraju, and J. Picone. Resegmentation and transcription of the SWITCHBOARD corpus. In Proceedings of Speech Tran- scription Workshop, 1998. [45] M. Harper, B. Dorr, B. Roark, J. Hale, Z. Shafran, Y . Liu, M. Lease, M. Snover, L. Young, R. Stewart, and A. Krasnyanskaya. Parsing speech and structural event detection. Technical report, JHU Summer Workshop, 2005. [46] M. Hasegawa-Johnson, K. Chen, J. Cole, S. Borys, S. S. Kim, A. Cohen, T. Zhang, J. Y . Choi, H. Kim, T. J. Yoon, and S. Chavara. Simultaneous recogni- tion of words and prosody in the boston university radio speech corpus. Speech Communication, 46:418–439, 2005. [47] M. Hasegawa-Johnson, J. Cole, C. Shih, K. Chen, A. Cohen, S. Chavarria, H. Kim, T. Yoon, S. Borys, and Jeung-Yoon Choi. Speech recognition models of the interdependence among syntax, prosody, and segmental acoustics. In Pro- ceedings of HLT/NAACL, Workshop on Higher-Level Knowledge in Automatic Speech Recognition and Understanding, May 2004. [48] J. Hirschberg. Pitch accent in context: Predicting intonational prominence from text. Artificial Intelligence, 63(1-2), 1993. 119 [49] J. Hirschberg and D. Litman. Empirical studies on the disambiguation of cue phrases. Computational Linguistics, 19(3):501–530, 1993. [50] J. Hirschberg and C. Nakatani. A prosodic analysis of discourse segments in direction-giving monologues. In Proceedings of the 34th conference on Associa- tion for Computational Linguistics, pages 286–293, 1996. [51] J. Hirschberg and C. Nakatani. Acoustic indicators of topic segmentation. In Proc. Inter. Conf. on Spoken Language Proc., pages 976–979, 1998. [52] J. Hirschberg and P. Prieto. Training intonational phrasing rules automatically for English and Spanish text-to-speech. Speech Commun., 18(3):281–290, 1996. [53] J. Hirschberg and O. Rambow. Learning prosodic features using a tree represen- tation. In Proceedings of Eurospeech, pages 1175–1180, Aalborg, 2001. [54] D. J. Hirst, N. Ide, and J. Vronis. Coding fundamental frequency patterns for multilingual synthesis with INTSINT in the MULTEXT project. In Proceedings of the 2nd ESCA/IEEE Workshop on Speech Synthesis, pages 77–81, September 1994. [55] J.Hart. F0 sylisation in speech : Straight lines versus parabolas. Journal of Acoustics Society of America, 6, 1991. [56] G. Ji and J. Bilmes. Dialog act tagging using graphical models. In Proc. of ICASSP, 2005. [57] A. Joshi and Y . Schabes. Tree-adjoining grammars. In A. Salomaa and G. Rozen- berg, editor, Handbook of Formal Lanaguages and Automata. Springer-Verlag, Berlin, 1996. [58] D. Jurafsky, R. Bates, N. Coccaro, R. Martin, M. Meteer, K. Ries, E. Shriberg, A. Stolcke, P. Taylor, and C. Van Ess-Dykema. Automatic detection of discourse structure for speech recognition and understanding. In Proceedings of ASRU, pages 88–95, Santa Barbara, CA, December 1997. [59] D. Jurafsky, R. Bates, N. Coccaro, R. Martin, M. Meteer, K. Ries, E. Shriberg, S. Stolcke, P. Taylor, and C. Van Ess-Dykema. Switchboard discourse language modeling project report. Technical report research note 30, Center for Speech and Language Processing, Johns Hopkins University, Baltimore, MD, 1998. [60] D. Jurafsky, E. Shriberg, B. Fox, and T. Curl. Lexical, prosodic, and syntactic cues for dialog acts. In Proc. ACL/COLING Workshop on Discourse Relations and Discourse Markers, pages 114–120, Montreal, Canada, August 1998. 120 [61] J. G. Kahn, M. Lease, E. Charniak, M. Johnson, and M. Ostendorf. Effective use of prosody in parsing conversational speech. In Proceedings of HLT/EMNLP, 2005. [62] P. Koehn. Pharaoh: A beam search decoder for phrasebased statistical machine translation models. In Proceedings of AMTA-04, Berlin/Heidelberg, pages 115– 124, 2004. [63] P. Koehn, S. Abney, J. Hirschberg, and M. Collins. Improving intonational phras- ing with syntactic information. In Proceedings of ICASSP, 2000. [64] P. Koehn and H. Hoang. Factored translation models. In Proceedings of EMNLP, 2007. [65] K. Koumpis and S. Renals. The role of prosody in a voicemail summarization system. In Proc. ISCA Workshop on Prosody in Speech Recognition and Under- standing, Red Bank, NJ, USA, 2001. [66] L. I. Kuncheva and C. J. Whitaker. Measures of diversity in classifier ensembles. Machine Learning, 51:181–207, 2003. [67] A. Lavie, L. Levin, Y . Qu, A. Waibel, D. Gates, M. Gavalada, L. Mayfield, and M. Taboada. Dialogue processing in a conversational speech translation system. In Proc. of ICSLP, pages 554–557, Oct 1996. [68] I. Lehiste. Suprasegmentals. MIT Press, Cambridge, MA, 1970. [69] M. Liberman and A. Prince. On stress and linguistic rhythm. Linguistic Inquiry, 8(2):249–336, 1977. [70] Y . Liu, E. Shriberg, and A. Stolcke. Automatic disfluency identification in con- versational speech using multiple knowledge sources. In Proc. Eurospeech, pages 957–960, Geneva, September 2003. [71] Y . Liu, E. Shriberg, A. Stolcke, H. Hillard, M. Ostendorf, and M. Harper. Enrich- ing speech recognition with automatic detection of sentence boundaries and disfluencies. IEEE Transactions on Audio, Speech and Language Processing, 14(5):1526–1540, September 2006. [72] X. Ma, W. Zhang, Q. Shi, W. Zhu, and L. Shen. Automatic prosody labeling using both text and acoustic information. In Proceedings of ICASSP, volume 1, pages 516–519, April 2003. [73] M. Mast, R. Kompe, S. Harbeck, A. Kiessling, and V . Warnke. Dialog act clas- sification with the help of prosody. In Proceedings of ICSLP, pages 1732–1735, 1996. 121 [74] E. Matusov, D. Hillard, M. Magimai-Doss, D. Hakkani-T¨ ur, M. Ostendorf, and H. Ney. Improving speech translation with automatic boundary prediction. In Proceedings of Interspeech, Antwerp, Belgium, 2007. [75] E. Matusov, S. Kanthak, and H. Ney. On the integration of speech recognition and statistical machine translation. In Proc. of Eurospeech, 2005. [76] L. Mayfield, M. Gavalda, W. Ward, and A. Waibel. Concept-based speech trans- lation. In Proc. of ICASSP, volume 1, pages 97–100, May 1995. [77] J. Mrozinski, E. W. D. Whittaker, P. Chatain, and S. Furui. Automatic sentence segmentation of speech for automatic summarization. In Proceedings of ICASSP. [78] A. F. Muller and R. Hoffman. A neural network model and hybrid approach for accent label prediction. In Proceedings of 4th ISCA tutorial and research workshop on speech synthesis, 2001. [79] G. Murray, S. Renals, J. Moore, and J. Carletta. Incorporating speaker and dis- course features into speech summarization. In Proceedings of HLT-NAACL,New York City, USA, June 2006. [80] S. Narayanan et. al. Speech recognition engineering issues in speech to speech translation system design for low resource languages and domains. In Proc. of ICASSP, Toulose, France, May 2006. [81] A. Nenkova, J. Brenier, A. Kothari, S. Calhoun, L. Whitton, D. Beaver, and D. Jurafsky. To memorize or to predict: Prominence labeling in conversational speech. In Proceedings of NAACL-HLT 2007, 2007. [82] H. Ney, F. J. Och, and S. V ogel. Statistical translation of spoken dialogues in the verbmobil system. In Workshop on Multilingual Speech Communication, pages 69–74, Kyoto, 2000. [83] E. N¨ oth, A. Batliner, A. Kießling, R. Kompe, and H. Niemann. VERBMOBIL: The use of prosody in the linguistic components of a speech understanding sys- tem. IEEE Transactions on Speech and Audio processing, 8(5):519–532, 2000. [84] F. J. Och and H. Ney. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19–51, 2003. [85] J. D. O’Connor and G. F. Arnold. Intonation of Colloquial English. Longman, 2 edition, 1973. [86] M. Ostendorf, P. J. Price, and S. Shattuck-Hufnagel. The Boston University Radio News Corpus. Technical Report ECS-95-001, Boston University, March 1995. 122 [87] M. Ostendorf, I. Shafran, S. Shattuck-Hufnagel, L. Carmichael, and W. Byrne. A prosodically labeled database of spontaneous speech. In ISCA Workshop on Prosody in Speech Recognition and Understanding, pages 119–121, 2001. [88] M. Ostendorf, I. Shafran, S. Shattuck-Hufnagel, L. Carmichael, and W. Byrne. A prosodically labeled database of spontaneous speech. In ISCA Workshop on Prosody in Speech Recognition and Understanding, pages 119–121, 2001. [89] K. Papineni, S. Roukos, T. Ward, and W. J. Zhu. BLEU: a method for automatic evaluation of machine translation. Technical report, IBM T.J. Watson Research Center, 2002. [90] M. Paul. Overview of the IWSLT 2006 Evaluation Campaign. In Proc. of the International Workshop on Spoken Language Translation, pages 1–15, Kyoto, Japan, 2006. [91] C. R. Perrault and J. Allen. A plan-based analysis of indirect speech acts. Amer- ican Journal of Computational Linguistics,6. [92] J. Pierrehumbert. The Phonology and Phonetics of English Intonation. PhD thesis, MIT, 1980. [93] J. F. Pitrelli, M. E. Beckman, and J. Hirschberg. Evaluation of prosodic transcrip- tion labeling reliability in the tobi framework. In Proceedings of ICSLP, pages 123–126, 1994. [94] P. J. Price, M. Ostendorf, S. Shattuck-Hufnagel, and C. Fong. The use of prosody in syntactic disambiguation. The Journal of the Acoustical Society of America, 90(6):2956–2970, 1991. [95] V . K. Rangarajan Sridhar, S. Bangalore, and S. Narayanan. Acoustic-syntactic maximum entropy model for automatic prosody labeling. In Proceedings of IEEE/ACL Spoken Language Technology, Aruba, December 2006. [96] V . K. Rangarajan Sridhar, S. Bangalore, and S. Narayanan. Exploiting acoustic and syntactic features for prosody labeling in a maximum entropy framework. In Proceedings of NAACL-HLT, 2007. [97] V . K. Rangarajan Sridhar, S. Bangalore, and S. Narayanan. Exploiting prosodic features for dialog act tagging in a discriminative modeling framework. In Pro- ceedings of InterSpeech, Antwerp, 2007. [98] V . K. Rangarajan Sridhar, S. Bangalore, and S. Narayanan. Combining lexical, syntactic and prosodic cues for improved online dialog act tagging. Computer Speech and Language, In press, 2008. 123 [99] V . K. Rangarajan Sridhar, S. Bangalore, and S. Narayanan. Enriching spoken language translation with dialog acts. In In Proceedings of ACL, 2008. [100] V . K. Rangarajan Sridhar, S. Bangalore, and S. Narayanan. Exploiting acous- tic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE Transactions on Audio, Speech and Language Processing, 16(4):797–811, 2008. [101] V . K. Rangarajan Sridhar, S. Bangalore, and S. Narayanan. Factored translation models for enriching spoken language translation with prosody. In In Proceedings of Interspeech, Brisbane, Australia, 2008. [102] V . K. Rangarajan Sridhar, S. Bangalore, and S. Narayanan. Modeling the intona- tion of discourse segments for improved online dialog act tagging. In Proceedings of ICASSP, Las Vegas, 2008. [103] V . K. Rangarajan Sridhar and S. Narayanan. Analysis of disfluent repetitions in spontaneous speech recognition. In In Proceedings of EUSIPCO, Florence, Italy, September 2006. [104] V . K. Rangarajan Sridhar and S. Narayanan. Detection of non-native named enti- ties using prosodic features for improved speech recognition and translation. In In Proceedings of MULTILING, Stellenbosch, South Africa, 2006. [105] N. Reithinger and R. Engel. Robust content extraction for translation and dialog processing. In W. Wahlster, editor, Verbmobil: Foundations of Speech-to-Speech Translation, pages 430–439. Springer, 2000. [106] N. Reithinger, R. Engel, M. Kipp, and M. Klesen. Predicting dialogue acts for a speech-to-speech translation system. In Proc. of ICSLP, volume 2, pages 654– 657, Oct 1996. [107] K. Ries. HMM and neural network based speech act detection. In Proc. of ICASSP, volume 1, pages 497–500, March 1999. [108] K. Ross and M. Ostendorf. Prediction of abstract prosodic labels for speech synthesis. Computer Speech and Language, 10:155–185, Oct. 1996. [109] S. Sakai and J. Glass. Fundamental frequency modeling for corpus-based speech synthesis based on a statistical learning technique. In Proceedings of IEEE ASRU. ASRU. [110] J. R. Searle. Speech Acts. Cambridge University Press, Cambridge, 1969. [111] P. Shimei and K. McKeown. Word informativeness and automatic pitch accent modeling. In In Proceedings of EMNLP/VLC, College Park, Maryland, 1999. 124 [112] E. Shriberg, R. Bates, A. Stolcke, P. Taylor, D. Jurafsky, K. Ries, N. Coccaro, R. Martin, M. Meteer, and C. Van Ess-Dykema. Can prosody aid the automatic classification of dialog acts in conversational speech? Language and Speech, 41(3-4):439–487, 1998. [113] E. Shriberg, L. Ferrer, S. Kajarekar, A. Venkataraman, and A. Stolcke. Model- ing prosodic feature sequences for speaker recognition. Speech Communication, 46:455–472, 2005. [114] E. Shriberg, A. Stolcke, D. Hakkani-Tur, and G. Tur. Prosody-based automatic segmentation of speech into sentences and topics. In Speech Communication, number 32 in Special Issue on Accessing Information in Spoken Audio, pages 127–154, 2000. [115] E. E. Shriberg, R. A. Bates, and A. Stolcke. A prosody-only decision-tree model for disfluency detection. In Proc. Eurospeech 97, Rhodes, Greece, 1997. [116] C. L. Sidner. Plan parsing for intended response recognition in discourse. Com- putational Intelligence, 1(1), 1985. [117] K. Silverman, M. Beckman, J. Pitrelli, M. Ostendorf, C. Wightman, P. Price, J. Pierrehumbert, and J. Hirschberg. ToBI: A standard for labeling English prosody. In Proceedings of ICSLP, pages 867–870, 1992. [118] M. Steedman. Information structure and the syntax-phonology interface. Lin- guistic inquiry, 31(4):649–689, 2000. [119] A. Stolcke, K. Ries, N. Coccaro, E. Shriberg, R. Bates, D. Jurafsky, P. Taylor, R. Martin, C. Van Ess-Dykema, and M. Meteer. Dialogue act modeling for auto- matic tagging and recognition of conversational speech. Computational Linguis- tics, 26(3):339–373, September 2000. [120] V . Strom, A. Nenkova, R. Clark, Y . Vazquez-Alvarez, J. Brenier, S. King, and D. Jurafsky. Modelling prominence and emphasis improves unit-selection syn- thesis. In Proceedings of Interspeech, Antwerp, Belgium, 2007. [121] X. Sun. Pitch accent prediction using ensemble machine learning. In Proc. of ICSLP, 2002. [122] X. Sun and T. H. Applebaum. Intonational phrase break prediction using decision tree and n-gram model. volume 1, pages 537–540, Aalborg, Denmark, 2001. [123] P. Taylor. The tilt intonation model. In Proc. ICSLP, volume 4, pages 1383–1386, 1998. 125 [124] P. Taylor, S. King, S. Isard, and H. Wright. Intonation and dialogue context as constraints for speech recognition. Language and Speech, 41(34):493–512, 2000. [125] P. A. Taylor. The rise/fall/connection model of intonation. Speech Communica- tion, 15:169–186, 1995. [126] K. Tokuda, H. Zen, and A. W. Black. An hmm-based speech synthesis system applied to english. Sep 2002. [127] N. M. Veilleux and M. Ostendorf. Prosody/parse scoring and its application in atis. In HLT ’93: Proceedings of the workshop on Human Language Technology, pages 335–340, Morristown, NJ, USA, 1993. Association for Computational Lin- guistics. [128] N. M. Veilleux, M. Ostendorf, and C. W. Wightman. Parse scoring with prosodic information. In Proc. 1992 Intl. Conf. on Spoken Language Processing, pages 1605–1608, 1992. [129] A. Venkataraman, A. Stolcke, and E. Shriberg. Automatic dialog act labeling with minimal supervision. In Proc. 9th Australian International Conference on Speech Science and Technology, Melbourne, December 2002. [130] M. Q. Wang and J. Hirschberg. Automatic classification of intonational phrase boundaries. Computer Speech and Language, 6:175–196, 1992. [131] C. W. Wightman and M. Ostendorf. Automatic labeling of prosodic patterns. IEEE Transactions on Speech and Audio Processing, 2(3):469–481, 1994. [132] C. W. Wightman, S. Shattuck-Hufnagel, M. Ostendorf, and P. J. Price. Segmen- tal durations in the vicinity of prosodic phrase boundaries. J. of the Acoustical Society of America, 91(3):1707–1717, 1992. [133] Ian H. Witten and Eibe Frank. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, San Francisco, 2005. [134] C. H. Wu, G. L. Yan, and C. L. Lin. Speech act modeling in a spoken dialog system using a fuzzy fragment-class markov model. Speech Communication, 38(1-2):183–199, 2002. [135] F. Xia, M. Palmer, and A. K. Joshi. A uniform method of grammar extraction and its applications. In Proceedings of Empirical Methods in Natural Language Processing, 2000. [136] XTAG. A lexicalized tree-adjoining grammar for English. Technical report, Uni- versity of Pennsylvania, http://www.cis.upenn.edu/ xtag/gramrelease.html, 2001. 126 [137] Tae-Jin Yoon. Predicting prosodic boundaries using linguistic features. In ICSA International Conference on Speech Prosody, Dresden, Germany, 2006. [138] T. Yoshimura, S. Hayamizu, H. Ohmura, and K. Tanaka. Pitch pattern clustering of user utterances in human-machine dialogue. In Proc. of ICSLP, volume 2, pages 837–840, 1996. [139] M. Zimmermann, Y . Liu, E. Shriberg, and A. Stolcke. A* based joint segmen- tation and classification of dialog acts in multiparty meetings. In Proc. IEEE Speech Recognition and Understanding Workshop, pages 215–219, San Juan, Puerto Rico, November 2005. 127
Abstract (if available)
Abstract
Machine processing of speech, while has advanced significantly, is still insufficient in capturing and utilizing rich contextual information such as prosodic prominence, phrasing and discourse information that are conveyed beyond words. The work presented in this dissertation focuses on automatic enrichment of spoken language processing through the representation and modeling of suprasegmental events such as prosody and discourse context. First, we demonstrate the suitability of maximum entropy models for the automatic recognition of these events from speech and text. The techniques that we have developed achieve state-of-the-art performance. Second, we introduce a novel framework for enriching speech translation with rich information. Our approach of incorporating rich information in speech translation is motivated by the fact that it is important to capture and convey not only what is being communicated (the words) but how something is being communicated (the context). We show that promising improvements in translation quality can be obtained by exploiting rich annotations in conventional speech translation approaches.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Categorical prosody models for spoken language applications
PDF
Speech recognition error modeling for robust speech processing and natural language understanding applications
PDF
Active data acquisition for building language models for speech recognition
PDF
User modeling for human-machine spoken interaction and mediation systems
PDF
Concept classification with application to speech to speech translation
PDF
Exploiting latent reliability information for classification tasks
PDF
Toward understanding speech planning by observing its execution—representations, modeling and analysis
PDF
Spoken language processing in low resource scenarios with applications in automated behavioral coding
PDF
Noise aware methods for robust speech processing applications
PDF
Context-aware models for understanding and supporting spoken interactions with children
PDF
Computational modeling of behavioral attributes in conversational dyadic interactions
PDF
Syntactic alignment models for large-scale statistical machine translation
PDF
Articulatory dynamics and stability in multi-gesture complexes
PDF
Automatic quantification and prediction of human subjective judgments in behavioral signal processing
PDF
Extracting and using speaker role information in speech processing applications
PDF
Modeling expert assessment of empathy through multimodal signal cues
PDF
Weighted tree automata and transducers for syntactic natural language processing
PDF
A computational framework for exploring the role of speech production in speech processing from a communication system perspective
PDF
Efficient estimation and discriminative training for the Total Variability Model
PDF
Weighted factor automata: A finite-state framework for spoken content retrieval
Asset Metadata
Creator
Rangarajan Sridhar, Vivek Kumar
(author)
Core Title
Enriching spoken language processing: representation and modeling of suprasegmental events
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
08/01/2008
Defense Date
04/29/2008
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
dialog acts,enriched speech translation,maximum entropy models,OAI-PMH Harvest,prosody,spoken language understanding
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Narayanan, Shrikanth S. (
committee chair
), Byrd, Dani (
committee member
), Jenkins, B. Keith (
committee member
)
Creator Email
vivek_136@yahoo.com,vrangara@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m1518
Unique identifier
UC1316597
Identifier
etd-Sridhar-2154 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-91423 (legacy record id),usctheses-m1518 (legacy record id)
Legacy Identifier
etd-Sridhar-2154.pdf
Dmrecord
91423
Document Type
Dissertation
Rights
Rangarajan Sridhar, Vivek Kumar
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
dialog acts
enriched speech translation
maximum entropy models
prosody
spoken language understanding