Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Concept classification with application to speech to speech translation
(USC Thesis Other)
Concept classification with application to speech to speech translation
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
CONCEPT CLASSIFICATION WITH APPLICATION TO SPEECH TO
SPEECH TRANSLATION
by
Emil Ettelaie
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulllment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
May 2011
Copyright 2011 Emil Ettelaie
Dedication
This dissertation is dedicated to the memory of my mother, Dr. Neli Tamraz and to my
father, Dr. Youile Ettelaie.
ii
Acknowledgements
I would like to express my sincere appreciation to my advisor Dr. Shrikanth Narayanan
for his exceptional mentoring and support during my time at the University of Southern
California. Many thanks to Dr. Panayiotis Georgiou for his guidance and help throughout
my research work. I would also like to thank my committee members, Dr. Bart Kosko
and Dr. Aiichiro Nakano for their precious comments.
I am grateful to the students and sta of the Signal Analysis and Interpretation
Laboratory for their help and feedback. My thanks must go also to the faculty and sta
of the Ming Hsieh Department of Electrical Engineering.
iii
Table of Contents
Dedication ii
Acknowledgements iii
List of Tables vii
List of Figures viii
Abstract x
Chapter 1: Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Chapter 2: Classier Structure and Training with Sparse Data 9
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Concept classier and background model . . . . . . . . . . . . . . . . . . . 10
2.3 Handling sparsity by statistical machine translation . . . . . . . . . . . . 11
2.4 Data and Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.2 Classication Accuracy Measures . . . . . . . . . . . . . . . . . . . 14
2.4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Chapter 3: Cross-lingual Dialog Model and Rejection Mechanism 19
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Concept Classication and Understanding Model . . . . . . . . . . . . . . 21
3.3 Dialog Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4 Data and System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4.1 Categorical Understanding Model: Concept Classication . . . . . 26
3.4.2 Dialog Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4.3 Combining Understanding and Dialog Models . . . . . . . . . . . . 28
3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
iv
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Chapter 4: Data-driven Training 33
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Training the concept classier . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Unsupervised Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3.1 Utterance level distance . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3.2 Distance Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3.3 Clustering Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.3.1 Exchange Method . . . . . . . . . . . . . . . . . . . . . . 40
4.3.3.2 Anity Propagation . . . . . . . . . . . . . . . . . . . . . 42
4.3.4 Concept Representatives . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3.5 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.4.1 Clustering Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.4.2 Overall Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.5 Data and Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.5.1 Classier Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.5.2 SMT Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.5.3 Clustering with n-best Lists . . . . . . . . . . . . . . . . . . . . . . 50
4.5.4 The eect of n-best Lists Size . . . . . . . . . . . . . . . . . . . . . 53
4.6 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.6.1 Clustering with New Metrics . . . . . . . . . . . . . . . . . . . . . 54
4.6.2 Clustering and the Size of the n-best Lists . . . . . . . . . . . . . . 56
4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Chapter 5: Training Based on Topic Modeling 59
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2 Clustering with SMT n-best Lists . . . . . . . . . . . . . . . . . . . . . . . 60
5.3 Distance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3.1 Language Model Distance . . . . . . . . . . . . . . . . . . . . . . . 62
5.3.2 Topic Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3.3 Combination of Two Distances . . . . . . . . . . . . . . . . . . . . 64
5.4 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.4.2 Clustering with Topic Modeling Metric . . . . . . . . . . . . . . . 66
5.4.3 Clustering with Combined Metric . . . . . . . . . . . . . . . . . . . 69
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Chapter 6: Hierarchical Structure 72
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.2 Hierarchical Classier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.2.1 A Two-layered Classier . . . . . . . . . . . . . . . . . . . . . . . . 73
6.2.2 Categorical Partition . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.2.3 Category Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.2.3.1 Maximum Likelihood Category Classier . . . . . . . . . 76
v
6.2.3.2 Category Detection by Topic Modeling . . . . . . . . . . 76
6.2.4 Buering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.3.2 Oracle Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.3.3 Topic Modeling of Categories . . . . . . . . . . . . . . . . . . . . . 81
6.3.4 Eect of Buer Size on Category Detection . . . . . . . . . . . . . 82
6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
References 85
vi
List of Tables
2.1 Classication accuracy for the conventional method and the proposed method
with dierent lengths of n-best list . . . . . . . . . . . . . . . . . . . . . . 15
3.1 Training data for understanding model . . . . . . . . . . . . . . . . . . . . 27
3.2 Development and testing data (Both data sets were manually tagged for
performance measurement.) . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Classier accuracy with and without dialog model for Experiment 1 . . . 30
3.4 Classier accuracy with and without dialog model for Experiment 2 . . . 31
4.1 The data sets used for clustering and classication . . . . . . . . . . . . . 49
4.2 The results of dierent clustering schemes for Transonics data . . . . . . . 52
4.3 The results of dierent clustering schemes for BBN data . . . . . . . . . . 53
4.4 The eect of the size of the SMT n-best list using Exchange Method and
Anity Propagation with KLD metric . . . . . . . . . . . . . . . . . . . . 54
5.1 The data sets used for clustering . . . . . . . . . . . . . . . . . . . . . . . 65
5.2 Results of the experiments with classiers trained on the clustered data . 68
vii
List of Figures
2.1 Training process using n-best lists in the intermediate language . . . . . . 11
2.2 The eect of background model on classication accuracy . . . . . . . . . 16
3.1 The dialog model statistically connects the concepts uttered from two sides
of a dialog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 The English-Farsi S2S system with a dual translation scheme involving a
classier and an SMT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1 Overview of the proposed data preparation procedure . . . . . . . . . . . 46
5.1 The training process with the distance metric based on topic modeling . . 63
5.2 Purity of clustering BBN data with LM and TM metrics and dierent topic
numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.3 Purity of clustering Transonics data with LM and TM metrics and dierent
topic numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.4 Purity of clustering BBN data with the combined metric . . . . . . . . . . 69
5.5 Purity of clustering Transonics data with the combined metric . . . . . . 71
6.1 Partition of the classier domain . . . . . . . . . . . . . . . . . . . . . . . 74
6.2 Sample of categorized training data for concept classication . . . . . . . 75
6.3 Two-layered classication scheme . . . . . . . . . . . . . . . . . . . . . . . 78
6.4 (a) Number of classes per category for training data, (b) Overall number of
sentences per category for training data, (c) Overall number of sentences
per category for testing set, (d) Number of errors per category for testing
set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
viii
6.5 Category detection using topic modeling with dierent parameters . . . . 81
6.6 Eect of buer size on the LM-based classication and Topic Modeling . . 83
ix
Abstract
The central goal in interactive speech-to-speech translation applications is to facilitate
the accurate exchange of the semantic content (or \concept") of the speech between the
interlocutors rather than producing word-by-word literal translation of the source utter-
ance. While the conventional Statistical Machine Translation (SMT) methods are mainly
developed and optimized for translating text, speech understanding through concept clas-
sication oers a possible way of translation in speech-to-speech translation systems that
suits the above purpose. A correct concept classication oers the promise of obtaining
well-formed target language speech output, although it cannot accurately cover the entire
dialog domain due to the limited number of concept classes.
Here, the task of spoken utterance classication is presented as a MAP estimation
problem. Formulation of the understanding model and data collection methods are pre-
sented. To cope with the inherent sparsity in the training data, the use of a background
model is introduced and its eects are investigated. For further sparsity mitigation, a
new method of lexical enhancement by using an SMT system is introduced.
To improve the overall accuracy of the classication task, a method for incorporating
contextual information is also presented. Specically, for a two-way speech translation
system, a classication scheme is derived that utilizes these information from both sides
x
of the conversation through a dialog model. Empirical results show that the proposed
dialog model provided a modest improvement in classication accuracy while a signicant
improvement in the accuracy of rejection task.
The main bottleneck in achieving an acceptable performance with concept classiers
is the tedious task of annotating large amounts of training data. Any attempt to develop
a method to assist in, or to completely automate, data annotation should involve the
clustering of sentences based on the meaning they convey. This needs a distance measure
to compare sentences in the concept level. Here, a new method of sentence comparison
is introduced that is motivated from the translation point of view. In this method the
imperfect translations produced by a phrase-based SMT system are used to compare the
concepts of the source sentences. The distances among the utterances are measured using
two alternative type of metrics. The rst metric is aimed to capture local dependencies
among words based on the Markov chain modeling of the text (Language Models). The
second metric is computed based on the word associations in a wider view. Such associ-
ations are learned through Topic Modeling of the translation lists of the data utterances.
Two clustering methods are adapted to support the concept-base distance. Experimental
evaluations show the eectiveness of the proposed methods.
The eectiveness of concept classiers depends on the size of the domain that they
cover. An obstacle in expanding the classier domain, however, is the degradation in
accuracy as the number of classes increase. A hierarchical classication process that aims
to scale up the domain without compromising the accuracy is introduced here. This
method exploits the categorical associations that naturally appear in the training data,
to split the domain into sub-domains with fewer classes. In a two-layered structure, rst
xi
the best category for the discourse is detected and then a sub-domain classier|limited
to that category|is deployed. For category detection the discourse information is used
as input. For that purpose, two alternative methods based on language models and
topic modeling are introduced. Results from experiments show higher accuracy for the
proposed method compared to a single layered classier.
xii
Chapter 1
Introduction
1.1 Motivation
Speech-to-speech (S2S) translation systems are developed to mediate communication be-
tween people that do not share any common language. What is critical in this group of
applications is the faithful transfer of the semantic context (also known as gist) of the
spoken utterances between the interlocutors.
Dierent machine translation (MT) approaches have been adopted in current speech-
to-speech (S2S) systems. Translation engines developed based on statistical machine
translation (SMT) methods [32], or utterance classication based on understanding [2,49]
or inter-lingua [45] have been integrated into S2S systems.
The SMT methods are the most commonly used translation technique for S2S trans-
lation systems [20,30]. The statistical models used in these methods make them
exible
enough to provide a good coverage of the dialog domain. The
uency of the translation,
however, is not guaranteed. Dis
uencies of spoken utterances plus the speech recognizer
errors degrade the translation quality even more. All these ultimately aect the quality of
1
the synthesized speech output in the target language, and the eectiveness of the overall
system.
Existing SMT systems are typically optimized for faithful translation rather than a
form of interpretation that guarantees an accurate exchange of the semantic context of
speech. Therefore it is quite common to use other means of translation in parallel to the
SMT methods [17,40]. Also, since the SMT engines were developed for text translation,
they cannot benet from the presence of the \human in the loop" that is typical in S2S
applications.
The focus of this dissertation is the development of a translation mechanism with the
objective of faithful transferring of the semantic context. This of course can be achieve
only if the system provides a proper level of human-machine collaboration to set and
maintain the course of the dialog.
1.2 Challenges
The spoken utterances are not always well-structured. Lexical and syntactic errors are
made by the speaker or introduced through the speech recognition process. Also, due
to the lack of punctuation and capitalization, the boundary of sentences are not clearly
marked. All of these, plus the inherent dis
uency in speech make the task of speech
translation more challenging than the translation of well-structured text.
Utterance level evaluation metrics have been introduced and widely used for the text
translation task [33]. This type of metrics compare the hypothetical translations and
2
references (produced by human translators) and measure the lexical and ordering (syn-
tactic) similarity between them. Therefore, they are not sucient for the S2S translation
task in which the goal is concept transfer.
Compare to the text translation task, the data that can be used for S2S translation
are very limited and more expensive to process. The recorded speech must be transcribed
and translated for SMT training. To train the translation engines which work based on
the semantic context, the data utterances must also be grouped according to the concept
that they convey.
Finally, the S2S translation systems are applied for real time tasks in which a small
latency is vital.
1.3 Approach
A well dened dialog domain, e.g., doctor-patient dialog, can be partly covered by a
number of concept classes. If a
uent representative sentence in target language is assigned
to each concept class, the translation task is simplied to classifying the source utterances
based on the concepts they convey [11, 30]. Therefore a classier can be used as an
interpreter by mapping the input utterance to one of the predened concepts. Upon a
successful classication of the input utterance the previously stored representative of the
selected class is synthesized. This method would be a better t to serve the purpose of
exchanging the semantic context as long as the input utterance falls within the coverage
of the classier.
3
The above process can be viewed as a quantization of a continuous \semantic" sub-
space. The classier is adequate when the quantization error is small, i.e., when (1) the
derived concept and input utterance are good matches, and (2) when the utterance falls
in the same sub-space (domain) as the quantizer attempts to cover. Obviously it is not
feasible to accurately cover the entire dialog domain with concept classes. The reason is
the large number of quantization levels that would be necessary due to the innite number
of concepts that could be exchange in a dialog. Very large number of concept classes will
also decrease the classier accuracy. This is the major disadvantage of classier-based
translation method.
In spite of this short coming, since the concept representatives are preselected well-
formed sentences, the correct classication assures the high quality of the output. In
the speech applications the input to the translation engine is neither well-formed nor
error-free. Dis
uency in the original speech plus the recognition errors create a \noisy"
utterance at the input of the translation unit. The lexical and syntax errors in the input
of an SMT system often causes a severe degradation in the translation quality. While
the \hard decision" nature of the classication process makes the classier engine more
robust to input errors.
Therefore, the classier-based translator is an attractive option for S2S applications
because of its tolerance to \noisy" input and the
uency of its output, when it operates
close to its design parameters. In practice this is attainable for structured dialog interac-
tions with high levels of predictability. In addition, it can provide the users with both an
accurate feedback and dierent translation options to choose from. The latter feature,
4
specially, is useful for applications like doctor-patient dialog. Moreover, the classica-
tion task, is much less computationally demanding compared to the statistical methods,
and therefore, is more suitable for applications with small tolerable latency, or when the
system is implemented on a small device.
The parallel combination of both concept classication and SMT can leverage the
strength of both methods [2,29,30], the rst one providing high quality in a small domain
and the other one covering a much wider domain. The idea would be to use the classier
as long as it works, and fall back to the SMT otherwise.
The concept classication is also used in a range of other applications such as systems
with virtual interactive characters or machine spoken dialog systems to implement speech
understanding [25,43]. The concept classication in these examples fall under the general
category of Spoken Language Understanding (SLU). The SLU methods are developed
for applications with human-machine interaction. In the S2S translation, however, the
machine is a mediator through which the users interact (human-machine-human inter-
action). Therefore the concept classication for translation diers from the typical SLU
task. Unlike the human-machine interactions, in a human-human interaction the dialog
domain is much wider and the concepts are more complex. Therefore, from practical
point of view, in the translation applications the number of concepts are much larger
than that in an SLU application. The concept selection and data annotation are also
much more dicult tasks for the former set of applications. Another dierence is the
presence of human conrmation in the former applications which can be used for on-
y
learning and domain expansion.
5
To build a usable classier-based translator two separate but related problems must be
addressed. Classier structure is the rst issue. The classier must be designed based on
models that encompass the information from training data as much as possible, without
over tting. The second challenge is the data processing and training the models. In
other words nding feasible ways to extract the information from the training data. This
proposal focuses on both of these problems.
Building a concept classier starts with identifying the desired concepts and represent-
ing them with canonical utterances that express these concepts. A good set of concepts
should consist of the ones that are more frequent in a typical interaction in the domain.
For instance in a doctor-patient dialog, the utterance \Where does it hurt?" is quite
common and therefore its concept is a good choice. Phrase books, websites, and experts'
judgment are some of the resources that can be used for concept selection. Other fre-
quently used concepts include those that correspond to basic communicative and social
aspects of the interaction such as greeting, acknowledgment and conrmation.
After forming the concept space, for each class, utterances that convey its concept
must be gathered. Hence, this training corpus would consist of a group of paraphrases for
each class. This form of data are often very dicult to collect as the number of classes
grow.
1.4 Outline
The basic classier structure is explained in Chapter 2. Often, the available data set that
could be used for training do not include enough number of instances for each class. This
6
data sparsity leads to a poor classication performance. It is shown in Chapter 2 that
an SMT can be used as a way of information enhancement to address this problem [14].
Also, a background model has been used to improve the discrimination ability of a given
concept class model.
When the classier is used in parallel to another translation engine (almost always
an SMT), a rejection mechanism can help identify the cases that the input utterance
falls outside the classier coverage [13]. A high level of accuracy is essential for such a
mechanism to be practically useful. In Chapter 3 the classier models are expanded in
a way that the dialog information are utilized to achieve a higher accuracy specially for
the rejection cases.
Preparing data for training the classier is a cumbersome task. Chapters 4 and 5
deal with data preparation and training. An automatic way of data preparation based
on utterance clustering is introduced in Chapter 4. To measure the distance between
utterances, information theoretic metrics are adapted in a way that is more meaningful
from the prospect of translation task [15]. Then suitable clustering algorithms that can
function with these metrics are selected.
The metric introduced in Chapter 4 is based on divergence among Language Models
[21]. Markov chain modeling of a stream of words, know as Language Modeling, has been
extensively used in speech and natural language processing. As language models capture
the local lexical dependencies, the metrics that are based on them would encapsulate such
relations.
Topic Modeling has emerged as a powerful tool for semantic analysis [4, 18, 41]. The
global semantic association of words are modeled in this method. In Chapter 5, a metric
7
is deployed that is based on such models. Using a combination of the two dierent
types of models, i.e., language and topic models, is proven to be quite benecial. Such
combination would create a metric that contains the information from both local and
global association of words.
A major obstacle in domain expansion is the fact that classiers lose accuracy as the
number of classes increase. In Chapter 6 the issue of domain expansion has been ad-
dressed through the splitting the domain, by taking advantage of the inherent categorical
partitions that are observed in data generated or collected from human interlocutors.
8
Chapter 2
Classier Structure and Training with Sparse Data
2.1 Introduction
The available training data are usually sparse and cannot produce a classication accuracy
to the degree possible. Since the classier range is limited, high accuracy within that range
is quite crucial for its eectiveness. One of the main issues is dealing with data sparsity.
Other techniques have also been proposed to improve the classication rates [13].
In this chapter a novel method for handling the sparsity is introduced. This method
utilizes an SMT engine to map a single utterance to a group of them [14]. Furthermore,
the eect of the background model on classication accuracy is investigated.
Section 2.2 explains the concept classication process and the background model.
In Section 2.3 the sparsity handling method using an SMT is introduced. Data and
experiments are described in Section 2.4. The results are discussed in Section 2.5.
9
2.2 Concept classier and background model
The concept classier based on the maximum likelihood criterion can be implemented as
a language model (LM) scoring process. For each class a language model is built using
data expressing the class concept. The classier scores the input utterance using the class
LM's and selects the class with highest score. In another word if C is the set of concept
classes and e is the input utterance, the classication process is,
^ c = arg max
c2C
fP
c
(ejc)g (2.1)
whereP
c
(ejc) is the score of e from the LM of class c. The translation job is concluded
by playing out a previously constructed prompt that expresses the concept ^ c in the target
language.
It is clear that for a class with limited number of training utterances, the associated
LM will have a poor coverage. Even with a large training set for each class, the class
language models will have a large number of out of vocabulary words due to the fact
that classes usually have a limited lexicon. In practice such a model fails to produce a
usable LM score and leads to a poor classication accuracy. Interpolating the LM with
a background language model results in a smoother model [42] and increases the overall
accuracy of the classier. The background model should be built from a larger corpus
that fairly covers the domain vocabulary. The interpolation level can be optimized for
the best performance based on heldout set.
While the training of an SMT is mainly done through the use of bilingual parallel
data, to train a concept classier, sets of sentences with same (or very similar) concept
10
SMT
Classifier
(Training)
Intermediate
Language
Original
training
data
Source
language
Figure 2.1: Training process using n-best lists in the intermediate language
are required. These are often generated by deciding a set of canonical utterances and then
for each one of them manually generating a large number of paraphrases with signicantly
similar concepts [29]. This procedure can be extremely time consuming.
2.3 Handling sparsity by statistical machine translation
The goal is to employ techniques that limit the eects of data sparsity. What is proposed
here is to generate multiple utterances|possibly with lower quality|from a single origi-
nal one. One approach is to use an SMT to generaten-best lists of translation candidates
for the original utterances. Such lists are ranked based on a combination of scores from
dierent models [32]. The hypothesis here is that for an SMT trained on a large corpus,
11
the quality of the candidates would not degrade rapidly as one moves down then-best list.
Therefore a list with an appropriate length would consist of translations with acceptable
quality without containing a lot of poor candidates. This process would result in more
data, available for training, at the cost of using noisier data (Figure 2.1).
Although the source language of the SMT must be the same as the classier's, its
target language can be selected deliberately. It is clear that a language with large available
resources (in the form of parallel corpora with the source language) must be selected. For
simplicity this language is called the \intermediate language" here.
A classier in the intermediate language can be built by rst generating an n-best
list for every source utterance in the classier's training corpus. Then the n-best lists
associated with each class are combined to form a new training set. The class LM's are
now built from these training sets rather than the original sets of the source utterances.
To classify a source utterance e, rst the SMT is deployed to generate an n-best list
(in the intermediate language) from it. The list will consist of candidates f
1
, f
2
,..., f
n
.
The classication process can be reformulated as,
^ c = arg max
c2C
(
n
Y
i=1
~
P
c
(f
i
jc)
)
(2.2)
Here,
~
P
c
(f
i
jc) is the score of the i
th
candidate f
i
from the LM of class c. The scores are
considered in the probability domain.
The new class LM's can also be smoothed by interpolation with a background model
in the intermediate language.
12
2.4 Data and Experiments
2.4.1 Data
The data used in this work were originally collected for, and used in, the Transonics
project [2, 29, 30]. The goal was to develop a two-way S2S translator system in the
doctor-patient interaction domain with English/Farsi language pairs. For the doctor
side (English), 1,269 concept classes were carefully chosen using experts' judgment and
medical phrase books.
Then paraphrases were collected from human subjects in dierent ways at the Infor-
mation Sciences Institute of the University of Southern California. A website was design
to collect on-line data by showing a sentences to the users and asking them to provide
as many paraphrases as they could. A similar tool was developed to collect speech data
by letting the users to say and record paraphrases. Also, an on-line game was designed
in which the players were scored based on the number and quality of paraphrases they
entered [6]. Paraphrasing sessions were also held in which a small group of volunteers
provided data for some of the concepts. The total size of the data set consists of 9,893
English phrases.
As the test corpus for this work, 1,000 phrases were randomly drawn from the above
set and the rest were used for training. To make sure that the training set covered every
class, one phrase per class was excluded from the test set selection process.
To generate the n-best lists, a phrase based SMT [23] was used. The intermediate
language was Farsi and the SMT was trained on a parallel English/Farsi corpus with
148K lines (1.2M words) on the English side. This corpus was also used to build the
13
classication background models in both languages. The SMT was optimized using a
parallel development set with 915 lines (7.3K words) on the English side.
2.4.2 Classication Accuracy Measures
Classier accuracy is often used as the the quality indicator of the classication task.
However, it is common in the speech-to-speech translation systems to provide the user
with a short list of potential translations to choose from. For example the user of system
in [30] is provided with the top four classier outputs. In such cases, it is practically
useful to measure the accuracy of the classier within its n-best outputs (e.g., n = 4 for
the above system). In this work the classication accuracy was measured on both the
single output and the 4-best outputs.
2.4.3 Experiments
To compare the proposed method with the conventional classication, a classier based
on each method was put to test. In the proposed method, it is expected that the accuracy
is aected by the length of the n-best lists. To observe that, n-best lists of lengths 100,
500, 1000, and 2000 were used in the experiments. The results are shown in Table 4.4.
In all of the above experiments the background interpolation factor was set to 0.9 which
is close to the optimum value obtained in [13].
To examine the eect of the background model, the conventional and proposed meth-
ods were tried with dierent values of the interpolation factor (the background model
is weighted by 1). For the conventional method the length of the n-best list was set
14
Table 2.1: Classication accuracy for the conventional method and the proposed method
with dierent lengths of n-best list
Conventional n-best length
(baseline) 100 500 1,000 2,000
Accuracy [%] 74.9 77.4 77.5 76.8 76.4
Relative error reduction [%] 0.0 10.0 10.4 7.6 6.0
Accuracy in 4-best [%] 88.6 90.7 91.0 91.3 90.5
Relative error reduction [%] 0.0 18.4 21.1 23.7 16.7
to 500. Figure 2.2 shows the accuracy changes with respect to the interpolation factor
for these two methods.
2.5 Discussion
Table 4.4 shows the advantage of the proposed method over the conventional classication
with a relative error rate reduction up to 10.4% (achieved when the length of the SMTn-
best list was 500). However, as expected, this number decreases with longer SMT n-best
lists due to the increased noise present in lower ranked outputs of the SMT.
Table 4.4 also shows the accuracy within 4-best classier outputs for each method.
In that case the proposed method showed an error rate which was relatively 23.7% lower
than the error rate of the conventional method. That was achieved at the peak of the
accuracy within 4-best, when the length of the SMT n-best list was 1,000. In this case
too, further increase in the length of then-best list led to an accuracy degradation as the
classier models became noisier.
The eect of the background model on classier accuracy is shown in Figure 2.2. The
gure shows the one-best accuracy and the accuracy within 4-best outputs, versus the
15
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
45%
50%
55%
60%
65%
70%
75%
80%
85%
90%
95%
Conv. 4-options
Conv.
New 4-options
New
Background Interpolation Factor
Accuracy
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0%
5%
Background Interpolation Factor (λ)
Accuracy
Figure 2.2: The eect of background model on classication accuracy
background interpolation factor () for both conventional and proposed methods. As the
curves indicate, with equal to zero the classier has no discriminating feature since all
the class scores are driven solely from the background model. However, a slight increase
in , leads to a large jump in the accuracy. The reason is that the background model
was built from a large general domain corpus and hence, had no bias toward any of the
classes. With a small, the score from the background model dominates the overall class
scores. In spite of that, the score dierences caused by the class LM's are notable in
improving the classier performance.
16
As increases the role of the class LM's becomes more prominent. This makes the
classier models more discriminative and increases its accuracy as shown in Figure 1.
When the factor is in the close vicinity of one, the smoothing eect of the background
model diminishes and leaves the classes with spiky models with very low vocabulary
coverage (lots of zeros). This leads to a rapid drop in accuracy as reaches one.
Both the conventional and proposed methods follow the above trend as Figure 2.2
shows, although, the proposed method maintains its superiority throughout the range of
that was examined. The maximum measured accuracies for conventional and proposed
methods were 75.2% and 78.7% respectively and was measured at = 0:999 for both
methods. Therefore, the error rate of the proposed method was relatively 14.1% lower
than its counterpart from the conventional method.
Figure 2.2 also indicates that when the accuracy is measured within the 4-best outputs,
again the proposed method outperforms the conventional one. The maximum 4-best
accuracy for the conventional method was measured at the sample point = 0:9 and was
equal to 88.6%. For the proposed method, that number was measured as 91.5% achieved
at the sample point = 0:999. In another words, considering the 4-best classier outputs,
the error rate of the proposed method was relatively 25.4% lower.
2.6 Summary
The proposed language model based method can be used to improve the accuracy of
the concept classiers specially in the case of sparse training data. It outperformed
the conventional classier, trained on the original source language paraphrases, in the
17
experiments. With this method, when the input utterance is within the classication
domain, the classier can be viewed as a lter that produces
uent translations (removes
the \noise") from the SMT output.
The experiments also emphasized the importance of the background model, although
indicated that the classication accuracy was not very sensitive to the value of the back-
ground interpolation factor. This relieves the developers from the ne tuning of that
factor and eliminates the need for a development data set when a suboptimal solution is
acceptable.
Signicant improvements to the technique can be made through the use of weighted
n-best lists based on the SMT scores. In addition using a much richer SMT engine could
provide signicant gains through increased diversity in the output vocabulary.
18
Chapter 3
Cross-lingual Dialog Model and Rejection Mechanism
3.1 Introduction
To design a system that aims to combine eectively a classier with a traditional phrase-
based statistical machine translator (SMT) the following issues need to be considered:
1) improving the classier accuracy, and 2) enabling a preference mechanism that helps
the system make robust selections between the two translation methods during an S2S
interaction, or rank the options provided by each.
This raises the motivation to seek ways of using additional sources of information,
beside the text from the recognized speech, to improve the classication performance
and to enhance the ability to reject low condence classication results in favor of the
statistical MT output. In an attempt to achieve those goals, a method of using cross-
lingual dialog information in conjunction with a method for rejecting low condence
classier output is proposed. In such systems, utterances from each side of conversation
are statistically mapped to predened concepts. For each side, these mappings carry
information about the potential concept of choice in the other side, and are tracked and
19
Hello
c
23
c
38
How can I help you? r
48
I am sick. c
102
Good morning r
23
Are you in pain? r
11
I have headache. c
231
Do you have fever? r
461
My thermometer is
broken.
c
null
Let me measure
your temperature
r
961
P(c
38
|r
23
)
P(c
102
|r
48
)
P(c
231
|r
11
)
P(c
null
|r
461
)
Flow of Concepts
Time
Figure 3.1: The dialog model statistically connects the concepts uttered from two sides
of a dialog
exploited during classication. This sort of dialog information has been used previously
in dierent applications to enhance the performance of speech utterance classication.
For instance, in [34] the classication problem has been formulated in a way that the use
of dialog information increases the accuracy of the classier in a categorical classication
task. A similar MAP classication approach is used here to utilize the information that
is carried by concept history sequence from both sides of the conversation [13].
Taking the MAP classication formulation leads to a practical framework that con-
verts the problem into two straightforward modeling problems. The rst model, i.e.
understanding model, presents the statistical relation between the transcribed utterances
and the concepts. This type of modeling has been used for other applications in [5, 36].
20
The second model is the statistical dialog model that statistically connects the concepts
expressed by both sides of the conversation as shown in Figure 3.1. The proposed method
is evaluated with a corpus of doctor-patient English-Farsi dialogs [2,3].
This chapter is organized as follows: In section 3.2, concept classication approach
for speech translation is explained in detail and the understanding model is described.
Section 3.3, starts with the formulation of the problem for the S2S mediation context, and
continues with the derivation of a method that uses the dialog model. The system and the
experiment set up are explained in section 3.4 and results were discussed in section 3.5.
3.2 Concept Classication and Understanding Model
Using concept classiers for speech translation purpose has been investigated based on
covering the target dialog domain by several concept classes. For example, in the med-
ical domain dialogs of [1, 2], the doctor side of the conversation was mapped into 1200
pre-specied classes and the patient side with around 400 classes. Associated with each
class are representative surface form instantiations that convey the concept in the target
language in the best way. The classier then tries to map the source language spoken ut-
terances into one of the predened concepts. Upon a successful mapping the representing
phrase of the wining concept class is played out in the target language.
If the dialog domain is partitioned in a set of concept classesC =fc
(1)
;c
(2)
;:::;c
(jCj)
g
the classication task can be formulated as the following maximum a posteriori estima-
tion.
^ c
t
= arg max
c2C
fP (cj o
t
)g (3.1)
21
where o
t
is the acoustic observation of the spoken utterance in the source language at
turn t and ^ c
t
is the estimated concept of that utterance.
In practice, the above estimation is implemented in two well-known steps. First, the
acoustic signal is transcribed by an automatic speech recognizer (ASR) and then a text
classier maps the transcription to a concept by using its words as classication features.
With no prior knowledge about the concepts this would reduce to a maximum likelihood
classier, i.e.,
^ c
t
= arg max
c2C
fP (^ e
t
jc)g (3.2)
Here, ^ e
t
is the vector of words generated by ASR as the transcription of o
t
. The likelihood
function P (^ e
t
jc) can be approximated by a language model (LM) built specically for
each c2 C. All of these concept-specic language models form what is known as the
Understanding Model [34]. In section 4 building the understanding model is explained in
detail.
A two way speech to speech translation system needs two classiers similar to Eq. (3.2).
Each one of those should have an understanding model in the language of its correspond-
ing side. However, depending on the application the concept sets can be dierent for each
one, which will make the system asymmetric. An example of such a system is explained
in [2] and [30]. It facilitates doctor-patient dialogs in which one side (doctor) often drives
the conversation by asking questions.
22
3.3 Dialog Model
The system described in section 2 has the drawback of not utilizing any dialog context in-
formation. Especially in an asymmetric two way system where the driver sides utterances
(such as the doctor who controls the dialog
ow in a typical doctor-patient interactions)
seem to carry some (if not much) information about what the other interlocutor says.
An approach to incorporating such information is presented in [34] for a human-machine
dialog system. A similar method can be used to incorporate dialog information in a two
way concept classication task.
Let us consider a two way mediation system where each interlocutor has his own
concept set (such as in a Q- task). Such an asymmetric system will have a concept
set R = fr
(1)
;r
(2)
;:::;r
(jRj)
g for the driving side (side A) and a dierent set C =
fc
(1)
;c
(2)
;:::;c
(jCj)
g for the other side (B). The goal is to come up with a new classi-
er for side B, that uses the decisions made by side A, with the hope that the extra
information would lead to a better classication accuracy for side B.
With the assumption that the chain of dependency is limited to only one cycle of
conversation (note that a rst order Markov model for dialog history was found to be
eective in [34]), the classier for side B can be rewritten as the following a maximum a
posterior estimator.
^ c
t
= arg max
c2C
fP (cj o
t
;r
t
)g (3.3)
23
where r
t
2 R and o
t
is the acoustic observation of side B in conversation turn t. Since
in practice the transcription of o
t
is needed for classication, we reformulate Eq. (3.3) as
the following maximization.
max
c2C;e
fP (c; ej o
t
;r
t
)g (3.4)
e is a potential transcription of o. In the absence of any prior information of the above
maximization is equivalent to
max
c2C;e
fP (o
t
j e;r
t
;c):P (e):P (ejc;r
t
):P (cjr
t
)g (3.5)
Two assumptions are made here to make the maximization problem practically feasi-
ble:
1. The rst assumption is P (o
t
j e;r
t
;c) P (o
t
j e) which means that the acoustic
observations do not depend on any sides chosen concepts [34].
2. It is also assumed that the transcription of side Bs utterance and side As concept
are independent, i.e., P (ejc;r
t
) =P (ejc).
These assumptions help split Eq. (3.5) as the following two step maximization.
^ e
t
= arg max
e
fP (o
t
j e):P (e)g (3.6)
^ c
t
= arg max
c2C
fP (^ e
t
jc):P (cjr
t
)g (3.7)
This decomposition is greatly benecial from practical point of view. According to Eq.
(3.6) ^ e
t
can be the output of an ASR with acoustic and language models that estimate
24
P (o
t
j e) and P (e) respectively. Eq. (3.7) shows that the concept of the utterance
spoken by side B can be estimated using the ASR output and the concept chosen by side
A. While estimator Eq. (3.2) only uses the information from one side, Eq. (3.7) uses
the statistical dependency of concepts in both sides of the dialog, i.e., P (cjr
t
). This
dependency can be estimated by dialog model.
Hence, in practice, the Eq. (3.7) is used for concept estimation as
^ c
t
= arg max
c2C
fP
U
(^ e
t
jc):[P
D
(cjr
t
)]
g (3.8)
whereP
U
andP
D
are the understanding model and the dialog model respectively. These
functions are the imperfect estimates of probabilities in Eq. (3.7). The exponential weight
is introduced here to emphasize (or deemphasize) the eect of the dialog model. Eq.
(3.8) is in fact a log-linear combination of the scores from two models. Section 4 contains
the detail of tuning
to reach the best performance.
3.4 Data and System
The Transonics system described in previous chapter was chosen as the framework to test
the new modeling of Eq. (3.8). The languages were English and Farsi for the doctor and
patient side, respectively. Beside a statistical MT engine, the system also uses concept
classication as an alternative translation method (Figure 3.2). The system is asymmetric
in the sense that concept sets for the sides are not similar. The concepts were manually
assigned using expert resources. In the type of conversations that the system is designed
for in [2] and [30], the doctor predominantly controls the
ow of the dialog.
25
Decision
SMT
Classifier
Input from ASR
Translation
Farsi
Synthesizer
English
to Farsi
Translator
English
Speech
Recognizer
Farsi to
English
Translator
English
Synthesizer
(Context
Information)
English
Speech English
Text
Farsi
Text
Dialogue
Manager
(Context
Information)
Farsi
Speech
English
Text
Farsi
Text
Farsi
Speech
Recognizer
English
Text
Farsi
Text
Farsi
Speech
English
Speech
Farsi
Text
English
Text
Figure 3.2: The English-Farsi S2S system with a dual translation scheme involving a
classier and an SMT
While the acoustic and language models of Eq. (3.6) are embedded in the systems
two ASRs [38], the focus of this task was to build understanding and dialog models
and compare the system performance under two conditions with and without the dialog
model.
3.4.1 Categorical Understanding Model: Concept Classication
The understanding model consists of the LMs built for each concept-class. The data
needed to build a LM for a specic concept are in the form of paraphrases that convey
26
Table 3.1: Training data for understanding model
Specication Doctor Patient
Language English Farsi
Type Text Text
Number of concept classes 1,269 364
Training data for 9,894 lines 27,459 lines
class-specic LMs (60,050 words) (182,751 word)
Training data for 25,305 lines 42,870
background LM (224,642 words) (263,683 words)
that concept. Such data were collected in dierent ways. For instance, to collect para-
phrases for the doctor side, a website was set up and several people were invited to enter
paraphrases for each concept. Data were also collected from standardized patient based
methods [3]. Data were also collected directly from a few small groups of Farsi speakers
for each concept class in the patient side. Table 3.1 shows the collected data statistics.
For each class a trigram LM was built using the SRILM toolkit with Ristads natural
discounting law [42]. Transcriptions from in-domain conversations were also used to build
a background LM. Each class-specic LM was interpolated with that background model
to create a smoother model for each class [42]. That sort of smoothing was used in the
task reported in [5] and a similar method was introduced in [36]. The interpolation weight
of the two models requires optimization through a development set representative of
the usage conditions.
In addition to the concept classes, a rejection or null class was also included in the
concept set for each side. That class represents the utterances that do not convey any of
the covered concepts and therefore should be rejected. The LM of the null class was only
trained on the background data. Presence of a null class shows a great advantage when
27
the concept set does not cover whole the dialog domain. From a practical point of view,
this was especially critical because in such an unrestricted conversation scenario, at best,
the classier is designed for, and can capture, only a fraction of the concepts conveyed.
For each side of the dialog an understanding model was built using the class-specic
LM. Manually transcribed and translated data of human-human, monolingual (English),
non-mediated in-domain interactions were annotated and used as a development set.
Thus the interpolation weight of the understanding model was not optimized for the
case of noisy ASR outputs. The data is described in Table 3.2.
3.4.2 Dialog Model
In this task, the dialog model represents the statistical dependency of the patients con-
cept on the concept expressed by doctors utterance in the same conversation cycle. For
training the model, the manual transcription of audio data recorded from doctor-patient
conversations were used [3]. A set of 15,411 conversation turns (i.e., doctors utterance
followed by patients response) were selected for unsupervised training of the dialog model.
Each utterance pair was rst converted to a pair of concept tags using the classier of the
corresponding side. The tag pairs were then used to train a bigram dialog model using
SRILM with no discounting scheme.
3.4.3 Combining Understanding and Dialog Models
The concept classier makes decisions only based on its understanding model. The goal
of this task was to improve the accuracy of the decisions in the patient side by using the
information from doctors decisions through the dialog model.
28
Table 3.2: Development and testing data (Both data sets were manually tagged for
performance measurement.)
Application Side Language Type
Size of the data
[utterance]
Set A
Doctor English Manual transcription 106
Patient Farsi Human translation 106
Set B
Doctor English Manual transcription 252
Patient Farsi ASR output 252
To apply the dialog model, the scores generated by classier must be combined with
the dialog model scores and the overall scores should be used to select the right class.
However, the eect of dialog model can be emphasized (or deemphasized) by the intro-
duction of the exponential weight in Eq. (3.8). To set this parameter to an optimal value
a development data set was prepared (Table 3.2). The development data was manually
annotated for performance measurement. A one-dimensional search for the parameter led
to a setting that gave the minimum classication error on the development data. That
setup was then frozen and used to test the system.
3.5 Results
To validate the benets of applying a dialog model, two sets of experiments were con-
ducted:
1. Set A, as described in Table 3.2 was used for development, and Set B as a test set.
As a result there is a mismatch of training and testing conditions
29
Table 3.3: Classier accuracy with and without dialog model for Experiment 1
Data Experiment 1 Development Testing
All Data
Baseline (w/o dialog model) 50.00% 41.30%
w/ dialog model 53.80% 44.80%
Relative Error Reduction 7.60% 5.96%
Data annotated as
"null" (rejection)
Baseline 41.90% 48.10%
w/ dialog model 54.80% 64.40%
Relative Error Reduction 22.20% 31.41%
Excluding the data
recognized as "null"
Baseline 51.30% 33.30%
w/ dialog model 54.80% 35.40%
Relative Error Reduction 7.19% 3.15%
2. Sets A and B are combined and using the leave one out technique about two thirds
of the data were used for development and the remaining for testing and the test is
repeated three times
In Table 3.2 the patient side of Set A was manually translated and transcribed, while
Set B was generated by running a Farsi ASR [38] on the recorded audio les.
The classier was evaluated on these transcripts and compared with human class
annotations. The performance of the classier without the dialog model was acquired as
the baseline. Then the classier with the dialog model was applied on the same data.
Table 3.3, presents the results of experiment 1, where there is a mismatch of devel-
opment and test conditions in the optimization of the understanding model and shows
the accuracy obtained by the patient side classier before and after engaging the dialog
model. According to these results, a relative decrease of 5.96% in the classication error
was achieved by applying the dialog model. The improvement is due to a more elaborate
modeling based on of Eq. (3.3) that utilizes some of the dialog information. Table 3.3
30
Table 3.4: Classier accuracy with and without dialog model for Experiment 2
Data Experiment 2 Development Testing
All Data
Baseline (w/o dialog model) 56.80% 58.37%
w/ dialog model 58.62% 60.37%
Relative Error Reduction 4.21% 4.80%
Data annotated as
"null" (rejection)
Baseline 90.00% 89.55%
w/ dialog model 91.37% 91.13%
Relative Error Reduction 13.74% 15.07%
Excluding the data
recognized as "null"
Baseline 55.52% 55.60%
w/ dialog model 59.91% 60.18%
Relative Error Reduction 9.87% 10.31%
also shows a signicant improvement (relative reduction of 22.2% for development and
31.41% for testing data) in the rejection accuracy achieved by applying the dialog model.
However, looking at the overall improvement by itself can be misleading since it could
merely be due to a higher score given to the null concept by dialog model. Therefore,
the accuracy was also measured without the utterances that were classied as the null
concept. Since the accuracy improvement was also observed in this case (3.15%), we
can deduct that the dialog model has helped the classication of non-null utterances in
addition to improving the rejection accuracy (31.41%).
The results of experiment 2 are shown in Table 3.4 where we again observe a 4.80%
relative error reduction. The relative improvement in rejection accuracy is smaller, but
still signicant (15.07%) than in experiment 1, likely due to the optimization of the
understanding model, which results in smaller margins of potential improvement.
31
3.6 Summary
The formulation of the concept classication as a maximum a posteriori estimation was
adopted and extended in a practical way that was suitable for the S2S mediation scenarios.
The formulation decomposes the estimation task into two separate well-known steps of
speech recognition and text classication with four familiar components: Acoustic Model,
Language Model, Understanding Model, and Dialog Model. The resultant classication
scheme not only is based on an understanding model but also uses a dialog model. That
provides a way to incorporate dialog information directly into the speech translation task.
For a two-way S2S MT system that is based on the concept classication, the de-
ployment of the dialog model opens up a way to use the information from the both sides
of conversation. For the English-Farsi doctor patient dialog system used in this work, a
relative reduction in error of 4.80-5.96% in the classier performance indicates that using
the information from one side of the conversation (driving side, in this task) can improve
the quality of the translations of the other interlocutors speech.
More notably, applying the dialog model also improved the accuracy of the rejection
signicantly (15.07-31.41% relative error reduction). This leads to a more accurate detec-
tion of the cases that classier fails to produce a reliable result and therefore indicating
that the fallback selection of the output from the statistical MT will be more appropriate.
32
Chapter 4
Data-driven Training
4.1 Introduction
While the training of an SMT is mainly done through the use of bilingual parallel data,
training for a concept classier requires sets of sentences with the same (or very similar)
concept. These are often generated by deciding a set of canonical utterances and then,
for each one of them, manually generating multiple paraphrases. This procedure can be
extremely time consuming.
The size of currently available data corpora is an incentive to seek automatic ways
of both selecting the concept classes and automatic clustering of the training data into
these classes.
The reason that two sentences are placed in the same concept class motivates the
denition of a distance metric between them. From a translation point of view, sentences
share the same concept if they have \similar" translations. Therefore the similarity
of two sentences can be judged by the similarity of their translations. To cope with the
possible lexical mismatch one can provide more than one translation for each sentence and
33
compare them. In another word, from each sentence in the source language a document is
created that consists of the various possible translations for that sentence. To check if two
original sentences have matching concept, one can compare their associated translation
documents.
With the existence of such documents and an appropriate distance metric to compare
them, the problem of unsupervised training of the classier would simply reduce to a
clustering problem [15]. The focus of this chapter is twofold: (1) identify a cross-sentence
distance metric that will correlate well with the concept closeness of the two sentences in
question, and (2) identify and employ clustering techniques that rely on relative rather
than global distance metrics.
The following section explains the supervised training procedure of the concept clas-
sier. In section 4.3 the proposed method for unsupervised training is explained in detail.
Section 4.4 covers the evaluation of the training procedure. Both intermediate and end-to-
end evaluation measures are discussed. Section 4.5 consists of the experiment details and
the associated data used in this work. Both the proposed method and the k-means clus-
tering algorithm were investigated. The results are compared and discussed in section 4.6
which is followed by conclusion in Section 4.7.
4.2 Training the concept classier
For a system based on Eq. 2.1, the training starts with the selection of concepts. Usually
a set of canonical sentences that represent these concepts are hand picked via dierent
available resources. These canonical representations should cover a reasonable portion
34
of the dialog domain. After forming the concept set, groups of paraphrases are needed
to train the class language models. These paraphrases are usually provided by human
annotators.
For example in the Transonics project the canonical concepts were manually selected
using medical phrase books, websites, and by human judgment, as mentioned before.
Then human subjects were asked to provide paraphrases through a website, a web-based
game, and paraphrasing sessions [30]. The lessons learned can be summarized as,
1. Manually selecting the concepts can lead to a poor coverage of the dialog domain.
2. Since the selected concepts are not driven from real data, some of them might be
uncommon in real dialogs.
3. The overlapping concepts are very dicult to avoid.
4. For moderate number of classes, the training data are dicult to collect from human
resources, and the method is not practical for large number of classes.
5. The paraphrases provided by human subjects are not always common sentences in
the dialog domain.
However, the demonstrated suitability of classier-based translation engines for the Tran-
sonics system is a strong incentive to seek new approaches to overcome some of the above
obstacles.
35
4.3 Unsupervised Data Processing
The available data corpora for S2S applications cannot be directly used for the purpose
of classier training without additional processing. These corpora are usually collected
from example conversations. At best, data can be transcribed to train system components
such as monolingual language models for speech recognizer. Also, large portions of the
collected data are often translated for SMT training (bilingual). The goal is to identify
the common concepts in these data sets, and for each concept, create a cluster containing
all the utterances that convey that concept. Human input could be employed at the
nal step to represent the concepts in a canonical form in the target language. Obvious
requirements for implementing this procedure are rst specifying a distance metric among
phrases and second, choosing a clustering method.
4.3.1 Utterance level distance
Since the goal is translation, it is intuitive to think that if two sentences have similar
translation in another language, they convey the same concept. Therefore the similarity
of two sentences can be judged by the similarity of their translations. This motivates the
denition of a distance metric that is based on comparing the translations. In fact, two
utterances from the training corpus are put together in the same concept class because
they have the same translation.
However, two sentences with the same concept might have translations that share
no common word. Therefore such comparison must be based on multiple translations of
each original sentence. Otherwise it would not be discriminative enough for any practical
36
application. Using multiple translations can partially address problems like lexical mis-
match. If a group of translations is available for every data utterance, dening a distance
metric for them can be seen as a document comparison problem.
Various methods of document comparison have been introduced and deployed in a
wide range of applications. The essence of these methods is a measure that indicates the
similarity of the documents. For instance, in text clustering, documents are represented
by points in a vector space. For each document, a vector is generated in a fashion
where words represent the dimensions and number of occurrences, the scales [10]. Then,
distance measures, e.g., Euclidean distance, can be used as a similarity metric. The
vectors, however, will contain no word ordering information.
In practice the sparsity of such vectors would make the comparison inaccurate. For
example the vocabulary of a medium sized domain could contain more than 6,000 words
[31], but a sentence, represented by anM-dimensional (M > 6; 000) vector s, will contain
at most 10-20 words. As a result, direct comparison of two such vectors would unlikely
provide any meaningful distance measurement.
The proposed solution here, can be viewed as \fuzzifying" the vector representations
of sentences by adding up a large number of noisy measurements to make a new represen-
tation vector. If s
i
is the vector representing a valid utterance, a corrupted version of that
utterance will be represented by a vector x
i
that can be modeled as a noisy measurement
(x
i
= s
i
+ v
i
). The document consist of corrupted versions of n utterances is represented
by vector x
1
+::: + x
n
. The goal is to reduce the sparsity of the measurement while
attempting to keep the noise { at the concept level { to a minimum.
37
One way to create such noisy measurements x
i
is by employing an SMT engine in a
language pair with plentiful available resources, i.e., in the level of million words. In that
case x
i
would be a representation for a sentence in the target or intermediate language.
Again the assumption here is that for an SMT built on a vast amounts of data, there
would be little quality degradation as one moves further down the n-best list, if the
length of the list is chosen appropriately. Therefore x
1
; x
2
;::: x
n
will contain acceptably
low levels of noise.
4.3.2 Distance Measure
The classier presented by Eq. 2.1 is aimed at comparing utterances based on class LM's.
The documents can be also compared on an LM base, mainly to incorporate some level of
word ordering information in the assessment of the distance metric. By building an LM
for eachn-best list, each utterance in the data would have an associated LM in the target
(or intermediate) language. Since these models are approximations of probability density
functions, they can be compared using information theoretic measures like relative entropy
(Kullback-Leibler Divergence { KLD). Although relative entropy is not commutative and
therefore could not be used as a metric, it can be modied in the following way to serve
the purpose.
KLD
sym
(P;Q) =
1
2
D(PkQ) +
1
2
D(QkP ) (4.1)
Jensen-Shannon Divergence (JSD) [26] is another symmetric and smoother derivation of
the relative entropy. It is dened as,
JSD(P;Q) =
1
2
D(PkM) +
1
2
D(QkM) (4.2)
38
where M =
1
2
(P +Q). A recursive algorithm has been presented in [35] to eciently
calculate the relative entropy (and hence KLD and JSD) between language models.
4.3.3 Clustering Algorithms
Selecting concepts and forming classes of data for classier training is in fact a clustering
problem. While each cluster is assumed to represent a concept class, the utterances
forming these clusters can be used to train the classier.
Hierarchical techniques (e.g., agglomerative methods) are not very popular in text
clustering applications. This is mainly due to their inherent chaining eect which lets a
very dissimilar item appear in a cluster just because of its closeness to another member.
Experiments with the agglomerative algorithm in the early stages of this work did not
show any promising results either.
The partitional algorithms that are common in document clustering mostly rely on
the presentation of the items in a coordinate system. For instance k-means algorithm
and all its variations are based on centroid computations that are only meaningful when
the items are presented in a vector space. This is despite the variety of ways that the
vectors might be dened and the variations in processing details such as normalization,
inverse document frequency (idf) adjustment, etc.
When the items are compared based on the language models' mismatch (Eq. 4.1 and 4.2),
the clustering algorithm must take the distance among the items as the sole form of in-
formation. With no denition of item coordinates, centroid computation is not possible.
Also, since the information theoretic measures do not satisfy triangular inequality, the
clustering must not rely on such property.
39
We selected the Exchange Method and the Anity Propagation clustering algorithms
to use with the measures of Eq. 4.1 and 4.2. Both methods only use the relative distances
and were derived with no assumptions regarding the triangular inequality.
However, here, the goal is to use language models and an information theoretic mea-
sure as a similarity metric to compare them. In that case, the only algorithms that could
be applied are the ones that rely on the distances among the items as the sole form of
information, as the items' coordinates are not dened and centroid computation is not
possible.
4.3.3.1 Exchange Method
The Exchange Method introduced in [37] reformulates the clustering problem as an opti-
mization task. This algorithm has also been used for summarization [19] and clustering
semantically-related adjectives [28].
If C
1
, C
2
,..., C
K
are the clusters, the goal is to minimize the cost function,
(C) =
K
X
i=1
1
jC
i
j
X
x;y2C
i
x6=y
d(x;y) (4.3)
C,fC
1
;C
2
;:::;C
K
g (4.4)
Here,d(x;y) is the distance between itemsx andy, andjj is the cluster cardinality. The
algorithm is as follows:
1. Randomly assign one item to each cluster (to make sure no cluster is initialized as
an empty one).
40
2. Assign the remaining items randomly to the clusters.
3. Select the rst item.
4. If the selected item is the only member of its cluster select the next item.
5. Move the selected item from its cluster to the rest of them, and check the change
in the cost of Eq. 4.3. If the movement lowers the cost keep the item in the new
cluster. Total cost recalculation is not necessary after each movement. If item t
belongs originally to cluster C
k
(obviouslyjC
k
j > 1), its movement to cluster C
i
will decrease the total cost if and only if,
1
jC
k
j(jC
k
j 1)
2
6
6
6
6
4
X
x;y2C
k
x6=y
d(x;y)jC
k
j
X
x2C
k
x6=t
d(x;t)
3
7
7
7
7
5
<
1
jC
i
j(jC
i
j + 1)
2
6
6
6
6
4
X
x;y2C
i
x6=y
d(x;y)jC
i
j
X
x2C
i
d(x;t)
3
7
7
7
7
5
The above condition must be checked for every cluster C
i
2C;i6=k.
6. Repeat steps 4 to 5 for the next items until all of them are checked.
7. Start over from step 3 until no cost change is observed.
To avoid local minima, the above algorithm should be run several times with dierent
random initialization in step 1.
41
4.3.3.2 Anity Propagation
Anity propagation [16] was introduced based on the max-sum algorithm in factor graphs
[24]. The clusters form gradually through an iterative message passing procedure.
Similar to k-centers algorithm, every cluster is represented by one of the data items
called an \exemplar". Every item is initially a potential exemplar. As the algorithm
proceeds, some of the items emerge as stronger exemplars and every other item shows
tendency to be represented by one of these exemplars.
The input to the algorithm is a table of pair-wise distances between the data items and
also a \preference" number that indicates each item's potential to become an exemplar.
Every data item sends messages, called \responsibility", to the other items indicating
how well it could be represented by each one of them. As a potential exemplar, each
data item shows its adequacy to represent another item by sending it a message called
\availability". In every iteration these two types of messages are exchanged among all
the items.
IfD is the set of all items, for two itemsx;y2D with distanced(x;y) between them,
we show the responsibility and availability by r(x;y) and a(x;y), respectively, when y
is considered the potential exemplar. The preferences of x is shown byd(x;x) (not a
distance). The algorithm can be described as follows.
1. 8x;y2D initialize a(x;y) = 0
42
2. 8x;y2D update the responsibilities as,
r(x;y) r(x;y) + (1)
2
6
6
4
d(x;y) max
t2D
t6=y
fa(x;t)d(x;t)g
3
7
7
5
3. 8x;y2D update the availabilities as,
if x6=y
a(x;y) a(x;y) + (1) min
8
>
>
>
>
<
>
>
>
>
:
0;r(y;y) +
X
t2D
t6=x;y
maxf0;r(t;y)g
9
>
>
>
>
=
>
>
>
>
;
else
a(x;x) a(x;x) + (1)
X
t2D
t6=x
maxf0;r(t;x)g
4. 8x2D update the exemplars as,
e
x
max
t2D
fa(x;t) +r(x;t)g
5. While the stop criterion is not met, goto 2
6. 8x2D :e
x
=x, select the corresponding cluster as,
C
(x)
=ft2D :e
t
=xg
43
The damping factor was included in the update rules to guarantee the convergence of
the algorithm [16]. An itemx is an exemplar ife
x
=x. The stop criterion can be selected
as a x number of iterations, or when the messages change below a threshold in two
consequent iterations, or when the exemplars remain unchanged after a certain number
of iterations.
The algorithm does not take the number of clusters as a prexed parameter and
determines it automatically. However, this number is mainly aected by the choice of the
preferences. Higher preferences means more chance for the items to emerge as exemplars
and therefore a higher number of clusters.
Since the algorithm does not need a random initialization, a one-time run suces to
get the optimum results and therefore signicantly reduces the overall clustering time.
Beside the fact that the triangular inequality is not necessary for the Anity Propagation
method, it is not even restricted to the commutative distance measures and therefore
accepts measures like relative entropy with no modication.
4.3.4 Concept Representatives
Each data cluster is associated with a concept that needs to be represented by a canonical
utterance. A human supervisor can generate this representative or manually draw one
from the cluster members. When the number of clusters is too large, a method similar
to the one used in [48] can be applied to select the representatives automatically, i.e.,
8C2C :r
C
= arg min
x2C
X
t2C
t6=x
d(x;t) (4.5)
44
In each cluster C, an utterance r
C
is selected that has the least accumulative distance
with the other members of that cluster. The two methods can be combined so that a few
automatically selected utterances will be used by a human supervisor for identifying the
concept representative.
4.3.5 Training
What was described in the above sub-sections, can be put together as an automatic
method for concept selection and training data preparation for a concept classier. The
steps are illustrated in Figure 4.1 and are implemented as follows,
1. Domain Denition: Utterances from the source language data are selected.
2. Sparseness Reduction: An SMT system is used to translate these utterances to the
target or any other language.
3. Statistical Representation: For each source utterance, a language model is built
from the associated n-best list provided by the SMT system.
4. Distance Metrics: A table of JSD or KLD measures are built for all the possible
language model pairs.
5. Clustering: Using the above distance information, Exchange Method or Anity
Propagation is applied to cluster the original utterances.
6. Representative Selection: For each cluster, a representative is chosen. The represen-
tative might be translated (manually or by an SMT) or pulled out from the target
language part of data, in case of parallel corpus. If the classier selects a certain
45
SMT
KLD
or
JSD
Language
Model #1
Distance
Table
(MxM)
M data
utterances
cluster #1
cluster #K
n-best
list #M
n-best
list #1
Language
Model #2
Language
Model #M
Clustering
and
Representative Selection
Figure 4.1: Overview of the proposed data preparation procedure
class, the translated representative of that class would be the output of the overall
system.
The complex data sentences can be broken into smaller segments through a prepro-
cessing stage. This will reduce, but not eliminate, the number of utterances with multiple
concepts. Presence of such utterances in large proportions may lead to clusters with am-
biguous concepts.
4.4 Evaluation
The classier accuracy is the main evaluation measure, however, measuring the cluster-
ing quality, as an intermediate level assessment is benecial for developing the training
46
method. Despite its popularity in the MT world, the BLEU score [33] would not be a use-
ful measure for concept classication task. BLEU score is based on the lexical matching of
the hypothetical translation and its reference. However, two lexically disjoint utterances,
could express the same concept. For instance, \have a seat" and \please sit down" match
in concept while having no common word.
4.4.1 Clustering Evaluation
Two methods for evaluating the quality of the clustering task have been used in this work.
The rst one is introduced in [44] and is based on computing the percentage of binary
decisions that are common between the clustered and reference data. Every possible pair
of data items, gives a correct/wrong output depending on the agreement of the reference
with the hypothesis regarding to their same-cluster status. A percentage of the decisions
that are in agreement with the reference data can be used as an indicator of the quality
of the clustering task. In other words the agreement can be presented as,
Agr(C;R) =
agreement(C;R)
0
@
N
2
1
A
(4.6)
where agreement(C;R) is the number of common binary decisions between cluster setC
and referenceR, and N is the number of data items. The denominator is the number of
possible pairs and therefore the same number of decisions are to be made.
Note that this measure considers placing two items from dierent classes in two dif-
ferent clusters, a correct decision, and hence is highly biased toward the correct measure-
ments. For example, with 100 classes, a decision about two given items from dierent
47
classes, has 99% chance of being right. Therefore, this measure saturates as the number
of items grow and hence, not expected to distinguish performance improvements very
well. Despite the use of agreement measure in the early stages of this work, because of its
marginal benet, the proposed methods have been mainly developed based on a second
evaluation measure.
The second evaluation method is based on measuring the cluster purity, i.e., the
average entropy of clusters. IfR is the set of reference classes, the average entropy is
dened for cluster setC as,
E =
X
C2C
jCj
jCj
X
R2R
P
CR
log(P
CR
) (4.7)
where,
P
CR
,
jC\Rj
C\ (
[
2R
)
(4.8)
Experimental results presented in Section 4.5 show that the cluster purity measure is
more suitable for this task.
4.4.2 Overall Evaluation
After training the classier with the clustered data, an annotated test set can be used
to measure the success level of the automatic training. To measure the classication
accuracy, it suces to count the number of cases that input utterance and the classier
output are from the same class in the reference annotations.
48
Table 4.1: The data sets used for clustering and classication
Data Set Transonics BBN
Language English English
Domain Medical Family and Background Query
Number of classes 1,269 117
Number of Sentences 9,893 2393
It is common in S2S translation systems to provide the user with multiple options.
For instance in the system of [30], the user is given a list of top four classier outputs
to choose from. In such cases, it is practically useful to measure the accuracy of the
classier within its n-best outputs (e.g., n = 4 for the above system). However, to avoid
the accuracy bias, when that n-best list contains multiple correct answers, only one of
them is counted.
4.5 Data and Experiments
4.5.1 Classier Data
We used the two data sets shown in Table 5.1 to evaluate the performance of the proposed
methods.
The rst set was originally collected for, and used in, the Transonics project [2] to
develop an English/Farsi S2S translator in the doctor-patient interaction domain. For
the doctor side (English), concept classes were carefully chosen using experts' judgment
and medical phrase books [29]. Then, for each concept, English data were collected
from a website, a web-based game, and multiple paraphrasing sessions as explained in
Section 2.4.
49
The second set was provided by BBN Technologies and consisted of paraphrases that
were questions about peoples' family background.
4.5.2 SMT Data
Two intermediate languages, Farsi and Iraqi Arabic, were chosen for this work. To train
and optimize the statistical translation systems we used two parallel corpora.
The English/Farsi corpus was prepared for the Transonics project with the total size
of 149,881 sentences (1,183,600 English running words). Most of the data were collected
from general-domain conversation sessions and the rest were from interactions between
medical students and actors performing as patients. Therefore this corpus was biased
toward medical domain.
The second corpus was used in DARPA's Transtac project involving the development
of an English/Arabic S2S translators (for more details see [7]). This corpus consists of
654,181 sentences (5,517,656 English running words).
4.5.3 Clustering with n-best Lists
First, we used the Transonics data by selecting 97 of the classes that contained at least
four paraphrases. The associated 1,207 English utterances were randomly split into 500
for training and 707 for testing with the assurance that each class at least had one
sentence in the training set. To generate n-best lists, we trained and optimized the
Moses system [22]. The intermediate language was Farsi and the SMT was trained on
the parallel English/Farsi corpus with 147,691 sentences (1,168,856 English words) as we
removed parts of the corpora that overlapped with the classier data. This corpus was
50
also used to build the classication background models in both languages. The SMT was
optimized using a parallel development set with 915 lines (7,281 English words). The size
of the lists were set to n = 1; 000.
The language models were generated using the SRILM toolkit [42]. The KLD and
JSD distance tables were formed by applying the algorithm from [35] to every pair of
such language models. The distance tables were processed by both Exchange Method
and Anity Propagation algorithms to cluster the utterances. Throughout this work the
number of classes was always considered a known parameter. When it is not known,
this number can be determined by Anity Propagation method. Here, by adjusting the
preferences in Anity Propagation, we xed the number of clusters to the correct value.
For each case, the Exchange Method was run 100,000 times with random initialization
and the outcome that had the least cost (Eq. 4.3) was selected.
We also tried the spherical k-means algorithm using gmeans software [9]. For that
purpose, the MC toolkit [8] was rst used to create the vector models from the docu-
ments (n-best lists). For comparison, k-means was applied to both the original English
utterances and their associated n-best list documents (again n = 1000).
For comparison, random clustering in which the input utterances were randomly and
uniformly dispersed over the 97 output clusters, was also included in the experiment. Ta-
ble 4.2 shows the clustering agreement and cluster purity for the results of these dierent
clustering methods along with the results of supervised training with annotated data. For
the random clustering the table shows the average of measurements from 10,000 runs.
Since the main goal was to build a classier, the outcome cluster from each method was
used to train the classier of Eq. 2.2 with n = 50. Table 4.2 also shows the classication
51
Table 4.2: The results of dierent clustering schemes for Transonics data
Method
Agreement Purity Acc. Acc. 4-best
[%] [bits] [%] [%]
Random 97.47 3.780 14.43 35.08
Exchange Method
98.54 5.245 50.07 65.91
with KLD (n = 1; 000)
Exchange Method
98.39 5.106 46.11 63.08
with JSD (n = 1; 000)
Anity Prop.
98.33 5.099 49.36 65.63
with KLD (n = 1; 000)
Anity Prop.
98.33 5.077 47.95 62.66
with JSD (n = 1; 000)
Spherical k-means
97.92 4.763 38.61 54.60
with original data
Spherical k-means
98.06 4.890 42.01 55.16
with n-best documents
Reference annotation 100.0 6.213 72.84 87.98
accuracy and the accuracy within 4-best for each case which was measured using the
testing data. For, all the cases we selected the class representatives using the method of
Section 4.3.4. For k-means, random, and reference cases we used JSD distance tables in
Eq. 4.5.
We repeated the same set of experiments on the BBN data set using all 117 classes.
For training and testing 500 and 1,000 sentences were randomly selected, with at least
one sentence for each class in the training set.
Two intermediate languages were tried in this set of experiments. For Farsi, the Moses
system was trained on 145,885 sentences (1,155,775 English words) from the English/Farsi
corpus with 2,000 sentences (15,872 English words) for development.
52
Table 4.3: The results of dierent clustering schemes for BBN data
Method
Intermediate Agr. Purity Acc. Acc. 4-best
Language [%] [bits] [%] [%]
Random { 98.07 4.334 13.20 29.10
Exchange Method Farsi 98.51 5.240 37.50 54.20
with KLD (n = 1; 000) Arabic 98.55 5.269 34.90 53.70
Exchange Method Farsi 98.36 5.031 31.40 48.20
with JSD (n = 1; 000) Arabic 98.31 4.989 31.60 49.50
Anity Propagation Farsi 98.63 5.350 40.50 57.50
with KLD (n = 1; 000) Arabic 98.62 5.327 39.10 56.20
Anity Propagation Farsi 98.46 5.099 33.70 52.30
with JSD (n = 1; 000) Arabic 98.39 5.086 32.10 50.60
Spherical k-means
English 98.40 5.264 33.20 51.70
with original data
Spherical k-means
Farsi 98.54 5.321 36.30 54.70
with n-best documents
Reference annotation { 100.0 6.561 60.20 77.90
To train the Moses system for English to Arabic translation, the Transtac corpus was
split into a 651,181 sentence (5,486,547 English words) training set and a 2,000 sentence
(22,632 English words) development set.
The results of the experiments on the BBN data are shown in the Table 4.3. In all
the cases, the classication of the test set was performed with lists of sizen
0
= 50 in Farsi
(Eq. 2.2).
4.5.4 The eect of n-best Lists Size
To study the eect of the n-best lists size on the clustering process, we used the same
setup explained in the previous section for Transonics data and generated n-best lists
53
Table 4.4: The eect of the size of the SMT n-best list using Exchange Method and
Anity Propagation with KLD metric
Method n-best length 50 100 500 1,000 2,000 3,000
Exchange
Method
Agreement [%] 98.48 98.56 98.63 98.54 98.60 98.55
Purity [bits] 5.222 5.281 5.323 5.245 5.308 5.273
Accuracy [%] 45.40 50.21 51.63 50.07 50.92 48.51
Acc. in 4-best [%] 59.97 63.65 66.62 65.91 65.06 64.22
Anity
Prop.
Agreement [%] 98.33 98.28 98.33 98.33 98.31 98.32
Purity [bits] 5.100 5.079 5.091 5.099 5.080 5.097
Accuracy [%] 49.22 48.51 48.94 49.36 47.81 48.66
Acc. in 4-best [%] 65.06 63.65 64.22 65.63 65.06 68.03
of 50, 100, 500, 1,000, 2,000, and 3,000 hypotheses per source sentence. Then for each
size, the KLD distance table was generated. The Exchange Method and the Anity
Propagation were carried out using these tables and the clustering agreement and cluster
purity were measured for each case. These measurements are reported in Table 4.4 along
with the corresponding classier accuracy for the testing data.
4.6 Results and Discussion
4.6.1 Clustering with New Metrics
The results in Tables 4.2 and 4.3 provide insights that can help the further development of
an unsupervised training method. The cluster purity and the agreement are included to
show the quality of the clustering process in each case. These numbers indicate that, with
some exceptions, a better clustering leads to a more accurate classier. Although, for a
rigorous proof a statistical analysis on the outcome of numerous experiments is necessary.
54
The cluster purity seems to be a more useful measure for evaluating the quality of the
clusters as the other measure saturates rapidly.
The tables clearly show the superiority of methods that use SMT n-best list for
distance calculation. Using n-best lists (Farsi as intermediate language with n = 1; 000)
in k-means algorithm improved the classication accuracy by about 9% (relatively) for
both Transonics and BBN data sets.
As it is clear from Tables 4.2 and 4.3, the clustering methods with information the-
oretic distance have an advantage over the k-means algorithm which is based on vector
representation and Euclidean distance. In both data sets, KLD yielded better results in
terms of both clustering quality and classication accuracy that followed it. Since JSD is
a smoother measure, it is less likely to capture the small LM dierences that are crucial
in the clustering process.
For the Transonics data, the Exchange Method gave the best results by all measures.
The classier accuracy in that case was relatively 30% better than the accuracy from
the spherical k-means. With the same methods and the same intermediate language,
this accuracy improvement was 13% for BBN data. For the latter data set, the best
results by all measures were achieved by Anity Propagation clustering with Farsi as the
intermediate language. In that case, a 22% relative increase over the accuracy resulted
from k-means was observed. A better result for Transonics data was expected because
of the match to the SMT domain (medical). It is worth mentioning that the result of
our experiments with French and Spanish as the intermediate languages, using Europarl
corpora, were inferior due to the domain mismatch.
55
For BBN data, using Farsi and Arabic as the intermediate language produced very
close results (except for Exchange Method with KLD), although Farsi produced a slightly
better numbers. The Arabic corpus used to train the SMT was more than four times larger
than its Farsi counterpart, however Arabic as a highly in
ected language seemed to be a
less ecient choice as an intermediate language.
In both data sets, the utterances are not evenly distributed over the classes as some
of the concepts, e.g., greetings are much more frequent than the others. The original
distribution is more or less preserved while sampling the training and the testing sets
and leaves some classes with more items. With random clustering, these items dominate
some of the clusters. While testing, the items from more frequent classes are classied to
these dominated clusters and labeled correctly. Therefore even with random clustering
the average accuracy was 14.43% and 13.20% for Transonics and BBN data sets.
4.6.2 Clustering and the Size of the n-best Lists
Table 4.4 shows how the size of then-best list aects the clustering process. To decouple
the size impact on clustering and classication, in all the cases the classier was built
using the n-best list with size n = 50.
Clustering with Exchange Method seemed to follow a more expected pattern. An
increase in accuracy was observed as the size of the n-best lists grew from 50 to 500 for
which an accuracy of 51.63% was achieved. We see a decline as the lists grow larger. As
more SMT hypotheses are included in the distance measurements, a better clustering is
achieved due to the lexical diversity. However, for larger lists the assumption that all the
hypotheses are quality translations of the source sentence, loses its validity. Low quality
56
hypotheses in the bottom of the list increases the noise in the distance table and cause an
inferior clustering result. Consequently there is a trade-o between lexical enhancement
versus noise control. The same trend can be observed for accuracy within 4-best and
more or less for the cluster purity.
The results of the Anity Propagation algorithm did not follow the same trend.
Although the peaking eect happened in that case too. Since the algorithm uses the
original KLD as the distance measure it is more sensitive to the distance errors. In
fact, Eq. 4.1 can be viewed as a mild smoothing process which gives some level of noise
tolerance to the Exchange Method that uses the symmetric KLD. As Table 4.4 shows the
accuracy peak with Anity Propagation clustering has occurred for n = 1; 000.
4.7 Summary
Clustering the data utterances based on their concept is necessary for unsupervised train-
ing of a classier-based translation system. The main focus of this work was to propose
a distance measure to compare utterances based on their concepts. The utterances are
compared based on their multiple translations. To generate these translations, n-best
lists from an SMT system are used and language models are generated from these lists.
Information theoretic metrics are used to quantify the dierence between these language
models.
We also showed that the Exchange Method and the Anity Propagation algorithms
could be adapted as an appropriate clustering method to use with these metrics. The
57
n-best lists can also be used directly in thek-means algorithm. The eectiveness of these
methods were compared through a set of experiments.
In experiments with two dierent data sets, clustering with the proposed metrics
showed superior results compared to the classic k-means. The classiers trained on the
outcome of these clustering tasks also showed more than 20% (up to 30%) higher accuracy
compared to the one trained on the outputs of k-means. For accuracy within 4-best, the
improvement was at least 11% and up to 21%.
The eect of the size of the SMT generated n-best lists was also examined. The
experiments show that the classier accuracy peaks and then drops as the n-best list
length grows.
By the proposed methods, the concept classier can be trained automatically. How-
ever, as expected, the classication accuracy would be signicantly lower than what could
be achieved by supervised training. This work is the rst step in the development of un-
supervised training methods for classier based translation systems. We are currently
focused on improving the quality of the clusters through dierent methods including
ltering in dierent stages of the process, such as, original data selection, n-best list
generation, and clustering.
58
Chapter 5
Training Based on Topic Modeling
5.1 Introduction
In Chapter 4 we introduced a method to group utterances based on their concepts [15]. In
that method, each original sentence was associated with a list of its n-best translations,
generated by some available SMT engine. In spite of including some erroneous infor-
mation, these n-best lists often have a more expanded lexical content than the original
sentences. A language model (LM) can encapsulate the local word dependencies in each
one of these lists and represent the original sentence which the list was derived from.
Such LMs were used in our method, to build a table of distances between the utterances.
That table was used in the clustering algorithm.
In this Chapter, by means of Topic Modeling [41], we improve the distance metric
by capturing (latent) semantic associations of the words across the n-best lists. After
generating n-best lists for training utterances, the topic distributions or gist [18] of the
lists are used for distance measurements. Our experiments demonstrated improvement
59
in clustering quality and the accuracy of the classiers that are trained on clustered data
when the topical information is included in the distance metric.
The next section gives an overview of the clustering method with n-best lists. In
Section 5.3 the distance metric used in our previous and current methods are explained.
The details of the experiments and results are presented in Section 5.4 following by
conclusions in Section 5.5.
5.2 Clustering with SMT n-best Lists
For S2S translation and SLU applications, we desire to have a clustering method that
groups paraphrases, i.e., sentences with common concepts, together. The conventional
clustering techniques, e.g., k-means, rely merely on lexical overlap among documents,
which in this case consist of only one sentence each. For instance, take the utterance \Have
a seat". Although it conveys the same concept as the sentences \You may sit down", they
do not share any lexical elements. The metric used in a conventional clustering method|
usually a type of distance between word-frequency vectors|would fail to detect their
similarity.
On the other hand, translating the above two sentences through a phrase-based SMT
engine [22] most probably would result in n-best lists with some matching translations.
Even if the generated lists do not contain any identical translations, some common words
or phrases are likely to exist in both of them. The transformation of the original sentences
to their n-best lists, can be viewed as a single sentence to document mapping. These
n-best lists can be considered as the input documents to a clustering algorithm. The
60
language in which these documents should be produced (\intermediate" language) can
be chosen based on the availability of a high quality SMT system.
It is obvious that the quality of hypotheses in an n-best list degrades with the rank
therefore there is an inherent trade-o between the list quality and the chance of matching.
In practice, the length of the lists greatly aects the performance of the whole clustering
process and must be chosen with enough care [15].
In summary, the proposed method is carried out in three major steps:
1. Generating n-best lists in an intermediate language via an SMT system.
2. Measurement of the distance between lists with an appropriate metric.
3. Clustering based on the above distances.
The distance evaluation methods presented here and in [15] do not involve the repre-
sentation of documents in a coordinate system. The adopted metrics also do not satisfy
the triangular inequality. Therefore, the choice of clustering algorithms was limited to
the techniques that did not rely on the above properties. Similar to [12], we used the
Exchange Method [37] and the Anity Propagation [16] for clustering. These algorithms
only use the distance between elements as input. The Anity Propagation algorithm can
also operate with asymmetrical distances and therefore does not involve the symmetriza-
tion. The hierarchical class of clustering methods, e.g., agglomerative algorithms, are not
suitable for these type of tasks mainly because of the chaining eect.
61
5.3 Distance Metrics
It is common in text clustering methods to represent the elements, i.e., documents or
sentences, as a point in a vector space and use a metric such as Euclidean distance or
cosine between the angles of two vectors as the measure of dissimilarity or similarity.
However, these type of metrics are not adequately discriminative to produce the desired
clusters in a concept-based fashion [15].
5.3.1 Language Model Distance
In Chapter 4 to represent each utterance in the corpus, we used an LM, built from the n-
best list that SMT produced in the intermediate language. Language Models capture local
word ordering. This can be viewed as capturing a shallow level of semantic and syntactic
dependencies among the words that occur in the vicinity of one another throughout the
corpus.
The distance between twon-best lists, and hence the dissimilarity between two original
utterances, can be measured by calculating the distance between their corresponding LMs
in an information theoretic fashion [35]. For instance, by using symmetric Kullback-
Leibler (KL) divergence as metric and Exchange Method for clustering, we demonstrated
clustering and classication performance improvements, compared to spherical k-means
[15].
5.3.2 Topic Modeling
Although LMs are shown to be quite useful in representing the concepts of sentences,
they do not explicitly model the (latent) semantic relations among words throughout
62
SMT
KLD
Topic
Model #1
Distance
Table
(MxM)
M data
utterances
cluster #1
cluster #K
n-best
list #M
n-best
list #1
Topic
Model #2
Topic
Model #M
Clustering
and
Representative Selection
Figure 5.1: The training process with the distance metric based on topic modeling
documents (n-best lists, here). Topic models, on the other hand, are powerful tools to
capture and manifest this sort of associations [41]. In contrast to the conventional topic
modeling which relies on the word co-occurrences in a document, here we intend to learn
the topics and association of words to them, based on the word co-occurrences in the
multiple translations of a single sentence, i.e., an n-best list.
Here, we used Latent Dirichlet Allocation (LDA) [4]|a common method of topic
modeling|to extract and represent the gist [18] of the n-best lists and, hence, the gist
of the sentences that they are derived from.
The basic assumption in LDA is that the words of a document are independently
generated from a set of \topics". A topic is a multinomial distribution of all the words
in the corpus vocabulary. Therefore, words can be randomly generated from the topics.
There is also a multinomial distribution of topics associated with each document which
63
can be regarded as the gist of that document. Each word in a document is considered to
be generated as follows: First a topic is drawn according to the gist of the document. That
topic then dictates the distribution from which a word is sampled. Topic distributions
(gist) are also assumed to be the samples of a Dirichlet random vector.
The topics, their distributions, and Dirichlet hyperparameters are all learned from
the corpus in an unsupervised manner. The only human input beside the words is the
document boundaries. Dierent implementations of algorithms based on either variational
method [4] or Markov chain Monte Carlo (MCMC) [41] are available for training and
inference.
The KL divergence
1
between the topic distributions of two n-best lists (learned via
LDA), quanties the distance between the original sentences which the n-best lists were
generated from. These distances can be used for clustering, similar to the ones calculated
from the LMs. The training process is shown in Figure 5.1.
5.3.3 Combination of Two Distances
Inspection of the distance tables calculated based on the above two metrics revealed
some level of discrepancy between them. For instance, in multiple cases, two utterances
with dierent concepts were mistakenly gauged to be very close by one metric, while
the other metric indicated their mismatch by estimating a large distance between them.
Such observations induce that the metrics may contain some complementary information.
This inspires seeking a method to combine the information extracted from local lexical
1
Experiments with Hellinger distance showed inferior results.
64
Table 5.1: The data sets used for clustering
Data Set Transonics BBN
Language English English
Domain Medical Family and Background Query
Number of classes 1,269 117
Number of Sentences 9,893 2,393
dependencies and contained in the LM, with the information from topic modeling, learned
from global semantic associations of words.
If the LM and gist of a document were statistically independent, it would be justied
to simply add the distances from the two metrics. This is due to the fact that in both
cases the measurements are based on KL divergence. In practice, linear combination
of two distances is a simple way of incorporating information from both sources. Such
combination has produced better results than using any of them exclusively, as shown in
Section 5.4.
2
It is noteworthy that some extensions to LDA have been introduced that can capture
some level of word ordering information [46]. Our preliminary experiment with one of
these methods, called Topical N-grams, did not lead to any promising outcome.
5.4 Experiments and Results
5.4.1 Data
To evaluate the benets of the metric derived from topic modeling, we experimented
with two data corpora shown in Table 5.1. Both data sets consist of multiple classes
2
Adding the square of two distances, did not appear to be much benecial in our experiments.
65
of paraphrases in English. The rst set is from the Transonics project [15] and the
second one was prepared for DARPA's Transtac project, and was provided to us by BBN
Technologies.
To generate the n-best lists we trained the Moses SMT engine [22] using a parallel
English/Farsi (intermediate language) corpus of size 1.18M words on the English side.
The corpus was collected for the Transonics project and contains general-domain conver-
sations and doctor-patient interactions.
5.4.2 Clustering with Topic Modeling Metric
The rst set of experiments were intended to examine whether the metric driven from
topic modeling (TM metric) can be eectively used for clustering.
For the Transonics data we only used 97 classes that have at least four utterances.
We selected 500 utterances from each corpus that were randomly drawn from 97 and 117
classes of the Transonics and BBN sets respectively. The n-best lists were generated for
them with the length of n = 50. In our previous experiments, this list size had produced
relatively good results [15] and LDA models did not take much processing time to train
with lists of this size.
For topic modeling, we used the Mallet toolkit [27] which oers the Gibbs sampling
[41] method for training the LDA models. Separate models were trained for each data
set through 100,000 iterations of Gibbs sampling. The Dirichlet hyperparameters were
optimized every 2000 iterations. To observe the eects of dierent choices of number of
topics, we trained models with 50, 100, 200, 300, and 400 topics. For each case a distance
table was built by computing the KL divergence between the gist of every two lists.
66
50 100 150 200 250 300 350 400
4.7
4.8
4.9
5
5.1
5.2
5.3
5.4
Number of Topics
Cluster Purity
TM Metric w/ Affinity Propagation
TM Metric w/ Exchange Method
LM Metric w/ Affinity Propagation
LM Metric w/ Exchange Method
Figure 5.2: Purity of clustering BBN data with LM and TM metrics and dierent topic
numbers
We ran the Exchange Method and Anity Propagation clustering algorithms on these
tables. To evaluate the quality of the clustering outcome, for each case the cluster purity
was calculated.
Figure 5.2 shows the purity of clustering the BBN set with the TM metric, for dierent
numbers of topics. The results from both Exchange Methods and Anity Propagation
are plotted alongside with the purity of clustering when using the LM metric. The graphs
show that for some choices of the topic number, each clustering method has produced a
better result with the TM metric than the result of the same method with the LM metric.
Especially, this gain has been larger for the Anity Propagation algorithm.
Similar graphs for the Transonics set are shown in Figure 5.3. For this data set, both
algorithms produced better results with LM metric. However, for 100 topics, the purities
of the resulting clusters are close to the outcome of using the LM metric. The better
performance with the LM metric is due to the domain match between the Transonics
67
50 100 150 200 250 300 350 400
4.85
4.9
4.95
5
5.05
5.1
5.15
5.2
5.25
5.3
Number of Topics
Cluster Purity
TM Metric w/ Affinity Propagation
TM Metric w/ Exchange Method
LM Metric w/ Affinity Propagation
LM Metric w/ Exchange Method
Figure 5.3: Purity of clustering Transonics data with LM and TM metrics and dierent
topic numbers
data and a portion of the SMT's training set which was drawn from the medical domain.
This led to better word orderings in the translations, and therefore better LMs for that
Transonics data.
These experiments show that the TM metric, when used for clustering, can produce
results comparable to the LM metric. For comparison of the above two algorithms using
the LM metric with other clustering methods see [15].
Table 5.2: Results of the experiments with classiers trained on the clustered data
Corpus Metric
Anity Propagation Exchange Method
Acc. Rel. Acc. Acc. Rel. Acc.
BBN
LM 27.8% { 31.5% {
TM (T=100) 26.7% -4.0% 33.5% 6.0%
TM (T=200) 36.5% 31.3% { {
Trans.
LM 49.2% { 46.0% {
TM (T=100) 43.1% -12.4% 49.2% 7.1%
68
1.5 2 2.5 3 3.5 4
4.7
4.8
4.9
5
5.1
5.2
5.3
5.4
5.5
Mixing Factor
Cluster Purity
Combined Metric
LM Metric
TM Metric
Figure 5.4: Purity of clustering BBN data with the combined metric
We also used the clustered data to train a set of classiers for each corpus. For that
purpose the LM-based classication method [15] was chosen. To test the classiers, 707
and 1,000 utterances were randomly drawn from the rest of the Transonics and BBN sets,
respectively, disjoint from their training sets. Table 5.2 shows the concept classication
accuracy results. For the cases with the TM metric, the relative change with respect
to the accuracy from the LM metric is also shown. The TM metric produced better
results when used with Exchange Methods. For the BBN corpus, an improvement was
achieved using the TM metric, when the distance table for 200 topics (peak of the purity
in Figure 5.2) was fed to the Anity propagation algorithm.
5.4.3 Clustering with Combined Metric
In the second set of experiments a combination of both metrics were used with the Anity
Propagation algorithm which is much faster than the Exchange Method. For each data
set, we combined the two distance tables from the previous experiment, one with the LM
69
metric and the other one with the TM metric, derived with the number of topics set to
100. We scaled the entries of the table with the LM metric by a mixing factor and added
them to their counterparts in the table with the TM metric.
Initially, to make the metrics comparable, the mixing factor was chosen as the ratio
of the largest entries in the two tables. This ratio was very close for the two data sets:
1.695 for BBN and 1.699 for Transonics. The purity of the clusters produced by using
the combined tables are shown in Figures 5.4 and 5.5 for the BBN and the Transonic
data sets respectively. The gures show the purity for dierent values of the mixing
factor around its initially selected value. For reference, we also included the results when
using each metric exclusively. It is clear that the combined metric led to a much better
clustering outcome than what the sole use of each metric had produced. For both data
sets, a mixing factor of 3.2 delivered the best results. We also ran the Exchange Method
for that value of mixing factor.
For a mixing factor of 3.2, we trained classiers on the resulting clusters from both
methods. For the BBN set, the classier trained with clusters from Anity Propagation,
showed an accuracy of 39.9% which is a relative improvement of 43.5%, over the result
gained from using the LM metric (Table 5.2). For the Exchange Method, the accuracy
was 32.6%, i.e., a relative improvement of 3.5% over the baseline with the LM metric.
For the Transonic set, with the clusters from the Anity Propagation and the com-
bined metric, the classier showed an accuracy of 49.4% which is only 0.3% better than
what was gained from the LM metric. However, using the Exchange Method led to a
relative improvement of 9.9% (again with respect to the case of using the LM metric,
Table 5.2) as the classication accuracy reached 50.5%.
70
1.5 2 2.5 3 3.5 4
5.04
5.06
5.08
5.1
5.12
5.14
5.16
5.18
5.2
Mixing Factor
Cluster Purity
Combined Metric
LM Metric
TM Metric
Figure 5.5: Purity of clustering Transonics data with the combined metric
5.5 Summary
This work is the continuation of eorts towards the development of an unsupervised train-
ing method for concept-based classiers. We have shown that by using topic modeling,
the information extracted from semantic associations of words can be used to improve
the quality of the sentence clustering method that we had previously introduced.
We intend to continue this work further by examining the use of other topic mod-
eling methods and more sophisticated distance measurement techniques. We are also
investigating more elaborate strategies for combining the metrics.
71
Chapter 6
Hierarchical Structure
6.1 Introduction
The set of concepts that a classier handles denes its domain. In practice the classier is
eective only if its domain is large enough. The number of classes increase, as the domain
grows, which causes a decrease in the discrimination power of the single layer classier
of (2.1) due to higher number of competing hypotheses. This leads to an increase in the
classier's error rate. In other words, this drop in accuracy is the bottleneck in scaling
up the classier domain.
An intuitive solution to this problem is to divide a large classication domain into
smaller sub-domains. In this paper, we introduce a hierarchical classication scheme that
exploits a categorical division of the domain. Such divisions naturally exist in almost every
domain and are usually the result of the data collection process. For instance in patient
care domain, exchange utterances can be grouped in categories such as greeting, medicine
dosage instructions, diet, pediatrics, cardiology, and so on. We created the hierarchy by
building a category detection layer in to the classier structure. This can be achieved
72
by using a maximum likelihood classier implemented byn-gram language models at the
category level. Another method is to capture the semantic relations between words and
categories through topic modeling. We used Latent Semantic Allocation (LDA) [4] for
topic modeling [41].
Document classication has been used as an example in [4] to show the usefulness of
topic modeling. In [1], topic distributions were used as additional features in a document
classication task, using mined text from the web for topic modeling. Classication of web
pages was addressed in [39] by using topic distributions to compute a feature set. Based
on that feature set, a Hierarchical Support Vector Machine (HSVM) was trained as the
classier. In contrast to these methods, the focus of our work has been the classication
of utterances rather than documents. Specically, the purpose of this work is to explore
a feasible strategy for scaling up the classier based spoken language processing for S2S
applications.
In the next section we introduce the hierarchical classication scheme based on topic
modeling. Section 6.3 covers the details of experiments performed to test the proposed
method along with the discussion of the outcomes. This is followed by conclusions in
Section 6.4.
6.2 Hierarchical Classier
6.2.1 A Two-layered Classier
The idea of hierarchical classication is to perform the task in dierent stages. Assume
that the domain of a classier in (2.1) withN classes is split intoK disjoint sub-domains
73
C1 C2
C3
C4 C5
C6 C7
C8
C9
C10
C1
C2
C3
C4
C5
C6
C7
C8
C9
C10
Sub-domain
#2
Sub-domain
#1
Sub-domain
#4
Sub-domain
#3
Classifier
Domain
Domain
Partition
Figure 6.1: Partition of the classier domain
withN
1
;N
2
;:::;N
K
classes each, such that
P
K
i=1
N
i
=N (Figure 6.1). We call the clas-
siers that operate on these sub-domains, sub-classiers. This division does not change
the class LMs. Assume that the original N-ary classier chooses class c
e
for an input
utterance e, i.e.,
c
e
= arg max
c2C
fP
c
(e)g (6.1)
whereP
c
(e) is the score of e from the LM of classc. If one selects the sub-classier with
domainC
l
C such that c
e
2C
l
, then,
max
c2C
l
fP
c
(e)g = max
c2C
fP
c
(e)g =P
ce
(e) (6.2)
This means that a sub-classier will make the same decision as the original classier
as long as its domain contains that decision. This implies that a sub-classier will not
introduce additional errors if 1) its domain contains the correct class and, 2) the original
classier makes a correct decision. Therefore the overall error rate ofK sub-classiers will
be less than or equal to the error rate of the
atN-ary classier, if the right sub-classier
74
Category: Bio
Class 1:
What is his full name?
What is his name?
What is his complete name?
...........
Class 6:
What is his address?
Where does he live?
What street does he live on?
...........
Category: Electricity
Class 203:
How is your electricity?
How is your electricity these days?
How is your electricity working?
...........
Class 204:
Where do you get your electricity from?
What is the source of your electricity?
Where does your electricity come from?
...........
Figure 6.2: Sample of categorized training data for concept classication
is selected for each input, i.e., if the top level classier is perfectly accurate. This is
supported by the empirical results as we will see in Section 6.3.
By splitting the original domain into sub-domains we build a two layer classier. The
rst layer will select the proper sub-domain, while the second layer will consist of a group
of sub-classiers. For each input, only one of these sub-classiers will be active based on
the rst layer selection. The hierarchical classication process will remain accurate as
long as each layer of the process remains accurate. Since the sub-classiers operate on
relatively small number of classes, the overall domain can be expanded by adding more
sub-domains without compromising the accuracy.
6.2.2 Categorical Partition
The rules by which we split the domain need to be simple and lead to a feasible and
accurate selection method in the rst layer. This split is often natural in S2S applications
as the data often belong in categorical groups. For example, data from the patient care
75
domain [14] have categories like emergency, cardiology, neurology, etc. The data that we
used in this work was collected for DARPA's Transtac project and consists of questions
about people (biographic information, work, education), places (houses, land), civil aairs
(water, sewage, electricity),and so on (Figure 6.2). Therefore domains are naturally split
across the category boarders. In addition, since frequent changes of category hardly
happens throughout a discourse, an accurate mechanism (rst layer) can be developed to
capture and track the categorical association of the exchanged utterances.
6.2.3 Category Detection
As mentioned above, for the lower layer of classication (concept level), maximum likeli-
hood classiers can be employed as in (2.1). For the top layer we explore two alternative
methods as follows.
6.2.3.1 Maximum Likelihood Category Classier
The classier of (2.1) can be trained to operate in the category level by using all the
data from dierent classes in each category to build the LM for that category. Here,
conceptual classes are replaced with categorical ones.
6.2.3.2 Category Detection by Topic Modeling
The semantic association of words within the categories of training data can be captured
by topic modeling. If we consider the data in each sub-domain as a document, LDA can
be applied to learn the topics and represent each sub-domain (document) by a probability
distribution over these topics [4].
76
With LDA each word of a document is assumed to be associated with a topic. For each
topic, words can be randomly picked with a multinomial distribution that is specic to
that topic. For each document, the probability distribution of topics is also multinomial
and specic to that document. The parameters of the latter distribution are considered to
be samples of a Dirichlet random vector. Variational method [4] or Markov chain Monte
Carlo (MCMC) [41] can be applied to extract the topics and associated distributions.
Here we used the latter method both for training the models and to infer the topic
distributions throughout a discourse.
Assume a dialog is in progress with a specic category, e.g., dermatology. Using a
sucient number of utterances we can infer the topic distribution p
d
for that collection
using the above method. By comparing this distribution to the distribution of each sub-
domain, acquired from the training, we can assess the category of the discourse. The
distributions can be compared using Kullback{Leibler divergence. If p
i
represents the
topic distribution of sub-domain i, the category of the discourse can be estimated as the
category of sub-domain ^ g where,
^ g, arg max
i=1:::K
fD
KL
(p
i
k p
d
)g (6.3)
and K is the number of categories.
With topic modeling in the rst layer we capture the global semantic association of
the words in a sub-domain or discourse. We use this information to nd the category of
the discourse and select the correct classier exclusively trained for that category. The
77
Category
Detector
Classifier #1
Classifier #2
Classifier #K
Buffer
Figure 6.3: Two-layered classication scheme
classiers in the second layer, however, operate based on the local semantic and syntactic
dependencies via class LMs.
6.2.4 Buering
Without a highly accurate category detection, the two-layered structure will not be bene-
cial. The category of the dialog often remains constant over multiple rounds of utterance
exchange. Therefore it is reasonable to use a few buered utterances for that purpose
rather than relying on a single one (Figure 6.3).
For the rst couple of conversational exchanges the category cannot be detected re-
liably therefore it is more appropriate to use the outcome of a single-layered classier.
It is customary in S2S applications to provide the user with a short list of hypothetical
translations. In that case, for the rst few sentences the results from both two-layered
and single-layered classiers could be presented to the user. As the discourse progresses
and more information is gathered, the detection becomes more accurate.
78
6.3 Experiments and Results
6.3.1 Data
We used the data that were collected for phase one of DARPA's Transtac project. The
data is in the form of 64,342 questions grouped as paraphrases in 621 classes and 38
categories, such as biographic information, education, family, nance, etc. Most of the
categories include less than 2,000 questions. We remove 4 categories with more than
4,000 questions to avoid biasing toward them. The data from the remaining 34 categories
consisted of 11,409 questions in 441 classes were used here. Figure 6.2 shows a sample
of that data set. From the English side of the Transtac parallel corpus, 654K sentences
(5.5M words) were used to build the background LM for the classiers.
6.3.2 Oracle Test
To conrm that a two-layered classication was potentially more accurate than a con-
ventional one, we ran an oracle test in which always the correct category was chosen.
For that purpose we randomly selected a test set of 2,000 questions and used the rest for
training. One question per class was kept out of the selection process to guarantee that
every class at least have one sentences for training.
The distribution of classes and sentences over categories are shown in Figure 6.4 (a)
and (b), respectively, for the training data. For the categories that were eliminated from
the original data, the charts show zero number of classes and sentences. It is clear that
more classes in a category is not necessarily accompanied by a higher number of training
sentences for that category. As Figure 6.4 (c) shows, the distribution of sentences in
79
0 5 10 15 20 25 30 35 40
0
10
20
30
40
Classes
0 5 10 15 20 25 30 35 40
0
500
1000
1500
2000
Training Sentences
0 5 10 15 20 25 30 35 40
0
100
200
300
400
Testing Sentences
0 5 10 15 20 25 30 35 40
0
0.2
0.4
0.6
0.8
1
Category Number
Error/Sentence
Figure 6.4: (a) Number of classes per category for training data, (b) Overall number of
sentences per category for training data, (c) Overall number of sentences per category for
testing set, (d) Number of errors per category for testing set
the test set closely resembles the training set which has been the purpose of the random
selection.
The error rate for the conventional single-layered classier was 7.6% while for the
sub-domain classiers with the oracle category selection was lower by 25%, relative (1.9%
absolute). This signicant error rate reduction shows the benet of splitting the domain
80
20 40 60 80 100 120 140 160 180 200
5
10
15
20
25
30
35
Number of Topics
Number of Correct Answers
(out of 34)
w/ stop words
w/ stop words, optimized
w/o stop words
w/o stop words, optimized
Figure 6.5: Category detection using topic modeling with dierent parameters
if an accurate category detection mechanism exists. The number of errors for each sub-
classiers is shown in Figure 6.4 (d).
6.3.3 Topic Modeling of Categories
We used the same data sets of the previous experiment for topic modeling of the categories
using the Mallet toolkit [27]. The Gibbs sampling [41] method is adopted in Mallet for
LDA model training and inference. We used the training data to build the topic models
which were then used to nd the topic distribution of the test set. The detection was
performed as explained by (6.3).
At rst we set the topics number equal to 100 as it is a common choice in the literature
[41], removed the stop words from the data, and shut o the procedure that optimizes the
Dirichlet parameters. We ran 100,000 iterations of Gibbs sampling for training and 10,000
for inference. Using all the test set the detection process captured all the 34 categories
correctly.
81
To investigate the eect of the topics number, we ran a set of experiments with
dierent values for that parameter. In that case we only used at most two sentences per
category from the test set which was equivalent to using a buer of size two in Figure 6.3.
That set of experiments was also repeated with parameter optimization and/or leaving
the stop words in the data. For each experiment, the number of iterations was the same
as the above experiment. The optimization of Dirichlet parameters was carried out in
every 500 iterations with an initial burn-in period of 1,000 for the cases that involved
such a procedure.
The results are shown in Figure 6.5. It is clear that having the stop words in the data
has been quite advantageous. Also, optimization of Dirichlet parameters has signicantly
improved the performance. The best results were achieved for 70 topics with optimized
parameters while leaving the stop words in.
6.3.4 Eect of Buer Size on Category Detection
We examined the performance of both category detection methods for dierent buer
sizes. For testing, we selected ve random sentences from each category and used the
rest for training. To avoid having classes with no training data, one sentence per class
was held out of the selection pool. The training data were used to build LMs for the
categorical classier as well as topic models for each category. For topic modeling we set
the number of topics to 70 and ran the Gibbs sampling method with optimization for
100,000 iterations, without removing the stop words. For buer sizes of one and two we
repeated the test ve times and twice, respectively, with dierent test utterances, and
averaged the results.
82
1 2 3 4 5
60
80
100
Buffer Size
Correct Detections [%]
Topic Modeling
LM−based Classifier
Figure 6.6: Eect of buer size on the LM-based classication and Topic Modeling
Figure 6.6 shows the number of correct decisions by two methods for dierent buer
sizes. Both methods detected all the categories correctly for buer sizes of four and ve.
This indicates that the hierarchical classication method could reach the oracle results
with a very small buer size. With no buering involved, the LM-based method had a
better accuracy compared to detection with topic models (at the 0.05 signicance level).
6.4 Summary
Scaling up the concept classier is possible through a hierarchical structure without sig-
nicantly degrading accuracy. Such structures can be built based on the categorical
partitions of the domain. We showed the benet of such partitioning and proposed to use
two promising category detection methods in a two-layered classication scheme. These
two methods|language model based classication and topic modeling|use the discourse
information to select the correct concept classier that operates on a sub-domain.
The results of the above two methods reveal cases of mismatch in their errors, therefore
combining their results to benet from both is a part of our ongoing work. We are
planning to deploy the topicaln-gram modeling method [46] to incorporate more ordering
83
information in the process. We are also investigating the use of more sophisticated metrics
such as Hellinger distance [47] for a more precise comparison of topic distributions.
84
References
[1] S. Banerjee, \Improving text classication accuracy using topic modeling over an
additional corpus," in Proc. of the 31st annual international ACM SIGIR conference
on Research and development in information retrieval, Singapore, Singapore, July
2008, pp. 867{868.
[2] R. Belvin, E. Ettelaie, S. Gandhe, P. Georgiou, K. Knight, M. Marcu, S. Millward,
S. Narayanan, H. Neely, and D. Traum, \Transonics: A practical speech-to-speech
translator for english-farsi medical dialogs," in Proc. of the Association for Com-
putational Linguistics, Interactive Poster and Demonstration Sessions, Ann Arbor,
MI, USA, June 2005, pp. 89{92.
[3] R. Belvin, W. May, S. Narayanan, P. Georgiou, and S. Ganjavi, \Creation of a
doctor-patient dialogue corpus using standardized patients," in Proc. of the Fourth
International Conference on Language Resources and Evaluation (LREC '04), Lis-
bon, Portugal, May 2004, pp. 187{190.
[4] D. M. Blei, A. Y. Ng, and M. I. Jordan, \Latent dirichlet allocation," Journal of
Machine Learning Research, vol. 3, pp. 993{1022, March 2003.
[5] C. Chelba, M. Mahajan, and A. Acero, \Speech utterance classication," in Acous-
tics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03). 2003 IEEE
International Conference on, vol. 1, Hong Kong, April 2003, pp. I{280{I{283 vol.1.
[6] T. Chklovski, \Collecting paraphrase corpora from volunteer contributors," in Proc.
of the Third International Conference on Knowledge Capture (K-CAP), Ban,
Canada, October 2005, pp. 115{120.
[7] F. Choi, S. Tsakalidis, S. Saleem, C. lin Kao, R. Meermeier, K. Krstovski, C. Moran,
K. Subramanian, R. Prasad, and P. Natarajan, \Recent improvements in BBNs
English/Iraqi speech-to-speech translation system," in Proc. of the Second IEEE
Workshop on Spoken Language Technology (SLT), Goa, India, December 2008, pp.
245{248.
[8] I. S. Dhillon, J. Fan, and Y. Guan, \Ecient clustering of very large document collec-
tions," in Data Mining for Scientic and Engineering Applications, V. K. R. Gross-
man, C. Kamath and R. Namburu, Eds. Kluwer Academic Publishers, 2001, pp.
357{381.
85
[9] I. S. Dhillon, Y. Guan, and J. Kogan, \Iterative clustering of high dimensional text
data augmented by local search," in Proc. of the IEEE International Conference on
Data Mining (ICDM), Maebashi City, Japan, 2002, pp. 131{138.
[10] I. S. Dhillon and D. S. Modha, \Concept decompositions for large sparse text data
using clustering," Machine Learning, vol. 42, no. 1, pp. 143{175, January 2001.
[11] F. Ehsani, J. Kinzey, D. Master, K. Sudre, D. Domingo, and H. Park, \S-MINDS 2-
way speech-to-speech translation system," in Proc. of the Medical Speech Translation
Workshop, Conference of the North American Chapter of the Association for Com-
putational Linguistics on Human Language Technology (NAACL-HLT), New York,
NY, USA, June 2006, pp. 44{45.
[12] E. Ettelaie, P. G. Georgiou, and S. S. Narayanan, \Unsupervised data processing for
classier-based speech translator," submitted to ISCA Journal of Computer Speech
and Language.
[13] ||, \Cross-lingual dialog model for speech to speech translation," in Proc. of
the Ninth International Conference on Spoken Language Processing (ICSLP), Pitts-
burgh, PA, USA, September 2006, pp. 1173{1176.
[14] ||, \Mitigation of data sparsity in classier-based translation," in Proc. of Coling
2008 Workshop on Speech Processing for Safety Critical Translation and Pervasive
Applications, Manchester, UK, August 2008, pp. 1{4.
[15] ||, \Towards unsupervised training of the classier-based speech translator," in
Proc. of the International Conference on Spoken Language Processing (ICSLP), 2008,
pp. 2739{2742", month = September, address = Brisbane, Australia.
[16] B. J. Frey and D. Dueck, \Clustering by passing messages between data points,"
Science, vol. 315, pp. 972{976, February 2007.
[17] Y. Gao, L. Gu, B. Zhou, R. Sarikaya, M. Afy, H. Kuo, W. Zhu, Y. Deng, C. Prosser,
W. Zhang, and L. Besacier, \IBM MASTOR SYSTEM: Multilingual automatic
speech-to-speech translator," in Proc. of the Medical Speech Translation Workshop,
Conference of the North American Chapter of the Association for Computational
Linguistics on Human Language Technology (NAACL-HLT), New York, NY, USA,
June 2006, pp. 53{56.
[18] T. L. Griths, J. B. Tenenbaum, and M. Steyvers, \Topics in semantic representa-
tion," Psychological Review, vol. 114, pp. 211{244, 2007.
[19] V. Hatzivassiloglou, J. Klavans, M. Holcombe, R. Barzilay, M. Kan, and K. McKe-
own, \Simnder: A
exible clustering tool for summarization," in Proc. of NAACL
Workshop on Automatic Summarization, Pittsburgh, PA, USA, 2001, pp. 41{49.
86
[20] R. Hsiao, A. Venugopal, T. Kohler, Y. Zhang, P. Charoenpornsawat, A. Zollmann,
S. Vogel, A. W. Black, T. Schultz, and A. Waibel, \Optimizing components for
handheld two-way speech translation for an English-Iraqi Arabic system," in Proc. of
the Ninth International Conference on Spoken Language Processing (ICSLP), Pitts-
burgh, PA, USA, September 2006, pp. 765{768.
[21] F. Jelinek, Statistical Methods for Speech Recognition (Language, Speech, and Com-
munication). Cambridge, MA, USA: The MIT Press, 1998.
[22] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi,
B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, and
E. Herbst, \Moses: Open source toolkit for statistical machine translation," in Proc.
of the 45th Annual Meeting of the Association for Computational Linguistics, vol.
Companion Proc. of the Demo and Poster Sessions, Prague, Czech Republic, June
2007, pp. 177{180.
[23] P. Koehn, F. Och, and D. Marcu, \Statistical phrase-based translation," in Proc. of
the Conference of the North American Chapter of the Association for Computational
Linguistics on Human Language Technology (NAACL-HLT), vol. 1, Edmonton, AB,
Canada, May-June 2003, pp. 48{54.
[24] F. Kschischang, B. J. Frey, and H.-A. Loeliger, \Factor graphs and the sum-product
algorithm," IEEE Transactions on Information Theory, vol. 47, pp. 498{519, Febru-
ary 2001.
[25] A. Leuski, J. Pair, D. Traum, P. J. McNerney, P. Georgiou, and R. Patel, \How to
talk to a hologram," in Proc. of the Eleventh international conference on Intelligent
user interfaces (IUI), Sydney, Australia, January-February 2006, pp. 360{362.
[26] J. Lin, \Divergence measures based on the Shannon entropy," IEEE Transactions
on Information Theory, vol. 37, no. 1, pp. 145{151, January 1991.
[27] A. K. McCallum, \Mallet: A machine learning for language toolkit," 2002,
http://mallet.cs.umass.edu.
[28] K. McKeown and V. Hatzivassiloglou, \Augmenting lexicons automatically: Clus-
tering semantically related adjectives," in Proc. of ARPA Workshop on Human Lan-
guage Technology (HLT '93), Princeton, NJ, USA, March 1993, pp. 272{277.
[29] S. Narayanan, S. Ananthakrishnan, R. Belvin, E. Ettelaie, S. Gandhe, S. Ganjavi,
P. G. Georgiou, C. M. Hein, S. Kadambe, K. Knight, D. Marcu, H. E. Neely, N. Srini-
vasamurthy, D. Traum, and D. Wang, \The transonics spoken dialogue translator:
An aid for English-Persian doctor-patient interviews," in Proc. of American As-
sociation for Articial Intelligence Fall Symposium on Dialog Systems for Health
Communication (AAAI), Arlington, VA, October 2004.
87
[30] S. Narayanan, S. Ananthakrishnan, R. Belvin, E. Ettelaie, S. Ganjavi, P. Georgiou,
C. Hein, S. Kadambe, K. Knight, D. Marcu, H. Neely, N. Srinivasamurthy, D. Traum,
and D. Wang, \Transonics: A speech to speech system for English-Persian interac-
tions," in Proc. of IEEE Workshop on Automatic Speech Recognition and Under-
standing (ASRU), St.Thomas, U.S. Virgin Islands, November-Decmeber 2003, pp.
670{675.
[31] S. S. Narayanan, P. G. Georgiou, A. Sethy, D. Wang, M. Bulut, S. Sundaram, E. Et-
talaie, S. Ananthakrishnan, H. Franco, K. Precoda, D. Vergyri, J. Zheng, W. Wang,
R. R. Gadde, M. Graciarena, V. Abrash, M. Frandsen, and C. Richey, \Speech
recognition engineering issues in speech to speech translation system design for low
resource languages and domains," in Proc. of the Thirty First IEEE International
Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 5, Toulose,
France, May 2006, pp. 1209{1212.
[32] H. Ney, S. Nieen, F. J. Och, C. Tillmann, H. Sawaf, and S. Vogel, \Algorithms
for statistical translation of spoken language," IEEE Trans. on Speech and Audio
Processing, Special Issue on Language Modeling and Dialogue Systems, vol. 8, no. 1,
pp. 24{36, January 2000.
[33] K. Papineni, S. Roukos, T. Ward, and W. Zhu, \Bleu: a method for automatic
evaluation of machine translation," Technical Report RC22176 (W0109-022), IBM
Research Division, Thomas J. Watson Research Center, 2001.
[34] A. Potamianos, S. Narayanan, and G. Riccardi, \Adaptive categorical understanding
for spoken dialogue systems," IEEE Transactions on Speech and Audio Processing,
vol. 13, no. 3, pp. 321{329, May 2005.
[35] A. Sethy, S. Narayanan, and B. Ramabhadran, \Measuring convergence in language
model estimation using relative entropy," in Proc. of the Eight International Con-
ference on Spoken Language Processing (ICSLP), Jeju Island, Korea, October 2004,
pp. 1057{1060.
[36] A. Sethy, P. G. Georgiou, and S. Narayanan, \Building topic specic language models
from web data using competitive models," in Proc. of the Ninth European Confer-
ence on Speech Communication and Technology (Interspeech - Eurospeech), Lisbon,
Portugal, October 2005, pp. 1293{1296.
[37] H. S path, The Cluster Dissection and Analysis Theory FORTRAN Programs Exam-
ples. Upper Saddle River, NJ, USA: Prentice-Hall, Inc., 1985.
[38] N. Srinivasamurthy and S. Narayanan, \Language-adaptive Persian speech recogni-
tion," in Proc. of the Eighth European Conference on Speech Communication and
Technology (Interspeech - Eurospeech), Geneva, Switzerland, September 2003, pp.
3137{3140.
88
[39] W. Sriurai, P. Meesad, and C. Haruechaiyasak, \Hierarchical web page classication
based on a topic model and neighboring pages integration," International Journal
of Computer Science and Information Security (IJCSIS), vol. 7, no. 2, pp. 166{173,
February 2010.
[40] D. Stallard, F. Choi, K. Krstovski, P. Natarajan, R. Prasad, and S. Saleem, \A
hybrid phrase-based/statistical speech translation system," in Proc. of the Ninth
International Conference on Spoken Language Processing (ICSLP), Pittsburgh, PA,
USA, September 2006, pp. 757{760.
[41] M. Steyvers and T. Griths, \Probabilistic topic models," in Handbook of Latent
Semantic Analysis, T. Landauer, D. Mcnamara, S. Dennis, and W. Kintsch, Eds.
Mahwah, NJ, USA: Lawrence Erlbaum Associates, Inc., 2007.
[42] A. Stolcke, \SRILM - an extensible language modeling toolkit," in Proc. of the In-
ternational Conference on Spoken Language Processing (ICSLP), Denver, CO, USA,
September 2002, pp. 901{904.
[43] D. Traum, A. Roque, A. Leuski, P. Georgiou, J. Gerten, B. Martinovski,
S. Narayanan, S. Robinson, and A. Vaswani, \Hassan: A virtual human for tactical
questioning," in Proc. of the Eighth SIGdial workshop on Discourse and Dialogue,
Antwerp, Belgium, September 2007, pp. 75{78.
[44] K. Wagsta and C. Cardie, \Clustering with instance-level constraints," in Proc. of
the Seventeenth International Conference on Machine Learning. San Francisco, CA,
USA: Morgan Kaufmann Publishers Inc., June-July 2000, pp. 1103{1110.
[45] A. Waibel, A. Badran, A. W. Black, R. Frederking, A. Lavie, L. Levin, K. Lenzo,
L. M. Tomokiyo, J. Reichert, T. Schultz, M. Woszczyna, and J. Zhang, \Speecha-
lator: two-way speech-to-speech translation on a consumer PDA," in Proc. of the
Eighth European Conference on Speech Communication and Technology (Interspeech
- Eurospeech), Geneva, Switzerland, September 2003, pp. 369{372.
[46] X. Wang, A. McCallum, and X. Wei, \Topical N-Grams: phrase and topic discovery,
with an application to information retrieval," in Proc. of the Seventh IEEE Interna-
tional Conference on Data Mining (ICDM), Omaha, NE, USA, October 2007, pp.
697{702.
[47] G. L. Yang and L. M. Le Cam, Asymptotics in Statistics: Some Basic Concepts.
Springer, 2000.
[48] H. Ye and S. Young, \A clustering approach to semantic decoding," in Proc. of
the Ninth International Conference on Spoken Language Processing (ICSLP), Pitts-
burgh, PA, USA, September 2006, pp. 5{8.
[49] B. Zhou, D. Dechelotte, and Y. Gao, \Two-way speech-to-speech translation on
handheld devices," in Proc. of the Eight International Conference on Spoken Lan-
guage Processing (ICSLP), Jeju Island, Korea, October 2004, pp. 1637{1640.
89
Abstract (if available)
Abstract
The central goal in interactive speech-to-speech translation applications is to facilitate the accurate exchange of the semantic content (or "concept") of the speech between the interlocutors rather than producing word-by-word literal translation of the source utterance. While the conventional Statistical Machine Translation (SMT) methods are mainly developed and optimized for translating text, speech understanding through concept classification offers a possible way of translation in speech-to-speech translation systems that suits the above purpose. A correct concept classification offers the promise of obtaining well-formed target language speech output, although it cannot accurately cover the entire dialog domain due to the limited number of concept classes.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Speech recognition error modeling for robust speech processing and natural language understanding applications
PDF
Enriching spoken language processing: representation and modeling of suprasegmental events
PDF
User modeling for human-machine spoken interaction and mediation systems
PDF
Enhancing speech to speech translation through exploitation of bilingual resources and paralinguistic information
PDF
Compression algorithms for distributed classification with applications to distributed speech recognition
PDF
Categorical prosody models for spoken language applications
PDF
Robust automatic speech recognition for children
PDF
Emotional speech production: from data to computational models and applications
Asset Metadata
Creator
Ettelaie, Emil
(author)
Core Title
Concept classification with application to speech to speech translation
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
05/05/2011
Defense Date
12/10/2010
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
dialog models,OAI-PMH Harvest,semantic classification,sentence clustering,speech translation,spoken language understanding,topic modeling
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Narayanan, Shrikanth S. (
committee chair
), Georgiou, Panayiotis G. (
committee member
), Kosko, Bart (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
e.ettelaie@gmail.com,ettelaie@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m3896
Unique identifier
UC1213288
Identifier
etd-Ettelaie-4592 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-470846 (legacy record id),usctheses-m3896 (legacy record id)
Legacy Identifier
etd-Ettelaie-4592.pdf
Dmrecord
470846
Document Type
Dissertation
Rights
Ettelaie, Emil
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
dialog models
semantic classification
sentence clustering
speech translation
spoken language understanding
topic modeling