Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Exploiting comparable corpora
(USC Thesis Other)
Exploiting comparable corpora
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
EXPLOITING COMPARABLE CORPORA
by
Dragos Stefan Munteanu
A Dissertation Proposal Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
December 2006
Copyright 2006 Dragos Stefan Munteanu
Dedication
To my father, who is my foundation; and my wife Anamaria, who is my guiding light.
ii
Acknowledgements
My thesis work would not have been possible without the advice and support of many
colleagues, friends and family.
I am most grateful to my adviser, Daniel Marcu, for the guidance he provided
throughout my graduate career. He taught me a lot of fundamental lessons about research,
either through his excellent advice, or through personal example. I often walked into his
office feeling discouraged, stuck, and confused; and always left with my head full of
new ideas and research directions, and my heart full of confidence and excitement.
I am also thankful to the members of my thesis committee - Eduard Hovy, Kevin
Knight, Paul Rosenbloom and Shrikanth Narayanan - for their useful feedback, and for
helping me improve the coverage of my work.
I was fortunate to work in a great environment, and I owe a lot to my colleagues
from the Natural Language Group at ISI. Many thanks to Radu Soricut, my long time
room- and office mate; we were together so often that many people started confusing
us . His enthusiasm and sound advice were invaluable. I also benefited from many
conversations (some of them during happy hour on the ocean front in Venice Beach)
iii
with other colleagues: Tim Chlovski, Hal Daum´ e, Abdessamad Echihabi, Alexander
Fraser, Philipp Koehn, Ion Muslea, Franz Och, Patrick Pantel, Deepak Ravichandran,
Ignacio Thayer, Liang Zhou.
I thoroughly enjoyed my summer internship at Language Weaver, where I worked
with great people and was exposed to many interesting problems that I would not have
encountered in a research environment. The work I did there helped me better under-
stand what path to choose for the final stage of my research.
Last but certainly not least, I want to thank my family for their support, both during
the thesis and before it. My parents encouraged me intellectually, provided me with a
great education, and supported me every step of the way. My love goes out to my wife
Anamaria, who fills my life with joy and harmony.
iv
Table of Contents
Dedication ii
Acknowledgements iii
List of Tables viii
List of Figures ix
Abstract xii
1 Introduction 1
1.1 Parallel Data: A Scarce Resource . . . . . . . . . . . . . . . . . . . . . 1
1.2 Comparable Data: A Much Richer Resource . . . . . . . . . . . . . . . 3
1.3 What is Comparable? . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Exploiting Comparable Corpora . . . . . . . . . . . . . . . . . . . . . 6
1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.6 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Previous Work 11
2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Finding Word Translations in Comparable Corpora . . . . . . . . . . . 12
2.2.1 Choosing Between Alternative Translations . . . . . . . . . . . 12
2.2.2 Learning New Translations . . . . . . . . . . . . . . . . . . . . 13
2.3 Sentence Alignment Algorithms . . . . . . . . . . . . . . . . . . . . . 16
2.4 Finding Parallel Sentences in Comparable Corpora . . . . . . . . . . . 19
2.4.1 Finding Parallel Documents in Comparable Corpora . . . . . . 22
2.4.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3 A Framework for Empirical Evaluation of Parallel Data Detection Algorithms 26
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 Experimental Resources . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.1 Parallel data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
v
3.3.2 Comparable data . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.3 Statistical Machine Translation Systems . . . . . . . . . . . . . 30
3.4 The Data Extraction Framework . . . . . . . . . . . . . . . . . . . . . 31
3.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.2 Selecting similar document pairs . . . . . . . . . . . . . . . . . 33
3.4.3 Selecting candidate sentence pairs . . . . . . . . . . . . . . . . 34
3.4.4 The Parallel Data Detection Models . . . . . . . . . . . . . . . 35
3.4.5 Lexicons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4.6 Analysis of Computational Complexity . . . . . . . . . . . . . 41
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4 Finding Parallel Sentences in Comparable Corpora 46
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3 A Maximum Entropy Classifier for Parallel Sentence Identification . . . 48
4.3.1 The Maximum Entropy Classification Framework . . . . . . . . 48
4.3.2 Features for Parallel Sentence Identification . . . . . . . . . . . 49
4.3.3 Word Alignment Model . . . . . . . . . . . . . . . . . . . . . 52
4.3.4 Training and Testing . . . . . . . . . . . . . . . . . . . . . . . 54
4.3.5 Algorithmic complexity . . . . . . . . . . . . . . . . . . . . . 56
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.4.2 Classifier Evaluation . . . . . . . . . . . . . . . . . . . . . . . 58
4.4.3 Machine Translation Improvements . . . . . . . . . . . . . . . 62
4.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5 Finding Parallel Documents in Comparable Corpora 69
5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.3 Finding Parallel Documents . . . . . . . . . . . . . . . . . . . . . . . 72
5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.4.2 Evaluating Against a Gold Standard . . . . . . . . . . . . . . . 73
5.4.3 Machine Translation Improvements . . . . . . . . . . . . . . . 76
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6 Finding Parallel Sub-sentential Fragments in Comparable Corpora 81
6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.3 Finding Parallel Sub-Sentential Fragments . . . . . . . . . . . . . . . . 84
6.3.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
vi
6.3.2 Algorithmic Complexity . . . . . . . . . . . . . . . . . . . . . 87
6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.4.2 Experiments on the BBC Corpus . . . . . . . . . . . . . . . . . 89
6.4.3 Experiments on the Gigaword Corpora . . . . . . . . . . . . . 91
6.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7 Bootstrapping 97
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
7.2 Bootstrapping Experiments . . . . . . . . . . . . . . . . . . . . . . . . 97
7.3 Translation Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
8 Conclusions 104
Reference List 107
Appendix A Detailed Experiment Description 113
vii
List of Tables
3.1 Size of the Gigaword comparable corpora. . . . . . . . . . . . . . . . . 30
3.2 Example of translation probabilities from the ML-Lex lexicon. Words
marked with ’*’ are incorrect translations. . . . . . . . . . . . . . . . . 38
3.3 Example of translation probabilities from the LLR-Lex lexicon. Words
marked with ’*’ are incorrect translations. . . . . . . . . . . . . . . . . 42
4.1 Amounts of data extracted by the parallel sentence detection algorithm
from the Gigaword corpora, measured in number of English tokens. . . 62
5.1 Amounts of data extracted by the parallel document detection algorithm
from the Gigaword corpora. For each language pair, the table presents
the number of parallel document pairs found, the amount of parallel data
extracted from them (measured in number of English tokens), and the
extra amount of parallel data gained by doing parallel document identi-
fication versus simply looking for sentence pairs in isolation (Chapter 4). 77
6.1 Amounts of parallel data extracted by the parallel fragment detection
algorithm from the BBC corpus, measured in number of English tokens. 89
6.2 BLEU scores obtained using various datasets extracted by the parallel
fragment detection algorithm from the BBC corpus. . . . . . . . . . . . 91
6.3 Amounts of data extracted by the parallel fragment detection algorithm
from the Gigaword corpora, measured in number of English tokens. . . 92
viii
List of Figures
1.1 Currently available amounts of parallel data. . . . . . . . . . . . . . . . 2
1.2 Two comparable texts. The connected phrases are mutual translations. . 4
3.1 A framework for empirical evaluation of parallel data detection algorithms. 28
3.2 A framework for parallel data extraction from comparable corpora. . . . 32
4.1 Example of parallel sentences in non-parallel documents. . . . . . . . . 47
4.2 Word-level alignments between two parallel sentences. . . . . . . . . . 50
4.3 Word-level alignments between two non-parallel sentences. . . . . . . . 50
4.4 Performance of the Arabic-English parallel sentence classifiers. . . . . . 60
4.5 Performance of the Chinese-English parallel sentence classifiers. . . . . 60
4.6 Arabic-English MT performance improvements using parallel sentences
automatically extracted from the Gigaword corpus. . . . . . . . . . . . 64
4.7 Chinese-English MT performance improvements using parallel sentences
automatically extracted from the Gigaword corpus. . . . . . . . . . . . 64
4.8 Examples of automatically extracted Arabic-English parallel sentence
pairs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.9 Examples of automatically extracted Chinese-English parallel sentence
pairs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.1 Example of sentence-level alignments between parallel and non-parallel
documents. The thin lines are true links (produced by a human), while
the thick ones are produced automatically. . . . . . . . . . . . . . . . . 71
5.2 Performance of the parallel document detection method against a gold
standard. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
ix
5.3 Arabic-English MT performance improvements using parallel documents
automatically extracted from the Gigaword corpus. . . . . . . . . . . . 78
5.4 Chinese-English MT performance improvements using parallel docu-
ments automatically extracted from the Gigaword corpus. . . . . . . . . 78
5.5 Arabic-English MT performance improvements using parallel documents
and sentences automatically extracted from the Gigaword corpus. . . . . 80
5.6 Chinese-English MT performance improvements using parallel docu-
ments and sentences automatically extracted from the Gigaword corpus. 80
6.1 Example of parallel fragments in non-parallel documents. . . . . . . . . 82
6.2 Example of parallel fragments in non-parallel sentences. . . . . . . . . 83
6.3 Example of parallel fragments in non-parallel sentences: information
from the lexicon. The underlined words are translated, according to the
lexicon. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.4 A signal-filtering approach for detecting parallel fragments. . . . . . . . 85
6.5 MT performance improvements using parallel fragments and sentences
automatically extracted from the BBC corpus. . . . . . . . . . . . . . . 90
6.6 Arabic-English MT performance improvements using parallel fragments
automatically extracted from the Gigaword corpus. . . . . . . . . . . . 93
6.7 Chinese-English MT performance improvements using parallel frag-
ments automatically extracted from the Gigaword corpus. . . . . . . . . 93
6.8 Examples of automatically extracted parallel sub-sentential fragments. . 95
7.1 Arabic-English MT performance results obtained using bootstrapping.
The baseline (B) system is trained on 10k English tokens. The results of
the comparative systems are grouped according to the type of extracted
data they use: sentences, documents, fragments, sentences plus docu-
ments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
7.2 Chinese-English MT performance results obtained using bootstrapping.
The baseline (B) system is trained on 10k English tokens. The results of
the comparative systems are grouped according to the type of extracted
data they use: sentences, documents, fragments, sentences plus docu-
ments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
x
7.3 Examples of translations produced by SMT systems obtained from suc-
cesive bootstrapping iterations. . . . . . . . . . . . . . . . . . . . . . . 103
A.1 Detailed description of a data extraction experiment. For each mod-
ule of the data extraction system, the figure shows its running time, the
amounts of data it needs to process, and the values of the parameters
that control its execution. . . . . . . . . . . . . . . . . . . . . . . . . . 114
xi
Abstract
One of the major bottlenecks in the development of Statistical Machine Translation
systems for most language pairs is the lack of bilingual parallel training data. Currently
available parallel corpora span relatively few language pairs and very few domains;
building new ones of sufficiently large size and high quality is time-consuming and
expensive.
In this thesis, I propose methods that enable automatic creation of parallel corpora by
exploiting a rich, diverse, and readily available resource: comparable corpora. Compa-
rable corpora are bilingual texts that, while not parallel in the strict sense, are somewhat
related and convey overlapping information. Such texts exist in large quantities on the
Web; a good example are the multilingual news feeds produced by news agencies such
as Agence France Presse, CNN, and BBC.
I present novel methods for extracting parallel data of good quality from such com-
parable collections. I show how to detect parallelism at various granularity levels, and
thus find parallel documents (if there are any in the collection), parallel sentences, and
parallel sub-sentential fragments.
xii
In order to demonstrate the validity of this approach, I use my method to extract
data from large-scale comparable corpora for various language pairs, and show that the
extracted data helps improve the end-to-end performance of a state-of-the art machine
translation system.
xiii
Chapter 1
Introduction
1.1 Parallel Data: A Scarce Resource
Parallel texts – texts that are translations of each other – are an important resource in
many cross-lingual NLP applications. They were found useful in research on automatic
lexical acquisition (Gale & Church 1991; Melamed 1997), cross-language information
retrieval (Davis & Dunning 1995; Oard 1997), and annotation projection (Diab & Resnik
2002; Yarowsky & Ngai 2001; Yarowsky, Ngai, & Wicentowski 2001). However, their
importance is paramount in the field of Statistical Machine Translation (SMT) (Brown
et al. 1990; Och & Ney 2002), as they provide the training data from which all the
translation knowledge is learned.
The state of the art in SMT is advanced enough that, given sufficient parallel data (i.e.
a few million words) for any language pair in a given domain, a generic SMT system
trained on it will achieve a reasonable translation performance in that domain. The main
reason why SMT systems exist only for a handful of languages is that, for most language
pairs, parallel training data is simply not available. Figure 1.1 shows the approximate
amounts of such data available today to the research community, from the Linguistic
1
Data Consortium
1
and other publicly available, sentence-aligned resources. The vertical
axes shows the size of the data, measured in millions of English words (as it happens,
all publicly available parallel corpora have an English half), while the horizontal axis
lists the language of the other half of the corpora. As can be seen, although parallel
corpora exist for the major European languages plus Chinese and Arabic, there are still
notable absentees, even among languages with a large number of speakers such as Hindi,
Japanese, and Russian.
Figure 1.1: Currently available amounts of parallel data.
The graph also indicates the genre of the data, and shows that most of the available
data comes from the political discourse, i.e. translated proceedings of various organi-
zations and parliaments (United Nations, European Parliament, Hong Kong, etc). The
genre factor has a strong impact on translation performance. SMT systems are heavily
1
http://www.ldc.upenn.edu
2
influenced by the domain of their training data; if there is a large mismatch between
training and testing domains, translation quality decreases significantly. And getting
more parallel data from the same “wrong” domain does not help much. Therefore, we
not only need more parallel data, but also more diverse parallel data.
1.2 Comparable Data: A Much Richer Resource
One way to alleviate this lack of parallel data is to exploit a much richer and more diverse
resource: comparable corpora, texts which are not strictly parallel but related. The pro-
totypical example of comparable texts are two news articles in different languages which
report on the same event. They are (most often) produced independently, but express
overlapping content, and are therefore likely to contain some parallel data. Consider the
articles in Figure 1.2, published on the BBC’s English and Romanian web site in Novem-
ber 2005. They were published 25 minutes apart, and report on the one-year anniversary
of Ukraine’s Orange Revolution. They have slightly different yet overlapping content;
as the lines in the figure show, although much of the text is not translated, some parallel
sentences and phrases do exist. My goal is to develop algorithms to automatically find
such parallel sentences and phrases.
Comparable texts can be found in large quantities on the Web. They exist for many
language pairs, fairly many domains, and are continually updated and enriched. As the
Web continues to diversify and grow, even low-density languages will start having a
3
Figure 1.2: Two comparable texts. The connected phrases are mutual translations.
significant presence. I believe that the ability to exploit this resource will enable the
creation of SMT systems for many language pairs, which in turn will open up many
new and exciting research avenues in the field of Machine Translation.
1.3 What is Comparable?
The concept of “comparable corpus” is rather vague; most researchers define such cor-
pora simply as “texts that convey overlapping information”. To my knowledge, the
only attempt to give a more precise definition can be found in (Fung & Cheung 2004a),
4
who contrast and analyze bilingual corpora of various degrees of parallelism. They
define noisy-parallel corpora which are almost parallel but not quite, comparable cor-
pora which contain topic-aligned (and thus fairly similar) documents, and very-non-
parallel corpora which contain disparate, non-parallel bilingual documents.
The corpora that are of interest for the work presented in this thesis are those that
are not parallel, but do contain parallel data which can be of use for Statistical Machine
Translation. Broadly speaking, I consider parallel those bilingual corpora for which
the large majority of the text in one language has an easily identifiable translation in
the other language. In contrast, comparable corpora have a lot of non-translated data;
however, they also contain fair amounts of parallel data. As for corpora which contain
little or no parallel data (which might include some of the very-non-parallel corpora
mentioned by Fung and Cheung (2004a)), I would call them unrelated; the methods
presented in this thesis might not work very well on such corpora.
Let me try to be more precise. A bilingual corpus usually consists of two collections
of documents, one for each language. In order to obtain SMT training data, one must
extract from the corpus a set of parallel “segment pairs”, where a segment is usually a
sentence but can also be a phrase, a clause, or several sentences.
A corpus is parallel if it fulfills the following conditions: it has (explicit or implicit)
information about which documents are parallel; the parallel documents are fairly literal
translations of each other (i.e. without too many re-orderings, omissions, etc); and the
majority of the data in one language has a translation in the other. In such a corpus, the
5
parallel segments are easily identifiable. One merely needs to take each pair of parallel
documents and apply dynamic-programming sentence alignment algorithms (such as
those described in Section 2.3)
When the conditions described above are not fulfilled, we enter the realm of com-
parable corpora. In a typical comparable corpus there is no information about which
document pairs are parallel; many of the documents in one language will have no corre-
spondent in the other; and when such correspondents do exist, the documents in question
are usually not literal translations of each other. Extracting parallel data from such cor-
pora is a difficult task, which requires algorithms and models specifically designed for
this problem.
1.4 Exploiting Comparable Corpora
Different comparable corpora exhibit different levels of parallelism, and contain par-
allel data at different levels. At the “parallel” end of the scale are corpora for which
documents in one language are either fully translated in the other language, or have no
correspondent. An example of this are news articles from “Le Monde Diplomatique”
which are produced in 17 languages; some of the articles are translated in several lan-
guages, while others are region-specific and only exist in one language. In order to get
parallel data from such corpora, it is sufficient to identify the parallel document pairs
and apply sentence-alignment algorithms to them.
6
Further down on the scale are corpora which have some documents translated, others
not fully translated but still related (and therefore sharing parallel sentences), and others
not translated at all. We have found this to be the case for multilingual news feeds
produced by agencies like Xinhua News and Agence France Presse. In these corpora,
most of the parallel data can be found at sentence level.
Finally, at the other end of the scale are corpora which exhibit little parallelism at
either document or sentence level. An example of this are news articles produced by the
BBC. The documents shown in Figure 1.2 come from this corpus; the boxes and lines in
the figure indicate the parts which are translations of each other. As the example shows,
even article pairs that report on the same event, and that have been published almost at
the same time, share few or no fully parallel sentences. In order to make use of these
corpora we need the ability to extract parallel “fragments”, i.e. sub-sentential segments.
In my thesis, I will present algorithms that, using knowledge gathered only from an
initial (possibly quite small) parallel corpus, are capable of finding and extracting paral-
lel data at all these levels of granularity: documents, sentences, or sentence fragments.
The text that I extract is of sufficiently high quality that, when used as additional SMT
training data, it yields improved translation performance.
7
1.5 Contributions
The main contribution of this thesis is the development of three new algorithms designed
to find translationally equivalent data in unrelated monolingual corpora.
The first algorithm is designed to find parallel sentences in comparable corpora. It
achieves this by analyzing bilingual sentence pairs individually (without relying on any
context), and classifying them as parallel or non-parallel. The classification is based on
a set of features derived from a simple word-level alignment between the two sentences.
Despite its simplicity, the method achieves high performance levels, even with few initial
resources.
The second algorithm extends this approach in order to find parallel document pairs,
with equally high performance. Using the sentence pairs judged to be parallel, it cre-
ates a sentence-level alignment between the documents, and uses it to recognize which
document pairs are literal translations of each other.
The third algorithm tackles a harder problem: finding parallel sub-sentential frag-
ments, even from sentence pairs that are not translations of each other. To my knowl-
edge, this is the first published method that addresses this problem. It makes use of a
word-level alignment between the input sentences, and attempts to find fragments from
each sentence for which most of the words have a translation in the other sentence. I
show that, when dealing with a noisy comparable corpus, which contains few parallel
sentences, this is the best way of obtaining useful parallel data from that corpus.
8
The work from this thesis also improves over the current state-of-the-art in the more
practical matters of scale and evaluation. My algorithms are applied to very large-scale
corpora, of several hundred million words, which is several orders of magnitude larger
than any other published work. Their performance is evaluated by measuring the impact
of their output data on the performance of a state-of-the-art Statistical Machine Transla-
tion (SMT) system. This is an important contribution: my work is the first (and so far,
the only) one that successfully improved end-to-end SMT performance using training
data automatically extracted from non-parallel corpora.
1.6 Thesis Outline
The rest of this thesis is structured as follows. I start by reviewing previous work on
comparable corpora in Chapter 2. Chapter 3 presents the experimental framework used
in the rest of the thesis: the evaluation methodology, the data, and the general archi-
tecture of a parallel data extraction system. All the methods presented in subsequent
chapters are evaluated in the context of this framework.
Next, in Chapter 4, I present my method for finding parallel sentences in compara-
ble corpora. Given an arbitrary bilingual pair of sentences, I show how to define and
extract features which are indicative of their degree of parallelism. I also show how to
train a classifier able to distinguish, with high precision, between parallel sentence pairs
9
and non-parallel ones. Using this method I extract data from large (Gigaword) Chi-
nese, Arabic and English newswire corpora, use it as additional SMT training data, and
obtain better translation performance. This work led to the first published results that
showed end-to-end increases in the performance of a state-of-the-art MT system using
information extracted from comparable corpora.
Chapter 5 describes how the parallel sentence detection method can be extended
in order to find parallel document pairs that might exist in a comparable corpus. The
approach is based on the intuition that if two documents share many parallel sentences,
it is likely that they are translations of each other.
Then, in Chapter 6, I present a new model that aims to discover parallel sub-
sentential fragments. By going below the sentence level, I can extract useful data even
from corpora which have few parallel document or sentence pairs; of course, most avail-
able comparable corpora fall into this category.
Chapter 7 describes bootstrapping experiments: by using data automatically
extracted from the comparable corpus to perform another extraction from the same
corpus, I can obtain even more data, and better MT performance. Finally, Chapter 8
provides conclusions and suggestions to future work.
I also present, in Appendix A, a detailed description of one of my experiments. I
show the running times of the various modules involved in the extraction system, the
amounts of data they process, and also make explicit the various parameters that control
the execution of the systems, as well as their values.
10
Chapter 2
Previous Work
2.1 Overview
The goal of the research that I review in this chapter is to extract translational informa-
tion from comparable corpora . The earliest such efforts, which I review in Section 2.2,
focused on finding translations for unknown words. Subsequent research developed
methods for finding full parallel sentences. However, some of these methods are exten-
sions of sentence-alignment algorithms – which find parallel sentences in parallel doc-
uments – to noisy-parallel documents. Therefore, I first review some of the sentence-
alignment literature in Section 2.3, and continue with the comparable corpora work in
Section 2.4. Then, in Section 2.4.1, I present efforts aimed at discovering parallel doc-
ument pairs. I conclude in Section 2.4.2 by discussing the relationship between my
research and the previous word presented in the rest of the chapter.
11
2.2 Finding Word Translations in Comparable Corpora
2.2.1 Choosing Between Alternative Translations
The work presented in this section attempts to use comparable corpora in order to disam-
biguate between alternative translations of a word. More precisely, the problem is this:
given a source word in a given context and a list of several possible target translations,
use a target language monolingual corpus to help choose the right one. This involves
representing the context, translating it, and then choosing the best target word according
to the translated context.
Dagan and Itai (1994) represent the context as syntactic tuples, which consist of
syntactic relations between words, together with the base forms of those words. Tuples
are mapped in the target language by considering all possible translations of the words,
and by using hand-written rules to decide how to translate the syntactic relations. Target
tuples are then scored using the maximum likelihood estimator (i.e. frequency). The
resulting translation is selected by computing a confidence interval of the ratio between
estimates of different alternatives.
Kikui (1998) presents a method for translating “term lists”, groups of content words
which characterize a concept. Each word in the list has a list of translation alternatives.
By translating the whole list at once, the coherence of all suggested translations can be
taken into account in order to produce better translations. Word context is defined as a
12
coocurrence vector, and coherence is measured by computing the geometric proximity
of these vectors.
The work reported in Koehn and Knight (2000) has a somewhat different goal: to
put probabilities on the possible translations given by the lexicon. They use the expec-
tation maximization (EM) algorithm and a bigram target language model in order to
estimate the best sequence of translations for all the words in a sentence. They apply
this method to translate the whole source corpus, and then use maximum likelihood to
estimate translation probabilities.
2.2.2 Learning New Translations
A more difficult problem is that of learning new translations for unknown source words.
Efforts to solve this problem are based on the assumption that translationally equivalent
words appear in similar contexts, even in unrelated texts. Thus, the general approach
is this: compute the context of the source word (over a monolingual source language
corpus) and transfer it in the space of the target language; compute the context of all
target words in a monolingual target corpus, and choose the one most similar to the
context of the initial source word. This requires a method for computing word contexts,
a method for translating source contexts into the target language, and a measure of
similarity between contexts.
The various research efforts presented below differ mainly in the method used to
compute contexts and their similarities. All evaluate their results intrinsically, either
13
by manual inspection or by creating test data from a bilingual lexicon. In particular,
none of the systems have been shown to produce lexicons which helped improve the
performance of another application.
The hypothesis that translationally equivalent word have similar contexts was first
studied by Rapp (1995). He takes 100 words together with their translations, represents
their context as a coocurrence vector, and shows that the resulting source and target
coocurrence matrices are most similar when the word order in the matrix is the same in
both languages. Searching through all possible orderings is computationally infeasible,
so this approach cannot be used to actually find translations. However, it does provide
support to the similar context assumption.
Fung (1995) uses a very simple notion of context, which simply counts how many
different words precede or follow our word of interest, and divide the count by the
frequency of occurrence. Thus, context is a pair of numbers, and context similarity is
measured by Euclidian distance.
Fung and McKeown (1997) use seed words whose translations are known, and rep-
resent context as a vector of coocurrences with these seed words. More precisely, each
value in the context vector is the weighted mutual information between the word of
interest and a seed. They measure context similarity using the cosine measure. In a
subsequent publication, Fung and Yee (1998) experiment with several other context
similarity measures, which employ the term frequency and inverse document frequency
of the seed words. And, in a similar attempt, Rapp (1999) computes context by using
14
the log likelihood ratio score (Dunning 1993) between the word of interest and the seed
words; and measures similarity using the “city block metric”, which is the sum of the
absolute differences of corresponding vector positions.
Koehn and Knight (2002) find new translations by using, besides context vectors,
additional knowledge sources. Some simple and useful clues are spelling (identical or
similarly spelled words are likely to be translations of each other), and frequency of
usage (translations of frequent words should also be frequent words). A more complex
feature is “similarity”, which is based on the assumption that words which are used
similarly in the source language (such as Wednesday and Thursday) should have trans-
lations which are similar in the target language. This idea was first presented by Diab
and Finch (2000); however, they only used it to perform proof-of-concept experiments
between two English corpora.
Gaussier et al. (2004) address the issues of polysemy and synonymy of the seed
words. If two seed words are synonyms, the contextual information they provide is
wrongly split in two. If a seed word has several meanings, some of which are not
translated in the seed lexicon, then its untranslated senses supply misleading contextual
information. In order to alleviate these problems, Gaussier et al. compute context vec-
tors not in terms of the seed words themselves, but in terms of their context vectors. In
this new space, synonymous seed words are close to each other, and polysemous entries
are further apart.
15
Comparable corpora have also been used for finding transliterations of named enti-
ties (i.e. names of people and locations). Sproat et. al. (2006), and Klementiev and
Roth (2006), make use of the temporal structure of comparable data to help choose the
correct translation of named entities. The intuition is that names that cooccur often in
documents from the same time period are more likely to be mutual translations. For
each source language name, a set of target language candidate translations are first cho-
sen according to a transliteration model; then, the temporal structure information and
the transliteration model are used together to discover the best candidate translation.
The biggest comparable corpus is, of course, the World Wide Web; and it also has
been mined for translations. Huang et al. (2005) attempt to translate “key phrases”
(content-bearing strings such as named entities or medical terms) by submitting queries
to web search engines. They submit a first query with the source phrase that needs to
be translated; retrieve the snippets returned by the search engine, and expand the query
by translating words from those snippets; submit the expanded query; and try to extract
the target translation from the second round of snippets using phonetic, semantic and
frequency-distance features.
2.3 Sentence Alignment Algorithms
The goal of sentence alignment (SA) algorithms is to create a sentence-level alignment
between two parallel documents. The assumption that the documents are parallel is
16
crucial; it allows the problem to be efficiently solved with dynamic programming algo-
rithms.
The main approach is to consider that the documents are a succession of minimal
translated units, or beads ( (Brown, Lai, & Mercer 1991)). An (m,n) bead, consists
of
source language sentences and target language sentences (
and can also
be zero). Thus, a perfectly literal translation would be a sequence of (1,1) beads; the
existence of an untranslated source sentence would be explained with a (1,0) bead; one
target sentence which translated two source ones would be represented as a (2,1) bead;
and so on. The goal of the sentence alignment algorithm is to find the best sequence
of beads that generates the two documents. This can be achieved by defining a score
for each bead, and using dynamic programming to perform a search over possible bead
sequences.
In the algorithms of Gale and Church (1991) and Brown et al. (1991), the score of
each bead is computed based only on sentence length, measured in characters by Gale
and Church and words by Brown et al. They assume that the ratio of the two lengths
in a bead is a normally distributed random variable. The score of a sentence bead can
be computed by integrating this distribution, whose mean and variance are estimated
empirically (by analyzing the two texts to be aligned). Both algorithms produce align-
ments of high quality, having an error rate of less than 5%. The major drawback of the
approach is that it requires some form of paragraph boundaries, in order to constrain the
17
dynamic programming search. Both researchers obtain these boundaries by exploiting
regularities in their respective corpora.
Further work on sentence alignment studied ways of enhancing this basic, length-
based framework with additional bilexical knowledge sources. Wu (1994) proposed the
use of “lexical cues”, i.e. high-precision translation pairs. For each sentence bead, he
counts the number of occurrences of each cue in both sides of the bead, and defines
probability distributions over these counts. Thus, the score of a sentence bead becomes
a product of probabilities: one for the lengths of the sentences, and then one for every
lexical cue. He shows that his method outperforms the simpler, length-based one, for
an English-Chinese parallel corpus. In a similar manner, Davis et. al (1995) define a
bead score that combines even more diverse knowledge sources: unordered character
ngram matching, number matching, string comparisons. They apply this framework to a
noisy parallel English-Spanish corpus, and again obtain significant improvements over
the simple length-based approach.
One problem concerning the use of additional knowledge sources for sentence
alignment is that it increases computational complexity. In order to alleviate this,
Moore (2002) suggests a two-pass algorithm. The first pass applies the simple and
fast length-based method; from that, reliable 1-to-1 alignments are extracted, and used
to learn a bilingual lexicon; the second pass uses the lexicon to compute better scores for
the sentence beads (using the IBM Model 1 (Brown et al. 1993)). However, the second
18
pass only searches through the possible alignments that were assigned non-negligible
probability in the first pass; thus, it searches through a much smaller space.
The work of Deng et. al. (2006) goes beyond sentence alignment, to chunk align-
ment. Their procedure, divisive clustering, segments the parallel text into sub-sentence
units, and allows them to be reordered to improve the alignment quality. The algorithm
starts by producing a coarse alignment, consisting of segments that may contain several
sentences each. It then attempts to split each segment in two parts, such that the parts
align well to each other. Splitting is done using a set of predefined break markers (such
as punctuation marks), and the two parts produced by a split may reverse their order.
The resulting segments are split again, recursively, as long as the resulting alignment
improves. The procedure produces as output parallel segments that are quite short, and
mostly sub-sentential.
2.4 Finding Parallel Sentences in Comparable Corpora
The research most relevant to the work presented in this thesis is that focused on finding
parallel sentences in comparable corpora. There are two main approaches for solving
this problem.
The first one, adopted by Zhao and V ogel (2002a) and Utiyama and Isahara (2003), is
an extension of the sentence-alignment algorithms presented in Section 2.3. These meth-
ods essentially try to find pairs of parallel documents, and then sentence-align them. To
19
account for less-literal document translations, they perform a wider dynammic program-
ming search, by allowing larger sentence beads, i.e. more reordering.
Zhao and V ogel compute the score of a document pair by defining a generative model
of a target document given a source document. They consider as parallel all document
pairs whose score is above a certain threshold. These pairs are then sentence-aligned; the
score of a sentence bead is computed as a combination of length information and IBM
Model 1 lexical score. Zhao and V ogel evaluate the extracted sentences by showing
they improve the accuracy of automatically computed word alignments. In a subsequent
publication, V ogel (2003) evaluates these sentences in the context of an MT system, and
shows that they bring improvement under special circumstances (i.e. language model
constructed from reference translations), designed to reduce the noise introduced by the
automatically extracted corpus.
Utiyama and Isahara adopt a similar approach; they use the BM25 score (Robert-
son & Walker 1994) to find parallel documents, and sentence-align them using a score
based on the number of translated words in each sentence bead. Next, they use the
sentence similarity scores to compute new document matching scores, and the new doc-
ument scores to compute new sentence similarity scores, and show that this yields data
of higher quality. They evaluate their sentences by analyzing randomly sampled align-
ments.
20
The second approach to the problem of parallel sentence discovery, developed con-
currently in my own research (Munteanu & Marcu 2005) and in that of (Fung & Che-
ung 2004a; 2004b; Cheung & Fung 2004; Wu & Fung 2005), is designed specifically
for non-parallel corpora. The idea is to match each source document with several target
ones, and analyze all possible sentence pairs from each document pair. This enables the
discovery of parallel sentences even from non-translated documents.
Fung and Cheung (2004a; 2004b) pair each source document with all target docu-
ments whose similarity score (computed using cosine similarity) is higher than a certain
threshold. From each document pair, they extract all sentence pairs with high cosine
similarity. However, their list of document pairs is not fixed. After one round of sen-
tence extraction, the list is enriched with additional documents, and the system iterates.
Thus, they manage to include in the search document pairs which are dissimilar (but
hopefully still contain good data). Wu and Fung (2005) also show how the system’s
output can be further improved by filtering the extracted sentence pairs using bracketing
Inversion Transduction Grammar.
The problem of aligning sentences in comparable corpora was also addressed for
monolingual texts. Barzilay and Elhadad (2003) present a method of aligning sentences
in two comparable English corpora, for the purpose of building a training set of text-to-
text rewriting examples. Monolingual parallel sentence detection presents a particular
challenge: there are many sentence pairs that have low lexical overlap, but are neverthe-
less parallel. Therefore, context modelling becomes a crucial component of the method.
21
The researchers employ context information by clustering together the paragraphs of
each corpus independently, and then learning rules for mapping clusters from one cor-
pus to another. Thus, they obtain “parallel paragraph pairs”, and align the sentences of
these paragraphs using dynamic programming.
2.4.1 Finding Parallel Documents in Comparable Corpora
There is a large body of work in the Cross-Lingual Information Retrieval (CLIR)
field (Oard & Gey 2002) that is related to the problem of finding document pairs that
are similar in content, or that are relevant for a given query. However, since my interest
lies in finding documents that are translations of each other (and not merely similar), I
review here only the research efforts directed towards this goal.
The standard approach to the problem of finding translated document pairs is as fol-
lows: define a document similarity score, compute it for all document pairs from the
collection, and apply a score threshold in order to decide which pairs should be declared
parallel. This is best shown in the work of Resnik and Smith (2003), which aims to
discover parallel document pairs on the Web by measuring the similarity between their
URLs, HTML structure, and content. The content-based match is performed by com-
puting a similarity score based on a word-level alignment between the two documents.
The more links are shared by two documents, the higher their similarity.
All other algorithms published in the literature take the same basic approach, and
differ only in the choice of the similarity metric. The “CLIR” method translates source
22
documents in the target language (usually by dictionary term lookup), and computes the
similarity score in the space of the target language as a function of the term frequency
(TF) and inverse document frequency (IDF) of the words in the documents. This is
the method employed by Collier et. al (1997), and Utiyama and Isahara (2003). The
latter work also takes into account similarity at sentence level, by refining the initial
document similarity score with a component based on a sentence-level alignment of
the two documents. Another way of defining document similarity is the “generative
model”, presented by Zhao and V ogel (2002a). They compute the probability that the
target document (viewed as a bag of words) was generated from the source one, by a
statistical generative model.
Pike and Melamed (2004) adopt a different approach, attempting to distinguish doc-
ument pairs which are parallel from those which are merely similar. They look for points
of correspondence (i.e. pairs of translated words) in the two documents from a pair, and
perform a geometric analysis of these points. Their method exploits the intuition that
parallel documents should not only have many translated words, but these words should
appear in roughly monotonic order in the two documents.
The main drawback of this standard approach is that, since it performs computations
essentially at word level, it has difficulties in distinguishing document pairs that are truly
translation equivalents from those that are only similar. Thus, it is likely to produce
many false positives when applied to a noisy document collection.
23
2.4.2 Discussion
The aim of the work presented in this thesis is to find parallel data in comparable cor-
pora. From that perspective, it is quite similar to the STRAND system of Resnik and
Smith (2003), which also tries to find parallel data (i.e. web pages) in the big corpus
which is the World Wide Web. However, the kind of corpora that STRAND has been
designed to work on is rather different from the corpora that are of interest for my work.
STRAND runs on the Web, and can therefore make use of URL addresses (in order to
narrow down its search space) and of HTML structure (that helps when analyzing doc-
ument pairs). My system runs on collections of plain text files, and all its algorithms
rely only on the documents’ contents. Moreover, STRAND has a more narrow scope;
it merely attempts to discover web pages which are fully parallel. In contrast, I present
algorithms for finding translated data at finer levels of granularity. Besides parallel doc-
uments, I also show how to find pairs of translated sentences even from documents that
are non-parallel; and how to find translated sub-sentential fragments, even from sen-
tences that are non-parallel.
All these algorithms are based on ideas that are very different from the approaches
tried in previous research. For finding parallel sentences (Chapter 4), instead of extend-
ing sentence-alignment algorithms or ranking pairs based on some similarity metric, I
design a classifier able to judge which pairs are parallel and which are not, with high
precision and recall. For finding parallel documents (Chapter 5), instead of measuring
document similarity at word level, I do it at sentence level, which enables me to reject
24
“false positives” (i.e. documents that have similar content, but are not literal translations
of each other).
As for the method for finding parallel fragments (Chapter 6), it is the first published
attempt at solving this problem. The work of Deng et. al. (2006) (described in Sec-
tion 2.3) is somewhat related, because it also produces as output parallel sub-sentential
fragments. However, their method finds parallel fragments in parallel texts, by split-
ting parallel segments and aligning the results. My algorithm tackles a harder problem,
attempting to find translated phrases within potentially non-parallel sentence pairs.
I conclude by pointing out that the research described in Section 2.2 has no relation-
ship with the work presented in this thesis. None of my algorithms are designed to find
translations for individual words. Even when trying to find parallel fragments (which
could potentially be as short as one word), my method discards fragments shorter than
three words, since I have empirically found that short phrase pairs contain mostly incor-
rect correspondences.
25
Chapter 3
A Framework for Empirical
Evaluation of Parallel Data Detection
Algorithms
3.1 Overview
The algorithms and models described in this thesis are designed to analyze compara-
ble texts in order to detect parallel data. In this chapter, I describe the experimental
framework used to measure the performance of these models. I start by describing the
evaluation methodology (Section 3.2), and by providing details about the various cor-
pora used in the experiments (Section 3.3). I then present, in Section 3.4, the architecture
of a general parallel data extraction system, on top of which I apply all the parallel data
detection models described in the rest of this thesis.
26
3.2 Evaluation Methodology
I evaluate the data extracted by my algorithms by measuring its impact on the perfor-
mance of a state-of-the-art Statistical Machine Translation (SMT) system. Essentially,
I use the data as additional training material, and verify whether this leads to better
performance.
As Figure 3.1 shows, each data extraction experiment makes use of an initial par-
allel corpus and a comparable corpus. The initial parallel corpus is the only source
of bilingual knowledge that is available to the extraction system; mainly, it is used for
computing lexicons (Section 3.4.5). It is also the only resource available for building
an SMT system, which will be the baseline system for our evaluation. After extracting
additional data from the comparable corpus, I then train a comparative MT system on
both the initial and extracted corpora. The quality of the extracted data is measured by
its impact on MT performance, i.e. by the difference in the performances of the baseline
and comparative MT systems.
The size of the initial parallel corpus (i.e. of the lexicon) has a significant impact
on the amount and quality of the data extracted by my system, and also on the out-
come of the MT experiments. If the initial corpus is small, the system will have little
bilingual knowledge (or coverage), and may extract relatively little data. However, the
baseline MT performance will also be weak, and easier to improve upon. In contrast, a
large initial corpus will yield more data from the comparable corpus, but also a stronger
baseline.
27
Figure 3.1: A framework for empirical evaluation of parallel data detection algorithms.
Therefore, all experiments presented in the rest of the thesis will be replicated with
initial corpora of various sizes, ranging from very small (10k words) to very large (100M
words). This will enable me to estimate the impact of the extraction algorithms on both
resource-scarce and resource-rich language pairs.
3.3 Experimental Resources
3.3.1 Parallel data
Most of the experiments from this thesis are performed in the context of Arabic-English
and Chinese-English machine translation. The initial parallel corpora that seed these
experiments are parts of the training data available for the 2006 NIST MT evaluation
1
.
1
http://www.nist.gov/speech/tests/mt/
28
I use initial seed corpora, with sizes (measured in number of tokens on the English side)
of 10k, 100k, 1M, and 100M tokens.
For Chinese-English, I also use as initial corpus a dictionary of 80k entries, available
from the Linguistic Data Consortium
2
. Since a regular dictionary and a parallel corpus
have distinct properties, it is instructive to use both as seed for the parallel data extraction
algorithms.
The parallel MT training corpora consist of data from two distinct genres: news
(translations of news articles) and political (translations of proceedings of the United
Nations parliament). Both the comparable data and the MT test corpus (described in the
following sections) are from the news genre; therefore, in order to maintain consistency,
I construct the seed corpora by using as much news data as possible.
3.3.2 Comparable data
The comparable corpora consist of collections of news stories published by the Agence
France Presse (AFP), Xinhua News, and Central News Agency (CNA) news angencies.
They are parts of the Arabic, English and Chinese Gigaword corpora (second edition),
which are available from the Linguistic Data Consortium. From these collections, for
each language pair, I create the Gigaword comparable corpora by putting together arti-
cles coming from the same agency and the same time period. Table 3.1 presents in detail
the sources and sizes of the resulting comparable corpora.
2
www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2002L27
29
Foreign English
Language pair News agency and period # articles # tokens # articles # tokens
Arabic-English AFP, 1994-1997, 2001-2004 600k 115M 1.6M 330M
Xinhua News, 2001-2004
Chinese-English CNA, 1997-2004 1.6M 456M 983k 247M
Xinhua News, 1995-2004
Table 3.1: Size of the Gigaword comparable corpora.
3.3.3 Statistical Machine Translation Systems
The SMT systems used in my experiments are trained using a variant of the alignment
template model of Och and Ney (2004). All systems use the same 2 trigram language
models: a very large one trained on the whole English Gigaword corpus, and a smaller
one trained on the English side of available MT training data. Thus, any difference in
performance between a baseline and a contrastive system are caused only by differences
in their parallel training data.
All systems were tested on the (news domain) test corpus used for the NIST 2005
MT evaluation. Since all articles from the Gigaword comparable corpus (which are
potentially training data) were published no later than 2004, they should not overlap with
the test data. Translation performance was measured using the automatic BLEU (Pap-
ineni et al. 2002) evaluation metric, against 4 reference translations.
30
3.4 The Data Extraction Framework
3.4.1 Introduction
The comparable corpora that we work with consist of (loosely independent) monolin-
gual collections of documents. The parallel data that we seek could be anywhere in
those collections; in any two documents, and any two sentences from them. The frame-
work presented in this section is designed to restrict this potentially huge search space,
and produce a reasonably-sized set of document/sentence pairs that are likely to contain
good data. It applies simple and efficient operations on large amounts of data, in order to
filter out clearly non-parallel pairs. The more advanced models presented in the rest of
the thesis are then used to analyze this set of candidates, in order to find parallel data at
various levels of granularity: parallel documents, sentences of sub-sentential fragments.
Figure 3.2 depicts this general framework. The input, as explained in the previ-
ous Section, consists of an initial parallel corpus and two monolingual corpora in the
two languages of interest (which we will refer to as source and target), divided into
documents. In the first stage of the pipeline, document selection (Section 3.4.2), I use
information retrieval techniques to select a subset of document pairs which are likely to
contain parallel data. From each such pair, I then generate all possible sentence pairs
(since I make no assumption that the documents should have similar sentence order)
and discard those which are very unlikely to be similar, thus obtaining candidate sen-
tence pairs (Section 3.4.3). These candidates are further processed by the parallel data
31
Figure 3.2: A framework for parallel data extraction from comparable corpora.
detection models, as described in Section 3.4.4. Each step in the pipeline has some
parameters that control its execution, which are shown in the lower part of the figure;
these parameters are described in the respective sections. The components make use of
two bilingual lexicons, ML-Lex and LLR-Lex, which are presented in Section 3.4.5.
I conclude the description of the framework in Section 3.4.6, with a discussion of
the computational complexity of the algorithms involved at each stage.
32
3.4.2 Selecting similar document pairs
This step is designed to select, for each source document, target documents that are
likely to exhibit similar content. I perform document matching using the Lemur infor-
mation retrieval toolkit
3
(Ogilvie & Callan 2001). I index all target documents into a
database, and transform all source documents into target language queries by glossing
their words. For each source document, I take the top 5 translations of each of its words
(according to my lexicon), and create a query in the target language. The translation
probabilities are only used to choose the word translations, they do not appear in the
query. I then run the query against the collection, retrieve the TopK most similar target
documents, and pair them with the source query document.
This step of the process emphasizes recall rather than precision. For each source
document, I do not attempt to find one best matching target document, but rather a set of
similar ones. The subsequent components of the system are robust enough to filter out
the extra noise introduced by the selection of additional (possibly bad) target candidates.
Since my corpora consist of news articles, I can also make use of the articles’ pub-
lication dates. I consider it likely that documents with similar content have publication
dates that are close to each other. Thus, each query is actually run only against target
documents published within a window of DateWindow days around the publication date
of the source query document; I retrieve the best TopK of these documents.
3
http://www-2.cs.cmu.edu/ lemur
33
3.4.3 Selecting candidate sentence pairs
From each document pair, I consider all possible sentence pairs, i.e. the cartesian prod-
uct of the sentences from the two documents. Clearly, most of these pairs will not
contain any parallel data; this step attempts to filter out as many of them as possible. I
achieve this by looking at the difference in the length of the two sentences, and at the
number of words they have in common.
Thus, for each sentence pair, I first check that the ratio of their lengths is no greater
than the LengthRatio parameter. Then, I compute their word overlap: the percentage of
words from the source sentence that have a translation in the target one (according to
the
distribution of my lexicon) and, conversely, the percentage of target words
that are translated in the source sentence. Both these percentages should be higher than
the WordOverlap parameter. Pairs that do not fulfill these conditions are discarded; the
others constitute the output.
This step acts as a filter, removing many of the bad sentence pairs introduced by our
loose document selection procedure and by the cartesian product. I will subsequently
refer to it as the word-overlap filter. Naturally, this filter also rejects some parallel
sentence pairs, which fail to fulfill the word overlap conditions because the lexicon does
not contain the necessary entries; but those pairs could not have been handled reliably
anyway, so the overall effect of the filter is to improve the efficiency, precision and
robustness of the system. Typically, the filter discards around 99% of the pairs from its
input (see Appendix A).
34
3.4.4 The Parallel Data Detection Models
The subsequent chapters of this thesis will describe models that detect parallel data
at either document, sentence, or sub-sentence level. The parallel sentence detection
model (described in Chapter 4), analyzes each candidate sentence pair produced by the
word-overlap filter and decides whether the sentences in the pair are translation of each
other. The pairs which were found to be parallel are further employed by the parallel
document detection model (Chapter 5), which regards them as links between documents,
thus finding those documents which are literal translations of each other. The parallel
fragment detection model (Chapter 6) analyzes the candidate sentence pairs which were
found non-parallel, and attempts to exploit them by finding sub-sentential fragments
which are still parallel.
3.4.5 Lexicons
Introduction
As explained in Section 3.4.1, the extraction pipeline consists of two parts that have
distinct purposes. The first one (which is responsible for finding similar documents and
candidate sentence pairs) is concerned with filtering out useless data, while the second
(the parallel data detection models) performs a more detailed analysis. For the first part,
broader coverage is more important than precision: I should avoid filtering out parallel
data, even at the expense of allowing some amount of non-parallel data to pass through.
35
For the second part, which produces the final output of the system, high precision is
crucial. I therefore employ two lexicons, one for each of the two parts: one which is less
accurate but contains more entries (higher recall), and one which has fewer but more
accurate entries (higher precision).
The probabilistic lexicons are computed from the initial parallel corpus: I align the
parallel corpus at the word level, and estimate translation probabilities from the word
links. I compute the word alignment using the GIZA++
4
implementation of the IBM
translation models (Brown et al. 1993). Since these models can only produce one-to-
many alignments, which do not explain the data well, I follow Och and Ney (2003) and
compute one alignment for each translation direction, and then symmetrize them using
the “refined” method. Then, for the first lexicon I estimate translation probabilities
using Maximum Likelihood (Section 3.4.5), while for the second lexicon I use the Log-
Likelihood Ratio statistic (Section 3.4.5).
Maximum Likelihood Lexicon: ML-Lex
Using the word-aligned corpus, for each source word and a target word
, I use
maximum-likelihood estimation to compute conditional probabilities
4
www.fjoch.com/GIZA++.html
36
where is the number of links involving word
, is the number of links involv-
ing , and is the number of links that connect both words. The ML-Lex lexicon
consists of these probability distributions for all words from both languages.
Clearly, any two words linked at least once in the parallel corpus appear as an entry
into this lexicon. This property provides the higher recall that is useful in the first part
of the parallel data extraction pipeline. However, it is also a source of errors, since
automatically computed alignments inevitably exhibit a number of incorrect links. For
instance, words which occur very frequently (such as function words) tend to have many
links, and will therefore appear (incorrectly) in many entries in the lexicon; often with
unreasonably high probabilities.
Let us consider an example. I computed ML-Lex from a Romanian-English parallel
corpus of 10M words. Table 3.2 presents a few of the possible translations of Romanian
(source) word rezervata, together with their probabilities. The source word occurs in
the corpus 118 times (so ), and translates as rezerved, restricted or silent.
The entries from the last two rows are incorrect translations (and therefore marked with
a star); they appear in the lexicon because of alignment errors, caused by the high fre-
quency of occurence of the respective English words.
Note that the last column of the table, , is not used in computing the
probabilities; however, it provides indication that the last two entries are wrong. The
number of links that the and to share with the Romanian source word is very small
when compared to their total number of occurences.
37
Target translation reserved 54 0.45 174
restricted 8 0.07 398
silent 2 0.02 76
to 9 0.07 230178
the 2 0.01 608638
Table 3.2: Example of translation probabilities from the ML-Lex lexicon. Words marked
with ’*’ are incorrect translations.
Log-Likelihood Ratio Lexicon: LLR-Lex
In order to obtain a higher-quality lexicon, I employ the Log-Likelihood Ratio (LLR)
statistic ( (Dunning 1993)), which was also used by Moore (2004a; 2005) and
Melamed (2000) as a measure of word association. Generally speaking, this statistic
gives a measure of the likelihood that two samples are not independent (i.e. generated
by the same probability distribution). In this case, I use it to estimate the independence
of pairs of words which cooccur (i.e. are linked together) in my word-aligned parallel
corpus.
If source word and target word
are independent (i.e. they are not translations of
each other), one can expect that , i.e. the distribution of
given
that is present is the same as the distribution of
when is not present. The LLR
statistic gives a measure of the likelihood of this hypothesis. The LLR score of a word
pair is low when these two distributions are very similar (i.e. the words are independent),
and high otherwise (i.e. the words are strongly associated). However, high LLR scores
can indicate either a positive association (i.e. ) or a negative one;
and one can distinguish between them by checking whether (Moore
38
2004a). Thus, I split the set of cooccurring word pairs into positively and negatively
associated pairs, and obtain a measure for each of the two association types.
The set of positively associated word pairs will constitute my second lexicon, . The negatively associated pairs make up , which contains probabilites
of words not being translations of each other. I will use these “negative” probabilities in
my parallel fragment detection model (Chapter 6).
The LLR score can be computed using the following formula (Dunning 1993):
!
"
$#
where:
% & $’
& % ! define the probability of observing word
when word is present; that is,
(i.e. the number of links between
and in the word aligned corpus)
and (the number of links containing ).
% ( define the probability of observing
when is not present: (links containing
but not ) and )
(links not containing ).
% &* ’ * , &,+ ’ + , &-*/.&,+ ’ */.
’ + 39
For a more detailed discussion of the relationship between this LLR formula and the
frequencies and probabilities that arise in word-association problems, see Moore (Moore
2004b).
Unfortunately, for this particular use of the LLR scores, namely that of finding
strongly associated word pairs, one of the terms in the formula raises problems. That
term is , the number of links that don’t contain source word ; or the total number
of links in the corpus minus the links that contain . The total number of links in a
reasonably-sized parallel corpus is very large compared to the number of links that con-
tain any given word; therefore, this term has have a high value and renders all the other
terms in the LLR formula insignificant, and all the LLR scores relatively similar.
The solution that I adopt is to pretend that the total number of links in the corpus is
much smaller. More precisely: for each source word , let be the set of all target
words linked to . For each source word , I compute a word-specific total as being
the number of links that contain either , or any word linked to . Then, when computing
, I use . Note that LLR scores computed in this manner
are no longer symmetrical: .
I use the LLR values to compute two conditional probability distributions:
, the probability that source word translates into target word
and
, the probability that does not translate into
. I obtain these distributions
by normalizing the LLR scores for each source word.
The lexicon estimation procedure is the following:
40
% Compute LLR scores for all pairs of words
which are linked at least once in
the word-aligned corpus.
% Classify all as either (positive association) if , or ’ (negative association) otherwise.
% For each , compute the normalizing factors
and
’ .
% Divide all terms by the corresponding normalizing factors to obtain
.
% Divide all ’ terms by the corresponding normalizing factors to obtain
.
% Reverse the source and target languages and repeat the steps outlined above in
order to obtain and .
Table 3.3 presents the LLR-Lex probabilities for the example in the previous section.
The target words from the last two rows of the tabled are determined to be negatively
associatd with our initial source word; they no longer appear as entries in ,
but only in .
3.4.6 Analysis of Computational Complexity
In order to analyze the complexity of the algorithms involved in the parallel data extrac-
tion framework, I must first define the size of their input. The input consists of the
41
Target translation ! " #
reserved 0.45 174 798.91 yes 0.82
restricted 0.07 398 68.17 yes 0.07
silent 0.02 76 18.02 yes 0.02
to 0.07 230178 16.61 no 0.08
the 0.01 608638 175.45 no 0.85
Table 3.3: Example of translation probabilities from the LLR-Lex lexicon. Words
marked with ’*’ are incorrect translations.
two monolingual copora which make up the comparable corpus. These corpora are
structured into documents, sentences, and words. Since different algorithms operate at
different levels in this structure; I need various measures for the “size” of a corpus.
For instance, the document selection module (Section 3.4.2) operates mainly at doc-
ument level. When running queries against collections of documents, the lengths of
the documents have little impact on the running time, because the collection is indexed
beforehand. What matters is the number of documents (and of queries); therefore, the
complexity of the query process should be measured in terms of the number of doc-
uments in each collection. In contrast, the selection of candidate sentence pairs (Sec-
tion 3.4.3) operates on individual sentences, so its complexity can be measured as a
function of the number of words in the corpus.
I therefore use two measures for size of a corpus: $ , its total number of documents,
and , the average number of words in a document. For the sake of clarity, I assume that
the source and target language corpora are similar in size. I will express the complexity
of the algorithms from my parallel data extraction system using these variables.
42
The first stage in the extraction pipeline, document selection, consists of several
steps. First, the target language collection needs to be indexed; this takes time liniar
in the number of words from the corpus, i.e. $ . Then, each source document
is turned into a target language query by translating all its words; this takes $ . Lastly, each query (i.e. source language document) is compared against all target
documents, and this has complexity $ . Thus, the complexity of the document
selection stage is $ $ In the subsequent stage, that of selecting candidate sentence pairs (Section 3.4.3),
each source document is paired with a constant number (TopK) of target documents.
Each source sentence from document is therefore paired with all the sentences from
the TopK target documents that are similar with ; and all these paired are passed
through the word-overlap filter. For each sentence pair, the filter goes over all the words
in each sentence, checking for a correspondence in the other sentence.
The complexity of this operation is proportional to the number of words that are
analyzed. On the source language side, each word from will be analyzed once for
each sentence in the TopK similar target documents. For the whole source language
corpus, this leads to a complexity of $ . On the target language side, a particular
target document may be considered similar with many source documents (and processed
several times), or with no source documents. The number of words that are analyzed
depends therefore on the number of source language documents, and the length of the
43
target language documents: $ . Thus, the complexity of the candidate sentence
pair selection stage is: $ $ $ All subsequent stages in the extraction pipeline operate on the sentence pairs from
the output of the word-overlap filter. Although in practice the filter discards the large
majority of the sentences it analyzes, for the purposes of worst-case analysis I may
assume that it doesn’t discard anything. However, as I will describe in later chapters,
the complexity of my parallel data detection algorithms is at most the same as that of the
filter: linear in the number of words from the two sentences. Thus, the total asymptotic
complexity of the extraction framework is:
$ $ $ $ $ 3.5 Summary
I presented a general framework for evaluating parallel data detection algorithms. The
quality of the data extracted by these algorithms from a comparable corpus is measured
in terms of its impact on the end-to-end performance of a Statistical Machine Translation
system. I described the general architecture of a parallel data extraction system, and
showed how the various models described in this thesis work together in the context
of this system in order to fully exploit the comparable corpus. I also showed how to
44
estimate a high-quality lexicon from a parallel corpus, using the Log-Likelihood Ratio
statistic.
45
Chapter 4
Finding Parallel Sentences in
Comparable Corpora
4.1 Overview
In this chapter I present a novel method for finding parallel sentences in comparable cor-
pora. I show how to design and train a classifier which can reliably decide whether two
sentences are translations of each other. I evaluate the method both intrinsically, by mea-
suring the precision and recall of the classifier, and extrinsically, by showing that data
extracted with this method improves the performance of an end-to-end SMT system.
This chapter is based on work previously published in the Computational Linguistics
Journal (Munteanu & Marcu 2005).
4.2 Introduction
The corpora that are of interest for this work are those whose documents, even when they
express similar content and have sentences which are translations of each other, exhibit
differences at sentence level. As an example, consider the two newspaper articles in
46
PEKIN, 14 oct (AFP) - Une épidémie de choléra venue de la côte
occidentale de la Corée du Nord a fait au cours des dernières
semaines une dizaine de morts à Pyongyang, ont rapporté
vendredi des visiteurs étrangers de retour de la capitale nord-
coréenne.
Les premiers cas ont été découverts dans le port de Nampo (sud-
ouest de Pyongyang), où des habitants ont affirmé avoir été
contaminés par du poisson pêché en mer, ont indiqué ces témoins.
L'agence russe Itartass avait rapporté fin septembre que ce port
avait été fermé sans explication officielle.
"A Pyongyang, les autorités ont affirmé qu'il ne s'agissait que d'une
épidémie de diarrhée, mais on a entendu dire qu'une dizaine de
personnes étaient déjà mortes du choléra dans la capitale", ont-ils
déclaré.
"Les habitants de Pyongyang nous ont conseillé de ne pas manger
de poisson et accusent les Chinois d'avoir contaminé le nord de la
Mer Jaune en rejetant à la mer les cadavres atteints de choléra",
ont ajouté ces visiteurs.
A Pékin, un responsable de l'Organisation Mondiale de la Santé
(OMS) a déclaré vendredi qu'à sa connaissance, aucun cas de
choléra n'avait été signalé dans le nord de la Chine.
Toutefois, selon des rumeurs non confirmées officiellement, un
pêcheur serait mort du choléra au mois d'août dans la région de
Beidaihe, une station balnéaire située à 250 km à l'est de Pékin,
sur les rives du golfe de Bohai.
Selon l'équipage du bateau de pêche sur lequel il travaillait, le
pêcheur aurait succombé après avoir mangé du poisson cru.
A Séoul, les services secrets sud-coréens avaient annoncé fin
septembre qu'une grave épidémie de choléra se répandait dans le
nord de la péninsule, touchant de vastes zones autour de
Pyongyang et sur la côte orientale.
Foreign travellers returning from Pyongyang said Friday that
about a dozen people had died in the North Korean capital in a
cholera epidemic that first broke out on the country's western
coast.
"The authorities in Pyongyang are saying that it's only a diarrhoea
epidemic, but we heard that about a dozen people had already
died in the city," one said.
"People living in Pyongyang advised us not to eat fish, and
accuse the Chinese of having contaminated the northern part of
the Yellow Sea by throwing cholera-tainted corpses in the water,"
the visitor said.
The first cases of cholera apparently were recorded in the port of
Nampo, southwest of Pyongyang, where residents were infected
by eating sea fish, the sources said.
The Russian news agency ITAR-TASS reported late last month
that Nampo had been closed without official explanation.
That report coincided with an announcement by the South
Korean secret service that a major outbreak of cholera had
occurred in Pyongyang and the western coast of North Korea.
Agence France Presse, English Agence France Presse, French
Figure 4.1: Example of parallel sentences in non-parallel documents.
Figure 4.1. They have been published by the English and French editions of Agence
France Presse, and report on the same event – an epidemic of cholera in Pyongyang.
The lines in the figure connect sentence pairs which are approximate translations of
each other.
Finding the parallel sentences from such documents is hard. Sentence-alignment
algorithms (such as those described in Section 2.3), which work under the assumption
that sentence order is approximately the same in both documents, are unlikely to work
well. Since we cannot make assumptions about sentence order we need to analyze all
possible sentence pairs, and make reliable judgments about each pair independently.
47
My approach is to compute, for each sentence pair in isolation, features indicative
of the degree of similarity between the sentences, and to use these features to classify
the pair as parallel or non-parallel. I describe how to compute such features, and how
to use them with a Maximum Entropy-based classifier, in Section 4.3. I then present
performance evaluation results in Section 4.4, and conclude by showing a few parallel
sentence pairs automatically detected by this approach (Section 4.5).
4.3 A Maximum Entropy Classifier for Parallel Sen-
tence Identification
4.3.1 The Maximum Entropy Classification Framework
In the Maximum Entropy (ME) statistical modeling framework, we impose constraints
on the model of our data by defining a set of feature functions. These feature functions
emphasize properties of the data that we believe to be useful for the modeling task.
For example, for a sentence pair , the word overlap (the percentage of source words
that have a translation in the target sentence) might be a useful indicator of whether the
sentences are parallel. We therefore define a feature function
, whose value is the
word overlap of the sentences in .
48
According to the ME principle, the optimal parametric form of the model of our
data, taking into account the constraints imposed by the feature functions, is a log linear
combination of these functions. Thus, for our classification problem, we have:
& where
is the class (
=”parallel”,
=”not parallel”),
is a normalization fac-
tor, and
are the feature functions (indexed both by class and by feature). The resulting
model has free parameters
, the feature weights. The parameter values that maximize
the likelihood of a given training corpus can be computed using various optimization
algorithms (see Malouf (2002) for a comparison of such algorithms).
4.3.2 Features for Parallel Sentence Identification
For my particular classification problem, I need to find feature functions that distinguish
between parallel and non-parallel sentence pairs. For this purpose, I compute and exploit
word-level alignments between the sentences in each pair. A word alignment between
two sentences specifies translational correspondences between the individual words in
the sentences. Word alignments were first introduced in the context of statistical MT,
where they are used to estimate the parameters of a translation model (Brown et al.
1990). Since then, they were found useful in many other NLP applications (e.g. word
sense tagging (Diab & Resnik 2002) and question answering (Echihabi & Marcu 2003)).
49
Figures 4.2 and 4.3 give examples of word alignments between two English-Arabic
sentence pairs from our comparable corpus. Each figure contains two alignments. The
one on the left is a correct alignment, produced by a human, while the one on the right
was computed automatically. As can be seen from the gloss next to the Arabic words,
the sentences in Figure 4.2 are parallel while the sentences in Figure 4.3 are not.
Figure 4.2: Word-level alignments
between two parallel sentences.
Figure 4.3: Word-level alignments
between two non-parallel sentences.
In a correct alignment between two non-parallel sentences, most words would have
no translation equivalents; in contrast, in an alignment between parallel sentences, most
words would be aligned. Automatically computed alignments, however, may have incor-
rect connections; for example, on the right side of Figure 4.2, Arabic word issue is linked
to the English comma; and in Figure 4.3, Arabic word at is linked to the English phrase
its case to the. Such errors are due to noisy dictionary entries and to shortcomings of the
50
model used to generate the alignments. Thus, merely looking at the number of uncon-
nected words, while helpful, is not discriminative enough. Still, automatically produced
alignments have certain additional characteristics that can be exploited.
I follow Brown et. al (1993) in defining the fertility of a word in an alignment as
the number of words it is connected to. The presence, in an automatically computed
alignment between a pair of sentences, of words of high fertility (such as the Arabic
word at in Figure 4.3) is indicative of non-parallelism. Most likely, these connections
were produced because of a lack of better alternatives.
Another aspect of interest is the presence of long contiguous connected spans, which
I define as pairs of bilingual substrings in which the words in one substring are connected
only to words in the other substring. Such a span may contain a few words without any
connection (a small percentage of the length of the span), but no word with a connection
outside the span. Examples of such spans can be seen in Figure 4.2: the English strings
after saudi mediation failed or to the international court of justice together with their
Arabic counterparts. Long contiguous connected spans are indicative of parallelism,
since they suggest that the two sentences have long phrases in common. And, in contrast,
long substrings whose words are all unconnected are indicative of non-parallelism.
To summarize, the classifier uses the following features, defined over two sentences
and an automatically computed alignment between them.
General features (independent of the word alignment):
% lengths of the sentences, as well as the length difference and length ratio;
51
% percentage of words on each side that have a translation on the other side (accord-
ing to the lexicon);
Alignment features:
% percentage and number of words that have no connection;
% the top 3 largest fertilities;
% length of the longest contiguous connected span;
% length of the longest unconnected substring
4.3.3 Word Alignment Model
In order to compute word alignments I need a simple and efficient model, capable of
aligning a large number of sentences in reasonable time. I also want the model to have
as few parameters as possible; preferably only word-for-word translation probabilities.
One such model is the IBM Model 1 (Brown et al. 1993). According to this model,
given foreign sentence
, English sentence
, and translation prob-
abilities
, the best alignment
is obtained by linking each foreign word
to
its most likely English translation . Thus, each foreign word is aligned
to exactly one English word (or to a special NULL token).
Due to its simplicity, this model has many shortcomings (see Moore (2004a) for a
discussion). Thus, I augment it with two simple heuristics, which attempt to alleviate
some of these shortcomings.
52
One heuristic concerns English words which appear more than once in a sentence.
According to the model, a foreign word which prefers to be aligned with such an English
word could equally well be aligned with any instance of that word. In such situations,
instead of arbitrarily choosing the first instance or a random instance, I attempt to make
a ”smarter” decision. First, I create links only for those English words which appear
exactly once; next, for words which appear more than once, I choose connections that
minimize the number of crossings with already existing links.
The second heuristic attempts to improve the choice of the most likely English trans-
lation of a foreign word. The translation probabilities are automatically learned from
parallel data, and I learn values for both
and
. I can therefore pick the
most likely English translation of
to be . Using both sets
of probabilities is likely to help me make a better informed decision.
Using this alignment strategy, I follow (Och & Ney 2003) and compute one align-
ment for each translation direction (
and
), and then combine them together.
Och and Ney present three combination methods: intersection, union, and refined which
is a form of intersection expanded with certain additional neighboring links; I use all of
them.
Thus, for each sentence pair I compute five alignments (two modified-IBM-Model-1
plus three combinations), and extract one set of general features and five sets of align-
ment features (as described in the previous section).
53
4.3.4 Training and Testing
I create training instances for the classifier from a small parallel corpus. The simplest
way to obtain classifier training data from a parallel corpus is to generate all possible
sentence pairs from the corpus (cartesian product). This generates training instances
(where is the number of sentences in the corpus), out of which are positive (belong
to class ”parallel”) and the rest are negative.
One drawback of this approach is that the resulting training set is very imbalanced,
i.e. it has many more negative examples than positive ones. Classifiers trained on such
data do not achieve good performance; they generally tend to predict the majority class,
i.e. they classify most sentences as non-parallel (which has indeed been the case in
my experiments). Our solution to this is to downsample, i.e. eliminate a number of
(randomly selected) negative instances.
Another issue is that a lot of the sentence pairs from the cartesian product have low
word overlap (i.e. most of their words have no translation, according to our lexicon).
As explained in Section 3.4.3 (and shown in Figure 3.2), when we extract data from
a comparable corpus, all such pairs are filtered out; they never get to be classified. I
therefore exclude them from the classifier training data by applying the word-overlap
filter (Section 3.4.3) to the pairs from the cartesian product. Actually, by training only on
sentence pairs with many words in common, the classifier learns to make more refined
distinctions, which cannot be made based on word overlap alone.
54
To summarize, I prepare the classifier training set in the following manner: starting
from the parallel corpus, I generate the cartesian product; I discard the pairs that do not
fulfill the conditions of the word-overlap filter; if the resulting set is imbalanced, i.e.
the ratio of non-parallel to parallel pairs is greater than five, I balance it by removing
randomly chosen non-parallel pairs; I then compute word alignments and extract feature
values.
Using the training set, I compute values for the classifier feature weights using the
YASMET
1
implementation of the GIS algorithm (Darroch & Ratcliff 1974). Since I am
dealing with few parameters and have sufficiently many training instances, using more
advanced training algorithms is unlikely to bring significant improvements.
I can test the performance of the classifier by generating test instances from a differ-
ent parallel corpus, and checking how many of these instances are correctly classified. I
prepare the test set in a manner similar to the training set: I create the cartesian product
of the sentences in the test parallel corpus, and apply the word-overlap filter (I do not
perform any balancing of the test set). Conceptually, this can be regarded as a two-stage
classification process: all pairs discarded by the filter are classified as ”non-parallel”,
and for the rest I obtain predictions from the classifier.
1
http://www.fjoch.com/YASMET.html
55
4.3.5 Algorithmic complexity
As discussed in Section 3.4.6, I will show that the complexity of my parallel sentence
detection algorithms is linear in the number of words in the two sentences that are being
classified. In order to analyze this complexity, it is useful to break down the classifica-
tion process into several stages, as follows:
% Compute two word alignments, source-target and target-source. Since the align-
ment algorithm is essentially greedy (Section 4.3.3), aligning each word to its best
correspondent according to the lexicon, the complexity of this step is linear in the
number of words from the two sentences.
% Combine the alignments, computing their union, intersection, and “refined” com-
bination. Each of these can be computed in time linear in the number of links from
the two original alignments; and since the original alignments are one-to-many,
the number of links is of the same order as the number of words.
% Extract features from the alignments. The values of all the features described in
Section 4.3.2 can be computed by traversing each of the sentences sequentially;
therefore, the complexity is again linear in the number of words of both sentences.
% Perform the classification: compute the Maximum Entropy probability of the sen-
tence pair according to the feature values, and the (previously trained) feature
weights. This operation takes constant time, since its complexity depends only on
the number of features, which is constant relative to input size.
56
Thus, the complexity of the classification process is linear in the number of words
from the two sentences. Since this complexity is the same as that of the word-overlap
filter that is applied before classification in the extraction pipeline (Section 3.4.1), adding
the parallel sentence detection stage does not change the asymptotic behavior of the data
extraction system. The system’s complexity is $ $ , where $ is the number
of documents in one input corpus, and is the average number of words per document.
4.4 Experiments
4.4.1 Introduction
I evaluate my parallel sentence detection approach in two ways. The intrinsic evaluation
(Section 4.4.2) measures the precision and recall of the classifier on a test set generated
from parallel data (as described in Section 4.3.4). For the second, extrinsic evaluation
(Section 4.4.3), I use the classifier to extract parallel sentences from my Gigaword com-
parable corpora (Section 3.3.2) and measure the impact of the extracted data on MT
performance. As explained in Section 3.2, each experiment is defined by the initial par-
allel corpus used to compute the lexicon, and is replicated with several initial parallel
corpora of various sizes.
57
4.4.2 Classifier Evaluation
There are three factors that might influence a classifier’s performance: language pair,
lexicon coverage and similarity between the domains of the training and test instances.
I perform evaluation experiments to account for all these factors.
I account for the language factor by experimenting with two language pairs: Arabic-
English and Chinese-English. Arabic and Chinese are sufficiently different from each
other (and from English) to ensure that the results are indeed language-independent.
The lexicons used in the experiments are automatically learned from the initial parallel
corpus; thus, as I vary the size of that corpus, I also vary the lexicon coverage. I use
initial corpora of sizes 10k, 100k, 1M, and 100M English tokens; for Chinese-English,
I also use a dictionary (see Section 3.3).
I measure the impact of the domain similarity by generating training instances from
parallel corpora that come from different domains. I have available parallel MT training
data from two different genres: news, and political discourse (translated proceedings of
the United Nations). Thus, I generate one classifier training set from news data and one
from political data (further referred to as UN data), and a common test set from news
data. The data used to generate classifier training and test sets is the same for all initial
parallel corpora, and is distinct from all of them.
In summary, for each language pair, I use the following corpora:
% several initial parallel corpora of various sizes, used for learning lexicons
58
% one UN-domain parallel corpus for generating classifier training examples
% one news-domain parallel corpus for generating classifier training examples
% one news-domain parallel corpus used to generate classifier test data
From each initial corpus I compute a lexicon. I then take the classifier training
and test corpora and, using the method described in the previous section, create two
sets of training instances and one set of test instances. I train two classifiers (one on
each training set) and evaluate both of them on the test set. The parallel corpora used for
generating training and test instances have around 5k sentence pairs each (approximately
150k English tokens), and generate around 10k training instances (for each training set)
and 8k test instances.
The performance of the classification process can be measured by computing its
precision and recall. Precision is the ratio of sentence pairs correctly identified as par-
allel to the total number of pairs judged as parallel by the classifier. Recall is the ratio
of sentence pairs correctly identified as parallel by the classifier to the total number
of truly parallel pairs, i.e. the number of pairs in the parallel corpus used to gener-
ate the test instances. Both numbers are expressed as percentages. More formally:
let
be the total number of sentence pairs from the test set that the
classifier judged as parallel;
be the number of pairs that the classifier
correctly judged as parallel; and
be the total number of parallel pairs in
the test set. Then: 59
Figures 4.4 and 4.5 show the recall and precision of the classifiers, for Arabic-
English and Chinese-English respectively. The results show that the precision of the
classifier is robust with respect to lexicon coverage and training domain. Even when
starting from a very small initial parallel corpus, the classifier can attain high preci-
sion. Having a larger lexicon and training data from the right domain does help though,
mainly with respect to recall.
Figure 4.4: Performance of the Arabic-English parallel
sentence classifiers.
Figure 4.5: Performance of the Chinese-English parallel
sentence classifiers.
60
The classifiers achieve high precision because their positive training examples are
clean parallel sentence pairs, with high word overlap (since the pairs with low over-
lap are filtered out); thus, the classification decision frontier is pushed toward ”good-
looking” alignments. Training on UN-domain data yields higher precision and lower
recall because those translations are more literal, and therefore the classifier learns to
accept only such sentence pairs.
The low recall results are mostly due to the word-overlap filter (the first stage of the
classification process) which discards many parallel pairs. The loss in recall is bigger
for smaller initial corpora, because the lexicons trained from those corpora have very
low coverage. If the filter would not be applied, the recall results would increase by
about 20% (with very little loss in precision). However, as explained in Section 3.4.3,
the filter plays an important role in keeping the extraction pipeline robust and efficient,
by filtering out a very large number of mostly non-parallel sentence pairs.
It is worthwile to mention the (relatively high) performance of the classifier that uses
the Chinese-English dictionary (the “Dict” column in Figure 4.5). Unlike the lexicons
computed automatically from parallel corpora, a dictionary has relatively few, but pre-
cise entries; this has a strong impact on the behavior of the word-overlap filter. For
all my experiments, the filter’s parameter (WordOverlap, described in Section 3.4.3)
requires that at least 50% of the words in each sentence have a translation in the other.
However, when using the dictionary, very few (less than 1%) of the parallel sentences
from the classifier training or test sets satisfied this condition. I therefore lowered the
61
parameter value to 25%, which allowed more than half of the good pairs to pass through
the filter, and yielded the results shown in Figure 4.5.
Classifier evaluations using different subsets of features show that most of the clas-
sifier performance comes from the general features together with the alignment features
concerning the percentage and number of words that have no connection. However, I
expect that in real data, the differences between parallel and non-parallel pairs are less
clear than in our test data, and can no longer be accounted for only by counting the
linked words; thus, the other features should become more important.
4.4.3 Machine Translation Improvements
I apply the news-trained classifiers in the context of the extraction system described in
Section 3.4, and obtain additional parallel data from the Gigaword comparable corpora
(Section 3.3.2). Table 4.1 shows the amounts of data extracted for each language pair
and initial corpus size, measured in million English tokens.
Size of initial Size of automatically extracted corpora
parallel corpus Arabic-English Chinese-English
Dictionary 75M
10k 1M 25M
100k 14M 76M
1M 50M 59M
100M 42M 41M
Table 4.1: Amounts of data extracted by the parallel sentence detection algorithm from
the Gigaword corpora, measured in number of English tokens.
62
It is interesting to note that for Chinese-English, smaller initial corpora (like the
100k one) yield more data than larger ones (1M, or 100M). The relationship between
the size of the initial corpus and the amount of data extracted from a comparable corpus
is defined by two factors: the coverage of the lexicon computed from the initial corpus,
and the precision and recall of the classifier. The lexicon determines the behavior of
the word-overlap filter (Section 3.4.3): the smaller the lexicon, the fewer sentence pairs
pass through the filter. The classifier performance indicates how many of these pairs are
produced as output; in particular, lower precision results in more extracted data.
Thus, the only reason why the 100k initial corpus would yield more data than the
100M one is that the 100k-trained classifier has lower precision: 79%, as opposed to
92% for the 100M-trained classifier (see Figure 4.5). Clearly, the lower classifier preci-
sion is more important than the lower lexicon coverage, so the system ends up producing
more data. To test this hypothesis, I performed Chinese-English extraction experiments
using the UN-trained classifiers, which have roughly the same recall as the news-trained
ones, but much higher precision (see Figure 4.5). The resulting amounts of extracted
data were correlated with the size of the initial parallel corpus, indicating that classifier
precision is indeed the cause of this phenomenon.
In order to evaluate the quality of the extracted data I train, for each initial parallel
corpus, a Baseline system on that parallel data, and a PlusSentences system on the initial
corpus plus the data extracted using that corpus. Figures 4.6 and 4.7 present the perfor-
mance of these systems. The statistical significance intervals (computed using bootstrap
63
Figure 4.6: Arabic-English MT performance improve-
ments using parallel sentences automatically extracted
from the Gigaword corpus.
Figure 4.7: Chinese-English MT performance improve-
ments using parallel sentences automatically extracted
from the Gigaword corpus.
resampling (Koehn 2004)) are shown for each data point. For convenience, the table
below the graph also shows (in the last row) the amounts of additional extracted data
used by each PlusSentences system, measured in million English tokens; these are the
same numbers from Table 4.1.
64
As the results show, the automatically extracted additional training data yields signif-
icant improvements in performance, regardless of the size of the initial parallel corpus;
the only exception is Chinese-English, initial corpus of 100M words, for which the addi-
tional data does not help. The improvement obtained from the extracted data is largest
for the 100k initial parallel corpus, and then decreases as the size of the initial corpus
increases. This seems to indicate that the data extracted using the 10k words initial cor-
pus is of relatively low quality; in order to get really good data, a bigger initial parallel
corpus is necessary.
4.5 Examples
I conclude the description of my method by presenting a few sentence pairs extracted by
my system. I chose the examples by randomly looking for cases where a given foreign
sentence was judged parallel to several different English sentences. Figures 4.8 and 4.9
show the foreign sentence in Arabic and Chinese respectively, followed by a human-
produced translation in italic and bold font, followed by the automatically extracted
matching English sentences in normal font. Each matching English sentence is preceded
by the probability associated by the classifier to the hypothesis that the English sentence
is parallel to the foreign one. The sentences are picked from the datasets extracted using
the 100M initial parallel corpus.
65
Figure 4.8: Examples of automatically extracted Arabic-English parallel sentence pairs.
The examples reveal the two main types of errors. The first type concerns cases when
the system classifies as parallel sentence pairs that, although they share many content
words, express slightly different meanings - as in Figure 4.8, example 1. The second
concerns pairs in which the two sentences convey different amounts of information.
In such pairs, one of the sentences contains a translation of the other, plus additional
phrases (Figure 4.8, example 4). These errors are caused by the noise present in the
66
Figure 4.9: Examples of automatically extracted Chinese-English parallel sentence
pairs.
automatically learned dictionaries, and by the use of a weak word alignment model for
extracting the classifier features.
The examples also show that parallel sentence pairs which are extracted automat-
ically are not always good mutual translations (such as Figure 4.8 example 2, or Fig-
ure 4.9 example 4). Although this does not seem to be a big problem for SMT, other
67
applications might require parallel data of higher quality. Such data can be obtained by
cleaning up the output of the parallel sentence classifier in two simple ways. First, one
can choose only sentence pairs that were classified as parallel with confidence higher
than a particular threshold (the confidence values returned by the Maximum Entropy
classifier range between 0.5 and 1). Second, in cases when a source sentence was
deemed parallel with several different target ones (as is the case in all these examples),
one can keep only the highest-confidence target translation.
4.6 Summary
I have shown how sentence pairs can be reliably classified as parallel or non-parallel
using features extracted from a simple word alignment between the two sentences. I
have shown that data extracted with this method helps improve the performance of an
end-to-end Statistical Machine Translation system, under a variety of experimental con-
ditions. In particular, I’ve demonstrated that starting with a very small parallel corpus,
or even a dictionary, I can exploit a large comparable corpus in order to obtain a good-
quality SMT system.
68
Chapter 5
Finding Parallel Documents in
Comparable Corpora
5.1 Overview
In this chapter I describe a new approach for finding parallel documents in comparable
corpora. By using my parallel sentence detection algorithm (described in Chapter 4), I
create sentence-level alignments between document pairs, and use these alignments to
find the pairs which are translations of each other. I evaluate the method both intrin-
sically, by measuring its precision and recall on a human-annotated gold standard, and
extrinsically, by using it to find parallel document pairs in the Gigaword comparable
corpus. I show that by extending the parallel sentence detection method to parallel doc-
uments, I can extract more, and higher-quality, data from the comparable corpus.
5.2 Introduction
Although the parallel sentence classifier (Chapter 4) is appropriate for extracting data
from non-parallel documents, it is sub-optimal when the documents are actually literal
69
translations of each other. Any approach that looks at sentences in isolation is likely
to miss some sentence pairs (due, for example, to poor lexicon coverage), or to pro-
pose false associations; errors which might be avoided in the presence of information
about document-level parallelism. A better way to process documents which are literal
translations of each other is to apply sentence alignment algorithms (such as Gale and
Church (1991) or Moore (2002); see Section 2.3 for a more detailed discussion). Such
algorithms have been shown to perform with high precision, find most of the correct sen-
tence correspondences, and produce not only one-to-one sentence alignments, but also
many-to-many. Thus, the ability to detect parallelism at document level would enable a
better exploitation of the parallel documents, and of the comparable corpus.
It is important to consider parallel only document pairs which are literal translations
of each other and, in particular, have similar sentence order; only such pairs will yield
good-quality parallel data. Current document matching algorithms, that look for word-
level correspondences (such as Zhao and V ogel (2002b) or Utiyama and Isahara (2003),
discussed in Section 2.4.1), are therefore unlikely to have sufficient precision. My
approach is to perform a sentence-level analysis, by using my parallel sentence detec-
tion algorithm to find translated sentences between the two documents. I regard those
sentence pairs as links between their respective documents; by analyzing the number
and ordering of the links, I can infer whether two documents are parallel or not.
The assumption is that if two documents are parallel, I will be able to link (i.e. clas-
sify as parallel) many of their sentences; moreover, the links will be mostly monotone
70
(i.e. will not cross each other). Conversely, if the documents are not parallel, I should not
find too many translated sentence pairs; and the links that I do find will be more likely
to cross each other. These assumptions are exemplified in Figure 5.1, which shows two
examples of documents with sentence-level links. The documents on the left side of the
figure are literal translations of each other, while the documents on the right side are
only partial translations (they contain some common content, but each document also
has non-translated parts). The links drawn with thick lines are the ones produced by our
sentence analysis module; the thin lines are “true” links, produced by a human.
Figure 5.1: Example of sentence-level alignments between parallel and non-parallel
documents. The thin lines are true links (produced by a human), while the thick ones
are produced automatically.
71
5.3 Finding Parallel Documents
In order to determine whether two documents are parallel, I consider all possible sen-
tence pairs from the documents, and classify them as parallel or non-parallel. By linking
the sentences judged to be parallel, I obtain an alignment between the documents. I then
verify certain conditions of the documents and their alignment; if they are satisfied, I
consider the documents to be parallel.
The conditions are as follows:
% The difference between the lengths of the documents (measured in number of sen-
tences) should not be too big (I refer to this parameter as $ and, in my
experiments, require that it is smaller than 25% of the length of each document).
% At least a certain percentage of the sentences in each document should aligned (I
call this the
parameter, which is set to 30 in my experiments).
% At least a certain percentage of the links in the alignment should be monotone (I
use
).
The thresholds were tuned on a development corpus; however, experimental results indi-
cate that the system’s performance is robust to changes of these values.
When applied in the context of my parallel data extraction system, this method is a
natural extension of the parallel sentence detection algorithm. The document selection
step (Section 3.4.2) provides a list of ”similar” document pairs; the subsequent steps
72
analyze all their sentence pairs, and deteremine which are parallel. The document align-
ments are therefore available, and I only need to verify the conditions (as described
above) and output the parallel documents.
5.4 Experiments
5.4.1 Introduction
I evaluate my parallel document detection method both intrinsically, by running it
against a human-annotated gold standard of document pairs, and extrinsically, by run-
ning it on my comparable corpora (Section 3.3.2) and using the resulting data as addi-
tional MT training data.
For the extrinsic evaluation, I measure the MT performance impact of two datasets:
first the data obtained only from the parallel document pairs, and then the data obtained
both from parallel sentences (the PlusSentences corpus, Chapter 4) and parallel docu-
ments. The first dataset will indicate the value of the parallel document detection algo-
rithm, while the second will show how well I am able to mine the comparable corpus.
5.4.2 Evaluating Against a Gold Standard
In this section I evaluate the performance of the parallel document detection algorithm
on a human-annotated gold standard of document pairs. I compare the precision and
73
recall of the method with that of several baselines that use an information-retrieval-
based approach.
The baseline approach is very similar to the document selection step employed in my
extraction system (Section 3.4.2). I turn each source language document into a query by
translating each of its words independently, run the query using the Lemur IR toolkit,
and obtain the most similar target document, together with a similarity score. I consider
the two documents to be parallel if the score is above a certain threshold. By vary-
ing the threshold, I can trade off between precision and recall; thus, I will present the
baseline performance in the form of a precision-versus-recall curve. I compute three dif-
ferent baselines, which make use of different query methods implemented in the Lemur
toolkit: TF-IDF
1
, Okapi (Walker et al. 1998) and LM-KL, which is a language modeling
algorithm based on KL-divergence (Cover & Thomas 1991).
The test data consists of a Romanian-English document collection, containing news
articles downloaded from the web sites of Romanian newspaper Ziua. In this collection,
the articles from the English-language edition are either translations of Romanian ones,
or summaries of one or more Romanian articles. The gold standard was created by
analyzing all documents published in one month, and finding those pairs which are
mutual translations. It contains 192 English documents and 2000 Romanian ones, out
of which 131 pairs are parallel.
1
http://www.cs.cmu.edu/ lemur/tfidf.ps
74
In order to obtain the document alignments, I run my extraction system using two
different initial parallel corpora. The first one consists of 1M English tokens, which
is the training data for the Romanian-English word alignment task from the Workshop
on Building and Using Parallel Corpora
2
. The second contains, in addition, Romanian
translations of the European Union law documents, which I mined from the Web. Its
size is about 10M English tokens.
The precision and recall results are presented in Figure 5.2. Our method (represented
in the figure by a single dot) has precision and recall close to 90%, out-performing all
baselines. Its performance is high even when the initial parallel corpus is relatively
small, which indicates that the method is robust to low lexicon coverage. Note that the
baseline performance is somewhat over-estimated, because in this experiment we look
at all possible thresholds, which cannot be done in practice.
Figure 5.2: Performance of the parallel document detection method against a gold stan-
dard.
2
http://www.statmt.org/wpt05/
75
5.4.3 Machine Translation Improvements
This section presents the results obtained by applying the parallel document detection
method on the Gigaword corpora. The experiments are designed to address two ques-
tions: how useful is the data from the parallel documents found by my method, and how
much do I gain by adding parallel document detection on top of the parallel sentence
detection algorithm.
To answer the first question, I create the PlusDocuments parallel corpus, consisting
of data obtained from the automatically found parallel documents. In order to obtain
parallel data from the document pairs I run a sentence alignment algorithm similar to that
of Moore (2002) . For the second question, I add to this corpus a subset of the sentences
produced by the parallel sentence detection algorithm, namely those that originate from
document pairs that were determined to be non-parallel. I thus obtain the PlusSD corpus,
which consists of the best data that I can extract from the comparable corpus.
Table 5.1 contains information about the sizes of these corpora. As explained in
Section 3.2, all extraction experiments are replicated using initial parallel corpora of
various sizes. For each language pair and initial corpus size, the table shows the number
of document pairs identified as parallel, the amount of parallel data obtained from them
by sentence-alignment, and the gain obtained by exploiting parallel documents in addi-
tion to extracting parallel sentences (i.e. the difference in size between the PlusSD and
PlusSentences corpora presented in Section 4.4.3). The amounts of data are measured in
76
million English tokens. As the results show, identifying parallel documents consistently
helps obtain more data from the comparable corpus.
When the initial parallel corpus is very small (10k words), few document pairs are
identified as being parallel, because only few sentences are classified parallel. However,
as the initial corpus (and therefore the lexicon coverage) grows, more and more parallel
documents are found. The last two rows indicate that with as little as 1M words of seed
data, almost all parallel documents can be identified. This agrees with the results from
Figure 5.2, which show that the performance of the method is about the same when
starting with either 1M words or 100M words of initial parallel data.
Size of initial Size of automatically extracted corpora
parallel corpus Arabic-English Chinese-English
Doc pairs Words Gain Doc pairs Words Gain
Dictionary 20380 3M 1M
10k 604 0.1M 0.06M 1530 0.5M 0.2M
100k 27334 6M 3M 14288 4M 1M
1M 92918 20M 8M 21896 5M 2M
100M 98967 21M 8M 25605 6M 3M
Table 5.1: Amounts of data extracted by the parallel document detection algorithm from
the Gigaword corpora. For each language pair, the table presents the number of paral-
lel document pairs found, the amount of parallel data extracted from them (measured
in number of English tokens), and the extra amount of parallel data gained by doing
parallel document identification versus simply looking for sentence pairs in isolation
(Chapter 4).
Figures 5.3 and 5.4 present the performance of the MT systems using the PlusDoc-
uments corpus; the last row of the table below the graphs contains the extra amount of
data (in addition to the initial corpus) used in the respective PlusDocuments system (and
are the same as in Table 5.1). The improvements are similar to those obtained using
77
Figure 5.3: Arabic-English MT performance improve-
ments using parallel documents automatically extracted
from the Gigaword corpus.
Figure 5.4: Chinese-English MT performance improve-
ments using parallel documents automatically extracted
from the Gigaword corpus.
the automatically extracted parallel sentences (the PlusSentences systems, Figures 4.6
and 4.7); for all experimental conditions except Chinese-English 100M, the additional
data brings significant MT improvements. One important aspect is that the amounts of
additional data used in the PlusDocuments systems are much smaller than those from
78
the PlusSentences systems; however, the two sets of systems have very similar perfor-
mance. This indicates that data from (automatically detected) parallel documents is of
higher quality than that obtained by the parallel sentence detection algorithm.
Figures 5.5 and 5.6 present the performance of the MT systems using the PlusSD
corpus. For convenience, the table below the graph also shows the scores of the respec-
tive PlusDocuments and PlusSentences systems, as well as the extra amount of data
(i.e. in addition to the initial corpus) used for training the PlusSD system. Using both
sentences and documents yields improvements for the Arabic-English systems, but not
for the Chinese-English ones. One possible reason for this is that the Chinese-English
collection yielded much fewer parallel documents than the Arabic-English one.
5.5 Summary
I presented a simple and efficient method for finding parallel documents in comparable
collections. By analyzing document pairs at sentence level (instead of word level) I
am able to identify, with high precision and recall, pairs which are literal translations
of each other. I have shown that documents identified with this approach yield high-
quality parallel data; in addition, by making use of document-level parallelism I can
extract more useful data from a comparable corpus.
79
Figure 5.5: Arabic-English MT performance improve-
ments using parallel documents and sentences automati-
cally extracted from the Gigaword corpus.
Figure 5.6: Chinese-English MT performance improve-
ments using parallel documents and sentences automati-
cally extracted from the Gigaword corpus.
80
Chapter 6
Finding Parallel Sub-sentential
Fragments in Comparable Corpora
6.1 Overview
In this chapter I present a novel method for finding parallel fragments in comparable
corpora. By analyzing sentence pairs using a signal processing-inspired approach, I
detect which segments of the source language sentence are translated into segments in
the target language sentence, and which are not. This allows me to extract parallel data
even from sentence pairs which are not translations of each other. This chapter is based
on work presented at COLING/ACL 2006 (Munteanu & Marcu 2006).
6.2 Introduction
One limitation of the methods presented in previous chapters is that they require the
presence of parallel sentences in the comparable corpus. The large majority of compa-
rable corpora are likely to have few or no such sentence pairs. As an example, consider
Figure 6.1, which is the same one we presented in Chapter 1. The documents in the
81
Figure 6.1: Example of parallel fragments in non-parallel documents.
figure report on the same event, have been published almost at the same time, and yet
have no parallel sentences. Still, as the lines and boxes from the figure show, translated
fragments of data do exist; but the parallelism is present at the sub-sentential level.
The goal of this work is to detect such parallel fragments. Figure 6.2 illustrates the
problem more clearly. It shows two sentences taken from the articles in Figure 6.1,
and highlights and connects their parallel fragments. Although both sentences express
the same message, each of them has content which is not translated on the other side.
The English phrase reports the BBC’s Helen Fawkes in Kiev, as well as the Romanian
82
one De altfel, vorbind inaintea aniversarii (which means Besides, speaking before the
anniversary) have no translation correspondent, either in the other sentence or anywhere
in the whole document. Since the sentence pair contains so much untranslated text, it is
unlikely that any parallel sentence detection method would consider it useful. And, even
if the sentences would be used for MT training, considering the amount of noise they
contain, they might do more harm than good for the system’s performance. The best
way to make use of this sentence pair is to extract and use for training just the translated
(highlighted) fragments.
Figure 6.2: Example of parallel fragments in non-parallel sentences.
I describe, in Section 6.3, an approach for finding such parallel fragments. I
then present experimental results (Section 6.4) which show that such fragments help
improve MT performance, and even outperform automatically extracted parallel sen-
tences (obtained with the algorithm presented in Chapter 4) on a noisy comparable cor-
pus. I conclude by showing a few example fragments extracted by the method, which
highlight some of the strengths and weaknesses of the approach.
83
6.3 Finding Parallel Sub-Sentential Fragments
6.3.1 Approach
The aim of this method is to distinguish between source fragments that have a transla-
tion on the target side, and fragments that do not, using the information available in the
lexicon. As an example, consider Figure 6.3. It shows the same sentence pair as Fig-
ure 6.2, in which I underlined those words of each sentence which have a translation in
the other sentence, according to my lexicon. Thus, boldface indicates true translations,
while underline indicates translation according to the lexicon. The phrases “to focus
on the past year’s achievements, which,” and “sa se concentreze pe succesele anului
trecut, care,” are mostly underlined (the lexicon is unaware of the fact that “achieve-
ments” and “succesele” are in fact translations of each other). The rest of the sentences
are mostly not underlined, although there are occasional connections, some correct and
some wrong. The best that can be done in this case is to infer that these two phrases are
parallel, and discard the rest. Doing this does gain a bit of new knowledge: the lexicon
entry (achievements, succesele).
Figure 6.3: Example of parallel fragments in non-parallel sentences: information from
the lexicon. The underlined words are translated, according to the lexicon.
84
Figure 6.4: A signal-filtering approach for detecting parallel fragments.
The difficulty lies in quantifying the notions of “mostly translated” and “mostly not
translated”. My approach is to consider the target sentence as a numeric signal, where
translated words correspond to positive values (the translation probabilities from the
lexicon), and the others to negative values. I want to retain the parts of the sentence
where the signal is mostly positive; this can be achieved by applying a smoothing filter
to the signal, and selecting those fragments of the sentence for which the corresponding
filtered values are positive.
85
The details of the procedure are presented below, and also illustrated in Figure 6.4.
Let the Romanian sentence be the source sentence , and the English one be the target,
. I compute a word alignment using the alignment model described in Sec-
tion 4.3.3 (which is IBM Model 1 plus some heuristics). For each of the linked target
words, the corresponding signal value is the probability of the link (the alignment model
is defined such that there can be at most one link for each target word). Thus, if target
word
is linked to source word , the signal value corresponding to
is (the
distribution described in Section 3.4.5), i.e. the probability that
is the translation of .
For the remaining target words, the signal value should reflect the probability that
they are not translated; for this, I employ the lexicon (Section 3.4.5).
This lexicon contains word pairs that were found to be negatively associated. Although
they were linked together in the word-aligned initial parallel corpus used to compute the
lexicon, those links were are accidental. assigns them a probability of not
being translation of each other. Of course, words pairs that are never linked in the initial
corpus have probability 1 of not being translations of each other.
For each non-linked target word
, I look for the source word that is “closest” to being
its translation, or least likely to be its non-translation: . If exists, I set the signal value for
to ; otherwise, I set it to . This is the
initial signal. I obtain the filtered signal by applying an averaging filter, which sets the
value at each point to be the average of the surrounding 5 values. I then simply retain
86
those fragments of for which the corresponding filtered signal values are positive;
these fragments are circled in the figure.
Unfortunately, this algorithm will sometimes produce incorrectly positive short frag-
ments; examples of this are, in Figure 6.4, the fragments “president”, “democracy”, and
“, reports”. In an attempt to avoid such errors, I discard all fragments with less than 3
words.
In order to obtain the fragments for , I repeat the procedure in the other direction
( ). For the sentence pair from my example (Figure 6.4), the system will output
the pair:
people to focus on the past year’s achievements, which, he says
sa se concentreze pe succesele anului trecut, care, printre
which is fairly correct.
6.3.2 Algorithmic Complexity
All the steps involved in the algorithm described above can be performed in time linear
in the number of words from the two input sentences. These steps are:
% compute the word alignment
% compute the initial signal values
% filter the signal
% extract the output
87
As I argued in Section 4.3.5, since the word alignment is one-to-many, the complex-
ity of computing it is linear in the number of words being aligned. As for the other steps,
they can clearly be performed by traversing the input sentences sequentially.
Thus, as argued in Section 3.4.6, the fragment detection algorithm does not increase
the total complexity of the extraction pipeline, which is $ $ (where $ is
the number of documents in the input corpus, and is the average number of words
per document).
6.4 Experiments
6.4.1 Introduction
Extracting parallel data at sub-sentence level is most useful when applied to comparable
corpora that have little or no parallelism at sentence level. However, the Gigaword com-
parable corpora, which I used for most of my experiments so far, have plenty of parallel
sentences and documents (as demonstrated by the experimental results presented in the
previous chapters). This renders them inappropriate for demonstrating the advantages
of the parallel fragment detection algorithm.
I therefore present, in Section 6.4.2, experiments performed on a noisier corpus,
which contains few parallel sentences: the Romanian-English BBC comparable corpus.
In order to preserve uniformity, I then also report on experiments performed on the
Gigaword corpora, in Section 6.4.3.
88
6.4.2 Experiments on the BBC Corpus
The BBC corpus was obtained by downloading articles from the English and Romanian
online editions of the BBC news agency. The Romanian side of the corpus has 6k
articles and 2.5M tokens, while the English side has 200k articles and 118M tokens.
This corpus is precisely the kind of corpus that my method is designed to exploit; as
our example from Figure 6.1 shows, even closely related documents have few parallel
sentence pairs, so the parallel data must be extracted from the sub-sentence level.
I run extraction experiments with two different initial parallel corpora, whose sizes
are 1M words and 10M words; these are the same corpora used in the experiments
from Section 5.4.2. Using each of the initial parallel corpora, I apply both the fragment
extraction (to obtain the PlusFragments corpus) and the sentence extraction method
(Chapter 4, which produces the PlusSentences corpus). In order to extract the parallel
fragments, I apply my model on all the candidate sentence pairs produced by the word-
overlap filter (see Figure 3.2). The sizes of the extracted datasets, measured in million
English tokens, are in Table 6.1.
Initial corpus Fragments Sentences
1M 0.4M 0.3M
10M 1.3M 0.9M
Table 6.1: Amounts of parallel data extracted by the parallel fragment detection algo-
rithm from the BBC corpus, measured in number of English tokens.
I evaluate the data by measuring its impact on MT performance. For each initial cor-
pus size I train 3 MT systems: Baseline, PlusFragments, and PlusSentences. I measure
89
MT performance on a test corpus which consists of news articles from the Time Bank
corpus which were translated into Romanian, and has 1000 sentences.
The results are presented in Figure 6.5. The statistical significance intervals (which,
for clarity, are shown only for the Baseline results) indicate that a difference of more
than 1 BLEU% is significant. As the results show, the extracted parallel fragments yield
significant performance improvements, while the parallel sentences fail to do so. Thus,
for noisy comparable corpora, detecting parallelism at sub-sentence level is more helpful
than attempting to find full sentences.
Figure 6.5: MT performance improvements using parallel fragments and sentences auto-
matically extracted from the BBC corpus.
Table 6.2 presents the results of two additional experiments. First, I joined the par-
allel fragments and parallel sentences, and used them together as additional data for
90
training a PlusSF MT system. This yielded slightly better performance than the Plus-
Fragments system, but not by a significant margin.
The second experiment was designed to test the importance of the lexicon quality.
As mentioned in Section 3.4.5, all parallel data detection algorithms use ,
a high-quality lexicon computed using the Log-Likelihood Ratio statistic. The other
components of the extraction pipeline, however, use a noisier lexicon,
. I
ran the fragment extraction algorithm using this noisier lexicon, and obtained the Plus-
Fragments ML-Lex dataset. This dataset failed to improve over the Baseline system,
indicating that having a good lexicon, such as , is important for the success
of the method.
Initial corpus Baseline PlusFragments PlusSentences PlusSF PlusFragments ML-Lex
1M 22.5 24.04 22.92 24.28 21.45
10M 30.3 31.51 30.85 31.99 30.59
Table 6.2: BLEU scores obtained using various datasets extracted by the parallel frag-
ment detection algorithm from the BBC corpus.
6.4.3 Experiments on the Gigaword Corpora
This section presents the results obtained by applying the parallel fragment detection
algorithm on the Gigaword corpora (Section 3.3.2). Table 6.3 presents the amounts of
data extracted from these corpora, for the various initial parallel corpora used in the
experiments, measured in million English tokens.
91
Size of initial Size of automatically extracted corpora
parallel corpus Arabic-English Chinese-English
Dictionary 7M
10k 0.2M 2.6M
100k 4M 22M
1M 37M 81M
100M 74M 285M
Table 6.3: Amounts of data extracted by the parallel fragment detection algorithm from
the Gigaword corpora, measured in number of English tokens.
Figures 6.6 and 6.7 present the performance of the MT systems trained using the
PlusFragments corpora. The table below the graph also shows the amounts of automat-
ically extracted data used by each system. They are the same as in Table 6.3, with one
exception: for Chinese-English, initial corpus of 100M tokens, the amount of extracted
data was too large to train on (285M tokens), so I only used 100M tokens from it.
The parallel fragments yield significant improvements over the baseline for most
experimental conditions, providing further evidence that my method produces good par-
allel data. However, these improvements are smaller than those obtained using automat-
ically extracted sentences (Section 4.4.3) or documents (Section 5.4.3); clearly, the data
from parallel fragments has lower quality.
The parallel fragments fail to improve over the baseline not only on the Chinese-
English, 100M tokens corpus (where all my other methods failed as well), but also
on the Chinese-English, 10k tokens one. This is caused by poor lexicon quality. The
lexicon is computed from the automatically word-aligned parallel corpus; and for such
a small corpus, the quality of the word alignment is quite poor. As I have shown in the
92
Figure 6.6: Arabic-English MT performance improve-
ments using parallel fragments automatically extracted
from the Gigaword corpus.
Figure 6.7: Chinese-English MT performance improve-
ments using parallel fragments automatically extracted
from the Gigaword corpus.
experiments from Table 6.2 (the PlusFragments ML-Lex column), a poor lexicon has a
strong negative impact on the method’s performance.
In another experiment, I trained MT systems on data extracted using all three meth-
ods presented in this thesis: from parallel sentences, documents and fragments. How-
ever, their performance was very similar to that of the systems trained on data extracted
only from sentences and documents (i.e. the PlusSD systems presented in Figures 5.5
93
and 5.6). It is likely that for the Gigaword corpora, most of the parallel data can be
found at sentence and document level, so the extracted fragments cannot bring addi-
tional improvements. As I mentioned in Section 6.4.1, it is only for noisy comparable
corpora that the parallel fragment detection model can outperforms the other parallel
data extraction methods presented in this thesis.
6.5 Examples
Figure 6.8 presents a few fragment pairs extracted by my algorithm from the Romanian-
English BBC corpus (Section 6.4.2). For each example, the figure contains the input,
i.e. the initial sentence pair, and the output, the two fragments that were judged parallel.
Each Romanian language sentence or fragment is followed by a human-produced trans-
lation, printed in bold and italic font. When printing the output, I used the “[...]” token
to mark places where words from the original sentence were discarded. This token does
not appear in the system output; it is only included in the figure to help make it more
clear.
The examples point out both positive and negative effects of the parallel fragment
detection approach. In example 1, the model successfully discards the English phrase
“or their enthusiasm with the PDSR propositions” which has no translation on the Roma-
nian side. In example 4, it manages to correctly extract translated fragments from two
non-parallel sentences.
94
Figure 6.8: Examples of automatically extracted parallel sub-sentential fragments.
On the other hand, as example 5 shows, the method often produces completely non-
parallel fragment pairs; or, as in example 4, merely joins toghether sequences of frequent
(often closed-class) words. This behaviour is caused by the presence of noisy lexicon
95
entries, which wrongly give positive signal values to many function words. Thus, the
approach might be improved by distinguishing content words from non-content words,
and handling them differently in the model; maybe by giving less weight to the signal
values corresponding to non-content words.
6.6 Summary
I presented a method for finding parallel sub-sentential fragments from pairs of (poten-
tially non-parallel) sentences. This approach, which is the first published attempt at
tackling this problem, performs a signal-processing-inspired analysis of the two sen-
tences. I’ve shown that this method can be used to exploit noisy comparable corpora,
on which full-sentence-based extraction methods fail. Finally, even though the paral-
lel data obtained with this method may not always be coherent or grammatical, it is of
sufficiently high quality that it helps improve SMT performance.
96
Chapter 7
Bootstrapping
7.1 Introduction
The amount of data that my algorithms can extract from a comparable corpus is
adversely affected by poor lexicon coverage. When the initial parallel corpus (and there-
fore the lexicon) is small, only very little data is obtained from the comparable corpus.
In this chapter, I show how this problem can be alleviated by using bootstrapping: after
one round of data extraction, I add the extracted data to the initial corpus, compute a
new lexicon, and perform another extraction. This should yield more data from the cor-
pus, as well as a better-performing MT system. In order to give the reader an idea of
how higher MT performance scores are reflected in the output translation quality, I also
provide examples from the output of MT systems trained on bootstrapped datasets.
7.2 Bootstrapping Experiments
As can be seen from the experiments performed on the Gigaword corpora (i.e. Fig-
ure 4.6), the amounts of data extracted using a small initial corpus (i.e. 10k tokens) are
much smaller than the amounts extracted with larger initial resources. This is mostly due
97
to the low coverage of the lexicon, which causes the word-overlap filter (Section 3.4.3)
to discard many sentence pairs (because they share too few words in common, according
to the lexicon).
One possible solution is to make the filter less strict by lowering the value of its
Overlap parameter, and thus allowing through sentence pairs that share a smaller per-
centage of words. However, a lexicon computed from a small world-aligned parallel
corpus has not just low coverage, but also low quality; automatic word alignment algo-
rithms perform poorly on small datasets, because they lack sufficient statistics for their
parameters. Thus, lowering the precision of the filter, in conjunction with a noisy lexi-
con, would result in poor output quality. Besides, a lower Overlap value would greatly
increase the number of sentence pairs that need to be classified, which would cause
serious computational problems.
A better solution is to use bootstrapping: after an initial extraction, I can use the data
it produced to compute a new lexicon and go back and extract parallel data again. The
new lexicon should have much better coverage of the comparable corpus: not only it is
computed from more parallel data, but also from data obtained from the comparable cor-
pus itself. Bootstrapping was first applied to this problem by Fung and Cheung (2004a).
I perform bootstrapping iterations starting from my smallest initial corpora, of
10k English tokens. Each extraction iteration produces three corpora: PlusSentences,
obtained using the parallel sentence classifier (Chapter 4); PlusDocuments, obtained
by using the parallel sentences as alignment links between documents (Chapter 5);
98
and PlusFragments, obtained with the parallel fragment detection method described in
Chapter 6. I train (and evaluate) MT systems on all these corpora, plus combinations;
I then take the best-performing parallel corpus, and use it as initial corpus for the next
extraction iteration. Since in all iterations I extract data from the same comparable cor-
pus, I do not use for MT training data extracted from an older iteration; it is most likely
a (lower quality) subset of the current dataset.
I refer to the first extraction as Iteration 0, since no bootstrapping is involved yet.
The results from this extraction are the ones that have been reported in the previous
chapters of this thesis. Subsequent iterations are named Iteration 1, Iteration 2, etc. In
my experiments, due to time and resource limitations, I only performed two bootstrap-
ping iterations.
Figures 7.1 and 7.2 present the results of these experiments. In order to emphasize
the improvements produced by each iteration, the results are grouped by extracted cor-
pus rather than by iteration. The first bar represents the baseline score; the next group
of bars correspond to the scores obtained using data extracted with the parallel sentence
classifier (PlusSentences) for each iteration (Iteration 0, 1 or 2); the next group repre-
sents the scores obtained using data from parallel documents (PlusDocuments) in each
iteration; and then PlusFragments, and then a combination of sentences and documents,
Plus Sentences and Documents. (Since combinations involving the parallel fragments
had relatively poor performance, I did not include them in the figures.) The last row
of the table below the graph presents the amounts of automatically extracted data used
99
Figure 7.1: Arabic-English MT performance results obtained using bootstrapping. The
baseline (B) system is trained on 10k English tokens. The results of the comparative sys-
tems are grouped according to the type of extracted data they use: sentences, documents,
fragments, sentences plus documents.
for each of the systems, measured in million English tokens. The bars of the graph also
show the statistical significance intervals of the scores.
As the results show, bootstrapping significantly increases both the amounts of
extracted data, and the performance of the MT systems. For Arabic-English in partic-
ular, by starting with 10k tokens of parallel data and performing two rounds of extrac-
tion from the comparable corpora, I obtain MT performance similar to that of a system
trained on 100M words of parallel data.
100
Figure 7.2: Chinese-English MT performance results obtained using bootstrapping. The
baseline (B) system is trained on 10k English tokens. The results of the comparative sys-
tems are grouped according to the type of extracted data they use: sentences, documents,
fragments, sentences plus documents.
The extracted data is of increasingly lower quality; in latter iterations I extract larger
amounts of data, but obtain smaller BLEU improvements. As might be expected, this
effect is strongest for the PlusFragments corpora.
However, the parallel document detection method seems robust to this increasing
amount of noise; although the PlusDocuments systems use the smallest amounts of
automatically extracted data, they have some of the highest scores. This is partly due
to the sentence-alignment algorithm used to obtain parallel sentences from the parallel
101
documents. When applied to documents which are not fully parallel, the algorithm is
able to hypothesize 1-0 aligments (i.e. decide that some sentences in one document are
insertions, and have no translation in the other document), thus removing some of the
noise.
The good quality of the data from the PlusDocuments corpora, indicated by the ratio
between their MT performance and data size, is an important result. It shows that even
when starting with very little parallel data, my algorithms can produce not only good
MT performance, but also large amounts of high-quality parallel data.
7.3 Translation Examples
Figure 7.3 shows a few example translations produced by the MT systems presented in
Figures 7.1 and 7.2. The source sentences are from the Arabic-English and Chinese-
English test sets used in all my experiments (Section 3.3.3). Each example consists
of source sentence and its reference translation (in bold and italic font), followed by
the translation produced by the Baseline system, and the translations produced by the
PlusDocuments systems trained on datasets obtained in bootstrapping iterations 0 and
1.
102
Figure 7.3: Examples of translations produced by SMT systems obtained from succesive
bootstrapping iterations.
103
Chapter 8
Conclusions
In this dissertation I describe novel models for detecting parallel data in comparable,
unrelated corpora. I show how to find parallel data at various levels of granularity: par-
allel documents, sentences, or sub-sentential fragments. I describe a robust and efficient
framework for extracting such data, which is capable of processing very large corpora.
I also show that the parallel data extracted with my algorithms helps improve the end-
to-end performance of a state-of-the-art Statistical Machine Translation system.
This research has significantly advanced the state of the art in the field of compa-
rable corpus processing. The methods (and experiments) described here make several
important contributions.
They enable a thorough exploitation of the corpus, because they find parallel data
at various levels, and because of their ability to recognize parallelism in non-parallel
contexts (i.e. find parallel sentences within non-parallel documents, or parallel phrases
within non-parallel sentences). This ability also increases the range of comparable cor-
pora that can be usefully exploited, to include texts that contain few translated docu-
ments or even sentences.
104
My work is the first to prove that automatically extracted parallel data can be used
to improve the performance of an NLP task, namely SMT. It is my hope that other tasks
will also benefit from such resources; since my algorithms provide good indications
about the quality of the data they discover, they should be helpful even for applications
that have more strict requirements for the quality of their input parallel data.
Finally, I have shown that the exploitation of comparable corpora can be useful for
both resource-scarce and resource-rich languages. By performing extensive experiments
using small initial resources (a parallel corpus as small as 10k words, or just a dictio-
nary), I demonstrated that my methods are robust to low-data conditions, and can still
obtain large amounts of high-quality parallel data which yield very good SMT perfor-
mance. On the other hand, I’ve also shown that given sufficient amounts of compara-
ble data, even the performance of MT systems trained on large parallel corpora (100M
words) can be improved upon.
My parallel data detection algorithms can be improved in a number of ways.
Although the paradigm of using a simple word-level alignment to classify sentence pairs
works well, the alignment model used in this thesis is weak. Developing a better model,
which would for example take into account word order, should help not only the par-
allel sentence classifier but also the parallel fragment detection module, which relies
on the same word-alignment. The parallel document detection model, which has been
shown to produce the highest-quality data, can also be improved to increase its efficacy.
Instead of distinguishing only between parallel and non-parallel document pairs, it can
105
also look for partially translated documents, and try to identify those parts which are
translationally equivalent.
As for the parallel fragment detection approach, it is suboptimal in several aspects.
The signal filtering function is very simple; more advanced filters might work better,
and eliminate the need of applying additional heuristics (such as our requirement that
the extracted fragments have at least 3 words). The fact that the source and target signal
are filtered separately is also a weakness; a joint analysis should produce better results.
Despite the better lexicon, the greatest source of errors is still related to false word cor-
respondences, generally involving punctuation and very common, closed-class words.
Giving special attention to such cases should help get rid of these errors, and improve
the precision of the method.
I consider this work to be a first step towards automatic creation of parallel corpora.
Large amounts of multilingual comparable texts are published on the Web every day;
using my algorithms, they can be successfully mined for parallel data. Thus, I hope this
work will facilitate the creation of parallel corpora, and machine translation systems, for
many language pairs; which in turn will foster new avenues of research in several fields
of Natural Language Processing.
106
Reference List
[2003] Barzilay, R., and Elhadad, N. 2003. Sentence alignment for monolingual com-
parable corpora. In Proceedings of the Conference on Empirical Methods in Natural
Language Processing (EMNLP 2003).
[1990] Brown, P. F.; Cocke, J.; Pietra, S. D.; Pietra, V . J. D.; Jelinek, F.; Lafferty, J. D.;
Mercer, R. L.; and Roossin, P. S. 1990. A statistical approach to machine translation.
Computational Linguistics 16(2):79–85.
[1993] Brown, P. F.; Pietra, S. A. D.; Pietra, V . J. D.; and Mercer, R. L. 1993. The
mathematics of machine translation: Parameter estimation. Computational Linguis-
tics 19(2):263–311.
[1991] Brown, P. F.; Lai, J. C.; and Mercer, R. L. 1991. Aligning sentences in parallel
corpora. In Proceedings of the 29th Annual Meeting of the Association for Computa-
tional Linguistics, 169–176.
[2004] Cheung, P., and Fung, P. 2004. Sentence alignment in parallel, comparable, and
quasi-comparable corpora. In Proceedings of the LREC2004 Workshop.
[1997] Collier, N.; Hirakawa, H.; and Kumano, A. 1997. Creating a noisy parallel cor-
pus from newswire articles using cross-language information retrieval. Transactions
of Information Processing Society of Japan 38(9):1234–1244.
[1991] Cover, T., and Thomas, J. 1991. Elements of Information Theory. New
York:Wiley.
[1994] Dagan, I., and Itai, A. 1994. Word sense disambiguation using a second lan-
guage monolingual corpus. Computational Lingustics 20(4):563–596.
[1974] Darroch, J. N., and Ratcliff, D. 1974. Generalized iterative scaling for log-linear
models. Annals of Mathematical Statistics 43:95–144.
[1995] Davis, M. W., and Dunning, T. E. 1995. A TREC evaluation of query translation
methods for multi-lingual text retrieval. In Fourth Text Retrieval Conference, 483–
498.
107
[1995] Davis, M. W.; Dunning, T. E.; and Ogden, W. C. 1995. Text alignment in
the real world: Improving alignments of noisy translations using common lexical
features, string matching strategies and n-gram comparisons. In Proceedings of the
7th European Conference of the ACL, 67–74.
[2000] Diab, M., and Finch, S. 2000. A statistical word-level translation model for
comparable corpora. In Proceedings of the Conference on Content-Based Multimedia
Information Access.
[2002] Diab, M., and Resnik, P. 2002. An unsupervised method for word sense tag-
ging using parallel corpora. In Proceedings of the 40th Anniversary Meeting of the
Association for Computational Linguistics, 255–262.
[1993] Dunning, T. 1993. Accurate methods for the statistics of surprise and coinci-
dence. Computational Linguistics 19(1):61–74.
[2003] Echihabi, A., and Marcu, D. 2003. A noisy-channel approach to question
answering. In Proceedings of the 41st Annual Meeting of the Association for Com-
putational Linguistics, 16–23.
[2004a] Fung, P., and Cheung, P. 2004a. Mining very non-parallel corpora: Parallel
sentence and lexicon extraction vie bootstrapping and EM. In Proceedings of the
Conference on Empirical Methods in Natural Language Processing (EMNLP 2004),
57–63.
[2004b] Fung, P., and Cheung, P. 2004b. Multi-level bootstrapping for extracting par-
allel sentences from a quasi-comparable corpus. In Proceedings of the 20th Interna-
tional Conference on Computational Linguistics (COLING 2004), 1051–1057.
[1997] Fung, P., and McKeown, K. 1997. Finding terminology translations from non-
parallel corpora. In Proceedings of the 5th Annual Workshop on Very Large Corpora,
192–202.
[1998] Fung, P., and Yee, L. Y . 1998. An IR approach for translating new words from
nonparallel, comparable texts. In Proceedings of the 36th Annual Meeting of the
Association for Computational Linguistics, 414–420.
[1995] Fung, P. 1995. Compiling bilingual lexicon entries from a non-parallel english-
chinese corpus. In Proceedings of the 3rd Annual Workshop on Very Large Corpora,
173–183.
[1991] Gale, W. A., and Church, K. W. 1991. A program for aligning sentences in
bilingual corpora. In Proceedings of the 29th Annual Meeting of the Association for
Computational Linguistics, 177–184.
108
[2004] Gaussier, E.; Renders, J.-M.; Matveeva, I.; Goutte, C.; and Dejean, H. 2004. A
geometric view on bilingual lexicon extraction from comparable corpora. In Proceed-
ings of the 42nd Annual Meeting of the Association for Computational Linguistics,
527–534.
[2005] Huang, F.; Zhang, Y .; and V ogel, S. 2005. Mining key phrase translations from
web corpora. In Proceedings of the Human Language Technology Conference and
Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP),
483–490.
[1998] Kikui, G. 1998. Ter-list translation using mono-lingual co-occurence vectors.
In Proceedings of the 17th International Conference on Computational Linguistics,
670–674.
[2006] Klementiev, A., and Roth, D. 2006. Named entity transliteration and discov-
ery from multilingual comparable corpora. In Proceedings of the Human Language
Technology Conference of the North American Chapter of the ACL, 82–88.
[2000] Koehn, P., and Knight, K. 2000. Estimating word translation probabilities
from unrelated monolingual corpora using the EM algorithm. In Proceedings of the
National Conference on Artificial Intelligence, 711–715.
[2002] Koehn, P., and Knight, K. 2002. Learning a translation lexicon from monolin-
gual corpora. In Proceedings of the Workshop of the ACL Special Interest Group on
the Lexicon (SIGLEX), 9–16.
[2004] Koehn, P. 2004. Statistical significance tests for machine translation evalua-
tion. In Proceedings of the Conference on Empirical Methods in Natural Language
Processing, 388–395.
[2002] Malouf, R. 2002. A comparison of algorithms for maximum entropy parameter
estimation. In Sixth Conference on Natural Language Learning.
[1997] Melamed, D. I. 1997. A portable algorithm for mapping bitext correspondence.
In Proceedings of the 35th Annual Meeting of the Association for Computational
Linguistics, 305–312.
[2000] Melamed, I. D. 2000. Models of translational equivalence among words. Com-
putational Linguistics 26(2):221–249.
[2002] Moore, R. C. 2002. Fast and accurate sentence alignment of bilingual corpora.
In Proceedings of the 5th Conference of the Association for Machine Translation in
the Americas, 135–144.
109
[2004a] Moore, R. C. 2004a. Improving IBM word-alignment model 1. In Proceedings
of the 42nd Annual Meeting of the Association for Computational Linguistics, 519–
526.
[2004b] Moore, R. C. 2004b. On log-likelihood-ratios and the significance of rare
events. In Proceedings of the 2004 Conference on Empirical Methods in Natural
Language Processing, 333–340.
[2005] Moore, R. C. 2005. A discriminative framework for bilingual word align-
ment. In Proceedings of Human Language Technology Conference and Conference
on Empirical Methods in Natural Languge Processing, 81–88.
[2005] Munteanu, D. S., and Marcu, D. 2005. Improving machine translation perfor-
mance by exploiting non-parallel corpora. Computational Linguistics 31(4).
[2006] Munteanu, D. S., and Marcu, D. 2006. Extracting parallel sub-sentential frag-
ments from non-parallel corpora. In Proceedings of the 21st International Confer-
ence on Computational Linguistics and 44th Annual Meeting of the Association for
Computational Linguistics (COLING/ACL), 81–88.
[2002] Oard, D. W., and Gey, F. C. 2002. The trec-2002 arabic/english
clir track. In Proceedings of the 2002 Text Retrieval Conference.
http://trec.nist.gov/pubs/trec11/t11 proceedings.html.
[1997] Oard, D. W. 1997. Cross-language text retrieval research in the USA. In Third
DELOS Workshop on Cross-Language Information Retrieval, 1–10.
[2002] Och, F. J., and Ney, H. 2002. Discriminative training and maximum entropy
models for statistical machine translation. In Proceedings of the 40th Annual Meeting
of the Association for Computational Linguistics, 295–302.
[2003] Och, F. J., and Ney, H. 2003. A systematic comparison of various statistical
alignment models. Computational Linguistics 29(1):19–51.
[2004] Och, F. J., and Ney, H. 2004. The alignment template approach to statistical
machine translation. Computational Linguistics 30(4):417–450.
[2001] Ogilvie, P., and Callan, J. 2001. Experiments using the Lemur toolkit. In
Proceedings of the Tenth Text REtrieval Conference, 103–108.
[2002] Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. BLEU: a method for
automatic evaluation of machine translation. In Proceedings of the 40th Anniversary
Meeting of the Association for Computational Linguistics, 311–318.
110
[2004] Pike, C., and Melamed, I. D. 2004. An automatic filter for non-parallel texts.
In Proceedings of the 42nd Annual Conference of the Association for Computational
Linguistics.
[1995] Rapp, R. 1995. Identifying word translation in non-parallel texts. In Proceed-
ings of the Conference of the Association for Computational Linguistics, 320–322.
[1999] Rapp, R. 1999. Automatic identification of word translations from unrelated
English and German corpora. In Proceedings of the 27th Annual Meeting of the
Association for Computational Linguistics, 519–526.
[2003] Resnik, P., and Smith, N. A. 2003. The web as a parallel corpus. Computational
Linguistics 29(3):349–380.
[2006] Richard Sproat, T. T., and Zhai, C. X. 2006. Named entity transliteration with
comparable corpora. In Proceedings of the 21st International Conference on Com-
putational Linguistics and 44th Annual Meeting of the ACL, 73–80.
[1994] Robertson, S., and Walker, S. 1994. Some simple effective approximations to
the 2-Poisson model for probabilistic weighted retrieval. In Proceedings of the 17th
Annual ACM SIGIR, 232–241.
[2003] Utiyama, M., and Isahara, H. 2003. Reliable measures for aligning Japanese-
English news articles and sentences. In Proceedings of the 41st Annual Meeting of
the Association for Computational Linguistics, 72–79.
[2003] V ogel, S. 2003. Using noisy bilingual data for statistical machine translation. In
Proceedings of the 10th Conference of the European Chapter of the Association for
Computational Linguistics, 175–178.
[1998] Walker, S.; Robertson, S.; Boughamen, M.; Jones, G.; and Sparck-Jones, K.
1998. Okapi at TREC-6: Automatic ad hoc, VLC, routing, filtering and QSDR. In
TREC-6, 125–136.
[2005] Wu, D., and Fung, P. 2005. Inversion transduction grammar constraints for
mining parallel sentences from quasi-comparable corpora. In Second International
Joint Conference on Natural Language Processing (IJCNLP-2005), 257–268.
[1994] Wu, D. 1994. Aligning a parallel English-Chinese corpus statistically with
lexical criteria. In Proceedings of the 32nd Annual Meeting of the Association for
Computational Linguistics, 80–87.
[2001] Yarowsky, D., and Ngai, G. 2001. Inducing multilingual POS taggers and NP
bracketers vid robust projection across aligned corpora. In Proceedings of the 2nd
Meeting of the North American Association for Computational Linguistics, 200–207.
111
[2001] Yarowsky, D.; Ngai, G.; and Wicentowski, R. 2001. Inducing multilingual text
analysis tools via robust projection across aligned corpora. In Proceedings of the
First International Conference on Human Language Technology Research, 161–168.
[2006] Yonggang Deng, S. K., and Byrne, W. 2006. Segmentation and alignment of
parallel text for statistical machine translation. Journal of Natural Language Engi-
neering 12(4):1–26.
[2002a] Zhao, B., and V ogel, S. 2002a. Adaptive parallel sentences mining from web
bilingual news collection. In 2002 IEEE International Conference on Data Mining,
745–748.
[2002b] Zhao, B., and V ogel, S. 2002b. Full-text story alignment models for Chinese-
English bilingual news corpora. In Proceedings of the International Conference on
Spoken Language PRocessing.
112
Appendix A
Detailed Experiment Description
In this appendix I provide a detailed description of one of my extraction experiments, in
terms of the running times of the various components of the extraction system, as well
as the amounts of data that they process. The experiment consists of an extraction from
the Arabic-English Gigaword comparable corpus (Section 3.3.2), starting with an initial
parallel corpus of 100M English tokens (Section 3.3.1).
The relevant information is depicted in Figure A.1. Besides running times and
amounts of processed data, the figure also makes explicit, in the lower part, the val-
ues used for the various parameters that control the system’s execution. The experiment
was run on Linux systems, with 4 Gb of RAM and processor speed of 2.8 Ghz. Although
I used several machines in parallel, the running times are expressed in terms of a single
processor.
As the figure shows, the comparable corpus consists of two monolingual corpora:
an English one of 1.6 million documents, and an Arabic one of 600 thousand docu-
ments. The first step in the extraction pipeline selected pairs of similar documents (Sec-
tion 3.4.2). Each Arabic document was transformed into an English-language query,
which was run against the English documents that were published within a window of
113
Figure A.1: Detailed description of a data extraction experiment. For each module of the
data extraction system, the figure shows its running time, the amounts of data it needs
to process, and the values of the parameters that control its execution.
3 days of the Arabic one. The best 20 results were returned and paired with the original
Arabic document; this produced 11 million document pairs, and took 30 CPU-days to
complete.
Next, all possible sentence pairs from each document pair were passed through the
word-overlap filter (Section 3.4.3). The filter processed 2.5 billion sentence pairs, ver-
ifying that at least 50% of their words have a translation, and their length ratio is no
greater than 2. It took 17 days to run, and produced 4.5 million candidate pairs.
114
The candidate pairs were then analyzed by the parallel sentence classifier (Sec-
tion 4.3). This involved computing word alignments (Section 4.3.3), extracting feature
values (Section 4.3.2), and running the classifier on those values. The process took 5
days; 2M sentence pairs were judged to be parallel, with a confidence higher than 0.5
(the confidence values returned by the Maximum Entropy classifier range between 0.5
and 1).
The parallel pairs were further used by the parallel document detection module (Sec-
tion 5.3) to create sentence-level alignment between documents. In order to be consid-
ered parallel, two documents were required to have at least 30% of the sentences aligned,
more than 90% of the alignment links monotone, and length difference (measured in
number of sentences) smaller than 25% of the length of each document. The module
ran for 12 hours, and found 100k parallel document pairs.
The parallel fragment detection module (Section 6.3) attempted to make further use
of the candidate sentence pairs that were classified as non-parallel. For this experiment,
it ran for 7 hours and found 500k parallel fragment pairs.
115
Abstract (if available)
Abstract
One of the major bottlenecks in the development of Statistical Machine Translation systems for most language pairs is the lack of bilingual parallel training data. Currently available parallel corpora span relatively few language pairs and very few domains
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Syntactic alignment models for large-scale statistical machine translation
PDF
Improved word alignments for statistical machine translation
PDF
Deciphering natural language
PDF
Non-traditional resources and improved tools for low-resource machine translation
PDF
Weighted tree automata and transducers for syntactic natural language processing
PDF
Beyond parallel data: decipherment for better quality machine translation
PDF
Rapid prototyping and evaluation of dialogue systems for virtual humans
PDF
Modeling, searching, and explaining abnormal instances in multi-relational networks
PDF
Neural sequence models: Interpretation and augmentation
PDF
Enhancing speech to speech translation through exploitation of bilingual resources and paralinguistic information
PDF
Generating psycholinguistic norms and applications
PDF
Learning paraphrases from text
PDF
Smaller, faster and accurate models for statistical machine translation
PDF
The inevitable problem of rare phenomena learning in machine translation
PDF
Scalable machine learning algorithms for item recommendation
PDF
Learning semantic types and relations from text
PDF
Dialogue management in spoken dialogue systems with degrees of grounding
PDF
Neural networks for narrative continuation
PDF
User modeling for human-machine spoken interaction and mediation systems
PDF
Multimodal representation learning of affective behavior
Asset Metadata
Creator
Munteanu, Dragos Stefan
(author)
Core Title
Exploiting comparable corpora
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
12/05/2006
Defense Date
10/16/2006
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
comparable corpora,machine translation,OAI-PMH Harvest,parallel corpora
Language
English
Advisor
Marcu, Daniel (
committee chair
), Hovy, Eduard (
committee member
), Knight, Kevin (
committee member
), Narayanan, Shrikanth S. (
committee member
), Rosenbloom, Paul S. (
committee member
)
Creator Email
dragos@isi.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m217
Unique identifier
UC156442
Identifier
etd-Munteanu-20061205 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-160623 (legacy record id),usctheses-m217 (legacy record id)
Legacy Identifier
etd-Munteanu-20061205.pdf
Dmrecord
160623
Document Type
Dissertation
Rights
Munteanu, Dragos Stefan
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
comparable corpora
machine translation
parallel corpora