Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Upload Folder
/
usctheses
/
uscthesesreloadpub_Volume1
/
Syntactic alignment models for large-scale statistical machine translation
(USC Thesis Other)
Syntactic alignment models for large-scale statistical machine translation
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
SYNTACTIC ALIGNMENT MODELS FOR LARGE-SCALE
STATISTICAL MACHINE TRANSLATION
by
Jason A. Riesa
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
May 2012
Copyright 2012 Jason A. Riesa
Table of Contents
List Of Tables v
List Of Figures vii
Acknowledgements x
Abstract xiii
Chapter 1: Introduction 1
1.1 Machine Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The Alignment Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Thesis Outline and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 3
Chapter 2: Machine Translation and Discriminative Training – an Overview 6
2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Classical Statistical Machine Translation . . . . . . . . . . . . . . . . 7
2.1.2 Modern Statistical Machine Translation . . . . . . . . . . . . . . . . . 8
2.1.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.4 Word alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Current Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Consistency with Translation Model . . . . . . . . . . . . . . . . . . . 12
2.2.2 Linguistic assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.3 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.4 Beyond unsupervised learning – automatically acquiring features for
word alignment and translation . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Alignment for other tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Chapter 3: Hierarchical Search for Word Alignment 19
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Word Alignment as a Hypergraph . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Hierarchical search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4 Discriminative training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.5 Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
ii
3.6 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.6.1 Local features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.6.2 Nonlocal features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.8 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.8.1 Alignment Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.8.2 Translation Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Chapter 4: Feature-Rich Syntax-Based Alignment for Statistical Machine Translation at
Scale 43
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2 A Feature-Rich Syntax-Aware Alignment Model . . . . . . . . . . . . . . . . 45
4.2.1 Search Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2.1.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.1.2 Combination . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.3 Automatically Exploiting Syntactic Features for Alignment . . . . . . . . . . . 49
4.3.1 Target Syntax Features . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3.2 Source Syntax Features and Joint Features . . . . . . . . . . . . . . . . 52
4.3.2.1 Source-Target Coordination Features . . . . . . . . . . . . . 54
4.4 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.5 Iterative Approximate Viterbi Inference . . . . . . . . . . . . . . . . . . . . . 58
4.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.6.1 Alignment Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.6.2 Translation Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.7 Discussion and Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.7.1 Content Words vs. Function Words in Alignment . . . . . . . . . . . . 64
4.7.2 An Concrete Example . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.7.3 Desiderata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Chapter 5: Automatic Parallel Fragment Extraction from Noisy Data 68
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2 Detecting Noisy Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2.1 Alignment Model as Noisy Data Detector . . . . . . . . . . . . . . . . 71
5.2.2 A Brief Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2.3 Parallel Fragment Extraction . . . . . . . . . . . . . . . . . . . . . . . 73
5.2.3.1 A Hierarchical Alignment Model and its Derivation Trees . . 73
5.2.3.2 Using the Alignment Model to Detect Parallel Fragments . . 73
5.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
iii
Chapter 6: Conclusion and Future Work 80
6.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.2.1 Regularization for Generalization and Scaling . . . . . . . . . . . . . . 81
6.2.2 Better Standards for Word Alignment . . . . . . . . . . . . . . . . . . 82
6.2.3 Unsupervised Discriminative Learning . . . . . . . . . . . . . . . . . 82
References 84
iv
List Of Tables
3.1 A sampling of learned weights for the lexical zero feature. Negative weights
penalize links never seen before in a baseline alignment used to initialize lexical
p(e| f) andp(f| e) tables. Positive weights outright reward such links. . . . . 34
3.2 F-measure, Precision, Recall, the resulting BLEU score, and number of unknown
words on a held-out test corpus for three types of alignments. BLEU scores are
case-insensitive IBM BLEU. We show a 1.1 BLEU increase over the strongest
baseline, Model-4 grow-diag-final. This is statistically significant at the p <
0.01 level. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1 This table shows a sampling of the highest and lowest-weighted coordination
features applied when scoring partial alignments at nodes in the tree. Pretermi-
nal tags inside parentheses indicate the POS tags on the left and right edge of a
given constituent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2 F-measure, Precision, Recall for GIZA++ Model-4, and for alignments from
this work. GIZA++ was trained on 223M words for Arabic-English, and 261M
words for Chinese-English. We observe very large gains in accuracy of 15 points
for both language pairs. Iterative inference results in a large effect on Chinese-
English recall, and a modest improvement in Arabic-English. . . . . . . . . . . 59
4.3 IBM BLEU scores using a syntax-based MT system. We show statistically
significant gains in both language pairs over unsupervized GIZA++ Model 4
trained on very large corpora. An asterisk (*) denotes a statistically significant
improvement withp < 0.01 over the number immediately above; a (+) denotes
p < 0.05. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.1 Descriptive statistics about fragments extracted from each parallel corpus. . . . 75
v
5.2 End-to-end translation experiments with and without extracted fragments. BLEU
score gains are significant with p < 0.05 for Arabic-English and p < 0.01 for
Chinese-English. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
vi
List Of Figures
2.1 The classical Statistical Machine Translation decoding pipeline. A translation
model uses word or phrase probabilities to translate from source to target lan-
guage. Because word-order differs across languages, a Language Model is used
to rank possible translations for fluency. This dissertation is primarily concerned
with learning the latent structure of alignments between language pairs that al-
low us to learn the translation rules of the Translation Model. . . . . . . . . . . 6
2.2 Matrix representation of a word alignment used throughout this dissertation.
In this figure, filled cells in the matrix denote a translational correspondence
between English and Chinese words. . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Four different rulesets for four different decoders. . . . . . . . . . . . . . . . . 14
3.1 Model-4 alignment vs. a gold standard. Circles represent links in a human-
annotated alignment, and black boxes represent links in the Model-4 alignment.
Bold gray boxes show links gained after fully connecting the alignment. . . . . 20
3.2 Example of approximate search through a hypergraph with beam size = 5. Each
black square implies a partial alignment. Each partial alignment at each node
is ranked according to its model score. In this figure, we see that the partial
alignment implied by the 1-best hypothesis at the leftmost NP node is constructed
by composing the best hypothesis at the terminal node labeled “ the” and the
2nd-best hypothesis at the terminal node labeled “ man”. (We ignore terminal
nodes in this toy example.) Hypotheses at the root node imply full alignment
structures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Cube pruning with alignment hypotheses to select the top-k alignments at node
v with children⟨u
1
,u
2
⟩. In this example, k = 3. Each box represents the
combination of two partial alignments to create a larger one. The score in each
box is the sum of the scores of the child alignments plus a combination cost. . . 26
vii
3.4 Correct version of Figure 1 after hypergraph alignment. Subscripts on the non-
terminal labels denote the branch containing the head word for that span. . . . . 27
3.5 Depiction of the parallelization used in this work for averaged perceptron learning. 30
3.6 Learning curves for 10 random restarts over time for parallel averaged percep-
tron training. These plots show the current F-measure on the training set as time
passes. Perceptron training here is quite stable, converging to the same general
neighborhood each time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.7 Features PP-NP-head, NP-DT-head, and VP-VP-head fire on these tree-alignment
patterns. For example, PP-NP-head fires exactly when the head of the PP is
aligned to exactly the samef words as the head of it’s sister NP. . . . . . . . . 36
3.8 This figure depicts the tree/alignment structure for which the feature PP-from-
prep fires. The English preposition “from” is aligned to Arabic word ﻦﻣ. Any
aligned words in the span of the sister NP are aligned to words following ﻦﻣ.
English preposition structure commonly matches that of Arabic in our gold data.
This family of features captures these observations. . . . . . . . . . . . . . . . 36
3.9 Model robustness to the initial alignments from which thep(e| f) andp(f| e)
features are derived. The dotted line indicates the baseline accuracy of GIZA++
Model 4 alone. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.1 Approximate search through a hypergraph with beam size k = 5. Each black
square represents a partial alignment; larger grey-shaded boxes are links in an
alignment. Each partial alignment at each node is ranked according to its model
score. The root node, S, contains ak-best list of full alignments. . . . . . . . . 47
4.2 Translation rules as features extracted during Arabic-English alignment. These
rules show that we learn to reorder adjectives and nouns inside noun phrases,
and that prepositions before sister NPs prefer to be translated monotonically.
For Chinese-English, we learn the opposite. We learn both lexicalized and non-
lexicalized features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3 Two examples of joint features over monolingual parse trees. The value of the
feature depends on the shaded areas. . . . . . . . . . . . . . . . . . . . . . . . 53
viii
4.4 Learning feature-rich alignment models. Figure 4.4(a) shows learning curves on
heldout data for five different beam sizes. Figure 4.4(b) shows how the models
dynamically grow over time. In Figure 4.4(b) we notice that less accurate mod-
els with narrower beams need to add more complexity in an attempt to make up
for their many more mistakes. . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.5 Example depicting two different ways to align the English phrase “the big book”
to its Chinese translation “大书” (gloss: big book). We extract more translation
rules under the strategy shown in Figure 4.5(b). Links in the word alignment are
shaded matrix cells, with extractable bilingual phrases outlined. . . . . . . . . 65
5.1 Example of a word alignment resulting from noisy parallel data. The structure
of the resulting alignment makes it difficult to find parallel fragments simply by
inspection. How can we discover automatically those parallel fragments hidden
within such data? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2 From LDC2004T08, when the NP fragment shown here is combined to make a
larger span with a sister PP fragment, the alignment model objects due to non-
parallel data under the PP, voicing a score of -0.5130. We extract and append
to our training corpus the NP fragment depicted, from which we later learn 5
additional translation rules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
ix
Acknowledgements
In September 2001, I stepped into the office of David Yarowsky for the first time as a wide-eyed
first-year university student ready to dive into my chosen field of study, Computer Science,
but also with a deep, yet still-informal appetite for language and linguistics. Any potentially
bare wallspace was covered by bookshelves and books, epitomizing my idealistic notion of a
university professor. As he finished up rattling off an email to a colleague, I took stock of
my surroundings and noticed that almost every single book – old, new, fat, thin, dusty and
not – was a dictionary, a grammatical reference, or language textbook. I held my breath and
wondered if I was in the right office. This was the Computer Science building, wasn’t it? Could
a Computer Science professor be so involved in and harbor as deep an interest in human language
as me? David Yarowsky showed me that indeed one could be interested in both computation and
language, introduced me to Natural Language Processing and Machine Translation, nurtured my
beginnings as a researcher, and encouraged me to visit for the first time the USC Information
Sciences Institute.
At USC/ISI I found a welcoming, open-minded, and friendly environment by the beach
where I have met so many smart people who have challenged me and helped me to become a
better scientist. There are many people to thank. In particular, I thank my advisor, Daniel Marcu,
x
for giving me the independence to do things my own way but also for keeping me away from
tempting but stray garden paths in my research with very frank advice and invaluable wisdom.
Daniel taught me the fundamentals of being a good scientist and good citizen in the academic
community – the essentials of accessible technical communication, to see the big picture, to
sniff out weak arguments, to challenge mediocrity without arrogance, to aim for practical and
impactful research objectives.
Kevin Knight consistently reinforced for me the importance of impactful research and taught
me that excellent scientists can still be cool, talks can be entertaining, and language and infor-
mation should be plain and digestible to anyone who wishes to devour it.
Many thanks to David Chiang, whose door is always open, for always providing thoughtful
feedback and answers to my constant barrage of questions in a fast-moving field. Thanks also
to Liang Huang with whom I have had many illuminating technical discussions.
Many thanks also to my inquisitive and incredibly intelligent officemates Victoria Fossum
and Oana Nicolov for often serving as a sounding board for ideas. During my time at ISI, I have
also greatly enjoyed my interactions with Ulf Hermjakob, Alex Fraser, Jonathan May, Steve
DeNeefe, Michael Pust, Rahul Bhagat, Sujith Ravi, Dirk Hovy, Ashish Vaswani, Gully Burns,
Zornitsa Kozareva, and Philipp Koehn. Thank you to Erika Lance, Kary Lau, Alma Nava, and
Peter Zamar.
Thank you to the other members of my committee – Eduard Hovy, Shri Narayanan, and
Stefan Schaal – who asked of me insightful questions which only contributed to making this
work more significant.
xi
Finally, I thank my parents for their unwavering support at points high, low and everywhere
in between; for providing me a wonderful education; for understanding and patience. This
dissertation is dedicated to them.
To all those who helped me along the way, academically and otherwise, I am forever grate-
ful.
xii
Abstract
Word alignment, the process of inferring the implicit links between words across two languages,
serves as an integral piece of the puzzle of learning linguistic translation knowledge. It enables
us to acquire automatically from data the rules that govern the transformation of words, phrases,
and syntactic structures from one language to another. Word alignment is used in many tasks in
Natural Language Processing, such as bilingual dictionary induction, cross-lingual information
retrieval, and distilling parallel text from within noisy data. In this dissertation, we focus on
word alignment for statistical machine translation.
We advance the state-of-the-art in search, modeling, and learning of alignments and show
empirically that, when taken together, these contributions significantly improve the output qual-
ity of large-scale statistical machine translation, outperforming existing methods. We show re-
sults for Arabic-English and Chinese-English translation.
Ultimately, the work we describe herein may be used for any language-pair, supporting ar-
bitrary and overlapping features from varied sources. Finally, our features are learned automat-
ically without any human intervention, facilitating rapid deployment for new language-pairs.
xiii
Chapter 1
Introduction
This dissertation proposes search, modeling, and learning ideas for solving one of the central
problems in statistical machine translation – identifying translations in parallel corpora. This
chapter is provides a high-level nontechnical introduction to the field of statistical machine
translation, with Chapter 2 providing additional technical background needed to absorb the
novel contributions of Chapters 3–5. We outline our contributions at the end of this chapter and
conclude this work in Chapter 6.
1.1 Machine Translation
Rooted in statistical learning theory, today’s state-of-the-art machine translation systems learn
from observed data to predict the future. That is, we learn translations from parallel corpora of
millions of words of text in order to translate new sentences never seen by the system before. For
open-domain translation, this is a challenging problem – natural language is rife with ambiguity.
Further complicating things, the structure of language varies wildly from tongue to tongue.
1
However, since the time of Weaver Weaver (1949) we have made immense progress. The
most successful modern translation systems no longer solely make use of long lists of manually
produced translation rules. Rather, we learn these rules automatically from parallel corpora –
two or more documents having the same content, but each in a different language.
These corpora are made available by large data clearinghouses like the Linguistic Data Con-
sortium (LDC), but also exist naturally in the world on the web in the form of news articles,
government proceedings, and personal websites. Indeed, finding such corpora is a hard problem
and a research area unto itself (Fung & Cheung, 2004; Munteanu & Marcu, 2006; Uszkoreit
et al., 2010).
1.2 The Alignment Problem
Machine translation systems generally learn from sentence-aligned parallel corpora. That is,
once we have gathered a parallel corpus on which to train our translation model, we first pair up
sentences in each corpus that are direct translations of each other. This is also a nontrivial and
sometimes noisy process (Zhao et al., 2003; Xu et al., 2005; Deng et al., 2006) as sentences in
one document may appear in a slightly different order than in its corresponding translated doc-
ument. There may be extra information in one document, and missing information in another;
two sentences in one document may appear as a single sentence in another when translated due
to the authors’ style differences or simply due to pragmatic linguistic reasons.
This thesis is concerned with what to do with these corpora once we have gathered them and
performed sentence alignment. We operate on sentence pairs rather than document pairs, and call
2
this process word alignment to differentiate. How do we learn the implicit links between words
in each language? In answering this question, we will exploit the syntactic structure of natural
language, learning not only mappings between words, but also between higher-level syntactic
structures for language pairs as divergent as English and Chinese. We call a collection of links
between words representing the translational correspondence between two sentence pairs an
alignment.
1.3 Thesis Outline and Contributions
This dissertation addresses important problems in alignment for statistical MT, and makes the
following contributions:
1. In Chapter 2, we give a deeper view of the Statistical Machine Translation problem –
where we have been, where we are today, and how, in many ways, advances in word
alignment have failed to keep up. Understanding the current state-of-affairs and experi-
mental pipeline for this complex science and engineering problem allows us to go deeper
in the subsequent chapters.
2. Chapter 3 describes our hierarchical search algorithm for word alignment. It is here
where we make novel connections to bottom-up parsing and develop the idea of align-
ment forests. In conjunction with discriminative training and a linear model, our search
process affords us the ability to make efficient use of truly arbitrary, overlapping features.
We show at this point that we make significant gains in downstream translation quality
over a strong baseline system.
3
3. In Chapter 4, we describe our featureset in detail, from very simple binary identity features
to complex features fully encoding whole syntactic transformations and translations. We
make use of millions of data-driven syntactically motivated features, directly exploiting
the search space of the downstream translation task in which our word alignments are
ultimately to be used. It is in this way that we tie our syntax-based word alignment system
to a downstream syntax-based machine translation decoder. We end with experimental
results showing large, significant gains over state-of-the-art translation systems under a
wide range of experimental conditions.
4. Chapter5 shows that, given the right model, word alignment can be used for more than
just learning translation rules. This chapter describes our work in using word alignment
derivations to distill parallel fragments from within noisy (i.e., not strictly parallel) cor-
pora. Gathering parallel data for translation systems is an active area of research unto
itself.
5. The final portions of Chapter 4 develop some important ideas going forward for anyone
building supervised word alignment systems. We give evidence that not all training data
can be considered equal, and make the case for the existence of some universal truths with
respect to what makes a useful gold-standard alignment for learning a translation model.
We outline desiderata for future creators of such data.
4
6. Finally, in Chapter 6 we summarize and conclude this dissertation, point to future work,
and describe preliminary results. We have made publicly available the algorithms de-
scribed herein as a packaged toolkit called Nile.
1
1
Software available at: http://http://code.google.com/p/nile/
5
Chapter 2
Machine Translation and Discriminative Training – an Overview
This chapter presents an overview of a modern statistical machine translation pipeline, and the
central machine learning framework used to train the parameters of the translation model within
the pipeline.
2.1 Background
Statistical machine translation (MT) systems, at a high level, have historically essentially con-
sisted of three components: a translation model to model the likelihood that two sequences of
words in two different languages are translations of each other; a language model to model the
Translation
Model
Language
Model
the man ate the bread
ate the man the bread
the man the bread ate
bread the man the ate
the man ate the bread
ate the man the bread
the man the bread ate
bread the man the ate
ا " # ا $ % & # ا ا ' ( )
Figure 2.1: The classical Statistical Machine Translation decoding pipeline. A translation model
uses word or phrase probabilities to translate from source to target language. Because word-
order differs across languages, a Language Model is used to rank possible translations for flu-
ency. This dissertation is primarily concerned with learning the latent structure of alignments
between language pairs that allow us to learn the translation rules of the Translation Model.
6
fluency of a sequence of words in a given language; and a search procedure to find the best
translation according to these two models, which we call a decoder.
Modern MT systems, however, consider these components as features in a log-linear model.
2.1.1 Classical Statistical Machine Translation
Using the noisy channel model from Information Theory, Brown et al. (1993) formulate and
derive The Fundamental Equation of Statistical Machine Translation, giving a mathematical
model which encompasses the ideas of a language model, a translation model, and a need to find
the best translation according to both these models:
arg max
e
P(f| e)P(e) (2.1)
Brown at al. arrive at this formulation by supposing that we wish to translate a sequence of
French wordsf into a sequence of English wordse, that is, we wish to find:
arg max
e
P(e| f) (2.2)
But, motivated by the noisy channel formulation of Information Theory and using Bayes Rule,
we can decompose Equation 2.2 into an equivalent product of factors:
arg max
e
P(f| e)P(e)
P(f)
(2.3)
arg max
e
P(f| e)P(e) (2.4)
7
Then,P(f| e) is our translation model,P(e) denotes our language model, and arg max says
that we wish to find the sequence of wordse that maximizes the product of both the translation
model and language model.
2.1.2 Modern Statistical Machine Translation
Modern data-driven models of translation, beginning with (Och & Ney, 2003), largely make use
of a log-linear model of translation, where the features h of the model decompose over the e
and f words and a derivation,D, consisting of all of the translation rules or phrases applied to
transform f into e:
^ e = arg max
e
h
(e, f,D) (2.5)
^ e = arg max
e
∑
j
θ
j
·h
j
(e, f,D) (2.6)
This formulation moves us to a feature-based model, where the score of a predicted translation
^ e is a weighted sum of features h
j
. The weights θ
j
of each feature h
j
are typically learned
discriminatively given a set of reference translations.
The most common type of learning procedure for translation, Minimum Error Rate Training
(MERT) (Och & Ney, 2003), has the goal of minimizing the loss between each correct transla-
tion and its predicted translation. We might use just a handful of features (Och & Ney, 2003)
or perhaps thousands (Chiang et al., 2009). The language model component of Equation 2.4
becomes a single feature in the full log-linear model.
8
2.1.3 Evaluation
Automatic metrics for evaluating translation quality have helped the field progress rapidly, en-
abling rapid experimentation and comparable quantitative analysis of results across systems.
Translation output quality evaluation remains in itself an active area of research.
Throughout this work we measure machine translation output quality using the Bilingual
Evaluation Understudy (BLEU) metric (Papineni et al., 2001). Despite known issues (Chiang
et al., 2008a), it is currently the de-facto standard automatic metric in the literature for measuring
output quality. The BLEU metric is a precision-based metric that looks for phrasal matches
among a given set of reference, or correct, translations.
1
2.1.4 Word alignment
A word alignment denotes a translational correspondence between two sequences of different
languages. Although there are increasingly different notions of alignment for translation (e.g.
string-to-tree alignment (Pauls & Klein, 2010)), this dissertation largely deals with word align-
ment – denoting translational correspondence between two sequences of words.
Visualizing an alignment A word alignment may be represented in several ways. We describe
two equivalent representations:
• An undirected bipartite graph, with vertices of the graph being words in each language.
Each edge in the graph connects a word in one language to a word in another. There are
no edges between words in the same language. No other structural assumptions are made.
1
In practice, generally between 1–4 reference translations are provided.
9
Sentence 23
Fengzhu
Xu
has
won
many
championships
in
international
competitions
before
.
!!!
!
!
!!
!!
!
""
!"
!
!"
.
Sentence 24
He
had
achieved
complete
victory
in
nine
games
with
Chinese
Go
players
before
.
"!
#
$
!!
!!
%
!
&
!
,
#!
!!
.
9
Figure 2.2: Matrix representation of a word alignment used throughout this dissertation. In
this figure, filled cells in the matrix denote a translational correspondence between English and
Chinese words.
In particular, some words may be unaligned (having degree 0), and some words may be
connected to two or more words in the opposite language. In the translation task, one
language is called the source language; the other the target language. We translate from
the source language into the target language.
• An adjacency matrix, in which the rows of the matrix correspond to individual words in
the target language, and the columns of the matrix correspond to individual words in the
source language (or vice-versa). Cells in the matrixM
i;j
are filled with a 1 if there is an
alignment link between target-word i and source-word j; 0 otherwise. Figure 2.2 shows
an example.
10
Throughout the rest of this dissertation when we visualize an alignment, we opt for the latter
representation.
Bilingual alignment serves as an integral step and the foundation of the building of any state-
of-the-art statistical machine translation system. It enables us to automatically learn and extract
translation rules from hundreds of millions of words of bilingual text.
In this dissertation, we recognize that word-to-word translation models (Brown et al., 1993;
Germann et al., 2001) are no longer state of the art; translation models have become much more
complex, while word alignment models have largely not caught up. We also recognize that there
is now a wealth of new data and information sources that were not readily available twenty years
ago. We now have syntactic treebanks, accurate parsers, morphological analyzers, and a great
deal of human-aligned “gold-standard” alignments.
We focus in this thesis on models that can easily and efficiently make use of this information.
Because of the difficulty of incorporating new information into existing generative models, we
propose a discriminative framework for learning alignments and encode our ideas as to what
makes a good alignment as features in the model. We train our models with human-annotated
alignments, which we treat as a gold standard. We show empirically that our novel models and
inference algorithms developed in the subsequent chapters give rise to translation systems with
significantly better output quality.
More formally, in our work, we use a linear model parameterized by some vectorθ, learned
discriminatively; we wish to find the best alignment between e and f under the model P
(a|
11
f,e). The framework we present in the next chapter provides an umbrella under which we can
work on search, modeling, and learning with respect to word alignment.
But first, we begin by discussing current issues which we aim to address with our work.
2.2 Current Issues
Current bilingual word alignment systems exhibit suboptimal qualities. In this proposal, we
identify several current deficiencies, and propose to eliminate these problems. In achieving this
goal, we develop new models of bilingual alignment and efficient search algorithms for working
with such models.
2.2.1 Consistency with Translation Model
Given a parallel corpus, the IBM Models of word alignment (Brown et al., 1993) all emit a word-
to-word bilingual alignment and have led to the development of word-to-word decoding algo-
rithms for translation, e.g. (Germann et al., 2001). Yet, as we alluded to in the previous section,
later advances in translation modeling and decoding, including phrase-based models (Koehn,
2004), hierarchical phrase-based models (Chiang, 2005), and syntax-based models, e.g. (Galley
et al., 2004; Galley et al., 2006), all use much more complex models than used by Germann et
al., taking into account more and more context. All use word-to-word alignments as a first step
in extracting the translation rules that will later be used to decode. Thus, bilingual alignment has
fallen out of sync with the translation model. The IBM Models, and specifically the implemen-
tation offered in the open-source toolkit GIZA++ (Al-Onaizan et al., 1999; Och & Ney, 2003),
12
remain the de-facto standard for alignment – from those interested in cross-lingual information
retrieval, to researchers and practitioners building large-scale MT systems.
In this work, we employ well-studied ideas from both k-best parsing (Klein & Manning,
2001; Huang & Chiang, 2005; Huang, 2008) as well as forest-based and hierarchical phrase-
based translation (Huang & Chiang, 2007; Chiang, 2007), and apply them in a novel way to the
alignment problem. We treat bilingual alignment as a parsing problem, using CKY-style chart
parsing to combine partial alignment hypotheses as we move bottom-up along a syntactic parse
tree to construct the full alignment. As in related work in decoding and parse-forest reranking,
our search algorithm emits an acyclic hypergraph structure and allows us to make use of local,
nonlocal, and global features. It is in the search and generation of alignments that we have thus
far brought the alignment problem in line with state-of-the-art decoding algorithms. A continued
focus of future work is the modeling aspect – making the connection, at alignment time, between
translation rules used in decoding and the alignments that yield such translation rules.
Figure 2.3 shows four different types of rulesets for four different types of translation models.
The rules shown in the figure are rules that could be used under different formalisms to translate
the Chinese phrase 在 事发 后 into the English after the event.
Word-based models Word-based models are essentially word-replacement translation mod-
els. The rules depicted here, represent word-replacement operations. The second and fourth
rules indicate that in order to translate this phrase correctly, we will need to have learned that,
for this model, it’s a good idea to delete 在 and insert “the.” Knight (1999) shows, with reduc-
tions to the Hamilton Circuit Problem (Hopcroft & Ullman, 1979) and the Minimum Set Cover
13
〈 , after the event
〉 〈 , after
〈 , after X
〉 X
〈 , ∅
〈 , event
〉
〈 ∅ , the
〈 , the event
〉
〈 ,
NP
PP
IN NP
after
(Germann et al., 2001; Och 2001) (Koehn et al., 2003; Och and Ney, 2004)
(Chiang 2005, 2007)
(Galley et al., 2004, 2006)
〈 ,
〉
NP
NN
event
DT
the
Figure 2.3: Four different rulesets for four different decoders.
Problem (Garey & Johnson, 1979), that complexity for translation with word-based models is
NP-complete due to the combinatorial interaction of translation and word-reordering. Even this
basic model of translation is computationally hard.
Phrase-based models Phrase-based models generalize word-based models to substitute, in-
sert, and delete entire phrases. This model can learn to translate the entire phrase 在 事发 后
into English in one chunk.
Hierarchical phrase-based models These models permit phrases that contain subphrases,
allowing us to further generalize rules in the translation grammar. In the example in Figure 2.3,
we make use of a rule that makes the generalization that the first and last Chinese words translate
14
into the English word after. The text within this Chinese circumfix is translated into the object
of after, namely the event.
Syntax-based models These models make use of syntactic parse trees on the source or target
side of the text. In our work, we use a string-to-tree system that, e.g. reads in a Chinese string and
outputs an English parse tree. In this case, for example, the system knows when it is constructing
specific types of syntactic phrases, and uses this information to make more informed lexical and
reordering choices. For example, the Chinese word 出现 may manifest as English appear when
part of a verb phrase, or as English appearance when part of a noun phrase. Likewise, as in the
example in Figure 2.3, the system learns that objects of English prepositions appear after the
preposition itself.
2.2.2 Linguistic assumptions
A significant problem with most approaches to word-to-word alignments, as Fraser and Marcu Fraser
and Marcu (2007) point out, is that they are based on false linguistic assumptions. For reasons
of tractability, the IBM models assume that a single word in one language translates to zero or
more words in another language. Users generally deal with this shortcoming with a range of
post-processing heuristics.
Empirically, the restriction to one-to-many alignment structures means that almost 10%
of Arabic-English alignments cannot be modeled; for Chinese-English, over 16%. We show
in (Riesa & Marcu, 2010) that our model is not forced to emit alignments that are subject to the
one-to-many constraint, and can theoretically model many-to-many alignments.
15
2.2.3 Efficiency
Using GIZA++ (Al-Onaizan et al., 1999; Och & Ney, 2003) to train the commonly used IBM
Model-4 generally takes over six days or 150 hours
2
. This poor scalability is a severe drag on
the experimental cycle of statistical machine translation research, limiting progress in the field.
Each time we modify the our input data in any way, we must wait another week to generate
alignments. We seek an alignment model that is fast and efficient to train, and one that can be
learned online – we wish to avoid having to retrain over all the data every time new data becomes
available.
The nature of the search algorithm developed in our alignment work allows for simple paral-
lelization of the search process, and was developed with speed and efficiency in mind. Because
of the tree structure of the search space, every node or state in the search space and be computed
independently of its sister states. We can distribute independent computations across multiple
CPUs. Likewise, we learn parameters with an online algorithm for structured prediction: a par-
allelized version of averaged perceptron modified for structured outputs (Collins, 2002). We
distribute training examples at random to a large, fixed number of nodes in a cluster, each node
maintaining its own perceptron learner. After each epoch over the training data, we average
the updated parameters from each perceptron and each starts a new training epoch with the new
averaged parameter vector.
It turns out that this learning procedure will converge in the same neighborhood regardless
of the number of CPUs used during computation. McDonald et al. McDonald et al. (2010)
2
This figure is for large bilingual corpora on the order of hundreds of millions of words.
Corpora of this size are currently used in state-of-the-art machine translation systems.
16
have concurrently shown that the strategy outlined above is a good one. Finally, while we
have focused on making search and learning easily distributable over a large cluster, we also
demonstrate that these conditions hold when we scale from hundreds of features to hundreds
of thousands of features, while increasing the amount of training data to estimate these new
parameters accordingly. More on this topic in Section 3.5.
2.2.4 Beyond unsupervised learning – automatically acquiring features for word
alignment and translation
Current generative alignment models cannot easily exploit extra sources of knowledge apart
from a single parallel corpus. Today, human-annotated gold alignments are available for several
language pairs. Accurate natural language parsers, e.g. (Petrov et al., 2006), provide syntactic
analyses. Morphological analyzers, e.g. (Buckwalter, 2004; Riesa & Yarowsky, 2006), provide
subword-level clues to lexical function and meaning. How can we learn from and take advantage
of all of this new data at our fingertips?
We encode information from multiple knowledge sources as features in a discriminative
model – lexical probabilities, syntactic parse trees, shallow morphological information, and joint
syntactic/alignment patterns as seen in gold data. In Chapter 3 we discuss modeling alignments
based on features encoding target-side English syntactic and lexical information, and only lexical
information on the source side. We go further in Chapter 4 – there is much more data waiting
17
to be exploited, especially in the context of source-side features. This chapter focuses on in-
corporating more source-side features, and jointly modeling their interaction with target-side
features.
2.3 Alignment for other tasks
Apart from translation, word alignments are also used for other language-related tasks, from
Cross-Lingual Information Retrieval (Hiemstra & de Jong, 1999; Xu et al., 2001) to Annotation
Projection (Yarowsky et al., 2001; Hwa et al., 2002) and bilingual dictionary induction. Here
we propose to use alignment to distill sentence fragments from within noisy data.
The benefits of succeeding here are at least two-fold: we can be rid of noisy data inside of
our parallel training corpus, and be able to find new parallel data with which to augment our
training corpus. Both are difficult tasks in and of themselves, and both have implications for the
translation task we are largely concerned about in this thesis.
18
Chapter 3
Hierarchical Search for Word Alignment
In this chapter, we present a simple yet powerful hierarchical search algorithm for automatic
word alignment. Our algorithm induces a forest of alignments from which we can efficiently
extract a ranked k-best list. We score a given alignment within the forest with a flexible, lin-
ear discriminative model incorporating hundreds of features, and trained on a relatively small
amount of annotated data. We report results on Arabic-English word alignment and translation
tasks. Our model outperforms a GIZA++ Model-4 baseline by6.3 points in F-measure, yielding
a 1.1 BLEU score increase over a state-of-the-art syntax-based machine translation system.
3.1 Introduction
Automatic word alignment is generally accepted as a first step in training any statistical machine
translation system. It is a vital prerequisite for generating translation tables, phrase tables, or
syntactic transformation rules. Generative alignment models like IBM Model-4 (Brown et al.,
1993) have been in wide use for over 15 years, and while not perfect (see Figure 3.1), they
19
Visualization generated by riesa: Feb 12, 2010 20:06:24
683.g (a1)
683.union.a (a2)
683.e (e)
683.f (f)
Sentence 1
the
five
previous
tests
have
been
limited
to
the
target
missile
and
one
other
body
.
!
"#$%
!
&
!
'()
"
*
+,-
*
!
&.(
/0
1
2
3
4(
!
5
!
67
*
,8.(
9:;
<)+,=.(
1
>?@
!
A8BC(
DEFG
*
)
#
1
G(
?H()
*
1
Figure 3.1: Model-4 alignment vs. a gold standard. Circles represent links in a human-annotated
alignment, and black boxes represent links in the Model-4 alignment. Bold gray boxes show
links gained after fully connecting the alignment.
are completely unsupervised, requiring no annotated training data to learn alignments that have
powered many current state-of-the-art translation system.
Today, there exist human-annotated alignments and an abundance of other information for
many language pairs potentially useful for inducing accurate alignments. How can we take ad-
vantage of all of this data at our fingertips? Using feature functions that encode extra informa-
tion is one good way. Unfortunately, as Moore (2005) points out, it is usually difficult to extend
a given generative model with feature functions without changing the entire generative story.
This difficulty has motivated much recent work in discriminative modeling for word alignment
(Moore, 2005; Ittycheriah & Roukos, 2005; Liu et al., 2005; Taskar et al., 2005b; Blunsom &
Cohn, 2006; Lacoste-Julien et al., 2006; Moore et al., 2006).
20
We present here a discriminative alignment model trained on relatively little data, with a
simple, yet powerful hierarchical search procedure. We borrow ideas from both k-best pars-
ing (Klein & Manning, 2001; Huang & Chiang, 2005; Huang, 2008) and forest-based, and
hierarchical phrase-based translation (Huang & Chiang, 2007; Chiang, 2007), and apply them
to word alignment.
Using a foreign string and an English parse tree as input, we formulate a bottom-up search on
the parse tree, with the structure of the tree as a backbone for building a hypergraph of possible
alignments. Our algorithm yields a forest of word alignments, from which we can efficiently
extract the k-best. We handle an arbitrary number of features, compute them efficiently, and
score alignments using a linear model. We train the parameters of the model using averaged
perceptron (Collins, 2002) modified for structured outputs, but can easily fit into a max-margin
or related framework. Finally, we use relatively little training data to achieve accurate word
alignments. Our model can generate arbitrary alignments and learn from arbitrary gold align-
ments.
3.2 Word Alignment as a Hypergraph
Algorithm input The input to our alignment algorithm is a sentence-pair(e
n
1
,f
m
1
) and a parse
tree over one of the input sentences. In this work, we parse our English data, and for each
sentence E = e
n
1
, let T be its syntactic parse. To generate parse trees, we use the Berkeley
parser (Petrov et al., 2006), and use Collins head rules (Collins, 2003) to head-out binarize each
tree.
21
the man ate the
NP
VP
S
NP
the
ﺍ ﻛ ﻞ
ﺍ ﻟ ﺮ ﺟ ﻞ
the
ﺍ ﻛ ﻞ
ﺍ ﻟ ﺮ ﺟ ﻞ
the
ﺍ ﻛ ﻞ
ﺍ ﻟ ﺮ ﺍ ﺟ ﻞ
man
ﺍ ﻛ ﻞ
ﺍ ﻟ ﺮ ﺟ ﻞ
the
man
ate
the
bread
ﺍ ﳋ ﺒ ﺰ
ﺍ ﳋ ﺒ ﺰ ﺍ ﳋ ﺒ ﺰ
ﺍ ﳋ ﺒ ﺰ
bread
bread
ﺍ ﻛ ﻞ
ﺍ ﻟ ﺮ ﺟ ﻞ
ﺍ ﳋ ﺒ ﺰ
Figure 3.2: Example of approximate search through a hypergraph with beam size = 5. Each black
square implies a partial alignment. Each partial alignment at each node is ranked according to its
model score. In this figure, we see that the partial alignment implied by the 1-best hypothesis at
the leftmost NP node is constructed by composing the best hypothesis at the terminal node labeled
“ the” and the 2nd-best hypothesis at the terminal node labeled “ man”. (We ignore terminal
nodes in this toy example.) Hypotheses at the root node imply full alignment structures.
22
Overview We present a brief overview here and delve deeper in Section 3.3. Word alignments
are built bottom-up on the parse tree. Each nodev in the tree holds partial alignments sorted by
score. Each partial alignment comprises the columns of the alignment matrix for the e-words
spanned by v, and each is scored by a linear combination of feature functions. See Figure 3.2
for a small example.
Initial partial alignments are enumerated and scored at preterminal nodes, each spanning a
single column of the word alignment matrix. To speed up search, we can prune at each node,
keeping a beam of sizek. In the diagram depicted in Figure 3.2, the beam is sizek = 5.
From here, we traverse the tree nodes bottom-up, combining partial alignments from child
nodes until we have constructed a single full alignment at the root node of the tree. If we are
interested in thek-best, we continue to populate the root node until we havek alignments.
1
We use one set of feature functions for preterminal nodes, and another set for nonterminal
nodes. This is analogous to local and nonlocal feature functions for parse-reranking used by
Huang (2008). Using nonlocal features at a nonterminal node emits a combination cost for
composing a set of child partial alignments.
Because combination costs come into play, we use cube pruning (Chiang, 2007) to approx-
imate thek-best combinations at some nonterminal nodev. Inference is exact when only local
features are used.
1
We use approximate dynamic programming to store alignments, keeping only scored lists
of pointers to initial single-column spans. Each item in the list is a derivation that implies a
partial alignment.
23
Assumptions There are certain assumptions related to our search algorithm that we must
make: (1) that using the structure of 1-best English syntactic parse trees is a reasonable way
to frame and drive our search, and (2) that F-measure approximately decomposes over hyper-
edges.
We perform an oracle experiment to validate these assumptions. We find the oracle for a
given (T ,e,f) triple by proceeding through our search algorithm, forcing ourselves to always
select correct links with respect to the gold alignment when possible, breaking ties arbitrarily.
The theF
1
score of our oracle alignment is 98.8%, given this “perfect” model.
3.3 Hierarchical search
Initial alignments We can construct a word alignment hierarchically, bottom-up, by making
use of the structure inherent in syntactic parse trees. We can think of building a word alignment
as filling in an M×N matrix (Figure 3.1), and we begin by visiting each preterminal node in
the tree. Each of these nodes spans a singlee word. (Line 2 in Algorithm 1).
From here we can assign links from each e word to zero or more f words (Lines 6–14).
At this level of the tree the span size is 1, and the partial alignment we have made spans a
single column of the matrix. We can make many such partial alignments depending on the links
selected. Lines 5 through 9 of Algorithm 1 enumerate either the null alignment, single-link
alignments, or two-link alignments. Each partial alignment is scored and stored in a sorted heap
(Lines 9 and 13).
24
Algorithm 1: Hypergraph Alignment
Input:
Source sentencee
n
1
Target sentencef
m
1
Parse treeT overe
n
1
Set of feature functions h
Weight vector w
Beam sizek
Output:
Ak-best list of alignments overe
n
1
andf
m
1
1 function Align(e
n
1
,f
m
1
,T)
2 forv∈ T in bottom-up order do
3 α
v
←∅
4 if is-PreterminalNode(v) then
5 i← index-of(v)
6 forj = 0 tom do
7 links← (i,j)
8 score← w· h(links,v,e
n
1
,f
m
1
)
9 Push(α
v
,⟨score,links⟩,k )
10 fork = j +1 tom do
11 links← (i,j),(i,k)
12 score← w· h(links,v,e
n
1
,f
m
1
)
13 Push(α
v
,⟨score,links⟩,k )
14 end
15 end
16 else
17 α
v
← GrowSpan(children(v),k)
18 end
19 end
20 end
21 function GrowSpan(⟨u
1
,u
2
⟩,k)
22 return CubePruning(⟨α
u
1
,α
u
2
⟩,k,w,h)
23 end
25
In practice enumerating all two-link alignments can be prohibitive for long sentence pairs;
we set a practical limit and score only pairwise combinations of the top n = max
{
jfj
2
,10
}
scoring single-link alignments.
We limit the number of total partial alignmentsα
v
kept at each node tok. If at any time we
wish to push onto the heap a new partial alignment when the heap is full, we pop the current
worst off the heap and replace it with our new partial alignment if its score is better than the
current worst.
u11 u12 u13
u21 2.2 4.1 5.5
u22 2.4 3.5 7.2
u23 3.2 4.5 11.4
(a) Score the left corner
alignment first. Assume it
is the 1-best. Numbers in
the rest of the boxes are
hidden at this point.
u11 u12 u13
u21 2.2 4.1 5.5
u22 2.4 3.5 7.2
u23 3.2 4.5 11.4
(b) Expand the frontier of
alignments. We are now
looking for the 2nd best.
u11 u12 u13
u21 2.2 4.1 5.5
u22 2.4 3.5 7.2
u23 3.2 4.5 11.4
(c) Expand the frontier
further. After this step we
have our topk alignments.
Figure 3.3: Cube pruning with alignment hypotheses to select the top-k alignments at node v
with children⟨u
1
,u
2
⟩. In this example, k = 3. Each box represents the combination of two
partial alignments to create a larger one. The score in each box is the sum of the scores of the
child alignments plus a combination cost.
Building the hypergraph We now visit internal nodes (Line 16) in the tree in bottom-up order.
At each nonterminal nodev we wish to combine the partial alignments of its childrenu
1
,...,u
c
.
We use cube pruning (Chiang, 2007; Huang & Chiang, 2007) to select thek-best combinations of
the partial alignments ofu
1
,...,u
c
(Line 19). Note that Algorithm 1 assumes a binary tree
2
, but
2
We find empirically that using binarized trees reduces search errors in cube pruning.
26
Sentence 1
TOP1
S2
NP-C1
NPB2
DT
NPB-BAR2
CD
NPB-BAR2
JJ NNS
S-BAR1
VP1
VBP
VP-C1
VBN
VP-C1
VBN
PP1
IN
NP-C1
NP-C-BAR1
NP1
NPB2
DT
NPB-BAR2
NN NN CC
NP1
NPB2
CD
NPB-BAR2
JJ NN
.
the
five
previous
tests
have
been
limited
to
the
target
missile
and
one
other
body
.
!
"#$%
!
&
!
'()
"
*
+,-
*
!
&.(
/0
1
2
3
4(
!
5
!
67
*
,8.(
9:;
<)+,=.(
1
>?@
!
A8BC(
DEFG
*
)
#
1
G(
?H()
*
Figure 3.4: Correct version of Figure 1 after hypergraph alignment. Subscripts on the nonter-
minal labels denote the branch containing the head word for that span.
is not necessary. In the general case, cube pruning will operate on ad-dimensional hypercube,
whered is the branching factor of nodev.
We cannot enumerate and score every possibility; without the cube pruning approximation,
we will havek
c
possible combinations at each node, exploding the search space exponentially.
Figure 3.3 depicts how we select the top-k alignments at a nodev from its children⟨u
1
, u
2
⟩.
27
3.4 Discriminative training
We incorporate all our new features into a linear model and learn weights for each using the
online averaged perceptron algorithm (Collins, 2002) with a few modifications for structured
outputs inspired by Chiang et al. (2008b). We define:
γ(y) = ℓ(y
i
,y)+ w·(h(y
i
)− h(y)) (3.1)
where ℓ(y
i
,y) is a loss function describing how bad it is to guess y when the correct answer is
y
i
. In our case, we defineℓ(y
i
,y) as 1−F
1
(y
i
,y). We select the oracle alignment according to:
y
+
= arg min
y2cand(x)
γ(y) (3.2)
where cand(x) is a set of hypothesis alignments generated from inputx. Instead of the traditional
oracle, which is calculated solely with respect to the lossℓ(y
i
,y), we choose the oracle that jointly
minimizes the loss and the difference in model score to the true alignment. Note that Equation 3.2
is equivalent to maximizing the sum of the F-measure and model score ofy:
y
+
= arg max
y2cand(x)
(F
1
(y
i
,y)+ w· h(y)) (3.3)
Let ^ y be the 1-best alignment according to our model:
^ y = arg max
y2cand(x)
w· h(y) (3.4)
28
Then, at each iteration our weight update is:
w← w+η(h(y
+
)− h(^ y)) (3.5)
whereη is a learning rate parameter.
3
We find that this more conservative update gives rise
to a much more stable search. After each iteration, we expecty
+
to get closer and closer to the
truey
i
.
Model Selection Among models from iteration 1 up to convergence, we select as our model for
alignment the best performing model as measured by F-measure on a held-out set of alignments.
3.5 Parallelization
The nature of the search algorithm developed in this alignment work allows for simple paral-
lelization of the search process. Because of the tree structure of the search space, alignment
generation and scoring at each node or state in the search space can be computed independently
of its sister nodes. We can distribute independent computations across multiple CPUs. Likewise,
we learn parameters with an online algorithm for structured prediction: a parallelized version
of averaged perceptron modified for structured outputs (Collins, 2002). We distribute training
examples at random to a large, fixed number of nodes in a cluster, each node maintaining its own
perceptron learner. After each epoch over the training data, we average the updated parameters
3
In initial versions of published work relating to experiments discussed in this chapter, we
set η to 0.05 in our experiments. In our later work, including the experiments presented in the
following chapter, we find that setting the learning-rateη to 1 suffices.
29
Hierarchical Search for Word Alignment
Jason Riesa and Daniel Marcu
the man ate the
NP
VP
S
NP
the
! " #
! $ % & #
the
! " #
! $ % & #
the
! " #
! $ % ! & #
man
! " #
! $ % & #
the
man
ate
the
bread
! ' ( )
! ' ( ) ! ' ( )
! ' ( )
bread
bread
! " #
! $ % & #
! ' ( )
Visualization generated by riesa: Feb 12, 2010 20:06:24
683.g (a1)
683.union.a (a2)
683.e (e)
683.f (f)
Sentence 1
the
five
previous
tests
have
been
limited
to
the
target
missile
and
one
other
body
.
!
"#$%
!
&
!
'()
"
*
+,-
*
!
&.(
/0
1
2
3
4(
!
5
!
67
*
,8.(
9:
;
<)+,=.(
1
>?@
!
A8BC(
DEFG
*
)
#
1
G(
?H()
*
1
Sentence 1
TOP
1
S
2
NP-C
1
NPB
2
DT
NPB-BAR
2
CD
NPB-BAR
2
JJ NNS
S-BAR
1
VP
1
VBP
VP-C
1
VBN
VP-C
1
VBN
PP
1
IN
NP-C
1
NP-C-BAR
1
NP
1
NPB
2
DT
NPB-BAR
2
NN NN CC
NP
1
NPB
2
CD
NPB-BAR
2
JJ NN
.
the
five
previous
tests
have
been
limited
to
the
target
missile
and
one
other
body
.
!
"#$%
!
&
!
'()
"
*
+,-
*
!
&.(
/0
1
2
3
4(
!
5
!
67
*
,8.(
9:
;
<)+,=.(
1
>?@
!
A8BC(
DEFG
*
)
#
1
G(
?H()
*
A typical Model4 alignment:
correct link from a gold standard
automatic links made by machine
links resulting from
fully connecting phrases
Better alignments Better translation rules Better translations
USC Information Sciences Institute
large rules may
generalize poorly ...
the five previous
tests have been
limited to
! " # $ % & ' " ( $ ) * + , " - . / " ( 0 1 2 3 4
Example of approxiate search through
a hypergraph with beam size k = 5.
Hierarchical Hypergraph Search with Features
1. English parse tree forms a backbone to guide hierarchical search.
2. Each black square represents a partial alignment.
3. We use cube pruning (Chiang, 07; Huang and Chiang 07) to
combine smaller hypotheses.
4. Each partial alignment is ranked according to its model score
using local and nonlocal features (Huang, 08).
5. Scoring function is a linear combination of feature values.
6. Parameters trained using (parallelized) averaged perception
(Collins, 02).
7. Hypotheses at the root node imply full alignment structures.
Correct alignment after hypergraph search
u11 u12 u13
u21 2.2 4.1 5.5
u22 2.4 3.5 7.2
u23 3.2 4.5 11.4
u11 u12 u13
u21 2.2 4.1 5.5
u22 2.4 3.5 7.2
u23 3.2 4.5 11.4
u11 u12 u13
u21 2.2 4.1 5.5
u22 2.4 3.5 7.2
u23 3.2 4.5 11.4
Cube pruning, k = 3. (Chiang, 07; Huang and Chiang, 07)
0 5 10 15 20 25 30 35 40
0.73
0.735
0.74
0.745
0.75
0.755
0.76
0.765
0.77
0.775
Training epoch
Training F−measure
Multiprocessor Averaged
Perceptron Training
0 min
1hr10 min 2hr20 min
44.0
44.5
45.0
45.5
46.0
46.5
47.0
47.5
48.0
47.5
46.4
45.1
Translation Quality
BLEU
Model 4
union
(GIZA++)
Model 4
grow-diag-!nal
(GIZA++)
Hypergraph Alignment
(+1.1)
Model 4
union
(GIZA++)
Model 4
grow-diag-!nal
(GIZA++)
Hypergraph Alignment
0.60
0.62
0.64
0.66
0.68
0.70
0.72
0.74
0.76
0.78
0.80
0.75
0.69
0.67
0.72
0.67
0.70
0.78
0.70
0.64
Alignment Quality
Precision Recall F-measure
Baseline alignments are from GIZA++ Model 4
trained on 50Mw. Hypergraph alignment yields a
+6.3 point increase in F-measure over Model 4
symmetrized with the grow-diag-!nal heuristic.
Hypergraph alignments are robust to varying
qualities of initial lexical probabilities,
consistently outperforming the Model 4 baseline
shown by the dotted horizontal line.
We show a statistically signi!cant (p < 0.01)
increase over a syntax-based MT system trained
with alignments from GIZA++ Model 4 and
symmetrized with grow-diag-!nal.
PP
IN NP
eprep ehead ...
f
NP
DT
NP
edet ehead ...
f
VP
VBD
VP
everb ehead ...
f
Figure6: FeaturesPP-NP-head,NP-DT-head,andVP-VP-headfireonthesetree-alignmentpatterns. For
example,PP-NP-headfiresexactlywhentheheadofthePPisalignedtoexactlythesame f wordsasthe
headofit’ssisterNP.
Penalty
NNPS −1.11
NNS −1.03
NN −0.80
NNP −0.62
VB −0.54
VBG −0.52
JJ −0.50
JJS −0.46
VBN −0.45
...
...
POS −0.0093
EX −0.0056
RP −0.0037
WP$ −0.0011
TO 0.037
Reward
Table1: Asamplingoflearnedweightsforthelex-
ical zero feature. Negative weights penalize links
never seen before in a baseline alignment used to
initialize lexical p(e | f) and p(f | e) tables. Posi-
tiveweightsoutrightrewardsuchlinks.
than those likely to be function words (e.g.
TO, RP, EX), where the use of such words is
oftenradicallydifferentacrosslanguages.
• We also include a measure of distortion.
This feature returns the distance to the diag-
onal of the matrix for any link in a partial
alignment. If there is more than one link, we
return the distance of the link farthest from
thediagonal.
• As a lexical backoff, we include a tag prob-
ability feature, p(t | f) that fires for some
link (e, f) if the part-of-speech tag of e is t.
Theconditionalprobabilitiesinthistableare
computed from our parse trees and the base-
lineModel4alignments.
• In cases where the lexical probabilities are
too strong for the distortion feature to
overcome (see Figure 5), we develop the
multiple-distortion feature. Although local
featuresdonotknowthepartialalignmentsat
other spans, they do have access to theentire
Englishsentenceateverystepbecauseourin-
put is constant. If some e exists more than
oncein e
n
1
wefirethisfeatureonalllinkscon-
tainingworde,returningagainthedistanceto
the diagonal for that link. We learn a strong
negativeweightforthisfeature.
• We find that binary identity and
punctuation-mismatch features are im-
portant. The binary identity feature fires if
e = f, and proves useful for untranslated
numbers, symbols, names, and punctuation
in the data. Punctuation-mismatch fires on
any link that causes nonpunctuation to be
alignedtopunctuation.
Additionally, we include fine-grained versions of
the lexical probability, fertility, and distortion fea-
tures. These fire for for each link (e, f) and part-
of-speech tag. That is, we learn a separate weight
for each feature for each part-of-speech tag in our
data. Giventhetagof e,thisaffordsthemodelthe
abilitytopaymoreorlessattentiontothefeatures
describedabovedependingonthetaggivento e.
Arabic-English specific features We describe
here language specific features we implement to
exploitshallowArabicmorphology.
162
Learned weights for the
lexprob-zero feature
Some structural nonlocal features
Some lexical local features
lexprob-zero
lexical prob: p(e|f), p(f|e)
fertility: !
0
(e), !
1
(e), !
"2
(e)
distortion
tag probability: p(t|f), p(f|t)
Binary feature !res for any alignment containing
word pair (e, f) when p(e|f)=p(f|e)=0.
These experiments randomly divide training examples
among 40 processors running perceptron learners in
parallel. At the end of each epoch, individual weight
vectors are sent to the a master node, and a new
averaged weight vector is rebroadcast to all nodes.
0 1 2 3 4 5 6 7 8 9
Ten random restarts of parallel averaged
perceptron training, converging into the same
general neighborhood.
0 1 2 3 4 5 6 7 8 9
0
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
w
0
w
1
w
2
w
3
w
4
w
5
w
6
w
7
w
8
w
9
w'
w' w'
w'
w' w' w' w'
w'
w'
(Brown et al., 93; Och and Ney, 02)
0.60
0.62
0.64
0.66
0.68
0.70
0.72
0.74
0.76
0.78
0.80
0.751
0.734
0.718
Varying Initial Lexical Probabilities
Hypergraph Alignment F-measure
GIZA++ Model 4, grow-diag-!nal (50Mw)
with Model 1 with HMM with Model 4
PP
IN NP
eprep
ehead ...
f
NP
DT
NP
edet
ehead ...
f
VP
VBD
VP
everb
ehead ...
f
PP
IN NP
from
...
!"
...
Figure 3.5: Depiction of the parallelization used in this work for averaged perceptron learning.
from each perceptron and each starts a new training epoch with the new averaged parameter
vector.
It turns out that this learning procedure converges in the same neighborhood regardless of the
number of CPUs used during computation. Though, time-to-convergence will vary. McDonald
et al. (2010) have concurrently shown that the strategy outlined above is a good one. Finally,
while we have focused on making search and learning easily distributable over a large cluster,
we also demonstrate that these conditions hold when we scale from hundreds of features to
30
0 5 10 15 20 25 30 35 40
0.73
0.735
0.74
0.745
0.75
0.755
0.76
0.765
0.77
0.775
Training epoch
Training F−measure
Figure 3.6: Learning curves for 10 random restarts over time for parallel averaged perceptron
training. These plots show the current F-measure on the training set as time passes. Perceptron
training here is quite stable, converging to the same general neighborhood each time.
hundreds of thousands of features, while increasing the amount of training data to estimate these
new parameters accordingly.
It is important to note that the framework we have presented affords us multiple axes of
parallelization. It is possible to parallelize work across training examples, giving each parallel
learner a subset of the data in any given pass over the training corpus. It is also possible to
parallelize computation within each training example in at least two ways: (1) Since decisions
made at each node of the parse tree are made independently of any sister nodes, we can distribute
computation of any partial alignment items whose corresponding treespan is not a subset of
another. (2) We can also distribute computation of feature values across the learners in our pool.
Although any combination of these axes of parallelization is possible, in this work, for sim-
plicity, we distribute whole training examples to single nodes. If there are k CPUs available,
31
there will be k perceptron learners running in parallel learning from k random disjoint subsets
of the data.
Figure 3.5 depicts the case for a single epoch whenk = 10. Each parallel learners receives
a random disjoint subset of the training corpus. Each returns a new parameter vector and sends
it back to a master. In Figure 3.5, the master is CPU 0. A single average is computed across
all newly learned parameter vectors and redistributed to each of the k learners to start the next
epoch. This is essentially the averaged perceptron of Collins (2002). McDonald et al. (2010)
calls this strategy Iterative Parameter Mixing, and concurrently shows that this is the optimal
strategy over several other training strategies for the perceptron.
Convergence criterion At the end of each epoch we compute F-measure on a heldout set of
data. If the change in accuracy is small and within δ for at least e epochs, then we say that we
have converged and learning is terminated. In our experiments, we useδ = 0.05 ande = 5.
3.6 Features
Our simple, flexible linear model makes it easy to throw in many features, mapping a given
complex alignment structure into a single high-dimensional feature vector. Our hierarchical
search framework allows us to compute these features when needed, and affords us extra useful
syntactic information.
We use two classes of features: local and nonlocal. Huang (2008) defines a feature h to
be local if and only if it can be factored among the local productions in a tree, and non-local
otherwise. Analogously for alignments, our class of local features are those that can be factored
32
among the local partial alignments competing to comprise a larger span of the matrix, and non-
local otherwise. These features score a set of links and the words connected by them.
Feature development Our features are inspired by analysis of patterns contained among our
gold alignment data and automatically generated parse trees. We use both local lexical and
nonlocal structural features as described below.
3.6.1 Local features
These features fire on single-column spans.
• From the output of GIZA++ Model 4, we compute lexical probabilitiesp(e| f) andp(f|
e), as well as a fertility tableϕ(e). From the fertility table, we fire featuresϕ
0
(e),ϕ
1
(e),
and ϕ
2+
(e) when a word e is aligned to zero, one, or two or more words, respectively.
Lexical probability featuresp(e| f) andp(f| e) fire when a worde is aligned to a word
f.
• Based on these features, we include a binary lexical-zero feature that fires if bothp(e| f)
and p(f| e) are equal to zero for a given word pair (e,f). Negative weights essentially
penalize alignments with links never seen before in the Model 4 alignment, and positive
weights encourage such links. We employ a separate instance of this feature for each
English part-of-speech tag: p(f| e,t).
We learn a different feature weight for each. Critically, this feature tells us how much
to trust alignments involving nouns, verbs, adjectives, function words, punctuation, etc.
33
Penalty
NNPS −1.11
NNS −1.03
NN −0.80
NNP −0.62
VB −0.54
VBG −0.52
JJ −0.50
JJS −0.46
VBN −0.45
.
.
.
.
.
.
POS −0.0093
EX −0.0056
RP −0.0037
WP$ −0.0011
TO 0.037
Reward
Table 3.1: A sampling of learned weights for the lexical zero feature. Negative weights penalize
links never seen before in a baseline alignment used to initialize lexical p(e| f) and p(f| e)
tables. Positive weights outright reward such links.
from the Model 4 alignments from which our p(e | f) and p(f | e) tables are built.
Table 3.1 shows a sample of learned weights. Intuitively, alignments involving English
parts-of-speech more likely to be content words (e.g. NNPS, NNS, NN) are more trust-
worthy than those likely to be function words (e.g. TO, RP, EX), where the use of such
words is often radically different across languages.
• We also include a measure of distortion. This feature returns the distance to the diagonal
of the matrix for any link in a partial alignment. If there is more than one link, we return
the distance of the link farthest from the diagonal.
34
• As a lexical backoff, we include a tag probability feature, p(t| f) that fires for some
link(e,f) if the part-of-speech tag ofe ist. The conditional probabilities in this table are
computed from our parse trees and the baseline Model 4 alignments.
• We find that binary identity and punctuation-mismatch features are important. The
binary identity feature fires ife = f, and proves useful for untranslated numbers, symbols,
names, and punctuation in the data. Punctuation-mismatch fires on any link that causes
nonpunctuation to be aligned to punctuation.
Additionally, we include fine-grained versions of the lexical probability, fertility, and distortion
features. These fire for for each link (e,f) and part-of-speech tag. That is, we learn a separate
weight for each feature for each part-of-speech tag in our data. Given the tag of e, this affords
the model the ability to pay more or less attention to the features described above depending on
the tag given toe.
Arabic-English specific features We describe here language specific features we implement
to exploit shallow Arabic morphology.
• We observe the Arabic prefix و, transliterated w- and generally meaning and, to prepend
to most any word in the lexicon, so we define features p
:w
(e| f) and p
:w
(f | e). If
f begins with w-, we strip off the prefix and return the values of p(e| f) and p(f| e).
Otherwise, these features return 0.
• We also include analogous feature functions for several functional and pronominal pre-
fixes and suffixes.
35
PP
IN
NP
eprep
ehead ...
f
NP
DT
NP
edet
ehead ...
f
VP
VBD
VP
everb
ehead ...
f
Figure 3.7: Features PP-NP-head, NP-DT-head, and VP-VP-head fire on these tree-alignment
patterns. For example, PP-NP-head fires exactly when the head of the PP is aligned to exactly
the samef words as the head of it’s sister NP.
3.6.2 Nonlocal features
These features comprise the combination cost component of a partial alignment score and may
fire when concatenating two partial alignments to create a larger span. Because these features
can look into any two arbitrary subtrees, they are considered nonlocal features as defined by
Huang (2008).
PP
IN NP
from
...
!"
...
Figure 3.8: This figure depicts the tree/alignment structure for which the feature PP-from-prep
fires. The English preposition “from” is aligned to Arabic word ﻦﻣ. Any aligned words in the
span of the sister NP are aligned to words following ﻦﻣ. English preposition structure commonly
matches that of Arabic in our gold data. This family of features captures these observations.
36
• Features PP-NP-head, NP-DT-head, and VP-VP-head (Figure 3.7) all exploit head-
words on the parse tree. We observe English prepositions and determiners to often align
to the headword of their sister. Likewise, we observe the head of a VP to align to the head
of an immediate sister VP.
In Figure 3.4, when the search arrives at the left-most NPB node, the NP-DT-head fea-
ture will fire given this structure and links over the span [the ... tests]. When
search arrives at the second NPB node, it will fire given the structure and links over the
span [the ... missle], but will not fire at the right-most NPB node.
• Local lexical preference features compete with the headword features described above.
However, we also introduce nonlocal lexicalized features for the most common types of
English and foreign prepositions to also compete with these general headword features.
PP features PP-of, PP-from, PP-to, PP-on, and PP-in fire at any PP whose left child is a
preposition and right child is an NP. The head of the PP is one of the enumerated English
prepositions and is aligned to any of the three most common foreign words to which it
has also been observed aligned in the gold alignments. The last constraint on this pattern
is that all words under the span of the sister NP, if aligned, must align to words following
the foreign preposition. Figure 3.8 illustrates this pattern.
• Finally, we have a tree-distance feature to avoid making too many many-to-one (from
many English words to a single foreign word) links. This is a simplified version of and
similar in spirit to the tree distance metric used in (DeNero & Klein, 2007). For any pair
37
of links(e
i
,f) and(e
j
,f) in which thee words differ but thef word is the same token in
each, return the tree height of first common ancestor ofe
i
ande
j
.
This feature captures the intuition that it is much worse to align two English words at
different ends of the tree to the same foreign word, than it is to align two English words
under the same NP to the same foreign word.
To see why a string distance feature that counts only the flat horizontal distance from e
i
toe
j
is not the best strategy, consider the following. We wish to align a determiner to the
same f word as its sister head noun under the same NP. Now suppose there are several
intermediate adjectives separating the determiner and noun. A string distance metric, with
no knowledge of the relationship between determiner and noun will levy a much heavier
penalty than its tree distance analog.
3.7 Related Work
Recent work has shown the potential for syntactic information encoded in various ways to sup-
port inference of superior word alignments. Very recent work in word alignment has also started
to report downstream effects on BLEU score.
Cherry and Lin (2006) introduce soft syntactic ITG (Wu, 1997) constraints into a discrimi-
native model, and use an ITG parser to constrain the search for a Viterbi alignment. Haghighi
et al. (2009) confirm and extend these results, showing BLEU improvement for a hierarchical
phrase-based MT system on a small Chinese corpus. As opposed to ITG, we use a linguistically
38
F P R Arabic/English # Unknown
BLEU Words
M4 (union) .665 .636 .696 45.1 2,538
M4 (grow-diag-final) .688 .702 .674 46.4 2,262
Hypergraph alignment .751 .780 .724 47.5 1,610
Table 3.2: F-measure, Precision, Recall, the resulting BLEU score, and number of unknown
words on a held-out test corpus for three types of alignments. BLEU scores are case-insensitive
IBM BLEU. We show a 1.1 BLEU increase over the strongest baseline, Model-4 grow-diag-final.
This is statistically significant at thep < 0.01 level.
motivated phrase-structure tree to drive our search and inform our model. And, unlike ITG-
style approaches, our model can generate arbitrary alignments and learn from arbitrary gold
alignments.
DeNero and Klein (2007) refine the distortion model of an HMM aligner to reflect tree
distance instead of string distance. Fossum et al. (2008) start with the output from GIZA++
Model-4 union, and focus on increasing precision by deleting links based on a linear discrimi-
native model exposed to syntactic and lexical information.
Fraser and Marcu (2007) take a semi-supervised approach to word alignment, using a small
amount of gold data to further tune parameters of a headword-aware generative model. They
show a significant improvement over a Model-4 union baseline on a very large corpus.
Pauls and Klein (2010) use unsupervised methods to learn a bilingual alignment for syntax-
based translation; the authors induce an alignment between source-language words and target-
language tree nodes.
39
0.67
0.68
0.69
0.70
0.71
0.72
0.73
0.74
0.75
0.76
Model 1 HMM Model 4
F-measure
Initial alignments
Figure 3.9: Model robustness to the initial alignments from which the p(e| f) and p(f | e)
features are derived. The dotted line indicates the baseline accuracy of GIZA++ Model 4 alone.
3.8 Experiments
We evaluate our model and and resulting alignments on Arabic-English data against those in-
duced by IBM Model-4 using GIZA++ (Och & Ney, 2003) with both the union and grow-diag-
final heuristics. We use 1,000 sentence pairs and gold alignments from LDC2006E86 to train
model parameters: 800 sentences for training, 100 for testing, and 100 as a second held-out
development set to decide when to stop perceptron training. We also align the test data using
GIZA++
4
along with 50 million words of English.
3.8.1 Alignment Quality
We empirically choose our beam sizek from the results of a series of experiments, settingk=1,
2, 4, 8, 16, 32, and 64. We find settingk = 16 to yield the highest accuracy on our held-out test
4
We use a standard GIZA++ training procedure: 5 iterations of Model-1, 5 iterations of
HMM, 3 iterations of Model-3, and 3 iterations of Model-4.
40
data. Using wider beams results in higher F-measure on training data, but those gains do not
translate into higher accuracy on held-out data.
The first three columns of Table 3.2 show the balanced F-measure, Precision, and Recall of
our alignments versus the two GIZA++ Model-4 baselines. We report an F-measure 8.6 points
over Model-4 union, and 6.3 points over Model-4 grow-diag-final.
Figure 3.6 shows the stability of the search procedure over ten random restarts of parallel
averaged perceptron training with 40 CPUs. Training examples are randomized at each epoch,
leading to slight variations in learning curves over time but all converge into the same general
neighborhood.
Figure 3.9 shows the robustness of the model to initial alignments used to derive lexical
features p(e| f) and p(f | e). In addition to IBM Model 4, we experiment with alignments
from Model 1 and the HMM model. In each case, we significantly outperform the baseline
GIZA++ Model 4 alignments on a heldout test set.
3.8.2 Translation Quality
We align a corpus of 50 million words with GIZA++ Model-4, and extract translation rules
from a 5.4 million word core subset. We align the same core subset with our trained hypergraph
alignment model, and extract a second set of translation rules. For each set of translation rules,
we train a machine translation system and decode a held-out test corpus for which we report
results below.
41
We use a syntax-based translation system for these experiments. This system transforms
Arabic strings into target English syntax trees Translation rules are extracted from (e-tree, f-
string, alignment) triples as in (Galley et al., 2004; Galley et al., 2006).
We use a randomized language model (similar to that of Talbot and Brants (2008)) of 472
million English words. We tune the the parameters of the MT system on a held-out develop-
ment corpus of 1,172 parallel sentences, and test on a held-out parallel corpus of 746 parallel
sentences. Both corpora are drawn from the NIST 2004 and 2006 evaluation data, with no
overlap at the document or segment level with our training data.
Columns 4 and 5 in Table 3.2 show the results of our MT experiments. Our hypergraph
alignment algorithm allows us a 1.1 BLEU increase over the best baseline system, Model-4
grow-diag-final. This is statistically significant at the p < 0.01 level. We also report a 2.4
BLEU increase over a system trained with alignments from Model-4 union.
3.9 Discussion
We have cast alignment as a parsing problem, opening up the word alignment task to advances in
hypergraph algorithms currently used in parsing and machine translation decoding. By taking
advantage of English syntax and the hypergraph structure of our search algorithm, we report
significant increases in both F-measure and BLEU score over standard baselines in use by most
state-of-the-art MT systems today.
In the following chapter, we develop significant improvements intended for using the align-
ment framework presented here on very large datasets for different language pairs.
42
Chapter 4
Feature-Rich Syntax-Based Alignment for Statistical Machine
Translation at Scale
This chapter further develops the ideas presented in Chapter 3. As opposed to hand-crafted fea-
tures based on linguistic insight, we describe learning those features and hundreds of thousands
more automatically from data, heavily exploiting source and target-language syntax. Using our
previously developed discriminative framework and efficient bottom-up search algorithm, we
train a model of hundreds of thousands of syntactic features. This new model (1) helps us to
very accurately model syntactic transformations between two languages; (2) yet, is language-
independent; and (3) with automatic feature extraction, assists system developers in obtaining
good word-alignment performance off-the-shelf when tackling new language pairs.
We analyze the impact of our features, describe inference under the model, and demon-
strate significant alignment and translation quality improvements over already-powerful base-
lines trained on very large corpora. We observe translation quality improvements corresponding
43
to 1.0 and 1.3 BLEU for Arabic-English and Chinese-English, respectively, over already strong
baseline systems.
4.1 Introduction
In recent years, several state-of-the-art statistical machine translation (MT) systems have incor-
porated both source and target syntax into the grammars that they generate and use to translate.
While some tree-to-tree systems parse source and target sentences separately (Galley et al., 2006;
Zollman & Venugopal, 2006; Huang & Mi, 2010), others project syntactic parses across word
alignments (Li et al., 2009). In both approaches, as in largely all statistical MT, the quality of
the alignments used to generate the rules of the grammar are critical to the success of the sys-
tem. However, to date, most word alignment systems have not considered the same degree of
syntactic information that MT systems have.
Extending unsupervised models, like the IBM models (Brown et al., 1993), generally re-
quires changing the entire generative story. The additional complexity would likely make train-
ing such models quite expensive. Already, with ubiquitous tools like GIZA++ (Och & Ney,
2003), training accurate models on large corpora takes upwards of 5 days.
Recent work in discriminative alignment has focused on incorporating features that are
unavailable or difficult to incorporate within other models, e.g. (Moore, 2005; Ittycheriah &
Roukos, 2005; Liu et al., 2005; Taskar et al., 2005b; Blunsom & Cohn, 2006; Lacoste-Julien
et al., 2006; Moore et al., 2006). Even more recently, motivated by the rise of syntax-based
translation models, others have sought to inform alignment decisions with syntactic information
44
(Fraser & Marcu, 2007; DeNero & Klein, 2007; Fossum et al., 2008; Haghighi et al., 2009;
Burkett et al., 2010; Pauls & Klein, 2010; Riesa & Marcu, 2010).
Motivated by the wide modeling gap that still remains between syntax-based translation and
word-alignment models, in this chapter we expand on previous work in discriminative align-
ment, and move forward in three key areas:
1. We heavily exploit both source and target syntax in ways that most models can not. In
addition, during training we extract and learn hundreds of thousands of features automat-
ically, learning both the structure and parameters for the model at the same time.
2. Our model and inference support arbitrary features, and easily scales to millions of such fea-
tures.
3. Having strengthened the synchronicity between alignment and syntax-based translation
models, we advance state-of-the-art performance in terms of both alignment and transla-
tion quality over already-powerful baselines on very large corpora.
4.2 A Feature-Rich Syntax-Aware Alignment Model
We follow Riesa and Marcu (2010) for efficient inference with arbitrary features, but do not
rely upon hand-crafted syntactic patterns; rather, we extract syntactic features automatically
from training data. We also introduce, in Section 4.5, an iterative approximate Viterbi inference
procedure to deal with the asymmetry of the model. We show that this boosts both alignment
and downstream translation quality even further.
45
The model itself is a linear combination of features, whose parameters are learned online via
a structured perceptron (Collins, 2002). However, as we describe in Section 4.3, the features of
the model are not known a priori. In what follows, we describe the search algorithm so that the
reader has an understanding of the domain of locality before we begin to describe features and
how they are learned.
4.2.1 Search Overview
We formulate the search for the best alignment as bottom-up parsing. Given a syntactic parse
tree on one side of a parallel sentence, we use the structure of the tree to guide the search process.
The key idea is that complex interactions between alignments are less likely to cross constituents,
so we search recursively on the tree.
As an illustrative example, we point to the structure of the hypergraph search depicted in
Figure 4.1. Here we are aligning the sentence pair:
(4.1) the flag hung from the stage
台 上 挂 着 国旗
tái shàng guà zhe guóqí
The figure shows the search process for a small example with beam size k. Each black square
represents a partial alignment. Each partial alignment at each node is ranked according to
its model score. In this figure, the 1-best hypothesis at the leftmost NP node is constructed by
composing the best hypothesis at its child NP and the 2nd-best hypothesis at its child NN. At
the root node, we have ak-best list of full alignments.
46
a flag hung from the stage
NN VBD IN DT NN
NP NP
PP
VP
S
a
DT
stage
NN
a
DT
a flag
NN DT
NP
hung
VBD
from
the
stage
IN DT NN
NP
PP
VP
hung
VBD
from
the
stage
IN DT NN
NP
PP
VP
a
flag
NN DT
NP
S
DT
Figure 4.1: Approximate search through a hypergraph with beam sizek = 5. Each black square
represents a partial alignment; larger grey-shaded boxes are links in an alignment. Each partial
alignment at each node is ranked according to its model score. The root node, S, contains a
k-best list of full alignments.
47
We continue with a procedural description of the algorithm.
4.2.1.1 Initialization
We begin by visiting each preterminal node in sequence. We enumerate and score all one-to-one
links as well as the unaligned link (aligned to null). Next, for a given preterminal node, we use
cube pruning (Chiang, 2007) to find the top k one-to-two alignments, given the scores of the
one-to-one links. We perform additional iterations of cube pruning to find top k sets of one-to-m
links. In theory, we could increase m to the length of the foreign sentence and enumerate top k
lists for each English word aligned to between 0 and all foreign words. However, in practice we
set m to limit time spent here, while maintaining acceptable recall. In our experiments we set
m = 2 for both Arabic-English and Chinese-English.
4.2.1.2 Combination
We continue traversing the tree bottom-up. At each nonterminal node, a k-best list of partial
alignments from each of its child nodes are combined into a larger span. We use cube pruning to
do this efficiently.
1
Nodes in different subtrees are processed independently of one another; i.e.,
for any node, alignment information at that node’s sister is unavailable. For example, in Fig-
ure 4.1, alignment information at the leftmost NP is unavailable to us while we are constructing
partial alignments at the PP. Search continues recursively up the tree, until we have reached
1
Cube pruning is approximate when we have nonlocal combination features, and most of our
features are of this type.
48
the root node. The root node again computes the top k alignments from its children, and these
comprise our finalk-best list of full alignments.
In our experiments we only make use of the 1-best alignment for evaluation and transla-
tion. Previous work has shown that deep k-best lists of alignments are not especially useful in
improving final downstream translation grammar extraction (Venugopal et al., 2008; Liu et al.,
2009b), though they may have other uses.
4.3 Automatically Exploiting Syntactic Features for Alignment
Up to now, previous work in syntax-based alignment has largely modeled alignments based on
features encoding target-side English syntactic and lexical information, and but lexical informa-
tion on the source side.
However, there is much more data waiting to be exploited, and the flexible model and ef-
ficient and modular learning framework of hierarchical discriminative alignment afford us this
possibility. Here, we discuss our target-side features, source-side features, and features that
jointly take into account both source- and target-side information.
4.3.1 Target Syntax Features
Most alignment systems currently function without explicit regard to the downstream translation
model. Some notable exceptions are May and Knight (2007) who generate syntactic alignments
by re-aligning word-to-word alignments with a syntactic model; and Pauls and Klein (2010)
who generate syntactic alignments with a synchronous ITG (Wu, 1997) approach. We depart
49
from ITG-based models (Cherry & Lin, 2006; Haghighi et al., 2009) because of their complex-
ity (O(n
6
) in the synchronous case), requiring heavy pruning or computing outside cost esti-
mates (DeNero & Klein, 2010). Instead, we use linguistically motivated target-side parse trees
to constrain search, as described above. These trees are output from the Berkeley parser (Petrov
& Klein, 2007) and fixed at alignment time. We use these trees not only as a vehicle for search,
but also for features.
A significant motivation for this work is the desire to make the connection, at alignment
time, between translation rules used in decoding and the alignments that yield such translation
rules. To do this, we fold the rule extraction process into the alignment search. At each step in
the search process, we can extract translation rules from a given partial alignment encode them
as binary features.
Importantly, the rule extraction process itself is not directly tied to the alignment system, but
rather to the downstream translation model. We can drop in any type of rule extraction we like
into the alignment system. In this work we focus on string-to-tree translation and the translation
rule space described in (Galley et al., 2004; Galley et al., 2006).
During training and inference, we are constantly scoring partial alignments. Every time we
have a partial alignment to score, we can extract all potential translation rules implied by that
alignment, and encode those rules as features. In this case, we are doing two important things:
(1) informing the alignment search with the rules of the translation model, and (2) modeling
actual translation rules – the model parameters give us a way to quantify the relative importance
of each rule. For example, we learn that:
50
(4.2) Chinese VP and NP tend to be reordered around the 的 particle when translating to English.
feature weight
NP(NP
1
VP
2
)↔ 2 的 1 1.01304
(4.3) When translating an Arabic NP as part of a VP, we often insert “is”.
feature weight
VP((VBZ is) NP
1
)↔ 1 0.67252
From this process we extract and learn 326,239 translation rule features in our Arabic-
English model; 234,972 in our Chinese-English model. Those features for which a positive
weight is learned tend to generalize well over the training data; negatively weighted features do
not, and are generally learned from alignments with mistakes during search. See Figure 4.2 for
additional examples of rule features learned for Arabic-English alignment.
Negative evidence Nearly 67% of the rule features we learn for Chinese-English, and 55%
of the rule features we learn for Arabic-English are negatively weighted. Early experiments
involved only firing indicator rule features when an extracted rule at alignment-time matched in
a set of rules extracted offline from our hand-aligned data. However, coverage from such rules
will always be limited; firing every rule as a feature as it is encountered during search gives us
many more darts to throw. Using only rule features extracted from gold data lowers F-measure
by close to 5 points.
51
[1] [2]
1.15328
PP
IN
[1]
NP
[2]
[2] [1]
–0.65943
[2] [1]
1.11908
NP
JJ
[1]
NN
[2]
[1] [2] –0.15417
Extracted Rule Feature Weight
NP
JJ
[1]
NN
[2]
PP
IN
[1]
NP
[2]
Figure 4.2: Translation rules as features extracted during Arabic-English alignment. These rules
show that we learn to reorder adjectives and nouns inside noun phrases, and that prepositions
before sister NPs prefer to be translated monotonically. For Chinese-English, we learn the op-
posite. We learn both lexicalized and nonlexicalized features.
4.3.2 Source Syntax Features and Joint Features
Source syntactic trees have recently been shown to be helpful in machine translation decoding
(Zhang et al., 2008; Liu et al., 2009a; Chiang, 2010), but to our knowledge have not been used in
alignment models other than (Burkett et al., 2010). We parse the source side of our data using the
Berkeley parser (Petrov & Klein, 2007), and encode information provided by the source syntax
as features in the model in two ways: (1) as tree-distance features
2
, and (2) joint source-target
syntax features.
52
Ara-Eng Model Chi-Eng Model
eng ara w eng chi w
[1] SBAR SBAR 6.40 PP PP 10.3
[2] S S(CC,PU) 4.91 NP NP 9.38
[3] PP PP 4.20 SBAR VP(VV ,PU) 6.97
[4] VP VP 3.90 NP NP(DT,NN) 6.67
[5] SBAR PP 2.58 PP PP(P,LC) 6.38
[6] NP S -2.80 NP PP -6.82
[7] NP VP -3.01 S IP(PU,PU) -7.44
[8] NP NP(NN,IN) -4.52 PP IP -7.33
[9] PP VP -5.13 SBAR VP -7.72
[10] PP S -7.37 NP IP -7.83
Table 4.1: This table shows a sampling of the highest and lowest-weighted coordination features
applied when scoring partial alignments at nodes in the tree. Preterminal tags inside parentheses
indicate the POS tags on the left and right edge of a given constituent.
Sentence 1
PP
IN
NP
NP
NNP POS JJ NN
in
china
’s
foreign
trade
在
中国
对外贸易
中
PP
P
LCP
NR
NP
LC
NP
(a) Source/target tree fea-
ture firing at node NP, with
value⟨ NP ; NP⟩. The
maximal-depth source tree
node that spans every link
also spanned by the shaded
target tree NP is also la-
beled NP.
Sentence 1
PP
IN
NP
NP
NNP POS JJ NN
in
china
’s
foreign
trade
在
中国
对外贸易
中
PP
P
LCP
NR
NP
LC
NP
(b) Source/target tree fea-
ture firing at node IN, re-
turning value⟨ IN ; PP⟩.
!
Sentence 1
PP
IN
NP
NP
NNP POS JJ NN
in
china
’s
foreign
trade
在
中国
对外贸易
中
PP
P
LCP
NR
NP
LC
NP
Sentence 1
PP
IN
NP
NP
NNP POS JJ NN
in
china
’s
foreign
trade
在
中国
对外贸易
中
PP
P
LCP
NR
NP
LC
NP
Sentence 1
PP
IN
NP
NP
NNP POS JJ NN
in
china
’s
foreign
trade
在
中国
对外贸易
中
PP
P
LCP
NR
NP
LC
NP
Sentence 1
PP
IN
NP
NP
NNP POS JJ NN
in
china
’s
foreign
trade
在
中国
对外贸易
中
PP
P
LCP
NR
NP
LC
NP
(c) In this figure, depicting
an incorrect alignment, the
same feature value is fired
as for the correct alignment
in 4.3(b):⟨ IN ; PP⟩. We
need more contextual anno-
tation to create more dis-
criminative power.
Figure 4.3: Two examples of joint features over monolingual parse trees. The value of the
feature depends on the shaded areas.
53
4.3.2.1 Source-Target Coordination Features
Drawing on work by Chiang (2010) in stochastically rewriting syntactic constituents across
languages in a translation model, we adapt the general idea to alignment modeling. Chiang
calls these features fuzzy syntax features; here, we simply call them coordination features in our
adaptation for alignment, so as to avoid the implication that we are rewriting.
This class of feature is a set of binary features that may fire at any nonterminal node in the
tree during bottom-up search. A feature fires for each combination of two nonterminal source
and target nodess andt, respectively, that match the following conditions:
1. t is the label of the current target tree node in the bottom-up search.
2. s is the label of the source tree node of highest depth (i.e. closest to leaf nodes) that spans
all links also spanned byt.
Figure 4.3 shows three examples of this joint feature over source and target trees. In Fig-
ure 4.3(a), The maximal-depth source tree node that spans every link also spanned by the shaded
target tree NP is also labeled NP. So, the feature returns a value of⟨NP; NP⟩. In Figure 4.3(b),
PP is the label of the maximal-depth source tree node that spans every link also spanned by the
shaded target tree IN node; the feature fires a value⟨IN ; PP⟩. We might expect this pairing of
2
These features parameterize the intuition that if two source words align to a single target
word, we prefer them to be members of the same constituent, or having a short path through
the tree from one word to the other, e.g. (in, 在...中), or the first and last Chinese words in the
examples in Figure 4.3.
54
IN with PP, or of IN with P, but we would expect to learn a penalizing parameter weight for the
pairing of, say, IN with NP.
Adding more context Powerful as this feature is, it is not quite discriminative enough; it
may return the same feature value for both a correct and incorrect alignment, as shown in Fig-
ure 4.3(c). To overcome this, we introduce additional features annotated with the left-most and
right-most tags in the current span. For example, in this figure, we also fire⟨ IN ; PP(P,NP)⟩,
and learn a negative weight of−0.638 denoting a poor choice of alignment. We also find it
helpful to keep the original unannotated feature as a poor-man’s backoff.
Some examples Table 4.1 shows some of the highest and lowest weighted features learned.
As the higher weighted features show, both models learn to prefer alignments that result in the
coordination of similar constituent labels. For example, the Chinese model learns a very high
weight for aligning sets of English words that form prepositional phrases to sets of Chinese
words that also form prepositional phrases
3
.
Inversely, we learn high negative weights for model features that fire for alignments that
oblige the firing of features of very dissimilar nonterminal labels, and that often yield asyn-
chronous bracketing. For example, the Arabic model learns that English words that form prepo-
sitional phrases should not align to sets of Arabic words that form entire sentences or verb
phrases
4
.
3
In Table 4.1, Chinese feature [1].
4
In Table 4.1, Arabic features [6] and [7].
55
35 0 5 10 15 20 25 30
0.87
0.79
0.8
0.81
0.82
0.83
0.84
0.85
0.86
Training Time (in Epochs)
F-measure
k=128
k=64
k=4
k=2
k=16
(a) Learning curves (Arabic-English): F-
measure accuracy on heldout development
data over time for five different beam set-
tings, k=2, k=4, k=16, k=64 and k=128.
For Arabic-English, improvements are min-
imal with beams larger than k=128; and
for Chinese-English, with beams larger than
k=256.
35 0 5 10 15 20 25 30
800,000
150,000
200,000
250,000
300,000
350,000
400,000
450,000
500,000
550,000
600,000
650,000
700,000
750,000
Training Time (in Epochs)
Model Size (number of features)
k=128
k=64
k=16
k=4
k=2
(b) Model size as a function of time for five dif-
ferent beam settings (Arabic-English): We see
a steep initial growth, and then begin to trail
off as the number of new unique extractable
features and negative evidence we encounter
diminishes.
Figure 4.4: Learning feature-rich alignment models. Figure 4.4(a) shows learning curves on
heldout data for five different beam sizes. Figure 4.4(b) shows how the models dynamically
grow over time. In Figure 4.4(b) we notice that less accurate models with narrower beams need
to add more complexity in an attempt to make up for their many more mistakes.
In total, we learn 127,932 syntactic coordination features in our Arabic-English model;
59,239 for Chinese-English.
4.4 Learning
We learn feature weights using a parallelized implementation of online averaged perceptron (Collins,
2002). We distribute training examples to CPUs in a cluster and essentially run several per-
ceptron learners in parallel. We communicate and average the weight vectors of each learner
according to the Iterative Parameter Mixing strategy described by McDonald et al. (2010).
56
At each iteration, our perceptron update is:
w← w+ h(y
i
)− h(^ y) (4.4)
And we define:
^ y= arg max
y2cand(x)
ℓ(y
i
,y)+ w· h(y) (4.5)
ℓ(y
i
,y)=1−F
1
(y
i
,y) (4.6)
with w our weight vector, h(y) our sparse vector of feature values, and F
1
(y
i
,y) balanced
F-measure. The loss,ℓ(y
i
,y), is a measure of how bad it would be to guess ^ y instead ofy.
In selecting ^ y, we draw upon the loss-augmented inference literature (Tsochantaridis et al.,
2004; Taskar et al., 2005a). Alignment ^ y is the output candidate maximizing the sum of both the
loss and model score. This guess appears attractive to the model, yet has low F-measure, and so
is exactly the sort of output we would like to update away from.
During training, we learn both the parameters and model structure. Figure 4.4(b) shows how
the size of the model grows over time. As described in Sections 4.2 and 4.3, we automatically
extract and fire features given an alignment configuration and our current position in the tree.
We see a steep initial growth in model size, and then begin to trail off as the number of new
unique rules and negative evidence we encounter diminishes.
57
Model Selection Among models from the first iteration up to convergence, we choose the
model parameters from the best performing model as measured by F-measure on a held-out
development set of alignments.
4.5 Iterative Approximate Viterbi Inference
Though up to now we have described features that fire during bottom-up search on the target-
language tree, we can also search bottom-up on the source-language tree. The syntactic features
we have described are generic enough that they will still be extractable and applicable. Be-
cause our model and inference procedure are asymmetric, a search on the source-language tree
will generate alignments from a different space, and can provide a unique signal we would not
otherwise have. We can use the Viterbi alignments from each model to inform the other. In
the following we describe a method for simultaneously training both target-tree and source-tree
models, but with features to enforce agreement.
We begin by training two models, one that operates on the target tree, and one that operates
on the source tree. Call the parameters learned from these models w
t
1
and w
s
1
, respectively.
Then, performing inference under these models yields alignmentsa
t
1
anda
s
1
.
In the next iteration we learn parameters w
t
2
and w
s
2
, and introduce agreement features. In
this step, during training to find w
t
2
, the target-tree model usesa
s
1
to fire indicator features. These
fire for any alignment link that was also present in the previous iteration’s source-tree alignment,
a
s
1
. Analogously, when searching for the best w
s
2
, we use a
t
1
to fire indicator features that fire
for any alignment link also present in the previous iteration’s target-tree alignment,a
t
1
.
58
Arabic-English Chinese-English
F P R F P R
GIZA++ M4 grow-diag-final 72.5 74.5 70.5 71.7 71.4 72.0
Heavy Syntax (this work) 86.8 89.1 84.6 84.4 89.4 80.0
with Iterative Inference (grow-diag-final) 87.6 89.7 85.6 87.0 90.0 84.1
with Iterative Inference (intersection) 83.4 93.1 75.6 83.1 95.4 73.6
Table 4.2: F-measure, Precision, Recall for GIZA++ Model-4, and for alignments from this
work. GIZA++ was trained on 223M words for Arabic-English, and 261M words for Chinese-
English. We observe very large gains in accuracy of 15 points for both language pairs. Itera-
tive inference results in a large effect on Chinese-English recall, and a modest improvement in
Arabic-English.
This process of using the alignment from the previous iteration’s opposing tree continues
until convergence, i.e. until we no longer see improvement in our 1-best source-tree and target-
tree alignments. When we use these alignments for downstream translation, we symmetrize
with the grow-diag-final heuristic, which continues to work remarkably well in practice. We
also experiment with the intersection of both final alignments.
4.6 Evaluation
4.6.1 Alignment Quality
We use as training data 2,280 hand-aligned sentence pairs of Arabic-English and 1,102 for
Chinese-English. We measure training convergence using a held-out development set of 100
sentence pairs for each language pair, and evaluate with F-measure on a held-out test set of
184 sentences pairs for Chinese-English and 364 sentence pairs for Arabic-English. We use
instances of the Berkeley parser (Petrov & Klein, 2007) trained on the English Penn Treebank,
59
Chinese Treebank 6, and the Arabic Treebank parts 1–3; for each language, trees are fixed at
alignment time using the 1-best output from each parser.
We use Model-4 symmetrized with the grow-diag-final heuristic, trained with GIZA++ as
a baseline alignment model. We train two GIZA++ models on our largest available Chinese-
English and Arabic-English parallel corpora. These consist of 261M words and 223M English
words,
5
respectively. The size of these corpora make for quite a powerful unsupervised baseline.
In training our alignment model, we use the syntactic features discussed in Section 3, plus
word-based lexical features t(e| f) and t(f | e) used during initialization, extracted offline
directly from the translation-table of GIZA++. Using these features alone results in an F-measure
of 59.1 for Arabic-English, and 55.6 for Chinese-English. Our automatically extracted syntactic
features and iterative inference algorithm get us the rest of the way, bringing performance up to
87.6 and 87.0, respectively.
Table 4.2 shows the results on our held-out 100-sentence test set. In an intrinsic evaluation
on an alignment task, our F-measure scores are more than 15 points higher than the baseline for
both language pairs.
4.6.2 Translation Quality
In evaluating downstream translation quality, we build three translation systems each for Arabic-
English and Chinese-English. One with alignments from GIZA++, one with alignments from
our syntactically-informed discriminative model, and one with alignments from our model with
iterative inference (Section 4.5). For each of these systems we align our parallel training corpora
5
These counts correspond to 240M words of Chinese and 194M words of Arabic.
60
described in Section 4.6.1, and compute word-based lexical weighting features (Koehn et al.,
2003) based on these alignments.
Because of the number of experiments involved in this research, we needed to accelerate
our downstream experimental pipeline. While we align our full training corpus, we extract
translation rules from a subsampled core subset of our training data; the quality of the translation
rules extracted is still a function of the original alignment model.
We train a syntax-based string-to-tree translation model (Galley et al., 2004; Galley et al.,
2006) and extract translation rules.
6
using alignments produced by each system from 4.25+5.43M
words for Arabic-English and 31.8+37.7M words for Chinese-English. For Arabic-English, we
tune our MT system on a held-out development corpus of 1,172 parallel sentences, and test on
a heldout set of 746 parallel sentences with four references each. For Chinese-English we tune
our MT system on a held-out development corpus of 4,089 parallel sentences, and test on a
set of 4,060 sentences with four references each. We tune the translation models for these sys-
tems with MIRA (Watanabe et al., 2007; Chiang et al., 2008b). Our tuning and test corpora are
drawn from the NIST 2004 and 2006 evaluation data, disjoint from our rule-extraction data. All
systems used two language models; one trained on the combined English sides of our Arabic-
English and Chinese-English data (480M words), and one trained on 4 billion words of English
data.
MT results are shown in Table 4.3. We show a gain of 1.0 and 1.1 BLEU points over GIZA++
Model-4. Each is statistically significant over the baseline.
6
We use the so-called composed rules of (Galley et al., 2006).
61
ara-eng chi-eng
Alignment model BLEU BLEU
GIZA++ Model-4 47.6 26.2
Heavy Syntax 48.3
26.4
+
+Iterative Inference (gdf) 48.3 27.0
+Iterative Inference (intersection) 48.6
+
27.3
Table 4.3: IBM BLEU scores using a syntax-based MT system. We show statistically significant
gains in both language pairs over unsupervized GIZA++ Model 4 trained on very large corpora.
An asterisk (*) denotes a statistically significant improvement with p < 0.01 over the number
immediately above; a (+) denotesp < 0.05.
In the case of Chinese-English, we see a 0.9 BLEU gain when using iterative inference over
the standard model which provides only target-tree alignments. As measured by a bootstrap
resampler, this improvement is statistically significant, withp < 0.01.
For Arabic-English, we see a BLEU gain of 0.7 with target-tree alignments alone, and a total
1.0 BLEU gain over the baseline with iterative inference and our joint-agreement features.
We expect the limited improvement of iterative inference for Arabic-English is due to at
least two factors:
1. the relative weakness of our Arabic parser, and
2. as shown in Table 4.2, our Arabic target-tree alignments are already quite accurate.
62
4.7 Discussion and Implications
We achieve our best downstream BLEU results when using iterative inference with source-tree
and target-tree alignments, keeping the intersection.
7
These alignments have been shown to have
recall in a similar neighborhood as our unsupervised baseline, but extremely high precision.
As DeNero and Klein (2010) and others have observed, the relationship between word align-
ment evaluation metrics and BLEU score remains tenuous at best. While we are able to induce
some of the most accurate alignments we have seen to date, it remains unclear, given our gold
hand-aligned data, whether we are optimizing for the right function ultimately for the transla-
tion task. Related metrics, like Rule F-measure (Fossum et al., 2008) and Translation Unit Error
Rate (Søgaard & Kuhn, 2009), are still functions of a given gold-standard alignment. If the
gold-standard alignment is not ideally annotated for the translation task, it matters little what
our alignment evaluation metric is.
Why do our grow-diag-final alignments not perform as well, even though they are the highest
in F-measure accuracy? We believe the answer lies in the fact that these alignments too closely
resemble the gold alignments with word-alignment annotation standards
8
that do not handle
function words ideally for the translation task. Indeed, Hermjakob (2009) reports improved
BLEU with a hand-modified gold standard.
7
Intersection symmetrization does not help GIZA++ because the resulting recall is so low as
to severely limit the usefulness of direct translation rule extraction with such alignments (49.7
Recall for Chi-Eng; 47.2 Recall for Ara-Eng).
8
We refer to those used for data used in this work, LDC2006E86 and LDC2006E93, as well
as the standards for later hand-aligned data developed for the GALE program.
63
4.7.1 Content Words vs. Function Words in Alignment
Interestingly, the places in which our source-tree and target-tree alignments most often disagree
is in the alignment of function words
9
with no clear translation in the opposite language. For
example, English the has no translation in Chinese. Our intersection alignments generally leave
the unaligned to Chinese words, whereas in our gold alignments the is generally aligned to the
same word as the head of the NP in which it appears. For example,⟨(the country , 国家)⟩; but
not,⟨(the,∅); (country, 国家)⟩
Continuing with the the example, our translation model learns to insert words like the where
appropriate, and such insertion rules are validated by the language model. We learn with good
coverage accurate high-precision translation rules for content words, and general insertion rules
for words like the, instead of learning two unique lexicalized rules for a given content word, one
with and one without the. In this way, we are learning a more general grammar that explains the
data.
We see our best translation performance with our intersection alignments because we believe
it largely leaves untranslated words and words without clear translations in the opposite language
unaligned; we believe this may be the right thing to do. Naively leaving all function words
unaligned is likely suboptimal, as many have seem to have direct translations in some contexts;
cf. (of, ﻦﻣ) and (of, 的).
64
(a) Alignment under the cur-
rent standards.
(b) Alignment under an optimal
strategy for translation.
Figure 4.5: Example depicting two different ways to align the English phrase “the big book”
to its Chinese translation “大书” (gloss: big book). We extract more translation rules under
the strategy shown in Figure 4.5(b). Links in the word alignment are shaded matrix cells, with
extractable bilingual phrases outlined.
4.7.2 An Concrete Example
Figure 4.5 shows two strategies for aligning the English phrase “the big book” to its Chinese
translation “大书”. The Chinese gloss is big book.
Under the existing standards for word alignment (Figure 4.5(a)), the English word the, with
no translation on the Chinese side, would be aligned to the same word as the English noun it
modifies. Thus, in this case, book is aligned to Chinese 书, so we have that the is also aligned
to 书. Under this alignment we can extract the following phrasal translations:
big 大
the big book 大书
9
Function words are a class of words that generally have little lexical meaning, but serve to
signal grammatical relationships among words in a sentence.
65
However, we cannot learn from this example useful translations for words we might expect
to appear down the line in novel sentences we are tasked to translate, such as:
book
the book
The problem is the alignment of the highlighted word the. Following our intuitions from
Section 4.7.1, we leave words like the with no direct translation in the target sentence unaligned
(Figure 4.5(b)). Now, we extract from this example the translation knowledge we were missing
under the old standards:
big 大
the big book 大书
book 书
the book 书
For ambiguous words or phrases like 书, we essentially let the language model decide when
to use either book or the book in context.
4.7.3 Desiderata
Following from the discussions in the above sections, we thus propose the following desiderata
for creation of gold-standard alignment data going forward:
• Do align content words.
66
• Do align function words with a direct translation in the other language.
• Don’t align words without a direct translation in the other language.
Future work includes devising metrics, e.g. a content-word F-measure, that may correlate
better with downstream BLEU score. Such metrics should be inexpensive to calculate and op-
timizable under standardly used learning frameworks, e.g. perceptron or max-margin learning.
Such a metric should, for instance, award a higher score to Line 4 in Table 4.3 than Line 3.
4.8 Conclusion
We are closing the gap between translation and alignment models in terms of syntactic sophis-
tication. We have (1) shown how to efficiently extract hundreds of thousands of language-
independent syntactic features useful for alignment, (2) given a detailed analysis of the types of
linguistic phenomena these varied features generalize, and (3) report significant gains not only
on alignment quality but also on downstream machine translation quality (1.0+ BLEU) over
very strong baselines across diverse language pairs.
We have also described roadblocks to improved discriminative alignment modeling for trans-
lation, and desiderata for moving forward. We expect that an accurate discriminative word align-
ment system, such as the one presented here, in conjunction with better annotation standards for
alignment will take us even farther beyond the advancements in translation quality shown here.
67
Chapter 5
Automatic Parallel Fragment Extraction from Noisy Data
Apart from translation, word alignments are also used for other language-related tasks, e.g.
cross-lingual information retrieval (Hiemstra & de Jong, 1999; Xu et al., 2001), inducing mono-
linugal resources from bilingual corpora (Yarowsky et al., 2001; Hwa et al., 2002), and bilingual
dictionary induction (Schafer & Yarowsky, 2002). In this chapter, we use the alignment model
we have developed in previous chapters for the task of distilling parallel fragments from within
noisy corpora. The benefits of succeeding here are at least two-fold: we can be rid of noisy data
inside of our parallel training corpus, and be able to find new parallel data with which to augment
our training corpus. Both are difficult tasks in and of themselves, and both have implications
for the translation task we are largely concerned about in this thesis.
In what follows, we present a novel method to detect parallel fragments within noisy parallel
corpora. Isolating these parallel fragments from the noisy data in which they are contained frees
us from noisy alignments and stray links that can severely constrain translation-rule extraction.
We do this with existing machinery, making use of an existing word alignment model for this
task. We evaluate the quality and utility of the extracted data on large-scale Chinese-English
68
Figure 5.1: Example of a word alignment resulting from noisy parallel data. The structure of
the resulting alignment makes it difficult to find parallel fragments simply by inspection. How
can we discover automatically those parallel fragments hidden within such data?
and Arabic-English translation tasks and show significant improvements over a state-of-the-art
baseline.
5.1 Introduction
A decade ago, Banko and Brill (2001) showed that scaling to very large corpora is game-
changing for a variety of tasks. Methods that work well in a small-data setting often lose their
luster when moving to large data. Conversely, other methods that seem to perform poorly in
that same small-data setting, may perform markedly differently when trained on large data.
Perhaps most importantly, Banko and Brill showed that there was no significant variation in
performance among a variety of methods trained at-scale with large training data. The takeaway?
If you desire to scale to large datasets, use a simple solution for your task, and throw in as much
data as possible. The community at large has taken this message to heart, and in most cases it
has been an effective way to increase performance.
69
Today, for machine translation, more data than what we already have is getting harder and
harder to come by; we require large parallel corpora to train state-of-the-art statistical, data-
driven models. Groups that depend on clearinghouses like LDC for their data increasingly find
that there is less of a mandate to gather parallel corpora on the scale of what was produced in the
last 5-10 years. Others, who directly exploit the entire web to gather such data will necessarily
run up against a wall after all that data has been collected.
We need to learn how to do more with the data we already have. Previous work has focused
on detecting parallel documents and sentences on the web, e.g. (Fung & Cheung, 2004; Wu &
Fung, 2005). Munteanu and Marcu (2006) extend the state-of-the-art for this task to parallel
fragments. Smith et al. Smith et al. (2010) learn to find parallel sentence pairs over Wikipedia,
though do not focus on isolating sub-sentential fragments.
In this paper, we present a novel method for detecting parallel fragments in large, existing
and potentially noisy parallel corpora using existing machinery and show significant improve-
ments to two state-of-the-art MT systems. We also depart from previous work in that we only
consider parallel corpora that have previously been cleaned, sanitized, and thought to be non-
noisy, e.g. parallel corpora available from LDC.
5.2 Detecting Noisy Data
In order to extract previously unextractable good parallel data, we must first detect the bad
data. In doing so, we will make use of existing machinery in a novel way. We directly use the
alignment model to detect weak or undesirable data for translation.
70
5.2.1 Alignment Model as Noisy Data Detector
The alignment model we use in our experiments is that described in (Riesa et al., 2011), modified
to output full derivation trees and model scores along with alignments. Our reasons for using
this particular alignment method are twofold: it provides a natural way to hierarchically partition
subsentential segments, and is also empirically quite accurate in modeling word alignments, in
general. This latter quality is important, not solely for downstream translation quality, but also
for the basis of our claims with respect to detecting noisy or unsuitable data:
The alignment model we employ is discriminatively trained to know what good alignments
between parallel data look like. When this model is “confused,” it yields a low model score for
a generated alignment, given an input sentence pair. In this case, the alignment probably doesn’t
look like the examples it has been trained on.
1. It could be that the data is parallel, but the model is very confused.
2. It could be that the data is a mess, and the model is very confused.
The general accuracy of the alignment model we employ makes the former case unlikely. We
therefore make a key assumption that underlies this work – that a low model score accompanies
noisy data. We use this data as candidates from within which we will extract non-noisy parallel
segments.
5.2.2 A Brief Example
As an illustrative example, consider the following sentence pair in our training corpus taken
from LDC2005T10. This is the sentence pair shown in Figure 5.1:
71
fate brought us together on that wonderful summer day and one year later , shou – tao and i were married not
only in the united states but also in taiwan .
他 来自 于 台湾 , 我 则 是 土生土长 于 纽泽西 州 的 美国人 ; 而 就 在 那 奇妙 的 夏 日 里
, 我 俩 被 命运 兜 在 一起 .
In this sentence pair there are only two parallel phrases, corresponding to the underlined
and double-underlined strings. There are a few scattered word pairs which may have a natural
correspondence,
1
but no other larger phrases.
2
In this work we are concerned with finding large phrases,
3
since very small phrases tend to be
extractible even when the data is noisy. Bad alignments tend to cause conflicts when extracting
large phrases due to unexpected, stray links in the alignment matrix; smaller fragments will have
less opportunity to come into conflict with incorrect, stray links due to noisy data or alignment
model error. We consider large enough phrases for our purposes to be phrases of size greater
than 3, and ignore smaller fragments.
1
For example, (I, 我) and (Taiwan, 台湾)
2
The rest of the Chinese describes where the couple is from; the speaker, she says, is an
American raised in New Jersey.
3
We count the size of the phrase according to the number of English words it contains; one
could be even more conservative by constraining both sides.
72
5.2.3 Parallel Fragment Extraction
5.2.3.1 A Hierarchical Alignment Model and its Derivation Trees
We use the alignment model of Riesa et al. (2011), described and developed in Chapters 3
and 4. It is a discriminatively trained model which at alignment-time walks up the English parse-
tree and, at every node in the tree, generates alignments by recursively scoring and combining
alignments generated at the current node’s children, building up larger and larger alignments.
This process works similarly to a CKY parser, moving bottom-up and generating larger and
larger constituents until it has generated the full tree spanning the entire sentence. However,
instead of generating syntactic structures, we are generating alignments.
In moving bottom-up along the tree, just as there is a derivation tree for a CKY parse, we
can also follow backpointers to extract the derivation tree of the 1-best alignment starting from
the root node. This derivation tree gives a hierarchical partitioning of the alignment and the
associated word-spans. We can also inspect model scores at each node in the derivation tree.
5.2.3.2 Using the Alignment Model to Detect Parallel Fragments
For each training example in our parallel corpus, we have an alignment derivation tree. Because
the derivation tree is essentially isomorphic to the English parse tree, the derivation tree repre-
sents a hierarchical partitioning of the training example into syntactic segments. We traverse
the tree top-down, inspecting the parallel fragments implied by the derivation at each point, and
their associated model scores.
73
The idea behind this top-down traversal is that although some nodes, and perhaps entire
derivations, may be low-scoring, there are often high-scoring fragments that make up the larger
derivation which are worthy of extraction. Figure 5.2 shows an example. We recursively tra-
verse the derivation, top-down, extracting the largest fragment possible at any derivation node
whose alignment model score is higher than some threshold τ, and whose associated English
and foreign spans meet a set of important constraints:
1. The parent node in the derivation has a score less thanτ.
2. The length of the English span is > 3.
3. There are no unaligned foreign words inside the fragment that are also aligned to English
words outside the fragment. Once a fragment has been extracted, we do not recurse any
further down the subtree.
Constraint 1 is a candidate constraint, and forces us to focus on segments of parallel sentences
with low model scores; these are segments likely to consist of bad alignments due to noisy data
or aligner error.
Constraint 2 is a conservativity constraint – we are more confident in model scores over
larger fragments with more context than smaller ones. This constraint also parameterizes the
notion that larger fragments are the type more often precluded from extraction due to stray or
incorrect word-alignment links; we are already likely to be able to extract smaller fragments
using standard methods, and as such, they are less useful to us here.
Constraint 3 is a content constraint, limiting us from extracting fragments with blocks of
unaligned foreign words that don’t belong in this particular fragment because they are aligned
74
Chi-Eng Ara-Eng
Total fragments extracted 868 870 996 538
English words extracted 6.4M 9.7M
Foreign words extracted 5.2M 7.8M
Average English words/fragment 7.3 9.6
Average Foreign words/fragment 6.0 7.9
Table 5.1: Descriptive statistics about fragments extracted from each parallel corpus.
elsewhere. If we threw out this constraint, then in translating from Chinese to English, we would
erroneously learn to delete blocks of Chinese words that otherwise should be translated. When
foreign words are unaligned everywhere within a parallel sentence, then they can be included
within the extracted fragment. Common examples in Chinese are function words such as 的,
个, and 了.
Computing τ. In computing our extraction threshold τ, we must decide what proportion of
fragments we consider to be low-scoring and least likely to be useful for translation. We make
the assumption that this is the bottom 10% of the data.
4
In inspecting model scores for each
node spanning 3 or more English words in each alignment derivation, the worst-scoring 10% of
fragments have scoreτ = 12.659 and below. Any fragment we extract should have an alignment
model score higher than this number.
4
Though this figure is somewhat arbitrary, we base it on the observation that Riesa et al.
(2011) report an F-measure accuracy of slightly under 90%.
75
a
IN
fantastic
yet
realistic
JJ CC JJ
ADJP
NP
NN
adventure
[14.2034] PP [9.5130]
NP [-0.5130]
with multi-sensory experiences
Figure 5.2: From LDC2004T08, when the NP fragment shown here is combined to make a larger
span with a sister PP fragment, the alignment model objects due to non-parallel data under the
PP, voicing a score of -0.5130. We extract and append to our training corpus the NP fragment
depicted, from which we later learn 5 additional translation rules.
5.3 Evaluation
We evaluate our parallel fragment extraction in a large-scale Chinese-English and Arabic-English
MT setting. In our experiments we use a tree-to-string syntax-based MT system (Galley et al.,
2004), and evaluate on a standard test set, NIST08. We parse the English side of our parallel
corpus with the Berkeley parser (Petrov et al., 2006), and tune parameters of the MT system
with MIRA (Chiang et al., 2008b). We decode with an integrated language model trained on
about 4 billion words of English.
Chinese-English We align a parallel corpus of 8.4M parallel segments, with 210M words of
English and 193M words of Chinese. From this we extract 868,870 parallel fragments according
76
to the process described in Section 5.2, and append these fragments to the end of the parallel cor-
pus. In doing so, we have created a larger parallel corpus of 9.2M parallel segments, consisting
of 217M and 198M words of English and Chinese, respectively.
Arabic-English We align a parallel corpus of 9.0M parallel segments, with 223M words of
English and 194M words of Arabic. From this we extract 996,538 parallel fragments, and ap-
pend these fragments to the end of the parallel corpus. The resulting corpus has 10M parallel
segments, consisting of 233M and 202M words of English and Arabic, respectively.
Results are shown in Table 5.2. Using our parallel fragment extraction, learn 68 million
additional unique Arabic-English rules that are not in the baseline system; likewise, we learn 38
million new unique Chinese-English rules not in the baseline system for that language pair.
Note that we are not simply duplicating portions of the parallel data. While each sequence
fragment of source and target words we extract will be found elsewhere in the larger parallel
corpus, these fragments will largely not make it into fruitful translation rules to be used in the
downstream MT system.
We see gains in BLEU score across both language pairs, showing empirically that we are
learning new and useful translation rules previously not present in our grammars. These results
are significant withp < 0.05 for Arabic-English andp < 0.01 for Chinese-English.
5.4 Future Work
We believe using the alignment model derivation trees as a way to find and extract parallel frag-
ments is widely applicable regardless of the provenance of the data. While we see no immediate
77
Corpus Extracted Rules BLEU
Baseline (Ara-Eng) 750M 50.0
+Extracted fragments 818M 50.4
Baseline (Chi-Eng) 270M 31.5
+Extracted fragments 308M 32.0
Table 5.2: End-to-end translation experiments with and without extracted fragments. BLEU
score gains are significant withp < 0.05 for Arabic-English andp < 0.01 for Chinese-English.
reason why this work would be better suited to somewhat cleaner LDC data, as opposed to
noisier web data, we leave the evaluation of this empirical question to future work.
5.5 Discussion
The work presented above provides clear translation quality benefits in terms of BLEU across
two different language-pairs. However, in choosing what type of corpora to work with, there
is a trade-off. With larger training corpora, the baseline BLEU score may already be relatively
high and difficult to improve upon, but large parallel corpora provide many potential candidate
segments for extraction. Likewise, for smaller training corpora, while the bar may be lower,
there is a priori significantly less data from which we can extract. In our experiments, we saw
the largest gains starting from already-large parallel corpora.
We know of no alignment model that can inherently deal with and detect noisy data. All
alignment models we have experimented with will fall down in the presence of noisy data. Im-
portantly, even if the alignment model were able to yield ”perfect” alignments with no alignment
78
links among noisy sections of the parallel data precluding us from extracting reasonable rules
or phrase pairs, we would still have to deal with the downstream rule extraction heuristics.
Standardly-used rule extraction heuristics for phrase-based, hierarchical phrase-based and
syntax-based MT all cause a translation grammar to blow up when large swaths of a sentence
pair remain unaligned. We cannot simply turn off heuristics for dealing with unaligned words,
since many unaligned words are not part of noisy phrase segments, yet are rightfully unaligned;
function words are a good example. Absent a mechanism within the alignment model itself
to deal with this problem, we provide a simple way to recover from noisy data without the
introduction of new tools e.g. (Munteanu & Marcu, 2006) or (Xu et al., 2005).
Summing up, parallel data in the world is not unlimited. We cannot always continue to
double our data for increased performance. Parallel data creation is expensive, and automatic
discovery is resource-intensive (Uszkoreit et al., 2010). We have presented a technique that
helps to squeeze more out of an already large, state-of-the-art MT system, using existing pieces
of the pipeline to do so in a novel way.
79
Chapter 6
Conclusion and Future Work
6.1 Contributions
We have made a number of contributions to the problem of word alignment, and more broadly,
to machine translation.
1. We have presented a novel search algorithm for alignment, casting word alignment as a
parsing problem and introducing the idea of alignment forests. In doing so, we are able to
take advantage of advances in the well-studied areas of bottom-up parsing and decoding
and apply them to the word alignment task.
2. We have progressed in unifying the structure of the search space for word-alignment and
syntax-based translation by learning syntax-based translation rules during alignment train-
ing time and directly encoding those rules as features in our alignment model. In doing so,
we make use of millions of arbitrary, overlapping, local, and global data-driven syntactic
features. Our features are largely language-independent, but the model is flexible enough
for anyone to encode and use any linguistically inspired knowledge as a feature.
80
3. Our simple, parallelized online learning formulation in conjunction with our efficient
search method now allows for rapid experimentation. Under a very large data setting
of over 200 million words of parallel text, we have compressed the experimental pipeline
for word-alignment from over six days to less than 24 hours.
4. We have shown that our word alignment models are useful for not just translation, but
parallel data discovery, as well.
5. We have shown that not all training data can be considered equal, and made the case for
the existence of some universal truths with respect to what makes a useful gold-standard
alignment for learning a translation model. Resulting from this discussion, we have out-
lined desiderata for future creation of supervised training data for word alignment.
6.2 Future work
6.2.1 Regularization for Generalization and Scaling
Regularization during the learning process may be important for training scenarios with both
few and many training examples. We learn models of hundreds of thousands or even millions of
features with relatively few training examples. With a relatively small set of training examples,
i.e. less than 10,000, are we overfitting our training data? L
1
regularization (Tibshirani, 1996;
Kazama & Tsujii, 2003) is known to provide sparse solutions, and has been used in a variety of
structured prediciton problems (Martins et al., 2011); is it possible to use L
1
regularization at
training-time for feature selection, keeping important features and discarding the rest?
81
Likewise, we have concerns about scale. The more training data we have, the more features
we will ultimately learn; we would like to be able to accommodate additional training data with-
out the expected massive expansion in the number of features. This will be especially important
for unsupervised learning (described in Section 6.2.3, below) in which we would use as input the
entire large (200M word) parallel training corpus as input. We have done initial experiments in
this vein that show that we can reduce the number of parameters in our Chinese-English model
by 80%, and For Arabic-English, over 87% while still achieving the same or better accuracy as
do the significantly larger, unregularized models.
6.2.2 Better Standards for Word Alignment
In Section 4.7 we discussed modifications to the current standards for word alignment and out-
lined disiderata for creating new gold-standard alignment data going forward. These ideas are
based on human linguistic intuitions (Hermjakob, 2009) and some limited empirical evidence
and analysis from Chapter 4. However, some interesting questions remain: Can we decide auto-
matically what type of alignments are good for MT? Are there alignment strategies more suited
to syntax-based MT than phrase-based MT? Hierarchical phrase-based MT? How dependent is
an alignment strategy on a specific language pair?
6.2.3 Unsupervised Discriminative Learning
As long as reliable training data is available, we would not expect an unsupervised model to
outperform a supervised one. However, we don’t have gold-standard hand-aligned training data
for all language pairs. The work we have presented thus far employs supervised learning and
82
is not applicable when gold-standard data is unavailable. Thus, we would like to continue to
exploit the syntactic features we have discussed throughout this thesis, but train the parameters
of the model in an unsupervised fashion.
Future work in this area involves formulating an unsupervised discriminative learning prob-
lem using contrastive estimation (Smith & Eisner, 2005), a method for training log-linear models
with unlabeled data only.
83
Bibliography
Al-Onaizan, Yaser, Curin, Jan, Jahr, Michael, Knight, Kevin, Lafferty, John D., Melamed,
I. Dan, Purdy, David, Och, Franz J., Smith, Noah A., & Yarowsky, David (1999). Statis-
tical machine translation, final report, JHU workshop.
Banko, Michele, & Brill, Eric (2001). Scaling to very very large corpora for natural language
disambiguation. Proceedings of the 39th Annual Meeting of the Association for Computational
Linguistics (pp. 26–33). Toulouse, France: Association for Computational Linguistics.
Blunsom, Phil, & Cohn, Trevor (2006). Discriminative word alignment with conditional random
fields. Proceedings of COLING-ACL. Sydney, Australia.
Brown, Peter F., Della Pietra, Stephen A., Della Pietra, Vincent J., & Mercer, R. L. (1993).
The mathematics of statistical machine translation: parameter estimation. Computational
Linguistics, 19, 263–311.
Buckwalter, Timothy (2004). Buckwalter arabic morphological analyzer version 2.0. Linguistic
Data Consortium, catalog number LDC2004L02.
Burkett, David, Blitzer, John, & Klein, Dan (2010). Joint parsing and alignment with weakly
synchronized grammars. Proceedings of NAACL HLT 2010.
Cherry, Colin, & Lin, Dekang (2006). Soft syntactic constraints for word alignment through
discriminative training. Proceedings of COLING/ACL (pp. 105–112). Sydney, Australia:
Association for Computational Linguistics.
Chiang, David (2005). A hierarchical phrase-based model for statistical machine translation.
Proceedings of ACL (pp. 263–270). Ann Arbor, MI.
Chiang, David (2007). Hierarchical phrase-based translation. Computational Linguistics, 33,
201–228.
Chiang, David (2010). Learning to translate with source and target syntax. Proceedings of
the 48th Annual Meeting of the Association for Computational Linguistics (pp. 1443–1452).
Uppsala, Sweden.
Chiang, David, DeNeefe, Steve, Chan, Tee Seng, & Ng, Hwee Tou (2008a). Decomposability of
translation metrics for improved evaluation and efficient algorithms. Proceedings of EMNLP
(pp. 610–619).
84
Chiang, David, Marton, Yuval, & Resnik, Philip (2008b). Online large-margin training of syn-
tactic and structural translation features. Proceedings of the Conference on Empirical Methods
in Natural Language Processing (EMNLP) (pp. 224–233). Honolulu, HI. USA.
Chiang, David, Wang, Wei, & Knight, Kevin (2009). 11,001 new features for statistical machine
translation. Proceedings of NAACL HLT (pp. 218–226).
Collins, Michael (2002). Discriminative training methods for hidden markov models: theory
and experiments with perceptron algorithms. Proceedings of EMNLP. Philadelphia, PA.
Collins, Michael (2003). Head-driven statistical models for natural language parsing. Compu-
tational Linguistics, 29, 589–637.
DeNero, John, & Klein, Dan (2007). Tailoring word alignments to syntactic machine translation.
Proceedings of ACL. Prague, Czech Republic.
DeNero, John, & Klein, Dan (2010). Discriminative modeling of extraction sets for machine
translation. Proceedings of NAACL HLT 2010.
Deng, Yonggang, Kumar, Shankar, & Byrne, William (2006). Segmentation and alignment of
parallel text for statistical machine translation. In Natural language engineering, 236–260.
Cambridge University Press.
Fossum, Victoria, Knight, Kevin, & Abney, Steven (2008). Using syntax to improve word
alignment for syntax-based statistical machine translation. Proceedings of ACL MT Workshop.
Honolulu, HI. USA.
Fraser, Alexander, & Marcu, Daniel (2007). Getting the structure right for word alignment:
LEAF. Proceedings of EMNLP-CoNLL (pp. 51–60). Prague, Czech Republic.
Fung, Pascale, & Cheung, Percy (2004). Mining very non-parallel corpora: Parallel sentence
and lexicon extraction via boostrapping and em. Proceedings of EMNLP (pp. 57–63).
Galley, Michel, Graehl, Jonathan, Knight, Kevin, Marcu, Daniel, DeNeefe, Steve, Wang, Wei, &
Thayer, Ignacio (2006). Scalable inference and training of context-rich syntactic translation
models. Proceedings of COLING-ACL (pp. 961–968). Sydney, Australia: Association for
Computational Linguistics.
Galley, Michel, Hopkins, Mark, Knight, Kevin, & Marcu, Daniel (2004). What’s in a translation
rule? Proceedings of HLT-NAACL (pp. 273–280).
Garey, Michael, & Johnson, David (1979). Computers and intractability: A guide to the theory
of np-completeness. New York: W.H. Freeman and Co.
Germann, Ulrich, Jahr, Michael, Knight, Kevin, Marcu, Daniel, & Yamada, Kenji (2001). Fast
decoding and optimal decoding for machine translation. Proceedings of ACL (pp. 228–235).
Toulouse, France.
85
Haghighi, Aria, Blitzer, John, DeNero, John, & Klein, Dan (2009). Better word alignments
with supervised itg models. Proceedings of the Joint Conference of the ACL and IJCNLP (pp.
923–931). Singapore: Association for Computational Linguistics.
Hermjakob, Ulf (2009). Improved word alignment with statistics and linguistic heuristics. Pro-
ceedings of EMNLP (pp. 229–237).
Hiemstra, Djoerd, & de Jong, Franciska (1999). Disambiguation strategies for cross-language
information retrieval. Lecture Notes in Computer Science, Volume 1696, Jan 1999.
Hopcroft, John, & Ullman, Jeffrey (1979). Introduction to automata theory, languages and
computation. Reading, MA: Addison-Wesley.
Huang, Liang (2008). Forest reranking: Discriminative parsing with non-local features. Pro-
ceedings of ACL. Columbus, OH. USA.
Huang, Liang, & Chiang, David (2005). Better k-best parsing. Proceedings of the 9th Interna-
tional Workshop on Parsing Technologies (IWPT 2005). Vancouver, BC. Canada.
Huang, Liang, & Chiang, David (2007). Forest rescoring: Faster decoding with integrated
language models. Proceedings of ACL.
Huang, Liang, & Mi, Haitao (2010). Efficient incremental decoding for tree-to-string translation.
Proceedings of EMNLP 2010. Boston, MA. USA.
Hwa, Rebecca, Resnik, Philip, Weinberg, Amy, & Kolak, Okan (2002). Evaluating translational
correspondence using annotation projection. ACL02. Philadelphia, PA.
Ittycheriah, Abraham, & Roukos, Salim (2005). A maximum entropy word aligner for Arabic-
English machine translation. Proceedings of HLT-EMNLP (pp. 89–96). Vancouver, Canada.
Kazama, Jun’ichi, & Tsujii, Jun’ichi (2003). Evaluation and extension of maximum entropy
models with inequality constraints. Proceedings of the Conference on Empirical Methods in
Natural Language Processing (EMNLP) (pp. 137–144).
Klein, Dan, & Manning, Chris (2001). Parsing and hypergraphs. Proceedings of the Interna-
tional Workshop on Parsing Technologies (IWPT).
Knight, Kevin (1999). Decoding complexity in word-replacement translation models. Compu-
tational Linguistics, 25, 607–615.
Koehn, Philipp (2004). Pharaoh: A beam search decoder for phrase-based statistical machine
translation models. AMTA (pp. 115–124).
Koehn, Philipp, Och, Franz J., & Marcu, Daniel (2003). Statistical phrase-based translation.
Proceedings of HLT-NAACL (pp. 127–133). Edmonton, Canada.
86
Lacoste-Julien, Simon, Klein, Dan, Taskar, Ben, & Jordan, Michael (2006). Word alignment
via quadratic assignment. Proceedings of HLT-NAACL (pp. 112–119). New York, NY . USA.
Li, Zhifei, Callison-Burch, Chris, Dyer, Chris, Ganitkevitch, Juri, Khudanpur, Sanjeev,
Schwartz, Lane, Thornton, Wren, Weese, Jonathan, & Zaidan, Omar (2009). Joshua: An
open source toolkit for parsing-based machine translation. Proceedings of the Workshop on
Statistical Machine Translation.
Liu, Yang, Liu, Qun, & Lin, Shouxun (2005). Log-linear models for word alignment. Proceed-
ings of ACL (pp. 459–466). Ann Arbor, MI.
Liu, Yang, Lü, Yajuan, & Liu, Qun (2009a). Improving tree-to-tree translation with packed
forests. Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP
(pp. 558–566). Singapore.
Liu, Yang, Xia, Tian, Xiao, Xinyan, & Liu, Qun (2009b). Weighted alignment matrices for
statistical machine translation. Proceedings of the 2009 Conference on Empirical Methods in
Natural Language Processing (pp. 1017–1026). Singapore.
Martins, Andre T., Smith, Noah A., Aguiar, Pedro M. Q., & Figueiredo, Mario A. T. (2011).
Structured sparcity in structured prediction. Proceedings of the Conference on Empirical
Methods in Natural Language Processing (EMNLP) (pp. 1500–1511).
May, Jonathan, & Knight, Kevin (2007). Syntactic re-alignment models for machine translation.
Proceedings of EMNLP. Prague, Czech Republic.
McDonald, Ryan, Hall, Keith, & Mann, Gideon (2010). Distributed training strategies for the
structured perceptron. Proceedings of NAACL HLT (pp. 456–4664). Los Angeles, CA. USA.
Moore, Robert C. (2005). A discriminative framework for bilingual word alignment. Proceed-
ings of HLT-EMNLP. Vancouver, Canada.
Moore, Robert C., Yih, Wen-Tau, & Bode, Andreas (2006). Improved discriminative bilingual
word alignment. Proceedings of COLING-ACL (pp. 513–520). Sydney, Australia.
Munteanu, Dragos Stefan, & Marcu, Daniel (2006). Extracting parallel sub-sentential fragments
from non-parallel corpora. Proceedings of the 21st International Conference on Computa-
tional Linguistics and 44th Annual Meeting of the Association for Computational Linguistics
(COLING/ACL). Sydney, Australia.
Och, Franz J., & Ney, Hermann (2003). A systematic comparison of various statistical alignment
models. Computational Linguistics, 29, 19–51.
Papineni, Kishore A., Roukos, Salim, Ward, Todd, & Zhu, Wei-Jing (2001). Bleu: a method
for automatic evaluation of machine translation (Technical Report RC22176 (W0109-022)).
IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, NY.
87
Pauls, Adam, & Klein, Dan (2010). Unsupervised syntactic alignment with inversion transduc-
tion grammars. Proceedings of Human Language Technologies: The 2010 Annual Conference
of the North American Chapter of the Association for Computational Linguistics. Los Ange-
les, CA. USA.
Petrov, Slav, Barrett, Leon, Thibaux, Romain, & Klein, Dan (2006). Learning accurate, compact,
and interpretable tree annotation. Proceedings of COLING-ACL. Sydney, Australia.
Petrov, Slav, & Klein, Dan (2007). Improved inference for unlexicalized parsing. Proceedings
of NAACL HLT (pp. 404–411).
Riesa, Jason, Irvine, Ann, & Marcu, Daniel (2011). Feature-rich language-independent syntax-
based alignment for statistical machine translation. Proceedings of the 2011 Conference on
Empirical Methods in Natural Language Processing (pp. 497–507). Edinburgh, Scotland,
UK: Association for Computational Linguistics.
Riesa, Jason, & Marcu, Daniel (2010). Hierarchical search for word alignment. Proceed-
ings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL) (pp.
157–166). Uppsala, Sweden.
Riesa, Jason, & Yarowsky, David (2006). Minimally supervised morphological segmentation
with applications to machine translation. Proceedings of the 7th Conference of the Association
for Machine Translation in the Americas (AMTA) (pp. 185–192). Cambridge, MA. USA.
Schafer, Charles, & Yarowsky, David (2002). Inducing translation lexicons vis diverse similarity
measures and bridge languages. Proceedings of CoNLL (pp. 146–152).
Smith, Jason, Quirk, Chris, & Toutanova, Kristina (2010). Extracting parallel sentences from
comparable corpora using document level alignment. Proceedings of NAACL HLT (pp.
403–411).
Smith, Noah A., & Eisner, Jason (2005). Contrastive estimation: Training log-linear models on
unlabeled data. Proceedings of the ACL (pp. 354–362).
Søgaard, Anders, & Kuhn, Jonas (2009). Empirical lower bounds on alignment error rates in
syntax-based machine translation. SSST ’09: Proceedings of the Third Workshop on Syntax
and Structure in Statistical Translation (pp. 19–27). Morristown, NJ, USA: Association for
Computational Linguistics.
Talbot, David, & Brants, Thorsten (2008). Randomized language models via perfect hash func-
tions. Proceedings of ACL-08: HLT (pp. 505–513). Columbus, OH. USA.: Association for
Computational Linguistics.
Taskar, Ben, Chatalbashev, Vassil, Koller, Daphne, & Guestrin, Carlos (2005a). Learning struc-
tured prediction models: A large margin approach. Proceedings of ICML.
88
Taskar, Ben, Lacoste-Julien, Simon, & Klein, Dan (2005b). A discriminative matching approach
to word alignment. Proceedings of HLT-EMNLP. Vancouver, Canada.
Tibshirani, Robert (1996). Regression shrinkage and selection via the lasso. Journal of the
Royal Statistical Society B. (pp. 267–288).
Tsochantaridis, Hofmann, Thomas, Joachims, Thorsten, & Altun, Yasemin (2004). Support
vector machine learning for interdependent and structured output spaces. Proceedings of
ICML.
Uszkoreit, Jakob, Ponte, Jay, Popat, Ashok, & Dubiner, Moshe (2010). Large scale parallel
document mining for machine translation. Proceedings of COLING (pp. 1101–1109).
Venugopal, Ashish, Zollmann, Andreas, Smith, Noah A., & V ogel, Stephan (2008). Wider
pipelines: N-best alignments and parses in MT training. Proceedings of AMTA. Honolulu,
HI. USA.
Watanabe, Taro, Suzuki, Jun, Tsukada, Hajime, & Isozaki., Hideki (2007). Online large-margin
training for statistical machine translation. Proceedings of EMNLP-CoNLL (pp. 764–773).
Weaver, Warren (1949). Translation. In Machine translation of languages: Fourteen essays.
MIT Press.
Wu, Dekai (1997). Stochastic inversion transduction grammars and bilingual parsing of parallel
corpora. Computational Linguistics, 23, 377–403.
Wu, Dekai, & Fung, Pascale (2005). Inversion transduction grammar constraints for mining
parallel sentences from quasi-comparable corpora. Proceedings of the Second International
Joint Conference on Natural Language Processing (IJCNLP). Jeju, South Korea.
Xu, Jinxi, Weischedel, Ralph M., & Nguyen, Chanh (2001). Evaluating a probabilistic model for
cross-lingual information retrieval. Proceedings of the 24th Annual International Conference
on Research and Development in Information Retrieval (SIGIR).
Xu, Jia, Zens, Richard, & Ney, Hermann (2005). Sentence segmentation using ibm word align-
ment model 1. Proceedings of EACL (pp. 280–287).
Yarowsky, David, Ngai, Grace, & Wicentowski, Richard (2001). Inducing multilingual text
analysis tools via robust projection across aligned corpora. Human Language Technology
Conference (pp. 109–116). San Diego, CA.
Zhang, Min, Jiang, Hongfei, Aw, Aiti, Li, Haizhou, Tan, Chew Lim, & Li, Sheng (2008). A tree
sequence alignment-based tree-to-tree translation model. Proceedings of ACL-08: HLT (pp.
559–567). Columbus, OH. USA.
Zhao, Bing, Zechner, Klaus, V ogel, Stephan, & Waibel, Alex (2003). Efficient optimization
for bilingual sentence alignment based on linear regression. Proceedings of the HLT-NAACL
Workshop on Building and Using Parallel Texts. Edmonton, Canada.
89
Zollman, Andreas, & Venugopal, Ashish (2006). Syntax augmented machine translation via
chart parsing. NAACL 2006 Workshop on Statistical Machine Translation. Rochester, NY .
USA.
90
Abstract (if available)
Abstract
Word alignment, the process of inferring the implicit links between words across two languages, serves as an integral piece of the puzzle of learning linguistic translation knowledge. It enables us to acquire automatically from data the rules that govern the transformation of words, phrases, and syntactic structures from one language to another. Word alignment is used in many tasks in Natural Language Processing, such as bilingual dictionary induction, cross-lingual information retrieval, and distilling parallel text from within noisy data. In this dissertation, we focus on word alignment for statistical machine translation. ❧ We advance the state-of-the-art in search, modeling, and learning of alignments and show empirically that, when taken together, these contributions significantly improve the output quality of large-scale statistical machine translation, outperforming existing methods. We show results for Arabic-English and Chinese-English translation. ❧ Ultimately, the work we describe herein may be used for any language-pair, supporting arbitrary and overlapping features from varied sources. Finally, our features are learned automatically without any human intervention, facilitating rapid deployment for new language-pairs.
Conceptually similar
PDF
Improved word alignments for statistical machine translation
PDF
Smaller, faster and accurate models for statistical machine translation
PDF
Beyond parallel data: decipherment for better quality machine translation
PDF
Non-traditional resources and improved tools for low-resource machine translation
PDF
Exploiting comparable corpora
PDF
Deciphering natural language
PDF
Neural sequence models: Interpretation and augmentation
PDF
The inevitable problem of rare phenomena learning in machine translation
PDF
Weighted tree automata and transducers for syntactic natural language processing
PDF
Rapid prototyping and evaluation of dialogue systems for virtual humans
PDF
Generating psycholinguistic norms and applications
PDF
Theory of memory-enhanced neural systems and image-assisted neural machine translation
PDF
Machine learning paradigms for behavioral coding
PDF
Scalable machine learning algorithms for item recommendation
PDF
A computational framework for diversity in ensembles of humans and machine systems
PDF
Enriching spoken language processing: representation and modeling of suprasegmental events
PDF
Neural creative language generation
PDF
Learning shared subspaces across multiple views and modalities
PDF
Decipherment of historical manuscripts
PDF
Building a knowledgebase for deep lexical semantics
Asset Metadata
Creator
Riesa, Jason A.
(author)
Core Title
Syntactic alignment models for large-scale statistical machine translation
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
04/19/2012
Defense Date
03/21/2012
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
alignment,machine learning,machine translation,natural language processing,noisy data,OAI-PMH Harvest,syntax,translation,word alignment
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Marcu, Daniel (
committee chair
), Hovy, Eduard (
committee member
), Knight, Kevin C. (
committee member
), Narayanan, Shrikanth S. (
committee member
), Schaal, Stefan (
committee member
)
Creator Email
jason.riesa@gmail.com,riesa@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-8434
Unique identifier
UC1111780
Identifier
usctheses-c3-8434 (legacy record id)
Legacy Identifier
etd-RiesaJason-621.pdf
Dmrecord
8434
Document Type
Dissertation
Rights
Riesa, Jason A.
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
alignment
machine learning
machine translation
natural language processing
noisy data
syntax
word alignment