Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Non-traditional resources and improved tools for low-resource machine translation
(USC Thesis Other)
Non-traditional resources and improved tools for low-resource machine translation
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
NON-TRADITIONAL RESOURCES AND IMPROVED TOOLS FOR
LOW-RESOURCE MACHINE TRANSLATION
by
Nima Pourdamghani
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
May 2019
Copyright 2019 Nima Pourdamghani
Dedication
The Earth keeps revolving, and I keep loving you more. To my dear wife,
Marjan, without whom this journey would lose its meaning.
2
Acknowledgement
You present the essence of the last five years of your work in a short talk, answer
questionsfromthecommitteemembers,anditisdone. Thefirstfeelingthatcomes
to you is a sense of relief, and with that comes the joy of achieving something
great. But after a few days, when the feelings sink in, a doctor of philosophy is
just another title. What really matters is what you have gained in this journey. A
journey that would be impossible without the guidance of an expert mentor, and
would be intolerable without caring company and friends.
I have been fortunate to work under the supervision of an excellent advisor,
Kevin Knight, who patiently guided me in every step of this path. I would like
to thank him for his great encouragement, support, and guidance, and for his
dedication to the progress of all his students including me. I would always be
grateful for how he shaped my academic, and non-academic personality.
I would also like to thank Jonathan May who was not only a great mentor to
me, but a caring friend, and the host to my first thanksgiving dinner. I would
always be thankful for all his help , kindness, and support.
I was honored to have Jerry Hobbs, Jonathan May, Kallirroi Georgila, and
Shrikanth Narayanan as my committee members, and I would like to thank all of
them for their insightful comments.
3
A very special gratitude goes out to Ulf Hermjakob, Michael Pust, Daniel
Marcu, David Chiang, Heng Ji, Xing Shi, Mozhdeh Gheini, Ashish Vaswani,
Yonatan Bisk, Tomer Levinboim, Aliya Deri, David Kale, and Nanyun Peng. It
wasfantastictohavetheopportunitytoworkwithyouandhaveyouasmyfriends.
I would also like to express my gratitude to Lizsl De Leon, and Peter Zamar for
all their help in my academic life.
IthoroughlythankmyparentsMohammadPourdamghani,andZahraRajizadeh,
and my brother Arash. This journey wouldn’t be possible without your endless
love, prayers, support and encouragement. I will always stay in your dept for all
your kindness.
I was very fortunate to have friends who made my far from home life, warm
and enjoyable. I want to thank my friends Payman, Soheil, Reza, Mohsen, Shima,
AmirSoheil,Sepideh,SadafandMahdiformakingtheseyearsamemorableperiod
of my life.
And finally, I would like to thank the one that for having her I should be
thankful to God every second of my life. Marjan, thanks a lot for staying by my
side in all my good and bad moments. I would not be where I am today without
your love and support. Thanks for believing in me when I did not, and thanks for
all the joy and sunshine you bring to my life.
4
Contents
Dedication 2
Acknowledgement 3
List of Tables 8
List of Figures 10
Abstract 12
1 Introduction 14
1.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2 Previous Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3 This Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3.1 Symmetrizing word alignments . . . . . . . . . . . . . . . . 17
1.3.2 Using semantic information in monolingual data to improve
word alignments. . . . . . . . . . . . . . . . . . . . . . . . . 18
1.3.3 Using resources from related languages . . . . . . . . . . . . 19
1.3.4 On-demand unsupervised neural machine translation . . . . 20
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2 Machine Translation Overview 23
2.1 Statistical Machine Translation . . . . . . . . . . . . . . . . . . . . 23
2.1.1 Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 Phrase-based Machine Translation . . . . . . . . . . . . . . . . . . . 25
2.2.1 Training the Phrase Table . . . . . . . . . . . . . . . . . . . 27
2.2.2 Evaluation with BLEU . . . . . . . . . . . . . . . . . . . . . 28
2.3 Word Alignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3 Symmetric word alignment 32
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 Training Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4.1 Machine Translation . . . . . . . . . . . . . . . . . . . . . . 35
3.4.2 Alignments . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4 Using monolingual data to improve word alignments 39
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.1 Extracting Semantically Similar Tokens . . . . . . . . . . . . 43
4.2.2 Extracting Transliterations . . . . . . . . . . . . . . . . . . . 44
4.3 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.4.1 Machine Translation . . . . . . . . . . . . . . . . . . . . . . 46
4.4.2 Alignments . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5 Borrowing from related languages 49
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2 Previous Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2.1 Previous Work on Borrowing Resources from Related Lan-
guages to Improve Machine Translation . . . . . . . . . . . . 52
5.2.2 Previous Works on Translating Between Related Languages 53
5.3 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3.1 Cipher Model . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3.2 LL Character-Based Language Model . . . . . . . . . . . . . 57
5.3.3 LL Word-Based Language Model . . . . . . . . . . . . . . . 58
5.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.4.1 Training the Cipher Model . . . . . . . . . . . . . . . . . . . 59
5.4.2 Training the Language Models . . . . . . . . . . . . . . . . . 61
5.5 Decoding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.6 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.7.1 Evaluating the Conversion Accuracy . . . . . . . . . . . . . 64
5.7.2 Machine Translation Experiments . . . . . . . . . . . . . . . 66
5.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6 On-demand unsupervised neural machine translation 73
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.2.1 Building a Dictionary . . . . . . . . . . . . . . . . . . . . . . 76
6.2.2 Source to Translationese . . . . . . . . . . . . . . . . . . . . 78
6.2.3 Translationese to Target . . . . . . . . . . . . . . . . . . . . 78
6
6.3 Data and Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7 Conclusion 83
Reference List 85
7
List of Tables
2.1 An example phrase table for Spanish phrase dejó caer el . . . . . . 26
3.1 Data split (#English tokens + #foreign tokens) for different lan-
guages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2 Machine translation experiments (BLEU). We improve baseline for
almost all languages. . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3 Word alignment experiments (alignment precision/recall/f-score).
The proposed method improves baseline in all cases. . . . . . . . . . 37
4.1 Data split and size of monolingual data (tokens) for different lan-
guages. For parallel data, size refers to the number of English plus
foreign language tokens. . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2 Machine translation experiments (BLEU). For languages with less
than 10M monolingual tokens (first five) we only use L
e
, otherwise
we use both lexicons L
e
+ L
f
. This way we improve baseline for
almost all languages. . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3 Word alignment experiments (alignment precision/recall/f-score).
The proposed method (L
e
+L
f
) improves baseline in all cases. . . . 48
5.1 Size of monolingual and parallel data (number of sentences) avail-
able for each language. LL and RL are presented in pairs, LL first. . 63
5.2 BLEU scores for RL-to-LL translation of UDHR text. Format is
BLEU4/BLEU1. Polish/Belorussian and Bosnian/Serbian have dif-
ferent orthographies hence copying is not applicable. . . . . . . . . 65
5.3 BLEU score of alignment-only experiments. We perform 12 exper-
iments for each language that vary by the number of sentences of
original LL/English parallel data (first column) and the number of
sentences of parallel data converted from RL/English (second col-
umn). In each section, corresponding to a different size of original
LL data, the first row is the baseline. Empty cells are due to inad-
equacy of parallel data. . . . . . . . . . . . . . . . . . . . . . . . . . 68
8
5.4 BLEU score of experiments for appending the converted parallel
data to the original LL data. We perform 12 experiments for each
languagethatvarybythenumberofsentencesoforiginalLL/English
parallel data (first column) and the number of sentences of parallel
data converted from RL/English (second column). In each section,
corresponding to a different size of original LL data, the first row
is the baseline. Empty cells are due to inadequacy of parallel data.
Numbers in parenthesis are cases where the alignment-only experi-
ments (Table5.3) outperform the current results. . . . . . . . . . . . 69
5.5 BLEU score of experiments for combining the phrase tables with
different features. We perform 12 experiments for each language
that vary by the number of sentences of original LL/English par-
allel data (first column) and the number of sentences of parallel
data converted from RL/English (second column). In each section,
corresponding to a different size of original LL data, the first row
is the baseline. Empty cells are due to inadequacy of parallel data.
Numbersinparenthesisarecaseswherethebestofthetwoprevious
experiments is better than the current results. . . . . . . . . . . . . 71
6.1 Comparingtranslationresultsonnewstest2014forFrench,andnew-
stest2016 for Russian, German, and Romanian with previous unsu-
pervised NMT methods. Kim et al. (2018) is the method closest to
our work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.2 Translation results on ten new languages: Czech, Spanish, Finnish,
Dutch,Bulgarian,Danish,Indonesian,Polish,Portuguese,andCata-
lan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
9
List of Figures
2.1 An example of phrase-based machine translation. Source sentence
is split into phrases. Each phrase is translated and reordered into a
target sentence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2 An example of word alignments . . . . . . . . . . . . . . . . . . . . 27
2.3 IBM model 1 assigns the same probability to these two different
translations of the same Spanish sentence. . . . . . . . . . . . . . . 30
4.1 Word vectors trained on monolingual data are used to extract a
bilingual lexicon out of translation tables. This lexicon is added
to the parallel data, resulting in improved alignments for machine
translation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.1 The process used for training the cipher model and decoding RL
text to LL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2 Part of the cipher model corresponding to reading LL character s
from start state. The same pattern repeats for any LL character.
After reading s, the model goes to WFST1, WFST2, or WFST3
with respective probability (s), (s), or
(s). In WFST1, the
model produces each RL character t with probability p
1
(tjs). In
WFST2, the model produces each two RL characters t and t
0
with
probability p
21
(tjs) and p
22
(t
0
js). In WFST3, the model reads each
LL character s
0
and produces each RL character t with probability
p
3
(tjss
0
). From the last state of WFST1, WFST2, and WFST3,
the model returns to the start state without reading or writing.The
model has a loop on start state that reads and writes space. . . . . 56
5.3 Partofa3-gramcharacterbasedlanguagemodelonalanguagewith
alphabet fa;b;cg. . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.4 Part of the word based language model corresponding to genera-
tion of out of vocabulary (oov) tokens. The oov state can generate
any sequence of characters and flow returns to the null state after
generating a space. . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
10
5.5 First sentence of the first article of UDHR in Malaysian (mal) and
Indonesian (ind). These languages have a different vocabulary, but
their cognates (shown in bold) are exact matches. . . . . . . . . . 65
5.6 FirstsentenceofthefirstarticleofUDHRinDutch(dut), Afrikaans
(afr), anditsconversionfromDutchtoAfrikaansusingPM+2-gram
LM (d2a), along with their translations to English. . . . . . . . . . 66
11
Abstract
We provide new tools and techniques for improving machine translation for low-
resource languages. Thanks to massive training data, and powerful machine trans-
lation techniques, machine translation quality has reached acceptable levels for a
handful of languages. However, for hundreds of other languages, translation qual-
itydecreasesquicklyasthesizeoftheavailabletrainingdatabecomessmaller. For
languageswithafewmillionsorlesstokensoftranslationdata(calledlow-resource
languages in this dissertation) traditional machine translation technologies fail to
produce understandable translations into English. In this dissertation, we explore
various non-traditional sources for improving low-resource machine translation.
We introduce three approaches for improving low-resource machine transla-
tion: 1) Adapt machine translation algorithm to the low-resource setting. 2) Use
resources other than traditionally used source/English translation pairs (called
parallel data) for training the system. 3) Build massively-multilingual tools that
can be used out-of-the box for any language to help machine translation.
Weaddresstheseapproachesforimprovinglow-resourcemachinetranslationin
thisdissertation: 1)Wepresenttwomethodstoimprovestate-of-the-artalgorithms
for word alignments (which is an essential step in traditional machine translation
systems). These methods are designed to work best when size of the parallel data
is small. 2) We present a method for translating texts between related languages,
12
trained on monolingual data only. We use this method to borrow training data
from a related language to compensate for lack of source/target parallel data. 3)
We propose a two step pipeline for building a rapid neural MT system for any
language. This pipeline includes glossing the input into a pseudo-translation, and
translating the pseudo-translation into target using a model built in advance from
large parallel data, coming from a set of high resource languages.
13
Chapter 1
Introduction
1.1 Problem Description
Consider the scenario where a disaster has struck a distant area and a quick
response is required to aid survivors. A primary requirement in this situation is
the ability to communicate with the people in need, which relies on the ability to
translate between English and the language spoken in the area of the incident. For
thispurposeweneedtobuildamachinetranslationcapabilitybetweentheincident
language and English. This requirement is observed after multiple recent incidents
including 2010 Haiti earthquake (Munro, 2013), 2011 Japan Earthquake (Varga
et al., 2013), and more (Imran et al., 2015; Rudra et al., 2015).
Statistical machine translation techniques require large sentence-aligned bilin-
gualcorpora,oftenreferredtoasparalleldata,totraintranslationpatterns. These
patternsarethenusedfortranslatingnewtexts. Studies(Koehnetal.,2003)show
that millions of sentences of parallel data is required for fluent translation. For
a few language pairs, such large corpora can be obtained through sources like
United Nations (6 official languages) or European Parliament (21 European lan-
guages) documents. Tens of other medium-resource languages have enough data
for an understandable but not fluent translation into English from sources like
opensubtitles (Lison and Tiedemann, 2016) or commoncrawl (Smith et al., 2013).
Unfortunately, for thousands of other languages the amount of existing parallel
14
data is extremely limited (Lopez and Post, 2013) and state-of-the-art machine
translation tools fail to produce reasonable translations.
Almost all notable-size parallel corpora are generated for other purposes over
years of effort of thousands of people and then adopted in the machine translation
community for training the systems. Generating large and reliable parallel data
from scratch is too expensive to be considered as an option for improving machine
translation for low-resource languages. For instance, Germann (2001) estimated
a cost of $0:36 per word for Tamil-English professional grade translation. This
adds up to millions of dollars for generating a one million sentence parallel corpus.
Consequently, other techniques and resources should be considered for machine
translation for low-resource languages.
1.2 Previous Works
Automatictranslationoftextsinshortageofparalleldataisachallengingproblem.
In this section we will review previous works in this area.
A large portion of the previous works on translating low-resource languages are
language-specific. Long-term projects such as AVENUE (Carbonell et al., 2002;
Lavie et al., 2003; Probst et al., 2002) and METIS-II (Carl et al., 2008; Van-
deghinste et al., 2006) started in early 2000s. However, their methods are either
rule-based (as in AVENUE) or highly depend on language-specific resources (as
in METIS-II). Other works have used orthography (Hermjakob, 2009), morpho-
logical analysis (De Gispert et al., 2006; Lee, 2004), syntactic constraints (Fossum
et al., 2008; Cherry and Lin, 2006; Toutanova et al., 2002) or a mixture of such
clues (Tiedemann, 2003) to improve machine translation.
15
Thesemethodscanhelpmachinetranslationifenoughlanguage-specificresources
or experts are available. However, this is not the case for most low-resource lan-
guages. As a result, research should switch to language-independent techniques.
Alargebranchoflanguage-independentworksforimprovinglow-resourcemachine
translation looks to utilize other resources besides parallel data. Monolingual
data (Koehn and Knight, 2002; Ravi and Knight, 2011b; Nuhn et al., 2012; Con-
neauetal.,2017), comparabledata(Rapp,1995;FungandYee,1998;Smithetal.,
2010; Irvine and Callison-Burch, 2013), or both (Dou and Knight, 2012; Irvine,
2013) are used to increase the bilingual lexicon or create more parallel data. Dic-
tionaries are another resource for improving lexical coverage (Okuma et al., 2008)
and adapting the domain (Wu et al., 2008) of machine translation systems. Pivot-
ing through one or multiple languages with which both source and target language
havea large corpus ofparallel data is another technique toimprovemachinetrans-
lation (Mann and Yarowsky, 2001; Utiyama and Isahara, 2007; Wu and Wang,
2007; Johnson et al., 2016).
Unfortunately,manyoftheseworkshaveconstraintsthatlimittheirapplication
to low-resource settings. For instance, pivoting requires parallel data between
the low-resource language and a third language, using dictionaries needs the non-
trivial process of dictionary cleaning, and methods that use monolingual data to
generate a bilingual lexicon often need large monolingual data for the low-resource
languageorrequirethemonolingualdataofEnglishandthelow-resourcelanguage
to come from the same domain. Moreover, other useful resources are still to be
explored. For instance, not much work has been done on using resources from
related languages, and even those works are very limited (Chapter 5).
16
1.3 This Dissertation
Theresearchgoalofthisdissertationistoaddresslow-resourcemachinetranslation
in a language-independent fashion. In order to improve machine translation in
scarcity of parallel data, we present methods to make better use of existing (even
though small) parallel data, as well as make use of non-traditional data sources
that are accessible considering the constraints of low-resource settings.
Throughout this thesis, we assume that the low-resource language has a small
parallel data with English (a couple of million IL+English tokens), almost no
paralleldatawithanyotherlanguage,andsome(millionsoftokens)ofmonolingual
data.
In what follows, we will briefly describe the contributions of this dissertation.
1.3.1 Symmetrizing word alignments
Correspondences betweenwords in sentence pairs (i.e. word alignments)are inher-
ently symmetric. State-of-the-art word alignment methods (Brown et al., 1993)
fail to model this symmetry as they train their parameters by maximizing the
conditional likelihood of source data given target data or vice versa.
In this thesis, we present a method to improve word alignments for machine
translation by considering inherent symmetry of word alignments in the training
process. We propose to use a symmetric objective function by maximizing the
sum of conditional likelihoods. This method improves the alignment accuracy and
consequently results in better machine translation quality. Experiments on Ara-
bic/English, Chinese/English and Farsi/English word alignment shows an average
of 6:0 improvement in alignment f-score. Translation experiments on 16 languages
show an average of 0:9 BLEU score improvement over the baseline.
17
Related publication:
“Aligning English Strings with Abstract Meaning Representation Graphs”
(N. Pourdamghani, Y. Gao, U. Hermjakob, K. Knight), Proc. Empirical
Methods in Natural Language Processing (EMNLP), 2014.
1.3.2 Using semantic information in monolingual data to
improve word alignments
Finding a correspondence between words in sentence pairs of a training corpus is
thefirststepintrainingtraditionalmachinetranslationsystems. Thesecorrespon-
dences, knows as word alignments, are essential for extracting translation patters
from the data.
In this thesis, we introduce a method to use monolingual data to improve
word alignments for low-resource languages. This method is based on encouraging
common alignment links between semantically similar tokens. Similarity of tokens
isestimatedusingwordvectorstrainedonmonolingualdata. Themethodimproves
the alignment accuracy and consequently results in better machine translation
quality.
We show that this method works well even when the monolingual data for the
low-resource language is small or unreliable. Experiments on Arabic/English, Chi-
nese/English and Farsi/English word alignment shows an average of 0:8 improve-
ment in alignment f-score. Translation experiments on 15 languages show an aver-
age of 0:4 BLEU score improvement over the baseline.
Related publication:
“Using word vectors to improve word alignments for low-resource machine
translation” (N. Pourdamghani, M. Ghazvininejad, K. Knight), Proc. North
18
AmericanChapteroftheAssociationforComputationalLinguistics(NAACL),
2018.
1.3.3 Using resources from related languages
A less-explored source of data for improving machine translation for low-resource
languages is borrowing parallel data from related languages. To borrow parallel
datafromaRelatedLanguage(RL)weneedtotranslatetheRLsideofRL/English
parallel data into the low-resource language. This is a hard task because related
languages might have different orthographies and translating texts from RL to the
low-resource language without any parallel data between them is not trivial. All
the previous works in this area assume same orthography and existence of some
bilingual data between RL and the low-resource language.
In this thesis, we present a method to translate texts between closely related
languages without needing any parallel data. This method is based on training
a character-based cipher model capable of 1-1, 1-2, and 2-1 mapping between
characters of related languages. This cipher model is then combined with a word-
based language model for decoding. Experiments on translating texts between 6
relatedlanguagesshowanaverageof41:4BLEU-1scorewhichisabigimprovement
over letter substitution (32:8) and simple copying (19:8).
Weusethismethodfortranslatingbetweenrelatedlanguagestoconvertparallel
data from higher-resource, related languages to low-resource languages. The con-
verted parallel data is appended to the existing one for the low-resource language
to train a better machine translation system. We show that the extra parallel data
helps machine translation by both improving word alignments and phrase table
construction.
19
We also propose a better way for combining the converted parallel data with
the original one (compared to appending) and show that this method results in
even more BLEU score improvements. Experiments on translating 6 low-resource
languagesintoEnglish, usingonerelatedlanguageforeach, showconsistentBLEU
score improvement for all languages.
Related publications:
“DecipheringRelatedLanguages” (N.Pourdamghani,K.Knight),Proc. Empir-
ical Methods in Natural Language Processing (EMNLP), 2017.
“Neighbors Helping the Poor: Improving low-resource Machine Translation
Using Related Languages” (N. Pourdamghani, K. Knight), Machine Trans-
lation (MT) journal.
1.3.4 On-demand unsupervised neural machine translation
Given a rough, word-by-word gloss of a source language sentence, target language
natives can provide a fluent translation of the input. In this thesis we explore this
intuition by breaking translation into a two step process: generating a rough gloss
by means of a dictionary and then ‘translating’ the resulting pseudo-translation
into a fully fluent translation.
The pseudo-translation to target translator is built once from a large parallel
data that has the target language in common. This parallel data comes from a
diverse set of high-resource languages. Having this pre-trained model at hand,
for any source language we can build dictionaries on demand using unsupervised
techniques,convertsourcesentencesintopseudo-translation,andtranslatepseudo-
translations into target. This way we produce rapidly generated unsupervised
neural MT systems for many source languages.
20
We apply this process to 14 test languages, obtaining better or comparable
translation results on high-resource languages than previously published unsuper-
vised MT studies, and obtaining good quality results for low-resource languages
that have never been used in an unsupervised MT scenario.
“TranslatingTranslationese: ATwo-StepApproachtoUnsupervisedMachine
Translation” (N. Pourdamghani, N. Aldarrab, M. Ghazvininejad, K. Knight,
J. May), submitted to Proc. North American Chapter of the Association for
Computational Linguistics (NAACL), 2019.
1.4 Contributions
We present a method to improve word alignments by considering inherent
symmetry of word alignments in the training process. Translation experi-
ments on 16 languages show an average of 0:9 BLEU score improvement over
the baseline.
We introduce a method to use semantic information in monolingual data
to improve word alignments for low-resource languages. Translation experi-
ments on 15 languages show an average of 0:4 BLEU score improvement over
the baseline.
We present a method to translate texts between closely related languages
with no parallel data. Experiments on 6 pairs of related language show an
average 8:6 BLEU1 improvement over using letter substitution for transla-
tion.
We convert parallel data from higher-resource, related languages to be used
as extra parallel data for low-resource machine translation. Experiments on
21
translating 6 low-resource languages into English using one related language
for each, show consistent BLEU score improvement for all languages. The
average improvement is 8.4 (6.2) BLEU points when baseline is trained on
50K (100K) sentences of parallel data.
We propose a two step pipeline for building a rapid unsupervised neural
machine translation system for any language. Translating a new language
using this pipeline only requires a comprehensive source-to-target dictionary.
We show how to easily obtain such a dictionary using off-the shelf tools.
We apply this process to 14 test languages, obtaining better or compara-
ble translation results on languages in common with previously published
unsupervised MT studies.
22
Chapter 2
Machine Translation Overview
In this chapter, we provide a brief overview of Statistical Machine Translation
(SMT) technologies with a focus on phrase-based Machine Translation (PBMT)
technology, as it is the underlying machine translation technology that we use for
mostofthisthesis. Thegoalofthischapteristoquicklyfamiliarizethereaderwith
the components of SMT pipeline, especially those explored in this dissertation. A
morecompletepresentationofSMTsystemsispresentedinthetextbookbyKoehn
(2009).
2.1 Statistical Machine Translation
Statisticalmachinetranslationwasfirstintroducedinlate1980sandearly1990sby
the works of IBM researchers (Brown et al., 1988, 1993) as a series of probabilistic
models trained on sentence-aligned bilingual parallel data containing of pairs of
source and target sentence translations. They formulated machine translation as
the problem of finding the best sentence in the target language (named t) that is
a translation of a sentence in the source language (named s). This problem can be
stated as follows:
t
= arg max
t
p(tjs) = arg max
t
p(sjt)p(t)
where the decomposition of the formula into two terms is based on Bayes’ rule.
23
The first term of the decomposition is a target-to-source translation model.
This model encodes the adequacy of the translation. In other words it measures
theamountofcontentthatiscorrectlytranslated. Thismodelshouldassignhigher
probabilities to correct and complete translations. For instance p(gatojcat) >
p(gatojdog) and p(gato negrojblack cat) >p(gato negrojcat).
The second term of the decomposition is a target language model (Chen and
Goodman, 1996), that captures the fluency of the translated text. This model
states if t is a correct target sentence. It should assign higher probabilities to
more likely target sentences. For instance p(a black cat) > p(a black cats) and
p(a black cat) >p(a fresh cat).
Splitting the translation problem into adequacy and fluency components has
two main benefits. It allows the fluency component (i.e. the language model) to
be trained confidently on relatively cheap target monolingual data (compared to
parallel data). More importantly, it adds room for error in the translation model,
as the language model can compensate for some errors. Consequently the p(sjt)
model can be constructed more loosely.
All traditional statistical machine translation systems can be split into a trans-
lationandalanguagemodel. Trainingthelanguagemodelcanbesimilarregardless
of the machine translation technology, and the difference in different technologies
comes from their translation models. In the simplest form, a monotone word-by-
word machine translation system defines the translation model as:
p(sjt)'
n
Y
i=1
p(s
i
jt
i
)
where s
i
and t
i
are the i-th tokens of the source and target sentences.
24
2.1.1 Pipeline
Traditional statistical machine translation systems are constructed from three dif-
ferent steps:
Training: Inthisstep,aparalleltrainingcorpusisusedtotraintheparameters
of the translation model.
Tuning: In this step, a decoding algorithm is trained to translate source sen-
tences into target. The decoder works based on features from the parameters of
the translation model, features from the language model, and other features like
sentence length. Weights of these features are trained in the tuning step using a
held-out tuning parallel corpus.
Test: In this step, the tuned decoder is used to translate query source sen-
tences. Particularly, the decoder is used to translate a separate test set and eval-
uate the output of the system against reference translations.
In this thesis, we only design algorithms to improve the training step. For
tuning the phrase-based machine translation system we always use the MERT
algorithm (Och, 2003), and evaluation is always done with BLEU score (Papineni
et al., 2002). We defer the details of the decoding algorithm, and different tuning
and evaluation methods to Koehn (2009). We discuss training of phrase-based
machine translation in the following section, and describe the BLEU evaluation
metric in Section 2.2.2.
2.2 Phrase-based Machine Translation
ConsiderFigure2.1,whichillustrateshowphrase-basedmachinetranslationworks.
The source sentence (here Spanish) is first segmented into multi-word units called
25
phrases. Then,eachphraseistranslatedintoanEnglishphrase. Finally,translated
phrases are reordered to form the English sentence.
el
gato
negro
dejó caer el jarrón de vidrio
the black cat dropped the glass vase
Figure 2.1: An example of phrase-based machine translation. Source sentence is
split into phrases. Each phrase is translated and reordered into a target sentence.
The translation model of phrase-based machine translation is composed of a
phrase translation component, and a reordering component. The phrase transla-
tion component is a probabilistic model that captures the potential translations
of each target phrase along with their probabilities. We show this model with
( s
i
j
t
j
), where s and
t are source and target phrases. It is presented as a phrase
translation table, or phrase table for short. For instance the phrase table of the
Spanish phrase dejó caer el may look like the following:
Translation Probability (
tj s)
dropped 0.4
dropped the 0.35
he dropped 0.15
fell 0.05
he dropped the 0.05
Table 2.1: An example phrase table for Spanish phrase dejó caer el
The reordering component works on the translations of source phrases. In
its simplest form, it captures the distance between the positions of translation
of phrase i and phrase i + 1 in the translated sentence. The distance between
the positions is defined as the distance between the last word of translation of
phrase i and first word of translation of phrase i + 1. We show this model with
d(start
i+1
end
i
1).
26
Hence, in phrase-based machine translation the best target translation t
for a
source sentence s is defined as:
t
= arg max
t
p(tjs) = arg max
t
p(sjt)p(t)
Where p(t) is the language model score, and the translation model is defined
as:
p(sjt)'
n
X
N=1
X
split(s;N)
N
Y
i=1
( s
i
j
t
j
)d(start
i+1
end
i
1)
The sums are over N, the number of phrases the source sentence is split to,
and over all the possible ways to split the source sentence into N phrases.
Reordering is handled by using a predefined function. Movement of phrases
over large distances should be more costly than short distances or no movement at
all. Asimplereorderingmodelthatsatisfiesthispropertyisadecayingexponential
function d(x) =
jxj
. Training of the phrase table is described in the next section.
2.2.1 Training the Phrase Table
Aphrase table istrained fromword-alignedparallel data. Wewill talk moreabout
word alignments in Section 2.3.
ConsidertheexamplesentencesinFigure2.1. Figure2.2showswordalignments
for this sentence pair.
el
gato
negro
dejó caer
el
jarrón
de vidrio
the black cat dropped
the
glass vase
Figure 2.2: An example of word alignments
27
Intuitively, one might extract phrase pairs like (gato negro , black cat), (el ,
the), and (jarrón de vidrio , glass vase) from this sentence pair given the align-
ments. However, shorter and longer phrase pairs can also be helpful for machine
translation. Shorter phrases occur more frequently, and they will more often be
applicable to unseen sentences. Longer phrases capture more local context and
can be used to translate longer chunks of text. Hence, phrase pairs like (gato ,
cat), (el gato , the cat), and (el gato negro dejó caer el , the black cat dropped the)
should be extracted from this sentence pair.
Following the example above, all the phrase pairs consistent with an alignment
areextractedfromeachsentencepair. Wecallthephrasepair ( s;
t)consistentwith
an alignment A if and only if all tokens s
1
;:::;s
n
in phrase s that have alignment
pointsinA,havetheiralignmentswithtokenst
1
;:::;t
m
inphrase
t,andviceversa.
In order to train the phrase table, all the consistent phrase pairs are extracted
from the training corpus. The probabilities of the phrase table are computed as:
( sj
t) =
count( s;
t)
P
s
i
count( s
i
;
t)
2.2.2 Evaluation with BLEU
In this thesis, we evaluate all the machine translation systems using the standard
machinetranslationevaluationmethodcalledBLEUPapinenietal.(2002). BLEU
compares a set of output translations against a set of reference translations and
assigns a score between 0 and 100 to the system’s output. The higher the score,
the closer the predicted translations are to the reference sentences.
BLEUmeasureshowmanywordunigrams,bigrams,trigramsandfour-gramsin
each output translation can be matched against the reference translation. These
28
measurements estimate precision of different n-grams of the output set. BLEU
combines these precisions in a geometrical average:
BLEU =BP expf
4
X
n=1
w
n
logp
n
g
where BP is a brevity penalty, p
n
is the n-gram precision, and all weights w
n
are
typically set to
1
4
.
ThoughofaexistingcriticismsonthecredibilityofBLEU,thismetrichasbeen
frequently reported to correlate well with human judgment (Denoual and Lepage,
2005; Callison-Burch et al., 2006). A system with a BLEU score above 40 can
be considered as a very good translation system. A BLEU score between 20 and
40 means a relatively good translation that contains obvious translation errors.
Scores under 20 correspond to translations that are not good, and a system with
BLEU score under 10 often does not produce understandable translations. BLEU
score can vary a lot on different test sets. Nevertheless, on the same test set, few
tenths of a point of improvement is significant.
2.3 Word Alignments
The goal of word alignments is finding a correspondence between words of source /
target sentence pairs in the training corpus. An example of words alignments in a
sentencepairispresentedinFigure2.2intheprevioussection. IBMmodels(Brown
et al., 1993; Koehn, 2009) are the most commonly used methods for finding word
alignments. They are based on a generative story that describes how a target
sentence t = (t
1
;:::;t
n
) can be generated from a source sentence s = (s
1
;:::;s
m
)
using a set of alignments A = ((i
1
;j
1
);(i
2
;j
2
);:::).
29
IBM model 1, the first model of this series, bases its generative story on a
lexical translation probability distribution. Each target word t
i
is generated from
the source word s
j
with probability t(t
i
js
j
). Hence, the probability of generating
a target sentence t from a source sentence s with respect to the alignment A can
be defined as:
p(t;Ajs)'
n
Y
j=1
t(t
j
js
A(j)
)
where
is a normalization factor based on length of s and t.
The generative model of IBM model 1 does not have an explicit model for
the order of generation. For instance, it assigns the same probability to the two
different translations of the same source sentence in Figure 2.3.
el
gato
negro
el
gato
negro
the black cat black cat the
Figure 2.3: IBM model 1 assigns the same probability to these two different trans-
lations of the same Spanish sentence.
To solve this problem, IBM model 2 includes the position of the input and out-
put tokens in the generative model by adding a distortion probability distribution
a(ijj):
p(t;ajs)'
n
Y
j=1
t(t
j
js
A(j)
)a(A(j)jj)
The HMM Model (Vogel et al., 1996) and further IBM models gradually add
moreelementstothemodeltoimprovethegenerativestory. Specifically,theHMM
Model replaces absolute positions with relative ones, and IBM model 3 introduces
fertility as the number of target words a single source word can generate.
30
In this thesis, Chapters 4 and 3 directly work with the lexical translation prob-
abilities(referredtoast-table)andthedistortionprobabilitydistribution(referred
to as a-table).
ParametersoftheIBMmodelsaretrainedusingtheExpectationMaximization
(EM)algorithm(Duchietal.,2008). IntheEstepoftheEMalgorithm,parameters
of the model (called ) such as t-table and a-table probabilities are fixed and the
probability distribution of the alignments p(Ajt;s;) is computed. In the M step,
probability of the alignments is fixed and parameters are updated.
For practical reasons, IBM models are often trained incrementally, so that first
IBM model 1 is trained using a fixed number of EM iterations, then the trained
parameters of model 1 are used to initialize a higher model such as model 2, and
so on.
Oncethemodelparametersaretrained,probabilityofalignmentsarecomputed
for each sentence pair p(Ajt;s;) and then the Viterbi algorithm (Viterbi, 1967) is
used to find the best alignments for each sentence pair.
31
Chapter 3
Symmetric word alignment
1
3.1 Introduction
In this chapter, we propose a method to improve word alignments by noticing that
alignment is a symmetric relation between words of source and target sentence
pairs. We modify the process of training the parameters of the alignment models
to take this symmetry into consideration.
As mentioned in Chapters 2 and 4, word alignment of the training corpus is
an essential step in machine translation. Machine translation technologies require
word alignments to identify translation patterns in the training data and extract
rules from them. In low-resource settings, where many of the translation patterns
are only observed a few times, improving word alignments can highly affect the
accuracy of translation rules and consequently machine translation quality.
In Chapter 4 we proposed a method to use semantic information extracted
from monolingual data to improve word alignments. In this chapter, we improve
word alignments without requiring any extra data, but by noticing that the word
alignments are inherently symmetric. In other words, aligning a source token s to
a target token t is the same as aligning t to s. We propose a method to model this
symmetry during the training process of the alignment model.
1
The work is described in:
“Aligning English Strings with Abstract Meaning Representation Graphs” (N. Pourdamghani,
Y. Gao, U. Hermjakob, K. Knight), Proc. Empirical Methods in Natural Language Processing
(EMNLP), 2014.
32
State-of-the-artwordalignmentmethods(Brownetal.,1993)failtomodelthis
symmetry as they train their parameters by maximizing the conditional likelihood
of source data given target data or vice versa. In this chapter, we propose to use
a symmetric objective function by maximizing the sum of conditional likelihoods.
Previously Liang et al. (2006) also presented a symmetric method for training
alignment parameters. However, their constraint is on agreement between pre-
dicted alignments while we directly focus on agreement between the parameters
themselves. Moreover their method involves a modification of the E step of EM
algorithm which is very hard to implement for IBM Models 3 and above.
Ourproposedmethodimprovesthealignmentaccuracyandconsequentlyresults
inbettermachinetranslationquality. ExperimentsonArabic/English,Chinese/English,
and Farsi/English word alignment shows an average of 6:0 improvement in align-
ment f-score. Translation experiments on 16 languages show an average of 0:9
BLEU score improvement over the baseline.
3.2 Training Method
OurtrainingmethodisbasedonIBMwordalignmentmodels(Brownetal.,1993).
We modify the objective functions of the IBM models to encourage agreement
between learning parameters in source-to-target and target-to-source directions of
EM. The solution of this objective function can be approximated in an extremely
simple way that requires almost no extra coding effort.
Assumethatwehaveasetofsource/targetsentencepairsf(S;T)g. According
to IBM models, T is generated from S through a generative story based on a set
of parameters. For instance in IBM model 2, these parameters are the translation
table t(tjs), and the distortion table a(ijj).
33
IBM models estimate these parameters to maximize the conditional likelihood
of the data:
TjS
= argmaxL
TjS
(TjS) or
SjT
= argmaxL
SjT
(SjT) where
denotesthesetofparameters. Theconditionallikelihoodisintrinsictothegenera-
tivestoryofIBMmodels. However,wordalignmentisasymmetricproblem. Hence
it is more reasonable to estimate the parameters in a more symmetric manner.
Our objective function in the training phase is:
SjT
;
TjS
= argmaxL
SjT
(SjT)+L
TjS
(TjS)
subject to
SjT
T
=
TjS
S
=
S;T
We approximate the solution of this objective function with almost no change
to the existing implementation of the IBM models. We relax the constraint to
SjT
=
TjS
, then apply the following iterative process:
1. Optimize the first part of the objective function:
SjT
= argmaxL
SjT
(SjT)
using EM
2. Satisfy the constraint: set
TjS
/
SjT
3. Optimizethesecondpartoftheobjectivefunction:
TjS
=argmaxL
TjS
(TjS)
using EM
4. Satisfy the constraint: set
SjT
/
TjS
5. Iterate
Note that steps 1 and 3 are nothing more than running the IBM models, and
steps 2 and 4 are just initialization of the EM parameters, using tables from the
previousiteration. Theinitializationstepsonlymakesensefortheparametersthat
involve both sides of the alignment (i.e., the translation table and the distortion
table). For the translation table we set t
TjS
(tjs) = t
SjT
(sjt) for target and source
34
tokens t and s and then normalize the translation table. The distortion table can
also be initialized in a similar manner. We initialize the fertility table with its
value in the previous iteration.
3.3 Data
We use data from 16 languages for our machine translation experiments.
2
These
languagesincludeArabic,Bengali,Mandarin,Farsi,Hausa,Somali,Spanish,Tamil,
Thai, Tigrinya, Turkish, Uighur, Urdu, Uzbek and Yoruba. Table 3.1 shows the
size of training, development, and test data for each language. In addition, we use
hand-aligned data
3
for Arabic/English (77:3K+119:5K tokens), Chinese/English
(240:2K+305:2K tokens), and Farsi/English (0:9K+0:8K tokens) for word align-
ment experiments. We lowercase and tokenize all data using Moses (Koehn et al.,
2007) scripts.
3.4 Experiments
3.4.1 Machine Translation
We perform end-to-end machine translation experiments on 16 different languages
described in Section 3.3. We use MGIZA++ (Gao and Vogel, 2008) as the imple-
mentation of the IBM models, and Moses (Koehn et al., 2007) to train and decode
phrase based machine translation (PBMT) systems. The parallel data is stemmed
2
LDC2015E13, LDC2015E14, LDC2015E83, LDC2015E84, LDC2016E57, and LDC2016E86
to LDC2016E105
3
LDC2012E51, LDC2012E24, (Pilevar et al., 2011)
35
train dev. test
amh 2.1M 39.8K 19.5K
ara 3.8M 39.1K 19.8K
ben 0.9M 41.9K 21.0K
cmn 10.6M 41.7K 20.5K
fas 4.3M 47.7K 24.2K
hau 2.1M 48.0K 24.1K
som 2.8M 46.8K 23.5K
spa 24.1M 49.4K 24.3K
tam 0.5M 39.0K 11.4K
tha 0.7M 39.1K 23.1K
tir 0.5M 42.0K 20.7K
tur 4.1M 40.2K 19.9K
uig 5.2M 8.6K 4.3K
urd 1.1M 46.7K 23.2K
uzb 4.2M 42.5K 21.7K
yor 2.1M 47.9K 24.5K
Table 3.1: Data split (#English tokens + #foreign tokens) for different languages.
to the first 4 characters for training the alignments but not for the PBMT system.
We run Model 1 and HMM for 5 iterations each, then run our training algorithm
on Model 4 for 4 iterations. For the baseline, we use MGIZA++ and run each of
IBM Models 1, HMM and Model 4 for 5 iterations. Table 3.2 shows the BLEU
scores of running the experiments.
OurproposedmethodimprovesthebaselineforallthelanguagesexceptAmharic
and Spanish. On average we improve BLEU score by 0.9 points per language.
3.4.2 Alignments
In addition, we perform word alignment experiments on Arabic/English, Chi-
nese/English, and Farsi/English, for which we have access to gold alignment data
(Section 3.3). We append the test sentences to the existing parallel training data
36
baseline proposed method improve
amh 11.5 11.2 -0.3
ara 18.2 19.7 1.5
ben 8.1 8.8 0.7
cmn 12.5 12.6 0.1
fas 19.2 19.2 0.0
hau 19.4 20.1 0.7
som 18.7 19.2 0.5
spa 40.0 39.9 -0.1
tam 19.2 20.0 0.8
tha 20.3 21.9 1.6
tir 11.1 12.1 1.0
tur 14.7 15.0 0.3
uig 12.8 17.7 4.9
urd 15.8 15.9 0.1
uzb 13.2 14.2 1.0
yor 14.2 15.1 0.9
Table 3.2: Machine translation experiments (BLEU). We improve baseline for
almost all languages.
for each language (Table 3.1) and use it to get the alignments. Baseline and pro-
posed methods are defined as in the machine translation experiments above (Sec-
tion 3.4.1). Table 3.3 presents the precision, recall, and f-score of the alignments
compared to the gold alignments.
baseline proposed method
ara 63:1=58.1=60:5 80.0=55:3=65.4
cmn 66:5=61.6=63:9 76.0=59:3=66.6
fas 52:7=66:7=58:9 72.1=66.8=69.3
Table 3.3: Word alignment experiments (alignment precision/recall/f-score). The
proposed method improves baseline in all cases.
Theproposedmethodhighlyincreasesprecision,andslightlydecreasesrecallin
alltheexperiments. Onaverage,theproposedmethodgets 6:0pointsimprovement
in f-score.
37
3.5 Conclusion
Inthischapter,wepresentamethodforimprovingwordalignmentsbyconsidering
the inherent symmetry of the alignments in the training process. The method is
verysimpletoimplement,andyetveryefficient. Inend-to-endmachinetranslation
experiments on translating 16 languages into English, the proposed method gets
an average of 0:9 BLEU score improvement.
38
Chapter 4
Using monolingual data to improve
word alignments
1
4.1 Introduction
Word alignments are essential for phrase-based and syntax-based machine transla-
tion systems. The most widely used word alignment method (Brown et al., 1993)
works by estimating the parameters of IBM models from training data using the
ExpectationMaximization (EM)algorithm (see Chapter2 formore details). How-
ever, EM works poorly for low-frequency words as they do not appear enough in
the training data for confident parameter estimation. This problem is even worse
in low-resource settings where a large portion of word types appear infrequently in
the parallel data. In this chapter, we improve word alignments and consequently
machine translation in low-resource settings by improving the alignments of infre-
quent tokens.
Works that deal with the rare-word problem in word alignment include those
that alter the probability distribution of IBM models’ parameters by adding prior
distributions (Vaswani et al., 2012; Mermer and Saraçlar, 2011), smoothing the
probabilities (Moore, 2004; Zhang and Chiang, 2014; Van Bui and Le, 2016) or
1
The work is described in:
“Using word vectors to improve word alignments for low-resource machine translation” (N. Pour-
damghani, M. Ghazvininejad, K. Knight), Proc. North American Chapter of the Association for
Computational Linguistics (NAACL), 2018.
39
introducingsymmetrization(Liangetal.,2006;Pourdamghanietal.,2014). These
works, although effective, merely rely on the information extracted from the par-
allel data. Another branch adds linguistic knowledge like word stems, orthogra-
phy(Hermjakob,2009),morphologicalanalysis(DeGispertetal.,2006;Lee,2004),
syntactic constraints (Fossum et al., 2008; Cherry and Lin, 2006; Toutanova et al.,
2002)oramixtureofsuchclues(Tiedemann,2003). Thesemethodsneedlanguage-
specific knowledge or tools like morphological analyzers or syntax parsers that is
costly and time consuming to obtain for any given language.
A less explored branch that can help align rare words is adding semantic infor-
mation. The motivation behind this branch is simple: Words with similar mean-
ings should have similar translations. Previously, Ma et al. (2011) cluster words
using monolingual data and substitute each word with its cluster representative
to get alignments. They then duplicate their parallel data and use both regular
alignments and alignments on word classes for training MT. Kočisk` y et al. (2014)
simultaneously learn alignments and word representations from bilingual data.
Their method does not benefit from monolingual data and requires large parallel
data for training. Songyot and Chiang (2014) define a word-similarity model that
can be trained from monolingual data using a feed-forward neural network, and
alter the implementation of IBM models in Giza++ (Och and Ney, 2003) to use
the word similarity inside their EM. They require large monolingual data for both
source language and English. While English monolingual data is abundant, avail-
ability of large and reliable monolingual data for many low-resource languages is
not guaranteed.
All these previous works define their own word similarity models. This models,
similar to the more widely used distributed word representation methods (Mikolov
et al., 2013; Pennington et al., 2014), assign high similarity to substitutable words
40
in a given context; however, substitutability does not always imply synonymy. For
instancetea andcoffee,orPakistan andAfghanistan willbesimilarinthesemodels
but do not share translations.
Inthischapter,weproposeasimplemethodtouseoff-the-shelfdistributedrep-
resentation methods to improve word alignments for low-resource machine trans-
lation (Section 4.2). Our model is based on encouraging common alignment links
between semantically similar words. We do this by extracting a bilingual lexicon,
as a subset of the translation tables trained by IBM models and adding it to the
parallel data. For instance, the rare word obliterated and its semantically similar
worddestroyed, haveacommonentry destruida intheEnglish/Spanishtranslation
table. We add a new (obliterated, destruida) pair to the parallel data to encourage
aligning obliterated to destruida. By limiting the effect of a semantic neighbor to
onlythecommonalignmentlinkswehighlyreducetheinfluenceofthoseneighbors
that are merely good substitutes but not synonyms.
The simplicity of our method makes it easy to be widely used. Our work
addresses a major problem of previous works, which is taking substitutability for
synonymy without discrimination. Finally, the lexicon can be extracted either
with or without help of word vectors trained on foreign language monolingual
data. Large and reliable foreign monolingual data can help our alignments, but
we still get good improvements over baseline for languages with small monolingual
data where we only use English word vectors (Section 4.4).
We test our method on both alignment f-score and machine translation BLEU
(Section 4.4). Alignment accuracy is tested on Arabic-English, Chinese-English
and Farsi-English gold alignments. Machine translation accuracy is tested on fif-
teen languages were we show a consistent BLEU score improvement.
41
IBM
Models
EjjF
par-
allel
data
t-tables: p(fje);p(ejf)
Extract
Lexicon
E word vectors
F word vectors
EjjF lexicon
+
IBM
Models
MT
Figure 4.1: Word vectors trained on monolingual data are used to extract a bilin-
gual lexicon out of translation tables. This lexicon is added to the parallel data,
resulting in improved alignments for machine translation.
4.2 Proposed Method
We improve the alignment of rare words by encouraging them to align to what
their semantic neighbors align to. However, we should be careful in this process.
Distributed word representation methods (e.g. Mikolov et al. (2013); Pennington
et al. (2014)) often define word similarity as the ability to substitute one word
for another given a context. This does not always imply that they have the same
translations. Multiple reasons contribute to this problem. First, word vectors are
noisy, especially when monolingual data is small. Second, some words might have
multiple meanings and a semantically similar word might share only part of these
meanings. Finally, some words do not have synonyms, especially proper names.
Word vectors often group such entities together as they are substitutable, but this
similarity should not be used for alignments.
We bring a simple three-fold solution to these problems. First, we split the
use of English and foreign word vectors in the method, so that if foreign mono-
lingual data is small or unreliable, we can fall back to only using English word
42
vectors. Second, and more importantly, we limit the effect of a semantic neighbor
on the alignments of a token to the common alignment links between them. This
removes the effect of a semantic neighbor which is not a synonym (like effect of
tea on alignments of coffee) and irrelevant meanings of a semantic neighbor (like
multiple meanings of bow on alignments of token crossbow) as we only encourage
an alignment link if it appears as a potential translation for both the neighbors.
Third, we note that using similarities based on distributed representations only
hurts alignments for proper names. For these cases we encourage alignment to
transliterations if applicable. For instance if we see “Pakistan” and “Pakistan”
(Pakistan) in an English/Russian sentence pair, we encourage alignment between
these two tokens.
Figure 4.1 shows the outline of the proposed method. We provide the initial
parallel data to the IBM models and train the translate tables p(fje) and p(ejf).
WethenusethewordvectorstrainedonEnglishandforeignlanguagemonolingual
data to extract a bilingual lexicon from these tables. This lexicon is added to the
original parallel data and used to re-train the alignments. The lexicon contains
both common alignment links and transliteration links that are extracted from
the translation table. Next we will describe how each section of the lexicon is
generated.
4.2.1 Extracting Semantically Similar Tokens
Consider an infrequent English token e (w.r.t. the parallel data), and its semantic
neighbor e
0
. If e and e
0
have a common t-table entry—some f, where p(fje) > 0
and p(fje
0
) > 0:1 we encourage the translation of e to f by adding this pair to the
parallel data for re-alignment. We limit the lexicon entries to non-common words.
43
We only add entries where freq(e) 100 and freq(f) 100. The translation
table is trained by 5 iterations of each of IBM models 1, HMM, and 4.
We add each (e,f) pair multiple times to the lexicon proportional to p(fje
0
),
the cosine distance of e and e
0
, and the frequency of e. More precisely, for each
neighbor e
0
, each (e;f) pair appears dmin(freq(e)dist(e;e
0
)p(fje
0
);
freq(e)
4
)e
times in the lexicon.
To measure similarity, we use cosine distance of word vectors trained on mono-
lingual data using an implementation of continues bag-of-words (CBOW) algo-
rithm.
2
English word vectors are trained on the one-billion-word language model-
ing benchmark (Chelba et al., 2013). Foreign language word vectors are trained on
the monolingual data described in Section 4.3. All vectors are trained with win-
dowsize6anddimension300. Foreachwordweconsideritstwonearestneighbors
according to the cosine distance.
In a similar manner, we extract a lexicon from the p(ejf) translation table as
well. For each foreign rare token f and its semantic neighbor f
0
, we add (e;f) pair
to the lexicon if p(ejf) > 0 andp(ejf
0
) > 0:1. However, as discussed in Section 4.4,
it is better to use this lexicon only if the foreign language word vectors are trained
on more than 10 million tokens of monolingual data.
4.2.2 Extracting Transliterations
For any infrequent English token e (w.r.t. the parallel data) and its translation
table entry f, if f is a transliteration of e we add the (e;f) pair to the lexicon.
Similarly we extract transliteration pairs from the p(ejf) translation table. Each
transliteration pair is added once to the lexicon.
2
https://code.google.com/archive/p/word2vec/
44
Inordertodecidewhethertwotokensaretransliterations, wecomputethenor-
malized edit distance of their romanizations. We use uroman,
3
a universal roman-
izer that converts text in any script to its romanized version (Latin alphabet). We
say two tokens are transliterations if dist(rom(e);rom(f)) 0:25, where dist is
the normalized Levenshtein distance and rom(:) is the output of the romanizer.
4.3 Data
We use data from fifteen languages for our machine translation experiments.
4
TheselanguagesincludeAmheric,Arabic,Bengali,Mandarin,Farsi,Hausa,Somali,
Spanish, Tamil, Thai, Turkish, Uighur, Urdu, Uzbek and Yoruba. Table 4.1 shows
the size of training, development, test and monolingual data for each language. In
addition, we use hand aligned data
5
for Arabic/English (77:3K+119:5K tokens),
Chinese/English(240:2K+305:2K tokens),andFarsi/English(0:9K+0:8K tokens)
forwordalignmentexperiments. WelowercaseandtokenizealldatausingMoses(Koehn
et al., 2007) scripts.
3
https://www.isi.edu/ulf/uroman.html
4
LDC2015E13, LDC2015E14, LDC2015E83, LDC2015E84, LDC2016E57, and LDC2016E86
to LDC2016E105
5
LDC2012E51, LDC2012E24, (Pilevar et al., 2011)
45
train dev. test mono.
amh 2.1M 39.8K 19.5K 4.3M
ara 3.8M 39.1K 19.8K 230.4M
ben 0.9M 41.9K 21.0K 2.5M
cmn 10.6M 41.7K 20.5K 33.2M
fas 4.3M 47.7K 24.2K 271.2M
hau 2.1M 48.0K 24.1K 3.9M
som 2.8M 46.8K 23.5K 13.5M
spa 24.1M 49.4K 24.3K 14.7M
tam 0.5M 39.0K 11.4K 1.0M
tha 0.7M 39.1K 23.1K 39.7M
tur 4.1M 40.2K 19.9K 483.0M
uig 5.2M 8.6K 4.3K 33.8M
urd 1.1M 46.7K 23.2K 14.4M
uzb 4.2M 42.5K 21.7K 60.3M
yor 2.1M 47.9K 24.5K 7.0M
Table 4.1: Data split and size of monolingual data (tokens) for different languages.
Forparalleldata, sizereferstothenumberofEnglishplusforeignlanguagetokens.
4.4 Experiments
4.4.1 Machine Translation
We perform end-to-end machine translation experiments on 15 different languages
described in Section 4.3. We use Giza++ (Och and Ney, 2003) to get the align-
ments and Moses (Koehn et al., 2007) to train and decode phrase based machine
translation(PBMT)systems. Theparalleldataisstemmedtothefirst4characters
for training the alignments but not for the PBMT system. We use 5 iterations of
each of IBM models 1, HMM and 4 to train the alignments both before and after
adding the lexicons. Our baseline is the system before adding the lexicons. We
test both the effect of only adding the lexicon extracted from p(fje) translation
table using the English word vectors (L
e
), and adding both the lexicons (L
e
+L
f
).
Table 4.2 shows the BLEU scores of running different experiments. The languages
46
are sorted by the size of their monolingual data. The first five languages have less
than 10M tokens of monolingual data.
baseline L
e
L
e
+L
f
improve
tam 19.2 19.3 19.2 0.1
ben 8.1 8.2 8.0 0.1
hau 19.4 19.6 19.9 0.2
amh 11.5 11.9 11.2 0.4
yor 14.2 14.6 14.3 0.4
som 18.7 19.1 18.9 0.2
urd 15.6 15.2 16.2 0.6
spa 40.0 40.0 40.0 0.0
cmn 12.5 12.7 12.7 0.2
uig 12.8 14.3 14.0 1.2
tha 20.3 20.1 20.5 0.2
uzb 13.2 13.5 13.9 0.7
ara 18.2 18.1 18.0 -0.2
fas 19.2 19.3 19.4 0.2
tur 14.7 15.4 15.4 0.7
Table 4.2: Machine translation experiments (BLEU). For languages with less than
10M monolingual tokens (first five) we only use L
e
, otherwise we use both lexicons
L
e
+L
f
. This way we improve baseline for almost all languages.
We see that it is generally better to only use L
e
for languages with small
monolingual data and use both L
e
and L
f
for others. If we put the threshold
at 10M tokens of monolingual data, we improve the BLEU score over baseline
for almost all languages, up to 1:2 points for Uighur. The exceptions are Arabic
and Spanish. However, the Spanish experiment is hardly within the low-resource
settings as it has about 24M tokens of parallel data.
4.4.2 Alignments
In addition, we perform word alignment experiments on Arabic/English, Chi-
nese/English, and Farsi/English, for which we have access to gold alignment data
47
(Section 4.3). We append the test sentences to the existing parallel training data
for each language (Table 4.1) and use it to get the alignments. Baseline and
proposed methods are defined as in the machine translation experiments above
(Section 4.4.1). Note that word vectors for these three languages are trained on
more than 10M tokens, so we use both lexicons in the proposed method. Table 4.3
presents the precision, recall, and f-score of the alignments compared to the gold
alignments.
baseline L
e
+L
f
ara 63:1=58:1=60:5 63:8=58:4=61:0
cmn 66:5=61:6=63:9 66:7=61:6=64:1
fas 52:7=66:7=58:9 54:3=68:5=60:6
Table 4.3: Word alignment experiments (alignment precision/recall/f-score). The
proposed method (L
e
+L
f
) improves baseline in all cases.
The proposed method gets better precision, recall, and f-score for all three
languages.
4.5 Conclusion
In this chapter, we present a method for improving word alignments using word
similarities. Themethodissimpleandyetefficient. Weuseoff-the-shelfdistributed
wordrepresentationtoolstoencourageasubsetoftranslationtableentriesthatare
common between semantically similar words. End-to-end experiments on translat-
ing 15 languages into English, as well as alignment-accuracy experiments for three
languages, show consistent improvement over the baseline.
48
Chapter 5
Borrowing from related languages
1
5.1 Introduction
In this chapter, we address the problem of translating a Low-resource Language
(LL) into English. We assume that LL has little parallel data with English or any
other language. However, this low-resource language has a higher-resource closely
Related Language (RL). We introduce a method for borrowing parallel data from
RL and using it to improve machine translation for LL. We assume that LL and
RL are mostly cognates, having roughly the same word order. LL and RL can
have different orthographies. Examples of such LL/RL pairs are Afrikaans/Dutch
and Belorussian/Polish.
As an example, assume that we would like to translate Belorussian to English.
TheamountofavailableBelorussian/Englishparalleldataissmall. Hence,machine
translation quality for this language is very poor. Fortunately, Polish is a language
related to Belorussian, and it is part of the European parliament. Consequently,
lots of parallel data exists between Polish and English. In this chapter, we propose
a method to convert Polish/English parallel data to Belorussian/English. This
convertedparalleldatawillbeusedtotrainabetterBelorussiantoEnglishmachine
1
The work is described in:
- “Deciphering Related Languages” (N. Pourdamghani, K. Knight), Proc. Empirical Methods in
Natural Language Processing (EMNLP), 2017.
- “Neighbors Helping the Poor: Improving low-resource Machine Translation Using Related Lan-
guages” (N. Pourdamghani, K. Knight), Machine Translation (MT) journal.
49
translation system. We will discuss different methods to combine the converted
parallel data with the original one for a better machine translation quality.
Consider the following Polish/English parallel sentence:
pol: Wszyscy ludzie rodz si wolni i równi pod wzgldem swej godnoci i swych praw.
eng: All human beings are born free and equal in dignity and rights.
We would like to convert it to the following Belorussian/English parallel sentence:
bel: Use ldzi naradacca svabodnymi i rounymi u svaei godnasci i pravah.
eng: All human beings are born free and equal in dignity and rights.
To do so, we need to convert the Polish side of the Polish/English parallel data to
Belorussian. At first look, the Polish sentence is very different from the Belorus-
sian one. However, if we romanize the Belorussian sentence and compare it to the
Polish one, we see that many words are cognates and the word order is almost the
same.
bel: Use ldzi .. svabodnymi i rounymi .. svaei godnasci i pravah.
rom(bel): Use ludzi .. svabodnymi i rounymi .. svaei godnastsi i pravakh.
pol: Wszyscy ludzie .. si wolni i równi .. swej godnoci i .. praw.
If we accept some noise, we can assume that Polish is Belorussian written in
another script, and all we need to do is to “decipher” it back to Belorussian. How-
ever,thiscipherismuchmorecomplexthanasimpleone-to-onelettersubstitution
cipher, where a one-to-one character mapping is used to replace each letter in LL
(here, Belorussian) with a letter in RL (here, Polish). In our problem, we observe
higher level mappings in many situations. For instance notice the one-to-two map-
ping between swej and svaei (svaei) and the two-to-one mapping between ludzie
50
and ldzi (ludzi). Moreover, many RL words can not be deciphered into any LL
word and vice versa.
In this chapter we propose a universal method for translating texts between
such closely related languages. Our method is orthography-agnostic for alphabetic
systems, and crucially, it does not need any bilingual data. We use this method
to convert RL/English parallel data into LL/English by translating the RL side
of RL/English data to LL. We use this extra resource for training an improved
LL/English machine translation system.
To translate RL to LL, we train a character-based cipher model that can deci-
pher RL character sequences into LL, and connect it to a word-based LL language
model for decoding RL texts into LL. The cipher model is trained in a noisy
channel model where a character language model produces LL characters and the
model converts them to RL. Expectation Maximization is used to train the model
parameters to maximize the likelihood of a set of RL monolingual data. At decod-
ingtime,theciphermodelreadstheRLtextcharacterbycharacterinwhichwords
are separated by a special character, and produces a weighted lattice of characters
representing all the possible translations for each of the input tokens. The word-
based language model takes this lattice and produces a sequence of output words
that maximize the language model score times the cipher model score. Figure 5.1
depicts this process.
Usingacharacter-basedciphermodelprovidestheflexibilitytogenerateunseen
words. Inotherwords,thevocabularyislimitedbythedecodingLM,notthecipher
model. Separation of training and decoding language models enables us to train
the decoding LM on as much data as is available without worrying about training
speed or memory issues. We can also transliterate out-of-vocabulary words by
51
LL mono.
LL char
LM
cipher
model
RL mono.
(chars)
LL word
LM
trained
cipher
model
LL mono.
RL text
(tokens)
LL text
(tokens)
Training
Decoding
Figure 5.1: The process used for training the cipher model and decoding RL text
to LL
spelling out the best path produced by cipher model in case no good match is
found for a token in the decoding LM.
We show that this conversion method can translate texts between related lan-
guages much more accurately than copying or using a letter substitution cipher
(Section 5.7.1). We show different techniques for combining the converted parallel
data with the original one, and we show that the extra parallel data improves
phrase-based machine translation by both improving word alignments and phrase
table construction (Section 5.7.2).
5.2 Previous Works
5.2.1 Previous Work on Borrowing Resources from Related
Languages to Improve Machine Translation
Borrowing resources from a related language has been shown to be helpful for
improving machine translation. Currey et al. (2016) use a rule-based translation
system to convert Italian and Portuguese into Spanish, to improve Spanish (here,
52
LL) language modeling. Mann and Yarowsky (2001) use Spanish/Portuguese cog-
natestoconvertanEnglish/SpanishlexicontoEnglish/Portuguese. NakovandNg
(2009) and Karakanta et al. (2017) convert parallel data from a higher-resource
related language to the low-resource language, where both languages share the
same orthography, to improve translation between English and the low-resource
language. These methods prove the usefulness of data from a related language
to improve machine translation for the low-resource language. However, none of
them addresses borrowing data from a related language with a different orthog-
raphy. Moreover, their conversion method is either rule-based, or requires some
form of bilingual data between the related languages for training. These draw-
backs limit the application of these methods. In order to be able to use data from
a relate language regardless of the orthography or availability of bilingual data, a
more general method for translating between related languages is required.
5.2.2 Previous Works on Translating Between Related Lan-
guages
Previous works on translation between related languages can be categorized into
three groups:
Systems for specific language pairs such as Czech/Slovak (Hajič et al.,
2000), Turkish/Crimean Tatar (Cicekli, 2002), Irish/Scottish Gaelic (Scannell,
2006), and Indonesian/Malaysian (Larasati and Kubo, 2010). Another similar
trend is translation between dialects of the same language like Arabic dialects to
standard Arabic (Hitham et al., 2008; Sawaf, 2010; Salloum and Habash, 2010).
Also, work has been done on translating back the Romanized version of languages
like Greeklish to Greek (Chalamandaris et al., 2006) and Arabizi to Arabic (May
53
et al., 2014). These methods cannot be applied to our problem because time and
resources are limited to build a translation system for the specific language pair.
Machine learning systems that use parallel data: These methods cover
a broader range of languages but require parallel text between related languages.
They include character-level machine translation (Vilar et al., 2007; Tiedemann,
2009)orcombinationofword-levelandcharacter-levelmachinetranslation(Nakov
and Tiedemann, 2012) between related languages.
Use of non-parallel data: Cognates can be extracted from monolingual
data and used as a parallel lexicon (Hana et al., 2006; Mann and Yarowsky, 2001;
Kondrak et al., 2003). However, our task is whole-text transformation, not just
cognate extraction.
Unsupervised deciphering methods, which require no parallel data, have been
used for bilingual lexicon extraction and machine translation. Word-based deci-
phering systems ignore sub-word similarities between related languages (Koehn
and Knight, 2002; Ravi and Knight, 2011b; Nuhn et al., 2012; Dou and Knight,
2012; Ravi, 2013). Haghighi et al. (2008a) and Naim and Gildea (2015) propose
models that can use orthographic similarities. However, the model proposed by
Naim and Gildea (2015) is only capable of producing a parallel lexicon and not
translation. Furthermore, both systems require the languages to have the same
orthography and their vocabulary is limited to what they see during training.
Character-based decipherment is the model we use for solving this problem.
Character-based decipherment has been previously applied to problems like solv-
ing letter substitution ciphers (Knight et al., 2006; Ravi and Knight, 2011a) or
transliterating Japanese katakana into English (Ravi and Knight, 2009), but not
for translating full texts between related languages.
54
5.3 The Model
We learn a noisy-channel character-based cipher model that converts a sequence
of LL characters s
1
;:::;s
n
generated by an LL character-based language model to a
sequence of RL characters t
1
;:::;t
m
. At decoding, we feed this model in the reverse
direction with an RL text and weight the potential outputs using a word-based LL
language model to produce LL tokens.
5.3.1 Cipher Model
Our cipher model converts a sequence of LL characters s
1
;:::;s
n
to a sequence of
RLcharacterst
1
;:::;t
m
. Itmodelsone-to-one, one-to-twoandtwo-to-onecharacter
mappings. This allows us to handle cases like Cyrillic ‘q’ and Latin ‘ch’, and also
subtle differences in pronunciation between RL and LL like Portuguese ‘justiça’
and Spanish ‘justicia’.
We model the cipher by a Weighted Finite State Automata (WFST) composed
of three components (Figure 5.2):
WFST1 is a one-to-one letter substitution model. For each LL character s it
writes one RL character t with probability p
1
(tjs).
WFST2 is a one-to-two letter substitution model. For each LL character s, it
writestwoRLcharacterstandt
0
withrespectiveprobabilitiesp
21
(tjs)andp
22
(t
0
js).
We assume p
22
(t
0
js) is independent of t. As a result we can estimate p(tt
0
js) =
p(tjs)p(t
0
jt;s) ' p
21
(tjs)p
22
(t
0
js) as modeled in WFST2. This simplification is
required to make the model practicable. Otherwise, the size of the cipher model
would become cubic in the number of RL and LL characters, and combining it
with a language model would make the system unfeasibly large for training.
55
WFST3
WFST1
WFST2
s,,1
s
0
,,1
...
,t,p
3
,,
,,
,,
,t,p
1
,t,p
21
,t
0
,p
22
_,_,1
Figure 5.2: Part of the cipher model corresponding to reading LL character s
from start state. The same pattern repeats for any LL character. After reading
s, the model goes to WFST1, WFST2, or WFST3 with respective probability
(s), (s), or
(s). In WFST1, the model produces each RL character t with
probability p
1
(tjs). In WFST2, the model produces each two RL characters t and
t
0
with probability p
21
(tjs) and p
22
(t
0
js). In WFST3, the model reads each LL
character s
0
and produces each RL character t with probability p
3
(tjss
0
). From the
last state of WFST1, WFST2, and WFST3, the model returns to the start state
without reading or writing.The model has a loop on start state that reads and
writes space.
WFST3 is a two-to-one letter substitution cipher. For each LL character s, it
readsanotherLLcharacters
0
withprobability 1, andthenwritesoneRLcharacter
t with probability p
3
(tjss
0
). As we will discuss in Section 5.4.1 we train p
3
directly
from p
21
and p
22
, hence the cubic number of parameters does not cause a problem.
The start state reads each LL character s and goes to WFST1, WFST2, or
WFST3 with respective probability (s), (s), or
(s). The last state of each
component returns to start without reading or writing anything. The start state
also reads and writes space with probability one.
Weights of the cipher model are trained as described in Section 5.4.1.
56
5.3.2 LL Character-Based Language Model
We train a character-based language model on LL monolingual data and use it
for training the cipher model. The structure of this language model is depicted in
Figure 5.3.
start
S
null
b
a
c
ab
aa
ac
bb
ba
bc
cb
ca
cc
Figure 5.3: Part of a 3-gram character based language model on a language with
alphabet fa;b;cg.
Thisfigureshowspartofa3-gramlanguagemodelonalanguagewithalphabet
fa;b;cg. At each step, the language model decides to either move forward (dashed
edges) or back off (dotted edges) before generating the next symbol (solid edges).
Possible symbols are characters from the alphabet, start-of-sentence which can
only be generated in the start state, and end-of-sentence symbol which results
in moving to the finish state (not shown). The only action in the start state is
generating the start-of-sentence symbol and moving to state S. Null state is the
lowest back off state, representing the empty sequence.
Weights of edges that generate a symbol (solid edges) are equal to the corre-
sponding n-gram probabilities, specifically edges out of the null state have 1-gram
57
weights. Sumoftheweightofbackoff(dotted)andmoveforward(dashed)edgesis
equaltooneforeachnode. TheseweightsaretrainedasdescribedinSection5.4.2.
Backing off is used to generate n-gram sequences that are not observed in the
training data, hence the language model does not have a probability for them.
For instance if the 3-gram ‘acb’ is not observed in the training data, the language
model can not generate character ‘b’ from the ‘ac’ node (note lack of this edge in
the figure). In order to generate the sequence ‘acb’ it has to back off from node
‘ac’ to node ‘c’ and generate ‘b’ from there.
We do not need to worry about out-of-vocabulary symbols as we assume all
alphabet characters are observed in the training. In this work, we use a 5-gram
language model which is a simple extension to the model described above.
5.3.3 LL Word-Based Language Model
We train a 2-gram word-based language model on LL monolingual data and com-
bine it with the cipher model in the decoding phase. Structure of the word-based
language model is similar to the character-based language model with two excep-
tions.
First, although this is a word-based language model, it needs to weight the
outputs of the character-based cipher model. Hence, words need to be spelled
out. Each edge that generates a symbol in the language model (each solid edge in
Figure 5.3) needs to be replaced with a sequence of edges and states that spell out
the corresponding word, ending with a space character. In practice we do not spell
out all the words one by one, but create a trie at each node for all the outgoing
words to reduce the memory requirements for the decoding process.
58
Second, we need to handle out-of-vocabulary (OOV) words. For this purpose
the null state backs off to an OOV state which is capable of generating any unob-
served word. Flow moves back from OOV state to the null state after generating
space. Figure 5.4 shows this modification.
S
null
oov
{_}
{a,b,c}
Figure 5.4: Part of the word based language model corresponding to generation
of out of vocabulary (oov) tokens. The oov state can generate any sequence of
characters and flow returns to the null state after generating a space.
Section 5.4.2 describes how we train the probabilities of this model.
5.4 Training
5.4.1 Training the Cipher Model
The cipher model described in Section 5.3.1 is much more flexible than a one-
to-one letter substitution cipher. A few thousand sentences of RL monolingual
data is not enough to train the model as a whole, and more training data makes
the process too slow to be practical. Hence, we break the full model into three
components WFST1, WFST2, and WFST3 as described in Section 5.3.1 and train
the parameters of each component, i.e. p
1
, p
21
and p
22
, and p
3
in separate steps.
A final step trains the probability of moving into each of the components, i.e. ,
, and
.
Each step of the training uses EM algorithm to maximize the likelihood of 500
sentences of RL text in a noisy channel model where the 5-gram character-based
59
LL language model (trained on 5000 LL sentences) produces an LL text character
by character and the cipher model converts RL characters into RL (top section of
Figure 5.1).
Step one: We set (s) = 1 and (s) =
(s) = 0 and train p
1LL!RL
(tjs) for
each LL character s and each RL character t. In parallel we reverse RL and LL
and train p
1RL!LL
(sjt) for each RL character t and each LL character s. We use
p
1
(tjs) =
1
2
(p
1LL!RL
(tjs) +p
1RL!LL
(sjt)) to set WFST1 parameters in the next
steps.
Steptwo: Weset(s) = (s) =
1
2
and
(s) = 0,fixp
1
andtrainp
21LL!RL
(tjs)
and p
22LL!RL
(t
0
js) for each LL character s and each pair of RL characters t and
t
0
. In parallel we reverse RL and LL and train p
21RL!LL
(sjt) and p
22RL!LL
(s
0
jt)
for each RL character t and each pair of LL characters s and s
0
.
Step three: Our cipher model has to decide after reading one LL character
if it will perform a one-to-one, one-to-two or two-to-one mapping. In the first
two scenarios the model has enough information to decide, but for the two-to-
one mapping the model has to decide before reading the second LL character.
For instance, consider converting Bosnian to Serbian. When the model reads
the character “c” it has to decide between one-to-one, one-to-two and two-to-one
mappings. A good decision will be two-to-one mapping because “ch” maps to q,
hence the system learns a large
for character “c” but the same
applies to any
other character that follows “c” which is not desirable.
One way to overcome this problem is to change the model to make the decision
after reading two LL characters, but this will over-complicate the model. We
60
use a simpler trick instead. We compute p
3LL!RL
(tjss
0
) from p
21RL!LL
(sjt) and
p
22RL!LL
(s
0
jt) using Bayes rule:
p
3
(tjss
0
) =
p(ss
0
jt)p(t)
p(ss
0
)
'
p
21
(sjt)p
22
(s
0
jt)p(t)
p(ss
0
)
(5.1)
The estimate is based on our assumption from the previous step that p
22
(s
0
jt) is
independent of s. For each RL character t we compute the empirical probability
p(t) from monolingual data and p(ss
0
) is the normalization factor.
Wesetp
3
parametersusingequation(5.1), butbeforenormalizingwemanually
prune the probabilities. If for LL characters s and s
0
there exists no RL character
t such that p
21
(sjt)p
22
(s
0
jt)p(t) > 0:01 we assume that ss
0
does not map to any RL
character. Otherwise, we only keep RL characters for which p
21
(sjt)p
22
(s
0
jt)p(t) >
0:01 and then apply the normalization.
Step four: In the final step we fix p
1
, p
21
, p
22
, and p
3
to the trained values
and train (s), (s), and
(s) for each LL character s.
5.4.2 Training the Language Models
We train the n-gram probabilities of the language models empirically from the
monolingual data.
p(c
n
jc
1
:::c
n1
) =
count(c
1
:::c
n
)
count(c
1
:::c
n1
)
where c
i
is a character in the character-based language model or a word in the
word-based language model.
To train the back off probabilities and probabilities of generating characters in
the OOV state of the word-based language model, we fix the n-gram weights on
the language models and use EM to train the parameters with the objective of
maximizing the likelihood of generating the monolingual data. In order to avoid
61
training a separate back off probability for each n-gram sequence, for each n we
groupallthepossiblesequencesinto100groupsbasedontheirfrequencyandtrain
a separate back off probability for each group.
We train the character-based language models on 5000 sentences of LL data
to keep them small for a feasible training of the cipher model. The word-based
language models are trained on all the available monolingual data for each LL.
5.5 Decoding
In the decoding step we compose the cipher model with the LL word-based lan-
guage model and find the best path for the input sentence in the resulting WFST
(bottom section of Figure 5.1).
As the cipher model translates spaces only to spaces, it preserves the number
of words, though length of words may vary. The cipher model converts each word
to a lattice of possible words. The word-based language model then weights these
potentialwordsandselectsthebestoneaccordingtoboththelanguagemodelscore
and the cipher model score. Note that if the output lattice of the cipher model
doesnotfindagoodenoughmatchinthelanguagemodelaccordingtothebi-gram
probabilities, the language model has the option to accept the most probable path
in the lattice according to the cipher model score and score of generating it in the
OOV state as an out-of-vocabulary word.
5.6 Data
We collect data for six pairs of related languages: Afrikaans(afr) / Dutch(dut),
Macedonian(mac)/Bulgarian(bul),Malaysian(mal)/Indonesian(ind),Swedish(swe)
/ Danish(dan), Belorussian(bel) / Polish(pol), and Serbian(srb) / Bosnian(bos).
62
Foreachlanguage, wedownloadthemonolingualdatafromLeipzigcorpora(Gold-
hahn et al., 2012), and the parallel data from OPUS (Tiedemann, 2012). The
domain of the monolingual data is news, web, and Wikipedia. The domain of
the parallel data varies, including technical texts, religious texts, phrase book sen-
tences, etc. We consider the language with more parallel data as RL and the one
with less parallel data as LL. Table 5.1 shows the size of available data for each
language.
monolingual parallel
afr 0.9M 159K
dut 4M 7.9M
mac 0.9M 2.7M
bul 2.1M 2.8M
mal 0.4M 934K
ind 4M 4.5M
swe 4M 8.2M
dan 4M 10.9M
bel 1.9M 79K
pol 4M 5.1M
srb 1.6M 53K
bos 0.8M 180K
Table 5.1: Size of monolingual and parallel data (number of sentences) available
for each language. LL and RL are presented in pairs, LL first.
We also extract the list of alphabets for each language from Wikipedia, and
collect the Universal Declaration of Human Rights (UDHR) for each LL and RL.
Wemanuallysentencealignthesedocumentsandget104sentencesandabout1.5K
tokens per language. We use these documents for tuning and testing the systems.
We tokenize and lowercase all the monolingual, parallel and UDHR data with
Mosesscripts. Weremoveallnon-alphabeticcharactersfromeachtextaccordingto
thealphabetextractedfromWikipedia. Thisincludesnumbers, punctuations, and
63
rare/old characters that are not considered as official characters of the language.
We keep all the accented variations of characters.
5.7 Experiments
In this section, we present internal evaluation of the conversion accuracy as well
as end-to-end machine translation experiments with different settings.
5.7.1 Evaluating the Conversion Accuracy
Before running the end-to-end machine translation experiments, we evaluate the
conversion accuracy of our proposed method. We translate RL UDHR to LL and
measure the BLEU score of the translation. We compare the following translation
methods:
Copy: Copyingthetext. Thisisnotapplicableforlanguageswithdifferentorthog-
raphy.
LS: One-to-one Letter Substitution cipher. This is equivalent to using WFST1
without a decoding language model.
LS+1g LM: One-to-one letter substitution cipher with a 1-gram word language
model at decoding.
PM+1gLM,PM+2gLM: TheProposedMethodwithrespectively1-gramand
2-gram word language model at decoding.
ResultsarereportedinTable5.2. ForallthelanguagepairsexceptMalaysian(mal)
/ Indonesian(ind), the proposed method is the best model with a large margin.
Malaysian / Indonesian is a special case where, although the languages have a
64
dut!afr bul!mac ind!mal dan!swe pol!bel bos!srb
Copy 1.9 (25.0) 5.6 (33.5) 10.0 (41.6) 1.2 (17.6) – –
LS 2.13 (26.5) 5.94 (34.6) 10.0 (41.6) 1.3 (20.7) 0.0 (12.8) 33.2 (59.4)
LS+1g LM 3.07 (27.6) 6.9 (37.6) 10.1 (41.7) 4.5 (31.3) 0.7 (19.7) 32.8 (59.2)
PM+1g LM 3.9 (29.4) 9.4 (40.6) 10.1 (41.7) 6.4 (36.5) 1.3 (23.7) 39.9 (64.6)
PM+2g LM 5.2 (31.2) 10.2 (41.3) 10.0 (41.7) 6.9 (38.8) 1.9 (25.2) 39.2 (64.4)
Table 5.2: BLEU scores for RL-to-LL translation of UDHR text. Format is
BLEU4/BLEU1. Polish/BelorussianandBosnian/Serbianhavedifferentorthogra-
phies hence copying is not applicable.
different vocabulary and a slightly different grammar, they have a common alpha-
bet, and almost all of their cognates are exactly the same. See Figure 5.5 for an
example. As a result the proposed method cannot learn much more than copying.
mal: semua manusia dilahirkan bebas dan samarata dari segi kemuliaan dan hakhak
ind: semua orang dilahirkan merdeka dan mempunyai martabat dan hakhak yang sama
Figure 5.5: First sentence of the first article of UDHR in Malaysian (mal) and
Indonesian (ind). These languages have a different vocabulary, but their cognates
(shown in bold) are exact matches.
The proposed method translates between Serbian (srb) and Bosnian (bos)
almost perfectly. For other pairs, we translate between a quarter and half of the
words correctly, but we get few of the higher n-grams. Figure 5.6 visualizes the
conversion of the first sentence of the first article of UDHR from Dutch (dut) to
Afrikaans (afr) using PM+2g LM (5.2 BLEU4, 31.2 BLEU1). Observe that 4 out
of10tokensaretranslatedcorrectly,closetothe 31:2BLEU1score,andthereisno
3 or 4-gram match. For other tokens except “mensen” the translation is either cor-
rect but non-existent in the Afrikaans sentence (en=and, in=in) or has a meaning
similarenoughthat canbe usefulforthe downstreammachinetranslation (worden
= become v.s. word = are, gelyk = equal(noun) v.s. gelyke = equal(adjective),
richter = judge v.s. regte = rights). The token “mensen” in d2a is an OOV. The
model is not able to convert “mensen” (dut) to “menslike” (afr). The language
65
model does not accept other potential conversions and passes out “mensen” (d2a)
as the best output of the cipher model.
dut: alle mensen ------ worden vrij en gelijk in waardigheid en rechten
d2a: alle mensen ------ worden vry en gelyk in waardigheid en richter
afr: alle menslike wesens word vry met gelyke -- waardigheid en regte
dut2en: all people ------ are free and equal in dignity and rights
d2a2en: all people ------ become free and equal in dignity and judge
afr2en: all human beings are free with equal -- dignity and rights
Figure 5.6: First sentence of the first article of UDHR in Dutch (dut), Afrikaans
(afr), and its conversion from Dutch to Afrikaans using PM+2-gram LM (d2a),
along with their translations to English.
5.7.2 Machine Translation Experiments
Weperformend-to-endmachinetranslationexperimentsbyconvertingtheRL/English
parallel data into LL/English and using it as a new resource to train an LL
to English machine translation system. The system we use for the experiments
is the Moses phrase-based machine translation (Koehn et al., 2007). We use
Giza++ (Och and Ney, 2003) for word aligning the parallel data.
In our experiments we test adding different amounts of converted RL/English
parallel data to different amounts of original LL/English parallel data. We work
with 50k, 100k, and 200k sentences of original LL/English parallel data and for
each case we add zero, 100k, 500k and 1M sentences of converted parallel data,
which adds up to a total of 12 experiments for each language. As the size of the
development set is small (50 sentences), to reduce the variance in the results, in
each experiment we tune and test five times and report the average test result of
these five runs.
Our training data for the machine translation experiments is the “found” par-
allel data on OPUS (Tiedemann, 2012) and the tuning and test data is first and
second half or UDHR data respectively (Section 5.6). This choice is consistent
66
with the reality of the disaster scenario where we need to translate query texts
with mostly out-of-domain training data, but we can gather some in-domain tun-
ing data.
2
Alignment-only Experiments
In our first end-to-end experiment, we test if the converted parallel data can help
word alignments of LL/English parallel data, before using them in the next steps
of the machine translation system. In this alignment-only experiment we append
theconvertedparalleldatatotheoriginalLL/Englishdata,andfindalignmentson
the combined set. The extra parallel data helps to train better parameters for the
alignment model, and results in better alignments for the original parallel data.
Then we only use the original LL/English parallel data and the corresponding
alignments to train the machine translation system.
Table 5.3 shows the results of this experiment. In each section of the table, the
firstrow(correspondingtoaddingzeroconvertedparalleldata)representsthebase-
line. We see that for almost any size of the original LL/English parallel data, and
for all languages except Belorussian, the converted parallel data helps the machine
translation BLEU score by improving alignments of the original LL/English par-
allel data. Belorussian/English parallel data is extremely out-of-domain and the
extra data either does not help improving the alignments, or it helps but this
improvement does not reflect in better translation of the UDHR text.
It is notable that adding too much converted data often hurts the alignments.
Moreinterestingly,improvementsinBLEUscoreoverbaselinestayshighevenwith
increasing the size of original LL data.
2
Tuning with a held out subset of training data results is lower BLEU scores in all the
experiments but does not change the conclusions of this chapter.
67
LL conv. RL afr(dut) mac(bul) mal(ind) swe(dan) bel(pol) srb(bos)
50k 0 26.5 23.4 10.3 28.1 5.1 6.2
50k 100k 27.7 26.3 11.6 28.8 5.0 6.7
50k 500k 27.9 25.7 12.9 28.7 1.2 7.2
50k 1M 27.4 25.5 12.0 28.6 3.4 —
100k 0 28.2 30.4 15.3 29.5 4.8 11.6
100k 100k 29.1 29.9 14.4 29.2 4.8 17.4
100k 500k 30.1 29.8 16.3 29.6 4.1 16.8
100k 1M 29.5 29.7 14.7 30.0 3.7 —
200k 0 32.5 31.9 16.5 32.6 — —
200k 100k 32.7 31.6 22.1 32.8 — —
200k 500k 33.7 32.2 21.3 32.7 — —
200k 1M 32.4 32.3 19.5 33.1 — —
Table5.3: BLEUscoreofalignment-onlyexperiments. Weperform12experiments
for each language that vary by the number of sentences of original LL/English
parallel data (first column) and the number of sentences of parallel data converted
from RL/English (second column). In each section, corresponding to a different
size of original LL data, the first row is the baseline. Empty cells are due to
inadequacy of parallel data.
Appending the Parallel Data
Inthenextexperimentwerunthefullmachinetranslationsystemonthecombined
data to see the effect of converted data on phrase table construction as well as
alignments.
Table 5.4 shows the results of this experiment. We see that in almost all cases
theresultsarebetterthanbaseline. Asexpectedtheimprovementsarehigherwhen
baseline is weak (i.e. the original data is extremely out-of-domain) and when LL
and RL have a high conversion accuracy (Table 5.2). Specifically note the more
than 20 BLEU score improvements in translating Serbian to English. Also note
that unlike the alignment-only case, we get good improvements for Belorussian
now that the extra data helps the phrase table as well as the alignments. However,
68
LL conv. RL afr(dut) mac(bul) mal(ind) swe(dan) bel(pol) srb(bos)
50k 0 26.5 23.4 10.3 28.1 5.1 6.2
50k 100k 27.9 28.1 16.6 28.3 (28.8) 11.0 27.8
50k 500k 27.4 28.2 18.8 26.4 10.6 30.1
50k 1M 28.1 27.7 19.5 26.9 9.7 —
100k 0 28.2 30.4 15.3 29.5 4.8 11.6
100k 100k 29.4 30.3 18.5 28.9 10.8 28.8
100k 500k 30.3 29.6 20.9 28.6 10.9 32.3
100k 1M 28.7 31.1 20.7 31.0 10.7 —
200k 0 32.5 31.9 16.5 32.6 — —
200k 100k 31.9 31.7 21.5 (22.1) 31.7 — —
200k 500k 33.1 (33.7) 33.4 21.2 31.5 — —
200k 1M 31.9 33.4 21.3 31.7 (33.1) — —
Table5.4: BLEUscoreofexperimentsforappendingtheconvertedparalleldatato
the original LL data. We perform 12 experiments for each language that vary by
thenumberofsentencesoforiginalLL/Englishparalleldata(firstcolumn)andthe
number of sentences of parallel data converted from RL/English (second column).
In each section, corresponding to a different size of original LL data, the first row
is the baseline. Empty cells are due to inadequacy of parallel data. Numbers in
parenthesis are cases where the alignment-only experiments (Table5.3) outperform
the current results.
the improvements for Belorussian are not as good as those for Serbian, as Bosnian
is a much closer RL for Serbian compared to Polish for Belorussian.
Numbers in parenthesis in the table show those results from alignment-only
experiments that are better than the current experiment. The alignment-only
experiment is better than full machine translation on the combined data in four
cases. Three belong to the largest size of original LL data. When the size of the
original data is large, better alignments are the main source of improvement in
BLEU compared to better phrase table entries.
Finally, as expected, we see that for larger sizes of LL data, as the baseline gets
more powerful we get less improvements in BLEU over baseline. Moreover similar
69
to the alignment-only case, adding too much converted parallel does not always
help the BLEU score.
Combining the Phrase Tables
In the previous experiments we observed that adding some converted parallel data
almost always helps machine translation quality, but adding too much converted
data often hurts it. This phenomenon is expected as the conversion of RL to
LL results in an informative but noisy LL text. In order to prevent the noise in
the converted data from overpowering the useful training material in the original
parallel data, we propose a method to let the system decide if it should use the
rulesfromtheoriginaldataorthosefromtheconverteddatafortranslatingatext.
Inthealignment-onlyexperimentsweobservedthattheconvertedparalleldata
often helps the alignments of the original data. So at the first step, similar to the
alignment-only experiments, we append the converted data to the original one and
train the alignments on the combined set. Then, unlike the appending the parallel
data experiments, we train two separate phrase tables one for the original data
and one for the converted one. We combine these phrase tables with extra binary
feature for each entry that identifies which phrase table it comes from. Finally we
let the MERT algorithm (Och, 2003) tune the weight of this feature.
Table 5.5 presents the results of this experiment. Numbers in parenthesis show
cases where the better result of the two previous experiments is better than the
current results. We see that in almost all experiments for Afrikaans, Macedonian,
Swedish and Belorussian combining the phrase tables with a feature results in
better BLEU scores than both of the previous experiments. Moreover, for exper-
iments with 50k and 100k sentences of original LL data, adding more converted
data hurts much less as ten out of twelve of the best results are in the last row of
70
LL conv. RL afr(dut) mac(bul) mal(ind) swe(dan) bel(pol) srb(bos)
50k 0 26.5 23.4 10.3 28.1 5.1 6.2
50k 100k 29.5 27.9 16.6 25.9 10.8 27.7
50k 500k 29.8 28.6 18.6 28.4 11.9 29.5 (30.1)
50k 1M 30.4 30.0 18.4 (19.5) 29.8 12.0 —
100k 0 28.2 30.4 15.3 29.5 4.8 11.6
100k 100k 30.0 30.1 17.7 28.8 11.1 28.8
100k 500k 32.0 30.8 19.1 (20.9) 29.4 11.7 32.2 (32.3)
100k 1M 30.4 31.0 (31.1) 19.4 31.4 12.3 —
200k 0 32.5 31.9 16.5 32.6 — —
200k 100k 30.8 33.2 18.4 (22.1) 30.5 — —
200k 500k 34.2 34.6 20.9 32.3 — —
200k 1M 31.1 33.7 20.1 31.1 (33.1) — —
Table 5.5: BLEU score of experiments for combining the phrase tables with dif-
ferent features. We perform 12 experiments for each language that vary by the
number of sentences of original LL/English parallel data (first column) and the
number of sentences of parallel data converted from RL/English (second column).
In each section, corresponding to a different size of original LL data, the first row
is the baseline. Empty cells are due to inadequacy of parallel data. Numbers in
parenthesisarecaseswherethebestofthetwopreviousexperimentsisbetterthan
the current results.
these two sections. The best results for Malaysian and Serbian are mostly from
the appending the parallel data experiments. This is because the baseline is very
weak for these languages and the RL to LL conversion is accurate. As a result
it is often a good bet to trust the phrase table from the converted data, which is
equivalent to appending the parallel data experiments.
5.8 Conclusions
In this chapter, we present a method for translating texts between closely related
languages with potentially different orthography, without needing any parallel
data. We use this method to convert parallel data from higher-resource related
languages to low-resource languages in order to improve their translation quality
71
into English. We show that the converted parallel data can help machine trans-
lation of low-resource languages by both improving the alignments and improving
the phrase table construction. We also propose a method to automatically distin-
guish between the original parallel data and the converted one when using them
for generating the translation. This way, we reduce the effect of noise of converted
parallel data and further improve machine translation quality.
72
Chapter 6
On-demand unsupervised neural
machine translation
1
6.1 Introduction
Qualityofmachinetranslation,especiallyneuralMT,highlydependsontheamount
of available parallel data. For a handful of languages, where parallel data is abun-
dant, MT quality has reached human level performance Wu et al. (2016); Hassan
et al. (2018). However, the quality of translation rapidly deteriorates as the size
of parallel data decreases. Koehn and Knowles (2017). Unfortunately, many lan-
guages have close to zero parallel data. Translating texts from these languages
requires new techniques.
Hermjakob et al. (2018) presented a hybrid human/machine translation tool
that uses lexical translation tables to gloss a translation and relies on human
language and world models to propagate glosses into fluent translations. Inspired
by that work, this work investigates the following question: Can we replace the
human in the loop with more technology? We provide the following two-step
solution to unsupervised neural machine translation:
1
The work is described in:
“Translating Translationese: A Two-Step Approach to Unsupervised Machine Translation” (N.
Pourdamghani, N. Aldarrab, M. Ghazvininejad, K. Knight, J. May), submitted to Proc. North
American Chapter of the Association for Computational Linguistics (NAACL), 2019.
73
1. Use a bilingual dictionary to gloss the input into a pseudo-translation or
‘Translationese’.
2. Translate the Translationese into target language, using a model built in
advancefromvariousparalleldata,withthesourcesideconvertedintoTrans-
lationese using Step 1.
The notion of separating adequacy from fluency components into a pipeline
of operations dates back to the early days of MT and NLP research, where the
inadequacy of word-by-word MT was first observed Yngve (1955). A subfield
of MT research that seeks to improve fluency given disfluent but adequate first-
passtranslationisautomatic post-editing (APE)pioneeredbyKnightandChander
(1994). MuchofthecurrentAPEworktargetscorrectionofblack-boxMTsystems,
which are presumed to be supervised.
Early approaches to unsupervised machine translation include decipherment
methods Nuhn et al. (2013); Ravi and Knight (2011a); Pourdamghani and Knight
(2017), which suffer from a huge hypothesis space. Recent approaches to zero-shot
machine translation include pivot-based methods Chen et al. (2017); Zheng et al.
(2017); Cheng et al. (2016) and multi-lingual NMT methods Firat et al. (2016a,b);
Johnson et al. (2016); Ha et al. (2016, 2017). These systems are zero-shot for a
specific source/target language pair, but need parallel data from source to a pivot
or multiple other languages.
More recently, totally unsupervised NMT methods are introduced that use
only monolingual data for training a machine translation system. Lample et al.
(2018a,b),Artetxeetal.(2018),andYangetal.(2018)useiterativeback-translation
to train MT models in both directions simultaneously. Their training takes place
on massive monolingual data and requires an extremely long time to train as well
as careful tuning of hyperparameters.
74
The closest unsupervised NMT work to ours is by Kim et al. (2018). Similar
to us, they break translation into glossing and correction steps. However, their
correction step is trained on artificially generated noisy data aimed at simulating
glossed source texts. Although this correction method helps, simulating noise
caused by natural language phenomena is a hard task and needs to be tuned for
every language.
Previous zero-shot NMT work compensates for a lack of source/target parallel
data by either using source/pivot parallel data, extremely large monolingual data,
or artificially generated data. These requirements and techniques limit the meth-
ods’ applicability to real-world low-resource languages. Instead, in this chapter we
propose using parallel data from high-resource languages to learn ‘how to trans-
late’ and apply the trained system to low resource settings. We use off-the-shelf
technologies to build word embeddings from monolingual data Bojanowski et al.
(2017) and learn a source-to-target bilingual dictionary using source and target
embeddings Conneau et al. (2017). Given a target language, we train source-to-
targetdictionariesforadiversesetofhigh-resourcesourcelanguages,andusethem
to convert the source side of the parallel data to Translationese. We combine this
parallel data and train a Translationese-to-target translator on it. Later, we can
buildsource-to-targetdictionarieson-demand,generateTranslationesefromsource
texts, and use the pre-trained system to rapidly produce machine translation for
many languages without requiring a single line of source-target parallel data.
We introduce the following contributions in this chapter:
Following Hermjakob et al. (2018), we propose a two step pipeline for build-
ing a rapid neural MT system for any language. The pipeline does not
require parallel data or parameter fine-tuning when adapting to new source
languages.
75
The pipeline only requires a comprehensive source to target dictionary. We
show that this dictionary can be easily obtained using off-the shelf tools
within a few hours.
We use this system to translate test texts from 14 languages into English.
We obtain better or comparable quality translation results on high-resource
languages than previously published unsupervised MT studies, and obtain
good quality results for low-resource languages that have never been used in
an unsupervised MT scenario. To our knowledge, this is the first unsuper-
vised NMT work that shows good translation results on such a large number
of languages.
6.2 Method
We introduce a two step pipeline for unsupervised machine translation. In the
first step a source text is glossed into a pseudo-translation or Translationese, while
in the second step a pre-trained model translates the Translationese into target.
We introduce a fully unsupervised method for converting the source into Transla-
tionese, and we show how to train a Translationese to target system in advance
and apply it to new source languages.
6.2.1 Building a Dictionary
The first step of our proposed pipeline includes a word-by-word translation of
the source texts. This requires a source/target dictionary. Manually constructed
dictionaries exist for many language pairs, however cleaning these dictionaries to
get a word to word lexicon is not trivial, and these dictionaries often cover a
smallportionofthesourcevocabulary,focusingonstemsandspecificallyexcluding
76
inflected variants. In order to have a comprehensive, word to word, inflected bi-
lingual dictionary we look for automatically built ones.
Automatic lexical induction is an active field of research Fung (1995); Koehn
and Knight (2002); Haghighi et al. (2008b); Conneau et al. (2017). A popular
methodforautomaticextractionofbilingualdictionariesisthroughbuildingcross-
lingualwordembeddings. Findingasharedwordrepresentationspacebetweentwo
languages enables us to calculate the distance between word embeddings of source
and target, which helps us to find translation candidates for each word.
We follow this approach for building the bilingual dictionaries. For a given
source and target language, we start by separately training source and target word
embeddings S and T, and use the method introduced in Conneau et al. (2017)
to find a linear mapping W that maps the source embedding space to the target:
SW =T.
Conneau et al. (2017) propose an adversarial method for estimating W, where
a discriminator is trained to distinguish between elements randomly sampled from
WS and T, and W is trained to prevent the discriminator from making accu-
rate classifications. Once the initial mapping matrix W is trained, a number of
refinement steps is performed to improve performance over less frequent words by
changing the metric of the space.
We use the trained matrix W to map the source embeddings into the space of
the target embeddings. Then we find the k-nearest neighbors among the target
words for each source word, according to the cosine distance metric. These nearest
neighbors represent our translation options for that source word.
77
6.2.2 Source to Translationese
Once we have the translation options for tokens in the source vocabulary we can
perform a word by word translation of the source into Translationese. However,
a naive translation of each source token to its top translation option without con-
sidering the context is not the best way to go. Given different contexts, a word
should be translated differently.
We use a target language model to look at different translation options for a
source word and select one based on its context. This language model is trained
in advance on large target monolingual data.
In order to translate a source sentence into Translationese we apply a beam
search with a stack size of 100 and assign a score equal to P
LM
+d(s;t) to each
translation option t for a source token s, where P
LM
is the language model score,
andd(s;t)isthecosinedistancebetweensourceandtargetwords. Weset = 0:01
and = 0:5
6.2.3 Translationese to Target
We train a transformer model Vaswani et al. (2017) on parallel data from a diverse
set of high-resource languages to translate Translationese into a fluent target. For
each language we convert the source side of the parallel data to Translationese as
describedinSection6.2.2. ThenwecombineandshufflealltheTranslationese/tar-
get parallel data and train the model on the result. Once the model is trained, we
can apply it to the Translationese coming from any source language.
78
We use the tersor2tensor implementation
2
of the transformer model with the
transformer_base set of hyperparameters (6 layers, hidden layer size of 512) as
our translation model.
6.3 Data and Parameters
For all our training and test languages, we use the pre-trained word embeddings
3
trained on Wikipedia data using fastText Bojanowski et al. (2017). These embed-
dings are used to train bilingual dictionaries.
We select English as the target language. In order to avoid biasing the trained
system toward a language or a specific type of parallel data, weuse diverseparallel
data on a diverse set of languages to train the Translationese to English system.
We use Arabic, Czech, Dutch, Finnish, French, German, Italian, Russian, and
Spanish as the set of out training languages.
We use roughly 2 million sentence pairs per language and limit the length of
the sentences to 100 tokens. For Dutch, Finnish, and Italian we use Europarl as
the source of parallel data. For Arabic we use MultiUN. For French we use Com-
monCrawl. For German we use a mix of CommonCrawl(1.7M), and NewsCom-
mentary(300K).Thenumbersinparenthesisshowthenumberofsentencesforeach
dataset. For Spanish we use CommonCrawl(1.8M), and Europarl(200K). For Rus-
sian we use Yandex(1M), CommonCrawl(800K), and NewsCommentary(200K),
and finally for Czech we use a mix of ParaCrawl(1M), Europarl(640K), NewsCom-
mentary(200K), and CommonCrawl(160K).
2
https://github.com/tensorflow/tensor2tensor
3
https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.
md
79
We train one model on these nine languages and apply it to test languages not
in this set. Also, to test on each of the training languages we train a model where
the parallel data for that language is excluded from the training data. In each
experiment we use 3000 blind sentences randomly selected out of the combined
parallel data as the development set.
WeusethedefaultparametersinConneauetal.(2017)tofindthecross-lingual
embedding vectors. In order to create the dictionary we limit the size of the source
and target (English) vocabulary to 100K tokens. For each source token we find 20
nearest neighbors in the target language. We use a 5gram language model trained
on 4 billion tokens of Gigaword to select between the translation options for each
token. In order to be comparable to Kim et al. (2018) we split German compound
words only for the newstest2016 test data. We use the CharSplit
4
python package
for this purpose. We use tensor2tensor’s transformer_base hyperparameters to
train the transformer model on a single gpu for each language.
6.4 Experiments
Wereporttranslationresultsonnewstest2013forSpanish,newstest2014forFrench,
and newstest2016 for Czech, German, Finnish, Romanian, and Russian. We also
report results on the first 3000 sentences of GlobalVoices2015
5
for Dutch, Bulgar-
ian, Danish, Indonesian, Polish, Portuguese, and Catalan.
Wecompareourresultsagainstalltheexistingfullyunsupervisedneuralmachine
translationmethodsinTable6.1andshowbetterresultsoncommontestlanguages
compared to all of them except Lample et al. (2018b) where, compared to their
4
https://github.com/dtuggener/CharSplit
5
http://opus.nlpl.eu/GlobalVoices.php
80
fr-en de-en ru-en ro-en
Lample et al. (2018a) 14.3 13.3 - -
Artetxe et al. (2018) 15.6 10.2 - -
Yang et al. (2018) 15.6 14.6 - -
Lample et al. (2018b) (transformer) 24.2 21.0 9.1 19.4
Kim et al. (2018) 16.5 17.2 - -
Our Work 21.0 18.7 12.0 16.3
Table 6.1: Comparing translation results on newstest2014 for French, and new-
stest2016 for Russian, German, and Romanian with previous unsupervised NMT
methods. Kim et al. (2018) is the method closest to our work.
transformer model, we get worse translation results for closer language pairs but
a better result for Russian-English which is a relatively far pair of languages.
Thefirstfourmethodsthatwecompareagainst,includingLampleetal.(2018b)
are based on back-translation. These methods require huge monolingual data and
large training time to train a model per test language. The fifth method, which is
most similar to our approach Kim et al. (2018), can be trained quickly, but still is
finetunedforeachtestlanguageandperformsworsethanourmethods. Unlikethe
previousworks, ourmodelcanbetrainedonceandappliedtoanytestlanguageon
demand. Besides this, these methods use language-specific tricks and development
data for training their models while our system is trained totally independent of
the test language.
We also show acceptable BLEU scores for ten other languages for which no
previous unsupervised NMT scores exist, underscoring our ability to produce new
systems rapidly (Table 6.2).
cs-en es-en fi-en nl-en bg-en da-en id-en pl-en pt-en ca-en
13.7 22.2 7.2 22.0 16.8 18.5 13.7 14.8 23.1 19.8
Table 6.2: Translation results on ten new languages: Czech, Spanish, Finnish,
Dutch, Bulgarian, Danish, Indonesian, Polish, Portuguese, and Catalan
81
6.5 Conclusion
We propose a two step pipeline for building a rapid unsupervised neural machine
translation system for any language. The pipeline does not require re-training the
neural translation model when adapting to new source languages, which makes its
application to new languages extremely fast and easy. The pipeline only requires
a comprehensive source-to-target dictionary. We show how to easily obtain such a
dictionary using off-the shelf tools. We use this system to translate test texts from
14 languages into English. We obtain better or comparable quality translation
results on high-resource languages than previously published unsupervised MT
studies, and obtain good quality results for ten other languages that have never
been used in an unsupervised MT scenario.
82
Chapter 7
Conclusion
In this dissertation, we provide new tools and techniques for improving machine
translation for low-resource languages. We tackle the problem of machine transla-
tion in scarcity of parallel data from three directions:
Improve algorithms, adapt them to work better with small parallel data
(chapters 3, and 4)
Useresourcesotherthansource/Englishparalleldatafortraining(chapters5,
4, and 6)
Build universal components, trained on resources from high-resource lan-
guages, that can work out-of-the-box for any language. Use these models to
help machine translation for low-resource languages (chapter 6)
In chapter 3, we present a method to improve word alignments by considering
inherent symmetry of word alignments in the training process. To do so, we use
a symmetric objective function to train the parameters of the alignment model.
This method improves the alignment accuracy and consequently results in better
machine translation quality. End-to-end experiments on translating 16 languages
into English show consistent improvement (0.9 BLEU points on average) over the
baseline.
In chapter 4, we introduce a technique to use semantic information extracted
from monolingual data to improve word alignments for machine translation. We
use off-the-shelf distributed word representation tools to encourage a subset of
translation table entries that are common between semantically similar words.
83
End-to-end experiments on translating 15 languages into English show consistent
improvement (0.4 BLEU points on average) over the baseline.
In chapter 5, we propose a method for borrowing parallel data from a related
language, and propose different ways to combine this extra parallel data with the
originalone. Weshowthattheconvertedparalleldatacanhelpmachinetranslation
by both improving the alignments and improving the phrase table construction.
Machine translation experiments for 6 languages, using one related language for
each, show an average of 8.4 (6.2) point BLEU score improvement over baseline
when the baseline is trained on 50K (100K) sentences of parallel data.
Inchapter6,weproposeatwosteppipelineforbuildingarapidneuralmachine
translation system for any language. This pipeline includes glossing the input into
a pseudo-translation, and translating the pseudo-translation into target using a
model built in advance from a diverse set of parallel data. We use this system to
translate test texts from 14 languages into English, and obtain better or compa-
rable quality translation results on high-resource languages than previously pub-
lished unsupervised MT studies, and obtain good quality results for low-resource
languages that have never been used in an unsupervised MT scenario.
84
Reference List
Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. Unsupervised
neural machine translation. In Proc. ICLR, 2018.
PiotrBojanowski, EdouardGrave, ArmandJoulin, andTomasMikolov. Enriching
word vectors with subword information. Transactions of the Association for
Computational Linguistics, 5:135–146, 2017.
PeterBrown,JohnCocke,SDellaPietra,VDellaPietra,FrederickJelinek,Robert
Mercer, and Paul Roossin. A statistical approach to language translation. In
Proc. COLING, 1988.
Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, and Robert L.
Mercer. The mathematics of statistical machine translation: Parameter estima-
tion. Computational linguistics, 19(2), 1993.
Chris Callison-Burch, Miles Osborne, and Philipp Koehn. Re-evaluation the role
of BLEU in machine translation research. In Proc. EACL, 2006.
Jaime Carbonell, Katharina Probst, Erik Peterson, Christian Monson, Alon Lavie,
Ralf Brown, and Lori Levin. Automatic rule learning for resource-limited MT.
In Proc. ATMA: machine translation: from research to real users, 2002.
Michael Carl, Maite Melero, Toni Badia, Vincent Vandeghinste, Peter Dirix, Ineke
Schuurman, Stella Markantonatou, Sokratis Sofianopoulos, Marina Vassiliou,
and Olga Yannoutsou. METIS-II: low resource machine translation. Machine
Translation, 22(1), 2008.
Aimilios Chalamandaris, Athanassios Protopapas, Pirros Tsiakoulis, and Spyros
Raptis. All Greek to me! An automatic Greeklish to Greek transliteration
system. In Proc. LREC, 2006.
Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp
Koehn, andTonyRobinson. Onebillionwordbenchmarkformeasuringprogress
in statistical language modeling. arXiv preprint arXiv:1312.3005, 2013.
85
StanleyFChenandJoshuaGoodman. Anempiricalstudyofsmoothingtechniques
for language modeling. In Proc. ACL, 1996.
YunChen,YangLiu,YongCheng,andVictorOKLi. Ateacher-studentframework
for zero-resource neural machine translation. arXiv preprint arXiv:1705.00753,
2017.
Yong Cheng, Yang Liu, Qian Yang, Maosong Sun, and Wei Xu. Neural machine
translation with pivot languages. arXiv preprint arXiv:1611.04928, 2016.
Colin Cherry and Dekang Lin. Soft syntactic constraints for word alignment
through discriminative training. In Proc. COLING, 2006.
Ilyas Cicekli. A machine translation system between a pair of closely related
languages. In Proc. ISCIS, 2002.
Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer,
and Hervé Jégou. Word translation without parallel data. arXiv preprint
arXiv:1710.04087, 2017.
Anna Currey, Alina Karakanta, and Jonathan Poitz. Using related languages to
enhance statistical language models. In Proc. NAACL, 2016.
Adrià De Gispert, Deepa Gupta, Maja Popović, Patrik Lambert, Jose B Marino,
MarcelloFederico,HermannNey,andRafaelBanchs. Improvingstatisticalword
alignmentswithmorpho-syntactictransformations. In Advances in Natural Lan-
guage Processing. 2006.
Etienne Denoual and Yves Lepage. BLEU in characters: towards automatic MT
evaluationinlanguageswithoutworddelimiters. In Companion Volume to Proc.
IJCNLP, 2005.
Qing Dou and Kevin Knight. Large scale decipherment for out-of-domain machine
translation. In Proc. EMNLP, 2012.
John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Efficient
projections onto the l1-ball for learning in high dimensions. In Proc. ICML,
2008.
Orhan Firat, Kyunghyun Cho, and Yoshua Bengio. Multi-way, multilingual neu-
ral machine translation with a shared attention mechanism. arXiv preprint
arXiv:1601.01073, 2016a.
Orhan Firat, Baskaran Sankaran, Yaser Al-Onaizan, Fatos T Yarman Vural, and
Kyunghyun Cho. Zero-resource translation with multi-lingual neural machine
translation. arXiv preprint arXiv:1606.04164, 2016b.
86
Victoria Fossum, Kevin Knight, and Steven Abney. Using syntax to improve word
alignment precision for syntax-based machine translation. In Proc. Workshop
on Statistical Machine Translation, 2008.
Pascale Fung. Compiling bilingual lexicon entries from a non-parallel English-
Chinese corpus. In Workshop on Very Large Corpora, 1995.
Pascale Fung and Lo Yuen Yee. An IR approach for translating new words from
nonparallel, comparable texts. In Proc. COLING, 1998.
Qin Gao and Stephan Vogel. Parallel implementations of word alignment tool. In
Proc. ACL workshop on software engineering, testing, and quality assurance for
natural language processing, 2008.
Ulrich Germann. Building a statistical machine translation system from scratch:
how much bang for the buck can we expect? In Proc. workshop on data-
driven methods in machine translation. Association for Computational Linguis-
tics, 2001.
Dirk Goldhahn, Thomas Eckart, and Uwe Quasthoff. Building large monolingual
dictionaries at the Leipzig corpora collection: From 100 to 200 languages. In
Proc. LREC, 2012.
Thanh-Le Ha, Jan Niehues, and Alexander Waibel. Toward multilingual neu-
ral machine translation with universal encoder and decoder. arXiv preprint
arXiv:1611.04798, 2016.
Thanh-LeHa,JanNiehues,andAlexanderWaibel. Effectivestrategiesinzero-shot
neural machine translation. arXiv preprint arXiv:1711.07893, 2017.
Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick, and Klein Dan. Learning
bilingual lexicons from monolingual corpora. In Proc. ACL, 2008a.
Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick, and Dan Klein. Learning
bilingual lexicons from monolingual corpora. In Proc. ACL, 2008b.
Jan Hajič, Hric Jan, and Kubo Vladislav. Machine translation of very close lan-
guages. In Proc. ANLP, 2000.
Jirka Hana, Anna Feldman, Chris Brew, and Luiz Amaral. Tagging Portuguese
withaSpanishtaggerusingcognates.InProc.ACLworkshoponCross-Language
Knowledge Induction, 2006.
Hany Hassan, Anthony Aue, Chang Chen, Vishal Chowdhary, Jonathan Clark,
Christian Federmann, Xuedong Huang, Marcin Junczys-Dowmunt, William
Lewis, Mu Li, et al. Achieving human parity on automatic Chinese to English
news translation. arXiv preprint arXiv:1803.05567, 2018.
87
Ulf Hermjakob. Improved word alignment with statistics and linguistic heuristics.
In Proc. EMNLP, 2009.
Ulf Hermjakob, Jonathan May, Michael Pust, and Kevin Knight. Translating a
language you don’t know in the Chinese room. In Proc. ACL, System Demon-
strations, 2018.
Abo Bakr Hitham, Khaled Shaalan, and Ibrahim Ziedan. A hybrid approach for
converting written Egyptian colloquial dialect into diacritized Arabic. In Proc.
INFOS, 2008.
Muhammad Imran, Carlos Castillo, Fernando Diaz, and Sarah Vieweg. Processing
social media messages in mass emergency: A survey. ACM Computing Surveys
(CSUR), 47(4), 2015.
Ann Irvine. Statistical machine translation in low resource settings. In Proc.
NAACL, 2013.
AnnIrvineandChrisCallison-Burch.Combiningbilingualandcomparablecorpora
for low resource machine translation. In Proc. Workshop on Statistical Machine
Translation, 2013.
Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng
Chen,NikhilThorat,FernandaViégas,MartinWattenberg,GregCorrado,etal.
Google’s multilingual neural machine translation system: enabling zero-shot
translation. arXiv preprint arXiv:1611.04558, 2016.
Alina Karakanta, Jon Dehdari, and Josef van Genabith. Neural machine transla-
tion for low-resource languages without parallel corpora. Machine Translation,
2017.
Yunsu Kim, Jiahui Geng, and Hermann Ney. Improving unsupervised word-by-
word translation with language model and denoising autoencoder. In Proc.
EMNLP, 2018.
Kevin Knight and Ishwar Chander. Automated postediting of documents. In Proc
AAAI, 1994.
Kevin Knight, Anish Nair, Nishit Rathod, and Kenji Yamada. Unsupervised anal-
ysis for decipherment problems. In Proc. COLING, 2006.
TomášKočisk` y, KarlMoritzHermann, andPhilBlunsom. Learningbilingualword
representations by marginalizing alignments. arXiv preprint arXiv:1405.0947,
2014.
Philipp Koehn. Statistical machine translation. Cambridge University Press, 2009.
88
PhilippKoehnandKevinKnight. Learningatranslationlexiconfrommonolingual
corpora. In Proc. ACL workshop on Unsupervised lexical acquisition, 2002.
PhilippKoehnandRebeccaKnowles.Sixchallengesforneuralmachinetranslation.
arXiv preprint arXiv:1706.03872, 2017.
PhilippKoehn,FranzJosefOch,andDanielMarcu. Statisticalphrase-basedtrans-
lation. In Proc. NAACL, 2003.
PhilippKoehn, HieuHoang, AlexandraBirch,ChrisCallison-Burch,MarcelloFed-
erico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard
Zens, et al. Moses: Open source toolkit for statistical machine translation. In
Proc. ACL, interactive poster and demonstration sessions, 2007.
Grzegorz Kondrak, Daniel Marcu, and Kevin Knight. Cognates can improve sta-
tistical translation models. In Proc. NAACL, 2003.
Guillaume Lample, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato.
Unsupervised machine translation using monolingual corpora only. In Proc.
ICLR, 2018a.
GuillaumeLample,MyleOtt,AlexisConneau,LudovicDenoyer,andMarc’Aurelio
Ranzato. Phrase-based & neural unsupervised machine translation. In Proc.
EMNLP, 2018b.
Septina Dian Larasati and Vladislav Kubo. A study of Indonesian-to-Malaysian
MT system. In Proc. MALINDO workshop, 2010.
Alon Lavie, Stephan Vogel, Lori Levin, Erik Peterson, Katharina Probst, Ari-
adna Font Llitjos, Rachel Reynolds, Jaime Carbonell, and Richard Cohen.
Experiments with a Hindi-to-English transfer-based MT system under a miserly
data scenario. ACM transactions on Asian language information processing
(TALIP), 2(2), 2003.
Young-SukLee. Morphologicalanalysisforstatisticalmachinetranslation. In Proc.
NAACL, 2004.
Percy Liang, Ben Taskar, and Dan Klein. Alignment by agreement. In Proc.
NAACL, 2006.
Pierre Lison and Jörg Tiedemann. OpenSubtitles2016: Extracting large parallel
corpora from movie and tv subtitles. In Proc. LREC, 2016.
Adam Lopez and Matt Post. Beyond bitext: Five open problems in machine
translation. In Proc. EMNLP workshop on twenty years of bitext, 2013.
89
Jeff Ma, Spyros Matsoukas, and Richard Schwartz. Improving low-resource statis-
ticalmachinetranslation withanovelsemanticwordclusteringalgorithm. Proc.
MT Summit XIII, 2011.
Gideon S Mann and David Yarowsky. Multipath translation lexicon induction via
bridge languages. In Proc. NAACL, 2001.
Jonathan May, Yassine Benjira, and Abdessamad Echihabi. An Arabizi-English
social media statistical machine translation system. In Proc. AMTA, 2014.
Coşkun Mermer and Murat Saraçlar. Bayesian word alignment for statistical
machine translation. In Proc. ACL, 2011.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation
of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
Robert C Moore. Improving IBM word-alignment model 1. In Proc. ACL, 2004.
Robert Munro. Crowdsourcing and the crisis-affected community. Information
retrieval, 16(2), 2013.
IftekharNaimandDanielGildea. Feature-baseddeciphermentforlargevocabulary
machine translation. In arXiv, 2015.
Preslav Nakov and Hwee Tou Ng. Improved statistical machine translation for
resource-poorlanguagesusingrelatedresource-richlanguages. In Proc. EMNLP,
2009.
Preslav Nakov and Jörg Tiedemann. Combining word-level and character-level
modelsformachinetranslationbetweenclosely-relatedlanguages. In Proc. ACL,
2012.
Malte Nuhn, Arne Mauser, and Hermann Ney. Deciphering foreign language by
combining language models and context vectors. In Proc. ACL, 2012.
Malte Nuhn, Julian Schamper, and Hermann Ney. Beam search for solving sub-
stitution ciphers. In Proc. ACL, 2013.
Franz Josef Och. Minimum error rate training in statistical machine translation.
In Proc. ACL, 2003.
Franz Josef Och and Hermann Ney. A systematic comparison of various statistical
alignment models. Computational linguistics, 29(1), 2003.
Hideo Okuma, Hirofumi Yamamoto, and Eiichiro Sumita. Introducing a transla-
tion dictionary into phrase-based SMT. IEICE transactions on information and
systems, 91(7), 2008.
90
KishorePapineni, SalimRoukos, ToddWard, andWei-JingZhu. BLEU:amethod
for automatic evaluation of machine translation. In Proc. ACL, 2002.
Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global
vectors for word representation. In Proc. EMNLP, 2014.
MohammadTaherPilevar, HeshaamFaili, andAbdolHamidPilevar. Tep: Tehran
English-Persian parallel corpus. In Proc. CICLing, 2011.
Nima Pourdamghani and Kevin Knight. Deciphering related languages. In Proc.
EMNLP, 2017.
Nima Pourdamghani, Yang Gao, Ulf Hermjakob, and Kevin Knight. Aligning
English strings with abstract meaning representation graphs. In Proc. EMNLP,
2014.
Katharina Probst, Lori Levin, Erik Peterson, Alon Lavie, and Jaime Carbonell.
MT for minority languages using elicitation-based learning of syntactic transfer
rules. Machine Translation, 17(4), 2002.
Reinhard Rapp. Identifying word translations in non-parallel texts. In Proc. ACL,
1995.
Sujith Ravi. Scalable decipherment for machine translation via hash sampling. In
Proc. ACL, 2013.
Sujith Ravi and Kevin Knight. Learning phoneme mappings for transliteration
without parallel data. In Proc. ACL, 2009.
SujithRaviandKevinKnight.BayesianinferenceforZodiacandotherhomophonic
ciphers. In Proc. ACL, 2011a.
SujithRaviandKevinKnight. Decipheringforeignlanguage. In Proc. ACL,2011b.
Koustav Rudra, Subham Ghosh, Niloy Ganguly, Pawan Goyal, and Saptarshi
Ghosh. Extracting situational information from microblogs during disaster
events: a classification-summarization approach. In Proc. CIKM, 2015.
Wael Salloum and Nizar Habash. Dialectal to standard Arabic paraphrasing to
improve Arabic-English statistical machine translation. In Proc. ACL workshop
onalgorithmsandresourcesformodelingofdialectsandlanguagevarieties,2010.
Hassan Sawaf. Arabic dialect handling in hybrid machine translation. In Proc.
AMTA, 2010.
91
Kevin P. Scannell. Machine translation for closely related language pairs. In Proc.
LREC Workshop on Strategies for developing machine translation for minority
languages, 2006.
JasonRSmith,ChrisQuirk,andKristinaToutanova. Extractingparallelsentences
from comparable corpora using document level alignment. In Proc. NAACL,
2010.
Jason R Smith, Herve Saint-Amand, Magdalena Plamada, Philipp Koehn, Chris
Callison-Burch, and Adam Lopez. Dirt cheap web-scale parallel text from the
common crawl. In Proc. ACL, 2013.
Theerawat Songyot and David Chiang. Improving word alignment using word
similarity. In Proc. EMNLP, 2014.
Jörg Tiedemann. Combining clues for word alignment. In Proc. EACL, 2003.
Jörg Tiedemann. Character-based PSMT for closely related languages. In Proc.
EAMT, 2009.
Jörg Tiedemann. Parallel data, tools and interfaces in OPUS. In Proc. LREC,
2012.
Kristina Toutanova, H Tolga Ilhan, and Christopher D Manning. Extensions to
HMM-based statistical word alignment models. In Proc. EMNLP, 2002.
Masao Utiyama and Hitoshi Isahara. A comparison of pivot methods for phrase-
based statistical machine translation. In Proc. NAACL, 2007.
Vuong Van Bui and Cuong Anh Le. Smoothing parameter estimation framework
for IBM word alignment models. arXiv preprint arXiv:1601.03650, 2016.
VincentVandeghinste,InekaSchuurman,MichaelCarl,StellaMarkantonatou,and
Toni Badia. METIS-II: Machine translation for low resource languages. In Proc.
LREC, 2006.
István Varga, Motoki Sano, Kentaro Torisawa, Chikara Hashimoto, Kiyonori
Ohtake, Takao Kawai, Jong-Hoon Oh, and Stijn De Saeger. Aid is out there:
Looking for help from tweets during a large scale disaster. In Proc. ACL, 2013.
Ashish Vaswani, Liang Huang, and David Chiang. Smaller alignment models for
better translations: unsupervised word alignment with the l0-norm. In Proc.
ACL, 2012.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.
In Proc. NIPS, 2017.
92
David Vilar, Jan-T. Peter, and Hermann Ney. Can we translate letters? In Proc.
ACL workshop on Statistical Machine Translation, 2007.
Andrew Viterbi. Error bounds for convolutional codes and an asymptotically opti-
mum decoding algorithm. IEEE Transactions on Information Theory, 13(2),
1967.
Stephan Vogel, Hermann Ney, and Christoph Tillmann. HMM-based word align-
ment in statistical translation. In Proc. COLING, 1996.
Hua Wu and Haifeng Wang. Pivot language approach for phrase-based statistical
machine translation. Machine Translation, 21(3), 2007.
Hua Wu, Haifeng Wang, and Chengqing Zong. Domain adaptation for statistical
machine translation with domain dictionary and monolingual corpora. In Proc.
COLING, 2008.
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi,
WolfgangMacherey, MaximKrikun, YuanCao, QinGao, KlausMacherey, etal.
Google’s neural machine translation system: Bridging the gap between human
and machine translation. arXiv preprint arXiv:1609.08144, 2016.
Zhen Yang, Wei Chen, Feng Wang, and Bo Xu. Unsupervised neural machine
translation with weight sharing. In Proc. ACL, 2018.
Victor H. Yngve. Sentence-for-sentence translation. Mechanical Translation, 2(2):
29–37, 1955.
HuiZhangandDavidChiang. Kneser-Neysmoothingonexpectedcounts. In Proc.
ACL, 2014.
Hao Zheng, Yong Cheng, and Yang Liu. Maximum expected likelihood estimation
for zero-resource neural machine translation. In Proc. IJCAI, 2017.
93
Abstract (if available)
Abstract
We provide new tools and techniques for improving machine translation for low-resource languages. Thanks to massive training data, and powerful machine translation techniques, machine translation quality has reached acceptable levels for a handful of languages. However, for hundreds of other languages, translation quality decreases quickly as the size of the available training data becomes smaller. For languages with a few millions or less tokens of translation data (called low-resource languages in this dissertation) traditional machine translation technologies fail to produce understandable translations into English. In this dissertation, we explore various non-traditional sources for improving low-resource machine translation. ❧ We introduce three approaches for improving low-resource machine translation: 1) Adapt machine translation algorithm to the low-resource setting. 2) Use resources other than traditionally used source/English translation pairs (called parallel data) for training the system. 3) Build massively-multilingual tools that can be used out-of-the box for any language to help machine translation. ❧ We address these approaches for improving low-resource machine translation in this dissertation: 1) We present two methods to improve state-of-the-art algorithms for word alignments (which is an essential step in traditional machine translation systems). These methods are designed to work best when size of the parallel data is small. 2) We present a method for translating texts between related languages, trained on monolingual data only. We use this method to borrow training data from a related language to compensate for lack of source/target parallel data. 3) We propose a two step pipeline for building a rapid neural MT system for any language. This pipeline includes glossing the input into a pseudo-translation, and translating the pseudo-translation into target using a model built in advance from large parallel data, coming from a set of high resource languages.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Beyond parallel data: decipherment for better quality machine translation
PDF
Syntactic alignment models for large-scale statistical machine translation
PDF
Neural sequence models: Interpretation and augmentation
PDF
Smaller, faster and accurate models for statistical machine translation
PDF
Weighted tree automata and transducers for syntactic natural language processing
PDF
Neural creative language generation
PDF
Exploiting comparable corpora
PDF
Generating psycholinguistic norms and applications
PDF
Improved word alignments for statistical machine translation
PDF
Deciphering natural language
PDF
Improving language understanding and summarization by leveraging auxiliary information through self-supervised or unsupervised learning
PDF
Theory of memory-enhanced neural systems and image-assisted neural machine translation
PDF
The inevitable problem of rare phenomena learning in machine translation
PDF
Identifying and mitigating safety risks in language models
PDF
Automatic decipherment of historical manuscripts
PDF
Decipherment of historical manuscripts
PDF
Event-centric reasoning with neuro-symbolic networks and knowledge incorporation
PDF
A computational framework for diversity in ensembles of humans and machine systems
PDF
Towards more human-like cross-lingual transfer learning
PDF
Scalable machine learning algorithms for item recommendation
Asset Metadata
Creator
Pourdamghani, Nima
(author)
Core Title
Non-traditional resources and improved tools for low-resource machine translation
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
02/14/2019
Defense Date
01/11/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
low-resource languages,machine translation,natural language processing,OAI-PMH Harvest
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Knight, Kevin (
committee chair
), May, Jonathan (
committee member
), Narayanan, Shrikanth (
committee member
)
Creator Email
damghani@gmail.com,pourdamg@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-120458
Unique identifier
UC11675648
Identifier
etd-Pourdamgha-7064.pdf (filename),usctheses-c89-120458 (legacy record id)
Legacy Identifier
etd-Pourdamgha-7064.pdf
Dmrecord
120458
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Pourdamghani, Nima
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
low-resource languages
machine translation
natural language processing