Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Beyond parallel data: decipherment for better quality machine translation
(USC Thesis Other)
Beyond parallel data: decipherment for better quality machine translation
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
BEYOND PARALLEL DATA - DECIPHERMENT FOR BETTER QUALITY MACHINE TRANSLATION by Qing Dou A Ph.D. Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2015 Copyright 2015 Qing Dou Dedication To my dear mother, Minzhi Lv, for her endless love and support. 2 Acknowledgment First of all, I would like to acknowledge my Ph.D. advisor, Kevin Knight, for his valuable guidance throughout the past 5 unforgettable years. Kevin has been a generous supporter, a great inspiration, and a role model for conducting cutting- edge research. It is my great fortune to meet him, and become his student. He has provided me with the best resources and advice that any Ph.D. student can hope for from his mentoring. I am indebted to him for all the endless support, help and opportunities that he has oered to me throughout my research career. Besides Kevin, I also owe my gratitude to my other committee members: David Chiang, Daniel Marcu, Shri Narayanan, and Kenji Sagae for their insightful feedback. My research wouldn't have been successful without discussions and collabora- tions with many people including but not limited to Ashish Vaswani, Tomer Lev- inboim, Malte Nuhn, Dengkang Lin, Kevin Small, Shu Cai, Hui Zhang, Bo Wu, Ulf Hermjakob, Jonathan May, Aliya Deri, Yinggong Zhao, Yang Feng, Jason Riesa, Victoria Fossum, Dirk Hovy, and Zornitsa Kozareva. Special thanks to Ashish for the pleasant collaboration experience, Malte for the great time we shared together at every conference, as well as Dekang and Kevin for oering me valuable industrial research experiences. 3 I also like to express my gratitude to other USC/ISI faculty and sta for their support: Yigal Arens, Yolanda Gil, Jerry Hobbs, Aram Galystan, Jose-Luis Ambite, Peter Zamar, Kary Lau, and Alma Nava. Last but not least, I would like to thank all my lovely friends who always stand by me and give me support through dicult times. 4 Contents Dedication 2 Acknowledgment 3 List of Tables 8 List of Figures 10 Abstract 12 1 Introduction 14 1.1 Parallel Data and Statistical Machine Translation . . . . . . . . . . 14 1.1.1 Word Alignment . . . . . . . . . . . . . . . . . . . . . . . . 16 1.1.2 Rule Extraction . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.2 Non-Parallel Data in Machine Translation . . . . . . . . . . . . . . 18 1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2 Previous Work 23 2.1 Word Vector Based Approach . . . . . . . . . . . . . . . . . . . . . 23 2.2 Decipherment Approach . . . . . . . . . . . . . . . . . . . . . . . . 25 2.2.1 Letter Substitution Ciphers . . . . . . . . . . . . . . . . . . 25 2.2.2 Probabilistic Decipherment . . . . . . . . . . . . . . . . . . . 26 2.2.3 Deciphering Foreign Languages . . . . . . . . . . . . . . . . 28 3 Solving Word Substitution Ciphers 30 3.1 Word Substitution Ciphers . . . . . . . . . . . . . . . . . . . . . . . 30 3.1.1 Slice Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.1.2 Deciphering with Bigrams . . . . . . . . . . . . . . . . . . . 36 3.1.3 Iterative Sampling . . . . . . . . . . . . . . . . . . . . . . . 37 3.1.4 Parallel Sampling . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2.1 Deciphering Gigaword Corpus . . . . . . . . . . . . . . . . . 38 5 3.2.2 Deciphering Military Check-Point Corpus . . . . . . . . . . . 40 4 Improving French-Spanish Out-of-Domain Machine Translation 42 4.1 Baseline Phrase-Based System . . . . . . . . . . . . . . . . . . . . . 43 4.2 Learning a New Translation Table with Decipherment . . . . . . . . 44 4.3 Combining Translation Tables . . . . . . . . . . . . . . . . . . . . . 45 4.4 Data and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5 Dependency-Based Decipherment 49 5.1 From Adjacent Bigrams to Dependency Bigrams . . . . . . . . . . . 49 5.2 Deciphering Spanish Gigaword . . . . . . . . . . . . . . . . . . . . . 51 5.2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.2.2 Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.2.3 Iterative Sampling with Multiple Random Restarts . . . . . 53 5.2.4 Deciphering Accuracy . . . . . . . . . . . . . . . . . . . . . 54 5.2.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6 Resource-Limited Machine Translation with Decipherment 57 6.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 6.2 Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 6.2.1 Baseline Machine Translation System . . . . . . . . . . . . . 59 6.2.2 Decipherment for Machine Translation . . . . . . . . . . . . 60 6.2.3 Improving Translation of Observed Words with Decipherment 60 6.2.4 Improving OOV translation with Decipherment . . . . . . . 61 6.2.5 A Combined Approach . . . . . . . . . . . . . . . . . . . . . 62 6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 7 Deciphering Malagasy 65 7.1 The Malagasy Language . . . . . . . . . . . . . . . . . . . . . . . . 65 7.2 The Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 7.2.1 Challenge 1: Lack of Parallel Data . . . . . . . . . . . . . . 66 7.2.2 Challenge 2: Limited Monolingual Data . . . . . . . . . . . 67 7.2.3 Challenge 3: Poor Parsing Quality . . . . . . . . . . . . . . 67 7.3 Deciphering Malagasy: Preliminary Results . . . . . . . . . . . . . . 68 7.4 Improving Malagasy Dependency Parsing . . . . . . . . . . . . . . . 69 7.5 Joint Word Alignment and Decipherment . . . . . . . . . . . . . . . 71 7.5.1 A New Objective Function . . . . . . . . . . . . . . . . . . . 73 7.5.2 Word Alignment . . . . . . . . . . . . . . . . . . . . . . . . 74 7.5.3 Decipherment . . . . . . . . . . . . . . . . . . . . . . . . . . 75 7.5.4 Joint Optimization . . . . . . . . . . . . . . . . . . . . . . . 76 7.6 Word Alignment Experiments . . . . . . . . . . . . . . . . . . . . . 80 7.6.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . 80 6 7.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 7.7 Machine Translation Experiments . . . . . . . . . . . . . . . . . . . 82 7.7.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 7.7.2 Baseline Machine Translation System . . . . . . . . . . . . . 84 7.7.3 Joint Word Alignment and Decipherment for Machine Trans- lation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 7.7.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 8 UnifyingBayesianInferenceandVectorSpaceModelsforImproved Decipherment 88 8.1 Decipherment Model: Revisit . . . . . . . . . . . . . . . . . . . . . 89 8.2 Base Distribution with Cross-Lingual Word Similarities . . . . . . . 93 8.3 Deciphering Spanish . . . . . . . . . . . . . . . . . . . . . . . . . . 97 8.3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 8.3.2 Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 8.3.3 Sampling Procedure . . . . . . . . . . . . . . . . . . . . . . 98 8.4 Deciphering Malagasy . . . . . . . . . . . . . . . . . . . . . . . . . 100 8.4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 8.4.2 Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 8.4.3 Sampling Procedure . . . . . . . . . . . . . . . . . . . . . . 101 8.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 9 Conclusion and Future Work 108 9.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 9.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 A MonoGIZA 119 A.1 Compiling and Installation . . . . . . . . . . . . . . . . . . . . . . . 119 A.2 Example Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 A.2.1 Letter Substitution Cipher Decipherment . . . . . . . . . . . 120 A.2.2 Japanese-English Phoneme Decipherment . . . . . . . . . . 120 A.2.3 Spanish-English Decipherment . . . . . . . . . . . . . . . . . 120 A.3 Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 7 List of Tables 1.1 A mini parallel corpus with 12 sentences in English and Spanish . . 15 1.2 An example of phrase table . . . . . . . . . . . . . . . . . . . . . . 16 2.1 Key table used to create the cipher in Figure 2.1 . . . . . . . . . . . 27 3.1 Conversion of full sentences to bigrams with counts . . . . . . . . . 36 3.2 Size of English Gigaword training and testing data . . . . . . . . . 39 3.3 Size of military check-point training and testing data . . . . . . . . 40 3.4 Deciphering accuracy on military check-point corpus . . . . . . . . 41 3.5 Sample decipherment of military check-point corpus . . . . . . . . . 41 4.1 Size of Europarl training, tuning, and testing data . . . . . . . . . . 46 4.2 Size of EMEA decipherment training data . . . . . . . . . . . . . . 46 4.3 Using decipherment to improve SMT. Each row has a dierent set of training, tuning, and testing data. Baseline is trained on parallel data only. Tune LM and Test LM specify language models used for tuning and testing respectively. Decipher-CP and Decipher-NP use a phrase table learnt from comparable and non-parallel EMEA corpus respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.4 10 most frequent OOV words in the table learnt from non-parallel EMEA corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.1 Comparison of adjacent bigrams (left) and dependency bigrams (right) extracted from the same Spanish text . . . . . . . . . . . . . . . . . 49 5.2 Size of data from AFP (Agence France Presse) section of the Giga- word corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.3 Groups of dependency relations used in decipherment . . . . . . . . 53 6.1 Number of tokens in training, tuning, and testing data for resource- limited MT experiments . . . . . . . . . . . . . . . . . . . . . . . . 59 6.2 Systems that use translation lexicons learned from decipherment show consistent improvement over the baseline system. . . . . . . . 62 6.3 Decipherment nds correct translations for 7 out of 10 most frequent OOV word types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 8 7.1 Head-Child POS patterns used in decipherment . . . . . . . . . . . 69 7.2 Size of parallel and non parallel data for word alignment experiments (Measured in number of tokens) . . . . . . . . . . . . . . . . . . . . 81 7.3 Size of Malagasy and English data used in Malagasy-English machine translation experiments (Measured in number of tokens) . . . . . . 84 7.4 Size of training, tuning, and testing data in number of tokens (GV: Global Voices) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 7.5 Decipher-Pipeline does not show signicant improvement over the baseline system. In contrast, Decipher-Joint using joint word align- ment and decipherment approach achieves a Bleu gain of 0.9 and 2.1 on the Global Voices test set and the web news test set, respec- tively. The results in brackets are obtained using a parser trained with only 120 sentences. (GV: Global Voices) . . . . . . . . . . . . 87 8.1 Size of data in tokens used in Spanish/English decipherment exper- iment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 8.2 Size of data in tokens used in Malagasy/English decipherment exper- iment. GlobalVoices is a parallel corpus. . . . . . . . . . . . . . . . 101 8.3 Spanish/English decipherment top-5 accuracy (%) of 5k and 10k most frequent word types . . . . . . . . . . . . . . . . . . . . . . . . 102 8.4 Malagasy/English decipherment top-5 accuracy (%) of 5k and 10k most frequent word types . . . . . . . . . . . . . . . . . . . . . . . . 102 A.1 MonoGIZA options . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 9 List of Figures 1.1 Standard machine translation pipeline . . . . . . . . . . . . . . . . 16 2.1 Encryption and decryption of a letter substitution cipher . . . . . . 26 2.2 Word alignment between a Chinese-English sentence pair shows translation process is not only simple substitution. . . . . . . . . . . 29 3.1 Encryption and decryption of a word substitution cipher . . . . . . 31 3.2 Learning curve for a large word substitution cipher . . . . . . . . . 40 5.1 Learning curves for Spanish-English decipherment . . . . . . . . . . 56 6.1 Improving machine translation with decipherment (Grey boxes rep- resent new data and process). . . . . . . . . . . . . . . . . . . . . . 58 7.1 Word alignment showing dierent word orders between English and Malagasy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 7.2 Comparison of learning curves for Malagasy-English decipherment with a poor dependency parser . . . . . . . . . . . . . . . . . . . . 70 7.3 Comparison of learning curves for Malagasy-English decipherment with an improved dependency parser . . . . . . . . . . . . . . . . . 72 7.4 Previous word alignment and decipherment pipeline . . . . . . . . . 72 7.5 Joint word alignment and decipherment . . . . . . . . . . . . . . . . 73 7.6 Joint Word Alignment and Decipherment with EM . . . . . . . . . 77 7.7 Learning curve showing our joint word alignment and decipherment approach improves word alignment quality over the traditional EM without decipherment (Model 1: Iteration 1 to 10, HMM: Iteration 11 to 15) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 8.1 Iterative sampling procedures . . . . . . . . . . . . . . . . . . . . . 99 8.2 Learning curves of top-5 accuracy evaluated on 5k most frequent word types for Spanish/English decipherment. . . . . . . . . . . . . 103 8.3 Learning curves of top-5 accuracy evaluated on 5k most frequent word types for Malagasy/English decipherment. . . . . . . . . . . . 104 10 8.4 Translation pairs are often close and sometimes overlap each other. Words in spanish have been appended with spanish . . . . . . . . . 106 8.5 Semantic groups of word-translations appear close to each other. . . 107 11 Abstract Thanks to the use of parallel data and advance machine learning techniques, we have seen tremendous improvement in the eld of machine translation over the past 20 years. However, due to lack of sucient parallel data, the quality of machine translation is still far from satisfying for many language pairs and domains. In general, it is easier to obtain non-parallel data, and much work has tried to discover word level translations from non-parallel data. Nonetheless, improvements to machine translation have been limited. In this work, I follow a decipherment approach to learn translations from non-parallel data and achieve signicant gains in the quality of machine translation. First of all, I apply slice sampling to Bayesian decipherment to make it highly scalable and accurate, making it possible to decipher billions of tokens with hun- dreds of thousands of word types at high accuracy. Then, when it comes to deci- phering foreign languages, I introduce dependency relations to address the prob- lems of word reordering, insertion, and deletion. Experiments show that depen- dency relations help improve Spanish/English deciphering accuracy by over 5-fold. Last but not least, this accuracy is further doubled when word embeddings are used to incorporate more contextual information. With faster and more accurate decipherment algorithms, I decipher large amounts of monolingual data to improve the state-of-the-art machine translation systems 12 in the scenario of domain adaptation and low density languages. Through experi- ments, I show that decipherment nds high quality translations for out-of-vocabulary words in the task of domain adaptation, and help improve word alignment when the amount of parallel data is limited. I observe up to 3.8 point and 1.9 point Bleu gain in Spanish/French and Malagasy/English machine translation experi- ments respectively. In the end, I release a decipherment package{MonoGIZA, which nds word level translations from monolingual corpora. It serves the purpose to facilitate future research in replicating and advancing the work described in this thesis. 13 Chapter 1 Introduction The past decade has seen tremendous advances in the eld of machine transla- tion (MT) since Brown et al. (1993) published their work that takes a statistical approach to model the process of translation. In contrast to traditional approaches that require human experts to craft translation rules, the statistical approach auto- matically learns translation rules from parallel data, which consists of pairs of sen- tences that are translation of each other. Table 1.1 gives an example of a mini parallel corpus. The use of parallel data and machine learning techniques allows people to create better translation systems with less eort. However, reliance on parallel data also seriously limits its scope of application, as such data is hard to obtain for many language pairs and domains. Therefore, the goal of the work in this thesis is to overcome the limitation of parallel data by nding translations from monolingual data. Let's begin with an overview of the state-of-the-art MT systems, and then turn to some important questions to be investigated in this thesis. 1.1 Parallel Data and Statistical Machine Trans- lation As shown in Figure 1.1, parallel data plays a vital role in the state-of-the-art statis- tical machine translation (SMT) systems. It is used to learn the translation model. Dependening on the type of machine translation system, the translation model can 14 Parallel Corpus 1a. Garcia and associates . 1b. Garcia y asociados . 2a. Carlos Garcia has three associates . 2b. Carlos Garcia tiene tres asociados . 3a. his associates are not strong . 3b. sus asociados no son fuertes . 4a. Garcia has a company also . 4b. Garcia tambien tiene una empresa . 5a. its clients are angry . 5b. sus clientes estan enfadados . 6a. the associates are also angry . 6b. los asociados tambien estan enfadados . 7a. the clients and the associates are enemies . 7b. los clients y los asociados son enemigos . 8a. the company has three groups . 8b. la empresa tiene tres grupos . 9a. its groups are in Europe . 9b. sus grupos estan en Europa . 10a. the modern groups sell strong pharmaceuticals . 10b. los grupos modernos venden medicinas fuertes . 11a. the groups do not sell zenzanine . 11b. los grupos no venden zanzanina . 12a. the small groups are not modern . 12b. los grupos pequenos no son modernos . Table 1.1: A mini parallel corpus with 12 sentences in English and Spanish contain word-to-word, phrase-to-phrase, phrase-to-syntactic tree, or even syntac- tic tree to syntactic tree translation rules and their probabilities. Table 1.2 shows an example of phrase table used in phrase-based SMT systems. Besides transla- tion rules, the translation model sometimes also includes a reordering model. The reordering model is needed as the process of translation is not always monotone. For instance, in the 10th sentence pair in Table 1.1, the order of words "mod- ern groups" are swapped. Syntax-based systems usually don't have a separate reordering model as translation rules reorder words automatically. In general, a 15 Figure 1.1: Standard machine translation pipeline translation model can be learned by the following steps. First, word alignment is performed on parallel data. Then, heuristics are applied to extract translation rules from the word alignment. German English P(Englishj German) Ich I 0.8 der the 0.5 der of the 0.1 das the 0.3 das ist this is 0.8 Table 1.2: An example of phrase table 1.1.1 Word Alignment The goal of word alignment is to nd word-level translation correspondences from parallel data. The simplest idea is to look at word co-occurrence. Intuitively, if a pair of words often appear together in a large number of sentence pairs, they are 16 likely to be translation of each other. Brown et al. (1993) proposed a few models to nd word alignment automatically. Given a foreign sentence F = f 1 ;:::;f j ;:::;f J and an English sentence E = e 1 ;:::;e i ;:::;e I , Brown et al. (1993) described dierent generative processes that produce the foreign sentence from the English sentence through alignments a =a 1 ;:::;a j ;:::;a J . In their work, the IBM models 1-2 (Brown et al., 1993) use two sets of parameters, distortion(reordering) probabilities and translation proba- bilities, to dene the joint probability of a target sentence and alignment given a foreign sentence. P (F;ajE) = J Y j=1 d(a j ja j1 ;j)t(f j je a j ): (1.1) Following Brown et al. (1993), a large number of works have been proposed to improve the quality of word alignment. This includes but is not limited to the HMM word alignment model (Vogel et al., 1996), Hierarchical search for word alignment (Riesa and Marcu, 2010), and alignment with regularization (Vaswani et al., 2012a). 1.1.2 Rule Extraction Following word alignment, heuristics are applied to extract translation rules from the word-aligned parallel data. The form of translation rules varies signicantly between dierent types of translation systems. While a phrase-based system con- tains rules that translate phrases to phrases(a phrase is dened as a contiguous segment of words) (Koehn et al., 2003), a syntax-based system can have rules that 17 map a segments source words(not necessarily contiguous) to syntax trees (Galley et al., 2004). The probabilities of translation rules are usually obtained using maximum like- lihood estimation. To build a good translation model, large amounts of parallel data with tens or even hundreds of millions of tokens are required to give robust probability estimation for translation rules. Such data is also needed for dier- ent domains as a shift in domain usually leads to signicant drop in performance (Callison-Burch et al., 2008). However, large amounts of parallel data are hard to obtain for many languages and domains as construction of parallel data requires expensive eorts from human translators. 1.2 Non-Parallel Data in Machine Translation As shown in Figure 1.1, besides parallel data, non-parallel target-language data is used in the state-of-the art MT systems to build a language model. During decod- ing, the language model is used to help the translation system generate grammat- ical and sensible outputs. Besides target side monolingual data, there are also large amounts of non- parallel source-language data available. In general, it is easier to collect non- parallel data, and the ability to use large amounts of non-parallel data for the construction of better translation models may alleviate the problems brought by insucient parallel data when trying to build a high quality machine translation system. Could we use not only target side, but also source side monolingual data to improve the quality of translation? Motivated by the above idea, researchers rst investigated whether it is pos- sible to nd sucient and good quality word level translations from non-parallel 18 data. Rapp (1995) found that words that are translations appear in similar context across dierent languages. Based on this observation, Rapp (1995) showed that it is possible to nd good translations from non-parallel texts. Since then, there has been growing interests in mining translations from non-parallel data, including Haghighi et al. (2008) who used canonical correlation analysis (Hardoon et al., 2004) to model latent contextual information, and Koehn and Knight (2002); Irvine and Callison-Burch (2013a), who built a translation lexicon from monolin- gual corpora by combining multiple evidence. However, to improve the quality of machine translation, nding word level translation is not enough. Another idea is to expand the size of parallel data by extracting parallel sentences from non-parallel data. Munteanu and Marcu (2005) trained a parallel sentence classier using parallel data, and use the classier to identify parallel sentences from large amounts of non-parallel data. Compared with work that uses non-parallel data to improve quality of MT systems trained on parallel data, a more ambitious goal is to build MT systems using only non-parallel data. Towards this goal, Klementiev et al. (2012) use non- parallel data to estimate parameters for a large scale MT system. Other work tries to learn full MT systems using only non-parallel data (Ravi and Knight, 2011a; Ravi, 2013). However, the performance of systems trained with only non-parallel data is poor compared with those trained with parallel data. In reality, there is often a small amount of parallel data. Therefore, it is more practical to ask whether we are able to use larger amounts of non-parallel data to improve machine translation systems trained on limited amount of parallel data, and whether we are able to use in-domain non-parallel data to improve translation systems trained on out-of-domain parallel data. 19 1.3 Contributions Motivated by the above questions, this work takes a decipherment approach to improve machine translation for domain adaptation and resource-limited languages. The contributions of this work are: This work improves the state-of-the-art decipherment techniques by inte- grating slice sampling (Neal, 2000) into Bayesian decipherment. For the rst time, it is possible to solve a large word substitution cipher with billions of tokens and hundreds of thousands word types with an accuracy of 92.2% (Dou and Knight, 2012). This work advances decipherment of foreign languages by exploring depen- dency relations and word embeddings. Experiment results show that depen- dency relations alone improves accuracy of Spanish/English decipherment over 5 fold (Dou and Knight, 2013), and adding word embeddings further doubles the accuracy (Dou et al., 2015). This work shows that decipherment nds good translations for out-of-vocabulary words in the task of domain adaptation, and helps improve word alignment in the scenario of low density language machine translation. Experiments show that using large amounts of non-parallel data improves out-of-domain and low density language machine translation signicantly (Dou and Knight, 2012, 2013; Dou et al., 2014). Last but not least, this work leads to release of a decipherment package{ MonoGIZA, which could be used in future research to replicate and advance this line of work. 20 1.4 Thesis Outline The remainder of the thesis is organized as follows: Chapter 2 discusses previous related work in nding translations from non- parallel data: word vector based approach and decipherment approach. Chapter 3 discusses the limitation of the previous decipherment approach, and proposes methods to make decipherment much more scalable and accu- rate. Chapter 4 applies the improved decipherment algorithm to help discover translations for out-of-vocabulary words in a Spanish-French MT domain adaptation task. Chapter 5 discusses a solution to address the problem of word reorder- ing while deciphering foreign languages with signicant dierences in word orders. Chapter 6 demonstrates the eectiveness of the approach proposed in chapter 5 by using decipherment to improve Spanish-English MT in a low resource scenario. Chapter 7 discusses diculties in deciphering Malagasy { a real low density language, and proposes joint word alignment and decipherment to overcome those diculties. Chapter 8 presents a new decipherment algorithm, which takes advantage of both word vector based and existing decipherment approach to achieve even higher deciphering accuracy. 21 Chapter 9 concludes this thesis, summarizing contributions, and proposing possible future work. 22 Chapter 2 Previous Work Although the majority of research in MT is based on parallel data, there is also a large number of work that explores the use of non-parallel data to advance the state-of-the-art MT systems. In this chapter, I rst review word vector based approach for nding translations from non-parallel data, then compare it with recent work in decipherment, and discuss the limitation of previous decipherment work in the end. 2.1 Word Vector Based Approach Motivated by the idea that a translation lexicon induced from non-parallel data can be used to improve quality of machine translation, a number of previous work has tried to build a translation lexicon from non-parallel or comparable data (Rapp, 1995; Fung and Yee, 1998; Koehn and Knight, 2002; Haghighi et al., 2008; Garera et al., 2009; Bergsma and Van Durme, 2011; Daum e and Jagarlamudi, 2011; Irvine and Callison-Burch, 2013a,b). Those work is based on a key observation that if two words are translations, they appear in similar context (Rapp, 1995). The intuition behind this is words that share similar context are also semantically similar. The rst step of the approach is to convert each word into its vector representation. In a simple way, the size of the vector is equal to the size of the vocabulary, with each dimension representing 23 a word. Given this representation, the similarities between words are given by vector cosine similarities. Given two vectors A and B, the cosine similarity is cos = AB jjAjjjjBjj (2.1) The similarity measure is 1 ifA andB are exactly the same, and 0 if A andB are orthogonal. For the same language, given a word and its context vector, it is straight forward to nd other words that have similar meanings. The challenges come when we want to nd translations as the vector space is not the same for dierent languages, making it impossible to compute similarities between dierent vectors. In order to compare the similarity of context vectors across languages, one must rst map the vectors into the same space. This can be easily achieved by using a seed lexicon. One can limit the size of vector to the size of seed lexicon with each dimension representing an entry in the seed lexicon. Alternatively, one can also learn a mapping between two dierent vector space using a seed lexicon Haghighi et al. (2008); Daum e and Jagarlamudi (2011). Either way, a seed lexicon is important to bootstrap the approach, and it is not always easy to build one to ensure best performance for a specic task. Although previous work is able to nd word-to-word translations without par- allel data, improvement made in terms of machine translation has been limited (Daum e and Jagarlamudi, 2011; Irvine and Callison-Burch, 2013a,b). This is par- tially due to diculties in estimating translation probabilities. To obtain trans- lation probabilities, a similarity score threshold is chosen. Then scores that pass the threshold are normalized among top candidates. The selection of threshold is always tricky, which is usually based on empirical system performance. 24 2.2 Decipherment Approach To address the problems with the vector based approach, a probabilistic model that does not require any form of bootstrap is desirable. Decipherment is the analysis of documents written in ancient languages either unknown or lost. The goal is to interpret the content of those documents. When foreign languages are viewed as those ancient languages, we can borrow ideas from decipherment. In fact, the idea is not entirely new. Machine translation has its roots in decipherment: \When I look at an article in Russian, I say: this is really written in English, but it has been coded in some strange symbols. I will now proceed to decode." Weaver (1955) Lately, there has been growing interest in problems on automatic decipherment in the natural language processing community (Knight and Yamada, 1999; Knight et al., 2006; Ravi and Knight, 2008, 2009; Snyder et al., 2010; Corlett and Penn, 2010; Knight et al., 2011; Reddy and Knight, 2011; Ravi and Knight, 2011c,a; Berg-Kirkpatrick and Klein, 2011; Reddy and Knight, 2012; Nuhn et al., 2012b; Kim and Snyder, 2013; Berg-Kirkpatrick and Klein, 2013; Corlett and Penn, 2013; Nuhn and Ney, 2013; Nuhn et al., 2013; Ravi, 2013). Before jumping directly into deciphering languages, I will introduce how to decipher letter substitution ciphers eectively. Substitution ciphers are interesting because the process of translation also involves large number of substitutions at word level. 2.2.1 Letter Substitution Ciphers In cryptography, substitution ciphers are a group of ciphers created by replacing units of plaintext with ciphertext. The units may be letters, group of letters, words, and so forth. 25 When the units of substitution are letters, the resulting cipher is called a letter substitution cipher. Figure 2.1 provides an example of letter substitution cipher created by replacing each letter in the plaintext with its cipher code. Table 2.1 is called key table which contains mappings between each letter and its cipher code. In Table 2.1, the mapping is deterministic. When the mapping is not deterministic, where single plaintext letters can be replaced by any of several dierent cipher letters, the resulting cipher is called homophonic letter substitution cipher. Finding out the key table as well as the mapping probabilities is a key part in solving a substitution cipher. This is a relatively old problem, and has been well studied in previous work (Ravi and Knight, 2008; Corlett and Penn, 2010; Ravi and Knight, 2011b; Corlett and Penn, 2013). Figure 2.1: Encryption and decryption of a letter substitution cipher 2.2.2 Probabilistic Decipherment Knight et al. (2006) followed a noisey-channel model approach to solve letter sub- stitution ciphers. In this work, a ciphertext sequence f =f 1 :::f n is modeled with the following generative story: 26 Plaintext Ciphertext Plaintext Ciphertext a 17 o 09 b 01 p 12 c 19 q 23 d 05 r 04 e 13 s 22 f 20 t 16 g 10 u 07 h 14 v 24 i 25 w 06 j 18 x 26 k 21 y 03 l 08 z 27 m 02 15 n 11 Table 2.1: Key table used to create the cipher in Figure 2.1 Generate English plaintext sequence e =e 1 :::e n with probability P(e). Replace each English plaintext lettere i with a cipher codef i with probability P (f i je i ). Based on the above generative story, the probability of the observed cipher string f is: P (f) = X e n Y i=1 P (f i je i )P (e) (2.2) where p(e) is given by a letter n-gram language model learned from large amounts of English texts. Knight et al. (2006) then applied EM algorithm (Demp- ster et al., 1977) to search for P (f i je i ). The EM algorithm has a complexity of NV n , where N is the length of ciphertext, V is the number of unique plain- text symbols, and n is the length of context in the language model. Ravi and 27 Knight (2011c) applies Bayesian inference to decipherment of the Zodiac cipher, and observed better performance compared with the EM algorithm. 2.2.3 Deciphering Foreign Languages A cipher is called word substitution cipher, when units of substitution becomes words. We can view foreign languages as word substitution ciphers of English, and use decipherment to nd word level translations from non-parallel data. Compared with the vector based approach, the decipherment approach has the following advantages. First, the decipherment process does not need to be bootstrapped by any translation lexicon. Second, the output not only contains possible translations but also their probabilities. However, foreign languages are not just simple substitution ciphers. As shown in Figure 2.2, the process of translation involves substitution, deletion, insertion, and reordering. However, this problem can be solved by using a more complex generative story like the one in IBM model 3 (Brown et al., 1993). Following this idea, Ravi and Knight (2011a) were the rst to learn a full MT system for translating movie subtitles from Spanish to English without any parallel data. Unfortunately, the performance is poor compared with those learned from parallel data, and the approach fails to scale. Although decipherment addresses some problems of the word vector based approach such as reliance on seed lexicons, and diculties in assigning proba- bilities, it faces its own issues: First of all, previous decipherment approaches fail to scale; Second, languages are far more complex than simple substitution ciphers; Last, decipherment has not been shown to improve the quality of machine trans- lation. In this thesis, I will address the above problems. 28 Figure 2.2: Word alignment between a Chinese-English sentence pair shows trans- lation process is not only simple substitution. 29 Chapter 3 Solving Word Substitution Ciphers As shown in the previous chapter, deciphering foreign languages is a rather chal- lenging problem. Therefore, instead of going directly into the problem, I rst look at a similar but simpler problem: solving word substitution ciphers. Unlike letter substitution ciphers, which can be solved using earlier decipherment approaches. Solving word substitution ciphers requires a more ecient algorithm. This chapter introduces two novel ideas to make decipherment much faster and accurate. 3.1 Word Substitution Ciphers As illustrated in Figure 3.1, word substitution ciphers are exactly the same as letter substitution ciphers except that, instead of letters, the units being replaced are words. Compared with breaking letter substitution ciphers, solving word sub- stitution ciphers is an more interesting problem as it is closer to translation of languages, where a large number of word substitutions also take place. Similarly, solving a word substitution cipher means recovering the original plaintext from the ciphertext without knowing the keys. The only thing relied on is knowledge about the underlying language. While letter substitution ciphers can be solved easily, there hasn't been work published on solving a very large word 30 Figure 3.1: Encryption and decryption of a word substitution cipher substitution cipher with high accuracy. Just recently, Nuhn and Ney (2013) proves that solving word substitution ciphers is an NP hard problem. Suppose we observe a large cipher stringf and want to decipher it into English e. Following the idea from Weaver (1955), I assume that the cipher string f is generated in the following way: Generate English plaintext sequence e =e 1 ;e 2 :::e n with probability P(e). Replace each English plaintext token e i with a cipher token f i with proba- bility P (f i je i ). Based on the above generative story, I write the probability of the observed cipher string f as: P (f) = X e n Y i=1 P (f i je i )P (e) (3.1) I can use the above equation as an objective function for maximum likelihood training. In the equation, P (e) is given by an ngram language model, which is trained using large amounts of monolingual texts. The rest of the task is to 31 manipulate channel probabilities P (f i je i ) so that the probability of the observed texts P (f) is maximized. Theoretically, I can directly apply EM, as proposed by Knight et al. (2006) to nd P (f i je i ) that maximizes the probability of cipher string f. However, unlike letter substitution ciphers, word substitution ciphers pose much greater challenges to algorithm scalability. To solve a word substitution cipher, EM has a time complexity of O(NV 2 R) and a space complexity of O(NV 2 ), where N is the length of the ciphertext, V is the size of plaintext vocabulary, and R is the number of iterations. In the world of word substitution ciphers, both V and N are very large, making EM an impractical solution. Nuhn et al. (2013) applies beam search to solve a very large word substitution cipher with high accuracy. Although the approach is scalable, it assumes that mapping is 1:1 and deterministic. Thus, it is hard to apply to decipher foreign languages. An alternative approach that requires less memory and time complexity is Bayesian inference, as proposed by Ravi and Knight (2011b). Instead of searching probabilities P (fje) to maximize equation 3.1, Bayesian inference tries to draw plaintext samples from posterior distribution according to equation 3.2. P (ejf)/P (e) n Y i=1 P ( f i je i ) (3.2) In the above equation, P (e) is given by an ngram language model, while the channel probability is modeled by the Chinese Restaurant Process (CRP): P (f i je i ) = prior +count(c i ;e i ) +count(e i ) (3.3) 32 Where prior is the base distribution (we set prior to 1 C in all our experiments, whereC is the number of word types in ciphertext), and count, also called \cache", records events that occurred in the history. Although the Bayesian approach using Gibbs sampling has a better time com- plexity,O(NVR), the problem still becomes intractable when it comes to solving ciphers with billions of word tokens and hundreds of thousands word types. Ravi and Knight (2011a) propose several modications to the Bayesian approach. How- ever, the modied algorithm is only a poor approximation of the original algorithm and produces low deciphering accuracy, and they are still unable to handle very large ciphers. To address the above problems, I propose the following two new improvements to previous decipherment methods. I apply slice sampling (Neal, 2000) to scale up to ciphers with a very large vocabulary. Instead of deciphering the original ciphertext, I collect bigrams with their counts for decipherment. 3.1.1 Slice Sampling Ravi and Knight (2011a) use point-wise Gibbs sampling (Geman and Geman, 1987) to perform Bayesian inference. A sampling operator samples plaintext word choices for each cipher token one at a time. For each cipher token, the sampler has to consider a list of all possible plaintext choices (10k|1M English words). This becomes intractable when the vocabulary is large and the ciphertext is long. Slice sampling can solve this problem by automatically adjusting the number of samples to be considered for each sampling operation. 33 Suppose the derivation probability for current sample is P (current s). Then slice sampling draws a new sample in two steps: Select a threshold T uniformly from the rangef0;P(current s)g. Draw a new samplenew s uniformly from a pool of candidates:fnew sjP (new s) Tg. From the above two steps, we can see that given a threshold T , the algorithm only needs to consider those samples whose probability is at least equal to the threshold. This leads to a signicant reduction in the number of samples to be considered, if probabilities of most samples are below T . As I perform point-wise sampling, each new sample is obtained by sampling plaintext word choices for each cipher token. Therefore, unlike Gibbs sampling which needs to look at all possible choices, Slice sampling only needs to consider those that make the probability of the new sample greater or equal the threshold. In practice, the rst step is easy to implement, while it is dicult to make the second step ecient. An easy way to collect candidate samples is to go over all possible samples and record those with probabilities higher than T . However, doing this will not save any time as I need to look at all possible choices for any given cipher token. Fortunately, for Bayesian decipherment, I am able to complete the second step eciently. According to Equation 3.2, the probability of the current sample is given by a language model P (e) and a channel model P (fje). Suppose our current sample current s contains English tokens X, Y , and Z at position i 1, i, and i + 1 respectively. Let f i be the cipher token at position i. To obtain a new sample, I just need to change token Y to Y 0 . Since the rest of the sample stays the same, I only need to calculate the bigram language model probability of P (XY 0 Z), the 34 channel model probability: P (f i jY 0 ), and multiply them together as shown in Equation 3.4. P (XY 0 Z)P (f i jY 0 ) (3.4) Remember that, in slice sampling, each sampling operation has two steps. For the rst step, A thresholdT is chosen uniformly between 0 andP (XYZ)P (f i jY ). For the second step, there are two cases. First, I notice that two types of Y 0 are more likely to pass the threshold T : (1) Those that have a high P(XY'Z) , and (2) those that have high channel model probability: P (f i jY 0 ) = prior +count(f i ;Y 0 ) +count(Y 0 ) (3.5) To nd candidates that have the former, I build sorted lists ranked byP (XY 0 Z), which can be pre-computed o-line. I only keep the top K English words for each (X,Z) pair in the language. The total number of the sorted lists is V 2 , where V is the vocabulary size. When the last item Y K in the top K list sat- ises P (XY k Z) prior < T , a sample is drawn in the following way: Let set A =fY 0 jP (XY 0 Z)prior Tg and set B =fY 0 jcount(f i ;Y 0 ) > 0g, then I only need to sample Y 0 uniformly from A[B until Equation 3.4 is greater than T . 1 Second, what happens when the last item Y K in the list does not satisfy P (XY k Z)prior < T ? Then I always choose a word Y 0 randomly and accept it as a new sample if Equation 3.4 is greater than T . 1 It is easy to prove that all other candidates that are not in the sorted list and with count(f i ;Y 0 ) = 0 have a upper bound probability: P (XY k Z)prior. Therefore, they are ignored when P (XY k Z)prior <T . 35 man they took our land . they took our arable land . took our 2 they took 2 land . 2 man they 1 arable land 1 Table 3.1: Conversion of full sentences to bigrams with counts The algorithm alternates between the two cases. The actual number of choices the algorithm looks at depends on theK and the total number of possible choices. In dierent experiments, I nd that when K = 500, the algorithm only looks at 0.06% of all possible choices. When K = 2000, this further reduces to 0.007%. 3.1.2 Deciphering with Bigrams Since my decipherment algorithm described in the previous section uses a bigram language model, I posit that a frequency list of ciphertext bigrams may contain enough information for decipherment. Table 3.1 shows how full English sentences in the original data are broken into bigrams and their counts. Instead of doing sampling on full sentences, I now treat each bigram as a full \sentence". There are two advantages to use bigrams and their counts for decipherment. First of all, the bigrams and counts are a much more compact representation of the original ciphertext. For instance, after breaking a billion tokens from the English Gigaword corpus, I nd only 29m distinct bigrams and 58m total tokens, which is only 1/17 of the length of the original text. In practice, I can further discard all bigrams with only 1 count, which makes the ciphertext even shorter. Secondly, using bigrams signicantly reduces the number of sorted lists (from V 2 to 2V ) mentioned in the previous section. The reduction comes from the fact that words in a bigram only have one neighbor. Therefore, for any word W in a 36 bigram, there are only 2V lists (\words to the right of W" and \words to the left of W") instead of V 2 lists (\pairs of words that surround W"). In letter substitution experiments, I nd that deciphering using only bigrams and their counts doesn't hurt deciphering accuracy. 3.1.3 Iterative Sampling Although I can directly apply slice sampling to decipher a large number of bigrams, I nd that gradually including less frequent bigrams into a sampling process saves deciphering time. The motivation behind the process is that I nd the correct decipherment of frequent word types can be used to seed decipherment of less frequent ones and speed up the decipherment. I call this process iterative sampling: Break the ciphertext into bigrams and collect their counts Keep bigrams whose counts are greater than a threshold . Then initialize the rst sample randomly and use slice sampling to perform maximum like- lihood training. In the end, extract a translation table T according to the nal sample. Lower the threshold to include more bigrams into the sampling process. Initialize the rst sample using the translation table obtained from the pre- vious sampling run (for each cipher token f, choose a plaintext token e whose P (ejf) is the largest). Perform sampling again. Repeat until = 1. 37 3.1.4 Parallel Sampling Inspired by Newman et al. (2009), I also parallelized the sampling process as described below: Collect bigrams and their counts from ciphertext and split the bigrams into N parts. Run slice sampling on each part for 5 iterations independently. Combine counts from each part to form a new count table and run sampling again on each part using the new table. (Except for combining the counts to form a new count table, other parameters remain the same. For instance, each parti has its own prior set to 1 C i , whereC i is the number of word types in that part of ciphertext.) 3.2 Experiments In previous sections, I have proposed several techniques to address the scalability problem in solving word substitution ciphers. In this section, I will show that while previous approaches are only able to solve a small word substitution cipher with one million tokens and several thousand word types (Ravi and Knight, 2011a), my new decipherment algorithm can solve a word substitution cipher with over one billion tokens and hundreds of thousand word types. Moreover, I will also demonstrate that the new algorithm achieves better deciphering accuracy. 3.2.1 Deciphering Gigaword Corpus To prove the scalability of my proposed ideas, I apply them to solve a very large word substitution cipher built from the English Gigaword corpus, which contains 38 news articles from dierent news agencies. Table 3.2 gives an overview of the training and testing data. I split the corpus into two parts chronologically. Each part contains approximately 1.2 billion tokens. I use the rst part to build a word substitution cipher, and the second part to build a bigram language model. 2 English Gigaword Corpus Training 2.4 billion tokens, 129k word types Testing 33k tokens, 10k word types Table 3.2: Size of English Gigaword training and testing data I rst use a single machine and apply iterative sampling to solve a 68 million token cipher. Then I use the result from the rst step to initialize our parallel sam- pling process, which uses as many as 100 machines. When the training terminates, a translation table with probability P (fje) is built based on the counts collected from the nal sample. Given the translation table, and a plaintext language model, I use decoding to refer to the process of searching the original plaintext (in full sentences). I use Moses (Koehn et al., 2007) to perform the decoding by setting the distortion limit to 0 (disable word reordering) and cubing the translation probabil- ities (\stretch out the channel probabilities" Knight and Yamada (1999)). During decoding, a trigram language model is used. Essentially, Moses tries to nd an English sequence e that maximizes P (e)P (fje) 3 . I evaluate the performance by measuring the percentage of cipher tokens that are correctly recovered. I only calculate accuracy over the rst 1000 sentences (33k tokens) of the ciphertext, so that evaluation is independent of the amount of ciphertext used for decipherment. After 2000 iterations of the parallel sampling process, the deciphering accuracy reaches 92.2%. Figure 3.2 shows the learning curve of the algorithm. It can be 2 Before building the language model, I replace low frequency word types with an \UNK" symbol, leaving 129k unique word types 39 Figure 3.2: Learning curve for a large word substitution cipher seen from the graph that both token and type accuracy increase as more and more data becomes available. 3.2.2 Deciphering Military Check-Point Corpus This section demonstrates that my new decipherment algorithm is not only scalable but also achieves better deciphering accuracy compared with previous work. Number of Tokens Number of Types Training 3.7m tokens 25k Testing 9.4k tokens 1k Table 3.3: Size of military check-point training and testing data As shown in Table 3.3, I use the same corpus used in previous work (Ravi and Knight, 2011a) and split it in the same way. One million tokens of the training 40 Method Decipherment Language Model Deciphering Accuracy Ravi and Knight (2011) bigram 80.0 trigram 82.5 Dou and Knight (2012) bigram 88.1 Table 3.4: Deciphering accuracy on military check-point corpus Ciphertext Gold Decipherment 18463 11234 23465 76453 43215 87645 11212 65324 33654 21435 12654 man i've come to le a complaint against some people . man i've come to hand a telephone lines some people . 18463 54746 23221 89677 44325 12654 man they took our land . man they took our farm . 54746 23221 89677 61321 44325 12654 they took our arable land . they took our slide door . 20034 18463 12654 okay man . okay man . 80973 31212 12654 eighty donums . mil ih donums . Table 3.5: Sample decipherment of military check-point corpus data is used to create ciphertext, and the rest is used for language model training 3 . A bigram language model is used for decipherment, and a trigram language model is used for decoding. Results I use the same evaluation metric as in the previous section, and only decode the rst 1000 cipher sentences. Table 3.4 compares the deciphering accuracy with the state- of-the-art algorithm. Results show that our algorithm improves the deciphering accuracy from 80.0% to 88.1%. Table 3.5 shows the decipherment of the rst 5 sentences and compares them with the gold plaintext. It can be seen that my proposed approach recovers the majority of the plaintext correctly. 3 In practice, I replaced singletons with a \UNK" symbol, leaving around 16904 word types 41 Chapter 4 Improving French-Spanish Out-of-Domain Machine Translation In the previous chapter, I show that the new decipherment algorithm is not only scalable but also achieves higher deciphering accuracy than the state-of-the-art algorithm. In this chapter I will further demonstrate its value by using it to improve French-Spanish out-of-domain Machine Translation. Out-of-domain machine translation is a challenging task for statistical machine translation systems trained on parallel corpora. It is common to see a signicant drop in translation quality when translating texts with a domain dierent from the training data. Although it is hard to nd in-domain parallel corpora, it is relatively easier to nd in-domain monolingual corpora. In this chapter, I use decipherment to learn a domain specic translation lexicon by deciphering in-domain (medical) monolingual corpora, and use it improve a baseline system trained on large amounts of out-of-domain (political) parallel data. In experiments, I show that the domain specic translation lexicon improves the baseline translation system by up to 3.8 Bleu points. 42 4.1 Baseline Phrase-Based System First, I build a state-of-the-art phrase-based SMT system using Moses using large amounts of out-of-domain parallel data from Europarl (Koehn, 2005a). The base- line system has 3 models: a translation model, a reordering model, and a language model. The language model is trained on monolingual data, and the rest are trained on parallel data. By default, Moses uses the following 8 scores to evaluate a candidate translation: direct and inverse translation probabilities direct and inverse lexical weighting phrase penalty word penalty score given by a language model score from a re-ordering model Each of the 8 scores has its own weight, which can be tuned on a held-out set using minimum error rate training (Och, 2003). As shown in Table 4.3, the baseline achieves 38.2Bleu when translating texts from political domain. However, the performance drops signicantly to only 30.5 Bleu when translating texts from medical domain. The problem is caused by lack of translations for a large number of out-of-vocabulary (OOV) words, as the translation model is trained on political domain while test texts are from medical domain. In the following sections, I describe how to use decipherment to learn translations for those domain specic OOV words, and use them as new features to improve the baseline system. 43 4.2 LearningaNewTranslationTablewithDeci- pherment From a decipherment perspective, machine translation is a much more complex task than solving a word substitution cipher and poses three major challenges: Mappings between languages are nondeterministic, as words can have multi- ple translations Re-ordering of words Insertion and deletion of words Fortunately, our decipherment model does not assume deterministic mapping and is able to discover multiple translations. To address the other two problems, I choose to treat French as a simple word substitution for Spanish as the two languages are very close to each other. Despite the simplication in the assumption, I nd that decipherment still learns a useful word-to-word lexicon for domain adaptation. Problem formulation: By ignoring word re-orderings, insertion, and dele- tion, I can formulate MT decipherment problem as word substitution decipher- ment. I view source language f as ciphertext and target language e as plaintext. The goal is to decipher f into e and estimate translation probabilities based on the decipherment. Probabilistic decipherment: Similar to solving a word substitution cipher, the goal here is to estimate the translation model parameters P (fje) using a large amount of monolingual data in f and e respectively. According to Equation 4.1, the learning process involves drawing samples of English text from the posterior distribution P (ejf). 44 P (ejf)/P (e) n Y i=1 P (f i je i ) (4.1) Building a translation table: Once the sampling process completes, I esti- mate translation probability P (fje) from the nal sample using maximum likeli- hood estimation. I also decipher from the reverse direction to estimate P (ejf). Finally, I build a translation table by taking translation pairs seen in both deci- pherments. 4.3 Combining Translation Tables There are now two phrase tables: one learnt from parallel corpus, which contains word-to-word and phrase-to-phrase translations, and one from non-parallel mono- lingual corpus, which contains only word-to-word translations with two scores for each translation pair. Moses has a function to decode with multiple phrase tables. During decoding, if a source word only appears in the phrase table learnt by deci- pherment, then the table's translation will be used exclusively. If a source word exists in both tables, Moses will create two separate decoding paths and choose the better one after taking other features into account. If a word is not seen in either of the tables, it is copied directly to the output. 4.4 Data and Results In MT experiments, I use the following corpora to learn translation systems to translate French into Spanish: Europarl Corpus (Koehn, 2005b): The Europarl parallel corpus is extracted from the proceedings of the European Parliament and includes versions in 11 45 French Spanish Train 28.5 million tokens 26.6 million tokens Tune 28k tokens 26k tokens Test 30k tokens 28k tokens Table 4.1: Size of Europarl training, tuning, and testing data French Spanish Comparable EMEA: Every odd line, 8.7 million tokens Every even line, 8.1 million tokens Non-parallel EMEA: First 550k sentences, 9.1 million tokens Last 550k sentences, 7.7 million tokens Table 4.2: Size of EMEA decipherment training data European languages. The corpus contains articles from the political domain and is used to train our baseline system. I use the 6th version of the corpus. Details of training, tuning, and testing data are listed in Table 4.1. EMEA Corpus (Tiedemann, 2009): EMEA is a parallel corpus made out of PDF documents from the European Medicines Agency. It contains articles from the medical domain, which is a good test bed for out-of-domain tasks. I use the rst 2k pairs of sentences for tuning and testing (1k for each), and use the rest (1.1 million lines) for decipherment training. I split the training corpus in ways that no parallel sentences are included in the training set. The splitting methods are listed in Table 4.2. For decipherment training, I use lexical translation tables learned from the Europarl corpus to initialize our sampling process. I compare the following 3 systems in experiments withBleu (Papineni et al., 2002) as a standard evaluation metric, and present the results in Table 4.3. Baseline: Trained on Europarl 46 Train Data Tune Data Tune LM Test Data Test LM Baseline Decipher- CP Decipher- NP Europarl Europarl Europarl Europarl Europarl 38.2 Europarl Europarl Europarl EMEA Europarl 24.9 Europarl Europarl Europarl EMEA EMEA 30.5 33.2 (+2.7) 32.4 (+1.9) Europarl EMEA EMEA EMEA EMEA 37.3 41.1 (+3.8) 39.7 (+2.4) Europarl + EMEA EMEA EMEA EMEA EMEA 67.4 68.7 (+1.3) 68.7 (+1.3) Table 4.3: Using decipherment to improve SMT. Each row has a dierent set of training, tuning, and testing data. Baseline is trained on parallel data only. Tune LM and Test LM specify language models used for tuning and testing respectively. Decipher-CP and Decipher-NP use a phrase table learnt from comparable and non-parallel EMEA corpus respectively. Decipher-CP: Trained on Europarl + Comparable EMEA Decipher-NP: Trained on Europarl + Non-Parallel EMEA Our baseline system achieves 38.2 Bleu score on Europarl test set. In the second row of Table 6.2, the test set changes to EMEA, and the baseline Bleu score drops to 24.9. In the third row, the baseline score rises to 30.5 with a language model built from EMEA corpus. Although it is much higher than the previous baseline, I further improve it by including a new phrase table learnt from domain specic monolingual data. In a real out-of-domain task, we are unlikely to have any parallel data to tune weights for the new phrase table. Therefore, I can only set it manually. In experiments, each score in the new phrase table has a weight of 5, and the Bleu score rises up to 33.2. In the fourth row of the table, I assume that there is a small amount of domain specic parallel data for tuning. With better weights, our baseline Bleu score increases to 37.3, and our combined systems increase to 41.1 and 39.7 respectively. In the last row of the 47 French Spanish P (frjes)P (esjfr) < < 0.32 1.00 h epatique hep atico 0.88 0.08 hep atica 0.76 0.85 injectable inyectable 0.91 0.92 dl dl 1.00 0.70 > > 0.32 1.00 ribavirine ribavirina 0.40 1.00 olanzapine olanzapina 0.57 1.00 clairance aclaramiento 0.99 0.64 pellicul es recubiertos 1.00 1.00 pharmaco- cin etique farmaco- cin etico 1.00 1.00 Table 4.4: 10 most frequent OOV words in the table learnt from non-parallel EMEA corpus table, I compare the combined systems with an even better baseline. This time, the baseline is given half of the EMEA tuning set for training and uses the other half for weight tuning. Results show that our combined systems still outperform the baseline. The phrase table learnt from monolingual data consists of both observed and unknown words. Table 4.4 shows the top 10 most frequent OOV words in the table learnt from non-parallel EMEA corpus. Among the 10 words, 9 have correct translations. It is interesting to see that our algorithm nds multiple correct translations for the word \h epatique". The mistake in the table is close to the correct answer, as French word \pellicul es" is translated into \recubiertos con pel cula" in Spanish. 48 Chapter 5 Dependency-Based Decipherment In the previous chapter, I show that it is possible to decipher French into Span- ish and use it to improve the quality of machine translation. However, I have neglected some important issues in deciphering foreign languages: handling of word reordering, deletion, and insertion. In this chapter, I will show that the pre- vious assumption fails for deciphering Spanish into English, and propose a novel solution to solve the problem. 5.1 FromAdjacentBigramstoDependencyBigrams A limitation of my decipherment model is its monotonic generative story for deci- phering adjacent bigrams. In experiments, I nd that while the approach works well for deciphering similar languages (e.g. Spanish and French) without consid- ering reordering, it works poorly for languages that are more dierent in grammar and word order (e.g. Spanish and English). misi on de naciones unidas en oriente medio misi on de misi on naciones de naciones naciones unidas naciones unidas misi on en unidas en en oriente en oriente oriente medio oriente medio Table 5.1: Comparison of adjacent bigrams (left) and dependency bigrams (right) extracted from the same Spanish text 49 Table 5.1 gives a concrete example. The left column in Table 5.1 contains adja- cent bigrams extracted from the Spanish phrase \misi on de naciones unidas en oriente medio" (united states' mission in middle east). The correct decipherment for the bigram \naciones unidas" should be \united nations". Since my decipher- ing model does not consider word reordering, it needs to decipher the bigram into \nations united" in order to get the right word translations \naciones"!\nations" and \unidas"!\united". However, the English language model used for deci- pherment is built from English adjacent bigrams, so it strongly disprefers \nations united" and is not likely to produce a sensible decipherment for \naciones unidas". The Spanish bigram \oriente medio" poses the same problem. Thus, without considering word reordering, the model described in my previous work (Dou and Knight, 2012) is not a good t for deciphering Spanish into English. However, if I extract bigrams based on dependency relations for both languages, the model ts better. To extract such bigrams, I rst use dependency parsers to parse both languages, and extract bigrams by putting head word rst, followed by the modier. 1 I call these dependency bigrams. The right column in Table 5.1 lists examples of Spanish dependency bigrams extracted from the same Spanish phrase. With a language model built with English dependency bigrams, the same model used for deciphering adjacent bigrams is able to decipher Spanish dependency bigram \naciones(head) unidas(modier)" into \nations(head) united(modier)". A dierent solution I could propose is to change the model so that it will consider word reordering when deciphering adjacent bigrams (e.g. add an operation to swap tokens in a bigram). However, using dependency bigrams has the following advantages: 1 I choose to skip some function words like \del" and \de" in Spanish and \of" in English. To skip those words, I simply use their head words as new heads if any of them serves as a head. 50 First, using dependency bigrams avoids complicating the model, keeping deci- phering ecient and scalable. Second, it addresses the problem of long distance reordering, which can not be modeled by swapping tokens in bigrams. Last but not least, I can choose what types of dependency bigrams to extract for decipherment, which also addresses the problem of word deletion and insertion. 5.2 Deciphering Spanish Gigaword In this section, I view Spanish as a cipher for English, and decipher large amounts of Spanish into English using dependency bigrams. This is dierent from previ- ous experiments in Section 3.2, where a word substitution cipher is created from English plaintext, and adjacent bigrams are used for decipherment. Moreover, I will show that using dependency bigrams achieves a much higher deciphering accuracy compared with using adjacent bigrams when deciphering Spanish into English. 5.2.1 Data Source Number of Tokens Spanish Gigaword AFP section 440 million English Gigaword AFP secition 350 million Table 5.2: Size of data from AFP (Agence France Presse) section of the Gigaword corpus 51 I use the Gigaword corpus for the experiments. The corpus contains news articles from dierent news agencies and is available in Spanish and English. I use only part of the AFP (Agence France-Presse) section of the corpus in decipherment experiments. I tokenize the corpus using tools that come with the Europarl corpus (Koehn, 2005a). To shorten the time required for running dierent systems on large amounts of data, I keep only the top 5000 most frequent word types in both languages and replace all other word types with UNK. I also throw away lines with more than 40 tokens, as the Spanish parser I use is slow when processing long sentences. The size of the data after preprocessing is shown in Table 5.2. To obtain dependency bigrams, I use the Bohnet parsers (Bohnet, 2010) to parse both the Spanish and English version of the corpus. 5.2.2 Systems Three systems are evaluated in the experiments. I implement a baseline system, Adjacent, based in my previous work in Chapter 3 and 4. The baseline system collects adjacent bigrams and their counts from Spanish and English texts. It then builds an English bigram language model using the English adjacent bigrams and uses it to decipher the Spanish adjacent bigrams. I then build the second system,Dependency, which uses dependency bigrams for decipherment. As the two parsers do not output the same set of dependency relations, I cannot extract all types of dependency bigrams. Instead, I select a subset of dependency bigrams whose dependency relations are shared by the two parser outputs. The selected dependency relations are: Verb/Subject, Verb/Noun- Object, Preposition/Object, Noun/Modier. Decipherment runs the same way as in the baseline system. 52 Dependency Types Group 1 Verb/Subject Group 2 Preposition/Preposition- Object, Noun/Noun- Modier Group 3 Verb/Noun-Object Table 5.3: Groups of dependency relations used in decipherment The third system, DepType, is built using both dependent bigrams and their dependency types. I rst extract dependency bigrams for both languages, then group them based on their dependency types. I choose to divide the dependency bigrams into 3 groups 2 and list them in Table 5.3. A separate language model is built for each group of English dependency bigrams and used to decipher the group of Spanish dependency bigrams with same dependency type. For all the systems, language models are built using the SRILM toolkit (Stolcke, 2002). For the Adjacent system, I use Good-Turing smoothing. For the other systems, I use a mix of Witten-Bell and Good-Turing smoothing. 5.2.3 Iterative Sampling with Multiple Random Restarts In experiments, I nd that iterative sampling used in my previous work (Dou and Knight, 2012) achieves better deciphering accuracy and saves decipherment time. However, I also nd that using only iterative sampling is not enough when it comes to deciphering foreign languages. A single decipherment run could lead to very dierent deciphering accuracy. An intuitive solution to the problem is to use multiple random restarts and combine their results. Thus, instead of using 2 Both parsers treat noun phrases containing \del", \de", and \of" as prepositional phrases 53 a single sampling process, I use 10 dierent sampling processes at each iteration. The details of the new sampling procedure are provided here: Extract dependency bigrams from parsing outputs and collect their counts. Keep bigrams whose counts are greater than a threshold . Then start 10 dierent randomly seeded and initialized sampling processes. Perform sampling. At the end of sampling, extract word translation pairs (f;e) from the nal sample. Estimate translation probabilities P (ejf) for each pair. Then con- struct a translation table by keeping translation pairs (f;e) seen in more than one decipherment and use the average P (ejf) as the new translation probability. Lower the threshold to include more bigrams into the sampling process. Start 10 dierent sampling processes again and initialize the rst sample using the translation pairs obtained from the previous step (for each Spanish token f, choose an English token e whose P (ejf) is the highest). Perform sampling again. Repeat until = 1. 5.2.4 Deciphering Accuracy I choose the rst 1000 lines of the monolingual Spanish texts as our test data. The data contains 37,505 tokens and 6556 word types. Unlike the experiments in section 3.2, I don't have gold answers for the decipherment of the test data. Therefore, instead of using token accuracy, I use type accuracy as the evaluation metric: Given a word type f in Spanish, nd a translation pair (f;e) with the 54 highest average P (ejf) from the translation table learned through decipherment. If the translation pair (f;e) can also be found in a gold translation lexicon T gold , I treat the word type f as correctly deciphered. LetjCj be the number of word types correctly deciphered, andjVj be the total number of word types evaluated. I dene type accuracy as jCj jVj . To create T gold , I use GIZA (Och and Ney, 2003b) to align a small amount of Spanish-English parallel text (1 million tokens for each language), and use the lexicon derived from the alignment as our gold translation lexicon. T gold contains a subset of 4408 types seen in the test data, among which, 2878 are also top 5000 frequent word types. 5.2.5 Results During decipherment, I gradually increase the size of Spanish texts and compare the learning curves of three deciphering systems in Figure 5.1. With 100k tokens of Spanish text, the performance of the three systems are similar. However, the learning curve of Adjacent plateaus quickly, while those of the dependency based systems soar up as more data becomes available and still rise sharply when the size of Spanish texts increases to 10 million tokens, where the DepType system improves deciphering accuracy of the Adjacent system from 4.2% to 24.6%. In the end, with 100 million tokens, the accuracy of the DepType system rises to 27.0%, which is 4 times higher than the system using adjacent bigrams. The accuracy is even higher (41%) when evaluated against the top 5000 frequent word types only. 55 Figure 5.1: Learning curves for Spanish-English decipherment 56 Chapter 6 Resource-Limited Machine Translation with Decipherment In Chapter 5, I improve deciphering accuracy for deciphering languages with sig- nicant dierent syntax. Is it now possible to improve machine translation for lan- guage pairs that are signicantly more dierent compared with Spanish-French? Moreover, in real world, we usually have small amounts of parallel data, and large amounts of non-parallel data. Could I still use decipherment to improve machine translation in this more realistic scenario? In this chapter, I will answer the above questions. The experiment settings have the following major dierences compared with Chapter 4: Deciphering Spanish into English with dependency bigrams instead of Span- ish to French with adjacent bigrams The amount of parallel data is restricted to mimic a more realistic situation in building systems for low-density languages The lexicon learned from decipherment is used to improve translations of both OOV and observed words. Using naturally occurring monolingual data 57 Figure 6.1: Improving machine translation with decipherment (Grey boxes repre- sent new data and process). Figure 6.1 illustrates the approach proposed in this chapter. First I learn a domain specic translation lexicon learned by deciphering large amounts of in- domain (news) monolingual data. Then this lexicon is used to improve a phrase- based machine translation system trained with limited out-of-domain (politics) parallel data. 6.1 Data I use approximately one million tokens of the Europarl corpus as our small out-of- domain parallel training data and Gigaword as our large in-domain monolingual training data. For tuning and testing, I use the development data from the NAACL 2012 workshop on statistical machine translation. The data contains test data in the news domain from the 2008, 2009, 2010, and 2011 workshops. I use the 2008 58 Parallel Spanish English Europarl 1.1 million 1.0 million Tune-2008 52.6k 49.8k Test-2009 68.1k 65.6k Test-2010 65.5k 61.9k Test-2011 79.4k 74.7k Non-Parallel Spanish English Gigaword 894 million 940 million Table 6.1: Number of tokens in training, tuning, and testing data for resource- limited MT experiments test data for tuning and the rest for testing. The sizes of the training, tuning, and testing sets are listed in Table 6.1. 6.2 Systems 6.2.1 Baseline Machine Translation System First, I build a state-of-the-art phrase-based MT system, PBMT, using Moses. PBMT has 3 models: a translation model, a distortion model, and a language model. I build a 5-gram language model the English Gigaword corpus and train the other models using the Europarl corpus. By default, Moses uses the following 8 features as described in Chapter 3. Similarly, weights of those features are learned using minimum error rate training (MERT) (Och, 2003). PBMT has a phrase table T phrase . During decoding, Moses copies out-of- vocabulary (OOV) words directly to output. In the rest of this chapter, I describe how to use a translation lexicon learned from large amounts of non-parallel data to improve translation of OOV words, as well as words found in T phrase . 59 6.2.2 Decipherment for Machine Translation For decipherment training, I: Increase the size of Spanish ciphertext to 894 million tokens. Keep top 50k instead of top 5k most frequent word types of the ciphertext. Instead of seeding the sampling process randomly, I use a translation lexicon learned from a limited amount of parallel data as seed: For each Spanish dependency bigramf 1 ;f 2 , where bothf 1 andf 2 are found in the seed lexicon, I nd the English sequence e 1 ;e 2 that maximizes P (e 1 ;e 2 )P (e 1 jf 1 )P (e 2 jf 2 ). Otherwise, for any Spanish token f that can be found in the seed lexicon, I choose English word e, where P (ejf) is the highest as the initial sample; for any f that are not seen in the seed lexicon, I do random initialization. I perform 20 random restarts with 10k iterations on each and build a word-to- word translation lexicon T decipher by collecting translation pairs seen in at least 3 nal decipherments with either P (fje) 0:2 or P (ejf) 0:2. 6.2.3 ImprovingTranslationofObservedWordswithDeci- pherment To improve translation of words observed in our parallel corpus, I simply use T decipher as an additional parallel corpus. First, I lter T decipher by keeping only translation pairs (f;e), where f is observed in the Spanish part and e is observed in the English part of the parallel corpus. Then I append all the Spanish and English words in the ltered T decipher to the end of Spanish part and English part of the parallel corpus respectively. The training and tuning process is the same as 60 the baseline machine translation system PBMT. I call this system as Decipher- OBSV. 6.2.4 Improving OOV translation with Decipherment AsT decipher is learned from large amounts of in-domain monolingual data, I expect that T decipher contains a number of useful translations for words not seen in the limited amount of parallel data (OOV words). Instead of copying OOV words directly to output, which is what Moses does by default, I try to nd translations from T decipher to improve translation. During decoding, if a source word f is in T phrase , its translation options are collected fromT phrase exclusively. Iff is not inT phrase but inT decipher , the decoder will nd translations from T decipher . If f is not in either translation table, the decoder just copies it directly to the output. I call this system Decipher-OOV. However, when an OOV's correct translation is same as its surface form and all its possible translations inT decipher are wrong, it is better to just copy OOV words directly to output. This scenario happens frequently, as Spanish and English share many common words. To avoid over trustingT decipher , I add a new translation pair (f;f) for each source wordf inT decipher if the translation pair (f;f) is not originally in T decipher . For each newly added translation pair, both of its log translation probabilities are set to 0. To distinguish the added translation pairs from the others learned through decipherment, I add a binary feature to each translation pair in T decipher . The nal version of T decipher has three feature scores: P (ejf), P (fje), and . Finally, I tune weights of the features in T decipher using MERT on the tuning set. 61 6.2.5 A Combined Approach In the end, I build a system Decipher-COMB, which uses T decipher to improve translation of both observed and OOV words with methods described in sections 6.2.3 and 6.2.4. 6.3 Results I tune each system three times with MERT and choose the best weights based on Bleu scores on tuning set. Decipherment System Tune 2008 Test 2009 Test 2010 Test 2011 None PBMT (Baseline) 19.1 19.6 21.3 22.1 Adjacent Decipher-OBSV 19.5 20.1 22.2 22.6 Decipher-OOV 19.4 19.9 21.7 22.5 Decipher-COMB 19.5 20.2 22.3 22.5 Dependency Decipher-OBSV 19.7 20.5 22.5 23.0 Decipher-OOV 19.9 20.4 22.4 22.9 Decipher-COMB 20.0 20.8 23.1 23.4 Table 6.2: Systems that use translation lexicons learned from decipherment show consistent improvement over the baseline system. Table 6.2 shows that the translation lexicon learned from decipherment helps achieve higher Bleu scores across tuning and testing sets. Decipher-OBSV improves Bleu scores by as much as 1.2 points. I analyze the results and nd the gain mainly comes from two parts. First, adding T decipher to small amounts of parallel corpus improves word level translation probabilities, which lead to better lexical weighting; second, T decipher contains new alternative translations for words observed in the parallel corpus. Moreover, Decipher-OOV also achieves better Bleu scores compared with PBMT across all tuning and test sets. I also observe that systems using T decipher 62 learned by deciphering dependency bigrams leads to larger gains in Bleu scores. When decipherment is used to improve translation of both observed and OOV words, I see improvement in Bleu score as high as 1.8 points on the 2010 news test set. The consistent improvement on the tuning and dierent testing data suggests that decipherment is capable of learning good translations for a number of OOV words. To further demonstrate that our decipherment approach nds useful trans- lations for OOV words, I list the top 10 most frequent OOV words from both the tuning set and testing set as well as their translations (up to three most likely translations) in Table 6.3. P (ejf) and P (fje) are average scores over dierent decipherment runs. From the table, It can be seen that decipherment nds correct translations (bolded) for 7 out of the 10 most frequent OOV words. Moreover, many OOVs and their correct translations are homographs, which makes copying OOVs directly to the output a strong baseline to beat. Nonetheless, decipherment still nds enough correct translations to improve the baseline. 63 Spanish English P (ejf) P (fje) obama his 0.33 0.01 bush 0.27 0.07 clinton 0.23 0.11 bush bush 0.47 0.45 yeltsin 0.28 0.81 he 0.24 0.05 festival event 0.68 0.35 festival 0.61 0.72 wikileaks zeta 0.03 0.33 venus venus 0.61 0.74 serena 0.47 0.62 colchones mattresses 0.55 0.73 cars 0.31 0.01 helado frigid 0.52 0.44 chill 0.37 0.14 sandwich 0.42 0.27 google microsoft 0.67 0.18 google 0.59 0.69 cantante singer 0.44 0.92 jackson 0.14 0.33 artists 0.14 0.77 mccain mccain 0.66 0.92 it 0.22 0.00 he 0.21 0.00 Table 6.3: Decipherment nds correct translations for 7 out of 10 most frequent OOV word types. 64 Chapter 7 Deciphering Malagasy In previous chapters, I have applied Slice Sampling to Bayesian decipherment to make decipherment scalable, and introduced dependency relations into decipher- ment to address the issues of word reordering, insertion, and deletion. Through experiments, I show that decipherment very helpful for both domain adaptation and low resource languages. However, Spanish is not a low resource language, and the success is built on availability of good dependency parser. In this chapter, I will decipher Malagasy, and use decipherment to improve Malagasy-English machine translation. 7.1 The Malagasy Language Malagasy is the ocial language of Madagascar. It has around 18 million native speakers. Although Madagascar is an African country, Malagasy belongs to the Malayo-Polynesian branch of the Austronesian language family. It is related to the Malayo-Polynesian languages of Indonesia, Malaysia, and the Philippines, and more closely to the Southeast Barito languages spoken in Borneo. Malagasy and English have very dierent word orders. First of all, in contrast to English, which has a subject-verb-object (SVO) word order, Malagasy has a verb- object-subject (VOS) word order. Besides that, Malagasy is a typical head initial language: Determiners precede the noun, while other modiers and relative clauses follow the noun (e.g. ny \the" boky \book" mena \red"). To better illustrate 65 that, Figure 7.1 shows an English sentence and its translation in Malagasy, with lines indicating word level translations. The signicant dierence in word orders between English and Malagasy pose great challenges for machine translation and decipherment. Figure 7.1: Word alignment showing dierent word orders between English and Malagasy 7.2 The Challenges Besides the great challenges in signicant word order dierence, there are also other challenges when it comes to Malagasy-English machine translation. 7.2.1 Challenge 1: Lack of Parallel Data First of all, the amount of parallel data available is limited. As shown in Table 7.3, the majority of parallel data available is from Global Voices, which is a website that contains international news in multiple languages. The translations are mostly done by volunteers. I am able to collect news in Malagasy and their English trans- lation generated from 2007 to 2013. In total, the parallel corpus contains roughly 2.0 million tokens in Malagasy and 1.8 million tokens in English. Besides Global Voices, I also have a little Malagasy news collected from websites in Madagascar as well as their translations, which are provided by bilingual native speakers. 66 In order to nd how good a machine translation built from this small amount of parallel data is, I build a baseline phrase-based machine translation system with MOSES. The training data contains the rst 87000 sentences from the parallel data from Global Voices, and the rest 2000 are used for development and testing. The system achieves a Bleu score of 16.2 on development and 15.0 on test. However, the low Bleu score suggests large room for improvements. 7.2.2 Challenge 2: Limited Monolingual Data In general it is easier to collect monolingual data for any pair of languages. How- ever, compared with resource rich languages, low density languages have much less monolingual data as well. The size of non-parallel data I collected so far is listed in Table 7.3. For English, large amounts of non-parallel data is available. The fth edition of Gigaword corpus alone contains over 2 billion of tokens. However, for Malagasy, there is no existing monolingual corpus like Gigaword. One way is to collect data from dierent websites. In a month during summer 2013, I was able to nd a few major Madagascar local news websites 1 in Malagasy and collected around 15.3 million tokens of news in Malagasy. Although the amount of Mala- gasy monolingual data is relatively small compared with that of English, it is still almost 8 times the size of available parallel data. 7.2.3 Challenge 3: Poor Parsing Quality In Chapter 5, I show that dependency relations are particularly useful for address- ing the problem of dierent word orders in decipherment. Since Malagasy and 1 aoraha (www.aoraha.com), gazetiko (www.gazetiko.mg), inovaovao (www.inovaovao.com), lakroa (www.lakroa.mg) 67 English have very dierent word orders, I decide to apply dependency based deci- pherment for the two languages as suggested in Chapter 5. To extract dependency relations, one needs to parse monolingual data in both languages with dierent parsers. For English, there are already many good parsers. Therefore, the chal- lenges come from building a parser for Malagasy. The quality of a dependency parser mainly depends on the amount of training data available. The state-of-the-art English parsers are built using Penn Treebank (Marcus et al., 1993), which contains over 1 million tokens of annotated parse trees. The Spanish parser I used in Chapter 5 is trained on over 400k tokens of annotated data generated from AnCora corpora (Mariona Taul and Recasens, 2008). The accuracy for the English and the Spanish parser is 92 %, and 89 % respectively. In contrast, the available data for training a Malagasy parser is rather limited. The only available data with annotated trees contains 168 sentences, and 2.8k tokens. There is more data with only part of speech (POS) tags, which contains 465 sentences and 10k tokens tokens. The attachment accuracy of a parser trained with the above data is only at 72%, which is much lower than the accuracy of Spanish parser, and English Parser. 7.3 Deciphering Malagasy: Preliminary Results Nonetheless, to evaluate the eect of parsing accuracy on deciphering accuracy, I build a dependency based decipherment system Dependency using a Malagasy dependency parser trained on the 168 annotated trees. Since the Malagasy parser doesn't predict dependency relation types, I use head-child part-of-speech (POS) 68 Head POS Child POS Verb Noun Verb Proper Noun Verb Person Pronoun Preposition Noun Preposition Proper Noun Noun Adjective Noun Determiner Noun Verb Particle Noun Verb Noun Noun Cardinal Number Noun Noun Table 7.1: Head-Child POS patterns used in decipherment tag patterns to select a subset of dependency bigrams for decipherment. I list the selected POS tag patterns in Table 7.1. I perform decipherment following the sampling schedule in Chapter 5. The decipherment result is evaluated against a translation lexicon obtained from Global Voices parallel data in Table 7.3. I use deciphering accuracy to measure percent- age of word types correctly deciphered among the top 5000 frequent word types. To make a comparison, I also build a baseline system Adjacent, which performs decipherment using adjacent bigrams. Figure 7.2 shows that, instead of improving decipherment, a bad parser actually hurts deciphering accuracy. 7.4 Improving Malagasy Dependency Parsing One obvious way to improve tagging and parsing accuracy is to get more annotated data. I nd more data with only part of speech tags containing 465 sentences and 10k tokens released by Garrette et al. (2013), and add this data as extra training data for POS tagger. Also, I download an online dictionary that contains POS 69 Figure 7.2: Comparison of learning curves for Malagasy-English decipherment with a poor dependency parser tags for over 60k Malagasy word types from malagasyword.org. The dictionary is very helpful for tagging words never seen in the training data. It is natural to think that creation of annotated data for training a POS tagger and a parser requires large amounts of eorts from annotators who understand the language well. However, I nd that through the help of parallel data and dictionaries, I am able to create more annotated data by ourselves to improve tagging and parsing accuracy. This idea is inspired by previous work that tries to learn a semi-supervised parser by projecting dependency relations from one language (with good dependency parsers) to another (Yarowsky and Ngai, 2001; Ganchev et al., 2009). However, I nd those automatic approaches do not work well for Malagasy. 70 To further expand the Malagasy parser training data, I rst use a POS tagger and parser with poor performance to parse 788 sentences (20k tokens) on the Malagasy side of the parallel corpus from Global Voices. Then, I correct both the dependency links and POS tags based on information from dictionaries 2 and the English translation of the parsed sentence. I spent 3 months to manually project English dependencies to Malagasy and eventually improve test set parsing accuracy from 72.4% to 80.0%. I make this data available for future research use. The impact of parsing quality on decipherment accuracy is signicant. After parsing accuracy is improved, I compare decipherment accuracy of dependency based decipherment with that of adjacent based decipherment baseline. The results are presented in Figure 7.3. This time, the accuracy of dependency based decipher- ment exceeds the accuracy of the baseline system with 10 million or more tokens of cipher data. 7.5 Joint Word Alignment and Decipherment In Chapter 5, I show that a translation lexicon learned by decipherment improves machine translation. According to Figure 7.4, the approach has three independent steps: learning a seed lexicon from word alignment, then obtain a lexicon through decipherment initialized by the seed lexicon, and nally, use the lexicon obtained from decipherment for word alignment and decoding. One major drawback of this pipeline approach is that mistakes made at each step of the pipeline could aect the result of the next step. Evidence from a number of previous work shows that a joint inference process leads to better performance in both tasks (Jiang et al., 2008; Zhang and Clark, 2008). 2 an online dictionary from malagasyword.org, as well as a lexicon learned from the parallel data 71 Figure 7.3: Comparison of learning curves for Malagasy-English decipherment with an improved dependency parser Figure 7.4: Previous word alignment and decipherment pipeline Therefore, I propose a new approach to perform word alignment and decipher- ment jointly as shown in Figure 7.5. In the new approach, the decipherment and 72 word alignment process are performed jointly so that the two processes can benet from each other. Figure 7.5: Joint word alignment and decipherment 7.5.1 A New Objective Function In the presence of parallel and monolingual data, we would like the alignment and decipherment models to benet from each other. Since the decipherment and word alignment models contain word-to-word translation probabilitiest(fje), having them share these parameters during learning will allow us to pool infor- mation from both data types. This leads to development of a new objective function that takes both learning processes into account. Given parallel data, (E 1 ;F 1 );:::; (E m ;F m );:::; (E M ;F M ), and monolingual dataF 1 mono ;:::;F n mono ;:::;F N mono , we now seek to maximize the likelihood of both. Our new objective function is dened as: F joint = M X m=1 logP (F m jE m ) + N X n=1 logP (F n mono ) (7.1) The goal of training is to learn the parameters that maximize this objective, that is 73 = arg max F joint (7.2) In the next two sections, I describe details of the word alignment and decipher- ment models, and present how they are combined to perform joint optimization. 7.5.2 Word Alignment Given a source sentenceF =f 1 ;:::;f j ;:::;f J and a target sentenceE =e 1 ;:::;e i ;:::;e I , word alignment models describe the generative process employed to produce the French sentence from the English sentence through alignmentsa =a 1 ;:::;a j ;:::;a J . The IBM models 1-2 (Brown et al., 1993) and the HMM word alignment model (Vogel et al., 1996) use two sets of parameters, distortion probabilities and transla- tion probabilities, to dene the joint probability of a target sentence and alignment given a source sentence. P (F;ajE) = J Y j=1 d(a j ja j1 ;j)t(f j je a j ): (7.3) These alignment models share the same translation probabilities t(f j j e a j ), but dier in their treatment of the distortion probabilities d(a j j a j1 ;j). Brown et al. (1993) introduce more advanced models for word alignment, such as Model 3 and Model 4, which use more parameters to describe the generative process. I do not go into details of those models here and the reader is referred to the paper describing them. 74 Under the Model 1-2 and HMM alignment models, the probability of target sentence given source sentence is: P (FjE) = X a J Y j=1 d(a j ja j1 ;j)t(f j je a j ): Let denote all the parameters of the word alignment model. Given a corpus of sentence pairs (E 1 ;F 1 );:::; (E m ;F m );:::; (E M ;F M ), the standard approach for training is to learn the maximum likelihood estimate of the parameters, that is, = arg max M X m=1 logP (F m jE m ) = arg max log X a P (F m ;ajE m ) ! : The EM algorithm (Dempster et al., 1977) is typically used to carry out this optimization. 7.5.3 Decipherment Given a corpus ofN foreign text sequences (ciphertext),F 1 mono ;:::;F n mono ;:::;F N mono , decipherment nds word-to-word translations that best describe the ciphertext. Knight et al. (2006) are the rst to study several natural language decipher- ment problems with unsupervised learning. In order to speed up decipherment, I suggest that a frequency list of bigrams might contain enough information for deci- pherment Dou and Knight (2012). Based on this idea, a monolingual ciphertext bigram F mono is generated through the following generative story: 75 Generate a sequence of two plaintext tokens e 1 e 2 with probability P (e 1 e 2 ) given by a language model built from large numbers of plaintext bigrams. Substitute e 1 with f 1 and e 2 with f 2 with probability t(f 1 je 1 )t(f 2 je 2 ). The probability of any cipher bigram F is: P (F mono ) = X e 1 e 2 P (e 1 e 2 )t(f 1 je 1 )t(f 2 je 2 ) (7.4) And the probability of the corpus is: P (corpus) = N Y n=1 P (F n mono ) (7.5) Given a plaintext bigram language model, the goal is to manipulate t(fje) to maximize P (corpus). Theoretically, one can directly apply EM to solve the problem (Knight et al., 2006). However, EM has time complexity O(NV 2 e ) and space complexityO(V f V e ), whereV f ,V e are the sizes of ciphertext and plaintext vocabularies respectively, and N is the number of cipher bigrams. In Section 3.1, I describe how to apply slice sampling to Bayesian decipherment to make decipherment faster. However, since the task of word alignment uses EM algorithm, it is easier to use the same algorithm for decipherment as well. In the following sections, I will describe an approach that uses slice sampling to compute expected counts for decipherment using EM. 7.5.4 Joint Optimization I now describe our EM approach to learn the parameters that maximize F joint (equation 7.2), where the distortion probabilities, d(a j ja j1 ;j) in the word align- ment model are only learned from parallel data, and the translation probabilities, 76 Figure 7.6: Joint Word Alignment and Decipherment with EM t(fje) are learned using both parallel and non parallel data. The E step and M step are illustrated in Figure 7.6. The algorithm starts with EM learning only on parallel data for a few iterations. When the joint inference starts, the algorithm rst computes expected counts from parallel data and non-parallel data using parameter values from the last M step separately. Then, it adds the expected counts from both parallel data and non- parallel data together with dierent weights for the two. Finally it renormalizes the translation table and distortion table to update parameters in the new M step. The E step for parallel part can be computed eciently using the forward- backward algorithm (Vogel et al., 1996). However, as I pointed out in Chapter 3, the E step for the non-parallel part has a time complexity of O(V 2 ) with the 77 forward-backward algorithm, where V is the size of English vocabulary, and is usually very large. In previous work, I have made Bayesian decipherment scalable (Dou and Knight, 2012). In this section, I describe how to make EM decipherment scalable by using slice sampling (Neal, 2000) to compute expected counts from non-parallel data needed for the EM algorithm. Draw Samples with Slice Sampling To start the sampling process, I initialize the rst sample by performing approx- imate Viterbi decoding using results from the last EM iteration. For each for- eign dependency bigram f 1 ;f 2 , I nd the top 50 candidates for f 1 and f 2 ranked by t(ejf), and nd the English sequence e 1 ;e 2 that maximizes t(e 1 jf 1 )t(e 2 jf 2 ) P (e 1 ;e 2 ). Suppose the derivation probability for current samplee current isP (e current), I use slice sampling to draw a new sample in two steps: Select a threshold T uniformly between 0 and P (e current). Draw a new samplee new uniformly from a pool of candidates:fe newjP (e new)> Tg. For the rst step, a thresholdT is chosen uniformly between 0 andP (e i1 e i e i+1 ) t(f i je i ). I divide the second step into two cases based on the observation from pre- vious work that two types of samples are more likely to have a probability higher than T (Dou and Knight, 2012): (1) those whose trigram probability is high, and (2) those whose channel model probability is high. In previous work, to nd can- didates that have high trigram probability, I build a top k sorted lists ranked by 78 P (e i1 e 0 e i+1 ), which can be pre-computed o-line. Then, I check if the last item e k in the list satises the following inequality: P (e i1 e k e i+1 )c<T (7.6) wherec is a small constant and is set uniformly to 1 jciphervocabj in previous work. In contrast, I choose c empirically. When the inequality in Equation 7.6 is satised, a sample is drawn in the following way: Let set A =fe 0 je i1 e 0 e i+1 c > Tg and set B =fe 0 jt(f i je 0 ) > cg. Then I only need to sample e 0 uniformly from A[B until P (e i1 e 0 e i+1 )t(f i je 0 ) is greater than T . It is easy to prove that all other candidates that are not in the sorted list and witht(f i je 0 )c have a upper bound probability: P (e i1 e k e i+1 )c. Therefore, they do not need to be considered. Second, when the last item e k in the list does not meet the condition in Equa- tion 7.6, I keep drawing samples e 0 randomly until its probability is greater than the threshold T . As I mentioned before, the choice of the small constantc is empirical. A largec reduces the number of items in setB, but makes the conditionP (e i1 e k e i+1 )c<T less likely to satisfy, which slows down the sampling. On the contrary, a small c increases the number of items in set B signicantly as EM does not encourage a sparse distribution, which also slows down the sampling. In our experiments, I set c to 0.001 based on the speed of decipherment. Furthermore, to reduce the size of set B, I rank all the candidate translations of f i by t(e 0 jf i ), then I add maximum the rst 1000 candidates whose t(f i je 0 )>=c into setB. For the rest of the candidates, I set t(f i je 0 ) to a value smaller than c (0.00001 in experiments). 79 Compute Expected Counts from Samples With the ability to draw samples eciently for decipherment using EM, I now describe how to compute expected counts from those samples. Let f 1 ;f 2 be a spe- cic ciphertext bigram, N be the number of samples needed to compute expected counts, and e 1 ;e 2 be one of the N samples. The expected counts for pairs (f 1 ;e 1 ) and (f 2 ;e 2 ) are computed as: count(f 1 ;f 2 ) N wherecount(f 1 ;f 2 ) is count of the bigram, and is the weight for non parallel data as shown in Equation 7.1. Expected counts collected for f 1 ;f 2 are accumulated from each of its N samples. Finally, I collect expected counts using the same approach from each foreign bigram. 7.6 Word Alignment Experiments In this section, I show that joint word alignment and decipherment improves the quality of word alignment. I choose to evaluate word alignment performance for Spanish and English as manual gold alignments are available. In experiments, the joint approach improves alignment F score by as much as 8 points. 7.6.1 Experiment Setup As shown in Table 7.2, I work with a small amount of parallel, manually aligned Spanish-English data (Lambert et al., 2005), and a much larger amount of mono- lingual data. 80 Spanish English Parallel 10.3k 9.9k Non Parallel 80 million 400 million Table 7.2: Size of parallel and non parallel data for word alignment experiments (Measured in number of tokens) The parallel data is extracted from Europarl, which consists of articles from European parliament plenary sessions. The monolingual data comes from English and Spanish versions of Gigaword corpra containing news articles from dierent news agencies. I view Spanish as a cipher of English, and follow the approach proposed by Dou and Knight (2013) to extract dependency bigrams from parsed Spanish and English monolingual data for decipherment. I only keep bigrams where both tokens appear in the parallel data. Then, I perform Spanish to English (English generating Spanish) word alignment and Spanish to English decipherment simultaneously with the method discussed in section 7.5. 7.6.2 Results I align all 500 sentences in the parallel corpus, and tune the decipherment weight () for Model 1 and HMM using the last 100 sentences. The best weights are 0.1 for Model 1, and 0.005 for HMM. I start with Model 1 with only parallel data for 5 iterations, and switch to the joint process for another 5 iterations with Model 1 and 5 more iterations of HMM. In the end, I use the rst 100 sentence pairs of the corpus for evaluation. Figure 7.7 compares the learning curve of alignment F-score between EM with- out decipherment (baseline) and joint word alignment and decipherment. From the learning curve, it can be found that at the 6th iteration, 2 iterations after 81 Figure 7.7: Learning curve showing our joint word alignment and decipherment approach improves word alignment quality over the traditional EM without deci- pherment (Model 1: Iteration 1 to 10, HMM: Iteration 11 to 15) we start the joint process, alignment F-score is improved from 34 to 43, and this improvement is held through the rest of the Model 1 iterations. The alignment model switches to HMM from the 11th iteration, and at the 12th iteration, there is a sudden jump in F-score for both the baseline and the joint approach. Consistent improvement of F-score is observed till the end of HMM iterations. 7.7 Machine Translation Experiments In the previous section, I show that the joint word alignment and decipherment process improves quality of word alignment signicantly for Spanish and English. 82 In this section, I test the joint approach in a more challenging setting: improving the quality of machine translation in a real low density language setting. In this task, the goal is to build a system to translate Malagasy news into English. I have a small amount of parallel data, and larger amounts of monolingual data collected from online websites. I build a dependency parser for Malagasy to parse the monolingual data to perform dependency based decipherment (Dou and Knight, 2013). In the end, I perform joint word alignment and decipherment, and show that the joint learning process improvesBleu scores by up to 2.1 points over a phrase-based MT baseline. 7.7.1 Data Table 7.3 shows the data available in the experiments. The majority of parallel text comes from Global Voices 3 (GV). The website contains international news translated into dierent foreign languages. Besides that, I also have a very small amount of parallel text containing local web news, with English translations pro- vided by native speakers at the University of Texas, Austin. I also collect much larger amounts of non parallel data for both languages. For Malagasy, I spent two months collecting 15.3 million tokens of news text from local news websites in Madagascar. 4 I have released this data for future research use. For English, I have 2.4 billion tokens from the Gigaword corpus. Since the Malagasy monolingual data is collected from local websites, it is reasonable to argue that those data contain signicant amount of information related to Africa. Therefore, I also collect 396 million tokens of African news in English from allAfrica.com. 3 globalvoicesonline.org 4 aoraha.com, gazetiko.com, inovaovao.com, expressmada.com, lakroa.com 83 Source Malagasy English Parallel Global Voices 2.0 million 1.8 million Web News 2.2k 2.1k Non Parallel Gigaword N/A 2.4 billion allAfrica N/A 396 million Local News 15.3 million N/A Table 7.3: Size of Malagasy and English data used in Malagasy-English machine translation experiments (Measured in number of tokens) 7.7.2 Baseline Machine Translation System I build a state-of-the-art phrase-based MT system, PBMT, using Moses (Koehn et al., 2007). PBMT has 3 models: a translation model, a distortion model, and a language model. I train the other models using half of the Global Voices parallel data (the rest is reserved for development and testing), and build a 5-gram language model using 834 million tokens from AFP section of English Gigaword, 396 million tokens from allAfrica, and the English part of the parallel corpus for training. For alignment, I run 10 iterations of Model 1, 5 iterations of HMM, 3 iterations of Model 3, and 3 iterations of Model 4. I do word alignment in two directions and use grow-diag-nal-and heuristic to obtain nal alignment. During decoding, we use 8 standard features in Moses to score a candidate translation: direct and inverse translation probabilities, direct and inverse lexical weighting, a language model score, a distortion score, phrase penalty, and word penalty. The weights for the features are learned on the tuning data using minimum error rate training (MERT) (Och, 2003). To compare with previous decipherment approach to improve machine transla- tion, I build a second baseline system. I follow my previous work (Dou and Knight, 2013) to decipher Malagasy into English, and build a translation lexicon T decipher 84 from decipherment. To improve machine translation, we simply use T decipher as an additional parallel corpus. First, we lterT decipher by keeping only translation pairs (f;e), where f is observed in the Spanish part and e is observed in the English part of the parallel corpus. Then we append all the Spanish and English words in the ltered T decipher to the end of Spanish part and English part of the parallel corpus respectively. The training and tuning process is the same as the baseline machine translation system PBMT. We call this system Decipher-Pipeline. 7.7.3 JointWordAlignmentandDeciphermentforMachine Translation When deciphering Malagasy to English, I extract Malagasy dependency bigrams using all available Malagasy monolingual data plus the Malagasy part of the Global Voices parallel data, and extract English dependency bigrams using 834 million tokens from English Gigaword, and 396 million tokens from allAfrica news to build an English dependency language model. In the other direction, I extract English dependency bigrams from English part of the entire parallel corpus plus 9.7 million tokens from allAfrica news 5 , and use 17.3 million tokens Malagasy monolingual data (15.3 million from the web and 2.0 million from Global Voices) to build a Malagasy dependency language model. To perform joint word alignment and decipherment, I require that all dependency bigrams only contain words observed in the parallel data used to train the baseline MT system. During learning, I run Model 1 without decipherment for 5 iterations. Then I perform joint word alignment and decipherment for another 5 iterations with Model 1 and 5 iterations with HMM. Decipherment weights () for Model 1 and 5 We do not nd further Bleu gains by using more English monolingual data. 85 Parallel Malagasy English Train (GV) 0.9 million 0.8 million Tune (GV) 22.2k 20.2k Test (GV) 23k 21k Test (Web) 2.2k 2.1k Non Parallel Malagasy English Gigaword N/A 834 million Web 15.3 million 396 million Table 7.4: Size of training, tuning, and testing data in number of tokens (GV: Global Voices) HMM are tuned using grid search against Bleu score on a development set. In the end, I only extract rules from one direction P (EnglishjMalagasy), where the decipherment weights for Model 1 and HMM are 0.5 and 0.005 respectively. I chose this because I did not nd more benets to tune the weights on each direction, and then use grow-diag-nal-end heuristic to form nal alignments. I call this system Decipher-Joint. 7.7.4 Results I tune each system three times with MERT and choose the best weights based on Bleu scores on tuning set. Table 7.5 shows that while using a translation lexicon learnt from decipherment does not improve the quality of machine translation signicantly, the joint approach improves Bleu score by 0.9 and 2.1 on Global Voices test set and web news test set respectively. The results show that the parsing quality correlates with gains in Bleu scores. Scores in the brackets in the last row of the table are achieved using 86 Decipherment System Tune (GV) Test (GV) Test (Web) None PBMT (Baseline) 18.5 17.1 7.7 Separate Decipher-Pipeline 18.5 17.4 7.7 Joint Decipher-Joint 18.9 (18.7) 18.0 (17.7) 9.8 (8.5) Table 7.5: Decipher-Pipeline does not show signicant improvement over the base- line system. In contrast, Decipher-Joint using joint word alignment and decipher- ment approach achieves a Bleu gain of 0.9 and 2.1 on the Global Voices test set and the web news test set, respectively. The results in brackets are obtained using a parser trained with only 120 sentences. (GV: Global Voices) a dependency parser with 72.4% attachment accuracy, while scores outside the brackets are obtained using a dependency parser with 80.0% attachment accuracy. I analyze the results and nd the gain mainly comes from two parts. First, adding expected counts from non parallel data makes the distribution of translation probabilities sparser in word alignment models. The probabilities of translation pairs favored by both parallel data and decipherment becomes higher. This gain is consistent with previous observation where a sparse prior is applied to EM to help improve word alignment and machine translation (Vaswani et al., 2012b). Second, expected counts from decipherment also help discover new translation pairs in the parallel data for low frequency words, where those words are either aligned to NULL or wrong translations in the baseline. 87 Chapter 8 Unifying Bayesian Inference and Vector Space Models for Improved Decipherment So far, I have shown how to make decipherment scalable, and apply it to improve machine translation in the scenario of domain adaptation, and low resource lan- guages. Although dependency based decipherment successfully addresses the prob- lem of signicant syntactic dierences between languages, the reliance on high qual- ity parsers also limits its use. On the other hand, an alternative approach based on similarities of word vectors (Rapp, 1995; Mikolov et al., 2013a) is less sensitive to syntactic dierences. In this chapter, I take advantage of both approaches and combine them in a joint inference process. More specically, I extend previous work in large scale Bayesian decipherment by introducing a better base distri- bution derived from similarities of word context vectors. Experiments show that the new approach improves state-of-the-art decipherment accuracy by a factor of two for Spanish/English and Malagasy/English. This chapter describes joint work described in Dou et al. (2015). 88 8.1 Decipherment Model: Revisit In this section, I revisit the previous decipherment framework in a way to be extended to the new work. This framework follows Ravi and Knight (2011a), who built an MT system using only non-parallel data for translating movie subtitles; Dou and Knight (2012) and Nuhn et al. (2012a), who scaled decipherment to larger vocabularies; and Dou and Knight (2013), who improved decipherment accuracy with dependency relations between words. Throughout this section I usef to denote target language or ciphertext tokens, and e to denote source language or plaintext tokens. Given ciphertext f :f 1 :::f n , the task of decipherment is to nd a set of parameters P (f i je i ) that convert f to sensible plaintext. The ciphertextf can either be full sentences (Ravi and Knight, 2011a; Nuhn et al., 2012a) or simply bigrams (Dou and Knight, 2013). Since using bigrams and their counts speeds up decipherment, in this work, I treat f as bigrams, where f =ff n g N n=1 =ff n 1 ;f n 2 g N n=1 . Same as described in Chapter 3, a cipher bigram f n is modeled with the fol- lowing generative story: First, a language model P (e) generates a sequence of two plaintext tokens e n 1 ;e n 2 with probability P (e n 1 ;e n 2 ). Then, substitute e n 1 with f n 1 and e n 2 with f n 2 with probability P (f n 1 j e n 1 ) P (f n 2 je n 2 ). Based on the above generative story, the probability of any cipher bigram f n is: P (f n ) = X e 1 e 2 P (e 1 e 2 ) 2 Y i=1 P (f n i je i ) 89 The probability of the ciphertext corpus, P (ff n g N n=1 ) = N Y n=1 P (f n ) There are two sets of parameters in the model: the channel probabilitiesfP (fj e)g and the bigram language model probabilitiesfP (e 0 je)g, where f ranges over the ciphertext vocabulary and e;e 0 range over the plaintext vocabulary. Given a plaintext bigram language model, the training objective is to learn P (fj e) that maximize P (ff n g N n=1 ). When formulated like this, one can directly apply EM to solve the problem (Knight et al., 2006). However, EM has time complexity O(N V 2 e ) and space complexity O(V f V e ), where V f , V e are the sizes of ciphertext and plaintext vocabularies respectively, and N is the number of cipher bigrams. This makes the EM approach unable to handle long ciphertexts with large vocabulary size. Let's assume thatP (fje) andP (e 0 je) are drawn from a Dirichet distribution with hyper-parameters f;e and e;e 0, that is: P (fje)Dirichlet( f;e ) P (eje 0 )Dirichlet( e;e 0): 90 The remainder of the generative story is the same as the noisy channel model for decipherment. Given f;e and e;e 0, The joint likelihood of the complete data and the parameters is: P (ff n ;e n g N n=1 ;fP (fje)g;fP (eje 0 )g) =P (ff n je n g N n=1 ;fP (fje)g) P (fe n g N n=1 ;P(eje 0 )) = Y e P f f;e Q f ( e;f ) Y f P (fje) #(e;f)+ e;f 1 Y e ( P e 0 e;e 0) Q e 0 ( e;e 0) Y f P (eje 0 ) #(e;e 0 )+ e;e 01 ; (8.1) where #(e;f) and #(e;e 0 ) are the counts of the translated word pairs and plaintext bigram pairs in the complete data, and () is the Gamma function. Unlike EM, in Bayesian decipherment, one no longer searches for parametersP (fj e) that maximize the likelihood of the observed ciphertext. Instead, samples are drawn from posterior distribution of the plaintext sequences given the ciphertext. Under the above Bayesian decipherment model, it turns out that the probability of a particular cipher word f j having a value k, given the current plaintext word e j , and the samples for all the other ciphertext and plaintext words, f j and e j , is: P (f j =kje j ;f j ;e j ) = #(k;e j ) j + e j ;k #(e j ) j + P f e j ;f : 91 Where, #(k;e j ) j and #(e j ) j are the counts of the ciphertext, plaintext word pair and plaintext word in the samples excluding f j and e j . Similarly, the proba- bility of a plaintext word e j taking a value l given samples for all other plaintext words, P (e j =lje j ) = #(l;e j1 ) j + l;e j1 #(e j1 ) j + P e e;e j1 : (8.2) Given large amounts of plaintext data, I can train a high-quality dependency- bigram language model, P LM (ej e 0 ) and use it to guide our samples and learn a better posterior distribution. Therefore, I dene e;e 0 =P LM (eje 0 ), and set to be very high. The probability of a plaintext word (Equation 8.2) is now P (e j =lje j )P LM (lje j1 ): (8.3) To sample from the posterior, a sampler iterates over the observed ciphertext bigram tokens and uses equations 8.1 and 8.3 to sample a plaintext token with probability P (e j je j ;f)/P LM (e j je j1 ) P LM (e j+1 je j )P (f j je j ;f j ;e j ): (8.4) In previous work (Dou and Knight, 2012), I use symmetric priors over the channel probabilities, where e;f = 1 V f , and set to 1. Symmetric priors over word translation probabilities is a poor choice, as one would not a-priori expect plaintext words and ciphertext words to cooccur with equal frequency. Bayesian inference is a powerful framework that allows one to integrate useful prior information into the sampling process. In the next section, I will describe how to learn better priors 92 using distributional properties of words. In subsequent sections, I show signicant improvements over the baseline by learning better priors. 8.2 Base Distribution with Cross-Lingual Word Similarities As shown in the previous section, the base distribution in Bayesian decipherment is given independent of the inference process. A better base distribution can improve decipherment accuracy. Ideally, word pairs that are similar should have higher base distribution probabilities. One straightforward way is to consider orthographic similarities. This works for closely related languages, e.g., the English word \new" is translated as \neu" in German and \nueva" in Spanish. However, this fails when two languages are not closely related, e.g., Chinese/English. Previous work aims to discover translations from comparable data based on word context similarities. This is based on the assumption that words appearing in similar contexts have similar meanings. The approach straightforwardly discovers monolingual synonyms. However, when it comes to nding translations, one challenge is to draw a mapping between the dierent context spaces of the two languages. In previous work, the mapping is usually learned from a seed lexicon. There has been much recent work in learning distributional vectors (embed- dings) for words. The most popular approaches are the skip-gram and continuous- bag-of-words models (Mikolov et al., 2013a). In the work by Mikolov et al. (2013b), the authors are able to successfully learn word translations using lin- ear transformations between the source and target word vector-spaces. However, unlike our learning setting, their approach relied on large amounts of translation 93 pairs learned from parallel data to train their linear transformations. Inspired by these approaches, I exploit high-quality monolingual word embeddings to help learn better posterior distributions in unsupervised decipherment, without any parallel data. The idea is to model e;f using pre-trained word embeddings. Mimno and McCallum (2012) developed topic models where the base distribution over topics is a log-linear model of observed document features, which permits learning better priors over topic distributions for each document. Similarly, I introduce a latent cross-lingual linear mapping M and dene: f;e = expfv T e Mv f g; (8.5) wherev e andv f are the pre-trained plaintext word and ciphertext word embed- dings. M is the similarity matrix between the two embedding spaces. f;e can be thought of as the anity of a plaintext word to be mapped to a ciphertext word. Rewriting the channel part of the joint likelihood in equation 8.1, P (ff n je n g N n=1 ;fP (fje)g) = Y e P f expfv T e Mv f g Q f (expfv T e Mv f g) Y f P (fje) #(e;f)+expfv T e Mv f g1 94 Integrating out the channel probabilities, the complete data log-likelihood of the observed ciphertext bigrams and the sampled plaintext bigrams is given below, P (ff n je n g) = Y e P f expfv T e Mv f g Q f (expfv T e Mv f g) Y e Q f expfv T e Mv f g + #(e;f) P f expfv T e Mv f g + #(e) : I also add aL2 regularization penalty on the elements ofM. The derivative of logP (ff n j e n g 2 P i;j M 2 i;j , where is the regularization weight, with respect to M, @ logP (ff n je n g 2 P i;j M 2 i;j @M = X e X f expfv T e Mv f gv e v T f X f 0 expfv T e Mv f 0g ! X f 0 expfv T e Mv f 0g + #(e) ! + + expfv T e Mv f g + #(e;f) expfv T e Mv f g M; where I use 95 @ expfv T e Mv f g @M = expfv T e Mv f g @v T e Mv f @M = expfv T e Mv f gv e v T f : () is the Digamma function, the derivative of log (). Again, following Mimno and McCallum (2012), I train the similarity matrix M with stochastic EM. In the E-step, I draw sample plaintext words for the observed ciphertext using equa- tion 8.4 and in the M-step, I update M that maximizes logP (ff n j e n g) with stochastic gradient descent. The time complexity of computing the gradient is O(V e V f ). However, signicant speedups can be achieved by precomputing v e v T f and exploiting GPUs for Matrix operations. After learning M, I set e;f = X f 0 expfv T e Mv f 0g expfv T e Mv f g P f 00 expfv T e Mv f 00g = e m e;f ; (8.6) where e = P f 0 expfv T e Mv f 0g is the concentration parameter andm e;f = expfv T e Mv f g P f 00 expfv T e Mv f 00g is an element of the base measure m e for plaintext worde. In practice, I nd that e can be very large, overwhelming the counts from sampling when there are only a few ciphertext bigrams. Therefore, I use m e and set e proportional to the data size. 96 Spanish English Training 992 million 940 million (Gigaword) (Gigaword) Evaluation 1.1 million 1.0 million (Europarl) (Europarl) Table 8.1: Size of data in tokens used in Spanish/English decipherment experiment 8.3 Deciphering Spanish In previous section, I present how to learn better base distribution with word embeddings. In this section, I describe data and experimental conditions for deci- phering Spanish into English. 8.3.1 Data In the Spanish/English decipherment experiments, I use half of the Gigaword cor- pus as monolingual data, and a small amount of parallel data from Europarl for evaluation. I keep only the 10k most frequent word types for both languages and replace all other word types with \UNK". I also exclude sentences longer than 40 tokens, which signicantly slow down our parser. After preprocessing, the size of data for each language is shown in Table 8.1. While I use all the monolin- gual data shown in Table 8.1 to learn word embeddings, I only parse the AFP (Agence France-Presse) section of the Gigaword corpus to extract cipher depen- dency bigrams and build a plaintext language model. I also use GIZA (Och and Ney, 2003a) to align Europarl parallel data to build a dictionary for evaluating decipherment results. 97 8.3.2 Systems The baseline system carries out decipherment on dependency bigrams as described in Dou and Knight (2013). I use the Bohnet parser (Bohnet, 2010) to parse the AFP section of both Spanish and English versions of the Gigaword corpus. Since not all dependency relations are shared across the two languages, I do not extract all dependency bigrams. Instead, I only use bigrams with dependency relations from the following list: Verb / Subject Verb / Object Preposition / Object Noun / Noun-Modier I denote the system that uses the new method as DMRE (Dirichlet Multino- mial Regression with Embedings). The system is the same as the baseline except that it uses a base distribution derived from word embeddings similarities. Word embeddings are learned using word2vec (Mikolov et al., 2013a). For all the systems, language models are built using the SRILM toolkit (Stol- cke, 2002). I use the modied Kneser-Ney (Kneser and Ney, 1995) algorithm for smoothing. 8.3.3 Sampling Procedure Motivated by the previous work, I use multiple random restarts and an iterative sampling process to improve decipherment (Dou and Knight, 2012). As shown in 98 Figure 8.1: Iterative sampling procedures Figure 8.1, I start a few sampling processes each with a dierent random sam- ple. Then results from dierent runs are combined to initiate the next sampling iteration. The details of the sampling procedure are listed below: 1. Extract dependency bigrams from parsing outputs and collect their counts. 2. Keep bigrams whose counts are greater than a threshold t. Then start N dierent randomly seeded and initialized sampling processes. Perform sam- pling. 3. At the end of sampling, extract word translation pairs (f;e) from the nal sample. Estimate translation probabilities P (ejf) for each pair. Then con- struct a translation table by keeping translation pairs (f;e) seen in more 99 than one decipherment and use the average P (ejf) as the new translation probability. 4. Start N dierent sampling processes again. Initialize the rst samples with the translation pairs obtained from the previous step (for each dependency bigram f 1 ;f 2 , nd an English sequence e 1 ;e 2 , whose P (e 1 jf 1 ) P (e 2 jf 2 ) P (e 1 ;e 2 )is the highest). Initialize similarity matrix M with one learned by previous sampling process whose posterior probability is highest. Go to the third step, repeat until it converges. 5. Lower the thresholdt to include more bigrams into the sampling process. Go to the second step, and repeat until t = 1. The sampling process consists of sampling and learning of similarity matrixM. The sampling process creates training examples for learningM, and the newM is used to update the base distribution for sampling. In Spanish/English decipher- ment experiments, I use 10 dierent random starts. As pointed out in section 8.2, setting e to it's theoretical value (equation 8.6) gives poor results as it can be quite large. In experiments, I set e to a small value for the smaller data sets and increase it as more ciphtertext becomes available. I nd that using the learned base distribution always improves decipherment accuracy, however, certain ranges are better for a given data size. I use e values of 1; 2, and 5 for ciphertexts with 100k, 1 million, and 10 million tokens respectively. I plan to leave automatic learning of e for future work. 8.4 Deciphering Malagasy In this section, I describe experiment settings for Malagasy/English decipherment. 100 Malagasy English Training 16 million 1.2 billion (Web) (Gigaword and Web) Evaluation 2.0 million 1.8 million (GlobalVoices) (GlobalVoices) Table 8.2: Size of data in tokens used in Malagasy/English decipherment experi- ment. GlobalVoices is a parallel corpus. 8.4.1 Data Table 8.2 lists the sizes of monolingual and parallel data used in this experiment, released by Dou et al. (2014). The monolingual data in Malagasy contains news text collected from Madagascar websites. The English monolingual data contains Gigaword and an additional 300 million tokens of African news. Parallel data (used for evaluation) is collected from GlobalVoices, a multilingual news website, where volunteers translate news into dierent languages. 8.4.2 Systems The baseline system is the same as the baseline used in Spanish/English decipher- ment experiments. I use data collected in my previous work (Dou et al., 2014) to build a Malagasy dependency parser. For English, I use the Turbo parser, trained on the Penn Treebank (Martins et al., 2013). 8.4.3 Sampling Procedure I use the same sampling protocol designed for Spanish/English decipherment. I double the number of random starts to 20. Further more, compared with Span- ish/English decipherment, I nd the base distribution plays a more important role in achieving higher decipherment accuracy for Malagasy/English. Therefore, I set 101 e to 10, 50, and 200 when deciphering 100k, 1 million, and 20 million token ciphtertexts, respectively. 8.5 Results In this section, I rst compare decipherment accuracy of the baseline with our new approach, and show signicant improvements. Then, I evaluate the quality of the base distribution through visualization. Most Freq 5k 10k System Baseline DMRE Baseline DMRE 100k 1.9 12.4 1.1 7.1 1 million 7.3 37.7 4.2 23.6 10 million 29.0 64.7 23.4 43.7 100 million 45.8 67.4 39.4 58.1 Table 8.3: Spanish/English decipherment top-5 accuracy (%) of 5k and 10k most frequent word types Most Freq 5k 10k System Baseline DMRE Baseline DMRE 100k 1.2 2.7 0.6 1.4 1 million 2.5 5.8 1.3 3.2 16 million 5.4 11.2 3.0 6.9 Table 8.4: Malagasy/English decipherment top-5 accuracy (%) of 5k and 10k most frequent word types I use top-5 type accuracy as our evaluation metric for decipherment. Given a word typef in Spanish, I nd top-5 translation pairs (f;e) ranked byP (ejf) from the learned decipherent translation table. If any pair (f;e) can also be found in a gold translation lexicon T gold , I treat the word type f as correctly deciphered. 102 LetjCj be the number of word types correctly deciphered, andjVj be the total number of word types evaluated. I dene type accuracy as jCj jVj . To createT gold , I use GIZA to align a small amount of Spanish/English parallel text (1 million tokens for each language), and use the lexicon derived from the alignment as our gold translation lexicon. T gold contains a subset of 4233 word types in the 5k most frequent word types, and 7479 word types in the top 10k frequent word types. I decipher the 10k most frequent Spanish word types to the 10k most frequent English word types, and evaluate decipherment accuracy on both the 5k most frequent word types as well as the full 10k word types. I evaluate accuracy for the 5k and 10k most frequent word types for each language pair, and present them in Table 8.3 and Table 8.4. Figure 8.2: Learning curves of top-5 accuracy evaluated on 5k most frequent word types for Spanish/English decipherment. 103 I also present the learning curves of decipherment accuracy for the 5k most fre- quent word types. Figure 8.2 compares the baseline with DMRE in deciphering Spanish into English. Performance of the baseline is in line with previous work of Dou and Knight (2013). (The accuracy reported here is higher as I evaluate top-5 accuracy for each word type.) With 100k tokens of Spanish text, the base- line achieves 1.9% accuracy, whileDMRE reaches 12.4% accuracy, improving the baseline by over 6 times. Although the gains attenuate as I increase the number of ciphertext tokens, they are still large. With 100 million cipher tokens, the baseline achieves 45.8% accuracy, while DMRE reaches 67.4% accuracy. Figure 8.3: Learning curves of top-5 accuracy evaluated on 5k most frequent word types for Malagasy/English decipherment. Figure 8.3 compares the baseline with our new approach in deciphering Mala- gasy into English. With 100k tokens of data, the baseline achieves 1.2% accuracy, 104 and DMRE improves it to 2.4%. I observe consistent improvement throughout the experiment. In the end, the baseline accuracy obtains 5.8% accuracy, and DMRE improves it to 11.2%. Low accuracy in Malagasy-English decipherment is attributed to the follow- ing factors: First, compared with the Spanish parser, the Malagasy parser has lower parsing accuracy. Second, word alignment between Malagasy and English is more challenging, producing less correct translation pairs. Last but not least, the domain of the English language model is much closer to the domain of the Spanish monolingual text compared with that of Malagasy. Overall, the new approach achieves large consistent gains across both language pairs. I hypothesize the gain comes from a better base distribution that consid- ers larger context information. This helps prevent the language model driving deicpherment to a wrong direction. Since the learned transformation matrixM signicantly improves decipherment accuracy, it's likely that it is translation preserving, that is, plaintext words are transformed from their native vector space to points in the ciphertext such that translations are close to each other. To visualize this eect, I take the 5k most fre- quent plaintext words and transform them into new embeddings in the ciphertext embedding space v e 0 =v T e M, where M is learned from 10 million Spanish bigram data. I then project the 5k most frequent ciphertext words and the projected plaintext words from the joint embedding space into a 2dimensional space using t-sne (Van der Maaten and Hinton, 2008). In Figure 8.4, I see an instance of a recurring phenomenon, where translation pairs are very close and sometimes even overlap each other, for example (judge, jueces), (secret, secretos). The word \magistrado" does not appear in our eval- uation set. However, it is placed close to its possible translations. Thus, our 105 approach is capable of learning word translations that cannot be discovered from limited parallel data. I also see translation clusters, where translations of groups of words are close to each other. For example, in Figure 8.5, it can be seen that time expressions in Spanish are quite close to their translations in English. Although better quality translation visualizations (Mikolov et al., 2013b) have been presented in previous work, they exploit large amounts of parallel data to learn the mapping between source and target words, while our transformation is learned on non-parallel data. Figure 8.4: Translation pairs are often close and sometimes overlap each other. Words in spanish have been appended with spanish These results show that the new approach can achieve high decipherment accu- racy and discover novel word translations from non-parallel data. 106 Figure 8.5: Semantic groups of word-translations appear close to each other. 107 Chapter 9 Conclusion and Future Work 9.1 Conclusions In this work, I apply slice sampling to Bayesian Decipherment and show signi- cant improvement in deciphering accuracy compared with the state-of-the-art algo- rithm. The approach is not only accurate but also highly scalable. In experiments, I decipher at the scale of the English Gigaword corpus containing billions of tokens and hundreds of thousands word types. Furthermore, to address the problem of word reordering, insertion, and deletion, I introduce syntax into decipherment by using dependency bigrams instead of adja- cent bigrams. Experiment results show that using dependency bigrams improves decipherment accuracy by 5-fold compared with the state-of-the-art approach while deciphering Spanish into English. Moreover, I use word embeddings to take larger contextual information into account, and show it further doubles the accuracy. Last but not least, I show the value of my decipherment work by demonstrating that decipherment improves out-of-domain machine translation by nding high quality translations for OOV words, and improves word alignment quality when the amount of parallel data is limited. By deciphering large amounts of non- parallel data, I observe signicantBleu gain in Spanish/French, Spanish/English, and Malagasy/English machine translation experiments. 108 9.2 Future Work Faster Decipherment: In the most recent work in Chapter 8, I combine vector space models and Bayesian inference to improve deciphering accuracy. However, the improvement comes with some compromise in speed. The M step has a time complexity of O(V 2 ). This is similar to the problem when deciphering with the EM algorithm. It is hard for both approaches to scale to very large vocabulary. To make the new approach applicable for machine translation in a real world scenario, it is necessary to search for new techniques to speed up the M step. Like how EM is replaced with Bayesian inference, this problem could potentially be addressed by using sampling as a mean of approximation. Phrase Decipherment: In this thesis, most of the time, decipherment is used to nd word level translations from non-parallel data. The translations are then used to improve the quality of machine translation. But this approach has its limitation as the process of translation does not always happen word by word. Because of that, phrase-based and syntax-based MT systems have better performance com- pared with word-based systems. Given that the more complex (phrase or syntax-based) MT systems still rely on word alignment to build their translation models. It is reasonable to envision that we can also learn phrase or even syntax level translation rules from non-parallel data given a translation lexicon. Another approach is to perform decipherment at phrase level directly. However, this requires much more ecient decipherment algorithms. Joint Decipherment and Parsing: In Chapter 5, I introduce dependency rela- tion to address the problem of word reordering while deciphering foreign languages. Although experiment results show that decipherment using dependency bigrams 109 improves deciphering accuracy signicantly, the reliance on good quality depen- dency parsers also limits the application of decipherment. This is especially ture for low resource languages. This problem gives rise to an interesting question: whether it is possible to train a good parser without annotated data during decipherment. Previously, progress in unsupervised dependency parsing has been disappointing, and some researchers blame it to lack of good objective function (mostly likelihood of the observed data). Now, given the connection between dependency structure and deciphering accuracy, it is possible to improve the objective function by taking deciphering accuracy into account. 110 Reference List Taylor Berg-Kirkpatrick and Dan Klein. Simple eective decipherment via combi- natorial optimization. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2011. Taylor Berg-Kirkpatrick and Dan Klein. Decipherment with a million random restarts. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2013. Shane Bergsma and Benjamin Van Durme. Learning bilingual lexicons using the visual similarity of labeled web images. In Proceedings of the Twenty-Second international joint conference on Articial Intelligence - Volume Volume Three. AAAI Press, 2011. Bernd Bohnet. Top accuracy and fast dependency parsing is not a contradiction. In Proceedings of the 23rd International Conference on Computational Linguistics. Coling, 2010. Peter F. Brown, Vincent J.Della Pietra, Stephen A. Della Pietra, and Robert. L. Mercer. The mathematics of statistical machine translation: Parameter estima- tion. Computational Linguistics, 19:263{311, 1993. Chris Callison-Burch, Cameron Fordyce, Philipp Koehn, Christof Monz, and Josh Schroeder. Further meta-evaluation of machine translation. In Proceedings of the Third Workshop on Statistical Machine Translation. Association for Com- putational Linguistics, 2008. Eric Corlett and Gerald Penn. An exact a* method for deciphering letter- substitution ciphers. In Proceedings of the 48th Annual Meeting of the Associa- tion for Computational Linguistics. Association for Computational Linguistics, 2010. Eric Corlett and Gerald Penn. Why letter substitution puzzles are not hard to solve: A case study in entropy and probabilistic search-complexity. In Proceed- ings of the 13th Meeting on the Mathematics of Language (MoL 13). Association for Computational Linguistics, 2013. 111 Hal Daum e, III and Jagadeesh Jagarlamudi. Domain adaptation for machine trans- lation by mining unseen words. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2011. Arthur Dempster, Nan Laird, and Donald Rubin. Maximum likelihood from incom- plete data via the EM algorithm. Computational Linguistics, 39(4):1{38, 1977. Qing Dou and Kevin Knight. Large scale decipherment for out-of-domain machine translation. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, 2012. Qing Dou and Kevin Knight. Dependency-based decipherment for resource-limited machine translation. In Proceedings of the 2013 Conference on Empirical Meth- ods in Natural Language Processing. Association for Computational Linguistics, 2013. Qing Dou, Ashish Vaswani, and Kevin Knight. Beyond parallel data: Joint word alignment and decipherment improves machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2014. Qing Dou, Ashish Vaswani, Kevin Knight, and Chris Dyer. Unifying bayesian infer- ence and vector space models for improved decipherment. In Proceedings of the 53rd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2015. Pascale Fung and Lo Yuen Yee. An IR approach for translating new words from nonparallel, comparable texts. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Confer- ence on Computational Linguistics - Volume 1. Association for Computational Linguistics, 1998. Michel Galley, Mark Hopkins, Kevin Knight, and Daniel Marcu. What's in a translation rule? In HLT-NAACL 2004: Main Proceedings. Association for Computational Linguistics, 2004. Kuzman Ganchev, Jennifer Gillenwater, and Ben Taskar. Dependency grammar induction via bitext projection constraints. In Proceedings of the Joint Confer- ence of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1. Association for Computational Linguistics, 2009. 112 Nikesh Garera, Chris Callison-Burch, and David Yarowsky. Improving translation lexicon induction from monolingual corpora via dependency contexts and part- of-speech equivalences. In Proceedings of the Thirteenth Conference on Compu- tational Natural Language Learning. Association for Computational Linguistics, 2009. Dan Garrette, Jason Mielens, and Jason Baldridge. Real-world semi-supervised learning of pos-taggers for low-resource languages. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2013. Stuart Geman and Donald Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. In Readings in computer vision: issues, problems, principles, and paradigms. Morgan Kaufmann Publishers Inc., 1987. Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick, and Dan Klein. Learning bilingual lexicons from monolingual corpora. In Proceedings of ACL-08: HLT. Association for Computational Linguistics, 2008. David R. Hardoon, Sandor R. Szedmak, and John R. Shawe-taylor. Canonical correlation analysis: An overview with application to learning methods. Neural Comput., 16(12), 2004. Ann Irvine and Chris Callison-Burch. Supervised bilingual lexicon induction with multiple monolingual signals. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2013a. Ann Irvine and Chris Callison-Burch. Combining bilingual and comparable corpora for low resource machine translation. In Proceedings of the Eighth Workshop on Statistical Machine Translation. Association for Computational Linguistics, August 2013b. Wenbin Jiang, Liang Huang, Qun Liu, and Yajuan L u. A cascaded linear model for joint Chinese word segmentation and part-of-speech tagging. In Proceedings of ACL-08: HLT. Association for Computational Linguistics, 2008. Young-Bum Kim and Benjamin Snyder. Unsupervised consonant-vowel predic- tion over hundreds of languages. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2013. Alexandre Klementiev, Ann Irvine, Chris Callison-Burch, and David Yarowsky. Toward statistical machine translation without parallel corpora. In Proceedings 113 of the 13th Conference of the European Chapter of the Association for Compu- tational Linguistics. Association for Computational Linguistics, 2012. Reinhard Kneser and Hermann Ney. Improved backing-o for m-gram language modeling. IEEE International Conference on Acoustics, Speech, and Signal Pro- cessing (ICASSP), 1995. Kevin Knight and Kenji Yamada. A computational approach to deciphering unknown scripts. In Proceedings of the ACL Workshop on Unsupervised Learn- ing in Natural Language Processing. Association for Computational Linguistics, 1999. Kevin Knight, Anish Nair, Nishit Rathod, and Kenji Yamada. Unsupervised anal- ysis for decipherment problems. In Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions. Association for Computational Linguistics, 2006. Kevin Knight, Be ata Megyesi, and Christiane Schaefer. The copiale cipher. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web. Association for Computational Linguistics, 2011. Philipp Koehn. Europarl: a parallel corpus for statistical machine translation. In In Proceedings of the Tenth Machine Translation Summit, Phuket, Thailand, 2005a. Asia-Pacic Association for Machine Translation. Philipp Koehn. Europarl: A parallel corpus for statistical machine translation. In In Proceedings of the Tenth Machine Translation Summit, Phuket, Thailand, 2005b. Asia-Pacic Association for Machine Translation. Philipp Koehn and Kevin Knight. Learning a translation lexicon from monolin- gual corpora. In Proceedings of the ACL-02 Workshop on Unsupervised Lexical Acquisition. Association for Computational Linguistics, 2002. Philipp Koehn, Franz Josef Och, and Daniel Marcu. Statistical phrase-based trans- lation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. Association for Computational Linguistics, 2003. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Fed- erico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ond rej Bojar, Alexandra Constantin, and Evan Herbst. Moses: open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions. Association for Computational Linguistics, 2007. 114 Patrik Lambert, Adri a De Gispert, Rafael Banchs, and Jos e B. Mari~ no. Guidelines for word alignment evaluation and manual alignment. Language Resources and Evaluation, 39(4):267{285, 2005. Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large annotated corpus of english: The penn treebank. COMPUTATIONAL LINGUISTICS, 19(2):313{330, 1993. M. Antnia Mart Mariona Taul and Marta Recasens. Ancora: Multilevel annotated corpora for catalan and spanish. In Proceedings of the Sixth International Con- ference on Language Resources and Evaluation (LREC'08). European Language Resources Association (ELRA), 2008. Andre Martins, Miguel Almeida, and Noah A. Smith. Turning on the Turbo: Fast third-order non-projective Turbo parsers. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, 2013. Tomas Mikolov, Kai Chen, Greg Corrado, and Jerey Dean. Ecient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013a. Tomas Mikolov, Quoc V Le, and Ilya Sutskever. Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168, 2013b. David Mimno and Andrew McCallum. Topic models conditioned on arbitrary features with dirichlet-multinomial regression. arXiv preprint arXiv:1206.3278, 2012. Dragos Stefan Munteanu and Daniel Marcu. Improving machine translation per- formance by exploiting non-parallel corpora. Computational Linguistics, 31(4), 2005. Radford Neal. Slice sampling. Annals of Statistics, 31, 2000. David Newman, Arthur Asuncion, Padhrai Smyth, and Max Welling. Distributed algorithms for topic models. Journal of Machine Learning Research, 10, 2009. Malte Nuhn and Hermann Ney. Decipherment complexity in 1:1 substitution ciphers. In Proceedings of the 51st Annual Meeting of the Association for Com- putational Linguistics. Association for Computational Linguistics, 2013. Malte Nuhn, Arne Mauser, and Hermann Ney. Deciphering foreign language by combining language models and context vectors. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1. Association for Computational Linguistics, 2012a. 115 Malte Nuhn, Arne Mauser, and Hermann Ney. Deciphering foreign language by combining language models and context vectors. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2012b. Malte Nuhn, Julian Schamper, and Hermann Ney. Beam search for solving substi- tution ciphers. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2013. Franz Josef Och. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2003. Franz Josef Och and Hermann Ney. A systematic comparison of various statistical alignment models. Computational Linguistics, 2003a. Franz Josef Och and Hermann Ney. A systematic comparison of various statistical alignment models. Comput. Linguist., 2003b. Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2002. Reinhard Rapp. Identifying word translations in non-parallel texts. In Proceed- ings of the 33rd annual meeting on Association for Computational Linguistics. Association for Computational Linguistics, 1995. Sujith Ravi. Scalable decipherment for machine translation via hash sampling. In Proceedings of the 51th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2013. Sujith Ravi and Kevin Knight. Attacking decipherment problems optimally with low-order n-gram models. In Proceedings of the Conference on Empirical Meth- ods in Natural Language Processing. Association for Computational Linguistics, 2008. Sujith Ravi and Kevin Knight. Probabilistic methods for a japanese syllable cipher. In Proceedings of the 22Nd International Conference on Computer Processing of Oriental Languages. Language Technology for the Knowledge-based Economy. Springer-Verlag, 2009. Sujith Ravi and Kevin Knight. Deciphering foreign language. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2011a. 116 Sujith Ravi and Kevin Knight. Bayesian inference for Zodiac and other homo- phonic ciphers. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2011b. Sujith Ravi and Kevin Knight. Bayesian inference for zodiac and other homo- phonic ciphers. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2011c. Sravana Reddy and Kevin Knight. What we know about the Voynich Manuscript. In ACL Workshop on Language Technology for Cultural Heritage, Social Sci- ences, and Humanities. Association for Computational Linguistics, 2011. Sravana Reddy and Kevin Knight. Decoding running key ciphers. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2012. Jason Riesa and Daniel Marcu. Hierarchical search for word alignment. In Pro- ceedings of the 48th Annual Meeting of the Association for Computational Lin- guistics. Association for Computational Linguistics, 2010. Benjamin Snyder, Regina Barzilay, and Kevin Knight. A statistical model for lost language decipherment. In Proceedings of the 48th Annual Meeting of the Associ- ation for Computational Linguistics. Association for Computational Linguistics, 2010. Andreas Stolcke. SRILM - an extensible language modeling toolkit. In Proceedings of the International Conference on Spoken Language Processing, 2002. Jrg Tiedemann. News from OPUS - A collection of multilingual parallel corpora with tools and interfaces. In Recent Advances in Natural Language Processing. John Benjamins, Amsterdam/Philadelphia, 2009. Laurens Van der Maaten and Georey Hinton. Visualizing data using t-sne. Jour- nal of Machine Learning Research, 9:85, 2008. Ashish Vaswani, Liang Huang, and David Chiang. Smaller alignment models for better translations: Unsupervised word alignment with the l0-norm. In Proceed- ings of the 50th Annual Meeting of the Association for Computational Linguis- tics. Association for Computational Linguistics, 2012a. 117 Ashish Vaswani, Liang Huang, and David Chiang. Smaller alignment models for better translations: Unsupervised word alignment with the l0-norm. In Pro- ceedings of the 50th Annual Meeting of the Association for Computational Lin- guistics: Long Papers - Volume 1. Association for Computational Linguistics, 2012b. Stephan Vogel, Hermann Ney, and Christoph Tillmann. HMM-based word align- ment in statistical translation. In Proceedings of the 16th Conference on Com- putational Linguistics - Volume 2. Association for Computational Linguistics, 1996. Warren Weaver. Translation (1949). Reproduced in W.N. Locke, A.D. Booth (eds.)., pages 15{23. MIT Press, 1955. David Yarowsky and Grace Ngai. Inducing multilingual POS taggers and NP bracketers via robust projection across aligned corpora. In Proceedings of the Second Meeting of the North American Chapter of the Association for Compu- tational Linguistics on Language Technologies. Association for Computational Linguistics, 2001. Yue Zhang and Stephen Clark. Joint word segmentation and POS tagging using a single perceptron. In Proceedings of ACL-08: HLT, Columbus, Ohio, 2008. Association for Computational Linguistics. 118 Appendix A MonoGIZA Unlike GIZA (Och and Ney, 2003a), which nds translations from parallel data, the software takes two monolingual corpra as input and outputs a word-to-word translation table. It implements algorithms described in Dou and Knight (2012) and Dou et al. (2015). A.1 Compiling and Installation Download the package from http://www.isi.edu/natural-language/software/ and unzip it. All necessary dependcies are included in 3rdparty folder. To compile, just type ./compile.sh. This will install the SRILM toolkit, and generate a program called slice with embeddings. Set environment variable \export $LD LIBRARY PATH=$LD LIBRARY PATH:3rdparty/tbb/lib/intel64/gcc4.4" A.2 Example Usage To get an overview of the options and their explanations, run ./demo.sh {help The package includes 3 demos: Letter substitution decipherment, Japanese phoneme to English phoneme decipherment, and Spanish to English decipherment. 119 A.2.1 Letter Substitution Cipher Decipherment To decipher a very simple letter substitution cipher, simply type: ./demo.sh data/test.f data/test.e ttable.nal [options] The ttable.nal is the output ttable, with each line in the following format: cipherjjj plainjjj p(cipherjplain) p(plainjcipher) A.2.2 Japanese-English Phoneme Decipherment To decipher Japanese phonemes into English, type ./demo.sh data/mono-shuf.f data/mono.e ttable.nal [options] You might want to run the program with dierent random seeds in parallel, and combine the output ttable to increase decipherment accuracy: e.g. cat *.nalj python join table.py > joined.nal A.2.3 Spanish-English Decipherment You can also decipher languages. The package includes data necessary for a trial run to decipher Spanish into English. For deciphering foreign languages, using dependency bigrams is better. I have converted words to their ids, the mappings could be found in data/vocab.giga.es.top10k and data/vocab.giga.en.top10k. You can use the following prepared Spanish dependency bigrams as input cipher bigrams cipher.es.v10k.100k.3 (bigrams extracted from 100k tokens whose counts are greater than 3) 120 cipher.es.v10k.100k.1 (bigrams extracted from 100k tokens whose counts are greater than 1) Second, build a dependency language model from English dependency bigrams: 3rdparty/srilm/bin/i686-m64/ngram-count -text data/bigram.id.en -text-has- weights -order 2 -lm train.id.lm -kndiscount Thrid, build a pre-sorted list: java -jar -Xmx10g Build List.jar train.id.lm 2000 > train.id.data Last, to decipher without word embeddings: ./slice with embeddings {output ttable ttable.id {lm train.id.lm {sorted list train.id.data {sorted list size 2000 {cipher bigrams data/cipher.es.v10k.100k.3 { vocab size 10001 {interval iteration 10000 to decipher with word embeddings: ./slice with embeddings {output ttable ttable.id {lm train.id.lm {sorted list train.id.data {sorted list size 2000 {cipher bigrams data/cipher.es.v10k.100k.3 { vocab size 10001 {interval iteration 10000 {total iteration 50000 {plain embeddings data/vectors.s50.10k.en {cipher embeddings data/vectors.s50.10k.es {use embeddings 1 {use uniform base 0 {fast mode 1 A.3 Options Table A.1 lists the options and their detail explanations. 121 {base le base distribution le used to initialize base distribution (prior) in Bayesian decipherment. Format: f e p(fje) {output mapping output le name for mapping matrix M {output ttable output le name for translation table. Format fjjj ejjj p(fje) p(ejf) {cipher embeddings ciphertext word embeddings le {plain embeddings plaintext word embeddings le {mapping seed seed le used to initialize mapping matrix M, previous output of M can be used here {seed table seed table le used to initialize rst sample. same format as output ttable {cipher bigrams cipher bigrams le {sorted list pre-sorted list le {lm language model le {use embeddings whether use embeddings during decipherment to reesti- mate prior. 1 yes, 0 no. {sorted list size number of top k candidates for each context {use uniform base whether use uniform prior for decipherment. 1 yes, 0 no {random seed random seed for sampler {reg regularization parameter for the M step in stochastic EM {learning rate learning rate for gradient ascent in stochastic EM {fast mode whether perform approximate slice sampling. 1 yes, 0 no. {base threshold controls the number of candidates to consider at each sampling operations. candidates whose prior are above this threshold will always be considered. Default value is set by 1 jV observe j . For details, refer to c described in Section 7.5.4 of Chapter 7 {alpha weight of base distribution {dimension size of word embeddings vectors {num threads number of threads for slice sampler {vocab size max number of word types in plain and cipher text. Default is 10,000 {interval iteration number of sampling iterations between each M step in stochastic EM. Default is 100,000 {m iteration number of M steps in stochastic EM. Default is 5 {total iteration number of total iterations. Default is 1 Table A.1: MonoGIZA options 122
Abstract (if available)
Abstract
Thanks to the use of parallel data and advance machine learning techniques, we have seen tremendous improvement in the field of machine translation over the past 20 years. However, due to lack of sufficient parallel data, the quality of machine translation is still far from satisfying for many language pairs and domains. In general, it is easier to obtain non-parallel data, and much work has tried to discover word level translations from non-parallel data. Nonetheless, improvements to machine translation have been limited. In this work, I follow a decipherment approach to learn translations from non-parallel data and achieve significant gains in the quality of machine translation. ❧ First of all, I apply slice sampling to Bayesian decipherment to make it highly scalable and accurate, making it possible to decipher billions of tokens with hundreds of thousands of word types at high accuracy. Then, when it comes to deciphering foreign languages, I introduce dependency relations to address the problems of word reordering, insertion, and deletion. Experiments show that dependency relations help improve Spanish/English deciphering accuracy by over 5-fold. Last but not least, this accuracy is further doubled when word embeddings are used to incorporate more contextual information. ❧ With faster and more accurate decipherment algorithms, I decipher large amounts of monolingual data to improve the state-of-the-art machine translation systems in the scenario of domain adaptation and low density languages. Through experiments, I show that decipherment finds high quality translations for out-of-vocabulary words in the task of domain adaptation, and help improve word alignment when the amount of parallel data is limited. I observe up to 3.8 point and 1.9 point Bleu gain in Spanish/French and Malagasy/English machine translation experiments respectively. ❧ In the end, I release a decipherment package—MonoGIZA, which finds word level translations from monolingual corpora. It serves the purpose to facilitate future research in replicating and advancing the work described in this thesis.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Non-traditional resources and improved tools for low-resource machine translation
PDF
Syntactic alignment models for large-scale statistical machine translation
PDF
Improved word alignments for statistical machine translation
PDF
Deciphering natural language
PDF
Automatic decipherment of historical manuscripts
PDF
Exploiting comparable corpora
PDF
Weighted tree automata and transducers for syntactic natural language processing
PDF
Smaller, faster and accurate models for statistical machine translation
PDF
The inevitable problem of rare phenomena learning in machine translation
PDF
Neural sequence models: Interpretation and augmentation
PDF
Improving modeling of human experience and behavior: methodologies for enhancing the quality of human-produced data and annotations of subjective constructs
PDF
Generating psycholinguistic norms and applications
PDF
Scalable machine learning algorithms for item recommendation
PDF
Fair Machine Learning for Human Behavior Understanding
PDF
Enhancing speech to speech translation through exploitation of bilingual resources and paralinguistic information
PDF
Improving language understanding and summarization by leveraging auxiliary information through self-supervised or unsupervised learning
PDF
Learning semantic types and relations from text
PDF
Learning distributed representations from network data and human navigation
PDF
Deep learning models for temporal data in health care
PDF
Robust causal inference with machine learning on observational data
Asset Metadata
Creator
Dou, Qing
(author)
Core Title
Beyond parallel data: decipherment for better quality machine translation
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
09/03/2015
Defense Date
08/21/2015
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Bayesian inference,decipherment,Learning,low resource languages,machine learning,machine translation,OAI-PMH Harvest
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Knight, Kevin C. (
committee chair
), Chiang, David (
committee member
), Marcu, Daniel (
committee member
), Narayanan, Shrikanth (
committee member
), Sagae, Kenji (
committee member
)
Creator Email
qdou@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-173755
Unique identifier
UC11275511
Identifier
etd-DouQing-3867.pdf (filename),usctheses-c40-173755 (legacy record id)
Legacy Identifier
etd-DouQing-3867.pdf
Dmrecord
173755
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Dou, Qing
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
Bayesian inference
decipherment
low resource languages
machine learning
machine translation