Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Smaller, faster and accurate models for statistical machine translation
(USC Thesis Other)
Smaller, faster and accurate models for statistical machine translation
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Smaller,FasterandAccurateModelsforStatisticalMachineTranslation by AshishTekuVaswani ADissertationPresentedtothe FACULTYOFTHEGRADUATESCHOOL UNIVERSITYOFSOUTHERNCALIFORNIA InPartialFulfillmentofthe RequirementsfortheDegree DOCTOROFPHILOSOPHY (COMPUTERSCIENCE) June,2014 Copyright 2014 AshishTekuVaswani Acknowledgements Iowemydeepestgratitudetomyadvisor,DavidChiang,forhisinspiringguidanceineverystep ofmyPhD.Ihavebeenveryfortunatetobeco-advisedbyLiangHuang,whobeganadvisingme from2011.Ialsowouldliketothankthemembersofmycommittee,KevinKnightandJinchiLv fortheirfeedback. I have had many fruitful discussions and collaborations with the faculty, students, and visit- ing scholars at ISI and ICT. I am particularly indebted to Haitao Mi, Tomer Levinboim, Victoria Fossum, Yinggong Zhao, Theerawat Songyot, Sujith Ravi, Hui Zhang, Jonathan May, Steve De- Neefe, Jason Riesa, Daniel Marcu, Bilyana Martinovsky, David Traum, and Oana Postolache. It hasbeenapleasureworkingasaPhDstudentinthenaturallanguagegroupatISI. Most importantly, I would like to thank my parents, my brother Rahul and his family, and Ann,fortheirsupportandencouragementduringmyPhD. ii Abstract The goal of machine translation is to translate from one natural language into another using computers.Thecurrentdominantapproachtomachinetranslation,statisticalmachinetranslation (SMT),useslargeamountsoftrainingdatatoautomaticallylearntotranslate.SMTsystemstypi- cally contain three primary components: word alignmentmodels, translation rules, and language models. These are some of the largest models in all of natural language processing, containing up to a billion parameters. Learning and employing these components pose dicult challenges of scale and generalization: using large models in statistical machine translation can slow down the translation process; learning models with so many parameters can cause them to fit the train- ing data too well, degrading their performance at test time. In this thesis, we improve SMT by addressingtheseissuesofscaleandgeneralizationforwordalignment,learningtranslationgram- mars,andlanguagemodeling. Word alignments, which are correspondences between pairs of source and target words, are used to derive translation grammars. Good word alignment can result in good translation rules, improvingdownstreamtranslationquality.Wewillpresentanalgorithmfortrainingunsupervised wordalignmentmodelsbyusingapriorthatencourageslearningsmallermodels,whichimproves bothalignmentandtranslationqualityonlargescaleSMTexperiments. iii SMT systems typically model the translation process as a sequence of translation steps, each ofwhichusesatranslationrule.Moststatisticalmachinetranslationsystemsusecomposedrules (rulesthatcanbeformedoutofsmallerrulesinthegrammar)tocapturemorecontext,improving translation quality. However, composition creates many more rules and large grammars, making bothtraininganddecodinginecient.Wewilldescribe anapproachthatusesMarkovmodelsto capturedependenciesbetweenaminimalsetoftranslationrules,whichleadstoaslimmermodel, afasterdecoder,yetthesametranslationqualityascomposedrules. Good language models are important for ensuring fluency of translated sentences. Because language models are trained on very large amounts of data, in standard n-gram language mod- els,thenumberofparameterscangrowveryquickly,makingparameterlearningdicult.Neural network language models (NNLMs) can capture distributions over sentences with many fewer parameters. We will present recent work on eciently learning large-scale, large-vocabulary NNLMs.IntegratingtheseNNLMsintoahierarchicalphrasebasedMTdecoderimprovestrans- lationqualitysignificantly. iv TableofContents Acknowledgements ii Abstract iii ListOfTables 4 ListOfFigures 6 Chapter1 Goals andContributions 8 1.1 WordAlignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.2 TranslationGrammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.3 LanguageModels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Chapter2 Causes andEffects ofLargeModels 16 2.1 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Chapter3 Optimizing anMDLInspiredObjectiveFunctionUsingMAP-EM 22 3.1 MaximumLikelihoodEstimationwithObservedData . . . . . . . . . . . . . . . 22 3.2 TheEMAlgorithmforMaximumLikelihoodEstimationwithUnobservedData . 24 3.3 LearningwithPriors:TheMaximumAPosterioriEMalgorithm . . . . . . . . . 29 3.4 AnMDLInspiredObjectiveFunction . . . . . . . . . . . . . . . . . . . . . . . 30 3.5 ComparisonwithOtherPriors . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.6 ApplicationtoUnsupervisedPart-of-SpeechTagging . . . . . . . . . . . . . . . 40 3.6.1 ObjectiveFunction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.6.2 ParameterOptimization . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.6.4 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 0 ThisisjointworkwithAdamPaulsandDavidChiang 1 Chapter4 WordAlignment 51 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.2.1 IBMModelsandHMM . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.2.2 MAP-EMwiththe L 0 -norm . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2.3 ProjectedGradientDescent. . . . . . . . . . . . . . . . . . . . . . . . . 55 4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.3.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.3.2 Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.3.3 Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.4 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Chapter5 RuleMarkovModels 67 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.2 RuleMarkovmodels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.3 Tree-to-StringDecodingwithRuleMarkovModels . . . . . . . . . . . . . . . . 73 5.4 ExperimentsandResults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.4.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.5 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Chapter6 NeuralLanguageModels 83 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.2 NeuralLanguageModels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.2.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.2.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.2.3 GradientsofNeuralNetworkParameters . . . . . . . . . . . . . . . . . 92 6.2.4 ForwardPropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.2.5 BackwardPropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.3.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.3.2 Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.4.1 NISTChinese-English . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.4.2 Europarl. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.4.3 SpeedComparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.5 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 0 ThisisjointworkwithLiangHuangandDavidChiang 0 ThisisjointworkwithHaitaoMi,LiangHuang,andDavidChiang 0 ThisisjointworkwithYinggongZhao,VictoriaFossum,andDavidChiang 2 Chapter7 Conclusion andFutureWork 106 7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 7.2 FutureResearchDirections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 7.2.1 MAP-EMwiththe L 0 prior . . . . . . . . . . . . . . . . . . . . . . . . . 107 7.2.2 RuleMarkovModels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 7.2.3 NeuralProbabilisticLanguageModels . . . . . . . . . . . . . . . . . . . 110 References 111 3 ListOfTables 3.1 MAP-EMwithaL0normachieveshighertaggingaccuracyonEnglishthan[29] andmuchhigherthanstandardEM. . . . . . . . . . . . . . . . . . . . . . . . . 45 3.2 Averageaccuraciesoverthreeheld-outsetsforEnglish. . . . . . . . . . . . . . . 46 3.3 MAP-EM with a smoothed L0 norm yields much smaller models than standard EM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.4 AccuraciesontestsetforItalian. . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.1 Addingthe L 0 -normtotheIBMmodelsimprovesbothalignmentandtranslation accuracyacrossfourdierentlanguagepairs.Thewordtranscolumnalsoshows that the number of distinct word translations (i.e., the size of the lexical weight- ing table) is reduced. The ˜ ϕ sing: column shows the average fertility of once-seen sourcewords.ForCzech-English,theyearreferstotheWMTsharedtask;forall other language pairs, the year refers to the NIST Open MT Evaluation. Half of thistestsetwasalsousedfortuningfeatureweights. . . . . . . . . . . . . . . . 60 4.2 AlmostallhyperparametersettingsachievehigherF-scoresthanthebaselineIBM Model4andHMMmodelforArabic-Englishalignment(=0). . . . . . . . . . 62 4.3 Adding word classes improves the F-score in both directions for Arabic-English alignmentbyalittle,forthebaselinesystemmoresothanours. . . . . . . . . . . 63 4.4 Optimizing hyperparameters on alignment F1 score does not necessarily lead to optimalBleu.Thefirsttwocolumnsindicatewhetherweusedthefirst-orsecond- best alignments in each direction (according to F1); the third column shows the F1 of the symmetrized alignments, whose corresponding Bleu scores are shown inthelasttwocolumns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.1 Mainresults.OurtrigramruleMarkovmodelstronglyoutperformsminimalrules, andperformsatthesamelevelascomposedandverticallycomposedrules,butis smallerandfaster.Thenumberofparametersisshownforboththefullmodeland themodelfilteredfortheconcatenationofthedevelopmentandtestsets(dev+test). 78 4 5.2 For rule bigrams, RM-B with D 2 =0:4 gives the best results on the development set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.3 For rule bigrams, RM-A with D 2 ;D 3 =0:5 gives the best results on the develop- mentset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.4 RM-Aisrobusttodierentsettingsof D n onthedevelopmentset. . . . . . . . . 80 5.5 Comparison of vertically composed rules using various settings (maximum rule height7). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.6 Adding rule Markov models to composed-rule grammars improves their transla- tionperformance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.1 ResultsforChinese-Englishexperiments,withoutneuralLM(baseline)andwith neuralLMforrerankingandintegrateddecoding.RerankingwiththeneuralLM improves translation quality, while integrating it into the decoder improves even more. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.2 Results for Europarl MT experiments, without neural LM (baseline) and with neuralLMforrerankingandintegrateddecoding.TheneuralLMgivesimprove- ments across three dierent language pairs. Superscript 2 indicates a score aver- agedbetweentworuns;allotherscoreswereaveragedoverthreeruns. . . . . . 103 5 ListOfFigures 1.1 Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.2 StandardSMTpipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.3 Wordalignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.4 Atranslationruleextractedfrom1.1 . . . . . . . . . . . . . . . . . . . . . . . . 13 3.1 GraphicaldepictionoftheEMalgorithm . . . . . . . . . . . . . . . . . . . . . . 28 3.2 The L 0 -normcurveisnotsmooth . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3 TheL 0 -norm(topcurve)andsmoothedapproximations(below)for=0:05;0:1;0:2. Lowervaluesof approximatethe L 0 -normbetter. . . . . . . . . . . . . . . . . 35 3.4 Thepriorprefersthecornersofthe3-dimensionalprobabilitysimplexencourag- ingsparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.5 L p normsfor p=0:2;0:1;0:05areshown.Wecanseethatthenormisnotdefined at0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.6 ∥∥ 2 2 (inred)hasthesameshapeasthesmoothed L 0 normforlarge . . . . . . 38 3.7 TheL 0 -norm(topcurve)andsmoothedGaussianapproximations(below)for= 0:05;0:1;0:2.Again,lowervaluesof approximatethe L 0 -normbetter. . . . . . . 39 3.8 Tagging accuracy vs. objective function for 1152 random restarts of MAP-EM withsmoothedL0norm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.9 Taggingaccuracyvs.likelihoodfor1152randomrestartsofstandardEM. . . . . 48 6 4.1 Smoothed-L 0 alignments(redcircles)correctmanyerrorsinthebaselineGIZA++ alignments (black squares), as shown in four Chinese-English examples (the red circles are almost perfect for these examples, except for minor mistakes such as liu-sh¯ uq¯ ıng and meeting-z` aizu` o in (a) and .-, in (c)). In particular, the baseline system demonstrates typical “garbage-collection” phenomena in proper name “shuqing”inbothlanguagesin(a),number“4000”andword“l´ aib¯ ın”(lit.“guest”) in(b),word“troublesome”and“l` ul` u”(lit.“land-route”)in(c),and“blockhouses” and “di¯ aobˇ ao” (lit. “bunker”) in (d). We found this garbage-collection behavior to be especially common with proper names, numbers, and uncommon words in both languages. Most interestingly, in (c), our smoothed-L 0 system correctly aligns“extremely”to“hˇ enhˇ enhˇ enhˇ en”(lit.“veryveryveryvery”)whichisrare inthebitext. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.1 Exampletree-to-stringgrammar. . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.2 Example tree-to-string derivation. Each row shows a rewriting step; at each step, theleftmostnonterminalsymbolisrewrittenusingoneoftherulesinFigure5.1. 70 5.3 Exampleinputparsetreewithtreeaddresses. . . . . . . . . . . . . . . . . . . . 74 5.4 Simulation of incremental decoding with rule Markov model. The solid arrows indicateonepathandthedashedarrowsindicateanalternatepath. . . . . . . . . 75 5.5 Verticalcontext . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 6.1 Neuralprobabilisticlanguagemodel[6]. . . . . . . . . . . . . . . . . . . . . . . 85 6.2 Activationfunctionforarectifiedlinearunit. . . . . . . . . . . . . . . . . . . . 86 6.3 NPLMindexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.4 Noise contrastive estimation (NCE) is much faster, and much less dependent on vocabularysize,thanMLEasimplementedbytheCSLMtoolkit[76]. . . . . . . 104 7 Chapter1 Goals andContributions Thegoalofmachinetranslationistotranslateainputsourcesentenceintoaoutputtargetsentence with the same meaning (see Figure 1.1). The current dominant approach to machine translation is statistical machine translation (SMT), which automatically learns how to translate from the source language to the target language using training data in both languages. Since the seminal work of Brown et al. [13] on the CANDIDE project at IBM, SMT has been a highly active area of research. During this time, the amount of data available for training SMT systems has grown rapidly, which has in turn led to rapid improvements in translation quality. Indeed, modern SMT systems are trained on some of the largest datasets in all of NLP, they also use some of the largest models in all of NLP. This presents two dicult challenges: First, how do we engineer SMT systems to scale such large datasets and large models? Second, with large models comes the danger of overfitting the data, failing to generalize to new sentences. In this chapter, we will brieflydescribetheseproblemsandourcontributionstosolvingthem Figure1.2showsthearchitectureofatypicalSMTpipeline.ModernSMTsystemshavethree primarycomponents:Wordalignments,translationgrammars,andLanguagemodels. 8 IP NP VP Bushi PP VP P NP VV AS NP yu Shalong juxing le huitan Bush held talks with Sharon target source Figure1.1:Translation 1.1 WordAlignments Wordalignmentseekstodiscoverpairsofsourceandtargetwordtranslationsfromparalleldata: pairsofsentencesthataretranslationsofeachother.Reliablewordalignmentsareparamountfor good translation quality because they are used downstream in translation systems (figure 1.2) to driveruleextraction.Sincemanually-alignedtrainingdataisscarce,thedominantwordalignment models are trained unsupervised, where word alignments are modeled as latent variables [13]. These models are typically learned using the Expectation Maximization (EM) algorithm, which findsthemaximumlikelihood estimatesoftheparameters. 9 Translation grammars Language model Decoder Rule Extraction Parse trees Parallel data Word Alignment Figure1.2:StandardSMTpipeline A ubiquitous property of natural language is that the number of salient interactions between events is much lower than the total number of possible interactions. In word alignment, most words in the source language have a strong anity to be linked to only one or few other target words.Forexample,inasampleof1102Chinese-Englishparallelsentences,thereare9938possi- blesource-targetwordtranslations.However,only1045oftheseappearinthecorrecttranslations annotatedbyhumans.Therefore,amongtheprobabilisticinteractionsthataretobelearnedbyan unsupervised algorithm, it is desirable that most of them be inactive. This phenomenon is called sparsity (not to be confused with data sparsity, which results from paucity of labeled data). This 10 Bushi yu Shalong juxing le huitan Bush held talks with Sharon alignments target source Figure1.3:Wordalignment motivates us to use learning algorithms that learn smaller models. Unfortunately, the EM algo- rithm is not equipped with an explicit reward for learning smaller models and can overfit to the data.Thatis,whenthenumberofparametersislarge,itusesexcessmodelcomplexitytoexplain the data. As a result, we learn incorrect values for the parameters, which hurts performance. We propose a possible solution to this by balancing model complexity and good data prediction. In- spired by the Minimum Description Length (MDL) principle, we have developed a novel prior that encourages the probabilities to be inactive unless necessary, thereby encouraging compact models. We have also developed an ecient general search algorithm based on the MAP-EM frameworktooptimizetheresultingobjectivefunction. 11 Inrecentyears,significantresearchhasbeencarriedouttolearnsmallmodels,IntegerLinear Programming (ILP) and Bayesian inference being most noteworthy. In most approaches using ILPs, model minimization and parameter learning are carried out independently, and there is no single objective function. Both techniques have shown success in smaller data settings, but have diculty scaling to larger data settings. For example, corpora for word alignment for many languagepairsareusuallyhundredsofmillionsofwords,whichwouldbeprohibitivelylargefor these approaches. In contrast, our approach is both simple and scalable. Although the primary focus of this thesis is SMT, our approach for learning small models is quite general. We have tested it on unsupervised POS tagging and on large scale word alignment. Our learning method improves performance significantly over standard EM and achieves state-of-the-art accuracy on unsupervised Italian POS tagging. Using the word alignments produced by our learned models, weachievelargegainsinbothalignmentandtranslationqualityoverastrongbaseline.InChapter 3,weestablishthetheoreticalframeworkofoursparseprior.AttheendofChapter3,wedescribe the application to POS tagging, and Chapter 4 describes improvements in unsupervised word alignment. 1.2 TranslationGrammars Given word alignments, we could proceed to translate a sentence word-by-word using word- translation models learned from word alignments. However, word order is not necessarily pre- served across languages. For example, in Figure 1.1, the second word Shalong translates to the last word Sharon in English. Relying on word translations alone would require us to potentially reorder over long distances, making the search for good translations dicult. Also, words can 12 held talks Figure1.4:Atranslationruleextractedfrom1.1 translatedierentlyindierentcontexts.Forexample,theChinesewordyucantranslationtothe prepositionwithortheconjunctionand.Asimplesolutionisusealignmentstolearnlargertrans- lation rules. Exploiting larger translation rules can reduce reordering complexity and preserve translationsincontext.Forexample,thetranslationruleshowninFigure1.3,extractedusing the alignmentsinFigure1.3,translatesyucorrectly. The question now arises, “How large should the translation rules be?”. Typically, translation rules are applied independently of each other, violating the conventional wisdom that translation should be done in context. To compensate for this lack of interaction, state-of-the-art translation systemsuselargertranslationrules,calledcomposedrules,whicharebuiltupofsmallerminimal 13 rules. Composed rule grammars are essential for good translation quality. Unfortunately, they are very large: A few million minimal rules can be combined to create up to a billion composed rules.Thesegrammarsalsotakeuplargeamountsofspaceondiskandmainmemory,restricting theirusagetopowerfulservers.Moreover,theyblowupthesearchspaceofpossibletranslations, slowingthesearchforgoodtranslationsconsiderably. InChapter5,wepresentanapproachtouseonlyminimalrules,achievingthesametranslation quality as composed rule grammars resulting in much smaller grammars, and faster run times. Instead of explicitly combining minimal rules to produce composed rules, we use a language model to capture their interaction, where the vocabulary of the language model is the set of minimalrules.Wealsoshowhowtousetheseminimalrulelanguagemodelswithcomposedrule grammarstofurtherimprovetranslationquality. 1.3 LanguageModels The decoder (Figure 1.1) searches for the best translation for an input source sentence from the spaceofpossibletranslationsdefinedbythetranslationgrammar.Decodersuselanguagemodels to ensure fluency of candidate translations in the search process. Depending on the vocabulary size and the n-gram order, language models can have up to a trillion parameters. Training such language models is challenging since most contexts are observed rarely. Therefore sophisticated smoothingalgorithmsareemployedtoaccountforpossibleunseenn-grams. 14 Similar to standard language models, Neural probabilistic language models (NPLMs) learn probability distributions over n-grams [6]. However, NPLMs suer less from data sparsity be- cause dierent n-gram contexts share parameters, which are learned. In addition, words are rep- resented as real-valued vectors in a high dimensional space. These word vectors, referred to as word embeddings, can be dierent for input and output words, and are learned from training data.Thus,althoughattrainingandtesttime,theinputandoutputtotheneurallanguagemodels are one-hot representation of words, it is their embeddings and other shared parameters that are usedtocomputewordprobabilitydistributions.NPLMshavebeguntosurpasstraditionaln-gram language models, but they have not been adopted in SMT, which requires training over large vo- cabularies. Standard maximum likelihood training has time complexity linear in the vocabulary size;computingtheprobabilityofann-gramattesttimeisequallyexpensiveandtheapplication ofNPLMstoSMThasbeenrestrictedtosmallvocabularysettings. In Chapter 6, we will present the first successful application of large-vocabulary NPLMs in SMT.Usingtheprincipleofnoisecontrastiveestimation[32,54],wetrainaNPLMonhundreds of millions of words and show large improvements in translation quality on a Chinese to English translationtask. 15 Chapter2 Causes andEffects ofLargeModels In the previous chapter, we briefly mentioned our contributions for solving problems caused by large models in three dierent stages of the statistical machine translation (SMT) pipeline: word alignment, learning translation grammars, and language modeling. These problems emerge from two root causes: Zipfian distributions, which cause large models, and maximum likelihood esti- mation (MLE), which can cause overfitting in the presence of large models. In this chapter, we willanalyzethecausesandeectsoftheproblemoflargemodelsinthethreestagesoflearning: modeling,trainingandprediction,andsummarizehowourapproachesaddressthem. 2.1 Modeling Thisthesisemploystheframeworkofprobabilisticmodels.Giventhedataandwhatinformation wedesire,wedecideontheobservedvariablesofinterestandtheirinteractionsviaprobabilities, whichcomprisethemodel.Inthecaseofunsupervised learning,themodelcontainshiddenvari- ables as well. For example, if word alignment is our goal and our data is parallel English-French sentences, we could decide that each possible alignment between the English-French words is a 16 hidden variable. We do not use hidden variables in our translation grammars and language mod- els.Translationgrammarsaremodeledbythesetoftranslationrulesandtheirprobabilitieswhile languagemodelscomprisen-gramsandtheirprobabilities. In the previous chapter, we emphasized that SMT contains some of the largest models in all of NLP. The Zipfian nature of language, in concert with the explosion of training data, has led to the scale of these models. In 1935, Zipf made the observation that the frequencies of wordsinanycorpusofnaturallanguagehaveaheavy-tailed distribution,whereonlyafewwords occur frequently and most words are rare [47]. These distributions can be observed for other phenomena in language as well, such as word meanings. A few words have multiple meanings while most words have only one meaning. As our data grows, the abundance of rare events in Zipfian distributions will cause our models to grow. For example, even with a unigram language model, one would expect to see new words as we increase our training data, no matter how large the data currently is. Modeling choices can also exaggerate growth in model size. For example, in n-gram language models the numbers of parameters grow rapidly as we increase the context length n. For the same reasons, grammars with large translation rules (composed rules), that capturelongercontexts,growprohibitivelylargewithmorebilingualtrainingdata. For learning small translation grammars (Chapter 5) and language modeling (Chapter 6), we propose improvements in the modeling stage of learning. For combating the scale of large trans- lation grammars, we propose using rules derived from small contexts (minimal rules), keeping the model size small. To overcome the scale of language modeling, we use neural probabilis- tic language models (NPLMs), which have far fewer parameters than standard n-gram language models.Inaddition,thenumberofparametersgrowatmostlinearlywiththelengthofthecontext, incontrasttostandardn-gramlanguagemodels,wherethemodelsizecangrowexponentially. 17 However, restricting model size is only a part of the solution to achieving fast and accurate translations.Inourresearch,wehaveprovidednovelapproachestotrainthesemodelsaswell. 2.2 Training Once we have decided on the model, the task is to fit the model to the data, that is, to learn the parameters of the model. The most common way to do that is to use the maximum likeli- hoodestimator,whichfindstheparametersthatmaximizethelikelihood(equation2.1)orthelog likelihood(equation2.2)oftheobserveddata, ML =argmax P(Xj) (2.1) =argmax logP(Xj): (2.2) For unsupervised models such as word alignment, since the alignments are not observed, we model them as latent variables and typically use the EM algorithm to calculate the maximum likelihoodestimateoftheparameters. However, in general, the log likelihood of the data improvesby increasing the number of pa- rameters,preferringcomplexmodels.Thisbehavioriscalledoverfitting.Wecanseeinstancesof overfitting in unsupervised POS tagging with HMMs [68]. For a corpus of about 24;000 words, the number of tag bigram parameters in the HMM is 1839. However, only 760 of these tag bi- grams appear in the correct tag sequence annotated by humans. Thus, any parameter learning algorithm needs to encourage most of the parameters to be negligible, that is, to encourage the 18 learned model to be small. Unfortunately, if we use the EM algorithm to learn the HMM param- eters,wepredict924uniquetagbigrams,muchmorethanoptimal. TheMinimumDescriptionLength(MDL)principle,amathematicalinstantiationofOccam’s razor,canbeusedtofixthisproblem.MDLcastsOccam’srazorasanoptimizationproblemand tries to find the model that explains data well and is not too complex. Inspired by the MDL principle,inChapter3,wewillproposeageneralsolutiontolearnsmallermodelsforafamilyof generativemodels. This work is not the first to present approaches for learning smaller models. Approaches by Ravi et al. [68], Goldwater et al., [29], Bodrumlu et al. [10] also try to learn smaller models for POStaggingandwordalignment.However,theyprimarilyhavetheshortcomingofbeingunable to scale to large problems, which is paramount. Our proposed algorithm for word alignment can scaletohundredsofmillionsofwords. Our research improves training for all three components of the SMT pipeline. Since we re- strict the size of our translation grammars to only minimal rules in the modeling phase, we dras- ticallyreducemodelsizeandspeedupdecoding.However,translationqualitydropssinceweare unabletotranslatelargephraseswithcomposedrules.Toimprovetranslationwithminimalrules, we mimic composed rules by training language models of minimal rules. To avoid the n-gram rulelanguagefromblowingup,wetrainpruned languagemodelsthatonlypreserven-gramsthat meet some criteria. Section 5.2 describes this in more detail. To avoid assigning zero probability to unseen n-grams, we use absolute discounting (Ney et al. [60]), a smoothing approach, instead ofstandardmaximumlikelihoodtraining. 19 Training large-vocabulary NPLMs with maximum likelihood is prohibitively expensive as it requires repeated summations over every word in the vocabulary. In chapter 6, we show how to successfullytrainlarge-vocabularyNPLMsonlargedatawithnoisecontrastiveestimation. 2.3 Prediction Having trained our models, at test time we would like to query them. For unsupervised models, we are typically interested in the values of the hidden variables given the data. Once we have learned our parameters, say ML , we would like to query the most likely configuration of our hiddenvariables ˆ zgiventhedatawhichisthemodeoftheposteriordistribution P(ZjX), ˆ z=argmax z P(ZjX) argmax z P(ZjX; ML ): (2.3) Findingthemostlikelyconfigurationofthehiddenvariablesgiventhedataandtheparameters isalsocalledmaximumaposteriori(MAP)inferenceordecoding.Forthelatentvariablemodels usedinthisthesis,wecanfindthemodeoftheposteriorwiththeViterbidecodingalgorithm. In Chapter 5, we will present a new approach for decoding with rule language models in a statistical machine translation system. Decoding with minimal rule translation grammars and rule language models is faster than using composed rule grammars with no loss in translation quality.We will also present our contributions to further improvetranslation over composed rule grammarsusingrulelanguagemodels. 20 We would like to integrate our trained NPLMs into the machine translation decoder, which searches for the the best translation of a source sentence. To ensure fast decoding, we should be able to query the neural language model for the probability of an n-gram quickly. Neural language models trained with MLE would require a summation over the entire vocabulary to compute the normalization constant required for n-gram probabilities, making them applicable for small vocabulary settings only. In chapter 6, we will present a fast approach for querying neural language models trained with noise-contrastive estimation (about 40 micro seconds) that enablefastdecodingandimprovemachinetranslationquality. Inthischapter,weanalyzedthecausesandproblemscausedbylargemodelsandsummarized oursolutionstotheseproblemsinthedierentstagesoflearning:modeling,trainingandpredic- tion.Inthenextchapter,wewillstartwithourresearchonlearningsmallmodelsinunsupervised settingsusingaMDLinspiredapproach. 21 Chapter3 Optimizing anMDLInspiredObjectiveFunctionUsingMAP-EM Inthischapter,wewilldevelopanewobjectivefunctionandanoptimizationalgorithm,basedon the MAP-EM algorithm, that encourages learning small models in an unsupervised setting. We willalsoshowhowourapproachimprovesunsupervisedpart-of-speech(POS)taggingaccuracy. Sections 3.1, 3.2, and 3.3 will present the requisite background material on maximum likelihood estimation with the expectation maximization algorithm(EM) [18], and the MAP-EM algorithm. TheobjectivefunctionanditsapplicationtounsupervisedPOStaggingwillbepresentedinsub- sequentsections. 3.1 MaximumLikelihoodEstimationwithObservedData BeforewediscussmaximumlikelihoodlearningforunobserveddatawiththeEM,wewilldiscuss thesimplercasewhenallourdataisobserved,withanexample.Imagineflippingacoinntimes. LetX=X 1 ;:::;X n betheobservationsofthecoinflips.Supposethecoincameupheadsn H times and tails n T times. Let H be the probability of heads and 1 H the probability of tails, which havetobelearned.Maximumlikelihoodestimationtriestofindtheparametersthatmaximizethe loglikelihoodofthedata. 22 ML H =argmax H logL(Xj H ) =argmax H log n ∏ i=1 P(X i j H ) =argmax H log n H H (1 H ) n T : Tomaximize,wesetthederivativesto0,whichgivesus n H H n T 1 H =0 n H n H H n T H =0 ML H = n H n H +n T : It turns out that the maximum likelihood estimate of the parameters is simply the relative frequency of the events. The approach is the same for experiments with multinomial trials like the roll of a die. For the model classes that are a product of multinomials, such as HMM models forPOStaggingandmodelsforwordalignment,ifallthedataisobserved,carryingoutmaximum likelihood estimation is easy. Unfortunately, this is not the case for most unsupervised learning tasksinNLP,wherethequantitiesofinterestaremissingorhidden.TheEMalgorithm,described inthenextsection,carriesoutmaximumlikelihoodparameterestimationinthisregime. 23 3.2 The EM Algorithm for Maximum Likelihood Estimation with UnobservedData The Expectation Maximization (EM) algorithm [18] has had an illustrious career in statistical NLP. Numerous unsupervised learning problems in NLP have been tackled using EM. Some examples are word alignment [13], part-of-speech tagging, [48], and automatic decipherment [38].TheEMalgorithmisnotsomuchanalgorithmasitisaniterativeframeworkformaximum likelihoodlearningwithincompletedataorwhenthevaluesofsomeofthevariablesaremissing or hidden. While we use the MAP-EM algorithm (section 3.3), it is a slight modification to the EMalgorithmandforcompleteness,westartwiththederivationofEM. In what follows, boldface letters represent random variables and corresponding lowercase letters represent generic values that the random variables can take. Again, the goal of maximum likelihoodlearningistofindtheparametersthatmaximizetheprobabilityoftheobserveddata: ˆ =argmax logP(Xj): (3.1) For now, we don’t write the constraints that the parameters have to sum to one, and specify them in later chapters since they vary from model to model. Equation (3.1) can be optimized di- rectlyusingnumericaloptimizationorEM.WecanwritetheobjectivefunctioninEquation(3.1) as 24 logP(Xj)=log ∑ z P(X;Z=zj): (3.2) Thedicultyinoptimizingcomesfromsummingoverthehiddenvariables,Z.Aswesawin Section3.1,ifallthevariableswereobserved,thenonewouldjustneedtocountanddividetofind the maximum likelihood estimate of the parameters. We shall soon see how the EM algorithm exploits this. Let curr be the current guess of the parameters and Let P(Z= zjX; curr ) be the posterior probability of the hidden variable sequence z given the observed data and the current parameters. This could be the probability of a particular tag sequence in POS tagging. Most of themodelsemployedinthisthesis admitecientalgorithmsfor computingthe posteriors. Also, ∑ z P(Z=zjX; curr )=1.Equation(3.1)isequivalentto: logP(Xj)=log ∑ z P(X;Z=zj)P(Z=zjX; curr ) P(Z=zjX; curr ) : (3.3) Sincelogx isconcave,wecanapplyJensen’sinequalitywhichsaysthat: log ∑ z P(z)f(x) ∑ z P(z)log f(x): UsingJensen’sinequalitywetakethesummationoutsidethelogandget 25 logP(Xj) ∑ z P(Z=zjX; curr )log P(X;Z=zj) P(Z=zjX; curr ) = ∑ z P(Z=zjX; curr )logP(X;Z=zj) ∑ z P(Z=zjX; curr )logP(Z=zjX; curr ) = ∑ z P(Z=zjX; curr )logP(X;Z=zj) ∑ z P(Z=zjX; curr )log P(X;Z=zj curr ) P(Xj curr ) =E ZjX; curr [ logP(X;Z=zj) ] ∑ z P(Z=zjX; curr )log P(X;Z=zj curr ) P(Xj curr ) =E ZjX; curr [ logP(X;Z=zj) ] E ZjX; currlogP(X;Z=zj curr )+logP(Xj curr ) =Q(; curr ): The auxiliary function Q(; curr ) is a concave lower bound on the observed data log likeli- hood. We will show that maximizing Q(; curr ) would increase logP(Xj) as well. The last two termsdonotcontainandcanbedroppedformaximization.E ZjX; currlogP(X;Z=zj)istheex- pected complete data log likelihood which uses the expected frequencies of the hidden variables underthecurrentparameters, curr . Thoughwehaven’texplicitlystatedtheEMalgorithm,wecanappreciatehowitmightwork. InsteadofmaximizinglogP(Xj)directly,wecangeneratecountsforthehiddenvariablesfrom curr and then perform maximum likelihood estimation on the complete data, which is easy. Q(; curr ) has two nice properties: It is concave and at = curr , it touches the log likelihood function 26 Q(; curr )=E ZjX; currlogP(X;Z=zj curr )E ZjX; currlogP(X;Z=zj curr )+logP(Xj curr ) =logP(Xj curr ): These properties show that maximizing Q(; curr ) would increase logP(Xj ) as well. Un- fortunately, this does not give us a one step optimization of the observed data log likelihood but suggestsaniterativeprocedure. Startwithaninitialguessoftheparameters init .Set curr = init andrepeatuntilconvergence. 1. E-step:Compute: E ZjX; curr [ logP(X;Z=zj) ] . 2. M-step:Compute: next :=argmax ∑ z E ZjX; curr [ logP(X;Z=zj) ] . 3. curr := next . Set final := curr . Figure3.2givesagraphicaldescriptionoftheEMalgorithm.Theredcurve(topcurve)isthe log likelihood of the data that we want to maximize. The E-steps correspond to estimating the blue convex curve (the solid convex curve in the figure), the auxiliary function Q(; curr ), and the green convex curve (the dotted convex curve in the figure), the auxiliary function Q(; next ). Maximizing Q(; curr ) gives us next , the estimate of parameters for the next iteration. We can see that the auxiliary functions touch the log likelihood function at curr and next . The EM al- gorithm converges to a local maximum of the log likelihood. If the maximization of the auxil- iary function in the M-step is hard, Dempster et. al. [18] also suggest a variant of the M-step where it suces to improve on the auxiliary function. The goal here is to find a a ′ such that 27 ln P(X) θ Q(θ curr ,θ curr ) Q(θ next ,θ next ) θ next θ curr ln P(X | θ) Q(θ,θ curr ) Q(θ,θ next ) Figure3.1:GraphicaldepictionoftheEMalgorithm Q( ′ ; curr ) Q( curr ; curr ). This variant is called the Generalized EM (GEM) algorithm, which alsoconvergestoalocalmaximumoftheloglikelihood. The maximum likelihood estimator and consequently the EM algorithm do not work well for parameter learning when we have have insucient data. The problem is exacerbated when the model has many parameters and the maximum likelihood estimator causes the parameters to overfittothedata.Forexample,whenusedtolearntheparametersofahiddenMarkovmodelfor unsupervised part-of-speech tagging, the EM algorithm gives probability mass to more parame- tersthannecessary.Oftentimeswehaveprior knowledgeaboutourparameterswhichwewould 28 like to include. The MAP-EM algorithm, described in the next section, allows us to perform parameterlearningwithpriors. 3.3 LearningwithPriors:TheMaximumAPosterioriEMalgorithm The primary contribution of this thesis is using a sparse prior on the parameters of the model to encouragesmallermodels.InSection3.4,wewillpresentaMAPobjectivefunctionincorporating our prior. The general goal of maximum a posteriori estimation is to find the parameters ˆ such that ˆ =argmax ( logP(Xj)+logp() ) : (3.4) Equation (3.4) introduces a new term logp() to the maximum likelihood objective in equa- tion (3.1). We carry out this optimization using the MAP-EM algorithm which was also intro- duced in [18]. The derivation for the auxiliary function that lower bounds the MAP objective is the same follows the same as for the maximum likelihood objective except with the addition of twoterms: Q(; curr )=E ZjX; currlogP(X;Z=zj curr )E ZjX; currlogP(X;Z=zj curr )+ logP(Xj curr )+logp()+logp( curr ): Although the auxiliary function has two new terms, the process of computing the expected frequencies of the hidden variables remains unchanged. This is favorable because if we have 29 existingEMimplementations,thesamealgorithmscanbeadoptedforcomputingexpectationsin theMAP-EMalgorithm.TheM-stepinvolvesmaximizingthesumoftheexpectedcompletedata loglikelihoodandthepriorterm.TheMAP-EMalgorithmsisasfollows: Startwithaninitialguessoftheparameters init .Set curr = init andrepeatuntilconvergence. 1. E-step:Compute E ZjX; curr [ logP(X;Z=zj) ] +logp(). 2. M-step:Compute next :=argmax ∑ z E ZjX; curr [ logP(X;Z=zj) ] +logp(). 3. curr := next . Set final := curr This concludes the description of the pertinent optimization frameworks for my thesis work onunsupervisedlearning.Intheprecedingsections,wedescribedtheprogressionfrommaximum likelihood learning with complete data to learning with incomplete or unobserved data with the EM algorithm. We also introduced the MAP-EM algorithm, the framework for parameter opti- mizationthatwewillemployforunsupervisedlearning.Subsequentsectionswillpresentanovel objectivefunctionforlearningsmallmodels. 3.4 AnMDLInspiredObjectiveFunction Chapter 2 presented the need for a new objective function that encourages parameter sparsity and learning smaller models. We are now ready to build on the foundation of previous sections to present a new objective function. Occam’s Razor, or the principle of parsimony, is an old but sound principle for model selection which says that if there are competing hypotheses, choose the one that makes the fewest assumptions. It makes intuitive sense to favor less complexity in 30 comparablehypotheses,asthelesscomplexmodelhashopefullycapturedmoregeneralpatterns inthedataandwouldpredictnewdatabetter.TheMinimumDescriptionLength(MDL)principle [3]castsOccam’sRazorasanoptimizationproblem.Ittriestofindthemodelthatminimizesthe description length (DL) of the data giventhe model and the description length of the model. The goalisthefindthemodel ˆ suchthat: ˆ =argmin DL(Dj)+DL(): (3.5) Glancing at this optimization, we can see that the MDL principle balances the explanation of the data with the complexity of the hypothesis. Notice that this objective function does not assumethatthehypothesisclassofmodelsisfixed.Forexample,ifthegoalisunsupervisedPOS tagging, the competing models can be second order and third order HMMs. It’s possible that a thirdorderHMMmightexplainthedatabetter,buttheextrapricetopayforthemodelcomplexity might make it less attractive for the MDL objective in (3.5). A convenient analogy is to imagine if one were to compress a message of length n and send in bits to a receiver. Let the message be a sequence of characters from fa;b;c;dg. In order for the receiver to decode the message, the sender has to send both the encoded message and the codebook with which the message was encoded.Asensibleapproachwouldbetoencodefrequentletterswithshorterdescriptionlengths (or codelengths) allowing more compression. For the message to be coded correctly, the sender will also have to send the entries in the codebook for each letter. This might not necessarily give the smallest total description length according to the objective in equation (3.5). One could imagine a message that contains the pair ab occurring very often. Here, adding the bigram ab 31 to the codebook and giving it a short codelength might give us a smaller description length for thedata,justifyingthepenaltywepayforstoringtheextrabigram.Theencoderhasthefreedom to choose any encoding for the data and the model as he or she wishes which would aect the finalmodelchoice.Forexample,settingthepenaltyforabigramentrytoinfinitywillensurethat no bigram is employed in the code. The advantage is that dierent model scoring functions like Akaike Information Criterion (AIC) [1] and Bayesian Information Criterion (BIC) [72] can be usedintheobjective.ThiscanalsobeseenasadrawbackofMDLasdierentencodingschemes canleadtodierentmodelselectionresultsandthesechoicesneedtobemadecarefully. The prior that we develop in this chapter applies to probabilistic models and the model class will stay fixed during learning, for example the set of all possible HMMs with fixed dimension emissionandtransitionmatrices.ThegoalistouseMDLasaguidingprincipletolearnthevalues of the parameters. If one takes the negative of the log likelihood of the data under the model, it turns out to be the number of nats (or bits) that it takes to write down the data, which is a reasonablemeasureofthedescriptionlengthofthedatagiventhemodel[31].Wealsomotivated the need to add prior information that would encourage most of the parameters to be inactive. To achieve this, we take the description length of the data to be the the number of non-zero probabilitiesinthemodel.Thisgivesthefollowingoptimizationproblem.Findthemodel ˆ such that: ˆ =argmin 0 B B B B B @ logP(Xj)+ ∑ i ( i >0) 1 C C C C C A =argmin ( logP(Xj)+∥∥ 0 ) ; (3.6) 32 ( i > 0) outputs 1 if i > 0 and 0 otherwise. The description length of the model measures the model size in terms of the non-zero probabilities in the model. It is called the L 0 norm of . It penalizes every parameter whose value is > 0 equally. The parameter controls the reward (or penalty) for sparsity. The objective function in 3.6 looks very much like AIC. As we mentioned, theadvantageofMDListhatwecanusedierentscoringfunctionsformodelselection. Unfortunately, minimization of the L0 norm is known to be NP-hard [36]. It is not smooth, makingitunamenabletogradient-basedoptimizationalgorithms(Figure3.2).Therefore,weuse asmoothedapproximation, ∥∥ 0 ∑ i ( 1e i ) (3.7) where 0<1. For smaller values of , this closely approximates the desired function (Figure 3.3).Invertingsignsandignoringconstantterms,ourobjectivefunctionisnow: ˆ =argmax 0 B B B B B @ logP(Xj)+ ∑ i e i 1 C C C C C A (3.8) Wecanthinkoftheapproximatemodelsizeasakindofprior: P()= exp ∑ i e i Z (3.9) logP()= ∑ i e i logZ (3.10) where Z= ∫ d exp ∑ i e i is a normalization constant. The objective function now looks similar totheMAPobjectiveinsection3.3.There,ourgoalistofindthemaximumaposteriorparameter estimate,whichwefindusingMAP-EM[8]: 33 ... .. 0 . 0:2 . 0:4 . 0:6 . 0:8 . 1 . 0 . 0:2 . 0:4 . 0:6 . 0:8 . 1 Figure3.2:The L 0 -normcurveisnotsmooth ˆ =argmax ( logP(Xj)+logP() ) (3.11) Substituting(3.10)into(3.11)andignoringtheconstanttermlogZ,wegetourobjectivefunction (3.8)again. In the latter sections of this chapter (Section 3.6 onwards), we will give details on imple- menting this optimization for unsupervised POS tagging. Chapter 4 will present implementation details for unsupervised word alignment. We can see the behavior of the unnormalized prior in Figure 3.4. A reasonable question to ask is, ”Why this particular choice of approximation func- tion?”.Wewilllistsomeofitsadvantagescomparedtootherpossiblesparsepriors. 3.5 ComparisonwithOtherPriors 34 ... .. 0 . 0:2 . 0:4 . 0:6 . 0:8 . 1 . 0 . 0:2 . 0:4 . 0:6 . 0:8 . 1 Figure3.3:The L 0 -norm(topcurve)andsmoothedapproximations(below)for =0:05;0:1;0:2. Lowervaluesof approximatethe L 0 -normbetter. 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 5 10 15 20 25 30 35 40 45 P(θ) P(Θ) with α=5, β=0.5 θ 1 θ 2 P(θ) 0 5 10 15 20 25 30 35 40 45 Figure 3.4: The prior prefers the corners of the 3-dimensional probability simplex encouraging sparsity 35 L 1 norm The L 1 norm is a common and eective prior to use for parameter sparsity [25]. The L 1 normofavectorvisdefinedasfollows: ∥v∥ 1 = ∑ i jv i j: Forprobabilityvectors,the L 1 normoftheparameterswillalwaysbeaconstantsinceproba- bilitieshavetosumtoone.Therefore,itcannotbeapplieddirectly. L p norm where 0< p<1 The ideal function that we want for model size is the L 0 norm, but it is very hard to optimize. The previous section shows why the L 1 t norm will not work for us. A reasonable option is to consider the L p norm where 0 < p < 1 since for p > 1, we do not get sparsity.The L p normofavectorvfor pbetween0and1isdefinedasfollows. ∥v∥ p = 0 B B B B B @ ∑ i jv i j p 1 C C C C C A 1 p : Figure 3.5 shows the L p norm for dierent values of p. The L p norm has the disadvantage that it becomes discontinuous at probability 0 and the gradient goes to infinity. This makes it unsuitableforgradientbasedoptimization. Relationship Between the Squared L 2 Norm and the Smoothed Approximation to the L 0 A little analysis reveals a very interesting connection between the smoothed L 0 approximation and the squared L 2 norm for large . The squared L 2 norm of a vector v (∥v∥ 2 2 ) is the sum of 36 ... .. 0 . 0:2 . 0:4 . 0:6 . 0:8 . 1 . 0:4 . 0:6 . 0:8 . 1 Figure3.5: L p normsfor p=0:2;0:1;0:05areshown.Wecanseethatthenormisnotdefinedat0 squares of its components. Taking the Taylor series expansion of the smoothed L 0 prior around (0;0;:::;0),weget: ∑ i ( 1e i ) ∑ i 0 B B B B @ 1+ i 2 i 2 1 C C C C A (3.12) =n+ 1 ∑ i 2 i 2 =n+ 1 ∥∥ 2 2 2 : Dropping third order and higher terms, since these will be negligible for large , we see that the exponential prior gives the squared L 2 norm. Figure 3.6 shows this relationship graphically (note that the curves have been scaled for visual comparison). In the past, the squared L 2 norm hasbeenwidelyusedtoencourageparametersparsity,forexampleinregression.Theapproachis tominimizethesquaredL 2 normoftheparametersofthemodel.Ithasmoreorlessbeenreplaced 37 ... .. 0 . 0:2 . 0:4 . 0:6 . 0:8 . 1 . 0 . 0:2 . 0:4 . 0:6 . 0:8 . 1 Figure3.6:∥∥ 2 2 (inred)hasthesameshapeasthesmoothed L 0 normforlarge by the L 1 norm for the lack of parameter sparsity it induces. If we replace the smoothed L 0 in optimization3.8with∥∥ 2 2 ,wegetthefollowingoptimization.Findthemodel ˆ ,suchthat ˆ =argmax ( logP(Xj)+∥∥ 2 2 ) : (3.13) We maximize the squared L 2 instead of minimize. The reason is that for the family of prob- abilistic models in the scope of this thesis, the parameters have to lie on a constrained surface (the simplex) and therefore there is only a fixed amount of mass to move around between the parameters.Inthiscase,maximizingthesquared L 2 willgiveussparsity. This observation has a practical consequence. Remember that we have two hyperparameters inourprior,,therewardforsparsityand,thesmoothnessoftheprior.Thisconnectionsuggests thatwecoulddropandjustusethesquared L 2 norminstead.Wewillleavethiseortforfuture work(Chapter7). 38 ... .. 0 . 0:2 . 0:4 . 0:6 . 0:8 . 1 . 0 . 0:2 . 0:4 . 0:6 . 0:8 . 1 Figure 3.7: The L 0 -norm (top curve) and smoothed Gaussian approximations (below) for = 0:05;0:1;0:2.Again,lowervaluesof approximatethe L 0 -normbetter. Smoothed L 0 BasedontheGaussianFamily This is not the first approach to use a surrogate functionforthetrue L 0 norm.InMohimanietal.[55],theauthorsuseanapproximationinspired bytheGaussianfamilyoffunctions. ∥∥ 0 ∑ i 0 B B B B @ 1e 2 i 2 2 1 C C C C A (3.14) where is the variance of the gaussian. Figure 3.7 shows the smoothed approximations in com- parisontothe L 0 norm.WewillshowthatforPOStagging,usingthisformofapproximationdid notyieldthehighestaccuracy. 39 3.6 ApplicationtoUnsupervisedPart-of-SpeechTagging WewillnowbrieflyshowhowtotrainbettermodelsforunsupervisedPOStaggingwiththeMAP- EMalgorithmandthe L 0 prior.Inthenextchapter(chapter4),wewillpresentourimprovements towordalignment. It has been successfully shown that minimizing the model size in a Hidden Markov Model (HMM) for part-of-speech (POS) tagging leads to higher accuracies than simply running the Expectation-Maximization(EM)algorithm[18].GoldwaterandGriths[29]employaBayesian approach to POS tagging and use sparse Dirichlet priors to minimize model size. More recently, Ravi and Knight [68] alternately minimize the model using an integer linear program and max- imize likelihood using EM to achieve the highest accuracies on the task so far. However, in the latter approach, because there is no single objective function to optimize, it is not entirely clear how to generalize this technique to other problems. In this chapter, we introduced an objective function to learn smaller models by using a simple prior that encourages sparsity. we will show how to implement the search for the maximum a posteriori (MAP) hypothesis and present a variantofEMtoapproximatelysearchfortheminimum-description-lengthmodel.Applyingour approach to the POS tagging problem, we obtain higher accuracies than both EM and Bayesian inference as reported by Goldwater and Griths [29]. On a Italian POS tagging task, we ob- tainevenlargerimprovements.Wefindthatourobjectivefunctioncorrelateswellwithaccuracy, suggestingthatthistechniquemightbeusefulforotherproblems. 0 ThisisjointworkwithAdamPaulsandDavidChiang 40 3.6.1 ObjectiveFunction In the unsupervised POS tagging task, we are given a word sequence w= w 1 ;:::;w N and want to find the best tagging t= t 1 ;:::;t N , where t i 2 T, the tag vocabulary. We adopt the problem formulation of Merialdo [48], in which we are given a dictionary of possible tags for each word type. WedefineabigramHMM P(w;tj)= N ∏ i=1 P(w;tj)P(t i jt i1 ): (3.15) The parameters of our model are the tag-bigram probabilities fP(tjt ′ )g and the emission proba- bilitiesfP(wjt)g.InmaximumlikelihoodestimationforunsupervisedPOStagging,thegoalisto findparameterestimates ˆ =argmax logP(wj) (3.16) =argmax log ∑ t P(w;tj): (3.17) TheEMalgorithmcanbeusedtofindasolution.However,wewouldliketomaximizelikelihood andminimizethesizeofthemodelsimultaneously.Rewritingoptimization(3.6)forPOStagging, thegoalistofindthemodel ˆ suchthat: ˆ =argmin (logP(wj)+∥∥ 0 ) =argmin (logP(wj)+ c ∥fP(wjt)g∥ 0 + t ∥fP(tjt ′ )g∥ 0 ): (3.18) 41 In equation 3.18, we use separate values of alphas for the channel( c ) and tag-bigram( t ) parameters to exercise finer control. Using our approximation for the L 0 norm and writing the equivalentmaximizationproblem(similartoequation(3.8)),weget: ˆ =argmax 0 B B B B B B @ logP(wj)+ c ∑ w;t e P(wjt) + t ∑ t;t ′ e P(t ′ jt) 1 C C C C C C A ; (3.19) Subjecttotheconstraints ∑ t ′ P(t ′ jt)=1 forallt ∑ w P(wjt)=1 forallt: Inourexperiments,weset c =0,sincepreviousworkhasshownthatminimizingthenumber oftagn-gramparametersismoreimportant[29,68]. 3.6.2 ParameterOptimization To optimize (3.19), we use MAP EM, which was introduced in Section 3.3. The computation involved in the E step is the same as in standard EM, which is to calculate P(tjw; t ), where the t aretheparametersinthecurrentiterationt.TheMstepiniteration(t+1)lookslike t+1 =argmax ( E P(tjw; t ) [ logP(w;tj) ] + t ∑ t;t ′ e P(t ′ jt) ) (3.20) Let C(t;w;t;w) count the number of times the word w is tagged as t in t, and C(t;t ′ ;t) the numberoftimesthetagbigram(t;t ′ )appearsint.WecanrewritetheMstepas 42 t+1 =argmax ( ∑ t ∑ w E[C(t;w)]logP(wjt)+ ∑ t ∑ t ′ ( E[C(t;t ′ )]logP(t ′ jt)+ t e P(t ′ jt) ) 1 C C C C C A (3.21) subject to the constraints ∑ w P(wjt)=1 and ∑ t ′P(t ′ jt)=1. In equation (3.21), the term corre- sponding to the channel probabilities can be optimized separately from the term corresponding tothetagbigramprobabilitiessincetheydon’tshareanyparameters.Eachofthesesubproblems can be further decomposed into smaller terms, one for each tag t (Equations (3.22) and (3.23)). Foreacht,theterm ∑ w E[C(t;w)]logP(wjt) (3.22) iseasilyoptimizedasinEM:justlet P(wjt)/E[C(t;w)].Buttheterm ∑ t ′ ( E[C(t;t ′ )]logP(t ′ jt)+ t e P(t ′ jt) ) (3.23) is trickier. This is a non-convex optimization problem for which we invoke a publicly available constrained optimization tool, ALGENCAN [2]. To carry out its optimization, ALGENCAN re- quirescomputationofthefollowingineveryiteration: Objectivefunction,definedinequation(3.23).Thisiscalculatedinpolynomialtimeusing dynamicprogramming. Constraints: g t = ∑ t ′P(t ′ jt)1=0 for each tag t2T. Also, we constrain P(t ′ jt) to the interval[ϵ;1]. 1 1 We must have ϵ >0 because of the logP(t ′ jt) term in equation (3.23). It seems reasonable to set ϵ ≪ 1 N ; in our experiments,weset ϵ=10 7 . 43 Gradientofobjectivefunction: @F @P(t ′ jt) = E[C(t;t ′ )] P(t ′ jt) t e P(t ′ jt) (3.24) Gradientofequalityconstraints: @g t @P(t ′′ jt ′ ) = 8 > > > > > > > > < > > > > > > > > : 1 ift=t ′ 0 otherwise (3.25) Hessian of objective function, which is not required but greatly speeds up the optimiza- tion: @ 2 F @P(t ′ jt)@P(t ′ jt) = E[C(t;t ′ )] P(t ′ jt) 2 + t e P(t ′ jt) 2 (3.26) The other second-order partial derivatives are all zero, as are those of the equality con- straints. We perform this optimization for each instance of (3.23). These optimizations could easily be performedinparallelforgreaterscalability. 3.6.3 Experiments WecarriedoutPOStaggingexperimentsonEnglishandItalian. 44 system accuracy(%) StandardEM 82.4 +randomrestarts 84.5 Goldwateretal.[29] 85.2 ourapproach 87.4 +randomrestarts 87.1 Gaussiansmoothed L 0 85.6 Table 3.1: MAP-EM with a L0 norm achieves higher tagging accuracy on English than [29] and muchhigherthanstandardEM. 3.6.3.1 EnglishPOStagging To set the hyperparameters t and , we prepared three held-out sets H 1 ;H 2 , and H 3 from the PennTreebank.Each H i comprisedabout24;000wordsannotatedwithPOStags.WeranMAP- EM for 100 iterations, with uniform probability initialization, for a suite of hyperparameters and averaged their tagging accuracies over the three held-out sets. The results are presented in Table 3.2. We then picked the hyperparameter setting with the highest average accuracy. These were t =80;=0:05. We then ran MAP-EM again on the test data with these hyperparameters and achieved a tagging accuracy of 87:4% (see Table 4.1). This is higher than the 85:2% that Goldwater and Griths [29] obtain using Bayesian methods for inferring both POS tags and hyperparameters. It is much higher than the 82:4% that standard EM achieves on the test set whenrunfor100iterations.Wealsocomparedourprioragainstthesmoothed L 0 inspiredbythe Gaussian family of functions presented in Section 3.5. For this, we used the same optimization procedure with the appropriate first and second order derivatives. We tuned the hyperparameters foraccuracyonthetestsetandachieved85:6%,lowerthanoursmoothed L 0 prior. Using t = 80;= 0:05, we ran multiple random restarts on the test set (see Figure 3.8). We find that the objective function correlates well with accuracy, and picking the point with the 45 t 0.75 0.5 0.25 0.075 0.05 0.025 0.0075 0.005 0.0025 10 82.81 82.78 83.10 83.50 83.76 83.70 84.07 83.95 83.75 20 82.78 82.82 83.26 83.60 83.89 84.88 83.74 84.12 83.46 30 82.78 83.06 83.26 83.29 84.50 84.82 84.54 83.93 83.47 40 82.81 83.13 83.50 83.98 84.23 85.31 85.05 83.84 83.46 50 82.84 83.24 83.15 84.08 82.53 84.90 84.73 83.69 82.70 60 83.05 83.14 83.26 83.30 82.08 85.23 85.06 83.26 82.96 70 83.09 83.10 82.97 82.37 83.30 86.32 83.98 83.55 82.97 80 83.13 83.15 82.71 83.00 86.47 86.24 83.94 83.26 82.93 90 83.20 83.18 82.53 84.20 86.32 84.87 83.49 83.62 82.03 100 83.19 83.51 82.84 84.60 86.13 85.94 83.26 83.67 82.06 110 83.18 83.53 83.29 84.40 86.19 85.18 80.76 83.32 82.05 120 83.08 83.65 83.71 84.11 86.03 85.39 80.66 82.98 82.20 130 83.10 83.19 83.52 84.02 85.79 85.65 80.08 82.04 81.76 140 83.11 83.17 83.34 85.26 85.86 85.84 79.09 82.51 81.64 150 83.14 83.20 83.40 85.33 85.54 85.18 78.90 81.99 81.88 Table3.2:Averageaccuraciesoverthreeheld-outsetsforEnglish. system zeroparameters bigramtypes maximumpossible 1389 – EM,100iterations 444 924 MAP-EM,100iterations 695 648 Table3.3:MAP-EMwithasmoothedL0normyieldsmuchsmallermodelsthanstandardEM. 46 0.78 0.79 0.8 0.81 0.82 0.83 0.84 0.85 0.86 0.87 0.88 0.89 -53200 -53000 -52800 -52600 -52400 -52200 -52000 -51800 -51600 -51400 Tagging accuracy objective function value α t =80,β=0.05,Test Set 24115 Words Figure 3.8: Tagging accuracy vs. objective function for 1152 random restarts of MAP-EM with smoothedL0norm. highest objective function value achieves 87:1% accuracy. We also carried out the same experi- mentwithstandardEM(Figure3.9),wherepickingthepointwiththehighestcorpusprobability achieves84:5%accuracy. We also measured the minimization eect of the sparse prior against that of standard EM. Sinceourmethodlower-boundsalltheparametersbyϵ,weconsideraparameter i asazeroif i ϵ. Measuring the number of unique tag bigram types in the Viterbi tagging of the word sequence revealsthat(Table3.3)ourmethodproducesmuchsmallermodelsthanEM,andproducesViterbi taggingswithmanyfewertag-bigramtypes. 47 0.76 0.78 0.8 0.82 0.84 0.86 0.88 0.9 -147500 -147400 -147300 -147200 -147100 -147000 -146900 -146800 -146700 -146600 -146500 -146400 Tagging accuracy objective function value EM, Test Set 24115 Words Figure3.9:Taggingaccuracyvs.likelihoodfor1152randomrestartsofstandardEM. 3.6.3.2 ItalianPOStagging Wealso carried out POS tagging experiments on an Italian corpus from the Italian Turin Univer- sity Treebank [12]. This test set comprises 21;878 words annotated with POS tags and a dictio- naryforeachwordtype.Sincethisisalltheavailabledata,wecouldnottunethehyperparameters on a held-out data set. Using the hyperparameters tuned on English ( t = 80;= 0:05), we ob- tained 89:7% tagging accuracy (see Table 3.4), which was a large improvement over 81:2% that standard EM achieved. Ravi et al., [67] report 88% accuracy on this Italian POS tagging task using the state of the art technique for English. The accuracy that we get is the current state of 48 t 0.75 0.5 0.25 0.075 0.05 0.025 0.0075 0.005 0.0025 10 81.62 81.67 81.63 82.47 82.70 84.64 84.82 84.96 84.90 20 81.67 81.63 81.76 82.75 84.28 84.79 85.85 88.49 85.30 30 81.66 81.63 82.29 83.43 85.08 88.10 86.16 88.70 88.34 40 81.64 81.79 82.30 85.00 86.10 88.86 89.28 88.76 88.80 50 81.71 81.71 78.86 85.93 86.16 88.98 88.98 89.11 88.01 60 81.65 82.22 78.95 86.11 87.16 89.35 88.97 88.59 88.00 70 81.69 82.25 79.55 86.32 89.79 89.37 88.91 85.63 87.89 80 81.74 82.23 80.78 86.34 89.70 89.58 88.87 88.32 88.56 90 81.70 81.85 81.00 86.35 90.08 89.40 89.09 88.09 88.50 100 81.70 82.27 82.24 86.53 90.07 88.93 89.09 88.30 88.72 110 82.19 82.49 82.22 86.77 90.12 89.22 88.87 88.48 87.91 120 82.23 78.60 82.76 86.77 90.28 89.05 88.75 88.83 88.53 130 82.20 78.60 83.33 87.48 90.12 89.15 89.30 87.81 88.66 140 82.24 78.64 83.34 87.48 90.12 89.01 88.87 88.99 88.85 150 82.28 78.69 83.32 87.75 90.25 87.81 88.50 89.07 88.41 Table3.4:AccuraciesontestsetforItalian. the art for Italian. When we tuned the hyperparameters on the test set, the best setting t =120, =0:05gaveanaccuracyof90:28%. 3.6.4 RelatedWork AvarietyofothertechniquesintheliteraturehavebeenappliedtothisunsupervisedPOStagging task. Smith and Eisner [80] use conditional random fields with contrastive estimation to achieve 88:6% accuracy. Goldberg et al. [28] provide a linguistically-informed starting point for EM to achieve 91:4% accuracy. More recently, Chiang et al. [15] use Gibbs sampling for Bayesian inferencealongwithautomaticrunselectionandachieve90:7%. 3.7 Conclusion In this Chapter, we derived the the objective function (equation 3.8) from a novel approximation to the L 0 norm to fulfill the need for learning smaller models motivated in earlier chapters. We 49 described some of its properties and its advantage over comparable priors like the L 1 and the L p norms. We also derived a connection with the squared L 2 norm suggesting an alternative way to achieveparametersparsity.WeshowedhowEMcanbeextendedinagenericwaytouseanMDL- like objective function that simultaneously maximizes likelihood and minimizes model size. We have presented an ecient search procedure that optimizes this function for generative models and demonstrated that maximizing this function leads to improvement in tagging accuracy over standard EM. We infer the hyperparameters of our model using held out data and achieve better accuraciesthan[29].Wehavealsoshownthattheobjectivefunctioncorrelateswellwithtagging accuracy supporting the MDL principle. This approach performs quite well on POS tagging for bothEnglishandItalian,achievingstateoftheartaccuraciesonthelatter.InthenextChapter,we willshowhowtoapplythesametechniquetoimprovewordalignmentformachinetranslation. 50 Chapter4 WordAlignment 4.1 Introduction Automatic word alignment is an important component of nearly all current statistical translation pipelines.Althoughstate-of-the-arttranslationmodelsuserulesthatoperateonunitsbiggerthan words(likephrasesortreefragments),theynearlyalwaysusewordalignmentstodriveextraction of those translation rules. The dominant approach to word alignment has been the IBM models [13]togetherwiththeHMMmodel[84].Thesemodelsareunsupervised,makingthemapplicable to any language pair for which parallel text is available. Moreover, they are widely disseminated intheopen-sourceGIZA++toolkit[64].Thesepropertiesmakethemthedefaultchoiceformost statisticalMTsystems. Inthedecadessincetheirinvention,manymodelshavesurpassedtheminaccuracy,butnone has supplanted them in practice. Some of these models are partially supervised, combining un- labeled parallel text with manually-aligned parallel text [56,69,82]. Although manually-aligned data is very valuable, it is only available for a small number of language pairs. Other models are 0 ThisisjointworkwithLiangHuangandDavidChiang 51 unsupervised like the IBM models [21,30,43], but have not been as widely adopted as GIZA++ has. Inthischapter,weproposeasimpleextensiontotheIBM/HMMmodelsthatisunsupervised like the IBM models, is as scalable as GIZA++ because it is implemented on top of GIZA++, and provides significant improvements in both alignment and translation quality. I extend the IBM/HMM models by incorporating the L 0 prior introduced in Chapter 3 to encourage sparsity intheword-to-wordtranslationmodel(Section4.2.2).Thisextensionfollowstheapproachofthe previous chapter, on part-of-speech tagging, but enables it to scale to the large datasets typical in word alignment, using an ecient training method based on projected gradient descent (Sec- tion4.2.3).ExperimentsonCzech-,Arabic-,Chinese-andUrdu-Englishtranslation(Section5.4) demonstrate consistent signficant improvements over IBM Model 4 in both word alignment (up to+6.7F1)andtranslationquality(upto+1.4Bleu).Ourimplementationhasbeenreleasedasa simplemodificationtotheGIZA++toolkitthatcanbeusedasadrop-inreplacementforGIZA++ inanyexistingMTpipeline. 4.2 Method WestartwithabriefreviewoftheIBMandHMMwordalignmentmodels,thendescribehowto extendthemwithasmoothed L 0 priorandhowtoecientlytrainthem. 4.2.1 IBMModelsandHMM Given a French string f = f 1 f j f m and an English string e=e 1 e i e ℓ , these models de- scribetheprocessbywhichtheFrenchstringisgeneratedbytheEnglishstringviathealignment 52 a=a 1 ;:::;a j ;:::;a m . Each a j is a hidden variable, indicating which English word e a j the French word f j isalignedto. In IBM Model 1–2 and the HMM model, the joint probability of the French sentence and alignmentgiventheEnglishsentenceis P(f;aje)= m ∏ j=1 P d (a j ja j1 ; j)P t (f j je a j ): (4.1) Theparametersofthesemodelsarethedistortionprobabilities P d (a j ja j1 ; j)andthetranslation probabilities P t (f j je a j ). The three models dier in their estimation of P d , but the dierences do notconcernushere.Allthreemodels,aswellasIBMModels3–5,sharethesame P t .Forfurther detailsofthesemodels,thereaderisreferredtotheoriginalpapersdescribingthem[13,84]. Let stand for all the parameters of the model. The standard training procedure is to find the parameter values that maximize the likelihood, or, equivalently, minimize the negative log- likelihoodoftheobserveddata: ˆ =argmin ( logP(fje;) ) (4.2) =argmin 0 B B B B B @ log ∑ a P(f;aje;) 1 C C C C C A (4.3) ThisisdoneusingtheExpectation-Maximization(EM)algorithm[18]. 4.2.2 MAP-EMwiththe L 0 -norm Maximumlikelihoodtrainingispronetooverfitting,especiallyinmodelswithmanyparameters. In word alignment, one well-known manifestation of overfitting is that rare words can act as 53 “garbage collectors” [57], aligning to many unrelated words. This hurts alignment precision and rule-extraction recall. Previous attempted remedies include early stopping, smoothing [57], and posteriorregularization[30]. In the previous chapter, we described another simple remedy to over-fitting in the context of unsupervised part-of-speech tagging, which was to minimize the size of the model using an MDL-inspiredsmoothed L 0 prior. Here, the goal is to apply a similar prior in a word alignment model to the word-to-word translationprobabilitiesP t (f je).Weleavethedistortionmodelsalone.Thedistortionmodelsare not very large, and GIZA++ already applies smoothing to them, so there is not much reason to believethatwecanprofitfromcompactingthem. Hopefully, the reader is familiar by now with the the prior introduced in Chapter 3. We will directlywritedownthe optimizationusingthe smoothed L 0 priorforthewordtranslation proba- bilities.Findthemodel, ˆ suchthat: ˆ =argmax 0 B B B B B B B @ logP(fje;)+ ∑ e;f exp P t (f je) 1 C C C C C C C A (4.4) subjecttotheconstraints ∑ f P t (f je)=1 foralle: (4.5) 54 We can carry out the optimization in equation 4.4 with the MAP-EM algorithm [8]. The dierencebetweenEMandMAP-EMtrainingliesintheM-step.ForvanillaEM,theM-stepis: ˆ =argmax 0 B B B B B B B @ ∑ e;f E[C(e; f)]logP t (f je) 1 C C C C C C C A (4.6) again subject to the constraints (4.5). The count C(e; f) is the number of times that f occurs alignedtoe.ForMAP-EM,itis: ˆ =argmax ( ∑ e;f E[C(e; f)]logP t (f je)+ ∑ e;f exp P t (f je) ) (4.7) This optimization problem is non-convex, and we do not know of a closed-form solution. In the previouschapter, we used ALGENCAN, a non-linear optimization toolkit, but this solution does notscalewelltothenumberofparametersinvolvedinwordalignmentmodels.Instead,weusea projectedgradientdescentmethodwhichisscalableandsimpleenoughtobeimplementedwithin GIZA++.Wedescribethismethodinthenextsection. 4.2.3 ProjectedGradientDescent Following Schoenemann [71], we use projected gradient descent (PGD) to solve the M-step (but with the L 0 -norm instead of the ℓ 1 -norm). Gradient projection methods are attractive solutions to constrained optimization problems, particularly when the constraints on the parameters are simple [7]. Let F() be the objective function in (4.7); we seek to minimize this function. As in previouswork[83],weoptimizeeachsetofparametersfP t (je)gseparatelyforeachEnglishword type e. The inputs to the PGD are the expected counts E[C(e; f)] and the current word-to-word 55 conditional probabilities . We run PGD for K iterations, producing a sequence of intermedi- ate parameter vectors 1 ;:::; k ;:::; K : Each iteration has two steps, a projection step and a line search. Projectionstep Inthisstep,wecompute: k = [ k srF( k ) ] ∆ (4.8) Thismoves inthedirectionofsteepestdescent(rF)withstepsize s,andthenthefunction[] ∆ projects the resulting point onto the simplex; that is, it finds the nearest point that satisfies the constraints(4.5). ThegradientrF( k )is @F @P t (f je) = E[C(f;e)] P t (f je) + exp P t (f je) (4.9) In contrast to [71], we use an O(nlogn) algorithm for the projection step due to Duchi et. al.[20],showninPseudocode1. Pseudocode1Projectinputvectoru2R n ontotheprobabilitysimplex. v=usortedinnon-increasingorder =0 for i=1ton do if v i 1 i ( ∑ i r=1 v r 1 ) >0then =i endif endfor = 1 ( ∑ r=1 v r 1 ) w r =maxfv r ;0gfor1rn return w 56 Pseudocode2Findapointbetween k and k thatsatisfiestheArmijocondition. F min =F( k ) min = k form=1to20do m = m ( k k ) if F( k + m )<F min then F min =F( k + m ) min = k + m endif if F( k + m )F( k )+ ( rF( k ) m ) then break endif endfor k+1 = min return k+1 Linesearch Next,wemovetoapointbetween k and k thatsatisfiestheArmijocondition, F( k + m )F( k )+ ( rF( k ) m ) (4.10) where m = m ( k k ) and and are both constants in (0;1). We try values m=1;2;::: until the Armijo condition (4.10) is satisfied or the limit m=20 is reached. (Note that we don’t allow m=0 because this can cause k + m to land on the boundary of the probability simplex, where the objective function is undefined.) Then we set k+1 to the point inf k g[f k + m j1m20g thatminimizes F.ThelinesearchalgorithmissummarizedinPseudocode2. In our implementation, we set = 0:5 and = 0:5. We keep s fixed for all PGD iterations; we experimented with s2f0:1;0:5g and did not observe significant changes in F-score. We run the projection step and line search alternately for at most K iterations, terminating early if there is no change in k from one iteration to the next. We set K = 35 for the large Arabic-English 57 experiment;forallotherconditions,wesetK=50.Thesechoicesweremadetobalanceeciency andaccuracy.Wefoundthatvaluesof K between30and75weregenerallyreasonable. 4.3 Experiments To demonstrate the eect of the L 0 -norm on the IBM models, we performed experiments on four translation tasks: Arabic-English, Chinese-English, and Urdu-English from the NIST Open MT Evaluation, and the Czech-English translation from the Workshop on Machine Translation (WMT) shared task. We measured the accuracy of word alignments generated by GIZA++ with and without the L 0 -norm, and also translation accuracy of systems trained using the word align- ments.Acrossalltests,wefoundstrongimprovementsfromaddingthe L 0 -norm. 4.3.1 Training We have implemented our algorithm as an open-source extension to GIZA++. 1 Usage of the extensionisidenticaltostandardGIZA++,exceptthattheusercanswitchthe L 0 prioronoro, andadjustthehyperparameters and . For vanilla EM, we ran five iterations of Model 1, five iterations of HMM, and ten iterations of Model 4. For our approach, we first ran one iteration of Model 1, followed by four iterations ofModel1withsmoothedL 0 ,followedbyfiveiterationsofHMMwithsmoothedL 0 .Finally,we ranteniterationsofModel4. 2 1 The code can be downloaded from the first author’s website at http://www.isi.edu/ ~ avaswani/ giza-pp-l0.html. 2 GIZA++ allows changing some heuristic parameters for ecient training. Currently, we set two of these to zero: mincountincrease and probcutoff. In the default setting, both are set to 10 7 . We set probcutoff to 0 because wewouldliketheoptimizationtolearntheparametervalues.Forafaircomparison,weappliedthesamesettingtoour vanillaEMtrainingaswell.Totest,weranGIZA++withthedefaultsettingonthesmallerofourtwoArabic-English datasetswiththesamenumberofiterationsandfoundnochangeinF-score. 58 president of the foreign affairs institute shuqin liu was also present at the meeting . u u w aiji ao u xu ehu u hu zh ang u li u u u sh uq ng u hu ji an sh u u u u z aizu o u . over 4000 guests from home and abroad attended the opening ceremony . u u zh ongw ai u l aib n u s qi an u u du o r en u ch ux u le u u k aim ush u . (a) (b) it 's extremely troublesome to get there via land . u r ugu o u y ao u u l ul u zhu an u u q u dehu a ne u , u h en u h en u h en u h en u m afan de u , after this was taken care of , four blockhouses were blown up . u zh ege u ch ul w an u y h ou ne u , h ai u zh a u le u s ge u di aob ao u . (c) (d) Figure 4.1: Smoothed-L 0 alignments (red circles) correct many errors in the baseline GIZA++ alignments(blacksquares),asshowninfourChinese-Englishexamples(theredcirclesarealmost perfect for these examples, except for minor mistakes such as liu-sh¯ uq¯ ıng and meeting-z` aizu` o in (a) and .-, in (c)). In particular, the baseline system demonstrates typical “garbage-collection” phenomenainpropername“shuqing”inbothlanguagesin(a),number“4000”andword“l´ aib¯ ın” (lit. “guest”) in (b), word “troublesome” and “l` ul` u” (lit. “land-route”) in (c), and “blockhouses” and “di¯ aobˇ ao” (lit. “bunker”) in (d). We found this garbage-collection behavior to be especially common with proper names, numbers, and uncommon words in both languages. Most interest- ingly,in(c),oursmoothed-L 0 systemcorrectlyaligns“extremely”to“hˇ enhˇ enhˇ enhˇ en”(lit.“very veryveryvery”)whichisrareinthebitext. 59 task data(M) system alignF1(%) wordtrans(M) ˜ ϕ sing: Bleu(%) 2008 2009 2010 Chi-Eng 9.6+12 baseline 73.2 3.5 6.2 28.7 L 0 -norm 76.5 2.0 3.3 29.5 dierence +3:3 43% 47% +0:8 Ara-Eng 5.4+4.3 baseline 65.0 3.1 4.5 39.8 42.5 L 0 -norm 70.8 1.8 1.8 41.1 43.7 dierence +5:9 39% 60% +1:3 +1:2 Ara-Eng 44+37 baseline 66.2 15 5.0 41.6 44.9 L 0 -norm 71.8 7.9 1.8 42.5 45.3 dierence +5:6 47% 64% +0:9 +0:4 Urd-Eng 1.7+1.5 baseline 1.7 4.5 25.3 29.8 L 0 -norm 1.2 2.2 25.9 31.2 dierence 29% 51% +0:6 +1:4 Cze-Eng 2.1+2.3 baseline 65.6 1.5 3.0 17.3 18.0 L 0 -norm 72.3 1.0 1.4 17.9 18.4 dierence +6:7 33% 53% +0:6 +0:4 Table 4.1: Adding the L 0 -norm to the IBM models improves both alignment and translation ac- curacy across four dierent language pairs. The word trans column also shows that the number of distinct word translations (i.e., the size of the lexical weighting table) is reduced. The ˜ ϕ sing: columnshowstheaveragefertilityofonce-seensourcewords.ForCzech-English,theyearrefers to the WMT shared task; for all other language pairs, the year refers to the NIST Open MT Evaluation. Halfofthistestsetwasalsousedfortuningfeatureweights. 60 Weusedthefollowingparalleldata: Chinese-English:selecteddatafromtheconstrainedtaskoftheNIST2009OpenMTEval- uation. 3 Arabic-English: all the available data for the constrained track of NIST 2009, excluding United Nations proceedings (LDC2004E13), ISI Automatically Extracted Parallel Text (LDC2007E08), and Ummah newswire text (LDC2004T18), for a total of 5.4+4.3 million words. We also experimented on a larger Arabic-English parallel text of 44+37 million wordsfromtheDARPAGALEprogram. Urdu-English:allavailabledatafortheconstrainedtrackofNIST2009. Czech-English: A corpus of 4 million words of Czech-English data from the News Com- mentarycorpus. 4 Wesetthehyperparametersandbytuningongold-standardwordalignments(tomaximize F1) when possible. For Arabic-English and Chinese-English, we used 346 and 184 hand-aligned sentencesfromLDC2006E86andLDC2006E93.Similarly,forCzech-English,515hand-aligned sentences were available [11]. But for Urdu-English, since we did not have any gold alignments, weused=10and=0:05.Wedidnotchoosealarge,asthedatasetwassmall,andwechose aconservativevaluefor . We ran word alignment in both directions and symmetrized using grow-diag-final [40]. For modelswiththesmoothed L 0 prior,wetuned and separatelyineachdirection. 3 LDCcatalognumbersLDC2003E07,LDC2003E14,LDC2005E83,LDC2005T06,LDC2006E24,LDC2006E34, LDC2006E85,LDC2006E86,LDC2006E92,andLDC2006E93. 4 Thisdataisavailableathttp://statmt.org/wmt10. 61 model 0 10 25 50 75 100 250 500 750 – HMM 47.5 M4 52.1 0.5 HMM 46.3 48.4 52.8 55.7 57.5 61.5 62.6 62.7 M4 51.7 53.7 56.4 58.6 59.8 63.3 64.4 64.8 0.1 HMM 55.6 60.4 61.6 62.1 61.9 61.8 60.2 60.1 M4 58.2 62.4 64.0 64.4 64.8 65.5 65.6 65.9 0.05 HMM 59.1 61.4 62.4 62.5 62.3 60.8 58.7 57.7 M4 61.0 63.5 64.6 65.3 65.3 65.4 65.7 65.7 0.01 HMM 59.7 61.6 60.0 59.5 58.7 56.9 55.7 54.7 M4 62.9 65.0 65.1 65.2 65.1 65.4 65.3 65.4 0.005 HMM 58.1 59.0 58.3 57.6 57.0 55.9 53.9 51.7 M4 62.0 64.1 64.5 64.5 64.5 65.0 64.8 64.6 0.001 HMM 51.7 52.1 51.4 49.3 50.4 46.8 45.4 44.0 M4 59.8 61.3 61.5 61.0 61.8 61.2 61.0 61.2 Table 4.2: Almost all hyperparameter settings achieve higher F-scores than the baseline IBM Model4andHMMmodelforArabic-Englishalignment(=0). 4.3.2 Alignment First, we evaluated alignment accuracy directly by comparing against gold-standard word align- ments. The results are shown in the alignment F1 column of Table 4.1. We used balanced F- measureratherthanalignmenterrorrateasourmetric[24]. Following[21],wealsomeasuredtheaveragefertility, ˜ ϕ sing: ,ofonce-seensourcewordsinthe symmetrized alignments. Our alignments show smaller fertility for once-seen words, suggesting thattheysuerfrom“garbagecollection”eectslessthanthebaselinealignmentsdo. Thefactthatwehadtousehand-aligneddatatotunethehyperparametersandmeansthat our method is no longer completely unsupervised. However, our observation is that alignment accuracy is actually fairly robust to the choice of these hyperparameters, as shown in Table 4.2. As we will see below, we still obtained strong improvements in translation quality when hand- aligneddatawasunavailable. 62 wordclasses? direction system no yes P(f je) baseline 49.0 52.1 L 0 -norm 63.9 65.9 dierence +14:9 +13:8 P(ej f) baseline 64.3 65.2 L 0 -norm 69.2 70.3 dierence +4:9 +5:1 Table4.3:AddingwordclassesimprovestheF-scoreinbothdirectionsforArabic-Englishalign- mentbyalittle,forthebaselinesystemmoresothanours. We also tried generating 50 word classes using the tool provided in GIZA++. We found that addingwordclassesimprovedalignmentqualityalittle,butmoresoforthebaselinesystem(see Table 4.3). We used the alignments generated by training with word classes for our translation experiments. Figure 4.1 shows four examples of Chinese-English alignment, comparing the baseline with our smoothed-L 0 method. In all four cases, the baseline produces incorrect extra alignments that prevent good translation rules from being extracted while the smoothed-L 0 results are correct. In particular,thebaselinesystemdemonstratestypical“garbagecollection”behavior[57]inallfour examples. 4.3.3 Translation Wethentestedtheeectofwordalignmentsontranslationqualityusingthehierarchicalphrase- based translation system Hiero [16]. We used a fairly standard set of features: seven inherited from Pharaoh [40], a second language model, and penalties for the glue rule, identity rules, unknown-word rules, and two kinds of number/name rules. The feature weights were discrim- inatively trained using MIRA [17]. We used two 5-gram language models, one on the combined 63 setting alignF1(%) Bleu(%) t(f je) t(ej f) 2008 2009 1st 1st 70.8 41.1 43.7 1st 2nd 70.7 41.1 43.8 2nd 1st 70.7 40.7 44.1 2nd 2nd 70.9 41.1 44.2 Table 4.4: Optimizing hyperparameters on alignment F1 score does not necessarily lead to opti- malBleu.Thefirsttwocolumnsindicatewhetherweusedthefirst-orsecond-bestalignmentsin each direction (according to F1); the third column shows the F1 of the symmetrized alignments, whosecorrespondingBleuscoresareshowninthelasttwocolumns. English sides of the NIST 2009 Arabic-English and Chinese-English constrained tracks (385M words),andanotheron2billionwordsofEnglish. For each language pair, we extracted grammar rules from the same data that were used for word alignment. The development data that were used for discriminative training were: for Chinese-English and Arabic-English, data from the NIST 2004 and NIST 2006 test sets, plus newsgroup data from the GALE program (LDC2006E92); for Urdu-English, half of the NIST 2008testset;forCzech-English,atrainingsetof2051sentencesprovidedbytheWMT10trans- lationworkshop. The results are shown in the Bleu column of Table 4.1. We used case-insensitive IBM Bleu (closest reference length) as our metric. Significance testing was carried out using bootstrap re- samplingwith1000samples[39,86]. Allofthetestsshowedsignificantimprovements(p<0:01),rangingfrom+0.4Bleuto+1.4 Bleu. For Urdu, even though we didn’t have manual alignments to tune hyperparameters, we got significant gains over a good baseline. This is promising for languages that do not have any manuallyaligneddata. 64 Ideally, one would want to tune and to maximize Bleu. However, this is prohibitively expensive, especially if we must tune them separately in each alignment direction before sym- metrization. We ran some contrastive experiments to investigate the impact of hyperparameter tuning on translation quality. For the smaller Arabic-English corpus, we symmetrized all combi- nations of the two top-scoring alignments (according to F1) in each direction, yielding four sets ofalignments.Table4.4showsBleuscoresfortranslationmodelslearnedfromthesealignments. Unfortunately, we find that optimizing F1 is not optimal for Bleu—using the second-best align- ments yields a further improvement of 0:5 Bleu on the NIST 2009 data, which is statistically significant(p<0:05). 4.4 RelatedWork Schoenemann[70],takinginspirationfromBodrumluetal.[10],usesintegerlinearprogramming to optimize IBM Model 1–2 and the HMM with the L 0 -norm. This method, however, does not outperform GIZA++. In later work, Schoenemann [71] used projected gradient descent for the L 1 -norm. Here, we have adopted his use of projected gradient descent, but using a smoothed L 0 -norm. Liangetal.,[43]showhowtotrainIBMmodelsinbothdirectionssimultaneouslybyadding a term to the log-likelihood that measures the agreement between the two directions. Grac ¸a et al. [30], explore modifications to the HMM model that encourage bijectivity and symmetry. The modifications take the form of constraints on the posterior distribution over alignments that is computed during the E-step. [49] explore a Bayesian version of IBM Model 1, applying sparse 65 Dirichlet priors to t. However, because this method requires the use of Monte Carlo methods, it isnotclearhowwellitcanscaletolargerdatasets. 4.5 Conclusion In this chapter, we extended the IBM models and HMM model by the addition of an L 0 prior to theword-to-wordtranslationmodel,whichcompactstheword-to-wordtranslationtable,reducing overfitting, and, in particular, the “garbage collection” eect. We have shown how to perform MAP-EM with this prior eciently, even for large datasets. The method is implemented as a modificationtotheopen-sourcetoolkitGIZA++andwehaveshownthatitsignificantlyimproves translationqualityacrossfourdierentlanguagepairs.Eventhoughweusedasmallsetof gold- standardalignmentstotuneourhyperparameters,wefoundthatperformancewasfairlyrobustto variationinthehyperparameters,andtranslationperformancewasgoodevenwhengold-standard alignments were unavailable. The software is available for download from my website. We hope that the method, due to its simplicity, generality, and eectiveness, will find wide application for trainingbetterstatisticaltranslationsystems. 66 Chapter5 RuleMarkovModels In the last few chapters, we presented an unsupervised approach for learning small models and successfully applied it for learning word-alignment models. Continuing on the thread of small models,wewillnowpresentourworkonachievinggoodtranslationqualitywithsmalltranslation grammars. 5.1 Introduction Statistical machine translation systems typically model the translation process as a sequence of translationsteps,eachofwhichusesatranslationrule,forexample,aphrasepairinphrase-based translation or a tree-to-string rule in tree-to-string translation. These rules are usually applied independently of each other, which violates the conventional wisdom that translation should be done in context. To alleviate this problem, most state-of-the-art systems rely on composed rules, which are larger rules that can be formed out of smaller rules (including larger phrase pairs that can be formed out of smaller phrase pairs), as opposed to minimal rules, which are rules that cannot be formed out of other rules. Although this approach does improve translation quality 0 ThisisjointworkwithHaitaoMi,LiangHuang,andDavidChiang 67 dramatically by weakening the independence assumptions in the translation model, they suer from two main problems. First, composition can cause a combinatorial explosion in the number of rules. To avoid this, ad-hoc limits are placed during composition, like upper bounds on the number of nodes in the composed rule, or the height of the rule. Under such limits, the gram- mar size is manageable, but still much larger than the minimal-rule grammar. Second, due to large grammars, the decoder has to consider many more hypothesis translations, which slows it down. Nevertheless, the advantages outweigh the disadvantages, and to our knowledge, all top- performingsystems,bothphrase-basedandsyntax-based,usecomposedrules.Forexample,[27] initially built a syntax-based system using only minimal rules, and subsequently reported [26] that composing rules improves Bleu by 3.6 points, while increasing grammar size 60-fold and decodingtime15-fold. The alternative we propose is to replace composed rules with a rule Markov model that gen- erates rules conditioned on their context. In this work, we restrict a rule’s context to the vertical chain of ancestors of the rule. This ancestral context would play the same role as the context formerly provided by rule composition. The dependency treelet model developed by Quirk et al. [66] takes such an approach within the framework of dependency translation. However, their study leaves unanswered whether a rule Markov model can take the place of composed rules. In this work, we investigate the use of rule Markov models in the context of tree-to-string transla- tion[34,45].Wemakethreenewcontributions. First, we carry out a detailed comparison of rule Markov models with composed rules. Our experiments show that, using trigram rule Markov models, we achieve an improvement of 2.2 Bleu over a baseline of minimal rules. When we compare against vertically composed rules, we find that our rule Markov model has the same accuracy, but our model is much smaller and 68 decoding with our model is 30% faster. When we compare against full composed rules, we find that our rule Markov model still often reaches the same level of accuracy, again with savings in spaceandtime. Second,weinvestigatemethodsforpruningruleMarkovmodels,findingthatevenverysim- ple pruning criteria actually improve the accuracy of the model, while of course decreasing its size. Third, we present a very fast decoder for tree-to-string grammars with rule Markov models. HuangandMi[35]haverecentlyintroducedanecientincrementaldecodingalgorithmfortree- to-string translation, which operates top-down and maintains a derivation history of translation rules encountered. This history is exactly the vertical chain of ancestors corresponding to the contextsinourruleMarkovmodel,whichmakesitanidealdecoderforourmodel. We start by describing our rule Markov model (Section 5.2) and then how to decode using theruleMarkovmodel(Section5.3). 5.2 RuleMarkovmodels Our model which conditions the generation of a rule on the vertical chain of its ancestors, which allowsittocaptureinteractionsbetweenrules. ConsidertheexampleChinese-Englishtree-to-stringgrammarinFigure5.1andtheexample derivationinFigure5.2.Eachrowisaderivationstep;thetreeontheleftisthederivationtree(in which each node is a rule and its children are the rules that substitute into it) and the tree pair on the right is the source and target derivedtree. Forany derivationnode r, let anc 1 (r)be the parent 69 ruleid translationrule r 1 IP(x 1 :NP x 2 :VP)! x 1 x 2 r 2 NP(B` ush´ ı)!Bush r 3 VP(x 1 :PP x 2 :VP)! x 2 x 1 r 4 PP(x 1 :P x 2 :NP)! x 1 x 2 r 5 VP(VV(jˇ ux´ ıng)AS(le)NPB(hu` ıt´ an))!heldtalks r 6 P(yˇ u)!with r ′ 6 P(yˇ u)!and r 7 NP(Sh¯ al´ ong)!Sharon Figure5.1:Exampletree-to-stringgrammar. derivationtree derivedtreepair ϵ IP ϵ : IP ϵ r 1 IP ϵ NP 1 VP 2 IP ϵ NP 1 VP 2 : NP 1 VP 2 r 1 r 2 r 3 IP ϵ NP 1 B` ush´ ı VP 2 PP 2:1 VP 2:2 IP ϵ NP 1 B` ush´ ı VP 2 PP 2:1 VP 2:2 : Bush VP 2:2 PP 2:1 r 1 r 2 r 3 r 4 r 5 IP ϵ NP 1 B` ush´ ı VP 2 PP 2:1 P 2:1:1 NP 2:1:2 VP 2:2 VV jˇ ux´ ıng AS le NP hu` ıt´ an IP ϵ NP 1 B` ush´ ı VP 2 PP 2:1 P 2:1:1 NP 2:1:2 VP 2:2 VV jˇ ux´ ıng AS le NP hu` ıt´ an : Bush heldtalks P 2:1:1 NP 2:1:2 r 1 r 2 r 3 r 4 r 6 r 7 r 5 IP ϵ NP 1 B` ush´ ı VP 2 PP 2:1 P 2:1:1 yˇ u NP 2:1:2 Sh¯ al´ ong VP 2:2 VV jˇ ux´ ıng AS le NP hu` ıt´ an IP ϵ NP 1 B` ush´ ı VP 2 PP 2:1 P 2:1:1 yˇ u NP 2:1:2 Sh¯ al´ ong VP 2:2 VV jˇ ux´ ıng AS le NP hu` ıt´ an : Bushheldtalks with Sharon Figure 5.2: Example tree-to-string derivation. Each row shows a rewriting step; at each step, the leftmostnonterminalsymbolisrewrittenusingoneoftherulesinFigure5.1. 70 of r (or ϵ if it has no parent), anc 2 (r) be the grandparent of node r (or ϵ if it has no grandparent), andsoon.Letanc n 1 (r)bethechainofancestorsanc 1 (r)anc n (r). The derivation tree is generated as follows. With probability P(r 1 jϵ), we generate the rule at therootnode,r 1 .Wethengenerateruler 2 withprobabilityP(r 2 jr 1 ),andsoon,alwaystakingthe leftmost open substitution site on the English derived tree, and generating a rule r i conditioned on its chain of ancestors with probability P(r i janc n 1 (r i )). We carry on until no more children can begenerated.ThustheprobabilityofaderivationtreeT is: P(T)= ∏ r2T P(rjanc n 1 (r)): (5.1) FortheminimalrulederivationtreeinFigure5.2,theprobabilityis: P(T)=P(r 1 jϵ)P(r 2 jr 1 )P(r 3 jr 1 )P(r 4 jr 1 ;r 3 )P(r 6 jr 1 ;r 3 ;r 4 ) P(r 7 jr 1 ;r 3 ;r 4 )P(r 5 jr 1 ;r 3 ): (5.2) Training We run the algorithm of Galley et al. [27] on word-aligned parallel text to obtain a single derivation of minimal rules for each sentence pair. (Unaligned words are handled by attaching them to the highest node possible in the parse tree.) The rule Markov model can then betrainedonthepathsetofthesederivationtrees. Smoothing Weuseinterpolationwithabsolutediscounting[60]: P abs (rjanc n 1 (r))= max { c(rjanc n 1 (r))D n ;0 } ∑ r ′c(r ′ janc n 1 (r ′ )) + (1 n )P abs (rjanc n1 1 (r)); (5.3) 71 where c(rjanc n 1 (r)) is the number of times we have seen rule r after the vertical context anc n 1 (r), D n isthediscountforacontextoflengthn,and(1 n )issettothevaluethatmakesthesmoothed probabilitydistributionsumtoone. WeexperimentwithbigramandtrigramruleMarkovmodels.Foreach,wetrydierentvalues of D 2 and D 3 , the discount for bigrams and trigrams, respectively. Ney et al. [60] suggest using thefollowingvalueforthediscount D n : D n = n 1 n 1 +n 2 (5.4) Here, n 1 and n 2 are the total number of n-grams with exactly one and two counts, respectively. Forourcorpus, D 2 =0:871and D 3 =0:902.Additionally,weexperimentwith0:4and0:5for D n tooptimizeforBleuonthedevelopmentset(Tables5.2and5.3). Pruning In addition to full n-gram Markov models, we experiment with three approaches to build smaller models to investigate if pruning helps. Our results will show that smaller models indeedgiveahigherBleuscorethanthefullbigramandtrigrammodels.Theapproachesweuse are: RM-A:Wekeeponlythosecontextsinwhichmorethan Puniqueruleswereobserved.By optimizingonthedevelopmentset,weset P=12. RM-B: We keep only those contexts that were observed more than Q times. Note that this isasupersetofRM-A.Again,byoptimizingonthedevelopmentset,weset Q=12. RM-C:Wetryamoreprincipledapproachforlearningvariable-lengthMarkovmodelsin- spired by that of Bejerano et al. [4], who learn a Prediction Sux Tree (PST). They grow 72 the PST in an iterative manner by starting from the root node (no context), and then add contexts to the tree. A context is added if the KL divergence between its predictive distri- bution and that of its parent is above a certain threshold and the probability of observing thecontextisaboveanotherthreshold. 5.3 Tree-to-StringDecodingwithRuleMarkovModels In this paper, we use our rule Markov model framework in the context of tree-to-string transla- tion.Tree-to-stringtranslationsystems[34,45]havegainedpopularityinrecentyearsduetotheir speed and simplicity. The input to the translation system is a source parse tree and the output is the target string. Huang and Mi [35] have recently introduced an ecient incremental decoding algorithm for tree-to-string translation. The decoder operates top-down and maintains a deriva- tionhistoryoftranslationrulesencountered.Thehistoryisexactlytheverticalchainofancestors correspondingtothecontextsinourruleMarkovmodel.Thismakesincrementaldecodinganat- ural fit with our generative story. In this section, we describe how to integrate our rule Markov model into this incremental decoding algorithm. Note that it is also possible to integrate our rule Markov model with other decoding algorithms, for example, the more common non-incremental top-down/bottom-up approach [34], but it would involve a non-trivial change to the decoding algorithmstokeeptrackoftheverticalderivationhistory,whichwouldresultinsignificantover- head. Algorithm Given the input parse tree in Figure 5.3, Figure 5.4 illustrates the search process of the incremental decoder with the grammar of Figure 5.1. We write X for a tree node with label X at tree address [79]. The root node has address ϵ, and the ith child of node has address :i. 73 IP ϵ NP 1 B` ush´ ı VP 2 PP 2:1 P 2:1:1 yˇ u NP 2:1:2 Sh¯ al´ ong VP 2:2 VV 2:2:1 jˇ ux´ ıng AS 2:2:2 le NP 2:2:3 hu` ıt´ an Figure5.3:Exampleinputparsetreewithtreeaddresses. At each step, the decoder maintains a stack of active rules, which are rules that have not been completedyet,andtherightmost(n1)Englishwordstranslatedthusfar(thehypothesis),where n is the order of the word language model (in Figure 5.4, n= 2). The stack together with the translatedEnglishwordscompriseastateofthedecoder.Thelastcolumninthefigureshowsthe ruleMarkovmodelprobabilitieswiththeconditioningcontext.Inthisexample,weuseatrigram ruleMarkovmodel. After initialization, the process starts at step 1, where we predict rule r 1 (the shaded rule) with probability P(r 1 jϵ) and push its English side onto the stack, with variables replaced by the corresponding tree nodes: x 1 becomes NP 1 and x 2 becomes VP 2 . This gives us the following stack: s=[NP 1 VP 2 ] The dot () indicates the next symbol to process in the English word order. We expand node NP 1 first with English word order. We then predict lexical rule r 2 with probability P(r 2 jr 1 ) and push ruler 2 ontothestack: [NP 1 VP 2 ][Bush] 74 stack hyp. MRprob. 0 [<s> IP ϵ </s>] <s> 1 [<s> IP ϵ </s>] [NP 1 VP 2 ] <s> P(r 1 jϵ) 2 [<s> IP ϵ </s>][NP 1 VP 2 ] [Bush] <s> P(r 2 jr 1 ) 3 [<s> IP ϵ </s>][NP 1 VP 2 ][Bush ] ... Bush 4 [<s> IP ϵ </s>][NP 1 VP 2 ] ... Bush 5 [<s> IP ϵ </s>][NP 1 VP 2 ] [VP 2:2 PP 2:1 ] ... Bush P(r 3 jr 1 ) 6 [<s> IP ϵ </s>][NP 1 VP 2 ][VP 2:2 PP 2:1 ] [heldtalks] ... Bush P(r 5 jr 1 ;r 3 ) 7 [<s> IP ϵ </s>][NP 1 VP 2 ][VP 2:2 PP 2:1 ][held talks] ... held 8 [<s> IP ϵ </s>][NP 1 VP 2 ][VP 2:2 PP 2:1 ][heldtalks ] ... talks 9 [<s> IP ϵ </s>][NP 1 VP 2 ][VP 2:2 PP 2:1 ] ... talks 10 [<s> IP ϵ </s>][NP 1 VP 2 ][VP 2:2 PP 2:1 ] [P 2:1:1 NP 2:1:2 ] ... talks P(r 4 jr 1 ;r 3 ) 11 [<s> IP ϵ </s>][NP 1 VP 2 ][VP 2:2 PP 2:1 ][P 2:1:1 NP 2:1:2 ] [with] ... with P(r 6 jr 3 ;r 4 ) 12 [<s> IP ϵ </s>][NP 1 VP 2 ][VP 2:2 PP 2:1 ][P 2:1:1 NP 2:1:2 ][with ] ... with 13 [<s> IP ϵ </s>][NP 1 VP 2 ][VP 2:2 PP 2:1 ][P 2:1:1 NP 2:1:2 ] ... with 14 [<s> IP ϵ </s>][NP 1 VP 2 ][VP 2:2 PP 2:1 ][P 2:1:1 NP 2:1:2 ] [Sharon] ... with P(r 7 jr 3 ;r 4 ) 11 ′ [<s> IP ϵ </s>][NP 1 VP 2 ][VP 2:2 PP 2:1 ][P 2:1:1 NP 2:1:2 ] [and] ... and P(r ′ 6 jr 3 ;r 4 ) 12 ′ [<s> IP ϵ </s>][NP 1 VP 2 ][VP 2:2 PP 2:1 ][P 2:1:1 NP 2:1:2 ][and ] ... and 13 ′ [<s> IP ϵ </s>][NP 1 VP 2 ][VP 2:2 PP 2:1 ][P 2:1:1 NP 2:1:2 ] ... and 14 ′ [<s> IP ϵ </s>][NP 1 VP 2 ][VP 2:2 PP 2:1 ][P 2:1:1 NP 2:1:2 ] [Sharon] ... and P(r 7 jr 3 ;r 4 ) 15 [<s> IP ϵ </s>][NP 1 VP 2 ][VP 2:2 PP 2:1 ][P 2:1:1 NP 2:1:2 ][Sharon ] ... Sharon 16 [<s> IP ϵ </s>][NP 1 VP 2 ][VP 2:2 PP 2:1 ][P 2:1:1 NP 2:1:2 ] ... Sharon 17 [<s> IP ϵ </s>][NP 1 VP 2 ][VP 2:2 PP 2:1 ] ... Sharon 18 [<s> IP ϵ </s>][NP 1 VP 2 ] ... Sharon 19 [<s>IP ϵ </s>] ... Sharon 20 [<s>IP ϵ </s> ] ... </s> Figure 5.4: Simulation of incremental decoding with rule Markov model. The solid arrows indi- cateonepathandthedashedarrowsindicateanalternatepath. 75 VP 2 VP 2:2 PP 2:1 P 2:1:1 yˇ u NP 2:1:2 VP 2 VP 2:2 PP 2:1 P 2:1:1 yˇ u NP 2:1:2 Figure5.5:Verticalcontext In step 3, we perform a scan operation, in which we append the English word just after the dot to the current hypothesis and move the dot after the word. Since the dot is at the end of the topruleinthestack,weperformacompleteoperationinstep4wherewepopthefinishedruleat thetopofthestack.Inthescanandcompletesteps,wedon’tneedtocomputeruleprobabilities. An interesting branch occurs after step 10 with two competing lexical rules, r 6 and r ′ 6 . The Chinesewordyˇ ucanbetranslatedaseitheraprepositionwith(leadingtostep11)oraconjunction and (leadingtostep11 ′ ).Thewordn-grammodeldoesnothaveenoughinformationtomakethe correctchoice,with.Asaresult,goodtranslationsmightbeprunedbecauseofthebeam.However, ourruleMarkovmodelhasthecorrectpreferencebecauseoftheconditioningancestralsequence (r 3 ;r 4 ), shown in Figure 5.5. Since VP 2:2 has a preference for yˇ u translating to with, our corpus statisticswillgiveahigherprobabilitytoP(r 6 jr 3 ;r 4 )thanP(r ′ 6 jr 3 ;r 4 ).Thishelpsthedecoderto scorethecorrecttranslationhigher. Complexity analysis With the incremental decoding algorithm, adding rule Markov models does not change the time complexity, which is O(lcjVj n1 ), where l is the sentence length, c is the maximum number of incoming hyperedges for each node in the translation forest, V is the target-language vocabulary, and n is the order of the n-gram language model [35]. However, if oneweretouseruleMarkovmodelswithaconventionalCKY-stylebottom-updecoder[45],the 76 complexity would increase to O(lC m1 jVj 4(n1) ), where C is the maximum number of outgoing hyperedgesforeachnodeinthetranslationforest,andmistheorderoftheruleMarkovmodel. 5.4 ExperimentsandResults 5.4.1 Setup The training corpus consists of 1.5M sentence pairs with 38M/32M words of Chinese/English, respectively. Our development set is the newswire portion of the 2006 NIST MT Evaluation test set(616sentences),andourtestsetisthenewswireportionofthe2008NISTMTEvaluationtest set(691sentences). We word-aligned the training data using GIZA++ followed by link deletion [23], and then parsed the Chinese sentences using the Berkeley parser [65]. To extract tree-to-string transla- tion rules, we applied the algorithm of Galley et al. [27]. We trained our rule Markov model on derivations of minimal rules as described above. Our trigram word language model was trained on the target side of the training corpus using the SRILM toolkit [81] with modified Kneser-Ney smoothing.ThebasefeaturesetforallsystemsissimilartothesetusedinMietal.[50].Thefea- turesarecombinedintoastandardlog-linearmodel,whichwetrainedusingminimumerror-rate training[63]tomaximizetheBleuscoreonthedevelopmentset. At decoding time, we again parse the input sentences using the Berkeley parser, and convert them into translation forests using rule pattern-matching [50]. We evaluate translation quality usingcase-insensitiveIBMBleu-4. 77 grammar ruleMarkov max parameters(10 6 ) Bleu time model ruleheight full dev+test test (sec/sent) minimal None 3 4.9 0.3 24.2 1.2 RM-Bbigram 3 4.9+4.7 0.3+0.5 25.7 1.8 RM-Atrigram 3 4.9+7.6 0.3+0.6 26.5 2.0 verticalcomposed None 7 176.8 1.3 26.5 2.9 composed None 3 17.5 1.6 26.4 2.2 None 7 448.7 3.3 27.5 6.8 RM-Atrigram 7 448.7+7.6 3.3+1.0 28.0 9.2 Table 5.1: Main results. Our trigram rule Markov model strongly outperforms minimal rules, and performs at the same level as composed and vertically composed rules, but is smaller and faster. The number of parameters is shown for both the full model and the model filtered for the concatenationofthedevelopmentandtestsets(dev+test). 5.4.2 Results Table 5.1 presents the main results of our work. We used grammars of minimal rules and com- posed rules of maximum height 3 as our baselines. For decoding, we used a beam size of 50. Using the best bigram rule Markov models and the minimal rule grammar gives us an improve- mentof1:5Bleuovertheminimalrulebaseline.UsingthebesttrigramruleMarkovmodelbrings our gain up to 2:3 Bleu. These gains are statistically significant with p < 0:01, using bootstrap resampling with 1000 samples [39]. We find that by just using bigram context, we are able to get at least 1 Bleu point higher than the minimal rule grammar. It is interesting to see that using justbigramruleinteractionscangiveusareasonableboost.Wegetourhighestgainsfromusing trigramcontextwhereourbestperformingruleMarkovmodelgivesus2:3Bleupointsovermin- imalrules.Thissuggeststhatusinglongercontextshelpsthedecodertofindbettertranslations. We also compared rule Markov models against composed rules. Since our models are cur- rently limited to conditioning on vertical context, the closest comparison is against vertically composedrules.Wefindthatourapproachperformsequallywellusingmuchlesstimeandspace. 78 ruleMarkov D 2 Bleu time model dev (sec/sent) RM-A 0.871 29.2 1.8 RM-B 0.4 29.9 1.8 RM-C 0.871 29.8 1.8 RM-Full 0.4 29.7 1.9 Table5.2:Forrulebigrams,RM-Bwith D 2 =0:4givesthebestresultsonthedevelopmentset. ruleMarkov D 2 D 3 Bleu time model dev (sec/sent) RM-A 0.5 0.5 30.3 2.0 RM-B 0.5 0.5 29.9 2.0 RM-C 0.5 0.5 30.1 2.0 RM-Full 0.4 0.5 30.1 2.2 Table 5.3: For rule bigrams, RM-A with D 2 ;D 3 =0:5 gives the best results on the development set. Comparing against full composed rules, we find that our system matches the score of the baseline composed rule grammar of maximum height 3, while using many fewer parameters. (It shouldbenotedthataparameterintheruleMarkovmodelisjustafloating-pointnumber,whereas a parameter in the composed-rule system is an entire rule; therefore the dierence in memory usage would be even greater.) Decoding with our model is 0:2 seconds faster per sentence than withcomposedrules. These experiments clearly show that rule Markov models with minimal rules increase trans- lationqualitysignificantlyandwithlowermemoryrequirementsthancomposedrules.Onemight wonderifthebestperformancecanbeobtainedbycombiningcomposedruleswitharuleMarkov model. This is straightforward to implement: the rule Markov model is still defined over deriva- tions of minimal rules, but in the decoder’s prediction step, the rule Markov model’s value on a composed rule is calculated by decomposing it into minimal rules and computing the product of their probabilities. We find that using our best trigram rule Markov model with composed 79 D 3 D 2 0.4 0.5 0.871 0.4 30.0 30.0 0.5 29.3 30.3 0.902 30.0 Table5.4:RM-Aisrobusttodierentsettingsof D n onthedevelopmentset. parameters(10 6 ) Bleu time dev+test dev test (sec/sent) 1.2 30.2 26.1 2.8 1.3 30.1 26.5 2.9 1.3 30.1 26.2 3.2 Table5.5:Comparisonofverticallycomposedrulesusingvarioussettings(maximumruleheight 7). rulesgivesusa0:5Bleugainontopofthecomposedrulegrammar,statisticallysignificantwith p<0:05,achievingourhighestscoreof28:0.Forthisexperiment,abeamsizeof100wasused. 5.4.3 Analysis Tables 5.2 and 5.3 show how the various types of rule Markov models compare, for bigrams and trigrams, respectively. It is interesting that the full bigram and trigram rule Markov models do not give our highest Bleu scores; pruning the models not only saves space but improves their performance.Wethinkthatthisisprobablyduetooverfitting. parameters(10 6 ) Bleudev/test time(sec/sent) dev/test withoutRMM withRMM without/withRMM 2.6 31.0/27.0 31.1/27.4 4.5/7.0 2.9 31.5/27.7 31.4/27.3 5.6/8.1 3.3 31.4/27.5 31.4/28.0 6.8/9.2 Table 5.6: Adding rule Markov models to composed-rule grammars improves their translation performance. 80 Table5.4showsthattheRM-Atrigrammodeldoesfairlywellunderallthesettingsof D n we tried. Table 5.5 shows the performance of vertically composed rules at various settings. Here we havechosenthesettingthatgivesthebestperformanceonthetestsetforinclusioninTable5.1. Table 5.6 shows the performance of fully composed rules and fully composed rules with a rule Markov Model at various settings. For these experiments, a beam size of 100 was used. In the second line (2.9 million rules), the drop in Bleu score resulting from adding the rule Markov modelisnotstatisticallysignificant. 5.5 RelatedWork BesidestheworkofQuirketal.[66]discussedinSection5.1,therearetwootherpreviouseorts both using a rule bigram model in machine translation, that is, the probability of the current rule only depends on the immediate previous rule in the vertical context, whereas our rule Markov model can condition on longer and sparser derivation histories. Among them, Ding et al. [19] also use a dependency treelet model similar to Quirk et al. [66], and Liu et al. [44] use a tree-to- stringmodelmorelikeours.Neithercomparedtothescenariowithcomposedrules. Outside of machine translation, the idea of weakening independence assumptions by mod- eling the derivation history is also found in parsing (Johnson [37]) where rule probabilities are conditioned on parent and grand-parent nonterminals. However, besides the dierence between parsingandtranslation,therearestilltwomajordierences.First,ourworkconditionsruleprob- abilities on parent and grandparent rules, not just nonterminals. Second, we compare against a 81 composed-rulesystem,whichisanalogoustotheDataOrientedParsing(DOP)approachinpars- ing [9]. To our knowledge, there has been no direct comparison between a history-based PCFG approachandDOPapproachintheparsingliterature. 5.6 Conclusion Usingasmallminimal translationgrammarwithruleMarkovmodels,wepresentedanapproach for achieving the same translation quality as a much larger composed rule grammar resulting in faster decoding times. In the next chapter, we will present our approach for learning neural probabilistic language models, which are much smaller than standard n-gram language models, andshowimprovementsinSMT. 82 Chapter6 NeuralLanguageModels In this chapter, we explore the application of neural language models to machine translation. We present a new model that combines the neural probabilistic language model of Bengio et al. [6], rectified linear units, and noise-contrastive estimation, and incorporate it into a machine translationsystembothbyrerankingk-bestlistsandbydirectintegrationintothedecoder. 6.1 Introduction Machine translation (MT) systems rely upon language models (LMs) during decoding to ensure fluentoutputinthetargetlanguage.Dependingonthevocabularysizeandlengthofcontext,stan- dard n-gram language models can contain up to a trillion parameters. Typically, models operate over discrete representations of words. Such models are susceptible to data sparsity–that is, the probability of an n-gram observed only few times is dicult to estimate reliably, because these modelsdonotuseanyinformationaboutsimilaritiesbetweenwords. Toaddressthisissue,Bengioetal.[6]proposedistributedwordrepresentations,inwhicheach word is represented as a real-valued vector in a high-dimensional feature space. They introduce 0 ThisisjointworkwithYinggongZhao,VictoriaFossum,andDavidChiang 83 a feed-forward neural probabilistic LM (NPLM) that operates over these distributed representa- tions. During training, the NPLM learns both a distributed representation for each word in the vocabularyandann-gramprobabilitydistributionoverwordsintermsofthesedistributedrepre- sentations. Although neural LMs have begun to rival or even surpass traditional n-gram LMs [52,53], they have not yet been widely adopted in large-vocabulary applications such as MT, because standardmaximumlikelihoodestimationrequiresrepeatedsummationsoverallwordsinthevo- cabulary.A varietyofstrategieshavebeenproposedtocombatthisissue,manyofwhichrequire severerestrictionsonthesizeofthenetworkorthesizeofthedata. Inthiswork,weextendtheNPLMofBengioetal.[6]intwoways.First,weuserectifiedlin- earunits[59],whoseactivationsarecheapertocomputethansigmoidortanhunits.Thereisalso evidencethatdeepneuralnetworkswithrectifiedlinearunitscanbetrainedsuccessfullywithout pre-training [85]. Second, we train using noise-contrastive estimation or NCE [32,54], which does not require repeated summations over the whole vocabulary. This enables us to eciently buildNPLMsonalargerscalethanwouldbepossibleotherwise. WethenapplythisLMtoMTintwoways.First,weuseittorerankthek-bestoutputofahier- archicalphrase-baseddecoder[16].Second,weintegrateitdirectlyintothedecoder,allowingthe neural LM to more strongly influence the model. We achieve gains of up to 0.6 Bleu translating French,German,andSpanishtoEnglish,andupto1.1BleuonChinese-Englishtranslation. 84 ................................ u 1 . u 2 . input words . input embeddings . hidden h 1 . hidden h 2 . output P(wju) . D ′ . M . C 1 . C 2 . D Figure6.1:Neuralprobabilisticlanguagemodel[6]. 6.2 NeuralLanguageModels LetV bethevocabulary,andnbetheorderofthelanguagemodel;leturangeovercontexts,i.e., stringsoflength(n1),andwrangeoverwords.Forsimplicity,weassumethatthetrainingdata is a single very long string, w 1 w N , where w N is a special stop symbol, </s>. We write u i for w in+1 w i1 ,where,fori0,w i isaspecialstartsymbol,<s>. 6.2.1 Model We use a feedforward neural network as shown in Figure 6.1, following [6]. The input to the network is a sequence of one-hot representations of the words in context u, which we write u x (1xn1).TheoutputistheprobabilityP(wju)foreachwordw,whichthenetworkcomputes asfollows. The hidden layers consist of rectified linear units [59], which use the activation function ϕ(y)=max(0;y)(Figure6.2). 85 . Figure6.2:Activationfunctionforarectifiedlinearunit. Theoutputofthefirsthiddenlayerh 1 ,F h 1 is F h 1 =ϕ 0 B B B B B B @ n1 ∑ x=1 C x Du x 1 C C C C C C A (6.1) where D is a matrix of input word embeddings which is shared across all positions, the C x are the context matrices for each word inu, and ϕ is applied elementwise. The output of the second layerh 2 ,F h 2 is F h 2 =ϕ(Mh 1 ); where M is the matrix of connection weights between h 1 and h 2 . Finally, the output layer is a softmaxlayer, P(wju)/exp ( wD ′ h 2 +w T b ) (6.2) where D ′ is the output word embedding matrix, b is a vector of biases for every word in the vocabulary,andwistheonehotrepresentationoftheoutputwordw. 6.2.2 Training ThetypicalwaytotrainneuralLMsistomaximizethelikelihoodofthetrainingdatabygradient ascent. But the softmax layer requires, at each iteration, a summation over all the units in the 86 output layer, that is, all words in the whole vocabulary. If the vocabulary is large, this can be prohibitivelyexpensive. Noise-contrastive estimation or NCE [32] is an alternative estimation principle that allows onetoavoidtheserepeatedsummations.Ithasbeenappliedpreviouslytolog-bilinearLMs[54], andweapplyitheretotheNPLMdescribedabove. WecanwritetheprobabilityofawordwgivenacontextuundertheNPLMas P(wju)= 1 Z(u) p(wju) p(wju)=exp ( wD ′ h 2 +w T b ) Z(u)= ∑ w ′ p(w ′ ju) (6.3) where p(wju)istheunnormalized outputoftheunitcorrespondingtow,andZ(u)isthenormal- izationfactor.Let standfortheparametersofthemodel. One possibility would be to treat Z(u), instead of being defined by (6.3), as an additional set of model parameters which are learned along with . But it is easy to see that we can make the likelihoodarbitrarilylargebymakingtheZ(u)arbitrarilysmall. NCE converts the problem of learning the probability distribution P(wju) to binary classi- fication. For each example u i w i , we add k noise samples ¯ w i1 ;:::; ¯ w ik into the data from a noise distribution q(w). We extend the model to account for noise samples by introducing a random variableC whichis1fortrainingexamplesand0fornoisesamples: 87 P(C=1;wju)= 1 1+k p(wju) Z(u) P(C=0;wju)= k 1+k q(w): Underthisjointmodel,theconditionalprobability P(C=1juw)= P(C=1;wju) P(C=1;wju)+P(C=0;wju) = 1 1+k p(wju) Z(u) 1 1+k p(wju) Z(u) + k 1+k q(w) = p(wju) Z(u) p(wju) Z(u) +kq(w) ; and, P(C=0juw)= kq(w) p(wju) Z(u) +kq(w) : Usingstochasticgradientascent,wethentrainthemodeltoclassifyexamplesastrainingdata ornoise,thatis,tomaximizetheconditionallikelihood, L= N ∑ i=1 ( logP(C=1ju i w i )+ k ∑ j=1 logP(C=0ju i ¯ w ij ) ) (6.4) For stochastic gradient ascent, we need the gradients of the objective (equation 6.4) with respect to and Z(u). In this section, we will derive the general form of the derivative, that is, thederivativewith respecttoa generalparameter .Thisexpositionisnovelas wehavenot seen 88 the detailed derivation of the general form in previous work. In Section 6.2.3, we will show the analyticalformofthegradientsfordierentparametersofourneuralnetwork.Wenowderive @L @ = N ∑ i=1 ( @ @ logP(C=1ju i w i )+ k ∑ j=1 @ @ logP(C=0ju i ¯ w ij ) ) Dropping the indices for simplicity and taking the gradient of the training data and noise termsseparately, @ @ logP(C=1juw)= 1 P(C=1juw) @ @ p(wju) Z(u) p(wju) Z(u) +kq(w) Usingthequotientruleofderivatives, @ @ logP(C=1juw)= 1 P(C=1juw) ( 1 p(wju) Z(u) +kq(w) 1 Z(u) @ @ p(wju)+ p(wju) Z(u) ( p(wju) Z(u) +kq(w) ) 2 1 Z(u) @ @ p(wju) ) 89 Takingcommontermsoutoftheparentheses = 1 P(C=1juw) ( 1 p(wju) Z(u) p(wju) Z(u) +kq(w) ) 1 p(wju) Z(u) +kq(w) 1 Z(u) @ @ p(wju) = 1 P(C=1juw) ( 1P(C=1juw) ) 1 p(wju) Z(u) +kq(w) 1 Z(u) @ @ p(wju) = 1 P(C=1juw) P(C=0juw) 1 p(wju) Z(u) +kq(w) 1 Z(u) @ @ p(wju) Using @ @ p(wju)= p(wju) @ @ logp(wju) @ @ logP(C=1juw)= 1 P(C=1juw) P(C=0juw) 1 p(wju) Z(u) +kq(w) p(wju) Z(u) @ @ logp(wju) = 1 P(C=1juw) P(C=0juw) p(wju) Z(u) p(wju) Z(u) +kq(w) @ @ logp(wju) = 1 P(C=1juw) P(C=0juw)P(C=1juw) @ @ logp(wju) =P(C=0juw) @ @ logp(wju): 90 ForZ(u),wecanshowthat @ @logZ(u) logP(C=1juw)=P(C=0juw) Similarly,wecanshowthatforthenoisetermsinequation6.4, @ @ logP(C=0ju¯ w)=P(C=1ju¯ w) @ @ logp(¯ wju); and @ @logZ(u) logP(C=0ju¯ w)=P(C=1ju¯ w): 91 Combiningthegradients,for ,weget @L @ = N ∑ i=1 ( P(C=0ju i w i ) @ @ logp(w i ju i ) k ∑ j=1 P(C=1ju i ¯ w ij ) @ @ logp(¯ w ij ju i ) ) ; (6.5) andforZ log (u),thecombinedgradientis @L @Z log (u) = N ∑ i=1 ( P(C=0ju i w i ) k ∑ j=1 P(C=1ju i ¯ w ij ) ) : These gradients are computed by backpropagation, which are described in the next section. Unlikeformaximumlikelihoodestimation,theZ(u)willconvergetoavaluethatnormalizes the model,satisfying(6.3),and,underappropriateconditions,theparameterswillconvergetoavalue thatmaximizesthelikelihoodofthedata. 6.2.3 GradientsofNeuralNetworkParameters Following the general form of the gradient derived in Section 6.2.2, we will present the deriva- tion of the gradients for each of the neural network parameters. For ease of understanding and completeness,wewillrepeatthematerialfromsection6.2.1,althoughinmoredetail.Hopefully, the material will allow the reader to implement his or her own neural network language model trainedwithnoisecontrastiveestimation. 92 Computing the gradients involves two stages, forward propagation (Section 6.2.4) and back- ward propagation (Section 6.2.5). In forward propagation, we compute the node activations in each of the layers, from the input to the output, as were described in Section 6.2.1. Then, we computegradientsfromoutputlayertoinputduringbackwardpropagation,thatis,wecompute @L @ = N ∑ i=1 ( P(C=0ju i w i ) @ @ logp(w i ju i ) k ∑ j=1 P(C=1ju i ¯ w ij ) @ @ logp(¯ w ij ju i ) ) ; (6.6) for=D;C 1 ;C 2 ;M;D ′ ,wherewealsousetheactivationscomputedduringforwardpropagation. Forsimplicity,weshallassumethatwehaveonlyonetraininginputtotheneuralnetworkwu andk noisesamples ¯ w 1 ;:::; ¯ w j ;:::; ¯ w k .Thewordsinthecontextuarew 1 ;:::;w x ;:::;w n1 .Letus define indexes v ′ ;q;r; and s for dierent layers of the neural network as depicted in Figure 6.3, with the letter in the parenthesis indicating the size of the particular layer. For example, the size of the input embedding layer is S and the size of hidden layer h 1 is R. For each hidden layer, in addition to F h 1 and F h 2 , which are the vectors of activations hidden layer activations computed duringforwardpropagation,wedefineB h 1 andB h 2 ,whicharethevectorsofpartialderivativesof theobjectivefunctionwithrespecttothehiddenlayeractivationsh 1 andh 2 .Thesearecomputed duringbackwardpropagation. 6.2.4 ForwardPropagation Forwardpropagation,alsocalledfprop,entailscomputingnodeactivationsasfollows: 93 ................................ u 1 . u 2 . s(S) . r(R)=F h 1 =B h 1 . q(Q)=F h 1 =B h 2 . v ′ . input words . input embeddings . hidden h 1 . hidden h 2 . output P(wju) . D ′ . M . C 1 . C 2 . D Figure6.3:NPLMindexes Forthefirsthiddenlayer,h 1 ,theactivationatanoder is: F h 1 r =ϕ 0 B B B B B B @ n1 ∑ x=1 S ∑ s=1 D w x s C x rs 1 C C C C C C A : whereD w x s is the value at the sth dimension of the input word w x , which is essentially a lookup. Similarly,theactivationatnodeqforh 2 is F h 2 q =ϕ 0 B B B B B B @ R ∑ r=1 F h 1 r M qr 1 C C C C C C A : Theunnormalizedprobabilityforoutputwordwiscomputedas: p(wju)=exp 0 B B B B B B B @ Q ∑ q=1 F h 2 q D ′ wq 1 C C C C C C C A : (6.7) 94 6.2.5 BackwardPropagation Backward propagation, or backdrop, entails computing parameter gradients in reverse, that is, fromtheoutputlayertotheinputlayer,usingtheactivationscomputedduringprop.Substituting theexpressionforunnormalizedprobability(Equation6.7)inEquation6.6,weget @L @ =P(C=0juw) @ @ log 0 B B B B B B B @ exp 0 B B B B B B B @ Q ∑ q=1 F h 2 q D ′ wq +b w 1 C C C C C C C A 1 C C C C C C C A k ∑ j=1 P(C=1ju¯ w j ) @ @ log 0 B B B B B B B @ exp 0 B B B B B B B @ Q ∑ q=1 F h 2 q D ′ ¯ w j q +b ¯ w j 1 C C C C C C C A 1 C C C C C C C A =P(C=0juw) @ @ 0 B B B B B B B @ Q ∑ q=1 F h 2 q D ′ wq b w 1 C C C C C C C A k ∑ j=1 P(C=1ju¯ w j ) @ @ 0 B B B B B B B @ Q ∑ q=1 F h 2 q D ′ ¯ w j q +b ¯ w j 1 C C C C C C C A : Thegradientwithrespecttothebiasofanoutputwordw ′ is @L @b w ′ =P(C=0juw) @ @b w ′ 0 B B B B B B B @ Q ∑ q=1 F h 2 q D ′ wq b w 1 C C C C C C C A k ∑ j=1 P(C=1ju¯ w j ) @ @b w ′ 0 B B B B B B B @ Q ∑ q=1 F h 2 q D ′ ¯ w j q +b ¯ w j 1 C C C C C C C A = w ′ w P(C=0juw) w ′ ¯ w j k ∑ j=1 P(C=1ju¯ w j ): where ab is the Kronecker delta function that returns 1 if a equals b and 0 otherwise. Similarly, thegradientwithrespectaparticularoutputembeddingdimensionD ′ w ′ q ′ is 95 @L @D ′ w ′ q ′ =P(C=0juw) @ @D ′ w ′ q ′ 0 B B B B B B B @ Q ∑ q=1 F h 2 q D ′ wq b w 1 C C C C C C C A = k ∑ j=1 P(C=1ju¯ w j ) @ @D ′ w ′ q ′ 0 B B B B B B B @ Q ∑ q=1 F h 2 q D ′ ¯ w j q +b ¯ w j 1 C C C C C C C A = w ′ w P(C=0juw)F h 2 q ′ w ′ ¯ w j k ∑ j=1 P(C=1ju¯ w j )F h 2 q ′ : WecannowcomputethepartialderivativeofLwithrespecttotheactivationintheqthhidden nodeinthesecondlayeras B h 2 q = @L @F h 2 q =P(C=0juw)D ′ wq k ∑ j=1 P(C=1ju¯ w j )D ′ ¯ w j q : The reader may have noticed that the outputs of a layer are functions of the outputs of the previous layer, and so on. The gradients for the remainder of the parameters will use the chain rule,whichisusedtocomputethederivativeofcomposedfunctions.Wewillstateitnow. Ifwehavey= f(u)andu=g(x),thepartialderivativeofywithrespectto xis @y @x = @y @u @u @x : We will also need the gradient of the rectifier activation function in subsequent derivations, whichis 96 @ϕ(y) @y = @ @y max(0;y) =max ( 0;sgn(y) ) : where sgn is the signum function that returns the sign of its argument. UsingB h 2 q , the chain rule ofderivatives,andthegradientoftherectifierfunction,wecomputethegradientoftheparameter M q ′ r ′ as @L @M q ′ r ′ = @L @F h 2 q ′ @F h 2 q ′ @M q ′ r ′ =B h 2 q ′ @ @M q ′ r ′ ϕ 0 B B B B B B @ R ∑ r=1 F h 1 r M q ′ r 1 C C C C C C A Applyingthechainruleofderivativesagain, =B h 2 q ′ max 0 B B B B B B @ 0;sgn 0 B B B B B B @ R ∑ r=1 F h 1 r M q ′ r 1 C C C C C C A 1 C C C C C C A @ @M q ′ r ′ 0 B B B B B B @ R ∑ r=1 F h 1 r M qr 1 C C C C C C A =B h 2 q ′ max 0 B B B B B B @ 0;sgn 0 B B B B B B @ R ∑ r=1 F h 1 r M q ′ r 1 C C C C C C A 1 C C C C C C A F h 1 r ′ : 97 The partial derivative of L with respect to the activation in the rth hidden node in the first layeris B h 1 r ′ = @L @F h 1 r ′ = Q ∑ q=1 @L @F h 2 q @F h 2 q @F h 1 r ′ = Q ∑ q=1 B h 2 q @ @F h 1 r ′ ϕ 0 B B B B B B @ 0; R ∑ r=1 F h 1 r M qr 1 C C C C C C A = Q ∑ q=1 B h 2 q max 0 B B B B B B @ 0;sgn 0 B B B B B B @ R ∑ r=1 F h 1 r M qr 1 C C C C C C A 1 C C C C C C A M qr ′: ThegradientofC x r ′ s ′ hasaverysimilarderivationasthegradientofM q ′ r ′ @L @C x r ′ s ′ = @L @F h 1 r ′ @F h 1 r ′ @C x r ′ s ′ =B h 1 r ′ max 0 B B B B B B @ 0;sgn 0 B B B B B B @ S ∑ s=1 D w x s C x r ′ s ′ 1 C C C C C C A 1 C C C C C C A D w x s ′: 98 Finally,thegradientofaparameterD w ′ s ′ intheinputwordembeddingmatrixDis @L @D w ′ s ′ = n1 ∑ x=1 w ′ w x @L @D w x s ′ = n1 ∑ x=1 w ′ w x 0 B B B B B B @ R ∑ r=1 @L @F h 1 r @F h 1 r @Dx s ′ 1 C C C C C C A = n1 ∑ x=1 w ′ w x 0 B B B B B B @ R ∑ r=1 B h 1 r @ @Dx s ′ ϕ 0 B B B B B B @ 0; S ∑ s=1 D w x s C x rs 1 C C C C C C A 1 C C C C C C A = n1 ∑ x=1 w ′ w x 0 B B B B B B @ R ∑ r=1 B h 1 r max 0 B B B B B B @ 0;sgn 0 B B B B B B @ S ∑ s=1 D w x s C x rs 1 C C C C C C A 1 C C C C C C A C x rs ′ 1 C C C C C C A : In our derivations, we assumed only one training example. In our implementation, we train on minibatches of examples at a time. Additionally, we do not take gradients of each parameter of M and C x at a time, but use matrix-matrix products to compute the matrix of gradients. We leavetheconversionoftheabovecomputationsintomatrixoperationsforthereader. 6.3 Implementation Both training and scoring of neural LMs are computationally expensive at the scale needed for machine translation. In this section, we will describe some of the techniques used to make them practicalfortranslation. 6.3.1 Training During training, we compute gradients on an entire minibatch at a time, allowing the use of matrix-matrixmultiplicationsinsteadofmatrix-vectormultiplications[5].Werepresenttheinputs as a sparse matrix, allowing the computation of the input layer (6.1) to use sparse matrix-matrix multiplications. The output activations (6.2) are computed only for the word types that occur as 99 the positive example or one of the noise samples, yielding a sparse matrix of outputs. Similarly, duringbackpropagation,sparsematrixmultiplicationsareusedatboththeoutputandinputlayer. In most of these operations, the examples in a minibatch can be processed in parallel. How- ever,inthesparse-denseproductsusedwhenupdating theparametersDandD ′ ,wefounditwas besttodividethevocabularyintoblocks(16perthread)andtoprocesstheblocksinparallel. 6.3.2 Translation ToincorporatethisneuralLMintoaMTsystem,wecanusetheLMtorerankk-bestlists,ashas been done in previous work. But since the NPLM scores n-grams, it can also be integrated into a phrase-based or hierarchical phrase-based decoder just as a conventional n-gram model can, unlikeaRNN. The most time-consuming step in computing n-gram probabilities is the computation of the normalizationconstantsZ(u).FollowingMnihetal.[54],wesetallthenormalizationconstantsto one during training, so that the model learns to produce approximately normalized probabilities. Then, when applying the LM, we can simply ignore normalization. A similar strategy was taken byNiehuesetal.[61].Wefindthatasinglen-gramlookuptakesabout40 s. Thetechnique,describedabove,ofgroupingexamplesintominibatchesworksforscoringof k-best lists, but not while decoding. But caching n-gram probabilities helps to reduce the cost of the many lookups required during decoding. Notice that during decoding, we have to repeatedly multiply C x and D for each of the n1 context words to compute n-gram probabilities. We can avoid these expensive multiplications by pre-multiplying C x and D and achieve lookup times fasterthan40 s. 100 A final issue when decoding with a neural LM is that, in order to estimate future costs, we need to be able to estimate probabilities of n ′ -grams for n ′ <n. In conventional LMs, this infor- mation is readily available, 1 but not in NPLMs. Therefore, we defined a special word <null> whose embedding is the weighted average of the (input) embeddings of all the other words in the vocabulary. Then, to estimate the probability of an n ′ -gram u ′ w, we used the probability of P(wj<null> nn ′ u ′ ). 6.4 Experiments Weranexperimentsonfourlanguagepairs–ChinesetoEnglishandFrench,German,andSpan- ish to English – using a hierarchical phrase-based MT system [16] and GIZA++ [62] for word alignments. For all experiments, we used four LMs. The baselines used conventional 5-gram LMs, es- timated with modified Kneser-Ney smoothing [14] on the English side of the bitext and the 329M-word Xinhua portion of English Gigaword (LDC2011T07). Against these baselines, we tested systems that included the two conventional LMs as well as two 5-gram NPLMs trained on the same datasets. The Europarl bitext NPLMs had a vocabulary size of 50k, while the other NPLMshadavocabularysizeof100k.Weused150dimensionsforwordembeddings,750units in hidden layer h 1 , and 150 units in hidden layer h 2 . We initialized the network parameters uni- formly from (0:01;0:01) and the output biases to logjVj, and optimized them by 10 epochs of stochastic gradient ascent, using minibatches of size 1000 and a learning rate of 1. We drew 100noisesamplespertrainingexamplefromtheunigramdistribution,usingthealiasmethodfor eciency[41]. 1 However,inKneser-NeysmoothedLMs,thisinformationisalsoincorrect[33]. 101 setting dev 2004 2005 2006 baseline 38.2 38.4 37.7 34.3 reranking 38.5 38.6 37.8 34.7 decoding 39.1 39.5 38.8 34.9 Table6.1:ResultsforChinese-Englishexperiments,withoutneuralLM(baseline)andwithneural LM for reranking and integrated decoding. Reranking with the neural LM improves translation quality,whileintegratingitintothedecoderimprovesevenmore. We trained the discriminative models with MERT [63] and the discriminative rerankers on 1000-bestlistswithMERT.Exceptwherenoted,weranMERTthreetimesandreporttheaverage score.Weevaluatedusingcase-insensitiveNISTBleu. 6.4.1 NISTChinese-English FortheChinese-Englishtask(Table6.1),thetrainingdatacamefromtheNIST2012constrained track,excludingsentenceslongerthan60words.Ruleswithoutnonterminalswereextractedfrom alltrainingdata,whileruleswithnonterminalswereextractedfromtheFBIScorpus(LDC2003E14). We ran MERT on the development data, which was the NIST 2003 test data, and tested on the NIST2004–2006testdata. RerankingusingtheneuralLMyieldedimprovementsof0.2–0.4Bleu,whileintegratingthe neuralLMyieldedlargerimprovements,between0.6and1.1Bleu. 6.4.2 Europarl For French, German, and Spanish translation, we used a parallel text of about 50M words from Europarl v7. Rules without nonterminals were extracted from all the data, while rules with non- terminals were extracted from the first 200k words. We ran MERT on the development data, 102 Fr-En De-En Es-En setting dev test dev test dev test baseline 33.5 25.5 28.8 21.5 33.5 32.0 reranking 33.9 26.0 29.1 21.5 34.1 32.2 decoding 34.1 2 26.1 2 29.3 21.9 34.2 2 32.1 2 Table 6.2: Results for Europarl MT experiments, without neural LM (baseline) and with neural LM for reranking and integrated decoding. The neural LM gives improvements across three dif- ferentlanguagepairs.Superscript2indicatesascoreaveragedbetweentworuns;allotherscores wereaveragedoverthreeruns. which was the WMT 2005 test data, and tested on the WMT 2006 news commentary test data (nc-test2006). Theimprovements,showninTable6.2,weremoremodestthanonChinese-English.Rerank- ing with the neural LM yielded improvements of up to 0.5 Bleu, and integrating the neural LM into the decoder yielded improvements of up to 0.6 Bleu. In one case (Spanish-English), in- tegrated decoding scored higher than reranking on the development data but lower on the test data – perhaps due to the dierence in domain between the two. On the other tasks, integrated decodingoutperformedreranking. 6.4.3 SpeedComparison We measured the speed of training a NPLM by NCE, compared with MLE as implemented by the CSLM toolkit [76]. We used the first 200k lines (5.2M words) of the Xinhua portion of Gigawordandtimedoneepochoftraining,forvariousvaluesofkandjVj,onadualhex-core2.67 GHz Xeon X5650 machine. For these experiments, we used minibatches of 128 examples. The timings are plotted in Figure 6.4. We see that NCE is considerably faster than MLE; moreover, as expected, the MLE training time is roughly linear in jVj, whereas the NCE training time is basicallyconstant. 103 ... .. 10 . 20 . 30 . 40 . 50 . 60 . 70 . 0 . 1;000 . 2;000 . 3;000 . 4;000 . Vocabularysize(1000) . Trainingtime(s) . . .. CSLM . .. NCEk=1000 . .. NCEk=100 . .. NCEk=10 Figure 6.4: Noise contrastive estimation (NCE) is much faster, and much less dependent on vo- cabularysize,thanMLEasimplementedbytheCSLMtoolkit[76]. 6.5 RelatedWork The problem of training with large vocabularies in NPLMs has received much attention. One strategy has been to restructure the network to be more hierarchical [53,58] or to group words into classes [42]. Other strategies include restricting the vocabulary of the NPLM to a shortlist andrevertingtoatraditionaln-gramLMforotherwords[73],andlimitingthenumberoftraining examples using resampling [77] or selecting a subset of the training data [78]. Our approach can beecientlyappliedtolarge-scaletaskswithoutlimitingeitherthemodelorthedata. NPLMs have previously been applied to MT, most notably feed-forward NPLMs [74,75] and RNN-LMs [51]. However, their use in MT has largely been limited to reranking k-best lists for MT tasks with restricted vocabularies. Niehues et al. [61] integrate a RBM-based language modeldirectlyintoadecoder,buttheyonlytraintheRBMLMonasmallamountofdata.Toour 104 knowledge,ourapproachisthefirsttointegratealarge-vocabularyNPLMdirectlyintoadecoder foralarge-scaleMTtask. 6.6 Conclusion We introduced a new variant of NPLMs that combines the network architecture of Bengio et al. [6], rectified linear units [59], and noise-contrastive estimation [32]. This model is dramati- cally faster to train than previous neural LMs, and can be trained on a large corpus with a large vocabularyanddirectlyintegratedintothedecoderofaMTsystem.Ourexperimentsacrossfour language pairs demonstrated improvements of up to 1.1 Bleu. Code for training and using our NPLMsisavailablefordownload. 2 .OurNPLMhasalsobeenintegratedintoMoses 3 ,thelargest opensourcestatisticalmachinetranslationtoolkit. 2 http://nlg.isi.edu/software/nplm 3 http://www.statmt.org/moses/ 105 Chapter7 Conclusion andFutureWork 7.1 Conclusion Statistical machine translation (SMT) has improved rapidly in the last two decades. Much of this improvement can be attributed to the growth of available training data. This has resulted in some of the largest models being used in SMT systems, presenting challenges of system engi- neeringandoverfitting.Inthisthesis,wecontributedtoovercomingthesechallengesbylearning and using small models. For word alignment, we presented an unsupervised algorithm (Chap- ters 3 and 4) for combatting overfitting while learning word alignment models. We improved both alignment and translation quality over the dominant word alignment approach that uses the EMalgorithm. Continuingonthethreadofsmallmodels,inChapter5,wepresentedanapproachtoachieve thesametranslationqualityandfasterdecodingtimesthanamuchlargercomposedrulegrammar using a small minimal rule grammar with language models of minimal rules. We presented ap- proaches for both learning minimal rule language models and using them in a top-down decoder fortree-to-stringtranslation. 106 InChapter6,wepresentedanalgorithmforlearningfastneuralprobabilisticlanguagemodels (NPLMs) that are much smaller than conventional language models and improved translation quality significantly. Our NPLMs use rectifier linear units, and noise contrastive estimation for faster training times than standard maximum likelihood estimation. We also released an ecient opensourceimplementationofourNPLMwhichhasbeenintegratedintoMoses,thelargestopen sourceSMTtoolkit. Wewillnowpresentsomefutureresearchdirectionsforthemodelsandalgorithmspresented inthisthesis. 7.2 FutureResearchDirections 7.2.1 MAP-EMwiththe L 0 prior There are a few unsupervised learning problems in NLP that would benefit by incorporating sparsityinthelearningregimeastheEMalgorithmgivesinadequateresults.Wewilldescribethe onesthatwewanttotacklewiththeapproachintroducedinthisthesis 7.2.1.1 Decipherment Incryptography,givenacodedmessage,calledciphertext,c=c 1 ;c 2 ;:::;c n ,thegoalofdecipher- ment is to output the sequence of original letters, called plaintext, p= p 1 ;p 2 ;:::;p n . There are multiple ways to encode the plaintext based on the type of mapping from plaintext to ciphertext letters.Therearetwopopularkindsofencodings,orciphers:substitutionciphersandhomophonic ciphers. 107 In 2006, Knight et al. [38] introduced the idea of using the EM algorithm to attack letter substitution decipherment problems where the mapping from plaintext to ciphertext alphabet is bijective.Therefore,bothencodinganddecodingaredeterministic.Theymodeledthegeneration of the ciphertext as a noisy channel which takes in the plaintext and produces the ciphertext as output.Thegoalistofindthemostlikelyplaintextsequence ˆ pgiventheciphertextsequence. ˆ p=argmax p P(pjc) =argmax p P(p)P(cjp) (7.1) Wegettoequation7.1byapplyingBayes’rule.Wecanlearnthelanguagemodel, P(p),from a large monolingual corpus. The channel parameters, which correspond to the mapping between plaintext and ciphertext letters, need to be learned in an unsupervised fashion. Knight et al., [38] give the algorithm for learning the channel probabilities using EM. Once the probabilities are learned,theplaintextsequenceisinferredusingViterbidecoding.Forsimplesubstitutionciphers, the ideal channel probabilities must be very sparse, with each row in the conditional probability tablehavingonlyoneparametertakealltheprobabilitymassandtherestbezero.However,thisis not what EM learns, and the authors find that the probabilities are not sparse enough. Therefore, during decoding, they have to resort to cubing to make them more sparse. This situation is ideal for our prior as it is built to encourage one parameter to take most of the probability mass. We hopethatourapproachwillbenefitunsuperviseddecipherment. 108 7.2.1.2 Monotonephrasebasedalignment The input to monotone phrase based alignment is the same as that for word alignment, pairs of sentences in a source and target language that carry the same meaning. Whereas in word alignment (chapter 4), the alignment links are between pairs of single words, in phrase-based alignment, the alignment links would be between pairs of substrings (phrases). The goal is to predict the phrase alignments between the source and target sentences. In contrast with word alignment, we also make assumptions about monotonicity in the alignments i.e., if there are two source phrases s i and s j aligned to target phrases t k and t l respectively, and if i < j, then k < l. Models of this type are potentially useful for various applications, including: translation between closely-related languages where we know there is very little reordering; grapheme-to- phonemeconversion;automatictransliteration;andnormalizationofnonstandardorthographyin low-densitylanguages[22]. Oneofthemainchallengesinthistaskisthepotentialforsuchmodelstooverfitbadly,putting theentirestringintoasinglephrase.Webelievethatthemethodsdemonstratedherecanprovide ascalablewaytoovercomethischallengeandwecanusethesmoothed L 0 priorforlearningthe phrasetranslationprobabilities. 7.2.1.3 Comparisonofvarioussparsepriorsforunsupervisedwordalignment This work has presented a particular family of sparse priors that approximates the L 0 norm and we mentioned some others in chapter 3. In 2009, Lv and Fan [46] also presented a family of concavepenaltyfunctionsthatencouragesparsity.WebelieveitwouldbenefitNLPpractitioners to be presented with a survey about the performance of these penalty functions on a large scale andimportanttasklikewordalignment(chapter4). 109 7.2.2 RuleMarkovModels In Chapter 5, we presented and approach for using language models of minimal rules to im- provetranslationwithsmallergrammars.However,wetreatedeachminimalruleasaentireunit, withoutconsideringtheimportantsyntacticinformationthatmightbecontainedwithin.Learning betterruleMarkovmodelsthatexploitthesyntacticinformationinsideminimalruleswouldbea fruitfulfutureresearchdirection. 7.2.3 NeuralProbabilisticLanguageModels In Chapter 6, we experimented with unigram noise distributions for learning neural probabilistic language models. Although using noise samples from a unigram distributions allows us to learn NPLMs that improve translation quality, we plan to investigate dierent noise distributions that mightspeeduptrainingandimprovetranslationqualityfurthermore. 110 References [1] H. Akaike. A new look at the statistical model identification. IEEE Transactions on Auto- maticControl,19(6):716–723,1974. [2] Roberto Andreani, Ernesto G. Birgin, Jose M. Martınez, and Maria L. Schuverdt. On aug- mented lagrangian methods with general lower-level constraints. SIAM Journal on Opti- mization,18(4):1286–1309,2007. [3] A.Barron,J.Rissanen,andB.Yu. Theminimumdescriptionlengthprincipleincodingand modeling. IEEETransactionsonInformationTheory,44(6):2743–2760,1998. [4] Gill Bejerano and Golan Yona. Modeling protein families using probabilistic sux trees. InProceedingsofRECOMB,pages15–24.ACMPress,1999. [5] Yoshua Bengio. Practical recommendations for gradient-based training of deep architec- tures. CoRR,abs/1206.5533,2012. [6] Yoshua Bengio, R´ ejean Ducharme, Pascal Vincent, and Christian Jauvin. A neural proba- bilisticlanguagemodel. JournalofMachineLearningResearch,2003. [7] DimitriP.Bertsekas. NonlinearProgramming. AthenaScientific,1999. [8] ChristopherM.Bishop. PatternRecognitionandMachineLearning. Springer,2006. [9] Rens Bod. An ecient implementation of a new DOP model. In Proceedings of EACL, pages19–26,2003. [10] TugbaBodrumlu,KevinKnight,andSujithRavi. Anewobjectivefunctionforwordalign- ment. In Proceedings of the NAACL HLT Workshop on Integer Linear Programming for NaturalLanguageProcessing,2009. [11] Ondˇ rej Bojar and Magdalena Prokopov´ a. Czech-English word alignment. In Proceedings ofLREC,2006. [12] J.Bos,C.Bosco,andA.Mazzei. Convertingadependencytreebanktoacategoricalgram- mar treebank for italian. In Eighth International Workshop on Treebanks and Linguistic Theories(TLT8),2009. [13] Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics,19:263–311,1993. 111 [14] Stanley F. Chen and Joshua Goodman. An empirical study of smoothing techniques for language modeling. Technical Report TR-10-98, Harvard University Center for Research inComputingTechnology,1998. [15] D. Chiang, J. Graehl, K. Knight, A. Pauls, and S. Ravi. Bayesian inference for Finite-State transducers. InProceedingsoftheNAACLHLT,2010. [16] David Chiang. Hierarchical phrase-based translation. Computational Linguistics, 33(2):201–208,2007. [17] David Chiang, Yuval Marton, and Philip Resnik. Online large-margin training of syntactic andstructuraltranslationfeatures. InProceedingsofEMNLP,2008. [18] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data viatheEMalgorithm. ComputationalLinguistics,39(4):1–38,1977. [19] YuanDingandMarthaPalmer. Machinetranslationusingprobablisiticsynchronousdepen- dencyinsertiongrammars. InProceedingsofACL,pages541–548,2005. [20] JohnDuchi,ShaiShalev-Shwartz,YoramSinger,andTusharChandra. Ecientprojections ontothe L 1 -ballforlearninginhighdimensions. InProceedingsofICML,2008. [21] Chris Dyer, Jonathan H. Clark, Alon Lavie, and Noah A. Smith. Unsupervised word align- mentwitharbitraryfeatures. InProceedingsofACL,2011. [22] Adel Foda and Steven Bird. Normalising audio transcriptions for unwritten languages. In Proceedings of IJCLNLP. Asian Federation of Natural Language Processing, November 2011. [23] Victoria Fossum, Kevin Knight, and Steve Abney. Using syntax to improve word align- ment precision for syntax-based machine translation. In Proceedings of the Workshop on StatisticalMachineTranslation,2008. [24] Alexander Fraser and Daniel Marcu. Measuring word alignment quality for statistical ma- chinetranslation. ComputationalLinguistics,33(3):293–303,2007. [25] J. Friedman, T. Hastie, and R. Tibshirani. The elements of statistical learning, volume 1. SpringerSeriesinStatistics,2001. [26] Michel Galley, Jonathan Graehl, Kevin Knight, Daniel Marcu, Steve DeNeefe, Wei Wang, and Ignacio Thayer. Scalable inference and training of context-rich syntactic translation models. InProceedingsofCOLING-ACL,pages961–968,2006. [27] Michel Galley, Mark Hopkins, Kevin Knight, and Daniel Marcu. What’s in a translation rule? InProceedingsofHLT-NAACL,pages273–280,2004. [28] Y. Goldberg, M. Adler, and M. Elhadad. Em can find pretty good hmm pos-taggers (when givenagoodstart). InProceedingsofACL,2008. [29] Sharon Goldwater and Thomas L. Griths. A fully bayesian approach to unsupervised part-of-speechtagging. InProceedingsofACL,2007. 112 [30] Jo˜ aoV.Grac ¸a,KuzmanGanchev,andBenTaskar. Learningtractablewordalignmentmod- elswithcomplexconstraints. ComputationalLinguistics,36(3):481–504,2010. [31] P.D.Gr¨ unwald. Theminimumdescriptionlengthprinciple. TheMITPress,2007. [32] Michael Gutmann and Aapo Hyv¨ arinen. Noise-contrastive estimation: A new estimation principleforunnormalizedstatisticalmodels. InProceedingsofAISTATS,2010. [33] Kenneth Heafield, Philipp Koehn, and Alon Lavie. Language model rest costs and space- ecientstorage. InProceedingsofEMNLP-CoNLL,pages1169–1178,2012. [34] LiangHuang, KevinKnight, andAravindJoshi. Statistical syntax-directed translation with extendeddomainoflocality. InProceedingsofAMTA,pages66–73,2006. [35] Liang Huang and Haitao Mi. Ecient incremental decoding for tree-to-string translation. InProceedingsofEMNLP,pages273–283,2010. [36] Mashud Hyder and Kaushik Mahata. An approximate l0 norm minimization algorithm for compressed sensing. In Proceedings of the IEEE International Conference on Acoustics, SpeechandSignalProcessing,2009. [37] MarkJohnson. PCFGmodelsoflinguistictreerepresentations. ComputationalLinguistics, 24:613–632,1998. [38] K. Knight, A. Nair, N. Rathod, and K. Yamada. Unsupervised analysis for decipherment problems. In Proceedings of the COLING/ACL on Main conference poster sessions, pages 499–506.AssociationforComputationalLinguistics,2006. [39] PhilippKoehn. Statisticalsignificancetestsformachinetranslationevaluation. InProceed- ingsofEMNLP,2004. [40] Philipp Koehn, Franz Joseph Och, and Daniel Marcu. Statistical phrase-based translation. InProceedingsofNAACL,pages127–133,2003. [41] RichardKronmalandArthurPeterson. Onthealiasmethodforgeneratingrandomvariables fromadiscretedistribution. TheAmericanStatistician,33(4):214–218,1979. [42] Hai-Son Le, Ilya Oparin, Alexandre Allauzen, Jean-Luc Gauvain, and Franc ¸ois Yvon. Structured output layer neural network language model. In Proceedings of ICASSP, pages 5524–5527,2011. [43] Percy Liang, Ben Taskar, and Dan Klein. Alignment by agreement. In Proceedings of HLT-NAACL,2006. [44] DingLiuandDanielGildea. Improvedtree-to-stringtransducerformachinetranslation. In ProceedingsoftheWorkshoponStatisticalMachineTranslation,2008. [45] Yang Liu, Qun Liu, and Shouxun Lin. Tree-to-string alignment template for statistical machinetranslation. InProceedingsofCOLING-ACL,pages609–616,2006. 113 [46] J. Lv and Y. Fan. A unified approach to model selection and sparse recovery using regular- izedleastsquares. TheAnnalsofStatistics,37(6A):3498–3528,2009. [47] Christopher D Manning and Hinrich Sch¨ utze. Foundations of statistical natural language processing. MITpress,1999. [48] Bernard Merialdo. Tagging english text with a probabilistic model. Computational Lin- guistics,20(2):155–171,1994. [49] Cos ¸kunMermerandMuratSarac ¸lar. Bayesianwordalignmentforstatisticalmachinetrans- lation. InProceedingsofACLHLT,2011. [50] Haitao Mi, Liang Huang, and Qun Liu. Forest-based translation. In Proceedings of ACL: HLT,pages192–199,2008. [51] Tom´ aˇ sMikolov. StatisticalLanguageModelsBasedonNeuralNetworks. PhDthesis,Brno UniversityofTechnology,2012. [52] Tom´ aˇ s Mikolov, Anoop Deoras, Stefan Kombrink, Luk´ aˇ s Burget, and Jan “Honza” ˇ Cernock´ y. Empirical evaluation and combination of advanced language modeling tech- niques. InProceedingsofINTERSPEECH,pages605–608,2011. [53] Andriy Mnih and Georey Hinton. A scalable hierarchical distributed language model. In AdvancesinNeuralInformationProcessingSystems,2009. [54] Andriy Mnih and Yee Whye Teh. A fast and simple algorithm for training neural proba- bilisticlanguagemodels. InProceedingsofICML,2012. [55] G. Mohimani, M. Babaie-Zadeh, and C. Jutten. Fast sparse representation based on smoothed L 0 norm. Independent Component Analysis and Signal Separation, pages 389– 396,2007. [56] Robert Moore. A discriminative framework for bilingual word alignment. In Proceedings ofHLT-EMNLP,2005. [57] RobertC.Moore. ImprovingIBMword-alignmentModel1. InProceedingsofACL,2004. [58] Frederic Morin and Yoshua Bengio. Hierarchical probabilistic neural network language model. InProceedingsofAISTATS,pages246–252,2005. [59] Vinod Nair and Georey E. Hinton. Rectified linear units improve restricted Boltzmann machines. InProceedingsofICML,pages807–814,2010. [60] H. Ney, U. Essen, and R. Kneser. On structuring probabilistic dependencies in stochastic languagemodelling. ComputerSpeechandLanguage,8:1–38,1994. [61] Jan Niehues and Alex Waibel. Continuous space language models using Restricted Boltz- mannMachines. InProceedingsofIWSLT,2012. [62] Franz Josef Och and Hermann Ney. A systematic comparison of various statistical align- mentmodels. ComputationalLinguistics,29(1):19–51,2003. 114 [63] Franz Joseph Och. Minimum error rate training in statistical machine translation. In Pro- ceedingsofACL,pages160–167,2003. [64] Franz Joseph Och and Hermann Ney. The alignment template approach to statistical ma- chinetranslation. ComputationalLinguistics,30:417–449,2004. [65] Slav Petrov and Dan Klein. Improved inference for unlexicalized parsing. In Proceedings ofHLT-NAACL,pages404–411,2007. [66] ChrisQuirkandArulMenezes. Doweneedphrases?Challengingtheconventionalwisdom instatisticalmachinetranslation. InProceedingsofNAACLHLT,2006. [67] S.Ravi,A.Vaswani,K.Knight,andD.Chiang. Fast,greedymodelminimizationforunsu- pervisedtagging. InProceedingsofACL,pages940–948,2010. [68] SujithRaviandKevinKnight. Minimizedmodelsforunsupervisedpart-of-speechtagging. InProceedingsofACL-IJCNLP,2009. [69] Jason Riesa and Daniel Marcu. Hierarchical search for word alignment. In Proceedings of ACL,2010. [70] Thomas Schoenemann. Probabilistic word alignment under the L 0 -norm. In Proceedings ofCoNLL,2011. [71] Thomas Schoenemann. Regularizing mono- and bi-word models for word alignment. In ProceedingsofIJCNLP,2011. [72] G. Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6(2):461–464, 1978. [73] Holger Schwenk. Ecient training of large neural networks for language modeling. In ProceedingsofIJCNN,pages3059–3062,2004. [74] Holger Schwenk. Continuous space language models. Computer Speech and Language, 21:492–518,2007. [75] Holger Schwenk. Continuous-space language models for statistical machine translation. PragueBulletinofMathematicalLinguistics,93:137–146,2010. [76] Holger Schwenk. CSLM - a modular open-source continuous space language modeling toolkit. InProceedingsofInterspeech,2013. [77] HolgerSchwenkandJean-LucGauvain. Trainingneuralnetworklanguagemodelsonvery largecorpora. InProceedingsofEMNLP,2005. [78] Holger Schwenk, Anthony Rousseau, and Mohammed Attik. Large, pruned or continuous space language models on a GPU for statistical machine translation. In Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model? On the FutureofLanguageModelingforHLT,pages11–19,2012. 115 [79] Stuart Shieber, Yves Schabes, and Fernando Pereira. Principles and implementation of deductiveparsing. JournalofLogicProgramming,24:3–36,1995. [80] Noah Smith and Jason Eisner. Contrastive estimation: Training log-linear models on unla- beleddata. InProceedingsofACL,2005. [81] Andreas Stolcke. SRILM – an extensible language modeling toolkit. In Proceedings of ICSLP,volume30,pages901–904,2002. [82] Ben Taskar, Lacoste-Julien Simon, and Klein Dan. A discriminative matching approach to wordalignment. InProceedingsofHLT-EMNLP,2005. [83] Ashish Vaswani, Adam Pauls, and David Chiang. Ecient optimization of an MDL- inspiredobjectivefunctionforunsupervisedpart-of-speechtagging. InProceedingsofACL, 2010. [84] Stephan Vogel, Hermann Ney, and Christoph Tillmann. HMM-based word alignment in statisticaltranslation. InProceedingsofCOLING,1996. [85] M.D. Zeiler, M. Ranzato, R. Monga, M. Mao, K. Yang, Q.V. Le, P. Nguyen, A. Senior, V.Vanhoucke,J.Dean,andG.E.Hinton. Onrectifiedlinearunitsforspeechprocessing. In ProceedingsofICASSP,2013. [86] YingZhang,StephanVogel,andAlexWaibel. InterpretingBLEU/NISTscores:Howmuch improvementdoweneedtohaveabettersystem? InProceedingsofLREC,2004. 116
Abstract (if available)
Abstract
The goal of machine translation is to translate from one natural language into another using computers. The current dominant approach to machine translation, statistical machine translation (SMT), uses large amounts of training data to automatically learn to translate. SMT systems typically contain three primary components: word alignment models, translation rules, and language models. These are some of the largest models in all of natural language processing, containing up to a billion parameters. Learning and employing these components pose difficult challenges of scale and generalization: using large models in statistical machine translation can slow down the translation process
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Syntactic alignment models for large-scale statistical machine translation
PDF
Improved word alignments for statistical machine translation
PDF
Non-traditional resources and improved tools for low-resource machine translation
PDF
Neural sequence models: Interpretation and augmentation
PDF
Weighted tree automata and transducers for syntactic natural language processing
PDF
Beyond parallel data: decipherment for better quality machine translation
PDF
Modeling, learning, and leveraging similarity
PDF
Neural networks for narrative continuation
PDF
Deciphering natural language
PDF
Aggregating symbols for language models
PDF
Theory of memory-enhanced neural systems and image-assisted neural machine translation
PDF
Generating and utilizing machine explanations for trustworthy NLP
PDF
Generating psycholinguistic norms and applications
PDF
The inevitable problem of rare phenomena learning in machine translation
PDF
Building straggler-resilient and private machine learning systems in the cloud
PDF
Exploiting comparable corpora
PDF
Hashcode representations of natural language for relation extraction
PDF
Scalable machine learning algorithms for item recommendation
PDF
Building a knowledgebase for deep lexical semantics
PDF
Inductive biases for data- and parameter-efficient transfer learning
Asset Metadata
Creator
Vaswani, Ashish Teku
(author)
Core Title
Smaller, faster and accurate models for statistical machine translation
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
07/28/2014
Defense Date
06/12/2014
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
L0 norm,machine translation,natural language processing,neural networks,OAI-PMH Harvest,rule Markov models,sparsity,statistical machine translation
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Chiang, David (
committee chair
), Huang, Liang (
committee chair
), Knight, Kevin C. (
committee member
), Lv, Jinchi (
committee member
)
Creator Email
vaswani@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-449238
Unique identifier
UC11287821
Identifier
etd-VaswaniAsh-2749.pdf (filename),usctheses-c3-449238 (legacy record id)
Legacy Identifier
etd-VaswaniAsh-2749.pdf
Dmrecord
449238
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Vaswani, Ashish Teku
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
L0 norm
machine translation
natural language processing
neural networks
rule Markov models
sparsity
statistical machine translation