Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Weighted tree automata and transducers for syntactic natural language processing
(USC Thesis Other)
Weighted tree automata and transducers for syntactic natural language processing
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
WEIGHTEDTREEAUTOMATAANDTRANSDUCERS FORSYNTACTICNATURALLANGUAGEPROCESSING by JonathanDavidLouisMay ADissertationPresentedtothe FACULTYOFTHEUSCGRADUATESCHOOL UNIVERSITYOFSOUTHERNCALIFORNIA InPartialFulfillmentofthe RequirementsfortheDegree DOCTOROFPHILOSOPHY (COMPUTERSCIENCE) August2010 Copyright 2010 JonathanDavidLouisMay Dedication ForLorelei,whomadeitallworthwhile. ii Acknowledgments As I write these words I am overwhelmed that so many people have provided such a continuous force of encouragement, advice, and unrelenting positivity. Truly, I have beenblessedtohavetheminmylife. My advisor, Kevin Knight, was just about the perfect person to guide me along this path. He was ever tolerant of my irreverent, frequently headstrong style, and gently corrected me when I strayed too far from my course of study, allowing me to make mistakes(suchascodingupimpossiblealgorithms)buthelpingmetolearnfromthem. The other members of my committee have provided valuable feedback and insight—I have especially benefited from working closely with David Chiang and Daniel Marcu, andhavewelcomedtheinputfromSvenKoenig,ShriNarayanan,andFernandoPereira. I’m indebted to the continued guidance of Mitch Marcus at Penn, who helped me find BBNwaybackin2001,andtothatofScottMiller,LanceRamshaw,andRalphWeischedel atBBN,whohelpedmefindISI. In2006,AndreasMaletti,thenaPh.D.studentattheTechnisheUniversit¨ atDresden, contactedKevinandaskedifhecouldcomebyISIandgiveapresentationonhisresearch intreeautomata. LittlewasItoknowatthistimewhatatreasuretrovehadbeenopened, and what fruitful work would result from this meeting. This led to collaborations with iii himaswellaswithJohannaH¨ ogberg,HeikoVogler,andMatthiasB¨ uchse. Theinfluence ofthesefolksfromTCScanbefeltthroughoutthiswork. Inparticular,Iamindebtedto HeikoforextensivehelpinthedraftingofChapters2and4. For six years I have been fortunate to have Steve DeNeefe as my office-mate and academic sibling. He has taught me quite a lot about a wide variety of subjects, from computer science and math to Christianity and fatherhood. The number of other truly amazing people at ISI I have had the good fortune of working with is astounding, and I hope I manage to acknowledge them all here: Erika Barragan-Nunez, Rahul Bhagat, Gully Burns, Hal Daum´ e III, Victoria Fossum, Alex Fraser, Jonathan Graehl, UlfHermjakob,DirkHovy,EdHovy,LiangHuang,ZornitsaKozareva,KaryLau,Rutu Mehta,AlmaNava,OanaPostolache,MichaelPust,DavidPynadath,SujithRavi,Deepak Ravichandran,JasonRiesa,TomRuss,RaduSoricut,AshishVaswani,JensV¨ ockler,Wei Wang,andKenjiYamada. Inaddition,IamgratefultoISIforitsabilitytoattractvisiting professors,graduatestudents,andotherluminaries. Thishasenabledmetogettoknow John DeNero, Erica Greene, Carmen Heger, Adam Pauls, Gerald Penn, Bill Rounds, MagnusSteinby,C˘ at˘ alinTˆ ırn˘ auc˘ a,andJosephTurian. In my times of greatest doubt and despair my friends and family have always been rock-solid, fully supportive of me and confident that I would succeed, even when I didn’tbelieveinmyself. I’msoluckytohavetheirunyieldingandundyinglove. Marco Carbone and Lee Rowland, friends from way back, hover just barely too far away in Nevada. Glenn Østen Anderson, Nausheen Jamal, and Ben Plantan have brightened ourdoorwithnot-frequent-enoughvisitsandmayfinallymoveheresomeday. Myaunt Ren´ e Block Baird and cousins Kalon, April, and Serena Baird engage me with constant iv love and affection. And I literally could not be here without the constant sacrifice, support, and occasional copy-editing of my parents, Howard May and Marlynn Block, andmystep-parents,IrenaMayandJerryLevine. Finally,thoughithasbeenpractically effortless, I still consider the greatest accomplishment and benefit that this document representstobemymeeting,fallinginlovewith,andmarryingLoreleiLaird. v TableofContents Dedication ii Acknowledgments iii ListOfTables ix ListOfFigures xii Abstract xix Chapter1 I: MT 1 1.1 Acautionarytale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Transducerstotherescue. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Atransducermodeloftranslation . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Addingweights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.5 Goingfurther . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.6 Bettermodelingthroughtreetransducers . . . . . . . . . . . . . . . . . . . 13 1.7 Algorithmsfortreetransducersandgrammars . . . . . . . . . . . . . . . . 17 1.8 Buildinganewtoolkit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Chapter2 WRTGWTT 23 2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.1.1 Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.1.2 Semirings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.1.3 Treeseriesandweightedtreetransformations . . . . . . . . . . . . 27 2.1.4 Substitution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.2 Weightedregulartreegrammars . . . . . . . . . . . . . . . . . . . . . . . . 29 2.2.1 Normalform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.2.2 Chainproductionremoval . . . . . . . . . . . . . . . . . . . . . . . . 35 2.2.3 Determinization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.2.4 Intersection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.2.5 K-best . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.3 Weightedtop-downtreetransducers . . . . . . . . . . . . . . . . . . . . . . 42 2.3.1 Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 2.3.2 Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 vi 2.3.3 Composition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 2.4 Tree-to-stringandstringmachines . . . . . . . . . . . . . . . . . . . . . . . 64 2.5 UsefulclassesforNLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Chapter3 DWTA 72 3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.2 Relatedwork. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.3 Practicaldeterminization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.4 Determinizationusingfactorizations . . . . . . . . . . . . . . . . . . . . . . 80 3.4.1 Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 3.4.2 Initialalgebrasemantics . . . . . . . . . . . . . . . . . . . . . . . . . 84 3.4.3 Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 3.4.4 Termination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 3.5 Empiricalstudies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 3.5.1 Machinetranslation . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 3.5.2 Data-OrientedParsing . . . . . . . . . . . . . . . . . . . . . . . . . . 96 3.5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 Chapter4 IC 99 4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.2 Stringcase: applicationviacomposition . . . . . . . . . . . . . . . . . . . . 100 4.3 Extensiontocascadeofwsts . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.4 Applicationoftreetransducers . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.5 Applicationoftreetransducercascades . . . . . . . . . . . . . . . . . . . . 116 4.6 Decodingexperiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 4.7 BackwardapplicationofwxLNTstostrings . . . . . . . . . . . . . . . . . . 131 4.8 Buildingaderivationwsa . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 4.9 Buildingaderivationwrtg . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 4.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 Chapter5 SR-A MMT 141 5.1 MethodsofstatisticalMT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 5.2 Multi-levelsyntacticrulesforsyntaxMT . . . . . . . . . . . . . . . . . . . 145 5.3 Introducingsyntaxintothealignmentmodel . . . . . . . . . . . . . . . . . 146 5.3.1 ThetraditionalIBMalignmentmodel . . . . . . . . . . . . . . . . . 146 5.3.2 Asyntaxre-alignmentmodel . . . . . . . . . . . . . . . . . . . . . . 147 5.4 Theappealofasyntaxalignmentmodel . . . . . . . . . . . . . . . . . . . . 148 5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 5.5.1 There-alignmentsetup . . . . . . . . . . . . . . . . . . . . . . . . . 149 5.5.2 TheMTsystemsetup . . . . . . . . . . . . . . . . . . . . . . . . . . 152 5.5.3 Initialresults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 5.5.4 MakingEMfair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 5.5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 5.5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 vii Chapter6 T: ATTT 159 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 6.2 Gettingstarted . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 6.3 Grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 6.3.1 Fileformats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 6.3.2 Commandsusinggrammarfiles . . . . . . . . . . . . . . . . . . . . 166 6.4 Transducers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 6.4.1 Fileformats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 6.4.2 Commandsusingtransducerfiles . . . . . . . . . . . . . . . . . . . 183 6.5 Performancecomparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 6.5.1 Transliterationcascades . . . . . . . . . . . . . . . . . . . . . . . . . 188 6.5.1.1 Readingatransducer . . . . . . . . . . . . . . . . . . . . . 189 6.5.1.2 Inferencethroughacascade . . . . . . . . . . . . . . . . . 190 6.5.1.3 K-best . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 6.5.2 Unsupervisedpart-of-speechtagging . . . . . . . . . . . . . . . . . 192 6.5.2.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 6.5.2.2 Determinization . . . . . . . . . . . . . . . . . . . . . . . . 193 6.5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 6.6 Externallibraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 6.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 Chapter7 CTFW 197 7.1 Concludingthoughts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 7.2 Futurework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 7.2.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 7.2.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 7.2.3 Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 7.3 Finalwords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 References 201 viii ListOfTables 1.1 Availabilityofalgorithmsandimplementationsforvariousclassesofau- tomata. Yes= an algorithm is known and an implementation is publicly available. Alg=analgorithmisknownbutnoimplementationisknown to be available. PoC= a proof of concept of the viability of an algorithm is known but there is no explicit algorithm. No= no methods have been described. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.2 Availability of algorithms and implementations for various classes of transducers,usingthekeydescribedinTable1.1. . . . . . . . . . . . . . . . 17 3.1 BLEUresultsfromstring-to-treemachinetranslationof116shortChinese sentences with no language model. The use of best derivation (undeter- minized),estimateofbesttree(top-500),andtruebesttree(determinized) forselectionoftranslationisshown. . . . . . . . . . . . . . . . . . . . . . . 94 3.2 Recall, precision, and F-measure results on DOP-style parsing of section 23ofthePennTreebank. Theuseofbestderivation(undeterminized),es- timateofbesttree(top-500),andtruebesttree(determinized)forselection ofparseoutputisshown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 3.3 Mediantreespersentenceforestinmachinetranslationandparsingexper- imentsbeforeandafterdeterminizationisappliedtotheforests,removing duplicatetrees. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.1 Preservationofforwardandbackwardrecognizabilityforvariousclasses of top-down tree transducers. Here and elsewhere, the following abbre- viations apply: w = weighted, x = extended left side, L = linear, N = nondeleting,OQ=openquestion. . . . . . . . . . . . . . . . . . . . . . . . 106 4.2 For those classes that preserve recognizability in the forward and back- ward directions, are they appropriately closed under composition with (w)LNT?Iftheansweris“yes”,thenanembedding,composition,projec- tionstrategycanbeusedtodoapplication. . . . . . . . . . . . . . . . . . . 111 ix 4.3 Closureundercompositionforvariousclassesoftop-downtreetransducer. 116 4.4 Transducer types and available methods of forward application of a cas- cade. oc=offlinecomposition,ecp=embed-compose-project,bb =custom bucketbrigadealgorithm,otf=onthefly. . . . . . . . . . . . . . . . . . . . 127 4.5 Transducer types and available methods of backward application of a cascade. oc=offlinecomposition,ecp=embed-compose-project,otf =on thefly.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 4.6 Deduction schema for the one-best algorithm of Pauls and Klein [108], generalized for a normal-form wrtg, and with on-the-fly discovery of productions. We are presumed to have a wrtg G = (N,Σ,P,n 0 ) that is a stand-inforsome M(G) / ,apriorityqueuethatcanholditemsoftypeIand O,prioritizedbytheircost,c,twoN-indexedtablesof(initially0-valued) weights, in and out, and one N-indexed table of (initially null-valued) productions, deriv. In each row of this schema, the specified actions are taken (inserting items into the queue, inserting values into the tables, discovering new productions, or returning deriv) if the specified item is at the head of the queue and the specified conditions of in, out, and deriv exist. Theone-besthyperpathin Gcanbefoundwhenderivisreturnedby joiningtogetherproductionsintheobviousway,beginningwithderiv[n 0 ]. 128 4.7 Timing results to obtain 1-best from application through a weighted tree transducer cascade, using on-the-fly vs. bucket brigade backward appli- cation techniques. pcfg= model recognizes any tree licensed by a pcfg built from observed data, exact= model recognizes each of 2,000+ trees withequalweight,1-sent =modelrecognizesexactlyonetree. . . . . . . . 130 5.1 TuningandtestingdatasetsfortheMTsystemdescribedinSection5.5.2. 149 5.2 A comparison of Chinese BLEU performance between the GIZA base- line (no re-alignment), re-alignment as proposed in Section 5.3.2, and re-alignmentasmodifiedinSection5.5.4. . . . . . . . . . . . . . . . . . . . 149 5.3 MachineTranslationexperimentalresultsevaluatedwithcase-insensitive BLEU4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 5.4 Re-alignment performance with semi-supervised EMD bootstrap align- ments.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 6.1 Generativeorder,description,andstatisticsforthecascadeofEnglish-to- katakanatransducersusedinperformancetestsinSection6.5. . . . . . . . 189 x 6.2 Timing results for experiments using various operations across several transducer toolkits, demonstrating the relatively poor performance of Tiburon as compared with extant string transducer toolkits. The read- ingexperimentisdiscussedinSection6.5.1.1,inferenceinSection6.5.1.2, the three k-best experiments in Section 6.5.1.3, the three training experi- ments in Section 6.5.2.1, and determinization in Section 6.5.2.2. For FSM and OpenFst, timing statistics are broken into time to convert between binaryandtextformats(“conv.”) andtimetoperformthespecifiedoper- ation(“op.”). N/A=thistoolkitdoesnotsupportthisoperation. OOM= thetestcomputerranoutofmemorybeforecompletingthisexperiment. . 195 xi ListOfFigures 1.1 The general noisy channel model. The model is proposed in the “story” direction but used in the “interpretation” direction, where a noisy input istransformedintothetargetdomain andthenvalidatedagainsttherec- ognizer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.2 An (English, Spanish) tree pair whose transformation we can capture via treetransducers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1 TreesfromExample2.1.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.2 Non-zero entries of weighted tree transformations and tree series of Ex- ample2.1.3. Foreachrow,theelement(s)ontheleftmapstothevalueon theright,andallotherelementsmapto0. . . . . . . . . . . . . . . . . . . . 28 2.3 Production set P 1 from example wrtg G 1 used in Examples 2.2.2, 2.2.3, 2.2.4,and2.2.7.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.4 Normal-form productions inserted in P 1 to replace productions 8, 9, and 10ofFigure2.3innormalizationofG 1 ,asdescribedinExample2.2.3. . . . 32 2.5 Productions inserted in P 1 to compensate for the removal of chain pro- duction 12 of Figure 2.3 in chain production removal of G 1 , as described inExample2.2.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.6 Illustration of Algorithms 3 and 4, as described in Example 2.2.5. Algo- rithm4buildsthetableinFigure2.6bfromtheproductionsinFigure2.6a andthenAlgorithm3usesthistabletogeneratetheproductionsinFigure 2.6c. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.7 rtgproductionsbeforeandafterdeterminization,asdescribedinExample 2.2.6. Notethat(a)isnotdeterministicbecauseproductions4and5have thesamerightside,whilenoproductionsin(b)havethesamerightside. . 39 xii 2.8 Productions for a wrtg and intersection result, as described in Example 2.2.7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.9 Rulesetsforthreewtts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.10 RulesetsforwxttspresentedinExample2.3.4. . . . . . . . . . . . . . . . . 48 2.11 Production sets formed from domain projection using Algorithms 7 and 8,asdescribedinExample2.3.5. . . . . . . . . . . . . . . . . . . . . . . . . 51 2.12 Transformations of M 4 from Example 2.3.4, depicted in Figure 2.10b, for useindomainandrangeprojectionExamples2.3.6and2.3.7. . . . . . . . . 54 2.13 Rule set R 6 , formed from embedding of wrtg G 7 from Example 2.2.7, as describedinExample2.3.8. . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 2.14 Graphical representation of COVER, Algorithm 14. At line 13, position v of tree u is chosen. As depicted in Figure 2.14a, in this case, u(v) isδ and has two children. One member (z,θ,w) ofΠ last is depicted in Figure 2.14b. Thetreezhasaleafpositionv 0 withlabelχandthereisanentryfor (v,v 0 )inθ,soasindicatedonlines16and17,welookforarulewithstate θ(v,v 0 )= q B and left symbolδ. One such rule is depicted in Figure 2.14c. Given the tree u, the triple (z,θ,w), and the matching rule, we can build thenewmemberofΠ v depictedinFigure2.14dasfollows: Thenewtreeis builtbyfirsttransformingthe(state,variable)leavesofh;iftheithchildof visa(state,variable)symbol,say,(q,x),thenleavesinhoftheform(q 00 ,x i ) aretransformedto(q,q 00 ,x)symbols,otherwisetheybecomeχ. Theformer case, which is indicated on line 24, accounts for the transformation from q 00 B 1 .x 1 to(q A ,q 00 B 1 ).x 4 . Thelattercase,whichisindicatedonline26,accounts for the transformation from q 00 B 2 .x 2 toχ. The result of that transformation is attached to the original z at position v 0 ; this is indicated on line 27. The newθ 0 is extended from the oldθ, as indicated on line 18. For each immediatechildviofvthathasacorrespondingleafsymbolinhmarked withx i atpositionv 00 ,thepositioninthenewlybuilttreewillbev 0 v 00 . The pair (vi,v 0 v 00 ) is mapped to the state originally at v 00 , as indicated on line 22. Finally,thenewweightisobtainedbymultiplyingtheoriginalweight, wwiththeweightoftherule,w 0 . . . . . . . . . . . . . . . . . . . . . . . . . 60 2.15 RulesetsfortransducersdescribedinExample2.3.9.. . . . . . . . . . . . . 62 2.16 ΠformedinExample2.3.9asaresultofapplyingAlgorithm14toν(q 1 .x 1 , ν(λ,q 2 .x 1 )),M 8 ,andq 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 2.17 Rules and productions for an example wxtst and wcfg, respectively, as describedinExample2.4.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 xiii 2.18 Example of the tree transformation power an expressive tree transducer should have, according toKnight [73]. There-ordering expressed bythis transformation is widely observed in practice, over many sentences in manylanguages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 2.19 Tree transducers and their properties, inspired by a similar diagram by Knight [73]. Solid lines indicate a generalization relationship. Shaded regionsindicateatransducerclasshasoneormoreoftheusefulproperties describedinSection2.5. AlltransducersinthisfigurehaveanEMtraining algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.1 Example of weighted determinization. We have represented a nonde- terministic wrtg as a weighted tree automaton, then used the algorithm presentedinthischaptertoreturnadeterministicequivalent. . . . . . . . 74 3.2 Ranked list of machine translation results with repeated trees. Scores shownarenegativelogsofcalculatedweights,thusalowerscoreindicates ahigherweight. Thebulletedsentencesindicateidenticaltrees. . . . . . . 75 3.3 PortionoftheexamplewrtgfromFigure3.1beforeandafterdeterminiza- tion. Weightsofsimilarproductionsaresummedandnonterminalresid- ualsindicatetheproportionofweightduetoeachoriginalnonterminal. . 78 3.4 Sketch of a wsa that does not have the twins property. The dashed arcs are meant to signify a path between states, not necessarily a single arc. q and r are siblings because they can both be reached from p with a path reading“xyz”,butarenottwins,becausethecyclesfromqandrreading “abc”havedifferentweights. sandrarenotsiblingsbecausetheycannot be both reached from p with a path reading the same string. q and s are siblings because they can both be reached from p with a path reading “def”andtheyaretwinsbecausethecyclefrombothstatesreading“abc” has the same weight (and they share no other cycles reading the same string). Since q and r are siblings but not twins, this wsa does not have thetwinsproperty. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 3.5 A wsa that is not cycle-unambiguous. The state q has two cycles reading thestring“ab”withtwodifferentweights. . . . . . . . . . . . . . . . . . . 92 3.6 Demonstrationofthetwinstestforwrtgs. Iftherearenon-zeroderivations of a tree t for nonterminals n and n’, and if the weight of the sum of derivations from n of u substituted at v with n is equal to the weight of thesumofderivationsfromn’ofusubstitutedatvwithn’foralluwhere theseweightsarenonzeroforbothcases,thennandn’aretwins. . . . . . 93 3.7 Rankedlistofmachinetranslationresultswithnorepeatedtrees. . . . . . 97 xiv 4.1 Applicationofawsttoastring. . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.2 Threedifferentapproachestoapplicationthroughcascadesofwsts. . . . . 102 4.3 Composition-basedapproachtoapplicationofawLNTtoatree.. . . . . . 111 4.4 Inputsforforwardapplicationthroughacascadeoftreetransducers. . . . 112 4.5 Resultsofforwardapplicationthroughacascadeoftreetransducers. . . . 113 4.6 Schematicsofapplication,illustratingtheextraworkneededtouseembed- compose-projectinwttapplicationvs. wstapplication. . . . . . . . . . . . 117 4.7 Forward application through a cascade of tree transducers using an on- the-flymethod. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 4.8 Example rules from transducers used in decoding experiment. j1 and j2 areJapanesewords. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 4.9 InputwxLNTsandfinalbackwardapplicationwrtgformedfromparsing λλλλ,asdescribedinExample4.7.1. . . . . . . . . . . . . . . . . . . . . . 133 4.10 Partial parse chart formed by Earley’s algorithm applied to the rules in Figure 4.9a, as described in Example 4.7.1. A state is labeled by its rule id, covered position of the rule right side, and covered span. Bold face states have their right sides fully covered, and are thus the states from which application wrtg productions are ultimately extracted. Dashed edges indicate hyperedges leading to sections of the chart that are not shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 4.11 Constructionofaderivationwsa. . . . . . . . . . . . . . . . . . . . . . . . . 134 4.12 Inputtransducersforcascadetraining. . . . . . . . . . . . . . . . . . . . . . 138 4.13 Progressofbuildingderivationwrtg. . . . . . . . . . . . . . . . . . . . . . . 139 4.14 Derivationwrtgafterfinalcombinationandconversion. . . . . . . . . . . 140 5.1 GeneralapproachtoidealisticandrealisticstatisticalMTsystems. . . . . . 142 5.2 A (English tree, Chinese string) pair and three different sets of multilevel tree-to-string rules that can explain it; the first set is obtained from boot- strap alignments, the second from this paper’s re-alignment procedure, andthethirdisaviable,ifpoorquality,alternativethatisnotlearned. . . 143 xv 5.3 Theimpactofabadalignmentonruleextraction. Includingthealignment linkindicatedbythedottedlineintheexampleleadstotherulesetinthe secondrow. There-alignmentproceduredescribedinSection5.3.2learns toprefertherulesetatbottom,whichomitsthebadlink. . . . . . . . . . . 158 6.1 ExamplewrtgandwcfgfilesusedtodemonstrateTiburon’scapabilities. . 165 6.2 ExamplewxttandwxtstfilesusedtodemonstrateTiburon’scapabilities. . 181 6.3 ComparisonofrepresentationsinCarmel,FSM/OpenFst,andTiburon. . . 188 6.4 K-best output for various toolkits on the transliteration task described in Section 6.5.1.3. Carmel produces strings, and Tiburon produces monadic trees with a special leaf terminal. FSM and OpenFst produce wfsts or wfsas representing the k-best list and a post-process must agglomerate symbolsandweights. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 6.5 AnexampleoftheHMMtrainedintheMerialdo[98]unsupervisedpart- of-speech tagging task described in Section 6.5.2, instantiated as a wfst. Arcs either represent parameters from the bigram tag language model (e.g., the arc from A’ to B, representing the probability of generating tag B after tag A) or from the tag-word channel model (e.g., the topmost arc fromAtoA’,representingtheprobabilityofgeneratingwordagiventag A).Thelabelsonthearcsmakethiswfstsuitablefortrainingonacorpus of (ε, word sequence) pairs to set language model and channel model probabilitiessuchthattheprobabilityofthetrainingcorpusismaximized. 193 xvi ListofAlgorithms 1 NORMAL-FORM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2 NORMALIZE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3 CHAIN-PRODUCTION-REMOVAL . . . . . . . . . . . . . . . . . . . . . . 37 4 COMPUTE-CLOSURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5 DETERMINIZE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 6 INTERSECT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 7 BOOL-DOM-PROJ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 8 LIN-DOM-PROJ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 9 PRE-DOM-PROJ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 10 PRE-DOM-PROJ-PROCESS . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 11 RANGE-PROJ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 12 EMBED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 13 COMPOSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 14 COVER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 15 WEIGHTED-DETERMINIZE . . . . . . . . . . . . . . . . . . . . . . . . . . 81 16 DIJKSTRA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 17 FORWARD-APPLICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 xvii 18 FORWARD-COVER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 19 FORWARD-PRODUCE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 20 MAKE-EXPLICIT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 21 BACKWARD-COVER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 22 BACKWARD-PRODUCE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 xviii Abstract Weighted finite-state string transducer cascades are a powerful formalism for models ofsolutionstomanynaturallanguageprocessingproblemssuchasspeechrecognition, transliteration, and translation. Researchers often directly employ these formalisms to buildtheirsystemsbyusingtoolkitsthatprovidefundamentalalgorithmsfortransducer cascademanipulation,combination,andinference. However,extanttransducertoolkits arepoorlysuitedtocurrentresearchinNLPthatmakesuseofsyntax-richmodels. More advanced toolkits, particularly those that allow the manipulation, combination, and inference of weighted extended top-down tree transducers, do not exist. In large part, this is because the analogous algorithms needed to perform these operations have not been defined. This thesis solves both these problems, by describing and developing algorithms, by producing an implementation of a functional weighted tree transducer toolkitthatusesthesealgorithms,andbydemonstratingtheperformanceandutilityof thesealgorithmsinmultipleempiricalexperimentsonmachinetranslationdata. xix Chapter1 I: MT 1.1 Acautionarytale One of the earliest syntactic parsers was built in 1958 and 1959 to run on the Univac 1 computer at the University of Pennsylvania and attempts were made to recreate the parser some forty years hence on modern machines [60]. Given that computer science was in its infancy at the time of the parser’s creation and much had changed in the interim, it is not surprising that the resurrectors relied on the hundreds of pages of flowcharts and program specifications describing the system as guidance, rather than theoriginalassemblycodeitself. Unfortunately,duetoambiguitiesinthedocumentation anddamagetothearchives,someguesseshadtobemadeinreimplementation,andthus thereincarnationoftheparserisonlyanapproximationoftheoriginalsystematbest. As theresurrectorswerefaithfultothedesignoftheoriginalparser,however,theybuiltthe modern incarnation of the parser as a single piece of code designed to do a single task. Of course, this time they wrote the program in C rather than assembly. If, forty years from now, a new generation of computer science archaeologists wishes to re-recreate 1 the parser, one hopes the C language is still understandable, and that the source code survives. If this source is for some reason unavailable, the team will once again have to re-write the parser from documentation alone. As before, if the documentation is incomplete or imprecise, only an approximation of the original program will be built, andalltheworkoftheresurrectorswillbefornaught. 1.2 Transducerstotherescue As it happens, this parser, now called Uniparse, was designed as a cascade of finite- state transducers, abstract machines that read an input tape and write an output tape basedonasetofstatetransitionrules. Transducershavebeenwidelystudiedandhave numerousattractiveproperties;amongthemisthepropertyofclosureundercomposition— thetransformationsaccomplishedbyasequenceoftwotransducerscanbecapturedbya singletransducer. Theseproperties,alongwitheffectivealgorithmsthattakeadvantage of them, such as an algorithm to quickly construct the composition of two transducers, allowanyprogramwrittenintheformofatransducercascade,asUniparsewasdesigned, tobeeasilyandeffectivelyhandledbyagenericprogramthatprocessesandmanipulates transducers. RatherthanwritingUniparsefromtheoriginaldesignschematicsincustomassembly orC,theresurrectorscouldhaveencodedtheschematicsthemselves,whicharealready written as transducers, into a uniform format that is suitable for reading by a generic finite-state transducer toolkit and used transducer operations such as composition and 2 projection (the obtaining of either the input or output language of a transducer) to per- form the parsing operations, rather than writing new code. By expressing Uniparse as transducerrules,theresurrectorswouldhavebeenabletocombinethetransducerdesign with its code implementation. Any alterations to the original design would have been encoded in the set of transducer rules, preventing any “hacks” from hiding in the code base. A file containing solely transducer rules requires no additional documentation to explain any hidden implementation, as the entire implementation is represented in the rules. Most importantly, aside from formatting issues, a file of transducer rules is immune from the havoc time wreaks on compiled code. Future generations could use theirtransducertoolkitsonthesefileswithatrivialnumberofchanges. Of course, someone has to build the transducer toolkit itself. And this naturally raises the question: Is implementing a program for finite-state transducers rather than for a syntactic parser simply trading one specific code base for another? Thankfully, transducer cascades are useful for more than light deterministic parsing. In the fields of phonological and morphological analysis Kaplan and Kay [63] realized the analysis rules linguists developed could be encoded as finite-state transducer rules, and this led first to a transducer-based morphological analysis system [80] and eventually to the development of XFST [70], an entire finite-state toolkit complete with the requisite algorithms needed to manipulate transducers and actually get work done. The set of natural language transformation tasks capturable by regular expressions such as daterecognition,wordsegmentation,somesimplepart-of-speechtagging,andspelling correction,andlightparsingsuchasthatdonebyUniparsecanbeexpressedascascades of finite-state transducers [69]. The availability of this toolkit allowed researchers to 3 simply write their models down as transducer rules and allow the toolkit to do the processingwork. 1.3 Atransducermodeloftranslation A concrete example is helpful for demonstrating the usefulness of transducer cascades. ImaginewewanttobuildasmallprogramthattranslatesbetweenSpanishandEnglish. HereisasimplemodelofhowanEnglishsentencebecomesaSpanishsentence: • AnEnglishsentenceisimaginedbysomeonefluentinEnglish. • The words of the sentence are rearranged—each word can either remain in its originalpositionorswapplaceswiththesubsequentword. • EachwordistranslatedintoasingleSpanishword. Such a model obviously does not capture all translations between Spanish and En- glish, but it does handle some of them, and thus provides a good motivating example. Moreover, it is fairly easy to design software to perform each step of the model, even if it might be hard to design a single piece of software that encodes the entire model, all at once. In this way we are espousing the “conceptual factoring” envisioned by Woods [133]. Havingenvisionedthismodelofmachinetranslation,anenterprisingstudentcould set about writing a program from scratch that implements the model. However, rather 4 than enduring lengthy coding and debugging sessions the student could instead rep- resent the model by the following cascade of finite-state transducers which may be composedintoasingletransducerusingatoolkitsuchasXFST: • Afinite-state automatonA,whichisanEnglishlanguagemodel,i.e.,itrecognizesthe sentencesofEnglish. Thisisaspecialcaseofafinite-statetransducer—onewhere each rule has a single symbol rather than separate reading and writing symbols. A simple automaton for English maintains a state associated with the last word it saw, and only has transitions for valid subsequent words. Let us semantically associatethestateq x withcaseswherethelastrecognizedwordwasx. Letq START be thestaterepresentingthebeginningofasentence. Wewriteruleslike: q START the −−→ q the q green ball −−→ q ball q the green −−−−→ q green q green horse −−−−→ q horse q the ball −−→ q ball ... andsoon. Thisautomatonallowsphrasessuchas“theball”and“thegreenhorse” butnot“greenballthe”or“horsegreen”. • AreorderingtransducerBwithrulepatternsofthefollowingformforallwordsa andb: r a: −−→ r a r a:a −−→ r r a b:ba −−−→ r Here, on the right side means no symbol is written when the rule is invoked, though the symbol on the left is consumed. The rules are instantiated for each possiblepairofwords,sogiventheEnglishvocabulary{the,ball,green}wewould have: 5 r the: −−−→ r the r ball: −−−−→ r ball r green: −−−−−→ r green r the:the −−−−−→ r r ball:ball −−−−−−→ r r green:green −−−−−−−−→ r r the ball:ballthe −−−−−−−−→ r r ball ball:ballball −−−−−−−−−→ r r green ball:ballgreen −−−−−−−−−−→ r r the green:greenthe −−−−−−−−−−−→ r r ball green:greenball −−−−−−−−−−−→ r r green green:greengreen −−−−−−−−−−−−−→ r r the the:thethe −−−−−−−→ r r ball the:theball −−−−−−−−→ r r green the:thegreen −−−−−−−−−→ r • Aone-statetransducer CthattranslatesbetweenEnglishandSpanish,e.g.: s ball:pelota −−−−−−−→ s s horse:caballo −−−−−−−−−→ s s the:el −−−−→ s s green:verde −−−−−−−−→ s s the:la −−−−→ s ... These three transducers can be composed together using classic algorithms to form a single translation machine, D, that reorders valid English sentences and translates them into (possibly invalid) Spanish in a single step. Run in reverse, D translates arbitrary sequences of Spanish words and, if possible, reorders them to form valid Englishsentences. Next, a candidate Spanish phrase can be encoded as a simple automaton E, with q 0 astheinitialstateandthefollowingrules: q 0 la − → q 1 q 1 pelota −−−−→ q 2 q 2 verde −−−−→ q 3 Thisautomatonrepresentsexactlythephrase“lapelotaverde”. Itcanbecomposedtothe rightofD,formingatransducerFthatreordersvalidEnglishsentencesand,ifpossible, translates them into exactly the phrase “la pelota verde”. The domain projection of F, then,isanautomatonthatrepresentsallvalidEnglishtranslationsoftheSpanishphrase, which in this case is the single phrase “the green ball.” Notice that this translation machine was built from simple transducers: A, which only recognizes valid English 6 butiscompletelyunawareofSpanish,B,whichreordersEnglishwithoutanyregardfor propergrammar,C,whichtranslatesbetweenEnglishandSpanishbutisnotconstrained to the proper grammar of either language, and D, the candidate Spanish sentence. By chaining together several simple transducers that each perform a limited task we can quicklybuildcomplicatedandpowerfulsystems. Therearemanyproblemswiththistranslationmodel. Oneofthemostobviousisthat wecannothandlecaseswherethenumberofEnglishandSpanishwordsisnotthesame, so the translation between “I do not have” and “No tengo” is impossible. However, successive refinements and introductions of additional transducers, some with-rules, canhelpmakethesystembetter. Implementingthesamerefinementsandmodelchanges inacustomcodeimplementationcanrequiremanymoretediouscodinganddebugging cycles. Thereisamorefundamentalproblemwiththismodelthatcannoteasilyberesolved byreconfiguringthetransducers. Iftherearemultiplevalidanswers,howdoweknow whichtochoose? Inthisframeworkatransformationiseithercorrectorincorrect;there isnoroomforpreference. Wearefacedwiththechoiceofeitheroverproducingandnot knowing which of several answers is correct, or underproducing and excluding many validtransformations. Neitherofthesechoicesisacceptable. 1.4 Addingweights Whilethedevelopmentofafinite-statetransducertoolkitwasveryhelpfulforattacking the problems of its age, advances in computation allowed modeling theory to expand 7 beyond a level conceivable to the previous generation. Specifically, the availability of large corpora and the ability to process these corpora in reasonable time coupled with a move away from prescriptivist approaches to computational linguistics motivated a desiretorepresentuncertaintyinprocessingandtoempiricallydeterminethelikelihood of a processing decision based on these large corpora. Transducers that make a simple yes-or-nodecisionwerenolongersu fficienttorepresentmodelsthatincludedconfidence scores. ResearchersatAT&TdesignedFSM,atoolkitthatharnessesweightedfinite-state transducers—a superset of the formalism supported by XFST [103]. The association of weights with transducer rules affected the previous algorithms for composition and introducednewchallenges. Alongwiththephysicaltoolkitcode, newalgorithmswere developed to cope with these challenges [109, 104, 100]. Carmel [53], a toolkit with similarpropertiesasFSM,butwithanalgorithmforEMtrainingofweightedfinite-state transducers[38],wasalsousefulinthisregard. Thesetoolkitsandotherslikethemwere quite helpful for the community, as the state of NLP models had greatly expanded to include probabilistic models and without a decent weighted toolkit around, the only option to test these models was nose-to-the-grindstone coding of individual systems. Subsequenttotheirinventionandreleaseanumberofpublishedresultsfeaturedtheuse ofthesetoolkitsandtransducerformalismsinmodeldesign[110,123,74,24,137,83,79, 94]. Returning to our translation example, we can see that some word sequences are more likely than others. Additionally, some translational correspondences are more likely than others. And perhaps we want to encourage the word reordering model to preserveEnglishwordorderwheneverpossible. Thisinformationcanbeencodedusing 8 weights on the various rules. Consider the language model A: both “green ball” and “green horse” are valid noun phrases, but the former is more likely than the latter. In the absence of any other evidence, we would expect to see “ball” after “green” more oftenthanwewouldsee“horse”. However,wewouldexpecttosee“greenhorse”more oftenthan“greenthe”. BylookingatalargeamountofrealEnglishtext,wecancollect statistics on how often various words follow “green”, and then calculate a probability foreachpossibleword. Suchweightsareaddedtotherelevantrulesasfollows: q green ball/0.8 −−−−−→ q ball q green horse/0.19 −−−−−−−→ q horse q green the/0.01 −−−−−→ q the The product of the weights of each rule used for a given sentence is the weight of the wholesentence. Noticethatwedidnotexcludetheveryunlikelysequence“greenthe”. However,wegavethatruleverylowweight,soanysentencewiththatsequencewould be quite unlikely. This represents an increase in power over the unweighted model. Previously, we would not want to allow such a phrase, as it almost certainly would be wrong. Nowwecanstillacknowledgetheextremeunlikelinessofthephrase,butallow for the rare situation where it is the most likely choice available after other possibilities havebeeneliminated. Wedemonstrateourpreferencesintheothertransducersinthechainthroughweights similarly. The reordering transducer, for instance, should favor some reorderings and disfavorothers. AnEnglishnounphrasethatcontainsanadjective(suchas“greenball”) would typically be translated with the noun first in Spanish. However, no reordering wouldbedoneforanounphrasewithoutadjectives(suchas“theball”). Thefollowing weightedtransducerrulesreflectthesepreferences: 9 Figure 1.1: The general noisy channel model. The model is proposed in the “story” direction but used in the “interpretation” direction, where a noisy input is transformed intothetargetdomainandthenvalidatedagainsttherecognizer. q the:/0.1 −−−−−−→ q the q green:/0.7 −−−−−−−−→ q green q the:the/0.9 −−−−−−−→ q q green:green/0.3 −−−−−−−−−−−→ q q the ball:ballthe/1 −−−−−−−−−−→ q q green ball:ballgreen/1 −−−−−−−−−−−−→ q q the green:greenthe/1 −−−−−−−−−−−−→ q q green green:greengreen/1 −−−−−−−−−−−−−−→ q q the the:thethe/1 −−−−−−−−−→ q q green the:thegreen/1 −−−−−−−−−−−→ q Thetranslationmodelwehavebuiltisgettingmoreandmorepowerful. Ithasbegun totakeontheshapeofaveryusefulandoften-appliedgeneralmodeloftransformation— thenoisychannelmodel[118]. Thekeyprinciplebehindthismodel,depictedinFigure1.1, is the separation of a sensible transformation task into the cascade of a transformation task (without regard to sensibility of the output) followed by a recognition task that only permits sensible output. By simply substituting the appropriate transducer or transducersintoourchainwecanperformdiversetaskswithoutalteringtheunderlying machinery. If we remove the permutation and translation transducers from our model, andinsteadaddaword-to-phonemetransducerfollowedbyaphoneme-to-speechsignal transducer, we can perform speech recognition on some given speech signal input. If 10 we replace the word language model with a part-of-speech language model and the transducercascadewithatag-to-wordtransducerwecanperformpart-of-speechtagging onsomewordsequenceinput. Inallofthesecasestheonlyworkrequiredtotransform ourtranslationmachineintoaspeechsignal-recognitionmachineoragrammarmarkup machine is that of rule set construction, which is in its essence pure model design. The underlying mechanics of operation remain constant. This illustrates the wide power of weighted finite-state toolkits and explains why they have been so useful in many researchprojects. 1.5 Goingfurther Wemustacknowledgethatourmodelisstillasimplificationof“real”humantranslation, and, for the foreseeable future, this will continue to be the case, as we are limited by practicalelements,suchasavailablecomputationalpoweranddata. Thishaslongbeen aconcernofmodelbuilders,andthusineverygeneration,compromisesaremade. Inthe 1950s,severememorylimitationsprecludedanythingsogeneralasafinite-state-machine toolkit. In the 1980s, few useful corpora existed and normally available computational powerwasstilltoolimitedtosupportthemillionsofwordsnecessarytoadequatelytrain a weighted transducer cascade. In the present age we have more computational power andlargeavailablecorpora. Weshouldconsiderwhetherthereareyetmorelimitations imposedbythetechnologyofthepreviouserathatcannowberelaxed. One particular deficiency that arises, particularly in our translation model, is the requirement of linear structure, that is, a sequential left-to-right processing of input. 11 Humanlanguagetranslationofteninvolvesmassivemovementdependingonsyntactic structure. In the previous examples we were translating between English and Spanish, twopredominatelySubject-Verb-Object(SVO)languages,butwhatifweweretranslating betweenEnglishandJapanese? TheformerisSVObutthelatterisSOV.Themovement ofanarbitrarilylongobjectphraseafteranarbitrarilylongverbphrase(orvice-versa)is simplynotfeasiblewithaformalismthatatitsfundamentalcoremustprocessitsinput inlinearorderwithafinitenumberofstates. Thedisconnectbetweenlinearprocessesandthemorehierarchicalnatureofnatural language is an issue that has long been raised by linguists [22]. However, practical considerations and a realization that a limited model with sufficient data was good enoughforthetimebeingledempiricalresearchawayfromsyntacticmodelsfornearly halfacentury. A survey of recent work in NLP shows there is evidence that syntax-rich empirical modelsmaynowbepractical. Recentworkwithsuccessfulpracticalresultsinlanguage modeling[20],syntacticparsing[25],summarization[77],questionanswering[37],and machine translation [136, 47, 31], to name a few, clearly indicates there are gains to be made in syntactic NLP research. One common thread these papers have, however, is that the results behind the presented models were obtained via custom construction of one-off systems. Weighing the benefits of the syntax translation models of Yamada and Knight[136]andGalleyetal. [47],forexample,requiresaccesstotheprojects’respective codebasesorre-engineeringefforts. Consequently,fewifanysuchcomparativestudies exist, and the cycle of model improvement is generally only possible for the original modelengineers. Suchalimitationisharmfultothecommunity. 12 1.6 Bettermodelingthroughtreetransducers S NP NNP John VP VB threw NP DT the JJ large JJ green NN ball S NP NNP John VP VB tir´ o NP DT la JJ gran NN pelota JJ verde Figure1.2: An(English,Spanish)treepairwhosetransformationwecancaptureviatree transducers. The good news is many of these models can be expressed as a cascade of finite-state tree transducers. Tree transducers were invented independently by Rounds [116] and Thatcher [128] to support Chomskyan linguistic theories [23]. The theory community then conducted extensive study into their properties [48, 49, 27] without much regard for the original linguistic motivation. Weighted tree automata, the direct analogue of weighted string automata, have been studied as a formal discipline [8, 41], as have weighted regular tree grammars [1]. This latter formalism generates the same class of tree languages as weighted tree automata, and closely resembles weighted context-free grammars,soitisthepreferredformalismusedinthisthesis. Letusconsiderhowsyntaxcanimproveourprevioustoytranslationmodel. Imagine we are now trying to translate between English and Spanish sentences annotated with syntactic trees, such as those in Figure 1.2. We can accomplish this with weighted regular tree grammars and top-down tree transducers, which are introduced formally in Chapter 2, but which we describe informally now, by way of comparison to the previouslydescribedfinite-state(string)automataandtransducers. 13 String automaton rules are of the form q a/w −−→ r, indicating that, with weight w, a machine in state q writes symbol a and changes to state r, where it continues, writing further results to the right of a in the output string. Tree grammar rules, on the other hand, are of the form q w − →τ. This rule indicates that, with weight w, a machine in state qwritestreeτ. Someoftheleavesofτmaybestatessuchasr. Iftheyare, furtherrules areusedtowritetreesatthesepointsinτ. String transducer rules are of the form q a:b/w −−−−→ r, indicating that, with weight w, a machine in state q reading symbol a writes symbol b and changes to state r, where it continues by processing the symbol to the right of a in the input string and writing the resulttotherightofbintheoutputstring. Treetransducerrules,ontheotherhand,are of the form q.γ(x 1 ...x n ) w − → τ. Such a rule indicates that, with weight w, a machine in stateqreadingatreethathasarootlabelγandnimmediatechildrenwritestreeτ. Some of the leaves ofτ are of the form r.x k , indicating that the kth immediate child of the tree shouldbeprocessedinstaterandtheresultwrittenatthatlocationintheoutputtree. Sincewearenowconsideringsyntax,wearenolongersimplytransformingbetween surfacestringsinEnglishandSpanish,butbetweensyntactictrees. Considerthepower ourtransducerchainhasnowthatwehaveintroducedthisformalism: • Rather than recognizing valid English phrases, our new tree grammar A 0 , the descendantofthestringautomatonA,mustnowrecognizevalidEnglishtrees. Let qbetheinitialstateofA 0 . Here’swhatsomerulesfromA 0 couldlooklike: 14 S q np-sing q vp-sing q 1 − → NP q dt q nn q np-sing 0.7 −−→ NP q dt q jj q nn q np-sing 0.2 −−→ NP q dt q jj q jj q nn q np-sing 0.1 −−→ DT the q dt 0.8 −−→ DT a q dt 0.2 −−→ JJ green q jj 0.8 −−→ JJ large q jj 0.2 −−→ JJ blue q jj 0.1 −−→ NN ball q nn 0.7 −−→ Already we can see some ways in which this grammar is more powerful than its string automaton ancestor. Words are conditioned on their parts of speech, so a more appropriate distribution for particular word classes can be defined. The top-downapproachenableslong-distancedependenciesandglobalrequirements. For example, the rule q− → S(q np-sing q vp-sing ) indicates the sentence will have both a nounphraseandverbphrase,andthatbothwillbesingular,eventhoughitisnot yetknownwherethesingularnounandverbwillappearwithinthosephrases. • Although we can now get quite creative with our permutation model, we can demonstratetheincreasedpoweroftreetransducersbydesigningaB 0 thathasthe sameideaexpressedinB,i.e.,allowre-orderingonelevelatatime. Theinitialstate israndthesearesomerules: 15 NP x 0 x 1 x 2 x 3 NP r dt .x 0 r nn .x 3 r jj .x 1 r jj .x 2 0.4 −−→ r np S x 0 x 1 S r np .x 0 r vp .x 1 0.7 −−→ r NP x 0 x 1 x 2 x 3 NP r dt .x 0 r jj .x 1 r nn .x 3 r jj .x 2 0.3 −−→ r np S x 0 x 1 S r vp .x 1 r np .x 0 0.3 −−→ r NP x 0 x 1 x 2 x 3 NP r dt .x 0 r jj .x 1 r nn .x 3 r jj .x 2 0.1 −−→ r np The reordering at the top of the tree, swapping the positions of arbitrarily large noun phrases and verb phrases, does sentence-wide permutation in a single step. Doing the equivalent with string-based transducers would require a unique state for every possible noun phrase, making such an operation generally impossible. Thelower-levelinversioncouldfeasiblybeaccomplishedwithastringtransducer, but it would still require states that encode the sequence of adjectives seen until thenounattheendofthephraseisreached. • We can use the power of syntax to our advantage in the design of the translation transducer,C 0 . RecallthatChasasetofword-to-wordrulesthatdonottakecontext intoaccount. WecaneasilytakecontextintoaccountinC 0 byourselectionofstates. ThefollowingselectionofrulesfromC 0 indicatethatallwordsinthenounphrase willbesingularandfeminine,ultimatelyconstraining“the”totranslateas“la”so astomatchthetranslationof“ball”as“pelota.” 16 O S T determinization Yes Yes Yes No K-best N/A Yes N/A Alg intersection Yes Yes Yes PoC EMtraining Yes Alg Table1.1: Availabilityofalgorithmsandimplementationsforvariousclassesofautomata. Yes = an algorithm is known and an implementation is publicly available. Alg = an algorithm is known but no implementation is known to be available. PoC= a proof of concept of the viability of an algorithm is known but there is no explicit algorithm. No =nomethodshavebeendescribed. O S T composition Yes Yes PoC PoC domainandrangeprojection Yes Yes PoC PoC application Yes Yes PoC PoC EMtraining Yes Alg Table 1.2: Availability of algorithms and implementations for various classes of trans- ducers,usingthekeydescribedinTable1.1. NP x 0 x 1 NP s fem .x 0 s fem .x 1 1 − → s fem DT x 0 DT s fem .x 0 1 − → s fem NN x 0 NN s fem .x 0 1 − → s fem the la 1 − → s fem ball pelota .15 −−→ s fem 1.7 Algorithmsfortreetransducersandgrammars Before deciding to build a new toolkit it is useful to take stock of the availability of needed algorithms. Tables 1.1 and 1.2 identify a set of useful operations for automata and transducers. The operations under consideration all have efficient algorithms that have been implemented for weighted string automata and transducers, primarily in FSM, OpenFst, and Carmel. For tree automata, the situation is more dire—algorithms 17 for determinization and intersection of unweighted tree automata exist and have been implementedinTimbuk[50]butastherewerenopreviousweightedtreeautomatatoolk- its,itisunderstandablethattherewouldbenoimplementationsofrelevantalgorithms. No tree transducer software had implementations of composition, projection, or appli- cation, and perhaps it is understandable, as these operations generally have not been describedinalgorithmicmannersbypractitionersofformallanguagetheory,whodonot needsuchdetailsfortheirproofs. Agoodexamplewhichillustratesthispointofviewis theclassicconstructionforunweightedtop-downtreetransducercompositionbyBaker [6]. Her construction essentially directs that a transducer rule in a composition of two transducers,M 1 andM 2 beformedbycombiningaruleq.σ− → ufromthefirst,whereuis sometree,withastatepfromthesecond,toform(q,p).σ− → t,foreverytthatisatransfor- mationofustartinginpbyM 2 . 1 Suchaconstructioniscertainlycorrect,anditisknown thatundercertainrestrictionsonM 1 andM 2 afinitesetoftcanbefound. Furthermore, in the weighted extension of this construction by Maletti [89] the additional constraint that the weight of the transformation from u to t by M 2 be calculable is known to be determinable for the closure cases. But while this is sufficient for proving theorems it doesnotsufficeforbuildingsoftware. Themeansoffindingeverytforuisnotdescribed; theconstructiondemonstratesitcanbebuilt,butdoesnotdescribeapracticalalgorithm forhowitshouldbebuilt. ThestatusofsuchoperationsisindicatedinTables1.1and1.2 as“proofsofconcept”,inthatnoconcrete,implementablealgorithmshavebeenshown for their operation. I have thus delved into these declarative constructs and exposed algorithms useful to the software-writing community that implement these operations. 1 Thisisasignificantparaphrasefrompage195of[6],aswedon’twishtointroducedetailedterminology justyet. 18 And,insomecases,Ihavedesignednewalgorithms,orextendedexistingalgorithmsor declarativeconstructstotheweightedtreeautomatonandweightedextendedtop-down treetransducercase. 1.8 Buildinganewtoolkit Weightedtreemachinesareapowerfulformalismfornaturallanguageprocessingmod- els, and as the example above indicates, constructing useful models is not difficult. However, prior to this thesis there was one main roadblock preventing progress in weighted tree machine modeling of the scale seen for weighted string machines: no appropriate toolkit existed. There were some existing tree automata and transducer toolkits [15, 50, 58, 34] but these are unweighted, a crucial omission in the age of data- drivenmodeling. Additionally,theyarechieflyaimedatthelogicandautomatatheory communityandarenotsuitedfortheneedsoftheNLPcommunity. ToguidemeinthisworkIfollowedtheleadofprevioustoolkits. Ihavechosensimple designsemanticsandasmallsetofdesiredoperations. Asnotedabove,somealgorithms for these operations already existed and could more or less be directly implemented, somehadtobeinferredfromdeclarativeconstructs,andsomerequirednovelalgorithm development. The resulting toolkit, Tiburon, can read, write, and manipulate large weightedtree-to-treeandtree-to-stringtransducers,regulartreegrammars,context-free grammars,andtrainingcorpora. Tosummarize,thesearethecontributionsprovidedinmythesis: 19 • Ipresentalgorithmsforintersectionofweightedregulartreegrammars,composi- tionandprojectionofweightedtreetransducers,andapplicationofweightedtree transducerstogrammarsthatwerepreviouslyonlydeclarativeproofsofconcept. Ialsodemonstratetheconnectionbetweenclassicparsingalgorithmsandonepar- ticular kind of application; that of tree-to-string transducers to strings. I provide thesealgorithmsforweightedextendedtreetransducers,aformalismthatismore useful to the NLP community than classical weighted tree transducers, though somewhatneglectedbyformallanguagetheory. • Ipresentanovelalgorithmforpracticaldeterminizationofacyclicweightedregular tree grammars and show the algorithm’s empirical benefits on syntax machine translationandparsingtasks. Injointworkwithcolleaguesfromformallanguage theoryweprovecorrectnessandexaminetheapplicabilityofthealgorithmtosome classesofcyclicgrammars. • I present novel algorithms for application of tree transducer cascades to grammar input, as well as more efficient on-the-fly versions of these grammars that take advantageoflazyevaluationtoavoidunnecessaryexploration. Idemonstratethe performanceadvantagesoftheseon-the-flyalgorithms. • I demonstrate the use of weighted tree transducers as formal models in the im- provement of the state of the art in syntax machine translation by using an EM trainingalgorithmforweightedtreetransducerstoimproveautomaticallyinduced word alignments in bilingual training corpora, leading to significant BLEU score increasesinArabic-EnglishandChinese-EnglishMTevaluations. 20 • I provide Tiburon, a tree automaton and transducer toolkit that provides these fundamental operations such that complicated tree-based models may easily be represented. The toolkit has already been used on several research projects and theses[56,113,126,12,125]. Hereisanoutlineoftheremainderofthisthesis: • Chapter 2 provides a formal basis for the remainder of the work. It defines key structures such as trees, grammars, and transducers, and presents algorithms for basictreetransducerandgrammaroperations. • Chapter 3 presents the first practical determinization algorithm for weighted reg- ular tree grammars. I outline the development of this algorithm, show, in joint work, a proof of correction, and demonstrate its effectiveness on two real-world experiments. • Chapter 4 presents novel methods of efficient inference through tree transducers andtransducercascades. Italsocontainsadetaileddescriptionofon-the-flyinfer- encealgorithmswhichcanbefasteranduselessmemorythantraditionalinference algorithms, as demonstrated in empirical experiments on a machine translation cascade. • Chapter 5 presents a method of using tree transducer training algorithms to ac- complishsignificantimprovementsinstate-of-the-artsyntax-basedmachinetrans- lation. 21 • Chapter6presentsTiburon,atoolkitformanipulatingtreetransducersandgram- mars. Tiburon contains implementations of many of the algorithms presented in previoussections. • Chapter7concludesthisworkwithahigh-levelviewofwhathasbeenpresented inthepreviouschaptersandoutlinesusefulfuturedirections. 22 Chapter2 WRTGWT T In this chapter we introduce basic terminology used throughout this thesis. In particu- lar we define the formal tree grammars and tree transducers that are the fundamental structuresthisthesisisconcernedwith. Wealsomakenoteofbasicalgorithmsforcom- bining,transforming, andmanipulatingthesemachines. Althoughessentiallyallofthe algorithms in this chapter were known previous to this work, much of it has not been presented inthis way, designed forthose seeking to providean implementation. Addi- tionally,somealgorithmsknowntothecommunityas“folklore”andvariousextensions tomoregeneralformalismsarepresented. Becauseagooddealofformalnotionsarepresentedinthischapter,particularlyearly on,itmaybehoovethereadertoreadthischapterlightly,andreturntorelevantsections whentheprecisedefinitionofatermorpieceofsymbologyisneeded. 23 2.1 Preliminaries Much of the notation we use is adapted from Section 2 of F¨ ul¨ op and Vogler [45], some quiteextensively. AdditionalnotationisadaptedfromSection3.1.1ofMohri[101]. 2.1.1 Trees A ranked alphabet is a tuple (Σ,rk) consisting of a finite setΣ and a mapping rk :Σ→N which assigns a rank to every member ofΣ. We refer to a ranked alphabet by its carrier set,Σ. Frequently used ranked alphabets areΣ,Δ, andΓ. For every k∈N,Σ (k) ⊆Σ is thesetofthoseσ∈Σsuchthatrk(σ)= k. Whenitisclear,wewriteσ (k) asshorthandfor σ∈Σ (k) . We let X={x 1 ,x 2 ,...} be a set of variables; X k ={x 1 ,...,x k },k∈N. We assume that X is disjoint from any ranked alphabet used in this work. A tree t∈ T Σ is denoted σ(t 1 ,...,t k ) where k∈N, σ∈ Σ (k) , and t 1 ,...,t k ∈ T Σ . 1 For σ∈ Σ (0) we write σ∈ T Σ as shorthand forσ(). For every set A disjoint fromΣ, let T Σ (A) = T Σ∪A , where for all a∈ A, rk(a)= 0. We define the positions of a tree t=σ(t 1 ,...,t k ), for k≥ 0,σ∈Σ (k) , and t 1 ,...,t k ∈ T Σ , as a set pos(t)⊂N ∗ such that pos(t)={ε}∪{iv| 1≤ i≤ k,v∈ pos(t i )}. The setofleafpositionsleaves(t)⊆ pos(t)arethosepositionsv∈ pos(t)suchthatfornoi∈N, vi∈ pos(t). We presume standard lexicographic orderings< and≤ on pos. Let t,s∈ T Σ andv∈ pos(t). Thelabeloftatpositionv,denotedbyt(v),thesubtreeoftatv,denotedby t| v ,andthereplacementatvbys,denotedbyt[s] v ,aredefinedasfollows: 1 Note that these are ranked trees, but in natural language examples we may see the same symbol with more than one rank. For example, both the phrases “boys” and “the boys” are noun phrases, and one frequently sees their tree representations as, respectively, NP(boys) and NP(the boys). From a formal perspective we should distinguish the parent nonterminals, writing, e.g., NP 1 (boys) and NP 2 (the boys), where NP 1 ∈ Σ (1) and NP 2 ∈ Σ (2) , and indeed these two parent symbols are interpreted differently, but to simplifynotationandremainconsistentwithcommonpractices,theywillbothappearonthepageasNP. 24 σ γ α β z η z α α (a)t β z (b)t| 1.2 β β α (c)s σ γ α β z η β β α α α (d)t[2.1] s Figure2.1: TreesfromExample2.1.1. 1. Foreveryσ∈Σ (0) ,σ(ε)=σ,σ| ε =σ,andσ[s] ε = s. 2. For every t = σ(t 1 ,...,t k ) such that k = rk(σ) and k≥ 1, t(ε) = σ, t| ε = t, and t[s] ε = s. For every 1 ≤ i ≤ k and v ∈ pos(t i ), t(iv) = t i (v), t| iv = t i | v , and t[s] iv =σ(t 1 ,...,t i−1 ,t i [s] v ,t i+1 ,...,t k ). Thesizeofatreet,size(t),is|pos(t)|,thecardinalityofitspositionset. Theheightofatree t,height(t),is1ifitssizeis1;else,height(t)= 1+max(height(t| i )| 1≤ i≤ rk(t(ε))). Theyield setofatreet,ydset(t),isthesetoflabelsofitsleaves,thusydset(t)={t(v)| v∈ leaves(t)}. Example2.1.1 LetΣ={α (0) ,β (1) ,γ (2) ,η (2) ,σ (3) }. LetA={z}. Lett=σ(γ(α,β(z)),η(z,α),α). Then,t∈ T Σ (A),pos(t)={ε,1,2,3,1.1,1.2,2.1,2.2,1.2.1},leaves(t)={1.1,1.2.1,2.1,2.2,3}. t(2) = η, and t| 1.2 = β(z). Let s = β(β(α)). Then, t[2.1] s = σ(γ(α,β(z)),η(β(β(α)),α),α), size(t) = 9, height(t) = 4, size(t[2.1] s ) = 11, height(t[2.1] s ) = 5, and ydset(t) ={α,z}. For greaterclarity,t,s,andt[2.1] s arereproducedinamore”treelike”fashioninFigure2.1. 25 2.1.2 Semirings Asemiring(W,+,·,0,1)isanalgebraconsistingofacommutativemonoid(W,+,0)anda monoid(W,·,1)where·distributesover+,0, 1,and0absorbs·,thatis,w·0= 0·w= 0 for any w∈W. A semiring is commutative if· is commutative. A semiring is complete if there is an additional operator L that extends the addition operator+ such that for anycountableindexsetI andfamily(w i ) i∈I ofelementsofW, L calculatesthepossibly infinitesummationof(w i ) i∈I . Wewrite M i∈I w i ratherthan L (w i ) i∈I . L hasthefollowing properties: L extends+: M i∈I w i = 0if|I|= 0and M i∈I w i = w i if|I|= 1. L isassociativeandcommutative: M i∈I w i = M j∈J M i∈I j w i foranydisjointpartitionI= [ j∈J I j . ·distributesover L frombothsides: w· M i∈I w i = M i∈I (w·w i ) and M i∈I w i ·w= M i∈I (w i ·w) foranyw∈W. A complete semiring can be augmented with the unary closure operator∗, defined by w ∗ = ∞ M i=0 w i for any w∈W, where w 0 = 1 and w n+1 = w n · w. Unless otherwise noted, henceforth semirings are presumed to be commutative and complete. We refer toasemiringbyitscarriersetW. Somecommoncommutativeandcompletesemirings, arethe: • Booleansemiring: ({0,1},∨,∧,0,1) 26 • probabilitysemiring: (R + ∪{+∞},+,·,0,1) • tropicalsemiring: (R∪{−∞,+∞},min,+,+∞,0) IntheBooleansemiring, 0 ∗ = 1 ∗ = 1. Inthetropicalsemiring, w ∗ = 0forw∈R + and w ∗ =−∞ otherwise. In the probability semiring, w ∗ = 1 1−w for 0≤ w < 1 and w ∗ =+∞ otherwise. Furthermaterialonsemiringsmaybefoundin[35,57,52]. Example2.1.2 In the probability semiring, 0.3+ 0.5 = 0.8 and 0.3· 0.5 = 0.15. In the tropical semiring, 0.3+0.5= 0.3 and 0.3·0.5= 0.8. In the Boolean semiring, 0+1= 1 and0·1= 0. 2.1.3 Treeseriesandweightedtreetransformations A tree series overΣ andW is a mapping L : T Σ →W. For t∈ T Σ , the element L(t)∈W is called the coefficient of t. The support of a tree series L is the set supp(L) ⊆ T Σ where t ∈ supp(L) iff L(t) is nonzero. A weighted tree transformation over Σ, Δ, and W is a mapping τ : T Σ × T Δ → W. The inverse of a weighted tree transformation τ : T Σ × T Δ →W is the weighted tree transformationτ −1 : T Δ × T Σ →W where, for every t∈ T Σ and s∈ T Δ , τ −1 (s,t) = τ(t,s). The domain of a weighted tree transforma- tion τ : T Σ × T Δ →W is the tree series dom(τ) : T Σ →W where, for every t∈ T Σ , dom(τ)(t)= L s∈T Δ τ(t,s). Therangeofτ isthetreeseriesrange(τ) : T Δ →Wwhere,for every s∈ T Δ , range(τ)(s)= L t∈T Σ τ(t,s). The identity of a tree series L : T Σ →W is the weightedtreetransformationı L : T Σ ×T Σ →Wwhere,foreverys,t∈ T Σ ,ı L (s,t)= L(s)if s= tand0otherwise. Thecompositionofaweightedtreetransformationτ : T Σ ×T Δ →W 27 γ α α , ξ λ .4 γ α α , ξ ξ λ .6 σ α β α α , ν ξ λ λ 1 (a)Originalτ ξ λ , γ α α .4 ξ ξ λ , γ α α .6 ν ξ λ λ , σ α β α α 1 (b)τ −1 γ α α 1 σ α β α α 1 (c)dom(τ) ξ λ .4 ξ ξ λ .6 ν ξ λ λ 1 (d)range(τ) Figure2.2: Non-zeroentriesofweightedtreetransformationsandtreeseriesofExample 2.1.3. Foreachrow,theelement(s)ontheleftmapstothevalueontheright,andallother elementsmapto0. andaweightedtreetransformationμ : T Δ ×T Γ →Wistheweightedtreetransformation τ;μ : T Σ ×T Γ →Wwhereforeveryt∈ T Σ andu∈ T Γ ,τ;μ(t,u)= L s∈T Δ τ(t,s)·μ(s,u). Example2.1.3 Let Σ be the ranked alphabet defined in Example 2.1.1. LetW be the probability semiring. Let L : T Σ → W be a tree series such that L(γ(α,α)) = .3, L(σ(α,β(α),α)) = .5, L(α) = .2 and L(t) = 0 for all other t ∈ T Σ . Then, supp(L) = {γ(α,α), σ(α,β(α),α), α} andı L : T Σ × T Σ →W is a weighted tree transformation such thatı L (γ(α,α),γ(α,α))=.3,ı L (σ(α,β(α),α),σ(α,β(α),α))=.5,ı L (α,α)=.2, andı L (t,s)= 0 forallother(t,s)∈ T Σ ×T Σ . LetΔ={λ (0) ,ξ (1) ,ν (2) }bearankedalphabet. Letτ : T Σ ×T Δ →Wbeaweightedtree transformationsuchthatτ(γ(α,α),ξ(λ))=.4,τ(γ(α,α),ξ(ξ(λ)))=.6,τ(σ(α,β(α),α),ν(ξ(λ), λ))= 1, andτ(t, s)= 0 for all other (t, s)∈ T Σ × T Δ . Then, the non-zero members of the weighted tree transformationτ −1 : T Δ ×T Σ →W and tree series dom(τ) : T Σ →W and range(τ) : T Δ →WarethosepresentedinFigure2.2. 28 2.1.4 Substitution Let A and B be sets. Let ϕ : A→ T Σ (B) be a mapping. ϕ may be extended to the mapping ϕ : T Σ (A)→ T Σ (B) such that for a∈ A, ϕ(a) = ϕ(a) and for k≥ 0, σ∈ Σ (k) , andt 1 ,...,t k ∈ T Σ (A),ϕ(σ(t 1 ,...,t k ))=σ(ϕ(t 1 ),...,ϕ(t k )). Weindicatesuchextensionsby describingϕasasubstitutionmappingandthenabusenotation, conflating, e.g.,ϕandϕ bothtoϕwherethereisnoconfusion. Example2.1.4 LetΣbetherankedalphabetdefinedinExample2.1.1. LetA={z,y}and B={v,w}. Letϕ : A→ T Σ (B)beasubstitutionmappingsuchthatϕ(z)= vandϕ(y)= w. Lett=η(γ(α,z),β(y)). Then,ϕ(t)=η(γ(α,v),β(w)). 2.2 Weightedregulartreegrammars Inmuchofthetextthatfollowswerefertopreviouswork,butnotethatourconstructions aresomewhatdifferent. Wewillcitethatwork,notedifferences,andnotetheimplications of these differences as we come to them. The algorithms we present are intended to be closetopseudocodeanddirectlyimplementable;exceptionsarenoted. Definition2.2.1(cf. AlexandrakisandBozapalidis[1]) A weighted regular tree gram- mar(wrtg)oversemiringWisa4-tuple G= (N,Σ,P,n 0 )where: 1. N isafinitesetofnonterminals,withn 0 ∈ N thestartnonterminal 2. Σistheinputrankedalphabet. 3. P is a tuple (P 0 ,π), where P 0 is a finite set of productions, each production p of the form n− → u, n∈ N, u∈ T Σ (N), andπ : P 0 →W is a weight function of the 29 productions. Withintheseconstraintswemay(andusuallydo)refertoPasafinite set of weighted productions, each production p of the form n π(p) −−−→ u. We denote subsets of P as follows: P n ={p∈ P| p is of the form n− → u}. We extend all definitions of operations on trees from Section 2.1.1 to productions such that, e.g., size(p)= size(u). WeassociatePwithG,suchthat,e.g.,p∈ Gisinterpretedtomean p∈ P. Unlike the definition by Alexandrakis and Bozapalidis [1] we in general allow chain productionsinawrtg,thatis,productionsoftheformn i w − → n j ,wheren i ,n j ∈ N. For wrtg G = (N,Σ,P,n 0 ), s, t, u∈ T Σ (N), n∈ N, and p∈ P of the form n w − → u∈ P, we obtain a derivation step from s to t by replacing some leaf nonterminal in s labeled n with u. Formally, s⇒ p G t if there exists some v∈ pos(s) such that s(v)= n and s[u] v = t. We say this derivation step is leftmost if, for all v 0 ∈ leaves(s) where v 0 < v, s(v 0 )∈ Σ. Exceptwherenotedandneeded,wehenceforthassumeallderivationstepsareleftmost and drop the subscript G. If, for some m∈N, p i ∈ P, and t i ∈ T Σ (N) for all 1≤ i≤ m, n 0 ⇒ p 1 t 1 ...⇒ p m t m , we say the sequence d = (p 1 ,...,p m ) is a derivation of t m in G and thatn 0 ⇒ ∗ t m . Theweightofdiswt(d)=π(p 1 )·...·π(p m ),theproductoftheweightsofall occurrencesofitsproductions. Wemayloosenthedefinitionofaderivationandspeakof, forexample,aderivationfromnusingP 0 ,whereP 0 ⊆ P,orassertthatthisderivationexists by saying that n⇒ ∗ t m using P 0 . In such cases one may imagine this to be equivalent to aderivationinsomewrtgG 0 = (N,Σ,P 0 ,n). 30 The tree series represented by G is denoted L G . For t∈ T Σ and n∈ N, the tree series L G (t) n isdefinedasfollows: L G (t) n = M derivation dof tfrom nin G wt(d) . Then L G (t) = L G (t) n 0 . Note that this tree series is well defined, even though this summationmaybeinfinite(sincechainproductionsareallowed),becauseWispresumed complete. We call a tree series L recognizable if there is a wrtg G such that L G = L; in such cases we then call G a wrtg representing L G . Two wrtgs G 1 and G 2 are equivalent if L G 1 = L G 2 . Example2.2.2 Figure 2.3 depicts P 1 for a wrtg G 1 = (N 1 ,Σ,P 1 ,n S ) over the probability semiring with production id numbers. N 1 andΣ may be inferred from P 1 . Note that production12isachainproduction. AderivationofthetreeS(NP(somestudents)VP(eat NP(redmeat)))is(1,2,7,5,8,11,13,15)andtheweightofthisderivationis.00084. 2.2.1 Normalform AwrtgGisinnormalformifeachproductionp∈ Pisinnormalform. Aproductionpof theformn w − → uisinnormalformifuhasoneofthefollowingforms: 1. u∈Σ (0) 2. u∈ N 3. u=σ(n 1 ,...,n k )wherek≥ 1,σ∈Σ (k) ,andn 1 ,...,n k ∈ N. 31 1. S n NP-SUBJ n VP n S 1 − → 2. NP n DT n NNS-SUBJ n NP-SUBJ .4 − → 3. NP n NNS-SUBJ n NP-SUBJ .6 − → 4. dogs n NNS-SUBJ .7 − → 5. students n NNS-SUBJ .3 − → 6. the n DT .8 − → 7. some n DT .2 − → 8. VP eat n NP-OBJ n VP .7 − → 9. VP chase n NP-OBJ n VP .25 −−→ 10. VP lie n VP .05 −−→ 11. NP n JJ n NP-OBJ n NP-OBJ .25 −−→ 12. n NNS-OBJ n NP-OBJ .75 −−→ 13. red n JJ .4 − → 14. smelly n JJ .6 − → 15. meat n NNS-OBJ .5 − → 16. cars n NNS-OBJ .5 − → Figure 2.3: Production set P 1 from example wrtg G 1 used in Examples 2.2.2, 2.2.3, 2.2.4, and2.2.7. 17. VP n 1 n NP-OBJ n VP .7 − → 18. VP n 2 n NP-OBJ n VP .25 −−→ 19. VP n 3 n VP .05 −−→ 20. eat n 1 1 − → 21. chase n 2 1 − → 22. lie n 3 1 − → Figure 2.4: Normal-form productions inserted in P 1 to replace productions 8, 9, and 10 ofFigure2.3innormalizationofG 1 ,asdescribedinExample2.2.3. 32 23. meat n NP-OBJ .375 −−−→ 24. cars n NP-OBJ .375 −−−→ Figure2.5: ProductionsinsertedinP 1 tocompensatefortheremovalofchainproduction 12ofFigure2.3inchainproductionremovalofG 1 ,asdescribedinExample2.2.4. For every wrtg G we can form the wrtg G 0 such that G and G 0 are equivalent and G 0 is in normal form. This is achieved by Algorithm 1, which follows the first half of the construction in Prop. 1.2 of Alexandrakis and Bozapalidis [1] and preserves chain productions. Example2.2.3 The wrtg G from Example 2.2.2 is not in normal form. Algorithm 1 produces a normal form equivalent by replacing productions 8, 9, and 10 from Figure 2.3withtheproductionsinFigure2.4. Algorithm1NORMAL-FORM 1: inputs 2: wrtgG in = (N in ,Σ,P in ,n 0 )overW 3: outputs 4: wrtgG out = (N out ,Σ,P out ,n 0 )overWinnormalformsuchthatL G in = L G out 5: complexity 6: O(size(˜ p)|P in |),where ˜ pistheproductionoflargestsizeinP in 7: N out ← N in 8: P out ←∅ 9: forallp∈ P in do 10: forallp 0 ∈ NORMALIZE(p,N in )do 11: Letp 0 beoftheformn w − → u. 12: P out ← P out ∪{p 0 } 13: N out ← N out ∪{n} 14: return G out 33 Algorithm2NORMALIZE 1: inputs 2: productionp in oftheformn w − → u 3: nonterminalsetN 4: outputs 5: P ={p 1 ,...,p n }, the set of productions in normal form such that n⇒ p 1 ...⇒ p n u andwt(p 1 ,...,p n )= w 6: complexity 7: O(size(p in )) 8: P←∅;Ψ←{p in } 9: whileΨ,∅do 10: p←anyelementofΨ 11: Ψ←Ψ\{p} 12: Letpbeoftheformn w − → u. 13: ifu∈Σ (0) oru∈ Nthen{alreadyinnormalform} 14: P← P∪{p} 15: else 16: Letubeoftheformσ(u 1 ,...,u k ). 17: (n 1 ,...,n k )← (u 1 ,...,u k ) 18: fori= 1tokdo 19: ifu i < Nthen 20: n i ←newnonterminaln x 21: Ψ←Ψ∪{n x 1 − → u i } 22: P← P∪{n w − →σ(n 1 ,...,n k )} 23: return P 34 2.2.2 Chainproductionremoval Although we have defined algorithms that take chain productions into account, it is neverthelesssometimesuseful,asinthestringcase[101],toremovechainproductions 2 fromawrtg. Althoughchainproductionsareveryhelpfulinthedesignofgrammarsand maybeproducedbyNLPsystems,theydelaycomputationandtheirpresenceinawrtg can make certain algorithms cumbersome. Fortunately, removing chain productions in a wrtg is conceptually equivalent to removing them in a weighted finite-state (string) automaton. Chain production removal for weighted string automata was described in Theorem 3.2 of ´ Esik and Kuich [41], Theorem 3.2 of Kuich [81], and by Mohri [101]. Algorithm 3 reproduces the chain production removal algorithm described by Mohri [101]butdoessointheterminologyofwrtgs. Example2.2.4 The wrtg G 1 from Example 2.2.2 has chain productions, specifically pro- duction 12. Algorithm 3 produces an equivalent wrtg without chain productions by removing production 12 (making productions 15 and 16 no longer useful) and adding theproductionsinFigure2.5. Because Example 2.2.4 does not make significant use of Algorithm 4, we provide a morecomplicatedexampleofchainproductionremoval. Example2.2.5 Consider the wrtg G 2 = ({q,r,s},Σ,P 2 ,q) whereΣ is defined in Example 2.1.1 and P 2 is depicted in Figure 2.6a. Algorithm 4 operates on the chain productions to form the map represented in Table 2.6b. This map is then used by Algorithm 3 to producewrtgG 3 = ({q,r,s},Σ,P 3 ,q)whereP 3 isdepictedinFigure2.6c. 2 Forfinite-state(string)automatachainproductionsareoftencalled epsilontransitions. 35 1. γ q s q .1 − → 2. α q .4 − → 3. r q .2 − → 4. s q .3 − → 5. α r .8 − → 6. q r .05 −−→ 7. s r .15 −−→ 8. α s .9 − → 9. s s .1 − → (a) Production set P 2 from example wrtg G 2 used in Example 2.2.5 to demonstratechainproductioncyclere- moval. q r s q 1.01 .20 .370 r .05 1.01 .185 s 0 0 1.1 (b) Closure table for chain pro- duction removal of G 2 . Rows and columns list source and destination nonterminals, re- spectively. 1. γ q s q .10 −−→ 2. α q .89 −−→ 3. α s 1 − → (c)ProductionsetP 3 ,obtainedbychainproduc- tionremovalofG 2 . Figure2.6: IllustrationofAlgorithms3and4,asdescribedinExample2.2.5. Algorithm 4buildsthetableinFigure2.6bfromtheproductionsinFigure2.6aandthenAlgorithm 3usesthistabletogeneratetheproductionsinFigure2.6c. 2.2.3 Determinization A normal-form wrtg G= (N,Σ,P,n 0 ) overW is deterministic if, for each k∈N,σ∈Σ (k) , and n 1 ,...,n k ∈ N k there is at most one production of the form n w − → σ(n 1 ,...,n k ) in P, wheren 1 ,...,n k ∈ N. 3 Non-deterministicanddeterministicwrtgovertheBooleansemir- ing(whichwesometimesrefertoasrtg,astheyareequivalenttounweightedregulartree grammars)representthesametreeseries([33],Thm. 1.10);thuswemaydefineanalgo- rithm which takes an arbitrary rtg in normal form and produces a language-equivalent deterministic one. The naive algorithm generalizes the classical determinization algo- rithm for fsas [112]; for each word ~ ρ = ρ 1 ρ 2 ...ρ k in (P(N)) k and σ∈ Σ (k) , find all m 3 Thus we are describing a wrtg equivalent to a bottom-up deterministic weighted tree automaton. We do not consider top-down deterministic properties, as top-down deterministic tree automata are strictly weakerthantheirbottom-upcounterparts([48],Ex. II.2.11). 36 Algorithm3CHAIN-PRODUCTION-REMOVAL 1: inputs 2: wrtgG in = (N,Σ,P in ,n 0 )overW 3: outputs 4: wrtgG out = (N,Σ,P out ,n 0 )overWsuchthatL G in = L G out . Additionally,nop∈ P out is oftheformn src w − → n dst ,wheren src andn dst ∈ N. 5: complexity 6: O(|N| 3 +|N||P in |) 7: P chain ←{n w − → u∈ P in | u∈ N} 8: Formamappingφ : N×N→W. 9: φ← COMPUTE-CLOSURE( N,P chain ) 10: Formamappingθ : N×T Σ (N)→W. 11: P chain ← P in \P chain 12: foralln dst ∈ Ndo 13: foralln src ∈ Ndo 14: foralln dst w − → u∈ P chain do 15: θ(n src ,u)←θ(n src ,u)+(w·φ(n src ,n dst )) 16: P out ←{n θ(n,u) −−−−→ u|θ(n,u), 0} 17: return G out Algorithm4COMPUTE-CLOSURE 1: inputs 2: nonterminalsN 3: productionsetP,whereeachp∈ Pisoftheformn src w − → n dst ,n src andn dst ∈ N 4: outputs 5: mappingφ : N×N→Wsuchthatφ(n src ,n dst )isthesumofweightsofallderivations fromn src ton dst usingP. 6: complexity 7: O(|N| 3 ) 8: φ(n src ,n dst )← 0foreachn src ,n dst ∈ N 9: foralln src w − → n dst ∈ Pdo 10: φ(n src ,n dst )←φ(n src ,n dst )+w 11: foralln mid ∈ Ndo 12: foralln src ∈ N,n src , n mid do 13: foralln dst ∈ N,n dst , n mid do 14: φ(n src ,n dst )←φ(n src ,n dst )+(φ(n src ,n mid )·φ(n mid ,n mid ) ∗ ·φ(n mid ,n dst )) 15: foralln src ∈ N,n src , n mid do 16: φ(n mid ,n src )←φ(n mid ,n mid ) ∗ ·φ(n mid ,n src ) 17: φ(n src ,n mid )←φ(n src ,n mid )·φ(n mid ,n mid ) ∗ 18: φ(n mid ,n mid )←φ(n mid ,n mid ) ∗ 19: return φ 37 Algorithm5DETERMINIZE 1: inputs 2: wrtg G in = (N,Σ,P in ,n 0 ) over Boolean semiring in normal form with no chain productions 3: outputs 4: deterministicwrtgG out = (P(N)∪{n 0 },Σ,P out ,n 0 )overBooleansemiringinnormal formsuchthatL G in = L G out . 5: complexity 6: O(|P in |2 |N| max σ∈Σ rk(σ) ) 7: P out ←∅ 8: Ξ←∅{Seennonterminals} 9: Ψ←∅{Newnonterminals} 10: forallα∈Σ (0) do 11: ρ dst ←{n| n− →α∈ P in } 12: Ψ←Ψ∪{ρ dst } 13: P out ← P out ∪{ρ dst − →α} 14: ifn 0 ∈ρ dst then 15: P out ← P out ∪{n 0 − →ρ dst } 16: whileΨ,∅do 17: ρ new ←anyelementofΨ 18: Ξ←Ξ∪{ρ new } 19: Ψ←Ψ\{ρ new } 20: forallσ (k) ∈Σ\Σ (0) do 21: forall~ ρ=ρ 1 ...ρ k |ρ 1 ...ρ k ∈Ξ k ,ρ i =ρ new forsome1≤ i≤ kdo 22: ρ dst ←{n| n− →σ(n 1 ,...,n k )∈ P in ,n 1 ∈ρ 1 ,...,n k ∈ρ k } 23: ifρ dst ,∅then 24: ifρ dst <Ξthen 25: Ψ←Ψ∪{ρ dst } 26: P out ← P out ∪{ρ dst − →σ(~ ρ)} 27: ifn 0 ∈ρ dst then 28: P out ← P out ∪{n 0 − →ρ dst } 29: return G out 38 1. D q r t− → 2. D q s t− → 3. A q− → 4. B r− → 5. B s− → 6. C s− → (a) P 4 , productions of non- deterministicrtgG 4 . 1. {t} t− → 2. D {q} {r,s} {t}− → 3. D {q} {s} {t}− → 4. A {q}− → 5. B {r,s}− → 6. C {s}− → (b) P 5 , productions of deterministic rtg G 5 ob- tainedbyapplyingAlgorithm5toG 4 . Figure 2.7: rtg productions before and after determinization, as described in Example 2.2.6. Notethat(a)isnotdeterministicbecauseproductions4and5havethesameright side,whilenoproductionsin(b)havethesamerightside. productions p 1 ,...,p m where, for 1 ≤ j ≤ m, p j = n j − → σ(n j 1 ,...,n j k ) such that for 1≤ i≤ k, n j i ∈ ρ i . Then,{n j | 1≤ j≤ m}− → σ(ρ 1 ,...,ρ k ) is in the determinized rtg. Additionally, if n j = n 0 for some 1≤ j≤ m, n 0 − →{n j | 1≤ j≤ m} is in the determinized rtg. This is frequently overexhaustive, though. Algorithm 5 is more appropriate for actualimplementation,asitonlybotherstobuildproductionsfornonterminalsthatcan bereached. Thisalgorithmfirstcreatesthenonterminalsusedtoproduceleavesinlines 10–15,thenusesthosenonterminalstoproducemorenonterminalsinlines20–26. until no more can be produced. To ensure a single start nonterminal, chain productions are addedfromanewuniquestartnonterminalinlines15and28. Example2.2.6 Consider the rtg G 4 = (N 4 ,Σ, P 4 , t) where N 4 ={t, q, r, s},Σ={A (0) , B (0) , C (0) ,D (2) },andP 4 isdepictedinFigure2.7a. TheresultofAlgorithm5oninputG 4 isG 5 =(P(N 4 ),Σ,P 5 ,t),whereP 5 isdepictedinFigure2.7b. We discuss our contribution to algorithms for determinization of a wider class of wrtginChapter3. 39 2.2.4 Intersection Algorithm6INTERSECT 1: inputs 2: wrtgG A = (N A ,Σ,P A ,n A 0 )overWinnormalformwithnochainproductions 3: wrtgG B = (N B ,Σ,P B ,n B 0 )overWinnormalformwithnochainproductions 4: outputs 5: wrtg G C = ((N A × N B ),Σ,P C ,(n A 0 ,n B 0 )) over W such that for every t ∈ T Σ , L G C (t)= L G A (t)·L G B (t) 6: complexity 7: O(|P A ||P B |) 8: P C ←∅ 9: forall(n A ,n B )∈ N A ×N B do 10: forallσ (k) ∈Σdo 11: forallp A oftheformn A w A −−→σ(n A 1 ,...,n A k )∈ P A do 12: forallp B oftheformn B w B −−→σ(n B 1 ,...,n B k )∈ P B do 13: P C ← P C ∪(n A ,n B ) w A ·w B −−−−−→σ((n A 1 ,n B 1 ),...,(n A k ,n B k )) 14: return G C It is frequently useful to find the weighted intersection between two tree series L A and L B . This intersection is of course also a tree series L C where for every t ∈ T Σ , L C (t) = L A (t)· L B (t). For the case of recognizable tree series, if we consider two chain production-free, normal-form wrtgs G A and G B representing these tree series, then we would like to find a third wrtg G C such that L G C (t) = L G A (t)· L G B (t). Algorithm 6 is a very simple algorithm that finds this intersection wrtg. 4 Note that in practice, rather thaniteratingoverallmembersofN A ×N B asthealgorithmspecifiesatline9,anactual implementationshouldbeginbyconsidering(n A 0 ,n B 0 ),andthenproceedbyconsidering nonterminals that appear in the right sides of newly generated productions as they are discovered,atline13. Wehaveomittedthisdetailfromthepresentationofthealgorithm 4 Thisalgorithmisderivedfromacompositionofidentitytransducers;seeSection2.3.3. 40 1. S NP the dogs n VP n S 1 − → 2. VP eat NP smelly meat n VP .6 − → 3. VP chase NP red cars n VP .4 − → (a)P 6 ,productionsforthewrtgG 6 usedinExample2.2.7. 1. S n 1 n 2 n 0 1 − → 2. NP n 3 n 4 n 1 .4 − → 3. the n 3 .8 − → 4. dogs n 4 .7 − → 5. VP n 5 n 8 n 2 .42 −−→ 6. VP n 7 n 6 n 2 .1 − → 7. eat n 5 1 − → 8. chase n 7 1 − → 9. NP n 9 n 10 n 6 .25 −−→ 10. NP n 11 n 12 n 8 .25 −−→ 11. red n 9 .4 − → 12. smelly n 11 .6 − → 13. meat n 12 .5 − → 14. cars n 10 .5 − → (b)P 7 ,productionsfortheresultofintersectingnormal-form,chain-productionfreetransformationsof Figures2.3and2.8a. Figure2.8: Productionsforawrtgandintersectionresult,asdescribedinExample2.2.7. for clarity’s sake and will continue to do so throughout the remainder of this work, but willindicatewhensuchanoptimizationisappropriateinrunningtext. Example2.2.7 Consider the wrtg G 1 = (N 1 ,Σ,P 1 ,n S ) described in Example 2.2.2 with P 1 depicted in Figure 2.3 and the wrtg G 6 = ({n S ,n VP },Σ,P 6 ,n S ), with P 6 depicted in Figure 2.8a. Normal form and chain production removal of G 1 is described in Ex- amples 2.2.3 and 2.2.4; similar transformations of G 6 are left as an exercise. The re- sult of intersecting these transformed wrtgs (after nonterminals have been renamed) is G 7 = ({n i | 0≤ i≤ 12},Σ,P 7 ,n 0 ),whereP 7 isdepictedinFigure2.8b. 41 2.2.5 K-best Algorithms for determining the k highest weighted paths in a hypergraph have been describedbyHuangandChiang[59]andPaulsandKlein[108]andareeasilyadaptable tothewrtgdomain. Wereferthereadertothoseworks,whichcontainclearlypresented algorithms. 2.3 Weightedtop-downtreetransducers Definition2.3.1(cf. Sec. 5.3ofF¨ ul¨ opandVogler[45]) Aweightedtop-downtreetrans- ducer(wtt)isa5-tuple M=(Q,Σ,Δ,R,q 0 )where: 1. Qisafinitesetofstates,withq 0 ∈ Qthestartstate, 2. ΣandΔaretheinputandoutputrankedalphabets, 3. R is a tuple (R 0 ,π) where R 0 is a finite set of rules, each rule r of the form q.σ− → u for q∈ Q,σ∈ Σ (k) , and u∈ T Δ (Q×X k ), andπ : R 0 →W is a weight function of the rules. We frequently refer to R as a finite set of weighted rules, each rule r of the form q.σ π(r) −−−→ u. We denote subsets of R as follows: R q,σ ={r∈ R| r is of the formq.σ− → u}. WeextendalldefinitionsofoperationsontreesfromSection2.1.1to rules,suchthat,e.g.,size(r)= size(u). WeassociateRwithM,suchthat,e.g.,r∈ M isinterpretedtomeanr∈ R. The multiplicity of a variable x i in a rule r of the form q.σ w − → u, denoted mult(r,i), is thenumberoftimesx i appearsinu. Awttislinearifforeachruleroftheformq.σ w − → u whereσ∈ Σ (k) and k≥ 1, max k i=1 mult(r,i) = 1. If, for each rule r, min k i=1 mult(r,i) = 1, 42 the wtt is nondeleting. We denote the class of all wtt as wT and add the letters L and N to signify intersections of the classes of linear and nondeleting wtt, respectively. We alsoremovetheletter“w”tosignifythosewttovertheBooleansemiring. Forexample, wNT is the class of nondeleting wtt over arbitrary semiring, and LNT is the class of nondeleting and linear wtt over the Boolean semiring. We use class names as a generic descriptor of individual wtts. For example, the phrase “a wLT M” means “a wtt M of classwLT.”Wealsodefinethefollowingpropertiesofwttusedinspecialcircumstances (and not part of the “core”): a wtt is deterministic if, for each q∈ Q andσ∈Σ there is at mostoneruleoftheformq.σ w − → u;itistotalifthereisatleastonesuchrule. Itisheight-1 ifeachruleisoftheformq.σ w − →δ(d 1 ,...,d k )whereδ∈Δ (k) andd i ∈ Q×Xfor1≤ i≤ k. For wtt M= (Q,Σ,Δ,R,q 0 ), s,t∈ T Δ (Q×T Σ ), q∈ Q, and r∈ R of the form q.σ w − → u, we obtain a derivation step from s to t by replacing some leaf of s labeled with q and a tree beginning withσ by a transformation of the right side of r, where each instance of a variable has been replaced by a corresponding child of theσ-headed tree. Formally, s⇒ r M t if there exists some v∈ pos(s) such that s(v) = (q,σ(s 1 ,...,s k )) and s[ϕ(u)] v = t, whereϕisasubstitutionmappingQ×X→ T Δ (Q×T Σ ),suchthatϕ((q 0 ,x i ))= (q 0 ,s i )for all q 0 ∈ Q,1≤ i≤ k. We say this derivation step is leftmost if, for all v 0 ∈ leaves(s) where v 0 < v, s(v 0 )∈Δ. Except where noted and needed, we henceforth assume all derivation stepsareleftmostanddropthesubscriptM. If,forsomeq∈ Q,s∈ T Σ ,m∈N,r i ∈ R,and t i ∈ T Δ (Q×T Σ ) for all 1≤ i≤ m, (q,s)⇒ r 1 t 1 ...⇒ r m t m , we say the sequence (r 1 ,...,r m ) isaderivationof(s,t m )inMfromq. Theweightofaderivationd,wt(d),istheproductof theweightsofalloccurrencesofitsrules. 43 1. α λ .6 − → q 2. α ξ λ .4 − → q 3. β ν q.x 1 ξ q.x 1 .7 − → q 4. γ ν q.x 2 q.x 1 .3 − → q 5. η ν q.x 1 q.x 2 .9 − → q 6. σ ν q.x 2 ν q.x 1 q.x 3 .2 − → q (a)R 1 : totalwNT 1. σ ν λ ν q.x 1 r.x 1 − → q 2. α ξ λ − → q 3. γ λ − → r 4. α λ − → r (b)R 2 : deterministicT 1. γ ν q.x 1 q.x 2 2 − → q 2. γ ν q.x 2 q.x 1 4 − → q 3. α λ 3 − → q (c)R 3 : height-1wLNT Figure2.9: Rulesetsforthreewtts. 44 The weighted tree transformation represented by M is the mappingτ M : T Σ × T Δ × Q→Wdefined,foralls∈ T Σ ,t∈ T Δ ,andq∈ Q,asfollows: τ M (s,t) q = M derivation dof (s,t)in Mfrom q wt(d). Ifq= q 0 ,wemayleaveoff“fromq 0 ”inthedefinitionofaderivationanduseτ M (s,t) asshorthandforτ M (s,t) q 0 . Example2.3.2 LetΣandΔbetherankedalphabetsdefinedinExamples2.1.1and2.1.3. For reference,Σ={α (0) ,β (1) ,γ (2) ,η (2) ,σ (3) } andΔ={λ (0) ,ξ (1) ,ν (2) }. Let M 1 = ({q},Σ,Δ, R 1 , q),M 2 = ({q,r},Σ,Δ,R 2 ,q),andM 3 = ({q},Σ,Δ,R 3 ,q)bewtts,withR 1 ,R 2 ,andR 3 depicted in Figures 2.9a, 2.9b, and 2.9c, respectively. M 1 is a total wNT, M 2 is a deterministic T, andM 3 isaheight-1wLNT.Thesequence(3,1,2)isaderivationof(β(α),ν(λ,ξ(ξ(λ))))in M 1 andifM 1 istakentobeovertheprobabilitysemiring,theweightofthederivationis .168. Thevalueofτ M 3 (γ(α,α),ν(λ,λ))is8ifM 3 istakentobeoverthetropicalsemiring, butis54iftakenovertheprobabilitysemiring. We now define weighted extended top-down tree transducers [56], a generalization of wtt where the left-hand side may contain an arbitrary pattern. This formalism is frequentlymoreusefulthan”traditional”wttinNLPapplicationsasitcapturesatleast the same set of weighted tree transformations as the commonly used synchronous tree substitution grammars [90, 119]. 5 We can immediately see problems with wtt before evenconsideringreal-worldapplications. Considerthatthereisnowttthatcapturesthe (finite!) transformationdepictedinFigure2.2a. 5 In fact, xLNT (which are defined further down the page) have precisely the same power as STSG , if STSGaregivenstates[90]. 45 Definition2.3.3(cf. Def. 1ofMaletti[90]) A weighted extended top-down tree trans- ducer(wxtt)isa5-tuple M= (Q,Σ,Δ,R,q 0 )where: 1. Q,Σ,andΔaredefinedasforwtt. 2. R is a tuple (R 0 ,π). R 0 is a finite set of rules, each rule r of the form q.y w − → u for q∈ Q, y∈ T Σ (X),andu∈ T Δ (Q×X). Wefurtherrequirethat yislinearinX,i.e.,no variablex∈ X appearsmorethanoncein y, andthateachvariableappearinginu is also in y. π : R 0 →W is a weight function of the rules. As for wrtgs and wtts, werefertoRasafinitesetofweightedrules,eachruleroftheformq.y π(r) −−−→ u. For wxtt M= (Q,Σ,Δ,R,q 0 ), s,t∈ T Δ (Q×T Σ ), q∈ Q, and r∈ R of the form q.y w − → u, we obtain a derivation step from s to t by replacing some leaf of s labeled with q and a tree matching y, by a transformation of u, where each instance of a variable has been replacedbyacorrespondingsubtreeofthey-matchingtree. Formally, s⇒ r M tifthereisa positionv∈ pos(s),asubstitutionmappingϕ : X→ T Σ ,andaruleq.y w − → u∈ Rsuchthat s(v)= (q,ϕ(y))andt= s[ϕ 0 (u)] v ,whereϕ 0 isasubstitutionmappingQ×X→ T Δ (Q×T Σ ) definedsuchthatϕ 0 (q 0 ,x)= (q 0 ,ϕ(x))forallq 0 ∈ Qandx∈ X. Wedefineleftmost, deriva- tion, and wt for wxtt as we do for wtt. We also define the weighted tree transformation τ M (s,t) q for all s∈ T Σ , t∈ T Δ , and q∈ Q as we do for wtt, but additionally note that the assumptionthatWiscompleteensuresweightedtreetransformationiswelldefinedfor wxtt, even though the summation of derivations may be infinite due to “chain” rules suchasq.x w − → q.x. Weextendthepropertieslinearandnondeletingtowxtt. Weusetheletter“x”todenote classes of wxtt and thus incorporate them into our class naming convention. Thus, xLT 46 is the class of linear wxtt over the Boolean semiring, and wxLNT is the class of linear and nondeleting wxtt over an arbitrary semiring. A wxtt is -free if there is no rule q.x w − → u∈ Rwherex∈ X. Example2.3.4 M 1 , M 2 , and M 3 from Example 2.3.2 are wtts, so they are also wxtts. However, the form of their rules when presented as wxtts is slightly different; Figure 2.10a demonstrates this for M 1 and similar “transformations” for M 2 and M 3 should be fairly obvious. M 4 = ({q 1 ,q 2 ,q 3 },Σ,Δ,R 4 , q 1 ), where R 4 is depicted in Figure 2.10b, is a wxLNTovertheprobabilitysemiringthatrecognizesτ fromFigure2.2a. Whencertainpropertiesapplytoeveryweightedtreetransformationrepresentedby someclassofwtt,weelevatethosepropertiestotheclassitself. Forexample,wecansay that T has recognizable domain, because all weighted tree transformations represented by a wtt of class T have recognizable domain ([48], cor. IV.3.17). We now discuss some algorithmsonwttsandwxtts. 2.3.1 Projection TheprojectionofatransducerMistheconstructionofasyntacticstructurethatrepresents either dom(τ M ) (called the domain projection) or range(τ M ) (the range projection). Since the only syntactic structures under discussion here capture recognizable tree series, projection is only possible once recognizability is ensured. As we just mentioned, T has recognizable domain ([48], cor. IV.3.17). Algorithm 7 is a “folklore” algorithm that obtains the domain projection from a wtt of class T. As in the case of Algorithm 6, an 47 1. α λ .6 − → q 2. α ξ λ .4 − → q 3. β x 1 ν q.x 1 ξ q.x 1 .7 − → q 4. γ x 1 x 2 ν q.x 2 q.x 1 .3 − → q 5. η x 1 x 2 ν q.x 1 q.x 2 .9 − → q 6. σ x 1 x 2 x 3 ν q.x 2 ν q.x 1 q.x 3 .2 − → q (a)RecastingofR 1 fromFigure2.9aaswxttrules. 1. γ α x 1 ξ q 2 .x 1 .4 − → q 1 2. γ α x 1 ξ ξ q 2 .x 1 .6 − → q 1 3. σ x 1 x 2 α ν q 3 .x 2 q 2 .x 1 1 − → q 1 4. α λ 1 − → q 2 5. β x 1 ξ q 2 .x 1 1 − → q 3 (b)R 4 ,forrepresentingτ fromFigure2.2a. Figure2.10: RulesetsforwxttspresentedinExample2.3.4. 48 implementationofthisalgorithmshouldconsidernonterminalsastheyareencountered, andnotiterateoverallpossiblenonterminals. wNT does not have recognizable domain [91], but wLT does [43]. A much simpler algorithm, Algorithm 8, obtains domain projection from a wtt of class wLT. The im- plementationoptimization previouslydiscussedfor Algorithms6 and7applies hereas well. Thekeydifferencebetweenthealgorithms,asidefromthepreservationofweights, is that the linearity constraint for Algorithm 8 does not require the “merging” of rules doneinlines17–22ofAlgorithm7. Algorithm7BOOL-DOM-PROJ 1: inputs 2: wttM= (Q,Σ,Δ,R,q 0 )overBooleansemiring 3: outputs 4: wrtgG= (N,Σ,P,n 0 )overBooleansemiringsuchthatL G = dom(τ M ) 5: complexity 6: O(( 2|R| |Q| ) |Q| ) 7: N←P(Q) 8: n 0 ←{q 0 } 9: P←∅ 10: foralln∈ Ndo 11: ifn=∅then 12: forallσ∈Σwithrankkdo 13: P← P∪{∅− →σ( k z }| { ∅,...,∅)} 14: else 15: n∈P(Q)isoftheform{q 1 ,...,q m }forsomem≤|Q|. 16: forallσ (k) ∈Σdo 17: forall(r 1 ,...,r m )∈ R q 1 ,σ ×...×R q m ,σ do 18: LetφbeamappingX k → N whereφ(x i )=∅forall1≤ i≤ k. 19: fori= 1tomdo 20: r i hastheformq i .σ− → t. 21: forall(q 0 ,x)∈ ydset(t)∩(Q×X k )do 22: φ(x)←φ(x)∪{q 0 } 23: P← P∪{n− →σ(φ(x 1 ),...,φ(x k ))} 24: return G 49 Algorithm8LIN-DOM-PROJ 1: inputs 2: wLTM= (Q,Σ,Δ,R,q 0 )overW 3: outputs 4: wrtgG= (Q∪{⊥},Σ,P,q 0 )overWsuchthatL G = dom(τ M ) 5: complexity 6: O(|R|) 7: P= (P 0 ,π)← (∅,∅) 8: forallσ (k) ∈Σdo 9: P 0 ← P 0 ∪{⊥− →σ( k z }| { ⊥,...,⊥)} 10: π(⊥− →σ( k z }| { ⊥,...,⊥))← 1 11: forallq∈ Qdo 12: forallσ∈Σwithrankkdo 13: forallr∈ R q,σ do 14: rhastheformq.σ w − → u. 15: LetφbeamappingX k → Q∪{⊥}whereφ(x i )=⊥forall1≤ i≤ k. 16: forall(q 0 ,x)∈ ydset(u)∩(Q×X k )do 17: φ(x)← q 0 18: p new ← q− →σ(φ(x 1 ),...,φ(x k )) 19: P 0 ← P 0 ∪{p new } 20: π(p new )←π(p new )+w 21: return G 50 1. α {q}− → 2. σ {q,r} ∅ ∅ {q}− → 3. α {q,r}− → 4. σ ∅ ∅ ∅ ∅− → 5. γ ∅ ∅ ∅− → 6. η ∅ ∅ ∅− → 7. β ∅ ∅− → 8. α ∅− → (a)P 7 formedfromdomainprojectionofM 2 fromExample2.3.2withrulesinFigure2.9b. 1. γ q q q 2 − → 2. α q 3 − → (b)P 8 formedfromdomainprojectionofM 3 fromExample2.3.2withrulesinFigure2.9c. Figure 2.11: Production sets formed from domain projection using Algorithms 7 and 8, asdescribedinExample2.3.5. Example2.3.5 We can use Algorithms 7 and 8 to obtain the domain projections of, respectively, M 2 and M 3 , from Example 2.3.2, where M 2 is taken to be over the Boolean semiringandM 3 istakentobeoverthetropicalsemiring. LetG 7 andG 8 bewrtgssuchthat L G 7 = dom(M 2 ) and L G 8 = dom(M 3 ). G 7 = (P({q,r}),Σ,P 7 ,{q}) and G 8 = ({q,⊥},Σ,P 8 ,q), whereP 7 isinFigure2.11aandP 8 isinFigure2.11b. TransducerclassesxTandwxLTalsohaverecognizabledomain,astheproofsfortheir non-extendedbrethrenarenotdependentonparticularsoftheleftside,butAlgorithms 7 and 8 are not appropriate for these classes. We can, however, transform a wxT into a wT with equivalent domain using Algorithm 9. This algorithm calls Algorithm 10, whichseparatesmulti-heightleftsidesofrulesandpreservesstatetransitioninformation from the original rules, but discards syntactic rule right side information. If we then augmentΣ (1) with an additional symbol with the understanding that R q, signifies 51 rulesbeginningwithq,andthatq. w − → uisequivalenttoq.x 1 w − → u,wemayuseAlgorithms 7and8onwxttstransformedfromAlgorithm9toobtaindomainprojections. Algorithm9PRE-DOM-PROJ 1: inputs 2: wxtt M in = (Q in , Σ, Δ, R in , q 0 ) over W, where l = max σ∈Σ rk(σ) and m = max r∈R in max l i=1 mult(r,i). 3: outputs 4: wtt M out = (Q out ⊇ Q in ,Σ,Γ,R out ,q 0 ) overW, whereΓ={υ (0) ,ω (lm) }, such that if M in islinearorWisBoolean,dom(τ M in )= dom(τ M out ) 5: complexity 6: O(|R in |) 7: Q out ← Q in 8: R out ←∅ 9: forallr∈ R in do 10: risoftheformq.y w − → u 11: forallr 0 ∈ PRE-DOM-PROJ-PROCESS( Σ,Δ,Q out ,lm,υ,ω,q,y,u,w)do 12: r 0 isoftheformq 0 .σ w 0 −→ u 0 . 13: R out ← R out ∪{r 0 } 14: Q out ← Q out ∪{q 0 } 15: return M out Example2.3.6 Recall M 4 from Example 2.3.4. The result of Algorithm 9 on M 4 is M 5 = (Q 5 ={q i |1≤ i≤ 6},Σ,Γ,R 5 ,q 1 ),whereΓ={υ (0) ,ω (3) },andR 5 isdepictedinFigure2.12a. The result of Algorithm 8 on M 5 is G 9 = (Q 5 ,Σ, P 9 , q 1 ), where P 9 is depicted in Figure 2.12b. NotethatL G 9 isdepictedinFigure2.2c. TransducerclassesxLT(from[48],Thm. IV.6.5)andwxLNT(from[43])haveregular range. Algorithm 11 describes how to obtain range projections from wxtts of these classes. Note that the algorithm is defined for wxLT but is only applicable to xLT and wxLNT; should the input be, for example, in wxLT over some non-Boolean semiring 52 Algorithm10PRE-DOM-PROJ-PROCESS 1: inputs 2: rankedalphabetsΣ,Δ 3: statesetQ 4: maximumnewrightsiderankr∈N 5: rank-0symbol υ 6: rank- rsymbolω 7: stateq∈ Q 8: tree y∈ T Σ (X) 9: treeu∈ T Δ∪{χ} (Q×X)whereχisarank-0“placeholder”symbolnotin ΣorΔ 10: weightw 11: outputs 12: set of rules R 0 and states Q 0 ⊇ Q such that for any wxtt M = (Q,Σ,Δ∪{υ,ω},R∪{q.y w − → u},q 0 ), dom(τ M ) = dom(τ M 0), where M 0 = (Q∪Q 0 ,Σ,Δ∪{υ,ω},R∪R 0 ,q 0 ) 13: complexity 14: O(size(y)) 15: R 0 ←∅ 16: Q 0 ←∅ 17: Ψ←∅ 18: Letb 1 =...= b r =υ 19: m← 1 20: Let ybeoftheformσ (k) (y 1 ,...,y k ). 21: Formsubstitutionmapϕ : Q×X→ T Δ∪{χ} (Q×X). 22: fori= 1tokdo 23: if y i ∈ Xthen 24: forall(q i ,y i )∈ ydset(u)∩(Q×{y i })do 25: ϕ(q i ,y i )←χ 26: b m ← (q i ,x m ) 27: m← m+1 28: else 29: Letq x beanewstatesuchthatQ∩{q x }=∅. 30: Q 0 ← Q 0 ∪{q x } 31: Ψ←Ψ∪{(q x ,y i )} 32: b m ← (q x ,x m ) 33: m← m+1 34: forall(q x ,y x )∈Ψdo 35: R 0 ,Q 0 ← R 0 ,Q 0 ∪PRE-DOM-PROJ-PROCESS( Σ,Δ,Q,r,υ,ω,q x ,y x ,ϕ(u),1) 36: R 0 ← R 0 ∪{q.σ w − →ω(b 1 ,...,b r )} 37: return R 0 ,Q 0 53 1. γ ω q 4 .x 1 q 2 .x 2 υ .4 − → q 1 2. γ ω q 5 .x 1 q 2 .x 2 υ .6 − → q 1 3. σ ω q 2 .x 1 q 3 .x 2 q 6 .x 3 1 − → q 1 4. α ω υ υ υ 1 − → q 2 5. β ω q 2 .x 1 υ υ 1 − → q 3 6. α ω υ υ υ 1 − → q 4 7. α ω υ υ υ 1 − → q 5 8. α ω υ υ υ 1 − → q 6 (a) Rule set R 5 formed as the result of pre-domain conversion of M 4 , Algorithm 9, as described in Example2.3.6. 1. γ q 4 q 2 q 1 .4 − → 2. γ q 5 q 2 q 1 .6 − → 3. σ q 2 q 3 q 6 q 1 1 − → 4. α q 2 1 − → 5. β q 2 q 3 1 − → 6. α q 4 1 − → 7. α q 5 1 − → 8. α q 6 1 − → (b) Production set P 9 , formed as the result of domain projection, Algorithm 8, on M 5 , which has rules depictedinFigure2.12a,asdescribedinExample2.3.6. 1. ξ q 2 q 1 .4 − → 2. ξ ξ q 2 q 1 .6 − → 3. ν q 3 q 2 q 1 1 − → 4. λ q 2 1 − → 5. ξ q 2 q 3 1 − → (c) Production set P 10 , formed as the result of range projection, Algorithm 11, on M 4 , as described in Example2.3.7. Figure2.12: TransformationsofM 4 fromExample2.3.4,depictedinFigure2.10b,foruse indomainandrangeprojectionExamples2.3.6and2.3.7. 54 the result of the algorithm is not guaranteed to be meaningful. More specifically, the coefficientsoftherepresentedtreeseriesmaybewrong. Algorithm11RANGE-PROJ 1: inputs 2: wxLTM= (Q,Σ,Δ,R,q 0 )oversemiringW. 3: outputs 4: wrtg G = (Q,Δ,P,q 0 ) over semiringW such that, if M is nondeleting orW is Boolean,L G = range(τ M ) 5: complexity 6: O(|R|max r∈R size(r)) 7: P= (P 0 ,π)← (∅,∅) 8: Letϕ be a substitution mapping Q×X→ T Δ (Q) such that for all q∈ Q and x∈ X, ϕ((q,x))= q. 9: forallroftheformq.y w − → zinRdo 10: p new ← q− →ϕ(z) 11: P 0 ← P 0 ∪{p new } 12: π(p new )←π(p new )+w 13: return G Example2.3.7 RecallM 4 = (Q 4 ,Σ,Δ,R 4 ,q 1 )fromExample2.3.4. TheresultofAlgorithm 11onM 4 isG 10 = (Q 4 ,Δ,P 10 ,q 1 ), whereP 10 isdepictedinFigure2.12c. NotethatL G 10 is depictedinFigure2.2d. 2.3.2 Embedding It is sometimes useful to embed a wrtg in a wtt, that is, given a wrtg G, to form a wtt M suchthatτ M =ı L G . Algorithm12isaverysimplealgorithmforformingthisembedding from a normal form and chain-production-free wrtg; this can bedone with an arbitrary wrtginananalogousmannertothatofthealgorithm,buttheresultingembeddingwill beawxtt. 55 Algorithm12EMBED 1: inputs 2: wrtgG= (N,Σ,P,n 0 )overWinnormalformwithnochainproductions 3: outputs 4: wttM= (N,Σ,Σ,R,n 0 )overWsuchthatτ M =ı L G 5: complexity 6: O(|P|) 7: forallpoftheformn w − →σ(n 1 ,...,n k )∈ Pdo 8: R← R∪{n.σ w − →σ(n 1 .x 1 ,...,n k .x k )} 9: return M 1. S S n 1 .x 1 n 2 .x 2 1 − → n 0 2. NP NP n 3 .x 1 n 4 .x 2 .4 − → n 1 3. the the .8 − → n 3 4. dogs dogs .7 − → n 4 5. VP VP n 5 .x 1 n 6 .x 2 .42 −−→ n 2 6. VP VP n 7 .x 1 n 8 .x 2 .1 − → n 2 7. eat eat 1 − → n 5 8. chase chase 1 − → n 7 9. NP NP n 9 .x 1 n 10 .x 2 .25 −−→ n 6 10. NP NP n 11 .x 1 n 12 .x 2 .25 −−→ n 8 11. red red .4 − → n 9 12. smelly smelly .6 − → n 11 13. meat meat .5 − → n 12 14. cars cars .5 − → n 10 Figure 2.13: Rule set R 6 , formed from embedding of wrtg G 7 from Example 2.2.7, as describedinExample2.3.8. 56 Example2.3.8 Recall the wrtg G 7 = ({n i | 0≤ i≤ 12},Σ,P 7 ,n 0 ), where P 7 is depicted in Figure2.8b. ThenM 6 = ({n i | 0≤ i≤ 12},Σ,Σ,R 6 ,n 0 ),whereR 6 isdepictedinFigure2.13, istheresultofAlgorithm12,anembeddingofG 7 . 2.3.3 Composition In Section 2.1.3 we described the composition of two weighted tree transformationsτ andμ. Here we consider composition of transducers (sometimes referred to as syntactic composition). In other words, given two wtts M A and M B , we want to construct a wtt M A ◦M B suchthatτ M A ◦M B =τ M A ;τ M B . A construction was given in declarative terms for syntactic composition of two un- weighted top-down tree transducers M A and M B by Baker [6]. This construction was shown to be correct for the cases where (M B is linear or M A is deterministic) and (M B is nondeleting or M A is total). The construction was extended to the weighted case by Maletti [88]. An inspection of Baker’s construction is enough to satisfy that it may be generalizedtoallowM A tobeawxttwithoutalteringmatters. Note,though,thatthede- terministicandtotalpropertiesarenotdefinedforwxtt,soanycompositionconstruction that involves a wxT as M A will require a wLNT as M B . We re-state Baker’s construc- tion,withtheadditionalgeneralizationandmodificationtohandleweightsprovidedby Maletti,asfollows: Let M A = (Q A ,Σ,Δ,R A ,q 0 A ) and M B = (Q B ,Δ,Γ,R B ,q 0 B ) be wtts. Then define M C = ((Q A ×Q B ),Σ,Γ,R C ,(q 0 A ,q 0 B )),wheremembersofR C arefoundbythefollowingprocess: 1. AugmentM B suchthatitrepresentstransformationsfromT Δ∪(Q A ×X) toT Γ∪(Q A ×Q B ×X) byadding,forallq A ∈ Q A ,x∈ X,andq B ∈ Q B ,theruleq B .(q A ,x) 1 − → ((q A ,q B ),x)toR B . 57 2. UsingtheaugmentedM B , forallrulesq A .y w 1 −−→ uinR A , allstatesq B inQ B , andallz suchthatτ M B (u,z) q B isnon-zero,addtherule( q A ,q B ).y w 1 ·τ M B (u,z) q B −−−−−−−−−−−→ ztoR C . 3. IfR C containsrulesthatonlydifferbytheirweight,replacetheseruleswithasingle rulethathasasaweightthesumoftheweightsofthereplacedrules. This process, essentially a reformulation of the construction of Baker [6], succinctly andintuitivelydescribeshowthecompositionisconstructed. However,asdiscussedin Section1.7,thistextdoesnotprovideanactual,implementablealgorithmforobtaining thecompositiontransducer. Forone,thefirststeprequiresaddinganinfinitenumberof rules. Thisproblemiseasilysolvedbyonlyaddingsuchrulesasmaybenecessitatedby thesecondstep. Ofmoreconcernisthemethodbywhichthesecondstepisperformed— thedescriptionabovegivesnohintastohowonemayactuallyobtainthespecifiedz. Algorithm13,COMPOSE,seekstocorrectpreciselythisomission. Itisanalgorithmic description of the aforementioned composition construction of weighted tree transduc- ers. The algorithm takes as input a wxT M A and a wLNT M B and produces a wxtt M C such that M C = M A ◦ M B . As in the declarative presentation, the main algorithm and generalideaisrathersimilartocompositionalgorithmsforwst. Likeinthewstcase,for each state of the composition transducer (i.e., for each pair of states, one from M A and one from M B ), rules from M A and M B are combined. 6 The key difference is that while for strings, one rule from M A is paired with one rule from M B , here multiple rules from M B may be necessary to match a single rule from M A . Specifically, every tiling of rules fromM B withleftsidesthat“cover”apotentiallylargerightsideofarulefromM A must 6 AspreviouslydiscussedforAlgorithms6,7,and8,animplementationofthisalgorithmshouldconsider statesastheyareencountered,andnotiterateoverallpossiblestates. 58 be chosen. The restriction that M B not be extended ensures that there are only a finite numberofsuchtilings(seeArnoldandDauchet[5]formoredetails). Theactofformingalltilingsofatreebyatransducerishandledbythesub-algorithm COVER,presentedasalgorithm14. Theinputtothealgorithmisatree,u,atransducer, M B , and a state q B . The desired output is the set of all trees that are formed as a consequenceoftilinguwithlefthandsidesofrulesfromM B andjoiningtherighthand sides of those rules in the natural way. Additionally, the product of the weights of the rules used to form these trees is desired. COVER proceeds in a top-down manner, findingallrulesthatmatchtherootofu,andbuildinganincompleteresulttreeforeach matching rule. Then, for each incomplete result tree, another step down the input tree istaken,formingmorepartialresults,andsoonuntilallthepossiblefullcoveringsand complete result trees are formed. Figure 2.14 graphically explains how COVER works. Additionally,thefollowingexampleprovidesawalkthroughofCOMPOSEandCOVER. Algorithm13COMPOSE 1: inputs 2: wxTM A = (Q A ,Σ,Δ,R A ,q A 0 )overW 3: wLNTM B = (Q B ,Δ,Γ,R B ,q B 0 )overW 4: outputs 5: wxTM C = ((Q A ×Q B ),Σ,Γ,R C ,(q A 0 ,q B 0 ))overWsuchthatM C = M A ◦M B . 6: complexity 7: O(|R A |max(|R B | size(˜ u) ,|Q B |))where ˜ uisthelargestrightsidetreeinanyruleinR A 8: LetR C beoftheform(R 0 C ,π) 9: R C ← (∅,∅) 10: forall(q A ,q B )∈ Q A ×Q B do 11: forallroftheformq A .y w 1 −−→ uinR A do 12: forall(z,w 2 )| (z,θ,w 2 )∈ COVER(u,M B ,q B )do 13: r new ← (q A ,q B ).y− → z 14: R 0 C ← R 0 C ∪{r new } 15: π(r new )←π(r new )+(w 1 ·w 2 ) 16: return M C 59 δ ζ q A .x 4 u v (a)Inputtreeu (ε, ε) = q (1, 2.2) = r ... (v, v') = q B ! w z θ v' (b)MemberofΠ last whenprocessingv q B δ w' q'' B 2 .x 2 q'' B 1 .x 1 h v'' 1 v'' 2 (c)Matchingrule ! (q A q'' B 1 ).x 4 (ε, ε) = q (1, 2.2) = r ... (v, v') = q B (v.1, v'.v'' 1 ) = q'' B 1 (v.2, v'.v'' 2 ) = q'' B 2 w x w' θ' (d)NewentrytoΠ v Figure2.14: GraphicalrepresentationofCOVER,Algorithm14. Atline13,positionvof treeuischosen. AsdepictedinFigure2.14a,inthiscase,u(v)isδandhastwochildren. One member (z,θ,w) ofΠ last is depicted in Figure 2.14b. The tree z has a leaf position v 0 with labelχ and there is an entry for (v,v 0 ) inθ, so as indicated on lines 16 and 17, we look for a rule with stateθ(v,v 0 ) = q B and left symbolδ. One such rule is depicted inFigure2.14c. Giventhetreeu,thetriple(z,θ,w),andthematchingrule,wecanbuild the new member of Π v depicted in Figure 2.14d as follows: The new tree is built by firsttransformingthe(state,variable)leavesofh;iftheithchildofvisa(state,variable) symbol,say,(q,x),thenleavesinhoftheform(q 00 ,x i )aretransformedto(q,q 00 ,x)symbols, otherwise they becomeχ. The former case, which is indicated on line 24, accounts for the transformation from q 00 B 1 .x 1 to (q A ,q 00 B 1 ).x 4 . The latter case, which is indicated on line 26, accounts for the transformation from q 00 B 2 .x 2 toχ. The result of that transformation is attached to the original z at position v 0 ; this is indicated on line 27. The new θ 0 is extended from the oldθ, as indicated on line 18. For each immediate child vi of v that has a corresponding leaf symbol in h marked with x i at position v 00 , the position in the newly built tree will be v 0 v 00 . The pair (vi,v 0 v 00 ) is mapped to the state originally at v 00 , as indicated on line 22. Finally, the new weight is obtained by multiplying the original weight,wwiththeweightoftherule,w 0 . 60 Algorithm14COVER 1: inputs 2: u∈ T Δ (Q A ×X) 3: wTM B = (Q B ,Δ,Γ,R B ,q 2 0 )overW 4: stateq B ∈ Q B 5: outputs 6: setΠoftriples{(z,θ,w) : z∈ T Γ ((Q A ×Q B )×X),θapartialmappingpos(u)×pos(z) → Q B , and w∈W}, each triple indicating a successful run on u by rules in R B , startingfromq B ,formingz,andw,theweightoftherun. 7: complexity 8: O(|R B | size(u) ) 9: ifu(ε)isoftheform(q A ,x)∈ Q A ×Xthen 10: Π last ←{(((q A ,q B ),x),{((ε,ε),q B )},1)} 11: else 12: Π last ←{(χ,{((ε,ε),q B )},1)} 13: forallv∈ pos(u)suchthatu(v)∈Δ (k) forsomek≥ 0inprefixorderdo 14: Π v ←∅ 15: forall(z,θ,w)∈Π last do 16: forallv 0 ∈ leaves(z)suchthatz(v 0 )=χdo 17: forallθ(v,v 0 ).u(v) w 0 −→ h∈ R B do 18: θ 0 ←θ 19: Formsubstitutionmappingϕ : (Q B ×X)→ T Γ ((Q A ×Q B ×X)∪{χ}). 20: fori= 1tokdo 21: forallv 00 ∈ pos(h)suchthath(v 00 )= (q 00 B ,x i )forsomeq 00 B ∈ Q B do 22: θ 0 (vi,v 0 v 00 )← q 00 B 23: ifu(vi)isoftheform(q A ,x)∈ Q A ×Xthen 24: ϕ(q 00 B ,x i )← ((q A ,q 00 B ),x) 25: else 26: ϕ(q 00 B ,x i )←χ 27: Π v ←Π v ∪{(z[ϕ(h)] v 0,θ 0 ,w·w 0 )} Π last ←Π v 28: return Π last Example2.3.9 Let M 7 = ({q 1 ,q 2 },Σ,Δ,R 7 ,q 1 ) and M 8 = ({q 3 ,q 4 ,q 5 },Δ,Γ,R 8 ,q 3 ) be wtts whereΣ andΔ are those ranked alphabets defined in Examples 2.3.2, 2.1.1, and 2.1.3, Γ={υ (0) ,ψ (1) ,ω (2) }, and R 7 and R 8 are depicted in Figures 2.15a and 2.15b, respectively. ThenM 9 = ({q 1 q 3 ,q 2 q 3 ,q 2 q 4 ,q 2 q 5 },Σ,Γ,R 9 ,q 1 q 3 ),whereR 9 isdepictedinFigure2.15c,is formedbyAlgorithm13appliedtoM 7 andM 8 . 61 1. σ x 1 α x 2 ν q 1 .x 1 ν λ q 2 .x 1 .1 − → q 1 2. α λ .2 − → q 1 3. β x 1 λ .3 − → q 2 (a)R 7 ,rulesetofM 7 . 4. ν x 1 x 2 ω ω q 3 .x 1 υ q 3 .x 2 .4 − → q 3 5. ν x 1 x 2 ω q 3 .x 1 q 4 .x 2 .5 − → q 3 6. ν x 1 x 2 ω q 3 .x 1 q 5 .x 2 .6 − → q 4 7. λ υ .7 − → q 3 8. λ υ .8 − → q 4 9. λ υ .9 − → q 5 (b)R 8 ,rulesetM 8 . 10. γ x 1 α x 2 ω ω q 1 q 3 .x 1 υ ω ω υ υ q 2 q 3 .x 1 .0112 −−−→ q 1 q 3 11. γ x 1 α x 2 ω ω q 1 q 3 .x 1 υ ω υ q 2 q 4 .x 1 .014 −−−→ q 1 q 3 12. γ x 1 α x 2 ω q 1 q 3 .x 1 ω υ q 2 q 5 .x 1 .021 −−−→ q 1 q 3 13. α υ .14 −−→ q 1 q 3 14. α υ .16 −−→ q 2 q 4 15. β x 1 υ .21 −−→ q 2 q 3 16. β x 1 υ .27 −−→ q 2 q 5 (c)R 9 ,rulesetofthecomposedtransducerM 9 . Figure2.15: RulesetsfortransducersdescribedinExample2.3.9. 62 θ z w (,)→ q 3 (1,1.1)→ q 3 (2,2)→ q 3 (2.1,2.1.1)→ q 3 (2.2,2.2)→ q 3 ω ω q 1 q 3 .x 1 υ ω ω υ υ q 2 q 3 .x 1 .112 (,)→ q 3 (1,1.1)→ q 3 (2,2)→ q 3 (2.1,2.1)→ q 3 (2.2,2.2)→ q 4 ω ω q 1 q 3 .x 1 υ ω υ q 2 q 4 .x 1 .14 (,)→ q 3 (1,1)→ q 3 (2,2)→ q 4 (2.1,2.1)→ q 3 (2.2,2.2)→ q 5 ω q 1 q 4 .x 1 ω υ q 2 q 5 .x 1 .21 Figure 2.16: Π formed in Example 2.3.9 as a result of applying Algorithm 14 toν(q 1 .x 1 , ν(λ,q 2 .x 1 )),M 8 ,andq 3 . 63 Let us describe in more detail how the rules of Figure 2.15c are formed, particularly rules10,11,and12. Thefirststatepairchoseninline10is(q 1 ,q 3 ). Ofparticularinterest to us is what happens when rule 1 is chosen at line 11. We now turn to Algorithm 14, which finds all coverings ofν(q 1 .x 1 ,ν(λ,q 2 .x 1 )) with rules from M 8 , starting from q 3 . A covering is denoted by the output tree z that is formed from assembling right sides of rulesfromM 8 (andthatwillbecomethenewlyproducedrule’srightside),themapping θ, which indicates how z was built, and the derivation weight w corresponding to the product of the weights in the rules used to form the covering. Figure 2.16 shows the three entries for z,θ, and w corresponding to the three coverings ofν(q 1 .x 1 ,ν(λ,q 2 .x 1 )) with rulesfrom M 8 , starting from q 3 . The entriescorrespond to the derivations (4, 4, 7), (4,5,7),and(5,6,7),respectively. VariationsonCOVERarepresentedinChapter4. Thesevariationsaredifferentfrom the algorithm presented here in that the “tiling” device is a wrtg, not a transducer, a result tree is not explicitly built, and the algorithms allow on-the-fly discovery of the tilingwrtg’sproductions. 2.4 Tree-to-stringandstringmachines We have thus far focused our discussion on tree-generating automata (wrtgs) and tree- transforming transducers (wxtts and wtts). However, it is frequently the case that we wish to transform between tree and string objects. Parsing, where a tree structure is placed on top of a given string, and recent work in syntactic machine translation are goodexamplesofthis. Thismotivatesthedescriptionoftwomoreformalmachines,the 64 weighted context-free grammar , a well studied string analogue to wrtgs, and the weighted, extendedtree-to-stringtransducer ,aminorvariationonwxtts. Noteinthedefinitionsbelowthatisaspecialsymbolnotcontainedinanyterminal, nonterminal,orstateset. Itdenotestheemptyword. Definition2.4.1(cf. SalomaaandSoittola[117]) Aweightedcontext-freegrammar(wcfg) oversemiringWisa4-tuple G=(N,Δ,P,n 0 )where: 1. N isafinitesetofnonterminals,withn 0 ∈ N thestartnonterminal 2. Δistheterminalalphabet. 3. P is a tuple (P 0 ,π), where P 0 is a finite set of productions, each production p of the form n− → g, n∈ N, g∈ (Δ∪ N) ∗ , and π : P 0 →W is a weight function of the productions. Withintheseconstraintswemay(andusuallydo)refertoPasafinite setofweightedproductions,eachproductionpoftheformn π(p) −−−→ g. Weassociate PwithG,suchthat,e.g.,p∈ Gisinterpretedtomeanp∈ P. For wcfg G= (N,Δ,P,n 0 ), e, f, g∈ (Δ∪N) ∗ , n∈ N, and p∈ P of the form n w − → g∈ P, weobtainaderivationstepfrometo f byreplacinganinstanceofninewith g. Formally, e⇒ p G f if there exist e 0 ,e 00 ∈ (Δ∪ N) ∗ such that e = e 0 ne 00 and f = e 0 ge 00 . We say this derivation step is leftmost if e 0 ∈ Δ ∗ . Except where noted and needed, we henceforth assume all derivation steps are leftmost and drop the subscript G. If, for some m∈N, p i ∈ P, and e i ∈ (Δ∪ N) ∗ for all 1≤ i≤ m, n 0 ⇒ p 1 e 1 ...⇒ p m e m , we say the sequence (p 1 ,...,p m ) is a derivation of e m in G and that n 0 ⇒ ∗ e m . The weight of a derivation d, wt(d),istheproductoftheweightsofalloccurrencesofitsproductions. 65 Definition2.4.2 A weighted extended top-down tree-to-string transducer (wxtst) is a 5-tuple M=(Q,Σ,Δ,R,q 0 )where: 1. QandΣaredefinedasforwtts,andΔisdefinedasforwcfgs. 2. R is a tuple (R 0 ,π). R 0 is a finite set of rules, each rule r of the form q.y w − → g for q∈ Q, y∈ T Σ (X),and g∈ (Δ∪(Q×X)) ∗ . Wefurtherrequirethat yislinearinX,i.e., no variable x∈ X appears more than once in y, and that each variable appearing in g is also in y. π : R 0 →W is a weight function of the rules. As for wrtgs, wtts, and wcfgs, we refer to R as a finite set of weighted rules, each rule r of the form q.y π(r) −−−→ g. For wxtst M = (Q,Σ,Δ,R,q 0 ), e, f ∈ (Δ∪ (Q× T Σ )) ∗ , q∈ Q, and r∈ R of the form q.y w − → g where g = g 1 g 2 ...g k for some k∈N, and g i ∈ Δ∪ (Q× X), 1≤ i≤ k, we obtain a derivation step from e to f by replacing some substring of e of the form (q,t), whereq∈ Q,t∈ T Σ ,andtmatches y,byatransformationof g,whereeachinstanceofa variablehasbeenreplacedbyacorrespondingsubtreeofthe y-matchingtree. Formally, e⇒ r M f ifthereexiste 0 ,e 00 ∈ (Δ∪(Q×T Σ )) ∗ suchthate= e 0 (q,t)e 00 ,asubstitutionmapping ϕ : X→ T Σ , and arule q.y w − → g∈ R such that t=ϕ(y) and f = e 0 θ(g 1 )...θ(g k )e 00 , where θ is a mappingΔ∪ (Q× X)→ Δ∪ (Q× T Σ ) defined such thatθ(λ) = λ for allλ∈ Δ andθ(q,x) = (q,ϕ(x)) for all q∈ Q and x∈ X. We define leftmost, derivation, and wt for wxtst as we do for wcfg. We also define a weighted tree-string transformation τ M (s, f) for alls∈ T Σ and f∈Δ ∗ inananalogouswaytothatforweightedtreetransformations. Weextendthepropertieslinearandnondeletingtowxtst. Wedifferentiatewxtstfrom wxttbyappendingtheletter“s”andomittheletter“x”todenotewxtstwithoutextended 66 1. σ γ α x 1 x 2 x 3 .1 − → q 2 .x 1 q 2 .x 2 q 2 .x 3 q 1 2. γ x 1 x 2 .2 − → q 2 .x 1 q 2 .x 2 q 2 3. β α .3 − → λ q 2 4. α .4 − → λ q 2 (a)rulesetR 10 fromexamplexLNTsM 10 . 1. q 1 .1 − → q 2 q 2 q 2 2. q 2 .2 − → q 2 q 2 3. q 2 .7 − →λ (b)productionsetP 11 fromfromexamplewcfgG 11 . Figure 2.17: Rules and productions for an example wxtst and wcfg, respectively, as describedinExample2.4.3. left sides (i.e., wtsts). Thus, xNTs is the class of nondeleting wxtsts over the Boolean semiring, and wLNTs is the class of linear and nondeleting wtsts over an arbitrary semiring. Notethatasforwcfgs,therightsideofawxtstrulecanbe,butweretainthe originalnomenclatureforan-freewxtst. Wedonotprovideanextensiverecapofalgorithmsforthesestring-basedstructures as we did for wrtgs and wxtts, as most of the algorithms previously presented are intuitivelyextendable,orinapplicabletothestringortree-stringcase. InChapters4and 5wewilldiscussalgorithmsthatareinterestinglydefinedforwxtsts. Example2.4.3 LetM 10 = ({q 1 ,q 2 },Σ,Δ,R 10 ,q 1 )beawxtstwhereΣistherankedalphabet defined in Examples 2.3.2, 2.1.1, 2.1.3, and 2.3.9,Δ ={λ}, and R 10 is depicted in Figure 2.17a. Additionally,letG 11 = ({q 1 ,q 2 },Δ,P 1 ,q 1 )beawcfgwhereP 11 isdepictedinFigure 2.17b. NotethatG 11 istherangeprojectionofM 10 . 67 2.5 UsefulclassesforNLP Knight[73]laidoutfourpropertiestreetransducersshouldhaveiftheyaretobeofmuch useinpracticalNLPsystems: • They should be expressive enough to capture complicated transformations seen in observednaturallanguagedata. • Theyshouldbeinclusiveenoughtogeneralizeotherwell-knowntransducerclasses. • They should be teachable so that a model of real-world transformations informed by observed data can be built such that the model’s parameters coincide with the transducers’transformations. • Theyshouldbemodular,sothatcomplicatedtransformationscanbebrokendown intoseveraldiscrete,easy-to-buildtransducersandthenusedtogether. Knightinstantiatedthesegeneralpropertiesinfourconcreteways: • An expressive transducer can capture the local rotation demonstrated in Figure 2.18. • AteachabletransducercanuseEMandatrainingcorpustoassignweightstothe transducer’srules. • Aninclusivetransducergeneralizeswfst. • Amodulartransducerisclosedundercomposition. Knightanalyzedseveralclassesoftreetransducer,includingmanyofthosediscussed in this chapter, as well as classes beyond the scope of this thesis such as bottom-up tree 68 S VP A B C S A B C Figure 2.18: Example of the tree transformation power an expressive tree transducer shouldhave,accordingtoKnight[73]. There-orderingexpressedbythistransformation iswidelyobservedinpractice,overmanysentencesinmanylanguages. transducers, and determined that none of them had all four of the desired properties. One of the classes analyzed in [73], wxLNT, is of particular interest because it is the basis of a state-of-the-art machine translation system [47, 46]. From the perspective of thefourdesiredpropertiesitisalsoquitepromising,aswxLNTisexpressive,teachable, and inclusive, though not modular according to Knight’s definitions, because it is not closed under composition. However, one may reasonably consider a transducer as modularifitanditsinversepreserverecognizability. Atransducer(oritsinverse)preserves recognizability if the transformation of all members of a recognizable language by the transducerisitselfarecognizablelanguage. 7 Transducerclasseswiththispropertysatisfy the idea of modularity since they can then be used in a pipelined cascade in a forward orbackwarddirection,witheachtransducerprocessingtheoutputofitsneighbor. Figure2.19, whichisanalogoustoasimilarfigurein[73], depictsthegeneralization relationship between the classes of transducer discussed in this thesis as well as the desired properties they satisfy, substituting preservation of recognizability for closure under composition. As can be seen from the figure, wxLNT possesses all the desired 7 WewilldiscusspreservationofrecognizabilityformallyandingreaterdetailinChapter4. 69 wxT wT wxLT wLT wxLNT wLNT wfst wxNT wNT generalizes wfst preserves recognizability expressive for local rotation Figure 2.19: Tree transducers and their properties, inspired by a similar diagram by Knight [73]. Solid lines indicate a generalization relationship. Shaded regions indicate a transducer class has one or more of the useful properties described in Section 2.5. All transducersinthisfigurehaveanEMtrainingalgorithm. properties and is thus a good choice for further research. With the exception of compo- sition, all thetreetransducer-relatedalgorithms discussedinthis thesisareappropriate forwxLNT,andeventhecompositionalgorithmcanbeusedtocomposeawxLNTwith awLNT.Thetree-to-stringvariantofwxLNT,wxLNTs,alsosatisfiestheseproperties,if weconsidermodularityforawxLNTstomeanitproducesarecognizablestringlanguage givenarecognizabletreelanguageasinput. 2.6 Summary Wehavediscussedtheprincipalformalstructuresofinterestinthisthesisandpresented useful algorithms for them, some of which are presented in implementable algorithmic form for the first time. We defer discussion of some algorithms, such as weighted 70 determinizationofwrtgs,trainingofwrtgsandwxtts,andapplicationofawxtttoawrtg tosubsequentchapters. 71 Chapter3 DWTA Inthischapterwepresentthefirstpracticaldeterminizationalgorithmforchainproduction- free wrtgs in normal form. 1 This work, which was first presented in [95], elevates a determinization algorithm for wsas to the tree case and demonstrates its effectiveness on two real-world experiments using acyclic wrtgs. We additionally present joint work withMatthiasB¨ uchseandHeikoVoglerthatprovesthecorrectnessofourdeterminiza- tion algorithm for acyclic wrtgs and lays out the conditions for which this algorithm is appropriateforcyclicwrtgs. Thatworkwasfirstpresentedin[17]. 3.1 Motivation Ausefultoolinnaturallanguageprocessingtaskssuchastranslation,speechrecognition, parsing, etc., is the ranked list of results. Modern systems typically produce competing partialresultsinternallyandreturnonlythetop-scoringcompleteresulttotheuser. They 1 Achainproduction-freewrtginnormalformisequivalenttoaweightedtreeautomaton(wta). Inthis chapterwewillprimarilyspeakintermsofwrtgsbutourvisualizationswillbeofwtas,whichprovidesome visual intuition. Additionally it should be noted that the original work this chapter is based on discussed wtasexclusively. 72 are, however, also capable of producing lists of runners-up, and such lists have many practicaluses: • The lists may be inspected to determine the quality of runners-up and motivate modelchanges. • Thelistsmaybere-rankedwithextraknowledgesourcesthataredifficulttoapply duringthemainsearch. • Thelistsmaybeusedwithannotationandatuningprocess,suchasinCollinsand Roark[26],toiterativelyalterfeatureweightsandimproveresults. Figure 3.2 shows the best 10 English translation parse trees obtained from a syntax- based translation system based on that of Galley et al. [47]. Notice that the same tree occurs multiple times in this list. This repetition is quite characteristic of the output of ranked lists. It occurs because many systems, such as the ones proposed by Bod [10], Galley et al. [47], and Langkilde and Knight [84] represent their result space in terms of weighted partial results of various sizes that may be assembled in multiple ways. Thereisingeneralmorethanonewaytoassemblethepartialresultstoderivethesame completeresult. Thus,thek-bestlistofresultsisreallya k-bestlistof derivations. When list-based tasks, such as the ones mentioned above, take as input the top k results for some constant k, the effect of repetition on these tasks is deleterious. A list withmanyrepetitionssuffersfromalackofusefulinformation,hamperingdiagnostics. Repeatedresultspreventalternativesthatwouldbehighlyrankedinasecondaryrerank- ing system from even being considered. And a list of fewer unique trees than expected can cause overfitting when this list is used to tune. Furthermore, the actual weight of 73 1 1 2 2 (a)Nondeterministicwrtg 1 1 2 2 (b)Afterdeterminization Figure 3.1: Example of weighted determinization. We have represented a nondeter- ministic wrtg as a weighted tree automaton, then used the algorithm presented in this chaptertoreturnadeterministicequivalent. obtaining any particular tree is split among its repetitions, distorting the actual relative weightsbetweentrees. Asamoreconcreteexample,considerFigure3.1a,whichdepicts a nondeterministic wrtg (as a bottom-up wta) over the probability semiring. This wrtg has three paths, one of which recognizes the tree d(a,c) with weight .054, and two of which recognize the tree d(a,b), once with weight .024 and once with weight .036. The highest weighted path in this wrtg does not recognize the highest weighted tree in the associated tree series. We would prefer a deterministic wrtg, such as in Figure 3.1b, whichcombinesthetwopathsford(a,b)intoone. Mohri[99]encounteredthisprobleminspeechrecognition,andpresentedasolution totheproblemofrepetitionink-bestlistsofstringsthatarederivedfromwfsas. Thatwork described a way to use a powerset construction along with an innovative bookkeeping systemtodeterminizeawfsa,resultinginawfsathatpreservesthelanguagebutprovides asingle,properlyweightedderivationforeachstringinit. Putanotherway,iftheinput wfsa has the ability to generate the same string with different weights, the output wfsa generatesthatstringwithweightequaltothesumofallofthepathsgeneratingthatstring in the input wfsa. Mohri and Riley [104] combined this technique with a procedure for 74 34.73: S(NP-C(NPB(DT(this) NNS(cases))) VP(VBD(had) VP-C(VBN(caused) NP-C(NPB(DT(the) JJ(american) NNS(protests))))) .(.)) •34.74: S(NP-C(NPB(DT(this) NNS(cases))) VP(VBD(had) VP-C(VBN(aroused) NP-C(NPB(DT(the) JJ(american) NNS(protests))))) .(.)) 34.83: S(NP-C(NPB(DT(this) NNS(cases))) VP(VBD(had) VP-C(VBN(caused) NP-C(NPB(DT(the) JJ(american) NNS(protests))))) .(.)) •34.83: S(NP-C(NPB(DT(this) NNS(cases))) VP(VBD(had) VP-C(VBN(aroused) NP-C(NPB(DT(the) JJ(american) NNS(protests))))) .(.)) 34.84: S(NP-C(NPB(DT(this) NNS(cases))) VP(VBD(had) VP-C(VBN(caused) NP-C(NPB(DT(the) JJ(american) NNS(protests))))) .(.)) 34.85: S(NP-C(NPB(DT(this) NNS(cases))) VP(VBD(had) VP-C(VBN(caused) NP-C(NPB(DT(the) JJ(american) NNS(protests))))) .(.)) •34.85: S(NP-C(NPB(DT(this) NNS(cases))) VP(VBD(had) VP-C(VBN(aroused) NP-C(NPB(DT(the) JJ(american) NNS(protests))))) .(.)) •34.85: S(NP-C(NPB(DT(this) NNS(cases))) VP(VBD(had) VP-C(VBN(aroused) NP-C(NPB(DT(the) JJ(american) NNS(protests))))) .(.)) 34.87: S(NP-C(NPB(DT(this) NNS(cases))) VP(VBD(had) VP-C(VB(arouse) NP-C(NPB(DT(the) JJ(american) NNS(protests))))) .(.)) •34.92: S(NP-C(NPB(DT(this) NNS(cases))) VP(VBD(had) VP-C(VBN(aroused) NP-C(NPB(DT(the) JJ(american) NNS(protests))))) .(.)) Figure3.2: Rankedlistofmachinetranslationresultswithrepeatedtrees. Scoresshown arenegativelogsofcalculatedweights,thusalowerscoreindicatesahigherweight. The bulletedsentencesindicateidenticaltrees. efficientlyobtainingk-bestlists,yieldingalistofstringresultswithnorepetition. Mohri’s algorithm was shown to be correct if it terminates, and shown to terminate for acyclic wsasovertheprobabilityandtropicalsemirings[101]aswellasforcyclicwsasoverthe tropicalsemiringthatsatisfyastructuralpropertycalledthetwinsproperty. In this chapter we extend that work to deal with wrtgs. We will present an algo- rithmfordeterminizingwrtgsthat,likeMohri’salgorithm,providesacorrectresultifit terminates and that terminates both for acyclic wrtgs over the probability and tropical semirings and for cyclic wrtgs satisfying an analogously defined twins property. We applythisalgorithmtowrtgsrepresentingvastforestsofpotentialoutputsgeneratedby machinetranslationandparsingsystems. Wethenuseavariantofthek-bestalgorithm ofHuangandChiang[59]toobtainlistsoftreesfromtheforests,bothbeforeandafterde- terminization, and demonstrate that applying the determinization algorithm improves ourresults. 75 3.2 Relatedwork Comon et al. [27] show the determinization of unweighted finite-state tree automata, and prove its correctness. Borchardt and Vogler [14] present determinization of wtas with a different method than the one we present here. Like our method, their method returnsacorrectresultifitterminates. However,theirmethodrequirestheimplementor to specify an equivalence relation of trees; if such a relation has a finite number of equivalence classes, then their method terminates ([13], Cor. 5.1.8). We consider our methodmorepracticalthantheirsbecausewerequirenosuchexplicitspecification,and because when their method is applied to an acyclic wta, the resulting wta has a size on theorderofthenumberofderivationsrepresentedinthewta. Ourmethodhasthesame liabilityintheworstcase,butinpracticerarelyexhibitsthisbehavior. 3.3 Practicaldeterminization WerecallbasicdefinitionsfromChapter2, particularlythatofsemirings(Section2.1.2), tree series (Section 2.1.3), and wrtgs (Section 2.2). We introduce some more semiring notions(cf. [17],Section2). LetWbeasemiring,andletw 1 andw 2 bevaluesinW. We sayW is zero-divisor free if w 1 ·w 2 = 0 implies that w 1 = 0 or w 2 = 0,W is zero-sum free if w 1 +w 2 = 0 implies that w 1 = w 2 = 0, andW is a semifield if it admits multiplicative inverses, i.e., for every w∈W\{0} there is a uniquely determined w −1 ∈W such that w·w −1 = 1. Note that the probability semiring is a zero-divisor free and zero-sum free semifield. The tropical semiring is also a zero-divisor free and zero-sum free semifield, 76 ifwedisallowthe∗operatorandconsequentlyremovethevalue−∞fromitscarrierset. Fortheprobabilitysemiring,w −1 = 1/wandforthetropicalsemiring,w −1 =−w. ThedeterminizationalgorithmispresentedasAlgorithm15. Ittakesasinputachain production-free and normal form wrtg G in over a zero-sum free and zero-divisor free semifieldW,andreturnsasoutputadeterministicwrtgG out suchthatL G in = L G out . Like the algorithmof Mohri [99], thisalgorithm is correctif it terminates, andwill terminate for acyclic wrtgs. It may terminate on some cyclic wrtgs, as described in the joint work portionofSection3.4,butwedonototherwiseconsiderthesecasesinthischapter. Determinizing a wrtg can be thought of as a two-stage process. First, the structure ofthe wrtgmustbe determinedsuch thatasingle derivationexistsfor eachrecognized input tree. This is achieved by a classic powerset construction, i.e., a nonterminal must be constructed in the output wrtg that represents all the possible reachable destination nonterminalsgivenaninputandalabel. NotethatthestructureofAlgorithm15isvery similartoclassical(unweighted)determinization,aspresentedinAlgorithm5. Inthesecondstagewedealwithweights. ForthiswewilluseMohri[99]’sconceptof theresidualweight. 2 Werepresentintheconstructionofnonterminalsintheoutputwrtg notonlyasubsetofnonterminalsoftheinputwrtg,butalsoavalueassociatedwitheach of these nonterminals, called the residual. Since we want the weight of the derivation of each tree in G out to be equal to the sum of the weights of all derivations of that tree in G in , we replace a set of productions in G in that have the same right side with a single productioninG out bearingthelabelandthesumoftheweightsoftheproductions. The leftnonterminaloftheproductioninG out representstheleftnonterminalsineachofthe 2 For ease of intuitive understanding we describe weights in this section over the probability semiring. However,anysemiringmatchingtheconditionsspecifiedinAlgorithm15willsuffice. 77 (a)Beforedeterminization (b)Afterdeterminization Figure3.3: PortionoftheexamplewrtgfromFigure3.1beforeandafterdeterminization. Weights of similar productions are summed and nonterminal residuals indicate the proportionofweightduetoeachoriginalnonterminal. combinedproductionsand,foreachnonterminal,theproportionoftheweightfromthe relevantoriginalproduction. Figure 3.3 shows the determinization of a portion of the example wrtg. Note that the production leading to nonterminal R in the input wrtg contributes 0.2, which is 1 3 of the weight on the output wrtg production. The production leading to nonterminal S in the input wrtg contributes the remaining 2 3 . This is reflected in the nonterminal construction in the output wrtg. The complete determinization of the example wrtg is showninFigure3.1b. LetusconsiderinmoredetailhowproductionsinG out ,theoutputwrtg,areformed. Itisillustrativetofirstdescribetheconstructionofterminalproductions,i.e.,thosewith right side labels inΣ (0) . This construction is done in lines 10–16. To be deterministic, there can only be one terminal production for each unique terminal symbol α. And as determined in line 11, the weight of this production is the sum of the weight of all α-labeled productions in G in . It remains to determine the left nonterminal of this production. In the analogous construction for unweighted determinization this would simplybeanonterminalrepresentingtheunionofallleftnonterminalsoftheα-labeled productions in G in , as can be seen in line 11 of Algorithm 5. For the weighted case we 78 need to add a residual weight; to do this we simply calculate the fraction of the total weight contributed by each production and assign that value to that production’s left nonterminal, in line 12. The (possibly) newly constructed nonterminal is then added to thesetofnonterminalstobeusedinformingnonterminalproductions. The formation of nonterminal productions follows the same principle as that of terminalproductions,butasequenceofdescendant(i.e.,rightside)nonterminalsandthe contributionofthatsequencetotheproductionweightandtheleftnonterminalmustbe considered. Theformationinlines21–29isajointgeneralizationoftheformationinlines 20–26forrtgsandthatoflines10–12inFigure10of[99]forwsas. Anonterminalsymbol σandasequenceofnonterminals~ ρischoseninlines21–22. Tobedeterministic,therecan onlybeonenonterminalproductioninG out withrightsideσ(~ ρ). Tocalculatetheweight ofthisproductionweconsidereachproductioninG in that“fits”theselectedsymboland sequence,asindicatedintheelementsunderthesummationofline23. Aproductionfits ifithasthecorrectsymbolσandifeachofthenonterminalsinitsdescendantsequence has a non-zero residual weight in the corresponding position of ~ ρ. As calculated to the right of the summation symbol in line 23, every production that fits contributes the product of its weight and the residuals associated with its descendant nonterminals to the new production’s weight. The contribution of each production is summed together to form the total new production weight, in line 24. As in the unweighted case, the left sidenonterminalofthenewproductioniscalculatedbytakingtheunionoftheleftside nonterminalsofallfittingproductions. Theresidualforeachnonterminalinthisunionis calculatedbysummingtheindividualcontributionsoffittingproductionswithleftsides 79 equal to that nonterminal. 3 That fraction of the total production weight is the residual forthatnonterminal,asseeninline26. Finally, to ensure G out has a single initial nonterminal, an unused symbol is added to the set of nonterminals in G out . 4 A chain production is created linking the new initial nonterminaltoeachnonterminalthathasanon-zeroresidualassociatedwiththeinitial nonterminalofG in . Theweightofthisproductionisequaltotheresidualassociatedwith theoriginalinitialnonterminal. 5 Thisconstructionisdoneinlines15–16and30–31. 3.4 Determinizationusingfactorizations ThissectionpresentsaproofofcorrectnessofAlgorithm15. ItisjointworkwithMatthias B¨ uchse and Heiko Vogler, first published in [17]. The results are primarily due to the other two authors, but our presentation of the material differs significantly, in order to matchourformalisms. Inthissectionwepresenttheideaoffactorizationsforweighted automata,introducedbyKirstenandM¨ aurer[72],andextendittowrtgs. Wethenshow that the construction of Algorithm 15 is indeed done via factorization. We also present theinitialalgebrasemantics,amethodofcalculatingthetreeseriesrepresentedbyawrtg that is different from that in Section 2.2, but equivalent in conclusion. This semantics is crucial in demonstrating language equivalence. Next we prove that if determinization viafactorizationterminates,itgeneratesawrtgwiththesamelanguageastheinputwrtg, thusprovingtheconstructionofAlgorithm15iscorrect. Finallywedescribeconditions 3 Unlikeintheterminalcase,therecanbemorethanonesuchproductionforagivensymbol. 4 Infact,itisthesameinitialnonterminalfromG in ,butclearlythatsymbolhasnotbeenusedthusfar. 5 ThiscorrespondstothecalculationoffinalweightsinAlgorithm10of[99]. 80 Algorithm15WEIGHTED-DETERMINIZE 1: inputs 2: wrtg G in = (N in ,Σ,P in ,n 0 ) over zero-sum free and zero-divisor free semifield W in normalformwithnochainproductions 3: outputs 4: deterministic wrtg G out = (N out ⊆ (P(N×W)∪ n 0 ),Σ,P out ,n 0 ) overW in normal formsuchthatL G in = L G out . 5: complexity 6: O(|Σ|k ˜ z|supp(L G in )| ),wherekisthehighestrankinΣand ˜ zisthesizeofthelargesttree insupp(L G in ),ifL G in isfinite. 7: P out ←∅ 8: Ξ←∅{Seennonterminals} 9: Ψ←∅{Newnonterminals} 10: forallα∈Σ (0) do 11: w total = M n w − →α∈P in w 12: ρ dst ←{(n,w·w −1 total )| n w − →α∈ P in } 13: Ψ←Ψ∪{ρ dst } 14: P out ← P out ∪{ρ dst w total −−−→α} 15: if(n 0 ,w)∈ρ dst forsomew∈Wthen 16: P out ← P out ∪{n 0 w − →ρ dst } 17: whileΨ,∅do 18: ρ new ←anyelementofΨ 19: Ξ←Ξ∪{ρ new } 20: Ψ←Ψ\{ρ new } 21: forallσ (k) ∈Σ\Σ (0) do 22: forall~ ρ=ρ 1 ...ρ k |ρ 1 ...ρ k ∈Ξ k ,ρ i =ρ new forsome1≤ i≤ kdo 23: φ←{(n, M n w − →σ(n 1 ,...,n k )∈P (n) in ,(n 1 ,w 1 )∈ρ 1 ,...,(n k ,w k )∈ρ k w· k Y i=1 w i )| P (n) in ,∅} 24: w total = M (n,w)∈φ w 25: ifw total , 0then 26: ρ dst ←{(n,w·w −1 total )| (n,w)∈φ} 27: ifρ dst <Ξthen 28: Ψ←Ψ∪{ρ dst } 29: P out ← P out ∪{ρ dst w total −−−→σ(~ ρ)} 30: if(n 0 ,w)∈ρ dst forsomew∈Wthen 31: P out ← P out ∪{n 0 w − →ρ dst } 32: N out ←Ξ∪{n 0 } 33: return G out 81 forwhichthealgorithmterminates,andnotethatoneoftheseappliestotheexperiments presentedinSection3.5. 3.4.1 Factorization We begin with a definition of factorization, an abstract algebraic structure, and then describe how it relates to the determinization construction. We recall the typical notion ofvectors,andforsomevectorswewrites i toindicatethememberofsassociatedwith i∈ I, where I is an index set for s. We write a finite vector s with members a,b,c as ha,b,ci. When the indices are not clear (typically when I ,N) and are, e.g., i 1 ,i 2 ,i 3 we denote their association with values ashi 1 = a,i 2 = b,i 3 = ci. We extend the definition ofthesemiringmultiplication·suchthatw·s,wherewisavalueofWandsisavector of values ofW, is equal to the magnification of s by w. For example, if s=ha,b,ci, then w·s=hw·a,w·b,w·ci. Definition3.4.1(cf. [17]) LetAbeanonemptyfiniteset. Apair(f,g)isafactorization(of dimensionAoversemiringW)if f :W A \{0 A }→W A ,g :W A \{0 A }→W,andu= g(u)·f(u) foreveryu∈W A \{0 A },whereW A denotesavectorofvaluesofWindexedonA,and0 A denotesthevectorindexedonAwhereallvaluesare0. Afactorizationismaximaliffor everyu∈W A andw∈W,w·u, 0 A implies f(u)= f(w·u). Thereisauniquelydefined trivialfactorization(f,g)where f(u)= uand g(u)= 1foreveryu∈W A \{0 A }. Wenowintroduceaparticularchoicefor(f,g). Wewillfirstshowthatthisparticular factorizationisamaximalfactorization,andthenwewillshowthatthisfactorizationis thefactorizationusedinAlgorithm15. 82 Lemma3.4.2(cf. [17],Lemma4.2) Let A be a nonempty finite set, letW be a zero-sum freesemifield,anddefine f(u)andg(u)foreveryu∈W A \{0 A },suchthatg(u)= L a∈A u a and f(u)= g(u) −1 ·u. (f,g)isamaximalfactorization. Proof We show that (f,g) is a factorization. Let u∈W A \{0 A }. SinceW is zero-sum free, g(u), 0andhence, g(u)· f(u)= g(u)·g(u) −1 ·u= u. Weshowthat(f,g)ismaximal. Letw∈Wsuchthatw·u, 0 A . Moreover,leta∈ A. Then f(w·u) a = h g(w·u) −1 ·w·u i a = ( L a 0 ∈A w·u a 0) −1 ·w·u a = (w· L a 0 ∈A u a 0) −1 ·w·u a = ( L a 0 ∈A u a 0) −1 ·w −1 ·w·u a = g(u) −1 ·u a = f(u) a . N out , thenonterminalsetofG out , theresultofAlgorithm15, consistsofelementsthat aresubsetsofN in ,theoriginalnonterminalset,pairedwithnonzerovaluesfromW. 6 We canthusinterpretsomen∈ N out asamemberofW N in ,wherenonterminalsnotappearing intheoriginalformulationofntakeonthevalue0inthisvectorformulation. Observation3.4.3 Considerthemappingφconstructedinline23ofAlgorithm15. The weight of newly constructed productions in P out , in line 29 is g(φ) and the left side 6 It also consists of the special start nonterminal n 0 , but we do not consider that item in the current discussion. 83 nonterminalρ dst ,formedinline26,is f(φ). Analogousconstructionsaredoneinlines11 and12forterminalproductions,thoughφisnotexplicitlyrepresented. 3.4.2 Initialalgebrasemantics In Section 2.2 we presented a semantics for wrtgs based on derivations that allows us, givenawrtgG= (N,Σ,P,n 0 ), tocalculatethetreeseriesL G . Accordingtothesemantics presented in Section 2.2, the weight of a tree t∈ T Σ is the sum of the weights of all derivationsfromn 0 totusingproductionsinP. Here we present another semantics commonly used in the study of tree automata, called the initial algebra semantics [45]. This semantics calculates the weight of a tree re- cursively,asafunctionoftheweightsofitssubtrees. Inordertodefinethissemanticsfor wrtgswithoutsignificantmodificationfromthedefinitionforwtas,welimittheapplica- bility to normal form, mostly chain production-free wrtgs. The only chain productions allowedarethoseusedtosimulatethe“finalweights”usedintypicaldefinitionsofwta (e.g., that of F¨ ul¨ op and Vogler [45]). We consider them to be in a set P chain distinct from Pandtoallhavetheformn 0 w − → n,wheren, n 0 . Furthermore,ifP chain ,∅thennop∈ P has n 0 as its left nonterminal. If P chain =∅, this is equivalent to a final weight of 1 for n 0 and0 forallothers. If P chain ,∅, this isequivalenttoa finalweightof wforeach nsuch thatn 0 w − → n∈ P chain and0forallothers. We have previously considered the productions P of a wrtg as a set or pair (see Section2.2). IntheinitialalgebrasemanticsweconsiderPafamily(P k | k∈N)ofmap- pings, P k : Σ (k) → W N k ×N . Thus, if P contains two productions n 1 w 1 −−→ σ(n 2 ,n 3 ,n 4 ) and n 5 w 2 −−→ σ(n 2 ,n 3 ,n 4 ) (assuming σ ∈ Σ (3) ), we can write P 3 (σ) n 2 n 3 n 4 ,n 1 = w 1 and 84 P 3 (σ) n 2 n 3 n 4 =hn 1 = w 1 ,n 2 = 0,n 3 = 0,n 4 = 0,n 5 = w 2 i. Since the particular mapping is obvious given the state sequence subscript, we omit it, e.g., we write P(σ) n 2 n 3 n 4 ,n 1 in- steadofP 3 (σ) n 2 n 3 n 4 ,n 1 . Wenowdenotetheweightofderivingatreetfromnonterminaln usingP 7 byh P (t) n . Thenh P (t)∈W N . LetG= (N,Σ,P∪P chain ,n 0 )beinnormalform. Thenμ P isafamily(μ P (σ)|σ∈Σ)of mappings where for every k∈N andσ∈ Σ (k) we haveμ P (σ) :W N ×...×W N | {z } k →W N andforeverys 1 ,...,s k ∈W N μ P (σ)(s 1 ,...,s k ) n = M (n 1 ,...,n k )∈N k (s 1 ) n 1 ·...·(s k ) n k ·P k (σ) n 1 ...n k ,n . Foratreet=σ(t 1 ,...,t k ),h P (t)=μ P (σ)(h P (t 1 ),...,h P (t k )). Additionally,ifP chain ,∅,μ P chain is a mapping N\{n 0 }→W. If P chain =∅, then h P (t) n 0 is the value of t in L G . Otherwise, the value is L n∈N\{n 0 } h P (t) n ·μ P chain (n). We finally note that the semantics in Section 2.2, when applied to the class of wrtgs discussed here, is equivalent to the traditional run semantics,andthattherunandinitialalgebrasemanticsareprovenequivalentinSection 3.2ofF¨ ul¨ opandVogler[45]. Observation3.4.4 Considerφ∈W N in constructedinline23ofAlgorithm15forσ∈Σ (k) and~ ρ∈ (W N in ) k . Clearly,φ=μ P (σ)(~ ρ). 3.4.3 Correctness BeforeweprovethecorrectnessofAlgorithm15weneedtonoteaparticularpropertyof deterministic wrtgs. The following lemma shows that there is at most one nonterminal 7 whichisequivalenttothesumofallderivationsfromntot 85 thatcanbeginachainproduction-freederivationofatreeinadeterministicnormal-form wrtg. Lemma3.4.5 Let G= (N,Σ,P∪P chain ,n 0 ) be a deterministic normal-form wrtg over W and let t∈ T Σ . There is at most one n∈ N such that n⇒ ∗ t using P and at most one derivationdfromntotusingP. Proof Weprovebyinductiononsize(t). Letsize(t)= 1. Thus,thastheformα,where α∈Σ (0) .Assumetherearen,n 0 ∈ N suchthatn⇒ ∗ αandn 0 ⇒ ∗ αusingP. Sincethesize is 1, n⇒ p α and n 0 ⇒ p 0 α for some p,p 0 ∈ P. By definition, p= n w − →α and p 0 = n 0 w 0 −→α forsomenonzerow,w 0 ∈W. ButthenGisnotdeterministic,soourassumptionmustbe false. Clearly, if there is one nonterminal n such that n⇒ p α, then the single derivation dis(p). Nowassumethelemmaistruefortreesofsizeiorsmaller. Lettbeatreeofsizei+1. Then,thastheformσ(t 1 ,...,t k )whereσ∈Σ (k) andsize(t j )≤ i,1≤ j≤ k. Bytheinduction hypothesiseitherthereexistuniquen 1 ,...,n k ∈ Nsuchthatn j ⇒ ∗ t j usingPfor1≤ j≤ k, or at least one such n j does not exist. In the latter case, clearly there is no n∈ N such that n⇒ ∗ t using P. For the former case, this means thatσ(n 1 ,...,n k )⇒ ∗ t using P. Let d 1 ,...,d k bethesinglederivationsof,respectively,t 1 ,...,t k . Then(d 1 ...d k )isclearlythe single derivation from σ(n 1 ,...,n k ) to t, that notation implying a concatenation of the productionsinthederivationofeachsubtree. Then,assumetherearen,n 0 ∈ Nsuchthat n⇒ ∗ tandn 0 ⇒ ∗ tusingP. BecauseGisinnormalformandbecauseoftheuniqueness of n 1 ,...,n k , n⇒ p σ(n 1 ,...,n k )⇒ ∗ t and n 0 ⇒ p 0 σ(n 1 ,...,n k )⇒ ∗ t for some p,p 0 ∈ P. By definition,p= n w − →σ(n 1 ,...,n k )andp 0 = n 0 w 0 −→σ(n 1 ,...,n k )forsomenonzerow,w 0 ∈W. 86 ButthenGisnotdeterministic,soourassumptionmustbefalse. Then,ifthereisasingle nonterminalnsuchthatn⇒ p σ(n 1 ,...,n k )⇒ ∗ t,thenthesinglederivationdis(pd 1 ...d k ). We break the proof of correctness of Algorithm 15 into an initial lemma that does mostofthework,andthetheorem,whichfinishestheproof. Thesetwocomponentsare veryrelatedtoTheorem5.3of[17],thoughthestructureissomewhatdifferent. Lemma3.4.6(cf. [17],Thm. 5.3) Let G in = (N in ,Σ,P in ,n 0 ) be the input to Algorithm 15, letthealgorithmterminateonG,andletG out = (N out ,Σ,P out ∪P chain ,n 0 )betheoutput. For everyt∈ T Σ andn∈ N in ,h P in (t)= L n 0 ∈N out h P out (t) n 0·n 0 . Proof We immediately note that by Lemma 3.4.5 we can rewrite the conclusions of thislemmaas“h P in (t)= h P out (t) n 0·n 0 ifthereissomen 0 ∈ N out suchthath P out (t) n 0 , 0,or0 otherwise.” We will prove the lemma by induction on the size of t. Let t be of size 1. Then t has the formα, whereα∈Σ (0) . According to line 14 of Algorithm 15, if there is at least one production with right sideα in P in , there is a single production p 0 = n 0 w total −−−→ α in P out , where w total is nonzero. If p 0 does not exist, then h P in (t) n = 0 for any selection of n, so the statement is true. If p 0 does exist, then h P out (t) n 0 = w total . Now, for any n∈ N in , h P in (t) n = P in (α) ,n . Ifthereissomep∈ P in oftheformn w − →αthisvalueisw,orelseitis0. Asindicatedonline12, intheformercase, theweightofn 0 n isw·w −1 total , andinthelatter caseitis0. Thus,ifh P in (t) n = w,h P out (t) n 0 = w total ·w·w −1 total = wandotherwisebothsides are0. 87 Nowassumethelemmaistruefortreesofsizeiorsmaller. Lettbeatreeofsizei+1. Then,thastheformσ(t 1 ,...,t k )whereσ∈Σ (k) andsize(t j )≤ i,1≤ j≤ k. ByLemma3.4.5 thereareuniquelydefinedn 0 1 ,...,n 0 k ∈ N out suchthatfor1≤ i≤ k,h P out (t i ) n 0 i , 0. h P in (t)=μ P in (σ)(h P in (t 1 ),...,h P in (t k )) (definitionofsemantics) =μ P in (σ)(h P out (t 1 ) n 0 1 ·n 0 1 ,...,h P out (t k ) n 0 k ·n 0 k ) (inductionhypothesis) = h P out (t 1 ) n 0 1 ·...·h P out (t k ) n 0 k ·μ P in (σ)(n 0 1 ,...,n 0 k ) (commutativity) Either h P in (t)= 0 N in or it does not. In the former case, since Q k j=1 h P out (t j ) n 0 j is not 0, 8 it must be thatμ P in (σ)(n 0 1 ,...,n 0 k )= 0. If the sequence n 0 1 ,...,n 0 k is chosen as~ ρ on line 22 of the algorithm then the value for w total set on line 24 is 0, and thus P out (σ) n 0 1 ...n 0 k = 0 N out . Thus,h P out (t)= 0 N out . Inthelattercase, considerline26, inwhichρ dst iscalculatedas f(μ P in (σ)(n 0 1 ,...,n 0 k )). Note further that on line 29, g(μ P in (σ)(n 0 1 ,...,n 0 k )) is chosen as the weight of the newly produced production, thus P out (σ) s 0 ,ρ dst = g(μ P in (σ)(~ ρ)). We continue the derivation from above: h P in (t)= h P out (t 1 ) n 0 1 ·...·h P out (t k ) n 0 k ·μ P in (σ)(n 0 1 ,...,n 0 k ) (fromabove) = h P out (t 1 ) n 0 1 ·...·h P out (t k ) n 0 k · g(μ P in (σ)(n 0 1 ,...,n 0 k ))·~ ρ (definitionoffandg) = h P out (t) ~ ρ ·~ ρ (line29anddefinition) 8 Noterminthatproductis0andWiszerodivisor-free 88 We are finally ready to show language equivalence, which is now a simple matter. WenotethatL G in = L G out iffforanyt∈ T Σ ,h P in (t) n 0 = h P out ∪P chain (t) n 0 . Theorem3.4.7 Let G in = (N in ,Σ,P in ,n 0 ) be the input to Algorithm 15, let the algorithm terminate on G in , and let G out = (N out ,Σ,P out ∪P chain ,n 0 ) be the output. For every t∈ T Σ , h P in (t) n 0 = h P out ∪P chain (t) n 0 . Proof Ifthereisnouniquen 0 ∈ N out suchthath P out (t) n 0 isnonzerothenbyLemma3.4.6 and by definition, h P in (t) n 0 = h P out ∪P chain (t) n 0 = 0. If P chain =∅, then for all n 0 ∈ N out \{n 0 }, n 0 n 0 = 0, since otherwise some chain production would be added in lines 16 or 31 of the algorithm. Then,again,byLemma3.4.6andbydefinition,h P in (t) n 0 = h P out ∪P chain (t) n 0 = 0. Theremainingcaseassumesthatthereisauniquen 0 ∈ N out suchthath P out (t) n 0 isnonzero andthatP chain isnonempty. h P in (t) n 0 = h P out (t) n 0·n 0 n 0 (Lemma3.4.6) = h P out (t) n 0·P chain (n 0 ) = h P out ∪P chain (t) n 0 3.4.4 Termination Wehaveshown,inTheorem3.4.7,thatAlgorithm15iscorrectifitterminates. Since,even if it terminates, the algorithm’s runtime is in the worst case exponential, to an engineer the conditions for termination are not as useful as the conditions for terminating with 89 a specified time. In fact, we have followed that tactic in our implementation of the algorithm by simply providing users with a maximum time to wait before considering determinization not possible for a given input; see Chapter 6 for details. However, it is theoretically interesting and can be generally useful to know certain conditions for termination. In this section we prove that Algorithm 15 terminates on acyclic normal form chain production-free wrtgs. This was also proven in [17] but we use a different approach. WebrieflynotethatAlgorithm15terminateswhenWhasafinitecarrierset, echoing a similar statement in [17]. And, we describe the main result of [17], sufficient conditions for determinization of cyclic wrtgs to terminate. The proof of this result is beyondthescopeofthisthesis,soweonlyoutlinetheconditionsandreferthereaderto [17]forthedetails. AwrtgG= (N,Σ,P,n 0 )iscyclicif,forsomen∈ Nandt∈ T Σ (N)suchthatn∈ ydset(t), n⇒ ∗ t. Itshouldbeclearthatforanacyclicnormalformchainproduction-freewrtg G, L G (t)= 0forallt∈ T Σ suchthatheight(t)>|N|. Sinceforanyk∈N,thesetoftreesinT Σ ofheightkisfinite,supp(L G )isfinite. Theorem3.4.8 Algorithm15terminatesonacyclicinput. Proof As can be seen from the structure of Algorithm 15, the only way it can fail to terminateisiftheloopatline17isalwayssatisfied,i.e.,ifthenonterminalsetisnotfinite. As noted above, an acyclic wrtg implies finite support. Then, consider the means by whichnonterminalsofG out areformed. Atmost|Σ (0) |areaddedatline14. Theonlyother additionofnonterminalscomesatline28. AsnotedinObservation3.4.3,anonterminal isformedasafunctionoftheφcreatedinline23. Thereisthusatmostonenonterminal 90 Figure 3.4: Sketch of a wsa that does not have the twins property. The dashed arcs are meant to signify a path between states, not necessarily a single arc. q and r are siblings because they can both be reached from p with a path reading “xyz”, but are not twins, because the cycles from q and r reading “abc” have different weights. s and r are not siblingsbecausetheycannotbebothreachedfrompwithapathreadingthesamestring. q and s are siblings because they can both be reached from p with a path reading “def” andtheyaretwinsbecausethecyclefrombothstatesreading“abc”hasthesameweight (and they share no other cycles reading the same string). Since q and r are siblings but nottwins,thiswsadoesnothavethetwinsproperty. foreveryuniqueφthatcanbeformedinthisalgorithm. AsnotedinObservation3.4.4, φ=μ P (σ)(~ ρ). If, for a givenσ∈Σ (k) , there is a finite choice of~ ρ that producesφ, 0 N out , then N out is finite. By Lemma 3.4.5, for any tree t∈ T Σ there is at most one n 0 ∈ N out such that h P out (t) , 0 N out . Thus the size of N out is at most the number of unique subtrees in supp(L G ). If we let ˜ z be the size of the largest tree in supp(L G ), then there are at most k ˜ z|supp(L G )| choicesfor~ ρthatproduceφ, 0 N out . WenotethatifWhasafinitecarrierset,i.e.,finitelymanypossiblevalue,thenclearly Algorithm 15 terminates, since|N out | =|W||N in |+1, where|W| is the cardinality of the carrierset. 91 Figure3.5: Awsathatisnotcycle-unambiguous. Thestateqhastwocyclesreadingthe string“ab”withtwodifferentweights. Finally, we present sufficient conditions for termination of cyclic wrtgs, the proof of which is the main result of [17]. As noted previously, the proof of these termination conditionsisbeyondthescopeofthiswork. AcyclicwrtgGoverWisdeterminizablebyAlgorithm15ifWisextremalandifG hasthetwinsproperty([17],Theorem5.2). Asemiringisextremalifforeveryw,w 0 ∈W, w+w 0 ∈{w,w 0 }. Forexample,thetropicalsemiringisextremal. Tointroducethetwinsproperty,letusfirstconsidertheanalogousconceptforwsas. Two states in a wsa that can be reached from the start state reading string e are siblings, andtwosiblingsqandq 0 aretwinsifforeverystring f suchthatthereisacycleatqand at q 0 reading f, the weights of these cycles are the same. A wsa has the twins property if all siblings in the wsa are twins. Figure 3.4 is a sketch of a wsa demonstrating sibling and twin states. It was shown by Mohri [99] that cyclic wsa are determinizable if they have the twins property. Furthermore, Allauzen and Mohri [2] showed that the twins test is decidable if a wsa is cycle-unambiguous , that is, if there is at most one cycle at a statereadingthesamestring. Figure3.5sketchesawsathatisnotcycle-unambiguous. Theconceptsoftwinsandcycle-unambiguityareelevatedtothetreecaseasfollows: A wrtg G = (N,Σ,P,n 0 ) has the twins property if, for every n,n 0 ∈ N, t ∈ T Σ , and u∈ T Σ ({z}) where z < (N∪Σ) and u(v) = z for exactly one v∈ pos(u), if L G (t) n , 0, 92 L G (t) n 0 , 0,L G (u[n] v ) n , 0,andL G (u[n 0 ] v ) n 0 , 0,thenL G (u[n] v ) n = L G (u[n 0 ] v ) n 0. Thetwins property for wrtgs is depicted in Figure 3.6. G is cycle-unambiguous if for any n∈ N andu∈ T Σ ({n})whereu(v)= zforexactlyonev∈ pos(u),thereisatmostonederivation from n to u. B¨ uchse et al. [18] showed, in Theorem 5.17, that the twins property for cycle-unambiguous normal-form and chain-production free wrtgs over a commutative zero-sum-freeandzero-divisor-freeisdecidable. Figure 3.6: Demonstration of the twins test for wrtgs. If there are non-zero derivations of a tree t for nonterminals n and n’, and if the weight of the sum of derivations from n ofusubstitutedatvwithnisequaltotheweightofthesumofderivationsfromn’ofu substituted at v with n’ for all u where these weights are nonzero for both cases, then n andn’aretwins. 3.5 Empiricalstudies Wenowturntosomeempiricalstudies. Weexaminethepracticalimpactofthepresented workbyshowing: 93 • Thatthemultiplederivationproblemispervasiveinpracticeanddeterminization iseffectiveatremovingduplicatetrees. • Thatduplicationcausesmisleadingweightingofindividualtreesandthesumming achievedfromweighteddeterminizationcorrectsthiserror,leadingtore-ordering ofthek-bestlist. • Thatweighteddeterminizationpositivelyaffectsend-to-endsystemperformance. We also compare our results to a commonly used technique for estimation of k-best lists,i.e.,summingoverthetop j> kderivationstogetweightestimatesofthetopm≤ j uniqueelements. M BLEU undeterminized 21.87 top-500“crunching” 23.33 determinized 24.17 Table 3.1: BLEU results from string-to-tree machine translation of 116 short Chinese sentenceswithnolanguagemodel. Theuseofbestderivation(undeterminized),estimate of best tree (top-500), and true best tree (determinized) for selection of translation is shown. 3.5.1 Machinetranslation We obtain packed-forest English outputs from 116 short Chinese sentences computed by a string-to-tree machine translation system based on that of Galley et al. [47]. The systemistrainedonallChinese-EnglishparalleldataavailablefromtheLinguisticData Consortium. The decoder for this system is a CKY algorithm that negotiates the space describedbyDeNeefeetal. [30]. Nolanguagemodelwasusedinthisexperiment. 94 The forests contain a median of 1.4× 10 12 English parse trees each. We remove cycles from each forest, 9 apply our determinization algorithm, and extract the k-best treesusingavariantofthealgorithmofHuangandChiang[59]. Theeffectsofweighted determinization on a k-best list are obvious to casual inspection. Figure 3.7 shows the improvement in quality of the top 10 trees from our example translation after the applicationofthedeterminizationalgorithm. Theimprovementobservedcircumstantiallyholdsuptoquantitativeanalysisaswell. Theforestsobtainedbythedeterminizedgrammarshavebetween1.39%and50%ofthe number of trees of their undeterminized counterparts. On average, the determinized forestscontain13.7%oftheoriginalnumberoftrees. Sinceadeterminizedforestcontains no repeated trees but contains exactly the same unique trees as its undeterminized counterpart, this indicates that an average of 86.3% of the trees in an undeterminized MToutputforestareduplicates. Weighteddeterminizationalsocausesasurprisinglylargeamountofk-bestreorder- ing. In77.6%ofthetranslations,thetreeregardedas“best”isdifferentafterdeterminiza- tion. This means that in a large majority of cases, the tree with the highest weight is not recognized as such in the undeterminized list because its weight is divided among its multiple derivations. Determinization allows these instances and their associated weightstocombineandputsthehighestweightedtree,notthehighestweightedderiva- tion,atthetopofthelist. 9 As in Mohri [99], determinization may be applicable to some wrtgs that recognize infinite languages. Inpractice,cyclesinforestsofMTresultsarealmostneverdesired,sincetheserepresentrecursiveinsertion ofwords. 95 Wecancompareourmethodwiththemorecommonlyusedmethodsof“crunching” j-best lists, where j> k. The duplicate sentences in the k trees are combined, hopefully resultinginatleastkuniquememberswithanestimationofthetruetreeweightforeach unique tree. Our results indicate this is a rather crude estimation. When the top 500 derivationsofthetranslationsofourtestcorpusaresummed, only50.6%ofthemyield anestimatedhighest-weightedtreethatisthesameasthetruehighest-weightedtree. Asameasureoftheeffectweighteddeterminizationanditsconsequentialre-ordering hasonanactualend-to-endevaluation,weobtainBLEUscoresforour1-besttranslations fromdeterminization,andcomparethemwiththe1-besttranslationsfromtheundeter- minized forest and the 1-best translations from the top-500 “crunching” method. The resultsareinTable3.1. Notethatin26.7%ofcasesdeterminizationdidnotterminatein a reasonable amount of time. For these sentences we used the best parse from top-500 estimationinstead. Itisnotsurprisingthatdeterminizationmayoccasionallytakealong time;evenforalanguageofmonadictrees(i.e.,strings)thedeterminizationalgorithmis NP-complete,asimpliedbyCasacubertaanddelaHiguera[19]and,e.g.,Dijkstra[32]. 3.5.2 Data-OrientedParsing Determinization of wrtgs is also useful for parsing. Data-Oriented Parsing (DOP)’s methodology is to calculate weighted derivations, but as noted by Bod [11], it is the highest ranking parse, not derivation, that is desired. Since Sima’an [121] showed that finding the highest ranking parse is an NP-complete problem, it has been common to estimatethehighestrankingparsebythepreviouslydescribed“crunching”method. 96 31.87: S(NP-C(NPB(DT(this) NNS(cases))) VP(VBD(had) VP-C(VBN(aroused) NP-C(NPB(DT(the) JJ(american) NNS(protests))))) .(.)) 32.11: S(NP-C(NPB(DT(this) NNS(cases))) VP(VBD(had) VP-C(VBN(caused) NP-C(NPB(DT(the) JJ(american) NNS(protests))))) .(.)) 32.15: S(NP-C(NPB(DT(this) NNS(cases))) VP(VBD(had) VP-C(VB(arouse) NP-C(NPB(DT(the) JJ(american) NNS(protests))))) .(.)) 32.55: S(NP-C(NPB(DT(this) NNS(cases))) VP(VBD(had) VP-C(VB(cause) NP-C(NPB(DT(the) JJ(american) NNS(protests))))) .(.)) 32.60: S(NP-C(NPB(DT(this) NNS(cases))) VP(VBD(had) VP-C(VBN(attracted) NP-C(NPB(DT(the) JJ(american) NNS(protests))))) .(.)) 33.16: S(NP-C(NPB(DT(this) NNS(cases))) VP(VBD(had) VP-C(VB(provoke) NP-C(NPB(DT(the) JJ(american) NNS(protests))))) .(.)) 33.27: S(NP-C(NPB(DT(this) NNS(cases))) VP(VBG(causing) NP-C(NPB(DT(the) JJ(american) NNS(protests)))) .(.)) 33.29: S(NP-C(NPB(DT(this) NN(case))) VP(VBD(had) VP-C(VBN(aroused) NP-C(NPB(DT(the) JJ(american) NNS(protests))))) .(.)) 33.31: S(NP-C(NPB(DT(this) NNS(cases))) VP(VBD(had) VP-C(VBN(aroused) NP-C(NPB(DT(the) NN(protest)) PP(IN(of) NP-C(NPB(DT(the) NNS(united states))))))) .(.)) 33.33: S(NP-C(NPB(DT(this) NNS(cases))) VP(VBD(had) VP-C(VBN(incurred) NP-C(NPB(DT(the) JJ(american) NNS(protests))))) .(.)) Figure3.7: Rankedlistofmachinetranslationresultswithnorepeatedtrees. We create a DOP-like parsing model 10 by extracting and weighting a subset of sub- trees from sections 2-21 of the Penn Treebank and use a DOP-style parser to generate packed forest representations of parses of the 2416 sentences of section 23. The forests contain a median of 2.5×10 15 parse trees. We then remove cycles and apply weighted determinizationtotheforests. Thenumberoftreesineachdeterminizedparseforestis reduced by a factor of between 2.1 and 1.7×10 14 . On average, the number of trees is reducedbyafactorof900,000,demonstratingamuchlargernumberofduplicateparses prior to determinization than in the machine translation experiment. The top-scoring parseafterdeterminizationisdifferentfromthetop-scoringparsebeforedeterminization for49.1%oftheforests. Whenthedeterminizationmethodis“approximated”bycrunch- ingthetop-500parsesfromtheundeterminizedlist,only55.9%ofthetop-scoringparses arethesame. Thisindicatesthecrunchingmethodisnotaverygoodapproximationof determinization. WeusethestandardF-measurecombinationofrecallandprecisionto 10 This parser acquires a small subset of subtrees, in contrast with DOP, and the beam search for this problemhasnotbeenoptimized. 97 M R P F- undeterminized 80.23 80.18 80.20 top-500“crunching” 80.48 80.29 80.39 determinized 81.09 79.72 80.40 Table3.2: Recall,precision,andF-measureresultsonDOP-styleparsingofsection23of the Penn Treebank. The use of best derivation (undeterminized), estimate of best tree (top-500),andtruebesttree(determinized)forselectionofparseoutputisshown. E U D machinetranslation 1.4×10 12 2.0×10 11 parsing 2.5×10 15 2.3×10 10 Table 3.3: Median trees per sentence forest in machine translation and parsing exper- iments before and after determinization is applied to the forests, removing duplicate trees. score the top-scoring parse in each method against reference parses, and report results in Table 3.2. Note that in 16.9% of cases determinization did not terminate. For those sentencesweusedthebestparsefromtop-500estimationinstead. 3.5.3 Conclusion We presented a novel algorithm for practical determinization of wrtgs and, together with colleagues, proved that it is correct and that it terminates on acyclic wrtgs. We have shown that weighted determinization is useful for recovering k-best unique trees from a weighted forest. As summarized in Table 3.3, the number of repeated trees prior to determinization was typically very large, and thus determinization is critical to recovering true tree weight. We have improved evaluation scores by incorporating the presented algorithm into our MT work and we believe that other NLP researchers workingwithtreescansimilarlybenefitfromthisalgorithm. Animplementationofthis algorithmisavailableintheTiburontreeautomatatoolkit(SeeChapter6). 98 Chapter4 IC Inthischapterwepresentalgorithmsforforwardandbackwardapplicationofweighted tree-to-tree and tree-to-string transducers to wrtgs, trees, and strings. We discuss ap- plication of cascades of transducers and methods of efficient inference that perform an integratedsearchthroughacascade. Wealsopresentalgorithmsforconstructingderiva- tion wrtgs for EM training of cascades of tree transducers, based on the application algorithms. Although application is a well-established concept for both string and tree trans- ducers, we do not believe explicit algorithms for conducting application have been previously presented, even for unweighted tree transducers. Current theoretical work isbeingdoneonapplicationofweightedtree-to-treetransducers[43],butagain,weare the first to describe these explicit algorithms. Our algorithm for backward application of tree-to-string transducers, while making use of classic parsing algorithms, is novel. So, too, is our extension of application to xwtt and our use of application to construct derivationwrtgsofcascades. 99 4.1 Motivation We are interested in the result of transformation of some input by a transducer, called application. We’dliketodosomeinferenceonthisresult,suchasdeterminingthek-best transformationsoftheinputorknowingallthepathsthattransformtheinputintosome given output. When we are given the input and want to know how the transducer transforms it, we call this forward application. When we are given the output and want to know the possible inputs that cause the transducer to produce this output, we call this backward application. In this chapter we discuss forward and backward application algorithms. Wealsoconsiderageneralizationofthisproblem. Wewanttodivideupourproblems into manageable chunks, each represented by a transducer. It is easier for designers to writeseveralsmalltransducerswhereeachperformsasimpletransformation,ratherthan asinglecomplicatedtransducer. We’dliketoknow,then,theresultoftransformationof inputbyacascadeoftransducers,oneoperatingaftertheother. Aswewillsee,thereare various ways of approaching this problem. We will consider offline composition, bucket brigadeapplication,andon-the-flyapplication . 4.2 Stringcase: applicationviacomposition Before we discuss application of tree transducers, it is helpful to consider the solutions already known for application of string transducers. Imagine we have a string and a wstthatcantransformthatstringinapossiblyinfinitenumberofways,eachwithsome weight. We’dliketoknowthekhighest-weightedoftheseoutputs. Ifwecouldrepresent 100 (a)Inputstring“aa”embeddedinan identitywst. (b)wstforapplication. (c)Resultofcomposition. (d)Resultofrangeprojection. Figure4.1: Applicationofawsttoastring. theseoutputsasawsathiswouldbeaneasytask,astherearewell-knownalgorithmsto efficientlyfindthek-bestpathsinawsa[104]. Fortunately,weknowitisinfactpossible to find this wsa—wsts preserve recognizability, meaning given some weighted regular language 1 and some wst there exists some weighted regular language representing all theoutputs. Thislanguagecanberepresentedasawsa,whichwecalltheapplicationwsa. The properties of wsts are such that, given several already-known algorithms, we can build application wsas without defining a new algorithm. Specifically, we will achieve application through a series of embedding, composition, and projection operations. As anaidinexplanation,weprovidearunningexampleinFigure4.1. Embeddingisatrivial operation on strings and wsas: a string is embedded in a wsa by creating a single state andoutgoingedgepersymbolinthestring. Inturn,awsaisembeddedinanidentitywst by, foreveryedgeinawsawithlabela, forminganedgeintheembeddedwstwiththe sameincomingstate,outgoingstate,andweight,butwithbothinputandoutputlabels a. InFigure4.1a,thestring“aa”hasbeenembeddedinawst. Compositionofwstiswell 1 Thesetofregularlanguagesincludesthoselanguagescontainingasinglestringwithweight1. 101 A B a : a / 1 a : a / 1 C (a)Inputstring“aa”embed- dedinanidentitywst. E a : b / . 1 a : a / . 9 b : a / . 5 D a : b / . 4 a : a / . 6 b : a / . 5 b : b / . 5 b : b / . 5 (b)Firstwstincascade. a : c / . 6 b : c / . 7 F a : d / . 4 b : d / . 3 (c)Secondwstincascade. E F a : c / . 0 7 a : c / . 5 4 b : c / . 6 5 b : d / . 3 5 D F a : c / . 2 8 a : c / . 3 6 b : c / . 6 5 b : d / . 3 5 a : d / . 3 6 a : d / . 0 3 a : d / . 2 4 a : d / . 1 2 (d)Offlinecompositionapproach: Composethetransducers. A D B D C D a : b / . 1 B E a : a / . 9 C E (e)Bucketbrigadeapproach: Applywst(b)towst(a). A D F B D F C D F d / . 0 3 c / . 0 7 B E F c / . 5 4 C E F c / . 5 4 c / . 3 6 c / . 2 8 c / . 0 7 d / . 3 6 d / . 0 3 d / . 3 6 d / . 1 2 d / . 2 4 (f)Resultofofflineorbucketappli- cationafterprojection. A D F B D F C D F d / . 0 3 B E F c / . 5 4 C E F c / . 3 6 c / . 2 8 c / . 0 7 d / . 3 6 d / . 1 2 d / . 2 4 (g)Initialon-the-fly stand-infor(f). A D F B D F C D F d / . 0 3 B E F c / . 5 4 C E F c / . 3 6 c / . 2 8 c / . 0 7 d / . 3 6 d / . 1 2 d / . 2 4 (h)On-the-flystand-inafterexploring outgoingedgesofstateADF. A D F B D F C D F d / . 0 3 B E F c / . 5 4 C E F c / . 3 6 c / . 2 8 c / . 0 7 d / . 3 6 d / . 1 2 d / . 2 4 (i) On-the-fly stand-in after best path has beenfound. Figure4.2: Threedifferentapproachestoapplicationthroughcascadesofwsts. covered by, e.g., Mohri [101]. The result of the composition of the wsts of Figures 4.1a and4.1bisshowninFigure4.1c. Projectingawsttoawsaisalsotrivial: toobtainarange projection, we ignore all the input labels in our wst, and to obtain a domain projection, we ignore all the output labels. The range projection of the transducer of Figure 4.1c is shown in Figure 4.1d. Figure 4.1d, then, depicts the result of the application of the transducer of Figure 4.1b to the string “a a”. Note that we can also use this method to solve the reverse problem, where we are given an output string (or set of outputs) and we want the k-best of the set of inputs. To do so, we follow an analogous procedure, embedding the given output string or wsa in an identity wst, composing, and this time projectingthedomain,andrunningk-best. 102 4.3 Extensiontocascadeofwsts Now consider the case of an input wsa and a sequence of transducers. Our running exampleinFigure4.2againusesthestring“aa”asinputandacascadecomprisingthe transducerfromFigure4.1(reproducedinFigure4.2b)andasecondtransducer,depicted inFigure4.2c. Thereareatleastthreewaystoperformapplicationthroughthiscascade. Firstly, we can compose the sequence of transducers before even considering the given wsa,usingtheaforementionedcompositionalgorithm. Wewillcallthisapproachoffline composition. TheresultofthiscompositionisdepictedinFigure4.2d. Theproblem,then, isreducedtoapplicationthroughthissingle-transducercase. Wecomposetheembedded transducer of Figure 4.2a with the result from Figure 4.2d and project, forming the wsa inFigure4.2f. Secondly,wecanbeginbyonlyinitiallyconsideringthefirsttransducerinthechain, and compose the embedded wsa with it, as if we were doing a single-transducer appli- cation, obtaining the result in Figure 4.2e. If there are “dead” states and edges in this transducer, i.e., those that cannot be in any path, they may be removed at this time. Then, instead of projecting, we continue by composing that transducer with the next transducerinthechain,i.e.,thatofFigure4.2c,andthenfinallytaketherangeprojection, againobtainingthewsainFigure4.2f. Thisapproachiscalledthebucketbrigade. A third approach builds the application wsa incrementally, as dictated by some algorithmthatrequestsinformationaboutit. Suchanapproach,whichwecallon-the-fly , wasdescribedin,e.g.,[109,101,103]. Theseworksgenerallydescribedtheeffectofthis approachratherthanconcretealgorithms,andforthemomentwewilldolikewise,and 103 presumewecanefficientlycalculatetheoutgoingedgesofastateofsomewsaorwston demand, without calculating all edges in the entire machine. The initial representation of the desired application wsa is depicted in Figure 4.2g and consists of only the start state. Now, consider Algorithm 16, an instantiation of Dijkstra’s algorithm that seeks to find the cost of the highest-cost path in a wsa. 2 Notice that this algorithm only ever needs to know the identity of the start state, whether a state is the final state, and for a givenstate,thesetofoutgoingedges. Wecanusethewsaformedbyofflinecomposition or bucket brigade as input, but we can also use our initial wsa of Figure 4.2g as input. At line 15, the algorithm needs to know the outgoing edges from state ADF, the only stateitknowsabout. Ouron-the-flyalgorithmsallowustodiscoversuchedges,andthe application wsa is now that depicted in Figure 4.2h. We are able to determine the cost of a path to a final state once our application wsa looks like Figure 4.2i, at which point we can return the value “.1944”. No other edges need be added to the application wsa. Noticeanadvantageovertheothertwomethodsisthatnotallthepossibleedgesofthe application wsa are built, and thus on-the-fly application can be faster than traditional methods. However,onedisadvantageoftheon-the-flymethodisthatworkmaybedone tobuildstatesandedgesthatareinnovalidpath,astheincrementally-builtwsacannot betrimmedwhilethebucketbrigade-builtwsacanhavedeadstatestrimmedaftereach composition. 2 Algorithm 16 requires a unique final state. We can convert any wsa into a wsa of this form by adding transitionswithlabelsandweightsof1fromanyfinalstatestoanewuniquefinalstate. 104 Algorithm16DIJKSTRA 1: inputs 2: wsaA= (Q,Σ,E,q 0 ,{q f })overW 3: outputs 4: weightw∈W,theweightofthehighest-costpathfrom q 0 toq f inA 5: complexity 6: O(|E|+|Q|log|Q|),ifaFibonacciheapisusedforqueuing. 7: cost(q 0 )← 1 8:Q← q 0 9: Ξ←{q 0 }{seennonterminals} 10: whileQisnotemptydo 11: q←statefromQwithhighestcost 12: RemoveqfromQ 13: ifq= q f then 14: return cost(q) 15: foralleoftheformq σ/w −−−→ pinEdo 16: ifp<Ξthen 17: cost(p)← cost(q)·w 18: Q←Q∪{p} 19: Ξ←Ξ∪{p} 20: else 21: cost(p)← max(cost(p),cost(q)·w) 22: return 0 105 T P. ? S wxT No SeeNT xT No SeeNT wT No SeeNT T No SeeNT wxLT OQ [91] xLT Yes [90],Thm. 4,cl. 2 wLT OQ [91] LT Yes [48]Cor. IV.6.6 wxNT No SeeNT xNT No SeeNT wNT No SeeNT NT No [48]Thm. 6.11 wxLNT Yes [43] xLNT Yes [90],Thm. 4,cl. 1 wLNT Yes [82],Cor. 14 LNT Yes seewLNT (a)Preservationofforwardrecognizability T P. ? S wxT No SeewNT xT Yes Thm. 4.4.5 wT No SeewNT T Yes [48]Cor. IV.3.17 wxLT Yes [43] xLT Yes SeewxLT wLT Yes SeewxLT LT Yes SeewxLT wxNT No SeewNT xNT Yes SeexT wNT No [91] NT Yes SeexT wxLNT Yes SeewxLT xLNT Yes SeewxLT wLNT Yes SeewxLT LNT Yes SeewxLT (b)Preservationofbackwardrecognizability Table 4.1: Preservation of forward and backward recognizability for various classes of top-down tree transducers. Here and elsewhere, the following abbreviations apply: w =weighted,x=extendedleftside,L=linear,N=nondeleting,OQ=openquestion. 4.4 Applicationoftreetransducers Now let us revisit these stories with trees and tree transducers. Imagine we have a tree and a wtt that can transform that tree with some weight. We’d like to know the k-best trees the wtt can produce as output, along with their weights. We already know of at least one method for acquiring k-best trees from a wrtg [59], so we then must ask if, analogously to the string case, wtts preserve recognizability and we can form an applicationwrtg. We consider preservation of recognizability first. Known results for top-down tree transducerclassesconsideredinthisthesisareshowninTable4.1. Unlikethestringcase, preservation of recognizability is not universal or symmetric. If a transducer preserves forward recognizability, then a regularly limited domain implies a regular range, and if 106 it preserves backward recognizability, then a regularly limited range implies a regular domain. 3 Succinctly put, wxLNT and its subclasses preserve forward recognizability, as does xLT and its subclass LT. The two cases marked as open questions and the other classes, which are superclasses of NT, do not or are presumed not to. All classes except wNT and its subclasses preserve backward recognizability. We do not consider cases where recognizability is not preserved in the remainder of this chapter. If a transducer MofaclassthatpreservesforwardrecognizabilityisappliedtoawrtgG,wecancallthe forward application wrtg M(G) . and if M preserves backward recognizability, we can callthebackwardapplicationwrtgM(G) / . Now that we have defined the application problem and determined the classes for which application is possible, let us consider how to build forward and backward ap- plication wrtgs. An initial approach is to mimic the results found for wsts, by using an embed-compose-projectstrategy. However,wemustfirstconsiderwhether,asinstring world, there is a connection between recognizability and composition. We introduce a theoremtopresentthatconnectionandoutlinethecaseswherethisstrategyisvalid. We recall the various definitions from Section 2.1. To them we add the following: Theapplicationofaweightedtreetransformationτ : T Σ ×T Δ toatreeseriesL : T Σ →W isatreeseriesτhLi : T Δ →Wwhereforeverys∈ T Δ ,τhLi(s)= L t∈T Σ L(t)·τ(t,s). Next, we need three lemmas. The first two demonstrate the semantic equivalence betweenapplicationandcomposition. 3 Formallyspeaking,itisnotthetransducersthatpreserverecognizabilitybuttheirtransformationsandone must speak of a transformation preserving recognizability (analogous to a transducer preserving forward recognizability)oritsinversedoingso(analogoustopreservingbackwardrecognizability). 107 Lemma4.4.1(forwardapplicationcompositionconnection) LetLbeatreeseriesT Σ → W. Letτ beaweightedtreetransformationT Σ ×T Δ →W. ThenτhLi=range(ı L ;τ). Proof Foranyt∈ T Δ : range(ı L ;τ)(t)= M s∈T Σ ı L ;τ(s,t) Definitionofrange = M s∈T Σ M u∈T Σ ı L (s,u)·τ(u,t) Definitionofcomposition = M s∈T Σ ı L (s,s)·τ(s,t) Removeelementsequalto0 = M s∈T Σ L(s)·τ(s,t) Definitionofidentity =τhLi(t) Definitionofapplication Lemma4.4.2(backwardapplicationcompositionconnection) Let L be a tree series T Δ →W. Letτ beaweightedtreetransformationT Σ ×T Δ →W. Thenτ −1 hLi= dom(τ;ı L ). Proof Foranyt∈ T Σ : dom(τ;ı L )(t)= M s∈T Δ τ;ı L (t,s) Definitionofdomain = M s∈T Δ M u∈T Δ τ(t,u)·ı L (u,s) Definitionofcomposition = M s∈T Δ τ(t,s)·ı L (s,s) Removeelementsequalto0 = M s∈T Δ τ(t,s)·L(s) Definitionofidentity =τ −1 hLi(t) Definitionofapplication Forthethirdlemmaweneedtointroducetheuniversaltreeseries,whichrecognizesall possibletreesinalanguagewithweight1. Wewillthenshowthatforanytransformation, 108 application of the universal tree series is equivalent to the range of the transformation, andforthetransformation’sinverse,applicationisequivalenttothedomain. Definition4.4.3 TheuniversaltreeseriesofT Σ oversemiringWisatreeseriesU T Σ :T Σ →Wwhere,foreveryt∈ T Σ ,U T Σ (t)= 1. NotethatU T Σ isrecognizable. Thiscanbeeasily verifiedbydefiningthewrtgG= ({n},Σ,P,n)where,foreveryσ∈Σ (k) ,n 1 − →σ(n,...,n)∈ P. Clearly,L G = U T Σ . Lemma4.4.4(applicationofuniversalequaltorange) Letτ be a weighted tree trans- formationT Σ ×T Δ →W. ThenτhU T Σ i= range(τ)andτ −1 hU T Δ i= dom(τ). Proof Foranyt∈ T Δ : τhU T Σ i= M s∈T Σ U T Σ (s)·τ(s,t) Definitionofapplication = M s∈T Σ τ(s,t) Definitionofuniversaltreeseries = range(τ)(t) Definitionofrange The second statement is easily obtained by substituting τ −1 and U T Δ for τ and U T Σ , respectively. . Wemaynowprovethetheorem. Theorem4.4.5(duetoMaletti[91]) For a weighted top-down tree transducer class A oversemiringW,ifclassAisclosedunderleft-composition 4 (respectively,right-composition) withLNT(W)then“Ahasrecognizablerange(respectively,domain)oversemiringW” ⇔“A(respectively,A −1 )preservesrecognizability”. 4 Tree transducer class A is said to be closed under left- (respectively, right-) composition with class B if B◦A⊆ A(respectively,A◦B⊆ A). 109 Proof We begin with the forward case. Let A be a class of wtt such that A is closed under left-composition with LNT. Let M be a wtt of class A over semiringW and let L be a recognizable tree series overW. First we assume that M has recognizable range. Due to the properties of class A,ı L ◦M is in class A. Then, from Lemma 4.4.1, τ M hLi= range(τ ı L ◦M ). Based on our assumption, the latter has recognizable range, thus the former does too, and thus τ M preserves recognizability. Next we assume that τ M preserves recognizability. Since U T Σ is recognizable, τ M hU T Σ i is recognizable. From Lemma4.4.4,τ M hU T Σ i= range(τ),thusrange(τ)isrecognizable. Nowforthebackwardcase, letAbeaclassofwttsuchthatAisclosedunderright- composition with LNT. Again, let M be a wtt of class A over semiringW and let L be a recognizabletreeseriesoverW. First,weassumethatMhasrecognizabledomain. Then, M◦ı L has recognizable domain and from Lemma 4.4.2,τ −1 M hLi = dom(τ M◦ı L ) does too. Thus τ −1 M preserves recognizability. Now assume τ −1 M preserves recognizability. Since U T Δ is recognizable,τ −1 M hU T Δ i is recognizable. From Lemma 4.4.4,τ −1 M hU T Δ i = dom(τ), thusdom(τ)isrecognizable. This theorem implies that, for those classes that preserve recognizability and are closedundertheappropriatecomposition, 5 acomposition-basedapproachtoapplication canbetaken. Table4.2showsrelevantcompositionresultsforpotentiallyviableclasses. As can be seen from Table 4.2, the embed-compose-project approach will work for backward application cases but will not work for most forward application cases. The readershouldnote,however,thatthecompositionresultsinTable4.2areforcomposition 5 and have the appropriate recognizable projection, but this is always true by definition of classes pre- servingrecognizability;seetheproof. 110 T L- S ()LNT? xLT No [92],Lem. 4.3 LT No [39],p. 207 wxLNT No SeexLNT xLNT No [92],Thm. 5.2 wLNT Yes [89],Thm. 26 LNT Yes seewLNT (a)Forforwardapplication T R- S ()LNT? xT Yes SeeT T Yes [6],Cor. 2 wxLT Yes [89],Thm. 26 xLT Yes SeexT wLT Yes SeewxLT LT Yes SeeT xNT Yes SeexT NT Yes SeeT wxLNT Yes SeewxLT xLNT Yes SeexT wLNT Yes SeewxLT LNT Yes SeeT (b)Forbackwardapplication Table 4.2: For those classes that preserve recognizability in the forward and backward directions, are they appropriately closed under composition with (w)LNT? If the an- swer is “yes”, then an embedding, composition, projection strategy can be used to do application. 0 a x 1 x 2 a 1.x 1 2.x 2 1 − → 0 b b 1 − → 1 c c 1 − → 1 (a) Input tree a(b c) embedded in an identity wLNT. q a x 1 x 2 a d q.x 1 q.x 2 2 − → q a x 1 x 2 a q.x 2 d q.x 1 3 − → q b b 1 − → q c c 2 − → q c e 3 − → q (b)wLNTforforwardapplication. 0q a x 1 x 2 a d 1q.x 1 2q.x 2 2 − → 0q a x 1 x 2 a 2q.x 2 d 1q.x 1 3 − → 0q b b 1 − → 1q c c 2 − → 2q c e 3 − → 2q (c)Resultofcomposition. 0q a d 1q 2q 0q 2 − → a 2q d 1q 0q 3 − → b 1q 1 − → c 2q 2 − → e 2q 3 − → (d)Resultofprojection. Figure4.3: Composition-basedapproachtoapplicationofawLNTtoatree. 111 g 0 σ g 0 g 1 g 0 w 1 −−→ γ g 0 g 0 w 2 −−→ α g 0 w 3 −−→ α g 1 w 4 −−→ (a)InputwrtgG. m 1 σ x 0 x 1 σ m 2 .x 0 m 3 .x 1 w 5 −−→ m 1 σ x 0 x 1 ψ m 4 .x 0 m 3 .x 1 w 6 −−→ m 1 γ x 0 m 2 .x 0 w 7 −−→ m 2 σ x 0 x 1 σ m 3 .x 0 m 3 .x 1 w 8 −−→ m 2 α α w 9 −−→ m 3 α ρ w 10 −−→ m 4 (b)Firsttransducer(wLNT)M 1 inthe cascade. n 1 σ σ x 0 x 1 x 2 δ n 2 .x 0 n 2 .x 1 n 2 .x 2 w 11 −−→ n 1 α α w 12 −−→ n 2 (c) Second transducer (wxLNT) M 2 inthecascade. Figure4.4: Inputsforforwardapplicationthroughacascadeoftreetransducers. withtransducersoftheclass(w)LNT,whileanembeddedRTGisinanarrowerclassthan this—itisarelabeling,deterministic(w)LNT.Butthen,inordertoaccomplishapplication via an embed-compose-project approach for the classes not appropriately closed with (w)LNT but closed with this narrow class, we must have a composition algorithm for this very specific class of transducers. Instead, it seems better to focus our energies on designingapplicationalgorithms. Beforewediscussapplicationalgorithms,itisimportanttonotethere-emergenceof COVER algorithms in this chapter, similar to Algorithm 14 from Chapter 2. It should not be much of a surprise that the same kind of algorithm comes up in a discussion of application, which is very similar in principle to composition. We will in fact describe two separate COVER variants here, Algorithms 18 and 21. All three algorithms match 112 g 0 m 1 σ g 0 m 2 g 1 m 3 g 0 m 1 w 1 w 5 −−−−→ ψ g 0 m 4 g 1 m 3 g 0 m 1 w 1 w 6 −−−−→ g 0 m 2 g 0 m 2 w 2 w 7 −−−−→ σ g 0 m 3 g 1 m 3 g 0 m 2 w 1 w 8 −−−−→ α g 1 m 3 w 4 w 9 −−−−→ α g 0 m 3 w 3 w 9 −−−−→ ρ g 0 m 4 w 3 w 10 −−−−→ (a)ForwardapplicationofM 1 toG: G 1 . g 0 m 1 σ g 0 m 2 g 1 m 3 g 0 m 1 w 1 w 5 −−−−→ ψ g 0 m 4 g 1 m 3 g 0 m 1 w 1 w 6 −−−−→ σ g 0 m 3 g 1 m 3 g 0 m 2 w 1 w 8 (w 2 w 7 ) ∗ −−−−−−−−−→ α g 1 m 3 w 4 w 9 −−−−→ α g 0 m 3 w 3 w 9 −−−−→ ρ g 0 m 4 w 3 w 10 −−−−→ (b)AfterchainproductionremovalofG 1 . g 0 m 1 δ g 0 m 3 n 2 g 1 m 3 n 2 g 1 m 3 n 2 g 0 m 1 n 1 w 1 w 5 w 1 w 8 ((w 2 w 7 ) ∗ )w 11 −−−−−−−−−−−−−−−−→ α g 0 m 3 n 2 w 3 w 9 w 12 −−−−−−→ α g 1 m 3 n 2 w 4 w 9 w 12 −−−−−−→ (c)Finalresultthroughthecascade. Figure4.5: Resultsofforwardapplicationthroughacascadeoftreetransducers. 113 treestopatternsofrulesinatop-downmanner,asdescribedinSection2.3.3. However, Algorithm 14 is concerned with joining together the right hand side of a rule from one tree transducer with a pattern of rules from a second, to form rules in a composition transducer. Algorithms 18 and 21, on the other hand, match a tree to a wrtg, and are concerned with the weight of matching derivations and the states reached in the wrtg, but do not need to construct new trees. These algorithms also differ from Algorithm 14 in that they invoke recursive calls to discover available productions in the input wrtgs, as will be discussed in more detail in Section 4.5. Both of the COVER algorithms in thischapterreturnmappingsfromtreepositionstononterminals;thisallowsthecalling algorithmstoconverttreesthatarepartsoftransducerrulesintotreesthatformtheright sidesofnewwrtgproductions. Algorithm18primarilydiffersfromAlgorithm21inthat theformermatcheswxttruleleftsidestotreesandthelattermatchesrightsides. Algorithm17FORWARD-APPLICATION 1: inputs 2: wrtgG= (N,Σ,P,n 0 )overWinnormalformwithnochainproductions 3: linearwxttM= (Q,Σ,Δ,R,q 0 )overW 4: outputs 5: wrtgG 0 = (N 0 ,Δ,P 0 ,n 0 0 )suchthatifMisnondeletingorWisBoolean,L G 0 =τ M (L G ) 6: complexity 7: O(|R||P| ˜ l ),where ˜ listhesizeofthelargestleftsidetreeinanyruleinR 8: N 0 ← (N×Q) 9: n 0 0 ← (n 0 ,q 0 ) 10: P 0 ←∅ 11: forall(n,q)∈ N 0 do 12: forallroftheformq.t w 1 −−→ sinRdo 13: forall(φ,w 2 )∈ FORWARD-COVER( t,G,n)do 14: Formsubstitutionmappingϕ : Q×X→ T Δ (N×Q)suchthatϕ(q 0 ,x)= (n 0 ,q 0 ) ifφ(v)= n 0 andt(v)= xforallq 0 ∈ Q,n 0 ∈ N,x∈ X,andv∈ pos(t). 15: P 0 ← P 0 ∪{(n,q) w 1 ·w 2 −−−−−→ϕ(s)} 16: return G 0 114 Algorithm18FORWARD-COVER 1: inputs 2: t∈ T Σ (A),whereA∩Σ=∅ 3: wrtgG= (N,Σ,P,n 0 )innormalform,withnochainproductions 4: n∈ N 5: outputs 6: setΠofpairs{(φ,w) :φamappingpos(t)→ N andw∈W},eachpairindicatinga successfulrunontbyproductionsinG,startingfromn,andtheweightoftherun. 7: complexity 8: O(|P| size(t) ) 9: Π last ←{((ε,n),1)} 10: forallv∈ pos(t)suchthatt(v)< Ainpre-order do 11: Π v ←∅ 12: forall(φ,w)∈Π last do 13: ifG M(G) . forsomewrtgGandw(x)ttMthen 14: G← FORWARD-PRODUCE( G,M,G,φ(v)) 15: forallφ(v) w 0 −→ t(v)(n 1 ,...,n k )∈ Pdo 16: Π v ←Π v ∪{(φ∪{(vi,n i ),1≤ i≤ k},w·w 0 )} 17: Π last ←Π v 18: return Π last Algorithm17producestheapplicationwrtgforthoseclassesoftreetransducerpre- servingrecognizabilitylistedinTable4.1a. Theimplementationoptimizationpreviously described in Chapter 2 for Algorithms 6, 7, 8, and 13 applies to this algorithm as well. Note that it covers several cases not allowed by the embed-compose-project strategy, specificallywxLNT.Itdoesrequirethattheinputwrtgbeinnormalformwithnochain productions;algorithmsforensuringthesepropertiesaredescribedinprevioussections. The algorithm pairs nonterminals n from a wrtg G = (N,Σ,P,n 0 ) with states q from a transducer M = (Q,Σ,Δ,R,q 0 ), and finds a derivation starting with n that matches the left hand side of a rule starting with q; this is done by a call to FORWARD-COVER at line 13. The substitution mapping that converts the right hand side of the rule from a 115 T C? S wxT No SeeT xT No SeeT wT No SeeT T No [48]Cor. IV.3.14 wxLT No SeeLT xLT No SeeLT wLT No SeeLT LT No [48]Thm. IV.3.6 wxNT No SeeNT xNT No SeeNT wNT No SeeNT NT No [6]Thm. 1 wxLNT No SeexLNT xLNT No [92]Thm. 5.4 wLNT Yes [45]Lem. 5.11 LNT Yes [48]Thm. IV.3.6 Table4.3: Closureundercompositionforvariousclassesoftop-downtreetransducer. treeinT Δ (Q×X)toatreeinT Δ (N×Q)appropriatefortheoutputwrtg,isformedatline 14,enablingthenewwrtgproductiontobebuiltatline15. Anexamplethatusesthisalgorithmisprovidedbelow,inthediscussionofapplica- tionthroughcascades;theexampleisrelevanttothesingletransducercaseaswell. 4.5 Applicationoftreetransducercascades Whataboutthecaseofaninputwrtgandasequenceoftreetransducers? Wewillrevisit thethreewaysofaccomplishingapplicationdiscussedaboveforthestringcase. Inorderforofflinecompositiontobeaviablestrategy,thetransducersinthecascade must be closed under composition. As Table 4.3 shows, in general a cascade of tree transducersofagivenclasscannotbecomposedofflinetoformasingletransducer;the loneexceptionsbeingforacascadeofwLNTandLNTtransducers. 116 (a) Schematic of application on a wst cascade. The given wsa (or string) is embedded in an identitywst,thencomposedwiththefirsttransducerinthecascade. Inturn,eachtransducer inthecascadeiscomposedintotheresult. Then,aprojectionistakentoobtaintheapplication wsa. (b)Schematicofapplicationonawttcascade,illustratingtheadditionalcomplexitycompared withthestringcase. Thegivenwrtg(ortree)isembeddedinanidentitywtt,thencomposed with the first transducer in the cascade. Then a projection is taken to form a wrtg, and this wrtgmustimmediatelybeembeddedinanidentitywtttoensurecomposabilitywiththenext transducer. Figure 4.6: Schematics of application, illustrating the extra work needed to use embed- compose-projectinwttapplicationvs. wstapplication. 117 When considering the bucket brigade, we have two options for specific methods; either the embed-compose-project approach or the custom algorithm approach can be used. The embed-compose-project process is somewhat more burdensome than in the string case. Recall that, for strings, application is obtained by an embedding, a series of compositions, and a projection (see Figure 4.6a). As discussed above, in general the “seriesofcompositions”isimpossiblefortrees. However,onecanobtainapplicationby aseriesofembed-compose-projectoperations,asdepictedinFigure4.6b. The custom algorithm case, which applies in instances the embed-compose-project does not cover (see Table 4.2), is easily usable in a bucket brigade scenario. One must, however,ensuretheoutputofthealgorithmisinnormalform(usingAlgorithm1)and chainproduction-free(usingAlgorithm3). Let us now work through an example (depicted in Figures 4.4 and 4.5) of bucket brigade through a cascade using custom algorithms in order to better understand the mechanism of these algorithms. In particular we will use a cascade similar to one de- scribed in [92] that demonstrated lack of closure under composition for certain classes. As this example makes clear, even though the transducers in the cascade are not closed undercomposition,thepropertyofpreservationofrecognizabilityissufficientforappli- cation. Inthisexampletheweightsarekeptasvariablessothatthesemiringoperations beingperformedremainobvious. The input wrtg, which we will call G, is in Figure 4.4a and the transducers M 1 and M 2 are in Figures 4.4b and 4.4c, respectively. In the example that follows we will first formM 1 hGiandthenformM 2 hM 1 hGii. 118 Note that G is in normal form and is chain-production free, and M 1 is linear and nondeleting,sotheinputstoAlgorithm17arevalid. Asindicatedonline8,thenonter- minals of the result will be pairs of (nonterminal, state) from the inputs and the initial nonterminal is g 0 m 1 , corresponding to g 0 from G and m 1 from M 1 . In the main while loop we consider productions from a particular new nonterminal, and we begin with the new initial nonterminal. At line 12 we choose a rule from M 1 beginning with m 1 , namelym 1 .σ(x 0 ,x 1 ) w 5 −−→σ(m 2 .x 0 ,m 3 .x 1 ). WethenmustinvokeAlgorithm18,FORWARD- COVER,toformacoveringoftheleftsideofthisrule. Atline15ofFORWARD-COVER, theproduction g 0 w 1 −−→σ(g 0 ,g 1 )ischosentocoverσ(x 0 ,x 1 ),andatline16 g 0 ismappedto x 0 and g 1 tox 1 andtheweightofthismappingissettow 1 ,theweightoftheproduction used. 6 Back in the main algorithm, this mapping is used to form the substitution map- ping 7 ϕthatultimatelyconvertstherightsideofthetransducerrule,σ(m 2 .x 0 ,m 3 .x 1 ),into σ(g 0 m 2 ,g 1 m 3 ). The weight of the rule is multiplied by the weight of the mapping, and the new production g 0 m 1 w 1 w 5 −−−−→σ(g 0 m 2 ,g 1 m 3 ) is formed. In a similar manner, the entire applicationwrtgdepictedinFigure4.5aisformed. The next task in the cascade is to apply M 2 to the just-formed application wrtg. However, the wrtg we just formed has chain productions in it. Thus, before continu- ing a chain production removal algorithm (described elsewhere) is used to convert the application wrtg to that depicted in Figure 4.5b. We then continue with application of M 2 to the wrtg of Figure 4.5b. The application process is the same as that just de- scribed. Note,however,thatthecomputationinFORWARD-COVERissomewhatmore 6 Themeaningbehindlines13and14ofFORWARD-COVERwillbeshortlyexplained,butsincethetest doesnotapplytheycanbesafelyskippedfornow. 7 RecallthisdefinitionfromSection2.1.4. 119 complicated for n 1 .σ(σ(x 0 ,x 1 ),x 2 ) w 11 −−→ δ(n 2 .x 0 ,n 2 .x 1 ,n 2 .x 2 ) due to its extended left side. In this case, the treeσ(σ(x 0 ,x 1 ),x 2 ) is covered by g 0 m 1 w 1 w 5 −−−−→ σ(g 0 m 2 ,g 1 m 3 ) followed by g 0 m 2 w 1 w 8 (w 2 w 7 ) ∗ −−−−−−−−−→ σ(g 0 m 3 ,g 1 m 3 ). If we had more transducers in the cascade, we would continueapplyingthemtotheresultofthepreviousapplicationstep,butinourexample wearedoneaftertwoapplications. ThefinalresultisinFigure4.5c. Algorithm19FORWARD-PRODUCE 1: inputs 2: wrtgG= (N,Σ,P,n 0 )overWinnormalformwithnochainproductions 3: linearwxttM= (Q,Σ,Δ,R,q 0 )overW 4: wrtgG 0 in = (N 0 in ,Δ,P 0 in ,n 0 0 )overW 5: n in ∈ N 0 in 6: outputs 7: wrtg G 0 out = (N 0 out ,Δ,P 0 out ,n 0 0 ) overW where, if M is nondeleting orW is Boolean, G 0 in M(G) . G 0 out ,andforallw∈W,t∈ T Δ (N 0 ),n in w − → t∈ P 0 out ⇔ n in w − → t∈ M(G) . 8: complexity 9: O(|R||P| ˜ l ),where ˜ listhesizeofthelargestleftsidetreeinanyruleinR 10: ifP 0 in containsproductionsoftheformn in w − → uthen 11: return G 0 in 12: N 0 out ← N 0 in 13: P 0 out ← P 0 in 14: Letn in beoftheform(n,q),wheren∈ N andq∈ Q. 15: forallroftheformq.t w 1 −−→ sinRdo 16: forall(φ,w 2 )∈ FORWARD-COVER( t,G,n)do 17: Formsubstitutionmappingϕ : Q×X→ T Δ (N×Q)suchthat,forallv∈ ydset(t) and q 0 ∈ Q, if there exist n 0 ∈ n and x∈ X such that φ(v) = n 0 and t(v) = x, ϕ(q 0 ,x)= (n 0 ,q 0 ). 18: forallp 00 ∈ NORMALIZE((n,q) w 1 ·w 2 −−−−−→ϕ(s),N 0 out )do 19: Letp 00 beoftheformn 00 w 00 −−→δ(n 00 1 ,...,n 00 k )forδ∈Δ (k) . 20: N out ← N out ∪{n 00 ,n 00 1 ,...,n 00 k } 21: P 0 out ← P 0 out ∪{p 00 } 22: return G 0 out We next consider on-the-fly algorithms for application. As in the string case, an on-the-fly approach is driven by a calling algorithm that periodically needs to know 120 the productions in a wrtg with a common left side nonterminal. Both the embed- compose-projectapproachandthepreviouslydefinedcustomalgorithmsproduceentire applicationwrtgs. Inordertoadmitanon-the-flyapproachwedescribealgorithmsthat onlygeneratethoseproductionsinawrtgthathaveagivenleftnonterminal. The following set of algorithms have a common flavor in that they take as input a wrtg and a desired nonterminal and return another wrtg, different from the input wrtg in that it has more productions, specifically those beginning with that specified nonterminal. Thewrtgsprovidedasinputtoandreturnedasoutputfromtheseproduce algorithmscanbethoughtofasstand-ins forsomewrtgbuiltinanon-on-the-flymanner. Algorithmsusingthesestand-insshouldcallanappropriateproducealgorithmtoensure thestand-intheyareusinghastheproductionsbeginningwiththedesirednonterminal. Algorithm 19, FORWARD-PRODUCE, obtains the effect of forward application in an on-the-fly manner. It takes as input a wrtg and appropriate transducer, as well as a stand-inanddesirednonterminal. Asanexample,considertheinvocationFORWARD- PRODUCE(G 1 , M 1 , G init , g 0 m 0 ), where G 1 is in Figure 4.7a, M 1 is in 4.7b, and G init has anemptyproductionsetandanonterminalsetconsistingofonlythestartnonterminal, g 0 m 0 . The stand-in wrtg that is output contains three productions: g 0 m 0 w 1 w 4 −−−−→σ(g 0 m 0 ,g 1 m 1 ), g 0 m 0 w 1 w 5 −−−−→ψ(g 0 m 2 ,g 1 m 1 ),and g 0 m 0 w 2 w 6 −−−−→α. Todemonstratetheuseofon-the-flyapplicationinacascade,wenextshowthee ffect of FORWARD-PRODUCE when used with the cascade of G 1 M 1 , and M 2 , where M 2 is in Figure 4.7c. Our driving algorithm in this case is Algorithm 20, MAKE-EXPLICIT, which simply generates the full application wrtg using calls to FORWARD-PRODUCE. TheinputtoMAKE-EXPLICITisthefirststand-infor M 2 (M 1 (G 1 ) . ) . ,theresultofforward 121 g 0 σ g 0 g 1 g 0 w 1 −−→ α g 0 w 2 −−→ α g 1 w 3 −−→ (a)InputwrtgG 1 . m 0 σ x 0 x 1 σ m 0 .x 0 m 1 .x 1 w 4 −−→ m 0 σ x 0 x 1 ψ m 2 .x 0 m 1 .x 1 w 5 −−→ m 0 α α w 6 −−→ m 0 α α w 7 −−→ m 1 α ρ w 8 −−→ m 2 (b)FirsttransducerM 1 inthecascade. n 0 σ x 0 x 1 σ n 0 .x 0 n 0 .x 1 w 9 −−→ n 0 α α w 10 −−→ n 0 (c) Second transducer M 2 in the cascade. σ g 0 m 0 g 1 m 1 g 0 m 0 w 1 ·w 4 −−−−→ ψ g 0 m 2 g 1 m 1 g 0 m 0 w 1 ·w 5 −−−−→ α g 0 m 0 w 2 ·w 6 −−−−→ α g 1 m 1 w 3 ·w 7 −−−−→ (d) Productions of M 1 (G 1 ) . built as a consequenceofbuildingthecomplete M 2 (M 1 (G 1 ) . ) . . g 0 m 0 n 0 σ g 0 m 0 n 0 g 1 m 1 n 0 g 0 m 0 n 0 w 1 ·w 4 ·w 9 −−−−−−→ α g 0 m 0 n 0 w 2 ·w 6 ·w 10 −−−−−−−→ α g 1 m 1 n 0 w 3 ·w 7 ·w 10 −−−−−−−→ (e)CompleteM 2 (M 1 (G 1 ) . ) . . Figure4.7: Forwardapplicationthroughacascadeoftreetransducersusinganon-the-fly method. 122 Algorithm20MAKE-EXPLICIT 1: inputs 2: wrtgG= (N,Σ,P,n 0 )innormalform 3: outputs 4: wrtg G 0 = (N 0 ,Σ,P 0 ,n 0 ), in normal form, such that L G = L G 0 and if G M(G) . for somewrtgGandw(x)ttM,G 0 = M(G) . . 5: complexity 6: O(|P 0 |) 7: G 0 ← G 8: Ξ←{n 0 }{seennonterminals} 9: Ψ←{n 0 }{pendingnonterminals} 10: whileΨ,∅do 11: n←anyelementofΨ 12: Ψ←Ψ\{n} 13: ifG 0 M(G) . forsomewrtgGandw(x)ttMthen 14: G 0 ← FORWARD-PRODUCE( G,M,G 0 ,n) 15: foralln w − →σ(n 1 ,...,n k )∈ P 0 do 16: fori= 1tokdo 17: ifn i <Ξthen 18: Ξ←Ξ∪{n i } 19: Ψ←Ψ∪{n i } 20: return G 0 123 application of M 2 to M 1 (G 1 ) . , which is itself the result of forward application of M 1 to G 1 . This initial input is a wrtg with an empty production set and a single nontermi- nal, g 0 m 0 n 0 , obtained by combining n 0 , the initial state from M 2 , with g 0 m 0 , the initial nonterminal from the first stand-in for M 1 (G 1 ) . . MAKE-EXPLICIT calls FORWARD- PRODUCE(M 1 (G 1 ) . , M 2 , M 2 (M 1 (G 1 ) . ) . , g 0 m 0 n 0 ). FORWARD-PRODUCE then seeks to cover n 0 .σ(x 0 ,x 1 ) w 9 −−→ σ(n 0 .x 0 ,n 0 .x 1 ) with productions from M 1 (G 1 ) . , thus it needs to improvethestand-inforthiswrtg. InFORWARD-COVER,thereisacalltoFORWARD- PRODUCE that accomplishes this. The productions of M 1 (G 1 ) . that must be built to form the complete M 2 (M 1 (G 1 ) . ) . are shown in Figure 4.7d. The complete M 2 (M 1 (G 1 ) . ) . is shown in Figure 4.7e. Note that because we used this on-the-fly approach, we were able to avoid building all the productions in M 1 (G 1 ) . ; in particular we did not build g 0 m 2 w 2 w 8 −−−−→ρ,whileabucketbrigadeapproachwouldhavebuiltthisproduction. Algorithm 22 is an analogous on-the-fly PRODUCE algorithm for backward appli- cation. It is only appropriate for application on linear transducers. We do not provide PRODUCEfornon-lineartreetransducersbecauseweightednon-lineartreetransducers donotpreservebackwardrecognizability,andon-the-flymethodsarenotterriblyuseful for obtaining (unweighted) application rtgs. Since there is no early stopping condition theentireapplicationmayaswellbecarriedout. We have now defined several on-the-fly and bucket brigade algorithms, and also discussed the possibility of embed-compose-project and o ffline composition strategies toapplicationofcascadesoftreetransducers. Tables4.4and4.5summarizetheavailable methodsofforwardandbackwardapplicationofcascadesforrecognizability-preserving treetransducerclasses. 124 Algorithm21BACKWARD-COVER 1: inputs 2: t∈ T Σ (A),whereA∩Σ=∅ 3: wrtgG= (N,Σ,P,n 0 )innormalform,withnochainproductions 4: n∈ N 5: outputs 6: setΠofpairs{(φ,w) :φamappingpos(t)→ N andw∈W},eachpairindicatinga successfulrunontbyproductionsinG,startingfromn,andtheweightoftherun. 7: complexity 8: O(|P| size(t) ) 9: Π last ←{((ε,n),1)} 10: forallv∈ pos(t)suchthatt(v)< Ainpre-order do 11: Π v ←∅ 12: forall(φ,w)∈Π last do 13: ifG M(G) / forsomewttMandwrtgGthen 14: G← BACKWARD-PRODUCE( M,G,G,φ(v)) 15: forallφ(v) w 0 −→ t(v)(n 1 ,...,n k )∈ Pdo 16: Π v ←Π v ∪{(φ∪{(vi,n i ),1≤ i≤ k},w·w 0 )} 17: Π last ←Π v 18: return Π last 4.6 Decodingexperiments The main purpose of this chapter has been to present novel algorithms for performing application. However, it is beneficial to demonstrate these algorithms on realistic data. We thus demonstrate bucket brigade and on-the-fly backward application on a typical NLP task cast as a cascade of wLNT. We adapted the Japanese-to-English translation model of Yamada and Knight [136] by transforming it from an English tree-to-Japanese stringmodeltoanEnglishtree-to-Japanese treemodel. TheJapanesetreesareunlabeled, meaning they have syntactic structure but no node labels. We then cast this modified modelasacascadeofLNTtreetransducers. Wenowdescribetheindividualtransducers inmoredetail. 125 Algorithm22BACKWARD-PRODUCE 1: inputs 2: linearwttM= (Q,Σ,Δ,R,q 0 ), 3: wrtgG= (N,Σ,P,n 0 )innormalformwithnochainproductions 4: wrtgG 0 in = (N 0 in ,Δ,P 0 in ,n 0 0 ) 5: n 0 ∈ (Q×N) 6: outputs 7: wrtg G 0 out = (N 0 out ,Δ,P 0 out ,n 0 0 ) where G 0 in M(G) / G 0 out , and n 0 w − → t∈ P 0 out ⇔ n 0 w − → t∈ M(G) / 8: complexity 9: O(|R||P| ˜ r ) 10: ifn 0 ∈ N 0 in then 11: return G 0 in 12: N 0 out ← N 0 in ∪{n 0 } 13: P 0 out ← P 0 in 14: ifn 0 =⊥then 15: forallσ∈Σwithrankkdo 16: P 0 out ← P 0 out ∪{⊥ 1 − →σ(⊥,...,⊥)} 17: return G 0 out 18: Letn 0 beoftheform(q,n),whereq∈ Qandn∈ N. 19: forallroftheformq.σ w 1 −−→ tinRwhereσ∈Σ (k) do 20: forall(φ,w 2 )∈ BACKWARD-COVER( t,G,n)do 21: d 1 ← d 2 ←...← d k ←⊥ 22: forallv∈ leaves(t)suchthatt(v)isoftheform(q 0 ,x)∈ Q×X k do 23: ifφ(v),∅then 24: d i ← (q 0 ,φ(v)) 25: P 0 out ← P 0 out ∪{(q,n) w 1 ·w 2 −−−−−→σ(d 1 ,...,d k )} 26: return G 0 out Rotation: The rotation transducer captures the reordering of subtrees such that the leaves of the tree are transformed from English to Japanese word order. Individual rules denote the likelihood of a particular sequence of sibling subtrees reordering in a particularway. ThestructureoftheEnglishtreesensuresthatthemaximumnumberof siblingsisfour. PreterminalsandEnglishwordsaretransformedasanidentity,withno extraweightincurred. Thereare6,453rulesintherotationtransducerinourexperimental model. SomeexamplerulesareinFigure4.8a. 126 M []LT []LNT []LNT oc √ × × √ ecp √ × × √ bb √ √ √ √ otf √ √ √ √ Table 4.4: Transducer types and available methods of forward application of a cascade. oc = offline composition, ecp = embed-compose-project, bb = custom bucket brigade algorithm,otf=onthefly. M []T [][]LT []NT []LNT []LNT oc √ × × × × √ ecp √ √ √ √ √ √ otf √ × √ × √ √ Table4.5: Transducertypesandavailablemethodsofbackwardapplicationofacascade. oc=offlinecomposition,ecp=embed-compose-project,otf =onthefly. Insertion: The insertion transducer captures the likelihood of inserting Japanese function words into the reordered English sentence. Individual rules denote the likeli- hood of inserting a function word to the left or right of a tree’s immediate subtrees but donotspecifywhatthatwordis(thisisleftforthetranslationtransducer). Preterminals and English words are transformed as an identity with no extra weight incurred, with someEnglishwordsreceivinganannotationindicatingtheyarenottobetranslatedinto anullsymbolinJapanese,basedonstructuralconstraintsofthemodel. Thereare8,122 rulesin theinsertion transducerin ourexperimental model. Some examplerules arein Figure4.8b. Translation: ThetranslationtransducerrelabelsallinternalnodesofanEnglishtree withthesymbol“X”andcapturesthelikelihoodoftranslatingeachEnglishandinserted word into a Japanese word or a null symbol, indicating there is no direct translation of the English word. 8 Individual rules that relabel with the symbol “X” have no extra 8 InsertedwordsandspeciallyannotatedEnglishwordscannottranslateintothenullsymbol. 127 Q G A — — O(n o ,1,1) O(n,c) out[n]= 0 out[n]← c G← BACKWARD-PRODUCE( G,M,G,n) ∀p : n w − →σ (k) (n 1 ,...,n k )∈ G,k> 0 ∀i∈ 1...k O(n i ,w·c) ∀p : n w − →α (0) ∈ G I(n,p,w,w·c) I(n,p,w,c) in[n]= 0,deriv[n]=∅ in[n]← w,deriv[n]← p returnderivifn= n 0 — in[n 1 ]= w 1 ,...,in[n k ]= w k I(n 0 ,p,w· k Y i=1 w i ,w· k Y i=0 w i ) out[n]= w 0 p : n w − →σ (k) (n 1 ,...,n k )∈ G Table 4.6: Deduction schema for the one-best algorithm of Pauls and Klein [108], gen- eralized for a normal-form wrtg, and with on-the-fly discovery of productions. We are presumed to have a wrtg G = (N,Σ,P,n 0 ) that is a stand-in for some M(G) / , a priority queue that can hold items of type I and O, prioritized by their cost, c, two N-indexed tables of (initially 0-valued) weights, in and out, and one N-indexed table of (initially null-valued) productions, deriv. In each row of this schema, the specified actions are taken (inserting items into the queue, inserting values into the tables, discovering new productions, or returning deriv) if the specified item is at the head of the queue and the specifiedconditionsofin,out,andderivexist. Theone-besthyperpathin Gcanbefound when deriv is returned by joining together productions in the obvious way, beginning withderiv[n 0 ]. JJ x 1 x 2 x 3 JJ r DT .x 1 r JJ .x 2 r VB .x 3 − → r JJ VB x 1 x 2 x 3 VB r NNPS .x 1 r NN .x 3 r VB .x 2 − → r VB “gentle” “gentle” − → t (a)Rotationrules NN x 1 x 2 NN INS i NN .x 1 i NN .x 2 − → i VB NN x 1 x 2 NN i NN .x 1 i NN .x 2 − → i VB NN x 1 x 2 NN i NN .x 1 i NN .x 2 INS − → i VB (b)Insertionrules VB x 1 x 2 x 3 X t.x 1 t.x 2 t.x 3 − → t “gentleman” j1 − → t “gentleman” EPS − → t INS j1 − → t INS j2 − → t (c)Translationrules Figure 4.8: Example rules from transducers used in decoding experiment. j1 and j2 are Japanesewords. 128 weightincurred. Thereare37,311rulesinthetranslationtransducerinourexperimental model. SomeexamplerulesareinFigure4.8c. We added an English syntax language model to the cascade of transducers just describedtobettersimulateanactualmachinetranslationdecodingtask. Thelanguage modelwascastasanidentitywttandthusfitnaturallyintotheexperimentalframework. In our experiments we tried several different language models to demonstrate varying performanceoftheapplicationalgorithms. Themostrealisticlanguagemodelwasbuilt fromaPCFG,whereeachrulecapturedtheprobabilityofaparticularsequenceofchild labelsgivenaparentlabel. Thismodelhad7,765rules. Todemonstratemoreextremecasesoftheusefulnessoftheon-the-flyapproach,we builtalanguagemodelthatrecognizedexactlythe2,087treesinthetrainingcorpus,each withequalweight. Ithad39,455rules. Finally,tobeultra-specific,weincludedaformof the “specific” language model just described, but only allowed the English counterpart oftheparticularJapanesesentencebeingdecodedinthelanguage. The goal in our experiments is to apply a single tree backward through the cascade andfindthe1-bestpathintheapplicationwrtg. Weevaluatebasedonthespeedofeach approach: bucketbrigadeandon-the-fly. Thealgorithmweusetoobtainthis1-bestpath is a modification of the k-best algorithm of Pauls and Klein [108]. Our algorithm finds the 1-best path in a wrtg and admits an on-the-fly approach. We present a schema for thisalgorithm,analogoustotheschematashowninPaulsandKlein[108],inTable4.6. 9 The results of the experiments are shown in Table 4.7. As can be seen, on-the-fly application was generally faster than the bucket brigade, about double the speed per 9 We only show the 1-best variant here but a k-best variant is easily obtained, in the manner shown by PaulsandKlein[108]. 129 LM M T/S pcfg bucket 28s pcfg otf 17s exact bucket >1m exact otf 24s 1-sent bucket 2.5s 1-sent otf .06s Table4.7: Timingresultstoobtain1-bestfromapplicationthroughaweightedtreetrans- ducer cascade, using on-the-fly vs. bucket brigade backward application techniques. pcfg = model recognizes any tree licensed by a pcfg built from observed data, exact = model recognizes each of 2,000+ trees with equal weight, 1-sent = model recognizes exactlyonetree. sentence in the traditional experiment that used an English PCFG language model. The results for the other two language models demonstrate more keenly the potential advantagethatcanbehadusinganon-the-flyapproach—thesimultaneousincorporation ofinformationfromallmodelsallowsapplicationtobedonemoreeffectivelythanifeach information source is considered in sequence. In the “exact” case, where a very large language model that simply recognizes each of the 2,087 trees in the training corpus is used,thefinalapplicationissolargethatitoverwhelmstheresourcesofa4gbMacBook Pro. In this case, the on-the-fly approach is necessary to avoid running out of memory. The “1-sent” case is presented to demonstrate the ripple effect caused by using on-the fly. Intheothertwocases,averylargelanguagemodelgenerallyoverwhelmsthetiming statistics, regardless of the method being used. But a language model that represents exactly one sentence is very small, and thus the effects of simultaneous inference are readily apparent—the time to retrieve the 1-best sentence is reduced by two orders of magnitudeinthisexperiment. 130 4.7 BackwardapplicationofwxLNTstostrings Tree-to-string transducers are quite important in modern syntax NLP systems. We frequentlyaregivenstringdataandwanttotransformitintoaforestoftreesbymeans of some grammar or transducer. This is generally known as parsing, but from our perspective, it is the inverse image of a tree-to-string transducer applied to a string. Despitethechangeinnomenclature,though,wecantakeadvantageoftherichparsing literatureindefininganalgorithmforthisproblem. Agoodchoiceforaparsingstrategy is Earley’s algorithm [36], and we will look to Stolcke’s extension to the weighted case [124] for guidance, though since we are required to build an entire parse chart, and ultimatelypreserveweightsfromourinputtransducer,strictlyspeakingweightsarenot neededintheapplicationalgorithm. Stolcke’s presentation [124] builds a parse forest from a wcfg. 10 Given a wcfg and a stringofkwords,ahypergraph,commonlycalledachartisbuilt. Thestatesinthechart are represented as (p,v,i, j) tuples, where p is a wcfg production, v is a position, i.e., an indexintotherightsideofp,andi, jisapairofintegers,0≤ i≤ j≤ k,signifyingarange of the input string. The hyperedges of the chart are unary or binary and are unlabeled. Foraunaryedge, ifthedestinationstateofanedgeisoftheform(p,v,i, j)wherev> 0, then its source state is of the form (p,v−1,i, j−1). For a binary edge, the right source statewillbeoftheform(p 0 ,v 0 ,h, j), wherep 0 hasasitsleftnonterminalthenonterminal at the vth position of the right side of p, v 0 is equal to the length of the right side of p 0 , and i≤ h≤ j. The left source state will be of the form (p,v−1,i,h). A state of the form (p,0,i,i)hasnoincominghyperedges. 10 WealterthepresentationofindicessomewhatfromStolcke’sapproachbutstilluseitforguidance. 131 We can build a chart with a xLNTs and string in the same way as we build one for a wcfg, treating a rule of the form q.y w − → g as if it were a production of the form q w − → g 0 ,where g 0 ismodifiedfrom gbyreplacingall(q,x)itemswithq. Theleftsidesand variablesareignored,however,notomitted. Afterthechartisformed,itistraversedtop down, and at each state q 0 of the form (r,v,i, j) where r is of the form q.y w − → g and v is equaltothelengthof g,asetofstatesequencesisformed,byappending,foreachbinary edgearrivingatq 0 ,therightsourcestatewitheachofthesequencesformedfromtheleft source state. Terminal states form no sequences, and unary edges form the sequences in their source state. Each of these sequences is assigned to the variable attached to the original g, and this is used, with y, to form a production. Additionally, to reduce the statespace,allstatesformedfromruleswiththesameleftside,withpositionsattheright extreme,andwiththesamecoveringspanaremerged. Inthiswaythedomainprojection is formed. Note that the result of this operation can then be the input to the backward application of a wxLT, and by this we may accomplish bucket brigade application of a cascadeofwxLT,followedbyawxLNTsandastring. Example4.7.1 ConsiderthewxLNTsM 1 fromExample2.4.3,whoserulesarereproduced in Figure 4.9a. To apply this transducer to the output string λ λ λ λ, first Earley’s algorithmisusedtoformaparsechart,aportionofwhichisshowninFigure4.10. Then, thechartisexploredfromthetopdownanddescendantstatesequencesaregatheredat eachstatethatrepresentsafully-coveredrule(suchstatesarehighlightedinFigure4.10). States representing rules with common left sides that cover the same span are merged andthewrtgprojectedisdepictedinFigure4.9b. 132 1. σ γ α x 1 x 2 x 3 .1 − → q 2 .x 1 q 2 .x 2 q 2 .x 3 q 1 2. γ x 1 x 2 .2 − → q 2 .x 1 q 2 .x 2 q 2 3. β α .3 − → λ q 2 4. α .4 − → λ q 2 (a)RulesfromexamplexLNTsusedforparsing,andoriginallyseeninFigure2.17. 1. σ γ α [q 2 ,0,2] [q 2 ,2,3] [q 2 ,3,4] [q 1 ,0,4] .1 − → 2. σ γ α [q 2 ,0,1] [q 2 ,1,3] [q 2 ,3,4] [q 1 ,0,4] .1 − → 3. γ γ α [q 2 ,0,1] [q 2 ,1,2] [q 2 ,2,4] [q 1 ,0,4] .1 − → 4. γ [q 2 ,0,1] [q 2 ,1,2] [q 2 ,0,2] .2 − → 5. γ [q 2 ,1,2] [q 2 ,2,3] [q 2 ,1,3] .2 − → 6. γ [q 2 ,2,3] [q 2 ,3,4] [q 2 ,2,4] .2 − → 7. β α [q 2 ,0,1] .3 − → 8. β α [q 2 ,1,2] .3 − → 9. β α [q 2 ,2,3] .3 − → 10. β α [q 2 ,3,4] .3 − → 11. α [q 2 ,0,1] .4 − → 12. α [q 2 ,1,2] .4 − → 13. α [q 2 ,2,3] .4 − → 14. α [q 2 ,3,4] .4 − → (b)BackwardapplicationrulesextractedfromthechartinFigure4.10afterstatemerging. Figure 4.9: Input wxLNTs and final backward application wrtg formed from parsingλ λλλ,asdescribedinExample4.7.1. 133 Figure4.10: PartialparsechartformedbyEarley’salgorithmappliedtotherulesinFigure 4.9a, as described in Example 4.7.1. A state is labeled by its rule id, covered position of the rule right side, and covered span. Bold face states have their right sides fully covered,andarethusthestatesfromwhichapplicationwrtgproductionsareultimately extracted. Dashededgesindicatehyperedgesleadingtosectionsofthechartthatarenot shown. (a)wstfromFigure4.2bwithedgeids. a:c/.6 R9 b:c/.7 R11 F a:d/.4 R10 b:d/.3 R12 (b)wstfromFigure4.2cwith edgeids. (c) Output string “c c” embedded in anidentitywst. (d)Derivationwsaconstruc- tion after input embed (Fig- ure 4.2a) is composed with firstwst. (e) Derivation wsa construction after compositionwithsecondwst. A D F G B D F H B E F H (f) Final derivation wsa construction after composition with output embed andprojection. Figure4.11: Constructionofaderivationwsa. 134 4.8 Buildingaderivationwsa AsmotivatedinSection4.1,anotherusefulstructurewemaywanttoinferisaderivation automaton (or grammar). Let’s return to the string world for a moment. Given a string pair (i,o), a wst M and unique identifiers for each of the edges in M, a derivation wsa is a wsa representing the (possibly infinite) sequences of edges in M that derive o from i. 11 It can be used by forward-backward training algorithms to learn weights on a wst to maximize the likelihood of a string corpus [38]. Forming a derivation wsa is in fact similar to forming an application wsa, with an additional relabeling to keep track of edgeids. First,weembediinanidentitywst,I. WethencomposeI withM,butreplace the input label of every edge formed via the use of some wst edge with its id instead. Thenwecomposethisresultwiththeembeddingofoinanidentityoutputwst,O. The domainprojectionoftheresultisawsarepresentingthesequencesofwstidsneededto deriveofromiusingthewst. One may also want to form a derivation wsa for a cascade of wsts. In this case, the edgesofthederivationwsawillcontainidsfromeachofthetransducersinthecascade. Again, the procedure is analogous to that used for forming an application wsa from a cascade, and the three strategies (offline composition, bucket brigade, and on-the-fly) eachapply. 12 Figure4.11showstheconstructionofaderivationwsafromthewstcascade ofFigure4.2,nowaugmentedwithedgeidsforeaseofunderstanding. 11 Derivationwsasaretypicallyformedfromunweightedwsts,butthegeneralizationholds. 12 Offlinecompositionrequiresabitoffancybookkeepingtomaintainedgeids. SeeEisner[38]. 135 4.9 Buildingaderivationwrtg Turning to the tree case, the reader may be pleasantly surprised to learn that the re- strictions for application wrtg construction do not apply for basic derivation wrtg construction—a derivation wrtg can be formed from a tree pair (i,o) and a wtt M of the most general class allowed—wxT. An algorithm to do this was first proposed by Graehl and Knight [55]. The theoretical justification for this is as follows: As in the string case, if we can compose I, M, and O, where I and O are embeddings of I and O, respectively, andperformthesamekindofrelabelingdoneforstrings, wecanformthe derivationwrtg. First let us consider I = (Q,Σ,Σ,R,q 0 ) and M= (Q 0 ,Σ,R 0 ,q 0 0 ). The standard embed- dingof iformsa deterministic relabelingwtt. Wecanform I 0 = (Q,Σ,Σ∪Υ,R∪R Υ ,q 0 ) whereΥ∩Σ=∅ andΥ={υ i },υ i ∈Υ (i) for 0≤ i≤ max(rk(σ)|σ∈Σ). The rules of R Υ are definedasfollows: Foreachσ∈Σ (k) andq∈ Q, ifthereisnoruleoftheformq.σ w − → tin R, add q.σ 1 − → υ k (q.x 0 ,...,q.x k ) to R Υ . We then take the input alphabet of M to beΣ∪Υ. It is clear that I 0 is a deterministic and total relabeling wtt and thatτ I 0;τ M =τ I ;τ M since theonlyeffectofaugmentingI toI 0 istoproduceextraoutputsthatarenotacceptedby M. Following the principles of Baker [6], Theorem 1, I 0 ◦M⊆ wxT, as I 0 is total and de- terministic. Thecompositioncanbemodifiedtouseruleidsontheinputlabelswithout anyproblems,sincethenatureofthecompositionensuresthatexactlyonerulefromMis usedtoformaruleinI 0 ◦M. Augmentingthecompositionwiththisreplacementresults inatransduceroftypewT, sinceanextendedleftsideisreplacedwithasinglesymbol, 136 denotingaruleidofM,withappropriaterank. SinceOisinwLNTwecanimmediately produceI 0 ◦M◦O,andbytakingthedomainprojectionwehaveaderivationwrtg. A sufficient condition for forming a derivation wrtg for a cascade of wtts and a training pair is that the members of the cascade preserve recognizability. Then we can useamodificationoftheembed-compose-projectapproachtobuildthederivationwrtg. Essentially, we modify the composition algorithm, Algorithm 13 from Section 2.3.3, suchthatthetraditionalcreationofacompositionrule,atline13,isalteredbycreatinga newsymboltoreplace ythatcontainsthesetofrulesusedtoconstructz(whichmaybe inferredfromθ,themappingoftreepositionstostates)andhasanappropriaterank. This transducerwillbedeterministicand, withtheadditionofsufficientrulesthatmatchno subsequenttransducer,total,soweareensured,bytheconditionssetforthbyBaker[6], that this wtt can be composed with the next wtt in the cascade. Note that no projection or embedding is needed in this case. After the process has been repeated (unioning the obtained rule sequences with the contents of the special symbols formed previously), a projection may be taken, forming the derivation wrtg. The following example uses the approach just described, though modifications of the on-the-fly algorithms in this chaptercanalsobeusedforthisconstruction,bysimplyadjoiningthecreatedruleswith placeholdersforrulesequenceinformation. Example4.9.1 ConsiderthewttsM 1 = ({q},Σ,Δ, R 1 , q)andM 2 = ({r},Δ,Γ, R 2 , r)where Σ ={C, D}, Δ ={H, J, E, F}, Γ ={U, V, S,K}, and R 1 and R 2 are presented in Figures 4.12aand4.12b,respectively. Wewouldliketobuildaderivationwrtgforthepair(D(C, C),S(U,V)).Figure4.13showsproductionsintheintermediateapplicationwrtgsusing 137 1. D x 1 x 2 E q.x 1 q.x 2 .1 − → q 2. D x 1 x 2 F q.x 1 q.x 2 .2 − → q 3. D x 1 x 2 F q.x 2 q.x 1 .3 − → q 4. C H .4 − → q 5. C J .5 − → q (a)R 1 6. E x 1 x 2 K r.x 1 U r.x 2 .6 − → r 7. E x 1 x 2 S r.x 1 r.x 2 .7 − → r 8. F x 1 x 2 S r.x 1 r.x 2 .8 − → r 9. H U .9 − → r 10. H V .01 −−→ r 11. J U .02 −−→ r 12. J V .03 −−→ r (b)R 2 Figure4.12: Inputtransducersforcascadetraining. 138 • {} 2 x 1 x 2 D n 1 .x 1 n 2 .x 2 1 − → n 0 • {} 0 C 1 − → n 1 • {} 0 C 1 − → n 2 (a)Convertedandembeddedwrtgofinputtree. • {1} x 1 x 2 E n 1 q.x 1 n 2 q.x 2 .1 − → n 0 q • {2} x 1 x 2 F n 1 q.x 1 n 2 q.x 2 .2 − → n 0 q • {3} x 1 x 2 F n 2 q.x 2 n 1 q.x 1 .3 − → n 0 q • {4} H .4 − → n 1 q • {5} J .5 − → n 1 q • {4} H .4 − → n 2 q • {5} J .5 − → n 2 q (b)ApplicationontoM 1 . • {1,6} x 1 x 2 K n 1 qr.x 1 U n 2 qr.x 2 .06 −−→ n 0 qr • {2,8,1,7} x 1 x 2 S n 1 qr.x 1 n 2 qr.x 2 .23 −−→ n 0 qr • {3,8} x 1 x 2 S n 2 qr.x 2 n 1 qr.x 1 .24 −−→ n 0 qr • {4,5,9,11} U .37 −−→ n 1 qr • {4,5,10,12} V .019 −−−→ n 1 qr • {4,5,9,11} U .37 −−→ n 2 qr • {4,5,10,12} V .019 −−−→ n 2 qr (c)ApplicationontoM 2 . Figure4.13: Progressofbuildingderivationwrtg. 139 • {2,8,1,7} n 1 qr n 2 qr n 0 qr .23 −−→ • {3,8} n 2 qr n 1 qr n 0 qr .24 −−→ • {4,5,9,11} n 1 qr .37 −−→ • {4,5,10,12} n 1 qr .019 −−−→ • {4,5,9,11} n 2 qr .37 −−→ • {4,5,10,12} n 2 qr .019 −−−→ Figure4.14: Derivationwrtgafterfinalcombinationandconversion. embed-compose-project,butasdiscussedabove,thederivationwrtg,whichisdepicted inFigure4.14,canbebuiltusinganyvalidapplicationmethod. 4.10 Summary Wehavepresented,forthefirsttime,algorithmsforbackwardandforwardapplication of cascades of weighted extended top-down tree-to-tree and tree-to-string transducers to tree grammar and string input. We have presented novel on-the-fly algorithms for application of tree transducer cascades and we have demonstrated the performance of these algorithms. We have also described how to use these algorithms to construct a derivation grammar for training a cascade of tree transducers that uses the application algorithms. 140 Chapter5 SR-A MMT In this chapter we use a wxtst framework and the tree transducer training algorithm described in [55, 56] for significant improvements in state-of-the-art syntax-based ma- chinetranslation. Specifically,wepresentamethodforimprovingwordalignmentthat employs a syntactically informed alignment model closer to the translation model than commonly-used word alignment models. This leads to extraction of more useful lin- guistic patterns and improved BLEU scores on translation experiments in Chinese and Arabic. Thisworkwasfirstpresentedin[97]andwaspresentedagainasonecomponent in a presentation of EM-based data-manipulation techniques to improve syntax MT in [132]. 5.1 MethodsofstatisticalMT Roughlyspeaking,therearetwopathscommonlytakeninstatisticalmachinetranslation (Figure 5.1). The idealistic path uses an unsupervised learning algorithm such as EM [28] to learn parameters for some proposed translation model from a bitext training corpus, and then directly translates using the weighted model. Some examples of the 141 u n s u p e r v i s e d l e a r n i n g t a r g e t s e n t e n c e s s o u r c e s e n t e n c e s u n w e i g h t e d m o d e l w e i g h t e d m o d e l p a t t e r n s ( u n w e i g h t e d m o d e l ) c o u n t i n g a n d s m o o t h i n g w e i g h t e d m o d e l d e c o d e r s o u r c e s e n t e n c e s t a r g e t s e n t e n c e s p a t t e r n e x t r a c t i o n t a r g e t s e n t e n c e s s o u r c e s e n t e n c e s V i t e r b i a l i g n m e n t s d e c o d e r s o u r c e s e n t e n c e s t a r g e t s e n t e n c e s Figure5.1: GeneralapproachtoidealisticandrealisticstatisticalMTsystems. idealistic approach are the direct IBM word model [7, 51], the phrase-based approach of Marcu and Wong [93], and the syntax approaches of Wu [134] and Yamada and Knight [136]. Idealistic approaches are conceptually simple and thus easy to relate to observed phenomena. However, as more parameters are added to the model the idealisticapproachhasnotscaledwell,foritisincreasinglydifficulttoincorporatelarge amountsoftrainingdataefficientlyoveranincreasinglylargesearchspace. Additionally, the EM procedure has a tendency to overfit its training data when the input units have varyingexplanatorypowers,suchasvariable-sizephrasesorvariable-heighttrees. Therealisticpathalsolearnsamodeloftranslation,butusesthatmodelonlytoobtain Viterbiword-for-wordalignmentsforthetrainingcorpus. Thebitextandcorresponding alignmentsarethenusedasinputtoapatternextractionalgorithm,whichyieldsasetof 142 ComputationalLinguistics Volumexx, Number xx NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS R1: q NP-C NP-C NPB x0:NPB x1:NN x2:PP ↔ x0 x2 x1 MIDDLE blah Figure2 ThefigureIwantinthethesis 2 ComputationalLinguistics Volumexx, Number xx NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS R1: q NP-C NP-C NPB x0:NPB x1:NN x2:PP ↔ x0 x2 x1 MIDDLE blah Figure2 ThefigureIwantinthethesis 2 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNPB.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNPB.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNPB.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 ComputationalLinguistics Volumexx, Number xx NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS NP-C NPB x1 x2 x3 q NPB.x1 qPP.x3 q NN.x2 MIDDLE NPB NNP taiwan POS ’s TAIWAN PP x1 x2 qIN.x1 q NP-C.x2 IN in IN NP-C x1 x2 qPP.x2 q NPB.x1 PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES NPB x1 q NN.x1 NN trade TRADE NN surplus SURPLUS Figure2 ThefigureIwantinthethesis 2 ComputationalLinguistics Volumexx, Number xx NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS NP-C NPB x1 x2 x3 q NPB.x1 qPP.x3 q NN.x2 MIDDLE NPB NNP taiwan POS ’s TAIWAN PP x1 x2 qIN.x1 q NP-C.x2 IN in IN NP-C x1 x2 qPP.x2 q NPB.x1 PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES NPB x1 q NN.x1 NN trade TRADE NN surplus SURPLUS Figure2 ThefigureIwantinthethesis 2 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNPB.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 ComputationalLinguistics Volumexx, Number xx NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS NP-C NPB x1 x2 x3 q NPB.x1 qPP.x3 q NN.x2 MIDDLE NPB NNP taiwan POS ’s TAIWAN PP x1 x2 qIN.x1 q NP-C.x2 IN in IN NP-C x1 x2 qPP.x2 q NPB.x1 PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES NPB x1 q NN.x1 NN trade TRADE NN surplus SURPLUS Figure2 ThefigureIwantinthethesis 2 ComputationalLinguistics Volumexx, Number xx NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS NP-C NPB x1 x2 x3 q NPB.x1 qPP.x3 q NN.x2 MIDDLE NPB NNP taiwan POS ’s TAIWAN PP x1 x2 qIN.x1 q NP-C.x2 IN in IN NP-C x1 x2 qPP.x2 q NPB.x1 PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES NPB x1 q NN.x1 NN trade TRADE NN surplus SURPLUS Figure2 ThefigureIwantinthethesis 2 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNPB.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNPB.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNPB.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNPB.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 ComputationalLinguistics Volumexx, Number xx NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS NP-C NPB x1 x2 x3 q NPB.x1 qPP.x3 q NN.x2 MIDDLE NPB NNP taiwan POS ’s TAIWAN PP x1 x2 qIN.x1 q NP-C.x2 IN in IN NP-C x1 x2 qPP.x2 q NPB.x1 PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES NPB x1 q NN.x1 NN trade TRADE NN surplus SURPLUS Figure2 ThefigureIwantinthethesis 2 ComputationalLinguistics Volumexx, Number xx NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS NP-C NPB x1 x2 x3 q NPB.x1 qPP.x3 q NN.x2 MIDDLE NPB NNP taiwan POS ’s TAIWAN PP x1 x2 qIN.x1 q NP-C.x2 IN in IN NP-C x1 x2 qPP.x2 q NPB.x1 PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES NPB x1 q NN.x1 NN trade TRADE NN surplus SURPLUS Figure2 ThefigureIwantinthethesis 2 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNPB.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 ComputationalLinguistics Volumexx, Number xx NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS NP-C NPB x1 x2 x3 q NPB.x1 qPP.x3 q NN.x2 MIDDLE NPB NNP taiwan POS ’s TAIWAN PP x1 x2 qIN.x1 q NP-C.x2 IN in IN NP-C x1 x2 qPP.x2 q NPB.x1 PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES NPB x1 q NN.x1 NN trade TRADE NN surplus SURPLUS Figure2 ThefigureIwantinthethesis 2 ComputationalLinguistics Volumexx, Number xx NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS NP-C NPB x1 x2 x3 q NPB.x1 qPP.x3 q NN.x2 MIDDLE NPB NNP taiwan POS ’s TAIWAN PP x1 x2 qIN.x1 q NP-C.x2 IN in IN NP-C x1 x2 qPP.x2 q NPB.x1 PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES NPB x1 q NN.x1 NN trade TRADE NN surplus SURPLUS Figure2 ThefigureIwantinthethesis 2 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNPB.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 ComputationalLinguistics Volumexx, Number xx NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS NP-C NPB x1 x2 x3 q NPB.x1 qPP.x3 q NN.x2 MIDDLE NPB NNP taiwan POS ’s TAIWAN PP x1 x2 qIN.x1 q NP-C.x2 IN in IN NP-C x1 x2 qPP.x2 q NPB.x1 PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES NPB x1 q NN.x1 NN trade TRADE NN surplus SURPLUS Figure2 ThefigureIwantinthethesis 2 ComputationalLinguistics Volumexx, Number xx NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS NP-C NPB x1 x2 x3 q NPB.x1 qPP.x3 q NN.x2 MIDDLE NPB NNP taiwan POS ’s TAIWAN PP x1 x2 qIN.x1 q NP-C.x2 IN in IN NP-C x1 x2 qPP.x2 q NPB.x1 PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES NPB x1 q NN.x1 NN trade TRADE NN surplus SURPLUS Figure2 ThefigureIwantinthethesis 2 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNPB.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 ComputationalLinguistics Volumexx, Number xx NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS NP-C NPB x1 x2 x3 q NPB.x1 qPP.x3 q NN.x2 MIDDLE NPB NNP taiwan POS ’s TAIWAN PP x1 x2 qIN.x1 q NP-C.x2 IN in IN NP-C x1 x2 qPP.x2 q NPB.x1 PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES NPB x1 q NN.x1 NN trade TRADE NN surplus SURPLUS Figure2 ThefigureIwantinthethesis 2 ComputationalLinguistics Volumexx, Number xx NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS NP-C NPB x1 x2 x3 q NPB.x1 qPP.x3 q NN.x2 MIDDLE NPB NNP taiwan POS ’s TAIWAN PP x1 x2 qIN.x1 q NP-C.x2 IN in IN NP-C x1 x2 qPP.x2 q NPB.x1 PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES NPB x1 q NN.x1 NN trade TRADE NN surplus SURPLUS Figure2 ThefigureIwantinthethesis 2 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNPB.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 ComputationalLinguistics Volumexx, Number xx NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS NP-C NPB x1 x2 x3 q NPB.x1 qPP.x3 q NN.x2 MIDDLE NPB NNP taiwan POS ’s TAIWAN PP x1 x2 qIN.x1 q NP-C.x2 IN in IN NP-C x1 x2 qPP.x2 q NPB.x1 PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES NPB x1 q NN.x1 NN trade TRADE NN surplus SURPLUS Figure2 ThefigureIwantinthethesis 2 ComputationalLinguistics Volumexx, Number xx NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS NP-C NPB x1 x2 x3 q NPB.x1 qPP.x3 q NN.x2 MIDDLE NPB NNP taiwan POS ’s TAIWAN PP x1 x2 qIN.x1 q NP-C.x2 IN in IN NP-C x1 x2 qPP.x2 q NPB.x1 PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES NPB x1 q NN.x1 NN trade TRADE NN surplus SURPLUS Figure2 ThefigureIwantinthethesis 2 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNPB.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNPB.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNPB.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNPB.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNPB.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNPB.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNNP.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNNP.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNNP.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNNP.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNNP.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNPB.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNNP.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNNP.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNNP.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNPB.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 ComputationalLinguistics Volumexx, Number xx NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS NP-C NPB x1 x2 x3 q NPB.x1 qPP.x3 q NN.x2 MIDDLE NPB NNP taiwan POS ’s TAIWAN PP x1 x2 qIN.x1 q NP-C.x2 IN in IN NP-C x1 x2 qPP.x2 q NPB.x1 PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES NPB x1 q NN.x1 NN trade TRADE NN surplus SURPLUS Figure2 ThefigureIwantinthethesis 2 ComputationalLinguistics Volumexx, Number xx NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS NP-C NPB x1 x2 x3 q NPB.x1 qPP.x3 q NN.x2 MIDDLE NPB NNP taiwan POS ’s TAIWAN PP x1 x2 qIN.x1 q NP-C.x2 IN in IN NP-C x1 x2 qPP.x2 q NPB.x1 PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES NPB x1 q NN.x1 NN trade TRADE NN surplus SURPLUS Figure2 ThefigureIwantinthethesis 2 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNPB.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNNP.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNPB.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNNP.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNNP.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNPB.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNNP.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 ComputationalLinguistics Volumexx, Number xx NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS NP-C NPB x1 x2 x3 q NPB.x1 qPP.x3 q NN.x2 MIDDLE NPB NNP taiwan POS ’s TAIWAN PP x1 x2 qIN.x1 q NP-C.x2 IN in IN NP-C x1 x2 qPP.x2 q NPB.x1 PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES NPB x1 q NN.x1 NN trade TRADE NN surplus SURPLUS Figure2 ThefigureIwantinthethesis 2 ComputationalLinguistics Volumexx, Number xx NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS NP-C NPB x1 x2 x3 q NPB.x1 qPP.x3 q NN.x2 MIDDLE NPB NNP taiwan POS ’s TAIWAN PP x1 x2 qIN.x1 q NP-C.x2 IN in IN NP-C x1 x2 qPP.x2 q NPB.x1 PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES NPB x1 q NN.x1 NN trade TRADE NN surplus SURPLUS Figure2 ThefigureIwantinthethesis 2 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNPB.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 ComputationalLinguistics Volumexx, Number xx NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS NP-C NPB x1 x2 x3 q NPB.x1 qPP.x3 q NN.x2 MIDDLE NPB NNP taiwan POS ’s TAIWAN PP x1 x2 qIN.x1 q NP-C.x2 IN in IN NP-C x1 x2 qPP.x2 q NPB.x1 PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES NPB x1 q NN.x1 NN trade TRADE NN surplus SURPLUS Figure2 ThefigureIwantinthethesis 2 ComputationalLinguistics Volumexx, Number xx NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS NP-C NPB x1 x2 x3 q NPB.x1 qPP.x3 q NN.x2 MIDDLE NPB NNP taiwan POS ’s TAIWAN PP x1 x2 qIN.x1 q NP-C.x2 IN in IN NP-C x1 x2 qPP.x2 q NPB.x1 PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES NPB x1 q NN.x1 NN trade TRADE NN surplus SURPLUS Figure2 ThefigureIwantinthethesis 2 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNPB.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 ComputationalLinguistics Volumexx, Number xx NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS NP-C NPB x1 x2 x3 q NPB.x1 qPP.x3 q NN.x2 MIDDLE NPB NNP taiwan POS ’s TAIWAN PP x1 x2 qIN.x1 q NP-C.x2 IN in IN NP-C x1 x2 qPP.x2 q NPB.x1 PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES NPB x1 q NN.x1 NN trade TRADE NN surplus SURPLUS Figure2 ThefigureIwantinthethesis 2 ComputationalLinguistics Volumexx, Number xx NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS NP-C NPB x1 x2 x3 q NPB.x1 qPP.x3 q NN.x2 MIDDLE NPB NNP taiwan POS ’s TAIWAN PP x1 x2 qIN.x1 q NP-C.x2 IN in IN NP-C x1 x2 qPP.x2 q NPB.x1 PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES NPB x1 q NN.x1 NN trade TRADE NN surplus SURPLUS Figure2 ThefigureIwantinthethesis 2 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNPB.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNPB.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNPB.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNPB.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNPB.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 ComputationalLinguistics Volumexx, Number xx q NPB NPB NNP taiwan x1 q POS.x1 q POS POS ’s TAIWAN q PP PP IN in x1 q NP-C.x1 q NP-C NP-C x1 PP x1 x2 q NP-C.x3 q NPB.x1 q IN.x2 q IN IN between MIDDLE Figure2 ThesecondfigureIwantinthethesis 2 ComputationalLinguistics Volumexx, Number xx q NPB NPB NNP taiwan x1 q POS.x1 q POS POS ’s TAIWAN q PP PP IN in x1 q NP-C.x1 q NP-C NP-C x1 PP x1 x2 q NP-C.x3 q NPB.x1 q IN.x2 q IN IN between MIDDLE Figure2 ThesecondfigureIwantinthethesis 2 ComputationalLinguistics Volumexx, Number xx q NPB NPB NNP taiwan x1 q POS.x1 q POS POS ’s TAIWAN q PP PP IN in x1 q NP-C.x1 q NP-C NP-C x1 PP x1 x2 q NP-C.x3 q NPB.x1 q IN.x2 q IN IN between MIDDLE Figure2 ThesecondfigureIwantinthethesis 2 ComputationalLinguistics Volumexx, Number xx q NPB NPB NNP taiwan x1 q POS.x1 q POS POS ’s TAIWAN q PP PP IN in x1 q NP-C.x1 q NP-C NP-C x1 PP x1 x2 q NP-C.x3 q NPB.x1 q IN.x2 q IN IN between MIDDLE Figure2 ThesecondfigureIwantinthethesis 2 ComputationalLinguistics Volumexx, Number xx q NPB NPB NNP taiwan x1 q POS.x1 q POS POS ’s TAIWAN q PP PP IN in x1 q NP-C.x1 q NP-C NP-C x1 PP x1 x2 q NP-C.x3 q NPB.x1 q IN.x2 q IN IN between MIDDLE Figure2 ThesecondfigureIwantinthethesis 2 ComputationalLinguistics Volumexx, Number xx q NPB NPB NNP taiwan x1 q POS.x1 q POS POS ’s TAIWAN q PP PP IN in x1 q NP-C.x1 q NP-C NP-C x1 PP x1 x2 q NP-C.x3 q NPB.x1 q IN.x2 q IN IN between MIDDLE Figure2 ThesecondfigureIwantinthethesis 2 ComputationalLinguistics Volumexx, Number xx q NPB NPB NNP taiwan x1 q POS.x1 q POS POS ’s TAIWAN q PP PP IN in x1 q NP-C.x1 q NP-C NP-C x1 PP x1 x2 q NP-C.x3 q NPB.x1 q IN.x2 q IN IN between MIDDLE Figure2 ThesecondfigureIwantinthethesis 2 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNPB.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 ComputationalLinguistics Volumexx, Number xx q NPB NPB NNP taiwan x1 q POS.x1 q POS POS ’s TAIWAN q PP PP IN in x1 q NP-C.x1 q NP-C NP-C x1 PP x2 x3 q NP-C.x3 q NPB.x1 q IN.x2 q IN IN between MIDDLE Figure2 ThesecondfigureIwantinthethesis 2 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNPB.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 ComputationalLinguistics Volumexx, Number xx q NPB NPB NNP taiwan x1 q POS.x1 q POS POS ’s TAIWAN q PP PP IN in x1 q NP-C.x1 q NP-C NP-C x1 PP x2 x3 q NP-C.x3 q NPB.x1 q IN.x2 q IN IN between MIDDLE Figure2 ThesecondfigureIwantinthethesis 2 ComputationalLinguistics Volumexx, Number xx q NPB NPB NNP taiwan x1 q POS.x1 q POS POS ’s TAIWAN q PP PP IN in x1 q NP-C.x1 q NP-C NP-C x1 PP x2 x3 q NP-C.x3 q NPB.x1 q IN.x2 q IN IN between MIDDLE Figure2 ThesecondfigureIwantinthethesis 2 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNPB.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 ComputationalLinguistics Volumexx, Number xx q NPB NPB NNP taiwan x1 q POS.x1 q POS POS ’s TAIWAN q PP PP IN in x1 q NP-C.x1 q NP-C NP-C x1 PP x2 x3 q NP-C.x3 q NPB.x1 q IN.x2 q IN IN between MIDDLE Figure2 ThesecondfigureIwantinthethesis 2 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNNP.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNPB.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNNP.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 ComputationalLinguistics Volumexx, Number xx q NPB NPB NNP taiwan x1 qPOS.x1 qPOS POS ’s TAIWAN qPP PP IN in x1 q NP-C.x1 q NP-C NP-C x1 PP x2 x3 q NP-C.x3 q NPB.x1 qIN.x2 qIN IN between MIDDLE q NPB NPB x1 CD two x2 qDT.x1 q NNS.x2 q NNS NNS shores TWO-SHORES qDT DT the IN Figure2 ThesecondfigureIwantinthethesis 2 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNPB.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 ComputationalLinguistics Volumexx, Number xx q NPB NPB NNP taiwan x1 qPOS.x1 qPOS POS ’s TAIWAN qPP PP IN in x1 q NP-C.x1 q NP-C NP-C x1 PP x2 x3 q NP-C.x3 q NPB.x1 qIN.x2 qIN IN between MIDDLE q NPB NPB x1 CD two x2 qDT.x1 q NNS.x2 q NNS NNS shores TWO-SHORES qDT DT the IN Figure2 ThesecondfigureIwantinthethesis 2 ComputationalLinguistics Volumexx, Number xx q NPB NPB NNP taiwan x1 qPOS.x1 qPOS POS ’s TAIWAN qPP PP IN in x1 q NP-C.x1 q NP-C NP-C x1 PP x2 x3 q NP-C.x3 q NPB.x1 qIN.x2 qIN IN between MIDDLE q NPB NPB x1 CD two x2 qDT.x1 q NNS.x2 q NNS NNS shores TWO-SHORES qDT DT the IN Figure2 ThesecondfigureIwantinthethesis 2 ComputationalLinguistics Volumexx, Number xx q NPB NPB NNP taiwan x1 qPOS.x1 qPOS POS ’s TAIWAN qPP PP IN in x1 q NP-C.x1 q NP-C NP-C x1 PP x2 x3 q NP-C.x3 q NPB.x1 qIN.x2 qIN IN between MIDDLE q NPB NPB x1 CD two x2 qDT.x1 q NNS.x2 q NNS NNS shores TWO-SHORES qDT DT the IN Figure2 ThesecondfigureIwantinthethesis 2 ComputationalLinguistics Volumexx, Number xx q NPB NPB NNP taiwan x1 qPOS.x1 qPOS POS ’s TAIWAN qPP PP IN in x1 q NP-C.x1 q NP-C NP-C x1 PP x2 x3 q NP-C.x3 q NPB.x1 qIN.x2 qIN IN between MIDDLE q NPB NPB x1 CD two x2 qDT.x1 q NNS.x2 q NNS NNS shores TWO-SHORES qDT DT the IN Figure2 ThesecondfigureIwantinthethesis 2 ComputationalLinguistics Volumexx, Number xx NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS NP-C NPB x1 x2 x3 q NPB.x1 qPP.x3 q NN.x2 MIDDLE NPB NNP taiwan POS ’s TAIWAN PP x1 x2 qIN.x1 q NP-C.x2 IN in IN NP-C x1 x2 qPP.x2 q NPB.x1 PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES NPB x1 q NN.x1 NN trade TRADE NN surplus SURPLUS Figure2 ThefigureIwantinthethesis 2 ComputationalLinguistics Volumexx, Number xx NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS NP-C NPB x1 x2 x3 q NPB.x1 qPP.x3 q NN.x2 MIDDLE NPB NNP taiwan POS ’s TAIWAN PP x1 x2 qIN.x1 q NP-C.x2 IN in IN NP-C x1 x2 qPP.x2 q NPB.x1 PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES NPB x1 q NN.x1 NN trade TRADE NN surplus SURPLUS Figure2 ThefigureIwantinthethesis 2 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNPB.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 ComputationalLinguistics Volumexx, Number xx q NPB NPB NNP taiwan x1 qPOS.x1 qPOS POS ’s TAIWAN qPP PP IN in x1 q NP-C.x1 q NP-C NP-C x1 PP x2 x3 q NP-C.x3 q NPB.x1 qIN.x2 qIN IN between MIDDLE q NPB NPB x1 CD two x2 qDT.x1 q NNS.x2 q NNS NNS shores TWO-SHORES qDT DT the IN Figure2 ThesecondfigureIwantinthethesis 2 ComputationalLinguistics Volumexx, Number xx q NPB NPB NNP taiwan x1 qPOS.x1 qPOS POS ’s TAIWAN qPP PP IN in x1 q NP-C.x1 q NP-C NP-C x1 PP x2 x3 q NP-C.x3 q NPB.x1 qIN.x2 qIN IN between MIDDLE q NPB NPB x1 CD two x2 qDT.x1 q NNS.x2 q NNS NNS shores TWO-SHORES qDT DT the IN Figure2 ThesecondfigureIwantinthethesis 2 ComputationalLinguistics Volumexx, Number xx q NPB NPB NNP taiwan x1 qPOS.x1 qPOS POS ’s TAIWAN qPP PP IN in x1 q NP-C.x1 q NP-C NP-C x1 PP x2 x3 q NP-C.x3 q NPB.x1 qIN.x2 qIN IN between MIDDLE q NPB NPB x1 CD two x2 qDT.x1 q NNS.x2 q NNS NNS shores TWO-SHORES qDT DT the IN Figure2 ThesecondfigureIwantinthethesis 2 ComputationalLinguistics Volumexx, Number xx NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS NP-C NPB x1 x2 x3 q NPB.x1 qPP.x3 q NN.x2 MIDDLE NPB NNP taiwan POS ’s TAIWAN PP x1 x2 qIN.x1 q NP-C.x2 IN in IN NP-C x1 x2 qPP.x2 q NPB.x1 PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES NPB x1 q NN.x1 NN trade TRADE NN surplus SURPLUS Figure2 ThefigureIwantinthethesis 2 ComputationalLinguistics Volumexx, Number xx NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS NP-C NPB x1 x2 x3 q NPB.x1 qPP.x3 q NN.x2 MIDDLE NPB NNP taiwan POS ’s TAIWAN PP x1 x2 qIN.x1 q NP-C.x2 IN in IN NP-C x1 x2 qPP.x2 q NPB.x1 PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES NPB x1 q NN.x1 NN trade TRADE NN surplus SURPLUS Figure2 ThefigureIwantinthethesis 2 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNPB.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 ComputationalLinguistics Volumexx, Number xx NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS NP-C NPB x1 x2 x3 q NPB.x1 qPP.x3 q NN.x2 MIDDLE NPB NNP taiwan POS ’s TAIWAN PP x1 x2 qIN.x1 q NP-C.x2 IN in IN NP-C x1 x2 qPP.x2 q NPB.x1 PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES NPB x1 q NN.x1 NN trade TRADE NN surplus SURPLUS Figure2 ThefigureIwantinthethesis 2 ComputationalLinguistics Volumexx, Number xx NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS NP-C NPB x1 x2 x3 q NPB.x1 qPP.x3 q NN.x2 MIDDLE NPB NNP taiwan POS ’s TAIWAN PP x1 x2 qIN.x1 q NP-C.x2 IN in IN NP-C x1 x2 qPP.x2 q NPB.x1 PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES NPB x1 q NN.x1 NN trade TRADE NN surplus SURPLUS Figure2 ThefigureIwantinthethesis 2 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 MIDDLE qNPB NPB NNP taiwan POS ’s TAIWAN qPP PP x1 x2 qIN.x1 qNP-C.x2 qIN IN in IN qNP-C NP-C x1 x2 qPP.x2 qNPB.x1 qPP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES qNPB NPB x1 qNN.x1 qNN NN trade TRADE qNN NN surplus SURPLUS qNP-C NP-C NPB x1 x2 x3 qNPB.x1 qPP.x3 qNN.x2 qNPB NPB x1 POS ’s qNPB.x1 qNNP NNP taiwan TAIWAN qPP PP IN in x1 qNP-C.x1 IN MIDDLE qPP PP IN between x1 qNP-C.x1 qNP-C NP-C x1 qNPB.x1 qNPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 R1: R2: R3: R4: R5: R6: R7: R8: R9: R10: R11: R6: R12: R13: R14: R4: R15: R16: R8: R9: R10: R17: R6: R18: R19: R20: R15: R21: R22: R23: R24: R8: R9: Figure 5.2: A (English tree, Chinese string) pair and three different sets of multilevel tree-to-stringrulesthatcanexplainit;thefirstsetisobtainedfrombootstrapalignments, the second from this paper’s re-alignment procedure, and the third is a viable, if poor quality,alternativethatisnotlearned. 143 patternsorrulesforasecondtranslationmodel(whichoftenhasawiderparameterspace than that used to obtain the word-for-word alignments). Weights for the second model are then set, typically by counting and smoothing, and this weighted model is used for translation. Realistic approaches scale to large data sets and have yielded better BLEU performancethantheiridealisticcounterparts,butthereisadisconnectbetweenthefirst model (hereafter, the alignment model) and the second (the translation model). Examples ofrealisticsystemsarethephrase-basedATSsystemofOchandNey[107],thephrasal- syntaxhybridsystemHiero[21],andtheGHKMsyntaxsystem[47,46]. Foranalignment model, mostoftheseusetheAachenHMMapproach[131], theimplementationofIBM Model4inGIZA++[106]or,morerecently,thesemi-supervisedEMDalgorithm[42]. The two-model approach of the realistic path has undeniable empirical advantages and scales to large data sets, but new research tends to focus on development of higher order translation models that are informed only by low-order alignments. We would liketoaddtheanalyticpowergainedfrommoderntranslationmodelstotheunderlying alignmentmodelwithoutsacrificingtheefficiencyandempiricalgainsofthetwo-model approach. By adding the syntactic information used in the translation model to our alignmentmodelwemayimprovealignmentqualitysuchthatrulequalityand,inturn, system quality are improved. In the remainder of this work we show how a touch of idealismcanimproveanexistingrealisticsyntax-basedtranslationsystem. 144 5.2 Multi-levelsyntacticrulesforsyntaxMT Galley et al. [47] and Galley et al. [46] describe a syntactic translation model that relates English trees to foreign strings via a tree-to-string transducer of class wxLNTs. The model describes joint production of a (tree, string) pair via a non-deterministic selectionofweightedrules. EachrulehasanEnglishtreefragmentwithvariablesanda corresponding foreign string fragment with the same variables. A series of rules forms anexplanation(orderivation)ofthecompletepair. As an example, consider the parsed English and corresponding Chinese at the top of Figure 5.2. The three columns underneath the example are different rule sequences that can explain this pair; there are many other possibilities. Note how rules specify rotation(e.g.,R10,R4),directtranslation(R12,R8),insertionanddeletion(R11,R1),and treetraversal(R6,R15). Notetoothattherulesexplainvariable-sizefragments(e.g.,R7 vs. R14) and thus the possible derivation trees of rules that explain a sentence pair have varying sizes. The smallest such derivation tree has a single large rule (which does not appearinFigure5.2;weleavethedescriptionofsucharuleasanexerciseforthereader). A string-to-tree decoder constructs a derivation forest of derivation trees where the right sides of the rules in a tree, taken together, explain a candidate source sentence. It then outputstheEnglishtreecorrespondingtothehighest-scoringderivationintheforest. 145 5.3 Introducingsyntaxintothealignmentmodel We now lay the groundwork for a syntactically motivated alignment model. We begin byreviewinganalignmentmodelcommonlyseeninrealisticMTsystemsandcompare ittoasyntactically-awarealignmentmodel. 5.3.1 ThetraditionalIBMalignmentmodel IBM Model 4 [16] learns a set of 4 probability tables to compute p(f|e) given a foreign sentence f and its target translation e via the following (greatly simplified) generative story: 1. Afertility yforeachworde i ineischosenwithprobabilityp fert (y|e i ). 2. Anullwordisinsertednexttoeachfertility-expandedwordwithprobability p null . 3. Eachtokene i inthefertility-expandedwordandnullstringistranslatedintosome foreignword f i in f withprobabilityp trans (f i |e i ). 4. The position of each foreign word f i that was translated from e i is changed byΔ (whichmaybepositive,negative,orzero)withprobabilityp distortion (Δ|A(e i ),B(f i )), whereAandBarefunctionsoverthesourceandtargetvocabularies,respectively. Brown et al. [16] describe an EM algorithm for estimating values for the four tables in the generative story. However, searching the space of all possible alignments is intractableforEM,soinpracticetheprocedureisbootstrappedbymodelswithnarrower searchspacesuchasIBMModel1[16]orAachenHMM[131]. 146 5.3.2 Asyntaxre-alignmentmodel Nowletuscontrastthiscommonlyusedmodelforobtainingalignmentswithasyntac- tically motivated alternative. We recall the rules described in Section 5.2. Our model learns a single probability table to compute p(etree, f) given a foreign sentence f and a parsed target translation etree. In the following generative story we assume a starting variablewithsyntactictypev. 1. Choosearulertoreplacev,withprobabilityp rule (r|v). 2. Foreachvariablewithsyntactictypev i inthepartiallycompleted(tree,string)pair, continuetochooserulesr i withprobabilityp rule (r i |v i )toreplacethesevariablesuntil therearenovariablesremaining. InSection5.5.1wediscussanEMlearningprocedureforestimatingtheseruleprob- abilities. As in the IBM approach, we must mitigate intractability by limiting the parameter space searched, which is potentially much wider than in the word-to-word case. We would like to supply to EM all possible rules that explain the training data, but this implies a rule relating each possible tree fragment to each possible string fragment, which is infeasible. We follow the approach of bootstrapping from a model with a narrower parameter space as is done by, e.g., Och and Ney [106] and Fraser and Marcu [42]. Toreduce themodelspacewe employtheruleacquisition techniqueofGalleyet al. [47],whichobtainsrulesgivena(tree,string)pairaswellasaninitialalignmentbetween them. We are agnostic about the source of this bootstrap alignment and in Section 5.5 147 present results based on several different bootstrap alignment qualities. We require an initial set of alignments, which we obtain from a word-for-word alignment procedure such as GIZA++ or EMD. Thus, we are not aligning input data, but rather re-aligning it withasyntaxmodel. 5.4 Theappealofasyntaxalignmentmodel Consider the example of Figure 5.2 again. The leftmost derivation is obtained from the bootstrap alignment set. This derivation is reasonable but there are some poorly moti- vatedrules,fromalinguisticstandpoint. ThethirdwordintheChinesesentenceroughly means “the two shores” in this context, but the rule R7 learned from the alignment in- correctlyincludes“between”. However,othersentencesinthetrainingcorpushavethe correct alignment, which yields rule R16. Meanwhile, rules R13 and R14, learned from yet other sentences in the training corpus, handle the second and fifth Chinese words (which,asaunit,translatesto“inbetween”),thusallowingthemiddlederivation. EMdistributesruleprobabilitiesinsuchawayastomaximizetheprobabilityofthe training corpus. It thus prefers to use one rule many times instead of several different rulesforthesamesituationoverseveralsentences,ifpossible. R7isapossiblerulein46 of the 329,031 sentence pairs in the training corpus, while R16 is a possible rule in 100 sentence pairs. Well-formed rules are more usable than ill-formed rules and the partial alignmentsbehindtheserules,generallyalsowell-formed,becomefavoredaswell. The top row of Figure 5.3 contains an example of an alignment learned by the bootstrap alignment model that includes an incorrect link. Rule R24, which is extracted from this 148 C A T NIST2002short 925 696 T NIST2003 919 663 Table5.1: TuningandtestingdatasetsfortheMTsystemdescribedinSection5.5.2. GIZA - E C 9,864,294 7,520,779 baseline 19,138,252 39.08 37.77 initial 18,698,549 39.49 38.39 adjusted 26,053,341 39.76 38.69 Table 5.2: A comparison of Chinese BLEU performance between the GIZA baseline (no re-alignment), re-alignment as proposed in Section 5.3.2, and re-alignment as modified inSection5.5.4. alignment, is a poor rule. A set of commonly seen rules learned from other training sentences provide a more likely explanation of the data, and the consequent alignment omitsthespuriouslink. 5.5 Experiments In this section, we describe the implementation of our semi-idealistic model and our meansofevaluatingtheresultingre-alignmentsinanMTtask. 5.5.1 There-alignmentsetup We begin with a training corpus of Chinese-English and Arabic-English bitexts, the EnglishsideparsedbyareimplementationofthestandardCollinsmodel[9]. Inorderto acquireasyntacticruleset,wealsoneedabootstrapalignmentofeachtrainingsentence. We use an implementation of the GHKM algorithm [47] to obtain a rule set for each bootstrapalignment. 149 Now we need an EM algorithm for learning the parameters of the rule set that maximize Y corpus p(tree,string). Such an algorithm is presented by Graehl et al. [56]. The algorithm consists of two components: D, which is a procedure for constructing a packedforestofderivationtreesofrulesthatexplaina(tree, string)bitextcorpusgiven thatcorpusandaruleset,andT,whichisaniterativeparameter-settingprocedure. Weinitiallyattemptedtousethetop-downD algorithmofGraehletal. [56],but astheconstraintsofthederivationforestsarelargelylexical,toomuchtimewasspenton exploring dead-ends. Instead we build derivation forests using the following sequence ofoperations: 1. Binarizerulesusingthesynchronousbinarizationalgorithmfortree-to-stringtrans- ducersdescribedbyZhangetal. [138]. 2. Construct a parse chart with a CKY parser simultaneously constrained on the foreignstringandEnglishtree,similartothebilingualparsingofWu[135] 1 . 3. Recoverallreachableedgesbytraversingthechart,startingfromthetopmostentry. Since the chart is constructed bottom-up, leaf lexical constraints are encountered immediately,resultinginanarrowersearchspaceandfasterrunningtimethanthetop- downDalgorithmforthisapplication. Derivationforestconstructiontakesaround 400 hours of cumulative machine time (4-processor machines) for Chinese. The actual running of EM iterations (which directly implements the T algorithm of Graehl et al. [56]) takes about 10 minutes, after which the Viterbi derivation trees are directly 1 In the cases where a rule is not synchronous-binarizable standard left-right binarization is performed and proper permutation of the disjoint English tree spans must be verified when building the part of the chartthatusesthisrule. 150 GIZA - E C 9,864,294 7,520,779 baseline 19,138,252 39.08 37.77 re-alignment 26,053,341 39.76 38.69 221,835,870 203,181,379 baseline 23,386,535 39.51 38.93 re-alignment 33,374,646 40.17 39.96 (a)Chinesere-alignmentcorpushas9,864,294Englishand7,520,779Chinesewords. GIZA - E A 4,067,454 3,147,420 baseline 2,333,839 47.92 47.33 re-alignment 2,474,737 47.87 47.89 168,255,347 147,165,003 baseline 3,245,499 49.72 49.60 re-alignment 3,600,915 49.73 49.99 (b)Arabicre-alignmentcorpushas4,067,454Englishand3,147,420Arabicwords. Table 5.3: Machine Translation experimental results evaluated with case-insensitive BLEU4. recoverable. The Viterbi derivation tree tells us which English words produce which Chinesewords,sowecanextractaword-to-wordalignmentfromit. Wesummarizethe approachdescribedinthispaperas: 1. ObtainbootstrapalignmentsforatrainingcorpususingGIZA++. 2. Extract rules from the corpus and alignments using GHKM, noting the partial alignmentthatisusedtoextracteachrule. 3. Constructderivationforestsforeach(tree,string)pair,ignoringthealignments,and runEMtoobtainViterbiderivationtrees,thenusetheannotatedpartialalignments toobtainViterbialignments. 4. UsethenewalignmentsasinputtotheMTsystemdescribedbelow. 151 5.5.2 TheMTsystemsetup A truly idealistic MT system would directly apply the rule weight parameters learned viaEMtoamachinetranslationtask. AsmentionedinSection5.1,wemaintainthetwo- model, or realistic approach. Below we briefly describe the translation model, focusing oncomparisonwiththepreviouslydescribedalignmentmodel. Galleyetal. [46]provide a more complete description of the translation model and DeNeefe et al. [31] provide a morecompletedescriptionoftheend-to-endtranslationpipeline. Althoughinprinciplethere-alignmentmodelandtranslationmodellearnparameter weights over the same rule space, in practice we limit the rules used for re-alignment to the set of smallest rules that explain the training corpus and are consistent with the bootstrapalignments. ThisisacompromisemadetoreducethesearchspaceforEM.The translationmodellearnsmultiplederivationsofrulesconsistentwiththere-alignments for each sentence, and learns weights for these by counting and smoothing. A dozen otherfeaturesarealsoaddedtotherules. Weobtainweightsforthecombinationsofthe featuresbyperformingminimumerrorratetraining[105]onheld-outdata. Wethenuse aCKYdecodertotranslateunseentestdatausingtherulesandtunedweights. Table5.1 summarizesthedatausedintuningandtesting. 5.5.3 Initialresults An initial re-alignment experiment shows a reasonable rise in BLEU scores from the baseline (Table 5.2), but closer inspection of the rules favored by EM implies we can do evenbetter. EMhasatendencytofavorfewlargerulesovermanysmallrules,evenwhen 152 C-E baseline 55,781,061 41.51 40.55 EMDre-align 69,318,930 41.23 40.55 A-E baseline 8,487,656 51.90 51.69 EMDre-align 11,498,150 51.88 52.11 Table5.4: Re-alignmentperformancewithsemi-supervisedEMDbootstrapalignments. the small rules are more useful. Referring to the rules in Figure 5.2, note that possible derivationsfortranslatingbetween“taiwan’s”andthefirstwordintheChinesesentence are R2, R11-R12, and R17-R18. Clearly the third derivation is not desirable, and we do not discuss it further. Between the first two derivations, R11-R12 is preferred over R2, as the conditioning for possessive insertion is not related to the specific Chinese word being inserted. Of the 1,902 sentences in the training corpus where this pair is seen, thebootstrapalignmentsyieldtheR2derivation1,649timesandtheR11-R12derivation 0 times. Re-alignment does not change the result much; the new alignments yield the R2 derivation 1,613 times and again never choose R11-R12. The rules in the second derivationthemselvesarenotrarelyseen—R11isin13,311forestsotherthanthosewhere R2 is seen, and R12 is in 2,500 additional forests. EM gives R11 a probability of e −7.72 — better than 98.7% of rules, and R12 a probability of e −2.96 . But R2 receives a probability ofe −6.32 andispreferredovertheR11-R12derivation,whichhasacombinedprobability ofe −10.68 . 5.5.4 MakingEMfair The preference for shorter derivations containing large rules over longer derivations containing small rules is due to a general tendency for EM to prefer derivations with few atoms. Marcu and Wong [93] note this preference but consider the phenomenon a 153 feature,ratherthanabug. ZollmannandSima’an[139]combattheoverfittingaspectfor parsingbyusingaheld-outcorpusandastraightmaximumlikelihoodestimate,rather thanEM.Wetakeamodelingapproachtothephenomenon. Astheprobabilityofaderivationisdeterminedbytheproductofitsatomprobabil- ities,longerderivationswithmoreprobabilitiestomultiplyhaveaninherentdisadvan- tage against shorter derivations, all else being equal. EM is an iterative procedure and thussuchabiascanleadtheproceduretoconvergewithartificiallyraisedprobabilities for short derivations and the large rules that comprise them. The relatively rare appli- cability of large rules (and thus lower observed partial counts) does not overcome the inherent advantage of large coverage. To combat this, we introduce size terms into our generative story, ensuring that all competing derivations for the same sentence contain thesamenumberofatoms: 1. Choosearulesizeswithcostc size (s) s−1 . 2. Choosearuler(ofsizes)toreplacethestartsymbolwithprobabilityp rule (r|s,v). 3. For each variable in the partially completed (tree, string) pair, continue to choose sizes followed by rules, recursively to replace these variables until there are no variablesremaining. ThisgenerativestorychangesthederivationcomparisonfromR2vsR11-R12toS2-R2 vs R11-R12, where S2 is the atom that represents the choice of size 2 (the size of a rule in this context is the number of non-leaf and non-root nodes in its tree fragment). Note that the variable number of inclusions implied by the exponent in the generative story above ensures that all derivations have the same size. For example, a derivation with 154 one size-3 rule, a derivation with one size-2 and one size-1 rule, and a derivation with threesize-1ruleswouldeachhavethreeatoms. Withthisrevisedmodelthatallowsfor fair comparison of derivations, the R11-R12 derivation is chosen 1636 times, and S2-R2 is not chosen. R2 does, however, appear in the translation model, as the expanded rule extractiondescribedinSection5.5.2createsR2byjoiningR11andR12. The probability of size atoms, like that of rule atoms, is decided by EM. The revised generative story tends to encourage smaller sizes by virtue of the exponent. This does not, however, simply ensure the largest number of rules per derivation is used in all cases. Ill-fitting and poorly-motivated rules such as R22, R23, and R24 in Figure 5.2 are not preferred over R16, even though they are smaller. However, R14 and R16 are preferred over R7, as the former are useful rules. Although the modified model does not sum to 1, it leads to an improvement in BLEU score, as can be seen in the last row ofTable5.2. 5.5.5 Results Weperformedprimaryexperimentsontwodifferentbootstrapsetupsintwolanguages: the initial experiment uses the same data set for the GIZA++ initial alignment as is used in the re-alignment, while an experiment on better quality bootstrap alignments usesamuchlargerdataset. Foreachbootstrappingineachlanguagewecomparedthe baselineofusingthesealignmentsdirectlyinanMTsystemwiththeexperimentofusing thealignmentsobtainedfromthere-alignmentproceduredescribedinSection5.5.4. For each experiment we report: the number of rules extracted by the expanded GHKM algorithm of Galley et al. [46] for the translation model, converged BLEU scores on the 155 tuningset,andfinallyBLEUperformanceontheheld-outtestset. Datasetspecificsfor theGIZA++bootstrappingandBLEUresultsaresummarizedinTable5.3. 5.5.6 Discussion The results presented demonstrate we are able to improve on unsupervised GIZA++ alignments by about 1 BLEU point for Chinese and around 0.4 BLEU point for Arabic using an additional unsupervised algorithm that requires no human aligned data. If human-aligneddataisavailable,theEMDalgorithmprovideshigherbaselinealignments thanGIZA++thathaveledtobetterMTperformance[42]. Asafurtherexperimentwe repeated the experimental conditions from Table 5.3, this time bootstrapped with the semi-supervisedEMDmethod,whichusesthelargerbootstrapGIZAcorporadescribed in Table 5.3 and an additional 64,469/48,650 words of hand-aligned English-Chinese and 43,782/31,457 words of hand-aligned English-Arabic. The results of this advanced experimentareinTable5.4. Weshowa0.42gaininBLEUforArabic,butnomovement for Chinese. We believe increasing the size of the re-alignment corpora will increase BLEUgainsinthisexperimentalcondition,butleavethoseresultsforfuturework. We can see from the results presented that the impact of the syntax-aware re- alignmentprocedureofSection5.3.2,coupledwiththeadditionofsizeparameterstothe generativestoryfromSection5.5.4servestoremovelinksfromthebootstrapalignments that cause less useful rules to be extracted, and thus increase the overall quality of the rules,andhencethesystemperformance. Wethusseethebenefittoincludingsyntaxin an alignment model, bringing the two models of the realistic machine translation path somewhatclosertogether. 156 5.6 Conclusion We have described a method for improving state-of-the-art syntax machine translation performancebycastingacomplicatedmtsystemasawxLNTsandemployingtransducer training algorithms to improve word alignment. This chapter demonstrates the real, practicalgainssuggestedinChapter1. 157 RunningAuthor RunningTitle S-C NP-C NPB NNP guangxi POS ’s VP VBG opening PRT RP up PP TO to NP-C NPB DT the JJ outside NN world GUANGXI OUTSIDE-WORLD OPENING-UP q S-C S-C NP-C NPB x 1 POS ’s VP VBG opening PRT RP up PP TO to NP-C NPB DT the JJ outside NN world q NNP .x 1 OUTSIDE-WORLD OPENING-UP q NNP NNP guangxi GUANGXI q S-C S-C x 1 x 2 q NP-C .x 1 q VP .x 2 q PP PP TO to x 1 q NP-C .x 1 q VP VP VBG opening PRT RP up x 1 q PP .x 1 OPENING-UP q NPB NPB DT the JJ outside NN world OUTSIDE-WORLD Figure3 ThethirdfigureIwantinthethesis 3 RunningAuthor RunningTitle S-C NP-C NPB NNP guangxi POS ’s VP VBG opening PRT RP up PP TO to NP-C NPB DT the JJ outside NN world GUANGXI OUTSIDE-WORLD OPENING-UP q S-C S-C NP-C NPB x 1 POS ’s VP VBG opening PRT RP up PP TO to NP-C NPB DT the JJ outside NN world q NNP .x 1 OUTSIDE-WORLD OPENING-UP q NNP NNP guangxi GUANGXI q S-C S-C x 1 x 2 q NP-C .x 1 q VP .x 2 q PP PP TO to x 1 q NP-C .x 1 q VP VP VBG opening PRT RP up x 1 q PP .x 1 OPENING-UP q NPB NPB DT the JJ outside NN world OUTSIDE-WORLD Figure3 ThethirdfigureIwantinthethesis 3 RunningAuthor RunningTitle S-C NP-C NPB NNP guangxi POS ’s VP VBG opening PRT RP up PP TO to NP-C NPB DT the JJ outside NN world GUANGXI OUTSIDE-WORLD OPENING-UP q S-C S-C NP-C NPB x 1 POS ’s VP VBG opening PRT RP up PP TO to NP-C NPB DT the JJ outside NN world q NNP .x 1 OUTSIDE-WORLD OPENING-UP q NNP NNP guangxi GUANGXI q S-C S-C x 1 x 2 q NP-C .x 1 q VP .x 2 q PP PP TO to x 1 q NP-C .x 1 q VP VP VBG opening PRT RP up x 1 q PP .x 1 OPENING-UP q NPB NPB DT the JJ outside NN world OUTSIDE-WORLD Figure3 ThethirdfigureIwantinthethesis 3 RunningAuthor RunningTitle S-C NP-C NPB NNP guangxi POS ’s VP VBG opening PRT RP up PP TO to NP-C NPB DT the JJ outside NN world GUANGXI OUTSIDE-WORLD OPENING-UP q S-C S-C NP-C NPB x 1 POS ’s VP VBG opening PRT RP up PP TO to NP-C NPB DT the JJ outside NN world q NNP .x 1 OUTSIDE-WORLD OPENING-UP q NNP NNP guangxi GUANGXI q S-C S-C x 1 x 2 q NP-C .x 1 q VP .x 2 q PP PP TO to x 1 q NP-C .x 1 q VP VP VBG opening PRT RP up x 1 q PP .x 1 OPENING-UP q NPB NPB DT the JJ outside NN world OUTSIDE-WORLD Figure3 ThethirdfigureIwantinthethesis 3 RunningAuthor RunningTitle S-C NP-C NPB NNP guangxi POS ’s VP VBG opening PRT RP up PP TO to NP-C NPB DT the JJ outside NN world GUANGXI OUTSIDE-WORLD OPENING-UP q S-C S-C NP-C NPB x 1 POS ’s VP VBG opening PRT RP up PP TO to NP-C NPB DT the JJ outside NN world q NNP .x 1 OUTSIDE-WORLD OPENING-UP q NNP NNP guangxi GUANGXI q S-C S-C x 1 x 2 q NP-C .x 1 q VP .x 2 q PP PP TO to x 1 q NP-C .x 1 q VP VP VBG opening PRT RP up x 1 q PP .x 1 OPENING-UP q NPB NPB DT the JJ outside NN world OUTSIDE-WORLD Figure3 ThethirdfigureIwantinthethesis 3 RunningAuthor RunningTitle S-C NP-C NPB NNP guangxi POS ’s VP VBG opening PRT RP up PP TO to NP-C NPB DT the JJ outside NN world GUANGXI OUTSIDE-WORLD OPENING-UP q S-C S-C NP-C NPB x 1 POS ’s VP VBG opening PRT RP up PP TO to NP-C NPB DT the JJ outside NN world q NNP .x 1 OUTSIDE-WORLD OPENING-UP q NNP NNP guangxi GUANGXI q S-C S-C x 1 x 2 q NP-C .x 1 q VP .x 2 q PP PP TO to x 1 q NP-C .x 1 q VP VP VBG opening PRT RP up x 1 q PP .x 1 OPENING-UP q NPB NPB DT the JJ outside NN world OUTSIDE-WORLD Figure3 ThethirdfigureIwantinthethesis 3 RunningAuthor RunningTitle S-C NP-C NPB NNP guangxi POS ’s VP VBG opening PRT RP up PP TO to NP-C NPB DT the JJ outside NN world GUANGXI OUTSIDE-WORLD OPENING-UP q S-C S-C NP-C NPB x 1 POS ’s VP VBG opening PRT RP up PP TO to NP-C NPB DT the JJ outside NN world q NNP .x 1 OUTSIDE-WORLD OPENING-UP q NNP NNP guangxi GUANGXI q S-C S-C x 1 x 2 q NP-C .x 1 q VP .x 2 q PP PP TO to x 1 q NP-C .x 1 q VP VP VBG opening PRT RP up x 1 q PP .x 1 OPENING-UP q NPB NPB DT the JJ outside NN world OUTSIDE-WORLD Figure3 ThethirdfigureIwantinthethesis 3 RunningAuthor RunningTitle S-C NP-C NPB NNP guangxi POS ’s VP VBG opening PRT RP up PP TO to NP-C NPB DT the JJ outside NN world GUANGXI OUTSIDE-WORLD OPENING-UP q S-C S-C NP-C NPB x 1 POS ’s VP VBG opening PRT RP up PP TO to NP-C NPB DT the JJ outside NN world q NNP .x 1 OUTSIDE-WORLD OPENING-UP q NNP NNP guangxi GUANGXI q S-C S-C x 1 x 2 q NP-C .x 1 q VP .x 2 q PP PP TO to x 1 q NP-C .x 1 q VP VP VBG opening PRT RP up x 1 q PP .x 1 OPENING-UP q NPB NPB DT the JJ outside NN world OUTSIDE-WORLD Figure3 ThethirdfigureIwantinthethesis 3 RunningAuthor RunningTitle S-C NP-C NPB NNP guangxi POS ’s VP VBG opening PRT RP up PP TO to NP-C NPB DT the JJ outside NN world GUANGXI OUTSIDE-WORLD OPENING-UP q S-C S-C NP-C NPB x 1 POS ’s VP VBG opening PRT RP up PP TO to NP-C NPB DT the JJ outside NN world q NNP .x 1 OUTSIDE-WORLD OPENING-UP q NNP NNP guangxi GUANGXI q S-C S-C x 1 x 2 q NP-C .x 1 q VP .x 2 q PP PP TO to x 1 q NP-C .x 1 q VP VP VBG opening PRT RP up x 1 q PP .x 1 OPENING-UP q NPB NPB DT the JJ outside NN world OUTSIDE-WORLD Figure3 ThethirdfigureIwantinthethesis 3 RunningAuthor RunningTitle S-C NP-C NPB NNP guangxi POS ’s VP VBG opening PRT RP up PP TO to NP-C NPB DT the JJ outside NN world GUANGXI OUTSIDE-WORLD OPENING-UP q S-C S-C NP-C NPB x 1 POS ’s VP VBG opening PRT RP up PP TO to NP-C NPB DT the JJ outside NN world q NNP .x 1 OUTSIDE-WORLD OPENING-UP q NNP NNP guangxi GUANGXI q S-C S-C x 1 x 2 q NP-C .x 1 q VP .x 2 q PP PP TO to x 1 q NP-C .x 1 q VP VP VBG opening PRT RP up x 1 q PP .x 1 OPENING-UP q NPB NPB DT the JJ outside NN world OUTSIDE-WORLD Figure3 ThethirdfigureIwantinthethesis 3 RunningAuthor RunningTitle S-C NP-C NPB NNP guangxi POS ’s VP VBG opening PRT RP up PP TO to NP-C NPB DT the JJ outside NN world GUANGXI OUTSIDE-WORLD OPENING-UP q S-C S-C NP-C NPB x 1 POS ’s VP VBG opening PRT RP up PP TO to NP-C NPB DT the JJ outside NN world q NNP .x 1 OUTSIDE-WORLD OPENING-UP q NNP NNP guangxi GUANGXI q S-C S-C x 1 x 2 q NP-C .x 1 q VP .x 2 q PP PP TO to x 1 q NP-C .x 1 q VP VP VBG opening PRT RP up x 1 q PP .x 1 OPENING-UP q NPB NPB DT the JJ outside NN world OUTSIDE-WORLD Figure3 ThethirdfigureIwantinthethesis 3 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS q NP-C NP-C NPB x 1 x 2 x 3 q NPB .x 1 q PP .x 3 q NN .x 2 MIDDLE q NPB NPB NNP taiwan POS ’s TAIWAN q PP PP x 1 x 2 q IN .x 1 q NP-C .x 2 q IN IN in IN q NP-C NP-C x 1 x 2 q PP .x 2 q NPB .x 1 q PP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES q NPB NPB x 1 q NN .x 1 q NN NN trade TRADE q NN NN surplus SURPLUS q NP-C NP-C NPB x 1 x 2 x 3 q NPB .x 1 q PP .x 3 q NN .x 2 q NPB NPB x 1 POS ’s q NNP .x 1 q NNP NNP taiwan TAIWAN q PP PP IN in x 1 q NP-C .x 1 IN MIDDLE q PP PP IN between x 1 q NP-C .x 1 q NP-C NP-C x 1 q NPB .x 1 q NPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS q NP-C NP-C NPB x 1 x 2 x 3 q NPB .x 1 q PP .x 3 q NN .x 2 MIDDLE q NPB NPB NNP taiwan POS ’s TAIWAN q PP PP x 1 x 2 q IN .x 1 q NP-C .x 2 q IN IN in IN q NP-C NP-C x 1 x 2 q PP .x 2 q NPB .x 1 q PP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES q NPB NPB x 1 q NN .x 1 q NN NN trade TRADE q NN NN surplus SURPLUS q NP-C NP-C NPB x 1 x 2 x 3 q NPB .x 1 q PP .x 3 q NN .x 2 q NPB NPB x 1 POS ’s q NPB .x 1 q NNP NNP taiwan TAIWAN q PP PP IN in x 1 q NP-C .x 1 IN MIDDLE q PP PP IN between x 1 q NP-C .x 1 q NP-C NP-C x 1 q NPB .x 1 q NPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS q NP-C NP-C NPB x 1 x 2 x 3 q NPB .x 1 q PP .x 3 q NN .x 2 MIDDLE q NPB NPB NNP taiwan POS ’s TAIWAN q PP PP x 1 x 2 q IN .x 1 q NP-C .x 2 q IN IN in IN q NP-C NP-C x 1 x 2 q PP .x 2 q NPB .x 1 q PP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES q NPB NPB x 1 q NN .x 1 q NN NN trade TRADE q NN NN surplus SURPLUS q NP-C NP-C NPB x 1 x 2 x 3 q NPB .x 1 q PP .x 3 q NN .x 2 q NPB NPB x 1 POS ’s q NNP .x 1 q NNP NNP taiwan TAIWAN q PP PP IN in x 1 q NP-C .x 1 IN MIDDLE q PP PP IN between x 1 q NP-C .x 1 q NP-C NP-C x 1 q NPB .x 1 q NPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 R15: RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS q NP-C NP-C NPB x 1 x 2 x 3 q NPB .x 1 q PP .x 3 q NN .x 2 MIDDLE q NPB NPB NNP taiwan POS ’s TAIWAN q PP PP x 1 x 2 q IN .x 1 q NP-C .x 2 q IN IN in IN q NP-C NP-C x 1 x 2 q PP .x 2 q NPB .x 1 q PP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES q NPB NPB x 1 q NN .x 1 q NN NN trade TRADE q NN NN surplus SURPLUS q NP-C NP-C NPB x 1 x 2 x 3 q NPB .x 1 q PP .x 3 q NN .x 2 q NPB NPB x 1 POS ’s q NPB .x 1 q NNP NNP taiwan TAIWAN q PP PP IN in x 1 q NP-C .x 1 IN MIDDLE q PP PP IN between x 1 q NP-C .x 1 q NP-C NP-C x 1 q NPB .x 1 q NPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS q NP-C NP-C NPB x 1 x 2 x 3 q NPB .x 1 q PP .x 3 q NN .x 2 MIDDLE q NPB NPB NNP taiwan POS ’s TAIWAN q PP PP x 1 x 2 q IN .x 1 q NP-C .x 2 q IN IN in IN q NP-C NP-C x 1 x 2 q PP .x 2 q NPB .x 1 q PP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES q NPB NPB x 1 q NN .x 1 q NN NN trade TRADE q NN NN surplus SURPLUS q NP-C NP-C NPB x 1 x 2 x 3 q NPB .x 1 q PP .x 3 q NN .x 2 q NPB NPB x 1 POS ’s q NPB .x 1 q NNP NNP taiwan TAIWAN q PP PP IN in x 1 q NP-C .x 1 IN MIDDLE q PP PP IN between x 1 q NP-C .x 1 q NP-C NP-C x 1 q NPB .x 1 q NPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS q NP-C NP-C NPB x 1 x 2 x 3 q NPB .x 1 q PP .x 3 q NN .x 2 MIDDLE q NPB NPB NNP taiwan POS ’s TAIWAN q PP PP x 1 x 2 q IN .x 1 q NP-C .x 2 q IN IN in IN q NP-C NP-C x 1 x 2 q PP .x 2 q NPB .x 1 q PP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES q NPB NPB x 1 q NN .x 1 q NN NN trade TRADE q NN NN surplus SURPLUS q NP-C NP-C NPB x 1 x 2 x 3 q NPB .x 1 q PP .x 3 q NN .x 2 q NPB NPB x 1 POS ’s q NNP .x 1 q NNP NNP taiwan TAIWAN q PP PP IN in x 1 q NP-C .x 1 IN MIDDLE q PP PP IN between x 1 q NP-C .x 1 q NP-C NP-C x 1 q NPB .x 1 q NPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 R11: R24: R25: R26: R28: RunningAuthor RunningTitle S-C NP-C NPB NNP guangxi POS ’s VP VBG opening PRT RP up PP TO to NP-C NPB DT the JJ outside NN world GUANGXI OUTSIDE-WORLD OPENING-UP q S-C S-C NP-C NPB x 1 POS ’s VP VBG opening PRT RP up PP TO to NP-C NPB DT the JJ outside NN world q NNP .x 1 OUTSIDE-WORLD OPENING-UP q NNP NNP guangxi GUANGXI q S-C S-C x 1 x 2 q NP-C .x 1 q VP .x 2 q PP PP TO to x 1 q NP-C .x 1 q VP VP VBG opening PRT RP up x 1 q PP .x 1 OPENING-UP q NPB NPB DT the JJ outside NN world OUTSIDE-WORLD Figure3 ThethirdfigureIwantinthethesis 3 RunningAuthor RunningTitle S-C NP-C NPB NNP guangxi POS ’s VP VBG opening PRT RP up PP TO to NP-C NPB DT the JJ outside NN world GUANGXI OUTSIDE-WORLD OPENING-UP q S-C S-C NP-C NPB x 1 POS ’s VP VBG opening PRT RP up PP TO to NP-C NPB DT the JJ outside NN world q NNP .x 1 OUTSIDE-WORLD OPENING-UP q NNP NNP guangxi GUANGXI q S-C S-C x 1 x 2 q NP-C .x 1 q VP .x 2 q PP PP TO to x 1 q NP-C .x 1 q VP VP VBG opening PRT RP up x 1 q PP .x 1 OPENING-UP q NPB NPB DT the JJ outside NN world OUTSIDE-WORLD Figure3 ThethirdfigureIwantinthethesis 3 RunningAuthor RunningTitle S-C NP-C NPB NNP guangxi POS ’s VP VBG opening PRT RP up PP TO to NP-C NPB DT the JJ outside NN world GUANGXI OUTSIDE-WORLD OPENING-UP q S-C S-C NP-C NPB x 1 POS ’s VP VBG opening PRT RP up PP TO to NP-C NPB DT the JJ outside NN world q NNP .x 1 OUTSIDE-WORLD OPENING-UP q NNP NNP guangxi GUANGXI q S-C S-C x 1 x 2 q NP-C .x 1 q VP .x 2 q PP PP TO to x 1 q NP-C .x 1 q VP VP VBG opening PRT RP up x 1 q PP .x 1 OPENING-UP q NPB NPB DT the JJ outside NN world OUTSIDE-WORLD Figure3 ThethirdfigureIwantinthethesis 3 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS q NP-C NP-C NPB x 1 x 2 x 3 q NPB .x 1 q PP .x 3 q NN .x 2 MIDDLE q NPB NPB NNP taiwan POS ’s TAIWAN q PP PP x 1 x 2 q IN .x 1 q NP-C .x 2 q IN IN in IN q NP-C NP-C x 1 x 2 q PP .x 2 q NPB .x 1 q PP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES q NPB NPB x 1 q NN .x 1 q NN NN trade TRADE q NN NN surplus SURPLUS q NP-C NP-C NPB x 1 x 2 x 3 q NPB .x 1 q PP .x 3 q NN .x 2 q NPB NPB x 1 POS ’s q NNP .x 1 q NNP NNP taiwan TAIWAN q PP PP IN in x 1 q NP-C .x 1 IN MIDDLE q PP PP IN between x 1 q NP-C .x 1 q NP-C NP-C x 1 q NPB .x 1 q NPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS q NP-C NP-C NPB x 1 x 2 x 3 q NPB .x 1 q PP .x 3 q NN .x 2 MIDDLE q NPB NPB NNP taiwan POS ’s TAIWAN q PP PP x 1 x 2 q IN .x 1 q NP-C .x 2 q IN IN in IN q NP-C NP-C x 1 x 2 q PP .x 2 q NPB .x 1 q PP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES q NPB NPB x 1 q NN .x 1 q NN NN trade TRADE q NN NN surplus SURPLUS q NP-C NP-C NPB x 1 x 2 x 3 q NPB .x 1 q PP .x 3 q NN .x 2 q NPB NPB x 1 POS ’s q NPB .x 1 q NNP NNP taiwan TAIWAN q PP PP IN in x 1 q NP-C .x 1 IN MIDDLE q PP PP IN between x 1 q NP-C .x 1 q NP-C NP-C x 1 q NPB .x 1 q NPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 RunningAuthor RunningTitle NP-C NPB NPB NNP taiwan POS ’s NN surplus PP IN in NP-C NPB NN trade PP IN between NP-C NPB DT the CD two NNS shores TAIWAN IN TWO-SHORES TRADE MIDDLE SURPLUS q NP-C NP-C NPB x 1 x 2 x 3 q NPB .x 1 q PP .x 3 q NN .x 2 MIDDLE q NPB NPB NNP taiwan POS ’s TAIWAN q PP PP x 1 x 2 q IN .x 1 q NP-C .x 2 q IN IN in IN q NP-C NP-C x 1 x 2 q PP .x 2 q NPB .x 1 q PP PP IN between NP-C NPB DT the CD two NNS shores TWO-SHORES q NPB NPB x 1 q NN .x 1 q NN NN trade TRADE q NN NN surplus SURPLUS q NP-C NP-C NPB x 1 x 2 x 3 q NPB .x 1 q PP .x 3 q NN .x 2 q NPB NPB x 1 POS ’s q NNP .x 1 q NNP NNP taiwan TAIWAN q PP PP IN in x 1 q NP-C .x 1 IN MIDDLE q PP PP IN between x 1 q NP-C .x 1 q NP-C NP-C x 1 q NPB .x 1 q NPB NPB DT the CD two NNS shores TWO-SHORES Figure1 ThefigureIwantinthethesis 1 R15: RunningAuthor RunningTitle S-C NP-C NPB NNP guangxi POS ’s VP VBG opening PRT RP up PP TO to NP-C NPB DT the JJ outside NN world GUANGXI OUTSIDE-WORLD OPENING-UP q S-C S-C NP-C NPB x 1 POS ’s VP VBG opening PRT RP up PP TO to NP-C NPB DT the JJ outside NN world q NNP .x 1 OUTSIDE-WORLD OPENING-UP q NNP NNP guangxi GUANGXI q S-C S-C x 1 x 2 q NP-C .x 1 q VP .x 2 q PP PP TO to x 1 q NP-C .x 1 q VP VP VBG opening PRT RP up x 1 q PP .x 1 OPENING-UP q NPB NPB DT the JJ outside NN world OUTSIDE-WORLD Figure3 ThethirdfigureIwantinthethesis 3 RunningAuthor RunningTitle S-C NP-C NPB NNP guangxi POS ’s VP VBG opening PRT RP up PP TO to NP-C NPB DT the JJ outside NN world GUANGXI OUTSIDE-WORLD OPENING-UP q S-C S-C NP-C NPB x 1 POS ’s VP VBG opening PRT RP up PP TO to NP-C NPB DT the JJ outside NN world q NNP .x 1 OUTSIDE-WORLD OPENING-UP q NNP NNP guangxi GUANGXI q S-C S-C x 1 x 2 q NP-C .x 1 q VP .x 2 q PP PP TO to x 1 q NP-C .x 1 q VP VP VBG opening PRT RP up x 1 q PP .x 1 OPENING-UP q NPB NPB DT the JJ outside NN world OUTSIDE-WORLD Figure3 ThethirdfigureIwantinthethesis 3 RunningAuthor RunningTitle S-C NP-C NPB NNP guangxi POS ’s VP VBG opening PRT RP up PP TO to NP-C NPB DT the JJ outside NN world GUANGXI OUTSIDE-WORLD OPENING-UP q S-C S-C NP-C NPB x 1 POS ’s VP VBG opening PRT RP up PP TO to NP-C NPB DT the JJ outside NN world q NNP .x 1 OUTSIDE-WORLD OPENING-UP q NNP NNP guangxi GUANGXI q S-C S-C x 1 x 2 q NP-C .x 1 q VP .x 2 q PP PP TO to x 1 q NP-C .x 1 q VP VP VBG opening PRT RP up x 1 q PP .x 1 OPENING-UP q NPB NPB DT the JJ outside NN world OUTSIDE-WORLD Figure3 ThethirdfigureIwantinthethesis 3 R25: RunningAuthor RunningTitle S-C NP-C NPB NNP guangxi POS ’s VP VBG opening PRT RP up PP TO to NP-C NPB DT the JJ outside NN world GUANGXI OUTSIDE-WORLD OPENING-UP q S-C S-C NP-C NPB x 1 POS ’s VP VBG opening PRT RP up PP TO to NP-C NPB DT the JJ outside NN world q NNP .x 1 OUTSIDE-WORLD OPENING-UP q NNP NNP guangxi GUANGXI q S-C S-C x 1 x 2 q NP-C .x 1 q VP .x 2 q PP PP TO to x 1 q NP-C .x 1 q VP VP VBG opening PRT RP up x 1 q PP .x 1 OPENING-UP q NPB NPB DT the JJ outside NN world OUTSIDE-WORLD Figure3 ThethirdfigureIwantinthethesis 3 RunningAuthor RunningTitle S-C NP-C NPB NNP guangxi POS ’s VP VBG opening PRT RP up PP TO to NP-C NPB DT the JJ outside NN world GUANGXI OUTSIDE-WORLD OPENING-UP q S-C S-C NP-C NPB x 1 POS ’s VP VBG opening PRT RP up PP TO to NP-C NPB DT the JJ outside NN world q NNP .x 1 OUTSIDE-WORLD OPENING-UP q NNP NNP guangxi GUANGXI q S-C S-C x 1 x 2 q NP-C .x 1 q VP .x 2 q PP PP TO to x 1 q NP-C .x 1 q VP VP VBG opening PRT RP up x 1 q PP .x 1 OPENING-UP q NPB NPB DT the JJ outside NN world OUTSIDE-WORLD Figure3 ThethirdfigureIwantinthethesis 3 RunningAuthor RunningTitle S-C NP-C NPB NNP guangxi POS ’s VP VBG opening PRT RP up PP TO to NP-C NPB DT the JJ outside NN world GUANGXI OUTSIDE-WORLD OPENING-UP q S-C S-C NP-C NPB x 1 POS ’s VP VBG opening PRT RP up PP TO to NP-C NPB DT the JJ outside NN world q NNP .x 1 OUTSIDE-WORLD OPENING-UP q NNP NNP guangxi GUANGXI q S-C S-C x 1 x 2 q NP-C .x 1 q VP .x 2 q PP PP TO to x 1 q NP-C .x 1 q VP VP VBG opening PRT RP up x 1 q PP .x 1 OPENING-UP q NPB NPB DT the JJ outside NN world OUTSIDE-WORLD Figure3 ThethirdfigureIwantinthethesis 3 R27: RunningAuthor RunningTitle S-C NP-C NPB NNP guangxi POS ’s VP VBG opening PRT RP up PP TO to NP-C NPB DT the JJ outside NN world GUANGXI OUTSIDE-WORLD OPENING-UP q S-C S-C NP-C NPB x 1 POS ’s VP VBG opening PRT RP up PP TO to NP-C NPB DT the JJ outside NN world q NNP .x 1 OUTSIDE-WORLD OPENING-UP q NNP NNP guangxi GUANGXI q S-C S-C x 1 x 2 q NP-C .x 1 q VP .x 2 q PP PP TO to x 1 q NP-C .x 1 q VP VP VBG opening PRT RP up x 1 q PP .x 1 OPENING-UP q NPB NPB DT the JJ outside NN world OUTSIDE-WORLD Figure3 ThethirdfigureIwantinthethesis 3 RunningAuthor RunningTitle S-C NP-C NPB NNP guangxi POS ’s VP VBG opening PRT RP up PP TO to NP-C NPB DT the JJ outside NN world GUANGXI OUTSIDE-WORLD OPENING-UP q S-C S-C NP-C NPB x 1 POS ’s VP VBG opening PRT RP up PP TO to NP-C NPB DT the JJ outside NN world q NNP .x 1 OUTSIDE-WORLD OPENING-UP q NNP NNP guangxi GUANGXI q S-C S-C x 1 x 2 q NP-C .x 1 q VP .x 2 q PP PP TO to x 1 q NP-C .x 1 q VP VP VBG opening PRT RP up x 1 q PP .x 1 OPENING-UP q NPB NPB DT the JJ outside NN world OUTSIDE-WORLD Figure3 ThethirdfigureIwantinthethesis 3 RunningAuthor RunningTitle S-C NP-C NPB NNP guangxi POS ’s VP VBG opening PRT RP up PP TO to NP-C NPB DT the JJ outside NN world GUANGXI OUTSIDE-WORLD OPENING-UP q S-C S-C NP-C NPB x 1 POS ’s VP VBG opening PRT RP up PP TO to NP-C NPB DT the JJ outside NN world q NNP .x 1 OUTSIDE-WORLD OPENING-UP q NNP NNP guangxi GUANGXI q S-C S-C x 1 x 2 q NP-C .x 1 q VP .x 2 q PP PP TO to x 1 q NP-C .x 1 q VP VP VBG opening PRT RP up x 1 q PP .x 1 OPENING-UP q NPB NPB DT the JJ outside NN world OUTSIDE-WORLD Figure3 ThethirdfigureIwantinthethesis 3 R29: Figure 5.3: The impact of a bad alignment on rule extraction. Including the alignment link indicated by the dotted line in the example leads to the rule set in the second row. The re-alignment procedure described in Section 5.3.2 learns to prefer the rule set at bottom,whichomitsthebadlink. 158 Chapter6 T: ATTT InthischapterwedescribeTiburon,atoolkitformanipulatingweightedtreetransducers andgrammars. Tiburoncontainsimplementationsofmanyofthealgorithmspresented inpreviouschaptersandisdesignedtobefairlyintuitiveandeasytouse. Wealsoplace Tiburoninthecontextofothertransducerandautomatatoolkitsandsoftware. 6.1 Introduction The development of well-founded models of natural language processing applications hasbeengreatlyacceleratedbytheavailabilityoftoolkitsforfinite-stateautomata. The influential observation of Kaplan & Kay, that cascades of phonological rewrite rules could be expressed as regular relations (equivalent to finite-state transducers) [64], was exploitedbyKoskenniemiinhisdevelopmentofthetwo-levelmorphologyandaccom- panying system for its representation [80]. This system, which was a general program foranalysisandgenerationoflanguages,pioneeredthefieldoffinite-statetoolkits[68]. Successiveversionsofthetwo-levelcompiler,suchasthatwrittenbyKarttunenand others at Xerox [67], were used for large-scale analysis applications in many languages 159 [68]. Continued advances, such as work by Karttunen in intersecting composition [71] and replacement [65, 66], eventually led to the development of the Xerox finite-state toolkit,whichsupersededthefunctionalityanduseofthetwo-leveltools[68]. Meanwhile, interest in adding uncertainty to finite-state models grew alongside increased availability of large datasets and increased computational power. Ad-hoc methods and individual implementations were developed for integrating uncertainty into finite-state representations [115, 86], but the need for a general-purpose weighted finite-state toolkit was clear [103]. Researchers at AT&T led the way with their FSM Library [102] which represented weighted finite-state automata by incorporating the theoryofsemiringsoverrationalpowerseriescleanlyintotheexistingautomatatheory. Other toolkits, such as van Noord’s FSA utilities [130], the RWTH toolkit [62], and the USC/ISICarmeltoolkit[53],providedadditionalinterfacesandutilitiesforworkingwith weighted finite-state automata. As in the unweighted case, the availability of this soft- wareledtomanyresearchprojectsthattookadvantageofpre-existingimplementations [61,129,78]andthedevelopmentofthesoftwareledtotheinventionofnewalgorithms andtheory[109,99]. AshasbeendescribedinChapter1,however,noneofthesetoolkitsareextendableto recognizesyntacticstructures. GRM,anextensionoftheAT&Ttoolkitthatusesapprox- imation theory to represent higher-complexity structure such as context-free grammars intheweightedfinite-statestringautomataframework,wasusefulforhandlingcertain representations[3],butatreeautomataframeworkisrequiredtotrulycapturetreemod- els. Additionally, the incorporation of weights is crucial for modern natural language processing needs. Thus, contributions such as Timbuk [50], a toolkit for unweighted 160 finitestatetreeautomatathathasbeenusedforcryptographicanalysis,andMONA[58], an unweighted tree automata tool aimed at the logic community, were insufficient for ourneeds. Knight and Graehl [76] put forward the case for the top-down tree automata theory of Rounds [116] and Thatcher [127] as a logical sequel to weighted string automata for NLP. Additionally, as Knight and Graehl mention [76], most of the desired general operations in a general weighted finite-state toolkit are applicable to top-down tree automata. Probabilistic tree automata were first proposed by Magidor and Moran [87]. WeightedtreetransducerswerefirstdescribedbyF¨ ul¨ opandVogler[44]asanoperational representationoftreeseriestransducers,firstintroducedbyKuich[82]. We present Tiburon, a toolkit designed in the spirit of its predecessors but with the tree, not the string, as its basic data structure and weights inherent in its operation. Tiburonisdesignedtobeeasytoconstructautomataandworkwiththem—afterreading this chapter a linguist with no computer science background or a computer scientist with only the vaguest notions of tree automata should be able to write basic acceptors and transducers. To achieve these goals we have maintained simplicity in data format design,suchthatacceptorsandtransducersareveryclosetothewaytheyappearintree automataliterature. Wealsoprovideasmallsetofgenericbutpowerfuloperationsthat allow robust manipulation of data structures with simple commands. In subsequent sectionswepresentanintroductiontotheformatsandoperationsintheTiburontoolkit and demonstrate the powerful applications that can be easily built. Tiburon was first introducedin[96]. 161 6.2 Gettingstarted TiburoniswritteninJavaandisdistributedasaJavaarchivefile(jar)withasimplebash wrappingscript. Afterdownloadingthesoftware,thecommand % ./tiburon producesoutputthatlookslike This is Tiburon, version 1.0 Error: Parameter ’infiles’ is required. Usage: tiburon [-h|--help] (-e|--encoding) <encoding> (-m|--semiring) <srtype> [--leftapply] [--rightapply] [-b|--batch] [(-a|--align) <align>] [-l|--left] [-r|--right] [-n|--normalizeweight] [--no-normalize] [--removeloops] [--normform] [(-p|--prune) <prune>] [(-d|--determinize) <determ>] [(-t|--train) <train>] [(-x|--xform) <xform>] [--training-deriv-location <trainderivloc>] [--conditional] [--no-deriv] [--randomize] [--timedebug <time>] [-y|--print-yields] [(-k|--kbest) <kbest>] [(-g|--generate) <krandom>] [-c|--check] [(-o|--outputfile) <outfile>] infiles1 infiles2 ... infilesN 162 The salient features of this output are that Tiburon was invoked, the program expected some input files, and the usage statement was displayed. Since input files are crucial toTiburon’soperation, wediscussthemnext. ThereareseveralfiletypesTiburonuses: wrtg, wcfg, wxtt, wxtst, and batch. With the exception of the batch file, each of these files correspondstoaformalstructuredescribedinChapter2. Wedescribethesefilesnext. 6.3 Grammars In this section we describe file formats and operations on grammars, the wrtgs that recognizetreelanguagesandthewcfgsthatrecognizecontext-freestringlanguages. 6.3.1 Fileformats Bothkindsofgrammardiscussedinthisthesishaveasimilarformalstructure: (N,Σ,P,n 0 ) 1 , thoughΣ and P have somewhat different meanings. Consequently, the grammar files, wrtgandwcfg,bothhaveasimilaroverallformat: 2 <n0> <prd>+ where <n0> is the start nonterminal and each <prd> is a member of P. A nonterminal canbemostanyalphanumericsequence. Tobesafe,itshouldnotcontainthefollowing reservedcharacters: 3 . ( ) # @ % > 1 WepreviouslyusedΔfortheterminalalphabetofawcfg,butthisisanissueofterminologyonly 2 WeuseaBNF-likesyntax,buthopethereaderwillrelyontheexamplesthatfollowforgreaterclarity. 3 ThefileparserforTiburonisbrittleandifyoutrytobreakit,youwillprobablysucceed. 163 Theformatofaproduction,<prd>is: <nterm> "->" <rhs> ["#" <wgt>] ["@" <tie>] where<nterm>isanonterminal,<wgt>isarealnumber,and<tie>isaninteger. Tiesare rarelyusedandwillbediscussedlater. Ifthefileisawrtg,then<rhs>hasthefollowing format: <nterm> | <sym> | <sym>"("<rhs>+")" where <sym> is a member ofΣ and is subject to the same definitional constraints as the nonterminals. Ifthefileisacfg,then<rhs>hasthefollowingformat: "*e*" | (<nterm>|<sym>)+ Note that the nonterminal and terminal sets are thus defined implicitly; all nonter- minals must appear as left sides of productions at least once, and the nonterminal and terminalalphabetsmustnotcoincide. The symbol % denotes a comment, and all characters after this symbol to the end of the line are ignored, with one exception. Since a file can be ambiguously a wcfg or a wrtg,iftheinitiallineofthefileisoneof: % TYPE CFG % TYPE RTG thenthattypeispresumed. Itisbesttofurtherexplaintheseformatswithsomeexamples, whichareinFigure6.1. Wewilllaterdiscusstraining,sowenowdescribebatchfilesappropriatefortraining wrtgsandwcfgs. Theformatforbatchfilesis 164 q3 q3 -> A(q1 q1) # .25 q3 -> A(q3 q2) # .25 q3 -> A(q2 q3) # .25 q3 -> B(q2) # .25 q2 -> A(q2 q2) # .25 q2 -> A(q1 q3) # .25 q2 -> A(q3 q1) # .25 q2 -> B(q1) # .25 q1 -> A(q3 q3) # .025 q1 -> A(q1 q2) # .025 q1 -> A(q2 q1) # .025 q1 -> B(q3) # .025 q1 -> C # .9 (a)three.rtgisawrtgthatrecognizes treeswithsizedivisiblebythree. q q -> np vp # 1 pp -> prep np # 1 vp -> vb do # 1 nn -> boy # .4 nn -> monkey # .1 nn -> clown # .5 np -> dt nn # .5 np -> dt nn pp # .5 dt -> the # 1 vb -> ran # 1 do -> *e* # .9 do -> home # .1 prep -> with # .6 prep -> by # .4 (b)ran.cfgisawcfgthatrecognizesaninfinitelanguage. qe qe -> A(qe qo) # .1 qe -> A(qo qe) # .8 qe -> B(qo) # .1 qo -> A(qo qo) # .6 qo -> A(qe qe) # .2 qo -> B(qe) # .1 qo -> C # .1 (c) even.rtg is a wrtg that recognizes treeswithsizedivisiblebytwo. q q -> S(subj vb obj) q -> S(subj likes obj) obj -> candy vb -> likes vb -> hates subj -> John subj -> Stacy (d)candy.rtgisartgthatrecognizesafinitelanguage. It isnotinnormalformordeterministic. TOP S -> S-C , NP-C VP S -> NP-C VP S -> ADVP NP-C VP S -> NP-C ADVP VP ... NP-C -> NPB NP-C -> NPB , VP NP-C -> NPB NP ... (e)Portionoftrain.cfg,acfgusedfor training. % TYPE RTG qS_TOP+ qS_TOP+ -> qS_TOP # .2 qS_TOP+ -> qS # 0.8 qS -> S(qNP_S+ : qVP_S+ EOL) # 0.0009 qS -> S(qNP_S+ qVP_S+ : EOL) # 0.0010 qS -> S(qNP_S+ DT qVP_S+ EOL) # 0.0005 ... qS_TOP -> S(qNP_S+ , qVP_S+ EOL) # 0.0258 ... (f)Portionoftrain.rtg,awrtgusedfortraining. Figure6.1: ExamplewrtgandwcfgfilesusedtodemonstrateTiburon’scapabilities. 165 (<count> <item>)+ | (<item>)+ Abatchfilefortrainingwrtgsisafileoftrees. Theformatfor<item>inthiscaseis: <sym> | <sym>"("<item>+")" Abatchfilefortrainingwcfgsisafileofstrings. Theformatfor<item>inthiscaseis: <sym>+ Hereisaportionofabatchfileofstrings: DT VBN NNS RB MD VB NNS TO VB NNS IN NNS RBR CC RBR RB EOL IN PRP$ NNS TO VB RP IN DT NNS , DT NN RB MD VB DT NN EOL NNS WP VBP TO VB DT JJ NN MD VB PRP$ NNS IN NNP , PRP VBD EOL Hereisaportionofabatchfileoftrees: TOP(S(NP-C(NPB(NNP NNP)) VP(VBZ NP-C(NPB(NNP))) EOL)) TOP(S(NP-C(NPB(DT JJ NN)) VP(MD VP-C(VB ADJP(JJ NP(NPB(NNP CD))))) EOL)) TOP(S(NP-C(NPB(DT NN NN)) VP(VBZ RB VP-C(VBN VP-C(VBN))) EOL)) 6.3.2 Commandsusinggrammarfiles If you write any of the example grammars in Figure 6.1 as a plain text file and then call Tiburonwiththatfileasanargument,thecontentsofthatfile,minusanycomments,and 166 withweightsof1in place ifnoweightisspecified, willbereturned. Ifthereisasyntax error, this will be reported instead. Here is an example of Tiburon reading candy.rtg, fromFigure6.1d: tiburon candy.rtg This is Tiburon, version 1.0 q q -> S(subj vb obj) # 1.000000 q -> S(subj likes obj) # 1.000000 obj -> candy # 1.000000 vb -> likes # 1.000000 vb -> hates # 1.000000 subj -> John # 1.000000 subj -> Stacy # 1.000000 Tiburonautomaticallyplacestheweight1onunweightedproductions. The--randomize makes Tiburon randomly choose some weights, which can be useful in debugging sce- narios: tiburon --randomize candy.rtg This is Tiburon, version 1.0 q q -> S(subj vb obj) # 0.357130 q -> S(subj likes obj) # 0.362367 obj -> candy # 0.291670 167 vb -> likes # 0.280298 vb -> hates # 0.686882 subj -> John # 0.671051 subj -> Stacy # 0.377440 Sinceweoftenwanttoworkwithprobabilisticgrammars,wherethesumofweights of productions with a common left nonterminal is 1, we could have also enforced the weightstobenormalizedbyaddingthe-nflag: tiburon --randomize -n candy.rtg This is Tiburon, version 1.0 q q -> S(subj vb obj) # 0.496361 q -> S(subj likes obj) # 0.503639 obj -> candy # 1.000000 vb -> likes # 0.289810 vb -> hates # 0.710190 subj -> John # 0.640016 subj -> Stacy # 0.359984 The-k <kbest>option,where<kbest>isaninteger,returnsthetreesrecognizedby the <kbest> highest weighted paths, or all of the paths with placeholder lines if there are not sufficient paths. This command, too, can be strung together with other options. Note in the following example that a new random weighting is chosen before the trees aregenerated,andthatthereareonlysixpathsinthegrammar. 168 tiburon --randomize -n -k 10 candy.rtg This is Tiburon, version 1.0 Warning: returning fewer trees than requested S(Stacy likes candy) # 0.235885 S(Stacy likes candy) # 0.228522 S(Stacy hates candy) # 0.151262 S(John likes candy) # 0.147251 S(John likes candy) # 0.142655 S(John hates candy) # 0.094425 0 0 0 0 The -g <grand> option, where <grand> is an integer, randomly follows <grand> paths and returns the strings or trees recognized by them. We show an example of this onthewcfgran.cfg. ./tiburon -g 5 ran.cfg This is Tiburon, version 1.0 the clown with the boy ran home # 0.001435 the boy by the monkey ran home # 0.000191 the clown by the clown ran home # 0.002500 the clown ran home # 0.011957 169 the monkey with the boy by the clown with the clown ran # 0.000043 The-yflagprintstheyieldoftreesasastringandiscombinedwith-kor-goptions. Itonlyhasanyeffectwithwrtgs. tiburon -yk 5 three.rtg This is Tiburon, version 1.0 C C # 0.202500 C # 0.056250 C C C # 0.011391 C C C # 0.011391 C C C # 0.011391 We can convert wrtgs to wcfgs by simply replacing production right side trees with yield strings. We can also convert wcfgs to wrtgs (provided they do not have pro- ductions) by introducing symbols, typically repurposing nonterminal names for that purpose. The-xflagprovidesthisfunctionality—inthefirstexamplethatfollowsthe productionfromran.cfgwasreplacedwiththeproductiondo -> away # .9todemon- stratethefeature,formingthefileran.noeps.cfg: tiburon -x RTG ran.noeps.cfg This is Tiburon, version 1.0 q_q q_q -> q(q_np q_vp) # 1.000000 q_dt -> dt(the) # 1.000000 170 q_pp -> pp(q_prep q_np) # 1.000000 q_vb -> vb(ran) # 1.000000 q_vp -> vp(q_vb q_do) # 1.000000 q_np -> np(q_dt q_nn) # 0.500000 q_np -> np(q_dt q_nn q_pp) # 0.500000 q_nn -> nn(boy) # 0.400000 q_nn -> nn(monkey) # 0.100000 q_nn -> nn(clown) # 0.500000 q_prep -> prep(with) # 0.600000 q_prep -> prep(by) # 0.400000 q_do -> do(away) # 0.900000 q_do -> do(home) # 0.100000 tiburon -x CFG even.rtg This is Tiburon, version 1.0 qe qe -> qe qo # 0.100000 qe -> qo qe # 0.800000 qe -> qo # 0.100000 qo -> qo qo # 0.600000 qo -> qe qe # 0.200000 qo -> qe # 0.100000 171 qo -> C # 0.100000 Wecanalsoconvertgrammarstoidentitytransducers,butwillcoverthatinthenext section. If a sequence of wrtgs is input as arguments to Tiburon, they will be intersected, and any other flags will be invoked on the intersected grammar 4 . For example, we can combine the intersection of three.rtg and even.rtg from Figures 6.1a and 6.1c, respectively,withthe-cflag,whichdisplaysinformationabouttheinput: tiburon -c three.rtg even.rtg This is Tiburon, version 1.0 RTG info for input rtg three.rtg: 3 states 13 rules 1 unique terminal symbols infinite derivations RTG info for input rtg even.rtg: 2 states 7 rules 1 unique terminal symbols infinite derivations RTG info for intersected RTG: 6 states 4 Asitisundecidablewhethertheintersectionoftwocontext-freelanguagesisempty,wedonotattempt tointersectwcfgs. 172 43 rules 1 unique terminal symbols infinite derivations AsdescribedinChapter3,grammarsproducedbyautomatedsystemssuchasthose used to perform machine translation [47] or parsing [10] frequently contain multiple derivationsforthesameitemwithdifferentweight. Thisisduetothesystems’represen- tationoftheirresultspaceintermsofweightedpartialresultsofvarioussizesthatmay beassembledinmultipleways. Thispropertyisundesirableifwewishtoknowthetotal probability of a particular item in a language. It is also frequently undesirable to have repeated results in a k-best list. The -d operation invokes May and Knight’s weighted determinization algorithm for wrtgs [95], which is applicable to wcfgs too. Note that candy.rtgisnondeterministic;thiscanbeseen,sinceithasmultiplepathsthatrecognize the same tree. Let’s generate some weights for this rtg again, and normalize them, but saveoffthefiletouseagain,usingthestandardUnixteecommand: tiburon --randomize -n candy.rtg | tee candy.wrtg This is Tiburon, version 1.0 q q -> S(subj vb obj) # 0.447814 q -> S(subj likes obj) # 0.552186 obj -> candy # 1.000000 vb -> likes # 0.779565 vb -> hates # 0.220435 173 subj -> John # 0.663922 subj -> Stacy # 0.336078 Ifweenumeratethepaths,wewillseethesametreeisrecognizedmorethanonce: tiburon -k 6 candy.wrtg This is Tiburon, version 1.0 S(John likes candy) # 0.366608 S(John likes candy) # 0.231775 S(Stacy likes candy) # 0.185578 S(Stacy likes candy) # 0.117325 S(John hates candy) # 0.065538 S(Stacy hates candy) # 0.033176 The-doperation 5 hastheeffectofcombiningduplicatepaths: tiburon -d 1 -k 6 candy.wrtg This is Tiburon, version 1.0 Warning: returning fewer trees than requested S(John likes candy) # 0.598384 S(Stacy likes candy) # 0.302902 S(John hates candy) # 0.065538 S(Stacy hates candy) # 0.033176 0 0 5 whichisinvokedwithatimeoutflag,sincedeterminizationcanpotentiallybeanexponentialalgorithm, evenonwrtgswithfinitelanguage 174 As a side note, in order to obtain the correct result, Tiburon may have to produce a wrtgthatisnotprobabilistic! Thatisinfactwhathappenshere: tiburon -d 1 candy.wrtg This is Tiburon, version 1.0 q5 q5 -> S(q2 q1 q4) # 0.506464 q5 -> S(q2 q3 q4) # 0.447814 q3 -> hates # 0.220435 q1 -> likes # 1.779565 q4 -> candy # 1.000000 q2 -> Stacy # 0.336078 q2 -> John # 0.663922 Inrealsystemsusinglargegrammarstorepresentcomplextreelanguages,memory and cpu time are very real issues. Even as computers increase in power, the added complexity of tree automata forces practitioners to combat computationally intensive processes. One way of avoiding long running times is to prune weighted automata beforeoperatingonthem. Onetechniqueforpruningfinite-state(string)automataisto usetheforward-backwardalgorithmtocalculatethehighest-scoringpatheacharcinthe automatonisinvolvedin,andthenprunethearcsthatareonlyinrelativelylow-scoring paths[122]. Weapplythistechniquefortreeautomatabyusinganadaptation[54]oftheinside- outside algorithm [85]. The -p option with argument x removes productions from a 175 tree grammar that are involved in paths x times or more worse than the best path. The -c option provides an overview of a grammar, and we can use this to demonstrate the effects of pruning. The file c1s4.determ.rtg (not shown here) represents a language of possible translations of a particular Chinese sentence. We inspect the grammar as follows: ./tiburon -m tropical -c c1s4.determ.rtg Check info: 113 states 168 rules 28 unique terminal symbols 2340 derivations Notethatthe-m tropicalflagisusedbecausethisgrammarisweightedinthetropical semiring. Weprunethegrammarandtheninspectitasfollows: java -jar tiburon.jar -m tropical -p 8 -c c1s4.determ.rtg Check info: 111 states 158 rules 28 unique terminal symbols 780 derivations Since we are in the tropical semiring, this command means “Prune all productions that areinvolvedinderivationsscoringworsethanthebestderivationplus8”. Thisroughly corresponds to derivations with probability 2980 times worse than the best derivation. 176 Note that the pruned grammar has fewer than half the derivations of the unpruned grammar. A quick check of the top derivations after the pruning (using -k) shows that the pruned and unpruned grammars do not differ in their sorted derivation lists until the455th-highestderivation. Tiburon contains an implementation of EM training as described in [56] that is ap- plicablefortrainingwrtgsandwcfgs. InFigure6.1faportionoftrain.rtgwasshown, buttheentirewrtgisquitelarge: tiburon -c weighted_trainable.rtg This is Tiburon, version 1.0 File is large (>10,000 rules) so time to read in will be reported below Read 10000 rules: 884 ms Done reading large file RTG info for input rtg weighted_trainable.rtg: 671 states 12136 rules 45 unique terminal symbols infinite derivations Weinvoketrainingusingthe-tflagandthenumberofdesirediterations. Notethat the “ties” described above could be used here to ensure that two productions in a wrtg with the same tie id are treated as the same parameter, for purposes of counting, and thus their weights are always the same 6 . This particular case does not, however, have 6 In the M-step, the weight of a single parameter with multiple rules is the sum of each rule’s count divided by the sum of each rule’s normalization group count. To ensure a probabilistic grammar, this 177 ties. We provide a corpus of 100 trees, corp.100, a portion of which was shown above, andtrainfor10iterations: tiburon -t 10 corp.100 train.rtg > train.t10.rtg This is Tiburon, version 1.0 File is large (>10,000 rules) so time to read in will be reported below Read 10000 rules: 775 ms Done reading large file Cross entropy with normalized initial weights is 1.254; corpus prob is eˆ-3672.832 Cross entropy after 1 iterations is 0.979; corpus prob is eˆ-2868.953 Cross entropy after 2 iterations is 0.951; corpus prob is eˆ-2786.804 Cross entropy after 3 iterations is 0.939; corpus prob is eˆ-2750.741 Cross entropy after 4 iterations is 0.933; corpus prob is eˆ-2732.815 Cross entropy after 5 iterations is 0.929; corpus prob is eˆ-2723.522 Cross entropy after 6 iterations is 0.928; corpus prob is eˆ-2718.316 Cross entropy after 7 iterations is 0.927; corpus prob is eˆ-2715.071 Cross entropy after 8 iterations is 0.926; corpus prob is eˆ-2712.862 Cross entropy after 9 iterations is 0.925; corpus prob is eˆ-2711.239 Since the input rtg has weights attached, these are used as initial parameter values. Thisparticularwrtggeneratesanodeinatreebasedoncontextofeithertheparentnode weight is then “removed” from the available weight for the remaining members of each normalization group. 178 or the parent and grandparent nodes. Chain productions are used to choose between theamountofcontextdesired. Twosuchproductionsfromtrain.rtgare: qSBAR_NP+ -> qSBAR_NP # .2 qSBAR_NP+ -> qSBAR # .8 Soinitiallythereisbiastowardforgettinggrandparentinformationwhenthecontext is(SBAR,NP).Theresultoftrainingfromtrain.t10.rtgis: qSBAR_NP+ -> qSBAR_NP # 0.936235 qSBAR_NP+ -> qSBAR # 0.063765 ThedatacausesEMtoreversethisinitialbias. 6.4 Transducers In this section we describe file formats and operations on transducers, the wxtts that transformtreestotrees,andthewxtststhattransformtreestostrings. 6.4.1 Fileformats Just as for the grammar case, both kinds of transducer have a similar formal structure: (Q,Σ,Δ,R,n 0 ), thoughΔ is ranked for wxtts and a simple terminal alphabet for wxtsts. Consequently, the grammar files, wxtt and wxtst, both have a similar overall format, whichisremarkablysimilartothegrammarformat: <q0> <rle>+ 179 where <q0> is the start state and each <rle> is a member of R. Nonterminals can in generallooklikestates,withtheexceptionthat: shouldnotbeusedintheirnames. The formatofarule,<rle>is: <state>"."<lhs> "->" <rhs> ["#" <wgt>] ["@" <tie>] where <state> is a state, and <wgt> and <tie> are as for grammars. <lhs> has the followingformat: <vbl>":" | <sig-sym> | <sig-sym>"("<lhs>+")" where <sig-sym> is a member ofΣ and <vbl> is a variable. Variables have the same formasalphabetsymbols,nonterminals,andstates,thoughbyconventionwegenerally namethemlikex4. A<lhs>isinvalidifthesame<vbl>appearsmorethanonce. Ifthefileisawxtt,then<rhs>hasthefollowingformat: <state>"."<vbl> | <del-sym> | <del-sym>"("<rhs>+")" Ifthefileisawxtst,then<rhs>hasthefollowingformat: "*e*" | (<state>"."<vbl>|<del-sym>)+ A<rhs>isinvalidifitcontainsavariablenotpresentin<lhs>. Asforgrammars,the various alphabets are defined implicitly. Aside from the regular use of comments, type canbeenforcedasfollows: % TYPE XR % TYPE XRS 180 q q.A(Z(x0:) x1:) -> B(C(q.x0 r.x0) q.x1) # 0.3 q.E(x0:) -> F # 0.5 r.E -> G # 0.7 (a)wxtt.transisawxTtransducer. s s.B(x0: x1:) -> D(t.x1 s.x0) s.C(x0: x1:) -> H(s.x0 s.x1) # 0.6 s.C(x0: x1:) -> H(s.x1 s.x0) # 0.4 t.B(x0: x1:) -> D(s.x0 s.x1) s.F -> L t.F -> I s.G -> J # 0.7 s.G -> K # 0.3 (b)wlnt.transisawLNTtransducer. rJJ rJJ.JJ(x0: x1:) -> JJ(rJJ.x0 rTO.x1) # 0.250000 rJJ.JJ(x0: x1:) -> JJ(rTO.x1 rJJ.x0) # 0.750000 rJJ.JJ(x0:) -> JJ(t.x0) # 1.000000 t."abhorrent" -> "abhorrent" # 1.000000 rTO.TO(x0: x1:) -> TO(rPRP.x1 rTO.x0) # 0.333333 rTO.TO(x0: x1:) -> TO(rTO.x0 rPRP.x1) # 0.666667 rTO.TO(x0:) -> TO(t.x0) # 1.000000 rPRP.PRP(x0:) -> PRP(t.x0) # 1.000000 t."them" -> "them" # 1.000000 t."to" -> "to" # 1.000000 (c)Rotationtransducerfragment. iJJ iJJ.JJ(x0: x1:) -> JJ(iJJ.x0 iJJ.x1) # 0.928571 @ 108 iJJ.JJ(x0: x1:) -> JJ(iJJ.x0 iJJ.x1 INS) # 0.071429 @ 107 iJJ.JJ(x0:) -> JJ(t.x0) # 0.928571 @ 108 iJJ.TO(x0: x1:) -> TO(iTO.x0 iTO.x1) # 1.000000 @ 159 iTO.TO(x0:) -> TO(t.x0) # 0.800000 @ 165 iTO.TO(x0:) -> TO(s.x0 INS) # 0.111842 @ 164 iTO.TO(x0:) -> TO(INS s.x0) # 0.088158 @ 163 iTO.PRP(x0:) -> PRP(t.x0) # 1.000000 @ 150 t."abhorrent" -> "abhorrent" # 1.000000 t."to" -> "to" # 1.000000 t."them" -> "them" # 1.000000 s."to" -> nn-to # 1.000000 (d)Insertiontransducerfragment. Figure6.2: ExamplewxttandwxtstfilesusedtodemonstrateTiburon’scapabilities. 181 Wenowpresentseveralwxttandwxtstexamples. As with grammars, Tiburon tries to automatically detect the type of file input, but thefollowinglines,includedasthefirstlineofthetransducerfile,removeambiguity: % TYPE XR % TYPE XRS Theformatforbatchfilesfortrainingtransducersis (<count> <in-item> <out-item>)+ | (<in-item> <out-item>)+ Theformatfor<in-item>is: <sig-sym> | <sig-sym>"("<in-item>+")" A batch file for training wxtts is a file of tree-tree pairs. The format for <out-item> inthiscaseis: <del-sym> | <del-sym>"("<out-item>+")" Abatchfilefortrainingwxtstsisafileoftree-stringpairs. Theformatfor <out-item> inthiscaseis: <del-sym>+ 182 HereisasamplecorpusofEnglishtree-Japanesestringpairs: 6.4.2 Commandsusingtransducerfiles As with grammars, you can simply provide a transducer file to Tiburon as input and it willreturnitscontentstoyou. tiburon comp2.rln This is Tiburon, version 1.0 s s.B(x0: x1:) -> D(t.x1 s.x0) # 1.000000 s.C(x0: x1:) -> H(s.x0 s.x1) # 0.600000 s.C(x0: x1:) -> H(s.x1 s.x0) # 0.400000 s.F -> L # 0.800000 s.G -> J # 0.700000 s.G -> K # 0.300000 t.B(x0: x1:) -> D(s.x0 s.x1) # 1.000000 t.F -> I # 0.200000 The-cflagworksfortransducers,too. tiburon -c comp2.rln This is Tiburon, version 1.0 Transducer info for input tree transducer comp2.rln: 183 2 states 8 rules Analogous to the intersection of wrtgs, providing multiple transducers to Tiburon causesittocomposethem. However,unlikethewrtgcase,therearestrictconstraintson theclassesoftransducerthatcanbecomposed. tiburon wxtt.trans wlnt.trans This is Tiburon, version 1.0 q0 q0.E(x0:) -> L # 0.500000 q0.A(Z(x0:) x1:) -> D(q13.x1 H(q0.x0 q14.x0)) # 0.180000 q0.A(Z(x0:) x1:) -> D(q13.x1 H(q14.x0 q0.x0)) # 0.120000 q13.E(x0:) -> I # 0.500000 q13.A(Z(x0:) x1:) -> D(H(q0.x0 q14.x0) q0.x1) # 0.180000 q13.A(Z(x0:) x1:) -> D(H(q14.x0 q0.x0) q0.x1) # 0.120000 q14.E -> J # 0.490000 q14.E -> K # 0.210000 Providingawrtgandatransducer(orsequenceoftransducers)asargumentscauses Tiburon to do application 7 . As before, subsequent operations (such as -k) are invoked on the resulting application wrtg or wcfg. In the following example we pass in a tree directly from the command line; the - represents where standard input is placed in the argumentsequence. 7 Thecurrentlyreleasedversiononlyprovidesbucketbrigadeapplication 184 echo ’JJ(JJ("abhorrent") TO(TO("to") PRP("them")))’ | ./tiburon -k 5 - rot ins This is Tiburon, version 1.0 JJ(TO(TO("to") PRP("them")) JJ("abhorrent")) # 0.344898 JJ(TO(PRP("them") TO("to")) JJ("abhorrent")) # 0.172449 JJ(JJ("abhorrent") TO(TO("to") PRP("them"))) # 0.114966 JJ(JJ("abhorrent") TO(PRP("them") TO("to"))) # 0.057483 JJ(TO(TO(nn-to INS) PRP("them")) JJ("abhorrent")) # 0.048218 Asmentionedbefore,wecanconvertgrammarstotransducers. tiburon -x XRS even.rtg | tee even.xrs This is Tiburon, version 1.0 qe qe.A(x0: x1:) -> qe.x0 qo.x1 # 0.100000 qe.A(x0: x1:) -> qo.x0 qe.x1 # 0.800000 qe.B(x0:) -> qo.x0 # 0.100000 qo.A(x0: x1:) -> qo.x0 qo.x1 # 0.600000 qo.A(x0: x1:) -> qe.x0 qe.x1 # 0.200000 qo.B(x0:) -> qe.x0 # 0.100000 qo.C -> C # 0.100000 Thisisagoodwaytobuildaparser! Wecannowpassastringintotherightofthis transducerandformawrtgthatrecognizesall(infinitelymany)parsesofthestringwith anevennumberoftotalnodes! 185 echo "C C C" | ./tiburon -k 5 a - This is Tiburon, version 1.0 A(C A(C B(C))) # 0.000064 A(A(C C) B(C)) # 0.000048 A(C B(A(C C))) # 0.000048 B(A(A(C C) C)) # 0.000036 B(A(C A(C C))) # 0.000036 Finally,wecanalsotraintransducers. ThistransducerisanEnglish-to-JapanesexNTs and thus we also need to set the character class with -e euc-jp. Here is a part of the transducertobetrained: Traininglookslikethis: 186 tiburon -e euc-jp -t 5 corpus transducer > a This is Tiburon, version 1.0 Cross entropy with normalized initial weights is 2.196; corpus prob is eˆ-474.308 Cross entropy after 1 iterations is 1.849; corpus prob is eˆ-399.584 Cross entropy after 2 iterations is 1.712; corpus prob is eˆ-369.819 Cross entropy after 3 iterations is 1.544; corpus prob is eˆ-333.421 Cross entropy after 4 iterations is 1.416; corpus prob is eˆ-305.830 6.5 Performancecomparison Tiburon is primarily intended for manipulation, combination, and inference of tree au- tomata and transducers, but as these formalisms generalize string transducers and au- tomata (see Figure 2.19), we can use it as a wfst toolkit, too. We may thus compare Tiburon’sperformancewithotherwfsttoolkits. Inthissectionweconductperformance andscalabilitytasksonwfstsandwfsasonthreewidelyavailablewfsttoolkits: Carmel version 6.2 (released May 4, 2010) [53], OpenFst version 1.1 (released June 17, 2009) [4], and FSM version 4.0 (released 2003) [102]. We repeat the same tests on wtt and wrtg equivalents in Tiburon version 1.0 (released concurrently with this thesis in August, 2010). The wfst toolkits were written in C or C++ and have been extensively used and testedoveranumberofyears. Tiburon,bycontrast,waswritteninJavaandhasreceived less testing and development time. The following sections demonstrate that Tiburon is 187 t (q (r A B .2)) (s (t *e* *e* .4)) (a)fstrulesandfinalstatefor Carmel. q r A B 1.609 s t *e* *e* 0.916 t (b)fstrulesandfinalstatefor FSMandOpenFst. q.A(x0:) -> B(r.x0) # .2 s.x0: -> t.x0 # .4 t.TIBEND -> TIBEND # 1 (c) Tree transducer rules and simula- tionoffinalstateforTiburon. A B (d) String representation in Carmel(-iswitchconvertsto identityfst). 0 1 A A 0 1 2 B B 0 2 (e) String representation as identityfstinFSMandOpen- Fst. A(B(TIBEND)) (f) String representation as monadic treeinTiburon. Figure6.3: ComparisonofrepresentationsinCarmel,FSM/OpenFst,andTiburon. generallyslowerandlessscalablethanitscompetitorsontasksdesignedforwfsttoolk- its. We ran these on a custom-built machine with four AMD Opteron 850 processors and32gmemoryrunningFedoraCore9. Timeswerecalculatedbysummingthe“user” and“sys”informationreportedbythetimecommandandaveragingoverthreerepeated runs. AlltimingresultsareshowninTable6.2. ForCarmelandTiburonwesimplyreport theaveragedtimetoperformthedescribedoperation. FSMandOpenFstoperatebyfirst converting text representations of transducers into a machine-readable binary format, then performing necessary operations, then finally converting back to text format. We thusseparateoutconversionfromtransducermanipulationoperationsforthesetoolkits inTable6.2. 6.5.1 Transliterationcascades We tested the toolkits’ performance on transliterating Japanese katakana of English namesthroughacascadeoftransducers,similartothatdescribedbyKnightandGraehl 188 O D S R 1 generatesEnglishwords 1 50,001 2 pronouncesEnglishsounds 150,553 302,349 3 convertsEnglishsoundstoJapanesesounds 99 283 4 slightlydispreferscertainJapanesephonemes 5 94 5 combinesdoubledJapanesevowels 6 53 6 convertsJapanesesoundstokatakana 46 294 Table 6.1: Generative order, description, and statistics for the cascade of English-to- katakanatransducersusedinperformancetestsinSection6.5. [75]. 8 We built equivalent versions of these transducers as well as a representation of katakana string glosses in formats suitable for the four toolkits. Figure 6.3 shows the differencebetweenthevariousformats. NotethatforTiburon,amonadictreereplacesthe katakana string. Also note that we represent weights in FSM and OpenFst in negative log space. We used these transducers to conduct performance experiments in simple reading,inferencethroughacascade,andk-bestpathsearchofanautomaton. 6.5.1.1 Readingatransducer Themostbasictaskatoolkitcandoisreadinafilerepresentingatransducerorautoma- ton. We compared the systems’ ability to read in and generate basic information about thelargepronunciationtransducerlistedasline2inTable6.1. OpenFstisabouttwiceas slow as FSM and Carmel at reading in a transducer, while Tiburon is about two orders ofmagnitudeworsethanFSMandCarmel. 8 We do not include a transducer modeling misspellings caused by OCR, as Knight and Graehl [75] do, andweincludetwoadditionaltransducersfortechnicalreasons—lines4and5inTable6.1. 189 6.5.1.2 Inferencethroughacascade As described by Knight and Graehl [75], we can use the cascade of transducers that producekatakanafromEnglishwordstodecodekatakanabypassingacandidatestring backwardthroughthecascade,therebyobtainingarepresentationofallpossibleinputsas a(stringortree)automaton. Wecomparedthesystems’abilitytoperformthisoperation with the katakana gloss a n ji ra ho re su te ru na i to. Carmel is by far the fastest at this task—OpenFst is about 4.5 times slower, FSM is ten times slower than OpenFst, and TiburonisanothersixtimesslowerthanFSM. 6.5.1.3 K-best To complete the inference task it is important to obtain a k-best list from the result automatonproducedinSection6.5.1.2. Carmelcanproducesuchalist,asdoesTiburon (though in this case it is a list of monadic trees). FSM and OpenFst do not directly producesuchlistsbutratherproducewfsasoridentitywfststhatcontainonlythek-best paths;apost-processisneededtoagglomeratelabelsandweights. Figure6.4showsthe 2-bestlistsortheirequivalentrepresentationsproducedbyeachofthetoolkits. We compared the systems’ abilities to generate a 20, 20,000, and 2,000,000-best list from the result automata they previously produced. For FSM and OpenFst we only considered the generation of the representative automata. FSM and Carmel took about thesameamountoftimeforthe20-bestlist;OpenFstwasanorderofmagnitudeslower, and Tiburon, which as previously noted is very inefficient at reading in structures, was considerably worse. Tiburon was only twice as slow at obtaining the 20,000-best list as it was the 20-best list; this is because the overhead incurred for reading in the large 190 ANGELA FORRESTAL KNIGHT 2.60279825597665e-20 ANGELA FORRESTER KNIGHT 6.00711401244296e-21 (a)Carmeloutput. ANGELA(FORRESTAL(KNIGHT(END))) # 2.602803E-20 ANGELA(FORRESTER(KNIGHT(END))) # 6.007097E-21 (b)Tiburonoutput. 0 1 *e* 0 28 *e* 1 2 *e* 2 3 *e* 0.0794004127 ... 7 8 *e* 0.637409806 8 9 "ANGELA" 10.6754665 9 10 *e* ... 25 26 "KNIGHT" 8.59761715 26 27 *e* 27 (c) Partial FSM (OpenFst) output. The complete output has 53 arcs and two finalstates. Figure 6.4: K-best output for various toolkits on the transliteration task described in Section 6.5.1.3. Carmel produces strings, and Tiburon produces monadic trees with a special leaf terminal. FSM and OpenFst produce wfsts or wfsas representing the k-best listandapost-processmustagglomeratesymbolsandweights. automatondominatesthe20-bestoperation. OpenFst,whichalsoiscomparativelyslow atreading,wasoneorderofmagnitudeslowerthanits20-bestoperationatgeneratingthe 20,000-bestlist. Unlikeinthe20-bestcase,FSMwasmuchslowerthanCarmelat20,000- best. ThismaybeduetoFSM’s32-bitarchitecture—theotherthreesystemsareall64-bit. The32-bitlimitationwascertainlythereasonforFSM’sinabilitytogeneratea2,000,000- bestlist. Itisunabletoaccessmorethan4gmemory, whiletheothersystemswereable to take advantage of higher memory limits. Tiburon required explicit expansion of the Javaheapinordertocompleteitstask. Itshouldbenoted,though,thatCarmelwasable toobtaina2,000,000-bestlistevenwhenits32-bitvariantwasrun. Thismaybeduetoa choice of algorithm—Carmel uses the k-best algorithm of Eppstein [40], while OpenFst andFSMusethealgorithmofMohriandRiley[104]. 191 6.5.2 Unsupervisedpart-of-speechtagging As described by Merialdo [98], the EM algorithm can be used to train a part-of-speech tagging system given only a corpus of word sequences and a dictionary of possible tags for each word in a given vocabulary. The task is unsupervised, as no explicit tag sequences or observed tag-word pairs are given. An HMM is formed, where the states representann-gramtaglanguagemodel,andanemissionforeachwordfromeachstate representsaparameterinaword-given-tagchannelmodel. WefollowRaviandKnight [114]anduseabigrammodelratherthanMerialdo’strigrammodel,astheformergives betterperformanceandiseasiertoconstruct. Asisstandardpractice,thelanguagemodel isinitiallyfullyconnected(alln-gramtransitionsarepossible)andthechannelmodelis bootstrappedbythetagdictionary. WeinstantiatetheHMMasasingleunweightedwfst anditstreetransducercounterpart. AschematicofasampleHMMwfstforalanguage oftwotagsandtwowordsisshowninFigure6.5. 6.5.2.1 Training We compared Tiburon with Carmel on EM training of the HMM described in Section 6.5.2. 9 The instantiated wfst/wtt has 92 states and 59,503 arcs. We trained both systems for10iterationsofEMusingacorpusof300sentencesof20wordsorfewerastraining. This is a reduction from the full Merialdo corpus of 1005 sentences of various lengths rangingfrom1to108words,whichoverwhelmsTiburon’smemory,evenwhenallowed the full 32g available on the test machine. Tiburon was slightly more than two orders of magnitude slower than Carmel. To demonstrate Carmel’s ability to scale, we ran it 9 FSMandOpenFstdonotprovidetrainingfunctionality. 192 A A' B B' S E ε:a / p(a|A) ε:b / p(b|A) ε:a / p(a|B) ε:b / p(b|B) ε:ε / p(A|A) ε:ε / p(A|S) ε:ε / p(B|S) ε:ε / p(B|B) ε:ε / p(A|B) ε:ε / p(B|A) ε:ε / p(E|A) ε:ε / p(E|B) Figure6.5: AnexampleoftheHMMtrainedintheMerialdo[98]unsupervisedpart-of- speechtaggingtaskdescribedinSection6.5.2,instantiatedasawfst. Arcseitherrepresent parametersfromthebigramtaglanguagemodel(e.g.,thearcfromA’toB,representing theprobabilityofgeneratingtagBaftertagA)orfromthetag-wordchannelmodel(e.g., the topmost arc from A to A’, representing the probability of generating word a given tagA).Thelabelsonthearcsmakethiswfstsuitablefortrainingonacorpusof(ε,word sequence) pairs to set language model and channel model probabilities such that the probabilityofthetrainingcorpusismaximized. on the whole training data using the aforementioned model, and also created a more complicated model by splitting the tag space, in the spirit of Petrov and Klein [111], effectively doubling the state space and adding many more hidden variables for EM to model. This model as instantiated has 182 states and 123,057 arcs. Tiburon was unable to train this model, even on the 300-sentence reduced corpus, but Carmel was able to trainthismodelontheentireMerialdocorpuseasily. 6.5.2.2 Determinization In order to use the trained transducers for tagging, the arcs representing the language modelaremodifiedsuchthattheirinputsymbolischangedfromεtothetagcorrespond- ing with the destination state. In the example of Figure 6.5, for instance, the arc label from A’ to B would be changed fromε:ε to b:ε. Backward application of the candidate 193 string can then be performed to form a result graph, and the most likely tag sequence calculated. Intheoriginalbigramformulationtheresultgraphisdeterministicandthere is exactly one path for every distinct tag sequence. However, in the state split model discussedabove,thisisnotthecase,anditisimportanttodeterminizetheresultgraph to ensure the most likely path and most likely sequence coincide. We used Carmel to build the state-split model above and to obtain a nondeterministic result graph for a 303-wordsequence. Thegraphishighlynondeterministic—ithas2,624statesand6,921 arcs and represents 1.8× 10 81 distinct taggings in 4.1× 10 157 paths. Carmel does not have a determinization function, but FSM, OpenFst, and Tiburon do, so we compared theabilityofthethreesystemstodeterminizethisresultgraph. FSMandOpenFstwere quite fast at determinizing, even though the operation is potentially exponential in the inputsize,whileTiburonwasagaintwoordersofmagnitudeslower. 6.5.3 Discussion We have seen that Tiburon runs consistently around two orders of magnitude slower thancompetingwfsttoolkitsoncommontasks. Additionally,Tiburondoesnotscalewith transducer and data size as well as the other systems do. Partial reasons for this may be unoptimized implementations, bugs, and the inherent advantage of compiled code. However,Tiburonhasakeyliabilityinitsmoregeneralnature. Asanexample,consider therepresentationofawfstarcinanyofthewfsttoolkitswithawxttruleinTiburon. A wfstarcmayberepresentedbyfourintegersandafloat,denotingtheinputandoutput states and symbols and the arc weight, respectively. More complicated powers such as parameter tying and locking that are available in Carmel require an additional integer 194 E FSM OF C T . . . . reading 1.2s .2s 2.6s .4s 1.2s 87s inference 2.0s 51.8s 5.4s .2s 1.2s 326s 20-best .09s .01s .4s .01s .02s 5.5s 20,000-best 1.6s 1.7s 1s 1.5s .2s 10s 2,000,000-best OOM 49s 112s 22s 1369s train(reduced) N/A N/A 1.2s 159s train(full) N/A N/A 2.9s OOM train(split) N/A N/A 9.0s OOM determinize .05 .03 .08s .06 N/A 232s Table 6.2: Timing results for experiments using various operations across several trans- ducer toolkits, demonstrating the relatively poor performance of Tiburon as compared with extant string transducer toolkits. The reading experiment is discussed in Section 6.5.1.1, inference in Section 6.5.1.2, the three k-best experiments in Section 6.5.1.3, the threetrainingexperimentsinSection6.5.2.1,anddeterminizationinSection6.5.2.2. For FSM and OpenFst, timing statistics are broken into time to convert between binary and text formats (“conv.”) and time to perform the specified operation (“op.”). N/A= this toolkit does not support this operation. OOM= the test computer ran out of memory beforecompletingthisexperiment. andboolean,butthememoryprofileofanarcisquiteslim. Awxttrule,ontheotherhand, is represented by an integer for the input state, a float for the weight, two trees for the inputandoutputpatterns,andamaplinkingthevariablesofthetwosidestogether. The treesandthemapareinstantiatedasobjects, andnotfixed-widthfields, givingasingle transducerruleaconsiderableminimummemoryfootprint. Additionally,overheadfor reading, storing, and manipulating these more general structures is increased due to their variable size. A useful improvement to Tiburon would be a reimplementation of the fundamental data structures such that their size is fixed. This would allow more assumptions to be made about the objects and consequently more efficient processes, particularlyintime-consumingI/O. 195 6.6 Externallibraries We are grateful to the following external sources for noncommercial use of their Java libraries in Tiburon: Martian Software, for the JSAP command line parsing libraries, StanfordUniversity’sNLPgroup,foritsimplementationofheaps,andtothemakersof the Gnu Trove, which provided several object container classes used in earlier versions ofthesoftware. 6.7 Conclusion We have described Tiburon, a general weighted tree automata toolkit, and described someofitsfunctionsandtheiruseinconstructingnaturallanguageapplications. Tiburon canbedownloadedathttp://www.isi.edu/licensed-sw/tiburon/. 196 Chapter7 CTFW 7.1 Concludingthoughts Torecap,thisthesisprovidedthefollowingcontributions: • Algorithmsforkeyoperationsofaweightedtreetransducertoolkit,someofwhich previouslyonlyexistedasproofsofconcept,andsomeofwhichwerenovel. • Empirical experiments, putting weighted tree transducers and automata to work toobtainmachinetranslationandparsingimprovements. • Tiburon, a tree transducer toolkit that allows others to use these key operations andformalismsintheirownsystems. Theoverallpurposeofthisthesis,somemightsayits“thesis”,hasbeenthatweighted tree transducers and automata are useful formalisms for NLP models and that the de- velopmentofpracticalalgorithmsandtangiblesoftwarewithimplementationsofthose algorithms make these formal structures useful tools for actual NLP systems. As is usual,though,itisthejourneytothisthesis’endthatisperhapsmoreimportantthanthe 197 destination. While I have provided a toolkit to enable development of syntax-infused systemsascascadesofformalmodels,andusedittobuildsomesystems,thealgorithm development component of this work seems to me to be the most interesting contribu- tion. Heretofore, tree transducers generally existed as abstract entities. Many papers reasoned about them but little actual work was done using them, and this is chiefly because there was no real concrete way to actually use them. With Tiburon, we really can use these structures, and it is empirical work down the road that will actually test their long-term utility. But this is not a software engineering thesis, and Tiburon is not someexemplarofsoftwaredesign. No,themaincontributionofthisworkisthatbefore thisthesisbeganTiburoncouldnotbeconstructed,astherewerenoclear-cutalgorithms for nearly any desired operation. But now we not only have Tiburon, but the keys to extendit,refineit,evenrebuilditifneeded. 7.2 Futurework Thereareseveraldifferentdirectionsofusefulfutureworkthatwouldenhancetheresults ofthisthesis. Theybroadlyfallintothecategoriesofalgorithms,models,andengineering. 7.2.1 Algorithms For most real-world systems, exact inference is an acceptable trade-o ff for speed. The key operations of Tiburon could be replaced with principled approximate versions that guaranteeperformanceratesforaknownamountoferror. Approximatealgorithmsthat admitawiderclassofautomatonortransducerattheexpenseof“limitedincorrectness” 198 wouldalsobeuseful. Thefollowingcomprisesawishlistofapproximatevariantsofthe algorithmspresentedinthisthesis: • A k-best algorithm for wrtgs and wcfgs that runs linear in k but may skip some paths. • A polynomial-time determinization algorithm for wrtgs and wcfgs that returns an output automaton that does not recognize some low-weighted trees that were recognizedbytheinput. • A domain projection algorithm for wxNT that produces a wrtg recognizing a tree series with equivalent support to the true domain projection but only enforces that the weights of the trees be in the same relative order as they are in the true projection. • AcompositionalgorithmforwxLNTthatproducesawxLNTthatover-orunder- generatestheresultingweightedtreetransformation. • An on-the-fly application algorithm for a cascade, ending in a wxLNTs and a string, that has memory and speed guarantees at the expense of a known loss of theapplicationwrtg’streeseries. 7.2.2 Models Animportantnextstepinthislineofworkisathoroughexaminationofthelimitationsof thetreetransducerformalismatrepresentingsyntaxmodelswecareabout. Forexample, Collins’ parsing model [25] has a complicated back-off weighting scheme that does not 199 seemamenabletorepresentationinthetreetransducerdomain. Additionally,asshown byDeNeefeandKnight[29],real-worldsystemsbuiltwithtreetransducermodelshave to use very large rule sets to model slight variations in the trees that can be produced (suchassentenceswithorwithoutindependentclauses). Moreflexibleformalisms,such assynchronoustreeadjoininggrammars[120]mayendupbeingkey,andthusrelevant extensionsofthealgorithmspresentedinthisthesismayhavetobedesigned. 7.2.3 Engineering Tiburon has also not been extensively battle tested in ways that really bring it up to the level of a production-level toolkit. This would indeed require a software engineering thesis. Such an approach could also focus on improving the runtime of the algorithms presented here—many are quite impractical for large-scale efforts. Additionally, the benefits and limitations of general models such as tree transducers are exposed by applicationtoawidedomain. Ihavelimiteddiscussioninthisthesistonaturallanguage experimentsandscenarios,buttree-structureddataexistsinbiologicaldomainsandmay bequiteusefulinthestudyoffinancialsystems. Treatmentsofwidergenresofdataare sure to provide insight into what challenges toward constructing a widely-used toolkit remain,inboththeformalandengineeringdomains. 200 7.3 Finalwords IfnothingelseitismyhopethatthisthesishashelpedtoreuniteNLPpractitionersand formallanguagetheorists,suchthatthetwofieldscanattempttotalktoeachotherina commonlanguageandrecognizehowtheymaybeofmutualbenefit. 201 References [1] Athanasios Alexandrakis and Symeon Bozapalidis. Weighted grammars and Kleene’stheorem. InformationProcessingLetters,24(1):1–4,January1987. [2] Cyril Allauzen and Mehryar Mohri. Efficient algorithms for testing the twins property. JournalofAutomata,LanguagesandCombinatorics,8(2):117–144,2003. [3] Cyril Allauzen, Mehryar Mohri, and Brian Roark. A general weighted grammar library. In Implementation and Application of Automata: Ninth International Confer- ence (CIAA 2004), volume 3317 of Lecture Notes in Computer Science, pages 23–34, Kingston,Ontario,Canada,July2004.Springer-Verlag. [4] Cyril Allauzen, Michael Riley, Johan Schalkwyk, Wojciech Skui, and Mehryar Mohri. OpenFst: A general and efficient weighted finite-state transducer library. InProceedingsoftheNinthInternationalConferenceonImplementationandApplication of Automata, (CIAA 2007), volume 4783 of Lecture Notes in Computer Science, pages 11–23,Prague,2007. [5] Andr´ eArnoldandMaxDauchet. Morhpismesetbimorphismesd’arbres. Theoret- icalComputerScience,20:33–93,1982. [6] Brenda S. Baker. Composition of top-down and bottom-up tree transductions. InformationandControl,41(2):186–213,May1979. [7] AdamBerger,PeterBrown,StephenDellaPietra,VincentDellaPietra,JohnGillett, John Lafferty, Robert Mercer, Harry Printz, and Luboˇ s Ureˇ s. The Candide system formachinetranslation. InHumanLanguageTechnology,pages157–162,Plainsboro, NewJersey,March1994. [8] Jean Berstel and Christophe Reutenauer. Recognizable formal power series on trees. TheoreticalComputerScience,18(2):115–148,1982. [9] Daniel Bikel. Intricacies of Collins’ parsing model. Computational Linguistics, 30(4):479–511,2004. [10] Rens Bod. A computational model of language performance: data oriented pars- ing. InProceedingsofthefifteenthInternationalConferenceonComputationalLinguistics (COLING-92) ,volume3,pages855–859,Nantes,France,1992. [11] RensBod. AnefficientimplementationofanewDOPmodel. InProceedingsofthe 10thConferenceoftheEuropeanChapteroftheAssociationforComputationalLinguistics, pages19–26,Budapest,2003. 202 [12] Tugba Bodrumlu, Kevin Knight, and Sujith Ravi. A new objective function for word alignment. In Proceedings of the NAACL HLT Workshop on Integer Linear ProgrammingforNaturalLanguageProcessing,pages28–35,Boulder,Colorado,June 2009. [13] Bj¨ orn Borchardt. The Theory of Recognizable Tree Series. PhD thesis, Dresden Uni- versityofTechnology,2005. [14] Bj¨ orn Borchardt and Heiko Vogler. Determinization of finite state weighted tree automata. JournalofAutomata,LanguagesandCombinatorics,8(3):417–463,2003. [15] PeterBorovansky, ClaudeKirchner, H´ el` eneKirchner, andPierre-EtienneMoreau. ELANfromarewritinglogicpointofview. TheoreticalComputerScience,285(2):155– 185,August2002. [16] Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, and Robert L. Mercer. Themathematicsofstatisticalmachinetranslation: parameterestimation. ComputationalLinguistics,19(2):263–311,1993. [17] Matthias B¨ uchse, Jonathan May, and Heiko Vogler. Determinization of weighted treeautomatausingfactorizations. InPre-proceedingsoftheEightInternationalWork- shoponFinite-StateMethodsandNaturalLanguageProcessing ,July2009. [18] Matthias B¨ uchse, Jonathan May, and Heiko Vogler. Determinization of weighted tree automata using factorizations. Journal of Automata, Languages and Combina- torics,2010. submitted. [19] Francisco Casacuberta and Colin de la Higuera. Computational complexity of problemsonprobabilisticgrammarsandtransducers. InArlindoL.Oliveira,edi- tor,GrammaticalInference: AlgorithmsandApplications,5thInternationalColloquium, ICGI2000,pages15–24,Lisbon,Portugal,September2000. [20] Eugene Charniak. Immediate-head parsing for language models. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, pages 116–123,Toulouse,France,2001. [21] David Chiang. A hierarchical phrase-based model for statistical machine trans- lation. In Proceedings of the 43rd Annual Meeting of the ACL, pages 263–270, Ann Arbor,Michigan,June2005. [22] Noam Chomsky. Three models for the description of language. IRE Transactions onInformationTheory,2(3):113–124,1956. [23] NoamChomsky. SyntacticStructures. Mouton,1957. [24] Alexander Clark. Memory-based learning of morphology with stochastic trans- ducers. InProceedingsofthe40thAnnualMeetingoftheAssociationforComputational Linguistics,pages513–520,Philadelphia,PA,July2002. 203 [25] Michael Collins. Three generative, lexicalised models for statistical parsing. In Proceedingsofthe35thAnnualMeetingoftheAssociationforComputationalLinguistics, pages16–23,Madrid,Spain,July1997. [26] Michael Collins and Brian Roark. Incremental parsing with the perceptron al- gorithm. In Proceedings of the 42nd Meeting of the Association for Computational Linguistics(ACL’04),MainVolume,pages111–118,Barcelona,Spain,July2004. [27] Hubert Comon, Max Dauchet, Remi Gilleron, Christof L¨ oding, Florent Jacque- mard,DenisLugiez,SophieTison,andMarcTommasi. Treeautomatatechniques andapplications,2007. releasedOctober12,2007. [28] Arthur Dempster, Nan Laird, and Donald Rubin. Maximum likelihood from incompletedataviatheEMalgorithm. JournaloftheRoyalStatisticalSociety, Series B,39(1):1–38,1977. [29] Steve DeNeefe and Kevin Knight. Synchronous tree adjoining machine transla- tion. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing,pages727–736,Singapore,August2009. [30] Steve DeNeefe, Kevin Knight, and Hayward Chan. Interactively exploring a ma- chinetranslationmodel. InProceedingsoftheACLInteractivePosterandDemonstra- tionSessions,pages97–100,AnnArbor,Michigan,2005. [31] Steve DeNeefe, Kevin Knight, Wei Wang, and Daniel Marcu. What can syntax- basedMTlearnfromphrase-basedMT? In Proceedingsofthe2007JointConferenceon EmpiricalMethodsinNaturalLanguageProcessingandComputationalNaturalLanguage Learning(EMNLP-CoNLL) ,pages755–763,Prague,CzechRepublic,June2007. [32] EdsgerW.Dijkstra. Anoteontwoproblemsinconnexionwithgraphs. Numerische Mathematik,1:269–271,1959. [33] JohnDoner. Treeacceptorsandsomeoftheirapplications. JournalofComputerand SystemSciences,4(5):406–451,October1970. [34] Frank Drewes and Renate Klempien-Hinrichs. Treebag. In Sheng Yu and Andrei Paun,editors,Proc.5thIntl.ConferenceonImplementationandApplicationofAutomata (CIAA 2000), volume 2088 of Lecture Notes in Computer Science, pages 329–330, London,Ontario,Canada,2001. [35] ManfredDrosteandWernerKuich. Semiringsandformalpowerseries. InHand- bookofWeightedAutomata,chapter1,pages3–28.Springer-Verlag,2009. [36] JayEarley. Anefficientcontext-freeparsingalgorithm. CommunicationsoftheACM, 6(8):451–455,1970. [37] Abdessamad Echihabi and Daniel Marcu. A noisy-channel approach to question answering. In Proceedings ofthe 41st AnnualMeeting ofthe Association forComputa- tionalLinguistics,pages16–23,Sapporo,Japan,July2003. 204 [38] Jason Eisner. Parameter estimation for probabilistic finite-state transducers. In Proceedings of 40th Annual Meeting of the Association for Computational Linguistics, pages1–8,Philadelphia,Pennsylvania,USA,July2002. [39] Joost Engelfriet. Bottom-up and top-down tree transformations – a comparison. MathematicalSystemsTheory,9(2):198–231,1975. [40] David Eppstein. Finding the k shortest paths. SIAM Journal on Computing, 28(2):652–673,1998. [41] Zolt´ an ´ Esik and Werner Kuich. Formal tree series. Journal of Automata, Languages andCombinatorics,8(2):219–285,2003. [42] AlexanderFraserandDanielMarcu. Semi-supervisedtrainingforstatisticalword alignment. In Proceedings of the 21st International Conference on Computational Lin- guisticsand44thAnnualMeetingoftheAssociationforComputationalLinguistics,pages 769–776,Sydney,Australia,July2006. [43] Zolt´ an F¨ ul¨ op, Andreas Maletti, and Heiko Vogler. Backward and forward appli- cationofweightedextendedtreetransducers. Unpublishedmanuscript,2010. [44] Zolt´ an F¨ ul¨ op and H. Vogler. Weighted tree transducers. J. Autom. Lang. Comb., 9(1),2004. [45] Zolt´ anF¨ ul¨ opandHeiko Vogler. Weightedtree automataandtreetransducers. In HandbookofWeightedAutomata,chapter9,pages313–404.Springer-Verlag,2009. [46] Michel Galley, Jonathan Graehl, Kevin Knight, Daniel Marcu, Steve DeNeefe, Wei Wang, and Ignacio Thayer. Scalable inference and training of context-rich syntacticmodels. InProceedingsofthe21stInternationalConferenceonComputational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pages961–968,Sydney,Australia,July2006. [47] Michel Galley, Mark Hopkins, Kevin Knight, and Daniel Marcu. What’s in a translation rule? In Daniel Marcu Susan Dumais and Salim Roukos, editors, HLT-NAACL2004: MainProceedings ,pages273–280,Boston,May2004. [48] Ferenc G´ ecseg and Magnus Steinby. Tree Automata. Akad´ emiai Kiad´ o, Budapest, 1984. [49] Ferenc G´ ecseg and Magnus Steinby. Tree languages. In Grzegorz Rozenberg and Arto Salomaa, editors, Beyond Words, volume 3 of Handbook of Formal Languages, chapter1,pages1–68.Springer-Verlag,Berlin,1997. [50] ThomasGenet,Val´ erieViet,andTriemTong. Reachabilityanalysisoftermrewrit- ing systems with timbuk. In Robert Nieuwenhuis and Andrei Voronkov, editors, LPARProceedings,volume2250ofLectureNotesinComputerScience,pages695–706, Havana,Cuba,December2001.Springer-Verlag. 205 [51] Ulrich Germann, Michael Jahr, Kevin Knight, Daniel Marcu, and Kenji Yamada. Fastdecodingandoptimaldecodingformachinetranslation. InProceedingsofthe 39th Annual Meeting of the Association for Computational Linguistics, pages 228–235, Toulouse,France,July2001. [52] JonathanS.Golan. SemiringsandTheirApplications. KluwerAcademic,Dordrecht, TheNetherlands,1999. [53] Jonathan Graehl. Carmel finite-state toolkit. http://www.isi.edu/licensed- sw/carmel,1997. [54] JonathanGraehl. Context-freealgorithms. Unpublishedhandout,July2005. [55] Jonathan Graehl and Kevin Knight. Training tree transducers. In HLT-NAACL 2004: MainProceedings,pages105–112,Boston,Massachusetts,USA,May2004. [56] Jonathan Graehl, Kevin Knight, and Jonathan May. Training tree transducers. ComputationalLinguistics,34(3):391–427,September2008. [57] UdoHebischandHansJoachimWeinert. Semirings—AlgebraicTheoryandApplica- tionsinComputerScience. WorldScientific,Singapore,1998. [58] JesperG.Henriksen,OleJ.L.Jensen,MichaelE.Jørgensen,NilsKlarlund,Robert Paige, Theis Rauhe, and Anders B. Sandholm. MONA: Monadic second-order logic in practice. In Uffe H. Engberg, Kim G. Larsen, and Arne Skou, editors, ProceedingsoftheWorkshoponToolsandAlgorithmsforTheConstructionandAnalysis ofSystems,TACAS(Aarhus,Denmark,19–20May,1995),numberNS-95-2inNotes Series,pages58–73,DepartmentofComputerScience,UniversityofAarhus,May 1995.BRICS. [59] Liang Huang and David Chiang. Better k-best parsing. In Proceedings of the NinthInternationalWorkshoponParsingTechnology,pages53–64,Vancouver,British Columbia,Canada,October2005. [60] Aravind K. Joshi and Phil Hopely. A parser from antiquity. Natural Language Engineering,2(4):291–294,1996. [61] Ed Kaiser and Johan Schalkwyk. Building a robust, skipping parser within the AT&T FSM toolkit. Technical report, Center for Human Computer Communica- tion,OregonGraduateInstituteofScienceandTechnology,2001. [62] Stephan Kanthak and Hermaan Ney. FSA: An efficient and flexible C++ toolkit forfinitestateautomatausingon-demandcomputation. In Proceedingsofthe42nd MeetingoftheAssociationforComputationalLinguistics(ACL’04),MainVolume,pages 510–517,Barcelona,July2004. [63] Ronald Kaplan and Martin Kay. Regular models of phonological rule systems. ComputationalLinguistics,20(3):331–378,1994. 206 [64] RonaldM.KaplanandMartinKay. Phonologicalrulesandfinite-statetransducers. In Linguistic Society of America Meeting Handbook, Fifty-Sixth Annual Meeting , 1981. Abstract. [65] Lauri Karttunen. The replace operator. In Proceedings of the 33rd Annual Meet- ing of the Association for Computational Linguistics, pages 16–23, Cambridge, Mas- sachusetts,USA,June1995. [66] LauriKarttunen. Directedreplacement. InProceedingsofthe34thAnnualMeetingof theAssociationforComputationalLinguistics,pages108–115,SantaCruz,California, USA,June1996. [67] Lauri Karttunen and Kenneth R. Beesley. Two-level rule compiler. Technical ReportISTL-92-2,XeroxPaloAltoResearchCenter,PaloAlto,CA,1992. [68] LauriKarttunenandKennethR.Beesley. Ashorthistoryoftwo-levelmorphology. Presented at the ESSLLI-2001 Special Event titled ”Twenty Years of Finite-State Morphology”,August2001. Helsinki,Finland. [69] Lauri Karttunen, Jean-Pierre Chanod, Gregory Grefenstette, and Anne Schiller. Regular expressions for language engineering. Natural Language Engineering, 2(4):305–328,1997. [70] LauriKarttunen,Tam´ asGa´ al,andAndr´ eKempe. Xeroxfinite-statetool. Technical report,XeroxResearchCentreEurope,1997. [71] Lauri Karttunen, Ronald M. Kaplan, and Annie Zaenen. Two-level morphology with composition. In Proceedings of the fifteenth International Conference on Compu- tationalLinguistics(COLING-92) ,volume3,pages141–148,Nantes,France,1992. [72] Daniel Kirsten and Ina M¨ aurer. On the determinization of weighted automata. JournalofAutomata,LanguagesandCombinatorics,10(2/3):287–312,2005. [73] Kevin Knight. Capturing practical natural language transformations. Machine Translation,21(2):121–133,June2007. [74] Kevin Knight and Yaser Al-Onaizan. Translation with finite-state devices. In David Farwell, Laurie Gerber, and Eduard Hovy, editors, Machine Translation and theInformationSoup: ThirdConferenceoftheAssociationforMachineTranslationinthe Americas AMTA ’98 Langhorne, PA, USA, October 28-31, 1998 Proceedings , volume 1529 of Lecture Notes in Artificial Intelligence, pages 421–437, Langhorne, Pennsyl- vania,USA,October1998.Springer-Verlag. [75] Kevin Knight and Jonathan Graehl. Machine transliteration. Computational Lin- guistics,24(4):599–612,1998. [76] KevinKnightandJonathanGraehl. Anoverviewofprobabilistictreetransducers for natural language processing. In Computational Linguistics and Intelligent Text Processing: 6thInternationalConference,CICLing2005,volume3406ofLectureNotes inComputerScience,pages1–24,MexicoCity,2005.SpringerVerlag. 207 [77] Kevin Knight and Daniel Marcu. Summarization beyond sentence extraction: A probabilistic approach to sentence compression. Artificial Intelligence, 139(1):91– 107,2002. [78] Philipp Koehn and Kevin Knight. Feature-rich statistical translation of noun phrases. In Proceedings of the 41st Annual Meeting of the Association for Computa- tionalLinguistics,pages311–318,Sapporo,Japan,July2003. [79] Okan Kolak, William Byrne, and Philip Resnik. A generative probabilistic OCR modelforNLPapplications. InProceedingsofthe2003HumanLanguageTechnology ConferenceoftheNorthAmericanChapteroftheAssociationforComputationalLinguis- tics,pages55–62,Edmonton,Canada,May-June2003. [80] Kimmo Koskenniemi. Two-level morphology: A general computational model forword-formrecognitionandproduction. Publication11,UniversityofHelsinki, DepartmentofGeneralLinguistics,Helsinki,1983. [81] WernerKuich. Formalpowerseriesovertrees. InSymeonBozapalidis,editor,Pro- ceedingsofthe3rdInternationalConferenceonDevelopmentsinLanguageTheory(DLT), pages61–101,Thessaloniki,Greece,1998.AristotleUniversityofThessaloniki. [82] WernerKuich. Treetransducersandformaltreeseries. ActaCybernet.,14:135–149, 1999. [83] Shankar Kumar and William Byrne. A weighted finite state transducer imple- mentation of the alignment template model for statistical machine translation. In Proceedings of the 2003 Human Language Technology Conference of the North Ameri- canChapteroftheAssociationforComputationalLinguistics,pages63–70,Edmonton, Canada,May-June2003. [84] Irene Langkilde and Kevin Knight. The practical value of n-grams in generation. In Proceedings of the Ninth International Workshop on Natural Language Generation, pages248–255,Niagara-on-the-Lake,Ontario,Canada,August1998. [85] Karim Lari and Steve Young. The estimation of stochastic context-free grammars usingtheinside-outsidealgorithm. ComputerSpeechandLanguage,4:35–56,1990. [86] Andrej Ljolje and Michael D. Riley. Optimal speech recognition using phone recognitionandlexicalaccess. InProceedingsofICSLP-92 ,pages313–316,1992. [87] M.MagidorandG.Moran. Probabilistictreeautomata. IsraelJournalofMathemat- ics,8:340–348,1969. [88] AndreasMaletti. ThepoweroftreeseriestransducersoftypeIandII. InCleliaDe FeliceandAntonioRestivo,editors,Proceedingsofthe9thInternationalConferenceon DevelopmentsinLanguageTheory(DLT),Palermo,Italy,volume3572ofLectureNotes inComputerScience,pages338–349,Berlin,2005. [89] AndreasMaletti. Compositionsoftreeseriestransformations. TheoreticalComputer Science,366:248–271,2006. 208 [90] Andreas Maletti. Compositions of extended top-down tree transducers. Informa- tionandComputation,206(9–10):1187–1196,2008. [91] AndreasMaletti,2009. PersonalCommunication. [92] Andreas Maletti, Jonathan Graehl, Mark Hopkins, and Kevin Knight. The power ofextendedtop-downtreetransducers. SIAMJournalonComputing,39(2):410–430, 2009. [93] Daniel Marcu and William Wong. A phrase-based, joint probability model for statistical machine translation. In Proceedings of the 2002 Conference on Empirical MethodsinNaturalLanguageProcessing,pages133–139,Philadelphia,July2002. [94] Lambert Mathias and William Byrne. Statistical phrase-based speech transla- tion. In IEEE Conference on Acoustics, Speech and Signal Processing, pages 561–564, Toulouse,France,2006. [95] Jonathan May and Kevin Knight. A better n-best list: Practical determinization of weighted finite tree automata. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, pages 351–358, New York City, June 2006. [96] Jonathan May and Kevin Knight. Tiburon: A weighted tree automata toolkit. In Oscar H. Ibarra and Hsu-Chun Yen, editors, Proceedings of the 11th International Conference of Implementation and Application of Automata, CIAA 2006, volume 4094 of Lecture Notes in Computer Science, pages 102–113, Taipei, Taiwan, August 2006. Springer. [97] Jonathan May and Kevin Knight. Syntactic re-alignment models for machine translation. In Jason Eisner and Taku Kudo, editors, Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 360–368, Prague, Czech Republic, June 2007. AssociationforComputationalLinguistics. [98] BernardMerialdo. Taggingenglishtextwithaprobabilisticmodel. Computational Linguistics,20(2):155–161,1994. [99] MehryarMohri. Finite-statetransducersinlanguageandspeechprocessing. Com- putationalLinguistics,23(2):269–312,June1997. [100] Mehryar Mohri. Generic –removal and input –normalization algorithms for weighted transducers. International Journal of Foundations of Computer Science, 13(1):129–143,2002. [101] MehryarMohri. Weightedautomataalgorithms. InHandbookofWeightedAutomata, chapter6,pages213–254.Springer-Verlag,2009. [102] MehryarMohri, Fernando C.N.Pereira, andMichael Riley. Arationaldesignfor a weighted finite-state transducer library. In Proceedings of the 7th Annual AT&T SoftwareSymposium,September1997. 209 [103] MehryarMohri,FernandoC.N.Pereira,andMichaelRiley. Thedesignprinciples ofaweightedfinite-statetransducerlibrary. TheoreticalComputerScience,231:17–32, January2000. [104] Mehryar Mohri and Michael Riley. An efficient algorithm for the n-best strings problem. In John H. L. Hansen and Bryan Pellom, editors, 7th International Con- ference on Spoken Language Processing (ICSLP2002 - INTERSPEECH 2002) , pages 1313–1316,Denver,Colorado,USA,September2002. [105] FranzOch. Minimumerrorratetraininginstatisticalmachinetranslation. InPro- ceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pages160–167,Sapporo,Japan,July2003.AssociationforComputationalLinguis- tics. [106] FranzOchandHermannNey. Improvedstatisticalalignmentmodels. InProceed- ingsofthe38thAnnualMeetingoftheAssociationforComputationalLinguistics,pages 440–447,HongKong,October2000. [107] Franz Och and Hermann Ney. The alignment template approach to statistical machinetranslation. ComputationalLinguistics,30(4):417–449,2004. [108] AdamPaulsandDanKlein. K-bestA*parsing. In ProceedingsoftheJointConference ofthe47thAnnualMeetingoftheACLandthe4thInternationalJointConferenceonNat- ural Language Processing of the AFNLP, pages 958–966, Suntec, Singapore, August 2009. [109] Fernando Pereira and Michael Riley. Speech recognition by composition of weighted finite automata. In Emmanuel Roche and Yves Schabes, editors, Finite- State Language Processing, chapter 15, pages 431–453. MIT Press, Cambridge, MA, 1997. [110] FernandoPereira,MichaelRiley,andRichardSproat. Weightedrationaltransduc- tions and their application to human language processing. In Human Language Technology, pages 262–267, Plainsboro, NJ, March 1994. Morgan Kaufmann Pub- lishers,Inc. [111] SlavPetrovandDanKlein. LearningandinferenceforhierarchicallysplitPCFGs. InAAAI2007(NectarTrack),2007. [112] MichaelRabinandDanaScott. Finiteautomataandtheirdecisionproperties. IBM JournalofResearchandDevelopment,3(2):114–125,April1959. [113] AlexanderRadzievsky. Correctivemodelingforparsingwithsemanticrolelabels. Master’sthesis,Universit´ edeGen` eve,February2008. [114] SujithRaviandKevinKnight. Minimizedmodelsforunsupervisedpart-of-speech tagging. InProceedingsoftheJointConferenceofthe47thAnnualMeetingoftheACLand the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 504–512, Suntec, Singapore, August 2009. Association for Computational Linguistics. 210 [115] GiuseppeRiccardi, RobertoPieraccini, andEnricoBocchieri. Stochasticautomata forlanguagemodeling. ComputerSpeech&Language,10(4):265–293,1996. [116] William C. Rounds. Mappings and grammars on trees. Mathematical Systems Theory,4(3):257–287,1970. [117] ArtoSalomaaandMattiSoittola. Automata-TheoreticAspectsofFormalPowerSeries . Springer-Verlag,1978. [118] ClaudeShannon. Amathematicaltheoryofcommunication. BellSystemTechnical Journal,27:379–423,623–656,1948. [119] Stuart M. Shieber. Synchronous grammars as tree transducers. In Proceedings of theSeventhInternationalWorkshoponTreeAdjoiningGrammarandRelatedFormalisms (TAG+7),pages88–95,Vancouver,BritishColumbia,Canada,May2004. [120] Stuart M. Shieber and Yves Schabes. Synchronous tree-adjoining grammars. In Hans Karlgren, editor, Papers presented to the 13th International Conference on Com- putationalLinguistics(COLING),volume3,pages253–258,Helsinki,1990. [121] Khalil Sima’an. Computational complexity of probabilistic disambiguation by meansoftree-grammars. In COLING1996Volume2: The16thInternationalConfer- enceonComputationalLinguistics,pages1175–1180,1996. [122] Achim Siztus and Stefan Ortmanns. High quality word graphs using forward- backward pruning. In Proceedings of the IEEE Conference on Acoustic, Speech and SignalProcessing,pages593–596,Phoenix,Arizona,1999. [123] Richard Sproat, William Gales, Chilin Shih, and Nancy Chang. A stochastic finite-state word-segmentation algorithm for Chinese. Computational Linguistics, 22(3):377–404,September1996. [124] Andreas Stolcke. An efficient probabilistic context-free parsing algorithm that computesprefixprobabilities. ComputationalLinguistics,21(2):165–201,June1995. [125] JosephTepperman. Hierarchicalmethodsinautomaticpronunciationevaluation. PhD thesis,UniversityofSouthernCalifornia,August2009. [126] Joseph Tepperman and Shrikanth Narayanan. Tree grammars as models of prosodicstructure. InProceedingsofInterSpeechICSLP,pages2286–2289,Brisbane, Australia,September2008. [127] JamesW.Thatcher. Generalized 2 sequentialmachines. JournalofComputerSystem Science,pages339–367,1970. [128] James W. Thatcher. Tree automata: An informal survey. In A. V. Aho, editor, CurrentsintheTheoryofComputing,pages143–172.Prentice-Hall,EnglewoodCliffs, NJ,1973. [129] GertjanvanNoord. Treatmentofepsilonmovesinsubsetconstruction. Computa- tionalLinguistics,26(1):61–76,2000. 211 [130] GertjanvanNoordandDaleGerdemann. Anextendibleregularexpressioncom- piler for finite-state approaches in natural language processing. In Automata Im- plementation,4thInternationalWorkshoponImplementingAutomata,WIA’99,volume 2214ofLectureNotesinComputerScience,pages122–139,Potsdam,Germany,1999. Springer-Verlag. [131] StephanVogel,HermannNey,andChristophTillmann. HMM-basedwordalign- ment in statistical translation. In COLING96: Proceedings of the 16th International ConferenceonComputationalLinguistics,pages836–841,Copenhagen,August1996. [132] Wei Wang, Jonathan May, Kevin Knight, and Daniel Marcu. Re-structuring, re- labeling,andre-aligningforsyntax-basedmachinetranslation. ComputationalLin- guistics,36(2),June2010. Toappear. [133] William A. Woods. Cascaded ATN grammars. American Journal of Computational Linguistics,6(1):1–12,January-March1980. [134] Dekai Wu. A polynomial-time algorithm for statistical machine translation. In Proceedingsofthe34thAnnualMeetingoftheAssociationforComputationalLinguistics, pages152–158,SantaCruz,California,USA,June1996.AssociationforComputa- tionalLinguistics. [135] Dekai Wu. Stochastic inversion transduction grammars and bilingual parsing of parallelcorpora. ComputationalLinguistics,23(3):377–404,1997. [136] Kenji Yamada and Kevin Knight. A syntax-based statistical translation model. In Proceedings of 39th Annual Meeting of the Association for Computational Linguistics, pages523–530,Toulouse,France,July2001. [137] David Zajic, Bonnie Dorr, and Richard Schwartz. Automatic headline generation fornewspaperstories. InProceedingsoftheACL-02WorkshoponTextSummarization (DUC 2002), pages 78–85, Philadelphia, PA, July 2002. Association for Computa- tionalLinguistics. [138] Hao Zhang, Liang Huang, Daniel Gildea, and Kevin Knight. Synchronous bina- rization for machine translation. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, pages 256–263, New York City, June 2006. [139] Andreas Zollmann and Khalil Sima’an. A consistent and efficient estimator for data-oriented parsing. Journal of Automata, Languages and Combinatorics, 10(2/3):367–388,2005. 212
Abstract (if available)
Abstract
Weighted finite-state string transducer cascades are a powerful formalism for models of solutions to many natural language processing problems such as speech recognition, transliteration, and translation. Researchers often directly employ these formalisms to build their systems by using toolkits that provide fundamental algorithms for transducer cascade manipulation, combination, and inference. However, extant transducer toolkits are poorly suited to current research in NLP that makes use of syntax-rich models. More advanced toolkits, particularly those that allow the manipulation, combination, and inference of weighted extended top-down tree transducers, do not exist. In large part, this is because the analogous algorithms needed to perform these operations have not been defined. This thesis solves both these problems, by describing and developing algorithms, by producing an implementation of a functional weighted tree transducer toolkit that uses these algorithms, and by demonstrating the performance and utility of these algorithms in multiple empirical experiments on machine translation data.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Deciphering natural language
PDF
Weighted factor automata: A finite-state framework for spoken content retrieval
PDF
Syntax-aware natural language processing techniques and their applications
PDF
Non-traditional resources and improved tools for low-resource machine translation
PDF
Building generalizable language models for code processing
PDF
Event-centric reasoning with neuro-symbolic networks and knowledge incorporation
PDF
Natural language description of emotion
PDF
Hashcode representations of natural language for relation extraction
PDF
Enriching spoken language processing: representation and modeling of suprasegmental events
PDF
Expanding the performance-compute frontier for retrieval-augmented language models
PDF
Multimodal reasoning of visual information and natural language
PDF
Understanding diffusion process: inference and theory
PDF
Beyond parallel data: decipherment for better quality machine translation
PDF
Spoken language processing in low resource scenarios with applications in automated behavioral coding
PDF
Speech recognition error modeling for robust speech processing and natural language understanding applications
PDF
Learning logical abstractions from sequential data
PDF
Multimodality, context and continuous dynamics for recognition and analysis of emotional states, and applications in healthcare
PDF
Kernel methods for unsupervised domain adaptation
PDF
Robust and generalizable knowledge acquisition from text
PDF
Speech and language understanding in the Sigma cognitive architecture
Asset Metadata
Creator
May, Jonathan David Louis
(author)
Core Title
Weighted tree automata and transducers for syntactic natural language processing
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
05/28/2010
Defense Date
04/20/2010
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
computational linguistics,context-free grammars,finite state machines,machine learning,machine translation,natural language processing,OAI-PMH Harvest,parsing,tree automata,tree transducers
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Knight, Kevin (
committee chair
), Chiang, David (
committee member
), Koenig, Sven (
committee member
), Marcu, Daniel (
committee member
), Narayanan, Shrikanth S. (
committee member
), Pereira, Fernando (
committee member
)
Creator Email
jonmay@isi.edu,uscthesis@jonmay.net
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m3104
Unique identifier
UC157048
Identifier
etd-May-3726 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-344748 (legacy record id),usctheses-m3104 (legacy record id)
Legacy Identifier
etd-May-3726.pdf
Dmrecord
344748
Document Type
Dissertation
Rights
May, Jonathan David Louis
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
computational linguistics
context-free grammars
finite state machines
machine learning
machine translation
natural language processing
parsing
tree automata
tree transducers