Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Learning paraphrases from text
(USC Thesis Other)
Learning paraphrases from text
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
LEARNINGPARAPHRASESFROMTEXT
by
RahulBhagat
ADissertationPresentedtothe
FACULTYOFTHEGRADUATESCHOOL
UNIVERSITYOFSOUTHERNCALIFORNIA
InPartialFulfillmentofthe
RequirementsfortheDegree
DOCTOROFPHILOSOPHY
(COMPUTERSCIENCE)
August2009
Copyright 2009 RahulBhagat
Dedication
ToMyParents...
ii
Acknowledgements
My dissertation work has benefitted greatly from the help, support, and advice of my
colleagues,friends,andfamily. Iamindebtedtomyadvisor,EdHovy,forhisvaluable
guidanceandsupportthroughoutmytimeattheUniversityofSouthernCalifornia. Ed
convincedmetogetaPhD,backwhenIhadnointentionsofgettingone. Throughout
myresearch,hehelpedmemaintainmyfocusandprovidedhonestadviceandcriticism.
Many thanks go also to the other members of my committee, especially Patrick
Pantel. Patricktaughtmethebasicsofwritinggoodpapersandhasconsistentlyguided
meinmyresearch. Theothermembersofmycommittee—JerryHobbs,KevinKnight,
Dennis McLeod, and Daniel O’Leary—have also provided valuable feedback. I am
also grateful to Deepak Ravichandran. Deepak taught me the value of out-of-the-box
thinkingandhasprovidedconstantguidanceandhelpinmyresearch.
I was fortunate to have a wonderful officemate in Donghui Feng who was always
willingtodiscussresearchideaswithme. Ihavealsobenefittedalotfromdiscussions
withDekangLin,MariusPasca,andEllenRiloff. Allthesediscussionshavehelpedme
improvevariousaspectsofthisdissertationforwhichIamthankful.
iii
I want to thank my colleagues William Chang, Dirk Hovy, Jon May, Sujith Ravi,
and Jason Riesa for helping me in doing important (and boring) annotations. I have
also enjoyed interacting with my other colleagues at USC and elsewhere including
but not limited to Jafar Adibi, Jose Luis Ambite, Erika Barragan-Nunez, Gully Burns,
Congxing Cai, David Chiang, Yao-Yi Chiang, Tim Chklovski, Bonaventura Coppola,
Hal Daume III, Steve Deneefe, Teresa Dey, Mike Fleischman, Victoria Fossum, Alex
Fraser,SudeepGandhe,PaulGroth,TommyIngulfsen,UlfHermjakob,GunjanKakani,
Soo-minKim,ZoriKozareva,KathyKurinsky,NamheeKwon,BrentLance,JinaLee,
Sean Lee, Kristina Lerman, Chin-Yew Lin, Shou-de Lin, Stacy Marsella, Chirag Mer-
chant, Dragos Munteanu, Tom Murray, Anish Nair, Oana Nicolov, Doug Oard, Feng
Pan, Siddharth Patwardhan, Marco Pennacchiotti, Fernando Pereira, Andrew Philpot,
DavidPynadath, DelipRao, NishitRathod, MartaRecasens-Potau, JoeResinger, Tom
Russ, Mei Sei, Mark Shirley, Partha Talukdar, David Traum, Benjamin Van Durme,
Ashish Vaswani, Jens-Soenke Voeckler, and Vishnu Vyas. I also want to thank the
“Venicebeachhappyhourcrowd”forthefuntimes.
Iamblessedtohavebeensurroundedbysomegreatfriendsfrommyearlydaysas
astudentintheUS.IwanttothankAmarAthavale,ArushiBhargava,VineetBhargava,
SiddharthBhavsar, Nirav Desai, NitinDhavale, DeepaJain, AmitJoshi, NehaKansal,
Kiran Meduri, Mingle Mehta, Prasanth Nittala, Aditya Pandharpurkar, Jigish Patel,
ParikshitPol,SagarShah,RahulSrivastava,andJigeshVora.
iv
I am eternally indebted to my parents and my late grandparents for their uncon-
ditional affection, understanding, and support in all my endeavors. Their backing has
been a source of great strength for me. I want to thank my brother, Sarang, whose
affection,friendship,andadviceIhavealwaysvaluedgreatly. IwanttothankNehafor
herlove,understanding,andforalwaysbelievinginme.
v
TableofContents
Dedication ii
Acknowledgements iii
ListOfTables xi
ListOfFigures xiii
Abstract xv
Chapter1: Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 GoalandApproach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 MajorContributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 ThesisOutline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Chapter2: RelatedWork 10
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 LinguisticTheoriesofParaphrases . . . . . . . . . . . . . . . . . . . . 11
2.3 LearningParaphrases . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.1 LearningParaphrasesusingMultipleTranslations . . . . . . . . 15
2.3.2 LearningParaphrasesusingParallelBilingualCorpora . . . . . 17
2.3.3 LearningParaphrasesusingComparableCorpora . . . . . . . . 18
2.3.4 LearningParaphrasesusingMonolingualCorpora. . . . . . . . 19
2.4 LearningSelectionalPreferencesandDirectionality . . . . . . . . . . . 20
2.4.1 LearningSelectionalPreferences . . . . . . . . . . . . . . . . . 21
2.4.2 LearningDirectionality . . . . . . . . . . . . . . . . . . . . . . 22
2.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5.1 ParaphrasesforLearningExtractionPatterns . . . . . . . . . . 24
vi
2.5.2 Paraphrases for Learning Patterns for Open-Domain Relation
Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5.3 Paraphrases for Learning Patterns for Domain-Specific Infor-
mationExtraction . . . . . . . . . . . . . . . . . . . . . . . . . 26
Chapter3: Paraphrases 28
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 ParaphrasingPhenomenonExplained . . . . . . . . . . . . . . . . . . 30
3.2.1 LexicalPerspective . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.2 StructuralPerspective . . . . . . . . . . . . . . . . . . . . . . 44
3.3 AnalysisofParaphrases . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3.1 DistributionofLexicalChanges . . . . . . . . . . . . . . . . . 46
3.3.2 HumanJudgementofLexicalChanges . . . . . . . . . . . . . 47
3.3.3 Sentence-levelDistributionofStructuralChanges . . . . . . . . 48
3.3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Chapter4: InferentialSelectionalPreferences 55
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2 SelectionalPreferenceModels . . . . . . . . . . . . . . . . . . . . . . 58
4.2.1 RelationalSelectionalPreferences . . . . . . . . . . . . . . . . 59
4.2.1.1 JointRelationalModel(JRM) . . . . . . . . . . . . . 60
4.2.1.2 IndependentRelationalModel(IRM) . . . . . . . . . 62
4.2.2 InferentialSelectionalPreferences . . . . . . . . . . . . . . . . 63
4.2.2.1 JointInferentialModel(JIM) . . . . . . . . . . . . . 63
4.2.2.2 IndependentInferentialModel(IIM) . . . . . . . . . 64
4.2.3 FilteringInferences . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3 ExperimentalMethodology . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3.1 Quasi-paraphraserules . . . . . . . . . . . . . . . . . . . . . . 67
4.3.2 SemanticClasses . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.3.3 EvaluationCriteria . . . . . . . . . . . . . . . . . . . . . . . . 69
4.4 ExperimentalResults . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4.1 ExperimentalSetup . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4.1.1 ModelImplementation . . . . . . . . . . . . . . . . 70
4.4.1.2 GoldStandardConstruction . . . . . . . . . . . . . . 71
4.4.1.3 Baselines . . . . . . . . . . . . . . . . . . . . . . . . 72
4.4.2 FilteringQuality . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.4.2.1 PerformanceandErrorAnalysis . . . . . . . . . . . 74
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
vii
Chapter5: LearningDirectionality 80
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.2 LearningDirectionalityofQuasi-paraphraseRules . . . . . . . . . . . 81
5.2.1 UnderlyingAssumption . . . . . . . . . . . . . . . . . . . . . 82
5.2.2 SelectionalPreferences . . . . . . . . . . . . . . . . . . . . . . 84
5.2.2.1 JointRelationalModel(JRM) . . . . . . . . . . . . . 84
5.2.2.2 IndependentRelationalModel(IRM) . . . . . . . . . 86
5.2.3 PlausibilityandDirectionalityModel . . . . . . . . . . . . . . 87
5.3 ExperimentalSetup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.3.1 Quasi-paraphraseRules . . . . . . . . . . . . . . . . . . . . . 89
5.3.2 SemanticClasses . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.3.4 GoldStandardConstruction . . . . . . . . . . . . . . . . . . . 91
5.3.5 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.4 ExperimentalResults . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.4.1 EvaluationCriterion . . . . . . . . . . . . . . . . . . . . . . . 94
5.4.2 ResultSummary . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.4.3 PerformanceandErrorAnalysis . . . . . . . . . . . . . . . . . 95
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Chapter6: LearningSemanticClasses 102
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.2 IncorporatingConstraints . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.3 WordSimilarityandAlgorithm . . . . . . . . . . . . . . . . . . . . . . 108
6.3.1 WordSimilarity . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.3.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.4.1 EvaluationCriterion . . . . . . . . . . . . . . . . . . . . . . . 111
6.4.2 DataandMethodology . . . . . . . . . . . . . . . . . . . . . . 112
6.5 ResultsandDiscussion . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.5.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Chapter7: ParaphrasesforLearningSurfacePatterns 119
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
7.2 AcquiringParaphrases . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.2.1 DistributionalSimilarity . . . . . . . . . . . . . . . . . . . . . 121
7.2.2 ParaphraseGenerationModel . . . . . . . . . . . . . . . . . . 122
7.2.3 LocalitySensitiveHashing . . . . . . . . . . . . . . . . . . . . 124
viii
7.3 LearningSurfacePatterns . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.3.1 SurfacePatternsModel . . . . . . . . . . . . . . . . . . . . . . 126
7.4 ExperimentalMethodology . . . . . . . . . . . . . . . . . . . . . . . . 127
7.4.1 Paraphrases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.4.2 SurfacePatterns . . . . . . . . . . . . . . . . . . . . . . . . . 128
7.4.3 RelationExtraction . . . . . . . . . . . . . . . . . . . . . . . . 129
7.5 ExperimentalResults . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
7.5.1 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
7.5.2 EvaluationCriteria . . . . . . . . . . . . . . . . . . . . . . . . 130
7.5.2.1 Paraphrases . . . . . . . . . . . . . . . . . . . . . . 131
7.5.2.2 SurfacePatterns . . . . . . . . . . . . . . . . . . . . 131
7.5.2.3 RelationExtraction . . . . . . . . . . . . . . . . . . 132
7.5.3 GoldStandard . . . . . . . . . . . . . . . . . . . . . . . . . . 132
7.5.3.1 Paraphrases . . . . . . . . . . . . . . . . . . . . . . 132
7.5.3.2 SurfacePatterns . . . . . . . . . . . . . . . . . . . . 133
7.5.3.3 RelationExtraction . . . . . . . . . . . . . . . . . . 134
7.5.4 ResultSummary . . . . . . . . . . . . . . . . . . . . . . . . . 134
7.5.5 DiscussionandErrorAnalysis . . . . . . . . . . . . . . . . . . 136
7.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
Chapter8: ParaphrasesforDomain-SpecificInformationExtraction 140
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
8.2 LearningBroad-CoverageParaphrasePatterns . . . . . . . . . . . . . . 142
8.2.1 LearningSurface-LevelParaphrasePatterns . . . . . . . . . . . 142
8.2.2 LearningLexico-SyntacticParaphrasePatternsbyConversion . 145
8.2.3 LearningLexico-SyntacticParaphrasePatternsDirectly . . . . 147
8.3 ExperimentalMethodology . . . . . . . . . . . . . . . . . . . . . . . . 149
8.3.1 ParaphrasePatterns . . . . . . . . . . . . . . . . . . . . . . . . 149
8.3.2 Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
8.3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
8.4 Result1andDiscussion . . . . . . . . . . . . . . . . . . . . . . . . . . 153
8.4.1 ComparisonofBroad-CoverageParaphrasePatterns . . . . . . 153
8.4.2 DiscussionandErrorAnalysis . . . . . . . . . . . . . . . . . . 155
8.5 Result2andDiscussion . . . . . . . . . . . . . . . . . . . . . . . . . . 156
8.5.1 ComparisonofBroad-CoverageandDomain-SpecificPatterns . 156
8.5.2 DiscussionandErrorAnalysis . . . . . . . . . . . . . . . . . . 160
8.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
ix
Chapter9: ConclusionandFutureWork 163
9.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
9.2 FutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
9.2.1 InferentialSelectionalPreferencesusingTailoredClasses. . . . 165
9.2.2 KnowledgeAcquisition . . . . . . . . . . . . . . . . . . . . . 166
9.2.3 ParaphrasesforMachineTranslation . . . . . . . . . . . . . . . 167
9.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
Bibliography 170
Appendix
ExampleParaphrases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
2 ListofParaphrases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
x
ListOfTables
3.1 DistributionoflexicalchangesinMTCparaphraseset . . . . . . . . . . 51
3.2 DistributionoflexicalchangesinMSRparaphraseset . . . . . . . . . . 52
3.3 HumanJudgementoflexicalchanges . . . . . . . . . . . . . . . . . . 53
3.4 DistributionofstructuralchangesinMTCparaphraseset . . . . . . . . 53
3.5 DistributionofstructuralchangesinMSRparaphraseset . . . . . . . . 54
4.1 Confusionmatrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2 Filteringqualityofbestperformingsystemsaccordingtotheevaluation
criteriadefinedinSection4.3.3ontheTESTset. . . . . . . . . . . . . 73
4.3 ConfusionmatrixforISP.IIM.∨—bestaccuracy . . . . . . . . . . . . 74
4.4 ConfusionmatrixforISP.JIM—best 90%-Specificity . . . . . . . . . . 75
5.1 Summaryofresultsonthetestset . . . . . . . . . . . . . . . . . . . . 95
6.1 ClassesfromS
2500
set . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.2 Example clusters with the corresponding largest intersecting VerbNet
classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.1 Qualityofparaphrases . . . . . . . . . . . . . . . . . . . . . . . . . . 135
xi
7.2 Exampleparaphrases . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.3 Qualityofextractionpatterns . . . . . . . . . . . . . . . . . . . . . . . 135
7.4 Exampleextractionpatterns . . . . . . . . . . . . . . . . . . . . . . . 136
7.5 Qualityofinstances . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.6 Exampleinstances . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
8.1 Listofpatterntemplates . . . . . . . . . . . . . . . . . . . . . . . . . 150
xii
ListOfFigures
4.1 ROCcurvesforoursystemsonTEST . . . . . . . . . . . . . . . . . . 76
4.2 ISP.IIM.∨ (Best system’s) performance variation over different values
oftheτ threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.1 ConfusionMatrixforthebestperformingsystem,IRMusingCBCwith
α = 0.15andβ = 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.2 AccuracyvariationforIRMwithdifferentvaluesofαandβ . . . . . . 97
5.3 Accuracy variation in predicting correct versus incorrect quasi-
paraphraserulesfordifferentvaluesofα . . . . . . . . . . . . . . . . . 98
5.4 Accuracy variation in predicting directionality of correct quasi-
paraphraserulesfordifferentvaluesofβ . . . . . . . . . . . . . . . . . 99
6.1 Exampleclasses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.2 Exampleconstraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.3 HMRFmodel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.4 EMalgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.5 LearningcurveforS
2500
set . . . . . . . . . . . . . . . . . . . . . . . . 115
6.6 LearningcurveforS
250
set . . . . . . . . . . . . . . . . . . . . . . . . 115
xiii
8.1 Paraphrasepatternsinterrorismdomain . . . . . . . . . . . . . . . . . 154
8.2 Paraphrasepatternsindisease-outbreaksdomain . . . . . . . . . . . . . 154
8.3 Paraphrasepatternsincorporate-acquisitionsdomain . . . . . . . . . . 155
8.4 ParaphrasebasedvstraditionalIEsystemsinterrorismdomain . . . . . 159
8.5 ParaphrasebasedvstraditionalIEsystemsindisease-outbreaksdomain 159
8.6 ParaphrasebasedvstraditionalIEsystemsincorporate-acquisitionsdo-
main . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
xiv
Abstract
Paraphrases are textual expressions that convey the same meaning using different sur-
faceforms. Capturingthevariabilityoflanguage, theyplayanimportant roleinmany
natural language applications including question answering, machine translation, and
multi-document summarization. In linguistics, paraphrases are characterized by ap-
proximateconceptualequivalence. Sincenoautomatedsemanticinterpretationsystems
availabletodaycanidentifyconceptualequivalence,paraphrasesaredifficulttoacquire
without human effort. The aim of this thesis is to develop methods for automatically
acquiringandfilteringphrase-levelparaphrasesusingamonolingualcorpus.
Notingthattherealworldusesfarmorequasi-paraphrasesthanthelogicallyequiv-
alentones,wefirstpresentageneraltypologyofquasi-paraphrasestogetherwiththeir
relativefrequencies. Toourknowledgethefirstoneever. Wethenpresentamethodfor
automaticallylearningthecontextsinwhichquasi-paraphrasesobtainedfromacorpus
are mutually replaceable. For this purpose, we use Relational Selectional Preferences
(RSPs) that specify the selectional preferences of the syntactic arguments of phrases
xv
(usually verbs or verb phrases). From the RSPs of individual phrases, we learn In-
ferential Selectional Preferences (ISPs), which specify the selectional preferences of a
pairofquasi-paraphrases. WethenapplythelearnedISPstothetaskoffilteringincor-
rect inferences. We achieve an accuracy of 59% for this task, which is a statistically
significantimprovementoverseveralbaselines.
Knowing that quasi-paraphrases are often inexact because they contain semantic
implications which can be directional, we present an algorithm called LEDIR to learn
the directionality of quasi-paraphrases using the (syntactic argument based) RSPs for
phrases. Learning directionality allows us to differentiate the strong (bidirectional)
from the weak (unidirectional) paraphrases. We show that the directionality of the
quasi-paraphrases can be learned with 48% accuracy. This is again a significant im-
provementoverseveralbaselines.
Inlearningthecontextanddirectionalityofquasi-paraphrases,wehaveencountered
the need for semantic concepts: Both RSPs and ISPs are defined in terms of semantic
concepts. For learning these semantic concepts from text, we use a semi-supervised
clustering algorithm HMRF-KMeans. We show that compared to the commonly used
unsupervised clustering approach, this algorithm performs much better. Applying the
semi-supervisedclusteringalgorithmtothetaskofdiscoveringverbclasses,weobtain
precisionscoresof 54%and 37%andcorrespondingrecallscoresof 53%and 38%for
ourtwotestsets. Thesearelargeimprovementsoverthebaseline.
xvi
We next investigate the task of learning surface paraphrases, i.e., paraphrases that
do not require the use of a syntactic interpretation. Since one would need a very large
corpus to find enough surface variations, we start with a really large but unprocessed
corpus of 150GB (25 billion words) obtained from Google News. We rely only on
distributional similarity to learn paraphrases from this corpus. To scale paraphrase
acquisition to this large corpus, we apply only simple POS tagging and randomized
algorithms. We build a paraphrase resource containing more than 2.5 million phrases.
Intheresource, 71%ofthequasi-paraphrasesarecorrect.
Having learned the surface paraphrases, we investigate their utility for the task of
relationextraction. Weshowthattheseparaphrasescanbeusedtolearnsurfacepatterns
for relation extraction. The extraction patterns obtained by using the paraphrases are
not only more precise (more than 80% precision for both our test relations), but also
have higher relative recall compared to a state-of-the-art baseline. This method also
delivers more extraction patterns than the baseline. Applying the learned extraction
patternstothetaskofextractingrelationinstancesfromatestcorpus,oursystemtakes
ahitinrelativerecallascomparedtothebaseline,butresultsinamuchhigherprecision
(morethan 85%precisionforbothourtestrelations).
Finally, we use paraphrases to learn patterns for domain-specific information ex-
traction (IE). Since the paraphrases are learned from a large broad-coverage corpus,
ourpatternsaredomain-independent,makingthetaskofmovingtonewdomainsvery
xvii
easy. We empirically show that patterns learned using (broad-coverage corpus based)
paraphrases are comparable in performance to several state-of-the-art domain-specific
IEengines.
Thus,inthisthesiswedefinequasi-paraphrases,presentmethodstolearnthemfrom
acorpus,andshowthatquasi-paraphrasesareusefulforinformationextraction.
xviii
Chapter1
Introduction
1.1 Motivation
Variability is a common phenomenon in language: The meaning conveyed by a sen-
tence or phrase can be expressed in several different ways. For example, the pairs of
sentences (1)and (2):
TheshuttleDiscoverylandedinFloridaonSaturday. (1)
TheshuttleDiscoverytoucheddowninFloridaonSaturday. (2)
orthephrases (3)and (4):
X landedinY (3)
X toucheddowninY (4)
1
respectively express the same meaning. Such semantically equivalent sentences and
phrasesarecalledparaphrases.
Formally, Webster dictionary defines a “paraphrase” as “a restatement of a text,
passage, or work giving the meaning in another form”. WordNet (Fellbaum, 1998)
defines a “paraphrase” as “rewording for the purpose of clarification”. In linguistics,
De Beaugrande and Dressler (1981) define paraphrases as “Approximate conceptual
equivalence among outwardly different material”. In general, paraphrases are simply
expressions that communicate the same meaning using different words. The ability to
paraphrasegiveslanguagetheflexibilityofarticulation. Itishardtoimaginealanguage
withoutthisfacility: it’llbecomemonotonous.
Since paraphrasing is such a common phenomenon, every Natural Language Pro-
cessing (NLP) application has to deal with paraphrases. But automatically capturing
thesemanticphenomenonof“approximateconceptualequivalence”,thatdefinespara-
phrases is hard. Hence, most state-of-the-art systems choose to not use and deal with
paraphrases explicitly. However, several people have attempted to harness the power
of paraphrases to improve various NLP applications and have shown promising re-
sults. Forexample,inquestionanswering,paraphraseshavebeenusedtofindmultiple
patterns that pinpoint the same answer (Ravichandran & Hovy, 2002); in statistical
machine translation, they have been used to find translations for unseen source lan-
guage phrases (Callison-Burch et al., 2006); in multi-document summarization, they
2
havebeenusedtoidentifyphrasesfromdifferentsentencesthatexpressthesameinfor-
mation (Barzilay et al., 1999); in information retrieval they have been used for query
expansion(Anick&Tipirneni,1999).
With such a wide range of applications, the knowledge of what constitutes para-
phrasesandhowtheycanbelearnedautomaticallyisimportant. Thisisthemotivation
for the work we undertake in this thesis. The thesis presents a set of methods to auto-
maticallylearnparaphrasesfromtext.
1.2 GoalandApproach
Thisthesisaimstoanswerthefollowingmainquestion:
Howcanwelearnparaphrasesfromamonolingualcorpus?
Findingaparaphraserequiresonetoensurepreservationofcoremeaning(semantics).
Sincetherearenoadequatesemanticinterpretationsystemsavailabletoday,paraphrase
acquisitiontechniquesusesomeothermechanismasakindofpivottoindirectly(help)
ensuresemanticequivalence. Eachpivotmechanismselectsphraseswithlikemeaning
in a different characteristic way. One method uses a bilingual dictionary or transla-
tion table as pivot mechanism: all source language words or phrases that translate to
a given foreign word/phrase are deemed to be paraphrases of one another (Bannard &
3
Callison-Burch,2005). Anotherpopularmethodusestheinstantiationsofsyntacticar-
gumentsofpathsinsyntaxtrees(context)aspivotsforlearningparaphrases: syntactic
paths that have overlapping contexts are considered paraphrases of each other (Lin &
Pantel,2001). Ofthetwomethods,sinceitneedsnoresourceotherthanalargemono-
lingualcorpus,thesecondmethodiscurrentlyeasiertousewithavailabledata. Hence
inthisthesis,weusethebasicprinciplebehindthismethod,theso-calleddistributional
hypothesis(Harris,1954),tolearnphraselevelparaphrases. Thedistributionalhypoth-
esisstatesthat“Twowordsthatappearinsimilarcontextshavesimilarmeanings”. We
extendthisideatoaddressseveralissuesinvolvedinlearningparaphrases.
This thesis takes the view that perfect synonymy is hard to achieve. With the ex-
ceptionofparaphrasesthatareobtainedbydirectsyntactictransformations,likeactive
voicetopassivevoice,phraseorsentencevariationsarelikelytohaveslightlydifferent
shades of meanings. In fact, a large number of paraphrases used in the real world are
quasi-paraphrases. Forexample,considerthesentencepair (5)and (6):
USjournalistDanielPearlwaskilled byAl-Qaeda. (5)
USjournalistDanielPearlwasbeheaded byAl-Qaeda. (6)
Sentences (5) and (6) are not equivalent in the logical sense: “killed” and “beheaded”
arenotsynonymous. However,forallpracticalpurposes,theycanbeconsideredpara-
phrasesorquasi-paraphrases.
4
Another observation is that paraphrases or quasi-paraphrases are not mutually re-
placeable in all contexts, i.e., a pair of expressions may be considered paraphrases in
certaincontextsbutnotinothers. Forexample,considerthephrases:
X waskilledbyY (7)
X wasbeheadedbyY (8)
These phrases are not paraphrases in all contexts, but can be considered quasi-
paraphrases when X is a “person” and Y is a “terrorist” or “terrorist organization”
asin (7)and (8). However,giventhesentence:
ThebillwaskilledbytheSenateHealthCareStrategiesCommittee.
itcan’tbeplausiblyparaphrasedas:
ThebillwasbeheadedbytheSenateHealthCareStrategiesCommittee.
This thesis deals with quasi-paraphrases, of which perfectly synonymous phrases
is a small subset. In the context of this thesis, the term “paraphrases” (even without
the prefix “quasi”) means “quasi-paraphrases”. We define quasi-paraphrases in detail
inChapter3.
5
An important consideration when learning paraphrases is their granularity. As is
clear from their definition, paraphrases occur at different levels in text, i.e., at the dis-
course level, sentence level, and phrase level. In this thesis, we focus only on the
phrase
1
levelparaphrases,i.e.,paraphrasesoftheform (3)and (4)or (7)and (8).
To undertake the work in this thesis given the above background, we first define
quasi-paraphrases. Wethenlearnthecontextsinwhichapairofquasi-paraphrasesare
mutually replaceable. To do this, we present a set of methods called Inferential Selec-
tional Preferences (ISPs) (Pantel et al., 2007). We then develop an algorithm called
LEDIR(Bhagatetal.,2007)thatusesthecontextualinformationofphrasestolearnthe
directionalityofquasi-paraphrases. Weusethisinformationtodifferentiatestrongfrom
weakparaphrases. Inaccomplishingthesetwotasks,wefindthatsemanticclassesplay
an important role: the nature of the semantic classes affects the effectiveness of both
ISP and LEDIR. Hence, we suggest using a state of the art semi-supervised algorithm
toobtainsemanticclassesandshowthatitfindsbettersemanticclassestosuittheuser
preferencethanthoseobtainedbyacommonlyusedunsupervisedalgorithm. Next,we
workonobtainingsurfaceparaphrases. Weshowthatwecanobtainhighprecisionsur-
faceparaphrasesusingdistributionalsimilarityandshowthatwecanusethemtolearn
highprecisionpatternsforrelationextraction(Bhagat&Ravichandran,2008). Finally,
1
Theword“phrase”hereonlymeansasequenceofwords,i.e.,n-grams.
6
weshowthatparaphrasescanbeusedtolearnpatternsfordomain-specificinformation
extractionwithoutrelyingonadomain-specificcorpus(Bhagatetal.,2009).
1.3 MajorContributions
Themajorcontributionsofthisthesisare:
• Wedefineatypologyofquasi-paraphrasesalongwiththeirrelativefrequencies.
• We develop a method for learning the contexts in which a pair of quasi-
paraphrasesaremutuallyreplaceable.
• We present a method for learning the directionality of quasi-paraphrases to dis-
tinguishthestrongfromtheweakparaphrases.
• We show that high quality surface level paraphrases can be learned from a large
monolingual corpus using distributional similarity. These paraphrases are then
showntobeusefulforlargescalerelationextraction.
• We show that broad-coverage paraphrases, can be used to learn patterns for
domain-specificinformationextractionwithoutthehelpofadomain-specificcor-
pus.
7
1.4 ThesisOutline
Theremainderofthethesisisorganizedasfollows:
• Chapter 2 discusses the previous work that has been done in the field on para-
phraselearningandotherrelatedthings.
• Chapter3definesquasi-paraphrasesindetail.
• Chapter 4 discusses the Inferential Selectional Preferences that define the con-
textsinwhichquasi-paraphrasesaremutuallyreplaceableandshowsthatwecan
usethisinformationtofilteroutincorrectinferences.
• Chapter 5 presents our algorithm for learning the directionality of quasi-
paraphrases.
• Chapter6showstheeffectivenessofusinganoff-the-shelfstate-of-the-artsemi-
supervised algorithm for learning semantic classes and suggests it as a method
fortailoringthesemanticclassestosuittheuser’spreferences.
• Chapter 7 shows that we can use distributional similarity to learn surface level
paraphrasesfromalargecorpusandfurthershowsthatwecanusethesesurface
paraphrasestolearnhighprecisionsurfacepatternsforrelationextraction.
8
• Chapter 8 presents our work on using paraphrases learned from a large broad-
coveragecorpusforthetaskofdomain-specificinformationextraction.
• Chapter 9 presents the concluding remarks and outlines several directions for
futurework.
9
Chapter2
RelatedWork
2.1 Introduction
Inthischapterwepresentanddiscussthepreviousworkdoneonparaphrasesandlearn-
ingthemfromtext. Webroadlyclassifythepreviousworkintofourmaincategories:
1. LinguisticTheoriesofParaphrases
2. LearningParaphrases
3. LearningSelectionalPreferencesandDirectionality
4. Applications
We discuss all these categories in detail below, and point out the differences of the
previousapproacheswiththatofours.
10
2.2 LinguisticTheoriesofParaphrases
Over the years, the phenomenon of paraphrasing has generated significant interest
among linguists. One influential theory that defines paraphrases is found in Trans-
formationalGrammar(Chomsky,1957;Harris,1981). TransformationalGrammarde-
composes complex sentences into simple sentences and defines operations on these
sentences (transformations). These transformations are meaning preserving and thus
generateparaphrases. Forexample, (1)belowisatransformationalrulethatconvertsa
sentencefromactivetopassiveformasin (2).
N
1
tVN
2
←→N
2
tbeVenbyN
1
(1)
Hesawtheman. =Themanwasseenbyhim. (2)
Harris (1981) lists a set of 20 transformational rules for English. He states that the set
ofthesetransformationalrulesisnotexhaustive,butissufficientforpracticalpurposes.
Themaindrawbackofthetransformationalrulesbasedparaphrasingisthatittreats
paraphrasing as a purely syntactic phenomenon and ignores the lexical nature of para-
phrases. However, a large number of paraphrases are lexical in nature (for example,
thosegeneratedbyreplacingawordinasentencebyitssynonyms)anditisafactthat
many transformational rules need to be constrained based on lexicon. For example,
certain sentences like those containing verbs like lack generate odd sentences when
convertedtopassive—“Theteamlackstalent.” VS“Talentislacked bytheteam.”.
11
A different perspective on paraphrases is provided by the Meaning Text Theory
(MTT) (Mel’cuk, 1996). MTT outlines a seven-strata (level) model for natural lan-
guage structure: the strata ranging from surface-phonetic to semantic representation
levels. Central to the concept of MTT are lexical functions (LFs) which express the
well established (institutionalized) relations between lexical units in a language. For
example,theLF Magn(X)—“toahighdegree”,“intense”—isthe“intensifier”for X
asin (3)and (4).
Magn(tolaugh)=heartily (3)
Magn(patience)=infinite (4)
MTT defines 64 LFs that operate at its deep-syntactic level. Using these, Mel’cuk (to
appear)definedasetof67lexical-paraphrasingrules(againatthedeep-syntacticlevel).
To explain the structural changes taking place while using the lexical-paraphrasing
rules,MTTdefines 38restructuring-paraphrasingrules(atthedeep-syntacticlevel).
While MTT gives detailed account of paraphrasing, its utility is marred by several
factors. Firstly,manyLFsarevagueandunderspecified,makingithardtomodelthem.
Secondly, the complexity of the model due to its seven levels and the definition of
paraphrasesatthetheory-specificdeep-syntacticlevelhasmadeitdifficultforthevast
majority of NLP researchers, who work outside the context of MTT, to even use this
definitioninanymeaningfulway. Thirdly,theExplanatoryCombinatorialDictionary,
12
which is the resource that would have a detailed descriptions of LFs and the list of LF
valuesforalargesetofwords,isunavailableformostlanguages.
Incontrasttotheabove-mentionedtheories,Honeck(1971)takesaveryhighlevel
viewofparaphrases. Hedividesparaphrasesinto: transformational: thesurfacestruc-
ture of the base phrase or sentence is changed but the content words are unchanged;
lexical: thesurfacestructureofthebasephraseorsentencesremainsthesamebutsyn-
onymsaresubstitutedforlexicalitems; formalexic: thesurfacestructureaswellasthe
contentwordsofthebasephraseorsentencearechanged. Forexample,giventhebase
sentence(5),sentences(6),(7),and(8)areitstransformational,lexical,andformalexic
paraphrasesrespectively.
Thefightevokedtheemotionsthatperplexedtheboythatwept. (5)
Theboythatweptwasperplexedbytheemotionsthatthefightevoked. (6)
Thestruggleelicitedthefeelingsthatpuzzledtheladthatcried. (7)
Theladthatcriedwaspuzzledbythefeelingsthestruggleelicited. (8)
While the Honeck (1971) theory explains a vast majority of paraphrases, it is too
generaltobemodeledortoevenbeusedasageneralguidelinefordistinguishingpara-
phrases from non-paraphrases. Also, it limits paraphrasing to the notion of synonymy
13
which goes against the broad notion of paraphrases — paraphrases are not just syn-
onyms — as put forth by many linguists (De Beaugrande & Dressler, 1981; Clark,
1992;Mel’cuk,toappear).
Barzilay (2003) also takes a generic, high-level view of paraphrases and classi-
fies them as: atomic: paraphrases between small non-decomposable lexical units, i.e.,
words and small phrases; compositional: paraphrases between constructs that can be
decomposed into smaller units, i.e., sentences and complex phrases. The atomic para-
phrases are further classified based on the lengths of the two phrases participating in
theparaphraserelation,i.e., [1 : 1], [1 : 2], [2 : 2],andothers. Thecompositionalpara-
phrasesarefurtherclassifiedbasedonthebasichighlevelchangesinthesentence,i..e,
deletions,permutations,nounphrasetransformation,active-passivetransformation,and
lexicalchanges.
WhiletheBarzilay(2003)theoryisgenericenoughtoexplainalmostallthepossi-
ble paraphrases, the operations defined by it are too general for practical applications.
Withoutfurtherelaboration,theseoperationscannoteitherbemodeledorevenbeused
asaguidelinestodistinguishparaphrasesfromnon-paraphrases.
Thus overall, the various theories of paraphrases todate either define paraphrasing
asanoperationinadeeplinguisticframeworkortheytakeaveryabstractviewofpara-
phrases. Theformerarehardtounderstandandcanbeusedonlyifoneisworkingwith
14
thespecifieddeeplinguisticrepresentations,whilethelatterareeasytounderstand,but
hardtouseforanypracticalpurposes.
In this thesis, we take the middle ground: We explain paraphrasing using a set of
commonlyknownphenomena. Thesephenomenaaregeneralenoughsothatamajority
of (quasi) paraphrases can be explained using them, and are specific enough so that
they can be used to distinguish quasi-paraphrases from non-paraphrases. This makes
the definition of quasi-paraphrases (sufficiently) objective and should make it easy for
NLPresearcherstounderstandandusethedefinitiontostudyparaphrasingempirically.
2.3 LearningParaphrases
Most recent work in paraphrase acquisition is based on automatic learning. In this
section,wediscussthemajorapproachestolearningparaphrasesfromtextcorpora.
2.3.1 LearningParaphrasesusingMultipleTranslations
The idea that multiple translations of the same foreign language texts can be used to
learn paraphrases was first presented by Barzilay and McKeown (2001). The intuition
behind this approach is that different translators are likely to use different words to
translatethesameforeignlanguagesentence. Thesetranslationsareequivalentandthus
15
theircorrespondingsub-partsarealsoequivalent,i.e.,paraphrases. BarzilayandMcK-
eown (2001) obtained multiple translations of five classic novels. They then aligned
the equivalent sentences from the different translations by using the Gale and Church
(1991) method. From the aligned sentences, they learned phrase level paraphrases by
usingaco-trainingalgorithmthatusesthecontextsofphrasesasfeatures. Forexample,
giventhealignedsentences(9)and(10),theylearnedtwopairsofparaphrases: (“burst
intotears”,“cried”)and(“comfort”,“console”).
Emma burst into tears and he tried to comfort her, saying things to make
hersmile. (9)
Emma cried, and he tried to console her, adorning his words with puns.
(10)
Pang et al. (2003) also used multiple translations for learning paraphrases. They
used 11translationseachofaround 900Chinesesentencesastheirtrainingdata. They
parsed the multiple translations of the same sentence using a syntactic parser and then
merged the matching parts of the different parse-trees into a single forest. The forest
wasthenlinearizedintheformofawordlattice. Thealternatepathsinthewordlattice
weretreatedasparaphrases.
Whilethemultipletranslationsbasedmethodoftenproducesgoodparaphrases,itis
limited by the availability of data. The corpus used by Barzilay and McKeown (2001)
has only around 0.5 million words on each side, while the corpus used by Pang et al.
16
(2003) has a little over 3 million words on each side. These corpora are very small by
currentstandardsinNLP.
2.3.2 LearningParaphrasesusingParallelBilingualCorpora
Bannard and Callison-Burch (2005) were the first to use bilingual parallel corpora for
learningparaphrases. Theintuitionbehindthisapproachisthatforeignlanguagewords
orphrasescanbeusedaspivotstolearnparaphrasesforthesourcelanguage: allsource
language words or phrases that align with the same foreign language word or phrase
are paraphrases. Bannard and Callison-Burch (2005) used a English-German parallel
corpus to learn paraphrases for English. They also experimented with using multiple
parallel corpora to improve the quality of the paraphrases. Zhou et al. (2006) also
employed a similar approach to learn paraphrases using a English-Chinese bilingual
corpus. Callison-Burch (2008) built on this work and developed a method to restrict
thetypeofparaphraseslearnedbylimitingtheparaphrasestobeofthesamesyntactic
category.
These approaches are also limited by the availability of bilingual parallel corpora,
whichthoughlargerthanmultipletranslationsdataarestillonlyaroundthesizeof 315
millionwordsontheEnglishside(Callison-Burch,2007).
17
2.3.3 LearningParaphrasesusingComparableCorpora
Shinyama et al. (2002) were the first to use comparable corpora to learn paraphrases.
The intuition behind this approach is that the same event is reported by various news-
papers and each of these newspapers use a different set of words to express this event.
These comparable articles can be used to mine paraphrases. Shinyama et al. (2002)
presentedanalgorithmthatautomaticallyfindsnewsarticlesthatreportthesameevent
andthenalignsthesimilarsentencesinthesenewsarticlesbyusingthenamedentities
as anchors. The named entities are then further used as anchors to learn phrase level
paraphrases. BarzilayandLee(2003)usedcomparablenewsarticlestoobtainsentence
level paraphrases. Dolan et al. (2004) worked with a much larger scale of data from
thewebandextractedsentencelevelparaphrasesusingeditdistancebetweensentence
pairs and some heuristics. Quirk et al. (2004) used these sentence level paraphrases
to learn phrase level paraphrases by aligning the corresponding sentence parts using
machinetranslationtechniques.
Theseapproachesrelyoncorporaofcomparablesentences,thelargestsizeofwhich
has about 60 million words (Callison-Burch, 2007). This is quite small by current
standardsinNLP.
18
2.3.4 LearningParaphrasesusingMonolingualCorpora
LinandPantel(2001)werethefirsttouseasinglemonolingualresource,forobtaining
quasi-paraphraserules. Theintuitionbehindtheirapproachisthatlikewords,pathsina
syntaxtreethatappearinsimilarcontextsshouldhavesimilarmeanings. Basedonthis
intuition, they parsed a medium sized corpus using a dependency parser and collected
paths in the syntax trees that satisfy some pre-defined heuristics. They then collected
contexts for each of the paths in that corpus and found similarities between the paths
basedintheircontexts. Twopathsthathavesimilarityoversomepre-definedthreshold
formaquasi-paraphraserules.
Szpektor et al. (2004) presented another approach for learning quasi-paraphrase
rules. They have a pre-defined set of verbs which they call pivots. For each of these
pivots, they obtain sentences from the web that contain that pivot, parse the sentences
using a dependency parser, and find anchors, i.e., words that are in specific syntactic
relations with the pivot. They then rank the anchors using some heuristics to choose
reliableanchors. Thereliableanchorsforeachpivotarethenusedtofindotherphrases
(thesephrasesaresyntacticpathsinadependencetree)fromthewebthatmightentail
thepivot.
In this thesis, we use the quasi-paraphrase rules obtained from the DIRT algo-
rithm(Lin&Pantel,2001)todemonstratetheeffectivenessofISP’sinpointingoutthe
contexts in which they are mutually replaceable, and to demonstrate the effectiveness
19
of LEDIR to learn their directionalities. However, both the Lin and Pantel (2001) and
Szpektor et al. (2004) approaches, use syntactic parsers for learning syntactic quasi-
paraphrases. This limits their applicability to clean, and relatively small data sets
1
.
Also, for applications like information extraction, researchers have found it hard to
make use of full parse tree based quasi-paraphrases for learning useful patterns
2
. To
addresstheseissues,inthisthesis,wepresentmethodstolearnparaphrasesusingsim-
ple and fast part-of-speech tagging and shallow parsing. Our algorithms use a single
monolingual corpus and easily scale to very large corpora. We were able to scale our
part-of-speechbasedalgorithmtoacorpusof 25billionwordswhichisatleastoneor-
derofmagnitudelargerthanthecorporathathavebeenusedforlearningparaphrasesin
thepast. Also,weshowtheeffectivenessofourquasi-paraphrasesinlearningpatterns
forinformationextraction.
2.4 LearningSelectionalPreferencesandDirectionality
In this section, we present a brief overview of the work relating to the two paraphras-
ing issues that we address in this thesis: learning selectional preferences and learning
directionality.
1
With the availability of large clusters, scaling parsers to large data sets has be-
comeeasier,especiallyinbigcompanieslikeGoogle,Yahoo,andMicrosoft. However,
scaling is still a problem for a majority of the research groups that work on Natural
LanguageProcessing.
2
Personalcommunication.
20
2.4.1 LearningSelectionalPreferences
Selectional Preference (SP) as a foundation for computational semantics is one of the
earliesttopicsinAIandNLP,andhasitsrootsinKatzandFodor(1963);Wilks(1975).
Overviews of NLP research on this theme are Wilks and Fass (1992), which includes
theinfluentialtheoryofPreferenceSemanticsbyWilks(1975),andmorerecentlyLight
and Greiff (2002). Light and Greiff (2002) define SPs as the preference of a predicate
forthesemanticclassmembershipofitsargumentandviceversa.
Much previous work has focused on learning SPs for simple structures. Resnik
(1996), the seminal paper on this topic, introduced a statistical model for learning SPs
for predicates using an unsupervised method. The focus of our work in this paper
howeveristolearnSPsforquasi-paraphrases.
Learning SPs often relies on an underlying set of semantic classes, as in both
Resnik (1996) and our approach. Semantic classes can be specified manually or de-
rived automatically. Manually constructed collections of semantic classes include the
hierarchies like WordNet (Fellbaum, 1998), Levin verb classes (Levin, 1993), and
FrameNet (Baker et al., 1998). Automatic derivation of semantic classes can take a
variety of approaches, but often uses corpus methods and the Distributional Hypothe-
sis(Harris,1954)toautomaticallyclustersimilarentitiesintoclasses,e.g.,CBC(Pantel
&Lin,2002). Inourwork,weexperimentwithtwosetsofsemanticclasses,onefrom
WordNetandonefromCBC.
21
Zanzotto et al. (2006) recently explored a different interplay between SPs and in-
ferences. RatherthanexaminetheroleofSPsininferences,theyuseSPsofaparticular
type to derive inferences. For instance, the preference of win for the subject player, a
nominalization of play, is used to derive that “win⇒ play”. Our work can be viewed
as complementary to the work on extracting quasi-paraphrase rules, since we seek to
refinewhenagivenquasi-paraphraseruleapplies,filteringoutincorrectinferences.
2.4.2 LearningDirectionality
Therehavebeenafewapproachestolearnthedirectionalityofrestrictedsetsofseman-
ticrelations,mostlybetweenverbs. ChklovskiandPantel(2004)usedlexico-syntactic
patternsovertheWebtodetectcertaintypesofsymmetricandasymmetricrelationsbe-
tweenverbs. Theymanuallyexaminedandobtainedlexico-syntacticpatternsthathelp
identify the types of relations they considered and used these lexico-syntactic patterns
over the Web to detect these relations among a set of candidate verb pairs. Zanzotto
etal. (2006)exploredaselectionalpreference-basedapproachtolearnasymmetricin-
ference rules between verbs. They used the selectional preferences of a single verb,
i.e., the semantic types of a verbs arguments, to infer an asymmetric inference be-
tween the verb and the verb form of its argument type. Torisawa (2006) presented a
methodtoacquireinferenceruleswithtemporalconstraintsbetweenverbs. Theyused
co-occurrences between verbs in Japanese coordinated sentences and co-occurrences
22
between verbs and nouns to learn the verb-verb inference rules. All these approaches
howeverarelimitedonlytoverbs,andtospecifictypesofverbrelations.
Inprinciple,theworkthatisthemostsimilartooursisbyGeffetandDagan(2005).
Geffet and Dagan proposed an extension to the distributional hypothesis to discover
entailmentrelationbetweenwords. Theymodelthecontextofawordusingitssyntactic
features, and compare the contexts of two words for strict inclusion to infer lexical
entailment. Their method however is limited to lexical entailment, and they show its
effectivenessfornouns.
Themethodthatwepresentinthisthesisforlearningdirectionalitydealswithquasi-
paraphraserulesbetweenbinaryrelationsandincludesquasi-paraphraserulesbetween
verbal relations, non-verbal relations, and multi-word relations. Our definition of con-
text, and the methodology for obtaining context similarity, and overlap is also much
differentfromthoseusedinanyofthepreviousapproaches.
2.5 Applications
In this section, we present a brief overview of the past work in pattern-based informa-
tionextractionandtheuseofparaphrasesforlearningextractionpatterns.
23
2.5.1 ParaphrasesforLearningExtractionPatterns
Paraphrases have emerged as a useful tool for learning Information Extraction (IE)
patterns. Sekine (2006) and Romano et al. (2006) used syntactic paraphrases to learn
patternsforrelationextraction. TheSekine(2006)approachclusterspatternsbasedon
theentitiestheyextractandonthekeywordsinsidethepatterns. Patternsinonecluster
are assumed to have similar meanings (paraphrases). Romano et al. (2006) on the
other hand uses the entailment templates from Szpektor et al. (2004) to learn patterns
using some seeds. While procedurally different, both methods depend heavily on the
performance of the syntax parser and require complex syntax tree matching to extract
therelationinstances.
Forgeneralinformationextraction,i.e.,whenonlyasingleentityorroleneedstobe
extracted from text, Szpektor and Dagan (2008) used distributional similarity between
paths in dependency trees to learn entailing quasi-paraphrases. This approaches also
learns syntactic quasi-paraphrases which involves parsing the text with a dependency
parser.
Ontheotherhand,ourmethodforlearningpatternsforrelationextraction(Bhagat
& Ravichandran, 2008), uses surface paraphrases. Also for general information ex-
traction, our first method learns surface paraphrases making is easily scalable to large
corpora and giving us the flexibility to apply different levels of post-processing. Our
secondmethodlearnslexico-syntacticparaphrases,usingshallowparsing,whichisalso
24
severaltimesfasterthanfullparsing. Thefactthatourmethodsonlyrelyonsimpleand
fastlanguageprocessingtechniquesmakesthembothrobustandscalable.
2.5.2 Paraphrases for Learning Patterns for Open-Domain
RelationExtraction
One task related to the work we do in this thesis is relation extraction. Its aim is to
extract instances of a given relation. For example, given a relation like “acquisition”,
relation extraction aims to extract the “hACQUIRERi and “hACQUIREEi”. Hearst
(1992),thepioneeringpaperinthefield,usedasmallnumberofhandselectedpatterns
toextractinstancesofhyponymyrelation. BerlandandCharniak(1999)usedasimilar
methodforextractinginstancesofmeronymyrelation. RavichandranandHovy(2002)
used seed instances of a relation to automatically obtain surface patterns by querying
theweb. Romanoetal. (2006)andSekine(2006)usedsyntacticparaphrasestoobtain
patternsforextractingrelations.
Inthisthesis, weusesurfaceparaphrasesasamethodforlearningsurfacepatterns
for relation extraction. Our method starts with a few seed patterns for a given rela-
tion and obtains other surface patterns automatically, by finding their paraphrases. To
find these paraphrases, we use our surface paraphrase learning algorithm, which uses
distributional similarity over a large corpus to learn them. Using this approach helps
ourmethodtoavoidtheproblemofobtainingoverlygeneralpatterns,asagainstusing
25
the Ravichandran and Hovy (2002) approach. Also, the use of surface patterns in our
methodavoidsthedependenceonaparser,andsyntacticmatching,thetwofactorsthat
play an important role in the performance of Romano et al. (2006) and Sekine (2006)
approaches. Thisalsomakestheextractionprocessscalable.
2.5.3 Paraphrases for Learning Patterns for Domain-Specific
InformationExtraction
Whilethepatternbasedapproachtodomain-specificinformationextractionhavebeen
popular since the early 1990’s, the initial work focused on manual creation of pat-
terns (Hobbs et al., 1993; Jacobs et al., 1993). Focus however quickly shifted to
learning patterns automatically from domain-specific corpora. One line of work fo-
cusedonusingannotatedtrainingcorporatolearnthesepatterns(Riloff,1993;Kim&
Moldovan,1993;Freitag,1998a;Califf&Mooney,2003). However,giventheneedfor
large amounts of tedious manual annotation for these methods, weakly-supervised ap-
proaches, which need very little annotation, are now becoming popular (Riloff, 1996;
Patwardhan & Riloff, 2007). These and other similar approaches to domain-specific
IE use domain-specific corpora for learning patterns. Their dependence on domain-
specificcorporaisahindrancetotheeasyportabilityofthesemethodstonewdomains.
Thepattern-learningmethodwepresentinthisthesis,however,differsfromtheprevi-
ous approaches in that it does not need a domain-specific corpus for learning patterns:
26
itlearnspatternsfromageneralbroad-coveragecorpus. Also,ourmethodcollectspat-
terns from a broad-coverage corpus only once. Patterns can then be generated for any
(new)domainbyusingjustafewseedpatterns.
27
Chapter3
Paraphrases
3.1 Introduction
Sentences or phrases that convey the same meaning using different surface words are
called paraphrases. For example, the sentences (1) and (2) below are called para-
phrases.
Theschoolsaidthattheirbusesseat 40studentseach. (1)
Theschoolsaidthattheirbusesaccommodate 40studentseach. (2)
Whilethegeneralinterpretationofthetermparaphrasesisquitenarrow(alongthe
lines mentioned above), in linguistic literature, paraphrases are most often character-
izedbyanapproximateequivalenceofmeaningacrosssentencesorphrases. DeBeau-
grandeandDressler(1981)defineparaphrasesas“Approximateconceptualequivalence
among outwardly different material”. Hirst (2003) defines paraphrases as “Talk(ing)
28
aboutthesamesituationinadifferentway”. Hearguesthatparaphrasesaren’tsynony-
mous: There are pragmatic differences in paraphrases, i.e., difference of evaluation,
connotation, viewpoint etc. According to Mel’cuk (to appear) “For two sentences to
beconsideredparaphrases,theyneednotbefullysynonymous: itissufficientforthem
tobequasi-synonymous,thatis,tobemutuallyreplaceablesalvasignificationeatleast
in some contexts”. He further adds that approximate paraphrases include implications
(not in logical, but everyday sense). Taking an extreme view, Clark (1992) rejects the
ideaofabsolutesynonymybysaying“Everytwoforms(inlanguage)contrastinmean-
ing”. Overall,thereisalargebodyofworkinthelinguisticliteraturewhicharguesthat
paraphrasesarenotrestrictedtosynonymy.
Inthisthesis,wetakeabroadviewofparaphrasesalongthelinesmentionedabove.
To avoid the conflict between the notion of strict paraphrases as understood in logic
and the broad notion in linguistics, we use the term quasi-paraphrases to refer to the
paraphrases that we deal with in this thesis. In the context of this thesis, the term
“paraphrases”(evenwithouttheprefix“quasi”)means“quasi-paraphrases”. Wedefine
quasi-paraphrasesas“Sentencesorphrasesthatconveyapproximatelythesamemean-
ingusingdifferentsurfacewords”. Weignorethefinegraineddistinctionsofmeaning
between sentences and phrases, introduced due to the speakers evaluation of the situ-
ation, connotation of the terms used, change of modality, etc. For example, consider
sentences (3)and (4)below.
29
Theschoolsaidthattheirbusesseat 40studentseach. (3)
Theschoolsaidthattheirbusescram 40studentseach. (4)
Here,thewords“seat”and“cram”arenotsynonymous: theycarrydifferentevaluations
ofthespeakersaboutthesamesituation. Wehoweverconsidersentences(3)and(4)to
be(quasi)paraphrases. Similarly,considersentences (5)and (6)below.
Theschoolsaid thattheirbusesseat 40studentseach. (5)
Theschoolissayingthattheirbusesmightaccommodate 40studentseach.
(6)
Here, “said” and “is saying” have different tenses. Also, “might accommodate” and
“seat” are not synonymous: “might accommodate” contains the modal verb “might”.
Wehoweverconsidersentences(5)and(6)tobequasi-paraphrases. Whileapproximate
equivalence is hard to characterize, except as the intuition of a native speaker, we will
doourbestinthisthesistomakeitasobjectiveaspossible.
3.2 ParaphrasingPhenomenonExplained
In this section, we define+describe the phenomenon of quasi-paraphrases. The phe-
nomenon is analyzed from two perspectives: lexical and structural. The lexical per-
spective deals with the kinds of lexical changes that can take place in a sentence or a
30
phraseresultinginthegenerationofitsparaphrases. Theselexicalchangesareaccom-
panied by changes in the structure of the original sentence or phrase, which is char-
acterized from the structural perspective. We discuss both the perspectives in detail
below.
3.2.1 LexicalPerspective
The lexical perspective presents the various lexical changes that can take place in a
sentenceoraphrasewhileretainingitsapproximatemeaning(semantics).
1. Synonymsubstitution: Replacingawordoraphrasebyasynonymouswordor
phrase,intheappropriatecontext,resultsinaparaphraseoftheoriginalsentence
or phrase. This category covers near-synonymy, that is, it allows for changes
in evaluation, connotation, etc., of words or phrases between paraphrases. This
categoryalsocoversthespecialcaseofgenitives,wheretheclitic“’s”isreplaced
byothergenitiveindicatorslike“of”,“ofthe”,etc.
Accompanyingstructuralchanges: Substitution.
Example:
GoogleboughtYouTube.⇔Googleacquired YouTube.
Maryisslim.⇔Maryisskinny.
31
2. Actor/Actionsubstitution: Replacingthenameofanactionbyawordorphrase
denoting the person doing the action (actor) and vice versa, in the appropriate
context, resultsinaparaphraseoftheoriginalsentenceorphrase. Thissubstitu-
tionmaybeaccompaniedbytheaddition/deletionofappropriatefunctionwords.
Accompanyingstructuralchanges: Substitution,Addition/Deletion.
Example:
Idislikerashdrivers.⇔Idislikerashdriving.
3. Manipulator/Device substitution: Replacing the name of a device by a word
or phrase denoting the person using the device (manipulator) and vice versa, in
theappropriatecontext,resultsinaparaphraseoftheoriginalsentenceorphrase.
This substitution may be accompanied by the addition/deletion of appropriate
functionwords.
Accompanyingstructuralchanges: Substitution,Addition/Deletion.
Example:
The pilot took off despite the stormy weather.⇔ The plane took off despite the
stormyweather.
4. General/Specificsubstitution: Replacingawordofaphrasebyamoregeneral
ormorespecificwordorphrase,intheappropriatecontext,resultsinaparaphrase
oftheoriginalsentenceorphrase. Thissubstitutionmaybeaccompaniedbythe
32
addition/deletionofappropriatefunctionwords. Hypernymsubstitutionisapart
ofthiscategory. Thisoftengeneratesaquasi-paraphrase.
Accompanyingstructuralchanges: Substitution,Addition/Deletion.
Example:
Idislikerashdrivers.⇔Idislikerashmotorists.
Johnisflyinginthisweekend.⇔JohnisflyinginthisSaturday.
5. Metaphorsubstitution: Replacinganounbyitsstandardmetaphoricaluseand
viceversa,intheappropriatecontext,resultsinaparaphraseoftheoriginalsen-
tence or phrase. This substitution may be accompanied by the addition/deletion
ofappropriatefunctionwords.
Accompanyingstructuralchanges: Substitution,Addition/Deletion.
Example:
I had to drive through fog to get here.⇔ I had to drive through a wall of fog to
gethere.
Immigrants have used this network to send cash.⇔ Immigrants have used this
networktosendstashesofcash.
6. Part/Wholesubstitution: Replacingapartbyitscorrespondingwholeandvice
versa, in the appropriate context, results in a paraphrase of the original sentence
33
or phrase. This substitution may be accompanied by the addition/deletion of
appropriatefunctionwords.
Accompanyingstructuralchanges: Substitution,Addition/Deletion.
Example:
American airplanes pounded the Taliban defences. ⇔ American airforce
poundedtheTalibandefences.
7. Verb/“Semantic-role noun” substitution: Replacing a verb by a noun corre-
sponding to the agent of the action or the patient of the action or the instrument
used for the action or the medium used for the action, in the appropriate con-
text, results in a paraphrase of the original sentence or phrase. This substitution
maybeaccompaniedbytheaddition/deletionofappropriatefunctionwordsand
sentencerestructuring.
Accompanying structural changes: Substitution, Addition/Deletion, Permuta-
tion.
Example:
JohnteachesMary.⇔JohnisMary’steacher.
JohnteachesMary.⇔MaryhasJohnasherteacher.
JohnteachesMary.⇔MaryisJohn’sstudent.
JohnteachesMary.⇔JohnhasMaryashisstudent.
34
Johnwasbatting.⇔Johnwaswieldingthebat.
Johntiled hisbathroomfloor.⇔Johninstalledtilesonhisbathroomfloor.
8. Antonym substitution: Replacing a word or phrase by its antonym accompa-
nied by a negation or by negating some other word, in the appropriate context,
results in a paraphrase of the original sentence or phrase. This substitution may
beaccompaniedbytheaddition/deletionofappropriatefunctionwords.
Accompanyingstructuralchanges: Substitution,Addition/Deletion.
Example:
Johnate.⇔Johndidnotstarve.
Ilostinterestintheendeavor.⇔Idevelopeddisinterestintheendeavor.
9. Pronoun/“Referencednoun”substitution: Replacingapronounbythenounit
referstoresultsinaparaphraseoftheoriginalsentenceorphrase.
Accompanyingstructuralchanges: Substitution.
Example:
John likes Mary, because she is pretty. ⇔ John likes Mary, because Mary is
pretty.
10. Verb/Noun conversion: Replacing a verb by its corresponding nominalized
noun form and vice versa, in the appropriate context, results in a paraphrase of
35
the original sentence or phrase. This substitution may be accompanied by the
addition/deletionofappropriatefunctionwordsandsentencerestructuring.
Accompanying structural changes: Substitution, Addition/Deletion, Permuta-
tion.
Example:
The police interrogated the suspects.⇔ The police subjected the suspects to an
interrogation.
Thevirus spread overaperiodoftwoweeks.⇔Twoweekssawa spreadingof
thevirus.
11. Verb/Adjective conversion: Replacing a verb by the corresponding adjective
form and vice versa, in the appropriate context, results in a paraphrase of the
original sentence or phrase. This substitution may be accompanied by the addi-
tion/deletionofappropriatefunctionwordsandsentencerestructuring.
Accompanying structural changes: Substitution, Addition/Deletion, Permuta-
tion.
Example:
JohnlovesMary.⇔MaryisloveabletoJohn.
36
12. Verb/Adverb conversion: Replacing a verb by its corresponding adverb form
and vice versa, in the appropriate context, results in a paraphrase of the orig-
inal sentence or phrase. This substitution may be accompanied by the addi-
tion/deletionofappropriatefunctionwordsandsentencerestructuring.
Accompanying structural changes: Substitution, Addition/Deletion, Permuta-
tion.
Example:
Johnboasted abouthiswork.⇔Johnspokeboastfullyabouthiswork.
13. Noun/Adjective conversion: Replacing a verb by its corresponding adjective
form and vice versa, in the appropriate context, results in a paraphrase of the
original sentence or phrase. This substitution may be accompanied by the addi-
tion/deletionofappropriatefunctionwordsandsentencerestructuring.
Accompanying structural changes: Substitution, Addition/Deletion, Permuta-
tion.
Example:
I’llflybytheend ofJune.⇔I’llflylateJune.
14. Converse substitution: Replacing a word or a phrase with its converse and in-
verting the relationship between the constituents of a sentence or phrase, in the
appropriate context, results in a paraphrase of the original sentence or phrase,
37
presenting the situation from the converse perspective. This substitution may be
accompaniedbytheaddition/deletionofappropriatefunctionwordsandsentence
restructuring.
Accompanying structural changes: Substitution, Addition/Deletion, Permuta-
tion.
Example:
GoogleboughtYouTube.⇔YouTubewassoldtoGoogle.
15. “Verb and preposition denoting location”/“Noun denoting location” substi-
tution: Replacingaverbandaprepositiondenotinglocationbyanoundenoting
the location and vice versa, in the appropriate context, results in a paraphrase of
the original sentence or phrase. This substitution may be accompanied by the
addition/deletionofappropriatefunctionwordsandsentencerestructuring.
Accompanying structural changes: Substitution, Addition/Deletion, Permuta-
tion.
Example:
The finalists are playing in the Giants stadium. ⇔ The Giants stadium is the
playground forthefinalists.
38
16. Changeofvoice: Changingaverbfromitsactivetopassiveformandviceversa,
results in a paraphrase of the original sentence of phrase. This change may be
accompaniedbytheaddition/deletionofappropriatefunctionwordsandsentence
restructuring. This often generates one of the most strictly meaning-preserving
paraphrase.
Accompanying structural changes: Substitution, Addition/Deletion, Permuta-
tion.
Example:
JohnlovesMary.⇔MaryislovedbyJohn.
Thisbuildingstorestheexcessitems.⇔Excessitemsarestoredinthisbuilding.
17. Changeoftense: Changingthetenseofaverb,intheappropriatecontext,results
inaparaphraseoftheoriginalsentenceorphrase. Thischangemaybeaccompa-
niedbytheaddition/deletionofappropriatefunctionwords. Thisoftengenerates
aquasi-paraphrase.
Accompanyingstructuralchanges: Substitution,Addition/Deletion.
Example:
Johnloved Mary.⇔JohnlovesMary.
39
18. Change of aspect: Changing the aspect of a verb, in the appropriate context,
results in a paraphrase of the original sentence of phrase. This change may be
accompaniedbytheaddition/deletionofappropriatefunctionwords.
Accompanyingstructuralchanges: Substitution,Addition/Deletion.
Example:
Johnisflyingintoday.⇔Johnfliesintoday.
19. Changeofmodality: Addition/deletionofamodalorsubstitutionofonemodal
byanother,intheappropriatecontext,resultsinaparaphraseoftheoriginalsen-
tence or phrase. This change may be accompanied by the addition/deletion of
appropriatefunctionwords. Thisoftengeneratesaquasi-paraphrase.
Accompanyingstructuralchanges: Substitution,Addition/Deletion.
Example:
GooglemustbuyYouTube.⇔GoogleboughtYouTube.
Thegovernmentwantstoboosttheeconomy.⇔Thegovernmenthopestoboost
theeconomy.
20. Change of person: Changing the grammatical person of a referenced object,
results in a paraphrase of the original sentence of phrase. This change may be
accompanied by the addition/deletion of appropriate function words. This often
generatesoneofthemoststrictlymeaning-preservingparaphrase.
40
Accompanyingstructuralchanges: Substitution,Addition/Deletion.
Example:
Johnsaid“I likefootball”.⇔Johnsaidthathelikedfootball.
21. Repetition/Ellipsis: Ellipsis or elliptical construction results in a paraphrase of
theoriginalsentenceorphrase.
Accompanyingstructuralchanges: Addition/Deletion.
Example:
JohncanrunfastandMarycanrunfast,too.⇔JohncanrunfastandMarycan,
too.
John can eat three apples and Mary can eat two apples. ⇔ John can eat three
applesandMarycaneattwo.
22. Semantic implication: Replacing a word or a phrase denoting an action, event
etc. by a word or phrase denoting its possible future effect, in the appropriate
context, results in a paraphrase of the original sentence of phrase. This may be
accompaniedbytheaddition/deletionofappropriatefunctionwordsandsentence
restructuring. Thisoftengeneratesaquasi-paraphrase.
Accompanying structural changes: Substitution, Addition/Deletion, Permuta-
tion.
41
Example:
GoogleisintalkstobuyYouTube.⇔GoogleboughtYouTube.
The Marines are fighting the terrorists.⇔ The Marines are eliminating the ter-
rorists.
23. Approximate numerical equivalences: Replacing a numerical expression (a
word or a phrase denoting a number) by an approximately equivalent numeri-
cal expression, in the appropriate context, results in a paraphrase of the original
sentenceorphrase. Thisoftengeneratesaquasi-paraphrase.
Accompanyingstructuralchanges: Substitution.
Example:
Atleast23USsoldierswerekilledinIraqlastmonth.⇔Atleast26USsoldiers
werekilledinIraqlastmonth.
Disneyland is over 30 miles from my place. ⇔ Disneyland is around 32 miles
frommyplace.
24. Function word variations: Changing the function words in a sentence or a
phrase without affecting its semantics, in the appropriate context, results in a
paraphrase of the original sentence or phrase. This can involve replacing a light
42
verb by another light verb, replacing a light verb by copula, replacing a prepo-
sition by another preposition, replacing a determiner by another determiner, re-
placing a determiner by a preposition and vice versa, and addition/removal of a
prepositionand/oradeterminer.
Accompanying structural changes: Substitution, Addition/Deletion, Permuta-
tion.
Example:
Results of the competition have been declared. ⇔ Results for the competition
havebeendeclared.
Johnshowedanicedemo.⇔John’sdemowasnice.
25. External knowledge: Replacing a word or a phrase by another word or phrase
basedonextra-linguistic(world)knowledge,intheappropriatecontext,resultsin
aparaphraseoftheoriginalsentenceorphrase. Thismaybeaccompaniedbythe
addition/deletionofappropriatefunctionwordsandsentencerestructuring. This
often generates a quasi-paraphrase, although in some cases preserving meaning
exactly.
Accompanying structural changes: Substitution, Addition/Deletion, Permuta-
tion.
43
Example:
We must work hard to win this election. ⇔ The Democrats must work hard to
winthiselection.
3.2.2 StructuralPerspective
The structural perspective presents the various structural changes that take place in
a sentence or phrase, at the surface level, as a result of the above mentioned lexical
changes. These structural changes are crucial for the preservation of meaning across
sentencesorphrases.
1. Substitution: Replacing a word or phrase in the original sentence or phrase, by
anotherwordorphraseinitsparaphrase,iscalledsubstitution.
Example:
Excessitemsarestoredinthisbuilding.⇔Thisbuildingstorestheexcessitems.
2. Addition/Deletion: Addition of an extra word in the paraphrase, such that, it
doesnothaveacorrespondingwordorphraseintheoriginalsentenceorphrase,
iscalledaddition. Removalofsuchanextrawordorphraseiscalleddeletion. In
paraphrasing,additionofawordtoasentenceorphraseisequivalenttodeletion
ofthatwordfromitsparaphrase.
44
Example:
Excessitemsarestoredinthisbuilding.⇔Thisbuildingstorestheexcessitems.
3. Permutation: Changing the order of words or phrases in the paraphrase, such
that, the corresponding words or phrases in the original sentence or phrase have
adifferentrelativeorder,iscalledpermutation.
Example:
Excessitemsarestoredinthisbuilding.⇔Thisbuildingstorestheexcessitems.
3.3 AnalysisofParaphrases
In Section 3.2, we presented the lexical and structural perspectives which together ex-
plain quasi-paraphrases. In this section, we seek to validate the scope and accuracy of
the changes from each of the perspectives. To validate the list of suggested changes
fromthelexicalperspective,weshowanalysisusingtwocriteria:
1. Distributionoflexicalchanges: Whatisthedistributionofeachoftheselexical
changesinarandomparaphrasecorpus?
2. Humanjudgementoflexicalchanges: If one uses each of the lexical changes,
naively, on applicable sentences, how often do each of these changes generate
acceptablequasi-paraphrases?
45
Toestimatetheprevalenceofthestructuralchanges,weshowtheirsentence-leveldis-
tributionsforparaphrases.
3.3.1 DistributionofLexicalChanges
In this section, we explain the procedure we used to measure the distribution of the
changesthatdefineparaphrasesfromthelexicalperspective:
1. We downloaded paraphrases from portions of two publicly available data
sets containing sentence-level paraphrases: Multiple-Translations Corpus
(MTC) (Huang et al., 2002) and Microsoft Research (MSR) paraphrase cor-
pus (Dolan et al., 2004). These data sets contain pairs of sentences that are
deemed to be paraphrases of each other by annotators. The paraphrase pairs
come with their equivalent parts manually aligned (Cohn et al., 2008); thus also
givingphraselevelparaphrases.
2. Weselected30sentence-levelparaphrasepairsfromeachofthesecorporaatran-
domandextractedthecorrespondingalignedandunalignedphrases. Weassume
thatanyunalignedphraseispairedwithanullphrase. Thisresultedin210phrase
pairsfortheMTCcorpusand 145phrasepairsfortheMSRcorpus.
46
3. We labeled each of the phrase pairs with the appropriate lexical changes from
Section 3.2.1. If any phrase pair could not be labeled by a lexical change from
Section3.2.1,welabeleditasunknown.
4. We finally calculated the distribution of each label (lexical change), over all the
labels,foreachcorpus.
3.3.2 HumanJudgementofLexicalChanges
Inthissection,weexplaintheprocedureweusedtoobtainthehuman-judgementofthe
changesthatdefineparaphrasesfromthelexicalperspective:
1. We randomly selected two words or phrases, from publicly available resources
(depending on the lexical change), for each of the lexical changes from Sec-
tion3.2.1(except“externalknowledge”).
2. For each selected word or phrase, we obtained five random sentences from the
gigawordcorpus. Thesesentencesweremanuallycheckedtomakesurethatthey
contain the intended sense of the word or phrase. This gave us a total of 10
sentencesforeachphenomenon. Forthephenomenonofexternalknowledge,we
randomlysampledatotalof 10sentencepairsfromtheMTCandMSRcorpora,
suchthatthepairswereparaphrasesbasedonexternalknowledge.
47
3. For each sentence (except the sentences for the“external knowledge” category),
we applied the corresponding lexical changes to the word or phrase selected in
step 1. The lexical changes were applied in a naive way: the word or phrase in
1wasreplacedbythecorrespondingwordorphrasedependingontheapplicable
lexicalchange;thewordsinthenewsentencewereallowedtobereordered(per-
muted) if needed; only function words (and no content words) were allowed to
beaddedtothenewsentenceifneeded.
4. Wegavethesentencepairstotwoannotatorsandaskedthemtoannotatethemas
paraphrasesandnon-paraphrases.
5. We calculated the precision percentage for each lexical change as the average
of the precision scores obtained from the two annotations. We also calculated
the kappa-statistic (Siegal & Castellan Jr., 1988) to measure the inter-annotator
agreement.
3.3.3 Sentence-levelDistributionofStructuralChanges
In this section, we explain the procedure we used to measure the sentence-level distri-
butionsofthechangesfromthestructuralperspectivethatdefineparaphrases—
1. We obtained the 60 sentence-level paraphrase pairs (30 from each corpus) used
inSection3.3.1.
48
2. We labeled each of the selected paraphrase pairs with the appropriate struc-
tural change from Section 3.2.2
1
(a paraphrase pair can have multiple structural
changes).
3. Wefinallycalculatedthedistributionofeachstructuralchange,overtheselected
sentencepairs,foreachcorpus.
3.3.4 Results
This section shows the results of the analysis of the lexical and structural changes of
paraphrasescarriedoutusingthemethodologiesdescribedinSections3.3.1,3.3.2,and
3.3.3. The corpus for calculating precision from Section 3.3.2 was annotated by two
annotators. Akappascoreofκ = 0.66wasobtainedontheannotationtask. Therecall,
precision,andpercentagedistributionwerecalculatedasfollows:
Distributionofalexicalchange =
#ofcorrectphrase-levelquasi-paraphrasepairscontainingthelexicalchange∗100
#ofcorrectphrase-levelquasi-paraphrasepairsinthecorpus
1
Sinceweknowthatthesearetheonlypossiblestringoperations,wedonotusethe
label“unknown”here.
49
Humanjudgementofalexicalchange =
#ofcorrectsentence-levelquasi-paraphrasepairscontainingthelexicalchange∗100
#ofsentencepairscontainingthelexicalchangeinthecorpus
Distributionofastructuralchange =
#ofcorrectsentence-levelquasi-paraphrasepairscontainingthestructuralchange∗100
#ofcorrectsentence-levelquasi-paraphrasepairsinthecorpus
Tables 3.1 and 3.2 show the percentage recall of lexical changes in the MTC and
MSRcorporarespectively;Table3.3showsthepercentageprecisionoflexicalchanges
in the sentence-level paraphrases test-corpus we created; Tables 3.4 and 3.5 show the
percentagedistributionofstructuralchangesintheMTCandMSRcorporarespectively.
3.4 Conclusion
A definition of what phenomena constitute paraphrases and what do not has been a
probleminthepast. Whilesomepeoplehaveusedaverynarrowinterpretationofpara-
phrases—paraphrasesareexactlylogicallysynonymous—othershavetakenbroader
perspectiveswhichconsiderevensemanticimplications asparaphrases. Tothebestof
50
# Category %Distribution
1. Synonymsubstitution 36.67%
2. Actor/Actionsubstitution 0.00%
3. Manipulator/Devicesubstitution 0.00%
4. General/Specificsubstitution 4.29%
5. Metaphorsubstitution 0.00%
6. Singulative/Collectivesubstitution 0.00%
7. Verb/“Semantic-rolenoun”substitution 0.48%
8. Antonymsubstitution 0.00%
9. Pronoun/“Referencednoun”substitution 0.95%
10. Verb/Nounconversion 1.90%
11. Verb/Adjectiveconversion 0.48%
12. Verb/Adverbconversion 0.00%
13. Noun/Adjectiveconversion 0.00%
14. Conversesubstitution 0.48%
15. “Verb and preposition denoting location”/“Noun denoting
location”substitution
0.00%
16. Changeofvoice 1.43%
17. Changeoftense 3.81%
18. Changeofaspect 0.95%
19. Changeofmodality 0.95%
20. Changeofperson 0.00%
21. Repetition/Ellipsis 3.81%
22. Semanticimplication 0.95%
23. Approximatenumericalequivalences 0.00%
24. Functionwordvariations 36.67%
25. Externalknowledge 6.19%
26. Unknown 0.00%
Total 100.00%
Table3.1: DistributionoflexicalchangesinMTCparaphraseset
51
# Category %Distribution
1. Synonymsubstitution 18.62%
2. Actor/Actionsubstitution 0.00%
3. Manipulator/Devicesubstitution 0.00%
4. General/Specificsubstitution 2.76%
5. Metaphorsubstitution 0.69%
6. Singulative/Collectivesubstitution 0.00%
7. Verb/“Semantic-rolenoun”substitution 0.00%
8. Antonymsubstitution 0.00%
9. Pronoun/“Referencednoun”substitution 0.69%
10. Verb/Nounconversion 2.76%
11. Verb/Adjectiveconversion 0.00%
12. Verb/Adverbconversion 0.00%
13. Noun/Adjectiveconversion 0.00%
14. Conversesubstitution 0.00%
15. “Verb and preposition denoting location”/“Noun denoting
location”substitution
0.00%
16. Changeofvoice 0.69%
17. Changeoftense 1.38%
18. Changeofaspect 0.00%
19. Changeofmodality 0.00%
20. Changeofperson 0.69%
21. Repetition/Ellipsis 4.14%
22. Semanticimplication 4.14%
23. Approximatenumericalequivalences 2.07%
24. Functionwordvariations 29.66%
25. Externalknowledge 31.72%
26. Unknown 0.00%
Total 100.00%
Table3.2: DistributionoflexicalchangesinMSRparaphraseset
52
# Category %Accuracy
1. Synonymsubstitution 95.00%
2. Actor/Actionsubstitution 75.00%
3. Manipulator/Devicesubstitution 30.00%
4. General/Specificsubstitution 80.00%
5. Metaphorsubstitution 60.00%
6. Singulative/Collectivesubstitution 65.00%
7. Verb/“Semantic-rolenoun”substitution 60.00%
8. Antonymsubstitution 65.00%
9. Pronoun/“Referencednoun”substitution 70.00%
10. Verb/Nounconversion 100.00%
11. Verb/Adjectiveconversion 55.00%
12. Verb/Adverbconversion 65.00%
13. Noun/Adjectiveconversion 80.00%
14. Conversesubstitution 75.00%
15. “Verb and preposition denoting location”/“Noun denoting
location”substitution
65.00%
16. Changeofvoice 85.00%
17. Changeoftense 70.00%
18. Changeofaspect 95.00%
19. Changeofmodality 80.00%
20. Changeofperson 80.00%
21. Repetition/Ellipsis 100.00%
22. Semanticimplication 70.00%
23. Approximatenumericalequivalences 95.00%
24. Functionwordvariations 85.00%
25. Externalknowledge 95.00%
Table3.3: HumanJudgementoflexicalchanges
# Category %Distribution
1. Substitution 93.33%
2. Addition/Deletion 90.00%
3. Permutation 60.00%
Table3.4: DistributionofstructuralchangesinMTCparaphraseset
53
# Category %Distribution
1. Substitution 96.67%
2. Addition/Deletion 96.67%
3. Permutation 40.00%
Table3.5: DistributionofstructuralchangesinMSRparaphraseset
our knowledge, outside of specific language interpretation frameworks (like Meaning
TextTheory(Mel’cuk,1996)),noonehascreatedageneral,exhaustivelistofchanges
thatdefineparaphrases. Inthis chapterwehave createdsuch alist. Wehave alsotried
to empirically quantify the distribution and accuracy of the list. It is notable that cer-
taintypesofchangesdominatewhilesomeotherchangesareveryrare. However,itis
alsoobservedthatthedominatingchangesvarybasedonthetypeofparaphrase-corpus
used, thus indicating the variety exhibited by the paraphrases. Based on the large va-
riety of possible changes that can generate paraphrases, its seems likely that the kinds
of paraphrases that are deemed useful would depend on the application at hand. This
mightmotivatethecreationofapplication-specificlistsofthekindsofallowablepara-
phrasesandthedevelopmentofautomaticmethodstodistinguishthedifferentkindsof
paraphrases.
54
Chapter4
InferentialSelectionalPreferences
4.1 Introduction
Semantic inference is a key component for advanced natural language understanding.
Severalimportantapplicationsarealreadyrelyingheavilyoninference,includingques-
tionanswering(Moldovanetal.,2003;Harabagiu&Hickl,2006),informationextrac-
tion(Romanoetal.,2006),andtextualentailment(Szpektoretal.,2004).
In response, several researchers have created resources for enabling semantic in-
ference. Among manual resources used for this task are WordNet (Fellbaum, 1998)
andCyc(Lenat,1995). Althoughimportantanduseful,theseresourcesprimarilycon-
tain prescriptive paraphrase rules such as “X acquired Y”⇔ “X bought Y”. In prac-
tical NLP applications, however, quasi-paraphrase rules such as “X is charged by Y”
⇔ “Y announced the arrest of X” are very useful. This, along with the difficulty and
55
labor-intensiveness of generating exhaustive lists of rules, has led researchers to focus
on automatic methods for building inference resources such as quasi-paraphrase rule
collections (Lin & Pantel, 2001; Szpektor et al., 2004) and quasi-paraphrase collec-
tions(Barzilay&McKeown,2001).
Usingtheseresourcesinapplicationshasbeenhinderedbythelargeamountofin-
correctinferencestheygenerate,eitherbecauseofaltogetherincorrectrulesorbecause
of blind application of plausible rules without considering the context of the relations
orthesensesofthewords. Forexample,considerthefollowingsentence:
TerryNicholswaschargedbyfederalprosecutorsformurderandconspir-
acyintheOklahomaCitybombing.
andanquasi-paraphraserulesuchas:
XischargedbyY⇔YannouncedthearrestofX (1)
Using this rule, we can infer that “federal prosecutors announced the arrest of Terry
Nichols”. However,giventhesentence:
Fraud was suspected when accounts were charged by CCM telemarketers
withoutobtainingconsumerauthorization.
theplausiblequasi-paraphraserule(1)wouldincorrectlyinferthat“CCMtelemarketers
announcedthearrestofaccounts”.
56
Thisexampledepictsamajorobstacletotheeffectiveuseofautomaticallylearned
quasi-paraphrase rules. What is missing is knowledge about the admissible argument
valuesforwhichthephrasesaresynonymous,i.e.,aquasi-paraphraseruleholds,which
we call Inferential Selectional Preferences. For example, quasi-paraphrase rule (1)
should only be applied if X is a Person and Y is a Law Enforcement Agent or a Law
Enforcement Agency. This knowledge does not guarantee that the quasi-paraphrase
rule will hold, but, as we show here, goes a long way toward filtering out erroneous
applicationsofrules.
In this chapter, we introduce ISP, a collection of methods for learning inferential
selectionalpreferencesandfilteringoutincorrectinferences. Thepresentedalgorithms
apply to any collection of quasi-paraphrase rules between binary semantic relations,
such as example (1). ISP derives inferential selectional preferences by aggregating
statistics of quasi-paraphrase rule instantiations over a large corpus of text. Within
ISP, we explore different probabilistic models of selectional preference to accept or
rejectspecificinferences. Wepresentempiricalevidencetosupportthefollowingmain
contribution:
Claim: Inferential selectional preferences can be automatically learned and used for
effectivelyfilteringoutincorrectinferences.
57
4.2 SelectionalPreferenceModels
Theaimofthischapteristolearninferentialselectionalpreferencesforfilteringquasi-
paraphraserules.
Let p
i
⇔ p
j
be a quasi-paraphrase rule where p is a binary semantic relation be-
tweentwoentitiesxandy. Lethx,p,yibeaninstanceofrelationp.
Formal task definition: Given a quasi-paraphrase rule p
i
⇔ p
j
and the instance
hx,p
i
,yi,ourtaskistodetermineifhx,p
j
,yiisvalid.
Consider the example in Section 4.1 where we have the quasi-paraphrase rule “X is
charged by y”⇔ “Y announced the arrest of X”. Our task is to automatically de-
termine that federal prosecutors announced the arrest of Terry Nichols (i.e.,hTerry
Nichols, p
j
, federal prosecutorsi) is valid but that CCM telemarketers announced the
arrestofaccountsisinvalid.
Becausethesemanticrelationsparebinary,theselectionalpreferencesontheirtwo
argumentsmaybeeitherconsideredjointlyorindependently. Forexample,therelation
p=“XischargedbyY”couldhavejointSPs:
hPerson,LawEnforcementAgenti
hPerson,LawEnforcementAgencyi (2)
hBankAccount,Organizationi
orindependentSPs:
58
hPerson,*i
h*,Organizationi (3)
h*,LawEnforcementAgenti
This distinction between joint and independent selectional preferences constitutes the
difference between the two models we present in this section. The remainder of this
sectiondescribestheISPapproach. InSection4.2.1,wedescribemethodsforautomati-
callydeterminingthesemanticcontextsofeachsinglerelationsselectionalpreferences.
Section 4.2.2, uses these for developing our inferential selectional preference models.
Finally,wepresentinferencefilteringalgorithmsinSection4.2.3.
4.2.1 RelationalSelectionalPreferences
Resnik(1996)definedtheselectionalpreferencesofapredicateasthesemanticclasses
ofthewordsthatappearasitsarguments. Similarly,wedefinetherelationalselectional
preferences of a binary semantic relationp
i
as the semantic classesC(x) of the words
thatcanbeinstantiatedforxandasthesemanticclassesC(y)ofthewordsthatcanbe
instantiatedfory.
ThesemanticclassesC(x)andC(y)canbeobtainedfromaconceptualtaxonomy
as proposed in Resnik (1996), such as WordNet, or from the classes extracted from a
word clustering algorithm such as CBC (Pantel & Lin, 2002). For example, given the
59
relation “X is charged by Y”, its relational selection preferences from WordNet could
be{social group,organism,state}forX and{authority,state,section}forY.
Below we propose joint and independent models, based on a corpus analysis, for
automaticallydeterminingrelationalselectionalpreferences.
4.2.1.1 JointRelationalModel(JRM)
Our joint model uses a corpus analysis to learn SPs for binary semantic relations by
considering their arguments jointly, as in example (2). Given a large corpus of En-
glish text, we first find the occurrences of each semantic relationp. For each instance
hx,p,yi,weretrievethesetsC(x)andC(y)ofthesemanticclassesthatxandybelong
toandaccumulatethefrequenciesofthetripleshc(x),p,c(y)i,wherec(x)∈C(x)and
c(y)∈C(y).
Each triplehc(x),p,c(y)i is a candidate selectional preference for p. Candidates
canbeincorrectwhen: a)theyweregeneratedfromtheincorrectsenseofapolysemous
word;orb)pdoesnotholdfortheotherwordsinthesemanticclass.
Intuitively,wehavemoreconfidenceinaparticularcandidateifitssemanticclasses
are closely associated given the relation p. Pointwise mutual information (Cover &
60
Thomas, 1991) is a commonly used metric for measuring this association strength be-
tweentwoeventse
1
ande
2
:
pmi(e
1
;e
2
) = log
P(e
1
,e
2
)
P(e
1
)P(e
2
)
(4.1)
We define our ranking function as the strength of association between two semantic
classes,c(x)andc(y),giventherelationp:
pmi(c(x)|p;c(y)|p) = log
P(c(x),c(y)|p)
P(c(x)|p)P(c(y)|p)
(4.2)
Let|c(x),p,c(y)| denote the frequency of observing the instancehc(x),p,c(y)i. We
estimate the probabilities of Equation 4.2 using maximum likelihood estimates over
ourcorpus:
P(c(x)|p) =
|c(x),p,∗|
|∗,p,∗|
P(c(y)|p) =
|∗,p,c(y)|
|∗,p,∗|
P(c(x),c(y)|p) =
|c(x),p,c(y)|
|∗,p,∗|
(4.3)
Similarlyto(Resnik1996),weestimatetheabovefrequenciesusing:
|c(x),p,∗| =
P
w∈c(x)
|w,p,∗|
|C(w)|
|∗,p,c(y)| =
P
w∈c(y)
|∗,p,w|
|C(w)|
|c(x),p,c(y)| =
P
w
1
∈c(x),w
2
∈c(y)
|w
1
,p,w
2
|
|C(w
1
)|×|C(w
2
)|
(4.4)
61
where|x,p,y| denotes the frequency of observing the instancehx,p,yi and|C(w)|
denotes the number of classes to which word w belongs.|C(w)| distributes w’s mass
equallytoallofitssensesc(w).
4.2.1.2 IndependentRelationalModel(IRM)
Because of sparse data, our joint model can miss some correct selectional preference
pairs. Forexample,giventherelation
YannouncedthearrestofX
we may find occurrences from our corpus of the particular class “Money Handler” for
X and “Lawyer” forY, however we may never see both of these classes co-occurring
eventhoughtheywouldformavalidrelationalselectionalpreference.
To alleviate this problem, we propose a second model that is less strict by con-
sidering the arguments of the binary semantic relations independently, as in example
(3).
SimilarlytoJRM,weextracteachinstancehx,p,yiofeachsemanticrelationpand
retrievethesetofsemanticclassesC(x)andC(y)thatxandybelongto,accumulating
thefrequenciesofthetripleshc(x),p,∗iandh∗,p,c(y)i,wherec(x)∈C(x)andc(y)∈
C(y).
62
All tupleshc(x),p,∗i andh∗,p,c(y)i are candidate selectional preferences for p.
We rank candidates by the probability of the semantic class given the relation p, ac-
cordingtoEquations4.3
4.2.2 InferentialSelectionalPreferences
Whereas in Section 4.2.1 we learned selectional preferences for the arguments of a
relationp,inthissectionwelearnselectionalpreferencesfortheargumentsofaquasi-
paraphraserulep
i
⇔p
j
.
4.2.2.1 JointInferentialModel(JIM)
Given a quasi-paraphrase rule p
i
⇔ p
j
, our joint model defines the set of inferential
SPs as the intersection of the relational SPs forp
i
andp
j
, as defined in the Joint Rela-
tionalModel(JRM).Forexample, supposerelationp
i
=“X is charged by Y”givesthe
followingSPscoresundertheJRM:
hPerson,p
i
,LawEnforcementAgenti= 1.45
hPerson,p
i
,LawEnforcementAgencyi= 1.21
hBankAccount,p
i
,Organizationi= 0.97
and that p
j
= “Y announced the arrest of X” gives the following SP scores under the
JRM:
63
hLawEnforcementAgent,p
j
,Personi= 2.01
hReporter,p
j
,Personi= 1.98
hLawEnforcementAgency,p
j
,Personi= 1.61
The intersection of the two sets of SPs forms the candidate inferential SPs for the rule
p
i
⇔pj:
hLawEnforcementAgent,Personi
hLawEnforcementAgency,Personi
We rank the candidate inferential SPs according to three ways to combine their
relationalSPscores,usingtheminimum,maximum,andaverageoftheSPs. Forexam-
ple, forhLaw Enforcement Agent, Personi, the respective scores would be 1.45, 2.01,
and 1.73. These different ranking strategies produced nearly identical results in our
experiments,asdiscussedinSection4.4.
4.2.2.2 IndependentInferentialModel(IIM)
Our independent model is the same as the joint model above except that it computes
candidateinferentialSPsusingtheIndependentRelationalModel(IRM)insteadofthe
JRM.Considerthesameexamplerelationsp
i
andp
j
fromthejointmodelandsuppose
thattheIRMgivesthefollowingrelationalSPscoresforp
i
:
64
hLawEnforcementAgent,p
i
,*i= 0.63
h*,p
i
,Personi= 0.27
h*,p
i
,Organizationi= 0.21
andthefollowingrelationalSPscoresforp
j
:
h*,p
j
,Personi= 0.57
hLawEnforcementAgent,p
j
,*i= 0.32
hReporter,p
j
,*i= 0.26
TheintersectionofthetwosetsofSPsformsthecandidateinferentialSPsfortheinfer-
encep
i
⇔p
j
:
hLawEnforcementAgent,*i
h*,Personi
Weusethesameminimum,maximum,andaveragerankingstrategiesasinJIM.
4.2.3 FilteringInferences
Given a quasi-paraphrase rulep
i
⇔ p
j
: and the instancehx,p
i
,yi, the systems task is
todeterminewhetherhx,p
j
,yiisvalid. LetC(w)bethesetofsemanticclassesc(w)to
which wordw belongs. Below we present three filtering algorithms which range from
theleasttothemostpermissive:
65
• ISP.JIM, accepts the inferencehx,p
j
,yi if the inferential SPhc(x),p
j
,c(y)i was
admittedbytheJointInferentialModelforsomec(x)∈C(x)andc(y)∈C(y).
• ISP.IIM.∧,acceptstheinferencehx,p
j
,yiiftheinferentialSPshc(x),p
j
,∗iAND
h∗,p
j
,c(y)iwereadmittedbytheIndependentInferentialModelforsomec(x)∈
C(x)andc(y)∈C(y).
• ISP.IIM.∨, accepts the inferencehx,p
j
,yi if the inferential SPhc(x),p
j
,∗i OR
h∗,p
j
,c(y)iwasadmittedbytheIndependentInferentialModelforsomec(x)∈
C(x)andc(y)∈C(y).
Since both JIM and IIM use a ranking score in their inferential SPs, each filtering
algorithm can be tuned to be more or less strict by setting an acceptance threshold on
the ranking scores or by selecting only the top τ percent highest ranking SPs. In our
experiments,reportedinSection4.4,wetestedeachmodelusingvariousvaluesofτ.
4.3 ExperimentalMethodology
Thissectiondescribesthemethodologyfortestingourclaimthatinferentialselectional
preferencescanbelearnedtofilterincorrectinferences.
Given a collection of quasi-paraphrase rules of the form p
i
⇔ p
j
, our task is to
determine whether a particular instancehx,p
j
,yi holds given thathx,p
i
,yi holds . In
the next sections, we describe our collection of quasi-paraphrase rules, the semantic
66
classes used for forming selectional preferences, and evaluation criteria for measuring
thefilteringquality.
4.3.1 Quasi-paraphraserules
Ourmodelsforlearninginferentialselectionalpreferencescanbeappliedtoanycollec-
tion of quasi-paraphrase rules between binary semantic relations. In our work, we fo-
cusonthequasi-paraphraserulescontainedintheDIRTresource(Lin&Pantel,2001).
DIRT consists of over 12 million rules which were extracted from a 1GB newspaper
corpus (San Jose Mercury, Wall Street Journal and AP Newswire from the TREC-9
collection). Forexample,hereareDIRTstop3quasi-paraphraserulesfor“XsolvesY”:
“YissolvedbyX”,“XresolvesY”,“XfindsasolutiontoY”
4.3.2 SemanticClasses
The choice of semantic classes is of great importance for selectional preference. One
important aspect is the granularity of the classes. Too general a class will provide no
discriminatory power while too fine-grained a class will offer little generalization and
applyinonlyextremelyfewcases.
The absence of an attested high-quality set of semantic classes for this task makes
discovering preferences difficult. Since many of the criteria for developing such a set
arenotevenknown,wedecidedtoexperimentwithtwoverydifferentsetsofsemantic
67
classes, in the hope that in addition to learning semantic preferences, we might also
uncoversomecluesfortheeventualdecisionsaboutwhatmakesgoodsemanticclasses
ingeneral.
OurfirstsetofsemanticclasseswasdirectlyextractedfromtheoutputoftheCBC
clusteringalgorithm(Pantel&Lin,2002). WeappliedCBCtotheTREC-9andTREC-
2002 (Aquaint) newswire collections consisting of over 600 million words. CBC gen-
erated 1628nounconceptsandthesewereusedasoursemanticclassesforSPs.
Secondly, we extracted semantic classes from WordNet 2.1 (Fellbaum, 1998). In
theabsenceofanyexternallymotivateddistinguishingfeatures(forexample,theBasic
LevelcategoriesfromPrototypeTheory,developedbyEleanorRosch(1978)),weused
the simple but effective method of manually truncating the noun synset hierarchy and
considering all synsets below each cut point as part of the semantic class at that node.
To select the cut points, we inspected several different hierarchy levels and found the
synsets at a depth of 4 to form the most natural semantic classes. Since the noun hier-
archy in WordNet has an average depth of 12, our truncation created a set of concepts
considerably coarser-grained than WordNet itself. The cut produced 1287 semantic
classes, a number similar to the classes in CBC. To properly test WordNet as a source
of semantic classes for our selectional preferences, we would need to experiment with
differentextractionalgorithms.
68
4.3.3 EvaluationCriteria
The goal of the filtering task is to minimize false positives (incorrectly accepted infer-
ences) and false negatives (incorrectly rejected inferences). A standard methodology
for evaluating such tasks is to compare system filtering results with a gold standard
usingaconfusionmatrix. Aconfusion matrix, suchasTable4.1, capturesthefiltering
performance on both correct and incorrect inferences: whereA represents the number
GOLDSTANDARD
1 0
SYSTEM
1 A B
0 C D
Table4.1: Confusionmatrix
of correct instances correctly identified by the system, D represents the number of in-
correct instances correctly identified by the system, B represents the number of false
positives and C represents the number of false negatives. To compare systems, three
keymeasuresareusedtosummarizeconfusionmatrices:
• Sensitivity, defined as
A
A+C
, captures a filters probability of accepting correct
inferences;
• Specificity, defined as
D
B+D
, captures a filters probability of rejecting incorrect
inferences;
• Accuracy,definedas
A+D
A+B+C+D
,capturestheprobabilityofafilterbeingcorrect.
69
4.4 ExperimentalResults
In this section, we provide empirical evidence to support the main claim of this paper.
Given a collection of DIRT quasi-paraphrase rules of the form p
i
⇔ p
j
, our experi-
ments,usingthemethodologyofSection4.3,evaluatethecapabilityofourISPmodels
fordeterminingifhx,p
j
,yiholdsgiventhathx,p
i
,yiholds.
4.4.1 ExperimentalSetup
4.4.1.1 ModelImplementation
For each filtering algorithm in Section 4.2.3, ISP.JIM, ISP.IIM.∧, and ISP.IIM.∨, we
trained their probabilistic models using corpus statistics extracted from the 1999 AP
newswirecollection(partoftheTREC-2002Aquaintcollection)consistingofapprox-
imately 31 million words. We used the Minipar parser (Lin, 1994) to match DIRT
patternsinthetext. ThispermitsexactmatchessinceDIRTquasi-paraphraserulesare
builtfromMiniparparsetrees.
For each system, we experimented with the different ways of combining relational
SP scores: minimum, maximum, and average (see Section 4.2.2). Also, we experi-
mentedwithvariousvaluesfortheτ parameterdescribedinSection4.2.3.
70
4.4.1.2 GoldStandardConstruction
In order to compute the confusion matrices described in Section 4.3.3, we must first
construct a representative set of inferences and manually annotate them as correct or
incorrect.
Werandomly selected 100quasi-paraphrase rulesofthe formp
i
⇔ p
j
from DIRT.
Foreachpatternp
i
,wethenextracteditsinstancesfromtheAquaint1999APnewswire
collection (approximately 22 million words), and randomly selected 10 distinct in-
stances,resultinginatotalof 1000instances. Foreachinstanceofp
i
,applyingDIRTs
quasi-paraphraserulewouldasserttheinstancehx,p
j
,yi. Ourevaluationtestshowwell
ourmodelscanfilterthesesothatonlycorrectinferencesaremade.
To form the gold standard, two human judges were asked to tag each instance
hx,p
j
,yi as correct or incorrect. For example, given a randomly selected quasi-
paraphraserule“XischargedbyY”⇔“YannouncedthearrestofX”andtheinstance
“Terry Nichols was charged by federal prosecutors”, the judges must determine if the
instancehfederal prosecutors, Y announced the arrest of X, Terry Nicholsi is correct.
Thejudgeswereaskedtoconsiderthefollowingtwocriteriafortheirdecision:
•hx,p
j
,yiisasemanticallymeaningfulinstance;
• Theinferenceortheparaphraserelationp
i
⇔p
j
holdsforthisinstance.
71
Judges found that annotation decisions can range from trivial to difficult. The differ-
ences often were in the instances for which one of the judges fails to see the right
context under which the inference could hold. To minimize disagreements, the judges
wentthroughanextensiveroundoftraining.
To that end, the 1000 instanceshx,p
j
,yi were split into DEV and TEST sets, 500
in each. The two judges trained themselves by annotating DEV together. The TEST
set was then annotated separately to verify the inter-annotator agreement and to verify
whetherthetaskiswell-defined. Thekappastatistic(Siegal&CastellanJr.,1988)was
κ = 0.72. For the 70 disagreements between the judges, a third judge acted as an
adjudicator.
4.4.1.3 Baselines
WecompareourISPalgorithmstothefollowingbaselines:
• B0: Rejectsallinferences;
• B1: Acceptsallinferences;
• Rand: Randomlyacceptsorrejectsinferences.
One alternative to our approach is admit instances on the Web using literal search
queries. Weinvestigatedthistechniquebutdiscardeditduetosubtleyetcriticalissues
72
with pattern canonicalization that resulted in rejecting nearly all inferences. However,
weareinvestigatingotherwaysofusingWebcorporaforthistask.
4.4.2 FilteringQuality
System
ParametersSelectedfromDevSet Sensitivity Specificity Accuracy
RankingStrategy τ(%) (95%Conf) (95%Conf) (95%Conf)
B0 — — 0.00±0.00 1.00±0.00 0.50±0.04
B1 — — 1.00±0.00 0.00±0.00 0.49±0.04
Rand — — 0.50±0.06 0.47±0.07 0.50±0.04
CBC
ISP.JIM maximum 100 0.17±0.04 0.88±0.04 0.53±0.04
ISP.IIM.∧ maximum 100 0.24±0.05 0.84±0.04 0.54±0.04
ISP.IIM.∨ maximum 90 0.73±0.05 0.45±0.06 0.59±0.04
WordNet
ISP.JIM minimum 40 0.20±0.06 0.75±0.06 0.47±0.04
ISP.IIM.∧ minimum 10 0.33±0.07 0.77±0.06 0.55±0.04
ISP.IIM.∨ minimum 20 0.87±0.04 0.17±0.05 0.51±0.05
Table 4.2: Filtering quality of best performing systems according to the evaluation
criteriadefinedinSection4.3.3ontheTESTset.
For each ISP algorithm and parameter combination, we constructed a confusion
matrix on the development set and computed the system sensitivity, specificity and
accuracy as described in Section 4.3.3. This resulted in 180 experiments on the devel-
opment set. For each ISP algorithm and semantic class source, we selected the best
parametercombinationsaccordingtothefollowingcriteria:
• Accuracy: This system has the best overall ability to correctly accept and reject
inferences.
• 90%-Specificity: Several formal semantics and textual entailment researchers
have commented that quasi-paraphrase rule collections like DIRT are difficult
73
to use due to low precision. Many have asked for filtered versions that remove
incorrectinferencesevenatthecostofremovingcorrectinferences. Inresponse,
we show results for the system achieving the best sensitivity while maintaining
atleast 90%specificityontheDEVset.
WeevaluatedtheselectedsystemsontheTESTset. Table4.2summarizesthequalityof
thesystemsselectedaccordingtotheAccuracycriterion. Thebestperformingsystem,
ISP.IIM.∨,performedstatisticallysignificantlybetterthanallthreebaselines. Thebest
systemaccordingtothe90%-SpecificitycriteriawasISP.JIM,whichcoincidentallyhas
thehighestaccuracyforthatmodelasshowninTable4.2. Thisresultisverypromising
for researchers that require highly accurate quasi-paraphrase rules since they can use
ISP.JIM and expect to recall 17% of the correct inferences by only accepting false
positives 12%ofthetime.
4.4.2.1 PerformanceandErrorAnalysis
GOLDSTANDARD
1 0
SYSTEM
1 184 139
0 63 114
Table4.3: ConfusionmatrixforISP.IIM.∨—bestaccuracy
Tables 4.3 and 4.4 present the full confusion matrices for the most accurate and
highlyspecificsystems,withbothsystemsselectedontheDEVset. Themostaccurate
74
GOLDSTANDARD
1 0
SYSTEM
1 42 28
0 205 225
Table4.4: ConfusionmatrixforISP.JIM—best 90%-Specificity
system was ISP.IIM.∨, which is the most permissive of the algorithms. This suggests
that a larger corpus for learning SPs may be needed to support stronger performance
on the more restrictive methods. The system in Table 4.4, selected for maximizing
sensitivity while maintaining high specificity, was 70% correct in predicting correct
inferences.
Figure4.1illustratestheROCcurveforalloursystemsandparametercombinations
on the TEST set. ROC curves plot the true positive rate against the false positive rate.
Thenear-diagonallineplotsthethreebaselinesystems.
Several trends can be observed from this figure. First, systems using the semantic
classes from WordNet tend to perform less well than systems using CBC classes. As
discussedinSection4.3.2,weusedaverysimplisticextractionofsemanticclassesfrom
WordNet. The results in Figure 4.1 serve as a lower bound on what could be achieved
withabetterextractionfromWordNet. UponinspectionofinstancesthatWordNetgot
incorrectbutCBCgotcorrect,itseemedthatCBChadamuchhigherlexicalcoverage
than WordNet. Forexample, severalof the instances contained proper names as either
75
Figure4.1: ROCcurvesforoursystemsonTEST
76
theX orY argument(WordNethaspoorpropernamecoverage). Whenanargumentis
notcoveredbyanyclass,theinferenceisrejected.
Figure 4.1 also illustrates how our three different ISP algorithms behave. The
strictest filters, ISP.JIM and ISP.IIM.∧, have the poorest overall performance but, as
expected,haveagenerallyverylowrateoffalsepositives. ISP.IIM.∨,whichisamuch
morepermissivefilterbecauseitdoesnotrequirebothargumentsofarelationtomatch,
hasgenerallymanymorefalsepositivesbuthasanoverallbetterperformance.
We did not include in Figure 4.1 an analysis of the minimum, maximum, and aver-
age ranking strategies presented in Section 4.2.2 since they generally produced nearly
identicalresults.
Figure 4.2: ISP.IIM.∨ (Best system’s) performance variation over different values of
theτ threshold
77
For the most accurate system, ISP.IIM.∨, we explored the impact of the cutoff
thresholdτ onthesensitivity,specificity,andaccuracy,asshowninFigure4.2. Rather
than step the values by 10% as we did on the DEV set, here we stepped the threshold
value by 2% on the TEST set. The more permissive values ofτ increase sensitivity at
the expense of specificity. Interestingly, the overall accuracy remained fairly constant
across the entire range of τ, staying within 0.05 of the maximum of 0.62 achieved at
τ = 30%.
Finally,wemanuallyinspectedseveralincorrectinferencesthatweremissedbyour
filters. A common source of errors was due to the many incorrect “antonymy” quasi-
paraphrase rules generated by DIRT, such as “X is rejected in Y”⇔“X is accepted in
Y”. This recognized problem in DIRT occurs because of the distributional hypothesis
assumption used to form the quasi-paraphrase rules. Our ISP algorithms suffer from a
similarquandarysince,typically,antonymousrelationstakethesamesetsofarguments
for X (and Y). For these cases, ISP algorithms learn many selectional preferences that
accept the same types of entities as those that made DIRT learn the quasi-paraphrase
ruleinthefirstplace,henceISPwillnotfilteroutmanyincorrectinferences.
4.5 Conclusion
We presented algorithms for learning what we call inferential selectional preferences,
and presented evidence that learning selectional preferences can be useful in filtering
78
out incorrect inferences. This work constitutes a step towards better understanding of
theinteractionofselectionalpreferencesandinferences, bridgingthesetwoaspectsof
semantics.
79
Chapter5
LearningDirectionality
5.1 Introduction
Asdiscussedinthepreviouschapter,manuallybuiltresourceslikeWordNet(Fellbaum,
1998)andCyc(Lenat,1995)havebeenaroundforyears;butforcoverageanddomain
adaptability reasons many recent approaches have focused on automatic acquisition
of quasi-paraphrases (Barzilay & McKeown, 2001) and quasi-paraphrase rules (Lin &
Pantel, 2001; Szpektor et al., 2004). The downside of these approaches is that they
oftenresultinincorrectquasi-paraphraserulesorinquasi-paraphraserulesthatareun-
derspecifiedindirectionality (i.e. asymmetric but arewronglyconsideredsymmetric).
Forexample,consideraquasi-paraphraserulefromDIRT(Lin&Pantel,2001):
XeatsY⇔XlikesY (1)
80
All rules in DIRT are considered symmetric. Though here, one is most likely to infer
that “X eats Y”⇒ “X likes Y”, because if someone eats something, he most probably
likes it, but if he likes something he might not necessarily be able to eat it. So for
example,giventhesentence“Ieatspicyfood”,oneismostlylikelytoinferthat“Ilike
spicy food”. On the other hand, given the sentence “I like rollerblading”, one cannot
inferthat“Ieatrollerblading”.
In this chapter, we propose an algorithm called LEDIR (pronounced “leader”) for
LEarning Directionality of Inference Rules. Our algorithm filters incorrect quasi-
paraphrase rules and identifies the directionality of the correct ones. Our algorithm
workswithanyresourcethatproducesquasi-paraphraserulesoftheformshowninex-
ample (1). Weuseboththedistributionalhypothesisandselectionalpreferencesasthe
basis forour algorithm. Weprovide empirical evidence tovalidate the following main
contribution:
Claim: Relational selectional preferences can be used to automatically determine the
plausibilityanddirectionalityofaquasi-paraphraserule.
5.2 LearningDirectionalityofQuasi-paraphraseRules
The aim of this chapter is to filter out incorrect quasi-paraphrase rules and to identify
thedirectionalityofthecorrectones.
81
Letp
i
⇔ p
j
be a quasi-paraphrase rule where eachp is a binary semantic relation
betweentwoentitiesxandy. Lethx,p,yibeaninstanceofrelationp.
Formal problem definition: Given the quasi-paraphrase rule p
i
⇔ p
j
, we want to
concludewhichoneofthefollowingismoreappropriate:
1. p
i
⇔p
j
2. p
i
⇒p
j
3. p
i
⇐p
j
4. Noplausibleinference
Considertheexample (1)fromSection5.1. There,itismostplausibletoconclude
“XeatsY”⇒“XlikesY”.OuralgorithmLEDIRusesselectionalpreferencesalongthe
lines of Chapter 4 to determine the plausibility and directionality of quasi-paraphrase
rules.
5.2.1 UnderlyingAssumption
Many approaches to modeling lexical semantics have relied on the distributional hy-
pothesis (Harris, 1954), which states that words that appear in the same contexts tend
tohavesimilarmeanings. Theideaisthatcontextisagoodindicatorofawordmean-
ing. Lin and Pantel (2001) proposed an extension to the distributional hypothesis and
82
applied it to paths in dependency trees, where if two paths tend to occur in similar
contextsitishypothesizedthatthemeaningsofthepathstendtobesimilar.
In this paper, we assume and propose a further extension to the distributional hy-
pothesisandcallittheDirectionalityHypothesis.
Directionality Hypothesis: If two binary semantic relations tend to occur in similar
contextsandthefirstoneoccursinsignificantlymorecontextsthanthesecond,thenthe
secondmostlikelyimpliesthefirstandnotviceversa.
Theintuitionhereisthatofgenerality. Themoregeneralarelation,morethetypes
(andnumber)ofcontextsinwhichitislikelytoappear. Considertheexample (1)from
Section 5.1. The fact is that there are many more things that someone might like than
those that someone might eat. Hence, by applying the directionality hypothesis, one
caninferthat“XeatsY”⇒“XlikesY”.
Thekeytoapplyingthedistributionalhypothesistotheproblemathandistomodel
the contexts appropriately and to introduce a measure for calculating context similar-
ity. Concepts in semantic space, due to their abstractive power, are much richer for
reasoning about inferences than simple surface words. Hence, we model the context
of a relation p of the formhx,p,yi by using the semantic classes C(x) and C(y) of
words that can be instantiated for x and y respectively. To measure context similar-
ity of two relations, we calculate the overlap coefficient (Manning & Sch¨ utze, 1999)
betweentheircontexts.
83
5.2.2 SelectionalPreferences
The selectional preferences of a predicate is the set of semantic classes that its argu-
ments can belong to (Wilks, 1975). Resnik (1996) gave an information theoretical
formulation of the idea. In Chapter 4, we extended this idea to non-verbal relations
by defining the relational selectional preferences (RSPs) of a binary relation p as the
set of semantic classes C(x) and C(y) of words that can occur in positions x and y
respectively.
ThesetofsemanticclassesC(x)andC(y)canbeobtainedeitherfromamanually
created taxonomy like WordNet as proposed in the above previous approaches or by
using automatically generated classes from the output of a word clustering algorithm
asproposedinChapter4.
In this chapter, we deployed both the Joint Relational Model (JRM) and Indepen-
dent Relational Model (IRM) proposed in Chapter 4 to obtain the selectional prefer-
encesforarelationp.
5.2.2.1 JointRelationalModel(JRM)
The JRM uses a large corpus to learn the selectional preferences of a binary semantic
relationbyconsideringitsargumentsjointly.
Given a relationp and large corpus of English text, we first find all occurrences of
relation p in the corpus. For every instancehx,p,yi in the corpus, we obtain the sets
84
C(x) and C(y) of the semantic classes that x and y belong to. We then accumulate
the frequencies of the tripleshc(x),p,c(y)i by assuming that every c(x)∈ C(x) can
co-occurwitheveryc(y)∈C(y)andviceversa. Everytriplehc(x),p,c(y)iobtainedin
thismannerisacandidateselectionalpreferenceforp. AsinChapter4,werankthese
candidatesusingPointwisemutualinformation(Cover&Thomas,1991). Theranking
function is defined as the strength of association between two semantic classes, c(x)
andc(y),giventherelationp:
pmi(c(x)|p;c(y)|p) = log
P(c(x),c(y)|p)
P(c(x)|p)P(c(y)|p)
(5.1)
Let|c(x),p,c(y)| denote the frequency of observing the instancehc(x),p,c(y)i. We
estimate the probabilities of Equation 5.1 using maximum likelihood estimates over
ourcorpus:
P(c(x)|p) =
|c(x),p,∗|
|∗,p,∗|
P(c(y)|p) =
|∗,p,c(y)|
|∗,p,∗|
P(c(x),c(y)|p) =
|c(x),p,c(y)|
|∗,p,∗|
(5.2)
Weestimatetheabovefrequenciesusing:
|c(x),p,∗| =
P
w∈c(x)
|w,p,∗|
|C(w)|
|∗,p,c(y)| =
P
w∈c(y)
|∗,p,w|
|C(w)|
|c(x),p,c(y)| =
P
w
1
∈c(x),w
2
∈c(y)
|w
1
,p,w
2
|
|C(w
1
)|×|C(w
2
)|
(5.3)
85
where|x,p,y| denotes the frequency of observing the instancehx,p,yi and|C(w)|
denotes the number of classes to which word w belongs.|C(w)| distributes ws mass
equallyamongallofitssensesC(w).
5.2.2.2 IndependentRelationalModel(IRM)
Duetosparsedata,theJRMislikelytomisssomepair(s)ofvalidrelationalselectional
preferences. HenceweusetheIRM,whichmodelstheargumentsofabinarysemantic
relationindependently.
Similar to JRM, we find all instances of the formhx,p,yi for a relation p. We
then find the sets C(x) and C(y) of the semantic classes that x and y belong to and
accumulatethefrequenciesofthetripleshc(x),p,∗iandh∗,p,c(y)iwherec(x)∈C(x)
andc(y)∈C(y).
Allthetupleshc(x),p,∗iandh∗,p,c(y)iaretheindependentcandidateRSPsfora
relationpandwerankthemaccordingtoEquation5.3.
Once we have the independently learnt RSPs, we need to convert them into a joint
representationforusebytheinferenceplausibilityanddirectionalitymodel. Todothis,
we obtain the Cartesian product between the setshC(x),p,∗i andh∗,p,C(y)i for a
relationp. TheCartesianproductbetweentwosetsAandB isgivenby:
A×B ={(a,b) :∀a∈A ∧ ∀b∈B} (5.4)
86
Similarlyweobtain:
hC(x),p,∗i×h∗,p,C(y)i ={hc(x),p,c(y)i :∀hc(x),p,∗i∈hC(x),p,∗i
∧ ∀h∗,p,c(y)i∈h∗,p,C(y)i}
(5.5)
TheCartesianproductinEquation5.5givesthejointrepresentationoftheRSPsofthe
relationplearnedusingIRM.Inthejointrepresentation, theIRMRSPshavetheform
hc(x),p,c(y)iwhichisthesameformastheJRMRSPs.
5.2.3 PlausibilityandDirectionalityModel
Ourmodelfordeterminingtheplausibilityanddirectionalityofquasi-paraphraserules
is based on the intuition that for an inference to hold between two semantic relations
there must exist sufficient overlap between their contexts and the directionality of the
quasi-paraphraseruledependsonthequantitativecomparisonbetweentheircontexts.
Here we model the context of a relation by the selectional preferences of that re-
lation. We determine the plausibility of a quasi-paraphrase rule based on the overlap
coefficient (Manning & Sch¨ utze, 1999) between the selectional preferences of the two
relations. Wedeterminethedirectionalitybasedonthedifferenceinthenumberofse-
lectionalpreferencesoftherelationswhentheinferencebetweenthemseemsplausible.
87
Given a candidate quasi-paraphrase rule p
i
⇔ p
j
, we first obtain the RSPs
hC(x),p
i
,C(y)iforp
i
andhC(x),p
j
,C(y)iforp
j
. Wethencalculatetheoverlapcoef-
ficient between their respective RSPs. Overlap coefficient is one of the many distribu-
tional similarity measures used to calculate the similarity between two vectors A and
B:
sim(A,B) =
|A∩B|
min(|A|,|B|)
(5.6)
The overlap coefficient between the selectional preferences of p
i
and p
j
is calculated
as:
sim(p
i
,p
j
) =
|hC(x),p
i
,C(y)i∩hC(x),p
j
,C(y)i|
min(|C(x),p
i
,C(y)|,|C(x),p
j
,C(y)|)
(5.7)
Ifsim(p
i
,p
j
)isaboveacertainempiricallydeterminedthresholdα(≤ 1),weconclude
thatthequasi-paraphraseruleisplausible,i.e.:
Ifsim(p
i
,p
j
)≥α
weconcludethequasi-paraphraseruleisplausible
else
weconcludethequasi-paraphraseruleisnotplausible
Foraplausiblequasi-paraphraserule,wethencomputetheratiobetweenthenum-
ber of selectional preferences|C(x),p
i
,C(y)| for p
i
and|C(x),p
j
,C(y)| for p
j
and
compareitagainstanempiricallydeterminedthresholdβ(≥ 1)todeterminethedirec-
tionalityofthequasi-paraphraserule. Sothealgorithmis:
88
If
|C(x),p
i
,C(y)|
|C(x),p
j
,C(y)|
≥β weconcludep
i
⇐p
j
elseif
|C(x),p
i
,C(y)|
|C(x),p
j
,C(y)|
≤
1
β
weconcludep
i
⇒p
j
else weconcludep
i
⇔p
j
5.3 ExperimentalSetup
In this section, we describe our experimental setup to validate our claim that LEDIR
canbeusedtodetermineplausibilityanddirectionalityofaquasi-paraphraserule.
Given a quasi-paraphrase rule of the form p
i
⇔ p
j
, we want to use automatically
learned relational selectional preferences to determine whether the quasi-paraphrase
ruleisvalidandifitisvalidthendetermineitsdirectionality.
5.3.1 Quasi-paraphraseRules
LEDIR can work with any set of binary semantic quasi-paraphrase rules. For the pur-
poseofthispaper,wechosethequasi-paraphraserulesfromtheDIRTresource(Lin&
Pantel,2001). DIRTconsistsof 12millionrulesextractedfrom 1GBofnewspapertext
(AP Newswire, San Jose Mercury and Wall Street Journal). For example, “X eats Y”
⇔“XlikesY”isaquasi-paraphraserulefromDIRT.
89
5.3.2 SemanticClasses
Appropriatechoiceofsemanticclassesiscrucialforlearningrelationalselectionalpref-
erences. Theidealsetshouldhavesemanticclassesthathavetherightbalancebetween
abstractionanddiscrimination,thetwoimportantcharacteristicsthatareoftenconflict-
ing. A very general class has limited discriminative power, while a very specific class
has limited abstractive power. Finding the right balance here is a separate research
problemofitsown.
Since the ideal set of universally acceptable semantic classes in unavailable, we
decidedtousetheapproachfromChapter4ofusingtwosetsofsemanticclasses. This
approach gave us the advantage of being able to experiment with sets of classes that
varyalotinthewaytheyaregeneratedbuttrytomaintainthegranularitybyobtaining
approximatelythesamenumberofclasses.
The first set of semantic classes was obtained by running the CBC clustering al-
gorithm (Pantel & Lin, 2002) on TREC-9 and TREC-2002 newswire collections con-
sisting of over 600 million words. This resulted in 1628 clusters, each representing a
semanticclass.
ThesecondsetofsemanticclasseswasobtainedbyusingWordNet 2.1(Fellbaum,
1998). We obtained a cut in the WordNet noun hierarchy by manual inspection and
usedallthesynsetsbelowacutpointasthesemanticclassatthatnode. Ourinspection
showedthatthesynsetsatdepthfourformedthemostnaturalsemanticclasses. Acutat
90
depthfourresultedinasetof 1287semanticclasses,asetthatismuchcoarsergrained
thanWordNetwhichhasanaveragedepthof 12. Thisseemstobeadepththatgivesa
reasonableabstractionwhilemaintaininggooddiscriminativepower. Itwouldhowever
beinterestingtoexperimentwithmoresophisticatedalgorithmsforextractingsemantic
classes from WordNet and see their effect on the relational selectional preferences,
somethingwedonotaddressthisinthispaper.
5.3.3 Implementation
We implemented LEDIR with both the JRM and IRM models using quasi-paraphrase
rules from DIRT and semantic classes from both CBC and WordNet. We parsed the
1999APnewswirecollectionconsistingof 31millionwordswithMinipar(Lin,1994)
and used this to obtain the probability statistics for the models (as described in Sec-
tion5.2.2).
Weperformedbothsystem-wideevaluationsandintrinsicevaluationswithdifferent
valuesofαandβ parameters. Section5.4presentstheseresultsandourerroranalysis.
5.3.4 GoldStandardConstruction
Inordertoevaluatetheperformanceofthedifferentsystems,wecomparetheiroutputs
againstamanuallyannotatedgoldstandard. Tocreatethisgoldstandard,werandomly
91
sampled 160 quasi-paraphrase rules of the form p
i
⇔ p
j
from DIRT. We discarded
threerulessincetheycontainednominalizations.
Foreveryquasi-paraphraseruleoftheformp
i
⇔p
j
,theannotationguidelineasked
annotators(inthisworkweusedtwoannotators)tochoosethemostappropriateofthe
fouroptions:
1. p
i
⇔p
j
2. p
i
⇒p
j
3. p
i
⇐p
j
4. Noplausibleinference
To help the annotators with their decisions, the annotators were provided with 10
randomly chosen instances for each quasi-paraphrase rule. These instances, extracted
from DIRT, provided the annotators with context where the inference could hold. So
for example, for the quasi-paraphrase rule X eats Y⇔ X likes Y, an example instance
would be “I eat spicy food”⇔ “I like spicy food”. The annotation guideline however
gave the annotators the freedom to think of examples other than the ones provided to
maketheirdecisions.
92
Theannotatorsfoundthatwhilesomedecisionswerequiteeasytomake,themore
complex ones often involved the choice between bi-directionality and one of the di-
rections. To minimize disagreements and to get a better understanding of the task, the
annotatorstrainedthemselvesbyannotatingseveralsamplestogether.
We divided the set of 157 quasi-paraphrase rules, into a development set of 57
quasi-paraphrase rules and a blind test set of 100 quasi-paraphrase rules. Our two
annotators annotated the development test set together to train themselves. The blind
test set was then annotated individually to test whether the task is well defined. We
used the kappa statistic (Siegal & Castellan Jr., 1988) to calculate the inter-annotator
agreement, resulting in κ = 0.63. The annotators then looked at the disagreements
togethertobuildthefinalgoldstandard.
Allthisresultedinafinalgoldstandardof 100annotatedDIRTrules.
5.3.5 Baselines
Togetanobjectiveassessmentofthequalityoftheresultsobtainedbyusingourmod-
els,wecomparedtheoutputofoursystemsagainstthreebaselines:
• B-random: Randomly assigns one of the four possible tags to each candidate
quasi-paraphraserule.
• B-frequent: Assigns the most frequently occurring tag in the gold standard to
eachcandidatequasi-paraphraserule.
93
• B-DIRT: Assumes each quasi-paraphrase rule is bi-directional and assigns the
bi-directionaltagtoeachcandidatequasi-paraphraserule.
5.4 ExperimentalResults
Inthissection,weprovideempiricalevidencetovalidateourclaimthattheplausibility
anddirectionalityofaquasi-paraphraserulecanbedeterminedusingLEDIR.
5.4.1 EvaluationCriterion
WewanttomeasuretheeffectivenessofLEDIRforthetaskofdeterminingthevalidity
and directionality of a set of quasi-paraphrase rules. We follow the standard approach
ofreportingsystemaccuracybycomparingsystemoutputsonatestsetwithamanually
createdgoldstandard. UsingthegoldstandarddescribedinSection5.3.4,wemeasure
theaccuracyofoursystemsusingthefollowingformula:
Accuracy =
|correctlytaggedquasi-paraphraserules|×100
|inputquasi-paraphraserules|
(5.8)
5.4.2 ResultSummary
We ran all our algorithms with different parameter combinations on the development
set (the 57 DIRT rules described in Section 5.3.4). This resulted in a total of 420
94
experimentsonthedevelopmentset. Basedontheseexperiments,weusedtheaccuracy
statistictoobtainthebestparametercombinationforeachofourfoursystems. Wethen
used these parameter values to obtain the corresponding percentage accuracies on the
test set for each of the four systems. Table 5.1 summarizes the results obtained on the
Model α β Accuracy(%)
B-random — — 25
B-frequent — — 34
B-DIRT — — 25
JRM
CBC 0.15 2 38
WN 0.55 2 38
IRM
CBC 0.15 3 48
WN 0.45 2 43
Table5.1: Summaryofresultsonthetestset
testsetforthethreebaselinesandforeachofthefoursystemsusingthebestparameter
combinations obtained as described above. The overall best performing system uses
the IRM algorithm with RSPs form CBC. Its performance is found to be significantly
betterthanallthethreebaselinesusingtheStudentspairedt-test(Manning&Sch¨ utze,
1999)atp < 0.05. However,thissystemisnotstatisticallysignificantwhencompared
withtheotherLEDIRimplementations(JRMandIRMwithWordNet).
5.4.3 PerformanceandErrorAnalysis
ThebestperformingsystemselectedusingthedevelopmentsetistheIRMsystemusing
CBC with the parametersα = 0.15 andβ = 3. In general, the results obtained on the
95
GOLDSTANDARD
⇔ ⇒ ⇐ NO
SYSTEM
⇔ 16 1 3 7
⇒ 0 3 1 3
⇐ 7 4 22 15
NO 2 3 4 9
Figure 5.1: Confusion Matrix for the best performing system, IRM using CBC with
α = 0.15andβ = 3
testsetshowthattheIRMtendstoperformbetterthantheJRM.Thisobservationpoints
at the sparseness of data available for learning RSPs for the more restrictive JRM, the
reason why we introduced the IRM in the first place. A much larger corpus would be
neededtoobtaingoodenoughcoveragefortheJRM.
Figure 5.1 shows the confusion matrix for the overall best performing system as
selected using the development set (results are taken from the test set). The confusion
matrix indicates that the system does a very good job of identifying the directionality
of the correct quasi-paraphrase rules, but gets a big performance hit from its inability
to identify the incorrect quasi-paraphrase rules accurately. We will analyze this obser-
vationinmoredetailbelow.
Figure5.2plotsthevariationinaccuracyofIRMwithdifferentRSPsanddifferent
valuesofαandβ. Thefigureshowsaveryinterestingtrend. Itisclearthatforallvalues
ofβ,systemsforIRMusingCBCtendtoreachtheirpeakintherange0.15≤α≤ 0.25,
whereasthesystemsforIRMusingWordNet(WN),tendtoreachtheirpeakintherange
0.4≤ α≤ 0.6. This variation indicates the kind of impact the selection of semantic
96
Figure5.2: AccuracyvariationforIRMwithdifferentvaluesofαandβ
classescouldhaveontheoverallperformanceofthesystem. Thisisnothardevidence,
but it does suggest that finding the right set of semantic classes could be one big step
towardsimprovingsystemaccuracy.
Twootherfactorsthathaveabigimpactontheperformanceofoursystemsarethe
values of the system parameters α and β, which decide the plausibility and direction-
alityofaquasi-paraphraserule,respectively. Tobetterstudytheireffectonthesystem
performances,westudiedthetwoparametersindependently.
Figure 5.3 shows the variation in the accuracy for the task of predicting the cor-
rect and incorrect quasi-paraphrase rules for the different systems when varying the
value of α. To obtain this graph, we classified the quasi-paraphrase rules in the test
set only as correct and incorrect without further classification based on directionality.
97
Figure 5.3: Accuracy variation in predicting correct versus incorrect quasi-paraphrase
rulesfordifferentvaluesofα
All of our four systems obtained accuracy scores in the range of 68− 70% showing a
goodperformanceonthetaskofdeterminingplausibility. Thishoweverisonlyasmall
improvement over the baseline score of 66% obtained by assuming every inference to
be plausible (as will be shown below, our system has most impact not on determin-
ing plausibility but on determining directionality). Manual inspection of some system
errors showed that the most common errors were due to the well-known “problem of
antonymy” when applying the distributional hypothesis. In DIRT, one can learn rules
like“X loves Y”⇔“X hates Y”.Sincetheplausibilityofquasi-paraphraserulesisde-
termined by applying the distributional hypothesis and the antonym paths tend to take
thesamesetofclassesforXandY,ourmodelsfinditdifficulttofilterouttheincorrect
98
quasi-paraphraseruleswhichDIRTendsuplearningforthisverysamereason. Toim-
prove our system, one avenue of research is to focus specifically on filtering incorrect
quasi-paraphraserulesinvolvingantonyms(perhapsusingmethodssimilartoLinetal.
(2003)).
Figure 5.4: Accuracy variation in predicting directionality of correct quasi-paraphrase
rulesfordifferentvaluesofβ
Figure 5.4 shows the variation in the accuracy for the task of predicting the direc-
tionality of the correct quasi-paraphrase rules for the different systems when varying
the value of β. To obtain this graph, we separated the correct quasi-paraphrase rules
formtheincorrectonesandranallthesystemsononlythecorrectones,predictingonly
thedirectionalityofeachrulefordifferentvaluesofβ. Toolowavalueofβ meansthat
thealgorithmstendtopredictmostthingsasunidirectionalandtoohighavaluemeans
99
that the algorithms tend to predict everything as bidirectional. It is clear from the fig-
ure that the performance of all the systems reach their peak performance in the range
2≤ β≤ 4, which agrees with our intuition of obtaining the best system accuracy in
a medium range. It is also seen that the best accuracy for each of the models goes up
as compared to the corresponding values obtained in the general framework. The best
performingsystem,IRMusingCBCRSPs,reachesapeakaccuracyof63.64%,amuch
higher score than its accuracy score of 48% under the general framework and also a
significant improvement over the baseline score of 48.48% for this task. Paired t-test
showsthatthedifferenceisstatisticallysignificantatp < 0.05. Thebaselinescorefor
thistaskisobtainedbyassigningthemostfrequentlyoccurringdirectiontoallthecor-
rectquasi-paraphraserules. Thispaintsaveryencouragingpictureabouttheabilityof
the algorithm to identify the directionality much more accurately if it can be provided
withacleanersetofquasi-paraphraserules.
5.5 Conclusion
Semantic inferences are fundamental to understanding natural language and are an in-
tegral part of many natural language applications such as question answering, sum-
marization and textual entailment. Given the availability of large amounts of text and
with the increase in computation power, learning them automatically from large text
100
corpora has become increasingly feasible and popular. We introduced the Direction-
ality Hypothesis, which states that the contexts of relations can be used determine the
directionality of the quasi-paraphrase rules containing them. Our experiments show
empiricalevidencethattheDirectionalityHypothesiswithRSPscanindeedbeusedto
filterincorrectquasi-paraphraserulesandfindthedirectionalityofcorrectones. Webe-
lievethatthisresultisonestepinthedirectionofsolvingthebasicproblemofsemantic
inference.
Ultimately, our goal is to improve the performance of NLP applications with bet-
ter inferencing capabilities. Several recent data points, such as (Harabagiu & Hickl,
2006), and others discussed in Chapter 4, give promise that refined quasi-paraphrase
rulesfordirectionalitymayindeedimprovequestionanswering,textualentailmentand
multi-documentsummarizationaccuracies. Itisourhopethatmethodssuchastheone
proposed in this paper may one day be used to harness the richness of automatically
createdquasi-paraphraseruleresourceswithinlarge-scaleNLPapplications.
101
Chapter6
LearningSemanticClasses
6.1 Introduction
With the recent shift in Natural Language Processing (NLP) towards data driven tech-
niques, the demand for annotated data to help different areas in NLP has increased
greatly. One such area, lexical semantics, depends heavily on manually compiled se-
mantic resources like WordNet (Fellbaum, 1998) and VerbNet (Kipper, 2005). People
turntotheseresourceswhentheyneedinformationlikethesemanticclassesofwords,
relations between words etc. Creating a semantic resource however is both time con-
suming and expensive. Thus, inadequate recall is often a problem with the existing
semantic resources, especially when dealing with specialized domains or when they
are used for specific tasks. For example, when we used the classes from WordNet
to learn the Relational Selectional Preferences (RSPs) for relations in Chapters 4 and
102
5, we found that there were many instantiations of our relations for which we could
not find classes from WordNet. Such recall problems often result in a drop in system
performance.
To get around this problem, the fall-back often used by people is some automatic
method like unsupervised word clustering. The unsupervised clustering algorithms
group words based on some similarity measure and attempt to automatically induce
semantic classes. For example, given the words impress, mesmerize, cover and mask
thealgorithmsattempttodiscoverthegroupingasinFigure6.1. LinandPantel(2002)
used an unsupervised clustering algorithm, Clustering By Committee (CBC) to clus-
ter nouns; Schulte im Walde and Brew (2002) used the KMeans (McQueen, 1967)
algorithmtoclusterGermanverbs. Whilethesemethodsproducegoodclusters(corre-
sponding to semantic classes), the approaches often produce classes that are different
from the kind of semantic classes that a user wants. For example, in Chapters 4 and
5,ourresultsindicatedthatthenatureofsemanticclassesproducedbyCBCwasquite
different from those obtained from WordNet. There may be times however, when a
userwantsthelearnedsemanticclassestoresemblesomepre-existingclasses. Forex-
ample, in verb clustering, we might want the verb clusters to be along the lines of the
classes in VerbNet. Unsupervised algorithms do not have ways for incorporating such
preferencesorhumanknowledge.
103
impress
mesmerize
cover
mask
Figure6.1: Exampleclasses
A potential middle ground then lies in semi-supervised clustering. On one hand,
the semi-supervised clustering algorithms give us the ability to automatically cluster
new elements thus overcoming the recall problem and on the other, they allow us to
incorporate external knowledge into their learning process so that we can tailor the
clusters. Given these advantages, off-late there has been surge in machine learning
community in developing semi-supervised clustering algorithms. Basu et al. (2002)
presented a semi-supervised clustering algorithm that accepts supervision in the form
oflabeledpoints;Basuetal. (2004),Wagstaffetal. (2001),Xingetal. (2003)presented
algorithmsthatacceptsupervisionintheformofconstraints.
In this chapter, we use semi-supervised clustering as the framework for learning
semantic classes. To do this, we incorporated external knowledge by formulating it as
constraints and then employ the recently developed HMRF-KMeans algorithm (Basu
et al., 2004), which provides a way to add these constraints to the KMeans algorithm.
We show that this framework allows us to incorporate the knowledge available in the
104
semanticresourcesintotheclusteringprocessveryeasily,thusreproducingthehuman
generated classes more faithfully. We provide empirical evidence to validate the fol-
lowingmainclaim:
Claim: Semi-supervisedclusteringcanbeusedtolearngoodsemanticclassesbyusing
readilyavailablesemanticknowledge.
6.2 IncorporatingConstraints
As mentioned before, in this work we use the HMRF-KMeans algorithm proposed by
Basu et al. (2004) to incorporate minimal supervision in word clustering in the form
of constraints. HMRF-KMeans allows the incorporation of two types of constraints:
must-link and cannot-link. The must-link constraint specifies that a pair of elements
mustbeinthesameclass,andcannot-link constraintspecifiesthatthepairmaynotbe
inthesameclass. Forexample,fromfigure6.1,wecancomeupwiththemust-linkand
cannot-linkconstraintsinfigure6.2.
Must-link:
impress⇔mesmerize
cover⇔mask
Cannot-link:
impress×cover
impress×mask
mesmerize×cover
mesmerize×mask
Figure6.2: Exampleconstraints
105
Toincorporatetheseconstraintsintoclustering,Basuetal. (2004)usedtheHidden
Markov Random Fields (HMRFs). A HMRF model consists of the following compo-
nents(Figure6.3):
1. Y ={y
1
,y
2
,...,y
N
}isasetofhiddenrandomvariables, whichintheclustering
framework corresponds to the set of cluster labels for the N data points to be
clustered. Each cluster labely
i
∈ Y takes values from{1...K}, whereK is the
totalnumberofclusters.
2. X ={x
1
,x
2
,...,x
N
} is a set of observed random variables, which in the clus-
tering framework corresponds to the set of the N data points to be clustered.
Each random variable x
i
∈ X is assumed to be generated from a conditional
probability distribution P(x
i
|y
i
) determined by the corresponding hidden vari-
able y
i
∈ Y. The random variables X are conditionally independent given the
hiddenvariablesY,i.e.,
P (X|Y) =
Y
x
i
∈X
P (x
i
|y
i
) (6.1)
Using this model, the task of finding the best possible clustering boils down to find-
ing the maximum a posteriori (MAP) configuration of HMRF, i.e., to maximize the
posteriorprobabilityP(Y|X).
106
cannot-link must-link
Hidden Field (Y: Cluster Labels)
Observed Data (X: Words)
X1
X2 X3
y1
y2
y3
Figure6.3: HMRFmodel
The overall posterior probability of a label sequence Y can be calculated using
Bayes’ruleas:
P(Y|X) =
P(Y)P(X|Y)
P(X)
(6.2)
AssumingP(X)tobeconstant,weget:
P(Y|X)≃P(Y)P(X|Y) (6.3)
ThusfromEquation6.3,maximizingP(Y|X)isequivalenttomaximizingtheproduct
of P(Y) and P(X|Y). Basu et al. (2004) show that in the current semi-supervised
framework,wherethemust-linkandcannot-linkconstraintsareknown:
107
• The prior P(Y) in Equation 6.3 can be calculated from the given must-link and
cannot-linkconstraints,andthedistancebetweenthepointstobeclustered.
• The likelihood P(X|Y) in Equation 6.3 can be calculated by using the cluster
centroids,andthedistancebetweenthepointstobeclustered.
In the following section, we present the method for measuring the distance between
pointsandthealgorithmforclustering.
6.3 WordSimilarityandAlgorithm
To use the model in Section 6.2 for clustering, we need a way to measure similarity
(distance) between words and an algorithm to find the best posterior configuration of
cluster labels. This section briefly describes the similarity (distance) measure we use
andthealgorithmforfindingtheMAPconfigurationofclusterlabels.
6.3.1 WordSimilarity
Following Lin (1998), we represent each word x by a feature vector. Each feature in
the feature vector corresponds to a context in which the word occurs. For example,
consider the word “impress” in the phrase “impressed by her beauty”. Here, “by her
beauty”wouldbeafeatureof“impress”. Eachfeaturef haswithitanassociatedscore,
which measures the strength of the association of the feature f with the word x. We
108
useacommonlyusedmeasurepointwisemutualinformation(PMI) (Cover&Thomas,
1991)tomeasurethisstrengthofassociation:
pmi(x;f) = log
P(x,f)
P(x)P(f)
(6.4)
The probabilities in Equation 6.4 are calculated by using the maximum likelihood es-
timate over our corpus. Let|x,f| be the number of times feature f occurs with x;
|x,∗| =
P
f
|x,f|, be the total frequency of all the features of x;|∗,f| =
P
x
|x,f|
be the total number of times f occurs with any word in the entire corpus; and
|∗,∗| =
P
x
P
f
|x,f| be the total number of all features of all the words that occur
inthecorpus.
P(x,f) =
|x,f|
|∗,∗|
P(x) =
|x,∗|
|∗,∗|
P(f) =
|∗,f|
|∗,∗|
(6.5)
To compute the similarity between two words x
i
and x
j
in our corpus, we use cosine
similarity,whichisacommonlyuseddistortionmeasureinNaturalLanguageProcess-
ing. FollowingBasuetal. (2004),weusedtheparameterizedformofcosinesimilarity:
D
cos
A
(x
i
,x
j
) = 1−
x
T
i
·A·x
j
kx
i
k
A
kx
j
k
A
(6.6)
HereAisadiagonalmatrixandkxk
A
=
p
x
T
i
·A·x
j
istheweightedL
2
norm.
109
6.3.2 Algorithm
1. InitializeK clustercentroidsU ={u
1
,u
2
,...,u
K
}
2. Repeatuntilconvergence:
a. E-Step: GivencentroidsU ={u
1
,u
2
,...,u
K
},
re-assignclusterlabelsY ={y
1
,y
2
,...,y
N
}to
pointsX ={x
1
,x
2
,...,x
N
}tominimizethe
objectivefunctionPE.
b. M-Step(A):GivenclusterlabelsY ={y
1
,y
2
,...,y
N
},
re-calculateclustercentroidsU ={u
1
,u
2
,...,u
K
}to
optimizeP(Y|X).
c. M-Step(B):Re-estimatedistancemeasureD to
optimizeP(Y|X).
Figure6.4: EMalgorithm
TooptimizetheposteriorprobabilityP(Y|X),theEMalgorithmisemployed. This
algorithm called HMRF-KMeans is very similar to the KMeans algorithm. In the E-
step, HMRF-KMeans algorithm assigns cluster labels to the data points so as to opti-
mize P(Y|X), and in the M-step it re-estimates the cluster centroids and the distance
measure again to optimize P(Y|X). The outline of the algorithm is presented in Fig-
ure6.4.
HMRF-KMeans also has some nice heuristics to come up with a good estimation
ofinitialcentroidsbasedonthemust-linkandcannot-linkconstraints.
110
6.4 Experiments
Inthissection,wedescribeourevaluationcriterion,theexperimentalmethodologyand
thedatathatweused.
6.4.1 EvaluationCriterion
Tomeasurethequalityofclusters,wecomparetheclusteringoutputagainstagoldstan-
dard answer class key. We use normalized mutual information (NMI), which is com-
monly used extrinsic evaluation measure (Dom, 2001; Basu et al., 2004) to judge the
clusteringquality. Itisagoodmeasureforclusteringwithfixednumberofclustersand
measures how effectively the clustering regenerates the original classes (Dom, 2001).
Let C be the random variable denoting the cluster assignments and A be the random
variable denoting the true class labels. Let H(C) be the Shannon entropy (Shannon,
1948) of variable C, H(A) be the Shannon entropy of variable A, and H(C|A) is the
conditional entropy of C given A. Then the mutual information I(C;A) between the
randomvariablesC andAisgivenby:
I(C;A) =H(C)−H(C|A) (6.7)
TheNMIisdefinedas:
NMI =
2·I(C;A)
H(C)+H(A)
(6.8)
111
6.4.2 DataandMethodology
To verify the quality of semantic classes learned, we used the verb classes from the
publiclyavailableresourceVerbNetasourtrainingandtestset. VerbNet2.1hasatotal
of 237verbclassesandalittleover 3600uniqueverbs.
Following Lin and Pantel (2002), we represent each verb by its syntactic depen-
dency based features. We parsed the 3GB Aquaint corpus (over 350 million tokens)
using the dependency parser Minipar (Lin, 1994) and extracted the dependency fea-
turesforeachverboccurringinthecorpus. Wethenobtainedthefrequencycountsfor
the features and calculated the PMI score for each feature as described in Section 6.3.
Forourexperiments,wethenretainedonlythoseverbsfromthecorpusthatarepresent
inVerbNet,whosetotalmutualinformationequalsorexceedsathresholdα,andwhose
VerbNetclasssizeinoursetequalsorexceedsathresholdβ. Weconstructedtwosuch
datasets:
1. S
2500
consisting of 406 verbs belonging to 11 VerbNet classes (α = 2500, β =
20)
2. S
250
consistingof1691verbsbelongingto39VerbNetclasses(α = 250,β = 20)
Table6.1showstheclassesintheS
2500
set.
112
Class Frequency Elements
admire 30 admire,appreciate,bear,cherish,favor,miss,prefer,...
amalgamate 20 couple,incorporate,integrate,match,consolidate,pair,team,...
amuse 90 affect,afflict,aggravate,alarm,alienate,amaze,amuse,...
appear 24 appear,arise,awaken,break,burst,come,derive,...
build 23 arrange,assemble,bake,blow,cast,compile,cook,...
characterize 35 depict,detail,envision,interpret,picture,paint,specify,...
fill 27 adorn,bind,bombard,choke,clog,coat,contaminate,...
force 22 coax,compel,dare,draw,force,incite,induce,...
judgement 35 acclaim,applaud,assail,assault,attack,blame,bless,...
other cos 77 accelerate,activate,age,air,alter,animate,balance,...
run 23 bowl,hike,hobble,hop,inch,leap,march,...
Table6.1: ClassesfromS
2500
set
Forthepurposeofthiswork,wehavetoassumethateachverbbelongstoonlyone
semantic class. This is because both our algorithm and the baseline are hard cluster-
ing algorithms which assign each element to only one cluster. However, the verbs in
VerbNet can belong to more than one class. So whenever we came across such a verb
in either of our data sets (S
2500
orS
250
), we randomly assigned it to one of its classes.
Notethatthissimplifyingassumptionhastheeffectofunder-reportingoursystemper-
formance,sincesometimesaverbmightgetmarkedaswrongeventhoughitisassigned
to a correct class. But given that the baseline has the same disadvantage, we consider
this a fair comparison. We however keep in mind that the results we report here serve
asalowerboundontheexpectedsystemperformance.
Having built these sets, we randomly divided each of them into two equal sized
subsets
1
fortrainingandtesting,andperformedtwo-foldcross-validation. Thetraining
1
FortheS
250
set,oneofthesubsetshadoneelementmorethantheother
113
subset was used to generate constraints by randomly picking pairs of elements and
generating a must-link or cannot-link constraint between them based on whether they
belongedtothesameordifferentclasses. Unitweightsw = 1and ¯ w = 1wereassigned
toallthemust-linkandcannot-linkconstraints. Eachdataset(containingbothtraining
and test subsets) was then clustered using HMRF-KMeans
2
: S
2500
with K = 11 and
S
250
with K = 39. The clustering performance however was measured only on the
corresponding test subsets. The results were averaged over 10 runs of two-fold cross
validation.
6.5 ResultsandDiscussion
This section presents the experimental results and the discussion of their possible im-
plications.
6.5.1 Results
Figures 6.5 and 6.6 show the effect of varying the number of constraints on NMI in
theHMRF-KMeansalgorithm. Thecasewithzeroconstraintscorrespondstostandard
KMeans.
2
We used the implementation from WekaUT toolkit
http://www.cs.utexas.edu/users/ml/risc/code
114
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
1000 800 600 400 200
NMI
# of Constraints
Figure6.5: LearningcurveforS
2500
set
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
2500 2000 1500 1000 500
NMI
# of Constraints
Figure6.6: LearningcurveforS
250
set
From the figures, it is clear that adding constraints to HMRF-KMeans results in
an improvement in the NMI. ForS
2500
, even with just 200 constraints (which is much
lessthaneven 0.5%ofthepossibleconstraints),thealgorithmperformsbetterthanthe
baseline. Similar result is found for the S
250
set with only 1000 constraints. This is
a good sign, indicating that even a small number of constraints can help guide the al-
gorithm in the right direction. As we add more constraints, we expect the algorithm
115
to perform better. This general trend becomes clear as the number of constraints in-
creases. Thequestionthenishowlongwillthistrendcontinue? Wefoundthatforthe
S
2500
set the average performance reached a plateau at 0.40 NMI with around 10,000
constraints. After that, even when we increased the number of constraints we found
only minimal improvement in NMI scores. On manual inspection, the clusters with
10,000constraintslookedfairlyconsistentwiththemajorityelementsofeachclassal-
waysinthesamecluster. FortheS
250
set,wefoundasimilarplateauataround 50,000
constraintswithanNMIscoreofaround 0.43. Wefoundveryslowimprovementafter
that. Again in this case the clusters looked quite consistent on manual inspection. Ta-
ble 6.2 shows randomly selected elements from random clusters from one of the runs
onthesetsS
2500
andS
250
. TheVerbNetclassesarethosewithwhichthecorresponding
clustershadlargestintersection.
Set #Constraints ClusterElements VerbNetClass
S
2500
10,000 pound,stir,touch,adorn,lash,bind,cook,taint, fill
bombard,soak,...
S
2500
10,000 envisage,diagnose,depict,report,laud,describe,suffer, characterize
select,remember,judge,...
S
250
50,000 esteem,treasure,hate,dislike,venerate,chaperone,doubt, admire
propose,trust,like,...
S
250
50,000 deflate,galvanize,mute,stagger,bereave,frighten,slim amuse
exhaust,peeve,trick,...
Table6.2: ExampleclusterswiththecorrespondinglargestintersectingVerbNetclasses
116
6.5.2 Discussion
In Section 6.5.1, we have shown that by adding constraints to the clustering algorithm
its performance on both data sets S
2500
and S
250
improves. This is reflected by the
improvementintheNMIscores. Butwhatdoesthismeanfortheverbsemanticclasses?
What do they look like? To answer these questions, we inspected the clusters and
compared them to the semantic classes in VerbNet. For our analysis, we considered
the HMRF-KMeans runs with 10,000 constraints for the S
2500
set and with 50,000
constraintsfortheS
250
set. ThebaselineisstandardunsupervisedKMeans.
Fortheclusteringandhencethelearnedsemanticclassestobegood,twoproperties
aredesirable:
1. Mostoftheelementsinonesemanticclassshouldbepresentinonlyonecluster
2. Mostoftheelementsinoneclustershouldbelongtoonlyonesemanticclass
Thesecorrespondtothetraditionalrecallandprecisioncriterionfortheclusters.
To get a sense for recall, we first consider each VerbNet (gold standard) class for
theS
2500
setandfindthecorrespondingclusterwiththehighestnumberofintersecting
elements. We find that on an average around 53% of the elements for every class are
present in its corresponding cluster. For the S
250
set, this score is 38%. While these
numbers aren’t spectacular they are a huge improvement over the baseline scores of
25%fortheS
2500
setand 17%fortheS
250
set.
117
Wethenconsideredthealgorithmsclustersforprecision. FortheS
2500
set,wefind
that on an average around 54% elements in every cluster belong to only one VerbNet
semantic class. For the S
250
set, the corresponding score is 37%. This is again an
improvementoverthecorrespondingbaselinescoresof 28%fortheS
2500
setand 21%
fortheS
250
set.
6.6 Conclusion
We have shown that the HMRF-KMeans algorithm shows good performance for the
task of learning semantic classes. When using unsupervised clustering algorithm for
thistask,wehavelittleornocontrolonthekindofclasseswelearn. Butthepresented
algorithmprovidesawayofcontrollingthiswithsomesupervision. Thiskindofcontrol
isoftendesirableandgoesalongwayintailoringthelearnedsemanticclassestomeet
ourrequirements.
118
Chapter7
ParaphrasesforLearningSurfacePatterns
7.1 Introduction
InChapters4and5,wepresentedmethodsforlearningthecontextsinwhichthequasi-
paraphrase rules are mutually replaceable, and for learning the directionality of these
rules (using the phrase contexts) to filter “strict paraphrases”. In that work, we pre-
sentedourresultsonsyntacticquasi-paraphraserules,i.e.,rulesthatcontainphrasesin
the form of paths in a syntax tree. These can be obtained by parsing a corpus using
a syntactic parser, and using distributional similarity to find similar paths in the parse
trees. Anexampleparaphrasepair,asinexample(1),islearnedandrepresented(inthe
syntacticform)inDIRT(Lin&Pantel,2001)asinexample (2).
“Xacquired Y”⇔“Xcompletedtheacquisitionof Y” (1)
“N:subj:VhacquirediV:obj:N”⇔“N:subj:VhcompleteiV:obj:NiacquisitioniN:of:N” (2)
119
Butwhatifwedidnotwanttoparsethetext,perhapsbecauseitstoolargeornoisy?
Insuchacase,canwelearnjustsurface-levelparaphrases,i.e.,paraphrasesthatcontain
phrasesonlyintheformofsurfacen-gramsasinexample (1)?
Weherepresentamethodtoacquiresurfaceparaphrasesfromasinglemonolingual
corpus. We use a large corpus (about 150GB, i.e., 25 billion words) to overcome the
datasparsenessproblem. Toovercomethescalabilityproblem,wepre-processthetext
with a simple parts-of-speech (POS) tagger and then apply locality sensitive hashing
(LSH) to speed up the remaining computation for paraphrase acquisition. Our experi-
mentsshowresultstoverifythefollowingmainclaim:
Claim1: Highlyprecisesurfaceparaphrasescanbeobtainedfromaverylargemono-
lingualcorpus.
With this result, we further show that these paraphrases can be used to obtain
high precision surface patterns that enable the discovery of relations in a minimally
supervised way. Surface patterns are templates for extracting information from text.
For example, if one wanted to extract a list of company acquisitions, “hACQUIRERi
acquired hACQUIREEi” would be one surface pattern with “hACQUIRERi” and
“hACQUIREEi”astheslotstobeextracted. Thuswecanclaim:
Claim 2: These paraphrases can then be used for generating high precision surface
patternsforrelationextraction.
120
7.2 AcquiringParaphrases
Thissectiondescribesourmodelforacquiringparaphrasesfromtext.
7.2.1 DistributionalSimilarity
Harris’s distributional hypothesis (Harris, 1954) has played a central role in many ap-
proachestolexicalsemanticssuchasunsupervisedwordclustering. Itstatesthatwords
thatappearinsimilarcontextstendtohavesimilarmeanings. Inthischapterweapply
thedistributionalhypothesistophrases,i.e.,wordn-grams.
Forexample,considerthephrase“acquired”oftheform“X acquiredY”. Consid-
ering the context of this phrase, we might find{Google, eBay, Yahoo,...} in position
X and{YouTube, Skype, Overture,...} in position Y. Now consider another phrase
“completedtheacquisitionof”,againoftheform“X completedtheacquisitionof Y”.
Forthisphrase,wemightfind{Google,eBay,HiltonHotelcorp.,...}inpositionX and
{YouTube, Skype, Bally Entertainment Corp.,...} in positionY. Since the contexts of
thetwophrasesaresimilar,ourextensionofthedistributionalhypothesiswouldassume
that“acquired”and“completedtheacquisitionof”havesimilarmeanings.
121
7.2.2 ParaphraseGenerationModel
Letp
i
beaphraseintextoftheformX p
i
Y, whereX andY aretheplaceholdersfor
entities occurring on either side of p
i
. Our first task is to find the set of phrases that
aresimilarinmeaningtop
i
. LetP ={p
1
,p
2
,p
3
,...,p
l
}bethesetofallphrasesofthe
form X p
i
Y where p
i
∈ P. Let S
i,X
be the set of entities that occur in position X
ofp
i
andS
i,Y
be the set of entities that occur in positionY ofp
i
. LetV
i
be the vector
representingp
i
suchthatV
i
= S
i,X
∪S
i,Y
. Eachentityf∈ V
i
has an associatedscore
thatmeasuresthestrengthoftheassociationoftheentityf withphrasep
i
;asdomany
others, weemploypointwisemutualinformation(Cover&Thomas, 1991)tomeasure
thisstrengthofassociation.
pmi(p
i
;f) = log
P(p
i
,f)
P(p
i
)P(f)
(7.1)
The probabilities in Equations 7.1 are calculated by using the maximum likelihood
estimateoverourcorpus.
Once we have the vectors for each phrasep
i
∈ P, we can find the paraphrases for
eachp
i
byfindingitsnearestneighbors. Weusecosinesimilarity,whichisacommonly
usedmeasureforfindingsimilaritybetweentwovectors.
122
If we have two phrases p
i
∈ P andp
j
∈ P with the corresponding vectors V
i
and
V
j
constructedasdescribedabove,thesimilaritybetweenthetwophrasesiscalculated
as:
sim(p
i
;p
j
) =
V
i
V
j
|V
i
|∗|V
j
|
(7.2)
Each entity in V
i
(and V
j
) has with it an associated flag which indicates whether
the entity came from S
i,X
or S
i,Y
. This ensures that the X and Y entities of p
i
are
considered separate and do not get merged inV
i
. Also, for each phrasep
i
of the form
X p
i
Y, we have a corresponding phrase−p
i
that has the formY p
i
X. For example,
considerthesentences:
Googleacquired YouTube. (3)
YouTubewasboughtbyGoogle. (4)
Fromsentence (3),weobtaintwophrases:
1. p
i
=acquired whichhastheform“X acquiredY”where“X =Google”and“Y
=YouTube”
2.−p
i
=−acquired which has the form “Y acquired X” where “X = YouTube”
and“Y =Google”
Similarly,fromsentence (4)weobtaintwophrases:
1. p
j
= was bought by which has the form “X was bought by Y” where “X =
YouTube”and“Y =Google”
123
2.−p
j
=−was bought by which has the form “Y was bought by X” where “X =
Google”and“Y =YouTube”
TheswitchingofX andY positionsin (3)and (4)ensuresthat“acquired”and“−was
boughtby”arefoundtobeparaphrasesbythealgorithm.
7.2.3 LocalitySensitiveHashing
AsdescribedinSection7.2.2,wefindparaphrasesofaphrasep
i
byfindingitsnearest
neighborsbasedoncosinesimilaritybetweenthefeaturevectorofp
i
andotherphrases.
Todothisforallthephrasesinthecorpus,we’llhavetocomputethesimilaritybetween
all vector pairs. If n is the number of vectors and d is the dimensionality of the vec-
tor space, finding cosine similarity between each pair of vectors has time complexity
O(n
2
d). Thiscomputationisinfeasibleforourcorpus,sincebothnanddarelarge.
Tosolvethisproblem,wemakeuseofLocalitySensitiveHashing(LSH).Thebasic
ideabehindLSHisthataLSHfunctioncreatesafingerprintforeachvectorsuchthatif
two vectors are similar, they are likely to have similar fingerprints. The LSH function
we use here was proposed by Charikar (2002). It has the property of preserving the
cosinesimilaritybetweenvectors,whichisexactlywhatwewant. TheLSHrepresents
124
a d dimensional vector U by a stream of b bits. The bit steam is obtained by using b
hashfunctions. Eachhashfunctionisdefinedasfollows:
h
R
(U) =
1 ifR.U > 0
0 ifR.U < 0
(7.3)
where R is a d-dimensional random vector obtained from a d-dimensional Gaussian
distribution. ThenfortwovectorsU andV,Charikar(2002)showedthat:
cos(θ(U,V)) = cos((1−Pr[h
R
(U) =h
R
(V)])∗π) (7.4)
Ravichandranetal. (2005)haveshownthatbyusingtheLSHnearestneighborscalcu-
lationcanbedoneveryrapidly
1
.
7.3 LearningSurfacePatterns
In this section we describe the model for learning surface patterns for a given target
relation.
1
The details of the algorithm are omitted, but interested readers are encouraged to
readCharikar(2002)andRavichandranetal. (2005).
125
7.3.1 SurfacePatternsModel
Letrbeatargetrelation. OurtaskistofindasetofsurfacepatternsS ={s
1
,s
2
,...,s
n
}
that express the target relation. For example, consider the relation r = “acquisition”.
WewanttofindthesetofpatternsS thatexpressthisrelation:
S ={hACQUIRERi acquiredhACQUIREEi,hACQUIRERi boughthACQUIREEi,
hACQUIREEiwasboughtbyhACQUIRERi,...}.
Let SEED ={seed
1
, seed
2
,..., seed
n
} be the set of seed patterns that express
the target relation. For eachseed
i
∈ SEED, we obtain the corresponding set of new
patternsPAT
i
intwosteps:
1. We find the surface phrase, p
i
, using a seed and find the corresponding set of
paraphrases,P
i
={p
i,1
,p
i,2
,...,p
i,m
}. Each paraphrase,p
i,j
∈ P
i
, has with it an
associatedscorewhichissimilaritybetweenp
i
andp
i,j
.
2. Inseedpattern,seed
i
,wereplacethesurfacephrase,p
i
,withitsparaphrasesand
obtain the set of new patterns PAT
i
={pat
i,1
,pat
i,2
,...,pat
i,m
}. Each pattern
has with it an associated score, which is the same as the score of the paraphrase
from which it was obtained. The patterns are ranked in the decreasing order of
theirscores.
126
After we obtain PAT
i
for each seed
i
∈ SEED, we obtain the complete set of
patterns, PAT, for the target relation r as the union of all the individual pattern sets,
i.e.,PAT =PAT
1
∪PAT
2
∪...∪PAT
n
.
7.4 ExperimentalMethodology
In this section, we describe experiments to validate the main claims of the chapter.
We first describe paraphrase acquisition, we then summarize our method for learning
surfacepatterns,andfinallydescribetheuseofpatternsforextractingrelationinstances.
7.4.1 Paraphrases
Finding surface variations in text requires a large corpus. The corpus needs to be at
leastoneorderofmagnitudelargerthanthatrequiredforlearningsyntacticvariations,
sincesurfacephrasesaresparserthansyntacticphrases.
Forourexperiments,weusedacorpusofabout 150GB(25billionwords)obtained
from Google News . It consists of few years worth of news data. We POS tagged
the corpus using TNT tagger (Brants, 2000) and collected all phrases (n-grams) in the
corpus that contained at least one verb, and had a noun or a noun-noun compound on
eitherside. Werestrictedthephraselengthtoatmostfivewords.
127
We build a vector for each phrase as described in Section 7.2. To mitigate the
problem of sparseness and co-reference to a certain extent, whenever we have a noun-
noun compound in theX orY positions, we treat it as bag of words. For example, in
the sentence “Google Inc. acquired YouTube”, “Google” and “Inc.” will be treated as
separatefeaturesinthevector.
Oncewehaveconstructedallthevectors,wefindtheparaphrasesforeveryphrase
by finding its nearest neighbors as described in Section 7.2. For our experiments, we
set the number of random bits in the LSH function to 3000, and the similarity cut-off
between vectors to 0.15. We eventually end up with a resource containing over 2.5
millionphrasessuchthateachphraseisconnectedtoitsparaphrases.
7.4.2 SurfacePatterns
Oneclaimofthispaperisthatwecanfindgoodsurfacepatternsforatargetrelationby
startingwithaseedpattern. Toverifythis,westudytwotargetrelationsthathavebeen
used as examples by other researchers (Bunescu & Mooney, 2007; Banko & Etzioni,
2008):
1. Acquisition: Wedefinethisastherelationbetweentwocompaniessuchthatone
companyacquiredtheother.
2. Birthplace: We define this as the relation between a person and his/her birth-
place.
128
Forthe“acquisition”relation,westartwiththesurfacepatternscontainingonlythe
wordsbuyandacquire:
1. “hACQUIRERiboughthACQUIREEi”(anditsvariants,i.e., buy, buysand buy-
ing)
2. “hACQUIRERiacquiredhACQUIREEi”(anditsvariants,i.e.,acquire,acquires
andacquiring)
Thisresultsinatotalofeightseedpatterns.
Forthe“birthplace”relation,westartwithtwoseedpatterns:
1. “hPERSONiwasborninhLOCATIONi”
2. “hPERSONiwasbornathLOCATIONi”.
We find other surface patterns for each of these relations by replacing the surface
wordsintheseedpatternsbytheirparaphrases,asdescribedinSection7.3.
7.4.3 RelationExtraction
Thepurposeoflearningsurfacepatternsforarelationistoextractinstancesofthatre-
lation. We use the surface patterns obtained for the relations “acquisition” and “birth-
place”toextractinstancesoftheserelationsfromtheLDCNorthAmericanNewsCor-
pus (Graff, 1995). This helps us to extrinsically evaluate the quality of the surface
patterns.
129
7.5 ExperimentalResults
Inthissection,wepresenttheresultsoftheexperimentsandanalyzethem.
7.5.1 Baselines
Itishardtoconstructabaselineforcomparingthequalityofparaphrases,asthereisn’t
much work in extracting surface level paraphrases using a monolingual corpus (see
discussion in Chapter 2). To overcome this, we compare the results informally to the
othermethodsthatproducesyntacticparaphrases.
Tocomparethequalityoftheextractionpatternsandrelationinstances,weusethe
method presented by Ravichandran and Hovy (2002) as the baseline. For each of the
givenrelations,“acquisition”and“birthplace”,weuse 10instancesandobtainthetop
1000 results from the Google search engine for each query. We use these results to
obtainthesetofbaselinepatternsforeachrelation. Wethenapplythesepatternstothe
testcorpusandextractthecorrespondingbaselineinstances.
7.5.2 EvaluationCriteria
Here we present the evaluation criteria we used to evaluate the performance on the
differenttasks.
130
7.5.2.1 Paraphrases
We estimate the quality of paraphrases by annotating a random sample as cor-
rect/incorrect and calculating the accuracy. However, estimating the recall is difficult
giventhatwedonothaveacompletesetofparaphrasesfortheinputphrases. Following
Szpektoretal. (2004),insteadofmeasuringrecall,wecalculatetheaveragenumberof
correctparaphrasesperinputphrase.
7.5.2.2 SurfacePatterns
We can calculate the precision (P) of learned patterns for each relation by annotating
theextractedpatternsascorrect/incorrect. Howevercalculatingtherecallisaproblem
for the same reason as above. But we can calculate the relative recall (RR) of the
systemagainstthebaselineandviceversa. TherelativerecallRR
S|B
ofsystemS with
respecttosystemB canbecalculatedas:
RR
S|B
=
C
S
∩C
B
C
B
whereCS isthenumberofcorrectpatternsfoundbyoursystemandCB isthenumber
ofcorrectpatternsfoundbythebaseline. RR
B|S
canbefoundinasimilarway.
131
7.5.2.3 RelationExtraction
Weestimatetheprecision(P)oftheextractedinstancesbyannotatingarandomsample
of instances as correct/incorrect. While calculating the true recall here is not possible,
evencalculatingthetruerelativerecallofthesystemagainstthebaselineisnotpossible
as we can annotate only a small sample. However, following Pantel et al. (2004), we
assumethattherecallofthebaselineis 1andestimatetherelativerecallRR
S|B
ofthe
systemS withrespecttothebaselineB usingtheirrespectiveprecisionscoresP
S
and
P
B
andnumberofinstancesextractedbythem|S|and|B|as:
RR
S|B
=
P
S
∗|S|
P
B
∗|B|
7.5.3 GoldStandard
Inthissection,wedescribethecreationofgoldstandardforthedifferenttasks.
7.5.3.1 Paraphrases
Wecreatedthegoldstandardparaphrasetestsetbyrandomlyselecting 50phrasesand
their corresponding paraphrases from our collection. For each test phrase, we asked
twoannotatorstoannotateitsparaphrasesascorrectorincorrect. Theannotatorswere
instructed to look for strict paraphrases, i.e., phrases that express the same meaning
usingdifferentwords.
132
To obtain the inter-annotator agreement, the two annotators annotated the test set
separately. The kappa statistic (Siegal & Castellan Jr., 1988) was κ = 0.63. It was
interesting that the annotators obtained this respectable kappa score without any prior
training,whichishardtoachieveinannotationofasimilartaskliketextualentailment.
Thisindicatesthatthenotionof(quasi)paraphraseswaswelldefined.
7.5.3.2 SurfacePatterns
Forthetargetrelations,weaskedtwoannotatorstoannotatethepatternsforeachrela-
tion as either “precise” or “vague”. The annotators annotated outputs from the system
as well as the baseline. We consider the “precise” patterns as correct and the “vague”
asincorrect. Theintuitionisthatapplyingthevaguepatternsforextractingtargetrela-
tion instances might find some good instances, but will also find many bad ones. For
example,considerthefollowingtwopatternsforthe“acquisition”relation:
hACQUIRERiacquiredhACQUIREEi (5)
hACQUIRERiandhACQUIREEi (6)
Example (5) is a precise pattern as it clearly identifies the acquisition relation while
example (6) is a vague pattern because it is too general and says nothing about the
“acquisition”relation. Thekappastatisticbetweenthetwoannotatorsforthistaskwas
κ = 0.72.
133
7.5.3.3 RelationExtraction
Werandomlysampled50instancesofthe“acquisition”and“birthplace”relationsfrom
the system as well as the baseline outputs. We asked two annotators to annotate the
instances as correct or incorrect. The annotators marked an instance as correct only if
boththeentitiesandtherelationbetweenthemwerecorrect.
To make their task easier, the annotators were provided the context for each in-
stance, and were free to use any resources at their disposal (including a web search
engine) to verify the correctness of the instances. The annotators found that the anno-
tationfor thistaskwasmucheasierthantheprevious two; thefewdisagreements they
had were due to ambiguity of some of the instances. The kappa statistic for this task
wasκ = 0.91.
7.5.4 ResultSummary
Table 7.1 shows the results of annotating the paraphrases test set. We do not have
a baseline to compare against but we can analyze them in light of numbers reported
previouslyforsyntacticparaphrases. DIRT(Lin&Pantel,2001)andTEASE(Szpektor
etal.,2004)reportaccuraciesof50.1%and44.3%respectivelycomparedtoouraverage
accuracy across two annotators of 70.79%. The average number of paraphrases per
phraseishowever10.1and5.5forDIRTandTEASErespectivelycomparedtoour4.2.
134
Table 7.2 shows some paraphrases generated by our system for the phrases “are being
distributedto”and“approvedarevisiontothe”.
Annotator Accuracy Average#correctparaphrases
Annotator 1 67.31% 4.2
Annotator 2 74.27% 4.28
Table7.1: Qualityofparaphrases
arebeingdistributedto approvedarevisiontothe
havebeendistributedto unanimouslyapprovedanew
arebeinghandedoutto approvedanannual
weredistributedto willconsideradoptinga
−arehandingout approvedarevised
willbedistributedtoall approvedanew
Table7.2: Exampleparaphrases
Table 7.3 shows the results on the quality of surface patterns for the two relations.
It can be observed that our method outperforms the baseline by a wide margin in both
precision and relative recall. Table 7.4 shows some example patterns learned by our
system.
Relation Method #Patterns
Annotator 1 Annotator 2
P RR P RR
Acquisition
Baseline 160 55% 13.02% 60% 11.16%
ParaphraseMethod 231 83.11% 28.40% 93.07% 25%
Birthplace
Baseline 16 31.35% 15.38% 31.25% 15.38%
ParaphraseMethod 16 81.25% 40% 81.25% 40%
Table7.3: Qualityofextractionpatterns
135
acquisition birthplace
XagreedtobuyY X,whowasborninY
X,whichacquiredY X,wasborninY
XcompleteditsacquisitionofY XwasraisedinY
XhasacquiredY XwasborninNNNNinY
XpurchasedY X,borninY
Table7.4: Exampleextractionpatterns
Table7.5showstheresultsofthequalityofextractedinstances. Oursystemobtains
very high precision scores but suffers in relative recall given that the baseline with its
very general patterns is likely to find a huge number of instances (though a very small
portionofthemarecorrect). Table7.6showssomeexampleinstancesweextracted.
Relation Method #Patterns
Annotator 1 Annotator 2
P RR P RR
Acquisition
Baseline 1,261,986 6% 100% 2% 100%
ParaphraseMethod 3875 88% 4.5% 82% 12.59%
Birthplace
Baseline 979,607 4% 100% 2% 100%
ParaphraseMethod 1811 98% 4.53% 98% 9.06%
Table7.5: Qualityofinstances
7.5.5 DiscussionandErrorAnalysis
We studied the effect of the decrease in size of the available raw corpus on the quality
of the acquired paraphrases. We used about 10% of our original corpus to learn the
surfaceparaphrasesandevaluatedthem. Theprecisionandtheaveragenumberofcor-
rectparaphrasesarecalculatedonthesametestset, asdescribedinSection7.5.2. The
136
acquisition birthplace
1. Huntington Bancshares Inc.
agreedtoacquireRelianceBank
1. CyrilAndrewPonnam-perumawas
borninGalle
2. SonyboughtColumbiaPictures 2. Cook was born in NNNN in De-
vonshire
3. HansonIndustriesbuysKiddeInc. 3. TanseywasborninCincinnati
4. CasinoAmericainc. agreedtobuy
GrandPalais
4. TsoiwasborninNNNNin Uzbek-
istan
5. Tidewater inc. acquired Hornbeck
OffshoreServicesInc.
5. Mrs. Totenberg was born in San
Francisco
Table7.6: Exampleinstances
performancedroponusing 10%oftheoriginalcorpusissignificant(11.41%precision
andonanaverage 1correctparaphraseperphrase),whichshowsthatweindeedneeda
largeamountofdatatolearngoodqualitysurfaceparaphrases. Onereasonforthisdrop
isalsothatwhenweuseonly10%oftheoriginaldata,forsomeofthephrasesfromthe
testset,wedonotfindanyparaphrases(thusresultingin 0%accuracyforthem). This
isnotunexpected, asthelargerresourcewouldhaveamuchlargerrecall, whichagain
points at the advantage of using a large data set. Another reason for this performance
dropcouldbetheparametersettings: Wefoundthatthequalityoflearnedparaphrases
depended greatly on the various cut-offs used. While we adjusted our model parame-
tersforworkingwithsmallersizeddata,itisconceivablethatwedidnotfindtheideal
setting for them. So we consider these numbers to be a lower bound. But even then,
thesenumbersclearlyindicatetheadvantageofusingmoredata.
137
Movingtothetaskofrelationextraction,weseefromTable7.5thatoursystemhas
a much lower relative recall compared to the baseline. This was expected as the base-
line method learns some very general patterns, which are likely to extract some good
instances, even though they result in a huge hit to its precision. However, our system
wasabletoobtainthisperformanceusingveryfewseeds. Soanincreaseinthenumber
of input seeds, is likely to increase the relative recall of the resource. The question
however remains as to what good seeds might be. It is clear that it is much harder to
comeupwithgoodseedpatterns(thatoursystemneeds),thanseedinstances(thatthe
baselineneeds). Buttherearesomeobviouswaystoovercomethisproblem. Oneway
is to bootstrap. We can look at the paraphrases of the seed patterns and use them to
obtainmorepatterns. Ourinitialexperimentswiththismethodusinghandpickedseeds
showedgoodpromise. However,weneedtoinvestigateautomatingthisapproach. An-
other method is to pick good patterns from the baseline system by manual inspection
and use them as seeds for our system. We plan to investigate this approach as well.
One reason, why we have seen good preliminary results using these approaches (for
improvingrecall),webelieve,isthattheprecisionoftheparaphrasesisgood. Soeither
aseeddoesn’tproduceanynewpatternsoritproducesgoodpatterns,thuskeepingthe
precisionofthesystemhighwhileincreasingrelativerecall.
138
7.6 Conclusion
Paraphrases are an important technique to handle variations in language. Given their
utility in many NLP tasks, it is desirable that we come up with methods that produce
goodqualityparaphrases. Webelievethattheparaphraseacquisitionmethodpresented
hereisasteptowardsthisgoal. Wehaveshownthathighprecisionsurfaceparaphrases
can be obtained by using distributional similarity on a large corpus. We made use of
some recent advances in theoretical computer science to make this task scalable. We
have also shown that these paraphrases can be used to obtain high precision extrac-
tion patterns for information extraction. While we believe that more work needs to be
done to improve the system recall (some of which we are investigating), this seems to
be a good first step towards developing a minimally supervised and scalable relation
extractionsystem.
139
Chapter8
ParaphrasesforDomain-SpecificInformation
Extraction
8.1 Introduction
While Information Extraction (IE) has evolved into a highly sophisticated application,
oneofitscoremethodsremainsbasedonpatternsthatmatchthesequencesofwordsor
syntacticunitscharacteristicofeachdesiredentitytobeextracted. Generally,giventhe
wide range of expressive possibility of language, especially English, better results are
obtained when the system is focused on a particular application domain. The general
approachtolearningpatternsfordomain-specificIEis:
1. Obtainadomain-specificcorpus.
140
2. For each type of entity or event-role to be extracted, build or learn the patterns
fromthiscorpus(possiblywiththehelpofanout-of-domaincorpus).
Thisproceduregenerallyhastoberepeatedeverytimeanewdomainisaddressed.
Aside from the fact that this repetition is tedious, obtaining a domain-specific corpus
for a new domain is hard, especially when the definition and scope of the domain are
notclear. Thus,adaptinganIEsystemtoanewdomainrequiresalotofeffort(atleast
intheorderofdays).
However,experiencewithIEinavarietyofdomainshasledtotheobservationthat
a broad-coverage corpus, if it is large enough, contains most of the domain-specific
patterns. Ifitwerepossibletolearnthesedomain-specificpatternsfromsuchabroad-
coverage corpus, it would save the efforts of collecting domain-specific corpora and
re-learningIEpatterns. IEindifferentdomainswouldbeeasy. AdaptinganIEsystem
toanewdomainwouldrequireverylittleeffort(intheorderofhours).
In this chapter, we describe a very general method of learning IE patterns from a
largebroad-coveragecorpus,usingparaphraselearningtechniques,atbothsurfaceand
deeperlevels. Wethenapplythepatternslearnedbythesemethodstovariousdomain-
specifictestcorporaandshowresultstodemonstratethefollowingclaim:
Claim 1: Lexico-syntactic paraphrase based patterns, learned using a shallow parser,
outperformsurface-levelparaphrasebasedpatterns,fordomain-specificIE.
141
Once the best paraphrase based pattern learning technique is determined, we com-
pare its results to several domain-specific IE engines. We show results to verify the
followingclaim:
Claim2: Paraphrasebasedpatterns,learnedfromalargebroad-coveragecorpus,per-
form at a level comparable to the patterns learned from a domain-specific corpus, for
domain-specificIE.
8.2 LearningBroad-CoverageParaphrasePatterns
In this section, we present a set of methods to learn domain-specific patterns from a
broad-coverage corpus. Our focus here is on event-oriented IE, where the task is to
identifyfactsrelatedtospecificevents.
Formally,givenadomaind,abroad-coveragecorpusb,andanevent-roleeind,the
aim is to learn (from b) a set of extraction patterns PAT ={pat
1
, pat
2
,..., pat
n
} that
canextractinstancesofe.
8.2.1 LearningSurface-LevelParaphrasePatterns
Inthissection,wepresentourmethodtolearnsurface-levelparaphrasepatterns.
Formally, given an event-role e in a domain d, the aim is to learn a set of surface-
levelextractionpatterns
142
SURF ={surf
1
,surf
2
,...,surf
n
}
that can extract instances ofe. For example, given the event rolee = “weapon” in the
domaind=“terrorism”,theaimistolearnthefollowingpatterns:
SURF ={hSLOTiwentoff,hSLOTiexploded,hSLOTithatexploded,...}
Let p be a pattern of the form “hSLOTi n-gram” or “n-gramhSLOTi”, where
“hSLOTi” contains a word or phrase that we expect p to extract, and “n-gram” is
any n-gram in a corpus. Call the word or phrase that p extracts its slot-filler. Let
P ={p
1
,p
2
,p
3
,...,p
l
} be the set of all patterns of the form “hSLOTi n-gram” or “n-
gramhSLOTi”. DefinethecontextC
i
ofapatternp
i
∈P tobeitsslot-fillerplusatwo
word(token)windowonitsotherside. Forexample,assumethatwehavethefollowing
sentenceinourcorpus:
Thebombexplodedprematurely. (1)
Alsoassumethatwehaveapattern“hSLOTiexploded”inoursetofpatternsP. Given
sentence (1)andp
i
=“hSLOTiexploded”,itscontextis:
C
i
={hSLOTi:bomb,+1:prematurely,+2:.}
Each item that occurs in the context C
i
of p
i
, is called a feature of p
i
. For each
featuref∈ C
i
, wecalculateitsstrengthofassociationwiththepatternp
i
usingpoint-
wisemutualinformation(PMI)(Cover&Thomas,1991). Wethenconstructthefeature
vectorV
i
associated withp
i
, such that it contains each feature with its associated PMI
143
value. Forexample,giventhepatternp
i
=“hSLOTiexploded”,itsfeaturevectorcould
be:
V
i
={hSLOTi:bomb 3.53, +1:prematurely 1.74, +2:. 2.76}
Once we have vectors for all the patterns in a corpus, we can find paraphrases for any
patternbynearestneighborscomputationusingcosinesimilarity
1
.
Assumethatwehaveasetofsurface-levelseedpatterns:
SEED ={seed
1
,seed
2
,...,seed
m
}
for the event role e. Given SEED, our model finds the paraphrase set PARA
i
for
each seed
i
∈ SEED. The set of surface-level extraction patterns SURF for e then
is the union of the individual paraphrase sets, i.e., SURF = PARA
1
∪ PARA
2
∪
...∪PARA
m
. Each pattern in SURF comes with an associated score, which is the
similarity between the learned pattern surf
i
∈ SURF and the seed pattern seed
j
∈
SEED thatgeneratedit
2
.
Forexample,giventheeventrolee=“weapon”inthedomaind=“terrorism”and
theseeds:
SEED ={hSLOTiwentoff,hSLOTiexploded,...}
1
Cosinesimilaritybetweentwovectorsisthecosineoftheanglebetweenthem.
2
Ifapatternisgeneratedbytwoormoreseedpatterns,itsscoreistheaverageofall
the scores it obtains from the different seeds. We also tried using maximum and sum,
buttheresultswereverysimilar.
144
wemightfindasetofsurfacepatterns:
SURF ={hSLOTithatexploded,hSLOTiblewup,washitbyhSLOTi,...}.
Thisprovidesthesurface-levelparaphraseextractionpatterns—SurfPara.
8.2.2 Learning Lexico-Syntactic Paraphrase Patterns by
Conversion
The method described in Section 8.2.1 generates surface-level patterns from a set of
seed extraction patterns. These patterns are lexical in nature. They extract infor-
mation by matching the exact sequence of words (tokens) in the pattern with that in
the given text and extracting the noun phrase that acts as a slot-filler for that pattern
(ex: hSLOTi exploded). The lexical nature of these patterns makes them very spe-
cific. While this specificity often results in high precision extractions, it can be a se-
rious disadvantage, especially when recall is important. For example, we need several
variations of surface-level patterns to match all the active voice verb phrases contain-
ing the verb “exploded”: “hSLOTi exploded”, “hSLOTi recently exploded”, “hSLOTi
suddenly exploded”, etc. However a single lexico-syntactic pattern “Subject(hSLOTi)
ActiveVP(exploded)” matches them all. Since extracting specific entities in a small
domain-specificcorpusislikelytorequirehighrecall,weautomaticallygeneralizethe
surface-levelextractionpatternsasdescribedbelow.
145
Formally,let:
SURF ={surf
1
,surf
2
,...,surf
n
}
be a set of surface-level extraction patterns. The aim is to convert them into a corre-
spondingsetoflexico-syntacticextractionpatterns:
LEXSYN ={lexsyn
1
,lexsyn
2
,...,lexsyn
m
}
Toconvertthesurface-levelpatternsintolexico-syntacticpatterns,weuseashallow
parser. Weparseallthesurfacepatternssurf
i
∈ SURF usingthisshallowparserand
generalizethembasedoncertainlexicalandsyntacticdimensions:
• Voice: The different lexical variations of the active and the passive voice forms
ofaverballmaptothecorrespondinggeneralactiveandpassiverepresentations.
For example, “hSLOTi exploded”, “hSLOTi recently exploded”, “hSLOTi sud-
denlyexploded”allcontainactivevoiceformsoftheverb“exploded”,andhence
maptothelexico-syntacticpattern“Subject(hSLOTi)ActiveVP(exploded)”.
• Headword: Allnounphrases,verbphrases,prepositionalphrases,andadjectival
phrasesinthesurfacepatternarerepresentedonlybytheirrespectiveheads. For
example, given the pattern “the recent explosion ofhSLOTi”, it is generalized
to the form “NP(explosion) ofhSLOTi”, which is considered equivalent to “the
loudexplosionofhSLOTi”,“theexplosionofhSLOTi”.
146
• Syntactic templates: We use a pre-defined set of 17 hand-built syntactic tem-
platestoaddonlythosepatternsthatmatchthesetemplatesintooursetoflexico-
syntacticpatternsLEXSYN. Forexample,“hSubjectiActiveVP”isasyntactic
template. Using this template, the surface-level pattern “hSLOTi recently ex-
ploded” results in the pattern “Subject(hSLOTi) ActiveVP(exploded)”
3
, which
wouldbeaddedtoLEXSYN.
After performing these generalizations, we obtain the first set of lexico-syntactic
paraphraseextractionpatterns—LexSynPara(Conv).
8.2.3 LearningLexico-SyntacticParaphrasePatternsDirectly
In this section, we present our method to learn lexico-syntactic paraphrase patterns
fromacorpus.
Formally, given an event-role e in a domain d, the aim is to learn a set of lexico-
syntacticpatterns:
LEXSYN ={lexsyn
1
,lexsyn
2
,...,lexsyn
n
}
that can extract instances ofe. For example, given the event rolee = “weapon” in the
domaind=“terrorism”,theaimistolearnthefollowingpatterns:
3
If multiple templates match a given surface-level pattern, then a lexico-syntactic
patternisgeneratedforeachmatchedconstruct.
147
LEXSYN ={Subject(hSLOTi)ActiveVP(exploded),Subject(hSLOTi)
PassiveVP(defused),ActiveVP(detonated)DirObject(hSLOTi),...}
Let lp be a pattern of the form “hSLOTi syn-rel” or “syn-relhSLOTi”, where
“hSLOTi” contains a word or phrase that we expect lp to extract (slot-filler), and
“syn-rel” is a syntactic relation discovered by a shallow parser, in a corpus. Let
LP ={lp
1
,lp
2
,lp
3
,...,lp
l
} be the set of all patterns of the form “hSLOTi syn-rel”
or “syn-relhSLOTi”. Define the context C
i
of a pattern lp
i
∈ LP to be its slot-filler.
For example, given the sentence (1) in Section 8.2.1 andlp
i
= “Subject(hSLOTi) Ac-
tiveVP(exploded)”,itscontextis:
C
i
={hSLOTi:bomb}
Using this definition of patterns and their contexts, feature vectors are constructed
for all of the lexico-syntactic patterns in a corpus, similar to Section 8.2.1. The set
of lexico-syntactic extraction patterns LEXSYN, for an event-role e in a domain d,
then is constructed using a set of lexico-syntactic seed patterns LSEED, similar to
Section8.2.1.
Forexample,giventheeventrolee=“weapon”inthedomaind=“terrorism”and
thesetofseeds:
LSEED ={Subject(hSLOTi)ActiveVP(exploded),Subject(hSLOTi)
PassiveVP(defused),...}
148
wemightfindasetoflexico-syntacticpatterns:
LEXSYN ={ActiveVP(detonated)DirObject(hSLOTi),ActiveInfVP(wantsdefuse)
DirObject(hSLOTi),...}.
Thisprovidesthesecondsetoflexico-syntacticparaphrasepatterns—LexSynPara
(Direct).
8.3 ExperimentalMethodology
Inthissection,wesummarizetheexperimentsandresultsoflearningextractionpatterns
usingbroad-coverageparaphrases.
8.3.1 ParaphrasePatterns
Forlearningthebroadcoverageparaphrases,weuseda2.2billionwordcorpusconsist-
ingmainlyofnewswiredata. ItcontainsdatafromtheEnglishgigawordcorpus(Graff,
2003) which has about 1.75 billion tokens collected from various international news
sources; the HARD 2004 text corpus (Kong et al., 2005), consisting of about 225 mil-
lion words of newswire and web text; and the CSR-III text corpus (Graff et al., 1995),
consistingofabout 225millionwordsofnewswiretext.
Forlearningsurface-levelparaphrasepatterns,wepart-of-speech(POS)taggedthe
corpus using the Brill POS tagger (Brill, 1994) and applied the method described in
149
Section 8.2.1 to the corpus. We assume that every noun (or a sequence of nouns) that
occurs in our corpus is a potential slot-filler. We enumerate every n-gram around that
noun (or sequence of nouns), upto a maximum size of three as a candidate pattern
for paraphrase learning. We discard all patterns that occur fewer than 100 times in the
corpus(forscalability). Weeventuallybuildaresourcecontainingovertwomillionpat-
terns. Wethencreate 10seedpatterns foreachevent-role wewanttoextract. Wethen
obtainparaphrasesforeachoftheseseedpatterns
4
toobtainthesurface-levelpatterns.
Once we have the surface-level paraphrase patterns, we convert them into lexico-
syntactic patterns as described in Section 8.2.2. To perform this conversion, we use
the pattern generation component of the AutoSlog system (Riloff, 1993). We use the
Sundance shallow parser (Riloff & Phillips, 2004) and all of the default 17 pattern
templatesthatareapartofthissystempackage. Table8.1showsthesetemplates.
hSubjectiPassiveVP ActiveVPhDirObjecti NPPrephNPi
hSubjectiActiveVP InfinitiveVPhDirObjecti ActiveVPPrephNPi
hSubjectiActiveVPDirObject ActiveInfVPhDirObjecti PassiveVPPrephNPi
hSubjectiActiveInfVP PassiveInfVPhDirObjecti InfVPPrephNPi
hSubjectiPassiveInfVP SubjectAuxVPhDirObjecti hPossessiveiNP
hSubjectiAuxVPDirObject
hSubjectiAuxVPAdj
Table8.1: Listofpatterntemplates
4
A similarity threshold of 0.1 is set for the paraphrases, based on data inspection
andpriorexperience.
150
Finally, for learning the second set of lexico-syntactic paraphrase patterns, we ap-
pliedthemethoddescribedinSection8.2.3tothecorpus. WeusedtheSundanceshal-
low parser (Riloff & Phillips, 2004), all of the default 17 pattern templates that are
a part of this system package (Table 8.1), and the pattern generation component of
the AutoSlog system (Riloff, 1993), to generate every pattern occurring in the corpus.
We assume that every noun phrase that matches a pattern in our corpus is a potential
slot-filler. We discard all patterns that occur fewer than 10 times in the corpus (for
scalability). We eventually build a resource containing over two million patterns. We
then convert the 10 surface-level seed patterns (above) for each event-role into lexico-
syntactic forms
5
. We then obtain paraphrases for each of these lexico-syntactic seed
patternsandlearnthelexico-syntacticparaphrasepatterns.
8.3.2 Domains
TotestourIEsystems,weuseddatafromthreedomains: terrorism,disease-outbreaks,
andcorporate-acquisitions.
Fortheterrorismdomain,weusedtheMUC-4terrorismcorpus(Sundheim,1992),
which consists of Latin American terrorist events. It has a total of 400 gold-standard
annotateddocumentsinitstestportion,dividedintofoursetsof100each(TST1,TST2,
TST3,andTST4). Ofthese,theTST1andTST2documentswereusedfortuning,TST3
5
The conversion of 10 surface seeds results in 8–10 lexico-syntactic seeds for each
eventrole,becausesomesurfaceseedsmaptothesamelexico-syntacticseed.
151
and TST4 documents were used for test. We focused on extracting five event roles
in this domain: perpetrator individuals, perpetrator organizations, physical targets,
victims,andweapons.
Forthedisease-outbreaksdomain,weusedaProMed-mail
6
IEdataset,whichcon-
sists of reports about outbreaks of infectious diseases. This collection consists of 245
gold-standard annotated articles. Of these, 125 were used for tuning and 120 for test.
Weextractedtwoeventrolesinthisdomain: diseasesandvictims.
For the corporate-acquisitions domain, we used the corporate-acquisitions cor-
pus (Freitag, 1998b). It consists of 600 newspaper articles about acquisitions and
mergers of companies. These articles have gold-standard annotations. Of these, we
randomly set aside 300 documents for tuning and used the remaining 300 for test. We
extractedtwoeventrolesinthisdomain: acquired andpurchaser.
8.3.3 Evaluation
The complete event-oriented IE task involves the generation of event templates. Tem-
plate generation is complicated: It requires discourse analysis to identify the different
events in one article and coreference resolution to find coreferring entities. Our aim
here, however, is to evaluate the quality of extraction patterns. Hence, following Pat-
wardhanandRiloff(2007),weevaluateourmethodsonthequalityoftheirextractions
6
http://www.promedmail.org
152
rather than on template generation
7
. Also, following Patwardhan and Riloff (2007),
we merge duplicate extractions and employ a head noun matching based evaluation
scheme: An extraction is considered correct only if its head noun matches the head
nounofthegoldstandardanswer.
8.4 Result1andDiscussion
8.4.1 ComparisonofBroad-CoverageParaphrasePatterns
Inthissection,wesummarizetheexperimentsandresultsofusingthedifferentbroad-
coverageparaphrasebasedsystems. Wecomparetheperformanceofthesesystems,to
twobaselines:
• SurfSeeds: Usesonlythesurface-levelseedpatternsforextraction.
• LexSynSeeds: Usesonlythelexico-syntacticseedpatternsforextraction.
FortheSurfPara,LexSynPara(Conv),andLexSynPara(Direct)systems,theirper-
formances were first measured using the top 25, 50, 75, ..., m patterns on the cor-
responding tuning sets. The configurations that performed the best on the tuning set
were then applied to the test set. Figures 8.1, 8.2, and 8.3 summarize the results of
the baselines and the paraphrase based systems for the terrorism, disease-outbreaks,
7
Templategenerationshouldlogicallyfollowtheextractionstep.
153
and corporate-acquisitions domains respectively using macro-averaged precision, re-
call,andf-scores.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
LexSynPara(Direct) LexSynPara(Conv) SurfPara LexSynSeeds SurfSeeds
Score
System
Precision
Recall
F-score
Figure8.1: Paraphrasepatternsinterrorismdomain
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
LexSynPara(Direct) LexSynPara(Conv) SurfPara LexSynSeeds SurfSeeds
Score
System
Precision
Recall
F-score
Figure8.2: Paraphrasepatternsindisease-outbreaksdomain
154
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
LexSynPara(Direct) LexSynPara(Conv) SurfPara LexSynSeeds SurfSeeds
Score
System
Precision
Recall
F-score
Figure8.3: Paraphrasepatternsincorporate-acquisitionsdomain
8.4.2 DiscussionandErrorAnalysis
Lookingatthefigures,itisclearthattheSurfParasystemdoesnotimprovemuchover
the LexSynSeeds baseline (in fact sometimes it performs much worse). The LexSyn-
Para (Conv) and LexSynPara (Direct) systems, however, consistently perform at par
or improve over the baselines and the SurfPara system. This demonstrates the power
of the generalization obtained by using the lexico-syntactic patterns. This generaliza-
tion seems especially important in domain-specific IE: Limited redundancy in a small
domain-specificcorpusmeansthatthesystemmusthavealargenumberofpatternsto
achieve good recall
8
. Our analysis of the paraphrase patterns also confirms that gen-
eralization is crucial here
9
. Also, the fact that LexSynPara (Direct) performs the best
8
Even for open-domain relation extraction, (Bhagat & Ravichandran, 2008) point
outthatusingsurface-levelpatternsforextractionsresultsinlowrecall.
9
Itshouldhoweverbenotedthatautomaticgeneralizationsometimesproducesvery
generalpatterns,duetoparsererror. Henceattimesitharmsprecision.
155
indicatesthatdatasparsenessisaprobleminlearningsurface-levelparaphrases. Anal-
ysisofthelearnedpatternsalsoconfirmsthis. Wediscussthedatasparsenessproblem
inSection8.5.2.
8.5 Result2andDiscussion
8.5.1 Comparison of Broad-Coverage and Domain-Specific
Patterns
In this section, we present experiments and results to compare the performance of our
broad-coverage patterns based IE system to some state-of-the-art domain-specific pat-
terns based IE systems. For the terrorism and disease-outbreaks domain, we show
resultsfortheASlog-TSsystem(Riloff,1996)andtheSemanticAffinity(SemAff)sys-
tem (Patwardhan & Riloff, 2007). Both these are weakly-supervised systems, that re-
quireadomain-specificcorpustolearnextractionpatternsasexplainedbelow:
• ASlog-TS: ASlog-TS (Riloff, 1996) is a weakly supervised learner that relies
ondomain-specifictrainingdatatolearnextractionpatternsforthegivenIEtask.
Thetrainingdataconsistsofasetofrelevantdocumentsandasetirrelevantdocu-
ments. Thepatternlearnerextractsallthepatternsthatmatchsomepre-specified
templates from both the relevant and irrelevant documents. It then uses the dis-
tributionofpatternsbetweentherelevantandirrelevantdocumentstogeneratea
156
ranking for the patterns. This ranking enables a human expert to then map the
top-rankedpatternstotheircorrespondingevent-rolesandtodiscardpatternsthat
donotmaptoanyoftheevent-roles(orareobviouslybadpatterns).
• SemAff: TheSemAff systemusesthesemanticaffinitymetricintroducedbyPat-
wardhan and Riloff (2007). Semantic affinity is a measure of the tendency of a
pattern to extract noun phrases corresponding to a specific event-role. For ex-
ample, if a large proportion of the extractions of a pattern are weapon words,
the pattern is more likey to be a weapon event-role pattern. The learner uses the
semanticclassinformationofitsextractionsandamanuallycreatedmappingbe-
tweensemanticclassesandevent-rolestoestimatethethesemanticaffinityscore
for each patterns. The top-ranked patterns are then used to extract information
fromtext.
Wealsoshowtheresultsofapplyingthesesystemsintherelevantregions(Patwardhan
&Riloff,2007). Thebasicideabehindtherelevantregionsisthatcertainsentencesare
more likely to contain the relevant event-roles than others. To automatically identify
suchrelevantsentences,PatwardhanandRiloff(2007)developedself-trainedclassifier
that uses a set of documents relevant to the domain, a set of irrelevant documents, and
a set of seed extraction patterns as training data. They then apply extraction patterns
fromthedifferentsystemsonlyinrelevantregions,i.e.,relevantsentencestogetbetter
performance. Weshowresultsfortwosuchsystemsaswell:
157
• ASlog-TS(Rel): Thissystemisobtainedbyapplyingthepatternsfromtheabove
mentionedASlog-TSsysteminrelevantregions.
• SemAff (Rel): This system is obtained by applying the patterns from the above
mentionedSemAffsysteminrelevantregions.
ThesystemscoresforASlog-TS,SemAff,ASlog-TS(Rel),andSemAff(Rel)aretaken
fromPatwardhanandRiloff(2007). Thesenumbersaredirectlycomparabletooursas
weusethesametestsetsandevaluationmethodologyasthem.
Forthecorporate-acquisitionsdomain,weshowtheresultsoftheSRV andSRVlng
systems (Freitag, 1998b), both of which are supervised learning systems trained on
domain-specificannotateddataasdescribedbelow:
• SRV: SRV is a supervised classification algorithm that uses training data con-
taining positive and negative examples to identify the relevant and non-relevant
event-roles. Itusesasetofbasicfeaturesliketokencapitalization,numeric/non-
numericnatureoftokens,contextofthetokenamongotherstolearntheclassifier.
• SRVlng: SRVlng is an incremental improvement over the SRV approach. In
SRVlng, the feature space of the original SRV algorithm is enriched by adding
syntacticfeaturesusingaparserandsemanticfeaturesusingWordNet.
158
The system scores for SRV and SRVlng are taken from Freitag (1998b). These num-
bers are not directly comparable to ours, since we use a different evaluation method-
ology from theirs. These numbers are shown here only to give the readers a rough
idea of performance in this domain. Figures 8.4, 8.5, and 8.6 summarize the results
for the terrorism, disease-outbreaks, and corporate-acquisitions domains respectively
usingmacro-averagedprecision,recall,andf-scores.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
LexSynPara(Direct) SemAff(Rel) SemAff ASlog-TS(Rel) ASlog-TS
Score
System
Precision
Recall
F-score
Figure8.4: ParaphrasebasedvstraditionalIEsystemsinterrorismdomain
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
LexSynPara(Direct) SemAff(Rel) SemAff ASlog-TS(Rel) ASlog-TS
Score
System
Precision
Recall
F-score
Figure8.5: ParaphrasebasedvstraditionalIEsystemsindisease-outbreaksdomain
159
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
LexSynPara(Direct) SRVlng SRV
Score
System
Precision
Recall
F-score
Figure8.6: ParaphrasebasedvstraditionalIEsystemsincorporate-acquisitionsdomain
8.5.2 DiscussionandErrorAnalysis
From Figures 8.4, 8.5, and 8.6, it can be seen that overall, the performance of the
best paraphrase patterns based system — LexSynPara (Direct), is comparable to the
domain-specificIEsystems.
While we are satisfied by the overall performance of the LexSynPara (Direct) sys-
tem, wefoundthatourapproachperformedbadlyontwoslots— perpetrator individ-
uals in the terrorism domain and victims in the disease-outbreaks domain. We investi-
gatedthecauseforitspoorperformanceonthethesetwoslots. Intuitiontellsus,thatit
would be much easier to learn patterns for slots that extract entities whose event-roles
arelessambiguous: Anycontexttheentitiesoccurinislikelytobeagooddisambigua-
torforthemandhenceagoodextraction-pattern. Anotherintuitiontellsusthatitwould
be much easier to learn patterns for slots that extract entities belonging to smaller (in
size)semanticclasses,especiallywhenusingabroad-coveragecorpus: Thevectorsfor
160
at least the high-frequency patterns corresponding to these slots would be much more
densecomparedtothoseofthepatternsforslotsthatextractentitiesbelongingtolarger
semanticclasses.
The above two intuitions are confirmed by our analysis of the learned patterns and
also explain our results. The slot — perpetrator individuals in the terrorism domain,
extractsentitiesbelongingtothegeneralsemanticclasspeople,aclasswhosemembers
can play many event roles both within and across domains (ex: members of the class
peoplecanalsobevictims,oronlookersorsecuritypersonneletc.) andislargeinsize.
Looking at the learned patterns for this slot, we find that they also contain patterns for
the slot victims. The second slot — victims in the disease-outbreaks domain, extracts
entities that can belong to general semantic classes people, animals, birds, etc., again
classes whose members play different event-roles and which have large sizes. Our
analysisofthepatternsforthisslotshowsthatsomeofthepatternsareverygeneralthat
extractanimals,birds,etc. Athirdslot—victimsintheterrorismdomain,alsosuffers
fromtheslot-fillerambiguityandsparsevectorproblemsinitially(whenusingSurfPara
and LexSynPara (Conv) methods), but the problem becomes much less pronounced in
theLexSynPara(Direct)method,wherethevectorsarelesssparse. Wehypothesizethat
ifweweretouseacorpusoneorderofmagnitudelargerthantheonewehaveusedand
have the computation power to handle it, the paraphrases based method will perform
muchbetter. Unfortunately,wedonothaveaccesstosuchacorpusorthecomputation
161
powertohandleitcurrently. However,workingwithsuchadatasetisfeasible(Bhagat
& Ravichandran, 2008) and it is an avenue we want to explore. Finally, while it takes
days to adapt any of the domain-specific methods to other domains, our method could
beadaptedtoeachofthenewdomainsinacoupleofhours.
8.6 Conclusion
Clearly,learningextractionpatternsoncefromabroad-coveragecorpusandbeingable
tousetheminanyspecificdomainandapplicationispreferabletotherepetitivetaskof
buildingadomain-specificcorpusandapplyingsomelearningalgorithm(andpossibly
somehumanannotation)toit,everytimeanewdomainisaddressed. Theparaphrase-
based method described in this chapter performs well, makes moving to new domains
easy, and obviates the need to worry about the scope of a domain and/or the type of
training data required. This observation leads to the conclusion that learning broad-
coverageparaphrasesisaneffectivegeneralmethodologyfordomain-specificinforma-
tionextraction.
162
Chapter9
ConclusionandFutureWork
9.1 Contributions
Inthisthesiswehave:
• Developedatypologyofquasi-paraphraseswhichcategorizesthembasedonlex-
icalandstructuralchangesinthesentencesandphrases,andprovidedtheirrela-
tivefrequencies.
• Presented a method for learning Inferential Selectional Preferences (ISPs), the
contexts in which a pair of quasi-paraphrases are mutually replaceable. Using
ISPs, wehaveshownthatincorrectinferencescanbefilteredsignificantlybetter
thanseveralbaselines.
• Developed an algorithm, LEDIR, that learns the directionality of quasi-
paraphrases. Learning directionality allows one to separate the strong from the
163
weakparaphrases. WehaveshownthatLEDIRperformssignificantlybetterthan
severalbaselines.
• Presented a method to learn high quality surface-level paraphrases from a large
monolingualcorpus. Wethenusedtheseparaphrasestolearnpatternsforrelation
extractionandhaveshownthatthesepatternsarebetterthanthoselearnedusing
astate-of-the-artbaseline.
• Shown that broad-coverage paraphrases, learned using a large monolingual cor-
pus, can be used to learn patterns for domain-specific information extraction.
We have shown that the patterns learned using the broad-coverage paraphrases
perform roughly at par with several state-of-the-art domain-specific information
extractionsystems.
Overall, through this work, we have shown that paraphrases can be learned from a
monolingual corpus and can be used for information extraction, an important applica-
tionareaofNLP.
Therearehoweverseveralopenquestionsthatneedtobeaddressedandseveralar-
easthatneedtobeexplored. Inthefollowingsection,welistasetofpossibledirections
forfutureworkbasedontheworkdoneinthisthesis.
164
9.2 FutureWork
Inthissection,wepresentseveralissuesthatcanprobablybeaddressedinthefuture.
9.2.1 InferentialSelectionalPreferencesusingTailoredClasses
In Chapter 4, we experimented with using two types of classes for learning Inferential
SelectionalPreferences(ISPs): classesfromWordNet,classeslearnedbyCBC(Pantel
& Lin, 2002). The results there show that the ISPs learned from the CBC classes
aremoreeffectivethantheISPslearnedfromWordNetclassesforfilteringinferences.
Even though the WordNet classes are hand created and accurate, they suffer from the
problemoflowrecall.
Recently, there has been work on adding new words to WordNet to improve its
recall (Snow et al., 2006). The semi-supervised algorithm we used in Chapter 6 can
also be used to increase the recall of WordNet based classes. However, the drawback
ofthealgorithmfromChapter6isthatitisahardclusteringalgorithmandthusassigns
each word to only a single class. But, it can be extended using a simple strategy to
allowwordstobeassignedtomultipleclassesasfollows:
• Run HMRF-KMeans using constraints from WordNet to obtain hard clustering
ofnewwords.
165
• Calculate centroids for the clusters and for each cluster find the top-n elements
that are most similar to its centroid. Let’s call these these small clusters formed
bythetop-nelementsofeachclusterthe“representative-clusters”.
• Calculate centroids for the representative-clusters. Let’s call these the
“representative-centroids”.
• For each of the remaining words, find the top-m representative-centroids whose
similarity exceeds a certain threshold and assign them to the clusters. Alterna-
tively, words can be assigned to various clusters using the stage III of the soft
clusteringversionoftheCBCalgorithm.
WebelievethismethodwilllearnbetterWordNetbasedISPs.
9.2.2 KnowledgeAcquisition
Recently, there has been a growth in interest among researchers for building large
knowledge bases containing relations between entities (Pas ¸ca et al., 2006; Shinyama
& Sekine, 2006; Banko & Etzioni, 2008). In Chapter 7, we have shown that para-
phrases can be used to learn high precision patterns for extracting relations from text.
Forexample,giventherelation“acquisition”betweentwocompanies,i.e.,therelation
suchthatonecompanyacquiredtheother,wecanwithhighprecisionextractinstances
of the companies which have this relation. An obvious use of paraphrases then is to
166
buildalargeknowledgebasecontainingpairsofentitiesthatexhibitusefulrelations. It
canbedoneasfollows:
• Obtain a set of relations. These can be specified either manually or can be dis-
coveredautomaticallyasinBankoandEtzioni(2008).
• Foreachrelationinstep 1,specifyafewseedpatterns.
• For each seed pattern, find its paraphrases using the method described in Chap-
ter 7 and create a list of surface patterns for each relation. If needed, generalize
thesurfacepatternsusingthemethoddescribedinChapter8.
• Extract instances for each relation using the patterns from step 3 and create a
databaseofrelationinstances.
9.2.3 ParaphrasesforMachineTranslation
Recently,therehasbeenworkinMachineTranslationtoaddresstheproblemofsparse-
ness of training data using paraphrases (Callison-Burch et al., 2006). Callison-Burch
et al. (2006) used bilingual data to learn paraphrases for the source language phrases.
Theyshowthatautomaticallyparaphrasingsourcelanguagephraseswhosetranslations
arenotknowntotheoneswhosetranslationsareknownimprovesthetranslationqual-
ity. The algorithm for learning surface paraphrases that we present in Chapter 7 can
alsobeadaptedforthispurposeasfollows:
167
• Obtainalargemonolingualcorpus.
• Definethecontextofawordoraphraseasatwowordwindowtoitsrightanda
twowordwindowtoitsleft.
• Collectalln-gramsinthiscorpusuptoasizeof 5alongwiththeircontexts.
• Buildacontextvectorforeachn-gramusingitscontextasdescribedinChapter7.
• FindparaphrasesforunknownphrasesusingthemethoddescribedinChapter7.
• Ifanyparaphraseofaunknownphrasehasaknowntranslation,assignthatasthe
translationfortheunknownphrase.
Themethoddescribedaboveislanguage-independent. Withtheavailabilityoflarge
publicly available corpora for many languages — English, Spanish, French, Arabic,
Chinese — paraphrases for these language can be learned. There however are sev-
eral details that will need to be worked out depending on the language. For example,
for Chinese, tokenization might be a factor that determines the quality of paraphrases.
Also, for languages with heavy morphology, e.g., Finnish, demorphing might be an
important factor. While these and other language specific issues will need to be ad-
dressed, the method presented in this paper can be used as a general framework for
learningparaphrases.
168
9.3 Conclusion
This thesis presents methods for automatically learning quasi-paraphrases from text.
Whilewehaveusedtheseparaphrasesforinformationextraction,thereareotherareas
in Natural Language Processing (NLP) that can benefit by using them. We have listed
someofthesepotentialapplicationsabove. Therearehowevermanyotherapplications
liketextsummarization,informationretrieval,automaticevaluationformachinetrans-
lationandsummarization,amongothersthatcanbenefitbyusingparaphrases. Weare
onlyscratchingthesurfacehere. Muchmoreworkstillneedstobedoneintheareaof
paraphraselearningtomakeeffectiveuseofparaphrasesforNLP.
169
Bibliography
Anick, P. G., & Tipirneni, S. (1999). The paraphrase search assistant: terminological
feedback for iterative information seeking. ACM SIGIR (pp. 153–159). Berkeley,
California,UnitedStates.
Baker, C. F., Fillmore, C. J., & Lowe, J. B. (1998). The berkeley framenet project. In
Proceedings of international conference on Computational linguistics (pp. 86–90).
Montreal,Quebec,Canada.
Banko, M., & Etzioni, O. (2008). The tradeoffs between traditional and open relation
extraction. AssociationforComputationalLinguistics(pp.28–36). Columbus,Ohio.
Bannard,C.,&Callison-Burch,C.(2005).Paraphrasingwithbilingualparallelcorpora.
AssociationforComputationalLinguistics(pp.597–604). AnnArbor,Michigan.
Barzilay, Regina (2003). Information fusion for multidocument summarization: Para-
phrasingandgeneration. Doctoraldissertation,ColumbiaUniversity.
Barzilay, R., & Lee, L. (2003). Learning to paraphrase: an unsupervised approach
using multiple-sequence alignment. In Proceedings North American Chapter of the
AssociationforComputationalLinguisticsonHumanLanguageTechnology(pp.16–
23). Edmonton,Canada.
Barzilay,R.,&McKeown,K.R.(2001). Extractingparaphrasesfromaparallelcorpus.
InProceedingsofAssociationforComputational Linguistics(pp.50–57). Toulouse,
France.
Barzilay,R.,McKeown,K.R.,&Elhadad,M.(1999).Informationfusioninthecontext
of multi-document summarization. Association for Computational Linguistics (pp.
550–557). CollegePark,Maryland.
Basu,S.,Banerjee,A.,&Mooney,R.J.(2002). Semi-supervisedclusteringbyseeding.
19thInternationalConferenceonMachineLearning(pp.19–26).
170
Basu, S., Bilenko, M., & Mooney, R.J. (2004). A probabilistic framework for semi-
supervised clustering. 10th ACM SIGKDD International Conference on Knowledge
DiscoveryandDataMining(pp.59–68). Seattle,WA,USA.
Berland,M.,&Charniak,E.(1999).Findingpartsinverylargecorpora.InProceedings
ofAssociationforComputationalLinguistics(pp.57–64). CollegePark,Maryland.
Bhagat, R., Hovy, E.H., & Patwardhan, S. (2009). Acquiring paraphrases from text
corpora. InternationalConferenceonKnowledgeCapture(KCap). RedondoBeach,
California,USA.
Bhagat, R., Pantel, P., & Hovy, E.H. (2007). Ledir: An unsupervised algorithm for
learning directionality of inference rules. Empirical Methods in Natural Language
Processing(EMNLP). Prague,CzechRepublic.
Bhagat,R.,&Ravichandran,D.(2008).Largescaleacquisitonofparaphrasesforlearn-
ing surface patterns. Association for Computational Linguistics (ACL). Columbus,
OH,USA.
Brants,T.(2000).Tnt–astatisticalpart-of-speechtagger.InProceedingsoftheApplied
NLPConference(ANLP). Seattle,WA.
Brill, Eric (1994). Some advances in rule-based part of speech tagging. Proceedings
of the Twelfth National Conference on Artificial Intelligence (pp. 722–727). Seattle,
WA.
Bunescu, Razvan, & Mooney, Raymond (2007). Learning to extract relations from
the web using minimal supervision. Proceedings of the 45th Annual Meeting of the
Association of Computational Linguistics (pp. 576–583). Prague, Czech Republic:
AssociationforComputationalLinguistics.
Califf,M.,&Mooney,R.(2003). Bottom-UpRelationalLearningofPatternMatching
Rules for Information Extraction. Journal of Machine Learning Research, 4, 177–
210.
Callison-Burch, Chris (2007). Paraphrasing and translation. Doctoral dissertation,
UniversityofEdinburgh.
Callison-Burch,Chris(2008). Syntacticconstraintsonparaphrasesextractedfrompar-
allelcorpora. Proceedingsofthe2008ConferenceonEmpiricalMethodsinNatural
Language Processing (pp. 196–205). Honolulu, Hawaii: Association for Computa-
tionalLinguistics.
171
Callison-Burch, C., Koehn, P., & Osborne, M. (2006). Improved statistical machine
translationusingparaphrases. HumanLanguageTechnologyConferenceoftheNorth
AmericanChapteroftheAssociationofComputationalLinguistics(pp.17–24). New
York,NewYork.
Charikar,M.S.(2002). Similarityestimationtechniquesfromroundingalgorithms. In
Proceedingsofthethiry-fourthannualACMsymposiumonTheoryofcomputing(pp.
380–388). Montreal,Quebec,Canada.
Chklovski, T., & Pantel, P. (2004). Verbocean: Mining the web for fine-grained se-
mantic verb relations. In Proceedings of Empirical Methods in Natural Language
Processing(EMNLP)(pp.33–40). Barcelona,Spain.
Chomsky,Noam(1957). Syntacticstructures. TheHague: MoutonPublishers,Paris.
Clark,EveV.(1992). Conventionalityandcontrasts: pragmaticprincipleswithlexical
consequences. InA.LehrerandE.F.Kittay(Eds.),Frame,fields,andcontrasts: New
essaysinsemanticlexicalorganization.LawrenceErlbaumAssociates.
Cohn, Trevor, Callison-Burch, Chris, & Lapata, Mirella (2008). Constructing corpora
for the development and evaluation of paraphrase systems. Computational Linguis-
tics,34,597–614.
Cover, T.M., & Thomas, J.A. (1991). Elements of information theory. John Wiley &
Sons.
De Beaugrande, R., & Dressler, W.. V. (1981). Introduction to text linguistics. New
York,NY:Longman.
Dolan, B., Quirk, C., & Brockett, C. (2004). Unsupervised construction of large
paraphrase corpora: exploiting massively parallel news sources. In Proceedings of
the conference on Computational Linguistics (COLING) (pp. 350–357). Geneva,
Switzerland.
Dom,B.E.(2001). Aninformation-theoreticexternalcluster-validitymeasure.
Fellbaum,C.(1998). Anelectroniclexicaldatabase. MITPress.
Freitag, Dayne(1998a). InformationextractionfromHTML:Applicationofageneral
learning approach. Proceedings of the Fifteenth National Conference on Artificial
Intelligence(pp.517–523). Madison,WI.
172
Freitag, D. (1998b). Toward General-Purpose Learning for Information Extraction.
Proceedings of the 36th Annual Meeting of the Association for Computational Lin-
guistics and 17th International Conference on Computational Linguistics (pp. 404–
408). Montreal,Quebec.
Gale, William A., & Church, Kenneth W. (1991). A program for aligning sentences
in bilingual corpora. Proceedings of the 29th Annual Meeting of the Association for
Computational Linguistics (pp. 177–184). Berkeley, California, USA: Association
forComputationalLinguistics.
Geffet, M., & Dagan, I. (2005). The distributional inclusion hypotheses and lexical
entailment. In Proceedings of Association for Computational Linguistics (pp. 107–
114). AnnArbor,Michigan.
Graff, D. (1995). North american news text corpus. Linguistic Data Consortium,
Philadelphia,PA.
Graff,D.(2003). Englishgigaword. LinguisticDataConsortium,Philadelphia,PA.
Graff, D., Rosenfeld, R., & Pau, D. (1995). Csr-iii text. Linguistic Data Consortium,
Philadelphia,PA.
Harabagiu, S., & Hickl, A. (2006). Methods for using textual entailment in open-
domainquestionanswering. InProceedingsoftheInternationalConferenceonCom-
putationalLinguisticsandACL(pp.905–912). Sydney,Australia.
Harris,Z.(1954). Distributionalstructure. Word,10(23):146–162.
Harris, Z. (1981). Co-occurence and transformation in linguistic structure. In H. Hiz
(Ed.),Papersonsyntax.D.ReidelPublishingCompany.Firstpublishedin1957.
Hearst, M. A. (1992). Automatic acquisition of hyponyms from large text corpora.
Proceedings of the conference on Computational linguistics (pp. 539–545). Nantes,
France.
Hirst,Graeme(2003). Paraphrasingparaphrased. InvitedtalkattheACLInternational
WorkshoponParaphrasing.
Hobbs,J.R.,Appelt,D.,Bear,J.,Israel,D.,Kameyalna,M.,&Tyson,M.(1993). Fas-
tus: Asystemforextractinginformationfromtext. ProceedingsofHumanLanguage
TechnologyConference. Plainsboro,NewJersey.
Honeck, Richard P. (1971). A study of paraphrases. Journal of Verbal Learning and
VerbalBehavior,10,367–381.
173
Huang, Shudong, Graff, David, & Doddington, George (2002). Multiple-translation
chinesecorpus. LinguisticDataConsortium,Philadelphia,PA.
Jacobs, P. S., Krupka, G., Rau, L., Mauldin, M. L., Mitamura, T., Kitani, T., Sider,
I., Childs, L., & Marietta, M. (1993). Ge-cmu: Description of the shogun system
usedformuc-5. InProceedingsoftheFifthMessageUnderstandingConference(pp.
109–120).
Katz, J.;, & Fodor, J.A. (1963). The structure of a semantic theory. Language, 39,
170–210.
Kim, J., &Moldovan, D.(1993). Acquisitionofsemanticpatternsforinformationex-
tractionfromcorpora. ProceedingsoftheIEEEConferenceonArtificialIntelligence
forApplications(pp.171–176). Orlando,FL,USA.
Kipper, K. (2005). Verbnet: A broad-coverage comprehensive verb lexicon. Doctoral
dissertation,ComputerandInformationScienceDept.,UniversityofPennsylvania.
Kong,J.,Graff,D.,Maeda,K.,&Strassel,S.(2005). Hard2004text. LinguisticData
Consortium,Philadelphia,PA.
Lenat, D. (1995). Cyc: A large-scale investment in knowledge infrastructure. 38(11),
33–38.
Levin, B. (1993). English verb classes and alternations: a preliminary investigation.
ChicagoandLondon: UniversityofChicagoPress.
Light,M.,&Greiff,W.R.(2002). Statisticalmodelsfortheinductionanduseofselec-
tionalpreferences. CognitiveScience,26,269–281.
Lin,D.(1994). Principar—anefficient,broad-coverage,principle-basedparser. Com-
putationalLinguistics(COLING)(pp.42–48). Kyoto,Japan.
Lin,D.(1998). Automaticretrievalandclusteringofsimilarwords. COLING/ACL(pp.
768–774). Montreal,Canada.
Lin, D., & Pantel, P. (2001). Dirt: Discovery of inference rules from text. ACM
SIGKDD international conference on Knowledge discovery and data mining (pp.
323–328). SanFrancisco,California.
Lin, D., & Pantel, P. (2002). Concept discovery from text. Computational Linguistics
(COLING)(pp.577–583). Taipei,Taiwan.
Lin, D., Zhao, S., Qin, L., & Zhou, M. (2003). Identifying synonyms among distribu-
tionallysimilarwords. InProceedingsofIJCAI(pp.1492–1493). Acapulco,Mexico.
174
Manning, C. D., & Sch¨ utze, H. (1999). Foundations of statistical natural language
processing. Cambridge,Massachusetts: TheMITPress.
McQueen, J. (1967). Some methods for classification and analysis of multivariate ob-
servations. 5th Berkeley Symposium on Mathematics, Statistics and Probability (pp.
281–298).
Mel’cuk,Igor(1996). Lexicalfunctions: Atoolfordescriptionoflexicalrelationsina
lexicon. InL.Wanner(Ed.),Lexicalfunctionsinlexicographyandnaturallanguage
processing.JohnBenjaminPublishingCompany.
Mel’cuk, Igor (to appear). Semantics: From meaning to text, chapter Deep-Syntactic
Paraphrasing.
Moldovan,D.,Clark,C.,Harabagiu,S.,&Maiorano,S.(2003). Cogex: alogicprover
for question answering. In Proceedings of the Conference of the North American
ChapteroftheAssociationforComputationalLinguisticsonHumanLanguageTech-
nology(pp.87–93). Edmonton,Canada.
Pas ¸ca,M.,Lin,D.,Bigham,J.,Lifchits,A.,&Jain,A.(2006). Organizingandsearch-
ing the world wide web of facts - step one: The one-million fact extraction chal-
lenge. Proceedings of the National Conference on Artificial Intelligence. Boston,
Massachusetts,USA.
Pang, B., Knight, K., & Marcu, D. (2003). Syntax-based alignment of multiple trans-
lations: Extracting paraphrases and generating new sentences. In Proceedings of
HLT/NAACL.
Pantel,P.,Bhagat,R.,Coppola,B.,Chklovski,T.,&Hovy,E.H.(2007). Isp: Learning
inferential selectional preferences. Human Language Technology Conference of the
NorthAmericanChapteroftheAssociationofComputationalLinguistics. Rochester,
NY,USA.
Pantel,Patrick,&Lin,Dekang(2002). Discoveringwordsensesfromtext. InProceed-
ingsoftheACMSIGKDDinternationalconferenceonKnowledgediscoveryanddata
mining(pp.613–619). Edmonton,Canada.
Pantel, P., Ravichandran, D., & Hovy, E.H. (2004). Towards terascale knowledge ac-
quisition. Proceedings of the conference on Computational Linguistics (COLING)
(pp.771–778). Geneva,Switzerland.
Patwardhan, S., & Riloff, E. (2007). Effective Information Extraction with Seman-
tic Affinity Patterns and Relevant Regions. Proceedings of EMNLP (pp. 717–727).
Prague,CzechRepublic.
175
Quirk, Chris, Brockett, Chris, & Dolan, William (2004). Monolingual machine trans-
lation for paraphrase generation. Proceedings of EMNLP 2004 (pp. 142–149).
Barcelona,Spain: AssociationforComputationalLinguistics.
Ravichandran,D.,&Hovy,E.H.(2002). Learningsurfacetextforaquestionanswering
system. AssociationforComputationalLinguistics(ACL). Philadelphia,PA.
Ravichandran, D., Pantel, P., & Hovy, E.H. (2005). Randomized algorithms and nlp:
usinglocalitysensitivehashfunctionforhighspeednounclustering. InProceedings
ofAssociationforComputationalLinguistics(pp.622–629). AnnArbor,Michigan.
Resnik, Philip (1996). Selectional constraints: an information-theoretic model and its
computationalrealization. Cognition,61,127–159.
Riloff,E.(1993). AutomaticallyConstructingaDictionaryforInformationExtraction
Tasks. Proceedings of the Eleventh National Conference on Artificial Intelligence
(pp.811–816). Washington,DC.
Riloff, E. (1996). Automatically Generating Extraction Patterns from Untagged Text.
Proceedings of the Thirteenth National Conference on Articial Intelligence (pp.
1044–1049). Portland,OR.
Riloff, E., & Phillips, W. (2004). An Introduction to the Sundance and AutoSlog Sys-
tems(TechnicalReportUUCS-04-015). SchoolofComputing,UniversityofUtah.
Romano,L.,Kouylekov,M.,Szpektor,I.,Dagan,I.,&Lavelli,A.(2006). Investigating
a generic paraphrase-based approach for relation extraction. In Proceedings of the
EuropeanChapteroftheAssociationforComputationalLinguistics(EACL).
Rosch,E.(1978). Humancategorization. CognitionandCategorization.
Schulte im Walde, S., & Brew, C. (2002). Inducing german verb semantic classes
frompurelysyntacticsubcategorizationinformation. AssociationforComputational
Linguistics(ACL). Philadelphia,PA.
Sekine,S.(2006).On-demandinformationextraction.InProceedingsofCOLING/ACL
(pp.731–738). Sydney,Australia.
Shannon, C. E. (1948). A mathematical theory of communication. 27, 379–423 and
623–656.
Shinyama, Y., & Sekine, S. (2006). Preemptive Information Extraction using Unre-
stricted Relation Discovery. Proceedings of the Human Language Technology Con-
ference of the North American Chapter of the Association for Computational Lin-
guistics(pp.304–311). NewYork,NY.
176
Shinyama, Y., Sekine, S., & Sudo, K. (2002). Automatic paraphrase acquisition from
newsarticles. ProceedingsofHumanLanguageTechnologyConference(pp.40–46).
Siegal, S., & Castellan Jr., N.J. (1988). Nonparametric statistics for the behavioral
sciences. McGraw-Hill.
Snow, Rion, Jurafsky, Daniel, & Ng, Andrew Y. (2006). Semantic taxonomy induc-
tion from heterogenous evidence. Proceedings of the 21st International Conference
on Computational Linguistics and the 44th annual meeting of the Association for
Computational Linguistics (pp. 801–808). Morristown, NJ, USA: Association for
ComputationalLinguistics.
Sundheim,B.(1992). OverviewoftheFourthMessageUnderstandingEvaluationand
Conference. Proceedings of the Fourth Message Understanding Conference (MUC-
4)(pp.3–21). McLean,VA.
Szpektor, I., & Dagan, I. (2008). Learning entailment rules for unary templates. Pro-
ceedings of the International Conference on Computational Linguistics (COLING)
(pp.849–856). Manchester,UK.
Szpektor,I.,Tanev,H.,Dagan,I.,&Coppola,B.(2004). Scalingweb-basedacquisition
of entailment relations. In Proceedings of Empirical Methods in Natural Language
Processing(pp.41–48). Barcellona,Spain.
Torisawa, K. (2006). Acquiring inference rules with temporal constraints by using
japanese coordinated sentences and noun-verb co-occurrences. In Proceedings of
Human Language Technology Conference of the North American Chapter of the As-
sociationofComputationalLinguistics(pp.57–64). NewYork,NewYork.
Wagstaff, K., Cardie, C., Rogers, S., & Schroedl, S. (2001). Constrained k-means
clustering with background knowledge. 18th International Conference on Machine
Learning(pp.577–584).
Wilks, Y. (1975). Preference semantics. Cambridge, Massachusetts: Cambridge Uni-
versityPress.
Wilks,Y.;,&Fass,D.(1992). Preferencesemantics: afamilyhistory. Computing and
MathematicswithApplications,23.
Xing,E.P.,Ng,A.Y.,Jordan,M.I.,&Russell,S.(2003). Distancemetriclearning,with
applicationtoclusteringwithside-information. 505512.
177
Zanzotto,F.M.,Pennacchiotti,M.,&Pazienza,M.T.(2006). Discoveringasymmetric
entailment relations between verbs using selectional preferences. In Proceedings of
the International Conference on Computational Linguistics and ACL(pp.849–856).
Sydney,Australia.
Zhou,L.,Lin,C.Y.,Munteanu,D.,&Hovy,E.H.(2006).Paraeval: usingparaphrasesto
evaluatesummariesautomatically. InProceedingsoftheHumanLanguageTechnol-
ogyConferenceoftheNorthAmericanChapteroftheAssociationofComputational
Linguistics(pp.447–454).
178
Appendix
ExampleParaphrases
1 Introduction
In this appendix, we present a random sample of 100 phrases with their paraphrases
fromtheparaphraseresourceof 2.5millionphrasescreatedinChapter7. Asdescribed
inChapter7,weusedthe 150GB(25billionwords)GoogleNewscorpustocreatethis
resource. Acosinesimilaritycutoffof0.15wassetfortheparaphrases. Intheexamples
below, cosine similarity of each paraphrase with the original phrase is provided and at
most 10 paraphrases are listed for each phrase. See Chapter 7 for details on how this
resourcewaslearned.
179
2 ListofParaphrases
XtookonnewY XtookonaddedY 0.31
XhavetakenonnewY 0.28
XhastakenonrenewedY 0.26
XtakeonnewY 0.26
XtookongreaterY 0.25
XhastakenonnewY 0.25
XhasbeengivenaddedY 0.25
XtakesonaddedY 0.23
XtookonevenmoreY 0.23
XtakeonaddedY 0.23
180
X,willcostY X-wouldcostY 0.22
XitcostsY 0.21
X,whichwillcostY 0.20
X,wouldcostY 0.20
XexpectedtocostY 0.19
XwillprobablycostY 0.19
XandwillcostY 0.19
XsayitwouldcostY 0.18
X,whichwouldcostY 0.18
XareexpectedtocostY 0.18
XworkingwithaY XofworkingwithaY 0.20
X,hiringaY 0.18
XbyworkingasaY 0.17
XworkingforaY 0.17
XtoworkwithaY 0.16
XfiredupaY 0.15
181
XandbuildmoreY X,buildingmoreY 0.34
X,buildmoreY 0.34
XandbuildingmoreY 0.32
X,buildnewY 0.30
XandbuildingnewY 0.30
XistobuildmoreY 0.29
XbybuildingmoreY 0.29
XtobuildmoreY 0.29
XofbuildingmoreY 0.28
XforbuildingnewY 0.28
182
XexpiringY XexpiringonY 0.62
XthatwillexpireonY 0.51
XthatexpiresonY 0.50
Xoffiscalnnnn,endedY 0.48
XthatexpiresY 0.44
XthatexpiredY 0.43
Ynnwereupn.nX 0.43
XwhichexpiredonY 0.42
X,inthequarterendedY 0.42
Ynnincreasednn.nX 0.42
X-establishedY X-establishtheirY 0.20
XofstrengtheningitsY 0.16
X-establishingitsY 0.15
183
XpricedatY X,pricedatY 0.30
XispricedatY 0.26
XpricedfromY 0.25
XpricedbetweenY 0.25
XwillbepricedatY 0.25
XatpricesrangingfromY 0.24
XwaspricedatY 0.24
XcostingY 0.23
XwithpricesrangingfromY 0.22
Xpricedatrsn.nnY 0.21
XinvolvedwiththisY XinvolvedinthisY 0.21
XinvolvedinthatY 0.18
YinvolvesanumberofX 0.15
184
X,drivingaY Ydrivenbynn-X 0.36
XdrivingaY 0.36
Ydrivenbyann-X 0.35
YparkedinthennnnX 0.34
YparkedinthennnX 0.34
XsaythedriverofaY 0.33
YparkedonthennnX 0.33
XfledthesceneinaY 0.30
Y,drivenbynn-X 0.30
XweredrivingaY 0.30
185
XoflettingY XofallowingY 0.25
XofspendingmoreY 0.23
XofrestrictingY 0.22
XofpursuingtheY 0.21
XoflettingtheY 0.21
XofraisingtheminimumY 0.21
XofgivingaY 0.21
XoflettingaY 0.21
XofholdingaY 0.20
XofspendingpublicY 0.20
186
XtoprotecttheirY XtoprotectourY 0.29
XtoprotecttheirownY 0.29
XtoprotectyourY 0.28
XtoshieldtheirY 0.27
XtoprotectitsY 0.27
XprotecttheirY 0.27
XprotectingtheirY 0.26
XtosafeguardtheirY 0.25
XinprotectingtheirY 0.23
XtoprotecttheY 0.23
XpoppedY XwaspoppingY 0.16
XispoppingY 0.16
XbypoppingY 0.16
YpoppedandtheX 0.15
187
XwillcloseonY Xwillcloseonmonday,Y 0.27
Xwillcloseonfriday,Y 0.27
XclosesonY 0.27
XwouldcloseonY 0.26
XisexpectedtocloseonY 0.26
X,whichclosesonY 0.26
XtocloseonoraboutY 0.25
XclosedthroughY 0.25
XwouldopenonY 0.24
Xwillcloseonsaturday,Y 0.24
XandhelpedhisY XhejoinedhisY 0.18
XhehelpedhisY 0.16
XtoopenthefifthY 0.15
XaccusedofhelpinghisY 0.15
Xtoleadthemen’sY 0.15
188
XofnegotiatingwithY XofnotnegotiatingwithY 0.24
XofnegotiatingwiththeY 0.22
XofspendingmoreY 0.22
XagainstnegotiatingwithY 0.22
XnottonegotiatewithY 0.22
XtonegotiatewithY 0.21
X’srefusaltonegotiatewithY 0.20
XofisolatingY 0.20
XofconsideringtheY 0.19
XofdisarmingY 0.19
189
XpittingY XpittingtheY 0.22
XthatpittedY 0.18
XbrokeoutbetweenY 0.18
XthathaspittedY 0.17
Y-takes-allX 0.17
YtoprevailinthisX 0.16
X,pittingY 0.16
Y-take-allX 0.16
XbrokeoutbetweentheY 0.16
XeruptedbetweentheY 0.16
190
XadjusteditsY XcutitsnnnnandnnnnY 0.23
XonthursdaylowereditsY 0.22
XontuesdayraiseditsY 0.22
XhasdowngradeditsY 0.22
XonfridayraiseditsY 0.21
XloweringitsY 0.21
XalsolowereditsY 0.21
XalsodowngradeditsY 0.21
XonwednesdayraiseditsY 0.21
XtoadjustitsY 0.21
191
XwhosendtheirY XfromsendingtheirY 0.31
XwhowanttosendtheirY 0.28
XwhosenttheirY 0.22
XteachtheirY 0.21
XareworriedtheirY 0.20
XtoenroltheirY 0.19
XdonotwanttheirY 0.19
XwhobringtheirY 0.18
XstillwanttheirY 0.18
XwhopulltheirY 0.18
192
XgivesmeY XthatgavemeY 0.27
X,itgavemeY 0.26
XofgivesmeY 0.26
XgivemeY 0.26
XofgivesyouY 0.25
XthatgivesmeY 0.25
XitgavemeY 0.24
XwillgiveyouY 0.23
XanditgavemeY 0.22
X,itgivesyouY 0.22
193
XwhogetintoY XarenotgettingintoY 0.25
XwhohavegotteninY 0.25
XgetintoY 0.25
XweregettingintoY 0.25
XorgotintoY 0.23
XgetintofinancialY 0.21
XinvolvedinphysicalY 0.21
XaftergettingintoY 0.20
XhavegottenintoY 0.20
XwhotryY 0.20
194
XandincreasedY XwillresultinhigherY 0.19
X,andincreasedY 0.18
XwillbeincreasedY 0.16
X,maximisingY 0.16
YcoupledwithhigherX 0.16
XwhileimprovingY 0.16
XwhilemaximizingtheY 0.16
XduetolowerY 0.15
XandreducingY 0.15
X,increasedY 0.15
XisboundbyY XtoabidebyY 0.18
XhasnotviolatedY 0.16
XmayhaveviolatedY 0.16
XboundbyY 0.15
XhaspromulgatedY 0.15
195
XisopposingtheY XstronglyopposedtheY 0.18
XwereopposingtheY 0.17
XwillopposetheY 0.15
XauthorizedaY 0.15
X,whichissupportingtheY 0.15
XwereaffectedbyY XaffectedbyY 0.30
XhavebeenaffectedbyY 0.29
XhardesthitbyY 0.27
XhardhitbyY 0.27
XthatwereaffectedbyY 0.24
XhithardestbyY 0.22
XhadbeenaffectedbyY 0.21
XdevastatedbyY 0.21
XwhowereaffectedbyY 0.21
XalreadyhitbyY 0.21
196
XbeginatY XbeginningatY 0.29
X,whichbeginatY 0.28
XthatbeginatY 0.28
XstartatY 0.27
XstartatnY 0.26
XbeginatnnY 0.24
XbeginatnamandY 0.24
XstartatnnY 0.23
Xbeginatnn:nnY 0.23
Xstartatnn:nnY 0.22
197
XtostoptheY XtohalttheY 0.36
XtopreventtheY 0.31
XtotrytostoptheY 0.27
XandstoptheY 0.25
XinordertostoptheY 0.25
XtostopthisY 0.23
XtostemtheY 0.23
XtohelpstoptheY 0.22
XofstoppingtheY 0.22
XtostopfurtherY 0.22
198
X,updatingY X,updateY 0.25
XandupdatedY 0.23
XandupdatingY 0.20
XandupdateY 0.20
X,upgradingY 0.19
XtoupdatetheirY 0.18
XwithupdatedY 0.17
XorupdateY 0.17
X,updatetheirY 0.17
XtoupdateY 0.17
199
XannouncedbyY X,announcedbyY 0.20
XunveiledbyY 0.20
YannouncedaX 0.19
X,whichwasannouncedbyY 0.19
XrecentlyannouncedbyY 0.18
XwasannouncedbyY 0.17
XannouncedlastweekbyY 0.17
XannouncedyesterdaybyY 0.16
YhasruledoutanyX 0.16
Yhasdefendedthegovernment’sX 0.16
200
XtookoveratY XwasnamedcoachatY 0.20
XhasdoneatY 0.20
XtookchargeatY 0.19
XtookthejobatY 0.19
XtotakeoveratY 0.18
XtakingoveratY 0.17
XwasinchargeatY 0.17
XthatwillkeephimatY 0.17
Y,thenmanagedbyX 0.17
X,whosesideentertainY 0.16
201
XreportedfourthY XreportedfirstY 0.38
XreportedfourthquarternnnnY 0.37
XreportedsecondY 0.36
XreportedfourthquarternetY 0.36
XreportedthirdY 0.32
XreportedfourthquarternnnnnetY 0.29
XreportedbetterthanexpectedfirstY 0.26
Xpostedsecond-Y 0.25
Xsaidfourth-Y 0.24
XreportedrecordthirdY 0.23
XofferslowY XbyofferinglowY 0.17
XofferlowY 0.17
XtoofferlowerY 0.17
XwillofferlowY 0.16
XandprovidelowY 0.16
XenjoylowY 0.15
XisofferingspecialintroductoryY 0.15
XofferlowerY 0.15
XcanchargehigherY 0.15
202
XclinchedY XclinchedvictoryfortheY 0.18
YcamebackthroughX 0.16
XclawedtheirY 0.15
XchangedonY XchangeddramaticallyonY 0.32
XandwilllastuntilY 0.28
X,whichwassignedY 0.28
XasscheduledonY 0.26
Ynn,whenhelostX 0.25
XhereceivedonY 0.24
XdidnotendonY 0.23
X,whichwassubmittedonY 0.23
XandlastsuntilY 0.23
XratifiedonY 0.23
203
XiscarriedbyY X,whichiscarriedbyY 0.33
X,whichistransmittedbyY 0.32
XistransmittedtohumansbyY 0.32
XisspreadbyY 0.31
X,whichisspreadbyY 0.31
XistransmittedbyinfectedY 0.30
Xisamosquito-borneY 0.30
Y,whichcanspreadtheX 0.29
YcarryingthedeadlyX 0.29
Y,whichtransmittheX 0.28
204
XcanclosetheY XcanclosethegaponY 0.21
Yclearofsecond-placedX 0.20
XcannotclosetheY 0.20
XcouldclosetheY 0.19
Y’schampionsleaguewinoverX 0.17
XaresecondonnnY 0.17
YasunitedbeatX 0.17
XnottolosethisY 0.17
XhaveclosedtheY 0.17
YofclubsincludingX 0.17
205
XstartsinY X,whichstartsinY 0.32
XwhichbeginsinY 0.29
XwhichstartsinY 0.28
XbeginsinY 0.27
X,whichbeginsinY 0.26
XgetsunderwayinY 0.22
X,whichconcludesinY 0.19
X,willbeplayedinY 0.19
XstartinginY 0.18
XagainstaustraliastartinginY 0.18
XcomefromtheY XarefromtheY 0.18
Y,addandX 0.16
X-producingareasoftheY 0.16
XthatcomefromtheY 0.16
XthatcomeoutoftheY 0.15
XarecomingfromtheY 0.15
206
XvariedbyY XvarysignificantlybyY 0.24
XdidnotdifferbyY 0.23
XdifferedbyY 0.21
XwerebrokendownbyY 0.20
XdifferbyY 0.18
XmayvarybyY 0.18
XdiffersbyY 0.18
X,brokendownbyY 0.16
Xincludeage,Y 0.16
XarebrokendownbyY 0.16
207
X,aregoodY XandaregoodY 0.22
XthatweregoodY 0.19
XwhoweregoodY 0.18
XthataregoodY 0.18
XtouchedthewallinaY 0.16
XaremuchbetterY 0.16
X,andaregreatY 0.15
XweregoodY 0.15
X,they’regoodY 0.15
XwhoaregoodY 0.15
208
XmakingnnY XbymakingnnY 0.21
XmakingnineY 0.20
X,makingthreeY 0.17
X,makingsevenY 0.16
X,makingfiveY 0.16
XhadmadennnY 0.15
XandmadennY 0.15
X,makingfourY 0.15
XthatmadennY 0.15
XandmadejustnnY 0.15
209
XhaveurgedY XareurgingY 0.23
XtourgeY 0.22
XhavecalledonY 0.21
XarealsourgingY 0.21
XhavewarnedY 0.18
XarecallingonY 0.18
XhavealsourgedY 0.18
XhaveappealedtoY 0.17
XsayotherY 0.17
XhaveblastedtheY 0.17
210
X,adecoratedY X,himselfadecoratedY 0.34
X,nn,adecoratedY 0.32
XisadecoratedY 0.28
X-adecoratedY 0.27
X,ahighlydecoratedY 0.25
XasadecoratedY 0.24
XwasadecoratedY 0.24
X,wasadecoratedY 0.23
X-olddecoratedY 0.22
X,adecoratedveteranofY 0.21
211
XcandetectY XtodetectY 0.38
XthatcandetectY 0.35
XusedtodetectY 0.33
X,whichcandetectY 0.33
XfordetectingY 0.32
XdetectY 0.29
XdetectsY 0.29
XthatdetectsY 0.28
XthatdetectY 0.27
XtodetecttheY 0.27
212
XdedicatedtoprovidingY XfocusedonprovidingY 0.24
XthatprovidesarangeofY 0.24
XdevotedtoprovidingY 0.22
XcommittedtoprovidingY 0.20
XaimedatprovidingY 0.19
XthatoffersY 0.19
XthatprovideshumanY 0.19
XthatprovidestechnologyandY 0.19
XthatpromotessustainableY 0.19
XthatprovidesfreeY 0.18
213
XforhandlingtheY XonhandlingtheY 0.23
XfordealingwiththeY 0.22
XforrespondingtoY 0.18
XformanagingtheY 0.18
XtotrytoalleviatetheY 0.18
XforcopingwiththeY 0.17
XonhandlingY 0.16
XfortacklingtheY 0.16
XarepreparedtohandletheY 0.16
XforsecuringY 0.16
214
X,evadingY XandfelonyevadingY 0.42
XandevadingY 0.37
XandfleeingandeludingY 0.37
XanddrivingunderY 0.34
X,andresistingY 0.34
XandeludingY 0.34
X,attemptingtoeludeY 0.33
XandpossessionofacontrolledY 0.33
X,tamperingwithY 0.33
X,evadingarrestandY 0.33
215
XofsuspendingY XofresumingY 0.26
XwouldsuspendY 0.24
XofhaltingY 0.22
XofrestartingY 0.20
XofestablishingaseparateY 0.20
XnntosuspendY 0.20
XofrestrictingY 0.19
XincludingsuspensionofY 0.19
XforhaltingY 0.19
XofusingeminentY 0.19
XhasoffereditsY XwishingtooffertheirY 0.16
XandofferedmyY 0.15
216
XattachedtotheY XattachedtoY 0.24
XaffixedtotheY 0.22
XmountedontheY 0.21
XareattachedtotheY 0.21
YattachedtoaX 0.20
XisattachedtotheY 0.19
XwasattachedtotheY 0.19
XattachedtoaY 0.19
XattachedtoitsY 0.19
XattachedtohisY 0.19
X,matchingY XandmatchingY 0.27
XwithmatchingY 0.21
Xnn,matchingY 0.17
XandaruffledY 0.17
XthattoppedY 0.15
XstainedwithY 0.15
YandmatchingX 0.15
217
XcangetY XalsocangetY 0.25
XcanalsogetY 0.23
XwhoaregettingY 0.20
XwhogetY 0.20
XgetY 0.20
YareprovidedtoX 0.20
XcanstillgetY 0.20
XwhoobtainY 0.19
XcanobtainY 0.19
XwhohavegottenY 0.19
218
XhaveblamedtheY XblametheY 0.26
XhaveattributedtheY 0.24
XarewarningthattheY 0.23
XhavewarnedthattheY 0.22
XalsoblametheY 0.20
XfearanY 0.19
XwarntheY 0.19
XblameontheY 0.19
XhaveblamedfortheY 0.19
XfeartheY 0.18
X-ringedY X-rimmedY 0.19
X-fringedY 0.18
YringedwithX 0.18
X-circledY 0.17
X-coloredplasticY 0.17
219
XandinterconnectY X,interconnectY 0.24
YsuchasembeddedX 0.22
X,interconnectandY 0.20
X,interconnects,Y 0.19
YenablingseamlessX 0.18
Yinterconnects,X 0.18
Xinterconnects,Y 0.17
X,printedcircuitboardsandY 0.17
Y,suchasembeddedX 0.17
XandcomputeY 0.17
220
XandrequireY XandrequiringY 0.26
X,andrequireY 0.23
XbyrequiringY 0.22
X,requiringY 0.22
X,requireY 0.22
XtorequireY 0.21
X,wouldrequireY 0.20
XrequiredY 0.19
XwouldrequireY 0.19
XthatwouldrequireY 0.19
221
X,asactingY X,willserveasactingY 0.21
X,wasnamedactingY 0.20
XtoserveasactingY 0.20
X,toserveasinterimY 0.19
X,willbecomeinterimY 0.19
X,wasnamedinterimY 0.19
XwasnamedactingY 0.18
X,whobecameactingY 0.18
XbecameactingY 0.18
XwouldserveasactingY 0.18
222
XaredueY XareduebyY 0.51
XareduenolaterthanY 0.47
XwereduebyY 0.42
XmustbeinbyY 0.42
Xareduebyfriday,Y 0.42
XweredueY 0.41
XareduebynY 0.41
X,whichareduebyY 0.41
Xareduebymonday,Y 0.40
X,whicharedueY 0.40
223
XsaiditseesY XsaiditexpectedY 0.25
XsaiditsawY 0.22
XstillexpectsY 0.22
XisexpectedtoreportY 0.21
XreduceditsnnnnY 0.21
Xsaidyesterdaythatfourth-Y 0.21
Xraiseditsfourth-Y 0.21
XpredictsnnnnY 0.21
XalsolowereditsY 0.21
XsaiditstillexpectsY 0.20
224
XarealwaystheY XweredefinitelytheY 0.21
XwereclearlytheY 0.18
XhavealwaysbeentheY 0.17
X,arenowtheY 0.17
XwhoarereallytheY 0.17
X,butweweretheY 0.16
XmusthavebeentheY 0.16
XhavereallybeentheY 0.16
XtheyhavebeentheY 0.16
Xwe’vebeentheY 0.15
225
XsoldnnnY Xsoldn,nnnY 0.32
XeachsoldnnnY 0.24
X,whichboughtn,nnnY 0.22
XeachboughtnnnY 0.19
X,whichsoldn,nnnY 0.18
X,soldnnnY 0.17
XbuyingnnnY 0.17
Xreportedsalesofn,nnnY 0.16
Xistosellnn,nnnY 0.16
Xexpectstosellnn,nnnY 0.16
X,andoperatesY XandoperatesY 0.22
X)andoperatesY 0.21
X,andoperatesnnY 0.18
YlocatedthroughoutX 0.17
XtheyhavethreeY 0.16
226
XandadjustingY XbyadjustingY 0.24
X,adjustY 0.23
XtoadjustY 0.22
XandadjustY 0.20
X,adjustingY 0.20
X,andadjustY 0.19
XoradjustingY 0.19
XandadjustingtheY 0.18
X,adjusttheirY 0.18
XareadjustingtheirY 0.17
XbeginswiththeY X,beginswiththeY 0.18
XbeginsinnnnnwiththeY 0.17
XstartswiththeY 0.16
XbeginstodaywiththeY 0.16
XculminateswiththeY 0.16
227
XtospeakaboutY XtotalkaboutY 0.18
XtotalktostudentsaboutY 0.16
XtotalkopenlyaboutY 0.16
YisatabooX 0.16
XfindotherY XseekotherY 0.29
XtofindotherY 0.28
XinfindingnewY 0.27
XtofindalternativeY 0.27
XhavefoundotherY 0.27
XwillfindotherY 0.26
XtoseekalternativeY 0.26
XhadtofindalternativeY 0.23
XwillhavetofindotherY 0.23
XcanfindotherY 0.22
228
XbettingY X-bettingY 0.42
XbettingatY 0.26
YbettingX 0.25
X-basedbettingY 0.25
X’sbettingY 0.23
X,bettingY 0.23
X’ssportsbettingY 0.23
XbettingandY 0.23
XandbettingY 0.22
XbettingandgamblingY 0.22
XtorealignY XforrealigningY 0.22
XforrealigningtheY 0.20
Xtore-alignY 0.18
XtoreorganizetheY 0.16
Ywouldlosen,nnnX 0.16
XofrealigningY 0.16
XgoingwestboundonY 0.15
229
XofcleaningupY XforcleaningupY 0.34
XtocleanupY 0.33
XoncleaningupY 0.29
XofcleaninguptheY 0.27
XincleaningupY 0.27
XtocleaningupY 0.26
XandcleaningupY 0.26
XofdollarstocleanupY 0.26
XistocleanupY 0.26
XcleaningupY 0.26
230
XrecoveredattheY XrecoveredfortheY 0.23
X(nntackles,n.nY 0.21
XfumbleattheY 0.20
X(nntackles,nY 0.20
Xlostnn-nnatY 0.20
XrecoveredtheballattheY 0.19
XrecoveredontheY 0.19
Xnn-nnwinovertheY 0.18
X(n-n)playtheY 0.18
XrecoveredthefumbleattheY 0.18
231
XduetorisingY XbecauseofrisingY 0.44
XcausedbyrisingY 0.40
XasaresultofrisingY 0.37
XcausedbyhigherY 0.35
XonrisingY 0.35
XasrisingY 0.33
XamidrisingY 0.33
XfromrisingY 0.33
XresultingfromrisingY 0.32
X,asrisingY 0.32
232
XgainednationalY X,whogainednationalY 0.21
XgainednationalattentionlastY 0.19
XgainedinternationalY 0.19
XhasgainednationalY 0.18
XgainednationalattentioninY 0.18
X,whichgainednationalY 0.18
XandgainedinternationalY 0.18
XgainingnationalY 0.18
XgainedglobalY 0.17
XandhasgainedinternationalY 0.17
233
XopenedandY XopenalittleY 0.18
YofopenedtheX 0.18
XdidnotopeninY 0.18
YitwillopentheX 0.18
XopenedwhentheY 0.18
XopenedandtheY 0.18
Xopened,andY 0.18
YanditopenedtheX 0.17
XwouldopenfortheY 0.16
XopenedatnY 0.16
234
XreckonY XreckontheY 0.21
XwonderiftheY 0.18
XarepredictingY 0.17
XhavebeenpredictingforY 0.16
XweresuggestingthattheY 0.16
XpredictthatY 0.16
XarepredictingthatY 0.16
XbelievethelatestY 0.15
XbelievethatthebankofY 0.15
XhavepredictedthatthisY 0.15
235
XarefindingY XarestillfindingY 0.26
XarecatchingY 0.23
XarefindingafewY 0.21
XaredoingwellforY 0.21
YarebeingcaughtbyX 0.20
XhavebeencatchingY 0.20
XreportgoodY 0.19
XlookforY 0.19
XarefindingplentyofY 0.19
XinvestintheirY 0.19
236
XandsynchronizedY X,synchronizedY 0.35
X,andsynchronizedY 0.31
Y,tumbling,X 0.26
YandsynchronizedX 0.25
Y,synchronizedswimmingandX 0.24
Y,synchronizedswimming,X 0.24
X,synchronizedswimmingandY 0.22
X,tumbling,Y 0.21
XandthesynchronizedY 0.21
XsynchronizedY 0.20
X,calledanY XknownasanY 0.30
XcalledanY 0.30
X,knownasanY 0.27
XiscalledanY 0.19
X-calledanY 0.17
Xcalledan Y 0.17
X,calledan Y 0.16
237
XnowallowY XalreadyallowY 0.18
XthatareavailabletoY 0.18
XmakeiteasierforY 0.18
XrestricttheuseofY 0.17
XdonotallowY 0.16
XalsoallowY 0.16
YhavecomprehensiveX 0.16
YshowupontheirX 0.15
XthatprohibitaY 0.15
Xthat’sgotY Yaresuspected,X 0.16
XonspeakinginY 0.15
XelsetodoY 0.15
238
XregisteredonY XwereregisteredonY 0.25
X(endedY 0.19
XinthennmonthsendingY 0.19
Xrosebyn,nnninY 0.18
XhaveregisteredsinceY 0.18
XregisteredbetweenY 0.18
XwaslodgedonY 0.18
Ynnnnwasnn.nX 0.18
XsenttohimonY 0.17
Xendinginmid-Y 0.17
239
XistosaveY XwastosaveY 0.34
XistogetasmuchY 0.29
XistoservetheY 0.23
XistoattractY 0.23
XandtosaveY 0.22
XwastopursueY 0.22
XistosafeguardtheY 0.21
XistodrawmoreY 0.21
XistocarefortheY 0.20
XistoreassuretheY 0.20
240
XchargedbytheY YcharginghigherX 0.25
XchargedbyY 0.24
YwillchargehigherX 0.22
XchargedbymostY 0.21
YchargeshighX 0.20
XchargedbytheirY 0.20
YpayinghighX 0.20
XtobechargedbytheY 0.20
XaregoingupnnY 0.20
YchargehighX 0.20
XandtoldY X,tellingY 0.16
XwhenhetoldY 0.15
X-oldtoldY 0.15
X,shetoldY 0.15
241
XunloadingY XloadingandunloadingY 0.32
XunloadY 0.27
XandunloadY 0.27
XtounloadY 0.26
XandunloadingY 0.26
YareloadedontoX 0.26
X,unloadingY 0.25
XtoloadandunloadY 0.25
XhaulingY 0.24
Xcarryingn,nnntonsofY 0.24
242
XbeforemakingY XbeforeyoumakeY 0.29
XbeforemakinganyY 0.26
XbeforetheymakeY 0.26
YweremadelastX 0.25
XwhentheymakeY 0.25
XpriortomakingY 0.24
XandthenmakingY 0.24
XtomakethoseY 0.23
XbeforewemakeY 0.23
XaftermakingY 0.23
XhasadiverseY XhasadiverseportfolioofY 0.22
Yhadbecome X 0.16
XhasadiverserangeofY 0.15
XstartingwiththeY XbeginningwiththeY 0.17
XscheduledthisY 0.16
XbuthasneverwontheY 0.15
X,graduatedtoY 0.15
243
XmadesomeY X,madesomeY 0.19
XandismakingY 0.18
Yweremade,theX 0.18
XmadeanumberofY 0.18
XdidmakesomeY 0.17
X,madeafewY 0.17
XormakeY 0.17
XandmadealotofY 0.17
XhadmadesomeY 0.16
XhavemadesomeY 0.16
XassistedwithY XassistingwithY 0.20
XandassistwithY 0.17
XhelpedwithY 0.15
244
XthatgetY XthatreceiveY 0.25
XthatdonotreceiveY 0.24
XthatcurrentlyreceiveY 0.21
XthatdonotgetY 0.20
XthatappliedforY 0.20
XthatreceivetheY 0.20
XthatarereceivingY 0.19
XthatreceivedY 0.19
X,whichreceiveY 0.18
XthatneedtheY 0.18
XfromreadingY XatreadingY 0.20
XofreadingY 0.19
XwithreadingY 0.19
XwhodonotreadY 0.18
XforreadingY 0.16
X,theyreadY 0.15
245
XtosimplifyitsY XtooptimizeitsY 0.22
XsimplifiesourY 0.22
XbyexpandingourY 0.21
XtooptimiseitsY 0.20
Y,inc. hasselectedtheX 0.19
XisamuchsimplerY 0.18
XtorestructureitsY 0.18
XtoaligntheY 0.18
XtooverhaulitsY 0.17
XtomoderniseitsY 0.17
XhadacombinedY XpostedacombinedY 0.21
X,whichhadacombinedY 0.19
XhadanaggregateY 0.19
XwillhaveacombinedY 0.18
XhaveacombinedY 0.17
XgeneratedacombinedY 0.16
X,whofinishednnthY 0.15
X,whichcurrentlyhasaY 0.15
X,hadacombinedY 0.15
246
XtoidentifythoseY XtobetteridentifyY 0.17
XforidentifyingnewY 0.17
XtobetterunderstandtheY 0.16
XforidentifyingY 0.16
XtoworkwiththoseY 0.15
XfoughtoffaY XfoughtoffapairofY 0.20
XfendedoffaY 0.18
XhadtosavetwoY 0.16
YtodefeatamericanX 0.15
XtofightoffaY 0.15
XfoughtoffaseriesofY 0.15
X,wasgivenY X,hasbeengrantedY 0.18
XandwasgivenY 0.18
XafterbeinggivenY 0.16
XisgivenY 0.15
247
XbookedtheirY XbookedtheirplaceintheY 0.53
YthatstartedagainstX 0.28
YwhentheytakeonX 0.27
XwerethebetterY 0.27
Xranoutn-nY 0.26
XhavecompletedtheY 0.25
XhavebeenlinkedwithaY 0.25
Y-finalwinoverX 0.25
XtooktheleadonnnY 0.24
XareaverygoodY 0.24
248
X-handledY Xoflong-handledY 0.21
XhadarmedhimselfwithaY 0.20
X-handleY 0.20
XwieldingaY 0.18
XstabbedwithaY 0.17
YandstabbedtheX 0.17
XswingingaY 0.16
XfoundabloodyY 0.16
XwhilebrandishingaY 0.16
XsuspectedofusingaY 0.16
249
XofslowingtheY XofslowingdowntheY 0.44
XofstimulatingtheY 0.32
XofstoppingtheY 0.31
XwastoslowtheY 0.30
XoflimitingtheY 0.29
XofcontrollingtheY 0.29
XinhopesofslowingtheY 0.28
XofsignificantlyreducingtheY 0.27
XofacceleratingtheY 0.27
XofraisingtheY 0.25
250
XandshavedY X,shavedY 0.49
XwithshavedY 0.39
X,gratedY 0.31
YtossedwithX 0.30
YandservedwithX 0.30
XandgratedY 0.30
XtossedwithY 0.29
X,andshavedY 0.29
XandchoppedY 0.29
Y-driedtomatoes,X 0.28
251
XfundedwithY XfinancedwithY 0.34
XthatarefundedwithY 0.33
XwillbefundedwithY 0.32
XwerefundedwithY 0.30
XfundedthroughY 0.30
XisfundedwithY 0.29
YusedtofundX 0.28
X,fundedwithY 0.27
YareusedtofundX 0.27
XarefundedwithY 0.27
252
Abstract (if available)
Abstract
Paraphrases are textual expressions that convey the same meaning using different surface forms. Capturing the variability of language, they play an important role in many natural language applications including question answering, machine translation, and multi-document summarization. In linguistics, paraphrases are characterized by approximate conceptual equivalence. Since no automated semantic interpretation systems available today can identify conceptual equivalence, paraphrases are difficult to acquire without human effort. The aim of this thesis is to develop methods for automatically acquiring and filtering phrase-level paraphrases using a monolingual corpus.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Learning semantic types and relations from text
PDF
Identification, classification, and analysis of opinions on the Web
PDF
Factorizing information extraction from text corpora
PDF
Syntactic alignment models for large-scale statistical machine translation
PDF
A reference-set approach to information extraction from unstructured, ungrammatical data sources
PDF
Text understadning via semantic structure analysis
PDF
Modeling, searching, and explaining abnormal instances in multi-relational networks
PDF
Ontology-based semantic integration of heterogeneous information
PDF
Improved word alignments for statistical machine translation
PDF
An algorithmic approach for static and dynamic gesture recognition
PDF
Rapid prototyping and evaluation of dialogue systems for virtual humans
PDF
From matching to querying: A unified framework for ontology integration
PDF
Generating psycholinguistic norms and applications
PDF
Toward a multi-formalism specification environment
PDF
Modeling and recognition of events from temporal sensor data for energy applications
PDF
Learning distributed representations from network data and human navigation
PDF
Scalable data integration under constraints
PDF
Externalized reasoning in language models for scalable and trustworthy AI
PDF
Robust and generalizable knowledge acquisition from text
PDF
Grounding language in images and videos
Asset Metadata
Creator
Bhagat, Rahul
(author)
Core Title
Learning paraphrases from text
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
08/05/2009
Defense Date
04/30/2009
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
information extraction,Learning and Instruction,OAI-PMH Harvest,paraphrases,Patterns,selectional preferences
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Hovy, Eduard (
committee chair
), Hobbs, Jerry R. (
committee member
), Knight, Kevin (
committee member
), McLeod, Dennis (
committee member
), O'Leary Daniel E. (
committee member
), Pantel, Patrick (
committee member
)
Creator Email
rahul_b10@yahoo.com,rbhagat@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m2482
Unique identifier
UC1126518
Identifier
etd-Bhagat-3037 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-189848 (legacy record id),usctheses-m2482 (legacy record id)
Legacy Identifier
etd-Bhagat-3037.pdf
Dmrecord
189848
Document Type
Dissertation
Rights
Bhagat, Rahul
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
information extraction
paraphrases
selectional preferences