Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Learning semantic types and relations from text
(USC Thesis Other)
Learning semantic types and relations from text
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
LEARNING SEMANTIC TYPES AND RELATIONS FROM TEXT
by
Dirk Hovy
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
August 2013
Copyright 2013 Dirk Hovy
Dedication
To my parents and siblings, for being there. I love you.
ii
Acknowledgements
After much deep and profound brain things inside my head, I have decided to thank you.
King Julien XIII – Madagascar
Even though it often feels like you are alone, a PhD is really the result of your environment and
the people around you. I am indebted to all of them, but I would like to personally mention a few.
First of all, would like to thank my advisor, Jerry Hobbs, who has already had every idea I
could possibly think of 25 years ago, but was always kind enough to let me try my own approach.
I have also greatly benefitted from the guidance and feedback of my committee members, David
Chiang, Elsi Kaiser, Kevin Knight, and Dennis McLeod, who have always been willing to discuss
my ideas and supported me with advice concerning my research.
I would also like to thank my mentor, Ed Hovy, who introduced me to NLP. If you are just
reading this to find out whether we are related, the answer is yes: second cousins once removed.
You can stop reading now.
ISI is a great place to work and study, because it provides you with such a productive
environment. My special thanks go to the admin staff, Kary Lau, Alma Nava, Melissa Snearl-
Smith, Lisa Winston, and Peter Zamar, who promptly and competently helped me with any
iii
non-research problems I had. I would like to thank all the ISI researchers, especially Zornitsa
Kozareva, for many discussions, lunch, and honest feedback, and her unbounded enthusiasm, as
well as former ISIer Paul Groth, for keeping things in perspective; and Stephan Gouws, who is
an ISIer at heart, and an inspiration in many ways, not just for efficient writing.
An integral part of ISI’s stimulating research environment is its large student population. I am
indebted to all the current and former graduates, who happily shared their ideas and experiences
with me: Rahul Bhagat, Jon May, Steve DeNeefe, Jason Riesa, Sujith Ravi, and many more.
Some of them have been directly involved in this work:
Stephen Tratz has been my friend, office mate, and collaborator for many years, and taught
me a lot about programming, efficiency, and LEGO.
Ashish Vaswani started out as my first TA and became a friend and frequent collaborator,
who always happily shared his bottomless knowledge and enthusiasm. I am glad that we had the
chance to work together on several occasions.
Victoria Fossum has been instrumental to this work like nobody else. Not only was she ready
to discuss any obscure detail, proofread all my drafts, and listen to all my talks, but she also
motivated me when I needed encouragement, and gave her honest advice.
Adam Lammert has not been at ISI, but I met him on my first day at USC and shared several
classes and many interesting discussions with him. Despite working in different fields, we finally
managed to get a paper together, which makes me very happy.
Many of my papers have been collaborations, and I would like to thank my co-authors
not mentioned otherwise. I have learned a lot from each and everyone of them, and feel lucky to
have worked with so many talented people: Taylor Berg-Kirkpatrick, James Fan, Alfio Gliozzo,
iv
Kartik Goyal, Sujay Kumar Jauhar, Huying Li, Donald Metzler, Siddharth Patwardhan, Anselmo
Pe˜ nas, Mrinmaya Sachan, Whitney Sanders, Shashank Shrivastava, Chris Welty, and Chunliang
Zheng.
I would like to thank the guys from IBM Research in Hawthorne for the internship I spent
with them, and our work on two DARPA evaluations for deep reading, as well as everyone in
CMU’s Language Technology Institute who made my stay there a fantastic experience.
Lastly, I would like to thank all my friends who had nothing to do with academia, and who helped
to keep me grounded and happy.
For almost four years, Yan Lee and Ethan Hamilton have been my roommates in Playa, and
I am grateful they lent me their ears when I needed them and dragged me away from my studies
when they felt I needed it.
Stefan Fabry and Mitra Martin have been great Tango teachers and friends, whose classes
kept me busy after hours.
Taj Faatagi and Oscar Guzman have been my workout partners and close friends, and their
dedication has not only kept me in shape, but also probably somewhat sane.
I am incredibly lucky to have a great set of friends back home, who were undeterred by
the distance and my periodical silence, and not only stayed in touch, but were there whenever I
needed them. Thanks to Carsten Bug, Jan Froitzheim, Andreas Kolloch, Lars Meyer, Phil Meuser,
Christoph Purschke, and the Saarbr¨ ucken Huschels, Tassilo Barth, Antonia Scheidel, and Marc
Schulder.
Last but not least, many thanks to The Curious Palate in Mar Vista and Pannikin Cafe in La
Jolla for being the kind of place that lets you sit there and write while serving you a strong cup
of coffee.
v
Most of all, thanks to you, for reading this! You are a great person, I like you.
vi
Abstract
NLP applications such as Question Answering (QA), Information Extraction (IE), or Machine
Translation (MT) are incorporating increasing amounts of semantic information. A fundamental
building block of semantic information is the relation between a predicate and its arguments,
e.g. eat(John;burger). In order to reason at higher levels of abstraction, it is useful to group
relation instances according to the types of their predicates and the types of their arguments. For
example, while eat(Mary;burger) and devour(John;to f u) are two distinct relation instances,
they share the underlying predicate and argument types INGEST(PERSON;FOOD). A central
question is: where do the types and relations come from?
The subfield of NLP concerned with this is relation extraction, which comprises two main
tasks:
1. identifying and extracting relation instances from text
2. determining the types of their predicates and arguments
The first task is difficult for several reasons. Relations can express their predicate explicitly
or implicitly. Furthermore, their elements can be far apart, with unrelated words intervening. In
this thesis, we restrict ourselves to relations that are explicitly expressed between syntactically
related words. We harvest the relation instances from dependency parses.
vii
The second task is the central focus of this thesis. Specifically, we will address these three
problems: 1) determining argument types 2) determining predicate types 3) determining argument
and predicate types. For each task, we model predicate and argument types as latent variables in
a hidden Markov models. Depending on the type system available for each of these tasks, our
approaches range from unsupervised to semi-supervised to fully supervised training methods.
The central contributions of this thesis are as follows:
1. Learning argument types (unsupervised):
We present a novel approach that learns the type system along with the relation candidates
when neither is given. In contrast to previous work on unsupervised relation extraction,
it produces human-interpretable types rather than clusters (published as [53]). We also
investigate its applicability to downstream tasks such as knowledge base population and
construction of ontological structures. An auxiliary contribution, born from the necessity
to evaluate the quality of human subjects, is MACE (Multi-Annotator Competence Estima-
tion), a tool that helps estimate both annotator competence and the most likely answer.
2. Learning predicate types (unsupervised and supervised):
Relations are ubiquitous in language, and many problems can be modeled as relation prob-
lems. We demonstrate this on a common NLP task, word sense disambiguation (WSD)
for prepositions (PSD). We use selectional constraints between the preposition and its ar-
gument in order to determine the sense of the preposition (published as [51, 52, 115]). In
contrast, previous approaches to PSD used n-gram context windows that do not capture the
relation structure. We improve supervised state-of-the-art for two type systems.
viii
3. Argument types and predicates types (semi-supervised):
Previously, there was no work in jointly learning argument and predicate types because (as
with many joint learning tasks) there is no jointly annotated data available. Instead, we have
two partially annotated data sets, using two disjoint type systems: one with type annotations
for the predicates, and one with type annotations for the arguments. We present a semi-
supervised approach to jointly learn argument types and predicate types, and demonstrate
it for jointly solving PSD and supersense-tagging of their arguments. To the best of our
knowledge, we are the first to address this joint learning task.
Our work opens up interesting avenues for both the typing of existing large collections of
triple stores, using all available information, and for WSD of various word classes. More gener-
ally, it provides insights into semantic problems such as dynamic ontology construction, semantic
representation design, inference, and semantic analysis (part of the work on PSD has been used
in a semantic parser [114]).
ix
Table of Contents
Dedication ii
Acknowledgements iii
Abstract vii
List of Tables xii
List of Figures xiv
I Introduction 1
Chapter 1 OVERVIEW 2
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
II Background 12
Chapter 2 RELATED WORK 13
Chapter 3 PRELIMINARIES 20
3.1 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Chapter 4 FRAMEWORK 26
4.1 Sparsity and Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
III Experiments 34
Chapter 5 ARGUMENT TYPES 35
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2 Inferring Argument Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
x
5.3 Learning Type Hierarchies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.4 Constructing Knowledge Bases . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Chapter 6 PREDICATE TYPES 66
6.1 Prepositional Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Chapter 7 JOINTLY LABELING PREDICATE AND ARGUMENT TYPES 89
7.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.2 Model and Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
7.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
7.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
7.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
7.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
IV Conclusion and Future Work 99
Chapter 8 CONCLUSIONS 100
Chapter 9 FUTURE WORK 103
References 105
Appendix A
Nature of Partial Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Appendix B
MACE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
B.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
B.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
B.3 Identifying Reliable Annotators . . . . . . . . . . . . . . . . . . . . . . . . . . 126
B.4 Recovering the Correct Answer . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
B.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
B.6 Related Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
B.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
xi
List of Tables
5.1 Percentage of relations derived from labeling the full data set that were judged
sensible . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Percentage of relations derived from labeling instance with unknown entities that
were judged sensible . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.3 Effect of input word order (grammatical vs. relation order) and Markov horizon
(bigram vs. trigram HMM) on prediction accuracy (football domain). . . . . . . 50
5.4 Accuracy of relation-ordered trigram models on various domains. . . . . . . . . 51
5.5 MRR for blanked out position of relation-ordered trigram models on various do-
mains. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.6 Percentage of local type hierarchies judged sensible . . . . . . . . . . . . . . . 56
6.1 Accuracy over all prepositions w. different models and training. Best accuracy:
MAP-EM+smoothed L
0
norm on our model. Italics denote significant improve-
ment over baseline at p<.001. Numbers in brackets include against (used to tune
MAP-EM and Bayesian Inference hyper-parameters) . . . . . . . . . . . . . . . 77
6.2 Accuracies (%) for Different Context Types and Sizes . . . . . . . . . . . . . . 80
6.3 Accuracies for Word-Extraction Using a Parser or Heuristics. . . . . . . . . . . 82
6.4 Accuracies (%) for Leave-One-Out (LOO) and Only-One Word-Extraction-Rule
Evaluation. none includes all words and serves for comparison. Important words
reduce accuracy for LOO, but rank high when used as only rule. . . . . . . . . . 83
6.5 Features used in our system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.7 Accuracies (%) for Different Classifications. Comparison with [80]*, and
[115]**. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
xii
6.6 Accuracies (%) for Coarse and Fine-Grained PSD, Using MALT and Heuristics.
Sorted by preposition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.8 Precision, Recall and F1 Results (%) for Coarse-Grained Classification. Compar-
ison to [80]. Classes ordered by frequency . . . . . . . . . . . . . . . . . . . . 86
7.1 The two type systems used. The 26 noun and 15 verb supersenses are the Word-
Net lexicographer file names; the 7 preposition supersenses are derived from the
Penn Treebank. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
7.2 Accuracy for bigram model, split up for each element . . . . . . . . . . . . . . 94
7.3 Comparison of accuracy of various models on preposition senses only . . . . . . 95
7.4 Precision, recall and F1 for individual preposition senses. Comparison between
previous work and the bigram model . . . . . . . . . . . . . . . . . . . . . . . 96
7.5 Performance of various models on argument senses . . . . . . . . . . . . . . . . 96
B.1 Correlation with annotator proficiency: Pearson r of different methods for var-
ious data sets. MACE-VB’s trustworthiness parameter (trained with Variational
Bayes witha =b = 0:5) correlates best with true annotator proficiency. . . . . 126
B.2 Accuracy of different methods on data sets from [102]. MACE-VB uses Varia-
tional Bayes training. Results @n use the n% items the model is most confident
in (Section B.4.1). Results below double line trade coverage for accuracy and are
thus not comparable to upper half. . . . . . . . . . . . . . . . . . . . . . . . . . 128
xiii
List of Figures
3.1 Elements of Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1 Graphical model of the HMM derived from the data with either grammatical word
order (a) or relation word order (b) . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Difference between true distribution and distribution found by EM . . . . . . . . 28
4.3 Effect of dictionary and token constraints on lattice. The types of “walk” and
“wood” are constrained by a dictionary, disallowing noun types for “walk” and
verb types for “wood”. “through” is constrained by a token constraint (shaded),
so we can disregard all other types (grey with dotted outlines). . . . . . . . . . . 30
5.1 The ten most frequent relations discovered by our system for the domains. Types
written in CAPITALS, common nouns lower case . . . . . . . . . . . . . . . . . 38
5.2 Examples of dictionary entries with counts. Types in brackets are not considered 40
5.3 Inference lattice for “Montana throw to receiver ” with a) grammatical or b) re-
lation word order. Verb marked grey, token constrained nodes dotted . . . . . . . 49
5.4 Example hierarchy for a subset of types in the football domain . . . . . . . . . . 54
5.5 Example for naive axiom derived from HMM . . . . . . . . . . . . . . . . . . . 61
6.1 Two examples of sentences with prepositions and their dependency parse trees.
The preposition triples are marked . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.2 Distribution of Class Labels in the WSJ Section of the Penn TreeBank. . . . . . 71
6.3 Graphical Models. a) standard 1
st
order HMM. b) variant used in experiments
(one model per preposition, thus no conditioning on p). c) incorporates further
constraints on variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
xiv
6.4 Accuracy for PSD systems improves linearly with amount of partial annotations.
The dotted line indicates the unsupervised results from [52]. Accuracies to the
right of the vertical line improve significantly (at p < 0:001) over this . . . . . . 78
7.1 Venn diagram of the data distribution. The subset we are interested in solving is
marked grey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
A.1 Labeling one example of each sense yields better results than all examples of any
one sense. Senses ordered by frequency . . . . . . . . . . . . . . . . . . . . . . 117
B.1 Graphical model: Annotator j produces label A
i j
on instance i. Label choice
depends on instance’s true label T
i
, and whether j is spamming on i, modeled by
binary variable S
i j
. N =jinstancesj, M =jannotatorsj. . . . . . . . . . . . . . . 121
B.2 Generative process: For each instance i, the true label T
i
is sampled from the
uniform prior. Then, S
i j
is drawn for each annotator j from a Bernoulli distribu-
tion with parameter 1q
j
. If S
i j
= 0, A
i j
copies the true label. Otherwise, the
annotation A
i j
is sampled from a multinomial with parameter vectorx
j
. . . . . . 122
B.3 Annotations: 10 annotators on three items, labelsf1, 0g, 5 annotations/item.
Missing annotations marked ‘–’ . . . . . . . . . . . . . . . . . . . . . . . . . . 125
B.4 Tradeoff between coverage and accuracy for RTE (left) and temporal (right).
Lower thresholds lead to less coverage, but result in higher accuracy. . . . . . . 129
B.5 Influence of adverse annotator strategy on label accuracy (y-axis). Number of
possible labels varied between 2 (top row) and 4 (bottom row). Adverse annota-
tors either choose at random (a) or always select the first label (b). MACE needs
fewer good annotators to recover the correct answer. . . . . . . . . . . . . . . . 131
B.6 Varying number of annotators: effect on prediction accuracy. Each point averaged
over 10 runs. Note different scale for WSD. . . . . . . . . . . . . . . . . . . . . 132
B.7 Varying the amount of supervision: effect on prediction accuracy. Each point
averaged over 10 runs. MACE uses supervision more efficiently. . . . . . . . . . 134
B.8 Example CSV input to MACE for three instances and 15 annotators. . . . . . . . 135
xv
Part I
Introduction
1
Chapter 1
OVERVIEW
1.1 Motivation
NLP applications such as Question Answering (QA), Information Extraction (IE), or Machine
Translation (MT) are incorporating increasing amounts of semantic information. Prominent ex-
amples include Watson [35], the computer that won the Jeopardy challenge; NELL [16], the
never-ending language learner; and the DARPA-funded Learning by Reading (LbR) project [6].
Incorporating semantics generally requires knowledge about entities, concepts, and ac-
tions. These are often expressed as relations between a predicate and its arguments, e.g.,
eats(John; pizza). Relations connect two or more words of certain types with one another.
These words are called the arguments of the relation. The word which expresses the relation
is called the predicate. In the sentence “John eats pizza”, the verb eats acts as a predicate for
the arguments “John” and “pizza”. Other words besides verbs can act as predicates as well. In
“Food for thought”, the preposition for acts as a predicate. We will address both cases in this
thesis. For the latter, we are the first to consequently model prepositions as relations. Types
are abstractions of the entities we observe in our data. For the above example, this could be
2
INGEST(PERSON;FOOD). They allow us to discover similarities between distinct lexical en-
tities. Typing, or correctly identifying the type of a lexical entity, is thus an important part of any
semantic approach, as it allows us to generalize beyond the word level. We can then discover re-
curring patterns in the data, derive features, and use them as constraints for excluding candidates
that have the wrong type. E.g., in QA systems, if we know we are looking for an answer with
type CITY , we can discard answers like “giraffe” or “John Wayne” that have incorrect types.
Both predicates and arguments have types, i.e., they can both be generalized. Predicate types
allow us to group relation candidates together. Argument types constrain the words that can
legally occupy the argument positions to the ones that have the correct type. For verbs, this
is known as “selectional restrictions”. Each word can have multiple types, either independent
of one another (such as etymological origin vs. part-of-speech), or organized in an ontological
structure, with hierarchical orderings between them (such as location! country! city). In
“Man bites dog”, possible types for the predicate “bites” include V ERB and AT TACK, possible
types for the arguments “man” and “dog” are NOUN, PERSON, ANIMAL, or ENT ITY . NOUN
is independent of the other types, while PERSON and ANIMAL are hierarchically related to
ENT ITY . Note that “man” and “dog” share most of those types, they only differ in PERSON vs
ANIMAL.
A relation is thus an implicit or explicit function that has certain constraints on the arguments and
returns information about them.
A central question is: where do the types and relations come from?
The subfield of NLP concerned with this question is relation extraction, which comprises
two main tasks:
1. identifying and extracting relation instances from text
3
2. determining the types of their predicates and arguments
The first task is difficult for several reasons. Relations can express their predicate explicitly
or implicitly, e.g., the BORN IN(PERSON;Y EAR) relation can be expressed explicitly as
“Mozart was born in 1756”, or implicitly, without a predicate, as “Wolfgang Amadeus Mozart
(1756—1791)” [93]. Furthermore, relation elements can be far part, with unrelated words
intervening. If the predicates and arguments appear together within a window of fixed size,
we can use pattern matching over surface strings to extract relations [43, 61, 83, 93]. Often,
however, intervening words and phrases between predicates and arguments cause fixed-window
pattern matching approaches to fail. To capture the relation between “man” and “dog” in “The
man who regularly gets a stomachache on Monday mornings bit the rabid-looking three-legged
dog”, a pattern-based approach would need to allow for a lot of wildcards, making it too general,
or specify a pattern that matches this particular sentence, making it too specific. The OpenIE
approach [5, 127] uses syntactic dependencies as relation candidates. They are better suited than
surface sequences, because they are invariant to interfering words. Dependency parsers allow
us to easily extract the syntactically related words in subject, verb, or object position. In the
previous example, the dependency structure would still recognize the underlying syntactic SVO
relationship bites(man,dog). In this thesis, we restrict ourselves to relations that are explicitly
expressed between syntactically related words. See, however, our work on implicit relations and
arguments that are not syntactically related in [48, 50].
The second task, determining the types of their predicates and arguments, is the central
focus of this thesis. Specifically, we will address these three problems:
4
1. determining argument types
2. determining predicate types
3. jointly determining argument and predicate types.
For each task, we model predicate and argument types as latent variables in a hidden Markov
model (HMM). Depending on the type system available for each of these tasks, our approaches
range from unsupervised to semi-supervised to fully supervised training methods.
The set of types and the required level of abstraction is highly task and domain-specific [35].
One solution is the use of very general high-level types that can be applied to any domain.
However, this has the disadvantage of being too coarse-grained and potentially missing domain-
specific information. In a biomedical domain, for example, it will not be sufficient to classify
“aspirin”, “benzopyrene” and “opiates” as ART IFACT (where they would be grouped together
with “syringe” and “bed”), but we require a more precise type, such as DRUG. We might even
want to split that type into the sub-types PAIN KILLER and ANT IBIOT IC. Not having those
types can result in absurd generalizations and diminish performance. However, in a different
domain, such a fine-grained level of type distinctions among biomedical artifacts would have
little benefit. We present an approach that derives domain-specific types directly from the data.
The central contributions of this thesis are as follows:
1. Learning argument types (unsupervised):
Existing approaches to relation extraction rely on a pre-defined type systems which are
either domain-specific, or domain-general. Domain-specific types perform best [35], but
must be manually constructed and adapted for each domain. Unfortunately, this set is not
5
necessarily known a priori. Furthermore, constructing type systems by hand is laborious
and expensive, so we would like to learn type systems from data to allow for fine-grained
control and rapid domain adaptation. General types such as named entities or WordNet
supersenses [34], can be applied to any domain, but may fail to capture critical domain-
specific information. Most previous approaches either relied on annotated data [5, 75]
or based their methods on existing (semi-)structured sources of information (Wikipedia,
Fatcz, DBpedia, etc.) [29, 109, 123]. Fully unsupervised clustering approaches [96] can be
applied ad-hoc, but the resulting types (cluster numbers) are not human-interpretable. In
Chapter 5, we present a novel approach that learns the type system along with the relation
candidates when neither is given. In contrast to previous unsupervised work, it produces
human-interpretable types rather than clusters (published as [53]). We also investigate
its applicability to downstream tasks such as knowledge base population and construction
(Section 5.3) of ontological structures (Section 5.4). An auxiliary contribution, born from
the necessity to evaluate the quality of human subjects, is MACE (Multi-Annotator Com-
petence Estimation), a tool that helps estimate both annotator competence and the most
likely answer.
2. Learning predicate types (unsupervised and supervised):
Relations are ubiquitous in language, and many problems can be modeled as relation prob-
lems. Rather than the common subject-verb-object triples, we extend the application to
prepositional relations, and demonstrate the applicability to a common NLP task, namely
word sense disambiguation (WSD) for prepositions (see Chapter 6). Our relation-based
approach works with various amounts of supervision, and improves over window-based
6
approaches for two different type systems. Since prepositions are a ubiquitous closed word
class, we assume that the type system is the same across all domains. Here, we are given
a set of types and want to find the best way to apply it to the data. We use selectional
constraints between the preposition and its arguments in order to determine the sense of
the preposition (published as [51, 52, 115]). In contrast, previous approaches to PSD used
n-gram context windows that do not capture the relation structure. We improve supervised
state-of-the-art for two type systems.
3. Argument types and predicates types (semi-supervised):
Lastly, we show how to combine different type systems for predicates and arguments to
jointly label them (see Chapter 7). Previously, there was no work in jointly learning argu-
ment and predicate types because (as with many joint learning tasks) there is no jointly an-
notated data available. Instead, we have two partially annotated data sets, using two disjoint
type systems: one with type annotations for the predicates, and one with type annotations
for the arguments. We present a semi-supervised approach to jointly learn argument types
and predicate types, and demonstrate it for jointly solving PSD and supersense-tagging of
their arguments. Our approach works even in the absence of jointly annotated data. We
exploit the mutual constraints of arguments and predicates and reach accuracies that rival
supervised methods. To the best of our knowledge, we are the first to address this joint
learning task.
Our work opens up interesting avenues for WSD of various word classes, and for rapid do-
main adaptation and typing of existing large collections of triple stores [32, 67, 86, 113], using
7
all available information. It provides general insights into semantic problems such as dynamic
ontology construction, representation design, abductive inference, and semantic analysis.
1.2 Contributions
In this thesis, we solve several problems in relation extraction: 1) learning argument types; 2)
learning predicate types, and 3) jointly learning argument types and predicate types. Depending
on the type system available for each of these tasks, our approaches range from unsupervised to
semi-supervised to fully supervised methods.
1. Learning argument types (unsupervised):
Existing approaches to relation extraction rely on a pre-defined type systems which are
either domain-specific, or domain-general. We present a novel unsupervised approach that
learns the appropriate type system directly from data while extracting relation instances at
the same time (published as [53]). We evaluate the resulting type system and the relation
instances extracted under that type system in a human sensibility judgments, and via label
accuracy in a type recovery task. As auxiliary contribution, we released MACE (Multi-
Annotator Competence Estimation), a tool that helps estimate both annotator competence
and the most likely answer (published as [47]). We also explore inducing ontological hier-
archies among the types, and incorporating the learned types into a downstream system for
abductive reasoning.
2. Learning predicate types (unsupervised and supervised):
Word sense disambiguation for verbs typically uses selectional constraints over argument
types to determine the verb sense itself. Similarly, we cast WSD for prepositions (PSD)
8
as a relation extraction task, using selectional constraints between the preposition type and
its argument types to determine the sense of the preposition (published as [51, 52, 115]).
Previous approaches to PSD have not used selectional constraints over argument types,
and relied on n-gram context windows. Our approach applies to different models and
significantly improves supervised state-of-the-art accuracy for two different preposition
type systems.
3. Argument types and predicates types (semi-supervised):
Finally, we combine predicate and argument typing. We demonstrate this by jointly solving
PSD and supersense tagging. Our semi-supervised approach uses two partially annotated
data sets to jointly learn argument types and predicate types and rival supervised state-of-
the-art results for both subtasks. To the best of our knowledge, we are the first to address
the joint learning of argument types and preposition types.
1.3 Outline
The remainder of this thesis is organized as follows:
Part II provides some of the background necessary for a better understanding of this work.
– Chapter 2 reviews related work to situate this thesis in the larger context of (unsuper-
vised) relation extraction and typing.
– Chapter 3 explains some of the underlying concepts of unsupervised learning.
– Chapter 4 presents the modeling and learning framework adopted throughout this
thesis.
9
Part III presents a number of experiments that focus on different aspects of relation types,
and the application of this framework to external tasks.
– In Chapter 5, we show how we can learn domain-specific argument types and their ap-
plication to relations directly from text. We are given a large amount of text, without
additional information like type systems, and need to perform two tasks:
1. extracting relation candidates
2. discover an appropriate type system
This work was published in part as [53].
– Section 5.3 extends the previous work to dynamically learn hierarchies among the
types.
– The last section in chapter 5 describes how to apply the the learned models to a
downstream application, namely populating abductive knowledge bases (Section 5.4).
While the previous chapter focused on argument types, chapter 6 is centered around the
predicate types. We make two changes to the setup we have encountered in the previous
chapter:
1. the relation are prepositional constructions rather than verbal ones, i.e., the predicates
are now prepositions.
2. we are given a set of preposition types and have to find the best way to apply them
We show how a relation-based approach allows for better disambiguation than previous,
window-based approaches. We present the first unsupervised model, published in [52],
and show that the relation-based approach carries over to fully supervised discriminative
10
models, outperforming the previous state-of-the-art by a significant margin. This work is
based on the publications in [49, 51, 115].
In the last chapter of experiments (chapter 7), we jointly type the predicate and arguments
of a relation. We exemplify this again via prepositional constructions. The difference to the
previous setting is that we are given two different type systems: one for the prepositions,
and a separate set for the arguments. While there exists labeled training data for each of
these sets, there is no jointly annotated data. The approach outlined before can be used
to jointly learn the types from two disparate sets in a common task, and achieves results
comparable to supervised systems on both subtasks (preposition sense disambiguation and
supersense tagging). This is the first approach to jointly typing predicates and arguments.
11
Part II
Background
12
Chapter 2
RELATED WORK
Relation Extraction
The task of relation extraction is well-established in NLP. Several paradigms have been applied,
which differ in the amount of supervision: supervised, semi-supervised, and unsupervised. The
task of relation extraction is subdivided into two parts: in relation identification, we want to find
out whether a sentence expresses a relation, and if so, which. In argument detection, we try to
identify the arguments, given a sentence expresses a relation.
Supervised approaches tend to perform well in terms of accuracy. They use either se-
quential models with features, such as Conditional Random Fields (CRFs) [5, 64], or linear
classifiers (SVMs) with string or tree kernels [48, 75]. They do, however, require large amounts
of annotated data to achieve this high accuracy, i.e., the relations of interest and the type systems
have to be defined at training time.
Semi-supervised (sometimes also called weakly or distantly supervised) methods reduce this
requirement by looking for patterns based on either seed examples [43, 61, 83, 93], or instances
stored in existing data bases [74]. Patterns are word sequences, potentially including wild cards.
13
Patterns require less supervision, but need a basis of agreed-upon seeds, and potentially additional
annotations such as POS tags, WordNet senses, Named Entity types, etc. Pattern can be either
very general (“X of Y”), or very relation-specific (“X was born in Y”). General patterns have
high recall, but pick up a lot of noise. Very specific patterns have the opposite problem: they
might not match enough instances and thus have very low recall. In general, the main challenge
of pattern-based approaches is to balance these two extremes. Patterns can be very brittle and
have trouble with noisy data. They do however, perform well in extracting large amounts of
candidates. Pattern-based approaches are thus used on large corpora like the Web [63, 127].
A powerful semi-supervised approach is the use of label propagation in graphs [111, 112],
which allows us to associate a set of type candidates with entities and thus learn the appropriate
types. However, this technique is transductive: while it is capable of taking both annotated and
unannotated data and label them, it does so by constructing a large graph and thus does not easily
generalize to new, unseen instances.
While the amount of supervision correlates with higher performance measures, supervised
and semi-supervised methods have several drawbacks. In relation extraction, we encounter sev-
eral problems where supervised and semi-supervised methods cannot help us.
The first is: suppose we can define the relations and arguments we want at training time,
but we do not have any annotated data. Supervised training requires substantial amounts of
annotated data. Producing it takes time, and the outcome largely depends upon the quality of
the annotations, which is synonymous with consistency and agreement among annotators. We
address some aspects of the problems with annotations via MACE (see Appendix B). Consistent
annotation, however, is especially difficult for relations. There are too many potential relations,
14
and agreement on what exactly constitutes the arguments in a given case is difficult. We adopt an
open IE approach and focus on syntactically related words, headed by a verb or preposition.
In the second case, we do not even know what relations and arguments we want to find
at training time. In these cases, not only do we not have any annotated data, but it is often a
completely new domain. We usually do not know what relations and arguments are relevant to
that domain. We address both of these problems jointly by learning the possible relations and the
argument types (see Section 5.2).
Another problem for the second case is that we do not know what granularity we should
be aiming for. Should we choose the most specific or the most general of the relations we learn
automatically? We explore this to some extent with our work on hierarchical ordering of argument
types (Section 5.3).
In all three cases, unsupervised techniques can help. They take plain text as input and do not
require additional annotations. Surprisingly, none of the work on unsupervised relation extraction
to date has made use of the sequential nature of relations. This ignores an important fact, namely
that the argument order matters. The good results from supervised sequential models emphasizes
this potential. Instead, previous unsupervised relation extraction systems have focused on clus-
tering. Relation candidates are clustered based on context similarity [12, 29]. By using a HMM,
we are able to leverage the sequential nature of relations.
Some approaches compare possible relations to an existing data base such as Wikipedia,
Factz, or DBpedia [29,109,123]. While comparing against an existing repository is a good sanity
check, it has the problem that it requires an existing knowledge base that contains the items in
question. This essentially sidesteps the question of how to populate these knowledge bases in
the first place, and limits the quality of our extraction to the expressiveness of the underlying
15
resource. In this work, we make no use of external resources, but start from first principles.
Several NLP tasks can be modeled as relation extraction tasks. WSD, for example, is
closely related to the other subtask of relation extraction previously mentioned, namely argument
detection. Here, we try to find the arguments for a given relation. Many systems have thus
suggested to use WordNet or Named Entity types to disambiguate the arguments. If the predicate
is a verb, this is similar to inferring its selectional preferences, the task of assigning semantic
types to the arguments of a verb. It captures the notion that the object of “to drink” is typically a
BEV ERAGE. Performance is improved by narrowing potential argument candidates down to a
semantic type. If we know we are looking for a word of type CITY , we can discard answers like
“giraffe” or “John Wayne”. In this work, we model the task of preposition sense disambiguation
(PSD) as a relation extraction task: given a pair of arguments and an ambiguous predicate,
we have to find the type of the predicate. Casting the problem as a relation identification task
outperforms previous window-based approaches in the supervised realm [80, 129].
Extending this to an unsupervised sequential model is a logical step. To the best of our knowl-
edge, this was the first attempt at unsupervised PSD. There have been attempts at unsupervised
word sense disambiguation (WSD: [2] present an unsupervised system based on acquired sense
ranking, and [130] present a noisy-channel model for general WSD. Neither of them include a
sequential unsupervised model. If sequential models have been used, it was done in a supervised
setting [22].
Most unsupervised approaches for selectional preference modeling tried to induce a system
using some form of clustering on the WordNet sense hierarchies. [95] presented an unsupervised
approach based on computing information theoretic values for verb object types. [96] use an
16
unsupervised model to derive selectional predicates from unlabeled text. They do not assign ar-
gument types altogether, but group similar predicates and arguments into unlabeled clusters using
Latent Dirichlet Allocation (LDA). The resulting clusters are not human-interpretable. [124,125]
use clustering plus distant supervision (FreeBase) on generative models to do relation extraction
and WSD.
[1] use HMMs on dependency snippets to induce the selectional preferences. However,
their model does not include transition probabilities. Instead, they train a separate model for
each verb, thus implicitly conditioning on each predicate. [23] used a Bayesian network over the
WordNet senses. In [85], the input to the HMM is not based on dependency relations, but surface
forms. However, the arguments in the training data were replaced with their respective named
entity type. Their model is thus not trained on plain text, and does not learn the correspondence
from words to types. In contrast, our models explicitly learn to label words with their appropriate
type. They are thus capable of turning plain text into a typed relation. [32]. All of the above
works use a predefined type system such as the WordNet hierarchy or NE types. This allows
for domain-independent systems, but limits the approach to a fixed set of oftentimes rather
inappropriate types. In contrast, we derive the level of granularity directly from the data [53].
Relations in Knowledge Bases
The main use of relations in NLP is as atomic units of information in knowledge bases (KBs).
The techniques presented in this thesis could aid in populating KBs more effectively. These
KBs are similar to the ones used in knowledge representation and reasoning. [76], based on
the work by [44], used the syntactic constituency structure to represent sentences as predicates
and infer semantic relations using predicate logic. [127] is another example of a QA system
17
that extracts large amounts of relational information from the Web. To populate KBs with
relations, people have long used surface patterns [13, 43, 63, 84, 93, 103]. While this is a very
natural approach, it has problems with lexical surface variation. For this reason, we follow
the OpenIE approach [5, 127] of using syntactic relations in this work. They are independent
of the actual word sequences and thus less susceptible to noise (such as intervening words,
sub-clauses, etc). Dependency parses are ideal, because they are unconstrained by adjacency, and
can express long-range relations much better than constituency parses or surface patterns. [114]
has shown that dependency graphs can be augmented with more semantic information in the
form of semantic relations, such as preposition senses, noun-noun compound types, and others.
Typically, OpenIE relations are triples of a verb and its arguments, but we extend this approach
to also include prepositions (see Chapter 6.1).
[67] is a large scale knowledge base of relational facts extracted from dependency structures.
Recently, [86] and [32] further explored dependency relations as basic units for knowledge bases
in the context of the Learning by Reading project [6], and [113] for NELL [16]. All of them
extract dependency snippets (also called “cuts” or “frames”) following the subject-verb-object
structure from large corpora. [32] additionally aggregate Named Entity type information over
the argument slots. They do not use a model, but rely on maximum likelihood estimation. While
this provides useful statistics, it is very general and hard to adapt to new domains, and runs into
problems when encountering unknown words. A learned sequential model as presented here can
operate on these dependency data collections and make use of the context to type the arguments
and fill in knowledge gaps.
To sum up: previous approaches to relation extraction either relied on annotated data and
18
supervision of some form, which makes them hard to adapt to new domains, or they used
clustering, which results in un-interpretable types. In this thesis we will show how to derive new
relations from unannotated text without recurrence to existing knowledge bases, while at the
same time learning a type system; and how to apply this to relations other than subject-verb-
object triples. Our approach can be used for typing on existing large background knowledge
bases. It can also helps with dynamic ontology construction, semantic representation design
(choosing the right type and level of granularity), inference, and semantic analysis, especially for
prepositional constructions.
19
Chapter 3
PRELIMINARIES
Since some of the work in this thesis relies on unsupervised learning techniques, I present an
outline in this section. The content is in part based on my tutorial on the EM algorithm [46].
This chapter can only present a brief overview of methods relevant to the thesis. Please see the
excellent works of [59, 60, 72, 91] for more in-depth treatment of unsupervised methods.
Readers with sufficient background in these areas can continue to Chapter 4.
3.1 Unsupervised Learning
Supervised training requires pairs of inputs (words) and outputs (types). It derives features from
these pairs and learns a model. For tasks with a small, well-defined and agreed-upon set of labels,
such as part-of-speech tagging, we can easily annotate or use pre-annotated data to learn a model.
If the correct type set is not available, however, supervised methods can not be applied.
These situations include:
1. tasks where there is a low level of agreement on the labels, and thus no accepted labeled
data,
20
2. the need for rapid domain specific adaptation, which leaves no time to annotate sufficient
amounts of data,
3. languages other than English for which there is enough data and agreement on the labels,
but no annotated data
In this thesis, we will encounter situations 1 and 2. Unsupervised methods can help in two ways
here: a) they do not rely on annotation, and thus can potentially be trained on unlimited amounts
of plain text, and b) they allow us to find a set of labels that best explains the data. In more
complex tasks, we are often faced with less well-defined sets of labels or even tasks. Here, the
task has more of an explorative nature. Unsupervised methods can help us find the most likely
structure if the only thing we know is that there is a structure.
3.1.1 Graphical Models
Graphical models (GMs) are a standard way to express the conditional dependencies that hold
between the random variables in a probabilistic model.
GMs consist of two elements, nodes and arcs. The nodes are the random variables, i.e., events
that have some probability. The probabilities associated with each of the values of a random
variable sum up to 1:0.
Arcs are the directed links between the random variables, and can be thought of as causal
relations (there are other kinds, but it is easiest this way). They denote what influence the parent
node has on the child node. This influence is expressed by a conditional probability. If there are
no links between two variables, then they are conditionally independent of one another given their
common ancestors in the graph. The probabilities are referred to as parameters of the model.
21
The performance of the models depends to some extent on the structure of the graphical model
that is used. Different topologies will produce different results. See the framework in Chapter 4
and Section 6.1.4.1 for more on this.
There are several types of popular graphical models (Bayes Nets, Conditionals Random
Fields, HMMs), but the ones we are interested in are Hidden Markov Models (HMMs). They
are directed sequential models.
3.1.2 Hidden Markov Models
y3 y1 y2
x3 x2 x1
hidden variables
emission
probability
observed variables
transition
probability
Figure 3.1: Elements of Hidden Markov Models
HMMs model the joint probability of input and output variables. They consist of two sets
of variables: the observed variables (in our case the predicates and arguments), and the hidden
variables (in our case the types). See Figure 3.1 for a depiction of a HMM. Observed variables
are typically represented by shaded nodes.
22
The two types of variables are connected by an arc originating in the hidden variables. The
observed word is thus conditioned upon the hidden type. A common analogy is that the word was
“generated” by the hidden variable. The set of conditional probabilities between the hidden and
observed variables is called the emission probabilities.
Words in a sentence are not random collections—some sequences are more likely than others.
Later words thus depend on the ones before them. We include another kind of conditional prob-
abilities in our model to capture this intuition. Since we made the assumption that words were
generated by the hidden states, we condition the hidden variables on one another, rather than the
words. These arcs are called the transition probabilities. They express how a hidden random
variable depends on the previous ones. The state at any time step depends on the previous ones.
Usually, we assume that each state depends only on some of the previous states, not all previous
states. If we condition only on one previous state, the model is called a first-order or bigram
HMM; if we condition on two previous states, the model is called a second-order or trigram
HMM. Higher order HMMs are possible, but the number of transition probabilities becomes so
great that they often become infeasible to train in practice.
The collection of parameters defines the joint probability distribution. All parameter together
are also referred to as the model. The probabilities of all parameters can be summed up to the
probability that the observed data was “generated” by the hidden states. This measure is called
the likelihood of the data.
Once an HMM is constructed to represent the underlying dependencies, we need to find the
values for the conditional probabilities that hold for each arc. In order to do that, we can apply an
unsupervised learning algorithm.
23
3.2 Learning Algorithms
The most popular training algorithms for HMMs are EM [26] and Bayesian methods using sam-
pling. The most popular Bayesian method is the Chinese Restaurant Process with Gibbs sam-
pling. Both learning algorithms alternate between two steps: calculating how likely the data is
given the current parameters (E-Step), and adjusting the parameters to maximize the likelihood
of the observed data (M-Step). This process is guaranteed to constantly increase the probability
of the observed data. Training is stopped after a fixed number of iterations, or once the likelihood
of the data improves only marginally between iterations.
We can use the learned model parameters to infer the most likely hidden sequence that pro-
duced the words we observe via decoding. This way, we can produce the types or labels of the
observed words.
There are many possible combinations of emission and transition parameters, and thus many
models. Better models assign a higher probability to the observed data. Ideally, we want to find
the model that best explains the data. There are different types of prior knowledge we can bring
to bear. The simplest form are clever initializations for the parameters from which to start the
training. A common practice is to run the algorithm several times, starting out with different
parameter initializations. We then keep the model that assigns the highest likelihood to the data.
In practice, this often leads to satisfying results. Section 4.1 introduces some more ways to induce
prior knowledge, namely priors and data constraints.
The biggest shortcoming of unsupervised models is their evaluation performance. They usu-
ally achieve lower accuracy than supervised models. Since they model problems only as de-
pendencies between random variables, it is difficult to include features. However, features can
24
capture general properties of the data, and act as constraints on the model. In the absence thereof,
the learned probability distributions in graphical models tend to be normal distributions, and thus
often fail to match the underlying true distribution. Two common ways to remedy this are the use
of priors to induce sparsity, and the addition of constraints on the applicable types (see section
4.1).
25
Chapter 4
FRAMEWORK
In this chapter, we describe the modeling and learning framework adopted throughout this the-
sis. If an experiment differs from this approach in some way, it will be explicitly noted in the
respective section on modeling.
In this work, we are concerned with relations, which we define as an ordered triple (l; p;r),
where l and r are the left and right argument, respectively, and p is the predicate of the relation.
We treat the triple as observed sequence, and associate a hidden state with each of its elements
(the types we are trying to find). Since all instances are triples, we can fully specify the joint
probability of the model as
P(l; p;r;y)= P(y
1
) P(ljy
1
) (4.1)
P(y
2
jy
1
) P(pjy
2
)
P(y
3
jy
2
) P(rjy
3
)
26
where y is a label sequence with three elements. Training the model with EM gives us estimates of
the emission parameters (P(x
i
jy
i
)) of the unknown entities, and transition parameters P(y
i
jy
i1
),
as well as the start parameters, P(y
1
), denoted byp.
We can arrange our input triples in two different ways:
1. grammatical order, x=(l;v;r), i.e., x
1
and x
3
are the arguments, x
2
the verb
2. relation order, x=(v;l;r), i.e., x
1
is the verb and x
2
and x
3
are the arguments
The former is the more intuitive, but the second is more advantageous for relation extraction.
It puts the (unambiguous) verb first, so that either the type of the first or even both arguments
(depending on the Markov order) is conditioned upon it (see Figure 4.1).
y3 y1 y2
x3 x2 x1
player
quarterback throw_to receiver
Montana throw_to receiver
y3 y1 y2
x3 x2 x1
throw_to
player
quarterback receiver
throw_to Montana receiver
a) b)
Figure 4.1: Graphical model of the HMM derived from the data with either grammatical word
order (a) or relation word order (b)
Most experiments presented here follow this model, but sometimes include modifications. In
those cases, the model section of the experiment will point out the changes.
27
4.1 Sparsity and Constraints
In language, most probability distributions follow a Zipf curve. In this distribution, the probabil-
ity of a particular event is inversely proportional to the rank of that event. That is, most of the
probability mass for any given event is concentrated in a few high frequency cases. Many NLP
tasks can thus achieve good accuracy by modeling only the most frequent cases. These distribu-
tions are often called sparse, since many of the possible conditional probabilities do not receive
any probability mass.
Sheet1
Page 1
1 2 3 4 5 6 7 8 9 10 11 12 13 14
0
50
100
150
200
250
300
true distribution EM
sense #
number of instances
Figure 4.2: Difference between true distribution and distribution found by EM
Unfortunately, fully unsupervised training of graphical models tends to find much “flatter”,
uniform distributions. The difference between the learned and the true distribution causes a bad
fit and results in low model performance.
1
The example graph in Figure 4.2 shows the Zipf
distribution of the senses for the word “in” in black and the distribution found by EM in grey.
The first few senses (shown on the x-axis) account for most of the cases (frequency is shown
1
See also [57] for more on this problem.
28
on the y-axis). The EM distribution is much flatter, so it assigns too little weight to the high
frequency cases, and consequently too much weight to the low frequency cases.
One of the central challenges for learning good models is thus to minimize the divergence
between the learned and the true distribution. We can achieve this either algorithmically by
parameters that adjust the learning method, or via the data, by restricting the applicable types for
certain words.
Various approaches have tried to induce sparsity in unsupervised learning: [119] zeroes out
as many parameters as possible, to concentrate more probability mass in fewer labels. Bayesian
methods use priors to sample from the data: the more often a case has been sampled, the more
likely it becomes to be sampled again. This leads to a more Zipf-shaped distribution. In this
work, I use several methods to induce sparsity. See Section 6.1.4.2.
The simplest way to enforce sparser distributions is by restricting the space of types that
can generate a particular word. Fully unsupervised models assume that any observation can have
any hidden label, i.e., that any word can have any type. However, this creates a huge search
space and leads to suboptimal solutions, since the model gets stuck in local maxima. In practice,
we usually know that this assumption is not true: a noun could never have a verb type, nor
vice versa. If we encode this knowledge, we can reduce the search space considerably and thus
improve performance.
There are two kinds of constraints commonly applied in graphical models, dictionary con-
straints
2
and token constraints. The distinction is whether they always apply to a word in any
context, or whether they only apply to a word in a specific context.
2
Also called type constraints in the literature. In order to avoid confusion with the semantic types we are concerned
here, we will use dictionary constraints here.
29
Dictionary constraints are usually encoded as mappings from a word to all its applicable
types (irrespective of the context the word appears in). See, among others, [39, 72, 92, 110] for
the use of dictionaries in HMMs for POS tagging, NE recognition, etc.
Token constraints apply to a concrete word in a specific context, e.g., we know the types of
this specific token, say the third word in the fourth sentence. Note that token constraints override
dictionary constraints, i.e., if we know that a word has a certain type in a specific context, we do
not even need to consider all the legal types listed in the dictionary.
START END
V.MOTION P.DIR N.LOCATION
V.STATIVE P.LOC V.MOTION
N.PERSON P.MNR N.PERSON
… …
…
walk through wood
Figure 4.3: Effect of dictionary and token constraints on lattice. The types of “walk” and “wood”
are constrained by a dictionary, disallowing noun types for “walk” and verb types for “wood”.
“through” is constrained by a token constraint (shaded), so we can disregard all other types (grey
with dotted outlines).
Both serve to constrain the space of emission parameters during training and thus guide the
process. By disallowing certain types, we essentially zero out their probability: the distribution
becomes sparser. See Figure 4.3 for an example of both types of constraints.
If either constraint restricts the number of applicable types for a word to one, we essentially
have an annotation on that word. Supervised data is thus nothing but observations with constraints
30
for every token. If we only have token constraints for some words, we refer to it as partially
annotated data. Partial annotations can result from
the combination of data from different sources, only some of which are annotated
unambiguous words (i.e., words whose dictionary entry contains only one type, regardless
of context. Here dictionary and token constraints are the same)
simple heuristics that allow us to disambiguate words in a certain context, or at least narrow
down their possible labels
I have explored a way to include token constraints via partial annotations in [49], and investigated
the effects of which kind and how much annotation has on accuracy for two tasks, POS tagging
and preposition sense disambiguation (see also Appendix A). Other works that have explored
token constraints are [28, 37, 42, 118] and [110]. In this work, we will exploit both types of
constraints to guide the learning of our models.
4.2 Evaluation
While unsupervised models are useful for training purposes when we have little or no annotated
data, those cases often present a challenge to evaluate any progress. As laid out by [100], we typ-
ically have to resort to a held out test set (which could have been better used as annotated training
data), a mapping to an existing resource (which defeats the exploratory nature of unsupervised
method), extrinsic evaluation in a larger system (which is more often than not unavailable) or
human judgement.
The latter is often treated as ultimate test, since the goal of most NLP applications is to
produce human-acceptable outputs (see the objections in [100], though). Non-expert annotation
31
services like Amazon’s Mechanical Turk (AMT) are cheap and fast ways to evaluate systems
and provide categorical annotations for training data. Unfortunately, some annotators (called
spammers) choose labels haphazardly in order to minimize their effort, thereby introducing noise.
Manual identification of these unreliable annotators is time consuming.
There are two common ways to address this:
1. include control items where the answer is known: annotators that fail these tests will be
excluded.
2. assume that good annotators exhibit similar behavior to one another, while spammers ex-
hibit a deviant answer pattern.
The first method works well, but it can discard otherwise reliable annotators if they just
happen to be wrong on one item. To prevent that requires additional manual effort.
The second method requires some form of agreement measure. Several inter-annotator agree-
ment measures exist (such as Cohen’s k, Fleiss’ k, G-index, etc.), but often have considerable
bias, such as overestimating chance agreement. Worse, all inter-annotator agreement measures
suffer from a fundamental problem: removing/ignoring annotators with low agreement will al-
ways improve the overall agreement score, irrespective of the quality of their annotations. Unfor-
tunately, there is no natural stopping point: deleting the most egregious outlier always improves
agreement, until we have only one annotator with perfect agreement left [54].
Inspired by this experience, we developed MACE (Multi-Annotator Competence Estimation),
a tool that allow us to find both trustworthy annotators and the most likely answer (see Appendix
B). It is an implementation of a class of graphical models known as item-response models.
32
In contrast to inter-annotator agreement measures, MACE does not discard any annotators, but
simply downweighs their contributions. We are thus not losing information.
MACE learns in an unsupervised fashion to a) identify which annotators are trustworthy
and b) predict the correct underlying labels. The results were published in [47], and the system
is available for download underhttp://www.isi.edu/publications/licensed-sw/mace/.
We match performance of more complex state-of-the-art systems and perform well even under
adversarial conditions. We show considerable improvements over standard baselines, both for
predicted label accuracy and trustworthiness estimates. The latter can be further improved by
introducing a prior on model parameters and using Variational Bayes inference. Additionally, we
can achieve even higher accuracy by focusing on the instances our model is most confident in
(trading in some recall), and by incorporating annotated control instances. Ours is the first such
system to be made available for download and can be used by experimenters in any field.
33
Part III
Experiments
34
Chapter 5
ARGUMENT TYPES
In this chapter, we will see that it is possible to learn a sensible, accurate, and human-interpretable
type system directly from data, induce hierarchies among them and apply the learned type system
to a downstream application, namely construction of abductive knowledge bases.
5.1 Introduction
Semantic NLP systems perform best when their knowledge about types and relations is adapted to
the domain at hand [35]. General ontologies, knowledge bases, and type systems, such as named
entities or WordNet supersenses [34], need to be manually adjusted, a painstaking and expensive
process.
Instead, we assume that the information is contained in the data, and can be inferred given
enough texts from the same domain. Relations can be extracted as predicate-argument combina-
tions from dependency parses. In order for them to generalize well, though, we need to also infer
their domain-specific argument types.
Since types are typically nouns, we assume that there exists a set of common nouns in the
domain that can be interpreted as types. These typically follow a domain-specific distribution,
35
which is related to the notion of term specificity [58]. We can thus use them as types for the
relation arguments. Our task is to extract the relations and infer the argument types. It requires
no pre-defined type system, but rather learns the type system along with the relation candidates.
In contrast to previous work on unsupervised relation extraction, it produces human-interpretable
types rather than clusters. Pre-defined argument types would limit the applicability of the
extraction method, and makes the results less specific. For example, while a NE-tagged corpus
could produce a general relation like “PERSON throws to PERSON”, our method enables us to
distinguish the arguments and learn “quarterback throws to receiver” for American football and
“outfielder throws to third base” for baseball.
In the first experiment (Section 5.2), we show how we can learn the domain-specific argu-
ment types and their distribution over predicates directly from text. Our system automatically
extracts domain-specific knowledge (relations and their types) from large amounts of unlabeled
data for three domains. This information is used to train a HMM to determine and apply the
appropriate type for the relation arguments. Human subjects judged up to 0.95 of the resulting
relations to be sensible. This work was previously published as [53]. We also evaluate the model
on a class-labeling task, recovering an anonymized argument, and achieve accuracy of up to
0.81.
After establishing the general framework, I present two sections that explore the applica-
bility of the framework to downstream tasks. These sections serve to show the versatility of the
approach.
36
In Section 5.3, we show how to extend the model to infer hierarchies among the argument
types. This allows us to learn small ontological structures.
In the last part of this chapter (Section 5.4), we investigate the applicability of the learned
model to constructing abductive knowledge bases. The goal here is to investigate how much we
can learn from the existing model.
5.2 Inferring Argument Types
In this section, I show how we can gather information from text to infer the type of the relation
arguments. We construct HMMs over dependency arcs to learn argument types directly from the
data and apply them to learn domain-specific relations. Some approaches use existing data bases
such as Wikipedia, Factz, Freebase, or DBpedia [29,109,123], but this does not address the ques-
tion of how to construct those data bases in the first place. Fully unsupervised approaches [96]
avoid this issue by learning clusters of similar words, but the resulting types (cluster numbers)
are not interpretable.
Instead, we would like to learn interpretable domain-specific entity types directly from text.
This allows for both fine-grained inference and easy domain-adaptation.
We observe that types are usually common nouns, often the ones frequent in the domain at
hand. In order to learn which subset of nouns to apply and when, we use a HMM. This allows us
to capture latent patterns and make use of data that is not fully annotated. Using common nouns
has several advantages: they have domain-specific distribution and are interpretable.
37
However, assuming the set of all common nouns as hidden variable is intractable, and would
not allow for efficient learning. To restrict the search space and improve learning, we make use
of some constraints:
1. we record the co-occurrence of common nouns with entities (dictionary constraints),
2. we restrict ourselves to verbal relations, and
3. we allow common nouns as type candidates and observations. Assuming they have only
their identity as type, they function as annotated words (token constraints).
Football Law Finance
T EAM beat T EAM COMPANY raise concern COMPANY have PROGRAM
T EAM play T EAM COMPANY say MEMBER GROUP have MEMBER
QUART ERBACK throw pass COMPANY say in general GROUP include EXPERT
T EAM win game COMPANY sell UNIT COMPANY join PROGRAM
T EAM defeat T EAM COMPANY set SUBSIDIARY COMPANY offer PROGRAM
RECEIV ER catch pass MINIST ER show concern ORGANIZAT ION provide MEMBER
QUART ERBACK complete pass COMPANY support MEMBER CANDIDAT E receive vote
QUART ERBACK throw pass to RECEIV ER GROUP take BANK OFFICIAL support PROGRAM
T EAM play game COMPANY tell analyst OFFICIAL tell MEMBER
T EAM lose game OFFICIAL tell MEMBER COMPANY write PROGRAM
Figure 5.1: The ten most frequent relations discovered by our system for the domains. Types
written in CAPITALS, common nouns lower case
Our approach differs from verb-argument identification or Named Entity (NE) tagging in
several respects. While previous work on verb-argument selection [32, 85] uses fixed sets of
38
types, we assume any common noun can function as a type. We thus cannot know a priori how
many and which types we will encounter. We therefore provide a way to derive the appropriate
types automatically and include a probability distribution for each of them. Our approach is thus
less restricted and can learn context-dependent, fine-grained, domain-specific relations. Also,
while in NE tagging each word has only one correct tag in a given context, we have potentially
hierarchical types: an entity can be correctly labeled as a player or a quarterback (and possibly
many more types), depending on the context (we explore this in Section 5.3). By taking context
into account, we are also able to label each sentence individually and account for unseen entities
without using external resources.
Our contributions are:
we use unsupervised learning to train a model that learns both relations and their types
from data
we evaluate the sensibility and applicability of the learned types and relations using both
human judges and a selectional restriction task
5.2.1 Data
We derive all our data from the New York Times corpus [98]. It contains several years worth of
articles from the NYT, manually annotated for meta-data such as author, content, etc. We use the
latter to mimic data from different domains. We select and group articles whose “content” meta-
data contains certain labels into separate sets. We use the labels Football, Law and Legislation,
and Finances. This is similar to the approach in [125].
39
We remove any meta-data or lists, tokenize, and parse all articles with the FANSE parser
[117]. For our experiments, we extract subject-verb-object triples from the parses, provided
the verb is a full verb. The same approach is used by [12]. We also concatenate multi-word
names (identified by sequences of NNPs) with an underscore to form a single token (“Steve/NNP
Young/NNP”! “Steve Young”).
Similarly to [85], we focus on the top 100 verbs for efficiency reasons. Nothing in our
approach prevents us from extending this to more verbs, however. For each domain, we extract
about 100k instances which have an entity (proper noun) or potential type (common noun) in at
least one argument position. This serves as training data. Observing potential types in argument
positions supplies us with token constraints, while the entities are constrained by a dictionary
(see next section).
5.2.2 Dictionary Construction
To derive the types used for the arguments, we do not restrict ourselves to a fixed set, but derive a
domain-specific set directly from the data. This step is performed simultaneously with the corpus
generation described above.
We generally assume that all common nouns in our domain are potential types. Since this
would make the number of parameters infeasibly large, we apply certain constraints. A common
way to restrict them in HMMs is to provide a dictionary that lists the set of legal types for every
entity [72, 92, 110].
Ben Roethlisberger: quarterback:65, player:38, star:16
Kleiman: professor:5, expert:3, (specialist:1)
Tilton: executive:37, economist:10, chairman:4, (president:2)
Figure 5.2: Examples of dictionary entries with counts. Types in brackets are not considered
40
To construct this dictionary, we follow [53] and collect for each entity observed in our data all
common nouns that modify them, either as nominal modifier (“judge Scalosi ...”) or as apposition
(“Tilton, a professor at ...”). Both of these modification can be easily collected from the parse
trees as well. See Figure 5.2 for examples. Nominal modifiers are common nouns (labeled NN)
that precede proper nouns (labeled NNP), as in “quarterback/NN Steve/NNP Young/NNP”, where
“quarterback” is the nominal modifier of “Steve Young”. Similar information can be gained
from appositions (e.g., “Steve Young, the quarterback of his team, said...”).
1
We extract those
co-occurrences and store the proper nouns as entities and the common nouns as their possible
types. This is similar in nature to Hearst’s lexico-syntactic patterns [43] and other approaches that
derive IS-A relations from text. Alternatively, one could use external resources such as Wikipedia,
Yago [108], or WordNet++ [88] for dictionary construction. However, here, we focus on the
possibilities of a self-contained system without recurrence to outside resources.
For each pair of type and entity, we collect counts over the corpus to derive probability distri-
butions. We store the type candidates for each entity and their associated counts in the dictionary
and remove any types observed less than 10 times. The total number of types ranges from 9k to
16k per domain.
Even after dictionary construction, the average number of types per entity in the football
domain is still 6:87. The total number of distinct types for entities is 63;942. This is a huge
number to model in our state space.
2
Instead of manually choosing a subset of the types we
extracted, we defer the task of finding the best set to the model. We note, however, that the
1
The published work this is based on, [53] only operated over the football domain and also included copula verbs
(“Steve Youngisthequarterback of the 49ers”). We later excluded this for simplicity.
2
NE taggers usually use a set of only a few dozen types at most.
41
distribution of types for each entity is highly skewed. Due to the unsupervised nature of the
extraction process, many of the extracted types are hapaxes and/or random noise. Most entities
have only a small number of applicable types (a football player usually has one main position,
and a few additional roles, such as star, teammate, etc.). We reflect this by limiting the number
of types considered to 3 per entity. This constraint reduces the total number of distinct types to
26;165, and the average number of types per entity to 2:53. The reduction makes for a more
tractable model size without losing too much information. The type alphabet is still several
magnitudes larger than that for NE or POS tagging. Since the distribution of type candidates is
typically highly peaked, we maintain most of the information.
Dictionary entities that do not have any type candidate after removing singletons, as well as
any entity that was not modified, are treated as unknown tokens. For unknown entities, we use
the 50 most common type candidates in the domain.
For verbs and common nouns, we assume only the identity as type (i.e., “pass” remains
“pass”).
5.2.3 Model
Given a sentence, we want to find the most likely type for each argument of the relation. Similar
to [85], we assume the observed data was produced by a process that generates the relation and
then transforms the types into a sentence, possibly adding additional words. We model this as a
Hidden Markov Model (HMM) and use the EM algorithm [26] to train it on the observed data,
with smoothing to prevent overfitting.
We construct a HMM and use the type candidates as hidden variables, the verbal relations
as observed sequences. Similar models have been used in [1, 53, 85]. Common nouns are both
42
types and observations in our model, so they act as unambiguous items, i.e., their legal types are
restricted to the identity. Entities are constrained by the dictionary, similar to the task in [72].
We jointly normalize
3
our dictionary counts to obtain the model’s emission parameters,
and keep them fixed. We only free the emission parameters for unknown entities (P(word =
“UNK
00
jtype=)), and distinguish between unknown entities in the first and second argument po-
sition. Since we are not interested in learning how to model the distribution of verbs, we can also
fix the start parameter, P(y
1
) during training. We initialize the transition parameters uniformly
(restricted to potentially observable type sequences).
5.2.4 Evaluation
We want to evaluate how well our model predicts the data, and how sensible the resulting relations
are. We define a good model as one that generalizes well and produces semantically useful
relations.
We encounter two problems. First, since we derive the types in a data-driven way, we have no
gold standard data available for comparison. Second, there is no accepted evaluation measure for
this kind of task. Ultimately, we would like to evaluate our model externally, such as measuring
its impact on performance of a QA or IE system.
In the absence thereof, we performs two tasks to assess the model’s sensibility and accuracy:
1. we use Amazon’s Mechanical Turk annotators to judge the sensibility of the relations pro-
duced by each system (Section 5.2.4.1)
2. we measure the model’s ability to reconstruct the correct type (Section 5.2.4.2)
3
This preserves the observed entity-specific distributions. Under conditional normalization, the type candidates
from frequent entities tend to dominate those of infrequent entities. I.e., the model favors an unlikely candidate for
entity a if it is frequent for entity b.
43
We reason that if our system learned to infer the correct types, then the resulting relations
should constitute true, general statements about that domain, and thus be judged as sensible.
4
We
exemplify this on the football domain.
5
The second part (type recovery) does not depend on human evaluations and is thus easier to
evaluate, so we are able to expand it to all domains. For each of them, we select sentences from
our corpus that contain one of the potential types in subject or object position, such as coach
in “coach bench receiver”. We replace those items with the unknown word (i.e., get “X bench
receiver” and “coach bench X”) and use our model to reconstruct the most probable type. Since
the original input contained a type, we can then compare the models’ predictions to the “correct”
type in the original sentence and compute accuracy over the test set. We produce test sets with
200 instances for each argument, and separately evaluate performance on them. As baseline, we
predict the common noun/type most frequently observed in this position. We omit cases with two
unknown arguments, since this becomes almost impossible to predict without further context,
even for humans (compare “UNK have UNK”). Note that this part of the evaluation was carried
out at a later point, and thus used a slightly different data set for the football domain than the
sensibility judgement task.
We first evaluate different model formulations, i.e., the influence of word ordering (gram-
matical vs. relation) and Markov order (i.e., bigram vs. trigram HMM) on performance (Section
5.2.4.2). For this, we again only make use of the football domain. Using the best model from
these experiment, we then compare its performance on the different domains (Section 5.2.4.2).
4
Unfortunately, if judged insensible, we can not infer whether our model used the wrong type despite better options,
or whether we simply have not learned the correct label.
5
This was part of the original paper, [53].
44
5.2.4.1 Sensibility and Label Accuracy
system 100 most freq. random combined
baseline 0.92 0.71 0.90
model 0.97 0.70 0.95
Table 5.1: Percentage of relations derived from labeling the full data set that were judged sensible
system 100 most freq. random combined
baseline 0.52 0.28 0.50
model 0.70 0.42 0.68
Table 5.2: Percentage of relations derived from labeling instance with unknown entities that were
judged sensible
We create a baseline system from the same corpus, which uses the most frequent type (MFT)
for each entity.
With the trained model, we use Viterbi decoding to extract the best type sequence for each
example in the data. This translates the original corpus sentences into relations.
For each system, we sample the 100 most frequent relations and 100 random relations found
for both the full data set and the unknown entities and have 10 annotators rate each relation as
sensible or insensible.
We evaluate label accuracy by presenting subjects with the relations we obtained from the
Viterbi decoding of the corpus, and ask them to rate their sensibility. We compare the different
systems by computing sensibility as the percentage of relations judged sensible for each system.
45
Since the underlying probability distributions are quite different, we weight the sensibility judge-
ment for each relation by the likelihood of that relation in the system. We report results where
each relation is scored according to the majority of annotators’ decisions.
6
Ultimately, we are interested in labeling unseen data from the same domain with the correct
type, so we evaluate separately on the full corpus and the subset of sentences that contain
unknown entities (i.e., entities for which no type information was available in the corpus, cf.
Section 5.2.2). For the latter case, we select all examples containing at least one unknown
entity (labeled UNK), resulting in a subset of 41;897 sentences, and repeat the evaluation
steps described above. Here, we have to consider a much larger set of possible types per
entity (the 20 overall most frequent types). The MFT baseline for these cases is the most fre-
quent of the 20 types for UNK tokens, while the random baseline chooses randomly from that set.
The 200 relations from each of the systems (model and baseline on both full and unknown data
set), contain 696 distinct relations. We break these up into 70 batches (Amazon Turk annotation
HIT pages) of ten relations each. For each relation, we request 10 annotators. Overall, 148
different annotators participated in our annotation. The annotators are asked to state whether a
given relation represents a sensible statement about American Football or not. A relation like
“Quarterbacks can throw passes to receivers” should make sense, while “Coaches can intercept
teams” does not. To ensure that annotators judge sensibility and not grammaticality, we format
each relation the same way, namely pluralizing the nouns and adding “can” before the verb.
In addition, annotators can state whether a relation sounds odd, seems ungrammatical, is a
6
We also experimented with aggregating results by summing up the values for each individual answer, but found
this to be less informative.
46
valid sentence, but against the rules (e.g., “Coaches can hit players”) or whether they do not
understand it.
The model and baseline relations for the full data set are both judged highly sensible, achiev-
ing accuracies of 0.95 and 0.92 (cf. Table 5.1). While our model did slightly better, the differences
are not statistically significant when using a two-tailed test. The relations produced by the model
from unknown entities are less sensible (0.68), albeit still significantly above chance level, and
the baseline relations for the same data set (p < 0:01). Only 0.5 of the baseline relations were
judged sensible (cf. Table 5.2).
Aside: Annotator Quality
While most annotators on Mechanical Turk provide genuine answers, a small subset tries
to simply maximize their profit without paying attention to the actual requirements. This
introduces noise into the data. We have to identify these spammers before the evaluation.
We use a repetition (first and last question are the same), and a truism (annotators answering
no either do not know enough about football or just answered randomly). Alternatively, we
compare each annotator’s agreement to the others and exclude those whose agreement falls
more than one standard deviation below the average overall agreement.
We find that both methods produce similar results. The first method requires more careful
planning, and the resulting set of annotators still has to be checked for outliers. The second
method has the advantage that it requires no additional questions. It includes the risk, though,
that one selects a set of bad annotators solely because they agree with one another.
47
There are several statistics that qualify annotator agreement. Raw agreement simply
records how often annotators choose the same value for an item. Other measure adjust this
number for random agreement. One frequently used measure, Cohen’s k, has the disadvan-
tage that if there is prevalence of one answer, k will be low (or even negative), despite high
agreement [33]. This phenomenon, known as the k paradox, is a result of the formula’s ad-
justment for chance agreement. As shown by [41], the true level of actual chance agreement
is realistically not as high as computed, resulting in the counterintuitive results. Another
statistic, the G-index [45], avoids the paradox. It assumes that expected agreement is a func-
tion of the number of choices rather than chance. It uses the same general formula ask, but
adjusts for the number of available categories, instead of expected chance agreement. The
raw agreement for both samples combined is 0:82, G = 0:58, and k = 0:48. The numbers
show that there is reasonably high agreement on the label accuracy.
Our experience with this task lead us to later develop MACE [47], which speeds up
annotator evaluation and provides a principled way to catch spammers and compute the most
likely answer.
5.2.4.2 Type Recovery
We separately evaluate the performance for the first and the second argument being blanked out,
and select 200 instances for each case. Note that we predict per-word accuracy: while verbs
and common nouns are supposed to have only the identity as type, they are not unambiguous,
due to homography between some nouns and verbs (e.g., “match”). Many instances have several
sensible reconstructions, such as “player leaves team” vs “quarterback leaves team”. In the case
48
of two unknown arguments (such as “UNK have UNK”), this becomes almost impossible to
predict, even for humans. We thus omit the case of two unknown arguments.
As baseline, we predict the type most commonly observed in this position. This is akin to the
most-frequent sense baseline. The baseline is insensitive to order, and thus used as a comparison
for both orderings.
We also evaluate the influence of word ordering (grammatical vs. relation) and Markov
horizon (bigram vs. trigram) on model performance. Figure 5.3 shows a the inference lattice of
an instance under both orders.
a)
player
quarterback throw_to receiver
star
player
quarterback throw_to receiver
star
START
START
b)
player
quarterback throw_to receiver
star
player
quarterback throw_to receiver
star
START
START
Figure 5.3: Inference lattice for “Montana throw to receiver ” with a) grammatical or b) relation
word order. Verb marked grey, token constrained nodes dotted
Model Structure
The results for the different systems are listed in Table 5.3. The baseline is already relatively
strong, which indicates that there is a high degree of regularity observable in the data. However,
all models improve over the baseline.
Relation order improves accuracy for both arguments over grammatical order. There are
several possible explanations: 1) conditioning the subject type on the (unambiguous) verb is more
constrained than starting out with the subject. 2) The object is now directly conditioned upon the
subject, which captures the “who does what to whom” aspect of the construction. In general,
knowing the verb restricts the possible arguments much more than knowing the first argument
49
(i.e, a quarterback can do all kinds of things, but once we know we are looking at a throwing
action, the number of possible argument types is limited). In information theoretic terms, the
entropy of the relation order is much lower than the entropy of the lexical order.
system arg1 arg2 avg
most freq. type 0.70 0.67 0.69
bigram HMM, grammatical 0.75 0.74 0.74
bigram HMM, relation 0.79 0.80 0.79
trigram HMM, relation 0.80 0.80 0.80
Table 5.3: Effect of input word order (grammatical vs. relation order) and Markov horizon (bi-
gram vs. trigram HMM) on prediction accuracy (football domain).
Increasing the Markov horizon to trigram HMMs, we now condition the object on both the
verb and the subject. Introducing additional arcs into a HMM can cause the model to abuse those
new parameters, essentially overfitting the data (a common problem with higher-order HMMs).
Since our data is highly constrained through the dictionary and the unambiguous verb, though, the
increase in the number of parameters is moderate, and the resulting conditional probabilities are
fairly sparse. This increases accuracy substantially. The effect is more pronounced for predicting
the object.
We also experimented with a trigram model with lexical order (not listed), but saw no im-
provements. This means that changing the Markov horizon we look at does not change the fun-
damental differences in entropy between the orders.
Domains
Table 5.4 shows that our approach generalizes to different domains, but also that there are
distinct differences. There are two possible explanation: i) the number of legal actions (i.e., verbs)
is more restricted in some domains, and ii) the overall number of type candidates is smaller. Note
50
domain system arg1 arg2 avg
Football most freq. type 0.70 0.67 0.69
model 0.80 0.80 0.80
Law most freq. type 0.70 0.73 0.71
model 0.83 0.78 0.81
Finance most freq. type 0.72 0.72 0.72
model 0.79 0.79 0.79
Table 5.4: Accuracy of relation-ordered trigram models on various domains.
that this includes the verb and un-anonymized argument. While prediction for those items is
usually unambiguous, it is by no means perfect: the model can fail to find a legal label sequence,
and thus incur errors on the verb and un-anonymized argument.
To assess how well our model performs on the anonymized position, we compute the mean
reciprocal rank (MRR) for each domain (see Table 5.5). MRR denotes the inverse rank in the
model’s k-best output at which the correct answer occurs, i.e.,
1
k
. We measure performance at the
blanked out position, not over the whole instance. The result give us an intuition of “how far off”
the model predictions are. Again, we observe domain-specific differences.
system arg1 arg2
football 0.58 0.56
law 0.64 0.47
finance 0.51 0.52
Table 5.5: MRR for blanked out position of relation-ordered trigram models on various domains.
In general, the numbers indicate that the correct answer is on average in the second position
of the model’s predictions.
51
5.2.5 Related Research
Most previous approaches only consider relations between entities, i.e., allow only proper nouns
as arguments. In contrast, we also consider common nouns, which broadens our approach and,
since we assume they act as types, supplies us with token-constrained data.
[85] use NE-tagged data to learn the parameters for an HMM and apply them to similar
verbal predicates. They do not attempt to learn the types from data.
[125] and [124] use clustering plus distant supervision from FreeBase on generative mod-
els, so their types are supplied by an existing resource. [112] and [111] present graph-based
approaches to the similar problem of class-instance learning.
The approach we describe is similar in nature to unsupervised verb argument selec-
tion/selectional preferences and semantic role labeling, yet goes beyond it in several ways. For
semantic role labeling [36, 38], types have been derived from FrameNet [3]. For verb argument
detection, types are either semi-manually derived from a repository like WordNet, or from NE
taggers [32, 85]. This allows for domain-independent systems, but limits the approach to a fixed
set of oftentimes rather inappropriate types. In contrast, we derive the level of granularity directly
from the data.
Pre-tagging the data with NE types before training comes at a cost. It lumps entities together
which can have very different types (i.e., all people become labeled as PERSON), effectively al-
lowing only one type per entity. [31] resolve the problem with a web-based approach that learns
hierarchies of the NE types in an unsupervised manner. We do not enforce a taxonomy, but
include statistical knowledge about the distribution of possible types over each entity by incor-
porating a prior distribution P(type;entity). This enables us to generalize from the lexical form
52
without restricting ourselves to one type per entity, which helps to better fit the data. In addition,
we can distinguish several types for each entity, depending on the context (e.g., winner vs. quar-
terback). [96] also use an unsupervised model to derive selectional predicates from unlabeled
text. They do not assign types altogether, but group similar predicates and arguments into unla-
beled clusters using LDA. [12] uses a very similar methodology to establish relations between
clauses and sentences, by clustering simplified relations.
[86] employ syntactic patterns to derive types from unlabeled data in the context of LbR.
They consider a wider range of syntactic structures, but do not include a probabilistic model to
label new data.
5.2.6 Conclusion
We use an HMM to learn both domain-specific relations and types from text. Unlike previous
unsupervised approaches, we automatically extract type candidates from the common nouns in
the data and learn a probability distribution over entities to allow for context-sensitive selection
of appropriate types.
Using newswire text, we applied the method to three different domains and evaluate both
the model’s qualities and the sensibility of the resulting relations. Several measures show that
the model has good explanatory power and generalizes well, significantly outperforming two
baseline approaches, especially where the possible types of an entity can only be inferred from
the context. Human subjects on Amazon’s Mechanical Turk judged up to 0.95 of the relations
for the full data set, and 0.68 for data containing unseen entities as sensible. Inter-annotator
agreement was reasonably high (agreement = 0:82, G= 0:58,k = 0:48).
53
A type recovery task showed that the model learns to recover the correct argument type with
high accuracy (0.84), and that the correct answer is usually among the top two of the model’s
choices.
5.3 Learning Type Hierarchies
The previous section showed how relation candidates can be extracted and simultaneously typed
in an unsupervised fashion. However, it treats all types (e.g., QUART ERBACK, PLAY ER,
RECEIV ER) as disjoint, mutually exclusive labels. In reality, many of these are organized in a
hierarchical structure, e.g., QUART ERBACK and RECEIV ER are instances of PLAY ER. Many
learned relations just represent different ways of typing of the same underlying relation at vari-
ous levels of specificity (compare “PLAY ER throw pass” and “QUART ERBACK throw pass”).
With a type hierarchy, we could order the types from most to least abstract, similar to the way
ontologies are structured (see Figure 5.4 for an example).
QUARTERBACK
PLAYER
RECEIVER
PERSON
Figure 5.4: Example hierarchy for a subset of types in the football domain
In this section, we explore a simple extension to the approach in the previous section to learn
that hierarchical structure of the types. We re-use the transition parameters learned in the Section
5.2 and try to estimate good emission parameters for each type. Note that this results in local
54
hierarchies, i.e., an ordering specific to the input instance, rather than a global hierarchy as it
would be provided by an ontology.
This ties in with existing work on ontology construction and hierarchy induction [62, 63, 83,
90]. The goal of this section is parallel to those existing approaches, but focuses on exploring
how much we can get out of the models introduced in the previous section.
5.3.1 Selecting Supertypes
We constrain the potential supertypes for each type to its co-occurring types, i.e., the other
types we have seen modifying the same entity. E.g., if our type dictionary contains an entry
for “Ben Roethlisberger” linking to the types QUART ERBACK and PLAY ER, we add PLAY ER
to the set of potential supertypes for QUART ERBACK, and vice versa. We allow every type to
be its own supertype, i.e., the type can not be abstracted any further. We exclude co-types that we
observed only once.
5.3.2 Model
From this dictionary of observed co-types for each type we construct emission parameters. Even
when excluding all singleton co-types, the number of of potential supertypes for any given type
is still so huge that training becomes intractable, so we simple set all emission parameters to a
uniform distribution.
We know from our evaluations that the transition parameters found by the model in the pre-
vious task represent reasonable type sequences, so we keep them as-is.
Using these parameter, we construct an HMM as described in Section 4.
55
5.3.3 Evaluation
In order to evaluate, we select 100 sentences that contain types in both argument positions. These
instances now serve as observed input. For each instance, we run Viterbi inference over the model
with the old transition parameters and new, uniform emission parameters derived from the co-type
dictionary. This produces a new type sequence. We then asked human judges to assess whether
the argument types in the new sequence were indeed supertypes of those in the input sequence,
given the context.
5.3.4 Results
Football Law Finance
include identity 0.23 0.77 0.49
ignore identity 0.15 0.42 0.37
Table 5.6: Percentage of local type hierarchies judged sensible
The results in Table 5.6 show that sensibility varies greatly among domains. Examining the
data, we see that while all domains have roughly the same number of distinct types in the input
data, they vary in the number of supertypes used. In the football domain, we see 57 distinct
types used, i.e., 1.63 times as many as in the input. This means that the same type was tagged
with various supertypes in different contexts. In contrast, in both other domains, the number of
distinct supertypes in the output is much smaller than in the input. Intuitively, this is what we
hope to see in a hierarchical ordering.
We observe that with uniformly initialized emissions, the output is mainly dominated by the
transition parameters, resulting in supertype sequences that are sensible in themselves, but not
with respect to the input types.
56
The model is free to assign each type the identity as supertype (i.e., GROUP stays GROUP).
While this would not be wrong, it would not allow for a lot of generalization. We thus also eval-
uate performance on the subset of instances where the supertype differs from the input type. For
all domains, we see a marked drop in accuracy if we ignore the identity labels. This indicates that
the identity is a valid and important supertype. If we disallow it, we essentially force the model to
assume endless hierarchical structures. With a limited set of supertypes, this consequently would
lead to cycles.
5.3.5 Conclusion
We presented an extension to the previous work to detect hierarchical orderings among the types,
or supertypes. The orderings are “localized” hierarchies, i.e., the supertype for a given type
highly depends upon the context. E.g., depending on the context the type FIRM occurs in, the
model might select GROUP or COMPANY as the appropriate supertype. While this does allow
us to generalize to some extent, it does not permit us to construct a general, context-independent
ontology. We see varying levels of sensibility for different domains, suggesting that this ap-
proach requires further investigation. We note that the data is the result of a noisy extraction
process, which likely affects performance. The dominance of the transition parameters suggest
that sparsity-inducing methods could help improve performance. It does show, however, that our
approach can be extended to generalize to various levels of abstraction.
57
5.4 Constructing Knowledge Bases
In the last section of this chapter, we turn to a potential application of the model described above,
namely populating knowledge bases with weighted logical axioms to allow for abductive infer-
ence. Again, the goal is to explore the versatility of the models derived earlier.
Artificial Intelligence has used various approaches to reasoning, some based on logic, others
on probability. Logical reasoning allows us to easily encode conditions about the world, but it
requires large knowledge bases (KBs). These often lack a data-driven foundation. Statistical
approaches are more rigorous in their construction and provide powerful models, but it is hard
to add complex constraints and conditions. Recently, there have been attempts to marry the
two [10, 27, 89].
We explore another approach to merging probabilistic models and logical reasoning. We
present a method to translate the sequential probabilistic models learned before into a knowledge
base of weighted axioms. We can then perform inference over the KB via weighted abduction.
Weighted abduction [44] is a powerful inference mechanism that allows us to interpret text on
a semantic level. It works by translating sentences into predicates and matching them against a
knowledge base of axioms. Each axiom has the form of a conditional:
q(x) : w! p(x) (5.1)
If the right-hand-side (p(x)) matches part of the input sentence predicates, we can assume the
left-hand-side (q(x)) at a specified cost w. If there are multiple axioms we can assume for p(x),
58
we choose the one with the lowest cost. If various axioms allow us to assume a predicate, its cost
is lowered. For example, from a KB with
quarterback(x) : 0:76^ etc(z)! Ben Roethlisberger(x) (5.2)
player(x) : 0:85^ etc(z)! Ben Roethlisberger(x) (5.3)
and the input Ben Roethlisberger(x), we would assume the first axiom, since it is the cheapest
explanation of the input. Abductive reasoning can be applied to resolving coreference, word-
sense disambiguation, and question answering.
However, three main questions need to be addressed:
1. Where do the axioms in the knowledge base come from? The quality of abductive
inference depends largely on the quality of the underlying KB, i.e., on the number and
nature of the axioms and their associated weights. Solving most NLP-related problems
requires KBs that are too large to encode by hand. Works such as [81] have addressed the
first problem and shown how large KBs can be derived by axiomatizing resources such as
WordNet [34] or FrameNet [3].
2. How do we set the weights of the axioms? Most approaches either use heuristics or
experimentation, both of which are brittle or quickly become infeasible for large KBs.
Unfortunately, due to language’s inherent ambiguity, large KBs will contain several axioms
that match the input. Without properly set weights, there is no way to discriminate between
them.
59
3. How do we match the input to the knowledge base? Typically, the knowledge base
is represented as logical axioms, so the input needs to be put into logical form. Existing
approaches ( [76]) make use of heuristics to extract axioms from constituency parses. While
this normalizes the input space somewhat, the large variety of syntactic constructions and
vocabulary size makes this brittle and often results in a vast input space.
We explore whether axioms can be both derived and weighted using existing graphical mod-
els. [19] have shown that the axiom weights can be interpreted probabilistically, so it is natural
to consider probabilistic models to set the weights. Furthermore, there are parallels between in-
ference in probabilistic generative models and abduction. Both allow us to derive the most likely
underlying interpretation from an observed sentence. Since we operate over relations derived
from syntactic dependencies, axiomatizing the input is relatively straightforward. This addresses
all of the above questions.
We evaluate our approach on a type labeling task: given a binary verbal predicate such as
throw(x;ball), we try to find the most likely type for x. We use the model structures introduced
above to construct logical axioms and set their weights using the model parameters. We show
that it is possible to construct simple axioms that match the performance of the HMM, and we
discuss possibilities and limitations of the approaches.
60
5.4.1 Intuition
Both abductive reasoning and probabilistic inference try to find the underlying explanation of
some observed sentence. In the case of graphical models, the sentence is represented as a se-
quence of observed variables x, and Bayes’ theorem is applied to find the most likely sequence
of hidden states y that generated the observation:
argmaxP(yjx)= argmax
P(y) P(xjy)
P(x)
(5.4)
= argmaxP(y) P(xjy)
Similarly, in the case of abductive reasoning, the sentence is represented as a conjunction of
predicates (Logical Form), and backchaining is applied over weighted axioms. An axiom has
the form of a conditional. If the predicates in the sentence match a right-hand-side (RHS) of an
axiom, we can assume that axiom’s left-hand-side (LHS) at the specified cost, and the assumed
LHS becomes part of the “observed” data. The goal is to find the antecedents which provide the
lowest-cost explanation of the observed sentence. This is akin to depth-first-search over a priority
queue ordered by the antecedents costs. The crucial points we are concerned with in this work
are how to formulate the axioms, but also how to determine the costs of their LHSs.
5.4.2 Naive Transformation
COACH(x)^ creates(x;y)^ campaign(y)! schottenheimer
0
(x)^ creates
0
(x;y)^ campaign
0
(y)
Figure 5.5: Example for naive axiom derived from HMM
61
As described earlier, axioms have the form of a conditional, i.e., p(x)! q(x). The right-hand-
side (RHS) denotes the observation, the left-hand-side (LHS) states what can be assumed led to
the observation, and at what cost. The cost is the percentage the LHS contributes to explaining
the RHS.
A naive transformation is to run inference over the HMM, which provides us with the most
likely type sequence of the instance, and to set the LHS equal to the type sequence, and the
right hand side as the sequence of observed words, see Figure 5.5. Inference over the HMM
also provides us with a marginal probability of the instance, i.e., the probability that the observed
sequence was generated by the type sequence. In abduction, however, we seek to minimize cost,
not maximize likelihood, so we have to turn the probability into weights. Since the costs we
compute are percentages, we define:
weight = 1:0 P(LHSjRHS) (5.5)
and distribute the weight evenly over all predicates of the LHS. In abduction, each input word is
represented as a predicate. Verbs become two-place predicates, all other words unary predicates.
For example, from an input sentence such as “create Schottenheimer campaign”, we might get
the sequence “ create COACH campaign” via Viterbi-decoding over the HMM, with some pos-
terior likelihood. We then transform both the input and the output into predicates and introduce
variables. To distinguish the derived types from input words, we mark words with an apostrophe.
We make the type sequence the LHS and the observed sequence the RHS of a conditional and
add this as an axiom to the KB. See figure 5.5 for the resulting axiom.
62
5.4.3 Evaluation
The task we try to solve here is again labeling relation arguments with their types. We follow the
same evaluation scheme as in Section 5.2.4.2.
We transform the trained models from that section into a KB of weighted axioms and evaluate
on the type recovery task outlined there. For the abduction, we use an existing implementation
of an abductive reasoner, Mini-Tacitus [44, 76], that we modify to accept floating numbers as
weights.
It is easy to see that the resulting KB contains exactly the predictions made by the HMM
it was derived from. A comparison confirms that, for both types of n-gram models and input
orderings. The size of the KB is equal to the number of input pairs decoded.
On the one hand, this means that we can create large KBs automatically, which produce
highly accurate results. Due to the way they were constructed, both sides of the axiom are highly
specified. Inference thus becomes a virtually unambiguous lookup, which is fast and efficient.
However, the resulting KB is incapable of generalizing to new inputs. We would like to maintain
some of the compositionality of the HMM’s parameters.
5.4.4 Related Work
Weighted abductive inference was introduced by [44]. The use of weights allowed the differenti-
ation of multiple explanations. While it has been shown in applications to several NLP problems,
inference was usually performed over knowledge bases of limited size, with manually set weights.
An implementation of an abductive inference engine by [76] exists, which uses syntactic
parses to convert the input into logical form. However, the fact that backchained propositions are
added to the knowledge base can quickly lead to a combinatorial explosion. The implementation
63
thus limits back-chaining up to a certain number of levels. We make us of a modified version
of this implementation. While the way we construct axioms makes the limitation unnecessary, it
does prevent the discovery of more complex causal chains.
[81] presents a data-driven approach to knowledge base population using existing NLP re-
sources. While this provides large numbers of axioms, it does not address the problem of weight-
ing them.
Recently, several approaches have been made to use Markov Logic to marry probability the-
ory and logical reasoning [27,89]. There are also applications to abductive reasoning [10]. How-
ever, due to the complexity of the resulting graph, this does not scale very well beyond a certain
point. In contrast to our approach, training already involves the logic part, while we keep the
two parts separate until we perform abductive inference. The main focus of our work was less to
device a new methodology, but rather to explore how far existing models could be used to inform
a KB.
5.4.5 Conclusion
We have presented a principled approach to translating graphical models over relations into
weighted axioms and evaluated the resulting models on a narrative cloze task. The abductive
reasoning results are equivalent to the probabilistic inference we introduced earlier.
There are several questions that could be addressed in future work, building on our findings.
The axioms we derive only allow for first-level inference, since there are no axioms that use the
output of other axioms as input. Since KB-based approaches are human-interpretable, they can
easily be modified and extended. We might formalize the fact that there was a passing game when
a quarterback passed to a receiver, and that both players are on the same team, and encode both
64
as second-level axioms. This would allow further reasoning that go far beyond what has been
expressed in the text.
We could also add very specific knowledge that can not easily be deduced from text, such as
“if x catches a ball, and x is a receiver, a passing play took place”; or coreference information,
e.g., that Ben Roethlisberger can also be referred to as “Big Ben”.
65
Chapter 6
PREDICATE TYPES
People hang on his every word. Even the prepositions.
Dos Equis Commercial
In the previous chapter, we have seen how we can learn a type system for relation arguments
along with the relations from simple dependency structures. We assumed no types were given.
In this chapter, we examine the predicate of relations. We will demonstrate that many com-
mon NLP problems can be beneficially cast as relation extraction problems, and exemplify it
with word sense disambiguation for prepositions. Since prepositions are a ubiquitous closed
word class, we assume that their types hold for all domains. We work with two pre-defined type
sets. We model the preposition as predicate of a relation and show that
1. we can disambiguate prepositions using only the selectional restrictions of the arguments
2. it generalizes to two different type systems (coarse and fine) and fully supervised models
3. this approach significantly improves performance for WSD for prepositions
66
6.1 Prepositional Relations
Prepositions are ubiquitous and highly ambiguous. Disambiguating prepositions is thus a chal-
lenging and interesting task in itself (as exemplified by the SemEval 2007 task, [69]).
Previous state-of-the-art systems extracted features from a window of n words around the
preposition. In this work, we cast preposition sense disambiguation (PSD) as a relation problem
and show how to use the dependency structure to solve it. Similarly to the verb in the previous
chapter, now the preposition takes the position of the relation’s predicate. Its arguments are the
governing word and the prepositional object.
In Section 6.1.4, we show how the task can be formulated as an unsupervised learning prob-
lem and present various models and training options. This was the first work to explore PSD in
an unsupervised setting, using a fine-grained type system. We show how by gradually increasing
the amount of annotation, we can improve accuracy, all the while keeping the context to a bare
minimum. This work has been published in part in [52] and [49].
In Section 6.1.5, I show that the approach carries over to fully supervised settings. It outper-
forms the previous (window-based) state-of-the-art result by a significant margin for two different
type systems, the fine-grained system used before, and a coarser one. This part of the work has
been published as [51, 115].
6.1.1 Introduction
Language is inherently ambiguous, since many words having several meanings. We need to dis-
tinguish among these meanings in order to extract the correct information. A substantial amount
67
of work has been devoted to disambiguating prepositional attachment, verbs, and names. Prepo-
sitions, as most other word types, are also ambiguous. For example, the word in can assume both
temporal (“in May”) and spatial (“in the US”) meanings, as well as others, less easily classifiable
(“in that vein”). The Preposition Project [68] lists an average of 9.76 senses for each of the 34
most frequent English prepositions, while nouns usually have around 2 (WordNet nouns average
about 1.2 senses, 2.7 if monosemous nouns are excluded [34]). Prepositions account for more
than 10% of the 1.16m words in the Brown corpus. Understanding the constraints that hold for
prepositional constructions could improve PP attachment in parsing, one of the most frequent
sources of parse errors. In a recent study, [55] identified preposition related features, among them
the coarse-grained PP labels used here, as the most informative feature in identifying caused-
motion constructions.
The preposition acts as the explicit predicate of a relation between its governing word and the
prepositional object. Prepositions are thus a type of (syntactic) relation. Since each preposition
can express several relation semantic types (temporal, spatial, etc.), we need to disambiguate it.
Given a sentence such as the “In the morning, he shopped in Rome”, the preposition in has two
distinct meanings, namely a temporal and a locative one. Preposition sense disambiguation (PSD)
has many potential uses. For example, due to the relational nature of prepositions, disambiguating
their senses can help with all-word sense disambiguation, by constraining the argument types.
Different senses of the same English preposition often correspond to several words in a foreign
language. Correct disambiguation could thus help improve translation quality.
1
The results from
this work can be incorporated into semantic tagging, which tries to assign not only syntactic,
but also semantic categories to unlabeled text. Knowledge about the semantic constraints of
1
See [18] for the relevance of word sense disambiguation and [21] for the role of prepositions in MT.
68
prepositional constructions provides better label accuracy, but also aids in resolving prepositional
attachment problems. In fact, [114] directly uses the results of our work in a semantic parser.
Previous approaches to PSD use a fixed window size around the preposition to derive features.
The words in that window may or may not be syntactically related to the preposition. This results
in the worst case in an arbitrary bag of words. In our approach, we rely solely on the arguments.
We first present our framework and show results for a completely unsupervised PSD ap-
proach. We then extend that approach to supervised models and show significant improvements
over state-of-the-art systems for two different type systems, a coarse and a fine-grained one.
Our contributions are:
we show how PSD can be cast as a relation extraction problem
we present the first unsupervised PSD system and compare the effectiveness of various
models and unsupervised training methods
the approach extends to supervised settings and achieves accuracies of 0.94 and 0.85 for
coarse and fine-grained PSD, respectively.
6.1.2 Theoretical Background
There are three elements in the syntactic structure of prepositional phrases. The first one is
the head word (usually a noun, verb, or adjective)
2
, the preposition itself, and the object of the
preposition. See Figure 6.1 for two examples. A preposition p thus acts as a link between two
words, h and o. The head word h (usually a noun or verb, sometimes an adjective) governs
the preposition. In our example, the head words are book and sat, respectively. The object
2
Sometimes also referred to as governor.
69
read
he book
a Kennedy
nsubj dobj
det prep_on
sat
he bench
a
nsubj
det
prep_on
he read a book on Kennedy he sat on a bench
Figure 6.1: Two examples of sentences with prepositions and their dependency parse trees. The
preposition triples are marked
of the prepositional phrase (usually a noun) is denoted o, in our example Kennedy and bench.
We will refer to h and o collectively as the prepositional arguments. The triple h; p;o forms a
syntactically and semantically constrained structure. This structure is reflected in dependency
parses as a common construction. In our example sentence above, the respective triples would be
book on Kennedy and sat on bench.
The types or senses of each element h; p;o will be denoted by a barred letter, i.e., ¯ p denotes
the preposition sense,
¯
h denotes the sense of the head word, and ¯ o the sense of the object.
To extract the words discussed above, one can either employ a fixed window size (which has to
be large enough to capture the words), or select them based on heuristics or parsing information.
There should be an optimal context, i.e., the smallest set of words that achieves the best accuracy.
It has to be large enough to capture all relevant information, but small enough to avoid noise
words.
3
Syntactically related words improve over fixed-window-size approaches [115], so we
surmise that earlier approaches were not utilizing that optimal context, but rather include a lot of
noise.
3
It is not obvious how much information a sister-PP can provide, or the subject of the superordinate clause.
70
6.1.3 Data
Depending on the task, different type sets at two levels of sense granularity are available. We use
two different data sets from existing resources for coarse and fine-grained PSD to make our results
as comparable to previous work as possible. Fewer senses increase the likelihood of correct
classification, but may incorrectly conflate prepositions. A finer granularity can help distinguish
nuances and better fit the different contexts. However, it might suffer from data sparsity.
For the coarse-grained type set, we use data from the POS-tagged version of the Wall Street
Journal (WSJ) section of the Penn TreeBank. A subset of the prepositional phrases in this corpus
is labeled with a set of seven classes: beneficial (BNF), direction (DIR), extent (EXT), location
(LOC), manner (MNR), purpose (PRP), and temporal (TMP). We extract only those prepositions
that head a PP labeled with such a class (N = 35;917). The distribution of classes is highly
skewed (cf. Figure 6.2). We compare the results of this task to the findings of [80].
PTB class distrib
Page 1
LOC TMP DIR MNR PRP EXT BNF
0
2000
4000
6000
8000
10000
12000
14000
16000
18000 16995
10332
5414
1781
1071
280
44
classes
frequency
Figure 6.2: Distribution of Class Labels in the WSJ Section of the Penn TreeBank.
71
For the fine-grained type set, we use data from the SemEval 2007 workshop [69]. It consists
of a training (16k) and a test set (8k) of sentences with sense-annotated prepositions following the
sense inventory of The Preposition Project, TPP [68]. It defines senses for each of the 34 most
frequent prepositions. There are on average 9.76 senses per preposition (the range for individual
prepositions is from 2 to 25). We compare our results directly to the findings from [115]. As in
the original workshop task, we train and test on separate sets.
We used the FANSE parser [117] to extract the prepositional constructions (e.g., “shop in
Rome”) from this data set. Pronouns and numbers are collapsed into PRO and NUM, respectively.
In order to constrain the argument senses, we construct a dictionary that lists for each word all
the possible lexicographer senses according to WordNet. The set of lexicographer senses (45) is a
higher level abstraction which is sufficiently coarse to allow for a good generalization. Unknown
words are assumed to have all possible senses applicable to their respective word class (i.e. all
noun senses for words labeled as nouns, etc).
6.1.4 Unsupervised Preposition Sense Disambiguation
The model structures we explore in this chapter differ from the ones in previous chapters. We
represent the constraints between h, p, and o in a graphical model. We compare different model
structures and training techniques (EM, MAP-EM with L
0
norm, Bayesian inference using Gibbs
sampling). To our knowledge, this is the first attempt at unsupervised preposition sense disam-
biguation. Our best accuracy reaches 56%, a significant improvement (at p <.001) of 16% over
the most-frequent-sense baseline. While this does not rival the supervised approach, it serves to
illustrate a key insight: we can use the syntactic relations extracted from dependency parses as
72
input to HMMs and infer the semantics of the relation. In cases where there is no annotated data,
this is a valuable insight.
6.1.4.1 Graphical Models
p h o
p h o a)
– – –
h o
p h o b)
p
– – –
h o
p h o c)
p
– – –
Figure 6.3: Graphical Models. a) standard 1
st
order HMM. b) variant used in experiments (one
model per preposition, thus no conditioning on p). c) incorporates further constraints on variables
As a starting point, we choose the standard first-order Hidden Markov Model as depicted in
Figure 6.3a. Since we train a separate model for each preposition, we can omit all arcs to p. This
results in model 6.3b. The joint distribution over this network can be written as
P
p
(h;o;
¯
h; ¯ p; ¯ o)= P(
¯
h) P(hj
¯
h) P( ¯ pj
¯
h) P( ¯ oj ¯ p) P(oj ¯ o) (6.1)
We define a good model as one that reasonably constrains the choices, but is still tractable in
terms of the number of parameters being estimated. We want to incorporate as much information
as possible into the model to constrain the choices. In Figure 1c, we condition ¯ p on both
¯
h and ¯ o,
to reflect the fact that prepositions act as links and determine their sense mainly through context.
In order to constrain the object sense ¯ o, we condition on
¯
h, similar to a second-order HMM. The
73
actual object o is conditioned on both ¯ p and ¯ o. The joint distribution over this last model is equal
to
P
p
(h;o;
¯
h; ¯ p; ¯ o)= P(
¯
h) P(hj
¯
h) P( ¯ oj
¯
h) P( ¯ pj
¯
h; ¯ o) P(oj ¯ o; ¯ p) (6.2)
Ideally, we would like to also condition the preposition sense ¯ p on the head word h (i.e., an arc be-
tween them in 1c) in order to capture idioms and fixed phrases. However, this would increase the
number of parameters prohibitively. A solution for this would be intermediate clusters between h
and
¯
h. The best number of clusters is not known a priori and has to be determined experimentally.
However, we know that it has to be smaller than the number of actual head words (in order to
have the desired effect), but larger than the number of head word senses. We leave this for future
experiments.
6.1.4.2 Training
With the different model structures in place, we can now turn our attention to the training options.
The training method largely determines how well the resulting model explains the data. Ideally,
the sense distribution found by the model matches the real one. Since most linguistic distributions
are Zipfian, we want a training method that encourages sparsity in the model.
We briefly introduce different unsupervised training methods and discuss their respective
advantages and disadvantages. Unless specified otherwise, we initialized all models uniformly,
and trained until the perplexity rate stopped increasing or a predefined number of iterations was
reached. Note that MAP-EM and Bayesian Inference require tuning of some hyper-parameters
on held-out data, and are thus not fully unsupervised.
74
Additionally, we evaluate the effect of adding supervised data points to the training set on
accuracy (this was part of the work in [49]). This moves from fully unsupervised into semi-
supervised learning.
EM We use the EM algorithm [26] as a baseline. It is relatively easy to implement with ex-
isting toolkits like Carmel [40]. However, EM has a tendency to assume equal importance for
each parameter. It thus prefers “general solutions, assigning part of the probability mass to un-
likely states [57]. We ran EM on each model for 100 iterations, or until the perplexity stopped
decreasing below a threshold of 10
6
.
EM with Smoothing and Restarts In addition to the baseline, we ran 100 restarts with random
initialization and smoothed the fractional counts by adding 0.1 before normalizing [30]. Smooth-
ing helps to prevent overfitting. Repeated random restarts help escape unfavorable initializations
that lead to local maxima. Carmel provides options for both smoothing and restarts.
MAP-EM with L
0
Norm Since we want to encourage sparsity in our models, we use the MDL-
inspired technique introduced by [119]. Here, the goal is to increase the data likelihood while
keeping the number of parameters small. The authors use a smoothed L
0
prior, which encourages
probabilities to go down to 0. The prior involves hyper-parameters a, which rewards sparsity,
and b, which controls how close the approximation is to the true L
0
norm.
4
We perform a grid
search to tune the hyper-parameters of the smoothed L
0
prior for accuracy on the preposition
against, since it has a medium number of senses and instances. For HMM, we seta
trans
=100.0,
4
For more details, the reader is referred to [119].
75
b
trans
=0.005,a
emit
=1.0,b
emit
=0.75. The subscripts
trans
and
emit
denote the transition and emis-
sion parameters. For our model, we seta
trans
=70.0, b
trans
=0.05, a
emit
=110.0, b
emit
=0.0025.
The latter resulted in the best accuracy we achieved.
Bayesian Inference Instead of EM, we can use Bayesian inference with Gibbs sampling and
Dirichlet priors (also known as the Chinese Restaurant Process, CRP). We follow the approach
of [20], running Gibbs sampling for 10,000 iterations, with a burn-in period of 5,000, and carry
out automatic run selection over 10 random restarts.
5
Again, we tuned the hyper-parameters of
our Dirichlet priors for accuracy via a grid search over the model for the preposition against. For
both models, we set the concentration parametera
trans
to 0.001, anda
emit
to 0.1. This encourages
sparsity in the model and allows for a more nuanced explanation of the data by shifting probability
mass to the few prominent classes.
Partial Annotations In order to test the effect of partial annotations on accuracy, we also varied
the amount of partial annotations from 0 to 65% in increments of 5%. The original corpus we use
contains 67% partial annotations, so we were unable to go beyond this number. We created the
different corpora by randomly removing the existing annotations from our corpus. Since this is
done stochastically, we ran 5 trials for each batch and averaged the results. The model used here
differs slightly from the ones introduced before, in that it is a simple HMM.
6.1.4.3 Results
Given a sequence h; p;o, we want to find the sequence of senses
¯
h; ¯ p; ¯ o that maximizes the joint
probability. Since unsupervised methods use the provided labels indiscriminately, we have to
5
Due to time and space constraints, we did not run the 1000 restarts used in [20].
76
baseline vanilla EM EM,
smoothed,
100 restarts
MAP-EM +
smoothed L
0
norm
CRP, 10
restarts
HMM
0.40 (0.40)
0.42 (0.42) 0.55 (0.55) 0.45 (0.45) 0.53 (0.53)
our model 0.41 (0.41) 0.49 (0.49) 0.55 (0.56) 0.48 (0.49)
Table 6.1: Accuracy over all prepositions w. different models and training. Best accuracy: MAP-
EM+smoothed L
0
norm on our model. Italics denote significant improvement over baseline at
p <.001. Numbers in brackets include against (used to tune MAP-EM and Bayesian Inference
hyper-parameters)
map the resulting predictions to the gold labels. The predicted label sequence
ˆ
h; ˆ p; ˆ o generated by
the model via Viterbi decoding can then be compared to the true key. We use many-to-1 mapping
as described by [57] and used in other unsupervised tasks [7], where each predicted sense is
mapped to the gold label it most frequently occurs with in the test data. Success is measured by
the percentage of accurate predictions. Here, we only evaluate ˆ p.
6
The results presented in Table 6.1 were obtained on the SemEval test set. We report results
both with and without against, since we tuned the hyper-parameters of two training methods on
this preposition. To test for significance, we use a two-tailed t-test, comparing the number of
correctly labeled prepositions. As a baseline, we simply label all word types with the same sense,
i.e., each preposition token is labeled with its respective name. When using many-to-1 accuracy,
this technique is equivalent to a most-frequent-sense baseline.
Vanilla EM does not improve significantly over the baseline with either model, all other meth-
ods do. Adding smoothing and random restarts increases the gain considerably, illustrating how
important these techniques are for unsupervised training. We note that EM performs better with
the less complex HMM.
6
In this case accuracy is the same as precision. Since everything receives a label, recall, and thus F1, can be
omitted.
77
CRP is somewhat surprisingly roughly equivalent to EM with smoothing and random restarts,
but takes considerably longer to train. Accuracy might improve with more restarts. Due to the
long training time for even one run, we do not explore this option.
MAP-EM with L
0
normalization produces the best result (56%), significantly outperforming
the baseline at p<:001. With more parameters (9.7k vs. 3.7k), which allow for a better modeling
of the data, L
0
normalization helps by zeroing out infrequent ones. However, the difference
between our complex model and the best HMM (EM with smoothing and random restarts, 55%)
is not significant.
50
55
60
65
70
75
80
0 5 10 15 20 25 30 35 40 45 50 55 60 65
accuracy (%)
amount of annotated prepositions (%)
Figure 6.4: Accuracy for PSD systems improves linearly with amount of partial annotations. The
dotted line indicates the unsupervised results from [52]. Accuracies to the right of the vertical
line improve significantly (at p < 0:001) over this
Figure 6.4 shows the effect of more partial annotations on PSD accuracy. Using no anno-
tations at all, just the dictionary, we achieve slightly worse results as reported in [52]. This is
potentially a function of the model. Each increment of partial annotations increases accuracy. At
78
around 27% annotated training examples, the difference starts to be significant. This shows that
unsupervised training methods can benefit from partial annotations.
However, the trend also shows that we are unlikely to rival a supervised model (the best sys-
tem in the SemEval task [129] reached 69% accuracy). All previous work on PSD was conducted
in a supervised setting, so to make our approach comparable, we apply our relation-based method
in a supervised setting in the next section. We show that this leads to significant improvements
over existing, window-based supervised approaches.
6.1.5 Supervised Preposition Sense Disambiguation
Increasing the amount of annotation generally improves accuracy. Having completely annotated
data allows us to model the problem more exactly. Can we use our relation-based approach in
this setting as well, and improve accuracy while still using only the minimal context?
We explore different options for feature selection, and different levels of sense granularity.
Using the resulting parameters in a Maximum Entropy classifier, we are able to improve sig-
nificantly over existing results. We compare to the state-of-the-art systems for both types of
granularity [80, 115], and use a most-frequent-sense baseline.
We use the MALLET implementation [71] of a Maximum Entropy classifier [8] to construct
our models. This classifier was also used by two state-of-the-art systems [115, 129]. For fine-
grained PSD, we train a separate model for each preposition due to the high number of possible
classes for each individual preposition. For coarse-grained PSD, we use a single model for all
prepositions, because they all share the same classes.
The general outline we present can potentially be extended to other word classes and improve
WSD in general.
79
6.1.6 Results
In this section we show experimental results for the influence of context and feature selection
on accuracy. Each section compares the results for both coarse and fine-grained granularity.
Accuracy for the coarse-grained type system in all experiments is higher than for the fine-grained
one.
6.1.6.1 Context
We now compare the effects of using a fixed window size to using syntactically related words
as context to derive features. Table 6.2 shows the results for the different types and sizes of
contexts.
7
context type coarse-grained fine-grained
2-word window 0.92 0.80
3-word window 0.92 0.81
4-word window 0.92 0.79
5-word window 0.91 0.79
head + preposition 0.81 0.79
preposition + object 0.94 0.57
head, preposition, object 0.94 0.85
Table 6.2: Accuracies (%) for Different Context Types and Sizes
The results show that the approach using both syntactically related words (head and object)
is the most accurate one. Of the fixed-window-size approaches, three words to either side
works best. This does not necessarily reflect a general property of that window size, but can be
explained by the fact that most heads and objects occur within this window size.
8
This distance
can vary from corpus to corpus, so window size would have to be determined individually for
7
See also [126] for experiments on the effect of varying window size for WSD.
8
Based on similar statistics, [79] actually set their window size to 5.
80
each task. The difference between using head and preposition versus preposition and object
between coarse and fine-grained classification might reflect the annotation process: while [69]
selected examples based on a search for heads
9
, most annotators in the PTB may have based their
decision of the PP label on the object that occurs in it. We conclude that syntactically related
words present a much better context for classification than fixed window sizes.
While we know that we want to extract the words that are syntactically related to the
preposition, there are different ways of doing this: a) extracting elements from a parse or b)
using POS-based heuristics.
Both [80] and [115] use constituency parsers to preprocess the data. However, parsing accu-
racy varies, and the problem of PP attachment ambiguity increases the likelihood of wrong ex-
tractions. This is especially troublesome in the present case, where we focus on prepositions. [97]
actually motivate their work on prepositions as a means to achieve better PP attachment resolu-
tion. When using a constituency parser, we need to consider all possible path between the related
words. This can be done with regular expressions over trees (TregEx, [66]), but requires us to
specify the patterns. We experimented with this originally [115], but found that it missed many
cases. Instead, we use a dependency parser [78] to extract the head and object.
The alternative is a POS-based heuristics approach. The only preprocessing step needed is
POS tagging of the data, for which we used the system of [99]. We then use simple heuristics
to locate the prepositions and their related words. In order to determine the head in the absence
of constituent phrases, we consider the possible governing noun, verb, and adjective. The object
of the preposition is extracted as first noun phrase head to the right. This approach is faster
9
Personal communication.
81
than parsing, but has problems with long-range dependencies and fronting of the PP (e.g., the PP
appearing earlier in the sentence than its head).
extraction method coarse-grained fine-grained
parser 0.94 0.84
heuristics 0.91 0.85
combined 0.92 0.85
Table 6.3: Accuracies for Word-Extraction Using a Parser or Heuristics.
Generally, using a parser performs slightly better than using heuristics to capture syntactically
related words (see Table 6.3), although differences are small. Both emphasize the importance of
the syntactic relation. The high score achieved when using the MALT parse for coarse-grained
PSD might be partially due by the fact that the parser was originally trained on that data set.
Under both extraction schemes, we generate features from each of the three words. Since
extraction is noisy, we also include the first potential head word to the left of the preposition.
Note that this most often will coincide with the head. The full set of potential candidates is
defined by:
head from the MALT parse
object from the MALT parse
object from heuristics
first verb to left of preposition
firstfverb, noun, adjectiveg to left of preposition
union of (first verb to left, firstfverb, noun, adjective to leftg)
first word to the left
82
candidate word coarse-grained fine-grained
parse head 0.80 0.79
parse object 0.94 0.56
verb to left 0.78 0.62
NN/ADJ/VB to left 0.79 0.79
union of previous two 0.78 0.81
first word to left 0.79 0.77
heuristics object 0.93 0.57
Table 6.4: Accuracies (%) for Leave-One-Out (LOO) and Only-One Word-Extraction-Rule Eval-
uation. none includes all words and serves for comparison. Important words reduce accuracy for
LOO, but rank high when used as only rule.
We evaluate the impact of each of these candidate words on the outcome by classifying using
each one in turn. The results can be found in Table 6.4. Note that a word that does not perform
well as the only attribute may still be important in conjunction with others.
6.1.6.2 Features
Having established the context we want to use, we now turn to the details of extracting the feature
words from that context.
10
Using higher-level features instead of lexical ones helps accounting for
sparse training data (given an infinite amount of data, we would not need to take any higher-level
features into account, since every case would be covered). Compare [80].
The feature-generating functions, many of which utilize WordNet [34], are listed below in
Table 6.5. To conserve space, curly braces are used to represent multiple functions in a single
line. The name of each feature is the combination of the word-selection rule and the output from
the feature-generating function.
10
As one reviewer pointed out, these two dimensions are highly interrelated and influence each other. To examine
the effects, we keep one dimension constant while varying the other.
83
WordNet-based Features Other Features
fhypernyms, synonymsg of 1st sense word-finding rule found word?
fhypernyms, synonymsg of all senses Roget’s Thesaurus divisions for word
all terms in definitions of word capitalization?
word’s lexicographer file names word’sflemma, surface formg
link types associated with word word’s POS tag
word2fNN, VB, JJ, RBg? general POS type (NN, VB, etc.)
all sentence frames for word ffirst, lastgftwo, threeg letters of word
allfpart, member, substanceg-of holonyms suffix types (e.g., de-adjectival/nominal, etc.)
all sentence frames for word affixes (e.g., ultra-, poly-, post-)
Table 6.5: Features used in our system.
6.1.6.3 Comparison with Related Work
To situate our experimental results within the body of work on PSD, we compare them to both a
most-frequent-sense baseline and existing work for both granularities (see Table 6.7).
coarse-grained fine-grained
most-frequent sense baseline 0.76 0.40
related work 0.89
0.78
this work 0.94 0.85
Table 6.7: Accuracies (%) for Different Classifications. Comparison with [80]*, and [115]**.
Our system easily exceeds the baseline for both coarse and fine-grained PSD (see Table 6.7).
Comparison with related work shows that we achieve an improvement of 6.5% over [115], and
of 4.5% over [80], both which are significant at p <:0001.
A detailed overview over all prepositions for frequencies and accuracies of both coarse and
fine-grained PSD can be found in Table 6.6.
In addition to overall accuracy, [80] also measure precision, recall and F-measure for the
different classes. They omitted BNF because it is so infrequent. Due to different training data and
models, the two systems are not strictly comparable, yet they provide a sense of the general task
difficulty. See Table 6.8. We note that both systems perform better than the most-frequent-sense
84
full both
Page 1
Total Total Total Total
– – 6 100.0 125 90.4 53 47.2
364 94.0 5 80.0 – – 74 93.2
23 69.6 78 65.4 – – 1 0.0
151 96.7 87 79.3 – – 7 71.4
53 79.2 841 92.5 of 1478 87.9 71 64.8
92 92.4 16 43.8 76 84.2 28 75.0
173 96.0 45 71.1 441 81.4 2287 90.8
– – 5 80.0 58 91.4 15 53.3
– – 58 70.7 out – – 90 68.9
50 80.0 358 93.9 – – 62 90.3
– – 1 0.0 98 79.6 417 89.4
155 69.0 107 86.0 – – 6 83.3
84 100.0 232 84.5 per – – 3 100.0
– – 2 50.0 82 65.9 – –
367 86.4 3078 92.0 – – 449 94.4
– – 5 100.0 – – 2 0.0
– – 420 91.7 208 48.1 364 69.0
20 90.0 384 83.3 – – 62 93.5
68 77.9 65 87.7 – – 3 100.0
– – 94 71.3 to 572 89.7 3166 97.5
28 78.6 11 72.7 – – 55 65.5
29 100.0 4 100.0 102 97.1 2 100.0
– – 1 0.0 – – 604 91.4
102 94.1 98 84.7 – – 2 50.0
– – 45 64.4 – – 208 94.2
248 88.3 1341 87.5 up – – 20 75.0
down 153 81.7 16 56.2 – – 23 73.9
39 87.2 547 92.1 via – – 22 40.9
– – 1 0.0 – – 1 100.0
478 82.4 1455 84.5 – – 3 33.3
578 85.5 1712 90.5 578 84.4 272 69.5
in 688 77.0 15706 95.0 – – 213 96.2
38 73.7 24 91.7 – – 69 63.8
297 86.2 415 80.0
Overall 8096 84.8 35917 91.8
fine coarse fine coarse
Prep Acc Acc Prep Acc Acc
aboard like
about near
above nearest
across next
after
against off
along on
alongside onto
amid
among outside
amongst over
around past
as
astride round
at since
atop than
because through
before throughout
behind till
below
beneath toward
beside towards
besides under
between underneath
beyond until
by
upon
during
except whether
for while
from with
within
inside without
into
Table 6.6: Accuracies (%) for Coarse and Fine-Grained PSD, Using MALT and Heuristics.
Sorted by preposition.
85
most frequent sense results from [80] this work
label prec rec F1 prec rec F1 prec rec F1
LOC 0.72 0.97 0.83 0.91 0.93 0.92 0.95 0.96 0.96
TMP 0.78 0.39 0.52 0.85 0.85 0.85 0.95 0.95 0.95
DIR 0.92 0.94 0.93 0.96 0.97 0.96 0.95 0.95 0.95
MNR 0.70 0.43 0.53 0.83 0.56 0.66 0.83 0.75 0.79
PRP 0.78 0.49 0.60 0.79 0.70 0.74 0.91 0.84 0.87
EXT 0.00 0.00 0.00 0.82 0.85 0.83 0.88 0.82 0.85
BNF 0.00 0.00 0.00 — — — 0.75 0.34 0.47
Table 6.8: Precision, Recall and F1 Results (%) for Coarse-Grained Classification. Comparison
to [80]. Classes ordered by frequency
baseline. DIR is reliably classified using the baseline, while EXT and BNF are never selected
for any preposition. Our method adds considerably to the scores for most classes. The low score
for BNF is mainly due to the low number of instances in the data, which is why it was excluded
by [80].
6.1.7 Related Work
The semantics of prepositions were topic of a special issue of Computational Linguistics [4].
Preposition sense disambiguation was one of the SemEval 2007 tasks [69], and was subsequently
explored in a number of papers using supervised approaches:
The constraints of prepositional constructions have been explored by [97] and [79] to annotate
the semantic role of complete PPs with FrameNet and Penn Treebank categories. [128] explore
the constraints of prepositional phrases for semantic role labeling.
[97] use syntactic and lexical features from the head and the preposition itself in coarse-
grained PP classification with decision heuristics. They reach an average F-measure of 89% for
four classes. This shows that using a very small context can be effective. However, they did not
86
include the object of the preposition and used only lexical features for classification. Their results
vary widely for the different classes.
[79] made use of a window size of five words and features from the Penn Treebank (PTB)
[70] and FrameNet [3] to classify prepositions. They show that using high level features, such
as semantic roles, significantly aid disambiguation. They caution that using collocations and
neighboring words indiscriminately may yield high accuracy, but has the risk of overfitting. [80]
show comparisons of various semantic repositories as labels for PSD approaches. They also
provide some results for PTB-based coarse-grained senses, using a five-word window for lexical
and hypernym features in a decision tree classifier.
SemEval 2007 [69] included a task for fine-grained PSD (more than 290 senses). The best
participating system, that of [129], extracted part-of-speech and WordNet [34] features using a
word window of seven words in a Maximum Entropy classifier.
[115] present a higher-performing system using a set of 20 positions that are syntactically
related to the preposition instead of a fixed window size.
Though using a variety of different extraction methods, contexts, and feature words, none of
these approaches explores the optimal configurations for PSD.
6.1.8 Conclusion
Prepositions act as predicates of relations between words and can thus be modeled as such. This
allows us to cast the problem of disambiguating the preposition as a relation typing task.
We showed how the dependency structure can be used as input to unsupervised models and
evaluate the influence of two different model structures (to represent constraints) and three unsu-
pervised training methods (to achieve sparse sense distributions). Using MAP-EM with L
0
norm
87
on our model achieves an accuracy of 56%. This is a significant improvement (at p <.001) over
the baseline and vanilla EM.
While it is obvious that providing some annotation to an unsupervised algorithm will improve
accuracy and learning speed, we can get significant improvements by partially annotating some
instances as token constraints (around 27%).
The approach carries over to a supervised setting. The results show that a relation-based
approach is significantly better than the previous window-based state-of-the-art. We measure
success in accuracy, precision, recall, and F-measure, and compare our results to a most-frequent-
sense baseline and existing work for two type systems at different levels of granularity (coarse
and fine). The relation-based approach achieves accuracies of 0.94 and 0.85 for coarse and fine-
grained PSD, respectively. This corroborates the linguistic intuition that close mutual constraints
hold between the elements of the PP. Each word syntactically and semantically restricts the choice
of the other elements.
Another advantage of the relation-based approach is that the sequential models used can infer
the types of both the prepositional arguments as well as the preposition. In fact, the next chapter
shows how the approach presented here can help us achieve precisely that.
88
Chapter 7
JOINTLY LABELING PREDICATE AND ARGUMENT TYPES
The sequential models we have seen in previous chapters allow us to find a structured output,
i.e., a sequence of types. However, the previous chapters have explored this sequential property
mainly to impose constraints on the label for one position, either the predicate or the arguments.
In this chapter, we finally attempt to type all elements in the relation.
We exemplify this again for prepositional relations. Given a sentence such as the following:
In the morning, he shopped in Rome
1
we ultimately want to be able to annotate it with high-level semantic types such as
in/TEMPORAL the morning/TIME he/PERSON shopped/SOCIAL in/LOCATIVE
Rome/LOCATION
We have seen that the selectional constraints between predicate and arguments can be exploited
for PSD (and implicitly modeled those in the unsupervised setting), but they have not been used to
explicitly type the arguments.
2
Concurrently, previous approaches to supersense tagging have ig-
nored prepositions, but focused on nouns and verbs. Supersense tagging tries to predict high-level
1
Note how the sentence contains two prepositional phrases, reflecting their high frequency in language.
2
Although [128] use prepositional phrases for semantic role labeling of the arguments.
89
semantic classes [22, 77, 82], typically using coarse-grained WordNet senses. I.e., in a sentence
like “house on the beach”, “house” has the supersense noun.artifact and “beach” noun.location.
The resulting models could be used to type existing triple stores [6,32,67,86,113], which are
common in open IE [5]and question answering.
To our knowledge, though, nobody has attempted the joint task.
3
Consequently, no jointly
annotated data exists.
We propose to jointly solve both tasks, supersense tagging and PSD. Given a triple of head
word, preposition, and object, we try to predict semantic labels for all three elements. We intro-
duce a test set and a semi-supervised sequence model. We leverage two disparate data and label
sets, one for the arguments, the other one for the prepositions, treating the two disjoint sources
as partial annotations. We reach state-of-the-art results for the preposition sense disambiguation,
and significant improvements over a baseline for the arguments.
This draws on two key components: the model class used throughout this work, and the token
constraints introduced in the last chapter. The type constraints (partial annotations) we saw in
Chapter 6.1 will be exploited further to leverage information from two disparate sources: one
providing labels for the predicate, and one providing labels for the arguments. This demonstrates
that our approach works even in the absence of jointly annotated data.
The approach presented here extends to any problem where we have one of the following
conditions:
two disjoint data sets that can be used to solve a joint task
two data sets with some overlap for a joint task
3
[106] come close by modeling the arguments, but ultimately only use them to predict a label for the whole
construction.
90
one task, one data set, some annotations
7.1 Data
annotated for
prepositions
annotated for
arguments
triples
Figure 7.1: Venn diagram of the data distribution. The subset we are interested in solving is
marked grey
We are working with two disparate data and label systems: one for the prepositions, the other
one for the arguments. Each is annotated in a separate data set. We use the FANSE dependency
parser [117] to extract the triples.
The labels we use for prepositions are based on the Penn Treebank (PTB) labels for
prepositional phrases. There are seven different labels: TMP (temporal), LOC (location), BNF
(beneficiary), EXT (extent), DIR (direction), MNR (manner) and PRP (purpose). We extract all
PPs in the PTB that contain one of the seven preposition labels. This results in 44.5k instances.
These examples serve as partially annotated instances for the prepositions.
For the argument senses, we use the supersense tagging labels corresponding to the WordNet
lexicographer file names. These are 45 high-level semantic senses for nouns, verbs, adjectives,
91
noun types verb types preposition types
act animal body location
artifact attribute change temporal
body cognition cognition direction
communication event communication beneficiary
feeling food competition manner
group location consumption extent
motive object contact purpose
person phenomenon creation
plant possession emotion
process quantity motion
relation shape perception
state substance possession
time social
stative
weather
Table 7.1: The two type systems used. The 26 noun and 15 verb supersenses are the WordNet
lexicographer file names; the 7 preposition supersenses are derived from the Penn Treebank.
and adverbs. For our purposes, we ignore adjectives and adverbs, since these rarely head
prepositional phrases, and never occur as objects. We use FANSE to extract preposition triples
from the SemCor corpus [73]. It contains over 37k sentences annotated for their WordNet
senses. We extract all triples which have an annotation for both arguments, resulting in 31k
instances, and map tem to the lexicographer senses. This is our partially annotated data set for
the arguments.
A small subset of the SemCor instances (2726 instances) coincides with the PTB triples.
It thus contains annotations for both prepositions and arguments. Note that this is too small a set
to support learning a supervised sequential model. We use this fully annotated data as test set
to evaluate our performance. In the experiments, we take care that none of the fully labeled test
instances occur in the training data by removing their annotations.
92
A small subset of the SemCor instances, 2726, coincides with the PTB triples and thus con-
tains annotations for both prepositions and arguments. Note that this is too small a set to support
supervised learning on it. This is our fully annotated test set that allows us to evaluate our per-
formance. In the experiments, we take care that none of the test instances occurs in the labeled
data, i.e., we remove them from the SemCor and PTB sets. They are included as unlabeled data,
though.
7.2 Model and Learning
While there is sufficient annotated data for either sub-part to train supervised models, there is very
little overlap. Sequential supervised models typically need far larger data sets, and are incapable
of incorporating partially labeled and unlabeled data. Generative models, on the other hand, can
deal with any amount of supervision.
We again use a simple HMM as described in Section 4 and train it using the EM algorithm.
For inference, we use posterior decoding over the model. In contrast to Viterbi decoding,
which finds the most likely path connecting all latent types, posterior decoding amounts to picking
the highest ranking type for each position. In theory, this can result in sequences not permitted
by the transition parameters. In practice, however, it tends to perform slightly better than Viterbi
decoding in terms of accuracy (see [57]).
7.3 Evaluation
Our final goal is to label a tripleharg
1
; pred;arg
2
i with the respective senses. We report per-
token accuracy, split up into results for each of the input positions. We first compare the overall
93
performance of the model to an informed baseline (for each word, choose the type that has been
most frequently observed with it in the training set, or, if the word is unknown, in that position).
We then compare the the model’s performance on PSD and supersense tagging to existing
supervised models that solve those particular subtasks.
Where appropriate, we evaluate the statistical significance between differences in the result
with a two-tailed t-test and choose at p < 0:01.
7.4 Results
arg1 prep arg2 all
MostFreq BL 0.56 0.81 0.70 0.69
model 0.66 0.93 0.80 0.80
Table 7.2: Accuracy for bigram model, split up for each element
Table 7.2 presents the results for the model in comparison to the baseline. The difference in
accuracy for all positions, as well as the total accuracy, is statistically significant. The gain is
most pronounced for the prepositions, indicating again that they benefit most from the selectional
constraints of the arguments.
Accuracy is weakest for the first argument. This is mainly due to the fact that there is no prior
context to condition it on. Worse, the head can be a number of different parts-of-speech, so there is
not only fewer constraints, but also more lexical variation for this position. We experimented with
changing the element order by putting the preposition first. This improved performance on the
arguments somewhat, but hurt preposition accuracy considerably. The effect on overal accuracy
was negative. This indicates that preposition senses are more constrained by their arguments than
vice versa.
94
We also experimented with L
0
regularization [119], but found no effect at all. This indicates
that the partial annotations constrain the data enough to produce rather sparse parameter vectors.
7.4.1 Subtask Comparison
We compare both of the subtasks (PSD and supersense tagging of the arguments) to existing
(supervised) models.
7.4.1.1 Prepositions
system preposition
MostFreq BL 0.81
model 0.93
Hovy & al. (2010) 0.94
Table 7.3: Comparison of accuracy of various models on preposition senses only
We first compare the performance of our model on prepositions. Note that the results we com-
pare against were obtained by 10-fold cross validation over the whole PTB set, while our results
are just based on the held-out test-set. Due to this difference, we evaluate statistical significance
of the difference between results with a two-tailed t-test at p < 0:01
Table 7.3 shows the accuracy of the baseline, our model, and the state-of-the-art PSD system.
Our system conclusively beats the baseline, and reaches similar accuracy as the highly engineered
supervised system we compare against. We note that the difference between our model and the
baseline is statistically significant, while the difference between the model and the supervised
system is not.
Accuracy can be high even for a system that only returns one answer, provided that is the
majority class. In addition to the accuracy, we also evaluate precision, recall, and F1 for each of
95
Hovy & al. (2010) this work
label P R F1 P R F1
LOC 0.95 0.96 0.96 0.95 0.95 0.95
TMP 0.95 0.96 0.95 0.95 0.94 0.94
DIR 0.95 0.95 0.95 0.94 0.94 0.94
MNR 0.83 0.75 0.79 0.24 0.24 0.24
PRP 0.91 0.84 0.87 0.76 0.68 0.72
EXT 0.88 0.82 0.85 0.00 0.00 0.00
BNF 0.75 0.34 0.47 0.00 0.00 0.00
Table 7.4: Precision, recall and F1 for individual preposition senses. Comparison between previ-
ous work and the bigram model
the preposition classes, and compare it to the supervised model. We notice again that the results
are close, albeit not statistically significant. EXT occurred two times in our test set, and BNF
once.
7.4.1.2 Arguments
system arg1 arg2 avg
MostFreq BL 0.56 0.70 0.63
model 0.66 0.80 0.73
Supersense Tagger 0.82 0.88 0.85
Table 7.5: Performance of various models on argument senses
For the argument, we compare the performance of our model against an implementation of
the supersense tagger (SST) by [22]. Since this is a sequential model, we label the whole SemCor
corpus, but only evaluate on the subset used in out test data. While the sets are thus comparable,
the SST has two decisive advantages over our system: it is run on the whole sentences, so it has
access to more contextual information. More importantly, it was originally trained on the SemCor
corpus.
96
Despite that, the results indicate that our model has a lot of headroom in labeling the first
argument. We assume that this is due to the high number of possible sense (in this position, both
noun and verb senses can occur), and the lack of prior context.
The difference between our model and the SST is statistically significant for both arguments.
We do note, however, that our model performs better than the informed baseline. This tends to be
very hard baseline to beat for most WSD-related tasks.
7.5 Related Work
Preposition sense disambiguation was one of the SemEval 2007 tasks [69], and was subsequently
explored in a number of papers. [51] make explicit use of the arguments for supervised preposi-
tion sense disambiguation at various levels of granularity. [97] and [79] use FrameNet and Penn
Treebank categories to annotate the semantic role of complete PPs. [80] do supervised preposition
supersense disambiguation using the TPP senses. [52] present generative models to do unsuper-
vised PSD. Similarly to us, they rely solely on the arguments as context. They use lexicographer
senses as latent variable values for the arguments, but do not explicitly output them. [106] present
a supervised approach to labeling the entire prepositional relation, using an inventory introduced
in [105]. Again, they model the argument senses (using hypernyms and clusters), but do not out-
put the labels. While all these papers on PSD agree on the importance of the arguments, none of
them attempt to label them, but focus on the prepositions. In contrast, we explicitly aim to solve
both tasks.
[22] present a supervised approach to sequential supersense tagging, using a perceptron. [82]
achieve even better results using similar features and conditional random fields. Our model is
97
not discriminative, so we rely on contextual similarity. [77] presents an unsupervised approach
for disambiguation of open-class word pairs. She uses prepositional dependency triples to dis-
ambiguate the arguments, yet not the prepositions themselves. Our work provides the means to
close that gap.
7.6 Conclusion
We proposed the task of labeling prepositions and their syntactic arguments with semantic labels.
We provide a test set of over 2500 instances, and present a semi-supervised learning approach
to solve the task. We leverage two disjoint, partially annotated data sets, one for each of the
subtasks, to train the model. It performs significantly better than an informed baseline, both on
each of the subtasks, as well as in terms of overall accuracy.
Comparing to state-of-the-art supervised systems for the subtasks (preposition sense disam-
biguation and argument supersense tagging), we approach state-of-the-art for PSD (0.93), but
find a significant difference in terms of supersense tagging. Even though the supervised model
was trained on the data set we evaluate on and had access to more context, we hope to close this
gap in future work.
The results indicate that preposition senses are more constrained by their arguments than vice
versa.
98
Part IV
Conclusion and Future Work
99
Chapter 8
CONCLUSIONS
In this thesis, we solved several problems in relation extraction: learning argument types, learn-
ing predicate types, and jointly learning argument types and predicate types. Depending on the
type system available for each of these tasks, our approaches range from unsupervised to semi-
supervised to fully supervised methods. We demonstrated how common NLP problems can be
framed as relation extraction problems to improve performance. Specifically:
1. Learning argument types (unsupervised):
We presented an unsupervised approach that learns the appropriate interpretable type sys-
tem directly from data while extracting relation instances at the same time (published as
[53]), rather than using predefined types or clusters. We evaluated the resulting type system
and the relation instances extracted under that type system in two ways: using human sen-
sibility judgments, reaching sensibility scores of 0.95; and via label accuracy in a narrative
cloze task, reaching accuracies of up to 0.81. An auxiliary contribution, born from the ne-
cessity to evaluate the quality of human subjects, is MACE (Multi-Annotator Competence
100
Estimation), a tool that helps estimate both annotator competence and the most likely an-
swer. It can thus be used in both annotation evaluation and data generation. MACE is avail-
able for download athttp://www.isi.edu/publications/licensed-sw/mace/. We
also explored the utility of our model in inducing ontological hierarchies among the types,
and incorporating the learned type system into a downstream system for abductive reason-
ing.
2. Learning predicate types (unsupervised and supervised):
We cast WSD for prepositions (PSD) as a relation extraction task, using selectional con-
straints between the preposition type and its argument types to determine the sense of the
preposition (published as [51, 52, 115]). Previous approaches to PSD have not used selec-
tional constraints over argument types, but used n-gram context windows. Our supervised
relation-based approach significantly improved over those systems for two different prepo-
sition type systems, reaching accuracies of 0.85 and 0.94, respectively.
3. Argument types and predicates types (semi-supervised):
Finally, we demonstrated how to type both predicates and arguments by jointly solving
PSD and supersense tagging. Using two disjoint data sets, one with type annotations for the
predicates and one with type annotations for the arguments, we presented a semi-supervised
approach that uses the two partially annotated data sets to jointly learn argument types and
predicate types to rival supervised state-of-the-art results. To the best of our knowledge,
we were the first to address the joint learning of argument types and preposition types.
The general approach outlined in this thesis can be applied to type existing dependency triple
collections and thus provide better inference for question answering and information extraction.
101
More generally, our work opens up interesting avenues of research for various semantic appli-
cations, such as dynamic ontology design (by describing an approach to rapid local ontology
construction), representation design (by finding the right level of granularity for a task and do-
main), semantic inference (exploiting the similarities between graphical models and abduction to
aid knowledge base construction), and semantic analysis (the results of our work on prepositions
were used in a semantic parser, [114]).
102
Chapter 9
FUTURE WORK
Some of the experiments presented in this thesis can be extended, namely hierarchy induction and
knowledge base construction for probabilistic abductive inference. The former would probably
benefit from sparsity-inducing method. Since we have no observations for the hierarchies, we
cannot use dictionary or token constraints, but rather methods that use a prior. Alternatively, a
split-merge procedure similar to [87] could be applied to compute the effect on data likelihood
when certain types are merged. The results of this procedure could be interpreted as directed
graph, which allows methods such as minimum-spanning trees (MST) to be used.
None of the approaches presented here has modeled similarity between types other than
through distributional patterns in the latent space (the transition probabilities). Deep learning ap-
proaches offer the possibility to learn distributed continuous representations from unsupervised
input data to capture contextual similarities [24]. This enables us to learn a representation of left
and right arguments for a given relation [11]. Verbal predicates are hard to group into higher level
classes. Existing work on verbal type distinction is thus usually sense-based clustering [125].
However, the work on prepositions has shown that the predicate type plays an important role in
relations. Similarly to the deep learning approach for arguments in [11], we could extend our
103
work to deep learning of predicate similarity, e.g., that “studied at” and “attended” often mean
the same thing. Similarity among predicates would thus not be reduced to a fixed number of
clusters, but on the fact that they usually occur with similar arguments. However, it would again
sacrifice the interpretability of the types, and replace them by vector representations.
104
References
[1] Steven Abney and Marc Light. Hiding a semantic hierarchy in a Markov model. In Pro-
ceedings of the ACL Workshop on Unsupervised Learning in Natural Language Process-
ing, volume 67, 1999.
[2] Eneko Agirre and David Martinez. Unsupervised WSD based on automatically retrieved
examples: The importance of bias. In Proceedings of EMNLP, pages 25–32, 2004.
[3] Collin F. Baker, Charles J. Fillmore, and John B. Lowe. The Berkeley FrameNet Project.
In Proceedings of the 17th international conference on Computational linguistics-Volume
1, pages 86–90. Association for Computational Linguistics Morristown, NJ, USA, 1998.
[4] Tim Baldwin, Valia Kordoni, and Aline Villavicencio. Prepositions in applications: A
survey and introduction to the special issue. Computational Linguistics, 35(2):119–149,
2009.
[5] Michele Banko and Oren. Etzioni. The tradeoffs between open and traditional relation
extraction. Proceedings of ACL-08: HLT, pages 28–36, 2008.
[6] Ken Barker, Bhalchandra Agashe, Shaw-Yi Chaw, James Fan, Noah Friedland, Michael
Glass, Jerry Hobbs, Eduard Hovy, David Israel, Doo Soon Kim, Rutu Mulkar-Mehta,
Sourabh Patwardhan, Bruce Porter, Dan Tecuci, and Peter Yeh. Learning by reading: A
prototype system, performance baseline and lessons learned. In Proceedings of the 22nd
National Conference for Artificial Intelligence, Vancouver, Canada, July 2007.
[7] Taylor Berg-Kirkpatrick, Alexandre Bouchard-Cˆ ot´ e, John DeNero, and Dan Klein. Pain-
less Unsupervised Learning with Features. In North American Chapter of the Association
for Computational Linguistics, 2010.
[8] Adam L. Berger, Vincent J. Della Pietra, and Stephen A. Della Pietra. A maximum entropy
approach to natural language processing. Computational Linguistics, 22(1):39–71, 1996.
[9] Jiang Bian, Yandong Liu, Ding Zhou, Eugene Agichtein, and Hongyuan Zha. Learning to
recognize reliable users and content in social media with coupled mutual reinforcement. In
Proceedings of the 18th international conference on World wide web, pages 51–60. ACM,
2009.
[10] Jim Blythe, Jerry R. Hobbs, Pedro Domingos, Rohit J. Kate, and Raymond J. Mooney.
Implementing weighted abduction in Markov logic. In Proceedings of the Ninth Interna-
tional Conference on Computational Semantics, pages 55–64. Association for Computa-
tional Linguistics, 2011.
105
[11] Antoine Bordes, Jason Weston, Ronan Collobert, and Yoshua Bengio. Learning structured
embeddings of knowledge bases. In Proceedings of the 25th Conference on Artificial
Intelligence (AAAI-11), San Francisco, USA, 2011.
[12] Samuel Brody. Clustering Clauses for High-Level Relation Detection: An Information-
theoretic Approach. In Annual Meeting-Association for Computational Linguistics, vol-
ume 45, page 448, 2007.
[13] Mary E. Califf and Raymond J. Mooney. Relational learning of pattern-match rules for in-
formation extraction. In Proceedings of the national conference on Artificial intelligence,
pages 328–334. JOHN WILEY & SONS LTD, 1999.
[14] Chris Callison-Burch and Mark Dredze. Creating Speech and Language Data With Ama-
zon’s Mechanical Turk. In Proceedings of the NAACL HLT 2010 Workshop on Creating
Speech and Language Data with Amazon’s Mechanical Turk, pages 1–12, Los Angeles,
June 2010. Association for Computational Linguistics.
[15] Chris Callison-Burch, Philipp Koehn, Christof Monz, Kay Peterson, Mark Przybocki, and
Omar Zaidan. Findings of the 2010 joint workshop on statistical machine translation and
metrics for machine translation. In Proceedings of the Joint Fifth Workshop on Statisti-
cal Machine Translation and MetricsMATR, pages 17–53, Uppsala, Sweden, July 2010.
Association for Computational Linguistics.
[16] Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R Hruschka Jr, and
Tom M Mitchell. Toward an architecture for never-ending language learning. In Proceed-
ings of the Twenty-Fourth Conference on Artificial Intelligence (AAAI 2010), volume 2,
pages 3–3, 2010.
[17] Bob Carpenter. Multilevel Bayesian models of categorical data annotation. Unpublished
manuscript, 2008.
[18] Yee Seng Chan, Hwee Tou Ng, and David Chiang. Word sense disambiguation improves
statistical machine translation. In Annual Meeting – Association For Computational Lin-
guistics, volume 45, pages 33–40, 2007.
[19] Eugene Charniak and Solomon E. Shimony. Probabilistic semantics for cost based abduc-
tion. Brown University, Department of Computer Science, 1990.
[20] David Chiang, Jonathan Graehl, Kevin Knight, Adam Pauls, and Sujith Ravi. Bayesian in-
ference for Finite-State transducers. In Human Language Technologies: The 2010 Annual
Conference of the North American Chapter of the Association for Computational Linguis-
tics, pages 447–455. Association for Computational Linguistics, 2010.
[21] David Chiang, Kevin Knight, and Wei Wang. 11,001 new features for statistical machine
translation. In Proceedings of Human Language Technologies: The 2009 Annual Con-
ference of the North American Chapter of the Association for Computational Linguistics,
pages 218–226, Boulder, Colorado, June 2009. Association for Computational Linguistics.
106
[22] Massimiliano Ciaramita and Yasemin Altun. Broad-coverage sense disambiguation and
information extraction with a supersense sequence tagger. In Proceedings of the 2006
Conference on Empirical Methods in Natural Language Processing, pages 594–602. As-
sociation for Computational Linguistics, 2006.
[23] Massimiliano Ciaramita and Mark Johnson. Explaining away ambiguity: Learning verb
selectional preference with bayesian networks. In Proceedings of the 18th Conference
on Computational Linguistics, volume 1, pages 187–193. Association for Computational
Linguistics, 2000.
[24] Ronan Collobert, Jason Weston, L´ eon Bottou, Michael Karlen, Koray Kavukcuoglu, and
Pavel Kuksa. Natural language processing (almost) from scratch. Journal of Machine
Learning Research, 12:2493–2537, 2011.
[25] A. Philip Dawid and Allan M. Skene. Maximum likelihood estimation of observer error-
rates using the EM algorithm. Applied Statistics, pages 20–28, 1979.
[26] Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. Maximum likelihood from
incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B
(Methodological), 39(1):1–38, 1977.
[27] Pedro Domingos, Stanley Kok, Hoifung Poon, Matthew Richardson, and Parag Singla.
Unifying logical and statistical ai. In Proceedings of the Twenty-First National Conference
on Artificial Intelligence, pages 2–7, 2006.
[28] Gregory Druck, Gideon Mann, and Andrew McCallum. Learning from labeled features
using generalized expectation criteria. In Proceedings of the 31st annual international
ACM SIGIR conference on Research and development in information retrieval, pages 595–
602. ACM, 2008.
[29] Kathrin Eichler, Holmer Hemsen, and G¨ unter Neumann. Unsupervised relation extraction
from web documents. LREC. http://www. lrecconf. org/proceedings/lrec2008, 2008.
[30] Jason Eisner. An interactive spreadsheet for teaching the forward-backward algorithm. In
Proceedings of the ACL-02 Workshop on Effective tools and methodologies for teaching
natural language processing and computational linguistics-Volume 1, pages 10–18. Asso-
ciation for Computational Linguistics, 2002.
[31] Oren Etzioni, Michael Cafarella, Doug. Downey, Ana-Maria Popescu, Tal Shaked, Stephen
Soderland, Daniel S. Weld, and Alexander Yates. Unsupervised named-entity extraction
from the web: An experimental study. Artificial Intelligence, 165(1):91–134, 2005.
[32] James Fan, David Ferrucci, David Gondek, and Aditya Kalyanpur. Prismatic: Inducing
knowledge from a large scale lexicalized relation resource. In Proceedings of the NAACL
HLT 2010 First International Workshop on Formalisms and Methodology for Learning by
Reading, pages 122–127, Los Angeles, California, June 2010. Association for Computa-
tional Linguistics.
[33] Alvan R. Feinstein and Domenic V . Cicchetti. High agreement but low kappa: I. the
problems of two paradoxes. Journal of Clinical Epidemiology, 43(6):543–549, 1990.
107
[34] Christiane Fellbaum. WordNet: an electronic lexical database. MIT Press USA, 1998.
[35] David Ferrucci, Eric Brown, Jennifer Chu-Carroll, James Fan, David Gondek, Aditya A
Kalyanpur, Adam Lally, J William Murdock, Eric Nyberg, John Prager, et al. Building
Watson: An overview of the DeepQA project. AI magazine, 31(3):59–79, 2010.
[36] Michael Fleischman, Namhee Kwon, and Eduard Hovy. Maximum entropy models for
FrameNet classification. In Proceedings of EMNLP, volume 3, 2003.
[37] Qin Gao, Nguyen Bach, and Stephan V ogel. A semi-supervised word alignment algorithm
with partial manual alignments. In Proceedings of the Joint Fifth Workshop on Statisti-
cal Machine Translation and MetricsMATR, pages 1–10. Association for Computational
Linguistics, 2010.
[38] Danies Gildea and Dan Jurafsky. Automatic labeling of semantic roles. Computational
Linguistics, 28(3):245–288, 2002.
[39] Yoav Goldberg, Meni Adler, and Michael Elhadad. EM can find pretty good HMM POS-
taggers (when given a good start). In Proceedings of ACL, 2008.
[40] Jonathan Graehl. Carmel Finite-state Toolkit. ISI/USC, 1997.
[41] Kilem Li Gwet. Computing inter-rater reliability and its variance in the presence of high
agreement. British Journal of Mathematical and Statistical Psychology, 61(1):29–48,
2008.
[42] Aria Haghighi and Dan Klein. Prototype-driven learning for sequence models. In Pro-
ceedings of the main conference on Human Language Technology Conference of the North
American Chapter of the Association of Computational Linguistics, pages 320–327. Asso-
ciation for Computational Linguistics, 2006.
[43] Marti A. Hearst. Automatic acquisition of hyponyms from large text corpora. In Pro-
ceedings of the 14th conference on Computational linguistics-Volume 2, pages 539–545.
Association for Computational Linguistics, 1992.
[44] Jerry R. Hobbs, Mark E. Stickel, Douglas E. Appelt, and Paul Martin. Interpretation as
abduction. Artificial Intelligence, 63(1):69–142, 1993.
[45] Jasper Wilson Holley and Joy Paul Guilford. A Note on the G-Index of Agreement. Edu-
cational and Psychological Measurement, 24(4):749, 1964.
[46] Dirk Hovy. An Evening with EM. Tutorial, 2010.
[47] Dirk Hovy, Taylor Berg-Kirkpatrick, Ashish Vaswani, and Eduard Hovy. Learning Whom
to trust with MACE. In Proceedings of NAACL HLT, 2013.
[48] Dirk Hovy, James Fan, Alfio Gliozzo, Siddharth Patwardhan, and Chris Welty. When did
that Happen? – Linking Events and Relations to Timestamps. In Proceedings of the 13th
Conference of the European Chapter of the Association for Computational Linguistics,
pages 185–193. Association for Computational Linguistics, 2012.
108
[49] Dirk Hovy and Eduard Hovy. Exploiting Partial Annotations with EM Training. In The
NAACL-HLT Workshop on the Induction of Linguistic Structure, pages 31–38, 2012.
[50] Dirk Hovy, Shashank Shrivastava, Sujay Kumar Jauhar, Mrinmaya Sachan, Kartik Goyal,
Huiying Li, Whitney Sanders, and Eduard Hovy. Identifying Metaphorical Word Use with
Tree Kernels. In Proceedings of NAACL HLT, Meta4NLP Workshop, 2013.
[51] Dirk Hovy, Stephen Tratz, and Eduard Hovy. What’s in a Preposition? Dimensions of
Sense Disambiguation for an Interesting Word Class. In Coling 2010: Posters, pages
454–462, Beijing, China, August 2010. Coling 2010 Organizing Committee.
[52] Dirk Hovy, Ashish Vaswani, Stephen Tratz, David Chiang, and Eduard Hovy. Models
and training for unsupervised preposition sense disambiguation. In Proceedings of the
49th Annual Meeting of the Association for Computational Linguistics: Human Language
Technologies, pages 323–328, Portland, Oregon, USA, June 2011. Association for Com-
putational Linguistics.
[53] Dirk Hovy, Chunliang Zhang, Eduard Hovy, and Anselmo Pe˜ nas. Unsupervised discovery
of domain-specific knowledge from text. In Proceedings of the 49th Annual Meeting of the
Association for Computational Linguistics: Human Language Technologies, pages 1466–
1475, Portland, Oregon, USA, June 2011. Association for Computational Linguistics.
[54] Eduard Hovy. Annotation. A Tutorial. In 48th Annual Meeting of the Association for
Computational Linguistics, 2010.
[55] Jena D. Hwang, Rodney D. Nielsen, and Martha Palmer. Towards a domain independent
semantics: Enhancing semantic representation with construction grammar. In Proceedings
of the NAACL HLT Workshop on Extracting and Using Constructions in Computational
Linguistics, pages 1–8, Los Angeles, California, June 2010. Association for Computational
Linguistics.
[56] Mukund Jha, Jacob Andreas, Kapil Thadani, Sara Rosenthal, and Kathleen McKeown.
Corpus creation for new genres: A crowdsourced approach to pp attachment. In Proceed-
ings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Ama-
zon’s Mechanical Turk, pages 13–20. Association for Computational Linguistics, 2010.
[57] Mark Johnson. Why doesn’t EM find good HMM POS-taggers. In Proceedings of the
2007 Joint Conference on Empirical Methods in Natural Language Processing and Com-
putational Natural Language Learning (EMNLP-CoNLL), pages 296–305, 2007.
[58] Karen Sp¨ arck Jones. A statistical interpretation of term specificity and its application in
retrieval. Journal of documentation, 28(1):11–21, 1972.
[59] Kevin Knight. Bayesian inference with tears. http://www.isi.edu/natural-
language/people/bayes-with-tears.pdf, 2009.
[60] Kevin Knight. Training Finite-State Transducer Cascades with Carmel, 2009.
109
[61] Zornitsa Kozareva and Eduard Hovy. Learning arguments and supertypes of semantic
relations using recursive patterns. In Proceedings of the 48th Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 1482–1491. Association for Computational
Linguistics, 2010.
[62] Zornitsa Kozareva and Eduard Hovy. Not all seeds are equal: Measuring the quality of
text mining seeds. In Human Language Technologies: The 2010 Annual Conference of
the North American Chapter of the Association for Computational Linguistics, pages 618–
626. Association for Computational Linguistics, 2010.
[63] Zornitsa Kozareva, Ellen Riloff, and Eduard Hovy. Semantic class learning from the web
with hyponym pattern linkage graphs. Proceedings of ACL-08: HLT, pages 1048–1056,
2008.
[64] John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. Conditional random
fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of
the Eighteenth International Conference on Machine Learning, ICML ’01, pages 282–289,
San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc.
[65] Michael D. Lee, Mark Steyvers, Mindy de Young, and Brent J. Miller. A model-based
approach to measuring expertise in ranking tasks. In L. Carlson, C. H¨ olscher, and T.F.
Shipley, editors, Proceedings of the 33rd Annual Conference of the Cognitive Science So-
ciety, Austin, TX, 2011. Cognitive Science Society.
[66] Roger Levy and Galen Andrew. Tregex and Tsurgeon: tools for querying and manipulating
tree data structures. In LREC 2006, 2006.
[67] Dekang Lin and Patrick Pantel. Dirt: Discovery of inference rules from text. In Proceed-
ings of the seventh ACM SIGKDD international conference on Knowledge discovery and
data mining, pages 323–328. ACM, 2001.
[68] Ken Litkowski. The preposition project. http://www.clres.com/prepositions.html, 2005.
[69] Ken Litkowski and Orin Hargraves. SemEval-2007 Task 06: Word-Sense Disambiguation
of Prepositions. In Proceedings of the 4th International Workshop on Semantic Evalua-
tions (SemEval-2007), Prague, Czech Republic, 2007.
[70] Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large
annotated corpus of English: the Penn TreeBank. Computational Linguistics, 19(2):313–
330, 1993.
[71] Andrew K. McCallum. MALLET: A Machine Learning for Language Toolkit. 2002.
http://mallet. cs. umass. edu, 2002.
[72] Bernard Merialdo. Tagging English text with a probabilistic model. Computational lin-
guistics, 20(2):155–171, 1994.
[73] George A. Miller, Martin Chodorow, Shari Landes, Claudia Leacock, and Robert G.
Thomas. Using a semantic concordance for sense identification. In Proceedings of the
110
workshop on Human Language Technology, pages 240–243. Association for Computa-
tional Linguistics, 1994.
[74] Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. Distant supervision for rela-
tion extraction without labeled data. In Proceedings of the Joint Conference of the 47th
Annual Meeting of the ACL and the 4th International Joint Conference on Natural Lan-
guage Processing of the AFNLP: Volume 2-Volume 2, pages 1003–1011. Association for
Computational Linguistics, 2009.
[75] Seyed A. Mirroshandel, Mahdy Khayyamian, and Gholamreza Ghassem-Sani. Syntac-
tic tree kernels for event-time temporal relation learning. Human Language Technology.
Challenges for Computer Science and Linguistics, pages 213–223, 2011.
[76] Rutu Mulkar, Jerry R. Hobbs, and Eduard Hovy. Learning from reading syntactically
complex biology texts. In Proceedings of the 8th International Symposium on Logical
Formalizations of Commonsense Reasoning, part of the AAAI Spring Symposium Series,
2007.
[77] Vivi Nastase. Unsupervised all-words word sense disambiguation with grammatical de-
pendencies. In Proceedings of the 3rd International Joint Conference on Natural Lan-
guage Processing, pages 7–12, 2008.
[78] Joakim Nivre, Johan Hall, Jens Nilsson, Atanas Chanev, G¨ ulsen Eryigit, Sandra K¨ ubler,
Svetoslav Marinov, and Erwin Marsi. MaltParser: A language-independent system for
data-driven dependency parsing. Natural Language Engineering, 13(02):95–135, 2007.
[79] Tom O’Hara and Janyce Wiebe. Preposition semantic classification via Penn Treebank
and FrameNet. In Proceedings of CoNLL, pages 79–86, 2003.
[80] Tom O’Hara and Janyce Wiebe. Exploiting semantic role resources for preposition disam-
biguation. Computational Linguistics, 35(2):151–184, 2009.
[81] Ekaterina Ovchinnikova, Laure Vieu, Alessandro Oltramari, Stefano Borgo, and Theodore
Alexandrov. Data-driven and ontological analysis of framenet for natural language rea-
soning. In Proceedings of LREC’10. LREC, 2010.
[82] Gerhard Paaß and Frank Reichartz. Exploiting semantic constraints for estimating super-
senses with CRFs. In Proceedings of SDM, 2009.
[83] Patrick Pantel and Marco Pennacchiotti. Espresso: Leveraging generic patterns for au-
tomatically harvesting semantic relations. In Proceedings of the 21st International Con-
ference on Computational Linguistics and the 44th annual meeting of the Association for
Computational Linguistics, pages 113–120. Association for Computational Linguistics,
2006.
[84] Patrick Pantel and Marco Pennacchiotti. Espresso: Leveraging generic patterns for au-
tomatically harvesting semantic relations. In Proceedings of the 21st International Con-
ference on Computational Linguistics and the 44th annual meeting of the Association for
Computational Linguistics, pages 113–120. Association for Computational Linguistics,
2006.
111
[85] Thiago Pardo, Daniel Marcu, and Maria Nunes. Unsupervised Learning of Verb Argument
Structures. Computational Linguistics and Intelligent Text Processing, pages 59–70, 2006.
[86] Anselmo Pe˜ nas and Eduard Hovy. Semantic enrichment of text with background knowl-
edge. In Proceedings of the NAACL HLT 2010 First International Workshop on For-
malisms and Methodology for Learning by Reading, pages 15–23, Los Angeles, California,
June 2010. Association for Computational Linguistics.
[87] Slav Petrov, Leon Barrett, Romain Thibaux, and Dan Klein. Learning accurate, compact,
and interpretable tree annotation. In Proceedings of the 21st International Conference on
Computational Linguistics and the 44th annual meeting of the Association for Computa-
tional Linguistics, pages 433–440. Association for Computational Linguistics, 2006.
[88] Simone Paolo Ponzetto and Roberto Navigli. Knowledge-rich Word Sense Disambigua-
tion rivaling supervised systems. In Proceedings of the 48th Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 1522–1531. Association for Computational
Linguistics, 2010.
[89] Hoifung Poon and Pedro Domingos. Joint unsupervised coreference resolution with
markov logic. In Proceedings of the Conference on Empirical Methods in Natural Lan-
guage Processing, pages 650–659. Association for Computational Linguistics, 2008.
[90] Hoifung Poon and Pedro Domingos. Unsupervised ontology induction from text. In Pro-
ceedings of the 48th Annual Meeting of the Association for Computational Linguistics,
pages 296–305. Association for Computational Linguistics, 2010.
[91] Lawrence R. Rabiner and Biing-Hwang Juang. An introduction to hidden Markov models.
IEEE ASSp Magazine, 3(1 Part 1):4–16, 1986.
[92] Sujith Ravi and Kevin Knight. Minimized Models for Unsupervised Part-of-Speech Tag-
ging. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and
the 4th International Joint Conference on Natural Language Processing of the AFNLP,
pages 504–512. Association for Computational Linguistics, 2009.
[93] Deepak Ravichandran and Eduard H. Hovy. Learning surface text patterns for a Question
Answering system. Proceedings of the 40th Annual Meeting on Association for Computa-
tional Linguistics, pages 41–47, 2001.
[94] Vikas C. Raykar and Shipeng Yu. Eliminating Spammers and Ranking Annotators for
Crowdsourced Labeling Tasks. Journal of Machine Learning Research, 13:491–518,
2012.
[95] Philip Resnik. Selectional preference and sense disambiguation. In Proceedings of the
ACL SIGLEX Workshop on Tagging Text with Lexical Semantics: Why, What, and How,
pages 52–57. Washington, DC, 1997.
[96] Alan Ritter, Mausam, and Oren Etzioni. A latent dirichlet allocation method for selectional
preferences. In Proceedings of the 48th Annual Meeting of the Association for Computa-
tional Linguistics, pages 424–434, Uppsala, Sweden, July 2010. Association for Compu-
tational Linguistics.
112
[97] Frank Rudzicz and Serguei A. Mokhov. Towards a heuristic categorization of
prepositional phrases in english with wordnet. Technical report, Cornell University,
arxiv1.library.cornell.edu/abs/1002.1095?context=cs, 2003.
[98] Evan Sandhaus, editor. The New York Times Annotated Corpus. Number LDC2008T19.
Linguistic Data Consortium, Philadelphia, 2008.
[99] Libin Shen, Giorgio Satta, and Aravind Joshi. Guided learning for bidirectional sequence
classification. In Proceedings of the 45th Annual Meeting of the Association of Computa-
tional Linguistics, volume 45, pages 760–767, 2007.
[100] Noah A. Smith. Adversarial evaluation for models of natural language. arXiv preprint
arXiv:1207.0245, 2012.
[101] Padhraic Smyth, Usama Fayyad, Mike Burl, Pietro Perona, and Pierre Baldi. Inferring
ground truth from subjective labelling of Venus images. Advances in neural information
processing systems, pages 1085–1092, 1995.
[102] Rion Snow, Brendan O’Connor, Dan Jurafsky, and Andrew Y . Ng. Cheap and fast—but is
it good? Evaluating non-expert annotations for natural language tasks. In Proceedings of
the Conference on Empirical Methods in Natural Language Processing, pages 254–263.
Association for Computational Linguistics, 2008.
[103] Stephen Soderland. Learning information extraction rules for semi-structured and free
text. Machine learning, 34(1):233–272, 1999.
[104] Alexander Sorokin and David Forsyth. Utility data annotation with Amazon Mechanical
Turk. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Workshops, CVPRW ’08, pages 1–8. IEEE, 2008.
[105] Vivek Srikumar and Dan Roth. An inventory of preposition relations. Technical report,
University of Illinois at Urbana-Champaign, 2013.
[106] Vivek Srikumar and Dan Roth. Modeling Semantic Relations Expressed by Prepositions.
Transactions of the ACL, 2013.
[107] Mark Steyvers, Michael D. Lee, Brent Miller, and Pernille Hemmer. The wisdom of
crowds in the recollection of order information. Advances in neural information process-
ing systems, 23, 2009.
[108] Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. Yago: a core of semantic
knowledge. In Proceedings of the 16th international conference on World Wide Web, pages
697–706. ACM, 2007.
[109] Zareen Syed and Evelyne Viegas. A hybrid approach to unsupervised relation discovery
based on linguistic analysis and semantic typing. In Proceedings of the NAACL HLT 2010
First International Workshop on Formalisms and Methodology for Learning by Reading,
pages 105–113. Association for Computational Linguistics, 2010.
113
[110] Oscar T¨ ackstr¨ om, Dipanjan Das, Slav Petrov, Ryan McDonald, and Joakim Nivre. Token
and type constraints for cross-lingual part-of-speech tagging. Transactions of the ACL,
2013.
[111] Partha P. Talukdar and Fernando Pereira. Experiments in graph-based semi-supervised
learning methods for class-instance acquisition. In Proceedings of the 48th Annual Meet-
ing of the Association for Computational Linguistics, pages 1473–1481. Association for
Computational Linguistics, 2010.
[112] Partha P. Talukdar, Joseph Reisinger, Marcus Pas ¸ca, Deepak Ravichandran, Rahul Bha-
gat, and Fernando Pereira. Weakly-supervised acquisition of labeled class instances using
graph random walks. In Proceedings of the Conference on Empirical Methods in Natural
Language Processing, pages 582–590. Association for Computational Linguistics, 2008.
[113] Partha P. Talukdar, Derry Wijaya, and Tom Mitchell. Coupled temporal scoping of rela-
tional facts. In Proceedings of the fifth ACM international conference on Web search and
data mining, pages 73–82. ACM, 2012.
[114] Stephen Tratz. Semantically-Enriched Parsing for Natural Language Understanding. PhD
thesis, University of Southern California, 2011.
[115] Stephen Tratz and Dirk Hovy. Disambiguation of Preposition Sense Using Linguistically
Motivated Features. In Proceedings of Human Language Technologies: The 2009 Annual
Conference of the North American Chapter of the Association for Computational Linguis-
tics, Companion Volume: Student Research Workshop and Doctoral Consortium, pages
96–100, Boulder, Colorado, June 2009. Association for Computational Linguistics.
[116] Stephen Tratz and Eduard Hovy. A taxonomy, dataset, and classifier for automatic noun
compound interpretation. In Proceedings of the 48th Annual Meeting of the Association
for Computational Linguistics, pages 678–687. Association for Computational Linguistics,
2010.
[117] Stephen Tratz and Eduard Hovy. A fast, accurate, non-projective, semantically-enriched
parser. In Proceedings of the Conference on Empirical Methods in Natural Language
Processing, pages 1257–1268. Association for Computational Linguistics, 2011.
[118] Yuta Tsuboi, Hisashi Kashima, Hiroki Oda, Shinsuke Mori, and Yuji Matsumoto. Training
conditional random fields using incomplete annotations. In Proceedings of the 22nd Inter-
national Conference on Computational Linguistics, volume 1, pages 897–904. Association
for Computational Linguistics, 2008.
[119] Ashish Vaswani, Adam Pauls, and David Chiang. Efficient optimization of an MDL-
inspired objective function for unsupervised part-of-speech tagging. In Proceedings of
the ACL 2010 Conference Short Papers, pages 209–214. Association for Computational
Linguistics, 2010.
[120] Peter Welinder, Steve Branson, Serge Belongie, and Pietro Perona. The multidimensional
wisdom of crowds. In Neural Information Processing Systems Conference (NIPS), vol-
ume 6, 2010.
114
[121] Jacob Whitehill, Paul Ruvolo, Tingfan Wu, Jacob Bergsma, and Javier Movellan. Whose
vote should count more: Optimal integration of labels from labelers of unknown expertise.
Advances in Neural Information Processing Systems, 22:2035–2043, 2009.
[122] Yan Yan, R´ omer Rosales, Glenn Fung, Mark Schmidt, Gerardo Hermosillo, Luca Bogoni,
Linda Moy, and Jennifer Dy. Modeling annotator expertise: Learning when everybody
knows a bit of something. In International Conference on Artificial Intelligence and Statis-
tics, 2010.
[123] Yulan Yan, Naoaki Okazaki, Yutaka Matsuo, Zhenglu Yang, and Mitsuru Ishizuka. Unsu-
pervised relation extraction by mining wikipedia texts using information from the web. In
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th
International Joint Conference on Natural Language Processing of the AFNLP: Volume
2-Volume 2, pages 1021–1029. Association for Computational Linguistics, 2009.
[124] Limin Yao, Aria Haghighi, Sebastian Riedel, and Andrew McCallum. Structured relation
discovery using generative models. In Proceedings of the Conference on Empirical Meth-
ods in Natural Language Processing, pages 1456–1466. Association for Computational
Linguistics, 2011.
[125] Limin Yao, Sebastian Riedel, and Andrew McCallum. Unsupervised relation discovery
with sense disambiguation. In Proceedings of the 50th Annual Meeting of the Association
for Computational Linguistics: Long Papers-Volume 1, pages 712–720. Association for
Computational Linguistics, 2012.
[126] David Yarowsky and Radu Florian. Evaluating sense disambiguation across diverse pa-
rameter spaces. Natural Language Engineering, 8(4):293–310, 2002.
[127] Alexander Yates, Micahel Cafarella, Michele Banko, Oren Etzioni, Matthew Broadhead,
and Stephen Soderland. Textrunner: open information extraction on the web. In Proceed-
ings of Human Language Technologies: The Annual Conference of the North American
Chapter of the Association for Computational Linguistics: Demonstrations, pages 25–26.
Association for Computational Linguistics, 2007.
[128] Patrick Ye and Tim Baldwin. Semantic role labeling of prepositional phrases. ACM Trans-
actions on Asian Language Information Processing (TALIP), 5(3):228–244, 2006.
[129] Patrick Ye and Timothy Baldwin. MELB-YB: Preposition Sense Disambiguation Using
Rich Semantic Features. In Proceedings of the 4th International Workshop on Semantic
Evaluations (SemEval-2007), Prague, Czech Republic, 2007.
[130] Deniz Yuret and Mehmet A. Yatbaz. The noisy channel model for unsupervised word
sense disambiguation. Computational Linguistics, 36(1):111–127, 2010.
[131] Omar F. Zaidan and Chris Callison-Burch. Crowdsourcing translation: Professional qual-
ity from non-professionals. In Proceedings of the 49th Annual Meeting of the Association
for Computational Linguistics: Human Language Technologies, pages 1220–1229, Port-
land, Oregon, USA, June 2011. Association for Computational Linguistics.
115
Appendices
116
Appendix A
Nature of Partial Annotations
0
10
20
30
40
50
60
70
80
90
100
all 1st 2nd 3rd 4th 5th one each
53.55
48.77
44.71
43.00
45.65
49.69
63.12
accuracy (%)
senses used
Figure A.1: Labeling one example of each sense yields better results than all examples of any
one sense. Senses ordered by frequency
[49] explores which kind of partial annotation is most beneficial, i.e., in a context where we
have a choice which kind of data to annotate, what we should choose. Figure A.1 shows the
results for semi-supervised PSD. Annotating only the items which have the most frequent sense
117
in training, accuracy drops to 49.69%. While this is better than the same baseline strategy (assign
most frequent sense to every instance in training, accuracy: 40%), it is actually worse than the
unsupervised approach (accuracy 55 %). We get much better results annotating one example of
each sense (53.55%).
118
Appendix B
MACE
Amazon’s MechanicalTurk (AMT) is frequently used to evaluate experiments and annotate data
in NLP [14, 15, 56, 131]. However, some turkers try to maximize their pay by supplying quick
answers that have nothing to do with the correct label. We refer to this type of annotator as
a spammer. In order to mitigate the effect of spammers, researchers typically collect multiple
annotations of the same instance so that they can, later, use de-noising methods to infer the best
label. The simplest approach is majority voting, which weights all answers equally.
Unfortunately, it is easy for majority voting to go wrong. A common and simple spammer
strategy for categorical labeling tasks is to always choose the same (often the first) label. When
multiple spammers follow this strategy, the majority can be incorrect. While this specific scenario
might seem simple to correct for (remove annotators that always produce the same label), the
situation grows more tricky when spammers do not annotate consistently, but instead choose
labels at random. A more sophisticated approach than simple majority voting is required.
If we knew whom to trust, and when, we could reconstruct the correct labels. Yet, the only
way to be sure we know whom to trust is if we knew the correct labels ahead of time. To address
this circular problem, we build a generative model of the annotation process that treats the correct
119
labels as latent variables. We then use unsupervised learning to estimate parameters directly from
redundant annotations. This is a common approach in the class of unsupervised models called
item-response models [17,25,94,121]. While such models have been implemented in other fields
(e.g., vision), we are not aware of their availability for NLP tasks (see also Section B.6).
Our model includes a binary latent variable that explicitly encodes if and when each annota-
tor is spamming, as well as parameters that model the annotator’s specific spamming “strategy”.
Importantly, the model assumes that labels produced by an annotator when spamming are inde-
pendent of the true label (though, a spammer can still produce the correct label by chance).
In experiments, our model effectively differentiates dutiful annotators from spammers (Sec-
tion B.3), and is able to reconstruct the correct label with high accuracy (Section B.4), even under
extremely adversarial conditions (Section B.4.2). It does not require any annotated instances, but
is capable of including varying levels of supervision via token constraints (Section B.4.2). We
consistently outperform majority voting, and achieve performance equal to that of more complex
state-of-the-art models. Additionally, we find that thresholding based on the posterior label en-
tropy can be used to trade off coverage for accuracy in label reconstruction, giving considerable
gains (Section B.4.1). In tasks where correct answers are more important than answering every
instance, e.g., when constructing a new annotated corpus, this feature is extremely valuable. Our
contributions are:
We demonstrate the effectiveness of our model on real world AMT datasets, matching the
accuracy of more complex state-of-the-art systems
We show how posterior entropy can be used to trade some coverage for considerable gains
in accuracy
120
We study how various factors affect performance, including number of annotators, annota-
tor strategy, and available supervision
We provide MACE (Multi-Annotator Competence Estimation), a Java-based implementa-
tion of a simple and scalable unsupervised model that identifies malicious annotators and
predicts labels with high accuracy
This work was previously published in part as [47].
B.1 Model
N
Ti
M
Aij
Sij
T
A2
C2
A3
C3
A1
C1
Figure B.1: Graphical model: Annotator j produces label A
i j
on instance i. Label choice de-
pends on instance’s true label T
i
, and whether j is spamming on i, modeled by binary variable S
i j
.
N =jinstancesj, M =jannotatorsj.
We keep our model as simple as possible so that it can be effectively trained from data where
annotator quality is unknown. If the model has too many parameters, unsupervised learning can
easily pick up on and exploit coincidental correlations in the data. Thus, we make a modeling
assumption that keeps our parameterization simple. We assume that an annotator always produces
the correct label when he tries to. While this assumption does not reflect the reality of AMT, it
allows us to focus the model’s power where it’s important: explaining away labels that are not
correlated with the correct label.
121
for i= 1:::N :
T
i
Uniform
for j = 1:::M :
S
i j
Bernoulli(1q
j
)
if S
i j
= 0 :
A
i j
= T
i
else :
A
i j
Multinomial(x
j
)
Figure B.2: Generative process: For each instance i, the true label T
i
is sampled from the uniform
prior. Then, S
i j
is drawn for each annotator j from a Bernoulli distribution with parameter 1q
j
.
If S
i j
= 0, A
i j
copies the true label. Otherwise, the annotation A
i j
is sampled from a multinomial
with parameter vectorx
j
.
Our model generates the observed annotations as follows: First, for each instance i, we sample
the true label T
i
from a uniform prior. Then, for each annotator j we draw a binary variable S
i j
from a Bernoulli distribution with parameter 1q
j
. S
i j
represents whether or not annotator j is
spamming on instance i. We assume that when an annotator is not spamming on an instance, i.e.
S
i j
= 0, he just copies the true label to produce annotation A
i j
. If S
i j
= 1, we say that the annotator
is spamming on the current instance, and A
i j
is sampled from a multinomial with parameter vector
x
j
. Note that in this case the annotation A
i j
does not depend on the true label T
i
. The annotations
A
i j
are observed, while the true labels T
i
and the spamming indicators S
i j
are unobserved. The
graphical model is shown in Figure B.1 and the generative process is described in Figure B.2.
The model parameters are q
j
and x
j
. q
j
specifies the probability of trustworthiness for an-
notator j (i.e. the probability that he is not spamming on any given instance). The learned value
of q
j
will prove useful later when we try to identify reliable annotators (see Section B.3). The
vector x
j
determines how annotator j behaves when he is spamming. An annotator can produce
the correct answer even while spamming, but this can happen only by chance since the annotator
122
must use the same multinomial parametersx
j
across all instances. This means that we only learn
annotator biases that are not correlated with the correct label, e.g., the strategy of the spammer
who always chooses a certain label. This contrasts with previous work where additional parame-
ters are used to model the biases that even dutiful annotators exhibit. Note that an annotator can
also choose not to answer, which we can naturally accommodate because the model is generative.
We enhance our generative model by adding Beta and Dirichlet priors on q
j
and x
j
respectively
which allows us to incorporate prior beliefs about our annotators (section B.1.1).
B.1.1 Learning
We would like to set our model parameters to maximize the probability of the observed data, i.e.,
the marginal data likelihood:
P(A;q;x)=
å
T;S
h N
Õ
i=1
P(T
i
)
M
Õ
j=1
P(S
i j
;q
j
) P(A
i j
jS
i j
;T
i
;x
j
)
i
where A is the matrix of annotations, S is the matrix of competence indicators, and T is the vector
of true labels.
We maximize the marginal data likelihood using Expectation Maximization (EM) [26], which
has successfully been applied to similar problems [25]. We initialize EM randomly and run for
50 iterations. We perform 100 random restarts, and keep the model with the best marginal data
likelihood. We smooth the M-step by adding a fixed value d to the fractional counts before
normalizing [30]. We find that smoothing improves accuracy, but, overall, learning is robust to
varyingd, and setd =
0:1
num labels
.
We observe, however, that the average annotator proficiency is usually high, i.e., most anno-
tators answer correctly. The distribution learned by EM, however, is fairly linear. To improve
123
the correlation between model estimates and true annotator proficiency, we would like to add
priors about the annotator behavior into the model. A straightforward approach is to employ
Bayesian inference with Beta priors on the proficiency parameters, q
j
. We thus also implement
Variational-Bayes (VB) training with symmetric Beta priors onq
j
and symmetric Dirichlet priors
on the strategy parameters,x
j
. Setting the shape parameters of the Beta distribution to 0:5 favors
the extremes of the distribution, i.e., either an annotator tried to get the right answer, or simply did
not care, but (almost) nobody tried “a little”. With VB training, we observe improved correlations
over all test sets with no loss in accuracy. The hyper-parameters of the Dirichlet distribution on
x
j
were clamped to 10:0 for all our experiments with VB training. Our implementation is similar
to [57], which the reader can refer to for details.
B.2 Experiments
We evaluate our method on existing annotated datasets from various AMT tasks. However, we
also want to ensure that our model can handle adversarial conditions. Since we have no control
over the factors in existing datasets, we create synthetic data for this purpose.
B.2.1 Natural Data
In order to evaluate our model, we use the datasets from [102] that use discrete label values
(some tasks used continuous values, which we currently do not model). Since they compared
AMT annotations to experts, gold annotations exist for these sets. We can thus evaluate the
accuracy of the model as well as the proficiency of each annotator. We show results for word
124
sense disambiguation (WSD: 177 items, 34 annotators), recognizing textual entailment (RTE: 800
items, 164 annotators), and recognizing temporal relation (Temporal: 462 items, 76 annotators).
B.2.2 Synthetic Data
In addition to the datasets above, we generate synthetic data in order to control for different
factors. This also allows us to create a gold standard to which we can compare. We generate data
sets with 100 items, using two or four possible labels.
For each item, we generate answers from 20 different annotators. The “annotators” are func-
tions that return one of the available labels according to some strategy. Better annotators have a
smaller chance of guessing at random.
For various reasons, usually not all annotators see or answer all items. We thus remove a
randomly selected subset of answers such that each item is only answered by 10 of the annotators.
See Figure B.3 for an example annotation of three items.
annotators
items
– 0 0 1 – 0 – – 0 –
1 – – 0 – 1 0 – – 0
– – 0 – 0 1 – 0 – 0
Figure B.3: Annotations: 10 annotators on three items, labelsf1, 0g, 5 annotations/item. Missing
annotations marked ‘–’
B.2.3 Evaluations
First, we want to know which annotators to trust. We evaluate whether our model’s learned
trustworthiness parametersq
j
can be used to identify these individuals (Section B.3).
We then compare the label predicted by our model and by majority voting to the correct
label. The results are reported as accuracy (Section B.4). Since our model computes posterior
125
entropies for each instance, we can use this as an approximation for the model’s confidence in the
prediction. If we focus on predictions with high confidence (i.e., low entropy), we hope to see
better accuracy, even at the price of leaving some items unanswered. We evaluate this trade-off
in Section B.4.1. In addition, we investigate the influence of the number of spammers and their
strategy on the accuracy of our model (Section B.4.2).
B.3 Identifying Reliable Annotators
RTE Temporal WSD
raw agreement 0.78 0.73 0.81
Cohen’sk 0.70 0.80 0.13
G-index 0.76 0.73 0.81
MACE-EM 0.87 0.88 0.44
MACE-VB
(0:5;0:5)
0.91 0.90 0.90
Table B.1: Correlation with annotator proficiency: Pearsonr of different methods for various data
sets. MACE-VB’s trustworthiness parameter (trained with Variational Bayes with a =b = 0:5)
correlates best with true annotator proficiency.
One of the distinguishing features of the model is that it uses a parameter for each annotator
to estimate whether or not they are spamming. Can we use this parameter to identify trustworthy
individuals, to invite them for future tasks, and block untrustworthy ones?
It is natural to apply some form of weighting. One approach is to assume that reliable anno-
tators agree more with others than random annotators. Inter-annotator agreement is thus a good
candidate to weigh the answers. There are various measures for inter-annotator agreement.
[116] compute the average agreement of each annotator and use it as a weight to identify
reliable ones. Raw agreement can be directly computed from the data. It is related to majority
126
voting, since it will produce high scores for all members of the majority class. Raw agreement is
thus a very simple measure.
In contrast, Cohen’sk corrects the agreement between two annotators for chance agreement.
It is widely used for inter-annotator agreement in annotation tasks. We also compute thek values
for each pair of annotators, and average them for each annotator (similar to the approach in [116]).
However, whenever one label is more prevalent (a common case in NLP tasks), k overestimates
the effect of chance agreement [33] and penalizes disproportionately. The G-index [41] corrects
for the number of labels rather than chance agreement.
We compare these measures to our learned trustworthiness parameters q
j
in terms of their
ability to select reliable annotators. A better measure should lend higher score to annotators who
answer correctly more often than others. We thus compare the ratings of each measure to the true
proficiency of each annotator. This is the percentage of annotated items the annotator answered
correctly. Methods that can identify reliable annotators should highly correlate to the annotator’s
proficiency. Since the methods use different scales, we compute Pearson’s r for the correlation
coefficient, which is scale-invariant. The correlation results are shown in Table B.1.
The model’sq
j
correlates much more strongly with annotator proficiency than eitherk or raw
agreement. The variant trained with VB performs consistently better than standard EM training,
and yields the best results. This show that our model detects reliable annotators much better than
any of the other measures, which are only loosely correlated to annotator proficiency.
The numbers for WSD also illustrate the lowk score resulting when all annotators (correctly)
agree on a small number of labels. However, all inter-annotator agreement measures suffer from
an even more fundamental problem: removing/ignoring annotators with low agreement will al-
ways improve the overall score, irrespective of the quality of their annotations. Worse, there is
127
no natural stopping point: deleting the most egregious outlier always improves agreement, until
we have only one annotator with perfect agreement left [54]. In contrast, MACE does not discard
any annotators, but weighs their contributions differently. We are thus not losing information.
This works well even under adversarial conditions (see Section B.4.2).
B.4 Recovering the Correct Answer
RTE Temporal WSD
majority 0.90 0.93 0.99
Raykar/Yu 2012 0.93 0.94 —
Carpenter 2008 0.93 — —
MACE-EM/VB 0.93 0.94 0.99
MACE-EM@90 0.95 0.97 0.99
MACE-EM@75 0.95 0.97 1.0
MACE-VB@90 0.96 0.97 1.0
MACE-VB@75 0.98 0.98 1.0
Table B.2: Accuracy of different methods on data sets from [102]. MACE-VB uses Variational
Bayes training. Results @n use the n% items the model is most confident in (Section B.4.1).
Results below double line trade coverage for accuracy and are thus not comparable to upper half.
The previous sections showed that our model reliably identifies trustworthy annotators. How-
ever, we also want to find the most likely correct answer. Using majority voting often fails to find
the correct label. This problem worsens when there are more than two labels. We need to take
relative majorities into account or break ties when two or more labels receive the same number
of votes. This is deeply unsatisfying.
Figure B.2 shows the accuracy of our model on various data sets from [102]. The model
outperforms majority voting on both RTE and Temporal recognition sets. It performs as well
as majority voting for the WSD task. This last set is somewhat of an exception, though, since
128
almost all annotators are correct all the time, so majority voting is trivially correct. Still, we need
to ensure that the model does not perform worse under such conditions. The results for RTE and
Temporal data also rival those reported in [94] and [17], yet were achieved with a much simpler
model.
[17] models instance difficulty as a parameter. While it seems intuitively useful to model
which items are harder than other, it increases the parameter space more than our trustworthiness
variable. We achieve comparable performance without modeling difficulty, which greatly simpli-
fies inference. The model of [94] is more similar to our approach, in that it does not model item
difficulty. However, it adds an extra step that learns priors from the estimated parameters. In our
model, this is part of the training process. For more details on both models, see Section B.6.
M A C E - E M
M A C E - V B
m a j o r i t y
Figure B.4: Tradeoff between coverage and accuracy for RTE (left) and temporal (right). Lower
thresholds lead to less coverage, but result in higher accuracy.
B.4.1 Trading Coverage for Accuracy
Sometimes, we want to produce an answer for every item (e.g., when evaluating a data set),
and sometimes, we value good answers more than answering all items (e.g., when developing
an annotated corpus). [56] have demonstrated how to achieve better coverage (i.e., answer more
129
items) by relaxing the majority voting constraints. Similarly, we can improve accuracy if we only
select high quality annotations, even if this incurs lower coverage.
We provide a parameter in MACE that allows users to set a threshold for this trade-off: the
model only returns a label for an instance if it is sufficiently confident in its answer. We approxi-
mate the model’s confidence by the posterior entropy of each instance. However, entropy depends
strongly on the specific makeup of the dataset (number of annotators and labels, etc.), so it is hard
for the user to set a specific threshold.
Instead of requiring an exact entropy value, we provide a simple thresholding between 0:0 and
1:0 (setting the threshold to 1.0 will include all items). After training, MACE orders the posterior
entropies for all instances and selects the value that covers the selected fraction of the instances.
The threshold thus roughly corresponds to coverage. It then only returns answers for instances
whose entropy is below the threshold. This procedure is similar to precision/recall curves.
[56] showed the effect of varying the relative majority required, i.e., requiring that at least n
out of 10 annotators have to agree to count an item. We use that method as baseline comparison,
evaluating the effect on coverage and accuracy when we vary n from 5 to 10.
Figure B.4 shows the tradeoff between coverage and accuracy for two data sets. Lower thresh-
olds produce more accurate answers, but result in lower coverage, as some items are left blank.
If we produce answers for all items, we achieve accuracies of 0.93 for RTE and 0.94 for Tem-
poral, but by excluding just the 10% of items in which the model is least confident, we achieve
accuracies as high as 0.95 for RTE and 0.97 for Temporal. We omit the results for WSD here,
since there is little headroom and they are thus not very informative. Using Variational Bayes
inference consistently achieves higher results for the same coverage than the standard implemen-
tation. Increasing the required majority also improves accuracy, although not as much, and the
130
loss in coverage is larger and cannot be controlled. In contrast, our method allows us to achieve
better accuracy at a smaller, controlled loss in coverage.
B.4.2 Influence of Strategy, Number of Annotators, and Supervision
a) random annotators
b) minimal annotators
Figure B.5: Influence of adverse annotator strategy on label accuracy (y-axis). Number of pos-
sible labels varied between 2 (top row) and 4 (bottom row). Adverse annotators either choose at
random (a) or always select the first label (b). MACE needs fewer good annotators to recover the
correct answer.
Adverse Strategy We showed that our model recovers the correct answer with high accuracy.
However, to test whether this is just a function of the annotator pool, we experiment with varying
the trustworthiness of the pool. If most annotators answer correctly, majority voting is trivially
131
M A C E - E M
M A C E - V B
m a j o r i t y
Figure B.6: Varying number of annotators: effect on prediction accuracy. Each point averaged
over 10 runs. Note different scale for WSD.
correct, as is our model. What happens, however, if more and more annotators are unreliable?
While some agreement can arise from randomness, majority voting is bound to become worse—
can our model overcome this problem? We set up a second set of experiments to test this, using
synthetic data. We choose 20 annotators and vary the amount of good annotators among them
from 0 to 10 (after which the trivial case sets in). We define a good annotator as one who answers
correctly 95% of the time.
1
Adverse annotators select their answers randomly or always choose
a certain value (minimal annotators). These are two frequent strategies of spammers.
For different numbers of labels and varying percentage of spammers, we measure the accu-
racy of our model and majority voting on 100 items, averaged over 10 runs for each condition.
Figure B.5 shows the effect of annotator proficiency on both majority voting and our method for
both kinds of spammers. Annotator pool strategy affects majority voting more than our model.
Even with few good annotators, our model learns to dismiss the spammers as noise. There is
a noticeable point on each graph where MACE diverges from the majority voting line. It thus
reaches good accuracy much faster than majority voting, i.e., with fewer good annotators. This
divergence point happens earlier with more label values when adverse annotators label randomly.
1
The best annotators on the Snow data sets actually found the correct answer 100% of the time.
132
In general, random annotators are easier to deal with than the ones always choosing the first la-
bel. Note that in cases where we have a majority of adversarial annotators, VB performs worse
than EM, since this condition violates the implicit assumptions we encoded with the priors in
VB. Under these conditions, setting different priors to reflect the annotator pool should improve
performance.
Obviously, both of these pools are extremes: it is unlikely to have so few good or so many
malicious annotators. Most pools will be somewhere in between. It does show, however, that our
model can pick up on reliable annotators even under very unfavorable conditions. The result has
a practical upshot: AMT allows us to require a minimum rating for annotators to work on a task.
Higher ratings improve annotation quality, but delay completion, since there are fewer annotators
with high ratings. The results in this section suggest that we can find the correct answer even in
annotator pools with low overall proficiency. We can thus waive the rating requirement and allow
more annotators to work on the task. This considerably speeds up completion.
Number of Annotators Figure B.6 shows the effect different numbers of annotators have on
accuracy. As we increase the number of annotators, MACE and majority voting achieve better
accuracy results. We note that majority voting results level or even drop when going from an odd
to an even number. In these cases, the new annotator does not improve accuracy if it goes with
the previous majority (i.e., going from 3:2 to 4:2), but can force an error when going against the
previous majority (i.e., from 3:2 to 3:3), by creating a tie. MACE-EM and MACE-VB dominate
majority voting for RTE and Temporal. For WSD, the picture is less clear, where majority voting
dominates when there are fewer annotators. Note that the differences are minute, though (within
1 percentage point). For very small pool sizes (< 3), MACE-VB outperforms both other methods.
133
Figure B.7: Varying the amount of supervision: effect on prediction accuracy. Each point aver-
aged over 10 runs. MACE uses supervision more efficiently.
Amount of Supervision So far, we have treated the task as completely unsupervised. MACE
does not require any expert annotations in order to achieve high accuracy. However, we often
have annotations for some of the items. These annotated data points are usually used as control
items (by removing annotators that answer them incorrectly). If such annotated data is available,
we would like to make use of it. We include an option that lets users supply annotations for
some of the items, and use this information as token constraints in the E-step of training. In
those cases, the model does not need to estimate the correct value, but only has to adjust the trust
parameter. This leads to improved performance.
2
We explore for RTE and Temporal how performance changes when we vary the amount
of supervision in increments of 5%.
3
We average over 10 runs for each value of n, each time
supplying annotations for a random set of n items. The baseline uses the annotated label
whenever supplied, otherwise the majority vote, with ties split at random.
2
If we had annotations for all items, accuracy would be perfect and require no training.
3
Given the high accuracy for the WSD data set even in the fully unsupervised case, we omit the results here.
134
Figure B.7 shows that, unsurprisingly, all methods improve with additional supervision,
ultimately reaching perfect accuracy. However, MACE uses the information more effectively,
resulting in higher accuracy for a given amount of supervision. This gain is more pronounced
when only little supervision is available.
B.5 Implementation
We provide MACE (Multi-Annotator Competence Estimation), a Java-based software package
that allows researchers to evaluate annotations without supervised data. It takes a simple CSV
file as input (see Figure B.8), where each line represents one annotation item, and each column
an annotator. Empty positions indicate missing answers. The output are two files: the model’s
predictions of the best label, and the trustworthiness estimate for each annotator.
,0,0,1,,0,,,0,,0,0,1,0,0
1,,1,0,1,1,0,,,0,,0,,0,1
,,0,,0,1,,0,,0,,0,1,,1
Figure B.8: Example CSV input to MACE for three instances and 15 annotators.
Optionally, users can supply a gold standard set of answers, to evaluate the accuracy of the
model, and set the entropy threshold as described in Section B.4.1.
While we find the model to be relatively robust to changes in the number of iterations, restarts,
and smoothing, we provide switches to control these values, too. In addition, users can set the
shape parameters of a Beta-distribution forq
j
and perform Variational Bayes EM training.
135
Redundant annotations tend to be very sparse (not every annotator answers or even sees every
item). We exploit this by implementing a sparse vector, which allows us to process even large
datasets.
B.6 Related Research
[102] and [104] showed that Amazon’s MechanicalTurk use in providing non-expert annotations
for NLP tasks. Various models have been proposed for predicting correct annotations from noisy
non-expert annotations and for estimating annotator trustworthiness. These models divide natu-
rally into two categories: those that use expert annotations for supervised learning [9, 102], and
completely unsupervised ones. Our method falls into the latter category because it learns from
the redundant non-expert annotations themselves, and makes no use of expertly annotated data.
Most previous work on unsupervised models belongs to a class called “Item-response mod-
els”, used in psychometrics. The approaches differ with respect to which aspect of the annotation
process they choose to focus on, and the type of annotation task they model. For example, many
methods explicitly model annotator bias in addition to annotator competence [25,101]. Our work
models annotator bias, but only when the annotator is suspected to be spamming.
Other methods focus modeling power on instance difficulty to learn not only which annotators
are good, but which instances are hard [17, 121]. In machine vision, several models have taken
this further by parameterizing difficulty in terms of complex features defined on each pairing of
annotator and annotation instance [120, 122]. While such features prove very useful in vision,
they are more difficult to define for the categorical problems common to NLP. In addition, several
136
methods are specifically tailored to annotation tasks that involve ranking [65, 107], which limits
their applicability in NLP.
The method of [94] is most similar to ours. Their goal is to identify and filter out annotators
whose annotations are not correlated with the gold label. They define a function of the learned
parameters that is useful for identifying these spammers, and then use this function to build
a prior. In contrast, we use simple priors, but incorporate a model parameter that explicitly
represents the probability that an annotator is spamming. Our simple model achieves the same
accuracy on gold label predictions as theirs.
B.7 Conclusion
We provide a Java-based implementation, MACE, that recovers correct labels with high accuracy,
and reliably identifies trustworthy annotators. In addition, it provides a threshold to control the ac-
curacy/coverage trade-off and can be trained with standard EM or Variational Bayes EM. MACE
works fully unsupervised, but can incorporate token constraints via annotated control items. We
show that even small amounts help improve accuracy.
Our model focuses most of its modeling power on learning trustworthiness parameters, which
are highly correlated with true annotator reliability (Pearsonr 0.9). We show on real-world and
synthetic data sets that our method is more accurate than majority voting, even under adver-
sarial conditions, and as accurate as more complex state-of-the-art systems. Focusing on high-
confidence instances improves accuracy considerably.
MACE is freely available for download under http://www.isi.edu/publications/
licensed-sw/mace/.
137
Abstract (if available)
Abstract
NLP applications such as Question Answering (QA), Information Extraction (IE), or Machine Translation (MT) are incorporating increasing amounts of semantic information. A fundamental building block of semantic information is the relation between a predicate and its arguments, e.g. eat(John,burger). In order to reason at higher levels of abstraction, it is useful to group relation instances according to the types of their predicates and the types of their arguments. For example, while eat(Mary,burger) and devour(John,tofu) are two distinct relation instances, they share the underlying predicate and argument types INGEST(PERSON,FOOD). A central question is: where do the types and relations come from? ❧ The subfield of NLP concerned with this is relation extraction, which comprises two main tasks: ❧ 1. identifying and extracting relation instances from text ❧ 2. determining the types of their predicates and arguments ❧ The first task is difficult for several reasons. Relations can express their predicate explicitly or implicitly. Furthermore, their elements can be far apart, with unrelated words intervening. In this thesis, we restrict ourselves to relations that are explicitly expressed between syntactically related words. We harvest the relation instances from dependency parses. ❧ The second task is the central focus of this thesis. Specifically, we will address these three problems: ❧ 1) determining argument types ❧ 2) determining predicate types ❧ 3) determining argument and predicate types. ❧ For each task, we model predicate and argument types as latent variables in a hidden Markov models. Depending on the type system available for each of these tasks, our approaches range from unsupervised to semi-supervised to fully supervised training methods. The central contributions of this thesis are as follows: ❧ 1. Learning argument types (unsupervised): ❧ We present a novel approach that learns the type system along with the relation candidates when neither is given. In contrast to previous work on unsupervised relation extraction, it produces human-interpretable types rather than clusters (published as Hovy et al., 2011b). We also investigate its applicability to downstream tasks such as knowledge base population and construction of ontological structures. An auxiliary contribution, born from the necessity to evaluate the quality of human subjects, is MACE (Multi-Annotator Competence Estimation), a tool that helps estimate both annotator competence and the most likely answer. ❧ 2. Learning predicate types (unsupervised and supervised): ❧ Relations are ubiquitous in language, and many problems can be modeled as relation problems. We demonstrate this on a common NLP task, word sense disambiguation (WSD) for prepositions (PSD). We use selectional constraints between the preposition and its argument in order to determine the sense of the preposition (published as Hovy et al., 2010
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Building a knowledgebase for deep lexical semantics
PDF
Learning paraphrases from text
PDF
Text understadning via semantic structure analysis
PDF
Identification, classification, and analysis of opinions on the Web
PDF
Hashcode representations of natural language for relation extraction
PDF
An efficient approach to clustering datasets with mixed type attributes in data mining
PDF
Syntactic alignment models for large-scale statistical machine translation
PDF
Learning distributed representations from network data and human navigation
PDF
Modeling, learning, and leveraging similarity
PDF
Beyond parallel data: decipherment for better quality machine translation
PDF
Robust and generalizable knowledge acquisition from text
PDF
From matching to querying: A unified framework for ontology integration
PDF
Modeling, searching, and explaining abnormal instances in multi-relational networks
PDF
A reference-set approach to information extraction from unstructured, ungrammatical data sources
PDF
Interactive learning: a general framework and various applications
PDF
Learning the geometric structure of high dimensional data using the Tensor Voting Graph
PDF
Advances in linguistic data-oriented uncertainty modeling, reasoning, and intelligent decision making
PDF
Generating psycholinguistic norms and applications
PDF
Deep learning models for temporal data in health care
PDF
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
Asset Metadata
Creator
Hovy, Dirk
(author)
Core Title
Learning semantic types and relations from text
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
07/26/2013
Defense Date
05/06/2013
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
computational linguistics,information extraction,NLP,OAI-PMH Harvest,relation extraction,unsupervised learning
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Hobbs, Jerry (
committee chair
), Chiang, David (
committee member
), Kaiser, Elsi (
committee member
), Knight, Kevin C. (
committee member
), McLeod, Dennis (
committee member
)
Creator Email
dirkh@isi.edu,mail@dirkhovy.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-300459
Unique identifier
UC11294871
Identifier
etd-HovyDirk-1851.pdf (filename),usctheses-c3-300459 (legacy record id)
Legacy Identifier
etd-HovyDirk-1851.pdf
Dmrecord
300459
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Hovy, Dirk
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
computational linguistics
information extraction
NLP
relation extraction
unsupervised learning