Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
On The Learning Of Rules For An Information Extraction System
(USC Thesis Other)
On The Learning Of Rules For An Information Extraction System
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
INFORMATION TO USERS This manuscript has been reproduced from the microfilm master. UMI films the text directly from the original or copy submitted. Thus, some thesis and dissertation copies are in typewriter face, while others may be from any type of computer printer. Hie quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, prim bleed through, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send UMI a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. Oversize materials (e.g^ maps, drawings, charts) are reproduced by sectioning the original, beginning at the upper left-hand comer and continuing from left to right in equal sections with small overlaps. Each original is also photographed in one exposure and is included in reduced form at the back of the book. Photographs included in the original manuscript have been reproduced xerographically in this copy. Higher quality 6m x 9" black and white photographic prints are available for any photographs or illustrations appearing in this copy for an additional charge. Contact UMI directly to order. A Bell & Howell Information Company 300 North Zeed Road. Ann Artxx M l 48106*1346 USA 313/761-4700 800/521-0600 On the Learning of Rules for an Information Extraction System by Stephen Vincent Kowalski A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Computer Engineering) December 1994 Copyright 1995 Stephen Vincent Kowalski UMI Number: 9601008 Copyright 1995 by Kowalski, Stephen Vincent All rights reserved. UMI Microform 9601008 Copyright 1995# by UMI Company. All rights reserved. This microform edition Is protected against unauthorised copying under Title 17# United States Code. UMI 300 North Seeb Road Ann Arbor# MI 48103 UNIVERSITY OF SOUTHERN CALIFORNIA THE GRADUATE SCHOOL UNIVERSITY PARK LOS ANGELES. CALtPORNtA 90007 This dissertation, written by Stephen Vincent Kowalski under the direction of ..... Dissertation Committee, and approved by all its members, has been presented to and accepted by The Graduate School, in partial fulfillment of re- quirements for the degree of DOCTOR OFPHILOSOPHY ^ _ Dean of Graduate Studlet Date ...P8S8feS..?.i..A22& DISSERTATION COMMITTEE / ^ * — » Dedication To Maria Acknowledgm ents This work would not have been possible without the support I have received from friends, family, and colleagues over the years. I would like to first thank Dan Moldovan for serving as my advisor. Dan taught me how to do research and communicate the results. His weekly group meetings provided a welcome forum for the exchange of new ideas. Thanks also go to the other members of my dissertation committee: Jean-Luc Gaudiot, Shankar Rajamoncy, and Paul Roscnbloom. They all provided valuable feedback and helped to strengthen this work. I also received constructive comments from Keith Price, who served on my guidance committee, and Kevin Knight. My interest in artificial intelligence was kindled in the classrooms of Lcs Gasscr and Michael Arbib. Shankar Rajamoncy introduced me to the subject of machine learning where 1 became fascinated with the area of induction. His weekly seminars with Paul Roscnbloom helped me to understand the issues in this field and to become aware of current research directions. I am grateful to my colleagues at the USC Parallel Knowledge Processing Lab oratory: Chin-Yew Lin, Eric Lin, Min-Hwa Chung, Juntae Kim, Seungho Cha, Sang-Hwa Chung, Tony Gallippi, Ken Hendrickson, Sanda Harabagiu, Adrian Moga, TYaian Mitrache, Wing Lee, and Ron Demara. Chin-Yew built the preprocessor that I used in my experiments. Chin-Yew and Ken helped to populate the preprocessor dictionary. Eric was always available to provide system support. Thanks to Dawn Ernst for administrative support and helping me to navigate the paperwork maze. I am very thankful to the Northrop Grumman Corporation for its generous fel lowship which supported me while I was a PhD student. Bob Young provided lots of encouragement and got me started in the program. Other people at Northrop Grumman who provided help were; lan MacAUister, Terry Hall, Ram Ramkumar, Jim Eves, Adi Aricli, Joe Bestor, A 1 Sandfort, Bob VVheelock, Alice Lebcl, Janice Makhlouf, Margaret Paulin, and Guenter Daub. I would like to acknowledge AAAt for financial assistance for conference travel during the completion of this work. Finally, 1 would like to thank those dearest to me. My wife Maria has proof* read every paper I have written and has provided many technical and non*technical comments. I am grateful for her love, devotion, and patience during my career as a graduate student. I could not have done this without her. l b my parents, Vincent and Lorraine, I am grateful for their support and encouragement over the years. And to Sharkus J. Pcepstcr III, thanks for enriching our lives and helping me realize that humans do not have a monopoly on intelligence. C ontents volum e Dedication it Acknowledgments iii List Of Tables viii List Of Figures x Abstract xiii 1 INTRODUCTION I 1.1 M otivation.......................................................................................................... 1 1.2 Overview of Machine L earning..................................................................... 2 1.2.1 Learning FVom E xam ples.................................................................... 3 1.2.2 Issues in Supervised L e a rn in g .......................................................... 4 1.2.3 Limitations of Supervised Learning S y s te m s ................................ 5 1.3 Past Information Extraction S u c c e ss e s..................................................... 6 1.4 Learning Information Extraction R u le s ..................................................... 9 1.5 Organization of this Dissertation ............................................................... 11 2 RELATED WORK 12 2.1 ID 3 ....................................................................................................................... 12 2.2 A Q ....................................................................................................................... 15 2.3 F O IL .................................................................................................................... 15 2.4 F O C L ................................................................................................................ 17 3 Rule-Based Information Extraction 21 3.1 In tro d u ctio n .......................................................................................................... 21 3.2 Rule Structure ................................................................................................ 23 3.2.1 Rule P a tte r n s ....................................................................................... 24 3.2.2 Object P red icates................................................................................. 25 3.2.3 Object M o d ifie rs................................................................................. 26 3.2.4 The Joint Venture D om ain................................................................ 28 v 3.3 Rule Matching ................................................................................................. 29 3.3.1 Mappings and Valid M appings........................................................ 29 3.3.2 S u b su m p tio n ..................................................................................... 30 3.3.2.1 Selector Subsumption ................................ . 31 3.3.2.2 Interpretation Subsumption ............................................32 3.3.2.3 Cover Subsum ption............................................. 33 3.3.2.4 Pattern Subsumption............................................................ 34 3.4 R U B IE S .......................................................................................................... 34 3.4.1 Preprocessor............................................................................................ 37 3.4.2 Rule Matcher ................................................................................38 3.4.3 Rule E v a lu a to r......................................................................................40 3.4.4 MUC*5 Rule C la s s e s ............................................................................41 3.5 L im ita tio n s........................................................................................................42 3.6 Conclusion............................................ 45 4 T R A IN IN G S E T C O N S T R U C T IO N 46 4.1 Introduction....................................................................................................... 46 4.2 MUC-5 EJV TVaining S e t s ..............................................................................47 4.3 Conversion of Text and Templates into R ules..............................................48 4.4 Rule Decomposition Algorithm ....................................................................51 4.5 Rule Decomposition E xperim ents............................... 55 4.6 Conclusion........................................................ 55 5 L E A R N IN G IN F O R M A T IO N E X T R A C T IO N R U L E S 57 5.1 Introduction....................................................................................................... 57 5.2 Covcring-Typc Induction A lgorithm s.......................................................... 58 5.3 Selecting a Positive Rule for Generalization ....................................... 61 5.4 Finding a Set Cover ................................................................... 63 5.5 Rule Generalization.......................................................................................... 66 5.5.1 Scare!) O p e ra to rs.................................................................................. 66 5.5.2 Finding a Set of Consistent R u le s ......................................................67 5.5.3 Finding the Moat General R u l e .........................................................69 5.6 Attribute Dependency G ra p h s................................................................... 72 5.7 Adding Cover Item s...................................................................................... 75 5.8 Adding S e le c to rs............................................................................................. 77 5.9 Comparison with Other L e a rn e rs .................................................................80 5.9.1 lYaining Example R epresentation..................................................... 80 5.9.2 Hypothesis Representation.................................................................. 83 5.9.3 Search S tra te g y ............................................................................. . 87 5.10 Conclusion............................................................................................. 88 6 E X P E R IM E N T A L R E SU L T S 89 6.1 Introduction....................................................................................................... 89 vi 6.2 Description of Experiments........................................... 90 6.3 Performance M easu res.................................................................... 92 6.4 Comparing Merlin/RUBIES with the USC MUC*5 S y s te m ....................94 6.5 Comparison with Other MUC*5 S y stem s................................................... 96 6.6 Varying the Number of Positive Examples ...................................... 96 6.7 Varying the Number of Negative Exam ples................................................99 6.8 Conclusion....................................................................................................... 102 7 C O N C LU SIO N S A N D E X T E N SIO N S 103 7.1 Summary of Contributions.......................................................................... 103 7.2 L im ita tio n s....................................................................................................104 7.3 Possible E xtensions....................................................................... 105 R eference L ist 106 A ppendix A Examples of Rules Learned by Merlin ................................................ 112 A.l Tieup/Entity Construction R u le s ..............................................................112 A.2 Ticup Joint Venture R u le s .......................................................................... 122 A.3 Tieup Status R u le s .......................................................................................125 A ppendix B DETAILED EXPERIMENTAL RESULTS.........................................................130 D.l Comparing Mcrlin/RUBIES with the USC MUC*5 S y s te m ..................131 B.2 Comparing Merlin/RUBIES with the MUC-5 System s........................... 142 B.3 Varying the Number of Positive TVaining E x am p les.............................. 146 B.4 Varying the Number of Negative TVaining Exam ples.............................. 157 ■ • vn List Of Tables 1.1 Example GLS [70]........................................................................................ 8 1.2 Example Concept Node [6 2 ]..................................................................... 9 2.1 The AQR A lgorithm ................................................................................. 16 2.2 The FOIL A lgorithm ................................................................................. 18 2.3 The FOCL Algorithm ..................................................................................20 3.1 Example Text M essage.............................................................................. 21 3.2 Example Information Extraction R ule......................................................... 22 3.3 Database Contents after Rule has F ired......................................................23 3.4 Objects for Joint Venture D om ain............................................................... 28 3.5 Truth Table for Subsumption D efinitions ................................31 3.6 Example Message and Filled Tem plates......................................................36 3.7 RUBIES A lgorithm .................................................................................... 37 3.8 Example Text Input S en ten ce.................... 37 3.9 Preprocessor Output for the Word “Killing” ............................................ 38 3.10 Example Text M essage..................................................................................39 3.11 Current State of Working M e m o ry ............................................................39 3.12 Example Information Extraction R ule.........................................................40 3.13 Rule Evaluation Algorithm.................................... 41 3.14 Tieup/Entity Construction Rule C la s s e s ........................... 42 3.15 Tieup jv-company Rule C la s se s.......................................... 43 3.16 Tieup No jv-company Rule C la sse s............................................................43 3.17 Tieup Status Rule Classes for status = e x is tin g ............................... . 44 4.1 Negative Examples for Tieup/Entity Construction R u le s ..........................47 4.2 Negative Examples for Tieup jv-company R u le s ...................... 48 4.3 Negative Examples for Tieup Status Rules for status = existing . . . 49 4.4 Example Message and Filled Tem plates.................................................. 50 4.5 Preprocessor Output for the Text in Table 4 . 4 ....................... 51 4.6 Covers for Words in Table 1 ................................................... 52 4.7 Rule Pattern for the Text Message in Table 1 ............................................ 52 4.8 Object Modifiers for Templates in Tible 1 ...................................... 53 4.9 MUC-5 EJV Rule Decomposition A lgorithm ..................................... . 53 viii 4.10 Example Complex R ule.................................. . ..................... 54 4.11 Simple Rules Decomposed from Complex Rule of Table 4 . 1 0 ................54 4.12 Tieup Construction R u l e s ............................................................................55 4.13 JV Company R u le s.................................................................................. . 55 4.14 Status R ules..................................................................................................... 56 5.1 The Basic Covering Method for In d u c tio n ................................................59 5.2 The Set*Covering P roblem ................. 59 5.3 The Covering Algorithm for Stargazer.........................................................60 5.4 The Covering Algorithm for M e rlin ............................................................ 61 5.5 A Greedy Set*Covering Algorithm . ......................................................... 63 5.6 Average Error Rate (%), 100 Trials [32]......................................................65 5.7 Search O p e ra to rs............................................................................................66 5.8 Merlin Generalization A lg o rith m ............................................................... 70 5.9 Example Text M essage.............................................................................. 71 5.10 Rule Pattern for the Text Message in Table 5 . 9 ...................................... 71 5.11 Result After First Generalization S t e p .................................................. 71 5.12 Most General Pattern for Pattern in Table 5 . 1 0 ...................................... 72 5.13 Edge Definitions for Figure 5 .1 ......................................................................74 5.14 Algorithm for Pivot Node Weight A ssignm ent..................................... 74 5.15 The GenSecdCover P rocedure................................................................. 77 5.16 The AddCovcrs P rocedure........................................................................ 78 5.17 Example Rule Pattern and Seed Ite m ..................................................... 79 5.18 Specializing Rule Patterns by Adding Selectors to Interpretations . . 79 5.19 The AddSclcctors Procedure........................................................................ 81 5.20 Examples Using Attribute*Value R epresentation......................................82 5.21 Examples Using Predicate Argument R epresentation............................ 82 5.22 Example Hypotheses Represented with Extended D N F ......................... 84 5.23 Example GLS Cospcc [ 7 0 ] ....................................................................... 85 5.24 Example Concept Node [ 6 2 ].......................................... 86 6.1 Template Objects and Slots Used In E xperim ents...................................90 List O f Figures 2.1 Example Decision T V ee................................................................................... 13 3.1 RUBIES Information Extraction S y s te m .......................................................35 4.1 Conversion of Training Example to RUBIES R u l e ...................................... 49 5.1 A ttribute Dependency Graph for MUC-5 E J V ........................................ 73 5.2 Specialization of Rule Pattern by Replacing Klccnc Stars with Cover Ite m s..................................................................................................................... 76 6.1 Flowchart of Experimental P r o c e s s ................................................................ 91 6.2 Error Measures for 5 Slots, 191 Pos, 1389 Ncg E x a m p le s .........................95 6.3 Recall/Precision for 5 Slots, 191 Pos, 1389 Neg E xam ples.........................96 6.4 Raw Scores for 5 Slots, 191 Pos, 1389 Ncg E xam ples...................................97 6.5 MUC-5 System Performance Range for 5 S lo ts .............................................97 6.6 Error Measures for AH Slots, 1389 Negative E xam ples............................... 98 6.7 Recall/Precision for All Slots, 1389 Negative E x a m p le s ............................ 99 6.8 Sum of Slot Scores, 1389 Negative E x a m p le s.............................................. 100 6.9 Error Measures for All Slots, 191 Positive E xam ples................................. 100 6.10 Recall/Precision for All Slots, 191 Positive E x a m p le s ..............................101 6.11 Sum of Slot Scores, 191 Positive E x a m p le s..................................................101 B .l Error Measures for TieUp Entity, 191 Pos, 1389 Neg Examples . . . . 131 B.2 R /P for TieUp Entity, 191 Pos, 1389 Neg E x am p les.................................131 B.3 Raw Scores for TieUp Entity, 191 Pos, 1389 Neg E x a m p le s....................132 B.4 Error Measures for TieUp Status, 191 Pos, 1389 Neg Examples . . . . 132 B.5 R /P for TieUp Status, 191 Pos, 1389 Neg E xam ples................................. 133 B.6 Raw Scores for TieUp Status, 191 Pos, 1389 Neg E x am p les....................133 B.7 Error Measures for TieUp Joint Venture, 191 Pos, 1389 Neg Examples 134 B.8 R /P for TieUp Joint Venture, 191 Pos, 1389 Neg E xam ples....................134 B.9 Raw Scores for TieUp Joint Venture, 191 Pos, 1389 Neg Examples . . 135 B.10 Error Measures for Entity Name, 191 Pos, 1389 Neg Examples . . . . 135 B .ll R /P for Entity Name, 191 Pos, 1389 Neg E x a m p le s.................................136 B.12 Raw Scores for Entity Name, 191 Pos, 1389 Neg E x a m p le s ....................136 B.13 Error Measures for Entity Type, 191 Pos, 1389 Neg Examples . . . . 137 B.14 R /P for Entity Type, 191 Pos, 1389 Neg E x a m p le s ................................137 B.15 Raw Scores for Entity Type, 191 Pos, 1389 Neg Examples ...................138 B.16 Error Measures for TicUp Objects, 191 Pos, 1389 Neg Examples . . . 138 B.17 R /P for TieUp Objects, 191 Pos, 1389 Neg E xam ples.............................139 B.18 Raw Scores for TieUp Objects, 191 Pos, 1389 Ncg E x am p les 139 B.19 Error Measures for Entity Objects, 191 Pos, 1389 Neg Examples . . . 140 B.20 R /P for Entity Objects, 191 Pos, 1389 Ncg Examples ......................... 140 B.21 Raw Scores for Entity Objects, 191 Pos, 1389 Neg E x am p les 141 B.22 MUC-5 System Performance Range for TieUp S t a t u s . . . 142 B.23 MUC-5 System Performance Range for TieUp E n t i t y .............................143 B.24 MUC-5 System Performance Range for TicUp Joint V e n tu re 143 B.25 MUC-5 System Performance Range for Entity Name ................144 B.26 MUC-5 System Performance Range for Entity T y p e ................................144 B.27 MUC-5 System Performance Range for TieUp O b j e c t s ..........................145 B.28 MUC-5 System Performance Range for Entity O b j e c t s ......................... 145 B.29 Scores for TicUp Status Slot, 1389 Ncg E x am p le s................................... 146 B.30 Scores for TieUp Entity Slot, 1389 Ncg E x a m p le s................................... 147 B.31 Scores for TieUp Joint Venture Slot, 1389 Ncg E x am p le s..................... 147 B.32 Scores for Entity Name Slot, 1389 Neg E x a m p le s ................................... 148 B.33 Scores for Entity Name Slot, 1389 Neg Examples ................................... 148 B.34 Error Measures for TieUp Status Slot, 1389 Neg E x am p le s...................149 B.35 Error Measures for TieUp Entity Slot, 1389 Neg E x am p le s...................149 B.36 Error Measures for TicUp Joint Venture Slot, 1389 Ncg Examples . . 150 B.37 Error Measures for Entity Name Slot, 1389 Ncg E x a m p le s...................150 B.38 Error Measures for Entity Type Slot, 1389 Neg E xam ples...................... 151 B.39 R /P for TicUp Status Slot, 1389 Neg E x a m p le s...................................... 151 B.40 R /P for TicUp Entity Slot, 1389 Neg E x a m p le s...................................... 152 B.41 R /P for TieUp Joint Venture Slot, 1389 Neg E x am p le s..........................152 B.42 R /P for Entity Name Slot, 1389 Neg Examples ...................................... 153 B.43 R /P for Entity Type Slot, 1389 Neg E xam ples......................................... 153 B.44 Scores for TieUp Objects, 1389 Neg E x am p les......................................... 154 B.45 Scores for Entity Objects, 1389 Neg E xam ples......................................... 154 B.46 Error Measures for TieUp Objects, 1389 Neg E xam ples......................... 155 B.47 Error Measures for Entity Objects, 1389 Neg E xam ples......................... 155 B.48 R /P for TieUp Objects, 1389 Neg E x am p les.............................................156 B.49 R /P for Entity Objects, 1389 Neg E x am p les.............................................156 B.50 Scores for TieUp Status Slot, 191 Pos E xam ples...................................... 157 B.51 Scores for TieUp Entity Slot, 191 Pos Examples . ................................158 B.52 Scores for TieUp Joint Venture Slot, 191 Pos E xam ples........................ 158 B.53 Scores for Entity Name Slot, 191 Pos E x a m p le s...................................... 159 B.54 Scores for Entity Type Slot, 191 Pos Exam ples........................................159 xi B.55 Error Measures for TicUp Status Slot, 191 Pos E xam ples......................160 B.56 Error Measures for TicUp Entity Slot, 191 Pos E xam ples......................160 B.57 Error Measures for TicUp Joint Venture Slot, 191 Pos Examples . . . 161 B.58 Error Measures for Entity Name Slot, 191 Pos E x a m p le s......................161 B.59 Error Measures for Entity Type Slot, 191 Pos Examples ............... 162 B.60 R /P for TieUp Status Slot, 191 Pos E xam ples.........................................162 B.61 R /P for TieUp Entity Slot, 191 Pos Examples ...................163 B.62 R /P for TieUp Joint Venture Slot, 191 Pos E xam ples............................ 163 B.63 R /P for Entity Name Slot, 191 Pos E x a m p le s.........................................164 B.64 R /P for Entity Type Slot, 191 Pos Exam ples......................... 164 B.65 Scores for TieUp Objects, 191 Pos E xam ples.................. 165 B.66 Scores for Entity Objects, 191 Pos E xam ples............................................165 B.67 Error Measures for TieUp Objects, 191 Pos Exam ples............................ 166 B.68 Error Measures for Entity Objects, 191 Pos Exam ples............................ 166 B.69 R /P for TieUp Objects, 191 Pos E xam ples............................................... 167 B.70 R /P for Entity Objects, 191 Pos E xam ples.................................. 167 xii Abstract Knowledge-based systems that extract information from natural language text have been successfully applied to a number of domains. A major problem with these systems is the large amount of effort required to construct the domain-specific portion of the knowledge base. This needs to be done every time the system is ported to a new domain and is a tedious and error-prone process. Although there have been several successful attem pts to partially automate the knowlege-base construction process, fully-automatic machine learning techniques have been elusive. In this thesis, we describe a machine learning system that automatically con structs the rules for an information extraction system called RUBIES. RUBIES is a pattern-based system and mainly operates on a syntactic representation of the text. Rules for RUBIES are automatically generated by the Merlin learning system from examples of text messages and the information that is to be extracted from them. The Merlin learning system extends the capabilities of previous inductive learn ers in several ways. The training examples that arc used and the concepts that arc learned arc represented in much more complex language - the structure of an in formation extraction rule. This complex representation necessitated the use of new search operators and strategies. The performance of a rule set learned by Merlin was found to be comparable with that of a manually-generated rule set when tested on the final text messages of the English Joint Venture domain of the Fifth Message Understanding Conference (MUC-5). Chapter 1 IN TRO D U C TIO N 1.1 M otivation Throughout history, man has successfully constructed machines that perform useful tasks. These machines have helped him work more efficiently, allowed him to spend time on more desirable pursuits, and in many cases, enabled him to do things he could not do before. These machines arc artificial things - they are “man-made.*1 The complexity of these devices has increased remarkably over the centuries. Recently, we have begun to build machines that exhibit intelligent behavior - behavior that was previously thought to be capable of only creatures such as ourselves. Tasks that arc relatively complicated and require a somewhat high level of in telligence are studied in a branch of computer science called artificial intelligence. Computer programs that perform intelligent tasks arc often referred to as knowledge- based systems. When a program is built to perform a task previously done by a hu man, it is sometimes called an expert system because it replaces the human expert. Despite the fact that knowledge-based systems have been shown to be effective in a number of domains, their use has been limited because of the large labor cost associated with their construction. Many knowledge-based systems employ a general-purpose inference engine, or shell. The inference engine operates on a set of rules that reside in the knowledge base of the system. These systems are sometimes called rule-based and have their origins in production systems [53]. The task of building the knowledge base is performed by skilled knowledge engineers and is typically very costly. 1 One problem that knowledge-based systems have been applied to is information extraction from natural language text. Text messages are searched for relevant information which is then extracted, formatted, and stored in a database. One such system was constructed at the University of Southern California for the Fourth Message Understanding Conference (MUC-4) [47]. The rule set for this system took about five person-months of labor to construct and provided only partial coverage. This rule set was domain specific, so when the domain changed from terrorism to business joint ventures in MUC-5, a similar level of effort was required to construct a new rule set [54]. If knowledge-base construction could be automated, these types of systems could be more quickly and less expensively ported to new domains. Despite many attempts to automate or partially automate the knowledge-base construction task, it is still largely a manual process. Most of the success in this area has been achieved by knowledge acquisition systems that automate some of the knowledge engineer’s tasks but still require a “human-in-the-Ioop” [49]. One commonly used method is to have a machine learning system automatically create some rules and then have the knowledge engineer edit these rules and write new ones where necessary. These learning systems arc limited in the complexity of the rules that they can learn and much editing of the rules they produce is usually required. In the next section, we give a brief overview of machine learning with special emphasis on supervised learning systems. 1.2 Overview of Machine Learning Machine learning can be defined as the study of computer systems that autonomously improve their own performance. The field of machine learning is about forty years old and is currently a large and active subficld of artificial intelligence with many dif ferent research directions being pursued. Some of the major paradigms are described as follows: S upervised L earning Given a set of preclassified objects, the system formulates classification rules that can be used to classify novel objects. ID3 [59] and AQ [42] are two supervised learning systems that have been extensively studied and improved upon. 2 C o n c e p tu a l C lu ste rin g Given a set of object descriptions, the system finds reg- ularitics in the data and groups the objects into classes. Some examples of these systems arc UNIMEM [36] and COBWEB [21]. D isco v ery The discovery paradigm is not well defined, but can be viewed as the ability to find new things (mathematical formulas, etc) without supervision. Some examples of discovery systems are AM [39], BACON [35], and ABACUS |18|. S p e e d u p L e a rn in g This paradigm is most often studied in the context of prob lem solvers, where efficiency is an im portant consideration. The goal of these types of learning systems is to decrease the tim e it takes to solve problems. Explanation-based learning systems (EBL) fall within this category. Some ex amples of systems that use EBL methods arc STRIPS [20], SOAR [34], and PRODIGY [44]. A nalogical L ea rn in g The goal of these types of learners is to create an analogy that can be used by an analogical or case-based reasoning system to solve a problem. The key idea is to be able to transfer knowledge from one task to another [2] [30] [24]. 1.2.1 Learning From Exam ples The ability to learn new concepts is fundamental to human development. Through out our lifetimes, we acquire concepts such as “objects th at burn,” “good things to cat,” “movies that Maria likes,” and “legal chess moves” by observing and experi m enting with the environment. Researchers in artificial intelligence have constructed systems th at dem onstrate concept learning in various domains using many different approaches. The practical benefit of these systems is the real possibility of autom at ing some, or all, of the development of knowledge-based systems [50] [14] [29]. Though several formal concept learning paradigms have been defined, we will lim it our discussion to the one commonly known as supervised /earning or learning from preclassified training examples. In this problem, we are given a set of training examples which have been conveniently classified into one or more mutually-exclusive classes by some teacher. The goal of the learning system is to generate a description 3 for each of these classes that can be used to accurately classify new, unseen objects. The supervised learning problem is defined below. Given: • A set of training examples, E e, for each class c expressed in some language C e - • A set of background knowledge, /?, expressed in some language Cb > Learn: • A description for each class that can be used to classify novel objects. The concept description could be expressed in any language and commonly ap pears in the form of a decision tree [58] [1] [59] [3] [19], logical expression [40] [42] [43] [6], Horn clause [60] [56] [52] [51], or artificial neural network [63] [15] [67]. It can be used, within some degree of accuracy, to classify any new object as long as the object is described in the language £&. The background knowledge, or domain knowledge, is optional and only recently have systems been constructed that can make use of it. Usually, these systems learn the concept descriptions as a set of Horn clauses. Some examples are FOIL [60], which uses a set of extcnsionally-defined predicates as background knowledge, and FOGL [56], which extends FOIL to allow intensionally-defincd predicates and constraints on predicate arguments. FOCL also accepts as background knowledge a partial, possibly incorrect, rule that is an approximation to the predicate to be learned. 1.2.2 Issues in Supervised Learning Perhaps the most important issue in supervised learning is the issue of concept accuracy, or the accuracy of the learned concept descriptions, Indeed, learning systems are empirically evaluated by testing these descriptions on a set of novel objects (called the test set) and measuring the classification accuracy. The accuracy that can be achieved by a learning system is dependent on many factors, including the number and quality of the training examples. In some domains, 4 the examples may be inconsistent (misclassified), incomplete (some missing data), or noisy (the description of an example may have errors). There has been some theoretical work done to determine the number of training examples1 required to learn a concept that is, within a certain probability, approximately correct (PAC). This model is called PAC learning [69] and there has been some success in extending it to more complex representations [9] [8]. The applicability of a learning system to a learning problem depends to a large degree on the representation used for the training set (Cb) and the representa tion used for the concept descriptions. Most domains used for supervised learning represent training examples with an attribute-value representation. Attributes arc single-valued and take either nominal or continuous values. Relational learning systems such as FOIL allow examples to be expressed using cxtcnsionally-dcfincd predicates. The representation used for the concept description determines what concepts can be learned and cannot be learned. For example, a system that rep resents concepts as a conjunction of attribute-valuc tests cannot learn the concept (color = red) V (size — small). There is a wide variety of concept representations in use including decision trees, neural networks, logical expressions (c.g. DNF2, VL3), and Horn clauses. Finally, we arc interested in the more traditional issues of time and space com plexity. Since the problem of finding the ubcstn concept description for some def inition of “best” is almost always computationally intractable, learning algorithms attem pt to approximate the optimal solution. 1.2.3 Limitations of Supervised Learning System s Consider a learning system that can automatically construct a rule set for a knowl edge-based system from examples of its input and output. Most existing learners can do this as long as the knowledge-based system, or performance element, is extremely simple. One example is a performance element that classifies objects based on their attribute values. ‘Also known as the temple complexity, ’ Disjunctive Normal Form. sVartab|e Logic System [42], 5 The learner is given examples of system input (attribute values of objects) and output (object class assignments) and generates a description for each class that takes the form of a classification rule. These classification rules are then put into the knowledge base of the performance clement. Much of the previous work on supervised learning has assumed this type of simple performance task (59} [43]. Consequently, existing learning systems have lim itations th at preclude the learning of more complex knowledge. For example, the input for an information extraction system is natural language text and the output is a database of extracted information. There arc no known learners that can operate on these types of complex training examples. In this dissertation, we will describe a learning system that.overcomes these limitations. 1.3 P ast Inform ation E xtraction Successes Natural language processing (NLP) is one of the grand challenges faced by artificial intelligence researchers. The long-term goal of this research is to build com puter systems th at can communicate directly with humans using a natural language such as English or Japanese. To our dismay, this has turned out to be an extremely diflicult problem and we arc nowhere close to solving it. However, several sub problems of NLP have been broken out and attacked with limited success. One of these problems is the extraction of information from natural language text. Recently, a large number of on-line textual d ata has become available including newspapers, journals, and instruction manuals [11]. Since most existing applications cannot directly process these texts, one solution has been to use an intermediary system to extract the relevant information and store it in a database th at can be easily processed by the downstream applications. Research on these information extraction systems has been encouraged by the Advanced Research Projects Agency (ARPA) of the United States Government. ARPA has sponsored the Message Understanding Conferences (MUC) which have provided a formal methodology for evaluating the performance of these systems. Conference participants are given a large number (~ 1000) of domain-specific text messages and the complete database of information th at was extracted from them by human analysts. This database takes the form of filled information tem plates in 6 an entity-relationship type structure. Also provided to the participants is a set of guidelines on how to fill the templates. Each of the conference participants builds or modifies an existing information extraction system for the new domain. The completed systems are evaluated on a novel set of text messages and the results arc presented at the conference. The performance of these systems has been steadily improving. Some of these systems are pat tern-based. These patterns operate much the same way as regular expressions. When matched to the text, variables in these patterns can be used to fill the information templates. Some examples of pattern-based information extraction systems are FASTUS [26], Diderot [16], and CIRCUS [37]. For the most p u t, patterns are constructed by tedious manual analysis of the text and templates. Two notable exceptions arc the Diderot and CIRCUS systems built for MUC-4 and MUC-5. Diderot and CIRCUS both use patterns that arc anchored to a single word. Diderot’s patterns arc called generative lexicon structures (GLS). An example GLS for the MUC-5 joint venture domain is shown in Table 1.1. This GLS is anchored to the word establish. Patterns looking very much like regular expressions can be easily identified in the cospcc fields. When the pattern is matched, the tcmplatcscmantics field maps it to the information template by constructing a tie-up object with argu ments as shown. The steps taken to build this knowledge base of GLS entries arc described as follows: 1. Generate a histogram for tokenB (i.e. words) in the initial corpora. 2. Select some of the most frequently seen tokens based on subject code entries in the Longman Dictionary of Contempory English (LDOCE) [57]. GLS entries will be generated for these key tokens. 3. Automatically create the GLS entry for each key token using LEXBASE [25], LEXBASE is a machine tractable version of LDOCE. 4. Manually tune the GLS entries. Almost all of the entries needed to be tuned for MUC-5. The methodology was to run the system on a set of texts, analyze the results, and make modifications to the GLS entries. 7 gls(catablish, syn(...), argB([argl(Al, syn([typc(np)]), qualia( [formal( [codc.2,organization])])), arg2(A2, syn([typc(np)]), qualia( [formal( [code.2,organization])])), arg3(A3, ®yn([type(np)]), qualia([formal([codcJ2,organization])]))]), qualia([forma](ticaipJcp)]), cospcc(( [A 1 ,*,self,*,A2,*,with, A3], [Al,and,A3,*,self,*,A2], [A 1,together,with, A3, *, self, *,A2], [A 1,is,to,be,self,*,with, A3], [Al,*,signed,♦.agreement,*,self,A2], [A 1,♦.self,* joint, venture, A2, with, A3], [self, include, A2], [A2,was,self,with,A3]]), types(tiejup.verb), templatejjemanticsfpt-tie.up, tiejup([Al, A3], A2, ^existing,-))). Table 1.1; Example GLS [70] 8 5. Convert the GLS entries into a Prolog definite clause grammcr. This grammcr is partial, because it only outlines a rough sentence structure and may have gaps. 6. Compile these rules for Diderot. The patterns for CIRCUS are called concept nodes and arc generated using a tool called AutoSIog [62]. An example concept node for the terrorism domain of MUC-4 is shown in Table 1.2. This pattern is anchored to the word bombed. The constant slot m aps this concept node to a bombing-cvcnt tem plate. The variable slot maps the subject of the matched clause (*S*) to the target slot of the tem plate. Name: IVigger: Variable Slots: Constraints: Constant Slots: Enabling Conditions: target-subjcct-passive-verb-bombed bombed (target (*S* 1)) (class phys-target *S*) (type bombing) ((passive)) Table 1.2: Example Concept Node [62] AutoSIog autom atically generates a set of these concept nodes from the training texts and tem plates. However, quite a bit of these nodes arc bad and arc deleted by a hum an operator. Fbr MUC-5, AutoSIog proposed 3167 concept nodes of which 944 (30%) were retained [38]. Although the patterns used by Diderot and CIRCUS have a relatively high de gree of complexity, their acquisition was not fully autom atic. However, partial au tom ation of the knowledge acquisition process greatly reduced the amount of time necessary to construct the knowledge bases for new domains. 1.4 Learning Inform ation E xtraction R ules T he focus of this dissertation is to develop a method to autom atically learn rules for an information extraction system such that performance is comparable with 9 that of manually-constructed systems. We emphasize that we arc interested in fully autom atic methods that do not require any human intervention. We do, however, allow a certain amount of effort to be spent on setting up the learning task for a specific domain. Before we were able to address this task of rule learning, we needed a rule- based information extraction system that could act as the performance element. Because the rule set for this system had to be easily modifiable, we could not use the University of Southern California MUC system which had hard-coded rules. Instead, we built a new system called RUBIES. RUBIES, the performance element, was designed concurrently with Merlin, the learning element. Both of these elements depend a great deal on the structure of the information extraction rules. There were some rule representations that worked well for performance, but made learning difficult. Conversely, there were some represen tations that were easier to learn, but made the design of the performance clement extremely complex. The final rule structure allowed for a simple performance cle ment while at the same time facilitating learning. RUBIES uses complex rules that are triggered by patterns similar to regular expressions. These patterns operate only on syntactic information which is pro vided by a preprocessor, which performs a simple dictionary lookup for each word in the input text message. At the highest level, these RUBIES patterns look similar to the cosptc field of the GLS used by Diderot. A major difference is that they arc not anchored to any specific word, thus eliminating the necessity of identifying these anchor tokens. Another difference is that an extended disjunctive normal form representation is used for pattern items which correspond to the arguments of the cospecs, A set of rule classes is defined for each domain. Each rule class is associated with the construction or modification of an information object in the working memory of RUBIES. These rule classes are developed manually by analyzing the tem plate structure for the domain - there is no need to analyze the training corpus. A specific strategy is employed to determine the order in which each class of rules is applied. This strategy is also domain specific, but follows directly from the structure of the information templates. 10 We have built a system called Merlin that automatically learns all the rules for RUBIES from examples of system input and output. RUBIES input is a natural language text message and the output is a set of filled information templates in an cntity-relationship format. The first step is to decompose the system input/output purs into sets of train* ing examples for each rule class. Merlin operates on each of these training sets independently and generates a rule set for each class. A key to being able to learn these complex rules was to use the same language for rules as was used for the training examples. This was done by converting the training examples into very specific RUBIES rules. Merlin generalizes these examples into a set of rules that have much more general coverage. Experiments were conducted using the MUC-5 English Joint Venture (EJV) do main. A set of training examples was selected from the corpus and used by Merlin to learn a rule set. This rule set was provided to RUBIES and run on the MUC-5 final test messages. The quality of the information extracted by Mcrlin/RUBIES was found to be comparable with that of the manually-constructed USC system. 1.5 Organization of this D issertation We begin by describing the RUBIES information extraction system in Chapter 3. The structure of the rules is described and the subsumption operators used for pat tern matching arc defined. Chapter 4 describes how training sets are constructed from the MUC-5 corpus using a rule decomposition algorithm. In Chapter 5, we present the Merlin learning algorithm and show how it can be used to learn infor mation extraction rules. Experiments were conducted using the final test set of the MUC-5 EJV domain and the results are presented in Chapter 6. In Chapter 7 we present the conclusions of this dissertation and propose future extensions. 11 C hapter 2 R ELATED W O R K In this chapter, we review some previously developed inductive teaming algorithms and discuss their limitations. These learners can be viewed as belonging to one of two classes: learners that learn object classifiers and learners th at learn relations. Historically, most of the experimental work in induction has been on learning object classifiers. More recently, however, there has been considerable interest in relational learning. We will discuss two classical algorithms th at learn object classifiers: ID3 and AQ. Many existing learners can trace their roots to one or both of these algorithms. We will then discuss two relational learners: FOIL and FOCL. FOIL borrows ideas from both ID3 and AQ. FOCL extends FOIL in several ways. 2.1 ID 3 In the late 1970s, J.It. Quinlan developed a supervised learning algorithm called ID3 (Interactive Dichotomizer 3) [59]. ID3 employs a m ethod of concept formation whereby a set of objects is partitioned into subsets such that objects in each subset have some attribute in common. The splitting process is iteratively performed on the subsets until objects of only one class remain. This model of concept formation has its roots in a psychological model develped in the 1960s [28]. ID3 formulates concepts for objectB th at are described by a set of attributes. Each object has one value for each of these attributes. There is also one special attribute called the class attribute that specifies the class of the object, 12 The concept formulated by 1D3 is represented by a decision tree. Each non-leaf node of the tree represents a test on one of the attributes. There is an edge from each attribute test node for each possible outcome of the test. These nodes correspond to the splitting of a set of objects into subsets. The set of objects that exist at each node can be viewed as being broken into subsets that arc passed through the appropriate edge to the next node in the tree. The tree terminates at leaf nodes such that all objects at the leaf nodes have the same class attribute value. An example decision tree is shown in Figure 2.1. Attribute test nodes arc depicted by ellipses and leaf nodes as rectangles. has_wings ? yes no no yes carries_passengcrs yes no car (rain truck plane Figure 2.1: Example Decision Tree An important aspect of 1D3 is how it selects attributes for the nodes in the tree. Given a set of objects associated with a node, ID3 chooses the attribute that 13 maximizes information gain to split these objects into subsets. One problem with this attribute selection method is that nodes at the bottom of the tree have very few training examples associated with them. Because there are so few objects, there is little evidence from which to select an attribute on which to split. It has been shown that rather than split at these small nodes, they can just be declared leaf nodes. If this is done properly, it can improve the predictive accuracy of the tree. This process is referred to as tree pruning and there arc many decision tree algorithms that utilize it. One notable example is the CART algorithm [1] which actually emphasizes tree pruning over tree generation. A survey of tree pruning methods is given in [22]. Decision trees can be used to classify objects described by a set of attribute* values, as long as the object has only one value for each attribute. If any of the attributes arc ambiguous (e.g. (color — red V green)), the decision tree will not work. Consider how an object is classified with a decision tree. Wc start at the root node, perform the attribute test, and move along the edge that represents the result of the test. We do this until we come to a leaf node, which contains the class assignment for the object. The problem is that the object will take multiple paths down the tree to leaf nodes which may have different class assignments. A similar problem exists for negated attribute values (e.g. ->(color = red)). In this case, every valid path through the tree would be followed. If there was a node that split on the color attribute, every edge except red would be followed. In addition to not being able to classify objects with negated attribute values and disjunctions of attribute values, ID3 also cannot learn from objects described with this representation. This is because the information-gain heuristic assumes that each object has a single value for each attribute. Another problem is when objects do not have all the same attributes. When classifying objects like this, a test may exist for an attribute the object does not have. The object would have to take every edge from this node, which may lead to multiple leaf nodes with different classes. One way to deal with this problem is to add an edge to each node for objects that do not have this attribute. In effect, this adds a value of not applicable to each attribute an object does not have. 14 2.2 AQ Induction algorithms in the AQ-Family [42] learn from the same type-of objects as ID3 and other decision tree algorithms. Rather than decision trees, however, AQ-Family algorithms learn concepts in the form of logical expressions in one of the VL-type (Variable Logic) languages [41]. This language is similar to disjunctive normal form (DNF), except that the propositions are attribute value tests th at may include internal disjunctions (e.g. (color = red V green)). Each class is described by a disjunction of complexes. Each complex consists of a conjunction of selectors, A selector is an attribute value test (e.g. (color = green), {color = red V green), {tem p > 98)). One member of the AQ-Family, AQR [6], is shown in Table 2.1. Although we single out AQR, most of the following discussion pertains to all AQ- Family algorithms. AQR is a covering-type algorithm where m ultiple beam searches arc performed to find consistent complexes until all training examples arc covered. This beam search is performed by the STAR procedure. Complexes are consistent if they cover some positive examples and no negative examples. W hether or not a complex covers an example object is determined by evaluating the selectors for each attribute. If all selectors evaluate true, the complex covers the example. The selector evaluation m ethod assumes that each object has one value for each attribute. Objects with negated attributes present a problem. For example, if we have a negated attribute - '(co/or = red) and a selector {color = preen), it would be incorrect to say th at this selector evaluates true for this object. Likewise, objects that have disjunctions of attribute values present a problem. The selector {color = red) would evaluate neither true nor false for the disjunctive object attribute {color = red Vgreen). Like ID3, AQ also has a problem with objects that do not have all the same attributes. For example, a selector on an attribute would be undefined if the object being classified did not have that attribute. 2.3 FOIL The last two algorithms we have described, ID3 and AQ, learn classifiers for objects th at are described by a set of attributes. There is another class of inductive learners 15 Let POS be a set of positive examples. Let NEG be a set of negative examples. Procedure AQR(P0St NEQ) 1 Let COVER be the empty cover. 2 While COVER does not cover all of POS, 3 Select a SEED from POS that is not covered by COVER. 4 Let NEWSTAR be STAR(SEED, NEG). 5 Let BEST be the best complex in NEWSTAR. 6 Append BEST as another disjunct of COVER. 7 Return COVER. Procedure STAR(SEED, NEG) 1 Let STAR be an empty set of complexes. 2 While any complex in STAR covers some examples in NEG, 3 Select an example EXNEG from NEG covered by a 4 complex in STAR. S Specialize complexes in STAR to exclude EXNEG by: 6 Let EXTENSION be all selectors that cover SEED 7 but not EXNEG. 8 Let STAR be the set {x A y | x € STAR, 9 y € EXTENSION}. 10 Remove all complexes in STAR subsumed by other 11 complexes. 12 Repeat until | STAR |< maxstar 13 (a user-defined constant). 14 Remove the worst complex in STAR. 15 Return STAR. Table 2.1: The AQR Algorithm 16 that learns concepts in a different form - that of relations. For example, consider the predicate path(X , Y ) which is true if a path exists between the graph vertices X and Y. We may want to learn this predicate from examples of argument values that satisfy it. The positive examples would be all pairs of vertices in the graph that satisfy this predicate and the negative examples would be all pairs of vertices that do not satisfy this predicate. We may also be provided with some background knowledge for this problem that describes the structure of the graph in question. FOIL [60] is one relational learning algorithm that can be used to solve this problem. The background knowledge for FOIL is described as a set of clauses with extensional definitions. A clause can be defined cxtensionally as a set of argument value tuples in the same way that we have defined the training examples for path(X , Y ). It could also be constructed from other cxtcnsionally*defincd clauses (i.e. path(X, Y ) * — edge(X, K)). FOIL incorporates ideas from both 1D3 and AQ. From AQ, it borrows the cov ering methodology. Given the training examples for path{X, Y ) and examples of argument value tuples for edge(X, K), FOIL would construct the following relations: path(X , Y ) « — edge(X> Y ) path{X, Y ) «- cdge(X, Z ), path(Z, Y) The cdgc(X, Y ) predicate would be true if there was an edge in the graph between vertex X and vertex Y. The high-level algorithm for FOIL is shown in Table 2.2. The outer loop of FOIL is basically the same as the outer loop of AQR. At each iteration, a clause is found and the positive examples that it covers are removed from the training set. The clauses are constructed by successively adding predicates to the body. FOIL uses an information-based heuristic like ID3 to select the next literal to append to the body of the clause that is being constructed. Literals may be predicates, negated predicates, variable equalities (e.g. X = K), or inequalities (e.g. X Y ). Predicate literals must have previously defined extensional definitions. 2.4 FOCL FOCL is an extension to FOIL and also learns relations. Its major contribution is the incorporation of an explanation-based learning component so that more extensive 17 Let PRED be the predicate to be learned. Let POS be a set of positive examples. Let NEG be a set of negative examples. Procedure F0IL(PRED, POS, NEG) 1 While Pos is not empty, 2 Let LOCALNEG be a copy of NEG. 3 Let BODY be an empty clause body. 4 LEARN JODY (PRED, POS, LOCALNEG, BODY). 5 Add PRED « — BODY to the relation that is being learned. 6 Remove all examples from POS that satisfy BODY. Procedure LEARN JODY (PRED, POS, NEG) 1 While NEG is not empty, 2 Choose a literal L. 3 Conjoin L to BODY. 4 Remove all examples from NEG that do not satisfy L. Tabic 2.2: The FOIL Algorithm background knowledge can be used. Recall that the background knowledge for FOIL needed to be defined cxtcnsionally as a set of tuples of argument values. FOCL allows for background knowledge to be specified as intensional clauses as well. For example, the following clause from a chess domain could be used by FOCL: Sam eLoc(Rl% F l% R2^F2) *- EqualRank(Rl, R2), EqualFile(Fl) F2) The meaning of this clause is that chess positions with the same rank and file are at the same board location. There is no need to specify sets of argument value tuples to define SameLoc(Rl, F I, R2, F2). However, the predicates Equal R ank(R l,R 2) and E quatF iU (F l, F2) need to be defined this way. A high-level description of the FOCL algorithm is given in Table 2.3. In addition to allowing intensionaUy-defined background knowledge, FOCL also permits use of an initial rule. FOCL checks clauses in the body of this initial rule to see if they would be good candidates to use for constructing the relation. Intensionally-defined predicates are operationalized in the LEARNJ30DY procedure 1 8 on Lines 2 and 11. A clause body is considered to be operational if it consists of extensionally*defined predicates. 19 Let PRED bo tho predicate to be learned. Let IR be the initial rule. Let POS be a set of positive examples. Let NEO be a set of negative examples. Procedure FOCL (PRED, POS, NEO, IR) 1 While Pos is not empty, 2 Let LOCALNEO be a copy of NEG. 3 Let BODY be an empty clause body. 4 LEARN JODY (POS, LOCALNEO, BODY, IR) . 5 Add PRED < — BODY to the relation that is being learned. 6 Remove all examples from POS that satisfy BODY. Procedure LEARN JODY(POS, NEO, BODY, IR) 1 If a CLAUSEBODY of IR has positive gain, 2 Operationalize the best CLAUSEBODY and delete 3 superfluous literals from it. 4 Conjoin result to BODY. 5 Remove all examples from NEG that do not satisfy any 6 of the new literals. 7 Remove all examples from POS that satisfy BODY. 8 EXTEND JODY (POS, NEG, BODY). 9 Else, 10 Choose a literal L. 11 Operationalize L and delete superfluous literals. 12 Conjoin result to BODY. 13 Remove all examples from NEO that do not satisfy any 14 of the new literals in BODY. 15 LEARN JODY (PRED, POS, NEO, BODY, IR) . Procedure EXTEND JODY (POS, NEO, BODY) 1 While NEO is not empty, 2 Choose a literal L. 3 Operationalize L and delete superfluous literals. 4 Conjoin result to BODY. 5 Remove all examples from NEO that do not satisfy any 6 of the new literals in BODY. Table 2.3: The FOCL Algorithm 20 Chapter 3 Rule-Based Information Extraction 3.1 Introduction Information extraction involves locating, in natural language text, information about prcspccificd types of events or relationships, and storing this information in a data base. In a rule-based approach to information extraction, a knowledge base of rules is created for the domain of interest. The rules arc designed such that a chain of them can be executed (fired) so as to extract the desired information as accurately as possible. An example text is shown in Table 3.1. Diaaond Star Motors is a joint venture between Chrysler and Mitsubishi. Table 3.1: Example Text Message If the information to be extracted from this text is the identities of companies who are involved in joint ventures, a rule like the one shown in Table 3.2 could be used in the knowledge base of the information extraction system. The conditional part of this rule is a pattern that is matched against the text. It contains a list of items that are either Kleene stars (*) or logical expressions that are evaluated over the attributes of a word. Kleene stars can match zero or more words, analogous to a Unix regular expression [31]. The second item in the pattern is a test on the lexical attribute of a word to see if it is equal to venture. But where 21 If {*,lex(A)=venture,*,concept(B)— company,*,concept(C)=company,*} T h e n T l * = JointVenture((El,E2)) an d E l as Entity(lex(B)) an d E2 = Entity(lcx(C)) Table 3.2: Example Information Extraction Rule do the attributes for words come from? Bach word will have zero or more entries in an electronic dictionary. Each of these entries is a list of word attributes and their values. Words that have no dictionary entries have only one attribute, the lexical attribute. Prior to matching rules to text, the text is run through a preprocessor that looks up the dictionary entries for each word. These dictionary entries form an exclusive disjunction of word interpretations. FVom a natural language processing viewpoint, it would be ideal if there were just one dictionary entry for each word. Having multiple entries causes problems for both learning and applying information extraction rules. The problem of multiple dictionary entries is referred to as the ambiguity problem. The action part of the example rule contains a list of object constructors. T l, E l, and E2 arc called object identifiers because they refer to specific objects. T l refers to a joint venture object and E l and E2 refer to entity objects. The arguments to the object constructors arc either object attribute values or relationships with other objects. The joint venture object constructor for T l has one argument th at defines the relationships between T l and the two entity objects, E l and E2. The reason for having two sets of parentheses around E l and E2 is to show that this is a single argument that is a list of object identifiers. The single argument for the entity object constructor becomes the value of the object's name attribute. After applying this rule to the example text, the database contents would be as is shown in Table 3.3. A database of this sort could be used to support queries about what types of joint ventures a particular company is involved in. An example query might be: w List all the joint ventures that Chrysler is involved in.” 22 ObjectID = T l Type = Joint Venture Number of Entities — 2 EntitylD(l) = El EntityID[2] = E2 ObjectID a E l Type = Entity Name = Chrysler ObjectID « E2 Type a Entity Name a Mitsubishi Table 3.3: Database Contents after Rule has Fired In the following sections, a Rule-Based Information Extraction System called RUBIES is presented. RUBIES uses rules much like the one described in Table 3.2. 3.2 Rule Structure In this section, the structure of the information extraction rules used by RUBIES is described. As was described in the last section, a rule consists of two parts: a condition and an action. The example rule that was discussed was simple, in that the condition was a pattern that was matched only against the input text message. These types of simple rules work well in isolation, but do not have the flexibility to interact with other rules in the knowledge base. There is no way for a chain of these types of simple rules to Arc in such a way as to create and modify the objects in the database. For example, if the rule in Table 3.2 had been flrcd to create a joint venture between Chrysler and Mitsubishi, how could another rule Arc that would modify this joint venture object so that its name was Diamond-Star Motors? One way to do this is to add an object predicate to the condition part of the rule that associates objects in the action part to existing objects. The high level structure of such a rule can be described as follows: • The condition part of a rule consists of one rule pattern and zero or more object predicates. The rule pattern is matched to the input text and the object predicates are matched to existing objects. 23 • The action part of a rule consists of one or more object modifiers. If the rule is fired, its object modifiers will create new data objects or modify existing data objects. The rule behavior is described in more detail in the following sections. 3.2.1 Rule Patterns Rule patterns are ordered lists of one or more pattern items where each pattern item is cither a cover or a Kleene star (♦). A Kleene star matches zero or more adjacent words in the text message. A cover is a logical expression that is evaluated over one word in the message. Each word in the pattern is assigned to one and only one pattern item. All words assigned to a particular pattern item must appear before words assigned to the following pattern item. A rule pattern can thus be thought of as a type of regular expression. A mapping of a rule pattern to the input text is an assignment of each pattern item to zero or more adjacent words. There may be several mappings of a pattern to a text message. Consider the pattern PI and the example sentence SI shown below. PI = { C l, ♦, C2, ♦} SI = I lik e school Pattern PI consists of a cover C l, a Kleene star, a cover C2, and a second Kleene star. There are two mappings of PI to SI. These mappings, Ml and M2, arc shown below. M l = { C l/I, ♦/, C2/like, ♦/school) M2 = { C l/I, ♦/like, C2/school, ♦/} In both mappings, C l is assigned to the word /. In M l, the first Kleene star is not assigned to any words, C2 is assigned to like, and the second Kleene star is assigned to school In M2, the first Kleene star is assigned to /tJbe, C2 is assigned to school, and the second Kleene star is not assigned to anything. A cover is an exclusive disjunction of one or more word interpretations. An example cover, C l, is shown below where 0 is the exclusive-or operator. Cl = *| 0 *2 0 *s 24 An interpretation is a conjunction of attribute-value equality tests called selec tors. An example interpretation is shown below, where cat and ntim are attributes and A is logical AND. it — [(cat = pronoun) A (su m = singular)] This interpretation would m atch any singular pronoun. The rule patterns th at were described in this section arc similar to the cospeci fication field of the Generative Lexical Structure (GLS) of the Diderot information extraction system [16]. Covers and selectors arc nearly identical to covers and sc* lectors in the VL1 representational formalism [42]. VL1 has a construct called a complex which is the same as an interpretation. 3.2.2 O bject P redicates An object predicate is a test on an object’s attributes. The predicates look the same as the object constructors th at were previously described, but their effect is much different. A set of three object predicates is shown below. T2 — Joint Vcnturc((El,E2)) and E l = Entityf) and E2 = Entity() These predicates would match any existing joint venture object that had a rela tionship with two entities. They would not match a joint venture between 3 or more entities. Because there are no arguments for the entity predicates, E l and E2 can be any existing entity objects. The notation used here is non-standard, in th at the object ID (e.g. T2) is moved outside of the predicate argument list and equated to the name of the predicate. TYaditional notation for the T2 predicate would be as follows: Joint Venture(T2,(El,E2)) Moving the object ID outside the argument list distinguishes the object ID from the other arguments. The predicates can then be viewed as functions th at return the ID of an object whose arguments m atch those in the predicate argum ent list. 25 The general form of an object predicate is shown below, where ObjectID is an object identifier and ObjectClass is an object class (type). ObjectID — Object Class (PredA rgList) This predicate may be matched with an existing object of the same ObjectClass which has a matching PredArgList. The PredArgList is described below. PredArgList: PredArg PredArgList The PredArg \s a predicate argument and is described below, where ObjcctlDList is a list of object ID’s. PredArg: empty (matches only if no argument instantiated) ObjcctlDList Note that only one type of argument is used for an object predicate: a list of object ID's. This could easily be extended to other types of arguments, but we have only observed the need for lists of object ID's. Thus, the object predicate could be viewed as a pattern that matches an existing object according to its relationships with other objects. 3.2.3 O bject M odifiers Object modifiers look much like object predicates, but have a very different effect because of their position in the action part of the rule. Object modifiers can create new objects and/or modify existing objects. In contrast, object predicates are pat terns that can only match to existing objects. An example object modifier is shown below. This modifier can have no effect on object attributes associated with the first argument, since the first argument is missing. T2 ss JointVenture(,existing) The effect of an object modifier depends on the object predicates in the rule. If this modifier were found in a rule with no predicates, it would create a joint venture whose second argument would be existing. This argument gives a value to 26 the status attribute of the joint venture* Of course, there must be some previously specified definition of how the object arguments affect the object attributes for a given domain. Consider now the effect when this modifier is placed in a rule with the object predicates shown below. T2 = JointVenture( E l, E2) and El — Entity() and E2 — Entity() This object predicate has the same object ID as the modifier. The modifier would then have the effect of modifying an existing object that was matched by the predicate. The general form of an object modifier is shown below. ObjectID = ObjectClass(ModArgList) The ObjectID and ObjectClass are the same as was described in the previous section on object predicates. The ModArgList, however, is quite different from the PredArgList and is described below. ModArgList: ModArg ModArgList ModArg: {} empty ObjcctlDList PalllemAttVal NomArgVal An ObjectlDList is a list of one or more Object IDs. A PatltemAtt Val is a reference to an attribute value of one of the pattern cover items in the condition part of the rule. The cover item must have only one value for that attribute in all of its interpretations. Typically, only the lexical attribute satisfies this criteria. A NomArgVal is one of the possible nominal values for that specific argument. It can now be seen how chains of rules can fire together to create and modify data objects. The rule structure that has been defined is general, and can be applied to any domain whose objects can be modeled by their attributes and relationships with other objects. Specifics, such as the object types, object arguments and attributes, need to be defined for each domain. In the next section, objects that have been defined for an example real world domain are described. 27 3.2.4 T he Joint Venture Dom ain In this section, the English Joint Venture (EJV) domain of the 5th Message Un* dcrstanding Conference (MUC-5) is described and object types for this domain arc defined using the structure th at was described in the previous sections. The MUC conferences are sponsored by the Advanced Research Projects Agency (ARPA) and arc competitions between information extraction systems on well-defined domains. The conference organizers provide participants with a large Bet of example texts and the associated output templates. These templates contain the objects, object relationships, and object attributes that have been described previously. This set of examples is referred to as the training set. The participant’s systems are tested on unseen texts and their information extraction accuracy is compared using several performance measures. The focus in this section will be on a subset of one of the MUC-5 domains which is concerned with business joint ventures. In this domain there arc three argument types as follows: O b je c t P o in te r L ist (O P L ) A list of pointers to objects. N o m in al (N ) A single value from a finite set of values. P a tte r n Ite m A ttrib u te (P IA ) A single value that is determined by the value of an attribute of a pattern item in the condition part of the rule. Each object has one or more arguments, where each argument is of one of the above types. The two object classes that will be used are shown with their arguments in Table 3.4. These arguments are a subset of the arguments in the formal MUC-5 joint venture domain. Object # Ars Arguments tieup entity 3 1 entities (OPL), status (N), jv-company (OPL) name (PIA) Table 3.4: Objects for Joint Venture Domain The status argument has 5 possible values: existing, pre-format, dissolved, former, and unspecified. The entities and jv-company arguments of the tieup object are lists of pointers to entity objects. 28 Upper and lower bounds are placed on the size of the object pointer lists based on what has been observed in the training set. The entities argument will have between 2 and 4 pointers and the jv-company argument will have 0 or 1 pointers. 3*3 Rule M atching In this section, the evaluation of information extraction rule patterns is discussed. Recall that there are two components to the rule condition: the pattern and the object predicate. Each of these is matched separately and both must match in order for the rule to fire. The object predicate is matched against existing objects that have previously been created. Each argument of the predicate * ib matched against the associated object attributes. Rule patterns are matched against the prcproccssed natural language text. The preprocessing operation replaces each word with an exclusive disjunction of its dictionary entries. Each dictionary entry is a list of word attributes and their values. Evaluation of the rule pattern is more complicated than the object predicate and the remainder of this section is dedicated to its description. 3.3.1 Mappings and Valid Mappings A mapping of a rule pattern to the input text is an assignment of each pattern item to zero or more adjacent words. There may be several mappings of a pattern to a text message. Consider the pattern PI and the example sentence SI shown below. PI s { C l, *, C2, *} SI = I like school Pattern PI consists of a cover C l, a Kleene star, a cover C2, and another Kleene star. There are two mappings of PI to SI. These mappings, M l and M2, are shown below. •Ml = { C l/I, */, C2/like, ^/school) M2 = { C l/I, */like, C2/school, */} In mapping M l, the word I is assigned to cover item C l, no words are assigned to the first Kleene star, the word lik e is assigned to cover item C2, and the word school 29 is assigned to the second Kleene star. Notice that the knowledge representation for words provided by the preprocessing operation is identical to the knowledge representation th at was previously described for cover items. Each dictionary entry becomes an interpretation. The reason the same representation was chosen was to make it easier to evaluate whether a cover matches a word. A cover matches, or subsumes, a word if the cover expression is the same or more general than the expression th at was formed from the dictionary entries for the word. Using this definition of subsumption, a valid mapping is a mapping where all cover items subsume their assigned words. Thbre may be several valid mappings of a rule pattern to the input text. Continuing with the above example, let us assume the following for C l and C2: C l = (category = pronoun) C2 = (category — verb) In this case, mapping M l would be a valid mapping and mapping M2 would not. In the next section, the method for testing whether or not a cover subsumes a word iB discussed. 3.3.2 Subsum ption Subsumption testing was previously utilized for comparing a cover expression with an exclusive disjunction of dictionary entries. It was noted that the knowledge representation for cover items is the same as it is for words. In this section, we describe how subsumption can be tested for any two logical expressions th at are described using the language defined for cover items and rule patterns. A subsumption operator (> ) is a binary Boolean operator th at takes as argu m ents two Boolean functions of the same type, where the type is a selector, inter pretation, or cover. The subsumption operator returns true if the Boolean function on the left is more general, or has the same generality as the function on the right. If A > B and A and B are Boolean functions of the same type, A subsumes B and B is subsumed by A. More formally, A > B if for all objects there does not exist an object such that A(V0 = 0 and B(iJ>) = 1, 30 Lemma 1 The Boolean function (A > B ) is equivalent to (~>B > -«/l) is equivalent to (A* B = B ) is equivalent to (A + B A) is equivalent to (-<A • B = 0). Proof of Lemma It By the truth table in Table 3.5. A B A > B ~>B > ~-/l A - B - B A + B = A -1A ’ B — 0 0 0 1 1 1 1 1 0 1 0 0 0 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 Table 3.5: TVuth Tabic for Subsumption Definitions In the following sections, these alternate definitions of subsumption wilt be used to prove different propositions about subsumption of selectors, interpretations, cov ers, and patterns. 3.3.2.1 Selector Subsumption In this section, a method for determining how to test whether or not one selector term subsumes another will be described. A selector is a Boolean function evaluated over one attribute of an object. A selector set is the set of attribute values in a selector. For example, the selector set for {category = noun V verb) is {noun, verb}. If two selectors are tests on the same attribute, then they arc said to be aligned. Selectors on different attributes cannot subsume one another. Lemma 2 For any two selectors, SK and Sv, if Sx is an empty selector, then S t > Sy. In other words, an empty selector subsumes every other selector. Proof of Lemma 2: Sx -Sv = l 'S v — Sv. If St 'S y = Sv then Sx > S v (Lemma I). Lemma 3 For any two selectors, Sx and S v, if Sv is an empty selector, then Sr > -iSy. In other words, a negated empty selector is subsumed by every other selector. Proof of Lemma 3: S x • ->SV = 51 * • 0 = 0. I f Sx • ~ > S V = 0 then ST > - S v (Lemma 1). 31 Lemma 4 For any two aligned selectors, Sr and Sv, SK > Sy if and only if S y C S x, where Sr and S v are the selector sets o f St and Sv respectively. Proof of Lemma 4: I f S y C $tf then Sx + Sy = Sr and thus Sr si Sy (Lemma 1). Example for Lemma 4: Consider the following selectors, SI and S2: SI — (category = noun V verb) S2 = (category = noun) Since the selector set of S2 is a subset of the selector set of SI, but not vice versa, SI subsumes S2 and S2 does not subsume Si. Lemma 5 For any two aligned selectors, St and Sy, Sx > -*Sy if and only if Sx or Sy is empty. Proof of Lemma 5: If Se > ->SV , then Sx • ->Sy = -iSy (Lemma I). Sx • ->Sy cannot evaluate to ->Sy unless Sr or Sy is empty. Lemma 0 For any two aligned selectors, Sx and Sy, ->SX > S y if and only if S x D = 0. P ro o f o f L em m a 6: If S x n S v — 0, then ->SX • Sy = Sy because the values that satisfy ->Sr art a superset of the values that satisfy Sv. If -»5'x • Sy — Sy then “•5r > S y (Lemma I). 3.3.2.2 Interpretation Subsumption In this section, a method for determining how to test whether or not one interpre tation subsumes another will be developed. Lemma 7 For any two interpretations, Ix and lyt Ik ^ ly if and only if all o f the selectors o f Ix subsume the aligned selectors of Iy. Proof of Lemma 7: I f the selectors 7Vi of Ix subsume their aligned selectors Tyi o f Iy, then 7 V » * Tyi = Tyi for all attributes * and thus, 7r • Iy = I„. I f Ix • Iv = Iy, then Ix > Iv (Lemma 3.5). 32 Lemma 8 For any two interpretations, 7* and Iy, -»7* > Iv if and only if for any of the atigned selector terms, Tt i and Tvi, Tri • Tvi — 0. Proof of Lemma 8: 7* • /„ = • ?V < )» where n is the number of attributes in the domain. If any of the aligned selectors have a product o f zero, the entire expression will evaluate to zero. If 7* • /„ = 0, then ->IX > Iy (Lemma I). Lemma 8 For any two interpretations, 7* and lyt Is ^ -|7y if and only if Ir is empty or Iv is empty. Proof of Lemma 8: An empty interpretation always subsumes every other in terpretation and a negated empty interpretation is subsumed by every other inter pretation. If neither of the interpretations is empty, none of the expressions for subsumption can be proved. 3.3.2.3 Cover Subsumption In Ibis section, a method for determining how to test whether or not one cover subsumes another will be developed. Lemma 10 For any two covers, Rr and Ry, Rx > Ry if and only if each of the interpretations in Ry is subsumed by at least one interpretation in 72*. Proof of Lemma 10: 7 2 * + Ry = 7 2 * if each of the interpretations in Ry is subsumed by at least one interpretation in Rr . This is because if an interpretation Ti subsumes another interpretation Tj, then Tj + Tj = Tj. If Rx + Ry = Rx, then 7 1 * > Ry (Lemma 1). Lemma 11 For any two covers, Rx and Ry, Rx > ~iRy if and only ifT x > -\Ty for any Tx in Rx and any Ty in Ry. Proof of Lemma 11: -’7 2 * • -* 7 2 y = 0 ifTx > -’7> y for any Tx in 7 2 * and any Tv in Ry. I f — ’7 2 * • -’7 2 y = 0, then 7 2 * > -’7 2 y (Lemma I). Lemma 12 For any two covers, 7 2 * and Ry, -’7 2 * > Ry if and only if for all inter pretations Txi and Tyj, -iT*,- > Tyj. Proof of Lemma 12: Rx • Ry — £ , £j(7V, • Tyj) = 0 when for all Trj and all Tyj, -> Txi > Tyj. This is because if -'T„ > Tyj, T V , - • Tvj = 0 . If Rx ■ Ry — 0, then -’72, > Ry (Lemma t). 33 3 .3 .2 .4 P a tte r n S u b su m p tio n In this section, the notion of subsumption is extended to rule patterns and the concept of misalignment is introduced. The concept of subsumption is the same for rule patterns as it is for selectors, interpretations, and covers. One pattern subsumes another if it is as general or more general than the other pattern. Previously, it was said th at a mapping of a rule pattern to the input text is an assignment of each pattern item to zero or more adjacent words. Also, recall that the input text is represented using the same representation as the rule pattern, except it does not have any Kleene stars as pattern items, only covers. Therefore, the definition of a mapping can be generalized as follows: D e fin itio n 1 A mapping from pattern PK to pattern Pv is an assignment o f each pattern item in Px to zero or more pattern items in pattern Py. A Kleene star in PK can be assigned to zero or more pattern items in Pv o f any type (Kleene star or cover). A cover in Pr can be assigned only to a cover in Pv. Likewise, the definition of a valid mapping can be generalized as follows: D e fin itio n 2 A valid mapping from pattern Pr to pattern Pv is a mapping such that any cover in PK assigned to a cover in Pv also subsumes the cover in P\r. Subsumption between two patterns can thus be tested as follows: L e m m a 13 A pattern Pt subsumes a pattern Pv if and only if there is at least one valid mapping from P* to Py. 3.4 R U BIES In this section, the Rule-Based Information Extraction System (RUBIES) and its components are described. Several components of RUBIES, namely the preprocessor and dictionary, have been taken from the USC Text Understanding System [47]. A schematic of RUBIES is shown in Figure 3.1. Input to the system is an English language text message, which is converted by a preprocessor into the rule pattern representation that has been previously described. 34 Template Formatter Rule Evaluator Rule Matcher W Dictionary Match Set Template Fill Rules Sentence Working Memory Filled Information Templates Figure 3.1: RUBIES Information Extraction System 35 The output is a set of objects with values assigned for attributes. Because there is a formal specification, or template, for these data objects, it is said that the output of the system is a set of filled information templates. All of the examples that will be discussed assume the template specification is that defined by the Fifth Mcs* sage Understanding Conference (MUC-5) [54]. An example message and simplified templates are shown in Table 3.6. <MESSAGE-100> := Diamond S ta r Motors i s a jo in t venture between C hrysler and M itsu b ish i. <TIEUP*100-1> := ENTITY: <ENTITY-100-1> <ENTITY-100*2> JOINT VENTURE CO: <ENTITY-100-3> STATUS: existing <ENTITY-100-1> := NAME: Chrysler TYPE: Company <ENTITY-100*2> := NAME: Mitsubishi TYPE: Company <ENTITY-100-3> := NAME: Diamond Star Motors TYPE: Company Table 3.6: Example Message and Filled Templates The RUBIES algorithm is shown in Table 3.7. The inner loop operates as a classic production system. In the following sections, the components of RUBIES that implement this algorithm are described. 36 For each message, Clear the template working memory. For each sentence in the message, Clear the sentence working memory. Generate the preprocessor representation for the message. For each rule class (order determined by Hatching Strategy), Find the rule match set. Fire some rules in the match set (determined by Rule Evaluator). Remove subsumer objects in sentence working memory. Add sentence working memory to template working memory. Remove subsumer objects in template working memory. Output the template working memory. Tabic 3.7: RUBIES Algorithm 3*4.1 Preprocessor Input to the preprocessor is a portion of text, such as a news article. An example text sentence is shown in Tabic 3.8. SALVADORAN PRESIDENT-ELECT ALFREDO CRISTIANI CONDEMNED THE TERRORIST KILLING OF ATTORNEY GENERAL ROBERTO GARCIA ALVARADO AND ACCUSED THE FARABUNDD MARTI NATIONAL LIBERATION FRONT (FMLN) OF THE CRIME. Table 3.8: Example Text Input Sentence The preprocessor takes the input text and replaces each word with all of its dictionary entries. An example of the preprocessor output for the word “killing’ * is shown in Table 3.9. Each word category may have a different set of attributes. The first two fields in the preprocessor output are always the category of the word and its root. After the root, the preprocessor lists a value for each of the remaining attributes, depending on the category. For example, common nouns can be singular 37 or plural. This is the attribute called number. Note that the preprocessor docs not disambiguate between word meanings and only provides ambiguous syntactic information to the parser. comnon-noun KILLING sg KILLING verb KILL present-participle Table 3.9: Preprocessor Output for the Word “Killing” 3.4.2 R ule M atcher The Rule Matcher determines which of the template fill rules are able to fire based on the preprocessor representation of the message and the current state of the Working Memory (WM). Rules are categorized into classes depending on the type of action they can perform. For example, if there is a template object tieup that has an attribute status, there might be one class of rules for creating the tieup, and one class of rules for determining a value for the status attribute of a tieup. In any rule matching step, only one class of rules is considered. Rule classes are domain specific. The Rule Matcher has a strategy that determines the order for applying rule classes. If it is assumed that the object classes and possible relationships between object classes in the information template form a tree, a simple rule matching strat egy would be to traverse the tree from root to leavds in a breadth-first manner, applying the appropriate rules at each node. As rules fire, they add to or modify objects in the template working memory. The Matcher determines that a rule is able to fire if its pattern subsumes the current text message, and the object predicates evaluate true given the current state of the WM. A rule may be able to fire in different ways, depending on the number of valid mappings between the rule pattern and the text and the number of unifications possible between the object predicate and the objects in memory. Thus, a single rule may be allowed to fire multiple times on a text message. The Matcher generates a list of unifications for each rule that is able to fire. A unification consists of a valid mapping (i.e. the assignment of cover items in the rule 38 pattern to words in the text message) and an assignment of the object identifier in each object predicate to an object in WM. These unifications arc passed to the Rule Evaluator along with the match set. A single rule matching operation is illustrated below. Assume that the input text message is that of Table 3.1 (reprinted below in Table 3.10) and the current state of the WM as is shown in Table 3.11. FVirther, assume that the current rule that is being matched is shown in Table 3.12. This rule modifies the status attribute of an existing tieup object. Diamond Star Motors is a joint venture between Chrysler and Mitsubishi. Table 3.10: Example Text Message ObjectID = WM1 Type = Joint Venture Number of Entities = 2 EntityID[l] = WM2 EntityID[2] = WM3 ObjectID = WM2 Type = Entity Name = Chrysler ObjectID = WM3 Type = s Entity Name = Mitsubishi Table 3.11: Current State of Working Memory The pattern in this rule has one valid mapping with the example text message where the single cover item maps to the present tense verb is. Further, there is only one unification possible between the predicates for T l, E l, and E2 and the objects in working memory. T l would match to WM1, E l would match to WM2, and E2 would match to WM3. The Rule Matcher would attach this unification with the rule and put it in the match set that would be sent to the Rule Evaluator. 39 If (*,category(A)«verb A tense(A)s=present,*} a n d T l = Joint Venture((El,E2)) a n d E l — Entity(lex(B)) a n d E2 = Entity(lex(C)) T h e n T l = JointVenturc(,existing) Table 3.12: Example Information Extraction Rule 3.4.3 R ule Evaluator The Rule Evaluator determines which rules in the m atch set will alTccl the Working Memory (WM). The rule matching strategy th at will be employed ensures th at all rules in the rule m atch set will be of the same class. Each rule will have one or more object modifiers in its right hand side. Each modifier cither constructs a new object, or modifies an existing object in WM. Before describing the rule evaluation algorithm, object subsumption will be discussed. In the course of rule evaluation, it may be found th at there arc one or more iden tical objects in WM. It is said th at two objects are identical if all of their arguments are identical. But what if two nearly identical objects have been constructed that differ only in that one of them has an additional ObjectID in its Object ID Listl For example, a Tieup object may have been constructed between Mitsubishi, Honda, and Ford and also a Tieup object may have been constructed between Honda and Ford. One ObjectlDList is subsumed by another if all of its Object ID'b are contained in the other. For any two objects of the same class (A and B), if all of their arguments are the same except th at one or more ObjectList's in A subsume the associated ObjectLisVs in B, then object A subsumes object B. There are cases where the correct output of a text message will contain one object th at is subsumed by another, but this is quite rare and it has not been observed in any of the available examples. Since the existence of subsuming objects would increase spurious d ata output, it was decided to bias RUBIES to disallow subsuming objects in working memory. 40 For every rule in the natch set, For every unification of this rule with objects in WM, For every object modifier in the rule, If the object modifier constructs a new object, Let it construct the object. Remove any objects in WM that subsume other objects. If the object modifier modifies an existing object and the rule weight is greater than the object argument weight, Let the object modifier modify the object argument. Update the modified argument weight with the rule weight. Table 3.13: Rule Evaluation Algorithm Any objects that subsume other objects will be deleted. In other words, only the more specific object descriptions arc retained. Another issue that must be addressed is the policy for conflicting object modifiers. An object modifier conflicts with another if they both modify the same argument of an object of the same class in different ways. For example, there may be one rule to modify the status of a Tieup object to existing and another to modify it to former. The problem occurs when these rules both appear in the match set. Our solution is to assign a weight to each rule and let them both fire. The weight of the last rule that modified each of an object's arguments is stored. If a rule has a higher weight than the last rule that modified the object argument, it is permitted to fire and the argument's weight is updated with the new rule weight. Rule weights are a measure of how much evidence there is to support a rule. The rule evaluation algorithm is shown in Table 3.13. 3.4.4 M UC-5 Rule Classes In this section, the rule classes used to extract some of the data objects of the MUC-5 joint-venture domain are described. Twenty-one classes of rules will be used to create and modify tieup and entity objects. Because the entity objects cannot be created 41 unless they are used by a tieup object, tieup objects and entity objects are created within the same rule. The rule classes for tieup and entity object construction arc shown in Table 3.14. Class # Predicate Modifier 1 NULL Ti a tieup((£fl,£(,)„) E a = entity(/a) Ey b entity (A) 2 NULL Ti b ticup((£?B ,£*,£?<.)„) E » = entity(/0) E\, b entity(/*) E e b entity(/c) 3 NULL Ti b tieup((£at^ h £ c f£rf)„) Ea = entity (/„) Ek = entity(/j) E c b entity(/c) Ed = cntity(/rf) Table 3.14: T icup/E ntity Construction Rule Classes There arc three rule classes where the jv-company argument of the tieup object is assigned a value as shown in Table 3.15. If the jv-company argument is not assigned a value, the three classes shown in Table 3.16 arc used. These rules have no action and would be useless to the RUDIES system. However, they arc useful for learning and will be discussed in later chapters. There arc fifteen rule classes for the status argument of the tieup object. The three classes for status with value existing are shown in Thble 3.17. The classes for other status values have the same structure. 3.5 L im itations RUBIES has several limitations that affect its performance as an information extrac tion tool. In this section, these limitations are described and extensions to m itigate them are proposed. The accuracy of information extraction is greatly dependent on the quality of the preprocessor and the dictionary. Errors in this early stage of the processing 42 Class # Predicate Modifier 4 Ti = tieup((£a,£ t)„) E a — entity(/„) Ei — entity( h ) Ti - tieup(,,{£,)) E t = entity!/,) 5 Ti = tieup((£0,£Jfc,Ec),,) E a = entity(/tt) Ei — entity( Ii) Ec — entity(/c) Ti = tieup!,,!#,)) E t = cntity(/,) 6 Ti - tieup{{Ea,EiiEe,Ei )„) Ea — entity(/0) Ei = entity (/(,) Ee = s entity! Ie) Ed = entity!/rf) I i = » tieup(„(#,)) Eg = cntity(/,) Table 3.15: Tieup jv-company Rule Classes C la ss# Predicate Modifier 7 T i - tieup((#0,£*)„) E a = cntity(/0) E i = entity (h) NULL 8 T i - tieu p ((# s,# t,# e)„) E a — entityf/*) E i — entityf/fc) E c = entity(/e) NULL 9 T i = tie u p ((E a ,E i,E c > E d )„ ) E a = entity(/a) E i — entity (7b) E e = entity! h ) E d = entity(/«j) NULL Table 3.16: Tieup No jv-company Rule Classes 43 Class # Predicate Modifier 10 Ti s= tieup((£„,£*),,) E & — entity(/B ) Ei = entity (h) Ti = * tieup(,existing,) II Ti « tieup((£a,£b*£c)i,) Ea = entity(/fl) Eb = cntity(Tb) Ee = entity(/e) Ti = tieup(,existing,) 12 Ti - tieup{(Et ,Ey% EetEd)») Ea ~ entity(/0) Eb = cntity(Zfc) Ee = entity(/c ) Ed = entity (/rf) Ti tieup(,existing,) Table 3.17: Tieup Status Rule Classes for status = existing have a cascading effect on later stages. This makes it difficult to isolate the per formance of RUBIES from that of the preprocessor. However, it is safe to say that improvements in the dictionary and preprocessor will directly improve the informa tion extraction accuracy of the entire system. The quality of the dictionary that is currently used could be greatly improved. There arc obvious errors in the entries for many words. Unfortunately, the large effort required to repair the dictionary prohibits any immediate solution. RUBIES uses purely syntactic information about the text in the form of dic tionary entries for each word. Farther processing (e.g. phrasal parsing, anaphora resolution, etc.) could eliminate much of the syntactic ambiguity and introduce additional attributes and relationships. For example, phrasal boundaries could be added. There is nothing in the RUBIES architecture that prevents adding additional natural language processing stages and this is a possible area of expansion. RUBIES rule patterns do not cross sentence boundaries. This makes it difficult for them to take advantage of discourse information, if it becomes available. For example, one sentence might mention a joint venture and the following sentence might list the companies involved. One solution might be to store the discourse information in the working memory, but this would still require some modification to the rule structure to take advantage of it. 44 3.6 C onclusion This chapter described the RUBIES system and how it can be used for extracting information from natural language text. RUBIES rules were shown to have the flexibility necessary for data objects (i.e. objects representing information extracted from text) to be created and modified. The rule matching algorithm relied heavily on a formal definition of the subsumption operator for rule patterns, cover items, interpretations, and selectors. RUBIES is a general information extraction system, in that it can be applied to any information extraction domain where the information can be represented as a set of objects, relationships between objects, and object attributes. This template information is domain specific and its definition m ust be provided to RUBIES, along with a rule set and a rule matching strategy which are also domain specific. In the following chapters, a machine learning method for autom atically creating the rule set for RUBIES will be described. 45 Chapter 4 T R A IN IN G SET C O NSTRUCTIO N 4.1 Introduction Our approach to knowledge-base construction is to use a supervised learning system to automatically generate the rules for RUBIES. In this machine learning paradigm, the learning system is presented with a set of training examples which have been pre-classificd into one of several classes. Because each example can belong to one and only one of these classes, we say that they arc mutually exclusive. The goal of the learning system is to generate a description for each class that can be used to accurately classify new, unseen objects. The training examples that arc available for information extraction arc text mes sages and their associated information templates. These examples cannot be directly used by a supervised learning system because they have not been classified. One possible classification method would be to group the examples by the types of tem plate objects they contain. There could be a class for examples with existing tieup objects, a class with former tieup objects, a class with tieup objects between two entity objects, etc. However, these classes would not be mutually exclusive because the filled templates for an example consist of multiple objects and their attribute values. A solution to this problem is to decompose these complex training examples into simpler examples that can be assembled into mutually exclusive classes that can be processed by an inductive learner. 46 4.2 M UC-5 EJV Training Sets In this section, the problem of learning information extraction rules from text mes- sages and templates is decomposed into multiple two-class (binary) learning prob lems. Each of these learning problems has a set of positive and negative example rules. Each example rule belongs to one of the rule classes described in the last chapter. The training set for a rule class consists of two sets of rules. One set is labeled as positive and contains examples of rules from that class. The second set is labeled as negative and consists of rules from some of the other classes. This is a bit different from the way traditional supervised learning is done, where examples from all other classes would be used for the negative training set. The negative examples for Ticup/Entity construction rules are shown in Table 4.1. A new rule class, Class 0, is introduced. This class contains rules with patterns, but no object predicates and no object modifiers. These rules are constructed from text that has no information content. Class to Learn Negative Example Classes Class 1 (2 entities) Class 0 (no information) Class 2 (3 entities) Class 0 (no information) Class 1 (2 entities) Class 3 (4 entities) Class 0 (no information) Class 1 (2 entities) Class 2 (3 entities) Table 4.1: Negative Examples for Tieup/Entity Construction Rules Class 1 is a Tieup/Entity construction rule for tieups between two entities. Class 2 is for tieups between three entities and Class 3 is for tieups between four entities. Class 1 can be viewed as being more general than Class 2. In other words, a tieup between three entities is also a tieup between two entities. Likewise, Class 1 is more general than Class 3. Therefore, Class 2 and 3 rules cannot be used as negative examples for Class 1. With this training set for Class 1, it is expected that Class 1 rules will match many of the same text messages that Class 2 rules will. RUBIES takes care of this situation by removing objects in Working Memory that subsume 47 other objects. Thus, the tieup objects created by a Class 1 rule arc removed if a Class 2 rule also fires and creates the same tivo entity objects that the Class 1 rule created. A similar situation exists for the negative examples used for learning Class 2 rules which are more general than Class 3 rules, but less general then Class 1 rules. To avoid having a Class 2 rule fire when a Class 1 rule should only fire, Class 1 rules are used as negative examples. T he negative examples for Tieup jv-company rules are shown in Table 4.2. Neg ative examples for each of these classes are the corresponding Tieup no-jv-company rules. The no-jv-company rules come from situations where a tieup object exists without a value for the jv-company attribute. Class to Learn Negative Example Class Class 4 (2 entities) Class 7 (2 entities, no-jv-company) Class 5 (3 entities) Class 8 (3 entities, no-jv-company) Class 6 (4 entities) Class 9 (4 entities, no-jv-company) Table 4.2: Negative Examples for Tieup jv-company Rules The negative examples for Tieup cxisling-status rules are shown in Table 4.3. Negative examples for cadi of these classes arc the status rules with the same number of entity objects that have a different value for the status attribute. In the following sections, a method for converting text messages and tem plates into complex rules !b described and an algorithm for decomposing these complex rules into simpler rules th at can be used in the training sets is presented. 4.3 Conversion o f Text and Tem plates into R ules In this section, a m ethod for converting a training example into a RUBIES rule is presented. The training example consists of a text message and its filled information tem plates. T he text and tem plate portions of the example are converted separately as is shown in Figure 4.1. Each word in the text is replaced by an exclusive disjunction of its dictionary entries by the preprocessor. Each of these dictionary entries is reform atted from a 48 Class to Learn Negalive Example Classes Class 10 (2 entities, existing) Class 13 (2 entities, pre-format) Class 16 (2 entities, dissolved) Class 19 (2 entities, former) Class 22 (2 entities, unspecified) Class 11 (3 entities, existing) Class 14 (3 entities, prc-formal) Class 17 (3 entities, dissolved) Class 20 (3 entities, former) Class 23 (3 entities, unspecified) Class 12 (4 entities, existing) Class 15 (4 entities, prc-formal) Class 18 (4 entities, dissolved) Class 21 (4 entities, former) Class 24 (4 entities, unspecified) Tabic 4.3: Negative Examples for Tieup Status Rules for status = * existing Training Example t RUBIES Rule If Tfext Then Template Objects L ! If Pattern Then I _ _ _ _ j I Object Modifiers L -----------------------— ------------------- J Figure 4.1: Conversion of TYaining Example to RUBIES Rule 49 list of attribute values into a conjunction of selectors. A rule pattern can then be formed by creating a list of cover items, with one item for each word. Thus, the text portion of a training example forms a rule pattern that does not have any Kleene stars. These patterns are highly specific and will only match the text of the training examples they were derived from. An example text message and its associated templates are shown in Table 4.4. Table 4.5 contains the preprocessor output for each of the words in the message. <MESSAGE-200> := ALCOA AQREED TO FORM A JOINT VENTURE, AK METALS, WITH KOBE STEEL <TIEUP*200*1> := ENTITY: < ENTITY-200-1> <ENTITY-200-2> JOINT VENTURE CO: <ENTITY-200-3> STATUS: existing <ENTITY-200-l> := NAME: Alcoa TYPE: Company <ENTITY-200-2> := NAME: Kobe Steel TYPE: Company < ENTITY-200-3> := NAME: AK Metals TYPE: Company Table 4.4: Example Message and Filled Templates The preprocessor output for each word is converted to the cover representation shown in Table 4.6 where C\ is the cover for the first word, C % is the cover for the second word, etc. These covers are then assembled into the pattern shown in Table 4.7. The information templates in Table 4.4 can easily be converted to the object modifiers shown in Table 4.8. The arguments to the entity object modifiers return 50 Word Preprocessor Output ALCOA Np Alcoa 8g company AGREED V agree ppl TO P to FORM Nc fora eg V fora r n3eg A T a eg JOINT VENTURE Nc joint-venture eg $ M . AK METALS Np AK-Metals eg company M , WITH P with KOBE STEEL Np Kobe-Steel eg coapany Tabic 4.5: Preprocessor O utput for the Text in Table 4.4 the value of the lexical attribute of the referenced item. The final rule for the training example in Table 4.4 thus has pattern Pi in the condition part, and contains the object modifiers of Table 4.8. Note that this rule has no object predicates. Also note that this rule could only be used for creating the object templates in Table 4.4. The condition part of this rule could be generalized in different ways. For exam ple, some cover items could be replaced with Klecnc Stars which would enable the rule to fire for additional text messages. However, the rule could not produce any different tem plate objects, with the exception of the name attribute of the entity objects, which is taken from the lexical attribute of the word that is matched by that cover item. In the next section, a method for decomposing these complex rules into simple rules is presented. Each simple rule belongs to one of the rule classes for the MUC-5 EJV domain that were defined in the last chapter. 4.4 R ule D ecom position Algorithm Complex rules that were derived directly from text messages and templates will have one or more object modifiers in the action part of the rule. These modifiers can be decomposed so that they match one of the MUC-5 EJV rule classes. Complex rules 51 C\ — ((lex — ALCOA) A (cat = Np) A (root = Alcoa) A (num = sg) A (concept = company)) Ca s= ((lex = AGREED) A (cat = V) A (root = agree) A (tense = ppl)) C3 = ((lex = TO) A (cat = P) A (root = to)) C4 as t ‘41 © t *42 t4t = ((lex = FORM) A (cat = Nc) A (root as form) A (num = sg)) in — ((lex — FORM) A (cat = V) A (root as form) A (tense = r) A (agree a= n3sg)) C & = ((lex = A) A (cat = T) A (root = a)) Co = ((lex = mbox JOINT VENTURE) A (cat as Nc) A (root = joint-venture) A (num = sg)) Ci — ((lex = ,) A (cat = M) A (root as ,)) Co as ((lex as AK METALS) A (cat = Np) A (root = AK-Metals) A (num — sg) A (concept as company)) C o = ((lex = ,) A (cat = M) A (root as ,)) Cjo = ((lex as WITH) A (cat as P) A (root as with)) C n as ((lex aa KOBE STEEL) A (cat = Np) A (root = Kobe-Steel) A (num as sg) A (concept as com pany)) Table 4.6: Covers for Words in Table I P\ — {C \% CoiCoiC4iCoyCo)Ci% Co%Co\C\Q% C \\} Table 4.7: Rule Pattern for the Text Message in Table 1 52 TIEUP-200-1 “ Joint Venture((ENTITY*200*l,ENTITY-200-2),existing, ENTITY-200-3) ENTITY-200-1 = Entity(lex(Item -l)) ENTITY-200-2 = Entity (lex(Item -ll)) ENTITY-200-3 = Entity(lex(Item-8)) Table 4.8: Object Modifiers for Templates in Table 1 th at have no object modifiers do not require decomposition. The decomposition algorithm presented in this section is domain specific and is defined only for the MUC-5 EJV domain. The decomposition algorithm is shown in Table 4.9. This algorithm will decom pose each complex rule into three simple rules. It is assumed that each complex rule contains one and only one tieup object modifier in the action part of the rule. Although this limits decomposition to text messages that contain information about one tieup, it could easily be extended to messages with multiple tieups. For each complex rule R , Let N be the number of tieup entities in R . Let J be the number of tieup joint-venture companies in R . Let S be the tieup status of R , Create a tieup/entity construction rule with an N-entity tieup object modifier. Create a jv-company rule with an N-entity tieup object predicate and a J-joint venture company tieup modifier. Create a status rule with an N-entity tieup object predicate and an S-status tieup object modifier. Table 4.9: MUC-5 EJV Rule Decomposition Algorithm 53 The rule pattern of each simple rule is copied from the rule pattern of its parent complex rtile. The Tieup/Entity construction rules are of class 1, 2, or 3 depending on the number of entity objects. If there is a jv-company for the complex rule, the jv-company rule is of class 4, 5, or 6. If there is no jv-company, the rule is of class 7, 8, or 9. The status rule is of class 7, 8, or 9. The example complex rule from the last section is shown in Table 4,10 where PI is the rule pattern of Table 4.7. The object identifiers have been simplified, but the meaning of the rule has not changed. P I - > T l = Tieup((El,E2),existing,E3), El = Entity(lcx(Itcm -l)), E2 s= Enlity(lex(Item-U)), E3 = Entity(Icx(Itcm-8)) Table 4.10: Example Complex Rule The algorithm in Table 4.9 would decompose this complex rule into the three simple rules shown in Table 4.11. Rule # Pattern Object Predicates Object Modifers 1 PI T l = Tieup((El,E2)„) E l ai Entity(lex(Itcm -l)) E2 = Entity(lex(Item -ll)) 2 PI T l = Tieup((El,E2)„) E l = Entity(lex(Item-i)) E2 = Entity(lex(Item -ll)) T l Tieup(,existing,) 3 PI T l = Tieup((El,E2)„) E l = Entity(lex(Item-l)) E2 = Entity(lex(Item -ll)) T l = Tieup(„E3) E3 = Entity(lex(Item-8)) Table 4.11: Simple Rules Decomposed from Complex Rule of Table 4.10 Rule 1 will construct a new tieup object and two new entity objects. Rule 2 will modify the status attribute of an existing tieup object. Rule 3 wilt modify the joint venture company attribute of an existing tieup object and create a new entity object. 54 4.5 R ule D ecom position Experim ents In this section, we present the results of applying the rule decomposition algorithm to a set of 191 complex rules culled from the M U 0 5 EJV training set. These complex rules were decomposed into 573 simple rules. Tables 4.12 through 4.14 show the class distribution of these simple rules. # Entities # Rules 2 166 3 18 4 7 Table 4.12: Tieup Construction Rules # Entities # J V Co. # Rules 2 0 133 1 33 3 0 14 1 4 4 0 5 1 2 Table 4.13: JV Company Rules 4.6 Conclusion Text messages and templates can be converted into simple rules that can be gener* alized by an inductive learner using the algorithms described in this chapter. The first step is to convert the text and templates into complex information extraction rules. Next, these complex rules are decomposed into the simple rules that belong to the MUC-5 EJV rule classes. These simple rules could be chained together to achieve the same effect as firing only the complex rule. In effect, the complex rule was de-compiled. The reverse of this decomposition process, construction of macro 55 # Entities Status # Rules 2 pre*formal 13 existing 145 dissolved 7 former 1 unspecified 0 3 pre*formal 2 existing 15 dissolved 1 former 0 unspecified 0 4 prc-formal 0 existing 6 dissolved 1 former 0 unspecified 0 Tabic 4.14: Status Rules rules from sets of other rules, has been addressed by researchers in Explanation* Based Learning (EBL) [46] [12]. Thus, rule decomposition could be viewed as a sort of reverse EBL. Once the simple rules have been derived, training sets of positive and negative examples can be constructed from different combinations of these rule classes. These training sets can be used to train an inductive learning system. A limitation of the rule decomposition algorithm of Table 4.9 is that it is domain specific and it will only work on MUC-5 EJV rules. One area for future work iB to generalize the rule decomposition algorithm so that it is domain independent. 56 Chapter 5 L E A R N IN G INFO RM ATIO N E X T R A C T IO N RULES 5.1 Introduction In this chapter, we describe how information extraction rules can be learned from examples. In Chapter 4, a method for constructing training sets for each rule class was described. Each of these training sets contains positive and negative example rules and is independent of the other training sets. Thus, we arc faced with multiple two-class (i.e. binary) induction problems. It is conceivable that we could learn a rule set for both the positive and negative class, but since RUBIES has no use for negative rules, a rule set will only be learned for the positive class. We would like the rules that are learned to accurately extract information from text messages other than those that were used to create the training examples. Un fortunately, we do not know anything about what new text messages the system may be required to process. We must assume that training messages and “new” messages are selected from the same population according to a fixed probability distribution. This population could be described as consisting of all possible text messages for the MUC-5 EJV domain. Messages that have a relatively large probability of occurrence in this population are more likely to show up in the training set and “unseen” set than other, less probable messages. Based on this assumption, we will attem pt to learn rules that are judged to be “good” relative to the training set. But this goal is not enough to give us the inductive leap necessary to be able to perform well on new text messages. 57 If wc only tried to learn a rule set that performed well on the training examples, we could just use the training rule sets, which perform perfectly on the training messages. However, we could not expect this overly specific rule set to be able to extract information from any new messages. We need to apply an additional bias to our learning system. In fact, it has been shown that an inductive learner without bias is only capable of rote learning [45]. The bias we choose is to try and learn rules that are as general as possible. A learner with this bias would be expected to learn a rule set that would perform fairly well at extracting all relevant information, but may extract some unnecessary, or spurious information. In this chapter, wc describe the Merlin learning algorithm. Merlin learns infor mation rules that can be directly provided to the RUBIES information extraction system that was described in Chapter 3. Merlin takes as input the training rule sets that were constructed using the method described in Chapter 4 and creates a generalized rule set using the previously described bias. The learning of information extraction rules can be viewed as a state space search. The space searched by Merlin is defined by the representation language that is being used for the rules. This representation was defined in Section 3.2. Merlin performs multiple beam searches in this space to accumulate a set of rules. After enough rules arc found, a subset (i.e. set cover) is chosen for the final rule set. In the description of Merlin that follows, the initial and final search states and the search operators that are applied will be defined. 5.2 Covering-Type Induction Algorithm s Several induction algorithms represent object classifiers as logical disjunctions of concept descriptions. Each of these disjuncts covers some of the positive examples. If a concept description includes an example, then it is said to cover that example. Together, these disjuncts cover all of the positive examples. We refer to these types of algorithms as covering~type induction algorithms. A common covering algorithm is shown in Table 5.1. Inductive learners that use this algorithm include the AQ-Family [42] [43], FOIL [60], and FOCL [56]. 58 For each class C that is to be learned, While the classifier for C does not cover all of the examples of C , Find a concept description that covers some examples of C, but no examples of other classes. Append this concept description as one disjunct of the classifier for C . Remove all examples that cover this concept description. Table 5.1: The Basic Covering Method for Induction This algorithm can be viewed as a solution to the closely related sct-covcring problem [10]. This problem is typically described using set theoretic terminology as shown in Table 5.2 and has been shown to be NP-hard. Given: A finite set X and a family, of subsets of X y such that every clement of X belongs to at least one subset in /*. Find: A minimum cost family of subsets C Q F that contains all the elements of X . Table 5.2: The Set-Covering Problem In the context of induction, the finite set X would be the set of positive examples for a class and the family of subsets C would be the disjunction of concept descrip tions that form the classifier. In the covering algorithm of Table 5.1 there is nothing analogous to the family of subsets T , Instead, a search is performed to find the next subset to append to C, One inductive learner, Stargazer [32], uses a more direct set-covering approach. The family of candidate disjuncts, T , is explicitly created and then a set-covering algorithm is used to select C from T . The basic Stargazer algorithm is shown in Table 5.3, Stargazer uses the same method for finding concept descriptions as does 59 AQR [6], a member of the AQ-Family. The major difference between the two is in the covering method. Their performance was compared in several domains and Stargazer was found to learn concepts that were more accurate than AQR. The conclusion drawn is that explicit set-covering methods (i.e. Stargazer) can gener ate more accurate classifiers than implicit set-covering methods. The disadvantage of explicit methods is that more concept descriptions need to be generated. This translates into increased learning time. However, it is easier to implement paral lel versions of explicit set-covering algorithms. A parallel version of Stargazer has been implemented on the iPSC/2 hypercube computer and shown to run faster than sequential AQR, while at the same time learning more accurate classifiers [33]. For each class C that is to be learned, Let F be an empty set of concept descriptions. For each example E of C, Find a concept description that satisfies E and some other examples of C, but no examples of other classes. Add this concept description to F . Select a covering-set of concept descriptions from T and disjoin them together to form the classifier for C . Table 5.3: The Covering Algorithm for Stargazer Merlin uses a hybrid implicit/explicit approach for learning information extrac tion rules. The reason that we did not use the purely explicit set-covering method of Stargazer is that the information extraction rules are extremely complex and the learning time would be too great to be practical. A parallel implementation was not considered because of convenience and portability issues. In the information extraction paradigm, examples and classifiers are both repre sented as rules in the format described in Chapter 3. Instead of creating a classifier from a disjunction of concept descriptions, we form a set of rules for each rule class. There is an implicit disjunction between the rules in the rule set. The covering algorithm for the Merlin inductive learner is shown in Table 5.4. Instead of finding only one rule for each iteration of the inner loop, a set of rules is created and merged 60 with the set of candidate rules. The explicit sct-covering step is similar to that used in Stargazer. The rules in set G are consistent in that they do not subsume any of the rules in N and do not have any misaligned subsumptions with the positive examples of C. We note that this definition of consistency is different than that used previously by members of the AQ*Family of algorithms. In Section 5.3, we describe the algorithm for selecting the next positive example (Table 5.4, Line 6). In Section 5.4, the set*covering algorithm used in Line 12 is presented. The search for consistent rules (Line 8) is described in Section 5.5. 1 For each class, C , of rules that is to be learned, 2 Let N be the set of negative examples for C . 3 Let A be an empty rule set. 4 While there are examples of C not subsumed by 5 rules in R , 6 Select one of the examples of C t hat is not subsumed 7 by any of the rules in R . 8 Create a set of consistent rules, G , b y generalizing 9 this example in different vays. 10 Append G to R . 11 Remove duplicate rules from R . 12 Select some of the rules from R and disjoin them 13 to form the final rule set for C . Table 5.4: The Covering Algorithm for Merlin 5.3 Selecting a Positive Rule for Generalization Which of the uncovered (i.e. not yet subsumed) positive example rules should be selected for generalization? Of course, we would like to select the example that would quickly lead to the best generalizations. But how can this be predicted? How can we tell if one rule is a better starting point than another? This problem of selecting an example to use as an initial state for a search has been addressed by 61 other researchers and is commonly known as the seed selection problem. The term seed refers to the rule that is selected. Merlin prefers seeds that subsume a large number of positive examples, have no misalignments with positive examples, and do not subsume any negative examples. In other words, Merlin considers the best rules to be those that arc consistent and most general (i.e. subsume the most positive examples). A rule is consistent if it has no misaligned subsumptions with positive example rules and does not subsume any negative example rules. One heuristic that could be used for seed selection is to perform a one-step lookahead for each example. The example that led to the best generalizations after this first search step would then be selected as the seed. The basic assumption of this heuristic is that the success of the first step is indicative of the remainder of the search. This method has been used in the Rigcl system [23]. Of course, the lookahead could be expanded to two, three, or more steps, but this becomes extremely computationally expensive. Another heuristic would be to select an example rule that is most representative of all the other positive examples. Some statistical measure of attribute values could be used as an indicator. Wc refer to this as the archetypical method. Yet another method would be to select an example at random. This is by far the most computationally inexpensive method. The performance of a seed selection method depends greatly on the type of search that follows. For example, a search method that explores a large amount of search space such that different search paths overlap to a large degree would not be as dependent on seed selection than would a more narrow search method. In previous work [32], we performed experiments using each of these methods to select seeds for the AQR induction algorithm [6]. AQR performs a beam search for concepts from general to specific and the beam width is variable. W hat we found was that at a narrow beam width of 5, the lookahead method was the best, but not by much. As the beam width was increased to 10, all three methods had about the same performance. There were similar results at a beam width of 15. Merlin performs a beam search starting at the seed example in the extremely complex search space of rule patterns. Because each search step in this space is 62 extremely computationally expensive and beam widths are not very narrow, it was decided that Merlin would use the random method of seed selection. 5.4 Finding a Set Cover Merlin uses an explicit set-covering algorithm to select some of the candidate rules for the final rule set. A greedy set*covering algorithm based on Chvatal’s algorithm [5] is shown in Table 5.5. This is similar to the covering method for induction that was shown in Table 5.1. The difference is that instead of searching for a rule, one is selected from a set of candidate rules (5). Let P be the set of positive example rules. Let S be the set of generalized rules input to the set-covering algorithm. Let C be an empty final rule set. While P is not empty, Select a rule R from 5. Remove R from S . Remove all example rules in P that are subsumed by R , Remove all rules in S that do not subsume any examples in P . Append r to C. Table 5.5: A Greedy Set-Covering Algorithm The first step inside the loop in Table 5.5 is to select one of the rules from R. In previous work [32], five heuristics for rule selection were investigated. These heuristics are labeled H i through H5. Four of these heuristics use a measure of rule complexity that is not directly applicable to information extraction rules, since it is representation dependent. Let ABSSUB (for absolute subsumption) be the number of positive example rules in the original set P subsumed by a rule in R. Let RELSUB (for new subsumption) be the number of rules in the current set P subsumed by a rule in R. Experiments were conducted using these heuristics for the set-covering algorithm of Stargazer. 63 Each rule in these experiments consisted of a single complex. Within this specialized representation for a rule pattern, C O M PLEXITY is defined to be the number of selectors. The heuristics for rule selection are described as follows: H I Select the rule that maximizes ABSSUB. In the case of a tie, select the rule that minimizes COM PLEXITY. If there is still a tie, select one at random. H 2 Select the rule that maximizes the ratio RELSU B/CO M PLEXITY. If there is a tie, select one at random. H 3 Select the rule that maximizes RELSUB. If there is a tie, select the rule of maximum RELSUB that minimizes COM PLEXITY. If there is still a tie, select one at random. H 4 Select the rule that minimizes COM PLEXITY. In the case of a tic, select the rule of minimum C O M PLEXITY that maximizes RELSUB. If there is still a tic, select one at random. H 5 Select the rule that maximizes RELSUB. If there is a tic, select the rule that maximizes ABSSUB. If there is still a tic, select one at random. The error rates of rules learned using each of these heuristics were compared for three natural domains. Table 5.6 shows the error rates averaged over 1 0 0 trials, where in each trial, 70% of the examples were randomly chosen for training and the remaining 30% were used for testing. The beam width of the search is shown in the column labeled Beam. These domains arc briefly described as follows:1 so y b ean This database consists of 307 examples from 19 classes. Each example is described by 35 attributes, some of which arc nominal and some ordered. Each class corresponds to a particular soybean disease diagnosis, and attributes correspond to symptoms. There are some missing values. glass This study of glass classification was motivated by criminological investiga tion. Glass that is found at a crime scene can be used as evidence if it can be 1 These domains were obtained from the UCI Repository of Machine Learning Databases, De partment of Information and Computer Science, University of California, Irvine, CA. Special thanks to Patrick Murphy and David Aha for their assistance. 64 accurately identified. The domain contains 214 examples of 7 types of glass. Each example has 10 continuously*valued attributes that mostly describe the chemical composition of the glass. There are no missing values. audiology This data was provided by Professor Jergen at Baylor College of Med icine. It consists of 200 examples from 24 classes. Each example is described by 70 nominal-valued attributes. There are some missing values. In these domains, ///, 7/5, and H5 have the lowest overall error rates, without much difference between them. / // seems to mitigate the problem of small disjuncts [61] since it prefers rules that subsume a large number of examples. HS and H5 have a bias towards minimizing the complexity of the rules since they tend to select the least number of rules to form the cover. HS also attem pts this bias, but it seems that the number of selectors in the disjuncts is less important than the number of disjuncts in the cover. Heuristic Beam soybean glass audiology III 5 26.2 9.6 34.3 H2 5 26.2 9.8 35.4 113 5 25.8 9.8 34.7 114 5 28.0 9.5 36.9 115 5 25.7 1 0 .2 34.3 HI 10 25.8 8 .8 34.0 112 10 25.8 9.2 33.9 113 10 25.2 9.2 34.0 114 10 27.9 9.1 34.6 H5 10 25.0 9.2 33.9 Table 5.6: Average Error Rate (%), 100 Trials [32] H5 was chosen as the rule selection heuristic for Merlin because it seems to perform better as the beam width is increased and Merlin uses a relatively large beam width. 65 5.5 R ule G eneralization At the core of the Merlin algorithm is the m ethod for generalizing a set of consistent rules from a single example. To do this, Merlin performs a beam search starting at the seed example. In this section, we describe the search operators th at are used and the strategy for applying them. 5.5.1 Search O perators Some operators that can be used to search the rule space are shown in Table 5.7. Merlin uses a subset of these operators (O P /, OPS, OPS, and OPT) to learn rules. Each of the operators in Table 5.7 can be classified as a generalization or special* ization operator. Every generalization operator has a specialization operator that is its inverse. Generalization operators make a rule more general and spccialtza* tion operators make a rule more specific. A rule is more general than another if it subsumes all of the rules that the other subsumes plus some additional rules (not necessarily in the training set) that the other docs not subsume. Each operator in Table 5.7 affects the generality of a rule to a different degree. The generalization or specialization power of an operator is a qualitative measure of the magnitude of this effect. ID Description Type Power O PI OP2 OP3 OP4 Change cover item to Kleenc star Append interpretation to cover item Remove selector from interpretation Append value to selector Generalization Generalization Generalization Generalization High Medium Low Very Low OP5 O P 6 OP7 O P 8 Change Klcene star to cover item Remove interpretation from cover item Append selector to interpretation Remove value from selector Specialization Specialization Specialization Specialization High Medium Low Very Low Table 5.7: Search Operators O P i is a high-power generalization operator. Any text message that is matched by a rule before applying this operator will also be matched after the operator is applied. There will also be many additional text messages (not necessarily in the 66 training set) that will be matched. OPS is a high-power specialization operator that replaces a Kleene star in a rule pattern with a cover item. It is the inverse of OPi. OPS is a medium-power generalization operator that appends an interpretation as one of the disjuncts of a cover item. This will allow the cover item to subsume more words in a text message. OP6 is a mediunvpower specialization operator that removes one of the interpretations from a cover item. It is the inverse of OPS. OP;? is a low-power generalization operator that removes one of the selectors from an interpretation. Since an interpretation is a conjunction of selectors, removing a selector allows the interpretation to subsume more words. O P I is a low-power specialization operator that adds a selector to an interpretation. It is the inverse of OPS. OP4 is a vcry-low-power generalization operator that appends a value to a selec tor. Since a selector contains a disjunction of values for an attribute, this operator has the effect of allowing the selector to evaluate true for more attribute values. OP8 is a vcry-low-power specialization operator that removes one of the values from a selector. It is the inverse of OPJ. 5.5.2 Finding a Set o f Consistent Rules The search for a set of consistent rules is conducted using a strategic application of some of the operators that were described in the last section. The starling point for this search is an example rule that is selected using the criteria described in Section 5.3. The search terminates if one of the following conditions becomes true: 1. If the most general rule for the seed is consistent with the training examples, the search terminates and the most general rule is returned. 2 . If the number of node expansions after the first consistent rule has been found is more than M AXITBR, then the search is terminated and the consistent rules are returned. 3. If the current node expansion does not yield any viable search paths and some consistent rules have previously been found, then the search terminates and these consistent rules are returned. 67 4. If there are no more nodes to expand and some consistent rules have been found, then the search terminates and these consistent rules are returned. 5. If there are no more nodes to expand and no consistent rules have been found, return the seed. Merlin conducts the search for consistent rules in a general to specific manner. For each iteration of the search, a node (i.e. rule) is expanded (i.e. specialized) in different ways using some of the previously described search operators. Rules are consistent if they do not subsume any negative examples and have no misaligned subsumptions with positive examples. The first stopping condition occurs when the most general rule for the seed example is consistent with the training examples. This condition almost never occurs in practice. If the most general rule is consistent, there is no need to search for more specialized rules. The second stopping condition limits the number of node expansions after the first consistent rule is found. This allows Merlin to continue the search after finding the first consistent rule, white placing a bound on the execution time after this point. The third stopping condition is intended to accelerate the second condition when the search hits a dead-end. A dead-end occurs when a node expansion docs not yield any viable search paths. In other words, the search operators cannot be applied in such a way as to find a new rule that is better than the current rule. If this occurs after some consistent rules have been found, the search is terminated. The fourth and fifth stopping conditions occur when there arc no unexpanded nodes left to expand. If consistent rules have been found, they are returned, oth erwise the seed rule is returned. If the fifth stopping condition is reached, the generalization algorithm has failed. The complete rule generalization algorithm for Merlin is shown in Table 5.8. Stopping conditions are abbreviated SC and shown where they occur. In Sec tion 5.5.3, the method for finding the most general rule for a seed example (Table 5.8, Line 3) is described. At each iteration of the generalization algorithm, one of the unexpanded rules is selected for expansion (Line 9). The method for rule selection directly affects the bias of the learner. It was decided to bias Merlin toward finding consistent rules that 68 are as general as possible. This is implemented by having Merlin select the rule that subsumes the most number of positive examples. This keeps the set of uncxpanded rules relatively general. Node expansion takes place on Line 1 0 using two types of specialization operators {OPS and OPT). The procedure AddCovers is described in Section 5.7 and the procedure AddSeleetors is described in Section 5.8. In Line 11, the consistency of rules resulting from the current node expansion is compared with the consistency of the rule that was expanded. A rule's consistency is defined to be the sum of the number of negative examples it does not subsume and the number of positive examples it does not have a misaligned subsumption with. As was mentioned previously in the context of the third stopping condition, certain rules cannot be expanded into more consistent rules. Since the search in this rule space can overlap to a great degree, these rules arc marked as bad and stored in the B A D SE T rule set. If any of these rules arc reached again during the search, they arc discarded so that dead-end search paths are not repeated. The size of the set of unexpanded rules is limited by the beam width of the search in Line 23. Rules that subsume the least number of positive examples arc discarded until the set size equals the beam width. 5.5.3 Finding th e M ost General R ule The seed example is first generalized by repeated applications of operator OPI to all cover items that are not referenced by object predicates or object modifiers. Any two adjacent Klecne stars in a rule pattern can be replaced by a single Klcene star without changing the meaning of the pattern. As an example, consider the text in Table 5.9. If this text were given as a positive example of a joint venture, an example rule R \ could be created using the rule pattern in Table 5.10. There is one cover item in P\ for each word in the message. The definitions of these cover items were given in Thble 4.6. Let us assume that P\ contains two reference cover items: C\ and C \\. Pattern Pi (shown in Table 5.11) results from applying operator OPI to all non-reference cover items in Pi. The rule R i has P3 as its rule pattern and the same object predicates and modifiers as R x. 69 Procedure Generalize(SEED , BEAMWIDTH,MAXITR) 1. Let CONS be a set of consistent rules. 2. Let ICONS be a set of inconsistent rules. 3. Let MAXGEN be the most general rule for SEED. 4. if MAXGEN is consistent then return {MAXGEN}. (SCI) 5. CONS - {}. 6. ICONS « {MAXGEN}. 7. NUMITER - 0. 8. Repeat until done 9. Let BEST be the best rule in ICONS. 10. OPSET - AddCovers(BEST,SEED) U AddSelectors(BEST,SEED). 11. Remove rules in OPSET that are less consistent than 12. BEST. 13. OPSET - OPSET - BADSET. 14. if OPSET is empty then 15. BADSET - BADSET U {BEST}. 16. if CONS is not empty then return CONS. (SC3) 17. else 18. Let OPCONS be the set of consistent rules in OPSET. 19. CONS * CONS U OPCONS. 20. OPSET - OPSET - OPCONS. 21. if CONS is not empty then NUMITER - NUMITER + 1. 22. if NUMITER > MAXITER then return CONS. (SC2) 23. ICONS - ICONS U OPSET. 24. Limit size of ICONS to BEAMWIDTH. 25. if ICONS is empty then 26. if CONS is not empty then return CONS. (SC4) 27. else return {SEED}. (SC5) Table 5.8: Merlin Generalization Algorithm 70 ALCOA AGREED TO FORM A JOINT VENTURE, AK METALS, WITH KOBE STEEL Tabic 5.9: Example Text Message P \ “ {^l»^2tC3»C4,C5,Cfl,C7,Cst C9,Cio,Cn} Table 5.10: Rule Pattern for the Text Message in Table 5.9 P i — {Ci,*,C7u} Table 5.11: Result After First Generalization Step The cover items C\ and C u are then generalized by repeated application of operator OPS which removes a selector from an interpretation. All selectors are removed except for one. The one that remains is domain specific. For the MUC- 5 EJV domain, the concept attribute remains if its value is M UC5-C0M PANY. Otherwise, the category attribute selector is retained. Using this method to remove selectors from pattern Pj results in pattern P3 shown in Table 5.12. The rule R3 has P3 as its rule pattern and the same object predicates and modifiers as R% . P3 — {(concept = company),*, (concept = company)} Table 5.12: Most General Pattern for Pattern in Table 5.10 Rule Rq subsumes rule R\ and has no misaligned subsumptions with R \. Also, we say that R& is the most general rule for R \ and that R\ is the seed for /I3. In genera), the most general rule may be misaligned with its seed. This misalignment will eventually be taken care of, since the most general rule will be specialized in subsequent operations. 5.6 A ttribute D ependency Graphs At each iteration of the search for consistent rules, the current rule is specialized by replacing Kleenc stars with cover items and adding selectors to existing cover items. The determination of which attribute to add a selector to is guided by the A ttribute Dependency Graph (ADR) for the domain. Use of the ADR greatly reduces the amount of search that is performed. The existence of certain object attributes depends on the values of other object attributes. For example, consider the category of a word. If the word is a verb, then there will be an attribute for tense, but no attribute for number. If the category is noun, then the opposite will be true. An attribute whose value determines the existence of other attributes is called a pivot attribute. Pivot attributes m ust be nominal-valued (i.e. they can be assigned one of a finite set of values). 72 An ADG is a directed graph with one node for each object attribute. Each node is labeled with the name of the attribute it represents. There is an edge from one node to another for every value of the source node’s attribute that can coexist with the attribute of the destination node in an interpretation. An example ADG for the MUC-5 EJV domain is shown in Figure 5.1. Since there are many values for some of the attributes, only one edge is drawn to represent all of the edges between two nodes. The values (edges) for each of these composite edges are shown in Table 5.13. The value All means that there is an edge for all values of the source node’s attribute. number concept cntegoiy lexical root tense O pivot O non-pivot agreement Figure 5.1: Attribute Dependency Graph for MUC-5 EJV 73 Edge Values C l E, J, Bl, B, Bd, Bs, Bo, Br, C, Nc, Np C 2 B, Bd, Bs, Bo, Br, Q, T, D, Fc, Nc, Np, Month, Date C 3 All c< All Cs Vm, Vs, Vh, Vb C« All C 7 All e8 All c» Present, pp2, pp3 Table 5.13: Edge Definitions for Figure 5.1 Merlin uses the ADG to determine the search order for attributes. Selectors on pivot attributes are searched before selectors on non-pivot attributes. Pivot nodes in the ADG are assigned weights according to the algorithm in Table 5.14. Selectors on attributes that have nodes in the ADG with higher weights are searched before selectors on attributes with lower weights. Let N be the number of pivot nodes. Let W[AT] be an array of pivot node weights. For * ■ 1 to N , fV[t] = 0. For S * 1 to N , For D ■ 1 to N , if S ^ D , Let P be the percentage of values of the attribute of node S that have edges terminating at node D . W[D) = W[D] + P . Table 5.14: Algorithm for Pivot Node Weight Assignment There are two pivot nodes in the ADG of Figure 5,1: category and tense. The weight of the category node would be 1 .0 because all of the values of the tense node 74 have edges to category. The weight of the tense node would be 0.13 because only 4 of the 31 values of the category node have edges to tense. Therefore, category attribute selectors would be searched before tense attribute selectors. 5.7 A dding Cover Item s At each iteration of the search for consistent rules, the current rule is expanded in different ways using operator OPS. OPS is a high-powcr specialization operator that replaces a Kleene star in the rule pattern with a cover item. This specialization is guided by the seed rule from which the rule was constructed. Consider the pattern of rule R \ shown in Figure 5.2. The single cover item , Ci, in this rule was derived from and subsumes its seed cover item Cf,. Cover item Cat subsumes C„, Cei subsumes C 7 c, and Cdi subsumes Cd. Replacement of Kleene stars with cover items is done in such a way as to ensure that the specialized rule will be subsumed by the seed rule. The pattern of rule R \ cannot be specialized to { *, Ci, C d } because there is no mapping for Cd. The first Kleene star in the pattern of Ii\ would map to C„. C\ would map to Cs and Ce* would m ap to Ce, leaving no cover items in R i for Cd. The cover items that arc added to R i are generalizations of their associated seed cover items. For example, C0j is a generalization of Ca. Analogous to the way cover items arc associated with seed cover items, each interpretation is associated with (i.e. derived from) a seed interpretation. Each of the new cover items in the specialized rule has the same number of interpretations as their respective seed items. Each interpretation has one of the selectors from its associated seed interpretation. In this way, the new cover item is as general as possible. The selector that is chosen for each interpretation is determined by the ADG for the domain. If the seed interpretation has one or more pivot attribute selectors, the one with the highest weight is selected. The case where there are no pivot attribute selectors is a bit more complicated, in that all the attribute selectors from all interpretations are combined in every way possible to create a set of specialized rules. To simplify this discussion, it will be assumed th at all interpretations for all seed cover items have at least one pivot attribute selector. This is true for the MUC-5 EJV domain where every word has an attribute selector for the category 75 Pattern of seed rule of R t: { Ca, Cb, Q., Cd) is the seed item for C Pattern of rule Rp* { ** C If * } I Cal, •, C „ * } { * Cal, C „ * } Assumption: All complexes of cover items in seed rule have pivot attribute selectors. Figure 5,2: Specialization of Rule Pattern by Replacing Kleene Stars with Cover Items attribute. Table 5.15 shows a procedure for generalizing a seed cover item based on this assumption. Procedure QenSeedCover(SeedCover) Let QenCover be an empty cover item. For every interpretation X in SeedCover, Create a new selector S by copying the highest veight pivot attribute selector in X . Create a new interpretation in QenCover that contains only S . Return QenCover. Tabic 5.15: The GcnSecdCovcr Procedure The algorithm for specializing a rule by replacing a Kleene star with a cover item is shown in Table 5.16. Each rule was derived from an original seed, which is one of the training examples. Each cover item in a rule was derived from and subsumes an associated cover item in the rule’ s seed. Kleene stars in the rule pattern match cover items in the seed that arc not assigned to cover items in the rule pattern. For example, in Figure 5.2, the first Kleene star of the current rule matches C* and the second Kleene star matches Cc and Cd- 5.8 Adding Selectors At each iteration of the search for consistent rules, the current rule is expanded in different ways using operator OPI. O PI is a low-power specialization operator that adds a selector to an interpretation. This specialization is guided by the seed from which the rule was constructed so that any specialized rules will still subsume the seed. Consider the simple rule pattern shown in Table 5.17. Interpretation l\ in the rule pattern was generalized from interpretation I3 in the seed. Likewise, interpretation I-x in the rule pattern was generalized from interpretation I4 in the seed. Table 5.18 illustrates how a rule with this pattern can be specialized. 77 Procedure AddCovers (CurrRute) Let RuleSet be an empty set of rules. For each Kleene star in the pattern of CurrRute, For each cover item, Seedltem, in the seed pattern that is matched by this Kleene star, GenCover * GenSeedCover(5eerf/fcm). Create a new rule by replacing the Kleene star with { *, GenCover, * } If this rule is subsumed by CurrRute, append it to RuleSet. Create a new rule by replacing the Kleene star with { *, GenCover } If this rule is subsumed by CurrRute, append it to RuleSet. Create a new rule by replacing the Kleene star with { GenCover, * } If this rule is subsumed by the CurrRute, append it to RuleSet. Create a new rule by replacing the Kleene star with { GenCover } If this rule is subsumed by CurrRute, append it to RuteSet. Return RuteSet. Table 5.16: The AddCovers Procedure 78 Rule Pattern — { *j Ci, * } C\ ss I t 0 12 I\ ss (category = V) h =2 (category = Nc) S E E D \ = Seed item for C\ — h © h Is st (category = V) A (tense = r) A (agreement = n3sg) A (root = light) A (lexical = light) I4 ss (category = Nc) A (number = sg) A (concept = thing) A (root = light) A (lexical = light) Table 5.17: Example Rule Pattern and Seed Item C n = In © I2 h i = (category C 12 = I 12 © I 2 I 12 = (category C13 = / l3 © I2 I 13 = (category C u — Ii © h i h i = (category C n = /» © /aa h i = (category C m — I \ © I23 I 2 3 = (category Cir — I\® h i h i - (category V) A (tense = r) V) A (lexical = light) V) A (root = light) Nc) A (number = sg) Nc) A (concept = thing) Nc) A (lexical = light) Nc) A (root ss light) Table 5.18: Specializing Rule Patterns by Adding Selectors to Interpretations 79 Each of these specialized cover items (Cu through C\r) is subsumed by C\. They differ from C\ in that they have one more selector in one of their interpretations. The selector that is added to each interpretation is determined by following the edges of the ADG for the domain. The procedure for doing this is shown in Table 5.19. 5.9 Comparison with Other Learners Inductive learning systems can be characterized by the representation that they use for training examples and hypotheses (i.e. concepts that are learned). Merlin uses the same representation for examples and hypotheses - the rule structure described in Section 3.2. No other known learners use a representation with this complexity for cither examples or hypotheses. In this section wc relate Merlin to other learning algorithms by comparing the representation used for examples and hypotheses and the search strategy that is employed. 5.9.1 Training Example Representation Most inductive learners use an attributc-valuc representation for training examples. Learners such as ID3 [59], GID3 [19], ASSISTANT [3], CART [1], 1R [27], AQVAL/ 1 [41], AQ15 [43], AQ17-HCI [72], CN2 [ 6], Rigcl [23], GABIL [13], GREEDY3 [55], and GROVE [55] all learn from examples such as those shown in Table 5.20. These examples arc taken from the first of the MONK’s problems [17] which arc in an artificial robot domain. Each example takes on a single value for each attribute (e.g. Head, Body) and no negated values are permitted. Some learners permit missing values for some of the attributes. The attributes of Table 5.20 are nominally-valued (i.e. they can take on one of a finite set of values). Some domains have continuous-valued attributes and others may have a combination of nominal-valued and continuous-valued attributes. Other learners represent training examples as predicate or relation arguments [60] [56] [51] [52] [7]. These arguments can sometimes take the form of simple attribute-values as in the previous examples, or they can be more complex as shown in Table 5.21. These examples are taken from the list domain which has been studied 80 Procedure Adds elect or aiCurrRute) Let RuleSet be an empty set of rules. For each non-reference cover item in the pattern of CurrRute, For each interpretation in the cover item, For each pivot attribute selector in the interpretation, For each value of this pivot attribute selector, For each attribute node in the ADG that there exists an edge from the pivot attribute with this value. If the interpretation does not have a selector for this attribute, create a new rule by adding the selector from the seed item to CurrRule and add this new rule to RuleSet. If the rule could not be specialized using the ADG then. For every attribute that the seed item has a selector for but CurrRule does not, Create a new rule by adding this selector to CurrRule and add this new rule to RuleSet, Return RuleSet. Thble 5.19: The AddSelectors Procedure 61 Class Head Body Smile Hold Color Tie POS POS NEG NEG round octagon round square round square square round no yes yes no sword sword sword flag yellow red blue yellow no no yes no Table 5.20: Examples Using Attribute*Value Representation by several researchers [65] [6 6] [64] [51] [60]. In these examples of the list class, there is but one argument. Class Argument list 0 list (a ) list (a,b) list (a,(b,c),d) Table 5.21: Examples Using Predicate Argument Representation As an extension to representing examples as sets of predicate arguments, some learners incorporate additional background knowledge, which is usually represented as Horne clauses with type restrictions on the arguments [56] [7]. This background knowledge, together with the training examples, is sometimes referred to as a domain theory [56]. Predicates that have arguments defined by other predicates are said to be non-operational Predicates not defined by other predicates arc said to be operational An example of an operational predicate would be Human(Socrates). An example of a non-operational predicate would be M ortal(x) < — H um an(x). Training examples for Merlin are represented using the RUDIES rule structure which is more complex than the attribute-value or predicate argument represen tations. At first glance, it may seem possible to transform these examples into attribute-value notation. There are several problems with this approach. The first problem is that cover items consist of exclusive disjunctions of interpretations. Con verting a cover item to disjunctive normal form (DNF) would introduce negated 82 attribute values2 which existing learners (e.g. 1D3, AQ) arc ill-cquippcd to handle. In some cases, the negation of an attribute-value can be eliminated. For example, a negated attribute-value ->(color — red) can be replaced by (color — blue V green) if the only possible values for color are red, blue, and green. Likewise, negated con- tinous attribute-values can be replaced by range tests. For example, -<(temp = 87) can be replaced by (temp > 87) V (temp < 87). The problem is with attributes that can take an infinite number of possible values and whose values cannot be ordered. An example of this type of attribute-value would be ->(roof = country). Assume that the only other values seen for this attribute in the training examples were person and car. It would not be correct to replace the negated attribute-value with (root = person V car). This is because there could be another example (possibly in the test set) that has a value airplane for this attribute. This example would be consistent with the negated attribute-value, but not the converted one. A second problem is that even if learners were made to handle negated attribute values, there would be a combinatorial explosion in the number of training examples. This problem was illustrated by Quinlan [60] when he argued the need to move away from attribute-values to a more powerful representation. To our knowledge there arc no learning systems that use a representation for training examples with a complexity comparable to the RUBIES rule structure. 5.9.2 Hypothesis Representation Inductive learners use a wide variety of representations to represent the hypotheses that are learned. One of the more common representations is the decision tree [59] [19] [3] [1], Eadi node in the tree is a test on an attribute's value except for leaf nodes, which are class labels. Each outgoing branch from a node corresponds to one of the possible outcomes of the test at that node. An object is classified by following a path through the tree from the root to a leaf node, letting the test at each node determine which branch to follow. 7a © b = (a A ->6) V (-vi A t), 83 Other learners represent hypotheses using a disjunctive normal form (DNF) that allows internal disjuctions [23] [ 6] [41] [43] [72] [13]. Some example hypotheses us* ing this representation are shown in Table 5.22. Merlin uses this extended DNF representation for cover items in rule patterns. plane * — (has.wings ss yes) car « — (carriespassengers = yes) V (doors = 4) iruck < — (carries passengers — no)V ((size — large V medium) A (weight = heavy)) Table 5.22: Example Hypotheses Represented with Extended DNF Neural network learners store the hypothesis within the network, so we could say that the representation is the network configuration (i.e. number of hidden units, how units arc connected). Quite often, the network is crafted for a particular problem using trial and error to determine the best number of hidden units to use and how to connect them. Typically, there is an input unit for each of the attributes of the training examples and an output unit for each class [48] [71]. Merlin represents hypotheses as RUBIES rules. There are no known machine learning systems that use a representation with this complexity. There arc, however, some knowledge acquisition tools that generalize patterns similar to those in the RUBIES rules. We make a distinction between the fields of machine learning and knowledge acquisition. Each field has a different focuB and most research is still done in isolation from each other. The differences between these fields have been described by Tecuci as follows: The focus of knowledge acquisition has been to improve and par* tially automate the acquisition of knowledge from human experts. In contrast, machine learning focuses on mostly autonomous algo rithms for acquiring or improving the organization of knowledge, often in simple prototype domains. Also, in knowledge acquisition, the acquired knowledge is directly validated by the expert that ex presses it, while in machine learning, the acquired knowledge needs an experimental validation on data sets independent of those on which learning took place. As machine learning moves to more 84 complex, “real” domains, and knowledge acquisition attem pts to au- tom ate more of the acquisition process, the two fields increasingly find themselves investigating common issues with complementary methods. [6 8 ] In Chapter 1, we described two information extraction systems th at used patterns in a m anner similar to RUBIES. The patterns used by Diderot [16] and CIRCUS [37] are anchored to the text by a single word and the pattern is built around this anchor. Thus, these systems arc faced with the problem of selecting anchor words. The rules learned by Merlin and used by RUBIES arc not anchored to any single word. They arc instead anchored by the syntactic attributes of words and the semantic concept that the word maps to. In the MUC-5 EJV domain, words that m apped to company concepts were used as pattern anchors. Thus, the problem of selecting numerous anchor words has been replaced with the problem of creating anchor descriptions. In general, there should be many fewer anchor descriptions than anchor words required. The GLS entries of Diderot arc sim ilar to the rule patterns of RUBIES in th at both allow pattern items to be separated by Klccnc stars, thus extending the gen erality of the pattern. One of the cospec fields of the GLS of Table 1.1 is shown in Table 5.23. The self argument is matched by a form of the verb establish and the arguments A l, A2, and A3 arc noun phrases that arc organizations. In general, this cospec would m atch a sentence of the form “A l established A2 with A3” and the extracted information would be that there is an existing tie-up between A l and A3 called A2. [Al, *, self, *, A2, *, with, A3] Table 5.23: Example GLS Cospec [70] Since RUBIES does not do any phrasal parsing, this cospec could not be directly represented using a RUBIES rule pattern because the items cannot be m atched to noun phrases. Instead, the arguments A l, A2, and A3 would have pattern items that m atch words that were company names (i.e. organizations). The word with in the 85 cospec would be represented by & RUBIES pattern item that had a test on the lexical attribute of a word. In general, RUBIES allows a more expressive representation for its pattern items than Diderot allows for its cospec arguments. The concept nodes learned by AutoSlog [62] for the CIRCUS system use a rep resentation much different from that used by RUBIES or Diderot. Like Diderot, there is an anchor word that triggers the pattern. Instead of a list of pattern items, some of which may be Kleene stars, the concept node uses a set of predicates as the pattern condition. An example concept node for the trigger-word murder is shown in Table 6.24. Name: TVigger: Variable SlotB: Constraints: Constant Slots: Enabting Conditions: node*l murder (perpetrator (*S* 1)) (class perpetrator *S*) (type perpetrator) ((active) (trigger-preceded-by? ’to ’threatened)) Table 5.24: Example Concept Node [62] The predicates arc listed in the enabling conditions field. In this case, there arc two predicates. The first tests if the word “murder” is active. The second predicate tests if the two words preceding “murder” arc “threatened to.” The constant slot determines the type of information object to map to - in this case, a perpetrator object. The variable slot determines how to fill the perpetrator slot of the perpetrator object. In this case, the subject (*S*) of the clause is the perpetrator. For example, > f the input sentence were “Terrorists threatened to murder the mayor," this concept node would be activated and “Terrorists” would be the perpetrator. The variable slots are analogous to the object modifiers of RUBIES rules and the combination of enabling conditions and trigger are analogous to the RUBIES rule pattern. AutoSlog uses separate enabling conditions for each linguistic pattern. In this case, the linguistic pattern for the concept node was subject-verb-infinitive. In general, there are tests for active, passive, and infinitive verb forms, gerunds, and 86 auxiliary verbs. This is very close to the category attribute used by RUBIES. The only other enabling conditions permitted are tests on preceding words. Thus, the rule patterns of RUBIES are generally more expressive than the concept nodes of CIRCUS. However, CIRCUS does perform phrasal parsing and RUBIES does not - RUBIES cannot identify the subject of a clause. 5.9.3 Search Strategy Inductive learners can also be characterized by the search strategy they employ. Merlin searches for consistent rules by performing a beam search from general to specific. The AQ algorithms also perform a beam search from general to specific. One major difference is in the operators that are used to move from one state to another. Because the hypothesis representation is different, the operators are dif ferent. The AQ algorithms use the VL1 representation which is almost identical to that used by Merlin for pattern items. Interpretations in Merlin rule patterns arc the same as complexes in VL1. The difference is that pattern items in Merlin arc exclusive disjunctions of interpretations and covers of AQ arc inclusive disjunctions of complexes. Because of the representation differences, AQ algorithms do not use the higher power search operators that Merlin uses (e.g. replacing a Kleene star with a pattern item). Decision tree algorithms usually use an algorithmic search strategy to search from small trees (i.e. small number of nodes) to large trees (i.e. large number of nodes). For example, ID3 [59] uses an information-based heuristic to select the next node to add to a tree. Another characteristic of learning algorithms is the way in which search states are evaluated. AQ algorithms judge the goodness of a search state (i.e. concept de scription) by the number of positive examples it covers and the number of negative examples it does not cover. If the examples are expressed in attribute-value repre sentation, this is a straightforward process of plugging in the values into the concept description and seeing if it evaluates true. This becomes much more complex when the training examples are expressed in a complex representation such as RUBIES rules. Instead of plugging values into a formula, Merlin performs a subsumption test 87 to tell if one concept is more general than another. This subsumption test replaces the insertion of attribute*values in a formula. 5.10 Conclusion In this chapter, we presented the Merlin learning algorithm and gave examples of its application to the MUC-5 EJV domain. The most novel aspect of Merlin is the complexity of the hypotheses and training examples. No other learning systems use such a complex representation. In other words, Merlin can learn more complex concepts from more complex examples than existing learners can. Another notable characteristic of Merlin is the way in which search states arc evaluated. Each search state is a RUBIES rule. The quality of a state is evaluated by testing its subsumption with the training examples, which are also rules. This is a great departure from methods previous learners used for state evaluation, which only require that attribute-values be plugged into an expression. Merlin was shown to reduce the amount of search that is performed by making use of the A ttribute Dependency Graph (ADG). This graph is used to start searching attributes whose values determine the existence of other attributes. 88 Chapter 6 EXPERIMENTAL RESULTS 6.1 Introduction In previous chapters, we have argued that a rule set for an information extrac tion system can be automatically constructed from a set of example messages and answer-key templates and achieve performance comparable with that of a manually- constructed rule set. In this chapter, we support this argument with experimental rcsultB from the MUC-5 English Joint Venture domain. Performance of information extraction systems is somewhat difficult to evaluate. When a system's output is compared with the answer-key templates, a variety of conditions can occur. There is missing information, spurious information, correct, incorrect, and partially correct information. A variety of performance measures have been developed for MUC that combine these conditions in different ways. In Section 6.3, we describe six of these measures that will be used to evaluate the performance of Merlin/RUBIES. The MUC-5 system that is closest in design to RUBIES is the University of Southern California (USC) system because they share the same preprocessor and both are pattern-based. It will be shown that on four of the six performance measures (error rate, undergeneration, substitution, and recall), Merlin/RUBIES performs better than the USC MUC-5 system. For two of these measures (overgeneration and precision), the USC system performs better. These results are presented in Section 6.4. In Section 6.5, we show how Merlin/RUBIES performs compared to the other MUC-5 systems. 89 Further experiments have been conducted to determine the effect that varying the num ber of positive and negative examples has on system performance. These results are presented in Sections 6 .6 and 6.7 respectively. Before presenting these results, we describe how the experiments were performed. 6.2 D escription o f E xperim ents All of the training examples used in the experiments were culled from the MUC-5 EJV training corpus. Each training message consists of a single sentence. Positive examples have a set of correctly filled tem plates associated with them. Positive ex amples were selected by looking through the corpus and identifying messages which had sentences that generated and/or modified templates. The negative examples were constructed from the remaining sentences in these messages th at did not gen erate and/or modify templates. A subset of the MUC-5 EJV tem plate objects and slots was used in the experiments and is shown in Table 6.1. Object Slot* tic-up-rclation status, entity, joint-vcnturc entity name, type Table 6.1: Template Objects and Slots Used In Experiments Training sets for each of the experiments were constructed using the method described in Chapter 4. Each of the training sets was provided to Merlin, from which it learned a rule set. These rule sets were used by RUBIES to extract information from a set of unseen text messages. The information was output in the form of tilled tem plates. A scoring program generated score reports for these response tem plates relative to a set of answer keys. A flow chart for this experimental process is shown in Figure 6.1. In all of the experiments, the beam width (BEAMWIDTH) of the search was set at 200. There is a direct relationship between beam width and learning tim e - the greater the beam width, the longer the learning time. We found th at increasing the beam width of the search improved the quality of the rules that were learned, but 90 Training Messages and Templates Rules Tfest Messages ► Filled Templates Template Keys Scoring Program Training Set Constructor Merlin Learning System RUBIES Information Extraction System Score Reports Figure 6.1: Flowchart of Experimental Process 91 with diminishing returns. We found that increasing the beam width past 200 did not improve the rule quality enough to justify the much longer learning time. In all of the experiments, the maximum number of iterations (MAXITR) was set at 10. This parameter determines how much searching is done after the first consistent rule is found. In most cases, continuing past 10 iterations did not seem to help much, so 1 0 was selected as the limit. Part 1 of the MUC-5 EJV final test was used for testing so that the performance of Merlin and RUBIES could be compared with the USC MUC-5 system [54]. A total of 11 experiments were conducted using different quantities of positive and negative examples. 6.3 Performance M easures Nearly all of the previous experimental work on inductive learning has measured system performance as the accuracy of a learned classifier on a set of unseen ob jects (i.e. the test set). This assumes that the performance clement is a simple classification system which is not the case with information extraction (IE) systems. Measuring performance of these systems is much more complex and several measures have been utilized. Before discussing these measures in more detail, we will consider the possible outcomes for raw scoring of template object slots. Consider a template slot filled by RUBIES. There are four possible scores for this fill. If the slot fill matches the answer key, it is correct If it partially matches the key, it is partial. If it does not match the key, it is incorrect. There is a set of scoring rules for the MUC-5 EJV domain that gives guidelines for discriminating between these scores. An additional situation occurs when there is no answer key for this response. In this case, the slot fill is scored as spurious (i.e. the IE system should not have generated a response). Conversely, if there is an answer key slot that should have been filled for which there was no response, it is scored as missing. Two additional parameters can be derived from these raw scores, The number of possible and actual responses is defined as follows: possible = correct + partial -f incorrect + m issing actual ss correct + partial + incorrect + spurious 92 Messages are scored by counting the number of correct, partial, incorrect, spu- rious, and missing slots for each tem plate object. Results for all test messages arc combined by adding together all of the scores for each slot. In addition to slot scoring, tem plate objects are also scored. The scoring program has a matching algorithm that determines which response tem plate object is to be scored against which key tem plate object [4]. Response objects th at can be mapped to key objects arc scored as correct. Response objects that cannot be mapped to key objects are scored as incorrect if there are leftover key objects th at have not been mapped to response objects. Response objects th at arc not scored as correct or incorrect are scored as spurious. Key objects th at do not have a m apped response object and do not have a response object scored as incorrect for them arc scored as missing. For the Fourth Message Understanding Conference (MUC-4), recall and precision were used as performance measures. Bach is expressed as a percentage from 0 to 100 with higher percentages being more desirable. They arc defined as follows: recall = (correct + (partial/ 2 ))/possible precision = (correct + (partial/2))/actual Roughly speaking, recall is a measure of how much of the available information the system can extract. Precision measures the accuracy of the responses. The recall score is penalized by missing information and the precision score is penalized by spurious responses. A different set of performance measures was used for the Fifth Message Under standing Conference (MUC-5). The primary measure was system error which is defined as follows: error = (incorrect + (partial/2) -f m issing -f spurious) (correct + partial + incorrect + m issing + spurious) The error measure is the ratio of the number of wrong responses to the total responses. Three secondary measures, undergeneration, overgeneration, and substi tution, were also used and are defined as follows; undergeneration = m issing/possible overgeneration — spurious/actual substitution = (incorrect + (partial/ 2 ))/(correct + partial + incorrect) 93 Undcrgcncration is the percentage of possible responses that were missing and ovcrgeneration is the percentage of actual responses that were spurious. Substitution ignores spurious and missing responses. The variety of performance measures makes it difficult to conclude whether or not one system is “better” than another. For all of the experiments that were performed, we analyzed each of these measures. 6.4 Com paring M erlin/R U B IE S w ith the USC M UC-5 System Ideally, we would like to compare the quality of the rules learned by Merlin to the quality of the rules that were manually generated. To do this, we would need to provide each rule set to the same performance element. Unfortunately, the rules for the USC MUC-5 system were “hard-coded” and cannot be directly replaced. It was for this reason that RUBIES was constructed. Thus, we can only compare the combined performance of Merlin and RUBIES with that of the USC MUC-5 system. Nevertheless, there are many similarities between the two systems. Both use the same front end - the preprocessor that performs dictionary lookups. Both arc sentential systems (i.e. they process one sentence a t a time) and perform a pattern matching operation on the input text. A major difference between the two is that the USC system also performs phrasal parsing prior to matching its rules and RUBIES docs not. This should give the USC system an advantage over RUBIES. The rules for the USC MUC-5 system were manually generated and took two students approximately three months to complete. This was a tedious process - texts and templates from the training corpus were read and analyzed, patterns were proposed and evaluated, and rules were constructed and coded. The final system was tested on a set of novel messages along with the other MUC-5 participants. Merlin was trained from a set of 191 positive example messages/templates and 1389 negative example messages. Some of the 280 rules it learned are shown in Appendix A. RUBIES used these rules to extract information from the same novel text messages th at were used for testing the USC MUC-5 system. The resulting output templates were scored using the same scoring program that was used for the USC system. 94 The error measures for the five slots for which Merlin learned rules are shown in Figure 6 .2 . Merlin/RUBIES had a slightly lower error rate, lower undergeneration and substitution, and higher ovcrgcneration than the USC system. The recall and precision measures for these slots are shown in Figure 6.3. Merlin/RUBIES had a higher recall, and lower precision than the USC system. The raw scores from which these performance measures were derived arc shown in Figure 6.4. Appendix B contains error measures, recall/precision measures, and raw scores for the individual slots. In addition to object slots, we can also evaluate the scores for the matching of template and entity objects to objects in the answer key. Error measures, re call/precision measures, and raw scores for these object matches arc shown in Ap pendix B. USOM UC3 Meriin/RUBIES CfTOf und ovg tub Figure 6.2: Error Measures for 5 Slots, 191 Pos, 1389 Neg Examples 95 ■ USC/MUCJ ■ Merlin/RUBIES recall preciiton Figure 6.3: Recall/Precision for 5 Slots, 191 Pos, 1389 Ncg Examples 6.5 Com parison w ith Other M UC-5 System s In Figure 6.5, the performance of Merlin/RUBIES is plotted against the range of performance of all of the MUC-5 systems. All six performance measures arc shown for all five slots combined. 191 positive and 1389 negative examples were used. The maximum and minimum performance for cadi measure is marked with a square. Merlin/RUBIES falls about in the middle of the range for error, substitution and recall. Merlin/RUBIES undergcncration falls in the low end of the range (recall that lower undergeneration rates arc better) and precision falls at the very bottom of the range (higher precision rates are better). The only measure where Merlin/RUBIES falls outside of the MUC-5 range is overgeneration. Merlin/RUBIES overgcneratcs more than any of the MUC-5 systems. Appendix B contains other range graphs for individual slots and object matches. 6.6 Varying the N um ber o f P ositive Exam ples Several experiments were performed to investigate the effect th at the number of pos itive training examples has on the system performance. For all of these experiments, the number of negative examples was held fixed at 1389. 96 Percentage (% ) # s k x r d s 2000V 600- 400- I I I I I I ■ I I I II i r i i i i I i i i m i i i i r i i i i i USOMUC5 Merlin/RUBIES i 'igure 6.4: Raw Scores for 5 Slots, 191 Pos, 1389 Ncg Examples lOO-y-------------------------------------------------------------------------- 9 0 - -------------------------------------------------------------- ------------- Merlin/RUBIES 8 0 - ------------ 7 o J ------------------------------------------------------------ n-------------- 6o-i----------------------------- V 5 0 i i -------------------------------------------- ¥----------------------------- 40-i-------------m ------------- 9 ------------------------- -------------- 3 0 ---------------------------------------------‘ > --------------1 1 20 ■ ■ -------------- 1 j-------------- — ■ ■ ■ ■ — —— — -------------- □ Q 1 0 ----------------------------------------------------------------------- 0-1-------------- 1 -------------- 1 -------------- i--------------'?-------------- err und ovg sub tec pre Figure 6.5: MUC-5 System Performance Range for 5 Slots !!.................c) j Merlin/RUBIES I I ▲ I) n i > u C 9 _______________________________________ 11 97 Error measures for all five slots arc shown in Figure 6 .6 . We can also see some general trends when greater than 96 positive examples are used. As the number of positive examples increases, overgeneration increases, undcrgeneration decreases, and substitution decreases. There is only a slight decrease in error rate. The recall and precision measures for all five slots are shown in Figure 6.7. As the number of positive examples increases, the recall tends to increase without much change in precision. Raw scores for all five slots together are shown in Figure 6 .8 . Again, some general trends in the data become apparent when greater than 96 positive examples arc used. As the number of positive examples is increased, the number of correct responses tends to increase without much change in the number of incorrect and partial responses. The number of spurious responses tends to increase and the number of missing responses decreases. Appendix D contains the raw scores and performance measures for individual slots and objects. 100 0 20 40 60 80 100 120140 160 180 200 # Positive Examples error — undcrgeneration oveiseneratlon substitution Figure 6 .6 : Error Measures for All Slots, 1389 Negative Examples 98 a 50 precision 0 20 40 60 80 100 120 140 160 180 200 # Positive Examples Figure 6.7: Recall/Precision for All Slots, 1389 Negative Examples 6.7 Varying the N um ber o f N egative E xam ples Several experiments were performed to investigate the elTcct that the num ber of neg ative training examples has on the system performance. For all of these experiments, the number of positive examples was held fixed at 191. Error measures for all five slots are shown in Figure 6.9. The num ber of negative examples does not appear to have much of an efTcct on undcrgeneration and substi tution. Error rate and overgencration do seem to decrease with large numbers (i.e. > 1 0 0 0 ) of negative examples. The recall and precision measures for all five slots are shown in Figure 6.10. After an initial decrease, recall seems to level off as the number of negative examples increases. Precision does not change much until more than 1000 negative examples are used, at which point it tends to increase. Raw scores for these slots are shown in Figure 6 .1 1. The most obvious effect of increasing the number of negative examples is a decrease in the num ber of spuri ous responses. Appendix B contains the raw scores and performance measures for individual slots and objects. 99 * Slot Fills 2000 1800 1600 1200 1000 800 600 400 200 11 W l p I iT#fi 111111 fi 1111111 ifr 0 20 40 60 80 100 120 140 160 180 200 possible actual correct — partial - B - incorrect spurious missing # Positive Examples Figure 6 .8 : Sum of Slot Scores, 1389 Negative Examples 90 70 50 - 4 iv * V 0 2(» 4t1 0 6(M ) 8 ( 1 0 1000 1200 1 4 ( ♦ error undeigeneration ovcrgeneratlon substitution # Negative Examples Figure 6.9: Error Measures for All Slots, 191 Positive Examples 100 100r 90- 80- 70- $ 60- K > I 50- I 4 0 - 30-: 20^ 10- 0 - j * - 1 W » i i precision 0 200 400 600 800 1000 1200 1400 # Negative Examples Figure 6.10: Recall/Precision for All Slots, 191 Positive Examples 4500 4000- 3500 500 0 200 400 600 800 1000 1200 1400 # Negative Examples possible actual correct partial - B - incorrect - e - spurious missing Figure 6.11: Sum of Slot Scores, 191 Positive Examples 101 6.8 Conclusion In this chapter, we presented the results of several experiments in the MUC-5 English Joint Venture domain. Merlin was shown to automatically learn rules that were used by RUBIES to achieve performance comparable to the USC MUC-5 system, even though the USC system has the advantage of being able to perform phrasal parsing. On four of six performance measures, Merlin/RUBIES exceeded the performance of the USC MUC-5 system. The rule set learned by Merlin tended to generate many more spurious responses than the manual rule set used by the USC system. We might say that the Merlin rule set was over-general. We speculate that this is a direct consequence of the bias designed into the learning algorithm. Recall that Merlin searches for the most general rules it can find that are consistent with the training examples. It seemed that increasing the number of negative examples should bring down the number of spurious responses. We conducted experiments in which the number of negative examples was varied and the number of positive examples held constant. It was confirmed that as the number of negative examples increases, the number of spurious responses decreases without a noticeable effect on the other scores. Increasing the number of positive examples, while holding the number of negative examples constant, seems to have the opposite effect on spurious responses - they tend to increase. We also tend to get more correct scores and fewer missing scores. 102 Chapter 7 CO NCLUSIO NS A N D EX TEN SIO N S 7.1 Sum m ary of Contributions In this dissertation, we have described a novel application of machine learning to the problem of information extraction from natural language text. We have shown how current machine learning systems are limited in the complexity of their training examples and the complexity of the concepts that they can formulate. A new learning system called Merlin was described that overcomes these limitations. A rule set learned by Merlin was provided to the RUDIES information extraction system and shown to perform comparably to the manually-constructed USC MUC-5 system. This dissertation contributes to both the fields of information extraction and machine learning. A major problem in the field of information extraction is the amount of effort required to build a knowledge base for a new domain. Although there have been several successful attem pts to partially autom ate the knowledge base construction process, fully-automatic machine learning techniques have been elusive. Merlin is the first known system to completely autom ate this learning task. In solving this problem, we have extended the capabilities of machine learning systems in several ways. Previously, learners were mainly restricted to representing training examples in an attribute-value language. We have extended the application of machine learning to more complex performance elements which require a richer language for the training examples. Merlin was shown to learn from examples that are expressed in the same language as the concepts that are learned - the rule structure of RUBIES. 103 Previous work in machine learning has mostly ignored the performance clement, which is the reason for learning in the first place. Since the performance task studied was almost always simple classification, the “action” part of the learned concept was just an assignment of the input object to one of the predefined output classes. For more complex performance elements, such as information extraction systems, this concept action is much more complex. The concepts (i.e. rules) learned by Merlin can create or modify information objects which have attributes and relationships with other objects. This action is much more powerful than simply making a class assignment. The conditional part of the information extraction rules is also more complex than what previous learners have dealt with. The pattern portion of the RUBIES rules was shown to have a structure similar to regular expressions. We feel that this increase in concept complexity is necessary in order to apply inductive learning systems to non-trivial real-world performance elements, 7.2 Limitations RUBIES can be considered a minimal information extraction system. It uses a very basic syntactic representation of the input text. The only parsing operation is to replace words with an exclusive disjunction of their dictionary entries. The perfor mance of RUBIES therefore, is greatly dependent on the accuracy and completeness of the dictionary. Because RUBIES performs a very basic syntactic parsing, its representation of the text is extremely ambiguous. Other natural language processing systems perform phrasal parsing, anaphora resolution, semantic analysis, and discourse processing to achieve a deeper and more accurate representation. Because Merlin depends on RUBIES to parse the training examples, these lim itations of RUBIES greatly affect the performance of Merlin. In fact, it is very difficult to isolate the performance of Merlin from that of RUBIES. Merlin uses the preprocessing stage of RUBIES for its training examples. The more accurate this parsing operation is, the better the training examples will be. We would expect to be able to learn more accurate information extraction rules from higher quality training examples. 104 Merlin relies on the rule decomposition algorithm to construct its training exam* pies. This algorithm is domain specific - it would need to be modified for any domain other than MUC-5 English Joint Venture. The object modifier and object predicate protion of the rules are also domain specific. They would need to be modified to reflect the template structure of any new domain. Merlin is limited to the RUBIES rule representation for both the training ex amples and the concepts it learns. Merlin’s search operators are only useful for searching the space of RUBIES rule patterns, Thus, Merlin is limited to learning in domains that can be expressed with this representation. 7.3 Possible Extensions Merlin and RUBIES both use a preprocessor to replace words with an exclusive disjunction of their dictionary entries. Any improvements in the preprocessor should directly improve the performance of both of these systems. The current preprocessor dictionary contains many errors and omissions. If this dictionary could be improved, the performance of Merlin and RUBIES should also improve. RUBIES does not perform any phrasal parsing of the input text. Phrasal parsing can be used to reduce some of the ambiguity in the syntax of the text and to provide additional information about phrasal groupings. For example, a phrasal parser would be able to identify a set of words as belonging to a noun clause. This information could be incorporated into the rule patterns in several ways, the easiest of which is to add an additional attribute to each word for the type of clause it appears in. The addition of phrasal parsing should also improve system performance. A more ambitious extension of this work would be to apply Merlin to other applications outside of information extraction. Many of the expert systems that are currently built are rule-based. With some modifications, Merlin may be able to automatically build their knowledge bases. 105 Ip R eference List [1] L. Breiman, J.H . FYicdman, and R.A, Olshcn. Classification and Regression Trees. Wadsworth, 1984. [ 2 ] J.G . Carbonell. Derivational analogy: A theory of reconstructive problem solv ing and expertise acquisition. In R.S. Michalski, J.G . Carbonell, and T.M. Mitchell, editors, Machine Learning: A n Artificial Intelligence Approach, vol ume 2. Morgan Kaufmann, 1986. [3] B. Ccstnik and I. Kononenko. ASSISTANT 8 6 : A knowlcdge-clicitation tool for sophisticated users. In I. Bratko and N. Lavrac, editors, Progress in Machine Learning. Sigma Press, 1987. [4] N. Chinchor and B. Sundhcim. MUC5 evaluation metrics. In Proceedings o f the Fifth DARPA Message Understanding Conference, Baltimore, MD, August 1993. [5] V. Chvatal. A greedy heuristic for the set-covering problem. Mathematics of Operations Research, 4(3), August 1979. [6] P. Clark and T. Niblelt. The CN2 induction algorithm. Machine Learning, 3:261-283, 1989. [7] W.W. Cohen. Compiling prior knowledge into an explicit bias. In D. Slccman and P. Edwards, editors, Machine Learning: Proceedings o f the Ninth Interna tional Workshop, pages 102-110, 1992. [8] W.W. Cohen. Pac-learning nondeterminate clauses. In Proceedings o f the Twelfth National Conference on Artificial Intelligence (A A A I9f), pages 676- 681. AAAI Press/M IT Press, 1994. [9] W.W. Cohen and H. Hirsh. Learnability of description logics. In Proceedings o f the Fifth Annual Workshop on Computational Learning Theory, Pittsburgh, PA, July 1992. [10] T.H. Corman, C.E. Leisereon, and R.L. Rivest, Introduction to Algorithms, pages 974-978. M IT Press, Cambridge, MA, 1990. 106 11] W.B. Croft. Knowlcdge*based and statistical approaches to text retrieval. IEEE Expert, 8 (2 ), April 1993. 12] G.F. DeJong and R.J. Mooney. Explanation-based learning: An allernalivc view. Machine Learning, 1:145-176, 1986. 13] K.A. DeJong, W.M. Spears, and D.F. Gordon. Using genetic algorithms for concept learning. Machine Learning^ 13(2*3), November 1993. 14] L. Dent, J. Boticario, J. McDermott, T.M. Mitchell, and D. Zabowski. A personal learning apprentice. In Proceedings of the Tenth National Conference on Artificial Intelligence, 1990. 15] T. Diettrich, H. Hild, and G. Bakri. A comparative study of ID3 and back- propagation for English text-to-spcech mapping. In Proceedings o f the Seventh International Conference on Machine Learning, 1990. 16] J. Cowie ct al. The Diderot information extraction system. In Proceedings o f the 1993 PACLING Conference, 1993. 17] S.B. Thrun ct al. The MONK’s problems - a performance comparison of dif ferent learning algorithms. Technical Report CMU-CS-91-197, School of Com puter Science, Carnegie Mellon University, 1991. 18] B.C. Falkenhainor and R.S. Michalski. Integrating quantitative and qualitative discovery: The ABACUS system. Machine Learning, 1:367-401, 1986. 19] U.M. Fayyad. On the Induction o f Decision Trees for Multiple Concept Learn ing, PhD thesis, Engineering and Computer Science Department, University of Michigan, Ann Arbor, MI, 1991. 20] R.E. Fikes, P.E. Hart, and N.J. Nilsson. Learning and executing generalized robot plans. Artificial Intelligence, 3:251-288, 1972. 21] D.H. Fisher. Knowledge acquisition via incremental conceptual clustering. Ma c/line Learning, 2:139-172, 1987. 22] S. Gelfand, C. Ravishankar, and E. Delp. An iterative growing and pruning algorithm for classification tree design. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(2):163-174, 1991. 23] R. Gemello, F. Mana, and L. Saitta. The Rlgel inductive system. Machine Learning, 6:7-35, 1991. 24] D. Gentner. Mechanisms of analogical learning. In S. Vosniadou and A. Ortony, editors, Similarity and Analogical Reasoning, Cambridge University Press, Lon don, 1989. 107 [25] L. Guthrie, R. Bruce, G. Stein, and F. Wcng. Development of an application independent lexicon: Lcxbase. Technical Report MCCS-92-247, Computing Research Laboratory, New Mexico State University, 1992. [26] J.R . Hobbs, D. Appelt, M. Tyson, J. Bear, and D. Isreal. SRI International: Description of the fastus system used for MUC-4. In Proceedings o f the Fourth DARPA Message Understanding Conference. Morgan Kaufmann, June 1992. [27] R.C. Holte. Very simple classification rules perform well on most commonly used datasets. Machine Learning, 11 (1 ):63— 91, April 1993. [28] E.B. Hunt, J. Marin, and P. Stone. Experiments in Induction. Academic Press, New York, 1966. [29] K.B. Irani, J. Cheng, U.M. Fayyad, and Z. Qian. Applying machine learning to semiconductor manufacturing. IEEE Expert, 8 ( 1), February 1993. [30] S. Kcdar-Cabelli. Toward a computational model of purpose-directed analogy. In A. Prieditis, editor, Analogica. Kluwcr, Boston, 1988. [31] B.W. Kcrnighan and R Pike. The Unix Programming Environment. Prcnticc- Hall, Englewood ClifTs, NJ, 1984. [32] S.V. Kowalski and D.I. Moldovan. Explicit versus implicit set-covering for su pervised learning. In Proceedings o f the 6th IE E E International Conference on Tools with Artificial Intelligence, New Orleans, LA, November 1994. [33] S.V. Kowalski and D.I, Moldovan. Parallel induction on hypcrcubc. In Pro* ceedings o f the Sixth International Conference on Parallel and Distributed Computers and Systems, pages 218-221, Washington, D.C., October 1994. IASTED/ISM M - ACTA Press. [34] J.E . Laird, P.S. Rosenbloom, and A. Newell. Chunking in SOAR: The anatomy of a general learning mechanism. Machine Learning, 1:11-46, 1986. [35] P. Langley, II.A. Simon, G.L Bradshaw, and J.M . Zytkow. Scientific Discovery: Computational Explorations o f the Creative Process. M IT Press, Cambridge, MA, 1987. [36] M. Lebowitz. Experiments with incremental concept formation: Unimem. Ma chine Learning, 2:103-138, 1987. [37] W. Lehnert, C. Cardie, D. Fisher, J. McCarthy, E. Riloff, and S. Soderland. University of Massachusetts: MUC-4 test results and analysis. In Proceedings o f the Fourth DARPA Message Understanding Conference, Morgan Kaufmann, June 1992. 108 {38] W. Lehnert, J. McCarthy, S. Soderland, E. RilofT, C. Cardie, J. Peterson, and F. Feng. UMASS/Hughes: Description of the circus system used for MUC*5. In Proceedings o f the Fifth DARPA Message Understanding Conference. Morgan Kaufmann, August 1993. [39] D.B. Lenat. The ubiquity of discovery. Artificial Intelligence, 9:257-285, 1987. [40] R.S. Michalski. On the quasi-minimal solution of the general covering problem. In Proceedings o f the Fifth International Symposium on Information Processing, pages 125-128, Bled, Yugoslavia, 1969. [41] R.S. Michalski. AQVAL/1 - computer implementation of a variable*valued logic system VL1 and its application to pattern recognition. In Proceedings of the First International Joint Conference on Pattern Recognition, pages 3-17, 1973. [42] R.S. Michalski. A theory and methodology of inductive learning. In R.S. Michalski, J. Carbonell, and T. Mitchell, editors, Machine Learning: An Artifi cial Intelligence Approach, pages 83-134. Morgan Kaufmann, Mountain View, CA, 1983. [43] R.S. Michalski, I. Mozctic, J. Hong, and N. Lavrac. The multipurpose learning system AQ15 and its testing application to three medical domains. In Proceed ings o f the Fifth National Conference on Artificial Intelligence, 1986. [44] S.N. Minton. Quantitative results concerning the utility of explanation-based learning. Artificial Intelligence, 42:363-392, 1990. [45] T.M. Mitchell. The need for biases in learning generalizations. Technical Re port CBM-TR-117, Computer Science Department, Rutgers University, New Brunswick, NJ, 1980. [46] T.M. Mitchell, R.M. Keller, and S.T. Kedar-Cabclli. Explanation-based gener alization: A unifying view. Machine Learning, 1:47-80, 1986. [47] D. Moldovan, S. Cha, M. Chung, K. Hendrickson, J. Kim, and S. Kowalski. USC: Description of the SNAP system used for MUC-4. In Proceedings o f the Fourth DARPA Message Understanding Conference. Morgan Kaufmann, June 1992. [48] R. Mooney, J. Shavlik, G. Towell, and A. Gove, An experimental comparison of symbolic and connectionist learning algorithms. In Proceedings o f the Eleventh Joint Conference on Artificial Intelligence, 1989. [49] H. Motoda, R, Mizoguchi, J. Boose, and B. Gaines. Knowledge acquisition for knowledge-based systems. IEEE Expert, pages 53-64, August 1991. 109 [50] P. Mowforth. Some applications with inductive expert system shells. Technical Report T10P 8 6 * 0 0 2 , Turing Institute, Glasgow, Scotland, 1986. [51] S.H. Muggleton and W. Duntinc. Machine invention of first-order predicates by inverting resolution. In S.H. Muggleton, editor, Inductive Logic Programming. Academic Press, San Diego, CA, 1992. [52] S.H. Muggleton and C. Feng. Efficient induction of logic programs. In Proceed- ings o f the First Conference on Algorithmic Learning Theory, Ohmsha, Tokyo, 1990. [53] A. Newell and H. Simon. Human Problem Solving. Prcnticc-Hall, Englewood Cliffs, NJ, 1972. [54] B. Onyshkevych. Template design for information extraction. In Proceedings of the Fifth Message Understanding Conference, August 1993. [55] G. Pagallo and D. Ilaussler. Two algorithms that learn dnf by discovering relevant features. In Proceedings of the Sixth International Machine Learning Workshop, pages 119-123. Morgan Kaufmann, 1989. [56] M. Pazzani and D. Kibler. The utility of knowledge in inductive learning. Machine Learning, 9, 1992. [57] P. Proctor, editor. Longman Dictionary of Contemporary English. Longhman, Harlow, 1978. [58] J.R. Quinlan. Learning efficient classification procedures and their application to chess end games. In R.S. Michalski, J.G. Carbon ell, and T. M. Mitchell, editors, Machine Learning: An Artificial Intelligence Approach. Morgan Kauf mann, Los Altos, CA, 1983. [59] J.R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986. [60] J.R. Quinlan. Learning logical definitions from relations. Machine Learning, 5, 1990. [61] J.R. Quinlan. Improved estimates for the accuracy of small disjuncts. Machine Learning, 6:93-98, 1991. [62] E. Riloff. Automatically constructing a dictionary for information extraction tasks. In Proceedings o f the Eleventh National Conference on Artificial Intclli- gence, pages 811-816. AAAI Press/MIT Press, 1993. [63] D.E. Rumelbart, G.E. Hinton, and R.J. Williams. Learning internal represen tations by error propagation. In D.E. Rumelbart and J.L. McClelland, editors, Parallel and Distributed Processing, volume 1, MIT Press, Cambridge, MA, 1986. 110 [64] C.A. Sammut and R.B. Banerji. Learning concepts by asking questions. In R.S. Michalski, J.G . Carbon ell, and T.M. Mitchell, editors, Machine Learning: A n Artificial Intelligence Approach (Vot. 2). Morgan Kaufmann, 1986. [65] E.Y. Shapiro. An algorithm that infers theories from facts. In Proceedings o f the Seventh International Joint Conference on Artificial Intelligence, pages 446-451. Morgan Kaufmann, 1981. [6 6] E.Y. Shapiro. Algorithmic Program Debugging. MIT Press, Cambridge, MA, 1983. [67] J. Shavlik, R. Mooney, and G. Towcll. Symbolic and neural learning algorithms: An experimental comparision. Machine Learning, 6 , 1991. [6 8] G. Tecuci. Call for papers, IJCAI-93 workshop on machine learning and knowl edge acquisition. 1993. [69] L.G. Valient. A theory of the learnablc. Communications o f the ACM , 27:1134- 1142, 1984. [70] S. A. W aterman. Structural methods for lexical/semantic patterns. In B. Bogu- raev and J. Pustcjovsky, editors, Proceedings o f the Workshop on Acquisition o f Lexical Knowledge from Text, Columbus, Ohio, June 1993. [71] S.M. Weiss and I. Kapoulcas. An empirical comparison of pattern recognition, neural nets, and machine learning classification methods, In Proceedings of the Eleventh International Joint Conference on Artificial Intelligence, 1989. [72] J. Wnek and R.S. Michalski. Hypothesis driven constructive induction in AQ17- I1CI: A method and experiments. Machine Learning, 14(2), February 1994. I l l Appendix A Exam ples of Rules Learned by M erlin This appendix contains some examples of the rules that were learned by Merlin for the MUC-5 English Joint Venture domain. A .l T ieup/E ntity Construction Rules Rule 2: Tieup/Entity Constructor with 2 Entities Pi ~ f t , ♦, f t , ♦, Co, ♦, f t , *, f t , *, f t , * } C & = fts© /lB /lB “ $20 5 jc “ (category — Bd ) /io = Sir S i 7 — (category = Bd ) E i — /jo © fti 7j0 “ $J8 5 jb = (concept = MUC5-COMPANY ) f t i — $ » 5 jo = (category = N p ) Ca = /jj © /j3 h i = $30 A f t t $30 = (concept = ENGLISH-ALPHABET ) $31 = (category = Nc ) ft 3 = $33 $32 = (category = T ) Cj = /j4 h i = $33 A $34 $33 = (lexical ~ joint venture ) $34 = (category = Nc ) 112 — he ® I 2& he — 5 3 s S3s ss (category — P ) he — •S > 36 S 3Q = (category = P ) £ 4 = h i h i ~ $31 S3 7 = (concept = MUC5-C0MPANY ) Rule 7: Tieup/Entity Constructor with 2 Entities P i = { *» ^i3» *» £w » £ 30* *» Caj, *, E h , *, C 32, * } B \3 = h i ® /m ® Am ® h i h i = <5iai 5i2t = (con cep t - M U C 5-C 0M P A N Y ) h 3 — S i n S i n — (category = N p ) Am — $123 5 123 = (category = N p ) h i = S |2 4 5 124 = (category = Np ) C 29 = /m ® A >r Am = 5 n s a S \ 20 = (con cep t a E N G L ISH -A L P H A B E T ) S u e — (category = N c ) h r — S i n A S m S i n — (root = a ) S u e = (category = T ) C30 = Aw Aw = Sim A Sieo A 5 is i S im = (lexical = join t v e n tu r e ) S im = (category = N c ) S in = (num ber = sg ) C31 = Aw ® A 00 Aw = 5|33 S m - (category = P ) Aoo = •S'iss 5 js3 “ (category = P ) E h — h o i h o i ~ S m S m = (con cept = M U C 5-C 0M P A N Y ) 113 C 32 — A o 2 © A 03 © A 04 © A 05 A 02 — ^135 5j35 — (category = Bo ) A 03 = S ia e S im — (category — Bo ) Ao< — <$137 5 n 7 — (category = Bs ) A os = S im S im - (category = Bs ) Rule 8: Tieup/Entity Constructor with 2 Entities P» — { *» ^33* ** £ is , C m * ♦, C 35, *, E je , *, C m * ** C 37, * } C33 — A 06 Aoo = Sim S im = (category = Nc ) E i$ = A 07 Ao7 = S 140 Suo o (concept = MUC5-C0MPANY ) C 34 = A 08 © Aoo A 08 = #141 5 i4j = (category = C ) Aoo = S i 42 S i 4 2 - (category = C ) C m — A 10 AlO — <?H3 5 m 3 = (category = M ) E ta — A u A l l — -5j44 S U 4 = (concept = MUC5-C0MPANY ) Cm — A12 © A is A ia — &45 A 5)48 5 hs s* (category = Nc ) Si48 = (number = p i ) A 13 — 5|47 A 5i48 S Ui = (category = V ) 5 h8 = (tense = r ) C37 = A h A h — <S|40 A 5i8o A 5i6i 5 ho = (concept » THING ) 5jso — (category = Nc ) S m = (number = sg ) Rule 9: Tieup/Entity Constructor with 2 Entities P o = { *» E\7, C m » ^18* * t CAoj *1 C 41, * , C 42, * } El7 ~ A lS I n 5 — 5 i®j 5i82 a (concept = MUC5-C0MPANY ) C m = A t e © A i 7 A i a = S is a Si 53 = (category = C ) A l 7 — S |$ 4 S im = (category = C ) ^ 1 8 = A l 8 /lI8 = Si58 S « 5 = (concept = MUC5-C0MPANY ) Cafl = / l l 9 / | I 9 = S i 5fl Si so = (category = V ) C 40 — A 20 © A21 A 2 0 = S157 S 157 = (category = Nc ) /j j! = S i 88 Siss = (category = T ) C41 = A 22 © A 2 3 © A 24 I\7 2 ~ S 159 S |8 9 = (category = Np ) A 23 = S i 00 Si60 = (category = P ) /1 2 4 = Siei S101 = (category = P ) C42 = /125 © A 26 © A 2 7 © A 38 A 2 5 = Si62 A Si63 5 162 = (category a Nc ) 5163 = (number = sg ) A 26 = S i 64 5164 = (category = Np ) A a 7 — Si 65 Si os = (category = Np ) A 28 = S i 66 Siee = (category = Np ) 115 Rule 14: Tieup/Entity Constructor with 2 Entities PlA — { *» Cb2i *♦ Gb3) Cmj *) ^27» * ♦ ^28t Cs5» * } C e 2 — A s i $ A s2 © A 83 A s i — 5^237 S 237 = (category = N p ) A s2 = S zi$ S im — (category = N p ) / l 83 — 5230 5239 = (category = N p ) C m = A 84 A 84 = 5240 A 5241 A 5242 5240 = (con cep t = T H IN G ) 5241 — (lexical = collaboration ) 5242 = (category — N c ) Cm = A as © A sa A 85 — 5243 5243 = (category = P ) A 80 — 5244 5244 = (category = P ) El7 = A 8 7 © A ss As7 = 5248 5 248 = (con cep t = M U C 5 -C 0 M P A N Y ) Ass = 5240 5240 - (category = N p ) £2 8 — A so © Aoo © A o i © A 02 A 89 = 5247 5247 — (category = N c ) Aoo = 5248 5 34S = (con cep t = M U C 5-C O M P A N Y ) A oi — 5 34o 5 34o — (category = N p ) A 02 — 5 jS0 5 jso = (category = N p ) C ss — A 03 A o 3 = 5281 5 j 5i = (category = M ) Rule 64: Tieup/Entity Constructor with 2 Entities Pfti — { * 1 £ l 27i Csoo* * 1 Csqt, * 1 C aoa, £)28t ** Csoo } 116 E \17 — h 63 Is63 — 5 ,3 6 3 5.363 = (concept « M U C 5-C O M PA N Y ) C 3O 8 — I g M © ^865 © 1 * 0 6 /b64 = 5,364 5.364 = (category = N p ) ^865 — 5,365 5.365 = (category = V ) /w e = 5,366 5.366 = (category = V ) Csor — Awr © l o o * 1 * 0 7 — 5,367 A 5,366 A 5,369 5.367 = (con cept = T H IN G ) 5.368 = (category = N c ) 5.369 = (num ber = sg ) /s68 — 5,370 A 5,37, A 5,373 5.370 = (lexical = form ) 5,37, - (category = V ) 5.372 - (ten se = r ) Caoa = /geg © hro I6 6 9 = 5,373 5.373 = (category = P ) h n = 5,374 5.374 - (category = P ) £,28 = I671 I6 7 1 — 5,375 5.375 = (concept = M U C 5-C 0M P A N Y ) Cam = I6 7 2 I6 7 7 — 5,376 5.376 — (category = M ) Rule 73: Tieup/Entity Constructor with 3 Entities £73 = { *1 C 348, £,46t ^349* *1 CasOi *1 £,471 *» C3M1 *, £ ,481 ** C352, *, C353 } C 348 = /»74 A,74 = 5 ,4 9 7 5.497 = (category — M ) £ |4 6 — / 97s I9 75 = 5 ,4 9 8 5.498 — (concept = M U C 5-C 0M P A N Y ) C 349 = / 97a 117 7fl76 — S nog 51499 = (category = M ) C 3S 0 — h rt h rt — S is o o A S i 501 S i 6oo = (co n cep t = M U C 5 -L 0 C A T I0 N ) Sisoi = (category = N p ) 7?M7 — /»78 /b78 — S 16O J Sisoa = (co n cep t = M U C 5*C 0M P A N Y ) C351 = 7 ® 79 © I & 80 I979 — S i 503 S 1603 = (category = C ) / 98O = S i 504 S i 504 = (category = C ) 7?J4S = /®8I 7®st = S i 505 S 1505 = (co n cep t = M U C 5 -C 0 M P A N Y ) C352 = /o sj © /o83 7 ® sa = S i 506 A S i 507 S i 500 = (category = N c ) S i 507 = (nu m b er = sg ) 7 ® 8 3 — Si5oa S i 508 = (category = V ) C353 — /® 84 7o84 “ S i 509 A S 1510 S i 509 = (ro o t = . ) S jsio = (category = M ) Rule 74: Tieup/Entity Constructor with 3 Entities 7 ^ 4 — { ** ^ 149* *» £ s 54t *1 ^ 355* *» C 350, *» C 357, * , C358, C 359, / ? 150, * , ^ 300> 7? i s |, *t Csax } E 149 = 7q8S © 7o88 © 7q87 © 7 ® g 8 7o85 = Sisii S|6II = (category = Nc ) 7 ® 8 6 = Sj5i3 S |5ia = (concept = MUC5-C0MPANY ) 7q 87 = Si5i3 Si5»3 - (concept - MUC5-C0MPANY ) 7 ® 8 8 — Si5t4 Sisi4 = (category - Np ) 118 C *3fc4 — /»89 © /» 9 0 © /wi © ^992 /989 — SlSlS Sisis = (category = Bp ) / w o — S is t a S m s — (category = Bp ) /wi — 5ui7 S 1517 = (category * Np ) Iggi = S 1BI 8 •S'isis = (category = Np ) C35S — Ign I9 9 3 — S ism , A S'ujo 5 1519 = (root = , ) Si 520 = (category = M ) C w o = / km / m4 — Si 5 2 1 Si521 = (category = Np ) Css7 — /ws / ms = Sj522 Si 522 = (category = V ) C35B = / 99a / M O — Si 523 Si 5 23 = (category = Nc ) C359 = / 99T © / 99s / M 7 “ Si 524 Si 524 = (category = P ) /mb = S 1535 Si 525 = (category = P ) £ ,5 0 — Aw / w 9 — S , 52fl S ,526 = (concept = MUC5-COMPANY ) C3BO = Aooo© A001 /1000 = S i 527 Si527 = (category = C ) /jooi = Si 528 Si 528 = (category = C ) £ 1 5 1 = /1002 / 1002 — Si 5 29 51520 — (concept = MUC5*C0MPANY ) C36I — A 003 /lO O S — Si 530 A Si 531 Si 53Q = (root = . ) S}53i — (category = M ) 119 Rule 81: Tieup/Entity Constructor with 4 Entities Psi = { CjflS, *, E\70, ♦, C396, *, E 171, *, C 397, * » C398, *, £ |7 Ji *1 C ?3 9 9 i *, C 40O 1 *1 i? I73> *1 ^ 401* * } C395 = /|0 8 8 Aosa — S'lej? S ,62? = (category » Nc ) ^170 “ A 089 ® A 090 A089 — S i 628 5.628 — (concept — MUC5-C0MPANY ) A 090 = Si 6 29 5.629 = (concept = MUC5-COMPANY ) C39G ~ A 091 A 091 — S,630 Si cm = (category = M ) ^171 = A 092 A092 = S,031 5.631 = (concept = MUC5-COMPANY ) C397 = A 093 A 093 — 5,632 A 5,633 5.632 = (lexical = , ) 5.633 = (category = M ) C39S — A094 © A 098 A094 = 5,634 5.634 - (category = Np ) A 095 — Si 635 S i635 = (category = Np ) E\72 = A 096 A096 = 5,636 5.636 = (concept = M UC5-C0MPANY ) C399 = A097 © A 098 A097 = 5,637 5.637 = (category = C ) A 098 = 5,638 5.638 — (category =s C ) C400 “ A 099 A 099 — 5,639 5.639 = (category = M ) E,73 = A , 00 A 100 — 5,640 5.640 = (concept = MUC5*C0MPANY ) Cm — A 101 120 A lO l — 5(641 S im as (category ss Nc ) Rule 83: Tieup/E ntity Constructor with 4 Entities P ss “ { *1 C 4O 6* *1 c ,I 0 7 t *1 E|78* * , C 4 0 8 » E ( 79» * , C 409) £(80» C 4I O 1 * , £ (8 1 ♦ * » C 411, * } ^ 4 0 8 = / l 114 A l 14 — 5 (6 5 0 5 i0M = (category = M ) C407 — /in # /in s “ 5 ( 6 5 7 A 5 (6 5 8 5 ( 6 5 7 ~ (lexical = jo in t venture ) 5 ( 6 5 6 — (category ss Nc ) ^ 1 7 8 = / l l ( 6 AllO — 5(659 5 ( 6 5 9 « (concept = M UC 5 -C 0 M PANY ) C4O8 = A l l 7 A ll 7 = 5 ( 6 6 0 5 ( 6 6 0 = (category = M ) ^179 = All8 A ll8 = 5(661 5 (6 0 1 = (concept = MUC5 -COM PANY ) C409 = /(1 1 9 A lio — 5 ( 6 6 2 A 5 ( 6 6 3 5 ( 6 6 2 = (lexical = , ) 5 ( 0 6 3 = (category = M ) £(80 = A120 A 120 ~ 5 (6 6 4 5 ( 6 6 4 - (concept =: M UC 5 -COM PANY ) C410 = A121 © A122 A 121 = 5 ( 6 6 6 5 ( 6 6 5 = (category = C ) A(22 = 5(666 5 ( 6 6 6 = (category = C ) E (81 » A 123 A 123 — 5 ( 6 6 7 5 ( 6 6 7 = (concept = M UC 5 -C 0 M PANY ) Caw — A 1 2 4 I 1124 = 5 ( 6 6 8 5 ( 6 6 8 = (category = Nc ) 121 A .2 Tieup Joint Venture Rules Rule 07: Tieup Joint Venture with 2 Entities P9 7 — { * 1 A t *» *» CiBit $2i4t *» C^ea, $aist *» C A e3» A — A 209 A 209 = $1831 $t83i « (co n cep t « M U C 5*C 0M P A N Y ) Cmoo “ Aaro © A 271 A 270 — $1832 A 51833 $1832 — (lex ica l = a ) $is 33 = (category = N c ) Aan — $1834 5*1834 = (category = T ) C 481 — A 272 A272 = $1835 A $1830 (Si 835 = (lex ica l = betw een ) 5i83o = (category = P ) $214 = A 273 A 273 = $1837 $1837 = (co n cep t = M U C 5 -C 0 M P A N Y ) C402 — A274 © A 275 A 274 — $1838 $1838 = (category = C ) A 275 = = $1839 $1839 = (category = C ) $218 = Aa7fl A 278 = $1840 $1840 = (co n cep t = M U C 5 -C 0 M P A N Y ) C 48S — A 277 © A 278 © A 279 A277 = $1841 $1841 = (category = N p ) A 278 = $1842 $1842 = (category = P ) A279 = $1843 $1843 = (category = P ) Rule 104: Tieup Joint Venture with 3 Entities P tO 4 — { *♦ & 2 2 S i *1 *, «A* *, C 489, *, Z?2291 *» 7?230t 7^228 — A m I © A 352 © ^1353 © 7l3S4 A w l “ 5,922 5,922 — (category - N c ) A 352 — 5,923 5 ibm a (con cept = M U C fl-C O M PA N Y ) A 353 — 5,924 5 ,924 = (con cept = M U C 5 C 0 M P A N Y ) A 354 — 5,925 5 , 92s = (category a N p ) (?4M “ A 355 A35# — 5,926 5.926 = (category a V ) «A — A s m A sm — 5,927 5 .927 = (concept a M U C 5-C 0M P A N Y ) C<89 a /is s r A s s r — S \ 92» 5 jo28 — (category = N c ) P229 = A sm A 358 = 5,029 5 ,920 * (concept a M U C 5-C 0M P A N Y ) E 23O = A 359 A 359 = 5,930 5,m o = (concept = M U C 5-C 0M P A N Y ) Rule 100: Tieup Joint Venture with 4 Entities P jo o a { * , J j , * , C492, * , £r2S4* *t 7A s 5» +» i?2M t * t T?237i A = A sor 7,307 = 5 ,e s s 5.938 = (concept a M U C 5-C O M PA N Y ) C492 a 7,303 7,368 = 5,939 5.939 = (category a P ) 7^234 a 7,389 7,389 — 5,940 5.940 - (concept = M U C 5-C O M PA N Y ) ^235 » 7,370 ^1370 — •S'lM l *^1941 — (concept =s MUC5*C0MPANY ) & 236 — / l 37I / t 37l — S1042 S t 9 4 2 - (concept b MUC5-C0MPANY ) B 237 — 11372 / | 37S = * S ’ l 943 5i M 3 b (concept = M U C5C0M PANY ) A .3 Tieup Status Rules Rule 107: Tieup Status FORMER with 2 Entities P 107 — { *» 7?238» *» El39% O 493, * , C |94 ♦ * } #238 — /l373 7,373 “ 5,944 5,944 = (concept S MUC5-C0MPANY ) £■ 239 — 71374 A 374 — 5,948 5,948 = (concept = MUC5-C0MPANY ) C493 = /,3 7 5 © £,378 7.378 — 5,946 5.946 = (category = Nc ) 7.378 = 5,947 5.947 = (category = Vh ) O494 — 7,377 7,377 = 5,948 5.948 = (category = Vb ) Rule 110: Tieup Status DISSOLVED with 2 Entities 7*1,0 = { *1 ^803» * t C m 4 i O 505, C m 61 *% C s07t *1 7*2441 *% C so si * \ #248» C W } C 503 — 7,402 7.402 — 5,982 A 5,983 5.982 = (lexical = rejoin ) 5.983 = (category - V ) C k> 4 = 7,403 © 7,404 7.403 = 5,984 5.984 = (category = C ) 7.404 — 5,088 5.988 = (category = C ) C * 8 0 8 = 7,405 7,408 = 5,988 A 5,987 5.988 = (concept = MUC5-C0MPANY ) 5.987 = (category = Np ) C508 = 7,408 7,406 = 5,988 5.988 = (category = Nc ) 125 C&07 — A 407 11407 = Sivgg 5 ,9*9 - (category = P ) £■ 2 4 4 — /|408 A 408 — 5,990 51990 a (concept = M U C5C0M PANY ) Cfi08 — A 409 © A4I0 A 409 = 5,99, 5 .991 = (category a C ) A 4 I0 — 5,992 5 .9 9 2 - (category = C ) ^248 — A41t A411 = 5,993 5 .9 9 3 = (concept = MUC5-C0MPANY ) C 509 = A 412 A412 — 5,994 A 5l995 5.994 = (root = . ) 5 ,9 9 3 = (category = M ) Rule 115: Tieup Status EXISTING with 2 Entities A lB — { *t (?642| ♦ , E 254, * t E 253, * , C s43( * , C 544, * } Cs42 = A47S A4TB = 52074 52074 = (category = Fc ) ^234 = A 4 7 9 © A 48O A 470 = 52078 52078 = (concept = MUC5-COMPANY ) A 48O = 52070 52076 = (concept = MUC5-C0MPANY ) JS255 = A 4 8, A 4 8I = 52077 52077 = (concept « MUC5 COMPANY ) ^843 = A482 © A 483 A 482 = 52078 A 52079 A 52080 52078 = (concept a MUC5-PR0D-SERV ) 52079 — (category = Nc ) 52080 — (number = eg ) A 483 — 5208, A 52082 5208, = (category = V ) 52082 = (tense = r ) 126 C B 44 — li484 © A 488 © A 488 A 484 — ^2083 • 5 * 2 0 8 3 - (category = Np ) A 485 — 5*2084 52084 - (category = = P ) A 488 — 52088 52088 = (category = P ) Rule 175: Tieup Status PREFORMAL with 2 Entities P \ 7b “ { *♦ C 7 9 8* *» C 799) C«00» ^ 374» * t ^ J 75» *» <?801i CsOJt ** ^ 8 0 3 } C ?79S = A 205 © A 388 © A 287 A 205 “ 53008 53008 = (category = Np ) A288 — 5 aoo7 53007 = (category = P ) A2 07 — 53008 53008 = (category = P ) C799 = A288 © A260 © A270 © A271 A 208 = 53000 53009 = (category = Nc ) A 269 = 53010 53010 = (category = Nc ) A270 = 53011 A 53012 5 mu = (concept = MUC5-L0CATI0N ) 53013 = (category = Np ) Aon = 5ooi3 53013 = (category = V ) Cboo = A272 A a 72 — 53014 53014 = (category = P ) £ ■ 3 7 4 — A273 /jj73 ~ 5 * 3 0 1 5 53018 — (concept’s MUC5-C0MPANY ) ^376 — A274 © A 275 © A276 A2T4 — 53018 53018 = (concept = MUC5-COMPANY ) A275 — 53017 53017 = (category = Np ) A278 = 53018 53018 = (category = Np ) CsOJ — I2277 I2277 — $ 30J9 53019 = (category - E ) Cs02 — /227s /227a — 53020 53020 = (category = J ) Csos — J2279 I7279 — 53021 a 53022 53021 = (lexical = . ) 53022 - (category = M ) Rule 172: Tieup Status PREFORMAL with 2 Entities P 1 7 2 — { ** ^ 368» +» C775, * , C r r o i ♦ , C 777, £309, * , C r 78, * , C 779, C780, *» C V ai, C7S2 } E368 — /2195 © ^2196 © /2I97 © /2I9S /2195 = 52924 52924 = (category = Nc ) /3190 = 52925 52925 — (category = Nc ) /2 1 9 7 — 52926 52926 = (concept = MUC5-C0MPANY ) /2198 — 52927 52 927 = (category = Np ) C775 = him © /2200 © /2201 ^2199 — 52928 A 52929 52928 = (root = IS ) 52929 = (category = Np ) /2 2 0 0 — 52930 52930 = (category = Vb ) J 72O I ~ 52931 52931 = (category = Vb ) C776 = /2202 © ^2203 © ^2204 hvn — 52932 A 52933 52932 = (root = s TO ) 52933 = (category = Np ) /2203 = 52934 52934 = (category » P ) ^2204 = 52935 52935 = (category = P ) O 777 — /2 2 0 5 © /2 2 0 6 /2205 5 5 52936 128 52936 = (category = Np ) I 2206 — ^2937 ^2937 — (category = Vb ) $309 — /2207 © / 220s © /2209 ^2207 = $2936 SWw = (category = Nc ) f 2208 — 52939 52939 « ■ (concept = M U C5C0M PANY ) ^2209 — $2940 52940 = (category = Np ) Crrs ~ / 2210© /2211 © /2212 /2210 = 52941 A 52942 52941 = (concept = THING ) 52942 — (category as Nc ) ^2211 — 52943 5 2943 = (category = Np ) / 32IJ ss 52944 52944 = (category as P ) C m — /3213 © /2214 © ^2215 © / 22I6 /2213 = 52945 52945 = (category = E ) /2214 as 52940 A 52947 52940 — (root as south ) 52947 = (category = J ) /2215 = 52948 A 52940 52948 = (category = Nc ) 52949 = (number = sg ) /2210 — 52950 A 52951 52950 = (lexical ss south ) 52951 = (category as Np ) C 78O ~ ^2217 © /2218 © /2219 / 221T — 52952 A 52953 52952 = (concept = M UC5-L0CATI0N ) 52953 = (category as Np ) /2218 = 52954 52954 = (category as P ) /2219 — 52955 52955 = (category = P ) Appendix B DETAILED EXPERIM ENTAL RESULTS This appendix presents the experimental results of Chapter 6 in more detail. Section B.l contains charts comparing the performance of Merlin/RUBIES with the USC MUC-5 system for individual slots and object matches. A summary for all slot scores was given in Section 6.4. Section B.2 contains charts comparing the performance of Merlin/RUBIES with all of the MUC-5 systems for individual slots and object matches. A summary for all slot scores was given in Section 6.5. Section B.3 contains graphs showing how varying the number of positive training examples effects the performance on individual slots and object matches. A summary for all slot scores was given in Section 6.6. Section B.4 contains graphs showing how varying the number of negative training examples effects the performance on individual slots and object matches, A summary for all slot scores was given in Section 6.7. 130 Percentage (%) B .l Comparing M erlin/RUBIES with the USC MUC-5 System USC/MUCS Mcriin/RUBIBS error und ovg tub Figure B.l: Error Measures for TicUp Entity, 191 Pos, 1389 Ncg Examples ■ USC/MUC5 ■ Merlin/RUBIES “ i i-----------------------------------r recall precision Figure B.2: R /P for TieUp Entity, 191 Pos, 1389 Neg Examples 131 • S lo t Fills ■ USC/MUC5 ■ Merlin/RUBIES poitible actual com et income* ipurtoua mining Figure B.3: Raw Scores for TicUp Entity, 191 Pos, 1389 Neg Examples USC/MUC5 MeHin/RUBIBS ir- i I I i I error und ovg tub Figure B.4: Error Measures for TieUp Status, 191 Pos, 1389 Neg Examples 132 « s lo t nib ■ USC/MUC5 ■ Mertin/RUBIES recall precision Figure B.5: R /P for TielJp Status, 191 Pos, 1389 Neg Examples 350 300. 250. 200 m I I I I - m m — r i p ~ \ w L 0 * USC/MUC5 Merlin/RUBIBS possible actual comet inoomct spurious missing Figure B.6: Raw Scores for TieUp Status, 191 Pos, 1389 Neg Examples 133 | USC/MUC5 ■ Merlin/RUBIES Figure B.7: Error Measures for TieUp Joint Venture, 191 Pos, 1389 Neg Examples | USC/MUC5 ■ Merlin/RUBIES recall prod lion Figure B.8: R /P for TieUp Joint Venture, 191 Pos, 1389 Neg Examples 134 & 80V 70 60 SO- 40- 30 20 1 0 0 ^ U5C/MUC5 Merfln/RUBtBS poniMe actual correct Incorrect tpurioui mlulng Figure B.9: Raw Scores for TieUp Joint Venture, 191 Pos, 1389 Ncg Examples ■ USOMUC5 ■ Meriln/RUBIBS error und ovg tub Figure B.10: Error Measures for Entity Name, 191 Pos, 1389 Ncg Examples 135 tSkxRDs 50- y 45- 40- 35- S 30- 1* 25- 1 20- 15- 10- 5- 0- U SC /M U C5 recall predtion Figure B .ll: R /P for Entity Name, 191 Pos, 1389 Neg Examples 400-/ ! USC/MUC3 MeHln/RUBIBS Figure B.12: Raw Scores for Entity Name, 191 Pos, 1389 Neg Examples 136 100* / 90- 80* 70- £ 60- t so - 40- 30- 20- 10- und ovg tub Figure D.13: Error Measures for Entity Type, 191 Pos, 1389 Ncg Examples USC/M UC3 Meriin/RUBIES recall predifoa Figure B.14: R /P for Entity Type, 191 Pos, 1389 Neg Examples 137 #Sk*FiIb ■ USC/MUC5 ■ Meriin/RUBIES I ( I I ' I • ' ponible sctual com et incorrect spurious misting Figure B.15: Raw Scores for Entity Type, 191 Pos, 1389 Neg Examples USC/M UC5 Meriin/RUBIES error und ovg nib Figure B.16: Error Measures for TieUp Objects, 191 Pos, 1389 Neg Examples 138 • S l o t Rib ■ USC/MUC5 ■ Meriin/RUBIES recall precision Figure B.17: R /P for TicUp Objects, 191 Pos, 1389 Ncg Examples 350 • / I I I ■ n i i t a w USOMUC5 Mcriin/RUBIBS possible actual coned incorrect spurious missing Figure B.18: Raw Scores for TieUp Objects, 191 Pos, 1389 Neg Examples 139 fcroenngc(%) USG/MUC5 enw und ovj tub Figure B.19: Error Measures for Entity Objects, 191 Pos, 1389 Ncg Examples USC/MUC5 Meriin/RUBIES recall precis ion Figure B.20: R /P for Entity Objects, 191 Pos, 1389 Neg Examples 140 #Sto<FiIb USOMUCS M ertin/R U B IB S possible actual correct inconrct spurious misting Figure B.21: Raw Scores for Entity Objects, 191 Pos, 1389 Ncg Examples 141 Percentage (%) B . 2 Comparing M eriin/R U B IE S w ith the M U C - 5 System s In all figures of this section, 191 positive and 1389 negative examples were used. 100 -a--------------m............ ...................... 9 0 - ----------------------- ----------------------------------------------------- i > —a— Meriin/RUBIES 8 0 - ---------------- 7 0 *i------------------------------------- x ------------------------------------ -------------------11 60 ----------------------------------------------------------------------- 50-j i------------------------------------------ D -------------------------- 40-j----------------------- M 30-i-------------^------------ 1---------------------------“------------- 2 0 -i---------------------------------------------- h---------------------------- Q ^ 1 0 - ------------------------------- D-------------------------------------------------- ° i 1 ---------------1 ---------------1 -------------- i -------------- err und ovg sub rec pre Figure B.22: MUC-5 System Performance Range for TicUp Status 1 1 -----------------1 i ‘ Meriin/RUBIES : \ A t i cp ii i i t 1 1 t I k t 1 11 142 100 a ■ P 11 ............ ......... 901! Meriin/RUBIES 80- -------- - 70* -------------------------- *------------------------- n------------1 1 S' 5 £ 60-i-------------------------------------------------------------------- |* 50- j , --------------------- j--------------- . 1 40-i---------------------------— ::---------------------------------- „ 3 0 -i------------- i ------------- ----------------------------- !!------------- 20- ------------------------------------------ □ 10-1 ---------------------------------------- 0-1 1 1 1 ------------ err und ovg sub rec prc Figure B.23: MUC-5 System Performance Range for TicUp Entity 100 - n , ------------------------------------ _ i, Q Meriin/RUBIES 9 0 - — 1 1 . ,; u 11 80-------------------------------------- ^ 70--------------- ---------------------- £ 60 JI------------------------------------- | 50- j ----------------------------- 1_______ I 40-------------------------------------- 30 : s------- 2 0 - j------------------------------------------------------------------------------------ Sl 10- --------------------------------------------------------------------------------------------------------------------- : □ “ II 0 | ) ---------- * p --- err und ovg sub rec prc Figure B.24: MUC-5 System Performance Range for TieUp Joint Venture 1 1 ------------- il ] p Meriin/RUBIES : ‘ ................ s a j ■ h 11 • i i < I i < 1 k 1 > Meriin/RUBIES tD 1 i i . i i 1 » ( i 1 ............ t 143 100 90 80 70 & 60 I s o 20- ------------------------------------------------------------------------------------------------- : D 10- & --------------------------------------------------------------- <M 1 -------------1 -------------1 ------------ 9------------ err und ovg sub rec prc Figure B.25: MUC*5 System Performance Range for Entity Name 100 m a — 90 “ - a - Meriin/RUBIES 8 0 - --------------- ii 7 0 ----------------------------------------------------- S------------ S' 60-i------------------------------------------------------------------ S 5 0 ------------------------------------------------------------------- M g ii g 40j i----------- ,,------------------------- Q ------------::------------ 3 0 -------------------------- o----------- . --------------------------- i " -------------------- lo-i-----------------J ,------- a ----------------- °-i------------------- 1 ------ 1 ------------ 1------------ err und ovg sub rec pre Figure B.26: MUC-5 System Performance Range for Entity Type —ir— Meriin/RUBIES c 1 I i 1 Q 1 9 c 9 11 E ! r j Meriin/RUBIES ; i > P 1 i j n i . 1 i 9 c — -i i 144 100 ---------------------------------------------------------------------------------- 90 J!-------------G L _i--------------------------------- ---------------------- — Meriin/RUBIES 8 0 - - 2 1 1 ; . ~ T O i------------ * ------------------- £ 6 0 --------------------------------------------------------------------- f 5 0J________________________: : _____ n 4 0 ----------------------------------------- a-------------------------- 30-| 1 ------------ : i ------------ 1---------------------------------------- 20-i---------------------------------------------------------------------11 10- ---------------------------------------------- °i------ 1 ------ 1 ------T------1 ------ err und ovg sub rec pre Figure B.27: MUC-5 System Performance Range for TieUp Objects 100 -------------------------------------------------------------------- !> 9 0 ----------------- ------------------------------- -------------------- Meriin/RUBIES 8 0 - 1 1 □ 7 0 ---------------------------------------------------------------------- & 60J:-------------------------------------------------------------------- J 5 0 --------------------------------------------------------------------- II | 40-ji------------,,----------------------------------------"------------ 30 J -------------------------- ------------- ---------------------------- - I ^?= = + : ^ = ---------------- " > i ------------ A i------------ 0-1 , 1 1 ? ------------ err und ovg sub rec pre Figure B.28: MUC-5 System Performance Range for Entity Objects !> ( Meriin/RUBIES i i i ‘ l 1 * □ <L i C ..................1 > t i I i i' — Meriin/RUBIES l i. M . i 1 1 • j 1 i i ^ j 1 ' l _____________________________G c a------------- 3 145 # Slot Fills B .3 Varying the Num ber of Positive Training Examples 350 300 250 200 150 100 80 100 120 140 160 180 200 # Positive Examples possible actual correct incorrect - e - spurious missing Figure B.29: Scores for Ticllp Status Slot, 1389 Neg Examples 146 # S lot F ills # S lo t Fills 900 800 700 600 500 400 300 200 100 # Positive Examples possible actual correct incorrect - B - spurious missing Figure B.30: Scores for TicUp Entity Slot, 1389 Ncg Examples 80 70 60 50 40 30 20 10 0 ; ■— — ■ 1 ---- -Hi \ possible actual correct incorrect spurious missing 40 60 80 100 120 140 160 180 200 # Positive Examples Figure B.31: Scores for TieUp Joint Venture Slot, 1389 Neg Examples 147 # Slot F ills # Slot Fills 400 350 300 ISO 100 80 100 120 140 160 180 200 # Positive Examples possible actual correct partial * Q '■ incorrect - e - spurious missing Figure B.32: Scores for Entity Name Slot, 1389 Neg Examples 450 400 350 300 200 150 100 # Positive Examples possible actual correct incorrect - B - spurious missing Figure B.33: Scores for Entity Name Slot, 1389 Neg Examples 148 P ercentage (%) error undcrgcncration ovcrgcncration — substitution 40 60 80 100 120 140 160 180 200 ff Positive Examples Figure D.34: Error Measures for TieUp Status Slot, 1389 Ncg Examples 100 # Positive Examples error undcrgcncration ovcrgcncration substitution Figure B.35: Error Measures for TieUp Entity Slot, 1389 Neg Examples 149 100 £ 40 # Positive Examples crror undcrgcncration ovcrgcncration — substitution Figure B.36: Error Measures for TieUp Joint Venture Slot, 1389 Ncg Examples 100 £ # Positive Examples error undcrgcncration ovcrgcncration substitution Figure B.37: Error Measures for Entity Name Slot, 1389 Neg Examples 150 40 60 80 If1 0 1 3 E 0 U1 0 ItiO u 0 20 error undergeneration ovcrgcncration substitution # Positive Examples Figure B.38: Error Measures for Entity Type Slot, 1389 Neg Examples 100 90 80 70 & 60 50 40 30 20 10 0 40 60 80 1(X ) i:10 U10 It> 0 If10 2 C precision # Positive Examples Figure B.39: R /P for TieUp Status Slot, 1389 Neg Examples 151 too 9 0 8 0 S' 7 0 £ 6 0 5 0 4 0 3 0 20 10 0 4 0 60 8 0 K> 0 i:to u1 0 If> 0 Itto 2 0 precision # Positive Examples Figure B.40: R /P for TieUp Entity Slot, 1389 Ncg Examples precision # Positive Examples Figure B.41; R /P for TieUp Joint Venture Slot, 1389 Neg Examples 152 Percentage (%) 100 90 80 70 60 50 40 30 20 10 0 r n i i i TTT precision 40 60 80 100 120 140 160 180 200 # Positive Examples Figure B.42: R /P for Entity Name Slot, 1389 Ncg Examples 40 60 precision 80 100 120 140 160 180 200 # Positive Examples Figure B.43: R /P for Entity Type Slot, 1389 Neg Examples 153 # Slot F ills # Slot Fills 350 300 250 200 150 100 e - # Positive Examples possible actual correct incorrect - e - spurious missing Figure B.44: Scores for TieUp Objects, 1389 Ncg Examples 450 400 350 300 200 ISO 100 # Positive Examples possible — actual correct incorrect spurious missing Figure B.45: Scores for Entity Objects, 1389 Neg Examples 154 100 90 80 70 II 60 f 50 1 40 50 20 10 0 40 60 80 100 120 140 160 180 200 # Positive Examples error undetgcncration ovcrgcncration — substitution Figure B.46: Error Measures for TieUp Objects, 1389 Ncg Examples error undergeneration overgeneration substitution 40 60 80 100 120 140 160 180 200 # Positive Examples Figure B.47: Error Measures for Entity Objects, 1389 Neg Examples 155 Percentage (% ) Percentage (%) 100 90 80 70 60 50 40 30 20 10 0 40 60 80 ItK > 1 2 10 1 10 IfiO If10 20 recall precision # Positive Examples Figure B.48: R /P for TieUp Objects, 1389 Ncg Examples 100 90 80 70 60 50 40 30 20 10 0 40 60 80 1( III X ) 1 2 to U10 If> 0 If1 0 20 precision # Positive Examples Figure B.49: R /P for Entity Objects, 1389 Neg Examples 156 # S lo e Fills B .4 Varying the Number of N egative Training Examples 500 450 400 350 300 250 200 150 100 # Negative Examples — possible actual correct incorrect — B— spurious -im missing Figure B.50: Scores for TieUp StatuB Slot, 191 Pos Examples 157 1200 1000 800 ja £ | 600 * 400 200 200 400 600 800 1000 1200 1400 possible actual correct incorrect -im spurious # Negative Examples Figure B.51: Scores for TieUp Entity Slot, 191 Pos Examples * 30 200 400 600 800 1000 1200 1400 possible actual correct incorrect - e - spurious - 9 " missing # Negative Examples Figure B.52: Scores for TieUp Joint Venture Slot, 191 Pos Examples 158 # Slot F ills # Slot Fills 500 450 400 350 300 250 200 150 100 50 0 — i— ■ — ■ —— |— |— i— — v — i— f ■ | i 1 --1 —— ■ — ■ --1 -- 1 --1 --1 — 200 400 600 800 1000 1200 1400 # Negative Examples Figure B.53: Scores for Entity Name Slot, 191 Pos Examples possible actual correct partial incorrect spurious missing 500 450 400 350 300 250 200 ISO 100 50 0 200 400 600 800 1000 1200 1400 # Negative Examples Figure B.54: Scores for Entity T ype Slot, 191 Pos Exam ples possible actual correct incorrect spurious missing 159 100 90- § # Negative Examples error undcrgcncration ovcrgcncration substitution Figure B.55: Error Measures for TicUp Status Slot, 191 Pos Examples 100 90 80 70 £ 60 8 50 1 40 30 20 10 01 — i— i— i— — 1— t— t— ,— i — * — i — I i i i I i i — i — — i — i — r — 200 400 600 800 1000 1200 1400 # Negative Examples Figure B.56: Error Measures for TieUp Entity Slot, 191 Pos Examples " " I A — error undcigeneration ovcrgcncration — substitution 160 100 # Negative Example* — error undctgencration ovcrgcncration substitution Figure B.57: Error Measures for TieUp Joint Venture Slot, 191 Pos Examples 100 90 80 70 £ v * * 1 ) 60 t 50 I 40 30 20 10 0 I I 1 T "1 I 200 400 600 800 1000 1200 1400 # Negative Examples ♦ error undetgeneration ovcrgcncration — substitution Figure B.58: Error Measures for Entity Name Slot, 191 Pos Examples 161 100 90 80 70 60 50 40 30 20 10 0 200 400 600 800 1000 1200 1400 ff Negative Examples error undctgencration ovcrgcncration — substitution Figure B.59: Error Measures for Entity Type Slot, 191 Pos Examples 100 90 80 70 40 30 20 10 0 200 400 600 800 1000 1200 1400 # Negative Examples Figure B.60; R /P for TieUp Status Slot, 191 Pos Examples precision 162 Precentage (% ) Percentage (%) 100 90 80 70 60 50 40 30 20 10 200 400 600 800 1000 1200 1400 # Negative Examptes Figure B.61: R /P for TicUp Entity Slot, 191 Pos Examples precision 100 90 80 70 60 50 40 30 20 10 0 200 400 600 800 1000 1200 1400 # Negative Examples — recall — • — precision Figure B.62: R /P for TieUp Joint Venture Slot, 191 Pos Examples 100 90 80 10 i i i I I I precision TT 200 400 600 800 1000 1200 1400 # Negative Examples Figure B.63: R /P for Entity Name Slot, 191 Pos Examples precision I I I 200 400 600 800 1000 1200 1400 # Negative Examples Figure B.64: R /P for Entity Type Slot, 191 Pos Examples 164 # S lo t F ills * S lo t Fins 500 450 400 350 300 250 200 150 100 possible actual correct incorrect - e - spurious missing # Negative Examples Figure B.65: Scores for TieUp Objects, 191 Pos Examples 500 450 400 350 300 250 200 150 too 200 400 600 800 1000 1200 1400 # Negative Examples possible actual correct incorrect - e - spurious missing Figure B.66: Scores for Entity Objects, 191 Pos Examples 165 100 90 80 70 & eo » JO I 40 30 20 10 0 ... 200 400 600 800 1000 1200 1400 # Negative Examples Figure B.67: Error Measures for TieUp Objects, 191 Pos Examples error — - undcrgeneration ovcrgcncration substitution '» I1 f TTT error undergeneration ovcrgcncration — substitution 200 400 600 800 1000 1200 1400 # Negative Examples Figure B.68: Error Measures for Entity Objects, 191 Pos Examples 1G 6 Percentage (%) 100 90 80 70 60 50 40 30 20 10 0 precision 200 400 600 800 1000 1200 1400 # Negative Examples Figure B.69: R /P for TieUp Objects, 191 Pos Examples 100 90 80 70 60 50 40 30 20 10 0 precision 200 400 600 800 1000 1200 # Negative Examples 1400 Figure B.70: R /P for Entity Objects, 191 Pos Examples 167
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Automatic code partitioning for distributed-memory multiprocessors (DMMs)
PDF
Architectural support for efficient utilization of interconnection network resources
PDF
Operationalizing Engineering Models Of Steady-State Equations Into Efficient Simulation Programs
PDF
Efficient PIM (Processor-In-Memory) architectures for data -intensive applications
PDF
A framework for coarse grain parallel execution of functional programs
PDF
Adaptive dynamic thread scheduling for simultaneous multithreaded architectures with a detector thread
PDF
Deadlock recovery-based router architectures for high performance networks
PDF
I -structure software caches: Exploiting global data locality in non-blocking multithreaded architectures
PDF
Mapping parallel algorithms onto parallel architectures
PDF
An integrated systems approach for software project management
PDF
Alias analysis for Java with reference -set representation in high -performance computing
PDF
Architectural support for network -based computing
PDF
Consolidated logic and layout synthesis for interconnect -centric VLSI design
PDF
Automatic array partitioning and distributed-array compilation for efficient communication
PDF
Implementation of neural networks on parallel architectures
PDF
Decoupled memory access architectures with speculative pre -execution
PDF
The politeness effect: pedagogical agents and learning outcomes
PDF
Distributed constraint optimization for multiagent systems
PDF
Indium Gallium Arsenic Phosphide-Based Optoelectronics Grown By Gas Source Molecular Beam Epitaxy
PDF
Quasiepitaxial Growth Of Archetype Organic Molecules: Modeling And Observation
Asset Metadata
Creator
Kowalski, Stephen Vincent (author)
Core Title
On The Learning Of Rules For An Information Extraction System
Degree
Doctor of Philosophy
Degree Program
Computer Engineering
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Computer Science,engineering, electronics and electrical,language, linguistics,OAI-PMH Harvest
Language
English
Contributor
Digitized by ProQuest
(provenance)
Advisor
Moldovan, Dan I. (
committee chair
), Gaudiot, Jean-Luc (
committee member
), Rajamoney, Shankar A. (
committee member
), Rosenbloom, Paul S. (
committee member
)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c20-582249
Unique identifier
UC11226385
Identifier
9601008.pdf (filename),usctheses-c20-582249 (legacy record id)
Legacy Identifier
9601008.pdf
Dmrecord
582249
Document Type
Dissertation
Rights
Kowalski, Stephen Vincent
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Tags
engineering, electronics and electrical
language, linguistics