Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
The importance of using domain knowledge in solving information distillation problems
(USC Thesis Other)
The importance of using domain knowledge in solving information distillation problems
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
THE IMPORTANCE OF USING DOMAIN KNOWLEDGE IN SOLVING INFORMATION DISTILLATION PROBLEMS by Liang Zhou A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2006 Copyright 2006 Liang Zhou Table of Contents List Of Tables vi List Of Figures viii Abstract x 1 Introduction 1 1.1 Information Distillation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Domain-specific Distillation Tasks . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 Headline Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.2 Biography Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.3 Online Discussion Summarization . . . . . . . . . . . . . . . . . . . . 4 1.2.4 Distillation Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 Previous Work 9 2.1 Text Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Single-Document Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 Cross-Document Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4 Chapter Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3 Headline Generation 18 3.1 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 A First Look at Headline Templates . . . . . . . . . . . . . . . . . . . . . . . 20 3.2.1 Template Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2.2 Sequential Recognition of Templates . . . . . . . . . . . . . . . . . . 21 3.2.3 Filing Templates with Key Words . . . . . . . . . . . . . . . . . . . . 21 3.3 Key Phrase Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3.1 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3.2 Phrase Candidates to Fill Templates . . . . . . . . . . . . . . . . . . . 25 3.4 Filling Templates with Phrases . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.4.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 ii 3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4 Biography Creation 31 4.1 Recent Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.2 Corpus Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.3 Sentence Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.3.1 Task Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.3.2 Machine Learning Methods . . . . . . . . . . . . . . . . . . . . . . . 36 4.3.3 Classification Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.4 Biography Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.4.1 Name-filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.4.2 Sentence Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.4.3 Redundancy Elimination . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.5.2 Coverage Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5 Summarizing Online Discussions 50 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.2 Previous and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.3 Technical Internet Relay Chats . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.3.1 Kernel Traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.3.2 Corpus Download . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.3.3 Observation on Chat Logs . . . . . . . . . . . . . . . . . . . . . . . . 54 5.3.4 Observation on Summary Digests . . . . . . . . . . . . . . . . . . . . 55 5.4 Fine-grained Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.4.1 Message Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.4.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.5 Summary Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.5.1 Adjacent Response Pairs . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.5.2 AP Corpus and Baseline . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.5.3 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.5.4 Maximum Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.5.5 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.5.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.5.7 Summary Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.6 Summary Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.6.1 Reference Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.6.2 Summarization Results . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 iii 5.7 Applicability to Other Domains . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.7.2 Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.7.3 Extrinsic Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 6 Automatic Summary Evaluation 72 6.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.1.1 Manual Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.1.2 Automatic Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6.2 Basic Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 6.3 The BE Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.3.1 Creating BEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.3.2 Weighing BEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.3.3 Matching BEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.3.4 Combining BE Scores . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.4 Testing and Validating BEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.4.1 Multi-document Test Set . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.4.2 Single-document Test Set . . . . . . . . . . . . . . . . . . . . . . . . 85 6.4.3 Comparing BE with ROUGE . . . . . . . . . . . . . . . . . . . . . . . 85 6.5 BE-based Summarization: Beyond Evaluation . . . . . . . . . . . . . . . . . . 87 6.5.1 Using BEs to Extract . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.5.2 Using BEs to Compress . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.5.2.1 Content Labeling . . . . . . . . . . . . . . . . . . . . . . . 89 6.5.2.2 Parse Tree Reduction . . . . . . . . . . . . . . . . . . . . . 90 6.5.2.3 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 7 Using Paraphrases in Distillation Evaluation 94 7.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 7.1.1 Paraphrase Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 7.1.1.1 Synonymy Relations . . . . . . . . . . . . . . . . . . . . . . 98 7.2 Paraphrase Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 7.2.1 Previous and Related Work . . . . . . . . . . . . . . . . . . . . . . . . 99 7.3 Re-evaluating Machine Translation Results with Paraphrase Support . . . . . . 102 7.3.1 N-gram Co-occurrence Statistics . . . . . . . . . . . . . . . . . . . . . 104 7.3.1.1 The BLEU-esque Matching Philosophy . . . . . . . . . . . . 105 7.3.1.2 Lack of Paraphrasing Support . . . . . . . . . . . . . . . . . 106 7.3.2 ParaEval for MT Evaluation . . . . . . . . . . . . . . . . . . . . . . . 108 7.3.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 7.3.2.2 The ParaEval Evaluation Procedure . . . . . . . . . . . . . . 109 7.3.3 Evaluating ParaEval . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 7.3.3.1 Validating ParaEval . . . . . . . . . . . . . . . . . . . . . . 111 7.3.3.2 Implications to Word-alignment . . . . . . . . . . . . . . . . 112 iv 7.3.3.3 Implications to Evaluating Paraphrase Quality . . . . . . . . 113 7.3.4 ParaEval’s Support for Recall Computation . . . . . . . . . . . . . . . 113 7.3.4.1 Using Single References for Recall . . . . . . . . . . . . . . 114 7.3.4.2 Recall and Adequacy Correlations . . . . . . . . . . . . . . 114 7.3.4.3 Not All Single References are Created Equal . . . . . . . . . 115 7.3.5 Observation of Change in Number of References . . . . . . . . . . . . 116 7.4 ParaEval for Summarization Evaluation . . . . . . . . . . . . . . . . . . . . . 118 7.4.0.1 Previous Work in Summarization Evaluation . . . . . . . . . 120 7.4.1 Summary Comparison in ParaEval . . . . . . . . . . . . . . . . . . . . 121 7.4.1.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . 121 7.4.1.2 Multi-Word Paraphrase Matching . . . . . . . . . . . . . . . 123 7.4.1.3 Synonym Matching . . . . . . . . . . . . . . . . . . . . . . 128 7.4.1.4 Lexical Matching . . . . . . . . . . . . . . . . . . . . . . . 128 7.4.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 7.4.2.1 Document Understanding Conference . . . . . . . . . . . . 129 7.4.2.2 Validation and Discussion . . . . . . . . . . . . . . . . . . . 130 7.5 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 8 Conclusion and Future Work 134 8.1 Domain-specific Distillation Tasks . . . . . . . . . . . . . . . . . . . . . . . . 134 8.1.1 Headline Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 8.1.2 Biography Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 8.1.3 Online Discussion Summarization . . . . . . . . . . . . . . . . . . . . 136 8.2 Distillation Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 8.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 Reference List 137 v List Of Tables 3.1 Study on sequential template matching of headline against its text, on training data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2 Results on bag-of-words model combinations . . . . . . . . . . . . . . . . . . 26 3.3 System-generated headlines. A headline can be concatenated from several phrases, separated by ’/’s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4 Results evaluated for unigram overlap . . . . . . . . . . . . . . . . . . . . . . 29 3.5 Performance on ROUGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.1 Performance of 10-Class sentence classification, using Na¨ ıve Bayes Classifier. . 39 4.2 Classification results on 2-Class using Na¨ ıve Bayes, SVM, and C4.5. . . . . . . 40 5.1 Lexical features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.2 Accuracy on identifying APs . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.3 Summary of results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.1 Correlation between BE and DUC2003 multi-doc. . . . . . . . . . . . . . . . 85 6.2 Correlation between BE and DUC2003 single-doc. . . . . . . . . . . . . . . . 86 6.3 Correlation between ROUGE and DUC2003 multi-doc. . . . . . . . . . . . . 86 7.1 Ranking correlations with human assessments. . . . . . . . . . . . . . . . . . 112 7.2 ParaEval’s recall ranking correlation. . . . . . . . . . . . . . . . . . . . . . . 115 vi 7.3 ParaEval’s correlation (precision) while using only single references. . . . . . 115 7.4 Differences among reference translations (raw ParaEval precision scores). . . . 116 7.5 BLEU’s correlating behavior with multi- and single-reference. . . . . . . . . . 116 7.6 System-ranking correlation when using modified unigram precision (MUP) scores. 117 7.7 System-ranking correlation when using geometric mean (GM) of MUPs. . . . 117 7.8 System-ranking correlation when multiplying the brevity penalty with GM. . . 118 7.9 Correlation with DUC 2003 MDS results. . . . . . . . . . . . . . . . . . . . . 130 vii List Of Figures 2.1 Skorochod’ko’s text structure types, reproduced from [32]. . . . . . . . . . . . 11 2.2 Monitoring subtopic change with lexical cohesion, reproduced from [32]. . . . 12 2.3 Dendrogram for documents, reproduced from [99]. . . . . . . . . . . . . . . . 13 2.4 A rhetorical structure tree, reproduced from [63]. . . . . . . . . . . . . . . . . 14 2.5 An empty rhetorical document profile (RDP), reproduced from [94]. . . . . . . 15 2.6 A multi-document cube, reproduced from [83] . . . . . . . . . . . . . . . . . . 16 2.7 A multi-document graph, reproduced from [83] . . . . . . . . . . . . . . . . . 17 3.1 Surrounding bigrams for top-scoring words . . . . . . . . . . . . . . . . . . . 27 4.1 Overall design of the biography creation algorithm. . . . . . . . . . . . . . . . 32 4.2 Official ROUGE performance results from DUC2004. Peer systems are labeled with numeric IDs. Humans are labeled A-H. 86 is our system with 2-class biography classification. Baseline is 5. . . . . . . . . . . . . . . . . . . . . . 45 4.3 Unofficial ROUGE results. Humans are labeled AH. Peer systems are labeled with numeric IDs. 86 is our system with 10-class biography classification. Baseline is 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.1 An example of chat subtopic structure and relation between correspondences . 54 5.2 A system-produced summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.3 An original Kernel Traffic digest . . . . . . . . . . . . . . . . . . . . . . . . . 64 viii 5.4 A reference summary reproduced from a summary digest . . . . . . . . . . . . 65 5.5 A short example from Baseline 2 . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.6 User feedback. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.1 A pyramid of 4 levels. Reproduced from [72] . . . . . . . . . . . . . . . . . . 76 6.2 A set of BEs extracted from a Collins’ parse tree. . . . . . . . . . . . . . . . . 80 6.3 An example for sentence compression . . . . . . . . . . . . . . . . . . . . . . 92 7.1 Paraphrases created by the Pyramid method. . . . . . . . . . . . . . . . . . . 97 7.2 Two translations produced by humans, from NIST Chinese MT evaluation [74]. 98 7.3 An example of the paraphrase extraction process. . . . . . . . . . . . . . . . . 102 7.4 Modified n-gram precision from BLEU . . . . . . . . . . . . . . . . . . . . . 106 7.5 Two reference translations. Grey areas are matched by using BLEU. . . . . . . 109 7.6 ParaEval’s matching process. . . . . . . . . . . . . . . . . . . . . . . . . . . 110 7.7 Overall system and human ranking. . . . . . . . . . . . . . . . . . . . . . . . 112 7.8 Comparison of summaries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 7.9 Local vs. global paraphrase matching. . . . . . . . . . . . . . . . . . . . . . . 125 7.10 Description for Figure 7.9: Local vs. global paraphrase matching. . . . . . . . 125 7.11 Solution for the example in Figure 7.9. . . . . . . . . . . . . . . . . . . . . . 127 7.12 A detailed look at the scores assigned by lexical and paraphrase/synonym com- parisons. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 7.13 Paraphrases matched by ParaEval. . . . . . . . . . . . . . . . . . . . . . . . . 132 ix Abstract This thesis is an inquiry into the importance of incorporating domain knowledge into emerging information distillation tasks which are in principle similar to that of text summarization, but in practice require techniques that are not adequately addressed in previous work. Tasks being analyzed are headline generation, biography creation, online discussion summarization, and automatic evaluation for summaries. This thesis shows empirically that while traditional text summarization techniques are designed for generic summarization tasks, they cannot be readily applied to the above four tasks. Each task requires prior knowledge on the operating domain, data type, task structure, and output structure. Techniques and algorithms designed with this knowledge perform significantly better than the ones without. This thesis explores the solutions to headline generation, or the generation of summaries of very short length. By identifying features that are specific to headlines, a keyword selection model was designed to select words that are headline-worthy. Context information surrounding these headline words are extracted to produce phrase-based headlines. Typical question-answering systems target definition questions and produce factoid an- swers. However, when questions require complex answers, like ”who is x” questions, a bi- ography creation engine is required to address the problem. Categorizing a person’s life into x multiple classes of information, the engine becomes a classification engine, coupled with ex- traction and re-ranking algorithms, and produces biographies on every aspects of a person’s life. The emergence of multi-party conversations recorded in text, such as online discussions, prompted development and analyses on the summarization of such data input. Recognizing the speech aspect of this type of information, including modeling subtopic structures and the exchanges between multiple speakers, shows a significantly better quality of summaries, whose constructions are also in accordance with what human summary writers do. Text summarization evaluation previously had been limited to manual annotation or com- parison on lexical identity. What separates manual and automatic matching is the ability to paraphrase, which makes automatic metrics extremely venerable. This thesis provides a solu- tion to bridge the gap by using a large paraphrase collection that is acquired through applying statistical phrase-based machine translation (MT) algorithms on parallel data. This procedure produces a significantly higher correlation with human judgments and can become an objective function as part of a summarization system. xi Chapter 1 Introduction 1.1 Information Distillation Distillation is the extraction of components of a mixture by condensation and collection. When this process is applied to liquid, it is the evaporation and subsequent collection of the liquid by condensation as a means of purification. When this process is applied to information process- ing, it is the mixture and combination of operations from various Natural Language Processing (NLP) systems. Most prominently, text summarization is part of the distillation effort. Human efforts are preferred if the summarization task is easily conducted and managed, and is not repeatedly performed. However, when resources (time, monetary compensation, and human) are limited, automatic summarization becomes more desirable. The emergence of new data types as major information sources has prompted increased interest in summarization within the NLP community. These data types include newspaper texts, online discussions, chats, etc. One might assume a smooth transition from generic text-based summarization to summarizing these newly emerged data types. The new summarization tasks 1 include producing summaries of very short length, creating a biography-like summary focused on a person, summarizing long and complex online discussions posted by online newsgroups. However, each of the tasks requires domain and genre-specific knowledge in order to properly address the problem associated with the task. This property makes summarization in these areas different, or sometimes even more difficult, than traditional generic text summarization. 1.2 Domain-specific Distillation Tasks This thesis is an inquiry into the importance of incorporating domain knowledge into emerging information distillation tasks which are in principle similar to that of text summarization, but in practice require techniques that are not adequately addressed in previous work. Tasks being analyzed are headline generation, biography creation, online discussion summarization, and automatic evaluation for summaries. This thesis shows empirically that while traditional text summarization techniques are designed for generic summarization tasks, they cannot be readily applied to the above four tasks. Each task requires prior knowledge on the operating domain, data type, task structure, and output structure. Techniques and algorithms designed with this knowledge perform significantly better than the ones without. 1.2.1 Headline Generation Most work on summarization focuses on generating summaries by extraction. The size of ex- tractive summaries is still quite large when the sole purpose is to decide whether a document is of interest. In such scenarios, central idea(s) presented in a list of phrases (10 words) should 2 suffice. I introduce a headline summarization system that selects headline words from the text, and then composes them by finding phrase clusters locally in the beginning of the text. After going through a post-processing phase, the phrase clusters will be the resulting headline. The solution model for this task is a simple one and relies on prior knowledge on the op- erating domain, which is newswire data. Newspaper articles tend to be written in a fashion that the beginning of an article reflects the more important content of the entire text. Thus the solution model is designed to capitalize on this property. I studied several bag-of-words models for headline keyword extraction. Based on the performance on testing data for these various model combinations, the best model combination for headline words selection is chosen. To produce phrase-based headlines, a headline language model was applied to extract the contexts surrounding the headline keywords. This process provides a level of readability for the headline words, resulting in more fluent and readable headlines. 1.2.2 Biography Creation Typical question-answering systems target definition questions and produce factoid answers. However, when questions require complex answers, like ”who is x” questions, a biography creation engine is required to address the problem. Categorizing a person’s life into multiple classes of information, the engine becomes a classification engine, coupled with extraction and re-ranking algorithms, and produces biographies on every aspects of a person’s life. This solution model relies on the knowledge that across a spectrum of personalities, the categories for personal information remain unchanged. 3 1.2.3 Online Discussion Summarization Discussions fall in the genre of correspondence, which requires dialogue and conversation anal- ysis. While the majority of summarization research has either focused on text summarization or speech summarization, my solution model addresses the challenges of automatically summa- rizing email/chat discussions. For system development and evaluation, I use the GNUe IRC archive that is uniquely suited for our experimental purpose because each IRC chat log has a companion summary digest written by project participants as part of their contribution to the community. This manual summary constitutes gold-standard data for evaluation. Topic drift occurs in chats more radically than in written genres, and interpersonal and prag- matic content appears more frequently. Questions about the content and overall organization of the summary must be addressed in a more thorough way for discussions and other dialogue summarization systems. This property, which I call subtopic structure, is observed in both dis- cussions and reference summaries. To model the interactions among conversation participants, one can define a complicated network that interlink various speech acts. However, for online discussions, participants’ primary goal is to seek and to provide information that others give and seek. This dominant behavior is modeled as problem (initiating, responding) pairs. Having incorporated subtopic structures and participant interactions into the process of cre- ating summaries, my solution model adequately supplements generic summarization techniques in addressing online discussion summarization. 4 1.2.4 Distillation Evaluation Machine Translation and automated speech recognition have gained greatly from automated evaluation measures for rapid system growth and improvement. The text summarization com- munity also has been searching for automated procedures that produce reliable evaluation scores and correlate well with human judgments. When measuring the content of a summary, current automated methods compare fragments of a system-produced summary against one or more reference summaries (typically produced by humans). The larger the overlap (i.e., the more desirable fragments the summary contains), the better the system summary is. During the past few years we have seen the development of several automated and semi- automated summary evaluation methods. All of them require the determination of the degree of match between (fragments of) the summary to be evaluated and a set of reference summaries. But they differ widely in length and nature of the fragments, the type of matching allowed, and composition of match scores. We describe an evaluation framework that can easily be instanti- ated, an implementation of the framework using very small units of content (Basic Elements), and address some of the short-comings of previous approaches. Choosing appropriate fragment lengths, comparing fragments, and interpreting the matches, are problems that have not yet been resolved satisfactorily. From previous work, we see that sentence-length fragments are too long because they contain many distinct pieces of information, some of which may or may not be important for a summary. On the other hand, single-word fragments suffer from the fact that 5 they may mislead the matching process, since words in combination may carry different mean- ings from individual words. With development of Basic Element (BE), a collaboration effort with our summarization team, it addresses the variability in fragments’ size and content. The fragment-matching problem is even more pressing. There is no a priori reason for the system to employ exactly the same words or expressions as a reference summary does, as long as it conveys the same meaning. The ability to recognize and utilize paraphrases automatically would make a significant and practical contribution to many NLP tasks. In machine translation (MT), a system that can distinguish and choose paraphrases—having the same meaning but dif- fering in lexical nuances—produces better translation results than others that cannot. Similarly, in text summarization, where the task is to emphasize differences across documents and syn- thesize common information, being able to recognize paraphrases would lead to better content selection, compression, and generation. My solution model, named ParaEval, utilizes statistical MT methods to extract phrases that are translated into the same phrase in a different language. Phrases with the same translation are grouped together as paraphrases. I separate paraphrase matching from lexical matching in MT and summarization evaluation procedures. Paraphrase matching includes the matching of multiple-word phrases and/or single-word synonyms. Then a lexical unigram matching process is performed on text fragments not consumed by the para- phrase matching step. 1.3 Research Contributions This thesis makes the following contributions: 6 1. It is among the first to experiment with creating very short summaries for newspaper articles. 2. It is the first to recognize and incorporate different categories of personal information into biographical summaries. Under the current climate of query-based summarization in focus, this approach is very important in answering ”who is x” questions. 3. It is the first to combine text-based summarization techniques with multi-party dialogue analyses to create summaries for online discussions. 4. With the development of paraphrase matching for text comparisons, this thesis is the first to pursue the measurement of semantic closeness. 1.4 Thesis Outline • Chapter 2 describes various text structural representations and their application to sum- marization. • Chapter 3 describes my solution model to headline generation. • Chapter 4 shows how predetermined biographical categories are used in creating sum- maries based on ”who is x” queries. • Chapter 5 introduces summarization for online discussions. • Chapter 6 describes the Basic Element framework and its application in evaluation and multi-document summarization. 7 • Chapter 7 shows that incorporating paraphrases into the process for text comparison sig- nificantly improves the evaluation methodologies for current distillation tasks. • Chapter 8, it is the conclusion for this thesis. 8 Chapter 2 Previous Work Text summarization is a difficult problem and warrants a great deal of research. The focus of this thesis is on creating and performing summarization tasks that utilize generic summarization techniques and domain and genre specific knowledge. The domain and genre specific knowl- edge are derived from analyses on the data collections that to be summarized. In their 1991 report, Salton et. al. [89] state that corpus-based methods are alternative to deep language analysis, which was described as: ”conventional wisdom is that sophisticated conceptual text representations are needed for information retrieval purposes, including the use of thesauruses tailored to particular subject areas, and of preconstructed knowledge structures that classify the main entities of interest in a given subject area, and specify the relationships that may hold between the entities in particular areas of application. ” Yaari [99] believes that a reader of a text or a collection of texts is to decide the kind and depth of language analysis. To implement text understanding, deep semantic analysis is needed. For browsing and reading expository texts, delineating the structure of the text and providing easy access to the discovered substructure should suffice. 9 Many researchers agree ([32], [42], [67], [82], [89], [96], [8], and [99]) and have shown works in discourse structure identification. They share the understanding that discourse struc- tures can be extremely useful to natural language processing (NLP) application such as auto- matic text summarization or information retrieval (IR) [42]. From investigating works in text segmentation and discourse modeling, we try to answer two basic questions that are fundamental to text summarization: 1) what are the basic elementary discussion units, and 2) what kind of structure can be built from these elementary units? Text segmentation are performed on single documents to produce coarse-grained textual structures. It does not show whether segments from multiple documents are associated with one another and the kind of relations between them. It merely demonstrates local coherence defined by Agar and Hobbs in 1982 [2]. Local coherence, or the recognition of the coher- ence relations—the relations conveyed by the mere adjacency of segments of text—gives us a discourse structure. Many use lexical cohesion relations in place of coherence relations to approximate such discourse structures. 2.1 Text Segmentation Linear segmentation of input texts is based on the assumption that textual structure can be characterized as a sequence of subtopical discussions that occur in the context of a few main topic discussions [32]. The most noticeable work that automated this process is TextTiling [32]. 10 Figure 2.1: Skorochod’ko’s text structure types, reproduced from [32]. Hearst relates her work to the Piecewise Monolithic Structure, one of Skorochod’ko’s text structure types [92], shown in Figure 2.1. This topology represents sequences of densely inter- related discussions linked together, one after another. Following Chafe’s notion of The Flow Model of discourse [14], the TextTiling algorithms are designed to recognize segment boundaries by determining the maximal change in subtopics using lexical cohesion (Figure 2.2), such as term repetition. To discover subtopic structure, two methods are introduced: 1) monitoring the changes in semantic similarity between two blocks of texts, and 2) monitoring the start and end of lexical chains, defined by Morris and Hirst in 1991 [71]. Kan et. al. [42] show that in addition to segmenting text, the functions of discovered dis- course segments can be labeled as to their relevance towards the whole text. Their system identi- fies segments that contribute towards the main topic of the input text, segments that summarize 11 Figure 2.2: Monitoring subtopic change with lexical cohesion, reproduced from [32]. the key points, and segments that have less informational value. The segment classification module is used in conjunction of a summarization system. Yaari [99] introduced text segmentation by hierarchical agglomerative clustering (HAC) to produce hierarchical discourse structure for documents. In general HAC applications, at each stage all available segments are measured for their proximity with the newly merged cluster. However, in Yaari’s work only proximity between a segment and its two neighbors is computed. This enforces the linearity of segments in the hierarchical structure produced. Figure 2.3 shows an hierarchical structure—formally dendrogram—produced for a document. 2.2 Single-Document Structure Well-formed discourse structures are recursively built from elementary discourse constituent units and develop a typology of higher level of constituents defined by rules specifying the syntax and semantics [79]. And rhetorical or coherence relation theory is to characterize the relationships holding between propositional elements in adjacent sentences of coherent written discourse ([79], [26], [62], [36], [70], and [35]). These approaches define discourse structures 12 Figure 2.3: Dendrogram for documents, reproduced from [99]. elegantly. However, they are mostly domain dependent, restricted to limited number of texts, or rely on sophisticated and large knowledge bases [89]. Marcu in 1997 introduced a rhetorical discourse parser that is capable of deriving discourse structures for unrestricted texts [64]. Following the principles in the context of Rhetorical Struc- ture Theory (RST) by Mann and Thompson [62], Marcu described an algorithm that determines automatically the textual units for a discourse tree and the rhetorical relations that hold between them, illustrated in Figure 2.4. The resulting parsed structure is called the rhetorical structure (RS) tree. In his follow-up work on summarization from utilizing rhetorical structures, Marcu [63] takes the RS-tree produced by the rhetorical parser and selects the textual units that are most salient in that text. If the goal is to produce summaries of very short length, internal nodes closer to the root are selected. For longer summaries, farther nodes from the root are selected. Teufel [94] argues that RST models micro-structure, i.e. relations holding between clauses, and does not model the global relations between given statements and the rhetorical act of the 13 Figure 2.4: A rhetorical structure tree, reproduced from [63]. whole article, namely the macro-level structure. She believes that it is macro-structure and not micro-structure which is useful for summarization and document representation. There are fewer constraints on relations between paragraphs and large segments when marked by humans. And the structure of these large segments from a text is doubtful to be hierarchical, the way micro-level structure is. Teufel designs a new document surrogate, called Rhetorical Document Profile (RDP) which consists of rhetorical slots and profiles different kinds of information about documents. An empty RDP is shown in Figure 2.5. When used in the context of summarization, the relevant material that is identified and fills the empty RDP slots produces already better textual extracts than those provided by sentence extraction methods. 14 Figure 2.5: An empty rhetorical document profile (RDP), reproduced from [94]. 2.3 Cross-Document Structure The discussion so far has focused on single-document structural representation. When it comes to language analyses involving multiple documents, approaches described in the previous sec- tions are insufficient. The Textnet system built by Trigg and Weiser [96] structures texts with two types of nodes and one type of labeled links. Intra-textual data objects are modeled by chunk nodes. And toc nodes are the meta-level or conceptual level nodes of chunk nodes and are interconnected by links within an hierarchical directed acyclic graph (DAG). This data structure can be applied both locally to a collection of scientific documents and globally through a network of computers. 15 Figure 2.6: A multi-document cube, reproduced from [83] Radev [83] introduced two data structures, multi-document cube and multi-document graph, to represent the cross-document structure in support of the cross-document structure theory (CST). A multi-document cube is a structure of three dimensions: time, source, and position within the document. And summarizing a collection of texts is to transform the cube that rep- resents those texts into a summary, which consists of data points in the cube, as illustrated in Figure 2.6. Concerned with modeling different levels of text granularity, multi-document graph consists of smaller subgraphs for individual documents. The relationships between texts and amongst textural units are best described as hypotactic and paratactic [64]. Figure 2.7 shows an example. 16 Figure 2.7: A multi-document graph, reproduced from [83] 2.4 Chapter Outlook In this chapter, we have described works that identify discourse structure and their application to text-based summarization. Two conceptual structures observed from our study are shown. The question remains, “can we automatically build these structures? And are they helpful in creating domain/genre-specific summaries?” In the following chapters, we will show how to summarize following those structures and measure the quality of the summaries created. 17 Chapter 3 Headline Generation Producing headline-length summaries is a challenging summarization problem. Due to the length restriction, every word in a headline is important. But the need for grammaticality— or at least intelligibility—sometimes requires the inclusion of non-content words. Forgoing grammaticality, one might compose a headline summary by simply listing the most important noun phrases one after another. At the other extreme, one might pick just one fairly indicative sentence of appropriate length, ignoring all other material. Ideally, we want to find a balance between including raw information and supporting intelligibility. I experimented with meth- ods that integrate content-based and form-based criteria. The process consists two phases. The keyword-clustering component finds headline phrases in the beginning of the text using a list of globally selected keywords. The template filter then uses a collection of pre-specified head- line templates and subsequently populates them with headline phrases to produce the resulting headline. In this chapter, I describe previous work, a study on the use of headline templates, discuss the process of selecting and expanding key headline phrases. 18 3.1 Previous Work Headlines are often written in a different way than normal spoken or written language. English front page headlines are written according to a grammar dedicated to ”headlinese” which was defined by Mardh [66]. Certain constituents are dropped with no loss of meaning, most notably the removal of certain instances of have and be [101]. Several systems were developed to address the need for headline-style summaries. A lossy summarizer that translates news stories into target summaries using the IBM-style statistical machine translation (MT) model was shown by Banko et. al. [3]. Conditional probabilities for a limited vocabulary and bigram transition probabilities as headline syntax approximation were incorporated into the translation model. It was shown to have worked surprisingly well with a stand-alone evaluation of quantitative analysis on content coverage. The use of a noisy-channel model and a Viterbi search was shown in another MT-inspired headline summarization system from Zajic et. al. [100]. The method was automatically evaluated by Bi-Lingual Evaluation Understudy (BLEU) [78] and scored 0.1886 with its limited length model. A nonstatistical system, coupled with linguistically motivated heuristics, using a parse-and- trim approach based on parse trees was reported in the development of HedgeTrimmer [19]. It achieved 0.1341 on BLEU with an average of 8.5 words. Even though human evaluations were conducted in the past, we still do not have sufficient material to perform a comprehensive comparative evaluation on a large enough scale to claim that one method is clearly better than others. 19 Very recently (after our headline system was published) an extension of HedgeTrimmer [19], named Topiary, was reported in DUC04 [101]. This new algorithm combines sentence compression with Unsupervised Topic Discovery (UTD). The sentence compression module involves removing constituents from a parse tree of the lead sentence according to a set of linguistically-motivated heuristics until a length threshold is reached. UTD is a statistical method for deriving a set of topic models from a document corpus, assigning meaningful names to the topic models, and associating sets of topics with specific documents. This combined ap- proach proves to be effective in content coverage and performs almost equivalently as humans in content selection measurements. 3.2 A First Look at Headline Templates We want to discover how headlines are related to derived templates using a training set of 60933 (headline, text) pairs. 3.2.1 Template Creation We view each headline in our training corpus a potential template for future headline creation. For any new text(s), if we can select an appropriate template from the set and populate it with content-baring words, then we will have a well-structured headline. An abstract representation of the templates suitable for matching against new material is required. In our current work, we build templates at the part-of-speech (POS) level. 20 TextSize Files from corpus (%) First sentence 20.01 First two sentences 32.41 First three sentences 41.90 All sentences 75.55 Table 3.1: Study on sequential template matching of headline against its text, on training data 3.2.2 Sequential Recognition of Templates We tested how well headline templates overlap with the opening sentences of texts by matching POS tags sequentially. The second column of Table 3.1 shows the percentage of files whose POS-level headline words appeared sequentially within the context described in the first col- umn. 3.2.3 Filing Templates with Key Words Filling POS templates sequentially using tagging information alone is not the most appropriate way to demonstrate the concept of headline summarization using template abstraction, since it completely ignores the semantic information carried by words themselves. Therefore, using the same set of POS headline templates, we modified the filling procedure. Given a new text, each nonstop word is categorized by its POS tag and ranked within each POS category according to its tf.idf weight. A word with the highest tf.dif weight from that POS category is chosen to fill each placeholder in a template. If the same tag appears more than once 21 in the template, a subsequent placeholder is filled with a word whose weight is the next highest from the same tag category. The score for each filled template is calculated as follows: score t(i)= Σ N j=1 W j |desired len − template len|+ 1 where score t(i) denotes the final score assigned to template i of up to N placeholders and W j is the tf.idf weight of the word assigned to a placeholder in the template. This scoring mechanism prefers templates with the most desirable length. The highest scoring template-filled headline is chosen as the result. 3.3 Key Phrase Selection Headlines generated using our templates are grammatical (by virtue of the templates) and reflect some content (by virtue of the tf.idf scores). But there is no guarantee of semantic accuracy! This led us to the search of key phrases as the candidates for filling headline templates. Headline phrases should be expanded from single seed words that are important and uniquely reflect the contents of the text itself. To select the best seed words for key phrase expansion, we studied several keyword selection models [103], described below. 3.3.1 Model Selection Bag-of-Words Models 22 1. Sentence Position Model: Sentence position information has long been proven useful in identifying topics of texts [22]. We believe this idea also applies to the selection of headline words. Given a sentence with its position in text, what is the likelihood that it would contain the first appearance of a headline word: P(Pos i )= Count Pos i Σ Q i=1 Count Pos Q and Count Pos i =Σ M k=1 Σ N j=1 P(H k |W j ) Over all M texts in the collection and over all words from the corresponding M headlines (each has up to N words), Count Pos i records the number of times that sentence position i has the first appearance of any headline word W j . P(H k |W j ) is a binary feature. This is computed for all sentence positions from 1 to Q. Resulting P(Pos i ) is a table on the tendency of each sentence position containing one or more headlines words (without indicating exact words). 2. Headline Word Position Model: For each headline word W h , it would most likely first appear at sentence position Pos i : P(Pos i |W h )= Count(Pos i , W h ) Σ Q i=1 Count(Pos Q , W h ) 23 The difference between models 1 and 2 is that for the sentence position model, statis- tics were collected for each sentence position i; for the headline word position model, information was collected for each headline word W h . 3. Text Model: This model captures the correlation between words in text and words in headlines [58]: P(H w |T w )= Σ M j=1 (doc t f (w, j) ∗ title t f (w, j)) Σ M j=1 doc t f (w, j) doc t f (w, j) denotes the term frequency of word w in the j th document of all M docu- ments in the collection. title t f (w, j) is the term frequency of word w in the j th title. H w and T w are words that appear in both the headline and the text body. For each instance of H w and T w pair, H w = T w . 4. Unigram Headline Model: Unigram probabilities on the headline words from the train- ing set. 5. Bigram Headline Model: Bigram probabilities on the headline words from the training set. Choice on Model Combinations Having these five models, we needed to determine which model or model combination is best suited for headline word selection. The test data was the DUC2001 test set of 108 texts. The reference headlines are the original headlines with a total of 808 words (not including stop words). The evaluation was based on the cumulative unigram overlap between the n top-scoring words and the reference headlines. The models are numbered as in the previous section. Table 24 3.2 shows the effectiveness of each model/model combination on the top 10, 20, 30, 40, and 50 scoring words. Clearly, for all lengths greater than 10, sentence position (model 1) plays the most important role in selecting headline words. Selecting the top 50 words solely based on position informa- tion means that sentences in the beginning of a text are the most informative. However, when we are working with a more restricted length requirement, text model (model 3) adds advantage to the position model (highlighted, 7th from the bottom of Table 3.2). As a result, the following combination of sentence position and text model was used: P(H|W i )= P(H|Pos i ) ∗ P(H w i |T w i ) 3.3.2 Phrase Candidates to Fill Templates We now need to expand keywords into phrases as candidates for filling templates. As illustrated in Table 3.2 and stated in [100], headlines from newspaper texts mostly use words from the beginning of the text. Therefore, we search for n-gram phrases comprising keywords in the first part of the story. Using the selected model combination, 10 top-scoring words over the whole story are selected and highlighted in the first 50 words of the text. The system should have the ability of pulling out the largest window of top-scoring words to form the headline. To help achieve grammaticality, we produced bigrams surrounding each headline-worthy word (under- lined), as shown in Figure 3.1. From connecting overlapping bigrams in sequence, one sees interpretable clusters of words forming. Multiple headline phrases are considered as candidates 25 Table 3.2: Results on bag-of-words model combinations 26 Figure 3.1: Surrounding bigrams for top-scoring words for template filling. Using a set of hand-written rules, dangling words were removed from the beginning and end of each headline phrase. 3.4 Filling Templates with Phrases 3.4.1 Method Key phrase clustering preserves text content, but lacks the complete and correct representation for structuring phrases. The phrases need to go through a grammar filter/reconstruction stage to gain grammaticality. A set of headline-worthy phrases with their corresponding POS tags is presented to the template filter. All templates in the collection are matched against each candidate headline phrase. Strict tag matching produces a small number of matching templates. To circumvent this problem, a more general tag-matching criterion, where tags belonging to the same part-of- speech category can be matched interchangeably, was used. Headline phrases tend to be longer 27 Generated Headlines First Palestinian airlines flight depart Gazas airport Jerusalem/ suicide bombers targeted market Friday setting blasts U.S. Senate outcome apparently rests small undecided voters. Chileans wish forget years politics repression Table 3.3: System-generated headlines. A headline can be concatenated from several phrases, separated by ’/’s than most of the templates in the collection. This results in only partial matches between the phrases and the templates. A score of fullness on the phrase-template match is computed for each candidate template f t i : f t i = length(t i )+ matched length(h i ) length(t i )+ length(h i ) t i is a candidate template and h i is a headline phrase. The top-scoring template is used to filter each headline phrase in composing the final multi-phrase headline. Table 3.3 shows a random selection of the results produced by the system. 3.5 Evaluation Ideally, the evaluation should show the system’s performance on both content selection and grammaticality. However, it is hard to measure the level of grammaticality achieved by a system computationally. Similar to [3], we restricted the evaluation to a quantitative analysis on content only. Our system was evaluated on previously un-seen DUC2003 test data of 615 files. For each file, headlines generated at various lengths were compared against i) the original headline, and 28 Assessors Generated P R Length(words) P R Original 0.3429 0.2336 9 0.1167 0.1566 12 0.1073 0.2092 13 0.1075 0.2298 Assessors’ 0.2186 0.2186 9 0.1482 0.1351 12 0.1365 0.1811 13 0.1368 0.1992 Table 3.4: Results evaluated for unigram overlap Human Generated Unigrams 0.292 0.169 Bigrams 0.084 0.042 Trigrams 0.030 0.010 4-grams 0.012 0.002 Table 3.5: Performance on ROUGE ii) headlines written by four DUC2003 human assessors. The performance metric was to count term overlaps between the generated headlines and the test standards. Table 3.4 shows the human agreement and the performance of the system comparing with the two test standards. P and R are the precision and recall scores. The system-generated headlines were also evaluated using the automatic summarization evaluation tool ROUGE (Recall-Oriented Understudy for Gisting Evaluation) [51]. The ROUGE score is a measure of n-gram recall between candidate headlines and a set of reference headlines. Its simplicity and reliability are gaining audience and becoming a standard for performing auto- matic comparative summarization evaluation. Table 3.5 shows the ROUGE performance results for generated headlines with length 12 against headlines written by human assessors. 29 3.6 Conclusion Structural abstraction at POS level is shown to be helpful in our current experiment. However, part-of-speech tags do not generalize well and fail to model issues like subcategorization and other lexical semantic effects. This problem was seen from the fact that there are half as many templates as the original headlines. A more refined pattern language, for example taking into account named entity types and verb clusters, will further improve performance. We intend to incorporate additional natural language processing tools to create a more sophisticated and richer hierarchical structure for headline summarization. 30 Chapter 4 Biography Creation Automatic text summarization is one form of information management. It is described as se- lecting a subset of sentences from a document that is in size a small percentage of the original and yet is just as informative. Summaries can serve as surrogates of the full texts in the context of Information Retrieval (IR). Summaries are created from two types of text sources, a single document or a set of documents. Multi-document summarization (MDS) is a natural and more elaborative extension of single-document summarization, and poses additional difficulties on algorithm design. Various kinds of summaries fall into two broad categories: generic sum- maries are the direct derivatives of the source texts; special-interest summaries are generated in response to queries or topic-oriented questions. One important application of special-interest MDS systems is creating biographies to an- swer questions like “who is Kofi Annan?”. This task would be tedious for humans to perform in situations where the information related to the person is deeply and sparsely buried in large quantity of news texts that are not obviously related. This chapter describes a MDS biography 31 Figure 4.1: Overall design of the biography creation algorithm. system that responds to the who is questions by identifying information about the person-in- question using IR and classification techniques, and creates multi-document biographical sum- maries. The overall system design is shown in Figure 4.1. To determine what and how sentences are selected and ranked, a simple IR method and experimental classification methods both contributed. The set of top-scoring sentences, after redundancy removal, is the resulting biography. As yet, the system contains no inter-sentence smoothing stage. 4.1 Recent Development Two trends have dominated automatic summarization research according to Mani [60]. One is the work focusing on generating summaries by extraction, which is finding a subset of the 32 document that is indicative of its contents (Kupiec et al. [44]) using shallow linguistic analy- sis and statistics. The other influence is the exploration of deeper knowledge-based methods for condensing information. Knight and Marcu [43] equate summarization with compression at sentence level to achieve grammaticality and information capture, and push a step beyond sentence extraction. Many systems use machine-learning methods to learn from readily aligned corpora of scientific articles and their corresponding abstracts. Zhou and Hovy [104] show a summarization system trained from automatically obtained text-summary alignments following the chronological occurrences of news events. MDS poses more challenges in assessing similarities and differences among the set of doc- uments. The simple idea of extract-and-concatenate does not respond to problems arisen from coherence and cohesion. Barzilay et al. [7] introduce a combination of extracted similar phrases and a reformulation through sentence generation. Lin and Hovy [55] apply a collection of known single-document summarization techniques, cooperating positional and topical informa- tion, clustering, etc., and extend them to perform MDS. While many have suggested that conventional MDS systems can be applied to biography generation directly, Mani (2001) illustrates that the added functionality of biographical MDS comes at the expense of a substantial increase in system complexity and is somewhat beyond the capabilities of present day MDS systems. The discussion was based in part on the only known MDS biography system by Schiffman et al. [91] that uses corpus statistics along with linguistic knowledge to select and merge description of people in news. The focus of this work was on 33 synthesizing succinct descriptions of people by merging appositives from semantic processing using WordNet [24]. 4.2 Corpus Description In order to extract information that is related to a person from a large set of news texts written not exclusively about this person, we need to identify attributes shared among biographies. Biographies share certain standard components. We annotated a corpus of 130 biogra- phies of 12 people (activists, artists, leaders, politicians, scientists, terrorists, etc.). We found 9 common elements: bio (info on birth and death), fame factor, personality, personal, social, education, nationality, scandal, and work. Biographies from this collection are appropriately marked at clause-level with one of the nine tags in XML format, for example: Martin Luther King<nationality> was born in Atlanta, Georgia</nationality>. He<bio>was assassinated on April 4, 1968 </bio>. King <education> entered the Boston University as a doctoral student </education>. In all, 3579 biography-related phrases were identified and recorded for the collection, among them 321 bio, 423 fame, 114 personality, 465 personal, 293 social, 246 education, 95 national- ity, 292 scandal, and 1330 work. We then used 100 biographies for training and 30 for testing the classification module. 4.3 Sentence Classification Relating to human practice on summarizing, three main points are relevant to aid the automa- tion process (Sparck Jones [40]). The first is a strong emphasis on particular purposes, e.g., 34 abstracting or extracting articles of particular genres. The second is the drafting, writing, and revision cycle in constructing a summary. Essentially as a consequence of these first two points, the summarizing process can be guided by the use of checklists. The idea of a checklist is especially useful for the purpose of generating biographical summaries because a complete bi- ography should contain various aspects of a persons life. From a careful analysis conducted while constructing the biography corpus, we believe that the checklist is shared and common among all persons in question, and consists the 9 biographical elements introduced in Section 3. The task of fulfilling the biography checklist becomes a classification problem. Classifica- tion is defined as a task of classifying examples into one of a discrete set of possible categories (Mitchell [69]). Text categorization techniques have been used extensively to improve the effi- ciency on information retrieval and organization. Here the problem is that sentences, from a set of documents, need to be categorized into different biography-related classes. 4.3.1 Task Definitions We designed two classification tasks: 1) 10-Class: Given one or more texts about a person, the module must categorize each sentence into one of ten classes. The classes are the 9 biographical elements plus a class called none that corresponds to sentences bearing zero biographical information. This fine-grained classification task is unique in generating comprehensive biographies on people of interest. The ten classes are: 35 • bio • fame • personality • social • education • nationality • scandal • personal • work • none 2) 2-Class: The module must make a binary decision of whether a given sentence is to be included in a biography summary. The classes are: • bio • none The label bio appears in both task definitions but bears different meanings. Under 10-Class, class bio contains information on a person’s birth or death, and under 2-Class it sums up all 9 biographical elements from the 10-Class. 4.3.2 Machine Learning Methods We experimented with three machine learning methods for classifying sentences. Na¨ ıve Bayes 36 The Na¨ ıve Bayes classifier is among the most effective algorithms known for learning to classify text documents (Mitchell [69]), calculating explicit probabilities for hypotheses. Using k features F j : j= 1, , k, we assign to a given sentence S the class C: C = argmax c P(C|F 1 , F 2 ,..., F k ) It can be expressed using Bayes rule, as stated by Kupiec et al. [44]: P(S ∈ C|F 1 , F 2 ,..., F k )= P(F 1 , F 2 ,..., F j |S ∈ C) · P(S ∈ C) P(F 1 , F 2 ,..., F k ) Assuming statistical independence of the features: P(S ∈ C|F 1 , F 2 ,..., F k )= Q k j=1 P(F j |S ∈ C) · P(S ∈ C) Q k j=1 P(F j ) Since P(F j ) has no role in selecting C: P(S ∈ C|F 1 , F 2 ,..., F k )= k Y j=1 P(F j |S ∈ C) · P(S ∈ C) We trained on the relative frequency of P(F j |S ∈ C) and P(S ∈ C), with add-one smoothing. This method was used in classifying both the 10-Class and the 2-Class tasks. Support Vector Machine Support Vector Machines (SVMs [39]) have been shown to be an effective classifier in text categorization. We extend the idea of classifying documents into predefined categories 37 to classifying sentences into one of the two biography categories defined by the 2-Class task. Sentences are categorized based on their biographical saliency (a percentage of clearly identified biography words) and their non-biographical saliency (a percentage of clearly identified non- biography words). We used LIBSVM [15] for training and testing. Decision Tree (4.5) In addition to SVM, we also used a decision-tree algorithm, C4.5 (Quinlan [80]), with the same training and testing data as SVM. 4.3.3 Classification Results The lower performance bound is set by a baseline system that randomly assigns a biographical class given a sentence, for both 10-Class and 2-Class. 2599 testing sentences are from 30 unseen documents. 10-Class Classification The Na¨ ıve Bayes classifier was used to perform the 10-Class task. Table 4.1 shows its performance with various features. Part-of-speech (POS) information by Brill [10] and word stems by Lovins [59] were used in some feature sets. We bootstrapped 10395 more biography-indicating words by recording the immediate hy- pernyms, using WordNet [24], of the words collected from the controlled biography corpus described in Section 3. These words are called Expanded Unigrams and their frequency scores are reduced to a fraction of the original words frequency score. 38 Features Precision/Recall (%) Baseline 10.04 Unigram 69.41 Unigram+ POS 70.64 Stem+ POS 68.10 Expanded unigram (bio words+ wordnet hypernyms) 60.98 Bigram 69.45 Relaxed unigram 70.41 Relaxed unigram+ POS 71.53 Manually identified ”work” words added 68.37 Table 4.1: Performance of 10-Class sentence classification, using Na¨ ıve Bayes Classifier. Some sentences in the testing set were labeled with multiple biography classes due to the fact that the original corpus was annotated at clausal level. Since the classification was done at sentence level, we relaxed the matching/evaluating program allowing a hit when any of the several classes was matched. This is shown in Table 4.1 as the Relaxed cases. A closer look at the instances where the false negatives occur indicates that the classifier mislabeled instances of class work as instances of class none. To correct this error, we created a list of 5516 work specific words hoping that this would set a clearer boundary between the two classes. However performance did not improve. 2-Class Classification All three machine learning methods were evaluated in classifying among 2 classes. The results are shown in Table 4.2. The testing data is slightly skewed with 68% of the sentences being none. In addition to using marked biographical phrases as training data, we also expanded the marking/tagging perimeter to sentence boundaries. As shown in the table, this creates noise. 39 Features Precision/Recall (%) Baseline 49.37 Nave Bayes Unigram (from bio phrases) 82.42 Bigram 47.98 Unigram (from bio sentences) 68.23 Unigram+ POS (bio sentences) 65.06 SVM 74.47 C4.5 75.76 Table 4.2: Classification results on 2-Class using Na¨ ıve Bayes, SVM, and C4.5. 4.4 Biography Extraction Biographical sentence classification module is only one of two components that supply the overall system with usable biographical contents, and is followed by other stages of processing (see system design in Figure 4.1). I discuss the other modules next. 4.4.1 Name-filter A filter scans through all documents in the set, eliminating sentences that are direct quotes, dialogues, and of insufficient length (under 5 words). Person-oriented sentences containing any variation (first name only, last name only, and the full name) of the persons name are kept for subsequent steps. Sentences classified as biography-worthy are merged with the name-filtered sentences with duplicates eliminated. 40 4.4.2 Sentence Ranking An essential capability of a multi-document summarizer is to combine text passages in a useful manner for the reader (Goldstein et al. [30]). This includes a sentence ordering parameter (Mani [60]). Each of the sentences selected by the name-filter and the biography classifier is either related to the person-in-question via some news event or referred to as part of this persons biographical profile, or both. We need a mechanism that will select sentences that are of informative significance within the source document set. Using inverse-term-frequency (ITF), i.e. an estimation of information value, words with high information value (low ITF) are distinguished from those with low value (high ITF). A sorted list of words along with their ITF scores from a document set—topic ITFs—displays the important events, persons, etc., from this particular set of texts. This allows us to identify passages that are unusual with respect to the texts about the person. However, we also need to identify passages that are unusual in general. We have to quantify how these important words compare to the rest of the world. The world is represented by 413307562 words from TREC-9 corpus 1 , with corresponding ITFs. The overall informativeness of each word w is: C w = d it f w W it f w 1 http://trec.nist.gov/data.html 41 where d it f is the document set ITF of word w and W it f is the world ITF of w. A word that occurs frequently bears a lower C w score compared to a rarely used word (bearing high information value) with a higher C w score. Top scoring sentences are then extracted according to: C s = P n i=1 C w i len(s) The following is a set of sentences extracted according to the method described so far. The person-in-question is the famed cyclist Lance Armstrong. 1. Cycling helped him win his battle with cancer, and cancer helped him win the Tour de France. 2. Armstrong underwent four rounds of intense chemotherapy. 3. The surgeries and chemotherapy eliminated the cancer, and Armstrong began his cycling comeback. 4. The foundation supports cancer patients and survivors through education, awareness and research. 5. He underwent months of chemotherapy. 4.4.3 Redundancy Elimination Summaries that emphasize the differences across documents while synthesizing common infor- mation would be the desirable final results. Removing similar information is part of all MDS systems. Redundancy is apparent in the Armstrong example from Section 5.2. To eliminate rep- etition while retaining interesting singletons, we modified [65] by Marcu so that an extract can be automatically generated by starting with a full text and systematically removing a sentence at a time as long as a stable semantic similarity with the original text is maintained. The original 42 extraction algorithm was used to automatically create large volume of (extract, abstract, text) tuples for training extraction-based summarization systems with (abstract, text) input pairs. Top-scoring sentences selected by the ranking mechanism described in Section 5.2 were the input to this component. The removal process was repeated until the desired summary length was achieved. Applying this method to the Armstrong example, the result leaves only one sentence that contains the topics chemotherapy and cancer. It chooses sentence 3, which is not bad, though sentence 1 might be preferable. 4.5 Evaluation 4.5.1 Overview Extrinsic and intrinsic evaluations are the two classes of text summarization evaluation methods (Sparck Jones and Galliers [93]). Measuring content coverage or summary informativeness is an approach commonly used for intrinsic evaluation. It measures how much source content was preserved in the summary. A complete evaluation should include evaluations of the accuracy of components involved in the summarization process (Schiffman et al. [91]). Performance of the sentence classifier was shown in Section 4. Here we will show the performance of the resulting summaries. 43 4.5.2 Coverage Evaluation An intrinsic evaluation of biography summarization was recently conducted under the guidance of Document Understanding Conference (DUC2004) using the automatic summarization evalu- ation tool ROUGE (Recall-Oriented Understudy for Gisting Evaluation) by Lin and Hovy [51]. 50 TREC English document clusters, each containing on average 10 news articles, were the input to the system. Summary length was restricted to 665 bytes. Brute force truncation was applied on longer summaries. The ROUGE-L metric is based on Longest Common Subsequence (LCS) overlap by Sag- gion et al. [88]. Figure 4.2 shows that our system (86) performs at an equivalent level with the best systems 9 and 10, that is, they both lie within our systems 95% upper confidence interval. The 2-class classification module was used in generating the answers. The figure also shows the performance data evaluated with lower and higher confidences set at 95%. The performance data are from official DUC results. Figure 4.3 shows the performance results of our system 86, using 10-class sentence classi- fication, comparing to other systems from DUC by replicating the official evaluating process. Only system 9 performs slightly better with its score being higher than our systems 95% upper confidence interval. A baseline system (5) that takes the first 665 bytes of the most recent text from the set as the resulting biography was also evaluated amongst the peer systems. Clearly, humans still perform at a level much superior to any system. 44 Figure 4.2: Official ROUGE performance results from DUC2004. Peer systems are labeled with numeric IDs. Humans are labeled A-H. 86 is our system with 2-class biography classification. Baseline is 5. 45 Figure 4.3: Unofficial ROUGE results. Humans are labeled AH. Peer systems are labeled with numeric IDs. 86 is our system with 10-class biography classification. Baseline is 5. 46 Measuring fluency and coherence is also important in reflecting the true quality of machine- generated summaries. There is no automated tool for this purpose currently. We plan to incor- porate one for the future development of this work. 4.5.3 Discussion N-gram recall scores are computed by ROUGE, in addition to ROUGE-L shown here. While cosine similarity and unigram-bigram overlap demonstrate a sufficient measure on content cov- erage, they are not sensitive on how information is sequenced in the text (Saggion et al. [88]). In evaluating and analyzing MDS results, metrics, such as ROUGE-L, that consider linguistic sequence are essential. Radev and McKeown [84] point out when summarizing interesting news events from multi- ple sources, one can expect reports with contradictory and redundant information. An intelligent summarizer should attain as much information as possible, combine it, and present it in the most concise form to the user. When we look at the different attributes in a persons life reported in news articles, a person is described by the job positions that he/she has held, by education in- stitutions that he/she has attended, and etc. Those data are confirmed biographical information and do not bear the necessary contradiction associated with evolving news stories. However, we do feel the need to address and resolve discrepancies if we were to create comprehensive and detailed biographies on people-in-news since miscellaneous personal facts are often overlooked 47 and told in conflicting reports. Misrepresented biographical information may well be contro- versies and may never be clarified. The scandal element from our corpus study (Section 3) is sufficient to identify information of the disputed kind. Extraction-based MDS summarizers, such as this one, present the inherent problem of lack- ing the discourse-level fluency. While sentence ordering for single document summarization can be determined from the ordering of sentences in the input article, sentences extracted by a MDS system may be from different articles and thus need a strategy on ordering to produce a fluent surface summary (Barzilay et al. [5]). Previous summarization systems have used tem- poral sequence as the guideline on ordering. This is especially true in generating biographies where a person is represented by a sequence of events that occurred in his/her life. Barzilay et al. also introduced a combinational method with an alternative strategy that approximates the information relatedness across the input texts. We plan to use a fixed-form structure for the majority of answer construction, fitted for biographies only. This will be a top-down ordering strategy, contrary to the bottom-up algorithm shown by Barzilay et al. 4.6 Conclusion and Future Work In this chapter, we described a system that uses IR and text categorization techniques to pro- vide summary-length answers to biographical questions. The core problem lies in extracting biography-related information from large volumes of news texts and composing them into flu- ent, concise, multi-document summaries. The summaries generated by the system address the question about the person, though not listing the chronological events occurring in this person’s 48 life due to the lack of background information in the news articles themselves. In order to obtain a “normal” biography, one should consult other means of information repositories. Question: Who is Sir John Gielgud? Answer: Sir John Gielgud, one of the great actors of the English stage who enthralled audiences for more than 70 years with his eloquent voice and consummate artistry, died Sun- day at his home Gielguds last major film role was as a surreal Prospero in Peter Greenaways controversial Shakespearean rhapsody. Above summary does not directly explain who the person-in-question is, but indirectly does so in explanatory sentences. We plan to investigate combining fixed-form and free-form struc- tures in answer construction. The summary would include an introductory sentence of the form x is <type/fame-category> , possibly through querying outside online resources. A main body would follow the introduction with an assembly of checklist items generated from the 10-Class classifier. A conclusion would contain open-ended items of special interest. Furthermore, we would like to investigate compression strategies in creating summaries, specifically for biographies. Our biography corpus was tailored for this purpose and will be the starting point for further investigation. 49 Chapter 5 Summarizing Online Discussions The summarization framework described in this chapter can be applied to any online discussion community, in particular the more sophisticated discussion forums. For system development and evaluation, we use the GNUe Internet Relay Chat (IRC) archive. I use the term chat loosely here. Input IRCs for our system is a mixture of chats and emails that are indistinguishable in format observed from the downloaded corpus. 5.1 Introduction The availability of many chat forums reflects the formation of globally dispersed virtual com- munities. From them we select the very active and growing movement of Open Source Software (OSS) development. Working together in a virtual community in non-collocated environments, OSS developers communicate and collaborate using a wide range of web-based tools includ- ing Internet Relay Chat (IRC), electronic mailing lists, and more [23]. In contrast to conven- tional instant message chats, IRCs convey engaging and focused discussions on collaborative 50 software development. Even though all OSS participants are technically savvy individually, summaries of IRC content are necessary within a virtual organization both as a resource and an or-ganizational memory of activities [1]. They are regularly produced manually by volunteers. These summaries can be used for analyzing the impact of virtual social interactions and virtual organizational culture on software/product development. 5.2 Previous and Related Work There are at least two ways of organizing dialogue summaries: by dialogue structure and by topic. Newman and Blitzer [73] describe methods for summarizing archived newsgroup conver- sations by clustering messages into subtopic groups and extracting top-ranked sentences per subtopic group based on the intrinsic scores of position in the cluster and lexical centrality. Due to the technical nature of our working corpus, we had to handle intra-message topic shifts, in which the author of a message raises or responds to multiple issues in the same message. This requires that our clustering component be not message-based but sub-message-based. Lam and Rohall [45] employ an existing summarizer for single documents using prepro- cessed email messages and context information from previous emails in the thread. Rambow et al. [85] show that sentence extraction techniques are applicable to summarizing email threads, but only with added email-specific features. Wan and McKeown [98] introduce a system that creates overview summaries for ongoing decision-making email exchanges by first detecting the issue being discussed and then extracting the response to the issue. Both systems 51 use a corpus that, on average, contains 190 words and 3.25 messages per thread, much shorter than the ones in our collection. Galley et al. [28] describe a system that identifies agreement and disagreement occurring in human-to-human multi-party conversations. They utilize an important concept from conversa- tional analysis, adjacent pairs (AP), which consists of initiating and responding utterances from different speakers. Identifying APs is also required by our research to find correspondences from different chat par-ticipants. In automatic summarization of spoken dialogues, Zechner [102] presents an approach to obtain extractive summaries for multi-party dialogues in unrestricted domains by addressing intrinsic issues specific to speech transcripts. Automatic question detection is also deemed important in this work. A decision-tree classifier was trained on question-triggering words to detect questions among speech acts (sentences). A search heuristic procedure then finds the corresponding answers. Ries [86] shows how to use keyword repetition, speaker initiative and speaking style to achieve topical segmentation of spontaneous dialogues. 5.3 Technical Internet Relay Chats GNUe, a meta-project of the GNU project 1 one of the most famous free/open source software projectsis the case study used in [23] in support of the claim that, even in virtual organizations, there is still the need for successful conflict management in order to maintain order and stability. 1 http://www.gnu.org 52 The GNUe IRC archive is uniquely suited for our experimental purpose because each IRC chat log has a companion summary digest written by project participants as part of their contri- bution to the community. This manual summary constitutes gold-standard data for evaluation. 5.3.1 Kernel Traffic Kernel Traffic 2 is a collection of summary digests of discussions on GNUe development. Each digest summarizes IRC logs and/or email messages (later referred to as chat logs) for a period of up to two weeks. A nice feature is that direct quotes and hyperlinks are part of the sum- mary. Each digest is an extractive overview of facts, plus the author’s dramatic and humorous interpretations. 5.3.2 Corpus Download The complete Linux Kernel Archive (LKA) consists of two separate downloads. The Kernel Traffic (summary digests) are in XML format and were downloaded by crawling the Kernel Traffic site. The Linux Kernel Archives (individual IRC chat logs) are downloaded from the archive site. We matched the summaries with their respective chat logs based on subject line and publication dates. 2 http://kt.hoser.ca/kernel-traffic/index.html 53 question question question chat_0 comment chat_1 answer answer chat_2 comment alternative explanation comments chat_3 comment w/ question comments Figure 5.1: An example of chat subtopic structure and relation between correspondences 5.3.3 Observation on Chat Logs Upon initial examination of the chat logs, we found that many conventional assumptions about chats in general do not apply. For example, in most instant-message chats, each exchange usu- ally consists of a small number of words in several sentences. Due to the technical nature of GNUe, half of the chat logs contain in-depth discussions with lengthy messages. One mes- sage might ask and answer several questions, discuss many topics in detail, and make further comments. This property, which we call subtopic structure, is an important difference from informal chat/interpersonal banter. Figure 5.1 shows the subtopic structure and relation of the first 4 messages from a chat log, produced manually. Each message is represented horizontally; the vertical arrows show where participants responded to each other. Visual inspection reveals in this example there are three distinctive clusters (a more complex cluster and two smaller satellite clusters) of discussions between participants at sub-message level. 54 5.3.4 Observation on Summary Digests To measure the goodness of system-produced summaries, gold standards are used as references. Human-written summaries usually make up the gold standards. The Kernel Traffic (summary digests) are written by Linux experts who actively contribute to the production and discussion of the open source projects. However, participant-produced digests cannot be used as reference summaries verbatim. Due to the complex structure of the dialogue, the summary itself exhibits some discourse structure, necessitating such reader guidance phrases such as “for the question,” “on the subject,” “regarding ,” “later in the same thread,” etc., to direct and refocus the reader’s attention. Therefore, further manual editing and partitioning is needed to transform a multi- topic digest into several smaller subtopic-based gold-standard reference summaries. 5.4 Fine-grained Clustering To model the subtopic structure of each chat message, we apply clustering at the sub-message level. 5.4.1 Message Segmentation First, we look at each message and assume that each participant responds to an ongoing dis- cussion by stating his/her opinion on several topics or issues that have been discussed in the current chat log, but not necessarily in the order they were discussed. Thus, topic shifts can occur sequentially within a message. Messages are partitioned into multi-paragraph segments using TextTiling, which reportedly has an overall precision of 83% and recall of 78% [32]. 55 5.4.2 Clustering After distinguishing a set of message segments, we cluster them. When choosing an appropriate clustering method, because the number of subtopics under discussion is unknown, we cannot make an assumption about the total number of resulting clusters. Thus, nonhierarchical parti- tioning methods cannot be used, and we must use a hierarchical method. These methods can be either agglomerative, which begin with an unclustered data set and perform N − 1 pairwise joins, or divisive, which add all objects to a single cluster, and then perform N − 1 divisions to create a hierarchy of smaller clusters, where N is the total number of items to be clustered [27]. Ward’s Method Hierarchical agglomerative clustering methods are commonly used and we employ Ward’s method [41], in which the text segment pair merged at each stage is the one that minimizes the increase in total within-cluster variance. Each cluster is represented by an L-dimensional vector (x i1 , x i2 ,, x iL ) where each x ik is the words t f.id f score. If m i is the number of objects in the cluster, the squared Euclidean distance between two segments i and j is: d 2 i j =Σ L k=1 (x ik − x jk ) 2 When two segments are joined, the increase in variance I i j is expressed as: I i j = m i m j m i + m j d 2 i j 56 Number of Clusters The process of joining clusters continues until the combination of any two clusters would destabilize the entire array of currently existing clusters produced from previous stages. At each stage, the two clusters x ik and x jk are chosen whose combination would cause the minimum increase in variance I i j , expressed as a percentage of the variance change from the last round. If this percentage reaches a preset threshold, it means that the nearest two clusters are much further from each other compared to the previous round; therefore, joining of the two represents a destabilizing change, and should not take place. Sub-message segments from resulting clusters are arranged according to the sequence the original messages were posted and the resulting subtopic structures are similar to the one shown in Figure 5.1. 5.5 Summary Extraction Having obtained clusters of message segments focused on subtopics, we adopt the typical sum- marization paradigm to extract informative sentences and segments from each cluster to produce subtopic-based summaries. If a chat log has n clusters, then the corresponding summary will contain n mini-summaries. All message segments in a cluster are related to the central topic, but to various degrees. Some are answers to questions asked previously, plus further elaborative explanations; some make suggestions and give advice where they are requested, etc. From careful analysis of the LKA data, we can safely assume that for this type of conversational interaction, the goal of the 57 participants is to seek help or advice and advance their current knowledge on various technical subjects. This kind of interaction can be modeled as one problem-initiating segment and one or more corresponding problem-solving segments. We envisage that identifying corresponding message segment pairs will produce adequate summaries. This analysis follows the structural organization of summaries from Kernel Traffic. Other types of discussions, at least in part, require different discourse/summary organization. These corresponding pairs are formally introduced below, and the methods we experimented with for identifying them are described. 5.5.1 Adjacent Response Pairs An important conversational analysis concept, adjacent pairs (AP), is applied in our system to identify initiating and responding correspondences from different participants in one chat log. Adjacent pairs are considered fundamental units of conversational organization [90]. An adjacent pair is said to consist of two parts that are ordered, adjacent, and produced by different speakers [28]. In our email/chat (LKA) corpus a physically adjacent message, following the timeline, may not directly respond to its immediate predecessor. Discussion participants read the current live thread and decide what he/she would like to correspond to, not necessarily in a serial fashion. With the added complication of subtopic structure (see Figure 5.1) the definition of adjacency is further violated. Due to its problematic nature, a relaxation on the adjacency requirement is used in extensive research in conversational analysis [47]. This relaxed requirement is adopted in our research. 58 Information produced by adjacent correspondences can be used to produce the subtopic- based summary of the chat log. As described in previous sections, each chat log is partitioned, at sub-message level, into several subtopic clusters. We take the message segment that appears first chronologically in the cluster as the topic-initiating segment in an adjacent pair. Given the initiating segment, we need to identify one or more segments from the same cluster that are the most direct and relevant responses. This process can be viewed equivalently as the informative sentence extraction process in conventional text-based summarization. 5.5.2 AP Corpus and Baseline We manually tagged 100 chat logs for adjacent pairs. There are, on average, 11 messages per chat log and 3 segments per message (This is considerably larger than threads used in previous research). Each chat log has been clustered into one or more bags of message segments. The message segment that appears earliest in time in a cluster was marked as the initiating segment. The annotators were provided with this segment and one other segment at a time, and were asked to decide whether the current message segment is a direct answer to the question asked, the suggestion that was requested, etc. in the initiating segment. There are 1521 adjacent response pairs; 1000 were used for training and 521 for testing. Our baseline system selects the message segment (from a different author) immediately following the initiating segment. It is quite effective, with an accuracy of 64.67%. This is reasonable because not all adjacent responses are interrupted by messages responding to different earlier initiating messages. In the following sections, we describe two machine learning methods that were used to identify the 59 number of overlapping words number of overlapping content words ratio of overlapping words ratio of overlapping content words number of overlapping tech words Table 5.1: Lexical features second element in an adjacent response pair and the features used for training. We view the problem as a binary classification problem, distinguishing less relevant responses from direct responses. Our approach is to assign a candidate message segment c an appropriate response class r. 5.5.3 Features Structural and durational features have been demonstrated to improve performance significantly in conversational text analysis tasks. Using them, Galley et al. [28] report an 8% increase in speaker identification. Zechner [102] reports excellent results (F > .94) for inter-turn sentence boundary detection when recording the length of pause between utterances. In our corpus, durational information is nonexistent because chats and emails were mixed and no exact time recordings beside dates were reported. So we rely solely on structural and lexical features. For structural features, we count the number of messages between the initiating message segment and the responding message segment. Lexical features are listed in Table 5.1. The tech words are the words that are uncommon in conventional literature and unique to Linux discussions. 60 5.5.4 Maximum Entropy Maximum entropy has been proven to be an effective method in various natural language pro- cessing applications [9]. For training and testing, we used YASMET 3 . To estimate P(r|c) in the exponential form, we have: P λ (r|c)= 1 Z λ (c) exp(Σ i λ i,r f i,r (c, r)) where Z λ (c) is a normalizing constant and the feature function for feature f i and response class r is defined as: f i,r (c, r 0 )= 1, if f i > 0 and r 0 = r; 0, otherwise. λ i,r is the feature-weight parameter for feature f i and response class r. Then, to determine the best class r for the candidate message segment c, we have: r ∗ = argmax r P(r|c). 5.5.5 Support Vector Machine Support vector machines (SVMs) have been shown to outperform other existing methods (Na¨ ıve Bayes, k-NN, and decision trees) in text categorization [39]. Their advantages are robustness and the elimination of the need for feature selection and parameter tuning. SVMs find the hyper- plane that separates the positive and negative training examples with maximum margin. Finding 3 http://www.fjoch.com/YASMET.html 61 Feature sets baseline MaxEnt SVM 64.67% Structural 61.22% 71.79% Lexical 62.24% 72.22% Structural+ Lexical 72.61% 72.79% Table 5.2: Accuracy on identifying APs this hyperplane can be translated into an optimization problem of finding a set of coefficients α i ∗ of the weight vector ~ w for document d i of class y i ∈ {+1, 1}: ~ w= X i α ∗ i y i ~ d i and α i > 0 . Testing data are classified depending on the side of the hyperplane they fall on. We used the LIBSVM 4 package for training and testing. 5.5.6 Results Entries in Table 5.2 show the accuracies achieved using machine learning models and feature sets. 5.5.7 Summary Generation After responding message segments are identified, we couple them with their respective initi- ating segment to form a mini-summary based on their subtopic. Each initializing segment has zero or more responding segments. We also observed zero response in human-written sum- maries where participants initiated some question or concern, but others failed to follow up on 4 http://www.csie.ntu.edu.tw/ cjlin/libsvm/ 62 Figure 5.2: A system-produced summary the discussion. The AP process is repeated for each cluster created previously. One or more subtopic-based mini-summaries make up one final summary for each chat log. Figure 5.2 shows an example. For longer chat logs, the length of the final summary is arbitrarily averaged at 35% of the original. 5.6 Summary Evaluation To evaluate the goodness of the system-produced summaries, a set of reference summaries is used for comparison. In this section, we describe the manual procedure used to produce the reference summaries, and the performances of our system and two baseline systems. 63 Figure 5.3: An original Kernel Traffic digest 5.6.1 Reference Summaries Kernel Traffic digests are participant-written summaries of the chat logs. Each digest mixes the summary writer’s own narrative comments with direct quotes (citing the authors) from the chat log. As observed in Section 3.4, subtopics are intermingled in each digest. Authors use key phrases to link the contents of each subtopic throughout texts. In Figure 5.3, we show an example of such a digest. Discussion participants names are in italics and subtopics are in bold. In this example, the conversation was started by Benjamin Reed with two questions: 1) asking for conventions for writing /proc drivers, and 2) asking about the status of sysctl. The summary writer indicated that Linus Torvalds replied to both questions and used the phrase for the question, he added to highlight the answer to the second question. As the digest goes on, Marcin Dalecki only responded to the first question with his excited commentary. Since our system-produced summaries are subtopic-based and partitioned accordingly, if we use unprocessed Kernel Traffic as references, the comparison would be rather complicated 64 Figure 5.4: A reference summary reproduced from a summary digest and would increase the level of inconsistency in future assessments. We manually reorganized each summary digest into one or more mini-summaries by subtopic (see Figure 5.4.) Exam- ples (usually kernel stats) and programs are reduced to [example] and [program code]. Quotes (originally in separate messages but merged by the summary writer) that contain multiple topics are segmented and the participants name is inserted for each segment. We follow clues like to answer question to pair up the main topics and their responses. 5.6.2 Summarization Results We evaluated 10 chat logs. On average, each con-tains approximately 50 multi-paragraph tiles (partitioned by TextTile) and 5 subtopics (clustered by the method from Section 4). 65 Figure 5.5: A short example from Baseline 2 A simple baseline system takes the first sentence from each email in the sequence that they were posted, based on the assumption that people tend to put important information in the beginning of texts (Position Hypothesis). A second baseline system was built based on constructing and analyzing the dialogue struc- ture of each chat log. Participants often quote portions of previously posted messages in their responses. These quotes link most of the messages from a chat log. The message segment that immediately follows the quote is automatically paired with the quote itself and added to the summary and sorted according to the timeline. Segments that are not quoted in later messages are labeled as less relevant and discarded. A resulting baseline summary is an inter-connected structure of segments that quoted and responded to one another. Figure 5.5 is a shortened sum- mary produced by this baseline for the ongoing example. The summary digests from Kernel Traffic mostly consist of direct snippets from original messages, thus making the reference summaries extractive even after rewriting. This makes it possible to conduct an automatic evaluation. A computerized procedure calculates the overlap between reference and system-produced summary units. Since each system-produced summary is a set of mini-summaries based on subtopics, we also compared the subtopics against those appearing in reference summaries (precision= 77.00%, recall= 74.33%, F= 0.7566). 66 Recall Precision F-measure Baseline1 30.79% 16.81% 0.2175 Baseline2 63.14% 36.54% 0.4629 System Summary 52.57% 52.14% 0.5235 Topic-summ 52.57% 63.66% 0.5758 Table 5.3: Summary of results Table 5.3 shows the recall, precision, and F-measure from the evaluation. From manual analysis on the results, we notice that the original digest writers often leave large portions of the discussion out and focus on a few topics. We think this is because among the participants, some are Linux veterans and others are novice programmers. Digest writers recognize this difference and reflect it in their writings, whereas our system does not. The entry ”Topic-summ” in the table shows system-produced summaries being compared only against the topics discussed in the reference summaries. 5.6.3 Discussion A recall of 30.79% from the simple baseline reassures us the Position Hypothesis still applies in conversational discussions. The second baseline performs extremely well on recall, 63.14%. It shows that quoted message segments, and thereby derived dialogue structure, are quite in- dicative of where the important information resides. Systems built on these properties are good summarization systems and hard-to-beat baselines. The system described in this paper (Sum- mary) shows an F-measure of .5235, an improvement from .4629 of the smart baseline. It gains from a high precision because less relevant message segments are identified and excluded from the adjacent response pairs, leaving mostly topic-oriented segments in summaries. There is a 67 slight improvement when assessing against only those subtopics appeared in the reference sum- maries (Topic-summ). This shows that we only identified clusters on their information content, not on their respective writers experience and reliability of knowledge. 5.7 Applicability to Other Domains Algorithms introduced in the previous sections are intended to be applied to a wide variety of domains. The primary reason that the GNUe IRC archive was used is that in order to validate the algorithms for research purposes, gold-standdard summaries are needed for comparison with system-generated ones, and the GNUe archive provided us with such a collection. In this section, I describe the application and deployment of our discussion summarization system for the University of Southern Californias Distance Education Network’s online discus- sion boards. 5.7.1 Introduction The distance education community is a very active virtual community. Regardless whether courses are held entirely online or mostly on-campus, online asynchronous discussion boards play an increasingly important role, enabling classroom-like communication and collaboration amongst students, tutors and instructors. The University of Southern California (USC), like many other universities, employs a commercial online course management system (CMS). In an effort to bridge research and practice in education, researchers at USCs Information Sciences 68 Institute (ISI) replaced the na¨ ıve CMS discussion board with an open source board that is cur- rently used by selected classes. The board provides a platform for evaluating new teaching and learning technologies. Within the discussion board teachers and students post messages about course-related topics. The discussions are organized chronologically within topics and higher- level forums. These ‘live’ discussions are now enabling a new opportunity, the opportunity to apply and evaluate advanced natural language processing (NLP) technology. However, it is difficult to evaluate the results produced by automatic summarization sys- tems. Traditionally, there are two distinctive sets of evaluations, namely intrinsic and extrinsic evaluations. Intrinsic evaluations are commonly used in measuring the content overlap between a system-produced summary and one or more human-written gold-standard summaries. Ex- trinsic evaluations include measuring the improvement of performing a specific task using the results of an automatic summarization system. This type of evaluations performs an indirect measurement on the quality of the system-produced summaries, without the comparison with a set of corresponding gold-standard summaries. 5.7.2 Deployment Summaries are created periodically and sent to students and teachers via their preferred medium (emails, text messages on mobiles, web, etc). This relieves users of the burden to read through a large volume of messages before participating in a particular discussion. It also enables users to keep track of all ongoing discussions without much effort. At the same time, the discussion summarization system can be measured beyond the typical NLP evaluation methodologies, i.e. 69 measures on content coverage. Teachers and students willingness and continuing interest in using the software will be a concrete acknowledgement and vindication of such research-based NLP tools. 5.7.3 Extrinsic Evaluation Intrinsic evaluations are possible when there is a set of gold-standard summaries to make the comparisons. When we integrated our system into the Distance Education Network, it was not clear whether an intrinsic measure is the most appropriate evaluation method. We, researchers in NLP, consider that it is absolutely necessary to have the gold-standard summaries and content overlap is an adequate reflection of summary quality. However, in real-world situations, users from different perspectives may not desire the same set of truth—gold-standard summaries. Extrinsic evaluations, which measure the usefulness of generated summaries, are designed to adapt to user variability and are appropriate for large-scale deployed applications. We asked those, including both instructors and students, who participated in receiving email summaries whether they found the summaries useful and whether they would prescribe to future deployment. Over a period of seven months, the feedbacks that we have received are positive, illustrated in Figure 5.6. Instructors, especially, find our system helpful because several discus- sions may be active at any particular point of time, it is time-consuming to check and keep one up-to-date on all topics being discussed. Since we system only sends summaries on live topics, users are exempt from manually checking for updates. 70 Figure 5.6: User feedback. 71 Chapter 6 Automatic Summary Evaluation In the previous chapters, I have introduced various ways to create summaries automatically and showed the quality of the summaries created. In the next two chapters, I will focus on distillation evaluation, more specifically evaluations for machine translation and summarization results. Many manual evaluation techniques have been introduced. Naturally, people trust manual evaluation methodologies since humans can infer, paraphrase, and use world knowledge to relate text units that are worded differently. However, there are two major drawbacks to an human evaluation: 1) determining the size of text units being compared, and 2) how much of the human-ness to allow. Without a structured definition on the size of the summary units, humans cannot reliably perform this task consistently over multiple texts over a period of time. The second drawback refers to the level of inference that we should allow human assessors to incorporate in evaluation. If there is no constraint at all, then the judgment made on a summary may reflect primarily the assessor’s knowledge on the subject being summarized. If we set the scope of inference (providing a list of paraphrases, synonyms, etc.), then the human assessors 72 are practically behaving like a machine making decisions according to rules. The point is if we leave all the decisions to human, it’s too hard and the results will be debatable. The goal is to define an automated mechanism that performs both of the above tasks consistently and correctly, and yet correlate well with human judgments, which are made under tightly defined rules. This challenge is not uniquely associated with discussion summarization, but with summarization in general. The summarization community has been actively seeking automatic evaluation methodolo- gies that can be readily applied to most summarization tasks. In this chapter, we will introduce Basic Elements, a new way of automating the evaluation of text summaries. This method corre- lates better with human judgments than any other automated procedures to date, and overcomes the subjectivity/variability problems of manual evaluation methods. This work is from the col- laborative efforts of several members of our research group, Prof. Eduard Hovy, Dr. Chin-Yew Lin, and I. 6.1 Related Work There are two directions in designing evaluations for summarization: manual evaluation or automated evaluation. 6.1.1 Manual Evaluation Three most noticeable efforts in manual evaluation are SEE of Lin and Hovy ([50] and [53]), Factoid of Van Halteren and Teufel [31], and Pyramid of Nenkova and Passonneau [72]: 73 • SEE provides an user-friendly environment in which human assessors evaluate the quality of system-produced peer summary comparing to an ideal reference summary. Summaries are represented by a list of summary units (sentences, clauses, etc.). Assessors can assign full or partial content coverage score to peer summary units in comparison to the corre- sponding reference summary units. Grammaticality can also be graded unit-wise. This early work stands correct to this day whereas newer manual methods lack systematically identifying summary units and allowing partial scores for partial matches. • The goal of the Factoid work is to compare the information content of different sum- maries of the same text and determine the minimum number of summaries needed to achieve stable consensus among human-written summaries. Van Halteren and Teufel studied 50 summaries that were created based on one single text. For each sentence, its meaning is represented by a list of atomic semantic units, called factoids. Their defini- tion of atomicity means that the amount of information associated with a factoid can vary from a single word to an entire sentence. This is similar to the definition of Pyramid’s summary content units. A problem with this flexible definition of a summary unit is that there is no consistency attached, therefore it’s highly arguable in the nature of atomicity. The Factoid method shows 15 summaries are need to achieve stable consensus among reference summaries. • The Pyramid method is an extension of Factoid carried out on a larger scale. Nenkova and Passonneau show that only 6 summaries are required to form stable consensus from ref- erence summaries. Summarization Content Units (SCUs) are originally defined as units 74 that are not bigger than a clause, but later redefined as larger than a word but smaller than a sentence. SCUs are exactly the same in definition as factoids. The process of finding similar SCUs starts by finding similar sentences and proceeds to identifying finer grained inspection of more tightly related subparts. After all the SCUs have been identified and compared, they can be partitioned into a pyramid structure, as illustrated in Figure 6.1. A peer summary would receive a score which is a ratio of the sum of the weights of its SCUs to the sum of the weights of an optimal summary with the same number of SCUs. The scoring of peer summaries is precision-based which is problematic in evaluating con- tent coverage for summarization. Suppose for a peer summary, only one SCU matched the pyramid, and the rest of the summary has no recall value. But we still need to mark rest of the text into SCUs. Let’s assume 10 SCUs were marked in total for the peer sum- mary and the pyramid SCU that I matched had a weight of 1, then the pyramid score of this summary is 1 / 10 = 0.1, assuming the pyramid is flat where all SCUs are at the lowest level, with a weight 1. But if it’s been decided to mark each word a SCU (which could happen, in theory) and the one SCU that matched had 10 words, that leaves 240 words/SCUs and total of 241 SCUs. This gives 1 / 241 = 0.004. We are using the same pyramid for comparison and received two very different scores for the same peer summary. Another major drawback of the pyramid method is that SCUs similarity comparison relies completely on human understanding and inferencing. Not to mention the wide range of different background and experience, making inference without any limitation completely 75 Figure 6.1: A pyramid of 4 levels. Reproduced from [72] distorts the evaluation process. In addition, humans learn more about the summary topic progressively when asked to annotate and compare several peer summaries against a pre- determined pyramid. At the beginning of the annotation process, a human annotator may know nothing of the topic being summarized. By the middle or near the end, he/she knows all there is to know about the topic and what the pyramid is relevant to. His/her inferencing ability has increased tremendously which makes the judgments unreliable. In 2004, Filatova and Hatzivassiloglou [25] defined atomic events as major constituents of actions described in a text linked by verbs or action nouns. They believe that major constituents of events are marked as named entities in text and an atomic event is a triplet of two named entities connected by a verb or an action-denoting noun all extracted from the same sentence. Atomic events are used in creating event-based summaries and are not modeled into any evalu- ation method. 76 6.1.2 Automatic Evaluation To overcome the problems associated with manual evaluations, ROUGE [51] was introduced in 2003. It is an automatic evaluation package that measures the n-gram co-occurrences between peer and reference summary pairs. ROUGE was inspired by a similar idea of Bleu [78] adopted by the machine translation (MT) community for automatic MT evaluation. A problem with ROUGE is that the summary units that are used in automatic comparison are of fixed length, unigram, bigram, or n-gram. Noncontent words, such as stopwords, could be treated equally as content words. A more desired design is to have summary units of variable size where several units may convey similar meaning but with various lengths. 6.2 Basic Elements In the following sections, we describe an automated method that produces useful information chunks of varying size, and show that this method equals or outperforms other automated meth- ods of scoring summaries. Our method can be seen as a generalization and automation of both Factoid and Pyramid methods. The task is to provide a numeric score that reflects the quality of a given peer summary when it is compared to one or more reference summaries. 77 6.2.1 Definition To define summary content units, we automatically produce a series of increasingly larger units from reference summaries, starting at single-word level. The focus of our work is on minimal summary units where the unit size is small and paraphrase alternatives are limited. We call these minimal summary units Basic Elements (BEs). They are defined as follows: • the head of major syntactic constituent (noun, verb, adjective, or adverbial phrases), ex- pressed as a single item, or • a relation between a head-BE and a single dependent, expressed as a triple (head| modifier | relation). Using BEs as a method to evaluate summary content, we address the following core ques- tions: 1. What or how large is a Basic Element? The answer to this is strongly conditioned by: how can BEs be created automatically? 2. How important is each BE? What basic score should each BE have? 3. When do two BEs match? What kinds of matches should be implemented, and how? 4. How should an overall summary score be derived from the individual matched BEs’ scores? 78 The input to our method is a collection of reference summaries and a peer summary. The output is a single numeric score for this peer summary. To generate the peer summary score, we go through the following four automatic modules: 1. BE breakers: create individual BE units, given a text. 2. BE scorers: assign scores to each BE unit individually. 3. BE matchers: rate the similarity of any two BE units. 4. BE score integrators: produce a total score given a list of rated BE units. 6.3 The BE Method 6.3.1 Creating BEs Basic elements are minimal summary units that are used in creating a summary. From evalua- tion’s point of view, we need to partition a summary into smaller units in order to make compar- isons. The concept of BE can be applied and used to produce many different constructions of summary units. We experimented with using several different syntactic and dependency parsers to produce BEs. The parsers incorporated as BE breaks are: Charniak’s parser [16], Collins’ parser [18], Minipar [56], and Microsoft Logical Forms [33]. The BE breaker model takes in a parse tree and applies a set of heuristics to extract from the tree a set of smaller constituents, a.k.a. BEs. Figure 6.2 shows a Collins’ parse tree and its corresponding BEs. 79 Figure 6.2: A set of BEs extracted from a Collins’ parse tree. 80 Collins and Charniak’s parse trees are syntactic and do not include the semantic relations between head and modifier words. Minpar, however, is a dependency parser which does auto- matically produce head | modifier | relation triples. For simplicity and accessibility, the default version of the BE package creates BEs from Minipar’s dependency trees: libyans | two | nn indicted | libyans | obj bombing | lockerbie | nn indicted | bombing | for bombing | 1991 | in 6.3.2 Weighing BEs When there is a match between a peer BE and a reference BE, the summary where the peer BE comes from gets exactly 1 point. There could potentially be many ways of assigning weights to BEs. Currently, we adopt this simple weighing scheme, as described in the Pyramid method [72]. 6.3.3 Matching BEs The Pyramid method [72] describes the SCU matching stage as a two-step process: first, find sentences that are similar to one another; second, among similar sentences find similar sub- parts, namely SCUs. The pyramid matching process relies on human inference and paraphras- ing skills. BEs are smaller than SCUs which tend to be phrases. Matching at the BE level allows less variation and thus less difficult. We have experimented with a range of increasingly sophisticated matching strategies (arranged from easiest/stricted to most sophisticated/flexible): 81 • lexical identity: comparing BEs must match exactly • lemma identity: the root forms of the comparing BEs must match • semantic generalization match: BE words are replaced by semantic generalizations (“Pres- ident Clinton” replaced by “human”) and then matched, at a variety of levels of abstrac- tion. Semantic generalization matching requires sentences that are similar to be grouped together first in order to show its effectiveness. It can reproduce Pyramid’s SCUs with 90% accuracy. However, monolingual sentence alignment is itself a difficult research problem. Without a good sentence alignment module, the semantic generalization matching process is incomplete. 6.3.4 Combining BE Scores The combined BE score for a peer summary is the recall score of all matched peer BEs and reference BEs. 6.4 Testing and Validating BEs To test and validate the effectiveness of an automatic evaluation metric, one needs to show that the automatic evaluations correlate with human assessments highly, positively, and consistently [51]. We have run BE on DUC2003 peer and reference summaries. Correlation figures are calculated by comparing rankings and average coverage scores of systems (peers and baselines) 82 of those documented by DUC and produced by BE. Spearman rank order correlation coefficient and Pearson’s correlation coefficient are computed for the validation tests. Spearman Rank Order Correlation Coefficient From DUC manual evaluations, we receive a ranking of all peer systems and baseline sys- tems. Using BE, an automatically computed ranking is also produced. We use Spearman’s to see the correlation between the two rankings since it is designed to handle ordinal data. The formula for Spearman Correlation Coefficient is: ρ= 1 − 6(ΣD 2 ) N(N 2 − 1) where D refers to the difference between a subject’s ranks on two variables and N is the number of subjects. Pearson’s Correlation Coefficient To see whether BE and DUC human assessors assign scores to peer and baseline summaries in a similar fashion, we computed Pearson’s correlation coefficient as follows: r = ΣXY − ΣXΣY N q (ΣX 2 − (ΣX) 2 N )(ΣY 2 − (ΣY) 2 N ) whether X and Y are scores from BE and DUC respectively for a summary from a particular peer or baseline system. And N is the number of comparisons being made. 83 6.4.1 Multi-document Test Set In DUC2003 task 2 (short summaries focused by events), participants were asked to create a short summary (100 words) for each of the 30 TDT clusters. For more information on the task, refer to official DUC website 1 . To evaluate each system-generated summary, one assessor- written summary was used as reference. Assessor-written summaries and their corresponding document clusters, all system-produced summaries, and overall ranking of the systems are doc- umented by DUC and are publicly available to participants. Using the breaking, scoring, and matching modules from BE, a ranking of the systems along with average coverage score for each system was automatically created. To validate the BE method, Spearman rank correlation coefficient and Pearson’s correlation are computed between DUC2003 official results and BE results. We show in Table 6.1 the BE performance using BEs produced by BE-L (Charniak+ CYL rules) and BE-F (Minipar + JF rules). The scoring and matching procedures are the same for both breakers. The table also shows the performance of each breaker when matching with and without the relation. It should be noted that BEs produced by BE-L (Charniak + CYL rules) include unigrams (head-word | NIL | NIL) and ones produced by BE-F (minipar+ Fukumoto rules) do not. Inclu- sion of unigrams results in larger number of BEs and matchings, and contributes to the lower performance. When excluding unigrams in matching, the performance of BE-L is comparable to that of BE-F’s. 1 http://duc.nist.gov/ 84 Table 6.1: Correlation between BE and DUC2003 multi-doc. 6.4.2 Single-document Test Set In DUC2003 task 1 (very short summaries), participants were asked to create a very short summary (10 words). Table 6.2 shows the correlation between BE and DUC results. 6.4.3 Comparing BE with ROUGE ROUGE has been used widely in summarization evaluation in the recent years. Any new au- tomatic evaluation metric should prove to be better than, or at least at an equivalent level with, ROUGE. Table 6.3 shows ROUGE’s correlation with DUC. Comparing to ROUGE’s perfor- mance, BE evaluation correlates really well with DUC, involves less guesswork, and is more consistent across different test conditions. 85 Table 6.2: Correlation between BE and DUC2003 single-doc. Table 6.3: Correlation between ROUGE and DUC2003 multi-doc. 86 6.5 BE-based Summarization: Beyond Evaluation This chapter so far has focused on utilizing BEs to automatically evaluate system-generated summaries. BEs can also be used as counting unit for frequency-based topic identification and content-based sentence compression. Assuming that we are to implement a two-step summa- rization system: first, extract informative sentences from the document set; second, perform sentence compression on the extract produced. During the extraction phase, BEs can be incor- porated as selection criteria. In the subsequent compression phase, we can use BEs to identify what information need to be retained and what information can be removed. The resulting BE-based summarization system is a top performer in DUC2005. 6.5.1 Using BEs to Extract To extract informative sentences from a document set, BEs are first used as information counting units. Each sentence from text is parsed and represented by its BEs, making a sentence a list of BEs. The likelihood ratio (LR) for each BE is then computed as an information theoretic measure that represents its relative importance comparing to all the BEs from the doc set ([21] and [49]). We thus have a ranked list of BEs by their LR scores. To assign a numeric score to each sentence from the doc set in order to choose the most desirable ones for the summary, a sentence’s BE list is compared against the BE LR ranked list. The score for a sentence is the ratio of the sum of matching BE LR scores and the number of BEs formed the sentence. This gives us a ranked list of sentences. The sentence that has the highest normalized score is the sentence that has the most important content over all other 87 sentences. We then iteratively choose from the ranked sentence list the sentence with a lesser score but has the least content overlap (measured in BEs) with the already selected sentences. This approach of selecting sentences is called maximum marginal relevancy (MMR) [29]. Given the documented importance of sentence position in the news genre [54], we only allow sentences from the beginning of a document to be considered in this extraction phase. In order to provide enough number of sentences, longer than required length, to be compressed, we usually produce extracts double the size of the required length. 6.5.2 Using BEs to Compress The discussion so far has focused on extraction. However, identifying salient information is only the first half of the summarization problem. A number of researchers have started to address the possibility of generating coherent summaries through summary revisions [61] and regeneration from derivation of information/theme intersection [7]. In particular, Knight and Marcu [43] have proposed to find the balance between grammaticality and textual retainment through both a noisy-channel model and a decision-based model. Both models learn from a training corpus to perform tree-reduction operations either probabilistically or through example learning. These two models mainly focus on syntactic tree learning and performing syntactic- based compressions, rather than information preservation. The extraction mechanism of our summarization system selects sentences that share a cer- tain degree of information overlap. The overlap acts as a bridging medium for extracting sen- tences that, when selected together, would produce a higher volume of important textual content. 88 Valuable information is identified by top-ranked BEs, indicating a high occurrence of repetition. Through experimentation, we discovered that BEs appearing in the same context (in this case, in the same sentence) also carry some degree of textual importance. This is shown with the increase in recall scores when those secondary BEs are included in the summary results. There- fore, ideally we would like to remove redundant information while producing the most coherent sentence sets. While this goal is shared by all compression research for summarization, it is yet to be realized while performing actual compression operations. Either syntactic structure or information content is taken into account primarily, but not both at the same time. We envisage a compression technique where re-duction operations are performed on parse trees syntactic constituents marked for “removal”. In addition, the compression module will decide the most appropriate level of the tree the marked constituents shall be cut from. 6.5.2.1 Content Labeling The compression procedure is invoked incremen-tally. Sentences selected by the extraction module are ranked according to the importance of their information content, i.e. the total weight of top-ranked BEs normalized by sentence length. A list of top-ranked BEs is then maintained for each document set. Each sentence is presented by its BE equivalent. For example, the sentence “A man was killed by police. becomes: killed | man | obj killed | by | by-subj killed | police | by 89 The first sentence from the extract contains the most salient information from the document set, with no compression applied. Any sentence following it should only complement its content with additional information. Top-ranked BEs from the first sentence are re-corded in a “have-seen table. Before a second sentence is added, all of its BEs are checked against the “have-seen table. If any of the BEs appear in the table, they are labeled as “remove. Top-ranked BEs from this sentence are then also recorded in the same “have-seen table. This procedure is performed on every sentence from the extract. The “have-seen table is maintained globally and the “remove lists are maintained on per sentence basis. 6.5.2.2 Parse Tree Reduction Knight and Marcu use sentence pairs of the form (long sentence, abstract sentence) from the Ziff-Davis corpus (newspapers with abstracts) to collect expansion-template probabilities. Ex- pansion templates are created through identifying corresponding syntactic nodes [18] from those sentence pairs. Assigning probabilities to trees rather than strings would introduce an information loss. For summarization tasks with a strict length limit we should constrain this kind of loss to a minimum. The challenge is to perform tree reduction with information reten- tion as a priority. BEs are minimal summary units. If we could compress sentences by identifying the smallest yet necessary removable units and remove them correctly according to grammar rules, then minimum information loss and maximal grammaticality can be achieved. 90 As stated in the previous section, BEs for sen-tences have been labeled as “remove” or “keep.” A parse tree is also produced for each sentence using Collins’ parser. In Figure 6.3, we show part of a parse tree where the compression would take effect. The smallest (furthest down the tree) constituent that covers a “remove” BE is first identified. Its ancestors (parent, grandparent, etc.) are traversed, and at each ancestor node, we assume the larger tree can be cut. (Figure 6.3, edges labeled “1” and “2”). For each resulting smaller tree t, P tree (t) is computed over the Penn TreeBank PCFG grammar rules that yielded the tree t. Among the smaller trees, the one that has the highest P tree (t), normalized by the number of grammar rules used, is con-sidered as the best candidate tree. But if one of its children (not containing the “remove” BE) contains unseen top-ranked BEs, the tree-cutting operation that produced this tree should not be activated. In other words, if the cutting operation at an ancestor level was deemed not desirable because one of the ancestors children contains important non-redundant information, the compression module backtracks to the next level down in the tree. This process is performed at every node, traversing from the lowest tree level that covers the “remove” BE up until a decision to backtrack is made on one of the upper-level ancestors. From the sample shown in Figure 6.3, lets as-sume that the BE “diplomatic immunity” has ap-peared in a previous sentence and needs to be removed. From computing P tree (t) for t 1 and t 2 (edges labeled as “1” and “2” in tree), lets assume t 1 is preferred. But if somehow the BE “be entitled (is entitled)” is one of the top-ranking BEs and needs to be kept in the sentence, then t 2 is the preferred tree. 91 Figure 6.3: An example for sentence compression 92 6.5.2.3 Validation The compression mechanism is designed for sum-marization. Therefore, summarization eval- uation methodologies should be used to evaluate the now-compressed summaries. The newly introduced and publicly available Basic Element evaluation tool kit is used in our experi-ment. At 100 words, the best multi-doc uncompressed extracts generated from DUC2003 data result in a recall of 0.0532 on BE-F. 200-word extracts, before compression, chart a 0.0786 in BE-F recall. When compression is applied, resulting in 100-word summaries, we see a significant improvement in BE-F recall at 0.0578. The preliminary results are quite encouraging. The compression would be much more ef- fective if such a corpus for training were available. The decision process on each tree node would be prob-abilistic, rather than ad hoc. 93 Chapter 7 Using Paraphrases in Distillation Evaluation The previous chapter on the Basic Element framework addresses the text fragment problem. However, the comparison between peer and reference texts is still limited to the matching of lexical identities. In order to move toward measuring semantic closeness, I explore the use of paraphrases for distillation evaluations. This new evaluation methodology is called ParaEval. Paraphrases are alternative verbalizations that convey the same information. The ability to recognize and utilize paraphrases automatically would make a significant practical contribution to many Natural Language Processing (NLP) tasks. In machine translation (MT), a system that can distinguish and choose paraphraseshaving the same meaning but differing in lexical nuancesproduces better translation results than others that cannot. Similarly, in text summa- rization, where the task is to emphasize differences across documents and synthesize common information, being able to recognize paraphrases would lead to better content selection, com- pression, and generation. 94 The process of acquiring a large enough collection of paraphrases is not an easy task, how- ever. Manual corpus analyses produce domain-specific collections that are used for text gen- eration and are application-specific. But operating in multiple domains and for multiple tasks translates into multiple manual collection efforts, which could be very time-consuming and costly. In order to facilitate smooth paraphrase utilization across a variety of NLP applications, we need an unsupervised paraphrase collection mechanism that can be easily conducted, and produces paraphrases that are of adequate quality and can be readily used with minimal amount of adaptation efforts. This chapter introduces the use of paraphrases in the evaluations of machine translation and text summarization systems. Previous efforts have primarily focused on collecting paraphrases, and applying them in text generation tasks where word choice can be affected by syntactic and pragmatic constraints (Barzilay and McKeown [6]). In MT and summarization evaluations, a system-produced text segment is compared with multiple corresponding reference text seg- ments, usually created by humans. At the present, methodologies used in both evaluation tasks are limited to lexical n-gram matching. Even though validated by high correlations with hu- man judgments, the lack of support for word or phrase matching that stretches beyond strict lexical matches has limited the expressiveness and utility of these methods. In this chapter, in an effort to approximate semantic closeness, we show evaluation mechanisms that supplement literal matching, i.e. paraphrase and synonym, and return a much higher and detailed level of text comparisons. 95 We separate paraphrase matching from lexical matching in our MT and summarization eval- uation procedures. Paraphrase matching includes the matching of multiple-word phrases and/or single-word synonyms. Then lexical unigram matching is performed on text fragments not consumed by paraphrase matching. This tiered design guarantees us a minimum of the levels of comparison provided by ROUGE-1 [51] and BLEU-1 [78]. A paraphrase table is used for paraphrase matching. Since manually created multi-word paraphrases are not available in suffi- cient quantities, we automatically build a paraphrase table using automatic word alignment and phrase extraction methods from MT. The assumption is that if two English phrases are trans- lated into the same foreign phrase with high probability (shown in the alignment results from a statistically trained alignment algorithm), then the two English phrases are paraphrases of each other. Utilizing paraphrases in MT and summarization evaluations is also a realistic way to mea- sure the quality of paraphrases acquired. If a comparison strategy, coupled with paraphrase matching, distinguishes good and bad MT and summarization systems in close accordance with what human judges do, then this strategy and the paraphrases used are of sufficient quality. Since our underlining comparison strategy is that of BLEU-1 for MT evaluation and ROUGE-1 for summarization evaluation, and they have been proven to be good metrics for their respective evaluation tasks, the performance of the overall comparison is directly and mainly affected by the paraphrase collection. More subtly, if we could isolate the false positive cases where two phrases are not supposed to be paraphrases of each other, but are indicated so by the paraphrase collection, i.e. assigned with high probability by the word/phrase alignment algorithm, we 96 Figure 7.1: Paraphrases created by the Pyramid method. could effectively pinpoint the mistakes that the MT alignment algorithm systematically makes, and improve its alignment choices. 7.1 Motivation 7.1.1 Paraphrase Matching A major difference that separates humans and systems in text understanding and comparison is the ability to recognize semantically equivalent text units. An essential part of semantic matching involves paraphrase matching. This paraphrase matching process is observed in the Pyramid annotation procedure for summary evaluation shown in Nenkova and Passonneau [72]. In the example shown in Figure 7.1 (reproduced from the original paper), each of the 10 phrases (labeled 1 to 10) carries the same semantic content as the overall unit labeled SCU1 does. In our work, we aim to automatically identify these 10 phrases. In MT evaluation, a system’s translation is compared with a set of reference translations, usually created by human translators. Since the reference translations are created from the 97 Figure 7.2: Two translations produced by humans, from NIST Chinese MT evaluation [74]. same source text (written in the foreign language) to the target language, they are supposed to be semantically equivalent, i.e. overlap completely. However, as shown in Figure 7.2, when literal comparison is used (indicated by links), such as employed in BLEU-1 [78] and ROUGE- 1 [51], only half (6 from the left and 5 from the right) of the 12 words from these two sentences are matched. Also, to was a mismatch. In applying paraphrase matching for MT evaluation, we aim to match the shaded words from both sentences. 7.1.1.1 Synonymy Relations Synonym and paraphrase matchings are often mentioned in the same context when discussing the comparison of equivalent semantic units. Having evaluated automatically extracted para- phrases using WordNet [68], Barzilay and McKeown [6] quantitatively validated that synonymy is not the only source of paraphrasing. We envisage that this claim is also valid for many NLP tasks. 98 From an in-depth analysis on the manually created paraphrase sets from the Pyramid method [72], we find that 54.48% of 1746 cases where a non-stop word from one phrase did not match with its supposedly human-aligned paraphrases are in need of some level of paraphrase match- ing support. For example, in the first two phrases (labeled as 1 and 2) in Figure 7.1—“for the Lockerbie bombing” and “for blowing up over Lockerbie, Scotland”—no non-stop word other than the word Lockerbie occurs in both phrases. But these two phrases were judged to carry the same semantic meaning because human annotators think the word “bombing” and the phrase “blowing up” refer to the same action, namely the one associated with “explosion.” However, “bombing” and “blowing up” cannot be matched through synonymy relations by using WordNet Miller1990, since one is a noun and the other is a verb phrase (if tagged within context). Even when the search is extended to finding synonyms and hypernyms for their categorical variants and/or using other parts of speech (verb for “bombing” and noun phrase for “blowing up”), a match still cannot be found. To include paraphrase recognition and matching in various NLP applications, a collection of less strict paraphrases must be created and matching strategies need to be investigated. 7.2 Paraphrase Acquisition 7.2.1 Previous and Related Work In addition to domain-specific manual efforts, there are two major directions in paraphrase acquisition research: utilization of existing lexical resources, and corpus-based derivation. 99 In generation, systems need to choose words and phrases that are suitable for a given con- text. Inkpen and Hirst [38] show a system capable of creating texts that could relay the subtleties of near-synonyms, using a dictionary. In MT evaluation, Russo-Lassner et al. [87] learn feature vectors from correlation relations between human judgments and various features, including the WordNet Miller1990 synset. There is also a mixed definition of what exactly constitutes paraphrase relations. Methods that utilize pre-existing lexical resources mostly rely on syn- onyms. Lin [57] shows the construction of a thesaurus by bootstrapping similar concepts from analyzing distributional patterns. While interesting concepts are created by this work and its subsequent derivatives, the application of such concepts is not directly obvious. Another trend in paraphrase acquisition is derivation from corpus analyses. Hermjakob et al. [34] perform paraphrase recognition by using a set of hand-written expansion and transfor- mation templates derived from data observation. Pang et al. [77] use MT reference translations (multiple versions of the same translated text) and construct word lattices as paraphrase repre- sentations. MT references are written by human translators and are of good quality, but not of sufficient quantity, which limits the portability of Pangs method. In Barzilay and McKeown [6], paraphrases are extracted from multiple translations of classic novels; but the extraction process is specific to the corpus used and not typical of MT phrase extraction algorithms. Quirk et al. [81] treat paraphrasing as a monolingual translation task operating on the multiplicity of news published on the web. Our method [105], also illustrated in by Bannard and Callison-Burch [4], to automatically construct a large domain-independent paraphrase collection is based on the assumption that two 100 different phrases of the same meaning may have the same translation in a foreign language. Phrase-based Statistical Machine Translation (SMT) systems analyze large quantities of bilin- gual parallel texts in order to learn translational alignments between pairs of words and phrases in two languages [76]. The sentence-based translation model makes word/phrase alignment decisions probabilistically by computing the optimal model parameters with application of the statistical estimation theory. This alignment process results in a corpus of word/phrase-aligned parallel sentences from which we can extract phrase pairs that are translations of each other. We ran the alignment algorithm from Och and Ney [75] on a Chinese-English parallel corpus of 218 million English words, available from the Linguistic Data Consortium (LDC [46]). Phrase pairs are extracted by following the method described in Och and Ney [76] where all contiguous phrase pairs having consistent alignments are extraction candidates. Using these pairs we build paraphrase sets by joining together all English phrases that have the same Chinese translation. Figure 7.3 shows an example word/phrase alignment for two parallel sentence pairs from our corpus where the phrases “blowing up” and “bombing” have the same Chinese translation. On the right side of the figure we show the paraphrase set which contains these two phrases, which is typical in our collection of extracted paraphrases. Although our paraphrase extraction method is similar to that of Bannard and Callison-Burch [4], the paraphrases we extracted are for completely different applications, and have a broader definition for what constitutes a paraphrase. In [4], a language model is used to make sure that the paraphrases extracted are direct substitutes, from the same syntactic categories, etc. So, us- ing the example in Figure 3, the paraphrase table would contain only “bombing” and “bombing 101 Figure 7.3: An example of the paraphrase extraction process. attack”. Paraphrases that are direct substitutes of one another are useful when translating un- known phrases. For instance, if a MT system does not have the Chinese translation for the word “bombing”, but has seen it in another set of parallel data (not involving Chinese) and has deter- mined it to be a direct substitute of the phrase “bombing attack”, then the Chinese translation of “bombing attack” would be used in place of the translation for “bombing”. This substitution technique has shown some improvement in translation quality [12]. 7.3 Re-evaluating Machine Translation Results with Paraphrase Support The introduction of automated evaluation procedures, such as BLEU [78] for machine transla- tion (MT) and ROUGE [51] for summarization, have prompted much progress and development in both of these areas of research in Natural Language Processing (NLP). Both evaluation tasks 102 employ a comparison strategy for comparing textual units from machine-generated and gold- standard texts. Ideally, this comparison process would be performed manually, because of hu- mans abilities to infer, paraphrase, and use world knowledge to relate differently worded pieces of equivalent information. However, manual evaluations are time consuming and expensive, thus making them a bottleneck in system development cycles. BLEU measures how close machine-generated translations are to professional human trans- lations, and ROUGE does the same with respect to summaries. Both methods incorporate the comparison of a system-produced text to one or more corresponding reference texts. The close- ness between texts is measured by the computation of a numeric score based on n-gram co- occurrence statistics. Although both methods have gained mainstream acceptance and have shown good correlations with human judgments, their deficiencies have become more evident and serious as research in MT and summarization progresses [13]. Text comparisons in MT and summarization evaluations are performed at different text gran- ularity levels. Since most of the phrase-based, syntax-based, and rule-based MT systems trans- late one sentence at a time, the text comparison in the evaluation process is also performed at the single-sentence level. In summarization evaluations, there is no sentence-to-sentence correspondence between summary pairsessentially a multi-sentence-to-multi-sentence compar- ison, making it more difficult and requiring a completely different implementation for matching strategies. In this chapter, we focus on the intricacies involved in evaluating MT results and address two prominent problems associated with the BLEU-esque metrics, namely their lack of support for paraphrase matching and the absence of recall scoring. Our solution, ParaEval, 103 utilizes a large collection of paraphrases acquired through an unsupervised processidentify- ing phrase sets that have the same translation in another languageusing state-of-the-art statisti- cal MT word alignment and phrase extraction methods. This collection facilitates paraphrase matching, additionally coupled with lexical identity matching which is designed for comparing text/sentence fragments that are not consumed by paraphrase matching. We adopt a unigram counting strategy for contents matched between sentences from peer and reference translations. This unweighted scoring scheme, for both precision and recall computations, allows us to di- rectly examine both the power and limitations of ParaEval. We show that ParaEval is a more stable and reliable comparison mechanism than BLEU, in both fluency and adequacy rankings. 7.3.1 N-gram Co-occurrence Statistics Being an $8 billion industry [11], MT calls for rapid development and the ability to differentiate good systems from less adequate ones. The evaluation process consists of comparing system- generated peer translations to human written reference translations and assigning a numeric score to each system. While human assessments are still the most reliable evaluation mea- surements, it is not practical to solicit manual evaluations repeatedly while making incremental system design changes that would only result in marginal performance gains. To overcome the monetary and time constraints associated with manual evaluations, automated procedures have been successful in delivering benchmarks for performance hill-climbing with little or no cost. 104 While a variety of automatic evaluation methods have been introduced, the underlining comparison strategy is similarmatching based on lexical identity. The most prominent imple- mentation of this type of matching is demonstrated in BLEU [78]. The remaining part of this section is devoted to an overview of BLEU, or the BLEU-esque philosophy. 7.3.1.1 The BLEU-esque Matching Philosophy The primary task that a BLEU-esque procedure performs is to compare n-grams from the peer translation with the n-grams from one or more reference translations and count the number of matches. The more matches a peer translation gets, the better it is. BLEU is a precision-based metric, which is the ratio of the number of n-grams from the peer translation that occurred in reference translations to the total number of n-grams in the peer translation. The notion of Modified n-gram Precision was introduced to detect and avoid rewarding false positives generated by translation systems. To gain high precision, systems could potentially over-generate good n-grams, which occur multiple times in multiple refer- ences. The solution to this problem was to adopt the policy that an n-gram, from both reference and peer translations, is considered exhausted after participating in a match. As a result, the maximum number of matches an n-gram from a peer translation can receive, when comparing to a set of reference translations, is the maximum number of times this n-gram occurred in any single reference translation. Papineni et al. [78] called this capping technique clipping. Fig- ure 7.4, taken from the original BLEU paper, demonstrates the computation of the modified unigram precision for a peer translation sentence. 105 Figure 7.4: Modified n-gram precision from BLEU To compute the modified n-gram precision, P n , for a whole test set, including all translation segments (usually in sentences), the formula is: P n = P C∈{peers} P n−gram∈C Count clip (n − gram) P C∈{peers} P n−gram∈C Count(n − gram) 7.3.1.2 Lack of Paraphrasing Support Humans are very good at finding creative ways to convey the same information. There is no one definitive reference translation in one language for a text written in another. Having ac- knowledged this phenomenon, however natural it is, human evaluations on system-generated translations are much more preferred and trusted. However, what humans can do with ease puts machines at a loss. BLEU-esque procedures recognize equivalence only when two n- grams exhibit the same surface-level representations, i.e. the same lexical identities. The BLEU implementation addresses its deficiency in measuring semantic closeness by incorporating the comparison with multiple reference translations. The rationale is that multiple references give a higher chance that the n-grams, assuming correct translations, appearing in the peer translation would be rewarded by one of the references n-grams. The more reference translations used, the better the matching and overall evaluation quality. Ideally (and to an extreme), we would need to collect a large set of human-written translations to capture all possible combinations of 106 verbalizing variations before the translation comparison procedure reaches its optimal matching ability. One can argue that an infinite number of references are not needed in practice because any matching procedure would stabilize at a certain number of references. This is true if precision measure is the only metric computed. However, using precision scores alone unfairly rewards systems that under-generateproducing unreasonably short translations. Recall measurements would provide more balanced evaluations. When using multiple reference translations, if an n-gram match is made for the peer, this n-gram could appear in any of the references. The computation of recall becomes difficult, if not impossible. This problem can be reversed if there is crosschecking for phrases occurring across referencesparaphrase recognition. BLEU uses the calculation of a brevity penalty to compensate the lack of recall computation problem. The brevity penalty is computed as follows: BP= 1, if c > r e (1−r/c) , if c ≤ r Then, the BLEU score for a peer translation is computed as: BLEU = BP · exp( N X n=1 w n log p n ) BLEUs adoption of the brevity penalty to offset the effect of not having a recall computation has drawn criticism on its crudeness in measuring translation quality. Callison-Burch et al. [13] point out three prominent factors: 107 • “Synonyms and paraphrases are only handled if they are in the set of multiple reference translations [available]. • The scores for words are equally weighted so missing out on content-bearing material brings no additional penalty. • The brevity penalty is a stop-gap measure to compensate for the fairly serious problem of not being able to calculate recall.” With the introduction of ParaEval, we will address two of these three issues, namely the paraphrasing problem and providing a recall measure. 7.3.2 ParaEval for MT Evaluation 7.3.2.1 Overview Reference translations are created from the same source text (written in the foreign language) to the target language. Ideally, they are supposed to be semantically equivalent, i.e. overlap completely. However, as shown in Figure 7.5, when matching based on lexical identity is used (indicated by links), only half (6 from the left and 5 from the right) of the 12 words from these two sentences are matched. Also, to was a mismatch. In applying paraphrase matching for MT evaluation from ParaEval, we aim to match all shaded words from both sentences. 108 Figure 7.5: Two reference translations. Grey areas are matched by using BLEU. 7.3.2.2 The ParaEval Evaluation Procedure We adopt a two-tier matching strategy for MT evaluation in ParaEval. At the top tier, a para- phrase match is performed on system-translated sentences and corresponding reference sen- tences. Then, unigram matching is performed on the words not matched by paraphrases. Preci- sion is measured as the ratio of the total number of words matched to the total number of words in the peer translation. Running our system on the example in Figure 7.5, the paraphrase-matching phase consumes the words marked in grey and aligns have been and to be, completed and fully, to date and up till now, and sequence and sequenced. The subsequent unigram-matching aligns words based on lexical identity. 109 Figure 7.6: ParaEval’s matching process. We maintain the computation of modified unigram precision, defined by the BLEU-esque Philosophy, in principle. In addition to clipping individual candidate words with their corre- sponding maximum reference counts (only for words not matched by paraphrases), we clip candidate paraphrases by their maximum reference paraphrase counts. So two completely dif- ferent phrases in a reference sentence can be counted as two occurrences of one phrase. For example in Figure 7.6, candidate phrases “blown up and “bombing matched with three phrases from the references, namely “bombing and two instances of “explosion. Treating these two can- didate phrases as one (paraphrase match), we can see its clip is 2 (from Ref 1, where “bomb- ing and “explosion are counted as two occurrences of a single phrase). The only word that was matched by its lexical identity is “was. The modified unigram precision calculated by our method is 4/5, where as BLEU gives 2/5. 7.3.3 Evaluating ParaEval To be effective in MT evaluations, an automated procedure should be capable of distinguishing good translation systems from bad ones, human translations from systems’, and human trans- lations of differing quality. For a particular evaluation exercise, an evaluation system produces 110 a ranking for system and human translations, and compares this ranking with one created by human judges [97]. The closer a system’s ranking is to the human’s, the better the evaluation system is. 7.3.3.1 Validating ParaEval To test ParaEval’s ability, NIST 2003 Chinese MT evaluation results were used [74]. This col- lection consists of 100 source documents in Chinese, translations from eight individual trans- lation systems, reference translations from four humans, and human assessments (on fluency and adequacy). The Spearman rank-order coefficient is computed as an indicator of how close a system ranking is to gold-standard human ranking. It should be noted that the 2003 MT data is separate from the corpus that we extracted paraphrases from. For comparison purposes, BLEU 1 was also run. Table 7.1 shows the correlation figures for the two automatic systems with the NIST rankings on fluency and adequacy. The lower and higher 95% confidence intervals are labeled as L-CI and H-CI. To estimate the significance of the rank-order correlation figures, we applied boot-strap resampling to calculate the confidence intervals. In each of 1000 runs, systems were ranked based on their translations of 100 ran- domly selected documents. Each ranking was compared with the NIST ranking, producing a correlation score for each run. A t-test was then performed on the 1000 correlation scores. In both fluency and adequacy measurements, ParaEval correlates significantly better than BLEU. The ParaEval scores used were precision scores. In addition to distinguishing the quality of MT systems, a reliable evaluation procedure must be able to distinguish system translations from 1 Results shown are from BLEU v.11 (NIST). 111 Table 7.1: Ranking correlations with human assessments. Figure 7.7: Overall system and human ranking. humans [52]. Figure 7.7 shows the overall system and human ranking. In the upper left corner, human translators are grouped together, significantly separated from the automatic MT systems clustered into the lower right corner. 7.3.3.2 Implications to Word-alignment We experimented with restricting the paraphrases being matched to various lengths. When allowing only paraphrases of three or more words to match, the correlation figures become stabilized and ParaEval achieves even higher correlation with fluency measurement to 0.7619 on the Spearman ranking coefficient. 112 This phenomenon indicates to us that the bigram and unigram paraphrases extracted using SMT word-alignment and phrase extraction programs are not reliable enough to be applied to evaluation tasks. We speculate that word pairs extracted from Liang et al. [48], where a bidirectional discriminative training method was used to achieve consensus for word-alignment (mostly lower n-grams), would help to elevate the level of correlation by ParaEval. 7.3.3.3 Implications to Evaluating Paraphrase Quality Utilizing paraphrases in MT evaluations is also a realistic way to measure the quality of para- phrases acquired through unsupervised channels. If a comparison strategy, coupled with para- phrase matching, distinguishes good and bad MT and summarization systems in close accor- dance with what human judges do, then this strategy and the paraphrases used are of sufficient quality. Since our underlining comparison strategy is that of BLEU-1 for MT evaluation, and BLEU has been proven to be a good metric for their respective evaluation tasks, the performance of the overall comparison is directly and mainly affected by the paraphrase collection. 7.3.4 ParaEval’s Support for Recall Computation Due to the use of multiple references and allowing an n-gram from the peer translation to be matched with its corresponding n-gram from any of the reference translations, BLEU cannot be used to compute recall scores, which are conventionally paired with precision to detect length- related problems from systems under evaluation. 113 7.3.4.1 Using Single References for Recall The primary goal in using multiple references is to overcome the limitation in matching on lexical identity. More translation choices give more variations in verbalization, which could lead to more matches between peer and reference translations. Since MT results are generated and evaluated at a sentence-to-sentence level (or a segment level, where each segment may contain a small number of sentences) and no text condensation is employed, the number of different and correct ways to state the same sentence is small. This is in comparison to writing generic multi-document summaries, each of which contains multiple sentences and requires significant amount of “rewriting. When using a large collection of paraphrases while evaluating, we are provided with the alternative verbalizations needed. This property allows us to use single references to evaluate MT results and compute recall measurements. 7.3.4.2 Recall and Adequacy Correlations When validating the computed recall scores for MT systems, we correlate with human assess- ments on adequacy only. The reason is that according to the definition of recall, the content coverage in references, and not the fluency reflected from the peers, is being measured. Ta- ble 7.2 shows ParaEval’s recall correlation with NIST 2003 Chinese MT evaluation results on systems ranking. We see that ParaEvals correlation with adequacy has improved significantly when using recall scores to rank than using precision scores. 114 Table 7.2: ParaEval’s recall ranking correlation. Table 7.3: ParaEval’s correlation (precision) while using only single references. 7.3.4.3 Not All Single References are Created Equal Human-written translations differ not only in word choice, but also in other idiosyncrasies that cannot be captured with paraphrase recognition. So it would be presumptuous to declare that using paraphrases from ParaEval is enough to allow using just one reference translation to evaluate. Using multiple references allow more paraphrase sets to be explored in matching. In Table 7.3, we show ParaEvals correlation figures when using single reference translations. E01E04 indicate the sets of human translations used correspondingly. Notice that the correlation figures vary a great deal depending on the set of single refer- ences used. How do we differentiate human translations and know which set of references to use? It is difficult to quantify the quality that a human written translation reflects. We can only define “good human translations as translations that are written not very differently from what other humans would write, and bad translations as the ones that are written in an unconventional fashion. Table 7.4 shows the differences between the four sets of reference translations when 115 Table 7.4: Differences among reference translations (raw ParaEval precision scores). Table 7.5: BLEU’s correlating behavior with multi- and single-reference. comparing one set of references to the other three. The scores here are the raw ParaEval preci- sion scores. E01 and E03 are better, which explains the higher correlations ParaEval has using these two sets of references individually, shown in Table 7.3. 7.3.5 Observation of Change in Number of References When matching on lexical identity, it is the general consensus that using more reference trans- lations would increase the reliability of the MT evaluation [97]. It is expected that we see an improvement in ranking correlations when moving from using one reference translation to more. However, when running BLEU for the NIST 2003 Chinese MT evaluation, this trend is inverted, and using single reference translation gave higher correlation than using all four references, as illustrated in Table 7.5. 116 Table 7.6: System-ranking correlation when using modified unigram precision (MUP) scores. Table 7.7: System-ranking correlation when using geometric mean (GM) of MUPs. Turian et al. [97] reports the same peculiar behavior from BLEU on Arabic MT evaluations in Figure 5b of their paper. When using three reference translations, as the number of segments (sentences usually) increases, BLEU correlates worse than using single references. Since the matching and underlining counting mechanisms of ParaEval are built upon the fundamentals of BLEU, we were keen to find out the differences, other than paraphrase match- ing, between the two methods when the number of reference translation changes. By following the description from the original BLEU paper, three incremental steps were set up for duplicat- ing its implementation, namely modified unigram precision (MUP), geometric mean of MUP (GM), and multiplying brevity penalty with GM to get the final score (BP-BLEU). At each step, correlations were computed for both using single- and multi- references, shown in Tables 7.6, 7.7, and 7.8. 117 Table 7.8: System-ranking correlation when multiplying the brevity penalty with GM. Given that many small changes have been made to the original BLEU design, our replication would not produce the same scores from the current version of BLEU. Nevertheless, the inverted behavior was observed in fluency correlations at the BP-BLEU step, not at MUP and GM. This indicates to us that the multiplication of the brevity penalty to balance precision scores is problematic. According to Turian et al. [97], correlation scores computed from using fewer references are inflated because the comparisons exclude the longer n-gram matches that make automatic evaluation procedures diverge from the human judgments. Using a large collection of paraphrases in comparisons allows those longer n-gram matches to happen even if single references are used. This collection also allows ParaEval to directly compute recall scores, avoiding an approximation of recall that is problematic. 7.4 ParaEval for Summarization Evaluation Content coverage is commonly measured in summary comparison to assess how much infor- mation from the reference summary is included in a peer summary. Both manual and automatic methodologies have been used. Naturally, there is a great amount of confidence in manual eval- uation since humans can infer, paraphrase, and use world knowledge to relate text units with 118 similar meanings, but which are worded differently. Human efforts are preferred if the eval- uation task is easily conducted and managed, and does not need to be performed repeatedly. However, when resources are limited, automated evaluation methods become more desirable. For years, the summarization community has been actively seeking an automatic evalua- tion methodology that can be readily applied to various summarization tasks. ROUGE [51] has gained popularity due to its simplicity and high correlation with human judgments. Even though validated by high correlations with human judgments gathered from previous Document Understanding Conference (DUC) [20] experiments, current automatic procedures ([51], [37]) only employ lexical n-gram matching. The lack of support for word or phrase matching that stretches beyond strict lexical matches has limited the expressiveness and utility of these meth- ods. We need a mechanism that supplements literal matchingi.e. paraphrase and synonymand approximates semantic closeness. In this section we present ParaEval for summarization evaluation, an automatic summariza- tion evaluation method, which facilitates paraphrase matching in an overall three-level compar- ison strategy. At the top level, favoring higher coverage in reference, we perform an optimal search via dynamic programming to find multi-word to multi-word paraphrase matches be- tween phrases in the reference summary (usually human-written) and those in the peer summary (system-generated). The non-matching fragments from the previous level are then searched by a greedy algorithm to find single-word paraphrase/synonym matches. At the third and the lowest level, we perform literal lexical unigram matching on the remaining texts. This tiered design 119 for summary comparison guarantees at least a ROUGE-1 level of summary content matching if no paraphrases are found. The first two levels employ the same paraphrase table as used in the MT evaluations. Since manually created multi-word paraphrasesphrases determined by humans to be paraphrases of one anotherare not available in sufficient quantities, we automatically build a paraphrase table using methods from the Machine Translation (MT) field. The assumption made in creating this table is that if two English phrases are translated into the same foreign phrase with high probability (shown in the alignment results from a statistically trained alignment algorithm), then the two English phrases are paraphrases of each other. 7.4.0.1 Previous Work in Summarization Evaluation There has been considerable work in both manual and automatic summarization evaluations. Three most noticeable efforts in manual evaluation are SEE [53], Factoid [31], and the Pyramid method [72]. SEE provides a user-friendly environment in which human assessors evaluate the quality of system-produced peer summary by comparing it to a reference summary. Summaries are represented by a list of summary units (sentences, clauses, etc.). Assessors can assign full or partial content coverage score to peer summary units in comparison to the corresponding reference summary units. Grammaticality can also be graded unit-wise. 120 The Pyramid method uses identified consensus—a pyramid of phrases created by annotators— from multiple reference summaries as the gold-standard reference summary. Summary compar- isons are performed on Summarization Content Units (SCUs) that are approximately of clause length. To facilitate fast summarization system design-evaluation cycles, ROUGE was created [51]. It is an automatic evaluation package that measures a number of n-gram co-occurrence statistics between peer and reference summary pairs. ROUGE was inspired by BLEU [78] which was adopted by the machine translation (MT) community for automatic MT evaluation. A problem with ROUGE is that the summary units used in automatic comparison are of fixed length. A more desirable design is to have summary units of variable size. This idea was implemented in the Basic Elements (BE) framework [37] which has not been completed due to its lack of support for paraphrase matching. Both ROUGE and BE have been shown to correlate well with past DUC human summary judgments, despite incorporating only lexical matching on summary units ([51]; [37]). 7.4.1 Summary Comparison in ParaEval This section describes the process of comparing a peer summary against a reference summary and the summary grading mechanism. 7.4.1.1 Description We adopt a three-tier matching strategy for summary comparison. The score received by a peer summary is the ratio of the number of reference words matched to the total number of words 121 Figure 7.8: Comparison of summaries. in the reference summary. The total number of matched reference words is the sum of matched words in reference throughout all three tiers. At the top level, favoring high recall coverage, we perform an optimal search to find multi-word paraphrase matches between phrases in the reference summary and those in the peer. Then a greedy search is performed to find single- word paraphrase/synonym matches among the remaining text. Operations conducted in these two top levels are marked as linked rounded rectangles in Figure 7.8. At the bottom level, we find lexical identity matches, as marked in rectangles in the example. If no paraphrases are found, this last level provides a guarantee of lexical comparison that is equivalent to what other automated systems give. In our system, the bottom level currently performs unigram matching. Thus, we are ensured with at least a ROUGE-1 type of summary comparison. Alternatively, equivalence of other ROUGE configurations can replace the ROUGE-1 implementation. There is no theoretical reason why the first two levels should not merge. But due to high computational cost in modeling an optimal search, the separation is needed. We explain this in detail below. 122 7.4.1.2 Multi-Word Paraphrase Matching In this section we describe the algorithm that performs the multi-word paraphrase matching between phrases from reference and peer summaries. Using the example in Figure 3, this algo- rithm creates the phrases shown in the rounded rectangles and establishes the appropriate links indicating corresponding paraphrase matches. Problem Description Measuring content coverage of a peer summary using a single reference summary requires computing the recall score of how much information from the reference summary is included in the peer. A summary unit, either from reference or peer, cannot be matched for more than once. For example, the phrase “imposed sanctions on Libya (r 1 ) in Figure 7.8’s reference summary was matched with the peer summarys voted sanctions against Libya (p 1 ). If later in the peer summary there is another phrase p 2 that is also a paraphrase of r 1 , the match of r 1 cannot be counted twice. Conversely, double counting is not permissible for phrase/words in the peer summary, either. We conceptualize the comparison of peer against reference as a task that is to complete over several time intervals. If the reference summary contains n sentences, there will be n time intervals, where at time t i , phrases from a particular sentence i of the reference summary are being considered with all possible phrases from the peer summary for paraphrase matches. A decision needs to be made at each time interval: 123 • Do we employ a local greedy match algorithm that is recall generous (preferring more matched words from reference) towards only the reference sentence currently being ana- lyzed, • Or do we need to explore globally, inspecting all reference sentences and find the best overall matching combinations? Consider the scenario in Figure 7.9 and its explanation in Figure 7.10: 1) at t 0 : L(p 1 = r 2 ) > L(p 2 = r 1 ) and r 2 contains r 1 . A local search algorithm leads to match(p 1 , r 2 ). L() indicates the number of words in reference matched by the peer phrase through paraphrase matching and match() indicates a paraphrase match has occurred (more in the figure). 2) at t 1 : L(p 1 = r 3 ) > L(p 1 = r 2 ). A global algorithm reverses the decision match(p 1 , r 2 ) made at t 0 and concludes match(p 1 , r 3 ) and match(p 2 , r 1 ). A local search algo- rithm would have returned no match. Clearly, the global search algorithm achieves higher overall recall (in words). The match- ing of paraphrases between a reference and its peer becomes a global optimization problem, maximizing the content coverage of the peer compared in reference. Solution Model We use dynamic programming to derive the solution of finding the best paraphrase-matching combinations. The optimization problem is as follows: Sentences from a reference summary and a peer summary can be broken into phrases of various lengths. A paraphrase lookup table is used to find whether a reference phrase and a peer phrase are paraphrases of each other. What is the optimal paraphrase matching combination of phrases from reference and peer that gives the 124 Figure 7.9: Local vs. global paraphrase matching. Figure 7.10: Description for Figure 7.9: Local vs. global paraphrase matching. 125 highest recall score (in number of matched reference words) for this given peer? The solution should be recall oriented (favoring a peer phrase that matches more reference words than those match less). Following [95], the solution can be characterized as: 1) This problem can be divided into n stages corresponding to the n sentences of the ref- erence summary. At each stage, a decision is required to determine the best combination of matched paraphrases between the reference sentence and the entire peer summary that results in no double counting of phrases on the peer side. There is no double counting of reference phrases across stages since we are processing one reference sentence at a time and are find- ing the best paraphrase matches using the entire peer summary. As long as there is no double counting in peers, we are guaranteed to have none in reference, either. 2) At each stage, we define a number of possible states as follows. If, out of all possible phrases of any length extracted from the reference sentence, m phrases were found to have matching paraphrases in the peer summary, then a state is any subset of the m phrases. 3) Since no double counting in matched phrases/words is allowed in either the reference summary or the peer summary, the decision of which phrases (leftover text segments in refer- ence and in peer) are allowed to match for the next stage is made in the current stage. 4) Principle of optimality: at a given state, it is not necessary to know what matches oc- curred at previous stages, only on the accumulated recall score (matched reference words) from previous stages and what text segments (phrases) in peer have not been taken/matched in previ- ous stages. 5) There exists a recursive relationship that identifies the optimal decision for stage 126 Figure 7.11: Solution for the example in Figure 7.9. s (out of n total stages), given that stage s+ 1 has already been solved. 6) The final stage, n (last sentence in reference), is solved by choosing the state that has the highest accumulated re- call score and yet resulted no double counting in any phrase/word in peer the summary. Figure 5 demonstrates the optimal solution (12 reference words matched) for the example shown in Figure 4. We can express the calculations in the following formulas: f 1 (x b )= max x b :c(x b ) {r(x b )} and f y (x b )= max x b :c(x b ) {r(x b )+ f y−1 (x b − c(x b ))} where f y (x b ) denotes the optimal recall coverage (number of words in the reference sum- mary matched by the phrases from the peer summary) at state x b in stage y. r(x b ) is the recall coverage given state x b . And c(x b ) records the phrases matched in peer with no double counting, given state x b . 127 7.4.1.3 Synonym Matching All paraphrases whose pairings do not involve multi-word to multi-word matching are called synonyms in our experiment. Since these phrases have either a n− to− 1 or 1− to− n matching ratio (such as the phrases blowing up and bombing), a greedy algorithm favoring higher recall coverage reduces the state creation and stage comparison costs associated with the optimal procedure (O(m 6 ): O(m 3 ) for state creation, and for 2 stages at any time)). Synonym matching is performed only on parts of the reference and peer summaries that were not matched from the multi-word paraphrase-matching phase. 7.4.1.4 Lexical Matching This matching phase performs straightforward lexical matching, as exemplified by the text frag- ments marked in rectangles in Figure 7.8. Unigrams are used as the units for counting matches in accordance with the previous two matching phases. During all three matching phases, we employed a ROUGE-1 style of counting. Other alter- natives, such as ROUGE-2, ROUGE-SU4, etc., can easily be adapted to each phase. 7.4.2 Evaluation To evaluate and validate the effectiveness of an automatic evaluation metric, it is necessary to show that automatic evaluations correlate with human assessments highly, positively, and con- sistently [51]. In other words, an automatic evaluation procedure should be able to distinguish 128 good and bad summarization systems by assigning scores with close resemblance to humans assessments. 7.4.2.1 Document Understanding Conference The Document Understanding Conference has provided large-scale evaluations on both human- created and system-generated summaries annually. Research teams are invited to participate in solving summarization problems with their systems. System-generated summaries are then assessed by humans and/or automatic evaluation procedures. The collection of human judg- ments on systems and their summaries has provided a test-bed for developing and validating automated summary grading methods ([51], [37]). The correlations reported by ROUGE and BE show that the evaluation correlations between these two systems and DUC human evaluations are much higher on single-document summa- rization tasks. One possible explanation is that when summarizing from only one source (text), both human- and system-generated summaries are mostly extractive. The reason for humans to take phrases (or maybe even sentences) verbatim is that there is less motivation to abstract when the input is not highly redundant, in contrast to input for multi-document summarization tasks, which we speculate allows more abstracting. ROUGE and BE both facilitate lexical n-gram matching, hence, achieving amazing correlations. Since our baseline matching strategy is lexi- cally based when paraphrase matching is not activated, validation on single-doc summarization results is not repeated in our experiment. 129 Table 7.9: Correlation with DUC 2003 MDS results. 7.4.2.2 Validation and Discussion We use summary judgments from DUC2003s multi-document summarization (MDS) task to evaluate ParaEval. During DUC2003, participating systems created short summaries ( 100 words) for 30 document sets. For each set, one assessor-written summary was used as the reference to compare peer summaries created by 18 automatic systems (including baselines) and 3 other human-written summaries. A system ranking was produced by taking the averaged performance on all summaries created by systems. This evaluation process is replicated in our validation setup for ParaEval. In all, 630 summary pairs were compared. Pearsons correlation coefficient is computed for the validation tests, using DUC2003 assessors results as the gold standard. Table 7.9 illustrates the correlation figures from the DUC2003 test set. ParaEval-para only shows the correlation result when using only paraphrase and synonym matching, without the baseline unigram matching. ParaEval-2 uses multi-word paraphrase matching and unigram matching, omitting the greedy synonym-matching phrase. ParaEval-3 incorporates matching at all three granularity levels. We see that the current implementation of ParaEval closely resembles the way ROUGE-1 differentiates system-generated summaries. We believe this is due to the identical calculations of recall scores. The score that a peer summary receives from ParaEval depends on the number of 130 Figure 7.12: A detailed look at the scores assigned by lexical and paraphrase/synonym compar- isons. words matched in the reference summary from its paraphrase, synonym, and unigram matches. The counting of individual words in reference indicates a ROUGE-1 design in grading. How- ever, a detailed examination on individual reference-peer comparisons shows that paraphrase and synonym comparisons and matches, in addition to lexical n-gram matching, do measure a higher level of content coverage. This is demonstrated in Figure 7.12a and b. Strict unigram matching reflects the content retained by a peer summary mostly in the 0.2-0.4 ranges in re- call, shown as dark-colored dots in the graphs. Allowing paraphrase and synonym matching increases the detection of peer coverage to the range of 0.3-0.5, shown as light-colored dots. We conducted a manual evaluation to further examine the paraphrases being matched. Using 10 summaries from the Pyramid data, we asked three human subjects to judge the validity 131 Figure 7.13: Paraphrases matched by ParaEval. of 128 (randomly selected) paraphrase pairs extracted and identified by ParaEval. Each pair of paraphrases was coupled with its respective sentences as contexts. All paraphrases judged were multi-word. ParaEval received an average precision of 68.0%. The complete agreement between judges is 0.582 according to the Kappa coefficient [17]. In Figure 7.13, we show two examples that the human judges consider to be good paraphrases produced and matched by ParaEval. Judges voiced difficulties in determining semantic equivalence. There were cases where paraphrases would be generally interchangeable but could not be matched because of non-semantic equivalence in their contexts. And there were paraphrases that were determined as matches, but if taken out of context, would not be direct replacements of each other. These two situations are where the judges mostly disagreed. 7.5 Future Work Paraphrases are vital in many areas of NLP. Finding them in large quantities however is not an easy task. In this chapter, we have shown that incorporating paraphrases, extracted via an unsupervised approach, significantly increases the level of text comparison performed in MT and summarization evaluations while maintaining the current state-of-art level of correlation 132 with human judgments. While we have used a parallel corpus of 200 million English tokens, it is not clear how the size of parallel data would affect the quality of the paraphrases extracted. A study on answering this question is being planned. The immediate impact and continuation of the described work would be to incorporate para- phrase matching to question-answering, phrase-level sentiment analyses, summary creation, etc. With appropriate adaptation, these tasks can also be evaluated using methods similar to ours. 133 Chapter 8 Conclusion and Future Work This thesis addresses the summarization challenges faced with the emergence of new data types and for domain and genre specific tasks. The new tasks include producing summaries of very short length, creating a biography-like summary focused on a person, summarizing long and complex online discussions posted by online newsgroups. Each of the tasks requires domain and genre-specific knowledge in order to properly address the problem associated with the task. In this chapter, I review the contributions made by this thesis and discuss future work. 8.1 Domain-specific Distillation Tasks Using generic summarization algorithms and techniques, one fails to capture and model ade- quately domain and genre specific knowledge associated with summarizing for newly emerged data types or tasks. This thesis highlights the importance of recognizing and utilizing this knowledge in the following three summarization tasks, namely headline generation, biography creation, and online discussion summarization. 134 8.1.1 Headline Generation To create summaries of very short length for newspaper articles, the most prominent knowledge we know a priori is that the beginning of an article contains the most salient portion of the entire article. The incorporation of this knowledge into a solution model is straightforward. Both a headline keyword extraction model and headline phrase extraction model utilize this knowledge by predicting keywords that headline-worthy and by selecting phrases from the beginning of the text. 8.1.2 Biography Creation Generic summarization has proven to be extremely difficult to solve. The problem is two-fold. First, what are the information that are summary-worthy and important? This varies to a great ideal depending on the prior knowledge and interest that a particular user possesses. Secondly, how do we derive the agreed upon “truth” to create gold-standard reference summaries for evaluations? Again, humans do not agree. The above difficulties prompt the community to move toward more finely defined distilla- tion tasks, most recently, query-based summarization. Biographical queries, like “who is x?” questions, are frequently presented to a summarization system. Different aspects of a person’s life can be categorized into predetermined classes and properties for each class can be recorded. To select information relevant information on a person, the extraction process becomes a pro- cess of classifying biographical information. Combining this classification step with generic 135 multi-document summarization techniques, the resulting summaries are informative in answer- ing people questions. 8.1.3 Online Discussion Summarization In my best knowledge, this thesis is the first to address summarization for online discussion by recognizing the complex topic structures associated with discussions and modeling multi-party interactions. The Internet has become a social network as much as an information repository. To properly represent complex discussions, I introduced the notion of subtopic structure where each message is linearly segmented by topic, and across messages segments are clustered to- gether by topic. This two-level representation models the information exchanging behavior where people respond to topics on a sub-message level. To automatically model interactions and speech acts, I focused on the dominant observation of problem (initiating, responding) pairs. Each summary generated using this solution model consists of a number of mini-summaries fo- cused on subtopics. Each mini-summary contains extracts on the subtopic cluster in the form of problem (initiating, responding)s. 8.2 Distillation Evaluation Distillation evaluation is as important as solving distillation problems. I address the evaluation problem with two approaches. The Basic Element framework, designed and implemented with my summarization group members, is to address the fact that summary content can be expressed in various lengths and different number of words. The ParaEval framework addresses the need 136 to push beyond lexical comparison for distillation evaluations. The use of paraphrases collected through statistical machine translation methods showed significant improvement in ranking cor- relations with human assessments. This demonstrated that modeling semantic closeness with a paraphrase approximation is a step toward the right direction. 8.3 Future Work Having explored solution models for domain and genre-specific distillation tasks, I would like to expand my research duties into more generic tasks. This thesis approaches problems at the bottom level, what are the knowledge that could also be useful in addressing broader NLP problems? The Basic Element and ParaEval frameworks have shown to be useful in evaluations. With other automatic evaluation methods becoming part of the development process for tasks like machine translation, it is straightforward to adapt these two methods into becoming objective functions for distillation systems. 137 Reference List [1] M. S. Ackerman and C. Halverson. Reexaming organizational memory. Communica- tions of the ACM, 43, 2000. [2] M. Agar and J. Hoobs. Interpreting discourse: coherence and the analysis of ethno- graphic interviews. Discourse Processes, 5, 1982. [3] M. Banko, V . Mittal, and M. Witbrock. Headline generation based on statistical transla- tion. In ACL 2000, 2000. [4] C. Bannard and C. Callison-Burch. Paraphrasing with bilingual parallel corpora. In ACL 2005, 2005. [5] R. Barzilay, N. Elhadad, and K. McKeown. Inferring strategies for sentence ordering in multidocument summarization. Journal of Artificial Intelligence Research, 17, 2002. [6] R. Barzilay and K. McKeown. Extracting paraphrases from a parallel corpus. In ACL/EACL 2001, 2001. [7] R. Barzilay, K. McKeown, and M. Elhadad. Information fusion in the context of multi- document summarization. In ACL 1999, 1999. [8] D. Beeferman, A. Berger, and J. Lafferty. Text segmentation using exponential models. In Second Conference on Empirical Methods in Natural Language Processing, 1997. [9] A. Berger, S. Della Pietra, and V . Della Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 22, 1996. [10] E. Brill. Transformation-based error-driven learning and natural language processing: A case study in part of speech tagging. Computational Linguistics, 1995. [11] J. Browner. The translator’s blues., 2006. http://www.slate.com/id/2133922/. [12] C. Callison-Burch, P. Koehn, and M. Osborne. Improved statistical machine translation using paraphrases. In HLT/NAACL 2006, 2006. [13] C. Callison-Burch, M. Osborne, and P. Koehn. Re-evaluating the role of bleu in machine translation research. In EACL 2006, 2006. 138 [14] W. L. Chafe. The flow of thought and the flow of language. Syntax and Semantics: Discourse and Syntex, 12, 1979. [15] C. Chang and C. Lin. LIBSVMA library for support vector machines. http://www.csie.ntu.edu.tw/ cjlin/libsvm/. [16] E. Charniak. A maximum-entropy-inspired parser. In NAACL 2000, 2000. [17] J. Cohen. A coefficient of agreement for nominal scales. Education and Psychological Measurement, 43, 1960. [18] M. Collins. Three generative lexicalized models for statistical parsing. In ACL 1997, 1997. [19] B. Dorr, D. Zajic, and R. Schwartz. Hedge trimmer: a parse-and-trim approach to head- line generation. In Workshop on Automatic Summarization 2003, 2003. [20] Document Understanding Conference, 2001-present. http://duc.nist.gov/. [21] T. Dunning. Accurate methods for the statistics of surprise and coincidence. Computa- tional Linguistics, 19, 1993. [22] H. P Edmundson. New methods in automatic extracting. Journal of ACM, 16, 1969. [23] M. Elliott and W. Scacchi. Free software development: cooperation and conflict in a virtual organization culture. S. Koch(editor), Free/Open Source Software Development. IDEA Publishing, 2004. [24] C. Fellbaum. WordNet: An electronic lexical database. MIT Press, 1998. [25] E. Filatova and V . Hatzivassiloglou. Event-based extractive summarization. In ACL Workshop on Summarization, 2004. [26] B. Fox. Discourse structure and anaphora: written and conversational english. In Cam- bridge: Cambridge University Press, 1987. [27] W. Frakes and R. Baeza-Yates. Information Retrieval Data Structures & Algorithms, chapter 14. Prentice Hall PTR, first edition, 1992. [28] M. Galley, K. McKeown, J. Hirschberg, and E. Shriberg. Identifying agreement and disagreement in conversational speech: use of bayesian networks to model pragmatic dependencies. In ACL-2004, 2004. [29] J. Goldstein, M. Kantrowitz, V . Mittal, and J. Carbonell. Summarizing text documents: sentence selection and evaluation metrics. In SIGIR 1999, 1999. [30] J. Goldstein, V . Mittal, J. Carbonell, and M. Kantrowitz. Multi-document summarization by sentence extraction. In ANLP 2000 Workshop on Automatic Summarization, 2000. 139 [31] H. Van Halteren and S. Teufel. Examining the consensus between human summaries: Initial experiments with factoid analysis. In HLT-NAACL Workshop, 2003. [32] M. A. Hearst. Multi-paragraph segmentation of expository text. In ACL-1994, 1994. [33] G. Heidorn. Intelligent writing assistance. R. Dale, H. Moisl and H. Somers, A hand- book of natural language processing: Techniques and applications for the processing of language as text, 2000. [34] U. Hermjakob, A. Echihabi, and D. Marcu. Natural language based reformulation re- source and web exploitation for question answering. In TREC 2002, 2002. [35] J. Hobbs. Coherence and coreference. Cognitive Science, 3(1), 1979. [36] E. Hovy. Approaches to the planning of coherent text. In Natural language generation in artificial intelligence and computational linguistics, 1991. [37] E. Hovy, C. Lin, and L. Zhou. Evaluating duc 2005 using basic element. In DUC 2005, 2005. [38] D. Z. Inkpen and G. Hirst. Near-synonym choice in natural language generation. In RANLP 2003, 2003. [39] T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In EACL-1998, 1998. [40] K. Sparck Jones. Discourse modeling for automatic summarising. University of Cam- bridge Technical Report, 29D, 1993. [41] J. H. Ward Jr. and M. E. Hook. Application of an hierarchical grouping procedure to a problem of grouping profiles. Educational and Psychological Measurement, 23, 1963. [42] M. Kan, J. Klavans, and K. McKeown. Linear segmentation and segment relevance. In 6th International Workshop of Very Large Corpora (WVLC-6), 1998. [43] K. Knight and D. Marcu. Statistics-based summarization–step one: Sentence compres- sion. In AAAI 2000, 2000. [44] J. Kupiec, J. Pedersen, and F. Chen. A trainable document summarizer. In SIGIR 95, 1995. [45] D. Lam and S. L. Rohall. Exploiting e-mail structure to improve summarization. Tech- nical Paper at IBM Watson Research Center, 20-02, 2002. [46] LDC. Linguistic Data Consortium. http://www.nist.gov/speech/test/mt. [47] S. Levinson. Pragmatics. Cambridge University Press, 1983. 140 [48] P. Liang, B. Taskar, and D. Klein. Consensus of simple unsupervised models for word alignment. In HLT/NAACL 2006, 2006. [49] C. Lin and E. Hovy. The automated acquisition of topic signatures for text summariza- tion. In COLING 2000, 2000. [50] C. Lin and E. Hovy. Manual and automatic evaluation of summaries. In DUC-02, 2002. [51] C. Lin and E. Hovy. Automatic evaluation of summaries using n-gram co-occurrence statistics. In HLT-NAACL 2003, 2003. [52] C. Lin and F. J. Och. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In ACL 2004, 2004. [53] C.Y . Lin. Summary evaluation environment, 2001. http://www.isi.edu/ cyl/SEE. [54] C.Y . Lin and E. Hovy. Identify topics by position. In ANLP 1997, 1997. [55] C.Y . Lin and E. Hovy. Automated multi-document summarization in neats. In HLT 2002, 2002. [56] D. Lin. A dependency-based method for evaluating broad-coverage parsers. In IJCAI- 95, 1995. [57] D. Lin. Automatic retrieval and clustering of similar words. In COLING-ACL 1998, 1998. [58] R. Lin and A. Hauptmann. Headline generation using a training corpus. In CICLING 2000, 2001. [59] J. B. Lovins. Development of a stemming algorithm. Mechanical Translation and Com- putational Linguistics, 11, 1968. [60] I. Mani. Automatic summarization (natural language processing, 3). [61] I. Mani, B. Gates, and E. Bloedorn. Improving summaries by revising them. In ACL 1999, 1999. [62] W. Mann and S. Thompson. Relational propositions in discourse. Discourse Processes, 9 (1), 1988. [63] D. Marcu. From discourse structures to text summaries. In ACL/EACL ’97 Workshop on Intelligent Scalable Text Summarization, 1997. [64] D. Marcu. The rhetorical parsing of natural language texts. In ACL/EACL ’97, 1997. [65] D. Marcu. The automatic construction of large-scale corpora for summarization research. In SIGIR’99, 1999. 141 [66] I. Mardh. Headlinese: on the grammar of English front page headlines. Malmo, 1980. [67] K. McKeown, J. Klavans, V . Hatzivassiloglou, R. Barzilay, and E. Eskin. Towards mul- tidocument summarization by reformulation: progress and prospects. In AAAI99, 1999. [68] G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. J. Miller. Introduction to wordnet: An on-line lexical database. International Journal of Lexicography, 3, 1990. [69] T. Mitchell. Machine Learning. McGraw Hill, 1997. [70] J. Moore and C. Paris. Planning texts for advisory dialogues: Capturing intentions and rhetorical information. Computational Linguistics, 9 (4), 1993. [71] J. Morris and G. Hirst. Lexical cohesion computed by thesaural relations as an indicator of the structure of text. Computational Linguistics, 17, 1991. [72] A. Nenkova and R. Passonneau. Evaluating content selection in summarization: The pyramid method. In HLT-NAACL 2004, 2004. [73] P. Newman and J. Blitzer. Summarizing archived discussions: a beginning. In Intelligent User Interfaces, 2002. [74] NIST. NIST machine translation evaluation, 2003. http://www.nist.gov/speech/tests/mt. [75] F. J. Och and H. Ney. a systematic comparison of various statistical alignment models. Computational Linguistics, 29, 2003. [76] F. J. Och and H. Ney. The alignment template approach to statistical machine translation. Computational Linguistics, 30, 2004. [77] B. Pang, K. Knight, and D. Marcu. Syntax-based alignment of multiple translations: extracting paraphrases and generating new sentences. In HLT/NAACL 2003, 2003. [78] K. Papineni, S. Roukos, T. Ward, and W.J. Zhu. Ibm research report bleu: a method for automatic evaluation of machine translation. In IBM Research Division Technical Report, RC22176, 2001. [79] L. Polanyi. The linguistic structure of discourse. In Handbook of Discourse Analysis, 2001. [80] R. J. Quinlan. C4.5: Programs for machine learning., 1993. [81] C. Quirk, C. Brockett, and W. B. Dolan. Monolingual machine translation for paraphrase generation. In EMNLP 2004, 2004. [82] D. Radev. Topic shift detection - finding new information in threaded news. In Technical Report CUCS-026-99, Columbia University Department of Computer Science, 1999. 142 [83] D. Radev. A common theory of information fusion from multiple text sources, step one: Cross-document structure. In 1st ACL SIGDIAL Workshop on Discourse And Dialogue, 2000. [84] D. Radev and K. McKeown. Generating natural language summaries from multiple on- line sources. Computational Linguistics, 24, 1998. [85] O. Rambow, L. Shrestha, J. Chen, and C. Laurdisen. Summarizing email threads. In HLT-NAACL 2004: Short Papers, 2004. [86] K. Ries. Segmenting conversations by topic, initiative, and style. In SIGIR Workshop: Information Retrieval Techniques for Speech Applications, 2001. [87] G. Russo-Lassner, J. Lin, and P. Resnik. A paraphrase-based approach to machine trans- lation evaluation. In Technical Report LAMP-TR-125/CS-TR-4754/UMIACS-TR-2005- 57, University of Maryland, College Park, August 2005, 2005. [88] H. Saggion, D. Radev, S. Teufel, and W. Lam. Meta-evaluation of summaries in a cross- lingual environment using content-based metrics. In COLING 2002, 2002. [89] G. Salton, C. Buckley, and J. Allan. Automatic structuring of text files. In Technical Re- port TR 91-1241, Computer Science Department, Cornell University, Ithaca, NY, 1991. [90] E. A. Schegloff and H. Sacks. Opening up closings. Semiotica, 7, 1973. [91] B. Schiffman, I. Mani, and K. Concepcion. Producing biographical summaries: combin- ing linguistic knowledge with corpus statistics. In ACL 2001, 2001. [92] E. F. Skorochod’ko. Adaptive method of automatic abstracting and indexing. In Infor- mation Processing 71, 1972. [93] K. Sparck-Jones and J. Galliers. Evaluating natural language processing systems: An analysis and review. Berlin: Springer. [94] S. Teufel. Argumentative zoning: Information extraction from scientific text. PhD thesis, University of Edinburgh, 1999. [95] M. A. Trick. A tutorial on dynamic programming, 1997. http://mat.gsia.cmu.edu/classes/dynamic/dynamic.html. [96] R. Trigg and M. Weiser. Textnet: A network-based approach to text handling. In ACM Transactions on Office Information Systems, 1987. [97] J. P. Turian, L. Shen, and I. D. Melamed. Evaluation of machine translation and its evaluation. In MT Summit IX, 2003. [98] S. Wan and K. McKeown. Generating overview summaries of ongoing email thread discussions. In COLING 2004, 2004. 143 [99] Y . Yaari. Segmentation of expository text by hierarchical agglomerative clustering. In Recent Advances in NLP, 1997. [100] D. Zajic, B. Dorr, and R. Schwartz. Automatic headline generation for newspaper stories. In ACL-2002 Workshop on Text Summarization, 2002. [101] D. Zajic, B. J. Dorr, and R. Schwartz. Bbn/umd at duc-2004: Topiary. In The North American Chapter of the Association for Computational Linguistics Workshop on Docu- ment Understanding, 2004. [102] K. Zechner. Automatic generation of concise summaries of spoken dialogues in unre- stricted domains. In SIGIR 2001, 2001. [103] L. Zhou and E. Hovy. Headline summarization at isi. In Document Understanding Conference 2003, 2003. [104] L. Zhou and E. Hovy. A web-trained extraction summarization system. In HLT-NAACL 2003, 2003. [105] L. Zhou, C.Y . Lin, D. Munteanu, and E. Hovy. Paraeval: Using paraphrases to evaluate summaries automatically. In HLT/NAACL 2006, 2006. 144 UMI Number: 3237760 3237760 2007 Copyright 2006 by Zhou, Liang UMI Microform Copyright All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. ProQuest Information and Learning Company 300 North Zeeb Road P.O. Box 1346 Ann Arbor, MI 48106-1346 All rights reserved. by ProQuest Information and Learning Company.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Protocol evaluation in the context of dynamic topologies
PDF
MUNet: multicasting protocol in unidirectional ad-hoc networks
PDF
Internet security and quality-of-service provision via machine-learning theory
PDF
Semantic heterogeneity resolution in federated databases by meta-data implantation and stepwise evolution
PDF
Probabilistic analysis of power dissipation in VLSI systems
PDF
An argumentation-based approach to negotiation in collaborative engineering design
PDF
Orthogonal architectures for parallel image processing
PDF
Efficient media access and routing in wireless and delay tolerant networks
PDF
The design and synthesis of concurrent asynchronous systems.
PDF
Cathodoluminescence studies of the influence of strain relaxation on the optical properties of InGaAs/GaAs quantum heterostructures
PDF
Elevated inflammation in late life: predictors and outcomes
PDF
Spatial distribution of neuroendocrine motoneuron pools in the hypothalmic paraventricular nucleus
PDF
Knowledge and information occupations in Singapore: A country case study.
PDF
The effects of marketing communication on consumers' choice behavior: the case of pharmaceutical industry
PDF
Colorectal cancer: genomic variations in insulin-like growth factor-1 and -2
PDF
Occamflow: Programming a multiprocessor system in a high-level data-flow language
PDF
The problem of safety of steel structures subjected to seismic loading
PDF
The effect of helmet liner density upon acceleration and local contact forces during bicycle helmet impacts
PDF
Iterative data detection: complexity reduction and applications
PDF
Initiation mechanism of estrogen neuroprotection pathway
Asset Metadata
Creator
Zhou, Liang
(author)
Core Title
The importance of using domain knowledge in solving information distillation problems
School
Graduate School
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2006-08
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
computer science,OAI-PMH Harvest
Language
English
Contributor
Digitized by ProQuest
(provenance)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c17-126649
Unique identifier
UC11349152
Identifier
3237760.pdf (filename),usctheses-c17-126649 (legacy record id)
Legacy Identifier
3237760.pdf
Dmrecord
126649
Document Type
Dissertation
Rights
Zhou, Liang
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Tags
computer science