Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Event-centric reasoning with neuro-symbolic networks and knowledge incorporation
(USC Thesis Other)
Event-centric reasoning with neuro-symbolic networks and knowledge incorporation
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Event-centric Reasoning with Neuro-Symbolic Networks and Knowledge Incorporation by Rujun Han A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2022 Copyright 2022 Rujun Han Dedication To the struggling, yet cheerful journey. ii Acknowledgements The Ph.D. studies have been my life’s most challenging and fruitful journey. I would not be able to cross the finishing line without the support of my advisors, mentors, lab-mates, and friends. First, I would like to thank my primary advisor, Nanyun (Violet) Peng for her unwavering commitment and guidance in my Ph.D. success. I am grateful that she convinced me to stay in the program when I was hesitant about continuing, trusted me to explore research topics that I was interested in, and worked tirelessly to establish a collaborative lab environment. I will miss those weekly updates with her where I obtained so many inspiring ideas. I would like to thank my advisor Aram Galstyan for recruiting me into the program at USC/ISI, of- fering generous support during the initial years, and providing insightful advice every step in my Ph.D. studies. I would like to thank Prof. Shri Narayanan and Prof. Aiichiro Nakano for serving on my thesis committee and providing thoughtful feedback on my works. I also thank Prof. Xiang Ren for serving on my qualification and proposal committee and advising me on various class and research projects. Research is a collaborative effort, and I am fortunate to work with many brilliant collaborators. The most effective one was with Dr. Qiang Ning, who mentored me on event temporal reasoning and taught me how to construct reliable datasets. Together we published three top-tier NLP conference papers that comprised a significant portion of my thesis. I also collaborated successfully with Dr. Yichao Zhou and I-Hung Hsu on various event reasoning projects that resulted in conference paper publications. I thank iii Hong Chen, Jiao Sun, Yufei Tian, Julia Baylon, Mingyu (Derek) Ma, and Mengyue Liang for participating and contributing to my research projects. It is also encouraging to observe how our research lab (PlusLab) grew into a strong existence in the NLP/AI field over the past four years. During the process, I was lucky to interact with and learn from many intelligent and diligent students: Te-lin Wu, Sidi Lu, Emily Sheng, Sarik Ghazarian, Tuhin Chakrakarty, Kung-Hsiang (Steeve) Huang, Seraphina Goldfarb-Tarrant, Alex Spangher, Johnny Wei, Zi-Yi Dou, Nuan Wen, and Yu (Hope) Hou. I also appreciate their feedback on my projects and papers. Over my Ph.D. studies, I had one productive research internship at the Amazon Alexa Web-search team, where I was fortunate to be mentored by Alessandro Moschitti and Luca Soldaini on open-domain question answering. The project led to an EACL publication and inspired my interest to work on open- domain search and QA post-graduation. Last but not least, I would like to thank my mentors before starting the Ph.D. program. When I studied part-time at the NYU Center of Data Science, I had the privilege to work with Prof. Kyunghyun Cho who generously agreed to advise me despite my limited NLP experience, and later wrote a good recommenda- tion letter for my Ph.D. applications. Without his help, my transition into AI would be much more difficult. I also thank Dr. Michael Gill and Prof. Arthur Spirling for mentoring me on my first NLP project, which led to my first NLP conference paper. Finally, I thank my supervisors at the Federal Reserve Bank of New York, Dr. Domenico Giannone and Dr. Jan Groen for coaching me on Bayesian statistics, which helped me develop a solid foundation for machine learning. iv TableofContents Dedication ii Acknowledgements iii ListofTables ix ListofFigures xiii Abstract xvi Chapter1: Introduction 1 1.1 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Motivations of Event-centric Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Applications of Event-centric Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Chapter2: Backgrounds 6 2.1 Definitions of Event and Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Pretrained Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Neuro-symbolic Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3.1 General Set-up of Symbolic Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3.2 Integer Linear Programming (ILP) . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.3 Semantic Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4.1 Event Relation Extraction (ERE) Datasets . . . . . . . . . . . . . . . . . . . . . . . 10 2.4.2 Question Answering Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4.3 Story Generation Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Chapter3: EventTemporalReasoningwithNeuro-symbolicNetworksandPretraining 14 3.1 Event Temporal Relation Extraction with Structured Constraints . . . . . . . . . . . . . . . 14 3.1.1 Task Definition and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.1.3 Joint Event-Relation Extraction Framework . . . . . . . . . . . . . . . . . . . . . . 18 3.1.3.1 Neural SSVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.1.3.2 Multi-Tasking Neural Scoring Function . . . . . . . . . . . . . . . . . . . 19 3.1.3.3 MAP Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.1.3.4 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.1.4 Model Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 v 3.1.5 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.1.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2 Event Temporal Relation Extraction with Distributional Constraints . . . . . . . . . . . . . 30 3.2.1 Motivations and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2.2 Inference with Distributional Constraints . . . . . . . . . . . . . . . . . . . . . . . 32 3.2.2.1 End-to-end Event Relation Extraction . . . . . . . . . . . . . . . . . . . . 32 3.2.2.2 Constrained Inference for Knowledge Incorporation . . . . . . . . . . . . 33 3.2.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2.3.1 Distributional Constraint Selection . . . . . . . . . . . . . . . . . . . . . 37 3.2.3.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2.4 Model Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2.5 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.2.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.3 ECONET: Adapting Pretrained Language Models for Event Temporal Reasoning Tasks . . 47 3.3.1 Motivations and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.3.2 Method: Targeted Masking and Pretraining . . . . . . . . . . . . . . . . . . . . . . 49 3.3.2.1 Targeted Masks Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.3.2.2 Generator for Mask Predictions . . . . . . . . . . . . . . . . . . . . . . . 51 3.3.2.3 Discriminator for Contrastive Learning . . . . . . . . . . . . . . . . . . . 52 3.3.2.4 Joint Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.3.2.5 Fine-tuning on Target Tasks . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.3.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.3.4 Model Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.3.5 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.3.5.1 Comparisons with Existing Systems . . . . . . . . . . . . . . . . . . . . . 57 3.3.5.2 The Impact of ECONET . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.3.5.3 Additional Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.3.5.4 Fine-tuning under Low-resource Settings . . . . . . . . . . . . . . . . . . 61 3.3.5.5 Attention Scores on Temporal Indicators . . . . . . . . . . . . . . . . . . 62 3.3.5.6 Temporal Knowledge Injection . . . . . . . . . . . . . . . . . . . . . . . . 63 3.3.6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Chapter4: EventSemanticReasoningviaQuestionAnswering 67 4.1 ESTER: Question Answering for Reasoning Event Semantic Relations . . . . . . . . . . . . 67 4.1.1 Motivations and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.1.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.1.4 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.1.4.1 Main Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.1.4.2 Quality Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.1.5 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.1.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.1.7 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.1.7.1 Generative QA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.1.7.2 Extractive QA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.1.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 vi Chapter5: StoryGenerationBasedonEventCommonsenseKnowledge 87 5.1 Generating Flashbacks with Event Temporal Prompts . . . . . . . . . . . . . . . . . . . . . 87 5.1.1 Motivations and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.1.2 Task Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.1.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.1.3.1 Plan-and-Write Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.1.3.2 Structured Storyline Construction . . . . . . . . . . . . . . . . . . . . . . 93 5.1.3.3 Temporal Prompt Encoding . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.1.3.4 Storyline Pretraining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.1.3.5 RL-based End-to-end Model . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.1.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.1.4.1 Datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.1.4.2 Temporal Prompts Constructions . . . . . . . . . . . . . . . . . . . . . . 96 5.1.4.3 Automatic Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . 96 5.1.4.4 Human Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.1.4.5 Compared Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.1.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.1.5.1 Textual Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.1.5.2 Effectiveness on Flashback Generation . . . . . . . . . . . . . . . . . . . 101 5.1.5.3 Temporal Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.1.5.4 Contributions to the Interest Level . . . . . . . . . . . . . . . . . . . . . 103 5.1.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.1.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.2 Generating Consistent Stories via Event Knowledge Incorporation . . . . . . . . . . . . . . 107 5.2.1 Motivations and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.2.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.2.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.2.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Chapter6: Conclusions 117 Bibliography 118 AppendixA 133 Appendix for Sec. 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 A.1 Hyper-parameter Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 A.2 Event-Relation Consistency Constraint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 A.3 Evaluation Metrics Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 AppendixB 137 Appendix for Sec. 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 B.1 Hyper-parameter Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 AppendixC 138 Appendix for Sec. 3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 C.1 Hyper-parameter Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 vii AppendixD 140 Appendix for Sec. 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 D.1 QA validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 D.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 D.3 Model Significance Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 D.4 Sub-sample Performances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 D.5 Worker Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 D.6 Number of Tokens. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 D.7 Generative v.s. Extractive QA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 D.8 Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 AppendixE 148 Appendix for Sec. 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 E.1 Data and Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 E.2 Hyper-parameter Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 E.3 More Details for Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 E.4 Benchmark Event Temporal Relation Annotations . . . . . . . . . . . . . . . . . . . . . . . 149 E.5 More Details for Baseline Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 E.6 Additional Generated Stories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 AppendixF 154 Appendix for Sec. 5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 F.1 Event Commonsense Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 F.2 Annotating Event Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 F.3 Relation Prediction Performances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 F.4 Generation Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 F.5 MTurker Statistics and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 viii ListofTables 2.1 ERE Data Overview. Not all datasets are used in each ERE paper. . . . . . . . . . . . . . . . 11 2.2 QA Data Overview. Numbers in the tables refer to question-answer pairs. . . . . . . . . . . 12 3.1 Event and Relation Extraction Results on TimeBank-Dense andMatres . . . . . . . . . . . 25 3.2 Further ablation studies on event and relation extractions. Relation (G) denotes train and evaluate using gold events to compose relation candidates, whereas Relation (E) means end-to-end relation extraction.† is the event extraction and pipeline relation extraction F1 scores for CAEVO (Nathanael Chambers et al. 2014). 57.0‡ is the best previously reported micro-average score for temporal relation extraction based on gold events by Y. Meng et al. (2018). AllMatres baseline results are provided by Ning et al. (2018d). . . . . . . . . 25 3.3 Model performance breakdown for TimeBank-Dense. “-” indicates no predictions were made for that particular label, probably due to the small size of the training sample. . . . . 27 3.4 Model performance breakdown forMatres. . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.5 Ablation Study on Global Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.6 Error Types Based onMatres Test Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.7 TimeBank-Dense: triplet prediction count and percentage in the development set (sample size = 629). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.8 Overall experiment results: per MacNemar’s test, the improvements against the end- to-end baseline models by adding inference with distributional constraints are both statistically significant for TimeBank-Dense (p-value < 0.005) and I2b2-Temporal (p-value< 0.0005). For I2b2-Temporal, our end-to-end system is optimized for the F 1 score of the gold pairs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.9 TimeBank-Dense ablation study: gap shrinkage of predicted probability and F 1 contribu- tion per constraint. ∗ is selected per Sec. 3.2.3, but the probability gap is smaller than the tolerance in the test set, hence no impact to the F 1 score. . . . . . . . . . . . . . . . . . . . 42 ix 3.10 Model performance breakdown for TimeBank-Dense. “-” indicates no predictions were made for that particular label, probably due to the small size of the training sample. . . . . 42 3.11 Model performance breakdown forI2b2-Temporal. . . . . . . . . . . . . . . . . . . . . . . 43 3.12 I2b2-Temporal ablation study: gap shrinkage of predicted probability and F 1 contribution per constraint. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.13 Error examples and breakdown by constraints. . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.14 The full list of the temporal lexicon. Categories are created based on authors’ domain knowledge and best judgment. ∗∗ ‘once’ can be also placed into[past] category due to its second meaning of ‘previously’, which we exclude to keep words unique. . . . . . . . . . . 50 3.15 Overall experimental results. Refer to Section 3.3.2 for naming conventions. The SOTA performances for TORQUE ∗ are provided by Ning et al. (2020b) and the numbers are average over 3 random seeds. The SOTA performances for McTaco † are provided by Pereira et al. (2020); TimeBank-Dense †† andMatres ‡ by S. Zhang et al. (2021) andRED ‡‡ by Han et al. (2019b). † , †† , ‡ and ‡‡ only report the best single model results, and to make fair comparisons with these baselines, we report both average and best single model performances. TacoLM baseline uses the provided and recommended checkpoint for extrinsic evaluations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.16 McNemar’s tests for improvement significance between best single models of RoBERTa LARGE and ECONET on the test data. Tests with p-values < 0.05 ( ∗∗ ) indi- cate strong statistical significance; tests with p-values < 0.1 ( ∗ ) indicate weak statistical significance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.17 Fine-tuning performance comparisons using event detection models trained onTORQUE v.s. TimeBank-Dense event annotations. All numbers are average over 3 random seeds. Std. Dev.≥ 1% is underlined. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.18 Fine-tuning performances with different pre-training methods. All numbers are average over 3 random seeds. Std. Dev.≥ 1% is underlined. . . . . . . . . . . . . . . . . . . . . . . 60 3.19 RoBERTa LARGE + ECONET’s improvements over RoBERTa LARGE using full train data v.s. 10% of train data. ∆ indicates absolute points improvements while ∆% indicates relative gains perF 1 scores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.20 Discriminator’s alignment with generator’s mask predictions in ECONET. Second column shows that discriminator strongly disagree with the “errors” made by the temporal generator. 64 3.21 Categories and examples of highly related “incorrect” temporal indicator predictions by the generator, but labeled as “correct” by the discriminator. . . . . . . . . . . . . . . . . . . 65 4.1 Passages and questions (overall + type breakdown) statistics for different data splits. . . . . 76 4.2 Average number of answers by semantic types. . . . . . . . . . . . . . . . . . . . . . . . . 79 x 4.3 Experimental results for answer generation. All numbers are 3-seed average with standard deviation reported, except for human baseline and zero-shot performances. All models refer to the generative QA task except for RoBERTa-large, which we use for the extractive QA task. Statistical tests are shown in Appenidx D.3. . . . . . . . . . . . . . . . . . . . . . 82 4.4 Performances on test data. Workers’ original annotations v.s. completed by another worker. 84 5.1 Evaluation results for ROCStories. All values in the bottom block are averaged over three runs. MEGATRON’sRef. PPL is missing as it is not reported in the paper and the model is not publicly available. MEGATRON andVanilla-Gen do not use anytemporal prompts, soaccuracy is not applied. MEGATRON’stemporaldiversity* is high due to large amount of VAGUE annotations in its generated stories. . . . . . . . . . . . . . . . . . 99 5.2 Evaluation results for WritingPrompts. Pearson correlation approximates the effectiveness of prompts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.3 Generated stories by our compared models. baseline: MEGATRON, Vanilla-Gen and RL-based. end-to-end. Input is the first sentence. Highlighted sentences contain an event that started earlier than the preceding event. . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.4 OLS regression results on temporal coherence, diversity and number of AFTER indicators. The coefficients for the intercept are omitted. ∗ means statistically significant with 99% confidence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.5 Automatic and human evaluation results for stories. All values in the middle columns are averaged over three runs except for the baseline P. Xu et al. (2020) (provided by Han et al. (2022)). ∗ and ∗∗ indicate statistical significance with p-values < 0.05 and< 0.01, respectively per McNemar’s test for the improvements against theBART-base model. . . . 114 5.6 Automatic and human evaluation results for the plan model. Values in the middle columns (automatic evaluation) are averaged over three runs. . . . . . . . . . . . . . . . . . . . . . . 115 A.1 Best hyper-parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 A.2 Event and Relation Global Constraint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 B.1 Hyper-parameters chosen using development data. For TimeBank-Dense, end-to-end baseline model is provided by the Han et al. (2019c), so we do not train it from scratch. . . 137 C.1 Hyper-parameters of our best performing LM with RoBERTa LARGE + ECONET as well as best hyper-parameters for fine-tuning on downstream tasks. . . . . . . . . . . . . . . . . . 138 C.2 Hyper-parameters of our best performing LM with BERT LARGE + ECONET as well as best hyper-parameters for fine-tuning on downstream tasks. . . . . . . . . . . . . . . . . . 139 D.1 Model and fine-tuning details. Learning rate: lr; batch size: b. . . . . . . . . . . . . . . . . . 141 xi D.2 McNemar’s test perHIT@1 andEM metrics. Models on the right-hand side of “→” are better. All numbers are p-values with≤ 0.05 indicating statistically significant (underlined). 142 D.3 Average number of tokens in questions and answers. . . . . . . . . . . . . . . . . . . . . . 144 D.4 Examples of answers predicted by generative v.s. extractive QA models. Some passages are shortened for demonstration purpose. Incomplete predictions from extractive QA are highlighted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 E.1 Label mapping from CaTeRS to MATRES. AFTER is ignored in CaTeRS by flipping event physical order. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 E.2 Additional generated stories for ROCStories. . . . . . . . . . . . . . . . . . . . . . . . . . . 152 E.3 Additional examples for WritingPrompt. Possible AFTER relations are highlighted. . . . . . 153 F.1 F 1 scores for relation predictions on the ATOMIC 20 20 test set. . . . . . . . . . . . . . . . . . 156 F.2 F 1 scores for relation predictions in the plan generation. Samples are extracted from ROCStories test set. Self-annotated labels are used for comparisons. . . . . . . . . . . . . . 156 F.3 Generation examples. Baseline: fine-tuning BART-base directly on the target data directly; Final: our model with Plan-and-Write +CS-label and +S-align. We show Final’s generated event plan for illustrating purpose. Sentences/events that are not logically consistent in thestory are in red highlights. . . . . . . . . . . . . . . . . . . . . . 159 xii ListofFigures 1.1 An illustrating example. Upper panel: paragraph with events underlined. Some events are omitted for clarify purpose. Lower panel: extracted sequence of events sorted by their chronological order.−→ indicates Event A before Event B. . . . . . . . . . . . . . . . . . . 3 1.2 An illustrating example. Upper panel: paragraph with events underlined. Lower panel: extracted graph of events sorted by semantic relations. Edges indicates different types of relations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.1 An illustration of event and relation models in our proposed joint framework. (a) is a (partial) graph of the output of the relation extraction model. “Hutu” is not an event and hence all relations including it should be annotated as NONE. (b) and (c) are comparisons between a pipeline model and our joint model. . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 Deep neural network architecture for joint structured learning. Note that on the structured learning layer, grey bars denote tokens being predicted as events. Edge types between events follow the same notations as in 3.1a.y e l =0 (non-event), so all edges connecting to y e l are NONE.y e i = 1,y e j = 1,y e k = 1 (events) and hence edges between them are forced to be the same (y r ij = y r jk = y r ik = BEFORE in this example) by transitivity. These global assignments are input to compute the SSVM loss. . . . . . . . . . . . . . . . . . . . . . . . 20 3.3 Another example of the event temporal ordering task. Solid lines / arrows between two highlighted events show their gold temporal relations, e.g. sayBEFOREstop andbuildup INCLUDES say, and the dash line shows a wrong prediction, i.e., the VAGUE relation between buildup and say. In the table, Column Overall shows the relation distribution over the entire training corpus; Column Type Pair (P) shows the predicted relation distribution condition on the event pairs having typesoccurrence andreporting (such asbuildup andsay); Column Type Pair (G) shows thegold relation distribution condition on event pairs having the same types. Biased predictions of VAGUE relation between buildup andsay can be partially corrected by using the gold event type-relation statistics in Column Type Pair (G). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 xiii 3.4 An overview of the proposed framework. The left column shows the end-to-end event temporal relation extraction workflow. The right column (in the dashed box) illustrates how we propose to enhance the end-to-end extraction system. The final MAP inference contains two components: scores from the relation module and distributional constraints constructed using domain knowledge and corpus statistics. The text input is a real example taken from the I2b2-Temporal dataset. The MAP inference is able to push the predicted probability of the event type-relation triplet closer to the ground-truth (corpus statistics). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.5 Dev v.s. Test sets performance (F 1 score) after relaxing the threshold of triplet count for selecting constraints. All numbers are percentages. . . . . . . . . . . . . . . . . . . . . . . 45 3.6 Top: an example illustrating the difference between ERE and QA / MRC samples of event temporal reasoning. Bottom: our targeted masking strategy for ECONET v.s. random masking in PTLMs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.7 The proposed generator-discriminator (ECONET) architecture for event temporal reasoning. The upper block is the mask prediction task for temporal indicators and the bottom block is the mask prediction task for events. Both generators and the discriminator share the same representations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.8 Target ERE and QA task illustrations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.9 Performances (F 1 scores) comparison between fine-tuning RoBERTa LARGE vs. RoBERTa LARGE + ECONET over different ratios of the training data. . . . . . . . . . . . . 61 3.10 Cumulative attention score comparisons between RoBERTa LARGE and ECONET on TimeBank-Dense test data. All numbers are multiplied by 100 and averaged over 3 random seeds for illustration clarity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.11 Performance (F 1 score) differences by temporal indicator categories and label classes in TimeBank-Dense. Fine-tuning on 10% TimeBank-Dense training data. . . . . . . . . . . . . 63 4.1 A graph illustration of event semantic relations in narratives. We use trigger words to represent events in this graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.2 Examples of event annotations and 5 types of QAs in our dataset. Not all events are annotated for clarity purpose. Different colors are used for better visualization. . . . . . . 69 4.3 An illustration of our quality control, worker validation and training process. . . . . . . . 75 4.4 Performances (F 1 scores) comparison between fine-tuning RoBERTa LARGE vs. RoBERTa LARGE + ECONET over different ratios of the training data. . . . . . . . . . . . . 77 4.5 Type confusion matrix between workers’ original annotations and the majority votes after co-authors’ validation. Rows are annotators’ types whereas columns are the majority votes. 78 4.6 Most frequent n-grams in questions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 xiv 4.7 Fine-tuning UnifiedQA-large results by using 500, 1K, 2K, 3K, 4K and full train data. Dashed lines on the top are corresponding human performances. . . . . . . . . . . . . . . 83 4.8 Test performances for each semantic type. . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.9 F T 1 v.s. EM scores on the dev set by increasing training weights on positive answer tokens. 85 5.1 (a) flashback (b) temporal v.s. narrative order. . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.2 An illustration of our overall model. Here we use the first sentence of the story (and its associated event representation) as inputx. The upper block shows the vanilla implementation of the Plan-and-Write workflow. The bottom block is our core novel design by leveragingtemporalprompts in structured storylines to generate flashbacks . For illustration purposes, we re-order the triggers and arguments, and storylines are ground-truths (i.e. not predicted by models). Our final model uses reinforcement learning to implement end-to-end training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.3 Relation annotation distribution by MTurkers for compared models trained on ROCStories. 103 5.4 Stories generated by directly finetuning PTLMs (top) v.s. our model (bottom). Red highlights indicate lack of event commonsense. Dotted gray blocks show abstract events + relations in CSKG format. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.5 An illustration of our overall model. We use the first sentence ([s1]) as input. The upper block shows story generation based on finetuning PTLM directly. The bottom block shows our improved Plan-and-Write design by embeddingcommonsenselabels in the event plan. The plan-to-story alignment encourages stronger coverage of the plan in the generated story (bottom-right); the alignment would be low if the story on the top-right were generated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 A.1 Confusion matrix (table) with each cell representing count of predictions over each gold label. BEFORE (b); AFTER (a); SIMULTANEOUS (e); VAGUE (v); NONE (n). . . . . . . . . . . 136 D.1 Sub-sample fine-tuning performances for hierarchical relations: Sub-event + Co- reference. All numbers are average over 3 random seeds. . . . . . . . . . . . . . . . . . . 142 D.2 Questions distributions by workers in train v.s. evaluation sets. Equality baseline indicates each participant provides equal number of questions. . . . . . . . . . . . . . . . . . . . . . 143 D.3 An Illustration of Event Selection Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 D.4 An Illustration of QA Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 F.1 All commonsense relations provided in ATOMIC 20 20 (Hwang et al. 2021). Relations with highlights are selected in this work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 xv Abstract The field of natural language processing (NLP) has made enormous progress over the past several years, largely due to the success of neural networks (e.g. LSTM (Hochreiter et al. 1997) and GRU (Cho et al. 2014)). More recently, transformer-based pretrained language models (PTLMs) such as BERT (Devlin et al. 2019), RoBERTa (Y. Liu et al. 2019), XLNET (Yang et al. 2019), GPT2 (Radford et al. 2019), BART (Lewis et al. 2020) allow machines to achieve human or near-human performances over a wide range of NLP applications including text paraphrasing, sentiment analysis, information extraction (IE), question-answering (QA), natural language inference (NLI), natural language generation (NLG), etc. Despite these successes, recent NLP research pointed out that machines still lack reasoning skills, which prevents models from generalizing efficiently to unseen scenarios. For example, Jia et al. (2017) finds that in machine reading comprehension, models appear to conduct pattern matching rather than explicit reasoning, resulting in failure to recognize adversarial examples. Holtzman et al. (2018) also shows that lack of reasoning may cause NLG models to produce generic, repetitive, and even self-contradictory texts. Kahneman (2003) categorizes reasoning as a more advanced human intelligent system compared with relatively simpler perception (e.g. object and speech recognition). Therefore, it is crucial to advance machines’ reasoning capability in order to elevate NLP models to more intelligent agents (Sap et al. 2020) . In this thesis, we attempt to improve a sub-area of machine reasoning, i.e., event-centric reasoning in text data. Real-world texts such as new articles and stories are often composed of a sequence of events. Thus, for natural language understanding, it is inherently critical to identify events and to understand how xvi they are related to each other. On the other hand, many real-world NLP tasks are based on events, and stronger event reasoning can improve these NLP applications. Specifically, we divide our contributions into three categories: 1) tasks: we tackled three event-centric tasks in this thesis, event relation extraction, question answering, and story generation, and pushed their performances to the new state-of-the-art; 2) methods: we proposed novel models: neuro-symbolic net- work and targeted pre-training for effective event knowledge incorporation; 3) data: we created new datasets that challenge machines’ reasoning capability. Overall, our works improve machine learning models’ understanding of events and their ability to reason event relations. We demonstrate the effectiveness of the proposed methods through improvements on event-centric applications in relation extraction, question answering, and story generation. xvii Chapter1 Introduction 1.1 ThesisStatement The breakthroughs of AI algorithms have enabled machines to accomplish human-level performances in many NLP applications such as information extraction, question answering, and language generation. However, current AI systems still show little sign of executing explicit reasoning, which is a critical com- ponent of intelligence. Amidst various reasoning skills, reasoning based on events is a crucial one as our world and society consist of events. We rely on our skills of collecting events, analyzing their patterns, and predicting feasible subsequent events to navigate our complicated and ever-changing world. Human languages (new articles, novels, journals, etc.), as reflections of our world and society, are naturally com- posed based on events too, yet event-centric reasoning based on texts remains relatively under-explored in NLP research. To improve machines’ event-centric reasoning, we propose to leverage neuro-symbolic networks and self-supervised learning methods to inject world knowledge into machine learning models. We also advocate enhancing event reasoning with question answering, a flexible way that humans often reason over events. As generating logical and creative content are a key manifestation of intelligence, we further incorporate event reasoning tool into story generation systems to help neural models produce interesting and coherent narratives with a reasonable sequence of events. 1 1.2 MotivationsofEvent-centricReasoning With the advances of AI and natural language processing over the past decade, machines are capable of not only parsing the superficial syntactic information of human languages but also understanding the subtle underlying semantic meaning (Chen et al. 2021), which often requires complicated reasoning skills. However, as many researchers pointed out recently that under various circumstances, machines’ reasoning capabilities over text inputs are not reliable. For example, Jia et al. (2017) found that neural networks often fail to capture adversarial examples in machine reading comprehension tasks; Holtzman et al. (2018) showed that neural language generation systems can produce repetitive andself-contradictory narratives; J. Li et al. (2016) provided examples where machine dialogue agents are not able to infer actual intents from humans and therefore, frequently respond with generic answers (e.g. “I don’t know”). These examples highlight the necessity of robust reasoning in machines in order to achieve the next level of AI. Among several options to equip machines with human-level reasoning (e.g. physical reasoning (Bisk et al. 2020), numerical reasoning (B. Y. Lin et al. 2020), etc.), the event-centric reasoning approach emerges as an appealing direction. The choice of using events as semantic “atoms” for understanding and reasoning text data can be attributed to several factors: 1) many real-world texts such as new articles, stories, and medical records are comprised of a sequence of events; 2) events, compared with simple concepts or en- tities, often have more complicated structures containing actions, participants (subjects or objects), time, location, etc.; 3) the prevalence and the complexity enable events to capture the essence of a narrative ef- fectively so that we may reduce the effort of reasoning the entire text inputs to reason over a collection of events. In Figure 1.1, we show a news article paragraph with (selected) events underlined. If machines can sort these events in their chronological order (temporal reasoning), it depicts the skeleton of the paragraph and efficiently shows how the narrative unfolds. 2 Figure 1.1: An illustrating example. Upper panel: paragraph with events underlined. Some events are omitted for clarify purpose. Lower panel: extracted sequence of events sorted by their chronological order. −→ indicates Event A before Event B. Event-centric reasoning is an effective tool for NLP, but it is also a challenging task. The first challenge is event definition. For example, the first event purchased in the paragraph has the subject and object of “Paramount” and “DreamWorks” respectively. The occurrence time of the event is probably Sunday or slightly earlier, and readers must infer from the context that the location of the event is somewhere in the United States. Event definitions can vary from a complete structure of all possible components of events to simply using the action word, i.e. purchased. While being more precise, the former event definition requires more effort to explicitly identify event components; the latter one only requires the action word and allows the model to implicitly infer missing event components, but this definition may lead to ambiguity. Whether implicitly inferred or explicitly identified, the complexity of events makes reasoning and un- derstanding event relations difficult. For example, if we were to change the object of the event purchased to a company in the paragraph other than “DreamWorks,” then it may not be correct to placepurchased aftersoughtafter in the chronological chain in the bottom panel of Figure 1.1. The two events’ temporal order is established when they share the same target company (“DreamWorks” was first sought after by “NBC” and then purchased by “Paramount”). We will formally define events and event relations in Section 2.1, and discuss their potential benefits and drawbacks. 3 1.3 ApplicationsofEvent-centricReasoning To further motivate event-centric reasoning, we discuss three event-based applications we work on in this thesis. RelationsExtraction. Event relation extraction (ERE) takes raw text as inputs and constructs a knowl- edge graph (KG) whose nodes indicate events and edges represent pairwise event relations. Figure 1.1 can be considered as a simple path of a graph where directed edges areBEFORE relation between neighboring events (nodes). Similarly, Figure 1.2 shows another event graph representations for the same paragraph, with more diverse edge types indicating various event relations: COREFERENCE,SUB-EVENT andCONDI- TIONAL. ERE is one of the widely explored applications in event-centric reasoning with many real-world applications. A clinical event KG can help doctors treat similar patients and a financial event KG can help investors better understand markets. It is also a major focus of this thesis, which we will discuss in detail in Chapter 3. Figure 1.2: An illustrating example. Upper panel: paragraph with events underlined. Lower panel: ex- tracted graph of events sorted by semantic relations. Edges indicates different types of relations. 4 Question Answering. In contrast to ERE, event-centric QA leverages natural language queries rather than structured graphs to reason about event relations. Instead of considering an event as a node and a relation as an edge (type), natural language queries are provided to machines to solicit answers. In Figure 1.2, to understand the COREFERENCE relation betweendeal andpurchased, a valid query would be “what event is considered as the same as the deal between Paramount and DreamWorks?” The answer would be “Paramountpurchased DreamWorks.” QA formulation benefits from its precision and flexibility which we will discuss more in Sec. 4.1. StoryGeneration. Another promising direction for event-centric reasoning is to apply it in story gen- eration. Stories are often composed based on a sequence of events in correct logical or chronological order. This is exactly the knowledge our ERE tools can provide to facilitate coherent and interesting stories. We will separately discuss how to utilize temporal and logical event reasoning in Sec. 5.1 and Sec. 5.2 to help machines generate better stories. 1.4 OutlineoftheThesis We will first formally define events and event relations, review recent language model pretraining methods, discuss mathematical details and describe the datasets we use in this thesis in Sec. 2. In Sec. 3, we will present three publications that propose novel neuro-symbolic networks (with structured and distributional constraints) and pretraining methods to enhance models’ performances over event temporal reasoning tasks (ERE and QA). In Sec. 4, we will discuss our data collection effort for ESTER, a QA dataset for event semantic reasoning. In Sec. 5, we will show two works that incorporate event reasoning to facilitate more creative and coherent neural story generation. Finally, we conclude in Sec. 6. 5 Chapter2 Backgrounds 2.1 DefinitionsofEventandRelations In this section, we formally define events and their relations. Events. The general guideline of ACE (2005) provide atemplatedefinition of events. Specifically, an event consists of a trigger word and its arguments. An event trigger is a word that most clearly describes the event’s occurrence, and it is often a verb or noun that evokes the action or the status of the target event (Pustejovsky et al. 2003b). Event arguments include an event’s participants (subject or object), occurrence time and location. Our dataset contribution, ESTER 4.1 largely follows thistemplatedefinition to allow most precise event definitions. While event triggers must exist in the context, some event arguments need to be inferred from the context. Recall the example shown in Figure. 1.1, the occurrence time of the event triggerpurchased as well as its location needs to be inferred to be “Sunday or slightly earlier” and “United States” respectively. Due to the difficulty to filling the entire event template, which may be impossible when contexts are extremely limited, some event-centric reasoning works resort to using simply the trigger word to define events (trigger definition ), e.g., TE3 (UzZaman et al. 2013), HiEve (Glavaš et al. 2014), RED (O’Gorman 6 et al. 2016) and TORQUE (Ning et al. 2020b). We adopt this trigger definition for our event temporal reasoning works in Sec. 3. Relations. Following pairwise definition of relations in the previous work (ACE 2005; Glavaš et al. 2014; Ning et al. 2020b; UzZaman et al. 2013), we define an event relation as how a pair of events are related. There are two large categories of relations we study in this thesis: 1)temporalrelations: BEFORE,AFTER, INCLUDES, SIMULTANEOUS, VAGUE, etc. with more details in Sec. 2.4; 2)semanticrelations: CAUSAL, CONDITIONAL, COREFERENCE, SUB-EVENT, COUNTERFACTUAL, which we explain the details of their definitions in Sec. 4.1. Overall, we denote the set of all events and relations in a dataset asE andR respectively. The set of all pairs of events is thusEE. Leti,j index two individual events, we denote a pair of events as(e i ,e j )∈EE and their relation asr i,j ∈R. 2.2 PretrainedLanguageModels Current state-of-the-art (SOTA) pretrained language models (PTLMs) leverage Transformer (Vaswani et al. 2017) as the base models. Two types of objectives are most widely adopted in pretraining: auto-encoding (AE) and auto-regressive (AR). BERT (Devlin et al. 2019), RoBERTa (Y. Liu et al. 2019) and ALBERT (Lan et al. 2020) fall into the auto- encoding category where random tokens are masked, and models are trained to recover their identities. More formally, letx=[x 1 ,...,x n ] be a sequence of input tokens andx m t ∈x m represents random masked tokens. The per-sample pre-training objective is to predict the identity (x t ) of x m t with a cross-entropy loss, L AE =− X x m t ∈x m I[x m t =x t ]log(p(x m t |x)) (2.1) 7 In contrast, auto-regressive language models such as ELMo (Peters et al. 2018) and GPT-2 (Radford et al. 2019) predict the next token’s identity given previous tokens. The per-sample pre-training objective is, L AR =− n X t=1 log(p(x t |x 1:t− 1 )) (2.2) Research in PTLMs also designed sequence-to-sequence pretraining objectives that work better for conditional language generation tasks. T5 (Raffel et al. 2020) builds on encoder-decoder Transformer ar- chitecture (thus AR model), but pretrained with a large number of text-to-text data. BART (Lewis et al. 2020) also pretrains on text-to-text data, but combines both AR and AE training objectives. 2.3 Neuro-symbolicNetworks Symbolic reasoning can be an effective tool to inject external knowledge as constraints into neural net- works, which could lead to better interpretability of a “black-box” model while improving model perfor- mances. In this thesis, we explored two methods: 1) integer linear programming (ILP) for event temporal reasoning and 2) semantic loss for story generation based on event commonsense. 2.3.1 GeneralSet-upofSymbolicReasoning LetV = (V 1 ,V 2 ,V 3 ...V n ) be a set of Boolean variables andv = (v 1 ,v 2 ,v 3 ...v n ) be the state ofV ; that is, for V k ∈ V , v k indicates V k = 0 or V k = 1. All logical operations such as negation¬, conjunction ∧ and disjunction∨, can be applied toV to form constraints. Letα be a logical sentence that represents a constraint, if v satisfies α , we denote it as v |= α . In our neural-symbolic networks, a state of v refers the assignment of variable values based on neural networks’ outputs, which is a vector of scores S =(s 1 ,s 2 ,s 3 ,...s n ) that can be normalized as probabilityp=(p 1 ,p 2 ,p 3 ,...p n ) ofV k =1. 8 Consider a simplified ERE example consisting of only three variables, V = (V 1 ,V 2 ,V 3 ). Let V 1 and V 2 denote a pair of events(e i ,e j ) andV 3 denotes their relationr i,j . e i = 1 ande i = 0 mean an token is predicted by a neural network to be an event and non-event respectively. Similarly,r i,j = 1 andr i,j = 0 indicate whether a relation is positive or negative (no relation exists). We can construct a constraintα 0 := (r i,j ⇐⇒ e i ∧e j ). That is, a positive relation exists between a pair of tokens, if and only if that both tokens are events. 2.3.2 IntegerLinearProgramming(ILP) ILP is an effective way to incorporate constraints in neural networks. The ILP objective function associated with the ERE task above can be written as, ˆ y =argmax X (i,j)∈EE X r∈R y r i,j S(y r i,j ,x)+C E X k∈E X e∈{0,1} y e k S(y e k ,x) (2.3) s.t. y r i,j ,y e k ∈{0,1} , X r∈R y r i,j =1, X e∈{0,1} y e k =1, wherey e k is a binary indicator of whether thek-th token in the input textx is predicted by ILP as an event or not, and y r i,j is also a binary indicator specifying whether the ILP prediction of the relation between thei-th andj-th tokens isr∈R. S(y e k ,x),∀e∈{0,1} andS(y r i,j ,x),∀r∈R are the scoring functions provided by underlying event and relation neural models. The output of ILPˆ y is a collection of optimal label assignments for all events and relation candidates inx. C E is a hyper-parameter controlling weight between relation and event. The constraint that follows immediately from the objective function is that ILP should only assign one label for all tokens and relations. The following two inequalities ensure the constraintα mentioned above can be satisfied, ∀(i,j)∈EE,e P i ≥ r P i,j ,e P j ≥ r P i,j 9 ande N i +e N j ≥ r N i,j We will prove these inequalities, show how to incorporate more constraints, and leverage ILP to construct an SSVM loss to train neural networks in Sec. 3.1. 2.3.3 SemanticLoss Constructing semantic loss (Jingyi Xu et al. 2018) to train neural networks is another way to incorporate symbolic reasoning. The general training objective function can be written as, L s (α, p)=− log X v|=α Y i:v|=V i p i Y i:v|=¬V i (1− p i ) (2.4) Intuitively, for all statesv that satisfies the constraints, we multiply the probability of each element in v being either 1 or 0 (depending on α ). This formulation represents the self-information of obtaining the assignments that satisfy the constraints. We augment the original training loss (e.g. cross-entropy loss) with semantic loss to obtain the optimal balance between the original label distribution and the one obeying constraints. 2.4 Data As mentioned in Sec. 1.3, we investigated three NLP applications based on event-centric reasoning. We provide a brief description below. 2.4.1 EventRelationExtraction(ERE)Datasets ERE experiments are conducted on TimeBank-Dense, Matres, RED, and I2b2-Temporal datasets and an overview of data statistics are shown in Table 2.1. We focus on event relation, thus, all numbers refer to 10 TimeBank-Dense Matres RED I2b2-Temporal NumberofRelations 6 4 11 3 NumberofDocuments Train 22 183 76 190 Dev 5 - 9 - Test 9 20 10 120 NumberofPairs Train 4032 6332 3336 11253 Dev 629 - 400 - Test 1427 827 473 8794 Table 2.1: ERE Data Overview. Not all datasets are used in each ERE paper. EE pairs 1 . Note that in all datasets, we ensure event pairs are ordered by their physical order in text, i.e. given a labeled event pair (i,j), eventi always appears prior to eventj in the text. RED (O’Gorman et al. 2016) consists of expert annotations of news articles and the most fine-grained and complicated relations among all ERE datasets. 11 relations are defined: BEFORE,BEFORE/CAUSES,BE- FORE/PRECONDITION,OVERLAP,OVERLAP/CAUSES,OVERLAP/PRECONDITION CONTAINS,CONTAINS- SUBEVENT, BEGINS-ON ENDS-ON and SIMULTANEOUS. However, due to the complexity of definition, a large number of annotations are missing from the corpus (Han et al. 2019b). TimeBank-Dense (Cassidy et al. 2014) is based on TimeBank Corpus (Pustejovsky et al. 2003a) but addresses the sparse-annotation issue in the original data by introducing the VAGUE label and requiring annotators to label all pairs of events/times in a given window. Temporal relations consist of BEFORE, AFTER,INCLUDES,INCLUDED,SIMULTANEOUS, andVAGUE. Moreover, each event has several properties, e.g.,type,tense, andpolarity. Event types includeoccurrence,action,reporting,state, etc. Event pairs that are more than 2 sentences away are not annotated. 1 ForTCR, we also include causal pairs in the table. 11 Matres (Ning et al. 2018c) is based on TimeBank-Dense data, but filters out non-verbal events. The au- thors project events on multiple axes and only keep those on the main-axis. These two factors explain the large decrease of event pairs to document ratio in Table 2.1. Matres only considers the starting point of an event rather than the full event interval as in TimeBank-Dense, which makes temporal order less com- plicated (noINCLUDES andINCLUDED relations) and hence contributes to the performance improvement of machine learning models built on this dataset. I2b2-Temporal In the clinical domain, one of the earliest event temporal datasets was provided in the 2012 Informatics for Integrating Biology and the Bedside (i2b2) Challenge on NLP for Clinical Records (Sun et al. 2013). Clinical events are categorized into 6 types: treatment, problem, test, clinical-dept, occurrence, and evidential. The final data used in the challenge contains three temporal relations: BEFORE,AFTER, andOVERLAP. The 2012 i2b2 challenge also had an end-to-end track, which we use as our feature-based system baseline. To mimic the input structure of TimeBank-Dense, we only consider event pairs that are within 3 consecutive sentences. Overall, 13% of the long-distance relations are excluded. 1 2.4.2 QuestionAnsweringDatasets For question answering, we investigated three datasets. Table 2.2 summarizes the data statistics. Data #Train #Dev #Test TORQUE 24,523 1,483 4,668 McTaco - 3,783 9,442 ESTER 4,547 301 1,170 Table 2.2: QA Data Overview. Numbers in the tables refer to question-answer pairs. TORQUE (Ning et al. 2020b) is an MRC/QA dataset where annotators first identify event triggers in given passages and then ask questions regarding event temporal relations (ordering). Correct answers 1 Over 80% of these long-distance pairs are event co-reference, i.e., simply predicting them as OVERLAP will achieve high performance. 12 are event trigger words in passages. TORQUE can be considered as reformulating temporal ERE tasks as an MRC/QA task. Therefore, both ERE datasets and TORQUE are pertinent to reasoning event temporal relations. McTaco (B. Zhou et al. 2019) is another MRC/QA dataset, but it differs from TORQUE in 1) events are not explicitly identified; 2) answers are statements with true or false labels; 3) questions contain broader temporal commonsense regarding not only temporal ordering but also event frequency, during and typical time that may not be directly helpful for reasoning temporal relations. For example, knowing how often a pair of events happen doesn’t help us figure out which event happens earlier. ESTER (Han et al. 2021a) is our main data contribution which extends the QA for event temporal rea- soning proposed byTORQUE to broader event semantic relations:semanticrelations: CAUSAL,CONDI- TIONAL, COREFERENCE, SUB-EVENT, COUNTERFACTUAL. We will discuss this work with more details in Sec. 4.1. 2.4.3 StoryGenerationDatasets ROCStories (Mostafazadeh et al. 2016a) is of the most widely used open-domain story dataset. Stories are structured with 5 sentences. Following (P. Xu et al. 2020), we split data into 88,344/4,908/4,909 for train/validation/test sets. WritingPrompt (Fan et al. 2018) contains 30,335 pairs of prompts and stories. With an average of more than 700 words per story, Writing Prompts are much longer than ROCStories. These stories are also much less structured as some dialogues and short phrases may be included. To speed up our experiments, we select stories with a maximum of 500 words, resulting in a total number of 96,488 training and 5,784 validation prompt-story pairs, respectively. For the test set, we use the 1,000 prompt-story pairs provided by the baseline paper Goldfarb-Tarrant et al. (2020) for reporting automatic evaluation results. 13 Chapter3 EventTemporalReasoningwithNeuro-symbolicNetworksand Pretraining 3.1 EventTemporalRelationExtractionwithStructuredConstraints Identifying events in texts and extracting their pairwise relations is one of the most fundamental tasks for event-centric reasoning. In this section, we discuss our first effort in Han et al. (2019c) that proposes a novel neuro-symbolic network to incorporate structured knowledge for the end-to-end event temporal relation extraction task. 3.1.1 TaskDefinitionandContributions The end-to-end event temporal relation extraction task can be modeled as building a graph for a given text, whose nodes represent events and edges are labeled with temporal relations correspondingly. Figure 3.1a illustrates such a graph for the text shown therein. The nodes assassination, slaughtered, rampage, war, andHutu are the candidate events, and different types of edges specify different temporal relations between them: assassination is BEFORE rampage, rampage INCLUDES slaughtered, and the relation between slaughtered and war is VAGUE. Since “Hutu” is actually not an event, a system is expected to annotate the relations between “Hutu” and all other nodes in the graph as NONE (i.e., no relation). 14 (a) Temporal Relation Graph (b) Pipeline Model (c) Structured Joint Model Figure 3.1: An illustration of event and relation models in our proposed joint framework. (a) is a (partial) graph of the output of the relation extraction model. “Hutu” is not an event and hence all relations including it should be annotated asNONE. (b) and (c) are comparisons between a pipeline model and our joint model. Prior works treat this task as a pipeline of two separate subtasks, i.e., event extraction and temporal relation classification, and they also assume that gold events are given when training the relation classifier (Verhagen et al. (2007), Verhagen et al. (2010), Nathanael Chambers et al. (2014), Ning et al. (2017), Y. Meng et al. (2018)). Specifically, they built end-to-end systems that extract events first and then predict temporal relations between them (Fig. 3.1b). In these pipeline models, event extraction errors will propagate to the relation classification step and cannot be corrected afterwards. Ourfirstcontribution is the proposal of a joint model that extracts both events and temporal relations simultaneously (see Fig. 3.1c). The motivation is that if we train the relation classifier with NONE relations between non-events, then it will potentially have the capability of correcting event extraction mistakes. For instance in Fig. 3.1a, if the relation classifier 15 predictsNONE for (Hutu,war) with a high confidence, then this is a strong signal that can be used by the event classifier to infer that at least one of them is not an event. The second contribution is that we improve event representations by sharing the same contextu- alized embeddings and neural representation learner between the event extraction and temporal relation extraction modules for the first time. On top of the shared embeddings and neural representation learner, the proposed model produces a graph-structured output representing all the events and relations in the given sentences. A valid graph prediction in this context should satisfy two structural constraints. First, the temporal relation should always be NONE between two non-events or between one event and one non-event. Sec- ond, for those temporal relations among events, no loops should exist due to the transitive property of time (e.g., if A is beforeB andB is beforeC, thenA must be beforeC). The validity of a graph is guaranteed by solving an integer linear programming (ILP) optimization problem with those structural constraints, and our joint model is trained by structural support vector machines (SSVM) in an end-to-end fashion. Results show that, according to the end-to-endF 1 score for temporal relation extraction, the proposed method improves CAEVO (Nathanael Chambers et al. 2014) by 10% on TimeBank-Dense, and improves CogCompTime (Ning et al. 2018d) by 6.8% on Matres. We further show ablation studies to confirm that the proposed joint model with shared representations and structured learning is very effective for this task. 3.1.2 RelatedWork In this section we briefly summarize the existing work on event extraction and temporal relation extraction. To the best of our knowledge, there is no prior work on jointevent and relation extraction, so we will review joint entity and relation extraction works instead. 16 Existing event extraction methods in the temporal relation domain, as in the TempEval3 workshop (UzZaman et al. 2013), all use conventional machine learning models (logistic regression, SVM, or Max- entropy) with hand-engineered features (e.g., ClearTK (Bethard 2013) and NavyTime (Nate Chambers 2013)). While other domains have shown progress on event extraction using neural methods (Nguyen et al. (2015) and Nguyen et al. (2016)), recent progress in the temporal relation domain is focused more on the setting where gold events are provided. Therefore, we first show the performance of a neural event extractor on this task, although it is not our main contribution. Early attempts ontemporalrelationextraction use local pair-wise classification with hand-engineered features (Mani et al. (2006), Verhagen et al. (2007), Nathanael Chambers et al. (2007), Verhagen et al. (2008)). Later efforts, such as ClearTK (Bethard 2013), UTTime (Laokulrat et al. 2013), NavyTime (Nate Chambers 2013), and CAEVO (Nathanael Chambers et al. 2014) improve earlier work with better linguistic and syn- tactic rules. Yoshikawa et al. (2009), Ning et al. (2017), and Leeuwenberg et al. (2017) explore structured learning for this task, and more recently, neural methods have also been shown effective (Tourille et al. (2017), Cheng et al. (2017), Y. Meng et al. (2017), Y. Meng et al. (2018)). In practice, we need to extract both events and those temporal relations among them from raw text. All the works above treat this as two subtasks that are solved in a pipeline. To the best of our knowledge, there has been no existing work on joint event-temporal relation extraction. However, the idea of “joint” has been studied for entity-relation extraction in many works. Miwa et al. (2014) frame their joint model as table filling tasks, map tabular representation into sequen- tial predictions with heuristic rules, and construct global loss to compute the best joint predictions. Q. Li et al. (2014) define a global structure for joint entity and relation extraction, encode local and global features based on domain and linguistic knowledge. and leverage beam-search to find global optimal assignments for entities and relations. Miwa et al. (2016) leverage LSTM architectures to jointly predict both entity and relations, but fall short on ensuring prediction consistency. M. Zhang et al. (2017) combines the benefits 17 of both neural net and global optimization with beam search. Motivated by these works, we propose an end-to-end trainable neural structured support vector machine (neural SSVM) model to simultaneously extract events and their relations from text and ensure the global structure via ILP constraints. Next, we will describe in detail our proposed method. 3.1.3 JointEvent-RelationExtractionFramework In this section we first provide an overview of our neural SSVM model, and then describe each component in our framework in detail (i.e., the multi-tasking neural scoring module, and how inference and learning are performed). We denote the set of all possible relation labels (includingNONE) asR, all event candidates (both events and non-events) asE, and all relation candidates asEE. Note here we utilize the trigger definition of events as mentioned as Sec. 2.1. 3.1.3.1 NeuralSSVM Our neural SSVM adapts the SSVM loss as: L= l X n=1 C M n max ˆ y n ∈Y 0,∆( y n ,ˆ y n ) + ¯ S n R +C E ¯ S n E +||Φ || 2 (3.1) where ¯ S n E = S(ˆ y n E ;x n )− S(y n E ;x n ) and ¯ S n R = S(ˆ y n R ;x n )− S(y n R ;x n ); Φ denotes model parameters, n indexes instances, M n = |E| n +|EE| n denotes the total number of relations|E| n and events|EE| n in instancen.y n ,ˆ y n denote the gold and predicted global assignments of events and relations for instancen— each of which consists of either one hot vector representing true and predicted relation labelsy n R ,ˆ y n R ∈ {0,1} |EE| , or entity labels y n E ,ˆ y n E ∈ {0,1} |E| . A maximum a posteriori probability (MAP) inference is needed to find ˆ y n , which we formulate as an interger linear programming (ILP) problem and describe more details in Section 3.1.3.3. ∆( y n ,ˆ y n ) is a distance measurement between the gold and the predicted assignments; we simply use the Hamming distance. C and C E are the hyper-parameters to balance the 18 losses between event, relation and the regularizer, andS(y n E ;x n ),S(y n R ;x n ) are scoring functions, which we design a multi-tasking neural architecture to learn. The intuition behind the SSVM loss is that it requires the score of gold output structurey n to be greater than the score of the best output structure under the current model ˆ y n with a margin∆( y n ,ˆ y n ) 1 or else there will be some loss. The training objective is to minimize the loss. The major difference between our neural-SSVM and the traditional SSVM model is the scoring function. Traditional SSVM uses a linear function over hand-crafted features to compute the scores, whereas we propose to use a recurrent neural network to estimate the scoring function and train the entire architecture end-to-end. 3.1.3.2 Multi-TaskingNeuralScoringFunction The recurrent neural network (RNN) architecture has been widely adopted by prior temporal extraction work to encode context information (Tourille et al. (2017), Cheng et al. (2017), and Y. Meng et al. (2017)). Motivated by these works, we adopt a RNN-based scoring function for both event and relation prediction in order to learn features in a data driven way and capture long-term contexts in the input. In Fig. 3.2, we skip the input layer for simplicity. 2 The bottom layer corresponds to contextualized word representations denoted as v k . We use (i,j) ∈ EE to denote a candidate relation and i ∈ E to indicate a candidate event in the input sentences of length N. We fix word embeddings computed by a pre-trained BERT-base model (Devlin et al. 2019). They are then fed into a BiLSTM layer to further encode task-specific contextual information. Both event and relation tasks share this layer. 1 Note that if the best prediction is the same as the gold structure, the margin is zero; there will be no loss. 2 Following the convention of event relation prediction literature (Nathanael Chambers et al. (2014), Ning et al. (2018a), and Ning et al. (2018c)), we only consider event pairs that occur in the same or neighboring sentences, but the architecture can be easily adapted to the case where inputs are longer than two sentences. 19 Figure 3.2: Deep neural network architecture for joint structured learning. Note that on the structured learning layer, grey bars denote tokens being predicted as events. Edge types between events follow the same notations as in 3.1a. y e l = 0 (non-event), so all edges connecting to y e l are NONE. y e i = 1,y e j = 1,y e k = 1 (events) and hence edges between them are forced to be the same (y r ij = y r jk = y r ik = BEFORE in this example) by transitivity. These global assignments are input to compute the SSVM loss. The event scorer is illustrated by the left two branches following the BiLSTM layer. We simply concate- nate both forward and backward hidden vectors to encode the context of each token. As for the relation scorer shown in the right branches, for each pair (i,j) we take the forward and backward hidden vectors corresponding to them, f i ,b i ,f j ,b j , and concatenate them with linguistic features as in previous event relation prediction research. We denote linguistic features asL i,j and only use simple features provided in the original datasets: token distance, tense, and polarity of events. Finally, all hidden vectors and linguistic features are concatenated to form the input to compute the probability of being an event or a softmax distribution over all possible relation labels—which we refer to as the RNN-basedscoringfunction in the following sections. 20 3.1.3.3 MAPInference A MAP inference is needed both during training to obtainˆ y n in the loss function (Equation 3.1), as well as during the test time to get globally coherent assignments. We formulate the inference problem as an ILP problem, which we briefly introduced in Sec. 2.3. The inference framework is established by constructing a global objective function using scores from local scorers and imposing several global constraints: 1) one-label assignment, 2) event-relation consistency, and 3) symmetry and transitivity as in Bramsen et al. (2006), Nathanael Chambers et al. (2008), Denis et al. (2011), Q. X. Do et al. (2012), and Ning et al. (2017). ObjectiveFunction. The objective function of the global inference is to find the global assignment that has the highest probability under the current model, as specified in Equation 3.2: ˆ y =argmax X (i,j)∈EE X r∈R y r i,j S(y r i,j ,x)+C E X k∈E X e∈{0,1} y e k S(y e k ,x) (3.2) s.t. y r i,j ,y e k ∈{0,1} , X r∈R y r i,j =1, X e∈{0,1} y e k =1, wherey e k is a binary indicator of whether thek-th candidate is an event or not, andy r i,j is a binary indicator specifying whether the global prediction of the relation between (i,j) is r ∈ R. S(y e k ,x),∀e ∈ {0,1} andS(y r i,j ,x),∀r ∈R are the scoring functions obtained from the event and relation scoring functions, respectively. The output of the global inferenceˆ y is a collection of optimal label assignments for all events and relation candidates in a fixed context. C E is a hyper-parameter controlling weights between relation and event. The constraint that follows immediately from the objective function is that the global inference should only assign one label for all entities and relations. Constraints. We introduce several additional constraints to ensure the resulting optimal output graph forms a valid and plausible event graph. 21 Event-RelationConsistency. Event and relation prediction consistency is defined with the following property: apairofinputtokenshaveapositivetemporalrelationifandonlyifbothtokensare events. The following global constraints will satisfy this property, ∀(i,j)∈EE,e P i ≥ r P i,j ,e P j ≥ r P i,j ande N i +e N j ≥ r N i,j wheree P i denotes an event ande N i denotes a non-event token.r P i,j indicates positive relations: BEFORE,AF- TER,SIMULTANEOUS,INCLUDES,IS_INCLUDED,VAGUE andr N i,j indicate a negative relation, i.e.,NONE. A formal proof of this property can be found in Appendix A.2. Symmetry and Transitivity Constraint. We also explore the symmetry and transitivity constraints of relations. They are specified as follows: ∀(i,j),(j,k)∈EE,y r i,j =y ¯r j,i , (symmetry) y r 1 i,j +y r 2 j,k − X r 3 ∈Trans(r 1 ,r 2 ) y r 3 i,k ≤ 1, (transitivity) Intuitively, the symmetry constraint forces two pairs of events with flipping orders to have reversed relations. For example, ifr i,j = BEFORE, thenr j,i = AFTER. The transitivity constraint rules that if (i,j), (j,k) and (i,k) pairs exist in the graph, the label (relation) prediction of (i,k) pair has to fall into the transitivity set specifyed by (i,j) and (j,k) pairs. The full transitivity table can be found in Ning et al. (2018a). 22 3.1.3.4 Learning We begin by experimenting with optimizing SSVM loss directly, but model performance degrades. 3 There- fore, we develop a two-state learning approach which first trains a pipeline version of the joint model with- out feedback from global constraints. In other words, the local neural scoring functions are optimized with cross-entropy loss using gold events and relation candidates that are constructed directly from the outputs of the event model. During the second stage, we switch to the global SSVM loss function in Equation 3.1 and re-optimize the network to adjust for global properties. We will provide more details in Section 3.1.4. 3.1.4 ModelComparisons Single-TaskModel. The most basic way to build an end-to-end system is to train separate event detec- tion and relation prediction models with gold labels, as we mentioned in our introduction. In other words, the BiLSTM layer is not shared as in Fig. 3.2. During evaluation and test time, we use the outputs from the event detection model to construct relation candidates and apply the relation prediction model to make the final prediction. Multi-TaskModel. This is the same as the single-task model except that the BiLSTM layer is now shared for both event and relation tasks. Note that both single-task and multi-task models are not trained to tackle the NONE relation directly. They both rely on the predictions of the event model to annotate relations as either positive pairs or NONE. Pipeline Joint Model. This shares the same architecture as the multi-task model, except that during training, we use the predictions of the event model to construct relation candidates to train the relation model. This strategy will generateNONE pairs during training if one argument of the relation candidate is not an event. TheseNONE pairs will help the relation model to distinguish negative relations from positive 3 We leave further investigation for future work. 23 ones, and thus become more robust to event prediction errors. We train this model with gold events and relation candidates during the first several epochs in order to obtain a relatively accurate event model and switch to a pipeline version afterwards inspired by Miwa et al. (2016). StructuredJointModel. This is described in detail in Section 3.1.3. However, we experience difficulties in training the model with SSVM loss from scratch. This is due to large amounts of non-event tokens, and the model is not capable of distinguishing them in the beginning. We thus adopt a two-stage learning procedure where we take the best pipeline joint model and re-optimize it with the SSVM loss. To restrict the search space for events in the ILP inference of the SSVM loss, we use the predicted probabilities from the event detection model to filter out non-events since the event model has a strong performance, as shown in Section 3.1.5. Note that this is very different from the pipeline model where events are first predicted and relations are constructed with predicted events. Here, we only leverage an additional hyper-parameter T evt to filter out highly unlikely event candidates. Both event and relation labels are assigned simutaneously during the global inference with ILP, as specified in Section 3.1.3.3. We also filter out tokens with POS tags that do not appear in the training set as most of the events are either nouns or verbs in TimeBank-Dense, and all events are verbs inMatres. EvaluationMetrics. To be consistent with previous research, we adopt two different evaluation metrics. The first one is the standard micro-averagescores. For densely annotated data, the micro-average metric should share the same precision, recall and F1 scores. However, since our joint model includesNONE pairs, we follow the convention of IE tasks and exclude them from evaluation. The second one is similar except that we exclude bothNONE andVAGUE pairs following Ning et al. 2018d. Please refer to Appendix A.3 for a visualizations of the two metrics. 24 Corpus Models Event Relation P R F1 P R F1 TimeBank-Dense Structrued Joint Model (Ours) 89.2 92.6 90.9 52.6 46.5 49.4 Nathanael Chambers et al. (2014) 97.2 79.4 87.4 43.8 35.7 39.4 Matres Structrued Joint Model (Ours) 87.1 88.5 87.8 59.0 60.2 59.6 Ning et al. (2018d) 83.5 87.0 85.2 48.4 58.0 52.8 Table 3.1: Event and Relation Extraction Results on TimeBank-Dense andMatres Micro-average TB-Dense MATRES F1(%) Event Relation(G) Relation(E) Event Relation(G) Relation(E) Baselines 87.4† 57.0‡ 39.4† 85.2 65.9 52.8 Single-task 88.6 61.9 44.3 86.9 75.3 57.2 Multi-task 89.2 64.5 48.4 86.4 75.5 58.7 Pipeline Joint 90.3 - 48.5 87.2 - 58.5 Structured Joint 90.9 - 49.4 87.8 - 59.6 Table 3.2: Further ablation studies on event and relation extractions. Relation (G) denotes train and eval- uate using gold events to compose relation candidates, whereas Relation (E) means end-to-end relation extraction. † is the event extraction and pipeline relation extraction F1 scores for CAEVO (Nathanael Chambers et al. 2014). 57.0‡ is the best previously reported micro-average score for temporal relation ex- traction based on gold events by Y. Meng et al. (2018). All Matres baseline results are provided by Ning et al. (2018d). 3.1.5 ResultsandAnalysis The main results of this paper can be found in Table 3.1. All best-recall and F1 scores are achieved by our structured joint model, and the results outperform the baseline systems by 10.0% and 6.8% on end-to-end relation extraction per F1 scores and 3.5% and 2.6% on event extraction per F1 scores. The best precision score for the TimeBank-Dense dataset is achieved by CAEVO, which indicates that the linguistic rule-based system can make highly precise predictions by being conservative. Table 3.2 shows a more detailed analysis, in which we can see that our single-task models with BERT embeddings and a BiLSTM encoder already outperform the baseline systems on end-to-end relation extrac- tion tasks by 4.9% and 4.4% respectively. In the following sections we discuss step-by-step improvement by adopting multi-task, pipeline joint, and structured joint models on end-to-end relation extraction, event extraction, and relation extraction on gold event pairs. 25 End-to-EndRelationExtraction. For TimeBank-Dense, the improvements over the single-task model per F1 score are 4.1% and 4.2% for the multi-task and pipeline joint model respectively. This indicates that the pipeline joint model is helpful only marginally. Table 3.3 shows that the structured joint model improves both precision and recall scores forBEFORE andAFTER and achieves the best end-to-end relation extraction performance at 49.4%—which outperforms the baseline system by 10.0% and the single-task model by 5.1%. ForMatres, compared to the single-task model, the multi-task model improves F1 scores by 1.5%, while the pipeline joint model improves F1 scores by 1.3%—which means that pipeline joint training does not bring any gains for Matres. The structured joint model reaches the best end-to-end F1 score at 59.6%, which outperforms the baseline system by 6.8% and the single-task model by 2.4%. We speculate that the gains come from the joint model’s ability to help deal with NONE pairs, since recall scores for BEFORE and AFTER increase by 1.5% and 1.1% respectively (Table 3.4). EventExtraction. For TimeBank-Dense, our structured joint model out-performs the CAEVO baseline by 3.5% and the single-task model by 1.3%. Improvements on event extraction can be difficult because our single-task model already works quite well with a close-to 89% F1 score, while the inter-annotator agree- ment for events in TimeBank documents is merely 87% (UzZaman et al. 2013). ForMatres, the structured model outperforms the the baseline model and the single-task model by 2.6% and 0.9% respectively. How- ever, we observe that the multi-task model has a slight drop in event extraction performance over the single-task model (86.4% vs. 86.9%). This indicates that incorporating relation signals are not particularly helpful for event extraction on Matres. We speculate that one of the reasons could be that events are all verbs in Matres. It is possible that a more concentrated single-task model works better when events are homogeneous, whereas a multi-task model is more powerful when we have a mixture of event types, e.g., both verbs and nouns as in TimeBank-Dense. 26 CAEVO PipelineJoint StructureJoint P R F1 P R F1 P R F1 BEFORE 41.4 19.5 26.5 59.0 46.9 52.3 59.8 46.9 52.6 AFTER 42.1 17.5 24.7 69.3 45.3 54.8 71.9 46.7 56.6 INCLUDES 50.0 3.6 6.7 - - - - - - IS_INCLUDED 38.5 9.4 15.2 - - - - - - SIMULTANEOUS 14.3 4.5 6.9 - - - - - - VAGUE 44.9 59.4 51.1 45.1 55.0 49.5 45.9 55.8 50.4 Overall 43.8 35.7 39.4 51.5 45.9 48.5 52.6 46.5 49.4 Table 3.3: Model performance breakdown for TimeBank-Dense. “-” indicates no predictions were made for that particular label, probably due to the small size of the training sample. CogCompTime Single-task PipelineJoint StructuredJoint P R F1 P R F1 P R F1 P R F1 BEFORE 50.4 65.6 57.0 51.5 68.4 58.8 60.1 62.8 61.4 60.0 64.3 62.0 AFTER 45.1 52.8 48.6 54.3 61.7 57.8 55.0 59.8 57.3 57.5 60.9 59.1 SIMULTANEOUS - - - - - - - - - - - - Overall 48.4 58.0 52.8 52.5 62.9 57.2 58.1 59.0 58.5 59.0 60.2 59.6 Table 3.4: Model performance breakdown forMatres. Relation Extraction with Gold Events. For TimeBank-Dense, there is much prior work on relation extraction based on gold events in TimeBank-Dense. Y. Meng et al. (2018) proposed a neural model with global information that achieved the best results as far as we know. The improvement of our single- task model over that baseline is mostly attributable to the adoption of BERT embedding. We show that sharing the LSTM layer for both events and relations can help further improve performance of the relation classification task by 2.6%. For the joint models, since we do not train them on gold events, the evaluation would be meaningless. We simply skip this evaluation. ForMatres, both single-task and multi-task models outperform the baseline by nearly 10%, while the improvement of multi-task over single task is marginal. In Matres, a relation pair is equivalent to a verb pair, and thus, the event prediction task probably does not provide much more information for relation extraction. 3.1.6 Discussion Global Constraints. In Table 3.5, we conduct an ablation study to understand the contributions from the event-relation prediction consistency constraint and the temporal relation transitivity constraint for 27 Micro-average TB-Dense MATRES No Structure 48.5 58.5 +ER Consistency 49.4 59.5 + Transitivity 49.4 59.6 Table 3.5: Ablation Study on Global Constraints the structured joint model. As we can see, the event-relation consistency helps improve the F1 scores by 0.9% and 1% for TimeBank-Dense and Matres, respectively, but the gain by using transitivity is either non-existing or marginal. We hypothesize two potential reasons: 1) We leveraged BERT contextualized embedding as word representation, which could tackle transitivity in the input context; 2) NONE pairs could make transitivity rule less useful, as positive pairs can be predicted as NONE and transitivity rule does not apply to NONE pairs. ErrorAnalysis. By comparing gold and predicted labels for events and temporal relations and examin- ing predicted probabilities for events, we identified three major sources of mistakes made by our structured model, as illustrated in Table 3.6 with examples. 1. Both events in Ex 1 are assigned low scores by the event module (<<0.01). Although the structured joint model is designed to predict events and relations jointly, we leverage the event module to filter out tokens with scores lower than a threshold. Consequently, some true events can be mistakenly predicted as non-events, and the relation pairs including them are automatically assigned NONE. 2. In Ex 2 the event module assigns high scores to tokens happened (0.97) and according (0.89), but according is not an event. When the structured model makes inference jointly, the decision will weigh heavily towards assigning1 (event) to both tokens. With the event-relation consistency constraint, this pair is highly likely to be predicted as having a positive temporal relation. Nearly all mistakes made in this category follow the same pattern illustrated by this example. 28 Type1: Event predicted as non-event 189 pairs Ex1. What Microsoft gets are developers around the worldworking on ideas that could potentiallyopen up Kinect for Windows ... Type2: NONE predicted as true relation 135 pairs Ex2. Mr. Netanyahu told Mr. Erdogan that what happened on board the Mavi Marmara was “unintentional” ... ,according to the statement. Type3: VAGUE relation 87 pairs Ex3. Microsoftsaid it has identified 3 companies for the China program to run through June. The company gives each participating startup $ 20,000 to create ... Table 3.6: Error Types Based onMatres Test Data 3. The existence ofVAGUE makes temporal relation prediction challenging as it can be easily confused with other temporal relations, as shown in Ex 3. This challenge is compounded with NONE in our end-to-end extraction task. Type 1 and Type 2 errors suggest that building a stronger event detection module will be helpful for both event and temporal relation extraction tasks. To improve the performances on VAGUE as well as the minority relations such as INCLUDES and SIMULTANEOUS, we could better leverage external knowledge in our general neuro-symbolic reasoning framework to improve the relation models. As we will discuss in the next section, we propose to incorporate distributional constraints to achieve this goal. 29 3.2 EventTemporalRelationExtractionwithDistributionalConstraints In this section, we present our work Han et al. (2020) that further advances the end-to-end event temporal relation extraction task introduced in Sec. 3.1. We address some shortcomings in Han et al. (2019c) using a novel neuro-symbolic network with distributional constraints and apply our model in both news and clinical domains. 3.2.1 MotivationsandContributions Figure 3.3: Another example of the event temporal ordering task. Solid lines / arrows between two high- lighted events show their gold temporal relations, e.g. say BEFORE stop and buildup INCLUDES say, and the dash line shows a wrong prediction, i.e., the VAGUE relation between buildup and say. In the table, Column Overall shows the relation distribution over the entire training corpus; Column Type Pair (P) shows the predicted relation distribution condition on the event pairs having types occurrence and reporting (such asbuildup andsay); Column Type Pair (G) shows the gold relation distribution condi- tion on event pairs having the same types. Biased predictions of VAGUE relation between buildup and say can be partially corrected by using the gold event type-relation statistics in Column Type Pair (G). Predicting event temporal relations is inherently challenging as it requires the system to understand each event’s beginning and end times. However, these time anchors are often hard to specify within a complicated context, even for humans. As a result, there is usually a large amount of VAGUE pairs (nearly 50% in the table of Figure 3.3) in an expert-annotated dataset, resulting in heavily class-imbalanced datasets. Moreover, expert annotations are often time-consuming to gather, so the sizes of existing datasets 30 are relatively small. To cope with the class-imbalance problem and the small dataset issues, recent research efforts adopt hard constraint-enhanced deep learning methods and leverage pre-trained language models (Han et al. 2019c; Ning et al. 2018d) and are able to establish reasonable baselines for the task. Thehard-constraints such as those described in Sec. 3.1.3 can only be constructed when they are nearly 100% correct and hence make the knowledge adoption restrictive. Temporal relation transitivity, as ahard constraint requires if A BEFORE B and B BEFORE C, it must be that A BEFORE C. However, constraints are usually not deterministic in real-world applications. For example, a clinical treatment and test are morelikely to happenAFTER a medicalproblem, but notalways. Such probabilistic constraints cannot be encoded with the hard-constraints as in the previous systems. Furthermore, deep neural models have biased predictions on dominant classes, which is particularly concerning given the small and biased datasets in event temporal extraction. For example, in Figure 3.3, an event pairheaded andsay (with relationINCLUDES) is incorrectly predicted asVAGUE (Column Type Pair (P)) by our baseline neural model, partially due to dominant percentage ofVAGUE label (Column Overall), and partially due to the complexity of the context. Using the domain knowledge thatheaded andsay have event types ofoccurrence andreporting, respectively, we can find a new label probability distribution (Type Pair (G)) for this pair. The probability mass allocated toVAGUE would decrease by 10% and increase by 7.2% for INCLUDES, which significantly increases the chance for a correct label prediction. We propose a novel neuro-symbolic network by incorporating domain knowledge such as corpus statistics in the model inference, and by solving the constrained inference problem using Lagrangian Re- laxation. This framework allows us to benefit from the strong contextual understanding of pre-trained language models while optimizing model outputs based on probabilistic structured knowledge that previ- ous deep models fail to consider. Experimental results demonstrate the effectiveness of this framework. 31 Figure 3.4: An overview of the proposed framework. The left column shows the end-to-end event temporal relation extraction workflow. The right column (in the dashed box) illustrates how we propose to enhance the end-to-end extraction system. The final MAP inference contains two components: scores from the relation module and distributional constraints constructed using domain knowledge and corpus statistics. The text input is a real example taken from theI2b2-Temporal dataset. The MAP inference is able to push the predicted probability of the event type-relation triplet closer to the ground-truth (corpus statistics). 3.2.2 InferencewithDistributionalConstraints In this section, we first briefly review the end-to-end event temporal relation extraction system, then show how to formulate domain-knowledge between event types and relations as distributional constraints in Integer Linear Programming (ILP), and finally apply Lagrangian Relaxation to solve the constrained inference problem. Our base model is trained end-to-end with cross-entropy loss and multitask learning to obtain relation scores. We need to perform an additional inference step in order to incorporatedomain- knowledge as distributional constraints. 3.2.2.1 End-to-endEventRelationExtraction As illustrated in the left column in Figure 3.4, our end-to-end model shares a similar work-flow as the pipeline model described in Sec. 3.1.4, where multi-task learning with a shared feature extractor is used to train the pipeline model. Recall thatE,EE andR denote event, candidate event pairs and the feasible relations, respectively, in an input instancex n , wheren is the instance index. The combined training loss 32 isL = c E L E +L R , whereL E andL R are the losses for the event extractor and the relation module, respectively, andc E is a hyper-parameter balancing the two losses. Feature Encoder. Input instances are first sent to pre-trained language models such as BERT (Devlin et al. 2019) and RoBERTa (Y. Liu et al. 2019), then to a Bi-LSTM layer as in Sec.3.1.3. Encoded features are used as inputs to the event extractor and the relation module below. Event Extractor. The event extractor first predicts scores over event classes for each input token and then detects event spans based on these scores. If an event has over more than one tokens, its beginning and ending vectors are concatenated as the final event representation. The event score is defined as the predicted probability distribution over event classes. Pairs predicted to include non-events are automati- cally labeled asNONE, whereas valid candidate event pairs are fed into the relation module to obtain their relation scores. Relation Module. The relation module’s input is a pair of events, which share the same encoded fea- tures as the event extractor. We simply concatenate them before feeding them into the relation module to produce relation scoresS(y r i,j ,x n ), which are computed using the Softmax function wherey r i,j is a binary indicator of whether an event pair(i,j)∈EE has relationr∈R. 3.2.2.2 ConstrainedInferenceforKnowledgeIncorporation As shown in Figure 3.4, once the relation scores are computed via the relation module, a MAP inference is performed to incorporatedistributionalconstraints so that the structured knowledge can be used to adjust neural baseline model scores and optimize the final model outputs. We formulate our MAP inference with distributional constraints as an LR problem and solve it with an iterative algorithm. Next, we explain the details of each component in our MAP inference. 33 Distributionalconstraints. Much of thedomain-knowledge required for real-world problems are prob- abilistic in nature. In the task of event relation extraction, domain-knowledge can be the prior probability of a specific event-pair’s occurrence acquired from large corpora or knowledge base (Ning et al. 2018b); domain-knowledge can also be event-property and relation distribution obtained using corpus statistics, as we study in this work. Previous work mostly leverage hard constraints for inference (Han et al. 2019a,c; Leeuwenberg et al. 2017; Ning et al. 2017, 2018a; Yoshikawa et al. 2009), where constraints such as transi- tivity and event-relation consistency are assumed to be absolutely correct. As we discuss in Section 3.2.1, hard constraints are rigid and thus cannot be used to model probabilistic domain-knowledge. The right column in Figure 3.4 illustrates how our work leverages corpus statistics to construct distri- butionalconstraints. LetP be a set of event properties such as clinical types (e.g. treatment orproblem). For the pair (P m ,P n ) and the triplet (P m ,P n ,r), where P n ,P m ∈ P and r ∈ R, we can retrieve their counts in the training corpus as C(P m ,P n ,r)= X i,j∈EE c(P i =P m ;P j =P n ;r i,j =r) and C(P m ,P n )= X i,j∈EE c(P i =P m ;P j =P n ). Lett=(P m ,P n ,r). The prior triplet probability can thus be defined as p ∗ t = C(P m ,P n ,r) C(P m ,P n ) . Let ˆ p t denote the predicted triplet probability, distributional constraints require that, p ∗ t − θ ≤ ˆ p t ≤ p ∗ t +θ (3.3) whereθ is the tolerance margin between the prior and predicted probabilities. 34 ILPwithDistributionalConstraints. We formulate our MAP inference as an ILP problem. LetT be a set of triplets whose predicted probabilities need to satisfy Equation 3.3. We can define our full ILP as L= X (i,j)∈EE X r∈R y r i,j S(y r i,j ,x) (3.4) s.t. p ∗ t − θ ≤ ˆ p t ≤ p ∗ t +θ, ∀t∈T , and y r i,j ∈{0,1} , X r∈R y r i,j =1, whereS(y r i,j ,x),∀r∈R is the scoring function obtained from the relation module. Fort=(P m ,P n ,r), we have ˆ p t = P EE (i:P m ,j:P n ) y r i,j P EE (i:P m ,j:P n ) P R r ′ y r ′ i,j . The output of the MAP inference, ˆ y, is a collection of optimal label assignments for all relation candidates in an input instancex n . P r∈R y r i,j = 1 ensures that each event pair gets one label assignment and this is the only hard constraint we use. To improve computational efficiency, we apply the heuristic to optimize only the equality constraints p ∗ t = ˆ p t ,∀t ∈ T . Our optimization algorithm terminates when|p ∗ t − ˆ p t | ≤ θ . This heuristic has been shown to work efficiently without hurting inference performance (T. Meng et al. 2019). For each triplet t, its equality constraint can be rewritten as F(t)=(1− p ∗ t ) EE X (i:P m ,j:P n ) y r i,j − p ∗ t EE X (i:P m ,j:P n , R X r ′ ̸=r) y r ′ i,j =0. (3.5) The goal is to maximize the objective function defined by Eq. (3.4) while satisfying the equality constraints. Lagrangian Relaxation. Solving Eq. (3.4) is NP-hard. Thus, we reformulate it as a Lagrangian Re- laxation problem by introducing Lagrangian multipliersλ t for each distributional constraint. Lagrangian Relaxation has been applied in a variety NLP tasks, as described by Alexander M. Rush et al. (2011), Alexan- der M Rush et al. (2012) and Zhao et al. (2017). 35 Algorithm1 Gradient Ascent for LR 1: procedure 2: fort∈T do 3: λ 0 t =0 4: endfor 5: k =0 6: whilek <K do ▷K: max iteration 7: ˆ y k+1 ← argmaxL(λ k ) 8: fort∈T do 9: ∆ t =p ∗ t − ˆ p t 10: if|∆ t |>θ then 11: λ k+1 t =λ k t +α ∆ t 12: endif 13: endfor 14: if ∆ t ≤ θ, ∀tthen 15: break 16: endif 17: k =k+1 18: α =γα ▷γ : decay rate 19: endwhile 20: endprocedure The Lagrangian Relaxation problem can be written as L(y,λ )= X (i,j)∈EE X r∈R y r i,j S(y r i,j ,x)+ X t∈T λ t F(t). (3.6) Initializeλ t =0. Eq. (3.6) can be solved with the following iterative algorithm (Algorithm 1). 1. At each iterationk, obtain the best relation assignments per MAP inference, ˆ y k =argmaxL(y,λ ) 2. Update the Lagrangian multiplier in order to bring the predicted probability closer to the prior. Specifically, for each t∈T , • If|p ∗ t − ˆ p t |≤ θ ,λ k+1 t =λ k t • Otherwise,λ k+1 t =λ k t +α (p ∗ t − ˆ p t ) α is the step size. We are solving a min-max problem: the first step chooses the maximum likelihood assignments by fixing λ ; the second step searches forλ values that minimize the objective function. 36 3.2.3 ImplementationDetails This section explains how to construct our distributional constraints and the implementation details for inference with LR. 3.2.3.1 DistributionalConstraintSelection The selection of distributional constraints is crucial for our algorithm. If the probability of an event-type and relation triplet is unstable across different splits of data, we may over-correct the predicted probability. We use the following search algorithm with heuristic rules to ensure constraint stability. TimeBank-Dense. We first sort candidate constraints by their corresponding values of C(P m ,P n ) = P ˆ r∈R C(P m ,P n ,ˆ r). We list C(P m ,P n ) with the largest prediction numbers and their percentages in the development set in Table 3.7. Next, we set3% as our threshold to include constraints for our main experimental results. We found this number to work relatively well for both TimeBank-Dense and I2b2-Temporal. We will show the impact of relaxing this threshold in the discussion section. In Table 3.7, the constraints in the bottom block are filtered out. Moreover, Eq. 3.5 implies that a constraint defined on one triplet (P m ,P n ,r) has impact on all(P m ,P n ,r ′ ) forr ′ ∈R\r. In other words, decreasing ˆ p (P m ,P n ,r) is equivalent to increasing ˆ p (P m ,P n ,r ′ ) and vice versa. Thus, we heuristically pick(P m ,P n ,VAGUE) as our default constraint triplets. Finally, we adopt a greedy search rule to select the final set of constraints. We start with the top constraint triplet in Table 3.7 and then keep adding the next one as long as it doesn’t hurt the grid search 4 F 1 score on the development set. Eventually, four constraints triplets are selected, and they can be found in Table 3.9. 4 Recall that our LR algorithm in Section 3.2.2 has three hyper-parameters: initial step sizeα , decay rateγ , and toleranceθ . We perform a grid search on the development set and use the best hyper-parameters on the test set. 37 Constraint Triplets Count % occurrence,occurrence, * 124 19.7 occurrence,reporting, * 50 7.9 occurrence,action, * 44 7.0 reporting,occurrence, * 41 6.5 action,occurrence, * 40 6.4 action,action, * 20 3.2 reporting,reporting, * 18 2.9 action,reporting, * 18 2.9 reporting,action, * 17 2.7 Table 3.7: TimeBank-Dense: triplet prediction count and percentage in the development set (sample size = 629). I2b2-Temporal. Similar to TimeBank-Dense, we use the3% threshold to select candidate constraints. However, it is computationally expensive to use the greedy search rule above by conducting grid search as the number of constraints that pass this threshold is large (15 of them), development set sample size is more than 3 times of TimeBank-Dense, and a large transformer is used for modeling, Therefore, we incorporate another two heuristic rules to directly select constraints, 1. We randomly split the train data into five subsets of equal size {s 1 ,s 2 ,s 3 ,s 4 ,s 5 }. For triplett to be selected, we must have 1 5 P 5 k=1 |p t,s k − p ∗ t |<0.001. 2. |ˆ p t − p ∗ t |>0.1, where ˆ p t is the predicted probability oft on the development set. The first rule ensures that a constraint triplet is stable over a randomly split of data; the second one ensures that the probability gaps between the predicted and gold are large so that we will not over-correct them. Eventually, four constraints satisfy these rules, and they can be found in Table 3.12, and we run only one final grid search for these constraints. 3.2.3.2 Inference The ILP component in Sec. 3.2.2 is implemented using an off-the-shelf solver provided by Gurobi optimizer. Hyper-parameters choices can be found in Table B.1 in the Appendix B.1. 38 3.2.4 ModelComparisons In this section, we first discuss our evaluation metrics and then the compared baselines. Evaluation Metrics. To be consistent with previous work, we adopt two different evaluation metrics. For TimeBank-Dense, we use standard micro-average scores that are also used in the baseline system (Han et al. 2019c). Since the end-to-end system can predict the gold pair asNONE, we follow the convention of IE tasks and exclude them from the evaluation. ForI2b2-Temporal, we adopt theTempEval evaluation metrics used in the 2012 i2b2 challenge. These evaluation metrics differ from the standard F 1 in a way that it computes the graph closure for both gold and predictions labels. SinceI2b2-Temporal contains roughly six times more missing annotations than the gold pairs, we only evaluate the performance of the gold pairs. Both datasets contain three types of entities: events, time expressions, and document time. In this work, we focus onevent-event relations and exclude all other relations from the evaluation. Feature-basedSystems. We use CAEVO 5 (Nathanael Chambers et al. 2014), a hybrid system of rules and linguistic feature-based MaxEnt classifier, as our feature-based benchmark for TimeBank-Dense. Model implementation and performance are both provided by Han et al. (2019c). As for I2b2-Temporal, we retrieve the predictions from the top end-to-end system provided by Yan et al. (2013) and report the per- formance according to the evaluation metrics specified above. NeuralModelBaselines. We use the end-to-end systems described in Sec. 3.1.4 as our neural network model benchmarks (Row 2 of Table 3.8). For TimeBank-Dense, the best global structured model’s per- formance is reported 3.1.5. For I2b2-Temporal, we re-implement the pipeline joint model. 6 Note that 5 https://www.usna.edu/Users/cs/nchamber/caevo/ 6 https://github.com/PlusLabNLP/JointEventTempRel 39 this end-to-end model only predicts whether each token is an event as well as each pair of token’s rela- tion. Event spans are not predicted, so head-tokens are used to represent events; event types are also not predicted. Therefore, we do not report Span F 1 and Type Accuracy in this benchmark. End-to-endBaseline. For the TimeBank-Dense dataset, we use the pipeline joint (local) model with no global constraints as shown in Sec. 3.1.4. In contrast to the aforementioned neural baseline provided in the same paper, this end-to-end model does not use any inference techniques. Hence, it serves as a fair baseline for our method (with inference). For TimeBank-Dense, we build our framework based on this model 7 . For the I2b2-Temporal dataset to be more comparable with the 2012 i2b2 challenge, we augment the event extractor illustrated in Figure 3.4 by allowing event type predictions; that is, for each input token, we not only predict whether it is an event or not, but also predict its event type. We follow the convention in the IE field by adding a “BIO” label to each token in the data. For example, the two tokens in “physical therapy” in Figure 3.4 will be labeled as B-treatment and I-treatment, respectively. To be consistent with the partial match method used in the 2012 i2b2 challenge, the event span detector looks for token predictions that start with either “B-” or “I-” and ensures that all tokens predicted within the same event span have only one event type. RoBERTa-large is used as the base model, and cross-entropy loss is used to train the model. We fine-tune the base model and conduct a grid search on the random hold-out set to pick the best hyper- parameters such asc E in the multitask learning loss and the weight,w Epos for positive event types (i.e. B- and I-). The best hyper-parameter choices can be found in Table B.1 in the Appendix. 3.2.5 ResultsandAnalysis Table 3.8 contains our main results. We discuss model performances on TimeBank-Dense andI2b2-Temporal in this section. 7 Code and data for TimeBank-Dense are published here: https://github.com/rujunhan/EMNLP-2020 40 TimeBank-Dense 2012i2b2Challenge (I2b2-Temporal) Event Relation Event Relation (TempEval Metrics) F 1 R P F 1 Span F 1 Type Accuracy R P F 1 Feature-based Benchmark 87.4 43.8 35.7 39.4 90.1 86.0 37.8 51.8 43.0 Han et al. (2019c) 90.9 52.6 46.5 49.4 - - 73.4 76.3 74.8 End-to-end Baseline 90.3 51.5 45.9 48.5 87.8 87.8 73.3 79.9 76.5 End-to-end + Inference 90.3 53.4 47.9 50.5 87.8 87.8 74.0 80.8 77.3 Table 3.8: Overall experiment results: per MacNemar’s test, the improvements against the end-to-end baseline models by adding inference with distributional constraints are both statistically significant for TimeBank-Dense (p-value < 0.005) and I2b2-Temporal (p-value < 0.0005). For I2b2-Temporal, our end-to-end system is optimized for the F 1 score of the gold pairs. TimeBank-Dense. All neural models outperform the feature-based system by more than 10% per re- lation F 1 score. Our structured model outperforms the previous SOTA systems with hard constraints and jointeventandrelationtraining by 1.1%. Compared with the end-to-end baseline model with no con- straints, our system achieves 2% absolute improvement, which is statistically significant with a p-value < 0.005 per MacNemar’s test. This is strong evidence that leveraging Lagrangian Relaxation to incorpo- rate domain knowledge can be extremely beneficial even for strong neural network models. The ablation study in Table 3.9 shows how distributional constraints work and the constraints’ indi- vidual contributions. The predicted probability gaps shrink by 0.15, 0.24, and 0.13 respectively for the three constraints chosen, while providing 0.91%, 0.65%, and 0.44% improvements to the final F 1 score for relation extraction. We also show the breakdown of the performance for each relation class in Table 3.10. The overall F 1 improvement is mainly driven by the recall scores in the positive relation classes (BEFORE, AFTER, andINCLUDES) that have much smaller sample size thanVAGUE. These results are consistent with the ablation study in Table 3.9, where the end-to-end baseline model over-predicts on VAGUE, and the LR algorithm corrects it by assigning less confident predictions on VAGUE to positive and minority classes according to their relation scores. 41 Constraint Triplets Prob. Gap F 1 occur.,occur., VAGUE -0.15 +0.91% occur.,reporting, VAGUE -0.24 +0.65% action,occur., VAGUE -0.13 +0.44% reporting,occur., VAGUE ∗ 0.0 0% Combined F1 Improvement 2.0% Table 3.9: TimeBank-Dense ablation study: gap shrinkage of predicted probability and F 1 contribution per constraint. ∗ is selected per Sec. 3.2.3, but the probability gap is smaller than the tolerance in the test set, hence no impact to the F 1 score. End-to-endBaseline End-to-endInference P R F 1 P R F 1 BEFORE 59.0 46.9 52.3 58.6 55.7 57.1 AFTER 69.3 45.3 54.8 67.8 51.5 58.5 INCLUDES - - - 8.3 1.8 2.9 IS_INCLUDED - - - - - - SIMULTANEOUS - - - - - - VAGUE 45.1 55.0 49.5 47.6 51.4 49.4 Average 51.5 45.9 48.5 53.4 47.9 50.5 Table 3.10: Model performance breakdown for TimeBank-Dense. “-” indicates no predictions were made for that particular label, probably due to the small size of the training sample. I2b2-Temporal. All neural models outperform the feature-based system by more than 30% per relation F 1 score. Our structured model withdistributionalconstraints outperforms the neural pipeline joint models of Han et al. (2019c) by 2.5% per absolute scale. Compared with our end-to-end baseline model, our system achieves 0.77% absolute improvement on F 1 measure, which is statistically significant with a p-value < 0.0005 per MacNemar’s test. This result also shows that adding inference with distributional constraints can be helpful for strong neural baseline models. Table 3.12 shows how distributional constraints work and their individual contributions. Predicted probability gaps shrink by 0.17, 0.16, 0.11, and 0.14, respectively, for the four constraints chosen, providing 0.19%, 0.25%, 0.22%, and 0.12% improvements to the final F 1 scores for relation extraction. We also have the breakdown performance for each relation class in Table 3.11. The performance gain is caused mostly by the increase of recall scores in BEFORE and AFTER. This is consistent with the results in Table 3.12 where the model over-predicts on the OVERLAP class, possibly because of label imbalance. Inference is 42 able to partially correct this mistake by leveraging distributional constraints constructed with event type and relation corpus statistics. End-to-endBaseline End-to-endInference P R F 1 P R F 1 BEFORE 82.1 60.6 69.7 80.9 65.3 72.2 AFTER 69.9 59.9 64.5 67.8 62.8 65.2 OVERLAP 81.3 81.5 81.4 83.6 80.2 81.9 TempEval 73.3 79.9 76.5 74.0 80.8 77.3 Table 3.11: Model performance breakdown forI2b2-Temporal. Constraint Triplets Prob. Gap F 1 occur.,problem, OVERLAP -0.17 +0.19% occur.,treatment, OVERLAP -0.16 +0.24% treatment,occur., OVERLAP -0.11 +0.22% treatment,problem, OVERLAP -0.14 +0.12% Combined F1 Improvement 0.77% Table 3.12: I2b2-Temporal ablation study: gap shrinkage of predicted probability and F 1 contribution per constraint. QualitativeErrorAnalysis. We can use the errors made by our structured neural model on TimeBank- Dense to guide potential directions for future research. There are 26 errors made by the structured model that are correctly predicted by the baseline model. In Table 3.13, we show the error breakdown by con- straints. Our method works by leveraging corpus statistics to correct borderline errors made by the baseline model; however, when the baseline model makes borderline correct predictions, the inference could mis- takenly change them to the wrong labels. This situation can happen when the context is complicated or when the event time interval is confusing. For the constraint (occur., occur., VAGUE), nearly all errors are cross-sentence event pairs with long context information. Inex.1, the gold relation betweenresponded anduse isVAGUE because of the negation of use, but one could also argue that if use were to happen, responded is BEFORE use. This inherent annotation confusion can cause the baseline model to predict VAGUE marginally over BEFORE. When informed by the constraint statistics that vague is over-predicted, the inference algorithm revises 43 occurrence,occurrence, VAGUE (57.7%) ex.1 In a bit of television diplomacy, Iraq’s deputy foreign minister responded from Baghdad in less than one hour, saying Washington would break international law by attacking without UN approval. The United States is not authorized to use force before going to the council. occurrence,reporting, VAGUE (26.9%) ex.2 A new Essex County task force began delving Thursday into the slayings of 14 black women over the last five years in the Newark area, as law-enforcement officials acknowledged that they needed to work harder... action,occurrence, VAGUE (15.4%) ex.3 The Russian leadership has staunchly opposed the western alliance’s expansion into Eastern Europe. Table 3.13: Error examples and breakdown by constraints. the baseline prediction to BEFORE. Similarly, inex.2 andex.3, one could make strong cases that both the relations between delving and acknowledged, and opposed and expansion are BEFORE rather than VAGUE from the context. This annotation ambiguity can contribute to the errors made by the proposed method. Our analysis shows that besides the necessity to create high-quality data for event temporal relation extraction, it could be useful to incorporate additional information such as discourse relation (particularly for (occur.,occur., VAGUE)) and other prior knowledge on event properties to resolve the ambiguity in event temporal reasoning. ConstraintSelection. In Sec. 3.2.3, we use a 3% threshold when selecting candidate constraints. In this section, we show the impact of relaxing this threshold on TimeBank-Dense. Table 3.7 shows three con- straints that miss the 3% bar by 0.1-0.3%. In Figure 3.5, we show F 1 scores on the development and test sets by including these constraints. Recall that only constraints that do not hurt development F 1 score are used. Therefore, Top5 and Top6 on the chart both correspond to the results in Table 3.8. Top7 in- cludes (reporting,reporting,VAGUE),Top8 includes (action,reporting,VAGUE), andTop9 includes (reporting,action, VAGUE). 44 Figure 3.5: Dev v.s. Test sets performance (F 1 score) after relaxing the threshold of triplet count for select- ing constraints. All numbers are percentages. We observe that F 1 score continues to improve over the development set, but on the test set, F 1 score eventually falls. This appears to support our hypothesis that when the triplet count is small, the ratio calculated based on that count is not so reliable as the ratio could vary drastically between development and test sets. Optimizing over the development set can be an over-correction for the test set, and hence results in a performance drop. 3.2.6 RelatedWork We’ve thoroughly reviewed event temporal relation extraction work in the news domain in Sec.3.1.2. In this section, we review temporal relation extraction work in the clinical domain. The 2012 i2b2 Challenge (Sun et al. 2013) is one of the earliest efforts to advance event temporal relation extraction of clinical data. The challenge hosted three tasks on event (and event property) classification, temporal relation extraction, and the end-to-end track. Following this early effort, a series of clinical event temporal relation challenges were created in the following years (Bethard et al. 2015, 2016, 2017). However, data in these challenges are relatively hard to acquire, and therefore they are not used in this paper. As in the news data, traditional machine learning approaches (Chikka 2016; Lee et al. 2016; Savova et al. 2010; Tang et al. 2013; Y. Xu et al. 2013) that tackle the end-to-end event and temporal relation extraction problem require time-consuming feature engineering such as collecting lexical and syntax features. Some recent 45 work (Dligach et al. 2017; Galvan et al. 2018; Leeuwenberg et al. 2017) apply neural network-based methods to model the temporal relations, but are not capable of incorporating prior knowledge about clinical events and temporal relations as proposed by our framework. 46 3.3 ECONET:AdaptingPretrainedLanguageModelsforEventTemporal ReasoningTasks In the previous two sections (Sec. 3.1 and 3.2), we propose two neuro-symbolic networks that can incor- porate explicit world knowledge as either hard or distributional constraints. In this section, we present ECONET (Han et al. 2021b) a novel pretraining (self-supervised learning) framework that implicitly infuses knowledge for event temporal reasoning in language models. 3.3.1 MotivationsandContributions Figure 3.6: Top: an example illustrating the difference between ERE and QA / MRC samples of event temporal reasoning. Bottom: our targeted masking strategy forECONET v.s. random masking in PTLMs. Recent approaches leveraging large pre-trained language models (PTLMs) achieved state-of-the-art re- sults on a range of event temporal reasoning tasks (Han et al. 2019b; Ning et al. 2020b; Pereira et al. 2020; Wang et al. 2020; Y. Zhou et al. 2021). Despite the progress, vanilla PTLMs do not focus on capturingevent temporalknowledge that can be used to infer event relations. For example, in Figure 3.6, an annotator of the QA sample can easily infer from the temporal indicator “following” that “transfer” happensBEFORE 47 “preparing the paperwork”, but a fine-tuned RoBERTa model predicts that “ transfer” has no such relation with the event “preparing the paperwork.” Plenty of such cases exist in our error analysis on PTLMs for event temporal relation-related tasks. We hypothesize that such deficiency is caused by original PTLMs’ random masks in the pre-training where temporal indicators and event triggers are under-weighted and hence not attended well enough for our downstream tasks. TacoLM (B. Zhou et al. 2020) explored the idea of targeted masking and predicting textual cues of event frequency, duration and typical time, which showed improvements over vanilla PTLMs on related tasks. However, event frequency, duration and time do not directly help machines understand pairwise event temporal relations. Moreover, the mask predic- tion loss of TacoLM leverages a soft cross-entropy objective, which is manually calibrated with external knowledge and could inadvertently introduce noise in the continual pre-training. We propose ECONET, a Effective CONtinual pre-training framework for Event Temporal reasoning that combines mask prediction and contrastive loss using our masked samples. Our targeted masking strategy focuses only on event triggers and temporal indicators as shown in Figure 3.6. This design as- sists models to concentrate on events and temporal cues, and potentially strengthen models’ ability to understand event temporal relations better in the downstream tasks. We further pre-train PTLMs with the following objectives jointly: the mask prediction objective trains a generator that recovers the masked temporal indicators or events, and the contrastive loss trains a discriminator that shares the representa- tions with the generator and determines whether a predicted masked token is corrupted or original (Clark et al. 2020). Our experiments demonstrate that ECONET is effective at improving the original PTLMs’ performances on event temporal reasoning. We briefly summarize our contributions. 1) We propose ECONET, a novel continual pre-training frame- work that integrates targeted masking and contrastive loss for event temporal reasoning. 2) Our training objectives effectively learn from the targeted masked samples and inject richer event temporal knowl- edge in PTLMs, which leads to stronger fine-tuning performances over five widely used event temporal 48 Figure 3.7: The proposed generator-discriminator (ECONET) architecture for event temporal reasoning. The upper block is the mask prediction task for temporal indicators and the bottom block is the mask prediction task for events. Both generators and the discriminator share the same representations. commonsense tasks. In most target tasks, ECONET achieves SOTA results in comparison with existing methods. 3) Compared with full-scale pre-training, ECONET requires a much smaller amount of train- ing data and can cope with various PTLMs such as BERT and RoBERTa. 4) In-depth analysis shows that ECONET successfully transfers knowledge in terms of textual cues of event triggers and relations into the target tasks, particularly under low-resource settings. 3.3.2 Method: TargetedMaskingandPretraining Our proposed method aims at addressing the issue in vanilla PTLMs that event triggers and temporal indicators are not adequately attended for our downstream event reasoning tasks. To achieve this goal, we propose to replace the random masking in PTLMs with a targeted masking strategy designed specifically for event triggers and temporal indicators. We also propose a continual pre-training method with mask prediction and contrastive loss that allows models to effectively learn from the targeted masked samples. The benefits of our method are manifested by stronger fine-tuning performances over downstream ERE and MRC tasks. 49 Category Words before, until, previous to, prior to, [before] preceding, followed by [after] after, following, since, now that soon after, once ∗∗ during, while, when, at the time, [during] at the same time, meanwhile [past] earlier, previously, formerly, yesterday, in the past, last time consequently, subsequently, in turn, [future] henceforth, later, then [beginning] initially, originally, at the beginning to begin, starting with, to start with [ending] finally, in the end, at last, lastly Table 3.14: The full list of the temporal lexicon. Categories are created based on authors’ domain knowl- edge and best judgment. ∗∗ ‘once’ can be also placed into [past] category due to its second meaning of ‘previously’, which we exclude to keep words unique. Our overall approach ECONET consists of three components. 1) Creating targeted self-supervised training data by masking out temporal indicators and event triggers in the input texts; 2) leveraging mask prediction and contrastive loss to continually train PTLMs, which produces an event temporal knowl- edge aware language model; 3) fine-tuning the enhanced language model on downstream ERE and MRC datasets. We will discuss each of these components in the following subsections. 3.3.2.1 TargetedMasksCreation TemporalMasks. We first compile a lexicon of 40 common temporal indicators listed in the Table 3.14 based on previous error analysis and expert knowledge in the target tasks. Those indicators in the [be- fore], [after] and [during] categories can be used to represent the most common temporal relations between events. The associated words in each of these categories are synonyms of each other. Tempo- ral indicators in the [past], [future], [beginning] and [ending] categories probably do not represent pairwise event relations directly, but predicting these masked tokens may still be helpful for models to understand time anchors and hence facilitates temporal reasoning. 50 With the temporal lexicon, we conduct string matches over the 20-year’s New York Times news articles 8 and obtain over 10 million 1-2 sentence passages that contain at least 1 temporal indicators. Finally, we replace each of the matched temporal indicators with a⟨mask⟩ token. The upper block in Figure 3.7 shows two examples where “following” and “after” are masked from the original texts. EventMasks. We build highly accurate event detection models (Han et al. 2019c; S. Zhang et al. 2021) to automatically label event trigger words in the 10 million passages mentioned above. Similarly, we replace these events with ⟨mask⟩ tokens. The bottom block in Figure 3.7 shows two examples where events “transfer” and “resumed” are masked from the original texts. 3.3.2.2 GeneratorforMaskPredictions Recall from Sec. 2.2 that PTLMs such as BERT and RoBERTa are trained to recover randomly masked tokens. Let x = [x 1 ,...,x n ] be a sequence of input tokens and x m t ∈ x m represents random masked tokens. The per-sample pre-training objective is to predict identity (x t ) ofx m t with a cross-entropy loss, L MLM =− X x m t ∈x m I[x m t =x t ]log(p(x m t |x)) (3.7) To learn effectively from the targeted samples, we modify Eq. 3.7 by training two generators with shared representations to recover temporal and event masks. Temporal Generator. The per-sample temporal mask prediction objective is computed using cross- entropy loss, L T =− X x T t ∈x T I[x T t =x t ]log(p(x T t |x)) (3.8) 8 NYT news articles are public from 1987-2007. 51 wherep(x T t |x)=Softmax(f T (h G (x) t )) andx T t ∈x T is a masked temporal indicator.h G (x) isx’s encoded representation using a transformer andf T is a linear layer module that maps the masked token representation into label spaceT consisting of the 40 temporal indicators. EventGenerator. The per-sample event mask prediction objective is also computed using cross-entropy loss, L E =− X x E t ∈x E I[x E t =x t ]log(p(x E t |x)) (3.9) where p(x E t |x) = Softmax(f E (h G (x) t )) and x E t ∈ x E are masked events. h G (x) is the shared trans- former encoder as in the temporal generator andf E is a linear layer module that maps the masked token representation into label spaceE which is a set of all event triggers in the data. 3.3.2.3 DiscriminatorforContrastiveLearning We incorporate a discriminator that provides additional feedback on mask predictions, which helps correct errors made by the generators. Contrastive Loss. For a masked tokenx t , we design a discriminator to predict whether the recovered token by the mask prediction is original or corrupted. As shown in Figure 3.7, “following” and “resumed” are predicted correctly, so they are labeled asoriginal whereas “during” and “run” are incorrectly predicted and labeled ascorrupted. We train the discriminator with a contrastive loss, L D =− X xt∈M ylog(D(x t |x))+(1− y)log(1− D(x t |x)) where M = x E ∪x T and D(x t |x) = Sigmoid(f D (h D (x) t )) and y is a binary indicator of whether a mask prediction is correct or not. h D shares the same transformer encoder withh G . 52 PerturbedSamples. Our mask predictions focus on temporal and event tokens, which are easier tasks than the original mask predictions in PTLMs. This could make the contrastive loss not so powerful as training a good discriminator requires relatively balanced original and corrupted samples. To deal with this issue, forr% of the generator’s output, instead of using the recovered tokens, we replace them with a token randomly sampled from either the temporal lexicon or the event vocabulary. We fix r =50 to make original and corrupted samples nearly balanced. 3.3.2.4 JointTraining To optimize the combining impact of all components in our model, the final training loss calculates the weighted sum of each individual loss, L = L T + α L E + β L D , where α and β are hyper-parameters that balance different training objectives. The temporal and event masked samples are assigned a unique identifier (1 for temporal, 0 for event) so that the model knows which linear layers to feed the output of transformer into. Our overall generator-discriminator architecture resembles ELECTRA (Clark et al. 2020). However, our proposed method differs from this work in 1) we use targeted masking strategy as opposed to random masks; 2) both temporal and event generators and the discriminator, i.e. h G andh D share the hidden representations, but we allow task-specific final linear layers f T , f E and f D ; 3) we do not train from scratch and instead continuing to train transformer parameters provided by PTLMs. 3.3.2.5 Fine-tuningonTargetTasks After training with ECONET, we fine-tune the updated MLM on the downstream tasks. ERE samples can be denoted as[P,e i ,e j ,r i,j ], where P is the passage and(e i ,e j ) is a pair of event trigger tokens in P. As Figure 3.8a shows, we feed(P,e i ,e j ) into an MLM (trained withECONET). Following the setup of Han et al. (2019a) and S. Zhang et al. (2021), we concatenate the final event representations v i ,v j associated with 53 (a) ERE (b) QA:TORQUE (c) QA:McTaco Figure 3.8: Target ERE and QA task illustrations. (e i ,e j ) to predict temporal relationr i,j . The relation classifier is implemented by a multi-layer perceptron (MLP). MRC/QA samples can be denoted as[P,Q,A], where Q represents a question and A denotes answers. Figure 3.8b illustrates an extractive QA task where we feed the concatenated [P,Q] into an MLM. Each tokenx i ∈ P has a label with 1 indicatingx i ∈ A and 0 otherwise. The token classifier implemented by MLP predicts labels for allx i . Figure 3.8c illustrates another QA task where A is a candidate answer for the question. We feed the concatenated[P,Q,A] into an MLM and the binary classifier predicts a 0/1 label of whether A is a true statement for a given question. 3.3.3 ImplementationDetails In this section, we describe details of implementingECONET, datasets and evaluation metrics, and discuss compared methods reported in Section 3.3.5. 54 Event Detection Model. As mentioned briefly in Section 3.3.2, we train a highly accurate event pre- diction model to mask event (triggers). We experimented with two models using event annotations in TORQUE (Ning et al. 2020b) and TimeBank-Dense (Nathanael Chambers et al. 2014). These two event annotations both follow previous event-centric reasoning research by using a trigger word (often a verb or an noun that most clearly describes the event’s occurrence) to represent an event (Glavaš et al. 2014; O’Gorman et al. 2016; UzZaman et al. 2013). In both cases, we fine-tune RoBERTa LARGE on thetrainset and select models based on the performance on thedevset. The primary results shown in Table 3.15 uses TORQUE’s annotations, but we conduct additional analysis in Section 3.3.5 to show both models produce comparable results. Continual Pretraining. We randomly selected only 200K out of 10 million samples to speed up our experiments and found the results can be as good as using a lot more data. We used half of these 200K samples for temporal masked samples and the other half for the event masked samples. We ensure none of these sample passages overlap with the target test data. To keep the mask tokens balanced in the two training samples, we masked only 1 temporal indicator or 1 event (closest to the temporal indicator). We continued to train BERT and RoBERTa up to 250K steps with a batch size of 8. The training process takes 25 hours on a single GeForce RTX 2080 GPU with 11G memory. Note that our method requires much fewer samples and is more computation efficient than the full-scale pre-training of language models, which typically requires multiple days of training on multiple large GPUs / TPUs. For the generator only models reported in Table 3.15, we excluded the contrastive loss, trained models with a batch size of 16 to fully utilize GPU memories. We leveraged the dev set of TORQUE to find the best hyper-parameters. Fine-tuning. Dev set performances were used for early-stop and averagedevperformances over three randoms seeds were used to pick the best hyper-parameters. Note that test set for the target tasks were 55 never observed in any of the training process and their performances are reported in Table 3.15. All hyper- parameter search ranges can be found in Appendix C. 3.3.4 ModelComparisons EvaluationMetrics. Three metrics are used to evaluate the fine-tuning performances. • F 1 : forTORQUE andMcTaco, we follow the data papers (Ning et al. 2020b) and (B. Zhou et al. 2019) to report macro average of each question’s F 1 score. For TimeBank-Dense, Matres and RED, we report standard micro-averageF 1 scores to be consistent with the baselines. • Exact-match(EM): for both MRC datasets, EM = 1 if answer predictions match perfectly with gold annotations; otherwise, EM = 0. • EM-consistency (C): in TORQUE, some questions can be clustered into the same group due to the data collection process. This metric reports the average EM score for a group as opposed to a question in the original EM metrics. ComparedModels. We compare several pre-training methods with ECONET. • RoBERTa LARGE is the original PTLM and we fine-tune it directly on target tasks. • RoBERTa LARGE +ECONET is our proposed continual pre-training method. • RoBERTa LARGE +Generator only uses the generator component in continual pre-training. • RoBERTa LARGE +randommask keeps the original PTLMs’ objectives and replaces the targeted masks in ECONET with randomly masked tokens. Models’ names for continual pre-training BERT LARGE can be derived by replacing RoBERTa LARGE with BERT LARGE . 56 We also fine-tune pre-trained TacoLM on target datasets. The current SOTA systems we compare with are provided by Ning et al. (2020b), Pereira et al. (2020), S. Zhang et al. (2021) and Han et al. (2019b). More details are presented in Sec. 3.3.5.1. 3.3.5 ResultsandAnalysis TORQUE McTaco TimeBank-Dense Matres RED Methods F 1 EM C F 1 EM F 1 F 1 F 1 TacoLM 65.4(± 0.8) 37.1(± 1.0) 21.0(± 0.8) 69.3(± 0.6) 40.5(± 0.5) 64.8(± 0.7) 70.9(± 0.3) 40.3(± 1.7) BERT LARGE 70.6(± 1.2) 43.7(± 1.6) 27.5(± 1.2) 70.3(± 0.9) 43.2(± 0.6) 62.8(± 1.4) 70.5(± 0.9) 39.4(± 0.6) +ECONET 71.4(± 0.7) 44.8(± 0.4) 28.5(± 0.5) 69.2(± 0.9) 42.3(± 0.5) 63.0(± 0.6) 70.4(± 0.9) 40.2(± 0.8) RoBERTa LARGE 75.1(± 0.4) 49.6(± 0.5) 35.3(± 0.8) 75.5(± 1.0) 50.4(± 0.9) 62.8(± 0.3) 78.3(± 0.5) 39.4(± 0.4) +Generator 75.8(± 0.4) 51.2(± 1.1) 35.8(± 0.9) 75.1(± 1.4) 50.2(± 1.2) 65.2(± 0.6) 77.0(± 0.9) 41.0(± 0.6) +ECONET 76.1(± 0.2) 51.6(± 0.4) 36.8(± 0.2) 76.3(± 0.3) 52.8(± 1.9) 64.8(± 1.4) 78.8(± 0.6) 42.8(± 0.7) ECONET (best) 76.3 52.0 37.0 76.8 54.7 66.8 79.3 43.8 Current SOTA 75.2 ∗ 51.1 34.5 79.5 † 56.5 66.7 †† 80.3 ‡ 34.0 ‡‡ Table 3.15: Overall experimental results. Refer to Section 3.3.2 for naming conventions. The SOTA per- formances for TORQUE ∗ are provided by Ning et al. (2020b) and the numbers are average over 3 random seeds. The SOTA performances forMcTaco † are provided by Pereira et al. (2020); TimeBank-Dense †† and Matres ‡ by S. Zhang et al. (2021) andRED ‡‡ by Han et al. (2019b). † , †† , ‡ and ‡‡ only report the best single model results, and to make fair comparisons with these baselines, we report both average and best single model performances. TacoLM baseline uses the provided and recommended checkpoint for extrinsic eval- uations. 3.3.5.1 ComparisonswithExistingSystems TORQUE. The current SOTA system reported in Ning et al. (2020b) fine-tunes RoBERTa LARGE and our own fine-tuned RoBERTa LARGE achieves on-parF 1 , EM and C scores. The gains of RoBERTa LARGE + ECONET against the current SOTA performances are 0.9%, 0.5% and 2.3% perF 1 , EM andC metrics. McTaco. The current SOTA system ALICE (Pereira et al. 2020) also uses RoBERTa LARGE as the text encoder, but leverages adversarial attacks on input samples. ALICE achieves 79.5% and 56.5% perF 1 and EM metrics on the test set for the best single model, and the best performances for RoBERTa LARGE + ECONET are 76.8% and 54.7% per F 1 and EM scores, which do not outperform ALICE. This gap can be 57 caused by the fact that the majority of samples in McTaco reason about event frequency, duration and time, which are not directly related to event temporal relations. TimeBank-Dense + Matres. The most recent SOTA system reported in S. Zhang et al. (2021) uses both BERT LARGE and RoBERTa LARGE as text encoders, but leverages syntactic parsers to build large graphical attention networks on top of PTLMs. RoBERTa LARGE + ECONET’s fine-tuning performances are essentially on-par with this work without additional parameters. For TimeBank-Dense, our best model outperforms S. Zhang et al. (ibid.) by 0.1% while for Matres, our best model underperforms by 1.0% per F 1 scores. RED. The current SOTA system reported incontextualTemporal-han uses BERT BASE as word repre- sentations (no finetuning) and BiLSTM as feature extractor. The single best model achieves 34.0% F 1 score and RoBERTa LARGE + ECONET is 9.8% higher than the baseline. 3.3.5.2 TheImpactofECONET OverallImpact. ECONET in general works better than the original RoBERTa LARGE across 5 different datasets, and the improvements are more salient in TORQUE with 1.0%, 2.0% and 1.5% gains perF 1 , EM and C scores, in McTaco with 2.4% lift over the EM score, and in TimeBank-Dense and RED with 2.0% and 3.4% improvements respectively overF 1 scores. We observe that the improvements of ECONET over BERT LARGE is smaller and sometimes hurts the fine-tuning performances. We speculate this could be related to the property that BERT is less capable of handling temporal reasoning tasks, but we leave more rigorous investigations to future research. ImpactofContrastiveLoss. Comparing the average performances of continual pre-training with gen- erator only and with ECONET (generator + discriminator), we observe that generator alone can improve performances of RoBERTa LARGE in 3 out of 5 datasets. However, except for TimeBank-Dense,ECONET is 58 able to improve fine-tuning performances further, which shows the effectiveness of using the contrastive loss. SignificanceTests. As current SOTA models are either not publicly available or under-perform our im- plementation of RoBERTa LARGE , we resort to testing the statistical significance of the best single model between ECONET and RoBERTa LARGE . McNemar’s tests compute statistics by aggregating all samples’ prediction correctness. For ERE tasks, this value is simply classification correctness; for QA tasks ( TORQUE and McTaco), we useEM per question-answer pairs. Table 3.16 lists all improvements’ p-values per Mc- Nemar’s test (McNemar 1947). Matres appears to be the only one that is not statistically significant. Datasets p-values TORQUE 0.002 ∗∗ McTaco 0.007 ∗∗ TimeBank-Dense 0.004 ∗∗ Matres 0.292 RED 0.059 ∗ Table 3.16: McNemar’s tests for improvement significance between best single models of RoBERTa LARGE and ECONET on the test data. Tests with p-values < 0.05 ( ∗∗ ) indicate strong statistical significance; tests with p-values < 0.1 ( ∗ ) indicate weak statistical significance. 3.3.5.3 AdditionalAnalysis To better understand our proposed model, we experiment with a few variations of our models and compare their overall fine-tuning performances. Impact of Event Models. Event trigger definitions have been consistent in previous event temporal datasets (Nathanael Chambers et al. 2014; Ning et al. 2020b; O’Gorman et al. 2016). Trigger detection models built onTORQUE and TimeBank-Dense both achieve>92%F 1 scores and>95% precision scores. For the 100K pre-training data selected for event masks, we found an 84.5% overlap of triggers identified by both models. We further apply ECONET trained on both event mask data to the target tasks and achieve 59 comparable performances shown in Table 3.17. These results suggest that the impact of different event annotations is minimal and triggers detected in either model can generalize to different tasks. TORQUE TB-D RED EventAnnotations F 1 EM C F 1 F 1 TORQUE 76.1 51.6 36.8 64.8 42.8 TimeBank-Dense 76.1 51.3 36.4 65.1 42.6 Table 3.17: Fine-tuning performance comparisons using event detection models trained on TORQUE v.s. TimeBank-Dense event annotations. All numbers are average over 3 random seeds. Std. Dev. ≥ 1% is underlined. RandomMasks. As most target datasets we use are in the news domain, to study the impact of potential domain-adaption, we continue to train PTLMs with the original objective on the same data using random masks. To compare fairly with the generator and ECONET, we only mask 1 token per training sample. The search range of hyper-parameters is the same as in Section 3.3.3. As Table 3.18 shows, continual pre- training with random masks, in general, does not improve and sometimes hurt fine-tuning performances compared with fine-tuning with original PTLMs. We hypothesize that this is caused by masking a smaller fraction (1 out of≈ 50 average) tokens than the original 15%. RoBERTa LARGE +ECONET achieves the best fine-tuning results across the board. TORQUE TB-D RED Methods F 1 EM C F 1 F 1 RoBERTa LARGE 75.1 49.6 35.3 62.8 39.4 +randommask 74.9 49.5 35.1 58.7 38.3 +ECONET 76.1 51.6 36.8 64.8 42.8 BERT LARGE 70.6 43.7 27.5 62.8 39.4 +randommask 70.6 44.1 27.2 63.4 35.3 +ECONET 71.4 44.8 28.5 63.0 40.2 Table 3.18: Fine-tuning performances with different pre-training methods. All numbers are average over 3 random seeds. Std. Dev.≥ 1% is underlined. 60 3.3.5.4 Fine-tuningunderLow-resourceSettings In Table 3.19, we compare the improvements of fine-tuning RoBERTa LARGE +ECONET over RoBERTa LARGE using full and 10% of the training data. Measured by both absolute and relative percentage gains, the ma- jority of the improvements are much more significant under low-resource settings. This suggests that the transfer of event temporal knowledge is more salient when data is scarce. We further show fine-tuning performance comparisons using different ratios of the training data in Figure 3.9a-3.9b. The results demon- strate that ECONET can outperform RoBERTa LARGE consistently when fine-tuning TORQUE andRED. Full Train Data 10% Train Data RoBERTa ∆ ∆% RoBERTa ∆ ∆% TORQUE 75.1 +1.0 +1.3% 59.7 +7.2 +12.1% McTaco 75.5 +0.8 +1.1% 44.0 +5.6 +12.7% TimeBank-Dense 62.8 +2.0 +3.2% 48.8 +2.8 +5.7% Matres 78.3 +0.5 +1.3% 71.0 +2.4 +3.4% RED 39.4 +2.4 +6.0% 27.2 +1.8 +6.6% Table 3.19: RoBERTa LARGE +ECONET’s improvements over RoBERTa LARGE using full train data v.s. 10% of train data. ∆ indicates absolute points improvements while∆% indicates relative gains perF 1 scores. 0.1 0.3 0.5 0.7 1.0 60.0 62.5 65.0 67.5 70.0 72.5 75.0 77.5 59.7 72.8 73.6 74.1 75.1 66.9 73.4 74.2 74.9 76.1 RoBERTa + ECONET (a)TORQUE 0.1 0.3 0.5 0.7 1.0 25.0 27.5 30.0 32.5 35.0 37.5 40.0 42.5 45.0 27.2 30.9 37.2 38.7 39.4 29.0 35.2 37.6 39.4 42.8 RoBERTa + ECONET (b)RED Figure 3.9: Performances (F 1 scores) comparison between fine-tuning RoBERTa LARGE vs. RoBERTa LARGE + ECONET over different ratios of the training data. 61 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Layers 0 5 10 15 20 25 30 Average Attention Scores ECONET [before] RoBERTa [before] ECONET [after] RoBERTa [after] ECONET [during] RoBERTa [during] Figure 3.10: Cumulative attention score comparisons between RoBERTa LARGE and ECONET on TimeBank-Dense test data. All numbers are multiplied by 100 and averaged over 3 random seeds for illustration clarity. 3.3.5.5 AttentionScoresonTemporalIndicators In this section, we attempt to show explicitly howECONET enhances MLMs’ attentions on temporal indi- cators for downstream tasks. As mentioned in Sec. 3.3.2.5, for a particular ERE task (e.g. TimeBank-Dense), we need to predict the temporal relations between a pair of event triggers e i ,e j ∈ P i,j with associated vector representationsv l,h i ,v l,h j ,l∈L,h∈H in an MLM.L andH are the number of layers and attention heads respectively. We further useT m ∈ T to denote a temporal indicator category listed in Table 3.14, and t m,n ∈ T m denote a particular temporal indicator. If we let attn(v l,h i ,v l,h x ) represents the attention score between an event vector and any other hidden vectors, we can aggregate the per-layer attention score betweene i andt m,n as, a l i,tm,n = 1 H P H h attn(v l,h i ,v l,h tm,n ). Similarly, we can computea l j,tm,n . The final per-layer attention score for (e i ,e j ) isa l tm,n = 1 2 a l i,tm,n +a l j,tm,n . To compute the attention score for theT m category, we take the average of{a l tm,n | ∀t m,n ∈ T m and∀t m,n ∈ P i,j }. Note we assume a temporal indicator is a single token to simplify notations above; for multiple-token indicators, we take the average ofattn(v l,h i ,v l,h x∈tm,n ). Figure 3.10 shows the cumulative attention scores for temporal indicator categories,[before],[after] and[during] in ascending order of model layers. We observe that the attention scores for RoBERTa LARGE and ECONET align well on the bottom layers, but ECONET outweighs RoBERTa LARGE in middle to top 62 BEFORE AFTER INCLUDES IS_INCLUDED VAGUE [before] [after] [during] [future] -23.09 -27.11 -2.36 -6.06 1.99 -22.37 -27.11 -2.90 -4.76 2.26 -21.67 -25.27 -2.67 -3.81 1.36 -23.25 -21.02 -2.65 -6.35 2.15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 (a) Random Mask - RoBERTa LARGE BEFORE AFTER INCLUDES IS_INCLUDED VAGUE [before] [after] [during] [future] 10.53 3.90 -3.03 2.27 2.06 8.47 8.54 -2.90 -3.10 1.92 8.51 6.44 -2.81 1.65 0.70 10.44 2.99 -3.18 2.66 1.27 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 (b) ECONET - RoBERTa LARGE Figure 3.11: Performance (F 1 score) differences by temporal indicator categories and label classes in TimeBank-Dense. Fine-tuning on 10% TimeBank-Dense training data. layers. Previous research report that upper layers of pre-trained language models focus more on complex semantics as opposed to shallow surface forms or syntax on the lower layers (Jawahar et al. 2019; Tenney et al. 2019). Thus, our findings here show another piece of evidence that targeted masking is effective at capturing temporal indicators, which could facilitate semantics tasks including temporal reasoning. 3.3.5.6 TemporalKnowledgeInjection We hypothesize in the introduction that vanilla PTLMs lack special attention to temporal indicators and events, and our proposed method addresses this issue by a particular design of mask prediction strategy and a discriminator that is able to distinguish reasonable events and temporal indicators from noises. In this section, we show more details of how such a mechanism works. The heat maps in Figure 3.11 calculate the fine-tuning performance differences between 1) RoBERTa LARGE and continual pre-training with random masks (Figure 3.11a); and 2) between RoBERTa LARGE andECONET (Figure 3.11b). Each cell shows the difference for each label class in TimeBank-Dense conditional on sam- ples’ input passage containing a temporal indicator in the categories specified in Table 3.14. Categories with less than 50 sample matches are excluded from the analysis. In Figure 3.11a, the only gains come fromVAGUE, which is an undetermined class in TimeBank-Dense to handle unclear pairwise event relations. This shows that continual pre-training with random masks 63 works no better than original PTLMs to leverage existing temporal indicators in the input passage to distinguish positive temporal relations from unclear ones. On the other hand, in Figure 3.11b, having temporal indicators in general benefits much more for BEFORE, AFTER, IS_INCLUDED labels. The only exception is INCLUDES, but it is a small class with only 4% of the data. More interestingly, notice the diagonal cells, i.e. ([before], BEFORE), ([after], AFTER) and ([during], INCLUDES) have the largest values in the respective columns. These results are intuitive as temporal indi- cators should be most beneficial for temporal relations associated with their categories. Combining these two sets of results, we provide additional evidence that ECONET helps PTLMs better capture temporal indicators and thus results in stronger fine-tuning performances. Our final analysis attempts to show why discriminator helps. We feed 1K unused masked samples into the generator of the bestECONET in Table 3.15 to predict either the masked temporal indicators or masked events. We then examine the accuracy of the discriminator for correctly and incorrectly predicted masked tokens. As shown in Table 3.20, the discriminator aligns well with the event generator’s predictions. For the temporal generator, the discriminator disagrees substantially (82.2%) with the “incorrect” predictions, i.e. the generator predicts a supposedly wrong indicator, but the discriminator thinks it looks original. Temporal Generator Event Generator Corr. Incorr. Corr. Incorr. Total # 837 163 26 974 Discr. Corr. # 816 29 25 964 Accuracy 97.5% 17.8% 96.2% 99.0% Table 3.20: Discriminator’s alignment with generator’s mask predictions in ECONET. Second column shows that discriminator strongly disagree with the “errors” made by the temporal generator. To understand why, we randomly selected 50 disagreed samples and found that 12 of these “incorrect” predictions fall into the same temporal indicator group of the original ones and 8 of them belong to the related groups in Table 3.14 as shown in Table 3.21. This suggests that despite being nearly perfect re- placements of the original masked indicators, these 40% samples are penalized as wrong predictions when 64 training the generator. The discriminator, by disagreeing with the generator, provides opposing feedback that trains the overall model to better capture indicators with similar temporal signals. TypeI.SameGroup: 12/50(24%) ⟩⟩Ex1. original: when;predicted: while Text: A letter also went home a week ago in Pelham, in Westchester County, New York,⟨mask⟩ a threat made by a student in a neighboring town circulated in several communities within hours... ⟩⟩Ex2. original: prior to;predicted: before Text: ... An investigation revealed that rock gauges were picking up swifter rates of salt movement in the ceiling of the room, but at Wipp no one had read the computer printouts for at least one month⟨mask⟩ the collapse. TypeII.RelatedGroup: 8/50(16%) ⟩⟩Ex3. original: in the past;predicted: before text: Mr. Douglen confessed that Lautenberg, which had won⟨mask⟩, was “a seasoned roach and was ready for this race... ⟩⟩Ex4. original: previously;predicted: once text: Under the new legislation enacted by Parliament, divers who⟨mask⟩ had access to only 620 miles of the 10,000 miles of Greek coast line will be able to explore ships and “archaeological parks” freely... Table 3.21: Categories and examples of highly related “incorrect” temporal indicator predictions by the generator, but labeled as “correct” by the discriminator. 3.3.6 ConclusionandFutureWork In this chapter, we investigate how to leverage external knowledge to help event temporal reasoning. In Sec. 3.1 and Sec. 3.2, we proposed two neuro-symbolic networks that can incorporate either structured or distributional knowledge as constraints to enhance original neural networks’ performances on event temporal relations extraction. In Sec. 3.3, we propose a self-supervised learning framework with targeted mask prediction and contrastive loss to enable PTLMs to capture event temporal knowledge and show its effectiveness on 5 ERE or QA tasks for temporal reasoning. 65 Both directions seem promising for future research. One possibility is to continue to explore advanced neuro-symbolic networks by incorporating more creative constraints for different applications. Another direction is to design more creative self-supervised training objectives that can inject knowledge to help event temporal reasoning. 66 Chapter4 EventSemanticReasoningviaQuestionAnswering 4.1 ESTER:QuestionAnsweringforReasoningEventSemanticRelations In this section, we switch topics from event temporal reasoning to event semantic reasoning and discuss our data collection work, ESTER (Han et al. 2021a) that proposes to utilize question answering formulation to reason event semantic relations. 4.1.1 MotivationsandContributions Narratives such as stories and news articles are composed of series of events (Carey et al. 1999; Harmon 2012). Understanding how events are logically connected is essential for reading comprehension (Caselli et al. 2017; Mostafazadeh et al. 2016b). Recall the same example used in Sec. 1.3, Figure 4.1 illustrates several pairwise relations for events in the given passage: “the deal” can be considered as the same event of “Paramount purchased DreamWorks,” forming a coreference relation; it is also a complex event that contains “assumed debt,” “gives access” and “takes over projects” as its sub-events. The event “sought after” is facilitated by a previous event“createdfeatures.” By capturing these event semantic relations, people can often grasp the gist of a story. Therefore, for machines to achieve human-level narrative understanding, we need to test and ensure models’ capability to reason over these event relations. 67 Paramount film studios has purchased DreamWorks for 1.6 billion dollars in cash, the company said in a statement Sunday. DreamWorks, which created successful computer animated features such as Shrek, was also sought after by NBC Universal. In the deal, Paramount assumed DreamWorks debt. The final deal gives Paramount access to 59 films in DreamWorks' library. Paramount takes over all of DreamWorks' current projects. deal created sought after assumed access takes over purchased Co-refers Conditional Sub-events Figure 4.1: A graph illustration of event semantic relations in narratives. We use trigger words to represent events in this graph. In this work, we study five types of eventsemanticrelations: Causal, Sub-event, Co-reference, Conditional andCounterfactual, and propose to use natural language questions to reason aboutevent semanticrelations. Figure 4.2 shows example question-answer pairs for each relation type. Although previous works studied some subset of these relations such as Sub-event (Glavaš et al. 2014; W. Yao et al. 2020), Causal and Conditional (Mirza et al. 2014a,b; O’Gorman et al. 2016), most of them adopted pairwise relation extraction (RE) formulation by constructing (event, event, relation) triplets and predicting the relation for the pair of events. Event relations of RE formulation are rigidly defined as class labels based on expert knowledge, which could suffer from relatively low inter-annotator agreements (Glavaš et al. 2014; O’Gorman et al. 2016) and may not be the most natural way to exploit the semantic connections between relations and events in their context. We instead propose to reason about event semantic relations as a reading comprehension / question an- swering task. Natural language queries ease the annotation efforts in the RE formulation by supplementing expert-defined relations with textual prompts. When querying Causal relations, we can ask “what causes / leads to Event A?” or “why does A happen?”; when reasoning Sub-event relation, we can ask “what are included in Event B?” or “What does B entail?” etc. “lead to,” “included” and “entail,” as textual cues, can help models better understand which relation is being queried. 68 Figure 4.2: Examples of event annotations and 5 types of QAs in our dataset. Not all events are annotated for clarity purpose. Different colors are used for better visualization. Our question-answering task also poses unique challenges for reasoningeventsemanticrelations. First, the correct answers can be completely different with slight changes of queries. In Figure 4.2, if we modify the third question to be “What would happen if Europe supported Albania?” then “oust President Sali” becomes an invalid answer. This challenge allows us to test whether models possess robust reasoning skills or simply conduct pattern matching. Second, answers must be in the form of complete and meaningful text spans. For theCounterfactual example in Figure 4.2, a random text span“PresidentSaliBerisha” is not a meaningful answer while a shortened answer“oust” is not complete. To get correct answers, models need to detect both event triggers and their event arguments. Finally, there could be multiple valid events in a passage that can answer a question, and a good system should be able to identify different valid answers simultaneously as in theSub-event QA of Figure 4.2. These challenges make our task more difficult than the classification tasks in RE. A few noticeable event-centric MRC datasets have been proposed recently. TORQUE (Ning et al. 2020b) and MCTACO (B. Zhou et al. 2019) are two recent MRC datasets that study event temporal relations. However, knowing only the temporal aspect of events could not solve many important event semantic relations. For example, in Figure 4.1, to understand that “assumed debt,” “gives access” and “takes over projects” are sub-events of “the deal,” a model not only needs to know that all these four events have 69 overlapped time intervals but also share the same associated participants for “the deal” to contain the other three. We summarize our contributions below. 1. We introduce ESTER, the first comprehensive MRC / QA dataset for the five proposed event se- mantic relations by adopting natural language questions and requiring complete event spans in the passage as answers. 2. By proposing a generative QA task that models all five relations jointly and comparing it with tra- ditional extractive QA task, we provide insights on how these event semantic relations interplay for MRC. 3. Our experimental results reveal SOTA models’ deficiencies in our target tasks, which demonstrates that ESTER is a challenging dataset that can facilitate future research in MRC for event semantic relations. 4.1.2 Definitions Composing event-centric questions and answers requires identifications of both events and their relations. For events in ESTER, we define it as a trigger word with its arguments (subject, object, time and location), i.e. thetemplatedefinition described in Sec. 2.1. Below, we describe our definitions the five event semantic relations studied in this work, most of which are consistent with previous studies. For example,Causal andConditional have been studied in Wolff (2007), Q. Do et al. (2011), Mirza et al. (2014b), and Mirza et al. (2014a). Sub-event andCo-reference were studied in Glavaš et al. (2014) and O’Gorman et al. (2016). Cosmos QA (Huang et al. 2019) has a small amount of Counterfactual questions, but it is not an event-centric dataset. The examples we use below are all presented in Figure 4.2. 70 Causal: A pair of events (e i ,e j ) exhibits aCausal relationife i happensthene j willdefinitelyhap- pen according to the given passage. For example, the passage explicitly says that the “meeting” happens “in return” for “Europe planned for getting stricken Albanian back.” Therefore, the Causal relation in the example can be established because if “Europe planned for getting stricken Albanian back” happens, the “meeting” will definitely happen in this context. Conditional: A pair of events (e i ,e j ) exhibits a Conditional relation ife i facilitates, but may not necessarilyleadstoe j according to the given passage. For example, the expectation of“thedispatchofa multinationalforce” is to“pullAlbaniabackfromthebrink”; in other words, the former event can help but does not guarantee the occurrence of the latter one. Therefore, the relation between this pair of events is Conditional. Counterfactual: e j mayhappenife i doesnothappen; in other words, if the negation ofe i facilitates e j , then (e i ,e j ) has aCounterfactual relation. In our example, if“Europedidn’tsupportAlbania,” which is a negation of what happens in the passage, then“oustPresidentSali” by the“armedrebels” would likely happen. Sub-event: There is a semantic hierarchy whereacomplexevente k consistsofasetofsub-events {e k,1 ,...,e k,j ,...,e k,n }. InSub-event relations, we require not onlye k,j ’s trigger word to be semantically contained ine k ’s trigger, but also the arguments ofe k,j are either identical or contained in the associated arguments of e k . For example, considering the complex event “efforts to pull Albania back,” and its sub- event “aid is brought into the chaotic Balkan state”, the trigger “brought” is a part of the “efforts.” Both subjects are “Europe,” both objects / locations are “Albania” or “Balkan state” and their time can be inferred to be (nearly) identical in the passage. Note that this definition is similar to the event hierarchical structure definition in RED, but stricter than the “Spatial-temporal containment” definition in HiEve. 71 Coreference: e i co-refers toe j when two events are mutually replaceable. This requires 1) their event triggers are semantically the same and 2) their event arguments are identical. In our example, the event triggers in the question “pull” (back from the brink) and in the answer “getting” (back on to its feet) are semantically the same. They also share the same subject - Europe, and object - Albania. Their time and location can be inferred from the passage to be the same. Therefore, these two events form aCo-reference relation. 4.1.3 RelatedWork We briefly survey related work in this section in order to provide broader background over the two key components of ESTER: 1) event semantic relations and 2) event-centric reading comprehension. Eventsemanticrelations have been studied before and most of them leverage relation extraction for- mulation for annotations. Causality is one of the widely studied event semantic relations. Mirza et al. (2014a) and Mirza et al. (2014b) follow the CAUSE, ENABLE and PREVENT schema proposed by Wolff (2007) where the first two relations align with our definitions in ESTER. Q. Do et al. (2011) adopted a mini- mally supervised method and measure event causality based on pointwise mutual information of predicates and arguments, which resulted in denser annotations than previous works. HiEve (Glavaš et al. 2014) defines pairwise Sub-event relation as spatiotemporal containment, which is less rigorous than our definitions where we require containment for all event arguments (subject, object, time and location). Our definition of Co-reference is nearly identical as HiEve where two co-referred events denote the same real-world events. W. Yao et al. (2020) utilized a weakly-supervised method to extract large scaleSub-event pairs, but the extracting rules can result in noisy relations. RED (O’Gorman et al. 2016) proposed to annotate event temporal and semantic relations (Causal,Sub- event) jointly. However, due to the complexity of the annotation schema, the data available for semantic relations are relatively sparse. Mostafazadeh et al. (2016b) and Caselli et al. (2017) annotate both event 72 temporal and semantic relations in ROCStories (Mostafazadeh et al. 2016a) and Event StoryLine Corpus (Caselli et al. 2017) respectively. ESTER differs from these works by disentangling temporal from other semantic relations and focusing on MRC to capture five proposed event semantic relations. Event-centric MRC. Datasets leveraging natural language queries for event-centric machine reading comprehension have been proposed recently (Ning et al. 2020b; B. Zhou et al. 2019). However, they focus on event temporal commonsense, whereas ESTER studies other event semantic relations. Du et al. (2020) and J. Liu et al. (2020) reformulate event extraction data as QA tasks to detect event triggers and arguments in a short passage. However, they did not propose new data, and knowing event triggers and arguments are merely a sub-task in ESTER, which require both event detection and relation understanding. 4.1.4 DataCollection In this section, we show our data collection procedure and describe the details of our approach to control annotation quality, including qualification exams and steps to validate and train workers. Passage Preparation. Passages are selected from news articles in TempEval3 (TE3) workshop (UzZa- man et al. 2013) with initial event triggers provided. We extracted 3-4 continuous sentences that contain at least 7 event triggers. Our choice of the number of sentences is based on previous studies that hierar- chical relations such asSub-event andCo-reference are likely to span over multiple sentences, but the majority of them are contained within 3-4 sentences (Glavaš et al. 2014; O’Gorman et al. 2016). 4.1.4.1 MainProcedure We use Figure 4.2 to illustrate our main data collection procedure, which consists of two components: event selection and QA annotations. The actual interface can be found in Appendix D.8. 73 1. Event Selections. Annotators are presented with a passage and initial event trigger annotations. They are allowed to modify event trigger selections per our definition in Section 4.1.2 by highlighting words. These correspond to the highlighted words in the passage of Figure 4.2. Our focus is not event extraction, and thus we do not require workers to identify all triggers as some of them are not used in their QAs. Rather, the event selection serves as a warm-up step for the following QA annotations by 1) helping workers locate where desirable events are and 2) ensuring that all the annotated question-answer pairs include events in the passage so that their QAs reason about event relations. 2. QAAnnotations. As the five questions in Figure 4.2 show, users must ask natural language ques- tions that contain a highlighted event trigger. In order to make questions natural, we allow workers to use different textual forms of an event trigger in the questions, such as “teach” v.s. “taught” and “meeting” v.s. “meet.” After writing a question, users need to pick the event semantic type (the blue boxes in Figure 4.2) that they reason about, and then select the corresponding answer spans from the passage. If there are multiple answers, we instruct users to select all of them. All answers must include an exact highlighted event trigger, and we prohibit answers with more than 12 words to en- sure conciseness. We pay $7.5 for an assignment where annotators need to ask at least five questions using two passages. 4.1.4.2 QualityControl Qualification. The initial worker qualification was conducted via an examination in the format of multiple- choice questions hosted by CROWDAQ platform (Ning et al. 2020a). We created a set of questions where a passage and a pair of QA are provided, and workers need to judge the correct type of this QA from six choices, including the five defined event semantic relations, plus an invalid option 1 . This examination 1 A full list of QA validity can be found in Appendix D.1 74 Figure 4.3: An illustration of our quality control, worker validation and training process. intends to test workers’ skills to 1) distinguish valid QAs from invalid ones based on our definitions; 2) judge the differences for the five proposed eventsemanticrelations. We recruit workers via Amazon Mechanical Turk with basic qualifications including: 1) at least 1K HITs 2 approved; 2) at least 97% approval rate. A single qualification exam consists of 10 multiple-choice questions. Participants are given 3 attempts to pass with a >= 0.6 score. We found this qualification examination effectively reduces the rate of spammers to nearly 0%. WorkerValidationandTraining. Since the real task is much more challenging than the qualification exams, we adopted a meticulous five-stage worker validation and training process to ensure data quality. As Figure 4.3 shows, for workers who passed the qualification exams, we repeat the validation and training steps four times until workers reach the final large tasks. In each validation and training step, two of our co-authors independently judge workers’ annotations to determine 1) whether a provided QA pair is valid per our definitions and 2) whether the answers provided are complete. Typically, we disqualify workers whose QA validity rate falls below 90%. Exceptions are given upon careful examination and reviewer discussion. For workers who pass a manual validation, 2 HIT is an assignment unit on Amazon Mechanical Turk. 75 we write a training message correcting all errors they made and invite them to the next task. We also add missing answers as a part of the validation process and reserved the validated annotations as our evaluation data. There are 1, 2, 3, 10, and 25 HITs in Task 1-4 and Large Task respectively. For Task 1-3, we validate all QAs, and for Task 4, we randomly select 20% questions per worker to validate. In order to work on the final large task, a worker needs to maintain an average QA validity rate higher than 90%. We further request one co-author to validate all questions with passages overlapped with the validated data above. This ensures that there are no passage overlaps between the training and evaluation data. All author validated data comprise our final evaluation data in the experiments. 4.1.5 DataAnalysis Our passage preparation (Sec. 4.1.4) produces 4.3K passages in total with 1887 of them randomly selected and annotated. We collect 6018 questions from 70 workers using Amazon Mechanical Turk and 1471 of them fully validated by co-authors as the evaluation set. We further split our evaluation data into dev and test sets based on passages. The remaining data are used as the training set. A summary of data statistics is shown in Table 4.1. Train Dev Test # of Passages 1492 108 287 # of Questions - Overall 4547 301 1170 -Causal 2047 118 431 -Conditional 928 58 289 -Counterfactual 294 28 106 -Sub-event 678 59 204 -Co-reference 600 38 140 Table 4.1: Passages and questions (overall + type breakdown) statistics for different data splits. TypeDistribution. As we can observe in Table 4.1 and Figure 4.4, ESTER consists of 64.2%Causal and Conditional questions. In Figure 4.5, we further show the type disagreements using data validated by 76 two co-authors. The rows indicate workers’ original types and the columns are the majority votes between the annotators and co-authors. CAUSAL 45.0% COREFERENCE 13.2% COUNTERFACTUAL 6.5% CONDITIONAL 20.4% SUB-EVENT 14.9% (a) Type Distribution: train data CAUSAL 37.3% COREFERENCE 12.1% COUNTERFACTUAL 9.1% CONDITIONAL 23.6% SUB-EVENT 17.9% (b) Type Distribution: evaluation data (dev + test) Figure 4.4: Performances (F 1 scores) comparison between fine-tuning RoBERTa LARGE vs. RoBERTa LARGE + ECONET over different ratios of the training data. As we can observe, the matrix is dominated by diagonal entries. Some noticeable disagreements are 1) betweenCausal andConditional where people have different opinions on the degree of causality be- tween events; 2) betweenCounterfactual andConditional as someCounterfactual questions, with double negations 3 , are merelyConditional; 3) betweenCo-reference andSub-event where annotated co-referred events do not have identical event arguments according to co-authors’ judgements. These 3 Double negated questions have the form of “what will not happen if Event A does not happen” 77 CAUSAL CONDITIONAL COUNTERFACTUAL SUB-EVENT COREFERENCE CAUSAL CONDITIONAL COUNTERFACTUAL SUB-EVENT COREFERENCE 0.93 0.058 0 0.0069 0 0 0.98 0 0.0075 0.0075 0 0.13 0.87 0 0 0.017 0 0 0.95 0.033 0.038 0.09 0 0.1 0.77 Figure 4.5: Type confusion matrix between workers’ original annotations and the majority votes after co- authors’ validation. Rows are annotators’ types whereas columns are the majority votes. results align with previous studies that some event semantic relations are inherently hard to distinguish (Glavaš et al. 2014; O’Gorman et al. 2016). Type Agreements. The inter-annotator-agreement (IAA) score is 85.71% when calculated using pair- wise micro F1 scores, and is 0.794 per Fleiss’sκ 4 . The IAA scores are calculated using the same data reported in Figure 4.5. The high IAA scores demonstrate strong alignments between annotators and co-authors in judging event semantic relations. We show n-grams in questions and the number of answers below. More analysis on tokens and worker distributions can be found in Appendix D.5 and D.6. FrequentN-gramsinQuestions. Figure 4.6 illustrates the most frequent unigram, bigram and trigrams in each type of questions after removing non-informative stop-words. These n-grams can be considered as semantic cues in the questions to reason about particular semantic relations. For example, ‘why’ and ‘what caused’ imply strong causality; ‘included’ indicates containment of events; ‘not’ inCounterfactual indicates negation of events. Number of Answers. Table 4.2 shows the average number of answers for each semantic type. Sub- event contains the most answers, which aligns with our intuition that a complex event in the passage 4 0.794 implies substantial agreement(Landis et al. 1977). The detailed calculation can be found in appendix C. 78 (a) Most frequent n-grams forCausal (b) Most frequent n-grams forConditional (c) Most frequent n-grams forCounterfactual (d) Most frequent n-grams forSub-event (e) Most frequent n-grams forCo-reference Figure 4.6: Most frequent n-grams in questions. often contains multiple sub-events. The evaluation sets contain about 0.5 answers more than the training set as co-authors added the missing answers in the validation process. Considering each unique question and answer as an event,ESTER captures 10.1K event pairs, which are larger than previous RE datasets such as RED and HiEve. Semantic Types Train Dev Test Causal 1.3 1.5 1.9 Conditional 1.3 1.9 2.0 Counterfactual 1.2 1.3 1.7 Sub-event 3.0 3.6 3.1 Co-reference 1.2 1.2 1.6 Table 4.2: Average number of answers by semantic types. 4.1.6 Experiments We design experiments to provide benchmark performances and understand learning challenges to facili- tate future research onESTER. We formulate our QA task as a conditional answer generation problem. This 79 choice is inspired by recent works such as UnifiedQA (Khashabi et al. 2020) that achieve impressive out- comes by integrating various QA tasks (extractive, abstractive and multiple-choice) as a single generative QA pre-training task. S. Li et al. (2021) and Paolini et al. (2021) also show that by reformulating original extractive tasks as generation tasks, it enables models to better exploit semantic relations between context and labels as well as the dependencies between different outputs. To better demonstrate the benefits of the proposed generative QA task, we compare it with a traditional extractive QA task. We introduce our experimental design and evaluation metrics subsequently. Generative QA. Given a question q i and a passage P i = {x 1 ,x 2 ,...x j ,...x n } where x j represents a token in the passage, the answer generation task requires the model to generate natural language answers A ′ i = {a ′ i,1 ...a ′ i,k }. For the gold answers A i = {a i,1 ...a i,k }, each answer span a i,k ∈ P i . We follow the input format of UnifiedQA (Khashabi et al. 2020) by concatenating q i andP i with a “\n” token. For training labels, we concatenate multiple answers with a “;” token. ExtractiveQA. Givenq i andP i , this task requires a model to predict whether each tokenx j ∈P i is an answer or not. Following the “B-I-O” labeling conventions in the IE field, we create a vector of labels with ‘2’ if x j is the beginning token of an answer span; ‘1’ if x j is an internal token of an answer span; ‘0’ if x j / ∈ A i . The input is the same as generative QA except that we concatenateq i andP i with two “<\s>” tokens to be consistent with the pair-sentence input format of the base model, RoBERTa LARGE . To compare fairly with the generative QA task, we construct candidate answer spans by examining predicted labels for all tokens. Both “BI*” and “I*” cases are considered as valid answers. Finally, we map positive answer tokens’ ids back to natural language phrases. More formally, we can denote the final candidate answers of the task asA ′′ i ={a ′′ i,1 ...a ′′ i,k }, wherea ′′ i,k ∈P i . 80 EvaluationMetrics. It is important to assess how well models can find all valid answer. We evaluate this by using token-basedF 1 and exact-match measures. On the other hand, when interacting with machines, we would like the top answer returned to be correct. We measure this byHIT@1 scores. • LetU i ,U ′ i denotes all uni-grams inA i ,A ′ i . We haveF T 1 = 2∗ P∗ R P+R whereP = |U i ∩U ′ i | |U ′ i | , R= |U i ∩U ′ i | |U i | . • HIT@1 equals to 1 if the top predicted answer, i.e. a ′ i,1 or a ′′ i,1 contains a correct event trigger; otherwise it is 0. This metrics is well defined as all questions in our data contain at least an answer and all (well trained) models return at least one answers. For both generative and extractive QAs, we use the leftmost answer as the top answer. • EM or exact-match equals to 1 if∀a ′ i ∈A ′ i ,a ′ i ∈A i and∀a i ∈A i ,a i ∈A ′ i ; otherwise,EM = 0. Baselines. We have model and human baselines. • ModelBaselines. For our primary generation QA task, we fine-tuned several sequence-to-sequence pre-trained language models on ESTER: BART (Lewis et al. 2020), T5 (Raffel et al. 2020) and Uni- fiedQA. As mentioned, UnifiedQA (based on BART and T5) is pre-trained on various QA tasks. It also demonstrates powerful zero-shot learning capabilities on unknown QA tasks, which we tested onESTER too. Due to computation constraints, the largest model we are able to finetune is UnifiedQA (T5-large). We leave further investigation to future modeling studies. Since extractive QA can be considered as a token prediction task, we build our model based on RoBERTa-large with token mask prediction pre-training objectives. Models and fine-tuning details can be found in Appendix D.2. • HumanBaselines. To show the human performance on the task, we randomly select 20 questions for each semantic type from the test set. Two co-authors provide answers for these questions, and we compare their mutually agreed answers with the original answers. We ensure co-authors never saw these questions previously.F T 1 ,HIT@1 andEM scores are calculated as the human performances. 81 4.1.7 ResultsandAnalysis Dev Test F T 1 HIT@1 EM F T 1 HIT@1 EM Generative Zero-shot: T5-base 18.0 55.8 0.0 21.1 61.0 0.0 Generative Zero-shot: UnifiedQA-base 49.0 61.5 10.6 46.5 61.5 7.1 Generative Zero-shot: UnifiedQA-large 51.1 69.4 14.3 48.7 66.5 9.7 Generative Fine-tune: BART-base 53.1(± 0.4) 66.9(± 1.7) 14.1(± 1.0) 53.3(± 0.8) 68.1(± 1.2) 15.1(± 0.7) Generative Fine-tune: BART-large 57.2(± 1.0) 72.1(± 1.4) 15.1(± 2.1) 56.1(± 1.0) 71.5(± 2.2) 15.2(± 0.9) Generative Fine-tune: T5-base 63.2(± 1.1) 80.8(± 1.7) 22.1(± 0.9) 58.5(± 0.7) 76.2(± 1.0) 20.5(± 0.9) Generative Fine-tune: UnifiedQA-base 64.6(± 0.4) 82.0(± 0.4) 23.8(± 1.0) 59.3(± 0.2) 78.1(± 0.4) 20.6(± 0.5) Generative Fine-tune: UnifiedQA-large 66.8(± 0.2) 87.2(± 0.3) 24.4(± 0.3) 63.3(± 0.8) 83.5(± 0.7) 22.1(± 0.4) Extractive Fine-tune: RoBERTa-large 68.8(± 0.7) 66.7(± 1.1) 16.7(± 0.2) 66.1(± 0.2) 63.8(± 1.6) 15.9(± 0.5) Human Baseline - - - 79.6 100 36.0 Table 4.3: Experimental results for answer generation. All numbers are 3-seed average with standard devi- ation reported, except for human baseline and zero-shot performances. All models refer to the generative QA task except for RoBERTa-large, which we use for the extractive QA task. Statistical tests are shown in Appenidx D.3. In this section, we present and analyze results for the experiments described in Section 4.1.6. 4.1.7.1 GenerativeQA As Table 4.3 shows, UnifiedQA-large achieves the best average performances among all generative QA baselines, with 63.3%, 83.5% and 22.5% for F T 1 , HIT@1 and EM scores on the test set, which are 16.3%, 16.5% and 13.1% below the human performances. We also observe that UnifiedQA-base with 220M pa- rameters outperforms other comparable or larger models such as T5-base and BART-large with 2-3x more parameters, showing the effectiveness of pre-training with generative QA tasks. Zero-shot and few-shot Learning. UnifiedQA also demonstrates powerful zero-shot and few-shot learning capabilities in a variety of QA tasks. We observe similar patterns where zero-shot learning from UnifiedQA can significantly outperform its T5 counterpart in Table 4.3. For few-shot learning, we show in Figure 4.7 that fine-tuning with only 500-1K examples, the model can achieve quite comparable results with full-training. The model performances level off as the second half of the training data provide ≤ 1.2% 82 0 500 1K 2K 3K 4K Full 65 70 75 80 85 90 95 100 HIT@1 (%) HIT@1 45 50 55 60 65 70 75 80 85 F1 (%) F1 (Token) Figure 4.7: Fine-tuning UnifiedQA-large results by using 500, 1K, 2K, 3K, 4K and full train data. Dashed lines on the top are corresponding human performances. improvements across all metrics. This suggests that the benefits of getting more data diminish drastically and data size may not be the bottleneck of learning for ESTER. Exact-Match (EM) F1 (Token) HIT@1 10 20 30 40 50 60 70 80 90 Scores 23.4 65.4 86.0 25.7 63.5 84.7 26.4 58.5 75.5 12.1 60.3 83.5 22.1 63.8 79.5 CAUSAL CONDITIONAL COUNTERFACTUAL SUB-EVENT COREFERENCE Figure 4.8: Test performances for each semantic type. Breakdown performances. In Figure 4.8, we show performances for each semantic type on the test data. Not surprisingly, Causal and Conditional achieve best performances as they are the more dom- inant semantic types in ESTER. Model training may favor these two types. Interestingly, though Coun- terfactual relation has the smallest number of training questions and requires more complex reasoning than Conditional due to its negation, our models can learn this relation relatively well per EM andF 1 measures. This could be contributed by 1) the similarity between Counterfactual and Conditional relations, and 2) the negations are well detected through textual cues in the model training. On the other hand, the significantly lower HIT@1 score forCounterfactual suggests that it is challenging for models to pin-point the most confident answer. 83 Hierarchical relations, Sub-event and Co-reference in general have lower scores than Causal and Conditional, which could be attributed to two factors: 1) these two categories have smaller percentages (28.1% combined) in training data; 2) understanding these two relations requires complicated reasoning skills to capture not only the hierarchical relations for event triggers but also for their associated argu- ments. Figure D.1 in Appendix D.4 shows the similar plateauing effect of adding more training samples for these two relations, which implies that data size may not be the only factor for weaker performances, and these two semantic relations could be inherently challenging to comprehend. Answercompleteness. In Table 4.2, we show that the validated data contain about 0.5 more answers per question. Besides some rare obvious misses,proximity andsaliency are the two reasons we observe that contribute most to this discrepancy. Our input data include long passages with an average of 128 tokens. Even well-trained workers can overlook relations for event pairs that are physically distant from each other. Moreover, long-distance relations are often less salient. For non-salient relations, expert or external knowledge may be needed to disambiguate. We found workers tend to be conservative by avoiding these non-salient answers. #Ans. F T 1 HIT@1 EM original 1.41 58.7(± 0.2) 78.8(± 0.3) 18.8(± 0.4) completed 1.84 59.3(± 0.1) 78.5(± 0.3) 16.9(± 0.4) Table 4.4: Performances on test data. Workers’ original annotations v.s. completed by another worker. To precisely gauge the impact of answer completeness, we randomly sample 500 questions with type distribution similar to the training data and request qualified workers to find more complete answers. We then retrain UnifiedQA-large with both the original and the more completed answer annotations. Table 4.4 shows that that the “completed” set has an average number of answers similar to those in our validated data, but we observe no significant improvements. We hypothesize that 1) through our rigorous validation and training, workers are able to identify important answers; 2) the request to find more complete answers could inadvertently introduce some noise, which cancels out the benefits of increasing answer numbers. 84 4.1.7.2 ExtractiveQA In this section, we discuss results for the extractive QA task. In Table 4.3, we observe that extractive QA by finetuning RoBERTa-large achieves the best token F 1 scores, yet under-performs generative QA perHIT@1 andEM metrics. We further compareF T 1 withEM scores by increasing training weights on positive tokens, i.e. ‘B’ or ‘I’. Figure 4.9 shows that as we train models to focus more on the positive answer tokens,F T 1 keeps increasing up to weight = 10, but answerEM starts to fall after weight = 2. These results imply that extractive QA excels at finding tokens or phrases that resemble or partially overlap with true answers (goodF T 1 scores), but falls short on producing complete and meaningful texts that truly represent event spans. 1 2 5 10 20 Positive Token Weights 58 60 62 64 66 68 70 F1 (%) F1 (Token) 8 10 12 14 16 18 20 EM (%) EM Figure 4.9: F T 1 v.s. EM scores on the dev set by increasing training weights on positive answer tokens. To verify our hypothesis above, we examine real predictions where both the best generative and ex- tractive QA models do not predict exact answers (i.e. per-sample EM = 0). We list several of them in Table D.4 of Appendix D.7. In general, extractive QA predicts many single or disconnected tokens that are not meaningful, whereas generative QA, despite making wrong predictions, produces answer spans that are complete and coherent. To summarize, the comparative studies between generative and extractive QAs emphasize the impor- tance of using multiple metrics to evaluate models and highlight the contribution of leveraging answer 85 generation to solve ESTER where complete and meaningful event spans rather than partial tokens are crucial to answer questions. 4.1.8 Conclusion We propose ESTER, a question answering dataset for comprehensive event semantic reasoning. We adopt meticulous data quality control to ensure annotation accuracy. ESTER enables a generative question an- swering task, which can be more challenging than the traditional event relation extraction work. The difficulty of the proposed data and task is also manifested by the significant gap between machine and hu- man performances. We thus believe that ESTER would be a novel and challenging dataset that empowers future event-centric research. Future research. ESTER facilitates a promising research direction of few-shot learning for event se- mantic relations as a generative QA task, yet remains challenging since large SOTA systems significantly under-perform human baselines. Future research can explore building question generation systems to au- tomatically annotate a larger scale of data or study the possibilities of transfer learning between this data and other event-centric reasoning tasks. 86 Chapter5 StoryGenerationBasedonEventCommonsenseKnowledge In this final chapter, we discuss how to improve event reasoning in creative language generation. Specif- ically, in Section 5.1, we describe our work of leveraging event temporal prompts to generate flashbacks in stories. In Section 5.2, we discuss how to inject event commonsense knowledge and leverage a plan-to- story alignment method to help machines generate more logically consistent stories. 5.1 GeneratingFlashbackswithEventTemporalPrompts Stories or narratives are comprised of a sequence of events. To compose interesting stories, professional writers often leverage a creative writing technique called flashback that inserts past events into current storylines as we commonly observe in novels and plays. However, it is challenging for machines to gener- ateflashbacks as it requires solid understanding of eventtemporalorder (e.g. feelinghungry BEFORE eat, not vice versa), and the creativity to arrange storylines so that earlier events do not always appear first in narrativeorder. Two major issues in existing systems exacerbate the challenges: 1) temporal bias in pre- training and story datasets that leads to monotonic event temporal orders; 2) lack of explicit guidance that helps machines decide where to insertflashbacks . We propose to address these issues using structured sto- rylines to encode events and their pair-wise temporal relations (BEFORE,AFTER andVAGUE) astemporal 87 prompts that guide how stories should unfold temporally. We leverage a Plan-and-Write framework en- hanced by reinforcement learning to generate storylines and stories end-to-end. Evaluation results show that the proposed method can generate more interesting stories withflashbacks while maintaining textual diversity, fluency and temporal coherence. 5.1.1 MotivationsandContributions Flashback is a popular creative writing technique that brings the readers from the present moment to the past via inserting earlier events in order to provide background or context of the current narrative (parenciteflashback-pavis, flashback-kenny, flashback-gebeyehu. For example, in Figure 5.1a, the “GHOST” in Shakespeare’s playHamlet instruments aflashback by interrupting the main narrative and describing a historical event to the audience that Hamlet’s father was killed by the current king rather than a snake. Flashback, by manipulating the event temporal orders in narrative structure, can arouse readers’ emo- tions such as surprise, suspense, and curiosity (Brewer et al. 1981, 1982). These emotions stimulate readers’ interests and eventually contribute to the satisfaction of reading (Alwitt 2002; E. S. Tan 1996), which im- proves the interest level of a story. The example in Figure 5.1a injects historical events in the middle of the narrative. This arrangement of events can surprise readers and therefore, makes the story more interesting than a straightforward storyline where the past events are shown in the beginning. Similarly, consider the pair of two-sentence stories in Figure 5.1b. Both stories are composed of the same events with thetemporalorder “lost consciousness” BEFORE “woke up in the hospital.” In Story (1), seeing[e1], readers can make a relatively easy educated guess of [e2], but it is more subtle in Story (2) as there are many different ways to end up in a hospital. By showing the ending event first, the flashback in Story (2) creates suspense that makes the following sentences less predictable, and thus arouses readers’ curiosity and makes the reading more interesting. 88 (a) A flashback example from William Shakespeare’s famous play Hamlet (in plain English). Red text indicates past events. (b) Two-sentence stories with the same eventtemporalorder but different narrativeorder. The second one with a flashback is intuitively more interesting than the first one. Figure 5.1: (a) flashback (b) temporal v.s. narrative order. While human writers are capable of maneuvering event temporal orders to compose coherent and in- teresting stories, it remains challenging for machines. The challenge is partially attributed to data bias. Ning et al. (2018b) shows that the pattern in Story (1) is dominant in human-written texts, where neigh- boring events with BEFORE temporal relations (i.e., narrative order indicates temporal order) occur 60− 70% of the time. This is also manifested in our experiments with vanilla language models amplifying this ratio and producing more than 80%BEFORE relations for neighboring events in the generated stories. Furthermore, current state-of-the-art story generation systems that incorporate event representations usu- ally assume eventtemporalorder followsnarrativeorder (Goldfarb-Tarrant et al. 2020; S.-T. Lin et al. 2021). There are no explicit prompts in these systems that help determine whenflashback should be used, leaving models to produce dull stories consisting of event sequences with monotonic BEFORE relations. To facilitate more effective flashback , we propose to incorporate temporal prompts in an end-to- end story generation framework inspired by the Plan-and-Write paradigm (Goldfarb-Tarrant et al. 2020; P. Xu et al. 2020; L. Yao et al. 2019), where machines first learn to plot a storyline, and then generate the story based on the storyline. Specifically, we encode predefined event temporal prompts in structured storylines. As the bottom block of Figure 5.2 shows, a structured storyline contains two components: 89 1) event representations where an event trigger (“grabbed”) and two arguments (“she” and “the dog”) are extracted from the original story sentences; and 2) temporal prompts: the temporal order between neighboring events, e.g. event 1: (“she”, “grabbed”, “the dog”) isAFTER event 2: (“white snow”, “blanketed”, “the ground”). By training our storyline generation model with these predefined pair-wise temporal rela- tions, models capture how neighboring events are temporally related to each other; while during storyline decoding, supplying predefined temporal prompts can guide models to generate reasonable narratives with desirable event temporal orders. Prior works (Fan et al. 2019; Goldfarb-Tarrant et al. 2020) build the storyline and story models sepa- rately, which creates a discrepancy where gold storylines are used during training, but predicted story- lines are used during inference. To mitigate this training-inference discrepancy, we leverage reinforcement learning (RL) to train our systems end-to-end. It enables the story model to train on generated storylines and updates the storyline model with the feedback from the story model. Our experimental results show that the RL-based models can leverage temporal prompts more effectively, resulting in more effective flashback generation and more interesting stories. We summarize the contributions of this paper as follows: 1) To facilitate effective flashback , we propose to leverage structured storylines with temporal prompts to arrange events in story generation. 2) We integrate reinforcement learning in our story generation pipeline, which can help models better leverage temporalprompts. 3) We test our framework on two open-domain story datasets and show more effective flashbacks and increased interest level while maintaining fluency and temporal commonsense in the generated stories. To our best knowledge, this is a pioneering study onflashbacks inneuralstorygeneration. 5.1.2 TaskDefinitions In this section, we describe the key components: events andtemporalprompts in our proposed struc- tured storylines, and then define the Plan-and-Write generation task. 90 Figure 5.2: An illustration of our overall model. Here we use the first sentence of the story (and its associ- ated event representation) as inputx. The upper block shows the vanilla implementation of the Plan-and- Write workflow. The bottom block is our core novel design by leveraging temporalprompts in structured storylines to generate flashbacks . For illustration purposes, we re-order the triggers and arguments, and storylines are ground-truths (i.e. not predicted by models). Our final model uses reinforcement learning to implement end-to-end training. Event Representation. Following the definitions of ACE (2005), we define an event as a trigger word and its arguments. In this work, we simplify the representation by leveraging semantic role labeling (SRL) tools (Shi et al. 2019) to parse two arguments as shown in Figure 5.2. We only consider one event per story sentence and denote thek-th event in storyi ase i,k . We leave more complicated representations for future study. Temporal Prompts. Letr i = {r i,k } denotes the set of temporal relations between the k-th and the (k +1)-th event in story i. If k indexes the last event, r i,k is not defined. Following the event relation definition of ning-etal-2018-multi, we use events’ start time to evaluate temporal order. StructuredStoryline. Figure 5.2 provides a storyline consisting of five event representations extracted from our data. More formally, letS i ={e i,1 ,e i,2 ,...e i,k ,...e i,n } indicates a storyline withn events. Encod- ingtemporalprompts,S i becomesS r i ={e i,1 ,r i,1 ,e i,2 ,r i,2 ...e i,k ,r i,k ,...e i,n }. Note that in this work,r i isprovidedaspredefinedprompts rather than predicted ase i,k . 91 Story. Our ultimate goal is to generate flashbacks in stories. We denote the story associated with the storylineS i asY i . Plan-and-Write is a two-stage framework that first generates storyline ˆ S i given some inputx (e.g. title, leading sentence), and then generate ˆ Y i based on ˆ S i . Again,r i,k are given as predefined prompts whereas e i,k are to be predicted as part of the storyline generation shown in Figure 5.2. 5.1.3 Method In this section, we first provide an overview of the Plan-and-Write story generation system and introduce a vanilla version of the end-to-end training method. Then we describe the details of our key contribution of leveraging eventtemporalprompts to generateflashbacks . After that, we discuss pretraining structured storylines with self-labeled data and incorporating reinforcement learning to jointly train our end-to-end models. 5.1.3.1 Plan-and-WriteModels In order to provide better explainability and controllability over the machine generated stories, recent research efforts (Goldfarb-Tarrant et al. 2020; P. Xu et al. 2020; L. Yao et al. 2019) explore dividing story generation into two steps: 1) from input or prefix, x, we first produce a storyline, S i ; 2) based on the storyline, we generate a story,Y i . We describe the details below. Storyline Model. Letα denote the parameters of the storyline model, per sample training loss can be computed asL α =− logp(S i |x i ,α ). StoryModel. Letβ denote the parameters of the story model, per sample training loss can be computed asL β =− logp(Y i |x i ,S i ,β ). 92 Inference. Note thatS i above is the gold storyline extracted fromY i . At the inference time, we do not haveS i , and have to replace it with ˆ S i , the predicted storyline. This results in a discrepancy between the training and inference time. End-to-end Training. Instead of using gold storylineS i to train a story model, we can take ˆ S i as its input. Now the per sample training loss for the story model becomesL θ =− logp Y i |x i , ˆ S i ,θ , where θ indicates the end-to-end story model parameters. End-to-end training can alleviate the gap between the training and inference time, and potentially lead to more consistent stories. 5.1.3.2 StructuredStorylineConstruction As Figure 5.2 shows, for a story sentence, we first use the SRL tool to parse its trigger t i,k and two arguments a 1 i,k and a 2 i,k . We then convert this representation into a textual form: “t i,k ; a 1 i,k ; a 2 i,k ⟨eoe⟩”, where “;” separates two event components, and ⟨eoe⟩ indicates event ending. For example, the parsed t i,k , a 1 i,k and a 2 i,k in the story sentence “she grabbed the dog and ran outside” are “grabbed,” “she” and “the dog” respectively. They are concatenated into a final textual representation as “grabbed ; she ; the dog ⟨eoe⟩.” Depending on the experimental setup, we may use no or only the leading event as input,x. Inspired by the mask prediction design in Devlin et al. (2019), Lewis et al. (2020), and Y. Liu et al. (2019), we represent the remaining missing events in the inputs as “⟨mask⟩ ;⟨mask⟩ ;⟨mask⟩ ;⟨eoe⟩,” where⟨mask⟩ indicates either event trigger word or arguments to be predicted by the storyline model. 5.1.3.3 TemporalPromptEncoding Temporal prompts are used to generate flashbacks . As we mentioned in Section 5.1.2, we encode a se- quence ofpredefined eventtemporalpromptsr i ={r i,k } in storyline fork∈{1,n− 1} to help models determine whether the next event mention (in narrative order) should start earlier or later than its preced- ing event mention. We use temporal relation extraction tools to annotate allr i,k in our experimental data. 93 Algorithm2 RL-based End-to-end Training 1: Randomly initializeα andθ 2: Pretrainα ▷ storyline pretraining 3: fori∈Mdo ▷ loop through all data 2 4: Generate ˆ S r i from storyline model (α ) 5: L θ =− logp Y i |x i , ˆ S r i ,θ 6: ∇J α =R i ·∇ log(p(S i |x i ,r i ,α )) 7: α =α −∇ J α 8: θ =θ −∇L θ 9: endfor Specifically, we use ECONET (Han et al. 2021b) finetuned on the MATRES dataset (Ning et al. 2018c) to predict the temporal relation between neighboring events. 1 The context and the locations of a pair of event trigger words are fed into ECONET to predict their temporal order. The temporal prompt set consists of BEFORE, AFTER and VAGUE (capturing undetermined temporal order), and are fixed in S r i . Note that VAGUE indicates undetermined temporal order due to the ambiguity of the context (Cassidy et al. 2014; Ning et al. 2018c) and it does not suggest the context is poor or the relations are wrong. As shown in Figure 5.2, we replace the end-of-event token⟨eoe⟩ withtemporalprompts in storylines, except for the last event which does not have a next event. With the prompt-augmented storylines,S r i , we can re-write the storyline loss asL α =− logp(S r i |x i ,r i ,α ), and story loss asL θ =− logp Y i |x i , ˆ S r i ,θ . 5.1.3.4 StorylinePretraining Using intermediate pretraining to adapt original pretrained language models has been shown to be effective for a variety of downstream tasks such as information extraction (Joshi et al. 2020), question-answering (Garg et al. 2020; Khashabi et al. 2020) and commonsense reasoning (W. Zhou et al. 2021). To capture more diverse event sequences and facilitate better story generation, we explore pretraining storyline model with SRL extracted storyline from BookCorpus dataset (Zhu et al. 2015), and use learnedα to initialize storyline models. 1 The ECONET tool is available here: https://github.com/PlusLabNLP/ECONET. 94 5.1.3.5 RL-basedEnd-to-endModel The end-to-end model described in Sec. 5.1.3.1 allows the story model to train with the generated sto- rylines and hence alleviate the gap between training and inference. However, this workflow still lacks a mechanism that enables the storyline model to adjust with the feedback from the story model. The chal- lenges of training storyline and story models jointly originate from decoding storylines as inputs for the story model, which involves non-differentiable token selections. Thus, the final loss L θ cannot be directly back-propagated into the storyline model. To overcome this barrier, we adopt reinforcement learning (RL), specifically, the REINFORCE algorithm (Williams et al. 1991) in our end-to-end training. LetR i = R(x i ,r i ). The expected reward with respect to the storyline model can be written asE α [R i ] = E[R i · log(p(S r i |x i ,r i ,α ))]. The gradient to update the storyline model is∇J α =E[R i ·∇ log(p(S r i |x i ,r i ,α ))], which can be approximated with sampling techniques. Motivated by Jingjing Xu et al. (2018), we use negative loss of the story model to construct rewards, that is, R = −L θ . 3 In other words, smaller loss from the story model is associated with larger reward. Algorithm 2 summarizes the overall method. 5.1.4 ExperimentalSetup In this section, we start by describing our research objectives, then we describe our data, evaluation metrics, experimental designs and implementation details aiming to achieve these objectives. The overall research objective is to measure the impact of using temporal prompts in structured storylines. Specifically, can AFTER successfully induceflashbacks ? If so, does that contribute to the interest level of the generated stories while maintaining the overall quality of the texts? 3 We do not use baseline reward as we found this simple reward design works effectively in our experiments. 95 5.1.4.1 Datasets. ROCStories (Mostafazadeh et al. 2016a) and WritingPrompts (Fan et al. 2018) are our experimental datasets. We ensured all reported results using the same test data as the baseline systems (Goldfarb-Tarrant et al. (2020) and P. Xu et al. (2020)). For pretraining data, we use BookCorpus (Zhu et al. 2015). Appendix E.1 shows all details of data splits and pre-processing process. 5.1.4.2 TemporalPromptsConstructions ECONET was finetuned three times with different random seeds, so we take the consensus vote from three models. If there is any disagreement, we label the temporal order as VAGUE. We benchmark ECONET’s annotation performances in Appendix E.4, which shows it provides highly accurate temporal relations. For human evaluations specifically, we consider two prompt settings in order to gauge different impacts of AFTER. 1) for ROCStories, all structured storylines consist of exactly four predefined temporalprompts created following Sec 5.1.3.3. We randomly sample stories with one AFTER prompt from the test data. We will show later in the analysis that vanilla language models would generate more than 80% event pairs with BEFORE relations for ROCStories; AFTER prompt should bring this ratio down if it is effective. 2) for WritingPrompts, since the number of events is not fixed, we randomly sample test stories generated with AFTER prompts for evaluation. 5.1.4.3 AutomaticEvaluationMetrics We use automatic metrics to evaluate the textual quality of stories. We reportRef. PPL: reference stories’ perplexity in our models and Gen. PPL: generated stories’ perplexity scored by GPT-2 (Radford et al. 2019). For diversity, we reportDistinctRatio (%): overall vocabulary:token number ratio. We also report standardBLEU-3 andROUGE L . 96 5.1.4.4 HumanEvaluationMetrics We rely on human annotators to analyze the effectiveness of flashback generations. We request 18 MTurk- ers who succeeded in our previous annotation tasks (Han et al. 2021a) to evaluate stories produced by our compared models. We host a small qualification round followed by a large annotation task. Only 10 workers are qualified, and we only consider their annotations. Eventually, we collect 106 and 77 sets of valid annotations for ROCStories and WritingPrompts. Temporal diversity. The dominance of BEFORE relation in our data can make models biased toward generating stories with moreBEFORE relations. Therefore, we are interested to see how inserting anAFTER prompt can helpincrease the percentage of non-BEFORE event relations in the generated stories. Let ˆ R r indicate the percentage of a particular relation annotated by MTurkers. We calculate the entropy of the set{ ˆ R r },∀r∈{BEFORE, AFTER, VAGUE} to measure temporal diversity. Accuracy measures the percentage of AFTER being correctly incorporated in the generated stories la- beled by human annotators. We used a relaxed version by counting annotated VAGUE as correct too, as VAGUE can potentially be AFTER. Bothaccuracy andtemporaldiversity can show the effectiveness of generating flashbacks using AFTER. Temporal coherence indicates if the event sequence in a generated story aligns with an annotator’s temporalcommonsense. 1 and 0 correspond to yes and no, respectively. Interest level. Precisely defining interest level is difficult as it is a broad concept. So we focus on the unexpectedness component of cognitive interest. As pointed out by Behrooz (2019), unexpectedness can be further explained as how predictive an event is, which is closely related toflashback generation. Therefore, we define an interesting event as 1) being unexpected or surprising and 2) being logical according to the 97 context and general commonsense. The second definition is not identical to the temporal coherence above. Events contradicting general commonsense can still be temporally coherent. For the compared models, we ask annotators to provide ranks between 1 to K for the generated sto- ries, with K indicating the most interesting story and 1 indicating the least interesting one. We encourage workers to provide different scores for all compared stories, but equal scores are allowed. The max score K depends on the number of compared models, 5 for ROCStories and 4 for WritingPrompts. We provide de- tailed instructions in the interface shown in the appendix. Crucially,interestlevel is separately annotated from other metrics and we ensure annotators do not see the same set of stories in both tasks. 5.1.4.5 ComparedModels Baselines. P. Xu et al. (2020), denoted asMEGATRON, is chosen as the baseline as it outperforms previ- ous systems such as Guan et al. (2020) on ROCStories. We also compare withTemporalBART (S.-T. Lin et al. 2021) as it is pretrained with temporal ordering and event infilling tasks. For WritingPrompts, we com- pare withContentPlanning (Goldfarb-Tarrant et al. 2020) as it also adopts the Plan-and-Write workflow as well as structured event representations. Appendix E.5 describes more details of baseline systems. We describe our own model variants below, 1. Vanilla-Gen uses the parameters of a pretrained language model (LM), specifically BART-base (Lewis et al. 2020), to initialize both the storyline and story models. Its workflow is illustrated in the upper block of Figure 5.2. Since no information other than the prefix (first sentence, prompt, etc.) is used to generate the story, we denote this model as vanilla LM generation orVanilla-Gen. 2. Structured-Prompt enhances Vanilla-Gen by using a structured storyline of events to encode temporalprompts, which is associated with the workflow of the bottom block of Figure 5.2. 3. Pretrained. For ROCStories data only, we initialize the storyline model of Structured-Prompt with the pretrained parameters. 98 4. RL uses the same inputs as Structured-Prompt. The difference is that reinforcement learning is used to train storyline and story models jointly. As Algorithm 2 shows,RL-based model is trained following the same forward workflow as Structured-Prompt, but during backpropagation, the storyline models’ parameters are updated. 5.1.5 Results The main results for ROCStories and WritingPrompts are shown in Table 5.1 and Table 5.2 respectively. Examples of generated stories can be found in Table 5.3 and Table E.2 for ROCStories and Table E.3 in the appendix for WritingPrompts. We organize our discussions and analysis in the following sections by answering the four research questions. Q1) Can our proposed models (withtemporalprompts) produce stories with goodtextualquality? Q2) Are our proposed models effective at generating flashbacks ? Q3) Can our proposed models maintain event temporal coherence in stories? Q4) How do our proposed models contribute to stories’interestlevels? AutomaticEvaluation HumanEvaluation Ref. Gen. Distinct BLEU ROUGE L Temporal Accuracy Temporal Interest Models PPL(↓) PPL(↓) Ratio(↑) (↑) (↑) Diversity(↑) (↑) Coherence (↑) Level (↑) Temporal-BART 24.65 19.47 4.10 5.01 19.12 - - - - MEGATRON - 34.14 4.57 2.57 15.23 1.21* - 0.78 2.69 Vanilla-Gen 27.30 19.29 3.99 5.13 19.29 0.88 - 0.88 2.95 +Structured-Prompt 22.85 19.94 4.09 5.07 19.39 1.09 55.75 0.82 3.03 +Pretrained 21.16 19.25 4.01 5.06 19.44 1.07 52.21 0.84 2.96 +RL 15.45 19.42 4.17 5.20 19.49 1.14 56.64 0.86 3.06 Table 5.1: Evaluation results for ROCStories. All values in the bottom block are averaged over three runs. MEGATRON’sRef.PPL is missing as it is not reported in the paper and the model is not publicly available. MEGATRON and Vanilla-Gen do not use any temporal prompts, so accuracy is not applied. MEGA- TRON’stemporaldiversity* is high due to large amount of VAGUE annotations in its generated stories. 5.1.5.1 TextualQuality We measure the textual quality of stories using a wide range of automatic evaluation metrics. 99 Ref. Gen. Distinct BLEU ROUGE L Tok. Pearson Temporal Interest Models PPL (↓) PPL (↓) Ratio (↑) (↑) (↑) Len. (↑) Corr. (↑) Coherence (↑) Level (↑) ContentPlanning - 25.52 1.80 3.46 14.40 252.3 0.04 0.57 2.20 Vanilla-Gen 31.04 11.17 3.50 0.67 9.43 160.2 0.09 0.52 2.49 +Structured-Prompt 30.77 9.30 2.86 1.44 10.95 208.6 0.56 0.49 2.49 +RL 30.98 9.50 2.83 1.39 10.78 203.8 0.57 0.55 2.62 Table 5.2: Evaluation results for WritingPrompts. Pearson correlation approximates the effectiveness of prompts. Perplexity. For ROCStories, all three model variants can improve Ref. PPL against Vanilla-Gen and Temporal-BART while maintaining good Gen. PPL. The weak Gen. PPL of MEGATRON may be at- tributed to its sentence-by-sentence generation pipeline, whereas our models generate an entire story in an integrated step. For WritingPrompts, both model variants improve Gen. PPL over Vanilla-Gen and ContentPlanning while maintaining goodRef. PPL. Token Diversity. For ROCStories, RL-based model improves the Vanilla-Gen by 0.18 per Distinct Ratio. MEGATRON achieves the highest token diversity as it incorporates external knowledge-bases that make the generated stories contain novel tokens. For WritingPrompts, we observe longer stories are associated with poorer scores. However, the large increases in Distinct Ratio suggest that the token usages in our proposed models are diverse. BLEU and ROUGE L . For ROCStories, the proposed models perform on-par with Vanilla-Gen and Temporal-BART while outperformingMEGATRON, which generates the shortest stories among all com- pared models. For WritingPrompts, ContentPlanning performs the best partially due to its usage of BART-large models. The overall performances across these three types of automatic metrics suggest that usingtemporal prompts in the Plan-and-Write framework can produce stories with high textual quality. 100 Input [s1] my favorite class was english literature. prompt BEFOREAFTER BEFORE BEFORE [s2] i was in english class for the first time. [s3] i had to take a class with my local author this semester. baseline [s4] i decided to ask him to speak to me at a local book festival. [s5] he said that he was really good and i liked him. [s2] one day, i decided to go to the library. Vanilla [s3] i found a book that i really liked. [s4] i read the whole book in one day. [s5] i was so happy that i went back to school. [s2] i would go to the library to read all the books. [s3] i had never been to a library before. RL [s4] i was so excited to go. [s5] i ended up spending a lot of time in the library. Input [s1] tom was excited to go on a date. prompt AFTER BEFORE BEFORE BEFORE [s2] he went to his favorite restaurant to pick up his date. [s3] the date went well and then tom got drunk. baseline [s4] he told his date they were going to a comedy show. [s5] the date went on and they had a great time at the movie. [s2] he was going to take his girlfriend to the movies. Vanilla [s3] they went to the movie theater. [s4] they had a great time. [s5] tom couldn’t wait to go back again. [s2] he had been looking forward to it for months. [s3] he went to the mall and bought a new dress. RL [s4] he wore it to his date’s house. [s5] his date loved it and they had a great time. Table 5.3: Generated stories by our compared models. baseline: MEGATRON,Vanilla-Gen andRL-based. end-to-end. Input is the first sentence. Highlighted sentences contain an event that started earlier than the preceding event. 5.1.5.2 EffectivenessonFlashbackGeneration The second research question probes the effectiveness of using temporal prompts on generating flash- backs. For ROCStories, all models can generate stories with the same number of events/sentences as the gold stories. This allows annotators to judge pairwise event relations in the generated stories and help us check whether the generated events have relations truthfully reflecting the temporal prompts used. Accuracy is the perfect metric for this. As Table 5.1 shows, the final RL-based model achieves the highest score, which indicates the strongest effectiveness of generating flashbacks . 101 However, temporal prompts are not used in the baselines and Vanilla-Gen. So we compute an approximate measures of effectiveness, temporaldiversity, which indicates how many non-BEFORE re- lations AFTER prompt can induce. Table 5.1 shows thatStructured-Prompt,Pretrained andRL-based models can help improve Vanilla-Gen with more than 80% generated BEFORE relations. MEGATRON achieves the highest score due to the largest amount (29%) ofVAGUE relations (complex or undetermined) annotated by MTurkers shown in Figure 5.3, which is associated with its lowest temporal coherence score. For WritingPrompts, stories are long and can contain dialogues or short phrases without events at all. These make the sentence or event alignments between the gold and generated stories worse than ROCStories, i.e. e i,k ,r i,k may not correspond to the k-th sentence inY i . Therefore, accuracy cannot be computed. To obtain an approximate metric, we use the tool described in Sec. 5.1.3.3 to annotate neigh- boring event temporal relations in thegenerated test stories for all the compared models. Slightly different fromtemporaldiversity, we calculate the total number of machine annotated AFTER relations, denoted as ˆ N i,A in each ˆ S ′ i . Let{N i,A } denote the number ofAFTERtemporalprompts extracted in gold stories. We compute the Pearson correlation benesty2009pearson between the sets{ ˆ N i,A } and{ ˆ N i,A } as the measure. As Table 5.2 shows, forContentPlanning andVanilla-Gen withouttemporalprompts, the corre- lations are weak; whereas whentemporalprompts are used in bothStructured-Prompt andRL-based models, the correlations are strong. Although using models’ temporal annotations for the generated sto- ries is not as precise as human annotations, the large differences in correlation provide another piece of evidence that our proposed methods are effective at generating flashbacks . 102 MEGATRON Vanilla-Gen Structured-Prompt Pretrained RL 0.0 0.2 0.4 0.6 0.8 1.0 Relation Distribution <before> <after> <vague> Figure 5.3: Relation annotation distribution by MTurkers for compared models trained on ROCStories. 5.1.5.3 TemporalCoherence Generating flashbacks requires a system to disrupt the monotonic BEFORE sequence, which is the domi- nant temporal pattern generated byVanilla-Gen (see Figure 5.3). In other words, flashbacks with at least one AFTER are minor patterns that can be hard to learn in our data, which may result in event sequences violating our temporal commonsense. Thus, we need to check that stories generated withflashbacks main- tain good temporal coherence. As shown in Table 5.1 and 5.2, our proposed models with temporal prompts can achieve on-par or slightly lower scores, suggesting little trade-off of temporalcoherence in generating flashbacks . 5.1.5.4 ContributionstotheInterestLevel As we can observe in Table 5.1, the impact of temporaldiversity andcoherence on theinterestlevel appears to be complex. To better understand the dynamics among these metrics, we run ordinary least 103 ROCStories WritingPrompts Coef. p-value Coef. p-value TemporalCoherence 0.609 0.000 ∗ 0.006 0.963 TemporalDiversity − 0.532 0.004 ∗ − 0.279 0.410 #AFTERprompt 0.387 0.000 ∗ 0.034 0.238 Table 5.4: OLS regression results on temporal coherence, diversity and number of AFTER indicators. The coefficients for the intercept are omitted. ∗ means statistically significant with 99% confidence. square regressions (OLS) Kenney et al. 1965 by settinginterestlevel as the target variable andtemporal diversity,coherence and the number ofAFTERs as predictors. Since all of these metrics apply to each of the compared stories, the total instances are 530 and 308 for ROCStories and WritingPrompts, respectively. As Table 5.4 shows, for ROCStories, holding other metrics constant, adding 1 unit totemporalcoher- ence and diversity leads to a 0.609 increase and 0.532 decrease of the interest level. The former result implies that a story lacking event temporal coherence tends to be less interesting. The latter result sug- gests that increasingtemporaldiversity may lead to less interesting stories, which we hypothesize could be attributed to two factors: 1) BEFORE is dominant in ROCStories, and by using AFTER as prompt, we force models to generate relations less seen in data. 2) Figure 5.3 showstemporaldiversity can increase with more VAGUE relations. Since VAGUE is an undetermined temporal relation even for our annotators, it could make the storyline confusing and thus lead to less interesting stories. The coefficient for the num- ber of AFTER indicators is positive with strong statistical significance. It suggests that holding the other two metrics constant, adding the number of AFTER indicators by 1 contributes to 0.387 increases of the interestlevel. For WritingPrompts, although we are not able to conclude that the estimates are statistically significant, the coefficients have the same signs as ROCStories. Also, we observe that the p-value of the number of AFTER indicators is much lower than the other two variables, which implies a relatively stronger (positive) impact. Sincetemporalprompts in human evaluations all contain at least oneAFTER, these results show that when AFTER prompt successfully produces event pairs with AFTER relation in the final stories, it makes 104 stories more interesting. Now, we can answer the final research question: improving temporaldiversity can helpinterestlevel when AFTER prompts are effective at generating AFTER relations in stories; that is, when flashbacks truly work, stories become more interesting. 5.1.6 RelatedWork Generatingflashbacks has been studied in a few prior works. Bae et al. (2008) is one of the early efforts proposing a planning-based approach to generate flashbacks to evoke surprise in the readers. Follow-up works proposed a cognitive-based model that finds the best location in the original stories to insert a past event (Wu et al. 2016). Our work differs from this line of research by using temporal prompts with pretrained language models to generate integratedflashback in stories. Hoek et al. (2014) studiesflashback in game narrative generation, which is remotely related to our work. Plan-and-Write framework has been shown to be an effective method to enhance the explainability and controllability of story generation. L. Yao et al. (2019) enables machines to produce a sequence of keywords prior to generating stories. Follow-up works leverage commonsense or external knowledge to enhance the quality of stories (Guan et al. 2020; B. Tan et al. 2021; P. Xu et al. 2020). Goldfarb-Tarrant et al. (2020) is one of our compared works that incorporates SRL extracted event representations in storylines and train models with several event-related decoding objectives. Our work differs from it by explicitly encodingtemporalprompts in event plots that facilitates flashback . Structured representation such as discourse structure (Guan et al. 2021), story keywords (Goldfarb- Tarrant et al. 2019; Peng et al. 2018) and event/plot graph (Ammanabrolu et al. 2021, 2019) have been widely used in story generation to enable models to output diverse stories, but they are remotely related to our flashback generation task. 105 Reinforcementlearning has also been explored in two-stage story generation such as Jingjing Xu et al. (2018) and Tambwekar et al. (2019). Our motivation of using RL-based generation is to enhance the effectiveness of temporalprompts. Event temporal reasoning helps the construction of the temporal prompts. It has been studied in story understanding (Han et al. 2019b), information extraction (Han et al. 2019a,c, 2020; Ma et al. 2021; Ning et al. 2017), QA (Ning et al. 2020b; B. Zhou et al. 2019), and event generation (M. Li et al. 2021; S.-T. Lin et al. 2021). Our proposed method is the pioneer work to introduce eventtemporalprompts inflashback generation. 5.1.7 Conclusions We propose to generate flashbacks in stories by encoding temporal prompts in structured storylines. Ex- perimental and evaluation results show our proposed systems can produce fluent, temporally coherent, and more interesting stories. Future work can focus on improving temporal prompts so that they han- dle minority cases (i.e. prompts with one or more AFTER indicators) more effectively. Designing more powerful rewards in reinforcement learning is another promising research direction. 106 5.2 GeneratingConsistentStoriesviaEventKnowledgeIncorporation Besides temporal consistency, another critical objective to generate logical stories is to ensure stories’ events and their relations follow human commonsense. While recent research efforts that inject com- monsense knowledge in pre-trained language models (PTLMs) have made great progress, two challenges remain for them to be reliable for story generation: 1) the discrepancy of event representations between external commonsense knowledge graphs (CSKGs) and the target story data makes knowledge consoli- dation difficult; 2) it is hard to probe whether knowledge is truly incorporated with the current implicit knowledge incorporation methods such as pretraining or multitask learning. We address these short- comings with the Plan-and-Write framework where we encode knowledge in plans with abstract event representations similar to CSKGs, which alleviates the data discrepancy issue. We then leverage a plan- to-story alignment method to facilitate explicit knowledge incorporation. Experimental results show that our proposed method can generate stories with better logical consistency than strong baselines. 5.2.1 MotivationsandContributions A story develops based on how its events unfold. Thus, creating a sequence of events that align with human commonsense is crucial for composing a logical and consistent story (Chen et al. 2021). Event commonsense is often demonstrated via reasonable relations between events. For example,feelinghun- gry⟨Before⟩ eat is a proper temporal relation, and eat a large meal⟨Causes⟩ feeling full is a logical causal relation. Simply flipping the event (physical) order in both cases would make these relations not logical. Recent developments in PTLMs appear to learn the textual or superficial form of event knowledge by training with a gigantic amount of text corpora (Rogers et al. 2021; X. Zhou et al. 2020). However, PTLMs cannot guarantee event commonsense in story generation due to their insufficiency in event reasoning 107 Figure 5.4: Stories generated by directly finetuning PTLMs (top) v.s. our model (bottom). Red highlights indicate lack of event commonsense. Dotted gray blocks showabstract events + relations in CSKG format. (Guan et al. 2020; Mao et al. 2019). The top block in Figure 5.4 shows a problematic story generated by directly finetuning PTLMs on our target data: the event transitions from [s2] to [s3] and from [s4] to [s5] contradict our commonsense, which we hypothesize, could be attributed to lack of inter-sentence reasoning guidance. Previous works studied injecting commonsense knowledge into PTLMs for story generation. Guan et al. (2020) conducted intermediate fine-tuning of PTLMs with transformed samples from commonsense knowledge graph (CSKG) such as ConceptNet (X. Li et al. 2016) and ATOMIC (Sap et al. 2019) and further fine-tuned models on target data with adversarial story samples. Mao et al. (2019) leveraged a multi-tasking learning approach by training PTLMs with discriminative tasks on transformed samples from common- sense data such as SWAG (Zellers et al. 2018) . The drawback of these approaches is that external data sourcesdifferdrasticallyfromthetargetstorydata : As shown in Figure 5.4 bottom panel,CSKGData con- tain abstract events whereas actual stories consist of long sentences; similarly, SWAG consists of multiple- choice QAs, which relates remotely to story generation. Consequently, it is difficult to gauge whether or how commonsense knowledge is injected. We propose a moreexplicit commonsense knowledge incorporation framework by enhancing the Plan- and-Write approach (L. Yao et al. 2019) that breaks story generation into two stages: (1) plan generation 108 Figure 5.5: An illustration of our overall model. We use the first sentence ([s1]) as input. The upper block shows story generation based on finetuning PTLM directly. The bottom block shows our improved Plan- and-Write design by embedding commonsense labels in the event plan. The plan-to-story alignment encourages stronger coverage of the plan in the generated story (bottom-right); the alignment would be low if the story on the top-right were generated. and (2) plan-to-story generation. As shown in Figure 5.5, our plan consists of abstract events similar to the CSKG format. We train a commonsense classifier using ATOMIC 20 20 (Hwang et al. 2021), a recent CSKG that contains broad categories of pairwise event relations. This classifier annotates training labels in plans that capture the relations between neighboring events (e.g. ⟨xWant⟩ in Figure 5.4 and 5.5). In Stage (1), predicting relation labels as an integrated component in the language model objective guides models to generate reasonable event plans. In Stage (2), we adopt a plan-to-story semantic alignment method, i.e., semantic loss (Jingyi Xu et al. 2018) as an auxiliary training objective to encourage the improved event plans to be incorporated into stories. We briefly summarize our contributions: 1) we propose to encode commonsense knowledge as event relations in abstract event plans that align better with CSKGs. 2) By training the story generation model with a semantic alignment objective, we encourage higher coverage of the commonsense empowered event plans in the final stories. 3) Both event relation labels and semantic alignments are shown to be effective at improving stories’ logical consistency based on our evaluation results. 109 5.2.2 Methods We first describe the Plan-and-Write story generation framework with reinforcement learning (RL). Then we will discuss our proposed method that adapts the framework with event commonsense labels and plan- to-story semantic alignment. RL-based Plan-and-Write Framework Given a leading sentence x i , the target task is to generate the remaining story sentencesY i , which can be represented by a conditional language generation model p(Y i |x i ) as illustrated in the top block of Figure 5.5. Adapting the Plan-and-Write system (L. Yao et al. 2019), we can divide the target task into two sub-modules (the bottom block of Figure 5.5): 1) the plan model with training objectiveL α = − logp(S i |x i ,α ) whereS i is the event plan associated with the storyY i ; 2) the story model with training objectiveL β = − logp Y i |x i , ˆ S i ,β . α and β denote the model parameters. As Han et al. (2022), Tambwekar et al. (2019), and Jingjing Xu et al. (2018) point out, to train this system end-to-end, the plan ˆ S i needs to be decoded during training. This discrete token selection process prohibits the loss of the story model to be back-propagated into the plan model. To encourage the plan model to adapt jointly with the story model, researchers incorporated the REINFORCE algorithm (Williams et al. 1991). Let R i denotes the reward. The gradient to update the plan model is ∇J α = E[R i ·∇ log(p(S i |x i ,r i ,α ))]. Following Han et al. (2022), we use the negative loss of the story model as rewards, soR i =−L β . EventPlanswithCommonsenseLabels LetS i ={e i,1 ,r i,1 ,e i,2 ,r i,2 ...e i,k ,r i,k ,...e i,n } wheree i,k and r i,k represent the k-th event and the k-th commonsense label respectively. The bottom block of Figure 5.5 shows an example plan: eache i,k consists of a verbv i,k and two arguments (a 1 i,k ,a 2 i,k ). For example, the second event she takes break can be denoted as e i,2 = a 1 i,2 ,v i,2 ,a 2 i,2 . Commonsense labels, r i,k (e.g. ⟨xWant⟩, ⟨xEffect⟩ and⟨xReact⟩ defined by ATOMIC 20 20 ) are event relations inserted between a pair 110 of neighboring events. In our models,r i,k are special tokens predicted together with other event tokens, except forr i,1 , which is given as input. Plan-to-StorySemanticAlignment After injecting event commonsense knowledge into the plan via relation labels, we want to encourage the improved plan to be incorporated into the final stories. In an end-to-end Plan-and-Write system, there are no training signals forcing the generated stories ˆ Y i to always contain ˆ S i as the plans are now predictions rather than ground-truths. This may lead to low coverage of the plan and thus ineffective knowledge incorporation in the final stories. We address this issue by training the story model with an auxiliary semantic loss (Jingyi Xu et al. 2018). Lety i,k ∈ ˆ Y i denotes the k-th predicted story sentence. Dropping the sample indexi, the general semantic loss can be written as, L s (C,p)=− log X y k |=C Y y j k ∈P k p j Y y j k ∈N k (1− p j ) where C is the plan-to-story constraint that requireseachcomponentinthek-thevent,i.e.,v k ,a 1 k and a 2 k toappearexactlyonceinthek-thstorysentence. P k indicates the “positive” text span iny k that corresponds to the event components; whereasN k indicates the “negative” text span that does not contain event components.p j denotes the predicted probability for thej-th token iny k . Intuitively, this objective implies that when a predicted sentence satisfies the plan-to-story constraint, i.e., y k |= C, we reinforce such predictions by pushing up the likelihood of tokens corresponding tov k ,a 1 k ora 2 k while reducing the likelihood of irrelevant tokens. In practice, we use a token matching function to check if v k , a 1 k or a 2 k appears iny k . We construct separate constraints C v ,C a 1,C a 2 for v, a 1 , a 2 as stories cover individual components more frequently than entire event spans, particularly in the initial phase of training. Combining all com- ponents, the final semantic loss is L s (C,p)=L s (C v ,p)+L s (C a 1,p)+L s (C a 2,p). For illustration, con- sider the second eventshetakesbreak in Figure 5.5. If the corresponding generated sentence were “she 111 takes the plate to the customer,” then onlyC v andC a 1 are active with nonzeroL s (C v ,p) andL s (C a 1,p). In other words, we only increase the probability of “she” and “takes” that align with the plan. Note we use greedy decoding for stories to speed up training and found it work well when switching to beam- search in inference. Event plans are always decoded using beam-search. The overall training objective is L=L α +w β L β +w s L s wherew β andw s are hyper-parameters to balance different losses. 5.2.3 ExperimentalSetup In this section, we describe data, evaluation metrics, model implementations and compared baselines. Data. We use ROCStories (Mostafazadeh et al. 2016a) with 5-sentence stories as our experimental data. Following (Han et al. 2022; P. Xu et al. 2020), we split data into 88,344/4,908/4,909 for train/validation/test sets. The first sentence is used as input whereas the remaining four sentence are outputs. To build the com- monsense relation classifier, we use ATOMIC 20 20 (Hwang et al. 2021). We follow their standard train/dev/test splits, but select only 10 relations (details in Appendix F.1). We also replace “personX” and “personY” to- kens with random sampled names. Pairs of heads and tails are concatenated as inputs to train the relation classifier, which achieves >87% F 1 scores (Appendix F.2). Event Plan Construction. As Figure 5.5 shows, for a story sentence, we first use the semantic role labeling (SRL) tool provided by Shi et al. (2019) to parse its triggerv i,k and two argumentsa 1 i,k anda 2 i,k . We then convert this representation into a textual form: “a 1 i,k v i,k a 2 i,k ,” which aligns better with the abstract events in ATOMIC 20 20 . To obtain relation labels, we apply the classifier described above on neighboring events to predict their relations. AutomaticEvaluationMetrics: 1)Ref. PPL: reference stories’ perplexity in our models; 2)Distinct Ratio (%): overall vocabulary:token number ratio to measure diversity; 3) standardBLEU-3 andROUGE L ; 112 4) for models based on Plan-and-Write, we calculate the coverage ratio of event plans’ tokens in stories denoted asCover-Token. Human Evaluation Metrics. We rely on human annotators to evaluate the event logical consistency in the generated stories. We randomly sampled 200 stories in the test data and requested 5 MTurkers who succeeded in our previous event annotation or story evaluation tasks. We define three metrics: 1) SC: whether a generated story is logically consistent; 2)PC: whether a generated plan is logically consistent; 3) Cover: whether a story follow its plan. All three metrics are originally LIKert scale (“definitely yes”, “likely yes,” “likely no,” “definitely no”). We also merge them into binary scores based on yes/no answers. PC andCover only apply to models based on Plan-and-Write. Annotation interfaces can be found at the end of the appendix. Compared Models. 1) BART-base (Lewis et al. 2020) fine-tuned directly on the ROCStories data; 2) Vanilla, our implementation of the RL-based Plan-and-Write model provided by Han et al. (2022), which also uses BART-base; 3) CS-label: Vanilla augmented with commonsense labels; 4) S-Align: CS-label augmented with the semantic loss; 5) We also compare with P. Xu et al. (2020) as it outperforms previous commonsense injection work such as Guan et al. (2020) per story consistency metrics (the focus of this work), and we do not have enough public information to re-train Guan et al. (ibid.) with ATOMIC 20 20 . 5.2.4 Results In this section, we answer the following research questions: Q1. Do our proposed models produce stories that maintain good textual/syntax quality? Q2. Do our proposed models produce stories with more logical consistency compared with the baselines? Q3. If so, how do our plan models contribute? In Appendix F.4, we provide several examples of the real generated stories. 113 PPL. Distinct BLEU ROUGE SC SC Models (↓) Ratio Binary LIKer Megatron - 4.57 2.57 19.12 0.33 1.05 BART-base 26.89 4.05 5.13 19.20 0.59 1.86 Vanilla 16.09 4.04 5.10 19.21 0.66 1.86 +CS-label 15.66 4.00 5.27 19.49 0.69 ∗ 1.89 +S-Align 15.80 4.13 5.03 19.46 0.73 ∗∗ 2.04 Table 5.5: Automatic and human evaluation results for stories. All values in the middle columns are av- eraged over three runs except for the baseline P. Xu et al. (2020) (provided by Han et al. (2022)). ∗ and ∗∗ indicate statistical significance with p-values < 0.05 and< 0.01, respectively per McNemar’s test for the improvements against theBART-base model. Q1. As Table 5.5 shows, both proposed models with CS-label and S-align perform on-par or better against the baselines per automatic evaluation metrics. This implies that regardless of the semantic quality, our models could generate stories with equally good textual quality (in terms of fluency, token diversity, etc.). Q2. Table 5.5 also shows that based on human evaluations, our proposed models can generate more logically consistent stories. The improvements from BART-base to Vanilla show the benefits of using Plan-and-Write systems, and the improvements of our proposed methods against Vanilla highlight the contributions of injecting event commonsense knowledge using our proposed methods. The gains of the binarySC are statistically significant against the BART-base baseline. 4 Q3. We further investigate how our plan models contribute to the improvement of logical consistency in the final stories. First, as Table 5.6 shows, our proposed models produce higher-quality plans compared with Vanilla. This is manifested by both better automatic metrics and stronger human inclinations (the binary and LIKerPC scores). Moreover, we observe that token and semantic coverage both increase slightly for CS-label, and both improve drastically for S-align. Combining these results, we show that we can effectively improve event plan quality with commonsense labels, and by encouraging these commonsense 4 further annotator analysis can be found in Appendix F.5 114 PPL. BLEU ROUGE Cover- PC PC Cover Models (↓) Token Binary LIKer Vanilla 71.58 4.59 18.02 0.55 0.75 2.02 0.48 +CS-label 67.69 5.14 20.99 0.58 0.77 2.07 0.49 +S-Align 69.15 5.10 20.89 0.77 0.78 2.06 0.79 Table 5.6: Automatic and human evaluation results for the plan model. Values in the middle columns (automatic evaluation) are averaged over three runs. empowered event plans to show in the final stories, we enable our overall systems to produce more logically consistent stories. 5.2.5 RelatedWork Plan-and-Write framework (L. Yao et al. 2019), is an effective method to enhance the explainability and con- trollability of story generation. Follow-up works in this direction leverage external knowledge to enhance the quality of stories (Guan et al. 2020; Han et al. 2022; B. Tan et al. 2021; P. Xu et al. 2020). Incorporating CSKGs for story generation has been explored in a few previous works. Guan et al. (2020) conducted a second step pre-training of PTLMs with transformed samples from ATOMIC before fine-tuning on story data. Ammanabrolu et al. (2021) leveraged COMET (Bosselut et al. 2019) (trained on ATOMIC) to generate upcoming events, and joined these events directly as final stories. L. Lin et al. (2022) finetuned PTLMs on CSKGs (ATOMIC + ConceptNet) and augmented the event generation model with prompt training. Our model differs from these prior works by proposing an end-to-end Plan-and-Write framework enhanced by plan-to-story alignment, which is capable of producing complete and consistent stories while facilitating more explicit event knowledge injections. 5.2.6 Conclusions We reinforce logical consistency in story generation by adapting the Plan-and-Write framework. With commonsense labels specifying the logical connections between neighboring events and plan-to-story 115 alignments encouraging the incorporation of event plans, we improve the consistency of stories with good textual quality. 116 Chapter6 Conclusions We advocate leveraging event-centric reasoning to build more intelligent NLP models. Specifically, we propose to incorporate event knowledge for reasoning via neural symbolic networks and targeted pre- trained language models. We apply them in event relation extraction and question-answer tasks and show their effectiveness. We also discuss our data collection effort that facilitates event semantic understanding by adopting natural language queries. Finally, we apply our event reasoning tools in story generation systems to help machines produce more interesting and logical stories. Future research in event-centric reasoning can focus on incorporating more scalable event knowledge or designing more effective algorithms for knowledge injections. For event-reasoning QA, more research efforts are needed to close the performance gap between current SOTA models and human/expert base- lines. For story generation, we believe the Plan-and-Write framework is a promising direction, so contin- uing to enhance event commonsense in the plan model and ensure alignments between the plan and the story models would help improve the quality of machine-generated stories. 117 Bibliography ACE (2005). “The ACE 2005 ( ACE 05 ) Evaluation Plan Evaluation of the Detection and Recognition of ACE Entities , Values , Temporal Expressions , Relations , and Events 1”. In. Alwitt, L. (2002). “Maintaining Attention to a Narrative Event”. In: Shohov, S.P. (eds.) Advances in Psychology Research 18, pp. 99–114. Ammanabrolu, Prithviraj, Wesley Cheung, William Broniec, and Mark O Riedl (2021). “Automated Storytelling via Causal, Commonsense Plot Ordering”. In: Proceedings of the AAAI Conference on Artificial Intelligence . Vol. 35. 7, pp. 5859–5867. Ammanabrolu, Prithviraj, Ethan Tien, Wesley Cheung, Zhaochen Luo, William Ma, Lara J. Martin, and Mark O. Riedl (2019). “Story Realization: Expanding Plot Events into Sentences”. In: Proceedings of AAAI 2020.url: https://arxiv.org/abs/1909.03480. Bae, Byung-Chull and R. Michael Young (2008). “A Use of Flashback and Foreshadowing for Surprise Arousal in Narrative Using a Plan-Based Approach”. In: Interactive Storytelling. Ed. by Ulrike Spierling and Nicolas Szilas. Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 156–167. Behrooz, Morteza (2019). “Curating Interest in Open Story Generation”. English. Copyright - Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works; Last updated - 2021-05-25. PhD thesis, p. 164.isbn: 9781392460962.url: http://libproxy.usc.edu/login?url=https://www.proquest.com/dissertations-theses/curating- interest-open-story-generation/docview/2355979367/se-2. Bethard, Steven (2013). “ClearTK-TimeML: A minimalist approach to TempEval 2013”. In: Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh InternationalWorkshoponSemanticEvaluation(SemEval2013). Atlanta, Georgia, USA: Association for Computational Linguistics, pp. 10–14.url: http://aclweb.org/anthology/S13-2002. Bethard, Steven, Leon Derczynski, Guergana Savova, James Pustejovsky, and Marc Verhagen (June 2015). “SemEval-2015 Task 6: Clinical TempEval”. In: Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015). Denver, Colorado: Association for Computational Linguistics, pp. 806–814.doi: 10.18653/v1/S15-2136. 118 Bethard, Steven, Guergana Savova, Wei-Te Chen, Leon Derczynski, James Pustejovsky, and Marc Verhagen (June 2016). “SemEval-2016 Task 12: Clinical TempEval”. In: Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016). San Diego, California: Association for Computational Linguistics, pp. 1052–1062.doi: 10.18653/v1/S16-1165. Bethard, Steven, Guergana Savova, Martha Palmer, and James Pustejovsky (Aug. 2017). “SemEval-2017 Task 12: Clinical TempEval”. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). Vancouver, Canada: Association for Computational Linguistics, pp. 565–572.doi: 10.18653/v1/S17-2093. Bisk, Yonatan, Rowan Zellers, Ronan Le bras, Jianfeng Gao, and Yejin Choi (Apr. 2020). “PIQA: Reasoning about Physical Commonsense in Natural Language”. In: Proceedings of the AAAI Conference on Artificial Intelligence 34.05, pp. 7432–7439.doi: 10.1609/aaai.v34i05.6239. Bosselut, Antoine, Hannah Rashkin, Maarten Sap, Chaitanya Malaviya, Asli Celikyilmaz, and Yejin Choi (July 2019). “COMET: Commonsense Transformers for Automatic Knowledge Graph Construction”. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, pp. 4762–4779.doi: 10.18653/v1/P19-1470. Bramsen, Philip, Pawan Deshpande, Yoong Keok Lee, and Regina Barzilay (July 2006). “Inducing Temporal Graphs”. In: EMNLP. Sydney, Australia.url: https://dl.acm.org/citation.cfm?id=1610105. Brewer, W.F. and E. H. Lichtenstein (1981). “Event Schemas, Story Schemas, and Story Grammars”. In: Long, J. and Baddeley, A. (eds.) Attention and Performance 9, pp. 263–279. — (1982). “Stories Are to Entertain: A Structural-Affect Theory of Stories”. In: Journal of Pragmatics 6, pp. 473–483. Carey, Gary and Mary Ellen Snodgrass (1999). A multicultural dictionary of literary terms. Jefferson, N.C. Caselli, Tommaso and Piek Vossen (Aug. 2017). “The Event StoryLine Corpus: A New Benchmark for Causal and Temporal Relation Extraction”. In: Proceedings of the Events and Stories in the News Workshop. Vancouver, Canada: Association for Computational Linguistics, pp. 77–86.doi: 10.18653/v1/W17-2711. Cassidy, Taylor, Bill McDowell, Nathanael Chambers, and Steven Bethard (2014). “An Annotation Framework for Dense Event Ordering”. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Baltimore, Maryland: Association for Computational Linguistics, pp. 501–506.doi: 10.3115/v1/P14-2082. Chambers, Nate (June 2013). “NavyTime: Event and Time Ordering from Raw Text”. In: Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh InternationalWorkshoponSemanticEvaluation(SemEval2013). Atlanta, Georgia, USA: Association for Computational Linguistics, pp. 73–77.url: http://www.aclweb.org/anthology/S13-2012. Chambers, Nathanael, Taylor Cassidy, Bill McDowell, and Steven Bethard (Oct. 2014). “Dense Event Ordering with a Multi-Pass Architecture”. In: ACL.url: http://www.aclweb.org/anthology/Q14-1022. 119 Chambers, Nathanael and Dan Jurafsky (Oct. 2008). “Jointly Combining Implicit Constraints Improves Temporal Ordering”. In: EMNLP. Honolulu, United States.url: https://dl.acm.org/citation.cfm?id=1613803. Chambers, Nathanael, Shan Wang, and Dan Jurafsky (2007). “Classifying Temporal Relations Between Events”. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions. ACL ’07. Prague, Czech Republic: Association for Computational Linguistics, pp. 173–176. url: http://dl.acm.org/citation.cfm?id=1557769.1557820. Chen, Muhao, Hongming Zhang, Qiang Ning, Manling Li, Heng Ji, Kathleen McKeown, and Dan Roth (Aug. 2021). “Event-Centric Natural Language Processing”. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Tutorial Abstracts. Online: Association for Computational Linguistics, pp. 6–14. doi: 10.18653/v1/2021.acl-tutorials.2. Cheng, Fei and Yusuke Miyao (2017). “Classifying temporal relations by bidirectional LSTM over dependency paths”. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Vol. 2, pp. 1–6. Chikka, Veera Raghavendra (June 2016). “CDE-IIITH at SemEval-2016 Task 12: Extraction of Temporal Information from Clinical documents using Machine Learning techniques”. In: Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016). San Diego, California: Association for Computational Linguistics, pp. 1237–1240.doi: 10.18653/v1/S16-1192. Cho, Kyunghyun, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio (Oct. 2014). “Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation”. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: Association for Computational Linguistics, pp. 1724–1734.doi: 10.3115/v1/D14-1179. Clark, Kevin, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning (2020). “ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators”. In: International Conference on Learning Representations.url: https://openreview.net/forum?id=r1xMH1BtvB. Cohen, Jacob (1960). “A Coefficient of Agreement for Nominal Scales”. In: Educational and Psychological Measurement 20.1, pp. 37–46.doi: 10.1177/001316446002000104. eprint: https://doi.org/10.1177/001316446002000104. Denis, Pascal and Philippe Muller (Aug. 2011). “Predicting globally-coherent temporal structures from texts via endpoint inference and graph decomposition.” In: IJCAI. Barcelone, Spain.url: https://hal.inria.fr/inria-00614765/document. Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova (June 2019). “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, pp. 4171–4186.doi: 10.18653/v1/N19-1423. 120 Dligach, Dmitriy, Timothy Miller, Chen Lin, Steven Bethard, and Guergana Savova (Apr. 2017). “Neural Temporal Relation Extraction”. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. Valencia, Spain: Association for Computational Linguistics, pp. 746–751.url: https://www.aclweb.org/anthology/E17-2118. Do, Quang, Yee Seng Chan, and Dan Roth (July 2011). “Minimally Supervised Event Causality Identification”. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. Edinburgh, Scotland, UK.: Association for Computational Linguistics, pp. 294–303.url: https://aclanthology.org/D11-1027. Do, Quang Xuan, Wei Lu, and Dan Roth (Sept. 2012). “Joint Inference for Event Timeline Construction”. In: EMNLP. Jeju, Korea.url: http://cogcomp.org/papers/DoLuRo12.pdf. Du, Xinya and Claire Cardie (Nov. 2020). “Event Extraction by Answering (Almost) Natural Questions”. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, pp. 671–683.doi: 10.18653/v1/2020.emnlp-main.49. Fan, Angela, Mike Lewis, and Yann Dauphin (July 2018). “Hierarchical Neural Story Generation”. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia: Association for Computational Linguistics, pp. 889–898.doi: 10.18653/v1/P18-1082. — (July 2019). “Strategies for Structuring Story Generation”. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, pp. 2650–2660.doi: 10.18653/v1/P19-1254. Galvan, Diana, Naoaki Okazaki, Koji Matsuda, and Kentaro Inui (Oct. 2018). “Investigating the Challenges of Temporal Relation Extraction from Clinical Text”. In: Proceedings of the Ninth International Workshop on Health Text Mining and Information Analysis. Brussels, Belgium: Association for Computational Linguistics, pp. 55–64.doi: 10.18653/v1/W18-5607. Garg, Siddhant, Thuy Vu, and Alessandro Moschitti (Apr. 2020). “TANDA: Transfer and Adapt Pre-Trained Transformer Models for Answer Sentence Selection”. In: Proceedings of the AAAI Conference on Artificial Intelligence 34.05, pp. 7780–7788.issn: 2159-5399.doi: 10.1609/aaai.v34i05.6282. Glavaš, Goran, Jan Šnajder, Marie-Francine Moens, and Parisa Kordjamshidi (May 2014). “HiEve: A Corpus for Extracting Event Hierarchies from News Stories”. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14). Reykjavik, Iceland: European Language Resources Association (ELRA), pp. 3678–3683.url: http://www.lrec-conf.org/proceedings/lrec2014/pdf/1023_Paper.pdf. Goldfarb-Tarrant, Seraphina, Tuhin Chakrabarty, Ralph Weischedel, and Nanyun Peng (Nov. 2020). “Content Planning for Neural Story Generation with Aristotelian Rescoring”. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, pp. 4319–4338.doi: 10.18653/v1/2020.emnlp-main.351. 121 Goldfarb-Tarrant, Seraphina, Haining Feng, and Nanyun Peng (June 2019). “Plan, Write, and Revise: an Interactive System for Open-Domain Story Generation”. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations). Minneapolis, Minnesota: Association for Computational Linguistics, pp. 89–97.doi: 10.18653/v1/N19-4016. Guan, Jian, Fei Huang, Zhihao Zhao, Xiaoyan Zhu, and Minlie Huang (2020). “A Knowledge-Enhanced Pretraining Model for Commonsense Story Generation”. In: Transactions of the Association for Computational Linguistics 8, pp. 93–108.doi: 10.1162/tacl_a_00302. Guan, Jian, Xiaoxi Mao, Changjie Fan, Zitao Liu, Wenbiao Ding, and Minlie Huang (Aug. 2021). “Long Text Generation by Modeling Sentence-Level and Discourse-Level Coherence”. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online: Association for Computational Linguistics, pp. 6379–6393.doi: 10.18653/v1/2021.acl-long.499. Han, Rujun, Hong Chen, Yufei Tian, and Nanyun Peng (2022). “Go Back in Time: Generating Flashbacks in Stories with Event Temporal Prompts”. In: 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). Han, Rujun, I-Hung Hsu, Jiao Sun, Julia Baylon, Qiang Ning, Dan Roth, and Nanyun Peng (Nov. 2021a). “ESTER: A Machine Reading Comprehension Dataset for Reasoning about Event Semantic Relations”. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, pp. 7543–7559.doi: 10.18653/v1/2021.emnlp-main.597. Han, Rujun, I-Hung Hsu, Mu Yang, Aram Galstyan, Ralph Weischedel, and Nanyun Peng (Nov. 2019a). “Deep Structured Neural Network for Event Temporal Relation Extraction”. In: Proceedingsof the 23rd Conference on Computational Natural Language Learning (CoNLL). Hong Kong, China: Association for Computational Linguistics, pp. 666–106.doi: 10.18653/v1/K19-1062. Han, Rujun, Mengyue Liang, Bashar Alhafni, and Nanyun Peng (2019b). “Contextualized Word Embeddings Enhanced Event Temporal Relation Extraction for Story Understanding”. In: arXiv preprint arXiv:1904.11942. Han, Rujun, Qiang Ning, and Nanyun Peng (Nov. 2019c). “Joint Event and Temporal Relation Extraction with Shared Representations and Structured Prediction”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, pp. 434–444.doi: 10.18653/v1/D19-1041. Han, Rujun, Xiang Ren, and Nanyun Peng (Nov. 2021b). “ECONET: Effective Continual Pretraining of Language Models for Event Temporal Reasoning”. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, pp. 5367–5380.url: https://aclanthology.org/2021.emnlp-main.436. 122 Han, Rujun, Yichao Zhou, and Nanyun Peng (Nov. 2020). “Domain Knowledge Empowered Structured Neural Net for End-to-End Event Temporal Relation Extraction”. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, pp. 5717–5729.doi: 10.18653/v1/2020.emnlp-main.461. Harmon, William (2012). Handbook to Literature. Pearson. Hochreiter, Sepp and Jürgen Schmidhuber (1997). “Long short-term memory”. In: Neural computation 9.8, pp. 1735–1780. Hoek, Marissa, Mariët Theune, and Jeroen Linssen (2014). “Generating Game Narratives with Focalization and Flashbacks”. In: vol. 10. 4.url: https://ojs.aaai.org/index.php/AIIDE/article/view/12758. Holtzman, Ari, Jan Buys, Maxwell Forbes, Antoine Bosselut, David Golub, and Yejin Choi (July 2018). “Learning to Write with Cooperative Discriminators”. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia: Association for Computational Linguistics, pp. 1638–1649.doi: 10.18653/v1/P18-1152. Huang, Lifu, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi (Nov. 2019). “Cosmos QA: Machine Reading Comprehension with Contextual Commonsense Reasoning”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, pp. 2391–2401.doi: 10.18653/v1/D19-1243. Hwang, Jena D., Chandra Bhagavatula, Ronan Le Bras, Jeff Da, Keisuke Sakaguchi, Antoine Bosselut, and Yejin Choi (2021). “COMET-ATOMIC 2020: On Symbolic and Neural Commonsense Knowledge Graphs”. In: AAAI. Jawahar, Ganesh, Benoıt Sagot, and Djamé Seddah (July 2019). “What Does BERT Learn about the Structure of Language?” In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, pp. 3651–3657. doi: 10.18653/v1/P19-1356. Jia, Robin and Percy Liang (Sept. 2017). “Adversarial Examples for Evaluating Reading Comprehension Systems”. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen, Denmark: Association for Computational Linguistics, pp. 2021–2031.doi: 10.18653/v1/D17-1215. Joshi, Mandar, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy (2020). “SpanBERT: Improving Pre-training by Representing and Predicting Spans”. In: Transactions of the Association for Computational Linguistics 8, pp. 64–77.doi: 10.1162/tacl_a_00300. Kahneman, Daniel (2003). “Maps of Bounded Rationality: Psychology for Behavioral Economics”. In: The American Economic Review 93.5, pp. 1449–1475.issn: 00028282.url: http://www.jstor.org/stable/3132137. Kenney, J. F. and E. S. Keeping (1965). “OLS Regression via Normal Equations”. In: Mathematics of Statistics. 3rd ed. Princeton, NJ. Chap. 15, pp. 252–285. 123 Khashabi, Daniel, Sewon Min, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, and Hannaneh Hajishirzi (Nov. 2020). “UNIFIEDQA: Crossing Format Boundaries with a Single QA System”. In: Findings of the Association for Computational Linguistics: EMNLP 2020. Online: Association for Computational Linguistics, pp. 1896–1907.doi: 10.18653/v1/2020.findings-emnlp.171. Lan, Zhenzhong, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut (2020). “ALBERT: A Lite BERT for Self-supervised Learning of Language Representations”. In:url: https://iclr.cc/virtual_2020/poster_H1eA7AEtvS.html. Landis, J. Richard and Gary G. Koch (1977). “The Measurement of Observer Agreement for Categorical Data”. In: Biometrics 33.1, pp. 159–174.issn: 0006341X, 15410420.url: http://www.jstor.org/stable/2529310. Laokulrat, Natsuda, Makoto Miwa, Yoshimasa Tsuruoka, and Takashi Chikayama (June 2013). “UTTime: Temporal Relation Classification using Deep Syntactic Features”. In: Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013). Atlanta, Georgia, USA: Association for Computational Linguistics, pp. 88–92.url: http://www.aclweb.org/anthology/S13-2015. Lee, Hee-Jin, Hua Xu, Jingqi Wang, Yaoyun Zhang, Sungrim Moon, Jun Xu, and Yonghui Wu (June 2016). “UTHealth at SemEval-2016 Task 12: an End-to-End System for Temporal Information Extraction from Clinical Notes”. In: Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016). San Diego, California: Association for Computational Linguistics, pp. 1292–1297.doi: 10.18653/v1/S16-1201. Leeuwenberg, Artuur and Marie-Francine Moens (2017). “Structured learning for temporal relation extraction from clinical records”. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers. Vol. 1, pp. 1150–1158. Lewis, Mike, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer (July 2020). “BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension”. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, pp. 7871–7880.doi: 10.18653/v1/2020.acl-main.703. Li, Jiwei, Will Monroe, Alan Ritter, Dan Jurafsky, Michel Galley, and Jianfeng Gao (Nov. 2016). “Deep Reinforcement Learning for Dialogue Generation”. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Austin, Texas: Association for Computational Linguistics, pp. 1192–1202.doi: 10.18653/v1/D16-1127. Li, Manling, Sha Li, Zhenhailong Wang, Lifu Huang, Kyunghyun Cho, Heng Ji, Jiawei Han, and Clare R. Voss (2021). “Future is not One-dimensional: Graph Modeling based Complex Event Schema Induction for Event Prediction”. In: arXiv preprint arXiv:2104.06344.url: https://arxiv.org/abs/2104.06344. 124 Li, Qi and Heng Ji (June 2014). “Incremental Joint Extraction of Entity Mentions and Relations”. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Baltimore, Maryland: Association for Computational Linguistics, pp. 402–412.doi: 10.3115/v1/P14-1038. Li, Sha, Heng Ji, and Jiawei Han (2021). “Document-Level Event Argument Extraction by Conditional Generation”. In: arXiv preprint arXiv:2104.05919. Li, Xiang, Aynaz Taheri, Lifu Tu, and Kevin Gimpel (Aug. 2016). “Commonsense Knowledge Base Completion”. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Berlin, Germany: Association for Computational Linguistics, pp. 1445–1455.doi: 10.18653/v1/P16-1137. Lin, Bill Yuchen, Seyeon Lee, Rahul Khanna, and Xiang Ren (Nov. 2020). “Birds have four legs?! NumerSense: Probing Numerical Commonsense Knowledge of Pre-Trained Language Models”. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, pp. 6862–6868.doi: 10.18653/v1/2020.emnlp-main.557. Lin, Li, Yixin Cao, Lifu Huang, Shuang Li, Xuming Hu, Lijie Wen, and Jianmin Wang (2022). “What Makes The Story Forward? Inferring Commonsense Explanations as Prompts for Future Event Generation”. In: arXiv preprint arXiv:2205.01898. Lin, Shih-Ting, Nathanael Chambers, and Greg Durrett (Aug. 2021). “Conditional Generation of Temporally-ordered Event Sequences”. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online: Association for Computational Linguistics, pp. 7142–7157. doi: 10.18653/v1/2021.acl-long.555. Liu, Jian, Yubo Chen, Kang Liu, Wei Bi, and Xiaojiang Liu (Nov. 2020). “Event Extraction as Machine Reading Comprehension”. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, pp. 1641–1651.doi: 10.18653/v1/2020.emnlp-main.128. Liu, Yinhan, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov (2019). “RoBERTa: A Robustly Optimized BERT Pretraining Approach”. In: ArXiv abs/1907.11692. Lorenz, M. O. (1905). “Methods of measuring the concentration of wealth”. In: American Statistical Association 9.70.9(70), pp. 209–219. Ma, Mingyu Derek, Jiao Sun, Mu Yang, Kung-Hsiang Huang, Nuan Wen, Shikhar Singh, Rujun Han, and Nanyun Peng (2021). “EventPlus: A Temporal Event Understanding Pipeline”. In: 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Demonstrations Track. 125 Mani, Inderjeet, Marc Verhagen, Ben Wellner, Chong Min Lee, and James Pustejovsky (2006). “Machine Learning of Temporal Relations”. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. ACL-44. Sydney, Australia: Association for Computational Linguistics, pp. 753–760.doi: 10.3115/1220175.1220270. Mao, Huanru Henry, Bodhisattwa Prasad Majumder, Julian McAuley, and Garrison Cottrell (Nov. 2019). “Improving Neural Story Generation by Targeted Common Sense Grounding”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, pp. 5988–5993.doi: 10.18653/v1/D19-1615. McNemar, Quinn (1947). “Note on the sampling error of the difference between correlated proportions or percentages”. In: Psychometrika 12(2), pp. 153–157. Meng, Tao, Nanyun Peng, and Kai-Wei Chang (Nov. 2019). “Target Language-Aware Constrained Inference for Cross-lingual Dependency Parsing”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural LanguageProcessing(EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, pp. 1117–1128.doi: 10.18653/v1/D19-1103. Meng, Yuanliang and Anna Rumshisky (2018). “Context-Aware Neural Model for Temporal Information Extraction”. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Meng, Yuanliang, Anna Rumshisky, and Alexey Romanov (2017). “Temporal Information Extraction for Question Answering Using Syntactic Dependencies in an LSTM-based Architecture”. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 887–896. Mirza, Paramita, Rachele Sprugnoli, Sara Tonelli, and Manuela Speranza (Apr. 2014a). “Annotating Causality in the TempEval-3 Corpus”. In: Proceedings of the EACL 2014 Workshop on Computational Approaches to Causality in Language (CAtoCL). Gothenburg, Sweden: Association for Computational Linguistics, pp. 10–19.doi: 10.3115/v1/W14-0702. Mirza, Paramita and Sara Tonelli (Aug. 2014b). “An Analysis of Causality between Events and its Relation to Temporal Information”. In: Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers. Dublin, Ireland: Dublin City University and Association for Computational Linguistics, pp. 2097–2106.url: https://aclanthology.org/C14-1198. Miwa, Makoto and Mohit Bansal (Aug. 2016). “End-to-End Relation Extraction using LSTMs on Sequences and Tree Structures”. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Berlin, Germany: Association for Computational Linguistics, pp. 1105–1116.doi: 10.18653/v1/P16-1105. Miwa, Makoto and Yutaka Sasaki (Oct. 2014). “Modeling Joint Entity and Relation Extraction with Table Representation”. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: Association for Computational Linguistics, pp. 1858–1869.doi: 10.3115/v1/D14-1200. 126 Mostafazadeh, Nasrin, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James Allen (June 2016a). “A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories”. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. San Diego, California: Association for Computational Linguistics, pp. 839–849.doi: 10.18653/v1/N16-1098. Mostafazadeh, Nasrin, Alyson Grealish, Nathanael Chambers, James Allen, and Lucy Vanderwende (June 2016b). “CaTeRS: Causal and Temporal Relation Scheme for Semantic Annotation of Event Structures”. In: Proceedings of the Fourth Workshop on Events. San Diego, California: Association for Computational Linguistics, pp. 51–61.doi: 10.18653/v1/W16-1007. Nguyen, Thien Huu, Kyunghyun Cho, and Ralph Grishman (June 2016). “Joint Event Extraction via Recurrent Neural Networks”. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. San Diego, California: Association for Computational Linguistics, pp. 300–309.doi: 10.18653/v1/N16-1034. Nguyen, Thien Huu and Ralph Grishman (2015). “Event Detection and Domain Adaptation with Convolutional Neural Networks”. In: ACL. Ning, Qiang, Zhili Feng, and Dan Roth (Sept. 2017). “A Structured Learning Approach to Temporal Relation Extraction”. In: EMNLP. Copenhagen, Denmark.url: http://cogcomp.org/papers/NingFeRo17.pdf. Ning, Qiang, Zhili Feng, Hao Wu, and Dan Roth (2018a). “Joint Reasoning for Temporal and Causal Relations”. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia: Association for Computational Linguistics, pp. 2278–2288.url: http://aclweb.org/anthology/P18-1212. Ning, Qiang, Hao Wu, Pradeep Dasigi, Dheeru Dua, Matt Gardner, Robert L. Logan IV, Ana Marasović, and Zhen Nie (Oct. 2020a). “Easy, Reproducible and Quality-Controlled Data Collection with CROWDAQ”. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Online: Association for Computational Linguistics, pp. 127–134. doi: 10.18653/v1/2020.emnlp-demos.17. Ning, Qiang, Hao Wu, Rujun Han, Nanyun Peng, Matt Gardner, and Dan Roth (Nov. 2020b). “TORQUE: A Reading Comprehension Dataset of Temporal Ordering Questions”. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, pp. 1158–1172.doi: 10.18653/v1/2020.emnlp-main.88. Ning, Qiang, Hao Wu, Haoruo Peng, and Dan Roth (June 2018b). “Improving Temporal Relation Extraction with a Globally Acquired Statistical Resource”. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). New Orleans, Louisiana: Association for Computational Linguistics, pp. 841–851.doi: 10.18653/v1/N18-1077. Ning, Qiang, Hao Wu, and Dan Roth (July 2018c). “A Multi-Axis Annotation Scheme for Event Temporal Relations”. In: ACL. Association for Computational Linguistics.url: http://cogcomp.org/papers/NingWuRo18.pdf. 127 Ning, Qiang, Ben Zhou, Zhili Feng, Haoruo Peng, and Dan Roth (2018d). “CogCompTime: A Tool for Understanding Time in Natural Language”. In: EMNLP. O’Gorman, Tim, Kristin Wright-Bettner, and Martha Palmer (Nov. 2016). “Richer Event Description: Integrating event coreference with temporal, causal and bridging annotation”. In: Proceedings of the 2nd Workshop on Computing News Storylines (CNS 2016). Austin, Texas: Association for Computational Linguistics, pp. 47–56.doi: 10.18653/v1/W16-5706. Paolini, Giovanni, Ben Athiwaratkun, Jason Krone, Jie Ma, Alessandro Achille, Rishita Anubhai, Cicero Nogueira dos Santos, Bing Xiang, and Stefano Soatto (2021). “Structured Prediction as Translation between Augmented Natural Languages”. In: arXiv preprint arXiv:2101.05779. Peng, Nanyun, Marjan Ghazvininejad, Jonathan May, and Kevin Knight (2018). “Towards controllable story generation”. In: NAACL Workshop. Pereira, Lis, Xiaodong Liu, Fei Cheng, Masayuki Asahara, and Ichiro Kobayashi (July 2020). “Adversarial Training for Commonsense Inference”. In: Proceedings of the 5th Workshop on Representation Learning for NLP. Online: Association for Computational Linguistics, pp. 55–60.doi: 10.18653/v1/2020.repl4nlp-1.8. Peters, Matthew E., Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer (June 2018). “Deep Contextualized Word Representations”. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). New Orleans, Louisiana: Association for Computational Linguistics, pp. 2227–2237.doi: 10.18653/v1/N18-1202. Pustejovsky, James, Patrick Hanks, Roser Sauri, Andrew See, Robert Gaizauskas, Andrea Setzer, Dragomir Radev, Beth Sundheim, David Day, and Lisa Ferro (2003a). “The TIMEBANK Corpus”. In: Corpus linguistics, pp. 647–656. Pustejovsky, James, Patrick Hanks, Roser Saurí, Andrew See, Rob Gaizauskas, Andrea Setzer, Dragomir Radev, Beth Sundheim, David Day, Lisa Ferro, and Marcia Lazo (Jan. 2003b). “The TimeBank corpus”. In: Proceedings of Corpus Linguistics. Radford, Alec, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever (2019). “Language Models are Unsupervised Multitask Learners”. In. Raffel, Colin, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu (2020). “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer”. In: Journal of Machine Learning Research 21.140, pp. 1–67.url: http://jmlr.org/papers/v21/20-074.html. Rogers, Anna, Olga Kovaleva, and Anna Rumshisky (Jan. 2021). “A Primer in BERTology: What We Know About How BERT Works”. In: Transactions of the Association for Computational Linguistics 8, pp. 842–866.issn: 2307-387X.doi: 10.1162/tacl_a_00349. eprint: https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00349/1923281/tacl\_a\_00349.pdf. 128 Rush, Alexander M and MJ Collins (2012). “A tutorial on dual decomposition and Lagrangian relaxation for inference in natural language processing”. In: Journal of Artificial Intelligence Research , pp. 305–362.issn: 45. Rush, Alexander M. and Michael Collins (June 2011). “Exact Decoding of Syntactic Translation Models through Lagrangian Relaxation”. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Portland, Oregon, USA: Association for Computational Linguistics, pp. 72–82.url: https://www.aclweb.org/anthology/P11-1008. Sap, Maarten, Ronan Le Bras, Emily Allaway, Chandra Bhagavatula, Nicholas Lourie, Hannah Rashkin, Brendan Roof, Noah A. Smith, and Yejin Choi (July 2019). “ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning”. In: Proceedings of the AAAI Conference on Artificial Intelligence 33.01, pp. 3027–3035.doi: 10.1609/aaai.v33i01.33013027. Sap, Maarten, Vered Shwartz, Antoine Bosselut, Yejin Choi, and Dan Roth (July 2020). “Commonsense Reasoning for Natural Language Processing”. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts. Online: Association for Computational Linguistics, pp. 27–33.doi: 10.18653/v1/2020.acl-tutorials.7. Savova, Guergana K, James J Masanz, Philip V Ogren, Jiaping Zheng, Sunghwan Sohn, Karin C Kipper-Schuler, and Christopher G Chute (2010). “Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications”. In: Journal of the American Medical Informatics Association 17.5, pp. 507–513. Shi, Peng and Jimmy J. Lin (2019). “Simple BERT Models for Relation Extraction and Semantic Role Labeling”. In: ArXiv abs/1904.05255. Sun, Weiyi, Anna Rumshisky, and Ozlem Uzuner (2013).Evaluatingtemporalrelationsinclinicaltext: 2012 i2b2 Challenge.doi: 10.1136/amiajnl-2013-001628. Tambwekar, Pradyumna, Murtaza Dhuliawala, Lara J. Martin, Animesh Mehta, Brent Harrison, and Mark O. Riedl (2019). “Controllable Neural Story Plot Generation via Reward Shaping”. In: Proceedings of the 2019 International Joint Conference on Artificial Intelligence .url: https://arxiv.org/abs/1809.10736. Tan, Bowen, Zichao Yang, Maruan Al-Shedivat, Eric P Xing, and Zhiting Hu (2021). “Progressive Generation of Long Text with Pretrained Language Models”. In. Tan, Ed S. (1996). “Emotion and the Structure of Narrative Film: Film as an Emotion Machine.” In: Mahwah, NJ: Erlbaum Associates. Tang, Buzhou, Yonghui Wu, Min Jiang, Yukun Chen, Joshua C Denny, and Hua Xu (2013). “A hybrid system for temporal information extraction from clinical text”. In: Journal of the American Medical Informatics Association 20.5, pp. 828–835. Tenney, Ian, Dipanjan Das, and Ellie Pavlick (July 2019). “BERT Rediscovers the Classical NLP Pipeline”. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, pp. 4593–4601.doi: 10.18653/v1/P19-1452. 129 Tourille, Julien, Olivier Ferret, Aurelie Neveol, and Xavier Tannier (2017). “Neural architecture for temporal relation extraction: a Bi-LSTM approach for detecting narrative containers”. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Vol. 2, pp. 224–230. UzZaman, Naushad, Hector Llorens, Leon Derczynski, James Allen, Marc Verhagen, and James Pustejovsky (June 2013). “SemEval-2013 Task 1: TempEval-3: Evaluating Time Expressions, Events, and Temporal Relations”. In: Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013). Atlanta, Georgia, USA: Association for Computational Linguistics, pp. 1–9.url: https://aclanthology.org/S13-2001. Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin (2017). “Attention is All you Need”. In: Advances in Neural Information Processing Systems. Ed. by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett. Vol. 30. Curran Associates, Inc.url: https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. Verhagen, Marc, Robert Gaizauskas, Frank Schilder, Mark Hepple, Graham Katz, and James Pustejovsky (2007). “SemEval-2007 Task 15: TempEval Temporal Relation Identification”. In: Proceedings of the 4th InternationalWorkshoponSemanticEvaluations. SemEval ’07. Prague, Czech Republic: Association for Computational Linguistics, pp. 75–80.url: http://dl.acm.org/citation.cfm?id=1621474.1621488. Verhagen, Marc and James Pustejovsky (2008). “Temporal Processing with the TARSQI Toolkit”. In: 22Nd International Conference on on Computational Linguistics: Demonstration Papers. COLING ’08. Manchester, United Kingdom: Association for Computational Linguistics, pp. 189–192.url: http://dl.acm.org/citation.cfm?id=1599288.1599300. Verhagen, Marc, Roser Saurı, Tommaso Caselli, and James Pustejovsky (2010). “SemEval-2010 Task 13: TempEval-2”. In: Proceedings of the 5th International Workshop on Semantic Evaluation. SemEval ’10. Los Angeles, California: Association for Computational Linguistics, pp. 57–62.url: http://dl.acm.org/citation.cfm?id=1859664.1859674. Wang, Haoyu, Muhao Chen, Hongming Zhang, and Dan Roth (Nov. 2020). “Joint Constrained Learning for Event-Event Relation Extraction”. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, pp. 696–706.doi: 10.18653/v1/2020.emnlp-main.51. Williams, Ronald J. and Jing Peng (1991). “Function Optimization Using Connectionist Reinforcement Learning Algorithms”. In: Connection Science 3, pp. 241–268. Wolf, Thomas, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush (Oct. 2020). “Transformers: State-of-the-Art Natural Language Processing”. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Online: Association for Computational Linguistics, pp. 38–45.doi: 10.18653/v1/2020.emnlp-demos.6. 130 Wolff, Phillip (2007). “Representing causation”. In: Journal of Experimental Psychology: General 136(1), pp. 82–111.url: https://doi.org/10.1037/0096-3445.136.1.82. Wu, Hui-Yin, Michael Young, and Marc Christie (Oct. 2016). “A Cognitive-Based Model of Flashbacks for Computational Narratives”. In: AIIDE 2016 - 12th Annual AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment. San Francisco, United States, pp. 239–245.url: https://hal.inria.fr/hal-01413401. Xu, Jingjing, Xuancheng Ren, Yi Zhang, Qi Zeng, Xiaoyan Cai, and Xu Sun (Oct. 2018). “A Skeleton-Based Model for Promoting Coherence Among Sentences in Narrative Story Generation”. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics, pp. 4306–4315.doi: 10.18653/v1/D18-1462. Xu, Jingyi, Zilu Zhang, Tal Friedman, Yitao Liang, and Guy Van den Broeck (Oct. 2018). “A Semantic Loss Function for Deep Learning with Symbolic Knowledge”. In: Proceedings of the 35th International Conference on Machine Learning. Ed. by Jennifer Dy and Andreas Krause. Vol. 80. Proceedings of Machine Learning Research. PMLR, pp. 5502–5511.url: https://proceedings.mlr.press/v80/xu18h.html. Xu, Peng, Mostofa Patwary, Mohammad Shoeybi, Raul Puri, Pascale Fung, Anima Anandkumar, and Bryan Catanzaro (Nov. 2020). “MEGATRON-CNTRL: Controllable Story Generation with External Knowledge Using Large-Scale Language Models”. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, pp. 2831–2845.doi: 10.18653/v1/2020.emnlp-main.226. Xu, Yan, Yining Wang, Tianren Liu, Junichi Tsujii, and Eric I-Chao Chang (2013). “An end-to-end system to identify temporal relation in discharge summaries: 2012 i2b2 challenge”. In: Journal of the American Medical Informatics Association 20.5, pp. 849–858. Yan, Xu, Wang Yining, Liu Tianren, Tsujii Junichi, and Chang EI (2013). An End-to-end System to Identify Temporal Relation in Discharge Summaries: 2012 I2b2 Challenge.doi: 10.1136/amiajnl-2012-001607. Yang, Zhilin, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le (2019). “XLNet: Generalized Autoregressive Pretraining for Language Understanding”. In: Advances in Neural Information Processing Systems. Ed. by H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett. Vol. 32. Curran Associates, Inc.url: https://proceedings.neurips.cc/paper/2019/file/dc6a7e655d7e5840e66733e9ee67cc69-Paper.pdf. Yao, Lili, Nanyun Peng, Weischedel Ralph, Kevin Knight, Dongyan Zhao, and Rui Yan (2019). “Plan-And-Write: Towards Better Automatic Storytelling”. In: The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19) . Yao, Wenlin, Zeyu Dai, Maitreyi Ramaswamy, Bonan Min, and Ruihong Huang (Nov. 2020). “Weakly Supervised Subevent Knowledge Acquisition”. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, pp. 5345–5356.doi: 10.18653/v1/2020.emnlp-main.430. 131 Yoshikawa, Katsumasa, Sebastian Riedel, Masayuki Asahara, and Yuji Matsumoto (2009). “Jointly identifying temporal relations with Markov logic”. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1-Volume 1. Association for Computational Linguistics, pp. 405–413. Zellers, Rowan, Yonatan Bisk, Roy Schwartz, and Yejin Choi (Oct. 2018). “SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference”. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics, pp. 93–104.doi: 10.18653/v1/D18-1009. Zhang, Meishan, Yue Zhang, and Guohong Fu (Sept. 2017). “End-to-End Neural Relation Extraction with Global Optimization”. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen, Denmark: Association for Computational Linguistics, pp. 1730–1740.doi: 10.18653/v1/D17-1182. Zhang, Shuaicheng, Lifu Huang, and Qiang Ning (2021). “Extracting Temporal Event Relation with Syntactic-Guided Temporal Graph Transformer”. In: arXiv preprint arXiv:arXiv:2104.09570. Zhao, Jieyu, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang (Sept. 2017). “Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints”. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen, Denmark: Association for Computational Linguistics, pp. 2979–2989.doi: 10.18653/v1/D17-1323. Zhou, Ben, Daniel Khashabi, Qiang Ning, and Dan Roth (Nov. 2019). ““Going on a vacation” takes longer than “Going for a walk”: A Study of Temporal Commonsense Understanding”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, pp. 3363–3369.doi: 10.18653/v1/D19-1332. Zhou, Ben, Qiang Ning, Daniel Khashabi, and Dan Roth (2020). “Temporal Common Sense Acquisition with Minimal Supervision”. In: Proc. of the Annual Meeting of the Association for Computational Linguistics (ACL).url: https://cogcomp.seas.upenn.edu/papers/ZNKR20.pdf. Zhou, Wangchunshu, Dong-Ho Lee, Ravi Kiran Selvam, Seyeon Lee, Bill Yuchen Lin, and Xiang Ren (2021). “Pre-training text-to-text transformers for concept-centric common sense”. In. Zhou, Xuhui, Yue Zhang, Leyang Cui, and Dandan Huang (2020). “Evaluating Commonsense in Pre-trained Language Models”. In: AAAI. Zhou, Yichao, Yu Yan, Rujun Han, J. Harry Caufield, Kai-Wei Chang, Yizhou Sun, Peipei Ping, and Wei Wang (2021). “Clinical Temporal Relation Extraction with Probabilistic Soft Logic Regularization and Global Inference”. In: AAAI. Zhu, Yukun, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler (Dec. 2015). “Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books”. In: The IEEE International Conference on Computer Vision (ICCV). 132 AppendixA AppendixforSec.3.1 Here we list all supplementary materials for the paper Han et al. (2019c). A.1 Hyper-parameterSelection All single-task, multi-task and pipeline joint models are trained by minimizing cross-entropy loss. We observe that model performances vary significantly with dropout ratio, hidden layer dimensions of the BiLSTM model and entity weight in the loss function (with relation weight fixed at 1.0). We leverage a pre-trained BERT model to compute word embedding 1 and all MLP scoring functions have one hidden layer. 2 In the SSVM loss function, we fix the value of C = 1, but fine-tune C E in the objective function in Equation 3.2. Hyper-parameters are chosen using a standard development set for TimeBank-Dense and a random holdout-set based on an 80/20 split of training data for Matres. To solve ILP in the inference process, we leverage an off-the-shelf solver provided by Gurobi optimizer; i.e. the best solutions from the Gurobi optimizer are inputs to the global training. The best combination of hyper-parameters can be found in Table A.1 1 We use a pre-trained BERT-Base model with 768 hidden size, 12 layers, 12 heads implemented by https://github.com/ huggingface/pytorch-pretrained-BERT 2 LetH,K denotes the dimension of (concatenated) vector from BiLSTM and number of output classes. MLP layer consists of |H|∗| K|+|K|∗| K| parameters 133 TB-Dense MATRES Single-taskModel Ent Rel Ent Rel hidden size 100 100 60 60 dropout 0.4 0.5 0.5 0.5 Multi-taskModel hidden size 90 90 dropout 0.6 0.3 entity weight 6.0 16.0 PipelineJointModel hidden size 90 90 dropout 0.6 0.4 entity weight 6.0 15.0 StructuredJointModel lr 0.0005 0.001 decay 0.1 0.1 momentum 0.2 0.1 C E 0.1 5.0 ent_thresh 0.49 0.4 Table A.1: Best hyper-parameters. A.2 Event-RelationConsistencyConstraint Apairofinputtokenshavepositivetemporalrelationifandonlyifbothtokensareevents. This property is encoded in Table A.2 with additional constraints that there can be only one label assigned to either token or relation. The following rules will satisfy this property. 1. P P∈R r P i,j +r N i,j =1 2. e P i +e N i =1 3. e P i ≥ r P i,j ande P j ≥ r P i,j 4. e N i +e N j ≥ r N i,j Proof.→) If eithere P i =0 ore P i =0, thenr P i,j could only be 0. Ife P i =1 ande P j =1, thenr N i,j can be either 0 or 1. Columnr P i,j is satisfied. If either e N i =1 or / ande N i =1, by Rule 4,r P i,j =0 or 1. However, by Rule 2, one of the top three rows of Columne P i ande P j has to be true, which implies thatr P i,j = 0 and 134 thus, by Rule 1,r N i,j =1. Ife N i =0 ande N j =0, it’s obvious thatr N i,j must be 0. ← ) Ifr P i,j = 0, then by Rule 3,e P i ande P j can be any number. Ifr P i,j = 1, thene P i = 1 ande P j = 1. If r N i,j = 1, by Rule 4, at least one ofe P i ,e P j = 1. Ifr N i,j = 0, it implies thatr P i,j = 1 and hencee N i = 1 and e N j =1 and thereforee P i =0 ande P j =0. e P i e P j e N i e N j r P i,j r N i,j 0 0 1 1 0 1 0 1 1 0 0 1 1 0 0 1 0 1 1 1 0 0 1, 0 0 Table A.2: Event and Relation Global Constraint A.3 EvaluationMetricsVisualization Each cell in FigureA.1a and A.1b is the count of predicted labels of gold pairs. In FigureA.1a, S1 is the sum of column b, a, e, v whereas in FigureA.1b, S1 is the sum of b, a, e. Similar calculation applied to S1. The final Precision (P), Recall (R) and F1 scores are calculated as, P = correct count / S1 R = correct count / S2 F1 = 2PR / (P+R) 135 (a) Micro-average excluding NONE only (b) Micro-average excluding NONE and VAGUE Figure A.1: Confusion matrix (table) with each cell representing count of predictions over each gold label. BEFORE (b); AFTER (a); SIMULTANEOUS (e); VAGUE (v); NONE (n). 136 AppendixB AppendixforSec.3.2 Here we list all supplementary materials for the paper Han et al. (2020). B.1 Hyper-parameterSelection Range of grid-search isc E : (1.0, 2.0);w Epos : (1.0, 2.0, 5.0, 10.0); lr: (1e − 5 ,2e − 5 ,5e − 5 ),α : (1.0, 2.0, 5.0, 10.0); θ : (0.2, 0.3, 0.5);γ : (0.7, 0.8, 0.9). the best combination of hyper-parameters can be found in Table B.1 TimeBank-Dense I2b2-Temporal c E - 1.0 w Epos - 5.0 lr - 2e − 5 α 5.0 5.0 θ 0.05 0.02 γ 0.7 0.8 Table B.1: Hyper-parameters chosen using development data. For TimeBank-Dense, end-to-end baseline model is provided by the Han et al. (2019c), so we do not train it from scratch. 137 AppendixC AppendixforSec.3.3 Here we list all supplementary materials for the paper Han et al. (2021b). C.1 Hyper-parameterSelection Due to computation constraints, we had to limit the search range of hyper-parameters for ECONET. For learning rates, we tried(1e − 6 ,2e − 6 ); for weights on the contrastive loss (β ), we tried(1.0,2.0). Best Hyper-parameters. In Table C.1 and Table C.2, we provide hyper-parameters for our best per- forming language model using RoBERTa LARGE + ECONET and BERT LARGE + ECONET and best hyper- parameters for fine-tuning them on downstream tasks. For fine-tuning on the target datasets. We con- ducted grid search for learning rates in the range of (5e − 6 ,1e − 5 ) and for batch size in the range of (2,4,6,12). We fine-tuned all models for 10 epochs with three random seeds (5,7,23). Method learning rate batch size β ECONET 1e − 6 8 1.0 TORQUE 1e − 5 12 - McTaco 5e − 6 4 - TimeBank-Dense 5e − 6 4 - Matres 5e − 6 2 - RED 5e − 6 2 - Table C.1: Hyper-parameters of our best performing LM with RoBERTa LARGE + ECONET as well as best hyper-parameters for fine-tuning on downstream tasks. 138 Method learning rate batch size β ECONET 2e − 6 8 1.0 TORQUE 1e − 5 12 - McTaco 1e − 5 2 - TimeBank-Dense 1e − 5 2 - Matres 5e − 6 4 - RED 1e − 5 6 - Table C.2: Hyper-parameters of our best performing LM with BERT LARGE + ECONET as well as best hyper-parameters for fine-tuning on downstream tasks. 139 AppendixD AppendixforSec.4.1 Here we list all supplementary materials for the paper Han et al. (2021a). D.1 QAvalidity A pair of QA is valid if and only if it fulfils the following criteria, 1. Both questions and answers MUST contain correct events. Events in questions can have different textual form. 2. Both questions and answers MUST be natural and meaningful. Workers with spotted spamming are immediately disqualified. 3. The semantic relation formed by a QA pair MUST falls into one of the five relation categories we define. Note that QA validity is different from QA completeness for which we instruct workers to find all possible answers in the passage. 140 D.2 ImplementationDetails We finetune BART-base, BART-large, T5-base, UnifiedQA-base and UnifiedQA-large on ESTER. UnifiedQA models are all based on T5. Hyper-parameters search ranges are 1) learning rate:(1e − 5 ,5e − 5 ,1e − 4 ); batch size:(2,4). Best hyper-parameters can be found in Table D.1. We also use 3 random seeds:(5,7,23) and re- port the average performances for each model. For RoBERTa-large, there is an additional hyper-parameter, positive token training weight mentioned in Section 4.1.7.2, and it search range is(1,2,5,10,20). For BART-base, BART-large, T5-base and UnifiedQA-base models, we were able to finetune on a single Nvidia GTX2020 GPU with 11G memory. For Pegasus and UnifiedQA-large, we have to use a much larger Nvidia A100 GPU with 40G memory. We tried to finetune UnifiedQA based on T5-3B, but we were not able to fit batch size = 1 into a single Nvidia A100 GPU. So we stop at UnifiedQA-large. All reproduction details can be found in the separately submitted code. Models # Params. Best Hyper. GPU RoBERTa-large 355M lr=1e − 5 ; b=2 GTX2080 BART-base 139M lr=5e − 5 ; b=4 GTX2080 BART-large 406M lr=1e − 5 ; b=2 GTX2080 T5-base 220M lr=1e − 4 ; b=4 GTX2080 UnifiedQA-base 220M lr=5e − 5 ; b=2 GTX2080 UnifiedQA-large 770M lr=5e − 5 ; b=4 A100 Table D.1: Model and fine-tuning details. Learning rate: lr; batch size: b. D.3 ModelSignificanceTest To conduct statistical tests over model improvements, we pick the model with highest F T 1 score among the three random seeds for the “best” hyper-parameters chosen in Table D.1. We then perform McNemar’s tests for HIT@1 and EM. Specifically, if HIT@1 = 1.0 for a sample, we treat it as a correct prediction; otherwise, it is incorrect. The same logic applies to EM. We only conduct statistical tests over pairs of models that are comparable in Table 4.3, and test results are shown in Table D.2 below. 141 Model Comparisons HIT@1 EM Zero-shot T5-base→ UnifiedQA-base 0.790 ≪ 0.001 UnifiedQA-base → UnifiedQA-large ≪ 0.001 0.001 Finetune BART-base→ BART-large 0.865 0.231 T5-base→ UnifiedQA-base 0.018 0.574 UnifiedQA-base → UnifiedQA-large ≪ 0.001 0.022 RoBERTa-large→ UnifiedQA-base ≪ 0.000 0.001 RoBERTa-large→ UnifiedQA-large ≪ 0.000 ≪ 0.000 Table D.2: McNemar’s test perHIT@1 andEM metrics. Models on the right-hand side of “→” are better. All numbers are p-values with≤ 0.05 indicating statistically significant (underlined). D.4 Sub-samplePerformances In Figure D.1 we show the fine-tuning UnifiedQA-large using different numbers of training samples for Sub-event andCo-reference. We observe the same level-off after using 2K training data as in Figure 4.7 for all semantic types. 0 500 1K 2K 3K 4K Full 40 50 60 70 80 HIT@1 (%) HIT@1 0 10 20 30 40 50 60 F1 and EM (%) F1 (Token) EM Figure D.1: Sub-sample fine-tuning performances for hierarchical relations: Sub-event +Co-reference. All numbers are average over 3 random seeds. D.5 WorkerDistribution We had 70 workers in total who passed our qualification exam and completed at least 1 assignment in our project. Due to our rigorous validating process, only 27 were able to make it into Task 4 and the Large Task 142 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Equality Baseline Train Data Evaluation Data Figure D.2: Questions distributions by workers in train v.s. evaluation sets. Equality baseline indicates each participant provides equal number of questions. which consist of a large number of assignments. Figure D.2, known as Lorenze Curve (Lorenz 1905) illus- trates the distribution of number of questions completed by workers. The equality baseline indicates the questions are perfectly well distributed among all workers, i.e. everyone completes the same numbers of questions. The further a curve deviates from the equality baseline, the more unevenly distributed a dataset becomes. Compared with the train data, we observe that the evaluation set is slightly better distributed, which reflects our validation process: for workers who failed our validation tasks and were disqualified, they could still provide some good quality QAs, which we keep in the evaluation data. This increases the diversity of the evaluation set. D.6 NumberofTokens. Table D.3 shows an average number of tokens in questions and answers. TheCounterfactual questions contain the most number of tokens as additional words are often needed to specify the negation reasoning. The average numbers of tokens are all around 6.5 across 5 types of answers. This is exactly the medium of our answer length limits where we set the minimum and maximum numbers of words to be 1 and 12 respectively. The average number of tokens in the passages is 128.1 with the longest passage containing 196 tokens. 143 # Tokens Semantic Types Question Answer Causal 10.3 6.6 Conditional 12.1 6.4 Counterfactual 13.7 6.0 Sub-event 9.3 6.5 Co-reference 8.6 6.5 Table D.3: Average number of tokens in questions and answers. D.7 Generativev.s. ExtractiveQA In Table D.4, we show 3 examples comparing predicted answers between generative and extractive QA. In general, scattered answers occur frequently in extractive QA, but barely occur in generative QA. In other words, generative QA is able to consistently produce complete and meaningful answers. D.8 Interface Please refer to Figure D.3 for user interface of event selection and Figure D.4 for QA annotation. 144 Ex. 1 Passage: the serbs only lifted their threat of a boycott friday after heavy international pressure and the intervention of serbian president slobodan milosevic, a longtime supporter of the rebels. in a last-minute attempt to get people to vote, the independent democratic serb party (sdss), led by vojislav stanimirovic, launched into what seemed more like a mobilisation rather than a real political campaign. Question: what could happen if there was no intervention by the serbian president? GenerativeAnswers: 1. a boycott ExtractiveAnswers: 1. the; 2. bs; 3. lifted their threat of a boycott Ex.2 Passage: french defence minister michele alliot-marie on sunday stressed paris’s support for the government of lebanese prime minister fuad siniora during a visit to the crisis-wracked nation. "i have come to reaffirm france’s support for the legitimate government of lebanon," she told reporters after meeting her lebanese counterpart elias murr. alliot-marie also stressed paris’s backing for the beirut government to "exercise its sovereignty completely", and that the lebanese army play "a role across all its territory". lebanon is undergoing a political crisis with opposition led by shiite movement hezbollah seeking to bring down siniora’s government and install a government of national unity. the french minister, who arrived in beirut on saturday for a 48-hour visit, was also to meet siniora before heading to south lebanon for new year’s eve with the french contingent of the united nations interim force in lebanon (unifil). Question: what caused alliot-marie to visit lebanon? GenerativeAnswers: 1. lebanon is undergoing a political crisis ExtractiveAnswers: 1. banon is undergoing a political crisis; 2. to; 3. bring down siniora’s government Ex.3 Passage: vieira seems very enthusiastic about bringing in chinese capital and technology into the west african country. he said priorities for bilateral cooperation could expand to ports, roads, bridges and mineral resources. inspired by vieira’s enthusiasm, cmec vice president zhou li promised that a special team would fly to guinea-bissau to discuss the details. vieira reminded her that apart from guinea-bissau, other west african countries such as senegal and guinea also need power- generation facilities badly. regarding china as a strategic friend who offers aids without political strings, many african countries impressed with the country’s two-digit economic growth are seizing time to explore cooperative opportunities during their stay in beijing to boost domestic economy. Question: what does the bilateral cooperation include? GenerativeAnswers: 1. bringing in chinese capital and technology; 2. a special team would fly to guinea-bissau; 3. talk about the details; 4. explore cooperative opportunities during their stay in beijing ExtractiveAnswers: 1. bringing in chinese capital and technology; 2. expand to; 3. ports, roads, bridges and mineral resources; 4. a; 5. special team would fly to guinea-bissau; 6. discuss the details; 7. explore Table D.4: Examples of answers predicted by generative v.s. extractive QA models. Some passages are shortened for demonstration purpose. Incomplete predictions from extractive QA are highlighted. 145 (a) Event Selections in Progress (b) Event Selections Completed Figure D.3: An Illustration of Event Selection Interface 146 (a) QA Annotations in Progress (b) QA Annotations Completed Figure D.4: An Illustration of QA Interface 147 AppendixE AppendixforSec.5.1 Here we list all supplementary materials for the paper Han et al. (2022). E.1 DataandPre-processing ROCStories (Mostafazadeh et al. 2016a) contains 5-sentence stories. Following (P. Xu et al. 2020), we split data into 88,344/4,908/4,909 for train/validation/test sets. WritingPrompt Fan et al. (2018) contains 30,335 pairs of prompts and stories. With an average of more than 700 words per story, Writing Prompts are much longer than ROCStories. These stories are also much less structured as some dialogues and short phrases may be included. To speed up our experiments, we select stories with a maximum of 500 words, resulting in a total number of 96,488 training and 5,784 validation prompt-story pairs, respectively. For the test set, we use the 1,000 prompt-story pairs provided by the baseline paper (Goldfarb-Tarrant et al. 2020) for reporting automatic evaluation results. Pretraining Data. As we mention in Section 5.1.3, we pretrain storyline models for ROCStories. To be consistent with ROCStories inputs, we divide BookCorpus data (Zhu et al. 2015) into 5 consecutive sentences and filter out those with noisy tokens. We randomly select 1 million such 5-sentence text spans and extract their storylines. 148 E.2 Hyper-parameterSelection We finetune BART-base. For ROCStories, hyper-parameters are learning rate: 5e − 5 ; batch size: 10. We use 3 random seeds: (5,9998,20016) and report the average performances for all end-to-end models. For Writing Prompts, hyper-parameters are learning rate: 1e − 4 ; batch size: 64; gradient accumulation: 8. For ROCStories, we were able to finetune on a single Nvidia GTX2020 GPU with 11G memory, and training time is 3-4 hours per epoch. For WritingPrompt, we have to use a much larger Nvidia A100 GPU with 40G memory, and the training time is 20 hours per epoch. We train all models for 10 epochs and save the model with the best evaluation perplexity. All reproduction details can be found in the separately submitted code. E.3 MoreDetailsforEvaluationMetrics Automaticevaluationmetrics are used to measure textual quality of stories. We report 1)Ref. PPL: reference stories’ perplexity in a model; 2)Gen. PPL: generated stories’ perplexity scored by GPT-2 (Rad- ford et al. 2019), i.e. we feed the generated stories into GPT-2 to compute perplexity scores. For diversity scores, we found our models implemented by Huggingface (Wolf et al. 2020) can achieve nearly 0Repeat-3 and 100% Distinct-3 scores, so we follow Goldfarb-Tarrant et al. (2020) to compute the overall vocabu- lary:token number ratio, which we denote as 3) Distinct Ratio (%). We also report standard 4) BLEU-3 and 5)ROUGE L scores. E.4 BenchmarkEventTemporalRelationAnnotations The experimental results in the main text demonstrate the effectiveness of using temporal prompts. Here, we further show that the tool to produce temporal prompts, i.e. ECONET, provides reliable event temporal 149 relation annotations. We benchmark ECONET’s performances using CaTeRS (Mostafazadeh et al. 2016b), which annotates 4 types of temporal relations for event pairs in a small amount of ROCStories. However, CaTeRS’s annotations are based on event time interval rather than event start time as used in MATRES, which ECONET is finetuned on. In Tabel E.1, we provide a mapping from CaTeRS’s temporal relations to MATRES labels. The only non-unique mapping is OVERLAP. In other words, when ECONET predicts BEFORE for a CaTeRS sample OVERLAPS, we have to manually examine whether it is correct or not. We found that when ECONET pre- dictsBEFORE for CeTeRS data, the precision rate is 65.53% due to a large amount ofOVERLAPS event pairs being predicted as BEFORE. But we emphasize here that this low numberiscausedbylabelmismatch as shown in Table E.1, which does not truthfully reflect the ECONET’s accuracy. To have a better understanding, we randomly selected 20 such pairs and manually examine their tem- poral relations in the context and found that 90% of such pairs are indeed correctly predicted by ECONET. Adjusting for this factor, the precision rate for the annotated BEFORE relation would be 92.07%, indicat- ing highly accurate predictions. We do not claim the final accuracy is 92.07%, but simply argue that the annotations provided by ECONET are helpful as our main experimental results demonstrate. CaTeRS MATRES A BEFORE B A BEFORE B A IDENTITY B A VAGUE B A CONTAINS B A BEFORE B A OVERLAPS B A BEFORE, AFTER, VAGUE B Table E.1: Label mapping from CaTeRS to MATRES.AFTER is ignored in CaTeRS by flipping event physical order. 150 E.5 MoreDetailsforBaselineModels MEGATRON-CNTRL P. Xu et al. (2020), denoted asMEGATRON for brevity, is chosen as the baseline as it outperforms previous systems such as Guan et al. (2020) on ROCStories. We do not perform delexi- calization that replaces names and entities with [MALE], [FEMALE] and [NEUTRAL] tokens, as we found our models work well by recognizing names and entities. When conducting evaluations, we try our best to map these special tokens back to their original texts by using the given first sentence. For rare undetermined cases, we manually examine the generated stories and swap in names or entities that make the most sense in the context. To be fair, we compare with the 124M-parameter version. ContentPlanning (Goldfarb-Tarrant et al. 2020) is chosen as the baseline for WritingPrompt, as it also adopts the Plan-and-Write workflow as well as structured event representations. However, their models are based on BART-large and do not train with an end-to-end framework. They use 65% of the original training data and also filter out samples with non-[WP] prompts. Our final training data is about 2/3 of theirs. TemporalBART (S.-T. Lin et al. 2021) is designed for two event temporal relation related tasks: tem- poral ordering and event infilling. Although TemporalBART does not tackle story generation directly, it encodes event temporal information via pretraining tasks. So we consider TemporalBART as another base- line model by initialing the storyline model with their parameters and training theStructured-Prompt workflow on ROCStories. E.6 AdditionalGeneratedStories Please see Table E.2 and Table E.3 for more examples on ROCStories and WritingPrompt respectively. 151 Input [s1] nina needed blood tests done. prompt BEFOREAFTER BEFORE BEFORE [s2] she was very nervous. [s3] she went to the local clinic. baseline [s4] they told her about her bad breath. [s5] she was relieved her nerves were n’t too much to handle. [s2] she was terrified of needles. Vanilla [s3] she went to the doctor. [s4] the doctor gave her a needle. [s5] nina was relieved. [s2] she was afraid to go to the hospital. [s3] she had never been to a hospital before. RL [s4] but when she got there, she was relieved. [s5] she felt so relieved! Input [s1] mickey was always a fan of webcomics. prompt AFTER BEFORE BEFORE BEFORE [s2] he loved writing his own comics one night. [s3] then he had a great idea. baseline [s4] he created a story called the superman story. [s5] the story was real as a kids cartoon. [s2] one day, he decided to make a webcomic himself. Vanilla [s3] he created a website and sent it to a publisher. [s4] the publisher loved it and contacted mickey. [s5] mickey was happy to receive a publishing deal. [s2] he had been writing for years. [s3] one day, mickey saw an ad for a webcomic. RL [s4] he signed up for it. [s5] mickey was able to make a lot of money from his work. Input [s1] anna had just gotten her hair permed. prompt BEFORE BEFOREAFTER BEFORE [s2] she was nervous about it. [s3] she was too nervous to do it. baseline [s4] she tried to cut hair to look better. [s5] when she was done it looked great ! [s2] she was excited to see how it looked. Vanilla [s3] but then she saw that her hair had grown out. [s4] anna was horrified! [s5] she had to go to the salon to get it fixed. [s2] she was excited to see how it looked. [s3] but when she went to apply it, it was terrible. RL [s4] her hair had grown way too long! [s5] anna had to go to the salon to get her hair fixed. Table E.2: Additional generated stories for ROCStories. 152 Input artic outpost 9 prompt AFTER VAGUE BEFORE AFTER BEFORE VAGUE VAGUE baseline she stood alone in the dimly lit corridor , staring at the light that flooded into the room . her eyes scanned the room , finally settling on the small , round table . “ hello , everyone . ” the woman spoke in a voice that echoed the room in a series of hushed tones , as though it was a voice only audible to those who heard it . “ do you know why we are here ? do you know where we come from ? ” our leader , the man with the black suit and black hat , stood at the end of the hall . a man in a black coat and black trousers , a long coat , and a black hat stood behind him , a black cloak and a dark hat , a pair of gloves . the black suit had a black mask on , and she was wearing her best , black dress shirt and a pairof pants , and black shoes . on her right wrist , the small of one shoulder , was an amulet , and on the other , a small device hidden in the pocket of the man ’s coat , a single small device . there was nothing else . just this man . . . and our leader Vanilla i’ve never seen anything like it before. it’s a strange place. i don’t know why i chose it, but i’ve always wanted to see it. it was the first time i saw it. it was a small building with a single window. i couldn’t believe it. i had never seen it before, but it was here. i’m not sure what it is, but there’s something about it that makes me want to run. i can’t remember the last time i ’d seen it. but i do know that it was there. the first thing i noticed was the door opening. i wasn’t sure if it was my own, or my own. i didn’t want to be here. i just wanted to get out. i wanted to go back to my home. as i approached the door, i saw a man standing there. he was wearing a trench coat. he had a gun pointed at his head. he pointed it at my head. “ what are you doing here? ” i asked. he looked at me. he looked like he was about to say something. “ i need you to come with me. ” he said, and i followed him. we were in the middle of nowhere when the man shot me in the head. i looked at him, and he looked back at me, then at the gun. i felt a sharp pain in my chest. i started to cry. i tried to speak, but my throat was dry. i heard the man say something, and then he shot me. i fell to the ground, bleeding. i woke up in a hospital bed. there was a note on my bedside table. i opened it and read it aloud. *i’m sorry, but this isn’t what i was expecting. RL what’s that? we’ve been waiting for you for a long time, i said. i’m not sure how long you’ll be waiting for me. you don’t have much time to think about it, do you? i mean, i can see it, but i can feel it in your eyes, and i can hear it in the way you can hear me. i can smell it, and it can feel your heart beating faster than you can think. i think you know what i am talking about. i know what you are talking about, but you can not understand what i mean. i want you to know that i love you. i need you to understand that you are my love, and that i will always be there for you. that you will always love me. that i am the love of my life, and you are the love that i have always wanted to be. and i will never let you down. i will not let you go. i promise you that.with that, i walked out of the building, into the street, and into the night. Input 100000 year high prompt BEFORE AFTER BEFORE BEFORE baseline for a long time it seemed the world ’s people all were living in the same way . the same . they were all the same . the same , they had the same families and their children and they all had the same friends . but these families were all different . they may have been the last of them , and the world was a better place without it , but those people knew about it , they just did n’t know the truth . some of them lived for years , some of them a century . and a new age started , one that would make this world a better places . a world without war or war . it seemed the first people could see the world and see the future and people living it . they were the first to do it . ” “ the first people on earth who could live a normal life ! the first to have a normal family and a normal home ! the first ones who we could live on ! who could have imagined that our world was as good as ours was and that the next life would be better than what i ’ ve been , that there ’s a whole new life ahead of us , that this world is a living one ! ” and that ’s when it happened . that was the day . on march 20th , 2036 . march 20, 2036 , the first time that day Vanilla it was a beautiful day. the sun was shining and the sky was blue. the clouds were bright and the trees were green and the birds were singing. it was a wonderful day. i was walking down the street when i saw it. the sky above me was blue and it was beautiful. i had never seen anything like it before. i looked down at the ground and saw nothing. i didn’t know what to do. i took a step back and looked up at the sky. i couldn’t see anything at all. i walked towards the sky and looked at the stars. they were beautiful. they seemed like they were floating. i reached out and touched them. i felt a warmth in my chest. i touched them again and again. it was amazing. i wasn’t sure if i was dreaming or not, but i knew i was. RL it’s been a long, long time since i’ve seen the light of the sun. i’m not sure how long it has been, but i think it’s a long time. i ’d like to say that the last time i saw the light, it was a thousand years ago, but it wasn’t. it was only a few years ago that i first saw the sun, when it was just a few seconds away from reaching the horizon. i don’t know how long i have been seeing it, but the light is blinding, and i can’t help but wonder what it could have been like. maybe i ’ll see it again, but for now, i just want to see the light again. Input paint a picture of loneliness , without using the word , or any of its forms . prompt BEFORE VAGUE VAGUE VAGUE BEFORE BEFORE BEFORE AFTER AFTER BEFORE baseline the glass is cold , but the sun shines brightly through . on the glass , in a small room , a woman stands . it stands before her , its hands clutching the glass in front of it like a shield . it 2019 s not the sunlight , but it hits the sky like a storm . the light makes the clouds roll in from all directions . a soft breeze , the wind blowing through the trees , the water rustling in the water . there are people in the room , but most do n’t look at their feet , they ’re too busy staring out the window . some are silent . most look at the woman . one of the people looks back at the woman who seems to be alone in this room , who is alone with a glass of champagne . the wind and water make the trees sway . you hear the birds chirping from outside . a few laugh , some are sad , some angry , but most are happy , and some are happy . there ’s a small child sitting on her lap . in the corner . on the ground , on a bench . and it ’s just this woman and the glass . Vanilla i’ve always been lonely. i’m not sure why. maybe it’s the fact that i don’t have any friends. maybe i’ve been alone for a long time. i haven’t seen anyone. i can’t even remember where i am. i have no friends. there’s no one to talk to. no one who can help me. i just want to be alone. sometimes i think that i ’ll just be alone forever. but then i think back to the first time i saw someone else. it was a girl. she’s beautiful, but she looks like she doesn’t belong here. she has no friends, no family, no one. then i thought of her. i think of the way she looked at me. the way i looked at her, the way her eyes looked at the people around her. that’s when i realized that i was alone. RL i don’t know why i’m doing this, but i’ve been doing it for a long time now. it’s been a long, long time since i ’d seen a single person. i remember the first time i saw her. she was beautiful, and i loved her. but i couldn’t tell you how much i missed her. i didn’t want to know how much she meant to me. and i can’t even remember the last time i talked to her, because i was too busy trying to fig ure out what to say. she was always there for me, and she always made me feel so alone. sometimes i wish i could go back, but sometimes i think i ’ll just stay here forever. maybe someday, when she’s gone, she ’ll come back. Table E.3: Additional examples for WritingPrompt. Possible AFTER relations are highlighted. 153 AppendixF AppendixforSec.5.2 F.1 EventCommonsenseRelations In Figure F.1, we show the summary table of relations in ATOMIC 20 20 (Hwang et al. 2021). We only consider relations in the event-centered and social-interaction blocks because physical-entity refers to relations of concepts and entities, which are NOT the focus of this work. We carefully selected 10 relations (highlighted) from ATOMIC 20 20 based on a large amount of qualitative studies. For example,⟨HinderedBy⟩ and⟨isFilledBy⟩ are dropped because they have small coverage in our target dataset ROCStories. ⟨Causes⟩ and⟨xReason⟩ are not selected as they only have about 300 pairs for training. Based on our initial analysis on ROCStories,⟨xIntent⟩ is hard to distinguished from ⟨xWant⟩, and⟨xAttr⟩ is hard to distinguished from⟨xNeed⟩ and⟨xReact⟩ . These relations and their associated head and tail events are the samples to train the relation classifier for annotating event relations in oureventplans (Section 5.2.2 and Appendix F.2). F.2 AnnotatingEventRelations Step1. We train a relation classifier using ATOMIC 20 20 . We concatenate the head and tail events as inputs and feed them into aBART-Large model to predict their relations. As mentioned in Section 5.2.2, we also 154 Figure F.1: All commonsense relations provided in ATOMIC 20 20 (Hwang et al. 2021). Relations with high- lights are selected in this work. replaced X and Y with random names to make the data more similar toeventplans of ROCStories. Using an example in Figure F.1, • input: “tom runs out of steam <sep> drinks some water” • label:⟨xEffect⟩. We follow the train/dev/test splits in the ATOMIC 20 20 paper and report the best model performance on the test set in Table F.1. Step2. We apply the trained classifier on the extracted event plans described in Sec. 5.2.3 to label neigh- boring events’ relations. As Figure 5.5 shows, these abstract events consist of [1 verb + 2 arguments] 155 Relations Size F 1 (%) ⟨HasSubEvent⟩ 1,315 99.86 ⟨isAfter⟩ 3,518 82.69 ⟨isBefore⟩ 3,518 72.64 ⟨oEffect⟩ 3,005 90.31 ⟨oReact⟩ 1,396 96.63 ⟨oWant⟩ 4,661 95.95 ⟨xEffect⟩ 8,826 85.32 ⟨xNeed⟩ 9,494 85.33 ⟨xReact⟩ 3,099 96.47 ⟨xWant⟩ 11,362 86.75 Macro-average 87.23 Table F.1: F 1 scores for relation predictions on the ATOMIC 20 20 test set. structure parsed from a complete story sentence using an external SRL tools. The overall event plan cre- ation can be considered as a self-annotation pipeline where we leverage trained models rather than human efforts to provide both abstract events and their commonsense relations. F.3 RelationPredictionPerformances As described in Sec. 5.2.2, relation labels are treated as special tokens and are generated together with the rest of the event plan using a language decoder. After an event plan is generated, we can retrieve the predicted relations and compare them with the self-annotated relation labels described in Appendix F.2. Relations Size Precision (%) Recall (%) F 1 (%) ⟨HasSubEvent⟩ 127 57.50 54.33 55.87 ⟨isAfter⟩ 372 27.07 32.53 29.55 ⟨isBefore⟩ 660 72.69 58.48 64.82 ⟨oEffect⟩ 1,893 60.01 69.99 64.62 ⟨oReact⟩ 8 0 0 0 ⟨oWant⟩ 31 48.57 54.84 51.52 ⟨xEffect⟩ 11,274 84.25 88.96 86.54 ⟨xNeed⟩ 2,491 72.15 44.52 55.06 ⟨xReact⟩ 407 62.67 34.64 44.62 ⟨xWant⟩ 2,149 81.57 91.25 86.14 Macro-average 77.83 78.09 77.32 Table F.2: F 1 scores for relation predictions in the plan generation. Samples are extracted from ROCStories test set. Self-annotated labels are used for comparisons. 156 As Table F.2 shows, ⟨xEffect⟩ is dominant in the ROCStories dataset, which aligns with previous research findings that causal or conditional relations are most commonly observed event relations in text narratives (Han et al. 2021a). The highprecision scores, particularly for the largest 5 relations, suggests that the majority of the predicted relations are likely correct and provide useful logical guidance for plan generations. An alternative method is to predict relation labels as a separate task and train the plan generation model with an auxiliary relation classification loss (multi-task learning). We did not pursue this approach further as we found its prediction accuracy fell far behind our proposed +CS-label method. We leave a detailed investigation for future research. F.4 GenerationExamples In Table F.3, we show 4 examples comparing our final model with BART-base baseline directly finetuned on ROCStories. All baseline stories contain at least 1 sentences that are not logically consistent with the context (red highlights), whereas our final models with event plans and semantic alignments produce mostly consistent stories. We observe that event plans are not followed occasionally, but it doesn’t hurt the story quality in these examples. F.5 MTurkerStatisticsandAnalysis As mentioned in Section 5.2.3, all MTurkers we requested in this work excelled at our previous large scale annotations and story evaluation tasks. Moreover, they surpassed Master-level qualifications on MTurk (more than 20,000 HIT submissions with > 99% acceptance rate). We paid each worker 1.25 U.S. dollars per HIT on average across three evaluation tasks. The total cost for human evaluation is 3 * 1.25 * 200 * 1.2 = 900 (U.S. dollars). 157 To further assess their annotation results, the co-authors made efforts to manually evaluate the main task: event logical consistency in all 200 randomly selected stories for theVanilla and+Semantic models. The Cohen’s Kappa (Cohen 1960) scores are 0.41 and 0.43 between the co-authors and the MTurkers for Vanilla and+Semantic respectively, both indicating moderate agreement. 158 [Ex1] billy went to the zoo he saw the monkeys Baseline he wanted to see the monkey exhibit story he asked the monkey if he could go inside the monkey said no billy went to the zoo⟨xWant⟩ he see the giraffes ⟨xEffect⟩ Final he saw a giraffe ⟨xWant⟩ eventplan he pet it⟨xEffect⟩ the giraffe bit him billy went to the zoo he wanted to see the giraffes Final he saw a giraffe story he tried to pet it the giraffe bit him [Ex2] the grass was starting to get too tall i decided to cut it down Baseline i went to the store and bought the cutest grass i could find story i cut it up and put it in the yard it looked great. the grass get too tall⟨xEffect⟩ it needed to be mowed⟨xEffect⟩ Final i bought a mower⟨isBefore⟩ eventplan i mowed the grass⟨xEffect⟩ it looked great the grass was starting to get too tall it needed to be mowed Final i bought a lawn mower story i mowed the grass for a long time it looked great. [Ex3] i bought the clothes clandestinely i wore them to school the next day Baseline when i got home, i realized i left my wallet at home story i went to the store to return the wallet i returned the wallet to the owner i bought the clothes⟨xEffect⟩ they were from a secret admirer⟨xEffect⟩ Final i took them⟨xEffect⟩ eventplan they looked great⟨xNeed⟩ i wear them i bought the clothes clandestinely i thought they were from a friend Final i took them to the laundromat story they looked great i wore them to school [Ex4] jeff and eve had a big farm with a stable one day, they decided to sell the stable Baseline the stable was very expensive story they sold the stable for a lot of money jeff and eve were very happy with their purchase. jeff and eve had a big farm with a stable ⟨oWant⟩ eve ride on the horse⟨oEffect⟩ Final jeff and eve rode the horse ⟨xEffect⟩ eventplan the horse got loose⟨xEffect⟩ eve was sad jeff and eve had a big farm with a stable eve wanted to ride the horse Final jeff and eve rode the horse every day story one day, the horse got loose from the stable and ran away eve was sad and never rode on the horse again. Table F.3: Generation examples. Baseline: fine-tuning BART-base directly on the target data directly; Final: our model with Plan-and-Write+CS-label and+S-align. We showFinal’s generatedeventplan for illustrating purpose. Sentences/events that are not logically consistent in thestory are in red highlights. 159
Abstract (if available)
Abstract
The field of natural language processing (NLP) has made enormous progress over the past several years, largely due to the success of neural networks (e.g. LSTM and GRU). More recently, transformer-based pretrained language models (PTLMs) such as BERT, RoBERTa, XLNET, GPT2, BART allow machines to achieve human or near-human performances over a wide range of NLP applications including text paraphrasing, sentiment analysis, information extraction (IE), question-answering (QA), natural language inference (NLI), natural language generation (NLG), etc.
Despite these successes, recent NLP research pointed out that machines still lack reasoning skills, which prevents models from generalizing efficiently to unseen scenarios. For example, Jia et al. (2017) finds that in machine reading comprehension, models appear to conduct pattern matching rather than explicit reasoning, resulting in failure to recognize adversarial examples. Holtzman et al. (2018) also shows that lack of reasoning may cause NLG models to produce generic, repetitive, and even self-contradictory texts. Kahneman (2003) categorizes reasoning as a more advanced human intelligent system compared with relatively simpler perception (e.g. object and speech recognition). Therefore, it is crucial to advance machines’ reasoning capability in order to elevate NLP models to more intelligent agents.
In this thesis, we attempt to improve a sub-area of machine reasoning, i.e., event-centric reasoning in text data. Real-world texts such as new articles and stories are often composed of a sequence of events. Thus, for natural language understanding, it is inherently critical to identify events and to understand how they are related to each other. On the other hand, many real-world NLP tasks are based on events, and stronger event reasoning can improve these NLP applications.
Specifically, we divide our contributions into three categories: 1) tasks: we tackled three event-centric tasks in this thesis, event relation extraction, question answering, and story generation, and pushed their performances to the new state-of-the-art; 2) methods: we proposed novel models: neuro-symbolic network and targeted pre-training for effective event knowledge incorporation; 3) data: we created new datasets that challenge machines’ reasoning capability. Overall, our works improve machine learning models’ understanding of events and their ability to reason event relations. We demonstrate the effectiveness of the proposed methods through improvements on event-centric applications in relation extraction, question answering, and story generation.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Externalized reasoning in language models for scalable and trustworthy AI
PDF
Towards generalized event understanding in text via generative models
PDF
Fairness in natural language generation
PDF
Aggregating symbols for language models
PDF
Improving language understanding and summarization by leveraging auxiliary information through self-supervised or unsupervised learning
PDF
Parametric and semi-parametric methods for knowledge acquisition from text
PDF
The inevitable problem of rare phenomena learning in machine translation
PDF
Balancing prediction and explanation in the study of language usage and speaker attributes
PDF
Integrating annotator biases into modeling subjective language classification tasks
PDF
Grounding language in images and videos
PDF
Syntactic alignment models for large-scale statistical machine translation
PDF
Multimodal reasoning of visual information and natural language
PDF
Responsible artificial intelligence for a complex world
PDF
Bridging the visual reasoning gaps in multi-modal models
PDF
Multi-modal preconditioned inference of commonsense knowledge
PDF
Emphasizing the importance of data and evaluation in the era of large language models
PDF
Generating psycholinguistic norms and applications
PDF
Non-traditional resources and improved tools for low-resource machine translation
PDF
Computational modeling of mental health therapy sessions
PDF
Lexical complexity-driven representation learning
Asset Metadata
Creator
Han, Rujun
(author)
Core Title
Event-centric reasoning with neuro-symbolic networks and knowledge incorporation
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2022-08
Publication Date
07/23/2022
Defense Date
06/02/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
event reasoning,natural language processing,neuro-symbolic networks,OAI-PMH Harvest
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Peng, Nanyun (
committee chair
), Galstyan, Aram (
committee member
), Nakano, Aiichiro (
committee member
), Narayanan, Shrikanth (
committee member
)
Creator Email
hanrj1207@gmail.com,rujunhan@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC111375209
Unique identifier
UC111375209
Legacy Identifier
etd-HanRujun-10944
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Han, Rujun
Type
texts
Source
20220728-usctheses-batch-962
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
event reasoning
natural language processing
neuro-symbolic networks