Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Towards generalized event understanding in text via generative models
(USC Thesis Other)
Towards generalized event understanding in text via generative models
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Towards Generalized Event Understanding in Text via Generative Models by I-Hung Hsu A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) May 2024 Copyright 2024 I-Hung Hsu To the invaluable support, unwavering persistence, and the boundless joy experienced throughout this journey. ii Acknowledgments Completing this thesis owes much to numerous individuals’ invaluable guidance and support. Foremost among them, I extend my deepest gratitude to my advisors, Prof. Prem Natarajan and Prof. Nanyun (Violet) Peng, whose unwavering mentorship has been instrumental throughout this extensive journey. Violet’s comprehensive mentorship has enriched my understanding of research and offered profound insights into career development and life itself. I am particularly grateful for her steadfast support during the initial challenges of my Ph.D. Her patience and enduring belief in my abilities have been pivotal in seeing me through this journey. Similarly, working alongside Prem has imparted invaluable lessons on setting and pursuing goals with unwavering determination, ensuring each stride taken is purposeful toward attaining the ultimate objective. I extend my sincere gratitude to Prof. Emilio Ferrara and Prof. Dan O’Leary for graciously agreeing to serve on my thesis committee and for offering invaluable insights and feedback on my work. Additionally, I am deeply appreciative of Prof. Cyrus Shahabi for his role in serving as a member of my qualification committee. I would also like to express my gratitude to Prof. Muhao Chen for his contributions to both my qualification and proposal committees, as well as for his guidance throughout various collaborative endeavors. Particularly, I have gleaned significant insights into mentoring skills under Muhao’s tutelage. Furthermore, I wish to acknowledge Prof. Kai-Wei Chang for his supervision and collaborative support throughout our shared endeavors. Collaborative efforts are essential in research endeavors, and I am privileged to be part of a vibrant and cooperative environment. I extend my heartfelt gratitude to Dr. KuanHao Huang, whose close collaboration has led to the fruition of numerous engaging research projects. Our shared discussions on innovative research avenues, exchange of ideas, and mutual support not only enriched our scientific pursuits but also provided valuable insights iii into career and life aspirations. Furthermore, I am grateful for the fruitful collaborations with Tanmay Parekh, Dr. Rujun Han, Zhiyu Xie, and Keming Lu. Their contributions have significantly enhanced the depth and breadth of our work. Additionally, I would like to express my gratitude to the esteemed members of Pluslab, including Te-lin Wu, Yufei Tian, Mingyu (Derek) Ma, Po-Nien Kung, Dr. Jiao Sun, KungHsiang (Steeve) Huang, Sidi Lu, Dr. Sarik Ghazarian, Alex Spangher, Zi-Yi Dou, Julia Baylon, Haoyi Qiu, Lucas Bandarkar, Mohsen Fayyaz, Xueqing Wu, Yu (Bryan) Zhou, and many others. Their enthusiasm and dedication have significantly contributed to a dynamic and stimulating research environment. I am deeply inspired by all of these talented individuals. In particular, I owe a special thanks to Te-lin Yufei Tian, Derek, Po-Nien, and Jiao for their invaluable support in both my academic and personal life. During my Ph.D. studies, I have been privileged to spend some of my summers and falls interning at various companies, each experience proving invaluable to my growth. I enjoyed collaborating closely with esteemed mentors at Amazon, including Dr. Avik Ray, Dr. Jing Huang, Shubham Garg, and Dr. Chris Hench. At Microsoft Research, I had the opportunity to learn extensively from Dr. Silviu Cucerzan, Nirupama Chandrasekaran, and Dr. Sujay Kumar Jauhar. Similarly, my time at Google was enriching, where I benefited greatly from the expertise of Dr. Chen-Yu Lee, Dr. Zifeng Wang, Dr. Long Le, and Dr. Lesly Miculicich. These diverse projects broadened my research horizon and honed my skills from various perspectives. I am grateful for the invaluable support and camaraderie extended to me during my time at USC-ISI. Dr. Ekraam Sabir, Dr. Ayush Jaiswal, Dr. Emily Sheng, Jiaxin Cheng, and Soumyaroop Nandi, whom I had the privilege of meeting through the VISTA Lab, significantly contributed to my academic journey. Their enriching discussions fostered an intellectually stimulating environment that I fondly remember. Additionally, I extend my gratitude to Elizabeth (Liz) Boschee for her invaluable assistance with the BETTER project. Lastly, I express my profound appreciation to Karen Rawlins for her unwavering support in iv logistics and emotional encouragement throughout my tenure at USC. Karen’s assistance has been indispensable to my academic pursuits. Graduate school is undoubtedly challenging, and I am profoundly grateful to the individuals with whom I’ve had the privilege of sharing this journey with me day in and day out. To Crystal, Andy, Peter, Kevin, Wei-Hsiu, Eric, Armi, Katherine, David, Hank, Amy, Kuan-Ying, Chia-Jo, Shao-Hua, Andrew, Jason, Michael Chen, Ashley, Chien-Sheng, PeiLun, Rachel, Tim, Samuel, and many others, your companionship has made this experience akin to having another family here in Southern California. Additionally, I wish to express my sincere gratitude to friends from my life before pursuing my Ph.D.: Manman, Welly, Gary, Zhan-Lun, Maintain Lee, Pin-Jung, Da-Fang, and numerous others. It is your unwavering support that fuels my determination to pursue this Ph.D. Finally, I extend heartfelt thanks to my parents, M.Y. Hsu and S.J. Wu, as well as to my siblings, S.Y. Hsu and S.M. Hsu. You are the reason I am here and you have always been my greatest support. v Table of Contents Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Overview of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Other Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Chapter 2: Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1 Definitions of Event and Event Understanding Tasks . . . . . . . . . . . . . 6 2.1.1 Task 1 – Event Detection . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.2 Task 2 – Event Argument Extraction . . . . . . . . . . . . . . . . . . 7 2.1.3 Task 3 – Event Linking . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Generalization for Event Understanding Models . . . . . . . . . . . . . . . . 8 2.3 Pre-trained Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . 8 I Model Perspective Solutions: A new generative framework to address event extraction 10 Chapter 3: Instantiation of the New Generative Framework for Event Extraction . . 12 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2.1 Degree Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2.3 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 vi 3.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2.5 Degree in Pipeline Framework . . . . . . . . . . . . . . . . . . . . . 18 3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.3.3 Few-Shot and Zero-Shot Result . . . . . . . . . . . . . . . . . . . . . 22 3.3.4 High-Resource Event Extraction . . . . . . . . . . . . . . . . . . . . . 25 3.4 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4.1 Impacts of components in prompts. . . . . . . . . . . . . . . . . . . . 28 3.4.2 Effects of different template designs. . . . . . . . . . . . . . . . . . . 29 3.4.3 Sensitivity to template design. . . . . . . . . . . . . . . . . . . . . . . 30 3.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Chapter 4: Incorporating Cross-Domain Robust Knowledge to the Generative Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.2 Ampere Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.2.1 Generation-Based EAE Model . . . . . . . . . . . . . . . . . . . . . . 36 4.2.2 AMR Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.2.3 AMR-Aware Prefix Generation . . . . . . . . . . . . . . . . . . . . . 37 4.2.4 Adjusted Copy Mechanism . . . . . . . . . . . . . . . . . . . . . . . . 38 4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.4.1 Different Ways for AMR Incorporation . . . . . . . . . . . . . . . . . 44 4.4.2 Studies of Adjusted Copy Mechanism . . . . . . . . . . . . . . . . . . 45 4.4.3 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Chapter 5: Enhancing the Generative Framework for Zero-Shot Cross-Lingual Transfer 51 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.3 Task Definition: Zero-Shot Cross-Lingual Event Argument Extraction . . . . 53 5.4 Proposed Method: X-Gear . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.4.1 Language-Agnostic Template . . . . . . . . . . . . . . . . . . . . . . 55 5.4.2 Target Output String . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.4.3 Input Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.4.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 vii 5.5.2 Evaluation Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.5.3 Compared Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.6 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.6.1 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.6.2 Constrained Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 II Data Perspective Solutions: Methods to generate pseudo training data for event understanding 68 Chapter 6: Generate Pseudo Multilingual Event Understanding Training Data via Contextual Label Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 6.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.2.1 Structure Prediction Tasks . . . . . . . . . . . . . . . . . . . . . . . . 73 6.2.2 Zero-shot Cross-Lingual Transfer . . . . . . . . . . . . . . . . . . . . 74 6.2.3 Translate-Train . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.2.4 Label Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.3.1 Baseline Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.3.2 Proposed Method: CLaP . . . . . . . . . . . . . . . . . . . . . . . . . 77 6.3.3 Implementing CLaP . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 6.4 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 6.4.1 Task and Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.4.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.4.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.4.4 Intrinsic Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.4.5 Extrinsic Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.5.1 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.5.2 CLAP with Larger LLMs . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.5.3 Generalization to other translation models . . . . . . . . . . . . . . . 85 6.5.4 Ablation Study for CLaP . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.5.5 CLaP for Translate-Test . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.6 CLaP for Low-Resource Languages . . . . . . . . . . . . . . . . . . . . . . . 87 6.7 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Chapter 7: Synthesize Pseudo Out-of-KB Examples for Event Linking . . . . . . . . . 90 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 viii 7.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 7.3.1 Event Argument Tagging . . . . . . . . . . . . . . . . . . . . . . . . . 92 7.3.2 Negative Data Generation . . . . . . . . . . . . . . . . . . . . . . . . 93 7.3.3 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 7.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 7.4.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . 96 7.4.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . 97 7.4.3 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 7.4.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 III Conclusion 101 Chapter 8: Conclusion and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . 102 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Appendix A: Appendix for Degree . . . . . . . . . . . . . . . . . . . . . . . . . . 123 A.1 EAE Template Constructing for Degree . . . . . . . . . . . . . . . . . . . . 123 A.2 Training Details of Degree . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 A.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 A.4 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Appendix B: Appendix for Ampere . . . . . . . . . . . . . . . . . . . . . . . . . 131 B.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 B.2 Detailed Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 Appendix C: Appendix for X-Gear . . . . . . . . . . . . . . . . . . . . . . . . . 134 C.1 Dataset Statistics and Data Preprocessing . . . . . . . . . . . . . . . . . . . 134 C.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 C.3 Constrained Decoding Detailed Results . . . . . . . . . . . . . . . . . . . . . 136 Appendix D: Appendix for CLaP . . . . . . . . . . . . . . . . . . . . . . . . . . 137 D.1 Data Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 D.2 Complete Results for Intrinsic Evaluation . . . . . . . . . . . . . . . . . . . . 137 D.2.1 Accuracy Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 D.2.2 Faithfulness Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 139 D.3 Additional Implementation Details . . . . . . . . . . . . . . . . . . . . . . . 141 D.3.1 XLM-R for NER base model . . . . . . . . . . . . . . . . . . . . . . . 141 D.3.2 CLaP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 D.3.3 EasyProject . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 D.4 Large Language Model Direct Inference Analysis . . . . . . . . . . . . . . . . 143 Appendix E: Appendix for Event Linking . . . . . . . . . . . . . . . . . . . . . 147 ix E.1 Event Extraction Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 E.2 Data Generation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 E.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 E.4 Full Bi-encoder Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . 153 x List of Tables 3.1 Three examples of EAE templates for the ACE 2005 corpus. . . . . . . . . . . 16 3.2 Trigger classification F1-scores and argument classification F1-scores for lowresource event extraction. The highest scores are in bold, and the secondbest scores are underlined. “Cls” and “Gen” represent classification- and generation-based models, respectively. If the model is pipelined, then its argument predictions are based on its predicted triggers. Degree achieves a much better performance than other baselines. The performance gap becomes more significant for the extremely low-resource situation. . . . . . . . . . . . 20 3.3 Common event types in ACE05-E. . . . . . . . . . . . . . . . . . . . . . . . 23 3.4 Full results of zero/few-shot event extraction on ACE05-E. . . . . . . . . . . 25 3.5 Full results of zero/few-shot event argument extraction on ACE05-E. . . . . 25 3.6 Results for high-resource event extraction. Highest scores are in bold and the second best scores are underlined. *We report the numbers from the original paper. Degree has a competitive performance to the SOTA model (OneIE) and outperform other baselines. . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.7 Results for high-resource event argument extraction. Models predict arguments based on the given gold triggers. Best scores are in bold. *We report the numbers from the original paper. Degree(EAE) achieves a new stateof-the-art performance on event argument extraction. . . . . . . . . . . . . . 26 3.8 Ablation study for the components in the prompt on event detection using ACE05-E dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.9 Ablation study for the components in the prompt on event argument extraction with ACE05-E. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.10 Performances of Degree(EAE) on ACE05-E with different types of templates. 29 xi 3.11 Study on the effect of different template constructing rules. Experiments is conducted on ACE05-E. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.1 Argument classification F1-scores (%) under different data proportion settings for ACE05-E and ERE-EN datasets. The highest scores are in bold and the second-best scores are underlined. Generation-based models and Classification-based models are indicated by ”Gen” and ”Cls” respectively. The table with argument identification F1-scores is listed in Section B.2. . . 42 4.2 Ablation study of different ways for AMR incorporation. Report numbers in F1-scores (%). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.3 The study of using different generation mechanisms. Report numbers in F1- scores (%). The best performance among methods using the same model architecture is highlighted in bold. . . . . . . . . . . . . . . . . . . . . . . . . 46 5.1 Average results in argument classification F1(%) of ACE-2005 with three different seeds. The best is in bold and the second best is underlined. “en ⇒ zh” denotes models transferring from en to zh. Compared with models using similar numbers of parameters, X-Gear (mT5-base) outperforms baselines. To test the influence of using larger pre-trained generative models, we add X-Gear (mT5-large), which achieves even better results. . . . . . . . . . . . 60 5.2 Average results in argument classification F1(%) of ERE with three different seeds. The best is in bold and the second best is underlined. “en ⇒ es” denotes that models transfer from English to Spanish. . . . . . . . . . . . . . 61 5.3 Ablation study on copy mechanism for ACE-2005. “en ⇒ xx” indicates the average of “en ⇒ en”, “en ⇒ zh”, and “en ⇒ ar”. . . . . . . . . . . . . . . . 63 5.4 Ablation study on including event type information in prompts for ACE-2005. “en ⇒ xx” indicates the average of “en ⇒ en”, “en ⇒ zh”, and “en ⇒ ar”. . 64 5.5 Ablation study on different orders of roles in templates for ACE-2005. “en ⇒ xx” indicates the average of “en ⇒ en”, “en ⇒ zh”, and “en ⇒ ar”. . . . . . 65 5.6 Comparison of using English tokens and special tokens for roles in templates. “en ⇒ xx” indicates the average of “en ⇒ en”, “en ⇒ zh”, and “en ⇒ ar”. . 65 5.7 Results of applying constrained decoding. Breakdown numbers can be found in Appendix C.3. Based on whether the training languages are the same between training and testing, we classify the results into monolingual and cross-lingual, and we report the corresponding average for each category. . . 66 xii 6.1 High-level data statistics for ACE and WikiANN datasets for EAE and NER tasks respectively. # = ‘number of’ and Avg. = average. . . . . . . . . . . . 79 6.2 Extrinsic evaluation of the different label projection techniques regarding downstream model performance using translate-train and the LLM-Infer baseline for EAE. Avg = Average. ∗ indicates the reproduced results of X-Gear (Check Chapter 5 for X-Gear details). . . . . . . . . . . . . . . . . . . . . . . 82 6.3 Extrinsic evaluation of the different label projection techniques in terms of downstream model performance using translate-train and the LLM-Infer baseline for NER. Avg = Average. . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.4 We present qualitative examples illustrating the error cases of the baseline models, accompanied by explanations, using the examples in Hindi (hi) and Chinese (zh). We also show how CLaP performs better and fixes the errors. Blue text is English translation. . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.5 Extrinsic evaluation of CLaP using Llama-2-13B and GPT-3.5-Turbo for five languages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.6 Extrinsic evaluation of the different label projection techniques using translatetrain for EAE using the mBART-50 many-to-many translation model. . . . . 85 6.7 Ablation study comparing different contextual translation techniques for label projection. Performance is measured by downstream EAE performance. . . . 86 6.8 Extrinsic evaluation of the different label projection techniques using translatetest using GMT for EAE and NER. Avg = Average . . . . . . . . . . . . . . 87 6.9 Extrinsic evaluation of the different label projection techniques using translatetrain using GMT for NER for 10 low-resource languages. . . . . . . . . . . . 87 7.1 Statistics of the two datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . 96 7.2 Accuracy (%) on both Wikipedia (in-domain, in-KB) and NYT (out-of-domain, out-of-KB) test sets for event linking. The best performance is highlighted in bold. BLINK [163]; EveLink [180]. . . . . . . . . . . . . . . . . . . . . . . . 97 7.3 Bi-encoder recall (%) on the Wikipedia test set. “R@1” stands for recall at 1, and so on. See Section E.4 for more recall values. . . . . . . . . . . . . . . 98 7.4 Analysis of alternative negative data generation methods. The best and the second-best are bolded and underlined, respectively . . . . . . . . . . . . . . 99 xiii 7.5 Comparison with the LLM-reranked baseline. Due to budget constraints, the experiment is conducted on a subset of the whole dataset. . . . . . . . . . . 100 A.1 All EAE templates for ACE05-E and ACE05-E+ for Degree. . . . . . . . . . 126 A.2 All EAE templates for ACE05-E and ACE05-E+ for Degree (Cont.). . . . . 127 A.3 All EAE templates for ERE-EN for Degree. . . . . . . . . . . . . . . . . . . 128 A.4 All EAE templates for ERE-EN for Degree (cont.). . . . . . . . . . . . . . . 129 A.5 Dataset statistics. Our experiments are conducted in sentences, which were split from documents. In the table, “#Docs” means the number of documents; “#Sents” means the number of sentences, “#Events” means the number of events; “#Event Types” means the number of event types in total; “#Args” means the number of argument in total; “#Arg Types” means the number of argument role types in total. . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 B.1 Argument Identification and classification F1-scores (%) under different data proportion settings for ACE05-E and ERE-EN datasets. The highest scores are in bold and the second-best scores are underlined. The reported numbers are the average of the results from three random runs. The standard deviation (%) of three runs are also reported in the table. . . . . . . . . . . . . . . . . 133 C.1 Dataset statistics of multilingual ACE-2005 and ERE. . . . . . . . . . . . . . 134 C.2 The detailed breakdown results for applying constrained decoding on XGear. The avg(mono.) column represents the results that average over values in en ⇒ en, zh ⇒ zh, and ar ⇒ ar. The avg(cross.) column represents the results that average over values in en ⇒ zh, en ⇒ ar, zh ⇒ en, zh ⇒ ar, ar ⇒ en, and ar ⇒ zh. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 D.1 Data Statistics in terms of events and arguments of the ACE dataset for the downstream task of EAE. # indicates ‘number of’. . . . . . . . . . . . . . . 137 D.2 Data Statistics in terms of sentences and entities of the WikiANN dataset for the downstream task of NER. # indicates ‘number of.’ . . . . . . . . . . . . 138 xiv D.3 A/B comparison of the various label projection techniques for accuracy evaluation for the Google Translation model. Accuracy is measured as the label translation quality by native human speakers. Here, S1 = System 1 is better, S2 = System 2 is better, and Tie = similar quality. The better systems are highlighted in bold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 D.4 Faithfulness evaluation of the various label projection techniques for EAE as a percentage of the times the translated labels were present in the translated input sentence. Numbers are in percentage (%). Higher faithfulness is better and the best techniques are highlighted in bold. . . . . . . . . . . . . . . . . 139 D.5 Faithfulness evaluation of the various label projection techniques for NER as a percentage of the times the translated labels were present in the translated input sentence. Numbers are in percentage (%). Higher faithfulness is better and the best techniques are highlighted in bold. . . . . . . . . . . . . . . . . 140 D.6 Hyperparameter details for the NER downstream XLM-R model. . . . . . . 141 D.7 Evaluation of LLM-based inference and their comparison with our label projected translate-train model CLaP. This study is done on Event Argument Extraction (EAE) for two languages - Arabic (ar) and Chinese (zh) - and on Named Entity Recognition (NER) for three languages: Hindi (hi), Malay (ms), and Yoruba (yo). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 E.1 Different recall values on the in-domain, in-KB evaluation for the bi-encoder on the Wikipedia test set. The best performance is highlighted in bold. “R@1” stands for recall at 1, and so on. . . . . . . . . . . . . . . . . . . . . . . . . 154 xv List of Figures 1.1 Example illustrating how human languages convey information, including descriptions of events, involved parties, causality, and event forecasts. . . . . . 2 2.1 Two examples of events (Justice:Execute and Justice:Appeal) described in the given passage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.1 An illustration of the Degree for predicting a Contact:Attack event involves several components. The input to the Degree includes the provided passage and a design prompt containing information such as event type description, event keywords, and a E2E template. The Degree is specifically trained to produce an output that populates the placeholders (underlined words) in the E2E template with appropriate triggers and arguments. Subsequently, the final event prediction is decoded from the generated output. . . . . . . . . . . . . 13 3.2 Trigger classification F1-scores and argument classification F1-scores for lowresource event extraction. Degree achieves a much better performance than other baselines. The performance gap becomes more significant for the extremely low-resource situation. . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3 The zero/few-shot experimental results of Degree. Left: The result for the models on the event detection task with the scores reported in trigger classification F1. Middle: The models are tested under the scenario of a given gold trigger and evaluated with an argument classification criterion. Right: The results for the models to perform event extraction task, which aims to predict triggers and their corresponding arguments (we report the argument classification F1). . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 xvi 4.1 An overview of Ampere using an example from the ACE 2005 dataset. Given a passage and an event trigger, we first use an AMR parser to obtain the AMR graph of the input passage. The linearized AMR graph sequence will be encoded into a l-length prefix by an AMR encoder and an attention layer. Our generation-based EAE model equipped with the AMR-aware prefix then summarizes the event mentioned in the passage into a natural sentence that follows a pre-defined template in the prompt. The final arguments and the corresponding roles can be extracted from the generated sentence. . . . . . . 34 4.2 Two examples of how AMR information helps the generation of event argument predictions. Note that due to space constraints, the shown passage, output text, and AMR graph omit some irrelevant information. . . . . . . . 48 5.1 The overview of X-Gear. Given an input passage and a carefully designed prompt containing an event trigger and a language-agnostic template, XGear fills in the language-agnostic template with event arguments. . . . . . 55 6.1 Illustration of the task of label projection from English to Chinese. Label projection converts sentences from a source to a target language while translating the associated labels jointly. Failures in this process occur when labels are either inaccurately translated or missing in the translated sentence in the target language. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.2 Illustration of the various techniques to conduct label projection: (a) Markerbased methods use markers to transform the sentence and translate the transformed sentence with label markers jointly, (b) Word Alignment methods use external word alignment tools to locate the translated labels in the translated sentence, and (c) CLaP (ours) performs contextual translation on labels using M (Here, we demonstrate the use of an instruction-tuned language model as M to identify translated labels within a translated sentence.). . . . 73 6.3 Reporting the faithfulness and accuracy (in %) of the various label projection models on EAE and NER, with the top-right indicating superior performance. 81 7.1 An example of two distinct events with the same event mention but different event arguments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 xvii 7.2 Given a text with an event to ground, our method extracts the event’s attributes through event detection and argument extraction modules. The text, enriched with event argument tags, is then input into a Bi-Encoder to identify the top-k potential nodes. These candidates are further ranked by a CrossEncoder, which also considers an additional “NIL” candidate in case of outof-KB instances. To equip the Cross-Encoder to robustly predict “NIL” and real KB entries, we train it with additional synthetic data generated through our negative data creation process. . . . . . . . . . . . . . . . . . . . . . . . 93 7.3 Illustration for our negative data generation processing for event linking using LLM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 D.1 Illustration of the text-completion prompt used for contextual machine translation for our CLaP model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 D.2 Illustration of the chat version prompt used for contextual machine translation for our CLaP model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 D.3 Illustration of the prompt used for the LLM-infer baseline to directly utilize LLMs for downstream structured prediction tasks. . . . . . . . . . . . . . . . 143 D.4 Illustration of the in-context examples used for the three different prompting strategies for LLM-Infer baseline. . . . . . . . . . . . . . . . . . . . . . . . . 144 E.1 Prompt for our argument-aware data generation. . . . . . . . . . . . . . . . . 149 E.2 Prompt for non-argument-aware data generation baseline. . . . . . . . . . . . 150 E.3 Prompt for the LLM baseline on the Wikipedia dataset (in-KB). . . . . . . . 151 E.4 Prompt for the LLM baseline on the NYT dataset (out-of-KB). . . . . . . . 152 xviii Abstract Human languages in the world, such as news or narratives, are structured around events. Focusing on these events allows Natural Language Processing (NLP) systems to better understand plots, infer motivations, consequences, and the dynamics of situations. Despite the rapidly evolving landscape of NLP technology, comprehending complex events, particularly those rarely encountered in training such as in niche domains or low-resource languages, remains a formidable challenge. This thesis explores methods to enhance NLP model generalizability for better adaptability to unfamiliar events and languages unseen during training. My approach includes two main strategies: (1) Model Perspective: I propose a novel generation-based event extraction framework, largely different from typical solutions that make predictions by learning to classify input tokens. This new framework utilizes indirect supervision from natural language generation, leveraging large-scale unsupervised data without requiring additional training modules dependent on limited event-specific data. Hence, it facilitates the models’ ability on understanding general event concepts. I further explore advanced methods to extend this framework for cross-lingual adaptation and to utilize cross-domain robust resources effectively. (2) Data Perspective: I develop techniques to generate pseudo-training data broaden the training scope for event understanding models. This includes translating structured event labels into other languages with higher accuracy and fidelity, and synthesizing novel events for the existing knowledge base. Overall, my work introduces a novel learning platform to the NLP community, emphasizing an innovative modeling paradigm and comprehensive data preparation to foster more generalized event understanding models. xix Chapter 1 Introduction 1.1 Motivation In an era where data is ubiquitous, the effective acquisition and management of comprehensive information has become critical in various sectors, including business [107], healthcare [52], and scientific discovery [37]. This necessity drives the development of sophisticated technologies capable of sifting through vast and diverse unstructured information to identify and gather relevant and structured information [40]. For decades, developing automated natural language processing (NLP) methods to extract and organize structured information from unstructured data has been a focal point of research. While substantial progress has been made, most prior methods have concentrated primarily on identifying and categorizing entities [26], [133], [139], [171]. This focus, although helpful, often overlooks the broader context of the data — the events in which these entities are involved. Events drive the narrative of the data and are pivotal in forming a holistic understanding of the information presented. A core aspect of human communication and language is the description of events, which inherently involves narrating interactions, occurrences, and activities. For example, a piece of news illustrated in Figure 1.1 describes the withdrawn event that happened to Novak Djokovic due to the declaration event from the U.S. government. Understanding these events 1 Top-ranked Novak Djokovic has withdrawn from the upcoming BNP Paribas Open, having lost his bid to enter the United States unvaccinated to play in the Southern California event. The tournament announced his withdrawal on Sunday night. Play in the combined ATP-WTA event begins Wednesday at Indian Wells Tennis Garden and runs through March 19. The U.S. is ending its COVID-19 emergency declaration on May 11, which will allow foreign air travelers to enter the county without being vaccinated. What happened? Reason about who is involved in what role Understand the event causality. What would have happened? The forecast and the plan for future. Figure 1.1: Example illustrating how human languages convey information, including descriptions of events, involved parties, causality, and event forecasts. is crucial as they offer a more comprehensive depiction of scenarios and their impacts. However, events are inherently complex and ambiguous, characterized by high variability due to their multiple components, such as actions, participants, and various circumstances, including time, location, and manner. Each variation subtly alters the event’s meaning, presenting considerable challenges in formulating simple algorithms or collecting comprehensive training data for model development. These challenges necessitate research into building a more robust system that can generalize effectively to handle new and previously unseen event types that were not encountered during training [70]. Moreover, the linguistic diversity in global data — where a significant portion of content exists in languages other than English — introduces additional demands on the system and adds another layer of complexity. Most existing NLP systems are predominantly trained on English data, which restricts their effectiveness and applicability in multilingual contexts. This limitation underscores the need for systems that operate across linguistic boundaries [60], [132]. This level of generalizability is vital for the NLP system to effectively apply event understanding globally. Motivated by these practical challenges, my thesis focuses on enhancing the generalizabil2 ity of NLP systems in understanding events. My first objective is to develop methodologies that can perform data-efficient learning about understanding events and enable fast adaptation to new scenarios. To push the boundaries of what automated systems can support information processes in a globally connected world, I study cross-lingual transfer learning, where the model trained in source languages can also support testing in many other languages. This pursuit addresses a significant gap in the current capabilities of NLP systems and aligns with the broader objective of making NLP more inclusive and universally applicable. 1.2 Overview of the Thesis This thesis explores using generative models to enhance the generalizability of NLP models for improved event understanding. To address this challenge, I propose two perspectives of harnessing the potential of generative models to enable more robust event analysis in textual data. In the initial part of the thesis (Part I), I approach the challenge by proposing a new generative formulation to address event understanding in text. Traditional approaches predominantly utilize discriminative methods, which classify each token in the text based on its participation in an event and identify the roles these tokens play within that event [50], [109], [110]. This method primarily focuses on categorizing inputs into predefined classes, thus being insufficient when confronted with new, previously unseen event types. Motivated by this limitation, I advocate for transitioning from a discriminative to a generative approach in model design. This revised approach predicts events by generating summarized sentences about the event, thereby improving the model’s expressiveness regarding event labels by jointly modeling both the input and event annotations as text rather than heterogeneous elements. In Chapter 3, I explain implementing this new generative framework. This is based on the publication of [54]. Subsequently, in Chapter 4, I propose techniques that enable this 3 generation-based model to utilize robust resources across domains. This is based on my work from [58]. Chapter 5 explores the extension of this model design to support cross-lingual adaptation, aligning with one of the generalization objectives of this thesis. This chapter is based on my work from [63]. In the latter part of the thesis (Part II), I delve into alternative approaches aimed at enhancing NLP models’ proficiency in event understanding by leveraging pre-trained generative models to produce pseudo-training data. Given the prohibitive expense associated with annotating events within text, it’s more commonplace to annotate only a significantly smaller subset of data for training, a scenario that diverges from real-world cases encountered during testing, thereby constraining the robustness and generalization capabilities of the trained model [184]. Consequently, I propose augmenting the sufficiency and diversity of training examples by sampling new instances from pre-trained generative models. However, a central challenge in this augmentation process lies in steering the generation of data samples while maintaining control over their corresponding event annotations [27], [41]. In Chapter 6, I introduce methods for generating pseudo data across a diverse array of languages through the translation of English source data. To ensure the accurate and faithful preservation of event information in translated data, I propose a contextualized translation approach to guide the label translation step. This methodology builds upon my previous work [120]. Subsequently, in Chapter 7, I extend the concept of generating pseudo-training data to the event-linking task. Here, I develop a sophisticated control paradigm to synthesize novel events for existing knowledge bases, enabling model training to address unforeseen event cases encountered during testing. This endeavor builds upon my previous work [59]. 1.3 Other Publications Beyond the main body of work of extending the generalizability of NLP models on event understanding through generative models, I have contributed to various other aspects aimed 4 at enhancing AI models’ comprehension of events. Three significant contributions in terms of resources are outlined below: (1) In [64], we establish an open-source and comprehensive benchmarking platform for event extraction. This paper also addresses evaluation issues and conducts thorough re-evaluation of existing popular models in the field, thus facilitating future event extraction research. (2) In [121], we create the first large-scale event argument extraction dataset supporting over 100+ event types and over 200+ different event role types. (3) In [48], we introduce ESTER, a comprehensive machine reading comprehension dataset for event semantic relation reasoning. We explore five types of event semantic relations: causal, sub-event, co-reference, conditional, and counterfactual, proposing the use of natural language questions for reasoning about event semantic relations. In addition to the resource allocation efforts, I have also introduced two additional modeling techniques aimed at enhancing the event understanding capabilities of models. In [55], we demonstrate the efficacy of priming techniques in enhancing classification-based models for relational structure extraction tasks, including event extraction. Furthermore, in [49], we propose a neural symbolic approach to incorporate global consistency into neural networks for event temporal relation extraction. During my Ph.D. studies, I have also contributed to various research endeavors in the realms of relation extraction, text synthesis, and speech recognition. In the field of relation extraction, we devised evidence grounding techniques to tackle cross-document relation extraction challenges [96]. Additionally, we explored the use of summarization models as supervision signals for relation extraction [97] and developed a graph pooling framework to enhance long-context relation extraction [53]. In text synthesis, our work pioneered codeswitching text synthesis for multiple language combinations using limited resources [57]. Moreover, in [65], we curated a new paraphrase dataset. Finally, in the domain of speech recognition, I investigated methods for building a more robust system against noise [56]. 5 Chapter 2 Background 2.1 Definitions of Event and Event Understanding Tasks In this section, we provide a formal definition of events, adhering to the general guideline set forth by the ACE dataset [30]. Specifically, an event comprises a trigger word and its corresponding arguments. The event trigger typically represents the clearest descriptor of the event’s occurrence, often manifesting as a verb or noun that conveys the action or state of the target event [115], [127]. Event arguments encompass various facets such as the event’s participants (subject or object), time of occurrence, location, and other pertinent details regarding the event [121]. For example, in Figure 2.1, a Justice:Execute event is triggered by the word “execution” and this event contains three argument roles, including an Agent (Indonesia) who carries out the execution, a Person who is executed (convicts), and a Place where the event occurs (not mentioned in the passage). To evaluate the NLP model’s proficiency in event understanding, we delineate three tasks in this thesis, formally articulating them in the subsequent section. 2.1.1 Task 1 – Event Detection Event detection takes raw text x as inputs and detects whether specific types of events have occurred in the input text. These event types of interest are usually specified by an “event 6 Passage: Indonesiawill delay the execution of six convicts including an Indian on death row afterfive of them appealed to the Supreme Court for a second review. Justice:Execute Justice:Appeal Person Plaintiff Agent Adjudicator Place Justice:Execute Agent : Indonesia Person : convicts Place : None Justice:Appeal Plaintiff : five Prosecutor : None Adjudicator : Supreme Court Place : Indonesia Figure 2.1: Two examples of events (Justice:Execute and Justice:Appeal) described in the given passage. ontology,” denoted as O = {o1, o2, ..., om}, which describes the definitions of each category of events. For instance, in Figure 2.1, “Justice:Execute” and “Justice:Appeal” represent two categories within the event ontology O. The objective of event detection is to identify event triggers and determine their corresponding types oi for an event e in the input. 2.1.2 Task 2 – Event Argument Extraction Event argument extraction identifies arguments and their corresponding roles for an given event in the text. More specifically, as illustrated by the example in Figure 2.1, given an input passage x and an event trigger t (execution) belonging to an event type oi (Justice:Execute) in the ontology O, event argument extraction model is required to predict a list of argument a = [a1, a2, ..., al ] (Indonesia, convicts) and their corresponding roles r = [r1, r2, .., rl ] (Agent, Person). Some prior works consider merging the task of event detection and event argument extraction together as a pipeline, forming an end-to-end task — event extraction. 2.1.3 Task 3 – Event Linking Event linking aims to establish a connection between identified events in text x and a structured knowledge base KB. At its core, event linking involves associating an event e, often represented by a trigger word t, with relevant entries in a knowledge base. This knowledge 7 base, denoted as KB = {c1, c2, ...cm}, provides comprehensive information about various events, with each ci capturing specific details about a distinct event, such as World War I, Republic of the Congo Civil War, 2020 Summer Olympics, etc. The primary objective of event linking models is to accurately determine which entry in the knowledge base corresponds to the given event trigger. This process is essential for disambiguating events with similar mentions, where the same trigger word may refer to multiple distinct events. For instance, the event trigger word “war” could denote various conflicts or historical events, requiring the model to identify the specific instance being referenced. 2.2 Generalization for Event Understanding Models Generalization, a cornerstone of machine learning, pertains to a model’s capacity to effectively perform on data it hasn’t encountered before. Essentially, it involves finding the optimal equilibrium between grasping the inherent patterns within the data and steering clear of overfitting. Within the realm of event understanding, we specifically delve into two facets of generalization. Firstly, we explore scenarios where the events, denoted as e, significantly diverge from what is covered in the event ontology O or the event knowledge base KB. Secondly, we address instances where the description of the event e is articulated in a language that was not included during the model’s training phase. 2.3 Pre-trained Language Models Recent advancements of generalization in NLP have been largely attributed to the widespread adoption of pre-training techniques. Pre-training involves training a model, denoted as Pϕ, on a vast corpus of text data without task-specific annotations. Unlike task-specific training, which focuses on optimizing for a particular objective, pre-training Pϕ typically involves designing objectives based on the input text itself. This approach, known as self-supervised 8 learning in current research, has paved the way for significant breakthroughs in NLP [44], [124]. Because of this flexibility, the trained model Pϕ can consume textual knowledge in a broader range, facilitating models’ generalizability if properly reusing Pϕ. These methods can also be extended to handle text written in multiple different languages. One prevalent pre-training objective is under an auto-encoding category, exemplified by models like BERT [29], RoBERTa [94], etc. These models use masked token strategies, wherein random tokens within the input sequence are masked, and the model is trained to predict the original tokens. By learning to recover masked tokens, auto-encoding models effectively capture contextual information and generate more transferable representations of inputs. Another category of pre-training is the auto-regressive approach, exemplified by models such as BART [79], and PALM [23], as well as more recent large language models like LLAMA [150] and GPT-series [17]. In this paradigm, models are trained to predict the next token in a sequence given the preceding tokens. This methodology equips the pre-trained models with the ability to generate coherent and contextually relevant text similar to the distribution of text in the large-scale pre-training corpus. In this thesis, our methods are built upon exploring better utilization of a pre-trained auto-regressive model that not only absorbs knowledge about the world but also possesses the ability to generate data similar to real-world testing cases. We provide both mathematical explanations and empirical evidence to show why our usage enables better generalizability to event understanding. 9 Part I Model Perspective Solutions: A new generative framework to address event extraction 10 In this part of my thesis, I focus on event extraction tasks (more details about the task definition can be found in Chapter 2.1). Previous studies on event extraction have primarily centered on classification-based methods. These approaches entail categorizing each token within the text and assigning it a label indicating whether it signifies an event trigger or an event argument, along with its corresponding role [33], [50], [68], [109], [110]. Such methodologies can be conceptualized as training a task-specific model, represented as Pθ(y|x), where y denotes the type or role of each given text, and x represents the input text. For instance, in event detection, y encompasses all event types (o1, o2, ..., on) in the ontology O, along with a distinct type representing ”not an event trigger.” The model’s parameters θ are required to learn the mapping between tokens and types using task-specific training data. However, during testing, if a new event type on+1 emerges, which is of current interest, the trained θ lacks the capability to make predictions concerning it, as the fixed mapping from input to all categories has already been established. Rather than solely concentrating on modeling Pθ(y|x), my approach involves modeling the joint probability of input text x and a specific event assignment y, denoted as Pθ(x, y). By treating the event assignment y as text, thus amalgamating (x, y) into a unified resource, we can represent Pθ(x, y) as Pθ(z), where z is also textual. Subsequently, through the application of the auto-regressive pre-trained language model (as elaborated in Chapter 2.3), we can directly transfer Pϕ and apply it as Pθ, enhancing the model’s capacity for generalized understanding. Thus, for any new event type on+1, provided on+1 can be expressed in pretrained text, Pθ can be utilized to model the new event seamlessly. In Chapter 3, I explain implementing this new generative framework. Subsequently, in Chapter 4, I propose techniques that enable this generation-based model to utilize robust resources across domains. Chapter 5 explores the extension of this model design to support cross-lingual adaptation. 11 Chapter 3 Instantiation of the New Generative Framework for Event Extraction 3.1 Introduction Most prior works on event extraction (EE) rely on a large amount of annotated data for training [33], [50], [66], [68], [109], [110], [119]. However, high-quality event annotations are expensive to obtain. For example, the ACE 2005 corpus [30], one of the most widely used EE datasets, requires two rounds of annotations by linguistics experts. The high annotation costs make these models hard to be extended to new domains and new event types. Additionally, as we mentioned in Chapter 2, new event type of interest challenges these existing methods. In this section, we apply the new framework we specified in Part I, and design a generative model that can support both generalized event extraction and low-resource event extraction, where only a small amount of training examples are available for training. We propose Degree (Data-Efficient GeneRation-Based Event Extraction), a generation-based model that takes a passage and a manually designed prompt as the input, and learns to summarize the passage into a natural sentence following a predefined template, as illustrated in Figure 3.1. The event triggers and arguments can then be extracted from the generated sentence by using a deterministic algorithm. 12 Prompt Event Type Description The event is related to conflict and some violent physical act. Event Keywords Similar triggers such as war, attack, terrorism. E2E Template Event trigger is <Trigger>. \n some attacker attacked some facility, someone, or some organization by some way in somewhere. Output Text Event trigger is detonated. \n Palestinian attacked jeep and soldiers by bomb in Gaza Strip. Encoder Decoder Passage [SEP] Prompt Output Text Event Trigger detonated Attacker Palestinian Target jeep, soldiers Instrument bomb Place Gaza Strip Passage: Earlier Monday , a 19-year-old Palestinian riding a bicycle detonated a 30-kilo ( 66-pound ) bomb near a military jeep in the Gaza Strip , injuring three soldiers. Query Type: Conflict:Attack Figure 3.1: An illustration of the Degree for predicting a Contact:Attack event involves several components. The input to the Degree includes the provided passage and a design prompt containing information such as event type description, event keywords, and a E2E template. The Degree is specifically trained to produce an output that populates the placeholders (underlined words) in the E2E template with appropriate triggers and arguments. Subsequently, the final event prediction is decoded from the generated output. Degree benefits from several advantages that enable effective learning with less training data. Firstly, the framework incorporates label semantics through a designed template within the prompts. Illustrated in Figure 3.1, the prompt’s use of words like “somewhere” guides the model to predict words akin to location for the role Place. Additionally, the sentence structure of the template, coupled with the term “attacked,” elucidates the semantic relationship between the roles Attacker and Target. Such guidance enhances Degree’s predictive accuracy even with limited training examples. Secondly, the prompts can integrate supplementary weak-supervision signals relevant to the task, such as event descriptions and related keywords, often readily available. For instance, in our experiments, we utilize information from annotation guidelines accompanying the dataset. This aids Degree in learning effectively under low-resource conditions. Lastly, Degree is tailored for end-to-end event extraction, concurrently addressing event detection and argument extraction. Leveraging shared knowledge and dependencies between these tasks enhances our model’s data 13 efficiency. Compared to existing works on EE, our approach encompasses all these advantages, which are often lacking individually in prior methods. For instance, conventional classificationbased models [86], [109], [155], [157], [174] struggle to encode label semantics and other weak supervision signals effectively. Recently proposed generation-based models for event extraction, albeit innovative, address the problem in a sequential pipeline manner, thus unable to exploit shared knowledge between subtasks [84], [119]. Moreover, their generated outputs often lack natural sentence structure, limiting the utilization of label semantics [98], [119]. Consequently, our model Degree demonstrates superior performance over prior approaches in low-resource event extraction, as elaborated in Section 3.3. Our contributions can be summarized as follows: • We propose Degree, a generation-based event extraction model that learns well with less data by better incorporating label semantics and shared knowledge between event detection and event argument tasks (Section 3.2). • Experiments on ACE 2005 [30] and ERE-EN [143] demonstrate the strong performance of Degree in the low-resource setting (Section 3.3). • We present comprehensive ablation studies in both the low-resource and the high-resource setting to better understand the strengths and weaknesses of our model (Section 3.4). Our code and models can be found at https://github.com/PlusLabNLP/DEGREE. 3.2 Method We introduce Degree, a generation-based model for low-resource event extraction. Unlike previous works [84], [175], which separate event extraction into two pipelined tasks (event detection and event argument extraction), Degree is designed for the end-to-end event extraction and predict event triggers and arguments at the same time. 14 3.2.1 Degree Details We formulate event extraction as a conditional generation problem. As illustrated in Figure 3.1, given a passage and our designed prompt, Degree generates an output following a particular format. The final predictions of event triggers and argument roles can then be parsed from the generated output with a deterministic algorithm. Compared to previous classification-based models [86], [155], [157], [174], the generation framework provides a flexible way to include additional information and guidance. By designing appropriate prompts, we encourage Degree to capture the dependencies between entities better and, therefore, to reduce the number of training examples needed. The desired prompt not only provides information but also defines the output format. As shown in Figure 3.1, it contains the following components: • Event type definition describes the definition for the given event type.1 For example, “The event is related to conflict and some violent physical act.” describes a Conflict:Attack event. • Event keywords presents some words semantically related to the given event type. For example, war, attack, and terrorism are three event keywords for the Conflict:Attack event. In practice, we collect three words that appear as the triggers in the example sentences from the annotation guidelines. • E2E template defines the expected output format and can be separated into two parts. The first part is called ED template, which is designed as “Event trigger is <Trigger>”, where “<Trigger>” is a special token serving as a placeholder. The second part is the EAE template, which differs based on the given event type. For example, in Figure 3.1, the EAE template for a Conflict:Attack event is “some attacker attacked some facility, someone, or some organization by some way in somewhere”. Each underlined string starting with “some-” serves as a placeholder corresponding to an argument role for a 1The definition can be derived from the annotation guidelines, which are provided along with the datasets. 15 Event Type EAE Template Life:Divorce somebody divorced in somewhere. Transaction:Transfer-Ownership someone got something from some seller in somewhere. Justice:Sue somebody was sued by some other in somewhere. The adjudication was judged by some adjudicator. Table 3.1: Three examples of EAE templates for the ACE 2005 corpus. Conflict:Attack event. For instance, “some way” corresponds to the role Instrument and “somewhere” corresponds to the role Place. Notice that every event type has its own EAE template. We list three EAE templates in Table 3.1. The full list of EAE templates and the construction details can be found in Section A.1. 3.2.2 Training The training objective of Degree is to generate an output that replaces the placeholders in E2E template with the gold labels. Take Figure 3.1 as an example, Degree is expected to replace “<Trigger>” with the gold trigger (detonated), replace “some attacker” with the gold argument for role Attacker (Palestinian), and replace “some way” with the gold argument for role Instrument (bomb). If there are multiple arguments for the same role, they are concatenated with “and”; if there is no predicted argument for one role, the model should keep the corresponding placeholder (i.e, “some-” in the E2E template). For the case that there are multiple triggers for the given event type in the input passage, Degree is trained to generate the output text that contains multiple E2E template such that each E2E template corresponds to one trigger and its argument roles. The hyperparameter settings are detailed in Section A.2. 3.2.3 Inference We enumerate all event types and generate an output for each event type. After we obtain the generated sentences, we compare the outputs with E2E template to determine the predicted triggers and arguments in string format. Finally, we apply string matching to convert the 16 predicted string to span offsets in the passage. If the predicted string appears in the passage multiple times, we choose all span offsets that match for trigger predictions and choose the one closest to the given trigger span for argument predictions. 3.2.4 Discussion Notice that the E2E template plays an important role for Degree. First, it is the control signal and defines the expected output format. Second, it provides label semantics to help Degree make accurate predictions. Those placeholders (words starting with “some-”) in the E2E template give Degree some hints about the entity types of arguments. For instance, when seeing “somewhere”, Degree tends to generate a location rather than a person. In addition, the words other than “some-” describe the relationships between roles. For example, Degree knows the relationship between the role Attacker and the role Target (who is attacking and who is attacked) due to E2E template. This guidance helps Degree learn the dependencies between entities. In contrast to previous generation-based methods [67], [82], [119], we purposefully craft E2E templates using natural language sentences. This approach effectively harnesses label semantics and facilitates the model’s utilization of knowledge from the pre-trained decoder. In Section 3.4, we will present experimental results to illustrate the benefits of employing natural sentences. 3.2.4.0.1 Cost of template constructing. Degree does require human effort to design the templates; however, writing those templates is much easier and more effortless than collecting complicated event annotations. As shown in Table 3.1, we keep the EAE templates as simple and short as possible. Therefore, it takes only about one minute for people who are not linguistic experts to compose a template. Several prior works [33], [82], [88] also use constructed templates as weakly-supervised signals to improve models. In Section 3.4, we will study how different templates affect the performance. 17 3.2.4.0.2 Efficiency Considerations. Degree requires enumeration of all event types during inference, which could cause efficiency considerations when extending to applications that contain many event types. This issue is minor for our experiments on the two datasets (ACE 2005 and ERE-EN), which are relatively small scales in terms of the number of event types. Due to the high cost of annotations, there are hardly any public datasets for endto-end event extraction on a large scale,2 we cannot provide a more thorough study when the experiments scale up. We leave the work on benchmarking and improving the efficiency of Degree in the scenario, considering more diverse and comprehensive types of events as future work. 3.2.5 Degree in Pipeline Framework Degree is flexible and can be easily modified to Degree(Pipe), which first focuses event detection (ED) and then solves event argument extraction (EAE). Degree(Pipe) consists of two models: (1) Degree(ED), which aims to exact event triggers for the given event type, and (2) Degree(EAE), which identifies argument roles for the given event type and the corresponding trigger. Degree(ED) and Degree(EAE) are similar to Degree but with different prompts and output formats. We describe the difference as follows. 3.2.5.0.1 Degree(ED). The prompt of Degree(ED) contains the following components: • Event type definition is the same as the ones for Degree. • Event keywords is the same as the one for Degree. • ED template is designed as “Event trigger is <Trigger>”, which is actually the first part of the E2E template. 2To the best of our knowledge, MAVEN [158] is the only publicly available large-scale event dataset. However, the dataset only focuses on event detection without considering event arguments. 18 Similar to Degree, the objective of Degree(ED) is to generate an output that replaces “<Trigger>” in the ED template with event triggers. 3.2.5.0.2 Degree(EAE). The prompt of Degree(EAE) contains the following components: • Event type definition is the same as the one for Degree. • Query trigger is a string that indicates the trigger word for the given event type. For example, “The event trigger word is detonated” points out that “detonated” is the given trigger. • EAE template is an event-type-specific template mentioned previously. It is actually the second part of E2E template. Similar to Degree, the goal for Degree(EAE) is to generate an outputs that replace the placeholders in EAE template with event arguments. In Section 3.3, we will compare Degree with Degree(Pipe) to study the benefit of dealing with event extraction in an end-to-end manner under the low-resource setting. 3.3 Experiments We conduct experiments for low-resource event extraction to study how Degree performs. 3.3.1 Experimental Settings 3.3.1.0.1 Datasets. We consider ACE 2005 [30] and follow the pre-processing in [155] and [86], resulting in two variants: ACE05-E and ACE05-E+. Both contain 33 event types and 22 argument roles. In addition, we consider ERE-EN [143] and adopt the pre-processing in [86], which keeps 38 event types and 21 argument roles. 19 Trigger Classification F1-Score (%) Model Type ACE05-E ACE05-E+ ERE-EN 1% 3% 5% 10% 20% 30% 1% 3% 5% 10% 20% 30% 1% 3% 5% 10% 20% 30% BERT QA Cls 20.5 40.2 42.5 50.1 61.5 61.3 - - - - - - - - - - - - OneIE Cls 38.5 52.4 59.3 61.5 67.6 67.4 39.0 52.5 60.6 58.1 66.5 66.4 11.0 36.9 46.7 48.8 51.8 53.5 Text2Event Gen 14.2 35.2 46.4 47.0 55.6 60.7 15.7 38.4 43.9 46.3 56.5 62.0 6.3 25.6 33.5 42.4 46.7 50.1 TANL Gen 34.1 48.1 53.4 54.8 61.8 61.6 30.3 50.9 53.1 55.7 60.8 61.7 5.7 30.8 43.4 45.9 49.0 49.3 Degree(Pipe) Gen 55.1 62.8 63.8 66.1 64.4 64.4 56.4 62.5 61.1 62.3 62.5 67.1 32.7 44.5 41.6 50.6 51.1 53.5 Degree Gen 55.4 62.1 65.8 65.8 68.3 68.2 49.5 63.5 62.3 68.5 67.6 66.9 27.9 45.5 47.0 53.0 51.7 53.5 Argument Classification F1-Score (%) Model Type ACE05-E ACE05-E+ ERE-EN 1% 3% 5% 10% 20% 30% 1% 3% 5% 10% 20% 30% 1% 3% 5% 10% 20% 30% BERT QA Cls 4.7 14.5 26.9 27.6 36.7 38.8 - - - - - - - - - - - - OneIE Cls 9.4 22.0 26.8 26.8 42.7 47.8 10.4 20.6 29.7 35.5 46.7 48.0 2.6 20.3 29.7 35.1 40.7 43.0 Text2Event Gen 3.9 12.2 19.1 24.9 32.3 39.2 5.7 16.5 21.3 26.4 35.2 42.1 2.3 15.2 23.6 28.7 35.7 38.7 TANL Gen 8.5 17.2 24.7 29.0 34.0 39.2 8.6 22.3 30.4 29.2 34.6 39.0 1.4 20.2 29.5 30.1 35.6 36.9 Degree(Pipe) Gen 13.1 26.1 27.6 42.1 40.7 44.0 16.0 26.4 29.9 39.5 41.3 48.5 12.2 29.7 31.4 39.4 41.9 42.2 Degree Gen 21.7 30.1 35.5 41.6 46.2 48.7 18.7 34.0 35.7 43.6 48.9 51.2 14.5 28.9 33.4 41.7 42.9 45.5 Table 3.2: Trigger classification F1-scores and argument classification F1-scores for lowresource event extraction. The highest scores are in bold, and the second-best scores are underlined. “Cls” and “Gen” represent classification- and generation-based models, respectively. If the model is pipelined, then its argument predictions are based on its predicted triggers. Degree achieves a much better performance than other baselines. The performance gap becomes more significant for the extremely low-resource situation. 0 20 40 60 Trigger-C F1 (%) ACE05-E ACE05-E+ ERE-EN 1 2 3 5 10 20 30 50 Percentage of Data (%) 0 20 40 60 Argument-C F1 (%) 1 2 3 5 10 20 30 50 Percentage of Data (%) 1 2 3 5 10 20 30 50 Percentage of Data (%) OneIE(100% of Data) DEGREE(100% of Data) BERT_QA Text2Event TANL OneIE DEGREE(PIPE) DEGREE Figure 3.2: Trigger classification F1-scores and argument classification F1-scores for lowresource event extraction. Degree achieves a much better performance than other baselines. The performance gap becomes more significant for the extremely low-resource situation. 20 3.3.1.0.2 Data split for low-resource setting. We generate different proportions (1%, 2%, 3%, 5%, 10%, 20%, 30%, and 50%) of training data to study the influence of the size of the training set and use the original development set and test set for evaluation. Section A.3 lists more details about the split generation process and the data statistics. 3.3.1.0.3 Evaluation metrics. We consider the same criteria in prior works [86], [155]. (1) Trigger F1-score: a trigger is correctly identified (Tri-I) if its offset matches the gold one; it is correctly classified (Tri-C) if its event type also matches the gold one. (2) Argument F1-score: an argument is correctly identified (Arg-I) if its offset and event type match the gold ones; it is correctly classified (Arg-C) if its role matches as well. 3.3.1.0.4 Compared baselines. We consider the following classification-based models: (1) OneIE [86], the current state-of-the-art (SOTA) EE model trained with designed global features. (2) BERT QA [33], which views EE tasks as a sequence of extractive question answering problems. Since it learns a classifier to indicate the position of the predicted span, we view it as a classification model. We also consider the following generation-based models: (3) TANL [119], which treats EE tasks as translation tasks between augmented natural languages. (4) Text2Event [98], a sequence-to-structure model that converts the input passage to a tree-like event structure. Note that the outputs of both generation-based baselines are not natural sentences. Therefore, it is more difficult for them to utilize the label semantics. All the implementation details can be found in Section A.4. We train OneIE with named entity annotations, as the original papers suggest, while the other models are trained without entity annotations. 3.3.2 Main Results Table 3.2 shows the trigger classification F1-scores and the argument classification F1-scores in three data sets with different proportions of training data. The results are visualized 21 in Figure 3.2. Since our task is end-to-end event extraction, the argument classification F1-score is the more important metric we considered when comparing models. From the figure and the table, we can observe that both Degree and Degree(Pipe) outperform all other baselines when using less than 10% of the training data. The performance gap becomes much more significant under the deficient data situation. For example, when only 1% of the training data is available, Degree and Degree(Pipe) achieve more than 15 points of improvement in trigger classification F1 scores and more than 5 points in argument classification F1 scores. This demonstrates the effectiveness of our design. The generation-based model with carefully designed prompts can utilize the label semantics and the additional weakly supervised signals, thus helping learning under the low-resource regime. Another interesting finding is that Degree and Degree(Pipe) seem more beneficial for predicting arguments than triggers. For example, OneIE, the most robust baseline, requires 20% of training data to achieve competitive performance on trigger prediction to Degree and Degree(Pipe); however, it requires about 50% of training data to achieve competitive performance in predicting arguments. The reason is that the ability to capture dependencies becomes more important for argument prediction than trigger prediction since arguments are usually strongly dependent on each other compared to triggers. Therefore, the improvements of our models for argument prediction are more significant. Furthermore, we observe that Degree is slightly better than Degree(Pipe) under the low-resource setting. This provides empirical evidence for jointly predicting triggers and arguments in a low-resource setting. 3.3.3 Few-Shot and Zero-Shot Result In order to further test our models’ generaliability, we additionally conduct zero-shot and few-shot experiments on the ACE05-E dataset with Degree(ED) and Degree(EAE). 22 3.3.3.0.1 Settings. We first select the top n common event types as “seen” types and use the rest as “unseen/rare” types, where the top common types are listed in Table 3.3. We remove all events with “unseen/rare” types from the training data to simulate a zero-shot scenario. We keep only k event examples for each “unseen/rare” type (denoted as k-shot) to simulate a few-shot scenario. We calculate micro F1-scores during the evaluation only for these “unseen/rare” types. n Seen Event Types for Training/Development 5 Conflict:Attack, Movement:Transport, Life:Die, Contact:Meet, Personnel:Elect 10 Conflict:Attack, Movement:Transport, Life:Die, Contact:Meet, Personnel:Elect, Life:Injure, Personnel:End-Position, Justice:Trial-Hearing, Contact:Phone-Write, Transaction:Transfer-Money Table 3.3: Common event types in ACE05-E. 3.3.3.0.2 Compared baselines. We consider the following baselines: (1) BERT QA [33] (2) OneIE [86] (3) Matching baseline, a proposed baseline that makes trigger predictions by performing string matching between the input passage and the event keywords. (4) Lemmatization baseline, another proposed baseline that performs string matching on lemmatized input passage and the event keywords. (Note: (3) and (4) are baselines only for event detection tasks.) 3.3.3.1 Experimental results. Figure 3.3, Table 3.4, and Table 3.5 show the results of n = 5 and n = 10. From the two subfigures in the left column, we see that Degree(ED) achieves promising results in the zero-shot setting. In fact, it performs better than BERT QA trained in the 10-shot setting and OneIE trained in the 5-shot setting. This demonstrates the great potential of Degree(ED) to discover new event types. Interestingly, we observe that our two proposed baselines perform surprisingly well, suggesting that the trigger annotations in ACE05-E are actually not diverse. Despite their impressive performance, Degree(ED) still outperforms 23 0 1 5 10 Number of Shot 0 10 20 30 40 50 60 70 Trigger-C F1 Score (%) Event Detection 0 1 5 10 Number of Shot 35 40 45 50 55 60 65 70 Argument-C F1 Score (%) EAE (Given Gold Triggers) 0 1 5 10 Number of Shot 0 10 20 30 40 50 Argument-C F1 Score (%) EAE (Using Predicted Triggers) Lemmatization Baseline Matching Baseline DEGREE (Full) OneIE (Full) BERT_QA (Full) DEGREE (Few-Shot) OneIE (Few-Shot) BERT_QA (Few-Shot) (a) Results for top common 5 event types. 0 1 5 10 Number of Shot 0 10 20 30 40 50 60 70 Trigger-C F1 Score (%) Event Detection 0 1 5 10 Number of Shot 40 45 50 55 60 65 70 Argument-C F1 Score (%) EAE (Given Gold Triggers) 0 1 5 10 Number of Shot 0 10 20 30 40 50 Argument-C F1 Score (%) EAE (Using Predicted Triggers) Lemmatization Baseline Matching Baseline DEGREE (Full) OneIE (Full) BERT_QA (Full) DEGREE (Few-Shot) OneIE (Few-Shot) BERT_QA (Few-Shot) (b) Results for top common 10 event types. Figure 3.3: The zero/few-shot experimental results of Degree. Left: The result for the models on the event detection task with the scores reported in trigger classification F1. Middle: The models are tested under the scenario of a given gold trigger and evaluated with an argument classification criterion. Right: The results for the models to perform event extraction task, which aims to predict triggers and their corresponding arguments (we report the argument classification F1). the matching baseline by over 4.7% absolute trigger classification F1 in both n = 5 and n = 10 cases in zero-shot scenario. Additionally, with only one training instance for each unseen type, Degree(ED) can outperform both proposed baselines. Next, we compare the results for the event argument extraction task. From the two middle subfigures, we observe that when given gold triggers, our model performs much better than all baselines with a large margin. Lastly, we train models for both trigger and argument extraction and report the final argument classification scores in the two right subfigures. We justify that our model has strong generalizability to unseen event types and it can outperform BERT QA and OneIE even when they are both trained in 5-shot settings. 24 Event Extraction Trigger Argument Common 5 Common 10 Tri-I Tri-C Arg-I Arg-C Tri-I Tri-C Arg-I Arg-C Matching Baseline 42.7 42.1 - - 46.3 46.3 - - Lemmatization Baseline 51.5 50.2 - - 56.6 56.0 - - BERT QA 1-shot 10.0 1.4 1.3 1.3 8.2 1.6 1.1 1.1 BERT QA 5-shot 14.0 12.6 11.1 10.8 20.8 15.4 14.6 13.9 BERT QA 10-shot 37.8 33.5 22.9 22.1 32.0 27.8 19.5 18.6 OneIE 1-shot 4.2 4.2 1.5 1.5 4.1 2.7 2.0 2.0 OneIE 5-shot 39.3 38.5 24.8 22.8 41.9 41.9 29.7 27.2 OneIE 10-shot 54.8 53.3 36.0 34.9 61.5 57.8 41.4 39.2 Degree(ED) 0-shot Degree(EAE) 0-shot 53.3 46.8 29.6 25.1 60.9 54.5 42.0 31.4 Degree(ED) 1-shot Degree(EAE) 1-shot 60.1 53.3 38.8 31.6 61.2 60.9 41.1 34.7 Degree(ED) 5-shot Degree(EAE) 5-shot 57.8 55.5 40.6 36.1 65.8 64.8 45.3 42.7 Degree(ED) 10-shot Degree(EAE) 10-shot 63.8 61.2 46.0 42.0 72.1 68.8 52.5 48.4 OneIE (Full) 72.7 70.5 52.3 49.9 74.5 73.0 51.2 48.9 Degree(ED) (Full) Degree(EAE) (Full) 68.4 66.0 51.9 48.7 72.0 69.8 52.5 49.2 Table 3.4: Full results of zero/few-shot event extraction on ACE05-E. Event Argument Extraction Trigger Argument Common 5 Common 10 Tri-I Tri-C Arg-I Arg-C Tri-I Tri-C Arg-I Arg-C Gold Triggers BERT QA 0-shot 100.0 100.0 55.8 37.9 100.0 100.0 57.2 46.7 Gold Triggers BERT QA 1-shot 100.0 100.0 55.8 44.3 100.0 100.0 57.8 47.2 Gold Triggers BERT QA 5-shot 100.0 100.0 56.6 49.6 100.0 100.0 59.1 50.6 Gold Triggers BERT QA 10-shot 100.0 100.0 58.8 52.9 100.0 100.0 60.5 52.8 Gold Triggers OneIE 1-shot 100.0 100.0 40.9 36.5 100.0 100.0 48.3 44.2 Gold Triggers OneIE 5-shot 100.0 100.0 55.6 51.4 100.0 100.0 58.6 55.0 Gold Triggers OneIE 10-shot 100.0 100.0 59.4 56.7 100.0 100.0 62.0 59.5 Gold Triggers Degree(EAE) 0-shot 100.0 100.0 56.1 48.0 100.0 100.0 66.5 53.3 Gold Triggers Degree(EAE) 1-shot 100.0 100.0 65.2 55.2 100.0 100.0 65.4 54.7 Gold Triggers Degree(EAE) 5-shot 100.0 100.0 70.9 62.2 100.0 100.0 68.0 61.7 Gold Triggers Degree(EAE) 10-shot 100.0 100.0 71.1 64.2 100.0 100.0 71.6 64.3 Gold Triggers BERT QA (Full) 100.0 100.0 63.1 57.9 100.0 100.0 62.1 56.5 Gold Triggers OneIE (Full) 100.0 100.0 70.8 66.4 100.0 100.0 67.9 64.1 Gold Triggers Degree(EAE) (Full) 100.0 100.0 74.5 70.6 100.0 100.0 73.6 68.9 Table 3.5: Full results of zero/few-shot event argument extraction on ACE05-E. 3.3.4 High-Resource Event Extraction Although we focus on data-efficient learning for low-resource event extraction, to better understand the advantages and disadvantages of our model, we additionally study Degree in the high-resource setting for controlled comparisons. 25 Model Type ACE05-E ACE05-E+ ERE-EN Tri-C Arg-C Tri-C Arg-C Tri-C Arg-C dbRNN* Cls 69.6 50.1 - - - - DyGIE++ Cls 70.0 50.0 - - - - Joint3EE* Cls 69.8 52.1 - - - - BERT QA* Cls 72.4 53.3 - - - - MQAEE* Cls 71.7 53.4 - - - - OneIE* Cls 74.7 56.8 72.8 54.8 57.0 46.5 TANL Gen 68.4 47.6 68.6 46.0 54.7 43.2 Text2Event* Gen 71.9 53.8 71.8 54.4 59.4 48.3 BART-Gen* Gen 71.1 53.7 - - - - Degree(Pipe) Gen 72.2 55.8 71.7 56.8 57.8 50.4 Degree Gen 73.3 55.8 70.9 56.3 57.1 49.6 Table 3.6: Results for high-resource event extraction. Highest scores are in bold and the second best scores are underlined. *We report the numbers from the original paper. Degree has a competitive performance to the SOTA model (OneIE) and outperform other baselines. Model Type ACE05-E ACE05-E+ ERE-EN Arg-I Arg-C Arg-I Arg-C Arg-I Arg-C DyGIE++ Cls 66.2 60.7 - - - - BERT QA* Cls 68.2 65.4 - - - - OneIE Cls 73.2 69.3 73.3 70.6 75.3 70.0 TANL Gen 65.9 61.0 66.3 62.3 75.6 69.6 BART-Gen* Gen 69.9 66.7 - - - - Degree(EAE) Gen 76.0 73.5 75.2 73.0 80.2 76.3 Table 3.7: Results for high-resource event argument extraction. Models predict arguments based on the given gold triggers. Best scores are in bold. *We report the numbers from the original paper. Degree(EAE) achieves a new state-of-the-art performance on event argument extraction. 3.3.4.0.1 Compared baselines. In addition to the EE models mentioned above: OneIE [86], BERT QA [33], TANL [119], and Text2Event [98], we also consider the following baselines focusing on the high-resource setting. dbRNN [138] is classification-based model that adds dependency bridges for event extraction. DyGIE++ [155] is a classification-based model with span graph propagation technique. Joint3EE [111] is a classification-based model jointly trained with annotations of entity, trigger, and argument. MQAEE [82] converts EE to a series of question answering problems for argument extraction . BARTGen [84] is a generation-based model focusing on only event argument extraction.3 Section A.4 shows the implementation details for the baselines. 3We follow the original paper and use TapKey as their event detection model. 26 Model 10% Data 100% Data Tri-I Tri-C Tri-I Tri-C Full Degree(ED) 69.3 66.1 75.4 72.2 - w/o Event type definition 67.9 64.4 73.5 70.1 - w/o ED template 68.8 65.8 74.0 70.5 - w/o Event keywords 68.2 64.0 73.5 69.1 - only Event type definition 66.3 63.5 72.6 68.9 - only Event keywords 69.2 63.8 70.8 66.2 Table 3.8: Ablation study for the components in the prompt on event detection using ACE05- E dataset. 3.3.4.1 Results for event extraction. Table 3.6 shows the results of high-resource event extraction. In terms of trigger predictions (Tri-C), Degree and Degree(Pipe) outperforms all the baselines except for OneIE, the current state-of-the-art model. For argument predictions (Arg-C), our models perform slightly better than OneIE in two of the three datasets. When enough training examples are available, models can learn more sophisticated features from data, which do not necessarily follow the learned dependencies. Therefore, the advantage of Degree over Degree(Pipe) becomes less noticeable. This result justifies our hypothesis that Degree performs better for the low-resource setting because of its ability to capture dependencies better. 3.3.4.2 Results for event argument extraction. In Table 3.7, we additionally study the performance for event argument extraction task, where the model makes argument predictions with the gold trigger provided. Interestingly, Degree(EAE) achieves pretty strong performance and outperforms other baselines with a large margin. Combining the results in Table 3.6, we hypothesize that event argument extraction is a more challenging task than event trigger detection and it requires more training examples to learn well. Hence, our proposed model, which uses label semantics to capture dependencies better, achieves a new state-of-the-art for event argument extraction. 27 Model 10% Data 100% Data Arg-I Arg-C Arg-I Arg-C Full Degree(EAE) 63.3 57.3 76.0 73.5 - w/o Event type definition 60.3 54.4 74.5 71.1 - w/o EAE template 57.0 51.9 73.8 70.4 - w/o Query trigger 55.2 49.9 71.4 69.0 - only Query trigger 51.9 48.1 71.2 69.4 - only EAE template 51.2 46.9 71.4 68.6 - only Event type definition 46.7 42.3 71.4 68.2 Table 3.9: Ablation study for the components in the prompt on event argument extraction with ACE05-E. 3.4 Ablation Studies In this section, we present comprehensive ablation studies to justify our design. To better understand the contribution of each component in the designed prompt and their effects on the different tasks, we ablate Degree(EAE) and Degree(ED) for both low-resource and high-resource situations. 3.4.1 Impacts of components in prompts. Table 3.8 lists the performance changes when removing the components in the prompts for event detection on ACE05-E. The performance decreases whenever removing any one of event type definition, event keywords, and ED template. The results suggest that three components are all necessary. Table 3.9 demonstrates how different components in prompts affect the performance of event argument extraction on ACE05-E. Removing any one of event type definition, query trigger, and EAE template leads to performance drops, which validates their necessity. We observe that query trigger plays the most important role among the three and when less training data is given, the superiority of leveraging any of these weakly-supervised signal becomes more obvious. 28 Model 10% Data 100% Data Arg-I Arg-C Arg-I Arg-C OneIE 48.3 45.4 73.2 69.3 BART-Gen - - 69.9 66.7 Natural sentence 63.3 57.3 76.0 73.5 Natural sentence w/ special tokens 59.8 55.5 74.7 72.3 HTML-like sentence w/ special tokens 60.8 51.9 74.6 71.4 Table 3.10: Performances of Degree(EAE) on ACE05-E with different types of templates. 3.4.2 Effects of different template designs. To verify the importance of using natural sentences as outputs, we study three variants of EAE templates: • Natural sentence. Our proposed templates described in Section 3.2, e.g., “somebody was born in somewhere.”, where “somebody” and “somewhere” are placeholders that can be replaced by the corresponding arguments. • Natural sentence with special tokens. It is similar to the natural sentence one except for using role-specific special tokens instead of “some-” words. For example, “<Person> was born in <Place>.” We consider this to study the label semantics of roles. • HTML-like sentence with special tokens. To study the importance of using natural sentence, we also consider HTML-like sentence, e.g., “<Person> </Person> <Place> </Place>”. The model aims to put argument predictions between the corresponding HTML tags. The results of all variants of EAE templates on ACE05-E are shown in Table 3.10. We notice that writing templates in a natural language style get better performance, especially when only a few data is available (10% of data). This shows our design’s capability to leverage pre-trained knowledge in the generation process. Additionally, there are over 1 F1 score performance drops when replacing natural language placeholders with special tokens. This confirms that leveraging label semantics for different roles is beneficial. 29 Model 10% Data 100% Data Arg-I Arg-C Arg-I Arg-C OneIE 48.3 45.4 73.2 69.3 BART-Gen - - 69.9 66.7 Degree(EAE) 63.3 57.3 76.0 73.5 Degree(EAE) + variant template 1 61.6 55.5 73.4 70.4 Degree(EAE) + variant template 2 63.9 56.9 75.5 72.5 Table 3.11: Study on the effect of different template constructing rules. Experiments is conducted on ACE05-E. 3.4.3 Sensitivity to template design. Finally, we study how sensitive our model is to the template. In addition to the original design of templates for event argument extraction, we compose other two sets of templates with different constructing rules (e.g., different word choices and different orders of roles). Table 3.11 shows the results of using different sets of templates. We observe a performance fluctuation when using different templates, which indicates that the quality of templates does affect the performance to a certain degree. Therefore, we need to be cautious when designing templates. However, even though our model could be sensitive to the template design, it still outperforms OneIE and BART-Gen, the best classification-based model and the best generation-based baseline, respectively. 3.5 Related Work 3.5.0.0.1 Fully supervised event extraction. Event extraction has been studied for over a decade [5], [72] and most traditional event extraction works follow the fully supervised setting [82], [86], [88], [109], [111], [138], [174]. Many of them use classification-based models and use pipeline-style frameworks to extract events [109], [155], [174]. To better leverage shared knowledge in event triggers and arguments, some works propose incorporating global features to jointly decide triggers and arguments [83], [86], [173]. Recently, few generation-based event extraction models have been proposed [63], [67], [84], [119]. TANL [119] treats event extraction as translation tasks between augmented nat30 ural languages. Their predicted target—augmented language embed labels into the input passage via using brackets and vertical bar symbols. TempGen [67] is a template-based rolefiller entity extraction model, which generate outputs that fill role entities into non-natural templated sequences. The output sequence designs of TANL and TempGen hinder the models from fully leveraging label semantics, unlike Degree that generates natural sentences. BART-Gen [84] is also a generation-based model focusing on document-level event argument extraction. They solve event extraction with a pipeline, which prevents knowledge sharing across subtasks. All these fully supervised methods can achieve substantial performance with a large amount of annotated data. However, their designs are not specific for lowresource scenarios, hence, these models can not enjoy all the benefits that Degree obtains for low-resource event extraction at the same time, as we mentioned in Section 3.1. 3.5.0.0.2 Low-resource event extraction. It has been a growing interest in event extraction in a scenario with less data. [88] uses a machine reading comprehension formulation to conduct event extraction in a low-resource regime. Text2Event [98], a sequence-tostructure generation paradigm, first presents events in a linearized format, and then trains a generative model to generate the linearized event sequence. Text2Event’s unnatural output format hinders the model from fully leveraging pre-trained knowledge. Hence, their model falls short on the cases with only extremely low data being available (as shown in Section 3.3). Another thread of works is using meta-learning to deal with the less label challenge [24], [28], [140]. However, their methods can only be applied to event detection, which differs from our primary focus on studying end-to-end event extraction. 3.6 Summary In this Chapter, we present Degree, a data-efficient generation-based event extraction model. Degree requires less training data because it better utilizes label semantics and weakly-supervised information and captures better dependencies by jointly predicting trig31 gers and arguments. Our experimental results and ablation studies show the superiority of Degree for low-resource event extraction. Degree assumes that some weakly-supervised information (the description of events, similar keywords, and human-written templates) is accessible or not expensive for the users to craft. This assumption may hold for most situations. We leave the automation of template construction for future work, which can further ease the needed efforts when deploying Degree in a large-scale corpus. 32 Chapter 4 Incorporating Cross-Domain Robust Knowledge to the Generative Framework 4.1 Introduction Building upon the groundwork established in Degree, generation-based event extraction models [84], [98], [119], [121] have been introduced and explored more widely, demonstrating significant generalizability and competitive performance in contrast to traditional classification-based approaches [22], [42], [55], [100]. However, existing generation-based models mostly focus on problem reformulation and prompt design without incorporating auxiliary syntactic and semantic information that is shown to be effective in classificationbased methods [2], [69], [70], [153], [169]. In this section, we delve deeper into the generative framework, exploring how we can seamlessly integrate additional knowledge into it. We focus on abstract meaning representation (AMR) [12], which extracts rich semantic information from the input sentence. As the Figure 4.1’s example shows, AMR graph summarizes the semantic structure of the input passage, and many of its nodes and edges share strong similarities with the event structures. For example, the trigger word appeal can be mapped to the node “appeal-01”, and the subject who appeals can be found using edge “ARG0”. Hence, the AMR graph could provide important clues for models to figure out 33 Encoder Decoder Passage [SEP] Prompt Output Text Plaintiff districts Adjudicator u.s. supreme court Place washington Transformer Block Q K V Prefix districts in washington appealed the adjudication from u.s. supreme court. Attention Layer purpose grant-01 give-01 ARG0 temporal-quantity ARG1 district school mod …. …. ARG1 ARG3 appeal-01 ARG0 Governmentorganization ARG1 name name U.S. Supreme Court location here location city name name Washington AMR Graph …. AMR Encoder Graph Linearization (z1 / grant-01 :ARG1 (z2 / stay) :purpose (z3 / give-01 :ARG0 z2 :ARG1 (z4 / temporal-quantity :quant 90 :unit (z5 / day)) :ARG2 (z6 / :……. Learnable Vector Q K V Length Length Passage: the stay was granted to give the school districts involved in the initial cases 90 days to appeal to the u.s. supreme court here in washington. Justice:Appeal Event ♢ Event Type Description : The event is related … ★ Query Trigger : Event trigger is appeal ☾ EAE Template : somebody in somewhere appealed the adjudication from some adjudicator. Encoding AMR into Prefix Prefix Generation Generation-Based EAE Model Linear Project Figure 4.1: An overview of Ampere using an example from the ACE 2005 dataset. Given a passage and an event trigger, we first use an AMR parser to obtain the AMR graph of the input passage. The linearized AMR graph sequence will be encoded into a l-length prefix by an AMR encoder and an attention layer. Our generation-based EAE model equipped with the AMR-aware prefix then summarizes the event mentioned in the passage into a natural sentence that follows a pre-defined template in the prompt. The final arguments and the corresponding roles can be extracted from the generated sentence. event arguments, resulting in performance improvements [183] and better generalizability [70] for classification-based methods. However, it is unclear how to best integrate AMR into generation-based methods. The heterogeneous nature between the AMR graph and the natural language prompts1 in the generation-based EAE models causes the difficulty of the model design. To overcome the challenge, we propose Ampere (AMr-aware Prefix for generation-based Event aRgument Extraction), which encodes AMR graph into prefix [85] to regulate the generation-based EAE models. Specifically, an additional AMR encoder is used to encode the input AMR graph into dense vectors. Then, these vectors will be disassembled and distributed to every Transformer layer in generation-based EAE models as the prefix. These 1For example, event type description and target generation templates. 34 generated prefixes are transformed into additional key and value matrices to influence the attention calculation, hence, guiding the generation. We also introduce an adjusted copy mechanism for Ampere to overcome potential noises brought by the AMR graph. Specifically, as we can observe in Figure 4.1, AMR parsers will include additional normalization (turning washington into Washington) and word disambiguation (using appeal-01 rather than appeal) to create AMR graphs. Such normalization could impact the generation to produce some words that are not in the original input, especially when the training data is limited. Hence, we apply a copy mechanism [137] and add an additional regularization loss term to encourage copying from the input passage. We conduct experiments on ACE 2005 [30] and ERE [143] datasets using different ratios of training data. Our results show that Ampere outperforms several prior EAE works in both datasets. Under low-resource settings that use only 5% or 10% of training data, we can get 4% −10% absolute F1-scores of improvement, and our method is in general powerful across different training sizes and different datasets. We also present a comprehensive study of different ways to incorporate AMR information into a generation-based EAE model. We will show that Ampere is the best way among the various methods we explored. Our code can be found at https://github.com/PlusLabNLP/AMPERE. 4.2 Ampere Details Ampere uses Degree as the base generation-based EAE model 2 (Section 4.2.1), and augments it with AMR-aware prefixes, as shown in Figure 4.1. To generate the AMRaware prefixes, we first use a pre-trained AMR parser to obtain the AMR graph of the input sentence (Section 4.2.2). Then, the graph is transformed into dense vectors through graph linearization and an AMR encoder. Then, these dense vectors will be disassembled and distributed to each layer of our base generation-based EAE model so the generation is guided by the AMR information (Section 4.2.3). Finally, we introduce the training loss 2We use the EAE version of Degree. 35 for Ampere and our adjusted copy mechanism that can help Ampere overcome additional noise brought from AMR graphs (Section 4.2.4). 4.2.1 Generation-Based EAE Model Despite our AMR-aware prefix being agnostic to the used generation-based EAE model, we select Degree as our base model because of its great generalizability and performance. Here, we provide a brief overview of the model. Given a passage and an event trigger, Degree first prepares the prompt, which includes an event type description (a sentence describing the trigger word), and an event-type-specific template, as shown in Figure 4.1. Then, given the passage and the prompt, Degree summarizes the event in the passage following the format of the EAE template, so that final predictions can be decoded easily by comparing the template and the output text. Take the case in Figure 4.1 as an example, by comparing “districts in washington appealed the adjudication from u.s. supreme court.” with the template “somebody in somewhere appealed the adjudication from some adjudicator.”, we can know that the “districts” is the argument of role “Plaintiff ”. This is because the corresponding placeholder “somebody” of the role “Plaintiff ” has been replaced by “districts” in the model’s prediction. 4.2.2 AMR Parsing The first step of our method is to prepare the AMR graph of the input passage. We consider SPRING [14], a BART-based AMR parser trained on AMR 3.0 annotation, 3 to be our AMR parser. As illustrated by Figure 4.1, the AMR parser encodes the input sentence into an AMR graph, which is a directed graph where each node represents a semantic concept (e.g., “give-01”, “appeal-01”) and each edge describe the categorical semantic relationship between two concepts (e.g., ARG0, location) [12]. 3https://catalog.ldc.upenn.edu/LDC2020T02 36 4.2.3 AMR-Aware Prefix Generation Our next step is to embed the information into prefixes [85] for our generation-based EAE model. To encode the AMR graph, we follow Konstas, Iyer, Yatskar, et al. [76] to adopt a depth-first-search algorithm to linearize the AMR graph into a sequence, as shown in the example in Figure 4.1. Then, an AMR encoder is adapted to encode the sequence representation. One of the advantages of our method is the flexibility to use models with different characteristics to our generation-based EAE model to encode AMR. Here, we consider two AMR encoders to form different versions of Ampere: • Ampere (AMRBART): We consider using the encoder part of the current state-of-the-art AMR-to-text model — AMRBART [11] that pre-trained on AMR 3.0 data. 4 The model is based on BART-large and its vocabulary is enlarged by adding all relations and semantic concepts in AMR as additional tokens. Employing the model as our AMR encoder enables Ampere to leverage knowledge from other tasks. • Ampere (RoBERTa): RoBERTa-large [94] is also considered as our AMR encoder as pretrained masked language models are typical choices to perform encoding tasks. In order to make RoBERTa better interpret the AMR sequence, we follow [11] to add all relations in AMR (e.g. ARG0, ARG1 ) as special tokens. However, since the model is not pre-trained on abundant AMR-to-text data, we do not include semantic concepts (e.g. concepts end with -01 ) as extra tokens. 5 After getting the representation of the linearized sequence, we then prepare l learnable vectors as queries and an attention layer, where l is a hyper-parameter that controls the length of the used prefixes. These queries will compute attention with the representations of the linearized AMR sequence, then, we will obtain a set of compressed dense vector P. This P will be transformed into the prefixes [85] that we will inject into our generation-based 4https://github.com/goodbai-nlp/AMRBART 5 If adding semantic concepts as extra tokens, RoBERTa will lose the ability to grasp its partial semantic meaning from its surface form, such as understanding that “appeal-01” is related to “appeal”. 37 EAE model. To be more specific, we first disassemble P into L pieces, where L is the number of layers in the base generation-based EAE model, i.e., P = {P 1 , P2 , ...P L}. Then, in the n-th layer of the EAE model, the prefix is separated into two matrices, standing for the addition key and value matrices: P n = {Kn , V n}, where Kn & V n are the addition key and value matrices, and they can be further written as Kn = {k n 1 , ..., k n l } and V n = {v n 1 , ..., v n l }. k∗ and v∗ are vectors with the same hidden dimension in the Transformer layer. These additional key and value matrices will be concatenated with the original key and value matrices in the attention block. Therefore, when calculating dot-product attention, these AMR-aware prefixes will influence the query at each position. The reason of generating layer-wise queries and keys is to exert stronger control. We generate layer-wise key-value pairs as each layer may embed different information. These keys influence the model’s weighting of representations towards corresponding generated values. Empirical studies on layer-wise versus single-layer control can be found in [92]. It is worth noting that [85]’s prefix tuning technique uses a fixed set of prefixes disregarding the change of input sentence, Ampere will generate a different set of prefixes when the input passage varies. The variation reflects the different presentation of the AMR graph. We can inject prefixes into the encoder self-attention blocks, decoder cross-attention blocks, or decoder self-attention blocks in our generation-based EAE model. Our preliminary experiments show that using prefixes in encoder self-attention blocks and decoder crossattention blocks works best in Ampere. 4.2.4 Adjusted Copy Mechanism We follow Degree’s setting to use BART-large [79] as the pre-trained generative model, and the training objective of our generation-based EAE model is to maximize the conditional probability of generating a ground-truth token given the previously generated ones and the 38 input context in the encoder x1, x2, ..xm: Loss = − log(X i P(yi |y Prompt Five Iraqi civilians, including a woman, were killed Monday when their houses were hit by a missile fired by the US - led coalition warplanes, witnesses said. <Trigger> killed <Template> <Agent> [None] </Agent> <Victim> [None] </Victim> <Instrument> [None] </Instrument> <Place> [None] </Place> <Agent> coalition </Agent> <Victim> civilians [and] woman </Victim> <Instrument> missile </Instrument> <Place> houses</Place> Agent coalition Victim civilians, woman Instrument missile Place houses Template for Life:Die Event Training Zero-Shot Cross-Lingual Transfer Given Trigger Generate Output String Decode Multilingual Generative Model Input Passage <SEP> Prompt 巴勒斯坦人持续以石块攻击以色列的部队,以军则是还以催泪弹、 橡皮子弹甚至是实弹,结果又造成两名巴勒斯坦青年丧生,10多人 受伤。 <Trigger> 丧生 <Template> <Agent> [None] </Agent> <Victim> [None] </Victim> <Instrument> [None] </Instrument> <Place> [None] </Place> <Agent> 以军 </Agent> <Victim> 青年 </Victim> <Instrument> 催泪弹 [and] 子弹 [and] 实弹 </Instrument> <Place> [None] </Place> Agent 以军 Victim 青年 Instrument 催泪弹, 子弹, 实弹 Place None Template for Life:Die Event Testing Given Trigger Generate Output String Decode Figure 5.1: The overview of X-Gear. Given an input passage and a carefully designed prompt containing an event trigger and a language-agnostic template, X-Gear fills in the language-agnostic template with event arguments. seamless cross-lingual transfer. X-Gear fine-tunes multilingual pre-trained generative models, such as mBART-50 [147] or mT5 [170], and augments them with a copy mechanism to better adapt to input language changes. We present its details as follows, including the language-agnostic templates, the target output string, the input format, and the training details. 5.4.1 Language-Agnostic Template We develop a language-agnostic template, denoted as Te, for each event type e, wherein we enumerate all potentially associated roles2 and construct a distinctive HTML-tag-style template tailored to the event type e. As illustrated in Figure 5.1, the Life:Die event encompasses four roles: Agent, Victim, Instrument, and Place. Consequently, the template for Life:Die events is configured as follows: 2These associated roles may be derived from a cursory examination of training data or directly from the provided annotation guidelines. 55 <Agent>[None]</Agent><Victim>[None]</Victim> <Instrument>[None]</Instrument><Place>[None]</Place>. For ease of understanding, we use English words to present the template. However, these tokens ([None], <Agent>, </Agent>, <Victim>, etc.) are encoded as special tokens3 that the pre-trained models have never seen and thus their representations need to be learned from scratch. Since these special tokens are not associated with any language and are not pre-trained, they are considered as language-agnostic. 5.4.2 Target Output String X-Gear learns to generate target output strings following the structure of language-agnostic templates. To construct the target output string during training, given an instance (x, t, e, a, r), we initially identify the language-agnostic template Te corresponding to the event type e. Then, we substitute all occurrences of [None]” in Te with the relevant arguments from a based on their assigned roles r. If there are multiple arguments for a single role, we combine them using a special token [and]”. For example, in the training instance depicted in Figure 5.1, two arguments (civilians and woman) are associated with the Victim role, resulting in the corresponding segment of the output string. <Victim> civilians [and] woman </Victim>. If there are no corresponding arguments for one role, we keep “[None]” in Te. By applying this rule, the full output string for the training example in Figure 5.1 becomes <Agent> coalition </Agent><Victim> civilians[and] woman </Victim><Instrument> missile </Instrument> <Place> houses </Place>. Given that the output string adheres to HTML-tag style, extracting the argument and role predictions from it becomes straightforward using a simple rule-based algorithm. 3 In fact, the special tokens can be replaced by any other format, such as <–token1–> or . Here, we use <Agent> and </Agent> to highlight that arguments between these two special tokens correspond to the Agent role. 56 5.4.3 Input Format As previously mentioned, the key of the generative formulation for zero-shot cross-lingual EAE lies in guiding the model to generate output strings in the desired format. To achieve this, we provide both the input passage x and a prompt to X-Gear, as depicted in Figure 5.1. The prompt encompasses all pertinent information necessary for the model to make predictions, comprising a trigger t and a language-agnostic template Te. It’s important to note that we don’t explicitly include the event type e in the prompt, as the template Te implicitly encapsulates this information. In Section 5.6.1, we will delve into experiments that involve explicitly adding the event type e to the prompt and discuss its impact on cross-lingual transfer. 5.4.4 Training To enable X-Gear to generate sentences in different languages, we employ a multilingual pre-trained generative model as our foundational framework. This model captures the conditional probability of generating a new token, considering both the previously generated tokens and the input context fed to the encoder c, i.e, P(x|c) = Y i Pgen(xi |x to represent the Attack event type. Table 5.4 shows the results. In most cases, including event type information in the prompt decreases the performance. One reason is that one word in a language can be mapped to several words in another language. For example, the Life event type is related to Marry, Divorce, Born, and Die four sub-event types. In English, we can use just one word Life to cover all four sub-event types. However, In Chinese, when talking about Marry and Divorce, Life should be translated to “生活”; when talking about Born and Die, Life should be translated to “生命”. This mismatch may cause the performance drop when considering event types in prompts. We leave out how to efficiently use event-type information in the cross-lingual setting as future work. 5.6.1.3 Influence of Role Order in Templates. The order of roles in the designed language-agnostic templates can potentially influence performance. When designing the templates, we intentionally make the order of roles close to the order in natural sentences.6 To study the effect of different orders, we train XGear with templates with different random orders and report the results in Table 5.5. 6For example, types related to subject and object are listed first and types related to methods and places are listed last. 64 Model en ⇓ xx ar ⇓ xx zh ⇓ xx xx ⇓ en xx ⇓ ar xx ⇓ zh avg X-Gear (mT5-base) 54.3 41.4 51.4 49.4 46.7 51.0 49.1 w/ random order 1 54.4 38.9 50.8 48.7 45.1 50.1 48.0 w/ random order 2 52.1 40.4 51.4 48.3 45.9 49.7 48.0 w/ random order 3 53.7 40.8 50.7 50.8 45.8 48.6 48.4 Table 5.5: Ablation study on different orders of roles in templates for ACE-2005. “en ⇒ xx” indicates the average of “en ⇒ en”, “en ⇒ zh”, and “en ⇒ ar”. Model en ⇓ xx ar ⇓ xx zh ⇓ xx xx ⇓ en xx ⇓ ar xx ⇓ zh avg X-Gear (mT5-base) 54.3 41.4 51.4 49.4 46.7 51.0 49.1 w/ English Tokens 51.4 39.3 49.7 46.6 44.7 49.0 46.8 Table 5.6: Comparison of using English tokens and special tokens for roles in templates. “en ⇒ xx” indicates the average of “en ⇒ en”, “en ⇒ zh”, and “en ⇒ ar”. X-Gear with random orders still achieve good performance but slightly worse than the original order. It suggests that X-Gear is not very sensitive to different templates while providing appropriate order of roles can lead to a small improvement. 5.6.1.4 Using English Tokens Instead of Special Tokens for Roles in Templates. In Section 5.4, we mentioned that we use language-agnostic templates to facilitate the crosslingual transfer. To further validate the effectiveness of the language-agnostic template. We conduct experiments using English tokens as the templates. Specifically, we set the format Agent: [None] <SEP> Victim: [None] <SEP> Instrument: [None] <SEP> Place: [None] to be the template for Life:Die events. Hence, for non-English instances, the targeted output string is a code-switching sequence. Table 5.6 lists the results. We can observe that applying language-agnostic templates brings X-Gear 2.3 F1 scores improvements on average. 65 Model monolingual cross-lingual average all X-Gear (mBART-50-large) 63.9 37.4 46.2 w/ constrained decoding 62.4 37.6 45.9 X-Gear (mT5-base) 67.8 39.7 49.1 w/ constrained decoding 67.0 39.9 48.9 X-Gear (mT5-large) 69.7 42.2 51.3 w/ constrained decoding 68.8 43.1 51.6 Table 5.7: Results of applying constrained decoding. Breakdown numbers can be found in Appendix C.3. Based on whether the training languages are the same between training and testing, we classify the results into monolingual and cross-lingual, and we report the corresponding average for each category. 5.6.2 Constrained Decoding In this study, we study the effectiveness of a stronger version of the copy mechanism – constrained decoding [20] to force all the generated tokens to appear input. Table 5.7 presents the result of X-Gear with constrained decoding. We observe that adapting such constraints indeed helps the cross-lingual transferability, yet it also hurts the performance in some monolingual cases. We conduct a qualitative inspection of the predictions. The observation is that constrained decoding algorithm although guarantees all generated tokens appearing in the input, the coercive method breaks the overall sequence distribution that learned. Hence, in many monolingual examples, once one of the tokens is corrected by constrained decoding, its following generated sequence changes a lot, while the original predicted suffixed sequence using beam decoding are actually correct. This leads to a performance decrease.7 5.7 Summary We present the first generation-based models for zero-shot cross-lingual event argument extraction. To overcome the discrepancy between languages, we design language-agnostic 7 Indeed, a similar situation happens to cross-lingual cases; however, since the original performance for cross-lingual transfer is not high enough, the benefits of correcting tokens are more significant than this drawback. 66 templates and propose X-Gear, which well capture output dependencies and can be used without additional named entity extraction modules. Our experimental results show that XGear outperforms the current state-of-the-art, which demonstrates the potential of using a language generation framework to solve zero-shot cross-lingual structured prediction tasks. 67 Part II Data Perspective Solutions: Methods to generate pseudo training data for event understanding 68 In Part I, I introduced a model perspective solution that establishes a more flexible modeling framework for knowledge transfer, enabling the utilization of knowledge within the parameters of the pre-trained language model Pϕ. Now, in Part II, my aim is to develop more advanced techniques to enhance the model’s generalizability when accessing an understanding of the target testing scenario. Within this scope, I specifically focus on methods to generate training data for event understanding using the pre-trained language model Pϕ. With the augmented training data, we can construct models that are acquainted with a broader range of different cases encountered during testing, compared to models trained with limited task-specific training data. More specifically, the significant expense associated with event annotations often results in a scarcity of training data for event understanding models. In other words, the input x in the event understanding training dataset typically represents only a subset of the potential inputs encountered during inference. For instance, in a zero-shot cross-lingual setting, all training data may be in English, yet the model must contend with testing data in various languages such as Chinese, Arabic, and more. This presents a significant challenge that impedes the generalizability of the trained event understanding model. To address this challenge, I suggest augmenting the dataset with additional samples x ′ drawn from the pre-trained language model Pϕ. By incorporating these additional samples during training, models can potentially enhance their robustness across various test scenarios. However, a significant hurdle lies in acquiring corresponding event annotations y ′ for these generated x ′ . To tackle this obstacle, as outlined in Chapter 6, I propose a contextual machine translation approach aimed at accurately preserving the associated event annotations y ′ when generating x ′ through translation from x. Additionally, in Chapter 7, I develop a meticulous strategy in handling x ′ containing novel events not present in the existing event knowledge base, necessitating the annotation of their corresponding y ′ as “non-existent.” 69 Chapter 6 Generate Pseudo Multilingual Event Understanding Training Data via Contextual Label Projection 6.1 Introduction In this chapter, our focus lies on the generation of pseudo-training data to enhance crosslingual generalization. This method, commonly referred to as translate-train in existing literature, utilizes machine translation techniques to produce pseudo-training data in target languages by translating training data from the source language [132], [170], [179]. However, applying translate-train to structured prediction tasks, such as event extraction, requires a crucial step known as label projection. This step involves the simultaneous translation of input sentences and labels [21]. Label projection requires not only accurate translation of the labels but also maintaining the association between the translated texts and labels. As illustrated in Figure 6.1, while “suits” can have multiple valid translations, only “诉讼” is present in the translated sentence and a proper translation at the same time. Previous studies have tackled label projection through two primary frameworks. The first framework, depicted in Figure 6.2(a), involves performing machine translation on modified 70 In South Florida, the average number of suits against a neurosurgeon is five. (suits, trigger), (neurosurgeon, Defendant) Sentence : Labels - Roles : Source Language: English (en) 在南佛罗里达州,针对神经外科医生的平均西装为五起 (西装, trigger), (神经外科医生, Defendant) Sentence : Labels - Roles : Target Language: Chinese (zh) 在南佛罗里达州,针对神经外科医生的诉讼平均为五起 (讼案, trigger), (神经外科医生, Defendant) Sentence : Labels - Roles : 在南佛罗里达州,针对神经外科医生的诉讼平均为五起 (诉讼, trigger), (神经外科医生, Defendant) Sentence : Labels - Roles : If we translate label first and inject the label translation to the sentence, then “西装” (a set of man’s formal clothes) is not a suitable translation of “suits” in this sentence. If we translate labels and sentence separately, then “讼案” is not presented in the translated sentence. Figure 6.1: Illustration of the task of label projection from English to Chinese. Label projection converts sentences from a source to a target language while translating the associated labels jointly. Failures in this process occur when labels are either inaccurately translated or missing in the translated sentence in the target language. source sentences that integrate label annotations using special markers [21], [51]. Translated labels can be extracted if these special markers are retained in the translations. However, the quality of the translation is inherently compromised due to the inclusion of these special markers [21]. The second framework utilizes word similarity to establish word alignments between the source and translated sentences. Label translations are then generated by combining mapped tokens in the translated sentence [7], [8], [144], as depicted in Figure 6.2(b). However, ensuring accurate label translation solely through word alignments proves challenging for this framework, as we will elaborate on in Section 6.4.4. In this chapter, we introduce CLaP (Contextual Label Projection), a method for obtaining projected label annotations through contextual machine translation of the labels. Initially, we acquire the translation of the input sentence using any plug-and-play machine translator. Building upon the concept of contextual machine translation [154], [162], we utilize the translated input text as context to conduct label translation, as illustrated in Figure 6.2(c). This approach significantly improves the accuracy of the translated labels while maintaining their association with the translated sentence. Moreover, preserving the in71 tegrity of the input sentence during translation optimally harnesses machine translators and ensures the quality of the translated sentence. To enable contextual machine translation, we employ a specialized instruction-tuned language model with multilingual capabilities, namely Llama-2-13B [151]. 1 By encoding the translated input sentence and the constraint for label presence as instruction prompts, we task the language model with performing the label translation process. Extensive experiments conducted on two representative tasks, event argument extraction (EAE) and named entity recognition (NER), reveal the following insights: • Compared to existing label projection methods, CLaP performs the best on intrinsic evaluation by achieving the best label translation accuracy (Section 6.4.4). Through extrinsic evaluation on downstream tasks, CLaP yields an average improvement of 2.4 and 1.4 F1 scores over the best baseline across 39 languages for EAE on ACE and NER on WikiANN datasets respectively (Section 6.4.5). • In comparison to directly prompting LLMs for the downstream task, we show that CLaP’s LLM usage for contextual machine translation provides significantly larger gains (Section 6.4.5). • Focusing on low-resource languages, CLaP demonstrates strong applicability by generalizing to ten extremely low-resourced African and American languages (Section 6.6). Using larger LLMs for CLaP yields further improvements for low-resource languages, underlining CLaP’s future potential to improve continually (Section 6.5.2). Our code can be found at https://github.com/PlusLabNLP/CLaP. 72 . In South Florida, the average number of suits against a neurosurgeon is five. *+, "*+, [ suits, neurosurgeon, South Florida ] In [0]South Florida[/0], the average number of [1]suits[/1] against a [2]neurosurgeon[/2] is five. ./-%. ./!"! [0] [/0], [2][/2] [1][/1] !!"! , "!"! [,, ] $(.*+,, "*+,) ((./*+,) )#$%&(./!"!, "*+,) In South Florida, the average number of suits against a neurosurgeon is five. .*+, ((.*+,) .!"! , "*+, suits neurosurgeon South Florida + Chinese Sentence: )*%#·+ ",! $ For the previous sentence, the words corresponding to 'candidate' is: ' ' Chinese Sentence: , -& (' For the previous sentence, the words corresponding to 'attack' is: '' "!"! ℳ (a) Marker-based (c) CLAP In-context Examples Contextual Translation Instruction . In South Florida, the average number of suits against a neurosurgeon is five. *+, , * "!"! [,, ] ((.*+,) )$'(")("*+,, *, .*+,, .!"!) .!"! +(.*+,, .!"!) In South Florida, the average number of suits against a neurosurgeon is five. . , (b) Word Alignment Chinese Sentence: , For the previous sentence, the words corresponding to ‘y’ is: y = suits y = neurosurgeon y = South Florida + ℳ "-%. .!"!) Add markers Translate Retrieve labels Translate Align sentences Retrieve labels Translate Contextual Translation Figure 6.2: Illustration of the various techniques to conduct label projection: (a) Markerbased methods use markers to transform the sentence and translate the transformed sentence with label markers jointly, (b) Word Alignment methods use external word alignment tools to locate the translated labels in the translated sentence, and (c) CLaP (ours) performs contextual translation on labels using M (Here, we demonstrate the use of an instructiontuned language model as M to identify translated labels within a translated sentence.). 6.2 Background 6.2.1 Structure Prediction Tasks Given an input sentence x, structure prediction models aim to predict structured output y = [x[i1 : j1], x[i2 : j2], . . . , x[in : jn]] (where x[i1 : j1] is an input sentence span from token i1 to j1) corresponding to a set of roles r = [r1, r2, . . . , rn] (where ri ∈ R, a pre-defined set of roles). This vastly differs from standard classification-based tasks wherein the output prediction y is a singular value from a fixed set of classes independent of the input x. 1We also explore the use of GPT-3.5-Turbo in Section 6.5.2. 73 6.2.2 Zero-shot Cross-Lingual Transfer Zero-shot cross-lingual transfer [3], [60], [62] aims to train a downstream model for the target language ltgt using supervised data Dsrc from a source language lsrc without using any data in the target language (i.e. Dtgt = ϕ). The paradigm effectively advances language technologies for under-resourced languages. 6.2.3 Translate-Train Translate-train [60], [132] is a popular and powerful zero-shot cross-lingual transfer technique that leverages machine translators T to boost downstream model performance. Specifically, in translate-train, Dsrc is translated into the target language as pseudo training data Dtgt src and the downstream model is trained using a combination of {Dsrc, Dtgt src}. Utilizing translate-train for structured prediction tasks requires Label Projection, which includes two sets of translations: (1) Sentence translation (x src → x tgt), where we use → to denote that x tgt is the transformation of x src; and (2) Label translation (y src → y tgt), such that the translated label y tgt is appropriately associated with x tgt. This demand makes translate-train for structure prediction tasks more complex than that for classification tasks, as the latter only requires sentence translation (since y is independent of x). 2 6.2.3.0.1 Translate-Test Besides translate-train, translate-test is another commonly used technique in zero-shot cross-lingual transfer. During inference, models trained on Dsrc are used to predict on translated test sentences (x tgt → x src), and the predictions on x src are later mapped back to x tgt. We mainly focus on translate-train in this work but discuss CLaP’s effectiveness for translate-test in Section 6.5.5. 2For certain structure prediction tasks like relation classification [2], [53] (determining the relationship between two entities in x), even if the output y is scalar, translate-train necessitates label projection step due to the required projection of the two given entities into the translated sentence. 74 6.2.4 Label Projection We hereby technically define the problem of label projection [7], [21]: x src → x tgt & y src m → y tgt m ∀y src m ∈ y src s.t. ytgt m ∈ x tgt ∀y tgt m ∈ y tgt . This problem requires optimizing two properties of accuracy and faithfulness in the translations: • Accuracy ensures that [x tgt, y tgt 1 , . . . , ytgt n ] are accurate translations of [x src, ysrc 1 , . . . , ysrc n ]. • Faithfulness ensures that each y tgt m is associated with x tgt (the constraint of y tgt m ∈ x tgt). Translating jointly while adhering to the faithfulness constraint is not a straightforward task, as standard translation models T struggle to incorporate this additional requirement, as evidenced by the failure cases depicted in Figure 6.1(b). This underscores the complexity inherent in label projection. 6.3 Methods In this section, we first formally define the previous attempts at label projection and later introduce CLaP, which provides a new perspective of using contextual machine translation for label projection. 6.3.1 Baseline Methods The primary frameworks used in prior works include Marker-based and Word-alignment methods. 75 6.3.1.0.1 Marker-based methods [21], [60], [80] solve the label projection by first marking labels to the input sentence x src, forming x˜ src, and then use the translation model to obtain the potential translation of input sentence and labels jointly. For example, in Figure 6.2(a), “South Florida” is delineated by markers [0] and [\0]. Assuming the preservation of markers after translation of x˜ src, a post-processing step, Pmark, is performed to retain the translated labels y tgt and translated sentence x tgt. Putting every step together, we have x˜ src = f(x src , y src), x˜ tgt = T (x˜ src) x tgt , y tgt = Pmark(x˜ tgt , y src), where f denotes the marker addition step and x˜ tgt is the translation of x˜ src using machine translator T . Although these methods are straightforward, they tend to exhibit low translation quality and diminished adaptability across various translation models due to the transformations applied to input sentences and the reliance on strong assumptions regarding the preservation of markers in x˜ tgt . 6.3.1.0.2 Word Alignment approaches [7], [177] first translate the input sentence and acquire word alignments [32], [38] between the translation pairs. Each translated label y tgt m is then procured by merging the aligned words of y src m in the translated sentence using the word mappings w. For example, in Figure 6.2(b), the translated label for “South Florida” is obtained by merging two aligned words, which is done by a heuristic post-processing algorithm Palign. Formally, we have x tgt =T (x src), w = W(x src , x tgt) y tgt m =Palign(y src m , w, x src , x tgt) ∀y src m ∈ y src While these methods excel in producing high-quality sentence translations, they often 76 compromise the accuracy of the translated labels. This discrepancy arises from the fact that the translated labels are reconstructed from word-level translations, without taking into account the context of the entire span [7], [21]. 6.3.2 Proposed Method: CLaP We approach the task of label projection from a fresh angle by directly translating labels instead of extracting them from translated text x tgt. This approach enhances the precision of the translated labels y tgt. To achieve this, we employ the concept of contextual machine translation, utilizing x tgt as context for label translation. Contextual machine translation, which targets phrase-level translations contingent upon the context of the translated sentence, has been tangentially investigated for applications such as anaphora resolution [154] and pronoun translations [162]. The primary objective of this endeavor is to uphold the coherence of phrasal translations within the provided context. In this chapter, we develop a novel model CLaP to extend the idea of contextual translation to the application of label projection. As illustrated in Figure 6.2(c), CLaP first utilizes machine translation model T to translate input sentence x src to x tgt. Treating x tgt as the context, the contextual translation model M translates the labels y src to y tgt. Contextual translation implicitly imposes the faithfulness constraint which requires y tgt m ∈ x tgt , ∀y tgt m ∈ y tgt, hence, slackly satisfying the requirement of label projection. These two steps can be formally described as: x tgt = T (x src), ytgt m = M(y src m |x tgt) ∀y src m ∈ y src , where y tgt m is generated from M(y src m |x tgt), drawing the difference from the previous works. Compared to word alignment approaches using simple word-similarity aligners W, we use models with translation capabilities M, to improve the accuracy of translated labels. Furthermore, the independence of T and M for translating x src and y src respectively assures 77 that CLaP has better translation quality for x tgt and is more robust than the marker-based baselines. We empirically back these intuitions in Section 6.4.4. 6.3.3 Implementing CLaP To actualize our concept, we initially configure T to function as a modular component, capable of being substituted by any third-party translation model. As for M, we employ an instruction-tuned language model (LM) with multilingual capabilities, as referenced in [134], [159]. These instruction-tuned LMs possess the ability to incorporate conditional information within their natural language prompts. Specifically, we encode both the translated target sentence x tgt and the faithfulness constraint y tgt m ∈ x tgt implicitly within the form of natural language instructions, highlighted as “Contextual Translation Instruction” in Figure 6.2(c). Following [17], we also provide n randomly chosen in-context examples (highlighted as “Incontext examples” in Figure 6.2(c)) to improve the instruction-understanding capability of the model. 3 Instruction-tuned LMs sacrifice some translation ability compared to supervised machine translation models [187], however, they provide better control of contextual constraints. After acquiring label translations, we utilize straightforward string-matching algorithms to determine the precise span index of y tgt m within x tgt. While this approach may not be the optimal solution in cases where duplicate strings are present within x tgt, it has proven effective in practice, as highlighted in previous word-alignment methods [32]. 6.4 Experiments and Results This section outlines our experimental settings, which include the datasets, baselines, and implementation details. Subsequently, we provide an in-depth analysis of CLaP through both intrinsic and extrinsic evaluations. 3The in-context examples are generated using Google translation and initial prediction from instructiontuned LMs. The label predictions are further verified by back-translation. 78 ACE WikiANN # Train Instances 4,202 20,000 # Dev Instances 450 10,000 # Avg. Test Instances 194 6,469 # Test Languages 2 39 Table 6.1: High-level data statistics for ACE and WikiANN datasets for EAE and NER tasks respectively. # = ‘number of’ and Avg. = average. 6.4.1 Task and Dataset In addition to event argument extraction (EAE) [55], [146], the key focus of the thesis, we also consider named entity recognition (NER) [148], [149] for evaluating our label projection method. EAE requires the extraction of text segments serving as arguments corresponding to an event and mapping them to their corresponding argument roles. NER aims to identify and categorize named entities from the input sentence. For EAE, we use multilingual ACE dataset [30] and follow the pre-processing used in Chapter 5. For NER, we consider the WikiANN [118], [131] with pre-processing by [60]. We list the basic statistics for these datasets in Table 6.1 and more details in Section D.1. For experiment, we consider the zero-shot cross-lingual transfer using English (en) as the source language. 6.4.2 Baselines We select two label projection models as baselines, each representing the two baseline frameworks we covered in Section 6.3.1, respectively: (1) EasyProject [21], a recent marker-based label-projection method, utilizes numbered square braces (e.g. [0] and [/0]) to mark the labels in the input sentence. (2) Awesome-Align [32], a neural bilingual word alignment model, uses multilingual language models to find word similarities to derive word alignments, which are later used for label projection. 79 6.4.3 Implementation Details We conduct experiments using the Google Machine Translation (GMT) model for translation (T ) [164]. For CLaP, we employ the text-completion variant of Llama-2 [151], comprising 13B parameters, denoted as M. In our CLaP setup, we utilize n = 2 in-context examples for prompts. For Awesome-align, we opt for their unsupervised model utilizing multilingual BERT [29], which has demonstrated superior performance [21]. We use the non-fine-tuned version of EasyProject [21], as our experiments involve GMT. Although the original work explores fine-tuning the machine translation model, this necessitates open-source access, which we do not have. Additional implementation details can be found in Appendix D.3. 6.4.4 Intrinsic Evaluation We first evaluate CLaP by directly evaluating the label projection quality, mainly focusing on evaluating the accuracy and faithfulness of the translated labels, with the definition stated in Section 6.2.4. We employ native speakers to evaluate the accuracy of label translations using a ranking framework that accommodates ties. The final accuracy score reflects the average percentage by which the model surpasses all other competitors. This evaluation is performed on 50 data samples each for Chinese, Arabic, Hindi, and Spanish, respectively. Faithfulness measures the fulfillment of the label projection constraint. It is measured as a percentage of projected data points when all the translated labels are present in the translated input sentence y tgt m ∈ x tgt , ∀y tgt m ∈ y tgt . The statistics use the complete test set on ACE and WikiANN. 6.4.4.0.1 Results: The accuracy and faithfulness of the models are plotted together in Figure 6.3. An ideal model should optimize both these metrics; thus, the closer the models are to the top-right, the better they are deemed. Overall, this figure illustrates how CLaP performs the best intrinsically, as it is the closest to the top-right for both tasks. 8 40 50 60 70 80 90 Accuracy (in %) 40 50 60 70 80 90 100 Faithfulness (in %) CLAP Awesome-align EasyProject EAE NER Figure 6.3: Reporting the faithfulness and accuracy (in %) of the various label projection models on EAE and NER, with the top-right indicating superior performance. Regarding EAE, CLaP outperforms all other models in both metrics, while for NER, it slightly compromises faithfulness for stronger accuracy. Awesome-align and EasyProject excel at achieving higher projection rates but result in less accurate label translations. In summary, intrinsic evaluation demonstrates that CLaP offers the optimal balance between accuracy and faithfulness on a qualitative basis. 6.4.5 Extrinsic Evaluation Extrinsic evaluation implicitly assesses the effectiveness of various label projection methods in generating pseudo-training data for downstream tasks. The projected data is filtered based on the faithfulness constraint as Dtgt src and used along with the original English data Dsrc for downstream training. For EAE, we use X-Gear (Check Chapter 5) as the downstream model. For NER, we use XLM-RoBERTalarge [25] as our downstream model and follow XTREME [60] setup for implementations. All results are the average over five runs. 6.4.5.0.1 Results: We present the EAE results for argument classification F1 scores in Table 6.2, including the zero-shot baseline trained solely on Dsrc for reference. CLaP emerges 81 ar zh Avg LLM-Infer 16.9 24.0 20.5 Zero-shot∗ 40.3 51.9 46.1 Awesome-align 48.6 54.5 51.6 EasyProject 38.5 56.3 47.4 CLaP (ours) 49.3 58.6 54.0 Table 6.2: Extrinsic evaluation of the different label projection techniques regarding downstream model performance using translate-train and the LLM-Infer baseline for EAE. Avg = Average. ∗ indicates the reproduced results of X-Gear (Check Chapter 5 for X-Gear details). Lang af ar bg bn de el es et eu fa fi fr he hi hu id it ja jv ka LLM-Infer 50.9 24.8 66.9 12.0 44.2 42.2 59.5 41.6 36.7 19.5 46.7 53.5 15.6 18.9 20.6 30.3 56.0 35.7 28.7 21.7 Zero-shot 77.4 48.1 82.8 77.0 78.8 80.6 74.5 78.7 61.4 69.2 79.3 79.4 57.3 70.6 80.8 53.1 79.4 19.1 58.5 72.3 Awesome-align 77.9 46.0 81.0 81.2 78.8 71.7 65.3 78.0 66.8 46.4 77.4 78.2 55.3 73.9 77.4 52.8 79.3 20.3 56.3 70.4 EasyProject 76.1 34.4 81.0 78.6 78.8 69.3 70.5 73.9 54.8 49.1 77.8 78.8 61.1 73.0 75.6 51.0 79.0 41.3 62.4 66.4 CLaP 74.4 48.7 81.0 78.1 78.4 75.9 74.7 77.4 68.8 59.0 75.9 79.4 58.4 73.1 72.4 56.1 80.1 45.3 64.8 70.5 kk ko ml mr ms my nl pt ru sw ta te th tl tr ur vi yo zh Avg LLM-Infer 20.9 18.5 11.1 16.5 46.5 10.1 64.3 46.4 22.7 33.4 12.8 9.2 19.8 46.1 31.0 11.6 37.3 28.6 41.0 32.1 Zero-shot 51.9 57.5 66.4 65.3 53.4 65.8 83.0 80.0 74.2 68.4 60.3 62.1 0.4 74.5 65.6 62.2 75.0 34.1 24.6 64.2 Awesome-align 47.7 57.7 63.4 62.4 70.7 54.1 83.0 75.8 64.8 70.1 62.4 55.4 2.4 80.9 62.8 53.7 66.4 61.5 45.4 63.5 EasyProject 31.7 48.2 56.5 59.8 71.7 60.3 81.9 79.6 66.3 71.5 53.2 54.2 11.4 78.2 66.8 63.8 65.6 68.8 42.0 63.2 CLaP 42.8 60.1 60.3 61.4 73.5 61.5 82.2 78.2 68.3 70.6 59.6 53.1 13.2 74.6 62.9 32.9 75.8 59.6 49.7 64.9 Table 6.3: Extrinsic evaluation of the different label projection techniques in terms of downstream model performance using translate-train and the LLM-Infer baseline for NER. Avg = Average. as the top performer, exhibiting an average gain of 2.4 F1 points over the closest baseline, Awesome-align, and a remarkable net gain of 7.9 F1 points over the zero-shot baseline. This outcome aligns with our intrinsic evaluation, where CLaP consistently outshone competitors for EAE. Regarding entity classification F1 scores, the primary findings are depicted in Table 6.3. Overall, CLaP surpasses all benchmarks, showcasing an absolute enhancement of 0.7 F1 points compared to the zero-shot baseline and outpacing prior studies by 1.4-1.7 F1 points. The superior performance of downstream models leveraging CLaP underscores its effectiveness in enhancing downstream tasks. 82 6.4.5.0.2 LLM usage comparison - Direct Inference v/s Contextual Translation: We compare the fine-tuned models with LLM-Infer, a large language model (LLM) baseline directly inferring on the downstream task in the target language. We utilize the chat version of Llama2-13B model [151] for the baseline. 4 We explore various cross-lingual prompting strategies, following [6] (complete experiments in Appendix D.4), and report the performance for the best prompt here. From results in Table 6.2 & 6.3, we can assert how LLM-infer performs significantly poorer than any fine-tuned model, indicating how LLMs can’t infer well on cross-lingual structured prediction. On the other hand, we demonstrate that LLMs can be better utilized to do contextual translation, as used in CLaP, which leads to the best performance for both the downstream tasks. Additional experiments with ChatGPT [17] are also provided in Appendix D.4. 6.5 Analysis 6.5.1 Qualitative Analysis Delving deeper, we conduct a qualitative examination of common error scenarios encountered in the translated labels across four languages, employing various label projection techniques. Among the 200 instances scrutinized, we observed that in 18% of cases, EasyProject fails to predict any output due to dropped markers in the translated sentence, while in 19% of cases, it simply replicates the English label without translating it into the target language. Conversely, for Awesome-align, the primary errors stem from either additional words or incomplete translations of labels, aligning with findings discussed in [21]. This phenomenon may be attributed to the complexity of determining alignments between sub-words, leading to either over-alignment or under-alignment. To illustrate, we present two representative examples from our analysis of Hindi (hi) and Chinese (zh) in Table 6.4, demonstrating instances where Awesome-align predicts surplus or incomplete words due to misalignments, 4Compared to the text version, the chat version of Llama2 provided better results. 83 Source Source Target Technique Translated Explanation Sentence Label Lang Label Born in Castelvetrano , Trapani and raised in Catania , he moved to Madrid to keep up his busy career . Castelvetrano hi Awesome-align k{-VlvV~Ano V~ApAnF Extra word (Castelvetrano Trapani) EasyProject Castelvetrano No translation CLaP k{-VlvV~Ano Perfect (Castelvetrano) Unilaterally leading a coalition featuring tyrannies, effect such change remains a bad idea, Iraq’s elections notwithstanding. Iraq zh Awesome-align 伊拉 Incomplete (Ira-) EasyProject 尽管伊拉克 Extra word (although Iraq) CLaP 伊拉克 Perfect (Iraq) Table 6.4: We present qualitative examples illustrating the error cases of the baseline models, accompanied by explanations, using the examples in Hindi (hi) and Chinese (zh). We also show how CLaP performs better and fixes the errors. Blue text is English translation. and where EasyProject fails to translate a word in Hindi while generating excess tokens in Chinese. In both scenarios, we showcase how CLaP exhibits precise predictions, showcasing greater robustness in maintaining accurate label translations. 6.5.2 CLAP with Larger LLMs We utilize a relatively small LLM Llama-2 [151] with 13B parameters as M for our experiments with CLaP. Here, we analyze the impact of utilizing a larger LLM for CLaP. More specifically, we compare Llama-2-13B based CLaP with a larger GPT-3.5-Turbo [17] based CLaP for five languages for EAE and NER in Table 6.5. 5 We notice that using GPT-3.5- Turbo in CLaP is at par with the Llama-2 variant for medium to high-resource languages like Arabic (ar) and Chinese (zh). On the other side, for lower-resourced languages like Yoruba (yo), Urdu (ur), and Kazakh (kk), GPT-3.5-Turbo introduces significantly larger improvements of 3 to 30 F1 points. Thus, we hypothesize that larger multilingual LLMs can further improve CLaP, especially for low-resource languages, also evidenced in [13]. 5GPT-3.5-Turbo costs $20-$30 per language. Thus, owing to budget constraints, we restrict ourselves to 5 languages. 84 Model EAE NER Size ar zh yo ur kk CLaP (w/ Llama-2-13B) 13B 49.3 58.6 59.6 32.9 42.8 CLaP (w/ GPT-3.5-Turbo) 175B 49.1 58.4 62.3 60.1 46.6 Table 6.5: Extrinsic evaluation of CLaP using Llama-2-13B and GPT-3.5-Turbo for five languages. ar zh Avg Zero-shot 40.3 51.9 43.9 Awesome-align 47.1 53.8 48.4 EasyProject 36.5 55.6 45.4 CLaP (ours) 48.2 56.9 50.4 Table 6.6: Extrinsic evaluation of the different label projection techniques using translatetrain for EAE using the mBART-50 many-to-many translation model. 6.5.3 Generalization to other translation models To assess the applicability of our approach across various translation models, we conduct an extrinsic evaluation of the label projection techniques on the EAE task, employing the mBART-50 many-to-many (MMT) model [147]. The evaluation results are presented in Table 6.6. Notably, CLaP outperforms other methods, exhibiting an average improvement of 2 F1 points over the next best baseline, Awesome-align, and a substantial 6.5 F1 points over the zero-shot baseline. These findings underscore the versatility of our CLaP as a label projection technique, independent of the underlying translation model. 6.5.4 Ablation Study for CLaP To study the impact of using instruction-tuned models for contextual translation, we conduct an ablation study comparing CLaP with the following baselines which put extra focus on accuracy or faithfulness for contextual machine translation: (1) Independent translation uses the translation model T to independently (without any context of the input sentence) translate the source text labels to the target language (i.e. y tgt = T (y src)), (2) Constrained 85 ar zh Avg Zero-shot 40.3 51.9 43.9 Independent 44.8 54.3 47.6 Constrained 45.6 55.6 48.8 CLaP (ours) 48.2 56.9 50.4 Supervised 63.2 69.7 65.0 Table 6.7: Ablation study comparing different contextual translation techniques for label projection. Performance is measured by downstream EAE performance. translation which uses a decoding constraint to carry out the faithfulness requirements. More specifically, during translation, it limits the generation vocabulary to the tokens in the translated sentence x tgt. We follow [20], [97] for implementing these constraints. We extrinsically evaluate the model performances of the techniques on the task of EAE using the MMT translation model 6 and show the results in Table 6.7. While the independent model compromises faithfulness and the constrained model sacrifices accuracy, both outperform the zero-shot baseline. Our CLaP demonstrates high accuracy and faithfulness, achieving the best performance with improvements ranging from 1.6 to 2.8 F1 over the ablation baselines. 6.5.5 CLaP for Translate-Test Lastly, we study another popular technique for cross-lingual transfer — translate-test [60], [132] which was discussed in Section 6.2.3. As part of this analysis, we study the applicability of CLaP for translate-test using extrinsic evaluation on Arabic (ar) and Chinese (zh) for EAE and Italian (it), Spanish (es), and Indonesian (id) for NER. We show the results in Table 6.8. Overall, we see how CLaP outperforms both the other methods significantly achieving the best scores for 4 out of the 5 languages. EasyProject performs the worst as it uses the translation model twice causing higher error propagation. We also note how translate-test doesn’t yield improvements over the zero-shot baseline, especially for EAE as it requires 6Since decoding-time constraints for the Constrained model can’t be applied to GMT 86 EAE NER Avg ar zh it es id Zero-shot 36.3 47.3 79.4 74.5 53.1 58.1 Awesome-align 32.8 30.1 77.5 69.6 51.4 52.3 EasyProject 17.0 11.5 65.9 62.6 51.8 41.8 CLaP (ours) 34.3 39.5 73.4 75.0 57.4 55.9 Table 6.8: Extrinsic evaluation of the different label projection techniques using translatetest using GMT for EAE and NER. Avg = Average Lang ha ig ny rw sn Zero-shot 72.9 46.4 49.0 45.0 50.2 Awesome-align 72.2 64.1 64.9 55.9 55.4 EasyProject 72.0 54.6 50.5 54.5 42.5 CLaP (ours) 69.9 60.5 58.7 53.6 59.7 sw xh yo zu qu Zero-shot 88.6 61.0 33.6 67.1 37.9 Awesome-align 82.9 52.4 30.8 57.9 46.1 EasyProject 81.3 50.6 25.2 44.3 44.1 CLaP (ours) 80.7 61.3 30.6 54.4 48.7 Table 6.9: Extrinsic evaluation of the different label projection techniques using translatetrain using GMT for NER for 10 low-resource languages. using label projection twice (once for trigger and once for arguments), thus leading to error propagation. 6.6 CLaP for Low-Resource Languages To cater our model to a wide range of languages, we study the applicability of CLaP for lowresource languages. Specifically, we consider the task of NER for 10 low-resource languages from Africa and South America. For the test datasets, we utilize MasakhaNER [1] for 9 African languages: Hausa (ha), Igbo (ig), Chichewa (ny), Kinyarwanda (rw), chShona (sn), Kiswahili (sw), isiXhosa (xh), Yor`ub´a (yo), isiZulu (zu), and refer to [181] for the South American language Quechua (qu). We conduct extrinsic evaluation of translate-train models 87 transferring from the English CoNLL training data7 using the GMT model and present the results in Table 6.9. We observe that this is a particularly challenging setting as all the label projection techniques fail to improve over the zero-shot model for 4 languages. Our model CLaP improves for 6 languages and performs the best for 3 languages. This result is particularly encouraging as our model uses a small and English-centric 13B Llama-2 model and utilizing larger multilingual LLMs will amplify these improvements further (as shown in Section 6.5.2). 8 6.7 Related Works Several works have attempted to solve label projection for various structure extraction tasks such as semantic role labeling [9], [39], slot filling [168], semantic parsing [10], [104], NER [112], [144], and question-answering [15], [78], [80]. The earliest works [7], [178] utilized statistical word-alignment techniques like GIZA++ [116] or fast-align [38] for locating the labels in the translated sentence. Recent works [21] have also explored the usage of neural word aligners like QA-align [106] and Awesome-align [32]. Another set of works has explored the paradigm of mark-then-translate using special markers like quote characters (“”) [80], XML tags () [60], and square braces ([0]) [21] to locate the translated labels. Overall, both these techniques can be error-prone and have poorer translation quality [7], as shown in Section 6.4.4 and 6.5.1. A recent concurrent work CODEC [77] improves the translation quality of text with markers by constrained decoding and data augmentation. 6.8 Summary In this Chapter, we introduce a novel approach CLaP for label projection, which utilizes contextual machine translation using instruction-tuned language models. Experiments on two 7For qu, we only use 3,000 CoNLL training data points due to budget constraints. 8Owing to budget constraints, we left the exploration as future work. 88 structure prediction tasks of EAE and NER across 39 languages demonstrate the effectiveness of CLaP compared to other label projection techniques. Intrinsic evaluation provides deeper insights that justify our model improvements. Additional experiments using larger LLMs, various translation models, translate-test paradigm, and 10 extremely low-resource languages demonstrate the generalizability and future potential of CLaP for general crosslingual structured prediction. 89 Chapter 7 Synthesize Pseudo Out-of-KB Examples for Event Linking 7.1 Introduction In this final chapter, we study event-linking, which involves associating mentions of events in text with corresponding nodes in a knowledge base (KB) [114], [117]. One crucial challenge faced by event linking models is accurately predicting a “NIL” label when there’s no corresponding entry in the KB for the mentioned event [31], [180], [185]. Given the limited entries KBs usually possess, it is always a practical need for linking models to deal with out-of-KB queries [31], and this challenge is more acute in event linking (compared to entity linking) due to the vast number of newly occurring events, of which only a fraction are recorded in KBs. However, prior research on event linking has largely overlooked this generalization issue, with the predominant focus being on training data contained within the KB. Hence, there’s a pressing need to devise methods for generating pseudo-out-of-KB training instances to enhance the generalizability of event-linking models. In contrast to entities that generally maintain consistent attributes over time, events can vary based on nuances in event arguments, such as time, location, and their participants, leading to increased complexity and ambiguity. For instance, Figure 7.1 illustrates that the 90 In 1775, the conflict between the British East India Company and the Maratha Empire escalated into war. British troops under the command of Colonel Keating, left Surat on 15 March 1775, for Pune. Label: First Anglo-Maratha War In 1803, the tensions between the British East India Company and a coalition of Maratha factions erupted into war. With the logistic assembly of his army complete, Wellesley gave the order to attack the nearest Maratha fort. Label: Second Anglo-Maratha War Figure 7.1: An example of two distinct events with the same event mention but different event arguments. two event mentions of “war” should be differentiated and linked to distinct Wikipedia entries by recognizing their unique occurrence times and involved leaders, despite the similarity in event names and combatants. Inspired by the observation, we propose to enhance event-linking systems by capitalizing on the role of event arguments in distinguishing events and creating pseudo-out-of-KB training data by altering the event arguments. We use established event extraction models [55], [61] to capture the participants, time, and locations of the query event. Then, we create synthetic out-of-KB query data, which is created by manipulating the event arguments of existing queries. For example, our system will replace British East India Company and the Maratha Empire in Figure 7.1 with alternative fictional combatant pairs to form the training data for “out-of-KB” prediction. We apply our design to a model architecture akin to [126], [180] and conduct experiments on the two event-linking datasets introduced by [180]. Our approach yields a 22% accuracy improvement over previous baselines for out-of-KB testing and an over 1% increase for inKB testing. Additionally, by comparing various methods for generating synthetic out-of-KB examples, we demonstrate that our data synthesis approach successfully balances in-KB and out-of-KB usage for event-linking. 91 7.2 Related Work 7.2.0.0.1 Entity Linking , which associates entity mentions with KB entries, has been studied for years [16], [18], [46], [103]. Common approaches include using neural networks to represent queries and KB entries for discriminative prediction [43], [163], [182] or using generation-based methods [19], [105], [165]. While these techniques can be adapted for event linking, they are not tailored to incorporate the structured information within events, which, as we will demonstrate in Section 7.4, is vital for disambiguating events for grounding. 7.2.0.0.2 Event Linking are first introduced by [114]. Recently, [180] and [126] have made efforts to introduce English and multilingual datasets to the task. However, their approaches overlook the influence of event arguments on the task and neglect the discussion of handling out-of-KB cases. 7.3 Method We hereby introduce our event-linking model with two key innovations, as illustrated in Figure 7.2. First, to help the model distinguish event attribute details, we propose to first tag the event argument information of the input query (Section 7.3.1). Second, to improve the model’s capability to deal with out-of-KB cases, we introduce a negative data generation method that synthesizes potential out-of-KB examples to train our model (Section 7.3.2). Finally, we train the model with these data changes (Section 7.3.3). 7.3.1 Event Argument Tagging To make event linking models better capture event argument information, we use the UniST model [61] trained on the MAVEN dataset [158] to first identify the event types of query events. Given predicted event types, we extract event arguments using the TagPrime 92 Knowledge Base Event Argument Tagging Query Query with Event Argument Tags Negative Data Generation BERT (Score K + 1 Inputs) Cross-Encoder Original Query Negative Query Negative Query Original Query’s Top-K Candidates ...... Query + Learnable NIL Candidate Query + Candidate 1 Query + Candidate K Cand idate 1 … Cand idate K NIL Cand idate Bi-Encoder Dense Retrieval Query Encoder Documents Candidate Encoder Event Argument Extraction Event Detection Reformat Replace Query Original Query’s Top-K Candidates Top-K Rank K + 1 Candidates Training Only Figure 7.2: Given a text with an event to ground, our method extracts the event’s attributes through event detection and argument extraction modules. The text, enriched with event argument tags, is then input into a Bi-Encoder to identify the top-k potential nodes. These candidates are further ranked by a Cross-Encoder, which also considers an additional “NIL” candidate in case of out-of-KB instances. To equip the Cross-Encoder to robustly predict “NIL” and real KB entries, we train it with additional synthetic data generated through our negative data creation process. model [55], [64] trained on the GENEVA dataset [121]. Text with event argument tags is used by our model. Take the passage in Figure 7.3 as an example, “Germany” will be extracted as the “Assailant” of the invasion, “the Soviet Union” will be highlighted as the “Victim”. Overall, our system can support 115 event types with 220 different argument roles to cover the potentially diverse types of events in the given queries. More relevant details about event trigger and argument extraction can be found in Section E.1. 7.3.2 Negative Data Generation Prior research on event linking largely overlooked out-of-KB issues, mainly due to the limited availability of diverse training data for such scenarios. To address this gap, we design a pipeline to generate synthetic training data, enhancing the ability of event linking systems to make robust predictions for both in-KB and out-of-KB queries. Creating out-of-KB event queries is non-trivial because randomly altering the query text does not guarantee that the event falls outside the KB or at least stops referencing the original event. Directly altering the event mention word may lead to text that is incoherent 93 Original Passage: Zagrebelnyi was a Ukrainian novelist. He graduated from secondary school in 1941. That same year, when <Assailant> Germany </Assailant> <mention> invaded <Victim> the Soviet Union </mention> </Victim>, he enlisted the Red Army, participated in the Battle of Kiev, and was … Step 1: Zagrebelnyi was a Ukrainian novelist. He graduated from secondary school in 1941. That same year, when <Assailant> Korea </Assailant> <mention> invaded <Victim> France </mention> </Victim>, he enlisted the Red Army, participated in the Battle of Kiev, and was … Step 2: Antoinette was a French novelist. He graduated from secondary school in 1941. That same year, when <Assailant> Korea </Assailant> <mention> invaded <Victim> France </mention> </Victim> , he enlisted the French Army, participated in the Battle of Paris, and was … Underlined: Event Info Tags You are a storyteller ... I'll give you a passage … key details marked by special tags … You are to rewrite … but containing factually incorrect information … To complete the task, follow these two steps: Step 1: Edit the tagged key details of the event, ensuring that any changes remain consistent with the original role type … Step 2: Ensure the final passage's coherence and creativity by adjusting Step 1's outcome for fluency and consistency … Yellow: Event Trigger Cyan: Event Arguments Gray: Changes in Step 2 Prompt (Conceptual) + Input Output Figure 7.3: Illustration for our negative data generation processing for event linking using LLM. or still references the same event, such as changing “invaded” to “grew” or “attacked” in Figure 7.3. We address the challenge by leveraging our observation that events differ when argument configurations change. To generate a data point, we first sample an in-KB query from the training set, along with its tagged event mention and arguments. We then instruct a large language model (LLM) to adjust this example through a two-step process: first modifying the tagged event arguments and then making edits to ensure coherence and fluency, as demonstrated in Figure 7.3. 1 To enhance the chance of generating more realistic out-of-KB query cases, we instruct the LLM to create context violating its own knowledge. However, there remains a possibility that the generated context may reference other events within the KB. To minimize this impact, in actual data usage, we treat our generated event query as a “negative” training data point when paired with top KB entries for the original sampled in-KB mention. Further details are provided in Section 7.3.3. 1This two-step generation is generated through a single prompt. We use GPT-3.5-Turbo (https: //platform.openai.com/docs/models/gpt-3-5-turbo) with 2-shot examples [17] to instruct the model. More details about the prompt are listed in Section E.2. 94 7.3.3 Model We apply our proposed techniques to the same retrieve-and-rerank model architecture [163] used in prior works [180]. The retrieve stage involves a bi-encoder model. A candidate encoder first encodes each entry in the KB into a dense space. A text query q with event information tags is then fed into the other encoder (query encoder ) and projected into the same dense space. Top-k candidates will be extracted by measuring the dot product similarities between q and every KB entry. After obtaining top-k KB candidates c1, c2, ..., ck, a cross-encoder is employed to encode every pair (q, ci) to a score S(q, ci). The best candidates are selected by ranking the scores c = arg maxci S(q, ci). To handle out-of-KB scenarios, prior work [180], lacking out-of-KB training examples, generate final output cfinal by setting an arbitrary threshold θ: cfinal = c, if S(q, c) < θ NIL, otherwise. Unlike this approach, our method introduces a learned “NIL” class trained with our synthetic negative data. During the re-ranking phase, we expand the candidate pool to include k + 1 options, the extra one being a randomly initialized embedding that represents “NIL” : cfinal = arg max ci S(q, ci), where i ∈ {0, 1, ..., k}, c0 = NIL. Our cross-encoder is trained to predict “NIL” when the input query q is replaced with the synthetic negative query we generated, illustrated as the “Negative Query” in Figure 7.2. 95 7.4 Experiments 7.4.1 Experimental Settings 7.4.1.0.1 Datasets from the prior event linking work [180] are used. The data is constructed by [180]. The event KB used is the collection of Wikipedia pages with English titles. The datasets include the Wikipedia dataset, which contains training and in-domain testing data, and the New York Times (NYT) dataset, which contains out-of-domain and out-ofKB testing data. We introduce the details of the datasets below and list their statistics in Table 7.1. • The Wikipedia dataset contains the training, validation, and test splits. The Wikipedia dataset is collected automatically from hyperlinks in Wikipedia. A hyperlink text is considered an event mention if the linked Wikipedia title has its mapped FIGER type [87], a fine-grained set of entity tags, being “Event.” By construction, the Wikipedia dataset contains in-KB event mentions only. • The NYT dataset is a smaller, manually annotated test set. 2,500 lead paragraphs are sampled from The New York Times Annotated Corpus 2 and then annotated through Amazon Mechanical Turk. The dataset comes from real-life news articles and contains out-of-KB event mentions that are not covered by Wikipedia. Dataset Train Valid. Test In-KB Out-of-KB Wikipedia 66217 16650 19213 - NYT - - 769 993 Table 7.1: Statistics of the two datasets. 7.4.1.0.2 Evaluation Metrics We follow [180] to evaluate models using accuracy. 2https://catalog.ldc.upenn.edu/LDC2008T19 96 Model Wikipedia Test NYT Test All Verb Noun All Verb Noun BM25 9.72 13.08 6.36 3.69 3.18 5.19 BLINK 78.74 78.12 79.36 27.13 29.24 20.74 EveLink 79.00 78.07 79.93 32.03 34.34 25.13 Ours 80.05 79.47 80.62 55.40 59.90 41.99 Table 7.2: Accuracy (%) on both Wikipedia (in-domain, in-KB) and NYT (out-of-domain, out-of-KB) test sets for event linking. The best performance is highlighted in bold. BLINK [163]; EveLink [180]. 7.4.1.0.3 Baselines 3 : (1) BM25, a term-based ranking algorithm for information retrieval. (2) BLINK [163], the retrieve-and-rerank model architecture introduced for entity linking. We adopt its code but train it on our training set. The model takes pure query text as input. (3) EveLink [180], the current SOTA event linking model. It adopts BLINK but enhances the text query by adding local name entity information. 7.4.2 Implementation Details During training, we first train the bi-encoder and retrieve the top 10 candidates using the trained bi-encoder. We then use the queries and their top 10 candidates to train the crossencoder. During inference, we also first retrieve the top 10 candidates using the trained bi-encoder. Then, we re-rank the retrieved candidates using the trained cross-encoder and select the top 1 candidate to compare with the ground truth. To ensure a fair comparison, all baselines and our methods follow this recipe. The results presented are the average of three random runs. Section E.3 covers detailed implementation details. 7.4.3 Main Results Table 7.2 presents our main results, categorized by the type of event mention (All/Verb/- Noun). In the in-domain Wikipedia evaluation, our approach surpasses all baseline methods 3We also explore LLM’s performance in Section 7.4.4.3. 97 across all categories. For the out-of-domain, out-of-KB evaluation using the NYT dataset, our method demonstrates its robustness with an over 20% absolute improvement. 7.4.4 Analysis In this section, we present studies to verify our two innovations and explore another possibility of using LLMs on the task. 7.4.4.1 Bi-Encoder Results Bi-encoder results are shown in Table 7.3. Directly analyzing the bi-encoder performance allows us to assess the impact of integrating event argument data into the text. Our approach surpasses all baseline methods, showing greater enhancements in the harder cases as the number of candidates decreases. Model Wikipedia (in-KB) Test Set Recall R@1 R@2 R@3 R@5 R@10 R@20 BM25 9.72 16.64 20.58 25.48 31.77 38.10 BLINK 54.85 68.14 74.27 80.36 86.22 90.55 EveLink 55.72 67.22 74.74 80.62 86.51 90.91 Ours 57.28 70.14 76.10 81.69 87.40 91.34 Table 7.3: Bi-encoder recall (%) on the Wikipedia test set. “R@1” stands for recall at 1, and so on. See Section E.4 for more recall values. 7.4.4.2 Effectiveness of Negative Data Generation We benchmark our approach against two alternative methods for generating negative data to train the cross-encoder: (1) Non-argument-aware Data Generation, which also employs GPT-3.5-Turbo but does not incorporate event information into the prompts, as detailed in Section E.2; (2) KB Pruning, a strategy introduced by [31] in the entity linking field. This method creates negative samples by randomly eliminating 10% of KB entries and marking the associated training data as negative examples external to the KB. Table 7.4 shows the 98 comparison. While KB Pruning ensures high-quality negative examples outside the KB, it negatively affects performance on in-KB tests. In contrast, our method, designed with an emphasis on event information, effectively balances the use of in-KB and out-of-KB cases. Model Wiki. NYT Avg. BLINK (no negative data usage) 78.74 27.13 52.94 w/ Non-argument-aware method 79.09 54.08 66.59 w/ KB Pruning 76.72 55.85 66.29 w/ Argument-aware method (Ours) 79.22 55.18 67.20 Table 7.4: Analysis of alternative negative data generation methods. The best and the second-best are bolded and underlined, respectively 7.4.4.3 LLM Baseline Comparison To the best of our knowledge, no existing event-linking systems do not use LLMs yet. Thus, inspired by existing methods in information retrieval [125], we explore the LLMs’ effectiveness by replacing the cross-encoder in the current pipeline with an instructed LLM to perform candidate re-ranking. We implement the baseline using GPT-3.5-Turbo and compare our proposed method with this LLM baseline. 4 We sample a subset of 1000 test examples and use the same bi-encoder system to ensure fair comparison. From the comparison shown in Table 7.5, we observe that the method using LLM to rerank is significantly inferior to our method. This underscores the advantage of our approach in utilizing LLM as an engine for generating negative data. 7.5 Summary In this chapter, we introduce an argument-aware method designed to improve event-linking models. This approach aids in disambiguating events and generating out-of-KB training examples. Experimental results demonstrate that our method enhances the accuracy for 4 Implementation details can be found in Section E.3 99 Model Wikipedia Test NYT Test All Verb Noun All Verb Noun LLM-reranked 44.64 45.61 43.53 28.56 29.51 25.70 Ours 79.88 77.76 82.33 57.01 61.28 44.18 Table 7.5: Comparison with the LLM-reranked baseline. Due to budget constraints, the experiment is conducted on a subset of the whole dataset. both in-KB and out-of-KB queries. Our findings reveal that the system, trained on flattened data, struggles to process structured textual information effectively. Therefore, implementing our guidance about event arguments can improve its understanding of structured events. 100 Part III Conclusion 101 Chapter 8 Conclusion and Future Directions In this dissertation, I investigate approaches from two distinct perspectives aimed at enhancing the generalizability of NLP models in the domain of event understanding. I intend to summarize the contributions of this dissertation as follows. From the models’ perspective, I introduce a pioneering learning platform to the event extraction community, transitioning from discriminative models to generative model learning, thereby enhancing generalizability. In Chapter 3, I present the foundational framework of this platform, demonstrating its capacity for data-efficient learning, adaptability to unseen events, and competitive performance comparable to state-of-the-art results achieved by classification-based methods when training with ample training data. Expanding upon this framework in Chapter 4, I investigate techniques for incorporating external knowledge into the generation-based event understanding models. This is achieved by learning a prefix and integrating it into the generation-based model. Additionally, in Chapter 5, I extend the generative framework’s capabilities to excel in zero-shot cross-lingual scenarios by introducing language-agnostic templates. From a data perspective, I suggest enhancing the generalizability of event understanding models by augmenting the training data for events. This augmentation process leverages the capabilities of a pre-trained large language model. It is only through meticulous control of this model that we can generate high-quality pseudo-training data for event understanding. In Chapter 6, I demonstrate the integration of machine translation systems and large lan102 guage models to conduct label projection with heightened accuracy and fidelity. By training on this meticulously translated data, downstream event understanding models exhibit significantly improved performance compared to pure zero-shot transfer approaches. Additionally, in Chapter 7, I highlight the pivotal role played by event arguments in events and propose synthesizing out-of-event-KB data by manipulating these arguments. Through training on this high-quality negative data, downstream event linking models demonstrate a superior ability to discern events within the event knowledge base or beyond. Future research in this area could prioritize events necessitating annotation by domain experts. Our work establishes a foundation for generalization to novel events without surpassing common conceptual understanding. However, in numerous fields, including biomedicine and scientific research, events and information are intricately intertwined and challenging to capture. The development of more efficacious algorithms in this regard could significantly enhance the utility of these applications. Moreover, a promising avenue for future exploration involves information validation. Our research endeavors, encompassing extraction and linking systems, present methodologies for organizing information and events sourced from extensive reference data. This framework provides a foundation for conducting information validation through comparative analysis between input data and gathered information. Nonetheless, constructing this pipeline remains challenging due to the intricacies and potential fragility of each system component, especially when cross-document alignment is required. Further research could delve into this realm, aiming to enhance the efficacy of information utilization of models. 103 Bibliography [1] D. Adelani, G. Neubig, S. Ruder, et al., “MasakhaNER 2.0: Africa-centric transfer learning for named entity recognition,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022. [2] W. U. Ahmad, N. Peng, and K. Chang, “GATE: graph attention transformer encoder for cross-lingual relation and event extraction,” in Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI), 2021. [3] W. U. Ahmad, Z. Zhang, X. Ma, K.-W. Chang, and N. Peng, “Cross-lingual dependency parsing with unlabeled auxiliary languages,” in The 2019 SIGNLL Conference on Computational Natural Language Learning (CoNLL), 2019. [4] W. U. Ahmad, Z. Zhang, X. Ma, E. H. Hovy, K. Chang, and N. Peng, “On difficulties of cross-lingual transfer with order differences: A case study on dependency parsing,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2019. [5] D. Ahn, “The stages of event extraction,” in Proceedings of the Workshop on Annotating and Reasoning about Time and Events, 2006. [6] K. Ahuja, H. Diddee, R. Hada, et al., “MEGA: Multilingual evaluation of generative AI,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023. [7] A. Akbik, L. Chiticariu, M. Danilevsky, Y. Li, S. Vaithyanathan, and H. Zhu, “Generating high quality proposition Banks for multilingual semantic role labeling,” in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2015. [8] M. Aminian, M. S. Rasooli, and M. Diab, “Cross-lingual transfer of semantic roles: From raw text to semantic roles,” in Proceedings of the 13th International Conference on Computational Semantics - Long Papers, 2019. [9] M. Aminian, M. S. Rasooli, and M. Diab, “Transferring semantic roles using translation and syntactic information,” in Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), 2017. [10] A. Awasthi, N. Gupta, B. Samanta, S. Dave, S. Sarawagi, and P. Talukdar, “Bootstrapping multilingual semantic parsers using large language models,” in Proceedings 104 of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023. [11] X. Bai, Y. Chen, and Y. Zhang, “Graph pre-training for AMR parsing and generation,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, 2022. [12] L. Banarescu, C. Bonial, S. Cai, et al., “Abstract meaning representation for sembanking,” in Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, LAW-ID@ACL 2013, August 8-9, 2013, Sofia, Bulgaria, 2013. [13] L. Bandarkar, D. Liang, B. Muller, et al., “The belebele benchmark: A parallel reading comprehension dataset in 122 language variants,” arXiv preprint arXiv:2308.16884, 2023. [14] M. Bevilacqua, R. Blloshmi, and R. Navigli, “One SPRING to rule them both: Symmetric AMR semantic parsing and generation without a complex pipeline,” in ThirtyFifth AAAI Conference on Artificial Intelligence (AAAI), 2021. [15] M. A. Bornea, L. Pan, S. Rosenthal, R. Florian, and A. Sil, “Multilingual transfer learning for QA using translation as data augmentation,” in Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, 2021. [16] J. A. Botha, Z. Shan, and D. Gillick, “Entity linking in 100 languages,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, 2020. [17] T. B. Brown, B. Mann, N. Ryder, et al., “Language models are few-shot learners,” in Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020 (NeurIPS), 2020. [18] R. C. Bunescu and M. Pasca, “Using encyclopedic knowledge for named entity disambiguation,” in EACL 2006, 11st Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference, April 3-7, 2006, Trento, Italy, 2006. [19] N. D. Cao, G. Izacard, S. Riedel, and F. Petroni, “Autoregressive entity retrieval,” in 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, 2021. 105 [20] N. D. Cao, L. Wu, K. Popat, et al., “Multilingual autoregressive entity linking,” arXiv preprint arXiv:2103.12528, 2021. [21] Y. Chen, C. Jiang, A. Ritter, and W. Xu, “Frustratingly easy label projection for cross-lingual transfer,” in Findings of the Association for Computational Linguistics: ACL 2023, 2023. [22] Y. Chen, L. Xu, K. Liu, D. Zeng, and J. Zhao, “Event extraction via dynamic multipooling convolutional neural networks,” in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers, 2015. [23] A. Chowdhery, S. Narang, J. Devlin, et al., “Palm: Scaling language modeling with pathways,” 2023. [24] X. Cong, S. Cui, B. Yu, T. Liu, Y. Wang, and B. Wang, “Few-shot event detection with prototypical amortized conditional random field,” in Findings of the Association for Computational Linguistics: ACL/IJCNLP, 2021. [25] A. Conneau, K. Khandelwal, N. Goyal, et al., “Unsupervised cross-lingual representation learning at scale,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), 2020. [26] S. Cucerzan and D. Yarowsky, “Language independent named entity recognition combining morphological and contextual evidence,” in Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, EMNLP 1999, College Park, MD, USA, June 21-22, 1999, 1999. [27] M. Delmas, M. Wysocka, and A. Freitas, “Relation extraction in underexplored biomedical domains: A diversity-optimised sampling and synthetic data generation approach,” arXiv preprint arXiv:2311.06364, 2023. [28] S. Deng, N. Zhang, J. Kang, Y. Zhang, W. Zhang, and H. Chen, “Meta-learning with dynamic-memory-based prototypical network for few-shot event detection,” in The Thirteenth ACM International Conference on Web Search and Data Mining (WSDM), 2020. [29] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2- 7, 2019, Volume 1 (Long and Short Papers), 2019. 106 [30] G. R. Doddington, A. Mitchell, M. A. Przybocki, L. A. Ramshaw, S. M. Strassel, and R. M. Weischedel, “The automatic content extraction (ACE) program - tasks, data, and evaluation,” in Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC), 2004. [31] H. Dong, J. Chen, Y. He, Y. Liu, and I. Horrocks, “Reveal the unknown: Out-ofknowledge-base mention discovery with entity linking,” in Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, CIKM 2023, Birmingham, United Kingdom, October 21-25, 2023, 2023. [32] Z.-Y. Dou and G. Neubig, “Word alignment by fine-tuning embeddings on parallel corpora,” in Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 2021. [33] X. Du and C. Cardie, “Event extraction by answering (almost) natural questions,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020. [34] X. Du, S. Li, and H. Ji, “Dynamic global memory for document-level argument extraction,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, 2022. [35] X. Du, A. M. Rush, and C. Cardie, “GRIT: generative role-filler transformers for document-level event entity extraction,” in Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2021. [36] X. Du, A. M. Rush, and C. Cardie, “Template filling with generative transformers,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, Association for Computational Linguistics, 2021. [37] A. Dunn, J. Dagdelen, N. Walker, et al., “Structured information extraction from complex scientific text with fine-tuned large language models,” Nat Commun, 2024. [38] C. Dyer, V. Chahuneau, and N. A. Smith, “A simple, fast, and effective reparameterization of IBM model 2,” in Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2013. [39] H. Fei, M. Zhang, and D. Ji, “Cross-lingual semantic role labeling with high-quality translated training corpus,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020. 107 [40] R. Feldman and J. Sanger, The Text Mining Handbook - Advanced Approaches in Analyzing Unstructured Data. 2007. [41] S. Y. Feng, V. Gangal, J. Wei, et al., “A survey of data augmentation approaches for NLP,” in Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, August 1-6, 2021, 2021. [42] S. Fincke, S. Agarwal, S. Miller, and E. Boschee, “Language model priming for crosslingual event extraction,” in Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022, 2022. [43] M. Francis-Landau, G. Durrett, and D. Klein, “Capturing semantic similarity for entity linking with convolutional neural networks,” in NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, 2016. [44] J. Gui, T. Chen, J. Zhang, et al., “A survey on self-supervised learning: Algorithms, applications, and future trends,” arXiv preprint arXiv:2301.05712, 2023. [45] B. Guo, X. Zhang, Z. Wang, et al., “How close is chatgpt to human experts? comparison corpus, evaluation, and detection,” arXiv preprint arXiv:2301.07597, 2023. [46] N. Gupta, S. Singh, and D. Roth, “Entity linking via joint encoding of types, descriptions, and context,” in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, 2017. [47] R. Han, T. Peng, C. Yang, B. Wang, L. Liu, and X. Wan, “Is information extraction solved by chatgpt? an analysis of performance, evaluation criteria, robustness and errors,” arXiv preprint arXiv:2305.14450, 2023. [48] R. Han, I. Hsu, J. Sun, et al., “ESTER: A machine reading comprehension dataset for reasoning about event semantic relations,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, 2021. [49] R. Han, I. Hsu, M. Yang, A. Galstyan, R. M. Weischedel, and N. Peng, “Deep structured neural network for event temporal relation extraction,” in Proceedings of the 23rd Conference on Computational Natural Language Learning, CoNLL 2019, Hong Kong, China, November 3-4, 2019, 2019. 108 [50] R. Han, Q. Ning, and N. Peng, “Joint event and temporal relation extraction with shared representations and structured prediction,” in 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019. [51] L. Hennig, P. Thomas, and S. M¨oller, “MultiTACRED: A multilingual version of the TAC relation extraction dataset,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, Canada: Association for Computational Linguistics, Jul. 2023, pp. 3785–3801. doi: 10.18653/v1/2023.acl- long.210. [Online]. Available: https://aclanthology. org/2023.acl-long.210. [52] J. R. Hobbs, “Information extraction from biomedical text,” J. Biomed. Informatics, 2002. [53] I.-H. Hsu, X. Guo, P. Natarajan, and N. Peng, “Discourse-level relation extraction via graph pooling,” arXiv preprint arXiv:2101.00124, 2021. [54] I.-H. Hsu, K.-H. Huang, E. Boschee, et al., “DEGREE: A data-efficient generationbased event extraction model,” in Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022. [55] I.-H. Hsu, K.-H. Huang, S. Zhang, et al., “TAGPRIME: A unified framework for relational structure extraction,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023. [56] I. Hsu, A. Jaiswal, and P. Natarajan, “NIESR: nuisance invariant end-to-end speech recognition,” in Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019, 2019. [57] I.-H. Hsu, A. Ray, S. Garg, N. Peng, and J. Huang, “Code-switched text synthesis in unseen language pairs,” in Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada: Association for Computational Linguistics, Jul. 2023, pp. 5137–5151. doi: 10.18653/v1/2023.findings- acl.318. [Online]. Available: https://aclanthology.org/2023.findings-acl.318. [58] I. Hsu, Z. Xie, K. Huang, P. Natarajan, and N. Peng, “AMPERE: amr-aware prefix for generation-based event argument extraction model,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, 2023. [59] I. Hsu, Z. Xue, N. Pochh, et al., “Argument-aware approach to event linking,” arXiv preprint arXiv:2403.15097, 2024. 109 [60] J. Hu, S. Ruder, A. Siddhant, G. Neubig, O. Firat, and M. Johnson, “XTREME: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation,” in Proceedings of the 37th International Conference on Machine Learning (ICML), 2020. [61] J. Y. Huang, B. Li, J. Xu, and M. Chen, “Unified semantic typing with meaningful label inference,” in Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022. [62] K. Huang, W. U. Ahmad, N. Peng, and K. Chang, “Improving zero-shot cross-lingual transfer learning via robust training,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021. [63] K.-H. Huang, I.-H. Hsu, P. Natarajan, K.-W. Chang, and N. Peng, “Multilingual generative language models for zero-shot cross-lingual event argument extraction,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), 2022. [64] K. Huang, I. Hsu, T. Parekh, et al., “TextEE: Benchmark, reevaluation, reflections, and future challenges in event extraction,” arXiv preprint arXiv:2311.09562, 2023. [65] K. Huang, V. Iyer, I. Hsu, A. Kumar, K. Chang, and A. Galstyan, “Paraamr: A large-scale syntactically diverse paraphrase dataset by AMR back-translation,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, 2023. [66] K.-H. Huang and N. Peng, “Document-level event extraction with efficient end-to-end learning of cross-event dependencies,” in The 3rd Workshop on Narrative Understanding (NAACL 2021), 2021. [67] K.-H. Huang, S. Tang, and N. Peng, “Document-level entity-based extraction as template generation,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021. [68] K.-H. Huang, M. Yang, and N. Peng, “Biomedical event extraction with hierarchical knowledge graphs,” in the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)-Findings, short, 2020. [69] L. Huang, T. Cassidy, X. Feng, et al., “Liberal event extraction and event schema induction,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers, 2016. 110 [70] L. Huang, H. Ji, K. Cho, I. Dagan, S. Riedel, and C. R. Voss, “Zero-shot transfer learning for event extraction,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), 2018. [71] X. Huang, J. May, and N. Peng, “What matters for neural cross-lingual named entity recognition: An empirical analysis,” in 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), short, 2019. [72] H. Ji and R. Grishman, “Refining event extraction through cross-document inference,” in Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics (ACL), 2008. [73] Z. Ji, N. Lee, R. Frieske, et al., “Survey of hallucination in natural language generation,” arXiv preprint arXiv:2202.03629, 2022. [74] W. Jiao, W. Wang, J. Huang, X. Wang, and Z. Tu, “Is chatgpt A good translator? A preliminary study,” arXiv preprint arXiv:2301.08745, 2023. [75] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in 5th International Conference on Learning Representations (ICLR), 2017. [76] I. Konstas, S. Iyer, M. Yatskar, Y. Choi, and L. Zettlemoyer, “Neural AMR: sequenceto-sequence models for parsing and generation,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, 2017. [77] D. M. Le, Y. Chen, A. Ritter, and W. Xu, “Constrained decoding for cross-lingual label projection,” arXiv preprint arXiv:2402.03131, 2024. [78] K. Lee, K. Yoon, S. Park, and S.-w. Hwang, “Semi-supervised training data generation for multilingual question answering,” in Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 2018. [79] M. Lewis, Y. Liu, N. Goyal, et al., “BART: denoising sequence-to-sequence pretraining for natural language generation, translation, and comprehension,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, 2020. [80] P. Lewis, B. Oguz, R. Rinott, S. Riedel, and H. Schwenk, “MLQA: Evaluating crosslingual extractive question answering,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020. 111 [81] B. Li, G. Fang, Y. Yang, et al., “Evaluating chatgpt’s information extraction capabilities: An assessment of performance, explainability, calibration, and faithfulness,” arXiv preprint arXiv:2304.11633, 2023. [82] F. Li, W. Peng, Y. Chen, et al., “Event extraction as multi-turn question answering,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings (EMNLP), 2020. [83] Q. Li, H. Ji, and L. Huang, “Joint event extraction via structured prediction with global features,” in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL), 2013. [84] S. Li, H. Ji, and J. Han, “Document-level event argument extraction by conditional generation,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2021. [85] X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts for generation,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, 2021. [86] Y. Lin, H. Ji, F. Huang, and L. Wu, “A joint neural model for information extraction with global features,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), 2020. [87] X. Ling and D. S. Weld, “Fine-grained entity recognition,” in Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, July 22-26, 2012, Toronto, Ontario, Canada, 2012. [88] J. Liu, Y. Chen, K. Liu, W. Bi, and X. Liu, “Event extraction as machine reading comprehension,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020. [89] J. Liu, Y. Chen, K. Liu, and J. Zhao, “Neural cross-lingual event detection with minimal parallel resources,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019. [90] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing,” arXiv preprint arXiv:2107.13586, 2021. [91] X. Liu, H. Huang, G. Shi, and B. Wang, “Dynamic prefix-tuning for generative template-based event extraction,” in Proceedings of the 60th Annual Meeting of the 112 Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, 2022. [92] X. Liu, K. Ji, Y. Fu, et al., “P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, 2022. [93] Y. Liu, J. Gu, N. Goyal, et al., “Multilingual denoising pre-training for neural machine translation,” Trans. Assoc. Comput. Linguistics, vol. 8, pp. 726–742, 2020. [94] Y. Liu, M. Ott, N. Goyal, et al., “Roberta: A robustly optimized BERT pretraining approach,” arXiv preprint arXiv:1907.11692, 2019. [95] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, 2019. [96] K. Lu, I. Hsu, W. Zhou, M. D. Ma, and M. Chen, “Multi-hop evidence retrieval for cross-document relation extraction,” in Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, 2023. [97] K. Lu, I.-H. Hsu, W. Zhou, M. D. Ma, and M. Chen, “Summarization as indirect supervision for relation extraction,” in Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, Dec. 2022, pp. 6575–6594. doi: 10 . 18653 / v1 / 2022 . findings - emnlp . 490. [Online]. Available: https : / / aclanthology . org / 2022 . findings-emnlp.490. [98] Y. Lu, H. Lin, J. Xu, et al., “Text2event: Controllable sequence-to-structure generation for end-to-end event extraction,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL/IJCNLP), 2021. [99] Y. Lu, Q. Liu, D. Dai, et al., “Unified structure generation for universal information extraction,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, 2022. [100] J. Ma, S. Wang, R. Anubhai, M. Ballesteros, and Y. Al-Onaizan, “Resource-enhanced neural model for event argument extraction,” in Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020, 2020. [101] Y. Ma, Z. Wang, Y. Cao, et al., “Prompt for extraction? PAIE: prompting argument interaction for event argument extraction,” in Proceedings of the 60th Annual Meet113 ing of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, 2022. [102] T. Meng, N. Peng, and K.-W. Chang, “Target language-aware constrained inference for cross-lingual dependency parsing,” in 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019. [103] R. Mihalcea and A. Csomai, “Wikify!: Linking documents to encyclopedic knowledge,” in Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management, CIKM 2007, Lisbon, Portugal, November 6-10, 2007, 2007. [104] M. Moradshahi, G. Campagna, S. Semnani, S. Xu, and M. Lam, “Localizing openontology QA semantic parsers in a day using machine translation,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020. [105] K. Mrini, S. Nie, J. Gu, S. Wang, M. Sanjabi, and H. Firooz, “Detection, disambiguation, re-ranking: Autoregressive entity linking as a multi-task problem,” in Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022, 2022. [106] M. Nagata, K. Chousa, and M. Nishino, “A supervised word alignment method based on cross-language span prediction using multilingual BERT,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020. [107] M. Nguyen, N. H. Son, and L. T. Linh, “Gain more with less: Extracting information from business documents with small data,” Expert Syst. Appl., 2023. [108] M. V. Nguyen and T. H. Nguyen, “Improving cross-lingual transfer for event argument extraction with language-universal sentence structures,” in Proceedings of the Sixth Arabic Natural Language Processing Workshop, 2021. [109] T. H. Nguyen, K. Cho, and R. Grishman, “Joint event extraction via recurrent neural networks,” in The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2016. [110] T. H. Nguyen and R. Grishman, “Event detection and domain adaptation with convolutional neural networks,” in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL), 2015. [111] T. M. Nguyen and T. H. Nguyen, “One for all: Neural joint modeling of entities and events,” in The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, 2019. 114 [112] J. Ni, G. Dinu, and R. Florian, “Weakly supervised cross-lingual named entity recognition via effective annotation and representation projection,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2017. [113] J. Ni and R. Florian, “Neural cross-lingual relation extraction based on bilingual word embedding mapping,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019. [114] J. Nothman, M. Honnibal, B. Hachey, and J. R. Curran, “Event linking: Grounding event reference in a news archive,” in The 50th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, July 8-14, 2012, Jeju Island, Korea - Volume 2: Short Papers, 2012. [115] M. Ocal, A. Radas, J. Hummer, K. Megerdoomian, and M. A. Finlayson, “A comprehensive evaluation and correction of the timebank corpus,” in Proceedings of the Thirteenth Language Resources and Evaluation Conference, LREC 2022, Marseille, France, 20-25 June 2022, 2022. [116] F. J. Och and H. Ney, “A systematic comparison of various statistical alignment models,” Computational Linguistics, 2003. [117] J. Ou, A. Pratapa, R. Gupta, and T. Mitamura, “Hierarchical event grounding,” in Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, ThirtyFifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, February 7-14, 2023, 2023. [118] X. Pan, B. Zhang, J. May, J. Nothman, K. Knight, and H. Ji, “Cross-lingual name tagging and linking for 282 languages,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL), 2017. [119] G. Paolini, B. Athiwaratkun, J. Krone, et al., “Structured prediction as translation between augmented natural languages,” in 9th International Conference on Learning Representations (ICLR), 2021. [120] T. Parekh, I.-H. Hsu, K.-H. Huang, K.-W. Chang, and N. Peng, “Contextual label projection for cross-lingual structured prediction,” in Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024. [121] T. Parekh, I.-H. Hsu, K.-H. Huang, K.-W. Chang, and N. Peng, “GENEVA: Benchmarking generalizability for event argument extraction with hundreds of event types 115 and argument roles,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023. [122] H. Peng, A. P. Parikh, M. Faruqui, B. Dhingra, and D. Das, “Text generation with exemplar-based adaptive decoding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2019. [123] H. Peng, X. Wang, F. Yao, et al., “Omnievent: A comprehensive, fair, and easy-to-use toolkit for event understanding,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023 - System Demonstrations, Singapore, December 6-10, 2023, 2023. [124] M. E. Peters, M. Neumann, M. Iyyer, et al., “Deep contextualized word representations,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACLHLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), 2018. [125] R. Pradeep, S. Sharifymoghaddam, and J. Lin, “Rankvicuna: Zero-shot listwise document reranking with open-source large language models,” arXiv preprint arXiv:2309.15088, 2023. [126] A. Pratapa, R. Gupta, and T. Mitamura, “Multilingual event linking to Wikidata,” in Proceedings of the Workshop on Multilingual Information Access (MIA), 2022. [127] J. Pustejovsky, P. Hanks, R. Sauri, et al., “The timebank corpus,” in Corpus linguistics, 2003. [128] P. Qi, Y. Zhang, Y. Zhang, J. Bolton, and C. D. Manning, “Stanza: A python natural language processing toolkit for many human languages,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations (ACL), A. Celikyilmaz and T. Wen, Eds., 2020. [129] G. Qin and J. Eisner, “Learning how to ask: Querying lms with mixtures of soft prompts,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACLHLT), 2021. [130] C. Raffel, N. Shazeer, A. Roberts, et al., “Exploring the limits of transfer learning with a unified text-to-text transformer,” J. Mach. Learn. Res., 2020. [131] A. Rahimi, Y. Li, and T. Cohn, “Massively multilingual transfer for NER,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019. 116 [132] S. Ruder, N. Constant, J. Botha, et al., “XTREME-R: towards more challenging and nuanced multilingual evaluation,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021. [133] E. F. T. K. Sang, “Introduction to the conll-2002 shared task: Language-independent named entity recognition,” in Proceedings of the 6th Conference on Natural Language Learning, CoNLL 2002, Held in cooperation with COLING 2002, Taipei, Taiwan, 2002, 2002. [134] T. L. Scao, A. Fan, C. Akiki, et al., “BLOOM: A 176b-parameter open-access multilingual language model,” arXiv preprint arXiv:2211.05100, [135] T. L. Scao and A. M. Rush, “How many data points is a prompt worth?” In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2021. [136] T. Schick and H. Sch¨utze, “Exploiting cloze-questions for few-shot text classification and natural language inference,” in Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2021. [137] A. See, P. J. Liu, and C. D. Manning, “Get to the point: Summarization with pointergenerator networks,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, R. Barzilay and M. Kan, Eds., 2017. [138] L. Sha, F. Qian, B. Chang, and Z. Sui, “Jointly extracting event triggers and arguments by dependency-bridge RNN and tensor-based argument interaction,” in Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI), 2018. [139] K. Shaalan, “A survey of arabic named entity recognition and classification,” Comput. Linguistics, 2014. [140] S. Shen, T. Wu, G. Qi, Y. Li, G. Haffari, and S. Bi, “Adaptive knowledge-enhanced bayesian meta-learning for few-shot event detection,” in Findings of the Association for Computational Linguistics: ACL/IJCNLP, 2021. [141] E. Sheng, K.-W. Chang, P. Natarajan, and N. Peng, “Towards controllable biases in language generation,” in the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)-Findings, long, 2020. [142] T. Shin, Y. Razeghi, R. L. L. IV, E. Wallace, and S. Singh, “Autoprompt: Eliciting knowledge from language models with automatically generated prompts,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020. 117 [143] Z. Song, A. Bies, S. M. Strassel, et al., “From light to rich ERE: annotation of entities, relations, and events,” in Proceedings of the The 3rd Workshop on EVENTS: Definition, Detection, Coreference, and Representation, (EVENTS@HLP-NAACL), 2015. [144] E. Stengel-Eskin, T.-r. Su, M. Post, and B. Van Durme, “A discriminative neural model for cross-lingual word alignment,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019. [145] A. Subburathinam, D. Lu, H. Ji, et al., “Cross-lingual structure transfer for relation and event extraction,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019. [146] B. M. Sundheim, “Overview of the fourth Message Understanding Evaluation and Conference,” in Fourth Message Understanding Conference (MUC-4): Proceedings of a Conference Held in McLean, Virginia, June 16-18, 1992, 1992. [147] Y. Tang, C. Tran, X. Li, et al., “Multilingual translation with extensible multilingual pretraining and finetuning,” arXiv preprint arXiv:2008.00401, 2020. [148] E. F. Tjong Kim Sang, “Introduction to the CoNLL-2002 shared task: Languageindependent named entity recognition,” in COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL-2002), 2002. [149] E. F. Tjong Kim Sang and F. De Meulder, “Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition,” in Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, 2003. [150] H. Touvron, T. Lavril, G. Izacard, et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023. [151] H. Touvron, L. Martin, K. Stone, et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023. [152] A. Vaswani, N. Shazeer, N. Parmar, et al., “Attention is all you need,” in Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017 (NeurIPS), 2017. [153] A. P. B. Veyseh, T. N. Nguyen, and T. H. Nguyen, “Graph transformer networks with syntactic and semantic structures for event argument extraction,” in Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020, 2020. 118 [154] E. Voita, P. Serdyukov, R. Sennrich, and I. Titov, “Context-aware neural machine translation learns anaphora resolution,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018. [155] D. Wadden, U. Wennberg, Y. Luan, and H. Hajishirzi, “Entity, relation, and event extraction with contextualized span representations,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019. [156] S. Wang, M. Yu, S. Chang, L. Sun, and L. Huang, “Query and extract: Refining event extraction as type-oriented binary decoding,” in Findings of the Association for Computational Linguistics: ACL 2022, 2022. [157] X. Wang, Z. Wang, X. Han, et al., “HMEAE: hierarchical modular event argument extraction,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019. [158] X. Wang, Z. Wang, X. Han, et al., “MAVEN: A Massive General Domain Event Detection Dataset,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020. [159] J. Wei, M. Bosma, V. Y. Zhao, et al., “Finetuned language models are zero-shot learners,” arXiv preprint arXiv:2109.01652, 2021. [160] G. I. Winata, A. Madotto, Z. Lin, R. Liu, J. Yosinski, and P. Fung, “Language models are few-shot multilingual learners,” arXiv preprint arXiv:2109.07684, 2021. [161] T. Wolf, L. Debut, V. Sanh, et al., “Huggingface’s transformers: State-of-the-art natural language processing,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2020. [162] K. Wong, S. Maruf, and G. Haffari, “Contextual neural machine translation improves translation of cataphoric pronouns,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020. [163] L. Wu, F. Petroni, M. Josifoski, S. Riedel, and L. Zettlemoyer, “Scalable zero-shot entity linking with dense entity retrieval,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020. [164] Y. Wu, M. Schuster, Z. Chen, et al., “Google’s neural machine translation system: Bridging the gap between human and machine translation,” arXiv preprint arXiv:1609.08144, 2016. 119 [165] Z. Xiao, M. Gong, J. Wu, X. Zhang, L. Shou, and D. Jiang, “Instructed language models with retrievers are powerful entity linkers,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, 2023. [166] H. Xu, S. Ebner, M. Yarmohammadi, A. S. White, B. V. Durme, and K. W. Murray, “Gradual fine-tuning for low-resource domain adaptation,” arXiv preprint arXiv:2103.02205, 2021. [167] R. Xu, P. Wang, T. Liu, S. Zeng, B. Chang, and Z. Sui, “A two-stream amr-enhanced model for document-level event argument extraction,” in Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, WA, United States, July 10-15, 2022, 2022. [168] W. Xu, B. Haider, and S. Mansour, “End-to-end slot alignment and recognition for cross-lingual NLU,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020. [169] Z. Xu and L. Huang, “Improve event extraction via self-training with gradient guidance,” arXiv preprint arXiv:2205.12490, 2022. [170] L. Xue, N. Constant, A. Roberts, et al., “Mt5: A massively multilingual pre-trained text-to-text transformer,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2021. [171] V. Yadav and S. Bethard, “A survey on recent advances in named entity recognition from deep learning models,” in Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20- 26, 2018, 2018. [172] H. Yan, T. Gui, J. Dai, Q. Guo, Z. Zhang, and X. Qiu, “A unified generative framework for various NER subtasks,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL/IJCNLP), 2021. [173] B. Yang and T. M. Mitchell, “Joint extraction of events and entities within a document context,” in The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2016. [174] S. Yang, D. Feng, L. Qiao, Z. Kan, and D. Li, “Exploring pre-trained language models for event extraction and generation,” in Proceedings of the 57th Conference of the Association for Computational Linguistics (ACL), 2019. 120 [175] S. Yang, D. Feng, L. Qiao, Z. Kan, and D. Li, “Exploring pre-trained language models for event extraction and generation,” in Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, 2019. [176] Y. Yao, S. Mao, X. Chen, N. Zhang, S. Deng, and H. Chen, “Schema-aware reference as prompt improves data-efficient relational triple and event extraction,” arXiv preprint arXiv:2210.10709, 2022. [177] M. Yarmohammadi, S. Wu, M. Marone, et al., “Everything is all it takes: A multipronged strategy for zero-shot cross-lingual information extraction,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021. [178] D. Yarowsky, G. Ngai, and R. Wicentowski, “Inducing multilingual text analysis tools via robust projection across aligned corpora,” in Proceedings of the First International Conference on Human Language Technology Research, 2001. [179] P. Yu, J. May, and H. Ji, “Bridging the gap between native text and translated text through adversarial learning: A case study on cross-lingual event extraction,” in Findings of the Association for Computational Linguistics: EACL 2023, Dubrovnik, Croatia: Association for Computational Linguistics, May 2023, pp. 754–769. [Online]. Available: https://aclanthology.org/2023.findings-eacl.57. [180] X. Yu, W. Yin, N. Gupta, and D. Roth, “Event linking: Grounding event mentions to wikipedia,” in Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023, Dubrovnik, Croatia, May 2- 6, 2023, 2023. [181] R. Zevallos, J. Ortega, W. Chen, et al., “Introducing QuBERT: A large monolingual corpus and BERT model for Southern Quechua,” in Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing, 2022. [182] W. Zhang, W. Hua, and K. Stratos, “Entqa: Entity linking as question answering,” in The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022, 2022. [183] Z. Zhang and H. Ji, “Abstract meaning representation guided graph encoding and decoding for joint information extraction,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, 2021. [184] Y. Zhou, C. Guo, X. Wang, Y. Chang, and Y. Wu, “A survey on data augmentation in large model era,” arXiv preprint arXiv:2401.15422, 2024. 121 [185] F. Zhu, J. Yu, H. Jin, L. Hou, J. Li, and Z. Sui, “Learn to not link: Exploring NIL prediction in entity linking,” in Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, 2023. [186] T. Zhu, X. Qu, W. Chen, et al., “Efficient document-level event extraction via pseudotrigger-aware pruned complete graph,” in Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, Vienna, Austria, 23-29 July 2022, 2022. [187] W. Zhu, H. Liu, Q. Dong, et al., “Multilingual machine translation with large language models: Empirical results and analysis,” arXiv preprint arXiv:2304.04675, 2023. [188] B. Zou, Z. Xu, Y. Hong, and G. Zhou, “Adversarial feature adaptation for crosslingual relation classification,” in Proceedings of the 27th International Conference on Computational Linguistics (COLING), 2018. 122 Appendix A Appendix for Degree A.1 EAE Template Constructing for Degree Our strategy to create an EAE template first identifies all valid argument roles for the event type,1 such as Attacker, Target, Instrument, and Place roles. Then, for each argument role, according to the semantics of the role type, we select natural and fluent words to form its placeholder (e.g., some way for Instrument). This design aims to provide a simple way to help the model learn both the roles’ label semantics and the event structure. Finally, we create a natural language sentence that connects all these placeholders. Notice that we try to keep the template as simple and short as possible. Table A.1 lists all designed EAE templates for ACE05-E and ACE05-E+. The EAE templates of ERE-EN can be found in Table A.3. A.2 Training Details of Degree Given a passage, its annotated event types are consider as positive event types. During training, we additionally sample m event types that are not related to the passage as the negative examples, where m is a hyper-parameter. In our experiments, m is usually set to 13 or 15. For all of Degree, Degree(ED), and Degree(EAE), we fine-tune the pre-trained BART-large [79] with Huggingface package [161]. The number of parameters is around 406 1The valid roles for each event type are predefined in the event ontology for each dataset or can be decided by the user of interest. 123 millions. We train Degree with our machine that equips 128 AMD EPYC 7452 32-Core Processor, 4 NVIDIA A100 GPUs, and 792G RAM. We consider AdamW optimizer [95] with learning rate set to 10−5 and the weight decay set to 10−5 . We set the batch size to 6 for Degree(EAE) and 32 for Degree(ED) and Degree. The number of training epochs is 45. It takes around 2 hours, 18 hours, 22 hours to train Degree(EAE), Degree(ED), and Degree, respectively. We do hyper-parameter search on m, the number of negative examples, from {3, 5, 7, 10, 13, 15, 18, 21}, and our preliminary trials show that m less than 10 are usually less useful. For the learning rate and the weight decay, we tune it based on our preliminary experiment for event argument extraction from {10−5 , 10−4}, while they are both fixed to 10−5 for all the experiments. A.3 Datasets We consider ACE 20052 [30] and ERE3 [143]. Both consider LDC User Agreement for NonMembers4 as the licenses. Both datasets are created for entity, relation, and event extraction while our focus is only event extraction in this paper. In the original ACE 2005 dataset, it contains data for English, Chinese, and Arabic and we only take the English data for our experiment. In the original ERE dataset, it contains data for English, and Chinese and we only take the English data for our experiment as well. Because both datasets contain event like Justice:Execute and Life:Die, it is possible that some offensive words (e.g., killed) would appear in the passage. Also, some real names may appear in the passage as well (e.g., Palestinian president, Mahmoud Abbas). How to accurately identify these kinds of information is part of the goal of the task. Therefore, we do not take any changes on the datasets for protecting or anonymizing. We split the training data based on documents, which is a more realistic setup compared 2https://catalog.ldc.upenn.edu/LDC2006T06 3https://catalog.ldc.upenn.edu/LDC2020T19 4https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf 124 to splitting data by instance. Table A.5 lists the statistics of ACE05-E, ACE05-E+, and ERE-EN. Specifically, we try to make each proportion of data contain as many event types as possible. A.4 Implementation Details This section describes the implementation details for all baselines we use. We run the experiments with three different random seeds and report the best value. • DyGIE++: we use their released pre-trained model5 for evaluation. • OneIE: we use their provided code6 to train the model with default parameters. • BERT QA: we use their provided code7 to train the model with default parameters. • TANL: we use their provided code8 to train the model. We conduct the experiments with two variations: (1) using their default parameters, and (2) using their default parameters but with more training epochs. We observe that the second variant works better. As a result, we report the number obtained from the second setting. • Text2Event: we use their official code9 to train the model with the provided parameter setting. • dbRNN: we directly report the experimental results from their paper. • Joint3EE: we directly report the experimental results from their paper. • MQAEE: we directly report the experimental results from their paper. • BART-Gen: we report the experimental results from their released appendix.10 5https://github.com/dwadden/dygiepp 6http://blender.cs.illinois.edu/software/oneie/ 7https://github.com/xinyadu/eeqa 8https://github.com/amazon-research/tanl 9https://github.com/luyaojie/Text2Event 10https://github.com/raspberryice/gen-arg/blob/main/NAACL_2021_Appendix.pdf 125 Event Type EAE Template Life:Be-Born somebody was born in somewhere. Life:Marry somebody got married in somewhere. Life:Divorce somebody divorced in somewhere. Life:Injure somebody or some organization led to some victim injured by some way in somewhere. Life:Die somebody or some organization led to some victim died by some way in somewhere. Movement:Transport something was sent to somewhere from some place by some vehicle. somebody or some organization was responsible for the transport. Transaction:TransferOwnership someone got something from some seller in somewhere. Transaction:Transfer-Money someone paid some other in somewhere. Business:Start-Org somebody or some organization launched some organzation in somewhere. Business:Merge-Org some organzation was merged. Business:Declare-Bankruptcy some organzation declared bankruptcy. Business:End-Org some organzation dissolved. Conflict:Attack some attacker attacked some facility, someone, or some organization by some way in somewhere. Conflict:Demonstrate some people or some organization protest at somewhere. Contact:Meet some people or some organization met at somewhere. Contact:Phone-Write some people or some organization called or texted messages at somewhere. Personnel:Start-Position somebody got new job and was hired by some people or some organization in somewhere. Personnel:End-Position somebody stopped working for some people or some organization at somewhere. Personnel:Nominate somebody was nominated by somebody or some organization to do a job. Personnel:Elect somebody was elected a position, and the election was voted by some people or some organization in somewhere. Table A.1: All EAE templates for ACE05-E and ACE05-E+ for Degree. 126 Event Type EAE Template Justice:Arrest-Jail somebody was sent to jailed or arrested by somebody or some organization in somewhere. Justice:Release-Parole somebody was released by some people or some organization from somewhere. Justice:Trial-Hearing somebody, prosecuted by some other, faced a trial in somewhere. The hearing was judged by some adjudicator. Justice:Charge-Indict somebody was charged by some other in somewhere. The adjudication was judged by some adjudicator. Justice:Sue somebody was sued by some other in somewhere. The adjudication was judged by some adjudicator. Justice:Convict somebody was convicted of a crime in somewhere. The adjudication was judged by some adjudicator. Justice:Sentence somebody was sentenced to punishment in somewhere. The adjudication was judged by some adjudicator. Justice:Fine some people or some organization in somewhere was ordered by some adjudicator to pay a fine. Justice:Execute somebody was executed by somebody or some organization at somewhere. Justice:Extradite somebody was extradicted to somewhere from some place. somebody or some organization was responsible for the extradition. Justice:Acquit somebody was acquitted of the charges by some adjudicator. Justice:Pardon somebody received a pardon from some adjudicator. Justice:Appeal some other in somewhere appealed the adjudication from some adjudicator. Table A.2: All EAE templates for ACE05-E and ACE05-E+ for Degree (Cont.). 127 Event Type EAE Template Life:Be-Born somebody was born in somewhere. Life:Marry somebody got married in somewhere. Life:Divorce somebody divorced in somewhere. Life:Injure somebody or some organization led to some victim injured by some way in somewhere. Life:Die somebody or some organization led to some victim died by some way in somewhere. Movement:Transport-Person somebody was moved to somewhere from some place by some way. somebody or some organization was responsible for the movement. Movement:TransportArtifact something was sent to somewhere from some place. somebody or some organization was responsible for the transport. Business:Start-Org somebody or some organization launched some organzation in somewhere. Business:Merge-Org some organzation was merged. Business:Declare-Bankruptcy some organzation declared bankruptcy. Business:End-Org some organzation dissolved. Conflict:Attack some attacker attacked some facility, someone, or some organization by some way in somewhere. Conflict:Demonstrate some people or some organization protest at somewhere. Contact:Meet some people or some organization met at somewhere. Contact:Correspondence some people or some organization contacted each other at somewhere. Contact:Broadcast some people or some organization made announcement to some publicity at somewhere. Contact:Contact some people or some organization talked to each other at somewhere. Manufacture:Artifact something was built by somebody or some organization in somewhere. Personnel:Start-Position somebody got new job and was hired by some people or some organization in somewhere. Personnel:End-Position somebody stopped working for some people or some organization at somewhere. Personnel:Nominate somebody was nominated by somebody or some organization to do a job. Personnel:Elect somebody was elected a position, and the election was voted by somebody or some organization in somewhere. Transaction:TransferOwnership The ownership of something from someone was transferred to some other at somewhere. Transaction:Transfer-Money someone paid some other in somewhere. Transaction:Transaction someone give some things to some other in somewhere. Table A.3: All EAE templates for ERE-EN for Degree. 128 Event Type EAE Template Justice:Arrest-Jail somebody was sent to jailed or arrested by somebody or some organization in somewhere. Justice:Release-Parole somebody was released by somebody or some organization from somewhere. Justice:Trial-Hearing somebody, prosecuted by some other, faced a trial in somewhere. The hearing was judged by some adjudicator. Justice:Charge-Indict somebody was charged by some other in somewhere. The adjudication was judged by some adjudicator. Justice:Sue somebody was sued by some other in somewhere. The adjudication was judged by some adjudicator. Justice:Convict somebody was convicted of a crime in somewhere. The adjudication was judged by some adjudicator. Justice:Sentence somebody was sentenced to punishment in somewhere. The adjudication was judged by some adjudicator. Justice:Fine some people or some organization in somewhere was ordered by some adjudicator to pay a fine. Justice:Execute somebody was executed by somebody or some organization at somewhere. Justice:Extradite somebody was extradicted to somewhere from some place. somebody or some organization was responsible for the extradition. Justice:Acquit somebody was acquitted of the charges by some adjudicator. Justice:Pardon somebody received a pardon from some adjudicator. Justice:Appeal somebody in somewhere appealed the adjudication from some adjudicator. Table A.4: All EAE templates for ERE-EN for Degree (cont.). 129 Dataset Split #Docs #Sents #Events #Event Types #Args #Arg Types ACE05-E Train (full) 529 17172 4202 33 4859 22 Train (1%) 5 103 47 14 65 16 Train (2%) 10 250 77 17 104 16 Train (3%) 15 451 119 23 153 17 Train (5%) 25 649 212 27 228 21 Train (10%) 50 1688 412 28 461 21 Train (20%) 110 3467 823 33 936 22 Train (30%) 160 5429 1368 33 1621 22 Train (50%) 260 8985 2114 33 2426 22 Dev 28 923 450 21 605 22 Test 40 832 403 31 576 20 ACE05-E+ Train (full) 529 19216 4419 33 6607 22 Train (1%) 5 92 49 15 75 16 Train (2%) 10 243 82 19 129 16 Train (3%) 15 434 124 24 203 19 Train (5%) 25 628 219 27 297 21 Train (10%) 50 1915 428 29 629 21 Train (20%) 110 3834 878 33 1284 22 Train (30%) 160 6159 1445 33 2212 22 Train (50%) 260 10104 2231 33 3293 22 Dev 28 901 468 22 759 22 Test 40 676 424 31 689 21 ERE-EN Train (full) 396 14736 6208 38 8924 21 Train (1%) 4 109 61 14 78 16 Train (2%) 8 228 128 21 183 19 Train (3%) 12 419 179 26 272 19 Train (5%) 20 701 437 31 640 21 Train (10%) 40 1536 618 37 908 21 Train (20%) 80 2848 1231 38 1656 21 Train (30%) 120 4382 1843 38 2632 21 Train (50%) 200 7690 3138 38 4441 21 Dev 31 1209 525 34 730 21 Test 31 1163 551 33 822 21 Table A.5: Dataset statistics. Our experiments are conducted in sentences, which were split from documents. In the table, “#Docs” means the number of documents; “#Sents” means the number of sentences, “#Events” means the number of events; “#Event Types” means the number of event types in total; “#Args” means the number of argument in total; “#Arg Types” means the number of argument role types in total. 130 Appendix B Appendix for Ampere B.1 Implementation Details This section introduces the implementation details for all the baseline models we use in this paper. Our experiments are run using our machine, is equipped with 8 NVIDIA a6000 GPUs. • DyGIE++: we use their official code to reimplement the model.1 Their original model is built using BERT [29]. As we mentioned in Section 4.3.1, we replace the used pre-trained language model into RoBERTa-large and tune with learning rates = {1e − 5, 2e − 5, 3e − 5}. • OneIE: we use their official code2 to train the model. Their original model is built using BERT [29]. As we mentioned in Section 4.3.1, we replace the used pre-trained language model into RoBERTa-large and tune with learning rates = {1e − 5, 2e − 5, 3e − 5}. • Query and Extract: we use their official code3 to train argument detection model with learning rate = 1e − 5, batch size = 16, training epoch = 10. Different from the official code, we used RoBERTa-large for pre-trained language model to ensure a fair comparison. • AMR-IE: the original AMR-IE is an end-to-end event extraction model, so we adapt 1https://github.com/dwadden/dygiepp 2http://blender.cs.illinois.edu/software/oneie/ 3https://github.com/VT-NLP/Event_Query_Extract/ 131 their official code4 to event argument extraction task by giving gold triggers in model evaluation. We fixed pre-trained language model learning rate = 1e − 5, then did hyperparameter searching from graph learning rate = {1e − 3, 4e − 3} and batch size = {8, 16}. • PAIE: we use their official code5 to train the model with the default parameters for BART-large. B.2 Detailed Result Table B.1 shows the detailed results of our main experiments. We repeat running every experiment setting with three random seeds, and report their average Arg-I and Arg-C F1- scores, and the corresponding standard deviation scores. 4https://github.com/zhangzx-uiuc/AMR-IE 5https://github.com/mayubo2333/PAIE/ 132 ACE05-E Development Set Model 5% 10% 20% 30% 50% 100% Arg-I Arg-C Arg-I Arg-C Arg-I Arg-C Arg-I Arg-C Arg-I Arg-C Arg-I Arg-C DyGIE++ [155] 44.6±2.28 34.6±1.83 57.3±0.91 48.5±0.35 58.9±1.53 52.5±0.85 63.0±2.05 57.5±1.34 65.4±0.49 57.9±0.59 67.2±1.78 60.0±0.35 OneIE [86] 48.0±2.27 38.6±1.11 62.3±0.61 56.0±1.01 68.2±0.84 63.2±1.16 73.0±1.20 67.6±0.42 74.6±0.60 70.4±0.46 76.0±1.95 71.8±1.54 Query and Extract [156] 41.6±2.50 10.5±0.82 43.0±2.10 27.7±1.00 49.0±3.00 37.6±0.66 58.8±1.83 50.0±1.52 61.7±3.33 54.6±3.58 67.9±1.86 61.7±2.67 AMR-IE [183] 49.7±1.12 40.0±1.29 62.0±0.34 56.4±0.83 66.8±0.90 61.3±1.23 72.4±1.28 67.4±0.66 74.7±1.04 70.6±1.30 77.7±0.93 73.1±0.68 PAIE [101] 55.2±1.16 46.6±0.98 64.1±0.88 57.6±1.43 70.4±0.49 64.6±1.13 74.5±0.38 69.3±0.38 75.1±1.89 70.3±1.02 78.5±0.65 74.1±0.80 Degree 47.6±0.64 41.4±0.50 65.1±0.75 56.8±0.50 69.7±0.50 62.5±0.55 75.6±0.43 68.9±0.54 75.9±0.57 70.5±0.28 78.4±0.38 73.8±0.58 Ampere (AMRBART) 59.9±1.99 52.3±1.54 68.5±0.83 61.5±0.82 72.0±0.80 67.2±0.55 76.5±1.01 71.2±0.56 76.5±0.50 72.7±0.83 80.0±1.06 75.6±1.10 Ampere (RoBERTa) 62.1±1.73 53.2±2.26 68.2±0.39 61.5±1.24 71.5±1.00 66.6±0.12 76.8±0.37 71.8±0.53 76.4±1.01 72.5±0.79 80.9±0.60 76.6±0.78 ACE05-E Test Set Model 5% 10% 20% 30% 50% 100% Arg-I Arg-C Arg-I Arg-C Arg-I Arg-C Arg-I Arg-C Arg-I Arg-C Arg-I Arg-C DyGIE++ [155] 39.2±4.20 29.3±2.63 50.5±1.44 42.2±0.85 57.7±1.11 49.5±0.75 59.9±0.97 53.2±1.38 61.0±2.62 54.4±1.10 63.6±1.74 57.4±1.87 OneIE [86] 41.3±1.97 34.6±1.88 55.4±2.29 50.0±1.51 64.6±2.54 59.6±1.12 67.8±1.50 63.0±1.43 72.0±0.43 68.3±0.92 73.7±0.87 70.7±0.38 Query and Extract [156] 36.8±3.44 11.0±0.50 33.1±5.45 20.9±2.83 45.6±0.93 34.3±1.30 51.1±3.78 44.3±4.01 56.1±4.87 49.6±5.15 62.4±2.10 59.1±1.88 AMR-IE [183] 43.2±1.54 36.8±0.07 53.3±1.49 48.5±0.99 63.2±0.60 58.3±0.93 67.2±1.00 62.6±1.16 69.5±1.27 66.1±0.92 73.6±0.40 70.3±0.13 PAIE [101] 52.2±0.83 46.3±0.75 62.0±0.96 56.3±0.46 67.8±0.33 62.8±0.69 71.3±0.54 65.8±0.98 72.8±2.34 69.1±2.20 75.0±0.51 72.1±0.69 Degree 47.7±0.09 41.7±0.83 63.0±1.45 57.7±1.72 64.2±0.57 58.9±1.00 70.3±1.16 65.8±1.50 71.4±0.26 68.2±0.25 75.6±0.79 73.0±0.53 Ampere (AMRBART) 59.8±2.00 52.4±1.53 66.0±1.82 61.0±1.58 70.2±0.84 66.4±1.04 73.3±0.45 69.7±0.41 74.4±1.21 71.1±1.17 76.0±0.85 73.4±0.58 Ampere (RoBERTa) 61.0±0.98 53.4±0.21 67.8±1.13 61.7±0.79 70.5±1.28 66.4±0.86 73.1±0.43 69.5±0.67 74.6±1.03 71.9±0.89 76.7±0.75 74.2±0.28 ERE-EN Development Set Model 5% 10% 20% 30% 50% 100% Arg-I Arg-C Arg-I Arg-C Arg-I Arg-C Arg-I Arg-C Arg-I Arg-C Arg-I Arg-C DyGIE++ [155] 51.8±2.16 42.2±0.68 52.9±3.19 45.4±2.65 56.8±1.84 49.0±0.58 57.3±0.67 50.1±0.96 58.8±0.56 51.5±1.47 63.8±2.20 56.8±1.93 OneIE [86] 56.8±3.14 51.4±2.58 65.6±0.42 59.5±0.71 68.6±0.55 62.0±0.56 70.8±0.75 65.5±0.51 73.6±0.56 68.5±0.67 75.5±0.26 71.2±0.13 Query and Extract [156] 34.8±6.37 22.0±5.30 45.9±1.59 37.3±2.03 49.0±4.31 41.2±3.38 56.1±1.32 49.4±1.19 63.5±1.81 57.0±1.34 70.4±2.17 65.0±2.16 AMR-IE [183] 48.4±1.48 44.8±0.86 61.2±0.97 55.2±1.06 63.0±1.37 56.9±1.12 70.5±0.03 65.2±0.63 73.0±0.79 67.6±0.39 75.3±1.30 70.1±1.45 Degree 64.2±0.33 57.2±0.21 69.7±0.36 62.5±0.89 69.2±0.42 63.9±0.55 73.4±0.35 67.1±0.11 75.4±0.52 70.2±0.48 77.4±0.32 73.3±0.52 Ampere (AMRBART) 69.2±1.64 62.4±1.54 72.8±1.12 66.8±1.03 71.5±0.51 66.0±0.95 74.9±0.65 68.8±0.17 76.7±0.33 70.8±0.55 78.1±0.69 73.6±1.10 Ampere (RoBERTa) 69.9±0.97 63.1±1.24 72.7±0.81 66.7±0.56 71.7±0.33 66.6±0.98 74.6±0.52 69.7±0.68 75.7±0.74 70.6±0.67 77.9±0.28 73.8±0.34 ERE-EN Test Set Model 5% 10% 20% 30% 50% 100% Arg-I Arg-C Arg-I Arg-C Arg-I Arg-C Arg-I Arg-C Arg-I Arg-C Arg-I Arg-C DyGIE++ [155] 53.3±1.95 40.0±1.93 52.9±2.59 44.6±2.70 55.9±1.74 49.5±1.16 59.1±0.64 52.0±1.35 60.5±0.92 53.7±0.38 63.4±0.80 56.0±0.78 OneIE [86] 55.5±3.47 49.5±2.24 62.1±1.53 56.1±1.62 67.9±1.83 62.3±1.62 71.9±0.36 66.1±0.73 72.3±0.49 67.7±0.43 75.2±1.14 70.1±1.96 Query and Extract [156] 35.1±7.25 19.7±5.12 46.7±2.66 34.0±4.06 52.1±4.69 42.4±5.07 57.7±0.09 50.1±0.86 64.5±2.78 57.7±2.80 70.4±1.78 64.3±2.26 AMR-IE [183] 47.8±0.65 44.1±0.46 59.1±0.96 53.7±0.58 65.8±1.68 60.4±1.22 71.4±1.31 65.7±1.45 73.9±0.44 68.8±0.29 76.5±1.20 71.5±1.34 Degree 66.4±0.14 57.5±0.36 71.2±1.26 63.9±1.38 72.3±0.69 67.4±0.56 74.1±1.16 69.1±1.44 77.4±0.61 73.3±0.74 78.2±0.69 74.9±1.10 Ampere (AMRBART) 71.3±0.40 62.9±0.53 73.7±0.73 66.7±0.45 73.1±0.57 68.5±0.44 75.7±0.83 71.3±0.70 77.1±0.30 72.5±1.07 78.8±0.62 75.4±0.59 Ampere (RoBERTa) 71.4±1.13 63.2±0.57 73.8±0.57 67.7±0.66 73.6±0.64 68.4±0.40 75.4±0.36 70.5±0.17 77.0±0.73 72.5±0.68 78.4±0.80 75.0±0.77 Table B.1: Argument Identification and classification F1-scores (%) under different data proportion settings for ACE05-E and ERE-EN datasets. The highest scores are in bold and the second-best scores are underlined. The reported numbers are the average of the results from three random runs. The standard deviation (%) of three runs are also reported in the table. 133 Appendix C Appendix for X-Gear C.1 Dataset Statistics and Data Preprocessing Table C.1 presents the detailed statistics for the ACE-2005 dataset and ERE dataset. For the English and Chinese splits in ACE-2005, we use the setting provided by [155] and [86], respectively. As for Arabic part, we adopt the setup proposed by [166]. Observing that part of the sentence breaks made from [166] are extremely long for pre-trained models to encode, we perform additional preprocessing and postprocessing procedures for Arabic data. Specifically, we split Arabic sentences into several portions. Any of the portions is shorter than 80 tokens. Then, we map the models’ predictions of the split sentences back to the original sentence during postprocessing. Dataset Lang. Train Dev Test #Sent. #Event #Arg. #Sent. #Event #Arg. #Sent. #Event #Arg. ACE-2005 en 17172 4202 4859 923 450 605 832 403 576 ar 2722 1743 2506 289 117 174 272 198 287 zh 6305 2926 5581 486 217 404 482 190 336 ERE en 14734 6208 8924 1209 525 730 1161 551 882 es 4582 3131 4415 311 204 279 323 255 354 Table C.1: Dataset statistics of multilingual ACE-2005 and ERE. C.2 Implementation Details We describe the implementation details for all the models as follows: 134 • OneIE [86]. We use their provided code1 to train the model with the provided default settings. It is worth mention that for the Arabic split in the ACE-2005 dataset, OneIE is trained with only entity extraction, event extraction, and event argument extraction since there is no relation labels in [166]’s preprocessing script. All other parameters are set to the default values. • CL-GCN [145]. We refer the released code from [2]2 to re-implement the CL-GCN method. Specifically, we adapt the baseline framework that described and implemented in OneIE’s code [86], but we remove its relation extraction module and add two layers of GCN on top of XLM-RoBERTa-large. The pos-tag and dependency parsing annotations are obtained by applying Stanza [128]. All other parameters are set to the be the same as the training of OneIE. • GATE [2]. We refer the official released code from [2] to re-implement GATE. Similar to CL-GCN, we adapt the baseline framework that described and implemented in OneIE’s code, but we remove its relation extraction module and add two layers of GATE on top of XLM-RoBERTa-large, mT5, or mBART-50-large. The pos-tag and dependency parsing annotations are also obtained by applying Stanza [128]. The hyper-parameter of δ in GATE is set to be [2, 2, 4, 4, ∞, ∞, ∞, ∞]. All other parameters are set to the be the same as the training of OneIE. • TANL [119]. To adapt TANL to zero-shot cross-lingual EAE, we adapt the public code3 and replace its pre-trained based model T5 [130] with mT5-base [170]. All other parameters are set to their default values. • X-Gear is our proposed model. We consider three different pre-trained generative language models: mBART-50-large [147], mT5-base, and mT5-large [170]. When fine-tune the pre-trained models, we set the learning rate to 10−4 for mT5, and 10−5 for mBART1http://blender.cs.illinois.edu/software/oneie/ 2https://github.com/wasiahmad/GATE 3https://github.com/amazon-research/tanl 135 Model en ⇓ en en ⇓ zh en ⇓ ar ar ⇓ ar ar ⇓ en ar ⇓ zh zh ⇓ zh zh ⇓ en zh ⇓ ar avg (mono.) avg (cross.) avg (all) X-Gear (mBART-50-large) 68.3 48.9 37.7 59.8 30.5 29.2 63.6 45.9 32.3 63.9 37.4 46.2 w/ constrained decoding 68.0 49.1 37.8 59.5 30.6 29.2 59.7 47.7 31.3 62.4 37.6 45.9 X-Gear (mT5-base) 67.9 53.1 42.0 66.2 27.6 30.5 69.4 52.8 32.0 67.8 39.7 49.1 w/ constrained decoding 67.9 53.1 42.0 66.2 27.8 30.4 66.7 53.1 33.1 67.0 39.9 48.9 X-Gear (mT5-large) 71.2 54.0 44.8 68.9 32.1 33.3 68.9 55.8 33.1 69.7 42.2 51.3 w/ constrained decoding 71.2 54.8 45.6 68.9 32.0 33.3 66.2 57.7 35.0 68.8 43.1 51.6 Table C.2: The detailed breakdown results for applying constrained decoding on X-Gear. The avg(mono.) column represents the results that average over values in en ⇒ en, zh ⇒ zh, and ar ⇒ ar. The avg(cross.) column represents the results that average over values in en ⇒ zh, en ⇒ ar, zh ⇒ en, zh ⇒ ar, ar ⇒ en, and ar ⇒ zh. 50-large. The batch size is set to 8. The number of training epochs is 60. C.3 Constrained Decoding Detailed Results Table C.2 shows the detailed results for Degree using a constrained decoding algorithm during testing time. We directly apply constrained decoding algorithms on the trained models we have in Table 5.1. 136 Appendix D Appendix for CLaP D.1 Data Statistics We present the extensive data statistics for the ACE and WikiANN datasets used for downstream model evaluation on EAE and NER respectively. For ACE, we follow the preprocessing used in Chapter 5 to retain 33 event types and 22 argument roles. For WikiAnn, we follow the pre-processing steps described in [60], [131]. For ACE, Table D.1 provides statistics about the number of events and arguments for each language. For WikiANN, we present the statistics in Table D.2. Train Dev Test Language English English Arabic Chinese # Events 4,202 450 198 190 # Arguments 4,859 605 287 336 Table D.1: Data Statistics in terms of events and arguments of the ACE dataset for the downstream task of EAE. # indicates ‘number of’. D.2 Complete Results for Intrinsic Evaluation D.2.1 Accuracy Evaluation Accuracy evaluation is done by 5 native bilingual speakers for Chinese, Arabic, Hindi, and Spanish by ranking the translation quality of the translated labels. The native speakers were undergraduate and graduate students who were well-versed in their native languages. We 137 Split Language # Sentences # Entities Train English (en) 20,000 27,931 Dev English (en) 10,000 14,146 Test Afrikaans (af) 1,000 1,487 Arabic (ar) 10,000 11,259 Bulgarian (bg) 10,000 14,060 Bengali (bn) 1,000 1,089 German (de) 10,000 13,868 Greek (el) 10,000 12,163 Spanish (es) 10,000 12,260 Estonian (et) 10,000 13,892 Basque (eu) 10,000 13,459 Farsi (fa) 10,000 10,742 Finnish (fi) 10,000 14,554 French (fr) 10,000 13,369 Hebrew (he) 10,000 13,698 Hindi (hi) 1,000 1,228 Hungarian (hu) 10,000 14,163 Indonesian (id) 10,000 11,447 Italian (it) 10,000 13,749 Japanese (ja) 10,000 13,446 Javanese (jv) 100 117 Georgian (ka) 10,000 13,057 Kazakh (kk) 1,000 1,115 Korean (ko) 10,000 14,423 Malayalam (ml) 1,000 1,204 Marathi (mr) 1,000 1,264 Malay (ms) 1,000 1,115 Burmese (my) 100 119 Dutch (nl) 10,000 13,725 Portuguese (pt) 10,000 12,823 Russian (ru) 10,000 12,177 Swahili (sw) 1,000 1,194 Tamil (ta) 1,000 1,241 Telugu (te) 1,000 1,171 Thai (th) 10,000 16,970 Tagalog (tl) 1,000 1,034 Turkish (tr) 10,000 13,587 Urdu (ur) 1,000 1,020 Vietnamese (vi) 10,000 11,305 Yoruba (yo) 100 111 Chinese (zh) 10,000 12,049 Table D.2: Data Statistics in terms of sentences and entities of the WikiANN dataset for the downstream task of NER. # indicates ‘number of.’ 138 present the complete results as an A/B comparison of the different techniques in terms of their win rates (i.e. percentage when A is better than B) in Table D.3. We note how CLaP is more accurate than previous baselines of Awesome-align and EasyProject while at par with the Independent baseline. System 1 v/s System 2 Arabic Chinese Hindi Spanish S1 Tie S2 S1 Tie S2 S1 Tie S2 S1 Tie S2 CLaP Awesome-align 36% 58% 6% 45% 50% 5% 20% 74% 6% 12% 84% 4% CLaP EasyProject 52% 32% 16% 56% 39% 5% 42% 48% 10% 30% 66% 4% CLaP Independent 18% 60% 22% 12% 71% 17% 18% 64% 18% 24% 68% 8% Independent Awesome-align 44% 42% 14% 39% 57% 4% 28% 60% 12% 20% 64% 16% Independent EasyProject 50% 44% 6% 50% 46% 4% 52% 36% 12% 32% 52% 16% Awesome-align EasyProject 42% 26% 32% 34% 50% 16% 42% 42% 16% 26% 64% 10% Table D.3: A/B comparison of the various label projection techniques for accuracy evaluation for the Google Translation model. Accuracy is measured as the label translation quality by native human speakers. Here, S1 = System 1 is better, S2 = System 2 is better, and Tie = similar quality. The better systems are highlighted in bold. D.2.2 Faithfulness Evaluation We present the complete results for the faithfulness evaluation per language in Tables D.4 and D.5 for EAE and NER tasks, respectively. For EAE, CLaP has the best faithfulness followed by Awesome-align. For NER, Awesome-align and EasyProject are the most faithful. Techniques ar zh Avg. Independent 33 38 35 Awesome-align 66 83 74 EasyProject 31 66 48 CLaP 74 85 79 Table D.4: Faithfulness evaluation of the various label projection techniques for EAE as a percentage of the times the translated labels were present in the translated input sentence. Numbers are in percentage (%). Higher faithfulness is better and the best techniques are highlighted in bold. 139 Techniques af ar bg bn de el es Independent 78 66 67 74 79 57 70 Awesome-align 99 95 98 92 99 98 99 EasyProject 100 98 83 98 97 89 99 CLaP 94 75 63 93 79 46 84 et eu fa fi fr he hi Independent 70 64 61 71 71 71 65 Awesome-align 98 97 96 99 98 95 93 EasyProject 97 94 99 98 99 94 36 CLaP 92 91 72 92 74 80 90 hu id it ja jv ka kk Independent 68 77 74 68 66 64 56 Awesome-align 98 99 99 58 98 95 94 EasyProject 97 99 98 95 94 99 77 CLaP 93 84 78 67 53 70 85 ko ml mr ms my nl pt Independent 63 57 73 80 53 76 76 Awesome-align 96 88 92 99 90 99 97 EasyProject 93 87 73 98 62 100 99 CLaP 64 88 95 82 55 85 89 ru sw ta te th tl tr Independent 59 79 72 76 66 81 76 Awesome-align 97 96 91 91 51 99 98 EasyProject 99 97 91 87 99 99 98 CLaP 66 94 96 90 57 58 94 vi ur yo zh Avg. Independent 74 74 45 66 69 Awesome-align 83 97 92 92 93 EasyProject 98 94 77 92 92 CLaP 89 91 88 60 79 Table D.5: Faithfulness evaluation of the various label projection techniques for NER as a percentage of the times the translated labels were present in the translated input sentence. Numbers are in percentage (%). Higher faithfulness is better and the best techniques are highlighted in bold. 140 D.3 Additional Implementation Details D.3.1 XLM-R for NER base model XLM-R [25] is used as the downstream model for NER for extrinsic evaluation of the label projection techniques. We mainly follow the XTREME [60] framework for setting up the task and model. We present details about the hyperparameter settings for this model in Table D.6. We run experiments for CLaP on an NVIDIA GeForce RTX 2080 Ti machine with support for 8 GPUs. Base Model XLM - Roberta - Large # Training Epochs 5 Training Batch Size 32 Evaluation Batch Size 32 Learning Rate 2 × 10−5 Weight Decay 0 Max Sequence Length 128 # Accumulation Steps 1 # Saving Steps 1000 Table D.6: Hyperparameter details for the NER downstream XLM-R model. D.3.2 CLaP We provide a couple of prompt designs we used for our model in Figure D.1 along with an illustration for Chinese. We additionally provide a similar template for chat version of the model (which is used for experiments with GPT3.5-turbo as reported in Section 6.5.2) in Figure D.2. D.3.3 EasyProject Compared to the original EasyProject work, we made certain changes in the re-implementation of our work to provide a fair comparison. First, we use square-indexed markers (e.g. [0] and [/0]) compared to XML markers (e.g. <LOC> and </LOC>) used by EasyProject. This 141 {Target Language} Sentence: {Translated Input Sentence} For the previous sentence, the words corresponding to ‘{Source Label}‘ is: Prompt Input Prompt Output ‘{Target Label}’ Text Completion Prompt Design Chinese Sentence: , . For the previous sentence, the words corresponding to ‘suits’ is: Chinese Sentence: )*%#·+ ",!$ For the previous sentence, the words corresponding to 'candidate' is: ' ' Chinese Sentence: , -& ( ' For the previous sentence, the words corresponding to 'attack' is: '' Prompt Input In-context Examples Illustration of the Text Completion Prompt Figure D.1: Illustration of the text-completion prompt used for contextual machine translation for our CLaP model. Now do your work: Original English Sentence: {English Input Sentence with label within <<TAG>> <</TAG>>} Translated {Target Language} Sentence: {Translated Input Sentence} The corresponding word of ‘{Source Label}‘ in {Target Language} is: Prompt Input Prompt Output {Target Label} Chat Version Prompt Design System Prompt I have an English sentence and its corresponding translation in {Target Language}. There’s an English word been tagged using <<TAG>> <</TAG>> in the given English sentence. I want to know its corresponding {Target Language} translation in the given {Target Language} sentence. For example: {In-context Examples} Figure D.2: Illustration of the chat version prompt used for contextual machine translation for our CLaP model. 142 Sentence: {Input sentence} The event trigger word is ‘{Input trigger word}’ Does the input sentence mention the ‘{Argument Role}' role for the ‘{Event-name}' event? If yes, what is the corresponding argument? Output: Prompt Input Prompt Output Yes/No. The argument is ‘{Argument-name}’. LLM-Infer Prompt System Prompt You are trying to check if arguments specific to certain event roles are present in the sentence. The event of interest is {Event-name} The event is related to {Event-definition}. Note that your answer should only contain the output string and nothing else. Examples: {In-context examples} Figure D.3: Illustration of the prompt used for the LLM-infer baseline to directly utilize LLMs for downstream structured prediction tasks. is mainly because we obtained much higher retention rates using square-indexed markers (88.2%) than XML markers (6.2%) in our initial studies. Secondly, the original EasyProject model uses a finetuned NLLB-200-3.3B translation model. Since we don’t finetune CLaP or Awesome-align, we use the non-finetuned Google Machine Translation (GMT) model as the translation model. D.4 Large Language Model Direct Inference Analysis Large language models (LLMs) have demonstrated impressive zero-shot and few-shot capabilities across various tasks such as sentiment analysis, machine translation, and questionanswering [45], [74]. However, the utilization of LLMs specifically tailored for information extraction and structured prediction tasks in cross-lingual scenarios remains an area that has received limited attention. Recent studies, including those by [47] and [81], suggest that LLM performance in these tasks, even in English, falls short of the performance achieved by the best fine-tuned models. In our work, we address this gap by evaluating LLMs for direct inference in non-English structured prediction tasks through our proposed baseline model, LLM-Infer. 143 Sentence: waksal could be fined $ 3.5 million , facing a maximum of 75 years in prison , expected , though , to receive about six to seven years . The event trigger word is ‘receive’. Does the input sentence mention the 'Defendant' role for the 'Justice:Sentence' event? If yes, what is the corresponding argument? Output: Yes. The argument is 'waksal'. (a) Zero-shot Cross-Lingual Prompt (ZSCLP) Sentence: 7#)C D毁?JB& 1A4 The event trigger word is 'D毁'. Does the input sentence mention the 'Target' role for the 'Conflict:Attack' event? If yes, what is the corresponding argument? Output: Yes. The argument is ''. (b) Translate-shot Prompt (TSP) Sentence: )=(5. 6;3 The event trigger word is ''. Does the input sentence mention the 'Place' role for the 'Justice:ArrestJail' event? If yes, what is the corresponding argument? Output: Yes. The argument is ' '. (c) Monolingual Prompt (MP) Figure D.4: Illustration of the in-context examples used for the three different prompting strategies for LLM-Infer baseline. We utilize two LLMs of varying sizes for LLM-Infer: Llama-2-chat (13B version) [151] and GPT-3.5-Turbo [17]. We illustrate the prompts used for this baseline in Figure D.3. Our LLM prompts involve 2-shot and 4-shot in-context examples, and we meticulously explore three distinct prompting strategies, specifically for the cross-lingual setting, following [6] (also illustrated in Figure D.4). These strategies are listed as follows: 1. Zero-shot Cross-Lingual Prompt (ZSCLP): This strategy involves using k-shot examples from a pivot language (English in our study), which differs from the language of the test example, as shown in Figure D.4(a). 2. Translate-shot Prompt (TSP): In this strategy, we first obtain k-shot examples from the pivot language and subsequently perform label projection (using CLaP) to the target language on these examples. These label-projected examples are used as in-context examples in the final prompt (Figure D.4(b)). 3. Monolingual Prompt (MP): This method uses k-shot human-labeled examples directly from the target language (Figure D.4(c)). 144 Base Prompting k-shot EAE NER Avg Model Strategy ar zh hi ms yo Llama2-13b-chat ZSCLP 2 13.4 20.0 21.7 30.1 26.4 22.3 Llama2-13b-chat ZSCLP 4 14.2 17.9 39.5 38.3 31.9 28.4 Llama2-13b-chat TSP 2 16.9 24.0 18.9 46.5 28.6 27.0 Llama2-13b-chat TSP 4 8.7 22.8 17.5 43.5 36.2 25.7 Llama2-13b-chat MP 2 18.9 28.1 13.7 49.2 17.6 25.5 Llama2-13b-chat MP 4 11.9 26.0 13.7 61.5 17.4 26.1 GPT-3.5-turbo ZSCLP 2 15.8 22.3 64.4 50.7 39.7 38.6 GPT-3.5-turbo ZSCLP 4 15.9 23.6 65.0 53.0 39.0 39.3 GPT-3.5-turbo TSP 2 17.1 22.3 59.3 54.6 53.3 41.3 GPT-3.5-turbo TSP 4 17.2 24.5 52.3 57.2 48.8 40.0 GPT-3.5-turbo MP 2 15.3 25.2 59.5 64.1 51.0 44.7 GPT-3.5-turbo MP 4 19.5 28.8 58.5 65.4 48.5 44.1 Zero-shot Model 40.3 51.9 70.6 53.4 34.1 50.1 CLaP Translate-train (Ours) 49.3 58.6 73.1 73.5 59.6 62.8 Table D.7: Evaluation of LLM-based inference and their comparison with our label projected translate-train model CLaP. This study is done on Event Argument Extraction (EAE) for two languages - Arabic (ar) and Chinese (zh) - and on Named Entity Recognition (NER) for three languages: Hindi (hi), Malay (ms), and Yoruba (yo). While the first two strategies align with the zero-shot cross-lingual transfer setting, where data availability is limited to English, the third strategy offers a slight variation. It presupposes the availability of a few examples in the target languages. For a fair comparison, only the first two strategies are used to compare with CLaP, while the third strategy serves as a comparison datapoint for elucidating the difference between label-projected and humanlabeled data as in-context examples. We conduct this analysis on EAE across two languages and NER across three languages (as it’s expensive to conduct this study for all the languages). The selection of languages for NER is to consider both resource diversity (hi: medium-high resource; ms: medium resource; yo: low resource) and script diversity. We compare these models with the zero-shot baseline and our proposed CLaP translate-train model. We show the model performance results in terms of F1 scores for this study in Table D.7. This study reveals several insights: (1) We observe that GPT-3.5-turbo significantly 145 performs better than the Llama-2-13B model - signifying the importance of a larger model size. (2) Comparing different prompting strategies, we observe little variation in the Llama2-13B model performance, while a larger variation for GPT-3.5-turbo. Majorly, we observe that the label-projected in-context examples are better than the English examples, while human-labeled examples provide further gains of 3-4 F1 points. (3) We observe that on average, the LLM-Infer models perform poorer than the zero-shot fine-tuned model. These differences are massive for EAE, while LLM-Infer performs better for low-resource languages (ms and yo) using our label-projected examples for NER. (4) Finally, we observe that CLaP performs the best across all tasks and all languages, even in cases where few-shot examples in target languages are used (MP prompting strategy). All these insights validate CLaP’s manner of leveraging LLMs to solve zero-shot cross-lingual structured prediction tasks, i.e., CLaP is better than direct LLM prompting. 146 Appendix E Appendix for Event Linking E.1 Event Extraction Details To accurately extract event arguments from the designated event in the input text, we first undertake event typing to categorize the input event into the most suitable event type. Subsequently, based on the assigned category, we identify and capture the event arguments along with their corresponding roles that are characteristic of that specific type. For this purpose, we first employ UniST [61] to conduct event typing. To accommodate the wide range of potential input event, we opt for MAVEN, a highly comprehensive event detection dataset in the general domain, comprising 168 event types. Subsequently, we train a UniST model on MAVEN to conduct event typing. Upon identifying the event trigger and its corresponding type, we proceed with an event argument extraction model to obtain the related event arguments. We use TagPrime [55], the current state-of-the-art in event argument extraction. To align with the MAVEN ontology, we adopt GENEVA, an extensive event argument extraction dataset, primarily based on the MAVEN ontology with minor modifications. These modifications result in a reduction of the final event types coverage to 115. Although other event extraction models exist, such as DEGREE [54], UniIE [99], AMPERE [58], and PAIE [101], trained on various datasets, our approach represents the most comprehensive event extraction pipeline to date, encompassing the majority of event types. For additional information on event extraction models, we direct readers to the works of [64], [123] for a more thorough overview of available alternative 147 methodologies. E.2 Data Generation Details To generate negative data, we first get samples from the tagged training and validation data. We filter out the examples whose labeled event mentions are proper nouns or numeric to ensure the quality of our generated data. We also filter out the examples with less than two tagged event arguments to better apply our observation. We then generate our negative data in a two-shot fashion with the prompt demonstrated in Figure E.1. Due to the cost of GPT-3.5-Turbo, we only generate 6600 examples from the training set and 1600 examples from the validation set in the end. We employ the same method for the non-argument-aware data generation baseline but without tagging event arguments. The prompt is adjusted to indicate the absence of event argument information. Figure E.2 shows the actual prompt we use. Note that the text content of the few shot examples is identical in the two cases. The only difference is the presence of event argument tags. E.3 Implementation Details We train all the baselines and our methods using Nvidia A100 40G GPUs. For biencoder training, it takes around 6 hrs to train. As for crossencoder training, the training time takes around 30 hrs. For the bi-encoder, we use bert-base-uncased as our base encoder. We use a maximum query length of 300 and a maximum candidate length of 300. We train with a learning rate of 1e-5 and a batch size 48 for 15 epochs to best fit into our GPU usage. For the cross-encoder, we also use bert-base-uncased as our base encoder. We use a maximum query length of 256 and a maximum candidate length of 256. We train with a 148 You are a storyteller , and you can assist me in crafting a narrative based on a given passage . I ’ ll give you a passage marking an event and its key details using specific tags . The event is marked with " < mention > " , while the related details and the corresponding roles are identified with tags like " < role > " , such as < Victim > Mark , indicating that Mark is the event ’ s " Victim ". You are to rewrite the passage with similar length and structure but containing false information by changing the key details . Remember that I want a passage that is factually incorrect . To complete the task , follow these two steps : Step 1: Edit the tagged key details of the event , ensuring that any changes remain consistent with the original role type . Step 2: Ensure the final generated passage ’ s coherence and creativity by adjusting Step 1 ’ s outcome for fluency and consistency . This may include modifying unaltered parts to enhance logic and flow . Before each step , state your plans a little bit . Additionally , don ’ t truncate the original passage or alter any escaped characters . Also , don ’ t remove any argument role tags in the form of " < role > " or event mention tags in the form of " < mention > ". Present your results as : ’ Plan 1: {{ outline your changes for Step 1}}\ nFollowing Plan 1 , we can generate this passage after Step 1: {{ passage after Step 1}}\ nPlan 2: {{ outline your changes for Step 2}}\ nFollowing Plan 2 , we can generate this passage after Step 2: {{ passage after Step 2}} You can refer to the first two examples we provided and complete the third one on your own . Example 1: { Example 1} Example 2: { Example 2} Example 3: Passage : {} Additional information we have for the Passage : This "{ event mention text span }" event is of the type "{ event type }". Figure E.1: Prompt for our argument-aware data generation. 149 You are a storyteller , and you can assist me in crafting a narrative based on a given passage . I ’ ll give you a passage containing a reference to an event . An event is an occurrence of something that happens in a certain time / place involving some participants . In the given passage , The textual expression that refers to the event is called the " mention " of the event . The event mention is marked with surrounding " < mention > " tags . You are to rewrite the passage with similar length and structure but containing factually incorrect information . Remember that I want a passage that is factually incorrect . Do not truncate the original passage or alter any escaped characters . Also , do not remove the event mention tags in the form of " < mention > " from the passage . Present your output as : ’ New passage : {{ new passage }} ’ You can refer to the first two examples we provided and complete the third one on your own . Example 1: { Example 1} Example 2: { Example 2} Example 3: Passage : New passage : Figure E.2: Prompt for non-argument-aware data generation baseline. 150 I would like you to help me with a document re - ranking task . I will give you a short passage containing an event . I will also give you a list of 10 documents , each with a title and a short description . You task is to rank the given 10 documents in decreasing order of relevance to the event that the short passage mentions . Do not remove any documents . Do not include any documents that are not provided . In your answer , only provide the document titles in the original format . Input format : Document 1: < title of document 1 > < short description of document 1 > Document 2: < title of document 2 > < short description of document 2 > ... Document 10: < title of document 10 > < short description of document 10 > Short passage containing an event : < short passage containing an event > Answer format : Document d1 : < title of document d1 ( most relevant document ) > Document d2 : < title of document d2 ( second most relevant document ) > ... Document d10 : < title of document d10 ( least relevant document ) > Now , here is the actual input . { actual input } Figure E.3: Prompt for the LLM baseline on the Wikipedia dataset (in-KB). 151 I would like you to help me with a document re - ranking task . I will give you a short passage containing an event . I will also give you a list of 10 documents , each with a title and a short description . You task is to rank the given 10 documents in decreasing order of relevance to the event that the short passage mentions . However , it is possible that none of the 10 given documents describes the event in the passage . If you think that the event in the passage is not described by any of the 10 given documents , you should label the passage with a special " NIL " label . Do not remove any documents . Do not include any documents that are not provided . In your answer , only provide the document titles in the original format . Input format : Document 1: < title of document 1 > < short description of document 1 > Document 2: < title of document 2 > < short description of document 2 > ... Document 10: < title of document 10 > < short description of document 10 > Short passage containing an event : < short passage containing an event > Answer format ( If 1 or more documents describe the event in the passage ) : Document d1 : < title of document d1 ( most relevant document ) > Document d2 : < title of document d2 ( second most relevant document ) > ... Document d10 : < title of document d10 ( least relevant document ) > Answer format ( If none of the documents describes the event in the passage , just output this sentence below ) : The passage should be labeled as NIL . Now , here is the actual input . { actual input } Figure E.4: Prompt for the LLM baseline on the NYT dataset (out-of-KB). 152 learning rate of 2e-5 and batch size 6 for 20 epochs. For the threshold θ used for baseline NIL prediction (see Section 7.3.3), we use follow the description in [180]’s paper and set it as 0.5 after normalization. For the BM25 baseline, we use a query (context) window size of 16, which follows the practice from [126]. For the analysis of negative data generation methods, details on data generation are covered Section E.2. Additionally, the KB pruning baseline is implemented by randomly pruning 10% of the unique labels present in the training set. We then label the corresponding samples (5984 samples) as out-of-KB. For the LLM baseline, since existing event linking systems do not use LLMs, we take inspiration from existing methods [125] to apply an instructed LLM (GPT-3.5-Turbo) to re-rank the candidates retrieved by the bi-encoder in a zero-shot fashion. Due to the cost of GPT-3.5-Turbo, we sample 1000 test examples and use the same retrieved candidates for both our method and the LLM baseline to re-rank. For the Wikipedia test set, the model is asked to simply re-rank the top-k documents given the query. For the NYT test set, the model is given an additional option to simply label the query as “NIL” if none of the documents describes the event in the query. The two prompts for the Wikipedia and NYT test sets are shown in Figure E.3 and Figure E.4, respectively. E.4 Full Bi-encoder Evaluation Results We present the full bi-encoder evaluation results, which include more recall values, in Table E.1. 153 Model Wikipedia (in-KB) Test Set Recall R@1 R@2 R@3 R@4 R@5 R@8 R@10 R@15 R@20 BM25 9.72 16.64 20.58 23.16 25.48 29.78 31.77 35.58 38.10 BLINK [163]∗ 54.85 68.14 74.27 77.95 80.36 84.49 86.22 89.00 90.55 EveLink [180]∗ 55.72 67.22 74.74 78.27 80.62 84.83 86.51 89.21 90.91 Ours 57.28 70.14 76.10 79.49 81.69 85.82 87.40 89.80 91.34 Table E.1: Different recall values on the in-domain, in-KB evaluation for the bi-encoder on the Wikipedia test set. The best performance is highlighted in bold. “R@1” stands for recall at 1, and so on. 154
Abstract (if available)
Abstract
Human languages in the world, such as news or narratives, are structured around events. Focusing on these events allows Natural Language Processing (NLP) systems to better understand plots, infer motivations, consequences, and the dynamics of situations. Despite the rapidly evolving landscape of NLP technology, comprehending complex events, particularly those rarely encountered in training such as in niche domains or low-resource languages, remains a formidable challenge. This thesis explores methods to enhance NLP model generalizability for better adaptability to unfamiliar events and languages unseen during training.
My approach includes two main strategies: (1) Model Perspective: I propose a novel generation-based event extraction framework, largely different from typical solutions that make predictions by learning to classify input tokens. This new framework utilizes indirect supervision from natural language generation, leveraging large-scale unsupervised data without requiring additional training modules dependent on limited event-specific data. Hence, it facilitates the models’ ability on understanding general event concepts. I further explore advanced methods to extend this framework for cross-lingual adaptation and to utilize cross-domain robust resources effectively. (2) Data Perspective: I develop techniques to generate pseudo-training data broaden the training scope for event understanding models. This includes translating structured event labels into other languages with higher accuracy and fidelity, and synthesizing novel events for the existing knowledge base.
Overall, my work introduces a novel learning platform to the NLP community, emphasizing an innovative modeling paradigm and comprehensive data preparation to foster more generalized event understanding models.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Emphasizing the importance of data and evaluation in the era of large language models
PDF
Externalized reasoning in language models for scalable and trustworthy AI
PDF
Event-centric reasoning with neuro-symbolic networks and knowledge incorporation
PDF
Fairness in natural language generation
PDF
Annotating FrameNet via structure-conditioned language generation
PDF
Identifying and mitigating safety risks in language models
PDF
Countering problematic content in digital space: bias reduction and dynamic content adaptation
PDF
Grounding language in images and videos
PDF
Inductive biases for data- and parameter-efficient transfer learning
PDF
Integrating annotator biases into modeling subjective language classification tasks
PDF
Aggregating symbols for language models
PDF
Building generalizable language models for code processing
PDF
Decoding situational perspective: incorporating contextual influences into facial expression perception modeling
PDF
Robust and generalizable knowledge acquisition from text
PDF
Multi-modal preconditioned inference of commonsense knowledge
PDF
Mining and modeling temporal structures of human behavior in digital platforms
PDF
Toward better understanding and improving user-developer communications on mobile app stores
PDF
Towards more human-like cross-lingual transfer learning
PDF
Syntax-aware natural language processing techniques and their applications
PDF
Semantic structure in understanding and generation of the 3D world
Asset Metadata
Creator
Hsu, I-Hung
(author)
Core Title
Towards generalized event understanding in text via generative models
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2024-05
Publication Date
11/21/2024
Defense Date
05/07/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
event understanding,generative models,language models,natural language processing,OAI-PMH Harvest
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Natarajan, Premkumar (
committee chair
), Peng, Nanyun (
committee chair
), Ferrara, Emilio (
committee member
), O'Leary, Dan (
committee member
)
Creator Email
ihunghsu@usc.edu,salex84726@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113953638
Unique identifier
UC113953638
Identifier
etd-HsuIHung-12996.pdf (filename)
Legacy Identifier
etd-HsuIHung-12996
Document Type
Dissertation
Format
theses (aat)
Rights
Hsu, I-Hung
Internet Media Type
application/pdf
Type
texts
Source
20240522-usctheses-batch-1159
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
event understanding
generative models
language models
natural language processing