Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Generating and utilizing machine explanations for trustworthy NLP
(USC Thesis Other)
Generating and utilizing machine explanations for trustworthy NLP
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
GENERATING AND UTILIZING MACHINE EXPLANATIONS FOR TRUSTWORTHY NLP by Aaron Zesheng Chan A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) May 2023 Copyright 2023 Aaron Zesheng Chan Dedication To my family. ii Acknowledgements This dissertation officially concludes my five-year PhD journey. When I arrived at USC in Fall 2017, I understood very little about AI research beyond the simple fact that I wanted to do it. Since then, I have undergone profound intellectual and emotional growth, often in unexpected ways. Of course, my success could not have been possible without the support of many wonderful people in my life. As I prepare to graduate from USC and begin my career in industry, I would like to take this opportunity to reflect on my PhD experience and honor those who played a role in shaping me into the person I am today. PhDAdviser I want to express my utmost gratitude to Xiang Ren for being the consummate PhD adviser and dissertation chair. Xiang is exceptionally methodical, savvy, and proactive in everything he does, so it’s easy to see why his INK Lab consistently produces so many high-quality research papers. From the day I joined INK Lab, Xiang has worked tirelessly to keep putting me in the best position to succeed and pushing me to be the best version of myself. This includes but is not limited to: providing insightful feedback on my projects, showing me how to design effective experiments, teaching me how to manage uncertainty, integrating me into various research collaborations, involving me in grant proposal writing, giving me opportunities to mentor others, referring me to my internship at Meta, helping me build my network in the AI research world, and encouraging me to be (realistically) ambitious. Through Xiang’s training, I’ve learned how to quickly evaluate the novelty, impact, and feasibility of a research direction. I’ve learned to think systematically, speak calmly, and act decisively in stressful situations. I’ve learned the art of ruthless prioritization, which empowers me to efficiently juggle more than ten research tasks at a time. Every time iii I faced a challenge, Xiang was there to support me. Even when I made dumb mistakes, Xiang never made me feel dumb. Instead, he would patiently work with me until the problem was resolved. The more I work with Xiang, the less it feels like work. Everything I’ve achieved during my PhD is a testament to Xiang’s fearless research vision and the amazing environment that he has created at INK Lab. I truly owe my career to Xiang, and I aspire to follow his example as both an outstanding scientist and leader. CommitteeMembers I would like to thank Robin Jia, Jesse Thomason, Bistra Dilkina, and Morteza De- hghani for serving on my dissertation defense and thesis proposal committees. In addition, I want to thank Bistra Dilkina, Yan Liu, Jay Pujara, and Shri Narayanan for serving on my qualifying exam committee. I greatly appreciate my committee members’ thoughtful questions and comments about my work, which were instrumental in improving my PhD research and helping me successfully finish my dissertation. Fur- thermore, I am grateful for their responsiveness to various committee-related administrative tasks, so that I could complete all of my PhD milestones on schedule. ResearchCollaborators During my PhD, I’ve been fortunate to work with and learn from many excel- lent researchers at USC, Meta, and other institutions. Together, we were more productive than I thought was possible for someone like me, co-authoring 10 publications in just two years. Behind each paper is the unique story of a team that faced all kinds of adversity but didn’t give up. I have so many precious memo- ries of such stories: the apprehension of diving into an unfamiliar research area, the thrill of figuring out how to implement our ideas, the resilience we showed in debugging unexpected problems, the excitement of seeing our first positive results, the pressure of fighting to submit our paper before the 11:59PM AoE deadline, and the joy of finally receiving the paper acceptance email. Together, we broke through what we thought were our limits and became stronger. I am indebted to every one of my collaborators for accepting me in spite of my flaws and insecurities, helping me find my niche on the team, having my back when we were down in the trenches, and teaching me important life lessons that I will never forget. iv When I joined USC’s INK Lab in Fall 2020 as a fourth-year PhD student, I was thrust into a de facto mentor role, despite the fact that I probably had less (successful) research experience at that point than many of the students I was mentoring. Nonetheless, we came together as a team, helped each other grow, and got the job done. First, I would like to thank my USC PhD student co-authors — Peifeng Wang, Jun Yan, Soumya Sanyal, Brihi Joshi, Dong-Ho Lee, and Sahana Ramnath — for their diverse technical exper- tise, open-mindedness to new ideas, diligence in delivering results, clarity in communication, positivity in stressful situations, and humility in giving/receiving feedback. By working with them, I’ve become a bet- ter scientist and leader, discovering strengths and weaknesses I didn’t know I had. Second, I would like to thank the master’s, undergraduate, and high school students that I worked with at USC: Mrigank Raman, Siddhant Agarwal, Hansen Wang, Tianyu Zhang, Jiashu Xu, Boyuan Long, Tanishq Gupta, Siba Smarak Panigrahi, Wyatt Lake, Ziyi Liu, Akshen Kadakia, Kiran Narahari, Zhiyuan Zeng, and Zhewei Tong. Each student came to INK Lab with impressive talents and worked hard to make significant contributions to my PhD research. It has been a tremendous privilege to mentor these students and watch them grow as AI researchers. Third, I would like to thank Muhao Chen, Filip Ilievski, and Jay Pujara from the USC In- formation Sciences Institute (ISI). I appreciate the knowledge and wisdom they brought to our research collaborations, and I learned so much from our interesting discussions. In Fall 2021, during my fifth year, I had the incredible opportunity to work as an AI research intern at Meta. Although I had co-authored a few papers before then, I consider my Meta internship as the critical turning point in my career when I really started to feel confident in my ability to conduct research independently. I attribute this accelerated period of growth to the productive, flexible, and welcoming environment facilitated by all of my Meta collaborators, for which I am immensely grateful. First, I want to thank my intern managers, Maziar Sanjabi and Hamed Firooz, for taking a chance on me and showing me how to do great AI research at Meta. Maziar is super knowledgeable about machine learning and has high standards in both scientific experimentation and exposition. I appreciate Maziar’s penchant for asking v difficult questions that I overlooked, challenging me to think more deeply about research problems, and reminding me to “max my max” (i.e., maximize my focus on things of maximum value). Moreover, Maziar was extremely supportive in helping me debug technical issues as well as in sharing astute comments on my paper and rebuttal writing. Hamed is very experienced in both conducting AI research and applying AI research to real-world products. It was mind-blowing to learn that, not long after my internship ended, my intern project was already being deployed into production and creating a positive impact on Meta’s users. Furthermore, I appreciate Hamed’s abundant feedback on my research, tips for improving my internship performance, and transparency in answering my questions about Meta. Second, I would like to thank my other Meta co-authors — Lambert Mathias, Liang Tan, Shaoliang Nie, Xiaochang Peng, and Qifan Wang — for their intellectual contributions to my research projects and for warmly integrating me into the team. During our weekly project meetings and 1:1s, I benefited a lot from their questions, comments, and suggestions. I also enjoyed presenting and discussing my thoughts on the model explainability literature during the team’s reading group meetings. Third, I want to acknowledge several additional Meta colleagues who spent time mentoring me, providing me engineering support, or teaching me how things work at Meta: Xiaochen Liu, Narine Kokhlikyan, Bilal Alsallakh, Kanika Narang, Jingzhou Liu, Madian Khabsa, Hao Ma, and Ai-dan Tran. Their efforts definitely helped make my internship a great experience. During my PhD, I have also collaborated in varying capacities with researchers from Adobe, Sony, the Allen Institute for AI (AI2), the University of Virginia (UVA), and the University of California, Los Angeles (UCLA). From Adobe, I would like to thank Sungchul Kim, Ryan Rossi, Handong Zhao, and Nedim Lipka. From Sony, I would like to thank Takashi Shibuya, Ryosuke Mitani, and Toshiyuki Sekiya. From UVA, I would like to thank Hanjie Chen. From AI2, I would like to thank Yejin Choi, Jack Hessel, and Ximing Lu. From UCLA, I would like to thank Liunian Harold Li. I am delighted to have had the experience of working with and learning from such a superb group of researchers. vi Pre-PhD Mentors I owe much of my success to the mentors I had before starting my PhD, from my time at the University of Maryland (UMD), the University of Pennsylvania (Penn), and Google. From UMD, I would like to thank Rama Chellappa, David Jacobs, and Soumyadip Sengupta. From Penn, I would like to thank Kostas Daniilidis, Jianbo Shi, Gedas Bertasius, Georgios Pavlakos, Alex Zhu, and Menglong Zhu. From Google, I would like to thank Ying Chen Lou, Bob Hung, Samuel Ha, Weber Tang, and Huan Zeng. I deeply appreciate the valuable opportunities they provided me, regardless of my inexperience and naivete. While working with them, I learned many important skills and life lessons, which helped prepare me to be a PhD student. Without their guidance and support, I would not be where I am now. AdministrativeStaff I would like to thank Lizsl De Leon, Jennifer Gerson, Andy Chen, Tracy Charles, and Asiroh Cham for their huge efforts in providing administrative support to USC’s computer science PhD students. I appreciate their prompt replies to every question or concern I emailed them, which ensured that I could focus on my research without stressing about anything administration-related. I am especially grateful for Lizsl’s helpful advice when I was transferring to Xiang’s INK Lab during my fourth year. Additionally, I want to thank Rita Wiraatmadja for her hard work as the INK Lab manager, especially in managing the INK Lab budget. In particular, Rita was responsible for coordinating my general PhD funding (i.e., for tuition and stipends) each semester as well as reimbursing my registration and travel expenses whenever I presented my research at conferences or workshops. I am very appreciative of Rita’s ability to quickly respond to and complete all of my funding-related requests. Labmates My labmates at INK Lab have collectively created an enormously positive impact on my PhD experience. First, I want to thank Bill Yuchen Lin and Xisen Jin for introducing me to INK Lab and trans- parently answering the many questions I had before I decided to join. At the time, it was hard to believe that INK Lab could be as awesome as they made it seem, but now I know that everything they said was true. Second, I would like to thank Xisen Jin, Jun Yan, Soumya Sanyal, and Albert Xu for their hard work in vii serving as INK Lab’s server administrators. I really appreciate all the time and effort they spent answering my Slurm questions and debugging various server issues, on top of their own heavy research and course workloads. Third, I want to thank Brihi Joshi and Jacob Bremerman for organizing lab socials and Qinyuan Ye for organizing the weekly lab meetings. These events were great for helping INK Lab members build friendships, develop collaborations, and foster a welcoming atmosphere. Fourth, because I worked fully remotely during my time at INK Lab, I actually seldom saw my labmates in person. Consequently, most of my non-work social interactions with them came from ad-hoc Zoom calls and Slack messages/huddles. In particular, I would like to thank Brihi Joshi, Soumya Sanyal, Woojeong Jin, Bill Yuchen Lin, Jun Yan, Ziyi Liu, Jiashu Xu, and Zhiyuan Zeng for all the conversations in which we vented to each other about work struggles but also chatted about our personal lives. These conversations helped motivate me to keep pushing forward in my research, provide levity in stressful times, and reveal more about who my labmates are outside of work. Fifth, I want to thank each of my PhD student labmates — Hirona Arai, Shushan Arakelyan, Jacob Bremerman, Woojeong Jin, Xisen Jin, Brihi Joshi, Huihan Li, Bill Yuchen Lin, Sahana Ramnath, Soumya Sanyal, Albert Xu, Jun Yan, Qinyuan Ye, and Pei Zhou — for their unique contributions to the lively, supportive, and productive research environment at INK Lab. INK Lab is truly a special place. Friends I am grateful to have become friends with many other PhD students and postdocs at USC, and my PhD certainly would not have been the same without them. After work, it was great to spend time with people who could relate to a lot of the things I was going through, but also provide different perspectives from their own labs. Moreover, these friendships helped remind me that there is so much more to life than AI research and that we are more than just the content of our CVs. I want to thank Soravit (Beer) Changpinyo for being my unofficial mentor at USC and my role model as a successful yet humble AI researcher in industry. I really admire Beer’s growth mindset, introspectiveness, and systematic pursuit of self-improvement in all areas of his life. Most of all, I will never forget Beer’s guidance during the lowest point of my PhD and his empathetic joy when I eventually made it to a better situation. I want viii to thank Shariq Iqbal for being my first friend and roommate at USC as well as for being a paragon of consistency and level-headedness. I love how Shariq is always down to do something fun, whether it’s playing basketball, exploring Los Angeles, or engaging in profound discussions about life. I want to thank Michiel de Jong for challenging me intellectually and for indulging me with his razor-sharp wit. Michiel’s ultra-rational personality has pushed me to confront many of my hidden cognitive biases and perceive the world more clearly. I envy how Michiel seems to have an articulate answer for just about any question. Furthermore, I would like to thank Yury Zemlyanskiy, Wei-Lun (Harry) Chao, Hexiang (Frank) Hu, Chin- Cheng (Jeremy) Hsu, Bowen Zhang, Zhiyun Lu, Liyu Chen, Wang (Bill) Zhu, Yiming Yan, Séb Arnold, Ke Zhang, Melissa Ailem, and Chao-Kai Chiang for their friendship throughout my PhD years. Each one of them played a role in making my PhD experience more enjoyable and helping me become a better person. I am also thankful for the treasured friendships I have outside of USC and the AI research world. I want to thank all of these friends for being in my life despite my typically hermit-like PhD lifestyle. After a long day of work, it always feels great to see messages from friends checking in on how I’m doing or proposing social plans for the weekend. I love how spending time with my friends makes it so easy for me to relax, recharge, and forget about research for a while. When my friends do happen to ask me about my research, I appreciate how they ask me intriguing questions, help me fine-tune my elevator pitch, and remind me that my work is worthwhile. In addition, I thank my friends who live outside of Los Angeles for giving me many reasons to travel the world during my PhD, instead of staying cooped up in my home office. PhD life can sometimes feel lonely, so it means so much to have friends who care about my well-being, support me through thick and thin, and fill my life with fun and laughter. Family My family has been my rock throughout my PhD. Words cannot fully express how grateful I am for my family, to whom this dissertation is wholeheartedly dedicated. However, I will try my best. First, I would like to thank my parents for bringing me into the world and showering me with un- conditional love. Throughout my entire life, my parents have worked extremely hard to support me in ix everything I do, in any way they can. My parents are my greatest role models, continually teaching me priceless lessons about doing things the right way: integrity, humility, responsibility, tenacity, and many more. I am deeply indebted to my parents for my intellectually and culturally rich upbringing. They taught me reading, writing, and arithmetic when I was a toddler. They quizzed me on history and geography facts during long car rides. They made me bilingual by taking me to Chinese school every Sunday and pushing me to speak Chinese at home. They engaged me in lively debates about philosophy and politics. They cultivated my interest in computer science by showing me how to code and signing me up for engineering camps. They are the reason why I never get tired of asking big questions or learning new things. During my PhD, I struggled a lot and even considered quitting, but my parents never stopped believing in me. Despite their busy schedules, they always made time to talk with me on the phone, provide comforting words of encouragement, and share sage advice that I sometimes couldn’t understand until much later. In those moments, I realized how small my PhD struggles were compared to my parents’ love for me. And in those moments, I realized that I was going to make it. When I look in the mirror, I see the profound imprint that my parents have left on me, and I am proud of it. I hope this dissertation has made them proud. Second, I would like to thank my siblings and extended family. My siblings have always been a huge part of my life. Our frequent conversations provided much-needed emotional boosts throughout my PhD, and spending quality time with them was one of my favorite aspects of going home for winter break. Whenever I felt unproductive in my PhD research, hearing good news about my siblings’ accomplishments would replenish me with positive energy and inspire me to keep going. I am super grateful to my siblings for all of the hilarious memories we make together, the character-building opportunities they provide me, and the unique ways in which they show their love and support. My extended family has also made a major impact on my PhD experience. I am lucky to be part of a family that recognizes the vital importance of education, with multiple esteemed PhD graduates on both sides. Growing up, my family collectively placed a strong emphasis on improving academic performance in school, broadening horizons via extracurricular x activities, and developing successful careers that create value for society. Thus, from a young age, I learned to have high expectations for myself and to stand tall on the shoulders of the previous generation. During my PhD, many different family members — cousins, uncles, aunts, grandparents, and in-laws — would check in on my progress and let me know that they were proud of my hard work. Indeed, after five years of hard work, my family now has a new PhD graduate to celebrate. Finally, I would like to thank my beloved wife for being my soulmate, best friend, and number-one supporter. A recent study identified marital status as one of the most important predictors of PhD success, as the authors found that being married increases a PhD student’s graduation rate by an average of 25.3% (N = 1322;χ 2 (1) = 20.34;p< 0.001; Cramer’sV = 0.12) [256]. Now that I am graduating, I can better appreciate this finding. My wife is an extraordinary woman of numerous talents, with which she has achieved significant success in various areas of her life. She never fails to amaze me with her unwavering drive, confidence, and resourcefulness in the face of any obstacle she encounters. By simply being around her and witnessing how she lives, I am constantly motivated to improve myself. But my wife doesn’t just walk the walk — she also talks the talk. In fact, she is the one person I can talk to about anything; her rationality, creativity, empathy, and humor make her an ideal thought partner for problem-solving and decision-making. While thriving in her own ambitious career, my wife selflessly does her best to help me thrive in mine. I like to think of my wife as my second (and arguably tougher) PhD adviser, who enthusiastically pushes me to work smarter and manage my time better. On more than one occasion, I was stuck for days on what appeared to be a hopelessly complex issue, only for her to jump in with minimal context and quickly propose a simple yet effective solution. If she ever sensed that I was overworking myself, she would wisely remind me to take regular breaks and protect my health. Every time I had a paper accepted, she would find a fun way for us to celebrate. Outside of work, my wife enriches my life in countless ways. In contrast to my homebody tendencies, she is always on the lookout for new and exciting places to visit. As a result, we were able to make the most of our time in Los Angeles by experiencing the xi best restaurants, parks, and shows that the city has to offer. Plus, my wife herself is a magnificent chef, who seems to effortlessly master any cuisine she tries her hand at. Astonishingly, just about every dish from her kitchen looks and tastes as if it were prepared in a Michelin-starred restaurant. Without my wife’s love and support, I know my PhD would not have been nearly as happy or successful. What I appreciate most about my PhD experience is how it brought my wife and me closer together and showed us all of the amazing things we can accomplish as a team. Thank you for being my everything. I love you so much. xii TableofContents Dedication ii Acknowledgements iii ListofTables xviii ListofFigures xxiii Abstract xxvii I Background 1 Chapter1: Introduction 2 1.1 The Case for Trustworthy NLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.1 Three Pitfalls of Today’s NLP Systems . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Challenges in NLP Explainability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3.1 Overview of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3.2 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3.3 Relationship to Published Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Chapter2: LiteratureReview 13 2.1 Generating Machine Explanations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1.1 Extractive Rationales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1.1.1 Evaluating Extractive Rationales . . . . . . . . . . . . . . . . . . . . . . . 14 2.1.2 Free-Text Rationales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2 Utilizing Machine Explanations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.1 Improving LM Decision-Making . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.2 Improving Human Decision-Making . . . . . . . . . . . . . . . . . . . . . . . . . . 19 II GeneratingMachineExplanations 20 Chapter3: LearningtoGenerateMachineExplanations 21 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2.1 Rationale Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 xiii 3.2.2 Three Desiderata of Rationale Extraction . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3 UNIREX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3.1 Framework Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3.2 Rationale Extractor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3.2.1 Heuristic Rationale Extractors . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3.2.2 Learned Rationale Extractors . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3.3 Explainability Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.3.3.1 Faithfulness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.3.3.2 Plausibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3.4 Training and Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.4.1 Evaluation Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.4.1.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.4.1.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.4.1.3 Results Reporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.4.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.4.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.4.4 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.4.5 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.4.6 Gold Rationale Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.4.7 Zero-Shot Faithfulness Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.4.8 Plausibility User Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.6 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.6.1 Text Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.6.2 Gold Rationale Supervision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.6.3 Explainability Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.6.3.1 Faithfulness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.6.3.2 Plausibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.6.4 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.6.5 Gold Rationale Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.6.6 Computational Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.6.7 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.6.8 Main Results (Extended) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 III UtilizingMachineExplanations 59 Chapter4: LearningfromStrongly-SupervisedMachineExplanations 60 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.2 Explanation Regularization (ER) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.3 ER-Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.3.1 Unseen Dataset Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.3.2 Contrast Set Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.3.3 Functional Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.4 ER Design Choices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.4.1 Machine Rationale Extractors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.4.2 Rationale Alignment Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 xiv 4.4.3 Human Rationale Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.4.4 Instance Selection Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.5.1 Tasks and Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.5.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.5.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.5.4 RQ1: Which rationale alignment criteria are most effective for ER? . . . . . . . . . . 73 4.5.4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.5.4.2 Unseen Dataset Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.5.4.3 Contrast Set Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.5.4.4 Functional Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.5.5 RQ2: How effective are task-level human rationales for ER? . . . . . . . . . . . . . . 78 4.5.5.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.5.5.2 Unseen Dataset Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.5.5.3 Contrast Set Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.5.5.4 Functional Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.5.6 RQ3: How is ER affected by the number and choice of training instances with human rationales? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.5.6.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.5.6.2 Unseen Dataset Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.5.6.3 Contrast Set Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.5.6.4 Functional Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.5.7 RQ4: How is ER affected by the time taken to annotate human rationales? . . . . . . 86 4.5.7.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.5.7.2 Unseen Dataset Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.5.7.3 Contrast Set Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.5.7.4 Functional Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.7 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.7.1 ER Training Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.7.2 Development Set Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.7.3 RQ1: Which rationale alignment criteria are most effective for ER? . . . . . . . . . . 95 4.7.3.1 Named Entity Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.7.3.2 Hate Speech Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.7.4 RQ2: How effective are task-level human rationales for ER? . . . . . . . . . . . . . . 98 4.7.4.1 Creating Task-Level Rationales . . . . . . . . . . . . . . . . . . . . . . . 98 4.7.5 RQ3: How is ER affected by the number and choice of training instances with human rationales? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.7.5.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.7.6 RQ4: How is ER affected by the time taken to annotate human rationales? . . . . . . 100 4.7.6.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.7.7 Functional Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Chapter5: LearningfromWeakly-SupervisedMachineExplanations 109 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.3 Creating KG Saliency Explanations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.3.1 Coarse Saliency Explanations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 xv 5.3.2 Fine Saliency Explanations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.4 Oracle: Using KG Saliency Explanations as Inputs . . . . . . . . . . . . . . . . . . . . . . 116 5.4.1 Oracle Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.4.2 Evaluation Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 5.4.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 5.5 SalKG: Using KG Saliency Explanations as Supervision . . . . . . . . . . . . . . . . . . . . 120 5.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.6.1 Evaluation Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.6.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 5.6.3 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 5.6.4 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 5.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 5.8 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 5.8.1 Construction of the Contextualized KG . . . . . . . . . . . . . . . . . . . . . . . . . 130 5.8.2 Alternative Formulation of Coarse Saliency Explanations . . . . . . . . . . . . . . 131 5.8.3 Implementation Details for Grad-Based Fine Saliency Explanations . . . . . . . . . 132 5.8.4 Evaluation Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 5.8.5 Dataset Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 5.8.6 Threshold Tuning for Creating Explanations . . . . . . . . . . . . . . . . . . . . . . 134 5.8.7 Additional Details aboutOracle Models . . . . . . . . . . . . . . . . . . . . . . . . 135 5.8.8 AdditionalSalKG Results on CODAH . . . . . . . . . . . . . . . . . . . . . . . . . 135 5.8.9 AdditionalSalKG Results for Grad vs. Occl . . . . . . . . . . . . . . . . . . . . . . 136 5.8.10 Comparison to Published OBQA Baselines . . . . . . . . . . . . . . . . . . . . . . . 137 5.8.11 Low-Resource Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 5.8.12 Analyzing the Impact of Coarse Explanations . . . . . . . . . . . . . . . . . . . . . 139 5.8.13 Comparing Salient and Non-Salient KG Units . . . . . . . . . . . . . . . . . . . . . 140 5.8.14 Robustness to KG Perturbation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 5.8.15 Statistical Significance of Main Results . . . . . . . . . . . . . . . . . . . . . . . . . 143 5.8.16 Case Studies: Qualitative Analysis of KG Saliency Explanations . . . . . . . . . . . 143 5.8.17 User Studies: Quantitative Analysis of KG Saliency Explanations . . . . . . . . . . 145 5.8.17.1 User Study 1: Coarse Saliency Explanations . . . . . . . . . . . . . . . . 146 5.8.17.2 User Study 2: Fine Saliency Explanations . . . . . . . . . . . . . . . . . . 147 5.8.17.3 Inter-Annotator Agreement . . . . . . . . . . . . . . . . . . . . . . . . . 148 5.8.17.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 5.8.18 Training Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 5.8.19 Computational Costs and Resources . . . . . . . . . . . . . . . . . . . . . . . . . . 150 5.8.20 Related Work (Extended) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 5.8.21 Societal Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 IV Conclusion 153 Chapter6: Conclusion 154 6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 6.1.0.1 Evaluating Free-Text Rationales . . . . . . . . . . . . . . . . . . . . . . . 156 6.1.0.2 Generating Free-Text Rationales . . . . . . . . . . . . . . . . . . . . . . . 158 6.1.0.3 Utilizing Free-Text Rationales . . . . . . . . . . . . . . . . . . . . . . . . 160 xvi V References 162 Bibliography 163 xvii ListofTables 3.1 UNIREXAblationStudiesonSST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.2 Zero-ShotFaithfulnessTransferfromSST. We investigate whether the faithfulness of UNIREX rationale extractors (AA-F, DLM-FP) trained on SST can generalize to unseen datasets/tasks, even when the task model’s task performance cannot. Also, we include AA (IG) as a heuristic extractor baseline (i.e., only the task model is trained). Here, the seen task is sentiment analysis (SA), while the unseen tasks are hate speech detection (HSD), offensive speech detection (OSD), and irony detection (ID). For SA, the unseen datasets are Yelp and Amazon. For HSD, OSD, and ID, the unseen datasets are Stormfront, OffenseEval, and SemEval2018, respectively. Overall, we find that faithfulness is not strongly correlated with task performance, as unseen tasks’ comp/suff scores are similar to seen tasks’. In particular, though all methods achieve poor task performance on unseen tasks, UNIREX (DLM-FP)’s comp/suff scores are consistently good across all tasks, demonstrating its faithfulness generalization ability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.3 PlausibilityUserStudyonSST. In our user study, we find that humans judge UNIREX (DLM- FP)’s rationales as being more plausible than those created by methods, in terms of both forward simulation (accuracy) and subjective rating (alignment). . . . . . . . . . . . . . . . . . . . . . . 46 3.4 ComputationalEfficiencyResultsonSST. Besides achieving the best balance of faithfulness, plausibility, and task performance, UNIREX (DLM-FP) and UNIREX (SLM-FP) are also the most computationally efficient, achieving the lowest inference time per instance. Also, despite having a higher convergence delta and lower inference time than most other AA (IG) variants due to using 3-step IG, UNIREX (AA-F) outperforms all AA (IG) variants on faithfulness, while achieving comparable plausibility and task performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.5 Qualitative Analysis on SST. Building upon the quantitative results of our plausibility user study, our qualitative analysis further supports the notion that UNIREX (DLM-FP)’s rationales are more plausible than those created by other rationale extraction methods. In this table, we visualize each rationale by highlighting the important tokens (selected by the given method) inblue. . . . . 55 3.6 MainResultsonSST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.7 MainResultsonMovies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.8 MainResultsonCoS-E . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 xviii 3.9 MainResultsonMultiRC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.10 MainResultsone-SNLI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.1 RQ1-UnseenDatasetTests(§4.5.4.2). We compare various ER rationale alignment criteria (as well as the No-ER baseline), with respect to performance on seen (ID) and unseen (OOD) datasets. Performance is measured using accuracy (Acc) for sentiment analysis and macro F1 (F1) for NLI. Numbers highlighted in blue indicate statistically significant improvement over the No-ER baseline (p<0.05). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.2 RQ1-ContrastSetTests(§4.5.4.3). We compare various ER rationale alignment criteria (as well as the No-ER baseline), with respect to performance on both original test sets (ID) and contrast sets (OOD). Performance is reported in terms of original test set accuracy (Original Acc), contrast set accuracy (Contrast Acc), and contrast consistency (Consistency). Numbers highlighted in blue indicate statistically significant improvement over the No-ER baseline ( p<0.05). . . . . . . . . . 75 4.3 RQ2 - Unseen Dataset Tests (§4.5.5.2). (§4.3). We compare ER model performance using instance-level rationales versus using task-level rationales, with respect to performance on seen (ID) and unseen (OOD) datasets. Performance is measured using accuracy (Acc). Numbers highlighted in blue indicate statistically significant improvement over the No-ER baseline (p<0.05). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.4 RQ2 - Contrast Set Tests (§4.5.5.3). We compare ER model performance using instance-level rationales versus using task-level rationales, with respect to performance on both original test sets (ID) and contrast sets (OOD). Performance is reported in terms of original test set accuracy (Original Acc), contrast set accuracy (Contrast Acc), and contrast consistency (Consistency). Numbers highlighted in blue indicate statistically significant improvement over the No-ER baseline (p<0.05). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.5 RQ3 - Unseen Dataset Tests (§4.5.6.2). We compare ER model performance for five instance selection strategies across different instance annotation budgets ( k% of training data), with respect to performance on seen (ID) and unseen (OOD) datasets. Performance is measured using accuracy (Acc). Numbers highlighted in blue indicate statistically significant improvement over the No-ER baseline (p<0.05). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.6 RQ3 - Contrast Set Tests (§4.5.6.3). We compare ER model performance for five instance selection strategies across different instance annotation budgets ( k% of training data), with respect to performance on both original test sets (ID) and contrast sets (OOD). Performance is reported in terms of original test set accuracy (Original Acc), contrast set accuracy (Contrast Acc), and contrast consistency (Consistency). Numbers highlighted in blue indicate statistically significant improvement over the No-ER baseline (p<0.05). . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.7 RQ4 - Contrast Set Tests (§4.5.7.3). We compare ER model performance for three instance annotation types across different time budgets, with respect to performance on both original test sets (ID) and contrast sets (OOD). Performance is reported in terms of original test set accuracy (Original Acc), contrast set accuracy (Contrast Acc), and contrast consistency (Consistency). Numbers highlighted in blue indicate statistically significant improvement over the No-ER baseline (p<0.05). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 xix 4.8 RQ1 - Development Set Performance (§4.5.4.2). For each machine rationale extractor and task/dataset, the best-performing rationale alignment criterion is indicated inbold. . . . . . . . . 94 4.9 RQ2 - Development Set Performance (§4.5.5.2). For each machine rationale extractor and rationale alignment criterion, the best-performing human rationale type is indicated inbold. . . . 94 4.10 RQ3 - Development Set Performance (§4.5.6.2). For each instance annotation budget, the best-performing instance selection strategy is indicated inbold. . . . . . . . . . . . . . . . . . . 95 4.11 RQ4 - Development Set Performance (§4.5.7.2). For each time budget, the best-performing instance annotation type is indicated inbold. . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.12 RQ1 - Unseen Dataset Tests for NER (§4.7.3.1). For NER, we compare various ER rationale alignment criteria using the IxG machine rationale extractor (as well as the No-ER baseline), with respect to performance on seen (ID) and unseen (OOD) datasets. Performance is measured using accuracy (Acc). For each dataset and metric, the best-performing criterion is indicated inbold. . . 96 4.13 RQ1-UnseenDatasetTestsforHateSpeechDetection(§4.7.3.2). For hate speech detection, we compare various ER rationale alignment criteria using the IxG machine rationale extractor (as well as the No-ER baseline), with respect to performance on seen (ID) and unseen (OOD) datasets. Performance is measured using accuracy (Acc) and false positive difference rate (FPRD). For each metric, the best-performing criterion is indicated inbold. Note that we only consider task-level rationales for hate speech detection, since instance-level rationales are unavailable. . . . . . . . . 97 4.14 RQ1 - Functional Subtests for Sentiment Analysis (§4.5.4.4). For sentiment analysis, we compare various ER rationale alignment criteria using the IxG machine rationale extractor (as well as the No-ER baseline), with respect to performance on a range of functional tests/subtests (OOD). For each functional test, we report model performance on each of its individual functional subtests. Performance is reported in terms of failure rate. . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.15 RQ1 - Functional Subtests for NLI (§4.5.4.4). For NLI, we compare various ER rationale alignment criteria using the IxG machine rationale extractor (as well as the No-ER baseline), with respect to performance on a range of functional tests/subtests (OOD). For each functional test, we report model performance on each of its individual functional subtests. Performance is reported in terms of failure rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.16 RQ2 - Functional Subtests for Sentiment Analysis (§4.5.4.4). For sentiment analysis, we compare various ER rationale alignment criteria using the IxG machine rationale extractor (as well as the No-ER baseline), with respect to performance on a range of functional tests/subtests (OOD). For each functional test, we report model performance on each of its individual functional subtests. Performance is reported in terms of failure rate. . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.17 RQ3-FunctionalSubtestsforSentimentAnalysis-k=5%(§4.5.4.4). For sentiment analysis, we compare various ER rationale alignment criteria using the IxG machine rationale extractor (as well as the No-ER baseline), with respect to performance on a range of functional tests/subtests (OOD). For each functional test, we report model performance on each of its individual functional subtests. Performance is reported in terms of failure rate. . . . . . . . . . . . . . . . . . . . . . 105 xx 4.18 RQ3-FunctionalSubtestsforSentimentAnalysis-k=15%(§4.5.4.4). For sentiment analysis, we compare various ER rationale alignment criteria using the IxG machine rationale extractor (as well as the No-ER baseline), with respect to performance on a range of functional tests/subtests (OOD). For each functional test, we report model performance on each of its individual functional subtests. Performance is reported in terms of failure rate. . . . . . . . . . . . . . . . . . . . . . 106 4.19 RQ3-FunctionalSubtestsforSentimentAnalysis-k=50%(§4.5.4.4). For sentiment analysis, we compare various ER rationale alignment criteria using the IxG machine rationale extractor (as well as the No-ER baseline), with respect to performance on a range of functional tests/subtests (OOD). For each functional test, we report model performance on each of its individual functional subtests. Performance is reported in terms of failure rate. . . . . . . . . . . . . . . . . . . . . . 106 4.20 RQ4 - Functional Subtests for Sentiment Analysis - Label Only (§4.5.4.4). For sentiment analysis, we compare various ER rationale alignment criteria using the IxG machine rationale extractor (as well as the No-ER baseline), with respect to performance on a range of functional tests/subtests (OOD). For each functional test, we report model performance on each of its individual functional subtests. Performance is reported in terms of failure rate. . . . . . . . . . . 107 4.21 RQ4 - Functional Subtests for Sentiment Analysis - Expl Only (§4.5.4.4). For sentiment analysis, we compare various ER rationale alignment criteria using the IxG machine rationale extractor (as well as the No-ER baseline), with respect to performance on a range of functional tests/subtests (OOD). For each functional test, we report model performance on each of its individual functional subtests. Performance is reported in terms of failure rate. . . . . . . . . . . 107 4.22 RQ4 - Functional Subtests for Sentiment Analysis - Label+Expl (§4.5.4.4). For sentiment analysis, we compare various ER rationale alignment criteria using the IxG machine rationale extractor (as well as the No-ER baseline), with respect to performance on a range of functional tests/subtests (OOD). For each functional test, we report model performance on each of its individual functional subtests. Performance is reported in terms of failure rate. . . . . . . . . . . 108 5.1 KGunittypes used for different explanation modes (§5.3) and graph encoders (§5.4.2). . . . . . . 113 5.2 Comparison of Oracle Models. For each Oracle Model, we show its output and saliency weights. Note that the explanations are given (not predicted), so there is noL sal . WhileF ∗ c and F ∗ h are both ensembles ofF KG andF No-KG ,F ∗ f has the same architecture asF KG (denoted by∼ ) besides the attention masking. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 5.3 OraclePerformanceonCSQAandOBQA . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.4 Comparisonof SalKGModels. For each SalKG Model, we show its output, saliency weights, andL sal . WhileF c andF h are both ensembles,F f has the same architecture asF KG (denoted by ∼ ). “CE” denotes cross-entropy loss, while “KL” denotes KL divergence loss. . . . . . . . . . . . . 121 5.5 SalKGPerformanceonCSQA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.6 SalKGPerformanceonOBQAandCODAH . . . . . . . . . . . . . . . . . . . . . . . . . . 124 xxi 5.7 Comparison of SalKG to Published CSQA Baselines. SalKG models that outperform all baselines are shown inbold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 5.8 AblationStudies. Best model inbold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 5.9 SalKG-Fine Performance for Different top-k% Thresholds. We report performance for RoBERTa+MHGRN and RoBERTa+PathGen on CSQA and OBQA. Best model is shown inbold. . 135 5.10 SalKGPerformanceonCODAHforAdditionalSettings. Building upon the CODAH results in Table 5.6 (RoBERTa+MHGRN and RoBERTa+PathGen), we additionally report results for RoBERTa+RN, BERT+MHGRN, BERT+PathGen, and BERT+RN, all using threshold top-10%. We also report both Grad and Occl results forSalKG-Fine andSalKG-Hybrid. Best model is shown in bold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 5.11 CSQAPerformanceComparisonforSalKGGradvs. OcclModels. Best model between Grad and Occl is shown inbold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 5.12 OBQA Performance Comparison for SalKG Grad vs. Occl Models. Best model between Grad and Occl is shown inbold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 5.13 Comparisonof SalKGtoPublishedOBQAResults. Best model is shown inbold. . . . . . . 138 5.14 Impact of Coarse Explanations. Using BERT+PathGen on CSQA, we present a performance breakdown for various question sets, in order to analyze whySalKG-Coarse is able to beat No-KG and KG. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 5.15 Salient vs. Non-Salient KG Units. Using BERT+MHGRN on CSQA, we compare salient and non-salient KG units. In (a), we compare salient and non-salient KGs, as determined by coarse explanations. In (b), we compare salient and non-salient nodes, as determined by fine explanations. 141 5.16 SalKGPerformanceComparisononCSQAwithPerturbedKGs. Best performance inbold. 142 5.17 SalKGT-TestResultsonCSQA. For each setting in Table 5.5, we perform the T-test between the bestSalKG model and the best non-SalKG model. . . . . . . . . . . . . . . . . . . . . . . . 143 5.18 SalKG T-Test Results on OBQA and CODAH. For each setting in Table 5.6, we perform the T-test between the bestSalKG model and the best non-SalKG model. . . . . . . . . . . . . . . . 143 5.19 HumanEvaluationofCoarseSaliencyExplanations. Human-annotated usefulness scores for high- (positive) and low- (negative) saliency graphs. . . . . . . . . . . . . . . . . . . . . . . . . 146 5.20 Human Evaluation of Fine Saliency Explanations. Human-annotated usefulness scores for high-, median-, and low-saliency paths. We display the usefulness scores for paths from all predictions, correct predictions, and incorrect predictions. . . . . . . . . . . . . . . . . . . . . . 148 5.21 Inter-AnnotatorAgreementforExplanationUserStudies. Using Fleiss’ kappa, we measure the inter-annotator agreement for the human evaluation of coarse and fine saliency explanations. In both settings, the inter-annotator agreement is relatively low. . . . . . . . . . . . . . . . . . . 149 xxii ListofFigures 1.1 LMs Struggle on OOD Data. In-the-wild test data are often out-of-distribution (OOD) with respect to the LM’s training data, which can cause the model to fail in various ways in real-world settings [66, 41, 216]. Image credit: [143]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Unintended LM Behavior. Intuitively, swapping (e.g., “beautiful”→ “beuatiful”) or dropping (e.g., “downbeat”→ “dwnbeat”) one or two letters in one word of the passage should rarely affect the sentiment of a passage. However, recent studies have shown that LMs are not robust to such trivial perturbations [191, 100]. This has serious implications for the reliability of real-world NLP applications [56, 12]. Image Credit: [191]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Unfair LM Behavior. This figure shows two passages obtained from the Internet [109]. The first passage comes from the New York Times and gives relatively benign commentary about how many Africans fear hatred between their own people. Meanwhile, the second passage comes from Gab.com [108, 70], a white supremacist forum, and explicitly claims that white people are superior to black people — this constitutes actual hate speech. Nonetheless, recent works have found that state-of-the-art LMs classify both passages as hate speech, due to associating certain group identifier words ( e.g., “Africans”, “black”, and “white”) with hate speech [109, 211, 44]. This can lead to the unfair marginalization of such groups in online discourse [260, 211, 109]. Image Credit: [109]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Unaccountable LM Behavior. LMs’ complex reasoning processes are notoriously opaque, making such models essentially function as black box systems [72, 144, 205]. Without transparency in the LM’s decision process, it becomes very difficult to debug the NLP system when things go wrong [148, 72, 145]. Image credit: Black boxes and their intrusion. . . . . . . . . . . . . . . . . 6 1.5 Thesis Statement. LMs’ unintended, unfair, and unaccountable behaviors serve as critical obstacles to trustworthy NLP systems. To build human trust in NLP systems, we must be able to: (A) Generate machine explanations for LM behavior faithfully and plausibly. (B) Utilize machine explanations to improve LM generalization and decision-making. . . . . . . . . . . . . . . . . . 8 3.1 Desiderata of Rationale Extraction. Ideally, rationale extraction should be faithful and plausible, without compromising the task model’s task performance. Unlike prior works, UNIREX enables optimizing the rationale extractor for all three desiderata. . . . . . . . . . . . . . . . . . 22 xxiii 3.2 UNIREXFramework. UNIREX enables us to jointly optimize the task model (F task ) and rationale extractor (F ext ), with respect to faithfulness (L faith ), plausibility (L plaus ), and task performance (L task ). In this example, we consider the sentiment analysis task. For task performance,F task is trained via gold label y ∗ i to predict the sentiment – either positive (pos) or negative (neg) – of sentence x i . Here,F task ’s predicted label for x i isy i = pos. For plausibility,F ext is trained via gold rationale r ∗ i to output human-aligned token importance scores s i for x i (§3.3.3.2). For faithfulness, s i is binarized as rationale r i via top-k% selection, then used to construct the comp (x i \r (k) i ) and suff ( r (k) i ) inputs forL task . WithL task ’s predicted probabilities fory i , given x i , x i \r (k) i , and r (k) i , respectively, the comp and suff losses are computed. The comp and suff losses alignL task ’s output withr i , such thatr i becomes a faithful explanation ofL task ’s behavior. (§3.3.3.1). Note that some parts of UNIREX are non-differentiable. Still, by having L task andL ext share a text encoder, we can approximate end-to-end training of both models, jointly with respect to all three desiderata (§3.3.4). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3 RationaleExtractorTypes. In general, rationale extractorL ext can be either heuristic or learned. A heuristicL ext is a handcraftedattributionalgorithm, which cannot be trained (§3.3.2.1). By default, UNIREX uses a learnedL ext , which can be optimized for faithfulness, plausibility, and task performance. For learnedL ext , we focus on two architectures (w.r.t. task modelL task ): DualLM designsL task andL ext as two fully separate LMs, whileSharedLM designsL task andL ext to share the same text encoder (§3.3.2.2). Although some operations within UNIREX are non-differentiable, Shared LM’s shared encoder allows us to approximate end-to-end training of both models w.r.t. all three desiderata (§3.3.4, Fig. 3.2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4 Composite NRG Comparison (without Plausibility Optimization). The composite NRG (CNRG) is the mean of the three desiderata NRG scores. For each dataset, we use CNRG to compare rationale extraction methods that do not optimize for plausibility. Overall, UNIREX (AA-F) achieves the best CNRG on Dataset Mean (and on all datasets except Movies), showing the effectiveness of UNIREX’s faithfulness optimization. On Dataset Mean, UNIREX (AA-F) beats the strongest baseline (i.e., SGT) by 9.2%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.5 CompositeNRGComparison(withPlausibilityOptimization). The composite NRG (CNRG) is the mean of the three desiderata NRG scores. For each dataset, we use CNRG to compare rationale extraction methods that do optimize for plausibility. Overall, UNIREX (SLM-FP) and UNIREX (DLM-FP) achieve the best CNRG – both beating the strongest baseline (i.e., A2R+P) by over 30% on Dataset Mean – demonstrating UNIREX’s ability to jointly optimizeF task andF ext for all three desiderata. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.6 Desiderata NRG Comparison. For each rationale extraction method, we show the desiderata NRG for faithfulness (FNRG), plausibility (PNRG), and task performance (TNRG), averaged over all datasets. Left: This plot compares methodswithout plausibility optimization. UNIREX (AA-F)’s FNRG is highest, while its TNRG is close to highest. Meanwhile, baselines with high FNRG (i.e., FRESH, A2R) have low TNRG, while baselines with high TNRG (i.e., AA (IG), L2E, SGT) have low FNRG. Right: This plot compares methods with plausibility optimization. UNIREX (DLM-FP) and UNIREX (SLM-FP) have moderate FNRG, but the highest (or near-highest) PNRG and TNRG. Meanwhile, baselines with high FNRG (i.e., FRESH+P, A2R+P) have low TNRG, while baselines with high TNRG (i.e., SGT+P) have low PNRG. . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 xxiv 3.7 Gold Rationale Efficiency on SST. As shown in this plot, UNIREX (DLM-FP) and UNIREX (SLM-FP) are able to achieve high plausibility performance, even with a very small percentage of training instances with gold rationale annotations. . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.8 Gold Rationale Efficiency on CoS-E. As shown in this plot, UNIREX (DLM-FP) and UNIREX (SLM-FP) are able to achieve high plausibility performance, even with a very small percentage of training instances with gold rationale annotations. . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.1 ExplanationRegularization(ER). Sometimes, task labels alone provide insufficient supervision for language model (LM) generalization. ER aims to improve generalization by training the LM so that its machine rationales (Which input tokens did the LM focus on?) align with human rationales (Which input tokens would humans focus on?) (§4.2). . . . . . . . . . . . . . . . . . . . . . . . . 61 4.2 ER-Test. While existing works focus on ER models’ in-distribution (ID) generalization, the ER-Test framework is designed to evaluate ER models’ out-of-distribution (OOD) generalization with respect to: (A) unseen dataset tests, (B) contrast set tests, and (C) functional tests (§4.3). . . . 64 4.3 ER-Test Research Questions. To demonstrate ER-Test’s utility, we use ER-Test to study four important yet underexplored research questions (RQs). Each RQ considers a different category of ER design choices: rationale alignment criteria (RQ1), human rationale type (RQ2), number/choice of rationale-annotated instances (RQ3), and rationale annotation time (RQ4). With ER-Test, we have a system for identifying ER design choices that are effective for improving OOD generalization (§4.5). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.4 RQ1 - Functional Tests (§4.5.4.4). We compare various ER rationale alignment criteria (as well as the No-ER baseline), with respect to performance on a range of functional tests (OOD). Performance is reported in terms of normalized failure rate (↓) for four functional test types (Vocabulary, Robustness, Logic, Entity) as well the mean normalized failure rate across all functional tests (Mean). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.5 RQ2 - Functional Tests (§4.5.5.4). We compare ER model performance using instance-level rationales versus using task-level rationales, with respect to performance on a range of functional tests (OOD). Performance is reported in terms of normalized failure rate (↓) for four functional test types (Vocabulary, Robustness, Logic, Entity) as well the mean normalized failure rate across all functional tests (Mean). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.6 RQ3 - Unseen Dataset Tests (§4.5.6.2). For the strongest instance selection strategy (LIS) and key baselines (No-ER, Random), we plot task performance (Accuracy) as a function of instance annotation budget (k). Table 4.5 presents more comprehensive results comparing No-ER, Random, and LIS to other instance selection strategies (LC, HC, HIS). . . . . . . . . . . . . . . . . . . . . 82 4.7 RQ3-FunctionalTests(§4.5.6.4). We compare ER model performance for five instance selection strategies across different instance annotation budgets ( k% of training data), with respect to performance on a range of functional tests (OOD). Performance is reported in terms of normalized failure rate (↓) for four functional test types (Vocabulary, Robustness, Logic, Entity) as well the mean normalized failure rate across all functional tests (Mean). . . . . . . . . . . . . . . . . . . . 85 xxv 4.8 RQ4-UnseenDatasetTests(§4.5.7.2). We compare ER model performance for three instance annotation types across different time budgets, with respect to performance on seen (ID) and unseen (OOD) datasets. For each instance annotation type (Label Only, Expl Only, Label+Expl), we plot task performance (Accuracy) as a function of additional time budget for annotation. Note that the model trained with 0 min additional time budget corresponds to the No-ER baseline. . . . . . 88 4.9 RQ4 - Functional Tests (§4.5.7.4). We compare ER model performance for three instance annotation types across different time budgets, with respect to performance on a range of functional tests (OOD). Performance is reported in terms of normalized failure rate (↓) for four functional test types (Vocabulary, Robustness, Logic, Entity) as well the mean normalized failure rate across all functional tests (Mean). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.10 RQ4-LabelOnlyUI. UI used for Label Only MTurk annotations. . . . . . . . . . . . . . . . . . 101 4.11 RQ4-ExplOnlyUI. UI used for Expl Only MTurk annotations. . . . . . . . . . . . . . . . . . . 102 4.12 RQ4-Label+ExplUI. UI used for Label+Expl MTurk annotations. . . . . . . . . . . . . . . . . 102 5.1 KG Saliency Explanations for Commonsense QA. Across different questions, the KG’s usefulness can vary considerably. Coarse explanations indicate if the KG is useful overall, while fine explanations highlight useful nodes or paths. Here, the fine explanations state that the market, produce, andmerchant nodes are useful, while the other nodes are not. . . . . . . . . . . . . . . 110 5.2 KG-AugmentedModels fuse knowledge from text and KG inputs to solve CSR tasks. . . . . . . . 111 5.3 SchematicsforOracleandSalKGModels. Red arrows indicate the Oracle pipeline, where the target explanation is provided as input. Purple arrows indicate the SalKG pipeline, where the target explanation is used as supervision for the predicted explanation. In SalKG-Coarse and SalKG-Hybrid, the saliency predictor has the same architecture asF KG . Meanwhile,Oracle-Fine andSalKG-Fine (shown as white module, with text encoder and task predictor omitted) both have the same architecture asF KG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.4 Low-ResourceLearning. CSQA test accuracy for No-KG, KG, and SalKG-Coarse, when using varying amounts of training data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 5.5 Examples of coarse/fine saliency explanations. Illustration of examples presented in §5.6.4. Blue denotes given answer choice, while red denotes target answer. . . . . . . . . . 144 5.6 Moreexamplesofcoarse/finesaliencyexplanations. Illustration of examples presented in §5.8.16. Blue denotes given answer choice, while red denotes target answer. . . . . . . . . 144 6.1 A Symbiotic Explainability Framework. In this explainability framework, NLP systems and humans continually work together to generate, refine, and learn from explanations. By following this framework, humans can develop trust in NLP systems as partners in high-stakes decision-making. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 xxvi Abstract Neural language models (LMs) have yielded remarkable success on a wide range of natural language pro- cessing (NLP) tasks. However, LMs sometimes exhibit undesirable behavior, which can be difficult to resolve due to LMs’ opaque reasoning processes. This lack of transparency poses serious concerns about LMs’ trustworthiness in high-stakes decision-making, thus motivating the use of machine explanations to automatically interpret how LMs make their predictions. In my thesis, I argue that building human trust in NLP systems requires being able to: (A) generate machine explanations for LM behavior faithfully and plausibly and (B)utilize machine explanations to improve LM generalization and decision-making. First, to address (A), I propose UNIREX, a unified learning framework for jointly optimizing machine explanations with respect to both faithfulness and plausibility, without compromising the LM’s task performance. Sec- ond, for (B), I introduce ER-Test, a framework for evaluating the out-of-distribution generalization ability of LMs that are regularized via strongly-supervised machine explanations. Third, to further support (B), I present SalKG, an algorithm for improving LM generalization by regularizing LMs via weakly-supervised machine explanations. Finally, I discuss several future directions for achieving (A) and (B). xxvii PartI Background 1 Chapter1 Introduction 1.1 TheCaseforTrustworthyNLP A neural language model (LM) is a neural network that outputs probability distributions over sequences of language tokens (e.g., words) [48, 171]. In recent years, LMs have yielded remarkable success in natural language processing (NLP) and become a core component of modern NLP systems [193, 21, 195, 152, 48]. This success is largely due to the general recipe of first pre-training the LM on massive unlabeled corpora, then fine-tuning [195, 152, 48] or prompting [21, 194, 250] the LM on downstream NLP tasks. Following this recipe, LMs have achieved state-of-the-art performance on a wide range of NLP tasks, such as sentiment analysis [179, 223], natural language inference [253, 19], and question answering [232, 248]. On various NLP benchmarks, LMs have even surpassed human performance [242, 243]. As a result, contemporary LMs have been promoted by mainstream media as exhibiting capabilities indicative of artificial general intelligence (AGI) [169, 1, 188]. Based on these impressive developments, it could be reasonable for one to conclude that today’s NLP systems are already trustworthy enough to be used for automated, high-stakes decision-making in the real world. Nonetheless, LMs’ success on benchmark leaderboards does not always translate to real-world settings. The test data used in leaderboard datasets tend to be independent and identically distributed (i.i.d.) with respect to the LM’s training data, thus evaluating only a narrow notion of generalization ability [66]. 2 Figure 1.1: LMsStruggleonOODData. In-the-wild test data are often out-of-distribution (OOD) with respect to the LM’s training data, which can cause the model to fail in various ways in real-world settings [66, 41, 216]. Image credit: [143]. On the other hand, in-the-wild test data are often not i.i.d. with the LM’s training data, which can cause the model to fail in a variety of ways [66, 41, 216]. 1.1.1 ThreePitfallsofToday’sNLPSystems Unintended LM Behavior When given even slightly out-of-distribution (OOD) inputs, it is common for LMs to produceunintended outputs [191, 262, 100, 73, 206, 175, 52]. For example, the sentiment analysis task involves predicting whether a given passage is expressing positive or negative sentiment [277, 223, 166]. Intuitively, swapping or dropping one or two letters in one word of the passage should rarely affect the overall meaning of the passage. However, a number of studies have shown that LMs are often not robust to such trivial perturbations [191, 100] (Fig. 1.2). This has serious implications for the reliability of real-world NLP applications that depend on sentiment analysis, like customer service management [56, 268, 173] and brand monitoring [12, 186]. 3 Figure 1.2: Unintended LM Behavior. Intuitively, swapping (e.g., “beautiful”→ “beuatiful”) or dropping (e.g., “downbeat”→ “dwnbeat”) one or two letters in one word of the passage should rarely affect the sentiment of a passage. However, recent studies have shown that LMs are not robust to such trivial perturbations [191, 100]. This has serious implications for the reliability of real-world NLP applications [56, 12]. Image Credit: [191]. Unfair LM Behavior LMs have been shown to often behave in a manner that propagates undesirable biases and produces unfair societal outcomes [168, 28, 11, 154, 228, 114]. As a result of LM decisions, people can be harmfully discriminated against based on factors like race [58, 44, 211, 16, 109, 176, 43] and gender [17, 154, 6, 279, 280, 281, 184, 282]. For example, the hate speech detection task involves predicting whether a given passage constitutes hate speech [160, 108, 70]. In particular, studies have shown that LMs strongly associate certain group identifier words ( e.g., “Africans”, “black”, and “white”) with hate speech, regardless of how these words are actually used [109, 211, 44] (Fig. 1.2). Even so, it is common for social media platforms to use LMs to detect hate speech on their platforms [177, 117, 176, 273]. This can lead to the unfair removal of innocuous posts written by or about certain people groups, hence marginalizing the voices of such groups in online discourse [260, 211, 109]. UnaccountableLMBehavior When NLP systems make dangerous mistakes, NLP users and practition- ers need to be able to understand why [123, 37, 94, 53]. That is, LMs should have accountability for their decision-making. Here, accountability refers to the ability to identify and measure the impact of specific actions within the machine learning pipeline that contributed to undesirable LM behavior [112, 46, 198]. 4 Figure 1.3: UnfairLMBehavior. This figure shows two passages obtained from the Internet [109]. The first pas- sage comes from the New York Times and gives relatively benign commentary about how many Africans fear hatred between their own people. Meanwhile, the second passage comes from Gab.com [108, 70], a white supremacist forum, and explicitly claims that white people are superior to black people — this constitutes actual hate speech. Nonetheless, recent works have found that state-of-the-art LMs classify both passages as hate speech, due to associ- ating certain group identifier words ( e.g., “Africans”, “black”, and “white”) with hate speech [109, 211, 44]. This can lead to the unfair marginalization of such groups in online discourse [260, 211, 109]. Image Credit: [109]. Such actions could occur during data collection, data processing, model design, model training, model eval- uation, or model deployment [112, 92, 46, 198, 50]. However, despite their unintended and unfair behavior, LMs remain largely unaccountable [72, 42, 11, 148]. LMs’ complex reasoning processes are notoriously opaque, making such models essentially function as black box systems [72, 144, 205] (Fig. 1.4). Without transparency in the LM’s decision process, it becomes very difficult to debug the NLP system when things go wrong (e.g., exhibiting unintended or unfair behavior) [148, 72, 145]. Even policymakers are taking notice of this issue. As of April 2018, the European Union requires algorithms that make user-level deci- sions of significant impact to provide an explanation for their decisions [75]. Yet, the explanations we can currently obtain are still far from satisfactory [94, 122, 146, 53, 269]. Furthermore, it is often unclear how such explanations should best be used to support LM decision-making [81, 79, 36, 203, 263, 147] or human decision-making [123, 37, 121, 202, 222, 217, 94]. 1.2 ChallengesinNLPExplainability Explainability can play a number of important roles in supporting collaboration between humans and LMs [131, 81, 79, 123]. As described earlier, a key role of explainability is to help humans debug problematic LM 5 Figure 1.4: Unaccountable LM Behavior. LMs’ complex reasoning processes are notoriously opaque, making such models essentially function as black box systems [72, 144, 205]. Without transparency in the LM’s decision process, it becomes very difficult to debug the NLP system when things go wrong [148, 72, 145]. Image credit: Black boxes and their intrusion. behavior [123, 37, 94, 53]. But what if the LM already generalizes perfectly? Another role is to promote scientific understanding [42, 241, 240, 187, 138]. If an LM is able to make important decisions effectively, it is natural that we would want to verify and analyze the LM’s decision-making ability in various settings. Plus, even if the LM has perfect generalization, it may be flawed in other ways. Understanding how the LM works can give us valuable insight for improving other aspects of the LM, like compute efficiency [234, 115, 271, 10] and personalization [60, 136, 244, 258]. An additional role is to resolve human-LM misalignment [83, 39, 61, 113, 62, 181]. In particular, this may occur when the LM makes a “correct” decision that a human disagrees with. Suppose the human’s reasoning is objectively flawed. In this case, we should be able to educate the human by explaining why their reasoning is incorrect and why the model’s reasoning is correct [21, 235, 87, 276, 77]. On the other hand, suppose the human is not necessarily incorrect, but the human and LM have different value systems. In this case, we should be able to explain to the human what the LM’s value system is and how it was trained to uphold these values. Then, the human can decide whether they want to continue using this LM as is, modify the LM, or look for a different LM to use. 6 In light of this, the lack of LM explainability continues to be a major obstacle to human trust in NLP systems as high-stakes decision-makers [148, 11, 132, 139, 23, 84]. While humans may have certain ex- pectations of how LMs should behave in particular situations, LMs’ black box nature prevents humans from (manually) interpreting LMs’ behavior. Thus, we need some kind of explanation algorithm that can (automatically) analyze the LM’s decision process and then explain it to humans [14, 22, 42, 72]. With this in mind, we refer to such algorithmic explanations asmachineexplanations [226, 33, 3]. One common type of machine explanation is theextractiverationale, which highlights the input tokens that contributed most to the LM’s prediction [157, 229, 98, 130]. In recent years, many algorithms have been proposed to generate machine explanations for LMs [42, 157, 229, 98, 130, 178, 25, 196, 156, 135, 250]. However, existing explanation algorithms still suffer from a number of issues. First, existing algorithms may pro- duce explanations that are not faithful, which means the explanations do not accurately reflect the LM’s true reasoning process [96, 122, 82, 49, 86, 95, 192, 119, 251, 8]. Second, existing algorithms may produce explanations that are not plausible, which means that the explanations are not convincing to humans as being reflective of the LM’s reasoning process [96, 80, 82, 226, 49, 54, 165, 257, 174]. Note that plausibility does not require that humans agree with the LM’s reasoning. Third, after obtaining these explanations, what should users do with them? It remains an open question how explanations should beutilized to most effectively improve LM generalization and decision-making [81, 79, 36, 203, 263, 147]. 1.3 Thesis 1.3.1 OverviewofContributions Present LMs’ unintended, unfair, and unaccountable decisions stand as major obstacles to the trustworthi- ness of NLP systems. To address this situation, my thesis investigates how model explainability can bridge 7 Figure 1.5: Thesis Statement. LMs’ unintended, unfair, and unaccountable behaviors serve as critical obstacles to trustworthy NLP systems. To build human trust in NLP systems, we must be able to: (A) Generate machine explanations for LM behavior faithfully and plausibly. (B)Utilize machine explanations to improve LM generalization and decision-making. this gap between NLP systems and human trust. But how do we progress towards building this explain- ability bridge (Fig. 1.5)? The first pillar of the bridge is generatingexplanations to understand LM behavior. We want to be able to generate machine explanations that faithfully capture the LM’s reasoning process while also plausibly making sense to humans. The second pillar of the bridge is utilizing explanations to improve LM decision-making. We want to be able to use machine explanations as supervision signal for regularizing the LM’s behavior while also allowing humans to refine these explanations to further improve the LM. To summarize, my thesis statement is as follows. To build human trust in NLP systems, we must be able to: (A) Generate machine explanations for LM behavior faithfully and plausibly. (B) Utilize machine explanations to improve LM generalization and decision-making. My thesis revolves around three research questions. First, how can we generate machine explanations thatarebothfaithfulandplausible, withouthurtingtheLM’staskperformance? This is critical for enabling humans to effectively understand how NLP systems make decisions. There are many existing works for generating machine explanations (e.g., extractive rationales), but none satisfy all three desiderata of faith- fulness, plausibility, and task performance. To address this question, I propose UNIREX [30], a unified 8 learning framework for jointly optimizing machine explanations with respect to both faithfulness and plau- sibility without compromising the LM’s task performance. Second, how can we utilize strongly-supervised machine explanations to improve the LM’s task performance? If machine explanations reflect the LM’s be- havior, then we should be able to improve the LM’s behavior by aligning the machine explanations with human explanations (i.e., strong supervision) for the same task. This alignment process is known as ex- planation regularization (ER). However, the impact of ER on OOD generalization is unclear, since most prior ER works have focused on evaluating in-distribution (ID) generalization. To address this, I introduce ER-Test [104], a framework for evaluating the out-of-distribution generalization ability of LMs that are regularized via strongly-supervised machine explanations. Third, how can we utilize weakly-supervised machine explanations to improve the LM’s task performance? Although human explanations are generally reliable, they are also very expensive to obtain. Meanwhile, machine explanations from another trained LM are noisy, but essentially free. In particular, the trained LM’s machine explanations can be config- ured to explain not only which features most influenced the LM to output the label it predicted, but also which features would most influence the LM to output the correct label. This motivates us to investigate how LMs can be regularized by aligning their machine explanations with another LM’s machine expla- nations, especially if the machine explanations are based on structured modalities like knowledge graphs (KGs). To address this question, I presentSalKG [31], an algorithm for improving LM task performance by regularizing LMs via weakly-supervised, KG-based machine explanations. 1.3.2 ThesisOrganization PartI Part I establishes the motivation for the thesis and provides a review of relevant literature. Chap- ter1 describes the importance of trustworthy NLP (§1.1), the challenges in NLP explainability (§1.2), and how my thesis addresses these challenges (§1.3). Chapter2 gives an overview of related work on gener- ating (§2.1) and utilizing (§2.2) machine explanations in NLP. 9 PartII Part II describes my work on generating machine explanations faithfully and plausibly. Chapter 3 addresses Research Question 1 (RQ1): Howcanwegeneratemachineexplanationsthatarebothfaithfuland plausible, withouthurtingtheLM’staskperformance? In this chapter, I present UNIREX, a unified learning framework for jointly optimizing machine explanations with respect to both faithfulness and plausibility without compromising the LM’s task performance. PartIII Part III describes my work on utilizing machine explanations to improve LM generalization and decision-making. Chapter4 addresses Research Question 2 (RQ2): How can we utilize strongly-supervised machine explanations to improve the LM’s task performance? In this chapter, I present ER-Test, a frame- work for evaluating the out-of-distribution generalization ability of LMs that are regularized via strongly- supervised machine explanations. Chapter 5 addresses Research Question 3 (RQ3): How can we utilize weakly-supervised machine explanations to improve the LM’s task performance? In this chapter, I present SalKG, an algorithm for improving LM task performance by regularizing LMs via weakly-supervised, KG- based machine explanations. PartIV Part IV is the conclusion of the dissertation. Chapter 6 gives a summary of the works covered in the dissertation and discusses several promising directions for future work (§6.1). 1.3.3 RelationshiptoPublishedWork * Equal contribution. Chapter 3 Aaron Chan, Maziar Sanjabi, Lambert Mathias, Liang Tan, Shaoliang Nie, Xiaochang Peng, Xiang Ren, and Hamed Firooz. “UNIREX: A Unified Learning Framework for Language Model Rationale Extraction”. In: Proceedings of the 39th International Conference on Machine Learning. Ed. by Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato. Vol. 162. Proceedings of Machine Learning Research. PMLR, 17–23 Jul 2022, pp. 2867–2889. [30] 10 • SRML Workshop at ICLR 2022 • BigScience Workshop at ACL 2022 Chapter4 Brihi Joshi*, Aaron Chan*, Ziyi Liu*, Shaoliang Nie, Maziar Sanjabi, Hamed Firooz, and Xiang Ren. “ER-Test: Evaluating Explanation Regularization Methods for Language Models”. In: Findings of the Association for Computational Linguistics: EMNLP 2022. [104] • TrustNLP Workshop at NAACL 2022 Chapter5 Aaron Chan, Jiashu Xu, Boyuan Long, Soumya Sanyal, Tanishq Gupta, and Xiang Ren. “SalKG: Learning From Knowledge Graph Explanations for Commonsense Reasoning”. In: Advances in Neural In- formation Processing Systems 34 (2021). [31] • XAI Workshop at ICML 2021 Other Published Work The following papers are beyond the scope of my dissertation but were also published during my PhD. • Mrigank Raman, Aaron Chan*, Siddhant Agarwal*, Peifeng Wang, Hansen Wang, Sungchul Kim, Ryan Rossi, Handong Zhao, Nedim Lipka, and Xiang Ren. “Learning to Deceive Knowledge Graph Augmented Models via Targeted Perturbation”. In: InternationalConferenceonLearningRepresenta- tions. 2021. [197] – KR2ML Workshop at NeurIPS 2020 (Best Paper Award Finalist) • Jun Yan, Mrigank Raman, Aaron Chan, Tianyu Zhang, Ryan Rossi, Handong Zhao, Sungchul Kim, Nedim Lipka, and Xiang Ren. “Learning Contextualized Knowledge Structures for Commonsense Reasoning”. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 2021, pp. 4038–4051. [261] 11 – KR2ML Workshop at NeurIPS 2020 • Aaron Chan, Shaoliang Nie, Liang Tan, Xiaochang Peng, Hamed Firooz, Maziar Sanjabi, and Xiang Ren. “FRAME: Evaluating Rationale-Label Consistency Metrics for Free-Text Rationales”. In: arXiv preprint arXiv:2207.00779 (2022). [29] – BlackboxNLP Workshop at EMNLP 2022 • Dong-Ho Lee*, Akshen Kadakia*, Brihi Joshi, Aaron Chan, Ziyi Liu, Kiran Narahari, Takashi Shibuya, Ryosuke Mitani, Toshiyuki Sekiya, Jay Pujara, and Xiang Ren. “XMD: An End-to-End Framework for Interactive Explanation-Based Debugging of NLP Models”. In: arXiv preprint arXiv:2210.16978 (2022). [126] • Peifeng Wang, Aaron Chan, Filip Ilievski, Muhao Chen, and Xiang Ren. “PINTO: Faithful Language Reasoning Using Prompt-Generated Rationales”. In: International Conference on Learning Represen- tations. 2022. [245] – TL4NLP Workshop at NeurIPS 2022 – TSRML Workshop at NeurIPS 2022 • Aaron Chan*, Zhiyuan Zeng*, Wyatt Lake, Brihi Joshi, Hanjie Chen, and Xiang Ren. “KNIFE: Dis- tilling Meta-Reasoning Knowledge with Free-Text Rationales”. In: arXiv preprint arXiv:2212.09721 (2022). [32] – TrustML-(un)Limited Workshop at ICLR 2023 • Brihi Joshi*, Ziyi Liu*, Sahana Ramnath, Aaron Chan, Zhewei Tong, Qifan Wang, Yejin Choi, and Xiang Ren. “Are Machine Rationales (Not) Useful to Humans? Measuring and Improving the Human Utility of Free-Text Rationales”. In: OpenReview preprint h-aJ39Z3Tc (2022). [105] – TRAIT Workshop at CHI 2023 12 Chapter2 LiteratureReview 2.1 GeneratingMachineExplanations Broadly, LM explainability works can be categorized as either generating extractive rationales (§2.1.1) or free-text rationales (§2.1.2). Traditionally, the LM explainability literature has mostly considered extractive rationales, due to their relative simplicity [157, 23]. However, recent works have focused more on free- text rationales [178, 250], due to the increasing prevalence of encoder-decoder [195, 178] and decoder-only [21, 194] Transformer LMs for free-text generation. This dissertation focuses on the former category of extractive rationales. 2.1.1 ExtractiveRationales An extractive rationale explains an LM’s predicted output on a given task instance by scoring input fea- tures’ (e.g., tokens) influence on the LM’s output [47, 229, 135, 102, 156]. This feature scoring ( i.e., rationale extraction) can be done using various methods. First, many rationale extraction works rely in some way on attribution algorithms (AAs), which extract rationales via handcrafted functions [229, 93, 221]. Typi- cally, this involves designing gradient-based [229, 47, 156, 134] or perturbation-based [135, 189, 106] AAs. Second, many rationale extraction works use a specialized select-predict pipeline (SPP), where a predictor module is trained to solve the task using only tokens chosen by a selector module [98, 270, 182]. Third, 13 a number of works simply use attention weights for rationale extraction [192, 225, 252, 174, 236, 69, 127], although some have argued that they are not effective [8]. Fourth, some works train a rationale extraction model, based on certain explainability objectives [98, 221]. By binarizing these feature attribution scores, we can highlight the tokens that were most important to the LM’s prediction. For example, this binariza- tion can be performed using the top-k% strategy, where the features with the top-k% highest scores are mapped to one (important) while the remaining features are mapped to zero (unimportant) [49, 98, 192]. Besides text-based extractive rationales, some works have also considered graph-based extractive ra- tionales to explain the predictions of models that take graph inputs. For example, KG-augmented models consist of an LM and a graph encoder, using both text and graph inputs to solve reasoning tasks [140, 57, 151, 246, 261]. The KG-augmented model’s graph encoder usually computes graph embeddings via attention pooling of nodes/paths, and the attention weights can be used to explain which nodes/paths in the input KG are salient [140, 57, 151, 246, 261]. These KG explanations can be interpreted as identifying knowledge in the KG that is complementary to the knowledge encoded in the LM. There are also meth- ods proposing extractive rationales for graph encoders, especially graph neural networks (GNNs). Such rationales are designed to point out components in the graph input that contribute most to the model’s prediction. Some GNNs use attention for pooling, which naturally highlights nodes with higher attention weights [129, 128]. More sophisticated approaches use post hoc optimization to identify salient nodes [89, 267] or subgraphs [267]. 2.1.1.1 EvaluatingExtractiveRationales How rationale extraction methods should be evaluated (and thus selected) remains an open question and the subject of much recent debate [8, 252, 215, 97, 2, 49, 192, 118]. Two common desiderata of extractive rationales are faithfulness and plausibility, although it is also an open question how these desiderata should best be measured. We discuss these desiderata below. 14 Faithfulness Faithfulness means how accurately a rationale reflects the LM’s true reasoning process for its prediction [96]. Hence, faithfulness metrics aim to measure the extent to which the highlighted tokens in the extractive rationale influence the LM’s prediction ( e.g., confidence probability for LM’s predicted la- bel) [49, 219, 86, 192]. Recently, comprehensiveness and sufficiency have emerged as popular faithfulness metrics in the explainability literature [49]. Comprehensiveness measures the change in the predicted la- bel’s confidence probability when the rationale tokens are removed from the input. That is, if the rationale tokens are truly influential, then removing them from the input should decrease the LM’s confidence prob- ability for its predicted label. Thus, higher comprehensiveness indicates higher faithfulness. Sufficiency measures the change in the predicted label’s confidence probability when only the rationale tokens are kept in the input. That is, if the rationale tokens are truly influential, only keeping them in the input should not decrease the LM’s confidence probability for its predicted label. Thus, lower sufficiency indicates higher faithfulness. Furthermore, [192] proposed evaluating rationale faithfulness (to a teacher’s predictions) as how well the rationale distills knowledge about a teacher’s predictions to a student. Specifically, [192] trains a student model to mimic a teacher model’s predictions by regularizing the student model’s attention via rationales created from the teacher model, then measures the student model’s accuracy in predicting the teacher models’ predictions. Many prior works have tried to improve the faithfulness of extractive rationales through the use of AAs [8, 229, 47, 156, 134, 135, 189, 106]. AAs may have built-in faithfulness-related properties but cannot be directly trained and tend to be compute-intensive [8]. However, attribution algorithms cannot be optimized and tend to be compute-intensive (often requiring multiple LM forward/backward passes). Recently, [93] addressed the optimization issue by regularizing the task model to yield faithful rationales via the AA, while other works [221, 214] addressed the compute cost issue by training an LM (requiring only one forward pass) to mimic an AA’s behavior. Another line of work aims to produce faithful rationales by construction, via SPPs [98, 270, 182, 7, 269, 130]. The motivation for SPPs is that the extracted tokens can 15 be considered as a faithful rationale for the predictor’s output, since the predictor’s output depends only on the extracted tokens. Still, SPPs’ faithfulness can only guarantee sufficiency – not comprehensiveness [49]. Also, SPPs generally perform worse than vanilla LMs because they hide much of the original text input from the predictor and are hard to train end-to-end [98, 7, 130]. Plausibility Plausibility is defined as how convincingly a rationale explains a given model’s prediction, as judged by humans [96]. This can be measured either by automatically computing the similarity between the LM’s extracted rationales and human-annotated gold rationales [49], or by asking human annotators to rate whether the LM’s extracted rationales make sense for predicting the LM’s output [226, 53]. Typically, a gold rationale is a binary vector, where ones and zeros indicate important and unimportant tokens, respectively [130, 49]. Although AA-based rationale extraction methods are popular, AAs can be a bottleneck for plausibility, as producing human-like rationales is a complex objective requiring high capacity rationale extractors [178, 49]. Existing approaches for improving extractive rationale plausibility typically involve supervising LM-based extractors [13] or SPPs [98, 182, 49] with gold rationales. However, existing LM-based extractors have not been trained for faithfulness, while SPPs’ faithfulness by construction comes at the great cost of task performance. 2.1.2 Free-TextRationales Unlike extractive rationales which are limited to input token scoring, free-text rationales (FTRs) are de- signed to explain LMs’ decisions in a human-like manner via natural language and can describe things beyond the task input [178, 119, 196, 25, 183, 250, 163, 274]. Plus, FTRs support high flexibility in content, style, and length, since their only constraint is that they are expressed in natural language. This is im- portant for explaining LMs’ decisions on implicit reasoning tasks, since the knowledge required for such tasks is often not even part of the input [231, 67, 170, 111]. 16 Prior works on FTR generation can be grouped into three paradigms. In thefinetunedself-rationalization paradigm, a single LM is finetuned to jointly generate the task output and FTR [178, 274, 20, 137, 149, 162, 163]. Since the LM parameters are shared across two relatively dissimilar objectives, they often perform worse than non-rationalizing LMs [251, 178]. Notably, this paradigm requires expensive FTR annotations for all training instances. In the prompted self-rationalization paradigm, a single LM is instead frozen and prompted to jointly generate the task output and FTR, with the prompt consisting of a few input-output- FTR demonstrations [250, 88, 124, 163]. This paradigm performs well and only needs a few FTR annota- tions for the prompt, but it is computationally prohibitive since it generally requires very large-scale LMs to work effectively [250, 124]. In the pipeline rationalization paradigm, a finetuned rationalizing LM first generates the FTR, which is then used as input for a separate finetuned reasoning LM to generate the output [119, 196]. Here, the generated FTR forms a discrete (i.e., non-differentiable) bottleneck between the two modules, which complicates end-to-end training and can hurt task performance [251, 82]. Additionally, the dedicated rationalizing LM requires extra rationale annotation/computation costs. Although we focus on extractive rationales in this dissertation, our findings may also be applicable to free-text rationales and inspire future work in free-text rationale generation. 2.2 UtilizingMachineExplanations In NLP, machine explanations can be utilized to either improve LMs’ decision-making or humans’ decision- making. The former, often referred to as explanation-based learning, tends to involve automatically using the machine explanations as some sort of learning signal for LM training [81, 79]. The latter typically implies humans manually inspecting the machine explanations and adjusting their behavior based on new knowledge from the explanations [146, 53, 37]. This dissertation focuses on the former category. 17 2.2.1 ImprovingLMDecision-Making To improve LM behavior, many methods have been proposed for explanation-based learning [81, 79], especially using human-annotated explanations [233]. Explanations can be used to improve the model’s behavior in a diverse range of ways, such as extra supervision or regularization [192, 82, 178, 4], pruned inputs [98, 7, 130], additional inputs [81, 199], and intermediate variables [251, 286, 196]. ExtractiveRationales For extractive rationales, one common explanation-based learning paradigm is explanation regularization (ER), which regularizes the LM so that its extractive machine rationales (reflect- ing LM’s reasoning process) are aligned with extractive human rationales (reflecting humans’ reasoning process) [203, 90, 68, 272, 109, 201, 147]. In ER, the human rationale can be obtained by annotating each instance individually [272, 141, 25, 196, 49] or by applying domain-level lexicons across all instances [201, 203, 68, 109, 147]. The former approach is more expensive, while the latter approach has more limited applicability since it requires domain knowledge. Beyond ER, there are other ways to learn from expla- nations. [153] used human-in-the-loop feedback on machine rationales for data augmentation. [266] used machine rationales to calibrate black box models and improve their performance on low-resource domains. Free-TextRationales Since extractive rationales assign an importance score to each input feature, there is a one-to-one correspondence between the features and scores [229, 135, 30, 104]. However, unlike extractive rationales, FTRs do not have such a feature-score correspondence to leverage, which makes learning from FTRs less straightforward. In light of this, there are four main FTR learning paradigms, which largely overlap with the aforementioned FTR generation paradigms. In the input augmentation paradigm, the LM is finetuned to generate the task output given both the task input and an FTR [227, 245, 251, 82]. Still, this either assumes access to gold (or large-LM-generated) FTRs during inference, or introduces an input distribution shift between training and inference when FTRs are unavailable during inference. In the finetuned self-rationalization paradigm, the LM is finetuned to generate both the task output and an FTR 18 [178, 274, 20, 137, 149, 162, 163]. However, this may create conflict between the task and FTR objectives, especially if the FTR does not support the LM’s predicted task label. In the prompted self-rationalization paradigm, the LM is prompted to generate both the task output and an FTR [250, 88, 124, 163]. Yet, this generally requires very large LMs to work well [250, 124]. In the pipeline rationalization paradigm, a finetuned rationalizing LM first generates the FTR, which is then used as input for a finetuned reasoning LM to predict the task output [119, 196, 251, 82]. Here, the generated FTR forms a non-differentiable path between the two LMs, which complicates end-to-end training and can hurt task performance. Furthermore, in both self-rationalization and pipeline rationalization, LMs may be prone to hallucinating irrelevant FTRs, thus hurting task performance [88, 161, 137]. Although we focus on extractive rationales in this dissertation, our findings may also be applicable to free-text rationales and inspire future work in free-text rationale utilization. 2.2.2 ImprovingHumanDecision-Making In the explainability literature, machine explanations tend to be described in a way that implies their use in directly aiding humans’ manual decision-making [123, 37, 94, 72, 146, 53]. However, in this context, there is still much progress to be made in determining which desiderata are most important in machine explanation generation or how to generate machine explanations that best fulfill these desiderata [123, 37, 121, 202, 222, 217, 122, 118]. In this dissertation, we focus more on improving the decision-making of the LMs that assist humans, which can indirectly improve human decision-making. Thus, we leave the investigation of this category to future work. 19 PartII GeneratingMachineExplanations 20 Chapter3 LearningtoGenerateMachineExplanations In this chapter, we investigate RQ1: How can we generate machine explanations that are both faithful and plausible, without hurting the LM’s task performance? 3.1 Introduction In recent years, neural language models (LMs) have yielded state-of-the-art performance on a wide range of natural language processing (NLP) tasks [48, 152]. However, LMs’ complex processes are notoriously opaque [205], posing concerns about the societal implications of using LMs for high-stakes decision- making [11]. Thus, explaining LMs’ behavior is crucial for promoting trust, ethics, and safety in NLP systems [53, 146]. Given a LM’s (i.e., task model’s) predicted label on a text classification instance, an extractive rationale is a type of explanation that highlights the tokens that most influenced the model to predict that label [157]. To provide meaningful explanations, rationale extraction should be faithful (re- flective of LM’s actual behavior) [93, 98] and plausible (convincing to humans) [49], without compromising the LM’s task performance [49, 96] (Fig. 3.1). Configuring the rationale extractor and its training process can greatly impact these desiderata, yet prior works have commonly adopted at least one of the following suboptimal heuristic design choices. First, many works rely in some way on attribution algorithms (AAs), which extract rationales via handcrafted 21 Figure 3.1: Desiderata of Rationale Extraction. Ideally, rationale extraction should be faithful and plausible, without compromising the task model’s task performance. Unlike prior works, UNIREX enables optimizing the ra- tionale extractor for all three desiderata. functions [229, 93, 221]. AAs may have built-in faithfulness-related properties but cannot be directly trained and tend to be compute-intensive [8]. The most similar work to ours is SGT [93], which regularizes a task model to produce faithful AA-based rationales. Still, AAs can be a bottleneck for plausibility, as producing human-like rationales is a complex objective requiring high capacity rationale extractors [178, 49]. Second, many works use a specializedselect-predictpipeline (SPP), where a predictor module is trained to solve the task using only tokens chosen by a selector module [98, 270, 182]. Instead of faithfulness optimization, SPPs heuristically aim for “faithfulness by construction" by treating the selected tokens as a rationale for the predictor’s output (which depends only on those tokens). Still, SPPs typically have worse task performance than vanilla LMs since SPPs hide the full input from the predictor and are hard to train end-to-end [98, 7, 130]. Both AAs and SPPs utilize heuristics that fundamentally limit the rationale extractor from achieving all three desiderata. 22 To address this challenge, we propose the UNIfied Learning Framework for Rationale EXtraction (UNIREX), which generalizes rationale extractor optimization as follows: (1) specify architecture for a learned rationale extractor; (2) select explainability objectives (i.e., faithfulness and plausibility criteria); and (3) jointly train the task model and rationale extractor on the task using selected objectives (§3.3). UNIREX enables replacing prior works’ heuristic design choices in (1) with a generic learned rationale extractor and optimizing it for all three desiderata in (2)-(3). UNIREX provides great flexibility in performing (1)-(3). For (1), any model architecture is applicable, but we study Transformer LM based rationale extractors in this work [271, 49]. We focus on two architec- tures: (A) Dual LM, where task model and rationale extractor are separate; and (B) Shared LM, where task model and rationale extractor share parameters. For (2), any faithfulness and plausibility criteria can be used. Following [49], we focus on comprehensiveness and sufficiency as faithfulness criteria, while using similarity to gold rationales as plausibility criteria. For (3), trade-offs between the three desiderata can be easily managed during rationale extractor optimization by setting arbitrary loss weights for the faith- fulness and plausibility objectives. Furthermore, although computing the faithfulness criteria involves discrete (non-differentiable) token selection, using the Shared LM architecture can approximate end-to- end training and enable both task model and rationale extractor to be optimized with respect to all three desiderata (§3.3.4). To evaluate all three desiderata in aggregate, we introduce the Normalized Relative Gain (NRG) metric. On five English text classification datasets – SST, Movies, CoS-E, MultiRC, and e-SNLI [27, 49] – our best UNIREX configuration outperforms the strongest baselines by an average of 32.9% NRG (§3.4.4), showing that UNIREX can optimize rationale extractors for all three desiderata. In addition, we verify our UNIREX design choices via extensive ablation studies (§3.4.5). Moreover, UNIREX-trained extractors have con- siderable generalization power, yielding high plausiblity with minimal gold rationale supervision (§3.4.6) 23 and high faithfulness on unseen datasets/tasks (§3.4.7). Finally, our user study shows that humans judge UNIREX rationales as more plausible than rationales extracted via other methods (§3.4.8). 3.2 ProblemFormulation We formalize rationale extraction and discuss how extractive rationales are evaluated, in the context of text classification. 3.2.1 RationaleExtraction Here, we considerF task = f task (f enc (·)) as a task model for M-class text classification (§3.6.1), where f enc is the text encoder while f task is the task output head. In modern NLP systems,F task usually has a BERT-style architecture [48], in whichf enc is a Transformer network [238] whilef task is a linear layer with softmax classifier. Let x i =[x t i ] n t=1 be then-token input sequence (e.g., a sentence) for task instancei, and F task (x i )∈R M be the logit vector for the output of the task model. We usey i = argmax j F task (x i ) j to denote the class predicted byF task . GivenF task , x i , and y i , the goal of rationale extraction is to output vector s i = [s t i ] n t=1 ∈R n , such that eachs t i ∈R is an importance score indicating how strongly tokenx t i influenced F task to predict classy i . LetF ext denote a rationale extractor, such thats i =F ext (F task ,x i ,y i ).F ext can be a learned model or heuristic function. In practice, the final rationale is typically obtained by binarizing s i asr i ∈{0,1} n , via the top-k% strategy:r t i =1 ifs t i is one of the top-k% scores ins i ; otherwise,r t i =0 [49, 98, 192, 31]. While other binarization strategies can be used (e.g., score threshold, highest-scoring contiguousk-token span), we focus on top-k% in this study, since this strategy is most prevalent in the explainability literature. For top-k%, letr (k) i denote the “important" (i.e., ones) tokens inr i , when using0≤ k≤ 100. 24 3.2.2 ThreeDesiderataofRationaleExtraction To provide meaningful explanations, rationale extraction viaF ext should befaithful andplausible, without significantly hurting F task ’s task performance [49]. Faithfulness Faithfulness means how accurately a rationale reflects F task ’s true reasoning process for predictingy i [96]. Hence, faithfulness metrics aim to measure the extent to which the r (k) i tokens influ- encep y i (x i ), which denotesF task ’s confidence probability for y i when usingx i as input [49, 219, 86, 192]. Recently, comprehensiveness and sufficiency have emerged as popular faithfulness metrics in the explain- ability literature [49]. Comprehensiveness (comp) measures the change inp y i whenr (k) i isremoved from the input: comp = p y i (x i )− p y i (x i \r (k) i ). That is, if the r (k) i tokens are truly influential, then remov- ing them from the input should decreaseF task ’s predicted probability fory i . Thus, higher comp indicates higher faithfulness. Sufficiency (suff) measures the change in p y i when only r (k) i is kept in the input: suff = p y i (x i )− p y i (r (k) i ). That is, if ther (k) i tokens are truly influential, only keeping them in the input should not decreaseF task ’s predicted probability fory i . Thus, lower suff indicates higher faithfulness. Plausibility Plausibility is defined as how convincingly a rationale explains a given model’s prediction, as judged by humans [96]. This can be measured either by automatically computing the similarity between F ext ’s rationales (eithers i orr i ) and human-annotated gold rationales [49], or by asking human annotators to rate whetherF ext ’s rationales make sense for predicting y i [226, 53]. Typically, a gold rationale is a binary vectorr ∗ i ∈{0,1} n , where ones and zeros indicate important and unimportant tokens, respectively [130, 49]. TaskPerformance Task performance, in the context of rationale extraction, concerns how muchF task ’s task performance (on the test set) drops whenF task is trained with explainability objectives (i.e., faithful- ness, plausibility) forF ext . As long asF task is trained with non-task losses,F task ’s task performance can 25 Figure 3.2: UNIREX Framework. UNIREX enables us to jointly optimize the task model (F task ) and rationale extractor (F ext ), with respect to faithfulness (L faith ), plausibility (L plaus ), and task performance (L task ). In this example, we consider the sentiment analysis task. For task performance,F task is trained via gold label y ∗ i to predict the sentiment – either positive (pos) or negative (neg) – of sentence x i . Here, F task ’s predicted label for x i is y i = pos. For plausibility, F ext is trained via gold rationale r ∗ i to output human-aligned token importance scores s i for x i (§3.3.3.2). For faithfulness, s i is binarized as rationale r i via top-k% selection, then used to construct the comp (x i \r (k) i ) and suff ( r (k) i ) inputs forL task . WithL task ’s predicted probabilities fory i , givenx i ,x i \r (k) i , andr (k) i , respectively, the comp and suff losses are computed. The comp and suff losses align L task ’s output withr i , such thatr i becomes a faithful explanation ofL task ’s behavior. (§3.3.3.1). Note that some parts of UNIREX are non-differentiable. Still, by havingL task andL ext share a text encoder, we can approximate end-to-end training of both models, jointly with respect to all three desiderata (§3.3.4). be affected. Note that this means post hoc ( i.e., introduced afterF task training is over) rationale extraction will not affect F task ’s task performance. In general, the main goal ofF task is high task performance, so we should ideally improveF ext with respect to the other desiderata without hurtingF task ’s task performance. To measure task performance, we use standard dataset-specific performance metrics ( e.g., accuracy, F1). 3.3 UNIREX We present the UNIREX learning framework, which enables us to jointly optimize the task model and rationale extractor with respect to faithfulness, plausibility, and task performance. 26 3.3.1 FrameworkOverview Given task modelF task , UNIREX generalizes rationale extractor optimization as follows: (1) choose archi- tecture for a learned rationale extractorF ext ; (2) select explainability objectives (i.e., faithfulness lossL faith and plausibility lossL plaus ); and (3) jointly trainF task andF ext usingL task (task loss),L faith , andL plaus . As shown in Fig. 3.2, UNIREX training consists of two backpropagation paths. The first path is used to update F task with respect toL task andL faith . WhereasL task is computed with respect to the task targety ∗ i ,L faith is computed only using the task inputx i and the top-k% important tokensr (k) i (obtained viaF ext ), based on some combination of comp and suff (§3.2.2). The second path is used to update F ext with respect toL plaus , which encourages importance scoress i to approximate gold rationaler ∗ i . Thus, UNIREX frames rationale extraction as the following optimization problem: min F task ,F ext L task (x i ,y ∗ i ;F task ) + α f L faith (x i ,r (k) i ;F task ) + α p L plaus (x i ,r ∗ i ;F ext ), (3.1) whereα f andα p are loss weights. IfF task andF ext share parameters, then the shared parameters will be optimized with respect to all losses. During inference, for task input x i , we first use F task to predict y ∗ i , then useF ext to output a rationaler i forF task ’s predictiony i . Below, we discuss different options for UNIREX’s rationale extractor and explainability objectives. 3.3.2 RationaleExtractor In UNIREX,F ext is a learned function by default. Here, we first introduce heuristic F ext (i.e., AA), then discuss why a learnedF ext should typically be preferred (§3.2.1). For eachF ext type, we present several possible design choices and the pros/cons of the given type. 27 Figure 3.3: Rationale Extractor Types. In general, rationale extractorL ext can be either heuristic or learned. A heuristicL ext is a handcrafted attribution algorithm, which cannot be trained (§3.3.2.1). By default, UNIREX uses a learnedL ext , which can be optimized for faithfulness, plausibility, and task performance. For learnedL ext , we focus on two architectures (w.r.t. task modelL task ): DualLM designsL task andL ext as two fully separate LMs, whileSharedLM designsL task andL ext to share the same text encoder (§3.3.2.2). Although some operations within UNIREX are non-differentiable, Shared LM’s shared encoder allows us to approximate end-to-end training of both models w.r.t. all three desiderata (§3.3.4, Fig. 3.2). 3.3.2.1 HeuristicRationaleExtractors HeuristicF ext refers to AAs, which can be any handcrafted function that calculates an importance scores t i for each input tokenx t i [8]. AAs are typically gradient-based [229, 47, 156, 134] or perturbation-based [135, 189, 106] methods. Recall thatp y i (x i ) denotesF task ’s predicted probability for classy i (§3.2.2). Gradient- based methods compute s t i via the gradient of p y i (x i ) with respect to x t i . These methods require one or moreF task backward passes. Perturbation-based methods measures t i asp y i (x i )’s change when perturbing (e.g., removing)x t i . These methods require multipleF task forward passes – typically, one forward pass per token inx i . AAs can be used out of the box without training and are designed to satisfy certain faithfulness-related axiomatic properties [229, 156]. However, AAs’ lack of learnable parameters means they cannot be opti- mized for faithfulness/plausibility. Thus, ifF task is trained for explainability using AA-based rationales, 28 then onlyF task is optimized. Also, faithful AAs tend to be compute-intensive, requiring manyF task back- ward/forward passes per instance [229, 156, 135]. 3.3.2.2 LearnedRationaleExtractors LearnedF ext can be any learned model that transformsx t i intos t i . Given their success in NLP explainability [49], we focus on pre-trained Transformer LMs and highlight two key architectures: Dual LM (DLM) and Shared LM (SLM) (Fig. 3.3). For DLM,F task andF ext are two separate Transformer LMs with the same encoder architecture. Formally, we define the DLM extractor as F ext = f ext (f ′ enc (·)), wheref ′ enc andf ext areF ext ’s encoder and output head, respectively. DLM provides more capacity forF ext , which can help F ext output plausible rationales. For SLM,F task andF ext are two Transformer LMs sharing encoderf enc , whileF ext has its own output head f ext . Formally, the SLM extractor is defined as F ext = f ext (f enc (·)). SLM leverages multitask learning betweenF task andF ext , which can improve faithfulness sinceF ext has greater access to information aboutF task ’s reasoning process. By default,F ext takesx i as input and uses a linear layer forf ext , although these settings can be changed if desired. Unlike heuristicF ext , learnedF ext can be optimized for faithfulness/plausibility and only require one F task forward pass during inference (e.g., perturbation-based AAs require n forward passes per n-token instance). However, they cannot be used out of the box without explainability training and do not have built-in axiomatic properties –e.g., sensitivity and implementation invariance in the IG [229] AA – which are designed to promote faithfulness. Overall, learnedF ext is preferred if: (A) the goal is to optimize for both faithfulness and plausibility, and (B) gold rationales – even a small amount – are available for plausibility optimization. (B) is true because gold rationale annotated instances can be provided in every batch via oversampling (Sec 3.6.2), which works surprisingly well in low-resource settings (§3.4.6). Otherwise, UNIREX allows for the learned F ext to be replaced with a heuristicF ext . 29 3.3.3 ExplainabilityObjectives After selectingF ext , we specify the explainability objectives, which can be any combination of faithfulness and plausibility criteria. In prior approaches (e.g., AA, SPPs), the rationale extractor is not optimized for both faithfulness and plausibility, but UNIREX makes this possible. For any choice of learnedF ext , UNIREX lets us easily “plug and play” different criteria and loss weights, based on our needs and domain knowledge, to find those that best balance the rationale extraction desiderata. 3.3.3.1 Faithfulness Faithfulness refers to how accurately a rationale (output byF ext ) reflects F task ’s decision process for a given instance. Evaluating rationale faithfulness is still an open problem with numerous applicable metrics, and UNIREX is not tailored for any specific metric. However, given the prevalence of comp and suff (§3.2.1) in the explainability literature [49, 93], we focus on comp and suff related objectives. Recall that comp measures the importance of tokens inr (k) i as howp y i (x i ) changes when those tokens are removed fromx i . Intuitively, we wantp y i (x i ) to be higher thanp y i (x i \r (k) i ), so higher comp is better. Since comp is defined for a single class’ probability rather than the label distribution, we can define the comp lossL comp via cross-entropy lossL CE (which is computed w.r.t. the target class), as in the following difference criterion instantiation ofL comp : L comp-diff =L CE (F task (x i ),y ∗ i ) −L CE (F task (x i \r (k) i ),y ∗ i ) (3.2) L CE (F task (x i ),y ∗ i )=− y ∗ i log(F task (x i )) (3.3) 30 For training stability, we compute comp loss for target classy ∗ i here instead ofF task ’s predicted class y i , sincey i is a moving target during training. UsingL comp-diff , it is possible forL CE (F task (x i \r (k) i ),y ∗ i )) to become much larger thanL CE (F task (x i ),y ∗ i ), leading to arbitrarily negative losses. To prevent this, we can use marginm c to impose a lower bound onL comp-diff , yielding the following margin criterion: L comp-margin =max(− m c ,L comp-diff )+m c (3.4) Recall that suff measures the importance of tokens in r (k) i as howp y i (x i ) changes when they are the only tokens kept in x i . Based on suff’s definition, we want p y i (r (k) i ) to be higher thanp y i (x i ), so lower suff is better. For suff loss L suff , we define the difference and margin criteria analogously to L comp , using marginm s but the opposite sign forL suff-diff (since lower suff is better): L suff-diff =L CE (F task (r (k) i ),y ∗ i )−L CE (F task (x i ),y ∗ i ) (3.5) L suff-margin =max(− m s ,L suff-diff )+m s (3.6) In our experiments, we find that the margin-based comp and suff criteria are effective (§3.4.5), though others (e.g., KL divergence, MAE) can be used too (§3.6.3.1). Note thatr (k) i is computed via top-k% thresh- olding (§3.2.1), so we also need to specify a setK of threshold values. We separately compute the comp and suff losses for each k ∈ K, then obtain the final comp and suff losses by averaging over all k values via area-over-precision-curve (AOPC) [49]. To reflect this, we denote the comp and suff losses as L comp,K 31 andL suff ,K , respectively. Letα f L faith =α c L comp,K +α s L suff ,K , whereα c andα s are loss weights. In this case, we can abstractly considerα f as an aggregate loss weight for the faithfulness objectives. 3.3.3.2 Plausibility Plausibility is defined as how convincing a rationale (output by F ext ) is to humans as an explanation for F task ’s prediction on a given instance [96]. SinceF task ’s predictions may change throughout its training, optimizing for plausibility should ideally involve continual human-in-the-loop feedback. However, ob- taining such human-in-the-loop feedback is prohibitive, so many works consider human-annotated gold rationales as a cheaper form of plausibility supervision [49, 178, 98]. Even so, gold rationalesr ∗ i are gener- ally only annotated with respect to the gold task labely ∗ i (as opposed toF task ’s predicted labely ∗ i , which cannot be known a priori). Consequently, ify ∗ i ̸=y ∗ i , then gold rationale supervision may be noisy. Still, this is not a significant issue if F task is jointly trained with F ext . In UNIREX, F task and F ext are jointly trained to predicty ∗ i (viaL task ) and r ∗ i (viaL plaus ), respectively, whileF task is also regularized (viaL faith ) such that its output y ∗ i aligns withF ext ’s output r i . In other words, y ∗ i may be an acceptable approximation ofy ∗ i when trainingF ext to predict r ∗ i (which is based ony ∗ i ) because: (A)F task is jointly trained such that its outputy ∗ i approximatesy ∗ i , and (B)F ext is also trained such that its outputr i aligns with y ∗ i . As a result, if gold rationale supervision is available, then we can optimize for plausibility via UNIREX. Specifically, given gold rationale r ∗ i for input x i , plausibility optimization entails trainingF ext to predict binary importance labelr ∗ ,t i for each tokenx t i . This is essentially binary token classification, so one natural choice forL plaus is the token-level binary cross-entropy (BCE) criterion: L plaus-BCE =− X t r ∗ ,t i log(F ext (x t i )) (3.7) 32 Besides BCE loss, we can also consider other criteria like sequence-level KL divergence and linear loss. See §3.6.3.2 for discussion of these and other plausibility criteria. 3.3.4 TrainingandInference After settingF ext ,L faith , andL plaus , we can move on to trainingF task andF ext . Since top-k% rationale binarization (§3.3.3) is not differentiable, by default, we cannot backpropagate L faith through all ofF ext ’s parameters. Thus,F task is trained viaL task andL faith , whileF ext is only trained viaL plaus . This means F ext ’s rationales r i are indirectly optimized for faithfulness by regularizingF task such that its behavior aligns with r i . The exception is if we are using the SLM variant, where encoder f enc is shared byF task andF ext . In this case,f enc is optimized with respect to all losses, task headf task is optimized with respect toL task andL faith , and extractor head f ext is optimized with respect toL plaus . SLM is a simple way to approximate end-to-end training ofF task andF ext . In contrast, past SPPs have used more complex methods like reinforcement learning [130] and the reparameterization trick [7], whose training instability has been shown hurt task performance [98]. Now, we summarize the full learning objective. Given that cross-entropy lossL task =L CE (F task (x i ),y ∗ i ) is used to trainF task to predicty ∗ i , the full learning objective is: L=L task +α f L faith +α p L plaus =L task +α c L comp,K +α s L suff ,K +α p L plaus . (3.8) During inference, we useF task to predicty ∗ i , then useF ext to outputr i forF task ’s predicted labely i . 3.4 Experiments We present empirical results showing UNIREX’s effectiveness in trading off faithfulness, plausibility, and task performance during rationale extractor optimization. First, our main experiments compare rationale 33 extraction methods w.r.t. all three desiderata (§3.4.4). Second, we perform various ablation studies to verify our design choices for UNIREX (§3.4.5). Third, we present experiments showing UNIREX’s strong data efficiency, w.r.t. limited gold rationale supervision (§3.4.6) and zero-shot faithfulness transfer (§3.4.7). Fourth, to account for the limitations of gold-rationale-based plausibility evaluation, we conduct a user study to further demonstrate the improved plausibility of UNIREX-extracted rationales (§3.4.8). 3.4.1 EvaluationProtocol 3.4.1.1 Datasets We primarily experiment with the SST (sentiment analysis) [223, 27], Movies (sentiment analysis) [272], CoS-E (commonsense question answering) [196], MultiRC (reading comprehension) [110], and e-SNLI (natural language inference) [25] datasets, all of which have gold rationale annotations. The rationale- annotated version of SST was obtained from [27], while the latter four datasets were obtained from the ERASER benchmark [49]. For the zero-shot faithfulness transfer experiments (§3.4.7), we consider five additional datasets, which are described further in §3.4.7. For more details, please refer to §3.6.4. 3.4.1.2 Metrics To measure faithfulness, plausibility, and task performance, we use the metrics from the ERASER bench- mark [49]. For faithfulness, we use comp and suff, for k = [1,5,10,20,50] [49]. For plausibility, we use area under precision-recall curve (AUPRC) and token F1 (TF1) to measure similarity to gold rationales [49, 178]. For task performance, we follow the dataset-specific metrics used in ERASER: accuracy for SST and CoS-E; macro F1 for Movies, MultiRC, and e-SNLI [49]. That is, we only use one task performance metric per dataset. Normalized Relative Gain (NRG) After computing these raw metrics for faithfulness, plausibility, and task performance, we would like to compare different rationale extraction methods w.r.t. all three 34 desiderata. However, aggregating the raw metrics across the three desiderata may not be straightforward. In light of this, we introduce the Normalized Relative Gain (NRG) metric, which is based on the Average Relative Gain (ARG) metric [265] and min-max scaling. For each raw metric, NRG transforms all raw scores to normalized scores in[0,1] (higher is better). After all raw metrics are in the same[0,1] space, we can simply aggregate them via averaging. Concretely, for each raw metric (e.g., comp, suff, AUPRC, accuracy), we are given a set of raw scores Z ={z 1 ,z 2 ,...}. Each raw scorez i ∈Z corresponds to a different rationale extraction method i. NRG(z i ) capturesz i ’s relative gain over the worst score inZ, normalized w.r.t. score rangemax(Z)− min(Z). The definition of “worst score" depends on whether higher or lower raw scores are better for the given metric. If higher raw scores are better (e.g., comp, AUPRC, accuracy), then the worst score would be min(Z), which yields: NRG(z i )= z i − min(Z) max(Z)− min(Z) . If lower values are better (e.g., sufficiency), then the worst score would bemax(Z), which yields: NRG(z i )= max(Z)− z i max(Z)− min(Z) . After computing the individual NRG for each raw metric, we obtain the desiderata NRG scores by aver- aging the individual NRG scores within each desideratum. Let FNRG, PNRG, and TNRG be the desiderata NRG scores for faithfulness, plausibility, and task performance, respectively. FNRG is the average of the individual NRG scores for comp and suff; PNRG is the average of the individual NRG scores for AUPRC and TF1; and TNRG is just the individual NRG for the task performance metric (since there is only one task performance metric per dataset). Finally, to summarize all of the raw metrics as a single score, we compute the composite NRG (CNRG) by averaging the three desiderata NRG scores: CNRG = FNRG+PNRG+TNRG 3 . By default, we compute CNRG as an unweighted average of the three desiderata NRG scores, under the assumption that all three desiderata are equally important. On the other hand, for situations where certain desiderata are more important than others, we can also compute CNRG as a weighted average. Generally, the computation of NRG should involve globally aggregating the raw metrics across all available methods, which is done in the main results (§3.4.4). However, for a number of more focused 35 experiments (§3.4.5 and 3.4.7), only a subset of the available methods are considered. Thus, for these experiments, we report the raw metrics instead of NRG (Tables 3.1 and 3.2). 3.4.1.3 ResultsReporting For all results, we report the average over three seeds and five k faithfulness thresholds (i.e.,k =[1,5,10,20,50]), a total of 15 settings. We denote each UNIREX configuration with a parenthetical “([ rationale extractor]- [explainability objectives])”. For the rationale extractor, AA, DLM, and SLM denote attribution algorithm, Dual LM, and Shared LM, respectively. For explainability objectives, F, P, and FP denote faithfulness, plausibility, and faithfulness+plausibility, respectively. For example, DLM-FP means Dual LM with faith- fulness+plausibility objectives. 3.4.2 Baselines We consider a wide range of representative rationale extraction baselines, spanning three key categories. Note that some methods do not assume access to gold rationales, which prevents such methods from optimizing for plausibility. This means that not all of the methods are directly comparable. Therefore, when comparing methods, we generally group them by whether they use optimize for plausibility and only compare methods within the same group. The first category is vanillaattributionalgorithm(AA), which does not involve trainingF ext and is applied post hoc (i.e., they do not impactF task ’s training). Included baselines from this category are: Gradient (Grad) [220], Input*Gradient (Input*Grad) [47], DeepLIFT [156], and Integrated Gradients (IG) [229]. These four baselines are among the most popular AAs in the explainability literature [157, 192]. In the results, we denote these baselines as “AA ([AA name])”, e.g., AA (IG). The second category is AA-based training, which uses AAs in some way to train a learnedF ext . One baseline in this category is L2E [221], which distills knowledge from an AA to an LM-basedF ext . 36 Specifically, after training F task , then using an AA to extract rationales forF task , L2E entails trainingF ext to output rationales that are similar to the AA’s rationales. Another baseline is SGT [93], which uses a suff- based criterion to regularizeF task , such that the AA yields faithful rationales forF task . We also consider a variant called SGT+P, which augments SGT with plausibility optimization via gold rationales. For all baselines in this category, we use IG as the AA. The third category isselect-predictpipeline(SPP), whereF task (predictor) only takes input tokens chosen viaF ext ’s (selector) rationale output. One baseline in this category is FRESH [98], which trainsF task andF ext separately. For FRESH, we use a stronger variant (compared to those in the FRESH paper) where IG rationales are directly provided to the predictor, rather than output by a trainedF ext . Another baseline is A2R [270], a recently proposed SPP which aims to improveF task ’s task performance by regularizing F task with an attention-based predictor that uses the full input. Also, we introduce FRESH+P and A2R+P, which respectively augment FRESH and A2R with plausibility optimization. 3.4.3 ImplementationDetails For the LM architecture ofF task andF ext , we use BigBird-Base [271] in all of our experiments, in order to handle input sequences of up to 4096 tokens. For all AA-based methods besides vanilla AA, we use the IG [229] AA, which has been commonly adopted in the explainability literature [192, 210, 93]. By default, IG involves 50 steps (i.e., backward passes) per instance [116]. However, 50-step IG is prohibitive when regularizing the LM via IG, so we instead use 3-step IG in both training and evaluation. We empirically justify our usage of 3-step IG in §3.6.6. For all experiments, we use a learning rate of2e− 5 and effective batch size of 32. We train for a maxi- mum of 10 epochs, with early stopping patience of 5 epochs. For the gold rationale efficiency experiments (§3.4.6, §3.6.5), we use a batching factor β (§3.6.2) of 2. We only tune faithfulness and plausibility loss weights, sweepingα c =[0.5,0.7,1.0],α s =[0.5,0.7,1.0], andα p =[0.5,0.7,1.0]. We find that α c =0.5 37 andα s = 0.5 typically yield the best performance. For each method variant, we tuned hyperparameters w.r.t. dev CNRG, computed across all hyperparameter configurations for the variant. All of our experi- ments are implemented using PyTorch [185], Lightning [55], and Hugging Face Transformers [255]. 3.4.4 MainResults Figs. 3.4-3.6 display the main results, in terms of NRG. For conciseness, we omit AA (Grad), AA (In- put*Grad), and AA (DeepLIFT) results from these NRG figures, since AA (IG) is representative of vanilla AA methods. Please refer to §3.6.8 for all raw and NRG empirical results. Figure 3.4: Composite NRG Comparison (without Plausibility Optimization). The composite NRG (CNRG) is the mean of the three desiderata NRG scores. For each dataset, we use CNRG to compare rationale extraction methods that do not optimize for plausibility. Overall, UNIREX (AA-F) achieves the best CNRG on Dataset Mean (and on all datasets except Movies), showing the effectiveness of UNIREX’s faithfulness optimization. On Dataset Mean, UNIREX (AA-F) beats the strongest baseline (i.e., SGT) by 9.2%. In Figs. 3.4-3.5, we use CNRG to compare rationale extraction methods for each dataset. Here, the Dataset Mean group reports the mean CNRG across all datasets. First, Fig. 3.4 compares methods that do not optimize for plausibility (since they do not have access to gold rationales). Overall, we find that UNIREX (AA-F) achieves the best CNRG on Dataset Mean (and on all datasets except Movies), showing the effectiveness of UNIREX’s faithfulness optimization. On Dataset Mean, UNIREX (AA-F) beats the strongest 38 Figure 3.5: CompositeNRGComparison(withPlausibilityOptimization). The composite NRG (CNRG) is the mean of the three desiderata NRG scores. For each dataset, we use CNRG to compare rationale extraction methods that do optimize for plausibility. Overall, UNIREX (SLM-FP) and UNIREX (DLM-FP) achieve the best CNRG – both beating the strongest baseline (i.e., A2R+P) by over 30% on Dataset Mean – demonstrating UNIREX’s ability to jointly optimizeF task andF ext for all three desiderata. baseline (i.e., SGT) by 9.2%. Second, Fig. 3.5 compares methods that do optimize for plausibility. Overall, we find that UNIREX (SLM-FP) and UNIREX (DLM-FP) achieve the best CNRG – both beating the strongest baseline (i.e., A2R+P) by over 30% on Dataset Mean – demonstrating UNIREX’s ability to jointly optimize F task andF ext for all three desiderata. Meanwhile, UNIREX (DLM-P) performs slightly worse but still significantly better than all baselines, showing the effectiveness of UNIREX’s plausibility optimization. Fig. 3.6 compares rationale extraction methods w.r.t. the desiderata NRG, averaged over all datasets. First, Fig. 3.6 (left) compares rationale extraction methods without plausibility optimization, so PNRG is low for all methods here. Here, we see that UNIREX (AA-F)’s FNRG is highest, while its TNRG is close to highest. As shown in Fig. 3.4, UNIREX (AA-F) achieves the best composite NRG (CNRG) because UNIREX training enables effective balancing of faithfulness and task performance. On the other hand, baselines with high FNRG (i.e., FRESH, A2R) have low TNRG, while baselines with high TNRG (i.e., AA (IG), L2E, SGT) have low FNRG. Second, Fig. 3.6 (right) compares rationale extraction methods with plausibility optimization. Here, we see that UNIREX (DLM-FP) and UNIREX (SLM-FP) have moderate FNRG, but the 39 Figure 3.6: DesiderataNRGComparison. For each rationale extraction method, we show the desiderata NRG for faithfulness (FNRG), plausibility (PNRG), and task performance (TNRG), averaged over all datasets. Left: This plot compares methods without plausibility optimization. UNIREX (AA-F)’s FNRG is highest, while its TNRG is close to highest. Meanwhile, baselines with high FNRG (i.e., FRESH, A2R) have low TNRG, while baselines with high TNRG (i.e., AA (IG), L2E, SGT) have low FNRG.Right: This plot compares methods with plausibility optimization. UNIREX (DLM-FP) and UNIREX (SLM-FP) have moderate FNRG, but the highest (or near-highest) PNRG and TNRG. Meanwhile, baselines with high FNRG (i.e., FRESH+P, A2R+P) have low TNRG, while baselines with high TNRG (i.e., SGT+P) have low PNRG. highest (or near-highest) PNRG and TNRG. Meanwhile, UNIREX (DLM-P) achieves the highest PNRG and TNRG, but the worst FNRG, since UNIREX (DLM-P) does not optimize for faithfulness. As shown in Fig. 3.5, UNIREX (DLM-FP) and UNIREX (SLM-FP) achieve the best CNRG because UNIREX training enables effective balancing of faithfulness and task performance. Meanwhile, baselines with high FNRG (i.e., FRESH+P, A2R+P) have low TNRG, while baselines with high TNRG (i.e., SGT+P) have low PNRG. 3.4.5 AblationStudies We present five ablation studies to validate the effectiveness of our UNIREX design choices. The results of these ablation studies are displayed in Table 3.1, where each of the five sections contains results for a dif- ferent ablation. Thus, all numbers within the same section (ablation) and column (metric) are comparable. Extractor Type (F) In the Ext Type (F) section, we compare four heuristic rationale extractors, using AA-F. In this case, besides task performance, we can only optimize (the task model) for faithfulness. Rand 40 Ablation UNIREXConfig Faithfulness Plausibility Performance Comp (↑) Suff ( ↓) AUPRC (↑) Acc (↑) Ext Type (F) AA-F (Rand) 0.171 (± 0.040) 0.327 (± 0.050) 44.92 (± 0.00) 94.05 (± 0.35) AA-F (Gold) 0.232 (± 0.088) 0.249 (± 0.021) 100.00 (± 0.00) 93.81 (± 0.54) AA-F (Inv) 0.242 (± 0.010) 0.357 (± 0.019) 20.49 (± 0.00) 93.47 (± 1.81) AA-F (IG) 0.292 (± 0.051) 0.171 (± 0.038) 48.13 (± 1.14) 92.97 (± 0.44) Ext Type (FP) AA-FP (Sum) 0.296 (± 0.067) 0.185 (± 0.048) 47.60 (± 2.44) 93.25 (± 0.45) AA-FP (MLP) 0.285 (± 0.051) 0.197 (± 0.100) 54.82 (± 1.97) 93.23 (± 0.92) DLM-FP 0.319 (± 0.090) 0.167 (± 0.036) 85.80 (± 0.74) 93.81 (± 0.18) SLM-FP 0.302 (± 0.039) 0.113 (± 0.013) 82.55 (± 0.84) 93.68 (± 0.67) Comp/Suff Loss SLM-FP (Comp) 0.350 (± 0.048) 0.310 (± 0.049) 82.79 (± 0.62) 93.59 (± 0.11) SLM-FP (Suff) 0.166 ( ± 0.003) 0.152 (± 0.012) 83.74 (± 0.84) 94.16 (± 0.39) SLM-FP (Comp+Suff) 0.302 ( ± 0.039) 0.113 (± 0.013) 82.55 (± 0.84) 93.68 (± 0.67) Suff Criterion SLM-FP (KL Div) 0.306 (± 0.098) 0.131 (± 0.005) 82.62 (± 0.88) 93.06 (± 0.25) SLM-FP (MAE) 0.278 (± 0.058) 0.143 (± 0.008) 82.66 (± 0.61) 93.78 (± 0.13) SLM-FP (Margin) 0.302 (± 0.039) 0.113 (± 0.013) 82.55 (± 0.84) 93.68 (± 0.67) SLM Ext Head SLM-FP (Linear) 0.302 (± 0.039) 0.113 (± 0.013) 82.55 (± 0.84) 93.68 (± 0.67) SLM-FP (MLP-2048-2) 0.323 (± 0.071) 0.144 (± 0.012) 83.82 (± 0.77) 93.67 (± 0.18) SLM-FP (MLP-4096-3) 0.295 (± 0.057) 0.154 (± 0.027) 84.53 (± 0.61) 93.19 (± 0.79) Table 3.1: UNIREXAblationStudiesonSST uses random importance scores, Gold directly uses the gold rationales, Inv uses the inverse of the gold rationales, and IG uses IG rationales. All heuristics yield similar task performance, but IG dominates on all faithfulness metrics. This makes sense because IG is computed usingF task ’s inputs/parameters/outputs, while the others do not have this information. For plausibility, Gold is the best, Inv is the worst, and Rand and IG are about the same, as none of the heuristics are optimized for plausibility. ExtractorType(P) In the Ext Type (FP) section, we compare four learned rationale extractors. In this case, besides task performance, we can optimize for both faithfulness and plausibility. By default, attribu- tion algorithms’ dimension scores are pooled into token scores via sum pooling. AA-FP (Sum) uses IG with sum pooling, while AA-FP (MLP) replaces the sum pooler with a MLP-based pooler to increase capacity for plausibility optimization. Task performance for all four methods is similar, AA-FP (Sum) dominates on faithfulness, and DLM-FP and SLM-FP dominate on plausibility. AA-FP (MLP) does not perform as well on faithfulness but slightly improves on plausibility compared to AA-FP (Sum). 41 Comp/Suff Losses The Comp/Suff Loss section compares different combinations of Comp and Suff losses, using SLM-FP. Note that SLM-FP (Comp+Suff) is equivalent to SLM-FP shown in other tables/sections. As expected, SLM-FP (Comp) does best on Comp, but SLM-FP (Comp+Suff) actually does best on Suff. Meanwhile, SLM-FP, (Suff) does second-best on Suff but is much worse on Comp. This shows that Comp and Suff are complementary for optimization. SuffCriterion The Suff Criterion section compares different Suff criteria, using SLM-FP. SLM-FP (KL- Div) uses the KL divergence criterion, SLM-FP (MAE) uses the MAE criterion, and SLM-FP (Margin) uses the margin criterion. SLM-FP (Margin) is equivalent to SLM-FP in other tables/sections. All criteria yield similar performance and plausibility, while Margin is slightly better on faithfulness. SLM Extractor Head The SLM Ext Head section compares different extractor heads, using SLM-FP. Linear is the default choice and uses a linear layer. MLP-2048-2 uses a MLP with two 2048-dim hidden layers. MLP-4096-3 uses a MLP with three 4096-dim hidden layers. All three output head types yield similar performance, but decreasing head capacity yields better faithfulness, while increasing head capacity heads yields better plausibility. This trades off faithfulness and plausibility, although larger heads will be more compute-intensive. 3.4.6 GoldRationaleEfficiency UNIREX supports arbitrary amounts of gold rationale supervision, allowing plausibility optimization even in low-resource settings. In Fig. 3.7, we compare plausibility (w.r.t. AUPRC) forγ =[0.5,1,5,10,20,100] (i.e., % of train instances with gold rationales). We compare AA (IG) and four UNIREX variants (AA-F, AA-FP, DLM-FP, SLM-FP), with standard deviation shown by the error bands. First, AA (IG) and UNIREX (AA-F) do not optimize for plausibility via gold rationales, so their low AUPRC scores are constant for all γ . Second, UNIREX (AA-FP)’s AUPRC varies directly withγ at a modest rate, but is still always lower than 42 Figure 3.7: GoldRationaleEfficiencyonSST. As shown in this plot, UNIREX (DLM-FP) and UNIREX (SLM-FP) are able to achieve high plausibility performance, even with a very small percentage of training instances with gold rationale annotations. AA (IG)’s and UNIREX (AA-F)’s AUPRC. UNIREX (AA-FP)’s plausibility optimization is not effective, since plausibility optimization (i.e., learning to generate human-like rationales) typically requires high learning capacity, yet AAs do not have any learnable parameters. Third, UNIREX (DLM-FP) and UNIREX (SLM- FP) dominate across all γ values, with AUPRC slowly decreasing as γ decreases. Even at γ = 0.5, they can still achieve high AUPRC scores of around 0.75. This suggests that UNIREX’s gold rationale batching procedure (§3.6.2) is helpful for learning from minimal gold rationale supervision, thus enabling effective plausibility optimization. In addition to these results on SST, see Fig. 3.8 for similar results on CoS-E. 3.4.7 Zero-ShotFaithfulnessTransfer In Table 3.2, we investigate ifF ext ’s faithfulness, obtained via UNIREX training on some source (seen) dataset, can generalize to target (unseen) datasets/tasks in a zero-shot setting (i.e., no fine-tuning on target datasets/tasks). In this experiment, we consider SST and sentiment analysis as the source dataset and task, respectively. We compare six methods: AA (IG), AA-F (Rand), UNIREX (AA-F), UNIREX (DLM-P), 43 Task Dataset Method Faithfulness TaskPerformance CSD (↑) Comp (↑) Suff ( ↓) Perf (↑) SA SST AA (IG) -0.138 (± 0.040) 0.119 (± 0.009) 0.258 (± 0.031) 93.81 (± 0.55) AA-F (Rand) -0.156 (± 0.018) 0.171 (± 0.040) 0.327 (± 0.050) 94.05 (± 0.35) UNIREX (AA-F) 0.120 (± 0.055) 0.292 (± 0.051) 0.171 (± 0.038) 92.97 (± 0.44) UNIREX (DLM-P) -0.113 (± 0.040) 0.142 (± 0.008) 0.255 (± 0.007) 94.86 (± 0.41) UNIREX (DLM-FP) 0.151 (± 0.056) 0.319 (± 0.090) 0.167 (± 0.036) 93.81 (± 0.54) UNIREX (SLM-FP) 0.189 (± 0.030) 0.302 (± 0.039) 0.113 (± 0.013) 93.68 (± 0.67) Yelp AA (IG) -0.149 (± 0.028) 0.069 (± 0.004) 0.219 (± 0.028) 92.50 (± 2.07) AA-F (Rand) -0.177 (± 0.056) 0.127 (± 0.022) 0.305 (± 0.060) 86.27 (± 7.88) UNIREX (AA-F) 0.013 (± 0.036) 0.138 (± 0.078) 0.126 (± 0.059) 83.93 (± 13.20) UNIREX (DLM-P) -0.004 (± 0.028) 0.138 (± 0.009) 0.143 (± 0.023) 93.33 (± 0.90) UNIREX (DLM-FP) 0.169 (± 0.060) 0.265 (± 0.094) 0.097 (± 0.033) 92.37 (± 0.46) UNIREX (SLM-FP) 0.114 (± 0.056) 0.175 (± 0.055) 0.060 (± 0.001) 86.60 (± 1.57) Amazon AA (IG) -0.148 (± 0.038) 0.076 (± 0.010) 0.224 (± 0.037) 91.13 (± 0.28) AA-F (Rand) -0.166 (± 0.035) 0.120 (± 0.021) 0.286 (± 0.035) 81.52 (± 6.95) UNIREX (AA-F) 0.057 (± 0.106) 0.130 (± 0.077) 0.073 (± 0.039) 77.90 (± 13.12) UNIREX (DLM-P) -0.017 (± 0.027) 0.142 (± 0.001) 0.158 (± 0.026) 90.92 (± 0.93) UNIREX (DLM-FP) 0.133 (± 0.039) 0.232 (± 0.072) 0.098 (± 0.033) 89.35 (± 2.22) UNIREX (SLM-FP) 0.097 (± 0.027) 0.147 (± 0.012) 0.050 (± 0.017) 81.82 (± 7.62) HSD Stormfront AA (IG) -0.109 (± 0.053) 0.135 (± 0.010) 0.245 (± 0.059) 10.48 (± 1.66) AA-F (Rand) -0.147 (± 0.021) 0.150 (± 0.020) 0.297 (± 0.005) 10.66 (± 2.86) UNIREX (AA-F) 0.127 (± 0.015) 0.219 (± 0.009) 0.092 (± 0.025) 10.36 (± 1.94) UNIREX (DLM-P) -0.125 (± 0.092) 0.122 (± 0.008) 0.246 (± 0.099) 10.10 (± 1.73) UNIREX (DLM-FP) 0.052 (± 0.027) 0.167 (± 0.084) 0.115 (± 0.059) 10.37 (± 2.66) UNIREX (SLM-FP) 0.049 (± 0.041) 0.110 (± 0.039) 0.062 (± 0.043) 4.51 (± 1.87) OSD OffenseEval AA (IG) -0.146 (± 0.044) 0.097 (± 0.009) 0.244 (± 0.052) 33.51 (± 0.99) AA (Rand) -0.148 (± 0.046) 0.101 (± 0.020) 0.249 (± 0.065) 34.08 (± 2.34) UNIREX (AA-F) -0.029 (± 0.040) 0.074 (± 0.040) 0.102 (± 0.024) 32.62 (± 4.85) UNIREX (DLM-P) -0.102 (± 0.073) 0.112 (± 0.010) 0.214 (± 0.081) 33.67 (± 1.01) UNIREX (DLM-FP) 0.053 (± 0.012) 0.140 (± 0.049) 0.087 (± 0.045) 35.52 (± 1.26) UNIREX (SLM-FP) 0.039 (± 0.031) 0.087 (± 0.016) 0.048 (± 0.024) 38.17 (± 0.96) ID SemEval2018 AA (IG) -0.120 (± 0.061) 0.128 (± 0.014) 0.248 (± 0.064) 29.63 (± 4.72) AA-F (Rand) -0.133 (± 0.043) 0.124 (± 0.013) 0.258 (± 0.053) 32.39 (± 9.73) UNIREX (AA-F) -0.028 (± 0.051) 0.069 (± 0.041) 0.096 (± 0.011) 49.95 (± 8.31) UNIREX (DLM-P) -0.112 (± 0.095) 0.140 (± 0.017) 0.252 (± 0.112) 27.78 (± 5.08) UNIREX (DLM-FP) 0.047 (± 0.017) 0.149 (± 0.052) 0.102 (± 0.053) 31.97 (± 2.80) UNIREX (SLM-FP) 0.027 (± 0.047) 0.091 (± 0.027) 0.064 (± 0.033) 17.42 (± 4.04) Table 3.2: Zero-ShotFaithfulnessTransferfromSST. We investigate whether the faithfulness of UNIREX ratio- nale extractors (AA-F, DLM-FP) trained on SST can generalize to unseen datasets/tasks, even when the task model’s task performance cannot. Also, we include AA (IG) as a heuristic extractor baseline (i.e., only the task model is trained). Here, the seen task is sentiment analysis (SA), while the unseen tasks are hate speech detection (HSD), offensive speech detection (OSD), and irony detection (ID). For SA, the unseen datasets are Yelp and Amazon. For HSD, OSD, and ID, the unseen datasets are Stormfront, OffenseEval, and SemEval2018, respectively. Overall, we find that faithfulness is not strongly correlated with task performance, as unseen tasks’ comp/suff scores are similar to seen tasks’. In particular, though all methods achieve poor task performance on unseen tasks, UNIREX (DLM-FP)’s comp/suff scores are consistently good across all tasks, demonstrating its faithfulness generalization ability. UNIREX (DLM-FP), and UNIREX (SLM-FP). For AA (IG), only F task is trained on SST, since its F ext is 44 heuristic. Meanwhile, for the UNIREX variants, bothF task andF ext are trained on SST. First, as an in- domain reference point, we report faithfulness and task performance on SST. Second, we evaluate on unseen target datasets for a seen task (i.e., sentiment analysis): Yelp [278] and Amazon [166]. Third, we evaluate on unseen target datasets for unseen target tasks: Stormfront (hate speech detection, binary F1) [71], OffenseEval (offensive speech detection, macro F1) [273], and SemEval2018 (irony detection, binary F1) [237]. For more details about these datasets, please refer to §3.6.4. We want to show that, even ifF task yields poor task performance on unseen datasets,F ext ’s rationales can still be faithful. To make the faithfulness results easier to digest, we introduce a metric called Comp- Suff Difference (CSD), which locally aggregates comp and suff as: CSD = comp− suff. Therefore, since higher/lower comp/suff signals higher faithfulness, then higher CSD signals higher faithfulness. As expected, all methods achieve much lower task performance in the third setting than in the first two settings. However, faithfulness does not appear to be strongly correlated with task performance, as unseen tasks’ comp/suff scores are similar to seen tasks’. Across all datasets, DLM-FP has the best faithfulness and is the only method whose comp is always higher than suff. The other UNIREX variants are not as consistently strong as DLM-FP, but almost always beat non-UNIREX methods on comp and suff. Meanwhile, AA (IG) has the worst comp and suff overall. Ultimately, these results suggest that UNIREX- trained models’ faithfulness (i.e., alignment betweenF task ’s andF ext ’s outputs) is a dataset/task agnostic property (i.e., can generalize across datasets/tasks), further establishing UNIREX’s utility in low-resource settings. 3.4.8 PlausibilityUserStudy Gold rationale based plausibility evaluation is noisy because gold rationales are for the target label, not a F task ’s predicted label. Thus, we conduct two five-annotator user studies (Table 3.3) to get a better plau- sibility measurement. Given 50 random test instances from SST, we get the rationales for SGT+P, A2R+P, 45 UNIREX (AA-FP), and UNIREX (DLM-FP), plus the gold rationales. For each instance, we threshold all rationales to have the same number of positive tokens as the gold rationale. The first user study is forward simulation [80, 98]. Here, the annotator is given an input and a rationale for some model’s prediction, then asked what (binary) sentiment label the model most likely predicted. For forward simulation, we also con- sider a No Rationale baseline, where no tokens are highlighted. For No Rationale and Gold (which we call “oracle methods”), the target label is the correct choice. Annotators are also asked to rate their confidence (4-point Likert scale) in their answer to this question. The second user study involves giving a subjective rating of how plausible the rationale is [80]. Here, the annotator is given the input, rationale, and model’s predicted label, then asked to rate (5-point Likert scale) how aligned the rationale is with the prediction. Method ForwardSimulation SubjectiveRating Accuracy (%) Confidence (1-4) Alignment (1-5) No Rationale 92.00 (± 3.35) 3.02 (± 0.39) - SGT+P 80.80 (± 9.73) 2.34 (± 0.31) 3.64 (± 0.28) A2R+P 41.20 (± 4.71) 2.83 (± 0.28) 2.97 (± 0.12) UNIREX (AA-FP) 72.00 (± 7.78) 2.00 (± 0.31) 3.26 (± 0.31) UNIREX (DLM-FP) 83.60 (± 5.41) 2.77 (± 0.28) 3.96 (± 0.22) Gold 81.20 (± 3.03) 2.88 (± 0.30) 4.00 (± 0.20) Table 3.3: Plausibility User Study on SST. In our user study, we find that humans judge UNIREX (DLM-FP)’s rationales as being more plausible than those created by methods, in terms of both forward simulation (accuracy) and subjective rating (alignment). In both accuracy and subjective rating, we find that DLM-FP performs best among all non-oracle meth- ods and even slightly beats Gold on accuracy, further supporting our claim that DLM-FP rationales are plausible. As expected, the fact that Gold does not achieve near-100% accuracy shows the discrepancy between evaluating plausibility based on the target label (i.e., gold rationale similarity) andF task ’s pre- dicted label (forward simulation). Meanwhile, SGT+P and AA-FP, which had lower AUPRC and TF1 in our automatic evaluation, also do worse in accuracy and alignment. Also, users found SGT+P and AA-FP rationales harder to understand, as shown by their lower confidence scores. Meanwhile, A2R+P had high AUPRC and TF1, but gets very low accuracy and alignment because A2R+P’s predicted label was often not 46 the target label, leading to misalignment with its gold-like rationale. Nonetheless, users were still most confident in their predictions using A2R+P’s rationales. A2R+P is a great example of how automatic plau- sibility evaluation can be misleading. For the accuracy, confidence, and alignment questions, we achieved Fleiss’ Kappa [59] inter-annotator agreement scores of 0.2456 (fair), 0.1282 (slight), and, 0.1561 (slight), respectively. This lack of agreement demonstrates the difficulty of measuring rationale plausibility. 3.5 RelatedWork ExtractiveRationaleFaithfulness Many prior works have tried to improve the faithfulness of extractive rationales through the use of AAs [8]. Typically, this involves handcrafting gradient-based [229, 47, 156, 134] or perturbation-based [135, 189, 106] AAs. However, attribution algorithms cannot be optimized and tend to be compute-intensive (often requiring multiple LM forward/backward passes). Recently, [93] addressed the optimization issue by regularizing the task model to yield faithful rationales via the AA, while other works [221, 214] addressed the compute cost issue by training an LM (requiring only one forward pass) to mimic an AA’s behavior. Another line of work aims to produce faithful rationales by construction, via SPPs [98, 270, 182, 7, 269, 130]. Still, SPPs’ faithfulness can only guarantee sufficiency – not comprehensiveness [49]. Also, SPPs generally perform worse than vanilla LMs because they hide much of the original text input from the predictor and are hard to train end-to-end. Extractive Rationale Plausibility Existing approaches for improving extractive rationale plausi- bility typically involve supervising LM-based extractors [13] or SPPs [98, 182, 49] with gold rationales. However, existing LM-based extractors have not been trained for faithfulness, while SPPs’ faithfulness by construction comes at the great cost of task performance. Meanwhile, more existing works focus on improving the plausibility of free-text rationales [178, 120, 25], often with task-specific pipelines [196, 119]. 47 Connection to UNIREX Unlike prior works, UNIREX enables both the task model and rationale extractor to be jointly optimized for faithfulness, plausibility, and task performance. As a result, UNIREX- trained rationale extractors achieve a better balance of faithfulness and plausibility, without compromising the task model’s performance. Also, by using a learned rationale extractor, which generally only requires one model forward pass, UNIREX does not have the computational expenses that limit many AAs. 3.6 Appendix 3.6.1 TextClassification Here, we formalize the text classification problem in more detail. Let D ={X,Y} N i=1 be a dataset, where X ={x i } N i=1 are the text inputs,Y ={y ∗ i } N i=1 are the labels, andN is the number of instances (x i ,y ∗ i ) in D. We also assume D can be partitioned into train set D train , dev set D dev , and test set D test . Let F task = f task (f enc (·)) be a task LM, where f enc is the text encoder, and f task is the task output head. Typically,F task has a BERT-style architecture [48], in which f enc is a Transformer [238] while f task is a linear layer. Below, we define the sequence classification (SST, Movies, MultiRC, e-SNLI) and multi-choice QA (CoS-E) tasks, which are different types of text classification. SequenceClassification In sequence classification, x i is a token sequence (e.g., a single sentence, a pair of sentences), whiley ∗ i is the target class forx i . Here, we assume a fixed label space Y ={1,...,M} of size M, wherey ∗ i ∈Y for alli. Thus,f task outputs a vector of sizeM, such thatF task (x i ) =f task (f enc (x i )) = ˆ y i ∈ R M is the logit vector used to classify x i . Given ˆ y i = [ˆ y i,j ] M j=1 , let y i = argmax j ˆ y i,j be the class predicted byF task . The goal of sequence classification is to learn F task such thaty ∗ i = y i , for all (x i ,y ∗ i ) [172]. 48 Multi-Choice QA Instead of a fixed label space, multi-choice QA has a different (but fixed-size) set of answer choices per instance. For instancei, letq i be the question (e.g.,“Afriendisgreetingme,whatwould they say?”) andA i ={a i,j } M j=1 be the corresponding answer choices (e.g., {“say hello”, “greet”, “associate”, “socialize”, “smile”}), where M is now the number of answer choices. Define x i,j = q i ⊕ a i,j , where⊕ denotes concatenation. In multi-choice QA, we havex i ={x i,j } M j=1 , whiley ∗ i ∈ A i is the correct answer forx i . Thus,f task outputs a scalar, such thatF task (x i,j ) =f task (f enc (x i,j )) = ˆ y i,j ∈R is the logit forx i,j . Given ˆ y i = [ˆ y i,j ] M j=1 , letj ′ = argmax j ˆ y i,j , wherey i =a i,j ′ is the answer predicted byF task . The goal of multi-choice QA is to learnF task such thaty ∗ i =y i , for all(x i ,y ∗ i ) [232]. 3.6.2 GoldRationaleSupervision If a learned rationale extractor is chosen, UNIREX enables users to specify how much gold rationale su- pervision to use. Ideally, each train instance would be annotated with a gold rationale. In this case, we could directly minimize the plausibility loss for each train instance. However, since gold rationales can be expensive to annotate, UNIREX provides a special batching procedure for training with limited gold rationale supervision. GivenN train =|D train | train instances, let0<γ < 100 be the percentage of train instances with gold rationales,N gold =⌈ γ 100 N train ⌉≥ 1 be the number of train instances with gold rationales,b be the desired train batch size, andβ > 1 be a scaling factor. Define D gold ⊆D train as the set of train instances with gold rationales, where|D gold |=N gold . Note that, if all train instances have gold rationales, thenD gold =D train andγ =100. Each batch is constructed as follows: (1) randomly sample b gold = max(1, b β ) instances fromD gold without replacement, then (2) randomly sampleb− b gold instances fromD train \D gold without replacement. This results in a batch with b total train instances, b gold with gold rationales and the rest without. Since N gold is generally small, we only sample fromD gold without replacement for a given batch, but not a given 49 epoch. Thus, instances fromD gold may appear more than once in the same epoch. However, we do sample fromD train \D gold without replacement for each batch and epoch, so every instance inD train \D gold appears exactly once per epoch. After constructing the batch, we compute the plausibility loss for the batch as follows: P b i=1 1 (x i ,y ∗ i )∈D gold L plaus (F ext (x i ),r ∗ i ), whereL plaus is the plausibility loss for train instance(x i ,y ∗ i ). This function zeroes out the plausibility loss for instances without gold rationales, so that plausibility is only being optimized with respect to instances with gold rationales. However, in §3.4.6, we show that it is possible to achieve high plausibility via rationale extractors trained on minimal gold rationale supervision. 3.6.3 ExplainabilityObjectives 3.6.3.1 Faithfulness Sufficiency In addition, to the criteria presented in §3.3.3, we consider two other sufficiency loss func- tions. The first is the KLdivergencecriterion used in [93], which considers the entire label distribution and is defined as L suff-KL = KL(F task (r (k) i ))||F task (x i )). The second is themeanabsoluteerror(MAE)criterion, which is defined as L suff-MAE =|L CE (F task (r (k) i )),y ∗ i )−L CE (F task (x i ),y ∗ i )|. Unlike the difference criterion L suff-diff and margin criterionL suff-margin (§3.3.3), the MAE criterion assumes that usingr (k) i as input should not yield better task performance than using x i as input. In our experiments, we find that L suff-margin is effective, though others ( e.g., KL divergence, MAE) can be used too. 3.6.3.2 Plausibility Similar to faithfulness, UNIREX places no restrictions on the choice of plausibility objective. As described in §3.3.3, given gold rationaler ∗ i for inputx i , plausibility optimization entails trainingF ext to predict binary importance labelr ∗ ,t i for each tokenx t i . This is essentially binary token classification, so one natural choice forL plaus is the token-level binary cross-entropy (BCE) criterion: L plaus-BCE = − P t r ∗ ,t i log(F ext (x t i )) 50 (§3.3.3). Another option is the sequence-level KL divergence criterion, which is defined as: L plaus-KL = KL(F ext (x i )||r ∗ i ). Additionally, we can directly penalizeF ext (x i ) in the logit space via alinearloss, defined as: L plaus-linear = Φ( r ∗ i )F ext (x i ), whereΦ( u)=− 2u+1 maps positive and negative tokens to− 1 and+1, respectively. The linear loss directly pushes the logits corresponding to positive/negative tokens to be higher/lower and in- crease the margin between them. To prevent linear loss values from becoming arbitrarily negative, we can also lower bound the loss with a marginm p , yielding:L plaus-linear-margin =max(− m p ,L plaus-linear )+m p . 3.6.4 Datasets As described in §3.4.1.1, we primarily experiment with the SST (sentiment analysis) [223, 27], Movies (sen- timent analysis) [272], CoS-E (commonsense question answering) [196], MultiRC (reading comprehension) [110], and e-SNLI (natural language inference) [25] datasets, all of which have gold rationale annotations. The rationale-annotated version of SST was obtained from [27], while the latter four datasets were ob- tained from the ERASER benchmark [49]. The numbers of train/dev/test instances in these datasets are as follows: SST (6920/872/1821), Movies (1599/200/200), CoS-E (8752/1086/1079), MultiRC (24029/3214/4848), and e-SNLI (549309/9823/9807). As described in §3.4.7, for our zero-shot experiments, we consider the Yelp (sentiment analysis) [278], Amazon (sentiment analysis) [166], Stormfront (hate speech detection) [71], OffenseEval (offensive speech) [273], and SemEval2018 (irony detection) [237] datasets. The numbers of train/dev/test instances in these datasets are as follows: Yelp (10000/2000/2000), Amazon (10000/2000/2000), Stormfront (7896/978/1998), OffenseEval (11916/1324/860), and SemEval2018 (2862/955/784). For Yelp and Amazon, the original datasets are very large, so we use stratified sampling with respect to labels to obtain the final train/dev/test split split of 10000/2000/2000. 51 3.6.5 GoldRationaleEfficiency Fig. 3.8 shows the gold rationale data efficiency results for CoS-E, using the same setup as §3.4.6. Over- all, we see that the CoS-E results are quite similar to the SST results. Again, UNIREX (DLM-FP) and UNIREX (SLM-FP) dominate across allγ values, with AUPRC slowly decreasing asγ decreases. Interest- ingly, UNIREX (AA-FP) yields a noticeable dip in AUPRC for lower γ values. Since AA-FP has limited capacity (via the task model) for plausibility optimization, it is possible that this fluctuation is due to ran- dom noise. We leave further analysis of this for future work. Figure 3.8: GoldRationaleEfficiencyonCoS-E. As shown in this plot, UNIREX (DLM-FP) and UNIREX (SLM-FP) are able to achieve high plausibility performance, even with a very small percentage of training instances with gold rationale annotations. 3.6.6 ComputationalEfficiency Besides faithfulness, plausibility, and task performance, computational efficiency is also an important desideratum of rationale extraction. With that in mind, the number of IG steps is a critical design choice. A higher number of IG steps means a more accurate IG approximation, which should may yield more 52 faithful rationales. On the other hand, a higher number of IG steps means greater computational costs. As stated in §3.4.3, we use 3-step IG in all of our experiments, in order to make all methods computa- tionally comparable. 3-step IG requires three backward passes, while all other compared methods re- quire either one forward pass or one backward pass. We would like to empirically characterize this trade-off between faithfulness and computational efficiency. Thus, we compare IG performance for η = [3,5,10,30,50,70,100,500,1000] steps, denoted as AA (IG-η ). In Table 3.4, we report faithfulness, plau- sibility, task performance, convergence delta (i.e., IG approximation error), and inference time for each IG-based method. Additionally, we compare these IG settings to UNIREX (AA-F) (which uses 3-step IG), UNIREX (DLM-FP), and UNIREX (SLM-FP). Since UNIREX (DLM-FP) and UNIREX (SLM-FP) are not IG- based, we do not report convergence delta for them. Method Faithfulness Plausibility TaskPerf. IG Comp. Efficiency CSD (↑) Comp (↑) Suff ( ↓) AUPRC (↑) TF1 (↑) Acc (↑) Conv. Delta (↓) Inference Time (↓) AA (IG-3) -0.138 (± 0.040) 0.119 (± 0.009) 0.258 (± 0.031) 49.94 (± 1.77) 50.75 (± 0.54) 93.81 (± 0.55) 8.07 (± 1.47) 7.94E-03 (± 5.30E-05) AA (IG-5) -0.141 (± 0.031) 0.134 (± 0.015) 0.275 (± 0.021) 49.38 (± 1.00) 50.85 (± 0.70) 93.81 (± 0.55) 6.83 (± 0.92) 1.15E-02 (± 6.03E-05) AA (IG-10) 0.011 (± 0.043) 0.222 (± 0.015) 0.210 (± 0.031) 55.87 (± 0.51) 52.06 (± 0.33) 93.81 (± 0.55) 6.55 (± 0.48) 2.08E-02 (± 1.14E-04) AA (IG-30) 0.056 (± 0.058) 0.258 (± 0.020) 0.202 (± 0.038) 57.23 (± 1.16) 52.74 (± 0.43) 93.81 (± 0.55) 4.91 (± 1.82) 5.80E-02 (± 3.03E-04) AA (IG-50) 0.066 (± 0.057) 0.265 (± 0.017) 0.199 (± 0.040) 57.70 (± 1.02) 52.66 (± 0.37) 93.81 (± 0.55) 2.89 (± 0.34) 9.54E-02 (± 5.64E-04) AA (IG-70) 0.072 (± 0.055) 0.269 (± 0.016) 0.197 (± 0.039) 58.11 (± 1.21) 52.96 (± 0.32) 93.81 (± 0.55) 2.50 (± 0.25) 1.33E-01 (± 8.10E-04) AA (IG-100) 0.074 (± 0.055) 0.271 (± 0.016) 0.197 (± 0.039) 58.25 (± 1.27) 53.00 (± 0.42) 93.81 (± 0.55) 2.01 (± 0.13) 1.89E-01 (± 1.62E-03) AA (IG-500) 0.082 (± 0.055) 0.276 (± 0.016) 0.195 (± 0.039) 58.61 (± 1.10) 53.25 (± 0.29) 93.81 (± 0.55) 0.99 (± 0.23) 9.38E-01 (± 5.44E-03) AA (IG-1000) 0.083 (± 0.057) 0.278 (± 0.017) 0.195 (± 0.040) 58.64 (± 1.15) 53.17 (± 0.39) 93.81 (± 0.55) 0.74 (± 0.16) 1.88E+00 (± 1.67E-03) UNIREX (AA-F) 0.120 (± 0.055) 0.292 (± 0.051) 0.171 (± 0.038) 48.13 (± 1.14) 50.96 (± 0.93) 92.97 (± 0.44) 8.07 (± 1.47) 7.94E-03 (± 5.30E-05) UNIREX (DLM-FP) 0.151 (± 0.056) 0.319 (± 0.090) 0.167 (± 0.036) 85.80 (± 0.74) 72.76 (± 0.19) 93.81 (± 0.54) - 8.51E-04 (± 7.82E-07) UNIREX (SLM-FP) 0.189 (± 0.030) 0.302 (± 0.039) 0.113 (± 0.013) 82.55 (± 0.84) 70.65 (± 0.44) 93.68 (± 0.67) - 8.81E-04 (± 5.67E-06) Table 3.4: ComputationalEfficiencyResultsonSST. Besides achieving the best balance of faithfulness, plausi- bility, and task performance, UNIREX (DLM-FP) and UNIREX (SLM-FP) are also the most computationally efficient, achieving the lowest inference time per instance. Also, despite having a higher convergence delta and lower infer- ence time than most other AA (IG) variants due to using 3-step IG, UNIREX (AA-F) outperforms all AA (IG) variants on faithfulness, while achieving comparable plausibility and task performance. As expected, we find that convergence delta decreases as η increases. Furthermore, AA (IG) faithfulness scores generally improve asη increases, although the improvement begins to saturate at aroundη = 50. Similarly, we find that AA (IG) plausibility scores for IG also tend to improve as η increases, with the improvement also saturating aroundη =50. This makes sense because plausibility optimization involves regularizing the task model to yield rationales that are similar to gold rationales, but this regularization is 53 less effective if the yielded rationales are less faithful to the task model. However, despite these faithfulness and plausibility improvements, we see that inference time increases roughly linearly with respect toη . In particular, AA (IG-1000) is over 200 times slower than AA (IG-3). Meanwhile, despite only using 3-step IG, UNIREX (AA-F) beats all AA (IG) variants on faithfulness, while achieving comparable plausibility and task performance. On the other hand, UNIREX (DLM-FP) and UNIREX (SLM-FP) achieve the best faithfulness, while also achieving the best plausibility and inference time by far, without sacrificing task performance. Both UNIREX (DLM-FP) and UNIREX (SLM-FP) are over 2000 times faster than AA (IG-1000) and over nine times faster than AA (IG-3). These results show that UNIREX (esp. using learned rationale extractor) is an effective framework for jointly optimizing rationale extraction for faithfulness, plausibility, and task performance, while also achieving high computational efficiency. 3.6.7 QualitativeAnalysis In §3.4.8, we presented a user study of fifty SST test instances to further evaluate the plausibility of ra- tionales produced by various methods. To get additional insights about rationale plausibility, we conduct qualitative analysis of selected instances from the user study. In Table 3.5, for each selected instance and a given method, we show the method’s rationale,F task ’s corresponding prediction, and the user-annotated alignment score for the rationale with respect to the prediction. We consider two groups of selected in- stances. First, we select the top-3 instances with respect to mean alignment score (across five annotators), averaged over all methods (i.e., Instances 30, 41, and 35). Not surprisingly, we find that F task ’s prediction is very consistent across all methods. For these three instances, all of the predicted labels happen to be both positive and correct (i.e., same as gold label). Similarly, the rationales are also rather consistent across methods. In particular, A2R+P, UNIREX (DLM-FP), and Gold consistently yield the highest alignment scores. In Instance 30, these three methods plausibly highlight “good” and “actress”. In Instance 41, these 54 three methods plausibly highlight “understands”, “medium”, “amazingly”, and “well”. In Instance 35, these three methods plausibly highlight “admirable” and “achievement”. Of course, this is expected for Gold, since gold rationales are human-annotated with respect to the gold label. The high alignment scores for A2R+P and UNIREX (DLM-FP) also make sense since A2R+P and UNIREX (DLM-FP) yielded high PNRG (Fig. 3.6). InstanceRanking InstanceID Method Rationale F task ’sPrediction Alignment(1-5) Top Alignment Mean 30 SGT+P goodactress . positive 5.00 (± 0.00) A2R+P goodactress . positive 5.00 (± 0.00) UNIREX (AA-FP) good actress. positive 4.80 (± 0.45) UNIREX (DLM-FP) goodactress . positive 5.00 (± 0.00) Gold goodactress . positive 5.00 (± 0.00) 41 SGT+P a cop storythatunderstandsthemediumamazinglywell . positive 5.00 (± 0.00) A2R+P acop story thatunderstands themediumamazinglywell . positive 4.80 (± 0.45) UNIREX (AA-FP) acopstory thatunderstands the mediumamazingly well. positive 4.20 (± 0.45) UNIREX (DLM-FP) acop story thatunderstandsthemediumamazinglywell . positive 4.80 (± 0.45) Gold a cop story thatunderstandsthemediumamazinglywell. positive 5.00 (± 0.00) 35 SGT+P chicago is , in manyways , anadmirable achievement . positive 4.60 (± 0.55) A2R+P chicago is , in many ways , anadmirableachievement . positive 5.00 (± 0.00) UNIREX (AA-FP) chicago is , in manyways , an admirable achievement . positive 3.20 (± 0.45) UNIREX (DLM-FP) chicago is , in many ways , anadmirableachievement . positive 5.00 (± 0.00) Gold chicago is , in many ways , anadmirableachievement . positive 5.00 (± 0.00) Top Alignment Std 9 SGT+P bad beyond belief and ridiculous beyond description. negative 4.60 (± 0.55) A2R+P bad beyond belief andridiculous beyond description . positive 1.00 (± 0.00) UNIREX (AA-FP) bad beyond beliefand ridiculous beyond description. negative 2.60 (± 0.89) UNIREX (DLM-FP) bad beyond belief andridiculous beyond description . negative 4.80 (± 0.45) Gold bad beyond belief andridiculous beyond description . negative 4.80 (± 0.45) 12 SGT+P these are names to remember , in order to avoid them in the future . negative 2.40 (± 0.89) A2R+P these are names to remember , in order toavoid them in the future . positive 1.20 (± 0.45) UNIREX (AA-FP) these are names to remember , in order to avoid them in the future. negative 2.60 (± 0.89) UNIREX (DLM-FP) these are names to remember , in order toavoid them in the future . negative 4.80 (± 0.45) Gold these are names to remember , in order toavoid them in the future . negative 4.80 (± 0.45) 19 SGT+P the title’s lameness should clue youin on howbadthemovie is . negative 4.00 (± 0.00) A2R+P the title ’slameness shouldclueyou in on howbad themovie is . positive 1.20 (± 0.45) UNIREX (AA-FP) the title’slamenessshould clue youinon how badthe movie is. negative 2.60 (± 0.89) UNIREX (DLM-FP) the title ’slameness shouldclueyou in on howbadthemovie is . negative 4.80 (± 0.45) Gold thetitle’slameness should clue you in on howbad the movie is . negative 4.80 (± 0.45) Table 3.5: QualitativeAnalysisonSST. Building upon the quantitative results of our plausibility user study, our qualitative analysis further supports the notion that UNIREX (DLM-FP)’s rationales are more plausible than those created by other rationale extraction methods. In this table, we visualize each rationale by highlighting the important tokens (selected by the given method) inblue. Second, we select the top-3 instances with respect to standard deviation (std) alignment score (across five annotators), averaged over all methods ( i.e., Instances 9, 12, 19). This time, there is greater variance in the per-instance alignment scores across methods. For these three instances, the negative label is pre- dicted by all methods except A2R+P. Like before, UNIREX (DLM-FP) and Gold consistently yield the highest alignment scores. In Instance 9, these two methods plausibly highlight “bad” and “ridiculous”. In Instance 55 12, these two methods plausibly highlight “avoid”. In Instance 19, these two methods plausibly highlight “lameness” and “bad”. However, this time, A2R+P consistently yields the lowest alignment scores. Even though A2R+P produces rationales that are similar to those of UNIREX (DLM-FP) and Gold (i.e., aligning with the gold label), A2R+P’s rationales do not support A2R+P’s predicted label. This illustrates the limi- tation of automatically evaluating plausibility via gold rationale similarity, as A2R+P achieved high PNRG. Meanwhile, we see that UNIREX (DLM-FP) consistently yields the highest plausibility across various types of evaluation (i.e., gold rationale similarity, forward simulation, subjective rating). 3.6.8 MainResults(Extended) In §3.4, Figs. 3.4-3.6 only reported the main results averaged over all datasets. In this section, Tables 3.6-3.10 provide more detailed main results by reporting all raw and NRG metrics for each individual dataset. 56 Method Composite Faithfulness Plausibility Performance NRG (↑) NRG (↑) Comp (↑) Suff ( ↓) NRG (↑) AUPRC (↑) TF1 (↑) NRG (↑) Acc (↑) AA (Grad) 0.488 0.337 0.142 (± 0.010) 0.256 (± 0.006) 0.192 58.86 (± 3.65) 27.40 (± 0.00) 0.935 93.81 (± 0.55) AA (Input*Grad) 0.420 0.107 0.078 (± 0.013) 0.342 (± 0.014) 0.218 44.16 (± 1.43) 45.02 (± 0.39) 0.935 93.81 (± 0.55) AA (DeepLIFT) 0.453 0.122 0.085 (± 0.006) 0.340 (± 0.018) 0.302 46.50 (± 1.32) 50.18 (± 0.32) 0.935 93.81 (± 0.55) AA (IG) 0.526 0.297 0.119 (± 0.009) 0.258 (± 0.031) 0.347 49.94 (± 1.77) 50.75 (± 0.54) 0.935 93.81 (± 0.55) L2E 0.557 0.487 0.012 (± 0.004) 0.009 (± 0.024) 0.250 44.84 (± 0.32) 47.24 (± 0.87) 0.935 93.81 (± 0.55) SGT 0.632 0.555 0.147 (± 0.024) 0.113 (± 0.031) 0.371 51.38 (± 2.47) 51.35 (± 1.64) 0.971 94.40 (± 0.57) FRESH 0.330 0.837 0.219 (± 0.057) 0.000 (± 0.000) 0.152 42.06 (± 8.84) 41.19 (± 4.01) 0.000 78.78 (± 6.48) A2R 0.479 0.941 0.283 (± 0.104) 0.000 (± 0.000) 0.457 63.36 (± 6.01) 46.74 (± 6.65) 0.038 79.39 (± 11.67) UNIREX (AA-F) 0.639 0.706 0.292 (± 0.051) 0.171 (± 0.038) 0.329 48.13 (± 1.14) 50.96 (± 0.93) 0.882 92.97 (± 0.44) SGT+P 0.596 0.507 0.139 (± 0.032) 0.137 (± 0.026) 0.355 50.38 (± 1.45) 50.98 (± 0.46) 0.928 93.70 (± 0.88) FRESH+P 0.582 0.765 0.175 (± 0.043) 0.000 (± 0.000) 0.970 84.35 (± 0.87) 71.54 (± 0.53) 0.011 78.95 (± 5.18) A2R+P 0.695 0.953 0.290 (± 0.016) 0.000 (± 0.000) 0.978 85.56 (± 1.01) 70.97 (± 1.03) 0.154 81.26 (± 0.52) UNIREX (DLM-P) 0.770 0.339 0.142 (± 0.008) 0.255 (± 0.007) 0.970 84.35 (± 0.87) 71.54 (± 0.53) 1.000 94.86 (± 0.41) UNIREX (AA-FP) 0.636 0.339 0.296 (± 0.067) 0.185 (± 0.048) 0.315 47.60 (± 2.44) 50.23 (± 2.26) 0.900 93.25 (± 0.45) UNIREX (DLM-FP) 0.897 0.756 0.319 (± 0.090) 0.167 (± 0.036) 1.000 85.80 (± 0.74) 72.76 (± 0.19) 0.935 93.81 (± 0.54) UNIREX (SLM-FP) 0.891 0.807 0.302 (± 0.039) 0.113 (± 0.013) 0.940 82.55 (± 0.84) 70.65 (± 0.44) 0.927 93.68 (± 0.67) Table 3.6: MainResultsonSST Method Composite Faithfulness Plausibility Performance NRG (↑) NRG (↑) Comp (↑) Suff ( ↓) NRG (↑) AUPRC (↑) TF1 (↑) NRG (↑) F1 (↑) AA (Grad) 0.481 0.457 0.184 (± 0.023) 0.107 (± 0.017) 0.028 13.31 (± 0.91) 5.02 (± 0.00) 0.957 95.33 (± 0.65) AA (Input*Grad) 0.503 0.359 0.148 (± 0.031) 0.137 (± 0.019) 0.194 8.68 (± 0.37) 37.58 (± 0.55) 0.957 95.33 (± 0.65) AA (DeepLIFT) 0.468 0.259 0.122 (± 0.029) 0.172 (± 0.022) 0.187 9.00 (± 0.16) 36.15 (± 1.45) 0.957 95.33 (± 0.65) AA (IG) 0.439 0.173 0.134 (± 0.016) 0.219 (± 0.044) 0.188 8.88 (± 0.21) 36.39 (± 1.29) 0.957 95.33 (± 0.65) L2E 0.550 0.445 0.000 (± 0.007) 0.026 (± 0.015) 0.248 16.68 (± 10.20) 38.92 (± 4.07) 0.957 95.33 (± 0.65) SGT 0.553 0.474 0.124 (± 0.053) 0.071 (± 0.064) 0.184 10.05 (± 1.23) 34.64 (± 1.67) 1.000 96.33 (± 0.76) FRESH 0.645 0.732 0.234 (± 0.034) 0.000 (± 0.000) 0.305 17.02 (± 6.22) 48.26 (± 5.87) 0.899 94.00 (± 1.44) A2R 0.431 0.764 0.267 (± 0.050) 0.000 (± 0.000) 0.244 35.44 (± 21.69) 19.78 (± 25.56) 0.284 79.78 (± 7.14) UNIREX (AA-F) 0.601 0.744 0.505 (± 0.134) 0.122 (± 0.100) 0.189 9.14 (± 2.51) 36.28 (± 1.84) 0.870 93.33 (± 1.61) SGT+P 0.586 0.604 0.152 (± 0.013) 0.022 (± 0.004) 0.183 9.16 (± 1.59) 35.33 (± 0.41) 0.971 95.66 (± 1.16) FRESH+P 0.587 0.691 0.193 (± 0.062) 0.000 (± 0.000) 1.000 94.32 (± 0.12) 89.53 (± 1.63) 0.070 74.84 (± 12.22) A2R+P 0.585 0.764 0.267 (± 0.076) 0.000 (± 0.000) 0.991 93.53 (± 0.93) 88.77 (± 1.22) 0.000 73.22 (± 0.75) UNIREX (DLM-P) 0.667 0.024 0.024 (± 0.003) 0.238 (± 0.004) 1.000 94.32 (± 0.12) 89.53 (± 1.63) 0.978 95.83 (± 0.29) UNIREX (AA-FP) 0.543 0.514 0.428 (± 0.174) 0.195 (± 0.105) 0.193 8.53 (± 0.46) 37.71 (± 3.12) 0.921 94.50 (± 1.00) UNIREX (DLM-FP) 0.744 0.326 0.283 (± 0.217) 0.216 (± 0.005) 0.991 93.65 (± 0.36) 88.68 (± 2.29) 0.913 94.33 (± 1.61) UNIREX (SLM-FP) 0.754 0.362 0.313 (± 0.059) 0.213 (± 0.014) 0.965 91.70 (± 1.84) 86.17 (± 1.20) 0.935 94.83 (± 0.76) Table 3.7: MainResultsonMovies 57 Method Composite Faithfulness Plausibility Performance NRG (↑) NRG (↑) Comp (↑) Suff ( ↓) NRG (↑) AUPRC (↑) TF1 (↑) NRG (↑) Acc (↑) AA (Grad) 0.537 0.504 0.331 (± 0.012) 0.352 (± 0.007) 0.130 37.33 (± 0.62) 22.65 (± 0.00) 0.977 63.56 (± 1.27) AA (Input*Grad) 0.573 0.361 0.249 (± 0.018) 0.385 (± 0.008) 0.383 39.56 (± 0.54) 44.43 (± 0.40) 0.977 63.56 (± 1.27) AA (DeepLIFT) 0.605 0.346 0.254 (± 0.035) 0.403 (± 0.042) 0.491 42.82 (± 1.83) 51.72 (± 1.26) 0.977 63.56 (± 1.27) AA (IG) 0.578 0.327 0.216 (± 0.007) 0.378 (± 0.010) 0.429 40.07 (± 5.47) 48.34 (± 3.16) 0.977 63.56 (± 1.27) L2E 0.544 0.493 0.005 (± 0.003) 0.010 (± 0.008) 0.161 23.56 (± 1.09) 37.80 (± 1.10) 0.977 63.56 (± 1.27) SGT 0.618 0.367 0.197 (± 0.040) 0.324 (± 0.015) 0.491 43.68 (± 4.68) 51.00 (± 3.05) 0.995 64.35 (± 0.46) FRESH 0.302 0.546 0.037 (± 0.036) 0.000 (± 0.000) 0.261 32.35 (± 7.66) 39.37 (± 0.70) 0.101 24.81 (± 3.46) A2R 0.277 0.516 0.014 (± 0.021) 0.000 (± 0.000) 0.282 41.61 (± 3.85) 33.12 (± 9.06) 0.032 21.77 (± 1.31) UNIREX (AA-F) 0.690 0.538 0.297 (± 0.141) 0.286 (± 0.084) 0.554 46.97 (± 3.41) 53.99 (± 1.66) 0.978 63.58 (± 0.61) SGT+P 0.601 0.367 0.201 (± 0.032) 0.328 (± 0.022) 0.436 41.30 (± 6.70) 47.95 (± 1.65) 1.000 64.57 (± 0.33) FRESH+P 0.504 0.515 0.013 (± 0.021) 0.013 (± 0.021) 0.997 76.07 (± 1.63) 69.76 (± 0.27) 0.000 20.36 (± 0.66) A2R+P 0.488 0.500 0.001 (± 0.001) 0.000 (± 0.000) 0.951 73.59 (± 0.81) 67.63 (± 1.54) 0.012 20.91 (± 0.48) UNIREX (DLM-P) 0.751 0.267 0.180 (± 0.016) 0.390 (± 0.035) 0.997 76.07 (± 1.63) 69.76 (± 0.27) 0.990 64.13 (± 0.46) UNIREX (AA-FP) 0.685 0.551 0.395 (± 0.109) 0.381 (± 0.101) 0.537 45.21 (± 4.46) 53.91 (± 3.23) 0.968 63.14 (± 0.33) UNIREX (DLM-FP) 0.814 0.492 0.293 (± 0.043) 0.321 (± 0.070) 0.997 76.38 (± 0.57) 69.52 (± 0.24) 0.953 62.50 (± 1.34) UNIREX (SLM-FP) 0.807 0.494 0.390 (± 0.087) 0.424 (± 0.110) 0.983 75.12 (± 0.41) 69.25 (± 0.41) 0.944 62.09 (± 2.12) Table 3.8: MainResultsonCoS-E Method Composite Faithfulness Plausibility Performance NRG (↑) NRG (↑) Comp (↑) Suff ( ↓) NRG (↑) AUPRC (↑) TF1 (↑) NRG (↑) F1 (↑) AA (Grad) 0.498 0.462 0.222 (± 0.028) 0.120 (± 0.018) 0.035 22.27 (± 0.17) 13.81 (± 0.00) 0.997 69.80 (± 0.60) AA (Input*Grad) 0.506 0.289 0.225 (± 0.048) 0.260 (± 0.059) 0.231 18.51 (± 0.23) 43.45 (± 0.05) 0.997 69.80 (± 0.60) AA (DeepLIFT) 0.493 0.249 0.225 (± 0.012) 0.292 (± 0.014) 0.234 18.80 (± 0.19) 43.51 (± 0.04) 0.997 69.80 (± 0.60) AA (IG) 0.499 0.280 0.162 (± 0.086) 0.222 (± 0.086) 0.220 18.71 (± 0.40) 41.79 (± 1.33) 0.997 69.80 (± 0.60) L2E 0.522 0.366 0.007 (± 0.006) 0.042 (± 0.024) 0.205 24.48 (± 2.71) 32.63 (± 6.12) 0.997 69.80 (± 0.60) SGT 0.594 0.564 0.214 (± 0.105) 0.033 (± 0.077) 0.224 18.60 (± 0.42) 42.42 (± 0.51) 0.995 69.73 (± 0.13) FRESH 0.675 0.571 0.176 (± 0.029) 0.000 (± 0.000) 0.617 24.68 (± 7.98) 48.02 (± 3.04) 0.838 64.47 (± 3.41) A2R 0.217 0.404 -0.010 (± 0.029) 0.000 (± 0.000) 0.249 18.72 (± 0.67) 45.45 (± 0.02) 0.000 36.39 (± 0.00) UNIREX (AA-F) 0.711 0.956 0.505 (± 0.050) -0.071 (± 0.020) 0.236 18.82 (± 0.40) 43.68 (± 0.38) 0.939 66.17 (± 4.58) SGT+P 0.630 0.665 0.280 (± 0.029) 0.283 (± 0.039) 0.226 18.63 (± 0.52) 42.71 (± 0.39) 1.000 69.91 (± 0.81) FRESH+P 0.491 0.413 0.000 (± 0.013) 0.000 (± 0.000) 0.999 71.80 (± 0.27) 77.94 (± 0.57) 0.060 38.41 (± 5.34) A2R+P 0.516 0.422 0.011 (± 0.024) 0.000 (± 0.000) 0.977 70.86 (± 1.30) 76.21 (± 1.68) 0.150 41.42 (± 8.73) UNIREX (DLM-P) 0.708 0.123 0.127 (± 0.010) 0.322 (± 0.017) 0.999 71.80 (± 0.27) 77.94 (± 0.57) 1.000 69.91 (± 0.76) UNIREX (AA-FP) 0.706 1.000 0.545 (± 0.045) -0.077 (± 0.099) 0.231 19.13 (± 0.71) 42.66 (± 1.18) 0.888 66.17 (± 4.58) UNIREX (DLM-FP) 0.751 0.327 0.135 (± 0.072) 0.165 (± 0.029) 0.998 71.89 (± 0.41) 77.63 (± 0.62) 0.929 67.53 (± 1.06) UNIREX (SLM-FP) 0.784 0.377 0.198 (± 0.038) 0.171 (± 0.027) 0.997 71.69 (± 0.21) 77.79 (± 0.09) 0.979 69.20 (± 1.58) Table 3.9: MainResultsonMultiRC Method Composite Faithfulness Plausibility Performance NRG (↑) NRG (↑) Comp (↑) Suff ( ↓) NRG (↑) AUPRC (↑) TF1 (↑) NRG (↑) F1 (↑) AA (Grad) 0.587 0.518 0.313 (± 0.009) 0.380 (± 0.025) 0.244 59.80 (± 1.32) 15.27 (± 0.00) 0.999 90.78 (± 0.27) AA (Input*Grad) 0.503 0.287 0.205 (± 0.005) 0.446 (± 0.020) 0.223 32.98 (± 1.37) 43.13 (± 0.86) 0.999 90.78 (± 0.27) AA (DeepLIFT) 0.508 0.270 0.195 (± 0.012) 0.448 (± 0.014) 0.254 33.47 (± 1.31) 46.44 (± 0.04) 0.999 90.78 (± 0.27) AA (IG) 0.596 0.473 0.308 (± 0.011) 0.414 (± 0.020) 0.317 47.83 (± 1.04) 37.87 (± 1.39) 0.999 90.78 (± 0.27) L2E 0.606 0.460 0.009 (± 0.015) 0.036 (± 0.022) 0.358 58.11 (± 0.97) 31.35 (± 0.27) 0.999 90.78 (± 0.27) SGT 0.595 0.503 0.288 (± 0.025) 0.361 (± 0.038) 0.298 42.46 (± 3.03) 41.70 (± 1.78) 0.985 90.23 (± 0.16) FRESH 0.518 0.661 0.120 (± 0.075) 0.000 (± 0.000) 0.361 38.77 (± 6.82) 53.71 (± 3.30) 0.530 72.92 (± 8.71) A2R 0.273 0.564 0.053 (± 0.048) 0.000 (± 0.000) 0.256 48.48 (± 11.14) 29.54 (± 24.72) 0.000 52.72 (± 14.08) UNIREX (AA-F) 0.622 0.539 0.330 (± 0.018) 0.383 (± 0.055) 0.340 45.29 (± 3.02) 43.69 (± 1.98) 0.987 90.31 (± 0.19) SGT+P 0.608 0.524 0.286 (± 0.034) 0.339 (± 0.032) 0.311 43.03 (± 1.69) 42.59 (± 1.63) 0.988 90.36 (± 0.08) FRESH+P 0.746 0.695 0.143 (± 0.072) 0.000 (± 0.000) 1.000 87.85 (± 0.13) 77.63 (± 0.35) 0.544 73.44 (± 12.88) A2R+P 0.800 0.751 0.182 (± 0.097) 0.000 (± 0.000) 0.992 87.30 (± 0.44) 77.31 (± 0.72) 0.656 77.31 (± 0.72) UNIREX (DLM-P) 0.842 0.525 0.311 (± 0.011) 0.371 (± 0.032) 1.000 87.85 (± 0.13) 77.63 (± 0.35) 1.000 90.80 (± 0.33) UNIREX (AA-FP) 0.626 0.529 0.341 (± 0.008) 0.406 (± 0.046) 0.363 44.79 (± 0.81) 47.18 (± 0.83) 0.985 90.21 (± 0.08) UNIREX (DLM-FP) 0.857 0.588 0.335 (± 0.018) 0.346 (± 0.023) 0.991 86.99 (± 0.40) 77.53 (± 0.15) 0.992 90.51 (± 0.12) UNIREX (SLM-FP) 0.864 0.603 0.353 (± 0.017) 0.356 (± 0.015) 0.994 87.58 (± 0.14) 77.22 (± 0.28) 0.994 90.59 (± 0.09) Table 3.10: MainResultsone-SNLI 58 PartIII UtilizingMachineExplanations 59 Chapter4 LearningfromStrongly-SupervisedMachineExplanations In this chapter, we investigate RQ2: Howcanweutilizestrongly-supervisedmachineexplanationstoimprove the LM’s task performance? 4.1 Introduction Neural language models (LMs) have achieved state-of-the-art performance on a broad array of natural language processing (NLP) tasks [48, 152]. Even so, LMs’ reasoning processes are notoriously opaque [205, 53, 146], which has spurred significant interest in designing algorithms to automatically explain LM behavior [47, 229, 25, 196, 157]. Most of this work has focused on rationale extraction, which explains an LM’s output on a given task instance by highlighting the input tokens that most influenced the output [47, 229, 135, 102, 156, 30]. Recent studies have investigated howmachinerationales outputted by rationale extraction algorithms can be utilized to improve LM decision-making [81, 79]. Among these works, one prevalent paradigm is explanation regularization (ER), which aims to improve LM behavior by regularizing the LM to yield machine rationales that align with human rationales (Fig. 4.1) [203, 90, 68, 272, 109, 201, 147]. Human rationales can be created by annotating each training instance individually [141, 25, 196] or by applying task-level human priors across all training instances [201, 203, 147]. 60 Nolan is a great director. Nolan is a great director. positive Machine Rationale Human Rationale Target Label Predicted Label positive LM Human Input T ask Loss ER Loss Nolan is a great director. Figure 4.1: Explanation Regularization (ER). Sometimes, task labels alone provide insufficient supervision for language model (LM) generalization. ER aims to improve generalization by training the LM so that its machine rationales (WhichinputtokensdidtheLMfocuson?) align with human rationales (Whichinputtokenswouldhumans focus on?) (§4.2). Though prior works primarily evaluate ER models’ in-distribution (ID) generalization, the results are mixed, and it is unclear when ER is actually helpful. Furthermore, out-of-distribution (OOD) generalization is often more crucial in real-world settings [40, 204], yet ER’s impact on OOD generalization has been underexplored [203, 109]. In particular, due to prior works’ lack of unified comparison, little is understood about how OOD performance is affected by major ER design choices, such as the rationale alignment criterion, human rationale type (instance-level vs. task-level), number and choice of rationale-annotated instances, and time budget for rationale annotation. In this paper, we proposeER-Test (Fig. 4.2), a framework for evaluating ER methods’ OOD general- ization via: (A) unseen dataset tests, (B) contrast set tests, and (C) functional tests. For (A),ER-Test tests ER models’ performance on datasets beyond their training distribution [203, 109]. For (B), ER-Test tests ER models’ performance on real-world data instances that are semantically perturbed [64]. For (C), ER-Test 61 tests ER models’ performance on synthetic data instances created to capture specific linguistic capabilities [200]. UsingER-Test, we study four questions: (1) Which rationale alignment criteria are most effective? (2) Is ER effective with task-level human rationales? (3) How is ER affected by the number and choice of rationale-annotated instances? (4) How does ER performance vary with the rationale annotation time budget? For two text classification tasks and six datasets, ER-Test shows that ER has little impact on ID perfor- mance but yields large gains on OOD performance, with the best ER criteria being task-dependent (§4.5.4). Furthermore, ER can improve OOD performance even with distantly-supervised (§4.5.5) or few (§4.5.6) human rationales. Finally, we find that rationale annotation is more time-efficient than label annotation, in terms of impact on OOD performance (§4.5.7). These results fromER-Test help demonstrate ER’s utility and establish best practices for using ER effectively. 4.2 ExplanationRegularization(ER) Given an LM for an NLP task, the goal of ER is to improve LM generalization on the task by pushing the LM’s (extractive) machine rationales (Which input tokens did the LM focus on?) to align with human rationales (Which input tokens would humans focus on?). The hope is that this inductive bias encourages the LM to solve the task in a manner that follows humans’ reasoning process. Given a set of classesC, letF be an LM forM-class text classification, where |C| = M. We assume F has a BERT-style architecture [48, 152], consisting of a Transformer encoder [238] followed by a linear layer with softmax classifier. F can be used for either sequence or token classification. Let x i = [x t i ] n t=1 be then-token input sequence (e.g., a sentence) for task instancei. For sequence classification, F predicts a single class forx i , such thatF(x i )∈R M are the logits forx i . In this case, lety i = argmax c∈C F(x i ) c denoteF’s predicted class forx i . For token classification, F predicts a class for each tokenx t i , such that 62 F(x i ) ∈ R n× M are the logits for then tokens in x i . In this case, lety i,t = argmax c∈C F(x i ) t,c denote F’s predicted class forx t i , whiley i =[y i,t ] n t=1 collectively denotes all ofF’s predicted token classes forx i . GivenF,x i , andy i , the goal of rationale extraction is to output feature attribution vectorr i =[r t i ] n t=1 , where each 0 ≤ r t i ≤ 1 is an importance score indicating how strongly tokenx t i influenced F to predict classy i [157]. In practice, the final machine rationale is obtained by binarizing r i via strategies like top- k% thresholding [49, 98, 192]. However, for convenience, we refer to r i as the machine rationale in this work, since the binarized r i is not explicitly used in ER. Let G denote a rationale extractor, such that r i =G(F,x i ,y i ). G can also be used to compute machine rationales with respect to other classes besidesy i (e.g., target class ˙ y i ). Letˆ r i denote the (non-binarized) machine rationale forx i with respect to ˙ y i . Givenˆ r i obtained viaG andF, many works have explored ER, in whichF is regularized such thatˆ r i aligns with human rationale ˙ r i [272, 201, 203]. ˙ r i can either be human-annotated for individual instances, or generated via human-annotated lexicons for a given task. Typically, ˙ r i is a binary vector, where ones and zeros indicate important and unimportant tokens, respectively. We formalize the ER loss as:L ER =Φ( ˆ r i ,˙ r i ), whereΦ is an ER criterion measuring alignment between ˆ r i and ˙ r i . Thus, the full learning objective is:L=L task +λ ER L ER , whereL task is the task loss (e.g., cross- entropy loss)λ ER ∈R is theERstrength (i.e., loss weight) forL ER . Also, as a baseline, letF No-ER denote an LM that is trained without ER, such thatL=L task . 4.3 ER-Test Existing works primarily evaluate ER models via ID generalization [272, 141, 201, 147, 203, 90, 68, 109], though a small number of works have done auxiliary evaluations of OOD generalization [203, 109, 201, 225]. However, these OOD evaluations have been relatively small-scale, only covering a narrow range of OOD generalization aspects. As a result, little is understood about ER’s impact on OOD generalization. To 63 Nolan is a great director. @deltaairways I love <3 Delta!! Unseen Contrast Original Nolan is a genius. Nolan is a great director. Seen Nolan is a bad director. Nolan is a good director. Vocabulary Tests Nolan is not a great director. Logic Tests Nloan is a great director @hjh Robustness Tests Rajeev is a great director. Entity Tests Unseen Dataset Tests Contrast Set Tests Functional Tests ➡ ➡ Nolan is so terrible. Figure 4.2: ER-Test. While existing works focus on ER models’ in-distribution (ID) generalization, the ER-Test framework is designed to evaluate ER models’ out-of-distribution (OOD) generalization with respect to: (A) unseen dataset tests, (B) contrast set tests, and (C) functional tests (§4.3). address this gap, we proposeER-Test (Fig. 4.2), a framework for designing and evaluating ER models’ OOD generalization along three dimensions: (1) unseen dataset tests; (2) contrast set tests; and (3) functional tests. LetD be anM-class text classification dataset, which we call the ID dataset. Assume D can be parti- tioned into training setD train , development setD dev , and test setD test , whereD test is the ID test set forD. After trainingF onD train with ER, we measureF’s ID generalization via task performance onD test and F’s OOD generalization via (1)-(3). 4.3.1 UnseenDatasetTests First, we evaluate OOD generalization with respect to unseen dataset tests (Fig. 4.2A). BesidesD, suppose we have datasets{ ˜ D (1) , ˜ D (2) ,...} for the same task asD. Each ˜ D (i) has its own training/development/test sets and distribution shift fromD. After trainingF with ER onD train and hyperparameter tuning onD dev , we measureF’s performance on each OOD test set ˜ D (i) test . This tests ER’s ability to helpF learn general (i.e., task-level) knowledge representations that can (zero-shot) transfer across datasets. 4.3.2 ContrastSetTests Second, we evaluate OOD generalization with respect to contrast set tests (Fig. 4.2B). Dataset annotation artifacts [78] can cause LMs to learn spurious decision rules that work on the test set but do not capture 64 linguistic abilities that the dataset was designed to assess. Thus, we testF on contrast sets [64], which are constructed by manually perturbing the test instances of real-world datasets to express counterfactual meanings. Contrast set tests help probe the decision boundaries intended by the original dataset’s design and ifF has learned undesirable dataset-specific shortcuts. Given ˜ D (i) test , we can convert ˜ D (i) test to contrast set ˜ C (i) test using various types of semantic perturbation, such as inversion (e.g., “big dog”→ “small dog”), numerical modification ( e.g., “one dog”→ “three dogs”), and entity replacement (e.g., “good dog”→ “good cat”). Also, each original instance in ˜ D (i) test can have multiple corresponding contrast instances in ˜ C (i) test . Note that it may not be possible to create contrast sets for every instance, in which case these instances are omitted from the contrast set test. With ˜ D (i) test and ˜ C (i) test , we evaluateF using the contrast consistency metric. This is defined as the per- centage of instances for which both the original instance and all of its contrast instances are predicted correctly, so higher contrast consistency is better [64]. However, since contrast sets are built from real- world datasets, they provide less flexibility in testing linguistic abilities, as a given perturbation type may not apply to all instances in the dataset. Note that, unlike adversarial examples [63], contrast sets are not conditioned onF specifically to attack F. 4.3.3 FunctionalTests Third, we evaluate OOD generalization with respect to functional tests (Fig. 4.2C). Whereas contrast sets are created by perturbing real-world datasets, functional tests evaluateF’s prediction performance on synthetic datasets, which are manually created via templates to assess specific linguistic abilities [200, 133]. While contrast set tests focus on semantic abilities, functional tests consider both semantic (e.g., perception of word/phrase sentiment, sensitivity to negation) and syntactic (e.g., robustness to typos or punctuation addition/removal) abilities. Therefore, functional tests trade off data realism for evaluation flexibility. If ER improves F’s functional test performance for a given ability, then ER may be a useful 65 inductive bias for OOD generalization with respect to that ability. Across all tasks,ER-Test contains four general categories of functional tests: Vocabulary, Robustness, Logic, and Entity [200]. See §4.7.7 for more details. 4.4 ERDesignChoices ER consists of four main components: machine rationale extractor, rationale alignment criterion, human rationale type, and instance selection strategy. With ER-Test, we have a standard tool for evaluating design choices for each component. 4.4.1 MachineRationaleExtractors We consider three types of machine rationale extractors: gradient-based, attention-based, and learned. While other rationale extractor types, likeperturbation-based [135], can also be used, we focus on the first three types since they are relatively less compute-intensive. We describe these three rationale extractor types below. Gradient-Based Gradient-based rationale extractors compute rationales via the gradient of logitsF(x i ) with respect to x i [229, 209, 218]. In our experiments, we use Input*Gradient (IxG) [47] as a representa- tive gradient-based rationale extractor. Compared to more expensive gradient-based methods that require multiple backward passes per instance [229, 156], IxG only requires one backward pass per instance. Attention-Based Attention-based rationale extractors compute rationales via the attention weights used byF to predict y i [192, 225, 51, 252]. Following existing Transformer-based works, we consider a variant that uses the attention weights in the final layer of F [192, 225]. In this paper, we simply refer to this variant as Attention. 66 Learned Learned rationale extractors train a model to compute rationales given task input x i . The learned rationale extractor can be trained with respect to faithfulness, plausibility, task performance, and/or knowledge distillation objectives [30, 13, 221]. Given its generality, we consider UNIREX [30] as a representative learned rationale extractor in our experiments. 4.4.2 RationaleAlignmentCriteria We consider six representative rationale alignment criteria (i.e., choices ofΦ ), described below. MeanSquaredError(MSE) MSE is used in [147], [109], and [203]: Φ MSE (ˆ r i ,˙ r i )= 1 n ∥ˆ r i − ˙ r i ∥ 2 2 . MeanAbsoluteError(MAE) MAE is used in [201]: Φ MAE (ˆ r i ,˙ r i )= 1 n |ˆ r i − ˙ r i |. HuberLoss Huber loss [91] is a hybrid of MSE and MAE, but is still unexplored for ER. Following the PyTorch library’s default settings, our experiments useδ =1 [185]. Φ Huber (ˆ r i ,˙ r i ) = 1 2 Φ MSE (ˆ r i ,˙ r i ), Φ MAE (ˆ r i ,˙ r i )<δ δ (Φ MAE (ˆ r i ,˙ r i )− 1 2 δ ), otherwise (4.1) BinaryCrossEntropy(BCE) BCE loss is used in [30] and [31]: Φ BCE (ˆ r i ,˙ r i )=− 1 n P n t=1 ˙ r t i log(ˆ r t i ). KLDivergence(KLDiv) KLDiv is used by [192], [30], and [31]: Φ KLDiv (ˆ r i ,˙ r i )= 1 n P n t=1 ˙ r t i log(˙ r t i /ˆ r t i ). OrderLoss Recall that the human rationale˙ r i labels each token as important (one) or unimportant (zero). Whereas other criteria generally push important/unimportant tokens’ importance scores to be as high/low as possible, order loss [90] relaxes MSE to merely enforce that all important tokens’ importance scores are 67 higher than all unimportant tokens’ importance scores. This ranking-based criterion is especially useful if ˙ r i is somewhat noisy (e.g., if some tokens labeled as important are not actually important). Φ Order (ˆ r i ,˙ r i )= X ˙ r t i =1 min ˆ r t i max ˙ r t j =0 ˆ r t j − 1,0 ! 2 (4.2) 4.4.3 HumanRationaleTypes To construct human rationale ˙ r i , we consider both instance-level and task-level human rationales. Instance-Level Rationales Human rationales are often created by annotating each training instance individually [141, 25, 196]. For each instance, humans are asked to mark tokens that support the gold label as important, with the remaining tokens counted as unimportant. Here, each human rationale is specif- ically conditioned on the input and gold label for the given instance. However, instance-level rationales are expensive to obtain, given the high manual effort required per instance. Task-LevelRationales Some works construct distantly-supervised human rationales by applying task- level human priors across all training instances [109, 201, 203, 147]. Given a task-level token lexicon, each instance’s rationale is created by marking input tokens present in the lexicon as important and the rest as unimportant, or vice versa. Here, rationales are not as fine-grained or tailored for the given dataset, but may provide a more general learning signal for solving the task. 4.4.4 InstanceSelectionStrategies In real-world applications, it is often infeasible to annotate instance-level human rationales ˙ r i for all train- ing instances [38, 107]. Besides task-level rationales, another approach for addressing this issue could be to annotate only a subsetS train ⊂ D train of training instances. Given a budget of|S train | = k 100 |D train | 68 instances, where0<k <100, our goal is to selectS train such that ER withS train maximizesF’s task per- formance. There exist various ways that the annotation budget can be allocated. However, for simplicity, we assume that all|S train | instances are selected and annotated in a single round, so that ER model training only occurs once. While instance selection via active learning is well-studied for general classification [213], this problem has not been explored in ER. Given non-ER LMF No-ER , we use ER-Test to compare five active-learning- inspired instance selection strategies. Note that these are just basic strategies, used to demonstrate a proof of concept forER-Test’s utility. In practice, one could consider more sophisticated strategies that account for other factors like data diversity. RandomSampling(Rand) constructsS train by uniformly sampling|S train | instances fromD. Lowest Confidence (LC) selects the|S train | instances for whichF No-ER yields the lowest target class confidence probability F No-ER (˙ y i |x i ) [284]. HighestConfidence(HC) selects the|S train | instances for whichF No-ER yields the highest target class confidence probability F No-ER (˙ y i |x i ). This is the opposite of LC. Lowest Importance Scores (LIS) Given machine rationaleˆ r i forF No-ER and 0 < k ′ < 100, letˆ r (k ′ ) i denote a vector of the top-k% highest importance scores inˆ r i . Withr S =(1/|ˆ r (k ′ ) i |) P ˆ r (k ′ ) i as the mean score inˆ r (k ′ ) i , LIS selects the|S train | instances for whichr S is lowest. This is similar to selecting instances with the highestˆ r i entropy. HighestImportanceScores(HIS) Givenr S , HIS selects the|S train | instances for whichr S is highest. This is the opposite of LIS. 69 4.5 Experiments Using ER-Test’s unified evaluation protocol, we study the effectiveness of ER and various ER design choices with respect to OOD generalization. In particular, we focus on the following four important re- search questions (Fig. 4.3), which have been underexplored in prior works. Confidence Random Top-k% least confident Top-k% most confident MSE Huber MAE BCE KLDiv ✅ ✅ ✅ ✅ ✅ Sentiment Analysis Downstream Tasks ER Criteria RQ1: Which rationale alignment criteria are most effective? RQ3: How is ER affected by the number and choice of training instances with human rationales? Lexicon List great awful terrible awesome Nolan is a great director. Text: Label: positive RQ2: Is ER effective with task- level human rationales? RQ4: How does ER perf. vary with rationale annotation time? Order ✅ Nolan is a great director. Nolan is a great director. Nolan is a great director. Only Label Annotation Only Rationale Annotation Label + Rationale Annotation ✅ ✅ ✅ ✅ NLI NER Hate Speech Detection Figure 4.3: ER-Test Research Questions. To demonstrate ER-Test’s utility, we use ER-Test to study four im- portant yet underexplored research questions (RQs). Each RQ considers a different category of ER design choices: rationale alignment criteria (RQ1), human rationale type (RQ2), number/choice of rationale-annotated instances (RQ3), and rationale annotation time (RQ4). WithER-Test, we have a system for identifying ER design choices that are effective for improving OOD generalization (§4.5). (RQ1) First, which rationale alignment criteria are most effective for ER? Despite rationale alignment criteria being central to ER, little is understood about their influence on ER models’ generalization ability (§4.5.4). (RQ2) Second, compared to instance-level human rationales, how effective are task-level human rationales for ER? Currently, the generalization trade-offs between these two human rationale types is unclear, since existing ER works do not explicitly compare instance-level and task-level human rationales’ impact on ER (§4.5.5). (RQ3) Third, how is ER affected by the number and choice of training instances with humanrationales? Sometimes, it is only only feasible to annotate rationales for a small number of training instances, but determining how many and which instances to annotate has been underexplored in ER (§4.5.6). (RQ4) Fourth, how is ER affected by the time taken to annotate human rationales? Instead of doing ER (i.e., with rationale-annotated instances), it is also possible to improve LM generalization by simply 70 providing more label-annotated instances. To verify the practical utility of ER, we compare the time- efficiency of label and rationale annotation, in terms of their respective impact on model generalization (§4.5.7). 4.5.1 TasksandDatasets To demonstrate ER-Test’s usage, we consider a diverse set of text classification tasks. In this paper, we focus on sentiment analysis and natural language inference (NLI) in the main text (§4.5.4-4.5.7), but also present experiments on named entity recognition (NER) (§4.7.3.1) and hate speech detection (§4.7.3.2) in the appendix. For unseen dataset tests, we focus on the most popular datasets for the given task. First, for sentiment analysis, we use SST (short movie reviews) [223, 27] as the ID dataset. As OOD datasets, we use Yelp (restaurant reviews) [278], Amazon (product reviews) [166], and Movies (long movie reviews) [272, 49]. Second, for NLI, we use e-SNLI [25, 49] as the ID dataset and MNLI [254] as the OOD dataset. Since contrast set tests and functional tests are unavailable for the above datasets yet very expensive to construct, we instead use existing contrast set tests and functional tests released by prior works for sentiment analysis and NLI. Although these existing tests are created from different datasets than the ones mentioned earlier, they can still provide valuable signal for evaluating LMs’ OOD generalization. For sentiment analysis, we use contrast set tests created from the IMDb dataset [64] and functional tests created from the Flights dataset [200]. For NLI, we use contrast set tests created from the Linguistically-Informed Transformations (LIT) dataset [133] and functional tests created from the AllenNLP textual entailment (ANLP-NLI) dataset [65]. 71 4.5.2 EvaluationMetrics Unseen Dataset Tests For unseen dataset tests, we evaluate an LM’s task performance on unseen datasets using their respective standard metrics. For sentiment analysis datasets, we measure accuracy [223, 278, 166, 272]. For NLI datasets, we measure macro F1 [25, 254]. ContrastSetTests For contrast set tests, we primarily evaluate each LM using the contrast consistency metric. As described in §4.3.2, contrast consistency is defined as the percentage of instances for which both the original instance and all of its contrast instances are predicted correctly, so higher contrast consistency is better [64]. In addition, we report the LM’s task performance on the original test set and the contrast set. For sentiment analysis, as described before, task performance metric is measured using accuracy. For NLI, task performance metric is measured using accuracy (instead of macro F1), since accuracy is the standard metric for the special LIT dataset used for contrast set tests [133]. FunctionalTests For all functional tests, we evaluate the LM using the failure rate metric [200], defined as the percentage of instances predicted incorrectly by the LM. Since different functional tests’ failure rates may have different scales, we use min-max scaling to compute the normalized failure rate for each functional test [30]. Thus, an LM’s aggregate functional test performance can be computed as the mean normalized failure rate across all functional tests. 4.5.3 ImplementationDetails In all experiments, we use BigBird-Base [271] as the LM architecture, in order to handle input sequences of up to 4096 tokens. Unless otherwise specified, we use a learning rate of 2e− 5 and an effective batch size of 32. For all results, we report the mean over three seeds, as well as the standard deviation. To measure the statistical significance of ER’s improvements, we use the unpaired Welch’s t-test between each ER model and the baseline No-ER model, withp<0.05. For our t-test, the alternative hypothesis is that the given ER 72 model’s mean performance is greater than the No-ER model’s mean performance. Further implementation details can be found in §4.7.1. 4.5.4 RQ1: WhichrationalealignmentcriteriaaremosteffectiveforER? In RQ1, we use ER-Test to analyze how different rationale alignment criteria impact ER models’ ID and OOD generalization ability. 4.5.4.1 Setup In ER, the rationale alignment criterionΦ is used to push the model’s machine rationales to be more similar to human rationales (§4.2). For RQ1, we consider the three machine rationale extractors (IxG, Attention, UNIREX) and six rationale alignment criteria (MSE, MAE, Huber, BCE, KLDiv, Order) described in §4.4.1- 4.4.2. For all RQ1 settings, we assume instance-level rationales are available for all training instances. Due to the high computational costs of UNIREX’s faithfulness optimization, we only optimize UNIREX rationale extractors for plausibility (equivalent to ER objective) and task performance. See §4.7.1 for more details. Below, we discuss our findings from using ER-Test to explore RQ1. The RQ1 results obtained via unseen dataset tests (§4.5.4.2), contrast set tests (§4.5.4.3), and functional tests (§4.5.4.4) are shown in Table 4.1, Table 4.2, and Fig. 4.4, respectively. 4.5.4.2 UnseenDatasetTests Table 4.1 shows the results for unseen dataset tests on RQ1. First, on both sentiment analysis and NLI, no ER models achieve significantly higher ID (seen dataset) performance than No-ER. However, when considering OOD (unseen dataset) performance, a considerable number of ER models (e.g., IxG+MAE) yield significant improvements over No-ER. Second, we find that Attention-based and IxG-based ER models generally achieve the best OOD performance on sentiment analysis and NLI, respectively. On the other 73 Machine Rationale Extractor Rationale Alignment Criterion SentimentAnalysis NLI Seen Acc (↑) Unseen Acc (↑) Seen F1 (↑) Unseen F1 (↑) SST Amazon Yelp Movies e-SNLI MNLI - No-ER 94.22 (± 0.77) 90.72 (± 1.36) 92.07 (± 2.66) 89.83 (± 6.79) 76.18 (± 1.28) 46.15 (± 4.38) IxG MSE 94.29 (± 0.05) 90.58 (± 0.77) 92.17 (± 0.64) 90.00 (± 5.63) 78.98 (± 1.00) 54.23 (± 2.67) MAE 94.11 (± 0.38) 92.02 (± 0.25) 94.55 (± 0.30) 95.50 (± 1.32) 78.77 (± 1.01) 52.41 (± 4.50) Huber 94.19 (± 0.19) 90.43 (± 1.45) 92.38 (± 2.11) 91.83 (± 3.75) 78.99 (± 0.81) 53.97 (± 3.11) BCE 94.15 (± 0.53) 90.70 (± 1.19) 91.82 (± 2.30) 92.00 (± 6.98) 79.07 (± 0.83) 53.68 (± 4.15) KLDiv 94.62 (± 0.61) 91.63 (± 0.51) 93.55 (± 1.69) 93.00 (± 2.18) 73.68 (± 4.77) 46.57 (± 1.35) Order 94.37 (± 0.11) 89.47 (± 2.71) 87.95 (± 6.36) 84.50 (± 10.15) 79.11 (± 0.87) 55.26 (± 3.56) Attention MSE 94.71 (± 0.75) 91.88 (± 0.53) 94.70 (± 0.18) 95.83 (± 1.15) 76.04 (± 0.43) 48.60 (± 2.55) MAE 93.89 (± 0.89) 92.18 (± 0.59) 94.75 (± 0.22) 96.17 (± 2.02) 76.94 (± 0.89) 50.26 (± 3.14) Huber 93.92 (± 0.94) 91.93 (± 0.75) 94.55 (± 0.78) 96.00 (± 0.00) 76.54 (± 1.04) 50.32 (± 2.12) BCE 94.89 (± 0.71) 91.55 (± 1.56) 93.92 (± 1.14) 94.83 (± 0.58) 76.17 (± 0.65) 51.28 (± 2.06) KLDiv 94.29 (± 0.65) 91.43 (± 0.71) 94.58 (± 0.51) 96.67 (± 0.76) 77.35 (± 0.59) 49.66 (± 2.47) Order 94.92 (± 0.35) 91.45 (± 0.00) 93.90 (± 0.92) 95.50 (± 0.71) 76.76 (± 0.12) 50.66 (± 2.55) UNIREX MSE 93.81 (± 1.20) 85.08 (± 7.22) 82.32 (± 10.43) 82.50 (± 8.23) 74.20 (± 1.86) 46.98 (± 2.11) MAE 94.65 (± 0.46) 90.15 (± 1.19) 92.07 (± 1.59) 94.50 (± 0.50) 73.28 (± 1.23) 44.23 (± 0.89) Huber 94.03 (± 0.93) 87.93 (± 4.45) 89.90 (± 4.88) 88.67 (± 6.81) 73.75 (± 2.47) 46.73 (± 2.15) BCE 94.05 (± 0.55) 86.37 (± 3.59) 83.55 (± 6.38) 74.67 (± 14.49) 69.05 (± 1.64) 40.94 (± 1.15) KLDiv 94.23 (± 0.47) 90.12 (± 1.19) 93.35 (± 1.18) 91.00 (± 6.08) 74.69 (± 1.31) 47.03 (± 3.09) Order 93.96 (± 0.75) 86.95 (± 5.20) 90.18 (± 5.77) 91.83 (± 5.92) 72.86 (± 0.75) 44.41 (± 2.13) Table 4.1: RQ1-UnseenDatasetTests(§4.5.4.2). We compare various ER rationale alignment criteria (as well as the No-ER baseline), with respect to performance on seen (ID) and unseen (OOD) datasets. Performance is measured using accuracy (Acc) for sentiment analysis and macro F1 (F1) for NLI. Numbers highlighted in blue indicate sta- tistically significant improvement over the No-ER baseline ( p<0.05). hand, although UNIREX-based ER models perform competitively on some datasets, they are noticeably worse on others. This may be due to our UNIREX extractors not being optimized for faithfulness. Third, all ER models achieve about the same ID performance, making it hard to distinguish between different ER criteria via ID performance. Meanwhile, there is much greater variance in OOD performance among ER models, making it easier to identify which criteria are more effective. For sentiment analysis, ER models using MAE consistently achieve the best overall OOD performance across all rationale extractors. For NLI, no particular criterion consistently outperforms the others. Why does MAE perform better than other criteria? Although ER considers binary human rationales by default, in sentiment analysis, it is unlikely that every token with the same rationale annotation is equally important. This means the model should have the flexibility to assign different levels of importance to each token. Nonetheless, MAE, Huber, BCE, and KLDiv heavily penalize any deviation from human rationales, which may push the model to be overconfident in its token importance predictions on OOD 74 data. Meanwhile, Order’s soft ranking objective may be too relaxed, since it is perfectly satisfied even if the important tokens’ scores are barely higher than unimportant tokens’ scores. This may cause the model to have weak learning signal and also fail on OOD data. This suggests that MAE achieves the best balance of supervision strictness on sentiment analysis. Machine Rationale Extractor Rationale Alignment Criterion SentimentAnalysis NLI IMDb LIT Original Acc (↑) Contrast Acc (↑) Consistency (↑) Original Acc (↑) Contrast Acc (↑) Consistency (↑) - No-ER 88.39 (± 2.05) 85.11 (± 2.72) 73.90 (± 4.64) 46.15 (± 4.38) 43.73 (± 2.81) 16.84 (± 3.18) IxG MSE 88.11 (± 2.33) 86.07 (± 2.48) 78.07 (± 5.79) 54.23 (± 2.67) 51.95 (± 1.21) 16.37 (± 1.30) MAE 91.12 (± 0.59) 89.82 (± 1.20) 81.48 (± 1.86) 52.41 (± 4.50) 52.02 (± 1.49) 17.48 (± 0.40) Huber 89.20 (± 1.67) 86.13 (± 1.74) 75.82 (± 3.37) 53.97 (± 3.11) 52.32 (± 1.04) 16.34 (± 0.96) BCE 89.55 (± 1.42) 87.30 (± 4.03) 77.25 (± 5.30) 53.68 (± 4.15) 52.37 (± 1.42) 16.90 (± 0.63) KLDiv 89.82 (± 1.71) 87.91 (± 2.14) 78.28 (± 3.50) 52.39 (± 5.59) 45.07 (± 7.32) 14.91 (± 1.92) Order 86.00 (± 5.27) 83.40 (± 6.16) 69.74 (± 11.50) 55.26 (± 3.56) 52.78 (± 0.74) 16.20 (± 0.54) Attention MSE 91.46 (± 0.97) 89.14 (± 1.95) 81.01 (± 2.19) 50.56 (± 2.24) 56.42 (± 3.17) 18.87 (± 0.95) MAE 91.46 (± 0.63) 89.41 (± 0.12) 81.28 (± 0.60) 52.56 (± 1.93) 52.48 (± 1.77) 17.97 (± 0.48) Huber 91.33 (± 0.24) 88.66 (± 1.17) 80.40 (± 1.40) 50.46 (± 1.92) 55.46 (± 3.75) 19.29 (± 1.51) BCE 91.33 (± 1.36) 89.89 (± 1.39) 81.76 (± 2.67) 53.43 (± 5.12) 50.62 (± 6.28) 16.72 (± 1.77) KLDiv 91.39 (± 0.82) 86.81 (± 0.83) 78.48 (± 1.43) 51.78 (± 3.87) 51.91 (± 2.41) 17.83 (± 1.80) Order 91.70 (± 0.14) 89.34 (± 3.19) 81.56 (± 3.47) 54.50 (± 3.54) 50.78 (± 1.83) 17.75 (± 0.85) UNIREX MSE 77.12 (± 17.69) 70.77 (± 14.95) 48.30 (± 32.40) 46.21 (± 2.90) 55.25 (± 2.74) 17.45 (± 2.62) MAE 89.96 (± 1.25) 85.86 (± 2.08) 76.30 (± 2.40) 45.18 (± 2.88) 54.53 (± 2.79) 16.50 (± 1.47) Huber 87.84 (± 5.15) 81.01 (± 3.96) 69.13 (± 9.15) 46.37 (± 2.43) 52.12 (± 6.06) 15.67 (± 3.57) BCE 67.83 (± 18.57) 61.34 (± 14.75) 29.30 (± 33.07) 39.43 (± 2.64) 47.12 (± 5.21) 11.99 (± 1.80) KLDiv 88.46 (± 0.52) 82.99 (± 1.79) 71.93 (± 2.30) 48.32 (± 3.01) 49.46 (± 5.07) 14.88 (± 2.23) Order 85.93 (± 8.76) 80.94 (± 11.05) 67.35 (± 19.83) 43.94 (± 2.15) 54.10 (± 1.95) 16.94 (± 0.64) Table 4.2: RQ1-ContrastSetTests(§4.5.4.3). We compare various ER rationale alignment criteria (as well as the No-ER baseline), with respect to performance on both original test sets (ID) and contrast sets (OOD). Performance is reported in terms of original test set accuracy (Original Acc), contrast set accuracy (Contrast Acc), and contrast consistency (Consistency). Numbers highlighted in blue indicate statistically significant improvement over the No- ER baseline (p<0.05). 4.5.4.3 ContrastSetTests Table 4.2 shows the results for contrast set tests on RQ1. First, for both sentiment analysis and NLI, we observe that IxG-based and Attention-based ER models generally outperform No-ER, in terms of original accuracy, contrast accuracy, and contrast consistency. In particular, for almost all criteria, the Attention ex- tractor consistently yields high performance, especially on sentiment analysis. This further demonstrates the benefits of ER on OOD generalization. Meanwhile, UNIREX-based ER models tend to perform worse 75 than No-ER on both sentiment analysis and NLI. Again, this may be due to our UNIREX rationale extrac- tors not being optimized for faithfulness. Second, when comparing criteria, the MAE criterion yields the highest overall performance on sentiment analysis, while no particular criterion dominates on NLI. This corroborates our findings from the unseen dataset tests. 4.5.4.4 FunctionalTests Fig. 4.4 shows the results for functional tests on RQ1. First, we observe that models that perform well on one functional test may not perform well on other functional tests. This suggests that each functional test evaluates a distinctly different linguistic capability. Thus, when comparing different models’ generalization ability, it is important to consider the mean performance across all four functional tests. Second, across all functional tests for both tasks, we find that IxG-based ER models consistently achieve lower failure rates than No-ER. For IxG, all rationale alignment criteria except Huber yield significantly lower failure rates than No-ER, with IxG+Order performing best overall. However, the results are more mixed for Attention, with Attention+BCE performing best overall. Meanwhile, UNIREX-based ER models consistently achieve higher failure rates than No-ER. To some extent, this mirrors our general conclusions from the unseen dataset tests and contrast set tests. This shows that ER functional test performance can be very sensitive to the choices of both rationale extractor and rationale alignment criteria. 76 Figure 4.4: RQ1-FunctionalTests(§4.5.4.4). We compare various ER rationale alignment criteria (as well as the No-ER baseline), with respect to performance on a range of functional tests (OOD). Performance is reported in terms of normalized failure rate (↓) for four functional test types (Vocabulary, Robustness, Logic, Entity) as well the mean normalized failure rate across all functional tests (Mean). 77 4.5.5 RQ2: Howeffectivearetask-levelhumanrationalesforER? In RQ2, we use ER-Test to compare the effectiveness of instance-level and task-level human rationales when used for ER on the same task. 4.5.5.1 Setup Due to computational constraints, we only consider a subset of the settings used in RQ1. First, we focus on the sentiment analysis task. Second, although we consider all three extractors (IxG, Attention, UNIREX), we only consider the MAE and Huber criteria, since they yielded the best task performance on the ID development set. Third, to generate task-level rationales, we use the AFINN [180] and SenticNet [24] lexicons. See §4.7.4 for more details. Below, we discuss our findings from using ER-Test to explore RQ2. The RQ2 results obtained via unseen dataset tests (§4.5.5.2), contrast set tests (§4.5.5.3), and functional tests (§4.5.5.4) are shown in Table 4.3, Table 4.4, and Fig. 4.5, respectively. 4.5.5.2 UnseenDatasetTests Table 4.3 shows the results for unseen dataset tests on RQ2. Note that the results for instance-level ra- tionales are copied from RQ1’s unseen dataset test results in Table 4.1. For both instance-level and task- level rationales, all ER models perform similarly to No-ER on the seen dataset (SST). Meanwhile, for both instance-level and task-level rationales, most ER models significantly outperform No-ER on the unseen datasets (Amazon, Yelp, Movies). In particular, task-level rationales yield notable gains on the Yelp and Movie datasets, sometimes even beating their instance-level counterparts on the same extractors. We be- lieve task-level rationales’ advantage in some settings is due to their lexicon being task-specific ( i.e., for sentiment analysis) and dataset-agnostic. In other words, task-level rationales may contain more general 78 knowledge (i.e., sentiment-related terms) that is also applicable to unseen datasets, whereas the instance- based rationales contain more SST-specific knowledge. Machine Rationale Extractor Rationale Alignment Criterion Human Rationale Type SentimentAnalysis Seen Acc (↑) Unseen Acc (↑) SST Amazon Yelp Movies - - No-ER 94.22 (± 0.77) 90.72 (± 1.36) 92.07 (± 2.66) 89.83 (± 6.79) IxG MAE Instance-Level 94.11 (± 0.38) 92.02 (± 0.25) 94.55 (± 0.30) 95.50 (± 1.32) Task-Level 94.53 (± 0.60) 92.02 (± 0.45) 94.10 (± 0.91) 95.83 (± 1.02) Huber Instance-Level 94.19 (± 0.19) 90.43 (± 1.45) 92.38 (± 2.11) 91.83 (± 3.75) Task-Level 93.81 (± 0.47) 91.05 (± 1.45) 93.88 (± 0.41) 94.00 (± 0.40) Attention MAE Instance-Level 93.89 (± 0.89) 92.18 (± 0.59) 94.75 (± 0.22) 96.17 (± 2.02) Task-Level 94.42 (± 0.92) 91.73 (± 0.51) 94.65 (± 1.00) 94.33 (± 0.62) Huber Instance-Level 93.92 (± 0.94) 91.93 (± 0.75) 94.55 (± 0.78) 96.00 (± 0.00) Task-Level 94.88 (± 0.07) 91.90 (± 0.10) 94.08 (± 0.65) 95.83 (± 0.84) UNIREX MAE Instance-Level 94.69 (± 0.93) 91.28 (± 0.74) 93.28 (± 2.16) 94.83 (± 2.08) Task-Level 94.65 (± 0.46) 90.15 (± 1.19) 92.07 (± 1.59) 87.33 (± 8.36) Huber Instance-Level 94.03 (± 0.93) 87.93 (± 4.45) 89.90 (± 4.88) 88.67 (± 6.81) Task-Level 94.48 (± 1.18) 90.50 (± 0.48) 92.05 (± 1.00) 94.33 (± 1.24) Table 4.3: RQ2-UnseenDatasetTests(§4.5.5.2). (§4.3). We compare ER model performance using instance-level rationales versus using task-level rationales, with respect to performance on seen (ID) and unseen (OOD) datasets. Performance is measured using accuracy (Acc). Numbers highlighted in blue indicate statistically significant im- provement over the No-ER baseline (p<0.05). 4.5.5.3 ContrastSetTests Table 4.4 shows the results for contrast set tests on RQ2. Note that the results for instance-level rationales are copied from RQ1’s contrast set test results in Table 4.2. First, for both instance-level and task-level rationales, we observe that IxG-based and Attention-based ER models generally outperform No-ER on all three metrics. Second, for some settings, we find that task-level rationales can even yield higher contrast accuracy and contrast consistency than instance-level rationales, while achieving similar original accuracy. For both IxG-based and Attention-based ER models, task-level rationales often outperform instance-level rationales in contrast accuracy and contrast consistency. These results suggest task-level rationales can serve as an effective yet annotation-efficient substitute for instance-level rationales. 79 Machine Rationale Extractor Rationale Alignment Criterion Human Rationale Type SentimentAnalysis IMDb Original Acc (↑) Contrast Acc (↑) Consistency (↑) IxG MAE Instance-Level 91.12 (± 0.59) 89.82 (± 1.20) 81.48 (± 1.86) Task-Level 91.46 (± 0.72) 89.82 (± 2.46) 83.47 (± 0.47) Huber Instance-Level 89.20 (± 1.67) 86.13 (± 1.74) 75.82 (± 3.37) Task-Level 90.64 (± 1.25) 89.82 (± 2.46) 79.58 (± 1.72) Attention MAE Instance-Level 91.46 (± 0.63) 89.41 (± 0.12) 81.28 (± 0.60) Task-Level 91.19 (± 0.20) 91.94 (± 0.43) 83.47 (± 0.47) Huber Instance-Level 91.33 (± 0.24) 88.66 (± 1.17) 80.40 (± 1.40) Task-Level 91.12 (± 0.12) 91.94 (± 0.43) 79.58 (± 2.07) UNIREX MAE Instance-Level 89.96 (± 1.25) 85.86 (± 2.08) 76.30 (± 2.40) Task-Level 84.77 (± 9.03) 77.94 (± 13.45) 62.98 (± 22.43) Huber Instance-Level 87.84 (± 5.15) 81.01 (± 3.96) 69.13 (± 9.15) Task-Level 89.89 (± 0.52) 84.02 (± 1.55) 74.25 (± 2.17) Table 4.4: RQ2-ContrastSetTests(§4.5.5.3). We compare ER model performance using instance-level rationales versus usingtask-levelrationales, with respect to performance on both original test sets (ID) and contrast sets (OOD). Performance is reported in terms of original test set accuracy (Original Acc), contrast set accuracy (Contrast Acc), and contrast consistency (Consistency). Numbers highlighted in blue indicate statistically significant improvement over the No-ER baseline (p<0.05). 4.5.5.4 FunctionalTests Fig. 4.5 shows the functional test results for RQ2. First, for instance-level rationales, we see that multiple ER models (i.e., IxG+MAE, Attention+MAE, Attention+Huber) yield lower failure rates than No-ER, although ER models perform especially poorly on entity tests. In particular, IxG+MAE achieves the lowest mean failure rate for instance-level rationales, with Attention+MAE coming at a close second. This supports our earlier finding that MAE is a generally effective rationale alignment criterion. Second, for task-level rationales, we find that only Attention+MAE achieves the lowest failure rate but is the only ER model that achieves a lower mean failure rate than No-ER. Again, ER models perform especially poorly on entity tests. Although task-level rationales can be useful in certain situations (e.g., unseen dataset tests and contrast tests), these results show the limitations of using task-level rationales instead of instance-level rationales. We hypothesize that relying on such task-level lexicons may sometimes hinder ER models from generalizing to harder instances that require dataset-specific knowledge, which can only be obtained via instance-level rationale annotations. 80 Figure 4.5: RQ2-FunctionalTests(§4.5.5.4). We compare ER model performance using instance-level rationales versus using task-level rationales, with respect to performance on a range of functional tests (OOD). Performance is reported in terms of normalized failure rate (↓) for four functional test types (Vocabulary, Robustness, Logic, Entity) as well the mean normalized failure rate across all functional tests (Mean). 4.5.6 RQ3: How is ER affected by the number and choice of training instances with humanrationales? In RQ1 and RQ2, we considered the ER setting where human rationales (whether instance-level or task- level) are available for all training instances. However, if rationale annotation resources are limited and task-level rationales are not feasible, we need a way to determine which training instances should be pri- oritized for instance-level rationale annotation. Using ER-Test, RQ3 explores the impact of different an- notation budgets (i.e.,number of training instances) and instance selection methods (i.e.,choice of training instances) on ER model generalization. 4.5.6.1 Setup Following §4.4.4, we consider budgets of k = {5,15,50}, where k% of training instances are annotated with human rationales. For instance selection, we consider the five strategies (Random, LC, HC, LIS, HIS) described in §4.4.4. As reference points, we also include the No-ER model (k =0) and the fully-supervised 81 Figure 4.6: RQ3-UnseenDatasetTests(§4.5.6.2). For the strongestinstanceselectionstrategy (LIS) and key base- lines (No-ER, Random), we plot task performance (Accuracy) as a function of instance annotation budget (k). Table 4.5 presents more comprehensive results comparing No-ER, Random, and LIS to other instance selection strategies (LC, HC, HIS). ER model (k = 100). Due to computational constraints, we limit our RQ3 experiments to sentiment analysis and IxG+MAE, which yielded the highest development ID performance for RQ1 (§4.7.1). See §4.7.5 for more details. Below, we discuss our findings from using ER-Test to explore RQ3. The RQ3 results obtained via unseen dataset tests (§4.5.6.2), contrast set tests (§4.5.6.3), and functional tests (§4.5.6.4) are shown in Table 4.5 (as well as Fig. 4.6), Table 4.6, and Fig. 4.7, respectively. 4.5.6.2 UnseenDatasetTests Table 4.5 and 4.6 present the results for unseen dataset tests on RQ3. First, like we saw in RQ1 and RQ2, almost all compared models have similar ID (seen dataset) performance. Thus, we continue to focus on comparing models with respect to OOD (unseen dataset) performance. Second, among the different instance selection strategies, we see that LIS achieves the best overall OOD performance, across all datasets and instance annotation budgets. This demonstrates that feature impor- tance scores can provide useful signal for ranking instances to annotate. Interestingly, although Random generally does not perform well, we find that Random does not always yield the worst performance, with LC typically performing worse. In particular, fork = 50%, LC achieves much lower accuracy (with high variance) than other strategies do. This indicates that label confidence scores do not provide reliable signal for ranking instances. 82 Instance Annotation Budget (k) Instance Selection Strategy SentimentAnalysis Seen Acc (↑) Unseen Acc (↑) SST Amazon Yelp Movies 0% - 94.22 (± 0.77) 90.72 (± 1.36) 92.07 (± 2.66) 89.83 (± 6.79) 100% - 94.11 (± 0.38) 92.02 (± 0.25) 94.55 (± 0.30) 95.50 (± 1.32) 5% Random 94.36 (± 0.05) 91.57 (± 0.10) 93.36 (± 0.15) 92.39 (± 2.50) LC 93.14 (± 1.97) 90.72 (± 0.43) 93.50 (± 0.53) 93.17 (± 1.26) HC 94.32 (± 0.42) 91.57 (± 0.19) 93.03 (± 0.81) 91.33 (± 3.09) LIS 93.92 (± 1.07) 92.42 (± 0.48) 94.28 (± 0.31) 96.50 (± 1.50) HIS 93.94 (± 0.83) 90.58 (± 0.95) 91.47 (± 2.37) 92.00 (± 4.58) 15% Random 94.46 (± 0.21) 90.06 (± 1.17) 90.81 (± 2.63) 86.22 (± 2.94) LC 93.48 (± 0.80) 90.12 (± 2.66) 90.90 (± 5.30) 83.67 (± 14.02) HC 94.39 (± 0.27) 90.38 (± 1.12) 93.48 (± 0.64) 91.33 (± 5.11) LIS 94.25 (± 0.37) 91.15 (± 0.22) 94.00 (± 0.56) 95.33 (± 1.26) HIS 94.47 (± 0.22) 91.13 (± 0.60) 92.67 (± 0.98) 93.50 (± 3.12) 50% Random 93.47 (± 0.02) 90.28 (± 1.42) 91.85 (± 2.11) 89.78 (± 5.68) LC 87.07 (± 5.15) 78.82 (± 20.68) 77.73 (± 26.53) 76.67 (± 19.08) HC 92.93 (± 0.17) 92.15 (± 0.36) 94.48 (± 0.94) 91.00 (± 6.50) LIS 93.17 (± 0.55) 90.60 (± 0.25) 92.72 (± 0.53) 93.50 (± 0.87) HIS 94.23 (± 0.65) 88.85 (± 2.67) 91.47 (± 1.47) 93.67 (± 1.89) Table 4.5: RQ3-UnseenDatasetTests(§4.5.6.2). We compare ER model performance for five instance selection strategies across different instance annotation budgets ( k% of training data), with respect to performance on seen (ID) and unseen (OOD) datasets. Performance is measured using accuracy (Acc). Numbers highlighted in blue indicate statistically significant improvement over the No-ER baseline ( p<0.05). Third, as expected, the ER model trained on all rationale annotations (k = 100%) generally outper- forms both No-ER (k = 0%) and ER models with other budgets (k = {5%,15%,50%}). However, we counterintuitively find that k = 5% tends to outperform both k = 15% and k = 50%. In some cases (i.e., LIS on Movies),k = 5% even slightly outperformsk = 100%. This suggests that, despite its success in some settings, feature importance scores alone cannot provide sufficient signal for ranking instances to annotate. We leave further investigation of other feature importance algorithms and other instance ranking strategies to future work. 4.5.6.3 ContrastSetTests Table 4.6 presents the results for contrast set tests on RQ3. First, among the different instance selection strategies, we see that LIS performs best overall on the three metrics, across all instance annotation bud- gets. As expected, Random yields the worst overall performance, since it does not rank instances in any intelligent way. Furthermore, after Random, HIS yields the second-worst overall performance, since it 83 Instance Annotation Budget (k) Instance Selection Strategy SentimentAnalysis IMDb Original Acc (↑) Contrast Acc (↑) Consistency (↑) 0% - 88.39 (± 2.05) 85.11 (± 2.72) 73.90 (± 4.64) 100% - 91.12 (± 0.59) 89.82 (± 1.20) 81.48 (± 1.86) 5% Random 90.03 (± 1.71) 85.63 (± 1.76) 76.05 (± 3.37) LC 90.71 (± 0.24) 89.07 (± 0.31) 80.26 (± 0.30) HC 90.98 (± 1.02) 88.66 (± 0.12) 80.12 (± 1.08) LIS 91.73 (± 0.12) 89.34 (± 0.89) 81.56 (± 1.25) HIS 88.32 (± 4.14) 84.77 (± 5.93) 73.57 (± 10.11) 15% Random 89.41 (± 2.51) 86.32 (± 2.84) 76.18 (± 5.40) LC 87.43 (± 6.08) 87.02 (± 8.64) 74.81 (± 14.77) HC 90.44 (± 1.33) 86.75 (± 1.89) 77.60 (± 3.13) LIS 91.46 (± 0.66) 88.93 (± 0.20) 80.87 (± 0.43) HIS 90.51 (± 1.17) 87.23 (± 2.56) 78.35 (± 3.80) 50% Random 87.86 (± 3.15) 84.29 (± 4.35) 72.56 (± 7.11) LC 88.18 (± 2.96) 87.70 (± 1.98) 76.30 (± 5.10) HC 87.70 (± 3.63) 84.22 (± 3.50) 72.47 (± 7.11) LIS 89.69 (± 1.36) 86.89 (± 1.95) 77.05 (± 3.35) HIS 88.87 (± 1.03) 84.02 (± 3.19) 73.36 (± 3.36) Table 4.6: RQ3 - Contrast Set Tests (§4.5.6.3). We compare ER model performance for five instance selection strategies across different instance annotation budgets ( k% of training data), with respect to performance on both original test sets (ID) and contrast sets (OOD). Performance is reported in terms of original test set accuracy (Original Acc), contrast set accuracy (Contrast Acc), and contrast consistency (Consistency). Numbers highlighted in blue indicate statistically significant improvement over the No-ER baseline ( p<0.05). provides the opposite ranking as LIS. This demonstrates that feature importance scores can provide useful signal for ranking instances to annotate. On the other hand, although LC and HC also perform well for k = 5%, they perform noticeably worse for k = 15% and/or k = 50%. In particular, LC beats HC for k = 50%, while HC beats LC for k = 15%. This inconsistent signal shows that label confidence scores are not effective for ranking instances to annotate. Second, as expected, the ER model trained on all ra- tionale annotations (k = 100%) generally outperforms both No-ER (k = 0%) and ER models with other budgets (k = {5%,15%,50%}). However, we counterintuitively find that k = 5% tends to outperform both k = 15% and k = 50%. In some cases (i.e., LIS), k = 5% even slightly outperforms k = 100%. This suggests that, despite its success in some settings, feature importance scores alone cannot provide sufficient signal for ranking instances to annotate. Overall, our conclusions from the contrast set tests are in line with those from the unseen dataset tests. We leave further investigation of other feature importance algorithms and other instance ranking strategies to future work. 84 Figure 4.7: RQ3 - Functional Tests (§4.5.6.4). We compare ER model performance for five instance selection strategies across different instance annotation budgets ( k% of training data), with respect to performance on a range of functional tests (OOD). Performance is reported in terms of normalized failure rate (↓) for four functional test types (Vocabulary, Robustness, Logic, Entity) as well the mean normalized failure rate across all functional tests (Mean). 4.5.6.4 FunctionalTests Fig. 4.7 presents the results for functional tests on RQ3. First, we find that the k = 100% ER model consistently outperforms No-ER and other ER models. Yet, for all rationale annotation budgets except k = 100%, the ER model fails to outperform No-ER. This shows that having sufficiently abundant ratio- nale annotations is important for training an ER model that captures the linguistic capabilities evaluated in these functional tests. Second, for budgetsk ={5%,15%,50%}, no instance ranking strategy consis- tently beats other strategies, even when considering the Random strategy. This shows that LIS may not necessarily provide useful signal for ranking instances to annotate. Also, this further supports the notion 85 that, for functional tests, no instance ranking strategy is strong enough to overcome insufficient rationale annotations. 4.5.7 RQ4: HowisERaffectedbythetimetakentoannotatehumanrationales? Previously, we considered ER models trained on instance-level (RQ1, RQ3) and task-level (RQ2) human rationale annotations. However, instead of doing ER, it is also possible to improve LM generalization by simply providing more label-annotated instances. For ER to be practical, rationale annotation needs to be more cost-efficient than label annotation. In light of this, RQ4 compares the time cost of label and rationale annotation, in terms of their respective impact on model generalization (i.e., time budget vs. ID/OOD performance). 4.5.7.1 Setup We conduct an Amazon Mechanical Turk ∗ (MTurk) human study to compare the effectiveness of three instance annotation types across various time budgets. ForLabelOnly, the Turkers (i.e., MTurk annota- tors) are asked to annotate a given instance’s task label. ForExplOnly, the Turkers are provided a given instance’s ground truth task label, then asked to annotate an extractive rationale by highlighting input tokens that support this label. ForLabel+Expl, the Turkers are asked to annotate both the task label and the rationale. In this study, we consider an initial training setD init of 1000 instances and different time budgets b. For Label Only, the Turkers use b to add new instancesD (b) L with label annotations, yielding combined training set{D init ,D (b) L }. For Expl Only, the Turkers useb to annotate rationales for a subset of instancesD (b) E ⊆D init . For Label+Expl, the Turkers useb to add new instancesD (b) L+E with both label and rationale annotations, yielding combined training set{D init ,D (b) L+E }. Since it is difficult to track Turkers’ annotation progress over long time periods ( e.g., 48 hours), we first estimate the annotation time per instance based on a sample of timed instance annotations, then use these ∗ https://www.mturk.com/ 86 estimates to create proportional training sets (w.r.t. number of instances) for each time budget level. For each annotation type, we obtain the time estimates by asking Turkers to collectively annotate the same 200 SST training instances, which are randomly selected via stratified sampling with respect to sentiment label. We considered a 200-instance sample for time estimation because we felt that this was a reasonable trade-off between estimation accuracy and annotation cost. Across the three annotation types, we employ 178 total Turkers, with three Turkers per instance. On these 200 instances, the Turkers yielded mean± std annotation times of140.56± 8.45 seconds per instance (Label Only),110.31± 3.21 seconds per instance (Expl Only), and263.10± 7.31 seconds per instance (Label+Expl). Based on these estimates, for each annotation type and time budget, we construct training sets by uniformly sampling the following numbers of additional instances from the training set (excluding the initial 1000-instance training setD init ). For Label Only, we sample training sets of 4 (10 min), 13 (30 min), 128 (5 hr), 615 (24 hr), and 1229 (48 hr) instances. For Expl Only, we sample training sets of 5 (10 min), 16 (30 min), 163 (5 hr), 783 (24 hr), and 1556 (48 hr) instances. For Label+Expl, we sample training sets of 2 (10 min), 7 (30 min), 68 (5 hr), 328 (24 hr), and 657 (48 hr) instances. Due to computational constraints, we limit our RQ4 experiments to sentiment analysis and IxG+MAE, which yielded the highest development ID performance for RQ1 (§4.7.1). See §4.7.6 for more details. Below, we discuss our findings from using ER-Test to explore RQ4. The RQ4 results obtained via unseen dataset tests (§4.5.7.2), contrast set tests (§4.5.7.3), and functional tests (§4.5.7.4) are shown in Fig. 4.8, Table 4.7, and Fig. 4.9, respectively. 4.5.7.2 UnseenDatasetTests Fig. 4.8 shows the results for unseen dataset set tests on RQ4. When provided a nonzero additional time budget (i.e., 10 min and above), all three instance annotation types can yield models that perform better 87 Figure 4.8: RQ4 - Unseen Dataset Tests (§4.5.7.2). We compare ER model performance for three instance an- notation types across different time budgets, with respect to performance on seen (ID) and unseen (OOD) datasets. For each instance annotation type (Label Only, Expl Only, Label+Expl), we plot task performance (Accuracy) as a function of additional time budget for annotation. Note that the model trained with 0 min additional time budget corresponds to the No-ER baseline. than models with zero additional time budget (i.e., 0 min). However, the performance trajectory with respect to additional time budget varies significantly across different instance annotation types. For Label Only, on both seen (SST) and unseen datasets (Amazon, Yelp, Movies), model performance tends to decrease slightly for lower nonzero budgets, before steadily increasing for higher nonzero budgets. Generally, Label Only annotations require at least 24 hours of additional annotation time before the model’s performance begins to noticeably outperform the baseline 0-minute ER model. Given a 48-hour budget, Label Only yields even greater improvements. Also, we see that Label+Expl tends to yield similar trends as Label Only, except with smaller performance decreases for lower nonzero budgets and larger performance increases for higher nonzero budgets. On the other hand, Expl Only immediately yields improvements at the 10-minute mark, but does not always steadily increase with higher budgets. For the seen dataset (SST), the Expl Only model’s perfor- mance begins to drop after the time budget exceeds 30 minutes, although the performance stays within a relatively small small range across all time budgets. For unseen datasets (Amazon, Yelp, Movies), the Expl Only model’s performance generally increases with increasing time budgets, but not as drastically as the Label Only or Label+Expl models’. On average, we find that 30 minutes of Expl Only annotation yields similar performance as 24 hours of Label Only or Label+Expl annotation. 88 These results suggest that additional annotations may sometimes introduce an annotation distribution shift within the training set, as the annotators employed in our study are different from those in the origi- nal dataset. For annotation types involving the task label (Label Only, Label+Expl), the distribution shift is more drastic, so more annotations (i.e., higher budgets) are required to balance out the annotation distri- bution. Meanwhile, the opposite appears to be true for Expl Only. This is likely due to the fact that Label Only and Label+Only both involve adding new instances, whereas Expl Only only involves adding ratio- nale annotations to existing instances. Ultimately, we conclude that Expl Only is more effective for smaller budgets, while Label Only and Label+Expl are more suitable for larger budgets. Since annotation budgets tend to be low in real-world scenarios, Expl Only may often be a more practical annotation strategy than Label Only and Label+Expl. 4.5.7.3 ContrastSetTests Table 4.7 shows the results for contrast set tests on RQ4. For Expl Only, we see that increasing the time budget from 0 min (None) to any other budget under 48 hours yields decreased performance. However, increasing the Expl Only budget to 48 hours yields significant performance improvements. Also, for La- bel+Expl, we see a more extreme version of this trend, with a lower initial performance decrease followed by a higher eventual performance increase. Thus, Label+Expl with a 48-hour budget yields the highest overall performance. On the other hand, across all nonzero time budgets under 48 hours, we see that Expl Only generally yields higher performance than None, Label Only, and Label+Expl, yet fails to achieve sig- nificant performance improvements as the time budget increases to 48 hours. This mirrors the findings from our unseen dataset tests on RQ4, hence supporting our conclusion that Expl Only works better for smaller budgets whereas Label Only and Label+Expl work better for larger budgets. 89 AdditionalTime Budget Instance Annotation Type SentimentAnalysis IMDb Original Acc (↑) Contrast Acc (↑) Consistency (↑) 0 min None 88.02 (± 2.34) 83.36 (± 4.49) 71.84 (± 6.82) 10 min Label Only 85.15 (± 4.98) 83.08 (± 5.38) 68.17 (± 10.23) Expl Only 88.11 (± 0.24) 85.05 (± 1.98) 73.82 (± 2.38) Label+Expl 85.22 (± 1.19) 79.39 (± 2.25) 65.00 (± 3.42) 30 min Label Only 86.20 (± 4.18) 82.17 (± 5.15) 68.74 (± 9.22) Expl Only 87.80 (± 2.18) 83.61 (± 1.85) 72.15 (± 4.38) Label+Expl 83.89 (± 4.84) 80.97 (± 4.18) 65.22 (± 16.89) 5 hr Label Only 87.48 (± 1.27) 83.63 (± 1.69) 71.58 (± 3.02) Expl Only 87.57 (± 1.70) 84.40 (± 3.03) 72.40 (± 4.56) Label+Expl 88.39 (± 2.02) 85.68 (± 2.87) 74.48 (± 4.96) 24 hr Label Only 87.64 (± 3.25) 83.49 (± 5.46) 71.56 (± 8.78) Expl Only 88.19 (± 2.56) 84.81 (± 4.55) 73.39 (± 5.42) Label+Expl 86.86 (± 4.88) 83.45 (± 7.02) 70.67 (± 11.90) 48 hr Label Only 89.73 (± 0.52) 86.86 (± 6.11) 77.12 (± 1.54) Expl Only 88.19 (± 2.56) 84.81 (± 4.55) 73.39 (± 5.42) Label+Expl 89.78 (± 0.18) 87.97 (± 4.52) 78.74 (± 1.08) Table 4.7: RQ4-ContrastSetTests(§4.5.7.3). We compare ER model performance for three instance annotation types across different time budgets, with respect to performance on both original test sets (ID) and contrast sets (OOD). Performance is reported in terms of original test set accuracy (Original Acc), contrast set accuracy (Con- trast Acc), and contrast consistency (Consistency). Numbers highlighted in blue indicate statistically significant improvement over the No-ER baseline (p<0.05). 4.5.7.4 FunctionalTests Fig. 4.9 shows the results for functional tests on RQ4. For all instance annotation types, we see that models with nonzero additional time budgets (i.e., 10 min and higher) achieve lower mean failure rates than the model with zero additional time budget (i.e., 0 min). For Label Only, using a 30-minute budget yields the lowest mean failure rate by far, while other budgets yield failure rates that are not significantly lower than the 0-minute failure rate. This Label Only trend is quite different from the trends observed in the unseen dataset tests and contrast set tests, where the best performance was consistently attained using the 48-hour budget. For Expl Only, using 10-minute and 30-minute budgets yields much lower mean failure rates than the other budgets do, with the failure rate generally increasing with the time budget. This Expl Only trend is also different from previous trends, in which the performance either increased or stayed about the same as the budget increased. 90 Figure 4.9: RQ4 - Functional Tests (§4.5.7.4). We compare ER model performance for three instance annotation types across different time budgets, with respect to performance on a range of functional tests (OOD). Performance is reported in terms of normalized failure rate (↓) for four functional test types (Vocabulary, Robustness, Logic, Entity) as well the mean normalized failure rate across all functional tests (Mean). For Label+Expl, using a 5-hour budget yields the lowest mean failure rate. Although the failure rate generally decreases as the budget increases, this Label+Expl is still different from previous trends, where the best performance was consistently attained using the 48-hour budget. In summary, we find that nonzero time budgets lead to lower failure rates than zero time budgets, although there are mixed trends in how the failure rate changes as the nonzero time budget increases. Like in previous tests, Expl Only shines for lower nonzero budgets but yields diminishing returns for higher 91 nonzero budgets. However, the opposite tends to be true for Label+Expl. For Label Only, the results are less conclusive, with the failure rate oscillating dramatically as the budget increases. 4.6 RelatedWork RationaleExtraction Much of the language model (LM) explainability literature has focused on ratio- nale extraction, which is the process of producing extractive rationales. An extractive rationale explains an LM’s output on a given task instance by scoring input tokens’ influence on the LM’s output [47, 229, 135, 102, 156, 30]. This token scoring can be done via input gradients [229, 156, 47, 134], input perturbation [135, 189, 106], attention weights [192, 225, 252], or learned rationale extraction models [30, 98, 221]. In this work, we study how extractive rationales can be used to regularize LMs. Explanation-BasedLearning To improve LM behavior, many methods have been proposed for explanation- based learning [81, 79], especially using human-annotated explanations [233]. For extractive rationales, one common paradigm is explanation regularization (ER), which regularizes the LM so that its extractive machine rationales (reflecting LM’s reasoning process) are aligned with extractive human rationales (re- flecting humans’ reasoning process) [203, 90, 68, 272, 109, 201, 147]. In ER, the human rationale can be obtained by annotating each instance individually [272, 141, 25, 196, 49] or by applying domain-level lexi- cons across all instances [201, 203, 68, 109, 147]. Beyond ER, there are other ways to learn from extractive rationales. [153] used human-in-the-loop feedback on machine rationales for data augmentation. [266] used machine rationales to calibrate black box models and improve their performance on low-resource domains. EvaluatingERModels Existing works have primarily evaluated ER models via ID generalization [272, 141, 90], which only captures one aspect of ER’s impact. However, a few works have considered auxiliary evaluations, such as machine-human rationale alignment [90, 68], task performance on unseen datasets 92 [203, 109], and social group fairness [201, 147]. [26] showed that maximizing machine-human rationale alignment does not always improve task performance, while human rationales vary in their ability to provide useful information for task prediction. Meanwhile, ER-Test jointly evaluates ER models’ OOD generalization along three dimensions: unseen dataset tests, contrast set tests [64], and functional tests [200, 133]. 4.7 Appendix 4.7.1 ERTrainingDetails Though ER involves many design choices (i.e., hyperparameters), it is infeasible to comprehensively tune all of these hyperparameters. Hence, for hyperparameters that yielded little sensitivity in our initial ex- periments, were not ER-specific, and/or were not the focus of our RQs, we used fixed values across all of our subsequent experiments. We discuss these global hyperparameter values below. First, we use a learning rate of2e− 5. Second, we use a batch size of32. Third, we set the ER strength toλ ER =1. Fourth, we train all models for a maximum of 25 epochs, with early stopping. We perform early stopping based on total development set loss (i.e., task loss + ER loss) and a patience of 10 epochs. Fifth, before calcu- lating the ER loss between machine rationales (i.e., attribution scores) and human rationales, we use the sigmoid function to normalize the attribution scores as probabilities. However, for a given instance, if all tokens’ attribution scores are low, then all token probabilities will be close to0.5 and provide little signal for identifying important tokens. Thus, we scale the raw attribution scores by100, so that the normalized attribution scores become closer to0 or1. 4.7.2 DevelopmentSetResults In §4.5, we only reported ER model performance on test sets, even though we tuned our ER hyperpa- rameters based on the development sets (for the seen dataset). Furthermore, recall that we used RQ1’s 93 development set results to obtain the respective subsets of ER settings used for RQ2-RQ4. Thus, for refer- ence, we also report the development set results for RQ1, RQ2, RQ4, and RQ4 in Tables 4.8, 4.9, 4.10, and 4.11, respectively. Machine Rationale Extractor Rationale Alignment Criterion SentimentAnalysis NLI SST e-SNLI Acc (↑) F1 (↑) IxG MSE 93.39 (± 0.93) 78.66 (± 0.42) MAE 93.93 (± 0.24) 78.46 (± 0.79) Huber 93.89 (± 0.61) 78.69 (± 0.29) BCE 93.35 (± 0.70) 78.57 (± 0.81) KLDiv 92.89 (± 0.30) 73.41 (± 5.09) Order 90.83 (± 1.91) 78.93 (± 0.74) Attention MSE 93.27 (± 0.78) 75.41 (± 0.42) MAE 93.65 (± 0.63) 76.27 (± 0.92) Huber 93.08 (± 0.07) 75.98 (± 0.68) BCE 93.46 (± 0.90) 75.55 (± 0.40) KLDiv 93.50 (± 0.54) 76.69 (± 0.49) Order 93.98 (± 0.08) 75.99 (± 0.24) UNIREX MSE 92.43 (± 0.20) 73.77 (± 1.81) MAE 93.31 (± 0.57) 72.92 (± 1.45) Huber 92.47 (± 0.76) 73.33 (± 2.12) BCE 92.89 (± 0.11) 68.81 (± 1.54) KLDiv 92.43 (± 0.23) 74.05 (± 1.20) Order 93.00 (± 0.23) 72.26 (± 0.82) Table 4.8: RQ1-DevelopmentSetPerformance(§4.5.4.2). For each machine rationale extractor and task/dataset, the best-performing rationale alignment criterion is indicated inbold. Human Rationale Type Machine Rationale Extractor Rationale Alignment Criterion SentimentAnalysis SST Acc (↑) IxG MAE Instance-level 92.93 (± 0.24) Task-level 93.46 (± 0.11) Huber Instance-level 92.89 (± 0.61) Task-level 93.27 (± 0.48) Attention MAE Instance-level 93.65 (± 0.63) Task-level 94.31 (± 0.71) Huber Instance-level 93.08 (± 0.07) Task-level 93.98 (± 0.33) UNIREX MAE Instance-level 93.31 (± 0.57) Task-level 92.34 (± 1.11) Huber Instance-level 92.47 (± 0.76) Task-level 93.45 (± 0.86) Table 4.9: RQ2 - Development Set Performance (§4.5.5.2). For each machine rationale extractor and rationale alignment criterion, the best-performing human rationale type is indicated inbold. 94 Instance Annotation Budget (k) Instance Selection Strategy SentimentAnalysis SST Acc(↑) 5% Random 93.58 (± 0.23) LC 92.55 (± 0.72) HC 93.08 (± 0.35) LIS 92.62 (± 0.40) HIS 92.97 (± 0.52) 15% Random 93.08 (± 0.28) LC 92.24 (± 0.78) HC 92.58 (± 0.46) LIS 93.08 (± 0.66) HIS 93.31 (± 0.18) 50% Random 92.29 (± 0.06) LC 87.58 (± 5.11) HC 92.62 (± 0.81) LIS 92.24 (± 0.54) HIS 92.51 (± 0.37) Table 4.10: RQ3 - Development Set Performance (§4.5.6.2). For each instance annotation budget, the best- performing instance selection strategy is indicated inbold. Instance Annotation Type Additional Time Budget SentimentAnalysis SST Acc (↑) None 0 min 91.00 (± 0.18) 10 min Label Only 91.09 (± 0.18) Expl Only 91.50 (± 0.31) Label+Expl 90.66 (± 0.27) 30 min Label Only 90.88 (± 0.66) Expl Only 91.23 (± 0.21) Label+Expl 91.08 (± 0.18) 5 hr Label Only 91.62 (± 0.32) Expl Only 90.86 (± 0.77) Label+Expl 91.30 (± 0.70) 24 hr Label Only 92.42 (± 0.67) Expl Only 91.24 (± 0.06) Label+Expl 91.97 (± 0.53) 48 hr Label Only 92.70 (± 0.47) Expl Only 91.24 (± 0.06) Label+Expl 92.37 (± 0.53) Table 4.11: RQ4-DevelopmentSetPerformance(§4.5.7.2). For each time budget, the best-performing instance annotation type is indicated inbold. 4.7.3 RQ1: WhichrationalealignmentcriteriaaremosteffectiveforER? In this section, we present additional RQ1 experiments beyond those presented in §4.5.4. 95 4.7.3.1 NamedEntityRecognition Besides sentiment analysis and NLI, we also consider the named entity recognition (NER) task in unseen dataset tests for RQ1. For NER, we use CoNLL-2003 [207, 141] as the seen dataset (since it has rationale annotations) and OntoNotes v5.0 [190] as the unseen dataset. CoNLL-2003 contains only text from Reuters news stories, while OntoNotes v5.0 contains text from newswires, magazines, telephone conversations, websites, and other sources. Table 4.12 displays the NER unseen dataset test results for RQ1. For the seen dataset, we see more variance (versus what we saw in Table 4.1) in task performance among different ER criteria, although the variance is still quite small among the best criteria (MSE, MAE, Huber). Here, MAE yields the highest performance, while BCE yields the lowest by far. For the unseen dataset, MAE still performs best, while MSE and Huber are competitive. Meanwhile, BCE again performs the worst. Machine Rationale Extractor Rationale Alignment Criterion NER Seen Acc (↑) Unseen Acc (↑) CoNLL-2003 OntoNotes v5.0 - No-ER 77.24 (± 0.20) 20.78 (± 0.41) IxG MSE 78.02 (± 0.69) 21.60 (± 0.46) MAE 78.34 (± 0.81) 21.73 (± 0.31) Huber 77.83 (± 1.09) 21.38 (± 0.16) BCE 64.53 (± 13.22) 17.32 (± 3.59) Order 72.62 (± 5.01) 19.14 (± 1.75) Table 4.12: RQ1-UnseenDatasetTestsforNER(§4.7.3.1). For NER, we compare various ERrationalealignment criteria using the IxG machine rationale extractor (as well as the No-ER baseline), with respect to performance on seen (ID) and unseen (OOD) datasets. Performance is measured using accuracy (Acc). For each dataset and metric, the best-performing criterion is indicated inbold. 4.7.3.2 HateSpeechDetection Besides sentiment analysis, NLI, and NER, we also consider the hate speech detection task in unseen dataset tests for RQ1. Unlike the other tasks we consider, hate speech detection datasets typically provide task- level rationales by default, while instance-level rationales are unavailable. 96 Machine Rationale Extractor Rationale Alignment Criterion HateSpeechDetection Stf HatEval GHC Seen Acc (↑) Seen FPRD (↓) Unseen Acc (↑) Unseen FPRD (↓) Unseen Acc (↑) Unseen FPRD (↓) - No-ER 89.50 (± 0.20) 1.11 (± 0.58) 63.68 (± 0.78) 1.64 (± 0.66) 89.43 (± 0.98) 1.09 (± 0.12) IxG MSE 89.46 (± 0.21) 2.18 (± 0.47) 64.30 (± 1.52) 1.99 (± 0.26) 88.19 (± 0.62) 1.50 (± 0.10) MAE 89.59 (± 0.06) 1.39 (± 0.62) 63.30 (± 0.49) 1.80 (± 0.59) 88.07 (± 1.66) 1.43 (± 0.24) Huber 89.50 (± 0.51) 1.90 (± 0.35) 64.85 (± 1.50) 2.11 (± 0.27) 87.77 (± 1.21) 1.84 (± 0.34) BCE 89.42 (± 0.71) 1.87 (± 0.45) 63.54 (± 0.57) 1.87 (± 0.45) 88.99 (± 0.83) 1.36 (± 0.58) Order 89.21 (± 1.18) 0.56 (± 0.09) 64.46 (± 1.18) 0.92 (± 0.92) 92.84 (± 0.46) 0.59 (± 0.25) Table 4.13: RQ1 - Unseen Dataset Tests for Hate Speech Detection (§4.7.3.2). For hate speech detection, we compare various ER rationale alignment criteria using the IxG machine rationale extractor (as well as the No-ER baseline), with respect to performance on seen (ID) and unseen (OOD) datasets. Performance is measured using accuracy (Acc) and false positive difference rate (FPRD). For each metric, the best-performing criterion is indicated in bold. Note that we only consider task-level rationales for hate speech detection, since instance-level rationales are unavailable. Task-Level Rationales Many existing hate speech detection models are largely oversensitive to cer- tain group identifier words ( e.g., “black”, “Muslim”, “gay”), almost always predicting hate speech for text containing these words [109]. To address this, prior works first manually annotated a lexicon of group identifiers that should be ignored for hate speech detection. Then, for all training instances, they automat- ically marked only tokens belonging to the lexicon as unimportant (and the rest as important). By using these human rationales for ER, they trained the LM to be less biased with respect to these group identifiers [109, 101]. Datasets For hate speech detection, we use Stormfromt (Stf) dataset [70] as the seen dataset, for which we use the lexicons from [101] to generate distantly-supervised rationales. Each instance in the Stf dataset is matched to one or more lexicons by simple character-level matching, and the rationales are generated as described above. Then, we train modelF on the Stf dataset. Meanwhile, we use HatEval [5] and Gab Hate Corpus (GHC) [108] as the unseen datasets. All of these datasets contain binary labels for hateful and non-hateful content. The Stf dataset is collected from a white-supremacist forum, whereas HatEval instances are tweets and GHC instances are taken from the Gab forum. 97 FairnessEvaluation In addition to task performance, we evaluate models with respect to fairness (i.e., bias against group identifiers in the lexicons). We measure fairness using the false positive rate difference (FPRD) metric [101]. FPRD is computed as P z |FPR z − FPR overall |, where FPR z is the model’s false positive rate across all test instances mentioning group identifier z, and FPR overall is the model’s false positive rate across all test instances. In other words, FPRD evaluates the extent to whichF is biased against group identifier z. Lower FPRD indicates thatF has lower bias against the group identifiers in the lexicons. Results Table 4.13 presents the hate speech detection results for unseen dataset tests with respect to RQ1. Like in §4.5.4, compared to No-ER, ER does not yield significant increases or decreases in task performance (accuracy) on the seen dataset (Stf). However, ER yields slightly different results for the unseen datasets. For GHC, the ER model trained with Order criterion is the only ER model to achieve significant accuracy improvement over No-ER. Meanwhile, for HatEval, although Huber performs best, none of the ER models achieve significant accuracy improvement over No-ER. For fairness evaluation, we find that ER models tend to yield higher FPRD than No-ER. However, we observe that the Order criterion consistently yields the lowest FPRD among all models on both seen and unseen datasets. In particular, Order’s FPRD is significantly lower than No-ER’s. This suggests that using a more relaxed rationale alignment objective is helpful for improving group fairness. Our findings are in line with those in [90], which proposed the Order criterion. 4.7.4 RQ2: Howeffectivearetask-levelhumanrationalesforER? In this section, we provide additional details about our RQ2 experiments (§4.5.5). 4.7.4.1 CreatingTask-LevelRationales As stated in §4.4.3 and §4.5.5.1, task-level rationales are created using task-level lexicons. Below, we provide details about how the lexicons are used to create rationales. 98 For a given task T , let L T be a human-annotated lexicon (i.e., set) of words/phrases that are known to be either important or unimportant to T . For example, sentiment analysis lexicons generally contain words/phrases that are strongly indicative of sentiment (i.e., important) [180, 24], whereas hate speech detection lexicons generally contain words/phrases that should be ignored when detecting hate speech (i.e., unimportant) [101]. Given tokenx, let 1 L T (·) be an indicator function, such that 1 L T (x) = 1 ifx ∈ L T (i.e., x is part of at least one word/phrase inL T ) and1 L T (x) = 0 otherwise. Recall thatx i = [x t i ] n t=1 denotes then-token sequence for a task instancei (§4.2). By applying 1 L T (·) to each tokenx t i of each instancei in a dataset forT , we can obtain a distantly-supervised rationale ˙ r i =[1 L T (x t i )] n t=1 for every instance in the dataset. SentimentAnalysis For sentiment analysis, we used AFINN [180] and SenticNet [24] as source lexicons to create task-level rationales. By combining AFINN and SenticNet, we obtain a unified lexicon of 170K total words/phrases. Since some words/phrases appear in both source lexicons, 136 of these words/phrases have conflicting sentiments ( i.e., have positive sentiment in one lexicon and negative sentiment in the other), so we discard these 136 words/phrases from the unified lexicon. We find that 93% of the SST training instances ( i.e., 6441 instances) have at least one token in the lexicon, which means that the remaining 7% do not have task-level rationales. Nonetheless, we use all training instances for ER training. If a given instance has a task-level rationale, then we use both ER loss and task loss for that instance. Otherwise, we only use task loss for that instance. To facilitate a fair comparison with instance-based rationales, we assume that instance-level rationales are also unavailable for the same 7% of training instances for which task-level rationales are unavailable. 4.7.5 RQ3: How is ER affected by the number and choice of training instances with humanrationales? In this section, we provide additional details about our RQ3 experiments (§4.5.6). 99 4.7.5.1 Setup Since we average all results over three seeds in §4.5, we describe how this applies to instance selection in RQ3. For random instance selection, we uniformly sample three subsets from the training set, with each subset containingk% of the training instances. For each non-random instance selection strategy (LC, HC, LIS, HIS), we obtain an aggregate selection score for each training instance by first using the given strategy to compute the selection score for each seed, then taking the mean of these three seed-level scores. After that, we rank all instances in descending order of selection score and select the top-k% instances based on this ranking. Given that most training instances do not have rationale annotations in these low-resource settings, it is possible that a given training batch does not contain any rationale-annotated instances, which makes ER impossible for this batch. To address this, our RQ3 experiments use a modified dataloader that ensures at least one-third of the instances in each training batch have rationale annotations. As a result, the ER loss can be computed for a substantial number of instances per batch, so that the ER loss’ impact is not dwarfed by the task loss’. Note that the ER loss is only used for rationale-annotated instances in the batch, whereas the task loss is used for all instances in the batch. 4.7.6 RQ4: HowisERaffectedbythetimetakentoannotatehumanrationales? In this section, we provide additional details about our RQ4 experiments (§4.5.7). 4.7.6.1 Setup Figures 4.10, 4.11, and 4.12 display the user interfaces (UIs) provided to MTurk annotators in our time estimation experiments for Label Only, Expl Only, and Label+Expl, respectively. All Turkers were paid at least $16 per hour. Based on rough a priori estimates of the annotation time per instance, we paid $0.50 100 per instance for the qualifier task ( i.e., preliminary task used to select initial pool of 250 qualified Turkers), $0.25 per instance for Label Only, $0.30 per instance for Expl Only, $0.35 per instance for Label+Expl. To ensure that the annotation time estimates were based on high-quality annotations, we manually filtered out Turkers who submitted low-effort ( e.g., empty) responses. As a result, our time estimation experiments all yielded high inter-annotator agreement. For Label Only and Label+Expl, we achieved Fleiss’ kappa scores were0.74 and0.70, respectively. For Expl Only and Label+Expl, we achieved rationale overlap rates [272] of 0.78 and 0.66, respectively. To further verify the quality of our MTurk results, we replicated these experiments in a small-scale study with nine computer science students and observed similar trends. In this small-scale study, we considered the same 200 instances annotated by the Turkers, with each instance annotated by three students. Figure 4.10: RQ4-LabelOnlyUI. UI used for Label Only MTurk annotations. 101 Figure 4.11: RQ4-ExplOnlyUI. UI used for Expl Only MTurk annotations. Figure 4.12: RQ4-Label+ExplUI. UI used for Label+Expl MTurk annotations. 4.7.7 FunctionalTests There are four categories of functional tests: Vocabulary, Robustness, Logic, and Entity. Below, we de- scribe each category in more detail. Also, each functional test category consists of one or more functional subtests. However, for each category, §4.5 only reported the aggregate performance across all subtests in the category. Thus, Tables 4.14-4.15 also report the performance for all individual subtests, with respect to RQ1. 102 VocabularyTests Vocabulary tests are used to evaluate LMs’ sensitivity to vocabulary changes, which may or may not change the text’s meaning. For sentiment analysis, we consider the following vocabulary tests: Add Sentiment Words, Paraphrase Neutral Words, Add Intensifiers, Add Reducers, Add Positive Phrases, and Add Negative Phrases [200]. For NLI, we consider the following vocabulary tests: Antonym in Hypothesis, Synonym in Hypothesis, and Supertype in Hypothesis [200]. RobustnessTests Robustness tests evaluate LMs’ (in)sensitivity to character-level edits that should not change the text’s meaning. We consider the following robustness tests: Add Random URLs/Handles, Add Punctuation, Add One Typo, Add Two Typos, and Add Contractions [103, 247]. For NLI, we consider the following vocabulary tests: Add Punctuation, Add One Typo, Add Two Typos, and Add Contractions [103, 247]. Logic Tests Logic tests evaluate LMs’ sensitivity to perturbations that alter the logical semantics ex- pressed by the text. We consider the following logic tests: Positive→ Negative, Negative→ Positive, and Positive→ Negative (w/ Distractors) [230, 167]. For NLI, we consider the following vocabulary tests: Negate Hypothesis, Negate Premise, and Hypothesis is Premise [230, 167]. EntityTests Entity tests evaluate LMs’ (in)sensitivity to entity changes that should not affect the text’s meaning. We consider the following entity tests: Replace Names, Replace Locations, and Replace Numbers [200]. For NLI, we consider the following vocabulary tests: Replace Entity in Hypothesis [200]. Results For each functional test category, we report the results for each functional subtest. First, for sentiment analysis, we present the functional test/subtest for RQ1 (Table 4.14), RQ2 (Table 4.16), RQ3 (Tables 4.17-4.19), and RQ4 (Tables 4.21-4.22). Second, for NLI, we present the functional test/subtest results for RQ1 (Table 4.15). 103 FunctionalTest FunctionalSubtest SentimentAnalysis Flights Failure Rate (↓) No-ER IxG+MSE IxG+MAE IxG+Huber IxG+BCE IxG+Order IxG+KLDiv Vocabulary Add Sentiment Words 1.20 (± 0.74) 0.60 (± 0.16) 1.27 (± 0.84) 1.13 (± 0.50) 1.00 (± 0.86) 0.80 (± 0.28) 0.27 (± 0.25) Paraphrase Neutral Words 5.59 (± 0.16) 5.13 (± 0.90) 5.40 (± 0.28) 5.67 (± 0.74) 5.67 (± 0.68) 5.60 (± 1.63) 5.87 (± 0.66) Add Intensifiers 2.13 ( ± 1.63) 1.80 (± 0.16) 1.40 (± 0.16) 2.67 (± 0.96) 2.67 (± 0.77) 1.60 (± 0.65) 1.27 (± 0.19) Add Reducers 23.85 (± 7.18) 35.00 (± 46.01) 27.38 (± 5.95) 17.46 (± 13.65) 25.00 (± 25.00) 0.77 (± 0.43) 5.56 (± 7.86) Add Positive Phrases 1.40 (± 0.28) 2.33 (± 1.84) 0.67 (± 0.50) 2.33 (± 1.76) 1.27 (± 1.00) 2.07 (± 1.52) 1.07 (± 0.57) Add Negative Phrases 22.86 (± 7.43) 14.80 (± 1.40) 20.67 (± 4.07) 20.67 (± 3.35) 17.40 (± 3.64) 16.93 (± 1.91) 16.67 (± 2.29) Robustness Add Random URLs/Handles 9.80 (± 0.48) 7.27 (± 2.23) 9.07 (± 1.80) 10.27 (± 0.9) 7.87 (± 2.76) 9.60 (± 2.47) 9.60 (± 2.14) Add Punctuation 3.93 (± 0.89) 1.93 (± 0.41) 3.00 (± 1.02) 3.80 (± 0.28) 2.87 (± 0.19) 2.67 (± 0.34) 2.67 (± 0.50) Add One Typo 2.60 (± 0.90) 2.53 (± 0.82) 2.60 (± 0.57) 2.60 (± 0.75) 3.13 (± 0.90) 2.00 (± 0.86) 2.60 (± 0.43) Add Two Typos 3.93 (± 0.65) 3.87 (± 1.24) 4.27 (± 0.5) 4.60 (± 0.43) 4.13 (± 1.2) 3.33 (± 0.25) 4.73 (± 0.25) Add Contractions 1.00 (± 0.00) 0.80 (± 0.33) 0.87 (± 0.25) 0.47 (± 0.09) 0.80 (± 0.43) 0.53 (± 0.50) 1.00 (± 0.16) Logic Positive→ Negative 5.20 (± 2.75) 4.27 (± 1.65) 4.47 (± 3.07) 3.93 (± 1.57) 4.47 (± 1.75) 5.67 (± 1.68) 4.93 (± 2.29) Negative→ Positive 59.73 (± 9.48) 59.00 (± 15.81) 37.47 (± 10.41) 59.07 (± 14.97) 63.27 (± 17.61) 45.87 (± 24.13) 46.67 (± 18.75) Positive→ Negative (w/ Distractors) 32.20 (± 14.65) 35.13 (± 1.91) 35.00 (± 16.52) 40.93 (± 4.31) 19.00 (± 8.66) 29.13 (± 10.60) 38.00 (± 8.47) Entity Replace Names 0.70 (± 0.14 1.91 (± 0.71) 1.11 (± 0.51) 1.61 (± 0.62) 0.81 (± 0.14) 1.91 (± 1.51) 1.01 (± 0.75) Replace Locations 3.33 (± 0.74) 2.73 (± 1.15) 3.40 (± 0.86) 3.00 (± 0.33) 3.07 (± 1.79) 3.20 (± 1.57) 3.53 (± 1.95) Replace Numbers 0.80 (± 0.00) 0.53 (± 0.34) 0.47 (± 0.41) 0.60 (± 0.43) 0.60 (± 0.33) 0.67 (± 0.81) 0.87 (± 0.66) Table 4.14: RQ1-FunctionalSubtestsforSentimentAnalysis(§4.5.4.4). For sentiment analysis, we compare various ER rationale alignment criteria using the IxG machine rationale extractor (as well as the No-ER baseline), with respect to performance on a range of functional tests/subtests (OOD). For each functional test, we report model performance on each of its individual functional subtests. Performance is reported in terms of failure rate. FunctionalTest FunctionalSubtest NLI ANLP-NLI Failure Rate (↓) No-ER IxG+MSE IxG+MAE IxG+BCE IxG+Huber IxG+Order Vocabulary Antonym in Hypothesis 71.66 (± 20.98) 64.77 (± 21.97) 84.55 (± 11.53) 65.88 (± 21.40) 74.77 (± 20.41) 62.55 (± 13.16) Synonym in Hypothesis 32.61 (± 7.41) 24.11 (± 7.62) 30.11 (± 6.42) 25.88 (± 6.86) 30.77 (± 7.07) 29.27 (± 6.95) Supertype in Hypothesis 24.44 (± 15.95) 11.00 (± 3.62) 13.77 (± 6.71) 9.31 (± 5.90) 8.77 (± 8.06) 13.55 (± 7.10) Robustness Add Punctuation 14.55 (± 4.13) 9.44 (± 2.79) 11.33 (± 1.63) 8.11 (± 1.19) 10.00 (± 2.58) 9.88 (± 2.51) Add One Typo 15.88 (± 3.44) 10.22 (± 3.04) 12.33 (± 1.63) 9.66 (± 2.10) 10.88 (± 2.68) 10.77 (± 2.52) Add Two Typos 15.33 (± 3.68) 9.77 (± 1.81) 12.00 (± 1.76) 9.44 (± 2.31) 11.11 (± 2.99) 10.00 (± 2.66) Add Contractions 24.69 (± 6.98) 24.69 (± 8.72) 25.92 (± 9.07) 22.22 (± 9.07) 25.92 (± 7.40) 14.81 (± 5.23) Logic Negate Hypothesis 50.88 (± 32.25) 27.77 (± 37.24) 9.77 (± 15.66) 41.33 (± 41.54) 15.22 (± 28.77) 18.44 (± 23.21) Negate Premise 99.88 (± 0.31) 98.54 (± 3.78) 91.69 (± 20.37) 98.65 (± 2.56) 98.42 (± 4.44) 99.88 (± 0.31) Hypothesis is Premise 14.22 (± 8.63) 14.33 (± 10.14) 19.44 (± 12.12) 18.16 (± 12.69) 14.38 (± 9.23) 17.38 (± 10.16) Entity Replace Entity in Hypothesis 77.21 (± 39.57) 88.88 (± 24.11) 79.91 (± 22.20) 85.18 (± 30.04) 83.83 (± 24.25) 96.40 (± 4.85) Table 4.15: RQ1 - Functional Subtests for NLI (§4.5.4.4). For NLI, we compare various ER rationale alignment criteria using the IxG machine rationale extractor (as well as the No-ER baseline), with respect to performance on a range of functional tests/subtests (OOD). For each functional test, we report model performance on each of its individual functional subtests. Performance is reported in terms of failure rate. 104 FunctionalTest FunctionalSubtest SentimentAnalysis Flights Failure Rate (↓) IxG+MAE IxG+Huber Instance-Level Task-Level Instance-Level Task-Level Vocabulary Add Sentiment Words 0.80 (± 0.16) 2.00 (± 0.82) 1.13 (± 0.50) 1.27 (± 0.98) Paraphrase Neutral Words 5.87 (± 0.34) 5.67 (± 0.41) 5.67 (± 0.74) 6.00 (± 0.59) Add Intensifiers 1.67 ( ± 0.41) 1.60 (± 0.49) 2.67 (± 0.96) 2.27 (± 0.66) Add Reducers 55.03 (± 32.90) 30.25 (± 21.54) 17.46 (± 13.65) 35.89 (± 1.61) Add Positive Phrases 0.60 (± 0.33) 1.47 (± 1.52) 2.33 (± 1.76) 0.67 (± 0.57) Add Negative Phrases 20.47 (± 5.33) 19.67 (± 3.79) 20.67 (± 3.35) 21.00 (± 5.94) Robustness Add Random URLs/Handles 9.80 (± 0.48) 7.27 (± 2.23) 10.27 (± 0.90) 10.40 (± 2.05) Add Punctuation 2.07 (± 0.25) 4.00 (± 2.41) 3.80 (± 0.28) 3.40 (± 1.34) Add One Typo 2.40 (± 0.86) 2.47 (± 0.41) 2.60 (± 0.75) 2.87 (± 0.25) Add Two Typos 3.87 (± 0.94) 4.47 (± 0.93) 4.60 (± 0.43) 4.33 (± 0.52) Add Contractions 1.00 (± 0.43) 1.20 (± 0.43) 0.47 (± 0.09) 0.87 (± 0.25) Logic Positive→ Negative 4.60 (± 1.88) 6.13 (± 1.82) 3.93 (± 1.57) 3.80 (± 2.12) Negative→ Positive 41.60 (± 21.11) 40.87 (± 20.03) 59.07 (± 14.97) 31.53 (± 9.24) Positive→ Negative (w/ Distractors) 44.40 (± 6.78) 49.80 (± 10.22) 40.93 (± 4.31) 32.13 (± 13.47) Entity Replace Names 0.91 (± 0.49 1.11 (± 0.38) 1.61 (± 0.62) 1.91 (± 0.14) Replace Locations 3.80 (± 0.86) 5.00 (± 1.84) 3.00 (± 0.33) 4.07 (± 0.90) Replace Numbers 0.87 (± 0.68) 0.73 (± 0.25) 0.60 (± 0.43) 0.53 (± 0.19) Table 4.16: RQ2-FunctionalSubtestsforSentimentAnalysis(§4.5.4.4). For sentiment analysis, we compare various ER rationale alignment criteria using the IxG machine rationale extractor (as well as the No-ER baseline), with respect to performance on a range of functional tests/subtests (OOD). For each functional test, we report model performance on each of its individual functional subtests. Performance is reported in terms of failure rate. FunctionalTest FunctionalSubtest SentimentAnalysis Flights Failure Rate (↓) 0% 100% Random LC HC LIS HIS Vocabulary Add Sentiment Words 1.20 (± 0.74) 1.27 (± 0.84) 1.09 (± 0.43) 2.00 (± 1.13) 0.47 (± 0.19) 1.67 (± 0.74) 0.47 (± 0.25) Paraphrase Neutral Words 5.59 (± 0.16) 5.40 (± 0.28) 5.89 (± 1.25) 6.73 (± 2.04) 7.07 (± 1.33) 5.67 (± 0.68) 5.67 (± 0.84) Add Intensifiers 2.13 ( ± 1.63) 1.40 (± 0.16) 2.24 (± 0.53) 3.27 (± 1.52) 1.93 (± 0.09) 2.20 (± 0.43) 2.53 (± 0.90) Add Reducers 23.85 (± 7.18) 27.38 (± 5.95) 45.42 (± 29.46) 20.96 (± 11.10) 66.27 (± 24.67) 42.59 (± 14.58) 43.73 (± 39.84) Add Positive Phrases 1.40 (± 0.28) 0.67 (± 0.50) 0.80 (± 0.43) 1.20 (± 0.28) 0.80 (± 0.49) 0.07 (± 0.09) 1.73 (± 1.39) Add Negative Phrases 22.86 (± 7.43) 20.67 (± 4.07) 20.47 (± 0.50) 21.60 (± 6.65) 21.13 (± 4.72) 24.80 (± 2.55) 23.33 (± 6.38) Robustness Add Random URLs/Handles 9.80 (± 0.48) 9.07 (± 1.80) 9.18 (± 1.98) 9.53 (± 1.00) 9.13 (± 0.98) 10.00 (± 1.13) 9.00 (± 3.85) Add Punctuation 3.93 (± 0.89) 1.93 (± 0.41) 2.80 (± 1.07) 3.33 (± 0.82) 2.60 (± 0.43) 2.40 (± 0.49) 3.80 (± 1.73) Add One Typo 2.60 (± 0.90) 2.60 (± 0.57) 2.49 (± 0.57) 2.47 (± 0.34) 2.80 (± 0.33) 1.93 (± 0.77) 2.33 (± 0.34) Add Two Typos 3.93 (± 0.65) 4.27 (± 0.50) 4.07 (± 0.54) 4.20 (± 1.77) 5.13 (± 0.34) 3.93 (± 0.52) 4.27 (± 0.09) Add Contractions 1.00 (± 0.00) 0.87 (± 0.25) 0.87 (± 0.28) 1.00 (± 0.43) 1.47 (± 0.96) 0.80 (± 0.16) 1.07 (± 0.25) Logic Positive→ Negative 5.20 (± 2.75) 4.47 (± 3.07) 6.60 (± 2.85) 7.13 (± 1.09) 6.93 (± 2.1) 7.27 (± 1.81) 3.73 (± 1.64) Negative→ Positive 59.73 (± 9.48) 37.47 (± 10.41) 45.36 (± 23.87) 45.40 (± 17.25) 32.87 (± 7.25) 38.87 (± 21.89) 64.93 (± 30.43) Positive→ Negative (w/ Distractors) 32.20 (± 14.65) 35.00 (± 16.52) 47.89 (± 9.33) 52.27 (± 7.17) 46.27 (± 17.01) 48.07 (± 16.15) 40.47 (± 5.17) Entity Replace Names 0.70 (± 0.14) 1.11 (± 0.51) 1.07 (± 0.61) 1.51 (± 0.25) 0.50 (± 0.14) 0.81 (± 0.28) 1.31 (± 0.38) Replace Locations 3.33 (± 0.74) 3.40 (± 0.86) 3.82 (± 1.57) 4.20 (± .91) 4.53 (± 1.39) 4.27 (± 1.51) 3.73 (± 1.76) Replace Numbers 0.80 (± 0.00) 0.47 (± 0.41) 0.93 (± 0.55) 0.93 (± 0.19) 1.27 (± 0.38) 1.00 (± 0.33) 0.60 (± 0.16) Table 4.17: RQ3 - Functional Subtests for Sentiment Analysis - k=5% (§4.5.4.4). For sentiment analysis, we compare various ER rationale alignment criteria using the IxG machine rationale extractor (as well as the No-ER baseline), with respect to performance on a range of functional tests/subtests (OOD). For each functional test, we report model performance on each of its individual functional subtests. Performance is reported in terms of failure rate. 105 FunctionalTest FunctionalSubtest SentimentAnalysis Flights Failure Rate (↓) 0% 100% Random LC HC LIS HIS Vocabulary Add Sentiment Words 1.20 (± 0.74) 1.27 (± 0.84) 0.98 (± 0.60) 1.20 (± 0.59) 0.53 (± 0.25) 0.93 (± 0.25) 0.87 (± 0.09) Paraphrase Neutral Words 5.59 (± 0.16) 5.40 (± 0.28) 6.60 (± 1.35) 6.27 (± 1.36) 5.93 (± 0.98) 5.47 (± 0.94) 6.07 (± 0.74) Add Intensifiers 2.13 ( ± 1.63) 1.40 (± 0.16) 2.07 (± 1.12) 2.67 (± 0.34) 1.93 (± 0.25) 1.53 (± 0.41) 2.40 (± 0.71) Add Reducers 23.85 (± 7.18) 27.38 (± 5.95) 42.96 (± 29.27) 34.97 (± 27.96) 16.67 (± 16.67) 42.86 (± 20.2) 35.35 (± 24.78) Add Positive Phrases 1.40 (± 0.28) 0.67 (± 0.50) 0.49 (± 0.59) 1.07 (± 0.66) 0.53 (± 0.50) 0.80 (± 0.71) 1.73 (± 1.61) Add Negative Phrases 22.86 (± 7.43) 20.67 (± 4.07) 23.51 (± 3.31) 18.47 (± 0.90) 17.20 (± 4.96) 20.00 (± 4.26) 12.33 (± 2.02) Robustness Add Random URLs/Handles 9.80 (± 0.48) 9.07 (± 1.80) 8.93 (± 1.31) 8.20 (± 2.29) 8.87 (± 1.37) 8.53 (± 2.92) 9.20 (± 2.75) Add Punctuation 3.93 (± 0.89) 1.93 (± 0.41) 2.89 (± 0.89) 3.27 (± 2.39) 3.53 (± 0.84) 2.87 (± 0.34) 3.20 (± 0.71) Add One Typo 2.60 (± 0.90) 2.60 (± 0.57) 2.56 (± 0.55) 2.40 (± 0.98) 3.27 (± 0.25) 2.07 (± 0.19) 2.73 (± 0.34) Add Two Typos 3.93 (± 0.65) 4.27 (± 0.50) 4.60 (± 0.49) 3.93 (± 0.50) 3.67 (± 0.66) 4.27 (± 1.36) 3.87 (± 0.25) Add Contractions 1.00 (± 0.00) 0.87 (± 0.25) 1.29 (± 0.19) 0.80 (± 0.33) 1.33 (± 0.57) 0.93 (± 0.25) 1.07 (± 0.34) Logic Positive→ Negative 5.20 (± 2.75) 4.47 (± 3.07) 7.78 (± 2.17) 7.20 (± 1.85) 7.80 (± 0.99) 7.93 (± 1.65) 3.27 (± 1.91) Negative→ Positives 59.73 (± 9.48) 37.47 (± 10.41) 53.62 (± 24.45) 43.67 (± 34.97) 59.93 (± 27.40) 49.67 (± 12.45) 66.13 (± 22.82) Positive→ Negative (w/ Distractors) 32.20 (± 14.65) 35.00 (± 16.52) 49.29 (± 13.80) 33.53 (± 13.96) 53.20 (± 11.44) 57.20 (± 7.69) 38.60 (± 14.01) Entity Replace Names 0.70 (± 0.14) 1.11 (± 0.51) 1.01 (± 0.74) 0.70 (± 1.00) 1.21 (± 0.25) 1.11 (± 0.38) 1.61 (± 0.57) Replace Locations 3.33 (± 0.74) 3.40 (± 0.86) 4.42 (± 1.13) 4.87 (± 1.89) 3.73 (± 1.73) 3.20 (± 0.75) 3.40 (± 0.33) Replace Numbers 0.80 (± 0.00) 0.47 (± 0.41) 1.09 (± 0.35) 1.20 (± 0.65) 1.00 (± 0.16) 0.73 (± 0.34) 0.67 (± 0.25) Table 4.18: RQ3-FunctionalSubtestsforSentimentAnalysis-k=15%(§4.5.4.4). For sentiment analysis, we compare various ER rationale alignment criteria using the IxG machine rationale extractor (as well as the No-ER baseline), with respect to performance on a range of functional tests/subtests (OOD). For each functional test, we report model performance on each of its individual functional subtests. Performance is reported in terms of failure rate. FunctionalTest FunctionalSubtest SentimentAnalysis Flights Failure Rate (↓) 0% 100% Random LC HC LIS HIS Vocabulary Add Sentiment Words 1.20 (± 0.74) 1.27 (± 0.84) 1.02 (± 0.55) 3.93 (± 2.12) 2.27 (± 1.43) 1.33 (± 0.09) 1.60 (± 1.13) Paraphrase Neutral Words 5.59 (± 0.16) 5.40 (± 0.28) 5.98 (± 0.91) 9.73 (± 0.41) 5.93 (± 0.98) 4.87 (± 0.50) 4.53 (± 0.19) Add Intensifiers 2.13 ( ± 1.63) 1.40 (± 0.16) 2.07 (± 1.50) 3.27 (± 1.67) 1.93 (± 0.25) 2.60 (± 0.57) 3.53 (± 1.68) Add Reducers 23.85 (± 7.18) 27.38 (± 5.95) 20.32 (± 19.45) 41.86 (± 7.54) 15.81 (± 2.82) 27.86 (± 9.17) 15.20 (± 11.38) Add Positive Phrases 1.40 (± 0.28) 0.67 (± 0.50) 1.87 (± 2.16) 0.67 (± 0.41) 1.00 (± 0.33) 4.67 (± 1.81) 6.73 (± 6.74) Add Negative Phrases 22.86 (± 7.43) 20.67 (± 4.07) 25.38 (± 8.40) 19.93 (± 5.66) 21.47 (± 9.29) 23.73 (± 3.14) 24.47 (± 15.92) Robustness Add Random URLs/Handles 9.80 (± 0.48) 9.07 (± 1.80) 10.00 (± 3.02) 11.27 (± 3.36) 12.27 (± 4.34) 9.67 (± 1.39) 9.40 (± 7.64) Add Punctuation 3.93 (± 0.89) 1.93 (± 0.41) 4.11 (± 1.75) 4.67 (± 1.67) 3.27 (± 0.84) 4.33 (± 0.66) 3.80 (± 2.83) Add One Typo 2.60 (± 0.90) 2.60 (± 0.57) 2.89 (± 0.39) 4.07 (± 1.16) 2.40 (± 0.28) 2.13 (± 0.57) 3.00 (± 0.59) Add Two Typos 3.93 (± 0.65) 4.27 (± 0.50) 4.56 (± 1.12) 6.33 (± 1.20) 5.47 (± 1.39) 3.47 (± 0.84) 4.40 (± 0.59) Add Contractions 1.00 (± 0.00) 0.87 (± 0.25) 1.00 (± 0.28) 1.13 (± 0.41) 1.00 (± 0.28) 0.93 (± 0.09) 0.40 (± 0.16) Logic Positive→ Negative 5.20 (± 2.75) 4.47 (± 3.07) 6.09 (± 1.47) 10.13 (± 1.64) 8.00 (± 0.85) 3.80 (± 0.65) 2.00 (± 0.75) Negative→ Positives 59.73 (± 9.48) 37.47 (± 10.41) 68.09 (± 22.22) 41.73 (± 25.73) 59.93 (± 27.40) 69.80 (± 11.61) 85.27 (± 10.52) Positive→ Negative (w/ Distractors) 32.20 (± 14.65) 35.00 (± 16.52) 51.11 (± 12.86) 53.53 (± 7.48) 56.07 (± 9.26) 46.80 (± 2.21) 22.87 (± 7.56) Entity Replace Names 0.70 (± 0.14) 1.11 (± 0.51) 1.48 (± 0.79) 1.61 (± 0.87) 1.01 (± 0.14) 0.70 (± 0.38) 1.81 (± 1.08) Replace Locations 3.33 (± 0.74) 3.40 (± 0.86) 5.00 (± 1.04) 6.27 (± 1.73) 5.07 (± 1.37) 3.67 (± 1.09) 3.00 (± 2.14) Replace Numbers 0.80 (± 0.00) 0.47 (± 0.41) 1.27 (± 0.45) 1.87 (± 0.57) 1.40 (± 0.00) 0.80 (± 0.43) 0.93 (± 0.41) Table 4.19: RQ3-FunctionalSubtestsforSentimentAnalysis-k=50%(§4.5.4.4). For sentiment analysis, we compare various ER rationale alignment criteria using the IxG machine rationale extractor (as well as the No-ER baseline), with respect to performance on a range of functional tests/subtests (OOD). For each functional test, we report model performance on each of its individual functional subtests. Performance is reported in terms of failure rate. 106 FunctionalTest FunctionalSubtest SentimentAnalysis Flights Failure Rate (↓) 0 min 10 min 30 min 5 hr 24 hr 48 hr Vocabulary Add Sentiment Words 2.22 (± 1.58) 3.00 (± 4.79) 1.27 (± 0.84) 1.29 (± 0.80) 0.85 (± 0.85) 1.11 (± 0.70) Paraphrase Neutral Words 4.91 (± 1.66) 6.15 (± 1.52) 5.40 (± 0.28) 4.27 (± 0.93) 5.52 (± 1.02) 6.46 (± 0.89) Add Intensifiers 4.40 ( ± 2.05) 4.18 (± 3.70) 1.40 (± 0.16) 3.76 (± 1.34) 2.75 (± 1.26) 2.34 (± 0.42) Add Reducers 21.68 (± 16.20) 13.11 (± 11.58) 27.38 (± 5.95) 18.61 (± 18.37) 25.78 (± 32.43) 28.25 (± 14.76) Add Positive Phrases 2.80 (± 2.56) 4.80 (± 4.04) 0.67 (± 0.50) 3.27 (± 2.08) 2.05 (± 1.94) 1.06 (± 1.01) Add Negative Phrases 20.31 (± 11.60) 23.78 (± 12.68) 20.67 (± 4.07) 16.44 (± 7.70) 14.65 (± 3.36) 16.17 (± 4.42) Robustness Add Random URLs/Handles 8.82 (± 4.95) 9.85 (± 4.57) 9.07 (± 1.80) 6.87 (± 3.45) 7.65 (± 2.79) 7.66 (± 1.83) Add Punctuation 3.73 (± 2.68) 4.45 (± 3.51) 3.00 (± 1.02) 2.31 (± 1.22) 2.55 (± 1.40) 2.77 (± 0.92) Add One Typo 2.09 (± 0.60) 2.60 (± 0.77) 2.60 (± 0.57) 2.18 (± 0.46) 2.68 (± 0.45) 2.37 (± 0.61) Add Two Typos 4.56 (± 1.79) 5.00 (± 0.78) 4.27 (± 0.5) 3.91 (± 0.99) 4.40 (± 1.03) 4.20 (± 0.88) Add Contractions 0.80 (± 0.52) 0.72 (± 0.22) 0.87 (± 0.25) 0.56 (± 0.26) 0.65 (± 0.37) 0.77 (± 0.43) Logic Positive→ Negative 5.04 (± 4.09) 4.62 (± 2.39) 4.47 (± 3.07) 2.93 (± 1.38) 5.10 (± 2.64) 5.00 (± 2.27) Negative→ Positive 84.56 (± 14.41) 64.03 (± 25.19) 37.47 (± 10.41) 81.29 (± 13.66) 74.53 (± 17.39) 49.94 (± 28.02) Positive→ Negative (w/ Distractors) 30.40 (± 21.11) 34.20 (± 17.76) 35.00 (± 16.52) 25.53 (± 18.38) 32.88 (± 15.44) 38.20 (± 14.73) Entity Replace Names 1.28 (± 0.91) 1.96 (± 1.33) 1.11 (± 0.51) 1.11 (± 0.38) 1.02 (± 0.45) 0.91 (± 0.65) Replace Locations 3.47 (± 1.54) 3.20 (± 1.76) 3.40 (± 0.86) 2.42 (± 1.31) 3.05 (± 1.12) 2.89 (± 1.26) Replace Numbers 0.58 (± 0.60) 0.75 (± 0.21) 0.47 (± 0.41) 0.38 (± 0.37) 0.42 (± 0.43) 0.69 (± 0.57) Table 4.20: RQ4-FunctionalSubtestsforSentimentAnalysis-LabelOnly(§4.5.4.4). For sentiment analysis, we compare various ER rationale alignment criteria using the IxG machine rationale extractor (as well as the No-ER baseline), with respect to performance on a range of functional tests/subtests (OOD). For each functional test, we report model performance on each of its individual functional subtests. Performance is reported in terms of failure rate. FunctionalTest FunctionalSubtest SentimentAnalysis Flights Failure Rate (↓) 0 min 10 min 30 min 5 hr 24 hr 48 hr Vocabulary Add Sentiment Words 2.22 (± 1.58) 0.91 (± 0.58) 1.16 (± 0.62) 2.07 (± 1.25) 0.92 (± 0.52) 0.92 (± 0.52) Paraphrase Neutral Words 4.91 (± 1.66) 5.27 (± 1.47) 4.82 (± 1.05) 5.47 (± 2.25) 5.82 (± 2.47) 5.82 (± 2.47) Add Intensifiers 4.40 ( ± 2.05) 2.80 (± 1.19) 2.76 (± 0.90) 3.04 (± 1.09) 2.00 (± 1.11) 2.00 (± 1.11) Add Reducers 21.68 (± 16.20) 11.06 (± 9.21) 8.64 (± 11.13) 18.38 (± 10.12) 29.43 (± 19.41) 29.43 (± 19.41) Add Positive Phrases 2.80 (± 2.56) 2.49 (± 1.66) 2.76 (± 1.80) 1.76 (± 1.18) 1.52 (± 0.62) 1.52 (± 0.62) Add Negative Phrases 20.31 (± 11.60) 18.51 (± 9.87) 21.49 (± 8.73) 16.76 (± 5.49) 19.45 (± 8.40) 19.45 (± 8.40) Robustness Add Random URLs/Handles 8.82 (± 4.95) 7.80 (± 3.71) 7.69 (± 3.75) 8.13 (± 2.99) 7.15 (± 4.06) 7.15 (± 4.06) Add Punctuation 3.73 (± 2.68) 2.84 (± 1.37) 3.42 (± 1.85) 3.20 (± 1.48) 3.65 (± 2.75) 3.65 (± 2.75) Add One Typo 2.09 (± 0.60) 2.56 (± 0.67) 2.51 (± 0.82) 2.62 (± 1.01) 2.78 (± 0.64) 2.78 (± 0.64) Add Two Typos 4.56 (± 1.79) 4.76 (± 1.11) 4.09 (± 0.96) 4.49 (± 1.18) 4.65 (± 1.02) 4.65 (± 1.02) Add Contractions 0.80 (± 0.52) 0.73 (± 0.46) 0.56 (± 0.53) 0.73 (± 0.65) 0.68 (± 0.33) 0.68 (± 0.33) Logic Positive→ Negative 5.04 (± 4.09) 3.36 (± 1.30) 3.44 (± 1.50) 5.91 (± 3.16) 5.98 (± 2.96) 5.98 (± 2.96) Negative→ Positive 84.56 (± 14.41) 75.89 (± 26.78) 85.89 (± 12.74) 56.98 (± 34.69) 58.12 (± 27.91) 58.12 (± 27.91) Positive→ Negative (w/ Distractors) 30.40 (± 21.11) 28.38 (± 8.88) 24.91 (± 11.84) 40.73 (± 23.29) 46.75 (± 20.49) 46.75 (± 20.49) Entity Replace Names 1.28 (± 0.91) 1.17 (± 0.97) 1.28 (± 1.03) 1.01 (± 0.55) 1.21 (± 0.52) 1.21 (± 0.52) Replace Locations 3.47 (± 1.54) 2.69 (± 1.49) 2.87 (± 1.36) 3.56 (± 2.05) 3.18 (± 2.31) 3.18 (± 2.31) Replace Numbers 0.58 (± 0.60) 0.44 (± 0.39) 0.31 (± 0.21) 0.69 (± 0.58) 1.02 (± 0.87) 1.02 (± 0.87) Table 4.21: RQ4-FunctionalSubtestsforSentimentAnalysis-ExplOnly(§4.5.4.4). For sentiment analysis, we compare various ER rationale alignment criteria using the IxG machine rationale extractor (as well as the No-ER baseline), with respect to performance on a range of functional tests/subtests (OOD). For each functional test, we report model performance on each of its individual functional subtests. Performance is reported in terms of failure rate. 107 FunctionalTest FunctionalSubtest SentimentAnalysis Flights Failure Rate (↓) 0 min 10 min 30 min 5 hr 24 hr 48 hr Vocabulary Add Sentiment Words 2.22 (± 1.58) 2.07 (± 2.44) 2.84 (± 4.05) 1.38 (± 0.73) 1.53 (± 1.15) 1.07 (± 0.83) Paraphrase Neutral Words 4.91 (± 1.66) 6.18 (± 2.36) 4.80 (± 1.98) 5.27 (± 1.56) 5.36 (± 1.20) 4.78 (± 1.14) Add Intensifiers 4.40 ( ± 2.05) 3.00 (± 2.20) 3.51 (± 3.21) 4.04 (± 4.98) 2.89 (± 1.08) 2.07 (± 0.85) Add Reducers 21.68 (± 16.20) 22.08 (± 29.84) 16.55 (± 15.31) 22.27 (± 16.00) 22.68 (± 11.99) 46.09 (± 19.50) Add Positive Phrases 2.80 (± 2.56) 2.78 (± 2.22) 1.44 (± 1.04) 3.11 (± 2.83) 1.91 (± 1.60) 1.40 (± 1.10) Add Negative Phrases 20.31 (± 11.60) 19.07 (± 9.48) 16.60 (± 6.60) 18.84 (± 17.92) 25.87 (± 14.51) 17.98 (± 3.68) Robustness Add Random URLs/Handles 8.82 (± 4.95) 8.78 (± 4.78) 7.27 (± 2.48) 9.02 (± 7.59) 10.02 (± 5.74) 8.44 (± 2.37) Add Punctuation 3.73 (± 2.68) 3.22 (± 1.58) 3.27 (± 2.24) 3.91 (± 5.02) 3.93 (± 2.24) 2.38 (± 0.71) Add One Typo 2.09 (± 0.60) 2.62 (± 0.73) 2.42 (± 0.86) 3.02 (± 1.22) 2.44 (± 1.06) 2.22 (± 0.64) Add Two Typos 4.56 (± 1.79) 4.62 (± 0.89) 3.84 (± 1.49) 4.58 (± 1.85) 4.36 (± 1.14) 3.98 (± 0.60) Add Contractions 0.80 (± 0.52) 0.82 (± 0.38) 0.62 (± 0.57) 0.82 (± 0.63) 0.80 (± 0.23) 0.73 (± 0.25) Logic Positive→ Negative 5.04 (± 4.09) 8.31 (± 10.47) 4.47 (± 3.07) 4.13 (± 2.99) 4.84 (± 3.00) 5.82 (± 2.81) Negative→ Positive 84.56 (± 14.41) 65.76 (± 21.49) 68.91 (± 28.29) 81.29 (± 13.66) 70.98 (± 18.14) 63.73 (± 22.93) Positive→ Negative (w/ Distractors) 30.40 (± 21.11) 35.02 (± 25.60) 36.38 (± 22.33) 34.24 (± 15.02) 28.24 (± 16.92) 41.13 (± 11.90) Entity Replace Names 1.28 (± 0.91) 1.21 (± 0.65) 1.11 (± 0.43) 1.51 (± 0.97) 1.61 (± 1.18) 0.94 (± 0.59) Replace Locations 3.47 (± 1.54) 3.53 (± 1.96) 2.44 (± 0.78) 3.38 (± 1.92) 3.07 (± 2.12) 2.89 (± 1.02) Replace Numbers 0.58 (± 0.60) 0.73 (± 0.70) 0.71 (± 0.52) 0.47 (± 0.28) 0.56 (± 0.41) 0.69 (± 0.39) Table 4.22: RQ4-FunctionalSubtestsforSentimentAnalysis-Label+Expl(§4.5.4.4). For sentiment analysis, we compare various ER rationale alignment criteria using the IxG machine rationale extractor (as well as the No-ER baseline), with respect to performance on a range of functional tests/subtests (OOD). For each functional test, we report model performance on each of its individual functional subtests. Performance is reported in terms of failure rate. 108 Chapter5 LearningfromWeakly-SupervisedMachineExplanations In this chapter, we investigate RQ3: Howcanweutilizeweakly-supervisedmachineexplanationstoimprove the LM’s task performance? 5.1 Introduction Natural language processing (NLP) systems generally need common sense to function well in the real world [76]. However, NLP tasks do not always provide the requisite commonsense knowledge as input. Moreover, commonsense knowledge is seldom stated in natural language, making it hard for pre-trained language models (PLMs) [48, 152] — i.e., text encoders — to learn common sense from corpora alone [45, 164]. In contrast to corpora, a knowledge graph (KG) is a rich, structured source of commonsense knowl- edge, containing numerous facts of the form(concept1, relation, concept2). As a result, many meth- ods follow the KG-augmented model paradigm, which augments a text encoder with a graph encoder that reasons over the KG (Fig. 5.2). KG-augmented models have outperformed text encoders on various commonsense reasoning (CSR) tasks, like question answering (QA) (Fig. 5.1) [140, 18, 158, 264], natural language inference (NLI) [35, 248], and text generation [150, 285]. Since KGs do not have perfect knowledge coverage, they may not contain useful knowledge for all task instances (e.g., if the KG in Fig. 5.1 only consisted of the gray nodes). Also, even if the KG is useful overall 109 Figure 5.1: KG Saliency Explanations for Commonsense QA. Across different questions, the KG’s usefulness can vary considerably. Coarse explanations indicate if the KG is useful overall, while fine explanations highlight useful nodes or paths. Here, the fine explanations state that the market,produce, andmerchant nodes are useful, while the other nodes are not. for a given task instance, only some parts of the KG may be useful (e.g., the green nodes in Fig. 5.1). Ideally, a KG-augmented model would know both if the KG is useful and which parts of the KG are useful. Existing KG-augmented models always assume the KG should be used, but do often use attention [238] to focus on specific KG components ( e.g., nodes [57, 212, 261], paths [246, 208, 18]) when predicting. Still, the attention mechanism is supervised (end-to-end) only by the task loss, so the model is never explicitly taught which KG components should be used. Without component-level supervision, the attention mechanism is more likely to overfit to spurious patterns. How can we better teach the model whether each KG feature (e.g., graph, node, path) is useful for solving the given task instance? Using the task’s ground truth labels, saliency methods [8] can score each KG feature’s influence on the model making the correct prediction. Whereas attention weights show which KG features the model already used, saliency scores indicate which KG features the model should use. By binarizing these scores, we are able to produce saliency explanations, which can serve as simple targets for training the model’s attention mechanism. For example, Fig. 5.1 shows saliency explanations [market=1, produce=1, trading=0, merchant=1, store=0, shop=0], stating that market, produce, and merchant are useful nodes for answering the question. 110 Figure 5.2: KG-AugmentedModels fuse knowledge from text and KG inputs to solve CSR tasks. In this paper, we investigate how saliency explanations can be used to improve KG-augmented models’ performance. First, we propose to create coarse (graph-level) and fine (node-/path-level) saliency expla- nations. Since KGs have features at different granularities, saliency explanations can supply a rich array of signals for learning to focus on useful KG features. To create coarse explanations, we introduce an ensemble-based saliency method which measures the performance difference between a KG-augmented model and its corresponding non-KG-augmented model. To create fine explanations, we can adapt any off-the-shelf saliency method, e.g., gradient-based [47] or occlusion-based [135]. Second, to demonstrate the potential of saliency-based supervision, we analyze the performance of oracle KG-augmented models, whose attention weights are directly masked with coarse and/or fine saliency explanations. Third, as motivated by our oracle model analysis, we propose theLearningfromSaliencyExplanations of KG-Augmented Models (SalKG) framework. Given coarse and/or fine explanations created from thse task’s training set, SalKG jointly trains the model to predict the explanations, then solve the task by at- tending to KG features highlighted in the predicted explanations. Using saliency explanations to regularize the attention mechanism can help the model generalize better to unseen instances, especially when coarse and fine explanations are used together as complementary learning signals. Indeed, on three standard commonsense QA benchmarks (CSQA, OBQA, CODAH) and a range of KG-augmented models, we show thatSalKG can achieve considerable performance gains. 111 5.2 Preliminaries Since KGs abundantly provide structured commonsense knowledge, KG-augmented models are often help- ful for solving CSR tasks. CSR tasks are generally formulated as multi-choice QA (discriminative) tasks [231, 170, 111], but sometimes framed as open-ended response (generative) [150, 142] tasks. Given that multi-choice QA has been more extensively studied, we consider CSR in terms of multi-choice QA. Here, we present the multi-choice QA problem setting (Fig. 5.1) and the structure of KG-augmented models (Fig. 5.2). Problem Definition Given a question q and set of answer choices A = {a i }, a multi-choice QA model aims to predict a plausibility scoreρ (q,a i ) for each (q,a i ) pair, so that the predicted answer ˆ a = argmax a i ∈A ρ (q,a i ) matches the target answera ∗ . Letq⊕ a i be the text statement formed from(q,a i ), where⊕ denotes concatenation. For example, in Fig. 5.1, the text statement for q⊕ a ∗ would be: What kindofstoredoesamerchanthaveiftheysellproduce? market. We abbreviateq⊕ a i asx i and its plausibility score asρ (x i ). KG-AugmentedModels KG-augmented models use additional supervision from knowledge graphs to solve the multi-choice QA task. They encode the text and KG inputs individually as embeddings, then fuse the two embeddings together to use for prediction. A KG is denoted as ˜ G = ( ˜ V, ˜ R, ˜ E), where ˜ V, ˜ R, and ˜ E are the KG’s nodes (concepts), relations, and edges (facts), respectively. Anedge is a directed triple of the form e = (c 1 ,r,c 2 ) ∈ ˜ E, in which c 1 ,c 2 ∈ ˜ V are nodes, and r ∈ ˜ R is the relation between c 1 andc 2 . Apath is a connected sequence of edges in the KG. When answering a question, the model does not use the entire KG, since most information in ˜ G is irrelevant tox i . Instead, the model uses a smaller, contextualizedKGG i =(V i ,R i ,E i ), which is built from ˜ G usingx i .G i can be constructed heuristically by extracting edges from ˜ G [140, 159], generating edges with a PLM [18], or both [246, 261]. In this paper, we consider KG-augmented models whereG i is built by heuristically by extracting edges from ˜ G (see §5.8.1 for more details), since most KG-augmented models follow this paradigm. Ifx i andG i are not discussed in the 112 ExplanationSetting Unit Coarse KG Fine (MHGRN) Node Fine (PathGen) Path Fine (RN) Path Table 5.1: KGunittypes used for different explanation modes (§5.3) and graph encoders (§5.4.2). context of other answer choices, then we further simplifyx i ’s andG i ’s notation asx andG, respectively. Since the model never uses the full KG at once, we use “KG” to refer toG in the rest of the paper. As in prior works [140, 18], a KG-augmented modelF KG has three main components: text encoder f text , graph encoder f graph , and task predictor f task (Fig. 5.2). Meanwhile, its corresponding non-KG- augmented modelF No-KG has no graph encoder but has a slightly different task predictor ¯ f task which only takesx as input. In bothF KG andF No-KG , the task predictor outputsρ (x). Letx andg be the embeddings ofx andG, respectively. Then, the workflows of F KG andF No-KG are defined below: x=f text (x); g =f graph (G,x); F KG (x,G)=f task (x⊕ g); F No-KG (x)= ¯ f task (x). Typically,f text is a PLM [48, 152],f graph is a graph neural network (GNN) [57, 212] or edge/path aggre- gation model [140, 18, 208], andf task and ¯ f task are multilayer perceptrons (MLPs). In general,f graph reasons overG by encoding either nodes or paths, then using soft attention to pool the encoded nodes/paths into g. LetL task be the task loss for trainingF KG andF No-KG . For multi-choice QA,L task is cross-entropy loss, with respect to the distribution overA. For brevity, when comparing different models, we may also refer toF KG andF No-KG as KG and No-KG, respectively. 5.3 CreatingKGSaliencyExplanations Now, we show how to create coarse and fine saliency explanations, which tell us if the KG or certain parts of the KG are useful. These explanations can be used as extra inputs to mask oracle models’ attention 113 s c (x i ,G i ) = ( p KG (x i ,G i )− p No-KG (x i ), a i =a ∗ , p No-KG (x i )− p KG (x i ,G i ), a i ̸=a ∗ . (§5.4) or as extra supervision to regularize SalKG models’ attention (§5.5). We first abstractly define a unit as eitherG itself or a component ofG. A unit can be a graph, node, path, etc., and we categorize units ascoarse (the entire graphG) orfine (a node or path withinG) (Table 5.1). Given a model and task instance (x,G), we define an explanation as a binary indicator of whether a unit u ofG is useful for the model’s prediction on (x,G). If u is useful, then u should strongly influence the model to solve the instance correctly. By making explanations binary, we can easily use explanations as masks or learning targets (since binary labels are easier to predict than real-valued scores) for attention weights. 5.3.1 CoarseSaliencyExplanations SinceG may not always be useful, a KG-augmented model should ideally know when to useG. Here, the unitu is the graphG. Given instance(x,G), a coarse saliency explanationy c (x,G)∈{0,1} indicates ifG helps the model solve the instance. By default,F KG assumesG is used, so we propose an ensemble-based saliency formulation fory c (x,G). That is, we define y c (x,G) as stating ifF KG (i.e., usesG) orF No-KG (i.e., does not useG) should be used to solve(x,G). Under this formulation, each(x,G) has coarse unitsG and None, whereNone means “G is not used”. To get y c (x,G), we begin by computing coarse saliency score s c (x,G) ∈ R, which we define as the performance difference between F KG andF No-KG . For QA inputx i =q⊕ a i and its KGG i , letp KG (x i ,G i ) andp No-KG (x i ) be the confidence probabilities for x i predicted byF KG andF No-KG , respectively. Ideally, a QA model should predict higher probabilities for answer choicesa i that are correct, and vice versa. To capture this notion, we define s c (x i ,G i ) in Eq. ??, where a ∗ denotes the correct answer. Note thats c (x i ,G i ) is positive ifp KG (x i ,G i ) is higher thanp No-KG (x i ) for correct choices and lower for incorrect 114 choices. We obtainy c (x i ,G i ) by binarizings c (x i ,G i ) to0 or1 based on whether it is greater than or less than a thresholdT , respectively. Ify c (x i ,G i ) = 1, then the KG is useful, and vice versa. See the appendix for more details about why we use ensemble-based saliency for coarse explanations (§5.8.2) and how we tuneT (§5.8.6). 5.3.2 FineSaliencyExplanations Even ifG is useful, not every part ofG may be useful. Hence, fine saliency explanations can identify which parts of a KG are actually useful. For a given instance(x,G), we denote the fine saliency explanation for a fine unit u inG asy f (u;x,G)∈{0,1}. Fine units can be nodes, paths, etc. in the KG. If a graph encoder f graph encodes a certain type of unit, it is natural to define y f (u;x,G) with respect to such units. For example, MHGRN [57] encodesG’s nodes, so we define MHGRN’s fine saliency explanations with respect to nodes. Similar to coarse saliency explanations, to obtainy f (u;x,G), we first compute fine saliency score s f (u;x,G) ∈ R, and then binarize it. For a QA input x i = q⊕ a i and its KGG i , let u ij be the j th fine unit inG i and p KG (x i ,G i ) denoteF KG ’s predicted probability for x i . There are many existing saliency methods (a.k.a. attribution methods) [47, 229, 135] for calculating the importance score of an input, with respect to a model and a given label. Whiles f (u ij ;x i ,G i ) can be computed via any saliency method, we use gradient-based and occlusion-based methods, since they are the most common types of saliency methods [8]. Letϕ (u ij ;x i ,G i ) denote the raw saliency score given by some saliency method. Gradient-based meth- ods measure an input’s saliency via the gradient of the model’s output with respect to the input. We use the gradient× input (Grad) method [47], where ϕ (u ij ;x i ,G i ) is the dot product of u ij ’s embedding and the gradients ofp KG (x i ,G i ) with respect tou ij . Occlusion-based methods measure an input’s saliency as how the model’s output is affected by erasing that input. We use the leave-one-out (Occl) method [135], 115 whereϕ (u ij ;x i ,G i ) is the decrease inp KG (x i ,G i ) ifu ij is removed fromG i ,i.e.,ϕ (u ij ;x i ,G i ) =p KG (x i ,G i ) -p KG (x i ,G i \u ij ). s f (u ij ;x i ,G i ) = ϕ (u ij ;x i ,G i ), a i =a ∗ − ϕ (u ij ;x i ,G i ), a i ̸=a ∗ Intuitively, a unit is more useful if it increases the probabil- ity of correct answer choicea ∗ , and vice versa. Thus, we define the saliency scores f (u ij ;x i ,G i ) for unitu ij as Eq. ??. Next, we binarize the saliency scores to gety f (u ij ;x i ,G i ), by selecting the top-k%-scoring units inG i and setting y f (u ij ;x i ,G i ) = 1 (i.e., u ij is useful) for these units. For all other units in G, we set y f (u ij ;x i ,G i ) = 0 (i.e., u ij is not useful). See the appendix for more details about the fine saliency methods (§5.8.3) and tuning threshold k (§5.8.6). 5.4 Oracle: UsingKGSaliencyExplanationsasInputs In this section, we analyze KG saliency explanations’ potential to improve KG-augmented models’ perfor- mance. Recall that creating saliency explanations requires the task’s ground truth labels (§5.3), so directly using test set explanations is infeasible. Still, before exploring ways to leverage training set explanations (§5.5), we first establish upper bounds on how much models can benefit from saliency explanations. Here, we study three key questions:(1)Doesthemodelimprovewhenprovidedoracleaccesstocoarse/fineexplana- tions? (2)Arecoarseandfineexplanationscomplementary? (3)Howdogradient-basedexplanationscompare to occlusion-based explanations? 5.4.1 OracleModels Oracle models are KG-augmented models with oracle access to saliency explanations. AnOracle model uses ground truth labels to create explanations (even at inference time), and then uses the explanations as extra inputs to perform hard attention over the units. We define the model attention weights that are 116 modified based on saliency explanations as saliency weights. Below, we introduce the Oracle-Coarse, Oracle-Fine, andOracle-Hybrid models, shown in Fig. 5.3a-c. Oracle-Coarse Oracle-Coarse (F ∗ c ) uses coarse explanations to do hard attention overF KG ’s and F No-KG ’s predictions. First,F KG andF No-KG are trained separately, then frozen. Next, for each instance (x,G), they are used to create a coarse explanationy c (x,G)∈{0,1}. Then,F ∗ c is defined as an ensemble model that performs hard attention over coarse units (G and None) by weightingF KG ’s prediction with y c (x,G) andF No-KG ’s prediction with 1− y c (x,G) (Table 5.2; Fig. 5.3a). In other words, y c (x,G) and 1− y c (x,G) are the saliency weights forF ∗ c . Oracle-Fine Oracle-Fine (F ∗ f ) has the same architecture asF KG and uses fine explanations to do hard attention over fine units ( i.e., nodes or paths inG). First,F KG is trained, then frozen. As usual,F KG uses soft attention over fine units in G to compute graph embedding g (§5.2). Then, for each fine unit u inG,F KG is used to create fine explanation y f (u;x,G)∈{0,1}. Let ˆ y f (u;x,G)∈ [0,1] denoteF ∗ f ’s soft attention weight foru. We trainF ∗ f the same way asF KG , except eachˆ y f (u;x,G) is (hard attention) masked withy f (u;x,G), i.e., ˆ y f (u;x,G)← ˆ y f (u;x,G)⊙ y f (u;x,G), where⊙ denotes element-wise multiplication (Table 5.2; Fig. 5.3b). This means only units withy f (u;x,G)=1 will have ˆ y f (u;x,G)>0 and thus be able to influence F ∗ f ’s prediction. Lety f (x,G) and ˆ y f (x,G) denote the explanations and soft attention weights, respectively, for all units in the graph. Then, ˆ y f (x,G)⊙ y f (x,G) are the saliency weights forF ∗ f . Oracle-Hybrid Oracle-Hybrid (F ∗ h ) unifies Oracle-Coarse and Oracle-Fine as a single model, thus leveraging the coarse-fine hierarchy inherent in KG saliency explanations. First, F ∗ f (which uses fine explanations) andF No-KG are separately trained, then frozen. Then, for each (x,G),F ∗ f andF No-KG are used to createy h (x,G)∈{0,1}, which we define as the coarse explanation for F ∗ f andF No-KG . y h (x,G) is computed the same way as y c (x,G), besides replacingF KG withF ∗ f . Finally, similar toF ∗ c ,F ∗ h is an ensemble that performs hard attention over coarse units by weightingF ∗ f ’s prediction withy h (x,G) and 117 Model Output SaliencyWeights Oracle-Coarse F ∗ c (x,G)=y c (x,G)F KG (x,G)+(1− y c (x,G))F No-KG (x) [y c (x,G),1− y c (x,G)] Oracle-Fine F ∗ f (x,G)∼F KG (x,G) ˆ y f (x,G)⊙ y f (x,G) Oracle-Hybrid F ∗ h (x,G)=y h (x,G)F ∗ f (x,G)+(1− y h (x,G))F No-KG (x) [y h (x,G),1− y h (x,G)] Table 5.2: Comparison of Oracle Models. For each Oracle Model, we show its output and saliency weights. Note that the explanations are given (not predicted), so there is noL sal . WhileF ∗ c andF ∗ h are both ensembles ofF KG andF No-KG ,F ∗ f has the same architecture asF KG (denoted by∼ ) besides the attention masking. F No-KG ’s prediction with 1− y h (x,G) (Table 5.2; Fig. 5.3c). That is, y h (x,G) and 1− y h (x,G) are the saliency weights forF ∗ h . 5.4.2 EvaluationProtocol We use the CSQA [231] and OBQA [170] multi-choice QA datasets. For CSQA, we use the accepted in-house data split from [140], as the official test labels are not public. As in prior works, we use the ConceptNet [224] KG for both datasets. We report accuracy, the standard metric for multi-choice QA. ForF No-KG and F KG , we pick the best model over three seeds, then use them to create explanations for Oracle models. We use thresholdsT = 0.01 andk = 10 for coarse and fine explanations, respectively. For text encoders, we use BERT(-Base) [48] and RoBERTa(-Large) [152]. For graph encoders, we use MHGRN [57], PathGen [246], and Relation Network (RN) [208, 140]. MHGRN has node units, while PathGen and RN have path units. Asbaselinemodels, we useF No-KG ,F KG , andF No-KG +F KG , whereF No-KG +F KG is an ensemble whose prediction is the mean ofF No-KG ’s andF KG ’s predictions. Oracle and baseline models are trained only with task lossL task . 5.4.3 Analysis In Table 5.3, we show CSQA and OBQA performance for the baseline and Oracle models. We analyze these results via the three questions below. 118 CSQATestAccuracy(%) OBQATestAccuracy(%) MHGRN PathGen RN MHGRN PathGen RN Model BERT RoBERTa BERT RoBERTa BERT RoBERTa BERT RoBERTa BERT RoBERTa BERT RoBERTa No-KG 55.44 70.59 55.44 70.59 55.44 70.59 53.60 68.40 53.60 68.40 53.60 68.40 KG 56.57 73.33 56.65 72.04 55.60 71.07 53.20 69.80 55.00 67.80 58.60 70.20 No-KG + KG 56.57 71.39 57.45 73.00 56.73 68.49 55.60 70.60 54.40 70.6 53.40 69.60 Oracle-Coarse 66.16 81.39 68.57 80.10 67.28 79.69 70.60 79.40 65.00 76.60 69.00 79.00 Oracle-Fine (Grad) 74.86 76.15 79.61 87.35 81.39 83.24 67.60 72.60 73.80 73.40 68.00 62.80 Oracle-Fine (Occl) 91.06 87.99 79.61 75.34 73.73 68.41 77.00 71.20 83.60 62.60 55.60 61.40 Oracle-Hybrid (Grad) 85.50 84.21 90.49 92.83 92.26 93.56 80.80 84.80 85.60 92.80 85.40 86.80 Oracle-Hybrid (Occl) 95.89 98.63 88.96 96.78 85.25 95.25 87.00 89.60 92.80 90.60 67.40 80.60 Table 5.3: OraclePerformanceonCSQAandOBQA Does the model improve when provided oracle access to coarse/fine explanations? Yes. Or- acle-Coarse beats all baselines, while Oracle-Fine beats all baselines except on OBQA RN+RoBERTa. These results motivate us to develop a framework for models to improve performance by learning from coarse/fine explanations. Also, on average, Oracle-Fine outperformsOracle-Coarse, which suggests that fine explanations may often provide richer signal than their coarse counterparts. Indeed, fine explanations indicate the saliency of every unit in the KG, while coarse explanations only indicate the saliency of the KG as a whole. Are coarse and fine explanations complementary? Yes. Across all settings, Oracle-Hybrid per- forms significantly better than Oracle-Coarse andOracle-Fine. This suggests that coarse and fine expla- nations are complementary and that it is effective to leverage both hierarchically. How do gradient-based explanations compare to occlusion-based explanations? Overall, Grad and Occl perform similarly. Grad performs better on some settings (e.g., MHGRN), while Occl performs better on others (e.g., RN). See Table 5.8 and §5.8.9 for more Grad vs. Occl experiments. In our Oracle pilot study, KG-augmented models achieve large performance gains when given ex- planations as input. This suggests that, if oracle explanations can somehow be predicted accurately dur- ing inference without using ground truth labels, then KG-augmented models can still achieve improve- ments without directly using explanations as input. This motivates us to train KG-augmented models with explanation-based supervision viaSalKG, which we describe in §5.5. 119 Figure 5.3: Schematics for Oracle and SalKG Models. Red arrows indicate the Oracle pipeline, where the target explanation is provided as input. Purple arrows indicate theSalKG pipeline, where the target explanation is used as supervision for the predicted explanation. In SalKG-Coarse and SalKG-Hybrid, the saliency predictor has the same architecture asF KG . Meanwhile,Oracle-Fine andSalKG-Fine (shown as white module, with text encoder and task predictor omitted) both have the same architecture asF KG . 5.5 SalKG:UsingKGSaliencyExplanationsasSupervision Based on the analysis from §5.4.3, we propose the SalKG framework for KG-augmented models to learn from coarse/fine saliency explanations. Whereas Oracle models (§5.4.1) use explanations directly as extra inputs, SalKG models only use them as extra supervision during the training phase. With explanations created from the training set viaF KG andF No-KG ,SalKG models are jointly trained to predict the explana- tions (via saliency lossL sal ) and use the predicted explanations to solve the task (via task lossL task ). Thus, SalKG models have the following objective:L S =L task +λ L sal , whereλ ≥ 0 is a loss weighting parame- ter. This multitask objective not only encouragesSalKG models to focus on useful KG units for solving the task, but also to learn more general graph/node/path representations. Below, we present SalKG-Coarse, SalKG-Fine, andSalKG-Hybrid models. SalKG-Coarse Unlike Oracle-Coarse, SalKG-Coarse (F c ) is not given oracle coarse explanation y c (x,G) as input. Instead, a saliency predictorS c (with the same architecture asF KG ) is trained to predict the oracle coarse explanation.S c predicts coarse explanation as probability ˆ y c (x,G)∈ [0,1].F c ’s output is an ensemble that does soft attention over coarse units by weightingF KG ’s andF No-KG ’s predictions with 120 Model Output SaliencyWeights SaliencyLoss(L sal ) SalKG-Coarse F c (x,G)= ˆ y c (x,G)F KG (x,G)+(1− ˆ y c (x,G))F No-KG (x) [ˆ y c (x,G),1− ˆ y c (x,G)] CE(ˆ y c (x,G),y c (x,G)) SalKG-Fine F f (x,G)∼F KG (x,G) ˆ y f (x,G) KL(ˆ y f (x,G),y f (x,G)) SalKG-Hybrid F h (x,G)= ˆ y h (x,G)F f (x,G)+(1− ˆ y h (x,G))F No-KG (x) [ˆ y h (x,G),1− ˆ y h (x,G)] CE(ˆ y h (x,G),y h (x,G)) Table 5.4: Comparison of SalKG Models. For each SalKG Model, we show its output, saliency weights, and L sal . WhileF c andF h are both ensembles,F f has the same architecture asF KG (denoted by∼ ). “CE” denotes cross- entropy loss, while “KL” denotes KL divergence loss. saliency weights ˆ y c (x,G) and1− ˆ y c (x,G), respectively (Table 5.4; Fig. 5.3a). Here,L sal (ˆ y c (x,G),y c (x,G)) is the cross-entropy loss. SalKG-Fine Similarly, SalKG-Fine (F f ) is not given oracle fine explanation y f (u;x,G) as input, al- though both have the same architecture asF KG . Instead, for each fine unit u,F f ’s attention mechanism is trained to predicty f (u;x,G) as soft attention weight ˆ y f (u;x,G)∈ [0,1] (Table 5.4; Fig. 5.3b). As before, ˆ y f (x,G) = [ˆ y f (u;x,G)] u∈G are the soft attention weights for(x,G), whiley f (x,G) = [y f (u;x,G)] u∈G are the fine explanations for (x,G). Then, ˆ y f (x,G) are the saliency weights forF f , trained with KL divergence lossL sal (ˆ y f (x,G),y f (x,G)). SalKG-Hybrid Similar to the other SalKG variants, SalKG-Hybrid (F h ) does not use any oracle explanations. Like inSalKG-Coarse, a saliency predictorS h is trained to predict oracle coarse explanation y h (x,G) (§5.4.1). Predicted coarse explanation probabilitiesˆ y h (x,G)∈[0,1] are then used as soft attention over coarse units by weighting F f ’s and F No-KG ’s predictions with weights ˆ y h (x,G) and 1− ˆ y h (x,G), respectively (Table 5.4; Fig. 5.3c). Here,L sal (ˆ y h (x,G),y h (x,G)) is cross-entropy loss. 5.6 Experiments 5.6.1 EvaluationProtocol We evaluate SalKG models on the CSQA [231], OBQA [170], and CODAH [34] multi-choice QA datasets (§5.8.5). In addition to the baselines in §5.4.2, we consider two new baselines, Random and Heuristic, which help show that coarse/fine saliency explanations provide strong learning signal for KG-augmented 121 models to focus on useful KG features. We follow the same evaluation protocol in §5.4.2, except we now also report mean and standard deviation performance over multiple seeds. See §5.8.4 for a more detailed description of the evaluation protocol. Random Random is a variant of SalKG where each unit’s explanation is random. Random-Coarse is likeSalKG-Coarse, but with eachy c (x,G) uniformly sampled from{0,1}. Random-Fine is likeSalKG- Fine, but randomly pickingk% of units inG to sety f (u;x,G)=1. Random-Hybrid is likeSalKG-Hybrid, but with eachy h (x,G) uniformly sampled from{0,1} as well as using Random-Fine instead of SalKG- Fine. Heuristic EachG has three node types: question nodes (i.e., nodes inq), answer nodes (i.e., nodes in a i ), and intermediate nodes (i.e., other nodes) [140]. Let QA nodes be nodes in q or a i . Heuristic is a variant of SalKG where each unit’s explanation is based on the presence of QA nodes inG. Let ¯ N be the mean number of QA nodes per KG (in train set), and let N(G) be the number of QA nodes inG. Heuristic-Coarse is like SalKG-Coarse, excepty c (x,G) = 1 if and only ifN(G) > ¯ N. Heuristic-Fine is like SalKG-Fine, but how y f (u;x,G) is set depends on whether the fine units are nodes or paths. For node units, y f (u;x,G) = 1 if and only ifu is a QA node. For path units, y f (u;x,G) = 1 if and only ifu consists only of QA nodes. Heuristic-Hybrid is like SalKG-Hybrid, but withy h (x,G) = 1 if and only if N(G)> ¯ N, whileHeuristic-Fine is used instead of SalKG-Fine. 5.6.2 MainResults Table 5.5 shows performance on CSQA, while Table 5.6 shows performance on OBQA and CODAH. Best performance is highlighted in green , second-best performance is highlighted in blue , and best non- SalKG performance is highlighted in red (if it is not already green or blue). ForSalKG (unlikeOracle), we find that Occl usually outperforms Grad, so we only report Occl performance in Tables 5.5-5.6. For a comparison of Grad and Occl on SalKG, see Table 5.8 and §5.8.9. Being an ensemble, No-KG + KG tends 122 CSQATestAccuracy(%) MHGRN PathGen RN Model BERT RoBERTa BERT RoBERTa BERT RoBERTa No-KG 53.13 (± 2.34) 69.65 (± 1.06) 53.13 (± 2.34) 69.65 (± 1.06) 53.13 (± 2.34) 69.65 (± 1.06) KG 57.48 (± 0.89) 73.14 (± 0.78) 56.54 (± 0.73) 72.58 (± 0.57) 56.46 (± 1.22) 71.37 (± 1.20) No-KG + KG 56.14 (± 2.28) 72.15 (± 0.67) 57.29 (± 1.30) 72.44 (± 0.72) 55.98 (± 1.98) 71.15 (± 0.81) Random-Coarse 55.04 (± 1.44) 71.06 (± 1.09) 55.09 (± 1.08) 71.15 (± 1.06) 55.15 (± 1.23) 69.06 (± 2.96) Random-Fine 54.69 (± 2.54) 73.09 (± 1.06) 54.66 (± 0.97) 71.26 (± 3.19) 49.88 (± 1.75) 69.08 (± 1.95) Random-Hybrid 52.43 (± 2.60) 71.93 (± 0.77) 55.24 (± 0.58) 71.35 (± 0.34) 54.36 (± 0.35) 70.12 (± 0.35) Heuristic-Coarse 55.55 (± 2.29) 72.15 (± 0.84) 56.92 (± 0.18) 72.57 (± 0.49) 56.42 (± 1.11) 71.18 (± 0.77) Heuristic-Fine 52.54 (± 1.67) 71.50 (± 1.01) 54.00 (± 1.89) 71.11 (± 0.93) 52.04 (± 2.13) 65.08 (± 3.67) Heuristic-Hybrid 56.35 (± 0.81) 72.58 (± 0.32) 56.83 (± 0.48) 71.33 (± 0.87) 54.38 (± 3.30) 65.07 (± 2.02) SalKG-Coarse 57.98 (± 0.90) 73.64 (± 1.05) 57.75 (± 0.77) 73.07 (± 0.25) 57.50 (± 1.25) 73.11 (± 1.13) SalKG-Fine 54.36 (± 2.34) 70.00 (± 0.81) 54.39 (± 2.03) 72.12 (± 0.91) 54.30 (± 1.41) 71.64 (± 1.51) SalKG-Hybrid 58.70 (± 0.65) 73.37 (± 0.12) 59.87 (± 0.42) 72.67 (± 0.65) 58.78 (± 0.14) 74.13 (± 0.71) Table 5.5: SalKGPerformanceonCSQA to beat both No-KG and KG if both have similar performance. Otherwise, No-KG + KG’s performance is in between No-KG’s and KG’s. Across all datasets, we find that SalKG-Hybrid andSalKG-Coarse are consistently the two best mod- els. On CSQA,SalKG-Hybrid has the highest performance on BERT+MHGRN, BERT+PathGen, BERT+RN, and RoBERTa+RN, while SalKG-Coarse is the best on RoBERTa+MHGRN and RoBERTa+PathGen. In particular, on RoBERTa+RN, BERT+RN, and BERT+PathGen, SalKG-Hybrid beats max(No-KG, KG, No- KG + KG) by large margins of 2.76%, 2.58%, and 2.32%, respectively. Meanwhile, OBQA and CODAH, SalKG is not as dominant but still yields improvements overall. On OBQA, SalKG-Coarse is the best on RoBERTa+RN (beating max(No-KG, KG, No-KG + KG) by 1.89%) and RoBERTa+PathGen, while SalKG- Hybrid performs best on RoBERTa+MHGRN. On CODAH, SalKG-Coarse gets the best performance on both RoBERTa+MHGRN (beatingmax(No-KG, KG, No-KG + KG) by 1.71%) and RoBERTa+PathGen. SalKG- Coarse outperformingSalKG-Hybrid on OBQA and CODAH indicates that local KG supervision from fine explanations may not be as useful for these two datasets. On the other hand, SalKG-Fine is consistently weaker than SalKG-Hybrid and SalKG-Coarse, but still shows slight improvement for RoBERTa+RN on 123 OBQATestAccuracy(%) CODAHTestAccuracy(%) Model(RoBERTa) MHGRN PathGen RN MHGRN PathGen No-KG 68.73 (± 0.31) 68.73 (± 0.31) 68.73 (± 0.31) 83.96 (± 0.79) 83.96 (± 0.79) KG 68.87 (± 2.16) 68.40 (± 1.59) 66.80 (± 4.73) 84.02 (± 1.27) 84.02 (± 1.62) No-KG + KG 68.53 (± 0.95) 69.67 (± 1.45) 69.40 (± 0.35) 84.08 (± 1.46) 84.69 (± 1.48) Random-Coarse 68.11 (± 1.12) 67.18 (± 4.13) 65.02 (± 2.57) 83.48 (± 0.91) 84.68 (± 1.65) Random-Fine 57.60 (± 5.33) 55.13 (± 7.00) 48.53 (± 4.82) 74.77 (± 6.90) 80.48 (± 1.23) Random-Hybrid 68.33 (± 0.40) 69.53 (± 0.31) 69.27 (± 0.12) 83.86 (± 0.69) 83.75 (± 0.60) Heuristic-Coarse 69.24 (± 2.47) 65.58 (± 6.08) 64.29 (± 3.06) 82.64 (± 0.10) 82.52 (± 0.18) Heuristic-Fine 57.27 (± 3.76) 51.80 (± 2.95) 50.53 (± 3.51) 82.25 (± 1.43) 82.55 (± 2.03) Heuristic-Hybrid 68.47 (± 0.23) 68.40 (± 0.00) 68.60 (± 0.20) 82.16 (± 2.11) 82.73 (± 1.51) SalKG-Coarse 69.93 (± 0.56) 70.02 (± 0.55) 71.29 (± 0.57) 85.79 (± 1.83) 85.43 (± 1.88) SalKG-Fine 64.82 (± 0.97) 51.51 (± 0.87) 62.29 (± 0.85) 84.08 (± 1.14) 83.36 (± 0.81) SalKG-Hybrid 70.20 (± 0.69) 69.80 (± 0.49) 70.47 (± 0.91) 85.17 (± 0.54) 84.42 (± 0.64) Table 5.6: SalKGPerformanceonOBQAandCODAH CSQA. These results show that learning from KG saliency explanations is generally effective for improv- ing KG-augmented models’ performance, especially in CSQA when both coarse and fine explanations are used to provide complementary learning signals for SalKG-Hybrid. Furthermore, across all datasets, we find that SalKG outperformsRandom andHeuristic on every setting. This is evidence that explanations created from saliency methods can provide better learning signal than those created randomly or from simple heuristics. ComparisontoPublishedCSQABaselines To further demonstrate thatSalKG models perform com- petitively, we also compareSalKG (using MHGRN and PathGen) to the many KG-augmented model base- line results published in [57, 246, 261], for the CSQA in-house split. The baselines we consider are RN [208], RN + Link Prediction [57], RGCN [212], GAT [239], GN [9], GconAttn [248], MHGRN [57], and PathGen [246]. For the non-SalKG versions of MHGRN, PathGen, and RN, we quote the published results. Since these published results average over four seeds (instead of three), we reportSalKG results over four seeds in Table 5.7. We find that most of the listed SalKG variants can outperform all of the baselines. For MHGRN, SalKG-Coarse (MHGRN) performs the best overall, SalKG-Hybrid (MHGRN) beats vanilla 124 Model(RoBERTa) CSQATestAccuracy(%) RN [208] 70.08 (± 0.21) RN + Link Prediction [246] 69.33 (± 0.98) RGCN [212] 68.41 (± 0.66) GAT [239] 71.20 (± 0.72) GN [9] 71.12 (± 0.45) GconAttn [248] 69.88 (± 0.47) MHGRN [57] 71.11 (± 0.81) PathGen [246] 72.68 (± 0.42) SalKG-Coarse (MHGRN) 74.01 (± 0.14) SalKG-Fine (MHGRN) 72.68 (± 1.46) SalKG-Hybrid (MHGRN) 73.87 (± 0.48) SalKG-Coarse (PathGen) 72.76 (± 0.12) SalKG-Fine (PathGen) 71.21 (± 1.31) SalKG-Hybrid (PathGen) 73.03 (± 0.84) Table 5.7: Comparisonof SalKGtoPublishedCSQABaselines. SalKG models that outperform all baselines are shown inbold. MHGRN, and SalKG-Fine (MHGRN) is on par with vanilla MHGRN. For PathGen, SalKG-Hybrid (Path- Gen) andSalKG-Coarse (PathGen) both slightly outperform vanilla PathGen, whileSalKG-Fine (PathGen) performs worse. CSQALeaderboardSubmission In addition to our experiments on the CSQA in-house split, we evalu- atedSalKG on the CSQA official split by submitting SalKG to the CSQA leaderboard. Since the best models on the CSQA leaderboard use the ALBERT [125] text encoder, and PathGen was the highest graph encoder on the leaderboard out of the three we experimented with, we trainedSalKG-Hybrid (ALBERT+PathGen), which achieved a test accuracy of 75.9%. For reference, a previously submitted ALBERT+PathGen achieved a test accuracy of 75.6% on the CSQA leaderboard. This result suggests that the proposedSalKG training procedure can yield some improvements over baselines that do not use explanation-based regularization. Why does SalKG-Fine perform poorly? In general, SalKG-Fine does not perform as well as SalKG- Coarse andSalKG-Hybrid. Often,SalKG-Fine is noticeably worse than KG and No-KG. Recall that the KG model andSalKG-Fine model both assume that the KG should always be used to solve the given instance. 125 Still, the success of SalKG-Coarse shows that the KG sometimes may not be useful. But why doesSalKG- Fine almost always perform worse than the KG model? We believe it is because SalKG-Fine is more committed to the flawed assumption of universal KG usefulness. Whereas the KG model is trained to solve the task always using the KG as context, SalKG-Fine is trained to both solve the task always using the KG as context (i.e., global KG supervision) and attend to specific parts of the KG ( i.e., local KG supervision). SinceSalKG-Fine is trained with both global and local KG supervision, it is much more likely to overfit, as the KG is not actually useful for all instances. That is, for training instances where the KG should not be used,SalKG-Fine is pushed to not only use the KG, but also to attend to specific parts of the KG. This leads to a SalKG-Fine model that does not generalize well to test instances where the KG is not useful. To address this issue, we proposed theSalKG-Hybrid model, which is designed to take the best of both SalKG-Coarse and SalKG-Fine. For a given instance, SalKG-Hybrid uses its SalKG-Coarse component to predict whether the KG is useful, then uses its SalKG-Fine component to attend to the useful parts of the KG only if the KG is predicted to be useful. Indeed, we find that SalKG-Hybrid performs much better thanSalKG-Fine and is the best model overall on CSQA. These results support our hypothesis about why SalKG-Fine performs relatively poorly. 5.6.3 AblationStudies In Table 5.8, we validate our SalKG design choices with ablation studies. We report dev accuracy for BERT+MHGRN and BERT+PathGen on CSQA. Are ensemble-based coarse explanations effective? By default, SalKG-Coarse uses our proposed ensemble-based coarse explanations (§5.3.1). Alternatively, we consider using Grad and Occl to create coarse explanations. For Grad, we compute ϕ the same way as in §5.3.2, except using graph embedding g instead of node/path embeddings. Since a zero vector would have zero gradient, this is equivalent to 126 CSQADevAccuracy(%) Model(BERT) MHGRN PathGen SalKG-Coarse 59.49 (± 0.05) 60.72 (± 0.58) - w/ Grad 56.84 (± 2.27) 56.18 (± 2.31) - w/ Occl 57.60 (± 0.74) 56.32 (± 1.66) SalKG-Fine (Occl) 57.28 (± 0.95) 59.13 (± 2.35) - w/ Grad 56.05 (± 1.03) 58.80 (± 1.08) SalKG-Hybrid (Occl) 59.92 (± 0.31) 60.88 (± 0.05) - w/ Grad 60.17 (± 0.21) 59.71 (± 0.08) SalKG-Fine (Occl) 57.28 (± 0.95) 59.13 (± 2.35) - w/ Random Prune 50.61 (± 0.68) 54.10 (± 2.13) - w/ Heuristic Prune 50.72 (± 0.46) 50.53 (± 0.74) SalKG-Fine (Occl) 57.28 (± 0.95) 59.13 (± 2.35) - w/ BCE Sal. Loss 50.83 (± 1.75) 55.15 (± 2.58) Table 5.8: AblationStudies. Best model inbold. comparingg to a zero vector baseline. For Occl, we computeϕ as the decrease inp KG ifg is replaced with a zero vector. For both Grad and Occl, we sets c = ϕ . In Table 5.8, we see that our default SalKG-Coarse significantly outperforms SalKG-Coarse with both Grad and Occl. In §5.8.2, we further discuss why Grad and Occl are ill-suited for creating coarse explanations. ForSalKG, is Occl better than Grad? In Tables 5.5-5.6, we report SalKG-Fine and SalKG-Hybrid performance with Occl fine explanations. In Table 5.8, we compare Occl and Grad on SalKG-Fine and SalKG-Hybrid. Overall, Occl slightly outperforms Grad, although Grad beats Occl on MHGRN forSalKG- Hybrid. Their relative performance could also depend on the choice of top-k%, which we plan to explore later. In §5.8.9, we further compare Occl and Grad on other settings. How does SalKG-Fine’s soft KG pruning compare to hard KG pruning? SalKG-Fine does soft pruning of unhelpful fine units via soft attention. We compare SalKG-Fine to two baselines where the KG is filtered via hard pruning, which cannot be easily incorporated into end-to-end training. For Random Prune andHeuristic Prune, we respectively createRandom andHeuristic explanations, then hard prune all negative units from the KG. The KG-augmented model then uses the pruned KG as its KG input. In Table 127 5.8, we see that SalKG-Fine significantly outperforms the two baselines, showing the benefits of jointly training the model on saliency and QA prediction. Isiteffectivetotrain SalKG-FinewithKLdivergence? We trainSalKG-Fine’s explanation predic- tor (i.e., attention mechanism) using KL divergence as the saliency loss. Thus, within a KG, the distribution over attention weights constitutes a single prediction. Alternatively, we could treat each attention weight as a separate prediction and train the attention mechanism using binary cross entropy (BCE) loss. In Table 5.8, we find that using KL divergence yields much higher performance than using BCE loss. This suggests that the attention weights should not be trained separately, as each attention weight is highly dependent on other attention weights in the same KG. 5.6.4 CaseStudies We visualize coarse/fine explanations created from BERT+PathGen on CSQA, with 1-hop or 2-hop paths as fine units. For coarse explanations, we show examples of positive ( i.e., useful) and negative KGs. Since KGs are too large to show here, we uniformly sample three paths per KG. For the positive KG example, the question is James loved to play violin. He did it in his spare time because he found it what?, the answer choice is relaxing, and the target answer is relaxing. Its paths are: (1) play –[is related to]–> x <–[is used for]– relaxing , (2) violin –[is used for]–> x –[is used for]–> relaxing , and (3) time <–[has subevent]– x –[has subevent]–> relax . For the negative KG example, the question is Where do soldiers not deployed eat their food?, the answer choice is neighbor’s house, and the target answer is military base. Its paths are: (1) soldier <–[is related to]– x <–[is related to]– house , (2) eat –[is related to]–> x –[is at location of]–> house , and (3) food <–[is related to]– x –[is at location of]–> house . For fine explanations, we show examples of positive and negative paths from the same KG. Here, the question is Where can you find a bar before traveling a long distance?, the answer choice is airport, and the target answer is airport. The positive path is: bar –[is at location]–>airport . The negative path is: travel <–[is used for]–x –[is at location]–airport 128 . We can roughly see that the positive KGs/paths are useful for predicting the correct answer, and vice versa. However, as shown in [197], the model’s judgment of KG/path usefulness may not always align with human judgment. See §5.8.16 for more illustrative examples of coarse/fine explanations. 5.7 RelatedWork CreatingModelExplanations Many methods aim to explain PLMs’ predictions by highlighting impor- tant tokens in the model’s text input. Such methods are usually gradient-based [229, 134, 47], attention- based [174, 236, 69, 127], or occlusion-based [49, 189, 106, 135]. Similarly, for graph encoders, a number of works use post-hoc optimization to identify important nodes [89, 267] or subgraphs [267] in the graph input. Meanwhile, KG-augmented models’ attention weights can be used to explain which parts of the KG are important [140, 57, 151, 246, 261]. These KG explanations can be interpreted as identifying knowledge in the KG that is complementary to the knowledge encoded in the PLM. LearningFromModelExplanations Besides manual inspection, explanations can be used in var- ious ways, like extra supervision or regularization [192, 82, 178, 4], pruned inputs [98, 7, 130], additional inputs [81, 199], and intermediate variables [251, 286, 196]. The most similar work to ours is [192], which proposed training a student model to mimic a teacher model’s predictions by regularizing the student model’s attention via text explanations created from the teacher model. However, [192] aims to evaluate explanations, while our goal is to improve performance via explanations. To the best of our knowledge, SalKG is the first to supervise KG-augmented models with KG explanations. See §5.8.20 for a more comprehensive overview of the related literature. 129 5.8 Appendix 5.8.1 ConstructionoftheContextualizedKG In §5.2, we defined the full KG as ˜ G = ( ˜ V, ˜ R, ˜ E), where ˜ V, ˜ R, and ˜ E are all of the KG’s nodes (concepts), relations, and edges (facts), respectively. For each instance, we assume access to ˜ G but do not use the entire KG in practice. Given a question q and an answer choice a i for some instance, we construct the contextualized KG, ˜ G i = (V i ,R i ,E i ) by heuristically extracting edges from ˜ G, following the approach taken by most prior KG-augmented model works [57, 246, 140]. ˜ G i = (V i ,R i ,E i ) is built differently for node-based models and path-based models, and we describe both types of contextualized KG construction procedures below. Note that these procedures are not de- signed by us, but simply follow what was proposed and shown to work well in the KG-augmented models’ original papers [57, 246]. Thus, we do not experiment with different contextualized KG construction pro- cedures, since it is out of the scope of our work. Let us define the KG nodes mentioned in q anda i as QA nodes. For example, for the question What wouldyouputinateakettle? and answer choicewater, the QA nodes would beput,teakettle, andwater. We ground raw mentions of QA nodes to the KG via spaCy-based lemmatization and stop-word filtering [85]. For node-based models (MHGRN [57]), we selectV i ⊆ ˜ V as the QA nodes and all nodes in the QA nodes’ 1-hop KG neighborhood. Next, we chooseR i ⊆ ˜ R as all of the relations between concepts inV i . Finally, we takeE i ⊆ ˜ E as all of the edges involvingV i andR i . For path-based models (PathGen [246], RN [57, 9]), we select ˜ G i as all 2-hop paths between all question- answer node pairs. Thus,V i ⊆ ˜ V consists of the QA nodes as well as all intermediate nodes in the 2-hop paths. Meanwhile,R i ⊆ ˜ R andE i ⊆ ˜ E consist of all relations and edges within the 2-hop paths. When 130 reasoning over the 2-hop paths, the model does not actually use the intermediate nodes, perhaps in order to keep the path more general [57, 246]. 5.8.2 AlternativeFormulationofCoarseSaliencyExplanations SalKG-Coarse uses coarse explanations, which state whetherG or None (i.e., noG) should be used for the given task instance. By default,SalKG-Coarse uses our proposed ensemble-based coarse explanations (§5.3.1). In this case, the coarse explanations decide betweenG and None at the prediction level. That is, the coarse explanations correspond to saliency weights which perform attention overF KG ’s andF No-KG ’s predictions. GraphEmbeddingBasedExplanations In §5.6.3, we also considered applying coarse explanations at the graph embedding level. In this case, usingG corresponds to using graph embeddingg, while usingNone corresponds to using some baseline embeddingb that does not contain any information fromG. b could be a zero vector, random vector,etc. Our experiments in §5.6.3 — withb as a zero vector and Grad/Occl as saliency methods — show that this approach does not yield good empirical results. We believe the issue is thatb does not contain anyNone-specific information. Recall that the ensemble-based SalKG’s prediction is a weighted sum ofF KG ’s andF No-KG ’s predictions, which means we interpolate betweenF KG ’s and F No-KG ’s predictions. Here,F No-KG ’s prediction actually contains meaningful information aboutF No-KG . On the other hand, it does not make sense to interpolate between g and b, since b does not have any meaningful information. We also considered learning b when training the KG model, but this would require a complicated multitask learning setup where the KG and No-KG models are jointly trained using g andb, respectively. 131 5.8.3 ImplementationDetailsforGrad-BasedFineSaliencyExplanations In §5.3.2, we discussed the gradient× input (Grad) [47] method for computing raw fine saliency scores ϕ . For multi-choice QA, assume we are given text statementx i =q⊕ a i (formed from questionq and answer choicea i ), KGG i , unitu ij , andu ij ’s embedding u ij ∈R d inG i . Also, letu (ℓ) ij be theℓ-th element ofu ij . Then,ϕ is computed as follows: ϕ (u ij ;x i ,G i )= P d ℓ=1 u (ℓ) ij ∂p KG (x i ,G i ) ∂u (ℓ) ij , a i =a ∗ − P d ℓ=1 u (ℓ) ij ∂p KG (x i ,G i ) ∂u (ℓ) ij , a i ̸=a ∗ (5.1) Depending on the type of graph encoder used, a unit may or may not be given to the model as a single embedding. While node-based graph encoders take node embeddings as input, path-based graph encoders do not take path embeddings as input. Instead path-based graph encoders take node and relation embeddings as input, then form path embeddings from these node and relation embeddings. As a result, for Grad, the computation of ϕ is slightly different between node-based and path-based graph encoders. For node-based encoders, unit embedding u ij is just a node embedding. Thus, a node’s ϕ score is computed directly using Eq. 5.1. For path-based encoders, given a path, we first use Eq. 5.1 to compute a separate ϕ score for each node embedding and relation embedding in the path. Then, we compute the path’sϕ score as the sum of theϕ scores of its constituent nodes and relations. 5.8.4 EvaluationProtocol We present a more detailed description of the evaluation protocol used to obtain the results in §5.6. First, define non-explanation models (No-KG, KG, and No-KG + KG) as models that are not regularized with any kind of explanation, and define explanation models ( Random, Heuristic, SalKG) as models that are regularized with some kind of explanation. Second, each non-explanation model’s performance is reported 132 as the average over three seeds, which we denote as the non-explanation seeds. Also, recall that each explanation model is built from No-KG and/or KG models. Third, for each of the three non-explanation seeds, we train the explanation model on three more seeds, which we call the explanation seeds. After that, we compute the explanation model performance by averaging over [three non-explanation seeds]× [three explanation seeds] = [nine total seeds]. We summarize the evaluation protocol below: • Non-explanation seeds: 1, 2, 3 • Explanation seeds: A, B, C • Non-explanation performance: average(1, 2, 3) • Explanation performance: average(1A, 1B, 1C, 2A, 2B, 2C, 3A, 3B, 3C) 5.8.5 DatasetDetails Below are more detailed descriptions of the three datasets used for the experiments in §5.6. All datasets and resources used in this paper are publicly available and free for any researcher to use. CommonsenseQA(CSQA) [231] is a multi-choice QA dataset whose questions require commonsense reasoning to solve. Questions and answer choices in CSQA are derived from ConceptNet [224]. The official (OF) data split has 9741/1221/1140 questions for OFtrain/OFdev/OFtest. Since the labels for OFtest are not publicly available, we use the in-house (IH) data split introduced in [140] and used in many subsequent works [57, 246, 261]. The in-house data split has 8500/1221/1241 questions for IHtrain/IHdev/IHtest, where the IHtrain and IHtest are obtained by partitioning OFtrain. OpenbookQA(OBQA) [170] is a multi-choice QA dataset which aims to simulate open-book science exams. OBQA has 4957/500/500 elementary-school-level science questions for train/dev/test, but also pro- vides a supplementary “open book” resource containing 1326 core science facts. To solve questions from 133 OBQA, the model needs to reason over both information from the open book and commonsense knowledge from the KG (i.e., ConceptNet). CODAH [34] is a multi-choice QA dataset which augments the SWAG [275] sentence completion dataset with more difficult, adversarially-created questions. Similar to SWAG, CODAH’s questions are designed to require commonsense reasoning to solve. CODAH contains 2801 questions, and its official split specifies five folds, which balance the distribution of question categories per fold. Thus, by default, performance is evaluated by averaging over the five folds. However, due to computational constraints, we only evaluate on the first fold and compare to the baselines presented in §5.4.2 and §5.6, rather than to previously published methods. 5.8.6 ThresholdTuningforCreatingExplanations TuningT ThresholdforCoarseExplanations Recall that coarse explanations are binarized via thresh- oldT (§5.3.1). To setT , we manually tuneT to maximizeOracle-Coarse’s dev accuracy. This can be done efficiently, since Oracle-Coarse does not require any training. We use a sweep ofT =[0.01,0.02,0.03,0.04,0.05] and find that T =0.01 yields best performance overall. Tuningtop-k%ThresholdforFineExplanations Recall that fine explanations are binarized via thresh- oldk, used to set the top-k% of units as positive (§5.3.2). To setk, we manually tunek to maximizeSalKG- Coarse’s dev accuracy. Table 5.9 shows the performance of RoBERTa+MHGRN and RoBERTa+PathGen on CSQA and OBQA, across different values of k. Due to computational constraints, we report the average performance across [best non-explanation seed]× [three explanation seeds] = [three total seeds], as op- posed to the default [three non-explanation seed]× [three explanation seeds] = [nine total seeds] (§5.8.4). We use a sweep ofk = [5,10,30,50] and find that k = 5 yields best performance overall, although there is not a clear trend that smallerk is better. In this paper, we usedk =10 for all experiments, so it may be promising to further explore tuningk in the future. 134 CSQATestAccuracy(%) OBQATestAccuracy(%) Top-k% MHGRN PathGen MHGRN PathGen 2 72.66 (± 1.52) 69.86 (± 1.11) 66.47 (± 1.27) 61.33 (± 2.69) 5 72.58 (± 0.74) 71.64 (± 3.17) 69.13 (± 0.81) 64.80 (± 1.40) 10 73.65 (± 0.21) 71.39 (± 1.54) 65.07 (± 1.70) 51.60 (± 1.13) 30 71.98 (± 0.47) 69.76 (± 0.44) 63.47 (± 1.14) 61.87 (± 4.61) 50 72.93 (± 0.84) 71.04 (± 0.05) 63.27 (± 3.00) 63.60 (± 1.71) 70 72.04 (± 1.05) 70.13 (± 0.66) 65.80 (± 1.91) 64.40 (± 0.40) Table 5.9: SalKG-Fine Performance for Different top-k% Thresholds. We report performance for RoBERTa+MHGRN and RoBERTa+PathGen on CSQA and OBQA. Best model is shown inbold. 5.8.7 AdditionalDetailsaboutOracleModels We provide more details aboutOracle-Coarse andOracle-Fine. Given the coarse saliency explanations, Oracle-Coarse simply involves choosing the “correct” prediction — betweenF KG ’s andF No-KG ’s predic- tions — for each answer choice. Given thatF KG ’s andF No-KG ’s predictions are simply loaded from disk, this process runs very quickly, since it does not require additional training. On the other hand, Oracle- Fine involves training the KG-augmented model while applying the fine saliency explanations as a binary mask to the graph encoder’s attention weights. 5.8.8 AdditionalSalKGResultsonCODAH In this section, we present additionalSalKG results on CODAH. These additional results consist of RoBERTa+RN, BERT+MHGRN, BERT+PathGen, and BERT+RN, all using threshold top-10%. Also, across all settings, we report both Grad and Occl results for SalKG-Fine and SalKG-Hybrid. Due to computational constraints, we report the average performance across [best non-explanation seed] × [three explanation seeds] = [three total seeds], as opposed to the default [three non-explanation seed]× [three explanation seeds] = [nine total seeds] (§5.8.4). These results are shown in Table 5.10, along with the RoBERTa+MHGRN and RoBERTa+PathGen results from Table 5.6. First, we see thatSalKG-Hybrid (either Grad or Occl) performs the best on all settings except RoBERTa+PathGen. For RoBERTa+PathGen, Random-Coarse and Random-Hybrid perform the best, although some SalKG 135 CODAHTestAccuracy(%) MHGRN PathGen RN Model BERT RoBERTa BERT RoBERTa BERT RoBERTa No-KG 60.96 (± 1.27) 83.96 (± 0.79) 60.96 (± 1.27) 83.96 (± 0.79) 60.96 (± 1.27) 83.96 (± 0.79) KG 58.68 (± 1.63) 84.02 (± 1.27) 58.80 (± 2.01) 84.02 (± 1.62) 55.92 (± 1.04) 82.64 (± 0.85) No-KG + KG 60.60 (± 1.30) 84.08 (± 1.46) 60.42 (± 1.14) 84.69 (± 1.48) 58.62 (± 1.53) 84.08 (± 0.55) Random-Coarse 60.78 (± 0.38) 84.62 (± 0.55) 61.74 (± 0.28) 86.07 (± 0.89) 57.84 (± 0.83) 84.14 (± 0.65) Random-Fine 58.50 (± 0.91) 84.02 (± 0.89) 54.47 (± 1.55) 75.74 (± 4.71) 54.53 (± 1.40) 76.10 (± 4.16) Random-Hybrid 62.16 (± 0.00) 84.80 (± 0.10) 61.74 (± 0.55) 84.68 (± 0.18) 62.40 (± 0.10) 84.14 (± 0.65) Heuristic-Coarse 58.38 (± 0.00) 85.11 (± 0.10) 61.08 (± 0.00) 85.59 (± 0.00) 59.70 (± 0.10) 83.60 (± 0.00) Heuristic-Fine 60.18 (± 1.36) 83.72 (± 0.92) 55.98 (± 0.28) 82.64 (± 2.61) 54.71 (± 3.07) 81.80 (± 2.77) Heuristic-Hybrid 62.16 (± 0.00) 84.80 (± 0.10) 61.98 (± 0.31) 85.23 (± 0.00) 62.28 (± 0.10) 85.35 (± 0.10) SalKG-Coarse 61.02 (± 0.10) 85.41 (± 0.18) 61.20 (± 0.28) 85.95 (± 0.18) 61.74 (± 0.21) 84.98 (± 0.42) SalKG-Fine (Occl Top-10%) 60.00 (± 1.26) 84.08 (± 1.14) 57.72 (± 1.09) 83.36 (± 0.81) 59.16 (± 2.15) 83.78 (± 1.41) SalKG-Fine (Grad Top-10%) 59.16 (± 0.38) 84.20 (± 1.17) 57.36 (± 0.75) 83.00 (± 1.51) 55.86 (± 0.79) 83.66 (± 0.89) SalKG-Hybrid (Occl Top-10%) 62.28 (± 0.10) 85.71 (± 0.10) 62.04 (± 0.45) 84.44 (± 0.63) 62.58 (± 0.10) 85.11 (± 0.28) SalKG-Hybrid (Grad Top-10%) 60.48 (± 0.21) 88.17 (± 0.10) 61.02 (± 0.10) 85.17 (± 0.28) 61.38 (± 0.68) 85.11 (± 0.55) Table 5.10: SalKG Performance on CODAH for Additional Settings. Building upon the CODAH re- sults in Table 5.6 (RoBERTa+MHGRN and RoBERTa+PathGen), we additionally report results for RoBERTa+RN, BERT+MHGRN, BERT+PathGen, and BERT+RN, all using threshold top-10%. We also report both Grad and Occl results forSalKG-Fine andSalKG-Hybrid. Best model is shown inbold. models perform almost as well. Random’s strong performance is likely due to us reporting performance for the best non-explanation seed, rather than averaging over three non-explanation seeds. Second, forSalKG- Fine, Occl beats Grad on all settings except RoBERTa+PathGen. Third, for SalKG-Hybrid, Occl beats Grad on BERT+MHGRN, BERT+PathGen, and BERT+RN, while Grad beats Occl on RoBERTa+MHGRN and RoBERTa+PathGen. 5.8.9 AdditionalSalKGResultsforGradvs. Occl In Tables 5.11-5.12, we compare Grad vs. Occl on CSQA and OBQA, respectively. Due to computational constraints, we report the average test accuracy across [best non-explanation seed]× [three explanation seeds] = [three total seeds], as opposed to the default [three non-explanation seed] × [three explana- tion seeds] = [nine total seeds] (§5.8.4). For SalKG-Fine and SalKG-Hybrid on CSQA, we find that Occl beats Grad on all settings, exceptSalKG-Fine on RoBERTa+RN. However, forSalKG-Fine on OBQA, Grad beats Occl on RoBERTa+PathGen, BERT+RN, and RoBERTa+RN, while Occl beats Grad on BERT+MHGRN, 136 RoBERTa+MHGRN, and BERT+PathGen. Meanwhile, forSalKG-Hybrid on OBQA, Occl beats Grad on all settings except BERT+PathGen. Thus, we see that Occl generally outperforms Grad, although Grad can beat Occl on certain settings. CSQATestAccuracy(%) MHGRN PathGen RN Model BERT RoBERTa BERT RoBERTa BERT RoBERTa SalKG-Fine (Grad) 55.44 (± 1.22) 72.95 (± 1.44) 57.10 (± 0.81) 70.10 (± 0.28) 56.14 (± 1.97) 72.12 (± 0.14) SalKG-Fine (Occl) 56.78 (± 2.14) 73.65 (± 0.21) 57.64 (± 2.12) 71.39 (± 1.54) 56.86 (± 0.41) 71.58 (± 1.10) SalKG-Hybrid (Grad) 59.07 (± 0.56) 72.79 (± 0.20) 57.53 (± 0.43) 71.39 (± 0.14) 57.29 (± 0.29) 71.98 (± 0.28) SalKG-Hybrid (Occl) 59.12 (± 0.28) 73.41 (± 0.16) 60.35 (± 0.32) 73.11 (± 1.00) 58.80 (± 0.19) 74.64 (± 0.09) Table 5.11: CSQAPerformanceComparisonforSalKGGradvs. OcclModels. Best model between Grad and Occl is shown inbold. OBQATestAccuracy(%) MHGRN PathGen RN Model BERT RoBERTa BERT RoBERTa BERT RoBERTa SalKG-Fine (Grad) 53.40 (± 0.69) 58.80 (± 8.66) 55.33 (± 0.31) 67.87 (± 1.81) 56.53 (± 0.31) 68.87 (± 1.67) SalKG-Fine (Occl) 53.93 (± 1.01) 65.07 (± 1.70) 55.40 (± 0.53) 51.60 (± 1.13) 55.67 (± 0.90) 62.33 (± 0.90) SalKG-Hybrid (Grad) 53.80 (± 0.20) 69.47 (± 0.31) 55.67 (± 0.64) 69.93 (± 0.61) 53.20 (± 0.72) 69.40 (± 0.20) SalKG-Hybrid (Occl) 56.20 (± 0.20) 70.73 (± 0.12) 55.33 (± 0.23) 70.07 (± 0.12) 53.93 (± 0.42) 70.80 (± 0.00) Table 5.12: OBQAPerformanceComparisonforSalKGGradvs. OcclModels. Best model between Grad and Occl is shown inbold. 5.8.10 ComparisontoPublishedOBQABaselines To further demonstrate that SalKG models perform competitively, we also compare SalKG to the many KG-augmented model baseline results published in [57, 246, 261], for OBQA. The baselines we consider are RN, RN + Link Prediction, RGCN, GconAttn, MHGRN, and PathGen. For the non-SalKG versions of MHGRN, PathGen, and RN, we quote the published results. Since these published results average over four seeds (instead of three), we report SalKG results over four seeds in Table 5.13. For OBQA, we find that vanilla PathGen (quoted from published results) performs the best, while SalKG-Hybrid (MHGRN) and SalKG-Hybrid (PathGen) are almost as good. These OBQA results indicate that our reproduction of 137 vanilla PathGen may not have been optimally tuned, thus limiting the performance of theSalKG models built upon PathGen. We plan to investigate this issue in future work. Model(RoBERTa) OBQATestAccuracy(%) RN [208] 65.20 (± 1.18) RN + Link Prediction [246] 66.30 (± 0.48) RGCN [212] 62.45 (± 1.57) GconAttn [248] 64.75 (± 1.48) MHGRN [57] 66.85 (± 1.19) PathGen [246] 71.20 (± 0.96) SalKG-Coarse (MHGRN) 69.85 (± 0.30) SalKG-Fine (MHGRN) 64.65 (± 1.62) SalKG-Hybrid (MHGRN) 70.75 (± 0.10) SalKG-Coarse (PathGen) 69.70 (± 0.93) SalKG-Fine (PathGen) 54.30 (± 5.84) SalKG-Hybrid (PathGen) 70.00 (± 0.16) Table 5.13: Comparisonof SalKGtoPublishedOBQAResults. Best model is shown inbold. 5.8.11 Low-ResourceLearning Figure 5.4: Low-ResourceLearning. CSQA test accuracy for No-KG, KG, andSalKG-Coarse, when using varying amounts of training data. In Fig. 5.4, we show CSQA performance for different models in low-resource settings. Specifically, we experiment with low-resource learning by training the model on 10%, 30%, 50%, or 70% of the training data. For reference, we also include CSQA performance when using 100% of the training data. Here, we 138 consider No-KG (RoBERTa), KG (MHGRN), and SalKG-Coarse (RoBERTa+MHGRN). Across all settings, we find that SalKG-Coarse outperforms both No-KG and KG, suggesting that regularizing the model with coarse explanations can provide a helpful inductive bias for generalizing from limited training data. 5.8.12 AnalyzingtheImpactofCoarseExplanations SalKG-Coarse is based on the insight that KG information may help the model on some instances but hurt on others. Thus, even if KG outperforms No-KG on average, No-KG may still correctly predict some instances that KG got wrong. SalKG-Coarse takes advantage of such complementary predictions between No-KG and KG, in order to achieve performance higher than max(No-KG, KG). As shown by RoBERTa+PathGen and RoBERTa+RN on OBQA (Table 5.6),SalKG-Coarse can still beatmax(No-KG, KG, No-KG+ KG) even when No-KG outperforms KG. In Table 5.14, we analyze the performance of BERT (i.e., No-KG), PathGen (i.e., KG), SalKG-Coarse (BERT+PathGen), and Oracle-Coarse (BERT+PathGen) on various sets of questions in CSQA. Due to computational constraints, each model’s performance here is reported for one seed (instead of using the protocol described in §5.8.4), so these results are not directly comparable to those in Table 5.5. Through this performance breakdown, we can isolate the potential improvement contributed by each base model toSalKG-Coarse. We begin by looking at the questions for whichSalKG-Coarse has no influence. These are the 46.01% of questions correctly answered by both models and the 33.92% of questions incorrectly answered by both models. SinceSalKG-Coarse is trained to choose between the two models’ predictions, SalKG-Coarse’s output is fixed if both models make the same prediction. This leaves 20.07% of questions that were correctly answered by exactly one of the two models: 9.43% were from No-KG, while the other 10.64% were from KG. This 20.07% of constitutes the complementary predictions leveraged by SalKG- Coarse. 139 QuestionSet QuestionPercentage(%) No-KG Correct 55.44 KG Correct 56.65 Only No-KG Correct 9.43 Only KG Correct 10.64 Both Correct 46.01 Both Incorrect 33.92 At Least One Incorrect 66.08 SalKG-Coarse Correct 56.65 Oracle-Coarse Correct 68.57 Table 5.14: Impact of Coarse Explanations. Using BERT+PathGen on CSQA, we present a performance break- down for various question sets, in order to analyze whySalKG-Coarse is able to beat No-KG and KG. Based on this question-level analysis, we would estimate the Oracle-Coarse accuracy to be 66.08%, the percentage of questions that at least one model answered correctly. However, as stated in §5.3.1, coarse saliency targets are created at the answer choice level (not question level), which offers us more flexibility to choose between No-KG and KG. As a result, Oracle-Coarse’s accuracy is actually 68.57%. This leaves SalKG-Coarse (56.65%) significant room for improvement, perhaps through better model architecture and training. 5.8.13 ComparingSalientandNon-SalientKGUnits This paper explores learning from explanations of KG units’ saliency (i.e., usefulness). Overall, our focus is on how using salient KG units can yield improve model performance. In this subsection, we also analyze whether salient and non-salient KG units, as determined by our coarse/fine explanation methods, can differ in other ways that are not directly related to performance (Table 5.15). For both coarse and fine explanations, we use the BERT+MHGRN model on CSQA, where MHGRN is a node-based graph encoder (§5.4.2). Recall that Q nodes and A nodes are nodes (i.e., concepts) mentioned in the given question and answer choice, respectively (§5.6.1). For coarse explanations, we use the ensemble-based explanations introduced in §5.3.1. We compare salient and non-salient KGs with respect to the number of nodes in the KG (# nodes), percentage of Q nodes 140 in the KG (% Q nodes), percentage of A nodes in the KG (% A nodes), clustering coefficient (cluster coeff.), and average node degree (degree). These results are shown in Table 5.15a. We see that these metrics are not very discriminative, as salient and non-salient KGs perform similarly on all of these metrics. Metric Salient Non-Salient # nodes 125.88 120.57 % Q nodes 9.09 9.17 % A nodes 2.94 3.12 cluster coeff. 4.26E-1 4.25E-1 degree 9.89 9.78 (a) Salient vs. Non-Salient KGs. Metric Salient Non-Salient % Q nodes 16.84 10.79 % A nodes 10.00 6.06 degree 15.41 13.11 (b) Salient vs. Non-Salient Nodes. Table 5.15: Salient vs. Non-Salient KG Units. Using BERT+MHGRN on CSQA, we compare salient and non- salient KG units. In (a), we compare salient and non-salient KGs, as determined by coarse explanations. In (b), we compare salient and non-salient nodes, as determined by fine explanations. For fine explanations, we use the Grad-based explanations described in §5.3.2 and §5.8.3. We compare salient and non-salient nodes with respect to the percentage of Q nodes among salient/non-salient nodes in the KG (% Q nodes), percentage of A nodes among salient/non-salient nodes in the KG (% A nodes), and node degree (degree). These results are shown in Table 5.15b. Here, we see that %Q nodes and %A nodes are actually quite discriminative metrics between salient and non-salient nodes. On average, the percentage of Q nodes among salient nodes (16.84%) is 56.07% greater than the percentage of Q nodes among non- salient nodes (10.79%). Similarly, on average, the percentage of A nodes among salient nodes (10.00%) is 65.02% greater than the percentage of Q nodes among non-salient nodes (6.06%). However, compared to %Q nodes and %A nodes, degree is not as discriminative. This indicates that the difference between salient and non-salient nodes may be more semantic than structural. 5.8.14 RobustnesstoKGPerturbation Table 5.16 shows the CSQA performance of KG and SalKG models subjected to different forms of KG perturbation. Relation perturbation (Relation) permutes the relation labels of all edges in the KG, while node perturbation (Node) permutes the node labels of all nodes in the KG. These perturbation methods 141 CSQATestAccuracy(%) MHGRN PathGen RN Model BERT RoBERTa BERT RoBERTa BERT RoBERTa KG (Relation) 52.89 (± 0.73) 67.41 (± 0.84) 52.35 (± 0.60) 70.08 (± 0.38) 54.15 (± 0.40) 68.95 (± 1.58) SalKG-Coarse (Relation) 55.86 (± 0.48) 72.53 (± 0.50) 56.07 (± 0.44) 71.55 (± 0.85) 56.93 (± 0.51) 72.43 (± 0.96) SalKG-Fine (Relation) 52.58 (± 0.70) 68.84 (± 0.67) 53.32 (± 0.61) 71.23 (± 1.21) 53.94 (± 0.63) 69.80 (± 0.64) SalKG-Hybrid (Relation) 51.28 (± 0.70) 69.84 (± 0.57) 53.33 (± 0.55) 70.34 (± 1.03) 52.41 (± 1.11) 68.77 (± 0.80) KG (Node) 53.63 (± 0.70) 67.35 (± 0.41) 55.60 (± 0.16) 70.51 (± 1.69) 54.15 (± 2.27) 70.48 (± 1.71) SalKG-Coarse (Node) 55.75 (± 0.60) 71.83 (± 0.60) 55.43 (± 0.55) 71.36 (± 0.81) 56.14 (± 0.73) 71.20 (± 0.72) SalKG-Fine (Node) 53.60 (± 0.83) 66.81 (± 1.09) 53.13 (± 0.99) 70.80 (± 1.55) 54.02 (± 0.84) 71.08 (± 1.02) SalKG-Hybrid (Node) 51.14 (± 1.03) 69.58 (± 0.77) 50.80 (± 0.83) 69.85 (± 0.72) 53.24 (± 0.72) 69.57 (± 1.14) KG 57.48 (± 0.89) 73.14 (± 0.78) 56.54 (± 0.73) 72.58 (± 0.57) 56.46 (± 1.22) 71.37 (± 1.20) SalKG-Coarse 57.98 (± 0.90) 73.64 (± 1.05) 57.75 (± 0.77) 73.07 (± 0.25) 57.50 (± 1.25) 73.11 (± 1.13) SalKG-Fine 54.36 (± 2.34) 70.00 (± 0.81) 54.39 (± 2.03) 72.12 (± 0.91) 54.30 (± 1.41) 71.64 (± 1.51) SalKG-Hybrid 58.70 (± 0.65) 73.37 (± 0.12) 59.87 (± 0.42) 72.67 (± 0.65) 58.78 (± 0.14) 74.13 (± 0.71) Table 5.16: SalKGPerformanceComparisononCSQAwithPerturbedKGs. Best performance inbold. are designed to alter the semantics of the KG. For relation perturbation and node perturbation, SalKG- Coarse (Node) performs best on almost all settings, with KG (Node) barely beatingSalKG-Coarse for node perturbation on BERT+PathGen. However, with KG perturbation,SalKG-Hybrid does not perform as well, sometimes even worse than KG andSalKG-Fine. This may be becauseSalKG-Hybrid relies most heavily on fine explanations, making it especially sensitive to KG perturbation. We also compare these KG-perturbed models to models without any KG perturbation. As expected, across all settings, the KG-perturbed models outperform the non-KG-perturbed models. Interestingly, we find that SalKG-Coarse is most robust to KG perturbation. For BERT+RN and RoBERTa+RN, SalKG- Coarse (Relation) is less than 1% worse thanSalKG-Coarse. This makes sense, sinceSalKG-Coarse relies least on the KG. For a given instance,SalKG-Coarse has the option to completely ignore KG information when making its prediction. When the KG is perturbed, it would be advantageous for SalKG-Coarse to focus only on the text input. 142 5.8.15 StatisticalSignificanceofMainResults In this section, we verify the statistical significance of our results in §5.6.2. For each setting in Tables 5.5-5.6 (except RoBERTa+PathGen on CODAH), we perform the two-sided unpaired T-test with unequal variance between the bestSalKG model and the best non-SalKG model. Thep-values are shown in Tables 5.17-5.18. CSQAp-values MHGRN PathGen RN Model BERT RoBERTa BERT RoBERTa BERT RoBERTa BestSalKG Model vs. Best Non-SalKG Model 0.1235 0.4238 0.0701 0.2690 0.1336 0.0441 Table 5.17: SalKGT-TestResultsonCSQA. For each setting in Table 5.5, we perform the T-test between the best SalKG model and the best non-SalKG model. OBQAp-values CODAHp-values Model(RoBERTa) MHGRN PathGen RN MHGRN PathGen BestSalKG Model vs. Best Non-SalKG Model 0.2909 0.8890 0.0005 0.1223 0.2823 Table 5.18: SalKG T-Test Results on OBQA and CODAH. For each setting in Table 5.6, we perform the T-test between the bestSalKG model and the best non-SalKG model. If we use thresholdα = 0.1 (i.e.,p < 0.1), then we find that SalKG yields statistically significant im- provements on CSQA BERT+PathGen, CSQA RN+RoBERTa, and OBQA RN+RoBERTa. If we use threshold α = 0.05 (i.e.,p < 0.05), then we find that SalKG yields statistically significant improvements on CSQA RN+RoBERTa and OBQA RN+RoBERTa. In particular, the improvement on OBQA RN+RoBERTa is very statistically significant, with p = 0.0005. Our T-test results show that SalKG can produce significant per- formance gains on a number of model-dataset settings, while yielding competitive performance in other settings. 5.8.16 CaseStudies: QualitativeAnalysisofKGSaliencyExplanations In this section, we build upon §5.6.4 and illustrate more examples of coarse/fine explanations created from BERT+PathGen on CSQA, with 1-hop or 2-hop paths as fine units. Notice that 2-hop paths consist of two 143 airport airport Where can you find a bar before traveling a long distance? bar airport is at location neighbor’s house military base Where do soldiers not deployed eat their food? soldiers x house is related to is related to eat x house is related to is at location of travel x airport is used for is located at food x house is related to is at location of relaxing relaxing James loved to play violin. He did it in his spare time because he found it what? play x realxing is related to is used for violin x realxing is used for is used for time x relax has subevent has subevent Figure 5.5: Examples of coarse/fine saliency explanations. Illustration of examples presented in §5.6.4. Blue denotes given answer choice, while red denotes target answer. feeling bad care What do you feel for a someone when you comfort friend? book store university A poet may need to go where to achieve learning as an adult? adult x store is related to is related to learning x book causes is related to comfort x feeling is at location of is related to learning x book is related to is at location of water water What would you put in a teakettle? tea kettle x water is a kind of is at location put x water is related to is used for tea kettle x water is a kind of is used for comfort x feel is the antonym of is the antonym of Figure 5.6: More examples of coarse/fine saliency explanations. Illustration of examples presented in §5.8.16. Blue denotes given answer choice, while red denotes target answer. nodes and two relations, with the intermediate node replaced with a placeholder node x, following [57]. By constructing 2-hop paths this way, the model is able to learn from more general 2-hop paths. First, for coarse explanations, we provide more examples of positive (i.e., useful) and negative KGs. • For the positive KG example, the question isWhatwouldyouputinateakettle?, the answer choice is water, and the target answer iswater. Its paths are:(1)teakettle –[is a kind of]–>x <–[is at location]– water , (2) put –[is related to]–> x –[is used for]–> water , and (3) teakettle –[is a kind of]–> x –[is used for]–>water . 144 • For the negative KG example, the question is A poet may need to go where to achieve learning as an adult?, the answer choice is book store, and the target answer is university. Its paths are: (1) adult <–[is related to]–x –[is related to]–>store ,(2) learning <–[causes]–x <–[is related to]–book , and(3) learning –[is related to]–>x –[is at location of]–>book . Second, we provide more examples of fine explanations. Here, the question is What do you feel for a someonewhenyoucomfortfriend?, the answer choice isfeelingbad, and target answer iscare. The positive path is: comfort <–[is the antonym of]–x –[is the antonym of]–>feel . The negative path is: comfort –[is at location of]–>x –[is related to]–>feeling . The examples from §5.6.4 are shown in Fig. 5.5. The examples introduced in this subsection (§5.8.16) are shown in Fig. 5.6. Again, in the coarse/fine explanations, we can roughly see that the positive KGs/paths tend to be useful for predicting the correct answer, and vice versa. However, note that the model’s judgment of KG/path usefulness may not necessarily align with human judgment [197]. 5.8.17 UserStudies: QuantitativeAnalysisofKGSaliencyExplanations To better understand the role and limitations of KG saliency explanations, we quantitatively analyze KG saliency explanations in the context of two user studies. In both user studies, the goal is to measure KG saliency explanations’ plausibility, i.e., how closely the explanations align with human judgment. Note that explanation plausibility is orthogonal to our paper’s main claims, since we argue that KG saliency explanations can be used as additional supervision for improving performance, not that the ex- planations are plausible. Nonetheless, these user studies may still provide some useful insights about KG saliency explanations. 145 GraphType UsefulnessScore High-Saliency Graph 0.929± 0.734 Low-Saliency Graph 0.935± 0.764 Table 5.19: HumanEvaluationofCoarseSaliencyExplanations. Human-annotated usefulness scores for high- (positive) and low- (negative) saliency graphs. 5.8.17.1 UserStudy1: CoarseSaliencyExplanations The first user study measures how well the coarse (graph-level) explanations align with human judgment of usefulness. Given a RoBERTa+PathGen model, we begin by uniformly sampling 25 high-saliency (positive) KGs and 25 low-saliency (negative) KGs from the CSQA training set. Recall that whether a KG is high- saliency or low-saliency was determined by coarse explanations (§5.3.1) generated with respect to the given model. Note that each KG corresponds to one answer choice of a question, so each question in CSQA has up to five corresponding KGs. To ensure that none of the KGs in our sample come from the same question, we ended up pruning two high-saliency and two low-saliency KGs, yielding a final sample of 23 high-saliency and 23 low-saliency KGs. Since a KG can contain hundreds of paths, it is not feasible to ask humans to evaluate the entire KG’s usefulness. Thus, as a very rough representation of the KG, we uniformly sampled three paths from the KG. Then, for each KG, we asked ten human annotators to score each of the three paths’ usefulness for predicting the same answer choice predicted by the RoBERTa+PathGen model. To score the paths, all annotators were also given the question, correct answer, and model’s predicted answer. The paths were scored on the following 0-2 scale: • 0 = definitely not useful ( i.e., this path is either irrelevant or would cause someone to NOT select the model’s predicted answer) • 1 = possibly useful (i.e., this path provides some support for selecting the model’s predicted answer) 146 • 2 = definitely useful ( i.e., this path provides strong support for selecting the model’s predicted an- swer) Finally, each KG’s score is computed as the mean of its three constituent path scores. Below, we show the mean and standard deviation scores for high-saliency and low-saliency graphs. We find that the two graph types have similar mean usefulness scores, while also having relatively large standard deviations. This suggests that coarse saliency explanations do not align strongly with human judgment. One key limitation of this study is that the three sampled paths may not be representative of the entire KG. In the future, we plan to redesign the user study to provide annotators a more comprehensive representation of the KG to evaluate. 5.8.17.2 UserStudy2: FineSaliencyExplanations The second user study measures how well the fine (path-level) explanations align with human judgment of usefulness. Given a RoBERTa+PathGen model trained on CSQA, we begin by uniformly sampling 25 correctly answered questions and 25 incorrectly answered questions from the CSQA training set. For each question, we take the model’s predicted answer choice and the KG corresponding to the predicted answer choice, then select:(1) the path with the highest fine saliency score, (2) the path with median fine saliency score, and(3) the path with the lowest saliency score. To get finer-grained saliency signal in this study, we consider the raw fine saliency scores, instead of the binarized fine explanations actually used to regularize the model. Recall that a path’s fine saliency score (§5.3.2) is calculated with respect to the given model. Next, we asked ten human annotators to score each path’s usefulness for predicting the same answer choice predicted by the RoBERTa+PathGen model. Like before, to score the paths, all annotators were also given the question, correct answer, and model’s predicted answer. Again, the paths were scored on the following 0-2 scale: 147 PathType UsefulnessScore(AllPreds) UsefulnessScore(CorrectPreds) UsefulnessScore(IncorrectPreds) High-Saliency Path 1.091± 0.805 1.298± 0.782 0.884± 0.776 Med-Saliency Path 1.222± 0.769 1.320± 0.729 1.124± 0.798 Low-Saliency Path 1.060± 0.733 1.182± 0.730 0.938± 0.717 Table 5.20: HumanEvaluationofFineSaliencyExplanations. Human-annotated usefulness scores for high-, median-, and low-saliency paths. We display the usefulness scores for paths from all predictions, correct predictions, and incorrect predictions. • 0 = definitely not useful ( i.e., this path is either irrelevant or would cause someone to NOT select the model’s predicted answer) • 1 = possibly useful (i.e., this path provides some support for selecting the model’s predicted answer) • 2 = definitely useful ( i.e., this path provides strong support for selecting the model’s predicted an- swer) Below, we show the mean scores for high-saliency, median-saliency, and low-saliency paths. We dis- play these scores for paths from all predictions, correct predictions, and incorrect predictions. Overall, we find that the three path types have similar mean usefulness scores, although the mean score for median- saliency paths is somewhat higher than the other two path types’. Still, the standard deviations for all scores are relatively large, so this trend may not be meaningful. These results suggest that fine saliency explanations do not strongly align with human judgment. Additionally, we find that the path usefulness scores for correct predictions tend to be higher than those from incorrect predictions. This makes sense, since, intuitively, a model is more likely to predict the correct answer if it is using more useful knowledge as context. 5.8.17.3 Inter-AnnotatorAgreement Here, we measure inter-annotator agreement for both user studies, using Fleiss’ kappa. For the user study of coarse explanations, the kappa score is 0.2089, which is on the borderline of slight agreement and fair 148 agreement. For the user study of fine explanations, the kappa score is 0.1296, which indicates slight agree- ment. UserStudy Fleiss’Kappa Coarse Explanations 0.2089 Fine Explanations 0.1296 Table 5.21: Inter-Annotator Agreement for Explanation User Studies. Using Fleiss’ kappa, we measure the inter-annotator agreement for the human evaluation of coarse and fine saliency explanations. In both settings, the inter-annotator agreement is relatively low. These low kappa scores show that even humans can hardly agree on whether the coarse/fine expla- nations are useful. Therefore, it may not always be beneficial to measure explanation quality in terms of alignment with human judgment. Moreover, this shows that weak alignment with human judgment does not necessarily imply poor explanation quality. 5.8.17.4 Analysis In our user studies, we did not find strong evidence that coarse/fine saliency explanations align well with human judgment. However, we also found that human annotators had very low agreement about the usefulness of the explanations, which suggests that alignment with human judgment may not be the best measure of explanation quality. In light of this, we emphasize that the user study results do not contradict our paper’s conclusions, as our work does not claim that the generated saliency explanations are plausible. Rather, we merely claim that using KG-based saliency explanations as additional supervision to regularize KG-augmented models can yield higher performance. Our work appeals to the view that an explanation’s quality should be measured by how well it distills knowledge for improving performance on some task [192]. Furthermore, the results of our user studies are actually in line with the conclusions from [197], which found that KG-augmented models can effectively leverage KG information to improve performance, but in a manner that may not make sense to humans. 149 5.8.18 TrainingHyperparameters Since we consider a very large number of models and settings in our experiments, we only describe the core hyperparameters here. Let bsz denote batch size, let lr text denote text encoder learning rate, let lr graph denote graph encoder learning rate, and let lr task denote task predictor learning rate. Across all models (both baselines andSalKG), we generally used the following hyperparameter sweeps: bsz=[8,16,32,64], lr text =[1e− 5,2e− 5,3e− 5,5e− 5], lr graph =[1e− 4,2e− 4,3e− 4,5e− 4], and lr task =[1e− 4,2e− 4,3e− 4,5e− 4]. For CSQA and OBQA, we set the maximum number of epochs to 100. For CODAH, we set the maximum number of epochs to 30. For all three datasets, we used early stopping with a patience of 5 epochs. For more details about hyperparameters, please refer to our code repository. 5.8.19 ComputationalCostsandResources Since the SalKG pipeline (as well as Oracle, Random, and Heuristic) involves training models across multiple stages, its computational costs are considerably greater than those from just training a No-KG or KG model individually. Specifically, the pipeline involves: (1) training the No-KG and KG models; (2) creating coarse/fine explanations from the No-KG and KG models; (3) training the SalKG-Coarse model; (4) training the SalKG-Fine model; and (5) training the SalKG-Hybrid model. In particular, using the Occl method to create fine explanations can be especially costly since it requires n+1 KG model forward passes per KG, where n is the number of units in the given KG. Also, if we tune the T or k thresholds comprehensively, then the total training time further increases. For reference, each of our experiments was run on one NVIDIA Quadro RTX 8000 GPU. Nonetheless, since we are the first to propose regularizing KG-augmented models with saliency expla- nations, it is expected that not all components of our method will already be fully optimized. That is, the goal of our work is simply to introduce a new paradigm for training KG-augmented models and demon- strate its potential by showing that it can yield improved performance. Certainly, there are various parts 150 of the SalKG pipeline whose efficiency can be improved. For example, we could explore faster explanation generation via some KG-specific heuristic/approximation, training SalKG-Hybrid with coarse/fine expla- nations in a single step (instead of Steps 3-5 above), or generating explanations that can cover multiple instances at a time. Such potential improvements could be interesting directions for future work. 5.8.20 RelatedWork(Extended) Text-BasedExplanations Many works have been proposed for explaining the predictions of language models, especially PLMs. Although some of these works focus on abstractive (free-text) explanations [196, 226, 283], most aim to provide extractive explanations which highlight salient tokens in the model’s text input. Such extractive explanations typically use either gradient-based [229, 134, 47], attention-based [174, 236, 69, 127], and occlusion-based [49, 189, 106, 135] feature attribution methods. How feature attribution methods should be chosen remains an open question and the subject of much recent debate [8, 252, 215, 97]. While SalKG also uses feature attribution methods (e.g., G× I) to create extractive explanations, our study is limited to explanations regarding KG-augmented models’ graph inputs. Graph-Based Explanations There are also methods proposing extractive explanations for graph en- coders, especially GNNs. Such explanations are designed to point out components in the graph input that contribute most to the model’s prediction. Some GNNs use attention for pooling, which naturally highlights nodes with higher attention weights [129, 128]. More sophisticated approaches use post-hoc optimization to identify salient nodes [89, 267] or subgraphs [267]. Unlike individual PLMs and graph encoders, KG-augmented models take both text and graph inputs. The KG-augmented model’s graph encoder usually computes graph embeddings via attention pooling of nodes/paths, and the attention weights can be used to explain which nodes/paths in the input KG are salient [140, 57, 151, 246, 261]. These KG explanations can be interpreted as identifying knowledge in the KG that is complementary to the knowledge encoded in the PLM. However, there is little work on how such 151 KG explanations should be used. SalKG considers graph-based extractive explanations of KG-augmented models, but focuses more on how explanations are used rather than created. LearningFromModelExplanations To improve the model’s learning, explanations can be used in a diverse range of ways, including as extra supervision or regularization [192, 82, 178, 4], pruned inputs [98, 7, 130], additional inputs [81, 199], and intermediate variables [251, 286, 196]. The most similar work to ours is [192], which proposed training a student model to mimic a teacher model’s predictions by regularizing the student model’s attention via text explanations created from the teacher model. However, [192] aims to evaluate explanations, while our goal is to improve performance via explanations. Still, methods for learning from explanations have largely focused on domains like text and images, as opposed to graphs. To the best of our knowledge,SalKG is the first work to train KG-augmented models using KG explanations as supervision. 5.8.21 SocietalImpact Our proposed SalKG approach for learning from KG explanations can be applied to any KG-augmented model and can be adapted from any off-the-shelf saliency method. This enables KG-augmented models to improve generalization ability and learn more efficiently from data, thus yielding better performance while requiring less labeled data. However, in the present version of SalKG, this generalization ability and data efficiency comes with increased computational costs, as described in §5.8.19. In the future, we plan to explore methods for improving generalization and data efficiency while minimizing computational costs. 152 PartIV Conclusion 153 Chapter6 Conclusion Currently, there exists a trust gap between NLP systems and humans. The reasoning processes of today’s state-of-the-art NLP systems are not very transparent, which makes it hard for humans to entrust them with high-stakes decisions. In Ch. 1, we motivated this dissertation by discussing why generating and utilizing machine explanations are important for understanding and improving LM decision-making, re- spectively. Without such explanations, we have no way of knowing how to fix the model when things go wrong. In Ch. 2, we positioned this dissertation within the existing literature by reviewing relevant prior works on generating and utilizing LM explanations. In Ch. 3, we explored RQ1: Howcanwegenerate machine explanations that are both faithful and plausible, without hurting the LM’s task performance? In doing so, we proposed UNIREX, a learning framework for generating machine explanations that are both faithful and plausible, without hurting task performance. This is critical because machine explanations are not very useful if they cannot satisfy all three of these desiderata. In Ch. 4, we explored RQ2: How can we utilize strongly-supervised machine explanations to improve the LM’s task performance? In doing so, we proposedER-Test, a framework for evaluating the OOD generalization ability of models that are reg- ularized via strongly-supervised machine explanations. Many explainability works implicitly assume that machine explanations will eventually be passed on to humans for ad-hoc manual inspection. Meanwhile, ER-Test helps shift the paradigm from passive observation to active utilization of machine explanations. 154 In Ch. 5, we explored RQ3: How can we utilize weakly-supervised machine explanations to improve the LM’staskperformance? In doing so, we proposedSalKG, a learning algorithm that uses a teacher model’s machine explanations as weak supervision for a student model. Although having strongly-supervised ma- chine explanations is ideal, human annotation is expensive. SalKG shows that this weak supervision can improve the student model’s generalization, especially if the machine explanations are based on structured modalities like knowledge graphs. In conclusion, my PhD dissertation presents a vision for a symbiotic explainability framework, in which NLP systems and humans continually cooperate to generate, refine, and learn from explanations (Fig. 6.1). I believe that it is only through implementing such a framework that humans can come to accept NLP systems as fully trusted partners in high-stakes decision-making. Figure 6.1: ASymbioticExplainabilityFramework. In this explainability framework, NLP systems and humans continually work together to generate, refine, and learn from explanations. By following this framework, humans can develop trust in NLP systems as partners in high-stakes decision-making. 155 6.1 FutureWork In this dissertation, we focused on extractive rationales, which highlight input features that most influ- enced the LM’s prediction. On the other hand, there is growing interest in free-text rationales (FTRs), also known as natural language rationales [178, 119, 196, 25, 183, 250, 163, 274]. FTRs are interesting for sev- eral reasons. First, they tend to be more intuitive and understandable to humans, since natural language is how humans communicate. Second, instead of being limited to input token scoring, they can explain by referencing things beyond the task input. Third, they support high flexibility in content, style, and length, since their only constraint is that they are expressed in natural language. Therefore, as future work, I am interested in investigating how to better evaluate, generate, and utilize FTRs. 6.1.0.1 EvaluatingFree-TextRationales FTRs’ relatively unconstrained nature makes them prone to hallucinating misleading information [99, 119, 259], so it is important to have reliable metrics for measuring FTR quality. Like extractive rationale metrics, FTR metrics can be categorized as evaluating faithfulness or plausibility. However, for FTRs, measuring faithfulness and plausibility is not as straightforward, and there has been limited work so far in this area [82, 251]. Faithfulness Recall that faithfulness metrics measure the extent to which the FTR reflects the LM’s rea- soning process. Some initial works have explored rationale-label consistency (RLC) as one way to measure faithfulness for FTRs [82, 251]. Plus, there currently exists no way to measure the reliability of the met- rics themselves. To address this, our ongoing project FRAME [29] proposes a framework for evaluating RLC metrics for FTRs. FRAME is based on three axioms: (1) good metrics should yield highest scores for referencerationales, which maximize RLC by construction; (2) good metrics should be appropriately sensi- tive to semantic perturbation of FTRs; and (3) good metrics should be robust to variation in the task LM’s 156 task performance. For (1), we simply define the reference rationale as the task LM’s predicted label. This provides a powerful invariance for analyzing the quality of FTRs, whose unconstrained nature typically makes such analysis difficult. For (2), we test if the metric responds appropriately to equivalent (should not change meaning) and contrastive (should change meaning) perturbations of reference rationales. For (3), we compute RLC as a function of the task LM’s task performance, by varying factors like the task LM’s number of train instances, number of noisy train instances, and capacity. We also separately compute RLC on correctly and incorrectly predicted test instances, to check if it is stable across these two subpopulations. On three text classification datasets (e-SNLI, CoS-E v1.0, CoS-E v1.11) we show that existing RLC metrics cannot satisfy all three FRAME axioms. Since they involve pretraining simulators on external corpora, existing RLC metrics struggle to isolate the FTR’s contribution to simulator accuracy from the pretraining knowledge’s, hence muddling the metric’s signal. Thus, we introduce a non-pretraining RLC metric,NP-GH-Pred, to address this issue. For LM simulators,NP-GH-Pred improves performance on (1) and (3) by an average of 41.7% and 42.9%, respectively, while performing competitively on (2). For human simulators, we conduct a user study of (1), with NP-GH-Pred improving performance on (1) by 47.7%. However, we note that this is not a long-term solution, since an FTR can yield high RLC without actually telling us anything about the LM’s reasoning process. This motivates future work on designing better, non-RLC-based approaches for evaluating FTR faithfulness. Plausibility For plausibility, we want to measure the extent to which an FTR is convincing to humans as reflecting the LM’s reasoning process. There are also works that use RLC to measure plausibility [82, 251], but these metrics also tell us little about humans’ perception of FTR quality. How can we measure plausibility more effectively? In an ongoing project, we consider FTR quality from the perspective of human utility for human-AI collaboration, i.e., the ability to assist humans in solving NLP tasks [105]. First, we concretize the notion of plausibility by defining a set of properties that humans should desire in FTRs: grammaticality, validity, coherence, conciseness, leakage, novelty, association, and contrast. Next, 157 we evaluate FTRs generated by state-of-the-art LMs by using forward simulation [53] to measure the FTRs’ human utility as well as asking humans to judge them based on these properties. After that, we measure the correlation between human utility and different subsets of these properties. Finally, we show that high- utility FTRs for a given task instance can provide transferable knowledge that helps humans generalize to solving new instances. By using human utility and these defined properties to evaluate FTRs, we can better understand the limitations of FTR generation with respect to plausibility. Moreover, by shedding light on the nature of FTRs’ human utility in practical settings, our findings can help guide future work on designing LMs and FTR generation strategies for stronger human-AI collaboration. 6.1.0.2 GeneratingFree-TextRationales Among existing FTR works, most have focused on generating FTRs [178, 119, 196, 25, 183, 250, 163, 274]. As described in §2.1.2, there are various existing paradigms for FTR generation. However, when consid- ering generalization performance, reliability, and deployment costs, these existing paradigms all have key limitations. Fine-tuned self-rationalizing LMs often perform worse than non-rationalizing LMs, since their parameters are learned using two relatively dissimilar objectives, while also requiring expensive rationale annotations [251, 178]. Prompted self-rationalizing LMs yield strong task performance and only need a few rationale demonstrations for the prompt, but are computationally prohibitive since they generally require very large-scale (i.e., over 100B parameters) LMs to work effectively [249, 250]. Besides requiring expen- sive rationale annotations, pipeline-rationalizing LMs’ generated rationale forms a non-differentiable bot- tleneck between the two modules, which complicates end-to-end training and can hurt task performance [251, 82]. Moreover, none of these paradigms has a mechanism for regularizing the rationale generation to faithfully or plausibly reflect the reasoning process of the LM, without hurting task performance. Recently, chain-of-thought prompting [250], which falls under prompted self-rationalizing LMs, has become a popular method for FTR generation. In chain-of-thought prompting, a large (decoder-only) LM 158 is prompted to solve a given task instance by following demonstrations of how the task should be solved. For example, in question answering, the LM’s input consists of an question-FTR-answer demonstration of how a given question should be answered, followed by the actual question. In the demonstration, the FTR presents the reasoning process for obtaining the answer given the corresponding question. By having the demonstration precede the actual question in the LM input, the LM is prompted to also generate an FTR to explain its reasoning before generating its answer to the question. Although chain-of-thought prompting has been shown to improve task performance, there has been limited investigation of the faithfulness and plausibility of FTRs generated via chain-of-thought prompting. If we can define what it means for FTRs to be faithful or plausible, then we should also generate FTRs that satisfy these desiderata. Faithfulness Although chain-of-thought prompting has been quite successful, it still has several issues. First, it requires a very large LM to work well (i.e., over 100B parameters), but few people have the compu- tational resources to use such a LM. Second, it does not have a mechanism for ensuring that the FTRs are faithful to the LM’s reasoning process. To address these issues, our ongoing project proposes PINTO [245], an LM pipeline that rationalizes via prompt-based learning, then reasons over the task input and FTR via counterfactual regularization. First, PINTO’srationalizingmodule is a medium-scale (i.e., 20B parameters) LM that contains vast latent knowledge obtained via pretraining [15]. Though prohibitive to finetune, it is affordable for prompt-based learning. Given the task input and a minimal input-output-FTR demonstration prompt, the rationalizing module uses its internal knowledge to map out a suitable reasoning process for the task input by generating a FTR. The rationalizing module is frozen during finetuning, which drastically reduces training costs and prevents it from exploiting spurious shortcuts in the downstream training data. Second, PINTO’s reasoning module is a small-scale (i.e., under 1B parameters) LM to which knowledge is transferred from the rationalizing module. The reasoning module is finetuned to solve the downstream reasoning task by using the generated FTR as context for the task input. Crucially, to help ensure that the reasoning module’s behavior is dictated by the FTR (instead of by spurious shortcuts), the reasoning 159 module is regularized to output less confident predictions when the FTR is noisily perturbed. To simulate shortcut reasoning, we consider two FTR perturbation strategies: token masking (i.e., FTR is ignored) and token replacement (i.e., FTR is misused). Across four question answering datasets (CSQA, StrategyQA, OBQA, QASC), we show that PINTO significantly improves the reasoning LM’s generalization, yielding higher performance on both ID and OOD test sets. Also, we find that FTRs are utilized more faithfully by PINTO than by other methods, leading to better performance in low-resource settings. Furthermore, we show that PINTO’s counterfactual regularization improves the reasoning module’s robustness to noise in the rationalizing module’s generated FTRs. Still, one limitation of PINTO is that it only considers faithful- ness in terms of RLC. As better faithfulness metrics are developed, we can design better ways to generate faithful FTRs. Plausibility In an ongoing project, we are exploring how to optimize the generated rationale with re- spect to plausibility properties like those we introduced earlier [105]. However, this is challenging for two reasons. First, it is difficult to design objectives based on these properties or obtain supervision for such objectives [74]. Second, many of these objectives are non-differentiable, which means more complicated training techniques like reinforcement learning are needed [155]. 6.1.0.3 UtilizingFree-TextRationales After generating FTRs, it is important to consider how they can be utilized for explanation-based learning, in order to improve LM behavior or human behavior. Most prior works in explanation-based learning have focused on extractive rationales [81, 79]. Explanation-based learning, using paradigms like ER [203, 90, 68, 272, 109, 201, 147], is more straightforward for extractive rationales since there is a one-to-one correspondence between input tokens and their attribution scores. However, for FTRs, there is no such structure to be leveraged for model design or regularization. 160 ImprovingLMBehavior Recent works have studied how LM generalization can be improved by using FTRs to teach LMs the correct reasoning processes behind correct task outputs. These works aim to learn from FTRs via approaches like input augmentation (e.g., PINTO), self-rationalization, or pipeline rational- ization. However, such approaches may hurt task performance or require prohibitively large LMs to work well. In an ongoing project, we propose KNIFE, which uses knowledge distllation to inject FTR knowledge into LMs and improve the LMs’ generalization [32]. KNIFE distills FTR knowledge from an FTR-augmented teacher LM (given task input and FTR) to a student LM (given only task input) that is used for inference. KNIFE transfers FTR knowledge to the teacher LM’s task input/output hidden states by masking out its FTR hidden states, while training the student LM’s task input/output states to align with the teacher LM’s. Un- like prior works, KNIFE does not rely on inference-time FTR access, FTR generation objectives, large LMs, or multiple inference LMs. Across two question answering datasets (OBQA, StrategyQA), we show that KNIFE significantly outperforms existing FTR learning methods in both fully-supervised and low-resource settings. ImprovingHumanBehavior Besides improving LM behavior, recall that one of the main motivations of LM explainability is to assist humans in high-stakes decision-making. Since FTRs speak the same lan- guage as humans do, FTRs are a promising medium for helping humans understand LM behavior [123, 37, 121, 202, 222, 217, 94]. For example, a recent work called TalkToModel proposes an interactive dialogue system for explaining LM behavior through conversations [222]. Impressively, in real-world human eval- uations on the disease prediction task, 73% of healthcare workers said that they would use TalkToModel over baseline point-and-click systems for explaining LM behavior. In the future, I hope to build upon this line of work and further improve collaboration between humans and NLP systems. 161 PartV References 162 Bibliography [1] A robot wrote this entire article. are you scared yet, human? | GPT-3. Sept. 2020.url: https://www.theguardian.com/commentisfree/2020/sep/08/robot-wrote-this-article-gpt-3. [2] Chirag Agarwal, Eshika Saxena, Satyapriya Krishna, Martin Pawelczyk, Nari Johnson, Isha Puri, Marinka Zitnik, and Himabindu Lakkaraju. “OpenXAI: Towards a Transparent Evaluation of Model Explanations”. In: arXiv preprint arXiv:2206.11104 (2022). [3] David Alvarez-Melis, Hal Daumé III, Jennifer Wortman Vaughan, and Hanna Wallach. “Weight of evidence as a basis for human-oriented explanations”. In: arXiv preprint arXiv:1910.13503 (2019). [4] Jacob Andreas, Dan Klein, and Sergey Levine. “Learning with latent language”. In: arXiv preprint arXiv:1711.00482 (2017). [5] Francesco Barbieri, Jose Camacho-Collados, Luis Espinosa Anke, and Leonardo Neves. “TweetEval: Unified Benchmark and Comparative Evaluation for Tweet Classification”. In: Findings of the Association for Computational Linguistics: EMNLP 2020. Online: Association for Computational Linguistics, Nov. 2020, pp. 1644–1650.doi: 10.18653/v1/2020.findings-emnlp.148. [6] Marion Bartl, Malvina Nissim, and Albert Gatt. “Unmasking contextual stereotypes: Measuring and mitigating BERT’s gender bias”. In: arXiv preprint arXiv:2010.14534 (2020). [7] Jasmijn Bastings, Wilker Aziz, and Ivan Titov. “Interpretable neural predictions with differentiable binary variables”. In: arXiv preprint arXiv:1905.08160 (2019). [8] Jasmijn Bastings and Katja Filippova. “The elephant in the interpretability room: Why use attention as explanation when we have saliency methods?” In: arXiv preprint arXiv:2010.05607 (2020). [9] Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. “Relational inductive biases, deep learning, and graph networks”. In: arXiv preprint arXiv:1806.01261 (2018). [10] Iz Beltagy, Matthew E Peters, and Arman Cohan. “Longformer: The long-document transformer”. In: arXiv preprint arXiv:2004.05150 (2020). 163 [11] Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. 2021, pp. 610–623. [12] Francesco Benedetto and Antonio Tedeschi. “Big data sentiment analysis for brand monitoring in social media streams by cloud computing”. In: Sentiment Analysis and Ontology Engineering. Springer, 2016, pp. 341–377. [13] Meghana Moorthy Bhat, Alessandro Sordoni, and Subhabrata Mukherjee. “Self-training with Few-shot Rationalization: Teacher Explanations Aid Student in Few-shot NLU”. In: arXiv preprint arXiv:2109.08259 (2021). [14] Umang Bhatt, Alice Xiang, Shubham Sharma, Adrian Weller, Ankur Taly, Yunhan Jia, Joydeep Ghosh, Ruchir Puri, José MF Moura, and Peter Eckersley. “Explainable machine learning in deployment”. In: Proceedings of the 2020 conference on fairness, accountability, and transparency. 2020, pp. 648–657. [15] Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. “GPT-NeoX-20B: An Open-Source Autoregressive Language Model”. In: Proceedings of the ACL Workshop on Challenges & Perspectives in Creating Large Language Models. 2022.url: https://arxiv.org/abs/2204.06745. [16] Su Lin Blodgett and Brendan O’Connor. “Racial disparity in natural language processing: A case study of social media african-american english”. In: arXiv preprint arXiv:1707.00061 (2017). [17] Shikha Bordia and Samuel R Bowman. “Identifying and reducing gender bias in word-level language models”. In: arXiv preprint arXiv:1904.03035 (2019). [18] Antoine Bosselut and Yejin Choi. “Dynamic knowledge graph construction for zero-shot commonsense question answering”. In: arXiv preprint arXiv:1911.03876 (2019). [19] Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. “A large annotated corpus for learning natural language inference”. In: arXiv preprint arXiv:1508.05326 (2015). [20] Faeze Brahman, Vered Shwartz, Rachel Rudinger, and Yejin Choi. “Learning to rationalize for nonmonotonic reasoning with distant supervision”. In: Proceedings of the AAAI Conference on Artificial Intelligence . Vol. 35. 14. 2021, pp. 12592–12601. [21] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. “Language models are few-shot learners”. In: Advances in neural information processing systems 33 (2020), pp. 1877–1901. [22] Nadia Burkart and Marco F Huber. “A survey on the explainability of supervised machine learning”. In: Journal of Artificial Intelligence Research 70 (2021), pp. 245–317. 164 [23] Niklas Bussmann, Paolo Giudici, Dimitri Marinelli, and Jochen Papenbrock. “Explainable machine learning in credit risk management”. In: Computational Economics 57.1 (2021), pp. 203–216. [24] Erik Cambria, Yang Li, Frank Z. Xing, Soujanya Poria, and Kenneth Kwok. “SenticNet 6: Ensemble Application of Symbolic and Subsymbolic AI for Sentiment Analysis”. In: Proceedings of the 29th ACM International Conference on Information amp; Knowledge Management. CIKM ’20. Virtual Event, Ireland: Association for Computing Machinery, 2020, pp. 105–114.isbn: 9781450368599. doi: 10.1145/3340531.3412003. [25] Oana-Maria Camburu, Tim Rocktäschel, Thomas Lukasiewicz, and Phil Blunsom. “e-snli: Natural language inference with natural language explanations”. In: arXiv preprint arXiv:1812.01193 (2018). [26] Samuel Carton, Surya Kanoria, and Chenhao Tan. “What to Learn, and How: Toward Effective Learning from Rationales”. In: Findings of the Association for Computational Linguistics: ACL 2022. Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 1075–1088.doi: 10.18653/v1/2022.findings-acl.86. [27] Samuel Carton, Anirudh Rathore, and Chenhao Tan. “Evaluating and characterizing human rationales”. In: arXiv preprint arXiv:2010.04736 (2020). [28] Simon Caton and Christian Haas. “Fairness in machine learning: A survey”. In: arXiv preprint arXiv:2010.04053 (2020). [29] Aaron Chan, Shaoliang Nie, Liang Tan, Xiaochang Peng, Hamed Firooz, Maziar Sanjabi, and Xiang Ren. “Frame: Evaluating simulatability metrics for free-text rationales”. In: arXiv preprint arXiv:2207.00779 (2022). [30] Aaron Chan, Maziar Sanjabi, Lambert Mathias, Liang Tan, Shaoliang Nie, Xiaochang Peng, Xiang Ren, and Hamed Firooz. “Unirex: A unified learning framework for language model rationale extraction”. In: International Conference on Machine Learning. PMLR. 2022, pp. 2867–2889. [31] Aaron Chan, Jiashu Xu, Boyuan Long, Soumya Sanyal, Tanishq Gupta, and Xiang Ren. “SalKG: Learning From Knowledge Graph Explanations for Commonsense Reasoning”. In: Advances in Neural Information Processing Systems 34 (2021). [32] Aaron Chan, Zhiyuan Zeng, Wyatt Lake, Brihi Joshi, Hanjie Chen, and Xiang Ren. “KNIFE: Distilling Meta-Reasoning Knowledge with Free-Text Rationales”. In: arXiv preprint arXiv:2212.09721 (2022). [33] Chacha Chen, Shi Feng, Amit Sharma, and Chenhao Tan. “Machine Explanations and Human Understanding”. In: arXiv preprint arXiv:2202.04092 (2022). [34] Michael Chen, Mike D’Arcy, Alisa Liu, Jared Fernandez, and Doug Downey. “CODAH: An Adversarially-Authored Question Answering Dataset for Common Sense”. In: Proceedings of the 3rd Workshop on Evaluating Vector Space Representations for NLP. Minneapolis, USA: Association for Computational Linguistics, June 2019, pp. 63–69.doi: 10.18653/v1/W19-2008. 165 [35] Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Diana Inkpen, and Si Wei. “Neural natural language inference models enhanced with external knowledge”. In: arXiv preprint arXiv:1711.04289 (2017). [36] Valerie Chen, Umang Bhatt, Hoda Heidari, Adrian Weller, and Ameet Talwalkar. “Perspectives on Incorporating Expert Feedback into Model Updates”. In: arXiv preprint arXiv:2205.06905 (2022). [37] Valerie Chen, Jeffrey Li, Joon Sik Kim, Gregory Plumb, and Ameet Talwalkar. “Interpretable machine learning: Moving from mythos to diagnostics”. In: Queue 19.6 (2022), pp. 28–56. [38] Cheng-Han Chiang and Hung-yi Lee. Re-Examining Human Annotations for Interpretable NLP. 2022.doi: 10.48550/ARXIV.2204.04580. [39] Brian Christian. The alignment problem: Machine learning and human values. WW Norton & Company, 2020. [40] George Chrysostomou and Nikolaos Aletras. An Empirical Study on Explanations in Out-of-Domain Settings. 2022.doi: 10.48550/ARXIV.2203.00056. [41] Alexander D’Amour, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen, Jonathan Deaton, Jacob Eisenstein, Matthew D Hoffman, et al. “Underspecification presents challenges for credibility in modern machine learning”. In: Journal of Machine Learning Research (2020). [42] Marina Danilevsky, Kun Qian, Ranit Aharonov, Yannis Katsis, Ban Kawas, and Prithviraj Sen. “A survey of the state of explainable AI for natural language processing”. In: arXiv preprint arXiv:2010.00711 (2020). [43] Thomas Davidson and Debasmita Bhattacharya. “Examining racial bias in an online abuse corpus with structural topic modeling”. In: arXiv preprint arXiv:2005.13041 (2020). [44] Thomas Davidson, Debasmita Bhattacharya, and Ingmar Weber. “Racial bias in hate speech and abusive language detection datasets”. In: arXiv preprint arXiv:1905.12516 (2019). [45] Ernest Davis and Gary Marcus. “Commonsense reasoning and commonsense knowledge in artificial intelligence”. In: Communications of the ACM 58.9 (2015), pp. 92–103. [46] Paul B De Laat. “Algorithmic decision-making based on machine learning from big data: can transparency restore accountability?” In: Philosophy & technology 31.4 (2018), pp. 525–541. [47] Misha Denil, Alban Demiraj, and Nando De Freitas. “Extraction of salient sentences from labelled documents”. In: arXiv preprint arXiv:1412.6815 (2014). [48] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “Bert: Pre-training of deep bidirectional transformers for language understanding”. In:arXiv preprintarXiv:1810.04805 (2018). [49] Jay DeYoung, Sarthak Jain, Nazneen Fatema Rajani, Eric Lehman, Caiming Xiong, Richard Socher, and Byron C Wallace. “Eraser: A benchmark to evaluate rationalized nlp models”. In: arXiv preprint arXiv:1911.03429 (2019). 166 [50] Nicholas Diakopoulos. “Accountability in algorithmic decision making”. In: Communications of the ACM 59.2 (2016), pp. 56–62. [51] Shuoyang Ding and Philipp Koehn. “Evaluating Saliency Methods for Neural Language Models”. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Association for Computational Linguistics, June 2021, pp. 5034–5052.doi: 10.18653/v1/2021.naacl-main.399. [52] Xinshuai Dong, Anh Tuan Luu, Rongrong Ji, and Hong Liu. “Towards robustness against natural language word substitutions”. In: arXiv preprint arXiv:2107.13541 (2021). [53] Finale Doshi-Velez and Been Kim. “Towards a rigorous science of interpretable machine learning”. In: arXiv preprint arXiv:1702.08608 (2017). [54] Julia El Zini, Mohamad Mansour, Basel Mousi, and Mariette Awad. “On the evaluation of the plausibility and faithfulness of sentiment analysis explanations”. In: IFIP International Conference on Artificial Intelligence Applications and Innovations . Springer. 2022, pp. 338–349. [55] William Falcon and The PyTorch Lightning team. PyTorch Lightning. Version 1.4. Mar. 2019.doi: 10.5281/zenodo.3828935. [56] Jasper Feine, Stefan Morana, and Ulrich Gnewuch. “Measuring service encounter satisfaction with customer service chatbots using sentiment analysis”. In: (2019). [57] Yanlin Feng, Xinyue Chen, Bill Yuchen Lin, Peifeng Wang, Jun Yan, and Xiang Ren. “Scalable Multi-Hop Relational Reasoning for Knowledge-Aware Question Answering”. In: arXiv preprint arXiv:2005.00646 (2020). [58] Anjalie Field, Su Lin Blodgett, Zeerak Waseem, and Yulia Tsvetkov. “A survey of race, racism, and anti-racism in NLP”. In: arXiv preprint arXiv:2106.11410 (2021). [59] Joseph L Fleiss. “Measuring nominal scale agreement among many raters.” In: Psychological bulletin 76.5 (1971), p. 378. [60] Lucie Flek. “Returning the N to NLP: Towards contextually personalized classification models”. In: Proceedings of the 58th annual meeting of the association for computational linguistics. 2020, pp. 7828–7838. [61] Iason Gabriel. “Artificial intelligence, values, and alignment”. In: Minds and machines 30.3 (2020), pp. 411–437. [62] Iason Gabriel and Vafa Ghazavi. “The challenge of value alignment: From fairer algorithms to AI safety”. In: arXiv preprint arXiv:2101.06060 (2021). [63] Hang Gao and Tim Oates. Universal Adversarial Perturbation for Text Classification . 2019.doi: 10.48550/ARXIV.1910.04618. 167 [64] Matt Gardner, Yoav Artzi, Victoria Basmova, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, et al. “Evaluating Models’ Local Decision Boundaries via Contrast Sets”. In: arXiv preprint arXiv:2004.02709 (2020). [65] Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Peters, Michael Schmitz, and Luke S. Zettlemoyer. “AllenNLP: A Deep Semantic Natural Language Processing Platform”. In: 2017. eprint: arXiv:1803.07640. [66] Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. “Shortcut learning in deep neural networks”. In: Nature Machine Intelligence 2.11 (2020), pp. 665–673. [67] Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. “Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies”. In: Transactions of the Association for Computational Linguistics 9 (2021), pp. 346–361. [68] Reza Ghaeini, Xiaoli Z Fern, Hamed Shahbazi, and Prasad Tadepalli. “Saliency learning: Teaching the model where to pay attention”. In: arXiv preprint arXiv:1902.08649 (2019). [69] Reza Ghaeini, Xiaoli Z Fern, and Prasad Tadepalli. “Interpreting recurrent and attention-based neural models: a case study on natural language inference”. In: arXiv preprint arXiv:1808.03894 (2018). [70] Ona de Gibert, Naiara Perez, Aitor Garcıa-Pablos, and Montse Cuadros. “Hate Speech Dataset from a White Supremacy Forum”. In:Proceedingsofthe2ndWorkshoponAbusiveLanguageOnline (ALW2). Brussels, Belgium: Association for Computational Linguistics, Oct. 2018, pp. 11–20.doi: 10.18653/v1/W18-5102. [71] Ona de Gibert, Naiara Perez, Aitor Garcıa-Pablos, and Montse Cuadros. “Hate speech dataset from a white supremacy forum”. In: arXiv preprint arXiv:1809.04444 (2018). [72] Leilani H Gilpin, David Bau, Ben Z Yuan, Ayesha Bajwa, Michael Specter, and Lalana Kagal. “Explaining explanations: An overview of interpretability of machine learning”. In: 2018 IEEE 5th International Conference on data science and advanced analytics (DSAA). IEEE. 2018, pp. 80–89. [73] Karan Goel, Nazneen Rajani, Jesse Vig, Samson Tan, Jason Wu, Stephan Zheng, Caiming Xiong, Mohit Bansal, and Christopher Ré. “Robustness gym: Unifying the nlp evaluation landscape”. In: arXiv preprint arXiv:2101.04840 (2021). [74] Olga Golovneva, Moya Chen, Spencer Poff, Martin Corredor, Luke Zettlemoyer, Maryam Fazel-Zarandi, and Asli Celikyilmaz. “ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning”. In: arXiv preprint arXiv:2212.07919 (2022). [75] Bryce Goodman and Seth Flaxman. “European Union regulations on algorithmic decision-making and a “right to explanation””. In: AI magazine 38.3 (2017), pp. 50–57. [76] David Gunning. “Machine common sense concept paper”. In: arXiv preprint arXiv:1810.07528 (2018). 168 [77] Biyang Guo, Xin Zhang, Ziyuan Wang, Minqi Jiang, Jinran Nie, Yuxuan Ding, Jianwei Yue, and Yupeng Wu. “How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection”. In: arXiv preprint arXiv:2301.07597 (2023). [78] Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel Bowman, and Noah A. Smith. “Annotation Artifacts in Natural Language Inference Data”. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). New Orleans, Louisiana: Association for Computational Linguistics, June 2018, pp. 107–112.doi: 10.18653/v1/N18-2017. [79] Mareike Hartmann and Daniel Sonntag. “A survey on improving NLP models with human explanations”. In: Proceedings of the First Workshop on Learning with Natural Language Supervision. Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 40–47.doi: 10.18653/v1/2022.lnls-1.5. [80] Peter Hase and Mohit Bansal. “Evaluating explainable AI: Which algorithmic explanations help users predict model behavior?” In: arXiv preprint arXiv:2005.01831 (2020). [81] Peter Hase and Mohit Bansal. “When Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data”. In: arXiv preprint arXiv:2102.02201 (2021). [82] Peter Hase, Shiyue Zhang, Harry Xie, and Mohit Bansal. “Leakage-adjusted simulatability: Can models generate non-trivial explanations of their behavior in natural language?” In: arXiv preprint arXiv:2010.04119 (2020). [83] Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. “Aligning ai with shared human values”. In: arXiv preprint arXiv:2008.02275 (2020). [84] Andreas Holzinger, Georg Langs, Helmut Denk, Kurt Zatloukal, and Heimo Müller. “Causability and explainability of artificial intelligence in medicine”. In: Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 9.4 (2019), e1312. [85] Matthew Honnibal and Mark Johnson. “An Improved Non-monotonic Transition System for Dependency Parsing”. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon, Portugal: Association for Computational Linguistics, Sept. 2015, pp. 1373–1378.url: https://aclweb.org/anthology/D/D15/D15-1162. [86] Sara Hooker, Dumitru Erhan, Pieter-Jan Kindermans, and Been Kim. “A benchmark for interpretability methods in deep neural networks”. In: arXiv preprint arXiv:1806.10758 (2018). [87] Ehsan Hosseini-Asl, Bryan McCann, Chien-Sheng Wu, Semih Yavuz, and Richard Socher. “A simple language model for task-oriented dialogue”. In: Advances in Neural Information Processing Systems 33 (2020), pp. 20179–20191. [88] Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. “Large language models can self-improve”. In: arXiv preprint arXiv:2210.11610 (2022). 169 [89] Qiang Huang, Makoto Yamada, Yuan Tian, Dinesh Singh, Dawei Yin, and Yi Chang. “GraphLIME: Local interpretable model explanations for graph neural networks”. In: arXiv preprint arXiv:2001.06216 (2020). [90] Quzhe Huang, Shengqi Zhu, Yansong Feng, and Dongyan Zhao. “Exploring Distantly-Labeled Rationales in Neural Network Models”. In: arXiv preprint arXiv:2106.01809 (2021). [91] Peter J Huber. “Robust estimation of a location parameter”. In: Breakthroughs in statistics. Springer, 1992, pp. 492–518. [92] Ben Hutchinson, Andrew Smart, Alex Hanna, Emily Denton, Christina Greer, Oddur Kjartansson, Parker Barnes, and Margaret Mitchell. “Towards accountability for machine learning datasets: Practices from software engineering and infrastructure”. In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. 2021, pp. 560–575. [93] Aya Abdelsalam Ismail, Hector Corrada Bravo, and Soheil Feizi. “Improving Deep Learning Interpretability by Saliency Guided Training”. In: Advances in Neural Information Processing Systems 34 (2021). [94] Alon Jacovi, Jasmijn Bastings, Sebastian Gehrmann, Yoav Goldberg, and Katja Filippova. “Diagnosing ai explanation methods with folk concepts of behavior”. In: arXiv preprint arXiv:2201.11239 (2022). [95] Alon Jacovi and Yoav Goldberg. “Aligning faithful interpretations with their social attribution”. In: Transactions of the Association for Computational Linguistics 9 (2021), pp. 294–310. [96] Alon Jacovi and Yoav Goldberg. “Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness?” In: arXiv preprint arXiv:2004.03685 (2020). [97] Sarthak Jain and Byron C Wallace. “Attention is not explanation”. In: arXiv preprint arXiv:1902.10186 (2019). [98] Sarthak Jain, Sarah Wiegreffe, Yuval Pinter, and Byron C Wallace. “Learning to faithfully rationalize by construction”. In: arXiv preprint arXiv:2005.00115 (2020). [99] Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang, Andrea Madotto, and Pascale Fung. “Survey of hallucination in natural language generation”. In: arXiv preprint arXiv:2202.03629 (2022). [100] Robin Jia, Aditi Raghunathan, Kerem Göksel, and Percy Liang. “Certified robustness to adversarial word substitutions”. In: arXiv preprint arXiv:1909.00986 (2019). [101] Xisen Jin, Francesco Barbieri, Brendan Kennedy, Aida Mostafazadeh Davani, Leonardo Neves, and Xiang Ren. “On Transferability of Bias Mitigation Effects in Language Model Fine-Tuning”. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Association for Computational Linguistics, June 2021, pp. 3770–3783.doi: 10.18653/v1/2021.naacl-main.296. 170 [102] Xisen Jin, Zhongyu Wei, Junyi Du, Xiangyang Xue, and Xiang Ren. “Towards hierarchical importance attribution: Explaining compositional semantics for neural sequence models”. In: arXiv preprint arXiv:1911.06194 (2019). [103] Erik Jones, Robin Jia, Aditi Raghunathan, and Percy Liang. “Robust Encodings: A Framework for Combating Adversarial Typos”. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, July 2020, pp. 2752–2765.doi: 10.18653/v1/2020.acl-main.245. [104] Brihi Joshi, Aaron Chan, Ziyi Liu, Shaoliang Nie, Maziar Sanjabi, Hamed Firooz, and Xiang Ren. “ER-Test: Evaluating Explanation Regularization Methods for Language Models”. In: arXiv preprint arXiv:2205.12542 (2022). [105] Brihi Joshi, Ziyi Liu, Sahana Ramnath, Aaron Chan, Zhewei Tong, Qifan Wang, Yejin Choi, and Xiang Ren. “Are Machine Rationales (Not) Useful to Humans? Measuring and Improving the Human Utility of Free-Text Rationales”. In: OpenReview preprint h-aJ39Z3Tc (2022). [106] Akos Kádár, Grzegorz Chrupała, and Afra Alishahi. “Representation of linguistic form and function in recurrent neural networks”. In: Computational Linguistics 43.4 (2017), pp. 761–780. [107] Divyansh Kaushik, Eduard Hovy, and Zachary C Lipton. “Learning the difference that makes a difference with counterfactually-augmented data”. In: arXiv preprint arXiv:1909.12434 (2019). [108] Brendan Kennedy, Mohammad Atari, Aida M Davani, Leigh Yeh, Ali Omrani, Yehsong Kim, Kris Coombs, Shreya Havaldar, Gwenyth Portillo-Wightman, Elaine Gonzalez, and et al. Introducing the Gab Hate Corpus: Defining and applying hate-based rhetoric to social media posts at scale. July 2018.doi: 10.1007/s10579-021-09569-x. [109] Brendan Kennedy, Xisen Jin, Aida Mostafazadeh Davani, Morteza Dehghani, and Xiang Ren. “Contextualizing hate speech classifiers with post-hoc explanation”. In: arXiv preprint arXiv:2005.02439 (2020). [110] Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. “Looking beyond the surface: A challenge set for reading comprehension over multiple sentences”. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018, pp. 252–262. [111] Tushar Khot, Peter Clark, Michal Guerquin, Peter Jansen, and Ashish Sabharwal. “QASC: A Dataset for Question Answering via Sentence Composition”. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020 . AAAI Press, 2020, pp. 8082–8090.url: https://aaai.org/ojs/index.php/AAAI/article/view/6319. [112] Been Kim and Finale Doshi-Velez. “Machine learning techniques for accountability”. In: AI Magazine 42.1 (2021), pp. 47–52. 171 [113] Tae Wan Kim, John Hooker, and Thomas Donaldson. “Taking principles seriously: A hybrid approach to value alignment in artificial intelligence”. In: Journal of Artificial Intelligence Research 70 (2021), pp. 871–890. [114] Svetlana Kiritchenko and Saif M Mohammad. “Examining gender and race bias in two hundred sentiment analysis systems”. In: arXiv preprint arXiv:1805.04508 (2018). [115] Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. “Reformer: The efficient transformer”. In: arXiv preprint arXiv:2001.04451 (2020). [116] Narine Kokhlikyan, Vivek Miglani, Miguel Martin, Edward Wang, Bilal Alsallakh, Jonathan Reynolds, Alexander Melnikov, Natalia Kliushkina, Carlos Araya, Siqi Yan, et al. “Captum: A unified and generic model interpretability library for pytorch”. In: arXiv preprint arXiv:2009.07896 (2020). [117] György Kovács, Pedro Alonso, and Rajkumar Saini. “Challenges of hate speech detection in social media”. In: SN Computer Science 2.2 (2021), pp. 1–15. [118] Satyapriya Krishna, Tessa Han, Alex Gu, Javin Pombra, Shahin Jabbari, Steven Wu, and Himabindu Lakkaraju. “The Disagreement Problem in Explainable Machine Learning: A Practitioner’s Perspective”. In: arXiv preprint arXiv:2202.01602 (2022). [119] Sawan Kumar and Partha Talukdar. “NILE: Natural language inference with faithful natural language explanations”. In: arXiv preprint arXiv:2005.12116 (2020). [120] Kushal Lakhotia, Bhargavi Paranjape, Asish Ghoshal, Wen-tau Yih, Yashar Mehdad, and Srinivasan Iyer. “FiD-Ex: Improving Sequence-to-Sequence Models for Extractive Rationale Generation”. In: arXiv preprint arXiv:2012.15482 (2020). [121] Himabindu Lakkaraju. “Towards Reliable and Practicable Algorithmic Recourse”. In: Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 2021, pp. 4–4. [122] Himabindu Lakkaraju, Ece Kamar, Rich Caruana, and Jure Leskovec. “Faithful and customizable explanations of black box models”. In: Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society. 2019, pp. 131–138. [123] Himabindu Lakkaraju, Dylan Slack, Yuxin Chen, Chenhao Tan, and Sameer Singh. “Rethinking Explainability as a Dialogue: A Practitioner’s Perspective”. In: arXiv preprint arXiv:2202.01875 (2022). [124] Andrew K Lampinen, Ishita Dasgupta, Stephanie CY Chan, Kory Matthewson, Michael Henry Tessler, Antonia Creswell, James L McClelland, Jane X Wang, and Felix Hill. “Can language models learn from explanations in context?” In: arXiv preprint arXiv:2204.02329 (2022). [125] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. “Albert: A lite bert for self-supervised learning of language representations”. In: arXiv preprint arXiv:1909.11942 (2019). 172 [126] Dong-Ho Lee, Akshen Kadakia, Brihi Joshi, Aaron Chan, Ziyi Liu, Kiran Narahari, Takashi Shibuya, Ryosuke Mitani, Toshiyuki Sekiya, Jay Pujara, et al. “XMD: An End-to-End Framework for Interactive Explanation-Based Debugging of NLP Models”. In: arXiv preprint arXiv:2210.16978 (2022). [127] Jaesong Lee, Joong-Hwi Shin, and Jun-Seok Kim. “Interactive visualization and manipulation of attention-based neural machine translation”. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2017, pp. 121–126. [128] John Boaz Lee, Ryan Rossi, and Xiangnan Kong. “Graph classification using structural attention”. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018, pp. 1666–1674. [129] Junhyun Lee, Inyeop Lee, and Jaewoo Kang. “Self-attention graph pooling”. In: International Conference on Machine Learning. PMLR. 2019, pp. 3734–3743. [130] Tao Lei, Regina Barzilay, and Tommi Jaakkola. “Rationalizing neural predictions”. In: arXiv preprint arXiv:1606.04155 (2016). [131] Piyawat Lertvittayakumjorn. “Explainable NLP for Human-AI Collaboration”. In: (2021). [132] Bo Li, Peng Qi, Bo Liu, Shuai Di, Jingen Liu, Jiquan Pei, Jinfeng Yi, and Bowen Zhou. “Trustworthy AI: From Principles to Practices”. In: arXiv preprint arXiv:2110.01167 (2021). [133] Chuanrong Li, Lin Shengshuo, Zeyu Liu, Xinyi Wu, Xuhui Zhou, and Shane Steinert-Threlkeld. “Linguistically-Informed Transformations (LIT): A Method for Automatically Generating Contrast Sets”. In: Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. Online: Association for Computational Linguistics, Nov. 2020, pp. 126–135.doi: 10.18653/v1/2020.blackboxnlp-1.12. [134] Jiwei Li, Xinlei Chen, Eduard Hovy, and Dan Jurafsky. “Visualizing and understanding neural models in nlp”. In: arXiv preprint arXiv:1506.01066 (2015). [135] Jiwei Li, Will Monroe, and Dan Jurafsky. “Understanding neural networks through representation erasure”. In: arXiv preprint arXiv:1612.08220 (2016). [136] Lei Li, Yongfeng Zhang, and Li Chen. “Personalized transformer for explainable recommendation”. In: arXiv preprint arXiv:2105.11601 (2021). [137] Shiyang Li, Jianshu Chen, Yelong Shen, Zhiyu Chen, Xinlu Zhang, Zekun Li, Hong Wang, Jing Qian, Baolin Peng, Yi Mao, et al. “Explanations from Large Language Models Make Small Reasoners Better”. In: arXiv preprint arXiv:2210.06726 (2022). [138] Paul Pu Liang, Chiyu Wu, Louis-Philippe Morency, and Ruslan Salakhutdinov. “Towards understanding and mitigating social biases in language models”. In: International Conference on Machine Learning. PMLR. 2021, pp. 6565–6576. 173 [139] Weixin Liang, Girmaw Abebe Tadesse, Daniel Ho, L Fei-Fei, Matei Zaharia, Ce Zhang, and James Zou. “Advances, challenges and opportunities in creating data for trustworthy AI”. In: Nature Machine Intelligence 4.8 (2022), pp. 669–677. [140] Bill Yuchen Lin, Xinyue Chen, Jamin Chen, and Xiang Ren. “KagNet: Knowledge-Aware Graph Networks for Commonsense Reasoning”. In: Proceedings of EMNLP-IJCNLP. Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 2829–2839.doi: 10.18653/v1/D19-1282. [141] Bill Yuchen Lin, Dong-Ho Lee, Ming Shen, Ryan Moreno, Xiao Huang, Prashant Shiralkar, and Xiang Ren. “Triggerner: Learning with entity triggers as explanations for named entity recognition”. In: arXiv preprint arXiv:2004.07493 (2020). [142] Bill Yuchen Lin, Wangchunshu Zhou, Ming Shen, Pei Zhou, Chandra Bhagavatula, Yejin Choi, and Xiang Ren. “CommonGen: A Constrained Text Generation Challenge for Generative Commonsense Reasoning”. In: Findings of the Association for Computational Linguistics: EMNLP 2020. Online: Association for Computational Linguistics, Nov. 2020, pp. 1823–1840.url: https://www.aclweb.org/anthology/2020.findings-emnlp.165. [143] Cheng-Hung Lin, Cheng-Shian Lin, Po-Yung Chou, and Chen-Chien Hsu. “An Efficient Data Augmentation Network for Out-of-Distribution Image Detection”. In: IEEE Access 9 (2021), pp. 35313–35323. [144] Pantelis Linardatos, Vasilis Papastefanopoulos, and Sotiris Kotsiantis. “Explainable ai: A review of machine learning interpretability methods”. In: Entropy 23.1 (2020), p. 18. [145] Zachary C Lipton. “The doctor just won’t accept that!” In: arXiv preprint arXiv:1711.08037 (2017). [146] Zachary C Lipton. “The Mythos of Model Interpretability: In machine learning, the concept of interpretability is both important and slippery.” In: Queue 16.3 (2018), pp. 31–57. [147] Frederick Liu and Besim Avci. “Incorporating priors with feature attribution on text classification”. In: arXiv preprint arXiv:1906.08286 (2019). [148] Haochen Liu, Yiqi Wang, Wenqi Fan, Xiaorui Liu, Yaxin Li, Shaili Jain, Yunhao Liu, Anil K Jain, and Jiliang Tang. “Trustworthy ai: A computational perspective”. In: arXiv preprint arXiv:2107.06641 (2021). [149] Hui Liu, Qingyu Yin, and William Yang Wang. “Towards explainable NLP: A generative explanation framework for text classification”. In: arXiv preprint arXiv:1811.00196 (2018). [150] Ye Liu, Yao Wan, Lifang He, Hao Peng, and Philip S Yu. “KG-BART: Knowledge Graph-Augmented BART for Generative Commonsense Reasoning”. In: arXiv preprint arXiv:2009.12677 (2020). [151] Ye Liu, Tao Yang, Zeyu You, Wei Fan, and Philip S Yu. “Commonsense Evidence Generation and Injection in Reading Comprehension”. In: arXiv preprint arXiv:2005.05240 (2020). [152] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. “Roberta: A robustly optimized bert pretraining approach”. In: arXiv preprint arXiv:1907.11692 (2019). 174 [153] Jinghui Lu, Linyi Yang, Brian Namee, and Yue Zhang. “A Rationale-Centric Framework for Human-in-the-loop Machine Learning”. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 6986–6996.doi: 10.18653/v1/2022.acl-long.481. [154] Kaiji Lu, Piotr Mardziel, Fangjing Wu, Preetam Amancharla, and Anupam Datta. “Gender bias in neural natural language processing”. In: Logic, Language, and Security. Springer, 2020, pp. 189–202. [155] Ximing Lu, Sean Welleck, Liwei Jiang, Jack Hessel, Lianhui Qin, Peter West, Prithviraj Ammanabrolu, and Yejin Choi. “Quark: Controllable text generation with reinforced unlearning”. In: arXiv preprint arXiv:2205.13636 (2022). [156] Scott M Lundberg and Su-In Lee. “A unified approach to interpreting model predictions”. In: Proceedings of the 31st international conference on neural information processing systems. 2017, pp. 4768–4777. [157] Siwen Luo, Hamish Ivison, Caren Han, and Josiah Poon. “Local Interpretations for Explainable Natural Language Processing: A Survey”. In: arXiv preprint arXiv:2103.11072 (2021). [158] Shangwen Lv, Daya Guo, Jingjing Xu, Duyu Tang, Nan Duan, Ming Gong, Linjun Shou, Daxin Jiang, Guihong Cao, and Songlin Hu. “Graph-based reasoning over heterogeneous external knowledge for commonsense question answering”. In: Proceedings of the AAAI Conference on Artificial Intelligence . Vol. 34. 05. 2020, pp. 8449–8456. [159] Kaixin Ma, Jonathan Francis, Quanyang Lu, Eric Nyberg, and Alessandro Oltramari. “Towards Generalizable Neuro-Symbolic Systems for Commonsense Question Answering”. In: Proceedings of the First Workshop on Commonsense Inference in Natural Language Processing. Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 22–32.doi: 10.18653/v1/D19-6003. [160] Sean MacAvaney, Hao-Ren Yao, Eugene Yang, Katina Russell, Nazli Goharian, and Ophir Frieder. “Hate speech detection: Challenges and solutions”. In: PloS one 14.8 (2019), e0221152. [161] Lucie Charlotte Magister, Jonathan Mallinson, Jakub Adamek, Eric Malmi, and Aliaksei Severyn. “Teaching small language models to reason”. In: arXiv preprint arXiv:2212.08410 (2022). [162] Bodhisattwa Prasad Majumder, Oana Camburu, Thomas Lukasiewicz, and Julian Mcauley. “Knowledge-grounded self-rationalization via extractive and natural language explanations”. In: International Conference on Machine Learning. PMLR. 2022, pp. 14786–14801. [163] Ana Marasović, Iz Beltagy, Doug Downey, and Matthew E Peters. “Few-shot self-rationalization with natural language prompts”. In: NAACL (2022). [164] Gary Marcus. “Deep learning: A critical appraisal”. In: arXiv preprint arXiv:1801.00631 (2018). [165] Binny Mathew, Punyajoy Saha, Seid Muhie Yimam, Chris Biemann, Pawan Goyal, and Animesh Mukherjee. “Hatexplain: A benchmark dataset for explainable hate speech detection”. In: Proceedings of the AAAI Conference on Artificial Intelligence . Vol. 35. 17. 2021, pp. 14867–14875. 175 [166] Julian McAuley and Jure Leskovec. “Hidden factors and hidden topics: understanding rating dimensions with review text”. In: Proceedings of the 7th ACM conference on Recommender systems. 2013, pp. 165–172. [167] Tom McCoy, Ellie Pavlick, and Tal Linzen. “Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference”. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, July 2019, pp. 3428–3448.doi: 10.18653/v1/P19-1334. [168] Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. “A survey on bias and fairness in machine learning”. In: ACM Computing Surveys (CSUR) 54.6 (2021), pp. 1–35. [169] Cade Metz. “Meet GPT-3. It has learned to code (and Blog and Argue)”. In: The New York Times 24 (2020). [170] Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. “Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering”. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics, Oct. 2018, pp. 2381–2391.doi: 10.18653/v1/D18-1260. [171] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. “Efficient estimation of word representations in vector space”. In: arXiv preprint arXiv:1301.3781 (2013). [172] Shervin Minaee, Nal Kalchbrenner, Erik Cambria, Narjes Nikzad, Meysam Chenaghlu, and Jianfeng Gao. “Deep Learning–based Text Classification: A Comprehensive Review”. In: ACM Computing Surveys (CSUR) 54.3 (2021), pp. 1–40. [173] Fotis Misopoulos, Miljana Mitic, Alexandros Kapoulas, and Christos Karapiperis. “Uncovering customer service experiences with Twitter: the case of airline industry”. In: Management Decision 52.4 (2014), pp. 705–723. [174] Akash Kumar Mohankumar, Preksha Nema, Sharan Narasimhan, Mitesh M Khapra, Balaji Vasan Srinivasan, and Balaraman Ravindran. “Towards transparent and explainable attention models”. In: arXiv preprint arXiv:2004.14243 (2020). [175] Milad Moradi and Matthias Samwald. “Evaluating the robustness of neural language models to input perturbations”. In: arXiv preprint arXiv:2108.12237 (2021). [176] Marzieh Mozafari, Reza Farahbakhsh, and Noël Crespi. “Hate speech detection and racial bias mitigation in social media based on BERT model”. In: PloS one 15.8 (2020), e0237861. [177] Nanlir Sallau Mullah and Wan Mohd Nazmee Wan Zainon. “Advances in machine learning algorithms for hate speech detection in social media: a review”. In: IEEE Access (2021). [178] Sharan Narang, Colin Raffel, Katherine Lee, Adam Roberts, Noah Fiedel, and Karishma Malkan. “WT5?! training text-to-text models to explain their predictions”. In: arXiv preprint arXiv:2004.14546 (2020). 176 [179] Jianmo Ni, Jiacheng Li, and Julian McAuley. “Justifying recommendations using distantly-labeled reviews and fine-grained aspects”. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). 2019, pp. 188–197. [180] F. Å. Nielsen. AFINN. Richard Petersens Plads, Building 321, DK-2800 Kgs. Lyngby, Mar. 2011. url: http://www2.compute.dtu.dk/pubdb/pubs/6010-full.html. [181] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. “Training language models to follow instructions with human feedback”. In: arXiv preprint arXiv:2203.02155 (2022). [182] Bhargavi Paranjape, Mandar Joshi, John Thickstun, Hannaneh Hajishirzi, and Luke Zettlemoyer. “An information bottleneck approach for controlling conciseness in rationale extraction”. In: arXiv preprint arXiv:2005.00652 (2020). [183] Dong Huk Park, Lisa Anne Hendricks, Zeynep Akata, Anna Rohrbach, Bernt Schiele, Trevor Darrell, and Marcus Rohrbach. “Multimodal explanations: Justifying decisions and pointing to the evidence”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018, pp. 8779–8788. [184] Ji Ho Park, Jamin Shin, and Pascale Fung. “Reducing gender bias in abusive language detection”. In: arXiv preprint arXiv:1808.07231 (2018). [185] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. “Pytorch: An imperative style, high-performance deep learning library”. In: Advances in neural information processing systems 32 (2019), pp. 8026–8037. [186] Georgios Petasis, Dimitrios Spiliotopoulos, Nikos Tsirakis, and Panayiotis Tsantilas. “Sentiment analysis for reputation management: Mining the greek web”. In: Hellenic Conference on Artificial Intelligence. Springer. 2014, pp. 327–340. [187] Fabio Petroni, Tim Rocktäschel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, Alexander H Miller, and Sebastian Riedel. “Language models as knowledge bases?” In: arXiv preprint arXiv:1909.01066 (2019). [188] Kelsey Piper. GPT-3, explained: This new language AI is uncanny, funny – and a big deal. Aug. 2020.url: https://www.vox.com/future-perfect/21355768/gpt-3-ai-openai-turing-test-language. [189] Nina Poerner, Benjamin Roth, and Hinrich Schütze. “Evaluating neural network explanation methods using hybrid documents and morphological agreement”. In: arXiv preprint arXiv:1801.06422 (2018). [190] Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Hwee Tou Ng, Anders Björkelund, Olga Uryupina, Yuchen Zhang, and Zhi Zhong. “Towards robust linguistic analysis using ontonotes”. In: Proceedings of the Seventeenth Conference on Computational Natural Language Learning. 2013, pp. 143–152. 177 [191] Danish Pruthi, Bhuwan Dhingra, and Zachary C Lipton. “Combating adversarial misspellings with robust word recognition”. In: arXiv preprint arXiv:1905.11268 (2019). [192] Danish Pruthi, Bhuwan Dhingra, Livio Baldini Soares, Michael Collins, Zachary C Lipton, Graham Neubig, and William W Cohen. “Evaluating Explanations: How much do explanations from the teacher aid students?” In: arXiv preprint arXiv:2012.00893 (2020). [193] Xipeng Qiu, Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai, and Xuanjing Huang. “Pre-trained models for natural language processing: A survey”. In: Science China Technological Sciences 63.10 (2020), pp. 1872–1897. [194] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. “Language models are unsupervised multitask learners”. In: OpenAI blog 1.8 (2019), p. 9. [195] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J Liu, et al. “Exploring the limits of transfer learning with a unified text-to-text transformer.” In: J. Mach. Learn. Res. 21.140 (2020), pp. 1–67. [196] Nazneen Fatema Rajani, Bryan McCann, Caiming Xiong, and Richard Socher. “Explain yourself! leveraging language models for commonsense reasoning”. In: arXiv preprint arXiv:1906.02361 (2019). [197] Mrigank Raman, Aaron Chan, Siddhant Agarwal, PeiFeng Wang, Hansen Wang, Sungchul Kim, Ryan Rossi, Handong Zhao, Nedim Lipka, and Xiang Ren. “Learning to Deceive Knowledge Graph Augmented Models via Targeted Perturbation”. In: International Conference on Learning Representations. 2021. [198] Chris Reed, Elizabeth Kennedy, and Sara Silva. “Responsibility, Autonomy and Accountability: legal liability for machine learning”. In: Queen Mary School of Law Legal Studies Research Paper 243 (2016). [199] John D Co-Reyes, Abhishek Gupta, Suvansh Sanjeev, Nick Altieri, Jacob Andreas, John DeNero, Pieter Abbeel, and Sergey Levine. “Guiding policies with language via meta-learning”. In: arXiv preprint arXiv:1811.07882 (2018). [200] Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. “Beyond accuracy: Behavioral testing of NLP models with CheckList”. In: arXiv preprint arXiv:2005.04118 (2020). [201] Laura Rieger, Chandan Singh, William Murdoch, and Bin Yu. “Interpretations are useful: penalizing explanations to align neural networks with prior knowledge”. In: International conference on machine learning. PMLR. 2020, pp. 8116–8126. [202] Alexis Ross, Himabindu Lakkaraju, and Osbert Bastani. “Learning models for actionable recourse”. In: Advances in Neural Information Processing Systems 34 (2021), pp. 18734–18746. [203] Andrew Slavin Ross, Michael C Hughes, and Finale Doshi-Velez. “Right for the right reasons: Training differentiable models by constraining their explanations”. In: arXiv preprint arXiv:1703.03717 (2017). 178 [204] Sebastian Ruder. Challenges and Opportunities in NLP Benchmarking. http://ruder.io/nlp-benchmarking. 2021. [205] Cynthia Rudin. “Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead”. In: Nature Machine Intelligence 1.5 (2019), pp. 206–215. [206] Barbara Rychalska, Dominika Basaj, Alicja Gosiewska, and Przemysław Biecek. “Models in the wild: On corruption robustness of neural nlp systems”. In: International Conference on Neural Information Processing. Springer. 2019, pp. 235–247. [207] Erik F Sang and Fien De Meulder. “Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition”. In: arXiv preprint cs/0306050 (2003). [208] Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. “A simple neural network module for relational reasoning”. In: Advances in neural information processing systems. 2017, pp. 4967–4976. [209] Soumya Sanyal and Xiang Ren. “Discretized Integrated Gradients for Explaining Language Models”. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 10285–10299.doi: 10.18653/v1/2021.emnlp-main.805. [210] Soumya Sanyal and Xiang Ren. “Discretized integrated gradients for explaining language models”. In: arXiv preprint arXiv:2108.13654 (2021). [211] Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi, and Noah A Smith. “The risk of racial bias in hate speech detection”. In: Proceedings of the 57th annual meeting of the association for computational linguistics. 2019, pp. 1668–1678. [212] Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne Van Den Berg, Ivan Titov, and Max Welling. “Modeling relational data with graph convolutional networks”. In: European Semantic Web Conference. Springer. 2018, pp. 593–607. [213] Christopher Schröder and Andreas Niekler. A Survey of Active Learning for Text Classification using Deep Neural Networks. 2020. arXiv: 2008.07267[cs.CL]. [214] Robert Schwarzenberg, Nils Feldhus, and Sebastian Möller. “Efficient explanations from empirical explainers”. In: arXiv preprint arXiv:2103.15429 (2021). [215] Sofia Serrano and Noah A Smith. “Is attention interpretable?” In: arXiv preprint arXiv:1906.03731 (2019). [216] Harshay Shah, Kaustav Tamuly, Aditi Raghunathan, Prateek Jain, and Praneeth Netrapalli. “The pitfalls of simplicity bias in neural networks”. In: Advances in Neural Information Processing Systems 33 (2020), pp. 9573–9585. [217] Murtuza N Shergadwala, Himabindu Lakkaraju, and Krishnaram Kenthapadi. “A Human-Centric Take on Model Monitoring”. In: arXiv preprint arXiv:2206.02868 (2022). 179 [218] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. “Learning Important Features through Propagating Activation Differences”. In: Proceedings of the 34th International Conference on Machine Learning - Volume 70. ICML’17. Sydney, NSW, Australia: JMLR.org, 2017, pp. 3145–3153. [219] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. “Learning important features through propagating activation differences”. In: International Conference on Machine Learning. PMLR. 2017, pp. 3145–3153. [220] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. “Deep inside convolutional networks: Visualising image classification models and saliency maps”. In: arXiv preprint arXiv:1312.6034 (2013). [221] Xuelin Situ, Ingrid Zukerman, Cecile Paris, Sameen Maruf, and Gholamreza Haffari. “Learning to explain: Generating stable explanations fast”. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021, pp. 5340–5355. [222] Dylan Slack, Satyapriya Krishna, Himabindu Lakkaraju, and Sameer Singh. “TalkToModel: Understanding Machine Learning Models With Open Ended Dialogues”. In: arXiv preprint arXiv:2207.04154 (2022). [223] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. “Recursive deep models for semantic compositionality over a sentiment treebank”. In: Proceedings of the 2013 conference on empirical methods in natural language processing. 2013, pp. 1631–1642. [224] Robyn Speer, Joshua Chin, and Catherine Havasi. “ConceptNet 5.5: an open multilingual graph of general knowledge”. In: Proceedings of AAAI. 2017, pp. 4444–4451. [225] Joe Stacey, Yonatan Belinkov, and Marek Rei. “Supervising model attention with human explanations for robust natural language inference”. In: Proceedings of the AAAI Conference on Artificial Intelligence . Vol. 36. 10. 2022, pp. 11349–11357. [226] Julia Strout, Ye Zhang, and Raymond J Mooney. “Do human rationales improve machine explanations?” In: arXiv preprint arXiv:1905.13714 (2019). [227] Jiao Sun, Swabha Swayamdipta, Jonathan May, and Xuezhe Ma. “Investigating the Benefits of Free-Form Rationales”. In: arXiv preprint arXiv:2206.11083 (2022). [228] Tony Sun, Andrew Gaut, Shirlyn Tang, Yuxin Huang, Mai ElSherief, Jieyu Zhao, Diba Mirza, Elizabeth Belding, Kai-Wei Chang, and William Yang Wang. “Mitigating gender bias in natural language processing: Literature review”. In: arXiv preprint arXiv:1906.08976 (2019). [229] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. “Axiomatic attribution for deep networks”. In: International Conference on Machine Learning. PMLR. 2017, pp. 3319–3328. [230] Aarne Talman and Stergios Chatzikyriakidis. Testing the Generalization Power of Neural Network Models Across NLI Benchmarks. 2018.doi: 10.48550/ARXIV.1810.09774. 180 [231] Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. “CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge”. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, June 2019, pp. 4149–4158.doi: 10.18653/v1/N19-1421. [232] Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. “Commonsenseqa: A question answering challenge targeting commonsense knowledge”. In: arXiv preprint arXiv:1811.00937 (2018). [233] Chenhao Tan. “On the Diversity and Limits of Human Explanations”. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Seattle, United States: Association for Computational Linguistics, July 2022, pp. 2173–2188.doi: 10.18653/v1/2022.naacl-main.158. [234] Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. “Efficient transformers: A survey”. In: ACM Computing Surveys 55.6 (2022), pp. 1–28. [235] Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. “Lamda: Language models for dialog applications”. In: arXiv preprint arXiv:2201.08239 (2022). [236] Martin Tutek and Jan Šnajder. “Staying True to Your Word:(How) Can Attention Become Explanation?” In: arXiv preprint arXiv:2005.09379 (2020). [237] Cynthia Van Hee, Els Lefever, and Véronique Hoste. “Semeval-2018 task 3: Irony detection in english tweets”. In: Proceedings of The 12th International Workshop on Semantic Evaluation. 2018, pp. 39–50. [238] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. “Attention is all you need”. In: Advances in neural information processing systems. 2017, pp. 5998–6008. [239] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. “Graph attention networks”. In: arXiv preprint arXiv:1710.10903 (2017). [240] Jesse Vig and Yonatan Belinkov. “Analyzing the structure of attention in a transformer language model”. In: arXiv preprint arXiv:1906.04284 (2019). [241] Ivan Vulić, Edoardo Maria Ponti, Robert Litschko, Goran Glavaš, and Anna Korhonen. “Probing pretrained language models for lexical semantics”. In: arXiv preprint arXiv:2010.05731 (2020). [242] Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. “Superglue: A stickier benchmark for general-purpose language understanding systems”. In: Advances in neural information processing systems 32 (2019). [243] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. “GLUE: A multi-task benchmark and analysis platform for natural language understanding”. In: arXiv preprint arXiv:1804.07461 (2018). 181 [244] Kangkang Wang, Rajiv Mathews, Chloé Kiddon, Hubert Eichner, Françoise Beaufays, and Daniel Ramage. “Federated evaluation of on-device personalization”. In: arXiv preprint arXiv:1910.10252 (2019). [245] Peifeng Wang, Aaron Chan, Filip Ilievski, Muhao Chen, and Xiang Ren. “PINTO: Faithful Language Reasoning Using Prompt-Generated Rationales”. In: arXiv preprint arXiv:2211.01562 (2022). [246] Peifeng Wang, Nanyun Peng, Pedro Szekely, and Xiang Ren. “Connecting the Dots: A Knowledgeable Path Generator for Commonsense Question Answering”. In: arXiv preprint arXiv:2005.00691 (2020). [247] Tianlu Wang, Xuezhi Wang, Yao Qin, Ben Packer, Kang Li, Jilin Chen, Alex Beutel, and Ed Chi. “CAT-Gen: Improving Robustness in NLP Models via Controlled Adversarial Text Generation”. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, Nov. 2020, pp. 5141–5146.doi: 10.18653/v1/2020.emnlp-main.417. [248] Xiaoyan Wang, Pavan Kapanipathi, Ryan Musa, Mo Yu, Kartik Talamadupula, Ibrahim Abdelaziz, Maria Chang, Achille Fokoue, Bassem Makni, Nicholas Mattei, et al. “Improving natural language inference using external knowledge in the science questions domain”. In: Proceedings of the AAAI Conference on Artificial Intelligence . Vol. 33. 2019, pp. 7208–7215. [249] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. “Emergent abilities of large language models”. In: arXiv preprint arXiv:2206.07682 (2022). [250] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. “Chain of thought prompting elicits reasoning in large language models”. In: arXiv preprint arXiv:2201.11903 (2022). [251] Sarah Wiegreffe, Ana Marasovic, and Noah A Smith. “Measuring association between labels and free-text rationales”. In: arXiv preprint arXiv:2010.12762 (2020). [252] Sarah Wiegreffe and Yuval Pinter. “Attention is not not explanation”. In: arXiv preprint arXiv:1908.04626 (2019). [253] Adina Williams, Nikita Nangia, and Samuel Bowman. “A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference”. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). New Orleans, Louisiana: Association for Computational Linguistics, June 2018, pp. 1112–1122.doi: 10.18653/v1/N18-1101. [254] Adina Williams, Nikita Nangia, and Samuel R Bowman. “A broad-coverage challenge corpus for sentence understanding through inference”. In: arXiv preprint arXiv:1704.05426 (2017). [255] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. “Huggingface’s transformers: State-of-the-art natural language processing”. In: arXiv preprint arXiv:1910.03771 (2019). 182 [256] Robin Wollast, Gentiane Boudrenghien, Nicolas Van der Linden, BenoÃŽt Galand, Nathalie Roland, Christelle Devos, MikaÃG , l De Clercq, Olivier Klein, Assaad Azzi, and Mariane Frenay. “Who are the doctoral students who drop out? Factors associated with the rate of doctoral degree completion in universities.” In: International Journal of Higher Education 7.4 (2018), pp. 143–156. [257] Zach Wood-Doughty, Isabel Cachola, and Mark Dredze. “Faithful and plausible explanations of medical code predictions”. In: arXiv preprint arXiv:2104.07894 (2021). [258] Chuhan Wu, Fangzhao Wu, Tao Qi, and Yongfeng Huang. “Empowering news recommendation with pre-trained language models”. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2021, pp. 1652–1656. [259] Jialin Wu and Raymond J Mooney. “Faithful multimodal explanation for visual question answering”. In: arXiv preprint arXiv:1809.02805 (2018). [260] Albert Xu, Eshaan Pathak, Eric Wallace, Suchin Gururangan, Maarten Sap, and Dan Klein. “Detoxifying Language Models Risks Marginalizing Minority Voices”. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Association for Computational Linguistics, June 2021, pp. 2390–2397.doi: 10.18653/v1/2021.naacl-main.190. [261] Jun Yan, Mrigank Raman, Aaron Chan, Tianyu Zhang, Ryan Rossi, Handong Zhao, Sungchul Kim, Nedim Lipka, and Xiang Ren. “Learning Contextualized Knowledge Structures for Commonsense Reasoning”. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 2021, pp. 4038–4051. [262] Jun Yan, Yang Xiao, Sagnik Mukherjee, Bill Yuchen Lin, Robin Jia, and Xiang Ren. “On the Robustness of Reading Comprehension Models to Entity Renaming”. In: arXiv preprint arXiv:2110.08555 (2021). [263] Huihan Yao, Ying Chen, Qinyuan Ye, Xisen Jin, and Xiang Ren. Refining Language Models with Compositional Explanations. 2021.doi: 10.48550/ARXIV.2103.10415. [264] Michihiro Yasunaga, Hongyu Ren, Antoine Bosselut, Percy Liang, and Jure Leskovec. “QA-GNN: Reasoning with Language Models and Knowledge Graphs for Question Answering”. In: arXiv preprint arXiv:2104.06378 (2021). [265] Qinyuan Ye, Bill Yuchen Lin, and Xiang Ren. “CrossFit: A Few-shot Learning Challenge for Cross-task Generalization in NLP”. In: arXiv preprint arXiv:2104.08835 (2021). [266] Xi Ye and Greg Durrett. “Can Explanations Be Useful for Calibrating Black Box Models?” In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 6199–6212.doi: 10.18653/v1/2022.acl-long.429. [267] Zhitao Ying, Dylan Bourgeois, Jiaxuan You, Marinka Zitnik, and Jure Leskovec. “Gnnexplainer: Generating explanations for graph neural networks”. In: Advances in neural information processing systems. 2019, pp. 9244–9255. 183 [268] Galit B Yom-Tov, Shelly Ashtar, Daniel Altman, Michael Natapov, Neta Barkay, Monika Westphal, and Anat Rafaeli. “Customer sentiment in web-based service interactions: Automated analyses and new insights”. In:CompanionProceedingsoftheTheWebConference2018. 2018, pp. 1689–1697. [269] Mo Yu, Shiyu Chang, Yang Zhang, and Tommi S Jaakkola. “Rethinking cooperative rationalization: Introspective extraction and complement control”. In: arXiv preprint arXiv:1910.13294 (2019). [270] Mo Yu, Yang Zhang, Shiyu Chang, and Tommi Jaakkola. “Understanding Interlocking Dynamics of Cooperative Rationalization”. In: Advances in Neural Information Processing Systems 34 (2021). [271] Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. “Big Bird: Transformers for Longer Sequences.” In: NeurIPS. 2020. [272] Omar Zaidan and Jason Eisner. “Modeling annotators: A generative approach to learning from annotator rationales”. In: Proceedings of the 2008 conference on Empirical methods in natural language processing. 2008, pp. 31–40. [273] Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra, and Ritesh Kumar. “Semeval-2019 task 6: Identifying and categorizing offensive language in social media (offenseval)”. In: arXiv preprint arXiv:1903.08983 (2019). [274] Eric Zelikman, Yuhuai Wu, and Noah D Goodman. “Star: Bootstrapping reasoning with reasoning”. In: arXiv preprint arXiv:2203.14465 (2022). [275] Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. “Swag: A large-scale adversarial dataset for grounded commonsense inference”. In: arXiv preprint arXiv:1808.05326 (2018). [276] Xiaoming Zhai. “ChatGPT user experience: Implications for education”. In: Available at SSRN 4312418 (2022). [277] Lei Zhang, Shuai Wang, and Bing Liu. “Deep learning for sentiment analysis: A survey”. In: Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8.4 (2018), e1253. [278] Xiang Zhang, Junbo Zhao, and Yann LeCun. “Character-Level Convolutional Networks for Text Classification”. In: arXiv:1509.01626 [cs] (Sept. 2015). arXiv: 1509.01626[cs]. [279] Jieyu Zhao, Subhabrata Mukherjee, Saghar Hosseini, Kai-Wei Chang, and Ahmed Hassan Awadallah. “Gender bias in multilingual embeddings and cross-lingual transfer”. In: arXiv preprint arXiv:2005.00699 (2020). [280] Jieyu Zhao, Tianlu Wang, Mark Yatskar, Ryan Cotterell, Vicente Ordonez, and Kai-Wei Chang. “Gender bias in contextualized word embeddings”. In: arXiv preprint arXiv:1904.03310 (2019). [281] Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. “Gender bias in coreference resolution: Evaluation and debiasing methods”. In: arXiv preprint arXiv:1804.06876 (2018). 184 [282] Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. “Men also like shopping: Reducing gender bias amplification using corpus-level constraints”. In: arXiv preprint arXiv:1707.09457 (2017). [283] Xinyan Zhao and VG Vydiswaran. “LIREx: Augmenting language inference with relevant explanation”. In: arXiv preprint arXiv:2012.09157 (2020). [284] Zhiqiang Zheng and B. Padmanabhan. “On active learning for data acquisition”. In: 2002 IEEE International Conference on Data Mining, 2002. Proceedings. 2002, pp. 562–569.doi: 10.1109/ICDM.2002.1184002. [285] Hao Zhou, Tom Young, Minlie Huang, Haizhou Zhao, Jingfang Xu, and Xiaoyan Zhu. “Commonsense knowledge aware conversation generation with graph attention.” In: IJCAI. 2018, pp. 4623–4629. [286] Wangchunshu Zhou, Jinyi Hu, Hanlin Zhang, Xiaodan Liang, Maosong Sun, Chenyan Xiong, and Jian Tang. “Towards Interpretable Natural Language Understanding with Explanations as Latent Variables”. In: arXiv preprint arXiv:2011.05268 (2020). 185
Abstract (if available)
Abstract
Neural language models (LMs) have yielded remarkable success on a wide range of natural language processing (NLP) tasks. However, LMs sometimes exhibit undesirable behavior, which can be difficult to resolve due to LMs’ opaque reasoning processes. This lack of transparency poses serious concerns about LMs’ trustworthiness in high-stakes decision-making, thus motivating the use of machine explanations to automatically interpret how LMs make their predictions. In my thesis, I argue that building human trust in NLP systems requires being able to: (A) generate machine explanations for LM behavior faithfully and plausibly and (B) utilize machine explanations to improve LM generalization and decision-making. First, to address (A), I propose UNIREX, a unified learning framework for jointly optimizing machine explanations with respect to both faithfulness and plausibility, without compromising the LM’s task performance. Second, for (B), I introduce ER-Test, a framework for evaluating the out-of-distribution generalization ability of LMs that are regularized via strongly-supervised machine explanations. Third, to further support (B), I present SalKG, an algorithm for improving LM generalization by regularizing LMs via weakly-supervised machine explanations. Finally, I discuss several future directions for achieving (A) and (B).
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Integrating annotator biases into modeling subjective language classification tasks
PDF
Externalized reasoning in language models for scalable and trustworthy AI
PDF
Artificial Decision Intelligence: integrating deep learning and combinatorial optimization
PDF
Computational narrative models of character representations to estimate audience perception
PDF
Generating psycholinguistic norms and applications
PDF
Improving decision-making in search algorithms for combinatorial optimization with machine learning
PDF
Hashcode representations of natural language for relation extraction
PDF
Balancing prediction and explanation in the study of language usage and speaker attributes
PDF
Learning at the local level
PDF
Towards trustworthy and data-driven social interventions
PDF
Improving language understanding and summarization by leveraging auxiliary information through self-supervised or unsupervised learning
PDF
Machine learning in interacting multi-agent systems
PDF
Building generalizable language models for code processing
PDF
Controlling information in neural networks for fairness and privacy
PDF
Adapting pre-trained representation towards downstream tasks
PDF
Fair Machine Learning for Human Behavior Understanding
PDF
Quickly solving new tasks, with meta-learning and without
PDF
Identifying and mitigating safety risks in language models
PDF
Modeling dynamic behaviors in the wild
PDF
Generative foundation model assisted privacy-enhancing computing in human-centered machine intelligence
Asset Metadata
Creator
Chan, Aaron Zesheng
(author)
Core Title
Generating and utilizing machine explanations for trustworthy NLP
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2023-05
Publication Date
04/11/2023
Defense Date
11/15/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
artificial intelligence,deep learning,explanation-based learning,extractive rationales,faithfulness,free-text rationales,generalization,language models,machine explanations,machine learning,model explainability,natural language processing,neural networks,OAI-PMH Harvest,plausibility,regularization,supervised learning,text classification,transparency,trustworthiness
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Ren, Xiang (
committee chair
), Dehghani, Morteza (
committee member
), Dilkina, Bistra (
committee member
), Jia, Robin (
committee member
), Thomason, Jesse (
committee member
)
Creator Email
aarzchan@gmail.com,chanaaro@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113010396
Unique identifier
UC113010396
Identifier
etd-ChanAaronZ-11603.pdf (filename)
Legacy Identifier
etd-ChanAaronZ-11603
Document Type
Dissertation
Format
theses (aat)
Rights
Chan, Aaron Zesheng
Internet Media Type
application/pdf
Type
texts
Source
20230412-usctheses-batch-1020
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
artificial intelligence
deep learning
explanation-based learning
extractive rationales
free-text rationales
generalization
language models
machine explanations
machine learning
model explainability
natural language processing
neural networks
plausibility
regularization
supervised learning
text classification
transparency
trustworthiness