Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Evaluating and improving the commonsense reasoning ability of language models
(USC Thesis Other)
Evaluating and improving the commonsense reasoning ability of language models
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Evaluating and Improving the Commonsense Reasoning Ability of Language Models by Yuchen Lin A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) May 2023 Copyright 2023 Yuchen Lin Acknowledgements I am immensely grateful to my advisor, Xiang Ren, for his outstanding mentorship throughout my Ph.D. studies. From the very beginning, Xiang has been an invaluable source of guidance and inspiration, and I feel extremely fortunate to have had the opportunity to work with him and benefit from his wisdom and expertise. Xiang’s unwavering support, remarkable productivity, and insightful feedback have been instrumental in shaping my research and personal growth. His razor- sharp intellect and exceptional dedication to his work have been a constant source of inspiration to me. In addition to his outstanding contributions to my research, Xiang has also provided me with invaluable advice and guidance for my future career in academia. I am deeply grateful for his mentorship and will always cherish the lessons I learned from him. I am deeply grateful to William Cohen, my mentor during my Google Research internship, for giving me the opportunity to work on challenging problems in commonsense reasoning. I also extend my thanks to Haitian Sun, Bhuwan Dhingra, Manzil Zaheer, Ying Sheng and Sandeep Tata for their invaluable support during my internship at Google Research. I am also grateful to Scott Yih, who hosted me as a research intern at Facebook AI Research (FAIR). Scott provided me with the freedom to explore my research ideas and offered me invaluable advice on my career path. His patience, guidance, and support have been immensely valuable to me. In addition, I would like to express my gratitude to Sida Wang, Xi Victoria Lin, Robin Jia, and Lin Xiao for their helpful suggestions and support during my internship at FAIR. I would also like to thank Ram Nevatia, Yan Liu, and Toby Mintz for serving on my thesis committee. Their insights and suggestions have been integral to my thesis. ii I would like to express my heartfelt gratitude to my many fantastic collaborators who worked closely with me during my Ph.D. research, including Dongho Lee, Pei Zhou, Jun Yan, Qinyuan Ye, Xisen Jin, Woojeong Jin, Rahul Khanna, Seyeon Lee, Xiaoyang Qiao, Ziyi Wu, Yichi Yang, Wangchunshu Zhou, Wenxuan Zhou, Yanlin Feng, Xinyue Chen, Dongfu Jiang, Kangmin Tan, Chengsong Huang, Wenyang Gao, Ryan Moreno, Ming Shen, and so many others. Your insights, feedback, and contributions were invaluable to the success of my research, and I am deeply grateful for the opportunity to work with each and every one of you. In addition, I want to express my gratitude to my friends Yizhong Wang, Frank Xu, Zhengbao Jiang, Aaron Chan, and Wenhu Chen for their invaluable feedback on my papers and job talks. Your support and encouragement have been instrumental in my research and career, and I am fortunate to have had such wonderful peers on my Ph.D. journey. Lastly, I would like to express my heartfelt gratitude to my parents and my wife for their unconditional love and unwavering support throughout my Ph.D. journey. Their encouragement, understanding, and sacrifices have been instrumental in my success. I am deeply indebted to them for all they have done for me, and I am honored to have them in my life. iii Table of Contents Acknowledgements ii List of Tables vii List of Figures ix Abstract xii Chapter 1: Introduction 1 1.1 Language Models: The Era of Pre-Training & Fine-Tuning . . . . . . . . . . . . . 1 1.2 Commonsense Reasoning for NLP: Promise & Challenges . . . . . . . . . . . . . 2 1.2.1 Problem Definition of Commonsense Reasoning . . . . . . . . . . . . . . . 3 1.2.2 Prior Works CSR for NLP . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Chapter 2: Open-Ended and Generative Commonsense Reasoning 6 2.1 Open-Ended Commonsense Question Answering . . . . . . . . . . . . . . . . . . 6 2.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.2 Open-Ended Commonsense Reasoning . . . . . . . . . . . . . . . . . . . 9 2.2 Constrained Text Generation for Generative CSR . . . . . . . . . . . . . . . . . . 11 2.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.2 Task Formulation and Key Challenges . . . . . . . . . . . . . . . . . . . . 13 2.2.3 Dataset Construction and Analysis . . . . . . . . . . . . . . . . . . . . . . 14 2.2.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Chapter 3: Evaluating the Generalization of Commonsense Reasoning 30 3.1 Multilingual Generalization of the CSR Ability . . . . . . . . . . . . . . . . . . . 30 3.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.1.2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . 33 3.1.3 The Mickey Probe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.1.4 The Mickey Corpus and Evaluation . . . . . . . . . . . . . . . . . . . . . 39 3.1.5 Multilingual Contrastive Pre-Training . . . . . . . . . . . . . . . . . . . . 42 iv 3.1.6 Evaluation for Cross-lingual CSR . . . . . . . . . . . . . . . . . . . . . . 43 3.1.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.2 Generalization to Non-Monotonic Reasoning & Creativity . . . . . . . . . . . . . 48 3.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.2.2 Construction of RIDDLESENSE . . . . . . . . . . . . . . . . . . . . . . . 50 3.2.3 Data Analysis of RIDDLESENSE . . . . . . . . . . . . . . . . . . . . . . . 53 3.2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.2.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Chapter 4: Testing the Robustness of Commonsense Reasoning 67 4.1 Robustness in Probing Numerical Common Sense . . . . . . . . . . . . . . . . . . 67 4.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.1.2 The NUMERSENSE Probing Task . . . . . . . . . . . . . . . . . . . . . . 69 4.1.3 Empirical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.1.4 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.1.5 Open-Domain ‘How-Many’ Questions . . . . . . . . . . . . . . . . . . . . 75 4.1.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.1.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Chapter 5: Incorporating Structured Knowledge into LMs for CSR 79 5.1 Knowledge-Aware Graph Networks for CSR . . . . . . . . . . . . . . . . . . . . . 79 5.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.1.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.1.3 Schema Graph Grounding . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.1.4 Knowledge-Aware Graph Network . . . . . . . . . . . . . . . . . . . . . . 85 5.1.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.1.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.1.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Chapter 6: Modeling Unstructured Knowledge Corpora with LMs for CSR 99 6.1 DrFact: An Efficient Approach for Differentiable Reasoning over Facts . . . . . . . 99 6.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Chapter 7: Unsupervised Generalization for CSR with Implicit Knowledge 112 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 7.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 7.3 ReCross: Retrieval Augmentation for Cross-Task Generalization . . . . . . . . . . 116 7.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 7.3.2 Dense Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 7.3.3 Reranking Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 7.3.4 Mining Distant Supervision for Reranking . . . . . . . . . . . . . . . . . . 119 7.3.5 Re-learning via Fine-Tuning with Retrieved Data . . . . . . . . . . . . . . 121 7.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 v 7.4.1 Evaluating Unsupervised Cross-Task Generalization . . . . . . . . . . . . 122 7.4.2 BART0: Upstream Learning with a Smaller LM . . . . . . . . . . . . . . . 123 7.4.3 Setup and Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 7.4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 7.4.5 Analysis & More Findings. . . . . . . . . . . . . . . . . . . . . . . . . . . 126 7.5 More Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 7.5.1 Practicality of unsupervised setting. . . . . . . . . . . . . . . . . . . . . . 128 7.5.2 Empirical studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 7.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 7.7 Conclusion & Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 Chapter 8: Conclusion & Future Directions 132 8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 8.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 Bibliography 134 vi List of Tables 2.1 The basic statistics of the COMMONGEN data. We highlight the ratios of concept compositions that are unseen in training data, which assures the challenge in compositional generalization ability. . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 The distributions of the relation categories on one/two-hop connections. . . . . . 21 2.3 Experimental results of different baseline methods on the COMMONGEN test set (v1.1). The first group of models are non-pretrained models, while the second group is large pretrained models that we have fine-tuned. The best models are bold and second best ones are underlined within each metric. We highlight the metrics that we used in our official leaderboard. . . . . . . . . . . . . . . . . . . . 22 2.4 Manual Evaluation via Pair-wise Comparisons for Ranking. Numbers are hit rates (%) at top 1/3/5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5 Experimental results of models with DBA decoding method on the test set. . . . . 27 3.1 The hit@1 accuracy (%) of the five ML-LMs for the M ICKEYPROBE task. . . . . 38 3.2 Benchmark results for different ML-LMs and MCP-enhanced models for X- CSQA and X-CODAH in a zero-shot cross-lingual setting. ∆ is the improvement of MCP. {pl,ar,ja,pt,sw,ur} are unseen in MCP. . . . . . . . . . . . . . . . . . . . 44 3.3 Statistics of the two X-CSR datasets. . . . . . . . . . . . . . . . . . . . . . . . . 45 3.4 Key statistics of the RIDDLESENSE dataset (v1.1) vs the CommonsenseQA (CSQA) dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.5 The top-5 most frequent types of reasoning chains in CSQA and RS datasets, grouped by their length k={1,2,3,4}. The implicit-ratioρ is defined as the ratio of the implicit reasoning types (i.e., Related× k) over the most frequent types with at least one explicit relation (e.g., AtLoc) of the same length k. . . . . . . . . . . . 55 3.6 Benchmark performance over the dev and test set of RIDDLESENSE . . . . . . . . 58 4.1 NUMERSENSE examples of each category. . . . . . . . . . . . . . . . . . . . . . 70 vii 4.2 Results (%) of PTLMs on NUMERSENSE. ‘Ft.’ stands for ‘Fine-tuned.’ The human performance is shown by closed testing (α=‘no external information’) / open testing (β=‘Wikipedia is allowed’). . . . . . . . . . . . . . . . . . . . . . . 71 4.3 The average Softmax of top 3 predictions in templates where ‘[x]’ is filled with 1k random words. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.1 Comparisons with large pre-trained language model fine-tuning with different amount of training data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.2 Comparison with official benchmark baseline methods using the official split on the leaderboard. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.3 Comparisons with knowledge-aware baseline methods using the in-house split (both easy and hard mode) on top of BLSTM as the sentence encoder. . . . . . . . 93 5.4 Ablation study on the KagNet framework. . . . . . . . . . . . . . . . . . . . . . . 94 6.1 Statistics of datasets for OpenCSR (v1.0). . . . . . . . . . . . . . . . . . . . . . . 105 6.2 Results of the Hit@K and Rec@K (K=50/100) on OpenCSR (v1.0). We present two groups of methods with different inference speed levels. The upper group is retrieval-only methods that are efficient ( < 0.5 sec/q), while the bottom group are augmented with a computationally expensive answer reranker (≥ 14 sec/q). . . . . 106 6.3 Comparisons of the four retrieval methods. . . . . . . . . . . . . . . . . . . . . . . 107 6.4 The major competitions of each method and their online (batch-size=1) inference speed in sec/q. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.5 Ablation study of DRFACT (H@50 test acc). . . . . . . . . . . . . . . . . . . . . 109 7.1 The main experimental results (%) for unsupervised cross-task generalization in SoftEM. Each result in the upper section is the average (and the std) performance of using 5 different query sets for a task. The lower section of this table reports the mean, max, min, and median of the overall performance (i.e., the average performance on all tasks) of these five rounds. . . . . . . . . . . . . . . . 124 7.2 Results on a subset of BigBench tasks. . . . . . . . . . . . . . . . . . . . . . . . 126 7.3 The ablation study of ReCross. . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 viii List of Figures 2.1 We study the task of open-ended commonsense reasoning (OpenCSR), where answer candidates are not provided (as in a multiple-choice setting). Given a question, a reasoner uses multi-hop reasoning over a knowledge corpus of facts, and outputs a ranked list of concepts from the corpus. . . . . . . . . . . . . . . . . 7 2.2 A motivating example of how DrFact works for OpenCSR. We model the knowledge corpus as a hypergraph consisting of concepts inV as nodes and facts inF as hyperedges. Then, we develop a differentiable reasoning method, DrFact, to perform multi-hop reasoning via fact-following operations (e.g., f 1 → f 2 ). . . . 10 2.3 An example of the dataset of COMMONGEN. GPT-2, UniLM, BART and T5 are large pre-trained text generation models, fine-tuned on the proposed task. . . . . . . . . . . . . 11 2.4 Two key challenges of COMMONGEN: relational reasoning with underlying common- sense knowledge about given concepts (left), and compositional generalization for unseen combinations of concepts (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.5 Dataset construction workflow overview. . . . . . . . . . . . . . . . . . . . . . 15 2.6 Two numerical solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.7 Connectivity analysis in 5-size concept-sets in the test set, each of which consists of 10 concept pairs. For example, 12.0 in blue means: there are 12% concept-sets that have 3 concept pairs with one-hop connections on ConceptNet. . . . . . . . . . . . . . . . . . 20 2.8 A case study with a concept-set {hand, sink, wash, soap} for qualitative analysis of machine generations. Human references are collected from AMT. . . . . . . . . 25 2.9 Learning curve for the transferring study. We use several trained COM- MONGEN (GG) models to generate choice-specific context for the CSQA task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.1 Commonsense reasoning is well-studied with benchmarks and LMs in English. Can we advance commonsense reasoning beyond English? . . . . . . . . . . . . . 31 ix 3.2 A Mickey Probe example M has a set of probes in different languages (e.g., M en/zh ), and each of them is a set of 5 assertions. We rank assertions in the same language by their PLLs to probe common sense in ML-LMs across different languages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.3 The MICKEYPROBE results in hit@1-acc. . . . . . . . . . . . . . . . . . . . . . . 40 3.4 Categorized accuracy in for MCP(XLM-R L ) on X-CODAH. Each box is for 15 languages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.5 Dev acc v.s. learning steps on X-CSQA. . . . . . . . . . . . . . . . . . . . . . . . 47 3.6 The top example is a trivial commonsense question from the Common- senseQA [161] dataset. The two bottom examples are from our proposed RIDDLESENSE challenge. The right-bottom question is a descriptive riddle that implies multiple commonsense facts about candle, and it needs understanding of figurative language such as metaphor; The left-bottom one additionally needs counterfactual reasoning ability to address the ‘but-no’ cues. These riddle-style commonsense questions require NLU systems to have higher-order reasoning skills with the understanding of creative language use. . . . . . . . . . . . . . . . 63 3.7 The Q-A paths serve as an estimation of underlying reasoning chains. Fig. (a) illustrates how to compute the mean/min/max of the Q-A paths: {q 1 ,q 2 ,q 3 } are three concepts mentioned in the question, and a is the answer concept. L k is the length of the shortest path between q k and a over ConceptNet; min/max/mean are computed over{L 1 ,L 2 ,L 3 } as three aspects to measure the overall difficulty. Fig. (b), (c), and (d) show that generally RiddleSense has a longer question-answer path than CommonsenseQA, thus being harder to reason. . . . . . . . . . . . . . . 64 3.8 Three types of baseline methods: 1) fine-tuning pre-trained LMs, 2) incorporating graph-based reasoner, 3) fine-tuning a unified text-to-text LM. . . . . . . . . . . . 65 3.9 The curve of dev accuracy using different percentage of the RS-training data, respectively for RoBERTa-Large and ALBERT-XXL. . . . . . . . . . . . . . . . 66 3.10 Case studies of the error by UnifiedQA-3B model on the test set of R IDDLESENSE. 66 4.1 Top: PTLMs often cannot solve masked language modeling tasks needing numerical commonsense knowledge, hence our title. Bottom: Even when PTLMs seemingly succeed, they fail to stay consistent under small perturbations. . . . . . . . . . . . . . . 68 4.2 Truth number distribution of the test set. . . . . . . . . . . . . . . . . . . . . . . . 72 4.3 Performance of RoBERTa-Large V .S. human performance (closed-book tests) on different categories of numerical commonsense knowledge. . . . . . . . . . . . . 74 4.4 The attention distribution of the sentence “A bird usually has two legs.” on RoBERTa- base. We plot the attention weights (y) between each word and the number word ‘two’ at different position (x), e.g., x= 13 means (Layer 2, Head 1). . . . . . . . . . . . . . . . 75 x 5.1 An example of using external commonsense knowledge (symbolic space) for inference in natural language commonsense questions (semantic space). . . . . . . 80 5.2 The overall workflow of the proposed framework with knowledge-aware graph network module. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.3 Illustration of theGCN-LSTM-HPA architecture for the proposed KagNet module. 85 5.4 An example of interpreting model behaviors by hierarchical attention scores. . . . 98 6.1 The overall workflow of D RFACT. We encode the hypergraph (Fig. 2.2) with a concept-to-fact sparse matrix E and a fact-to-fact sparse matrix S. The dense fact index D is pre-computed with a pre-trained bi-encoder. A weighed set of facts is represented as a sparse vector F. The workflow (left) of D RFACT starts mapping a question to a set of initial facts that have common concepts with it. Then, it recursively performsFact-Follow operations (right) for computing F t and A t . Finally, it uses learnable hop-weightsα t to aggregate the answers. . . . . . . . . . 100 6.2 The curve of Hit@K accuracy in overall. . . . . . . . . . . . . . . . . . . . . . . . 108 6.3 A case study to compare DPR and DRFACT. . . . . . . . . . . . . . . . . . . . . . 110 7.1 The unsupervised cross-task generalization problem. In the upstream training stage, we train a multi-task NLP model,M , with a diverse collection of upstream tasks. In the generalization stage, given an unseen taskU i with a few unlabeled examples Q i , we want to update the upstream model (via retrieval augmentation) such that it can generalize to the target task. . . . . . . . . . . . . . . . . . . . . . 114 7.2 ReCross is a retrieval-augmentation method for unsupervised cross-task generalization. We reuse the encoder layers of the upstream model (green) to build a dense index, which consists of vectors of the upstream examples D. We also propose an algorithm to generate distant supervision for training a reranker, which takes a pair of examples as input and outputs a score. During the evaluation, we encode query examples Q i for querying the index to get initial ranking results R ′ , and then pair them with the queries again for reranking. Finally, we take the top-K results (i.e., R) for generalizing the upstream modelM to the unseen taskU i . 116 7.3 The mapping between unseen tasks (as rows) and upstream tasks (as columns). The darker upstream tasks take more percentage in retrieved data. For example, for the taskWIC, ReCross retrieves a plurality of examples fromQQP (about 30% of the retrieved examples). . . . . . . . . . . . . . . . . . . . . . . . 127 xi Abstract Large pre-trained language models (LMs) have become the foundation for natural language pro- cessing (NLP) and many other areas of artificial intelligence (AI). Based on Transformer-based neural network architectures and large text corpora, these large LMs gain a great amount of lin- guistic knowledge. These research advances with LMs have led to significant improvements in many AI tasks such as question answering, information extraction, summarization, machine trans- lation, and dialogue generation. Some recent large LMs even outperform human performance on many standard and popular benchmarks for natural language understanding and generation. How- ever, these LMs still often make mistakes when commonsense knowledge is needed for reasoning with everyday situations. This lack of commonsense reasoning (CSR) ability exposes troubling gaps in current models’ world knowledge and reasoning capabilities, thus being a bottleneck for building human-level AI systems that can naturally think, talk, and act in real life as humans do. In this thesis, I argue that evaluating and improving the commonsense reasoning ability of LMs is necessary for building human-level AI systems with general intelligence. In the first half of this thesis, I will focus on how to better evaluate the common sense of LMs. Prior works on bench- marking CSR in NLP have primarily focused on two types of evaluation: knowledge probing and multiple-choice question answering (MCQA). Although they are simple and straightforward to use, there are still many missing aspects in the current evaluation protocols. I create datasets dedicated to open-ended, generalizable, and robust CSR. The key contributions are to evaluate open-ended CSR by introducing two benchmarks: OPENCSR for open-ended QA, and COMMONGEN for lan- guage generation with generative commonsense. In order to encourage CSR models to be more generalizable in terms of multiple languages, non-monotonic reasoning, and style transfer, I create xii X-CSR and RIDDLESENSE benchmarks. Finally, the robustness is also a key aspect in evaluating CSR, so I focused on the logically equivalent perturbations (RICA) and the adversarial attacks in probing numerical commonsense (NUMERSENSE). In the second half of the thesis, I will present methods of incorporating knowledge for improv- ing the commonsense reasoning ability of LMs. Useful knowledge for commonsense reasoning can be roughly categorized into three types: 1) structured knowledge, 2) unstructured knowledge, and 3) instance-based implicit knowledge. I will start with the KagNet model, which first retrieves subgraphs of commonsense knowledge graphs and then fuse them into LMs for CSR. For incorpo- rating unstructured commonsense knowledge in the form of text corpora, I will introduce DrFact, an effective multi-hop reasoning method that can model more complex commonsense knowledge via retrieval. Beyond the above declarative commonsense knowledge, I will show that modeling annotated instances of NLP tasks as implicit knowledge bases can help improve CSR via retrieval augmentation, and this is especially helpful in unsupervised cross-task generalization settings. xiii Chapter 1 Introduction 1.1 Language Models: The Era of Pre-Training & Fine-Tuning Since the release of BERT [30], pre-trained language models (LMs) have revolutionized the area of natural language processing (NLP) and many other areas in artificial intelligence (AI). They have shown significant improvement in most NLP tasks such as text classification, sentiment anal- ysis, information extraction, question answering, summarization, and dialogue systems. There are mainly three types of language models: 1) bidirectional language encoders (e.g., BERT, RoBERTa), 2) autoregressive language decoders (e.g., the GPT family), and 3) encoder-decoder architectures (e.g., BART, T5). These models basically cover most task formulations of NLP tasks. Therefore, it does not need any additional effort in task-specific designs to achieve a decent result on a dataset. One only needs to fine-tune LMs with a suitable amount of labeled data for learning a new task. Specifically, there are three key reasons why pre-trained LMs can be this powerful with fine- tuning: a powerful base architecture, a well-designed pre-training strategy, and a large high-quality corpus. First, they are all based on Transformer networks [171] that consist of self-attention mod- ules for encoding and decoding sequential data, which naturally fit language data. Second, most pre-training strategies are based on text-recovery goals (e.g., predicting masked words, infilling the missing spans, etc.) that do not need any human annotation, thus forming a self-supervised learning paradigm. Third, the pre-training corpora are usually web-scale and well-prepossessed, thus resulting in high-quality data to learn linguistic patterns. 1 Nowadays, the most powerful pre-trained LM is GPT-3, which achieves great performance on many NLP tasks with in-context few-shot learning that does not need to tune the model parameters at all. With such great task generalization ability, GPT-3 can easily learn a new language task. Considering that many recent pre-trained LMs already achieve superhuman performance on several benchmarks such as GLUE, many media posts claim that the recent large LMs like GPT-3 have already got human-level performance on AI tasks. Have pre-trained LMs really achieved human-level intelligence? If we test GPT-3 with some simple commonsense questions, we will find that even the best LM is still far from being com- monsensical. For example, GPT-3 can still make commonsense mistakes (e.g., “a tiger usually has two wings”) and generate sentences describing scenarios that are nearly impossible to happen in real-world situations. These findings suggest that LMs are still weak in performing commonsense reasoning (CSR). How should we formally and comprehensively evaluate the CSR ability of LMs? How can we improve the CSR ability of LMs for building better AI systems? These two research questions motivate this thesis, and we will introduce the background of CSR in the next section. 1.2 Commonsense Reasoning for NLP: Promise & Challenges Commonsense reasoning aims to empower machines with the human ability to make presumptions about ordinary situations in our daily life. Human beings are rational and a major component of rationality is the ability to reason. Reasoning is the process of combining facts and beliefs to make new decisions [66], as well as the ability to manipulate knowledge to draw inferences. Commonsense reasoning utilizes the basic knowledge that reflects our natural understanding of the world and human behaviors, which is common to all humans. It has been seen as the bottleneck of artificial general intelligence [28] to empower machines with the ability to perform CSR. 2 1.2.1 Problem Definition of Commonsense Reasoning In general, commonsense reasoning is the ability to make decisions in real-world situations with the knowledge shared by most people. Commonsense reasoning is hard to define, particularly due to the ambiguous boundary of commonsense knowledge. In the following, we show some definitions and characterizations of common sense from different authors, and finally, give a summary that can be seen as my working definition. • “Commonsense knowledge includes the basic facts about events (including actions) and their effects, facts about knowledge and how it is obtained, facts about beliefs and desires. It also includes the basic facts about material objects and their properties.” (McCarthy, John; 1989) • “Commonsense knowledge differs from encyclopedic knowledge in that it deals with gen- eral knowledge rather than the details of specific entities.” (Tandon et al. 2018) • “Commonsense knowledge is real-world knowledge that can provide a basis for additional knowledge to be gathered and interpreted automatically.” (Matuszek, Cynthia, et al. 2005) • “The commonsense world consists of “time, space, physical interactions, people, and so on.” (Ernest Davis; Gary Marcus 2015) • “Common sense is all the knowledge about the world that we take for granted but rarely state out loud.” (Clive Thompson 2018) • “Common sense is the basic level of practical knowledge and reasoning concerning situa- tions and events that are commonly shared among most people. ” (Sap et al., 2020) A working definition of CSR: Commonsense reasoning is a human-like ability to make pre- sumptions about ordinary objects, events, and situations that humans encounter every day. It can be seen as a process of using commonsense knowledge to solve real-world problems, which we want to teach machines to do at the basic level. Thus, it is one of the fundamental areas of Ar- tificial Intelligence (AI) research. Commonsense knowledge is facts that most people can acquire in their life by observing and interacting with the physical world and other humans. Thus, there are two major topics in commonsense knowledge: physical and social common sense. Within each 3 topic, the knowledge can be further characterized by their basic units (e.g., objects vs events) and focused dimensions (e.g., taxonomic, utility, temporal, etc.). We also discuss a few other aspects of describing commonsense knowledge, such as concrete- ness (e.g., “birds have legs” is less concrete than “birds have two legs”), plausibility/typicality (e.g., “(most) birds have legs” is more plausible than “(some) apples are green”), saliency (e.g., “birds have wings” is more salient than “birds have legs”), and culture-sensitiveness (e.g., high schools are four years in the USA, while three years in China.). 1.2.2 Prior Works CSR for NLP Evaluating the common sense of LMs. Evaluation protocols and benchmark datasets are al- ways essential for a field to develop. How can we evaluate the common sense of LMs? In prior works, there are two typical types: LM probing and multiple-choice question answering (MCQA). The probing datasets such as LAMA probes [125] aim to test the commonsense knowledge of a pre-trained LM by predicting the missing words in template-based commonsense assertions (e.g., birds can [mask]→ fly). It is straightforward but it is limited by the task formats and supported languages. In contrast, MCQA is a more flexible formulation for testing all kinds of LMs via fine-tuning. For example, CommonsenseQA [160] has been used to evaluate all three types of LMs (i.e., encoder-only, decoder-only, and encoder-decoder LMs). The MCQA format can cover many complex questions for CSR. However, the existing datasets MCQA and the formulation itself both have limitations that prevent us from obtaining a more comprehensive evaluation of common sense. Methods for improving the CSR ability of LMs. As shown before, fine-tuning LMs can already lead to a decent performance for NLP tasks. Therefore, many prior works do not focus on designing CSR-specific modules on top of the LMs. They instead focus on how to improve the performance via data augmentation and better fine-tuning algorithms. There are a few information-retrieval 4 methods that focus on using corpora like Wikipedia to get more information as additional input contexts for fine-tuning, which shares similar motivation with knowledge incorporation methods. 1.3 Outline In the first of the thesis, I will use Chapter 2, 3, and 4 for showing how to better evaluate the com- mon sense of LMs. They will focus on open-ended, generalizable, and robust CSR, respectively. Chapter 2 illustrates open-ended CSR by introducing two benchmarks: OPENCSR for open-ended QA, and COMMONGEN for language generation with generative commonsense. In order to en- courage CSR models to be more generalizable in terms of languages, non-monotonic reasoning, and language creativity, I create X-CSR and RIDDLESENSE benchmarks in Chapter 3. Finally, in Chapter 4, I will focus on the logically equivalent perturbations (RICA) and the adversarial attacks in probing numerical commonsense (NUMERSENSE). Then, I will use Chapter 5, 6, and 7 to illustrate methods for improving the CSR ability of LMs via knowledge incorporation. Specifically, Chapter 5 introduces the KagNet model and its variant MHGRN, which first retrieve subgraphs of commonsense knowledge graphs and then fuse them into LMs for CSR. For incorporating unstructured commonsense knowledge in the form of text corpora, Chapter 6 will introduce DrFact, an effective multi-hop reasoning method that can model more complex commonsense knowledge via retrieval. In Chapter 7, I will show that modeling annotated instances of NLP tasks as implicit knowledge bases can help improve CSR via retrieval augmentation, and this is especially helpful in unsupervised cross-task generalization settings. 5 Chapter 2 Open-Ended and Generative Commonsense Reasoning 2.1 Open-Ended Commonsense Question Answering 2.1.1 Introduction The conventional task setting for most current commonsense reasoning research is multiple-choice question answering (QA) — i.e., given a question and a small set of pre-defined answer choices, models are required to determine which of the candidate choices best answers the question. Exist- ing commonsense reasoning models usually work by scoring a question-candidate pair [98, 111, 41]. Hence, even an accurate multiple-choice QA model cannot be directly applied in practical ap- plications where answer candidates are not provided (e.g., answering a question asked on a search engine, or during conversation with a chat-bot). Because we seek to advance commonsense reasoning towards practical applications, we pro- pose to study open-ended commonsense reasoning (OpenCSR), where answers are generated efficiently, rather than selected from a small list of candidates (see Figure 2.1). As a step to- ward this, here we explore a setting where the model produces a ranked list of answers from a large question-independent set of candidate concepts that are extracted offline from a corpus of common-sense facts written in natural language. The OpenCSR task is inherently challenging. One problem is that for many questions, find- ing an answer requires reasoning over two or more natural-language facts from a corpus. In the 6 carbon dioxide is the major greenhouse gas contributing to global warming . trees remove carbon dioxide from the atmosphere through photosynthesis . Multiple-Choice CSR (reason w/ question+choice) Open-Ended CSR (reason w/ question) a large text corpus of commonsense facts …, renewable energy, tree, solar battery, … Output: a ranked list of concepts as answers. Q: What can help alleviate global warming? (A) air cooler (B) fossil fuel (C) renewable energy (D) water Multi-Hop Reasoning Figure 2.1: We study the task of open-ended commonsense reasoning (OpenCSR), where answer candidates are not provided (as in a multiple-choice setting). Given a question, a reasoner uses multi-hop reasoning over a knowledge corpus of facts, and outputs a ranked list of concepts from the corpus. 7 multiple-choice QA setting, as the set of candidates is small, we can pair a question with an answer, and use the combination to retrieve relevant facts and then reason with them. In the open-ended setting, this is impractical: instead one needs to retrieve facts from the corpus using the question alone. In this respect, OpenCSR is similar to multi-hop factoid QA about named entities, e.g. as done for HotpotQA [200]. However, the underlying reasoning chains of most multi-hop factoid QA datasets are relatively clear and context-independent, and are thus easier to infer. Commonsense questions, in contrast, exhibit more variable types of reasoning, and the relationship between a question and the reasoning to answer the question is often unclear. (For example, a factoid question like “who starred in a movie directed by Bradley Cooper?” clearly suggests following a directed-by relationship and then a starred-in relationship, while the underlying reasoning chains of a question like “what can help alleviate global warming?” is relatively implicit from the question.) Furthermore, annotations are not available to identify which facts are needed in the latent reasoning chains that lead to an answer — the only supervision is a set of questions and their answers. We discuss the formulation of OpenCSR and its challenges further in Section 7.2. As shown in Fig. 2.1, another challenge is that many commonsense questions require reason- ing about facts that link several concepts together. E.g., the fact “trees remove carbon dioxide from the atmosphere through photosynthesis” cannot be easily decomposed into pairwise relation- ships between “trees”, “carbon dioxide”, “the atmosphere”, and “photosynthesis”, which makes it more difficult to store in a knowledge graph (KG). However, such facts have been collected as sentences in common-sense corpora, e.g., GenericsKB [11]. This motivates the question: how can we conduct multi-hop reasoning over such a knowledge corpus, similar to the way multi-hop rea- soning methods traverse a KG? Moreover, can we achieve this in a differentiable way, to support end-to-end learning? To address this question, we extend work by Seo et al. (2019) and Dhingra et al. (2020), and propose an efficient, differentiable multi-hop reasoning method for OpenCSR, named D RFACT (for Differentiable Reasoning over Facts). Specifically, we formulate multi-hop reasoning over a 8 corpus as an iterative process of differentiable fact-following operations over a hypergraph. We first encode all fact sentences within the corpus as dense vectors to form a neural fact index, such that a fast retrieval can be done via maximum inner product search (MIPS). This dense representation is supplemented by a sparse fact-to-fact matrix to store symbolic links between facts (i.e., a pair of facts are linked if they share common concepts). DRFACT thus merges both neural and sym- bolic aspects of the relationships between facts to model reasoning in an end-to-end differentiable framework. To evaluate OpenCSR methods, we construct new OpenCSR datasets by adapting three existing multiple-choice QA datasets: QASC [73], OBQA [117], and ARC [24]. Note that unlike factoid questions that usually have a single correct answer, open-ended commonsense questions can have multiple correct answers. Thus, we collect a collection of new answers for each test question by crowd-sourcing human annotations. We compare with several strong baseline methods and show that our proposed DRFACT outperforms them by a large margin. Overall DRFACT gives an 4.6% absolute improvement in Hit@100 accuracy over DPR [68], a state-of-the-art text retriever for QA, and 3.2% over DrKIT [33], a strong baseline for entity-centric multi-hop reasoning. With a relatively more expensive re-ranking module, the gap between DRFACT and others is even larger. 2.1.2 Open-Ended Commonsense Reasoning Task Formulation. We denote a corpus of knowledge facts asF , and useV to denote a vo- cabulary of concepts; both are sets consisting of unique elements. A fact f i ∈F is a sentence that describes generic commonsense knowledge, such as “trees remove carbon dioxide from the atmosphere through photosynthesis.” A concept c j ∈V is a noun or base noun phrase mentioned frequently in these facts (e.g., ‘tree’ and ‘carbon dioxide’). Concepts are considered identical if their surface forms are the same (after lemmatization). Given only a question q (e.g., “what can help alleviate global warming?”), an open-ended commonsense reasoner is supposed to answer it by returning a weighted set of concepts, such as {(a 1 =‘renewable energy’, w 1 ), (a 2 =‘tree’, w 2 ), . . . }, where w i ∈R is the weight of the predicted concept a i ∈V . 9 = carbon dioxide is the major greenhouse gas contributing to global warming . tree carbon dioxide photosynthesis oxygen water atmosphere global warming greenhouse gas = trees remove carbon dioxide from the atmosphere through photosynthesis . = the atmosphere contains oxygen, carbon dioxide, and water. Question: What can help alleviate global warming? Modeling a knowledge corpus as a hypergraph. tree DrFact: Multi-hop reasoning as recursive fact-following operations. Figure 2.2: A motivating example of how DrFact works for OpenCSR. We model the knowl- edge corpus as a hypergraph consisting of concepts inV as nodes and facts inF as hyperedges. Then, we develop a differentiable reasoning method, DrFact, to perform multi-hop reasoning via fact-following operations (e.g., f 1 → f 2 ). To learn interpretable, trustworthy reasoning models, it is expected that models can output intermediate results that justify the reasoning process — i.e., the supporting facts fromF . E.g., an explanation for ‘tree’ to be an answer to the question above can be the combination of two facts: f 1 = “carbon dioxide is the major ...” and f 2 = “trees remove ...”, as shown in Figure 2.1. Implicit Multi-Hop Structures. Commonsense questions (i.e., questions that need common- sense knowledge to reason) contrast with better-studied multi-hop factoid QA datasets, e.g., Hot- potQA [200], which primarily focus on querying about evident relations between named entities. For example, an example multi-hop factoid question can be “which team does the player named 2015 Diamond Head Classic’s MVP play for?” Its query structure is relatively clear and self- evident from the question itself: in this case the reasoning process can be decomposed into q 1 = “the player named 2015 DHC’s MVP” and q 2 = “which team does q 1 .answer play for”. The reasoning required to answer commonsense questions is usually more implicit and rel- atively unclear. Consider the previous example in Fig. 2.1, q = ‘what can help alleviate global warming?’ can be decomposed by q 1 = “what contributes to global warming” and q 2 = “what removes q 1 .answer from the atmosphere” — but many other decompositions are also plausible. In addition, unlike HotpotQA, we assume that we have no ground-truth justifications for training, which makes OpenCSR even more challenging. 10 2.2 Constrained Text Generation for Generative CSR 2.2.1 Introduction dog, frisbee, catch, throw - A dog leaps to catch a thrown frisbee. - The dog catches the frisbee when the boy throws it. - A man throws away his dog 's favorite frisbee expecting him to catch it in the air. ExpectedOutput: everyday scenarios covering all given concepts. [Humans] GPT2: A dog throws a frisbee at a football player. UniLM: Two dogs are throwing frisbees at each other . BART: A dog throws a frisbee and a dog catches it. T5: dog catches a frisbee and throws it to a dog [Machines] exercise | rope | wall | tie | wave - A man in a gym exercises by waving ropes tied to a wall. - The gym owner decided to tie a rope to the wall so people could make a wave in it for exercise. Concept-Set: [Humans] GPT2: A woman is tied up in a rope and swinging a wave at a wall. UniLM: A man with a rope and tie is doing some exercise on a wall. BART: A man is tied to a rope and is waving his arms and doing exercises on the wall. [Machines] Concept-Set: a collection of objects/actions. GenerativeCommonsense Reasoning Figure 2.3: An example of the dataset of COMMONGEN. GPT-2, UniLM, BART and T5 are large pre- trained text generation models, fine-tuned on the proposed task. Commonsense reasoning, the ability to make acceptable and logical assumptions about ordi- nary scenes in our daily life, has long been acknowledged as a critical bottleneck of artificial in- telligence and natural language processing [28]. Most recent commonsense reasoning challenges, such as CommonsenseQA [161], SocialIQA [149], WinoGrande [142] and HellaSwag [208], have been framed as discriminative tasks – i.e. AI systems are required to choose the correct option from a set of choices based on a given context. While significant progress has been made on these discriminative tasks, we argue that commonsense reasoning in text generation poses a distinct 11 complementary challenge. In this paper, we advance machine commonsense towards generative reasoning ability. { exercise, rope, wall, tie, wave } A woman in a gym exercises by waving ropes tied to a wall. (exercise, HasSubEvent , releasing energy) (rope, UsedFor, tying something) (releasing energy, HasPrerequisite, motion) (wave, IsA, motion) ; (rope, UsedFor, waving) The motion costs more energy if ropes are tied to a wall. Underlying Relational Commonsense Knowledge RelationalReasoningforGeneration Training CompositionalGeneralization x 1 = { apple, bag, put } y 1 = a girl puts an apple in her bag x = { pear, basket, pick, put, tree }, y = ? Reference: “a girl picks some pear from a tree and put them in her basket .” x 2 = { apple, tree, pick } y 2 = a man picks some apples from a tree x 3 = { apple, basket, wash } y 3 = a boy takes an apple from a basket and washes it. Test Figure 2.4: Two key challenges of COMMONGEN: relational reasoning with underlying commonsense knowledge about given concepts (left), and compositional generalization for unseen combinations of con- cepts (right). Humans acquire the ability to compose sentences by learning to understand and use common concepts that they recognize in their surrounding environment [167]. The acquisition of such an ability is regarded as a significant milestone of human development [120]. Can machines acquire such generative commonsense reasoning ability? To initiate the investigation, we present COM- MONGEN 1 – a novel constrained generation task that requires machines to generate a sentence de- scribing a day-to-day scene using concepts from a given concept-set. For example, in Figure 7.1, given a set of concepts: {dog, frisbee, catch, throw}, machines are required to generate a sentence such as “a man throws a frisbee and his dog catches it in the air.” To successfully solve the task, models need to incorporate two key capabilities: a) relational reasoning, and b) compositional generalization. Grammatically sound sentences may not always be realistic as they might violate our commonsense (e.g., “a dog throws a frisbee ..."). In order to compose a plausible sentence that describes an everyday scenario, models need to construct a grammatical sentence while adhering to and reasoning over the commonsense relations between the given concepts. Models additionally need compositional generalization ability to infer about unseen concept compounds. This encourages models to reason about a potentially infinite number 1 http://inklab.usc.edu/CommonGen/. 12 of novel combinations of familiar concepts – an ability believed to be a limitation of current AI systems [81, 70]. Therefore, in support of the COMMONGEN task, we present a dataset consisting of 35,141 concept-sets associated with 77,449 sentences. We explicitly design our dataset collection process to capture the key challenges of relational reasoning and compositional generalization described above, through an actively controlled crowd-sourcing process. We establish comprehensive base- line performance for state-of-the-art language generation models with both extensive automatic evaluation and manual comparisons. The best model, based on T5 [134], achieves 28.86% with significant gap compared to human performance of 52.43% in the SPICE metric – demonstrat- ing the difficulty of the task. Our analysis shows that state-of-the-art models struggle at the task, generating implausible sentences – e.g. “dog throws a frisbee ..." , “giving massage to a table", etc. Additionally, we show that successful COMMONGEN models can benefit downstream tasks (e.g., commonsense-centric question answering) via generating useful context as background sce- narios. We believe these findings point to interesting future research directions for the community of commonsense reasoning. 2.2.2 Task Formulation and Key Challenges We formulate the proposed COMMONGEN task with mathematical notations and discuss its in- herent challenges with concrete examples. The input is an unordered set of k concepts x = {c 1 ,c 2 ,...,c k }∈X (i.e. a concept-set), where each concept c i ∈C is a common object (noun) or action (verb). We useX to denote the space of all possible concept-sets and useC to denote the concept vocabulary (a subset of ConceptNet’s unigram concepts). The expected output is a simple, grammatical sentence y∈Y that describes a common scenario in our daily life, using all given concepts in x (morphological inflections are allowed). A scenario can depict either a static situation or a short series of actions. The COMMONGEN task is to learn a function f :X →Y , which maps a concept-set x to a sentence y. The unique challenges of this task come from two aspects: 13 Relational Reasoning with Commonsense. Expected generative reasoners should prioritize the most plausible scenarios over many other less realistic ones. As shown in Figure 2.4, models need to recall necessary relational commonsense facts that are relevant to the given concepts, and then reason an optimal composition of them for generating a desired sentence. In order to complete a scenario, generative commonsense reasoners also need to reasonably associate additional concepts (e.g., ‘woman’, ‘gym’) as agents or background environments for completing a coherent scenario. This not only requires understanding underlying commonsense relations between concepts, but also incrementally composing them towards a globally optimal scenario. The underlying reasoning chains are inherently based on a variety of background knowledge such as spatial relations, object properties, physical rules, temporal event knowledge, social conventions, etc. However, they may not be recorded in any existing knowledge bases. Compositional Generalization. Humans can compose a sentence to describe a scenario about the concepts they may never seen them co-occurring. For example, in Figure 2.4, there is a testing concept-set ˆ x={pear, basket, pick, put, tree}. The concept ‘pear’ never appear in the training data, and ‘pick’ never co-occurs with ‘basket’. We, humans, can generalize from these seen scenarios in the training data and infer that a plausible output: ˆ y=“a girl picks some pears from a tree and put them into her basket. ” This compositionally generalization ability via analogy, i.e., to make “infinite use of finite means” [20], is challenging for machines. This analogical challenge not only requires inference about similar concepts (e.g., ‘apple’→ ‘pear’) but also their latent associations. 2.2.3 Dataset Construction and Analysis Figure 2.5 illustrates the overall workflow of our data construction for the proposed C OMMONGEN task. We utilize several existing caption corpora for sampling frequent concept-sets (Sec. 2.2.3.1) for reflecting common scenarios. We employ AMT crowd workers for collecting human-written sentences (Sec. 2.2.3.2) for the development and test set, while we carefully monitor the quality of crowd workers and refine them dynamically. Finally, we present the statistics of the C OMMONGEN dataset, and the analysis on the challenges (Sec. 2.2.3.6). 14 Multiple Caption Corpora (Concept-Set, Sents) Concept-Sets diversity-based sampling Human References Actively Monitored Crowd-sourcing dev/test train Figure 2.5: Dataset construction workflow overview. 2.2.3.1 Collecting Concept-Sets from Captions It can be unreasonable to present any arbitrary set of concepts (e.g., x={apple, fold, rope}) and ask a reasoner to generate a commonsense scenario, since such an arbitrary set of concepts can be too unrelated. Therefore, our concept-sets are supposed to reflect reasonable concept co- occurrences in everyday situations. As web images and video clips capture diverse everyday scenarios, we use their caption text as a natural resource for collecting concept-sets and their corresponding descriptions of commonsense scenarios. More specifically, we collect visually- grounded sentences from several existing caption datasets, including image captioning datasets, such as Flickr30k [206], MSCOCO [102], Conceptual Captions [153], as well as video captioning datasets including LSMDC [140], ActivityNet [78], and V ATEX [185]. We first conduct part-of-speech tagging over all sentences in the corpora such that words in sentences can be matched to the concept vocabulary of ConceptNet. Then, we compute the sen- tence frequency of concept-sets consisting of 3∼ 5 concepts. That is, for each combination of three/four/five concepts in the vocabulary, we know how many sentences are in the corpora cover- ing all concepts. 15 Ideally, we want the selected concept-sets in our dataset to reflect the natural distribution of concept-sets in the real world. At first glance, a reasonable solution may seem to sample from the distribution of the concept-sets based on their frequencies in the source datasets. However, we find that this method leads to a rather unnaturally skewed collection of concept-sets, due to the inherent data biases from the source datasets. We therefore design a function to score a concept-set x based on scene diversity and inverse frequency penalty. We denote S(x) as the set of unique sentences that contain all given concepts{c 1 ,c 2 ,...,c k }, and then we have score(x)=|S(x)| | S s i ∈S(x) {w|w∈ s i }| ∑ s i ∈S(x) len(s i ) ρ(x), where ρ(x)= |X| max c i ∈x |{x ′ | c i ∈x ′ and x ′ ∈X}| . The first term in score is the number of unique sen- tences covering all given concepts in x, and the second term is to represent the diversity of the scenes described in these sentences. Th last termρ(x) is the penalty of inverse frequency. Specif- ically, we find the concept in x that has the maximum “set frequency” (i.e., the number of unique concept-sets containing a particular concept), then we take the inverse with the number of all concept-sets for normalization. This penalty based on inverse set-frequency effectively controls the bias towards highly frequent concepts. With the distribution of such scores of concept-sets, we sample our candidate examples for the next steps. 2.2.3.2 Crowd-Sourcing References via AMT In order to ensure the best quality, the references of the evaluation examples are crowdsourced from crowd workers on Amazon Mechanical Turk, which amounts to 10,060 references over 2.5k distinct concept-sets. Note that these newly collected references for dev and test examples can ensure that we can do a fair comparisons targeting generalization, considering potential data-leak (i.e., recent pre-trained language models might have seen the caption datasets). Each concept-set was assigned to at least 3 workers. In addition to references about given concept-sets, we also ask 16 Statistics Train Dev Test # Concept-Sets 32,651 993 1,497 -Size = 3 25,020 493 - -Size = 4 4,240 250 747 -Size = 5 3,391 250 750 # Sentences 67,389 4,018 7,644 per Concept-Set 2.06 4.04 5.11 Average Length 10.54 11.55 13.28 # Unique Concepts 4,697 766 1,248 # Unique Concept-Pairs 59,125 3,926 8,777 # Unique Concept-Triples 50,713 3,766 9,920 % Unseen Concepts - 6.53% 8.97% % Unseen Concept-Pairs - 96.31% 100.00% % Unseen Concept-Triples - 99.60% 100.00% Table 2.1: The basic statistics of the COMMONGEN data. We highlight the ratios of concept compositions that are unseen in training data, which assures the challenge in compositional gener- alization ability. the workers to provide rationale sentences to explain what commonsense facts they have used, for ensuring that the described scenarios are common in daily life. We control the quality by actively filtering workers who produced low-quality references, then removing their annotations, and finally re-opening the slots only for quality workers. There were 1,492 accepted workers in total and 171 disqualified workers in the end after the active filtering. There are three criteria for efficiently narrowing down candidates for us to further manually remove out low-quality workers: 1) coverage via part-of-speech tagging, 2) especially high perplexity via GPT-2, and 3) length of the rationales. Meanwhile, we also dynamically replaced the concept-sets that majority of the references do not make sense to ensure the final quality. 2.2.3.3 Permutation-Invariant Annotating We have to present every input as a a string for annotators to read, which means we take a ran- dom permutation of the concept-set as a linear sequence. Therefore we may wonder if annotators will make flexible adjustment on the concept order when creating references for CommonGen. To 17 address this concern, we first study the correlation between the input concept-order and the refer- ence concept-order (i.e., the order of the given concepts in the human annotations). We find that 96.97% of the references, of which the concept-order is different from the order shown when they are annotating. More specifically, we use Spearmans’s rank correlation coefficient to understand the correlation between input concept-order and reference concept-order. It turns out that the mean correlation over all input-reference pairs on test examples is -0.031, which suggests that different permutation of the input concept-order do not have notable influence on the order of concept in the human references, thus being permutation-invariant. 2.2.3.4 Finalizing Adequate References As there may be more than one acceptable scenes for each input concept-set, we would like to check if our human references are enough before we finalizing our dataset. Thus, we took one more round of crowd-sourcing to add one more reference for each concept-set by new annotators. Then, we compute the inter-annotator agreement (IAA) by using the cosine similarity between all pairs of human references, based on SentenceBERT [137] (fine-tuned for semantic similarity analysis). Note that if we have k human references for an example in the end, then we will have k(k− 1)/2 different pairs of references, each of which has a cosine similarity between their sentence embeddings. Then, we take the median of these similarity scores as a proxy to understand if we have collect adequate human references. The underlying rationale here is that if there are more references that are very similar to each other yet from different annotators, then it is likely that current references are adequate for this ex- ample. As shown in Figure 2.6, we simulated different sizes of number of references per example. We find that the IAA will be saturated when we have the fifth ones, and thus we believe references are adequate. Also, from the std of these IAA scores, we find that the diversity of the references, and it also saturate when there are five references. 18 3.0 3.5 4.0 4.5 5.0 Average # References per example 0.08 0.10 IAA-std 3.0 3.5 4.0 4.5 5.0 Average # References per example 0.685 0.690 0.695 0.700 IAA-median Figure 2.6: The curve of inter-annotator agreement (IAA) in terms of their std (up) and median (bottom) when average number of references increase. 2.2.3.5 Down-Sampling Training Examples In order to evaluate the compositional generalization ability, we down-sample the remaining can- didate concept-sets to construct a distantly supervised training dataset (i.e., using caption sentences as the human references). We explicitly control the overlap of the concept-sets between training examples and dev and test examples. The basic statistics of the final dataset is shown in Table 2.1. There are on average four sentences for each example in dev and test sets, which provide a richer and more diverse test-bed for automatic and manual evaluation. Table 2.1 also shows the ratio of unseen concept compositions (i.e., concept, concept-pair, and concept-triple) in the dev and test. Notably, all pairs of concepts in every test concept-set are unseen in training data and thus pose a challenge for compositional generalization. 2.2.3.6 Analysis of Underlying Common Sense We here introduce deeper analysis of the dataset by utilizing the largest commonsense knowledge graph (KG), ConceptNet [156], as an tool to study connectivity and relation types. Connectivity Distribution. If the concepts inside a given concept-set is more densely connected with each other on the KG, then it is likely to be easier to write a scenario about them. In each 5- size concept-set (i.e. a concept-set consists of five concepts), there are 10 unique pairs of concepts, 19 Figure 2.7: Connectivity analysis in 5-size concept-sets in the test set, each of which consists of 10 concept pairs. For example, 12.0 in blue means: there are 12% concept-sets that have 3 concept pairs with one-hop connections on ConceptNet. the connections of which we are interested in. As shown in Figure 2.7, if we look at the one-hop links on the KG, about 60% of the 5-size concept-set have less than one link among all concept- pairs. On the other hand, if we consider two-hop links, then nearly 50% of them are almost fully connected (i.e. each pair of concepts has connections). These two observations together suggest that the COMMONGEN has a reasonable difficulty: the concepts are not too distant or too close, and thus the inputs are neither too difficult nor too trivial. Relation Distribution. Furthermore, the relation types of such connections can also tell us what kinds of commonsense knowledge are potentially useful for relational reasoning towards genera- tion. We report the frequency of different relation types 2 of the one/two-hop connections among concept-pairs in the dev and test examples in Fig. ??. To better summarize the distributions, we cat- egorize these relations into five major types and present their distribution in Table 2.2, respectively for one/two-hop connections between concept pairs. 2 Relation definitions are at https://github.com/commonsense/conceptnet5/wiki/Relations. 20 Category Relations 1-hop 2-hop Spatial knowledge AtLocation, LocatedNear 9.40% 39.31% Object properties UsedFor,CapableOf,PartOf, ReceivesAction,MadeOf, FormOf, HasProperty,HasA 9.60% 44.04% Human behaviors CausesDesire,MotivatedBy, Desires,NotDesires,Manner 4.60% 19.59% Temporal knowledge Subevent, Prerequisite, First/Last-Subevent 1.50% 24.03% General RelatedTo, Synonym, DistinctFrom, IsA, HasContext,SimilarTo 74.89% 69.65% Table 2.2: The distributions of the relation categories on one/two-hop connections. 2.2.4 Methods We briefly introduce the baseline methods that are tested on the C OMMONGEN task. Encoder-Decoder Models. Bidirectional RNNs and Transformers [173] are two most popular architectures for seq2seq learning. We use them with the addition of attention mechanism [110] with copying ability [51], which are based on an open-source framework OpenNMT-py [76]. We usebRNN-CopyNet andTrans-CopyNet denote them respectively. To alleviate the influence from the concept ordering in such sequential learning methods, we randomly permute them multi- ple times for training and decoding and then get their average performance. To explicitly eliminate the order-sensitivity of inputs, we replace the encoder with a mean pooling-based MLP network (MeanPooling-CopyNet). Non-autoregressive generation. Recent advances [84, 158] in conditional sentence generation have an emerging interest on (edit-based) non-autoregressive generation models, which iteratively 21 refine generated sequences. We assume that these models potentially would have better perfor- mance because of their explicit modeling on iterative refinements, and thus study the most recent such model Levenshtein Transformer (LevenTrans) by Gu, Wang, and Zhao (2019). We also include a recent enhanced version,ConstLeven [159], which incorporates lexical constraints in LevenTrans. Model\ Metrics ROUGE-2/L BLEU-3/4 METEOR CIDEr SPICE Coverage bRNN-CopyNet [51] 7.67 27.77 12.58 7.06 16.38 5.06 13.39 51.15 Trans-CopyNet 8.64 27.85 12.47 7.56 15.91 4.65 12.85 49.06 MeanPooling-CopyNet 9.65 31.15 12.29 7.08 17.10 5.18 15.18 55.70 LevenTrans. [50] 10.61 31.87 21.51 12.65 20.50 7.45 16.84 63.81 ConstLeven. [159] 11.77 33.04 20.87 11.26 25.23 10.80 20.05 94.51 GPT-2 [133] 16.85 39.01 33.92 23.73 26.83 12.19 23.57 79.09 BERT-Gen [8] 17.78 40.21 33.29 23.47 28.25 12.61 24.82 86.06 UniLM [35] 21.20 43.60 41.82 30.73 30.62 14.89 27.43 89.19 UniLM-v2 [8] 18.11 40.51 34.31 24.53 29.04 13.19 25.52 89.13 BART [86] 22.02 41.78 39.52 29.01 31.83 13.98 28.00 97.35 T5-Base [134] 14.63 34.56 28.76 18.54 23.94 9.40 19.87 76.67 T5-Large [134] 21.74 42.75 43.01 31.96 31.12 15.13 28.86 95.29 Human Performance (Upper Bound) 36.72 53.45 52.55 46.49 38.79 37.64 52.43 99.33 Table 2.3: Experimental results of different baseline methods on the COMMONGEN test set (v1.1). The first group of models are non-pretrained models, while the second group is large pretrained models that we have fine-tuned. The best models are bold and second best ones are underlined within each metric. We highlight the metrics that we used in our official leaderboard. Pre-trained Language Generation Models. We also employ various pre-trained language gener- ation models, includingGPT-2 [133],UniLM [35],UniLM-v2 [8],BERT-Gen [8],BART [86], and T5 [134], to tackle this task and test their generative commonsense reasoning ability. We fine-tuned all the above models on our training data with a seq2seq format. Specifically, to use GPT-2 for this sequence-to-sequence task, we condition the language model on the format “c 1 c 2 ... c k = y” during fine-tuning, where c i is a concept in the given concept-set and connects with other concepts with a blank; y is a target sentence. For inference, we sample from the fine-tuned GPT-2 model after a prompt of “c 1 c 2 ... c k =” with beam search and use the first generated sentence as the output sentence. For BERT-Gen, we use thes2s-ft 22 package 3 to fine-tune them in a sequence-to-sequence fashion that is similar to the LM objective employed by UniLM. As forT5, the state-of-the-art text-to-text pre-trained model which is pre-trained with a multi- task objective by prepending a task description before the input text, we prepend the input concept set with a simple prompt: “generate a sentence with:” and fine-tune the model with the source sentence on the format “generate a sentence with c 1 c 2 ... c k .” For decoding, we employ the standard beam search with a beam size of 5 for all compared models. We also report their results with a lexically-constrained decoding method, dynamic beam allocation (DBA) [129], which do not show improvement over conventional beam searching. 4 2.2.5 Evaluation We first introduce the automatic evaluation metrics, then present main experimental results with manual analysis, and finally introduce the potential application in transferring CommonGen-trained models for other downstream tasks. 2.2.5.1 Metrics Following other conventional generation tasks, we use several widely-used automatic metrics to automatically assess the performance, such as BLEU [123], ROUGE [101], METEOR [7], which mainly focus on measuring surface similarities. We report the conceptCoverage, which is the average percentage of input concepts that are present in lemmatizatized outputs. In addition, we argue that it is more suitable to use evaluation metrics specially design for captioning task, such as CIDEr [175] and SPICE [1]. They usually assume system generations and human references use similar concepts, and thus focus on evaluate the associations between mentioned concepts instead of n-gram overlap. For example, theSPICE metric uses dependency parse trees as proxy of scene graphs to measure the similarity of scenarios. 5 3 https://github.com/microsoft/unilm 4 The used hyper-parameters are reported in the appendix. 5 We also tried recent metrics such as BERTScore [212], but we find that they overly focus on lexical semantics instead of dependencies between words, thus resulting low correlation with the manual evaluation results. 23 C.Leven GPT BERT-G. UniLM BART T5 Hit@1 3.2 21.5 22.3 21.0 26.3 26.8 Hit@3 18.2 63.0 59.5 69.0 69.0 70.3 Hit@5 51.4 95.5 95.3 96.8 96.3 97.8 Table 2.4: Manual Evaluation via Pair-wise Comparisons for Ranking. Numbers are hit rates (%) at top 1/3/5. To estimate human performance within each metric, we treat each reference sentence in dev/test data as a “system prediction” to be compared with all other references, which is equivalent to compute inter-annotator agreement within each metric. Thus, systems that have better generative ability than average crowd-workers should exceed this. 2.2.5.2 Experimental Results Automatic Evaluation. Table 2.3 presents the experimental results in a variety of metrics. We can see that all fine-tuned pre-trained models (the lower group) outperform non-pretrained models (the upper group) with a significant margin. This is not surprising because their pretraining objectives, including masked language modeling, word ordering, and text infilling which predicts missing words or text spans, are relevant to our task. On the other hand, we find that the key disadvantage of non-pretrained models with CopyNet still falls in the failure of using all given concepts (i.e., low coverage), which results in worse results. Among them, UniLM, BART, and T5 performs the best, which may be due to its inherent sequence-to-sequence pre-training framework. We found that BART has the best concept cover- age, which is probably due to its comprehensive pre-training tasks that aim to recover text with noise. The results suggest that further modifying pre-trained models is a promising direction for generative commonsense. Manual Evaluation. We conduct manual evaluation with a focus on commonsense plausibility for comparing the 6 best-performing models in Table 2.4. We ask five graduate students to compare 24 [bRNN-CopyNet]: a hand works in the sink . [MeanPooling-CopyNet]: the hand of a sink being washed up [ConstLeven]: a hand strikes a sink to wash from his soap. [GPT-2]: hands washing soap on the sink. [BERT-Gen]: a woman washes her hands with a sink of soaps. [UniLM]: hands washing soap in the sink [BART]: a man is washing his hands in a sink with soap and washing them with hand soap. [T5]: hand washed with soap in a sink. 1. A girl is washing her hands with soap in the bathroom sink. 2. I will wash each hand thoroughly with soap while at the sink. 3. The child washed his hands in the sink with soap. 4. A woman washes her hands with hand soap in a sink. 5. The girl uses soap to wash her hands at the sink. Concept-Set: { hand, sink, wash, soap } Figure 2.8: A case study with a concept-set {hand, sink, wash, soap} for qualitative analysis of machine generations. Human references are collected from AMT. 1,500 pairs of model-generated sentences respectively, for ranking the models within 100 concept- sets that are covered by all the models. The final average ranked results are shown in Table 2.4 and their inter-annotator agreement is 0.85 in Kendall’s rank correlation coefficient . Note that the coverage-weighted hit@1 rate correlates with the SPICE metric the most, i.e., 0.94 in Spearman’s ρ for model ranks, CIDEr for 0.91, while METEOR and ROUGE-2 are both 0.88 and BLEU-4 is 0.78. Case study. Fig. 2.8 shows the top generations of different models and human references about an input concept-set: {hand, sink, soup, wash} . We find that non-pretrained seq2seq models (e.g., bRNN, MeanPooling, ConstLeven) can successfully use part of given concepts, while the 25 Training Steps Accuracy Figure 2.9: Learning curve for the transferring study. We use several trained COMMONGEN (GG) models to generate choice-specific context for the CSQA task. generated sentences are less meaningful and coherent. On the contrary, the outputs of fine-tuned pre-trained language models are significantly more commonsensical. Most of them use all given concepts in their outputs. ConstLeven tends to make use of frequent patterns to compose a non- sense sentence but uses all concepts. GPT-2 and UniLM incorrectly compose the dependency among hand, wash, and soap. The phrase ‘a sink of soaps’ in BERT-gen’s output makes itself less common. BART and T5 generate relatively reasonable scenarios, but both are not as natural as human references; BART’s contains repetitive content while T5’s lacks a human agent. Influence of Dynamic Beam Allocation. Considering that all tested models decode sentences with beam searching, one may wonder what if we use a decoding method specially designed for constrained decoding. Thus, we employed dynamic beam allocation (DBA) [129]. The results are shown in Table 2.5. Note that the models are the same as in Table 2.3 while only the decoding 26 Model\ Metrics ROUGE-2/L BLEU-3/4 METEOR CIDEr SPICE Coverage T5-large+DBA 16.8 36.71 27.3 18.7 25.3 8.62 24.3 83.98 T5-base+DBA 15.07 34.82 24.8 16 23.5 9.31 21.3 76.81 GPT-2+DBA 17.56 39.45 29.4 20.6 24.9 10.85 26.8 79.51 BART+DBA 18.15 37.02 28.3 19.1 25.5 9.82 25.1 84.78 Table 2.5: Experimental results of models with DBA decoding method on the test set. method is changed to DBA. We can see that all methods are negatively impacted by the decoding method. This suggests that for the COMMONGEN task and pre-trained language models, we may need to focus on knowledge-based decoding or re-ranking as future directions. 2.2.5.3 Transferring CommonGen Models One may wonder how fine-tuned C OMMONGEN models can benefit commonsense-centric down- stream tasks such as Commonsense Question Answering [161] (CSQA) with their generative com- monsense reasoning ability. To this end, we use the models trained with the COMMONGEN dataset for generating useful context. We extract the nouns and verbs in questions and all choices respectively, and combine the concepts of the question q and each choice c i to build five concept-sets. Then, we use these concept- sets as inputs to a trained COMMONGEN model (e.g., T5) for generating scenario a sentence g i for each as choice-specific contexts. Finally, we prepend the outputs in front of the questions, i.e., “G: g i | Q: q C: c i ”. Note that the state-of-the-art RoBERTa-based models for CSQA uses the same form without “G: g i |” in fine-tuning. We show the learning-efficiency curve in Fig. 2.9, where y is the accuracy on the official dev set and x is the number of training steps. The details of the experiments are shown in the appendix. We highlight the performance of original RoBERTa-Large as the baseline. We find that some CommonGen models further improves the performance by a large margin, e.g., 76.9 UniLM − −−− → 78.4 and they converge at better accuracy in the end. Note that BERT-gen and ConstLeven cause neg- ative transfer due to the low quality of generated context. Particularly, we find that the context 27 generated by the T5-based CommonGen model (CG-T5) helps speed up training about 2 times, if we look at 550th steps of CG-T5 (74.85%) and 1,250th steps of original RoBERTa (74.77%). Through manual analysis, we find that the successful C OMMONGEN models can generate more reasonable and natural sentences for correct choices while noisy sentences for wrong choices. For example with CG (T5), q=“What do people aim to do at work?”, c i =‘complete job’ (✓) with g i =“people work to complete a job aimed at achieving a certain goal.”; c j =‘wear hats’ (✗) g j =“people wearing hats aim their guns at each other while working on a construction site.”Learning curve for the transferring study The used question concepts and choice concepts are underlined. 2.2.6 Related Work Commonsense benchmark datasets. There are many emerging datasets for testing machine commonsense from different angles, such as commonsense extraction [192, 90], next situation pre- diction (SWAG [209], CODAH [18], HellaSWAG [208]), cultural and social understanding [100, 148, 149], visual scene comprehension [207], and general commonsense question answering [161, 61, 178, 180]. However, the success of fine-tuning pre-trained language models for these tasks does not necessarily mean machines can produce novel assumptions in a more open, realistic, generative setting. We see COMMONGEN as a novel, complementary commonsense reasoning benchmark task for advancing machine commonsense in NLG. Constrained Text Generation. Constrained text generation aims to decode sentences with ex- pected attributes such as sentiment [109, 60], tense [60], template [218, 63], style [43, 107, 89], topics [40], etc. Two related scenarios with our task is lexically constrained decoding and word or- dering [214, 55, 34, 58, 130, 115]. However, they are not easily adopted by the recent pre-trained language models and thus not directly useful for our task. Topical story generation [39, 203] is also a related direction, while it targets generating longer, creative stories around the given topics, making it hard to directly adopt them to our task. Additionally, the COMMONGEN task brings some more challenges mentioned in Section 2.2.2. Prior constrained generation methods cannot address these issues together in a unified model. 28 Incorporating Commonsense for NLG. There are a few recent works that incorporate common- sense knowledge in language generation tasks such as essay generation [52, 198], image caption- ing [106], video storytelling [199], and conversational systems [211]. These works suggest that generative commonsense reasoning has a great potential to benefit downstream applications. Our proposed COMMONGEN, to the best of our knowledge, is the very first constrained sentence gen- eration dataset for assessing and conferring generative machine commonsense and we hope it can benefit such applications. Our transferring study in Sec. 2.2.5.3 also shows the potential benefits of CommonGen-generated contexts. 2.2.7 Conclusion Our major contribution in this paper are threefold: • we present COMMONGEN, a novel constrained text generation task for generative common- sense reasoning, with a large dataset; • we carefully analyze the inherent challenges of the proposed task, i.e., a) relational reasoning with latent commonsense knowledge, and b) compositional generalization. • our extensive experiments systematically examine recent pre-trained language generation models (e.g., UniLM, BART, T5) on the task , and find that their performance is still far from humans, generating grammatically sound yet realistically implausible sentences. Our study points to interesting future research directions on modeling commonsense knowledge in the language generation process, towards conferring machines with generative commonsense reasoning ability. We hope COMMONGEN would also benefit downstream NLG applications such as conversational systems and storytelling models. 29 Chapter 3 Evaluating the Generalization of Commonsense Reasoning 3.1 Multilingual Generalization of the CSR Ability 3.1.1 Introduction Understanding natural language relies heavily on commonsense reasoning (CSR), which is the process of making inferences with commonsense knowledge. Commonsense knowledge is the set of general facts that reflect our natural understanding of the physical world and human behavior, which are usually seen as an implicit background when people communicate with each other using languages. It is thus of vital importance to evaluate and improve the commonsense reasoning capability of language models (LMs), towards building general natural language understanding (NLU) systems [28]. Many recent benchmark datasets and probing methods have been proposed to evaluate machine common sense. As shown in Figure 7.1, the LAMA probe [126] is for analyzing LMs’ zero-shot commonsense recalling ability; CommonsenseQA (CSQA) [talmor2018commonsenseqaaq] is instead a multiple-choice QA task that needs fine-tuning; CODAH [18] and SWAG [210] focus on the ability to complete the most plausible scenes. However, all these works have been limited only to English. Consequently, follow-up analysis and reasoning methods developed [98, 41, 93] also focus only on English LMs like BERT [30]. Such English-centric trend of commonsense reasoning 30 Birds have [mask] . LAMA Probe Where do adults usually use glue sticks? A) school B) drawer C) office CommonsenseQA SWAG/CODAH The chef drops the piece of shrimp in the fryer. → A) The chef chops the pan. B) The chef watches it sizzle. C) The chef likes fried chicken. English LM Multilingual LM EN AR ZH FR HI IT DE ES JA PL NL PT UR VI RU SW wing C B Figure 3.1: Commonsense reasoning is well-studied with benchmarks and LMs in English. Can we advance commonsense reasoning beyond English? studies not only limits our research scope, but also tends to exacerbate English-specific bias that might prevent future methods from generalizing beyond English [128]. It is of pressing urgency for the community to develop NLU systems that can serve all lan- guages in the world to bridge the gap between different cultures and eliminate language barri- ers [59], and multilingual language models (ML-LMs), such as XLM-R [26], are among the most promising tools to achieve this ambitious goal. Although ML-LMs have been evaluated in a few NLU tasks, e.g., XNLI [27] and XTEMRE [59], it is still relatively unclear how ML-LMs perform in commonsense reasoning tasks, due to the lack of 1) dedicated methods for probing common sense in ML-LMs and 2) multilingual benchmark datasets for commonsense reasoning. 31 To analyze how much common sense ML-LMs already have without any tuning, we propose MICKEYPROBE, a zero-shot probing task. It tasks a ML-LM to rank a set of contrastive assertions (i.e., declarative sentences) in the same language by their commonsense plausibility, for which we use pseudo-likelihood (PLL) [143] as a proxy. Unlike the LAMA probe, it can study multi-token concepts which are ubiquitous in some non-English languages. In addition, it fairly compares performance across different languages via a language-invariant evaluation protocol. Alongside the probing task, we also create MickeyCorpus, a large-scale multilingual dataset, consisting of 561k sentences in 11 different languages. Our experiments reveal that there are always large discrepancies across different languages in the tested ML-LMs, and different ML-LMs show very different language preferences. Beyond supervision-free analysis of ML-LMs, we also study their performance in common- sense reasoning tasks, such as CSQA and CODAH, within a cross-lingual transfer setting (i.e., trained on English data and tested on other languages). We find that existing ML-LMs tend to have much lower accuracy in commonsense reasoning beyond English. We conjecture a major com- mon weakness of existing ML-LMs is that their pretraining stages do not have a proper sentence- level objective. Therefore, we propose multilingual contrastive pre-training (MCP), which tasks a ML-LM to select the correct assertion out of a set of N contrastive assertions in N different lan- guages. We re-formatMickeyCorpus by sampling across languages and thus form a dedicated pre-training corpus for the MCP task. To fairly evaluate different ML-LMs and validate the effec- tiveness of MCP, we create X-CSQA and X-CODAH, two cross-lingual commonsense reasoning datasets by translating their English versions to 15 other languages 1 , including low-resource ones such as Swahili (sw) and Urdu (ur). Experiments show that the proposed MCP objective indeed significantly improves the performance of state-of-the-art ML-LMs in cross-lingual commonsense reasoning. Our contributions are as follows: • Resources. We collect a large multilingual parallel corpus,MickeyCorpus, consisting of 561k sentences in 11 languages, which can be used for analyzing and improving ML-LMs. 1 The 16 languages for X-CSQA and X-CODAH: {en, zh, de, es, fr, it, jap, nl, pl, pt, ru, ar, vi, hi, sw, ur}. 32 We also createX-CSQA andX-CODAH, two cross-lingual CSR benchmarks in 16 languages, for question answering and scene completion, respectively. • Evaluation and analysis. We analyze multiple popular ML-LMs with MICKEYPROBE, a language-invariant, zero-shot task for probing common sense in ML-LMs; We also evaluate them on X-CSQA and X-CODAH in a cross-lingual transfer setting. • Method to improve ML-LMs. We propose multilingual contrastive pretraining, a simple and effective sentence-level pretext task for enhancing ML-LMs in cross-lingual common- sense reasoning, which significantly improves the state-of-the-art ML-LMs in cross-lingual commonsense reasoning. 3.1.2 Background and Related Work In this section, we introduce important concepts, background knowledge, and related work before we present our work in the following sections. 3.1.2.1 Multilingual Language Models A multilingual language model (ML-LM) aims to produce text representations for multiple lan- guages in a unified embedding space. One of the unique advantages of ML-LMs is their potential ability to perform zero-shot cross-lingual transfer — a model trained (or fine-tuned) on data in one language (usually English) can be directly used in other languages as well without further fine-tuning. Improving ML-LMs is thus believed as one of the most promising approach towards multilingual NLU at scale. mBERT [30] is simply the BERT model [30] trained on multilingual corpora without specific designs about multilinguality. The distil-mBERT (d-mBERT) [145] is a smaller mBERT trained by knowledge distillation. Conneau and Lample (2019) proposed XLM(- 100), which is pretrained with both masked language modeling (MLM) and translation language modeling (TLM). Conneau et al. (2020) further proposed XLM-R, which improves the XLM with a better sub-token vocabulary and high-quality multilingual corpora (CC100). We leave the analysis of recent ML-LMs, such as mBART [103], mT5 [194], and InfoXLM [19] as future work. 33 Note that the above ML-LMs are pretrained only with token-level training objectives such as MLM (i.e., recovering masked tokens in monolingual text) and TLM (i.e., recovering masked tokens in a pair of parallel sentences in two different languages). However, most NLU tasks, including commonsense reasoning, highly rely on sentence-level representations. We argue that a well-designed sentence-level pre-training objective should improve ML-LMs for NLU tasks. This intuition motivates us to propose a sentence-level pre-training objective — MCP (Section 3.1.5). 3.1.2.2 Cross-lingual Language Understanding There are a few recent multilingual benchmarks for NLU tasks, e.g., XTREME[59], TyDi QA[21], and XGLUE[92]. XTREME and XGLUE are unified large-scale multilingual multitask bench- marks, while Ty-Di QA focuses on the QA. These existing cross-lingual benchmarks have not cov- ered commonsense reasoning tasks, such as CSQA [talmor2018commonsenseqaaq], SWAG [210], and CODAH [18]. CSQA is a question answering task and the other two are scene completion tasks, while all have a multiple-choice selection objective, as shown in Figure 7.1. These benchmarks are widely used to evaluate LMs for commonsense reasoning. Unfortunately, they are limited to English, not applicable to evaluate models of multilingual commonsense knowledge, which motivates us to cre- ate X-CSQA and X-CODAH. The goal of the recent XCOPA [128] dataset shares a similar goal, but it only focused on event-based causal reasoning in the scope of humans’ social behavior, which is thus arguably more culturally biased. In contrast, the X-CSQA and X-CODAH are mainly for evaluating general world knowledge and cover more fine-grained types of reasoning (e.g., quan- titative, negation), and thus engage a more language-agnostic, comprehensive understanding of ML-LMs about common sense. 34 3.1.2.3 The LAMA Probe and Its Limitations The LAMA Probe [126] is the seminal work on probing for common sense in (English) language models. It has a straightforward intuition: if a pretrained language model contains more common- sense knowledge, then it should be better at recalling a masked token in a commonsense assertion (e.g.,“birds have [mask]”). Specifically, given a LAMA-probe sentence s s s and its masked token w t , a LM under testing uses all past and future tokens — s s s \t := w 1 ,...,w t− 1 ,w t+1 ,...,w |s s s| . as the input to rank all tokens in the vocabulary with the probability P w t | s s s \t via zero-shot inference. One can evaluate the performance of recalling common sense by measuring the position of a cor- rect token “wing” in the ranked list. That is, the LAMA probe method uses token-level probability as a proxy to probe for common sense in LMs via ranking all tokens in their vocabularies. This intuitive method, however, has several inherent limitations. First, in many other languages, multi-token concepts are ubiquitous, for example, “图书馆” (“library” in Simplified Chinese). Jiang et al. (2020) present several methods to decode multi-token entities so that they can adapt the LAMA probe to probe a LM for language-specific analysis. It is however infeasible to use token- level probing tasks if we want to analyze ML-LMs across languages. In addition, the evaluation metric of the LAMA probe could be unfair, because there can be many correct words for a masked position (e.g., “birds have legs/eyes”). The ranking metrics of the LAMA probe, however, tend to ignore these facts, resulting in a less trustworthy analysis. The vocabulary-specific ranking is unfair when comparing across different languages, as they can have very different label space. These limitations of the LAMA Probe prevent us from analyzing common sense in ML-LM across topologically diverse languages. We later found [69] proposed mLAMA, a contemporaneous work extending the LAMA probes from English to other languages, via machine translation, sharing a similar goal to ours. The mLAMA focuses on factual knowledge about named entities via an entity-retrieval objective for mBERT only, while our MICKEYPROBE aims to address commonsense knowledge via a sentence- ranking objective for more ML-LMs. 35 3.1.3 The Mickey Probe The challenges of using the LAMA Probe for probing common sense in ML-LMs motivate us to propose a more suitable method for analyzing ML-LMs, one that can fairly compare across a di- verse set of languages. We present MICKEYPROBE, a Multilingual task for probing commonsense knowledge and analysis. We design a language-agnostic probing task with a sentence-selection objective for analyzing common sense of a ML-LM: given a set of assertions (i.e., declarative sen- tences) that have similar words and syntactic features, select the one with highest commonsense plausibility. We present the task formulation in this section and then introduce how we collect the dedicated dataset in Section 3.1.4. Notations. We define a Mickey probe M as a set of K assertions in the same language, where one and only one of them (say, M i ) is the truth assertion with better commonsense plausibility than the other K− 1 ones. Each Mickey probe M has multiple semantically equivalent versions in different languages. Let us denote a language by l∈L whereL ={en, f r,ru,zh,...} and |L| is the number of languages of interest. Then, M l is the probe M in the language l. For example, M en and M fr denote the probes with the same meaning but in English (en) and French (fr) respectively. We useM to denote a multilingual parallel dataset for MICKEYPROBE, which consists of T×| L|× K assertions. T is the number of MICKEYPROBE items and each item has K assertions and|L| language. Finally, we can formally describe a multilingual parallel datasetM for MICKEYPROBE: ∀M∈M, ∀(l x ,l y )∈L 2 , ∀i∈N ≤ K , M l x i ▷◁ M l y i . (3.1) We use the notation▷◁ to indicate two assertions in different languages (e.g., l x and l y ) are seman- tically equivalent to each other. We leave the details of creating such anM in Section 3.1.4. 36 Ranking by PLLs MickeyProbe ML-LM The effect of reading the news is lying about the world. … of interviewing the deceased is learning about the world. … of tracking the dragon is learning about the world. … of reading the news is learning about the world. … of reading the news is saying about the world. 阅 读新 闻 的 效果 是 对 世 界 撒谎 。 采访死者 的 效果 是 了解世界。 追踪 龙 的 效 果是 了解 世界 。 阅读新 闻 的 效果 是 了解世界。 阅读新 闻 的 效果 是 描述世界。 en: [4,3,1,5,2] …. zh: [2,4,3,1,5] 4 4 ℎ Figure 3.2: A Mickey Probe example M has a set of probes in different languages (e.g., M en/zh ), and each of them is a set of 5 assertions. We rank assertions in the same language by their PLLs to probe common sense in ML-LMs across different languages. Commonsense Probing Task. Given an instance M for MICKEYPROBE in the datasetM , and suppose the index of the truth assertion to be t, a perfect multilingual language model would pro- duce sentence probabilities such that it always gives the truth assertion M l t the highest probability among other candidates for every language. ∀l∈L,∀i∈N ≤ K , P(M l i )≤ P(M l t ). (3.2) It is still an open problem to properly compute sentence probabilities from masked language models, the recently proposed pseudo-log-likelihood scoring (PLLs) [143] has shown promising 37 results in many downstream NLP applications that need sentence re-ranking (e.g., speech recogni- tion, and translation), suggesting it is a promising proxy of sentence probability. Given a sentence s s s, its PLL is defined as: logP(s s s)= PLL(s s s) := |s s s| ∑ i=1 logP w i | s s s \i (3.3) That is, we individually mask each token w i at a time and use the remaining context s s s \i to get the probability of a word w w w i i i in the sentence s s s. Finally, we aggregate them to approximate P(s s s). Models \ L en de it es fr nl ru bg vi zh hi avg BT-Cosine 1.0 0.937 0.936 0.935 0.934 0.933 0.901 0.901 0.882 0.879 0.869 0.919 CC-size (GB) 300.8 66.6 30.2 53.3 56.8 29.3 278.0 57.5 137.3 46.9 20.2 97.9 Shortest 23.17 27.21 29.93 31.00 35.84 31.68 18.55 22.01 15.46 25.07 20.66 25.51 d-mBERT 62.95 34.56 25.26 34.85 50.46 32.39 21.49 29.14 19.77 32.57 25.88 33.57 mBERT 63.56 35.58 29.13 44.70 42.58 35.15 28.30 36.03 24.04 28.15 27.85 35.92 XLM-100 60.57 36.33 26.49 43.39 32.53 36.24 32.90 39.71 25.79 33.01 31.49 36.22 XLM-R B 89.69 58.94 53.45 60.88 49.12 59.99 45.74 45.26 41.65 51.02 40.73 54.22 XLM-R L 90.03 61.98 53.42 63.68 59.47 63.12 50.03 47.01 45.30 55.93 43.98 57.63 Table 3.1: The hit@1 accuracy (%) of the five ML-LMs for the M ICKEYPROBE task. Evaluation Metric. The evaluation metric for MICKEYPROBE over a multilingual parallel datasetM in a specific language l is defined as the overall hit@k accuracy of the selection results hit@k(l)=∑ M∈M 1{truth-rank(M l )≤ k}/|M| where truth-rank(M l ) means the the position of the truth assertion M l t in M l sorted by their probabilities defined in Eq. (3.3). The hit@1 is just equivalent to the conventional accuracy. Advantages of MICKEYPROBE. There are two key advantages of the MICKEYPROBE for eval- uating ML-LMs: (1) The sentence-level probability can be more generally applied in languages besides English, comparing with the LAMA probe which only studies single-token English words. (2) The task formulation creates a relatively closed-ended setting, such that we can use a language- independent evaluation metric to fairly compare across various languages within an ML-LM and compare across various ML-LMs for a particular language. In addition, we can see LAMA Probe 38 as a monolingual, word-level version of the more general MICKEYPROBE: the LAMA Probe is whenL ={en}, and{M en }= M∈M is a huge number of K assertions (i.e., the vocabulary size) — a fixed [mask] is replaced by all tokens in the vocabulary. 3.1.4 The Mickey Corpus and Evaluation We present a procedure for automatically creating a multilingual parallel datasetM for the probing task MICKEYPROBE. Our collected corpus, namedMickeyCorpus , has 561k sentences in 11 languages (T =10.2k, K=5,|L|=11). 3.1.4.1 Creating English Probes For the correct commonsense assertions in English, we have an existing resource, the OMCS corpus [154] which contains human-written sentences in English that describe commonsense facts. Each assertion can be used as a M en t and we perform perturbations on it to create the other K− 1 distractor assertions (i.e., false candidates), yielding an M en example. Inspired by BERT-attack method [Li2020BERTATTACKAA], we use a simple method to gen- erate false assertions that are semantically related and syntactically similar to the truth assertions. Given a correct assertion, we first randomly sample a few (1 ∼ 3) words with a part-of-speech tag as noun, verb, or adjective, and replace them with [mask]. Then, we use a beam-search style method to decode the [mask] tokens one by one from left to right. To ensure that the distractors are less plausible, we limit the decoding steps to only sample tokens that ranks between 200th∼ 300th. We repeat the above procedure multiple times with different sets of [mask] tokens. Then, we use Stanza [131] to remove distractors that have sequences of POS tags or morphological features different from the truth assertions. Finally, we sample K− 1 of them as the distractors. 3.1.4.2 Scaling to Ten Other Languages. We use bidirectional translation with the MarianMT models [67] pretrained on the OPUS cor- pora [166]. We translate all English probes to the 25 languages that has models in both directions 39 Figure 3.3: The MICKEYPROBE results in hit@1-acc. and then translate them back to English. As the outputs from these models might contain noise and errors, we compute the semantic similarities (i.e., cosine similarity) between the original M en and the back-translated M x-en via the SentenceBERT [Reimers2019SentenceBERTSE] model. To ensure the quality and fair comparisons, we set a similarity threshold as 0.75 and keep the intersections of probes in all languages. Considering some languages tend to have translations of lower quality, we finally choose the best 10 languages to build the Mickey Probe dataset for our analysis, yielding 10k examples in each language and 10.2k*5*11≈ 561k sentences in total. The language setL ={en,de, f r,ru,es,hi,vi,bg,zh,nl,it}. Note that our purpose of checking the back-translation quality here is mainly to only keep the high-quality translations for all language pairs that we considered. Conventional metrics, e.g., BLUE score [124], which focus on the exact word match, are thus less suitable: given the original sentence “I have a book”, the translation results “I have a novel” and “I have a tool” will be seen as equally wrong. Inspired by BERTScore [213], the BT-cosine is based on SentenceBERT, which efficiently gives a higher score for the former and a lower score for the latter, due to the semantic relatedness between “novel” and “book.” We observed that most of our back-translations are in similar situations, and thus decide to use BT-cosine instead of others. 40 3.1.4.3 Analyzing ML-LMs with Mickey Probes We now use the MickeyCorpus to evaluate the 5 pre-trained ML-LMs introduced in Sec- tion 3.1.2.1: d-mBERT [145], mBERT [30], XLM [25], XLM-R Base , and XLM-R Large [26]. All these ML-LMs pretraining objectives contain masked-word-prediction tasks, so we can easily use PPLs (Eq. 3.3) to probe them a zero-shot, supervision-free manner with hit@1 accuracy. (The hit@2 results are shown in Appendix.) We present a histogram in Figure 3.3 and show the concrete results in Table 3.6. We find that there are always large discrepancies across different languages in all tested ML-LMs, which motivates us to analyze the following questions. Q1: Do different ML-LMs have similar language preferences? No. We arrange the languages in all ML-LMs with the same order for Figure 3.3 — the monotonically descending order of XLM- R L . Interestingly, we find that different ML-LMs are good for different languages, resulting in a very diverse set of trends. For example, XLM-R B , has a higher performance in it than zh and fr, unlike XLM-R− L which are pre-trained on the same corpora with the same objectives. mBERT and d-mBERT has stronger performance in fr than nl and de, unlike XLM and XLM-R. Q2: Does length influence PLL ranking? Not much. The PLL computation indeed tends to prefer shorter sequences (see Eq. 3.3), so one may wonder if the length of assertions would influence the probing results. The “Shortest” row in Table 3.6 presents the results when we always select the shortest assertion within a probe, instead of PLL ranking. The gaps between these scores and XLM-R-L’s suggest that the probing task indeed uses PLL as a valid proxy for evaluating common sense based on sentence-level semantics. Q3: Is the translation quality a key factor? We show “BT-Cosine”, the mean of the cosine scores between the original English sentences and the back-translated ones, and sort the table by these numbers. The first 5 languages, {de, it, es, fr, nl} have the largest BT-Cosine, i.e., the best translation quality, and they indeed have better performances in general for XLM-R models. However, although zh has a worse BT-score than vi, all ML-LMs perform better in zh than vi. Thus, we believe the translation quality ofMickeyCorpus will not be a factor to influence our 41 understanding of ML-LMs. Consequently, this suggests that further study must depend on pre- training corpora of each ML-LM in different languages. Q4: Does the size of pre-training corpora matter? We list the size of the monolingual corpus in each language for CC-100 that XLM-R are pre-trained on (i.e., the CC-size row). Although ru has a much larger corpus than de, it, etc., the XLM-R performance in ru is much worse. In addition, fr and nl have almost the same translation quality while fr’s CC-size is twice the size of nl, but the performance in fr is still much worse than nl. We conjecture this would be either due to the design of sub-token vocabulary or the text quality (instead of the size) of the CC-100 corpora. Further implications. The benchmark results of five popular ML-LMs on the M ICKEYPROBE task over the MickeyCorpus offer the initial and valuable understanding with a closer look at the commonsense knowledge of ML-LMs by probing them in a unified evaluation protocol. One can either compare a ML-LM across different languages or compare a certain language across ML-LMs in Table 3.6. These comparable results support further analysis that can benefit the development of ML-LMs in the future. After all, even the best ML-LM XLM-R L also degrades much in other languages, and also perform slightly worse than RoBERTa L in en (93.4%). We argue (culture-invariant) common sense knowledge should be seen as an important way to connect multiple languages and thus better align them in a shared embedding space induced by a ML-LM. 3.1.5 Multilingual Contrastive Pre-Training In this section, we reformat the MICKEYPROBE so that we can reuse the MickeyCorpus for improving the pre-trained ML-LMs for commonsense reasoning beyond English. We propose a multilingual contrastive pre-training (MCP) task that focuses on enhancing the sentence-level representation of ML-LMs. MCP improves a ML-LM in a multilingual, contrastive environment, where the model learns to select the assertion with the best commonsense plausibility from a set of contrastive sentences in different languages. Each MCP example is a set of multilingual assertions while each Mickey probe is a monolingual set. 42 MCP Dataset Creation fromM . We create pretraining examples for the MCP task by convert- ing MICKEYPROBE examples, as shown in the steps illustrated in Algorithm 1. Simply put, we reformat a K-way Mickey Probe M (K×| L| assertions) to a MCP example by sampling a set of V candidate assertions in V different languages. We convert all examples in theMickeyCorpus M to build a new cross-lingual sentence-selection datasetC for learning the MCP task. MCP Learning. Given a MCP example C∈C , we append one dense linear layer f on top of a ML-LM with parameters denoted asΘ ML-LM for learning to predict the commonsense plausibility score of each assertion C i ∈ C as follows: h i = ML-LM(C i ).[CLS] (3.4) o i = f(h i ;Θ f ) (3.5) z i = e o i ∑ V=|C| j=1 e o j (3.6) ρ = V ∑ i=1 − 1 i log(z i ) (3.7) We first get the logit o i of each assertion by projecting its[CLS] embeddings h i to a logit o i via a dense layer f with parametersΘ f ; Then, we use SoftMax to normalize the logits as plausibility scores z i ; Finally, we compute the cross-entropy loss ρ where1 i =1 if C i is a correct assertion and 0 otherwise. We fine-tune {Θ ML-LM ,Θ f } to minimize the overall loss over the MCP datasetC . 3.1.6 Evaluation for Cross-lingual CSR In this section, we introduce the datasets, experimental setup, results, and our analysis. 3.1.6.1 X-CSQA & X-CODAH: Two New Benchmarks for Evaluating XCSR To evaluate ML-LMs for commonsense reasoning in a cross-lingual zero-shot transfer setting, we create two benchmark datasets, namely X-CSQA and X-CODAH. Table 6.1 shows the statistics 43 Algorithm 1: Convert a Mickey Probe M to an example for the MCP task. In: M∈M /* is a probe that has|L| sub-sets; each sub-set M l x is a set of K assertions in the same language l x ∈L . M l x t is always the truth. */ Out: C /* A set of V assertions in different languages. */ Remarks:Γ n (X) is a function to randomly sample n unique elements from a set X. 1 l a ← − Γ 1 (L) /* Pick an anchor language. */ 2 C← −{ M l a t } /* Initiate w/ the truth assertion. */ /* Iterate each sampled distractor language l i . */ 3 foreach l i ∈Γ V− 1 (L− l a ) do /* Sample an index of distractor assertion. */ 4 j← − Γ 1 (N ≤ K −{ t}) /* Add a distractor assertion as a candidate. */ 5 C.add(M l i j ) en de it es fr nl ru vi zh hi pl ar ja pt sw ur avg CC-size (GB) 300.8 66.6 30.2 53.3 56.8 29.3 278.0 137.3 46.9 20.2 44.6 28.0 69.3 49.1 1.6 5.7 76.10 X-CODAH [Task: Scene Completion; Random Guess: 25.0; RoBERTa L for en: 81.6 ] mBERT 42.9 33.1 33.5 33.8 35.2 33.7 31.9 22.8 38.0 26.5 31.0 34.8 34.0 37.2 30.8 31.5 33.2 XLM-100 42.7 31.5 32.2 30.7 34.9 32.6 30.9 24.7 31.4 26.8 27.0 30.0 27.4 33.2 25.3 24.9 30.4 XLM-R-B 50.1 45.8 44.4 44.2 45.2 42.0 44.1 43.2 44.6 38.1 41.9 37.8 42.0 44.1 35.6 34.6 42.4 XLM-R-L 66.4 59.6 59.9 60.9 60.1 59.3 56.3 57.4 57.3 49.1 57.5 51.2 53.8 58.2 42.2 46.6 56.0 MCP(XLM-R B ) 52.2 47.6 46.2 44.4 48.1 44.8 42.9 43.2 45.7 37.8 41.8 41.8 42.9 44.7 37.2 36.4 43.6 MCP(XLM-R L ) 69.9 60.7 61.9 60.7 61.4 60.7 58.6 62.3 61.9 53.7 59.0 54.1 54.7 60.8 44.6 48.0 58.3 ∆(XLM-R L ) +3.5 +1.1 +2.0 -0.2 +1.3 +1.4 +2.3 +4.9 +4.6 +4.6 +1.5 +2.9 +0.9 +2.6 +2.4 +1.4 +2.3 X-CSQA [Task: Question Answering; Random Guess: 20.0; RoBERTa L for en: 70.4 ] mBERT 38.8 29.6 36.4 35.3 33.8 32.6 32.7 22.2 37.8 21.1 27.2 27.7 31.4 34.1 21.8 23.7 30.4 XLM-100 34.3 26.7 28.5 29.3 28.3 27.2 29.9 21.1 28.6 22.1 26.6 26.3 25.1 30.9 20.1 21.7 26.7 XLM-R B 51.5 44.1 42.1 44.8 44.0 43.3 39.5 42.6 40.6 34.6 40.2 38.4 37.5 43.4 29.6 33.0 40.6 XLM-R L 66.7 56.1 58.2 59.5 60.3 56.8 52.1 51.4 52.7 48.7 53.9 48.4 50.0 59.9 41.6 45.2 53.8 MCP(XLM-R B ) 52.1 46.2 45.6 44.3 44.7 45.3 42.8 45.3 44.3 36.8 41.4 36.8 37.5 44.9 28.1 33.4 41.9 MCP(XLM-R L ) 69.5 59.3 60.3 61.4 60.0 61.1 57.5 55.7 56.7 51.3 56.1 52.3 50.2 60.7 43.3 48.8 56.5 ∆(XLM-R L ) +2.8 +3.3 +2.2 +1.9 -0.4 +4.3 +5.4 +4.3 +4.0 +2.6 +2.1 +3.9 +0.2 +0.8 +1.7 +3.6 +2.7 Table 3.2: Benchmark results for different ML-LMs and MCP-enhanced models for X-CSQA and X-CODAH in a zero-shot cross-lingual setting. ∆ is the improvement of MCP. {pl,ar,ja,pt,sw,ur} are unseen in MCP. of the two datasets. Specifically, we use online commercial services such as DeepL Pro Translate to collect high-quality translations of the examples in CSQA and CODAH for 15 languages other than English. The size of CODAH is small (only 2.7k), so we use 7k SWAG validation exam- ples as additional training data which share the same formulation. We discuss the reduction of 44 Stat.↓ Dataset→ X-CSQA X-CODAH Task Format QA SceneComp. # Languages 15 + en 15 + en # Options per Example 5 4 # Training (en) 8,888 8,476 # Dev per Lang. 1,000 300 # Test per Lang. 1,074 1,000 # Total Instances 80,550 60,000 Table 3.3: Statistics of the two X-CSR datasets. cultural differences and quality control of automatic translations as well as other details in Ethi- cal Considerations (the paragraph for cultural bias reduction) and Appendix (A). As our goal is to evaluate different ML-LMs (instead of different languages) in a unified evaluation protocol for cross-lingual commonsense reasoning, we argue that such automatically translated examples, al- though might contain noise, can serve as a starting benchmark for us to obtain meaningful analysis before more human-translated datasets will be available in the future. 3.1.6.2 Setup We focus on 4 popular ML-LMs that we introduced in Section 3.1.2.1: mBERT, XLM-100, XLM- R B and XLM-R L as well as our proposed MCP method. For both tasks, we concatenate each prompt (the question or first sentence) and each of its options individually in the form of “[CLS] prompt [SEP] option i [SEP]”. Then, we fine-tune ML-LMs over the English training dataset and test them on other languages. Why zero-shot cross-lingual transfer? It is almost impossible to collect data in all languages that an NLU system might be used for. Therefore, prior works mainly focus on zero-shot cross- lingual transfer [27], which is more meaningful and can offer lower-bound performance analysis. It is also an ideal setting for studying CSR because most commonsense facts are language-invariant. Thus, an English-finetuned ML-LM for CSR should be able to transfer its ability to a wide range of other languages as well. Furthermore, our goal of this paper is to evaluate and improve ML-LMs, 45 Idioms 8.3% Negation 4.1% Polysemy 4.8% Ref. 3.7% Quant. 3.1% Others 76% 40 50 60 70 Figure 3.4: Categorized accuracy in for MCP(XLM-R L ) on X-CODAH. Each box is for 15 lan- guages. so translating back to English and then using an English-only LM is also not helpful towards to this end. 3.1.6.3 Experiments for Cross-lingual CSR In Table 3.2, we present the empirical results over X-CODAH and X-CSQA for the ML-LMs as well as two models enhanced by our proposed MCP method. On both tasks, the XLM-R L performs the best with a large margin. Enhanced by the MCP method, both XLM-R B and XLM-R L see significant improvement (e.g., 2.7% absolute improvement for XLM-R L on X-CSQA-avg). Can MCP’s improvement generalize to unseen, low-resource languages? Note that MCP dataset only involves 9 languages here, and there are 6 languages that are totally unseen in the MCP training (i.e., {pl, ar, ja, pt, sw, ur}). The largest performance gain is in ru on X-CSQA and vi on X-CODAH. Surprisingly, we find the improvements on them are also large for XLM-R L (e.g., 48.4→ 52.3 for ar). In addition, for the two low-resource languages sw and ur, MCP also brings 2∼ 3 percentage points of improvement for XLM-R L . It is, however, not always the case for XLM-R B , which we conjecture tends to be more likely to overfit. 46 MCP(XLM-R-B) MCP(XLM-R-L) XLM-R-L XLM-R-B 100 200 300 400 500 600 700 800 Step 0.2 0.3 0.4 0.5 0.6 0.7 Figure 3.5: Dev acc v.s. learning steps on X-CSQA. Although ML-LMs enjoy the merits of zero-shot cross-lingual transfer, their performances are usually worse than the English-only RoBERTa L on the en-test (70.4% vs 66.7% for X-CSQA). Al- though MCP can mitigate the gap (70.4% vs 69.5%) for X-CSQA, there is still a large gap (81.6% vs 69.9%) for X-CODAH. We use Fig. 3.4 to analyze how different categories of commonsense reasoning in CODAH [18] are diverse in different languages. We find that others, reference, and negation have relatively smaller variances across different languages, as they are more language- invariant. However, a few polysemous, idioms examples can be English-specific which may not generalize to other languages. More detailed analysis is in Appendix. From the curve of dev accuracy in Figure 3.5, we see that MCP-enhanced XLM-R models are much more sample efficient and converge much faster than vanilla versions. This suggests that the MCP, if used on a larger corpus with broader topics, can potentially produce a better ML-LM with more general usage, especially when only limited labelled is available. Our results on XNLI- 10% (using 10% of the training data) [27] show that MCP-enhanced XLM-R L has 1.2 percent accuracy improvement on the average of 15 languages. As our focus in this paper is commonsense reasoning, we leave the study on other cross-lingual NLU tasks as future work. Importantly, our experiments imply that a proper (continual) pre-training task that has a (contrastive) sentence-level objective could improve both the final performance as well as learning efficiency. 47 3.1.7 Conclusion We evaluate and improve popular multilingual language models (ML-LMs) for advancing com- monsense reasoning beyond English. We propose the MICKEYPROBE, a language-agnostic prob- ing task for analyzing the common sense of ML-LMs in a zero-shot manner. With our proposed new benchmark datasets via automatic translation, X-CSQA and X-CODAH, we evaluate ML-LMs in a cross-lingual transfer setting for commonsense reasoning. We also improve the state-of-the-art ML-LM with a simple yet effective method — multilingual contrastive pre-training, which uses a sentence-level objective to enhance sentence representations, yielding a significant performance gain. All the above work is based on MickeyCorpus, which can be used as both a probing dataset and a pre-training corpus for analyzing and improving ML-LMs. We hope our resources and pre-training method for ML-LMs can help the community advance commonsense reasoning beyond English. 3.2 Generalization to Non-Monotonic Reasoning & Creativity 3.2.1 Introduction “ The essence of a riddle is to express true facts under impossible combinations." — Aristotle, Poetics (350 BCE) A riddle is a puzzling question about concepts in our everyday life. For example, a riddle might ask “My life can be measured in hours. I serve by being devoured. Thin, I am quick. Fat, I am slow. Wind is my foe. What am I?” The correct answer “candle,” is reached by considering a collection of commonsense knowledge: a candle can be lit and burn for a few hours; a candle’s life depends upon its diameter; wind can extinguish candles, etc. 48 It is believed that the riddle is one of the earliest forms of oral literature, which can be seen as a formulation of thoughts about common sense, a mode of association between everyday con- cepts, and a metaphor as higher-order use of natural language [57]. Aristotle stated in his Rhetoric (335-330 BCE) that good riddles generally provide satisfactory metaphors for rethinking common concepts in our daily life. He also pointed out in the Poetics (350 BCE): “the essence of a riddle is to express true facts under impossible combinations,” which suggests that solving riddles is a nontrivial reasoning task. Answering riddles is indeed a challenging cognitive process as it requires complex common- sense reasoning skills. A riddle can describe multiple pieces of commonsense knowledge with figurative devices such as metaphor and personification (e.g., “wind is my foe − → extinguish”). Moreover, counterfactual thinking is also necessary for answering many riddles such as “what can you hold in your left hand but not in your right hand? − → your right elbow.” These riddles with ‘but-no’ cues require that models use counterfactual reasoning ability to consider possible solutions for situations or objects that are seemingly impossible at face value. This reporting bias [49] makes riddles a more difficult type of commonsense question for pretrained language models to learn and reason. In contrast, superficial commonsense questions such as “What home entertainment equip- ment requires cable?” in CommonsenseQA [161] are more straightforward and explicitly stated. We illustrate this comparison in Figure 7.1. In this paper, we introduce the RIDDLESENSE challenge to study the task of answering riddle- style commonsense questions 2 requiring creativity, counterfactual thinking and complex common- sense reasoning. RIDDLESENSE is presented as a multiple-choice question answering task where a model selects one of five answer choices to a given riddle question as its predicted answer, as shown in Fig. 7.1. We construct the dataset by first crawling from several free websites featuring large collections of human-written riddles and then aggregating, verifying, and correcting these examples using a combination of human rating and NLP tools to create a dataset consisting of 5.7k 2 We use “riddle” and “riddle-style commonsense question” interchangeably in this paper. 49 high-quality examples. Finally, we use Amazon Mechanical Turk to crowdsource quality distrac- tors to create a challenging benchmark. We show that our riddle questions are more challenging than CommonsenseQA by analyzing graph-based statistics over ConceptNet [156], a large knowl- edge graph for common sense reasoning. Recent studies have demonstrated that fine-tuning large pretrained language models, such as BERT [31], RoBERTa, and ALBERT [82], can achieve strong results on current commonsense rea- soning benchmarks. Developed on top of these language models, graph-based language reasoning models such as KagNet [98] and MHGRN [41] show superior performance. Most recently, Uni- fiedQA [72] proposes to unify different QA tasks and train a text-to-text model for learning from all of them, which achieves state-of-the-art performance on many commonsense benchmarks. To provide a comprehensive benchmarking analysis, we systematically compare the above methods. Our experiments reveal that while humans achieve 91.33% accuracy on RIDDLESENSE, the best language models can only achieve 68.80% accuracy, suggesting that there is still much room for improvement in the field of solutions to complex commonsense reasoning questions with language models. We believe the proposed RIDDLESENSE challenge suggests productive future directions for machine commonsense reasoning as well as the understanding of higher-order and creative use of natural language. 3.2.2 Construction of RIDDLESENSE In this section, we first present our pipeline for collecting the R IDDLESENSE dataset, including the details of data cleaning. We introduce how we design a crowd-sourcing protocol for annotating quality distractors to turn riddle-solving into a multiple-choice question answering task. 3.2.2.1 Riddle Crawling and Cleaning We write web crawlers for collecting a large number (approximately 10,000) of riddles and their answers from public riddle websites, such as brainzilla.com, riddlewot.com, etc. As the crawled data contain much noise such as inconsistent answer format and misspelled words, we process 50 riddles through careful data cleaning as well as human verification. First, we use an open-source tool for detecting typos 3 and then refine the sentences. Then we continuously sample (riddle, answer) pairs and recognize errors, for which we iteratively improve our program with a set of conditions to filter out noisy examples that are not readable or have ambiguous answers. Also, we merge the riddles from different sources while removing duplicate riddle questions with similar answers. For detecting duplicate riddles with minor word changes, we use SentenceBERT [138] to find clusters with high cosine similarities. 3.2.2.2 Distractor Collection from AMT We consider a multi-choice question-answering format rather than the open-ended format, as it is easier to meaningfully compare the performance of different models in a more controlled man- ner — there is a limited range of options. For such a dataset, given a riddle-style question and 5 answer options, the model should select the best one as the predicted answer. This format of- fers a straightforward and fair evaluation metric – accuracy, which is the metric adopted by many popular commonsense reasoning benchmarks such as CommonsenseQA, ARC [23], and Open- bookQA [118]. High-quality distractors are essential for multiple-choice question-answering tasks as they can ensure that the dataset is both clean and challenging — the distractors are neither too similar nor too distant from the correct answer. We thus design a protocol to collect quality distractors from human annotators via Amazon Mechanical Turk 4 based on a pool of candidate distractors. 3 github.com/phatpiglet/autocorrect 4 https://www.mturk.com/ 51 Candidate Distractor Pool We use Q to denote the concepts that are mentioned in the ques- tion, and a to denote the concept in the answer 5 . We then first get all two-hop neighbors in the ConceptNet of a and one-hop neighbors of each c∈ Q respectively: A={x|(x,r i ,y),(y,r j ,a)} B={x|(x,r k ,c),∀c∈ Q} D= A∩ B, where r i/ j/k is a binary relation in the ConceptNet such asHasProperty. The final intersection, D, is thus the pool of distractor candidates. We further use WordNet [119] to filter out concepts that have either too low or too high Wu-Palmer similarity 6 . We argue that such sampled distractors are semantically relevant to both questions and answers, and are also closer to answers in the WordNet taxonomy. Thus, they are more likely to serve as ideal distractors in a multiple-choice question answering task. AMT Crowd-sourcing We design a three-stage annotation protocol: • S1) Sanity Check. We show a question and 3 choices where only 1 choice is correct and the other 2 are randomly sampled concepts from the full vocabulary of ConceptNet. Only when the workers pass this sanity check, their following annotations will be considered, so we can avoid noise from random workers. • S2) Candidate Selection. As it is difficult to control and verify the quality of distractors from crowd workers, we first sample concepts from ConceptNet, which are relevant to both question concepts and answer concepts, forming a set of candidate distractors D for anno- tators to choose from. Workers are required to select at least 5 concepts that they think are good distractors to the question. There are at least 3 different workers for each question and 5 If there are multiple concepts, we pick the one with the least network degrees as they tend to be more important. 6 We use 0.5 as a threshold which is effective as expected. 52 CSQA RS # All Examples 12,102 5,715 # Train Examples 9,741 3,510 # Validation Examples 1,221 1,021 # Test Examples 1,140 1,184 Average Question Length 15.06 24.04 % Long Qs (>20 tokens) 16.5% 47.3% Distinct Question Words 6,822 7,110 Distinct Choice Words 7,044 9,912 Avg PLL of Qs -34.41 -53.98 QA-NLI Conflict 12.7% 39.6% QA-NLI Neutral 71.6% 44.9% QA-NLI Entailment 15.7% 15.5% Table 3.4: Key statistics of the RIDDLESENSE dataset (v1.1) vs the CommonsenseQA (CSQA) dataset. we take the candidates which are selected by at least two different workers to make sure the selected distractors are indeed meaningful. • S3) Open Distractor Collection. We also ask master workers on AMT to write at least one more distractor based on the question context. This stage is important because sometimes the candidate pool contains fewer candidates of good quality and the human-written distractors are usually better than the ones in the candidate pool. We thus give extra bonus credits to encourage annotators to write more quality distractors. 3.2.3 Data Analysis of RIDDLESENSE In this section, we first report the key statistics of the proposed R IDDLESENSE dataset, then we compare it to CommonsenseQA [161] from two major angles: the distribution of the lengths of Q-A paths and the types of reasoning chains, which serve as an effective proxy to analyze the differences between the two datasets. 53 Algorithm 2: Get statistics of QA paths. Input: Knowledge graph KG=(V,E), riddle question Q, riddle answer A Output: minPathLength, maxPathLength, meanPathLength 1 QC← extractConcept(Q) 2 AC← extractConcept(A) 3 ac← v∈ AC with smallestdeg(G,v) 4 l← [] 5 foreach qc∈ QC do 6 path← shortestPathLen(KG, qc, ac) 7 if path̸= None then 8 l.append(path) 9 minPathLength← min (l) 10 maxPathLength← max (l) 11 meanPathLength← mean (l) 3.2.3.1 Key Statistics Table 6.1 presents the key statistics of RIDDLESENSE (RS) and the comparisons with Common- senseQA (CSQA) which is the most similar benchmark to ours. Although the size of RS is smaller than CSQA, we argue that RS is complementary to the CSQA dataset and introduces novel chal- lenges for the commonsense reasoning community. As they share the same format, we can test different methods by training on either CSQA-only, RS-only, or the concatenation of CSQA and RS, as we show later in Section 6.2. Moreover, there is a greater number of long questions (i.e., containing more than 20 words) in RS than in CSQA. Additionally, we find that RS questions have a lower normalized pseudo- likelihood (PLL) [144], a proxy of estimating sentence probability, suggesting that RS questions are more puzzling (i.e., the words are less frequently co-occurring). We also use a RoBERTa model fine-tuned on MNLI [191] to perform natural language inference between CSQA/RS questions and their answers. There is a much greater proportion of questions in RS that have conflicting relations with their correct answers than compared to CSQA. This is indicative of RS’s complexity due to the self-contradictory and perplexing nature of riddles. 54 CommonsenseQA (CSQA) 1-hop (14.0%) 2-hop (34.4%) 3-hop (41.5%) 4-hop (9.5%) AtLoc (4.8%) Related-Related (8.3%) Related-Related-Related (4.1%) Related× 4 (0.4%) Related (3.4%) Related-AtLoc (4.5%) Related-Related-AtLoc (2.7%) Related× 3 -AtLoc (0.3%) Causes (1.1%) Related-Antonym (1.8%) Related-AtLoc − 1 -AtLoc (1.4%) Related-Related-AtLoc − 1 -AtLoc (0.3%) Antonym (0.9%) Related-IsA − 1 (1.3%) Related-Related-Antonym (1.3%) Related× 3 -Antonym (0.2%) Capableof (0.8%) Related-AtLoc − 1 (0.9%) Related-Related-CapableOf (1.3%) Related× 2-SubEvent − 1 -Cause (0.1%) ... ... ... ... ρ = 3.4 4.8 = 0.7 ρ = 8.3 4.5 = 1.8 ρ = 4.1 2.7 = 1.5 ρ = 0.4 0.3 = 1.3 RiddleSense (RS) 1-hop (4.6%) 2-hop (31.6%) 3-hop (47.8%) 4-hop (14.0%) Related (3.1%) Related-Related (13.1%) Related-Related-Related (10.6%) Related× 4 (1.8%) Antonym (0.4%) Related-Antonym (2.1%) Related-Related-IsA − 1 (2.6%) Antonym-Related× 3 (0.4%) IsA − 1 (0.3%) Related-IsA − 1 (2.0%) Related-Related-Antonym (1.6%) Related× 3 -IsA − 1 (0.3%) PartOf (0.1%) Related-AtLoc − 1 (1.3%) Related-Antonym-Related (1.5%) Related× 2-IsA − 1 -Related (0.3%) AtLoc − 1 (0.1%) Antonym-Related (0.8%) Antonym-Related-Related (1.5%) Related× 2-Antonym-Related (0.3%) ... ... ... ... ρ = 3.1 0.4 = 7.8 ρ = 13.1 2.1 = 6.2 ρ = 10.6 2.6 = 4.1 ρ = 1.8 0.4 = 4.5 Table 3.5: The top-5 most frequent types of reasoning chains in CSQA and RS datasets, grouped by their length k={1,2,3,4}. The implicit-ratioρ is defined as the ratio of the implicit reasoning types (i.e., Related× k) over the most frequent types with at least one explicit relation (e.g., AtLoc) of the same length k. Interestingly, we also find that although there are about twice as many examples in CSQA as RS, there are more distinct words in the questions and answer choices of RS than CSQA, suggest- ing that RS covers more diverse topics than CSQA. 3.2.3.2 Distribution of the Lengths of Q-A Paths Our main intuition is that the shortest paths between question concepts and the answer concepts can approximate the underlying reasoning chains, which are hidden and difficult to label. To understand the difference between CSQA and RS in terms of their reasoning chains, we use Q- A paths over ConceptNet as a proxy. For a riddle question, a set of Q-A path lengths are the lengths of the shortest paths between every question concept and the answer concept, i.e., shortest- PathLen(KG, qc, ac) in Alg. 2. For a question-answer pair, we first extract the concepts mentioned in the question and the answer respectively (extractConcept() in Algorithm 2), following the steps of Lin et al. (2019) and Feng et al. (2020). If there are three question concepts{q 1 ,q 2 ,q 3 } and an 55 answer concept a, we denote their shortest path lengths as{L 1 ,L 2 ,L 3 }. Finally, we compute the min/max/mean over them for a comprehensive understanding of the approximated difficulty of this riddle — a greater value indicates a more challenging example. As shown in Figure 3.7 (b), we can see that RS has longer Q-A paths as underlying reasoning chains. In addition, we can see that RS generally has longer chains, particularly the min of CSQA is 1-hop for more than 80% of examples. On the other hand, only about 30% of RS examples have 1-hop minimum Q-A paths, while about 50% of the examples have 2-hop min Q-A paths. The distribution over the maximum in Figure 3.7 (d) also shows that RS tends to have longer maximum paths than CSQA. We also show the percentage of all Q-A paths of different length as part of Table 3.5, and we can see that RS has longer paths in general (e.g., CSQA = 14.0% vs. RS = 4.6% in 1-hop). 3.2.3.3 Relational Types of Reasoning Paths In addition to the analysis on path length, we also show that the relation types of Q-A paths for RS and CSQA have clear differences, as shown in Table 3.5. The types of reasoning chains in RS rely more on a special relation in ConceptNet —Related, which is relatively more implicit and can not be grounded to a specific, explicit relation such as AtLoc (e.g., <wind, Related, air> vs. <lamp, AtLoc, table>). The most frequent relation between question concepts and answer concepts in CSQA is theAtLoc relation (4.8%), however, it isRelated (3.1%) in RS. We define implicit-ratio for k-hop paths,ρ k = %(Related× k) %(E k ) , where E k is the most frequent type of chains with at least one explicit relation of length k. In RS,ρ k is around 4.1∼ 7.8, while it is about 0.7∼ 1.8 for CSQA. Thus, we conclude that the dominant reasoning chains in RS are much more implicit, and consequently RS is more challenging to reason with using commonsense knowledge resources like ConceptNet. 56 3.2.4 Experiments We first introduce three types of popular baseline methods for commonsense reasoning (Sec- tion 6.2.0.2), then we present our main experimental results with analysis (Section 3.2.4.2), and finally show case studies for error analysis (Section 3.2.4.3). 3.2.4.1 Baseline Methods Given a riddle question q, there are 5 different choices{c 1 ,...,c 5 }, where only one of them is the correct choice and the others are distractors. The model needs to rank all choices and select the best one as the final answer. There are three major types of models for commonsense reasoning tasks in this format: 1) fine-tuning pretrained language models, 2) incorporating relevant knowledge graphs for reasoning, 3) fine-tuning a unified text-to-text QA model, as shown in Figure 3.8. Fine-tuning Pre-trained LMs As we seek to investigate how well current NLU models can perform in higher-order commonsense reasoning, we first experiment with a typical set of large pretrained language models such as BERT [32], RoBERTa [104], and ALBERT [82]. We concate- nate the question with each choice, using[SEP] as the separator, thus forming a statement. Then, we fine-tune any pretrained LMs like BERT to use their [CLS] token embeddings to predict a score for each statement. Then, a set of five scores about an example will be fed to SoftMax to optimize for maximizing the score of the correct choice. LMs + Graph Reasoning Modules KagNet [98] and MHGRN [41] are two typical graph-based language reasoning models. They both extract a schema graph from ConceptNet, i.e., a subgraph of ConceptNet consisting of Q-A paths in Figure 3.7, by incorporating them with a graph encod- ing module. They finally fuse the external commonsense knowledge with a text encoder (e.g., a pretrained LM). KagNet uses heuristics to prune irrelevant paths and then encode them with path-based LSTM and hierarchical attention to select the most important paths for improving com- monsense reasoning. In contrast, the recent MHGRN explicitly encodes multi-hop paths at scale 57 using graph networks with relational attention, improving efficiency and performance over KagNet and other models. A unique merit of such graph-based models is their interpretibility due to the neural attention over the symbolic structures of KGs. Fine-Tuning a Text-to-Text QA Model UnifiedQA [72], the state-of-the-art multiple-choice QA model, simply concatenates the question with all answer candidates as a single input sequence to a T5 [135] model for learning to generate the correct choice as extracting a span from the input. Apart from the multiple-choice QA format, it is also trained with other QA task formats so that it can benefit from many other QA datasets (including CSQA) via sharing the model parameters. Human Evaluation We invite three native English speakers who study computer science to solve 100 riddle examples sampled from the test set. They achieved an average accuracy of 91.3%. Models↓ Training Data→ Train = CSQA Train = RiddleSense Train = RS+CSQA RiddleSense-Split→ Dev Test Dev Test Dev Test Random Guess 20.0 20.0 20.0 20.0 20.0 20.0 BERT-Base [31] 33.59 34.61 54.16 42.43 56.22 47.67 BERT-Large [31] 36.14 39.10 55.24 45.09 57.69 54.91 RoBERTa-Large [104] 43.68 47.42 60.72 52.58 66.11 59.82 ALBERT-XXL [82] 51.03 51.00 66.99 60.65 71.50 67.30 KagNet (RoBERTa-L) [98] 42.66 48.24 61.77 53.72 66.55 59.72 MHGRN (RoBERTa-L) [41] 46.83 49.65 63.27 54.49 66.90 63.73 MHGRN (ALBERT-XXL) [41] 50.89 50.21 66.27 59.93 70.81 66.81 UnifiedQA (T5-Large) [72] 28.50 37.27 56.21 56.40 58.17 56.57 UnifiedQA (T5-3B) [72] 37.32 50.25 67.38 66.06 68.26 68.80 Human Performance - 91.33 - 91.33 - 91.33 Table 3.6: Benchmark performance over the dev and test set of RIDDLESENSE . 3.2.4.2 Results and Analysis We show the main results of the experiments in Table 3.6. There are 3 settings according to the different training data options: 1) the training data of CSQA, 2) the training data of RS, and 3) the concatenation of both RS and CSQA, while all experiments are validated over the dev set 58 of RS. However, as the public UnifiedQA checkpoints were already trained on CSQA (together with many other QA datasets), we directly use them for inference over RS in the first setting (i.e., “Train=CSQA”). This also suggests that the performance of UnifiedQA models in 2nd setting should be better than others although they all are fine-tuned on RS’s training data only. We can see that larger pretrained language understanding models always gain better perfor- mance, ranging from BERT-base to Albert-XXL, which gets the best performance in this group of baselines (67.30%). This matches their performance comparisions on CSQA and other benchmark datasets as well, suggesting that a better pre-trained language model can be also identified by R ID- DLESENSE as well. Interestingly, we find that ALBERT-XXL is so powerful that it can generalize from training on CSQA only but achieve comparable results with RoBERTa-Large that is trained over RS (i.e., 51.0% vs. 52.6%). However, if we look at the curve of dev accuracy when using different percentage of the RS-train data (setting 2) in Figure 3.9, we can see that RoBERTa-Large can generally outperform ALBERTA-XXL when using less than 60% data for fine-tuning. Moreover, we find that the KG-enhanced models, KagNet and MHGRN, using RoBERTa- Large (RB-L) as the encoder, perform better than vanilla RB-L. Although the Q-A paths over ConceptNet have more implicit paths (e.g.,Related× k), some paths can still be beneficial. For example, wind Related ← −−−−− → blow Related ← −−−−− → candle, can still help reason about the riddle “... Wind is my foe. What am I?” to the answer “candle.” The fusion of ConceptNet also improves in the situation when only training with CSQA data us- ing RoBERTa-Large. However, the improvement of KagNet is negative, which is unexpected. We conjecture that this is because the extracted subgraphs from the ConceptNet does not guarantee the reasoning path from question concepts to answer concepts, while the training phase forces models to learn to reason over those graphs, yielding a possibly harmful impact. Additionally, we find that MHGRN with ALBERT-XXL also results in a worse performance, unlike using RoBERTa-Large. We believe this may be related to the specific design of ALBERT, which reuses model parameters 59 for multiple layers, and thus it could be a problem when fused with another learnable module (e.g., a graph network in MHGRN). Fine-tuning UnifiedQA with T5-3B achieves the best performance, which is also the case for CSQA in their leaderboard. This is expected for two reasons: 1) UnifiedQA has been trained over multiple other QA datasets, which increases its generalization ability, 2) UnifiedQA considers all choices together at a time and thus can better compare different choices with self-attention mechanism of Transformer [172]. 3.2.4.3 Error Analysis and Future Directions We show a few examples that are mistakenly predicted by the UnifiedQA-3B model in Figure 3.10. From these concrete cases, we can see that even the best model cannot solve riddles that can be trivial to humans, especially when there are metaphors and/or counterfactual situations. We argue that future research should aim to address the creative use of language in commonsense reasoning and general understanding of language, as creativity is a critical feature of natural language. We list several promising directions as follows. First of all, we should mine (semi-)structured knowledge of metaphors, so that concepts can connect via metaphorical links (e.g., “tail”→ “thread”). Second, to prevent false inferences, we need more complete, precise commonsense knowledge of concepts. For example, in Figure 3.10, a model should know a chair only has exactly four legs instead of hundreds [93]; ink can be black or red, but it won’t change over time. However, current KGs only have (leg, PartOf, chair) and (ink, HasProperty, black/red). In addition, the reasoning methods should incorporate more symbolic logic rules, so that the multi-hop conditions and counterfactual “but-no” negations will be handled better. Finally, we think the graph-augmented methods should be improved to compare multiple options in a schema graph, e.g., QA-GNN [204]. Both KAGNET and MHGRN consider only a single option at a time which prevents them from effectively reasoning about the subtle differences between options. 60 3.2.5 Related Work Benchmarking Machine Common Sense The prior works on building commonsense reasoning benchmarks touch different aspects of com- monsense reasoning: SWAG [209], HellaSWAG [208], CODAH [18], aNLI [10] for situation- based reasoning; Physical IQA [13] on physical knowledge; Social IQA [149] on social psychol- ogy knowledge; LocatedNearRE [192] on mining spatial commonsense knowledge; DoQ [37] and NumerSense [93] on numerical common sense; CommonGen [94] for generative commonsense reasoning, and many others; OpenCSR [96] and ProtoQA [14] aim to test commonsense reasoning ability in an open-ended setting. CommonsenseQA [161] has the same format as our proposed RIDDLESENSE, and both target general commonsense knowledge via multiple-choice question answering. However, CSQA fo- cuses more on straightforward questions where the description of the answer concept is easy to un- derstand and retrieval over ConceptNet, while RS makes use of riddle questions to test higher-order commonsense reasoning ability. More detailed comparisions between them are in Section 3.2.3, which shows that the unique challenges of the RiddleSense on multiple dimensions. Commonsense Reasoning Methods Our experiments cover three major types of commonsense reasoning methods that are popular in many benchmarks: fine-tuning pretrained LMs [31, 104, 82], graph-based reasoning with external KGs [98, 41], and fine-tuning unified text-to-text QA models [72]. Apart from ConceptNet, There are also some methods [112, 193] using additional knowledge resources such as Wikipedia and Wiktionary. A few recent methods also aim to generate relevant triples via language generation models so that the context graph is more beneficial for reasoning [181, 196]. Our experiments in this paper aim to compare the most typical and popular methods which have open-source imple- mentations, which we believe are beneficial for understanding the limitation of these methods in higher-order commonsense reasoning — RIDDLESENSE. 61 Computational Creativity and NLP Creativity has been seen as a central property of the human use of natural language [114]. Text should not be always taken at face value, however, higher-order use of language and figurative devices such as metaphor can communicate richer meanings and needs deeper reading and more complicated reasoning skills [174]. Recent works on processing language with creative use fo- cus on metaphor detection [44], pun generation [56, 108], creative story generation, and humor detection [189, 190], sarcasm generation [17], etc. Riddling, as a way to use creative descriptions to query a common concept, are relatively underexplored. Previous works [163, 48] focus on the generation of riddles in specific languages and usually rely on language-specific features (e.g., decomposing a Chinese character into multiple smaller pieces). There is few datasets or public resources for studying riddles as a reasoning task, to the best of our knowledge. The proposed RIDDLESENSE is among the very first works connecting commonsense reasoning and computational creative, and provides a large dataset to train and evaluate models for answering riddle questions. 3.2.6 Conclusion We propose a novel commonsense reasoning challenge, RIDDLESENSE, which requires complex commonsense skills for reasoning about creative and counterfactual questions, coming with a large multiple-choice QA dataset. We systematically evaluate recent commonsense reasoning methods over the proposed RIDDLESENSE dataset, and find that the best model is still far behind human performance, suggesting that there is still much space for commonsense reasoning methods to improve. We hope RIDDLESENSE can serve as a benchmark dataset for future research targeting complex commonsense reasoning and computational creativity. 62 Reasoning about Riddles! • Trivial commonsense questions can be answered easily by both humans and pre-trained LMs. • Riddles, however, are easy for humans (in the multiple-choice setting), yet hard for LMs. • linguistic creativity. • E.g., metaphor, personalization, … • commonsense reasoning. • multipleconditions to meet. • counterfactual description. • E.g., “b ut no” cues. • We provide RiddleSense as the very first large dataset for reasoning about riddles! 3 My life can be measured in hours. I serve by being devoured. Thin, I am quick; Fat, I am slow. Wind is my foe. What am I? (A) paper (B) candle (C) lamp (D) clock (E) worm What home entertainment equipment requires cable? (A) radio shack (B) substation (C) cabinet (D) television (E) desk CommonsenseQA RiddleSense I have five fingers, but I am not alive. What am I? (A) piano (B) computer (C) glove (D) claw (E) hand D B C D VS Figure 3.6: The top example is a trivial commonsense question from the CommonsenseQA [161] dataset. The two bottom examples are from our proposed RIDDLESENSE challenge. The right- bottom question is a descriptive riddle that implies multiple commonsense facts about candle, and it needs understanding of figurative language such as metaphor; The left-bottom one additionally needs counterfactual reasoning ability to address the ‘but-no’ cues. These riddle-style common- sense questions require NLU systems to have higher-order reasoning skills with the understanding of creative language use. 63 min = 2; max= 4; mean=3 L 1 =2 L 2 =3 L 3 =4 q 1 q 2 q 3 a (a) Length of Q-A Paths 1 2 3 4 5 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Probability (b) Mean Length 1 2 3 4 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Probability (c) Min Length 1 2 3 4 5 6 0.0 0.1 0.2 0.3 0.4 0.5 Probability (d) Max Length Figure 3.7: The Q-A paths serve as an estimation of underlying reasoning chains. Fig. (a) illustrates how to compute the mean/min/max of the Q-A paths:{q 1 ,q 2 ,q 3 } are three concepts mentioned in the question, and a is the answer concept. L k is the length of the shortest path between q k and a over ConceptNet; min/max/mean are computed over{L 1 ,L 2 ,L 3 } as three aspects to measure the overall difficulty. Fig. (b), (c), and (d) show that generally RiddleSense has a longer question-answer path than CommonsenseQA, thus being harder to reason. 64 (2) Symbolic KG-based Language Reasoning KagNet (Lin et al. 2019) & MHGRN (Feng et al. 2020) (3) Fine-Tuning T5 with a text-to-text task format. UnifiedQA (Khashabi et al., 2020) Output: choice2 [CLS] question [SEP] choice1 [CLS] question [SEP] choice2 [CLS] question [SEP] choice3 SoftMax (1) Fine-Tuning BERT/RoBERTa/ALBERT, etc. Input: question \n A: choice1 B: choice2 C: choice 3 Figure 3.8: Three types of baseline methods: 1) fine-tuning pre-trained LMs, 2) incorporating graph-based reasoner, 3) fine-tuning a unified text-to-text LM. 65 2 0 . 0 0 3 0 . 0 0 4 0 . 0 0 5 0 . 0 0 6 0 . 0 0 7 0 . 0 0 10% 20% 30% 40% 50% 60% 70% 80% 90% 100 RoBERTa-Large AlBERT-XXL Figure 3.9: The curve of dev accuracy using different percentage of the RS-training data, respec- tively for RoBERTa-Large and ALBERT-XXL. Riddle Questions Choices (√=truth; ×=model’s choice) Explanation I am black when you buy me, red when you use me. When I turn white, you know it's time to throw me away. What am I? (A) charcoal ( √) (B) rose flower (C) ink (×) (D) fruit (E) shoe I have a long tail that I let fly. Every time I go through a gap, I leave a bit of my tail in the trap. What am I? (A) monkey (B) basketball (C) fishing pole (×) (D) comet (E) needle ( √) If you take off my skin, I will not cry, but you will. What am I? (A) grape (B) onion ( √) (C) package (D) plant (E) body (×) What is that which, though black itself, enlightens the world without burning? (A) coal (B) hole (C) cd player (D) sunlight (×) (E) ink ( √) I have hundreds of legs, but I can only lean. What am I? (A) chair (×) (B) sock (C) pleopod (D) pants (E) broom ( √) Describing multiple conditions of a common object. Only charcoal applies to all. Describing a common event and involved objects with metaphor: tail → thread; fly → sew; Personalization. Cutting onions → taking off my skin. Figure of speech (ink → writing → knowledge → light of wisdom) + Counterfactual (without burning) Counterfactual (many legs but cannot stand) + Metaphor (bristles) Figure 3.10: Case studies of the error by UnifiedQA-3B model on the test set of R IDDLESENSE. 66 Chapter 4 Testing the Robustness of Commonsense Reasoning 4.1 Robustness in Probing Numerical Common Sense 4.1.1 Introduction Pre-trained language models (PTLMs), such as BERT [32], have yielded state-of-the-art perfor- mance on many natural language processing tasks. Given PTLMs’ cited ability to create general, yet useful text representations, an investigation of their ability to encode commonsense knowl- edge into representations is warranted––commonsense knowledge is often required to have a full understanding of language. Recently there have been a few recent works that do investigate the inquiry of whether PTLMs possess commonsense knowledge [127, 29, 15]. Overall, these prior studies suggest that PTLMs are creating text representations that often have commonsense knowledge encoded in them. We, however, find it surprising that when posed with a similar reasoning-based masked-word-prediction task, PTLMs perform poorly in recalling the required numerical commonsense knowledge (see Figure 7.1). Therefore, in this paper, our goal is to study whether PTLMs capture numerical commonsense knowledge, i.e., commonsense knowledge that provides an understanding of the numeric relation between entities. We propose measuring this capability via a masked-word-prediction based prob- ing task, where, the ranking of numeric words by what the model believes most probably fills the 67 1st:fly (79.5%) 2nd:sing (9.1%) 1st:four(44.8%) 2nd:two (18.7%) Birds can [MASK]. Abird usuallyhas[MASK]legs. A carusuallyhas[MASK]wheels. A carusuallyhas[MASK] round wheels. However, for Numerical Commonsense Knowledge : 1st:four(53.7%) 2nd:two (20.5%) 1st:two (37.1%) 2nd:four(20.2%) BERT-Large Masked Word Prediction Figure 4.1: Top: PTLMs often cannot solve masked language modeling tasks needing numerical com- monsense knowledge, hence our title. Bottom: Even when PTLMs seemingly succeed, they fail to stay consistent under small perturbations. mask would expose the capabilities of PTLMs to capture numeric commonsense knowledge. For example, the masked position in the sentence “A bird usually has[MASK] legs.” is best filled by the number “two” when considering only numerical words. Around this concept, we built a carefully crafted dataset, NUMERSENSE, of 3,145 probes that covers questions from 8 different categories such as everyday objects, biology, geometry, etc. In our initial experiments, we find PTLMs to be brittle against adversarial attacks. As shown in the bottom section of Figure 7.1, BERT initially correctly predicts the masked word to be “four”, but it changes its top result to “two” in the slightly perturbed second sentence (a simple insertion of the word ‘round’). Thus, we intentionally included adversarial examples in the probes to test the robustness. We evaluate PTLMs in two settings (Section 6.2): (1) a zero-shot setting, meaning no probes from our dataset were used to fine-tune the models before evaluation; (2) a distant supervision setting, where models were fine-tuned on examples from related commonsense reasoning datasets before being evaluated on ours. Our findings reveal that PTLMs are still much worse than humans on the task, although fine-tuning with distant supervision can help. We also provide some cursory analysis on why PTLMs perhaps perform so poorly, pointing to interesting future research. We 68 also hope our work can benefit future works in: 1) improving PTLMs’ abilities to faithfully cap- ture (numerical) commonsense, 2) populating numerical facts in current commonsense knowledge bases, and 3) open-domain QA ––“Q: How many legs do ants have?” “A: Six!” 4.1.2 The NUMERSENSE Probing Task We introduce our numerical commonsense reasoning probing task, as well as the creation process of the namesake dataset, NUMERSENSE. Then, we provide a breakdown of what types of knowl- edge are covered by the probes and finally include additional high-quality distant supervision to test if fine-tuning can improve performance. 4.1.2.1 Task Formulation We essentially probe PTLMs with the distribution of words a PTLM thinks could fill the masked position, by ranking their softmax scores (greatest to least). If the ranking demonstrates numerical commonsense knowledge––the highest ranked number word (e.g., “one”, “two”, and so on) is the correct answer––then that probe is successfully completed by the PTLM. The masked position in each probe is chosen such that a number word is an extremely probable way of filling in the blank. 4.1.2.2 Probing Data Collection To build a suitable dataset for the proposed probing task, we make use of an existing corpus con- sisting of commonsense assertions, named Open Mind Common Sense (OMCS) [155]. We first ex- tracted the sentences from OMCS that had at least one of the following 12 number words: {“no” 1 , “zero”, “one”, “two”, ..., “ten” }. However, as to be expected, there were many noisy statements which were either 1) incorrect, 2) containing typos, or 3) having no numerical commonsense logic. We thus manually and prag- matically refined these sentences and did two rounds of vetting by different graduate students, from 1 We include “no”, as there exists statements involving numerical commonsense knowledge, where “no” is used in place of zero, “There are no princes in the United States.” 69 Category Example Objects(35.2%) A bicycle has two tires. Biology(13.5%) Ants have six legs. Geometry(11.7%) A cube has six faces. Unit(6.3%) There areseven days in a week. Math(7.3%) I will be ten next year, as I amnine now. Physics(5.7%) Water will freeze at zero degrees centigrade. Geography(2.9%) The world contains seven continents. Misc.(17.5%) There are no princes in the United States. Table 4.1: NUMERSENSE examples of each category. which we only kept the statements that were accepted by all annotators. After this strict filtration process, we ended up 1,131 cleaned statements for probing. We did an initial test and observed that PTLMs can be brittle under a simple perturbation of inserting an adjective near the masked number word. Thus, in order to study the robustness of mod- els in our proposed task, we also added adversarial examples to our dataset by adding adjectives before the noun involved in the numerical reasoning in each probe. The candidate adjectives are generated by querying relevant triples (e.g. <wheel,HasProperty,round> for the example in Fig. 7.1) in the commonsense knowledge graph, ConceptNet [156], and further selected or modified by human annotators to assure adversarial examples are still valid and natural. We finally have 3,145 testing probes for NUMERSENSE as the diagnostic dataset. We also manually annotated the category label for each instance so that we can better un- derstand the covered topics and their percentage. We found 8 types of numerical commonsense knowledge ranging from tangible everyday objects (e.g., car, guitar, and table) to geometry (e.g., cube). Table 4.1 lists some concrete examples of each category. 70 Core Probes + Adversarial Examples Models hit@1 hit@2 hit@3 hit@1 hit@2 hit@3 GPT-2 29.86 50.88 67.49 24.73 44.21 62.30 BERT-Base 31.98 55.92 70.58 25.24 48.66 64.81 RoBERTa-Base 36.04 60.42 72.08 28.39 51.91 67.29 BERT-Large 37.63 62.01 76.77 27.18 52.89 70.22 RoBERTa-Large 45.85 66.70 80.04 35.66 58.52 74.44 Ft. BERT-L. 50.00 66.34 74.91 43.58 62.27 72.92 Ft. RoBERTa-L. 54.06 69.61 79.15 47.52 66.43 76.76 Human Bound 89.7 (α) / 96.3 (β) 88.3 (α) / 93.7 (β) Table 4.2: Results (%) of PTLMs on NUMERSENSE. ‘Ft.’ stands for ‘Fine-tuned.’ The human per- formance is shown by closed testing (α=‘no external information’) / open testing (β=‘Wikipedia is allowed’). 4.1.2.3 Supervision for Fine-Tuning PTLMs One may wonder if fine-tuning towards this task could improve the performance. In order to answer this question, we further collected training sentences from the GenericsKB corpus [12]. The sentences in GenericsKB are generic commonsense statements that are extracted from Simple Wikipedia, Common Crawl within educational domains, ARC corpus, etc. We collected these sentences by first obtaining a list of frequent nouns from various caption cor- pora such as MSCOCO [102] and V ATEX [185]. Then, we selected collected sentences contained at least one number word of interest and finally go through the same human annotator verification process as the test data. We ended up collecting 10,492 sentences for fine-tuning and believe these sentences, if used properly, can improve PTLMs’ ability to recall the numerical commonsense knowledge. 4.1.2.4 Statistics of NUMERSENSE We show the distribution of the truth number words in the test data in Fig. 4.2. The average length of the sentence in training data is 11.1 and it is 8.9 in test data. 71 Figure 4.2: Truth number distribution of the test set. 4.1.3 Empirical Analysis We introduce the set-up of the experiments and then present results from different PTLMs in both a zero-shot setting and a distantly supervised fine-tuned one. We will also provide some analysis on the robustness and biases in the various models, and finally a study of the performance of a state-of-the-art open-domain question-answering model. 4.1.3.1 Experiment Set-up We run our experiments in two settings, zero-shot inference and additional supervision via fine- tuning. In the first setting, we probe PTLMs without any modifications, specifically we use BERT and RoBERTa with pre-trained masked-word-prediction heads. 72 In our second setting, we use our collected additional supervision dataset (Sec. 7.3.4) and mask the number words in each sentence. We then proceed to fine tune the models above on these masked sentences, before evaluating them on NUMERSENSE. 4.1.3.2 Evaluation Metric and Human Bound A masked-word-prediction head (either fine-tuned or not) produces a probability distribution over its whole vocabulary via a softax layer. As mentioned in Sec. 4.1.2.1, NUMERSENSE is the task of using this probability distribution to rank all number words, and evaluating this ranking. To evaluate, we use hit@1/2/3 accuracy, which calculates the percentage of predictions where the correct number word is ranked in the top k number words. 2 To estimate human performance on the task, we sampled 300 examples and asked two groups of three people to fill in the masked word, where one group had access to external information (open-book test) from the Web such as Wikipedia and the other did not (closed-book test). We take the majority label as the final human label. 4.1.3.3 Experimental results We show our experimental results in Table 4.2. The first four lines are results from PTLMs in the zero-shot inference setting. We see that size matters, as there is a clear performance gain when the model sizes increases. Also, RoBERTa’s results are consistently better than BERT’s, which is probably because RoBERTa uses a larger training corpora and focuses more on masked language modeling in its pre-training stage. We see that our fine-tuning efforts do help improve model performance: “37 .63→ 50.00” for BERT-large and “45.85→ 54.06” for RoBERTa-large. However, both are still far from the human’s closed-book evaluation. Figure 4.3 shows PTLMs performance is poor across all categories within the core set of NUMERSENSE. 2 We also report the performance of GPT-2 by iteratively filling the masked word and rank with their perplexity. 73 Table 1 Category RoBERTa-L Human (closed- book) Objects 46.78 93.88 Biology 41.06 85.71 Geometry 33.08 97.14 Unit 26.39 88.89 Math 43.37 94.44 Physics 27.69 73.68 Geography 36.36 60.00 Misc. 44.72 81.82 Commmon Obj. Biology Geometry Unit Math Phy. & Ast. Geography Misc. Category Objects Biology Geometry Unit Math Physics Geography Misc. 0.00 25.00 50.00 75.00 81.82 60.00 73.68 88.89 85.71 44.72 36.36 27.69 43.37 26.39 33.08 41.06 46.78 RoBERTa-L Human (closed-book) Objects Biology Geometry Unit Math Physics Geography Misc. 0.00 25.00 50.00 75.00 100.00 81.82 60.00 73.68 94.44 88.89 97.14 85.71 93.88 44.72 36.36 27.69 43.37 26.39 33.08 41.06 46.78 RoBERTa-L Human (closed-book) 1 Figure 4.3: Performance of RoBERTa-Large V .S. human performance (closed-book tests) on dif- ferent categories of numerical commonsense knowledge. Comparing the performance of a PTLM on the “Core Probes” set (#=1,131) versus the “+ Ad- versarial Examples” set (#=3,145), we can measure their robustness. We found all models incur a significant performance drop when being evaluated on the adversarial set. This suggests that PTLMs (even when fine-tuned) can be brittle towards adversarial attacks, and future direction in pre-training language models should consider more structured inductive biases such as dependen- cies and semantic roles when learning contextual representations. 4.1.4 Case Studies Object bias. Recall the example “a bird usually has[MASK] legs,” which BERT-Large predicts to be “four”. Does BERT-Large always predict “four” as long as the adjacent word after the [MASK] is ‘legs’? To investigate if the bias exists, we show some case studies in Table 4.3. As 1,000 different randomly generated words fill the ‘[x]’s we see that both BERT and RoBERTa have a bias towards a certain answer, evidenced by the existence of a dominant answer in the softmax 74 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 13 25 37 49 61 73 85 97 109 121 133 A bird usually has two legs . Figure 4.4: The attention distribution of the sentence “A bird usually has two legs.” on RoBERTa-base. We plot the attention weights (y) between each word and the number word ‘two’ at different position (x), e.g., x= 13 means (Layer 2, Head 1). distribution. However, it seems that RoBERTa’s [104] modified pre-training strategy helps it have less bias. We argue that future studies should further control the bias in masked language modeling. Attention distribution. Following the prior probing work [22] on the relationship between attention weights and syntactic structures, we plot the attention distribution of the sentence “A bird usually has two legs.” with respect to the word ‘two’ in Figure 4.4. We find that the root word ‘has’ enjoys the maximum attention at in the first few and middle layers, while the word ‘two’ gets the maximum attention to itself in the end. The important words for querying the numerical commonsense, namely ‘birds’ and ‘legs’, always have low attention weights. This suggests that the BERT (and RoBERTa) may inherently lose the relationship between subject/object and number words. 4.1.5 Open-Domain ‘How-Many’ Questions The examples in the NUMERSENSE can be also seen as open-domain questions targeting ‘how- many’ commonsense––“how many legs does a fly usually have?” Answering these open-domain numerical commonsense questions is a practical downstream application of models that are suc- cessful in the NUMERSENSE. Thus, as a side note, we also report the performance of the state-of- the-art open-domain QA model [4]. We use the model that is trained on the Natural Question (NQ) dataset [79], where we replace the ‘[MASK]’s in our examples with ‘how many’, so that our probes are in a similar format to 75 Template: a [x] usually has [MASK] legs. BERT-L four: 39.3%, two: 18.3%, three: 10.1% RoBERTa-L four: 20.8%, two: 9.0%, three: 8.1% Template: most [x] have [MASK] wheels. BERT-L four: 25.3%, two: 14.1%, three: 5.1% RoBERTa-L four: 9.2%, two: 7.8%, three: 4.6% Template: all [x] have [MASK] sides. BERT-L two: 28.3%, three: 12.9%, four: 12.9% RoBERTa-L two: 16.6%, no: 2.9%, three: 2.3% Table 4.3: The average Softmax of top 3 predictions in templates where ‘[x]’ is filled with 1k random words. NQ examples. For example “a fly usually has [MASK] legs” is converted to “ how many legs a fly usually has?” 3 The accuracy of the state-of-the-art model is only 15.4%, which is even lower than using BERT-base without fine-tuning. This indicates that improving performance on N U- MERSENSE can help improve the performance on answering open-domain “how-many” questions. 4.1.6 Related Work Probing Tasks for PTLMs. Prior work in probing language models have primarily focused on analysis of linguistic phenomena. Clark et al. (2019) investigated the relationship between BERT’s attention weights and syntactic structures, while such as dependency (e.g. direct objects, noun modifiers), coreference, and sentence segmentation. Tenney, Das, and Pavlick (2019) was able to display where certain types of linguistic information is captured within BERT––they in fact find the layers in a PTLM represent the steps of a classical NLP pipeline: POS tagging, parsing, NER, semantic roles, and coreference. This line of work has indeed helped us understand the ability of 3 We also manually test some queries such as “how many legs does a fly usually have?”, which have similar results. 76 PTLMs to capture linguistic knowledge via self-supervised learning from unlabeled data. We are interested in the numerical commonsense knowledge of PTLMs. Probing Commonsense Knowledge. Besides the works that we have discussed in Section 7.1, Zhou et al. (2020) and Talmor et al. (2019) also proposed to probe the commonsense knowledge of pre-trained language models, following the prior work by Trinh and Le (2018 and 2018). They both utilized various existing language understanding datasets targeting commonsense knowledge to test if PTLMs can capture certain commonsense knowledge. Lin et al. (2019) also show that PTLMs can retrieve paths from ConceptNet that aid in interpreting the decision made by the PTLMs on the CommonsenseQA dataset [161]. Lin et al. (2019) probe the commonsense knowledge in pre- trained language generation models via a constrained text generation task. However, they do not consider numerical commonsense knowledge, which is relatively under-explored area. Numerical Commonsense Knowledge. Forbes and Choi (2017) and Goel, Feng, and Boyd- Graber (2019) studied commonsense comparisons between two physical objects (e.g., a house is usually bigger than a person) in pre-trained word embeddings. Elazar et al. (2019) and Yamane, Lin, and Harada (2020) propose to induce the commonsense distribution of quantitative attributes (e.g., mass, length, and currency) of objects. Their goal is to extract or crowd-source such numer- ical attributes, and then obtain distributions that reflect commonsense knowledge. N UMERSENSE, however, mainly focuses on exact numerical commonsense facts (e.g., a bird has two legs) instead of a range of values (e.g., a tiger weighs around 120kg), and have a larger number of arguments besides physical attributes. Encoding Numerics for Computation. Wallace et al. (2019) probe PTLMs in terms of the ability to represent numeracy tokens by a regression task (e.g., “71”→ 71.0), and also find that BERT is not good at encoding numerical tokens. Some works focus on incorporate algebra computa- tion ability in PTLMs [219, 46], thus making them able to answer math reasoning tasks such as MAWPS [77] and DROP [36]. Note that these models and tasks are not targeting numerical com- monsense knowledge but mainly the numerical-related computation within text. 77 4.1.7 Conclusion We present a probing task, NUMERSENSE, to induce numerical commonsense knowledge from pre-trained language models. We collect a new diagnostic dataset carefully verified by human annotators, which covers 8 different topics. Powerful pre-trained models such as BERT and RoBERTa perform surprisingly poorly, even after fine-tuning with high-quality distant supervi- sion. We hope our findings and probing dataset will provide a basis for improving pre-trained masked language models’ numerical and other concrete types of commonsense knowledge. 78 Chapter 5 Incorporating Structured Knowledge into LMs for CSR 5.1 Knowledge-Aware Graph Networks for CSR 5.1.1 Introduction Human beings are rational and a major component of rationality is the ability to reason. Reasoning is the process of combining facts and beliefs to make new decisions [66], as well as the ability to manipulate knowledge to draw inferences [62]. Commonsense reasoning utilizes the basic knowl- edge that reflects our natural understanding of the world and human behaviors, which is common to all humans. Empowering machines with the ability to perform commonsense reasoning have been seen as the bottleneck of artificial general intelligence [28]. Recently, there have been a few emerg- ing large-scale datasets for testing machine commonsense with various focuses [209, 150, 207]. In a typical dataset, CommonsenseQA [161], given a question like “Where do adults use glue sticks?”, with the answer choices being {classroom(✗), office ( ✓), desk drawer (✗)}, a common- sense reasoner is expected to differentiate the correct choice from other “distractive” candidates. False choices are usually highly related to the question context, but are just less possible in real- world scenarios, making the task even more challenging. This paper aims to tackle the research question of how we can teach machines to make such commonsense inferences, particularly in the question-answering setting. 79 Where do adults use glue sticks? A: classroom B: office C: desk drawer Semantic Space Symbolic Space glue_stick adult work CapableOf AtLocation AtLocation office use HasSubevent CapableOf ReceiveAction Grounding Knowledge-Aware CommonsenseInference Schema Graph Where do adults use glue sticks? A: classroom B: office C: desk drawer glue_stick adult work CapableOf AtLocation AtLocation office use HasSubevent CapableOf ReceiveAction Grounding Knowledge-Aware Commonsense Inference Schema Graph Figure 5.1: An example of using external commonsense knowledge (symbolic space) for inference in natural language commonsense questions (semantic space). 80 It has been shown that simply fine-tuning large, pre-trained language models such as G PT [132] and BERT [31] can be a very strong baseline method. However, there still exists a large gap between the performance of said baselines and human performance. Reasoning with neural models is also lacking in transparency and interpretability. There is no clear way to how to manage to answer commonsense questions, thus making their inferences dubious. Merely relying on pre-training large language models on corpora cannot provide well-defined or reusable structures for explainable commonsense reasoning. We argue that it would be more beneficial to propose reasoners that can exploit commonsense knowledge bases [156, 164, 148]. Knowledge-aware models can explicitly incorporate external knowledge as relational inductive biases [9] to enhance their reasoning capacity, as well as to increase the transparency of model behaviors for more interpretable results. Furthermore, a knowledge-centric approach is extensible through commonsense knowledge acquisition techniques [91, 192]. We propose a knowledge-aware reasoning framework for learning to answer commonsense questions, which has two major steps: schema graph grounding (§5.1.3) and graph modeling for inference (§5.1.4). As shown in Fig. 5.1, for each pair of question and answer candidate, we retrieve a graph from external knowledge graphs (e.g. ConceptNet) in order to capture the relevant knowledge for determining the plausibility of a given answer choice. The graphs are named “schema graphs” inspired by theschema theory proposed by Gestalt psychologists [5]. The grounded schema graphs are usually much more complicated and noisier, unlike the ideal case shown in the figure. Therefore, we propose a knowledge-aware graph network module to further effectively model schema graphs. Our model KagNet is a combination of graph convolutional networks [75] and LSTMs, with a hierarchical path-based attention mechanism, which forms a GCN-LSTM-HPA architecture for path-based relational graph representation. Experiments show that our framework achieved a new state-of-the-art performance 1 on theCommonsenseQA dataset. Our model also 1 The highest score on the leaderboard as of the time when we submitted the paper (May 2019). 81 Schema Graph Question Answer Concept Recognition Question Concepts Language Encoder (e.g. BERT) Graph Construction via Path Finding Statement Vector Answer Concepts Graph Vector MLP Plausibility score GCN-LSTM-HPA KagNet Figure 5.2: The overall workflow of the proposed framework with knowledge-aware graph network module. works better then other methods with limited supervision, and provides human-readable results via intermediate attention scores. 5.1.2 Overview In this section, we first formalize the commonsense question answering problem in a knowledge- aware setting, and then introduce the overall workflow of our framework. 5.1.2.1 Problem statement Given a commonsense-required natural language question q and a set of N candidate answers {a i }, the task is to choose one answer from the set. From a knowledge-aware perspective, we additionally assume that the question q and choices {a i } can be grounded as a schema graph (denoted as g) extracted from a large external knowledge graph G, which is helpful for measuring 82 the plausibility of answer candidates. The knowledge graph G=(V,E) can be defined as a fixed set of concepts V , and typed edges E describing semantic relations between concepts. Therefore, our goal is to effectively ground and model schema graphs to improve the reasoning process. 5.1.2.2 Reasoning Workflow As shown in Fig. 5.2, our framework accepts a pair of question and answer (QA-pair) denoted as q and a. It first recognizes the mentioned concepts within them respectively from the concept set V of the knowledge graph. We then algorithmically construct the schema graph g by finding paths between pairs of mentioned concepts (§5.1.3). The grounded schema graph is further encoded with our proposed knowledge-aware graph network module (§5.1.4). We first use a model-agnostic language encoder, which can either be trainable or a fixed feature extractor, to represent the QA-pair as a statement vector. The statement vector serves as an additional input to a GCN-LSTM-HPA architecture for path-based attentive graph modeling to obtain a graph vector. The graph vector is finally fed into a simple multi-layer perceptron to score this QA-pair into a scalar ranging from 0 to 1, representing the plausibility of the inference. The answer candidate with the maximum plausibility score to the same question becomes the final choice of our framework. 5.1.3 Schema Graph Grounding The grounding stage is three-fold: recognizing concepts mentioned in text (§5.1.3.1), constructing schema graphs by retrieving paths in the knowledge graph (§5.1.3.2), and pruning noisy paths (§5.1.3.3). 5.1.3.1 Concept Recognition We match tokens in questions and answers to sets of mentioned concepts (C q andC a respectively) from the knowledge graph G (for this paper we chose to useConceptNet due to its generality). 83 A naive approach to mentioned concept recognition is to exactly match n-grams in sentences with the surface tokens of concepts in V . For example, in the question “Sitting too close to watch tv can cause what sort of pain?”, the exact matching resultC q would be {sitting, close, watch_tv, watch, tv, sort, pain, etc.}. We are aware of the fact that such retrieved mentioned concepts are not always perfect (e.g. “sort” is not a semantically related concept, “close” is a polysemous concept). How to efficiently retrieve contextually-related knowledge from noisy knowledge resources is still an open research question by itself [188, 71], and thus most prior works choose to stop here [216, 183]. We enhance this straightforward approach with some rules, such as soft matching with lemmatization and filtering of stop words, and further deal with noise by pruning paths (§5.1.3.3) and reducing their importance with attention mechanisms (§5.1.4.3). 5.1.3.2 Schema Graph Construction ConceptNet. Before diving into the construction of schema graphs, we would like to briefly introduce our target knowledge graphConceptNet. ConceptNet can be seen as a large set of triples of the form (h,r,t), like (ice, HasProperty, cold), where h and t represent the head and tail concepts in the concept set V and r is a certain relation type from the pre-defined set R. We delete and merge the original 42 relation types into 17 types, in order to increase the density of the knowledge graph 2 for grounding and modeling. Sub-graph Matching via Path Finding. We define a schema graph as a sub-graph g of the whole knowledge graph G, which represents the related knowledge for reasoning a given question-answer pair with minimal additional concepts and edges. One may want to find a min- imal spanning sub-graph covering all the question and answer concepts, which is actually the NP-complete “Steiner tree problem” in graphs [45]. Due to the incompleteness and tremendous size ofConceptNet, we find that it is impractical to retrieve a comprehensive but helpful set of knowledge facts this way. Therefore, we propose a straightforward yet effective graph construction algorithm via path finding among mentioned concepts ( C q ∪C a ). 2 The full mapping list is in the appendix. 84 GCNs Encoding Unlabeled Schema Graphs Statement Vector s <latexit sha1_base64="lILmndifYlXYYd3TBDeiCV0kJuw=">AAAB8XicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZdFNy4r2Ae2Q8mkd9rQTGZIMkIZ+hduXCji1r9x59+YtrPQ1gOBwzn3knNPkAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHssHM0nQj+hQ8pAzaqz02IuoGQVhpqf9csWtunOQVeLlpAI5Gv3yV28QszRCaZigWnc9NzF+RpXhTOC01Es1JpSN6RC7lkoaofazeeIpObPKgISxsk8aMld/b2Q00noSBXZyllAvezPxP6+bmvDaz7hMUoOSLT4KU0FMTGbnkwFXyIyYWEKZ4jYrYSOqKDO2pJItwVs+eZW0alXvolq7v6zUb/I6inACp3AOHlxBHe6gAU1gIOEZXuHN0c6L8+58LEYLTr5zDH/gfP4A9u6RGw==</latexit> C q <latexit sha1_base64="5OMPA+86tRP1ySu0Q8DeEiYVzTU=">AAAB9XicbVDLSgMxFL3js9ZX1aWbYBFclZkq6LLYjcsK9gHtWDJppg3NJGOSUcrQ/3DjQhG3/os7/8ZMOwttPRA4nHMv9+QEMWfauO63s7K6tr6xWdgqbu/s7u2XDg5bWiaK0CaRXKpOgDXlTNCmYYbTTqwojgJO28G4nvntR6o0k+LOTGLqR3goWMgINla670XYjAjmaX3afyj2S2W34s6AlomXkzLkaPRLX72BJElEhSEca9313Nj4KVaGEU6nxV6iaYzJGA9p11KBI6r9dJZ6ik6tMkChVPYJg2bq740UR1pPosBOZin1opeJ/3ndxIRXfspEnBgqyPxQmHBkJMoqQAOmKDF8YgkmitmsiIywwsTYorISvMUvL5NWteKdV6q3F+XadV5HAY7hBM7Ag0uowQ00oAkEFDzDK7w5T86L8+58zEdXnHznCP7A+fwBPJySVQ==</latexit> C a <latexit sha1_base64="dQQHt0/0PBOFo0bEC+awhykIroQ=">AAAB9HicbVDLSsNAFL2pr1pfVZdugkVwVZIq6LLYjcsK9gFtKDfTSTt0Mokzk0IJ/Q43LhRx68e482+ctFlo64GBwzn3cs8cP+ZMacf5tgobm1vbO8Xd0t7+weFR+fikraJEEtoiEY9k10dFORO0pZnmtBtLiqHPacefNDK/M6VSsUg86llMvRBHggWMoDaS1w9RjwnytDEf4KBccarOAvY6cXNSgRzNQfmrP4xIElKhCUeleq4Tay9FqRnhdF7qJ4rGSCY4oj1DBYZUeeki9Ny+MMrQDiJpntD2Qv29kWKo1Cz0zWQWUq16mfif10t0cOulTMSJpoIsDwUJt3VkZw3YQyYp0XxmCBLJTFabjFEi0aankinBXf3yOmnXqu5VtfZwXanf5XUU4QzO4RJcuIE63EMTWkDgCZ7hFd6sqfVivVsfy9GCle+cwh9Ynz/q6JIx</latexit> R i,j <latexit sha1_base64="IoEu3A5j4ruZ01pKrQYyOC/3ry0=">AAAB+XicbVDLSsNAFL2pr1pfUZduBovgQkqigi6LblxWsQ9oQ5hMJ+3YySTMTAol9E/cuFDErX/izr9x0mahrQcGDufcyz1zgoQzpR3n2yqtrK6tb5Q3K1vbO7t79v5BS8WpJLRJYh7LToAV5UzQpmaa004iKY4CTtvB6Db322MqFYvFo54k1IvwQLCQEayN5Nt2L8J6GITZw9TP2NnT1LerTs2ZAS0TtyBVKNDw7a9ePyZpRIUmHCvVdZ1EexmWmhFOp5VeqmiCyQgPaNdQgSOqvGyWfIpOjNJHYSzNExrN1N8bGY6UmkSBmcxzqkUvF//zuqkOr72MiSTVVJD5oTDlSMcorwH1maRE84khmEhmsiIyxBITbcqqmBLcxS8vk9Z5zb2ond9fVus3RR1lOIJjOAUXrqAOd9CAJhAYwzO8wpuVWS/Wu/UxHy1Zxc4h/IH1+QPDdpO9</latexit> LSTM Path Encoder T i,j <latexit sha1_base64="W8O8ds2U1YgPDUbktjjnwFXWkBo=">AAAB+XicbVDLSsNAFL2pr1pfUZduBovgQkqigi6LblxW6AvaECbTSTt2Mgkzk0IJ/RM3LhRx65+482+ctFlo64GBwzn3cs+cIOFMacf5tkpr6xubW+Xtys7u3v6BfXjUVnEqCW2RmMeyG2BFORO0pZnmtJtIiqOA004wvs/9zoRKxWLR1NOEehEeChYygrWRfNvuR1iPgjBrzvyMXTzNfLvq1Jw50CpxC1KFAg3f/uoPYpJGVGjCsVI910m0l2GpGeF0VumniiaYjPGQ9gwVOKLKy+bJZ+jMKAMUxtI8odFc/b2R4UipaRSYyTynWvZy8T+vl+rw1suYSFJNBVkcClOOdIzyGtCASUo0nxqCiWQmKyIjLDHRpqyKKcFd/vIqaV/W3Kva5eN1tX5X1FGGEziFc3DhBurwAA1oAYEJPMMrvFmZ9WK9Wx+L0ZJV7BzDH1ifP8aMk78=</latexit> P i,j <latexit sha1_base64="wjgZQHVCe19RlY4Rdt9vOhdqX/Q=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBg5REBT0WvXisYD+gDWWznbRrN5uwuxFK6I/w4kERr/4eb/4bt20O2vpg4PHeDDPzgkRwbVz32ymsrK6tbxQ3S1vbO7t75f2Dpo5TxbDBYhGrdkA1Ci6xYbgR2E4U0igQ2ApGt1O/9YRK81g+mHGCfkQHkoecUWOlVr2X8bPHSa9ccavuDGSZeDmpQI56r/zV7ccsjVAaJqjWHc9NjJ9RZTgTOCl1U40JZSM6wI6lkkao/Wx27oScWKVPwljZkobM1N8TGY20HkeB7YyoGepFbyr+53VSE177GZdJalCy+aIwFcTEZPo76XOFzIixJZQpbm8lbEgVZcYmVLIheIsvL5PmedW7qJ7fX1ZqN3kcRTiCYzgFD66gBndQhwYwGMEzvMKbkzgvzrvzMW8tOPnMIfyB8/kDGqSPag==</latexit> P i,j [k] <latexit sha1_base64="teqK/CNCdOXuGAw7kuDknxGtZZM=">AAAB8XicbVBNSwMxEJ2tX7V+VT16CRbBg5TdKuix6MVjBfuB26Vk02wbm02WJCuUpf/CiwdFvPpvvPlvTNs9aOuDgcd7M8zMCxPOtHHdb6ewsrq2vlHcLG1t7+zulfcPWlqmitAmkVyqTog15UzQpmGG006iKI5DTtvh6Gbqt5+o0kyKezNOaBDjgWARI9hY6aHRy9jZ48QfBb1yxa26M6Bl4uWkAjkavfJXty9JGlNhCMda+56bmCDDyjDC6aTUTTVNMBnhAfUtFTimOshmF0/QiVX6KJLKljBopv6eyHCs9TgObWeMzVAvelPxP89PTXQVZEwkqaGCzBdFKUdGoun7qM8UJYaPLcFEMXsrIkOsMDE2pJINwVt8eZm0alXvvFq7u6jUr/M4inAEx3AKHlxCHW6hAU0gIOAZXuHN0c6L8+58zFsLTj5zCH/gfP4AS7eQqw==</latexit> ↵ (i,j,k) <latexit sha1_base64="xtPE3JNDuXx21OqzTCU6ws7I8C8=">AAAB+XicbVBNS8NAEN34WetX1KOXxSJUKCWpgh6LXjxWsB/QhjDZbtq1m03Y3RRK6D/x4kERr/4Tb/4bt20O2vpg4PHeDDPzgoQzpR3n21pb39jc2i7sFHf39g8O7aPjlopTSWiTxDyWnQAU5UzQpmaa004iKUQBp+1gdDfz22MqFYvFo54k1ItgIFjICGgj+bbdA54Mwc/KrPJUGV1MfbvkVJ058Cpxc1JCORq+/dXrxySNqNCEg1Jd10m0l4HUjHA6LfZSRRMgIxjQrqECIqq8bH75FJ8bpY/DWJoSGs/V3xMZREpNosB0RqCHatmbif953VSHN17GRJJqKshiUZhyrGM8iwH3maRE84khQCQzt2IyBAlEm7CKJgR3+eVV0qpV3ctq7eGqVL/N4yigU3SGyshF16iO7lEDNRFBY/SMXtGblVkv1rv1sWhds/KZE/QH1ucPh0KS7w==</latexit> Path-level Attention ConceptPair-level Attention. (i,j) <latexit sha1_base64="6SalFpAG/xZaMwaiQhZwsLyyDW4=">AAAB9HicbVDLSgNBEJyNrxhfUY9eBoMQQcJuFPQY9OIxgnlAsoTZSW8yZvbhTG8gLPkOLx4U8erHePNvnCR70GhBQ1HVTXeXF0uh0ba/rNzK6tr6Rn6zsLW9s7tX3D9o6ihRHBo8kpFqe0yDFCE0UKCEdqyABZ6Elje6mfmtMSgtovAeJzG4ARuEwhecoZHcrgfIemlZnD2cTnvFkl2x56B/iZOREslQ7xU/u/2IJwGEyCXTuuPYMbopUyi4hGmhm2iIGR+xAXQMDVkA2k3nR0/piVH61I+UqRDpXP05kbJA60ngmc6A4VAvezPxP6+ToH/lpiKME4SQLxb5iaQY0VkCtC8UcJQTQxhXwtxK+ZApxtHkVDAhOMsv/yXNasU5r1TvLkq16yyOPDkix6RMHHJJauSW1EmDcPJInsgLebXG1rP1Zr0vWnNWNnNIfsH6+AYLWpGf</latexit> LSTM( ) P i,j [k] <latexit sha1_base64="teqK/CNCdOXuGAw7kuDknxGtZZM=">AAAB8XicbVBNSwMxEJ2tX7V+VT16CRbBg5TdKuix6MVjBfuB26Vk02wbm02WJCuUpf/CiwdFvPpvvPlvTNs9aOuDgcd7M8zMCxPOtHHdb6ewsrq2vlHcLG1t7+zulfcPWlqmitAmkVyqTog15UzQpmGG006iKI5DTtvh6Gbqt5+o0kyKezNOaBDjgWARI9hY6aHRy9jZ48QfBb1yxa26M6Bl4uWkAjkavfJXty9JGlNhCMda+56bmCDDyjDC6aTUTTVNMBnhAfUtFTimOshmF0/QiVX6KJLKljBopv6eyHCs9TgObWeMzVAvelPxP89PTXQVZEwkqaGCzBdFKUdGoun7qM8UJYaPLcFEMXsrIkOsMDE2pJINwVt8eZm0alXvvFq7u6jUr/M4inAEx3AKHlxCHW6hAU0gIOAZXuHN0c6L8+58zFsLTj5zCH/gfP4AS7eQqw==</latexit> ….. ….. g <latexit sha1_base64="CFenPOxUj3CMv4UAOfuuXhcuTTU=">AAAB8XicbVDLSsNAFL3xWeur6tLNYBFclaQKuiy4cVnBPrANZTK9aYdOJmFmIpTQv3DjQhG3/o07/8ZJm4W2Hhg4nHMvc+4JEsG1cd1vZ219Y3Nru7RT3t3bPzisHB23dZwqhi0Wi1h1A6pRcIktw43AbqKQRoHATjC5zf3OEyrNY/lgpgn6ER1JHnJGjZUe+xE14yDMRrNBperW3DnIKvEKUoUCzUHlqz+MWRqhNExQrXuemxg/o8pwJnBW7qcaE8omdIQ9SyWNUPvZPPGMnFtlSMJY2ScNmau/NzIaaT2NAjuZJ9TLXi7+5/VSE974GZdJalCyxUdhKoiJSX4+GXKFzIipJZQpbrMSNqaKMmNLKtsSvOWTV0m7XvMua/X7q2rDLeoowSmcwQV4cA0NuIMmtICBhGd4hTdHOy/Ou/OxGF1zip0T+APn8wffSJD9</latexit> Modeling Relational Paths between R <latexit sha1_base64="CrLsuWMDMFWjz+iaUw09lj4OVG4=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeCF4+t2FpoQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgTjm5n/8IRK81jem0mCfkSHkoecUWOl5l2/XHGr7hxklXg5qUCORr/81RvELI1QGiao1l3PTYyfUWU4Ezgt9VKNCWVjOsSupZJGqP1sfuiUnFllQMJY2ZKGzNXfExmNtJ5Ege2MqBnpZW8m/ud1UxNe+xmXSWpQssWiMBXExGT2NRlwhcyIiSWUKW5vJWxEFWXGZlOyIXjLL6+Sdq3qXVRrzctK3c3jKMIJnMI5eHAFdbiFBrSAAcIzvMKb8+i8OO/Ox6K14OQzx/AHzucPqIWMyA==</latexit> T <latexit sha1_base64="KcCkQ8Dr2DPVFNecfOXjV24oJ5Y=">AAAB6HicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fjw4rGFfkEbymY7adduNmF3I5TQX+DFgyJe/Une/Ddu2xy09cHA470ZZuYFieDauO63s7G5tb2zW9gr7h8cHh2XTk7bOk4VwxaLRay6AdUouMSW4UZgN1FIo0BgJ5jcz/3OEyrNY9k00wT9iI4kDzmjxkqN5qBUdivuAmSdeDkpQ476oPTVH8YsjVAaJqjWPc9NjJ9RZTgTOCv2U40JZRM6wp6lkkao/Wxx6IxcWmVIwljZkoYs1N8TGY20nkaB7YyoGetVby7+5/VSE975GZdJalCy5aIwFcTEZP41GXKFzIipJZQpbm8lbEwVZcZmU7QheKsvr5N2teJdV6qNm3LNzeMowDlcwBV4cAs1eIA6tIABwjO8wpvz6Lw4787HsnXDyWfO4A+czx+rjYzK</latexit> W 1 <latexit sha1_base64="LJRhcNBTUKLIm/qdf0cGHFUw/UU=">AAAB83icbVDLSsNAFL2pr1pfVZduBovgqiRV0GXBjcsK9gFNKZPpTTt0MgkzE6GE/oYbF4q49Wfc+TdO2iy09cDA4Zx7uWdOkAiujet+O6WNza3tnfJuZW//4PCoenzS0XGqGLZZLGLVC6hGwSW2DTcCe4lCGgUCu8H0Lve7T6g0j+WjmSU4iOhY8pAzaqzk+xE1kyDMuvOhN6zW3Lq7AFknXkFqUKA1rH75o5ilEUrDBNW677mJGWRUGc4Ezit+qjGhbErH2LdU0gj1IFtknpMLq4xIGCv7pCEL9fdGRiOtZ1FgJ/OMetXLxf+8fmrC20HGZZIalGx5KEwFMTHJCyAjrpAZMbOEMsVtVsImVFFmbE0VW4K3+uV10mnUvat64+G61nSLOspwBudwCR7cQBPuoQVtYJDAM7zCm5M6L86787EcLTnFzin8gfP5A/POkZE=</latexit> W 2 <latexit sha1_base64="lzkh16eyo/LRndrm7yyJ5f1AZBU=">AAAB83icbVDLSsNAFL2pr1pfVZduBovgqiRV0GXBjcsK9gFNKZPpTTt0MgkzE6GE/oYbF4q49Wfc+TdO2iy09cDA4Zx7uWdOkAiujet+O6WNza3tnfJuZW//4PCoenzS0XGqGLZZLGLVC6hGwSW2DTcCe4lCGgUCu8H0Lve7T6g0j+WjmSU4iOhY8pAzaqzk+xE1kyDMuvNhY1ituXV3AbJOvILUoEBrWP3yRzFLI5SGCap133MTM8ioMpwJnFf8VGNC2ZSOsW+ppBHqQbbIPCcXVhmRMFb2SUMW6u+NjEZaz6LATuYZ9aqXi/95/dSEt4OMyyQ1KNnyUJgKYmKSF0BGXCEzYmYJZYrbrIRNqKLM2JoqtgRv9cvrpNOoe1f1xsN1rekWdZThDM7hEjy4gSbcQwvawCCBZ3iFNyd1Xpx352M5WnKKnVP4A+fzB/VSkZI=</latexit> Graph Vector c (a) j <latexit sha1_base64="6jx4y/5xilUNLmSm8VO3ZuDeHtU=">AAAB8HicbVBNSwMxEJ2tX7V+VT16CRahXspuK+ix4MVjBfsh7VqyabaNTbJLkhXK0l/hxYMiXv053vw3pu0etPXBwOO9GWbmBTFn2rjut5NbW9/Y3MpvF3Z29/YPiodHLR0litAmiXikOgHWlDNJm4YZTjuxolgEnLaD8fXMbz9RpVkk78wkpr7AQ8lCRrCx0j3pPz6kZXw+7RdLbsWdA60SLyMlyNDoF796g4gkgkpDONa667mx8VOsDCOcTgu9RNMYkzEe0q6lEguq/XR+8BSdWWWAwkjZkgbN1d8TKRZaT0RgOwU2I73szcT/vG5iwis/ZTJODJVksShMODIRmn2PBkxRYvjEEkwUs7ciMsIKE2MzKtgQvOWXV0mrWvFqlertRanuZnHk4QROoQweXEIdbqABTSAg4Ble4c1Rzovz7nwsWnNONnMMf+B8/gA8EY/6</latexit> c (q) i <latexit sha1_base64="cFp0lm/JqJAAbxCNYiELRwyFIY0=">AAAB8HicbVBNSwMxEJ2tX7V+VT16CRahXspuK+ix4MVjBfsh7VqyabYNTbJrkhXK0l/hxYMiXv053vw3pu0etPXBwOO9GWbmBTFn2rjut5NbW9/Y3MpvF3Z29/YPiodHLR0litAmiXikOgHWlDNJm4YZTjuxolgEnLaD8fXMbz9RpVkk78wkpr7AQ8lCRrCx0j3ps4e0/Hg+7RdLbsWdA60SLyMlyNDoF796g4gkgkpDONa667mx8VOsDCOcTgu9RNMYkzEe0q6lEguq/XR+8BSdWWWAwkjZkgbN1d8TKRZaT0RgOwU2I73szcT/vG5iwis/ZTJODJVksShMODIRmn2PBkxRYvjEEkwUs7ciMsIKE2MzKtgQvOWXV0mrWvFqlertRanuZnHk4QROoQweXEIdbqABTSAg4Ble4c1Rzovz7nwsWnNONnMMf+B8/gBS55AJ</latexit> and g <latexit sha1_base64="70nNBQaCBgZQwVTQrRxMMmGaQK4=">AAAB6XicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fjw4rGK/YA2lM120y7dbMLuRCih/8CLB0W8+o+8+W/ctDlo64OBx3szzMwLEikMuu63s7a+sbm1Xdop7+7tHxxWjo7bJk414y0Wy1h3A2q4FIq3UKDk3URzGgWSd4LJbe53nrg2IlaPOE24H9GREqFgFK30MCoPKlW35s5BVolXkCoUaA4qX/1hzNKIK2SSGtPz3AT9jGoUTPJZuZ8anlA2oSPes1TRiBs/m186I+dWGZIw1rYUkrn6eyKjkTHTKLCdEcWxWfZy8T+vl2J442dCJSlyxRaLwlQSjEn+NhkKzRnKqSWUaWFvJWxMNWVow8lD8JZfXiXtes27rNXvr6oNt4ijBKdwBhfgwTU04A6a0AIGITzDK7w5E+fFeXc+Fq1rTjFzAn/gfP4A/MyM8Q==</latexit> Figure 5.3: Illustration of theGCN-LSTM-HPA architecture for the proposed KagNet module. Specifically, for each question concept c i ∈C q and answer concept c j ∈C a , we can efficiently find paths between them that are shorter than k concepts 3 . Then, we add edges, if any, between the concept pairs withinC q orC a . 5.1.3.3 Path Pruning via KG Embedding To prune irrelevant paths from potentially noisy schema graphs, we first utilize knowledge graph embedding (KGE) techniques, like TransE [186], to pre-train concept embeddings V and relation type embeddings R, which are also used as initialization for KagNet (§5.1.4). In order to measure the quality of a path, we decompose it into a set of triples, the confidence of which can be directly measured by the scoring function of the KGE method (i.e. the confidence of triple classification). Thus, we score a path with the multiplication product of the scores of each triple in the path, and then we empirically set a threshold for pruning (§5.1.5.3). 5.1.4 Knowledge-Aware Graph Network The core component of our reasoning framework is the knowledge-aware graph network module KagNet. The KagNet first encodes plain structures of schema graphs with graph convolutional net- works (§5.1.4.1) to accommodate pre-trained concept embeddings in their particular context within 3 We set k= 4 in experiments to gather three-hop paths. 85 schema graphs. It then utilizes LSTMs to encode the paths betweenC q andC a , capturing multi-hop relational information (§5.1.4.2). Finally, we apply a hierarchical path-based attention mechanism (§5.1.4.3) to complete theGCN-LSTM-HPA architecture, which models relational schema graphs with respect to the paths between question and answer concepts. 5.1.4.1 Graph Convolutional Networks Graph convolutional networks (GCNs) encode graph-structured data by updating node vectors via pooling features of their adjacent nodes [75]. Our intuition for applying GCNs to schema graphs is to 1) contextually refine the concept vectors and 2) capture structural patterns of schema graphs for generalization. Although we have obtained concept vectors by pre-training (§5.1.3.3), the representations of concepts still need to be further accommodated to their specific schema graphs context. Think of polysemous concepts such as “close” (§5.1.3.1), which can either be a verb concept like in “close the door” or an adjective concept meaning “a short distance apart”. Using GCNs to update the concept vector with their neighbors is thus helpful for disambiguation and contextualized concept embedding. Also, the pattern of schema graph structures provides potentially valuable information for reasoning. For instance, shorter and denser connections between question and answer concepts could mean higher plausibility under specific contexts. As many works show [113, 215], relational GCNs [151] usually over-parameterize the model and cannot effectively utilize multi-hop relational information. We thus apply GCNs on the plain version (unlabeled, non-directional) of schema graphs, ignoring relation types on the edges. Specif- ically, the vector for concept c i ∈V g in the schema graph g is initialized by their pre-trained em- beddings at first ( h (0) i = V i ). Then, we update them at the (l+ 1)-th layer by pooling features of their neighboring nodes (N i ) and their own at the l-th layer with an non-linear activation function σ: h (l+1) i =σ(W (l) sel f h (l) i + ∑ j∈N i 1 |N i | W (l) h (l) j ) 86 5.1.4.2 Relational Path Encoding In order to capture the relational information in schema graphs, we propose an LSTM-based path encoder on top of the outputs of GCNs. Recall that our graph representation has a special purpose: “to measure the plausibility of a candidate answer to a given question”. Thus, we propose to represent graphs with respect to the paths between question conceptsC q and answer conceptsC a . We denote the k-th path between i-th question concept c (q) i ∈C q and j-th answer concept c (a) j ∈ C a as P i, j [k], which is a sequence of triples: P i, j [k]=[(c (q) i ,r 0 ,t 0 ),...,(t n− 1 ,r n ,c (a) j )] Note that the relations are represented with trainable relation vectors (initialized with pre- trained relation embeddings), and concept vectors are the GCNs’ outputs (h (l) ). Thus, each triple can be represented by the concatenation of the three corresponding vectors. We employ LSTM networks to encode these paths as sequences of triple vectors, taking the concatenation of the first and the last hidden states: R i, j = 1 |P i, j | ∑ k LSTM(P i, j [k]) The above R i, j can be viewed as the latent relation between the question concept c (q) i and the answer concept c (a) j , for which we aggregate the representations of all the paths between them in the schema graph. Now we can finalize the vector representation of a schema graph g by aggregating all vectors in the matrix R using mean pooling: T i, j =MLP([s ; c (i) q ; c (j) a ]) g= ∑ i, j [R i, j ; T i, j ] |C q |×| C a | , where[· ; ·] means concatenation of two vectors. The statement vector s in the above equation is obtained from a certain language encoder, which can either be a trainable sequence encoder like LSTM or features extracted from pre-trained 87 universal language encoders like GPT/BERT). To encode a question-answer pair with universal language encoders, we simply create a sentence combining the question and the answer with a special token (“question+ [sep] +answer”), and then use the vector of ‘[cls]’ as suggested by prior works [161].. We concatenate R i, j with an additional vector T i, j before doing average pooling. The T i, j is inspired from theRelation Network [147], which also encodes the latent relational informa- tion yet from the context in the statement s instead of the schema graph g. Simply put, we want to combine the relational representations of a pair of question/answer concepts from both the schema graph side (symbolic space) and the language side (semantic space). Finally, the plausibility score of the answer candidate a to the question q can be computed as score(q,a)=sigmoid(MLP(g)). 5.1.4.3 Hierarchical Attention Mechanism A natural argument against the above GCN-LSTM-mean architecture is that mean pooling over the path vectors does not always make sense, since some paths are more important than others for reasoning. Also, it is usually not the case that all pairs of question and answer concepts equally contribute to the reasoning. Therefore, we propose a hierarchical path-based attention mechanism to selectively aggregate important path vectors and then more important question-answer concept pairs. This core idea is similar to the work of Yang et al. (2016), which proposes a document encoder that has two levels of attention mechanisms applied at the word- and sentence-level. In our case, we have path-level and concept-pair-level attention for learning to contextually model 88 graph representations. We learn a parameter matrix W 1 for path-level attention scores, and the importance of the path P i, j [k] is denoted as ˆ α (i, j,·) . α (i, j,k) = T i, j W 1 LSTM(P i, j [k]), ˆ α (i, j,·) =SoftMax(α (i, j,·) ), ˆ R i, j = ∑ k ˆ α (i, j,k) · LSTM(P i, j [k]). Afterwards, we similarly obtain the attention over concept-pairs. β (i, j) = s W 2 T i, j ˆ β (·,·) =SoftMax(β (·,·) ) ˆ g= ∑ i, j ˆ β (i, j) [ ˆ R i, j ; T i, j ] The wholeGCN-LSTM-HPA architecture is illustrated in Figure 5.3. To sum up, we claim that the KagNet is a graph neural network module with theGCN-LSTM-HPA architecture that models relational graphs for relational reasoning under the context of both knowledge symbolic space and language semantic space. 5.1.5 Experiments We introduce our setups of the CommonsenseQA dataset [161], present the baseline methods, and finally analyze experimental results. 5.1.5.1 Dataset and Experiment Setup TheCommonsenseQA dataset consists of 12,102 (v1.11) natural language questions in total that require human commonsense reasoning ability to answer, where each question has five candidate answers (hard mode). The authors also release an easy version of the dataset by picking two random terms/phrases for sanity check. CommonsenseQA is directly gathered from real human 89 10(%) of IHtrain 50(%) of IHtrain 100(%) of IHtrain Model IHdev-Acc.(%) IHtest-Acc.(%) IHdev-Acc.(%) IHtest-Acc.(%) IHdev-Acc.(%) IHtest-Acc.(%) Random guess 20.0 20.0 20.0 20.0 20.0 20.0 GPT-FINETUNING 27.55 26.51 32.46 31.28 47.35 45.58 GPT-KAGNET 28.13 26.98 33.72 32.33 48.95 46.79 BERT-BASE-FINETUNING 30.11 29.78 38.66 36.83 53.48 53.26 BERT-BASE-KAGNET 31.05 30.94 40.32 39.01 55.57 56.19 BERT-LARGE-FINETUNING 35.71 32.88 55.45 49.88 60.61 55.84 BERT-LARGE-KAGNET 36.82 33.91 58.73 51.13 62.35 57.16 Human Performance - 88.9 - 88.9 - 88.9 Table 5.1: Comparisons with large pre-trained language model fine-tuning with different amount of training data. annotators and covers a broad range of types of commonsense, including spatial, social, causal, physical, temporal, etc. To the best of our knowledge,CommonsenseQA may be the most suitable choice for us to evaluate supervised learning models for question answering. For the comparisons with the reported results in the CommonsenseQA’s paper and leader- board, we use the official split (9,741/1,221/1,140) named (OFtrain/OFdev/OFtest). Note that the performance on OFtest can only be tested by submitting predictions to the organizers. To effi- ciently test other baseline methods and ablation studies, we choose to use randomly selected 1,241 examples from the training data as our in-house data, forming an (8,500/1,221/1,241) split denoted as (IHtrain/IHdev/IHtest). All experiments are using the random-split setting as the authors sug- gested, and three or more random states are tested on development sets to pick the best-performing one. 5.1.5.2 Compared Methods We consider two different kinds of baseline methods as follows: • Knowledge-agnostic Methods. These methods either use no external resources or only use un- structured textual corpora as additional information, including gathering textual snippets from search engine or large pre-trained language models like BERT-LARGE. QABILINEAR, QA- COMPARE, ESIM are three supervised learning models for natural language inference that can be equipped with different word embeddings including GloVe and ELMO. BIDAF++ utilizes 90 Google web snippets as context and is further augmented with a self-attention layer while using ELMO as input features. GPT/BERT-LARGE are fine-tuning methods with an additional linear layer for classification as the authors suggested. They both add a special token ‘[sep]’ to the input and use the hidden state of the ‘[cls]’ as the input to the linear layer. More details about them can be found in the dataset paper [161]. • Knowledge-aware Methods. We also adopt some recently proposed methods of incorporating knowledge graphs for question answering. KV-MEM [116] is a method that incorporates retrieved triples from ConceptNet at the word-level, which uses a key-valued memory module to im- prove the representation of each token individually by learning an attentive aggregation of related triple vectors. CBPT [216] is a plug-in method of assembling the predictions of any models with a straightforward method of utilizing pre-trained concept embeddings from ConceptNet. TEXTGRAPHCAT [184] concatenates the graph-based and text-based representations of the state- ment and then feed it into a classifier. We create sentence template for generating sentences and then feed retrieved triples as additional text inputs as a baseline method TRIPLESTRING. Rajani et al. (2019) propose to collect human explanations for commonsense reasoning from annotators as additional knowledge (COS-E), and then train a language model based on such human annotations for improving the model performance. 5.1.5.3 Implementation Details of KagNet Our best (tested on OFdev) settings of KagNet have two GCN layers (100 dim, 50dim respec- tively), and one bidirectional LSTMs (128dim) . We pre-train KGE using TransE (100 dimension) initialized with GloVe embeddings. The statement encoder in use is BERT-LARGE, which works as a pre-trained sentence encoder to obtain fixed features for each pair of question and answer can- didate. The paths are pruned with path-score threshold set to 0.15, keeping 67.21% of the original paths. We did not conduct pruning on concept pairs with less than three paths. For very few pairs with none path, ˆ R (i, j) will be a randomly sampled vector. We learn our KagNet models with Adam 91 Model OFdev-Acc.(%) OFtest-Acc.(%) Random guess 20.0 20.0 BIDAF++ - 32.0 QACOMPARE+GLOVE - 25.7 QABLINEAR+GLOVE - 31.5 ESIM+ELMO - 32.8 ESIM+GLOVE - 34.1 GPT-FINETUNING 47.11 45.5 BERT-BASE-FINETUNING 53.57 53.0 BERT-LARGE-FINETUNING 62.34 56.7 COS-E (w/ additional annotations) - 58.2 KAGNET (Ours) 64.46 58.9 Human Performance - 88.9 Table 5.2: Comparison with official benchmark baseline methods using the official split on the leaderboard. optimizers [74]. In our experiments, we found that the recall of ConceptNet on commonsense questions&answers is very high (over 98% of QA-pairs have more than one grounded concepts). 5.1.5.4 Performance Comparisons and Analysis Comparison with standard baselines. As shown in Table 5.2, we first use the official split to compare our model with the baseline methods reported on the paper and leaderboard. BERT and GPT-based pre-training methods are much higher than other baseline methods, demonstrating the ability of language models to store commonsense knowledge in an implicit way. This presumption is also investigated by Trinh and Le (2019) and Wang et al. (2019). Our proposed framework achieves an absolute increment of 2.2% in accuracy on the test data, a state-of-the-art performance. We conduct the experiments with our in-house splits to investigate whether our KagNet can also work well on other universal language encoders (GPT and BERT-BASE), particularly with different fractions of the dataset (say 10%, 50%, 100% of the training data). Table 5.1 shows that our KagNet-based methods using fixed pre-trained language encoders outperform fine-tuning themselves in all settings. Furthermore, we find that the improvements in a small data situation 92 Easy Mode Hard Mode Model IHdev.(%) IHtest.(%) IHdev.(%) IHtest.(%) Random guess 33.3 33.3 20.0 20.0 BLSTMS 80.15 78.01 34.79 32.12 + KV-MN 81.71 79.63 35.70 33.43 + CSPT 81.79 80.01 35.31 33.61 + TEXTGRAPHCAT 82.68 81.03 34.72 33.15 + TRIPLESTRING 79.11 76.02 33.19 31.02 + KAGNET 83.26 82.15 36.38 34.57 Human Performance - 99.5 - 88.9 Table 5.3: Comparisons with knowledge-aware baseline methods using the in-house split (both easy and hard mode) on top of BLSTM as the sentence encoder. (10%) is relatively limited, and we believe an important future research direction is thus few-shot learning for commonsense reasoning. Comparison with knowledge-aware baselines. To compare our model with other adopted baseline methods that also incorporateConceptNet, we set up a bidirectional LSTM networks-based model for our in-house dataset. Then, we add baseline methods and KagNet onto the BLSTMs to compare their abilities to utilize external knowl- edge 4 . Table 5.3 shows the comparisons under both easy mode and hard mode, and our methods outperform all knowledge-aware baseline methods by a large margin in terms of accuracy. Note that we compare our model and the CoS-E in Table 5.2. Although CoS-E also achieves better result than only fine-tuning BERT by training with human-generated explanations, we argue that our proposedKagNet does not utilize any additional human efforts to provide more supervision. Ablation study on model components. To better understand the effectiveness of each component of our method, we have done ablation study as shown in Table 5.4. We find that replacing our GCN-LSTM-HPA architecture with tra- ditional relational GCNs, which uses separate weight matrices for different relation types, results in worse performance, due to its over-parameterization. The attention mechanisms matters almost equally in two levels, and pruning also effectively filters noisy paths. 4 We do LSTM-based setup because it is non-trivial to apply token-level knowledge-aware baseline methods for complicated pre-trained encoders like BERT. 93 Model IHdev.(%) IHtest.(%) KAGNET (STANDARD) 62.35 57.16 : replace GCN-HPA-LSTM w/ R-GCN 60.01 55.08 : w/o GCN 61.84 56.11 : #GCN Layers = 1 62.05 57.03 : w/o Path-level Attention 60.12 56.05 : w/o QAPair-level Attention 60.39 56.13 : using all paths (w/o pruning) 59.96 55.27 Table 5.4: Ablation study on the KagNet framework. Error analysis. In the failed cases, there are three kinds of hard problems that KagNet is still not good at. • negative reasoning: the grounding stage is not sensitive to the negation words, and thus can choose exactly opposite answers. • comparative reasoning strategy: For the questions with more than one highly plausible an- swers, the commonsense reasoner should benefit from explicitly investigating the difference between different answer candidates, while KagNet training method is not capable of doing so. • subjective reasoning: Many answers actually depend on the “personality” of the reasoner. For instance, “Traveling from new place to new place is likely to be what?” The dataset gives the answer as “exhilarating” instead of “exhausting”, which we think is more like a personalized subjective inference instead of common sense. 5.1.5.5 Case Study on Interpretability Our framework enjoys the merit of being more transparent, and thus provides more interpretable inference process. We can understand our model behaviors by analyzing the hierarchical attention scores on the question-answer concept pairs and path between them. Figure 5.4 shows an example how we can analyze our KagNet framework through both pair- level and path-level attention scores. We first select the concept-pairs with highest attention scores and then look at the (one or two) top-ranked paths for each selected pair. We find that paths located 94 in this way are highly related to the inference process and also shows that noisy concepts like ‘fountain’ will be diminished while modeling. 5.1.5.6 Model Transferability. We study the transferability of a model that is trained onCommonsenseQA (CSQA) by directly testing it with another task while fixing its parameters. Recall that we have obtained a B ERT- LARGE model and a KagNet model trained on CSQA. Now we denoted them as CSQA-BL and CSQA-KN to suggest that they are not trainable anymore. In order to investigate their transferability, we separately test them on SWAG [209] and WSC [85] datasets. We first test them the 20k validation examples in SWAG. CSQA-BL has an accuracy of 56.53%, while our fixed C SQA-KN model achieves 59.01%. Similarly, we also test both models on theWSC-QA, which is converted from theWSC pronoun resolution to a multi-choice QA task. The CSQA-BL achieves an accuracy of 51.23%, while our model CSQA-KN scores 53.51%. These two comparisons further support our assumption that KagNet, as a knowledge-centric model, is more extensible in commonsense reasoning. As we expect for a good knowledge-aware frame- works to behave, our KagNet indeed enjoys better transferablity than only fine-tuning large lan- guage encoders like BERT. 5.1.5.7 Recent methods on the leaderboard. We argue that the KagNet utilizes theConceptNet as the only external resource and other meth- ods are improving their performance in orthogonal directions: 1) we find that most of the other recent submissions (as of Aug. 2019) with public information on the leaderboard utilize larger ad- ditional textual corpora (e.g. top 10 matched sentences in full Wikipedia via information retrieval tools), and fine-tuning on larger pre-trained encoders, such as XLNet [201],RoBERTa [104]. 2) there are also models using multi-task learning to transfer knowledge from other reading compre- hension datasets, such asRACE [80] andOpenBookQA [118]. 95 An interesting fact is that the best performance on the OFtest set is still achieved the original fine-tuned RoBERTa model, which is pre-trained with copora much larger than BERT. All other RoBERTa-extended methods have negative improvements. We also use statement vectors from RoBERTa as the input vectors for KagNet, and find that the performance on OFdev marginally improves from 77.47% to 77.56%. Based on our above-mentioned failed cases in error analysis, we believe fine-tuning RoBERTa has achieved the limit due to the annotator biases of the dataset and the lack of comparative reasoning strategies. 5.1.6 Related Work Commonsense knowledge and reasoning. There is a recent surge of novel large-scale datasets for testing machine commonsense with various focuses, such as situation prediction (SWAG) [209], social behavior understanding [148, 150], visual scene comprehension [207], and general com- monsense reasoning [161], which encourages the study of supervised learning methods for com- monsense reasoning. Trinh and Le (2018) find that large language models show promising results in WSC resolution task [85], but this approach can hardly be applied in a more general question answering setting and also not provide explicit knowledge used in inference. A unique merit of our KagNet method is that it provides grounded explicit knowledge triples and paths with scores, such that users can better understand and put trust in the behaviors and inferences of the model. Injecting external knowledge for NLU. Our work also lies in the general context of using external knowledge to encode sentences or answer questions. Yang and Mitchell (2017) are the among first ones to propose to encode sentences by keeping retrieving related entities from knowledge bases and then merging their embeddings into LSTM networks computations, to achieve a better performance on entity/event extraction tasks. Weissenborn, Koˇ cisk` y, and Dyer (2017), Mihaylov and Frank (2018), and Annervaz, Chowdhury, and Dukkipati (2018) follow this line of works to incorporate the embeddings of related knowledge triples at the word-level and improve the performance of natural language understanding tasks. In contrast to our work, they do not explicitly 96 impose graph-structured knowledge into models , but limit its potential within transforming word embeddings to concept embeddings. Some other recent attempts [216, 184] to use ConceptNet graph embeddings are adopted and compared in our experiments (§6.2). Rajani et al. (2019) propose to manually collect more human explanations for correct answers as additional supervision for auxiliary training. KagNet-based framework focuses on injecting external knowledge as an explicit graph structure, and enjoys the relational reasoning capacity over the graphs. Relational reasoning. KagNet can be seen as a knowledge-augmented Relation Network module (RN) [147], which is proposed for the visual question answering task requiring relational reasoning (i.e. questions about the relations between multiple 3D-objects in an image). We view the concepts in the questions and answers as objects and effectively utilize external knowledge graphs to model their relations from both semantic and symbolic spaces (§5.1.4.2), while prior methods mainly work on the semantic one. 5.1.7 Conclusion We propose a knowledge-aware framework for learning to answer commonsense questions. The framework first constructs schema graphs to represent relevant commonsense knowledge, and then model the graphs with our KagNet module. The module is based on a GCN-LSTM-HPA archi- tecture, which effectively represent graphs for relational reasoning purpose in a transparent, in- terpretable way, yielding a new state-of-the-art results on a large-scale general dataset for testing machine commonsense. Future directions include better question parsing methods to deal with negation and comparative question answering, as well as incorporating knowledge to visual rea- soning. 97 WhatJdoJyouJfill withJink toJwrite on an A4 paper?J A: fountainJpen✔ (KagNet); B: printer (BERT); C: squid D: pencilJcase (GPT); E:Jnewspaper fill ink write A4 paper fountain pen fountain_pen ink —PartOf—> fountain_pen ink —RelatedTo—> container <—IsA— fountain_pen fill <—HasSubEvent— ink <—AtLocation— fountain_pen fill —RelatedTo—> container <—IsA— fountain_pen write <—UsedFor— pen write <—UsedFor— pen <—IsA— fountain_pen paper <—RelatedTo— write <—UsedFor— fountain_pen ….. 2. Ranking via path-level attn. 1.select concept pairs of high att. scores KagNet Figure 5.4: An example of interpreting model behaviors by hierarchical attention scores. 98 Chapter 6 Modeling Unstructured Knowledge Corpora with LMs for CSR 6.1 DrFact: An Efficient Approach for Differentiable Reasoning over Facts In this section we present DRFACT, a model for multi-hop reasoning over facts. More implemen- tation details are in Appendix ??. 6.1.0.1 Overview In DRFACT, we propose to model reasoning as traversing a hypergraph, where each hyperedge corresponds to a fact inF , and connects the concepts inV that are mentioned in that fact. This is shown in Figure 2.2. Notice that a fact, as a hyperedge, connects multiple concepts that are mentioned, while the textual form of the fact maintains the contextual information of the original natural language statement, and hence we do not assume a fixed set of relations. Given such a hypergraph, our open-ended reasoning model will traverse the hypergraph starting from the question (concepts) and finally arrive at a set of concept nodes by following multiple hyperedges (facts). A probabilistic view of this process over T hops is: P(c| q)= P(c| q,F T ) ∏ T t=1 P(F t | q,F t− 1 )P(F 0 | q) 99 Intuitively, we want to model the distribution of a concept c∈V being an answer to a question q as P(c| q). This answering process can be seen as a process of multiple iterations of “fact- following,” or moving from one fact to another based on shared concepts, and finally moving from facts to concepts. We use F t to represent a weighted set of retrieved facts at the hop t, and F 0 for the initial facts below. Then, given the question and the current retrieved facts, we iteratively retrieve the facts for the next hop. Finally, we score a concept using retrieved facts. ! Dense Index of Fact Vectors ! " Sparse Matrix of Fact Links ! ! " ! ! # ! ! Mixing <latexit sha1_base64="p+m3Dv0Ed4Bu2fHmFkCuLRX1g84=">AAAB9HicbVDLSgMxFL3js9ZX1aWbYBFclRlRdFkUxGUF+4B2KJk004ZmkjHJFMq03+HGhSJu/Rh3/o2ZdhbaeiBwOOde7skJYs60cd1vZ2V1bX1js7BV3N7Z3dsvHRw2tEwUoXUiuVStAGvKmaB1wwynrVhRHAWcNoPhbeY3R1RpJsWjGcfUj3BfsJARbKzkTzoRNgOCeXo3nXRLZbfizoCWiZeTMuSodUtfnZ4kSUSFIRxr3fbc2PgpVoYRTqfFTqJpjMkQ92nbUoEjqv10FnqKTq3SQ6FU9gmDZurvjRRHWo+jwE5mGfWil4n/ee3EhNd+ykScGCrI/FCYcGQkyhpAPaYoMXxsCSaK2ayIDLDCxNieirYEb/HLy6RxXvEuK+7DRbl6k9dRgGM4gTPw4AqqcA81qAOBJ3iGV3hzRs6L8+58zEdXnHznCP7A+fwBRW2SbA==</latexit> |F| <latexit sha1_base64="p+m3Dv0Ed4Bu2fHmFkCuLRX1g84=">AAAB9HicbVDLSgMxFL3js9ZX1aWbYBFclRlRdFkUxGUF+4B2KJk004ZmkjHJFMq03+HGhSJu/Rh3/o2ZdhbaeiBwOOde7skJYs60cd1vZ2V1bX1js7BV3N7Z3dsvHRw2tEwUoXUiuVStAGvKmaB1wwynrVhRHAWcNoPhbeY3R1RpJsWjGcfUj3BfsJARbKzkTzoRNgOCeXo3nXRLZbfizoCWiZeTMuSodUtfnZ4kSUSFIRxr3fbc2PgpVoYRTqfFTqJpjMkQ92nbUoEjqv10FnqKTq3SQ6FU9gmDZurvjRRHWo+jwE5mGfWil4n/ee3EhNd+ykScGCrI/FCYcGQkyhpAPaYoMXxsCSaK2ayIDLDCxNieirYEb/HLy6RxXvEuK+7DRbl6k9dRgGM4gTPw4AqqcA81qAOBJ3iGV3hzRs6L8+58zEdXnHznCP7A+fwBRW2SbA==</latexit> |F| <latexit sha1_base64="oM162eA1ZlO+VYWnvHj2VI8ax9M=">AAACCXicbVBNS8NAEN3Ur1q/oh69LBbBU0lE0WNREI8V7Ae0oWy2m3bpZhN2J0JJe/XiX/HiQRGv/gNv/hs3bQ629cHA470ZZub5seAaHOfHKqysrq1vFDdLW9s7u3v2/kFDR4mirE4jEamWTzQTXLI6cBCsFStGQl+wpj+8yfzmI1OaR/IBRjHzQtKXPOCUgJG6Nh53QgIDSkR6Oxl3gIdMz0ldu+xUnCnwMnFzUkY5al37u9OLaBIyCVQQrduuE4OXEgWcCjYpdRLNYkKHpM/ahkpiNnrp9JMJPjFKDweRMiUBT9W/EykJtR6FvunMbtSLXib+57UTCK68lMs4ASbpbFGQCAwRzmLBPa4YBTEyhFDFza2YDogiFEx4JROCu/jyMmmcVdyLinN/Xq5e53EU0RE6RqfIRZeoiu5QDdURRU/oBb2hd+vZerU+rM9Za8HKZw7RHKyvXyybm0U=</latexit> |F|⇥ |F| <latexit sha1_base64="OqoCwin5rtiKt4DE3aU3OtvsOW0=">AAAB/nicbVDLSsNAFL3xWesrKq7cDBbBVUlE0WVREJcV7AOaUCbTSTt0MgkzE6GkBX/FjQtF3Pod7vwbJ20W2npg4HDOvdwzJ0g4U9pxvq2l5ZXVtfXSRnlza3tn197bb6o4lYQ2SMxj2Q6wopwJ2tBMc9pOJMVRwGkrGN7kfuuRSsVi8aBHCfUj3BcsZARrI3Xtw7EXYT0gmGe3k7GnWUQV6nXtilN1pkCLxC1IBQrUu/aX14tJGlGhCcdKdVwn0X6GpWaE00nZSxVNMBniPu0YKrA542fT+BN0YpQeCmNpntBoqv7eyHCk1CgKzGSeVc17ufif10l1eOVnTCSppoLMDoUpRzpGeReoxyQlmo8MwUQykxWRAZaYaNNY2ZTgzn95kTTPqu5F1bk/r9SuizpKcATHcAouXEIN7qAODSCQwTO8wpv1ZL1Y79bHbHTJKnYO4A+szx+r6JXv</latexit> |F|⇥ d <latexit sha1_base64="pEwlDp+Fc5Gjf2Fn9fIogcAvOcg=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0WPRi8cW7Ae0oWw2k3btZhN2N0Ip/QVePCji1Z/kzX/jts1BWx8MPN6bYWZekAqujet+O4W19Y3NreJ2aWd3b/+gfHjU0kmmGDZZIhLVCahGwSU2DTcCO6lCGgcC28Hobua3n1BpnsgHM07Rj+lA8ogzaqzUCPvlilt15yCrxMtJBXLU++WvXpiwLEZpmKBadz03Nf6EKsOZwGmpl2lMKRvRAXYtlTRG7U/mh07JmVVCEiXKljRkrv6emNBY63Ec2M6YmqFe9mbif143M9GNP+EyzQxKtlgUZYKYhMy+JiFXyIwYW0KZ4vZWwoZUUWZsNiUbgrf88ippXVS9q6rbuKzUbvM4inACp3AOHlxDDe6hDk1ggPAMr/DmPDovzrvzsWgtOPnMMfyB8/kDyTuM7A==</latexit> d ! # ! $ Concept-to-Fact Sparse Matrix for in [1,…,%]{ } // 1. Initial Facts. // 2. Fact-Follow. <latexit sha1_base64="692OszH2og0RyrvmGvHBbKjKTmg=">AAACGHicbVBNS8NAEN34bf2qevSyWAQFrYkoehGKQvGoYFVoStlsJ+3iJht3J2oJ+Rle/CtePCjitTf/jdvag18PBh7vzTAzL0ikMOi6H87I6Nj4xOTUdGFmdm5+obi4dGFUqjnUuJJKXwXMgBQx1FCghKtEA4sCCZfB9XHfv7wFbYSKz7GbQCNi7ViEgjO0UrO4XW1mmB/6CPeYVRnHraqSUt3lvoQQ1/vulpdv0htfi3YHN5rFklt2B6B/iTckJTLEabPY81uKpxHEyCUzpu65CTYyplFwCXnBTw0kjF+zNtQtjVkEppENHsvpmlVaNFTaVox0oH6fyFhkTDcKbGfEsGN+e33xP6+eYnjQyEScpAgx/1oUppKiov2UaEto4Ci7ljCuhb2V8g7TNh+bZcGG4P1++S+52Cl7e2X3bLdUORrGMUVWyCpZJx7ZJxVyQk5JjXDyQJ7IC3l1Hp1n5815/2odcYYzy+QHnN4ngpOgCA==</latexit> F t = Fact-Follow(F t 1 ,q) <latexit sha1_base64="TjEiygIGSwowzdlUkbxfyAK6Y/U=">AAACBnicbVC7SgNBFJ31GdfXqqUIgyFgFXYF0SaQaGMZIS/IxmV2MkmGzD6YuSuEJZWNP+EH2FgoYus32KgfYu/kUWjigQuHc+7l3nv8WHAFtv1lLCwuLa+sZtbM9Y3NrW1rZ7emokRSVqWRiGTDJ4oJHrIqcBCsEUtGAl+wut+/GPn1GyYVj8IKDGLWCkg35B1OCWjJsw5KuIBdlQReCgVneF3BLhFxj3iASx54VtbO22PgeeJMSbaY+/78uDe7Zc96d9sRTQIWAhVEqaZjx9BKiQROBRuabqJYTGifdFlT05AETLXS8RtDnNNKG3ciqSsEPFZ/T6QkUGoQ+LozINBTs95I/M9rJtA5a6U8jBNgIZ0s6iQCQ4RHmeA2l4yCGGhCqOT6Vkx7RBIKOjlTh+DMvjxPasd55yRvXznZ4jmaIIP20SE6Qg46RUV0icqoiii6RQ/oCT0bd8aj8WK8TloXjOnMHvoD4+0HuZ+byA==</latexit> A= P T t=1 ↵ t A t // 3. Emit Concepts. // 4. Final answers. DrFact <latexit sha1_base64="8zLQJ1Nflbb+E+4rRvniE+Dg8w0=">AAACC3icbVDLSsNAFJ34rPUVdelmaBEEoSSC6DIoiMsK9gFNKJPppB06mYSZiRDS7N248zvcuFDErT/grn/jpC2orQcuHM65l3vv8WNGpbKssbG0vLK6tl7aKG9ube/smnv7TRklApMGjlgk2j6ShFFOGooqRtqxICj0GWn5w6vCb90TIWnE71QaEy9EfU4DipHSUtesjNwQqQFGLGvmI+gqGhIJf8TrfNQ1q1bNmgAuEntGqk7FPXkaO2m9a365vQgnIeEKMyRlx7Zi5WVIKIoZyctuIkmM8BD1SUdTjvRKL5v8ksMjrfRgEAldXMGJ+nsiQ6GUaejrzuJGOe8V4n9eJ1HBhZdRHieKcDxdFCQMqggWwcAeFQQrlmqCsKD6VogHSCCsdHxlHYI9//IiaZ7W7LOadWtXnUswRQkcggo4BjY4Bw64AXXQABg8gGfwCt6MR+PFeDc+pq1LxmzmAPyB8fkNHk2etQ==</latexit> |V|⇥ |F| <latexit sha1_base64="YrbP637LW/mIvbu3nBH/TE2cjmY=">AAAB6XicbZBNS8NAEIYn9avGr6pHL8EieCqJIHoRi148VrEf0Iay2W7apZtN2J0IJfQfePGgiNf+GO9exH/jpu1Bqy8sPLzvDDszQSK4Rtf9sgpLyyura8V1e2Nza3untLvX0HGqKKvTWMSqFRDNBJesjhwFayWKkSgQrBkMr/O8+cCU5rG8x1HC/Ij0JQ85JWisO7S7pbJbcady/oI3h/Llu32RTD7tWrf00enFNI2YRCqI1m3PTdDPiEJOBRvbnVSzhNAh6bO2QUkipv1sOunYOTJOzwljZZ5EZ+r+7MhIpPUoCkxlRHCgF7Pc/C9rpxie+xmXSYpM0tlHYSocjJ18bafHFaMoRgYIVdzM6tABUYSiOU5+BG9x5b/QOKl4pxX31itXr2CmIhzAIRyDB2dQhRuoQR0ohPAIz/BiDa0n69V6m5UWrHnPPvySNfkGdeGQTg==</latexit> t <latexit sha1_base64="j2NXoYoYRdqwvrDpsqGaw4GoLx4=">AAAB8XicbZDLSgMxFIYz9VbrrSq4cRMsgqsyI4huhFpRXLZgL9gOQybNtKGZzJCcEcrQt3DjQhG34lv4BO7c+Cyml4W2/hD4+P9zyDnHjwXXYNtfVmZhcWl5JbuaW1vf2NzKb+/UdZQoymo0EpFq+kQzwSWrAQfBmrFiJPQFa/j9y1HeuGdK80jewiBmbki6kgecEjDW3YUH+BxfXXvg5Qt20R4Lz4MzhUJpr/rN38sfFS//2e5ENAmZBCqI1i3HjsFNiQJOBRvm2olmMaF90mUtg5KETLvpeOIhPjROBweRMk8CHru/O1ISaj0IfVMZEujp2Wxk/pe1EgjO3JTLOAEm6eSjIBEYIjxaH3e4YhTEwAChiptZMe0RRSiYI+XMEZzZleehflx0Top21SmUymiiLNpHB+gIOegUldANqqAaokiiB/SEni1tPVov1uukNGNNe3bRH1lvP9wOk28=</latexit> A t =EF t ! !"# $ <latexit sha1_base64="iRVoy/eAik5+A8Lof+h/WZ88ao8=">AAACKHicbZDLSsNAFIYn9V5vVZduBotQQUsiim5E0Y1LBXuBJoTJdNIOTi7OnAg15HHc+CpuRBRx65M4qSlo6w8DP985hznn92LBFZjmp1Gamp6ZnZtfKC8uLa+sVtbWmypKJGUNGolItj2imOAhawAHwdqxZCTwBGt5txd5vXXPpOJReAODmDkB6YXc55SARm7ltGcL5kPNDgj0PT99yNwU9qxsF4/IXU4yW/JeH3ZORrRf9LmVqlk3h8KTxipMFRW6ciuvdjeiScBCoIIo1bHMGJyUSOBUsKxsJ4rFhN6SHutoG5KAKScdHprhbU262I+kfiHgIf09kZJAqUHg6c58TzVey+F/tU4C/rGT8jBOgIX05yM/ERginKeGu1wyCmKgDaGS610x7RNJKOhsyzoEa/zkSdPcr1uHdfP6oHp2XsQxjzbRFqohCx2hM3SJrlADUfSIntEbejeejBfjw/j8aS0ZxcwG+iPj6xvOR6ep</latexit> g(z t 1 ,q t )=h t 1 <latexit sha1_base64="vIkNQ5CDZZCVr9QBGPFVW1oMO+g=">AAACHnicbVBNa9tAFFy5aeq6TaKmx16WmoILjZFCTXI0SQ4tIeCS+gMsI1brJ3vxSit2nwpG6Jf0kr+SSw8tpZBT+2+ycnxInA4sDDPvsfMmyqQw6Hn/nNqTrafbz+rPGy9e7uzuua/2B0blmkOfK6n0KGIGpEihjwIljDINLIkkDKPFaeUPv4E2QqVfcZnBJGGzVMSCM7RS6HYClYFmqHTKEiguPvcuy/A8kBBj6+wDDRKG8ygu5mVY4IFfBlrM5vg+dJte21uBPib+mjTJGr3QvQmmiucJpMglM2bsexlOCqZRcAllI8gNZIwv2AzGllZZzKRYnVfSd1aZ0lhp+1KkK/X+RsESY5ZJZCeruGbTq8T/eeMc4+NJIdIsR0j53UdxLikqWnVFp0IDR7m0hHEtbFbK50wzjrbRhi3B3zz5MRkctv1O2/vysdk9WddRJ2/IW9IiPjkiXfKJ9EifcPKdXJOf5Jdz5fxwfjt/7kZrznrnNXkA5+8tlNuivg==</latexit> MIPS K (D,h t 1 ) Figure 6.1: The overall workflow of D RFACT. We encode the hypergraph (Fig. 2.2) with a concept-to-fact sparse matrix E and a fact-to-fact sparse matrix S. The dense fact index D is pre- computed with a pre-trained bi-encoder. A weighed set of facts is represented as a sparse vector F. The workflow (left) of D RFACT starts mapping a question to a set of initial facts that have common concepts with it. Then, it recursively performsFact-Follow operations (right) for computing F t and A t . Finally, it uses learnable hop-weightsα t to aggregate the answers. 6.1.0.2 Pre-computed Indices Dense Neural Fact Index D. We pre-train a bi-encoder architecture over BERT [31], which learns to maximize the score of facts that contain correct answers to a given question, following the steps of Karpukhin et al. (2020) (i.e., dense passage retrieval), so that we can use MIPS to do dense retrieval over the facts. After pre-training, we embed each fact inF with a dense vector (using the [CLS] token representation). Hence D is a|F|× d dense matrix. Sparse Fact-to-Fact Index S. We pre-compute the sparse links between facts by a set of connec- tion rules, such as f i → f j when f i and f j have at least one common concept and f j introduces at 100 least two more new concepts that are not in f i (see Appendix ?? (2) for more). Hence S is a binary sparse tensor with the dense shape|F|×| F|. Sparse Index of Concept-to-Fact Links E. As shown in Figure 2.2, a concept can appear in multiple facts and a fact also usually mentions multiple concepts. We encode these co-occurrences between each fact and its mentioned concepts into a sparse matrix with the dense shape|V|×| F| — i.e., the concept-to-fact index. 6.1.0.3 Differentiable Fact-Following Operation The most important part in our framework is how to model the fact-following step in our formula- tion, i.e., P(F t | F t− 1 ,q). For modeling the translation from a fact to another fact under the context of a question q, we propose an efficient approach with a differentiable operation that uses both neural embeddings of the facts and their symbolic connections in the hypergraph. The symbolic connections between facts are represented by the very sparse fact-to-fact ma- trix S, which in our model is efficiently implemented with the tf.RaggedTensor construct of TensorFlow [33]. S stores a pre-computed dependency between pairs of facts, S i j . Intuitively, if we can traverse from f i to f j these facts should mention some common concepts, and also the facts’ semantics are related, so our S i j will reflect this intuition. The fact embeddings computed by a pre-trained bi-encoder are in the dense index of fact vectors D, which contains rich semantic information about each fact, and helps measure the plausibility of a fact in the context of a given question. The proposed fact-follow operation has two parallel sub-steps: 1) sparse retrieval and 2) dense retrieval. The sparse retrieval uses a fact-to-fact sparse matrix to obtain possible next-hop facts. We can compute F s t = F t− 1 S efficiently thanks to the ragged representation of sparse matrices. 101 For the neural dense retrieval, we use a maximum inner product search (MIPS) [65, 53] over the dense fact embedding index D: z t− 1 = F t− 1 D h t− 1 = g(z t− 1 ,q t ) F d t = MIPS K (h t− 1 ,D) We first aggregate the dense vectors of the facts in F t− 1 into the dense vector z t− 1 , which is fed into a neural layer with the query embedding at the current step, q t (encoded by BERT), to create a query vector h t− 1 . Here g(·) is an MLP that maps the concatenation of the two input vectors to a dense output with the same dimensionality as the fact vectors, which we named to be fact- translating function. Finally, we retrieve the next-hop top-K facts F d t with the MIPS K operator. To get the best of both symbolic and neural world, we use element-wise multiplication to combine the sparse and dense retrieved results: F t = F s t ⊙ F d t . We summarize the fact-following operation with these differentiable steps: F t = Fact-Follow(F t− 1 ,q) (6.1) = F t− 1 S⊙ MIPS K (g(F t− 1 D,q t ),D) After each hop, we multiply F t with a pre-computed fact-to-concept matrix E, thus generating A t , a set of concept predictions. To aggregate the concept scores, we take the maximum score among the facts that mention a concept c. Finally we take the weighted sum of the concept predictions at all hops as the final weighted concept sets A=∑ T t=1 α t A t , where α t is a learnable parameter. Please read Appendix ?? for more details. 102 Equation 6.1 defines a random-walk process on the hypergraph associated with the corpus. We found that performance was improved by making this a “lazy” random walk—in particular by augmenting F t with the facts in F t− 1 which have a weight higher than a thresholdτ: F t = Fact-Follow(F t− 1 ,q)+ Filter(F t− 1 ,τ). We call this as self-following, which means that F t contains highly-relevant facts for all distances t ′ < t, and thus improve models when there are variable numbers of “hops” for different questions. Initial Facts. Note that the set of initial facts F 0 is computed differently, as they are produced using the input question q, instead of a previous-hop F t− 1 . We first use our pre-trained bi-encoder and the associated index D via MIPS query to finds facts related to q, and then select from the re- trieved set those facts that contain question concepts (i.e., concepts that are matched in the question text), using the concept-to-fact index E. 6.1.0.4 Auxiliary Learning with Distant Evidence Intermediate evidence, i.e., supporting facts, is significant for guiding multi-hop reasoning models during training. In a weakly supervised setting, however, we usually do not have ground-truth annotations as they are expensive to obtain. To get some noisy yet still helpful supporting facts, we use as distant supervision dense re- trieval based on the training questions. Specifically, we concatenate the question and the best can- didate answer to build a query to our pre-trained index D, and then we divide the results into four groups depending on whether they contain question/answer concepts: 1) question-answer facts, 2) question-only facts, 3) answer-only facts, and 4) none-facts. Then, to get a 2-hop evidence chain, we first check if a question-only fact can be linked to an answer-only fact through the sparse fact-to-fact matrix S. Similarly, we can also get 3-hop distant evidence. In this manner, we can collect the set of supporting facts at each hop position, denoted as{F ∗ 1 ,F ∗ 2 ,...,F ∗ T }. 103 The final learning objective is thus to optimize the sum of the cross-entropy loss l between the final weighed set of concepts A and the answer set A ∗ , as well as the auxiliary loss from distant evidence — i.e., the mean of the hop-wise loss between the predicted facts F t and the distant supporting facts at that hop F ∗ t , defined as follows: L = l(A,A ∗ )+ 1 T T ∑ t=1 l(F t ,F ∗ t ) 6.2 Experiments 6.2.0.1 Experimental Setup Fact corpus and concept vocabulary We use the GenericsKB-Best corpus as the main knowl- edge source 1 . In total, we have 1,025,413 unique facts as ourF . We use the spaCy toolkit to prepossess all sentences in the corpus and then extract frequent noun chunks within them as our concepts. The vocabularyV has 80,524 concepts, and every concept is mentioned at least 3 times. Datasets for OpenCSR To facilitate the research on open-ended commonsense reasoning (OpenCSR), we reformatted three existing multi-choice question answering datasets to allow evaluating OpenCSR methods. We choose three datasets: QASC, OBQA, and ARC, as their questions require common- sense knowledge about science and everyday objects and are presented in natural language. By applying a set of filters and rephrasing rules, we selected those open-ended commonsense ques- tions that query concepts in our vocabularyV . As we know that there can be multiple correct answers for a question in OpenCSR, we em- ployed crowd-workers to collect more answers for each test question based on a carefully de- signed annotation protocol. In total, we collect 15,691 answers for 2,138 rephrased questions for evaluation, which results in 7.5 answers per question on average. Please find more details about crowd-sourcing and analysis in Appendix ??. 1 It was constructed from multiple commonsense knowledge corpora and only kept naturally occurring generic statements, which makes it a perfect fit for OpenCSR. 104 Stat.\ Data ARC QASC OBQA Overall # All Examples 6,600 8,443 5,288 20,331 # Training Set 5,355 6,883 4,199 16, 437 # Validation Set 562 731 463 1,756 # Test Set 683 829 626 2,138 Avg.#Answers 6.8 7.6 7.7 7.5 Single-hop % 66.91% 59.35% 50.80% 59.02% Table 6.1: Statistics of datasets for OpenCSR (v1.0). We show some statistics of the OpenCSR datasets and our new annotations in Table 6.1. To understand the multi-hop nature and the difficulty of each dataset, we use a heuristic to estimate the percentage of “single-hop questions”, for which we can find a fact (from top-1k facts retrieved by BM25) containing both a question concept and an answer concept. The ARC dataset has about 67% one-hop questions and thus is the easiest, while OBQA has only 50%. Evaluation metrics. Recall that, given a question q, the final output of every method is a weighted set of concepts A={(a 1 ,w 1 ),...}. We denote the set of true answer concepts, as defined above, as A ∗ ={a ∗ 1 ,a ∗ 2 ,...}. We define Hit@K accuracy to be the fraction of questions for which we can find at least one correct answer concept a ∗ i ∈ A ∗ in the top-K concepts of A (sorted in descending order of weight). As questions have multiple correct answers, recall is also an important aspect for evaluating OpenCSR, so we also use Rec@K to evaluate the average recall of the top-K proposed answers. 6.2.0.2 Baseline Methods We present baseline methods and an optional re-ranker component for boosting the performance on OpenCSR. Table 6.3 shows a summary of the comparisions of the three methods and our DrFact. Direct Retrieval Methods. The most straightforward approach to the OpenCSR task is to directly retrieve relevant facts, and then use the concepts mentioned in the top-ranked facts as answer predictions. BM25 is one of the most popular unsupervised method for retrieval, while the Dense Passage Retrieval (DPR) model is a state-of-the-art trainable, neural retriever [68]. Following prior 105 ARC QASC OBQA Overall Metric = Hit@K (%) H@50 H@100 H@50 H@100 H@50 H@100 H@50 H@100 BM25 (off-the-shelf) 56.95 67.35 58.50 66.71 53.99 66.29 56.48 66.78 DPR [68] 68.67 78.62 69.36 78.89 62.30 73.80 66.78 77.10 DrKIT [33] 67.63 77.89 67.49 81.63 61.74 75.92 65.62 78.48 DRFACT (Ours) 71.60 80.38 72.01 84.56 69.01 80.03 70.87 81.66 BM25 + MCQA Reranker 76.87 80.38 75.75 80.22 79.23 84.03 77.28 81.54 DPR + MCQA Reranker 76.72 83.16 81.66 87.45 77.16 83.39 78.51 84.67 DrKIT + MCQA Reranker 78.44 83.37 84.00 86.83 79.25 84.03 80.56 84.74 DRFACT + MCQA Reranker 84.19 89.90 89.87 93.00 85.78 90.10 86.61 91.00 Metric = Rec@K (%) R@50 R@100 R@50 R@100 R@50 R@100 R@50 R@100 BM25 (off-the-shelf) 21.12 28.08 16.33 20.13 14.27 20.21 17.24 22.81 DPR [68] 28.93 38.63 23.19 32.12 18.11 26.83 23.41 32.53 DrKIT [33] 27.57 37.29 21.25 30.93 18.18 27.10 22.33 31.77 DRFACT (Ours) 31.48 40.93 23.29 33.60 21.27 30.32 25.35 34.95 BM25 + MCQA Reranker 39.11 42.96 29.03 32.11 36.38 39.46 34.84 38.18 DPR + MCQA Reranker 43.78 51.56 40.72 48.25 36.18 43.61 40.23 47.81 DrKIT + MCQA Reranker 43.14 49.17 39.20 44.37 35.12 39.85 39.15 44.46 DRFACT + MCQA Reranker 47.73 55.20 44.30 50.30 39.60 45.24 43.88 50.25 Table 6.2: Results of the Hit@K and Rec@K (K=50/100) on OpenCSR (v1.0). We present two groups of methods with different inference speed levels. The upper group is retrieval-only methods that are efficient ( < 0.5 sec/q), while the bottom group are augmented with a computationally expensive answer reranker (≥ 14 sec/q). work with DPR, we used BM25-retrieved facts to create positive and (hard-)negative examples as supervision. For both methods, we score a concept by the max 2 of the relevance scores of retrieved facts that mention it. DrKIT. Following Dhingra et al. (2020), we use DrKIT for OpenCSR, treating concepts as entities. DrKIT is also an efficient multi-hop reasoning model that reasons over a pre-computed indexed corpus, which, as noted above (Sec. 3.2.5), differs from our work in that DrKIT traverses a graph of entities and entity mentions, while DRFACT traverses a hypergraph of facts. Multiple-choice style re-ranking (MCQA). A conventional approach to multiple-choice QA (MCQA) is to fine-tune a pre-trained language model such as BERT, by combining a question and a particular concept as a single input sequence in the form of “[CLS]question[SEP]choice” 2 We also tried mean and sum, but max performs the best. 106 Methods BM25 DPR DrKIT DrFact (ours) Knowledge Corpus Structure A set of docs A set of docs Mention-Entity Bipartite Graph Concept-Fact Hypergraph Multi-hop Formulation N/A N/A Entity- Following Fact-Following Index for Dense Retrieval N/A Dense Fact Embeddings Dense Mention Embeddings Dense Fact Embeddings Sparse Retrieval Method BM25 N/A Entity- Entity/Mention Co-occurrence Fact-to-Fact, Concept-to-Fact Matrix # models for Multi-Hop N/A N/A Multiple Models A single model (self-following) Intermediate Supervision N/A N/A N/A Auxiliary Learning Table 6.3: Comparisons of the four retrieval methods. and using [CLS] vectors for learning to score choices. We follow this schema and train 3 such a multiple-choice QA model on top of BERT-Large, and use this to re-rank the top-K concept predictions. 6.2.0.3 Results and Analysis Main results. For a comprehensive understanding, we report the Hit@K and Rec@K of all meth- ods, at K=50 and K=100, in Table 7.1. The overall results are the average over the three datasets. We can see that DRFACT outperforms all baseline methods for all datasets and metrics. Compar- ing with the state-of-the-art text retriever DPR, DRFACT improves by about 4.1% absolute points in Hit@50 accuracy overall. With the expensive yet powerful MCQA reranker module DRFACT gives an even large gap (∼ 8% gain in H@50 acc). 3 Specifically, we fine-tune BERT-Large to score truth answers over 9 sampled distractors, and use it to rank the top-500 concepts produced by each above retrieval method. 107 10 20 30 40 50 60 70 80 90 100 K 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Hit@K Accuracy (%) BM25 DPR DrKIT DrFact BM25+MCQA DPR+MCQA DrKIT+MCQA DrFact+MCQA Figure 6.2: The curve of Hit@K accuracy in overall. The performance gains on the QASC and OBQA datasets are larger than the one on ARC. This observation correlates the statistics that the former two have more multi-hop questions and thus DRFACT has more advantages. As shown in Figure 6.2, we can see that DRFACT consistently outperforms other retrieval methods at different K by a considerable margin. Interestingly, we find that with the MCQA reranker, DrKIT does not yield a large improvement over DPR, and it usually has a lower than other methods. We conjecture this is because that entity- centric reasoning schema produces too many possible concepts and thus is more likely to take more irrelevant concepts at the top positions. The results on Rec@K in bottom section of Table 7.1 show that even our DRFACT+MCQA model only recalls about 50% of the correct answers in top-100 results on average. This suggests that OpenCSR is still a very challenging problem and future works should focus on improving the ability of ranking more correct answers higher. Run-time efficiency analysis. We use Table 6.4 to summarize the online inference speed of each OpenCSR method. At inference time, DPR will make one call to BERT-base for encoding a 108 Methods Major Computations Speed (sec/q) BM25 Sparse Retrieval 0.14 DPR BERT-base + MIPS 0.08 DrKIT BERT-base + T *(MIPS+ sp e2m ) 0.47 DRFACT BERT-base + T *(MIPS+ sp f 2 f ) 0.23 X + MCQA X + K * BERT-Large + 14.12 Table 6.4: The major competitions of each method and their online (batch-size=1) inference speed in sec/q. ARC QASC OBQA Overall T =1 69.3% 70.1% 65.0% 68.1% T =2 71.1% 72.2% 68.3% 70.5% T =3✓ 71.6% 72.0% 69.0% 70.9% w/o. Self-follow 70.9% 70.4% 68.4% 69.9% w/o. Aux. loss 70.6% 70.1% 68.0% 69.6% Table 6.5: Ablation study of DRFACT (H@50 test acc). question and do one MIPS search. Similarly, DrKIT and DRFACT with T hops will make one call to BERT-base for query encoding and do T MIPS searches. However, since the entity-to-mention matrix (sp e2m ) of DrKIT is much larger than the fact-to-fact matrix (sp f 2 f ) of DRFACT, DrKIT is about twice as slow as DRFACT. The MCQA is much more computationally expensive, as it makes K calls to BERT-Large for each combination of question and choice. Note that in these experiments we use T =2 for DrKIT, T =3 for DRFACT and K=500 for the MCQA re-rankers. 4 Ablation study. Varying the maximum hops (T={1,2,3}) — i.e., the number of calls toFact-Follow — indicates that overall performance is the best when T=3 as shown in Table 7.3. The performance with T=2 drops 0.7% point on OBQA. We conjecture this is due to nature of the datasets, in par- ticular the percentage of hard questions. We also test the model (with T=3) without the auxiliary learning loss (Sec. 6.1.0.4) or the self-following trick. Both are seen to be important to DRFACT. Self-following is especially helpful for QASC and OBQA, where there are more multi-hop ques- tions. It also makes learning and inference more faster than an alternative approach of ensembling multiple models with different maximum hops as done in some prior works. 4 We note the MCQA-reranker could be speed up by scoring more choices in parallel. All run-time tests were performed on NVIDIA V100 (16GB), but MCQA with a batch size of 1 requires only∼ 5GB. This suggests more parallel inference on a V100 could obtain 4.5 sec/q for MCQA. 109 Q: “What will separate iron filings from sand? ” magnetsattractmagneticmetalsthroughmagnetism (in F2) iron filings show themagneticfields . (in F0) magnetsproduceamagneticfieldwithanorth… (in F1) f1= angle irons reinforce the thinnest section of the ring .” f3= stainless steel has a rough surface just after filing .” f2= sieves are used for separating fossils from sand...” f1=heterogeneous mixtures have distinguishable phases , e.g., a mixture of iron filings and sulphur . f2=…a soil textural class where sand is the dominate separate BM25 DPR DrFact Figure 6.3: A case study to compare DPR and DRFACT. Qualitative analysis. We show a concrete example in Fig. 6.3 to compare the behaviour of DPR and DRFACT in reasoning. DPR uses purely dense retrieval without any regularization, yielding irrelevant facts. The fact f 2 matches the phrase “separating...from sand,” but does not help reason about the question. The f 3 shows here for the semantic relatedness of “steel” and “iron” while “filling” here is not related to question concepts. Our D RFACT, however, can faithfully reason about the question via fact-following over the hypergraph, and use neural fact embeddings to cu- mulatively reason about a concept, e.g., magnet. By backtracking with our hypergraph, we can use retrieved facts as explanations for a particular prediction. 6.3 Conclusion We introduce and study a new task — open-ended commonsense reasoning (OpenCSR) — which is both realistic and challenging. We construct three OpenCSR versions of widely used datasets targeting commonsense reasoning with a novel crowd-sourced collection of multiple answers, and evaluate a number of baseline methods for this task. We also present a novel method, DRFACT. DRFACT is a scalable multi-hop reasoning method that traverses a corpus (as a hypergraph) via a differentiable “fact-following” reasoning process, employing both a neural dense index of facts and 110 sparse tensors of symbolic links between facts, using a combination of MIPS and sparse-matrix computation. DRFACT outperforms several strong baseline methods on our data, making a sig- nificant step towards adapting commonsense reasoning approaches to more practical applications. Base on the multi-hop reasoning framework of DRFACT, we hope the work can benefit future research on neural-symbolic commonsense reasoning. 111 Chapter 7 Unsupervised Generalization for CSR with Implicit Knowledge 7.1 Introduction Advances in pre-training techniques for large language models (LMs) have considerably improved natural language processing (NLP) models on various important tasks via fine-tuning with labeled data. While these fine-tuned models are impressive in their target tasks, they can hardly generalize to unseen tasks. This thus makes it difficult to approach the general linguistic intelligence that we ultimately want an NLP model to enjoy. A promising avenue is to train a massively multi-task model that learns a large set of NLP tasks. However, in real-world applications, users often expect a multi-task NLP model can also perform unseen tasks that they are interested in. These users may only be able to provide a few unlabeled examples (i.e., the input-only data) of the target tasks with natural-language instructions. How can we generalize the multi-task model to unseen tasks without labels? This desirable ability is dubbed “unsupervised cross-task generalization.” Recent studies show that multi-task prompted training makes language models better in cross- task generalization, especially when natural-language instructions are used for formatting the train- ing data [205, 146, 187]. The general recipe is to first fine-tune a text-to-text language model such as T5 [135] on a multi-task mixture of diverse NLP datasets that are converted to sequence-to- sequence formats. We use the term upstream learning to refer to this multi-task training stage. Given a target task that is unseen during upstream learning, we want the upstream multi-task model to also perform well on it via reusing the previously acquired knowledge. FLAN [187] and 112 T0 [146] both use natural language (NL) instructions as prompts to reformat the data of various NLP tasks for upstream learning and generalization. Their results suggest that NL instructions are keys to unsupervised cross-task generalization. Despite of the exciting results from [187] and [146], their studies are limited to the analysis of the task generalization performance of the frozen, target-agnostic upstream models (i.e., FLAN and T0). We argue that the generalization performance can be further improved if we can exploit the unlabeled data of target tasks as hints for adjusting the upstream model to a more dedicated, target- aware model. Intuitively, the upstream examples that share similar skills with the target task should help the task generalization if the upstream model could recap these skills via retrieving. Motivated by this idea, we propose to further improve the cross-task generalization ability of upstream models via retrieval augmentation from the upstream data. The key challenge of such retrieval augmentation is to predict the example-level utility for cross-task generalization, which we introduce with details in Sec. 7.2. To address the challenges, we present a two-stage retrieval-augmentation framework, ReCross, for unsupervised cross-task generalization in Section 7.3. Specifically, we pre-compute a dense index by encoding all upstream data as dense vectors. Given a set of unlabeled examples, we first use them to retrieve an initial set of upstream data by using encoded queries to efficiently search over the dense index. Then, we apply the reranking module to carefully analyze the utility of each candidate example. To get such a reranker, we fine-tune a cross-encoder model with distant supervision mined by a novel algorithm. Finally, we take top-ranking retrieved data to fine-tune the upstream model for a few steps and use this updated model for inference on the target task in the future (i.e., the retrieval augmentation and model update is a one-time procedure for each unseen task). To more efficiently evaluate generalization methods without losing the generality, we train a variant of T0-like models, named BART0, which has comparable performance with T0-3B yet is 8x smaller. Our extensive experiments show that the proposed ReCross outperforms the baseline methods by a large margin. For example, ReCross improves the non-retrieval methods by 4 points on the overall performance of 10 target tasks and similarly on a few BigBench tasks. We also 113 upstream tasks multi-task NLP Model Unsupervised Cross-Task Generalization : an unseen task w/ a few unlabeled data RetrievalAugmentation Figure 7.1: The unsupervised cross-task generalization problem. In the upstream training stage, we train a multi-task NLP model,M , with a diverse collection of upstream tasks. In the general- ization stage, given an unseen taskU i with a few unlabeled examples Q i , we want to update the upstream model (via retrieval augmentation) such that it can generalize to the target task. analyze the distribution of the retrieved data to understand the behavior of retrieval-augmentation methods better and find that ReCross has a very different distribution compared to semantic re- trieval baselines. 7.2 Problem Formulation Massively Multi-Task Language Models. To build a general NLP model that can serve a wide range of real-world downstream applications, it is important to train a massively multi-task up- stream model. We assume there are N different upstream tasks (e.g., sentiment analysis of IMDB reviews), dubbed as{T 1 ,...,T N }. We use D to denote the collection of all labeled data for these upstream tasks (i.e., the upstream data), which are then used for training a massive multi-task modelM (e.g., BART, T5, and other Transformer-based models). The datasets of these upstream tasks are all converted to a shared text-to-text format using natural-language instruction templates such as PromptSource [6] to reformat data of different NLP tasks. This pipeline has become a com- mon approach, adopted by several recent massive multi-task models for NLP, such as T0 [146], FLAN [187], and CrossFit [205]. 114 Unsupervised Cross-Task Generalization. In real-world scenarios, it is very common that users to want a general multi-task model to perform tasks of their interest, even if their target tasks are never seen before by the upstream model. For these unseen target tasks, users usually can provide only a few unlabeled examples (i.e., the input-only data) of them for specifying the task instructions. This is the reason why we need to study how to generalize a multi-task LM to unseen tasks with only a few unlabeled examples, i.e., unsupervised cross-task generalization. For instance, in Fig. 7.1, the unseen taskU i is a coreference-resolution task that is not covered by the upstream training (the top-right box in Fig. 7.1). We have only a few inputs for it as the “hints” for cross-task generalization, which we call query examples Q i . Our objective is to use the query examples Q i to enhance the performance of upstream modelM on the unseen taskU i . For evaluating such unsupervised cross-task generalization methods, we test the enhanced model with a held-out labeled data of each target task. Challenges. Standard fine-tuning approaches (with or without meta-learning designs) for few- shot cross-task generalization are not feasible here. We have to adjust the upstream model based on only a few input-only examples for the unseen task. Intuitively, upstream examples that share similar skills with the target taskU i should be more beneficial than other upstream data. Thus, one naive idea is to first estimate the utility of each upstream example for U i and then re-train a dedicated modelM i via a weighted learning method (e.g., examples of more utility are trained with larger loss). However, such a target-aware weighted re-training method cannot scale, because the upstream data is usually very large and there can be a large number of unseen tasks from users in real-world applications. In addition, it is particularly challenging to estimate the utility scores of upstream data for a given unseen task, as we do not have ground-truth annotations for learning this. Although there are some existing studies on task-to-task relatedness and transferability [176, 83, 122], most of them are not designed for unsupervised settings and few are done with multi-task (prompted) upstream models. Moreover, these prior analyses are mainly limited to the task-level analysis and 115 Dense Index MIPS Reranker Encoders ! ′ topK ! Task Instances distant sup. QryExamplesof Final Retrieved Data Alg.1 query cand. Reranker pair-wise score Encoders copy rank learn Instance Embedding Figure 7.2: ReCross is a retrieval-augmentation method for unsupervised cross-task gen- eralization. We reuse the encoder layers of the upstream model (green) to build a dense index, which consists of vectors of the upstream examples D. We also propose an algorithm to generate distant supervision for training a reranker, which takes a pair of examples as input and outputs a score. During the evaluation, we encode query examples Q i for querying the index to get initial ranking results R ′ , and then pair them with the queries again for reranking. Finally, we take the top-K results (i.e., R) for generalizing the upstream modelM to the unseen taskU i . they may not directly generalize to studying example-level utility, which is particularly important for the problem setup of this work. 7.3 ReCross: Retrieval Augmentation for Cross-Task Generalization 7.3.1 Overview To address the above challenges for unsupervised cross-task generalization, we propose a retrieval- augmentation method named ReCross. The ReCross method is also based on the simple idea that we should exploit the upstream examples that enjoy better utility for a given unseen target task. Instead of costly re-training from scratch, our method first retrieves a small subset of the upstream data for each unseen task. It then uses them to efficiently fine-tune the upstream model such that 116 the updated model is generalized. This can ensure scalability to a great extent and benefit upstream models from re-learning target-specific acquired knowledge for cross-task generalization. Ideally, we aim to retrieve the upstream examples that are the most beneficial ones for gener- alizing the upstream model towards a particular unseen task — ranking the upstream data by their example-level utility. To achieve this goal while preserving the efficiency, we first use the query examples to retrieve initial candidates via efficient maximum inner product search (MIPS) over a dense index, which consists of embedding vectors of all upstream examples (Section 7.3.2). Based on the candidates from dense retrieval, we learn a reranking module for further improv- ing the retrieval results (Section 7.3.3). The reranker is based on the cross-encoder architecture that takes a query-candidate pair of examples and outputs a more curated score of utility. Recall that we do not have any annotation for such example-level utility scores, and the only allowed resources are the upstream data and model. Therefore, we propose an algorithm to mine distant supervision from the upstream data for learning the reranker (Section 7.3.4). The overview of ReCross is shown in Fig. 7.2. 7.3.2 Dense Retrieval To efficiently estimate the example-level utility for generalization, we propose to first employ a dense retrieval module that ensures high scalability. Specifically, we build a matrix D∈R |D|× d , where each upstream example in D is encoded with a dense vector. Based on this dense index, we can now estimate the utility of an upstream example with its cosine distances to the encoded query examples in Q. That is to say, the upstream examples that are the nearest neighbors of query examples, are more likely to be beneficial for generalizing the upstream model M to the unseen target task. To retrieve the candidate set R ′ , we use MIPS to search for the top-K examples for each query example in Q, so K =⌈|R ′ |/|Q|⌉. (We introduce the details and other aggregation strategies in Appendix.) This dense-retrieval process is very efficient as we pre-compute the upstream index 117 and perform MIPS for querying the candidates over the index on-the-fly during the generalization stage. We use the FAISS library [65] in our implementation. Instance embeddings. The example encoder is a key component of the dense-retrieval pipeline. An ideal example encoder is supposed to represent the underlying skills behind an example such that we can use the distances in the result embedding space to estimate utility for cross-task gener- alization. As we do not have annotations of utility scores for training an encoder, one may want to use pre-trained sentence embedding models such as SentenceBERT [139]. Our empirical results show that such semantics-based encoders cannot lead to much improvement over random retrieval results. We think there are two reasons for this failure. First, the semantic similarities between ex- amples are not suitable for estimating the utility for generalization. Second, the external encoding modules do not reflect the nature of the upstream model which we want to generalize. To address these two issues, we propose to use the encoding layers of upstream modelM for computing the example embeddings. Without loss of generality, let us assumeM to be a text-to- text Transformer that has multiple layers for both encoders and decoders such as BART. We encode an example by first obtaining the hidden representation of each token at the last encoder layer (i.e., a sequence of token vectors), and then performing mean-pooling over them to get a single dense vector to represent this example. By doing this, the produced example embeddings reflect the internal features of the upstream model, which are more relevant to the “thinking process” of the upstream model for the examples instead of the shallow semantic information. 7.3.3 Reranking Module Weakness of the dense retrieval. Although dense retrieval is very efficient thanks to the MIPS support, the retrieval performance is limited by its two major weakness. First, it is a dual-encoder architecture that encodes the candidate example and the query example separately, which ignores informative features behind token-to-token attention across a pair of examples. Second, it is too costly to frequently update the example encoder, which prevents us from learning to refine the 118 retrieval results with distant supervision (if any). Therefore, we design a re-ranking stage where we train a cross-encoder to further enhance the dense-retrieval results with mined distant supervision (Sec. 7.3.4). Encoding query-candidate pairs. The cross-encoder architecture has been widely used in sentence- pair classification tasks such as natural language inference and paraphrase detection. We here use a cross-encoder to encode the concatenation of a query example and a candidate example. Specif- ically, we fine-tune a RoBERTa [105] model to classify whether an example pair is a positive or negative match. The confidence of classifying such a pair to be positive can thus be used as the utility score of the candidate upstream example for this query example. On top of this, we then develop a reranking module for further improving retrieval performance as follows. Scoring paired data. To re-rank the initially retrieved data by the dense retriever, we apply the cross-encoder on all pairs of query examples Q and candidate retrieved examples R ′ , producing scores of all|Q|∗| R| ′ query-candidate pairs. For each candidate example r∈ R ′ , we use the average of all cross-encoder scores involving r as its utility score. Finally, we take the top-K examples based on this new ranking of candidate examples in R ′ as the final retrieved data R. We use upsampling ratio µ to denote the ratio between R ′ and R, i.e., µ =|R ′ |/|R|. 7.3.4 Mining Distant Supervision for Reranking How do we train such a re-ranking module? Recall that we only have access to the upstream data D and must not use any data from the unseen tasks at this stage. Inspired by meta-learning works, we propose an algorithm (Alg. 3) to mine distant supervision data for creating a training-as- testing environment for learning the reranker. Our key motivation is to examine the utility scores of candidate examples by assessing the generalization performance of updated models that are fine- tuned with these candidates as if we use them for real unseen tasks. Such more realistic estimation of utility scores can thus help us train a reranker to predict. 119 Algorithm 3: Distant Supervision Creation Input:M ; D;T q Output: Z=(Z q ,Z p ,Z n ) 1 2 D T q ← −{ x∈ D|x is an example ofT q } 3 Z q ← − Sample(D T q ); H q ← − Sample(D T q ) 4 R Z ← − DenseRetrieve(Z q ,D) /* Delete retrieved examples from the same task as queries. */ 5 R Z ← − R Z .discard(D T q ) 6 foreach round do 7 R Z .shuffle () /* Split retrieved examples into n groups */ 8 {G 1 ,...,G n }← − R Z .split() 9 foreach G i in{G 1 ,...,G n } do 10 M ′ ← − M.copy() 11 M ′ .fine_tune (G i ) 12 ℓ← − M ′ .calc_loss(H q ) 13 foreach x∈ G i do 14 scores[x].append(ℓ) /* Score each in the group w/ the loss. */ /* Use mean group score as score for single examples */ 15 foreach x∈ R Z do 16 score[x]← − mean(scores[x]) /* Sort R Z by score in increasing order. */ 17 R Z .sort(key:score,order:increasing) 18 Z p ← − First W items of R Z 19 Z n ← − Last W items of R Z Specifically, we define a data point of such distant supervision as a tuple Z =(Z q ,Z p ,Z n ): 1) Z q is a set of query examples of a particular taskT q ; 2) Z p is the set of positive examples from other tasks; 3) Z n is the set of negative examples from other tasks. We expect that Z p is of more 120 utility for generalization than Z n if Z q would be a query set for the target taskT q . To this end, we first randomly sample an upstream task T q and use a small subset of its training data as the Z q . Here, we also sample a larger held-out set H q examples of taskT q to facilitate utility estimation. Then, we apply the dense retriever using Z q as the query examples and get the retrieval results R Z . This R Z is thus the candidate pool where we create Z p and Z n . That is, Z p ⊂ R Z and Z n ⊂ R Z . We discard examples that are from theT q , so that the generated tuples are closer to the real scenarios where we use the reranker on the query sets of unseen tasks. Our criteria to select Z p and Z n from R Z is motivated by the hypothesis that a more suitable set of retrieved examples should improve the performanceM onT i after fine-tuning with it. There- fore, we iteratively sample a small subset from R Z , then fine-tune M with it, and finally, use the fine-tuned model to evaluate on Z ′ q . The performance of such a temporarily fine-tuned model can be seen as the utility score—how well this subset can help generalizeM to the unseen taskT q . Through multiple rounds of such sample-train-test procedures, we can thus score each example in R Z by taking the average of all test results where it is involved. With such a new ranking of examples in R Z , we take the best W examples as Z p and the worst W as Z n . With such distant supervision, we then can create pair of query-positive instances and query- negative instances via pairing Z q with Z p and Z n respectively. Now we can fine-tune a RoBERTa- base model by concatenating each pair and learning a binary-classification objective. The output logits of this trained model will be used for the reranking procedure as shown in Sec. 7.3.3. 7.3.5 Re-learning via Fine-Tuning with Retrieved Data When we have the final retrieved data R i for a certain query set Q i , we can now enhance the upstream modelM for the unseen taskU i . We use a small learning rate to continually fine-tune M with the retrieved upstream examples R i for a small number of steps. We find that the learning rate has to be very small so that this step can be seen as a natural continuation of the finished upstream training and avoid overfitting the retrieved data. We acknowledge that there could be more effective methods to reuse the query examples Q as guidance for fine-tuning, and we leave 121 this as future work. Please find more discussion on the hyper-parameter selection and configuration in our appendix. 7.4 Evaluation In this section, we first introduce the experimental setups, including the task distribution, upstream learning details, and the configurations of the main experiments. We present experimental results and reveal some non-trivial findings with extensive analysis that justify the effectiveness of Re- Cross. 7.4.1 Evaluating Unsupervised Cross-Task Generalization We follow [146] to use the templates from PromptSource [6] for converting data of different types of NLP tasks to text-to-text formats. In total, we have 36 upstream tasks and 10 target unseen tasks for our main experiments. The upstream tasks are the same as the ones that the T0 models used for upstream learning. We follow the evaluation protocol proposed by [146] and select the target tasks that are significantly different from the upstream tasks. Besides, we also include 5 additional tasks from the BIG-bench project [157] to create an even more out-of-distribution set of unseen tasks for analysis. Metric. When we apply the natural-language templates for the test examples, we only keep the templates that can be evaluated with an exact match (classification, question answering, answer selection, etc.) so that it is feasible to use exact-match for evaluating all tasks. To allow a smoother grading, our metric also counts the cases when outputs and truths are sub-strings of each other, which we call SoftEM. The only difference between SoftEM and the standard EM is that it also counts the sub-string matches. We observe that sometimes even though T0-like models (includ- ing ours) answer the input questions correctly, their raw outputs are not exactly the same as the truth outputs generated by the PromptSource templates. In particular, the ground-truth outputs for multiple-choice QA tasks are often in the form of “[A/B/C/D]: [answer]”, while the models often 122 only output the id of the correct choice (e.g., “A”) or the text of the answer. We also find that the model can output some noise (such as additional punctuation) after the answer (e.g., “True” vs “True.”). The standard EM will discard such matches and cause inaccurate measurements. Although SoftEM might add false positives due to substring matches, we found it is very rare ac- cording to our manual inspection of the 10 tasks. Therefore, we choose to use SoftEM for a more precise evaluation. We report the results with the standard EM in Table ?? that also supports our findings. 7.4.2 BART0: Upstream Learning with a Smaller LM The T0(pp) models are all very huge, and the smallest version, T0-3B (3 billion parameters), is still too large to be fine-tuned on popular affordable GPUs. We need a parameter-efficient alternative that makes the study on cross-task generalization more accessible to a broader community while keeping the generality. Thus, we fine-tune a BART-large [87] (0.4 billion parameters) following the recipe of training T0. Specifically, we sample 50k examples at most from each upstream task to build a large upstream dataset consisting of 1.7 million examples (i.e.,|D|= 1.7m), and then we fine-tune a BART-large with 22k steps with this upstream dataset. Finally, we use the fine-tuned checkpoint as our upstream modelM and name it BART0. Surprisingly, we find that BART0 and T0-3B have comparable zero-shot performance on the unseen target tasks, even though T0-3B is about 8x larger than BART0. More implementation details are shown in Appendix. 7.4.3 Setup and Configurations In our main experiments, we use|Q i | = 16 query examples for each unseen taskU i and retrieve|R i | = 512 examples for augmenting BART0. In the fine-tuning stage, we use a learning rate of 1e-6 and a batch size of 4 to continually fine-tune all layers of BART0 for 2 epochs. As for re-ranking, we set the upsampling ratio µ = 2, meaning that we first retrieve 1024 examples for reranking and use the top 512 ones as the final retrieved data. To obtain more convincing evaluation results, we average the scores of all target tasks to show the general zero-shot performance. For each taskU i , 123 Target Task T0-3B BART0 Random SBERT ReCross † ReCross ∆ anli_r3 26.00 30.50 35.34 ± 1.52 32.64 ± 2.53 36.70 ± 0.53 35.76 ± 0.90 5.26 h-swag 34.40 39.40 33.84 ± 5.59 30.92 ± 7.82 44.36 ± 3.07 47.28 ± 2.95 7.88 cb 53.93 39.64 47.07 ± 1.25 48.00 ± 3.28 44.50 ± 4.20 44.79 ± 3.36 5.15 wic 45.70 46.70 41.04 ± 2.18 46.78 ± 2.22 49.90 ± 0.50 50.58 ± 0.24 3.88 wsc 50.00 57.88 52.50 ± 2.29 52.69 ± 6.13 59.27 ± 1.96 61.46 ± 1.47 3.58 winogrande 47.60 51.10 52.68 ± 0.83 52.18 ± 3.20 54.60 ± 1.35 55.46 ± 0.88 4.36 arc-chan. 41.30 35.70 33.28 ± 1.50 37.90 ± 1.22 37.78 ± 0.73 38.44 ± 0.99 2.74 obqa 38.50 34.40 28.72 ± 2.46 33.28 ± 1.24 36.98 ± 1.55 39.58 ± 2.80 5.18 piqa 45.30 36.10 37.00 ± 2.71 38.54 ± 2.17 41.34 ± 1.75 41.42 ± 1.02 5.32 squadv2 30.60 32.40 29.86 ± 5.46 29.46 ± 0.84 30.26 ± 1.54 30.58 ± 1.61 -1.82 All@mean 41.33 40.38 39.13 ± 2.06 40.24 ± 1.61 43.57 ± 0.68 44.53 ± 0.42 4.15 @median 41.33 40.38 39.93 40.91 43.43 44.31 3.93 @min 41.33 40.38 35.66 38.28 42.65 44.16 3.77 @max 41.33 40.38 40.59 41.76 44.51 45.07 4.69 Table 7.1: The main experimental results (%) for unsupervised cross-task generalization in SoftEM. Each result in the upper section is the average (and the std) performance of using 5 different query sets for a task. The lower section of this table reports the mean, max, min, and median of the overall performance (i.e., the average performance on all tasks) of these five rounds. we use five different query sets, {Q (1) i ,...,Q (5) i }, to conduct five individual rounds of retrieval , thus resulting in five average scores for all tasks. To get a comprehensive assessment, we report the mean, std, median, min, and max of these five overall scores in the lower part of Table 7.1. We present an ablation study on hyper-parameter configurations in Table 7.3 and include more details in Appendix. 7.4.4 Experimental Results BART0 vs T0-3B. As mentioned earlier, we find that BART0 is comparable with the much larger T0-3B in terms of their zero-shot performance on our unseen tasks (41.33 vs 40.38). As we use BART0 as our base model for testing different retrieval-augmentation methods, its overall performance 40.38 is what we want retrieval-augmentation methods to beat. Note that when using BART0 and T0-3B for non-retrieval zero-shot inference, they do not use any information from the query examples, so their mean, median, min, and max are always the same. 124 Random Retrieval. The Random column shows the results when we randomly sample R i from the upstream data D without using any information from Q i . SBERT and ReCross † . We use SentenceBERT (SBERT) as a strong baseline method to create a dense index of the upstream data, compared with our proposed indexing method, ReCross † (i.e., ReCross without reranking). We can see that ReCross † always outperforms the other methods. Even its minimum performance in the five rounds ( 42.65) is better than the maximum of the SBERT (41.76). Besides, the standard deviation also becomes much smaller (1.61→ 0.68), which means that improvement by the ReCross † is more consistent under different query sets. The SBERT indexing relies mainly on the semantic similarities between a query example and the upstream data. Instead, our proposed ReCross † uses the hidden representations inside the upstream modelM for representing examples. We believe using such an indexing method can better help us find examples that share similar reasoning skills acquired by the upstream model. ReCross = ReCross † + Reranking. The full version of our ReCross with reranking can further improve the performance substantially on multiple dimensions. Both all@mean and median are improved by 1 point from the ReCross † , and the std is also reduced from 0.68 to 0.42. The last column (∆) in Table 7.1 shows its improvement compared to the base model BART0, and we can see that ReCross consistently outperforms non-retrieval methods (e.g., BART0) by a significant gap. To explore the potential benefits of retrieval-augmentation methods such as our ReCross, we also conduct the same experiments on five tasks selected from the BIG-Bench project. The results are shown in Table 7.2, where we can see that ReCross still outperforms the non-retrieval methods. An interesting case is the movie_dialog task, where the prompt in the template requires a model to output “same” or “different.” However, both T0-3B and BART0 fail to follow the prompt instruction, and can only output “yes/no.” Only when we use retrieval-augmentation methods, there are performance improvements on this task. 125 Task T0-3B BART0 ReCross hindu_knowledge 24.75 23.48 24.87 ± 0.27 known_unknowns 47.83 43.48 47.17 ± 1.65 logic_grid_puzzle 23.60 20.70 17.12 ± 6.29 strategyqa 47.70 48.30 49.76 ± 0.80 movie_dialog 0.00 4.40 37.22 ± 13.26 All@Mean 28.78 28.07 35.23 ± 2.85 Table 7.2: Results on a subset of BigBench tasks. Setup\All@ Mean std. Min Max Median Main Exp. 44.53 0.42 44.16 45.07 44.31 |Q|=1 43.20 0.83 42.58 44.58 42.88 |Q|=8 43.67 0.90 42.09 44.32 43.90 |Q|=32 42.52 1.17 40.52 43.40 42.96 |R|=256 40.80 0.83 39.45 41.68 40.96 |R|=1024 44.02 1.43 42.26 45.35 44.59 µ=3 43.92 0.58 43.08 44.57 43.89 µ=4 43.91 0.99 42.76 45.10 44.26 Table 7.3: The ablation study of ReCross. 7.4.5 Analysis & More Findings. More configurations. We have used a particular configuration in our main experiments that are in Table 7.1, which is|Q|=16,|R|=512, and|u|=2. In Table 7.3, we explore more configurations as ablation studies. The “Main Exp.” row refers to the results shown in Table 7.1, and the configura- tions of other rows are only changed with one factor at a time. Even using a single query example, ReCross is better than BART0. However, when increasing the query size to 32, we find that the performance starts to decrease, meaning that there could be an optimal query size for a certain |R|=512. We find that increasing |R| is generally beneficial, while the all@mean decreases when |R| is changed from 512 to 1024, although the max and the median slightly increased. Finally, we see that increasing µ increases the std. and does not improve the overall performance. 126 Figure 7.3: The mapping between unseen tasks (as rows) and upstream tasks (as columns). The darker upstream tasks take more percentage in retrieved data. For example, for the taskWIC, ReCross retrieves a plurality of examples fromQQP (about 30% of the retrieved examples). Retrieved data distribution. Figure 7.3 presents the difference between the methods in terms of their retrieved data. We draw the distribution of the retrieved data among different upstream tasks for each unseen task individually. From the heatmap, we can see that ReCross tends to have more dominant retrieved tasks (i.e., darker cells), while SBERT’s results are more sparse. They both can identify that squad is most similar to the adversarial_qa tasks. Their behaviors are very different too. Taking the unseen taskwinogrande (wngrnd) as an example, we can see that the SBERT retrieves from multiple upstream tasks such aspaws-x andcosmosQA , but the ReCross mainly retrieves fromsocial-iqa,wiki-qa, andcos-e. The experimental results in Table 7.1 show that ReCross produces a better performance than SBERT (i.e., 55.46 vs 52.18), while it is not clear how we can predict such task correlation in advance. This suggests that we should explore more about the utility of instances and tasks in future work. More analysis. In the appendix, we further presented some analysis to help understand “how” and “when” the retrieval augmentation works. We investigate whether the utility of upstream examples in retrieval augmentation is related to the similarity in terms of the task formats. From Appendix A.1, we found some counterintuitive results. For example, if removing MCQA upstream tasks from the upstream index, then the ARC target task can have an even better performance, 127 although it is an MCQA-formatted task. Thus, we hypothesize that similarity in terms of reasoning types is more important than format similarity for retrieval augmentation. After all, the upstream model has been already trained to work with these basic task formats. Re-learning the tasks of the same format might lead the model to overfit the seen domains. Additionally, to provide a more concrete analysis, we also present case studies with two specific tasks (CB and SQUADv2) in Appendix. Moreover, we conjecture the natural language instructions in the templates are necessary for ReCross to get impressive results. Therefore, we investigated two ways of perturbing the instruc- tions and monitoring the performance changes in Appendix A.2. We find it is indeed true that perturbations of the instructions will lead to much worse performance. We believe that a rigorous, principled way of analyzing the correlation between query and retrieval examples will be a great future direction, given the strong evidence that ReCross works so well as such a simple method. 7.5 More Discussion 7.5.1 Practicality of unsupervised setting. Cost of obtaining task labels The unsupervised setting in the paper does not require any human annotation of labels. For some tasks (NLG tasks in particular, e.g., summarization), the expected output (label) are open-ended and possibly lengthy and thus human annotation is much more ex- pensive and time-consuming. Also, few-shot learning must ask humans to label examples for each new task, and it is thus less practical when there are a large number of emerging tasks from the users. Meanwhile, ReCross requires only a natural-language task template, which does not require potentially expensive manual annotation or domain expertise. Scalability & Real-Time response Deploying the ReCross pipeline is a one-time process. All we need to do is to pre-compute the upstream index with LM and configure the reranker (a simple masked LM) by running our script. In production, once the users input the examples with NL 128 instructions, we do not need to wait for any human annotations anymore, so it is much more efficient in the long run. In the scenarios where users only provide one query example and want to get its label from the model, ReCross also shows great performance (i.e., |Q|=1 in Table 7.1). It is then impractical to assume there are a few labeled data from the users too in such cases. 7.5.2 Empirical studies The unsupervised ReCross performance is comparable to few-shot learning with label annotations. In Appendix D.2, we report the performance of directly fine-tuning BART0 with the labeled query examples. Although it is an unfair comparison with our previous ReCross results, we found that they are comparable. More importantly, the ReCross framework does not conflict with the few- shot setting. Given a labeled query set for a target task, retrieved examples from the ReCross can still improve few-shot learning as additional training data. We designed two simple methods for applying ReCross under the few-shot setting and report the empirical results in Appendix D.2. It turns out that ReCross can also boost the performance under the few-shot setting by about 3 points. 7.6 Related Work Multi-task training for task generalization. Text-to-text Transformer language models such as T5 enable us to train a multi-task NLP model with a more straightforward recipe: mixing the data of multiple tasks into a unified seq2seq format, and then fine-tuning text-to-text LMs for implicit multi-task learning. UnifiedQA [72] is among the first works in this direction. Although it shows great generalization performance within QA tasks, it can hardly generalize to other NLP tasks. Recent works, such as CrossFit [205], ExT5 [3], FLAN [187], T0 [146], and InstructGPT [121] focus on how to generalize a massively multi-task model across task boundaries in a much broader context. Particularly, in the CrossFit framework [205], cross-task generalization requires a small number of labeled instances of the target task for fine-tuning. It is because the templates of CrossFit use 129 the task names as the hard prefixes. Therefore, it is necessary to fine-tune the upstream model with a few examples that have the target task names as prefixes (i.e., few-shot learning), but this largely limits the application scenarios of these multi-task NLP models in practice. We instead focus on unsupervised cross-task generalization, where there is no labeled data of an unseen task (i.e., zero- shot learning). Using natural-language instructions as prompts, both FLAN and T0 show that it is promising to perform zero-shot cross-task generalization. In this work, we also focus on such an unsupervised setting for cross-task generalization, while our problem setup is a bit different from the ones used in T0 and FLAN. As for the assumption about the unlabeled data, their setups can be seen as a special case of ours when|Q|= 1 for all unseen tasks. The evaluation protocols of T0 and FLAN assess the generalization performance of the upstream model as it is, and thus their evaluation is more about the quality of templates and the upstream training tricks. In contrast, our evaluation protocol can also study how to efficiently adjust the upstream model such that the updated models can generalize to new tasks without labeled data. Thus, we believe ours is a more general setup for studying unsupervised cross-task generalization. Retrieval augmentation in NLP. We aim to retrieve useful examples from the upstream data and re-learning them for cross-task generalization. The proposed ReCross pipeline is inspired by open-ended QA methods such as DPR [68], DrFact [97], and RAG [88]. Retrieval augmentation also shows great performance in pre-training LMs [54]. Besides, [182] shows that learning with similar data via retrieval augmentation can improve the performance of a task-specific model. [141] show that retrieving better demonstration examples is also helpful for in-context few-shot learning of GPT-3 style language models [16]. The key challenge in the problem setup of this work is to predict the utility of the examples for unseen tasks with the consideration of efficiency and scalability. We have discussed more details about this challenge and related works in Sec. 7.2. 130 7.7 Conclusion & Future Directions We demonstrate that retrieval augmentation can largely improve the cross-task generalization abil- ity to multitask LMs in unsupervised settings. Our proposed method, ReCross, is a straightforward yet effective retrieval method that combines both efficient dense retrieval and effective pair-wise reranking. Our empirical results show that it significantly outperforms both non-retrieval methods and other baseline methods. We perform ablation studies showing the impact of changing query sizes, retrieval sizes, upsampling ratios, etc. We also find the distribution of retrieved data for analyzing the behavior differences between ReCross and others. We believe that our paper will spur further research on retrieval-augmentation methods for cross-task generalization. Interesting future directions include: 1) improve the re-learning stage by including more information from query examples, 2) extend the distant supervision mining process as a self-training procedure, 3) rigorously analyze the correlation between upstream data and target tasks, etc. 131 Chapter 8 Conclusion & Future Directions 8.1 Summary My Ph.D. research puts together evaluation and knowledge incorporation on CSR. For evaluation, I argue that we should create datasets dedicated to open-ended, generalizable, and robust CSR. We contribute to evaluating Open-Ended CSR by introducing two benchmarks, 1) OpenCSR for open- ended QA, and 2) CommonGen for LG with generative commonsense We want the CSR models more generalizable in terms of multiple languages, non-monotonic reasoning, and style transfer. Robustness is also a key aspect in evaluating CSR and we focused on the logically equivalent perturbations in RICA and adversarial attacks in NumerSense. For incorporating knowledge, we present a suite of methods for different types of knowledge. For structured knowledge, like KGs, we can use graph NNs to encode and reason with them. When we have unstructured knowledge like a corpus, we can design differentiable multi-hop reasoning methods by retrieving facts. Finally, we show that it’s also promising to encode a large number of data instances of diverse NLP data and achieve zero-shot transfer for CSR with such implicit knowledge. 8.2 Future Directions As for the future directions, I would be very excited if I could be able to work with you guys work these interesting directions: The 1st one is developing grounded and interactive CSR methods, 132 such that we can enhance embodied agents like household robots to be more intelligent. Also, I’d like to focus on improving the conversational models with social common sense and thus make them more effective and responsible in our society. Adding to that, I want to extend the current research on CSR with more societal impact. And focus on the cultural differences and social bias in the data and models. Plus, teaching machines to learn social norms and common values is also important to build responsible AI agents with common sense. Let’s have a closer look at these directions. We always hope AI can communicate with us and interact with the world in real-world ap- plications. Say, the robots need to understand the instructions from humans. Which are usually under-specified, for example, “Please prepare a quick breakfast with some fruits!” It can be very hard for the robots to understand that and respond correctly without grounding or commonsense reasoning. Humans have the context of the specific env in their mind, but robots may not. And they could struggle with a lot of uncertainty here. That’s why Grounded CSR in an interactive environment will be very important. Also, even when the instructions are clear, the task planning and execution of the plans can also be very challenging for robots, It’s because it requires a complex use of commonsense knowledge and reasoning ability such as affordance. To enable such studies, I want to combine embodied learning and NLP to evaluate and develop interactive agents for common sense reasoning. As for the broader communication between AI models and humans, we often find that the LMs lack common sense about social norms and many other human topics. such as rights, ethics, law, privacy, cultural differences, etc. This can cause many ethical concerns and make the applications of LMs largely limited to the public. We thus need to put the commonsense reasoning research in such a society in the loop context and make CSR research of more social good. In particular, the resources, evaluation, and modeling of social norms would be my research interests in the near future. 133 Bibliography [1] Peter Anderson et al. “Spice: Semantic propositional image caption evaluation”. In: European Conference on Computer Vision. Springer. 2016, pp. 382–398. URL:https: //link.springer.com/chapter/10.1007/978-3-319-46454-1_24. [2] K. M. Annervaz, Somnath Basu Roy Chowdhury, and Ambedkar Dukkipati. “Learning beyond datasets: Knowledge Graph Augmented Neural Networks for Natural language Processing”. In: NAACL-HLT. 2018. [3] Vamsi Aribandi et al. “ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning”. In: International Conference on Learning Representations. 2022. URL: https://openreview.net/forum?id=Vzh1BFUCiIX. [4] Akari Asai et al. “Learning to Retrieve Reasoning Paths over Wikipedia Graph for Question Answering”. In: International Conference on Learning Representations. 2020. URL:https://openreview.net/forum?id=SJgVHkrYDH. [5] Robert Axelrod. “Schema theory: An information processing model of perception and cognition”. In: American political science review 67.4 (1973), pp. 1248–1266. [6] Stephen Bach et al. “PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts”. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Dublin, Ireland: Association for Computational Linguistics, 2022, pp. 93–104. DOI: 10.18653/v1/2022.acl-demo.9. URL: https://aclanthology.org/2022.acl-demo.9. [7] Satanjeev Banerjee and Alon Lavie. “METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments”. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Ann Arbor, Michigan: Association for Computational Linguistics, June 2005, pp. 65–72. URL:https://www.aclweb.org/anthology/W05-0909. 134 [8] Hangbo Bao et al. “UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training”. In: arXiv: Computation and Language (2020). URL: https://arxiv.org/abs/2002.12804. [9] Peter W. Battaglia et al. “Relational inductive biases, deep learning, and graph networks”. In: CoRR abs/1806.01261 (2018). [10] Chandra Bhagavatula et al. “Abductive Commonsense Reasoning”. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL: https://openreview.net/forum?id=Byg1v1HKDB. [11] Sumithra Bhakthavatsalam, Chloe Anastasiades, and Peter Clark. “GenericsKB: A Knowledge Base of Generic Statements”. In: ArXiv preprint abs/2005.00660 (2020). URL:https://arxiv.org/abs/2005.00660. [12] Sumithra Bhakthavatsalam, Chloe Anastasiades, and Peter Clark. “GenericsKB: A Knowledge Base of Generic Statements”. In: ArXiv abs/2005.00660 (2020). URL: https://arxiv.org/abs/2005.00660. [13] Yonatan Bisk et al. “PIQA: Reasoning about Physical Commonsense in Natural Language”. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020. AAAI Press, 2020, pp. 7432–7439. URL:https://aaai.org/ojs/index.php/AAAI/article/view/6239. [14] Michael Boratko et al. “ProtoQA: A Question Answering Dataset for Prototypical Common-Sense Reasoning”. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, 2020, pp. 1122–1136. DOI: 10.18653/v1/2020.emnlp-main.85. URL: https://www.aclweb.org/anthology/2020.emnlp-main.85. [15] Zied Bouraoui, Jose Camacho-Collados, and Steven Schockaert. “Inducing Relational Knowledge from BERT”. In: Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence . 2020. URL:https: //www.aaai.org/Papers/AAAI/2020GB/AAAI-BouraouiZ.5537.pdf. [16] Tom B. Brown et al. “Language Models are Few-Shot Learners”. In: Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual. Ed. by Hugo Larochelle et al. 2020. URL: https://proceedings.neurips.cc/paper/2020/hash/ 1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html. 135 [17] Tuhin Chakrabarty et al. “R 3: Reverse, Retrieve, and Rank for Sarcasm Generation with Commonsense Knowledge”. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, 2020, pp. 7976–7986. DOI:10.18653/v1/2020.acl-main.711. URL: https://www.aclweb.org/anthology/2020.acl-main.711. [18] Michael Chen et al. “CODAH: An Adversarially Authored Question-Answer Dataset for Common Sense”. In: ArXiv abs/1904.04365 (2019). URL: https://arxiv.org/abs/1904.04365. [19] Zewen Chi et al. “InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training”. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Association for Computational Linguistics, 2021, pp. 3576–3588. DOI:10.18653/v1/2021.naacl-main.280. URL: https://aclanthology.org/2021.naacl-main.280. [20] Noam Chomsky. Aspects of the Theory of Syntax. 1965. URL: https://philpapers.org/rec/CHOAOT-8. [21] Jonathan H. Clark et al. “TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages”. In: Transactions of the Association for Computational Linguistics 8 (2020), pp. 454–470. DOI:10.1162/tacl_a_00317. URL:https://aclanthology.org/2020.tacl-1.30. [22] Kevin Clark et al. “What Does BERT Look at? An Analysis of BERT’s Attention”. In: Proceedings of the 2019 ACL Workshop BlackboxNLP. 2019. [23] Peter Clark et al. “Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge”. In: ArXiv preprint abs/1803.05457 (2018). URL: https://arxiv.org/abs/1803.05457. [24] Peter Clark et al. “Think you have solved question answering? try arc, the ai2 reasoning challenge”. In: ArXiv preprint abs/1803.05457 (2018). URL: https://arxiv.org/abs/1803.05457. [25] Alexis Conneau and Guillaume Lample. “Cross-lingual Language Model Pretraining”. In: Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada. Ed. by Hanna M. Wallach et al. 2019, pp. 7057–7067. URL: https://proceedings.neurips.cc/paper/2019/hash/ c04c19c2c2474dbf5f7ac4372c5b9af1-Abstract.html. 136 [26] Alexis Conneau et al. “Unsupervised Cross-lingual Representation Learning at Scale”. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, 2020, pp. 8440–8451. DOI:10.18653/v1/2020.acl-main.747. URL: https://aclanthology.org/2020.acl-main.747. [27] Alexis Conneau et al. “XNLI: Evaluating Cross-lingual Sentence Representations”. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics, 2018, pp. 2475–2485. DOI:10.18653/v1/D18-1269. URL: https://aclanthology.org/D18-1269. [28] Ernest Davis and Gary Marcus. “Commonsense reasoning and commonsense knowledge in artificial intelligence”. In: Commun. ACM 58 (2015), pp. 92–103. URL: https://dl.acm.org/doi/10.1145/2701413. [29] Joe Davison, Joshua Feldman, and Alexander Rush. “Commonsense Knowledge Mining from Pretrained Models”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 1173–1178. DOI: 10.18653/v1/D19-1109. URL: https://www.aclweb.org/anthology/D19-1109. [30] Jacob Devlin et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, 2019, pp. 4171–4186. DOI:10.18653/v1/N19-1423. URL:https://aclanthology.org/N19-1423. [31] Jacob Devlin et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. In: NAACL-HLT. 2019. [32] Jacob Devlin et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, June 2019, pp. 4171–4186. DOI:10.18653/v1/N19-1423. URL: https://www.aclweb.org/anthology/N19-1423. [33] Bhuwan Dhingra et al. “Differentiable Reasoning over a Virtual Knowledge Base”. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL: https://openreview.net/forum?id=SJxstlHFPH. 137 [34] Georgiana Dinu et al. “Training Neural Machine Translation to Apply Terminology Constraints”. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, July 2019, pp. 3063–3068. DOI:10.18653/v1/P19-1294. URL: https://www.aclweb.org/anthology/P19-1294. [35] Li Dong et al. “Unified language model pre-training for natural language understanding and generation”. In: Advances in Neural Information Processing Systems. 2019, pp. 13042–13054. URL:http://papers.nips.cc/paper/9464-unified- language-model-pre-training-for-natural-language- understanding-and-generation. [36] Dheeru Dua et al. “DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs”. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, June 2019, pp. 2368–2378. DOI: 10.18653/v1/N19-1246. URL: https://www.aclweb.org/anthology/N19-1246. [37] Yanai Elazar et al. “How Large Are Lions? Inducing Distributions over Quantitative Attributes”. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, 2019, pp. 3973–3983. DOI:10.18653/v1/P19-1388. URL: https://www.aclweb.org/anthology/P19-1388. [38] Yanai Elazar et al. “How Large Are Lions? Inducing Distributions over Quantitative Attributes”. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, July 2019, pp. 3973–3983. DOI:10.18653/v1/P19-1388. URL: https://www.aclweb.org/anthology/P19-1388. [39] Angela Fan, Mike Lewis, and Yann Dauphin. “Hierarchical Neural Story Generation”. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia: Association for Computational Linguistics, July 2018, pp. 889–898. DOI:10.18653/v1/P18-1082. URL: https://www.aclweb.org/anthology/P18-1082. [40] Xiaocheng Feng et al. “Topic-to-Essay Generation with Neural Networks.” In: IJCAI. 2018, pp. 4078–4084. URL: https://www.ijcai.org/Proceedings/2018/0567.pdf. 138 [41] Yanlin Feng et al. “Scalable Multi-Hop Relational Reasoning for Knowledge-Aware Question Answering”. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, 2020, pp. 1295–1309. DOI:10.18653/v1/2020.emnlp-main.99. URL:https://aclanthology.org/2020.emnlp-main.99. [42] Maxwell Forbes and Yejin Choi. “Verb Physics: Relative Physical Knowledge of Actions and Objects”. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vancouver, Canada: Association for Computational Linguistics, July 2017, pp. 266–276. DOI:10.18653/v1/P17-1025. URL:https://www.aclweb.org/anthology/P17-1025. [43] Zhenxin Fu et al. “Style transfer in text: Exploration and evaluation”. In: Thirty-Second AAAI Conference on Artificial Intelligence . 2018. URL:https://www.aaai.org/ ocs/index.php/AAAI/AAAI18/paper/viewPaper/17015. [44] Ge Gao et al. “Neural Metaphor Detection in Context”. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics, 2018, pp. 607–613. DOI: 10.18653/v1/D18-1060. URL: https://www.aclweb.org/anthology/D18-1060. [45] Michael R Garey and David S. Johnson. “The rectilinear Steiner tree problem is NP-complete”. In: SIAM Journal on Applied Mathematics 32.4 (1977), pp. 826–834. [46] Mor Geva, Ankit Gupta, and Jonathan Berant. “Injecting Numerical Reasoning Skills into Language Models”. In: ArXiv abs/2004.04487 (2020). URL: https://arxiv.org/abs/2004.04487. [47] Pranav Goel, Shi Feng, and Jordan Boyd-Graber. “How Pre-trained Word Representations Capture Commonsense Physical Comparisons”. In: Proceedings of the First Workshop on Commonsense Inference in Natural Language Processing. Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 130–135. DOI: 10.18653/v1/D19-6016. URL: https://www.aclweb.org/anthology/D19-6016. [48] Hugo Gonçalo Oliveira and Ricardo Rodrigues. “Exploring Lexical-Semantic Knowledge in the Generation of Novel Riddles in Portuguese”. In: Proceedings of the 3rd Workshop on Computational Creativity in Natural Language Generation (CC-NLG 2018). Tilburg, the Netherlands: Association for Computational Linguistics, 2018, pp. 17–25. DOI: 10.18653/v1/W18-6604. URL: https://www.aclweb.org/anthology/W18-6604. 139 [49] Jonathan Gordon and Benjamin Van Durme. “Reporting bias and knowledge acquisition”. In: Proceedings of the 2013 workshop on Automated knowledge base construction. 2013, pp. 25–30. [50] Jiatao Gu, Changhan Wang, and Junbo Zhao. “Levenshtein transformer”. In: Advances in Neural Information Processing Systems. 2019, pp. 11179–11189. URL: http://papers.nips.cc/paper/9297-levenshtein-transformer. [51] Jiatao Gu et al. “Incorporating Copying Mechanism in Sequence-to-Sequence Learning”. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Berlin, Germany: Association for Computational Linguistics, Aug. 2016, pp. 1631–1640. DOI:10.18653/v1/P16-1154. URL: https://www.aclweb.org/anthology/P16-1154. [52] Jian Guan, Yansen Wang, and Minlie Huang. “Story ending generation with incremental encoding and commonsense knowledge”. In: Proceedings of the AAAI Conference on Artificial Intelligence . V ol. 33. 2019, pp. 6473–6480. URL: https://www.aaai.org/ojs/index.php/AAAI/article/view/4612. [53] Ruiqi Guo et al. “Accelerating Large-Scale Inference with Anisotropic Vector Quantization”. In: Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event. V ol. 119. Proceedings of Machine Learning Research. PMLR, 2020, pp. 3887–3896. URL: http://proceedings.mlr.press/v119/guo20h.html. [54] Kelvin Guu et al. “Retrieval Augmented Language Model Pre-Training”. In: (2020). URL: https://arxiv.org/abs/2002.08909. [55] Eva Hasler et al. “Neural Machine Translation Decoding with Terminology Constraints”. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). New Orleans, Louisiana: Association for Computational Linguistics, June 2018, pp. 506–512. DOI:10.18653/v1/N18-2081. URL: https://www.aclweb.org/anthology/N18-2081. [56] He He, Nanyun Peng, and Percy Liang. “Pun Generation with Surprise”. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, 2019, pp. 1734–1744. DOI:10.18653/v1/N19-1172. URL: https://www.aclweb.org/anthology/N19-1172. [57] Edward Hirsch. A poet’s glossary. HMH, 2014. 140 [58] Chris Hokamp and Qun Liu. “Lexically Constrained Decoding for Sequence Generation Using Grid Beam Search”. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vancouver, Canada: Association for Computational Linguistics, July 2017, pp. 1535–1546. DOI: 10.18653/v1/P17-1141. URL: https://www.aclweb.org/anthology/P17-1141. [59] Junjie Hu et al. XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization. Tech. rep. 2020, pp. 4411–4421. URL: https://sites.. [60] Zhiting Hu et al. “Toward controlled generation of text”. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org. 2017, pp. 1587–1596. URL: https://dl.acm.org/doi/10.5555/3305381.3305545. [61] Lifu Huang et al. “Cosmos QA: Machine Reading Comprehension with Contextual Commonsense Reasoning”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 2391–2401. DOI: 10.18653/v1/D19-1243. URL: https://www.aclweb.org/anthology/D19-1243. [62] Drew A. Hudson and Christopher D. Manning. “Compositional Attention Networks for Machine Reasoning”. In: ICLR. 2018. [63] Litton J Kurisinkel and Nancy Chen. “Set to Ordered Text: Generating Discharge Instructions from Medical Billing Codes”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 6165–6175. DOI: 10.18653/v1/D19-1638. URL: https://www.aclweb.org/anthology/D19-1638. [64] Zhengbao Jiang et al. “X-FACTR: Multilingual Factual Knowledge Retrieval from Pretrained Language Models”. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, 2020, pp. 5943–5959. DOI: 10.18653/v1/2020.emnlp-main.479. URL: https://aclanthology.org/2020.emnlp-main.479. [65] Jeff Johnson, Matthijs Douze, and Herve Jegou. “Billion-scale similarity search with GPUs”. In: IEEE Transactions on Big Data (2019). 141 [66] Philip N Johnson-Laird. “Mental models in cognitive science”. In: Cognitive science 4.1 (1980), pp. 71–115. [67] Marcin Junczys-Dowmunt et al. “Marian: Fast Neural Machine Translation in C++”. In: Proceedings of ACL 2018, System Demonstrations. Melbourne, Australia: Association for Computational Linguistics, 2018, pp. 116–121. DOI:10.18653/v1/P18-4020. URL: https://aclanthology.org/P18-4020. [68] Vladimir Karpukhin et al. “Dense Passage Retrieval for Open-Domain Question Answering”. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, 2020, pp. 6769–6781. DOI:10.18653/v1/2020.emnlp-main.550. URL: https://aclanthology.org/2020.emnlp-main.550. [69] Nora Kassner, Philipp Dufter, and Hinrich Schutze. “Multilingual LAMA: Investigating Knowledge in Multilingual Pretrained Language Models”. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. Online: Association for Computational Linguistics, 2021, pp. 3250–3258. DOI:10.18653/v1/2021.eacl-main.284. URL: https://aclanthology.org/2021.eacl-main.284. [70] Daniel Keysers et al. “Measuring Compositional Generalization: A Comprehensive Method on Realistic Data”. In: International Conference on Learning Representations. 2020. URL:https://openreview.net/forum?id=SygcCnNKwr. [71] Daniel Khashabi et al. “Learning What is Essential in Questions”. In: CoNLL. 2017. [72] Daniel Khashabi et al. “UNIFIEDQA: Crossing Format Boundaries with a Single QA System”. In: Findings of the Association for Computational Linguistics: EMNLP 2020. Online: Association for Computational Linguistics, 2020, pp. 1896–1907. DOI: 10.18653/v1/2020.findings-emnlp.171. URL: https://aclanthology.org/2020.findings-emnlp.171. [73] Tushar Khot et al. “QASC: A Dataset for Question Answering via Sentence Composition”. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020. AAAI Press, 2020, pp. 8082–8090. URL: https://aaai.org/ojs/index.php/AAAI/article/view/6319. [74] Diederik P. Kingma and Jimmy Ba. “Adam: A Method for Stochastic Optimization”. In: ICLR. 2015. [75] Thomas N Kipf and Max Welling. “Semi-Supervised Classification with Graph Convolutional Networks”. In: Proceedings of ICLR. 2017. 142 [76] Guillaume Klein et al. “OpenNMT: Open-Source Toolkit for Neural Machine Translation”. In: Proceedings of ACL 2017, System Demonstrations. Vancouver, Canada: Association for Computational Linguistics, July 2017, pp. 67–72. URL: https://www.aclweb.org/anthology/P17-4012. [77] Rik Koncel-Kedziorski et al. “MAWPS: A Math Word Problem Repository”. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. San Diego, California: Association for Computational Linguistics, June 2016, pp. 1152–1157. DOI: 10.18653/v1/N16-1136. URL: https://www.aclweb.org/anthology/N16-1136. [78] Ranjay Krishna et al. “Dense-captioning events in videos”. In: Proceedings of the IEEE international conference on computer vision. 2017, pp. 706–715. URL: http://openaccess.thecvf.com/content_iccv_2017/html/Krishna_ Dense-Captioning_Events_in_ICCV_2017_paper.html. [79] Tom Kwiatkowski et al. “Natural Questions: A Benchmark for Question Answering Research”. In: Transactions of the Association for Computational Linguistics 7 (Mar. 2019), pp. 453–466. DOI:10.1162/tacl_a_00276. URL: https://www.aclweb.org/anthology/Q19-1026. [80] Guokun Lai et al. “RACE: Large-scale ReAding Comprehension Dataset From Examinations”. In: EMNLP. 2017. [81] Brenden M Lake and Marco Baroni. “Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks”. In: 2017. URL: https://arxiv.org/abs/1711.00350. [82] Zhenzhong Lan et al. “ALBERT: A Lite BERT for Self-supervised Learning of Language Representations”. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL: https://openreview.net/forum?id=H1eA7AEtvS. [83] Lukas Lange et al. “To Share or not to Share: Predicting Sets of Sources for Model Transfer Learning”. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, 2021, pp. 8744–8753. DOI: 10.18653/v1/2021.emnlp-main.689. URL: https://aclanthology.org/2021.emnlp-main.689. 143 [84] Jason Lee, Elman Mansimov, and Kyunghyun Cho. “Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement”. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics, Oct. 2018, pp. 1173–1182. DOI: 10.18653/v1/D18-1149. URL: https://www.aclweb.org/anthology/D18-1149. [85] Hector J. Levesque. “The Winograd Schema Challenge”. In: AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning. 2011. [86] Mike Lewis et al. “BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension”. In: ArXiv abs/1910.13461 (2019). URL:https://arxiv.org/abs/1910.13461. [87] Mike Lewis et al. “BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension”. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, 2020, pp. 7871–7880. DOI: 10.18653/v1/2020.acl-main.703. URL: https://aclanthology.org/2020.acl-main.703. [88] Patrick S. H. Lewis et al. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”. In: Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual. Ed. by Hugo Larochelle et al. 2020. URL: https://proceedings.neurips.cc/paper/2020/hash/ 6b493230205f780e1bc26945df7481e5-Abstract.html. [89] Juncen Li et al. “Delete, Retrieve, Generate: a Simple Approach to Sentiment and Style Transfer”. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). New Orleans, Louisiana: Association for Computational Linguistics, June 2018, pp. 1865–1874. DOI:10.18653/v1/N18-1169. URL: https://www.aclweb.org/anthology/N18-1169. [90] Xiang Li et al. “Commonsense Knowledge Base Completion”. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Berlin, Germany: Association for Computational Linguistics, Aug. 2016, pp. 1445–1455. DOI:10.18653/v1/P16-1137. URL: https://www.aclweb.org/anthology/P16-1137. [91] Xiang Li et al. “Commonsense knowledge base completion”. In: ACL. 2016. 144 [92] Yaobo Liang et al. “XGLUE: A New Benchmark Datasetfor Cross-lingual Pre-training, Understanding and Generation”. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, 2020, pp. 6008–6018. DOI: 10.18653/v1/2020.emnlp-main.484. URL: https://aclanthology.org/2020.emnlp-main.484. [93] Bill Yuchen Lin et al. “Birds have four legs?! NumerSense: Probing Numerical Commonsense Knowledge of Pre-Trained Language Models”. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, 2020, pp. 6862–6868. DOI: 10.18653/v1/2020.emnlp-main.557. URL: https://aclanthology.org/2020.emnlp-main.557. [94] Bill Yuchen Lin et al. “CommonGen: A Constrained Text Generation Challenge for Generative Commonsense Reasoning”. In: Findings of the Association for Computational Linguistics: EMNLP 2020. Online: Association for Computational Linguistics, 2020, pp. 1823–1840. DOI:10.18653/v1/2020.findings-emnlp.165. URL: https://aclanthology.org/2020.findings-emnlp.165. [95] Bill Yuchen Lin et al. “CommonGen: A Constrained Text Generation Dataset Towards Generative Commonsense Reasoning”. In: ArXiv abs/1911.03705 (2019). URL: https://arxiv.org/abs/1911.03705. [96] Bill Yuchen Lin et al. “Differentiable Open-Ended Commonsense Reasoning”. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Association for Computational Linguistics, 2021, pp. 4611–4625. URL: https://www.aclweb.org/anthology/2021.naacl-main.366. [97] Bill Yuchen Lin et al. “Differentiable Open-Ended Commonsense Reasoning”. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Association for Computational Linguistics, 2021, pp. 4611–4625. DOI: 10.18653/v1/2021.naacl-main.366. URL: https://aclanthology.org/2021.naacl-main.366. [98] Bill Yuchen Lin et al. “KagNet: Knowledge-Aware Graph Networks for Commonsense Reasoning”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 2829–2839. DOI:10.18653/v1/D19-1282. URL: https://www.aclweb.org/anthology/D19-1282. 145 [99] Bill Yuchen Lin et al. “KagNet: Knowledge-Aware Graph Networks for Commonsense Reasoning”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 2829–2839. DOI:10.18653/v1/D19-1282. URL: https://www.aclweb.org/anthology/D19-1282. [100] Bill Yuchen Lin et al. “Mining Cross-Cultural Differences and Similarities in Social Media”. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia: Association for Computational Linguistics, July 2018, pp. 709–719. DOI:10.18653/v1/P18-1066. URL:https://www.aclweb.org/anthology/P18-1066. [101] Chin-Yew Lin. “ROUGE: A Package for Automatic Evaluation of Summaries”. In: Text Summarization Branches Out. Barcelona, Spain: Association for Computational Linguistics, July 2004, pp. 74–81. URL: https://www.aclweb.org/anthology/W04-1013. [102] Tsung-Yi Lin et al. “Microsoft coco: Common objects in context”. In: European conference on computer vision. Springer. 2014, pp. 740–755. URL:https: //link.springer.com/chapter/10.1007/978-3-319-10602-1_48. [103] Yinhan Liu et al. “Multilingual Denoising Pre-training for Neural Machine Translation”. In: Transactions of the Association for Computational Linguistics 8 (2020), pp. 726–742. DOI:10.1162/tacl_a_00343. URL: https://aclanthology.org/2020.tacl-1.47. [104] Yinhan Liu et al. “RoBERTa: A Robustly Optimized BERT Pretraining Approach”. In: ArXiv abs/1907.11692 (2019). URL:https://arxiv.org/abs/1907.11692. [105] Yinhan Liu et al. “Roberta: A robustly optimized bert pretraining approach”. In: ArXiv preprint abs/1907.11692 (2019). URL:https://arxiv.org/abs/1907.11692. [106] Jiasen Lu et al. “Neural Baby Talk”. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. IEEE Computer Society, 2018, pp. 7219–7228. DOI:10.1109/CVPR.2018.00754. URL: http://openaccess.thecvf.com/content%5C_cvpr%5C_2018/html/ Lu%5C_Neural%5C_Baby%5C_Talk%5C_CVPR%5C_2018%5C_paper.html. [107] Fuli Luo et al. “A dual reinforcement learning framework for unsupervised text style transfer”. In: arXiv preprint arXiv:1905.10060 (2019). URL: https://arxiv.org/abs/1905.10060. 146 [108] Fuli Luo et al. “Pun-GAN: Generative Adversarial Network for Pun Generation”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, 2019, pp. 3388–3393. DOI:10.18653/v1/D19-1336. URL: https://www.aclweb.org/anthology/D19-1336. [109] Fuli Luo et al. “Towards Fine-grained Text Sentiment Transfer”. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, July 2019, pp. 2013–2022. DOI: 10.18653/v1/P19-1194. URL: https://www.aclweb.org/anthology/P19-1194. [110] Thang Luong, Hieu Pham, and Christopher D. Manning. “Effective Approaches to Attention-based Neural Machine Translation”. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon, Portugal: Association for Computational Linguistics, Sept. 2015, pp. 1412–1421. DOI: 10.18653/v1/D15-1166. URL: https://www.aclweb.org/anthology/D15-1166. [111] Shangwen Lv et al. “Graph-Based Reasoning over Heterogeneous External Knowledge for Commonsense Question Answering”. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020 . AAAI Press, 2020, pp. 8449–8456. URL: https://aaai.org/ojs/index.php/AAAI/article/view/6364. [112] Shangwen Lv et al. “Graph-Based Reasoning over Heterogeneous External Knowledge for Commonsense Question Answering”. In: ArXiv abs/1909.05311 (2020). URL: https://arxiv.org/abs/1909.05311. [113] Diego Marcheggiani and Ivan Titov. “Encoding Sentences with Graph Convolutional Networks for Semantic Role Labeling”. In: EMNLP. 2017. [114] David D. McDonald and Federica Busa. “On the Creative Use of Language: The Form of Lexical Resources”. In: Proceedings of the Seventh International Workshop on Natural Language Generation. 1994. URL: https://www.aclweb.org/anthology/W94-0310. [115] Ning Miao et al. “Cgmh: Constrained sentence generation by metropolis-hastings sampling”. In: Proceedings of the AAAI Conference on Artificial Intelligence . V ol. 33. 2019, pp. 6834–6842. URL: https://wvvw.aaai.org/ojs/index.php/AAAI/article/view/4659. 147 [116] Todor Mihaylov and Anette Frank. “Knowledgeable Reader: Enhancing Cloze-Style Reading Comprehension with External Commonsense Knowledge”. In: ACL. 2018. [117] Todor Mihaylov et al. “Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering”. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics, 2018, pp. 2381–2391. DOI:10.18653/v1/D18-1260. URL:https://www.aclweb.org/anthology/D18-1260. [118] Tzvetan Mihaylov et al. “Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering”. In: EMNLP. 2018. [119] George A. Miller. “WordNet: A Lexical Database for English”. In: Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23-26, 1992. 1992. URL:https://www.aclweb.org/anthology/H92-1116. [120] Chris Moore. The development of commonsense psychology. Psychology Press, 2013. URL: https://books.google.com/books?hl=en&lr=&id=0FjTVJ6yOo0C&oi= fnd&pg=PR1&dq=The+development+of+commonsense+psychology&ots= UhDqbdgq9z&sig=wi2JM7pWoWl4L89V0YyaWlYYkOw#v=onepage&q=The% 20development%20of%20commonsense%20psychology&f=false. [121] Long Ouyang et al. “Training language models to follow instructions with human feedback”. In: ArXiv abs/2203.02155 (2022). [122] Vishakh Padmakumar et al. “Exploring the Role of Task Transferability in Large-Scale Multi-Task Learning”. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Seattle, United States: Association for Computational Linguistics, 2022, pp. 2542–2550. DOI:10.18653/v1/2022.naacl-main.183. URL: https://aclanthology.org/2022.naacl-main.183. [123] Kishore Papineni et al. “Bleu: a Method for Automatic Evaluation of Machine Translation”. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Philadelphia, Pennsylvania, USA: Association for Computational Linguistics, July 2002, pp. 311–318. DOI: 10.3115/1073083.1073135. URL: https://www.aclweb.org/anthology/P02-1040. [124] Kishore Papineni et al. “Bleu: a Method for Automatic Evaluation of Machine Translation”. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Philadelphia, Pennsylvania, USA: Association for Computational Linguistics, 2002, pp. 311–318. DOI:10.3115/1073083.1073135. URL:https://aclanthology.org/P02-1040. 148 [125] Fabio Petroni et al. “Language Models as Knowledge Bases?” In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, 2019, pp. 2463–2473. DOI:10.18653/v1/D19-1250. URL: https://aclanthology.org/D19-1250. [126] Fabio Petroni et al. “Language Models as Knowledge Bases?” In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, 2019, pp. 2463–2473. DOI:10.18653/v1/D19-1250. URL: https://aclanthology.org/D19-1250. [127] Fabio Petroni et al. “Language Models as Knowledge Bases?” In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 2463–2473. DOI:10.18653/v1/D19-1250. URL: https://www.aclweb.org/anthology/D19-1250. [128] Edoardo Maria Ponti et al. “XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning”. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, 2020, pp. 2362–2376. DOI:10.18653/v1/2020.emnlp-main.185. URL: https://aclanthology.org/2020.emnlp-main.185. [129] Matt Post and David Vilar. “Fast Lexically Constrained Decoding with Dynamic Beam Allocation for Neural Machine Translation”. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). New Orleans, Louisiana: Association for Computational Linguistics, June 2018, pp. 1314–1324. DOI: 10.18653/v1/N18-1119. URL: https://www.aclweb.org/anthology/N18-1119. [130] Ratish Puduppully, Yue Zhang, and Manish Shrivastava. “Transition-Based Deep Input Linearization”. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers. Valencia, Spain: Association for Computational Linguistics, Apr. 2017, pp. 643–654. URL: https://www.aclweb.org/anthology/E17-1061. 149 [131] Peng Qi et al. “Stanza: A Python Natural Language Processing Toolkit for Many Human Languages”. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Online: Association for Computational Linguistics, 2020, pp. 101–108. DOI: 10.18653/v1/2020.acl-demos.14. URL: https://aclanthology.org/2020.acl-demos.14. [132] Alec Radford et al. “Improving language understanding by generative pre-training”. In: (2018). [133] Alec Radford et al. Language Models are Unsupervised Multitask Learners. 2019. URL: https://cdn.openai.com/better-language-models/language_ models_are_unsupervised_multitask_learners.pdf. [134] Colin Raffel et al. “Exploring the limits of transfer learning with a unified text-to-text transformer”. In: arXiv preprint arXiv:1910.10683 (2019). URL: https://arxiv.org/abs/1910.10683. [135] Colin Raffel et al. “Exploring the limits of transfer learning with a unified text-to-text transformer”. In: Journal of Machine Learning Research 21.140 (2020), pp. 1–67. [136] Nazneen Fatema Rajani et al. “Explain Yourself! Leveraging Language Models for Commonsense Reasoning”. In: ACL. 2019. [137] Nils Reimers and Iryna Gurevych. “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 3982–3992. DOI: 10.18653/v1/D19-1410. URL: https://www.aclweb.org/anthology/D19-1410. [138] Nils Reimers and Iryna Gurevych. “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, 2019, pp. 3982–3992. DOI:10.18653/v1/D19-1410. URL:https://www.aclweb.org/anthology/D19-1410. [139] Nils Reimers and Iryna Gurevych. “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, 2019, pp. 3982–3992. DOI:10.18653/v1/D19-1410. URL:https://aclanthology.org/D19-1410. 150 [140] Anna Rohrbach et al. “Movie description”. In: International Journal of Computer Vision 123.1 (2017), pp. 94–120. URL: https://link.springer.com/article/10.1007/s11263-016-0987-1. [141] Ohad Rubin, Jonathan Herzig, and Jonathan Berant. “Learning To Retrieve Prompts for In-Context Learning”. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Seattle, United States: Association for Computational Linguistics, 2022, pp. 2655–2671. DOI:10.18653/v1/2022.naacl-main.191. URL: https://aclanthology.org/2022.naacl-main.191. [142] Keisuke Sakaguchi et al. “WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale”. In: ArXiv abs/1907.10641 (2019). [143] Julian Salazar et al. “Masked Language Model Scoring”. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, 2020, pp. 2699–2712. DOI: 10.18653/v1/2020.acl-main.240. URL: https://aclanthology.org/2020.acl-main.240. [144] Julian Salazar et al. “Masked Language Model Scoring”. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, 2020, pp. 2699–2712. DOI: 10.18653/v1/2020.acl-main.240. URL: https://www.aclweb.org/anthology/2020.acl-main.240. [145] Victor Sanh et al. “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter”. In: ArXiv preprint abs/1910.01108 (2019). URL: https://arxiv.org/abs/1910.01108. [146] Victor Sanh et al. Multitask Prompted Training Enables Zero-Shot Task Generalization. 2021. arXiv:2110.08207[cs.LG]. [147] Adam Santoro et al. “A simple neural network module for relational reasoning”. In: NIPS. 2017. [148] Maarten Sap et al. “Atomic: An atlas of machine commonsense for if-then reasoning”. In: Proceedings of the AAAI Conference on Artificial Intelligence . V ol. 33. 2019, pp. 3027–3035. URL: https://wvvw.aaai.org/ojs/index.php/AAAI/article/view/4160. 151 [149] Maarten Sap et al. “Social IQa: Commonsense Reasoning about Social Interactions”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 4463–4473. DOI:10.18653/v1/D19-1454. URL: https://www.aclweb.org/anthology/D19-1454. [150] Maarten Sap et al. “SocialIQA: Commonsense Reasoning about Social Interactions”. In: CoRR abs/1904.09728 (2019). [151] Michael Sejr Schlichtkrull et al. “Modeling Relational Data with Graph Convolutional Networks”. In: European Semantic Web Conference. 2018. [152] Minjoon Seo et al. “Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index”. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, 2019, pp. 4430–4441. DOI:10.18653/v1/P19-1436. URL: https://aclanthology.org/P19-1436. [153] Piyush Sharma et al. “Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning”. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia: Association for Computational Linguistics, July 2018, pp. 2556–2565. DOI: 10.18653/v1/P18-1238. URL: https://www.aclweb.org/anthology/P18-1238. [154] Push Singh et al. “Open Mind Common Sense: Knowledge acquisition from the general public”. In: OTM Confederated International Conferences" On the Move to Meaningful Internet Systems". Springer. 2002, pp. 1223–1237. URL: https://link.springer.com/chapter/10.1007/3-540-36124-3_77. [155] Push Singh et al. “Open Mind Common Sense: Knowledge acquisition from the general public”. In: OTM Confederated International Conferences" On the Move to Meaningful Internet Systems". Springer. 2002, pp. 1223–1237. URL: https://link.springer.com/chapter/10.1007/3-540-36124-3_77. [156] Robyn Speer, Joshua Chin, and Catherine Havasi. “Conceptnet 5.5: An open multilingual graph of general knowledge”. In: Thirty-First AAAI Conference on Artificial Intelligence . 2017. URL:https://www.aaai.org/ocs/index.php/AAAI/AAAI17/ paper/viewPaper/14972. [157] Aarohi Srivastava et al. “Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models”. In: ArXiv preprint abs/2206.04615 (2022). URL: https://arxiv.org/abs/2206.04615. 152 [158] Mitchell Stern et al. “Insertion transformer: Flexible sequence generation via insertion operations”. In: arXiv preprint arXiv:1902.03249 (2019). URL: https://arxiv.org/abs/1902.03249. [159] Raymond Hendy Susanto, Shamil Chollampatt, and Li-ling Tan. “Lexically Constrained Neural Machine Translation with Levenshtein Transformer”. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. to appear. 2020. URL: https://arxiv.org/abs/2004.12681. [160] Alon Talmor et al. “CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge”. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, 2019, pp. 4149–4158. DOI:10.18653/v1/N19-1421. URL:https://aclanthology.org/N19-1421. [161] Alon Talmor et al. “CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge”. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, June 2019, pp. 4149–4158. DOI: 10.18653/v1/N19-1421. URL: https://www.aclweb.org/anthology/N19-1421. [162] Alon Talmor et al. “oLMpics - On what Language Model Pre-training Captures”. In: ArXiv abs/1912.13283 (2019). URL:https://arxiv.org/abs/1912.13283. [163] Chuanqi Tan et al. “Solving and Generating Chinese Character Riddles”. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Austin, Texas: Association for Computational Linguistics, 2016, pp. 846–855. DOI: 10.18653/v1/D16-1081. URL: https://www.aclweb.org/anthology/D16-1081. [164] Niket Tandon, Gerard de Melo, and Gerhard Weikum. “WebChild 2.0 : Fine-Grained Commonsense Knowledge Distillation”. In: ACL. 2017. [165] Ian Tenney, Dipanjan Das, and Ellie Pavlick. “BERT Rediscovers the Classical NLP Pipeline”. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, July 2019, pp. 4593–4601. DOI:10.18653/v1/P19-1452. URL: https://www.aclweb.org/anthology/P19-1452. 153 [166] Jorg Tiedemann. “OPUS – parallel corpora for everyone”. In: Proceedings of the 19th Annual Conference of the European Association for Machine Translation: Projects/Products. Riga, Latvia: Baltic Journal of Modern Computing, 2016. URL: https://aclanthology.org/2016.eamt-2.8. [167] Ruth Tincoff and Peter W Jusczyk. “Some beginnings of word comprehension in 6-month-olds”. In: Psychological science 10.2 (1999), pp. 172–175. URL:https: //journals.sagepub.com/doi/abs/10.1111/1467-9280.00127. [168] Trieu H. Trinh and Quoc V . Le. “A Simple Method for Commonsense Reasoning”. In: CoRR abs/1806.02847 (2018). [169] Trieu H. Trinh and Quoc V . Le. “Do Language Models Have Common Sense”. In: 2018. URL:https://openreview.net/forum?id=rkgfWh0qKX. [170] Trieu H. Trinh and Quoc V . Le. “Do Language Models Have Common Sense?” In: OpenReview ICLR submissions (2019). [171] Ashish Vaswani et al. “Attention is All you Need”. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA. Ed. by Isabelle Guyon et al. 2017, pp. 5998–6008. URL:https://proceedings.neurips.cc/paper/2017/ hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html. [172] Ashish Vaswani et al. “Attention is All you Need”. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA. Ed. by Isabelle Guyon et al. 2017, pp. 5998–6008. URL:https://proceedings.neurips.cc/paper/2017/ hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html. [173] Ashish Vaswani et al. “Attention is all you need”. In: Advances in neural information processing systems. 2017, pp. 5998–6008. URL: http://papers.nips.cc/paper/7181-attention-is-all-you-need. [174] Tony Veale. “Creative Language Retrieval: A Robust Hybrid of Information Retrieval and Linguistic Creativity”. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Portland, Oregon, USA: Association for Computational Linguistics, 2011, pp. 278–287. URL: https://www.aclweb.org/anthology/P11-1029. [175] Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. “Cider: Consensus-based image description evaluation”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015, pp. 4566–4575. URL:https://www.cv- foundation.org/openaccess/content_cvpr_2015/html/Vedantam_ CIDEr_Consensus-Based_Image_2015_CVPR_paper.html. 154 [176] Tu Vu et al. “Exploring and Predicting Transferability across NLP Tasks”. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, 2020, pp. 7882–7926. DOI:10.18653/v1/2020.emnlp-main.635. URL: https://aclanthology.org/2020.emnlp-main.635. [177] Eric Wallace et al. “Do NLP Models Know Numbers? Probing Numeracy in Embeddings”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 5307–5315. DOI:10.18653/v1/D19-1534. URL: https://www.aclweb.org/anthology/D19-1534. [178] Cunxiang Wang et al. “Does it Make Sense? And Why? A Pilot Study for Sense Making and Explanation”. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, July 2019, pp. 4020–4026. DOI:10.18653/v1/P19-1393. URL: https://www.aclweb.org/anthology/P19-1393. [179] Cunxiang Wang et al. “Does it Make Sense? And Why? A Pilot Study for Sense Making and Explanation”. In: ACL. 2019. [180] Cunxiang Wang et al. “SemEval-2020 Task 4: Commonsense Validation and Explanation”. In: Proceedings of The 14th International Workshop on Semantic Evaluation. Association for Computational Linguistics, 2020. URL: https://competitions.codalab.org/competitions/21080. [181] Peifeng Wang et al. “Connecting the Dots: A Knowledgeable Path Generator for Commonsense Question Answering”. In: Findings of the Association for Computational Linguistics: EMNLP 2020. Online: Association for Computational Linguistics, 2020, pp. 4129–4140. DOI:10.18653/v1/2020.findings-emnlp.369. URL: https://www.aclweb.org/anthology/2020.findings-emnlp.369. [182] Shuohang Wang et al. “Training Data is More Valuable than You Think: A Simple and Effective Method by Retrieving from Training Data”. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics, 2022, pp. 3170–3179. DOI: 10.18653/v1/2022.acl-long.226. URL: https://aclanthology.org/2022.acl-long.226. [183] Xiaoyan Wang et al. “Improving Natural Language Inference Using External Knowledge in the Science Questions Domain”. In: AAAI. 2019. [184] Xiaoyan Wang et al. “Improving Natural Language Inference Using External Knowledge in the Science Questions Domain”. In: (2019). 155 [185] Xin Wang et al. “V ATEX: A large-scale, high-quality multilingual dataset for video-and-language research”. In: Proceedings of the IEEE International Conference on Computer Vision. 2019, pp. 4581–4591. URL:http://openaccess.thecvf.com/ content_ICCV_2019/html/Wang_VaTeX_A_Large-Scale_High- Quality_Multilingual_Dataset_for_Video-and- Language_Research_ICCV_2019_paper.html. [186] Zhen Wang et al. “Knowledge Graph Embedding by Translating on Hyperplanes”. In: AAAI. 2014. [187] Jason Wei et al. “Finetuned Language Models Are Zero-Shot Learners”. In: ArXiv preprint abs/2109.01652 (2021). URL:https://arxiv.org/abs/2109.01652. [188] Dirk Weissenborn, Tomaš Koˇ cisk` y, and Chris Dyer. “Dynamic integration of background knowledge in neural NLU systems”. In: arXiv preprint arXiv:1706.02596 (2017). [189] Orion Weller and Kevin Seppi. “Humor Detection: A Transformer Gets the Last Laugh”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, 2019, pp. 3621–3625. DOI:10.18653/v1/D19-1372. URL: https://www.aclweb.org/anthology/D19-1372. [190] Orion Weller and Kevin Seppi. “The rJokes Dataset: a Large Scale Humor Collection”. English. In: Proceedings of the 12th Language Resources and Evaluation Conference. Marseille, France: European Language Resources Association, 2020, pp. 6136–6141. ISBN: 979-10-95546-34-4. URL: https://www.aclweb.org/anthology/2020.lrec-1.753. [191] Adina Williams, Nikita Nangia, and Samuel Bowman. “A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference”. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). New Orleans, Louisiana: Association for Computational Linguistics, 2018, pp. 1112–1122. DOI: 10.18653/v1/N18-1101. URL: https://www.aclweb.org/anthology/N18-1101. [192] Frank F. Xu, Bill Yuchen Lin, and Kenny Zhu. “Automatic Extraction of Commonsense LocatedNear Knowledge”. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Melbourne, Australia: Association for Computational Linguistics, July 2018, pp. 96–101. DOI: 10.18653/v1/P18-2016. URL: https://www.aclweb.org/anthology/P18-2016. 156 [193] Yichong Xu et al. “Fusing Context Into Knowledge Graph for Commonsense Reasoning”. In: arXiv preprint arXiv:2012.04808 (2020). [194] Linting Xue et al. “mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer”. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Association for Computational Linguistics, 2021, pp. 483–498. DOI: 10.18653/v1/2021.naacl-main.41. URL: https://aclanthology.org/2021.naacl-main.41. [195] Hiroaki Yamane, Chin-Yew Lin, and Tatsuya Harada. Measuring Numerical Common Sense: Is A Word Embedding Approach Effective? 2020. URL: https://openreview.net/forum?id=B1xbTlBKwB. [196] Jun Yan et al. “Learning Contextualized Knowledge Structures for Commonsense Reasoning”. In: arXiv preprint arXiv:2010.12873 (2020). [197] Bishan Yang and Tom Michael Mitchell. “Leveraging Knowledge Bases in LSTMs for Improving Machine Reading”. In: ACL. 2017. [198] Pengcheng Yang et al. “Enhancing Topic-to-Essay Generation with External Commonsense Knowledge”. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, July 2019, pp. 2002–2012. DOI:10.18653/v1/P19-1193. URL: https://www.aclweb.org/anthology/P19-1193. [199] Pengcheng Yang et al. “Knowledgeable storyteller: a commonsense-driven generative model for visual storytelling”. In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI . 2019, pp. 5356–5362. URL: https://www.ijcai.org/Proceedings/2019/0744. [200] Zhilin Yang et al. “HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering”. In: EMNLP. 2018. [201] Zhilin Yang et al. “XLNet: Generalized Autoregressive Pretraining for Language Understanding”. In: ArXiv abs/1906.08237 (2019). [202] Zichao Yang et al. “Hierarchical Attention Networks for Document Classification”. In: NAACL-HLT. 2016. [203] Lili Yao et al. “Plan-and-write: Towards better automatic storytelling”. In: Proceedings of the AAAI Conference on Artificial Intelligence . V ol. 33. 2019, pp. 7378–7385. URL: https://wvvw.aaai.org/ojs/index.php/AAAI/article/view/4726. 157 [204] Michihiro Yasunaga et al. “QA-GNN: Reasoning with Language Models and Knowledge Graphs for Question Answering”. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Association for Computational Linguistics, 2021, pp. 535–546. URL:https://www.aclweb.org/anthology/2021.naacl-main.45. [205] Qinyuan Ye, Bill Yuchen Lin, and Xiang Ren. “CrossFit: A Few-shot Learning Challenge for Cross-task Generalization in NLP”. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, 2021, pp. 7163–7189. DOI: 10.18653/v1/2021.emnlp-main.572. URL: https://aclanthology.org/2021.emnlp-main.572. [206] Peter Young et al. “From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions”. In: Transactions of the Association for Computational Linguistics 2 (2014), pp. 67–78. URL:https: //www.mitpressjournals.org/doi/abs/10.1162/tacl_a_00166. [207] Rowan Zellers et al. “From recognition to cognition: Visual commonsense reasoning”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019, pp. 6720–6731. URL:http://openaccess.thecvf.com/content_CVPR_ 2019/html/Zellers_From_Recognition_to_Cognition_Visual_ Commonsense_Reasoning_CVPR_2019_paper.html. [208] Rowan Zellers et al. “HellaSwag: Can a Machine Really Finish Your Sentence?” In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, July 2019, pp. 4791–4800. DOI:10.18653/v1/P19-1472. URL: https://www.aclweb.org/anthology/P19-1472. [209] Rowan Zellers et al. “SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference”. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics, Oct. 2018, pp. 93–104. DOI:10.18653/v1/D18-1009. URL: https://www.aclweb.org/anthology/D18-1009. [210] Rowan Zellers et al. “SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference”. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics, 2018, pp. 93–104. DOI:10.18653/v1/D18-1009. URL: https://aclanthology.org/D18-1009. 158 [211] Houyu Zhang et al. “Grounded Conversation Generation as Guided Traverses in Commonsense Knowledge Graphs”. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. to appear. 2020. URL: https://arxiv.org/abs/1911.02707. [212] Tianyi Zhang et al. “"BERTScore: Evaluating Text Generation with BERT"”. In: International Conference on Learning Representations. 2020. URL: https://openreview.net/forum?id=SkeHuCVFDr. [213] Tianyi Zhang et al. “BERTScore: Evaluating Text Generation with BERT”. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL: https://openreview.net/forum?id=SkeHuCVFDr. [214] Yue Zhang and Stephen Clark. “Discriminative Syntax-Based Word Ordering for Text Generation”. In: Computational Linguistics 41 (2015), pp. 503–538. URL: https://www.mitpressjournals.org/doi/10.1162/COLI_a_00229. [215] Yuhao Zhang, Peng Qi, and Christopher D. Manning. “Graph Convolution over Pruned Dependency Trees Improves Relation Extraction”. In: EMNLP. 2018. [216] Wanjun Zhong et al. “Improving Question Answering by Commonsense-Based Pre-Training”. In: ArXiv abs/1809.03568 (2018). [217] Xuhui Zhou et al. “Evaluating Commonsense in Pre-trained Language Models”. In: Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence . 2020. URL: https://www.aaai.org/Papers/AAAI/2020GB/AAAI-ZhouX.8401.pdf. [218] Wanrong Zhu, Zhiting Hu, and Eric P. Xing. “Text Infilling”. In: ArXiv abs/1901.00158 (2019). URL:https://arxiv.org/abs/1901.00158. [219] Yanyan Zou and Wei Lu. “Text2Math: End-to-end Parsing Text into Math Expressions”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 5327–5337. DOI:10.18653/v1/D19-1536. URL: https://www.aclweb.org/anthology/D19-1536. 159
Abstract (if available)
Abstract
Large pre-trained language models (LMs) have become the foundation for natural language processing (NLP) and many other areas of artificial intelligence (AI). Based on Transformer-based neural network architectures and large text corpora, these large LMs gain a great amount of linguistic knowledge. These research advances with LMs have led to significant improvements in many AI tasks such as question answering, information extraction, summarization, machine translation, and dialogue generation. Some recent large LMs even outperform human performance on many standard and popular benchmarks for natural language understanding and generation. However, these LMs still often make mistakes when commonsense knowledge is needed for reasoning with everyday situations. This lack of commonsense reasoning (CSR) ability exposes troubling gaps in current models' world knowledge and reasoning capabilities, thus being a bottleneck for building human-level AI systems that can naturally think, talk, and act in real life as humans do.
In this thesis, I argue that evaluating and improving the commonsense reasoning ability of LMs is necessary for building human-level AI systems with general intelligence. In the first half of this thesis, I will focus on how to better evaluate the common sense of LMs. Prior works on benchmarking CSR in NLP have primarily focused on two types of evaluation: knowledge probing and multiple-choice question answering (MCQA). Although they are simple and straightforward to use, there are still many missing aspects in the current evaluation protocols. I create datasets dedicated to open-ended, generalizable, and robust CSR. The key contributions are to evaluate open-ended CSR by introducing two benchmarks: OpenCSR for open-ended QA, and CommonGen for language generation with generative commonsense. In order to encourage CSR models to be more generalizable in terms of multiple languages, non-monotonic reasoning, and style transfer, I create X-CSR and RiddleSense benchmarks. Finally, robustness is also a key aspect in evaluating CSR, so I focused on the logically equivalent perturbations (RICA) and the adversarial attacks in probing numerical commonsense (NumerSense).
In the second half of the thesis, I will present methods of incorporating knowledge for improving the commonsense reasoning ability of LMs. Useful knowledge for commonsense reasoning can be roughly categorized into three types: 1) structured knowledge, 2) unstructured knowledge, and 3) instance-based implicit knowledge. I will start with the KagNet model, which first retrieves subgraphs of commonsense knowledge graphs and then fuse them into LMs for CSR. For incorporating unstructured commonsense knowledge in the form of text corpora, I will introduce DrFact, an effective multi-hop reasoning method that can model more complex commonsense knowledge via retrieval. Beyond the above declarative commonsense knowledge, I will show that modeling annotated instances of NLP tasks as implicit knowledge bases can help improve CSR via retrieval augmentation, and this is especially helpful in unsupervised cross-task generalization settings.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Externalized reasoning in language models for scalable and trustworthy AI
PDF
Bridging the visual reasoning gaps in multi-modal models
PDF
Grounding language in images and videos
PDF
Building generalizable language models for code processing
PDF
Event-centric reasoning with neuro-symbolic networks and knowledge incorporation
PDF
Multi-modal preconditioned inference of commonsense knowledge
PDF
Multimodal reasoning of visual information and natural language
PDF
Common ground reasoning for communicative agents
PDF
Generating and utilizing machine explanations for trustworthy NLP
PDF
Designing neural networks from the perspective of spatial reasoning
PDF
3D deep learning for perception and modeling
PDF
Identifying and mitigating safety risks in language models
PDF
Emphasizing the importance of data and evaluation in the era of large language models
PDF
Hashcode representations of natural language for relation extraction
PDF
Computational modeling of mental health therapy sessions
PDF
Data-efficient image and vision-and-language synthesis and classification
PDF
Countering problematic content in digital space: bias reduction and dynamic content adaptation
PDF
Simulating electrical stimulation and recording in a multi-scale model of the hippocampus
PDF
Demand based techniques to improve the energy efficiency of the execution units and the register file in general purpose graphics processing units
Asset Metadata
Creator
Lin, Yuchen (author)
Core Title
Evaluating and improving the commonsense reasoning ability of language models
School
Andrew and Erna Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2023-05
Publication Date
02/28/2025
Defense Date
12/05/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
artificial intelligence,benchmarks,CommonGen,commonsense knowledge graphs,commonsense reasoning,CSR,dialogue generation,DrFact,generalizable,human-level AI systems,information extraction,instance-based implicit knowledge,KagNet,large text corpora,linguistic knowledge,machine translation,multi-hop reasoning,natural language processing,OAI-PMH Harvest,OpenCSR,open-ended,pre-trained language models,question answering,reasoning capabilities,retrieval augmentation,RiddleSense,robust,structured knowledge,summarization,Transformer-based neural network,unstructured knowledge,unsupervised cross-task generalization,world knowledge,X-CSR
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Ren, Xiang (
committee chair
), Liu, Yan (
committee member
), Mintz, Toby (
committee member
), Nevatia, Ram (
committee member
)
Creator Email
yuchen.lin@usc.edu,yuchenlin1995@gmail.com
Unique identifier
UC112764291
Identifier
etd-LinYuchen-11489.pdf (filename)
Legacy Identifier
etd-LinYuchen-11489
Document Type
Dissertation
Format
theses (aat)
Rights
Lin, Yuchen
Internet Media Type
application/pdf
Type
texts
Source
20230313-usctheses-batch-1009
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
artificial intelligence
benchmarks
CommonGen
commonsense knowledge graphs
commonsense reasoning
CSR
dialogue generation
DrFact
generalizable
human-level AI systems
information extraction
instance-based implicit knowledge
KagNet
large text corpora
linguistic knowledge
machine translation
multi-hop reasoning
natural language processing
OpenCSR
open-ended
pre-trained language models
question answering
reasoning capabilities
retrieval augmentation
RiddleSense
robust
structured knowledge
summarization
Transformer-based neural network
unstructured knowledge
unsupervised cross-task generalization
world knowledge
X-CSR