Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Externalized reasoning in language models for scalable and trustworthy AI
(USC Thesis Other)
Externalized reasoning in language models for scalable and trustworthy AI
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
EXTERNALIZED REASONING IN LANGUAGE MODELS FOR SCALABLE AND TRUSTWORTHY AI by Peifeng Wang A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2023 Copyright 2023 Peifeng Wang Dedication To my family. ii Acknowledgements First, I would like to thank my PhD advisors, Prof. Muhao Chen and Prof. Xiang Ren, who offer their consistent and great help throughout my PhD journey. They both are young professors in NLP and always love to discuss my projects with me every week. For each discussion, they offer detailed suggestions on both high-level ideas and low-level implementations to make sure we can move forward the projects in the right direction. They also help me to connect with other collaborators from academia or industry. I learn a lot from my advisors on how to conduct research in NLP. Second, I want to thank my collaborators. I have the fortune to work with these amazing researchers and also learn a lot from them. I have worked with several great professors from USC-ISI, including Prof. Pedro Szekely, Prof. Nanyun Peng and Prof. Filip Ilievski. They come from different research background and thus provide many diverse suggestions during our collaboration which help me to come up with interesting and innovative ideas. I have also worked with several labmates such as Aaron Chan, Jonathan Zamora, Junfeng Liu and Jun Yan. They are both excellent collaborators and good friends who give me many wonderful life advice. During my several industrial internships, I was also lucky to work with many senior researchers including Zheng Li, Zhengyang Wang, Yifan Gao from Amazon, Olga Golovneva, Armen Aghajanyan, Asli Celikyilmaz, Maryam Fazel-Zarandi from Meta. Finally, I want to thank my wife and my parents for their love and support. They give me the courage and motivation to overcome the difficulties and challenges in my PhD career. I would always remember all they have done for me and love them. iii Table of Contents Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Recent success brought by language models . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Limitations of reckless scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Reasoning requires more than just being good at language . . . . . . . . . . . . . . . . . . 4 1.4 Problem settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.5 Externalized reasoning in language models . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.6 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.6.1 Structured Externalized Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.6.2 Free-text Externalized Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Chapter 2: Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1 Reasoning with Structured Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Knowledge-Enhanced Text Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 Rationale-based Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 Learning from Large Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.5 Reasoning with Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Chapter 3: A Knowledgeable Path Generator for Commonsense Question Answering . . . . . . . . 17 3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 Knowledgeable Path Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2.1 Knowledge Path Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2.2 Generating Paths to Connect Entities . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2.3 Adapted Commonsense QA Framework . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3.2 KG and Path Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3.3 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3.4 Model Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 iv 3.3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3.6 Study of Path Quality & Interpretability . . . . . . . . . . . . . . . . . . . . . . . . 31 Chapter 4: Contextualized Scene Imagination for Generative Commonsense Reasoning . . . . . . . 34 4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2.1 The Imagine-and-Verbalize Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2.2 Imagination via Generating SKG . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.2.3 Learning the Scene Imagination Module . . . . . . . . . . . . . . . . . . . . . . . . 39 4.2.4 Scene-aware Verbalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.4 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.4.1 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.4.2 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.4.3 Human Evaluation on Generated SKGs . . . . . . . . . . . . . . . . . . . . . . . . . 49 Chapter 5: Faithful Language Reasoning Using Prompt-Generated Rationales . . . . . . . . . . . . 51 5.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.2 PINTO: Faithful Language Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.2.1 Rationalizing Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.2.2 Reasoning Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.2.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.4.1 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.4.2 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Chapter 6: Self-Consistent Chain-of-Thought Distillation . . . . . . . . . . . . . . . . . . . . . . . . 67 6.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.1.1 Generating Rationale Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 6.1.2 Training a Student Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 6.2 Distilling a Self-Consistent Student . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.2.1 A Consistent Teacher: Contrastive Decoding . . . . . . . . . . . . . . . . . . . . . 72 6.2.2 A Faithful Student: Counterfactual Reasoning . . . . . . . . . . . . . . . . . . . . . 73 6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.3.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.3.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 6.3.4 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 6.3.5 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6.3.6 Ablation on the student model size . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.3.7 Controlling the behavior of the Student . . . . . . . . . . . . . . . . . . . . . . . . 80 Chapter 7: Memory-assisted Language Modeling and Chain-of-Thought Reasoning . . . . . . . . . 83 7.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 7.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 7.2.1 Symbolic Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 7.2.2 Memory Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 v 7.2.3 Memory Updating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 7.2.4 Memory-assisted Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 7.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 7.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 7.3.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 7.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Chapter 8: Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 8.2 Future Direction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 vi List of Tables 3.1 Transformation of a Symbolic Path into Text . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 Test accuracy with varying proportions of CommonsenseQA . . . . . . . . . . . . . . . . . 27 3.3 Test accuracy on OpenBookQA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4 Test accuracy on CommonsenseQA’s official leaderboard . . . . . . . . . . . . . . . . . . . 29 3.5 Test accuracy on OpenBookQA leaderboard . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.6 Automatic and Human Evaluation of the generated Paths on the task testset . . . . . . . . 31 3.7 Paths from question to gold answer entities . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.1 Statistics of the SKG instances collected from different resources. . . . . . . . . . . . . . . . . . . 38 4.2 Results on the official CommonGen test set . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.3 Results on the Concept2Story tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.4 Performance of our method using different SKG sources to train the imagination module, with T5-large as the backbone LM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.5 SPICE performance of our method using different sizes of T5 as backbone for the imagination module. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.6 Human evaluation on the generated SKGs . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.1 Rationalization Prompts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.2 ID Results of Rationale-based Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.3 OOD Results of Rationale-based Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.4 Sensitivity to Noisy Rationales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 vii 6.1 Human evaluation on the rationales generated by different teacher models . . . . . . . . . 78 viii List of Figures 1.1 Model size and performance on GLUE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 3.1 An motivating example of path generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2 KG-augmented QA Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3 Overview of the adapted knowledge module . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.4 Test accuracy on CommonsenseQA (left) and OpenBookQA (right) with different proportions of training data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.1 Continual pretraining and fine-tuning of the imagination module . . . . . . . . . . . . . . 39 4.2 The iterate process of imagination and verbalization . . . . . . . . . . . . . . . . . . . . . . 42 4.3 Ablation study on backbone LM sizes of our verbalization module and Node2Text using the Concept2Story-ROC dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.4 Results (SPICE) of the low-resource experiment on the three benchmark datasets with different number of training examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.1 Rationale-Based Language Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.2 Overview of PINTO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.3 Standard Training vs. Counterfactual Training . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.4 Low-Resource Learning. Performance (accuracy) of different fine-tuned models in low-resource settings on CSQA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.5 Rationale Quality Analysis. Accuracy of models with both generated and annotated rationales vs. models using only generated rationales on CSQA. . . . . . . . . . . . . . . . 65 6.1 Vacuous rationales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 6.2 Overview of our knowledge distillation framework for faithful reasoning . . . . . . . . . . 69 ix 6.3 Contrastive decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 6.4 Counterfactual reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.5 Simulatability (LAS) of the rationales generated from different teacher models . . . . . . . 75 6.6 Faithfulness (LAS) and task performance (accuracy) of the compared methods . . . . . . . 77 6.7 Faithfulness and task performance of the compared methods with different student model sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.8 Performance gain (drop) of the methods with oracle (perturbed) rationales . . . . . . . . . 81 7.1 Main results on Entity Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 x Abstract The recent advance in language models (LMs) has brought tremendous success to the field of natural language processing (NLP). These LMs are first pre-trained on a large amount of human-authored corpus to predict missing tokens. Then they are adapted to a wide range of downstream tasks through fine-tuning or prompting. This new paradigm is shown to be quite effective and as of today, LMs have achieved superhuman performance on many NLP benchmarks testing natural language understanding, machine translation, question answering, etc. Notably, scaling the model size seems to be the key to better performance, which drives many tech companies to develop increasingly large LMs. However, reckless scaling the model size comes with several limitations. These include: (1) Scaling introduces huge training and inference costs. (2) Scaling aggravates the issues of uninterpretability and biases in neural models. (3) Scaled LMs are good at generating fluent text which is not necessarily coherent. Therefore, this thesis advocates externalized reasoning, in order to build more scalable and trustworthy AI systems. The core idea of externalized reasoning is to let LMs present their otherwise opaque reasoning process explicitly, e.g., a local knowledge graph or a free-text rationale. The externalized reasoning process can thus serve as part of the explanation of LMs and also guide the language generation process to achieve coherency. Moreover, externalized reasoning provides a more targeted way to enhance LMs’ capabilities compared to reckless scaling. To motivate our idea, we focus on language-based reasoning tasks in NLP that require external knowledge for inference. In particular, we investigate two types of representations for representing the reasoning xi process of LMs. In the first half of this thesis, we resort to the structured representation, i.e., a graph for illustrating the complex relations between concepts mentioned in the task input. We show that (1) a local knowledge graph can help LMs better infer the implicit relations for answering a commonsense question and (2) a structured plan can help LMs generate more coherent text. In the second half of the thesis, we leverage natural language as the vehicle of externalized reasoning. We present techniques which make use of LMs to provide (1) background knowledge in free text and (2) supervision signals for step-by-step reasoning. We showed that externalizing the reasoning process as natural language helps LMs perform better on open-domain reasoning tasks. At the end of the thesis, we also equip LMs with an external memory which is shown to help LMs better model the observed world and keep track of the state changes. Overall, this thesis shows that different from reckless scaling, externalized reasoning in LMs can provide certain interpretability to the models’ behavior, enhance LMs’ capabilities in a targeted way and potentially save us from this endless scaling competition. xii Chapter 1 Introduction 1.1 Recent success brought by language models The rise of language models (LMs) such as BERT [37] and GPT [21] introduces a significant paradigm shift to the field of Natural Language Processing (NLP). Unlike traditional NLP systems that design separate models for specific tasks, LMs adopt one general-purpose architecture, i.e., Transformers [149], to perform different tasks. In particular, the process of building LMs involves pre-training as the first step, where LMs are trained on a huge amount of human-authored corpus to predict the missing tokens. For example, when given an incomplete sentence like “AI can do amazing ”, LMs are trained to predict the token “things”. By doing so, it is supposed that LMs can encode tremendous human knowledge from the textual corpus into their parameters. After pre-training, LMs can be adapted to a wide range of NLP tasks, such as natural language understanding, machine translation, question answering, etc., via small scale of fine-tuning or simply prompting. As of today, LMs have achieved remarkable success on several popular NLP benchmarks, even surpassing human performance on some of the datasets. On the GLUE benchmark [152] which evaluates the general language understanding abilities, several LMs including RoBERTa [95], ELECTRA [29], T5 [123] and many others, have achieved super human performance. Among these LMs, Turing-ULR [119] from Microsoft achieves the state-of-the-art result, greatly outperforming the human score (87.1 → 91.3). On the 1 Real-Time Social Media Analytics with Deep Transformer Language Models: A Big Data Approach Ahmed Ahmet Department of Computer Science University of Derby, United Kingdom ahmedahmetk@hotmail.co.uk Tariq Abdullah Department of Computer Science University of Derby, United Kingdom t.abdullah@derby.ac.uk Abstract— Utilisation of transfer learning with deep language models is regarded as one of the most important developments in deep learning. Their application on real-time high-velocity and volume user-generated data has been elusive due to the unprecedented size and complexity of the models which result in substantial computational overhead. Recent iterations of these architectures have produced significantly distilled models with state-of-the-art performance and reduced resource requirement. We utilize deep transformer language models on user-generated data alongside a robust text normalization pipeline to address what is considered as the Achilles heel of deep learning on user-generated text data, namely data normalization. In this paper, we propose a framework for the ingestion, analysis and storage of real-time data streams. A case study in sentiment analysis and offensive/hateful language detection is used to evaluate the framework. We demonstrate inference on a large Twitter dataset using CPU and GPU clusters, highlighting the viability of the fine-tuned distilled language model for high volume data. Fine-tuned model significantly outperforms previous state-of-the-art on several benchmark datasets, providing a powerful model that can be utilized for a variety of downstream tasks. To our knowledge, this is the only study demonstrating powerful transformer language models for realtime social media stream analytics in a distributed setting. Keywords—Real-time analytics, Social media, deep learning, machine learning, transfer learning, big data I. INTRODUCTION Humans are becoming increasingly integrated with smart devices, engagement on social media platforms have become pervasive leading to the creation of vast streams of usergenerated data. Since the establishment of social media, it has seen unprecedented growth and played a key role in ushering in the age of big data [1]. Social media platforms provide realtime content from users reacting to real-world events and is a place where users share news, opinions and analysis. The volume, velocity and value of this data provide a challenge for industry and academia to aggregate, process and extract valuable insights in real-time. Traditional batch analytic process involves the ingestion and storage of data before undergoing batch processing [2]. In real-time stream analytics, data is ingested via a web API and processed as they are created [3]. Real-time analytic platforms have traditionally been distributed and utilized machine learning tools to extract value. They are recognized as playing an increasingly vital role in the industry, producing business intelligence in areas such as market research, brand monitoring, customer voice and product analytics. Recent deep learning breakthroughs in NLP have produced large pre-trained language models (LM) based on transformer architecture utilizing knowledge transfer. This is acknowledged as the biggest innovation in deep learning and NLP in recent years. Large LMs are pre-trained on vast domain-general data before being fine-tuned on a downstream task, transferring linguistic and semantic knowledge on an unprecedented granularity. With few finetunings, pre-trained LMs have outperformed existing taskspecific models across all NLP tasks [4]. This has led to substantial progress over traditional deep learning approaches such as recurrent neural networks (LSTMs and GRUs), convolutional neural networks and recursive neural networks. Pre-trained LMs have seen a steady trajectory of increasingly larger models (refer Figure 1). Transformer LMs built using multi-headed self-attention have enabled larger models with hundreds of millions or billions of parameters (110m OpenAI-Transformer, 334m BERT-large, 340m XLNet, 1.5billion GPT-2, 11billion T5) (refer Table I) trained on unlimited amounts of data using unsupervised language modelling. The unprecedented size and data-hungry models have led to unparalleled language understanding on GLUE benchmark dataset (refer Figure 1) but rely on tremendous training cost (For example, Googles T5 transformer cost an estimated $1.3 million to pre-train train). For real-time data streams with high volume and velocity data, large transformer LMs are not economical due to high runtime. Recent iterations of a popular BERT transformer LM have produced drastically distilled architectures [5], [6] with improvements on model size and performance making, their deployment viable. Another key innovation that propelled transformer LMs to the forefront of NLP is the use of a powerful tokenizer which also provide highly effective pre-processing. BERT introduced Wordpiece [7] which processes words out of vocabulary using a segmentation algorithm and eliminates the need for traditional data preprocessing. Although these tokenizers can handle out of vocabulary words, they have a hard time processing noncanonical text prevalent in user-generated data. This has been referred to as the ‘Achilles heel’ of pre-trained LMs [8]. Noncanonical text is a key challenge when processing usergenerated text. It is noisy in nature and contains spelling errors, slang terms, abbreviations, subculture jargon and emojis [9]. Tokenizers are not equipped to handle noncanonical text and can miss out on important features. Fig. 1: Language Model Size & GLUE Performance ELMo GPT BERT large GPT-2 MT-DNN MegatronLM XLNet T5 0 2 4 6 8 10 12 Dec-17 Apr-18 Jul-18 Oct-18 Feb-19 May-19 Aug-19 Dec-19 60 65 70 75 80 85 90 95 GLUE score No. of Parameters GLUE score No. of Parameters (Billions) Dec-17 Apr-18 Aug-18 Dec-18 Apr-19 Aug-19 Dec-19 41 2020 IEEE 14th International Conference on Big Data Science and Engineering (BigDataSE) 978-1-6654-0396-2/20/$31.00 ©2020 IEEE DOI 10.1109/BigDataSE50710.2020.00014 2020 IEEE 14th International Conference on Big Data Science and Engineering (BigDataSE) | 978-1-6654-0396-2/20/$31.00 ©2020 IEEE | DOI: 10.1109/BigDataSE50710.2020.00014 Authorized licensed use limited to: University of Southern California. Downloaded on September 25,2023 at 01:06:34 UTC from IEEE Xplore. Restrictions apply. Figure 1.1: The model size (left y-axis) of several language models and their performance on GLUE (right y-axis). Image credit: [2]. more challenging benchmark for natural language understanding, SuperGLUE [151], the human baseline is also surpassed by several LMs by 0.5% (DeBERTa [59]) to 1.5% (Vega v2 [177], the current state-of-theart). On the SQuAD-2.0 dataset [127] which tests machine comprehension, we again see one and another LM like DeBERTa [59] (91.9 in F1) and ALBERT [75](90.9 in F1) outperform human performance (89.5 in F1). LMs thus become the backbone of modern NLP systems as the so called foundation models [16] due to their powerful capabilities and generalization. One critical factor underlying the success of LMs seems to be the model size, i.e., the number of parameters. As shown in Figure 1.1 [2], the performance of LMs increases as the model size scales. This is expected as these LMs are trained to “memorize” the corpus that contains 10 billion to 1 trillion tokens during the pre-training stage. Thus, LMs need to have a sufficiently large capacity in order to subsume such a large corpus, which allows them to encode more facts into their parameters. Moreover, it is observed that certain abilities like multi-step reasoning, are only present in large LMs but not small LMs, which are thus considered emergent [161]. The belief that scaling up the size of LMs would lead to better performance has drive the tech companies to develop increasingly larger LMs, whose size increases by at least of 10 every 2 year [81]. Starting from BERT-large [37] which has 355M parameters, GPT-2 [122] scales to 1.5B whereas T5 [123] further scales to 11B. The predecessor of the widely known ChatGPT from OpenAI is GPT-3 [21] with 175 billions of parameters. PaLM [28] from Google even has 540 billions of parameters. It seems like we are having another Moore’s law in the field of NLP [137]. 1.2 Limitations of reckless scaling Despite the remarkable success brought by LMs, this reckless scaling competition also comes with several vital limitations as listed below. • Huge training and inference cost. It is estimated that training a large LM such as GPT-3 could cost more than $4.6M using the Tesla V100 GPUs [81], which leads to intensive carbon footprint as well. The energy consumption is even higher when using these giant LMs for inference, as a single request in ChatGPT can consume 100 times more energy than one Google search [27]. It is also impractical for users or academia laboratories to deploy these LMs as it requires large computation resources to host the models. This leads to a severe issue of accessibility as most of the people only gets to use large LMs via API provided by companies like OpenAI. • Uninterpretability and Biases. Due to the black-box nature of neural networks, it is usually unclear why LMs make certain generations/decisions. Meanwhile, it is reported that LMs usually make use of spurious correlations such as biases that they pick up during the pre-training stage as the reasoning shortcuts [49, 20]. The opaque reasoning process of neural models makes it impossible for human developers to examine whether a deployed LM is making decisions based on the right reasons. Scaling up the model size would just intensify this issue as it is even harder to interpret models’ behaviors and pinpoint the biases that exist in the model with a larger parameter size. 3 • Hallucination. LMs are known to suffer from hallucination, i.e., generating text that is either not grounded by the input or not factually correct [64]. This is because during pre-training, LMs only learns patterns and associations between words in the training data and do not have a true comprehension on the observed data [147]. On the truthfulQA dataset [90] that tests whether an LM is truthful in question answering, GPT-3 only achieves 58% in accuracy whereas human performance is 94%. What is more concerning is that large LMs are good at generating fluent text and they present their predictions in such a compelling way that humans find credible. Thus, a hallucinated prediction could be quite misleading and dangerous if it involves high-stake decision making. 1.3 Reasoning requires more than just being good at language Based on the above observations, it is clear that being good at generating fluent text does not amount to being good at reasoning [103]. What LMs are learning during the pre-training stage is to string words (tokens in practise) together. Certainly, they could learn the basic linguistic knowledge that is useful for deciding the next token to generate when completing a sentence. For example, in the sentence “The key to the cabinet is on the table.”, LMs can learn that the verb of a sentence (“is”) should always follow the subject (“key”). However, not all aspects of human thoughts can be learnt through simply predicting the missing words [103]. One classical example is the Winograd schema challenge [77] where one needs to decide the missing word in an incomplete sentence like “The trophy did not fit into the suitcase because the is too big.”. We know it should be “trophy” being too big because we have external knowledge about that large objects do not fit into small containers. Besides this type of common sense, we also rely on many more types of knowledge other than pure linguistics to decide the proper words to use when speaking. These could include planning, general world knowledge, the mental model and many others [103]. Scaling up the model size would certainly teach LMs to be increasingly better at language modeling. But token prediction as a pre-training objective does not provide supervision signals for all the required 4 skills for modeling human reasoning. This leads to the biases and hallucination issues. Below we introduce the problem settings considered in the thesis that require more than just linguistic knowledge to solve a task. 1.4 Problem settings To motivate the studies in this thesis, we introduce the several problem settings that aim to examine a spectrum of reasoning capabilities in LMs. These problem settings are exemplified as language-based reasoning tasks, i.e., with both the task input and output as text, but are not mere language modeling tasks. In particular, we look into problem settings where LMs need to leverage external knowledge such as common sense or world knowledge to solve a given task. Below is a list of problem settings that our experiments are based on. • Open-domain Question Answering This task tests a system’s capability in answering a question with background knowledge. Given a question q and potentially a set of answer choices A = {ai}, the system’s goal is to predict the correct answer a ∗ (from A if provided). Unlike machine comprehension where the answer can be found in the given context, the evidence (or rationales) required for answering an open-domain question is not presented to the system. The evidence could be commonsense knowledge or general facts. To answer this type of question, a system needs to first ground the relevant knowledge and then reason over the knowledge. The model performance is measured by the accuracy over the test set. We conduct experiments on datasets such as CommonsenseQA [143], StrategyQA [50] and OpenbookQA [107], etc. • Generative Commonsense Reasoning This task examined a system’s capability in composing concepts into a daily scene description. The task input is a set of concepts X = {xi}, where each concept is a commonly seen object (nouns such as “dog” or “frisbee”) or commonly performed action 5 (verbs such as “throw” or “catch”). The goal of a system is to use all the concepts for generating a coherent sentence that describes a plausible situation following human common sense. This task is not a mere language generation task. To generate a coherent sentence, a system needs to reason about the relations between concepts and the avoidance of objects (e.g., “dog” performs the action “catch” but not the action “throw”). Moreover, the system needs to conduct compositional generalization, i.e., the ability to reason over a new concept composition that is not observed during training, and to identify implicit concepts related to the scene that are not provided (e.g., “person” to perform “throw” in the above example). Task performance is measured by standard evaluation metrics for text generation including BLEU, Cider and Spice which compare machine generation with human annotation. We experiment with CommonGen [88] and our concept-to-story dataset. • Procedure Understanding This task measures a LM’s capability in maintaining a coherent model of the observed world. We consider two task settings. (1) State-Tracking: Given an initial context describing a world state followed by a list of state-changing instructions, a system’s goal is to predict the final state of each entity mentioned in the context. (2) Story Understanding: Given a story with a long passage involving different events, and a question, a system needs to answer the question based on the world described by the story. In both settings, the system is required to keep track of the state changes of any entity mentioned in the context properly in order to perform the task. The model performance is measured by the accuracy of the predictions. We consider datasets including Dyna-bAbI [145], Alchemy [96] and Boxes [69] for experiments. 1.5 Externalized reasoning in language models The current implementation of LMs entangles different reasoning skills as illustrated above into one single architecture, Transformers. The entanglement poses a need on scaling in order to work effectively, which 6 comes with several limitations including huge training&inference cost, uninterpretability and hallucination. In order to build scalable and trustworthy AI systems, this thesis advocates externalized reasoning, a general paradigm that asks LMs to “speak out” the underlying reasoning process before performing a task. The core idea is to let LMs present their otherwise opaque reasoning process explicitly as either a structured representation (e.g., a graph representing the complex dependencies between facts) or a freetext rationale (e.g., a few short sentences as the explanation). Below we explain how externalized reasoning in LMs can combat the aforementioned limitations brought by reckless scaling. 1. Externalized reasoning converts part of the underlying process of LMs’ reasoning as an explicit form to aid interpretability. Instead of merely giving an answer without telling why, LMs with externalized reasoning can expose part (but not all) of the reasoning process underneath by providing the evidence that LMs rely on for answering. The exposed reasoning, compared with the complicated neural representations insides LMs, could thus serve as a more human-accessible explanation, through which we get to examine whether LMs are making decisions based on the right reasons. 2. Externalized reasoning serves as a guidance for text generation in order to address hallucination. Being forced to generate the task output directly from the task input is usually one of the major causes of hallucination. Through externalized reasoning, LMs have a chance to “think” more by generating the intermediate reasoning result, e.g., a high-level plan about how to structure a sentence, which would guide the generation process to be more coherent. 3. Externalized reasoning provides a more systematic way to incorporate different skills so as to avoid reckless scaling. The entanglement of the reasoning skills within one LM architecture makes it hard to enhance its skills in a targeted way. As a result, people resorts to reckless scaling 7 to enhance the overall model. Instead, externalized reasoning offers a way to tease apart these different reasoning skills with explicit components. We thus can more systematically enhance the corresponding components to mitigate any shortcomings we observe from LMs. 1.6 Outline This thesis would describe my works that investigate the two types of representations of the externalized reasoning in LMs to enhance different reasoning capabilities. In the first half of the thesis, we focus on structured graphs as the representation of externalized reasoning. The graph could depict the complex relations between concepts that help LMs to reason. We would describe two works, one using a local knowledge graph for commonsense question answering and another using a structured plan for coherent text generation. In the second half of the thesis, we turn to free-text rationales for representing the externalized reasoning. The rationale consists of a few short sentences laying out the reasoning process step-by-step in natural language. We would describe three works using free-text rationales to empower a small LM for multi-step reasoning. 1.6.1 Structured Externalized Reasoning In Chapter 3 and Chapter 4, we would present our early efforts in leveraging the structured representation as the externalized reasoning in LMs. We study commonsense reasoning tasks in these two chapters where the relationships between the concepts mentioned in the context are not provided and a system needs to reason over the implicit relations. A structured representation, i.e., a graph, can explicitly depict the complex relations between concepts that help the system to reason. The graph can then serve as the explanation of the underlying reasoning process of the system for humans to examine. In Chapter 3, we focus on Commonsense question answering (QA) which requires background knowledge not explicitly stated in a given context. Prior works use commonsense knowledge graphs (KGs) to 8 obtain this knowledge for reasoning. However, these KGs have limited coverage and the contextual dependence of their knowledge. We would present a general commonsense QA framework with a knowledgeable path generator. By extrapolating over existing paths in a KG with a state-of-the-art language model, our generator learns to connect a pair of concepts in text with a dynamic, and potentially novel, multi-hop relational path. Such paths can provide structured evidence for solving commonsense questions. Experiments on two datasets show the superiority of our method over previous works which fully rely on knowledge from KGs (with up to 6% improvement in accuracy). Further evaluation suggests that the generated paths are typically interpretable, novel, and relevant to the task. This chapter is based on Wang et al. [156]. In Chapter 4, we focus on generative commonsense reasoning where a system needs to compose a set of given concepts into a plausible sentence. The generative commonsense reasoning skill is lacking in state-of-the-art text generation methods. Descriptive sentences about arbitrary concepts generated by neural text generation models (e.g., pre-trained text-to-text LMs) are often grammatically fluent but may not correspond to human common sense. We propose an Imagine-and-Verbalize (I&V) method, which learns to imagine a relational scene knowledge graph (SKG) with relations between the input concepts, and leverage the SKG as a constraint when generating a plausible scene description. We collect and harmonize a set of knowledge resources from different domains and modalities, providing a rich indirect supervision signal for I&V. The experiments demonstrate the effectiveness of I&V in improving language models on both concept-to-sentence and concept-to-story generation tasks, while enabling the model to learn well from fewer task examples and generate SKGs that make common sense to human annotators. This chapter is based on Wang et al. [159]. 1.6.2 Free-text Externalized Reasoning In Chapter 5-7, we would present our three works leveraging natural language as the vehicle for externalized reasoning in LMs. We woudl study general language-based reasoning tasks including open-domain 9 QA and procedure understanding. The former requires a system to infer the intermediate facts not provided explicitly in the task input to deduce the answer to a question. The latter requires a system to keep track of the entity states that are indicated by the context either explicitly or implicitly to understand the procedure. In both problem settings, the free-text rationales have the flexibility and expressiveness to convey a complicated reasoning process in natural language. The externalized reasoning in natural language can thus serve as the explanations that interpret a prediction made by an LM and also provide the necessary knowledge for reasoning. In Chapter 5, we study rationale-based reasoning, where a system needs to rationalize its reasoning in free-text and then predicts the answer. We then present PINTO, an LM pipeline that rationalizes via prompt-based learning, and learns to consistently reason over rationales via counterfactual regularization. First, PINTO maps out a suitable reasoning process for the task input by prompting a frozen rationalizing LM to generate a free-text rationale. Second, PINTO’s reasoning LM is fine-tuned to solve the task using the generated rationale as context, while regularized to output less confident predictions when the rationale is perturbed. Across four datasets, we show that PINTO significantly improves the generalization ability of the reasoning LM, yielding higher performance on both in-distribution and out-of-distribution test sets. Also, we find that PINTO’s rationales are more consistent to its task predictions than those generated by competitive baselines. This chapter is based on Wang et al. [155]. In Chapter 6, we again study rationale-based reasoning but present a follow-up work to Chapter 5, SCOTT, a faithful knowledge distillation method to learn a small, self-consistent CoT model from a teacher model that is orders of magnitude larger. To form better supervision, we elicit rationales supporting the gold answers from a large LM (teacher) by contrastive decoding, which encourages the teacher to generate tokens that become more plausible only when the answer is considered. To ensure faithful distillation, we use the teacher-generated rationales to learn a student LM with a counterfactual reasoning objective, which prevents the student from ignoring the rationales to make inconsistent predictions. Experiments show 10 that while yielding comparable performance, our method leads to a more faithful model than baselines. Further analysis shows that such a model respects the rationales more when making decisions; thus, we can improve its performance more by refining its rationales. This chapter is based on Wang et al. [158]. In Chapter 7, we study procedure understanding, where a system needs to keep track of entity states in order to infer the final states or answer a question. We would present a framework where an LM is augmented with an external symbolic memory that maintains the entity states. As an LMs reads the context, the maintained memory interleaves the generation process to provides the previous states when the LM is inferring the new states. When conducting question answering, the memory also interleaves the chain-of-thought reasoning with the stored facts. Experiments show that our method consistently improve LMs’ capability of state-tracking on several datasets that involve long context and complex state changes. The memory also helps LMs to conduct more robust chain-of-thought reasoning. This chapter is based on our work in submission. 11 Chapter 2 Literature Review In this chapter, we survey the works closely related to our idea of externalized reasoning. These works aim to enhance different reasoning capabilities in LMs while provide certain interpretability. We first discuss existing systems that leverage structured knowledge to boost the performance of LMs on opendomain question answering. We then talk about similar works that retrieve knowledge to assist LMs in text generation. After that, we would turn our attention to free-text rationales that explicitly rationalize the internal knowledge in LMs for reasoning. At the end, we also touch on several works that aim to distill knowledge from large LMs to train small LMs. 2.1 Reasoning with Structured Knowledge Static Knowledge Graphs Recent benchmarks for question answering (QA) like commonsense QA [143], open-domain QA [168] and reading comprehension [163], require systems to conduct multi-hop reasoning. Existing works propose to leverage knowledge graphs (KGs) as the source of knowledge to assist LMs in reasoning. They typically employ entity linking to recognize the relevant entities, ground them to a KG, and retrieve the paths from the local graph neighborhood around the entities. The retrieved paths are scored or ranked using graph-based metrics (e,g., PageRank, centrality) [120, 42, 10], handcrafted rules [65] or neural methods (e.g., attention mechanisms) [73, 87]. 12 Dynamic Knowledge Graphs Several methods generate knowledge paths instead of extracting them from static KGs. Asai et al. [5] learn reasoning paths by forming sequences of evidence documents, however, their approach relies on the inter-document hyperlinks to establish relations in the constructed KG. The extractor of Fu et al. [48] retrieves missing facts in order to address the sparsity of KGs. Unlike our work, their setting is limited to knowledge graph completion, where both a query entity and a single query relation are given. Earlier works [84, 63, 35] poses the CKG completion task as triplet classification, where the goal is to score the plausibility of a complete triplet. COMET [19] is the first to cast this task as commonsense inference with LMs. Follow-up contributions utilize COMET as a commonsense provider in various downstream tasks [18, 3, 23], thus providing evidence for LM’s generalization to previously unseen scenarios. Further efforts include [62], which show that the quality of the training triplets is a key factor of adapting LMs, and [34], which investigates how to learn COMET in a few-shot learning setting. Meanwhile, the study by [153] indicates the limited generalization of COMET. [100] also adapt LMs simultaneously on multiple CKGs, albeit their goal is to improve downstream performance rather than CKG inference. 2.2 Knowledge-Enhanced Text Generation Incorporation of External Resources Pretrained LM such as GPT-2 [122] and UniLM [38] are prone to generate fluent but implausible sentences that do not follow human common sense after being fine-tune over the CommonGen dataset [88]. Recent works [94, 82] on GCSR propose to retrieve external knowledge to enhance the text generation. Prototype-based models, including EKI-BART [44], Re-T5 [154], and KFCNet [82] retrieve massive prototype sentences from external corpora (over 70M) like visual captions and Wikipedia as auxiliary input to the LM. Though the retrieved prototype sentences provide high coverage on the concepts, their model is supervised to compose sentences that are very similar to those existing prototypes. It is thus unclear whether their models are conducting commonsense reasoning or 13 only mimicking the prototypes. KG-BART [94] incorporates the embedding of relational facts about the concepts from ConceptNet into both the encoders and decoders of the BART architecture [78]. As there could be multiple relations between two concepts, it is unclear how to select the relation that fits a given context Fadnis et al. [42]. Content Planning Our method is also related to prior works [51] that propose intermediate representations as a way to “plan ahead” before generating long narratives. Plan-and-write Yao et al. [169] generates chains of keywords as a storyline, but do not consider relations between keywords (concepts) as we do. Action-plan Fan, Lewis, and Dauphin [43] takes a step further by using predicate-argument with semantic role labeling, but still does not involve all the concepts in a sentence. Moreover, these methods are limited to obtaining supervision from task-specific datasets. Machine Imagination There are also some prior works exploring machine imagination in different tasks. [91] proposes to generate images as visual evidence for solving commonsense question answering. [41] improve multi-modal translation by training the model to translate a sentence and imagine via jointly learning a visually-grounded representation. VisCTG [46] retrieves Google images to visually ground the Concept2Sentence generation. Aforementioned works directly captures images to enrich the generation. 2.3 Rationale-based Reasoning Prior works on free-text rationale generation can be grouped into three paradigms. In the fine-tuned selfrationalizing paradigm, a single LM is fine-tuned to jointly generate the task output and rationale [113, 104, 172, 83]. Since the LM parameters are shared across two relatively dissimilar objectives, they often perform worse than non-rationalizing LMs [166, 113]. Notably, this paradigm requires expensive rationale annotations for all training instances. In the prompted self-rationalizing paradigm, a single LM is instead frozen and prompted to jointly generate the task output and rationale, with the prompt consisting of a few 14 input-output-rationale demonstrations [162]. This paradigm performs well and only needs a few rationale annotations for the prompt, but it is computationally prohibitive since it generally requires very large-scale LMs to work effectively [74, 162]. In the pipeline-rationalizing paradigm, a fine-tuned rationalizing LM first generates the rationale, which is then used as input for a separate fine-tuned reasoning LM to generate the output [72, 126]. Here, the generated rationale forms a discrete (i.e., non-differentiable) bottleneck between the two modules, which complicates end-to-end training and can hurt task performance [166, 58]. Additionally, the dedicated rationalizing LM requires extra rationale annotation/computation costs. Moreover, none of these paradigms has a mechanism for regularizing the rationale generation to faithfully reflect the reasoning process of the LM, without hurting task performance. 2.4 Learning from Large Language Models There exist some works that explore the idea of distilling rationales knowledge from a large LM to a small LM as the student. West et al. [164] proposed to train the student for knowledge completion. Chan et al. [25] proposed to learn a student model that only predicts answers from a teacher model that is augmented with rationales. Eisenstein et al. [40] proposed to train the student to extract the sentence containing the answer, which is not applicable to reasoning tasks that require background knowledge. Shridhar, Stolfo, and Sachan [135] proposed to train the student to ask and answer sub-questions necessary for decomposing the main question, which is tailored to solve math word problems [31] with an equation generator for guiding the student while we do not have such a constraint. Li et al. [83] proposed to train the student on the joint task of generating the answers and the rationales, which only act as a regularization and do not affect the student’s prediction during inference. More importantly, both Shridhar, Stolfo, and Sachan [135] and Li et al. [83] do not consider the faithfulness of the rationales, which is critical for examining the behavior of the student. 15 2.5 Reasoning with Memory World Model in LMs Prior studies on LMs generally conclude that they do not maintain a coherent model of the described world internally. [79] probes the neural representations of LMs with trained classifiers which fail to correctly predict the entity states up to 53.8% of the time. Through promptings, it is also shown that LMs struggle with capturing the relations between entities [54] and keeping track of state changes [70]. The failure of building a good world model is shown to be the cause of LMs generating incoherent text [117, 175, 116]. Enhancing LMs on Stack Tracking Several approaches are proposed to enhance LMs’ capability on entity tracking either implicitly or explicitly. [55] and [101] inject the state information into the neural representation of LMs by training them with auxiliary tasks, which relies on expensive annotation. Advanced prompting methods [115, 162, 80, 76] guide LMs to infer the states explicitly in context as a way to expand the limited context memory in Transformers. But these methods still rely on LMs themselves to maintain the entity states which is prone to error. Addressing Hallucination Our scheme of interleaving text generation with memory bears some resemblance to recent works addressing hallucination in LMs. [148] interleaves the chain-of-thought reasoning process with facts retrieved from external resources for knowledge-intensive tasks. [26] and [98] replace the answer generation step with the results of executing programs generated by LMs. 16 Chapter 3 A Knowledgeable Path Generator for Commonsense Question Answering Solving commonsense QA tasks requires filling gaps with external knowledge. For instance, given the multiple-choice question in Figure 3.1, a system needs to know that fungus grows in moist environments, such as caves, and that a cave is a type of geological feature. Such commonsense knowledge is obvious for humans but most existing QA systems do not have it or cannot reason with it. Although recent advances in pre-trained language models (LMs) have resulted in impressive performance on commonsense-related benchmarks [173, 11, 60], it is unclear whether this is due to commonsense reasoning or to capturing spurious correlations in the data [114]. Pre-trained LMs may answer a question correctly for wrong reasons, making them highly uninterpretable [110]. Alternatively, a set of systems retrieve external knowledge either from large text corpora or knowledge graphs (KGs). A corpus, however, might not be an ideal source of commonsense knowledge, as such knowledge is seldom stated explicitly in text [139]. In contrast, commonsense KGs, like ConceptNet [138] and ATOMIC [130], provide structured evidence about the relevant entities, thus enabling effective reasoning and higher interpretability. Existing systems retrieve knowledge from a KG in the form of: triplets [109], multi-hop paths [87, 10], or subgraphs [65]. 17 grow geological_feature fungus cave water IsA AtLocation Q: In what geological feature will you find fungus growing? A: shower stall B: toenails C: basement D: forest E: cave AtLocation moist_place UsedFor AtLocation KG CapableOf Figure 3.1: Our path generator learns to connect the question entities (in red) and choice entities (in blue). The dashed arrow indicates a missing link in a static KG. Despite the aforementioned benefits, exploiting these KGs poses the following challenges. Firstly, as KGs are known to suffer from sparsity [84], they might not contain the knowledge needed to fill the gaps between the question and the answer. For example, a missing link (cave, IsA, geological_feature) in Figure 3.1 might prevent the QA system from choosing the correct answer. Recent work on commonsense KG completion [84, 19, 18] is limited to predicting the tail of a statement with known head and relation, or a single-hop relation between entities. Secondly, due to the large size and heterogeneity of modern KGs, contextualization—i.e., identifying a set of KG facts which are relevant or needed to answer a question— is also difficult [42]. Simply retrieving all paths could introduce noisy information and potentially harm reasoning. To address this gap between LMs and KGs, we propose a knowledgeable path generator (PG) that generalizes over the facts stored in a KG, rather than only retrieving them. We call our method neural KG due to its neural generalization over structured KGs, and, in contrast, we use the term static KG for methods which rely exclusively on existing facts in a KG. Our PG connects a pair of question and answer entities with a (novel) multi-hop path, which may not exist in the KG, allowing for missing facts like (cave, IsA, geological_feature) in Figure 3.1 to be considered during inference. To learn such a generator, we: (1) sample a set of random walk instances from a static commonsense KG based on rules and constraints for informativeness and relevance (§3.2.1); (2) fine-tune a pre-trained 18 language model — GPT-2 [122] on the sampled paths (§3.2.2). By doing so, we transfer the rich knowledge encoded in GPT-2 to our PG. This is expected to both enhance the generalization ability of the PG and combat the sparsity of KGs. Also, by generating high-quality missing links between the question and answer entities, we contextualize the task with relevant commonsense knowledge. To understand the impact of our multi-hop PG on downstream commonsense QA tasks, we integrate the PG in an augmented version of a general QA framework (§3.2.3). We run experiments on two benchmark datasets CommonsenseQA [143] and OpenBookQA [108]. The results show that out method performs better than previous systems augmented with static KGs by up to 6% in accuracy, which also reveals its potential as a plug-in module for various datasets and as a vital complement to existing KG structures. In the low-resource setting, the accuracy gain over the baselines grows as the training data decreases, indicating a larger inductive bias of our generator. We also assess the quality and interpretability of our paths through both automatic and human evaluation. To summarize, our key contributions are: 1. We propose a method to generate task-relevant knowledge paths that may not exist in the original KG, thus addressing the contextualization and sparsity challenges of KGs. 2. We design and implement a framework with three variants of our PG, to understand the role of local and global graph information. 3. Extensive experiments on two benchmark datasets demonstrate the effectiveness of our method compared to previous methods, as well as its robustness to limited training data. 3.1 Preliminaries Our multiple-choice commonsense QA setup follows prior work [143, 108, 13]: given a question q, a system selects exactly one of the choices a as an answer. To experiment with contextualized background knowledge, we adopt a general framework (Figure 3.2) consisting of a context module, a knowledge module and 19 Knowledge Encoder Context Encoder Reasoning Score f(q, a) Knowledge Paths Question ; Choice Figure 3.2: Our KG-augmented QA Framework. The reasoning module leverages both the unstructured context and structured knowledge to answer a question. a reasoning module. The context module encodes both the question q and a choice a as unstructured evidence, while the knowledge module encodes external facts as structured evidence. Both the unstructured and the structured evidence are fed to the reasoning module, which produces a score for a question-choice pair. The choice with a highest score would be the predicted answer. Next, we introduce each module in detail. Context Module We concatenate a question q and one of its choices a with a special token, and feed the sequence into a contextual encoder. This encoder generates an embedding c, which serves as an unstructured evidence to our system. As commonly done for textual input, we consider a bidirectional pre-trained language model [37, 95] as a contextual encoder. Knowledge Module Given a commonsense KG G = (E, R), where E is the entity set and R is the relation set, we seek a set of relevant knowledge facts for a question-choice pair {q, a}, which would serve as structured evidence to support reasoning. We employ an entity recognition system to extract relevant entity mentions in the question (denoted by E q = {e q}) and one of the choices (E a = {e a}). We connect each pair of question-choice entities with a multi-hop path, which can be done either by retrieving existing paths for now (as in previous methods) or by generating paths (see §3.2.3). Formally, a 20 path is p(e q , ea ) = {e q , r0, e1, r1, ..., rT −1, ea} where T is the number of hops. Note that when T = 1, the path is a single triplet. The set of paths is denoted by P = {p(e q , ea )|e q ∈ Eq , ea ∈ Ea}. Naturally, we employ a Relational Network (RN) [129] to aggregate the retrieved paths into a static knowledge embedding k, which serves as structured evidence. In essence, a RN is a composite function over the set P: k = fϕ({gθ(p)|p ∈ P}), (3.1) where fϕ could be any aggregation function and gθ could be any neural network which projects a discrete path p into a fixed-size continuous embedding p. We expect that not all paths contribute equally to choosing the right answer. Therefore, we construct the function fϕ as an attention network: k = X p∈P αpp. (3.2) We compute the attention weight αp by using the context embedding c as a query: αp = exp(ˆαp) P p ′ exp (ˆαp ′) , (3.3) where the context embedding c guides (as an attention query) the encoding of the structured evidence: αˆp = c ⊤tanh(Watt · p + batt). (3.4) Here, the attention network is parameterized by (Watt,batt) and tanh(·) is a nonlinear activation function. Regarding the function gθ, we employ its original formulation: gθ(p) = MLP[e q ; (r0 ◦ ... ◦ rT−1); e a ], (3.5) 21 organism --> IsA --> ecosystem --> HasContext --> resources overpopulation --> _Causes --> reproducing --> HasPrerequisite --> resource overpopulation --> IsA --> ecosystem organism --> PartOf --> ecosystem Attention Context Encoder Q: Overpopulation of an organism can? A: strain the resources of an ecosystem GPT-2 resources <SEP> organism is a ecosystem ... resources <MASK> <MASK> is a ecosystem ... <END> (1) Entity Recognition in question and choice. (2) Paths Generation for Connecting Each QA-Entity Pair (2.1) Generation Process for Connecting One QA-Entity Pair (the shaded part is given as input during inference). (3) Knowledge Path Aggregation Context Embedding Knowledge Embedding [CLS] Question [SEP] Choice [SEP] Figure 3.3: Overview of our adapted knowledge module. (1) Extraction of entities from a question and its answer choices. (2) Generation of a multi-hop knowledge path with our PG to connect each pair of question and answer entities. (3) Aggregation of the generated paths into a knowledge embedding. where [; ] is vector concatenation and ◦ stands for element-wise multiplication. The components (entities and relations) of a path are represented by their feature vectors. Reasoning Module This module leverages the unstructured evidence (the context embedding c) and the structured one (the knowledge embedding k), to compute the plausibility of a question-choice pair. We concatenate c with k and feed them to the final classification layer, which is a linear transformation that scores a question-choice pair {q, a}: f(q, a) = Wcls · [c; k] + bcls, (3.6) The linear classification layer is parameterized by (Wcls, bcls). We get the final probability over all choices by normalizing with softmax. 3.2 Knowledgeable Path Generator Extracting the structured evidence by retrieving paths (or subgraphs) from a static KG, as in prior work [108, 87, 65], faces two key challenges: sparsity and contextualization (§??). We thus propose a knowledgeable path generator (PG), which learns to connect a question-choice entity pair (e q , ea ) with a multi-hop path. 22 The generated paths are used as structured evidence in the knowledge module. Next, we detail the construction of training data (§3.2.1), the learning of our path generator over this data (§3.2.2), and the integration of the generator into the reasoning module (§3.2.3). Figure 6.2 presents an overview of our adapted knowledge module. 3.2.1 Knowledge Path Sampling We sample paths from a commonsense KG using random walks, in order to provide training data for our PG. Such paths are expected to contain useful knowledge for commonsense QA tasks. Given a KG G = (E, R), each sampled path p = {e0, r0, e1, r1, ..., rT −1, eT } is a random walk on the graph, where et ∈ E and rt ∈ R. The number of hops, T, is a hyperparameter in our method. To improve the quality of the paths, we adopt two heuristic strategies. For relevance, we define a subset of relation types that are useful for answering commonsense questions, e.g., AtLocation and IsA, and filter out the remaining ones, e.g., RelatedTo, prior to sampling (see Appendix ?? for the discarded relations). For informativeness, we require all relation types in a path to be distinct. We explore two sampling strategies in order to select the starting node of the random walks: Local Sampling. The random walks start from the entities that appear in the questions and answer choices of the training set of a benchmark. This strategy is expected to favor generation of paths that are tailored to the task. Global Sampling. We conduct random walks starting from each entity in E. This may divert our PG away from biasing on the local structure of the KG and enhance its generalizability to unseen data. To include entities that are connected only with inverse triplets in a path, we add a reverse relation r −1 for each relation r. We also sample paths with a mixed number of hops T, so our generator can learn to connect entities using paths of variable length, when needed. The full path sampling procedure is described by Algorithm ?? in the Appendix. 23 Table 3.1: Example Transformation of a Symbolic Path into Text. {predator, DistinctFrom, prey, IsA, animal} → { animal, [SEP], predator , distinct, from, prey, is, a, animal} 3.2.2 Generating Paths to Connect Entities We employ GPT-2 [122] as the backbone of our path generator. GPT-2 is a pre-trained language model that encodes rich unstructured knowledge from large text corpora. We foresee two benefits of combining a pre-trained model such as GPT-2 and a static KG: (1) the language model would be able to generate commonsense knowledge paths, by being enriched with relevant structured knowledge; (2) the unstructured knowledge encoded in the language model would help to alleviate the sparsity challenge of the static KGs. Unlike COMET [19] which fine-tunes GPT (an earlier version of GPT-2) with independent triplets, we fine-tune GPT-2 with consecutive triplets that form paths (see Section 3.2.1). To do so, we first use GPT-2’s Byte-Pair Encoding [133] to convert each symbolic path p to its textual form as a sequence {x0, y0, x1, y1, ..., yT −1, xT }, where xt = {x 1 t , x2 t , ..., x |et| t } are phrase tokens of the entity et and yt = {y 1 t , y2 t , ..., y |rt| t } are phrase tokens of the relation rt . The reverse relations are represented by adding a special prefix token “_”. The resulting paths mimic natural language sentences to facilitate optimal usage of the knowledge encoded in the pre-trained language model. At inference time, in order to connect the question-choice entities, we also add the last entity phrase tokens xT together with a separate token [SEP] at the beginning of each path sequence, which produces the final transformation s p . This informs the generator about the last entity it should output when generating a path. Table 3.1 provides an example path transformation. The PG learns to maximize the probability of the observed paths given the entity pairs. We use negative conditional log likelihood as a loss function: L = − |s p X | t=|x0|+|xT |+1 log P(s p t | s p <t), (3.7) 24 where the conditional probability is defined as: P(s p t | s p <t) = softmax(Wvocab · ht). (3.8) Here ht denotes the final GPT-2 representation for s p t . Wvocab is the embedding matrix for the token-based vocabulary used by GPT-2, which generalizes well to unseen words.∗ During the inference, the target entity (e a ), the [SEP] token, and the starting entity (e q ) are fed to our generator (the shaded part in Table 3.1), and greedy decoding is used to generate a path connecting the two entities. Other constrained decoding strategies would be left as future work. 3.2.3 Adapted Commonsense QA Framework To facilitate integration of the structured evidence from our path generator instead of a static KG, we adapt the knowledge module from §3.1 slightly. We construct the path set P by generating a multi-hop path p(e q , ea ) for each pair of a question entity e q and a choice entity e a with our PG and greedy decoding. To represent each path with an embedding, we perform mean pooling of the hidden states from the last layer of GPT-2 (before the softmax layer in Eq. 3.8) as a new formulation for the function gθ: gθ(p) = MEAN({h1, h2..., h|sp|}). (3.9) Since GPT-2 has been pre-trained on a large corpus, we believe such representation should be sufficient for preserving the information of the paths. Then, the knowledge embedding obtained with the function fϕ of the RN (Eq. 3.2-3.4) is concatenated with the original static knowledge embedding as our new definition of k. ∗This is because an unseen word of an entity or a relation may be split into several tokens that exist in the vocabulary. 25 The whole pipeline is optimized by minimizing its cross-entropy loss. The set of learnable parameters excludes the parameters of our proposed PG, because we observed that fixing their values yields optimal performance. This points to another advantage of our PG: after being fine-tuned on the sampled random walks from a KG, the PG could be integrated within an existing QA system with no further training. 3.3 Experiments 3.3.1 Datasets We evaluate our method on two commonsense QA benchmarks: CommonsenseQA[143] andOpenBookQA[108]. As the test set of CommonsenseQA is not publicly available, the predictions for it can only be evaluated once every two weeks via the official leaderboard. Thus, we report our test score on the leaderboard, and perform more extensive comparisons on the data split used in Lin et al. [87]. Besides questions and answers, OpenBookQA provides a collection of background facts in a textual form. We use the correspondence between these facts and their questions, prepared by Clark et al. [30], as an additional input to the context module for all methods, except RoBERTa-large (see §3.3.5). 3.3.2 KG and Path Data Preparation Entity Recognition We employ ConceptNet [138], a popular commonsense KG. As stated in §3.2.1, we disregard triplets that belong to a predefined set of relations (see Appendix). Similar to previous work [87], we use lexical matching to ground the entities mentioned in the question and the answer choices to our KG. One exception is that each answer choice in CommonsenseQA is treated as a single entity, as these tend to correspond directly to concepts in ConceptNet. 26 Table 3.2: Test accuracy with varying proportions of CommonsenseQA (using the data split in [87]). Results (as mean and standard deviation) are computed over 4 experimental runs with different random seeds (top score in boldface, second score underlined). Parts of the results for baselines are reported from our another work [47]. Methods BERT-large RoBERTa-large 20% Train 60% Train 100% Train 20% Train 60% Train 100% Train Fine-tuned LM (w/o KG) 46.25 (±0.63) 52.30 (±0.16) 55.39 (±0.40) 55.28 (±0.35) 65.56 (±0.76) 68.69 (±0.56) + RN 45.12 (±0.69) 54.23 (±0.28) 58.92 (±0.14) 61.32 (±0.68) 66.16 (±0.28) 69.59 (±3.80) + RGCN 48.67 (±0.28) 54.71 (±0.37) 57.13 (±0.36) 58.58 (±0.17) 68.33 (±0.85) 68.41 (±0.66) + GconAttn 47.95 (±0.11) 54.96 (±0.69) 56.94 (±0.77) 57.53 (±0.31) 68.09 (±0.63) 69.88 (±0.47) + Link Prediction 47.10 (±0.79) 53.96 (±0.56) 56.02 (±0.55) 60.84 (±1.36) 66.29 (±0.29) 69.33 (±0.98) + PG-Local 50.20 (±0.31) 55.68 (±0.07) 56.81 (±0.73) 61.56 (±0.72) 67.77 (±0.83) 70.43 (±0.65) + PG-Global 49.89 (±1.03) 55.47 (±0.92) 57.21 (±0.45) 62.93 (±0.82) 68.65 (±0.02) 71.55 (±0.99) + PG-Full 51.97 (±0.26) 57.53 (±0.19) 59.07 (±0.30) 63.72 (±0.77) 69.46 (±0.23) 72.68 (±0.42) Path Sampling We sample a set of paths with varying lengths, ranging from 1 to 3 hops. Global sampling generates 2,825,692 paths, while local sampling results in 133,612 paths for CommonsenseQA and 105,155 for OpenBookQA. We split them into training/dev/test sets at a 90 : 5 : 5 ratio. 3.3.3 Baselines As baselines, we consider a fine-tuned LM, static KG-augmented models, and a 1-hop link predictor on the question and the answer entities. Fine-tuned LM. To examine the role of the external knowledge, we compare to a “Fine-tuned LM” ablation of our QA framework without the knowledge module (§3.1). Static KG Models. We compare to three static KG variants of our QA framework that model the knowledge module with path/graph encoders: (1) a RN degenerate version of our system, which computes a knowledge embedding by an attention mechanism over the retrieved paths for each question-choice entity pair; (2) Relational Graph Convolutional Networks (RGCN) [132] which encode local graphs by using graph convolutional networks with relation-specific weight matrices; (3) GconAttn [160] which models the alignment between entities via attention and pools over all entity embeddings. Link Prediction Model. This baseline predicts the relation between question and answer entities instead of creating or finding knowledge paths. Namely, we employ TransE [17] to learn a representation for 27 Table 3.3: Test accuracy on OpenBookQA. Methods with AristoRoBERTa leverage the textual evidence by Clark et al. [30] as an additional input to the context module. Methods RoBERTa-large AristoRoBERTa Fine-tuned LMs (w/o KG) 64.80 (±2.37) 78.40 (±1.64) + RN 65.20 (±1.18) 75.35 (±1.39) + RGCN 62.45 (±1.57) 74.60 (±2.53) + GconAtten 64.75 (±1.48) 71.80 (±1.21) + Link Prediction 66.30 (±0.48) 77.25 (±1.11) + PG-Local 70.05 (±1.33) 79.80 (±1.45) + PG-Global 68.40 (±0.31) 80.05 (±0.68) + PG-Full 71.20 (±0.96) 79.15 (±0.78) every entity and relation in ConceptNet, which is then leveraged to predict a 1-hop relation for each pair of question and answer entities. The representations for each resulting triplet are used as 1-hop path embeddings. The rest of this baseline is identical to our QA framework. 3.3.4 Model Variations We experiment with three variants of our method which differ in terms of the knowledge embedding: (1) PG-Full: combination of our global PG and a static RN as detailed in §3.2.3; (2) PG-Local: a local PG which is trained on both local and global paths; (3) PG-Global: a global, data-independent PG which is trained on global paths only. We note that PG-Local and PG-Global do not include the static knowledge embedding. 3.3.5 Results Main Results For all systems, we experiment with several encoders as a context module: BERT-large [37] and RoBERTa-large [95] for CommonsenseQA, RoBERTa-large and AristoRoBERTa [30] for OpenBookQA. Tables 3.2 and 3.3 show the results for CommonsenseQA and OpenBookQA, respectively. On both datasets, we observe consistent improvements brought by our method with different context encoders. Our full model which, combines both generated and static knowledge, achieves the best performance overall, suggesting these two knowledge sources are complementary. Typically, either our local or global variant 28 Table 3.4: Test accuracy on CommonsenseQA’s official leaderboard. Note that the SOTA system, UnifiedQA is impractical (11B parameters) in an academic setting. Methods Single Ensemble RoBERTa [95] 72.1 72.5 RoBERTa+FreeLB [180] - 73.1 RoBERTa+HyKAS [99] 73.2 - XLNet+DREAM 73.3 - RoBERTa+KE - 73.3 RoBERTa+KEDGN - 74.4 XLNet+GraphReason [97] 75.3 - Albert [75] - 76.5 UnifiedQA* [67] 79.1 - Albert+PG-Full 75.6 78.2 yields second best results, demonstrating the effectiveness of the generated paths as structured evidence and their superiority over the static KG methods. The comparable performance of Link Prediction to the static KG methods indicates that even predicting 1-hop knowledge paths helps to address the KG sparsity. Furthermore, we report comparable results to the other systems on the official test sets, accessible via the leaderboards (Tables 3.4 and 3.5). Notably, the two best-performing systems, UnifiedQA [67] and TTTTT [123], are based on the T5 language model [123], which requires excessive computational resources and is impractical in an academic setting. Excluding these, our full method achieves the best performance on both datasets. Table 3.5: Test accuracy on OpenBookQA leaderboard. All listed methods leverage the provided science facts as additional textual input. Note that the top 2 systems, UnifiedQA (11B parameters) and TTTTT (3B parameters) are computationally expensive and impractical in an academic setting. Methods Test Careful Selection [8] 72.0 AristoRoBERTa 77.8 KF + SIR [7] 80.0 Albert + KB 81.0 TTTTT* [123] 83.2 UnifiedQA* [67] 87.2 AristoRoBERTa + PG-Full 80.2 Albert + PG-Full 81.8 29 20 40 60 80 100 Proportion of CSQA Trainset (%) 55.0 57.5 60.0 62.5 65.0 67.5 70.0 72.5 Accuracy (%) w/o KG RN GconAttn PG-Full 20 40 60 80 100 Proportion of OBQA Trainset (%) 35 40 45 50 55 60 65 70 w/o KG RN GconAttn PG-Full Figure 3.4: Test accuracy on CommonsenseQA (left) and OpenBookQA (right) with different proportions of training data. Less Labeled Data To compare the robustness of our model and the baselines to sparsity, we perform experiments with {20%, 40%, 60%, 80%, 100%} of the training data from both datasets. The results, displayed in Table 3.2 and Figure 3.4, show that our method (with RoBERTa) performs better or equal to the baselines with any amount of training data. The performance gain brought by either our Global or Full model is higher when less data is used, which shows that introducing structured evidence as inductive bias helps in a low-resource setting. Ablation Study We study the contribution of different strategies for learning our generator based on the performance of our Global and Local variants in Tables 3.2-3.3. We also include another variant by training our path generator from scratch, i.e. training a randomly-initialized model with the same architecture as GPT-2 instead of fine-tuning a pre-trained one. This Scratch variant achieves 68.75 and 65.50 accuracy on the CommonsenseQA and OpenBookQA test sets, respectively, with RoBERTa-large as the text encoder. Its performance thus resembles that of the static KG baselines while our Full method achieves 72.68 and 71.20. This demonstrates that learning paths from scratch approximates what a static KG has already, whereas the unstructured knowledge stored in a pre-trained GPT-2 helps to complement missing knowledge in a static KG. When coupled with a more powerful encoder like RoBERTa or Albert, our Global variant achieves 30 Table 3.6: Automatic and Human Evaluation of the generated Paths on the task testset. All scores are scaled to be percentage-based. Metric CommonsenseQA OpenBookQA Global Scratch Global Scratch Connection 97.33 91.16 96.03 96.01 Valid Entity 98.64 97.78 99.21 97.97 Valid Relation 100.00 100.00 100.00 100.00 Score 59.31 53.27 57.74 50.62 Novelty 75.82 58.18 78.93 53.81 H-Valid 89.20 60.13 84.93 53.73 H-Relevance 87.53 70.53 88.13 74.00 comparable or better results than our Local variant, without fitting the paths to the task, and thus holds a promise to enhance generalization on a wider range of datasets. 3.3.6 Study of Path Quality & Interpretability Automatic Evaluation We perform automatic evaluation of the validity and novelty of the generated paths from our Global and Scratch PG variants. To automatically measure validity, we analyze (1) the proportion of paths which successfully connect the head and the tail entities (Connection), (2) the proportion of entities/relations found in ConceptNet (Valid Entity / Relation). We also leverage a commonsense knowledge base completion model, Bilinear AVG [84], which produces a score for a given triplet. This model reportedly achieves 92.5% accuracy on commonsense knowledge completion and has been used in previous work [19]. We average the scores of all the triplets in a path which are missing in ConceptNet as its Score. We compute novelty as the proportion of paths which contain at least one triplet missing in ConceptNet (Novelty). The results are presented in Table 3.6. Firstly, our two generator variants are able to connect a vast majority of the entity pairs with a valid path (over 90% Connection). For this purpose, our generators only use the relations in the relation set instead of other, out-of-KG phrases (100% Valid Relation). In addition, the novel paths from the Global generator are of higher quality compared with the ones from the 31 Table 3.7: Paths from question to gold answer entities, with novel and valid triplets in boldface. Q1: Where would you find magazines along side many other printed works? A: doctor. B ∗ : bookstore. C: market. D: train station. E: mortuary. PG-Global (2-hop): {magazine, IsA, book, AtLocation, bookstore} PG-Scratch: {magazine, _IsA, magazine, AtLocation, bookstore} Q2: If you want harmony, what is something you should try to do with the world? A: take time. B: make noise. C: make war. D ∗ : make peace. E: make haste. PG-Global (2-hop): {harmony, _MotivatedByGoal, make better world, HasPrerequisite, make peace} PG-Scratch: {harmony, _UsedFor, committing perjury, Causes, make peace} Q3: Janet was watching the film because she liked what? A: rejection. B: laughter. C ∗ : being entertained. D: fear. E: bordem. PG-Global (1-hop): {film, _CausesDesire, being entertained} PG-Scratch: {film, _HasContext, being entertained} Scratch generator, given that any fact with a score over 0.5 is classified as positive by Bilinear AVG, which is later confirmed by our human evaluation as well. The Global generator also has a higher Novelty, indicating the necessity of transferring knowledge from a pre-trained GPT-2 to complement a static KG. Human Evaluation We also conduct human evaluation on two dimensions of the generated paths: (1) validity (How valid are the paths?) (2) relevance (How relevant are the paths to the question?). We randomly sample 50 paths from our Global and Scratch generator for different question-choice entity pairs in the test datasets. For each path, we provide the corresponding question and answer choices as the context. We ask three annotators to score each path from 1 (Not at all) to 5 (Very), resulting in a total of 150 scores for each dimension/generator/dataset. The averages of these scores are reported as H-Valid and H-Relevance in Table 3.6. For both dimensions, our Global generator achieves higher scores, showing the ability of fine-tuning a pre-trained GPT-2 as our generator to learn the path distribution which is of high quality and relevant to commonsense QA. Path Interpretability. In Table 3.7, we compare example paths generated by our Global and Scratch variants to connect the question entities to the gold answer entities. In Q1, our Global generator provides knowledge about the location of an entity with a 2-hop path, which helps with answering such “Where” questions. Although the path from our Scratch generator also contains the AtLocation relation, its first generated hop (_IsA) is less informative. In Q2, our Global generator is able to connect complex ideas 32 about harmony and making peace with a 2-hop path, while the path from the Scratch variant contains incorrect information: peace is caused by committing perjury. In Q3, the path from our Global generator is able to predict the relevant property of an entity and realizes that a 1-hop relation suffices in this case. Our Scratch variant, however, predicts a less precise relation (_HasContext). These cases show the path generalization ability of the fine-tuned pre-trained GPT-2, owed to its unstructured knowledge. We refer readers to Table ?? in Appendix for more cases. 33 Chapter 4 Contextualized Scene Imagination for Generative Commonsense Reasoning Humans describe everyday scenes in natural language based on their understanding of common concepts encountered in their environment [146]. Analogously, the task of generative commonsense reasoning (GCSR) asks machines to generate a description of everyday situations based on a set of concepts and an initial context [94, 82]. For example, given concept words {dog, frisbee, catch, throw}, a machine is expected to generate a plausible description, e.g., “A man throws a frisbee and his dog catches it in the air”. Machines with GCSR skills would communicate fluidly with humans, e.g., when summarizing a document by preserving its key details [134], composing a creative story according to a set of clues [169], and generating a conversation reply that includes specified keywords [112]. GCSR poses three unique challenges for automatic text generation methods. To depict plausible scenes when composing sentences, machines require commonsense knowledge to reason about the relations between concepts and the affordances of objects (e.g., “dog” performs the action “catch” but not the action “throw”). Moreover, machines require a compositional generalization ability [66], i.e., the ability to judge the plausibility of a new concept composition that has not been observed during training, and to identify concepts related to the scene that are not explicitly provided (e.g., “person” to perform “throw” in the above example). 34 GCSR can be directly attempted by fine-tuning pre-trained text-to-text language models (LMs) [123, 122]. While pre-trained LMs capture certain encyclopedic knowledge mentioned in text corpora (e.g., Wikipedia) [121] and can combine concepts in novel ways, they may generate grammatically fluent but implausible sentences that conflict with human common sense [88]. This is because LMs have no intrinsic mechanism to reason over high-level relations between concepts [178]. To close the knowledge gap, recent work augment LM input with knowledge graph triples (e.g., (dog, CapableOf, catch)) retrieved from ConceptNet [94, 86], or prototype sentences that cover input concepts retrieved from external text corpora [44, 154]. However, despite the input augmentation, GCSR skills are implicitly learned based on the concept-text pairs in the training data, without explicit supervision. While some recent work propose content planning in story generation in the form of plots or scripts [169, 43], only the narrative order of concepts are planned in those methods instead of their plausible roles and relations. Given the complexity of the GCSR task, machines need a direct mechanism to create a high-level relational representation of the provided concepts, which would allow them to judge the plausibility of their combination. In this paper, we propose to model an explicit scene imagination step which constructs a structured representation of a plausible scene based on input concepts and initial context. The scene imagination module formalizes the background knowledge required for the reasoning through a contextualized relational graph, called scene knowledge graph (SKG). An SKG allows us to collect and harmonize diverse commonsense knowledge across resources and modalities into a comprehensive SKG distribution (see Figure ?? for an illustration). We develop an imagine-and-verbalize framework: an imagination module learns to construct a contextualized SKG from input concepts and context by pretraining over a large amount of external SKGs; a verbalization module learns to faithfully realize the imagined SKG into natural language by training over downstream datasets. By learning from a large number of diverse SKGs, our method is able to capture plausible relations between concepts. By integrating these SKGs with LMs, the imagination 35 module is able to compose new objects in novel ways, and identify implicit concepts for a scene. Imagineand-verbalize decomposes the challenging scene description task into two realistic tasks for which a wealth of training data can be collected, simultaneously enabling for effective and explainable GCSR. We experiment with two GCSR tasks and three scene graph resources, observing consistently better or competitive performance to SotA baselines. We find that (1) SKGs extracted from visual captions and story datasets are more helpful than other resources; (2) our model can learn faster (with less training data) with the help of scene imagination; and (3) the imagination module with a larger backbone LM demonstrates larger capacity in encoding commonsense knowledge. Our human evaluation study on the generated imagination indicates that these SKGs capture common sense and that the verbalization module generates the text by following the guidance of the imagination. 4.1 Preliminaries Formally, in GCSR, we consider a list of concepts sets {x 1 , x 2 , ..., x K} and a textual context c ∈ C as input. Each concept set x i is unordered and consists of multiple concept words {xj}. A concept word xj ∈ X (or concept for brevity) is a commonly seen object (nouns such as “dog" or “frisbee") or commonly performed action (verbs such as “throw" or “catch"). The goal of GCSR is to generate K sentences {y 1 , y 2 , ..., y K}, each describing a plausible situation following human common sense for a concept set x i . The i-th sentence y i ⊂ Y should be generated using all concepts in x i . We consider two variants of GCSR: 1) concepts-to-sentence generation [88], where no context is given (i.e., c is empty) and only one concept set is provided (K = 1); and 2) concepts-to-story generation task, where c is the leading sentence of a multi-sentence story and more than one concept sets are provided, each corresponding to one sentence to be generated (K > 1). Both tasks are evaluated by comparing the machine-generated text with human-generated (gold) references. 36 4.2 Method 4.2.1 The Imagine-and-Verbalize Approach Pre-trained LMs struggle with learning a generalizable mapping from concepts to plausible sentences solely based on the training data. Augmenting concepts with external knowledge to form the input X ′ and finetuning a pretrained LM to model P(Y|C, X ′ ) [94, 44, 82] alleviates this issue partially, while still learning a direct mapping of {C, X ′ } → Y. In this work (Figure ??), we decompose the GCSR task into two sub-tasks: contextualized scene generation (imagination) and scene-aware text generation (verbalization). P(Y|C, X ) = X Z P(Y|C, X , Z)P(Z|C, X ), (4.1) where Z denotes the scene representation for the given concepts and context. The contextualized scene imagination module P(Z|C, X ) aims to construct a multi-relational graph representation Z (scene knowledge graph, or SKG) that describes a plausible scene that involves all input concepts and corresponds to the provided context. To learn this module, we collect a diverse set of SKG instances from different resources and modalities to form a comprehensive distribution of scenes (§4.2.2). The imagination module is pre-trained over the collected scene instances and learns to generate SKGs depicting plausible day-to-day situation. The imagination module is based on a neural architecture, which enables it to generate concept compositions that might not have been observed during training (§4.2.3).∗ We leverage the contextualized SKG for text generation with a verbalization module P(Y|C, X , Z) which takes the context, concepts, and the generated SKG as input, and composes a grammatical and plausible scene description in natural language (§4.2.4). To perform GCSR, where one or multiple concept sets are given, we apply the imagination module to sample z i . Since the marginalization over Z is generally intractable due to the complex structure of the ∗The imagination module can be further fine-tuned over the downstream datasets. 37 Table 4.1: Statistics of the SKG instances collected from different resources. Knowledge source # SKGs # Concepts Caption-AMR 584,252 22,961 Story-AMR 927,163 41,272 VG-SceneGraph 292,596 41,629 All 1,792,941 84,835 SKGs, we only sample the most probable scene representation z ∗i that maximizes P(z i |c ′ , x i ), where c ′ includes the given context c and the previously generated y j ,(j < i) . We then apply the verbalization module to generate one sentence at a time by sampling from P(y i |c ′ , x i , z i∗ ). Multiple sentences are generated by iteratively applying the imagination and verbalization modules. 4.2.2 Imagination via Generating SKG Imagination through SKGs We adopt the term “scene graph” from the computer vision community, and we generalize it to a novel relational schema that represents knowledge from multiple modalities. Our SKG is defined as a relational graph G = (E, R) that organizes a set of concepts in a coherent scene that follows common sense. The node set E of the graph includes both given and implicit concepts, while each relation (edge type) r ∈ R denotes how two concepts should be related. We follow the Abstract Meaning Representation (AMR) [6] schema to consider the core relations between two concepts, which corresponds to the commonsense knowledge required by GCSR. Table ?? in the appendix illustrates a few representative relations and their examples. Collecting Diverse SKGs We consider two complementary modalities, text and vision, as some concepts and relationships are more likely to occur in one modality versus another. (1) Textual Modality: According to pragmatic principles of human language, people generally leave out expected details about common scenes [53]. For this reason, we extract SKGs from visual captions and narrative stories, in which human annotators are asked to explicitly describe scenes that may happen using descriptive language as shown in Figure ??(a,b). To extract an SKG out of these textual signals, we adopt the AMR parsing tool to 38 Gold SKGs from external resources for continual pretraining (Optional) Silver SKGs from task dataset for fine-tuning Transformer “Context: People are playing on the grass. Concepts: dog <SEP> frisbee <SEP> catch <SEP> throw” “throw <:ARG0> woman <SEP> throw <:ARG1> frisbee …” Graph linearization Randomly ordered concepts Figure 4.1: Continual pretraining and fine-tuning of the imagination module to output a linearized SKG based on a sequential input (context and concepts). transform each sentence into an AMR graph. This process yields a single SKG per sentence. For the story SKGs, we also keep the sentences (up to 256 tokens) that precede the sentence that corresponds to the SKG, as context c. (2) Visual Modality: Image captions focus on salient information and may not capture all useful visual signals. Thus, we also capture the scene structures directly from images, by using VisualGenome [71], a large-scale scene graph dataset annotated by humans. To adopt a unified SKG schema, we manually map the relations in scene graphs from VisualGenome to the ones used in textual SKGs. A full set of mapping rules can be found in the Appendix (??). The statistics of the SKGs collected from each resource/modality are summarized in Table 4.1. We note that visual scene graphs may be biased towards knowledge about spatial relationships and object affordance, which further motivates our decision to extract SKGs from multiple modalities. 4.2.3 Learning the Scene Imagination Module We describe how we pre-train the scene imagination model using multimodal SKG examples collected from diverse sources, and how we fine-tune the imagination module to downstream datasets. A straightforward way to construct a SKG is to retrieve ones that contains all the given concepts from the collected SKGs. However, performance of such method is limited by the coverage of the SKG collection and will fail when encountering novel concept composition. We propose to model P(Z|C, X ) 39 with a neural graph generator. Inspired by previous work on (conditional) graph generation [171], we formulate SKG construction as an auto-regressive sequence generation task, where a linearized SKG is generated sequentially conditioned on the context, input concepts, and the graph sequence generated so far. Sequence generation formulation is advantageous, as it can be natively tackled by pre-trained auto-regressive LMs (e.g., GPT-2 [<empty citation>]). Thus, we adopt these LMs as the backbone of our imagination module [19, 157]. Linearized SKG Generation To form training instances for the imagination module, we treat the nodes in an SKG instance as input concepts and the linearized SKG as the target output (Figure 4.1). The input concepts are concatenated into a sequence x = [x1, x2, ..., xn], preceded by the context c ′ ∈ C. When c ′ is not given, we prepend the word “none" to the concept sequence. To linearize an AMR-based SKG into a sequence z = [z1, z2, ..., zm], we adopt the PENMAN serialization format [52] which converts AMR into a spanning tree over the graph. This format is shown to be more suitable than other linearization strategies like depth-first-search (DFS) in enabling LMs to learn the graph structure [102]. We conduct DFS and follow PENMAN format to prioritize nodes associated with core relations (e.g., ARG0). During training, we randomize the order of the concepts at every training step such that the graph generator learns to be invariant to concept order [176]. For each training instance, we randomly discard a small subset of the SKG nodes (concepts) in each training epoch. This simulates the scenario in which a subset of the concepts that constitute a scene will be given, thus teaching the model to infer implicit concepts for completing a plausible scene. 40 Continual-Pretraining and Fine-tuning With both the input concepts (plus context) and the output graph linearized as sequences based on the collected SKG instances, we continually pretrain an autoregressive LM to generate z = Transformer(c ′ , x). The training objective is to maximize P(Z|C, X ) by minimizing the negative log-likelihood: Limagine = − t X=m t=1 log P(zt |z<t, c ′ , x). (4.2) Our pre-trained imagination module generates an SKG on the fly, and it can be further fine-tuned on downstream datasets, when their distributions of context and concepts are different from the pretraining data (see Figure 4.1 for illustration). Since downstream datasets cannot be expected to have ground-truth SKGs paired with each training example, we apply the AMR parsing tool described in §4.2.2 on the training sentences to obtain silver-standard SKGs. We then follow the same training procedure to continually pretrain the module into a customized imagination module for a specific downstream dataset. 4.2.4 Scene-aware Verbalization Iterative Imagine-and-Verbalize At model inference time, we apply the trained imagination module iteratively to generate the most plausible SKG for each given concept set x i , i.e., z i∗ = arg maxz i P(z i |c ′ , x i ), where the context c ′ includes both the given context c and the previously generated sentences {y j} (j < i). The generated SKG is used by the scene-aware verbalization module to model P(Y|C, X , Z). The verbalization module generates the i-th sentence by sampling from P(y i |c ′ , x i , z i∗ ). Multiple sentences are generated iteratively by alternating between the scene imagination (to construct SKG) and verbalization (to produce the next sentence). See Figure 6.2 for an illustration of this iterative inference process. Model Training Since both the linearized SKG (generated by the imagination module) and the target sentences are sequences by nature, we design P(Y|C, X , Z) as a sequence-to-sequence generative 41 Transformer1 {Context, previously generated sentences, concepts} Linearized SKG Transformer2 “Context: People are playing on the grass. Concepts: dog <SEP> frisbee <SEP> catch <SEP> throw. Relations: throw <:ARG0> woman <SEP> throw <:ARG1> frisbee …” “A woman throws a frisbee and a dog catches it.” Imagination Verbalization Figure 4.2: Our I&V method iteratively applies the imagination and the verbalization modules, by generating one sentence in each iteration. model and learn this verbalization module by fine-tuning another pre-trained auto-regressive LM, i.e., y i = Transformer(c ′ , x i , z i ). To form the input for generating the sentence y i , we concatenate the context c ′ , the concept set sequence x i and z i into one sequence† as illustrated in Figure 6.2. We then train the model to maximize P(Y|C, X , Z) by minimizing the negative log-likelihood: Lverbalize = − X t=l t=1 log P(y i t |y i <t, c ′ , x i , z i ). (4.3) For each training instance (y i , c ′ , x i ), we construct two types of SKG instances as the input z i : (1) We perform AMR parsing on y i to obtain a silver-standard SKG; (2) We apply the trained imagination module to generate a SKG z i∗ = arg maxz i P(z i |c ′ , x i ), where c ′ includes the given context c and the groundtruth prefix sentences {y j} (j < i). We find it beneficial to train the verbalization module over these two types of SKGs as evidenced by our ablation study (§??). During inference, the SKG z i is generated by the imagination module, while c ′ includes the given context c and the previous sentences {y j} (j < i) generated by the verbalization module. †Our ablation study in Appendix ?? shows that including all these elements as input is helpful. 42 4.3 Experimental Setup Tasks & Datasets We consider two GCSR tasks: Concept2Sentence and Concept2Story. (1) Concept2Sentence is a task of generating a single sentence for a given set of concepts and no context. We evaluate concept2sentence on the CommonGen [88] benchmark. Since the labels of the official test set are not publicly available, we submit our method to the leaderboard to obtain its performance. Notably, the concept sets in CommonGen’s test set are novel and do not appear in the training set. We also create an in-house split of CommonGen to facilitate comparison between different variants of our method and the baselines. (2) Concept2Story is a generalization of the concept2sentence task, where the goal is to generate a coherent story with K = 4 sentences given a set of concepts and an initial verbal context. We construct two benchmarks based on the Visual Story Telling (VIST) [61] and ROCStories [111] datasets. Following CommonGen, we conduct part-of-speech tagging over the sentences and further lemmatize the recognized verbs and nouns to obtain the concept sets. Baselines (1) Concept2Sentence: We consider several recent submissions to the leaderboard of CommonGen that leverage auxiliary information for GCSR. KFCNet [82], Re-T5 [154], and EKI-BART [44] are prototype-based models, which retrieve sentences containing as many input concepts as possible from external captions and NLI datasets, and then use these sentences as auxiliary inputs. VisCTG [46] is an image-augmented model which retrieves images from Google by using concepts as a query, followed by an image captioning model that generates captions as auxiliary inputs. KG-BART [94] is a knowledge graph-augmented model which retrieves relations between concepts from ConceptNet as auxiliary inputs. SAPPHIRE [45] is a keyword-based model which extracts keywords from sentences as auxiliary inputs only during training. We also compare to Node2Text, which fine-tunes a pre-trained auto-regressive LM to take the concatenation of concepts as input and output the target sentences. (2) Concept2Story: We augment Node2Text with the iterative generation pipeline as in our method, which generates the next sentence 43 Table 4.2: Performance comparison with the top-ranked, published models on the official CommonGen test set. ∗Note that KFCNet uses a much larger corpora (over 70M) to retrieve prototypes and on average less than one concept in the concept sets is not covered [82], while we filter out any SKGs that contain concept sets that overlap with CommonGen dataset. Model BLEU-4 CIDEr SPICE KFCNet [82] ∗ 43.62 18.85 33.91 RE-T5 [154] 40.86 17.66 31.08 VisCTG [46] 36.94 17.20 29.97 SAPPHIRE [45] 37.12 16.90 29.75 KG-BART [94] 33.87 16.93 29.63 EKI-BART [44] 35.95 17.00 29.58 T5-base (our implementation) 33.81 15.79 28.34 T5-large (our implementation) 32.85 15.76 28.38 T5-large (reported) 31.96 15.13 28.86 I&V (T5-base) 40.16 17.44 30.57 I&V (T5-large) 40.57 17.71 31.29 given the provided context, previously generated sentences and the current concept set. In addition, we experiment with two representative methods from the controlled text generation literature. Plan-andwrite [169] first generates storyline keywords, then uses the keywords to generate a story. We use the concept set and context to generate storyline keywords. Action-Plan [43] uses predicate-argument pairs as storyline. We adapt the KFCNet model to retrieve prototype sentences. All Concept2Story baselines are used in an iterative generation pipeline, to enable fair comparison to our method. Evaluation Metric We evaluate systems against the K reference sentences provided by a dataset, by measuring the similarities between the machine-generated text and the gold references. Following CommonGen [88], we adopt widely-used automatic metrics for evaluating text generation, focused on (1) ngram overlap: BLEU [118], ROUGE [89], and METEOR [9], and (2) concept association: CIDEr [150] and SPICE [4]. [88] reports that SPICE yields the best correlation with human judgments and thus we used it as the main evaluation metric. 44 4.4 Results and Analysis We design experiments to answer the following questions: (1) Does contextualized scene imagination improve the performance of GCSR models? (2) Does imagination allow GCSR models to learn with less data? (3) How does each source of scene knowledge for pretraining affect the GCSR performance? (4) Do generated SKGs make common sense and correspond to the generated text? 4.4.1 Main Results We compare our proposed approach with state-of-the-art text generation methods on two GCSR tasks to understand whether scene imagination helps GCSR. Table 4.2 shows the performance of different models on CommonGen. We have the following observations. First, I&V drastically improves the vanilla T5-large model (Node2Text), demonstrating the effectiveness of the imagination module in GCSR. We also provide concrete examples in §?? which showcase how imagination fixes errors made by Node2Text. All these errors can be attributed to the fact that Node2Text does not properly capture the commonsense relations between concepts while our imagination module learns how concepts are related from indirect supervision. Second, our model outperforms other models using different auxiliary inputs, including prototypes (ReT5 and EKI-BART), knowledge facts (KG-BART) and images (VisCTG), showing the benefit of SKGs over these knowledge sources. Although our model under-performs KFCNet, our analysis in their work reveals that 97.4% of the test cases have perfectly matched prototypes, i.e., sentences containing all the queried concepts. It is thus unclear whether KFCNet is conducting commonsense reasoning or merely rephrasing the prototypes. Note that we filter out any collected SKGs that cover the concept sets from the downstream datasets. This ensures that the imagination module is examined with its compositional generalization. Table 4.3 shows the experimental results by I&V on the two Concept2Story datasets using T5-base and BART-large as the backend respectively. Among most evaluation metrics, our method outperforms Node2Text and baselines with other intermediate representations incorporated in the same backends. This 45 Table 4.3: Performance of the compared methods on the Concept2Story tasks. Best results are bold-faced. We mark them with an asterisk if they exceed the second best with statistical significance (p-value < 0.05). Concept2Story-VIST Concept2Story-ROC T5-base BART-large T5-base BART-large Model BLEU-4 CIDEr SPICE BLEU-4 CIDEr SPICE BLEU-4 CIDEr SPICE BLEU-4 CIDEr SPICE Node2Text 20.64 25.41 58.55 18.52 22.91 55.48 23.31 29.32 57.66 20.60 26.09 53.80 Keyword 16.75 21.87 56.23 15.62 20.86 55.49 22.24 27.05 50.41 22.14 27.40 49.52 Action-Plan 17.84 22.77 57.11 16.20 21.10 54.77 21.15 27.32 56.14 20.45 26.29 54.32 Prototype 20.28 25.05 58.17 22.81 26.93 58.84 23.59 29.48 57.68 26.76 31.60 58.35 I&V 21.05∗ 25.78∗ 59.21∗ 22.45 26.80 59.11∗ 26.77∗ 32.33∗ 60.63∗ 28.30∗ 33.40∗ 60.39∗ demonstrates that our imagination module can provide contextualized scene imagination that are more helpful in guiding long narrative generation. 4.4.2 Performance Analysis How does the knowledge source affect GCSR? We perform an ablation study in order to understand how effectively each source of SKGs contributes to the imagination. Specifically, we use each of the following SKG sources to pre-train an imagination module using T5-large as the backend: the silverstandard SKGs extracted from the training set from the downstream task (Task-AMR), and the external SKGs: Caption-AMR, Story-AMR, and VG-SceneGraph (§4.2.2). For CommonGen, we do not further finetune the imagination module in order to distinguish the contributions from each knowledge source more clearly. For Concept2Story (ROCstories), we conduct further fine-tuning using the task-AMR. Since this task provides the context as input, we find it helpful to adapt the imagination module with the task dataset. The results are shown in Table 4.4 and we have the following observations. For CommonGen, the contribution comes mostly from the SKGs based on Caption-AMR while being less from VG-SceneGraph. This may due to the fact that VG-SceneGraph is biased towards spatial relations and attributes of objects. For Concept2Story, we find both Story-AMR and Caption-AMR to be helpful for continual pretraining. The former teaches the model to generate contextualized imagination which is necessary for story generation in particular while the latter teaches the model about general commonsense knowledge. For both datasets, 46 Table 4.4: Performance of our method using different SKG sources to train the imagination module, with T5-large as the backbone LM. CommonGen (in-house) Concept2Story-ROC Knowledge Source BLEU-4 CIDEr SPICE BLEU-4 CIDEr SPICE Task-AMR 28.87 15.74 31.22 23.14 29.25 57.91 Caption-AMR 32.21 16.14 32.16 23.77 29.76 58.46 Story-AMR 23.73 13.51 27.53 24.17 30.10 58.59 VG-SceneGraph 21.00 13.36 29.07 22.84 25.33 53.96 All-SKG 33.27 16.95 33.49 26.77 32.33 60.63 Table 4.5: SPICE performance of our method using different sizes of T5 as backbone for the imagination module. Dataset / Backbone LM T5-base T5-large CommonGen (in-house) 32.00 33.49 Concept2Story-ROC 59.56 60.63 the imagination modules that are pre-trained over all the SKG instances yield significantly better results than the ones trained on the task-AMR datasets. This validates our intuition that different sources of SKGs contain complementary commonsense knowledge, and they should be used together for machine imagination. How does the backbone LM size affect the module’s performance? We also ablate the LM architecture of the imagination module and the verbalization module respectively to see how our method work with different pre-trained LMs. For the imagination module, we use T5-base and T5-large. This is to investigate how the capacity of LMs affects the learning of scene knowledge. The results are shown in Table 4.5. Compared to T5-large, we observe a slight performance drop for T5-base, which indicates that larger LMs are able to encode our rich set of SKG instances in a more expressive manner. For the verbalization module, we use BART-base/large and T5-base/large. The results are shown in Figure 4.3. We observe that compare to baseline, our method consistently yields a better performance regardless of what LM architecture is used. 47 BART-base BART-large T5-base T5-large 50 52 54 56 58 60 62 SPICE Node2Text I&V Figure 4.3: Ablation study on backbone LM sizes of our verbalization module and Node2Text using the Concept2Story-ROC dataset. Does imagination allow models to learn (faster) with less data? Next, we study how the indirect supervision provided to the imagination module help the system effectively learn with limited taskspecific training data. Accordingly, we conduct a low-resource experiment where we randomly sample {50, 500, 5000} training and development examples from each dataset. For each data size, we use 5 random seeds to obtain 5 different training and development splits. On each split, we train and test with random initialization of 3 seeds, and we report the average on the total 15 ways of results. In this study, the imagination module is fixed untrainable after continual pretraining and is not fine-tuned over the sampled task datasets. Figure 4.4 shows that our model consistently outperforms the baselines, and the performance gain is larger when less training data are used. This indicates that rich sources of SKGs provide practical forms of indirect supervision to complement limited task-specific training data. The robustness of our model in low-resource settings also justifies the need for including contextualized SKGs as an intermediate representation, which further enhances the verbalization module to generate plausible sentences even with little training data. 48 50 500 5000 full # Training Examples (VIST) 20 30 40 50 60 SPICE I&V (fine-tuned) I&V (general) Node2Text Keywords Action-Plan Prototype 50 500 5000 full # Training Examples (ROC) 20 30 40 50 60 I&V (fine-tuned) I&V (general) Node2Text Keywords Action-Plan Prototype 50 500 5000 full # Training Examples (ComGen) 12.5 15.0 17.5 20.0 22.5 25.0 27.5 30.0 32.5 I&V (general) Node2Text Action-Plan KFCNet Figure 4.4: Results (SPICE) of the low-resource experiment on the three benchmark datasets with different number of training examples. Table 4.6: Human evaluation on the generated SKGs regarding Completeness (COM), CommonSense (CS) and Alignment (AL) and Similarity (SIM). COM CS AL SIM CommonGen 97.30 90.15 89.90 88.30 VIST 93.80 89.70 91.40 76.20 ROC 95.70 86.60 87.80 75.68 Is context helpful for imagination? To validate that the textual context, including the provided context as well as the previously generated sentences, is helpful for imagination in the Concept2Story task, we conduct an ablation study where we learn an uncontextualized imagination module which only takes concepts as input. The final results on VIST and ROC datasets are 47.32 and 45.18 (SPICE) respectively, which are much lower than the results from contextualized I&V (59.21 and 60.63). This demonstrates that the context is critical in generating SKGs which are more relevant to the story line and thus lead to better text generation. 4.4.3 Human Evaluation on Generated SKGs We conduct human evaluation on the SKGs generated by our imagination module to examine their quality. Annotators are presented with the input concepts, the generated SKGs, the predicted sentences resulting 49 from the corresponding SKGs and the ground-truth sentences for reference. For each dataset, 100 instances are randomly chosen for evaluation. Annotators are students majoring in computer science and not all of them know about SKG or AMR language prior to the human evaluation. To facilitate annotators’ understanding of the evaluation task and AMR, we provide the detailed instruction and the examples of AMR relations. The annotators are asked to judge for: 1) Completeness, whether the SKG includes all the concepts (both given and implicit) to constitute a coherent scene; 2) CommonSense, whether the SKG organizes the concepts in a way that follows common sense; 3) Alignment, whether the generated sentence aligns with the SKG and 4) Similarity, whether the predicted sentence is similar to any referenced sentences in semantic. Annotation is based on a 3-point scale: a) 0 – “I do not agree”, b) 0.5 – “I partially agree” and c) 1.0 – “I fully agree”. Table 6.1 shows the evaluation results where we get a fair level of agreement measured by Fleiss Kappa (κ = 0.21). We observe that the generated SKGs are very complete and follow human common sense in a high degree across three datasets, which demonstrates the effectiveness of training the imagination module to learn useful commonsense knowledge with vast indirect supervision from different resources. Moreover, the SKGs are well-aligned with the generated text, which indicates that the verbalization module consistently follows the guidance of the imagination module when generating sentences. The moderate similarity scores validate that the generated text is generally similar to the natural language sentences annotated by humans. 50 Chapter 5 Faithful Language Reasoning Using Prompt-Generated Rationales Many language-based reasoning tasks require retrieving and reasoning over knowledge beyond the task input—e.g., commonsense reasoning and closed-book QA (Fig. 5.1, left) [143, 107]. Neural language models (LMs) have achieved impressive results on such tasks by utilizing latent knowledge encoded in their pretrained parameters [124, 21]. Still, given LMs’ black-box nature, it is unclear whether this knowledge is being used properly [39, 92]. Previous studies have shown that LMs often learn spurious correlations from artifacts in downstream training data, thus limiting their generalizability [20, 49, 33]. With this in mind, a number of prior works aim to make LMs’ reasoning processes more explicit by generating free-text rationales, which use LMs’ internal knowledge to describe a reasoning process in natural language [113, 162, 104, 172]. In the fine-tuned self-rationalizing paradigm, a single LM is finetuned to jointly generate the task output and rationale [113, 104, 172]. In the prompted self-rationalizing paradigm, a single LM is instead frozen and prompted to jointly generate the task output and rationale, with the prompt consisting of a few input-output-rationale demonstrations [162]. In the pipeline-rationalizing paradigm, a fine-tuned rationalizing LM first generates the rationale, which is then used as input for a separate fine-tuned reasoning LM to generate the output [72, 126]. However, when considering generalization performance, reliability, and deployment costs, these existing paradigms all have key limitations. Fine-tuned self-rationalizing LMs often perform worse than 51 non-rationalizing LMs, since their parameters are learned using two relatively dissimilar objectives, while also requiring expensive rationale annotations [166, 113]. Prompted self-rationalizing LMs yield strong task performance and only need a few rationale demonstrations for the prompt, but are computationally prohibitive since they generally require very large-scale (i.e. over 100B parameters) LMs to work effectively [161, 162]. Besides requiring expensive rationale annotations, pipeline-rationalizing LMs’ generated rationale forms a non-differentiable bottleneck between the two modules, which complicates end-to-end training and can hurt task performance [166, 58]. Moreover, none of these paradigms has a mechanism for regularizing the rationale generation to faithfully reflect the reasoning process of the LM, without hurting task performance. In this chapter, we propose Prompted RatIonalizing with CouNTerfactual ReasOning (PINTO), an LM pipeline that rationalizes via prompt-based learning, then reasons over the task input and rationale via counterfactual regularization. PINTO’s rationalizing module is a medium-scale (i.e. 20B parameters) LM that contains vast latent knowledge obtained via pretraining [14]. Though prohibitive to fine-tune, it is affordable for prompt-based learning. Given the task input and a minimal input-output-rationale demonstration prompt, the rationalizing module uses its internal knowledge to map out a suitable reasoning process for the task input by generating a free-text rationale. The rationalizing module is frozen during fine-tuning, which drastically reduces training costs and prevents it from exploiting spurious shortcuts in the downstream training data. PINTO’s reasoning module is a small-scale (i.e. under 1B parameters) LM to which knowledge is transferred from the rationalizing module. The reasoning module is fine-tuned to solve the downstream reasoning task by using the generated rationale as context for the task input. Crucially, to help ensure that the reasoning module’s behavior is dictated by the rationale (instead of by spurious shortcuts), the reasoning module is regularized to output less confident predictions when the rationale is noisily perturbed. To simulate shortcut reasoning, we consider two rationale perturbation strategies: token masking (i.e. rationale is ignored) and token replacement (i.e. rationale is misused). 52 Q Prompted LM (> 100B) R + A Q Fine-tuned LM (< 1B) Q Fine-tuned LM1 (< 1B) R Q Fine-tuned LM2 (< 1B) A Q Prompted LM (20B) R 1) Prompted Self-Rationalization 2) Finetuned Self-Rationalization 3) Pipeline Rationalization 4) PINTO StrategyQA: Question Did Aristotle use a laptop? Answer No. Rationale Aristotle was alive until … while the first laptop was invented in … Question Google Maps and other highway and street GPS services have replaced what? A) Mexico B) countryside C) atlas D) oceans … Answer Atlas Rationale Atlas are replaced by more precise Google maps, other highway and street GPS services. CommonsenseQA: Q Fine-tuned LM (< 1B) A R + A or A + R or A + R Figure 5.1: Rationale-Based Language Reasoning. (a) Examples of reasoning tasks that require implicit knowledge beyond task inputs. (b) Comparison of existing paradigms for providing free-text rationales along with predictions. Across four question answering datasets (CSQA, StrategyQA, OpenBookQA, QASC), we show that PINTO significantly improves the reasoning LM’s generalization, yielding higher performance on both in-distribution (ID) and out-of-distribution (OOD) test sets. Also, we find that rationales are utilized more faithfully by PINTO than by other methods, leading to better performance in low-resource settings. Furthermore, we show that PINTO’s counterfactual regularization allows us to further improve task performance with refined rationales. 5.1 Preliminaries In this work, we study LMs’ ability to reason about language using implicit knowledge. We consider a specific type of multi-choice question answering (QA) tasks where the required knowledge for answering the question is not explicitly provided in the input and needs to be inferred from the LM’s parameters [144, 68]: Given a question q and a set of answer choices A = {ai}, the model’s goal is to predict a plausibility 53 Question: What do people use to absorb extra ink from a fountain pen? The answer is Calligrapher's hand. Calligrapher's hand is used to … The answer is Inkwell. Inkwell is used to absorb extra ink from a … The answer is blotter. Blotting paper absorbs liquids like ink well. a) Rationalizing Module: frozen, prompted medium-scale LM b) Reasoning Module: fine-tuned small-scale LM Choice-Specific Rationales Demonstration Q1, A1, R1 Q2, A2, R2 Q3, A3, R3 … QA pairs Knowledge Transfer Prompt Q R A Q R A Training with Counterfactual Regularization Inference Figure 5.2: Overview of PINTO. (1) A frozen medium-scale LM is prompted to generate choice-specific rationales. (2) A small-scale LM is fine-tuned to reason over the generated rationales. (3) We introduce counterfactual regularization in addition to standard training loss to ensure the rationales are leveraged properly. During inference, the rationalizing LM is prompted with a new question to generate rationales, which are provided to the reasoning module to make a prediction. score ρ(q, ai) for each (q, ai) pair, so that the predicted answer aˆ = arg maxai∈A ρ(q, ai) matches the correct answer choice a ∗ ∈ A. Motivated by LMs’ common tendency to exploit reasoning shortcuts when solving tasks [20], we focus on methods that explicitly generate free-text rationales to explain their predictions. Whereas extractive rationales are limited to input token scoring [36, 141, 24], free-text rationales use natural language to describe a reasoning process (e.g., with knowledge beyond the task input) [113, 162]. Below, we discuss several paradigms (see also Fig. 5.1) for rationale-based language reasoning. Fine-Tuned Self-Rationalization In this paradigm, an LM is fine-tuned to autogregressively generate the task output and rationale as a single sequence [113, 93]. If the rationale is generated after the task output, then the rationale is conditioned on the task output, and vice versa. Since the LM parameters are shared across two relatively dissimilar objectives, they often perform worse than non-rationalizing LMs [166, 113]. Notably, this paradigm requires expensive rationale annotations for all training instances. 54 Prompted Self-Rationalization In this paradigm, a pretrained LM is frozen and prompted to autogregressively generate the task output and rationale as a single sequence, with the prompt consisting of a few input-output-rationale demonstrations [74, 162]. If the rationale is generated after the task output, then the rationale is conditioned on the task output, and vice versa. This paradigm performs well and only needs a few rationale annotations for the prompt, but it is computationally prohibitive since it generally requires very large-scale (i.e., over 100B parameters) LMs to work effectively [74, 162]. Pipeline Rationalization In this paradigm, a fine-tuned rationalizing LM first generates the rationale, which is then used as input for a separate fine-tuned reasoning LM to predict the task output [72, 126]. Here, the generated rationale forms a discrete (i.e., non-differentiable) bottleneck between the two modules, which complicates end-to-end training and can hurt task performance [166, 58]. Additionally, the dedicated rationalizing LM requires extra rationale annotation/computation costs. 5.2 PINTO: Faithful Language Reasoning PINTO is a two-stage, rationalize-then-reason pipeline, designed to address the limitations of existing paradigms for rationale-based language reasoning (§5.1). Like the pipeline rationalization paradigm, PINTO has separate modules for rationalizing and reasoning (Fig. 5.2). However, PINTO’s rationalizing module is prompted instead of fine-tuned. Thus, PINTO does not suffer from the non-differentiable bottleneck issue and has lower rationale annotation/computation costs. Following prior works, PINTO is based on choice-specific rationales [72, 58]. First, given q and A, the rationalizing module generates a set of choice-specific rationales R = {ri}, where each ri explains a reasoning process that supports answer choice ai ∈ A (§5.2.1), as opposed to generating one rationale per question. We opt for this design choice because rationales are often answer-leaking [140], i.e., the rationale itself is already sufficiently predictive of one of the answer choices. If the rationalizing module only generates one rationale per question, then it is forced to make an “early decision” on the predicted answer, such 55 that the reasoning module would only be left to recover the answer from the rationale [72]. While prior works require expensive rationale annotations to train/prompt the rationalizing module [72, 58], PINTO’s rationalizing module is a frozen pretrained LM that uses only a few question-answer-rationale demonstrations as a prompt (§5.2.1). Second, given q, ai ∈ A, and ri ∈ R, the reasoning module outputs plausibility score ρ(q, ai , ri) (§5.2.2). We also design a regularization objective that encourages the reasoning module to properly use the rationales to predict the answer (§5.2.3). We describe each module in more detail below. 5.2.1 Rationalizing Module Prior works mainly rely on human-annotated rationales for teaching a model to rationalize [72, 58, 140]. However, such rationale annotations are expensive and frequently of low quality [1, 140, 126], e.g., not providing sufficient knowledge to support a given answer. Meanwhile, a recent study shows that rationales automatically generated by pretrained LMs are often preferable over human-annotated rationales [165]. Therefore, for PINTO’s rationalizing module, we propose using a pretrained LM to generate rationales via in-context learning, which prompts the frozen LM to retrieve knowledge from its parameters [162]. The prompt consists of a fixed set of question-answer-rationale demonstrations that are randomly selected from the training set. Each demonstration consists of a question q, answer choices A, ∗ gold answer a ∗ ∈ A, and a human-annotated free-text rationale r ∗ ∈ R for a ∗ (Table 5.1).† With this prompt p, we use the LM to generate rationales for every instance from the dataset. Specifically, for each ai ∈ A of some instance (q, A), the rationalizing LM’s input is constructed as [p, q, A, ai]. Then, we use greedy decoding of the LM output to obtain rationale ri for ai . Note that the LM input does not have any information about the gold answer a ∗ . Our rationalizing module’s design assumes that ri will be aligned with accurate knowledge if and only if ai = a ∗ , since it should intuitively be difficult to retrieve correct knowledge that supports an incorrect answer choice (see Table ?? in the appendix for examples of the generation). ∗We include the answer choices A in the prompt so that the LM is aware of all the available choices and thus could generate a rationale that is more distinctive. †As opposed to full human annotation, we only need a few (usually < 8) examples per dataset. 56 Table 5.1: Rationalization Prompts. The format of our prompts for rationalization with a medium-scale LM. The prompt consists of a few examples as demonstration on how to rationalize for a question-choice pair and placeholders for new question and a target choice. Task CommonsenseQA OpenBookQA Prompt Q: What do people use to absorb extra ink from a fountain pen? Answer Choices: (a) shirt pocket (b) calligrapher’s hand (c) inkwell (d) desk drawer (e) blotter A: The answer is blotter. Blotting paper absorbs liquids like ink well. Q: How do you reduce pollution? Answer choices:(a) igniting fuel and oxidiser (b) transportation technology ... (h) using less resources A: The answer is using less resources. Conserving resources has a positive impact on the environment. Use of resources affects the environment such as pollution. The reasoning module then predicts the correct answer by reasoning over the rationales for each answer choice. 5.2.2 Reasoning Module Given a question q, the answer choices A, answer candidate ai ∈ A, and rationale ri , the reasoning module learns to output plausibility score ρi = ρ(q, A, ai , ri). Following prior works, we use a text-to-text Transformer LM as the backbone of our reasoning module [166, 58]. For each ai , the reasoning module’s input is defined as the token sequence s = [q ⊕ a1 ⊕ ... ⊕ a|A| ⊕ ri ], where ⊕ denotes concatenation. Meanwhile, the reasoning module’s output is obtained by sequentially teacher-forcing ai ’s tokens ti = [t 1 i , t2 i , ..., t|ai| i ] into the decoder, rather than via greedy decoding. This way, we can compute the reasoning module’s output token probabilities for arbitrary answer choices ai . Following [136], we compute ai ’s plausibility score ρi by aggregating the probabilities P of tokens t j i as: ρi = 1 |ai | X |ai| j=1 log P(t j i |t j−1 i , ..., t2 i , t1 i , q, A, ri). Next, we use the softmax function to normalize ρi as probability P(ai | q, A, R) = e ρi/ P|A| j=1 e ρj . During inference, given question q and answer choices A, the rationalizing module first generates rationales R = {ri}, then the reasoning module computes the predicted answer choice as aˆ = arg maxai∈A P(ai | q, A, R). 57 Q R A Q R A Q R A Standard Training 1) Token Maksing 2) Token Replacement One-hot labels + Unif. Distribution Counterfactual Training Noisy labels Figure 5.3: Standard Training vs. Counterfactual Training. For counterfactual regularization, we train the reasoning module with noisy labels when the rationale tokens are either masked or replaced. 5.2.3 Training For multi-choice QA, the standard training objective is to maximize the likelihood of the correct answer choice using cross-entropy loss, computed as: Lstd = − X ai∈A Q(ai | q, A) log P(ai | q, A, R), (5.1) where Q(ai | q, A) is 1 if ai = a ∗ and 0 otherwise. Let Q(A| q, A) be the one-hot target distribution over all ai ∈ A. There can be spurious correlations between q and A [20], so the reasoning module may take undesirable shortcuts instead of properly using the rationale to predict the answer [56, 106]. In this case, the rationales would be unfaithful in explaining the model’s behavior and useless for model debugging. 58 To address this, we introduce a counterfactual regularization objective in which the reasoning module is regularized to output less confident predictions when the rationale is not utilized properly (i.e., shortcuts are used). This is implemented using label smoothing [142], which softens the target distribution Q(A| q, A) by linearly combining it with a noisy distribution U(A| q, A), often set as the uniform distribution. Therefore, given tunable label smoothing factor 0 < ϵ < 1, we compute the label-smoothed target distribution as: Q′ (A | q, A) = (1 − ϵ)Q(A| q, A) + ϵU(A| q, A). In order to simulate shortcut reasoning, we consider two strategies for perturbing the generated rationales ri . Token Masking addresses the case where the reasoning module ignores the rationale and instead exploits spurious cues in the rest of the input. To simulate this, we mask out the rationales in the input. Recall that the backbone of the reasoning module is a Transformer LM, which uses a self-attention mechanism to aggregate information across tokens. Hence, we implement rationale masking by zeroing the attention mask for rationale tokens.‡ Token Replacement addresses the case where the reasoning module misunderstands the rationales’ meaning and thus uses them improperly. To simulate this, we randomly replace k% of the rationale tokens with other tokens uniformly sampled from the entire language vocabulary. At each fine-tuning step, we randomly select one of the strategies for obtaining perturbed rationales R′ = {r ′ i }, which helps keep the LM from overfitting to any particular strategy. Then, the counterfactual regularization loss is computed as: Lc-reg = − X ai∈A Q ′ (ai | q, A) log P(ai | q, A, R′ ). (5.2) This counterfactual regularization teaches the reasoning module to be less confident when the rationales are either absent or problematic, so that it can learn to make sounder use of the rationales. ‡We do not choose to replace the tokens in a rationale with special mask tokens since the LM is already pretrained to recover the mask tokens, and we want to ensure that this ability is completely deprived. 59 5.3 Experimental Setup Questions and hypotheses We design experiments to answer the following questions: (1) What is the impact of our PINTO pipeline on faithfulness and end-task performance? We expect our pipeline with counterfactual training technique to obtain improvements in both aspects. (2) How does the quality of rationales affect the end-task performance of PINTO? We hypothesize that improving the quality of the rationales of PINTO improves its accuracy. (3) Does faithful reasoning based on rationales lead to better generalization? We expect that a method like PINTO that learns to rely on rationales can better generalize to a low resource setting and out-of-distribution (OOD) datasets. Datasets We experiment with several CSR benchmarks. (1) CommonsenseQA [143] is a 5-choice QA dataset testing general commonsense reasoning about the concepts from ConceptNet [138]. (2) StrategyQA [50] is a binary (yes/no) QA dataset that requires models to infer the reasoning strategy. (3) OpenBookQA [107] is a 4-choice QA dataset that requests reasoning based on open book as well as broad commonsense knowledge. (4) QASC [68] is an 8-choice QA dataset that requires a system to answer a question with a valid composition of basic facts using common sense. Since the gold labels for the testing sets of these datasets are not publicly available, we treat the official development set as our test set, and separate the training data into our own training set and development set. Evaluation Metrics To evaluate the reasoning model’s task performance, we use the accuracy metric and consider both ID and OOD test sets in our experiments. ID/OOD test sets are taken from the same/different dataset as the training set. To evaluate the faithfulness of the generated rationale to the reasoning model’s predicted label, we adopt the LAS metric [58]. LAS measures rationale-label consistency as how well the rationale helps a simulator model predict the reasoning model’s predicted label. Following [58], we implement the simulator as a fine-tuned T5-Base LM [125]. To aggregate accuracy and LAS as a single metric, we use Normalized Relative Gain (NRG) metric [24]. Across all compared methods, NRG 60 first normalizes each of the two constituent metrics’ scores as values in [0, 1], then obtains the aggregate score by taking the mean of the two normalized scores. Implementation Details For the rationalizing module, we use GPT-neox [14], a pretrained, autoregressive LM with 20B parameters. We manually annotate 7 examples to set up the prompt for each task dataset. For the reasoning module, we adopt T5-base [125] with only 220 million parameters, which is around two orders of magnitude smaller than the rationalizing module. During fine-tuning, the standard training loss (Eq. 5.1) and our counterfactual training loss (Eq. 5.2) are directly combined as the overall training loss. For perturbing rationales, we randomly choose the token masking or token replacement strategy with a equal chance in each training batch. The replacing rate for token replacement is empirically set to 30%. We run all the experiments on the compared methods 4 times using a fixed set of random seeds and report the average results. Baselines (1) Without Rationales is a T5-based model fine-tuned on the task dataset without using any rationales as additional input. (2) Prompted Self-Rationalization is a GPT-NeoX LM that learns from a few examples in the prompt to firstly generate a few short sentences as the rationale and then predict the answer. Here, we use the chain-of-thought prompting configuration from [162]. (3) Distilled Self-Rationalization is a small LM (T5-base) trained on the rationales generated by the Prompted SelfRationalization model. We implement two variants of the distillation model: a) Rationalize-First, which firstly generates the rationale and then predicts the answer, and b) Predict-First, which firstly predicts the answer and then generates the rationale. (4) NILE [72] trains a rationalization module by fine-tuning a T5-3B model [125] with the rationales annotated by humans, then trains a reasoning module by finetuning a T5-Base model with the task dataset as in our method. We only apply NILE on the CSQA and StrategyQA datasets, since they provide human-annotated gold rationales. (5) Standard Training uses the same rationalize-then-reason pipeline as our method, except the reasoning module is not fine-tuned with the counterfactual training loss. (6) Dropout Context is the same as the Standard Training baseline, except 61 Table 5.2: ID Results. Task performance (accuracy), faithfulness (LAS), and Normalized Relative Gain (NRG) of the compared methods on the testing datasets. The reasoning module for the fine-tuning methods is T5-Base. We bold the results that outperform the second-best method with statistical significance (p < 0.05). CSQA StrategyQA OBQA QASC Method Acc.↑ LAS↑ NRG↑ Acc.↑ LAS↑ NRG↑ Acc.↑ LAS↑ NRG↑ Acc.↑ LAS↑ NRG↑ w/o Rationales 58.68 - - 58.12 - - 55.85 - - 35.58 - - Self-Rationalization Prompted GPT-neox 38.41 11.66 0.23 55.31 1.09 0.47 33.80 14.67 0.18 32.61 32.01 0.33 Prompted GPT-3 73.50 1.38 0.50 66.53 0.60 0.77 - - - - - - Distill. Explain-First 51.97 11.30 0.41 50.20 1.29 0.33 48.90 13.76 0.41 33.34 31.82 0.40 Distill. Predict-First 55.77 6.86 0.37 54.61 -2.68 0.13 50.25 12.30 0.33 34.53 18.48 0.18 Pipeline NILE 57.60 18.23 0.64 57.31 2.17 0.62 - - - - - - Standard Training 59.48 18.75 0.68 57.11 1.50 0.56 56.65 17.03 0.82 37.50 37.91 0.94 Dropout Context 59.64 20.40 0.72 51.45 0.62 0.31 57.55 18.76 0.97 35.37 37.54 0.73 PINTO 61.67 24.22 0.83 60.87 3.35 0.81 58.85 18.02 0.94 37.82 38.98 1.00 - Masking Only 60.46 17.44 0.67 59.12 1.74 0.64 58.35 13.06 0.55 37.39 34.06 0.84 - Replacement Only 60.38 22.54 0.78 58.72 2.11 0.66 58.10 18.01 0.93 37.47 34.61 0.86 the question is randomly dropped out from the input while fine-tuning the reasoning module. This is a strategy used in prior work to encourage the reasoning module to make good use of the input rationales [58]. Further, we also consider two variants of PINTO, namely Token Masking Only and Token Replacement Only as baselines. These baselines only adopt token masking or token replacement for perturbing rationale tokens, respectively. 5.4 Experiments 5.4.1 Main Results In-Distribution (ID) Performance We first evaluate all methods on ID test sets. Table 5.2 shows the task performance of these methods, with fine-tuning methods using T5-Base as the reasoning module. We have the following two observations. First, the Prompted Self-Rationalization baseline (using the 20Bparameter GPT-NeoX) generally does not outperform the fine-tuning methods while the GPT-3 version is reported to achieve 73.50 and 66.53 in accuracy on CSQA and StrategyQA, respectively [162]. This 62 Table 5.3: OOD Results. Performance (accuracy) of the compared methods, which are firstly trained on a source dataset and then directly predict on a target dataset (denoted as source → target). Method CSQA→OBQA CSQA→QASC OBQA→CSQA QASC→CSQA QASC→OBQA w/o Rationales 32.05 39.17 24.87 45.74 34.90 Distill. Explain-First 24.85 31.43 23.05 43.16 31.55 Distill. Predict-First 25.10 32.26 26.43 45.17 30.50 NILE 32.40 40.93 - - - Standard Training 31.05 40.04 25.37 47.71 34.50 Dropout Context 32.30 38.85 23.01 44.27 32.90 PINTO 34.90 42.25 27.66 48.03 35.75 validates that Prompted Self-Rationalization requires very large LMs to work effectively [161]. Second, simply augmenting the reasoning module with rationales (as in Standard Training) does not always lead to better results compared with the Without Rationales baseline since the rationales may not be properly utilized. The Dropout Context baseline helps to address this issue in some, but not all cases, while PINTO consistently yields the best accuracy in most of the cases. We have similar observations from results using RoBERTa-Large as the reasoning module (Table ?? of §??). This demonstrates the effectiveness of our counterfactual regularization method in improving ID generalization. Out-of-Distribution (OOD) Performance To further demonstrate the generalizability brought by faithful reasoning over rationales, we also investigate the performance of our method on OOD test sets. The intuition is that by utilizing rationales faithfully rather than fitting only the ID training data, our model achieves better OOD generalization without any fine-tuning. Table 5.3 shows the OOD performance of all the fine-tuning methods using T5-Base. We conclude that rationales are helpful in improving the generalizability of the model to a dataset unseen during fine-tuning. Among all the methods utilizing rationales, our method yields the best OOD performance, which confirms the benefit of faithful reasoning. A consistent conclusion can be made from the results based on RoBERTa-Large (Table ?? of §??). Rationale-Label Association Table 5.2 also reports the faithfulness of all the methods involving rationalization measured by LAS. We observe that PINTO achieves a much higher score compared with 63 20 40 60 80 100 Percentage of Training Data (%) 50 52 54 56 58 60 62 Accuracy (%) w/o rationales Standard Dropout PINTO Figure 5.4: Low-Resource Learning. Performance (accuracy) of different fine-tuned models in lowresource settings on CSQA. the baselines except on OpenBookQA. This demonstrates that counterfactual regularization helps the reasoning module make predictions more faithfully with respect to the rationales. 5.4.2 Performance Analysis How do different perturbation strategies contribute to the overall performance? Table 5.2 shows the results of the ablation study where we only conduct Token Masking or Token Replacement when perturbing the rationale tokens. From more cases, we note that Token Replacement leads to both better accuracy and faithfulness compared with Token Masking. This is because Token Replacement perturbs the semantics of the rationales more severely, thus further forcing the reasoning module to properly make use of the rationales. Our method yields the best results when both types of perturbation are conducted, which validates that these two strategies consider comprehensively the different ways in which a reasoning module could use the rationales improperly. 64 Standard Dropout PINTO 56 58 60 62 64 Accuracy (%) Generated Generated+Annotated Figure 5.5: Rationale Quality Analysis. Accuracy of models with both generated and annotated rationales vs. models using only generated rationales on CSQA. Can faithful rationales lead to better low-resource performance? We also investigate whether, with counterfactual training, the reasoning module can be fine-tuned with less training data. Figure 5.4 shows the accuracy of all the fine-tuning methods. We can observe that our method consistently outperforms the baselines at different percentages of training data. The observed larger performance gap is larger when less training data is used, demonstrating the data efficiency of our method. Can we refine the reasoning behavior via rationales? One important application of faithful reasoning is that rationales provide a way to refine the behavior of a model, i.e., we can correct reasoning mistakes by providing a better rationale. To verify this, we make use of ECQA [1] which augments CSQA with human-annotated rationales. We directly provide the human-annotated rationales to the fine-tuned reasoning modules to obtain its oracle results, shown in Figure 5.5. We see that human-annotated rationales generally lead to performance gain for all fine-tuning methods whereof the gain of our method is the largest. This again showcases the merits of ensuring the faithful reasoning on rationales in refining a system. Is our method more sensitive to perturbed rationales? Intuitively, higher rationale faithfulness (i.e., stronger connection between the rationale the and reasoning module’s behavior) should yield greater 65 Table 5.4: Sensitivity to Noisy Rationales. We use perturbed rationales during inference as a stress test and report the performance drop of the compared methods. . Model CSQA OBQA Standard Training 0.88 0.35 Dropout Context 2.06 0.55 PINTO 2.62 1.55 sensitivity to noisily perturbed rationales. In other words, higher performance drop (sensitivity) signals higher faithfulness. To verify this, we conduct a stress test. We choose CSQA and OpenBookQA and replace each question in the testing set with a randomly sampled question but still keep the original answer choices. We then prompt our rationalizing module with the replaced question and the original choices to obtain a set of perturbed rationales. We finally provide the perturbed rationales to the reasoning module. Our results in Table 5.4 show that PINTO achieves a significantly higher performance drop than the other two methods (esp. on OBQA), indicating that counterfactual regularization is effective in improving rationale faithfulness. 66 Chapter 6 Self-Consistent Chain-of-Thought Distillation Large language models (LMs) elicit strong reasoning capabilities through chain-of-thought (CoT) prompting [162], which asks LMs to generate free-text rationale for explaining their multi-step reasoning. However, CoT prompting does not guarantee that the rationale is consistent with the prediction, rendering the rationale useless for justifying the model’s behavior. In this work, we present Self-Consistent Chain-OfThought DisTillation (SCOTT), a knowledge distillation (KD) method for eliciting faithful CoT reasoning, where a small student model learns from a large teacher model to generate CoT rationales that are consistent to its own predictions. Existing works [135, 83] propose learning to reason from large LMs mainly for computation efficiency or task performance. They prompt a large LM (the teacher) to generate rationales for a downstream dataset, which is then used to train a small LM (the student). However, these works neglect the following two issues which could undermine the faithfulness of the rationales. First, LMs are prone to hallucination, meaning they often generate text that is not grounded by the input [105, 64]. Therefore, the teacher may not generate on-topic rationales, which fully support the answer. In our pioneer study (Figure 6.1) over 100 random rationales generated by GPT-3, we found 42% of them not providing new information that is not stated in the task input and 37% of them not justifying the answer∗ . This inconsistency between the rationale and answer would then be inherited by the student. Second, the student may treat rationale ∗Wiegreffe et al. obtains a similar observation on the rationales generated by GPT-3 for the CommonsenseQA dataset. 67 Can a Bengal cat survive eating only pancakes? The answer is no. Why? Is material from an aloe plant sometimes enclosed in petroleum-derived products? The answer is yes. Why? A Bengal cat cannot survive eating only pancakes. Aloe is a plant Plants are made of cells. Cells are made of molecules. Molecules are made of atoms. � Error 1 (42%): Do not provide new information. � Error 2 (37%): Do not justify the answer. GPT-3 GPT-3 Figure 6.1: Vacuous rationales generated by a prompted LM (GPT-3) for StrategyQA. In both types of error cases, LM fails to give rationales consistent with the answers due to hallucination. generation and answer prediction as two independent processes. This is due to the spurious correlations between the question and answer, which is exploited as a reasoning shortcut by the student [20]. The two issues together would lead to an unfaithful student which learns to generate vacuous rationales and may make predictions inconsistent with the rationales. To address these issues, we propose to enhance the vanilla KD process from two ends respectively. To elicit more on-topic rationales from the teacher, we propose to leverage contrastive decoding which aims to ground each rationale to the answer (§ 6.2.1). This technique encourages the teacher to generate tokens that are more plausible only when the answer is considered instead of the ones that are fairly plausible even without the answer during decoding. To train a faithful student, we ask the student to conduct counterfactual reasoning, i.e., predicting accordingly when the rationales are leading to different answers (§ 6.2.2). We obtain the training data by asking the teacher to generate a rationale for a sampled incorrect answer. The reasoning shortcut between the question and the gold answer is thus removed since now the student needs to give a different answer for the same question, according to the rationales provided during training. We conduct experiments on several open-domain question answering tasks that require knowledgeintensive reasoning. Experiments show that: (1) Contrastive decoding can lead to a more consistent teacher 68 <question> = Could the Great Wall of China connect the Dodgers to the White Sox? <answer> = yes The Great Wall of China is about 5,500 miles long… Input: Would a vegan eat a traditional Paella dish? Output: Paella is a dish that traditionally contains seafood and meat ... So the answer is no. Teacher: Consistent Rationales Annotation via Contrastive Decoding Student: Learning to Reason Faithfully via Counterfactual Reasoning {Q,R,A} Small LM Large LM Prompt Q: Do hamsters provide food for any animals? A: The answer is yes. Hamsters are prey ... Q: Could Brooke Shields succeed at University of Pennsylvania? A: The answer is yes. Brooke Shields went to Princeton University ... Q: <question> A: The answer is <answer>. Figure 6.2: Overview of our knowledge distillation framework for faithful reasoning. (a) Teacher: A large LM prompted to generate a consistent rationale given a question and the gold answer in the training set via contrastive decoding. (b) Student: A small LM fine-tuned to generate a rationale and then answer via counterfactual reasoning. which generates rationales that are more supportive of the gold answers. (2) Trained on the more consistent rationale-answer pairs, the student learns to better associate the answer prediction with the rationale generation. (3) With counterfactual reasoning as an auxiliary training objective, the student learns not to take the reasoning shortcut and instead respect the rationale more. (4) Despite being more faithful, our model performs comparably to the baselines. (5) Ablation study shows that although performing better, larger student models are more prone to being inconsistent. Our method robustly remedies the inconsistency regardless of the size of the student model. (6) With a more faithful student, we can better improve its performance by correcting its rationale, demonstrating the utility of our method in model refinement. 6.1 Preliminaries Our goal is to 1) elicit consistent rationales, i.e., those well justifying the gold answers, from a large LM as supervision, and then 2) train a self-consistent student model to reason faithfully, i.e., answer accordingly to its generated rationale. We consider the task of language-based reasoning where the required knowledge is not provided in the task input. Specifically, we focus on open-domain question answering (QA) which is the most general setting adopted by prior works: given a question q, a QA system is asked to predict 69 the gold answer a ∗ . For interpretability, we also require the model to provide a free-text rationale r, which justifies its prediction. Below we describe the overview of a vanilla KD framework as illustrated in Figure 6.2. We then discuss the limitations and propose our method in § 6.2. 6.1.1 Generating Rationale Annotation Instead of asking humans to annotate a rationale for each question-answer tuple {q, a∗}, we obtain the rationale from a teacher model automatically using in-context learning. The idea is to prompt a frozen LM as the teacher with only a few annotated examples as demonstration before a new instance is provided. Each example consists of a question q randomly sampled from the training set, the gold answer a ∗ and a human-annotated rationale r which justifies why a ∗ is correct. The prompt p is structured in the format as shown in Figure 6.2 (the Prompt in the left part). To obtain the rationale for a new question q, one basic strategy could be greedy decoding, which selects the most plausible token at each step: t ∗ i = arg max log P(ti |p, q, a∗ , t. Greedy Decoding � : - The back is a part of the body. The back is not a fruit. Thus, someone with back pain would not enjoy picking strawberries. Contrastive Decoding� - : (a) <perturbed_answer> = empty string - Manual labor can cause back pain. Thus, someone with back pain would not enjoy picking strawberries. (b) <perturbed_answer> = yes - The spine is needed to support the body. If someone has back pain, they would not be able to pick strawberries. Figure 6.3: Contrastive decoding for obtaining rationales that are more grounded by the gold answers, by preferring tokens that are more plausible only when the answer is considered. 6.2.1 A Consistent Teacher: Contrastive Decoding To encourage the teacher to generate a more on-topic rationale that supports the answer, our proposed method extends a prior technique called contrastive decoding for open-ended text generation [85]. The core idea is to search rationale tokens that are more plausible only when the answer is considered instead of the ones that are fairly plausible even without the answer during decoding. To implement this idea, we firstly model the hallucinating behavior by providing a perturbed answer a ′ to the same teacher and then obtain the plausibility growth of any token ti given the answer a ∗ as G(ti |a ∗ ) = log P(ti |p, q, a∗ , t
Abstract (if available)
Abstract
The recent advance in language models (LMs) has brought tremendous success to the field of natural language processing (NLP). These LMs are first pre-trained on a large amount of human-authored corpus to predict missing tokens. Then they are adapted to a wide range of downstream tasks through fine-tuning or prompting. This new paradigm is shown to be quite effective and as of today, LMs have achieved super-human performance on many NLP benchmarks testing natural language understanding, machine translation, question answering, etc. Notably, scaling the model size seems to be the key to better performance, which drives many tech companies to develop increasingly large LMs.
However, reckless scaling the model size comes with several limitations. These include: (1) Scaling introduces huge training and inference costs. (2) Scaling aggravates the issues of uninterpretability and biases in neural models. (3) Scaled LMs are good at generating fluent text which is not necessarily coherent. Therefore, this thesis advocates externalized reasoning, in order to build more scalable and trustworthy AI systems. The core idea of externalized reasoning is to let LMs present their otherwise opaque reasoning process explicitly, e.g., a local knowledge graph or a free-text rationale. The externalized reasoning process can thus serve as part of the explanation of LMs and also guide the language generation process to achieve coherency. Moreover, externalized reasoning provides a more targeted way to enhance LMs' capabilities compared to reckless scaling.
To motivate our idea, we focus on language-based reasoning tasks in NLP that require external knowledge for inference. In particular, we investigate two types of representations for representing the reasoning process of LMs. In the first half of this thesis, we resort to the structured representation, i.e., a graph for illustrating the complex relations between concepts mentioned in the task input. We show that (1) a local knowledge graph can help LMs better infer the implicit relations for answering a commonsense question and (2) a structured plan can help LMs generate more coherent text. In the second half of the thesis, we leverage natural language as the vehicle of externalized reasoning. We present techniques which make use of LMs to provide (1) background knowledge in free text and (2) supervision signals for step-by-step reasoning. We showed that externalizing the reasoning process as natural language helps LMs perform better on open-domain reasoning tasks. At the end of the thesis, we also equip LMs with an external memory which is shown to help LMs better model the observed world and keep track of the state changes.
Overall, this thesis shows that different from reckless scaling, externalized reasoning in LMs can provide certain interpretability to the models' behavior, enhance LMs' capabilities in a targeted way and potentially save us from this endless scaling competition.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Multi-modal preconditioned inference of commonsense knowledge
PDF
Generating and utilizing machine explanations for trustworthy NLP
PDF
Towards generalized event understanding in text via generative models
PDF
Bridging the visual reasoning gaps in multi-modal models
PDF
Common ground reasoning for communicative agents
PDF
Emphasizing the importance of data and evaluation in the era of large language models
PDF
Event-centric reasoning with neuro-symbolic networks and knowledge incorporation
PDF
Identifying and mitigating safety risks in language models
PDF
Balancing prediction and explanation in the study of language usage and speaker attributes
PDF
Building generalizable language models for code processing
PDF
Expanding the performance-compute frontier for retrieval-augmented language models
PDF
Multimodal reasoning of visual information and natural language
PDF
Countering problematic content in digital space: bias reduction and dynamic content adaptation
PDF
Integrating annotator biases into modeling subjective language classification tasks
PDF
Inductive biases for data- and parameter-efficient transfer learning
PDF
Robust and generalizable knowledge acquisition from text
PDF
Learning controllable data generation for scalable model training
PDF
Syntax-aware natural language processing techniques and their applications
PDF
Grounding language in images and videos
PDF
Green knowledge graph completion and scalable generative content delivery
Asset Metadata
Creator
Wang, Peifeng
(author)
Core Title
Externalized reasoning in language models for scalable and trustworthy AI
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2023-12
Publication Date
11/07/2023
Defense Date
10/06/2023
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Explainable artificial intelligence,language models,natural language processing,reasoning
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Ren, Xiang (
committee chair
), Chen, Muhao (
committee member
), Ferrara, Emilio (
committee member
), O'Leary, Dan (
committee member
)
Creator Email
peifengw@usc.edu,wanpifeng4ever@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113763478
Unique identifier
UC113763478
Identifier
etd-WangPeifen-12458.pdf (filename)
Legacy Identifier
etd-WangPeifen-12458
Document Type
Dissertation
Format
theses (aat)
Rights
Wang, Peifeng
Internet Media Type
application/pdf
Type
texts
Source
20231107-usctheses-batch-1105
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
Explainable artificial intelligence
language models
natural language processing
reasoning