Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Multi-modal preconditioned inference of commonsense knowledge
(USC Thesis Other)
Multi-modal preconditioned inference of commonsense knowledge
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Multi-Modal Preconditioned Inference of Commonsense Knowledge by Ehsan Qasemi A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2023 Copyright 2023 Ehsan Qasemi Dedication To brave women of Iran for their inspiring pursuit of life and liberty Figure 1: A graphic combination of the three words Woman (in red), Life (in grey), and Liberty (In green) in Farsi. ii Acknowledgements I am thrilled to have this opportunity to express my gratitude and appreciation to everyone who has played a significant role in my academic journey. I am beyond grateful to my Ph.D. advisor, Muhao Chen, for his exceptional guidance, mentorship, and positive energy. Working with him for the past three years has been a great pleasure, and I am fortunate to have had him as my mentor. Muhao’s brilliance, dedication, and encouragement have helped me navigate the complexities of a Ph.D. program, and I could not have completed this thesis without his guidance and support. I am also grateful to Pedro Szekely, who initially offered me the Ph.D. position and guided me during the early stages of my Ph.D. journey. His support, mentorship, and advice have helped me become a more mature and organized researcher. Additionally, I would like to express my gratitude to Amir H Assadi and Bijan Alizadeh, my Master’s and undergraduate advisors, respectively, for their unwavering support and belief in my abilities as a researcher. I would like to express my deepest gratitude to Aiichiro Nakano, Daniel O’Leary, Fred Morstatter, Luis Garcia, and Shrikanth S. Narayanan, my Ph.D. committee members, for their invaluable insights, comments, and suggestions that have helped me improve my research work. This endeavor would not have been possible without the generous support of DARPA’s MCS and D3M programs, the National Science Foundation of the United States, and Cisco, who financed my research. I would also like to acknowledge the numerous collaborators who have helped me shape my research ideas and provided invaluable support. In particular, special thanks to Qiang Ning from Amazon Inc., for his guidance and insight in shaping the iii focus on weak supervision in my research, Filip ilievski and Daniel Schwabe, for their help in formulating my ideas into an original research paper. I would like to thank my amazing parents for their unwavering love, support, and patience. Their encouragement and belief in me have been my constant source of motivation throughout my educational pursuits. Additionally, I am incredibly thankful for my wife, Setareh, who has been a rock for me and supported me through the ups and downs of this Ph.D. journey. Her unwavering love, understanding, and support have been instrumental in keeping me grounded and focused. Finally, I would like to thank the many students whom I have had the privilege of working with and supervising. Piyush Khanna, Amani Maina-Kilaas, Devadutta Dash, Khalid Alsaggaf, and Shikhar Singh have taught me so much, and it has been an honor to work with each of them. I am grateful for their hard work, dedication, and enthusiasm, which have contributed immensely to my research. Thank you all from the bottom of my heart for your support, encouragement, and unwavering belief in me throughout my Ph.D. journey. iv Table of Contents Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Thesis Contributions and Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Chapter 2: Backgrounds and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1 Theory of affordance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Language and Visual-Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Weak Supervision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Chapter 3: Preconditions Attributed to Commonsense Knowledge . . . . . . . . . . . . . . . 12 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2 Preconditions in Commonsense Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.3 PaCo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 v 3.3.1 Edge Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3.3 Dataset Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.4 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.5.1 Evaluating SOTA on PaCo Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.5.1.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.5.1.2 Evaluation Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.5.1.3 Results and Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.5.2 Diving in the Tuning Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.5.3 Discussion on Different Relation Types . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.5.4 Hard and Soft Preconditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.6 Related Works to PaCo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.7 Resources of Preconditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.8 Reasoning with Preconditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.9 Ethical Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.10 PaCo Data Collection Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.10.1 Main AMT Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.10.2 Gathering Factual Sentences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.10.3 Qualifying Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.10.4 PaCo Edge Lexicalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.11 PaCo Results in More Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.11.1 Edge Selection Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.11.2 Additional Results from P-NLI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 vi 3.11.3 Details of Soft Preconditions on P-NLI . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.12 PaCo Model Sizes and Run-times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.13 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Chapter 4: PInKS: Preconditioned Commonsense Inference with Weak Supervision . . . . 38 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.3 Preconditioned Inference with Minimal Supervision . . . . . . . . . . . . . . . . . . . . . . 42 4.3.1 Weak Supervision with Linguistic Patterns . . . . . . . . . . . . . . . . . . . . . . . 43 4.3.2 Generative Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.3.3 Precondition-Aware Biased Masking . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.4.1 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.4.2 Informativeness Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.5 Analysis on Weak Supervision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.6 Related Works to PInKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.7 Ethical Consideration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.8 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.9 Details on PInKS Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.9.1 Linguistic Patterns for PInKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.9.2 Details of Snorkel Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.9.3 Modified Masked Language Modeling . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.9.4 Interrogative Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.10 Details on Target Data Experiments in PInKS . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.11 Curriculum vs. Multitask Learning in PInKS . . . . . . . . . . . . . . . . . . . . . . . . . . 59 vii 4.12 PInKS Model Sizes and Run-times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.13 Details on PABI Measurement in PInKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.14 Details on LFs in PInKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.15 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Chapter 5: Preconditioned Visual Language Inference with Weak Supervision . . . . . . . 64 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.2 Construction of PVLIR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.3 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.4 Evaluation and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.4.1 Inference Benchmarking Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.4.2 Anaylsis with Fine-tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.4.3 Identifying Sources of Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.4.4 Utilizing Rationale for Inference Task . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.6 Ethical Concerns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.7 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.8 Weak Supervision Methods in PVLI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.8.1 Implementation Details and Experimental Setup . . . . . . . . . . . . . . . . . . . . 82 5.8.2 Image Search Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.8.3 Model Sizes and Run-times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.9 Data Annotation Details in PVLI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.9.1 Main AMT Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.9.2 Qualifying Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.9.3 Mechanical Turk Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 viii 5.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Chapter 6: Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.2.1 Preconditioned Inference in Age of Large Language Models . . . . . . . . . . . . . 94 6.2.2 Affordance in Real-World Applications . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.2.3 Free-Text Rationalization for Visual Question Answering . . . . . . . . . . . . . . . 98 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 ix List of Tables 3.1 Test of language model’s understanding of preconditions . . . . . . . . . . . . . . . . . . . 16 3.2 Example of the three tasks in PaCo. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3 F1-Macro results of SOTA systems on P-NLI task based on PaCo. Best values are highlighted . 23 3.4 Accuracy results of SOTA systems on P-MCQA task based on PaCo. Best values are highlighted . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.5 BLEU-2, ROUGE-2, and human evaluation Information score for results of SOTA systems on the P-G task. Zero-shot ROUGE scores are omitted to save space as they are negligible and do not add additional insight beyond the zero-shot BLEU-2. Best values are highlighted . 25 3.6 hand-annotated usefulness indication of the precondition statements for top/bottom/mid percentile buckets of the quantitative methods. The [A, B] label indicates edges with the metric score in the range of [A, B] percentile of the metric score. . . . . . . . . . . . . . . 34 3.7 hand-annotated usefulness indication of the precondition statements three of the ConceptNet relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.8 Test results of SOTA systems on NLI task based on the PaCo. FP: False Positive, FN: False Negative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.9 F1-Macro results of SOTA systems on P-NLI task based on PaCo dissected based on relation type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.10 Accuracy results of SOTA systems on P-MCQA task based on PaCo . . . . . . . . . . . . . 36 x 3.11 Results of RoBERTa-large-MNLI model on test set of P-NLI after being tuned on different datasets, dissected based on relation type. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.1 Examples from the collected dataset through linguistic patterns in Section 4.3.1. . . . . . . 43 4.2 Macro-F1 (%) results of PInKS on the target datasets: no PInKS (Orig.), with PInKS in zero-shot transfer learning setup (PInKS) and PInKS in addition to original task’s data (Orig.+PInKS). Bold values are cases where PInKS is improving supervised results. . . . . 47 4.3 PABI informativeness measures (x100) of PInKS and other target tasks w.r.t PaCo and δ-NLI. Bold values represent the maximum achievable PABI Score by considering train subset as an indirect signal for test subset of respective data. The highest PABI score, excluding the max achievable, is indicated in italic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.4 Precision of the sampled preconditions-statement-label triplets from three checkpoints in the pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.5 Linguistic patterns in PInKS and their recall value. Patterns with not enough matches in the corpora have empty recall values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.6 List of conjunctions used in modified masked loss function in section 4.3.3 . . . . . . . . . 57 4.7 Filtered Labeling Functions Patterns and their associated polarity. . . . . . . . . . . . . . . 57 4.8 Examples from target tasks in NLI format . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.9 Macro-F1 (x100) results of PInKS on the target datasets using multi-task fine-tuning strategy and its difference with curriculum strategy. . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.10 Details of PABI metric computations in Section 4.4.2 according to Equation (4.1) . . . . . . 61 xi 4.11 Coverage (fraction of raw corpus instances assigned the non-abstain label by the labeling function), Overlaps (fraction of raw corpus instances with at least two non-abstain labels), and Conflicts (fraction of the raw corpus instances with conflicting (non-abstain) labels) on top-performing LFs. Green and red color respectively represent LFs that assign allow and prevent labels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.1 Results of SoTA Visual Language Models on the PVLI task. . . . . . . . . . . . . . . . . . . 75 5.2 Regex patterns for the labeling functions. A=action, E=event, P=precondition, NP=negative precondition. Patterns with fewer than 20 examples in the corpora are marked with asterisks, and those with no examples are left empty. Bolded conjunctions were followed with part-of-speech tagging to confirm that they were used as conjunctions. . . . . . . . . 83 5.3 Top 10 websites for preconditions by NLI dataset. There are a total of 10,975 unique websites for 50,729 unique images belonging to 82,740 examples. . . . . . . . . . . . . . . . 86 5.4 Top 10 websites for actions by NLI dataset. There are a total of 9,700 unique websites for 48,305 unique images belonging to 80,170 examples. . . . . . . . . . . . . . . . . . . . . . . 86 xii List of Figures 1 A graphic combination of the three words Woman (in red), Life (in grey), and Liberty (In green) in Farsi. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii 2.1 Weak supervision methods based on Snorkel[120]. . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Weak supervision relies on models to generalize well on noisy labeled instances. Source: Snorkel[120]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.1 Overview of the PaCo data collection and instances of the three tasks derived from it. . . . 13 3.2 F1-Score of fine-tuning RoBERTa-large-MNLI with increasing amounts of training (tuning) data from P-NLI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3 Data-collection and processing in a nutshell . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.4 A sample question-unit used in main survey on the AMT . . . . . . . . . . . . . . . . . . . 31 4.1 Examples on Preconditioned Inference and the NLI format they can be represented in. . . 39 4.2 Overview of the three minimally supervised methods in PInKS. . . . . . . . . . . . . . . . 41 4.3 PABI informativeness measures of PInKS with different precision thresholds on PaCo. . . . 50 5.1 Preconditioned Visual Language Inference (PVLI) and Preconditioned Visual Language Reasoning (PVLR) tasks. The “H” and “P” are the input hypothesis and premise. The outputs, label (letter “L”) and rationale (letter “R”), are highlighted . . . . . . . . . . . . . . . . . . . 65 5.2 Overview of weak supervision methods for constructing PVLIR. . . . . . . . . . . . . . . . 68 5.3 Distribution of instances extracted from captions (log-scale), for each source of the caption. 71 xiii 5.4 Cumulative distribution of the data with respect to the precision of the generating labeling function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.5 a) Observed distribution of matches for preconditions. b) Observed distribution of matches for actions. c) Deviation from the expected distribution of matches for preconditions. d) Deviation from the expected distribution of matches for actions. . . . . . . . . . . . . . . . 73 5.6 Accuracy difference from the peak value of fine-tuning FLAVA (lower is better) with increasing amounts of tuning data from PVLI. The batch size is 64. . . . . . . . . . . . . . . 76 5.7 Heatmap graph comparing the measures of perplexity and model agreement with expert human evaluation in the caption querying method. Bins are computed using 6-quantiles for each axis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.8 Heatmap graph comparing the measures of perplexity and model agreement with human evaluation from Amazon Mechanical Turk in the caption querying method. Bins are computed using 6-quantiles for each axis. . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.9 A sample question-unit used in the main survey on the AMT. . . . . . . . . . . . . . . . . . 91 6.1 A sample interaction with ChatGPT over preconditions of common sense . . . . . . . . . . 96 6.2 A sample interaction with Bard over preconditions of common sense . . . . . . . . . . . . 97 6.3 ChatGPT as source of weak-supervision signal for preconditioned inference annotation . . 97 6.4 A reasoner deals with a limited set of objects for traffic control [137] . . . . . . . . . . . . 98 xiv Abstract Humans can seamlessly reason with circumstantial preconditions of commonsense knowledge. We understand that “a glass is used for drinking water”, unless “the glass is broken” or “the water is toxic”. Despite state-of-the-art (SOTA) models’ impressive performance in inferring commonsense knowledge, it is unclear whether they understand the circumstantial preconditions. In this dissertation, I initially propose a novel challenge of reasoning with preconditions attributed to commonsense knowledge, design three tasks based on the challenge in text-only setup, and show there is a significant gap between SOTA language models’ performance and humans’ on our tasks. I then use generative AI and weak supervision in a combination of targeted fine-tuning strategies to improve the language model’s performance on the preconditioned inference task. Finally, I go beyond the text-only setup and investigate the problem of preconditioned inference in a multi-modal setup when the model is challenged to infer the preconditions from an image. xv Chapter 1 Introduction 1.1 Motivation Improving machine’s ability to reason with commonsense knowledge is at the frontier of natural language processing (NLP) research, as a critical component in many knowledge-driven tasks such as question answering [146, 142], machine reading comprehension [127], narrative cloze [94], and dialogue systems [1, 161]. Recently, dozens of systems [117, 65, 85, 36] and learning resources [130, 95, 125, 11] have been proposed, focusing on various aspects of commonsense knowledge such as naive physics and naive psychology. According to the Theory of Affordance [45, 24] from cognitive studies, understanding the preconditions in which an action or statement is possible or impossible is a key aspect of human intelligence. For example, a glass may be used for drinking water, under an implicit assumption that the water is at normal temperature, but may not be if the glass is shattered. Hence, such contextual preconditions can naturally be categorized into two classes: the ones that enable the statements, and the ones that disable them [41, 57]. From the cognitive perspective, understanding the affordance of objects, or simply preconditions of actions [110], is part of the commonsense knowledge that constitutes what distinguishes humans from a machine to make inference [71]. From an applications perspective, it also has a potent impact on robotics [2], transportations [109, 131, 68], and general artificial intelligence [99]. 1 Causal preconditions may be partially inferred from text [95, 70], however: 1) as is the case in many other aspects of common sense, we rarely write them explicitly in our text; 2) when mentioned in the text, it is difficult for models to distinguish whether they represent causation or correlation. In addition, it is not clear how the state-of-the-art reasoner models are able to comprehend such knowledge and use it in real-world applications. 1.2 Challenges We address several key challenges in this dissertation. • The first is the extent to which current language-based reasoning models comprehend the concept of object affordance. Despite being trained on vast amounts of Web corpora, state-of-the-art language models like RoBERTa [85], it is not clear to what extent they have a thorough understanding of how objects behave and interact with their environment. Previous studies have suggested that these models possess some implicit knowledge of common sense, however, they are not perfect and they fail in other tasks [130, 156, 165]. The absence of an appropriate evaluation framework impedes the accurate measurement of their capacity and deficiencies. Hence, a crucial challenge addressed in this thesis is to formulate a rigorous testing mechanism to evaluate the model’s understanding of the theory of affordance. The proposed model must reason with the capabilities and limitations of each object in different states to make informed decisions. • Once the task for evaluating the model’s understanding of affordance has been defined, the subsequent challenge is to acquire a sufficient amount of high-quality data for a meaningful assessment of the model-under-test. The data set needs to be extensive enough to cover the practical aspects of everyday life, encompassing a wide range of objects and their applications from various perspectives, such as temporal, spatial, and emotional. Beyond the scope of the test, the quality of the data set is crucial to 2 enable accurate evaluation of the model’s capabilities. If the dataset is inaccurate or biased, it can lead to inaccurate evaluations of the model’s performance, which can result in the deployment of a flawed or suboptimal model. For instance, if the dataset used to evaluate a model’s understanding of affordance only covers a limited range of objects and scenarios, the model may perform well on the test but fail when applied to real-world situations that are outside the scope of the dataset. Therefore, it is critical to ensure that the dataset is comprehensive, diverse, and accurate in reflecting the real-world scenarios and properties that the model is expected to handle. This requires careful curation and annotation of the dataset, as well as rigorous testing and evaluation of the model’s performance on the dataset. A high-quality dataset can enable accurate evaluation of the model’s capabilities, leading to better performance and more reliable deployment in real-world applications. The use of human annotators is the most reliable and widely accepted method to ensure high-quality data. However, hiring and training annotators are time-consuming and costly, making it impractical for large-scale research settings. In addition, humans have their own biases, e.g cultural, political, and spiritual that can impact the quality of their annotations. To address these challenges, solutions such as simplifying the annotation task and designing informative user interfaces with care have been effective in expediting the process to some extent. Despite these efforts, the use of human annotators remains the gold standard for obtaining high-quality data. In addition, services such as Amazon Mechanical Turk, provide access to a large pool of annotators to minimize the annotator bias based on individuals. In recent years, solutions based on weak supervision, knowledge distillation, and transfer learning have emerged as popular alternatives for data acquisition through human annotations. These methods employ carefully designed automation to expedite the process while maintaining acceptable levels of quality. The main challenge in such solutions is balancing the trade-off between the quantity of data with its quality. 3 • The third challenge in this research endeavors to propose effective solutions for enhancing the language models’ (LMs) comprehension of affordance. In the field of natural language processing (NLP), the central aspect of any successful model is its ability to extend its learned knowledge to novel and unseen domains. For instance, a model that can recognize the constraints of a coffee mug should also be able to generalize this understanding to similar objects like a cup of coffee. Developing methods to enable the models to enhance their grasp of new domains is a crucial hurdle, constrained by the availability of high-quality data. • The final challenge of this dissertation pertains to the role of context and the incorporation of different modalities of context to enhance language-based reasoning models. Presently, such models have access only to text-based knowledge, which limits their ability to reason over everyday aspects of life or common sense, such as the affordances of objects. The main reason is the fact that we rarely write such information as they deemed obvious to us and we do not learn them formally, e.g. through schooling. Conversely, humans perceive context and learn primarily through their visual perception, which emphasizes the need for multi-modal processing of both text and image in reasoning models as well. With the advent of visual language models, the next challenge is to extend the findings of affordance to this multi-modal setup, reintroducing the challenges of careful evaluation task design and obtaining high-quality data. This challenge is essential for language models to gain a human-level understanding of affordances and generalize their knowledge to unseen domains. 1.3 Thesis Contributions and Research Questions To address the challenges associated with reasoning with preconditioned of commonsense, we formulate the following research questions: 4 • RQ1) How the LMs’ understanding of the theory of affordance in commonsense can be quantified? • RQ2) Can SOTA LMs comprehend affordances? • RQ3) How can we improve LMs’ understanding of affordance? • RQ4) Does the addition of visual context in VLMs help in the understanding of affordance? If yes, then how can it be quantified? In addressing RQ1 and RQ2, we aim to tackle the first and second challenges presented in this thesis. To assess language models’ grasp of the theory of affordance, we propose a novel method for evaluation and quantification. Our approach involves framing the problem as preconditioned inference, which requires the model to determine whether a new piece of information either enables or disables the possibility of a given commonsense statement. For instance, given the information that “the glass is shattered”, the model must predict whether it “disables” or “enables” the statement “a glass is used for drinking water”. To obtain a diverse set of commonsense statements, we rely on ConceptNet (CN) [84], which is widely used as a source of commonsense knowledge. Our framework involves using human annotators, hired through Amazon Mechanical Turk, to gather precondition knowledge for a wide range of commonsense statements. This method aims to provide a reliable and comprehensive evaluation of language models’ understanding of affordance. In this thesis, the third research question (RQ3) focuses on the third challenge, which is related to improving language models’ understanding of affordance. To tackle this challenge, we propose a novel method based on weak supervision to enhance language models’ performance in preconditioned inference and their ability to comprehend the affordance of objects. Our approach utilizes conjunctions in the English language as proxies for the ways people discuss preconditions of commonsense knowledge. By 5 analyzing raw text, we extract a diverse set of linguistic patterns based on conjunctions that encapsulate affordance knowledge. To further enhance the breadth and scale of the extracted knowledge, we perform generative data augmentation using masked language models. This process enables us to gain access to a massive scale of affordance knowledge from raw text. We also propose a modification to the fine-tuning process of language-based reasoning models to improve their preconditioned inference performance. This modification involves biased masking on conjunctions instead of the conventional random masking strategy. We propose to run additional fine-tuning steps on the same raw data, while solely masking the conjunctions and penalizing the model on how accurately it predicts them. Hence, our proposed method based on weak supervision and biased masking leads to improved performance of language models in understanding the affordance of objects. It is a promising approach to enhance language models’ performance in preconditioned inference and their ability to comprehend the affordance of objects and can be applied to a diverse set of commonsense statements and offers a scalable solution to address the third challenge in the evaluation of language models’ reasoning capabilities. The fourth research question (RQ4) pertains to the final challenge addressed in this thesis. This study proposes a novel method to extend preconditioned inference to a multi-modal setup and evaluates the visual language model’s understanding of the theory of affordance. Specifically, the problem of visual preconditioned inference is formulated, wherein the model must determine whether the information in an image enables or disables the possibility of a commonsense statement. Unlike the updating sentence scenario in the text-based setting, the model receives an image of the object in the question, such as “a shattered glass”, and is expected to predict whether it would “disable” or “enables” the statement that “the glass is used for drinking water”. To tackle this challenge, a framework based on the principles of weak supervision is proposed to leverage existing resources on image captioning and preconditioned inference to obtain visual precondition knowledge. This framework utilizes an established source of the parallel image and text data, such as image captioning datasets and image search tools, to ground the preconditioned 6 inference instances onto an image with minimal human supervision, thus providing large quantities of visual precondition knowledge with acceptable quality. 1.4 Thesis Overview The rest of this thesis is organized as follows. Chapter 3 addresses the first and second research questions (RQ1 and RQ2). It is based on my paper titled “PaCo: Preconditions Attributed to Common Sense Knowledge” that was published in “EMNLP-Findings 2022”. Here, we first formulate the problem of reasoning with preconditions (preconditioned inference) as a proxy to evaluate the understanding of affordance and then propose three tasks to benchmark SOTA language models. We finally, show that there is a huge gap between machine performance and human performance in our tasks. Chapter 4 addresses the third research question (RQ3). It is based on my paper titled “PInKS: Preconditioned CS Inference with Minimal Supervision” that was published in “AACL 2022”. Here, we propose a method to improve language models for reasoning with preconditions through minimum supervision. We show, both empirically and theoretically, our method improves the results on benchmarks focused on reasoning with the preconditions of commonsense knowledge In Chapter 5 we address the final research question (RQ4). This chapter is based on my paper “ P-VLI: Preconditioned Visual Language Inference” that is under review for “ACL 2023”. we introduce the task of preconditioned visual language inference and rationalization (PVLIR). We propose a learning resource based on three strategies to retrieve weak supervision signals for the task. Our results reveal the shortcomings of SOTA VLM models in the task and draw a road map to address the challenges ahead in improving them. Finally, in Chapter 6, we wrap up the thesis and go through possible future directions and potential research opportunities to build on this thesis. 7 Chapter 2 Backgrounds and Definitions 2.1 Theory of affordance In psychology and cognitive science, the theory of affordance [45, 24] states that for humans, as intelligent beings, the world is perceived not only in terms of physical aspects (e.g. object shapes and spatial relationships) but also in terms of possibilities of these objects for actions (affordances) and their limitations. In other words, affordances are the properties of an object or environment that can be perceived and acted upon by an organism. According to Gibson [45], these affordances are not inherent properties of an object or environment. They are perceived relational properties that arise from the interaction between an organism and its environment. For example, a cup affords to hold liquids because of its shape, and stability, which are perceived by an individual based on their previous experiences with it. The theory of affordance has been applied in many subfields of computer science, including reasoning, and robotics. It has been used to study how intelligent agents (physical or abstract) should perceive and interact with their environment, what type of information they need to retain, and how they make decisions based on the information available to them. Affordance of an object also includes properties that it does not possess or its limitations [41]. For example, a glass may be used for drinking water, under a general implicit assumption that the water is at normal temperature, but may not be if the glass is shattered. Accordingly, as discussed in Chapter 3, we use 8 two main categories of affordance: the ones that enable the statements, and the ones that disable them [41, 57]. 2.2 Language and Visual-Language Models language modeling refers to various techniques to determine the likelihood of observing an upcoming word or token after a given sequence of tokes (e.g. predicting the next word in a sentence). Language Models (LMs) are statistical models that are used to process or generate natural language text [9]. The most advanced language models are based on neural networks such as Transformers[36, 54, 85]. These models are capable of processing vast amounts of text data and can generate coherent and realistic text that is difficult to distinguish from text written by humans [126]. Visual Language Models (VLMs), on the other hand, process both visual input and textual input. They are mainly based on the same underlying architectures as LMs (e.g. transformers) but they have means to convert images to a multitude of tokens and their embeddings. This makes the VLMs capable of processing both visual and textual token embeddings in a single pipeline[76, 134, 87, 67]. VLMs are used in a variety of tasks including, image captioning[81], and visual question-answering[163, 136]. 2.3 Weak Supervision Weak supervision is a machine learning approach that involves training models using imperfect or incomplete labels or annotations that can be obtained without expensive (both time and resource) annotations [120]. This approach is opposed to the traditional supervised learning, which relies on expensive fully-labeled data. 9 Figure 2.1: Weak supervision methods based on Snorkel[120]. The labels used in weak supervision are usually generated automatically or heuristically, rather than by human experts [112]. Figure 2.1 summarizes some popular weak-supervision methods implemented in a commercial tool. As depicted, the class “Patterns & Dictionaries” uses well-curated heuristics to label the data. For example, in a sentiment analysis task[97], weak supervision heuristics might involve using keywords such as good or bad to automatically label text data. The low-cost nature of these annotations, allows us to obtain orders of magnitude more (noisy) data to train our models on a new task. In addition, other methods such as re-purposing of external knowledge [4, 20, 92] or other types of domain knowledge [140] also lie under weak supervision methods. The main principle in weak supervision relies on the ability of modern models to learn patterns and generalize on a large number of instances (depicted in Figure 2.2) even when the training data is less accurate than traditional supervised learning. This makes them a popular option for new tasks that do not have a large body of annotated data [28, 98, 112, 164, 35, 136]. 10 Figure 2.2: Weak supervision relies on models to generalize well on noisy labeled instances. Source: Snorkel[120]. 11 Chapter 3 Preconditions Attributed to Commonsense Knowledge Humans can seamlessly reason with circumstantial preconditions of commonsense knowledge. We understand that a glass is used for drinking water, unless the glass is broken or the water is toxic. Despite state-of-the-art (SOTA) language models’ (LMs) impressive performance on inferring commonsense knowledge, it is unclear whether they understand the circumstantial preconditions. To address this gap, we propose a novel challenge of reasoning with circumstantial preconditions. We collect a dataset, called PaCo, consisting of 12.4 thousand preconditions of commonsense statements expressed in natural language. Based on this dataset, we create three canonical evaluation tasks and use them to examine the capability of existing LMs to understand situational preconditions. Our results reveal a 10-30% gap between machine and human performance on our tasks, which shows that reasoning with preconditions is an open challenge. Upon acceptance, we will release the dataset and the code used to test models. 3.1 Introduction Improving a system’s ability to reason with commonsense knowledge is at the frontier of natural language processing (NLP) research, as a critical component in many knowledge-driven tasks such as question answering [146, 142], machine reading comprehension [127], narrative cloze [94], and dialogue systems [1, 161]. Recently, dozens of systems [117, 65, 85, 36] and learning resources [130, 95, 125, 11] have been 12 Figure 3.1: Overview of the PaCo data collection and instances of the three tasks derived from it. proposed, focusing on various aspects of commonsense knowledge such as naive physics and naive psychology. In cognitive studies, the theory of affordance [45, 24] suggests that understanding the circumstances in which an action or statement is possible or impossible is a key aspect of human intelligence. For example, a glass may be used for drinking water, under an implicit assumption that the water is at normal temperature, but may not if the glass is shattered. Accordingly, we argue that for an NLP reasoner to understand common sense, it should comprehend the contextual preconditions associated with commonsense statements. Such contextual preconditions can naturally be categorized into two classes: the ones that enable the statements, and the ones that disable them [41, 57]. 13 Causal preconditions may be partially inferred from text [95, 70], however: 1) as is the case in many other aspects of common sense, we rarely write them explicitly in our text; 2) when mentioned in the text, it is difficult for models to distinguish whether they represent causation or correlation. Similar to our work, Rudinger et al. [125] collect the preconditions by crowdsourcing. Here, the preconditions are seen as soft assumptions, namely: weakeners and strengtheners, which provides a model only with the relative correlation between statements, and is not explicitly testing the model on the underlying preconditions of the statement. Instead, we propose to define the problem based on the crisp conditioning of disablers and enablers, which forces the LM to learn the decisive preconditions of a statement and facilitates explainability based on them. In comparison to a hard logical connection modeled by the crisp condition, although the notion of weakener is also helpful to the commonsense reasoner, it raises additional questions like “by how much?”, or “is the statement still valid?”. Whereas in the notion of disablers, even though annotations are more difficult to collect, it can at least take the system one step forward by sorting out the clutter of the irrelevant statements. This chapter presents a systematic study on the problem of situational preconditions expressed in natural language. As the first contribution, we define a new problem of reasoning with enabling and disabling preconditions associated with commonsense statements (Section 3.2). Given a statement, the task is to infer the preconditions that make the statement possible (enabling) or impossible (disabling). Understanding such preconditions of commonsense knowledge would enable reasoning systems relying on a commonsense knowledge base to decide when to use a given commonsense statement. For example, given the statement “Glass is used for drinking water” in ConceptNet [139], a system should know that it is only possible if the “water is not too hot”, and it is impossible when “the water is toxic”. To foster research on preconditions of commonsense knowledge, we develop PaCo, a rich crowdsourced dataset with enabling and disabling preconditions of commonsense statements (Section 3.3), as the second contribution of this chapter. For PaCo, we start by extracting available commonsense statements. We then 14 design and execute a crowdsourcing task to gather preconditions of the statements by asking participants: what makes the statement possible/impossible? for each of the statements. PaCo contains 12.4K labeled preconditions (6.6K enabling, 5.8K disabling), corresponding to 3 ∗ 1K edges from three representative relations in ConceptNet [139], covering knowledge on utility, causality, and motivation. Example preconditions are illustrated in Fig. 3.1. These tasks for the first time allow analysis beyond what is done in prior work that cover enabling preconditions only. Particularly, they realize a head-to-head comparison of enabling and disabling statements which was not possible before. Besides, they allow analysis of the impact of the knowledge types (e.g., utility) on the task difficulty for both humans and neural language models. Our third contribution is an extensive NLP benchmarking based on PaCo. To this end, we transform PaCo into three tasks on Preconditions: Natural Language Inference (P-NLI), Multiple-Choice Question Answering (P-MCQA), and Generation (P-G). The three canonical tasks seek to provide a comprehensive evaluation of the ability of natural language reasoners to understand circumstantial preconditions (Section 3.4). These three tasks examine the understanding of preconditions of a number of SOTA language models and reasoners, such as DeBERTa [54], and UnifiedQA [65]. Results show that SOTA methods largely fall behind human performance, therefore indicating the need for further research in order to improve the comprehension of contextual preconditions by commonsense reasoners (Section 3.5). 3.2 Preconditions in Commonsense Reasoning Problem Definition. Commonsense statements describe well-known information about concepts, and, as such, they are acceptable by people without need for debate [128, 62]. A commonsense statement can be formalized as s = (h, r, t), where h and t are head and tail concepts, and r is the relation type. Following the notion of “causal complex” [57], we define the precondition Pf as a collection of eventualities (events or states) that results in s to happen. Such preconditions contain eventualities that either allow (p + f ∈ Pf ) or prevent (p − f ∈ Pf ) the statement to happen. Here, to prevent means to allow the negation 15 Model Input Output UnifiedQA A net is used for catching fish. What makes this impossible? (A) You are in water (B) You are in downtown LA You are in water UnifiedQA A net is used for catching fish. What makes this impossible? A net is used for catching fish. GPT2 A glass is used for drinking water only if, the glass is covered in a protective coat or can be removed with cold water. GPT2 A glass is used for drinking water only if, the water is acidic, not fresh. Table 3.1: Test of language model’s understanding of preconditions of the statement [41]. While enumerating a priori all such causal eventualities is impossible, people are still able to reason about them in a given situation [57]. Notably, preconditions are implicit, i.e., we usually omit them from conversation as they are considered obvious [47]. Shoham [133] and Hobbs [57] distinguish between two type of preconditions, based on causal connections (hard), or material implication (tends to cause; soft). Here we focus on the more restrictive, hard preconditions; for soft preconditions, see [125]. In this work, the problem of reasoning with preconditions is attempted in two ways: discriminative and generative (cf. Table 3.1). In the discriminative setting, given a statement f and a precondition (p), a model is expected to infer if the fact is still valid (p ∈ P + f ) or not (p ∈ P − f ). In the generative setting, given only the statement (f), a model is requested to compose a reasonable disabling (p − f ) or enabling (p + f ) precondition. Motivating Examples. In a preliminary investigation, we assess the ability of SOTA language models: GPT2 [116], and UnifiedQA [65], to reason with preconditions. As shown in Table 3.1, both models appear to fall short of reasoning with enabling and disabling factors of commonsense statements, regardless of whether the prompt task form is presented as multiple-choice question answering (row 1), or as text completion (rows 2-4). This observation is not surprising, considering that reasoning with preconditions is an under-addressed research challenge. Yet, it motivates the urgency for this problem to be studied in depth, which is the goal of this chapter. 16 3.3 PaCo This section introduces the procedure of developing the PaCo dataset. We start by selecting relevant commonsense facts (Section 3.3.1), and crowdsourcing preconditions for each statement (Section 3.3.2). Finally, we present the PaCo data statistics (Section 3.3.3). 3.3.1 Edge Selection We extracted relevant commonsense facts from ConceptNet [139]. We chose ConceptNet due to its breadth of knowledge and popularity in prior research [40, 78, 88]. ConceptNet is a publicly available common sense knowledge resource. It contains 3.4 million English assertions between concepts (e.g., “Glass”, “Drinking_- water”, “Person”), and covers a wide range of knowledge types, including spatial, physical, and temporal knowledge, as well as social and cognitive knowledge about everyday situations. We performed a pilot analysis of different knowledge types in ConceptNet to help us decide which of them were suitable to be annotated with preconditions. Namely, we sampled 20 random edges for each relation and checked how well one could annotate them with preconditions. Our analysis revealed that not all relations lent themselves naturally for annotation with enabling or disabling preconditions. Specifically, we observed that some relations (e.g., Related To) are underspecified in their meanings, and others, like IsA, are often truisms. Our investigation has revealed that it is difficult to come up with preconditions for these relations. Furthermore, we observed that some relations, like CreatedBy, could be easily annotated with enabling conditions, but not with disabling ones. The opposite was observed for PartOf. We opted for the relations UsedFor, Causes, and Desires, because of their suitability for annotation of preconditions, their relatively high number of statements, and their representativeness of three different dimensions of knowledge: utility, temporal, and motivational knowledge [60]. Following the intuition that not all statements can be annotated with preconditions, e.g., (Looking through telescope, Usedfor, viewing heavens), we computed the correlation between a hand-annotated suitability judgment for the precondition 17 statements, and the several quantitative scores: DICE metrics (Chalier, Razniewski, and Weikum [22]; e.g., salience), LM perplexity, and edge weights in ConceptNet. However, none of these scores had a strong correlation with the suitability for annotating preconditions (Appendix 3.11.1 contains the calculated correlations for UsedFor). Therefore, we opted for the relations UsedFor, Causes, and Desires, because of their suitability for annotation of preconditions, high number. Also they are representative of three different dimensions of knowledge: utility, temporal, and motivational knowledge [60]. We sampled 1K edges from each and lexicalized them into human readable sentences using relation-specific templates (see Appendix 3.10.4). 3.3.2 Data Collection Mechanical Turk We used Amazon Mechanical Turk [31] to collect data on preconditions for the lexicalized statements as part of Institutional Review Boards (IRB) approved (as exempt) study. For this, we asked the participants to provide short responses to the question: “What makes the statement possible/impossible?” for each of the lexicalized statements from ConceptNet. Due to financial limitations, we restricted our annotations to 3 enabling and 3 disabling judgments for each statement. While the goal of PaCo is not to exhaust all possible preconditions associated with each statement, for some statements we observed duplicate answers, signaling a near-saturation point. Further details on the data collection design, including annotator qualification, and survey design details are given in Appendix 3.10. With this procedure, we collected a total of 18K enabling and disabling preconditions. Quality Control We use a mixture of automated and expert annotations for quality control. The automated quality control consisted of three rules that we can programmatically check: 1) not using negative words like “not”, 2) not using pronouns, and 3) proper sentence lengths. In order to measure the informativeness and relevance of the remaining annotations, we use expert annotation. Specifically, for a 18 subset of the recorded responses we asked the annotator to classify the response into three categories, each representing a specific level of informativeness in the response: 1) Truism: the response is correct, but it is not specific to the situation (e.g., being broken/functional or being available/unavailable); 2) Informative: the response is correct and is adding information that is not mentioned in the prompt, while not being a truism (i.e., is specific); 3) Irrelevant: any response that is not placed into the previous two categories. For PaCo, we remove the answers from the Irrelevant category, while truism answers could be removed subsequently if so desired. 3.3.3 Dataset Statistics This data collection procedure resulted in a total of 9k enabling and 9k disabling preconditions for each of the 1k ConceptNet edges selected for UsedFor, Causes, and Desires relations respectively. After filtering out responses in low quality and those marked as Invalid by crowd annotators, PaCo contains 12.4K annotations (6.6K enabling, 5.8K disabling). Our expert annotation on 10% of the 6K annotations with UsedFor relation showed that in 93% of the crowdsourced responses are informative, whereas only 5% of the responses are irrelevant. The quality of the responses is lower for the two other relations: 70% informative responses for Causes and 61% for Desires. This shows that the two relations are semantically more challenging to human annotators compared to a utility relation like UsedFor. We also observed that on average it took the annotators 3.5 times longer to submit a responses for these two relations, which confirms that UsedFor is the most suitable of the three relations for associating preconditions. 3.4 Tasks Given the data collected in Section 3.3, we devise three complementary tasks to showcase the possible ways one could use the PaCo data to evaluate the current SOTA models’ understanding of circumstantial 19 ID Instance P-NLI Hypothesis: A net is used for catching fish Premise: We are in a desert Label: Contradiction P-MCQA Question: A net is used for catching fish. When is this impossible? Choices: (A) You are in sea, (B) The boat is moving, (C) Net has a large hole in it. P-G Question: A net is used for catching fish. When is this impossible? References: (-) Net has a large hole in it, (-) You are in downtown LA, (-) There are no fish in the water Table 3.2: Example of the three tasks in PaCo. preconditions. We select Preconditions Natural Language Inference (P-NLI) and Preconditions MultipleChoice Question Answering (P-MCQA) as representative discriminative tasks, and Preconditions Generation (P-G) task as a generative task. Table 3.2 summarizes the tasks and provides an example for each of them. In the rest of this section, we describe each task in detail and discuss the steps to prepare it from the raw precondition data. This preparation is fully automatic, and no human annotation or supervision signals have been used. P-NLI Task Natural Language Inference (NLI) refers to tasks where given a sentence pair composed of a hypothesis and a premise, the system has to decide whether the hypothesis is true (entailment), false (contradiction), or undetermined (neutral) given the premise [156]. Each of the preconditions (e.g., “water is clean” or “water is polluted”) of a statement can directly serve as a premise in the sense of NLI. Enabling preconditions correspond to entailment cases (e.g., “water is clean” entails “water is used for drinking”), whereas disabling preconditions can be annotated as contradictions (e.g. “water is polluted” contradicts “water is used for drinking”). The P-NLI task consists of 12.4K entries, with 6.6K entailment and 5.8K contradiction cases. P-MCQA Task PaCo can also be directly converted to a multiple-choice question answering (MCQA) task in three steps. First, for each statement, each enabling (disabling) response is paired with three disabling (enabling) responses from the same statement. These three responses naturally act as negative samples (distractors), allowing us to have high-quality and fair questions. The question of the MCQA instance is then 20 formed by appending “What makes this possible?” or “... impossible?” to the lexicalized statement. Second, in order to have more distractors and increase the number of multiple-choice instances we applied the two negative sampling methods used by Zhang, Lyu, and Callison-Burch [166]: Cosine Similarity Filtering, and Question/Answer Shuffling. Finally, in order to remove the annotation artifacts from the data, hence trivial instances, and prevent the models to exploit these artifacts instead of answering the questions, we used the Lite variation of the Adversarial Filtering method, which has been introduced in Sakaguchi et al. [127] and formalized in Bras et al. [19]. This resulted in a P-MCQA task with 47K multiple choice questions, each with 4 choices. P-G Task Despite our adversarial strategies, it remains possible that reasoning systems may identify annotation artifacts [50] in the data and solve the discriminative tasks without correctly performing the logical inference, as a result of those artifacts [19]. Hence, we provide a third formulation as a generative commonsense reasoning task. In this task, we present the system with the exact question that has been presented to the human annotators, thereby mimicking the human annotation task of writing down the precondition as a natural language sentence. We then evaluate the model’s response using the human responses as references. After removing the low-quality and Invalid responses from PaCo, the P-G task consists of 5.2K instances, with an average of 2.4 reference sentences per instance. 3.5 Experiments This section pitches SOTA language models against the three tasks derived from PaCo (Section 3.5.1), dives deep into the tuning process to pinpoint time of comprehension (Section 3.5.2), investigates how LMs react to different relation types (Section 3.5.3), and finally revisits the distinction between soft and hard preconditions (Section 3.5.4). 21 3.5.1 Evaluating SOTA on PaCo Tasks We assess our benchmark through evaluating representative NLP systems on the three tasks. This part starts with details about experimental setups (Section 3.5.1.1), followed by result analysis for the three tasks (Sections 3.5.1.3). 3.5.1.1 Experimental Setup For each task, we start from available pretrained models and evaluate their performance on the test set in zero-shot and fine-tuned setups. To create the test set, we use a uniform random split of the statements that each task’s instance is stemed from. For the split we use the [0.45, 0.15, 0.40] ratio of the data for train/dev/test. The rationale for splitting based on the statements instead of the task instances is to prevent data leakage into the test sets through shared edges. The experiments are conducted on a commodity workstation with an Intel Xeon Gold 5217 CPU and an NVIDIA RTX 8000 GPU . For all the tasks, we use allennlp [44] library for the Textual Entailment (TE) model [104] and use huggingface [157] for the rest of them. For the human evaluations of P-NLI and P-MCQA, we used a small (100) sample from test subset of each task and asked a CS graduate student to answer them. We then report the respective evaluation metric based on the task, as detailed below. 3.5.1.2 Evaluation Protocols For P-NLI, we use F1-Macro score on the ground-truth labels and report the results on the unseen test split of the data. For P-MCQA, we evaluate the systems’ performance based on their default evaluation protocols as discussed below. For RoBERTa [85], we use the LM coupled with a linear regression layer as classification head. In this method, the LM is tasked with embedding each question/answer pair, and the classification 22 Model 0-Shot Tuned AllenNLP TE 0.34 0.85 RoBERTa-large-MNLI 0.47 0.90 BART-large-MNLI 0.48 0.90 DeBERTa-base-MNLI 0.37 0.91 DeBERTa-large-MNLI 0.36 0.94 DeBERTa-xl-MNLI 0.37 0.91 Expert Human 0.99 - Random Baseline 0.5 - Table 3.3: F1-Macro results of SOTA systems on P-NLI task based on PaCo. Best values are highlighted . head assigns a score to the pair. Later for each MC instance, the question/answer pair with the highest score is selected as the output choice. We report the accuracy score (code from [105]) based on the output choices from the model. For UnifiedQA, we follow the original setting by Khashabi et al. [65] to let the model conduct sequence-to-sequence generation based on the question. Here, the question and all choices are feed to the model, and it is expected to generate the correct choice’s text. We then report the f1 score by selecting the one that is closest to the generated answer from the candidate choices. For P-G, to automatically evaluate the machine-generated answers of the models, we use Bleu-2 [102] (code from [13]) and ROUGE-2 [79] (code from [157]) metrics. We do not use methods with large n-gram match (e.g., Bleu-4) for two reasons. First, the small number of reference sentences (at most 3) made most of model’s output not matching any reference sentence. Second, relatively short reference sentences leads to no 4-gram match and mostly zero Bleu-4 scores. For the human evaluation score of the machine generated responses, we sample 100 responses and use a method similar to quality control method in Section 3.3.2 (here we consider the Truism responses as Informative), and report the percentage of informative responses from tuned models. 3.5.1.3 Results and Discussions We hereby separately discuss the performance of SOTA models on the three tasks in details. (1) P-NLI Results As shown in Table 3.3, all systems tend to get near-random results in the zero-shot setup. In case of the BART-large-MNLI model, although the zero-shot F1-Macro score is higher, it is far 23 Model 0-Shot Tuned RoBERTa-base 0.24 0.42 RoBERTa-large 0.22 0.22 UnifiedQA-small 0.32 0.50 UnifiedQA-base 0.23 0.59 UnifiedQA-large 0.28 0.68 Expert Human 0.92 - Random Baseline 0.25 - Table 3.4: Accuracy results of SOTA systems on P-MCQA task based on PaCo. Best values are highlighted . from human-level score (1.00). We observe that even models that are trained on large and diverse learning resources (e.g. MNLI [156]) are not able to perform well on the P-NLI in a zero-shot fashion. This high scores after fine-tuning can be attributed to systems’ exploiting the annotation artifacts of data instead of learning to reason with preconditions. This claim will be further supported by the P-MCQA results. (2) P-MCQA Results The P-MCQA has all the intricacies of the original precondition data absent from the simple annotation artifacts that make it a better alternative to evaluate systems. As presented in Table 3.4, there is a significant gap between the ideal and machine performance in the P-MCQA benchmark that further supports the novelty of PaCo and tasks stemming from it. After investigating the answers, we observe that even the promising large models tend to confuse the enabling v.s. disabling cases. For example the UnifiedQA-Large model, mistakenly chooses a disabling response “Your car is out of fuel” for the enabling question “Gas are typically used for providing energy. What makes this possible?”. This might be explained by the statement that LMs tend to focus more on correlation of lexical occurrences and statistical patterns (e.g., gas and car/fuel), rather than the actual question. In addition, similar to Zhou et al. [168], we observe that LMs lack understanding of linguistic permutations like negations, and lean toward positive words. (3) P-G Results As summarized in Table 3.5, the automatic evaluation results, BLEU and ROUGE, are close to zero for all models. This shows that the models fall short in generating similar to reference precondition 24 Model BLEU ROUGE HUM 0-Shot Tuned Tuned Info. UnifiedQA-small 0.007 0.157 0.064 0.12 UnifiedQA-base 0.006 0.303 0.115 0.28 UnifiedQA-large 0.029 0.330 0.128 0.48 BART-base 0.046 0.091 0.140 0.19 BART-large 0.041 0.058 0.117 0.11 GPT2 0.097 0.133 0.067 0.36 Expert Human - - - 1.0 Table 3.5: BLEU-2, ROUGE-2, and human evaluation Information score for results of SOTA systems on the P-G task. Zero-shot ROUGE scores are omitted to save space as they are negligible and do not add additional insight beyond the zero-shot BLEU-2. Best values are highlighted . even after fine-tuning. On the other hand, the human annotation sheds more light on the results and show the relative comparison of the models. Here the automatic evaluation methods do not sufficiently distinguish between the models as the difference among them are negligible. Hence, the comparison rather provides complementary insights to the two discriminative tasks. This is consistent with similar generation tasks [125], due to the small number of reference responses and relatively large space of correct responses that makes automatic evaluation of such machine responses an unresolved problem [25]. Upon analyzing the results we noticed several patterns in the generated responses. First, models tend to generate simple answers mostly discussing the existence or availability of the subject. For example, BART-base frequently generated patterns such as “ is closed” or “You have ” some of which were informative. Second, similar to the P-MCQA task, the models tend to confuse enabling and disabling preconditions. For example, BART-large generated the enabling precondition “The clothes are dirty” instead of disabling precondition for the statement “Washing clothes are used for making fresh again”. 3.5.2 Diving in the Tuning Process In the above evaluation on P-NLI, we observe that all models get higher scores after fine-tuning. Here, we investigate the fine-tuning process to find at what point the model understands the requirements of the task. 25 Experimental Setup We focus on the RoBERTa-large-MNLI [85] model in the P-NLI task. The experimental setup is similar to section 3.5.1.1. We evaluate the model’s performance on the test split of P-NLI in checkpoints during the tuning process instead of just at the end of it. Checkpoints are based on the amount of tuning data the model has observed (10%, 20%, · · · , 100%). Figure 3.2: F1-Score of fine-tuning RoBERTa-large-MNLI with increasing amounts of training (tuning) data from P-NLI. Results Figure 3.2 plots the changes of score of the model as it gets more tuning data. The slow saturation of the F1 score here suggests that the instances in P-NLI are not trivial for the model and it actually has to see a lot of instances to be able to perform the task. Considering that the RoBERTa-large-MNLI has been pre-trained on a vast corpus, our result shows the novelty and uniqueness of the PaCo data. 3.5.3 Discussion on Different Relation Types Given that PaCo consists of three relations types, we next pose the question of how well the LMs can handle each relation type. Here, we break down the results presented in Section 3.5.1 per relation type and discuss the model performance on each type. 26 Experimental Setup Due to simplicity of automatic evaluation, we on focus on the two discriminative tasks, P-NLI and P-MCQA. The experimental setup here is similar to section 3.5.1.1, except that for both zero-shot and fine-tuned settings where we measure the dissected results based on the relation types as well as their aggregation. Results On the P-NLI task, similar to the challenges for human annotators (Section 3.3.2), all NLI models tend to get lower accuracy on instances derived from Causes and Desires relations, compared to Usedfor. For instance, the DeBERTa-large-MNLI, has a 6% gap between the performance on UsedFor and Causes instances. In the P-MCQA task, we observe a similar pattern between Causes and Desires relations on one hand, and Usedfor on the other hand. For instance, the UnifiedQA-large mode shows a 13% gap between instances with Usedfor and Desires relations. The detailed P-NLI and P-MCQA performance results dissected based on relation types are provided in Tables 3.9 and 3.10 in the Appendix section. 3.5.4 Hard and Soft Preconditions In this work, we argued for the use of hard preconditions as opposed to soft preconditions used in previous works. Although semantically different, one may argue that using soft preconditions may help the models learn the task of reasoning with preconditions with already existing data. In this section we test this hypothesis. Experimental Setup Using the approach presented in Section 3.4, we created an NLI resource from two available resources with soft preconditions: Rudinger et al. [125] and ATOMIC2020 [59] (Details in Appendix 3.11.3). We focused on the RoBERTa-large-MNLI [85] model, fine-tuned in on the two resources, and evaluate on the test set of P-NLI. The experimental setup here is similar to Section 3.5.1.1. Results Although these resources have an order of magnitude more data (88K instances in ATOMIC2020 [59] and 236K instances in Rudinger et al. [125]), there is more than 10% gap between the performance of the 27 model tuned on them in the P-NLI task compared to a model exposed to PaCo data. Table 3.11, presents the detailed results of tuning RoBERTa-large-MNLI model on each of the NLI-style datasets, while being evaluated on P-NLI’s test subset. 3.6 Related Works to PaCo 3.7 Resources of Preconditions. A few resources have provided representations for preconditions of statements. ConceptNet [139]’s HasPrerequisite relation, ATOMIC [128]’s xNeed relation, and CauseNet [56] data can express concept dependencies, such as, e.g., before one bakes bread, they need to buy ingredients and go to a store. Instead of adding new edges, our work annotates existing edges with contextual preconditions, which helps reasoners understand when to use an edge and when not to. ASER [165] and ASCENT [98] extract edges from unstructured text together with their associated context. As such, their knowledge is restricted by information available in text, and they do not express disabling preconditions. It is also unclear to which extent their contextual edges express enabling preconditions, rather than coincidental information. GLUCOSE [95] comes closer to our work, as they also extract enabling preconditions (e.g., Possession state that enables X) via crowdsourcing. Similarly, PeKo [70] extract enabling preconditions between event pairs from available text and use it to propose precondition identification and generation tasks between pair of sentences. However focusing only on causal relations in available text hinders the extent of their tasks. Both GLUCOSE and PeKo do not explore disabling preconditions. 3.8 Reasoning with Preconditions. Few efforts have been made on evaluating commonsense reasoning with preconditions. Rudinger et al. [125] focus on modeling weakeners and strengtheners of commonsense statements. Their work adds a 28 utility sentence to the hypothesis-premise pair in NLI-style tasks and ask whether it weakens or strengthens the relationship of the pair. Similarly, Hwang et al. [59]’s Hindered by and Causes also focuses on similar relationship for events with focus on presenting a knowledge resource. Our work differs as we focus on a crisp condition of enabling/disabling that can be particularly useful in logic-like reasoning tasks (as opposed to probabilistic inference). In addition, our task allows the reasoning to be processed as canonical NLI and can benefit from existing NLI architectures instead of modifying them. 3.9 Ethical Statement Though we may present this as we started from openly available data that is both crowdsource-contributed and neutralized, however, it still may reflect human biases [90]. During our data collection, we did not collect any sensitive information, such as demographic or identity characteristics. We only limited the annotators to English-speaking users from mainly English-speaking countries such as the US, which may add cultural bias to the data. However, neither our crowd annotators nor the expert annotators noticed offensive language in the questions or responses. Given the urgency of addressing climate change, we have reported the detailed model sizes and runtime associated with all the experiments in Appendix 3.12. 3.10 PaCo Data Collection Details We used Amazon Mechanical Turk (AMT) [31] to collect the PaCo. This enabled us to coordinate the study and access a large pool of English-speaking participants as our study population. The AMT is especially suitable for this study as it can facilitate accessing a diverse population of participants which is necessary for any notion of commonsense. Our study on AMT consists of two parts: a tutorial that also serves as a qualification test and the main survey. In addition, we implemented two levels of quality control: in 29 Figure 3.3: Data-collection and processing in a nutshell the first one we use a response checker code and in the second we use human annotators to ensure only high-quality responses wind up into the final data. 3.10.1 Main AMT Survey In the main survey, the participants are given a set of question-units (sample in Fig. 3.4) each consists of a factual sentence (discussed in Section 3.10.2) followed by a prompt question, then we ask participants to write their responses for each prompt question in the designated text box in front of the unit. The prompt questions are short questions that ask about the preconditions that enable or disable the factual sentence (e.g. what makes this possible?, when is this impossible). The goal of this phase is to use the powers of crowdsourcing to capture as much information as needed to create a dataset of enabling and disabling conditions. 3.10.2 Gathering Factual Sentences The first row in Fig. 3.3 summarizes the steps to create the factual sentences. Each factual sentence is a short sentence derived from an edge from a commonsense knowledge graph. The information on this knowledge graph is related to everyday situations such as usage of objects (A net is used for catching fish.), or 30 Figure 3.4: A sample question-unit used in main survey on the AMT capabilities of objects (Humans are capable of catching a bus.), etc. [139, 61, 128]. In our case, the knowledge associated with each factual sentence is extracted from ConceptNet [139], a well known commonsense resource. To limit the scope of this work we only focus on UsedFor, Causes, and Desires relations from ConceptNet, however, the method can be extended to any other relation from any other knowledge graph. To convert the knowledge graph edges to human-readable factual sentences, we used automatic lexicalization methods, similar to [88, 15]. In this method, we define a set of templates to convert the edge to a set of sentence candidates, then use the perplexity score of a language model to pick the best candidate for each edge. The lexicalization is explained in more details in Appendix 3.10.4. Since ConceptNet’s knowledge is not perfect, some of the generated factual sentences may not fully make sense. Additionally, the automatic conversion of edges to the sentence is not perfect, hence some sentences may have odd grammar (e.g. An net is used for catch fish). Consequently, some of the questionunits may be hard to understand or just be wrong. To help us find those question-units and ignore them in future iterations, each unit is presented with an adjacent checkbox labeled This does not make sense. The participant may choose to select the checkbox and skip answering that prompt. To make the payment structure fair for the participants, they will get paid regardless of their response. 31 3.10.3 Qualifying Participants To ensure the participants can understand the task, we prepared detailed instructions that explain to the participants what they need to do and what are the criteria for a good vs bad response. For example, in the instructions, we ask participants to avoid using negative sentences or avoid using pronouns to refer to objects. The instruction is 366 words with an expected reading time of < 5 mins. Additionally, we have prepared a set of good/bad examples associated with each rule that can also be accessed in the tutorial. Each one of the good/bad examples comes with a short explanation clarifying the reason for its good/bad rating. The participants are then asked to take the qualification test as a check on whether they have read and understood the instructions. The qualification test contains 10 multi-choice questions (each with two choices); each containing a question-unit (similar to those that are used in the main survey) with two choices of the possible responses that one may give to them. We have carefully designed each multiple-choice question such that it tests the participants’ understanding of the rules individually and give them feedback on their wrong answers. For example, for the rule discouraging the use of negative sentences, we have two questions where the wrong answers contain a negative verb. After successfully passing the test, participants with acceptable scores are granted a qualification badge that allows them to engage in the main survey. It must be noted that the detailed instructions and the good/bad examples are both available in the main survey as a memory refresher for the participants. For the main survey, we have structured the payment on a per HIT basis, such that the overall compensation be equal to $15 per hour of work. To simplify the annotation process, we grouped 4 statements together in one HIT that helped us reduce the waste time of annotators. The participants will be paid by the number of submitted HITs and there will be no min number of HITs for them. However, AMT allows us to ban participants that produce low-quality responses from further engaging in our study. The banned participants were fully compensated for their accepted work (according to automatic evaluation script) up until they are banned. 32 3.10.4 PaCo Edge Lexicalization Each of the selected edges is lexicalized using a combination of templates and masked LMs described by Ma et al. [88] and Bouraoui, Camacho-Collados, and Schockaert [15]. Similar to Ma et al. [88], we use a combination of the templates for each relation (e.g. [subject] is used for [object], [subject] is used by [object]) and use the perplexity score from the LM to select the best lexicalization for each edge. However, this method does not guarantee the selection of the best lexicalization as the perplexity score reflects the probability of the sentence tokens appearing in that specific order rather than the sentence’s grammatical correctness. To mitigate this issue, in addition to the above method, following [15], we let the LM adjust the templates as well by adding one masked token to some templates (e.g. [subject] is used [MASK] [object]) and let the LM fill the mask before filling the subject and the object slots of the template. 3.11 PaCo Results in More Details 3.11.1 Edge Selection Results In this section, we provide further evidence to support the decision to use the UsedFor edges without any additional filtering. First, we showcase the lack of correlation between a hand-annotated usefulness indication of the precondition statements and existing quantitative methods/scores. Then, in a similar setup, we show that the UsedFor edges have a higher usefulness score. For the first study, we only focus on UsedFor edges. For each metric, we randomly sample 20 edges in each percentile of the metric and hand-annotate the usefulness of sampled edges in each percentile. Then, for each percentile-metric, we report the percentage of edges that were considered useful for our study. The results in Table 3.6, summarizes the usefulness score for three of the percentile buckets for three of the metrics. For the perplexity score we used the RoBERTa [85] language model on the lexicalized edges, for the Salient score we used DICE metrics [22], and for the weight score we use the weights from the 33 Metric [0,10](%) [50,60](%) [90,100](%) Perp. 75 95 90 Salient 80 100 95 Weight 95 90 90 Table 3.6: hand-annotated usefulness indication of the precondition statements for top/bottom/mid percentile buckets of the quantitative methods. The [A, B] label indicates edges with the metric score in the range of [A, B] percentile of the metric score. Metric Score(%) UsedFor 95 CapableOf 90 RelatedTo 40 Table 3.7: hand-annotated usefulness indication of the precondition statements three of the ConceptNet relations ConceptNet [139] itself. The usefulness scores suggest that a higher score may or may not result in more useful edges which makes using them for filtering edges tricky. This study is by no means conclusive due to both the small sample sizes and a small number of trials, however, it led us to choose the edges solely based on relation type and leave further filterings to future work. For the second study, Table 3.7, we group edges based on their relations only and compute the usefulness score for each relation. The results showed that UsedFor edges tend to generally be more useful for our annotation task. This couple with the statement that UsedFor edges could be annotated with both enabling and disabling preconditions led us to focus on them for this study. 3.11.2 Additional Results from P-NLI Table 3.8 presents some error cases that each model predicts on the test subset of P-NLI. As our version of NLI only consists of Entailment and Contradiction labels, we discuss the results using binary classification terminology. In addition, the detailed results of Table 3.3 dissected by the relation types are provided in Table 3.9. 34 Model Statement Context * TE You can typically use self adhesive label for labelling things The self adhesive label runs out of glue. FP Acoustic ceiling is typically used for dampening sound. in rooms with noise above a certain decibel. FP You can typically use self adhesive label for labelling things. Labeling things that are wet. FP Farm is typically used for raising crops. Enough rain should be available. FN roberta You can typically use pets to provide companionship the pet is dog. FN Acoustic ceiling is typically used for dampening sound The sound is too loud FP Table 3.8: Test results of SOTA systems on NLI task based on the PaCo. FP: False Positive, FN: False Negative Model Rel. 0-Shot Tuned RoBERTa-large-MNLI UsedFor 0.34 0.85 Causes 0.48 0.90 Desires 0.48 0.90 All 0.47 0.90 BART-large-MNLI UsedFor 0.51 0.91 Causes 0.41 0.82 Desires 0.46 0.89 All 0.48 0.89 DeBERTa-base-MNLI UsedFor 0.37 0.91 Causes 0.32 0.84 Desires 0.38 0.88 All 0.37 0.89 DeBERTa-large-MNLI UsedFor 0.38 0.94 Causes 0.31 0.88 Desires 0.36 0.90 All 0.36 0.92 DeBERTa-xlarge-MNLI UsedFor 0.37 0.94 Causes 0.31 0.88 Desires 0.37 0.89 All 0.37 0.91 Table 3.9: F1-Macro results of SOTA systems on P-NLI task based on PaCo dissected based on relation type 35 Model Rel. 0-Shot Tuned RoBERTa-base UsedFor 0.23 0.38 Causes 0.21 0.41 Desires 0.27 0.38 All 0.24 0.42 RoBERTa-large UsedFor 0.19 0.21 Causes 0.28 0.23 Desires 0.23 0.22 All 0.22 0.22 UnifiedQA-small UsedFor 0.37 0.55 Causes 0.35 0.53 Desires 0.31 0.45 All 0.32 0.50 UnifiedQA-base UsedFor 0.56 0.67 Causes 0.21 0.60 Desires 0.22 0.53 All 0.23 0.59 UnifiedQA-large UsedFor 0.31 0.76 Causes 0.26 0.68 Desires 0.26 0.61 All 0.28 0.68 Table 3.10: Accuracy results of SOTA systems on P-MCQA task based on PaCo 3.11.3 Details of Soft Preconditions on P-NLI In order to convert the ATOMIC2020 [59] to an NLI-style task, we method similar to P-NLI and focused on three relations HinderedBy, Causes, and xNeed. From these relations, HinderedBy is converted to Contradiction and the rest are converted to Entailment instances. For converting Rudinger et al. [125], we focused on SNLI subset of their data and used the concatenation of SNLI’s “Hypothesis” and “Premise” as hypothesis and their “Update” sentence as premise. Table 3.11, presents the detailed results of tuning RoBERTa-large-MNLI model on each of the NLI-style datasets, while being evaluated on P-NLI’s test subset. 3.12 PaCo Model Sizes and Run-times For table 3.3, Runtimes: TE=2hr,rbrta=2.5hr, dbrta-base=0.5hr, dbrta-large=2hr, dbrta-xlarge=3.5hr, BARTlarge=2hr and #params: TE=0.5M, rbta=356M, dbrta-base=141M, dbrta-large=401M, dbrta-xlarge=751M, BART-large=407M. For table 3.4, Runtimes:rbta-base=1hr, rbta-large=2hr, uqa-small=1hr, uqa-base=4hr, 36 Tune Dataset Relation F1-Macro PaCo UsedFor 0.85 Causes 0.90 Desires 0.90 All 0.90 Hwang et al. [59] UsedFor 0.50 Causes 0.50 Desires 0.45 All 0.48 Rudinger et al. [125] UsedFor 0.84 Causes 0.80 Desires 0.82 All 0.83 Table 3.11: Results of RoBERTa-large-MNLI model on test set of P-NLI after being tuned on different datasets, dissected based on relation type. uqa-large=20hr and #params: rbta-base=124M,rbta-large=355M, uqa-small=60M, uqa-base=222 M,uqalarge=737M. In table 3.1, Runtimes: uqa, gpt2=10min and #params: gpt2=1.5B. Finally in table 3.5, Runtimes:uqasmall=1hr, uqa-base=2hr, uqa-large=6hr, gpt2=1.5B, bart-base=139M, bart-large= and #params: uqa-small=60M,uqabase=222 M, uqa-large=737M, gpt2=1.5B, bart-base=139M, bart-large=406M. 3.13 Conclusions and Future Work We presented, PaCo, a dataset of 12.4K collected enabling and disabling preconditions of everyday commonsense statements from ConceptNet. We utilize this resource to create three tasks for evaluating the ability of systems to reason over circumstantial preconditions, namely: P-NLI, P-MCQA, and P-G. Our evaluation shows that SOTA reasoners largely fall behind human performance, indicating the need for further investigation to develop precondition-aware systems. Future work should cover the inclusion of preconditions in logical reasoning of the neuro-symbolic reasoners. It should also expand to multimodal setup or investigate using weak-supervision to gather preconditions. Alternatively, we can leverage the contributed resource to develop generative models for automated context-aware knowledge base construction [138]. 37 Chapter 4 PInKS: Preconditioned Commonsense Inference with Weak Supervision Reasoning with preconditions such as “glass can be used for drinking water unless the glass is shattered” remains an open problem for language models. The main challenge lies in the scarcity of preconditions data and model’s lack of support for such reasoning. We present PInKS, Preconditioned Commonsense Inference with WeaK Supervision, an improved model for reasoning with preconditions through minimum supervision. We show, both empirically and theoretically, that PInKS improves the results on benchmarks focused on reasoning with the preconditions of commonsense knowledge (up to 40% Macro-F1 scores). We further investigate PInKS through PAC-Bayesian informativeness analysis, precision measures, and ablation study.∗ 4.1 Introduction Inferring the effect of a situation or precondition on a subsequent action or state (illustrated in Fig. 4.1) is an open part of commonsense reasoning. It requires an agent to possess and understand different dimensions of commonsense knowledge [158], e.g. physical, causal, social, etc. This ability can improve many knowledge-driven tasks such as question answering [146, 142], machine reading comprehension [127], ∗Code and data on https://github.com/luka-group/PInKS 38 and narrative prediction [94]. It also seeks to benefit a wide range of real-world intelligent applications such as legal document processing [51], claim verification [100], and debate processing [155]. Figure 4.1: Examples on Preconditioned Inference and the NLI format they can be represented in. Multiple recent studies have made an effort on reasoning with preconditions of commonsense knowledge [125, 111, 95, 59]. These studies show that preconditioned reasoning represents an unresolved challenge to state-of-the-art (SOTA) language model (LM) based reasoners. Generally speaking, the problem of reasoning with preconditions has been formulated as variations of the natural language inference (NLI) task where, given a precondition/update, the model has to decide its effect on a common sense statement or chain of statements. For example, PaCo [111] approaches the task from the causal (hard reasoning) perspective in term of enabling and disabling preconditions of commonsense knowledge, and evaluate reasoners with crowdsourced commonsense statements about the two polarities of preconditions of statements in ConceptNet [139]. Similarly, δ−NLI [125] formulates the problem from soft assumptions’ perspective, i.e., weakeners and strengtheners, and justifies whether the update sentence weakens or strengthens the textual entailment in sentence pairs from sources such as SNLI [16]. Obviously, both tasks capture the 39 same phenomena of reasoning with preconditions and the slight difference in format does not hinder their usefulness [43]. As both works conclude, SOTA models generally fall short of tackling these tasks. We identify two reasons for such shortcomings of LMs on reasoning with preconditions: 1) relying on expensive direct supervision and 2) the need for improved LMs to reason with such knowledge. First, current resources for preconditions of common sense are manually annotated. Although this yields high-quality direct supervision, it is costly and not scalable. Second, off-the-shelf LMs are trained on free-text corpora with no direct guidance on specific tasks. Although such models can be further fine-tuned to achieve impressive performance on a wide range of tasks, they are far from perfect in reasoning on preconditions due to their complexity of need for deep commonsense understanding and lack of large-scale training data. In this work, we present PInKS (see Fig. 4.2), a minimally supervised approach for reasoning with the precondition of commonsense knowledge in LMs. The main contributions are 3 points. First, to enhance the training of the reasoning model (Section 4.3), we propose two strategies for retrieving a rich amount of cheap supervision signals (Fig. 4.1). In the first strategy (Section 4.3.1), we use common linguistic patterns (e.g. “[action] unless [precondition]”) to gather sentences describing preconditions and actions associated with them from massive free-text corpora (e.g. OMCS [52]). The second strategy (Section 4.3.2) then uses generative data augmentation methods on top of the extracted sentences to induce even more training instances. As the second contribution (Section 4.3.3), we improve LMs with more targeted preconditioned commonsense inference. We modify the masked language model (MLM) learning objective to biased masking, which puts more emphasis on preconditions, hence improving the LM’s capability to reason with preconditions. Finally, for third contribution, we go beyond the empirical analysis of PInKS and investigate the performance and robustness through theoretical guarantees of PAC-Bayesian analysis [53]. Through extensive evaluation on five representative datasets (ATOMIC2020 [59], WINOVENTI [37], ANION [63], PaCo [111] and DNLI [125]), we show that PInKS improves the performance of NLI models, up to 5% Macro-F1 without seeing any task-specific training data and up to 40% Macro-F1 after being 40 incorporated into them (Section 4.4.1). In addition to the empirical results, using theoretical guarantees of informativeness measure in PABI [53], we show that the minimally supervised data of PInKS is as informative as fully supervised datasets (Section 4.4.2). Finally, to investigate the robustness of PInKS and the effect of each component, we focus on the weak supervision part (Section 4.5). We perform an ablation study of PInKS w.r.t. the linguistic patterns themselves, the recall value associated with linguistic patterns, and finally contribution of each section to overall quality and the final performance. Figure 4.2: Overview of the three minimally supervised methods in PInKS. 4.2 Problem Definition Common sense statements describe well-known information about concepts, and, as such, they are acceptable by people without the need for debate [128, 34]. The preconditions of common sense knowledge are eventualities that affect the happening of a common sense statement [57]. These preconditions can either allow or prevent the common sense statement in different degrees [125, 111]. For example, Qasemi et al. [111] model the preconditions as enabling and disabling (hard preconditions), whereas Rudinger et al. [125] model them as strengthening and weakening(soft preconditions). Beyond the definition of preconditions, 41 the task of inference with preconditions is also defined differently in the literature. Some task definitions have strict constraints on the format of statement, e.g. two sentence format [125] or being human-related [128], whereas others do not [37, 111]. To unify the definitions in the available literature, we define the preconditioned inference task as below: Definition 1. Preconditioned Inference: given a common sense statement and an update sentence that serves as the precondition, is the statement still allowed or prevented? This definition is consistent with definitions in the literature . First, similar to the definition by Rudinger et al. [125], the update can have different levels of effect on the statement, from the causal connection (hard) to material implication (soft). Second, similar to the one Qasemi et al. [111], the statement can have any format. 4.3 Preconditioned Inference with Minimal Supervision In PInKS, to overcome the challenges associated with inference with preconditions, we propose two sources of weak supervision to enhance the training of a reasoner: linguistic patterns to gather rich (but allowably noisy) preconditions (Section 4.3.1), and generative augmentation of the preconditions data (Section 4.3.2). The main hypothesis in using weak-supervision methods is that pretraining models on a large amount of weak-supervised labeled data could improve the model’s performance on similar downstream tasks [120]. In weak supervision terminology for heuristics, the experts design a set of heuristic labeling functions (LFs) that serves as the generators of the noisy label [120]. These labeling functions can produce overlapping or conflicting labels for a single instance of data that will need to be resolved either with simple methods such as ensemble inference or more sophisticated probabilistic methods such as data programming [121], or generative [7]. Here, the expert still needs to design the heuristics to query the knowledge and convert the 42 results to appropriate labels for the task. In addition, we propose the modified language modeling objective that uses biased masking to improve the precondition-reasoning capabilities of LMs (Section 4.3.3). 4.3.1 Weak Supervision with Linguistic Patterns We curate a large-scale automatically labeled dataset for, both type of, preconditions of commonsense statements by defining a set of linguistic patterns and searching through raw corpora. Finally, we have a post-processing filtering step to ensure the quality of the extracted preconditions. Raw Text Corpora: In our experiments, we acquire weak supervision from two corpora: Open Mind Common Sense (OMCS) [135] and ASCENT [99]. OMCS is a large commonsense statement corpus that contains over 1M sentences from over 15,000 contributors. ASCENT has consolidated over 8.9M commonsense statements from the Web. First, we use sentence tokenization in NLTK [13] to separate individual sentences in the raw text. Each sentence is then considered as an individual statement to be fed into the labeling functions. We further filter out the data instances based on the conjunctions used in the common sense statements after processing the labeling functions (discussed in Post-Processing paragraph). Text Label Action Precondition A drum makes noise only if you beat it. Allow A drum makes noise you beat it. Your feet might come into contact with something if it is on the floor. Allow Your feet might come into contact with something it is on the floor. Pears will rot if not refrigerated Prevent Pears will rot refrigerated Swimming pools have cold water in the winter unless they are heated. Prevent Swimming pools have cold water in the winter they are heated. Table 4.1: Examples from the collected dataset through linguistic patterns in Section 4.3.1. Labeling Functions (LF): We design the LFs required for weak-supervision with a focus on the presence of a linguistic pattern in the sentences based on a conjunction (see Tab. 4.1 for examples). In this setup, each LF labels the training data as Allowing, Preventing or Abstaining (no label assigned) depending on the linguistic pattern it is based on. For example, as shown in Tab. 4.1 the presence of conjunctions only if 43 and if, with a specific pattern, suggests that the precondition Allows the action. Similarly, the presence of the conjunction unless indicates a Preventing precondition. We designed 20 such LFs based on individual conjunctions through manual inspection of the collected data in several iterations, for which details are described in section 4.9.1. Extracting Action-Precondition Pairs Once the sentence have an assigned label, we extract the actionprecondition pairs using the same linguistic patterns. This extraction can be achieved by leveraging the fact that a conjunction divides a sentence into action and precondition in the following pattern “precondition conjunction action”, as shown in Tab. 4.1. However, there could be sentences that contain multiple conjunctions. For instance, the sentence “Trees continue to grow for all their lives except in winter if they are not evergreen.” includes two conjunctions “except” and “if”. Such co-occurring conjunctions in a sentence leads to ambiguity in the extraction process. To overcome this challenge, we further make selection on the patterns by measuring their precisions† . To do so, we sample 20 random sentences from each conjunction (400 total) and label them manually on whether they are relevant to our task or not by two expert annotators. If a sentence is relevant to the task, it is labeled as 1; otherwise, 0. We then average the scores of two annotators for each pattern/conjunction to get its precision score. This precision score serves as an indicator of the quality of preconditions extracted by the pattern/conjunction in the context of our problem statement. Hence, priority is given to a conjunction with a higher precision in case of ambiguity. Further, we also set a minimum precision threshold (=0.7) to filter out the conjunctions having a low precision score (8 LFs), indicating low relevance to the task of reasoning with preconditions (see Section 4.9.1 for list of precision values). Post-Processing On manual inspection of sentences matched by the patterns, we observed a few instances from random samples that were not relevant to the context of commonsense reasoning tasks, for example: †The amounts of labeled instances (non-abstaining) for each labeling function are relevant 44 How do I know if he is sick? or, Pianos are large but entertaining. We accordingly filter out sentences that are likely to be irrelevant instances. Specifically, those include 1) questions which are identified based on presence of question mark and interrogative words (List of interrogative words in Section 4.9.4), or 2) do not have a verb in their precondition. Through this process we end up with a total of 113,395 labeled action-precondition pairs with 102,474 Allow and 10,921 Prevent assertions. 4.3.2 Generative Data Augmentation To further augment and diversify training data, we leverage another technique of retrieving weak supervision signals by probing LMs for generative data augmentation. To do so, we mask the nouns and adjectives (pivot-words) from the text and let the generative language model fill in the masks with appropriate alternatives. After masking the pivot-word and filling in the mask using the LM, we filter out the augmentations that change the POS tag of the pivot-word and then keep the top 3 predictions for each mask. In addition, to keep the diversity of the augmented data, we do not use more than 20 augmented sentences for each original statement (picked randomly). For example, in the statement “Dogs are pets unless they are wild”, the pivot-words are “dogs”, “pets” and “wild”. Upon masking “dogs”, using RoBERTa (large) language model, we get valid augmentations such as “Cats are pets unless they are wild”. Using this generative data augmentation, we end up with 7M labeled action-precondition pair with 11% prevent preconditions. 4.3.3 Precondition-Aware Biased Masking To increase the LM’s attention on preconditions, we used biased masking on conjunctions as the closest proxies to preconditions’ reasoning. Based on this observation, we devised a biased masked language modeling loss that solely focuses on masking conjunctions in the sentences instead of random tokens. Similar to Dai et al. [33], we mask the whole conjunction word in the sentence and ask the LM to fulfill the 45 mask. The goal here is to start from a pretrained language model and, through this additional fine-tuning step, improve its ability to reason with preconditions. To use such fine-tuned LM in a NLI module, we further fine-tune the “LM+classification head” on subset of MNLI [156] dataset. For full list of conjunctions and implementation details check Section 4.9.3. 4.4 Experiments This section first showcases improvements of PInKS on five representative tasks for preconditioned inference (Section 4.4.1). We then theoretically justify the improvements by measuring the informativeness of weak supervision by PInKS using PABI [53] score and then experiment on the effect of precision (discussed in section 4.3.1) on PInKS using PABI score (Section 4.4.2). Additional analysis on various training strategies of PInKS is also provided in Section 4.11. 4.4.1 Main Results Comparing the capability for models to reason with preconditions across different tasks requires canonicalizing the inputs and outputs in such tasks be in the same format. We used natural language inference (NLI) as such a canonical format. PaCo [111] and δ-NLI [125] are already formulated as NLI and others can be converted easily using the groundwork laid by Qasemi et al. [111]. In NLI, given a sentence pair with a hypothesis and a premise, one predicts whether the hypothesis is true (entailment), false (contradiction), or undetermined (neutral) given the premise [156]. Each task is preserved with equivalence before and after any format conversion at here, hence conversion does not seek to affect the task performance, inasmuch as it is discussed by Gardner et al. [43]. More details on this conversion process are in Section 4.10, and examples from the original target datasets are given in Tab. 4.8. 46 Setup To implement and execute labeling functions, and resolve labeling conflict, we use Snorkel [120], one of the SOTA frameworks for algorithmic labeling on raw data that provides ease-of-use APIs.‡ For more details on Snorkel and its setup details, please see Appendix 4.9.2. For each target task, we start from a pretrained NLI model (RoBERTa-Large-MNLI [85]), fine-tune it according to PInKS (as discussed in Section 4.3) and evaluate its performance on the test portion of the target dataset in two setups: zero-shot transfer learning without using the training data for the target task (labeled as PInKS column) and fine-tuned on the training portion of the target task (labeled as Orig.+PInKS). To facilitate comparison, we also provide the results for fully fine-tuning on the training portion of the target task and evaluating on its testing portion (labeled as Orig. column; PInKS is not used here). To create the test set, if the original data does not provide a split (e.g. ATOMIC and Winoventi), following Qasemi et al. [111], we use unified random sampling with the [0.45, 0.15, 0.40] ratio for train/dev/test. The experiments are conducted on a commodity workstation with an Intel Xeon Gold 5217 CPU and an NVIDIA RTX 8000 GPU. For all the tasks, we used the pretrained model from huggingface [157], and utilized PyTorch Lightning [39] library to manage the fine-tuning process. We evaluate each performance by aggregating the Macro-F1 score (implemented in Pedregosa et al. [105]) on the ground-truth labels and report the results on the unseen test split of the data. Target Task Orig. PInKS Orig+PInKS δ-NLI 83.4 60.3 84.1 PaCo 77.1 69.5 79.4 ANION 81.1 52.9 81.2 ATOMIC 43.2 48.0 88.6 Winoventi 51.1 52.4 51.3 Table 4.2: Macro-F1 (%) results of PInKS on the target datasets: no PInKS (Orig.), with PInKS in zero-shot transfer learning setup (PInKS) and PInKS in addition to original task’s data (Orig.+PInKS). Bold values are cases where PInKS is improving supervised results. ‡Other alternatives such as skweak [82] can also be used for this process. 47 Discussion Table 4.2 presents the evaluation results of this section. As illustrated, on ATOMIC [59] and Winoventi [37], PInKS exceeds the supervised results even without seeing any examples from the target data (zero-shot transfer learning setup). On δ-NLI [125], ANION [63] and ATOMIC [59], a combination of PInKS and train subset of target task (PInKS in low-resource setup) outperforms the target task results. This shows PInKS can also utilize additional data from target task to achieve better performance consistently across different aspects of preconditioned inference. 4.4.2 Informativeness Evaluation He et al. [53] proposed a unified PAC-Bayesian motivated informativeness measure, namely PABI, that correlates with the improvement provided by the incidental signals to indicate their effectiveness on a target task. The incidental signal can include an inductive signal, e.g. partial/noisy labeled data, or a transductive signal, e.g. cross-domain signal in transfer learning. In this experiment, we go beyond the empirical results and use the PABI measure to explain how improvements from PInKS are theoretically justified. Here, we use the PABI score for cross-domain signal assuming the weak supervised data portion of PInKS (Section 4.3.1 and Section 4.3.2) as an indirect signal for a given target task. We use PABI measurements from two perspectives. First, we examine how useful is the weak supervised data portion of PInKS for target tasks in comparison with fully-supervised data. And second, we examine how the precision of the linguistic patterns (discussed in Section 4.3.1) affects this usefulness. Setup We carry over the setup on models and tasks from Section 4.4.1. For details on the PABI itself and the measurement details associated with it, please see Section 4.13. For the aforementioned first perspective, we only consider PaCo and δ-NLI as target tasks, as they are the two main learning resources specifically focused on preconditioned inference (as defined in Section 2), which is not the case for others. We measure the PABI of the weak supervised data portion of PInKS on the two target tasks, and compare it with the 48 PABI of the fully-supervised data from Section 4.4.1. For the second perspective, we only focus on PInKS and consider PaCo as target task. We create different versions of the weak supervised data portion of PInKS with different levels of precision threshold (e.g. 0.0, 0.5) and compare their informativeness on PaCo. To limit the computation time, we only use 100K samples from the weak supervised data portion of PInKS in each threshold value, which is especially important in lower thresholds due to huge size of extracted patterns with low precision threshold. Informativeness in Comparison with Direct Supervision: Tab. 4.3 summarizes the PABI informativeness measure in comparison with other datasets with respect to PaCo [111] and δ-NLI [125]. To facilitate the comparison of PABI scores in Tab. 4.3, we have also reported the minimum achievable (“zero rate” classifier) and maximum achievable PABI scores. To clarify, to compute the maximum achievable PABI score, we consider the training subset of the target task as an indirect signal for the test subset. Here, we assume that the training subset is in practice the most informative indirect signal available for the test subset of any task. For the minimum achievable PABI score, we considered the error rate of the “zero rate” classifier (always classifies to the largest class) for computations of PABI. Our results show that although, PInKS is the top informative incidental signal in δ-NLI target task and second best in PaCo (less than 0.001 point of difference with the best signal). These PABI numbers are even more significant considering that PInKS is the only weak-supervision data that is automatically acquired, while others are acquired through sometimes multiple rounds of human annotations and verification. Effect of Precision on Informativeness: Fig. 4.3 presents the PABI informativeness estimation on weak supervision data under different threshold levels of precision values, and compare them with the “zero rate” classifier (always predicting majority class). As illustrated, the informativeness show a significant drop in lower precision showcasing the importance of using high precision templates in our weak-supervision task. For higher thresholds (0.95) the data will mostly consist of allow patterns, the model drops to near zero rate 49 PABI on Indir. Task PaCo δ-NLI Explanation PInKS 52.2 66.7 - Best on δ-NLI δ-NLI 52.3 85.5 - Max achievable on δ-NLI - Best on PaCo PaCo 52.3 31.3 - Max achievable on PaCo ANION 34.1 13.9 ATOMIC 20.9 17.4 Winoventi 36.4 53.4 Zero Rate 26.2 0.0 - Baseline Table 4.3: PABI informativeness measures (x100) of PInKS and other target tasks w.r.t PaCo and δ-NLI. Bold values represent the maximum achievable PABI Score by considering train subset as an indirect signal for test subset of respective data. The highest PABI score, excluding the max achievable, is indicated in italic . Figure 4.3: PABI informativeness measures of PInKS with different precision thresholds on PaCo. informativeness baseline again. This susceptibility on pattern precision can be mitigated with having more fine-grained patterns on larger corpora. We leave further analysis on precision of patterns to future work. 4.5 Analysis on Weak Supervision In this section, we shift focus from the external evaluation of PInKS on target tasks to analyzing distinct technical components of PInKS. Here, through an ablation study, we try to answer four main questions to get more insight into the weak supervision provided by those components. First (Q1), how each labeling 50 function (LF; Section 4.3.1) is contributing to the extracted preconditions? Second (Q2), what is the quality of the weak supervision data obtained from different ways of data acquisition? Third (Q3), how does generative data augmentation (Section 4.3.2) contribute to PInKS? And finally (Q4), how much does the precondition-aware masking (Section 4.3.3) affect the overall performance of PInKS? (Q1) LF Analysis: To address the first question, we use statistics of the 6 top performing LFs (see Section 4.14 for detailed results). These 6 top performing LFs generate more than 80% of data (Coverage) with the highest one generating 59% of data and the lowest one generating 1%. Our results show that in 0.14% of instances we have conflict among competing LFs with different labels and in 0.12%, we have overlap among LFs with similar labels, which showcases the level of independence each LF has on individual samples.§ (Q2) Quality Control: To assess the quality of collected data, we used an expert annotator. The expert annotator is given a subset of the collected preconditions (preconditions-statement-label triplet) and asked to assign a binary label based on whether each precondition is valid to its statement w.r.t the associated label. We then report the average quality score as a proxy for precision of data. We sampled 100 preconditionsstatement-label triplets from three checkpoints in the pipeline: 1) extracted through linguistic patterns discussed in Section 4.3.1, 2) outcome of the generative augmentations discussed in Section 4.3.2, and 3) final data used in Section 4.3.3. Tab. 4.4 contains the average precision of the collected data, which shows the data has acceptable quality with minor variance in quality for different weak supervised steps in PInKS. Checkpoint Name Precision. % Linguistic Patterns from Section 4.3.1 78 Generative Augmentation from Section 4.3.2 76 Final Data used in Section 4.3.3 76 Table 4.4: Precision of the sampled preconditions-statement-label triplets from three checkpoints in the pipeline. §Convectional inner-annotator agreement (IAA) methods hence are not applicable. 51 (Q3) Effectiveness of Generative Augmentation: The main effect of generative data augmentation (Section 4.3.2) is, among others, to acquire PInKS additional training samples labeled as prevent from pretrained LMs. When considering PaCo as the target task, the PInKS that does not use this technique (no-augment-PInKS) sees a 4.14% absolute drop in Macro-F1 score. Upon further analysis of the two configurations, we observed that the no-augment-PInKS leans more toward the zero rate classifier (only predicting allow as the majority class) in comparison to the PInKS. (Q4) Effectiveness of Biased Masking: We focus on PaCo as the target task and compare the results of PInKS with an alternative setup with no biased masking. In the alternative setup, we only use the weak-supervision data obtained through PInKS to fine-tune the model and compare the results. Our results show that the Macro-F1 score for zero-shot transfer learning setup has a 1.09% absolute drop in Macro-F1 score, without the biased masking process. 4.6 Related Works to PInKS Reasoning with Preconditions Collecting preconditions of common sense and reasoning with them has been studied in multiple works. Rudinger et al. [125] uses the notion of “defeasible inference” [108, 72] in term of how an update sentence weakens or strengthens a common sense hypothesis-premise pair. For example, given the premise “Two men and a dog are standing among rolling green hills.”, the update “The men are studying a tour map” weakens the hypothesis that “they are farmers”, whereas “The dog is a sheep dog” strengthens it. Similarly, PaCo [111] uses the notion of “causal complex” from Hobbs [57], and defines preconditions as eventualities that either allow or prevent (allow negation [41] of) a common sense statement to happen. For example, for the knowledge “the glass is shattered” prevents the statement “A glass is used for drinking water”, whereas "there is gravity" allows it. In PaCo, based on Shoham [133] and 52 Hobbs [57], authors distinguish between two type of preconditions, causal connections (hard), and material implication (tends to cause; soft). Our definition covers these definitions and is consistent with both. Hwang et al. [59], Sap et al. [128], Heindorf et al. [56], and Speer, Chin, and Havasi [139], provided representations for preconditions of statements in term of relation types, e.g. xNeed in ATOMIC2020 [59]. However, the focus in none of these works is on evaluating SOTA models on such data. The closest study of preconditions to our work are Rudinger et al. [125], Qasemi et al. [111], Do and Pavlick [37] and Jiang et al. [63]. In these works, direct human supervision (crowdsourcing) is used to gather preconditions of commonsense knowledge. They all show the shortcomings of SOTA models on dealing with such knowledge. Our work differs as we rely on combination of distant-supervision and targeted fine-tuning instead of direct supervision to achieve on-par performance. Similarly, Mostafazadeh et al. [95], and Kwon et al. [70] also study the problem of reasoning with preconditions. However they do not explore preventing preconditions. Weak Supervision In weak-supervision, the objective is similar to supervised learning. However instead of using human/expert resource to directly annotate unlabeled data, one can use the experts to design user-defined patterns to infer “noisy” or “imperfect” labels [123, 164, 35, 136], e.g. using heuristic rules. In addition, other methods such as re-purposing of external knowledge [4, 20, 92] or other types of domain knowledge [140] also lie in the same category. Weak supervision has been used extensively in NLU. For instance, Zhou et al. [167] utilize weak-supervision to extract temporal commonsense data from raw text, Brahman et al. [18] use it to generate reasoning rationale, Dehghani et al. [35] use it for improved neural ranking models, and Hedderich et al. [55] use it to improve translation in African languages. Similar to our work, ASER [165] and ASCENT [98] use weak supervision to extract relations from unstructured text. However, do not explore preconditions and cannot express preventing preconditions. As they do focus on reasoning evaluation, the extent in which their contextual edges express allowing preconditions is unclear. 53 Generative Data Augmentation Language models can be viewed as knowledge bases that implicitly store vast knowledge on the world. Hence querying them as a source of weak-supervision is a viable approach. Similar to our work, Wang et al. [147] use LM-based augmentation for saliency of data in tables, Meng et al. [91] use it as a source of weak-supervision in named entity recognition, and Dai, Song, and Wang [32] use masked LMs for weak supervision in entity typing. 4.7 Ethical Consideration We started from openly available data that is both crowdsource-contributed and neutralized, however they still may reflect human biases. For example in case of PaCo [111] they use ConceptNet as source of commonsense statements which multiple studies have shown its bias and ethical issues, e.g. [90]. During design of labeling functions we did not collect any sensitive information and the corpora we used were both publicly available, however they may contain various types of bias. The labeling functions in PInKS are only limited to English language patterns, which may inject additional cultural bias to the data. However, our expert annotators did not notice any offensive language in data or the extracted preconditions. Given the urgency of addressing climate change we have reported the detailed model sizes and runtime associated with all the experiments in Appendix 4.12. 4.8 Limitations The main limitation of this work is related to the choice of raw text corpora and the model for the main results. From the raw text corpora perspective, we relied on Open Mind Common Sense (OMCS) [135] and ASCENT [99] as two rich resources of commonsense knowledge. Future iterations of this work should include more fine-grained labeling functions to be applied to other large-scale corpora that result in a more diverse set of extracted preconditions. 54 The purpose of the experiments in this work is to show the effectiveness of PInKS in preconditioned inference without introducing any expensive (manually labeled) supervision. We chose RoBERTa-LargeMNLI [85] as a representative and strong model that has been widely applied to NLI tasks, including all those evaluated in this work. However, there are more models, e.g. unified-QA-11B for PaCo or DeBERTa for δ-NLI, that can be considered for each one of the target tasks. Of course, achieving the SOTA with these much larger models requires a lot of computational resources, which is beyond the scope and bandwidth of this study. But, given more resources, we would easily extend the analysis to other models as well. 4.9 Details on PInKS Method In this section, we discuss some of the extra details related to PInKS and its implementation. 4.9.1 Linguistic Patterns for PInKS We use a set of conjunctions to extract sentences that follow the action-precondition sentence structure. Initially, we started with two simple conjunctions-if and unless, for extracting assertions containing Allowing and Preventing preconditions, respectively. To further include similar sentences, we expanded our vocabulary by considering the synonyms of our initial conjunctions. Adding the synonyms of unless we got the following set of new conjunctions for Preventing preconditions-{but, except, except for, if not, lest, unless}, similarly we expanded the conjunctions for Enabling preconditions using the synonyms of if -{contingent upon, in case, in the case that, in the event, on condition, on the assumption, supposing}. Moreover, on manual inspection of the OMCS and ASCENT datasets, we found the following conjunctions that follow the Enabling precondition sentence pattern-{makes possible, the statement is true, to understand event}. Tab. 4.5, summarizes the final patterns used in PInKS, coupled with their precision value and their associated conjunction. 55 Conjunctions Precision Pattern but 0.17 {action} but {negative_precondition} contingent upon 0.6 {action} contingent upon {precondition} except 0.7 {action} except {precondition} except for 0.57 {action} except for {precondition} if 0.52 {action} if {precondition} if not 0.97 {action} if not {precondition} in case 0.75 {action} in case {precondition} in the case that 0.30 {action} in the case that {precondition} in the event 0.3 {action} in the event {precondition} lest 0.06 {action} lest {precondition} makes possible 0.81 {precondition} makes {action} possible. on condition 0.6 {action} on condition {precondition} on the assumption 0.44 {action} on the assumption {precondition} statement is true 1.0 The statement "{event}" is true because {precondition}. supposing 0.07 {action} supposing {precondition} to understand event 0.87 To understand the event "{event}", it is important to know that {precondition}. unless 1.0 {action} unless {precondition} with the proviso - {action} with the proviso {precondition} on these terms - {action} on these terms {precondition} only if - {action} only if {precondition} make possible - {precondition} makes {action} possible. without - {action} without {precondition} excepting that - {action} excepting that {precondition} Table 4.5: Linguistic patterns in PInKS and their recall value. Patterns with not enough matches in the corpora have empty recall values. 4.9.2 Details of Snorkel Setup Beyond a simple API to handle implementing patterns and applying them to the data, Snorkel’s main purpose is to model and integrate noisy signals contributed by the labeling functions modeled as noisy, independent voters, which commit mistakes uncorrelated with other LFs. To improve the predictive performance of the model, Snorkel additionally models statistical relationships between LFs. For instance, the model takes into account similar heuristics expressed by two LFs to avoid "double counting" of voters. Snorkel, further, models the generative learner as a factor graph. A labeling matrix Λ is constructed by applying the LFs to unlabeled data points. Here, Λi,j indicates the label assigned by the j th LF for the i th data point. Using this information, the generative model is fed signals via three factor types, representing the labeling propensity, accuracy, and pairwise correlations of LFs. ϕ Lab i,j (Λ) = 1{Λi,j ̸= ∅} 56 ϕ Acc i,j (Λ) = 1{Λi,j = yi} ϕ Corr i,j,k (Λ) = 1{Λi,j = Λi,k} The above three factors are concatenated along with the potential correlations existing between the LFs and are further fed to a generative model which minimizes the negative log marginal likelihood given the observed label matrix Λ. 4.9.3 Modified Masked Language Modeling Tab. 4.6 summarizes the list of Allowing and Preventing conjunctions which the modified language modeling loss function is acting upon. Type Conjunctions Allowing only if, subject to, in case, contingent upon, given, if, in the case that, in case, in the case that, in the event, on condition, on the assumption, only if, so, hence, consequently, on these terms, subject to, supposing, with the proviso, so, thus, accordingly, therefore, as a result, because of that, as a consequence, as a result Preventing but, except, except for, excepting that, if not, lest, saving, without, unless Table 4.6: List of conjunctions used in modified masked loss function in section 4.3.3 Conjunction Pattern to understand event To understand the event “{event}", it is important to know that {precondition}. in case {action} in case {precondition} statement is true The statement “{event}" is true because {precondition}. except {action} except {precondition} unless {action} unless {precondition} if not {action} if not {precondition} Table 4.7: Filtered Labeling Functions Patterns and their associated polarity. 4.9.4 Interrogative Words On manual inspection of the dataset, we observed some sentences that were not relevant to the common sense reasoning task. Many of such instances were interrogative statements. We filter out such cases based 57 on the presence of interrogative words in the beginning of a sentence. These interrogative words are listed below. Interrogative words: ["Who", "What", "When", "Where", "Why", "How", "Is", "Can", "Does", "Do"] 4.10 Details on Target Data Experiments in PInKS For converting Rudinger et al. [125], similar to Qasemi et al. [111], we concatenate the “Hypothesis” and “Premise” and consider then as NLI’s hypothesis. We then use the “Update” sentence as NLI’s premise. The labels are directly translated based on Update sentences’s label, weakener to prevent and the strengthener to allow. To convert the ATOMIC2020 [59], similar to Qasemi et al. [111], we focused on three relations HinderedBy, Causes, and xNeed. From these relations, edges with HinderedBy are converted as prevent and the rest are converted as allow. Winoventi [37], proposes Winograd-style ENTAILMENT schemas focusing on negation in common sense. To convert it to NLI style, we first separate the two sentences in the masked_prompt of each instance to form hypothesis and premise. We get two versions of premise by replacing the MASK token in premise with their target or incorrect tokens. For the labels the version with target token is considered as allow and the version with incorrect token as prevent. ANION [63], focuses on CONTRADICTION in general. We focus on their commonsense dCONTRADICTION subset as it is clean of lexical hints. Then we convert their crowdsourced original head or CONTRADICTION head as hypothesis, and the lexicalized predicate and tail as the premise (e.g. xIntent to PersonX intends to). Finally the label depends on head is allow for original head and prevent for CONTRADICTION head. We also replace “PersonX” and “PersonY” with random human names (e.g. “ALice”, “Bob”). 58 Finally, for the PaCo [111], we used their proposed P-NLI task as a NLI-style task derived from their preconditions dataset. We converted their Disabling and Enabling labels to prevent and allow respectively. Tab. 4.8 summarizes the conversion process through examples from the original data and the NLI task derived from each. Name Original Data Derived NLI Winoventi [37] masked_prompt: a Margaret smelled her bottle of maple syrup and it was sweet. The syrup is {MASK}. Hypothesis: a Margaret smelled her bottle of maple syrup and it was sweet. target: edible Premise: The syrup is edible/malodorous incorrect: malodorous Label: ENTAILMENT/CONTRADICTION ANION [63] Orig_Head: PersonX expresses PersonX’s delight. Hypothesis: Alice expresses Alice’s delight/anger. Relation: xEffect Premise: feel happy. Tail: Alice feel happy Label: ENTAILMENT/CONTRADICTION Neg_Head: PersonX expresses PersonX’s anger. ATOMIC2020 [59] Head: PersonX takes a long walk. Hypothesis: PersonX takes a long walk. Relation: HinderedBy Premise: It is 10 degrees outside.. Tail: It is 10 degrees outside. Label: CONTRADICTION δ-NLI [125] Hypothesis: PersonX takes a long walk. Hypothesis: PersonX takes a long walk. Premise: HinderedBy Premise: It is 10 degrees outside.. Update: It is 10 degrees outside. Label: CONTRADICTION Label: Weakener PaCo [111] Statement: A net is used for catching fish. Hypothesis: A net is used for catching fish. Precondition: You are in a desert. Premise: You are in a desert. Label: Disabling Label: CONTRADICTION Table 4.8: Examples from target tasks in NLI format To run all the experiments, we fine-tune the models on tuning data for a maximum of 5 epochs with the option for early stopping available upon 5 evaluation cycles with less than 1e − 3 change on validation data. For optimizer, we use AdamW [86] with a learning rate of 3e-6 and default hyperparameters for the rest. 4.11 Curriculum vs. Multitask Learning in PInKS For results of Section 4.4.1, we considered the target task and PInKS as separate datasets and fine-tuned the model sequentially on them (curriculum learning[107]). We chose curriculum learning setup due to its simplicity in implementation, ease of fine-tuning process monitoring, and hyperparameter setup. It would also allow us to monitor each task separately which increases the interpretability of results. However, in an alternative fine-tuning setup, one can merge the two datasets into one and fine-tune the model on the aggregate dataset (multi-task learning[21]). Here, we investigate such an alternative and its effect on the results of Section 4.4.1. 59 Setup We use the same setup as Section 4.4.1 for fine-tuning the model on Orig.+PInKS. Here instead of first creating PInKS and then fine-tuning it on the target task, we merge the weak-supervision data of PInKS with the training subset of the target task and then do fine-tuning on the aggregate dataset. To manage the length of this section, we only consider PaCo, δ-NLI, and Winoventi as the target dataset. Target Data Orig+PInKS (Multi-Task) Diff. δ-NLI 72.1 -11.00 PaCo 77.3 +6.8 Winoventi 51.7 +0.7 Table 4.9: Macro-F1 (x100) results of PInKS on the target datasets using multi-task fine-tuning strategy and its difference with curriculum strategy. Discussion Tab. 4.9 summarizes the results for multi-task learning setup and its difference w.r.t to the results of the curriculum learning setup in Tab. 4.2. Using multi-task learning does not show consistent results across tasks. We see significant performance loss on δ-NLI on one hand and major performance improvements on PaCo on the other. The Winoventi, however, appears to not change as much in the new setup. We leave further analysis of curriculum learning to future work. 4.12 PInKS Model Sizes and Run-times All the experiments are conducted on a commodity workstation with an Intel Xeon Gold 5217 CPU and an NVIDIA RTX 8000 GPU. For all the fine-tuning results in Tab. 4.2, Tab. 4.3 we used “RoBERTa-LargeMNLI” with 356M tuneable parameters. To fine-tune the model in each experiment, we use Ray [77] to handle hyperparameter tuning with 20 samples each. The hyperparameters that are being tuned fall into two main categories: 1) model hyperparameters such as “sequence length”, “batch size”, etc., and 2) data hyperparameters such as “precision threshold”, “data size”, etc. The mean run-time for each sample on target datasets is 1hr 55mins. For the augmentation in the PInKS dataset, we used the “BERT” language model 60 with 234M tuneable parameters. The mean run-time on the weak supervision data is 49hr which includes all three steps of data preprocessing, linguistic pattern matching, and generative data augmentation. 4.13 Details on PABI Measurement in PInKS PABI provides an Informativeness measure that quantifies the reduction in uncertainty provided by incidental supervision signals. We use the PABI measure to study the impact of transductive cross-domain signals obtained from our weak-supervision approach. Following [53], in order to calculate PABI Sˆ(π0, π˜0), we first find out η, the difference between a perfect system and a gold system in the target domain D that uses a label set L for a task, using Eq.4.1. η = Ex∼PD(x) 1(c(x) ̸= ˜c(x)) = (|L| − 1)(η1 − η2) 1 − |L|(1 − η1) (4.1) Here, PD(x) indicates the marginal distribution of x under D, c(x) refers to the gold system on gold signals, Indir. Task |L| η1 η ATMC 2 η PaCo 2 η δ−NLI 2 η ATMC η PaCo η δ−NLI PABIATMC PABIPaCo PABIδ−NLI PInKS 2 0.04 0.11 0.21 0.16 0.076 0.202 0.129 0.782 0.523 0.667 δ-NLI 2 0.13 0.22 0.28 0.16 0.122 0.203 0.046 0.683 0.522 0.855 PaCo 2 0.03 0.10 0.22 0.33 0.074 0.202 0.318 0.786 0.523 0.313 ATOMIC 2 0.01 0.57 0.62 0.60 0.608 0.622 0.602 0.184 0.209 0.174 ANION 2 0.16 0.57 0.36 0.44 0.571 0.302 0.418 0.122 0.341 0.139 Winoventi 2 0.19 0.10 0.37 0.31 0.139 0.289 0.196 0.647 0.364 0.534 Table 4.10: Details of PABI metric computations in Section 4.4.2 according to Equation (4.1) c˜(x) is a perfect system on incidental signals, η1 refers to the difference between the silver system and the perfect system in the source domain, η´1 indicates the difference between the silver system and the perfect system in the target domain, and η2 is the difference between the silver system and the gold system in the target domain. 61 Using Eq.4.1, the informative measure supplied by the transductive signals Sˆ(π0, π˜0) can be calculated as follows: s 1 − η ln(|L| − 1) − η ln η − (1 − η) ln(1 − η)) ln|L| Tab. 4.10 contains the details associated computation of PABI score as reported in Section 4.4.2. 4.14 Details on LFs in PInKS Tab. 4.11 shows Coverage (fraction of instances assigned the non-abstain label by the labeling function), Overlaps (fraction of instances with at least two non-abstain labels), and Conflicts (fraction of instances with conflicting and non-abstain labels) on top performing LFs in PInKS. LF name Cov. % Over. % Conf. % to understand 59.03 0.03 0.03 statement is 10.58 0.03 0.03 except 4.84 0.02 0.01 unless 4.79 0.04 0.04 in case 1.46 0.01 0.00 if not 1.00 0.01 0.01 Overall 81.69 0.14 0.12 Table 4.11: Coverage (fraction of raw corpus instances assigned the non-abstain label by the labeling function), Overlaps (fraction of raw corpus instances with at least two non-abstain labels), and Conflicts (fraction of the raw corpus instances with conflicting (non-abstain) labels) on top-performing LFs. Green and red color respectively represent LFs that assign allow and prevent labels. 4.15 Conclusion In this work we presented PInKS, as an improved method for preconditioned commonsense reasoning which involves two techniques of weak supervision. To maximize the effect of the weak supervision data, we modified the masked language modeling loss function using biased masking method to put more emphasis on conjunctions as closest proxy to preconditions. Through empirical and theoretical analysis of PInKS, we show it significantly improves the results across the benchmarks on reasoning with the preconditions of 62 commonsense knowledge. In addition, we show the results are robust in different precision values using the PABI informativeness measure and extensive ablation study. Future work can consider improving the robustness of preconditioned inference models using methods such as virtual adversarial training [93, 75]. With advent of visual-language models such as Li et al. [76], preconditioned inference should also expand beyond language and include different modalities (such as image or audio). To integrate in down-steam tasks, one direction is to include such models in aiding inference in the neuro-symbolic reasoners [78, 144]. 63 Chapter 5 Preconditioned Visual Language Inference with Weak Supervision Humans can infer the affordance of objects by extracting related contextual preconditions for each scenario. For example, upon seeing an image with a broken cup in it, we can infer that this precondition prevents the cup to be used for drinking. Reasoning with preconditions of commonsense is studied in NLP where the model explicitly gets the contextual precondition. However, it is unclear if SOTA visual language models (VLMs) can extract such preconditions and infer the affordance of objects with them. In this work, we introduce the task of preconditioned visual language inference and rationalization (PVLIR). We propose a learning resource based on three strategies to retrieve weak supervision signals for the task. Our results reveal the shortcomings of SOTA VLM models in the task and draw a road map to address the challenges ahead in improving them. 5.1 Introduction According to the Theory of Affordance [45, 24], understanding the preconditions in which an action or statement is possible or impossible is a key aspect of human intelligence. For example, a glass may be used for drinking water, under an implicit assumption that the water is at normal temperature, but may not be if the glass is shattered. From the cognitive perspective, understanding the affordance of objects, or simply preconditions of actions [110], is part of the commonsense knowledge that constitutes what 64 H: A glass is used for drinking water L : Contradict R: The glass is shattered H: A glass is used for drinking water L : Contradict H: A glass is used for drinking water P: The glass is shattered L : Contradict H: Cats are pets L : Entailment R: It has a collar H: Cats are pets L : Entailment H: Cats are pets P: It has a collar L : Entailment PNLI PVLI PVLR Allow Prevent Figure 5.1: Preconditioned Visual Language Inference (PVLI) and Preconditioned Visual Language Reasoning (PVLR) tasks. The “H” and “P” are the input hypothesis and premise. The outputs, label (letter “L”) and rationale (letter “R”), are highlighted . distinguishes humans from a machine to make inference [71]. From an applications perspective, it also has huge implications such as robotics [2], transportations [109, 131, 68], and general artificial intelligence [99]. Reasoning with preconditions of commonsense knowledge (i.e. preconditioned inference), is proposed as a benchmarking task for evaluation of the theory of affordance [110]. Multiple studies have formulated the preconditioned natural language inference (PNLI) as variations of the Natural Language Inference (NLI) [156, 16, 29] task and contributed learning resources that are gathered through crowdsouring [125, 110, 59, 37, 64] or weak supervision data [112]. In PNLI, the models rely on the contextual information (i.e. textual preconditions as premise) as input and have to decide whether the hypothesis is true (entailment), false (contradiction), or undetermined (neutral) given the premise (first row in Fig. 5.1). However, humans reason about affordance using information beyond text [8, 5] and extract the contextual meaning representations for cognitive tasks (such as PNLI) from the pool of available information in various modalities. For example, 65 upon getting the query “can this person run?” and seeing a picture of a person in a full leg cast, one can imply the contextual information from the image that “the person is injured and incapable of running” and use it to answer the query accordingly. Thus, a visual variation of the PNLI task is cognitively more realistic to benchmark artificial intelligence models. In this work, we propose PVLIR, to expand the preconditioned inference and reasoning to the visuallanguage realm by considering the interaction between linguistic and visual information in common sense. This work presents three contributions. First, we introduce the Preconditioned Visual Language Inference (PVLI) and Rationalization (PVLR) tasks (2nd and 3rd rows in Fig. 5.1), which evaluate the visual-language models’ (VLM) capabilities to reason with preconditions associated with commonsense knowledge. In PVLI, the precondition is represented as an image that further constrains the context in which the model has to decide the “prevent” or “allow” labels. In PVLR, the model has to provide the rationale for the choice between the labels as well. For example, say the model is given a commonsense statement such as “a glass is used for drinking water” and/or an image of “a person drinking water” as the hypothesis and an image of a “broken glass” as the premise. Then, in PVLI, the model has to decide whether there is a prevented or allowed relation between them, and in PVLR, it has to provide a rationale for its decision, such as the glass is broken. In addition, to foster further research, we created a crowd-verified evaluation dataset to benchmark future models. Second, we propose three strategies for retrieving a rich amount of cheap and allowably noisy supervision signals for inference and rationalization. Similar to Parcalabescu et al. [103], PVLIR’s three strategies rely on the available image captioning datasets (e.g. Changpinyo et al. [23], Sharma et al. [132], Gurari et al. [49], and Lin et al. [80]) that are readily available as a result of years of research in the field and maturity of resources. In the first strategy, Extraction from Captions, we utilize the PInKS [112] method to extract PNLI instances from image captions. PInKS uses a combination of linguistic patterns (e.g. “{action} unless {precondition}”) and generative augmentation to extract large quantities of instances from raw text. In the 66 second strategy, Caption Querying, we use the existing crowdsourced PNLI instances (e.g. Rudinger et al. [125], Qasemi et al. [110], Hwang et al. [59], Do and Pavlick [37], and Jiang et al. [64]) and find an image caption that is semantically identical to them. The third strategy, Image Querying, focuses solely on the PNLI instances and devises queries (such as “you are in a desert”) to search directly for corresponding images on the web using image search engines (e.g. Google Images). Our third contribution is an extensive benchmarking of VLMs based on PVLIR. We benchmark 4 SOTA VLMs, FLAVA [134], VisualBERT [76], ViLBERT [87] and ViLT [67] in inference (Section 5.4.1) and Ayyubi et al. [6] in rationalization (Section 5.4.4). We show how an effective rationalization will improve inference in the VLM models (Section 5.4.4). We further investigate the fine-tuning (learning) process of VLMs in the inference task (Section 5.4.2) and study their exploitation of the spurious correlation in our dataset (Section 5.4.3). 5.2 Construction of PVLIR This section gives an overview of PVLIR (summarized in Fig. 5.2), describing our strategies for obtaining the data, and quality control. Details associated with generating the human-verified test subset are moved to Section 5.9. The implementation details of each strategy is discussed in Section 5.8.1. Datasets: The construction of PVLIR uses existing text-only PNLI and image-captioning datasets as building blocks. For the text-only PNLI datasets, we require that they contain a precondition (e.g. premise, context), an action (e.g. hypothesis, question), and a binary label indicating whether the precondition allows or prevents the action. For the image captioning datasets, we simply require images (typically the URL) and their captions. Any datasets that meet these requirements can be used for the following steps. Preprocessing: The PNLI preconditions and actions (collectively referred to as statements) often use varying conventions for referring to people. We standardize them by replacing these identifiers with “the 67 P: No one puts their children in a boat R: The water is safer than the land L: Contradict P: A dog takes the plunge into water R: so that he can cool off L: Entailment Extraction from Captions PInKS <action> only if <precondition> LF <action> unless <precondition> LF Image Captions 1 P: The water came out of the showerhead in a square R: the showerhead is round L: Contradict P: Buying house are typically used for sense of worth R: Your mortgage is denied L: Contradict Caption Querying Image Captions 2 Preconditions Resources PaCo, -NLI, … Approximate Nearest Neighbour Text Encoder 1 Text Encoder N P: a person typically desires inoculated R: a person is affected by corona L: Entailment P: purse are typically used for holding stuff R: The purse is too small L: Contradict Caption Querying 2 Preconditions Resources PaCo, -NLI, … Query Generation Figure 5.2: Overview of weak supervision methods for constructing PVLIR. person”, “another person”, “a third person”, and so on. For example, the sentence “Alice helps Bob” would become “the person helps another person”. This ensures that the specific names or traditionally-associated genders are not mistaken as a focus of the statements. We also encourage further preprocessing of the data as seen fit, as not all sources contain clean text. Of the image captions, some may be very long and consist of multiple sentences. In these cases, we split the captions into individual examples, pairing each sentence with the same image. Using these preprocessed resources, we then obtain PVLIR instances with three different strategies: extraction from captions, caption querying, and image querying. Extraction from Captions (EC): Our first strategy focuses solely on the image captions, finding the few that contain preconditions and actions and extracting them. By nature, the resulting statements are already grounded in the images. We use the minimally-supervised approach described in PInKS [112], where linguistic patterns are used to extract preconditions and actions from raw corpora. This strategy constructs labeling functions (LF) based on common conjunctions such as “only if” and “unless”. For example, the sentence “Swimming pools have cold water in the winter unless they are heated” is matched by the pattern 68 “{action} unless {precondition}”, and therefore we can infer that “they are heated” is a precondition that prevents the action “Swimming pools have cold water in the winter”. Such labeling functions can be refined and added to as desired. In cases where the conjunction can be used in multiple senses, part-of-speech tagging can be utilized to filter out irrelevant senses. After applying the labeling functions to the image captions, we have a dataset consisting of preconditions and actions, where both are grounded in the associated images. To control for quality, we annotate a sample of matches from each labeling function (precision of each LF) to record whether the relation between precondition and action makes sense. Based on the results, we choose a precision threshold and only include labeling functions that meet this minimum. Caption Querying (CQ): Our second strategy bridges the PNLI statements and image captions by grounding preconditions and actions in images that have semantically similar captions. We begin by limiting the statements and captions to those whose length is within one standard deviation of the mean (rounded to the nearest integer) in order to remove outliers. We then encode the PNLI statements and image captions in high-dimensional vector embeddings using multiple models. Next, using a PNLI precondition as a query, we find the most similar captions through approximate nearest neighbors. This returns multiple rankings of the closest captions, one for each model’s encoding. We then aggregate the rankings and select the first-place caption. This strategy of including multiple models in the decision-making process helps make it more robust to model differences and to the approximate nature of the nearest neighbors. The number of models incorporated depends on balancing the desired robustness and time or computational constraints. Likewise, the number of similar captions returned for the rankings can vary but should be chosen such that the rankings typically contain some overlap. Note that the quality of data produced from this approach is dependent on the range of concepts covered by the image caption datasets, as it assumes that each PNLI precondition has a fairly similar caption. As such, it benefits from a very large corpus of captions (more discussion in Section 5.5). 69 To control for quality, when we select the best caption for a given query, we additionally record two values: perplexity and model agreement. The perplexity is the distance (cosine, dot, etc.) between the query and caption, averaged over the models. In the case when one of the models did not include the chosen caption in its ranking, the distance of the last caption is used for the average. By nature, the perplexity measures how good the models believe the match to be. In contrast, the model agreement is not specific to the chosen caption but instead measures how well-aligned the models’ rankings are. Using a ranking similarity metric, we compute the similarity between pairs of rankings and then average the scores for the model agreement. Since a high model agreement indicates that the models agree on which are the closest captions, but does not speak to the actual proximity of the match, it can be thought of as a measure of confidence. Image Querying (IQ): Our third strategy focuses solely on the PNLI statements and utilizes advances in image search engines to directly find the relevant images on the internet. Like the caption querying strategy, we limit the statements to those whose length is within one standard deviation of the mean. For this approach, we recommend excluding any PNLI datasets that deal with largely abstract concepts (e.g. the person is responsible, the person will be grateful), as searching directly for images is unlikely to yield good results. To ground the PNLI statements, we find the top images on the internet, using the statement with any commas removed as the search query. Since each of the top images becomes its own example, this strategy can quickly generate very large amounts of noisy training data. 5.3 Data Analysis In this section, we investigate different aspects of the weak supervision data and evaluate the quality of the generated resource. Implementation details and experimental setup details are moved to Section 5.8.1 (for data acquisition), and Section 5.9 (for human annotations) to conserve space. 70 if because without but so that in order to due to unless as if despite except even though in case as long as in the event lest on condition in the case that makes possible on the assumption contingent upon Labeling Function 10 4 10 3 10 2 10 1 Matches in Dataset (%) Labeling Function Distribution (Log-Scale) VizWiz MS COCO CC3M CC12M Figure 5.3: Distribution of instances extracted from captions (log-scale), for each source of the caption. We then use a random sample of 3% of the preprocessed preconditions and actions to perform Caption Querying (roughly 13k each) and Image Querying (roughly 80k each) to provide a representative sample of the full dataset. Extraction from Captions Results: After preprocessing, we have a resource of 17 million captions. From this, we utilize the Extraction from Captions method that results in 34K extracted instances. Fig. 5.3 illustrates the percentage of matches that come from each labeling function, separated by dataset. General statements such as “if” unsurprisingly make up a large percentage of the data, but interestingly, some datasets have very different distributions. Among the sources of captions, VizWiz has disproportionately high counts of “but” and “so that”, while MS COCO is high in “in order to” and “as if”. Fig. 5.4 shows the percent of the data and the percent of “allow” examples for varying precision thresholds. For the results in Section 5.4, we use the threshold of 0.6 to have a good balance between the quality and quantity of the final resource. 71 75 70 65 60 55 50 45 40 35 30 25 20 15 10 5 0 Precision (%) 0 10 20 30 40 50 60 70 80 90 100 Percent of Data (%) Cumulative Distribution by Precision Percent of Examples Percent Enabling Figure 5.4: Cumulative distribution of the data with respect to the precision of the generating labeling function. Caption Quering Results: With Fig. 5.5a and Fig. 5.5b, we look at where our matches in the CQ method come from. This tells us that the majority of our queries come from Anion while the majority of captions come from CC12M, which is unsurprising given they are our largest datasets. Fig. 5.5c and Fig. 5.5d take the ratio of the observed percentages to the percentages we would expect based purely on the sizes of the datasets. While we always use every query, the captions can be taken from any of the image datasets. This means that some may be over or underrepresented based on their fit for a given NLI dataset. For example, in Fig. 5.5d it appears that MS COCO captions are not good matches for PaCo actions. We also report more results on Image Querying in Section 5.8.2. 72 VizWiz MS COCO CC3M CC12M Total WinoVenti PaCo ATOMIC ANION Total 0.05% 0.11% 0.39% 1.43% 1.99% 0.02% 0.06% 0.65% 1.71% 2.44% 0.15% 0.39% 7.69% 27.43% 35.66% 0.19% 0.85% 9.98% 48.89% 59.91% 0.41% 1.42% 18.70% 79.47% 100.00% Observed Percentages (a) VizWiz MS COCO CC3M CC12M Total WinoVenti PaCo ATOMIC ANION Total 0.01% 0.02% 0.13% 0.65% 0.82% 0.01% 0.01% 0.25% 0.82% 1.09% 0.26% 0.88% 7.79% 31.33% 40.26% 0.43% 1.24% 11.22% 44.95% 57.83% 0.71% 2.15% 19.39% 77.75% 100.00% Observed Percentages (b) VizWiz MS COCO CC3M CC12M Total WinoVenti PaCo ATOMIC ANION Total 309.42% 161.03% 103.99% 93.82% 100.00% 72.11% 70.05% 141.35% 91.59% 100.00% 49.31% 30.54% 114.76% 100.19% 100.00% 36.69% 39.91% 88.60% 106.30% 100.00% 47.48% 39.72% 99.52% 103.51% 100.00% Ratio of Observed to Expected Percentages (c) VizWiz MS COCO CC3M CC12M Total WinoVenti PaCo ATOMIC ANION Total 206.02% 75.05% 85.53% 103.51% 100.00% 76.91% 18.68% 124.17% 98.13% 100.00% 75.13% 61.33% 102.91% 101.37% 100.00% 85.73% 59.99% 103.21% 101.23% 100.00% 82.35% 60.20% 103.18% 101.27% 100.00% Ratio of Observed to Expected Percentages (d) Figure 5.5: a) Observed distribution of matches for preconditions. b) Observed distribution of matches for actions. c) Deviation from the expected distribution of matches for preconditions. d) Deviation from the expected distribution of matches for actions. 5.4 Evaluation and Discussion In this section, we focus on the PVLIR tasks. We first benchmark state-of-the-art visual language models on the inference (Section 5.4.1) and reasoning tasks (Section 5.4.4). Then, we focus on sources of bias in the data through counterfactual analysis in the inference task (Section 5.4.3). 5.4.1 Inference Benchmarking Results Here, as the main results, we benchmark the SoTA VLM models in the PVLI task. Experimental Setup: We used 4 SOTA vision-language models: ViLBERT [87], ViLT [67], FLAVA [134], and CLIP [115]. 73 For the ViLBERT [87] model, we used the pre-trained model provided by the authors∗ . For the rest of the models, we use the pre-trained weights from the Hugging Face library [157]. ViLBERT has separate encoders for processing images and text, hence we feed the text and image separately. The model then predicts one of the 2 outcomes in the PVLI, that is whether the precondition (image) allows or prevents the commonsense statement. The ViLBERT model, provided by the authors, is originally fine-tuned on the Visual Natural Language Inference (VSNLI) task [145]; hence it is familiar with the structure of the PVLI task and we can use it for a zero-shot evaluation as well as the fine-tined evaluation. We fine-tune ViLBERT on the PVLI training set with a batch size of 32 for 5 epochs, with the Adam Optimiser to optimize the cross entropy loss between the actual and the predicted labels. For all other hyperparameters, we used the default values by authors. Finally, we report the accuracy of the ViLBERT model on the PVLI test set in both zero-shot and fine-tuned setups. The Hugging Face library contains the ViLT [67] pre-trained on the Visual Question-Answering task, in which the model has to find an answer from a predefined set of tokens including yes and no. So for zero-shot and fine-tuned results, we format the PVLI dataset into a question-answering format with binary yes/no answers. The statement is converted into a question format by appending the phrase “Is this possible?” to the statement. This question is then fed into the model along with the associated image, which acts as the premise. The model then outputs one of the 2 labels - yes or no, which we use to compute its accuracy on the task. FLAVA [134] and CLIP [115] are multi-modal vision and language models that can be used for tasks such as image-text similarity or zero-shot image classification. Similar to ViLT, the hugging face library does not provide CLIP and FLAVA models, pre-trained on binary or multi-label classification tasks. For the fine-tuned results of FLAVA model, we extract the multi-modal embeddings it generates and feed them to a classification head. This classification head is fine-tuned on the VSNLI before using in our experiments. For ∗ https://github.com/facebookresearch/vilbert-multi-task 74 0-shot Finetuned Model Noisy Test Clean Test Noisy Test Clean Test ViLBERT 52.02 48.48 78.75 55.68 ViLT 50.88 45.83 77.92 55.68 CLIP 30.15 42.80 73.13 56.82 FLAVA 47.38 53.78 80.43 59.47 Random 63.47 56.08 Table 5.1: Results of SoTA Visual Language Models on the PVLI task. the CLIP model, we utilize the similarity scores between the visual and the textual features. Similarly, we feed the features through a classification head to output the label which indicates whether the precondition “allows” or “prevents” the common sense statement. We then report the accuracy of the resulting models on the PVLI tasks in zero-shot and fine-tuned, w.r.t. PVLI, setups. From the weak supervision data, we randomly sample 16K for tuning and 6K as noisy test set. For the clean test set we used the 261 human-verified samples obtained through AMT experiments in Section 5.9. The experiments are conducted on a commodity workstation with an Intel Xeon Gold 5217 CPU and an NVIDIA RTX 8000 GPU. Discussion Tab. 5.1 summarizes the results of SoTA VLMs on the PVLI task. In the zero-shot setup, all the models perform below the random baseline, showing the difficulty and novelty of the task for the models. After fine-tuning, the models’ performance improves above the random guess, where the FLAVA’s [134] performance elevates by 33.05 points of accuracy to 80.43% on the noisy-test. However, it still is not mastering the task. Overall, this shows that SOTA methods generally fall behind the human-level performance, therefore indicating the need for further research in order to improve the comprehension of preconditions by commonsense visual reasoners. 75 Figure 5.6: Accuracy difference from the peak value of fine-tuning FLAVA (lower is better) with increasing amounts of tuning data from PVLI. The batch size is 64. 5.4.2 Anaylsis with Fine-tuning In the above evaluation on PVLI, we observe that all models get higher scores after a full fine-tuning process. Here, we dissect the fine-tuning process to find at what point the model understands the task’s requirements. Experimental Setup Here we focus on FLAVA [134] as one of the top-performing models in PVLI. We carry the setup from Section 5.4.1 and evaluate FLAVA on the noisy test set in fine-grained intervals during fine-tuning. Discussion Fig. 5.6 illustrates the progression of the FLAVA model toward its peak accuracy performance. As illustrated, the model’s performance saturates after 600 iterations of fine-tuning (or 38K instances). The slow saturation of the accuracy score here suggests that the instances in PVLI are not trivial for the model 76 and it has to see a substantial number of instances to be able to perform the task. Considering that the FLAVA has been pre-trained on a vast corpus, our result shows the novelty and uniqueness of the PVLI task. This result is consistent with the similar analysis in Qasemi et al. [110], for comparing MNLI task with PNLI (text-only). 5.4.3 Identifying Sources of Bias Large LMs (and by extension VLMs) tend to learn to solve the dataset rather than the task [19], by overfitting spurious correlations in the data [160]. To quantify and eliminate such biases in the data/model, recent studies conduct counterfactual inference to debias textual resources used in text classification Qian et al. [114] and information extraction tasks [151, 148]. Internally, debiasing through counterfactuals works on the model trained or fine-tuned on the biased classification data. During inference, this technique creates counterfactuals where parts or all of the input are wiped out to observe what the model would give by seeing only the biasing factors. In this way, bias terms can be distilled from the model, which can be further deducted from the original prediction for debiasing. Specifically, Qian et al. [114] design two types of counterfactual variations of the input to produce two counterfactual output distributions that model label bias and keyword bias in the model. Experimental Setup Since our data contain both images and text, we modified the counterfactuals in Qian et al. [114] to fit the task. We create three counterfactual variants of the inputs to consider, visualtoken bias, textual-token bias, image bias, and text bias. In the visual-token bias and textual-token bias, we partially mask the input image (50%) and text (67% as in [113]) respectively with no change to the other modality of input. In the image bias and text bias we blind the model in one modality by fully masking their respective modalities. Here, we focus on the FLAVA [134] model and carry over the setup from Section 5.4.1 on the noisy test set. 77 Discussion Our results show that the visually blind FLAVA [134] model is performing on par with the original model (79.88 accuracy on noisy test). This shows that the model may overly rely on the text modality as a shortcut in most of the instances rather than utilizing both image and text. This result further motivates the need for further research in multi-modal de-biasing techniques for both data and models. 5.4.4 Utilizing Rationale for Inference Task Here, we try to answer the question “How the rationales can contribute to the inference task?”. In other words, we show how the generated rationales can become a piece of useful evidence for inference. As discussed in Section 5.7 (under “Free-Text Rationale Generation”), even though there exists a rich body of literature on the free-text rationale generation models in the text-only tasks, there are limited publicly available models for the visual language tasks. We implement the architecture proposed in Ayyubi et al. [6] for visually-guided rationale generation† . The architecture feeds the visual embeddings from a VLM to the decoder of a LM and jointly trains both in an end-to-end fashion. Experimental Setup We do an experiment similar to Section 5.4.1, except that the VLM model is trained with both the textual hypothesis and rationale plus the visual premise as input. To contain the length of this experiment we only focus on the FLAVA [134] VLM, and evaluate the performance on the noisy test set in a fully fine-tuned setup. We separately experiment with two types of rationales as input: the FLAVA-rationale-gen gets the generated rationale, and the FLAVA-rationale-gold gets the ground-truth rationale from PVLIR . For our implementation of Ayyubi et al. [6] to generate the rationale, we use a separate FLAVA [134] as the VLM to embed the multi-modal input and use GPT-2 [116] as a decoder-only LM to generated the rationale. We initialize both models, from pre-trained weights on Hugging Face [157] library and fine-tune them on PVLIR data for the rationale generation task given the input (text and image). †At the time of this writing, the original implementation of Ayyubi et al. [6] was not public 78 Discussion The inference accuracy of the FLAVA-rationale-gold and FLAVA-rationale-gen is 94.2 and 80.56 respectively. First, the significant jump in the performance of FLAVA-rationale-gold (from the base of 80.43 in Tab. 5.1) shows that in the presence of a competent rationalization model, the generated rationales can significantly contribute to the inference task. Second, we observe that a rationale model as simple as FLAVA-rationale-gen, can also contribute to the performance (although slightly) of the visual preconditioned inference task. This further motivates the need for further research in multi-modal rationalization models. 5.5 Limitations Image captioning datasets are limited both in breadth and depth. We have not investigated the use of automatically generated captions, e.g. Wang et al. [150], in our weak-supervised pipeline, but it is a viable path for future extensions of this work. Alternatively, automatic text-to-image generation techniques, e.g. stable diffusion [124] or Dall-E [119], are gaining a lot of attention and are promising but require a lot of prompt engineering that is challenging on a large scale. In addition, the lack of access to a large number of free-text rationale generation models (through libraries such as Huggingface [157]) limited the evaluation of our PVLR tasks. We hope the availability of resources, such as ours, elicits more research effort in the field. 5.6 Ethical Concerns We started from publicly available data that is both crowd-verified and neutralized, however, multiple studies have shown the existence of bias and ethical issues in such resources, e.g. Mehrabi et al. [90]. Since our work is based on weak supervision, we have no additional filter on the acquired instances, hence our resource exacerbates the bias in models by reinforcing it with biased evidence, e.g. results from the query 79 “fat person” will only return images of obese white males. In addition, there is a combination of well-studied biases in the large models trained on raw text, e.g. Bender et al. [10]. Finally, in this work, we have only relied on English resources. In addition, we have only used Englishspeaking annotators. Hence the judgments and design decisions are heavily skewed culturally which will aggravate the bias issues of our work. 5.7 Related Works Preconditions of Commonsense Knowledge reasoning with preconditions of common sense has been studied in the context of affordance in different fields from cognitive sciences [42] to robotics [2] but was recently brought up in natural language understanding. In NLP, the focus has been mainly on proposing human-verified learning resources [110, 125, 59, 128, 56, 37, 63]. Among them, Qasemi et al. [110] and Rudinger et al. [125] propose variations of the canonical NLI task for preconditioned inference in common sense. Qasemi et al. [112] propose a combination of weak-supervision strategy and biased masking to improve LMs’ performance in the task. Visual Language Inference With the advent of visual language models (VLMs; Li et al. [74], Liu et al. [83], Li et al. [76], Cho et al. [27], and Huang et al. [58]) that can simultaneously process visual and linguistic information, there is growing attention to enrich text-only tasks with visual context [103, 159, 145]. Vu et al. [145] propose a visually-grounded version of the textual entailment task, supported by the cognitive science view of enriching meaning representations with multiple modalities. According to how Visual Language Inference (VLI; Xie et al. [159] and Vu et al. [145]) is defined, the task is regarded as a visual extension of the NLI task. In VLI, the premise is substituted with an image with visual context instead of the text in NLI [159]. Instead of relying on crowdsourcing, both works augment the Stanford NLI (SNLI) dataset [17]. Since the textual premises of SNLI are extracted from image captions on Flickr, each premise 80 can be easily replaced with its respective image. Our proposed PVLI task is a variation of the proposed VLI that focuses on the preconditions (affordance) of tasks/objects (similar relation exists between general NLI, e.g. MNLI versus the PNLI task). Weak Supervision Instead of using direct supervision from annotated data, weak supervision in NLP tasks typically use linguistic patterns to infer large-scale “noisy” or “imperfect” labels on unlabelled corpora [123, 164, 35, 136], e.g. using heuristic rules. Models fine-tuned on weak supervision data have shown considerable improvements across NLU tasks lacking direct supervision, including temporal commonsense reasoning [167], rationale generation [18], document ranking [35], and preconditioned inference [112]. Dai, Song, and Wang [32] and Choi et al. [28] use weak supervision from linguistically mined or LM-generated noisy data to enhance ultra-fine entity typing. Choi et al. [28] also uses human annotators to create a small ground-truth test set for testing. Free-Text Rationale Generation There is a large body of research on free-text rationale generation toward faithful and explainable NLP. Work like this typically fine-tunes a single LM to generate the task output and rationale [96, 89, 162], or uses a separate LM to generate the rationale that another LM uses to generate the output [149, 154, 69, 118]. In the visual-language realm, free-text rationale generation is limited, where based on our observation it can be due to the lack of large-scale learning resources. Dua, Kancheti, and Balasubramanian [38] and Ayyubi et al. [6] repurpose the VCR [163] data and propose VL models to generate free-text rationale (instead of picking one as is in the VCR) for it. Other works, e.g. Su et al. [141] and Li et al. [73], use visual inputs for text generation, but they are not focused on the rationale generation. 81 5.8 Weak Supervision Methods in PVLI 5.8.1 Implementation Details and Experimental Setup This section discusses the experimental setup and implementation details for the results in Section 5.3. Precondition Resources: For our P-NLI datasets, we pull from Anion [64], Atomic [129], PaCo [110], δ-NLI [125] and WinoVenti [37]. For the image caption datasets, we use CC12M [23], CC3M [132], MS COCO [81], and VizWiz [48]. Preprocessing Setup: Since Anion and Atomic use fixed identifiers (Alice/Bob, PersonX/PersonY), we rely on regex rules to replace them with “the person” and “another person”. WinoVenti uses random first names, and so we utilize Flair’s ner-english-fast model [3] to identify and replace the spans that are identified as people with greater than 90% confidence. PaCo does not have any such identifiers to replace. For the image caption data, we break the captions into multiple lines using Natural Language Toolkit’s sentence tokenizer [14] in combination with splitting on newline characters. We also notice that some contained “<PERSON>” tags and use regex to replace them. As the last step, we leverage regex to fix whitespace issues and replace “the person’s” with “their” to increase fluidity. We also found that datasets were easier to clean after lowercasing, particularly as some contained inconsistent capitalization. Extraction from Captions Setup: We modify some of the original labeling functions from PInKS [112] and add eight new ones after inspecting our caption corpus. In PInKS, the authors also calculate precision values for each of their labeling functions by sampling 20 examples from each function. The samples are then marked as relevant (score of 1) or irrelevant (score of 0) to the task by two human annotators. The average score of each labeling function provides an estimate of the quality that each labeling function returns and is used for tie-breaking matches or filtering out low-quality functions. We follow their lead and 82 Label Conjunction Precision Regex Pattern enables so that 0.689 {P} so that {A} in order to 0.650 {P} in order to {A} because 0.625 {A} because (?!of\b){P} due to 0.550 {A} due to {P} in case 0.475 {A} in case (?!of\b){P} as if 0.400 {A} as if {P} as long as 0.375 {A} as long as {P} if 0.150 {A}(?
Abstract (if available)
Abstract
Humans can seamlessly reason with circumstantial preconditions of commonsense knowledge. We understand that “a glass is used for drinking water”, unless “the glass is broken” or “the water is toxic”. Despite state-of-the-art (SOTA) models’ impressive performance in inferring commonsense knowledge, it is unclear whether they understand the circumstantial preconditions. In this dissertation, I initially propose a novel challenge of reasoning with preconditions attributed to commonsense knowledge, design three tasks based on the challenge in text-only setup, and show there is a significant gap between SOTA language models’ performance and humans’ on our tasks. I then use generative AI and weak supervision in a combination of targeted fine-tuning strategies to improve the language model’s performance on the preconditioned inference task. Finally, I go beyond the text-only setup and investigate the problem of preconditioned inference in a multi-modal setup when the model is challenged to infer the preconditions from an image.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Externalized reasoning in language models for scalable and trustworthy AI
PDF
Bridging the visual reasoning gaps in multi-modal models
PDF
Scalable exact inference in probabilistic graphical models on multi-core platforms
PDF
Event-centric reasoning with neuro-symbolic networks and knowledge incorporation
PDF
Robust and generalizable knowledge acquisition from text
PDF
Statistical approaches for inferring category knowledge from social annotation
PDF
Aggregating symbols for language models
PDF
Robust causal inference with machine learning on observational data
PDF
Parametric and semi-parametric methods for knowledge acquisition from text
PDF
An FPGA-friendly, mixed-computation inference accelerator for deep neural networks
PDF
Grounding language in images and videos
PDF
Physics-based data-driven inference
PDF
Responsible artificial intelligence for a complex world
PDF
Statistical inference for dynamical, interacting multi-object systems with emphasis on human small group interactions
PDF
Towards generalized event understanding in text via generative models
PDF
Expanding the performance-compute frontier for retrieval-augmented language models
PDF
A green learning approach to deepfake detection and camouflage and splicing object localization
PDF
Towards learning generalization
PDF
Exploiting web tables and knowledge graphs for creating semantic descriptions of data sources
PDF
Reward shaping and social learning in self- organizing systems through multi-agent reinforcement learning
Asset Metadata
Creator
Qasemi, Ehsan
(author)
Core Title
Multi-modal preconditioned inference of commonsense knowledge
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2023-12
Publication Date
11/07/2023
Defense Date
05/01/2023
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
multi-modal common sense,natural language processing,preconditioned inference
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Chen, Muhao (
committee chair
), Garcia, Luis (
committee member
), Morstatter, Fred (
committee member
), Nakano, Aiichiro (
committee member
), O'Leary, Daniel Edmund (
committee member
)
Creator Email
qasemi.ehs@gmail.com,qasemi@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113763620
Unique identifier
UC113763620
Identifier
etd-QasemiEhsa-12453.pdf (filename)
Legacy Identifier
etd-QasemiEhsa-12453
Document Type
Dissertation
Format
theses (aat)
Rights
Qasemi, Ehsan
Internet Media Type
application/pdf
Type
texts
Source
20231107-usctheses-batch-1105
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
multi-modal common sense
natural language processing
preconditioned inference